Using kdirstat for clusters

1. Summary

kdirstat (also k4dirstat) is an exceptional tool for visualizing the layout and and distribution of files, even on large filesystems (FSs). While it is best known as part of the Linux KDE Desktop and therefor used for relatively small filesystems (up to 100s of GB), it can be used productively for subdirs on larger systems to detect and, using the ability to call other scripts and apps, to allow naive users to self-police and clean up their own directories.

2. Intro & Background

There are a number of such filesystem visualizers available for Linux. Two others are described in the Appendix at the end of the doc but I’ll be describing the use of kdirstat and how it can be modified to be more useful in a cluster context.

Note about KDE on CentOS

Many clusters use RHEL or CentOS as the base distribution. These distros have the advantage of stability and support from commercial software packages, but their repositories for user utilities, especially the KDE Desktop is fairly awful. That said, with persistence it is possible to find and install the applications and libs required for the utility described in this doc. The CentOS-base repo has a number of KDE packages, but not all the KDE utils.

2.1. kdirstat

kdirstat (and its follow-on, k4dirstat is about half as fast as the system utility du or the visualizer duc (see below), but the interface is both much more intuitive and importantly, allows manipulation of the filesystem via add-on scripts.

kdistat-example

The design of the kdirstat interface is such that it allows the arbitrary modification of the filesystem by launching scripts written in any language.

Especially when those scripts are integrated into the scheduling system of a cluster, this allows users to fork large cleanup jobs into the job queue and be notified by email when their jobs are done.

3. General Usage of kdirstat

To use kdirstat, you’ll need to be running an X11 client (and the best way to do this for a number of reasons is via x2go). Configure it for HPC following this stanza and start a terminal from x2go to launch kdirstat as below:

On the HPC system, kdirstat is launched by:

module load kdirstat; kdirstat &

which should pop up a window that requires that you specify a dir to start with. By default, it’s the dir from which you started it.

kdirstat: choose initial view .

After you click [OK], the indexing will start - a small yellow PacMan will wander over the top bar while the dir is recursively read. Depending on the size of the dir and number of files (~1m for 500K files in 12GB on a heavily used NFS-mounted FS), the same for ~300K files in 3.3GB on a heavily used BeeGFS distributed FS.

kdirstat: scanning .

Once the indexing has finished, a treemap is created that shows the relative sizes of all dirs and files from the starting point.

kdirstat: initial treemap

When you mouse-select a dir or file in either the listing above or treemap below, the analogous object in the other is highlighted in red. Hence, it’s easy to bounce back and forth to find big files, number of files, or old files, via re-sorting by clicking the appropriate header.

4. File and Dir Operations

These options can be used by both root and normal users. At the bottom is a command: More Info on these HPC-local commands, which starts firefox, directed to the document you’re currently reading.

If you mouse-select a file or dir and then right-click, you can open a panel that lists a number of Cleanup options.

kdirstat: cleanup options

The upper ones are the default built-ins, but the configuration file allows a number of user-written options and supplies some useful ways of composing them, ranging from simple commands to complex scripts. Here are some that we’ve added:

4.1. File Operations

These right-click actions only work with file selections.

kdirstat: file options

4.1.1. What is this file?

This command merely invokes the file command on the selected file, invoking a kdialog pop-up returning one of a number of identifiable types (see man file).

4.1.2. Verify Owner by LDAP

Extracts the owner (UCINETID) of a file or dir and queries our local LDAP server to see if it is a valid UCINETID. If it is, it returns Dept affiliation & title as well as full name. Useful to see if directories have been abandoned.

4.1.3. gzip file with pigz (2 cpus)

This immediately compresses the selected file with pigz -p2 so as not to overwhelm the machine on which it’s run. The job is run in the foreground, so multiple such jobs can’t be started at the same time. The display will refresh when it’s done.

4.1.4. View Tarchive Contents

When a tarchive is selected, this starts the KDE utility ark which allows the user to view and extract files from various archive formats. It does not have any error checking so any file can be used as a target which will lead to a confused ark if it’s not a valid archive.

4.2. Directory Operations

These right-click actions only work with Directory selections

kdirstat: file options

4.2.1. Tarchive Dir Now

Invokes an immediate tarchiving & compression of the selected dir.

4.2.2. Tarchive Dir (qsub)

This action sets up a qsub script to be submitted to the Grid Engine scheduler that tarchives the selected dir and then uses pigz -p8 to compress it. Since it happens in the background, it immediately returns control to kdirstat, allowing you fire off multiple such tarchiving actions. Once the job is completed, the user is notified by email of what dir was compressed and the amount of compression.

4.2.3. In Situ Compression (qsub)

Similar to the option above, but it compresses files in-place, instead of combining files into a tarchive. This is useful in light cleaning a dir by compressing files that can be compressed, and deleting 0-length files (and simultaneously logging them to a date-stamped list.). The owner gets emailed when the job completes with the dir processed and then amount of compression.

5. Required files & Setup

Although the configuration and script files are fairly trivial to parse and write, it may save you time to copy them rather than create them anew.

The kdirstat configuration file for these operations is available here. It needs to be copied to $HOME/.kde/share/config/kdirstatrc for all users who want to use it.

5.1. For HPC users

For UC Irvine HPC users logged into the HPC login hpc-login-1-2 (hpc-login-1-3 isn’t configured for it), via x2go or running with an X11 client.

mkdir -p $HOME/.kde/share/config/ # if it doesn't exist
# if you have already used kdirstat, make a backup
mv $HOME/.kde/share/config/kdirstatrc $HOME/.kde/share/config/kdirstatrc.bak
# copy the new one into place
cp /data/hpc/share/kdirstatrc $HOME/.kde/share/config/kdirstatrc
# set it up and go
module load kdirstat; kdirstat &

5.2. For users at other sites

The script files (some written for the SGE scheduler) required for the above options, are available here (right-click, select option to save)

kds-qsub-tarchive.sh. for submitting tarchive jobs to the SGE Q. This script has a CONFIGS section at the top that includes all the things you should have to change.
kds-pigzem.sh. for submitting in situ compression jobs to the SGE Q. This script has a CONFIGS section at the top that includes all the things you should have to change.
kds-askLDAP.sh. for submitting identification requests to your local LDAP server. This uses ldapsearch to make the request; the script has to be modified to use your own LDAP server.
scut and stats which are perl support scripts. scut is described in its own scut document. stats has its own help (stats -h) output which is sufficient.

All these scripts need to be placed on a generally available FS where they can be referenced by all users. In addition, the PATH to this dir has to be in the users' PATH variable or referenced explicitly.

To do this for yourself if you are NOT on HPC:

mkdir -p $HOME/bin
for II in askLDAP  pigzem  qsub-tarchive; do
  wget http://moo.nac.uci.edu/~hjm/kdirstat/kds-${$II}.sh
done
chown u+x $HOME/bin/kds*.sh

# if you already have started kdirstat, you will already have a 'kdirstatrc'
# file, parts of which are overwritten at each execution. Make sure you save
# a copy to preserve any customizations that you've made.
cd $HOME/.kde/share/config/
mv $HOME/.kde/share/config/kdirstatrc $HOME/.kde/share/config/kdirstatrc.bak
wget http://moo.nac.uci.edu/~hjm/kdirstat/kdirstatrc

Note that all the scripts have to be edited to set local modifications; there is a stanza at the top for setting up various site-specific variables, similar to this:

################################################################
# only things that should be changed (by the admin)
CPUs=8                                   # how many cores should pigz use
KDSDIR="/fast-scratch/${USER}/kds-data"  # make a home for all these scripts
STAFFQUEUE='staff'                       # root users can run in 'staff' Q
USERQUEUE='pub64'                        # normal uses can run in 'pub64 Q
NOTIFY='hmangala'                        # admin user to notify about job info
STAFF="root hmangala jfarran garru "     # staff accounts
#################################################################

Most of these are explained by the comments. If they are not, let me know and I’ll expand on them.

6. Appendix

6.1. baobab

baobab is one from the GNOME desktop, but it is both very slow (2.5m to profile the example dir) and while you can view the output as both a treemap or a ring chart, you cannot do anything with that info. No point in continuing to describe it.

baobab-example

6.2. duc

duc indexer is very fast (as fast as du - 10s for the example dir tree), but is written in parts so that you have to first launch an index of the FS you wish to profile and then use either a GUI to view that or you can use it to write a PNG of the dir tree. (It does have a cgi interface which may be useful, but I did not try it). Because of that 2-part approach, it is effectively no faster than kdirstat. It also does not allow any manipulation of the dir tree and while the visualization is fast, it also somewhat confusing and especially does not lend itself to visualizing lots of files.

duc-example