= Using kdirstat for clusters by Harry Mangalam v1.2 - Dec 11th, 2015 :icons: //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // this file is converted to the HTML via the command: // fileroot="/home/hjm/nacs/kdirstat-pix/kdirstat-for-clusters"; // asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; // scp ${fileroot}.html ${fileroot}.txt moo:~/public_html/kdirstat; // refresh all files: // scp /home/hjm/nacs/kdirstat-pix/* moo:~hjm/public_html/kdirstat // don't forget that the HTML equiv of '~' = '%7e' // asciidoc cheatsheet: http://powerman.name/doc/asciidoc // asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html == Summary http://kdirstat.sourceforge.net/[kdirstat] (also https://bitbucket.org/jeromerobert/k4dirstat/wiki/Home[k4dirstat]) is an exceptional tool for visualizing the layout and and distribution of files, even on large filesystems (FSs). While it is best known as part of the Linux https://www.kde.org/workspaces/plasmadesktop/[KDE Desktop] and therefor used for relatively small filesystems (up to 100s of GB), it can be used productively for subdirs on larger systems to detect and, using the ability to call other scripts and apps, to allow naive users to self-police and clean up their own directories. == Intro & Background There are a number of such filesystem visualizers available for Linux. Two others are described in the Appendix at the end of the doc but I'll be describing the use of 'kdirstat' and how it can be modified to be more useful in a cluster context. .Note about KDE on CentOS *************************************************** Many clusters use RHEL or CentOS as the base distribution. These distros have the advantage of stability and support from commercial software packages, but their repositories for user utilities, especially the https://www.kde.org/workspaces/plasmadesktop/[KDE Desktop] is fairly awful. That said, with persistence it is possible to find and install the applications and libs required for the utility described in this doc. The 'CentOS-base' repo has a number of KDE packages, but not all the KDE utils. *************************************************** === kdirstat http://kdirstat.sourceforge.net/[kdirstat] (and its follow-on, https://bitbucket.org/jeromerobert/k4dirstat/wiki/Home[k4dirstat] is about half as fast as the system utility 'du' or the visualizer 'duc' (link:#duc[see below]), but the interface is both much more intuitive and importantly, allows manipulation of the filesystem via add-on scripts. image:kdirstat-elread.png[kdistat-example] The design of the 'kdirstat' interface is such that it allows the arbitrary modification of the filesystem by launching scripts written in any language. Especially when those scripts are integrated into the scheduling system of a cluster, this allows users to fork large cleanup jobs into the job queue and be notified by email when their jobs are done. == General Usage of kdirstat To use 'kdirstat', you'll need to be running an X11 client (and the best way to do this for a number of reasons is via http://wiki.x2go.org/doku.php/download:start[x2go]). Configure it for HPC following http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_12.html#_x2go[this stanza] and start a terminal from x2go to launch 'kdirstat' as below: On the HPC system, 'kdirstat' is launched by: --------------------------------------------------------------- module load kdirstat; kdirstat & --------------------------------------------------------------- which should pop up a window that requires that you specify a dir to start with. By default, it's the dir from which you started it. image:kds-choose-initial-dir.png[kdirstat: choose initial view]. After you click [OK], the indexing will start - a small yellow 'PacMan' will wander over the top bar while the dir is recursively read. Depending on the size of the dir and number of files (\~1m for 500K files in 12GB on a heavily used NFS-mounted FS), the same for ~300K files in 3.3GB on a heavily used BeeGFS distributed FS. image:kds-initial-scan.png[kdirstat: scanning]. Once the indexing has finished, a treemap is created that shows the relative sizes of all dirs and files from the starting point. image:kds-initialview.png[kdirstat: initial treemap] When you mouse-select a dir or file in either the listing above or treemap below, the analogous object in the other is highlighted in red. Hence, it's easy to bounce back and forth to find big files, number of files, or old files, via re-sorting by clicking the appropriate header. == File and Dir Operations These options can be used by both root and normal users. At the bottom is a command: *More Info on these HPC-local commands*, which starts 'firefox', directed to the document you're currently reading. If you mouse-select a file or dir and then right-click, you can open a panel that lists a number of 'Cleanup' options. image:kds-options.png[kdirstat: cleanup options] The upper ones are the default 'built-ins', but the configuration file allows a number of user-written options and supplies some useful ways of composing them, ranging from simple commands to complex scripts. Here are some that we've added: === File Operations These right-click actions only work with file selections. image:kds-file-options.png[kdirstat: file options] ==== What is this file? This command merely invokes the 'file' command on the selected file, invoking a kdialog pop-up returning one of a number of identifiable types (see 'man file'). ==== Verify Owner by LDAP Extracts the owner (UCINETID) of a file or dir and queries our local LDAP server to see if it is a valid UCINETID. If it is, it returns Dept affiliation & title as well as full name. Useful to see if directories have been abandoned. ==== gzip file with pigz (2 cpus) This immediately compresses the selected file with 'pigz -p2' so as not to overwhelm the machine on which it's run. The job is run in the foreground, so multiple such jobs can't be started at the same time. The display will refresh when it's done. ==== View Tarchive Contents When a tarchive is selected, this starts the KDE utility https://utils.kde.org/projects/ark/[ark] which allows the user to view and extract files from various archive formats. It does not have any error checking so any file can be used as a target which will lead to a confused 'ark' if it's not a valid archive. === Directory Operations These right-click actions only work with Directory selections image:kds-dir-options.png[kdirstat: file options] ==== Tarchive Dir Now Invokes an immediate tarchiving & compression of the selected dir. ==== Tarchive Dir (qsub) This action sets up a qsub script to be submitted to the Grid Engine scheduler that tarchives the selected dir and then uses 'pigz -p8' to compress it. Since it happens in the background, it immediately returns control to 'kdirstat', allowing you fire off multiple such tarchiving actions. Once the job is completed, the user is notified by email of what dir was compressed and the amount of compression. ==== In Situ Compression (qsub) Similar to the option above, but it compresses files in-place, instead of combining files into a tarchive. This is useful in 'light cleaning' a dir by compressing files that can be compressed, and deleting 0-length files (and simultaneously logging them to a date-stamped list.). The owner gets emailed when the job completes with the dir processed and then amount of compression. == Required files & Setup Although the configuration and script files are fairly trivial to parse and write, it may save you time to copy them rather than create them anew. The *kdirstat configuration file* for these operations is http://moo.nac.uci.edu/~hjm/kdirstat/kdirstatrc[available here]. It needs to be copied to '$HOME/.kde/share/config/kdirstatrc' for all users who want to use it. // scp /data/hpc/share/kdirstatrc hjm@moo.nac.uci.edu:~/public_html/kdirstat === For HPC users For UC Irvine HPC users logged into the HPC login *hpc-login-1-2* (hpc-login-1-3 isn't configured for it), via x2go or running with an X11 client. ------------------------------------------------------------------------ mkdir -p $HOME/.kde/share/config/ # if it doesn't exist # if you have already used kdirstat, make a backup mv $HOME/.kde/share/config/kdirstatrc $HOME/.kde/share/config/kdirstatrc.bak # copy the new one into place cp /data/hpc/share/kdirstatrc $HOME/.kde/share/config/kdirstatrc # set it up and go module load kdirstat; kdirstat & ------------------------------------------------------------------------ === For users at other sites The *script files* (some written for the *SGE scheduler*) required for the above options, are available here (right-click, select option to save) - http://moo.nac.uci.edu/~hjm/kdirstat/kds-qsub-tarchive.sh[kds-qsub-tarchive.sh]. for submitting tarchive jobs to the SGE Q. This script has a *CONFIGS* section at the top that includes all the things you should have to change. - http://moo.nac.uci.edu/~hjm/kdirstat/kds-pigzem.sh[kds-pigzem.sh]. for submitting 'in situ' compression jobs to the SGE Q. This script has a *CONFIGS* section at the top that includes all the things you should have to change. - http://moo.nac.uci.edu/~hjm/kdirstat/kds-askLDAP.sh[kds-askLDAP.sh]. for submitting identification requests to your local 'LDAP' server. This uses http://goo.gl/PQqYo1[ldapsearch] to make the request; the script has to be modified to use your own LDAP server. - http://moo.nac.uci.edu/~hjm/scut[scut] and http://moo.nac.uci.edu/~hjm/stats[stats] which are perl support scripts. 'scut' is described in its own http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut document]. 'stats' has its own help ('stats -h') output which is sufficient. All these scripts need to be placed on a generally available FS where they can be referenced by all users. In addition, the PATH to this dir has to be in the users' PATH variable or referenced explicitly. // scp /data/hpc/bin/kds* hjm@moo.nac.uci.edu:~/public_html/kdirstat To do this for yourself if you are NOT on HPC: ------------------------------------------------------------------------ mkdir -p $HOME/bin for II in askLDAP pigzem qsub-tarchive; do wget http://moo.nac.uci.edu/~hjm/kdirstat/kds-${$II}.sh done chown u+x $HOME/bin/kds*.sh # if you already have started kdirstat, you will already have a 'kdirstatrc' # file, parts of which are overwritten at each execution. Make sure you save # a copy to preserve any customizations that you've made. cd $HOME/.kde/share/config/ mv $HOME/.kde/share/config/kdirstatrc $HOME/.kde/share/config/kdirstatrc.bak wget http://moo.nac.uci.edu/~hjm/kdirstat/kdirstatrc ------------------------------------------------------------------------ Note that all the scripts have to be edited to set local modifications; there is a stanza at the top for setting up various site-specific variables, similar to this: ---------------------------------------------------------------------- ################################################################ # only things that should be changed (by the admin) CPUs=8 # how many cores should pigz use KDSDIR="/fast-scratch/${USER}/kds-data" # make a home for all these scripts STAFFQUEUE='staff' # root users can run in 'staff' Q USERQUEUE='pub64' # normal uses can run in 'pub64 Q NOTIFY='hmangala' # admin user to notify about job info STAFF="root hmangala jfarran garru " # staff accounts ################################################################# ---------------------------------------------------------------------- Most of these are explained by the comments. If they are not, let me know and I'll expand on them. == Appendix === baobab https://wiki.gnome.org/Apps/Baobab[baobab] is one from the GNOME desktop, but it is both very slow (2.5m to profile the example dir) and while you can view the output as both a treemap or a ring chart, you cannot do anything with that info. No point in continuing to describe it. image:baobab-elread.png[baobab-example] [[duc]] === duc https://github.com/zevv/duc[duc] indexer is very fast (as fast as 'du' - 10s for the example dir tree), but is written in parts so that you have to first launch an index of the FS you wish to profile and then use either a GUI to view that or you can use it to write a PNG of the dir tree. (It does have a cgi interface which may be useful, but I did not try it). Because of that 2-part approach, it is effectively no faster than kdirstat. It also does not allow any manipulation of the dir tree and while the visualization is fast, it also somewhat confusing and especially does not lend itself to visualizing lots of files. image:duc-elread.png[duc-example]