= Some tools to munge data Harry Mangalam v1.0 Nov 12, 2017 :icons: // fileroot="/home/hjm/nacs/data-munging-tools"; asciidoc -a icons -a toc2 -a toclevels=3 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt moo:~/public_html 'Data munging' or https://en.wikipedia.org/wiki/Data_wrangling[Data Wrangling] is the process of cleansing, transforming, and/or extracting data from crude form to one that can be ingested by your analytical tools to yield progress to your PhD. This will most likely occupy most of your computational time during your degree. == Summary Data prep/cleansing is one of the biggest impediments to getting things done. When I was transitioning (from biology to computers), I'd get literal headaches from staring at ragged columns, trying to tell if the data was in the right shape, trying to detect outliers, trying to merge data, etc. These tools (all available on HPC) helped me (bc I wrote some of them specifically to save my eyes): - *scut*: a better 'cut', cut on any/multiple regexes, deletes, re-arranges cols 'scut -h' for help. - *stats*: emits descriptive stats on numbers supplied on STDIN, also can stream- transform input numbers for a set of popular transforms. 'stats -h' for help - tools to align columns, in a spreadsheet like output * *cols*: previews data in aligned columns, which are numbered from any start #. 'cols -h' for help * *column*: similar, but less useful for parameterizing; doesn't align all columns. 'man column' for help * *columns*: similar to column - what happens when GNU gets hold of a tool. 'columns --help' for short help, 'man columns' for the full monty. - *feedgnuplot*: takes STDIN & plots it with gnuplot; removes much of the overhead of learning the very powerful,but confusing gnuplot. Will plot to a term if no X11 graphics are availale, using the '--terminal "dumb 80,40"'. 'feedgnuplot -h' to dump help; 'feedgnuplot -h |& less' for scrollable help. Send me links to your favorite data munging tools and I'll include them as well. You may also be interested in http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html[Manipulating Data on Linux], which I'm updating to reflect more data munging content. == Examples: See the use of all of them in one line: --------------------------------------------------------------------------- cd /usr/bin; ls -l | scut -f="1 4 6" | stats --xf="ln" --stdout --gfmt | cols --ch=5 | less # (insert a 'less # ' at each '|' above to see what each stage does; ie: cd /usr/bin; ls -l | less # scut -f="1 4 6" | stats --xf="ln" --stdout --gfmt | cols --ch=5 | less # ******** cd /usr/bin; ls -l | scut -f="1 4 6" | less # stats --xf="ln" --stdout --gfmt | cols --ch=5 | less # ******** etc --------------------------------------------------------------------------- If you have a graphics/X11 screen (such as http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_12.html#_x2go[x2go]) --------------------------------------------------------------------------- cd /usr/bin; ls -l [efg]* | scut -f="1 4 6" | \ stats --xf="ln" --stdout --gfmt | feedgnuplot --------------------------------------------------------------------------- If you don't: --------------------------------------------------------------------------- cd /usr/bin; ls -l [efg]* | scut -f="1 4 6" | \ stats --xf="ln" --stdout --gfmt | feedgnuplot --terminal 'dumb 80,40' --exit --------------------------------------------------------------------------- === stats The stats that this tool produces: --------------------------------------------------------------------------- $ cd /usr/bin; ls -l | scut -f="1 4 6" | stats Sum 479240044 Number 9091 Mean 52715.8776812232 Median 13 Mode 1 NModes 3065 Min 1 Max 53409192 Range 53409191 Variance 634674127880.993 Std_Dev 796664.375933173 SEM 8355.44468788764 95% Conf 36339.2060929634 to 69092.549269483 (for a normal distribution - see skew) Skew 47.1264013134154 (skew is 0 for a symmetric dist) Std_Skew 1834.40185517179 Kurtosis 2747.58818352313 (K=3 for a normal dist) --------------------------------------------------------------------------- For many uses, the 'gfmt' format is better --------------------------------------------------------------------------- # ****** $ cd /usr/bin; ls -l | scut -f="1 4 6" | stats --gfmt Sum 4.7924e+08 Number 9091 Mean 52715.9 Median 13 Mode 1 NModes 3065 Min 1 Max 5.34092e+07 Range 5.34092e+07 Variance 6.34674e+11 Std_Dev 796664 SEM 8355.44 95% Conf 36339.2 to 69092.5 (for a normal distribution - see skew) Skew 47.1264 (skew = 0 for a symmetric dist) Std_Skew 1834.4 Kurtosis 2747.59 (K=3 for a normal dist) ---------------------------------------------------------------------------