Harry Mangalam <harry.mangalam@uci.edu> v1.0 Nov 12, 2017

Data munging or Data Wrangling is the process of cleansing, transforming, and/or extracting data from crude form to one that can be ingested by your analytical tools to yield progress to your PhD. This will most likely occupy most of your computational time during your degree.

1. Summary

Data prep/cleansing is one of the biggest impediments to getting things done. When I was transitioning (from biology to computers), I’d get literal headaches from staring at ragged columns, trying to tell if the data was in the right shape, trying to detect outliers, trying to merge data, etc.

These tools (all available on HPC) helped me (bc I wrote some of them specifically to save my eyes):

  • scut: a better cut, cut on any/multiple regexes, deletes, re-arranges cols scut -h for help.

  • stats: emits descriptive stats on numbers supplied on STDIN, also can stream- transform input numbers for a set of popular transforms. stats -h for help

  • tools to align columns, in a spreadsheet like output

    • cols: previews data in aligned columns, which are numbered from any start #. cols -h for help

    • column: similar, but less useful for parameterizing; doesn’t align all columns. man column for help

    • columns: similar to column - what happens when GNU gets hold of a tool. columns --help for short help, man columns for the full monty.

  • feedgnuplot: takes STDIN & plots it with gnuplot; removes much of the overhead of learning the very powerful,but confusing gnuplot. Will plot to a term if no X11 graphics are availale, using the --terminal "dumb 80,40". feedgnuplot -h to dump help; feedgnuplot -h |& less for scrollable help.

Send me links to your favorite data munging tools and I’ll include them as well.

You may also be interested in Manipulating Data on Linux, which I’m updating to reflect more data munging content.

2. Examples:

See the use of all of them in one line:

cd /usr/bin; ls -l | scut -f="1 4 6" | stats --xf="ln" --stdout --gfmt | cols --ch=5 | less

#    (insert a 'less # ' at each '|' above to see what each stage does; ie:

cd /usr/bin; ls -l | less # scut -f="1 4 6" | stats --xf="ln" --stdout --gfmt | cols --ch=5 | less
#                  ********
cd /usr/bin; ls -l | scut -f="1 4 6" | less # stats --xf="ln" --stdout --gfmt | cols --ch=5 | less
#                                    ********  etc

If you have a graphics/X11 screen (such as x2go)

cd /usr/bin; ls -l [efg]* | scut -f="1 4 6" | \
stats --xf="ln" --stdout --gfmt | feedgnuplot

If you don’t:

cd /usr/bin; ls -l [efg]* | scut -f="1 4 6" | \
stats --xf="ln" --stdout --gfmt | feedgnuplot --terminal 'dumb 80,40' --exit

2.1. stats

The stats that this tool produces:

$ cd /usr/bin; ls -l | scut -f="1 4 6" | stats
Sum       479240044
Number    9091
Mean      52715.8776812232
Median    13
Mode      1
NModes    3065
Min       1
Max       53409192
Range     53409191
Variance  634674127880.993
Std_Dev   796664.375933173
SEM       8355.44468788764
95% Conf  36339.2060929634 to 69092.549269483
          (for a normal distribution - see skew)
Skew      47.1264013134154
          (skew is 0 for a symmetric dist)
Std_Skew  1834.40185517179
Kurtosis  2747.58818352313
          (K=3 for a normal dist)

For many uses, the gfmt format is better

#                                              ******
$ cd /usr/bin; ls -l | scut -f="1 4 6" | stats --gfmt
Sum       4.7924e+08
Number    9091
Mean      52715.9
Median    13
Mode      1
NModes    3065
Min       1
Max       5.34092e+07
Range     5.34092e+07
Variance  6.34674e+11
Std_Dev   796664
SEM       8355.44
95% Conf  36339.2 to 69092.5
          (for a normal distribution - see skew)
Skew      47.1264
          (skew = 0 for a symmetric dist)
Std_Skew  1834.4
Kurtosis  2747.59
          (K=3 for a normal dist)