BigData and its Analysis on Linux
=================================

by Harry Mangalam <harry.mangalam@uci.edu>

v0.1 - Nov 13, 2014

:icons:


// fileroot="/home/hjm/nacs/bigdata/BigData_Class"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt  moo:~/public_html/biolinux;  ssh -t moo 'scp ~/public_html/biolinux/BigData_Class.[ht]* hmangala@hpc.oit.uci.edu:/data/hpc/www/biolinux/'

== For Linux users, but novices at cluster data analysis and BigData.

If you don't know your way around a Linux bash shell nor how a compute cluster works,
please review the previous 
http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_12.html[Introduction to Linux 
essay/tutorial here]. It reviews Linux, the bash shell, common commands & utilities, as well
as the SGE scheduler.

== Introduction
For those of you who are starting your graduate career and have spent most of 
your life with with MacOS or Windows, the Linux commandline can be mystifying. 
The previous lecture and tutorial was meant to familiarize you with Linux on HPC, with 
only a very fast introduction to handling and analyzing large amounts of data.

You are the among the first few waves of the 'Data Generation' - your lives and 
research guided by, and awash with, ever-increasing rings of data. 

If you're reading this, your research will 
almost certainly depend on large amounts of data, data that 
does not map well to MS Word, Excel, or other such pointy-clicky desktop apps.  
The de facto standard for large-scale data processing, especially in the 
research field, is the bash prompt, backed by a Linux compute cluster.  It's not pretty, but it
works exceedingly well.  In the end, I think you'll find that it is worth the 
transition cost to use a cluster where most compute nodes each have 64 CPU cores, 
1/2 TB of RAM and access to 100s of TB of disk space on a very fast parallel filesystem.

This is not a HOWTO on Analysis of BigData.  That I'll leave to others who have 
specific expertise.  The aim of this document is to give you an idea of how data 
'lives' - how it is stored, how it is unstored, what path it travels, 
how much time it takes to transition from one state to another, and the time costs
involved in doing so.  I'll touch briefly on some general approaches to analyzing 
data on the Linux OS, introducing some small-data approaches (which may be all 
you need) and then expanding them to much larger data.

== The Data Flight Path
It may be a weak analogy, but you might think of your analysis as a long airplane
flight, say from LAX to Heathrow. One flight path is direct via the polar 
route.  Another might be a series of short commuter flights that hops across the 
US to Florida and then up the east coast, evntually to Nova Scotia
then to Gander, NFLD, and finally to Dublin, then to Edinburgh, then finally to LHR. 
The more times your plane has to land, the longer it will take.  Data has the same 
characteristic, where 'landing' is equivalent to being writ to disk.

== Where Data Lives and How fast it moves

=== CPU registers
In order to be manipulated by the CPU, the binary data has to be in CPU registers - 
very small, very high speed storage locations in the CPU that are used to hold 
the data immediately being operated on: where the data comes from and where it's 
supposed to go in RAM, the operations to be done, 
the state of the program, the state of the CPU, etc.  These registers operate in sync with the 
CPU at the CPU clock speed, currently somewhere in the range of 3GHz, or one 
operation every 0.3 nanoseconds.

=== CPU cache
CPU 'cache' is used to store the data coming into and out of the CPU registers 
and is often tiered into multiple levels, primary, secondary, tertiary, and even 
quaternary caches.  These assist in staging data that the CPU is asking for or 
even that the CPU 'might' ask for based on the structure of the program it's 
running.

The size of these caches has grown exponentially over the last few years as fabs
can jam more transistors onto a die and caching algorithms have gotten better.
In the Pentium era, a large primary cache used to be a few bytes.  
Now it's 32KB multi-thread accessible for the latest Haswell CPUs, with 256KB L2 
cache, and 2-20MB L3 cache.  These transistors are typically on the same silicon
die as the CPU, mere millimeters away from the CPU, so the access times are 
typically VERY fast.


As the cache gets further away from the CPU registers, the 
time to access the data stored there gets slower as well.
To use the data from a primary cache takes \~4 clock tics; to use data from the 
secondary cache takes \~11 tics, and so on down the line.  

=== System RAM
System RAM, typically considered very fast, is actually very slow compared to
the L1, L2, and L3 caches.  It is further, both physically, as well as 
electronically, since it has to be referenced by a memory controller. To 
refresh data from main RAM to L1 cache takes on the order of 1000 clock tics (or 
a few hundred ns).  So RAM is very slow compared to cache, BUT!  System memory 
access is a 1000 to 1M times faster than disk access, in which the
time to swing round a disk platter, position the head and initiate a new read is on the order of 
several to several 100 milliseconds, depending on the quality of disk, amount 
of external vibration, and anti-vibration features.

The preceding grafs have been to impress upon you that when your data is 
picked up off a disk, it should remain off disk as long as possible until you're 
done analyzing it.  
Reading data off a disk, doing a small operation, and then writing it back to disk 
many times thru the course of an analysis is the among the worst offenses in 
data analysis.  KEEP YOUR DATA IN FLIGHT.

=== Flash Memory / SSDs
In recent years, flash or non-volatile memory - which does not need power 
to keep the data intact - has gotten cheaper as well.  You're probably familiar 
with this type of memory in thumbdrives, removable memory for your phones and 
tablets, and especially SSDs.  This memory is notable for HPC and BigData in 
that unlike spinning disks, the time to access a chunk of data is very fast - 
on the order of microseconds. The speed at which data can be read off in bulk 
is not much different than a spinning disk, but if your data is structured 
discontinuously or needs to be read and written in 'non-streaming mode' (such as 
with many relational database operations), SSDs can provide a very high-performance boost.


=== RAM and the Linux Filecache

All modern OSs support the notion of a file cache - that is, if data has been 
read recently, it's much more likely to be needed again.  So instead of reading 
data into RAM, then clearing that RAM when it's not needed, Linux keeps it 
around - stuffing it into the RAM that's not needed for specific apps, 
their data, and other OS functions.  This is why Linux almost never has very 
much 'free RAM' - it's almost all used for the filecache and unsynced buffers 
and the OS only clears it when it the cached data has expired or when an app or 
the OS needs more.  This is one reason why you should always prefer more RAM to 
a faster CPU.

----------------------------------------------------------------------------
# done immediately after clearing the filecache
# RWbinary.pl just reads 2GB of binary data.

$ time RWbinary.pl
Read 2147483648 bytes

real    0m22.208s <--- took 22s
user    0m0.546s
sys     0m1.789s

# repeated immediately afterwards from filecache.

$ time RWbinary.pl
Read 2147483648 bytes

real    0m0.851s <------took <1s!
user    0m0.272s
sys     0m0.579s
----------------------------------------------------------------------------


.Streaming Data Input/Output (IO)
[NOTE]
===========================================================================
Streaming data Input/Output (IO) is the process (reading or writing) where the 
bytes are processed in a continuous stream.  It usually refers to data on disk, 
where the time to change an IO point is quite significant (milliseconds). We 
often see this kind of data access in bioinformatics in which a large chunk of 
DNA needs to be read from end to end, or in other domains where large images or 
binary data need to be read into RAM for processing.
The reverse of 'streaming IO' is 'discontinuous IO', where a few bytes need to 
be read from one part of a file, processed, and then written to another or 
the same file. This type of data IO is often seen in relational database 
processing where a complex table join to answer a 
query results in very complicated data access patterns to 
locate all the data.  

Data on disk that can be 'streamed' can be read much more efficiently than 
data that needs a lot of disk head movement.

See also: link:#zotfiles[ZOTfiles]

===========================================================================

=== Disks and Filesystems

//insert image of uncovered disk

A hard disk is simply a few metal-coated disks spinning at 5K to 15K rpms 
with tiny magnetic heads moving back and forth on light arms driven by voice 
coils, much like the cones of a speaker.  The magnetic heads either write data 
to the magnetic domains on the disks or read it back from those domains.  
However, because this involves accelerating and decelerating a physical object 
with great accuracy, as well as initiating the reads and writes at very precise 
places on the disk, it takes much longer to initiate an IO operation than with 
RAM or an SSD.  Once the operation has begun, a disk can read or write data at 
about the same speed as an SSD (but typically much slower than RAM).

In order to increase either the speed or reliability of disks, they are often 
grouped together in http://en.wikipedia.org/wiki/RAID[RAIDs] (Redundant Arrays 
of Inexpensive||Independent Disks) in which the disks can be used in parallel 
and/or with checksum parity to allow the failure of 1 or 2 disks in a RAID 
without losing data. 

Such RAIDs are often used as the basis for a number of different filesystems, 
both local to a computer or via a network.  In the latter case, the most 
popular protocol is the http://en.wikipedia.org/wiki/Network_File_System[Network 
File System] ('NFS'), which is reliable but complex and therefore somewhat 
slow.  Individual RAIDs can also be agglomerated into larger networked arrays 
using various kinds of 
http://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems[
Distributed Filesystems] (aka DFSs) in which a process (often running on 
a separate metadata server) tracks the requests for IO and tells the client 
where to store the actual bytes.  Because the data can be read from or written 
to multiple arrays (each consisting of 10s of disks) simultaneously, the 
performance of such DFSs can be extremely high, 10s to 100s of times faster than 
individual disks.  Examples of such DFSs 
are Gluster, BeeGFS/Fraunhofer, Lustre, OrangeFS, etc


----------------------------------------------------------------------------

Don't forget - this is almost presentation-ready:
http://moo.nac.uci.edu/~hjm/biolinux/BigData4Newbies.html

Include all of this??

----------------------------------------------------------------------------


=== Data formats

There are a number of different data formats that compose data, Big or not.

==== Un/structured ASCII files
ASCII (also, but rarely, known as 'American Standard Code for Information 
Interchange') is 'text', typically what you see on your screen as alphanumeric 
characters altho the full ASCII set is http://www.ascii-code.com[considerably 
larger]. For example, in bioinformatics, the protein and nucleic acid codes, as 
well as the confidence codes in FASTQ files are all ASCII characters.  Most of 
this page is composed of ASCII characters.  It is simple, human-readable, 
fairly compact, and easily manipulated by both naive and advanced programmers.  
It is the direct computer equivalent of how we think about data - characters 
representing alphabetic and numeric symbols.  The main thing about ASCII is that 
it is character data - each character is encoded in 7-8bits (a byte).  There is 
little correspondence between the symbol 
and its real value.  ie. '7' is represented by ASCII value 55 dec, 67 oct, 37 
hex, and '00110111' bin (the binary notation for 7 in binary is '111').

ASCII data is also notable for its use of 'tokens' or delimiters.  These 
are characters or strings that separate values in the data stream. These tokens 
are usually single characters like tabs, commas, or spaces, but can also be 
longer strings, such as '__'.  These tokens not only separate values, but also 
take up space, one byte for each delimter.  For long lines of short data 
strings, this can eat up a lot of storage. 

For many forms of data, and especially when that data gets very Big, ASCII data
becomes less useful and takes longer to process.  If you are a 
beginner in programming, you will probably spend the next several years 
programming with ASCII, regardless what anyone tells you.   But there 
are other, more compact ways of encoding data.

==== Binary files
While all data on a computer is binary in one sense - the data is laid down 
in binary streams - 'binary' format typically means that the data is laid down 
in streams of defined data types, segregated by format definitions in the 
program that does the reading. In this format, the bytes, ints, 
chars, floats, longs, doubles, etc of the internal representation of the data 
are written without intervening delimiter characters, like writing a sentence with
no spaces.
------------------------------------------------------------
163631823645618364912152ducksheepshark387ratthingpokemon
   \   \    \  \        \   \         \  \       \      
%3d, %4d, %5d,%3d,%9.4f, %4s,  %10s,   %3d,  %8s, %7s  # read format
------------------------------------------------------------

In the string of data above (translated into ASCII for readability), the string 
of numbers is broken into variables by the 'format string', an ugly, 
but very useful definition that for the above data string, looks like the read 
format line below it.
There is an additional efficiency with binary representation - data encoding.
The 1st number '163' can be coded not 
in 3 bytes, but in only 1 (which can hold 2^8 = 256 values).  Not so important 
when you only have 10,000 numbers to represent, but quite useful when you have 
a quadrillion numbers.
Similarly, a 'double-precision float' (64bits = 8 bytes) can provide a 
precision of 15-17 significant digits while the ASCII representation of 15 
digits takes 15 bytes.

For large data sets, the speed difference between using binary IO vs ASCII can 
be significant. 


============================================================================
TODO:  Need a demo of this.  Read in a large 2 GB ascii data file (how long 
does it take?), write it out in binary (how long), write it out in ascii (how 
long).
http://stackoverflow.com/questions/8920215/how-to-read-binary-file-in-perl

Mon Oct 27 16:01:16 [0.38 0.43 0.55]  hjm@stunted:~
574 $ echo 3 > /proc/sys/vm/drop_caches

Mon Oct 27 16:09:22 [0.47 0.40 0.48]  hjm@stunted:~
575 $ time RWbinary.pl
Read 2147483648 bytes

real    0m22.208s
user    0m0.546s
sys     0m1.789s

# from filecache.
Mon Oct 27 16:09:55 [0.73 0.47 0.50]  hjm@stunted:~
576 $ time RWbinary.pl
Read 2147483648 bytes

real    0m0.851s
user    0m0.272s
sys     0m0.579s

read a binary file in one shot into a var then parse it in-mem into the vars.
vs reading an ascii file line by line and then writing it out verbatim line by 
line.

so read in the red+blue in ascii, print time, write out in ascii, print time, 
print out in binary, print time, read in binary, print time.
==============================================================================

--------------------------------------------------------------------
#!/usr/bin/perl -w
use Devel::Size qw(size total_size);
use Time::HiRes qw( clock );
my $clock0 = clock();
... # Do something.
@ba=();
$lc = 0;
while (<>) {
  chomp;
  $N = $ba[$lc] = split(/\s+/);  # this work?
  for ($e=0; $e<$N; $e++){$ba[$lc][$e] = $A[$e];} # else try this..?
  $lc++;
}
$clock1 = clock();
$clockd = $clock1 - $clock0; # $clockd is delta.
print STDERR "took [$clockd] s to load file into 2D array";
sleep 3;

# print it out in ascii
$clock0 = clock();
$w = 0;
$llc = 0;

while ($llc++ < $lc){
  $w = $#ba[$llc];
  for ($i=0;$i<$w; $i++){
    print "$ba[$llc][$i]"
  }
  print "\n";
}
$clock1 = clock();
$clockd = $clock1 - $clock0; # $clockd is delta.
print STDERR "took [$clockd] s to write file in ascii to STDOUT";

#print out in binary
open (BO "> blue_and_red.bin") or die " can't open file for binary write"
binmode BO;
--------------------------------------------------------------------


See:
http://search.cpan.org/~zefram/Time-HiRes-1.9726/HiRes.pm#EXAMPLES

http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html[Here's a 
decent expansion] of the difference between ASCII and binary data.


==== Self-describing Data Formats
There are a variety of 'self-describing' data formats.  'Self-describing' can 
mean a lot of things from the extremely formalized such as the 
http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One[ASN.1 specification] 
but it generally has a somewhat looser meaning - a format in which the 
description of the file's contents - the 
http://en.wikipedia.org/wiki/Metadata[metadata] - is transmitted with the file, 
either as a header (as is the case with 
http://en.wikipedia.org/wiki/NetCDF#Format_description[netCDF])or as inline 
tags and commentary (as is the case with 
http://en.wikipedia.org/wiki/XML[Extensible Markup Language ] (XML)).

http://labrigger.com/blog/2011/05/28/hdf5-xml-sdcubes/

hdfview

ex of reading in a hdf5 file in R and then dumping an example of it.
same in python
same in perl

pointer to SDCubes (HDF5 + XML)

===== XML

XML is a variant of 
http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language[Standard 
Generalized Markup Language] (SGML), the most popular variant of which is the 
ubiquitous http://en.wikipedia.org/wiki/HTML[Hypertext Markup Language] (HTML) 
- by which a good part of the web is described.  A 'markup language' is a way of
 http://en.wikipedia.org/wiki/Markup_language[annotating a document in a way 
which is syntactically distinguishable from the text].  Put another way, it is 
a way of interleaving metadata with the data to tell the markup processor 
'what the data is'.  This makes it a good choice for small, heterogeneous 
data sets, or to describe the semantics of larger chunks of data, as used by 
the http://labrigger.com/blog/2011/05/28/hdf5-xml-sdcubes/[SDCubes format] and 
thehttp://en.wikipedia.org/wiki/Extensible_Data_Format[eXtensible Data Format 
(XDF)].

However, because XML is an extremely verbose and 
repetitive data format (and therefore ~20x compressible), I'm not going to 
examine it in very much detail for this overview of BigData formats.  Even if 
compressed on disk and decompressed while being read into memory, the overhead 
of stripping the data from the metadata makes it a poor choice for large 
amounts of data.


===== HDF5/netCDF4 
The http://en.wikipedia.org/wiki/Hierarchical_Data_Format[Hierarchical Data 
Format], which now includes HDF4, HDF5, and netCDF4 file formats is a very 
well-designed, self-describing data format for large arrays of numeric data and 
has even recently been adopted by some bioinformatics 
(http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.
pdf[PacBio], http://www.hdfgroup.org/projects/biohdf/[BioHDF]) applications 
dealing mostly with string data as well.
HDF files are supported by most widely used scientific software such 
as MATLAB & Scilab, Mathematica, & SAS as well as general purpose programming 
languages.such as C/C++, Fortran, R, Python, Perl, Java, etc.  R and Python are 
notable in their support of the format.

HDF5/netCDF4 are particularly good for storing and reading multiple sets of 
multi-dimensional data, especially timeseries data.  

HDF5 has 3 main types of data storage:

- *Datasets*, arrays of homogeneous types (ints, floats, of a particular 
precision)
- *Groups*, collections of 'Datasets' or other 'Groups', leading to the 
ability to store data in a hierarchical, directory-like structure, hence the 
name.
- *Attributes* - Metadata about the Datasets, which can be attached to the 
data.

Advantages of the HDF5 format are that due to how the data is 
internally organized, access to complete sets of data is much faster than via 
relational databases.  It is very much organized for streaming IO.


.Promotional Note
[NOTE]
===========================================================================
http://www.ess.uci.edu/~zender/[Charlie Zender], in Earth System Science, is the 
author of http://nco.sf.net[nco, a popular set of utilities] for manipulating 
and analyzing data in HDF5/nwtCDF format.  This highly optimized set of 
utilities can stride thru such datasets at close-to-theoretical speeds and is a 
must-learn for anyone using this format.
===========================================================================

TODO:  demo as http://docs.h5py.org/en/latest/quick.html

==== Relational Databases
If you have complex data that needs to be queried repeatedly to extract 
interdependent or complex relationships, then storing your data in a Relational 
Database format may be an optimal approach.  Such data, stored in collections 
called Tables (rectangular arrays of predefined types such as ints, floats, 
character strings, timestamps, booleans, etc), can be queried for their 
relationships via http://en.wikipedia.org/wiki/SQL[Structured Query Language] 
(SQL), a special-purpose language designed for this specific purpose.

SQL has a particular https://www.sqlite.org/lang.html[format and syntax] that is 
fairly ugly but necessary to use.  The previous link is from the SQLite site, 
but the advantage of SQL is that it is quite portable across Relational 
engines, allowing transitions from a relational 'Engine' like 
https://www.sqlite.org/[SQLite] to a fully networked 'Relational Database' like 
http://www.mysql.com/[MySQL] or http://www.postgresql.org/[PostgreSQL].


==== SQL Engines & Servers (SQLite vs MySQL)
There are overlaps in capabilities, but the difference between a relational 
engine like 'SQLite' and networked database like 'MySQl' is that the latter 
allows multiple users to read and write the database simultaneously from across 
the internet, while the former is designed mostly to provide very fast 
analytical capability for a single user on local data.  For example, many of the 
small and medium size blogs, content management systems,  and social networking 
sites on the internet use MySQL.  Many client-side, single-user databases that 
support things such as music and photo databases and email systems use SQL 
engines.

Here's a simple example of using the Perl interface to SQLite to read a 
specified number of files from your laptop, create a few simple tables based on 
the data and metadata, and then allow you to query the resulting tables with a 
few simple queries.

TODO: Examples of SQL?
use: recursive.filestats.db-sqlite.pl --db nacs.filestats  --startdir nacs --maxfiles=2000
to generate a database, then query it with some incrementally more sophisticated queries.

      demo of sucking filesystem data into a RDB and then querying it.
      (crude search engine)
      
http://www.w3schools.com/sql/default.asp 


== Timing and Optimization
If you're going to be analyzing large amounts of Data, you'll want your 
programs to run as fast as possible.  How do you know that they're running 
efficiently?  You can time them relative to other approaches.  You can profile 
the code using oprofile, perf, or HPCToolkit

=== time vs /usr/bin/time

---------------------------------------------------------------------
$ time ./tacg -n6 -S -o5 -s  < hg19/chr1.fa  > out

real    0m10.599s
user    0m10.456s
sys     0m0.145s


$ /usr/bin/time ./tacg -n6 -S -o5 -s  < hg19/chr1.fa  > out
10.47user 0.14system 0:10.60elapsed 100%CPU (0avgtext+0avgdata 867984maxresident)k
0inputs+7856outputs (0major+33427minor)pagefaults 0swaps

$ /usr/bin/time -v ./tacg -n6 -S -o5 -s  < hg19/chr1.fa  > out
        Command being timed: "./tacg -n6 -S -o5 -s"
        User time (seconds): 10.46
        System time (seconds): 0.14
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.60
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 867268
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 33147
        Voluntary context switches: 1
        Involuntary context switches: 1382
        Swaps: 0
        File system inputs: 0
        File system outputs: 7856
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


$ time ./tacg -n6 -S -o5 -s  < hg19/chr1.fa  > out

real    0m10.725s
user    0m10.549s
sys     0m0.147s
---------------------------------------------------------------------

http://oprofile.sourceforge.net/news/[Oprofile] is a fantastic, relatively simple mechanism for profiling your code.

Say you had a application called tacg that you wanted to improve.  You would first profile it to see where (in which function) it was spending its time.
---------------------------------------------------------------------
operf ./tacg -n6 -S -o5 -s  < hg19/chr1.fa  > out
operf: Profiler started


$ opreport --exclude-dependent --demangle=smart --symbols ./tacg
Using /home/hjm/tacg/oprofile_data/samples/ for samples directory.
CPU: Intel Ivy Bridge microarchitecture, speed 2.501e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask 
of 0x00 (No unit mask) count 100000
samples  %        symbol name
132803   43.1487  Cutting
86752    28.1864  GetSequence2
49743    16.1619  basic_getseq
9098      2.9560  Degen_Calc
7522      2.4440  fp_get_line
7377      2.3968  HorribleAccounting
6560      2.1314  abscompare
4287      1.3929  Degen_Cmp
2600      0.8448  main
704       0.2287  basic_read
212       0.0689  BitArray
112       0.0364  PrintSitesFrags
3        9.7e-04  ReadEnz
3        9.7e-04  hash.constprop.2
2        6.5e-04  hash
1        3.2e-04  Read_NCBI_Codon_Data
1        3.2e-04  palindrome
---------------------------------------------------------------------

You can see by the above listing that the program was spending most of its time in the 'Cutting' function.  If you had been thinking about optimizing the 'BitArray' function, even if you improved it 10,000%, it could only improve teh runtime by ~ 0.06%.

Beware premature optimization.

== Compression

There are 2 types of compression - lossy and lossless.  
Lossless compression indicates that a compression/decompression cycle will 
exactly reconstruct the data.  Lossy compression means that such a cycle 
results in loss of original data fidelity.

Typically for most things you want lossless compression, but for some things, 
lossy compression is OK.  The 'market' has decided that compressing music to 
'mp3' format is fine for consumer usage.  Similarly, compressing images with 
jpeg compression is fine for most consumer use cases. Both these compression 
algorithms use the tendency for distributions to be smooth if sampled frequently 
enough and to be able to approximate the original distribution of data 'well 
enough' when reconstructing the waveform or colorspace.  The 'well enough' part 
of this is why some ppl don't like mp3 sound and images that have been 
jpeg-compressed will appear oddly pixellated at high magnification, especially 
in low-contrast areas.

The amount of compression you can expect is also variable.  Compression 
typically works by looking for repetition in nearby blocks of data. 
Randomness is the opposite of repetition, so ..

=== Random data
------------------------------------------------------------------------
time dd if=/dev/urandom of=urandom.1G count=1000000 bs=1000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 82.8871 s, 12.1 MB/s
real    1m22.889s
  ====

$ time gzip urandom.1G 
real    0m34.601s
  ====
$ ls -l urandom.1G.gz 
-rw-r--r-- 1 hjm hjm 1000162044 Nov 14 12:03 urandom.1G.gz
------------------------------------------------------------------------
So compressing nearly random data actually results in INCREASING the file size.

=== Repetitive Data

However, compressing pure repetitive data from the '/dev/zero' device:

------------------------------------------------------------------------
$ ls -l zeros.1G
-rw-r--r-- 1 hjm hjm 1000000000 Nov 14 11:31 zeros.1G
  ====
$ time gzip zeros.1G
real    0m7.100s
  ====
 $ ls -l zeros.1G.gz 
-rw-r--r-- 1 hjm hjm 970510 Nov 14 11:32 zeros.1G.gz
------------------------------------------------------------------------

So compressing 1GB of zeros took considerably less time and resulted in a 
compression ration of ~1000X.  Actually, it should be smaller than that, since 
it's all zeros.

And in fact, bzip2 does a much better job:
------------------------------------------------------------------------
$ time bzip2 zeros.1G
real    0m10.106s
  ====
$ ls -l zeros.1G.bz2 
-rw-r--r-- 1 hjm hjm 722 Nov 14 11:32 zeros.1G.bz2
------------------------------------------------------------------------

or about 1.3 *Million* fold compression, about the same as 'Electronic Dance Music'.

For comparison sake, a typical JPEG image easily achieves a compression ratio of about 
5-15 X without much visual loss of accuracy, but you can choose what compression 
ratio to specify.

The amount of time you should dedicate to compression/decompression should be 
proportional to the amount of compression possible.

ie;  Don't try to compress an mp3 of white noise.

== Moving BigData
I've written an entire document 
http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[How to transfer large amounts 
of data via network] about moving bytes across networks, so I'll keep this short.

As much as possible, *Don't move the data*.  Find a place for it to live and work 
on it from there.  Hopefully, it's in compressed format so reads are more efficient.

Everyone should know how to use http://samba.anu.edu.au/rsync[rsync], if possible 
the simple commandline form:
--------------------------------------------------------------------
rsync -av /this/dir/   /that/DIR   
#                  ^
# note that trailing '/'s matter.
# above cmd will sync the CONTENTS of '/this/dir' to '/that/DIR'
# generally what you want.

rsync -av /this/dir   /that/DIR   
#                  ^
# will sync '/this/dir' INTO '/that/DIR', so the contents of '/that/DIR' will
# INCLUDE '/this/dir' after the rsync.
--------------------------------------------------------------------

If you have lots of huge files in a few dirs, and it's not an incremental move 
(the files don't already exist in some form on one of the 2 sides), then use 
[bbcp], which excels at moving big streams of data very quickly using multiple 
TCP streams. Note that rsync almost always will encrypt the data using the ssh 
protocol, whereas bbcp will not.  This may be an issue if you copy data outside 
the cluster.


== Processing BigData in parallel

One key to processing BigData efficiently is that you HAVE to exploit 
parallelism.  From filesystems to archiving and de-archiving files to moving 
data to the actual processing - leveraging //ism is the key to doing it faster.

=== Embarrassingly Parallel solutions - the big win
The easiest to exploit is what's called 'embarrassingly // (EP)
operations' - those which are independent of workflow, other data and 
order-of-operation.  This analyses can be done by subsetting data and farming 
it out in bulk to multiple processes with wrappers like 
http://www.gnu.org/software/parallel/[Gnu Parallel] and 
http://moo.nac.uci.edu/~hjm/clusterfork/[clusterfork].  These wrappers allow 
you to write a single script or program and then directly shove it to as many 
other CPUs or nodes as are available to you, bypassing the need for a scheduler 
(and therefore are appropriate only for your own set of machines, not a 
cluster).  On a cluster, you can exploit the scheduler (SGE in our case) to 
submit hundreds or thousands of jobs to distribute the analysis to otherwise 
idle cores.  This approach requires a shared file system so that all the nodes 
have access to the same file space.

This implies that the analysis (or at least the part that you're //izing) is 
EP.  And many analyses are - in simulations, you can often sample parameter 
space in parallel; in bioinformatics, you can usually search sequences in //; 
MATLAB and MMA have a number of functions that are automatically (and 
aggressively) //.

This often depends on who has written the application or workflow you're 
using - if it's 

==== Hadoop and MapReduce 
This is a combined technique; it is often referred to by either terms but 
http://en.wikipedia.org/wiki/Apache_Hadoop[Hadoop] 
is the underlying storage or filesystem that feeds the analytical technique of 
http://en.wikipedia.org/wiki/MapReduce[MapReduce].
Rather than being a general approach to BigData, this is actually a very specific 
technique that falls into the EP area and is optimized for fairly specific, 
usually fairly simple operations.  It is actually a very specific example of the 
much more general approach of cluster computing that we do 

Wikipedia has this good explanation of the process.

* 'Map' step: Each worker node applies the 'map()' function to the local data, and 
  writes the output to a temporary storage. A master node orchestrates that for 
  redundant copies of input data, only one is processed.
* 'Shuffle' step: Worker nodes redistribute data based on the output keys 
  (produced by the 'map()' function), such that all data belonging to one key is 
  located on the same worker node.
* 'Reduce' step: Worker nodes now process each group of output data, per key, 
  in parallel.

However, for those operations, it can be VERY fast and especially VERY scalable. 
One thing that argues against the use of Hadoop is that it is not a POSIX filesystem.
It has many of the characteristics of a POSIX, but not all, especially a number 
related to permissions.  It also is optimized for very large files and gains some 
robustness by data replication, typically 3-fold.

An improvement on the MapReduce approach is a related technology 
http://en.wikipedia.org/wiki/Apache_Spark[Spark], which provides a more 
sophisticated in-memory query mechanism instead of the simpler 2-step system in 
MapReduce.  As mentioned above, anytime you can maintain data in-RAM, your speed 
increases by at least 1000X.  Spark also provides APIs in Python and SCALA as 
well as Java.