BigData and its Analysis on Linux ================================= by Harry Mangalam v0.1 - Nov 13, 2014 :icons: // fileroot="/home/hjm/nacs/bigdata/BigData_Class"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt moo:~/public_html/biolinux; ssh -t moo 'scp ~/public_html/biolinux/BigData_Class.[ht]* hmangala@hpc.oit.uci.edu:/data/hpc/www/biolinux/' == For Linux users, but novices at cluster data analysis and BigData. If you don't know your way around a Linux bash shell nor how a compute cluster works, please review the previous http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_12.html[Introduction to Linux essay/tutorial here]. It reviews Linux, the bash shell, common commands & utilities, as well as the SGE scheduler. == Introduction For those of you who are starting your graduate career and have spent most of your life with with MacOS or Windows, the Linux commandline can be mystifying. The previous lecture and tutorial was meant to familiarize you with Linux on HPC, with only a very fast introduction to handling and analyzing large amounts of data. You are the among the first few waves of the 'Data Generation' - your lives and research guided by, and awash with, ever-increasing rings of data. If you're reading this, your research will almost certainly depend on large amounts of data, data that does not map well to MS Word, Excel, or other such pointy-clicky desktop apps. The de facto standard for large-scale data processing, especially in the research field, is the bash prompt, backed by a Linux compute cluster. It's not pretty, but it works exceedingly well. In the end, I think you'll find that it is worth the transition cost to use a cluster where most compute nodes each have 64 CPU cores, 1/2 TB of RAM and access to 100s of TB of disk space on a very fast parallel filesystem. This is not a HOWTO on Analysis of BigData. That I'll leave to others who have specific expertise. The aim of this document is to give you an idea of how data 'lives' - how it is stored, how it is unstored, what path it travels, how much time it takes to transition from one state to another, and the time costs involved in doing so. I'll touch briefly on some general approaches to analyzing data on the Linux OS, introducing some small-data approaches (which may be all you need) and then expanding them to much larger data. == The Data Flight Path It may be a weak analogy, but you might think of your analysis as a long airplane flight, say from LAX to Heathrow. One flight path is direct via the polar route. Another might be a series of short commuter flights that hops across the US to Florida and then up the east coast, evntually to Nova Scotia then to Gander, NFLD, and finally to Dublin, then to Edinburgh, then finally to LHR. The more times your plane has to land, the longer it will take. Data has the same characteristic, where 'landing' is equivalent to being writ to disk. == Where Data Lives and How fast it moves === CPU registers In order to be manipulated by the CPU, the binary data has to be in CPU registers - very small, very high speed storage locations in the CPU that are used to hold the data immediately being operated on: where the data comes from and where it's supposed to go in RAM, the operations to be done, the state of the program, the state of the CPU, etc. These registers operate in sync with the CPU at the CPU clock speed, currently somewhere in the range of 3GHz, or one operation every 0.3 nanoseconds. === CPU cache CPU 'cache' is used to store the data coming into and out of the CPU registers and is often tiered into multiple levels, primary, secondary, tertiary, and even quaternary caches. These assist in staging data that the CPU is asking for or even that the CPU 'might' ask for based on the structure of the program it's running. The size of these caches has grown exponentially over the last few years as fabs can jam more transistors onto a die and caching algorithms have gotten better. In the Pentium era, a large primary cache used to be a few bytes. Now it's 32KB multi-thread accessible for the latest Haswell CPUs, with 256KB L2 cache, and 2-20MB L3 cache. These transistors are typically on the same silicon die as the CPU, mere millimeters away from the CPU, so the access times are typically VERY fast. As the cache gets further away from the CPU registers, the time to access the data stored there gets slower as well. To use the data from a primary cache takes \~4 clock tics; to use data from the secondary cache takes \~11 tics, and so on down the line. === System RAM System RAM, typically considered very fast, is actually very slow compared to the L1, L2, and L3 caches. It is further, both physically, as well as electronically, since it has to be referenced by a memory controller. To refresh data from main RAM to L1 cache takes on the order of 1000 clock tics (or a few hundred ns). So RAM is very slow compared to cache, BUT! System memory access is a 1000 to 1M times faster than disk access, in which the time to swing round a disk platter, position the head and initiate a new read is on the order of several to several 100 milliseconds, depending on the quality of disk, amount of external vibration, and anti-vibration features. The preceding grafs have been to impress upon you that when your data is picked up off a disk, it should remain off disk as long as possible until you're done analyzing it. Reading data off a disk, doing a small operation, and then writing it back to disk many times thru the course of an analysis is the among the worst offenses in data analysis. KEEP YOUR DATA IN FLIGHT. === Flash Memory / SSDs In recent years, flash or non-volatile memory - which does not need power to keep the data intact - has gotten cheaper as well. You're probably familiar with this type of memory in thumbdrives, removable memory for your phones and tablets, and especially SSDs. This memory is notable for HPC and BigData in that unlike spinning disks, the time to access a chunk of data is very fast - on the order of microseconds. The speed at which data can be read off in bulk is not much different than a spinning disk, but if your data is structured discontinuously or needs to be read and written in 'non-streaming mode' (such as with many relational database operations), SSDs can provide a very high-performance boost. === RAM and the Linux Filecache All modern OSs support the notion of a file cache - that is, if data has been read recently, it's much more likely to be needed again. So instead of reading data into RAM, then clearing that RAM when it's not needed, Linux keeps it around - stuffing it into the RAM that's not needed for specific apps, their data, and other OS functions. This is why Linux almost never has very much 'free RAM' - it's almost all used for the filecache and unsynced buffers and the OS only clears it when it the cached data has expired or when an app or the OS needs more. This is one reason why you should always prefer more RAM to a faster CPU. ---------------------------------------------------------------------------- # done immediately after clearing the filecache # RWbinary.pl just reads 2GB of binary data. $ time RWbinary.pl Read 2147483648 bytes real 0m22.208s <--- took 22s user 0m0.546s sys 0m1.789s # repeated immediately afterwards from filecache. $ time RWbinary.pl Read 2147483648 bytes real 0m0.851s <------took <1s! user 0m0.272s sys 0m0.579s ---------------------------------------------------------------------------- .Streaming Data Input/Output (IO) [NOTE] =========================================================================== Streaming data Input/Output (IO) is the process (reading or writing) where the bytes are processed in a continuous stream. It usually refers to data on disk, where the time to change an IO point is quite significant (milliseconds). We often see this kind of data access in bioinformatics in which a large chunk of DNA needs to be read from end to end, or in other domains where large images or binary data need to be read into RAM for processing. The reverse of 'streaming IO' is 'discontinuous IO', where a few bytes need to be read from one part of a file, processed, and then written to another or the same file. This type of data IO is often seen in relational database processing where a complex table join to answer a query results in very complicated data access patterns to locate all the data. Data on disk that can be 'streamed' can be read much more efficiently than data that needs a lot of disk head movement. See also: link:#zotfiles[ZOTfiles] =========================================================================== === Disks and Filesystems //insert image of uncovered disk A hard disk is simply a few metal-coated disks spinning at 5K to 15K rpms with tiny magnetic heads moving back and forth on light arms driven by voice coils, much like the cones of a speaker. The magnetic heads either write data to the magnetic domains on the disks or read it back from those domains. However, because this involves accelerating and decelerating a physical object with great accuracy, as well as initiating the reads and writes at very precise places on the disk, it takes much longer to initiate an IO operation than with RAM or an SSD. Once the operation has begun, a disk can read or write data at about the same speed as an SSD (but typically much slower than RAM). In order to increase either the speed or reliability of disks, they are often grouped together in http://en.wikipedia.org/wiki/RAID[RAIDs] (Redundant Arrays of Inexpensive||Independent Disks) in which the disks can be used in parallel and/or with checksum parity to allow the failure of 1 or 2 disks in a RAID without losing data. Such RAIDs are often used as the basis for a number of different filesystems, both local to a computer or via a network. In the latter case, the most popular protocol is the http://en.wikipedia.org/wiki/Network_File_System[Network File System] ('NFS'), which is reliable but complex and therefore somewhat slow. Individual RAIDs can also be agglomerated into larger networked arrays using various kinds of http://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems[ Distributed Filesystems] (aka DFSs) in which a process (often running on a separate metadata server) tracks the requests for IO and tells the client where to store the actual bytes. Because the data can be read from or written to multiple arrays (each consisting of 10s of disks) simultaneously, the performance of such DFSs can be extremely high, 10s to 100s of times faster than individual disks. Examples of such DFSs are Gluster, BeeGFS/Fraunhofer, Lustre, OrangeFS, etc ---------------------------------------------------------------------------- Don't forget - this is almost presentation-ready: http://moo.nac.uci.edu/~hjm/biolinux/BigData4Newbies.html Include all of this?? ---------------------------------------------------------------------------- === Data formats There are a number of different data formats that compose data, Big or not. ==== Un/structured ASCII files ASCII (also, but rarely, known as 'American Standard Code for Information Interchange') is 'text', typically what you see on your screen as alphanumeric characters altho the full ASCII set is http://www.ascii-code.com[considerably larger]. For example, in bioinformatics, the protein and nucleic acid codes, as well as the confidence codes in FASTQ files are all ASCII characters. Most of this page is composed of ASCII characters. It is simple, human-readable, fairly compact, and easily manipulated by both naive and advanced programmers. It is the direct computer equivalent of how we think about data - characters representing alphabetic and numeric symbols. The main thing about ASCII is that it is character data - each character is encoded in 7-8bits (a byte). There is little correspondence between the symbol and its real value. ie. '7' is represented by ASCII value 55 dec, 67 oct, 37 hex, and '00110111' bin (the binary notation for 7 in binary is '111'). ASCII data is also notable for its use of 'tokens' or delimiters. These are characters or strings that separate values in the data stream. These tokens are usually single characters like tabs, commas, or spaces, but can also be longer strings, such as '__'. These tokens not only separate values, but also take up space, one byte for each delimter. For long lines of short data strings, this can eat up a lot of storage. For many forms of data, and especially when that data gets very Big, ASCII data becomes less useful and takes longer to process. If you are a beginner in programming, you will probably spend the next several years programming with ASCII, regardless what anyone tells you. But there are other, more compact ways of encoding data. ==== Binary files While all data on a computer is binary in one sense - the data is laid down in binary streams - 'binary' format typically means that the data is laid down in streams of defined data types, segregated by format definitions in the program that does the reading. In this format, the bytes, ints, chars, floats, longs, doubles, etc of the internal representation of the data are written without intervening delimiter characters, like writing a sentence with no spaces. ------------------------------------------------------------ 163631823645618364912152ducksheepshark387ratthingpokemon \ \ \ \ \ \ \ \ \ %3d, %4d, %5d,%3d,%9.4f, %4s, %10s, %3d, %8s, %7s # read format ------------------------------------------------------------ In the string of data above (translated into ASCII for readability), the string of numbers is broken into variables by the 'format string', an ugly, but very useful definition that for the above data string, looks like the read format line below it. There is an additional efficiency with binary representation - data encoding. The 1st number '163' can be coded not in 3 bytes, but in only 1 (which can hold 2^8 = 256 values). Not so important when you only have 10,000 numbers to represent, but quite useful when you have a quadrillion numbers. Similarly, a 'double-precision float' (64bits = 8 bytes) can provide a precision of 15-17 significant digits while the ASCII representation of 15 digits takes 15 bytes. For large data sets, the speed difference between using binary IO vs ASCII can be significant. ============================================================================ TODO: Need a demo of this. Read in a large 2 GB ascii data file (how long does it take?), write it out in binary (how long), write it out in ascii (how long). http://stackoverflow.com/questions/8920215/how-to-read-binary-file-in-perl Mon Oct 27 16:01:16 [0.38 0.43 0.55] hjm@stunted:~ 574 $ echo 3 > /proc/sys/vm/drop_caches Mon Oct 27 16:09:22 [0.47 0.40 0.48] hjm@stunted:~ 575 $ time RWbinary.pl Read 2147483648 bytes real 0m22.208s user 0m0.546s sys 0m1.789s # from filecache. Mon Oct 27 16:09:55 [0.73 0.47 0.50] hjm@stunted:~ 576 $ time RWbinary.pl Read 2147483648 bytes real 0m0.851s user 0m0.272s sys 0m0.579s read a binary file in one shot into a var then parse it in-mem into the vars. vs reading an ascii file line by line and then writing it out verbatim line by line. so read in the red+blue in ascii, print time, write out in ascii, print time, print out in binary, print time, read in binary, print time. ============================================================================== -------------------------------------------------------------------- #!/usr/bin/perl -w use Devel::Size qw(size total_size); use Time::HiRes qw( clock ); my $clock0 = clock(); ... # Do something. @ba=(); $lc = 0; while (<>) { chomp; $N = $ba[$lc] = split(/\s+/); # this work? for ($e=0; $e<$N; $e++){$ba[$lc][$e] = $A[$e];} # else try this..? $lc++; } $clock1 = clock(); $clockd = $clock1 - $clock0; # $clockd is delta. print STDERR "took [$clockd] s to load file into 2D array"; sleep 3; # print it out in ascii $clock0 = clock(); $w = 0; $llc = 0; while ($llc++ < $lc){ $w = $#ba[$llc]; for ($i=0;$i<$w; $i++){ print "$ba[$llc][$i]" } print "\n"; } $clock1 = clock(); $clockd = $clock1 - $clock0; # $clockd is delta. print STDERR "took [$clockd] s to write file in ascii to STDOUT"; #print out in binary open (BO "> blue_and_red.bin") or die " can't open file for binary write" binmode BO; -------------------------------------------------------------------- See: http://search.cpan.org/~zefram/Time-HiRes-1.9726/HiRes.pm#EXAMPLES http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html[Here's a decent expansion] of the difference between ASCII and binary data. ==== Self-describing Data Formats There are a variety of 'self-describing' data formats. 'Self-describing' can mean a lot of things from the extremely formalized such as the http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One[ASN.1 specification] but it generally has a somewhat looser meaning - a format in which the description of the file's contents - the http://en.wikipedia.org/wiki/Metadata[metadata] - is transmitted with the file, either as a header (as is the case with http://en.wikipedia.org/wiki/NetCDF#Format_description[netCDF])or as inline tags and commentary (as is the case with http://en.wikipedia.org/wiki/XML[Extensible Markup Language ] (XML)). http://labrigger.com/blog/2011/05/28/hdf5-xml-sdcubes/ hdfview ex of reading in a hdf5 file in R and then dumping an example of it. same in python same in perl pointer to SDCubes (HDF5 + XML) ===== XML XML is a variant of http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language[Standard Generalized Markup Language] (SGML), the most popular variant of which is the ubiquitous http://en.wikipedia.org/wiki/HTML[Hypertext Markup Language] (HTML) - by which a good part of the web is described. A 'markup language' is a way of http://en.wikipedia.org/wiki/Markup_language[annotating a document in a way which is syntactically distinguishable from the text]. Put another way, it is a way of interleaving metadata with the data to tell the markup processor 'what the data is'. This makes it a good choice for small, heterogeneous data sets, or to describe the semantics of larger chunks of data, as used by the http://labrigger.com/blog/2011/05/28/hdf5-xml-sdcubes/[SDCubes format] and thehttp://en.wikipedia.org/wiki/Extensible_Data_Format[eXtensible Data Format (XDF)]. However, because XML is an extremely verbose and repetitive data format (and therefore ~20x compressible), I'm not going to examine it in very much detail for this overview of BigData formats. Even if compressed on disk and decompressed while being read into memory, the overhead of stripping the data from the metadata makes it a poor choice for large amounts of data. ===== HDF5/netCDF4 The http://en.wikipedia.org/wiki/Hierarchical_Data_Format[Hierarchical Data Format], which now includes HDF4, HDF5, and netCDF4 file formats is a very well-designed, self-describing data format for large arrays of numeric data and has even recently been adopted by some bioinformatics (http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide. pdf[PacBio], http://www.hdfgroup.org/projects/biohdf/[BioHDF]) applications dealing mostly with string data as well. HDF files are supported by most widely used scientific software such as MATLAB & Scilab, Mathematica, & SAS as well as general purpose programming languages.such as C/C++, Fortran, R, Python, Perl, Java, etc. R and Python are notable in their support of the format. HDF5/netCDF4 are particularly good for storing and reading multiple sets of multi-dimensional data, especially timeseries data. HDF5 has 3 main types of data storage: - *Datasets*, arrays of homogeneous types (ints, floats, of a particular precision) - *Groups*, collections of 'Datasets' or other 'Groups', leading to the ability to store data in a hierarchical, directory-like structure, hence the name. - *Attributes* - Metadata about the Datasets, which can be attached to the data. Advantages of the HDF5 format are that due to how the data is internally organized, access to complete sets of data is much faster than via relational databases. It is very much organized for streaming IO. .Promotional Note [NOTE] =========================================================================== http://www.ess.uci.edu/~zender/[Charlie Zender], in Earth System Science, is the author of http://nco.sf.net[nco, a popular set of utilities] for manipulating and analyzing data in HDF5/nwtCDF format. This highly optimized set of utilities can stride thru such datasets at close-to-theoretical speeds and is a must-learn for anyone using this format. =========================================================================== TODO: demo as http://docs.h5py.org/en/latest/quick.html ==== Relational Databases If you have complex data that needs to be queried repeatedly to extract interdependent or complex relationships, then storing your data in a Relational Database format may be an optimal approach. Such data, stored in collections called Tables (rectangular arrays of predefined types such as ints, floats, character strings, timestamps, booleans, etc), can be queried for their relationships via http://en.wikipedia.org/wiki/SQL[Structured Query Language] (SQL), a special-purpose language designed for this specific purpose. SQL has a particular https://www.sqlite.org/lang.html[format and syntax] that is fairly ugly but necessary to use. The previous link is from the SQLite site, but the advantage of SQL is that it is quite portable across Relational engines, allowing transitions from a relational 'Engine' like https://www.sqlite.org/[SQLite] to a fully networked 'Relational Database' like http://www.mysql.com/[MySQL] or http://www.postgresql.org/[PostgreSQL]. ==== SQL Engines & Servers (SQLite vs MySQL) There are overlaps in capabilities, but the difference between a relational engine like 'SQLite' and networked database like 'MySQl' is that the latter allows multiple users to read and write the database simultaneously from across the internet, while the former is designed mostly to provide very fast analytical capability for a single user on local data. For example, many of the small and medium size blogs, content management systems, and social networking sites on the internet use MySQL. Many client-side, single-user databases that support things such as music and photo databases and email systems use SQL engines. Here's a simple example of using the Perl interface to SQLite to read a specified number of files from your laptop, create a few simple tables based on the data and metadata, and then allow you to query the resulting tables with a few simple queries. TODO: Examples of SQL? use: recursive.filestats.db-sqlite.pl --db nacs.filestats --startdir nacs --maxfiles=2000 to generate a database, then query it with some incrementally more sophisticated queries. demo of sucking filesystem data into a RDB and then querying it. (crude search engine) http://www.w3schools.com/sql/default.asp == Timing and Optimization If you're going to be analyzing large amounts of Data, you'll want your programs to run as fast as possible. How do you know that they're running efficiently? You can time them relative to other approaches. You can profile the code using oprofile, perf, or HPCToolkit === time vs /usr/bin/time --------------------------------------------------------------------- $ time ./tacg -n6 -S -o5 -s < hg19/chr1.fa > out real 0m10.599s user 0m10.456s sys 0m0.145s $ /usr/bin/time ./tacg -n6 -S -o5 -s < hg19/chr1.fa > out 10.47user 0.14system 0:10.60elapsed 100%CPU (0avgtext+0avgdata 867984maxresident)k 0inputs+7856outputs (0major+33427minor)pagefaults 0swaps $ /usr/bin/time -v ./tacg -n6 -S -o5 -s < hg19/chr1.fa > out Command being timed: "./tacg -n6 -S -o5 -s" User time (seconds): 10.46 System time (seconds): 0.14 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.60 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 867268 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 33147 Voluntary context switches: 1 Involuntary context switches: 1382 Swaps: 0 File system inputs: 0 File system outputs: 7856 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 $ time ./tacg -n6 -S -o5 -s < hg19/chr1.fa > out real 0m10.725s user 0m10.549s sys 0m0.147s --------------------------------------------------------------------- http://oprofile.sourceforge.net/news/[Oprofile] is a fantastic, relatively simple mechanism for profiling your code. Say you had a application called tacg that you wanted to improve. You would first profile it to see where (in which function) it was spending its time. --------------------------------------------------------------------- operf ./tacg -n6 -S -o5 -s < hg19/chr1.fa > out operf: Profiler started $ opreport --exclude-dependent --demangle=smart --symbols ./tacg Using /home/hjm/tacg/oprofile_data/samples/ for samples directory. CPU: Intel Ivy Bridge microarchitecture, speed 2.501e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 132803 43.1487 Cutting 86752 28.1864 GetSequence2 49743 16.1619 basic_getseq 9098 2.9560 Degen_Calc 7522 2.4440 fp_get_line 7377 2.3968 HorribleAccounting 6560 2.1314 abscompare 4287 1.3929 Degen_Cmp 2600 0.8448 main 704 0.2287 basic_read 212 0.0689 BitArray 112 0.0364 PrintSitesFrags 3 9.7e-04 ReadEnz 3 9.7e-04 hash.constprop.2 2 6.5e-04 hash 1 3.2e-04 Read_NCBI_Codon_Data 1 3.2e-04 palindrome --------------------------------------------------------------------- You can see by the above listing that the program was spending most of its time in the 'Cutting' function. If you had been thinking about optimizing the 'BitArray' function, even if you improved it 10,000%, it could only improve teh runtime by ~ 0.06%. Beware premature optimization. == Compression There are 2 types of compression - lossy and lossless. Lossless compression indicates that a compression/decompression cycle will exactly reconstruct the data. Lossy compression means that such a cycle results in loss of original data fidelity. Typically for most things you want lossless compression, but for some things, lossy compression is OK. The 'market' has decided that compressing music to 'mp3' format is fine for consumer usage. Similarly, compressing images with jpeg compression is fine for most consumer use cases. Both these compression algorithms use the tendency for distributions to be smooth if sampled frequently enough and to be able to approximate the original distribution of data 'well enough' when reconstructing the waveform or colorspace. The 'well enough' part of this is why some ppl don't like mp3 sound and images that have been jpeg-compressed will appear oddly pixellated at high magnification, especially in low-contrast areas. The amount of compression you can expect is also variable. Compression typically works by looking for repetition in nearby blocks of data. Randomness is the opposite of repetition, so .. === Random data ------------------------------------------------------------------------ time dd if=/dev/urandom of=urandom.1G count=1000000 bs=1000 1000000+0 records in 1000000+0 records out 1000000000 bytes (1.0 GB) copied, 82.8871 s, 12.1 MB/s real 1m22.889s ==== $ time gzip urandom.1G real 0m34.601s ==== $ ls -l urandom.1G.gz -rw-r--r-- 1 hjm hjm 1000162044 Nov 14 12:03 urandom.1G.gz ------------------------------------------------------------------------ So compressing nearly random data actually results in INCREASING the file size. === Repetitive Data However, compressing pure repetitive data from the '/dev/zero' device: ------------------------------------------------------------------------ $ ls -l zeros.1G -rw-r--r-- 1 hjm hjm 1000000000 Nov 14 11:31 zeros.1G ==== $ time gzip zeros.1G real 0m7.100s ==== $ ls -l zeros.1G.gz -rw-r--r-- 1 hjm hjm 970510 Nov 14 11:32 zeros.1G.gz ------------------------------------------------------------------------ So compressing 1GB of zeros took considerably less time and resulted in a compression ration of ~1000X. Actually, it should be smaller than that, since it's all zeros. And in fact, bzip2 does a much better job: ------------------------------------------------------------------------ $ time bzip2 zeros.1G real 0m10.106s ==== $ ls -l zeros.1G.bz2 -rw-r--r-- 1 hjm hjm 722 Nov 14 11:32 zeros.1G.bz2 ------------------------------------------------------------------------ or about 1.3 *Million* fold compression, about the same as 'Electronic Dance Music'. For comparison sake, a typical JPEG image easily achieves a compression ratio of about 5-15 X without much visual loss of accuracy, but you can choose what compression ratio to specify. The amount of time you should dedicate to compression/decompression should be proportional to the amount of compression possible. ie; Don't try to compress an mp3 of white noise. == Moving BigData I've written an entire document http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[How to transfer large amounts of data via network] about moving bytes across networks, so I'll keep this short. As much as possible, *Don't move the data*. Find a place for it to live and work on it from there. Hopefully, it's in compressed format so reads are more efficient. Everyone should know how to use http://samba.anu.edu.au/rsync[rsync], if possible the simple commandline form: -------------------------------------------------------------------- rsync -av /this/dir/ /that/DIR # ^ # note that trailing '/'s matter. # above cmd will sync the CONTENTS of '/this/dir' to '/that/DIR' # generally what you want. rsync -av /this/dir /that/DIR # ^ # will sync '/this/dir' INTO '/that/DIR', so the contents of '/that/DIR' will # INCLUDE '/this/dir' after the rsync. -------------------------------------------------------------------- If you have lots of huge files in a few dirs, and it's not an incremental move (the files don't already exist in some form on one of the 2 sides), then use [bbcp], which excels at moving big streams of data very quickly using multiple TCP streams. Note that rsync almost always will encrypt the data using the ssh protocol, whereas bbcp will not. This may be an issue if you copy data outside the cluster. == Processing BigData in parallel One key to processing BigData efficiently is that you HAVE to exploit parallelism. From filesystems to archiving and de-archiving files to moving data to the actual processing - leveraging //ism is the key to doing it faster. === Embarrassingly Parallel solutions - the big win The easiest to exploit is what's called 'embarrassingly // (EP) operations' - those which are independent of workflow, other data and order-of-operation. This analyses can be done by subsetting data and farming it out in bulk to multiple processes with wrappers like http://www.gnu.org/software/parallel/[Gnu Parallel] and http://moo.nac.uci.edu/~hjm/clusterfork/[clusterfork]. These wrappers allow you to write a single script or program and then directly shove it to as many other CPUs or nodes as are available to you, bypassing the need for a scheduler (and therefore are appropriate only for your own set of machines, not a cluster). On a cluster, you can exploit the scheduler (SGE in our case) to submit hundreds or thousands of jobs to distribute the analysis to otherwise idle cores. This approach requires a shared file system so that all the nodes have access to the same file space. This implies that the analysis (or at least the part that you're //izing) is EP. And many analyses are - in simulations, you can often sample parameter space in parallel; in bioinformatics, you can usually search sequences in //; MATLAB and MMA have a number of functions that are automatically (and aggressively) //. This often depends on who has written the application or workflow you're using - if it's ==== Hadoop and MapReduce This is a combined technique; it is often referred to by either terms but http://en.wikipedia.org/wiki/Apache_Hadoop[Hadoop] is the underlying storage or filesystem that feeds the analytical technique of http://en.wikipedia.org/wiki/MapReduce[MapReduce]. Rather than being a general approach to BigData, this is actually a very specific technique that falls into the EP area and is optimized for fairly specific, usually fairly simple operations. It is actually a very specific example of the much more general approach of cluster computing that we do Wikipedia has this good explanation of the process. * 'Map' step: Each worker node applies the 'map()' function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. * 'Shuffle' step: Worker nodes redistribute data based on the output keys (produced by the 'map()' function), such that all data belonging to one key is located on the same worker node. * 'Reduce' step: Worker nodes now process each group of output data, per key, in parallel. However, for those operations, it can be VERY fast and especially VERY scalable. One thing that argues against the use of Hadoop is that it is not a POSIX filesystem. It has many of the characteristics of a POSIX, but not all, especially a number related to permissions. It also is optimized for very large files and gains some robustness by data replication, typically 3-fold. An improvement on the MapReduce approach is a related technology http://en.wikipedia.org/wiki/Apache_Spark[Spark], which provides a more sophisticated in-memory query mechanism instead of the simpler 2-step system in MapReduce. As mentioned above, anytime you can maintain data in-RAM, your speed increases by at least 1000X. Spark also provides APIs in Python and SCALA as well as Java.