Big Data for Biologists
=======================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.00, May 4, 2010
:icons:


//Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu]
// this file is converted to the HTML with the command:
// export fileroot="/home/hjm/nacs/HTS_talk/BigBioData"; asciidoc -a toc -a numbered -a toclevels=2 ${fileroot}.txt; scp ${fileroot}.[ht]*   moo:~/public_html/Bio;  

// update svn from BDUC
// scp ${fileroot}.txt  hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to Bio_Software_on_BDUC"'

// and push it to Wordpress:
// blogpost.py  update  -c HowTos ${fileroot}.txt

// don't forget that the HTML equiv of '~' = '%7e'
// asciidoc cheatsheet: http://powerman.name/doc/asciidoc
// asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html

The Problem with Digital Biology
--------------------------------
Since the transition to genomics, data has gotten larger and larger.  

.Growth of Genbank
image:genbankgrowth_s.jpg[Growth of Genbank] 

Fortunately, until the mid-90's, disk size, network bandwidth & CPU speeds were increasing much faster so everyone was pretty happy.  Now with HTS, some bottlenecks are starting to crop up.  

'Moore's Law' predicts that CPU power increases by 60% per year, but 'CPU clock speeds' are essentially static and computational power is now being increased by multicore CPUs and GPUs which require considerably different programming techniques.

'Network speeds' are fairly fast, but are not increasing very fast locally because copper is reaching its upper limit and even if it increases a few more steps it will be at very short distances.  Optical networks hold huge promise, but the technology is comparatively expensive and tempermental.  

'Disk capacity' is continuing to increase exponentially as well, but large RAIDs are starting to approach the http://www.ecs.umass.edu/ece/koren/architecture/Raid/reliability.html[Uncorrectable Bit Error Rate] problem and so the amount of data that we can store in contiguous devices is starting to hit a limit. And if you want that data backed up, the problem/expense doubles.


*So what do we do?  Learn to live with it.*

.Short Rant: The commandline is your friend, not your enemy
***************************************************
If you don't learn to use the commandline, you'll be 1-3 years behind the cutting edge of any computer-aided analysis. Building good User Interfaces are hard.  Building good commercial ones are even harder.

Example of a commandline to search human chromosome 1 (254MB) for a pattern with one degeneracy and extract the match plus flanking sequence:

 time tacg -S -p bubba,gtgaggacttaga,1 -X 22,22,1 <chr1.fa >out
 real    0m27.448s
 user    0m19.105s
 sys     0m1.312s
 (9.4 MB/s)

For many analytical tools there simply are no GUI equivalents.

For a good introduction to Linux and the bash shell, spend a a few hours with http://software-carpentry.org/[Software Carpentry]
***************************************************


How much Data can you afford?
-----------------------------
It really boils down to the question above.

* What is the minimum data you need to keep around?
* Do you really need to back up everything?
* Where do you put this data?
  - on large RAIDs, preferably RAID6 (any OS) or ZFS (Solaris only) or be prepared to give a LOT of money to BlueArc, Isilon, NetApp, EMC, Sun, IBM, etc.
  - Open Source Parallel File systems are starting to be reliable enough to use in production as well (http://wiki.lustre.org/index.php/Main_Page[Lustre], http://www.pvfs.org/[PVFS], http://www.gluster.org/[Gluster], etc)
  - if on your Desktop, use reliable disks & RAID them if possible.
* How do you move this data?
  - As much as possible, *don't*. And if you can, use http://en.wikipedia.org/wiki/Rsync[rsync] (elegant incremental copy), 
  - for full copies, use http://www.slac.stanford.edu/~abh/bbcp/[bbcp] or http://monalisa.cern.ch/FDT/[fdt].
  - see http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[How to transfer large amounts of data via network].
    
Note that for most HTS data, using encryption is not ncessary (do you care if someone intercepts your short-read file?) so for most files, don't bother with encryption.  On the other hand, compression can help a lot on slower connections (plaintext sequence data is compressible to 1/4 to 1/3 the uncompressed size.
  

.Don't forget:
****************************************

* 100Mb network ---> ~10MB/s
* 1Gb   network ---> ~80MB/s

so 200GB of files -> ~45min over a 1Gb net if the storage can handle that speed.
****************************************

.A somewhat wooden analogy
image:MAPLEBOWL1_s.jpg[link="MAPLEBOWL1.jpg",align="left"] 

The above image of a spinning wooden dish on a lathe is a (strained) analogy to a disk drive.  The faster the dish spins, the more wood (data) can be scraped off it per second; the closer to the outside, the more wood (data) can be scraped off per second.  If you have multiple lathes (disks), each additional lathe increases the amount of wood (data) that can be scraped off per second.


Not all Big Data is the same
----------------------------

There is hope, especially for some types of data.

Types of data
~~~~~~~~~~~~~

Plaintext 
^^^^^^^^^
- ASCII 'plaintext' vs 'binary' vs 'MS Word doc format' For editing plaintext files, *use a plaintext editor*, such as 'Notepad' for Windows or http://www.jedit.org[Jedit] (excellent, for all platforms).
- Mac and Linux plaintext now use the same end-of-line character (line feed; '\n'); Windows is 'still' different (carriage  return + line feed; '\r\n')
    
Structured Text
^^^^^^^^^^^^^^^

- http://en.wikipedia.org/wiki/XML[XML] (like HTML) link:XML_example.txt[Example]
- http://en.wikipedia.org/wiki/FASTA_format[FASTA]
- http://en.wikipedia.org/wiki/FASTQ_format[FASTQ] 
- http://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)[Protein Data Bank]
- Other Structural formats

Binary
^^^^^^

- images
- compressed data
- executables, object files, libraries
- specially constructed application-specific data.


Databases
^^^^^^^^^

- flatfile (FASTA sequences)
- relational (MySQL, PostgreSQL)
- key:value type DBs (color:red, fruit:bananas)


Self-describing Data
^^^^^^^^^^^^^^^^^^^^
Structured (strided, self-describing) binary files (such as http://www.unidata.ucar.edu/software/netcdf/[NetCDF],  http://www.hdfgroup.org/HDF5/[HDF5], & http://www.geospiza.com/research/biohdf/[BioHDF]).

'Self-describing Data Formats' are a special case of data file and one that holds great promise for large Biological data.  It has been used for many kinds of large data for more than a decade in the Physical Sciences and since part of Biology is now transitioning to a Physical Science, it becomes appropriate to use their proven approaches.

These types of files:

* can represent N-dimensional data
* have unlimited extents (such as open-ended time series)
* are internally like an entire filesystem, so groups of related data can be bundled together.
* are indexed (or strided) so that offsets to a particular piece of data can be found very quickly (microseconds).
* are dynamically ranged so the file uses only as many bits to represent a range of data as it really needs. ('int's can be represented as bits, depending on the data range)

The http://www.hdfgroup.org/projects/biohdf/[BioHDF project] is already starting to use this approach, specifically for representing HTS data and we should start seeing code support later this year.  It happens that http://www.ess.uci.edu/~zender/[Charlie Zender] (author of http://nco.sf.net[nco], a popular utility suite for manipulating NetCDF files) is in UCI's ESS dept.

There is another technology that has been used quite successfully in the Climate Modelling field which is the http://opendap.org/[OpenDAP Server].  This software keeps the data on a central server and allows researchers to query it to return various chunks of data and to transform them in various ways.  In some ways it's very much like a web server such as the http://genome.ucsc.edu/[UCSC Genome Center's Browser] system, but it allows much more computationally heavy transforms.

// These are files that are optimally organized to compress data the required amount to accurately # represent the 
//       dynamic range of a set of data (base + offset) or base + offset + transform.  
//       The entire dataset is usually represented as a single file which has a 
//       filesystem-like internal structure (aka groups)
//         variables N-dimensional arrays of data
//         Dimensions - describe the axes of the data arrays (including 'unlimited')
//         Attributes - annotate variables or files with small notes or supplementary metadata. 
//                      Attributes are always scalar values or 1D arrays
//       examples are netCDF4, HDF5, bioHDF
// 
//       specialized tools (unrelated to biology)
//         nco: NetCDF operators
//         pytables - HDF5 <-> NumPy extractor
//         idv - Java visualization tool
//         others
//    
// Size of the problem is new to biology ut not new to other physical sciences.
// LHC - generates 10Gb/s continuous ~ 1GB/s or a human genome every 3s initial storage ~4PB = 4000 TB or # 2000 of these disks
// 4U in 2000/16 = ~15 racks or nothing but disk; just spinning the disks will take ~26kw.
// 
// NASA sat images - TBs per day.
// ESS - climate models  100s of GB /day
//     
// How to move it.
//    preferably don't.  If you do, consider rsync.
//    http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[How to transfer large amounts of data via network.]


Some useful tools
-----------------

Basic utilities
~~~~~~~~~~~~~~~

.Short rant: MS Windows is bad for research. Why?
********************************************
'Mac OSX' is fine (based on Unix).  The OS and UI are fairly stable.

'Linux' is best (re-written Unix).  The OS and UI are fairly stable.

'Windows' is a strange bastard hybrid system designed to transport dollars from your wallet to Microsoft that also makes writing software very difficult.  Every other year Microsoft replaces the system that you've learned to tolerate with one that bears little similarity to it and charges you for that loss of productivity.  

But most damning for research is that it makes it 'hard to write portable software'.

If you want an amusing read on the development of the Mac, Linux and Windows, Neal Stephenson has a great (if long) take on it called http://www.cryptonomicon.com/beginning.html[In the Beginning was the Command Line]
********************************************

The following are basic Linux utilities.

ls
^^
'ls' lists the files in a directory

grep
^^^^
'grep' (and relatives egrep, fgrep, agrep, nrgrep) searches for http://en.wikipedia.org/wiki/Regular_expression[regular expressions] in files.  A regular expression (aka regex) is very powerful pattern matching approach.  http://proquest.safaribooksonline.com/0596528124[Read THE regex book here.]

head & tail
^^^^^^^^^^^
'head' shows you the top of a file.  'tail' shows you the bottom of the file.

cut, scut & cols
^^^^^^^^^^^^^^^^
'cut' and 'scut' let you slice out and re-arrange columns in a file, spearated by a defined delimiter.
'cols' lines up those files so you can make sense of them.

These and more such utilities are discussed at much greater length in http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html[Manipulating Data on Linux].

Applications on the BDUC Cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

----------------------------------------------------------------------
*R/2.10.0          *gaussian/3.0       *mosaik/1.0.1388     readline/5.2
*R/2.11.0           gnuplot/4.2.4       mpich/1.2.7        *samtools/0.1.7
*R/2.8.0           *gromacs_d/4.0.7     mpich2/1.1.1p1      scilab/5.1.1
*R/dev             *gromacs_s/4.0.7    *msort/20081208      sge/6.2
*abyss/1.1.2        hadoop/0.20.1      *namd/2.6           *soap/2.20
 antlr/2.7.6        hdf5-devel/1.8.0   *namd/2.7b1          sparsehash/1.6
*autodock/4.2.3     hdf5/1.8.0          ncl/5.1.1           sqlite/3.6.22
*bedtools/2.6.1     imagej/1.41         nco/3.9.6           subversion/1.6.9
*bfast/0.6.3c       interviews/17       netCDF/3.6.3       *sva/1.02
 boost/1.410        java/1.6            neuron/7.0         *tacg/4.5.1
*bowtie/0.12.3     *maq/0.7.1          *nmica/0.8.0         tcl/8.5.5
*bwa/0.5.7          matlab/R2008b       octave/3.2.0        tk/8.5.5
*cufflinks/0.8.1    matlab/R2009b       open64/4.2.3       *tophat/1.0.13
 fsl/4.1           *mgltools/1.5.4      openmpi/1.4        *triton/4.0.0
*gapcloser         *modeller/9v7        python/2.6.1        visit/1.11.2

* of possible interest to Biologists.

example:

$ module load abyss

ABySS - assemble short reads into contigs
See: /apps/abyss/1.1.2/share/doc/abyss/README
or 'ABYSS --help'
or 'man /apps/abyss/1.1.2/share/man/man1/ABYSS.1'
or 'man /apps/abyss/1.1.2/share/man/man1/abyss-pe.1'
----------------------------------------------------------------------

Also, ClustalW2/X, SATE (including MAFFT, MUSCLE, OPAL, PRANK, RAxML), readseq, phylip, etc.


Other Genomic Visualization Tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some other Genomic Visualization tools http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html#_genomic_visualization_tools[are listed here].
(Some of them are installed on BDUC; some are web tools; some still need to be installed.


Which programming language is best for Biology?
-----------------------------------------------
Sometimes what you need to do just isn't handled by an existing utility and you need to write it yourself.  For those situations which call on data *mangling/massaging*, I'd recommend either http://en.wikipedia.org/wiki/Perl[Perl] & http://www.bioperl.org/wiki/Main_Page[BioPerl] (or http://en.wikipedia.org/wiki/Python[Python] & http://biopython.org/wiki/Main_Page[BioPython]).

Perl has an elegant syntax for handling much of the simple line-based data mangling you might run into:
---------------------------------------------------
#!/usr/bin/perl -w
while (<>) {          # while there's still more lines to process
   chomp;             # chew off the 'newline' character
   $N = @L = split;   # split all the columns and place them into the array named 'L' 
   <your code>        # do what you need to do to the data
}                     # continue to the end of the file.
---------------------------------------------------


For genomic data *analysis*, I'd very strongly recommend http://www.r-project.org/[R] & http://www.bioconductor.org/[BioConductor].  I wrote a very http://moo.nac.uci.edu/~hjm/AnRCheatsheet.html[gentle introduction to R] and cite many more advanced tutorials therein.   Both R and BioConductor have been advancing extremely rapidly recently and are now the 'de facto' language of Genomics as well as Statistics.  While it lacks a GUI, R has many advanced visualization tools and graphic output.

Also, there is a website called http://sysbio.harvard.edu/csb/resources/computational/scriptome/[Scriptome] which has started to gather and organize scripts for the slicing, dicing, reformatting, and joining of sequence and other data related to Bioinfo, CompBio, etc.

Parting Shots
-----------

BDUC Advancement
~~~~~~~~~~~~~~~~

* $ for 64GB RAM - ~$2400.
* letters to Dana for more Broadcom nodes


Latest Version
~~~~~~~~~~~~~~

The latest version of this document moo:~/public_html/Bio/BigBioData.html[should be here].