Research Computing for the non-Geek
===================================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.00 - Aug 18th, 2011
:icons:


//Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu]
// this file is converted to the HTML via the command:

// fileroot="/home/hjm/nacs/ResearchComputingForTheNonGeek"; asciidoc -a toc -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt  moo:~/public_html;

// update svn from BDUC
// scp ${fileroot}.txt  hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to ResearchComputingHOWTO"'

// and push it to Wordpress:
// blogpost.py  update  -c HowTos ${fileroot}.txt

// don't forget that the HTML equiv of '~' = '%7e'
// asciidoc cheatsheet: http://powerman.name/doc/asciidoc
// asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html


Philosophy for the course
-------------------------
This course will cover what you need to know to be able to use Unix-like (Linux, MacOSX, http://www.cygwin.com[Cygwin] on Windows) computers to do large-scale analysis using the latest software tools.  If the tools do not exist, you can write your own.  It's not a programming course per se, but we will briefly cover many programming concepts.  It is a 

This is an explicitly 'overview approach', and will shine light into many dark corners of computing, but full illumination will require those corners to beexplored by the student using Google, Wikipedia, and the reference list provided. It will explain many of the concepts of research computing that are foreign to a beginning graduate student, especially one coming from a non-physical science.  It will be attempt to be domain-neutral in terms of the topics covered and will explain as much as possible in non-technical terms altho due to the topic, some technical vocabulary will be unavoidable.

Some of this curriculum addresses aspects of computing that are arguably more related to computer administration than research, but there are often times on both shared and solo machines when you need to discern what the machine is doing when it appears it has gone into a trance state.  This is part of the general debugging process and is applicable to many other situations.

This course owes a great deal to Greg Wilson's excellent http://software-carpentry.org[Software Carpentry site] (content in video, PDF and HTML) from which much of this course is borrowed and inspired.  Many of the topics covered here are expanded upon in some detail in the http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html[BDUC Users HOWTO guide] which is updated frequently.

The skeleton of the course is described below.  For comparison, see the Software Carpentry curriculum.
       
Introduction
------------
- Open Source Software & Open Source Science
- Data transparency and availability
- Replication, Versioning, & Provenance
- The importance of Statistics
   

The basics: Shells, X11, & system tools
---------------------------------------
- terminal programs
- bash (& alternatives like tcsh, zsh)
- the command line; how to communicate with the OS without a mouse.
- the shell environment and variables
- text vs X11 graphics and how to deal with it using nx, 
- X11 on Windows and Macs, X11 tunneling
- quoting, IO redirection, backgrounding, virtual screens & byobu, 
- ssh, key exchange, sshouting,
- top, ps, [ah]top, dstat, type, uniq, sudo/su, du/df, ls, tree
- text editors and how they differ from word processors.
  

Files and Storage
-----------------
- Network Filesystems: NFS, sshfs, network traffic considerations
- File formats: bits/bytes, ascii, binary, SQL DBs, specialized formats (HDF5, XML), audio, video, logs, etc.
- Compression, Encryption, Checksums: gzip/bzip2/compress/zip, md5, crc
  
  
Moving data & watching bandwidth
--------------------------------
- cp, scp, rsync, bbcp, tar, sshfs, netcat, iftop, ifstat, iptraf, etherape, etc


Basic file and text tools
-------------------------
- ls, cp, wc, find, locate, rm, join, cut/scut/cols, sort, uniq, head/tail, sed, stats, less, 
- grep and regexes are important enough to require their own section
- more on text editors vs word processors.


Simple/Practical Regular Expressions
------------------------------------
grep/egrep/agrep and the like


Special purpose software
------------------------
(first, don't write your own software; use someone else's)


Finding & installing binary packages
------------------------------------
- apt, yum, google, synaptic, etc


Building packages from source
-----------------------------
- download, untar, configure, make


Writing your own programs
------------------------


Versioning
----------
To help with collaborative development, data & code provenance, and backup, we'll cover the concepts of code versioning and some integrated systems
- git, svn, cvs
- trac, redmine, github


Simple Data Structures
----------------------
How do programming languages represent and store information?
  - chars, strings, ints, floats, doubles, arrays, vectors, matrices, lists, 
  scalars, hashes/dictionaries, linked lists, structs, R data.frames, etc


Programming languages: bash, Python or Perl, R, and Scilab/Matlab
-----------------------------------------------------------------
- differences among these programs & why these ones as opposed to C/C++, Fortran, etc.
- declarations, variables, loops, flow control, file io, subroutines, system interactions
- how to add options to your programs to make them more flexible.

  
Debugging
---------
- print/printf
- debug flags
- using debuggers (ipython, perl -d, ddd, gdb)


Code Optimization & Profiling
-----------------------------
- simple timing with 'time', '/usr/bin/time'
- simple profiling with oprofile
- descriptions of lower level profiling with PAPI, PerfSuite, HPCTools
   

Big computation
---------------
- schedulers like SGE, slurm, PBS, 
- data flow considerations, data locality
- quick introduction to pros & cons of parallel programming with MPI, OpenMP, GPUs
- big data and how to handle it.