Research Computing for Non-Geeks

1. Philosophy for the course

Most scientists are not programmers, but almost all of them must use computers to do their work. Many scientists in domains that previously had little to do with computation are now finding that they’re being forced to essentially become de facto programmers or at least large-scale data analysts in order to address the requirements of their chosen field. This curriculum hopes to partially alleviate some of the problems faced by scientists forced to confront large data. A good 2-page essay that describes this is called Where’s the Real Bottleneck in Scientific Computing?

This course will cover what you need to know to be able to use Unix-like (Linux, MacOSX, Cygwin on Windows) computers to do large-scale analysis using recent software tools that we’ll show you how to find. If you can’t find the tools, we’ll show you how to write simple utilities to address the gaps. It’s not a programming course per se, but we will briefly cover many programming concepts.

This is an explicitly overview approach, and will flash light into many dark corners of computing (but full illumination will require those corners to beexplored by the student using Google, Wikipedia, and the Bibliography provided). It will explain many of the concepts of research computing that are foreign to a beginning graduate student, especially one coming from a non-physical science. It will be attempt to be domain-neutral in terms of the topics covered and will explain as much as possible in non-technical terms altho due to the topic, some technical vocabulary will be unavoidable.

Some of this curriculum addresses aspects of computing that are arguably more related to computer administration than research, but there are often times on both shared and solo machines when you need to discern what the machine is doing when it appears it has gone into a trance. This is part of the general debugging process and is applicable to many other situations.

1.1. Acknowledgements

This course owes a great deal to Greg Wilson’s excellent Software Carpentry site (content in video, PDF and HTML) from which much of this course is borrowed and inspired.

The skeleton of the course is described below. For comparison, see the Software Carpentry curriculum.

Many of the topics covered here are expanded upon in some detail in the BDUC Users HOWTO guide which is updated frequently.

2. Introduction

Open Source Software & Open Source Science
Data transparency and availability
Replication, Versioning, & Provenance
The importance of Statistics
Ask Google

3. The basics: Shells, X11, & system tools

terminal programs
bash (& alternatives like tcsh, zsh)
the command line; how to communicate with the OS without a mouse.
the shell environment and variables
text vs X11 graphics and how to deal with it using nx,
X11 on Windows and Macs, X11 tunneling
STDOUT, STDIN, STDERR, IO redirection (<, >, >>, |), quoting, tee, time, strace, backgrounding/foregrounding, virtual screens & byobu,
ssh, key exchange, sshouting,
top, ps, top family, dstat, type, which, uniq, sudo/su, type, du/df, ls, tree
text editors and how they differ from word processors.
executing commands at a particular time: at, cron

4. Files and Storage

Network Filesystems: NFS, sshfs, network traffic considerations
File formats: bits/bytes, ascii, binary, SQL DBs, specialized formats (HDF5, XML), audio, video, logs, etc.
Compression, Encryption, Checksums: gzip/bzip2/compress/zip, md5, crc
Permissions: what they mean and how to set them
Data management, sharing, security implications: How do HIPAA, FISMA, and related requirements impact how you can store and analyze data?

5. Moving data & watching bandwidth

cp, scp, rsync, bbcp, tar, sshfs, netcat, iftop, ifstat, iptraf, etherape
see How to move data doc for more detail.

6. Basic file and text tools

ls, file, cp, wc, find, locate, rm, join, cut/scut/cols, sort, uniq, head/tail, sed, stats, less/more
grep and regexes are important enough to require their own section
more on text editors vs word processors.
see this for more detail

7. Simple/Practical Regular Expressions

grep/egrep/agrep and the like

8. Special purpose software

(first, don’t write your own software; use someone else’s)

9. Finding & installing binary packages

apt, yum, google, synaptic, etc

10. Building packages from source

download, untar, configure, make

11. Writing your own programs

12. Versioning

To help with collaborative development, data & code provenance, and backup, we’ll cover the concepts of code versioning with:

and some integrated systems for larger groups with

13. Simple Data Structures

How do programming languages represent and store information? - chars, strings, ints, floats, doubles, arrays, vectors, matrices, lists, scalars, hashes/dictionaries, linked lists, trees, graphs, structs, objects, R data.frames, etc

14. Programming languages

bash
Python or Perl
R
Scilab or [Matlab]
differences among these programs & why these ones as opposed to C/C++, Fortran, etc.
declarations, variables, loops, flow control, file io, subroutines, system interactions
how to add options to your programs to make them more flexible.

15. Debugging

print/printf
debug flags
using debuggers
- ipython
- perl -d
- ddd
- gdb

16. Code Optimization & Profiling

simple timing with time, /usr/bin/time
simple profiling with oprofile
descriptions of lower level profiling with PAPI, PerfSuite, HPCTools

17. Big computation

schedulers like SGE, slurm, PBS
- data flow considerations, data locality
- automating your analysis via scripting with
  - bash
  - make
  - visual dataflow applications
    
    kepler
    
    loni pipeline
    
    vistrails
    
    taverna
quick introduction to pros & cons of parallel programming with MPI, OpenMP, GPUs using CUDA
big data and how to handle it.
specific utilities for big data
pytables & vitables
nco - the NetCDF Operators

18. Bibliography

see the Further Reading section at the bottom of the Software Carpentry home page. these are 1-10 page essays on how scientists do use and should use computers.
see the Suggested Readings For Computational Scientists section in Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive.
Unix Power Tools, Third Edition, Jerry Peek, Shelley Powers, Tim O’Reilly, Mike Loukides; 2002. (Online edition via UCI Safari subscription)
Advanced Bash Scripting, an excellent online tutorial describing how to fully exploit the bash programming language.

19. Release information & Latest version

This document is released under the GNU Free Documentation License.

The latest version of this document should always be HERE.