1. Philosophy for the course

Most scientists are not programmers, but almost all of them must use computers to do their work. Many scientists in domains that previously had little to do with computation are now finding that they’re being forced to essentially become de facto programmers or at least large-scale data analysts in order to address the requirements of their chosen field. This curriculum hopes to partially alleviate some of the problems faced by scientists forced to confront large data. A good 2-page essay that describes this is called Where’s the Real Bottleneck in Scientific Computing?

This course will cover what you need to know to be able to use Unix-like (Linux, MacOSX, Cygwin on Windows) computers to do large-scale analysis using recent software tools that we’ll show you how to find. If you can’t find the tools, we’ll show you how to write simple utilities to address the gaps. It’s not a programming course per se, but we will briefly cover many programming concepts.

This is an explicitly overview approach, and will flash light into many dark corners of computing (but full illumination will require those corners to beexplored by the student using Google, Wikipedia, and the Bibliography provided). It will explain many of the concepts of research computing that are foreign to a beginning graduate student, especially one coming from a non-physical science. It will be attempt to be domain-neutral in terms of the topics covered and will explain as much as possible in non-technical terms altho due to the topic, some technical vocabulary will be unavoidable.

Some of this curriculum addresses aspects of computing that are arguably more related to computer administration than research, but there are often times on both shared and solo machines when you need to discern what the machine is doing when it appears it has gone into a trance. This is part of the general debugging process and is applicable to many other situations.

1.1. Acknowledgements

This course owes a great deal to Greg Wilson’s excellent Software Carpentry site (content in video, PDF and HTML) from which much of this course is borrowed and inspired.

The skeleton of the course is described below. For comparison, see the Software Carpentry curriculum.

Many of the topics covered here are expanded upon in some detail in the BDUC Users HOWTO guide which is updated frequently.

2. Introduction

  • Open Source Software & Open Source Science

  • Data transparency and availability

  • Replication, Versioning, & Provenance

  • The importance of Statistics

  • Ask Google

3. The basics: Shells, X11, & system tools

  • terminal programs

  • bash (& alternatives like tcsh, zsh)

  • the command line; how to communicate with the OS without a mouse.

  • the shell environment and variables

  • text vs X11 graphics and how to deal with it using nx,

  • X11 on Windows and Macs, X11 tunneling

  • STDOUT, STDIN, STDERR, IO redirection (<, >, >>, |), quoting, tee, time, strace, backgrounding/foregrounding, virtual screens & byobu,

  • ssh, key exchange, sshouting,

  • top, ps, top family, dstat, type, which, uniq, sudo/su, type, du/df, ls, tree

  • text editors and how they differ from word processors.

  • executing commands at a particular time: at, cron

4. Files and Storage

  • Network Filesystems: NFS, sshfs, network traffic considerations

  • File formats: bits/bytes, ascii, binary, SQL DBs, specialized formats (HDF5, XML), audio, video, logs, etc.

  • Compression, Encryption, Checksums: gzip/bzip2/compress/zip, md5, crc

  • Permissions: what they mean and how to set them

  • Data management, sharing, security implications: How do HIPAA, FISMA, and related requirements impact how you can store and analyze data?

5. Moving data & watching bandwidth

  • cp, scp, rsync, bbcp, tar, sshfs, netcat, iftop, ifstat, iptraf, etherape

  • see How to move data doc for more detail.

6. Basic file and text tools

  • ls, file, cp, wc, find, locate, rm, join, cut/scut/cols, sort, uniq, head/tail, sed, stats, less/more

  • grep and regexes are important enough to require their own section

  • more on text editors vs word processors.

  • see this for more detail

7. Simple/Practical Regular Expressions

grep/egrep/agrep and the like

8. Special purpose software

(first, don’t write your own software; use someone else’s)

9. Finding & installing binary packages

  • apt, yum, google, synaptic, etc

10. Building packages from source

  • download, untar, configure, make

11. Writing your own programs

12. Versioning

To help with collaborative development, data & code provenance, and backup, we’ll cover the concepts of code versioning with:

and some integrated systems for larger groups with

13. Simple Data Structures

How do programming languages represent and store information? - chars, strings, ints, floats, doubles, arrays, vectors, matrices, lists, scalars, hashes/dictionaries, linked lists, trees, graphs, structs, objects, R data.frames, etc

14. Programming languages

  • bash

  • Python or Perl

  • R

  • Scilab or [Matlab]

  • differences among these programs & why these ones as opposed to C/C++, Fortran, etc.

  • declarations, variables, loops, flow control, file io, subroutines, system interactions

  • how to add options to your programs to make them more flexible.

15. Debugging

16. Code Optimization & Profiling

17. Big computation

18. Bibliography

19. Release information & Latest version

This document is released under the GNU Free Documentation License.

The latest version of this document should always be HERE.