1. Philosophy for the course
Most scientists are not programmers, but almost all of them must use computers to do their work. Many scientists in domains that previously had little to do with computation are now finding that they’re being forced to essentially become de facto programmers or at least large-scale data analysts in order to address the requirements of their chosen field. This curriculum hopes to partially alleviate some of the problems faced by scientists forced to confront large data. A good 2-page essay that describes this is called Where’s the Real Bottleneck in Scientific Computing?
This course will cover what you need to know to be able to use Unix-like (Linux, MacOSX, Cygwin on Windows) computers to do large-scale analysis using recent software tools that we’ll show you how to find. If you can’t find the tools, we’ll show you how to write simple utilities to address the gaps. It’s not a programming course per se, but we will briefly cover many programming concepts.
This is an explicitly overview approach, and will flash light into many dark corners of computing (but full illumination will require those corners to beexplored by the student using Google, Wikipedia, and the Bibliography provided). It will explain many of the concepts of research computing that are foreign to a beginning graduate student, especially one coming from a non-physical science. It will be attempt to be domain-neutral in terms of the topics covered and will explain as much as possible in non-technical terms altho due to the topic, some technical vocabulary will be unavoidable.
Some of this curriculum addresses aspects of computing that are arguably more related to computer administration than research, but there are often times on both shared and solo machines when you need to discern what the machine is doing when it appears it has gone into a trance. This is part of the general debugging process and is applicable to many other situations.
1.1. Acknowledgements
This course owes a great deal to Greg Wilson’s excellent Software Carpentry site (content in video, PDF and HTML) from which much of this course is borrowed and inspired.
The skeleton of the course is described below. For comparison, see the Software Carpentry curriculum.
Many of the topics covered here are expanded upon in some detail in the BDUC Users HOWTO guide which is updated frequently.
2. Introduction
-
Open Source Software & Open Source Science
-
Data transparency and availability
-
Replication, Versioning, & Provenance
-
The importance of Statistics
3. The basics: Shells, X11, & system tools
-
terminal programs
-
bash (& alternatives like tcsh, zsh)
-
the command line; how to communicate with the OS without a mouse.
-
the shell environment and variables
-
text vs X11 graphics and how to deal with it using nx,
-
X11 on Windows and Macs, X11 tunneling
-
STDOUT, STDIN, STDERR, IO redirection (<, >, >>, |), quoting, tee, time, strace, backgrounding/foregrounding, virtual screens & byobu,
-
ssh, key exchange, sshouting,
-
top, ps, top family, dstat, type, which, uniq, sudo/su, type, du/df, ls, tree
-
text editors and how they differ from word processors.
-
executing commands at a particular time: at, cron
4. Files and Storage
-
Network Filesystems: NFS, sshfs, network traffic considerations
-
File formats: bits/bytes, ascii, binary, SQL DBs, specialized formats (HDF5, XML), audio, video, logs, etc.
-
Compression, Encryption, Checksums: gzip/bzip2/compress/zip, md5, crc
-
Permissions: what they mean and how to set them
-
Data management, sharing, security implications: How do HIPAA, FISMA, and related requirements impact how you can store and analyze data?
5. Moving data & watching bandwidth
-
cp, scp, rsync, bbcp, tar, sshfs, netcat, iftop, ifstat, iptraf, etherape
-
see How to move data doc for more detail.
6. Basic file and text tools
-
ls, file, cp, wc, find, locate, rm, join, cut/scut/cols, sort, uniq, head/tail, sed, stats, less/more
-
grep and regexes are important enough to require their own section
-
more on text editors vs word processors.
7. Simple/Practical Regular Expressions
grep/egrep/agrep and the like
8. Special purpose software
(first, don’t write your own software; use someone else’s)
9. Finding & installing binary packages
-
apt, yum, google, synaptic, etc
10. Building packages from source
-
download, untar, configure, make
11. Writing your own programs
12. Versioning
To help with collaborative development, data & code provenance, and backup, we’ll cover the concepts of code versioning with:
and some integrated systems for larger groups with
13. Simple Data Structures
How do programming languages represent and store information? - chars, strings, ints, floats, doubles, arrays, vectors, matrices, lists, scalars, hashes/dictionaries, linked lists, trees, graphs, structs, objects, R data.frames, etc
14. Programming languages
16. Code Optimization & Profiling
17. Big computation
18. Bibliography
-
see the Further Reading section at the bottom of the Software Carpentry home page. these are 1-10 page essays on how scientists do use and should use computers.
-
see the Suggested Readings For Computational Scientists section in Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive.
-
Unix Power Tools, Third Edition, Jerry Peek, Shelley Powers, Tim O’Reilly, Mike Loukides; 2002. (Online edition via UCI Safari subscription)
-
Advanced Bash Scripting, an excellent online tutorial describing how to fully exploit the bash programming language.
19. Release information & Latest version
This document is released under the GNU Free Documentation License.
The latest version of this document should always be HERE.