Research Computing for Non-Geeks ================================ by Harry Mangalam v1.03 - Aug 24th, 2011 :icons: //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // this file is converted to the HTML via the command: // fileroot="/home/hjm/nacs/ResearchComputingForNonGeeks"; asciidoc -a toc -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt moo:~/public_html; // update svn from BDUC // scp ${fileroot}.txt hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to ResearchComputingHOWTO"' // and push it to Wordpress: // blogpost.py update -c HowTos ${fileroot}.txt // don't forget that the HTML equiv of '~' = '%7e' // asciidoc cheatsheet: http://powerman.name/doc/asciidoc // asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html Philosophy for the course ------------------------- Most scientists are not programmers, but almost all of them must use computers to do their work. Many scientists in domains that previously had little to do with computation are now finding that they're being forced to essentially become de facto programmers or at least large-scale data analysts in order to address the requirements of their chosen field. This curriculum hopes to partially alleviate some of the problems faced by scientists forced to confront large data. A good 2-page essay that describes this is called http://software-carpentry.org/articles/amsci-swc-2006.pdf[Where's the Real Bottleneck in Scientific Computing?] This course will cover what you need to know to be able to use Unix-like (Linux, MacOSX, http://www.cygwin.com[Cygwin] on Windows) computers to do large-scale analysis using recent software tools that we'll show you how to find. If you can't find the tools, we'll show you how to write simple utilities to address the gaps. It's not a programming course per se, but we will briefly cover many programming concepts. This is an explicitly 'overview approach', and will flash light into many dark corners of computing (but full illumination will require those corners to beexplored by the student using Google, Wikipedia, and the link:#biblio[Bibliography] provided). It will explain many of the concepts of research computing that are foreign to a beginning graduate student, especially one coming from a non-physical science. It will be attempt to be domain-neutral in terms of the topics covered and will explain as much as possible in non-technical terms altho due to the topic, some technical vocabulary will be unavoidable. Some of this curriculum addresses aspects of computing that are arguably more related to computer administration than research, but there are often times on both shared and solo machines when you need to discern what the machine is doing when it appears it has gone into a trance. This is part of the general debugging process and is applicable to many other situations. Acknowledgements ~~~~~~~~~~~~~~~~ This course owes a great deal to Greg Wilson's excellent http://software-carpentry.org[Software Carpentry site] (content in video, PDF and HTML) from which much of this course is borrowed and inspired. The skeleton of the course is described below. For comparison, see the Software Carpentry curriculum. Many of the topics covered here are expanded upon in some detail in the http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html[BDUC Users HOWTO guide] which is updated frequently. Introduction ------------ - Open Source Software & Open Source Science - Data transparency and availability - Replication, Versioning, & Provenance - The importance of Statistics - http://moo.nac.uci.edu/~hjm/FixITYourselfWithGoogle.html[Ask Google] The basics: Shells, X11, & system tools --------------------------------------- - terminal programs - bash (& alternatives like tcsh, zsh) - the command line; how to communicate with the OS without a mouse. - the shell environment and variables - text vs X11 graphics and how to deal with it using nx, - X11 on Windows and Macs, X11 tunneling - STDOUT, STDIN, STDERR, IO redirection (<, >, >>, |), quoting, tee, time, strace, backgrounding/foregrounding, virtual screens & byobu, - ssh, key exchange, sshouting, - top, ps, top family, dstat, type, which, uniq, sudo/su, type, du/df, ls, tree - text editors and how they differ from word processors. - executing commands at a particular time: at, cron Files and Storage ----------------- - Network Filesystems: NFS, sshfs, network traffic considerations - File formats: bits/bytes, ascii, binary, SQL DBs, specialized formats (HDF5, XML), audio, video, logs, etc. - Compression, Encryption, Checksums: gzip/bzip2/compress/zip, md5, crc - Permissions: what they mean and how to set them - Data management, sharing, security implications: How do http://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act[HIPAA], http://en.wikipedia.org/wiki/Federal_Information_Security_Management_Act_of_2002[FISMA], and related requirements impact how you can store and analyze data? Moving data & watching bandwidth -------------------------------- - cp, scp, rsync, bbcp, tar, sshfs, netcat, iftop, ifstat, iptraf, etherape - see http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[How to move data] doc for more detail. Basic file and text tools ------------------------- - ls, file, cp, wc, find, locate, rm, join, cut/scut/cols, sort, uniq, head/tail, sed, stats, less/more - grep and regexes are important enough to require their own section - more on text editors vs word processors. - http://goo.gl/UcTyb[see this for more detail] Simple/Practical Regular Expressions ------------------------------------ grep/egrep/agrep and the like Special purpose software ------------------------ (first, don't write your own software; use someone else's) Finding & installing binary packages ------------------------------------ - apt, yum, google, synaptic, etc Building packages from source ----------------------------- - download, untar, configure, make Writing your own programs ------------------------ Versioning ---------- To help with collaborative development, data & code provenance, and backup, we'll cover the concepts of code versioning with: - http://git-scm.com/[git] - http://subversion.tigris.org/[subversion] and some integrated systems for larger groups with - http://trac.edgewall.org/[Trac] - http://www.redmine.org/[Redmine] - https://github.com/[Github] Simple Data Structures ---------------------- How do programming languages represent and store information? - chars, strings, ints, floats, doubles, arrays, vectors, matrices, lists, scalars, hashes/dictionaries, linked lists, trees, graphs, structs, objects, R data.frames, etc Programming languages --------------------- - http://www.gnu.org/software/bash/manual/bashref.html[bash] - http://en.wikipedia.org/wiki/Python_(programming_language)[Python] or http://en.wikipedia.org/wiki/Perl_(programming_language)[Perl] - http://en.wikipedia.org/wiki/R_(programming_language)[R] - http://www.scilab.org/[Scilab] or [Matlab] - differences among these programs & why these ones as opposed to C/C++, Fortran, etc. - declarations, variables, loops, flow control, file io, subroutines, system interactions - how to add options to your programs to make them more flexible. Debugging --------- - print/printf - debug flags - using debuggers * http://ipython.org/[ipython] * http://perldoc.perl.org/perldebug.html[perl -d] * http://www.gnu.org/s/ddd/[ddd] * http://www.gnu.org/s/gdb/[gdb] Code Optimization & Profiling ----------------------------- - simple timing with 'time', '/usr/bin/time' - simple profiling with http://oprofile.sourceforge.net/news/[oprofile] - descriptions of lower level profiling with http://icl.cs.utk.edu/papi/[PAPI], http://perfsuite.ncsa.illinois.edu/[PerfSuite], http://www2.cs.uh.edu/~hpctools[HPCTools] Big computation --------------- - schedulers like http://wikis.sun.com/display/GridEngine/Home[SGE], https://computing.llnl.gov/linux/slurm/[slurm], http://en.wikipedia.org/wiki/Portable_Batch_System[PBS] * data flow considerations, data locality * automating your analysis via scripting with ** http://www.gnu.org/software/bash/manual/bashref.html[bash] ** http://en.wikipedia.org/wiki/Make_(software)[make] ** visual dataflow applications *** http://en.wikipedia.org/wiki/Kepler_scientific_workflow_system[kepler] *** http://pipeline.loni.ucla.edu/[loni pipeline] *** http://www.vistrails.org/index.php/Main_Page[vistrails] *** http://en.wikipedia.org/wiki/Taverna_workbench[taverna] - quick introduction to pros & cons of parallel programming with http://en.wikipedia.org/wiki/Message_Passing_Interface[MPI], http://en.wikipedia.org/wiki/OpenMP[OpenMP], http://en.wikipedia.org/wiki/Gpgpu[GPUs] using http://en.wikipedia.org/wiki/CUDA[CUDA] - big data and how to handle it. - specific utilities for big data - http://www.pytables.org/moin[pytables] & http://vitables.berlios.de/[vitables] - http://nco.sourceforge.net/[nco - the NetCDF Operators] [[biblio]] Bibliography ------------ - see the 'Further Reading' section at the bottom of the http://software-carpentry.org[Software Carpentry] home page. these are 1-10 page essays on how scientists 'do use' and 'should use' computers. - see the 'Suggested Readings For Computational Scientists' section in http://software-carpentry.org/articles/cise-swc-2006.pdf[Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive]. - 'Unix Power Tools, Third Edition', Jerry Peek, Shelley Powers, Tim O'Reilly, Mike Loukides; 2002. http://proquest.safaribooksonline.com/book/operating-systems-and-server-administration/unix/0596003307[(Online edition via UCI Safari subscription)] - http://tldp.org/LDP/abs/html/[Advanced Bash Scripting], an excellent online tutorial describing how to fully exploit the bash programming language. Release information & Latest version ------------------------------------ This document is released under the http://www.gnu.org/licenses/fdl.txt[GNU Free Documentation License]. The latest version of this document should always be http://moo.nac.uci.edu/~hjm/ResearchComputingForNonGeeks.html[HERE].