BigData classes Fall 2014. 2 sessions In Oct. maybe when marian is gone to Aus. Oct 6-10? Padhraic wants it after Oct 24th (also fine). Links to check out: http://www.r-statistics.com/2013/09/a-speed-test-comparison-of-plyr-data-table-and-dplyr/ Each day covers: Intro to Linux (somewhat advanced compared to BioLinux version) Some concepts on OpenSource: How to compile and install software on your own. [new] How to analyze data on Linux (update Dealing with BigData New, but incorporates Class 1: An introduction to Linux on the HPC cluster ==================================================== [For Linux and Cluster novices] Introduction Philosophy on Research Computing Getting help: Google, and more How to ask a question. The HPC cluster Cluster Computing Big, shared filesystems ssh, X11 graphics, logging in Linux bash and commandline editing Directories & Files, permissions, utils STDIN, STDOUT, STDERR Jobs, foreground, background Pipes and pipe processing Basic utilities Regular expressions ASCII, newlines, and text editors Moving data Some cautions archiving, compression, encryption, backup utilities - wget, scp, rsync, bbcp, tar, zip How to find the appro app Module system Grid Engine scheduler Commands Queues Specifying resources Programming in 1 Minute What is a program? Which language to use... bash: variables, loops, tests qsub scripts Perl / Python R Version Control Class 2: Analyzing Data/BigData on Linux ======================================== [For Linux Users, but novices at cluster data analysis, BigData.] Analyzing Data On Linux Viewing & Editing data: warnings and recs Office Docs on Linux Editors specialized for Data Stream editing Simple Slicing, Dicing, Pasting, Joining, Rearrangement Identifying File and Directory differences Searching files with regexes Data formats Un/structured ascii files Binary files Self-describing Data Formats HDF5/netCDF4 (demo as http://docs.h5py.org/en/latest/quick.html XML Relational Databases: SQL Engines & Servers Some example applications Data manipulation Systems R / big.data Python / Numpy / Pylab / pytables nco Mathematical Modeling Systems Data visualization Inspiration and Examples BigData Scale of Data Data on disk Inodes, sectors, bytes, zotfiles IOPS, Streaming IO Moving Bigdata Checksums Processing BigData in parallel Embarrassingly Parallel solutions GNU parallel clusterfork Backing up BigData Generally, you can't. So what else to do.