BigData classes Fall 2014. 2 sessions In Oct. maybe when marian is gone to Aus. Oct 6-10? Padhraic wants it after Oct 24th (also fine). Links to check out: http://www.r-statistics.com/2013/09/a-speed-test-comparison-of-plyr-data-table-and-dplyr/ Each day covers: Intro to Linux (somewhat advanced compared to BioLinux version) Some concepts on OpenSource: How to compile and install software on your own. [new] How to analyze data on Linux (update Dealing with BigData New, but incorporates Class 1: An introduction to Linux on the HPC cluster ==================================================== [For Linux and Cluster novices] Introduction Philosophy on Research Computing Getting help: Google, and more How to ask a question. The HPC cluster Cluster Computing ssh, X11 graphics, logging in Where Data Lives and How fast it moves Filesystems RAM vs SSD vs spinning disks Local filesystems vs NFS vs distributed FS RAID types 0 1 5 6 10 Hardware vs software RAID Linux bash and commandline editing Directories & Files, permissions, utils STDIN, STDOUT, STDERR Jobs, foreground, background Pipes and pipe processing Basic utilities Regular expressions ASCII, newlines, and text editors Moving data Some cautions archiving, compression, encryption, backup utilities - wget, scp, rsync, bbcp, tar, zip How to find the appro app Module system Grid Engine scheduler Commands Queues Specifying resources Programming in 1 Minute What is a program? Which language to use... bash: variables, loops, tests qsub scripts Perl / Python R Version Control SVN GIT Class 2: Analyzing Data/BigData on Linux ======================================== [For Linux Users, but novices at cluster data analysis, BigData.] Where Data Lives and How fast it moves CPU registers CPU cache System RAM Flash Memory / SSDs Spinning Disks RAID types 0 1 5 6 10 Hardware vs software RAID Data formats Un/structured ascii files Binary files Self-describing Data Formats HDF5/netCDF4 (demo as http://docs.h5py.org/en/latest/quick.html XML - don't. Relational Databases: demo of sucking filesystem data into a RDB and then querying it. (crude search engine) SQL Engines & Servers (SQLite vs MySQL) Analyzing Data On Linux Viewing & Editing data: warnings and recommendations Office Docs on Linux Editors specialized for Data Stream editing Simple Slicing, Dicing, Pasting, Joining, Rearrangement cut, scut, grep, paste, join, viewing that data Identifying File and Directory differences diff -r, md5sum, tree, ls -lR, Searching files with regexes big expansion of grep, pcre Some example applications Data manipulation Systems R / big.data Python / Numpy / Pylab / pytables nco Mathematical Modeling Systems Data visualization Inspiration and Examples BigData Scale of Data Data on disk Inodes, sectors, bytes, zotfiles IOPS, Streaming IO Moving Bigdata Checksums Processing BigData in parallel Embarrassingly Parallel solutions GNU parallel clusterfork & friends hadoop & map-reduce vs other approaches. nosql (mongodb) Backing up BigData Generally, you can't. So what else to do.