HPC Data Backups

Harry Mangalam <harry.mangalam@uci.edu>
v1.0, May13, 2016

1. Announcement

HPC has never had backups of user data for lack of $, however due to an allocation from OIT, we will soon have hardware that enables a SUBSET of the HPC data to be backed up. That word SUBSET is important.

We currently plan on being able to back up:

100% of user HOME dirs (50GB/user)
1TB per user on /pub
1TB per user on each of the rental storage dirs:
- /bio
- /som
- /cbcl
- /dabdub
- /edu
- /elread
- /tw

For you to take advantage of this service for files outside of your $HOME dir, you HAVE to initialize your backup files and provide guidance on what you want backed up. If you do not want your files to be backed up, don’t do anything.

If you DO want backup, at minimum you have to edit the files backup.include and backup.exclude that exist in your top-level dirs on /pub, /som, /bio, etc. We have placed them there with some guidance text.

We have ordered hardware for a secondary storage (aka incomplete backup) and it should be here in early June. Following a primary fill to initialize the backups, we will begin doing incremental backups, the cycle time of which depends on how long it takes for the backups to run.

2. Not all files will be backed up

Because we lack the space to back up everything, it is up to you to decide what you want backed up if you have more data than the quota limit (this is actually very few of you, but the exceptions are notable).

We will also NOT back up certain files and sets of files, described in more detail below.

We will read specific include/exclude files in your top-level dirs on each of the filesystems. ie:

$HOME/backup.include
$HOME/backup.exclude
/pub/$USER/backup.include
/pub/$USER/backup.exclude
/som/$USER/backup.include
/som/$USER/backup.exclude

etc

to decide what to back up and what to ignore.  If you don't indicate your preferences, we will simply back up your data arbitraily to the quota and stop there.

3. How much data do you have?

Before you make decisions about what data to back up, you need figure out how much data you have on the different filesystems. Garr regenerates this every weekend and places the results here which allows you to see if you’re over the quota for complete backup. You will have to check every filesystem to see if you’re over - just search in the page (usually Ctrl+f) for your UCINETID.

There are also a number of utilities on HPC that allow you to see how much space your files are taking. We recomend these:

kdirstat (fast, graphical, very comprehensive, only on hpc-login-1-2 & requires x2go). Described in more detail here.
gt5 - very slick, simple, fast terminal app, all nodes.

To use gt5 most effectively, qrsh to a non-login node, and then launch gt5 with the name of the top dir of interest. ie gt5 ~. Be careful - if you launch it on a huge branch of the filesystem, you’ll wait for a long time for it to complete & slow the filesystem as it works.

4. Please help us NOW

by examining your data with the tools above and determining which dirs and files you need and which you do not need. Even if you are below quota, it would make it faster if you do not back up data that you don’t need backed up.

5. Files we will NOT back up.

Zillions of Tiny Files (ZOTfiles), defined as >5000 files or dirs in a single dir. This number may be reduced in the future. We encourage the tarchiving of small files into large archives, even if there is little space-saving.
coredump files, usually named core.#.
SGE error and output files
any file in a /scratch or /tmp dir
object files
.rpm, .deb, or .iso package files which we assume are available from an internet archive
files or dirs beginning with a . unless they are explicitly mentioned in your include file.
anything, upper or lower case, with trash, junk, test in the name.
this list may be expanded if we find additional files to exclude in bulk.

6. How to tarchive your data.

We use tarchive to describe the process of archiving your files in the tar format, which is often compressed into tar.gz, depending on the compressibility of the data.

cd to the dir above the one you want to archive.
add the dir to the archive via the command:

tar -czvf the-tarchive-name.tar.gz  the-target-dir

for example: tarchiving the dir tacg-4.6.0-src

$ ls -w 65 -CF tacg-4.6.0-src
AUTHORS             MatrixMatch.o  SeqFuncs.o      out
COPYING             NEWS           Seqs/           rebase.data
COPYRIGHT           ORF.c          SetFlags.c      seqio.c
ChangeLog           ORF.o          SetFlags.o      seqio.h
Cutting.c           Proximity.c    SlidWin.c       seqio.o
Cutting.o           Proximity.o    SlidWin.o       tacg*
Data/               README         config.guess*   tacg.c
Docs/               ReadEnzFile.c  config.log      tacg.h
GelLadSumFrgSits.c  ReadEnzFile.o  config.status*  tacg.o
GelLadSumFrgSits.o  ReadMatrix.c   config.sub*     tacg.sapp*
INSTALL             ReadMatrix.o   configure*      tacg.sspec
Makefile            ReadRegex.c    configure.in    tacg.sspec~
Makefile.am         ReadRegex.o    control*        tacgi4/
Makefile.in         RecentFuncs.c  install-sh*     test/
Makefile~           RecentFuncs.o  missing
MatrixMatch.c       SeqFuncs.c     mkinstalldirs*

# how much space does that dir take up?
$ du -sh tacg-4.6.0-src
32M     tacg-4.6.0-src

# lets tarchive it - see the 'man tar' for the whole manual, but
# '-czf' means:
#   c=create the tarfile,
#   z=compress it as you create it
#   f=use the following name for the tarchive
$ tar -czf tacg-4.6.0-src.tar.gz tacg-4.6.0-src

# how big is the tarchive?
$ ls -lh tacg-4.6.0-src.tar.gz
-rw-r--r-- 1 root root 11M May 13 14:49 tacg-4.6.0-src.tar.gz
                       ^^^

# the tarchive file is both smaller in size (11MB vs 32MB) and
# is a single file rather than the 135 files and dirs in the original

# now delete the original
$ rm -rf tacg-4.6.0-src

# you can always restore the original by extracting the tarchive:
$ tar -xzvf tacg-4.6.0-src.tar.gz
tacg-4.6.0-src/
tacg-4.6.0-src/Data/
tacg-4.6.0-src/Data/codon.data
tacg-4.6.0-src/Data/matrix.data
tacg-4.6.0-src/Data/rebase.dam+dcm.data
tacg-4.6.0-src/Data/rebase.dam.data
tacg-4.6.0-src/Data/rebase.data
tacg-4.6.0-src/Data/rebase.dcm.data
...

# is it still the same?
$ du -sh  tacg-4.6.0-src
32M     tacg-4.6.0-src

# yup, at least to a 1st approximation.