by Harry Mangalam <harry.mangalam@uci.edu> v1.00 Nov 2, 2017 :icons:

1. Short(er) Version

When you submit a qsub job, you’re requesting that your job goes into the 1st available slot in the Grid Engine (GE) scheduler queue. Those slots are distributed in a pseudo-exponential curve as there are many more slots for small serial jobs than there are for huge multicore, 100GB RAM slots, so unless you want your jobs to languish at the bottom of the job Q, size it correctly. Here are some parameters that you should consider:

  • Unless you know that your application will run in parallel, don’t request multiple cores. Requesting more cores for a serial (single-CPU) job will simply increase the amount of time that it takes to find an execution slot on the cluster. How do you find out is the application is parallel or not? There’s this new thing called Google…. Or ask us. We will start adding this information to the appropriate module so when you load it, it will note if the application can run in parallel.

  • (in other words) don’t request resources that you can’t use. If you request 32 cores and your program is a serial job, you’ve just prevented 31 other jobs from executing on the node you finally get. There may be an infinitesimal increase in speed for YOUR program due to small memory or IO conflicts, but your program will take MUCH longer to find an open slot.

  • alternatively, DO request required and appropriately sized resources. If you need a GPU to run your Tensorflow jobs, you have to request a Q name for nodes that have GPUs. If you know that your jobs need 700GB RAM, please feel free to request that resource. Do not request an entire 64core/512GB node for a tiny single core 10GB job. We see a surprising number of these jobs running.

  • figure out how much RAM your program will consume over the entire run and request that amount. This is often difficult since often the RAM use will vary with the size of your input files, and programs can vary in terms of how many bytes they’ll retain in in-memory data structures. The only way to figure this out is to profile your program with increasing amounts of data so that at least you’ll be able to estimate what the RAM use will be. You can use the prefix /usr/bin/time -v to your program to do a crude profile of the runtime and maximum memory usage (check the Maximum resident set size (kbytes)

# if you don't use the '-v' flag, it will produce a short version of the output,
# which is really all you need.  Check the user% (86.60 - the app got 86.6% of a core),
# elapsed runtime (1:30.11) and the maxresident value (2092260), which should be enough
# to size the job.

$ /usr/bin/time tacg -L -n4 < chr1.fa > hugejunk
86.60user 2.19system 1:30.11elapsed 98%CPU (0avgtext+0avgdata 2092260maxresident)k
0inputs+2860832outputs (0major+271627minor)pagefaults 0swaps

# if you use the '-v' flag, it will format and label the fields better and add a
# few more fields for better scope.

 $ /usr/bin/time -v tacg -L -n4 < chr1.fa > hugejunk
        Command being timed: "tacg -L -n4"
        User time (seconds): 86.10
        System time (seconds): 1.95
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:29.19
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2092700  <- the MAX RAM used during the run
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 272916
        Voluntary context switches: 3064
        Involuntary context switches: 7246
        Swaps: 0
        File system inputs: 0
        File system outputs: 2860832  <- lot of output (blocks of 512 bytes)
        Socket messages sent: 0       <- no network traffic sent
        Socket messages received: 0   <- or received
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
  • or you can watch your program’s progression with the top

jobs_at compute-X-X   # launches 'top' with only your jobs shown to reduce clutter
  • log how large your input files are so you can try to correlate those sizes to how much RAM your program will take.

If you want to get your jobs executing on cores ASAP, you have to size them correctly. That generally means that that you have to request the minimum resources that your jobs need so that it will fit into the most open slots in the scheduler queue.

  • estimate how long your jobs will run. If you jobs will run shorter than 6 hrs, there’s no need or benefit to run them under checkpointing.

2. Array Jobs

  • if you have a set of Embarrassingly Parallel (EP) jobs, run them as an array job - it’s much more convenient for you and less work for the scheduler. Your input and/or output jobs need to be arranged so that they can be indexed by a series of integers.

3. Longer Version

3.1. Queues/Qs

3.1.1. Private Qs

3.1.2. Public Qs

3.2. Scheduling Groups

3.2.1. Overlapping Groups

3.3. Program Resources

3.3.1. CPU

3.3.2. RAM

3.3.3. IO

Types of IO
Local Disk
Network Disk
Network

3.4. How to tell what is going on?

3.5. What resources are your program consuming?

3.5.1. CPU

top family
  • top - original tool

  • htop - adds support to multicore/cpu

  • iotop - input/output monitoring

  • iftop - network monitoring

  • atop - merges previous elements into a single overview

  • gtop - fancy visuals of system stats (not worth it)

  • slabtop – displays a listing of the top caches (not sure if this is useful for users)

vmstat
mpstat
pidstat
iostat
free -g
sar
tiptop (measures IPCs per application)
ps
/usr/bin/time

(esp with -v) ===== perf ===== oprofile worth mentioning if perf is so much better.(?) ===== strace strace -c -p

3.5.2. RAM

top
free

3.5.3. Networking

iostat
ifstat
iftop

3.5.4. Disk

iostat