How to size and schedule your jobs.

by Harry Mangalam <harry.mangalam@uci.edu> v1.00 Nov 2, 2017 :icons:

1. Short(er) Version

When you submit a qsub job, you’re requesting that your job goes into the 1st available slot in the Grid Engine (GE) scheduler queue. Those slots are distributed in a pseudo-exponential curve as there are many more slots for small serial jobs than there are for huge multicore, 100GB RAM slots, so unless you want your jobs to languish at the bottom of the job Q, size it correctly. Here are some parameters that you should consider:

Unless you know that your application will run in parallel, don’t request multiple cores. Requesting more cores for a serial (single-CPU) job will simply increase the amount of time that it takes to find an execution slot on the cluster. How do you find out is the application is parallel or not? There’s this new thing called Google…. Or ask us. We will start adding this information to the appropriate module so when you load it, it will note if the application can run in parallel.
(in other words) don’t request resources that you can’t use. If you request 32 cores and your program is a serial job, you’ve just prevented 31 other jobs from executing on the node you finally get. There may be an infinitesimal increase in speed for YOUR program due to small memory or IO conflicts, but your program will take MUCH longer to find an open slot.
alternatively, DO request required and appropriately sized resources. If you need a GPU to run your Tensorflow jobs, you have to request a Q name for nodes that have GPUs. If you know that your jobs need 700GB RAM, please feel free to request that resource. Do not request an entire 64core/512GB node for a tiny single core 10GB job. We see a surprising number of these jobs running.
figure out how much RAM your program will consume over the entire run and request that amount. This is often difficult since often the RAM use will vary with the size of your input files, and programs can vary in terms of how many bytes they’ll retain in in-memory data structures. The only way to figure this out is to profile your program with increasing amounts of data so that at least you’ll be able to estimate what the RAM use will be. You can use the prefix /usr/bin/time -v to your program to do a crude profile of the runtime and maximum memory usage (check the Maximum resident set size (kbytes)

# if you don't use the '-v' flag, it will produce a short version of the output,
# which is really all you need.  Check the user% (86.60 - the app got 86.6% of a core),
# elapsed runtime (1:30.11) and the maxresident value (2092260), which should be enough
# to size the job.

$ /usr/bin/time tacg -L -n4 < chr1.fa > hugejunk
86.60user 2.19system 1:30.11elapsed 98%CPU (0avgtext+0avgdata 2092260maxresident)k
0inputs+2860832outputs (0major+271627minor)pagefaults 0swaps

# if you use the '-v' flag, it will format and label the fields better and add a
# few more fields for better scope.

 $ /usr/bin/time -v tacg -L -n4 < chr1.fa > hugejunk
        Command being timed: "tacg -L -n4"
        User time (seconds): 86.10
        System time (seconds): 1.95
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:29.19
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2092700  <- the MAX RAM used during the run
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 272916
        Voluntary context switches: 3064
        Involuntary context switches: 7246
        Swaps: 0
        File system inputs: 0
        File system outputs: 2860832  <- lot of output (blocks of 512 bytes)
        Socket messages sent: 0       <- no network traffic sent
        Socket messages received: 0   <- or received
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

or you can watch your program’s progression with the top

jobs_at compute-X-X   # launches 'top' with only your jobs shown to reduce clutter

log how large your input files are so you can try to correlate those sizes to how much RAM your program will take.

If you want to get your jobs executing on cores ASAP, you have to size them correctly. That generally means that that you have to request the minimum resources that your jobs need so that it will fit into the most open slots in the scheduler queue.

estimate how long your jobs will run. If you jobs will run shorter than 6 hrs, there’s no need or benefit to run them under checkpointing.

2. Array Jobs

if you have a set of Embarrassingly Parallel (EP) jobs, run them as an array job - it’s much more convenient for you and less work for the scheduler. Your input and/or output jobs need to be arranged so that they can be indexed by a series of integers.

3. Longer Version

3.1. Queues/Qs

3.1.1. Private Qs

3.1.2. Public Qs

3.2. Scheduling Groups

3.2.1. Overlapping Groups

3.3. Program Resources

3.3.1. CPU

3.3.2. RAM

3.3.3. IO

Types of IO

Local Disk

Network Disk

Network

3.4. How to tell what is going on?

3.5. What resources are your program consuming?

3.5.1. CPU

top family

top - original tool
htop - adds support to multicore/cpu
iotop - input/output monitoring
iftop - network monitoring
atop - merges previous elements into a single overview
gtop - fancy visuals of system stats (not worth it)
slabtop – displays a listing of the top caches (not sure if this is useful for users)

vmstat

mpstat

pidstat

iostat

free -g

sar

tiptop (measures IPCs per application)

ps

/usr/bin/time

(esp with -v) ===== perf ===== oprofile worth mentioning if perf is so much better.(?) ===== strace strace -c -p

How to size and schedule your jobs.

1. Short(er) Version

2. Array Jobs

3. Longer Version

3.1. Queues/Qs

3.1.1. Private Qs

3.1.2. Public Qs

3.2. Scheduling Groups

3.2.1. Overlapping Groups

3.3. Program Resources

3.3.1. CPU

3.3.2. RAM

3.3.3. IO

Types of IO

Local Disk

Network Disk

Network

3.4. How to tell what is going on?

3.5. What resources are your program consuming?

3.5.1. CPU

top family

vmstat

mpstat

pidstat

iostat

free -g

sar

tiptop (measures IPCs per application)

ps

/usr/bin/time

3.5.2. RAM

top

free

3.5.3. Networking

iostat

ifstat

iftop

3.5.4. Disk

iostat