This document describes a scheme to raise support money for the Broadcom Distributed Unified Cluster (BDUC) development by allowing faculty to "Pay for Priority" or P4P. The primary way I propose to do this is to use the admin/config features of the Sun Grid Engine (SGE) prioritization scheduler to allow those researchers who pay an additional amount to have increased priority for their jobs. For those unfamiliar with SGE, there's a good introduction here. For the purposes of this discussion, the relevant part is Chapter 2 (page 9 onwards).

The latest information I have is that the cost for participating in the P4P is $300 for a node's worth of priority. A node is defined as a 2U, 2core Opteron 252 (2.6Ghz), with 1MB cache per CPU, at least 8GB RAM, a local scratch disk, networked access to a /home dir on a RAID5 system (not backed up), racked, powered, with 1Gb ethernet connnectivity to the rest of the cluster, Linux OS, SGE integration, and admin support on a best-effort basis. This is a one-time cost. A rack contains 20 such nodes for 40 Opteron cores/rack for $6000.

Note also that the nodes of the cluster are going to be continually refreshed as long as Broadcom continues its policy to send us it's data center outflow, so while the nodes are 2-3 years old, they will continue to be updated. If you bought your own nodes, they will not defy aging in a like manner.

1. The ShareTree Policy

The most useful of SGE's prioritization schemes is called ShareTree which sets priority of the job based on the number of priority tickets or weight units a user or project is allocated (a project is like a unix group). I will call a Priority Ticket a picket, as ticket implies that it can be used up. A picket is a weighting measure which is not used up but is used to calculate the priority of a job in the event of limiting resources. If resources are not limiting (ie idle cores), all jobs will be launched immediately.

Using ShareTree, we can consider BDUC users and priority as leaves of a tree as below. Assume total of 1000 pickets for the entire cluster and that the priority balance between Public and P4P is 60:40.

.
+-- Public - (600)
|   |-- hjm (2)
|   |-- cesar (2)
|   |-- frank (2)
|   |-- hans (2)
|   |-- garr (2)
|   |-- [254 other users] (2)
|   `-- scott (2)
|
+-- Priority (400)
    |-- bob (100)
    |-- mary (30)
    |   |-- jenny (5)
    |   |-- barack (30)
    |   |-- colleen (15)
    |   `-- terry (5)
    |
    |-- burt (70)
    |   |-- anne (66)
    |   `-- boomer (44)
    |
    `-- validimir (20)
        |-- jack (2)
        |-- shannnon (20)
        |-- jack (30)
        |-- cal (5)
        `-- kerry (45)

NB: at any point in the tree, the leaf values DO NOT have to sum to the total allocated to the parent node - if they don't, they are normalized to the parent node value.

ie There are 400 pickets allocated to P4P, but there are only 220 pickets suballocated ( 100 (bob) + 30 (mary) + 70 (burt) + 20 (vladimir) ). These are normalized by multiplying by: (X/220) * 400 .

Burt buys 70 pickets, normalized to 127.2, so his group gets 127.3/400 = 0.318 of the Priority branch. His students' values add up to 110 (66 to Anne and 44 to Boomer), but that is normalized by the scheduler to 100%, giving Anne a priority value of (66/110) * (127.3/1000) = 0.07638 of the TOTAL Cluster Priority Weight. By comparison, a Public user (1 of 250) has (600/1000) / 250) = 0.0024 or about 1/30 the TOTAL Cluster Priority Weight that Anne has.

The ShareTree policy also allows temporal accounting in that a person or group can exceed his allocation or priority for a period, and then pay back the resource pool by allowing others to use his priority. This is done automatically via the scheduler over the configured time period.

See the ShareTree explanation on page 10 in this SGE document.

2. Some other issues, clarifications

2.1. Hardware binding of jobs

We have determined that the ShareTree approach works and is applicable cluster-wide. However, it does not operate on a per-Q or per-hardware basis. With a few exceptions, this means that customers engaging in P4P will have to be set up to have their priority be cluster-wide. P4P customers will usually not be able to run their jobs on the hardware they buy with the exception mentioned below. Instead, they buy HW that increases the capability of the cluster by X and get X amount of priority to run on the cluster, but their jobs typically will NOT run on their hardware most of the time.

We can jigger this somewhat by setting up Qs that ARE locked to hardware (say one rack) and only allow one group to have priority access on this Q. This would allow the sponsoring group to run on their Q on their hardware with overall priority on the cluster inclusive of this resource. So if one rack equaled 10% of the cluster resource, they would get 10% of the cluster pickets, but only they would be able to run on this hardware. This has some DISadvantages for the P4P client - he could only do unrestricted runs on HIS hardware. He could submit jobs on public Qs, but since those Qs are time-limited, he would be subject to the same limitation. In the absence of specialized hardware (GPGPUs/FPGAs, very large RAM), it would be better for the P4P client to use long-run Qs that are NOT locked to hardware, allowing him to use his share of the cluster across ANY hardware.

If the P4P client insisted on using his hardware, to harvest the unused cycles, a public Q would be set up pointing to the same hardware (with a low max runtime). The result would be that the P4P group would be able to run their jobs (with long runtimes if needed) on their hardware, with high probablility of starting. Public users could start their jobs (as long as they required a fairly low runtime) on this hardware as well, but if the P4P group submitted a job to this Q, it would get priority.

If the whole thing ran under ShareTree, any user could get allocated more time than was his share with the understanding that he would pay back his share over the course of the time window.

2.2. More SGE FAQ-like issues

A job submitted to a Q that cannot run immediately due to full utilization will increase in urgency the longer it waits in the Q until it can run.

A parallel job that requires exclusive use of a certain number of CPUs will typically wait longer to run as the resources it needs will take longer to become available, but as above, its urgency will increase until other jobs are prevented from starting to hold resources for the parallel job to run.

In the event of unusual circumstances, we can use the Policy configurator to modify these policies with special overrides. And we can add users to the Deadline Users group to accelerate priority if someone desperately needs to run some jobs.

Allocation for the Priority follows the total nodes (or cores, or whatever measure we want to use). If the total cluster has 10 racks, we will allocate a set number of pickets to the cluster, say 1M (100K/rack). We decide that following Broadcom's request, they be used in general support of research so 60% of its compute priority goes towards public Qs. That is, Qs for people who do not have to pay for it directly.

That leaves 40% of the cluster priority to be sold for P4P. People can buy in at any level - they're buying priority, not nodes, but the priority can be thought of as proportional in some sense to physical nodes and in fact, we will probably charge them in terms of nodes. For their $ they get allocated pickets so that in rough equivalence, they will get allocated computational power proportional to the # of nodes they buy, mod the priority allocated to the Public.

So, if a PI buys a rack of Priority for the cluster, which brings the cluster up to 10 racks (with a total cluster picket value of 1M), he will receive 40K pickets of priority, calculated in this way:

This does NOT mean that he CANNOT use more CPU than his 40% of the rack, but if resources are restricted, then his priority gets weighted at this level. At least in the beginning, I suspect that most of the Public users will not be running SGE jobs, so most of the Public priority will not be used and each P4P participant will get much more than his paid-for share. This is because if the Public user is not asking for resources, the P4P users then get 100% of the available resources, to be sub-allocated as per their relative picket weight.

3. Priority, not Exclusivity

A researcher who engages in the P4P will be allocated priority not exclusivity. That means in the event his job cannot start immediately due to resource exhaustion, it has a better probablility of executing sooner than a user with fewer pickets. It does not mean that his job is guaranteed to start immediately. His jobs can be assigned higher /urgency/ via manual intervention or we can generally assign more weight to urgency so that jobs holding in the Q go up in priority more rapidly.

This is different from the current scheme with MPC, the other major campus cluster. In that scheme, the client PI buys an amount of nodes and gets exclusive use of 3/4 of them with 1/4 going to support Public research computing. From long-term sampling, this has led to an average of ~25% wasted utilization (cores not assigned jobs).

I propose the Priority scheme to try to utilize that wasted 25% as well as provide more flexible utilization for those who need it more urgently.

The underlying scheduler algorithm and resource contribution can be adjusted (man sched_conf) but hopefully we won't have to do that soon.

4. Consumables

SGE can also moderate the use of consumables such as software licenses, holding jobs that require the use of MATLAB or other licensed applications and libraries so that the jobs queue properly and do not simply die because the license is unavailable. We will be adding licenses for MATLAB, SAS, and JMP (but not JMP/Genomics) very soon. Please suggest others.

5. Latest Version

The latest version of this document should always be here.

Thanks to Thomas Reuter aka Reuti, Chris Dagdigian aka craffi, and others for Docs, HOWTOs, and tireless answering of SGE-related questions on the gridengine list.