Scope of Work: CoSSci/Galaxy

1. Scope of Document

This document is a skeleton to make sure we (RCS) understand what we’re being asked to do. Until we’re clear on that and have had time to estimate the scope of that work, we’re not committing to do anything. We would appreciate as much feedback and correction as possible on this document. Especially information that will allow us to answer the Unknown: queries below.

If it’s more convenient to view as a plain text file, you can download the asciidoc src here.

2. Present

The CoSSci service is currently running on a Centos5 instance at UCI:

http://socscicompute.ss.uci.edu

This site hosts a version of the Python-based Galaxy user interface with a customized set of pages that present Social Science (SS) modeling possibilities, using 5 sets of SS data (WNAI, SCCS, EA LRB, and XC). We understand that these data sets are non-proprietary and unrestricted, and are no larger than ~10GB in aggregate and are hosted at 3rd party sites and are read via network calls into the analytical routines.

Data Sets

Unknown: Where are these data sets located?

Adddressed: They are somewhere offsite but discoverable via the code in the system.(DRW)

Unknown: Can we download all the data sets to make access faster?

Addressed: We can if we want (the data are in the public domain), but they appear to transfer fast enough normally via the network to leave them in place. (DRW)

These routines are mostly written in the R statistical language using standard modules or self-written scripts and have run-times (on the current VM) ranging from 1-10 minutes, altho a Bayesian routine currently under development may take much longer.

These routines then return either tables of data or internally generated graphics which are presented to the users as additional web pages.

Graphics

Unknown: Are these graphics are static (ie cannot be further manipulated) or are they dynamic (can be changed based on internal Javascript)?

Addressed: The graphics are static. (DRW)

3. Desired Outcome

While maintaining the currrent version and image, we (RCS) will convert it to a recent Long Term Support (LTS) Linux distribution such as CentOS7, Debian 8/Wheezy, or Ubuntu LTS (we prefer Debian versions for the depth of their repositories), with associated updated backing applications and scripts. If you have test suites or routines for validating the installation, we would appreciate the use of them.

In addition, the desired outcome is to include:

an Amazon image of the updated instance so that users could run it stand-alone a AWS. (Stu)
possibly other VMs images (VMWare, VirtualBox) so that others can download it and run it locally. (Stu)
with the compute jobs remotely executed on:
- the local host system (as it is currrently)
- another local Linux machine via ssh / clusterfork / gnu parallel,
- an HPC cluster (ie: UCI’s HPC, via SGE),
- an Xsede resource (ie: Comet at SDSC via SLURM)
- an on-demand AWS/EC2 compute cluster instance.

The output data files would have to be returned to the webserver for post-processing and presentation and if there was a significant time delay in processing, the user would have to be notified via email link to return and continue the processing.

Alternatives & Interaction

The processes that we describe being created by the Galaxy back end are batch jobs that are forked off to various computational back ends which then complete non-deterministically, depending on the load on the system executing them. It could be 1m before the jobs starts to execute or it could be a day. The amount of compute-side status information available to the submitting user would be quite restricted. We could offer an estimate of the actual run time based on previous testing, but the only updates available from the compute-side are emails emitted from the scheduler - typically when the job starts, when it ends, if it’s aborted or suspended. We can’t tell the user even approximately when a given job will end without quite a bit of system-hacking.

Additionally, in a class situation where up to hundreds of students are launching lots of jobs, sometimes without a good idea of what they are requesting, it may very well be the case that the queue or queues assigned to the class get saturated with waiting jobs. In order to clear such queues, it would require an administrator to log in to explicitly delete jobs submitted in error.

We also came up with 2 alternatives that may not have been considered:

when a class is pending, reboot the Galaxy instance into a larger VM, with the number of CPUs and and RAM commensurate with the load that the localhost is supposed to support. This would not require much in the way of programming and would keep the system as a single, integrated system. It would require rebooting the system whenever a resource change happens. See the VMware OS compatibility/configurator for Operating System compatibility (almost all Linux systems can be so configured).
depending on your budget and the anticipated load, buy a large (64core) node ($8K-$12K, depending on disk and RAM) to add to the HPC cluster that would run the current system, forking off jobs to the other CPU cores on the same system. HPC would supply the underlying applications and syadmin. This would be similar to the (low) programming requirements in the original system as well, but it would not be as scalable as a true cloud-based system.

The options for any of the above choices can be set via a configuration file on the web host with the default being to execute the analysis on the same machine as the webserver (so the basic image will contain all of the code necessary to support a fully functioning service). The full support for the above would require complete separation of the front end web system from the analytical components, but in a Linux environment that should be possible. For the back end, we would have to provide a set of RPMs or Debs to install and an installation script to set it up correctly.

However, depending on the interactivity of the output (see above cautions), there might be some problems if recalculations have to be done based on user manipulation of the presented data.

With the architecture above, anyone can spin up a web interface on a local instance or via AWS, change the config to point to a new compute back-end and, after installing that back-end, use it to handle a much heavier load such as a class or parameter space sweeps.

For the Social Science-side ppl, please fill in as much missing information as possible.

For the RCS-side ppl, please ask the relevant questions that I’ve forgotten.

4. Benchmarks

Upgrade of current system
- Dump list of all current RPMs from CentOS5 version
- Commission new VM and provision with either CentOS7 or Debian LTS based on list of RPMs
- install all packages to bring it up to spec with the requirements (R, Galaxy, Python, etc)
- Assure that updated version works comparably to old version
decouple cgi analysis from localhost and package so we can run it on other hosts.
- package form options into complete, portable workflows that can be completed on other hosts
- execute that wrkflow via scp, ssh, rsync directly on alternative host *