Research Computation Requirements

This is a strawman version intended to provoke discussion. It is not the final version.

1. Computation

1.1. The Problem

Broadly defined, Research Computing uses computational resources that range from tiny System-On-Chips to smart phones and tablets, to laptops and desktops, and then to High Performance Compute (HPC) clusters, which often include accelerators like Graphics Processing Units (GPUs).

The Campus can provide support for a wide swath of large-scale computation via the GreenPlanet and OIT’s HPC clusters, but it cannot provide all kinds of hardware support for all research domains. A central issue is how to maintain the age, number and type of CPUs to benefit the most researchers, especially if these clusters are also going to be used for instruction as well as primary research. Many researchers preferentially use local compute clusters (vs national centers) to do their work because of convenience, size of runs, or sysadmin/software support. The administration must decide how to provide sufficient resources to keep such resources at an optimum size and configuration vs requiring faculty support via grants.

However, hardware by itself solves few problems. Any utility derived from the hardware will have to be coupled with software that allows researchers to exploit the hardware, and the ability to do that depends on what kind of training researchers receive.

Thus….

1.2. A Substantial, tho partial solution: Linux …

Because the range of this hardware is so large, the best OIT can do is to try to support standardization of software packages as much as possible.

Increasingly this is Linux Operating System, which runs on all hardware. Even Microsoft is increasingly supporting Linux, from their Azure Cloud Switch based on Linux to their new Visual Studio Development environment.

1.2.1. New Research Software is written first for Linux.

Because of the low cost & high scalability of Linux-based clusters, the ease of writing software for the command-line (vs graphical applications), and the speed of Open Source Software (OSS) propagation, the overwhelming majority of research software is written first for Linux and later ported to other platforms. This means that researchers with knowledge of (and access to) Linux platforms will have a 1-3 year lead on those who require a Windows or MacOSX version of the software.

Users have to be able to download it, install it, possibly compile it, modify it or the data format that it desires, and combine it with other such tools into a repeatable work flow. Given the importance and relatively small expense, it is bewildering that training in these areas is so rare.

However, in order for local researchers to make use of that software, they must have a platform on which to execute it, hence the emphasis on generic HPC clusters which make these resources useful as widely as possible.

1.3. And the Research Remote Desktop

To support faculty who simply want to do their analyses, the most efficient mechanism in most cases is via the Research Remote Desktop, an exportable display from a large server or cluster that can be brought up on any device, anywhere. This centralizes storage for convenience, cost, backup, security, and reduces administration time per user. It also allows virtual instructional labs because it breaks the dependence of a class on a physical lab of PCs. This approach can be used for providing applications for both native Windows and Linux, but the Linux approach is both more secure and more scalable, since it’s cheaper.

There are some applications that do not work well by Remote Desktop, including those demanding very rapid 3D transformations (gaming, 3D visualization, image reconstruction, etc), those requiring dedicated disk and/or network bandwidth (audio/video editing), and those requiring data from sources so sensitive that they cannot allow any network connection. In Academia, these are fairly rare cases.

1.4. Parallel Processing

Many of todays research problems involve computational analyses that can take weeks, months, or years to solve on a single CPU. The idea behind parallel processing is that processes that can take X hours on a single CPU can be done in 1 hour on X CPUs. When X > 1000 (as on High Performance Compute (HPC) clusters), the time savings can be significant.

1.4.1. Complexity of Parallel Processing

In past decades, the only option for parallel computing was to use special hardware, such as the Cray supercomputer. Now, thanks to advances in networking, multi-core CPUs, and improved software tools, it is now practical to have programs run in parallel across inexpensive commodity servers aggregated into HPC clusters.

HPC clusters are attractive because they are are affordable and scalable; that is, researchers can buy additional hardware as needed to meet their computational needs. The hardware components are usually commodity hardware of varying types and vendors. Not being tied to a specific proprietary vendor means that competition helps to keep costs low. However, there are minimum standards that must be met to maintain speed across such clusters, and such clusters should be upgraded in chunks (minimally) to maintain enough homogeneity to allow parallelized jobs to compute efficiently. The minimum size (and cost allocation) of these chunks must be decided among stakeholders.

Some places like Perdue install completely new clusters every few years to maintain this homogeneity; we maintain this is not necessary nor even desirable, but the chunkiness of a cluster has to be large enough to allow jobs of this granularity to run efficiently. Adding to a cluster (vs creating a new one) reduces admin costs and the learning curve.

1.4.2. Technical Expertise

Coercing servers to work together effectively is a complex process, requiring broad system administration expertise. In addition to high speed network connections and parallel file systems, special scheduling software is necessary to guarantee that jobs from different researchers are assigned appropriately and efficiently to combinations of CPU cores and servers. The expertise for developing and debugging parallel applications is even more specialized.

1.5. Recommendation

The university should encourage Linux competence in its technology staff and research ranks, and attempt to hire more people who can assist with the problems described above. The additional upside of this approach is that the actual hardware costs for general purpose compute clusters are very low and the software & licensing costs are negligible, since most of the software is OSS. Thus investment in HPC clusters for both instruction and research is one of the most cost-effective resources in which the university can invest, especially since $ spent in this way is not zero-sum; idle instructional resources can be used extremely efficiently by research computing via shared queues. Faculty and administration have to agree as to how best to share the chunk upgrade costs.

Training researchers of all levels in the use of Linux cluster-based applications, utilities, and use (in the mold of the Software Carpentry courses) would not only make UCI stand out, but it is a low-cost way of increasing the expertise and desirability of its graduates.