UCI CyberInfrastructure & Research Computing

1. World-wide

Beyond UC, there are few technical challenges that stop UCI from interacting with the University of Geneva as easily as NYU. In order to move UCI towards the top of the list, we should be doing:

1.1. UCGrid participation

UCI should continue the support of Grid technologies. Many on-campus ORUs are using this infrastructure thru other support structures (BIC, Beckman Laser, PS, CalIT2, the CAMERA project, etc). This involves some physical infrastructure, but mostly software on top of existing physical infrastructure and more expertise.

Note

A Caution

It is not clear that the amount of resources and energy that is going into the Grid is going to be paid back in actual usage. Use of local clusters like the MPC is well understood and these resources are well-used on campus. The problem with extending this to the Grid is the amount of extra overhead in terms of connection, authentication/authorization, and other complexity. This overhead is worthwhile for very heavy users of CPU cycles, who cannot get their cycles from direct allocation from Supercomputer Centers (SCs). The advantage of the Grid is that it's supposed to provide a large surplus of cycles that can be shared or harvested by others. This model fails if there is not a large surplus of cycles to be harvested, or if there are not enough nodes in the same compute domain to allow parallel jobs to execute. So if there is no such surplus, the Grid is a highly complex method of inducing frustration and, dare I say it, gridlock.

In spite of this caution, there are additional reasons for contributing to the grid infrastructure. Both the grid and other UC resources need a global auth/auth system and the UCGrid is implementing the GridShib project that would provide such a system. There are other parts of the Grid that are useful for other purposes, such as the GridFTP system for fast file transfer.

On the whole, it's probably worthwhile to continue to support the UCGrid, but I'm not sure that it will fulfil the promises that have been made in its name.

2. Among UC Campuses

UCI should lead by example by developing and sharing applications that leverage the web or internet technologies to ease both research and infrastructure bottlenecks. Examples of such efforts are:
- an electronic software distribution solution that could be used UC-wide.
- a general purpose software licensing scheme that allowed commercial licences to be tracked and validated/invalidated by user time, CPU time, CPU #,OS, etc to allow us to enforce whatever licensing terms commercial vendors requested. (but see comments on Open Source Software below)
- a robust, inexpensive, easy to administer Open Source backup system. At least 3 such systems are used in Enterprise deployments: Amanda (commercialized into Zmanda), Bacula, and BackupPC.
- altho the EEE team is remarkably well-organized, EEE seems to be largely redundant to Moodle and Sakai. EEE needs to either use the Moodle infrastructure, adding EEE functionality as Moodle plug-ins or untangle the EEE infrastructure so that it is possible to provide EEE as a viable alternative to Moodle or Sakai. It is clearly possible to use Moodle as a viable alternative as UCLA is doing just that.
- all the above would promote UCI and NACS across UC and the world.
Open Source Information clearing house. Especially in these times of tight budgets, we need to conserve $ and time by evaluating the best Open Source technologies that can also be rolled out to other campuses.
embrace Grid technology as much as makes sense (see above and below)
use of SDSC as a UC megacluster center

3. Within-UCI

3.1. Addressing Ongoing Research Computing problems

We want to address problems in Research Computing like the NACS Response Center does. We want to fix the problem ASAP, directly or by cooperating with the nearest support person. After the problem is fixed or addressed, then we can figure out if the problem is of general interest. If it is, then we can evaluate whether it's worth while to create a generalized solution that can be scaled. If not, drop it. But 1st, fix the problem.

3.1.1. The General Cost/Benefit approach

Analyze solutions in terms of long-term risk / cost / benefit. If a project is worth undertaking beyond a one-off fixing of a single researcher's problem, it's worth considering the longer term consequences of addressing the problem by breaking the cost estimate into:

cost to analyze
initial cost to implement
features to cost ratio - can we live with a non-best solution if that solution provides a longer term reduced cost?

We would also need to evaluate the risk profile: What risk does each solution imply?

If it can be implemented quickly, but costs much more in support costs, is it a good solution?
If it is cheap for the test instance, will it be cheap when it has to scale? What are the scaling issues?
What are the support issues?
Is an OSS solution that works initially but no one else understands a workable support model?
Is commercial support fundable over the long term?
How long do we have to work with a piece of software until we understand it?
How much trouble is it to evaluate it in a real situation?

3.2. Green Computing

3.2.1. High Performance Computing

How do we decrease the energy cost of computing? For some forms of HPC we may be able to leverage the very low cost and very high speed of alternative computing approaches, such as GPGPUs, FPGAs, or using BOINC to create a Virtual Campus Supercomputing Center or additions to UCGrid. We will then have to promote its use to those people who have need of it.

The use of GPGPUs in scientific computing can have dramatic effects for for power consumption and HPC. A single $500 Graphics card contains 240 GPUs and a number of people have shown acceleration on the order of 100X on some codes of interest to UCI researchers, including molecular dynamics, astrophysics, and sequence analysis.

3.2.2. Thin clients

Support of thin clients for those who want them. Thin client computing reduces support and hardware costs, and over time, energy costs as thin clients use much less energy than PCs.

Thin client models

compute on server; export display only. This allows Linux, Windows apps to share desktop. [thin client]
load OS, apps, data from server, compute on client [meso client]
load OS from local disk, loads apps, data from server [fat client]
load OS, apps, data from local disk, no server. [current obese client]

It's worth noting that RedHat has returned to Desktop computing using the Kernel Virtual Machine (KVM) virtualization technology and a thin client product based on SolidICE/SPICE.

3.3. BDUC / BEAR

For general purpose computing, the approach of using pooled resources on BDUC as a compute cluster and on the proposed BEAR (Broadcom EA Replacement) as access to interactive applications is one of the best approaches we could exploit. This will allow users access to rapidly deployable applications (via BEAR) and computational power (via BDUC) as well as decreasing cost to use commercially available software such as MATLAB, Mathematica, SAS, SPSS.

3.4. Research Programming & Expertise Matching

Using the EEE model where a manager interviews a researcher with the presumptive programmer and allocates & supervises programming talent to a job for a certain duration. The details of this program have yet to be fleshed out.

There are some bushels of low-hanging fruit to be harvested in simply matching researchers needing sophisticated advice to those who can supply it. The Catalyst system that Harvard wrote is a great way of doing this, tho it will require a fair amount of human resources to implement it.

3.5. Open Source Advocacy and Expertise

3.5.1. Advocacy

The use of Open Source Software is often promoted because of the initial cost - free. While the cost is an advantage to evaluating the full package without time-consuming negotiatiations with commercial vendors, initial cost is not a large part of the total cost of implementing a SW infrastructure. By far, the larger costs associated with implementation are:

training
customization
long term support
debugging problems
security problems / patches / audits
making the SW interact with other SW such as as existing databases
scaling out the SW to other locations once proven in a single location.

In most of the above points, OSS has some advantage, if the OSS package is chosen with some care. Rather than reiterating the arguments that others have convincingly made, I refer the reader to David Wheeler's OSS site.

Arguments for the superiority of commercial support of proprietary packages are breaking down relatively fast due to the distributed, asynchronous, nature of various web technologies, especially due to improved search engines. Most support issues can be addressed by typing a short description into the google search engine.

One additional case can be made for using OSS - that of reducing personnel costs associated with negotiating licenses, selling the software, costs associated with restricting access to it, tracking licenses or compliance, and enforcing upgrades. With OSS, there are no such associated costs, alowing more personnel to be assigned to support for the software if necessary. We will not be able to do away with proprietary software completely, but reducing the number of proprietary packages can have significant effects on the number of people required to support it.

3.5.2. Expertise

While there is a huge amount of OSS expertise at UC and UCI, access to that expertise is largely hidden or available only via personal contact. This results in people bypassing good Open Source alternatives to commercial products because they either don't realize the existance of such projects or they're unaware of how OSS works.

UCI (and NACS in particular) should have an OSS hotline that would act as clearinghouse for such information. I would be happy to lead that endeavor.

3.6. Storage

This is a tough call because there is a very wide spectrum of expectations about storage.

At one end of the spectrum is highly volatile temporatry storage (aka scratch). This is both relatively cheap and easy to accomplish. $20K would probably provide enough research storage to supply the entire campus for a year or more. ($10K for the chassis; $10K for disks -> 48(1.5TB)*$200 = $9600 (72TB, raw or 4 x 12disk RAID6 -> 4*10*1.5TB = 60TB formatted storage.) About $333/TB (about 2x raw storage), but includes few management tools, except for Logical Volume Management via LVM2. This can be deployed as a multifunction fileserver by itself, an iSCSI target (see Openfiler as an example of a plug & play fileserver for this model), or as other components of a higher order system.

At the other end is the full Data Center model with high end storage from BlueArc, Isilon, NetApp, EMC, which will cost about $3000/TB (cost estimate from UCLA). The DC model addresses the full spectrum of data replication, expansion, migration, hardware upgrade or transition, and could possibly include disaster recovery.

There are some interesting metastorage products appearing that we should be following and which I'll be testing with the new Broadcom nodes. One is the Caringo CAStor (Content Addressible Storage) which is a software-only product that claims infinite storage using generic x86 storage devices. I will be testing this approach as soon as the new Brodcom nodes arrive.

The appearance of TB devices - instruments that can produce TBs of data in a day will be starting to appear on the campus within a year at least in the form of NextGen DNA Sequencing machines and possibly some confocal microscopes and other imaging systems. These will require 10Gb networks connecting the support infrastructure and (relatively cheap) short term but high performance disk space.

The Broadcom donation of NetApp devices adds another variable into the equation, tho probably a welcome one. If the heads arrive with licenses, we can use them as fast scratch storage devices but their actual disks are very small, so are not very applicable to bulk storage. If the heads do not have licenses, we can still use the disk shelves as generic storage under the control of a vanilla Linux head with LVM2.

4. Latest Version

The latest version of this document should be here for HTML and here for AsciiDoc source.

AsciiDoc is a free, flexible, Python-based plaintext-to-HTML/PDF/DocBook translator. The "source" text file is easily human-readable as well as the output.