UCI CyberInfrastructure & Research Computing
============================================
by Harry Mangalam <harry.mangalam@uci.edu>
v0.4, 25 Feb 2009

// Convert this file to HTML & move to it's final dest with the command:
// asciidoc -a toc -a numbered /home/hjm/nacs/UCI_CI_Research_Computing.txt; scp /home/hjm/nacs/UCI_CI_Research_Computing.* moo:~/public_html

World-wide
----------

Beyond UC, there are few technical challenges that stop UCI from interacting
with the University of Geneva as easily as NYU.  In order to move UCI towards
the top of the list, we should be doing:

[[ucgrid]]
UCGrid participation
~~~~~~~~~~~~~~~~~~~~
UCI should continue the support of Grid technologies.  Many on-campus ORUs are
using this infrastructure thru other support structures (BIC, Beckman Laser, PS,
CalIT2, the CAMERA project, etc).  This involves some physical infrastructure,
but mostly software on top of existing physical infrastructure and more
expertise.  

[NOTE]
.A Caution
============================================================================
It is not clear that the amount of resources and energy that is going into the
Grid is going to be paid back in actual usage.  Use of local clusters like the
MPC is well understood and these resources are well-used on campus.  The problem
with extending this to the Grid is the amount of extra overhead in terms of
connection, authentication/authorization, and other complexity.  This overhead is worthwhile
for very heavy users of CPU cycles, who cannot get their cycles from direct
allocation from Supercomputer Centers (SCs).  The advantage of the Grid is that
it's supposed to provide a large surplus of cycles that can be shared or
harvested by others.  This model fails if there is not a large surplus of cycles
to be harvested, or if there are not enough nodes in the same compute domain to
allow parallel jobs to execute.  So if there is no such surplus, the Grid is a
highly complex method of inducing frustration and, dare I say it, gridlock.  
============================================================================

In spite of this caution, there are additional reasons for contributing to the
grid infrastructure.  Both the grid and other UC resources need a global
auth/auth system and the http://www.ucgrid.org/[UCGrid] is implementing the
http://gridshib.globus.org[GridShib] project that would provide such a system. 
There are other parts of the Grid that are useful for other purposes, such as
the http://www.globus.org/grid_software/data/gridftp.php[GridFTP] system for
fast file transfer.

On the whole, it's probably worthwhile to continue to support the UCGrid, but
I'm not sure that it will fulfil the promises that have been made in its name.


Among UC Campuses
-----------------
- UCI should lead by example by developing and sharing applications that
leverage the web or internet technologies to ease both research and
infrastructure bottlenecks.  Examples of such efforts are:

	* an electronic software distribution solution that could be used UC-wide. 

    * a general purpose software licensing scheme that allowed commercial
      licences to be tracked and validated/invalidated by user time, CPU 
      time, CPU #,OS, etc to allow us to enforce whatever licensing terms
       commercial vendors requested.  (but see link:#oss[comments on Open Source Software below])

    * a robust, inexpensive, easy to administer Open Source backup system.  At least 3 such
      systems are used in Enterprise deployments: http://www.amanda.org/[Amanda]
       (commercialized into http://zmanda.com/[Zmanda]), http://www.bacula.org/en/[Bacula], 
       and http://backuppc.sourceforge.net[BackupPC].

    * altho the EEE team is remarkably well-organized, EEE seems to be largely
      redundant to http://moodle.org[Moodle] and
      http://sakaiproject.org/portal[Sakai].  EEE needs to either use the 
      Moodle infrastructure, adding EEE functionality as Moodle plug-ins or
       untangle the EEE infrastructure so that it is possible to provide EEE 
       as a viable alternative to Moodle or Sakai.  It is clearly possible to use 
       Moodle as a viable alternative as http://www.oid.ucla.edu/units/tec/tectutorials/tecmoodle[UCLA is doing just that].

	* all the above would promote UCI and NACS across UC and the world.

- Open Source Information clearing house.  Especially in these times of 
  tight budgets, we need to conserve $ and time by evaluating the best 
  Open Source technologies that can also be rolled out to other campuses.
- embrace Grid technology as much as makes sense (see link:#ucgrid[above] 
  and link:#costbenefit[below])
- use of SDSC as a UC megacluster center


Within-UCI
----------

Addressing Ongoing Research Computing problems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We want to address problems in Research Computing like the NACS Response
Center does.  We want to fix the problem ASAP, directly or by cooperating with the
nearest support person.  After the problem is fixed or addressed, *then* we can
figure out if the problem is of general interest.  If it is, then we can evaluate whether it's worth while to create a generalized solution that can be scaled.  If not, drop it.  
But 1st, fix the problem.

[[costbenefit]]
The General Cost/Benefit approach
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Analyze solutions in terms of long-term risk / cost / benefit. If a project is
worth undertaking beyond a one-off fixing of a single researcher's problem, it's
worth considering the longer term consequences of addressing the problem by
breaking the cost estimate into:
 
 * cost to analyze
 * initial cost to implement
 * features to cost ratio - can we live with a non-best solution if that
   solution provides a longer term reduced cost?
 
We would also need to evaluate the risk profile: What risk does each solution
imply?
 
 * If it can be implemented quickly, but costs much more in support costs, 
     is it a good solution?
 * If it is cheap for the test instance, will it be cheap when it has to 
     scale?  What are the scaling issues?
 * What are the support issues?  
 * Is an OSS solution that works initially but no one else understands a
     workable support model?  
 * Is commercial support fundable over the long term?  
 * How long do we have to work with a piece of software until we understand it?
 * How much trouble is it to evaluate it in a real situation?


Green Computing
~~~~~~~~~~~~~~~

High Performance Computing
^^^^^^^^^^^^^^^^^^^^^^^^^^
How do we decrease the energy cost of computing?  For some forms of HPC we may
be able to leverage the very low cost and very high speed of alternative
computing approaches, such as http://en.wikipedia.org/wiki/GPGPU[GPGPUs],
http://en.wikipedia.org/wiki/Field-programmable_gate_array[FPGAs], or using
http://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing[BOINC]
to create a Virtual Campus Supercomputing Center or additions to UCGrid.  We
will then have to promote its use to those people who have need of it.  

The use of http://en.wikipedia.org/wiki/GPGPU[GPGPUs] in scientific computing can have dramatic effects for for power consumption and HPC.  A single $500 Graphics card contains 240 GPUs and a number of people have shown acceleration on the order of 100X on some codes of interest to UCI researchers, including http://www.ks.uiuc.edu/Research/gpu/[molecular dynamics], http://www.gpgpu.org/cgi-bin/blosxom.cgi/2007/07/27[astrophysics], and http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2222658[sequence analysis].


Thin clients
^^^^^^^^^^^^
Support of http://en.wikipedia.org/wiki/Thin_client[thin clients] for those who
want them.  Thin client computing reduces support and hardware costs, and over time, energy
costs as thin clients use much less energy than PCs. 

Thin client models::
 - compute on server; export display only.  This allows Linux, Windows 
     apps to share desktop. [*thin client*]
 - load OS, apps, data from server, compute on client [*meso client*]
 - load OS from local disk, loads apps, data from server [*fat client*]
 - load OS, apps, data from local disk, no server. [current *obese client*]

It's worth noting that RedHat has returned to Desktop computing using the http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine[Kernel Virtual Machine] (KVM) virtualization technology and a thin client product based on http://www.qumranet.com/products-and-solutions[SolidICE/SPICE].


BDUC /  BEAR
~~~~~~~~~~~~
For general purpose computing, the approach of using pooled resources on 
http://moo.nac.uci.edu/~hjm/BDUC_USER_HOWTO.html[BDUC] as a compute cluster and
on the proposed BEAR (Broadcom EA Replacement) as access to interactive
applications is one of the best approaches we could exploit.  This will allow
users access to rapidly deployable applications (via BEAR) and computational
power (via BDUC) as well as decreasing cost to use commercially available
software such as MATLAB, Mathematica, SAS, SPSS.


Research Programming & Expertise Matching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the EEE model where a manager interviews a researcher with the presumptive
programmer and allocates & supervises programming talent to a job for a certain
duration.  The details of this program have yet to be fleshed out.

There are some bushels of low-hanging fruit to be harvested in simply matching
researchers needing sophisticated advice to those who can supply it.  The
http://catalyst.harvard.edu[Catalyst] system that Harvard wrote is a great way
of doing this, tho it will require a fair amount of human resources to implement
it.

[[oss]]
Open Source Advocacy and Expertise
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Advocacy
^^^^^^^^
The use of Open Source Software is often promoted because of the initial cost - free.  While the cost is an advantage to evaluating the full package without time-consuming negotiatiations with commercial vendors, initial cost is not a large part of the total cost of implementing a SW infrastructure.  By far, the larger costs associated with implementation are:

 * training
 * customization
 * long term support
 * debugging problems
 * security problems / patches / audits
 * making the SW interact with other SW such as as existing databases
 * scaling out the SW to other locations once proven in a single location.  

In most of the above points, OSS has some advantage, if the OSS package is chosen with some care.  Rather than reiterating the arguments that others have convincingly made, I refer the reader to http://www.dwheeler.com/oss_fs_why.html[David Wheeler's OSS site].

Arguments for the superiority of commercial support of proprietary packages are breaking down relatively fast due to the distributed, asynchronous, nature of various web technologies, especially due to improved search engines.  Most support issues can be addressed by typing a short description into the google search engine.

One additional case can be made for using OSS - that of reducing personnel costs associated with negotiating licenses, selling the software, costs associated with restricting access to it, tracking licenses or compliance, and enforcing upgrades.  With OSS, there are no such associated costs, alowing more personnel to be assigned to support for the software if necessary.
We will not be able to do away with proprietary software completely, but reducing the number of proprietary packages can have significant effects on the number of people required to support it. 


Expertise
^^^^^^^^^

While there is a huge amount of OSS expertise at UC and UCI, access to that expertise is largely hidden or available only via personal contact.  This results in people bypassing good Open Source alternatives to commercial products because they either don't realize the existance of such projects or they're unaware of how OSS works.

UCI (and NACS in particular) should have an *OSS hotline* that would act as clearinghouse for such information.  I would be happy to lead that endeavor.

Storage
~~~~~~~
This is a tough call because there is a very wide spectrum of expectations about storage.

At one end of the spectrum is highly volatile temporatry storage (aka scratch).  This is both
relatively cheap and easy to accomplish.  $20K would probably provide enough
research storage to supply the entire campus for a year or more.  ($10K for the
chassis; $10K for disks -> 48(1.5TB)*$200 = $9600 (72TB, raw or 4 x 12disk RAID6
-> 4*10*1.5TB = 60TB formatted storage.) About $333/TB (about 2x raw storage),
but includes few management tools, except for
http://en.wikipedia.org/wiki/Logical_Volume_Manager_(Linux)[Logical Volume
Management] via LVM2.  This can be deployed as a multifunction fileserver by
itself, an iSCSI target (see http://www.openfiler.com/[Openfiler] as an example
of a plug & play fileserver for this model), or as other components of a higher
order system.

At the other end is the full Data Center model with high end storage from
BlueArc, Isilon, NetApp, EMC, which will cost about $3000/TB (cost estimate from
UCLA).  The DC model addresses the full spectrum of data replication, expansion,
migration, hardware upgrade or transition, and could possibly include disaster
recovery.

There are some interesting 'metastorage' products appearing that we should be
following and which I'll be testing with the new Broadcom nodes.  One is the
http://www.caringo.com/[Caringo CAStor] (Content Addressible Storage) which is a
software-only product that claims infinite storage using generic x86 storage
devices.  I will be testing this approach as soon as the new Brodcom nodes
arrive.

The appearance of 'TB devices' - instruments that can produce TBs of data in a
day will be starting to appear on the campus within a year at least in the form
of http://www.politigenomics.com/next-generation-sequencing-informatics[NextGen
DNA Sequencing machines] and possibly some http://tinyurl.com/7mu6pm[confocal
microscopes and other imaging systems].  These will require 10Gb networks
connecting the support infrastructure and (relatively cheap) short term but high
performance disk space. 

The Broadcom donation of NetApp devices adds another variable into the equation, tho probably a welcome one.  If the heads arrive with licenses, we can use them as fast scratch storage devices but their actual disks are very small, so are not very applicable to bulk storage.   If the heads do not have licenses, we can still use the disk shelves as generic storage under the control of a vanilla Linux head with LVM2.


Latest Version
--------------
The latest version of this document should be http://moo.nac.uci.edu/%7Ehjm/UCI_CI_Research_Computing.html[here for HTML] and http://moo.nac.uci.edu/%7Ehjm/UCI_CI_Research_Computing.txt[here for AsciiDoc source].  

http://www.methods.co.nz/asciidoc/[AsciiDoc] is a free, flexible, Python-based plaintext-to-HTML/PDF/DocBook translator.  The "source" text file is easily human-readable as well as the output.