UCI CyberInfrastructure & Research Computing ============================================ by Harry Mangalam v0.4, 25 Feb 2009 // Convert this file to HTML & move to it's final dest with the command: // asciidoc -a toc -a numbered /home/hjm/nacs/UCI_CI_Research_Computing.txt; scp /home/hjm/nacs/UCI_CI_Research_Computing.* moo:~/public_html World-wide ---------- Beyond UC, there are few technical challenges that stop UCI from interacting with the University of Geneva as easily as NYU. In order to move UCI towards the top of the list, we should be doing: [[ucgrid]] UCGrid participation ~~~~~~~~~~~~~~~~~~~~ UCI should continue the support of Grid technologies. Many on-campus ORUs are using this infrastructure thru other support structures (BIC, Beckman Laser, PS, CalIT2, the CAMERA project, etc). This involves some physical infrastructure, but mostly software on top of existing physical infrastructure and more expertise. [NOTE] .A Caution ============================================================================ It is not clear that the amount of resources and energy that is going into the Grid is going to be paid back in actual usage. Use of local clusters like the MPC is well understood and these resources are well-used on campus. The problem with extending this to the Grid is the amount of extra overhead in terms of connection, authentication/authorization, and other complexity. This overhead is worthwhile for very heavy users of CPU cycles, who cannot get their cycles from direct allocation from Supercomputer Centers (SCs). The advantage of the Grid is that it's supposed to provide a large surplus of cycles that can be shared or harvested by others. This model fails if there is not a large surplus of cycles to be harvested, or if there are not enough nodes in the same compute domain to allow parallel jobs to execute. So if there is no such surplus, the Grid is a highly complex method of inducing frustration and, dare I say it, gridlock. ============================================================================ In spite of this caution, there are additional reasons for contributing to the grid infrastructure. Both the grid and other UC resources need a global auth/auth system and the http://www.ucgrid.org/[UCGrid] is implementing the http://gridshib.globus.org[GridShib] project that would provide such a system. There are other parts of the Grid that are useful for other purposes, such as the http://www.globus.org/grid_software/data/gridftp.php[GridFTP] system for fast file transfer. On the whole, it's probably worthwhile to continue to support the UCGrid, but I'm not sure that it will fulfil the promises that have been made in its name. Among UC Campuses ----------------- - UCI should lead by example by developing and sharing applications that leverage the web or internet technologies to ease both research and infrastructure bottlenecks. Examples of such efforts are: * an electronic software distribution solution that could be used UC-wide. * a general purpose software licensing scheme that allowed commercial licences to be tracked and validated/invalidated by user time, CPU time, CPU #,OS, etc to allow us to enforce whatever licensing terms commercial vendors requested. (but see link:#oss[comments on Open Source Software below]) * a robust, inexpensive, easy to administer Open Source backup system. At least 3 such systems are used in Enterprise deployments: http://www.amanda.org/[Amanda] (commercialized into http://zmanda.com/[Zmanda]), http://www.bacula.org/en/[Bacula], and http://backuppc.sourceforge.net[BackupPC]. * altho the EEE team is remarkably well-organized, EEE seems to be largely redundant to http://moodle.org[Moodle] and http://sakaiproject.org/portal[Sakai]. EEE needs to either use the Moodle infrastructure, adding EEE functionality as Moodle plug-ins or untangle the EEE infrastructure so that it is possible to provide EEE as a viable alternative to Moodle or Sakai. It is clearly possible to use Moodle as a viable alternative as http://www.oid.ucla.edu/units/tec/tectutorials/tecmoodle[UCLA is doing just that]. * all the above would promote UCI and NACS across UC and the world. - Open Source Information clearing house. Especially in these times of tight budgets, we need to conserve $ and time by evaluating the best Open Source technologies that can also be rolled out to other campuses. - embrace Grid technology as much as makes sense (see link:#ucgrid[above] and link:#costbenefit[below]) - use of SDSC as a UC megacluster center Within-UCI ---------- Addressing Ongoing Research Computing problems ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We want to address problems in Research Computing like the NACS Response Center does. We want to fix the problem ASAP, directly or by cooperating with the nearest support person. After the problem is fixed or addressed, *then* we can figure out if the problem is of general interest. If it is, then we can evaluate whether it's worth while to create a generalized solution that can be scaled. If not, drop it. But 1st, fix the problem. [[costbenefit]] The General Cost/Benefit approach ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Analyze solutions in terms of long-term risk / cost / benefit. If a project is worth undertaking beyond a one-off fixing of a single researcher's problem, it's worth considering the longer term consequences of addressing the problem by breaking the cost estimate into: * cost to analyze * initial cost to implement * features to cost ratio - can we live with a non-best solution if that solution provides a longer term reduced cost? We would also need to evaluate the risk profile: What risk does each solution imply? * If it can be implemented quickly, but costs much more in support costs, is it a good solution? * If it is cheap for the test instance, will it be cheap when it has to scale? What are the scaling issues? * What are the support issues? * Is an OSS solution that works initially but no one else understands a workable support model? * Is commercial support fundable over the long term? * How long do we have to work with a piece of software until we understand it? * How much trouble is it to evaluate it in a real situation? Green Computing ~~~~~~~~~~~~~~~ High Performance Computing ^^^^^^^^^^^^^^^^^^^^^^^^^^ How do we decrease the energy cost of computing? For some forms of HPC we may be able to leverage the very low cost and very high speed of alternative computing approaches, such as http://en.wikipedia.org/wiki/GPGPU[GPGPUs], http://en.wikipedia.org/wiki/Field-programmable_gate_array[FPGAs], or using http://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing[BOINC] to create a Virtual Campus Supercomputing Center or additions to UCGrid. We will then have to promote its use to those people who have need of it. The use of http://en.wikipedia.org/wiki/GPGPU[GPGPUs] in scientific computing can have dramatic effects for for power consumption and HPC. A single $500 Graphics card contains 240 GPUs and a number of people have shown acceleration on the order of 100X on some codes of interest to UCI researchers, including http://www.ks.uiuc.edu/Research/gpu/[molecular dynamics], http://www.gpgpu.org/cgi-bin/blosxom.cgi/2007/07/27[astrophysics], and http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2222658[sequence analysis]. Thin clients ^^^^^^^^^^^^ Support of http://en.wikipedia.org/wiki/Thin_client[thin clients] for those who want them. Thin client computing reduces support and hardware costs, and over time, energy costs as thin clients use much less energy than PCs. Thin client models:: - compute on server; export display only. This allows Linux, Windows apps to share desktop. [*thin client*] - load OS, apps, data from server, compute on client [*meso client*] - load OS from local disk, loads apps, data from server [*fat client*] - load OS, apps, data from local disk, no server. [current *obese client*] It's worth noting that RedHat has returned to Desktop computing using the http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine[Kernel Virtual Machine] (KVM) virtualization technology and a thin client product based on http://www.qumranet.com/products-and-solutions[SolidICE/SPICE]. BDUC / BEAR ~~~~~~~~~~~~ For general purpose computing, the approach of using pooled resources on http://moo.nac.uci.edu/~hjm/BDUC_USER_HOWTO.html[BDUC] as a compute cluster and on the proposed BEAR (Broadcom EA Replacement) as access to interactive applications is one of the best approaches we could exploit. This will allow users access to rapidly deployable applications (via BEAR) and computational power (via BDUC) as well as decreasing cost to use commercially available software such as MATLAB, Mathematica, SAS, SPSS. Research Programming & Expertise Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using the EEE model where a manager interviews a researcher with the presumptive programmer and allocates & supervises programming talent to a job for a certain duration. The details of this program have yet to be fleshed out. There are some bushels of low-hanging fruit to be harvested in simply matching researchers needing sophisticated advice to those who can supply it. The http://catalyst.harvard.edu[Catalyst] system that Harvard wrote is a great way of doing this, tho it will require a fair amount of human resources to implement it. [[oss]] Open Source Advocacy and Expertise ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Advocacy ^^^^^^^^ The use of Open Source Software is often promoted because of the initial cost - free. While the cost is an advantage to evaluating the full package without time-consuming negotiatiations with commercial vendors, initial cost is not a large part of the total cost of implementing a SW infrastructure. By far, the larger costs associated with implementation are: * training * customization * long term support * debugging problems * security problems / patches / audits * making the SW interact with other SW such as as existing databases * scaling out the SW to other locations once proven in a single location. In most of the above points, OSS has some advantage, if the OSS package is chosen with some care. Rather than reiterating the arguments that others have convincingly made, I refer the reader to http://www.dwheeler.com/oss_fs_why.html[David Wheeler's OSS site]. Arguments for the superiority of commercial support of proprietary packages are breaking down relatively fast due to the distributed, asynchronous, nature of various web technologies, especially due to improved search engines. Most support issues can be addressed by typing a short description into the google search engine. One additional case can be made for using OSS - that of reducing personnel costs associated with negotiating licenses, selling the software, costs associated with restricting access to it, tracking licenses or compliance, and enforcing upgrades. With OSS, there are no such associated costs, alowing more personnel to be assigned to support for the software if necessary. We will not be able to do away with proprietary software completely, but reducing the number of proprietary packages can have significant effects on the number of people required to support it. Expertise ^^^^^^^^^ While there is a huge amount of OSS expertise at UC and UCI, access to that expertise is largely hidden or available only via personal contact. This results in people bypassing good Open Source alternatives to commercial products because they either don't realize the existance of such projects or they're unaware of how OSS works. UCI (and NACS in particular) should have an *OSS hotline* that would act as clearinghouse for such information. I would be happy to lead that endeavor. Storage ~~~~~~~ This is a tough call because there is a very wide spectrum of expectations about storage. At one end of the spectrum is highly volatile temporatry storage (aka scratch). This is both relatively cheap and easy to accomplish. $20K would probably provide enough research storage to supply the entire campus for a year or more. ($10K for the chassis; $10K for disks -> 48(1.5TB)*$200 = $9600 (72TB, raw or 4 x 12disk RAID6 -> 4*10*1.5TB = 60TB formatted storage.) About $333/TB (about 2x raw storage), but includes few management tools, except for http://en.wikipedia.org/wiki/Logical_Volume_Manager_(Linux)[Logical Volume Management] via LVM2. This can be deployed as a multifunction fileserver by itself, an iSCSI target (see http://www.openfiler.com/[Openfiler] as an example of a plug & play fileserver for this model), or as other components of a higher order system. At the other end is the full Data Center model with high end storage from BlueArc, Isilon, NetApp, EMC, which will cost about $3000/TB (cost estimate from UCLA). The DC model addresses the full spectrum of data replication, expansion, migration, hardware upgrade or transition, and could possibly include disaster recovery. There are some interesting 'metastorage' products appearing that we should be following and which I'll be testing with the new Broadcom nodes. One is the http://www.caringo.com/[Caringo CAStor] (Content Addressible Storage) which is a software-only product that claims infinite storage using generic x86 storage devices. I will be testing this approach as soon as the new Brodcom nodes arrive. The appearance of 'TB devices' - instruments that can produce TBs of data in a day will be starting to appear on the campus within a year at least in the form of http://www.politigenomics.com/next-generation-sequencing-informatics[NextGen DNA Sequencing machines] and possibly some http://tinyurl.com/7mu6pm[confocal microscopes and other imaging systems]. These will require 10Gb networks connecting the support infrastructure and (relatively cheap) short term but high performance disk space. The Broadcom donation of NetApp devices adds another variable into the equation, tho probably a welcome one. If the heads arrive with licenses, we can use them as fast scratch storage devices but their actual disks are very small, so are not very applicable to bulk storage. If the heads do not have licenses, we can still use the disk shelves as generic storage under the control of a vanilla Linux head with LVM2. Latest Version -------------- The latest version of this document should be http://moo.nac.uci.edu/%7Ehjm/UCI_CI_Research_Computing.html[here for HTML] and http://moo.nac.uci.edu/%7Ehjm/UCI_CI_Research_Computing.txt[here for AsciiDoc source]. http://www.methods.co.nz/asciidoc/[AsciiDoc] is a free, flexible, Python-based plaintext-to-HTML/PDF/DocBook translator. The "source" text file is easily human-readable as well as the output.