UC Irvine CyberInfrastructure Plan - 2013 ========================================= by Harry Mangalam v1.10,Mar 5th, 2013 //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // Convert this file to HTML & move to it's final dest with the command: // export fileroot="/home/hjm/nacs/UCI_CyberInfrastructure_plan_2013"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.[ht]* moo:~/public_html == The 4-Sigma target Many CyberInfrastructure (CI) plans target the upper 1% of research computing users, whose requirements put the current network under strain with extremely large demands on network bandwidth, storage, and compute requirements. While addressing the needs of the 1% quite often lead to improvements for the 99%, often it does not. Our immediate CI plans are to improve life for the other '99%' (actually, the middle ~95% - the '± 2 Σs' of the research computing population) with the idea that by making it easier for the mid-range of the population to do their work with an overall infrastructure change, there will be more time to dedicate support to the top 1% who pose the most challenges (but who are also often the best positioned to help themseves). Generally, there are 2 phases to any major upgrade or change in infrastructure. The first is the 'seed' funding and the second is the 'sustain' funding. The seed funding usually derives from a major administration-funded initiative or from a grant; the sustain funding is usually more problematic, since it is hard to convince users to contribute and to convince them that what they're funding has perceivable return to them. We are using the 'Condo model' for providing sustained funding for this approach. This model uses a starting infrastructure, generally provided by a central authority, and then builds on that infrastructure with input from the community it benefits. Our central cluster computing core is based on this model (see link:#condocluster[below]). [[condocluster]] == The Condo Cluster Model We have 2 large compute clusters at UCI, Green Planet (GP), open to only the Physical Sciences and the High Performance Computing (HPC) cluster, open to all members of the University. The growth of both systems have been successfully funded via the Condo model, especially since HPC has a scheduler that permits condo owners instant use of their nodes while automatically allowing others access when they're not being fully used. This has resulted in nearly 100% usage of the cluster under load, with most of the excess consumed by users who cannot afford their own nodes. == Storage According to repeated faculty surveys over the years, the most requested (and now more 'shrilly' requested) resource is storage for faculty research. The details vary as to what kind of storage is most desirable (backedups, ease of access, reliabilty, high performance, web-available, for active data or for archiving), but it is always storage that leads the list. The Condo model for storage has the University seed a small storage cluster and provide the data center space, system administration, and especially networking to support it. Interested parties would pay to expand the storage, according to their requirements and their contribution would be added via quotas, with the actual total storage being monitored and increased in a very cost-effective way, staying just above critical levels. Typically, smaller groups buy excess storage to allow for growth but this is wasted money since the cost for disk-based storage halves every 14months, and we can use a scalable distributed filesystem to increase storage only as needed. Therefore, instead of carrying a large overhead of unused storage, we keep the storage pool at just under critical levels, increasing it in ~50TB chunks which are fairly cheap since we use high quality, but COTS technology. Such a large storage pool can be used for a number of purposes, depending on how it's configured and the type of networking that interconnects the infrastructure. The networking is the critical point for this application. With a campus backbbone upgrade to 10Gb and Long Haul Infiniband, we can use this campus storage pool for both general purpose storage (Office documents, lab results) as well as fairly high performance storage connecting directly to the campus compute clusters and to other high-bandwidth sources and sinks such as Genomic Sequencers, Imaging machines, motion capture, video storage, satellite sensing archives, and other Internet-based data archives that provide both static and streaming data. (See the backbone diagram). We are positioning such storage as the central part of our CI plan going forward to handle the following types of storage: === Small Active Data - Office documents (spreadsheets, word processing docs, presentations) - text files, instrument log files - lab notebooks, XML - idle databases (not for hosting active databases due to latency) - personal and research websites - file sharing among faculty, staff. === Big Active Data - Genomic Sequencing (fasta, fastq, bam files) - Imaging Data (MRI & CAT scans, microscope output, - video (from Arts, computer vision research, surveillance, etc) - Earth System Science (remote sensing data from NASA, other sources) - Large Physics Data sets (from CERN, ICECUBE, other large data sources) === Long Term Storage - Backups (faculty desktops/laptops, critical servers) - Digital Library (specialty archives, special formats) - Archiving (longer term storage as per funding requirements) == Human Resources One of the problems with an increasingly digital workplace and the increasing rate of change in that workplace is that human ability often lags the state of the art. To address this problem, UCIs CI plans involve improving the human resources available to assist with these problems, especially with the research aspects. We will be increasing the available HR to support: === Scientific Programming With the increased amount of digital data, there is widespread desire to mine and analyze it efficiently. We are now well past the point where these datasets can be manipulated with 'Microsoft Office' and similar applications; the tools of Big Data are required to exploit them. To this end, UCI needs people who can document these approaches and toolkits, make them available to researchers, and help them get their workflows ... working. This is specialized knowledge and we are competing against Google, Amazon, and the like to hire and keep such people. === Web Programming While not strictly a research computing problem, many faculty are using the web for publishing papers and primary results, and lab promotion. Faculty often need help in setting up a particular piece of software or module. This is yet another area where we need a small team of dedicated programmers with a specialty in web programming and databases. === Archiving Assistance This is generally the purvue of the Library system, but especially due to many granting agencies requiring specific types of archival storage and integration with specific retrieval software, there is a need for either more training of the Library personnel or hiring more people who have that expertise. == Hardware and Software Issues === Hardware While this is an ongoining requirement, a number of people are required by their jobs to keep up with hardware advances, so that while it would be nice ot have hardware-specific experts, this is one area where we are not as stressed as in others. === GPU acceleration One area where there has been recent very fast change has been in the area of compute accelerators in the form of General Purpose Graphics Processing Units (GPGPUs). These accelerators (mostly from Nvidia and recently, Intel) require special approaches to programming and require a fairly deep understanding of parallel programming to extract decent preformance. When the proper techniques are applied correctly, the acceleration can be dramatic, but that does require special training. === Software Since CI is at base a software-mediated infrastructure, careful policy at this level is among the most critical decisions that can be made. Software has huge possibilities for improvement: in robustness, capability, and especally scalability. Over the past few years, the number of 'Generally Regarded As Mature' (GRAM) software packages in the Open Source Software (OSS) world has increased by orders of magnitude. While there are still a few proprietary software packages that have no corresponding Open Source equivalent, the Open Source world is rapidly reducing that disparity. Especially for anything that acts as a 'server', there is an excellent chance that there is Open Source Software that is as good or better than the proprietary version. While it may not be possible to replace existing proprietary packages with their Open Source equivalents due to the long tail of entanglement that commercial packages encourage, we will endeavor to use only Open Source packages going forward where there is not a compelling (and documented) reason to use the proprietary one. This reverses past habit where only proprietary packages were considered due to support and lifetime issues. Those issues have largely been obviated by the ability of the Internet to debug problems. This is an especially notable step since Univ. of California has 10 campuses and good channels to spread information across them. We should be, and will be, using these channels to provide 'best-practices recommendations' and especially /configuration/ of GRAM Open Source packages for particular uses. === Scalability We are always looking for ways to make CI more scalable. We have done it well with our compute cluster architecture, and will re-use the same techniques to approach a campus storage system. While hardware can only scale so far, it is also falling in price, so we are able to buy more capacity for less money over time. Hardware also has a fairly short lifetime - 3-5 years between refresh cycles. However, software is different. Software tends to persist for a decade and even longer since it often takes a huge amount of effort to make it work well and once it works well, it tends to hang on until the hardware or other infrastructure can no longer support it. Because of this long tail, decisions regarding software policy are critical. ==== Software Some software (both proprietary and OSS) can be troublesome to install, but overwhelmingly the real time sink is in the configuration. Once a configuration has been set, it is much easier to scale out the deployment for a particular package, especially if an application has a single configuration file and allows in-line comments as do many OSS packages. If it takes a day to configure an application to work well, but that investment of knowledge can scale over 100 Open Source deployments, then that makes the overall time per deployment trivial. As more services are based on server offerings, either as cloud services or via an institutional server, the time per configuration reduces dramatically. ==== Cluster-based instruction Most lab-based computer instruction still takes place in a classroom, mostly because of the need to have specifc client machines licensed for a particular software. However, as instructors find OSS that fulfills a particular need, they are requesting server-based support for running the software for the class. Currently UCI is using externalized compute nodes from our HPC cluster to provide this on an ad-hoc basis, but going forward, we will have to formalize this approach to provide instructional computing. This allows classes themselves to be distributed if desired and the OSS remote display software 'x2go' allows students to connect to Graphical User Interfaces (GUIs) generated on the server, even if they are separated from the server by 10 or more network hops. This server-based approach reduces the requirement of supporting computer labs since almost all students have their own laptops and can use the server software from anywhere. This seems to be the future of lab-based computer instruction, altho the one area that still needs work is software that requires hardware-accelerated graphics. === Licensing This is a continuing problem since most proprietary software companies have different licensing requirements, methods of software distribution and updating, mechanisms of licensing, discounting by user numbers, licensing periods, optional add-ons, prorating of licences, and support mechanisms. For all of these reasons, it would simplify things significantly if we could use OSS, which allows unlimited .. everything. There is an ~2 FTE/campus cost to just administer the licensing and manage the various license servers. At least 1 of these FTEs could be switched to direct user support.