UC Irvine CyberInfrastructure Plan - 2013 ========================================= by Harry Mangalam v1.10,Mar 5th, 2013 //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // Convert this file to HTML & move to it's final dest with the command: // export fileroot="/home/hjm/nacs/UCI_CyberInfrastructure_plan_2013-1"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.[ht]* moo:~/public_html == The 4-Sigma target Many CyberInfrastructure (CI) plans and funding target the upper 1% of research computing users, whose requirements put the current network under strain with extreme demands on network bandwidth, storage, and compute requirements. While addressing the needs of the 1% quite often lead to improvements for the 99%, http://srcs.ucop.edu/[often it does not]. Our immediate CI plans are to improve computational resources for the middle ~95% of researchers - the '± 2 Σs' of the research computing population. By making it easier for most of the population to do their work with an overall infrastructure change, not only will they become more productive, but there will be more time to dedicate support to the top 1% who pose the most challenges (but who are also often the best positioned to help themseves). Generally, there are 2 phases to any major upgrade in infrastructure. The first is 'Seed' funding and the second is 'Sustain' funding. 'Seed' funding usually derives from a major administration-funded initiative or from a grant. 'Sustain' funding is usually more problematic, since it is hard to convince any group to contribute and to convince them that what they're funding has perceptable return to them. However, we have good reason to believe that using the 'Condo model', proven in sustaining our compute clusters, can convince a majority of stakeholders to fund this approach. Our compute clusters have been successfully funded via the Condo model, in which users contribute self-bought resources to a common pool and the resources are distributed to all users based on contribution and immediate need or willingness to wait for a lower priority resource. What happens is that a pool of resources is created which can be allocated in a flexible way that tends to buffer spikes of usage and create more 'apparent' resources for everyone. == Storage From repeated faculty surveys, the most requested (and now most 'shrilly' requested) resource is storage for faculty research. The details vary as to what kind of storage is most desirable (backed-ups, ease-of-access, reliabilty, high performance, web-available, for active data or for archiving), but it is always *storage* that leads the list. The Condo model for storage has the University seed a small storage cluster and provide the data center space, system administration, and especially networking to support it. Interested parties would pay to expand the storage, according to their requirements and their contribution would be added via automatic 'quotas', with the actual total storage being monitored and increased in a very cost-effective way, staying just above critical levels. Typically, smaller groups buy excess storage to allow for growth but this is wasted money since the cost for disk-based storage halves every 14 months, and we can use a scalable distributed filesystem to increase storage only as needed. Therefore, instead of carrying a large overhead of unused storage, we keep the storage pool at just under critical levels, increasing it in ~50TB chunks which are fairly cheap since we use high quality, but Common, Off-The-Shelf (COTS) technology. Such a large storage pool can be used for a number of purposes, depending on configuration and networking. The networking is the critical point for this application. With a campus backbbone upgrade to 10Gb and Long Haul Infiniband, we can use this campus storage pool for both general purpose storage as well as fairly high performance storage connecting directly to the campus compute clusters and to other high-bandwidth sources and sinks such as genomic sequencers, imaging machines, remote sensing repositories, and other Internet-based data archives that provide both static and streaming data. (See backbone diagram). We are positioning such storage as the central part of our CI plan going forward to handle: - *Small Active Data* (Office documents, text files, instrument logs, lab notebooks, XML, inactive databases, research websites, and file sharing among faculty & staff.), - *Big Active Data* (genomic sequence files, imaging data, video , Earth System Science climate models/data, and large Physics data sets. - *Long Term Storage* (Backups, Digital Library files, longer term storage as per funding requirements) == Networking A top-tier research university requires first rate networking, not only to support research, but also increasingly multimedia-based teaching which can efficiently expand the scope of the UCI's pedogogy. === Current Status As of 2011, the UCI campus network (UCInet) consisted of ~1600 network routers and switches with approximately 36,000 active network connection in 175 buildings on campus. There is also a campus wide WIFI network composed of _____ Wireless Access Points, most running at ____ Mbs. UCINet has a 10G Gbs redundant backbone linking all buildings together, but allowing no direct 10Gbs links to end-user devices. Only two buildings (Calit2 and Bren Hall, both Computer Science-centric) have 100% 1 Gbs connection to end users. Other buildings have 100 Mbs to the majority of end users. The campus is currently connected to the Internet, via our Internet Service Provider CENIC, with a total capacity of 8 Gbs. Among them two 1 Gbs links to CENIC High Performance Research Network (HPR) which has direct high speed connections //HOW HIGH SPEED??// to national Research & Education networks such as, Internet2 and National Lambda Rail (NLR). === The Plan to Improve Network Capacity ==== Bandwidth Upgrade within Campus To maintain productivity of scientists with high data rate requirements, and to attract and retain productive faculty, we plan to: - Increase bandwidth to selected research buildings (Rowland Hall, BioSci 1 & 2) from 1 Gbs to 10 Gbs. - Increase bandwidth to 1 Gbs to the Desktop for researchers who perform data intensive applications by installing a 1 Gbs switch on each floor of the selected buildings. - Enhance bandwidth capacity of campus wireless core infrastructure //in what way??// - *Budget: Total $639K*. ~$200K should be invested within 1 year. ==== Bandwidth Upgrade for UCINet External Network Connections In order to share Big Data a reasonable speeds with remote sites and to take advantage of grid- distributed resources such as the Large Hadron Collider Tier1 and Tier2 data nodes, we plan to: - Add a 10 Gbs connection between UCINet and CENIC High Performance Research (HPR) Network. - Upgrade the campus border firewall capacity // to what?? why??// - *Budget: $103K*. ==== Networking Performance Improvement at Academic Data Center (ADC) In order to alleviate network congestion at the ADC, (where bandwith utilization is consistently above 80% on all our 1Gbs paths), we will improve bandwidth to 10Gbs to each of our 2 large compute clusters which contain most of the campus research storage (~1PB, not including the nascent campus storage project). To do this, we will: - Replace current 1Gbs switches with 10Gbs switches - Upgrade ethernet bandwidth to 10 Gbs to endpoints - add Long Haul Infiniband capability to integrate the compute clusters with the rest of the campus storage project. - Increase fiber capacity //by how much??// from the ADC to the campus CORE sites //what is a campus CORE site??// - *Budget: $352K*. Priority items are the upgrades of the ADC router and the co-lo router, which is $195K //which one is $195K? each?// ==== Support for New Technology - *IPv6*: Although we have not heard any requests from faculty for IPv6 addressing, we have obtained an IPv6 /32 address (prefix: 2607:f538::/32) from ARIN. As noted, implementing IPv6 will require upgrading all of our 8yr-old backbone routers as well as many building routers. - *Software Defined Networking (SDN)* Openflow based SDN will be implemented in accordance with researcher requests. All network equipement going forward will be compatible with supporting SDN. //what good is SDN?? Are there any good examples of how it has been used to good effect??// - *InCommon Federation*: UCI is already a member of the InCommon Federation, which uses a standard protocol for establishing trust relationships. We subscribe to the InCommon Digital Certificate service and is will be certified in the InCommon Assurance Program at the Bronze and Silver levels. == Human Resources One of the problems with an increasingly digital workplace and its increasing rate of change is that human ability often lags the state of the art. To address this problem, UCI's CI plans involve improving the human resources available to assist with these problems, especially with the research aspects. We will be increasing the available HR to support: - *Scientific Programming*: With the increased amount of digital data, there is widespread desire to mine and analyze it efficiently. We are now well past the point where these datasets can be manipulated with Office applications; the tools of Big Data are required to exploit them. To this end, UCI needs people who can know these approaches and toolkits, can make them available to researchers, and help them get their workflows ... working. This is specialized knowledge and we are competing against Google, Amazon, and the like to hire and keep such people. - *Web Programming*: While not strictly a research computing problem, many faculty are using the web for publishing papers and primary results, and increasingly for teaching. Faculty often need help in setting up a particular piece of software or module. This is yet another area where we need a small team of dedicated programmers with a specialty in web programming and databases. - *Archiving Assistance*: This is generally the purvue of the Library system, but especially due to many granting agencies requiring specific types of archival storage and integration with specific retrieval software, there is a need for either more training of the Library personnel or hiring more people who have that expertise. == Hardware While this is an ongoing requirement, a number of people are required by their jobs to keep up with hardware advances, so that while it would be nice to have hardware-specific experts, this is one area where we are not as stressed as in others. == Software Since CI is at base a software-mediated infrastructure, careful policy at this level is among the most critical decisions that can be made. Software has huge possibilities for improvement: in robustness, capability, and especally scalability. Over the past few years, the number of mature software packages available as Open Source Software (OSS) has increased by orders of magnitude. While there are still a few proprietary software packages that have no corresponding OSS equivalent, the OSS world is rapidly reducing that disparity. Especially for anything that acts as a 'server', there is an excellent chance that there is OSS that is as good or better than the proprietary version. While it may not be possible to replace existing proprietary packages with their OSS equivalents due to the long tail of entanglement that commercial packages encourage, we will endeavor to use only Open Source packages going forward where there is not a compelling (and documented) reason to use the proprietary one. This is an especially notable step since Univ. of California has 10 campuses and good channels to spread information across them. We should be, and will be, using these channels to provide 'best-practices recommendations' and especially 'configuration' of mature OSS packages for particular uses. === Scalability We are always looking for ways to make CI more scalable. We have done it well with our compute cluster architecture, and will re-use the same techniques to approach a campus storage system. While hardware can only scale so far, it is also continually falling in price, so we are able to buy more capacity for less money over time. Hardware also has a fairly short lifetime; typically 3-5 years between refresh cycles. ==== Software Software is different. Unlike hardware, it tends to persist for a decade and longer since it often takes a huge amount of effort to make it work well and once it works well, it tends to hang on until major infrastructure changes can no longer support it. Because of this long tail, decisions regarding software policy are critical. Some software (both proprietary and OSS) can be troublesome to install, but overwhelmingly the real time cost is in the configuration. Once a configuration has been set, it is much easier to scale out the deployment for a particular package, especially if an application has a single configuration file and allows in-line comments as do many OSS packages. If it takes a day to configure an application to work well, but that investment of knowledge can scale over 100 OSS deployments, then the overall time per deployment is trivial. As more services are based on server offerings, either as cloud services or via an institutional server, the time per configuration reduces dramatically. This is what we are seeing. ==== Cluster-based instruction Most lab-based computer instruction still takes place in a classroom, mostly because of the need to have specifc client machines licensed for a particular software. However, as instructors find OSS that fulfills a particular need, they are requesting server-based support for running the software for the class. Currently UCI is using externalized compute nodes from our HPC cluster to provide this on an ad-hoc basis, but going forward, we will have to formalize this approach to provide instructional computing. This allows classes themselves to be distributed if desired and OSS remote display software such as 'VNC, NX, and x2go' allows students to connect to Graphical User Interfaces (GUIs) generated on the server, even if they are separated from the server by 10 or more network hops. This server-based approach reduces the requirement of supporting computer labs since almost all students have their own laptops and can use the server software from anywhere. This seems to be the future of lab-based computer instruction, altho the one area that still needs work is software that requires hardware-accelerated graphics. === Licensing This is a continuing problem since most proprietary software companies have different licensing requirements, methods of software distribution and updating, mechanisms of licensing, discounting by user numbers, licensing periods, optional add-ons, prorating of licences, and support mechanisms. For all of these reasons, it would simplify things significantly if we could use OSS, which allows unlimited .. everything. There is an ~2 FTE/campus cost to just administer the licensing and manage the various license servers. At least 1 of these FTEs could be switched to direct user support.