Big Data, ASAP ============== Harry Mangalam v1.0, Feb 12, 2013 :icons: // fileroot="/home/hjm/nacs/nsf-net-infrastructure-grant"; asciidoc -a icons -a toc2 -a toclevels=3 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt moo:~/public_html/nsf-cc-nie == Acquire, Store, Analyze, and Present With the increased number and ability of digital sensors and channels to acquire data, UCI needs a network infrastructure that is able to 'Acquire, Store, Analyze, and Present (ASAP)' that data in a timely, secure, and easy-to-use manner. Data is getting big. From ubiquitous smartphones sensors, to genomic sequencers, confocal microscopes, PET/CAT scanners, to national-scale image and data collectors such as satellite, census, and other government data sources, and even commercial social networking data (Google, Facebook, Twitter), research data is getting larger. With that increase in data size comes the requirement to be able to store it and process it in reasonable times. We already are changing the mechanics and protocols for high speed data within our Data Centers, but in order to support research widely, we will have to expand those mechanisms outside the Data Centers. This funding will support that aim directly by supplying the very high-speed transport needed for data replication and back-end movement of large data sets to the analysis centers and back again. == Campus data storage UCI is embarking on a project to develop a campus-wide storage system that will provide redundant, geographically dispersed storage system for all researchers to use to do all of the ASAP requirements. The system has 2 faces; one for general researcher access that will be based on the existing, fairly slow (100/1000 Mbs) campus network, and another higher speed 10-40Gbs network to allow the same underlying storage system to allow compute clusters and other specialized machinery to transfer bulk data to/from it with maximum speed. The latter high-speed network is the subject of this proposal. We consider 2 types of data; the first is large, raw, unprocessed, cheap-to-reproduce data; the second is analyzed, reduced data that is worth considerably more in terms of the amount of labor used to produce it. That is: cheap, big, raw data ('¢data') and expensive, small, processed data ('$data'). The point of research is to transform the former into the latter, with useful information extracted, concentrated, and hopefully published in the process. 'Publishing' in this case refers to peer-reviewed journals, making the raw data available as a supplement to the journal publication, and archiving for longer periods for public access. The last reference is is especially relevant since it is a state directive to the university and a requirement of many funding agencies. In order to supply researchers with storage to process the '¢data', we have to consider very cheap, fairly reliable storage as opposed to very expensive, very reliable storage (from EMC, NetApp, DDN), since the price differential between these 2 is easily 10X or more. Since it is not the primary point of this submission, we do not describe the storage system in detail here, but it is based on the http://www.gluster.org/[Gluster Distributed, Replicated file system] and will be used to provide: - a very large (up to Petabyte scale) intermediate data store for small and large data sets that can simultaneously be accessed by both desktop and compute cluster tools. - limited backup, including geographically dispersed backup. - access to read-enabled directories to distribute data to the web via browser. - data sharing among collaborators - staging for submission to formal archives such as the https://merritt.cdlib.org/[Merrit repository of the California Digital Library] == Fast, Very Large Data Transfer While there are ways to http://goo.gl/XKFEp[avoid & improve some network data transfer], eventually bytes need to move and the larger the pipe, the easier the transfer. We are now seeing unprecedented amounts of data [Jessica, how much IO goes thru CENIC?] flow in and out of UCI, especially genomic, satellite, and CERN data. In order to move that data to/from the primary analysis engines in our Data Center, we minimally need 10Gb ethernet, and preferably higher speed, lower latency mechanisms such as http://www.mellanox.com/page/long_haul_overview[long-haul Infiniband]. Data traversing UCI's primary internet connection (CENIC) tends to be packaged in ways that do not require short latencies (ie they are in archives which can be streamed in large chunks). However, internal to UCI where the data tends to be unpacked into separate, often small files, low latency is almost as important as high bandwidth in terms of moving data or files around. In this scenario, it may be better to use a technology based on long-haul Infiniband instead of 10Gb ethernet. Intra-campus, these are the bottlenecks with our present network: - data archives to/from the internet via CENIC (http://goo.gl/1qzD8[Thornton], http://goo.gl/Um5DS[Long], http://goo.gl/tH3U5[Tobias], http://goo.gl/DIkAU[Steele], http://goo.gl/Wd9Km[Mortazavi], http://goo.gl/N76QS[Zender], http://goo.gl/Po9s1[Small]) - data movement from genomic sequencers to processing clusters (http://goo.gl/7MtMN[Sandmeyer] / http://ghtf.biochem.uci.edu/[GHTF], http://goo.gl/kERxN[Barbour], http://goo.gl/Wd9Km[Mortazavi], http://goo.gl/JtM0q[Hertel]) - data movement from imaging centers to (http://goo.gl/MZpiA[Potkin], http://goo.gl/mUTb9[Van Erp]) to processing clusters. - going forward, data to and from cloud services such as http://aws.amazon.com/ec2/[Amazon EC2] and https://cloud.google.com/products/compute-engine[Google Compute]. The main prohibition of using these services is not their computational cost, but the cost and time of data IO to these services. image:NSF_NIE_diagram1.png[UCIResearchNet Diagram] http://moo.nac.uci.edu/~hjm/nsf-cc-nie/NSF_NIE_diagram.png[(Larger Image)] http://moo.nac.uci.edu/%7ehjm/nsf-cc-nie/NSF_NIE_diagram.pdf[(PDF version)] *Figure 1* shows a block layout of the connectivity that the UCI Research/ASAP Network would provide. The [yellow gray-background]#existing 1Gb ethernet backbone is shown in yellow#. Should this grant be funded, we would be able to provide the network infrastructure for the [green]#new Long Haul Infiniband (LHIB) network in green#, as well as the [blue]#10Gb backbone tree in blue#. The very low latency LHIB network would provide the replication service between the 'Replicated Campus Storage' as well as the transport from that storage to the compute clusters in the Data Center. The 10Gb ethernet transport woould also extend to the campus border router to allow a 10Gb path to CENIC and to the remote Medical facilities in Orange, where substantial Imaging and Genomic data is generated.