= Research Computing Infrastructure Requirements: immediate, and longer term
Harry Mangalam, Research Computing Support, OIT
Ver. 1.0; November 16, 2015

// export fileroot="/home/hjm/nacs/rci_storage.recs"; asciidoc  -a icons \
// -a toc2 -b html5 -a numbered  ${fileroot}.txt; scp ${fileroot}.* \
// moo:~/public_html

*This is a strawman version intended to provoke discussion.  It is not the final
version.*

== Storage.

This is a summary of a http://moo.nac.uci.edu/~hjm/CloudServicesForAcademics/[more 
extensive description of the problem], but that document provides no explicit 
recommendations.  This one does.

== The Problem
As machines of all kinds become more digital they produce more digital data, 
and the techniques to digest and analyze this data require additional storage 
as well, often 10X the raw data.  Especially in research, this leads to 
problems in handling that research for people who are experts in their fields, 
but not at Information Technology.  Therefore it is up to the organizations 
that do understand this to provide solutions that provide the optimum balance 
of speed, flexibility, cost, scalability, reliability, and especially, easy of 
use.  


== The Recommendation
UCI should immediately provide a local storage system similar to 
https://www.rcac.purdue.edu/storage/depot/[Perdue's 
DataDepot] of 120TB (2 half-populated storage servers + metadata server 
+ Input/Output node (IO node)), where local researchers can rent local storage 
for:

- active files
- short term backup
- web distribution of files 
- sync & share with collaborators

OIT should buy the required chassis hardware with the actual disks being paid 
for by storage rental (at a slight premium to cover the next chassis and maint 
costs).  The cost for raw disk is currently ~ $50/TB for disks of decent 
reliability.  

This storage can be accessed in a variety of ways via the 
https://en.wikipedia.org/wiki/Page_cache[caching] IO nodes that provide the 
actual services, so that the users rarely access the fileservers directly.

The filesystem should probably be a commercial appliance, but the IO nodes can 
be run by OIT to provide the services required by researchers.
The IO node(s) should provide CIFS & NFS access, web services.

Optimally, the same storage would support spinning disks for bulk storage, SSDs for 
complex access, https://en.wikipedia.org/wiki/Dm-crypt[encrypted volumes] for 
data needing extra security, parallel access 
for high-speed RW, and support for 
https://en.wikipedia.org/wiki/Server_Message_Block[SMB/CIFS], 
https://en.wikipedia.org/wiki/Network_File_System[Network File System], 
https://en.wikipedia.org/wiki/WebDAV[WebDAV], and other protocols as needed.

Users who require the very fast parallel access to the filesystem can purchase
the specialized clients to do so.  Such clients could be the compute 
clusters on campus or specialized machines at the microscope 
facility.

== The Rationale

From both internal and external surveys, by far the most critical 
resource that faculty desire is storage in various forms.  Storage is needed 
for actively used files, short-term backup, and long-term archives.  This 
storage needs to be 'medium to high performance' on many metrics, from 
streaming reads & writes (RW) as in bioinformatics & video editing, to small 
high-jitter RW operations with rapidly changing offsets (relational databases, 
access to zillions of tiny files 
(http://moo.nac.uci.edu/~hjm/Job.Array.ZOT.html[ZOTfiles]).  

Researchers particularly need storage for terms appropriate for their publishing 
cycles which is typically 1-2 years in the life sciences, longer in some social 
sciences.  They also need backup for these files, since their loss can often 
terminate the project with the fiscal loss to the campus often in the range of 
$10K to multiple $M.  The administrative contribution to this system is provided 
by the overhead of successful grants, themselves considerably assisted by the 
availability of this storage.

Research now often involves access to these files both on campus and off, by 
lab members and external collaborators.  Some of this data is restricted 
by intellectual property agreements or privacy concerns and therefore requires 
special protections of permission, local firewalls, and even on-disk encryption.


Commercial cloud storage can provide some of these resources, and therefore 
offload the cost of technical oversight.  Services like 
https://aws.amazon.com/glacier/[Amazon Glacier] or 
http://www.oscer.ou.edu/petastore.php[Univ Oklahoma's Petastore] are hard to 
beat for archiving data, which would otherwise require a ~$100K investment in
hardware.  For unrestricted data that has been published and no longer has to 
be accessed quickly, cloud archiving is the recommended solutions.  

However, locality is attractive for data that has critical latency, bandwidth, 
security, and legal/ownership requirements.  Even peering with another UC campus 
 (such as https://idre.ucla.edu/cass[UCLA's CASS]) cannot provide the kind of 
resources that many local users require.  For example, it cannot provide 
https://en.wikipedia.org/wiki/Tunneling_protocol[untunneled] SMB/CIFS file 
services to local users due to the insecurity of that protocol and the 
encryption adds even more latency to the connection.  Additionally, any 
operation that requires frequent communication between client and server will 
take longer with the longer packet 
https://en.wikipedia.org/wiki/Round-trip_delay_time[round trip times] due the 
increased number of network hops. While opening a single small file for editing 
is hardly more noticeable from CASS, unpacking a 100MB archive from a client at 
UCI to CASS at UCLA takes about 700X longer than the local operation.


The latest version of this document 
http://moo.nac.uci.edu/~hjm/rci_storage.recs.html[can be found here].