= A Vision for Research Cyberinfrastructure at UCI
 Draft 4.1 'Valentines Day', February 14 2016
:icons:

// fileroot="/home/hjm/nacs/rci-vision/RCI-VIsion-Draft.4+"; asciidoc -a icons -a toc2 -a toclevels=3 -b html5 -a numbered ${fileroot}.txt;

//  rsync -av /home/hjm/nacs/rci-vision  moo:~/public_html/

== Structure of Edit
I've tried to retain the intent of the previous text but have also tried to rationalize it by moving sections, removing duplicates, re-phrasing and yes, deleting (all the quotes, for example).

The major structure change is combining the Vision and Exec Summary, since they're related and since it's still less than 2 pages, it's not horribly long.

the rest have been edited down and each one has a 'Vision', 'Current', 'Competitive Risk', and 'Recommendation' section, since that's the way I felt they edited out after the first few, but they can be knocked out of the ToC if they don't help.

Also, as I wrote this, the 'RCI Staffing' part and the 'Education and Training' parts seem to have a fair bit of overlap and are also fairly wordy.  We could merge and chop down quite a bit there if nec.

I've kept significant comments that were in the google doc and moved them along with the bits to which they referred to see if the commenter feels that his/her comment has been addressed.

- DM - David Mobley
- SBS - Suzanne Sandmeyer
- SDF - Steve Franklin
- LS - Laura Smart
- DR - Dana Roode

There are still missing bits, typos, thinkos, and parts where I just didn't know what to say.  I referred a lot of the budgeting to the actual Budget section and maybe that's where all financial stuff should go rather than in-line. 

I've removed the 'Campus Storage Pool technical diagram' and linked to it. The Venn diagram could also be treated the same way.  There are some other sections that might benefit from external pages as well, but I haven't done that yet beyond linking to external pages.

With the ToC, this stanza, and all the comments removed, and it comes in at about 13 pages, printed directly in large print.

I'm still going thru the text to make sure that links align and I've represented what I think you all meant.  Give me your feedback when you can.  

Easiest way is to copy/paste the offending text and then comment on it.  I can find it easily with a text search.

Not sure how you want the report presented, but if it's not like this (Asciidoc; the text is the same name but replace '.html' with '.txt', then the text can be imported into MS Word or Gdocs and reformatted costing only a bottle of Tylenol.


== Vision & Executive Summary

RCI capability at UCI is well below that of other R1 universities and campus research & scholarship is already impaired by the lack of investment in Storage, Computation, & Staff. Without substantial new investment in this area we will increasingly fall behind and lose the ability to even sustain current levels of research, much less accelerate and expand it.

.DM
[IMPORTANT]
I don't think this is a strong enough statement for the first paragraph. What about adding an additional sentence which says, "Campus research and scholarship is already impaired by the lack of investment in RCI, and without substantial new investment in this area we will increasingly fall behind and lose the ability to even sustain current levels of research, much less grow our research to the level we plan," or something along those lines.

Existing RCI staff and facilities have provided a large amount of quality service to campus researchers since 2013 when the recommendations from the http://goo.gl/Zf5xw6[Faculty Assessment of the State of Research Computing (FASRC)] cited above were issued.  OIT, with the support of faculty and the Office of Research, has received two NSF grants to enhance RCI.  The first funded  http://sites.uci.edu/rcna/uci-lightpath/[UCI LightPath] a 10 Gb/s https://fasterdata.es.net/science-dmz/[Science DMZ] restricted to research data.  'LightPath' now connects the two largest campus compute clusters with labs in seven additional buildings.  The second grant will fund a Cyberinfrastructure Engineer for two years. The UCI Libraries launched the http://previous.lib.uci.edu/dss/[Digital Scholarship Services] unit to support data curation and promote Open Access to data produced by UCI researchers.  The 2 largest compute clusters, underfunded and aging as they are (https://hpc.oit.uci.edu[HPC] and https://greenplanet.ps.uci.edu/[GreenPlanet]) have been used to produce https://hpc.oit.uci.edu/publications[10s of papers] in multiple domains.

However, in spite of these isolated successes, the RCI at this university remains a distinct weakness, to the extent that some researchers still rely on RCI at other institutions.  Specialized computational and storage resources especially are notably underfunded at UCI, with some facilities such as HIPAA/FISMA-secure research computing facilities completely absent.

Theory & experimentation have been supplemented over the last decade by modeling and data. Those four foundational pillars of science  require a correspondingly robust Research CyberInfrastructure (RCI), implemented in a way that allows researchers to exploit it in a way that naturally fits with the way the scientist works.  The goal of the RCI  Workgroup is to provide recommendations that significantly advance and accelerate UCI research as widely and as economically as possible. 

Because RCI impacts every aspect of research and scholarship, it must be addressed as a campus priority.   Besides computation 'per se', it includes networking, storage, data curation & management, and support services required by all disciplines.   As well, since requirements continue to expand, there is a critical need for long-term RCI planning as well as immediate support.

To support the competitiveness of UCI researchers in the evolving cyber environment, our vision is to: 

- Change how RCI is coordinated, funded, and delivered, by placing the responsibility for this under a separate organization, the UCI RCI Center (RCIC).  In order to concentrate attention and responsiveness in this area, we propose to place RCI direction  under the control of its end users.  That is, to break the current Reseach Computing Support out of OIT and establish it under the direction of a supervisory group of interested faculty who will set direction, staffing, act as coPIs on grant applications, and provide feedback on suggestions for increased performance from the staff.  The RCIC would also coordinate with other like-minded units on campus (the Data Science Initiative, the UCI/SDSC Computation program(?)) that need RCI and support for research and instruction, since requirements for both are growing rapidly. Center & shared equipment grants would also be coordinated via the collaboration between the RCIC and other such units. link:#RCIC[Expanded Description]

- Initiate construction of a scalable Petabyte storage system that can be accessed by all researchers and can be leveraged to provide the multiple types of storage and data sharing that assist most research endeavors.  This includes centralized active file storage & backup, easier sharing of even large data sets, secure web distribution of data, file syncing if necessary, and tiered data archiving locally and to cloud archives. link:#researchstorage[Expanded Description]

- Initiate the upgrade and renewal of the UCI's compute clusters to bring UCI into parity with similar R1 institutions. link:#computation[Expanded Description]

- Provide a baseline or 'birthright' level of storage, connectivity, and computing for faculty in all disciplines. If the funding requested is provided, within a year, we can provide >1TB of robust, secure, central storage, 1 Gigabit/ second network connectivity, and 100,000 hours/yr of 64bit compute hours using Open Source Software (OSS) to each faculty member who requests it.  These allocations should increase over time and could be augmented to support research projects through grant funding. 
//link:#label[Expanded Description]

- Establish a widely available & scalable Research Desktop Computing Environment (RDCE) to facilitate computational & data science research and teaching.  This environment would include access to shared software (both proprietary and OSS), high performance computing resources, visualization tools, tools for data sharing and collaboration, assisted access to external UC and national facilities, and appropriate cloud resources.  While the RDCE would be more secure than traditional desktop computing, more secure computational and storage environments would be provided for compliance with Data Use Agreements, and other information security frameworks (e.g. HIPAA/FISMA) and Data Sharing policies (e.g. for Genomic Data Sharing).  link:#researchdesktop[Expanded Description]

- Provide increased support for research Data Management and Curation, not only because funding agencies demand it, but also to increase the re-use of data created at UCI and to use the resulting ease of access to encourage cross-domain collaborations among researchers at UCI and elsewhere. link:#curation[Expanded Description]

- Hire staff to support the RCI and provide much more assistance to researchers to fully leverage the hardware, software, and services.  None of the projects under discussion will advance without staff expertise, which is not cheap, but with UCI at the bottom of RCI staff by most measures, this is critical. Career staff would maintain, upgrade, and expand RCI operation, train students, other staff, & faculty in current computational techniques, document processes, provide catalytic programming services, assist with grant prep, work with existing staff in other units to provide statistical, analytical, and advanced computing needs, and assist in maintaining compliance with federal requirements for data robustness, curation, backup, retention, re-use, archiving, sharing, and security. link:#staffing[Expanded Description]

Executing this vision will speed the ramp-up of research programs, increase productivity by offloading in-lab administration & support, provide a much higher baseline of RCI services in all Schools, and offer much-increased data security and access to tools for all researchers.

[[RCIC]]
== Putting Researchers in Charge - the RCI Center.

=== Vision
Researchers tend to know best when they need a novel tool or function, so our recommendation is to have 'them' provide the guidance for how RCI is provided.  A small, rotating committee of faculty who have a strong interest in how RCI is provided will consult with the larger campus research community and drive the direction of how to use the staff and budget  of the RCIC to pursue better support.  This should cut current feedback process of hierarchical complaint loops which have historically taken multiple years to result in any changes.


.DM
*****************
*COMMENT:* See also my comment on Center for Computational Science above. One aspect that would be helpful about having such a thing is that you have a visible faculty center for computation, then it's easy to have a list of at least "computational" faculty which covers a reasonable fraction of RCI.
Maybe that's the point of the RCIC, except the name sounds more like RCIC is about providing the cyberinfrastructure and not about the research per se.
I guess what I'm saying is that to some degree, an RCIC makes sense as you need something which supports all RCI on campus, even that used by everyone (not just computational researchers). But on the other hand, a big fraction of your heaviest users are going to be computational people, and they're going to need a lot of the training and support. So if you recognized that and organized them somehow, it would make it a lot easier to deal with the training/outreach aspects for a big chunk of campus.

and also: 

To some degree my concern about the RCIC is that it could be seen as sort of, "staff people who support computation" and hence decoupled from the actual "research using computation".

Or, maybe there is an alternate model. i.e., what if the name were changed to something that included both, and you had a director who would manage both campus cyberinfrastructure and coordinate computational science efforts? By that I don't really mean "coordinate research efforts", but more "bring people together to try and go after major infrastructure grants and center grants", and things like coordinating with faculty on which classes being taught ought to be of broad interest to other computational faculty and/or how existing classes can be adapted to address what's needed. 

Maybe this is all within the scope of what you have in mind for the RCIC but if that's the case it should probably be made more clear and maybe the name broadened?
**********************


=== Current

The Office of Information Technology (OIT) currently supports RCI through its http://www.oit.uci.edu/research-computing-support/[Research Computing Support] group, with assistance from OIT data center and operations staff.  The UCI Libraries provides RCI services through its new http://previous.lib.uci.edu/dss/[Digital Scholarship Services] unit and other mechanisms.  UCI Health provides support to researchers through http://hirc.hs.uci.edu/default.aspx[secure access to clinical data] and staff support.  Some schools, notably Physical Sciences, have support team dedicated to RCI support.  In addition, there are numerous staff involved in RCI as components of research units and other entities.

Any organizational model for RCI services must be flexible and able to leverage expertise across the university (and beyond).  As there will be numerous funding and priority tradeoffs that will evolve over time, effective faculty governance is also imperative.


=== Competitive Risk
The RCIC is proposed to speed decisions and implementations to support research computing.  If it is not created and funded to appropriate levels, the risk is that support for RCI will proceed at the same unacceptable rate as it has before, with follow-on impacts on research, publication rates, recruitment, and retention.

=== Recommendations
We propose to establish the UCI Research CyberInfrastructure Center (RCIC) to manage and coordinate campus RCI.  The RCIC would have a full-time staff director reporting to the Chief Information Officer with a dotted line to the Vice Chancellor for Research.  The Office of Information Technology would house the RCIC administratively.  A faculty panel, including representatives from schools, UCI Libraries, research units, and institutes, would provide oversight and prioritize investments.  The RCIC would directly manage a subset of campus RCI and facilitate coordination and access to RCI resources within UCI, UC, and beyond.


UCI staff who support RCI, whether they are in a central RCI unit, other units such as the UCI Libraries, a school, research group, or research unit would be considered part of an extended UCI RCI support team.  The RCIC Director would provide leadership and coordination to enable and leverage this approach.  Team integration would be further strengthened through joint central/school assignments where appropriate.  

.DM
[IMPORTANT]
What about coupling this with a campus-wide Center for Computational Science or something similar, which would bring together faculty interested in/using these areas, making it easier to do shared grant proposals, center grants, etc.?

It is critical that RCI staff be closely aligned with the faculty and research groups they support.  Whereas it will not be possible to co-locate RCI staff with research groups in all cases, we should do so as much as possible.  Where it is not practical, other mechanisms should be employed to make RCI staff function as collegial members of extended research teams.

==== Cost
See link:#budget[Budget].

==== Delay
Should the RCIC be approved, it can start to operate quickly with an interim Director.  Since it is a virtual organization, there are no requirements for new physical infrastructure, although it would be favorable in the future to bring all its staff together in a fairly close setting so that discussions and decisions could be made face to face.


[[researchstorage]]
== Research Data Storage 

=== Vision
Researchers should be able to interact with large data sets as easily as they interact with email and desktop documents.  Tools to compose, share, backup & archive, forward, edit, analyze, and visualize multi-TB datasets should be available to all faculty.  A requirement that underlies all those aims is the secure & reliable physical storage required to contain that data. 

=== Current

.DM
[IMPORTANT]
I get where this section is going, but I think it needs a strong statement of why it matters at the very beginning of the section. Otherwise, we're forcing people to read four paragraphs of relatively technical background before we tell them, "look, we have a problem". I suggest we say there is a problem first and then explain why there is one and what it is.

Much research at UCI generates or uses vast amounts of data; researchers are largely left on their own to manage it and to prevent catastrophic loss of sometimes critical data.  This situation is inefficient, unsustainable, exposes UCI to liability, and is thus highly risky for research, legal, and fiduciary reasons.

All research organizations are seeing data storage requirements increase dramatically as more devices produce higher resolution digital data.  Without access to robust, scalable, 'medium to high' performance storage, modern research just does not work.  The various types of storage, metrics by which they are distinguished, and rationale are link:campus-storage-pool.html[discussed in the Technical Diagram Legend], but universal access to storage for recording, writing, analysis, backup, archiving, dispersal, and sharing are the 'de facto' *papyrus* of this age.  The on-campus availability of the data is not enough.  The data must be available globally to those who have valid need for it, and in many cases, secured against unauthorized access for reasons of privacy, sensitivity, intellectual property, or other legal prohibition.
 
Such storage systems also require automatic backup, since data loss can unexpectedly abort a project with substantial fiscal loss as well as incurring long-term penalties from funding agencies.
 
While some of this storage can be outsourced to Cloud providers, much of the storage a research university requires is not amenable to remote Clouds.  Much research storage must be 'medium to high' performance, from streaming reads and writes as required in video editing and bioinformatics, to small high Input-Output (IO) operations per sec, as with relational databases.  These characteristics require a local Campus Storage Pool link:campus-storage-pool.html[(see Technical Diagram]), which can be leveraged to provide much of the storage described above by providing specialized, highly cached 'IO nodes' communicating in parallel to the storage pool.  Such 'IO nodes' could provide desktop file services, web services, file-syncing and sharing, archival services, and some kinds of backup. 

The minimum standard for a useful Campus Storage Pool is one with:

- Large capacity, scalable to multi-Petabyte size.  https://idre.ucla.edu/cass[UCLA's CASS] is an example of such a Campus Storage Pool, tho an expensive one. &dagger;
- Low latency, high-bandwidth access to Compute Clusters and other
      analytical engines.
- Backed and mirrored up to multiple locations (including off-campus)
- Physically secured with appropriate authentication/authorization fences 
      to enable secure file sharing and collaboration among project 
      teams internationally. This storage should minimally match the security requirements for HIPAA/FISMA and other federally mandated access.
- Accessible via a range of protocols; As an example, https://www.rcac.purdue.edu/storage/depot/[Purdue's 
      Data Depot] is available to Windows and Macs as a network drive on
      campus, and accessible by SCP/SFTP/Globus from anywhere. 

(**) &dagger; 'CASS' could even act as a cross-campus Storage Pool but packet latency to it is poor and bandwidth to it, while decent for word-processing is unacceptably slow for HPC and 'many-file' operations.  CASS also does not allow the kind of file access used by Macs and Windows, nor does it allow shell access to set up other services.

The Campus Storage Pool would be available to all faculty as a baseline, no-cost service. Additional storage needs would be funded through a cost recharge model where the administration would support the cost of the storage server chassis and the disk-equivalents would be bought by researchers.

The following diagrams the various requirements of an academic storage system and shows how the Campus Storage Pool would implement these requirements in software, which would mostly be run on the IO nodes that provide specific services. link:campus-storage-pool.html[See also the Campus Storage Pool Technical Diagram].

.DM
[IMPORTANT]
Add a brief sentence on the point of this graphic? Is it that, "As this makes clear, no single service covers all storage requirements, so a multifacted solution is necessary."?

.LS
[IMPORTANT]
DMP Tool is not correctly placed here. It’s not a service for long term back-up or an archiving solution. It's a document template to assist PIs with writing curation plans that fulfill funding mandates. Replace with a specific named tool, like the DPN nodes in the national consortial of preservation networks or with something generic like “Digital curation services” to cover the suite of tools and support available. Under File Sharing, I suggest adding wording about Open Access Data Repositories or DASH. hjm-I haven't edited the diagram, but will shortly.

.SBS
[IMPORTANT]
As I see it Laura is saying that DMP should not be here at all; see table suggestion; I started but cannot deconvolute the Venn diagram to complete…I admit diagram is looks nicer but it depends what we want readers to get out of this

image:images/storage-services.venn.png[Storage Services Venn diagram]

//image:kdirstat-main.png[kdirstat screenshot]

.Figure 1.
**********
Overlapping service requirements by Application and Service (Cloud or local). Labels at the outer lobe of each ellipse denote the  generic service and some local examples.  Black labels show commercial/cloud services for comparison.  Bold blue names are Open Source services that could be implemented locally to replace the commercial service.
**********


=== Competitive Risk
There are 3 risks of not implementing this.  First is the risk of not providing what is increasingly considered to be a university 'birthright' - part of the basic research infrastructure.  This results in non-competitive grant applications and  inability to compete for attractive hires.  The second is the financial risk of not providing backup of research data.  Currently considered the very poor cousin of administrative data, research data is the data that actually 'brings in' money, altho the 'dollar density per byte' is much lower most of the time.  The 3rd risk is the fiduciary risk of not protecting data that must be shielded for intellectual property, legal, or security reasons.

=== Recommendations
We advise that this is Priority #1. We recommend the immediate funding of a baseline 500TB Campus Storage Pool and matching backup system, based on a review of the technical details  link:hpc-backup-system.html[similar to the smaller system described in detail here].

==== Cost
Based on the technical details mentioned above, this system would cost on the order of $300,000 for a raw storage of ~1PB & networking hardware, and another $200,000 for additional software, support, documentation & faculty assistance, networking support, and software integration, for a total amount of about $500,000.

==== Delay
The delay largely depends on the delay in funding but also the delay in review.  OIT has already started an internal review for aspects of the backup system, but the technical review for the Campus Storage Pool has not begun.  We estimate the time to review, spin up a test platform, and agree on such a plan is 3 months, given the equivalent of 2 FTEs for that period.  The delay from that point until implementation is 4 months, depending on the delay in signing agreements with vendors.


[[curation]]
== Curation

.SBS
[IMPORTANT]
This is fine but would be good to try to make the format a bit more consistent with other sections so need followed by action items statement to allow people not in curation or library business to understand how they would interface with these action items;

.DM
[IMPORTANT]
Start with a single sentence explaining what the need here is, like I proposed in the Data Storage section above? I think if I were reading this without being part of the discussions I would wonder what data curation even means at this stage. Are you after something like, "As a public university generating a vast amount of important research data and results, we have an obligation to make this data available to others, but efforts in this area are again left primarily to individual faculty. We need RCI investment to support data curation and sharing."

=== Vision
In the same way that Researchers need web browsing or analytical tools such as Firefox or MATLAB, data management, archiving, and re-use requirements from funding and publishing organizations demand a similar set of tools and support.  These tools not only decrease time-to-publication, but increase UCI's visibility re: data availability and competitiveness for new grants.

=== Current

.DM
[IMPORTANT]
Maybe it's necessary because curation involves so many different things, but this section/set of recommendations keeps seeming particularly nebulous to me. In the data storage section there's a particular system we'd be building/providing, but here it reads more like "we're going to do stuff to try and help people". Is it possible to make this more concrete at all, or does the nature of curation mean it has to be left a bit vague?

UCI researchers are increasingly being asked to manage and share their research data in order to comply with funding agency requirements link:#ref1[Ref 1] and system-wide Open Access policies link:#ref2[Ref 2]. Grant agencies direct researchers to document plans and demonstrate implementation for disseminating and preserving their work. 

.Example 1.
[TIP]
The http://www.lib.uci.edu[UCI Library] is working with http://www.icts.uci.edu/[ICTS] to http://previous.lib.uci.edu/dss/index.html[develop procedures] to identify appropriate  publications formats for deposit, verify citation information, and document required persistent link identifiers. 

.Example 2.
[TIP]
Librarian expertise is integral to creating scholarship tools based on digital collections such as the current NEH funded project creating http://partners.lib.uci.edu/newsletters/15_spring/10.html[linked data and visualization tools] for  http://calisphere-test.cdlib.org/collections/26147/[analyzing digital representations of artists' books]. The wait time grows to fulfill project requests such as: designing infrastructure and digitizing content for an international online https://mellon.org/grants/grants-database/grants/university-of-california-at-berkeley/11500130/[Critical Theory Archive]; integrating crowdsourced translation tools for Ming dynasty organizational names within the http://isites.harvard.edu/icb/icb.do?keyword=k16229[China Biographical Database]; and assisting a doctoral student procuring social science and humanities data (local crime statistics, transcripts of Dragnet radio shows) to map perceived and actual crime locations over time and create novel publishing mechanisms to interact with the model. 

Campus support for the increasing load of data and digital asset management is currently distributed, loosely coordinated, and not staffed to the level of peer institutions. link:#ref6[Ref 6]. 

=== Competitive Risk
Failure to provide such implementation plans risks the loss of current funding and inability to win subsequent grants. For example, the NIH recently announced they will suspend funding for awardees who do not document deposit of papers into PubMed Central. link:#ref3[Ref 3]. 

Professional staff are also needed to perform enabling activities to keep data usable when large discipline-specific repositories aren't available or suitable. link:#ref4[Ref 4] This is especially germane for Arts and Humanities data. The Libraries administer a number of digital repositories and metadata services which preserve content and make it discoverable. link:#ref5[Ref 5] Repositories and metadata alone won't fulfill a vision of cyber-research.  There are new modes of inquiry and research where scholars (individually and collaboratively) engage with online libraries/archives using methods like augmented editions, data mining, visualization, deep textual analysis, concept network analysis, and mapping.

=== Recommendations
The key missing part that addresses data curation and management is staffing. Staff are required to assist faculty directly, develop the tools and workflows required for data management, and to manage campus training programs on data management plans and open access distribution as funder mandates increase.  In the area of providing data storage for the curation and management, there is considerable overlap with the Storage Section above, but much of the functionality for data curation is completely separate.

We recommend:

- Immediate funding for a Library Data Curation Specialist to support funder compliance, manage collections of campus produced data, work with Office of Research to implement data management training program, and promote open access.
- Designing a campus space to bring together and highlight data and digital content management tools and services now distributed across the Libraries', OIT, Humanities Commons, and other units.
- In the longer term, funding Digital Humanities Librarian and additional programmer analysts to develop scholarship enabling tools over digital collections.

==== Cost
Laura: Can you estimate a cost for each of the things you request above?

==== Delay
Laura: Can you address how long you think it would take to 


.References
****************************
[[ref1]]
1 - A growing number of government and private funding bodies have Data Management requirements. For example, DOE, https://www.nsf.gov/eng/general/dmp.jsp[NSF], NIH, NEH, Gates Foundation, Howard Hughes Medical Institute, MacArthur Foundation, Gordon and Betty Moore Foundation mandate preservation and sharing of results.  These plans are often part of the competitive process and lack of verifiable Data Management support will decrease the fundability of the applicant.

[[ref2]]
2 - http://osc.universityofcalifornia.edu/open-access-policy/index.html[UC Open Access Policies]  

[[ref3]]
3 - "For non-competing continuation grants with a start date of July 2013 or beyond NIH will delay processing of an award if publications arising from it are not in compliance with the http://grants.nih.gov/grants/guide/notice-files/NOT-OD-13-042.html[NIH public access policy]"  

[[ref4]]
4 - Such as a http://bit.ly/1o34oOR[small set of microscopy images] used to devise new methods for evaluating cytoskeletal orientation deposited in UCI DASH data sharing repository.

[[ref5]]
5 - Calisphere, DASH data sharing, eScholarship, EZID, Merritt are provided centrally by UCOP/CDL. UCI contributes to the development, governance, and promotion of CDL hosted repositories and tools while managing local configurations and work flow. UCI fully administers the UCISpace repository. 

[[ref6]]
6 - UCI Digital Scholarship Services has 3FTE. Purdue has 9 FTE.  U. Oregon has 10 FTE
****************************

[[computation]]
== Research Computation 

=== Vision

As noted in the Executive Summary, computation is an integral aspect of modern science. and computation requires CPU cores, the more the better.  Medicine and biology, domains that recently used little computation, are now the HPC cluster's https://hpc.oit.uci.edu/accounting/cpu-usage/usage.summary.txt[largest consumers of CPU cycles], and other domains that previously had almost no large-scale computational requirements (Arts, Social Sciences) are using the massive social media databases to study trends and relationships in every conceivable arena.  Analyses of this scope, whether social & legal networks, molecular dynamics, healthcare, business intelligence, economics, or the increasingly creative arts is computationally intensive.  

Faculty are very supportive of the idea of the 'baseline' RCI, and with that computational requirement, CPU requirements increase that much more.  In addition, the notion of a link:#researchdesktop[Virtual Research Computing Desktop - see below] backed by the power of a large computational resource adds further requirements to the number of cores that must be available to provide reasonable response time.

Increasingly, as has already started in the web and commercial projects, we see more (semi-mobile) devices providing the interfaces, connecting to large central compute resources providing the analytic power required to run these increasingly large compute-bound jobs.  These heterogeneous mobile devices will increasingly be bought by the end users but via standard protocols, be used to access web-based or native graphical user interfaces running on powerful multi-processor back ends in *secure* environments, allowing the analysis of restricted and/or proprietary data.

While centralized resources like a compute cluster are an attractive mechanism to be able to address multiple requirements for raw processing, there are a large number of needs that are not well-served by them.  Efforts that require real-time processing, those that require specialized data pipes (as for multi-media editing), those that require dedicated hardware for 3D visualization, etc are not well-served by an over-emphasis on clusters.

The optimal research environment would provide enough resources so that compute jobs would not languish for days in wait queues, that computationally intense jobs could run to completion instead of having to be diced into smaller ones, and that the infrastructure is renewed regularly (with additional resources provided by Investigators as needed) to maintain the state of the art.

// Research computation provides the computational power used by our research, and campus support for this area is far behind that of peer institutions. Computation is essential to collecting, processing and analyzing increasingly large and complex data with appropriate speed.  Broadly defined, RCI includes computational resources that support these endeavors from smartphones and tablets, laptops and desktops, shared servers and clusters as well as national and international facilities. Scientific computing needs are growing exponentially, and High Performance Computing (HPC) resources are required to address this.


.SBS
[IMPORTANT]
This strikes me as a section that might suggest some goal for the increase in computation? Some metric of increase. Otherwise it just sounds like we need a little more and sustainable and we can keep up. We said Purdue was 10X GP.

.DM
[IMPORTANT]
I agree. How can this be projected to grow? Doubling in computer power every N years? Financial investment to track with total research outlays? Some sort of rule of thumb?

//  7.8Gflops/core = 500flops/node(64 cores)  (http://goo.gl/h4L55v)
//  7000 * .0078tflops = 546tflops for HPC.

=== Current

.SBS
[IMPORTANT]
Pls rephrase; not clear if shortfall now is 100 tFLOPS? And the shortfall is increasing or we have 100 tFLOPS is this combining GP and HPC or? *Dana:* FLOPS is usually a rate – operations per section; what is the context in this case? 

.DM
[IMPORTANT]
This won't mean anything to most people. Comment on how much this is, i.e. is it twice the size of our current resources? And maybe provide a "roughly the equivalent of X desktop computers" number?

The convergence of data-driven experimental science coupled with high-throughput technologies and computer-driven simulations has created a huge need for computing power which is currently not met at UCI.  Despite the growing demand for scientific computation, UCI has only two major computational facilities, the Physical Sciences https://greenplanet.ps.uci.edu/[GreenPlanet] and campus https://hpc.oit.uci.edu/[HPC] clusters. Both of these facilities are operating with aging hardware. HPC currently has a theoretical speed of about a 0.55 TeraFLOPS (TeraFLOPS = 1 Trillion Floating Point Operations / Second; a modern desktop CPU ~10 Billion FLOPS) and GreenPlanet is smaller.  UCI presently faces a large shortfall of at least several *hundred* TeraFLOPS in computing capability. This shortfall is limiting the ability of UCI faculty to perform their research and to compete for extramural funds. Many competing institutions are much better equipped; for example, Purdue, which is of comparable size to UCI, has the https://www.rcac.purdue.edu/compute/conte/[Conte Community Compute Cluster] providing an aggregate 943 TeraFLOPS (includes 2 Phi accelerators per node).  This is more than 1700 times the speed of HPC.

In terms of non-cluster resources, some of the problems reported in the humanities are addressable by improvements in the Research Data Storage section, but there are also specific needs that don't map well into compute clusters. These fall mostly into the areas of multimedia work, real-time processing of input data (a la the Internet of Things), and improvements in Networking for large-scale collaboration.


=== Competitive Risk
Simulations on tens of thousands of cores are becoming the new 'de facto' standard for computing-enabled and data-driven science. Science at this scale simply is not possible at UCI right now. Single-investigator funded node purchases help maintain the status quo, but their volume is too small to shift UCI's competitive position.  In the past, local compute resources were used as testbeds for optimizing codes for very large analytical runs which were then moved to National Supercomputer Centers, but while this is still happening, the scale of all computation is increasing to the point where that model is also being overwhelmed.  

This need was highlighted in the recent 'Research CyberInfrastructure Vision Symposium' at which new faculty described their surprise at UCI's limited computational resources, especially the complete lack of *secure* computing facilities where HIPAA/FISMA and other forms of restricted data can be analyzed.  In fact, many are continuing to rely on resources at their previous institutions (University of Washington, University of North Carolina, UC Berkeley) through professional contacts. This is not only detrimental to our reputation, but creates potential security risks, and is closely related to the above stanza on link:#researchstorage[Research Data Storage].


The obvious risk in allowing UCI's computational resources to wither is that our  computational scientists will no longer be competitive nationally, that incoming grant dollars will have to be shared to those institutions that have the facilities, and that research involving restricted data simply cannot be performed easily at UCI.

.SDF
[IMPORTANT]
As I recall, the reliance on non-UCI resources was more in terms of storage than computational cycles and this is in keeping with the comment about security risks. Thus, we might want to move the last 3 sentences to the section on storage rather than bundle it with compute cycles.


=== Recommendations

The maintenance of basic research facilities such as buildings, lab space, or shared research facilities including the Laboratory of Electron and X-Ray Instrumentation (LEXI), Transgenic Mouse Facility, Greenhouse, Optical Biology Core Facility and Genomics High-Thoughput Facility are essential to UCI's success.  Computational facilities should be considered a comparable and essential aspect of UCI's basic research facilities, and maintained accordingly.  Adequate computational hardware is not only just as important, but where lacking, can limit the impact of these other research resources. 

A major investment in support of maintenance and expansion of higher performance compute clusters is needed.  Annual funding must also be identified for the appropriate level of support staffing and to enable the computational infrastructure to remain current. This does not require whole scale renewal each year, but it should provide basic capabilities for all researchers and a framework that can be augmented by grant funding.

A baseline compute hour allocation on Linux compute clusters should be made available to all researchers, with additional computation required by specific projects addressed through grant funding and other mechanisms.  Additional capacity must be allocated for educational use to facilitate teaching on modern parallel computing and related techniques.  

.SBS
[IMPORTANT]
This strikes me as a section that might suggest some goal for the increase in computation? Some metric of increase. Otherwise it just sounds like we need a little more and sustainable and we can keep up. We said Purdue was 10X…GP

.DM
[IMPORTANT]
I agree. How can this be projected to grow? Doubling in computer power every N years? Financial investment to track with total research outlays? Some sort of rule of thumb?

An increasing amount of social, medical research (as opposed to specifically patient) and physical data have special security requirements to satisfy federal funding agencies. In order to remain competitive, campus investment is also needed to support demonstrable establishment of HIPAA/FISMA secure cluster resources.


==== Cost

See link:#budget[Budget].

==== Delay

Additions to clusters can be completed on the order of 2 months.  Providing secure compute facilities will require a careful review of the current Data Center and advice from UCI's security team as to how to provide more security without making access so difficult as to make it unusable.


== RCI Working Environment

=== Vision
.DM
[IMPORTANT]
In keeping with my suggestions above, might want a single sentence summary to motivate this section before giving background.

Researchers use (and feel quite strongly) about the interfaces to their computing devices. RCI should strive to provide as much functionality as possible, as unobtrusively as possible.  However, there are areas where the native interface either does not exist or can't be scaled economically and in those cases, the RCI environment should be driven by long-term functionality, with instruction to bring researchers to effective use of them.  

//I speak of the dreaded 'commandline'.

Besides lacking the underlying hardware to accomplish a task, the other major impediment to RCI is the software.  Many researchers, for reasons of history, dependence on certain libraries, familiarity with interfaces, or functionality will require access to proprietary software.  When this is the case, UCI should strive to provide that software at the lowest price via bulk or network licensing.  Where the demands of the work require proprietary tools, and they cannot be economically licensed, it is not unfair to require the small number of beneficiaries to contribute to the cost.

Working environments also include data sources (licensed at a cost, open source, and even locally developed) which may be broadly used or specific to particular disciplines.  Whether such components are part of a centrally directed campus-wide cyberinfrastructure or are best viewed in terms of locally directed components, all should be coordinated and integrated at a campus level.  Again, a secure link:#researchstorage[Campus Storage Pool] goes a long way in providing the infrastructure to do this.

In many other situations, https://en.wikipedia.org/wiki/Open-source_software[Open Source Software] (OSS) should be strongly considered, and where appropriate, promoted.  This makes services much more scalable, both via lack of legal exposure as well as reduction of human support for the licensing.  While most OSS is available for Linux, it exists in almost equitable amounts for Mac OSX and in a surprising amount for Windows.  This is not only because it is free, but because analytical software tends to appear first on Linux as OSS and (sometimes years) later is wrapped into proprietary form for Mac OSX and Windows, so the ability to use this software in its commandline form on Linux confers a months-to-years time advantage, as well as the financial one.


// The working environment that researchers use varies considerably across and within disciplines.  In almost every case, the most immediate aspect is the operating system, its user interface and application software present on personal computer systems and, increasingly, on various other 'personal devices.'  While the majority of faculty find themselves preferring Apple OS X or Microsoft Windows, as discussed above, Linux is the system of choice for high performance computing and for those who are using specialized or open-source software (including software they develop).  Campus RCI must include support for all of these working environments.

//.DM
//[IMPORTANT]
// I still don't totally get the point of this paragraph/section. Is it just that we have to support faculty using different OSs? Or is it that we need to provide the software faculty need? I get that people use some particular computing environment, but I think we might need to get to the aspect of this that requires campus support more quickly (or at least give an example of it) to avoid losing people. Right now I still read this first section down to the underlined "...Research remote..." text as basically saying, "People use lots of different stuff for research. We need to help them collaborate, share, and visualize data."

//In addition to operating systems with the user interfaces which researchers most easily interact, 


One class of applications and services that deserve special mention are those that enable collaboration, both on campus and with colleagues worldwide.  While there are significant variations in collaboration practices and preferences across and within disciplines, a very important aspect of RCI is facilitating collaboration from interpersonal interactions, to data sharing, to information dissemination.    

.SBS
[IMPORTANT]
Good if Crista Lopes could look at this particular para-she has commented software for interactions is key.

Visualization software and facilities are additional key requirements in RCI working environments.


[[researchdesktop]]
=== The Research Desktop Computing Environment

Maintaining research computing environments with the requisite software requires significant administration and upkeep.  One method that has proven effective here and elsewhere is the provision of standardized 'virtual desktop' environments using Remote Desktop protocols. 

.SBS
[IMPORTANT]
Here we say that we have and are using effectively; just below here we say we will establish. Could we have an example of how this happens currently? I am only aware of admin software unless we mean like Galaxy or HPC software? How would the new system improve or differ?

This is an efficient mechanism that provides an exportable display from a large server or cluster that can be brought up on any device, anywhere.  It centralizes storage for convenience, cost, backup, security, and reduces administration costs.  This approach can be used for providing applications for native Windows, Mac OSX and Linux.  It also allows sharing of research software licensing across a large set of users who make occasional use of a particular title to lessen overall campus costs for software that is not being used constantly.

The implementation mechanism is as simple as placing an icon on the Desktop of a personal computer regardless of OS.  Activating that icon starts a Remote Desktop application that presents another Desktop as another application window.  In that application window, all the research applications required would be presented as further icons or in the familiar nested menu system.  The applications started on that Desktop would execute on the CPUs on the cluster and would have access to both interactive and batch sessions.  These Desktops are long-lived - they can be closed and then re-activated at another location, doing exactly what was being done previously.
The Research Desktops have access to the same or similar facilities and collaboration tools as the native desktop.

=== Current
UCI does host significant components of a robust RCI: the campus wireless and wired network with good connectivity to CENIC and Internet2, the http://www.oit.uci.edu/network/lightpath/[LightPath] high-speed science network, systems (including compute clusters) housed in the http://www.oit.uci.edu/oit-data-center/[OIT Data Center], staff in OIT and in the UCI Libraries.

.DM
[IMPORTANT]
I am unclear as to the point of this section. Is it just that networking is important for access to external resources?  Also, the LightPath references make it sound like networking is good, whereas really we need much better networking for most researchers. Only a few are on lightpath.

In addition to network connectivity, providing access to off-campus resources includes addressing contractual agreements, security measures, and access permission (authentication and authorization).  UCI's participation in Internet2's federated identity management confederation (InCommon) allows UCInetID credentials to be used to access external resources.  These range from the world-wide wifi access provided by Eduroam to InCommon's 'Research & Scholarship' service providers (e.g., the GENI Experimenter Portal, the Gravitational-wave Candidate Event Database). 

However, in terms of addressing RCI requirements, UCI has left this largely in the hands of the individual Schools. Some schools (Physical Sciences) have internal staff to address and assist with research-related problems, but many  have only 'Computer Support Coordinators' to address Desktop-level and 'MS Office'-level issues.  

One of the persistent problems faced by faculty in the latter group is that they don't know who to ask for advice with their research computing problems.  While the http://www.oit.uci.edu/research-computing-support/[Research Computing Support] group has been active for almost a decade, this is still an issue that needs to be addressed.

The http://datascience.uci.edu/[Data Sciences Initiative], via their http://datascience.uci.edu/education/short-courses/[Short Courses] outreach program has been a significant driver to educate the UCI research community in the use of various techniques and especially in the use of OSS for data analysis.  These courses are often over-subscribed and the feedback is extremely positive.

=== Competitive Risk
Like other resources, if the applications and especially the instruction in those applications is not supported, UCI researchers will not have (or will not be able to use) some of the very powerful tools that they can exploit.  One way of looking at this is also in the preparation of future scientists who can be taught how to use these powerful OSS tools and therefore be freed for the rest of their lives from licensing costs, or spend those funds on licensing, which locks the students into a proprietary system which will require funding every license period.

Similarly, since we cannot ignore some of the very powerful proprietary tools, we can make access to them and to the OSS ones as simple as possible by bundling the interface into the virtual desktops.

=== Recommendations

We propose to make a new Research Virtual Desktop service available to the campus to facilitate access to well-maintained software in an integrated environment.

==== Cost

The software cost for implementing the Linux version is zero since it is OSS. The hardware costs are fairly low since we can run much of this software on older servers that we have in excess. Licensing costs for the proprietary software are dependent on whatever deals can be made with the vendors.  However, the effect of providing better and better OSS tools and making them more easily available to students will result in more pressure on proprietary vendors to decrease prices, as we have seen with the cost of OSs going to near zero.

Windows- based remote desktops will cost more, thought how much more depends on the agreements with Microsoft as well as the actual software vendors.

The main cost of implementation is the human cost of setting it up and documenting it to the level that it becomes easy to use.  After that, the human costs are for setting up the software on the back-end cluster which have to be done anyway.

==== Delay

There will be some delay in testing and selecting the best Desktop software to use in, but since this project is actually ongoing in RCS, we can move forward fairly quickly and test versions of it could be ready for use from the HPC cluster within 2 months with widespread availability in 6 months, based on availability of login servers.


[[staffing]]
== RCI Staffing Needs

=== Vision
.DM
[IMPORTANT]
Provide sense of scale? i.e., "Even if we conclude that Berkeley is bigger than UCI (1600 FTEs vs our 1300?) and we ought to have correspondingly fewer staff devoted to RCI, we still ought to have more than 10 FTEs supporting RCI."?


*People* are the part of RCI that enable it to be effectively used by researchers. 
'Research Computing Support' is different than most 'Computer Support' in that not only are they asked to debug strictly computational problems, but the problems requires an in-depth knowledge of how CPUs, file systems an formats, schedulers, the Linux OS, the various cache levels, utilities,  networks, compilers, libraries,provisioning systems, data-flow, and of course the applications all work... and/or don't.  Added to this are the vagaries and requirements of the backing science.  Add the requirement for being able to clearly document complex procedures, and the necessity of teaching these techniques, and you may understand why well-qualified people are hard to find.

Much RCI support is catalytic - a little knowledge can make an insurmountable problem disappear, but a good part of it is ongoing and quite demanding, such as programming support, investigating new techniques for improving and maintaining RCI, and especially 'answering questions from researchers'.

Optimally, there would be enough RCI staff to maintain the RCI - the clusters, the networks, the storage systems, computer security, as well as engage much more with researchers, which is really not possible beyond directly answering their most immediate questions.  Domain specialists in the areas of highest RCI use such as bioinformatics, engineering, physics, with particular training in the most popular applications such as MATLAB, R, PyLab, etc would be an enormous help to computational researchers, the numbers of whom are increasing steadily (HPC alone has almost 2000 registered users of whom about 500 use the cluster every month).


// The Linux OS is widely regarded as the most important environment for scientific computing.  Because of the low cost and high scalability of Linux-based clusters, the ease of writing software for the command-line (vs graphical applications), and the speed of Open Source Software (OSS) propagation, the overwhelming majority of research software is written first for Linux. It is necessary to train and retain Linux-savvy system administrators and programmers who are able to set up and maintain Linux- based software, servers, large filesystems, and configure environments. 


=== Current

UCI's current level of RCI staffing is substantially below that of comparable peer institutions (UCB, UCLA, Purdue), and others http://goo.gl/FxgfOm[based on a variety of metrics].  We have approximately 3 FTEs assigned to supporting the Greenplanet and HPC clusters across Physical Sciences and OIT and a similar number assigned to supporting researchers in other ways.  Purdue and Indiana University have 20 or more staff assigned to these tasks.  UCLA has a RCI team of 21 FTE, 11 of which focus on compute clusters and other high performance services.  UCB has 21 individuals (15 FTE) on their Research IT team. Current Physical Sciences and OIT RCI staff support is insufficient to cover the areas for which they are completely or partially responsible.  Including computing architecture and operation, research data storage and transfer and programming support, and Graphical Information Systems.

Additionally, there are 3 FTEs to support RCI within the Digital Scholarship Services unit of the UCI Libraries. As an exemplar, Purdue has a dedicated data curation center and 7 FTEs supporting data curation work: 4 data specialists, a software developer, a data curator, and an administrator. 

.DM
[IMPORTANT]
Is this saying that library FTEs count separately from those mentioned in the above paragraph?

Current UCI Libraries staffing support is insufficient to cover the areas for which they are responsible which include: promoting researcher compliance with funder data usage requirements and best practices in data curation; training in data curation; promoting Open Access repositories; development of tools to describe data domain ontologies; development of robust institutional repositories and exhibits, particularly with humanities applications; and negotiating software licensing from external sources.

=== Competitive Risk
There are 2 vectors of risk.  The most obvious is that with so few RCI support people, it is difficult for researchers to obtain more than glancing assistance from any of them.  The second is that the actual infrastructure is at risk since there are no spare cycles to do proactive work on the actual RCI.  This shows that providing more hardware  or proprietary software without matching FTEs will result in little improvement in usable RCI.

//The third risk is that those RCI people already here may decide that their talents are better used in a place that supports them better.  

//It takes several months or more to train a person into a position in which s/he saves more time than s/he absorbs and 


=== Recommendations

We recommend adding staff to assist with: 

- sysadmin support for cluster, filesystem, and backup maintenance and upgrades.
- installation and upgrades to existing software (>1000 packages and versions now on HPC)
- installation and networking of workstations for advanced instrumentation; 
- user training to introduce users
  * to the Linux OS, cluster computing, optimal data handling techniques
  * to bash, Perl, Python, Jupyter, R, & MATLAB
  * to Open Source visualization tools
- installation of, documentation about, and training with other Open Source tools
- answering researcher questions about techniques, errors, obtaining and using data sources, etc.
- assisting investigators with establishing appropriate levels of security to be in compliance with funding agency requirements. 

Beyond these general necessities, we strong advocate for adding the following 'Discipline-Oriented Specialists'

.DM
[IMPORTANT]
**********************
Seems somewhat incongruent with the rest of the vision in that here, we're talking about filling a specific position and what expertise would be required -- something we do nowhere else above. 

*hjm*: Agreed. The 'Library Sciences Specialist' especially reads more like a job advertisement and as such, somewhat out of place here. The 'RCI Specialists' are described a little more generally, but still sound too specific for this kind of document.  Comments?
**********************

*Library Sciences Specialist:* The library identified a need for an additional data science librarian with vision and leadership skills to grow library-related data management consulting services and administer data repository systems. This position would coordinate activities with OIT and Office of Research staff supporting data management and liaise with staff within the schools. The individual filling this position would should have expertise in research funder requirements for data preservation, project management, functional requirements specification and application development, metadata, and digital preservation. In addition, the position requires experience administering the common open source repository systems for the management, discovery, and access to UCI produced data. 

*RCI Specialists:* Full RCI support requires professional level specialists with computer science, but also disciplinary backgrounds (e.g. math, chemistry, biology, social sciences, engineering, humanities and the arts) working in partnership with research teams.  These staff positions might be jointly sponsored by RCI and the Schools. They could include accomplished part-time graduate or even undergraduate appointees. We refer to these as RCI specialists.  Examples of important skills in addition to discipline domain specific knowledge stipulated above would be:  
High performance programming skills and techniques (OpenMP, MPI, GPGPUs, etc.); high performance networking, data transfer and storage; statistical and mathematical computing tool utilization; scientific visualization; Linux and open source software; database design, construction and utilization; and website development. 
We expect to increase RCI support service staffing for both mission-critical RCI services and RCI specialists incrementally over the next several years based on annual assessment of unmet needs and effectiveness of current services.

*Outreach Specialist / Concierge:* The increasing complexity of RCI requires systematic outreach so the full potential of the RCI investment is realized.  An RCI staff specialist will be tasked with coordination of outreach to the campus community regarding RICI RCIC facilities and resources and will be the main point of contact for anyone wanting guidance on all aspects of RCI.  As such, this person must be familiar with most aspects of RCI on campus and know the principals well.  This person or another person familiar with software, resources, and techniques will provide outreach to schools and distributed research staff and be responsible for maintaining a RCI Program Website that will contain a directory and links/descriptions to relevant services and resources; federal guidelines and templates to facilitate compliance with data use policies; links/descriptions to shared software and means to request assistance with new acquisitions; comprehensive calendar for RCI related workshops and events; and current campus RCI site map.  Such a specialist would also develop a recommended minimum training program for incoming faculty, students and staff to ensure baseline awareness of resources and best practices.

*The team:* All staff at UCI who support RCI, whether they are in the central RCI unit, other units such as the UCI Libraries, a school, research group, or ORU/Institute would be considered part of an extended UCI RCI support team.  The RCIC Director would facilitate this approach, which would be further strengthened through joint central/school assignments where appropriate.

==== Cost
See link:#Budget[Budget]

==== Delay

Depending on the administrative level of this unit, it could be in place fairly rapidly or be delayed for a decade.

*HJM* I have no experience to estimate this at all


== Education and Training
The need for computational and data analysis skills is increasing rapidly and impacting almost every research discipline.  There are very strong statistical and computer sciences departments at UCI in which students systematically learn from experts and a thoughtful curriculum.  However, no solution is currently in place for incoming students in disciplines that have not been traditionally computationally oriented, or for postdocs, faculty and others who have limited time for formal classes, but going forward require working knowledge in these areas.  


=== Vision

Incoming students, undergrad as well as graduate, have execreble computer skills if any. At most, they have been trained on 'MS Office' which is close to useless for modern data analysis and visualization.  Since the university cannot ban them for not learning about computers, we must teach them, and that requires both instructors, time, and classrooms.  The classrooms are fairly easy to find; instructors with time are not.

This is where the RCIC staff can perform what can honestly be termed a transformational service; to transform computer-naive students into those who have a chance of dealing with modern data.  Such training will require learning new applications on their personal devices, but mostly learning Linux and how to use it for data analysis.  

What we aspire to is the graduation of a group of data-savvy students; they know about modern data formats, cleansing, regular expressions, transformation, effects of caching, IO problems, parallel operations (if not programming), how to move data effectively, encryption, compression, checksums, and of course, analysis, statistics, and visualization of large data sets.

//Training facilities:  Currently, most classroom buildings have wireless access that can continue to update.  However, there is also a need for specialized training in HPC computing that is not currently met.  Some number of classrooms should be available to enable actual teaching in the HPC environment for state-of-the art education.  but some require upgrades.  Regular and Computer Classrooms are in great demand and bringing the new Instructional building online is an important goal.  A subset of computer classrooms should be equipped with visualization and scientific software to facilitate RCI training.

RCI training will include:

- the Linux OS, basic bash commands, utilities,  bash programming, internet tools, etc (Software Carpentry courses provide a framework).  Also, leveraging Macintosh and Windows environments for scholarly and scientific applications using the MacOSX terminal and the Cygwin environment on Windows.
- Cluster computing; use of the scheduler, debugging, batch scripts, filesystems, data movement
- data analysis, and visualization techniques, basic intepreted languages such as Perl, Python, Julia
- installing, compiling, debugging Open Source Software.
- Data asset management
- Scientific and scholarly software applications including Matlab, Mathematica, R, SAS, ArcGIS, Perl, Python 
- Collaboration and sharing: Available tools, how they can be leveraged in collaborative research, and policies regarding security, confidentiality, open access and ownership of intellectual property.
- Procedural training: from how one accesses various components of cyberinfrastructure to the licensing or purchasing of software or devices.  Such procedural training needs to be closely integrated with educational outreach and information dissemination about existing capabilities.
- Using cloud services for academic analysis.
- Collaborative training with specialized groups, for example using: 
  * statistical software with ICTS / Department of Statistics
  * genomics open-source software with the Genomics High-Throughput Facility

.DM
[IMPORTANT]
In my area, python is mostly replacing matlab/mathematica and can cover a lot of what R does.

=== Current

Whether this proficiency is demanded as a requirement of a program or just made available on an 'ad hoc' basis, these introductory courses are critical to launch students on the right path to competency.  Generally, with a few such short-courses under their belts, students can Google their way to proficiency and even expertise.
For those students entering a numerically based program, it is a good idea to make such courses compulsory, since without them, later courses will be a nightmare of catch-up and backtracking.

There are some good introductory and advanced classes being taught already outside of official channels, mostly by student instructors and since faculty have little incentive to contribute to this effort, the natural alternative is an organization like the RCIC.  RCIC staff can teach the introductory classes and act as dedicated sysadmins adjuncts and TAs to other courses with substantial computational depth.

.DM
[IMPORTANT]
I don't totally get this. If I need software I bring it in (on my laptop), and I wouldn't use it even if it were on the computer already in the classroom. It seems like most faculty in the sciences, or at least in my slice of the sciences, are this way. Who actually wants this?

//Education and training for base level proficiency in use of computational tools is a necessary aspect of preparedness to function in the modern world.  At UCI this training should be integrated into the undergraduate and graduate curriculum.  The university should  establish a proficiency requirement for a minimum skill set in computation and use of computational tools for undergraduates at the earliest possible time. Formal graduate or undergraduate courses related to interdisciplinary computation (ie not all computation/data courses for majors) will be listed on the RCIC website.  These could include as examples, graduate courses in statistics for biologists from several departments and in bioinformatics from ICS and from Epidemiology.  

The RCI working group specifically addressed training in the context of faculty, undergraduate, graduate, postdoctoral and visiting scholar researchers.  Education and training targeted to this group is typically either remedial and intensive or targeted to specific and relatively immediate needs. Researchers find it difficult to accommodate a formal class schedule and expect to do significant amounts of learning independently.   We  acknowledge this and focus on three approaches: 

. individual or small group targeted training initiated by investigators or prompted by RCI computing staff and domain specialists in response to perceived need; 
. relatively short and intensive workshops (e.g. day or week-long) for groups of ten to thirty combining lecture and hands-on use of tools with rolling on-line sign up to allow instruction as soon as critical number of students is reached; and 
. on-line training.  

The RCIC would not be responsible for all of these modes of training but would ensure that the RCIC website lists the available modules and has links to sign-ups and schedules.  


Examples of currently available workshops that would be advertised and scheduled are statistics workshops with ICTS, Big Data with Data Sciences Intitiative and BioLinux through GHTF.  Online courses in HPC operations currently exemplify the online training mode.  Possible outcomes for researchers, in addition to enhanced skills, could include certification as occurs for Data Science Initiative Courses or access to more specialized commercial software in the case of GHTF. 


=== Competitive Risk
The glaringly apparent risk is that our students are not able to do research using the most basic of modern numerical tools.  Excel is a useful tool but not for assembling genomes. Nor is it capable of doing social network analysis on Facebook logs.  For those kinds of approaches, we need students with experience on Linux using modern stream-processing tools.


=== Recommendations
RCI resources must be made available for teaching, both for education on RCI tools/techniques, and as a platform for other subject matter.  This includes providing access to compute clusters and RCI working environments for classroom use, and equipping instructional labs with visualization and other RCI capabilities.  

==== Cost
Overwhelmingly, the cost here is human.  The current RCS personnel are teaching multiple courses, but this is essentially a labor of love and is not scalable.
This works in the reverse direction as well; with campus RCI being used for instruction, instructional monies can be used to support RCI.  
See link:#Budget[Budget]

==== Delay
Since a few of the courses are already being taught in cooperation with the Data Science Initiative, the spin-up time will be short for those courses, but new courses are extremely labor intensive and will probably take months for new hires to develop or learn them.


[[budget]]
== Budgetary Requirements and Funding

.SBS
[IMPORTANT]
Indicate how this relates to amt spent at other organizaitions if it is possible to find that information

.DM
[IMPORTANT]
Clarify that more than just the FTEs will be recurring (?), i.e. we need regular refreshes of hardware.

=== Vision

*Planning Budget Scenario:* More planning and prioritization review is required to flesh out an augmented RCI budget, but we present the following scenario to facilitate discussion.  This represents an additional investment of approximately $2.3m yearly and would fund the following (all salaries are approximate and include benefits at 50%):


=== Current
OIT's current annual RCI budget is approximately $700k, covering the costs of 3.8 FTE staff, hardware maintenance, and other operational expenses.  Additional funding will be required to establish and maintain baseline services that will be made available to faculty across disciplines.  These services will form the foundation for UCI's RCI; guaranteeing access to all faculty, facilitating collaboration and data safety and providing resources to that lead to funded projects.  

=== Competitive Risk


=== Recommendations
In addition to providing core funding to build and maintain UCI's RCI foundation, a critical goal for the first year of UCI's RCI program is to develop recharge models to fully leverage grant funding.  Fee structures would fund access to cluster computing cycles, storage, and staff services above baseline allocations. 

==== Cost
--------------------------------------------------------------------------
Research Cyberinfrastructure Director ............................ ($180k)
Campus-Wide Storage System:
  1 FTE storage programmer/administrator ......................... ($130k)
  Storage hardware ............................................... ($200k)
Compute Cluster Support: 
  2 FTE system administrators .................................... ($250k)
  Hardware refresh ............................................... ($500k)
Research Cyberinfrastructure Specialists:
  2 FTE research computing staff ................................. ($250k)
  1 FTE Data Curation Specialist ................................. ($130k)
Scientific/scholarly software licensing .......................... ($100k)
Research Information Security Compliance Engineer ................ ($150k)
Networking:
  UCI-Lightpath connectivity in additional campus buildings 
  or UCInet enhancement in support of research ....................($400k)
                                            Total                ($2,290k)
--------------------------------------------------------------------------

==== Delay
??