= A unified approach for a Secure Research Computing Environment.
Harry Mangalam <harry.mangalam@uci.edu>
v1.00 Dec 6th, 2017

:icons:

// fileroot="/home/hjm/nacs/Secure-Computing"; asciidoc -a icons -a toc2 -a toclevels=3 -b html5 -a numbered ${fileroot}.txt;  scp ${fileroot}.html ${fileroot}.txt ${fileroot}.png moo:~/public_html

== Requirements for a Research Computing platform

Research Computing is undergoing some of the same pressures that are affecting computing in general, with some special considerations.  

Those are (or can be):

- support for Windows, Macs, and (increasingly) Linux as a research platform.
- support for Graphical User Interfaces (GUIs) as well as naked shell access.
- requirements for large numbers of specialized applications, many of which are Open Source and run on Linux, due to how research computation evolves.
- access to well-defined licensed software such as MATLAB, Mathematica, SPSS, SAS, Systat.
- ability to restrict access to some applications due to licensing or security concerns.
- the need for large amounts of hardware, including CPU & GPU multicores, long CPU runtimes, large RAM, very high network bandwidth, and 'especially' storage requirements well above that available on even a large Desktop.
- conflicting requirements for sharing and restricting normal research data to/from collaborators/competitors, sometimes in the same dept.
- requirements for sharing or acquiring large amounts of data across the Internet (ie from sequencing centers, NASA image repositories, Global infrastructure projects like LIGO or the LHC.
- defined (but not 'well'-defined) security requirements for certain kinds of data such as HIPAA, FISMA, and the many NIST directives and the degree to which such data should be protected 'at rest', 'in flight' on different networks, and 'in RAM', if at all.
- increasing use of computational resources for teaching, itself increasingly virtual.
- the need for repeatable analyses, using tools and education to instruct students (and faculty) how to do accomplish this.
- need for increased security overall to decrease the malware load on sysadmins and to decrease liability exposure to the institution.
- equitable and efficient resource allocation & usage.
- network intrusion and detection of unwanted external and internal access
- centralized data protection .  ie no backups, which besides aggravation, can cause massive monetary loss and liability.

There are others, but let's take those as a good base set and see what we can do with them.

== Not yet integrated by current UCI platforms
At RCIC, we are dealing with most of the above issues already on the HPC cluster, altho in somewhat disparate and diffuse approaches.  The ones that we are 'NOT' yet addressing well that are especially applicable to our discussion about secure computing are described in the following sections.

The ways in which the issues immediately above might be addressed within the framework of a next-generation 'Common Research Computing Platform' (CRCP) are as follows:

=== Running native Windows & Mac applications
Clusters are composed of the same Hardware Building Blocks (HBB) as Google's or Amazon's cloud.  We could use the Cluster Scheduler to make sets of those HBBs available to users who what to use Windows or even https://goo.gl/ScdFEM[Macs].  There are many academic sites doing this now, for example XSEDE sites like SDSC's Comet and IU/TACC https://jetstream-cloud.org/[Jetstream].  Such VMs can be allocated their own virtual filesystems or can connect to specific filesystems, both open and restricted, depending on how the user wants them set up.  Graphics access to these VMs would be via ssh-encrypted tunnels, in addition to 2-key authentication.

There is a well-defined mechanism for doing this called https://www.openstack.org/[Openstack] as well as other Virtual Machine managers such as https://www.proxmox.com/en/proxmox-ve[ProxMox Virtual Environment] (which we're currently testing on HPC).  We have resisted implementing this approach so far due to the emphatically stated desires of most of out HPC users to have a single, reliable platform on which to run their applications, rather than OSs and configuration that they have to control themselves.  However a combined CRCP would make such an approach viable, especially since it allows the same cloud technologies to be deployed both on and off-premises.

=== Strong separation of data types
The current HPC cluster filesystem directories at the top level are segregated only by Unix permission concepts of User, Group, and Other (everyone else).  HPC filesystems are not encrypted at rest, altho we've made pseudo-filesystems encrypted for one-off requirements.

For a more secure computing environment, we could use encrypted filesystems, as well as stronger segregation of users - ie you wouldn't even be able to mount a protected filesystem unless you have been authorized, and each user could have their own private keysets to decrypt data via http://ecryptfs.org/[ecryptfs], currently being https://help.ubuntu.com/community/EncryptedPrivateDirectory[deployed by Ubuntu] as an option to encrypt each home directory.  Also, for smaller data sets (<1TB), we could incorporate those filesystems into the VM image so that the data and code are bundled & encrypted together.

=== Large secure storage devices.
Going forward, it's clear that the university needs a strongly protected, 'possibly' encrypted-at-rest, user-segregated, large and high performance storage system.  Such a storage system would probably have to be physically separated from the rest of the system.  However, even that requirement may be eased since it would be almost impossible to extract data from it unless you broke into the Data Center and removed the ENTIRE storage system since data would be both encrypted and striped across multiple chassis' and disks.  If the storage did have to be segregated, we do have another room internal to the Data Center (LabC) which could be hardened to protect the higher security storage.  

This storage system could be made available to both the Linux-based compute nodes as well as VMs running custom OSs including other flavors of Unix, Windows, and possibly MacOSX.

=== Data encryption at rest, in flight.
Modern filesystems allow encryption and compression at rest, so even if disks were stolen, the contents would not be readable.  Data going over internal fiber networks are well protected against surveillance but we could also use SSL/TLS encryption so that even that communication would be protected (tho again, I'm not convinced that this is a good bang-for-buck security expenditure.).

For cross-internet traffic, data would be encrypted by normal ssh-mediated transport such as rsync, scp, sftp, etc.

=== Computational separation.
Do we also need a completely separate compute cluster to analyze secure data?  I would argue that we don't since memory protection in modern OSs prevent differently owned applications from reading RAM that isn't assigned to them. Also this would be an incredibly difficult and inefficient way to steal data.

If we needed secured computation, we could bundle some servers in the same racks in LabC as the secure data, but this would be a major increase in $ and human cost.  As Francisco pointed out, this depends on what constraints the data supplier demands and how flexible they are.

=== Data Center Changes
Besides the costs for some physical changes in the Data Center, this would require a *major difference* in how the Data Center is managed since HIPAA compliance for physical security is extremely non-trivial.  Unlike the current situation, random ppl could not be allowed to wander in and out, construction ppl would have to be supervised, transit would have to be logged, video surveillance would have to be copied remotely, logged, and supervised, etc, etc.

== Why is this better than separate efforts?
Rather than have 2 or more separate 'Secure Research' systems, each having their own set of requirements and support FTEs, it would be much more efficient to have a common infrastructure, especially if it was much more scalable, arguably more secure, and much cheaper than separate ones.  

On the other hand, there has to be a rational, fiduciary argument for the increased complexity since nothing kills progress like complexity.

Certainly, if UCI only needs a small set of Windows users to access relatively small data sets, we probably shouldn't be having this conversation. But if we're going to need very large amounts of 'on-prem' secure storage, then it becomes a necessary approach.

In terms of funding, such a project would be more attractive to a wider range of entities on both the funding and user sides.

== What does this entail?
The core of this approach is a cluster-like resource which has would be physically similar to, but organizationally much different, from the current HPC cluster.  It would include:

- a cluster core, including
	- a high speed interconnect like Infiniband
    - multiple multicore nodes, including specialized hardware such as GPUs, 
      large RAM, 
    - a modern scheduler to allow both HPC parallel (//) jobs and virtual machines 
      to be run on the cores together, as well as reserving machines for 
      regular classes.
    - large-memory, high core-count login machines that can support full 
      Desktop GUI logins so that even Linux-naive users can use commercial research
      applications like MATLAB, Mathematica, SPSS, SYSTAT, SAS
    - integration with with SAMBA/CIFS file sharing back and forth to their desktop/laptop (for unrestricted data). 
      Restricted data would only migrate outside the secure storage under transfer rules designated (and logged) by the data rules.
      This would address about 90% of research users and use cases. (I am already experimenting with such a system.)
- a general purpose // filesystem for open data. (We already have considerable experience with this technology
- a dedicated, physically separate // filesystem for restricted data (technically similar to the above storage, 
    but in a physically separate, access-restricted location, such as LabC.
- dedicated LAN transport for the restricted data to the compute nodes, possibly encrypted.
- much increased security and firewalling efforts, including dual-key and shared key authentication, documentation, 
    & education, 
- depending on data restriction, data ingress/egress may need additional constraints, such as timed IO, but with encrypted, passwordless file transfer, this should be less of an issue.  Data transfer almost certainly will have to be logged thoroughly, but this can be done fairly easily using standard Unix logging protocols. The issue will be having the security ppl sign off on it.


== How this might work

.Secure Research Computing Infrastructure
image:Secure-Computing.png[Secure Computing Cluster]

- Clients of all kinds first gain access via the login nodes via ssh, with or without shared ssh keys or 2-factor authentication as we currently support.  Note that the diagrammed parts are protected by UCI's campus firewall and no off-campus access is allowed unless via VPN.  Graphics presentation is handled via x2go.
- Users can run entire Graphical Desktops on the login nodes to run graphics programs such as MATLAB
- They can also run large jobs on the cluster via the scheduler as would normally be the case 
- Cluster users with no special data access needs would read/write to the // filesystem servers called 'Storage'
- 'Restricted Data Users' (RDUs) will be identified by 'group' affiliation; without that affiliation, no user can see the encrypted filesystem.
- RDUs will additionally have to enter a per-user password to unlock their restricted data.
- 'groups' will also define ability to use specialized or other restricted data
- RDUs may analyze that data on regular cluster machines or on 'Secured Compute Nodes' in the LabC area if required by data or grant provider. 
- Users can also request Virtual Machines (VMs) running any supported OS to be made available or activated via the Scheduler which interacts with the VM Manager to provide resources for doing so. The Secured nodes could also be equipped with the same extended resources (GPUs, large RAM, Phi processors) that the main cluster has, if the user group was willing to pay for them.
- VMs containing or requiring access to restricted data would run on compute resources in the secure area in LabC (if nec).  Those not requiring restricted data access could run on servers in the main Data Center racks.
- Data traversing the LabC area may need to be additionally encrypted by SSL/TLS/SSH, depending on requirements.
- Data decryption, and movement into and out of the Secure Storage by specific users can be logged into specific log files for provenance tracking.
- not shown in the diagram are 'backup services' which may also have to be duplicated if the data providers require it.  However, since the data is already encrypted, it may be possible to use a single backup system.

== How might this be paid for:
The HPC and GreenPlanet clusters are due for replacement and the combined admins are already sketching out how this might be done.  From our current plans, it would not be difficult to modify the plans to allow the architecture shown above, altho the implementation and administration time would be significant.   For example, we are already planning for partial virtualization of the cluster to support OSs other than Linux and the idea of a secure storage system has been floated previously. We would need additional FTEs to support the project as well, but almost certainly fewer than if we had to support multiple different projects.  

This would have much more buy-in as an NSF https://goo.gl/f2JFej[Major Research Instrumentation] proposal since it would supply a significantly wider user base.  It might also gain more award points since it is proposing to accomplish what the NSF has been urging.