1. Introduction
Increasing interest in analyzing annotated genomes on the HPC Cluster and elsewhere on campus has caused RCIC to increase the security protocols on HPC and while some remain inferior to the dbGaP requirements, we will be tightening the remaining problem areas so that they will be congruent wth the dbGaP requirements.
2. Restricted Data on HPC
2.1. dbGaP
"The Database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans." By definition therefore, it contains human subject data that, while somewhat anonymized, is still considered sensitive and therefore requires additional electronic safeguards for use. These requirements are described in the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy paper.
The requirements listed therein have been re-worked into a summary mapping paper by John Denune of the UCI Security Team. Many of the proposed security enhancements in it are described in more detail below.
(NB: John has to to provide the actual doc).
2.2. TCGA
The Cancer Genome Atlas is a data collection similar to dbGap in terms of scope and complexity, but focussed on cancer specifically. As such, it also contains much human data, also similarly anonymized, but also unavoidably contains personally identifiable traits. Since it is also an NIH project, the release and analysis of that data is covered by the same security restrictions as dbGaP (see above).
3. Other Restricted Data
The following types of restricted data are included for reference, but we explicitly and emphatically do not host such data on HPC. We are planning for the next generation cluster and we may include such data on that resource if we can address their security requirements.
3.1. HIPAA
The Health Insurance Portability and Accountability Act of 1996 ( aka HIPAA). The part that concerns us is Title II, (unironically called Administrative Simplification provisions) which describes the computer protocols, security that involve the transfer of medical records and especially the penalties for their breach. The information concerns Personal Health Information (PHI) which is defined as "any information that is held by a covered entity regarding health status, provision of health care, or health care payment that can be linked to any individual". While originally, the covered entity referred to Health Insurance and Billing companies and Health Care Providers, the scope has been expanded to any organization that holds PHI data. And since PHI often covers genomic information, this is of concern to UC Irvine.
There are levels of anonymization that can take place with such data and PHI is generally attributed to information relating to Electronic Medical Records (EMR) as opposed to raw data such as the aggregated (tho de-identified) genomic information in dbGaP and TGCA. There is an inherent problem with this dichotomy since genomic information, when coupled with the kind of metadata provided by dbGaP and TGCA provides a statistical narrowing of possibilities and when cross-matched with other information, can be used to identify the supposedly anonymous data. So, while it is not as sensitive as actual EMR, it cannot be defined as Open Source Data and must therefore be protected.
3.2. FISMA
Federal Information Security Management Act (FISMA) refers to the regulatory act that created several levels of computer security. FISMA is not a single set of security guidelines, but a requirment that each Federal government agency define security regulations that must be implemented to host the various data types under its control. As such, it is complex, contradictory, expensive to implement and audit. It calls for not only a census of the hardware and software that make up computer systems and networks under its control, but also the classifications and protections of all the data types. Some of the documentation describing this process is listed here.
The 2 data types below are examples of how various Federal and State goverments have addressed the FISMA requirements.
3.2.1. Census Data
While much of the Census data is presented by region in ways that cannot be exploited to the personal level and can be directly downloaded and used as Open Source Data, any part of it that contains personally identifiable traits is restricted to specific Federal Statistical Research Data Centers, where such data can be studied under the proper computer security.
3.2.2. Corrections
Similar to the Census information, some data on incarceration/corrections/prisons is available directly from the agencies involved, such as the Federal Bureau of Prisons and California Department of Corrections and Rehabilitation, but access to any personally identifiable information is severely restricted.
4. HPC Current Controls
The HPC cluster is a general purpose research cluster composed of about 10K 64 bit cores, with 3 large parallel filesystems totally about 2 PB, and about 40 NFS mounts of various sizes. It is nominally an open cluster, with no restricted domains , except that some proprietary software packages are restricted by group.
We use the following controls to restrict access to the HPC cluster. Security techniques under consideration but not yet implemented cluster-wide are described but will be marked as such.
4.1. Login & Direct Access
-
ssh (passwords, shared key, ssh agent, X11). ssh is the de facto standard for secure communication to Unix-like (*nix) Operating Systems. We currently allow the use of passwords synced by Kerberos to our campus LDAP server, so that even sysadmins do not have access to the password hashes. We encourage the use of shared ssh keys for convenience and security, but do not insist on it. If using ssh keys, we encourage the use of ssh-agent to prevent unauthorized logins in case of theft.
-
Two Factor Authentication. The sysadmins can use a commercial two-factor authentication system embedded in a smart phone app called Duo , however this technology has not been licensed for the entire campus. We are also experimenting with Yubikey, a USB-based cryptographic hardware key, however this has not been rolled out to the user population.)
-
screen / byobu are mechanisms to maintain multiple secure terminal sessions to HPC that survive a client shutdown (such as sleeping or hibernating a laptop to physically change location). Once an internet connection has been re-established, as long as a connection can be made to the HPC cluster (or to the screen-server, if different), all the sessions can be instantly recalled.
-
x2go is an X11 graphics compression utility that allows a functionality very similar to the screen utility above, except that it allows disconnection and re-connection to graphics screens (including entire Desktop GUIs) as opposed to plain-text terminal sessions.
4.2. Kernel
-
Meltdown & Spectre We have updated the Linux Kernel on our public-facing nodes to a patched version (4.4.115, at the time of writing. This provides increased protection at the cost of slower execution. Since our internal storage servers are not accessible to users other than root, we have not patched their kernels so as to maintain performance.
-
Browser bugs. Following the announcement of the Meltdown/Spectre bugs, we disabled all our Javascript-capable browsers until we could provide one which did not allow Javascript (dillo), mostly used to download data and view images generated on the cluster. We have since wrapped a patched version of firefox in a Singularity container that allows reasonably safe execution of the browser.
4.2.1. Firewall
The HPC cluster is protected on the UCI Network side by the Campus Stateful Firewall, run by the UCI Security Team. All HPC users have to connect thru UCI’s VPN or via an open host inside of the UCI network before logging into HPC. There is no direct inbound connection to HPC from the larger Internet.
4.3. LightPath DMZ
UCI’s Lightpath 10GbE data network was funded by the NSF, starting in 2016 and one of the criteria for funding was that firewalls were forbidden as constricting for the high speed data transfer. This leaves only Access Control Lists (ACLs) at the border router as constraining ingres.
However, while HPC does have direct connection to the LightPath network, no such nodes allow inbound connection on the LightPath DMZ; only outbound connections are allowed.
RCIC is investigating high-speed firewalls which could sustain full 10Gbs wirespeed and are testing some. Implementation will depend on agreement and synchronzation with the UCI Security Team.
4.4. Data Encryption
We can create specific encrypted partitions on the HPC filesystems if required, but entire filesystems are not Encrypted At Rest (EAR) or by default, since theft of a multi-rack, multi-chassis distributed parallel filesystem is a fairly rare event. Usually the encrypted partitions are created on private filesystems on the owners' servers where they are invisible to other-groups. We can provide ecryptfs partitions so that the owner can decrypt the data on login with a passphrase and not even the sysadmins can view the data. If data requires backup, it is backed up encrypted, so that clear-text data is never visible.
4.5. Separation by Virtualization
-
Virtual Machines While HPC uses VMs for cluster services such as web and database servers, we do not support user-available VMs like those available on some XSEDE machines like SDSC’s Comet or PSC’s Bridges.
-
Containers We do support containers via Singularity and so can provide more isolated execution than normal, if required.
-
VLANs / Networking We are investigating the use of VLANs to provide additional isolation of applications since the Meltdown & Spectre bugs in all modern CPUs can be exploited to access previously protected memory contents of both VMs and Containers. Such an approach would use small blade servers to run secured applications with sensitive data. (This is only at the proposal stage; not working at present).
4.6. Web services
HPC does not provide sophisticated web services such as Galaxy, PacBio’s SmrtAnalysis web interface, or Partek’s FLOW system. We do provide a single containerized instance of the The Virtual Brain to a research group with a reverse-proxied connection.
HPC currently hosts a single web server that provides access to static documentation and to files that users want to share with collaborators . The latter service is provided to all users, as part of a web-available but non-browseable part of the filesystem. Ie, the user has to send the collaborator an explicit link to the file being shared; users cannot look around the filesystem, although they can download the files without a password.
This system is in the process of being moved from an HPC-associated server to one on another network to isolate the HPC system. It will provide the same services, but not directly from an HPC-associated IP address. The required hardware has been ordered and we anticipate separation of the service will be complete in September, 2018.
4.7. Intrusion Detection
We are testing a number of systems that monitor changes to specific files by comparing checksums to a database of previously verified files (the archetype of such systems is tripwire, which has become a commercial product and therefore quite expensive. This is in progress.
5. Timeline
The timeline is still under discussion with the RCIC team.