The cloud (public and private) provides an array of virtual machines, available with a range of cores, RAM and specialized hardware. Research faculty at small and medium-size institutions have a variety of requirements for computing and data resources, but need to be efficient in their use of these resources. Three current best practices (Terraform, Docker, AWS S3) are an integrated toolset able to provision compute/data resources and configure compute/data services in a customizable and efficient manner. The primary objective of the Tailored Research Environments (directed study) project is to create a web-based product that enables its user to run R, Python and Spark code in Jupyter notebooks on a user configured computer environment. Importantly, the user's notebooks (code) and datasets are stored separately from the computing environment. In addition, compute resources (of this environment) can be added and removed as needed, which will significantly reduce the overall cost of using the product.
The project is a solution to the problem of providing researchers with individually tailored compute resources in a cost-efficient manner. We propose to use an integrated set of technical tools to easily create tailored technical research resources in the cloud (AWS, Google, Microsoft and Open Stack.) This integrated toolset is AWS S3, Docker and Terraform.
AWS S3 provides inexpensive long-term persistent storage for datasets, programs, results and associated reports. Terraform provides the means to easily detail, record and share the specs/configurations of an array of technical services (primarily virtual machines, but there are others) from the four providers listed above. In addition, these services can be easily created and destroyed, which is the primary reason for this solution being cost-efficient. Docker provides an extensive selection of, mostly open source, software components that can be run on virtual machines.
The design of this solution follows the pattern wherein the data and code are persistent, but the use of computing resources is short-lived and impermanent. The later either reduce the cost of their use or facilitates the sharing of these resources.
Specific objectives of the project are to create a data analysis environment, which:
- is created and managed using "infrastructure as code" techniques
- provides distributed computing to the user
- Decouples code, data and compute capacity
The specific objectives of students working on the project are to:
- Create a working product
- Create product documentation
- Develop a working knowledge of infrastructure-as-code techniques
- Develop an understanding of distributed computing techniques and a working knowledge of Spark configuration
- Present a single tutorial and demo of the product to the Data Lab and incorporate feedback into the product and documentation