First Steps in a Data Strategy for Science

Chris Dwan September 19, 2017 blog, Scientific Data, Data Storage, Public Cloud, Life Sciences

The healthcare and life sciences industry generates a lot of data. Genomics in particular has seen a rather remarkable growth in our need for data storage over the past decade. Radical reductions in the cost to sequence DNA, coupled with massive increases in the speed with which we can do so, have left us awash in petabytes of sequence data and terabytes of variants and other derived results. With the advent of precision medicine, genomic data is only a part of the picture, albeit a large and important one.

Fortunately, that same decade of unprecedented data growth was also the first decade of the public cloud.

Cloud and Scientific Computing

In the 11 years since Amazon Web Services (AWS) launched its SQS, S3, and EC2 services in 2006, public clouds and cloud based services have become an essential and natural part of the landscape of IT infrastructure. Organizations have adopted software as a service (SaaS) and platform as a service (PaaS) solutions to manage key business functions like human resources, finance, office productivity, document sharing, and email. In 2017, while there is still certainly a conversation to be had around “buy vs. build,” we would be remiss to dismiss cloud based services out of hand.

Scientific computing, particularly high performance computing, has lagged behind this trend. This is not for lack of trying. Most of the research computing leaders that I know have adopted a “cloud strategy,” to greater or lesser extents. Many of us have ported specific workloads to one public cloud or another, sometimes to great scientific impact. On the whole, it has been harder than most of us expected, myself included, to make the jump to running our HPC workloads natively on Infrastructure as a Service (IaaS)  platforms like AWS, Google Cloud Platform (GCP), or Microsoft’s Azure.

Files are the Backbone of Scientific Data Storage

In large part, this is due to inertia, the sheer “resting mass,” of the core file systems that host our scientific data. All of those lab instruments needed to write their data somewhere over the last decade, and we in IT were often only days or weeks ahead of the demand for space. Once written, petabytes of data are heavy and ungainly to move. Network filesystems are still the primary repositories of data from DNA sequencers, mass spectrometers, imaging platforms, and every other sort of lab. They provide the backbone of data storage and accessibility to scientific analysis.

One reason for the inertia is that our workflows tend to be architected with the assumption that data resides on a Network Attached Storage (NAS) system running the Network File System (NFS). This is in contrast to both the structured, relational model as we find in databases, and also “cloud native” S3 compatible object stores. While some teams are re-architecting their pipelines toward the S3 standard, it is a long road. Making the jump away from NFS entails much more than just changing a path or a protocol. It requires a fundamental re-thinking at a systems level, as well as modifications to every tool that hopes to make use of the data.

Object storage performs best when individual data transactions are small and when transfers can be routinely aborted and restarted. Aggregate performance comes from extremely wide parallelism rather than optimizing for single “elephant” flows. None of these is a sweet spot for the network attached filers we use in scientific computing. Over decades, we have tuned our HPC NAS systems for an unruly and mixed workload, with a bias towards supporting modest numbers of concurrent large block streaming I/O transactions. HPC software, as currently written, expects these capabilities from the system hosting its data.

Transitioning a scientific computing workflow from NAS to an object representation is a heavy technical lift that touches all parts of the pipeline. It requires coordination between the scientific domain experts who ensure that the results remain valid; infrastructure technologists, who focus on performance, cost, and sustainability; and the data curators and stewards tasked with maintaining order and appropriate usage.  

In a research institute, university, or scientific computing core of a large company, this complexity is magnified by the intellectual diversity of the scientific community. The core filers are shared across teams, domains, and disciplines. The data repository is one of the major institute-wide integration points. Until all of the stakeholders are cloud-ready, piecewise change to the data repository is quite disruptive. Incremental approaches like containerizing legacy applications and providing file system “snapshots” pulled from an object repository are patches, rather than fundamental solutions.

Because of this, I expect to scientific workflows and the research institutes who host them to continue to rely on network attached file systems for many years to come. While some datasets will be made available in cloud-native formats, the assumption of a shared network file system is deeply ingrained in the scientific computing worldview. We will continue to need NFS or NFS like solutions for years to come.

Planning for a Multi-Cloud and Cross-Cloud Future

Confounding matters further, there are many clouds, and we will not all pick the same provider. I think that it is highly unlikely that one single IaaS provider will emerge as the universal winner, while the others close up shop. This means that, in addition to the need to take a first step, we need to architect for a time when some services and collaborators are native on AWS, others reside on Google Cloud, still others are hosted on Azure, and the rest are scattered across the dozen plus smaller and more specialized providers. In terms of cross-cloud interoperability, I expect the S3 standard to remain reliable. For other services, however, the devil is in the DevOps details.

So, where to begin? As scientific computing leaders, we are tasked to provide data storage solutions that allow for high performance computing, collaboration, security and reliability - all while managing cost and embracing forward-looking technology strategies. How do we take a step to the IaaS world without substantial disruption?

Elastifile’s software based file system has the potential to be a good first step in this space. It is a scalable, high performance NFS system that works on any of the major IaaS providers, and also works seamlessly on in-house private clouds like VMWare. Elastifile’s CloudConnect software allows ingest and migration of data from existing NAS systems to Elastifile systems, which also allows cross-cloud mobility. Because the NFS interface has not changed, user workflows can be shifted to public or private clouds as-is.  

To be clear, there will still be plenty of work to be done beyond making the data natively available in a public or private cloud. Questions of workflow, orchestration, and so on will demand substantial attention. For all that, Elastifile seems to me to be a decent first step towards breaking the on-premises NAS logjam.