Data Integration in a Multi-Cloud world
My last blog post dealt with file storage primarily as a challenge of legacy and technical debt. Genome centers like the Broad Institute have accumulated petabyte-scale data on network attached storage (NAS) systems. The ecosystem of tools and workflows built around these systems, as well as the sheer number of bytes involved, have made it hard to find a first step into public clouds.
In retrospect, that’s a fairly gloomy and pessimistic way to look at the world.
One of the great privileges of my career is that I get to work side by side with scientists and clinicians who are bringing new diagnostic techniques and therapies into being. These people work in a space of potential and possibility, which means that they serve as a functionally limitless source of the very best technical challenges.
This has the wonderful property of being both inspiring and maddening at the same time.
One of these interesting and forward-looking challenges is data integration across not just teams but whole organizations. We need to bridge research and clinical information systems, and bring together data from groups who are guaranteed to have made wildly divergent infrastructure choices.
Specifically, while I have confidence that “cloud” is the future, I am also very confident that we will not all choose exactly the same provider. A forward-looking infrastructure must plan for the fact that some groups will use Amazon, others will use Google, still others will use Microsoft, and so on.
Imagine, for example, a pre-competitive collaboration between a drug discovery startup, a major pharmaceutical company, and a medical center. This sort of thing happens all the time in Boston, where I live. The opportunities to drive novel therapies from “bench to bedside” are abundant, but only if the researchers and clinicians can share data.
Technical vs Social Challenges
Many of the challenges in data integration are going to be non-technical. The regulatory landscape around clinical and personally identifiable data is complex, to say the least. People own the data derived from their bodies. We must respect not just information privacy, but also people’s wishes around how their data is to be used. This is true whether we’re in a research or a clinical context. There is a very high cost of breaking the rules - HIPAA violations are the “extinction level event” of healthcare companies. This creates a very reasonable culture of risk avoidance that makes any sort of data integration effort challenging.
Put bluntly, it’s unlikely that hospitals are simply going to throw open the doors to their data, nor should they.
If we are to integrate this data in research and development - we will need a safe harbor in which to do the work, and mature processes for shifting the bytes around. For many organizations, public clouds can provide those harbors. This, of course, begs the question of cloud mobility for data.
In working with dozens of organizations over the years, I’ve seen a lot of teams get stuck when a problem has both technical and social challenges. This is the case with the vast majority of truly interesting problems. In these cases, one subset of stakeholders will say “it’s technically challenging, so we don’t need to dig in to the regulations just yet.” Another set will say “well, the regulations don’t support what we’re trying to do, so there’s no point in building it just yet.”
In these cases, a technical solution can change the the situation and get things moving again. It should be no surprise that I think that Elastifile’s cloud-agnostic filesystem has the potential to be just that catalyst. By solving the question of shifting subsets of data between clouds without changing representation or format, I hope that we can advance the conversation about what is appropriate, since many more things will be possible.
In the example above, of the cross-institutional collaboration, each partner would still certainly have to do the work to select the correct data and to ensure that they were respecting the applicable rules and policies. Once the data have been selected and potentially de-identified, tools like Elastifile’s Cloud Connect take care of the mechanics of shifting files to and between clouds.
Containerization Delivers Application Mobility
Containerization technologies like Docker and Singularity bundle up an application with all of its system-level preconditions into a single unit. The promise is that we no longer need to spend a bunch of time making sure that the operating system has the required libraries to support our code. This has the potential to radically expand the portability of software, since tools can bring along the context they need in order to run. Containerization opens the possibility of sharing whole execution environments in addition to merely copying data.
Of course, this creates something of a security headache for the systems administration crowd. In traditional environments we were able to make a distinction between configuration parameters that were accessible to end-users, as opposed to those requiring a higher level of authorization. Among the most critical of these more secure functions are the methods by which we identify and authorize user accounts.
In our collaboration example, above, there are undoubtedly already toolkits in use at each of the biotech, the pharma, and possibly even the hospital. Any change in the structure or representation of the data between these three amounts to friction in the relationship. Yet another challenge to be overcome before the data can be productively used. A common representation for both methods and data allows researchers to proceed with the work of science, rather than struggling against the technology.
Secure Collaboration Requires Data Mobility
With containers, developers and collaborators have the ability to change system-level configurations. While this is transformative in terms of their ability to relocate and manage code, it also means that the people responsible for restricted or sensitive data can no longer rely on the mechanics of user identity and authorization. Most NAS systems have this exact assumption built in - that systems allowed to connect to the NAS can be trusted to handle their user identities in a consistent and secure way.
As with opening the doors to the data, I think it’s unlikely that organizations will allow arbitrary code to run behind their firewalls. It is much more likely that collaborative teams will use public clouds as a neutral meeting ground.
As above, Elastifile provides a path forward. By placing subsets of data in protected cloud enclaves, we can limit the data that is exposed and accessible to a particular container. While we certainly could have done this in the past by creating new volumes on our NAS - the cloud makes it much simpler. We can create a purpose built data environment for a particular collaboration or experiment, and tear it down once we’re done. And since, Elastifile presents the containerize apps with a familiar POSIX-compliant file system interface, they are able to run on-cloud with no refactoring required.
The possibilities of genomic medicine are truly remarkable, and data integration will be key to seeing them through. Software like Elastifile will be a key enabler that we technologists can bring to the table.