Can CompSci Get Out of the Way of LifeSci?

Dave Paye March 27, 2017 blog, lifesci, compsci, parallel computing

Ninety. Years.  

That’s the new estimated life expectancy in some countries. New research published in The Lancet estimates by 2030 we’ll see life expectancies hit 90 in many countries around the world. We’ve added 17 years since my birth in 1976 when it was merely ~72. That’s roughly an additional year added for every 5 months I’ve been alive. Good job science! Imagine, if each human on earth gained just 1 more day in increased lifespan, the aggregate increase would be 20 million additional years of life enjoyed. The contribution to the human experience would be incalculable, and that’s merely 1 extra day per human. Imagine the impact of an extra week, a month, or a year. We can attribute much of this achievement to the amazing work from our life scientists.

I myself am no scientist. I’m merely a guy that builds computing systems that support the real geniuses using them for their work. That said, I like to think I have played a part, albeit a very very small one, in helping the cause of life science through my work in computer science. For example, I’ll never forget the sense of pride and gratitude I felt years ago at my last company when we discovered that a cancer research scientist recovered her life’s work from a failed computer using backup software that my company created, sold, and supported. Now, we didn’t contribute to the cancer research, but we prevented it from being lost. Win! The tech helped win that battle; hopefully science has a better chance to win the war.

Life scientists have gone far by taking advantage of the exponentially increasing ability to programmatically store and process vast amounts of data. I’ve met several people in the life sciences field recently and have been impressed by how many have deep backgrounds and accomplished degrees in both the life sciences and computer science. It seems it takes a computer expert to be a life scientist these days.

Which is precisely what I’m taking issue with in this article. Life scientists should be able to focus as much of their energy as possible on the science. But the dependency on computer science often translates into painful wrestling with computer systems. The computer systems are the servers, storage, operating systems, programming platforms and all the interconnecting networks and software that piece them together. Any energy a scientist spends on managing computer systems is time actually taken away from science, and is generally time wasted in utter distraction. It’s like a race car driver stuck in the pit doing repairs during a race; what could be more frustrating?! We must keep our scientists out of the pits and in the race. Our lives literally depend on it!

The secrets scientists are seeking are locked inside the data. But there’s so much data, being generated so fast, that we struggle to see it all. Data that can’t be read, processed, analyzed, and applied is merely a useless heap of bits. Frankly, our existing computer systems are overwhelmed. They were not designed with this level of scale in mind.

There exists an inverse relationship between the amount of data, the speed to access it and the integrity of its contents. Basically, with traditional computing systems, the more you have, the slower it gets and the more likely the data gets corrupted.

90 years_chart1_alt.jpg

We must do our part as computer scientists and systems engineers to reverse this trend, while relieving this burden from our life scientist cohorts. We must free them from the metaphorical computer “pit stops” so they are free to take all the data, apply science and extract life saving knowledge from it.

What we need are data management systems that reverse the trend. As data increases, they should get faster. As they grow, they should get more reliable, not less.

90 years_chart2_alt.jpgLife scientists have long employed techniques to achieve this type of trend. They’ve used parallel computing and distributed storage systems to get there. While these approaches have brought us this far, as the adage goes, “what brought you here won’t get you there.” The parallel computing and distributed file systems of the past have major downsides for all of their advantages. To frame up the challenge, today’s systems are too:

  1. Expensive
  2. Complex
  3. Rigid

Let’s briefly look at each limitation:

  1. Expensive — Dedicated storage systems require costly appliances and proprietary hardware. SAN for big databases and storage persistence. Then add more layers of distributed file systems on top of SAN for file workloads. Yet another dedicated system to be deployed for “cheap and deep” object storage. None of them are compatible data types. All silo data into closed, proprietary systems. It’s a lot of expensive work and frustration to piece them together for an end result that still leaves the user wanting.
  2. Complex — There’s a reason you need a Computer Science degree to accompany the Microbiology PhD. For example, one of the leading file systems in high performance computing is IBM GPFS. With all due respect to GPFS, it’s a solid product, just go ahead and read the product documentation. It’s over 1000 pages of thick technical jargon. This is just one layer from one vendor of a massively complex stack of compute products. What genomics researcher really wants to invest that kind of time into a commodity component of their systems?
  3. Rigid — Once I have all this data stuck on a “box,” what happens when I find that I would benefit from using it somewhere else? How about putting some of it in the cloud for an ad hoc analytics project? Or sharing it with a partner institution for a joint collaboration? Or simply finding a cheaper source of hardware for storing all of it somewhere else? These are all challenging at best, impossible at worst with many of today’s state of the art data storage products. If we can break the tight bond between hardware and data, a world of possibilities opens up.

The modern approach by my company Elastifile is designed to transcend these limitations and bring cloud-like capabilities (elasticity, scale, high performance, self service) to life sciences applications, whether they live in on-premises datacenters, in the public cloud, or somewhere in between. Elastifile’s purpose is to free data from storage infrastructure, and ultimately free scientists from the headache of storage management. Elastifile employs similar techniques as the hyperscale public cloud providers, along with a few (patented) tricks of it’s own, to deliver on this promise. We aim to free data from dependence on rigid proprietary storage hardware.

We’re aiming to meet the needs of today’s life scientist:

Screen Shot 2017-03-27 at 5.12.29 PM.png90 years.

I see this new bar as a challenge to all of us. I personally not only want to make it to 90, but far exceed it. And when I get there, I want my faculties to be sound enough to employ my accumulated wisdom for some greater purpose. I want the same for all 7+ billion of us living together on earth. As our life scientists seek new ways to win this race against time, it’s up to us computer engineers to support them, while keeping them out of the “pits.” Let’s stop trying to merely evolve our pit stop times with yesterday’s tech...instead, let’s change the race with a new and better approach to computing.