A Data-Centric Approach to Data Mobility

Shahar Frank October 24, 2016 blog

This sounds obvious and simple, but shouldn’t the critical requirements of data mobility across sites, private clouds, and the hybrid cloud be delivered with a true data-centric approach? Yet the current approaches are dominated by VM-centric techniques, where the data is still tied to the underlying infrastructure (whether it is virtualized, physical, or public cloud) and its underlying mechanics. Chris Evans covered the requirements, current options, and challenges beautifully in his Hybrid Cloud and Data Mobility blog. And he clearly articulates what’s required:

If a hybrid cloud is to be of any use, we need the ability to move applications and data around on demand, but retain the following characteristics for our applications and data:

  • Consistency– Ensure we know which version of our data is the most current at any one time and ensure we protect it and can restore from the most current backup.
  • Independence– Have the ability to move data to any platform of choice and not be tied to the underlying hardware platform of the infrastructure provider.
  • Abstraction– Operate on logical constructs, like files, objects, VMs or databases/data objects, rather than having to think in terms of LUNs or volumes.
  • Security– Implement security features to protect data wherever it is, either in-flight or at rest.
  • Ease of Management–  Be able to manage the data consistently wherever it sits in the ecosystem.

I want to discuss the root cause challenges for “Independence” and “Ease of Management”—and how a new approach that is truly data-centric could create new options.

Data Types

First let’s consider the underlying data types and their different I/O and mobility requirements. From the storage perspective, application solutions are composed of 4 typical data types: 1) system data (OS binaries and settings), 2) ephemeral data (swap and temporaries), 3) application binaries and settings, and 4) application/user data.

In their simplest forms, all of these data types can be packed together, e.g. on the same local file system stored on a single local storage. However, for most applications, this may be inefficient as the storage management and I/O patterns are significantly different for these types. For example, system data is typically set initially at system install time and is patched/updated with OS level tools, but otherwise it is infrequently being changed directly by the server owner. Application data is mainly managed and modified frequently by the application. Therefore, when considering backup strategies, you probably want to backup the application data often, while backing up the system data much less frequently. Moreover, system data does not produce significant I/O on most of its dataset (excluding swap), but application data may produce significant I/O. Another difference is their capacity requirements and elasticity. For example, application binaries consume a relatively small amount of capacity and, in most cases, do not change their capacity requirement over time, but application data can consume large amounts of capacity and can be very dynamic over time. A final key difference is the “shareability” of the data. For example, ephemeral data and some part of the system/application setting is mostly not shared (i.e. per host/application), while system and application binaries may be shared for READs, and application data may be shared for READs and WRITEs.

The VM-Centric Approach

Modern Enterprise IT relies heavily on virtualization. In the early days, the virtualization systems mimicked the physical world, but over time new mechanisms and methodologies were developed and were adapted to that world, which also affected data management methodologies. Basically, the common application data setup methods used in such environments are

  1. Physical-like: Treat a VM as a physical machine, install it from boot media, patch and maintain it using OS tools, etc.
  2. Clone/Template-based: Use templates that are pre-installed and preconfigured systems and clone them to create system instances (i.e. copy the template and customize the clone). This mode is typically the main method used for modern enterprises.
  3. Real time image composition: Start with some base image (~template) and use application streaming/virtualization and or automated customization techniques (CHEF, PUPPET) to compose/customize the final image - i.e. adding application, changing setting, etc. This mode is mainly used for VDI and for new gen applications.

Note that the above methods are also applicable and used in non-virtualized worlds, but each environment has its own methods. In the virtualized data center case, the virtualization management systems (such as VMWare’s vCenter) provide comprehensive lifecycle management for cloning processes, making it very convenient and simple. As a consequence, the cloning method encourages the system admin to package the application as a VM (i.e. set of virtual disks and settings). The benefit of this approach is that it detaches the application from the storage and enables the admin to deploy the VM on different storage servers and even to easily to migrate it among the storage servers. This reduces the risk of vendor lock-in, and solves data migration issues.

VM-Centric Issues

Even though the VM-centric approach has significant advantages, it has some significant disadvantages: 1) this approach heavily depends on the virtualization platform abilities and locks your entire workflow and infrastructure to it and 2) it is also builds everything around block services (i.e. virtual disks) and therefore the application dataset is broken to many “islands” of storage. This is essentially another infrastructure-centric approach (more like traditional arrays) with very limited central visibility and control across the mobility use cases. For example, to know how much free space is available for an application you may need to access the application VM OS and/or use a VM-side agent logic (VMware tools for example) because the storage block services are not aware of the application objects (usually files).  In fact, most if not all of the data management tasks become much more complex and cumbersome due to the fragmentation of the dataset into a typically large mix of virtual disks. For example, it is hard to know which users are creating data, what types of data are being stored, what data is accessed most frequently, performance SLAs, etc. This means that the system admin has very limited information on the data, resulting inefficient storage resource utilization and compromised business processes. Finally, for the evolving hybrid cloud data mobility use cases, this means data will continue to be limited to access only when compatible hypervisor and virtual disk combinations are used across sites and clouds.

A Data-Centric Approach

So how can you have the best of both worlds, with data mobility “Independence” and “Data Manageability”- especially for complex hybrid cloud environments? It really comes down to how can you enable data mobility without all the restrictions of data tightly bound to its underlying VM and storage infrastructure. Due to the diversity of data attributes and characterizations, system and application administrators often prefer to manage these data types on different data containers and/or devices. For example, a very common practice is to have a system disk, a swap device and one or more application stores (file, block or even object). This approach can be considered to be a “data centric” approach where the composition of the solution state is according to the different data attributes (capacity, performance, shareability, etc.), and with that how the admin will manage the data and data services. Done properly and persistently, such data containers can be the heart of a truly independent data mobility offering. But they will need to evolve to address the hybrid cloud requirements for dynamic workflows, universal access, all while maintaining enterprise availability, security, performance, etc.

Such data-centric implementations are usually implemented via file services as they provide the best data sharing and management abilities. In most cases, file servers provide the best and most efficient storage provisioning, capacity tracking and control, user control, data sharing and data analytics abilities. In other words, file systems allow you to manage data and reduce the need to manage storage. Note that block and object services have their own advantages and use cases, but for this fundamental goal of data mobility, a file system approach brings big advantages (which we’ll discuss more in my next blog!).

Current Software-Defined/Server Based Storage Implementations

In the last few years, several SDS/Server-based storage solutions were developed to replace the legacy (hardware) appliance-based storage solutions aiming to dramatically improve storage elasticity, reduce cost, and increase effectiveness. However, virtually all of these new solutions are focused on VM-centric data services by design. Why? It is much simpler to implement and requires much less native data services (since they can just borrow the VM services). In other words, most of these solutions implement dedicated image store services for virtualization systems, and provide none of the mobility requirements for data/app independence or easy data management services.

New Options

Yet software-defined is the only way to enable the hybrid cloud requirements for cross-site coverage, scale, elasticity and unified access. We need a new generation of SDS to deliver this potential, leveraging a global distributed file system approaches without compromise for massive consolidation and hybrid cloud use cases. Stay tuned for more on what we’re doing about this!