Files services have been a critical part of the modern computing world for several decades. UNIX introduced the IT world to the concept of the file (a random access array of bytes) as a simple and generic interface for data objects stored within a dynamic, hierarchical namespace (i.e. supporting nested directories, with the ability to perform dynamic CREATE, MOVE, and DELETE operations). The power of that model made UNIX-style files the de-facto standard interface for storage administrators handling application data and tasked with overall data management. Even today, most applications use files directly as the primary data interface (or at least as a supplementary interface) for configuration, logging, etc. In addition, file is definitely the primary (if not the only, in many cases) data service used by any modern OS (Linux, Windows, Solaris, etc.), including virtualization OSes (VMware, Hyper-V, KVM).
[For a quick primer on the basics of storage architectures, including the pros and cons of file, block, and object storage, check out this previous blog.]
Shared Block vs. Shared File System
Most enterprises believe that storage should be shared and managed by storage administrators. This means that most data shouldn’t be stored on local devices (DAS), but rather on a shared storage solution, so that the storage can be centrally managed, protected, and monitored. Without shared storage it is much more complex to design, implement, and enforce critical enterprise policies (whether business or regulatory) regarding data high availability, backup, versioning, disaster recovery (DR), etc.
For a long time, there was a debate amongst storage architects over the best shared storage architecture. The candidates were shared block (“SAN”) or shared file system (“NAS”), with the main difference being what gets exposed (i.e. made network accessible) to the server - a block entity (“volume”) or a file entity (“file share” or just “share”). Still, since file is the primary application storage interface, the difference between shared block architectures and shared file architectures is not what storage interface is used by the application, but rather how the application-facing file services are implemented - as a local server service on top of shared block volumes (i.e. using a local file system such as NTFS, ext3, zfs) or as a networked service using a networked file service. The two options are depicted below:
File services in shared block and shared file architectures
Ultimately, as I’ll now describe, the key differences between shared block and shared file architectures lie in 1) their differing divisions of IT responsibility and 2) the differing levels of data agility that they can deliver.
File Services and IT Responsibility
Even today, the combination of servers, applications, and virtualization (which we’ll refer to collectively as “servers” moving forward) are commonly managed separately from storage (and often also networking). When a shared block architecture is used, the storage administrator manages block entities (“volumes”) and the server administrators are responsible for managing the file system layer on top of these volumes. In such an architecture, the storage admin mainly focuses on the low-level storage tasks (i.e. tasks that are not directly visible to the end user) such as provisioning volumes, assigning them to servers, and managing the associated data protection, backup, and disaster recovery (DR). The server admin formats the file system, manages the free space, users, quotas, etc.
When a shared file architecture is used, the file services and the low-level storage tasks are managed by the storage admin (see the figure below).
Data and storage responsibilities in shared block and shared file storage architectures
Even though this shifts some work and responsibility away from the server admin to the storage admin, many storage admins still prefer it over a shared block architecture...since many tasks are much simpler and more efficient when done at the file level. A few examples:
- Capacity management - In SAN architectures, capacity management is hierarchical and has two (or more) levels. At one level, the storage admin is providing volumes. At another level, the server admin is managing data within the volumes. This means that each level has to maintain its own spare space (double reservation). As a result, the storage admin doesn’t really know how much space he/she has because big part of the free capacity is hidden from him/her. This also means that the storage admin has much less control over capacity usage, because volume-level capacity changes are complex (due to the need for multi-level adjustments).
- Copy services - Snapshots, backup, DR, etc. are easier to use and more powerful at the file level. For example, at the file level you can specify which files to backup...e.g. only certain file types, only files owned by specific users, etc. The same is true when restoring files - you can restore a file or a sub-namespace from a snapshot/backup without restoring the entire volume! Moreover, such restore operations can be performed as “self service” operations, making both the data user and the storage admin much happier.
As described above, two key reasons for the superiority of file management over volume management are the management granularity level (files instead of volumes) and the presence of extensive file metadata (access control information, type, dynamic size, etc.).
Put another way….when a shared block architecture is used, the storage admin mainly manages storage entities (volumes). Instead, when a shared file architecture is used, the storage admin mainly manages data (data access, data life cycle, data users, etc.).
Most enterprises would likely agree that their most valuable IT asset is the data that they create, store, and process. Whether delivered over shared block or shared file architectures, file services can satisfy the data’s storage requirements. However, the architecture options (i.e. block or file) differ greatly in their ability to deliver the desired data agility.
One of the most important differences relates to the ability to share data. Server-local file services are restricted to their local environment. If you want to share or to synchronize the data with a different data source / data set, you need to do that explicitly (and in many cases manually) using task specific tools and mechanisms. So, if you want every server to have the most up-to-date data source, you must to copy it to each server. On the other hand, when using a shared file system (a “share”), you simply update one location and all servers using the share are immediately synchronized. In addition, the file-level copy services enabled by the shared file architecture are critical for efficient synchronization of distinct data sets stored on different systems and/or in different locations.
Another important attribute of shared data is mobility. Shared data means also “network accessible data”. In other words, any server connected to the network can access the data. Shared data allows not only data mobility and data distribution, but also supports efficient parallel processing, data exchange, and “data flow”-oriented applications.
Just to clarify, block volumes can be shared, but this is completely different. Shared block volumes allow storage resource sharing, but NOT data sharing. In fact, shared block volumes are almost exclusively used as the backend for a distributed file system such VMWare’s VMFS, IBM’s GPFS, Redhat GFS, etc.
The table below summarize the main advantages and disadvantages of shared and local file systems:
File in the Cloud
I think that most people will agree that the concept of modern (public) cloud really formed when Amazon introduced AWS as a new IT model, derived from their own internal IT. In its initial state it was very similar to web-scale IT, where the focus was on scalability and flexibility of the compute resources. The pure concept was simple: detach the application from the data, implement the application on a stateless VM (i.e. the OS state is deleted after the VM is stopped), and push the persistent data to a new object (blob) store named S3 (“Simple Storage Service”). For applications that need persistent transactional storage, a persistent (non-shareable) block service was added - EBS. In addition, Amazon pushed a new compute model where the application should be implemented on top of cloud services and shouldn’t directly use low level infrastructure such as files. Implicitly, that model hinted that shared file is not required or even desired in the cloud.
This new cloud model made and is still making a lot of noise in the IT world and is challenging many legacy IT processes, mechanisms, and architectures. On the other hand, as the public cloud is starting to be enterprise production ready, it is also changing rapidly and adapting itself to the enterprise world. I think that it is apparent today that:
- In most cases, the cloud is not going to completely replace the on-premises data center in the foreseeable future. Instead, it will extend, complement, and coexist with it.
- Due to the point above (and due to many other reasons), the vast majority of existing enterprise applications will not be rewritten as cloud native applications.
- The prior point means that the cloud needs to provide an application IT environment (compute, storage, networking) that is compatible with the legacy IT world.
The points mentioned above have many implications on many IT areas. When limiting the discussion to the storage area, I would raise the following points:
- Local, ephemeral storage and persistent block services (volumes, disks) are implemented considerably well in the cloud.
- The object store concept failed to replace files as the main application primary storage interface. There are many reasons why, and this probably deserves a different blog post...but I think this is evident enough to declare it as such, even if only because replacing file with object would necessitate a rewrite of many (most?) applications. On the other hand, objects do have significant success as a data store providing secondary/cold/”active archive” storage (and this deserves another post too...).
- Due to the failure of the object store concept as a primary interface, there is no real shared primary data (or even shared storage) solution in the cloud.
The Role of Shared Storage in the Cloud
The core architecture of the cloud promotes the separation of application logic and storage, the ability of spin up applications very fast, anywhere (within the cloud domain), and the ability to scale the application’s processing capability very quickly. Any of the above requires efficient shared storage (i.e. storage accessible via the network). The persistent block storage options (EBS volume, PD volume, etc.) are, in fact, networked volumes similar to enterprise shared block volumes (SAN volumes). They are redundant and can be attached to any (single VM). They also provides basic copy services such as snapshots and backup to objects.
The Role of Shared File in the Cloud
As already mentioned, clouds have an inherent shared data mechanism - the object store and it is used by many services. So the real question is...do you need primary shared file? Amazon seems to think so and, therefore, added a shared file service (EFS) to its portfolio of services. My view is that it depends on the use case, but, in most cases, effective cloud integration does require shared file today…and it will definitely be needed in the future, as I will now explain.
Since enterprise cloud usage is still very new, the cloud-related use cases are still evolving. One of the first enterprise use cases was “backup to the cloud”. In this use case the object store is used as a target for enterprise backup data. Though enabling backup is a nice benefit, you’ll probably agree that this doesn’t fully utilize the potential of the cloud.
Another use case is “batch processing in the cloud”. This use case is typically about pushing some dataset to the cloud, processing it using it using the extensive, cloud-based compute power, and pushing the results back. Will shared file help? Probably yes, especially if you want to concurrently process the data. For example, big data analytics and rendering jobs are much easier to implement with a shared file platform. From what I hear and understand, this is the main use case for Amazon’s EFS.
The next use case of note is “burst to the cloud”. In this case, you have some dynamic activity in your on-premises data center and occasionally you need more resources to perform the activity. Therefore, you want to dynamically utilize cloud resources to temporally extend your data center. In this use case, the separation between compute and data is almost critical since, in many cases, you need to scale the compute at different rate (and/or in a different location) than the data. File-level data granularity and file sharing capabilities enable very efficient synchronization amongst sites and compute processing entities.
The Future Cloud Data Fabric
File services, and the data agility that they deliver, are already beneficial for many existing cloud-related use cases, but I believe they will be critical to the emerging, mainstream cloud-related use cases involving either continuous hybrid cloud workflows and cloud-only workflows. Though these use cases are not yet common practice, the buzz around them is very high and some implementations already exist. Most agree that they will be much more common in the (not so distant) future.
The main difference between the earlier use cases and these last two use cases is the in-cloud data lifespan. In the previous use cases, the cloud data was either a cold copy of the on-premises data and/or an temporary data repository that was probably deleted after the batch/burst is done. In such cases, the data lifecycle management is much less important and all requirements related to backup, archiving, disaster recovery, versioning, etc. are satisfied by the persistent at the on-premises origin. On the other hand, for continuous hybrid cloud and cloud-only use cases, the data may be generated in the cloud and/or managed in the cloud for long durations or even permanently. In such cases, the in-cloud data services must support all required enterprise data services, management, and policies...including backup, disaster recovery, archival of cold data, user/group access policies, versioning, anti-virus scanning, encryption, etc. Furthermore, as the data sets are spanned and accessed concurrently across several geographically distributed locations and/or clouds, these enterprise-level data services will need to operate across sites. As a result, new mechanisms to synchronize these geographically distributed data sets will be required.
Elastifile calls this elastic, scalable, universal data layer a cross-cloud data fabric (as depicted in the image below)..
Multi site/cloud data fabric
In the data fabric model there are 3 types of data classes:
- Stateless images: Each compute instance can run the enterprise image that is loaded by the cloud vendor, on demand. The image itself is stateless and/or uses some local ephemeral storage, such that scaling up/down the compute services is very quick and effective. Some instances can be long-lasting, and some can be very short-lived. Note that this instance can be actually a VM, a container, or some serverless entity. Once spawned, each instance is connected to one or more data sets in the data fabric (described below).
- Data fabric: The primary data repository and working space. All primary data services are implemented in that layer, including versioning (snapshots), site-to-site (geo) sync, disaster recovery, access control, antivirus, backup, tiering, and archiving. In addition, the data fabric provides a single pane of glass across sites and clouds to create global visibility and enable unified management of data, policies, users, etc.
- Long term (object) store: The deep data lake containing cold and archived data. Also acts as a versioned backup/”active archive”.
As we move forward, the data fabric, powered by a shared file architecture. will provide data backbone for most enterprise activities across the different data centers and clouds.