Creating a distributed, scale-out file-system is a challenging task. You need to scale read/write I/O as well as metadata I/O, while still preserving consistency of the entire file-system. Scaling the metadata I/O is the real hard part, and that’s what gives Elastifile its great performance.
Elastifile’s metadata services are fully distributed, running on every node in the system. Each file or directory can be owned by a different metadata service, and can migrate between them dynamically, according to the current workload. This dynamic ownership is especially beneficial in the hyperconverged deployment mode, since it allows us to colocate the accessor of a file (e.g. a VMDK) with its owner. By colocating accessor and owner, we reduce the network activity required to support the various NAS operations. In some cases, such as the getattr NFS operation, we can serve the request without ever going to the network.
The dynamic ownership is managed by the Ownership Repository Coordinator (a.k.a. ORC). The ORC is a crucial piece, since it can be on the datapath of I/O (so it must be highly performant) and is required to continue operating during failures. The requirement of high-availability together with the consistency needs are usually handled by consensus algorithms, like Paxos or Raft. However, these algorithms were designed for basic high-availability, and are not optimized to maintain consistent performance during the failure event...thus making them a poor match for high-performance, scale-out, distributed file systems.
Existing consensus algorithms are built upon a (serialized) distributed log. Every operation is written to the log, and can be applied only once a majority of the nodes acknowledge it. This design has several limitations:
- If one log entry is slow (e.g. due to a slow disk I/O or due to packet loss) it will prevent future log entries from being applied, until the slow entry is applied.
- When recovering from failures, the entire log must be recovered before the state is reconstructed and I/O can be served.
- Due to the cost of recovery, the timeout for failure detection is set to be high (to avoid false detection), thereby increasing the time without service during real failures.
Due to the above reasons, we’ve developed a specialized key-value consensus algorithm, called Bizur. Bizur overcomes the aforementioned limitations by allowing operations on independent keys to be applied concurrently. A detailed description of the Bizur algorithm is available in our full paper, in which we compare Bizur with two well known key-value systems: ZooKeeper and etcd.
Below are graphs depicting the behavior of the the different systems (etcd, Bizur, and ZooKeeper) during packet drop and when the leader fails. As shown below, Bizur handles these failures much more gracefully than the existing systems.
Effect of packet drop (0% to 5%)
Effect of Leader Failure