With standards efforts moving forward rapidly at the Storage Networking Industry Association, Non-Volatile Memory Express (NVMe) over Fabrics looks likely to be just over the horizon. NVMeF promises both a huge gain in system performance and new ways to configure systems. To understand what this means, we have to separate NVMe from its current implementation that uses direct PCIe connects. The PCIe approach arose a few years back from the then top-tier persistent storage, the flash card, which had reached the limit of SCSI-based operations due to IO stack overhead and interrupt handling problems.
NVMe replaced the SCSI stack in the operating system with a simpler and faster circular queue system, in the process adding the ability to build a huge number of queues, which could then be tied to computer cores or even applications or tools to tune operations. As NVMe was originally intended for PCIe, it took advantage of PCIe’s direct memory access (DMA) and made operations a “pull” system, where the downstream devices take entries from the queues whenever they free resources.
The other neat feature is interrupt consolidation: Substantially reducing system overhead - state swapping, interrupt handling, and then restoring state. With NVMe, completions are parked in “completion queues” that can be read as a large block by the OS driver and serviced together. With NVMe, flash cards achieve millions of IOPS and gigabytes of bandwidth.
A natural evolution of this approach is to extend connectivity from a PCIe slot to at least accommodate some drive structures. PCIe uses the same physical drivers as SAS and SATA, which also are used for LAN communications. As a result, we see SATA Express drive bays that can handle both PCIe/NVMe SSDs and SATA drives.
Clearly, though, the industry can take performance benefits of NVMe further. By expanding the protocol to cater for a multi-host environment, having NVMe run over a variety of fabrics is intriguing. This isn’t a trivial change, however, since longer cable lengths makes signal timing important while the inevitable switching modules complicate handling transactions and managing collisions.
More importantly, there's been conflict in the choice of fabrics. On the one hand there are purists who see PCIe as a fabric solution, with low overhead and no latency from intervening adapters. The other camp promotes solutions based on existing Ethernet fabrics using RDMA for the advantage of using the existing infrastructure with the addition of RDMA NICs. We also have InfiniBand and even Fibre Channel as contenders for a piece of the fabric, but PCIe and Ethernet look strongest today and products are starting to emerge.
Of course, these competing solutions are a result of the specific interests of vendors and their user communities. The result is that almost all approaches to the networked storage area will be offering NVMe solutions within a year, which clearly is going to create some confusion. To make some sense of this, let’s look at use cases.
First, NVMe is almost perfectly suited for the all-flash array model. This means the likely choices are Ethernet and Fibre Channel as carriers for NVMeF. On the surface, this looks like Fibre Channel turf, since most AFAs are installed in SANs to accelerate them. One problem, though, is that RDMA over Fibre Channel is a new, untested concept, while Ethernet has had the RDMA technology for years and is well understood and characterized. Another issue is link performance. Ethernet is already at 25 Gbps for a single link and 40 Gbps for a quad. The Fibre Channel alternative is just 16 Gbps and quads are on the roadmap for next year.
Mellanox is already talking up its 50 Gbps link, so it looks like Ethernet will continue to lead the race. This probably means Ethernet will win, especially if scale-out is a consideration.
Object storage solutions recently had a performance growth spurt orchestrated by SanDisk and Red Hat, which together extended Ceph to handle SSDs better. Ceph still is bottlenecked by back-end inter-node communications, but RDMA is already showing a major boost in performance. This, in the end, is almost all Ethernet territory and, with object storage a rapidly growing sector of the storage market, will strengthen the case for Ethernet.
When it comes to clustering systems and storage, at least one vendor, startup X-IO technologies, is bringing a NVMe over PCIe fabric cluster to the market. X-IO's performance claims are impressive, but overall PCIe currently lacks the infrastructure needed for scale.
In many ways, PCIe fabrics illustrate the problem of using anything but Ethernet. The fabric components just won’t exist for scale-out in anything else at least for a year or two, so the question you should ask is, “Why would I use anything but Ethernet to speed up my system?”
There are some further complications to the NVMeF story. Intel is bringing a flash alternative, 3D X-Point, to market in late 2016. Intel touts it as much faster than flash and plans to deliver it in both NVDIMMs and NVMe drive solutions. However, since Intel describes the NVMe interface as OmniPath-based, which is an Intel-proprietary LAN, this will add yet another player to the fabrics debate.
NVMe will inevitably weaken the SAS drive business, since it's so much better and using it adds little to the cost of a drive. It may well be that the inevitable desire of drive and system makers to use NVMe as a price differentiator just as SAS and SATA were used will delay broad and inexpensive deployment of NVMe, but NVMe is a better solution by far for the future server farm and should in the end replace SATA.
NVMe over Fabrics is inevitable in the near future and should boost the performance of server farms by a significant amount. I’m looking forward to it, and its potential for software-defined storage.