Like many new technologies, server SANs bring a mix of promise and risk. Replacing the arcane protocols of traditional SANs -- and the associated high priesthood of storage administrators -- with software-and-server contained SSDs and disks is pretty compelling, but also has a dark side. If you can mix and match storage hardware and software to create a Server SAN, you may discover they don’t play together as well as you might like.
At the DeepStorage lab, we ran into this problem when building a cluster to test VMware’s VSAN. We bought servers with Intel ACHI SATA controllers when ACHI was on the VSAN beta hardware compatibility list, but when VSAN shipped, VMware determined that ACHI was too slow.
Last month, VMware once again declared that VSAN requires more disk controller horsepower and removed several low-end SAS host bus adapters (HBAs) from its hardware compatibility list (HCL).
This change was based in no small part on the experience of Jason Gill, who reported on Reddit that when he replaced a failed storage node in his VSAN, the rebuild process went horribly wrong. In fact, it went so horribly wrong that all of his virtual machines went offline.
Even with his constant attention, and VMware’s level 147 support on the phone, shutting down less-critical VMs and the like to manage load, his critical servers were offline for seven hours and it was a full 24 hours before the system was fully up and running.
VMware’s root-cause analysis pointed the finger at the Dell H310 SAS HBAs in Mr. Gill’s storage nodes. Here's the official statement on the VMware blog: "Although Virtual SAN has a built-in throttling mechanism for rebuild operations, it is designed to make minimal progress in order to avoid Virtual SAN objects from being exposed to double component failures for a long time. In configurations with low queue-depth controllers, even this minimal progress can cause the controllers to get saturated, leading to high latency and IO time outs."
In plain English that means, "Sure VSAN throttles, but we set a minimum progress level as well and that minimum progress, and the VMs doing their normal I/O, was more than the poor underpowered HBA with a maximum queue length of 25 could handle." That’s a reasonable statement, except for the fact that the Dell H310 was at that time on the VSAN hardware compatibility list.
Like most steely eyed storage guys, I consider a vendor’s HCL as a significant endorsement, if not a holy document. When VMware said that a hardware component on their HCL was the cause of a failure it just felt wrong. It seems obvious that VMware didn’t test the edge cases, like a storage node failing and being replaced, on lower end hardware like the H310.
While the $300 or so additional cost for a faster SAS HBA, or RAID controller, is a small fraction of the $15,000 cost of a VSAN server and its software licenses, it gets me wondering if the whole “no RAID controller required” selling point for VSAN was that good an idea. I, for one, have nothing against a good hardware RAID controller.
If the rumors are true, VMware is picking up the tab for at least some customers who have to buy faster HBAs, like Dell’s H710. I hope they’re true. It would be nice to see a vendor stand behind its HCL like that.