At the recent Storage Field Day 7, I got a chance to discuss the resiliency features of VSAN with VMware’s Christos Karamanolis, VSAN’s primary architect. As I reflected on the conversation, I realized that my differences with Christos were rooted in the basic difference in perception between folks that have spent their careers managing storage and those that have lived in the compute world.
The truth is that I like, most steely-eyed storage guys, am paranoid. While the compute and network teams can, and do, joke about the storage team’s paranoia, this paranoia is hard earned. The difference between storage and the rest of the data center is persistence. Just as storage itself is persistent, so are storage screw-ups.
If the network team loads a bad config into the core routers at an organization and takes the whole network down, they can fix it in a matter of minutes by reloading the old config file. Sure it may take hours to update every device and someone has to drive to the Kalamazoo office where the out-of-band access failed, but at the end of the admittedly very long day the network can be exactly where it was before the screw-up occurred.
Every storage professional I know has not just feared that any mistake they just made would cost them their job, but was for at least a moment afraid they’d put the company out of business.
Assume for a minute that you have been a customer of vendor E’s storage for years but have more recently been buying newer systems from vendor H as well. You get a message that drive 9 in one of your arrays has failed. Since you know that array was set up by your predecessor as RAID5, you jump in the car and drive into the office on a Sunday to swap out the failed drive for a spare.
You get to the data center and replace the third drive from the left on the top row. Then you realize that this was a vendor H system, and you just replaced drive 2 instead of drive 9, crashing the ERP system and destroying the data. Many hours later, you’ve rebuild the RAIDset and restored from the last backup. But things aren’t back to the way they were before your little accident. Instead you’re back to the state your data was in when your last successful backup was made. Everything since is lost, and your screw-up persists.
Given this background it’s easy to understand how Christos and I had different opinions around the VSAN failures to tolerate (FTT) parameter. As VSAN is currently implemented, it requires a minimum of five host servers to set the FTT level to 2 and write 3 copies of each data object set to the FTT=3 resiliency level. Three of those will hold the data replicas, and the other two serve as witnesses. The witnesses allow the system to know which partial set of nodes should continue to serve data in the event of a node failure or network division.
I argued that I wanted the system to support three-way replication on a three-node cluster. If two hosts could see each other, they would continue and the witnesses wouldn’t be needed. I could even run with three-way resiliency on an EVO:RAIL that has 4 nodes.
Christos argued that in my model I wouldn’t be maintaining the FTT=2 level after the first failure with just three nodes, because there’s no place to rebuild to. More significantly, after the first failure there would be no way for the remaining two nodes to tell the difference between a second drive or node failure and a network problem that left both nodes running but unable to see each other.
He’s trying to maintain the computer scientist’s view of having the system deal with all possible scenarios. I’m worried about data loss. I want the minimum configuration that will survive a two failures without data loss even if it goes offline.
Hopefully I made my point, and the folks at VMware will add support for a higher levels of data resilience on smaller clusters. I’d even like to see support for replicas on multiple disk groups in the same node if enough nodes aren’t available.
You can see the entire conversation in this video.