A typical morning in IT: The phone rings first thing in the morning and a mail server is down. Since e-mail is the lifeblood of every organization, this is the type of problem that can seriously ruin your day. The cause: A VMware ESX snapshot gone awry and the disk files have used up all available space on the datastore. On powering up the virtual machine, I was greeted with an error message that the "RedoLog" was corrupt and the machine needed to be powered off. My subsequent investigation into what went wrong and how it could be fixed revealed to me that the net was woefully inadequate in describing the problem and providing a remedy.
Snapshots, according to the VMware admin guides, are a short term preventative measure for otherwise risky server maintenance tasks. They are meant to be taken immediately before a particularly dangerous or risky task, and kept for the bare minimum time period to make sure that the server is indeed stable and ready to provide services. The upside to snapshots is that they provide a near instantaneous method of reverting to a previous configuration. For network administrators and consultants used to fragile servers and dangerous tasks, they are a godsend. However, very few network administrators of VMware use snapshots correctly, and we're asking for trouble by taking snapshots gratuitously. It's easy to take a snapshot of a favored virtual machine for a rainy day, but there are caveats and they can quickly get serious.
Snapshots wreak havoc on the underlying file structure of VMFS file systems. Each time you snaphot a virtual machine, it terminates the original .vmdk file and starts a new file. The changes, called a delta, are written to the new file instead of the original .vmdk. The problem with this is that as multiple snapshots are taken, the ESX host must consult each file in the chain of snapshots in order to ascertain the state of a given VM. This negatively impacts performance speed while the machine is in use, but it also violates the original size constraints of the source .vmdk file. Snapshots can continue to grow even beyond the original disk size. This becomes a problem on servers which undergo a high rate of data change, because day in and day out, the server continues to utilize free disk space on the datastore until a low space condition occurs. After a while, the file structure begins to look like this:
If a datastore runs out of free space, you will get an error that "The RedoLog for "SERVERNAME" has been detected to be corrupt. The virtual machine needs to be powered off. If this problem persists, you need to discard the RedoLog." This is a typical message that occurs when a datastore has run out of space while VMware attempts to commit writes to the disk file. If you can free up some space on the datastore, you can delete all the snapshots on the virtual machine, which reconciles the change files against the original .vmdk and joins the separate disk files into a single file. This can take forever, so don't be alarmed if it takes twelve hours to complete.
To solve this problem you need to have as much free space on the VMware data store as the total thick provisioned size of all the snapshotted disks in question. If you don't, VMware will remove all .vmsn snapshot files but disk consolidation will fail. You can fix this but it's much more labor intensive and can cause data loss if the disk geometry gets miffed.
In other words, if you have an 20GB operating system partition and an 80GB data partition, you should have 100GB free on the data store to remove all snapshots operation. If you don't have this type of free space, you can use vCenter Converter standalone to migrate VMs to another datacenter or even download the virtual machine files to an administrative workstation with free storage. Remember, you can always re-upload them later and then browse the data store, and right click on the .vmx file and select import to bring a dead vm back to life.