Recovering From RedoLog Corrupt Errors On VMware ESX/ESXi
March 11, 2010
In my last entry, I discussed basic best practices for using snapshots in VMware environments. Today I want to get a little more technical by talking about recovery options for virtual machines which will not boot because of snapshot errors.
When you issue a delete all snapshots from the context menu of the Virtual Infrastructure (VI) client, if the disk space is insufficient to complete the operation, VMware has a nasty tendency to remove the physical snapshot files (.vmx) and leave you with a non-functional VM without any snapshots listed. When you try and power on the virtual machine in question, you will get the familiar: "The RedoLog for "SERVERNAME" has been detected to be corrupt. The virtual machine needs to be powered off. If this problem persists, you need to discard the RedoLog."
Unfortunately, because you already tried to reconcile snapshots with insufficient disk space, there are no longer any .vmx files on the datastore and there are no snapshots listed in the snapshot manager. Because of this, you can no longer issue the "remove all snapshots" command from the VI Client and consequently can't fix the problem from the VI client GUI.
Start by freeing up space on the datastore equal to the total size of the disks attached to the VM. Sit down at the console or start up an SSH session to your ESX host. Change directory to the datastore and virtual machine folder in question. The disks for the fragmented VM will be split into as many different files as there are snapshots. In order to repair the disk files, we need to clone the fragmented disks to a new file. Run the command:
vmkfstools -i vmname.vmdk vmname-repaired.vmdk