Let's say you have $50,000 in your bank account. One of the important keys to making sure your accounting system and your bank agree that you really have that amount is to ensure that every transaction in your accounting system uses atomic writes.
Atomic writes in this case means that a transaction has to be indivisible. If a transaction deducts $50,000 from your Bank of America account and deposits $50,000 in your account at the Grand Cayman Community Bank, all or none of the transaction should be saved to the database.
Traditional database engines ensure atomicity by writing all the parts of the transaction to a journal, then posting them to the main database. If something goes wrong between removing the money from BofA and posting it to Grand Caymans, the system can reapply the change from the journal when the problem is corrected and the system restarted.
If the system didn't use a journal, double-write buffer or some other method for insuring atomicity, the $50,000 would just disappear.
Atomic Writes and Flash
All that journaling and such is needed to ensure atomicity on spinning disks. When you tell a spinning disk to write data to logical blocks 124-145, it overwrites the data that was previously in those locations. If the system crashes after blocks 124-130 are written, the system better be able to write blocks 131-145 at restart time, or we'll have corrupted data that's part new and part old.
Flash is a whole different animal. The flash controller in an SSD is always writing to blank pages across the flash it manages. After it writes new data to fresh pages, the flash controller updates the metadata that maps the logical block locations it presents to the outside world.
Because the flash controller doesn't overwrite the current contents when it writes to blocks 124-130, as long as it doesn't update the metadata until after all the changes for a transaction are complete, none of the transaction will be posted and it will continue to return the old data.
[Flash suppliers are investing in storage startups. Find out what this could mean for the industry in Storage Vendors, Flash Manufacturers, Buy Into Startups.]
Some efforts have been going on in the storage community to better integrate atomic writes and flash. For instance, Fusion-io promotes a dedicated atomic write API for its products. Several MySQL implementations, including the MariaDB fork, now support the use of the Fusion-io's atomic write API. Meanwhile, the INCITS T10 committee, which defines the SCSI command set, is working on an extension of the SCSI standard to support atomic writes.
Both the Fusion-io API and the T10 SCSI command set extensions simply provide a mechanism for applications to tell the flash controller which set of updates should be performed as an atomic entity. The flash controller can then hold its metadata updates until all those writes are complete.
Benchmark data shows that with MariaDB, atomic writes can provide a 30% to 50% performance boost and reduce CPU utilization. In addition to the performance boost, using atomic writes eliminates the need to write data twice, which will significantly improve SSD life.
I hope the T10 committee finishes its work on a standard set of atomic write commands as quickly as possible, and that enterprise SSD vendors get on the atomic write bandwagon right away. Then we can start thinking about how SAN vendors, whose systems use the same SCSI commands as disk drives, can implement atomicity.