Hot Flash: Researchers Use Heat to Counter NAND Flash Wear-n-Tear
December 10, 2012
The limited write endurance of NAND flash storage is significant drawback of the technology, just below its high cost per gigabyte. The very idea that SSDs will fail after 10,000 write/erase cycles rubs storage administrators the wrong way. Now, engineers at Taiwan's Macronix have, according to an article in the IEEE's Spectrum, uncovered a way to extend flash to 100 million or more write/erase cycles.
The Macronix group figured out how to use heat to repair the insulating layers of the flash chip, which degrade with each erasure. Researchers have known that this method works; previous attempts heated the whole chip to 250 degrees C (482°F) for several hours. The Macronix advance uses itty-bitty heaters, derived from the ones they build for phase change memory, that heat small groups of flash pages to 500°C. Macronix also discovered that the elevated temperature speeds up erasures, which wasn't predicted by the materials science geeks. (Before you attempt to revive an old SSD or CF card in the pizza oven, note the solder that holds the components of an SSD together melts at about 185°C).
- Forrester Study: The Total Economic Impact of VMware View
- Operational Insight for Running IT at the Speed of Business
- SaaS and E-Discovery: Navigating Complex Waters
- SaaS 2011: Adoption Soars, Yet Deployment Concerns Linger
Macronix hasn't announced any product using the technology.
There's certainly some appeal to the idea of resetting the write endurance odometers after 50 or 100 write/erase cycles with built-in heaters for SSDs based in TLC or even QLC (Quad Level Cell, which is flash that stores 4 bits per cell). However, I don't think flash's limited write endurance is that big a problem. Instead, our management processes need to account for the fact that SSDs wear out.
Many people think SSDs just up and stop working, like a dead hard drive, when the 10,000th write/erase cycle completes. That's not true. While SSDs occasionally fail without warning (just like everything else), those failures aren't due to write exhaustion.
The flash controllers in each SSD monitor how often each page is erased, and distribute the wear as evenly as possible across all their flash. Array controllers and host OSes can use SMART (Self-Monitoring, Analysis and Reporting Technology) to check the status of parameter 231 SSD Life Left, which will report what percentage of the SSD's rated life remains. If customers would accept it, array vendors could stop using expensive SLC SSDs, which can be written to as fast as they accept data, and start using MLC flash, which should last for five years. MLC flash should satisfy the performance needs of 80% of array vendors' customers; the others, who need SLC, could get new SSDs shipped to arrive 60 days before the old ones reach the end of their rated life.
Of course, the flash in an SSD doesn't self-destruct on erase 10,001, although at least one controller vendor allows SSD makers to switch the device to read-only when a threshold is reached. Ten-thousand cycles is just the point where the flash has degraded to where the flash manufacturer doesn't want to guarantee it will work. As the flash insulating layers break down, individual cells get stuck and will no longer hold data properly. At some point after 10,000 cycles--and there's no knowing if it's 10,317 or 30,000--there will be too many broken cells on a given page for the controller to be able to correct, and the controller will mark that page as bad. Once too many pages go bad, the SSD will not have any place left to write new data. But this is a gradual, monitor-able degradation, not a fatal failure with data loss.
We should treat SSDs like the timing belts in our cars. They're just parts we replace every 60,000 miles. We know when 60,000 miles is coming, and we can plan for it.