Special Coverage Series

Network Computing

Special Coverage Series

Commentary

Howard Marks
Howard Marks Network Computing Blogger

SSDs And The Write Endurance Boogeyman

How I learned to stop worrying and love the flash.

A drawback of flash-based SSDs is their limited write endurance. In my previous blog I looked at how, at first glance, the limited write endurance of flash-based SSDs might cause problems for RAID-based data protection systems. The idea is that multiple SSDs with the same write endurance rating could, theoretically, fail at the same time and cause data loss. I don't think it's a significant concern. I'll take a closer look at flash write endurance and how it causes SSD wear to show you why.

A flash cell is controlled by a floating gate that is surrounded by two layers of silicon oxide dielectric, or insulating layers. As high voltage is applied to a flash page to erase the page, the charge in the cell tunnels through the dielectric layer, which causes some damage.

More Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

Over time, the dielectric layer will either refuse to allow the tunnel, and the bit will stick as a 0 or a lattice disruption from the high voltage will short the oxide and the cell will fail to program and be stuck in the 1 state.

The rate at which cell failures occur varies based on the flash technology. SLC flash is typically rated at 100,000 write-erase cycles, MLC at 10,000 and eMLC at 30,000.

[SSDs have a variety of storage uses, including supporting VDI performance. Find out how in "Solving VDI Problems With SSDs and Data Deduplication."]

These vendor ratings fall somewhere in between minimum guarantees and stated MTBF (mean time between failures) actually representing something like a 1st or 5th percentile failure number. That is, at the rated number of write-erase cycles, some small percentage of the cells on the chip will have failed, and the rate at which failures will occur through further write-erase cycles is great enough that the vendor suggests you not count on the flash any further.

Intel, Micron and Toshiba are like Nissan or Ford when it comes to SSDs. When they say the timing belt in your car will last 60,000 miles, or the flash will last through 30,000 write-erase cycles, they're not saying it will break at 61,000. They are saying it could break, or fail to hold new data, at that point. Most will last longer, but I wouldn't want to be the guy that finds out exactly how much longer during rush hour on the Brooklyn-Queens Expressway.

That said, flash devices don't hit their magic endurance numbers and kick the bucket. As flash ages, the error rate for writing data to each page increases as the cells in that page fail. Flash controllers have ECC and DSP technologies built in to handle individual cell failures; as long as the number of failed cells in a page is low, the controller can just correct the errors and use the page.

Eventually, the error rate rises to the point that the flash controller is no longer confident it's getting the right data, so the controller marks that page as bad.

Because a typical enterprise SSD will be overprovisioned, allowing user access to 200GB of its 256GB or more of flash, a moderate number of failed pages just reduces the amount of overprovisioned flash the controller can use for housekeeping. This might affect performance but won't cause the SSD as a whole to fail. Only when the pool of overprovisioned flash is used up replacing bad pages does the SSD fail, and even then the data it holds is still readable.

While today's semiconductor manufacturing processes are incredibly precise, when you're dealing with cell geometries of 20nm or less it's just not possible that every block, page and cell of an entire flash chip, let alone a batch of thousands, is exactly the same.

When the boffins at Toshiba or Micron say the oxide layer in their flash is 170 atoms thick, that's going to be an average. Some will be 150 and others 200, and as the oxide layers age some cells are going to fail earlier than others.

Actually testing flash devices to see just how variable the failures are would be destructive, take a long time and ultimately require a large number of chips to be destroyed, and I haven't seen any studies to show how variable the rate is. My discussions with flash, SSD and array vendors leads me to believe that it's variable enough that we don't have to worry about a second SSD wearing out while the first is rebuilding.

Consider also that SSDs rebuild five to 10 times faster than HDDs, and that unlike HDDs, new generations of SDDs get faster as well as bigger. When I put all these factors together, I think the near-simultaneous SDD wearout problem is a Bogeyman: big, really scary but ultimately not very real.

Does flash write endurance have you worried? Is it enough to keep you from adopting flash? I'd like to get your input. Use the comments section to share your feedback.



Related Reading



Network Computing encourages readers to engage in spirited, healthy debate, including taking us to task. However, Network Computing moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. Network Computing further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | Please read our commenting policy.
 

Editor's Choice

Research: 2014 State of Server Technology

Research: 2014 State of Server Technology

Buying power and influence are rapidly shifting to service providers. Where does that leave enterprise IT? Not at the cutting edge, thatís for sure: Only 19% are increasing both the number and capability of servers, budgets are level or down for 60% and just 12% are using new micro technology.
Get full survey results now! »

Vendor Turf Wars

Vendor Turf Wars

The enterprise tech market used to be an orderly place, where vendors had clearly defined markets. No more. Driven both by increasing complexity and Wall Street demands for growth, big vendors are duking it out for primacy -- and refusing to work together for IT's benefit. Must we now pick a side, or is neutrality an option?
Get the Digital Issue »

WEBCAST: Software Defined Networking (SDN) First Steps

WEBCAST: Software Defined Networking (SDN) First Steps


Software defined networking encompasses several emerging technologies that bring programmable interfaces to data center networks and promise to make networks more observable and automated, as well as better suited to the specific needs of large virtualized data centers. Attend this webcast to learn the overall concept of SDN and its benefits, describe the different conceptual approaches to SDN, and examine the various technologies, both proprietary and open source, that are emerging.
Register Today »

Related Content

From Our Sponsor

How Data Center Infrastructure Management Software Improves Planning and Cuts Operational Cost

How Data Center Infrastructure Management Software Improves Planning and Cuts Operational Cost

Business executives are challenging their IT staffs to convert data centers from cost centers into producers of business value. Data centers can make a significant impact to the bottom line by enabling the business to respond more quickly to market demands. This paper demonstrates, through a series of examples, how data center infrastructure management software tools can simplify operational processes, cut costs, and speed up information delivery.

Impact of Hot and Cold Aisle Containment on Data Center Temperature and Efficiency

Impact of Hot and Cold Aisle Containment on Data Center Temperature and Efficiency

Both hot-air and cold-air containment can improve the predictability and efficiency of traditional data center cooling systems. While both approaches minimize the mixing of hot and cold air, there are practical differences in implementation and operation that have significant consequences on work environment conditions, PUE, and economizer mode hours. The choice of hot-aisle containment over cold-aisle containment can save 43% in annual cooling system energy cost, corresponding to a 15% reduction in annualized PUE. This paper examines both methodologies and highlights the reasons why hot-aisle containment emerges as the preferred best practice for new data centers.

Monitoring Physical Threats in the Data Center

Monitoring Physical Threats in the Data Center

Traditional methodologies for monitoring the data center environment are no longer sufficient. With technologies such as blade servers driving up cooling demands and regulations such as Sarbanes-Oxley driving up data security requirements, the physical environment in the data center must be watched more closely. While well understood protocols exist for monitoring physical devices such as UPS systems, computer room air conditioners, and fire suppression systems, there is a class of distributed monitoring points that is often ignored. This paper describes this class of threats, suggests approaches to deploying monitoring devices, and provides best practices in leveraging the collected data to reduce downtime.

Cooling Strategies for Ultra-High Density Racks and Blade Servers

Cooling Strategies for Ultra-High Density Racks and Blade Servers

Rack power of 10 kW per rack or more can result from the deployment of high density information technology equipment such as blade servers. This creates difficult cooling challenges in a data center environment where the industry average rack power consumption is under 2 kW. Five strategies for deploying ultra-high power racks are described, covering practical solutions for both new and existing data centers.

Power and Cooling Capacity Management for Data Centers

Power and Cooling Capacity Management for Data Centers

High density IT equipment stresses the power density capability of modern data centers. Installation and unmanaged proliferation of this equipment can lead to unexpected problems with power and cooling infrastructure including overheating, overloads, and loss of redundancy. The ability to measure and predict power and cooling capability at the rack enclosure level is required to ensure predictable performance and optimize use of the physical infrastructure resource. This paper describes the principles for achieving power and cooling capacity management.