Hadoop And Enterprise Storage?

Both NetApp and EMC have announced that they're turning the turrets of their marketing battleships toward the Apache Hadoop marketplace that provides the back end to many Web 2.0 implementations. While I understand how Hadoop is attractive to these storage vendors--after all, a typical Hadoop cluster will have hundreds of gigabytes of data--I'm not sure I buy that Hadoop users need enterprise-class storage.

Howard Marks

May 10, 2011

3 Min Read
Network Computing logo

Both NetApp and EMC have announced that they're turning the turrets of their marketing battleships toward the Apache Hadoop marketplace that provides the back end to many Web 2.0 implementations. While I understand how Hadoop is attractive to these storage vendors--after all, a typical Hadoop cluster will have hundreds of gigabytes of data--I'm not sure I buy that Hadoop users need enterprise-class storage.

EMC's Greenplum division is introducing its own distributions of Hadoop with an all-open-source community edition and a ruggedized enterprise edition. These will be available as software and installed on the Greenplum HD Data Computing Appliance, which uses SATA drives in a JBOD configuration. However, since it's from EMC, it will certainly cost more than using Supermicro servers and Western digital drives from NewEgg.

NetApp is pitching the concept of shared DAS by SAS connecting the Engenio RAID arrays it just bought from LSI (now renamed the E-Series). NetApp is pushing the E2600 low-end array for Hadoop clusters.

The key to these announcements may be in Informatica CEO James Markarian's statement from a stage in the EMCworld pressroom that some companies are more willing to adopt new technologies like Hadoop if they can buy them from trusted suppliers such as EMC.

Personally, I'm not so sure. To get the full benefits of the Web 2.0 architecture, organizations may have to--for those applications where it's appropriate--adopt the whole Web 2.0 toolkit and design model. Hadoop's Hadoop Distributed File System (HDFS) is designed to distribute data across multiple nodes so it can survive node failures without data or even data availability loss. This enables Web 2.0 site operators to use large clusters of very inexpensive nodes with SATA JBODs to store their data and process it at a very low cost per gigabyte.Enterprise storage, on the other hand, is based more on a "failure is not an option" model than on a fault-tolerant model. Controllers, drives and even drive enclosures are designed to have long mean times between failures. This reliability, of course, costs money, so you pay more per gigabyte for a Vplex (or even a Clariion) than Google does for its MicroATX motherboards with SATA drives all but duct taped to them.

To understand the very different enterprise and Web 2.0 models, think for a moment of an engineering school egg drop contest. The rules of the contest state that teams must get a dozen eggs unbroken from the roof of the engineering building to a team cooking omelets on the quad. Teams will be judged on cost, speed and originality.

The enterprise team builds a dumbwaiter to gently lower the eggs in a supermarket package down to the quad. The Hadoop team buys three dozen eggs and a roll of bubble wrap, wraps each egg in the bubble wrap, and throws the eggs off the roof. As long as one-third of the Hadoop team's eggs arrive unbroken, it has solved the problem and spent $30 to 40 (compared with the hundreds of dollars the enterprise team needed for dumbwaiter parts).

I can just see some application group deciding that Hadoop will help them process the deluge of data in their data center. The proposal finally comes to the storage group, which looks at the low cost and--to the storage guy's eye--low-reliability storage in the proposal. They say, "This should go on our SAN so we can provide the five-nines reliability enterprise applications require." The project goes ahead with storage on the Symetrix, and, while it works fine, the organization doesn't see the cost savings it expected because they're spending several times as much for storage as they needed.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights