Hadoop And Enterprise Storage?
May 10, 2011
Both NetApp and EMC have announced that they're turning the turrets of their marketing battleships toward the Apache Hadoop marketplace that provides the back end to many Web 2.0 implementations. While I understand how Hadoop is attractive to these storage vendors--after all, a typical Hadoop cluster will have hundreds of gigabytes of data--I'm not sure I buy that Hadoop users need enterprise-class storage.
EMC's Greenplum division is introducing its own distributions of Hadoop with an all-open-source community edition and a ruggedized enterprise edition. These will be available as software and installed on the Greenplum HD Data Computing Appliance, which uses SATA drives in a JBOD configuration. However, since it's from EMC, it will certainly cost more than using Supermicro servers and Western digital drives from NewEgg.
NetApp is pitching the concept of shared DAS by SAS connecting the Engenio RAID arrays it just bought from LSI (now renamed the E-Series). NetApp is pushing the E2600 low-end array for Hadoop clusters.
The key to these announcements may be in Informatica CEO James Markarian's statement from a stage in the EMCworld pressroom that some companies are more willing to adopt new technologies like Hadoop if they can buy them from trusted suppliers such as EMC.
Personally, I'm not so sure. To get the full benefits of the Web 2.0 architecture, organizations may have to--for those applications where it's appropriate--adopt the whole Web 2.0 toolkit and design model. Hadoop's Hadoop Distributed File System (HDFS) is designed to distribute data across multiple nodes so it can survive node failures without data or even data availability loss. This enables Web 2.0 site operators to use large clusters of very inexpensive nodes with SATA JBODs to store their data and process it at a very low cost per gigabyte.