In late July, EMC announced the acquisition of Greenplum, the company's entry into the data warehousing and analytics market. Greenplum uses a massively parallel processing (MPP) architecture and Scatter/Gather Streaming technologies to help computing keep up with ever larger data flows.

David Hill

October 25, 2010

Hopefully, by now, skeptics of the Greenplum deal will have lost their deer-in-the-headlights look when trying to comprehend why EMC is moving into an unfamiliar space in the IT market. Understand that EMC is no longer simply a data storage vendor but an information infrastructure company with a strong storage foundation. The company's growth imperative, which is common to businesses that want to increase share price, cannot be provided through organic growth alone, but depends upon continued successful acquisitions. In general, to be a successful IT vendor giant requires the ability to manage a wide diversity of products and services and EMC has been very successful in branching out from its roots.

That explains why EMC is interested in acquisitions, but does not explain why the company is interested in data warehousing in general and Greenplum in particular. Data warehousing is a large market space that EMC tapped into previously, but only indirectly. The reason for acquiring Greenplum is that data warehousing requires specialized technology tools. Let's examine the data warehousing market first.

Data warehousing and associated analytics is now a large market with, to paraphrase McDonald's statement upon how many hamburgers it has sold, billion and billions to be made by the market as a whole. Why is that the case, especially since data warehousing was once a niche market?

A little history review may help to put things in perspective. Early IT departments called their work data processing, which was quite appropriate, since the combination of data with computing is essential. Later on, an attempt was made to upgrade the name and importance of the data processing organization by renaming it management information systems (MIS). However, early applications, such as payroll, accounts payable, and accounts receivable were really clerical information systems.The pervasiveness of mission-critical online transaction processing (OLTP) systems, aided and abetted by relational database management systems, increased the importance of applications but these were still business-process oriented and not focused on management-decision-making processes. So the M was dropped to create IS and later the name was changed to information technology (IT).

During all this time there was a parallel track of activities that worked to use information to aid management decision making. Early attempts were limited to specific domains using limited amounts of data. Operations research/management science focused on algorithms and models, such as economic order sizes for products. Statistics, such as a least squares analysis, were used to do early forecasting and predictive analysis.

Eventually decision support systems arose that were not as algorithmic-focused but rather focused on organizing information in a way that would give insight into a specific decision such as adding plant capacity or ongoing decisions such as working with suppliers to improve their performance. Then executive information systems (EIS) were developed to collect information from several sources to allow senior executives to manage their subordinates better. This was a precursor of data warehousing and managed scorecards, but the process was not easy to accomplish.

Finally, data warehousing solutions that collected information from multiple sources and made it available as one version of the truth evolved. Getting to that point was difficult because of data quality issues as well as disagreements within enterprises over such issues as the basic nature of customers and products. If you have never been there, you cannot appreciate how difficult such an apparently simple task can be! Additionally, getting enterprises to understand how they could exploit all the richness of information has been anything but easy and still has a long ways to go. But now the market has blossomed.

So we are now approaching the era of true MIS although it will never be called that given its archaic roots. Still let's think of MIS for just a minute. First, senior managers can now understand better what is happening in their business as a whole--balanced scorecard anyone?. At the next level down, managers can also understand their own areas of interest better and faster meaning that they can get the information that they need to avoid getting blindsided by their bosses.However, usage has expanded to other individual contributors who do not manage people but need to manage information to do their job better. For example, gauging the success of retail uplift activities like price discounts, coupons, advertising across hundreds of stores, and designing effective sales and marketing campaigns involves a lot of people and a lot of work. When effective data usage expands to a couple of thousand people, it's easy to understand how mushrooming growth is a key reason as to why EMC viewed its entry into such a key and growing market as a must.

Data warehousing is not an IT infrastructure business as usual. Although it uses storage, servers, and networking hardware, as is true with all of IT, the requirements are different. For example, on the storage side traditional OLTP applications are all about IOPS, whereas data warehousing is about storage bandwidth. OLTP applications do a lot of random reads and writes of data. Analytic-based SQL queries, on the other hand may require sequential or "clustered" reads that span a large part of an entire petabyte-scale warehouse.

Bottom line; how fast information can be pushed out to the computing engines i.e., storage bandwidth and not how fast it can be "crunched" is a critical factor in data warehousing. Interestingly, even though SATA hard drives have slower rotational speeds than FC or SAS drives, their areal density is greater allowing more data to be stored in the same two dimensional space which means that, in at least some cases, more data can be pushed out per second via SATA solutions than with FC or SAS technologies.

That said, computing speed is also important. In the largest queries, the computing engine often becomes a bottleneck because it can't read the data flooding in from storage fast enough. Greenplum use of massively parallel processing (MPP) technologies which leverages virtualized industry standard components to scale systems up and down to match workload requirements, can help IT keep up with highly variable data flows.

But that is not the only necessary difference between Greenplum and other technologies. Database management systems were originally built for write-intensive OLTP applications. Data warehousing in these environments is a retrofit that requires a substantially different focus. In contrast, EMC recently announced Greenplum Database 4.0, the latest edition of this enterprise-oriented product which is specifically optimized for next-gen data warehousing and analytics - which includes the ability to analyze ALL of an organization's structured and unstructured data.EMC also announced the Greenplum Data Computing Appliance which packages software and hardware, including EMC storage, naturally, in one box. Appliances are a hot topic in data warehousing today offering workable options for organizations that need a simpler deployment of data warehousing that does not require the difficult integration of software and hardware that custom platforms require or for those seeking to achieve the ultimate in performance due to tight hardware-software integration. Additionally, the Greenplum DCA offers a solution to the need to offload disruptive analytic workloads from the EDW - as they are optimized for reporting, not advanced analytics.

Overall, market potential and the need to have a technology that is necessary to play in the data warehousing space are likely the key factors that led EMC to acquire Greenplum. Greenplum has been named as the foundation of EMC's new Data Computing Division and the company is already shifting resources its way, as the division has grown by 30% in personnel since EMC acquired Greenplum.

By acquiring Greenplum, EMC has opened up another front in the information infrastructure space. The data warehousing platform market share battle is likely to be fierce, metaphorically bloody, and very public since a lot of people including analysts and investors will be watching. Competitors include Oracle, which is strongly pushing its Exadata products based on Sun hardware assets. IBM offers a strong portfolio of products including the Smart Analytics Systems and the Smart Analytics Optimizer based on columnar database technology, and recently acquired Netezza, a well-regarded player in this market. Moreover, long-time players like Teradata are not standing idly by, and other players -- big such as HP and small such as Aster and Kognitio -- will not willingly cede the market to their rivals.

That said, for years now EMC has proven to have a green thumb when it comes to acquisitions. That is reflected not only in the acquired company itself, but also in how successfully the company helps its acquisitions achieve a growth potential that they would not have been able to reach on their own. The process has tended to be easier when EMC acquired leaders in the field such as VMware and RSA which did not face the same degree of competition as there is in data warehousing. If EMC can prevail in the face of such fierce, well-funded efforts by competitors, then Greenplum will provide a rich harvest for EMC's information infrastructure strategy.

EMC is currently a client of David Hill and the Mesabi Group.

About the Author(s)

