According to an EMC-sponsored IDC report, the amount of data amassed by consumers and businesses is expected to increase by 44 times in this decade. A lot of that information will be what many, including EMC, call big data. Obviously, big data requires storage and other products and services that the company provides, so it should come as no surprise that, in its recent blizzard of announcements, EMC targeted big data as one of its key markets. Let's try to understand big data and what it means, and then briefly illustrate how EMC is addressing the big data market through its recent acquisitions of Isilon and Greenplum.
EMC's working definition of big data is "data sets, or information, whose scale, distribution, location in separate silos or timeliness require customers to employ new architectures to capture, store, integrate (into one data set), manage and analyze to realize business value." Now, that is quite a mouthful and requires some time to digest and, of course, it fits around what EMC can or wants to do. However, the definition covers the essence of the subject and makes some valid points. But let's look at some examples to gain a better perspective on the breadth of where big data resides in the real world:
- Medical information--including medical images, such as MRIs, as well as electronic health records (EHRs);
- Increased use of broadband on the Web--including the 2 billion photos each month that Facebook users currently upload, as well as the innumerable videos uploaded to YouTube and other multimedia sites;
- Video surveillance--this is a booming business with a need for enormous volumes of storage, as well as the advanced analytics to make sense of it;
- Increased global use of mobile devices--the torrent of texting is not likely to cease;
- Smart devices--sensor-based collection of information has a tremendous future enabling smart electric grids, smart buildings and many other public and industry infrastructures;
- Non-traditional IT devices--including the use of RFID readers and GPS navigation systems;
- Non-traditional use of traditional IT information, including the transformation of OLTP into, say, a data warehouse for applying analytics, e-discovery and Web-generated information tools; and
- Industry-specific requirements, including high-performance computing solutions in genomic research, oil and gas exploration, entertainment media, etc.
Now, a critic might say that there is nothing new here. For example, medical images and broadband Web access have been around for a long time. The reply is that big data-related changes are probably mostly a matter of degree but also, to some extent, a matter of kind. The matter of degree comes about because of the increased intensity of usage and higher scale--sheer volume of petabytes of storage--than we have ever had. The matter of kind relates to the transformation of data from analog to digital and the need to get business value in new ways. But the key point to remember is that big data is a huge market that translates into "big money." From an IT business perspective, that is why big data matters.
There have been, roughly speaking, three major waves in the kind of structure that information has had from an IT perspective. Note that these new waves do not replace the old waves that continue to grow, and all three types of data structure have always been present, but one type of structure tends to dominate the others:
- Structured information. This is the information that finds a home in relational databases and has dominated the use of IT for many years. It is still the home of mission-critical OLTP systems businesses depend upon; among other things you can sort as well as query on structured database information.
- Semi-structured information. This was the second major wave in IT and includes e-mails, word processing documents and a lot of information stored and presented on the Web. Semi-structured information is content-based and can be searched, which is the raison d'etre of Google;
- Unstructured information. This can be thought of as primarily bit-mapped data in its native form. The data has to be put in a form that can be sensed (such as seen or heard in audio, video and multimedia files). A lot of big data is unstructured, and its sheer size and complexity require advanced analytics to create or impose structure that makes perceiving and interacting with it easier for humans.
Unfortunately, this classification scheme is not perfect. First, there are numerous hybrid and composite forms such as a photo embedded in a word processing document. Secondly, while "records" is a term that applies to databases, and much of the semi-structured information is stored in files, other information resides in streams such as captured by a video camera. And then there is the entirely separate concept of objects.
The bottom line, though, is that traditional IT infrastructures--including servers, storage and networks--were built around structured information and bent to adapt to semi-structured information. However, they are really not designed for the multifaceted structure requirements, scale and analytical demand required by big data.
That is why EMC underlined new architectures in its definition of big data, and that is also why it acquired Isilon and Greenplum. Much has been written about these acquisitions, so I will focus briefly on how the companies illustrate the need for different architectures for big data.