EMC's working definition of big data is "data sets, or information, whose scale, distribution, location in separate silos or timeliness require customers to employ new architectures to capture, store, integrate (into one data set), manage and analyze to realize business value." Now, that is quite a mouthful and requires some time to digest and, of course, it fits around what EMC can or wants to do. However, the definition covers the essence of the subject and makes some valid points. But let's look at some examples to gain a better perspective on the breadth of where big data resides in the real world:
- Medical information--including medical images, such as MRIs, as well as electronic health records (EHRs);
- Increased use of broadband on the Web--including the 2 billion photos each month that Facebook users currently upload, as well as the innumerable videos uploaded to YouTube and other multimedia sites;
- Video surveillance--this is a booming business with a need for enormous volumes of storage, as well as the advanced analytics to make sense of it;
- Increased global use of mobile devices--the torrent of texting is not likely to cease;
- Smart devices--sensor-based collection of information has a tremendous future enabling smart electric grids, smart buildings and many other public and industry infrastructures;
- Non-traditional IT devices--including the use of RFID readers and GPS navigation systems;
- Non-traditional use of traditional IT information, including the transformation of OLTP into, say, a data warehouse for applying analytics, e-discovery and Web-generated information tools; and
- Industry-specific requirements, including high-performance computing solutions in genomic research, oil and gas exploration, entertainment media, etc.
There have been, roughly speaking, three major waves in the kind of structure that information has had from an IT perspective. Note that these new waves do not replace the old waves that continue to grow, and all three types of data structure have always been present, but one type of structure tends to dominate the others:
- Structured information. This is the information that finds a home in relational databases and has dominated the use of IT for many years. It is still the home of mission-critical OLTP systems businesses depend upon; among other things you can sort as well as query on structured database information.
- Semi-structured information. This was the second major wave in IT and includes e-mails, word processing documents and a lot of information stored and presented on the Web. Semi-structured information is content-based and can be searched, which is the raison d'etre of Google;
- Unstructured information. This can be thought of as primarily bit-mapped data in its native form. The data has to be put in a form that can be sensed (such as seen or heard in audio, video and multimedia files). A lot of big data is unstructured, and its sheer size and complexity require advanced analytics to create or impose structure that makes perceiving and interacting with it easier for humans.
The bottom line, though, is that traditional IT infrastructures--including servers, storage and networks--were built around structured information and bent to adapt to semi-structured information. However, they are really not designed for the multifaceted structure requirements, scale and analytical demand required by big data.
That is why EMC underlined new architectures in its definition of big data, and that is also why it acquired Isilon and Greenplum. Much has been written about these acquisitions, so I will focus briefly on how the companies illustrate the need for different architectures for big data.