GridIron Systems: Mining Big Data 'Gold' in a Flash
January 12, 2012
Trends in the IT industry sometimes resemble gold rushes as vendors pan for revenue "nuggets." The use of solid state devices (SSDs)--most notably, flash memory--is the central point of one of these, but just as with the real 19th century gold rushes in California and Alaska, not all prospectors (that is, vendors) will be successful. Where the claims are staked can make all the difference in the world, and GridIron Systems is staking one with a focus on accelerating big data analyses.
The IT industry loves to give trends labels ("cloud," anyone?) and "big data" is the buzz label for one recent trend. Three distinguishing characteristics that are often noted with respect to big data are volume, variety and velocity. Volume is the quantity of data. Variety describes the fact that big data is not merely structured information (the familiar SQL type found in relational databases) but also includes semi-structured data (which is content-searchable) and unstructured data (which is bit-mapped data, such as video surveillance files). Velocity relates to the speed required to both capture big data and analytically process it.
Now, big data is nothing new. Large data warehouses have been around for quite a while, and specialized vertical market information, such as seismic data, has been captured and analyzed for years. But large volumes of new sensor-based information (such as more utility readings captured with "smart" meters) and new sources of semi-structured and unstructured information (such as that generated by and stored on the Web and the Human Genome Project) have led to big data being added to the IT lexicon.
But big data is highly complex for a number of reasons beyond volume, variety and velocity. Sometimes the data is transient (meaning that it is captured, analyzed and deleted quickly, such as with very frequent RFID sensor information, where the value of the information is quickly extracted and there is no ongoing value or need for storage or archiving), but sometimes it is persistent (where the data is kept for a long period of time, such as with historical sales information). Note that this variety of continua has serious implications for how value is derived from big data through processing (such as the speed of processing and when it needs to take place). This also impacts how the data is best stored. Standard relational data warehouses were not designed to handle many of the new big data workloads or analytical processes; this is why the term "virtual data warehouse" is coming into vogue.
Another major problem in working with big data is the "I/O gap." Essentially, while server performance has continued to evolve, storage performance has remained essentially flat. For example, while the capacity of disk drives has increased dramatically, the rate of revolutions per second of disk drives has remained essentially constant. What this means practically is that when servers can process more I/Os than the storage can deliver, the speed of processing is slowed because the servers cannot process data that they don't have.
This I/O bottleneck is a performance problem that is not exclusive to big data--increasingly, robust enterprise applications run into the same problem. Additionally, because SSDs promise to solve this problem (since they have no mechanical parts that impact I/O) they are also a generally hot topic across enterprise storage. For its part, GridIron Systems focuses on the big data aspects of the I/O gap. Let's see what it has to offer.
GridIron Systems targets businesses' need to accelerate the processing of big data workloads and requisite high bandwidth and/or IOPS. A workable solution must also enable highly concurrent data access since there may be a large number of both users and applications active simultaneously. In addition, volatility is a big issue, as the queries that access the data may be rapidly changing, and there is often a high data ingest rate.
This concurrent access requirement and increasing volatility render traditional caching and tiering solutions, even those utilizing SSDs, ineffective when dealing with performance-constrained big data stores. Traditional caching is designed for read/write production systems and makes certain assumptions, such as a relatively fixed size data set, which benefits uniformly from lower latency.