It has taken the better part of a decade for big data analytics to become more mainstream, a time in which Hadoop was born (and died) and public clouds went from theoretical to ubiquitous. Today, there are hundreds of data applications that, while not exactly shrink-wrapped, are readily available to most enterprise IT and engineering teams who want them.
Algorithms such as recommendation engines, fraud detection, sales and marketing analytics, behavior analysis, predictive maintenance, portfolio/market analysis, and so on are now widely understood and implemented. We are no longer as stupefied by the sheer massiveness of the datasets and the vast clusters of computers that are employed to pound out these new insights.
But even with our greater understanding and comfort with big data concepts and practices, there is no getting around the non-trivial costs - in time, effort, and budget - that these engineering projects represent. To mitigate some of these costs, data teams are now considering two key methods of reducing the time and expense of getting actionable insights out of their huge datasets: data virtualization and precomputation.
Data virtualization focuses on immediacy and agility (query faster) by:
- Performing analytics directly from the data lake
- Minimizing ETL and skipping the data warehouse
- Assembling query results from back end systems as needed
Precomputation focuses on query response time (faster queries) by:
- Precomputing query results and making them available in pre-aggregated tables
- Compute results once, serve many times - thereby reducing the number of queries
- Enabling fast response times for querying data in data lakes or data warehouses
Both approaches have merit, and there may be good reasons to consider both depending on your requirements. Do you need to query faster? Or do you need faster queries? In the cloud era, where literally every operation has an associated cost, it is important to know the difference.
If there ever were two symbiotic technologies in modern computing, it would be data analytics and cloud computing. They were both more or less invented at the same time and by some of the same players. The unlimited, elastic scale of cloud infrastructure naturally pairs with the phenomenal compute, network, and storage requirements of big data algorithms. But one must consider the economic realities of running vast analytical workloads in the cloud.
In their recent Cloud Research Report, Capita found that the primary driver for companies moving to the cloud was lowering costs, yet only 27% have actually realized lower costs. In fact, more companies (34%) have experienced an increase in their costs than a decrease. This has spawned a new industry of experts and tools that focus solely on optimizing cloud costs.
The mechanics of big data analytics in the cloud are defined and constrained by some very basic physical and systemic realities. First, and most obviously, processing petabytes of data takes longer than processing terabytes of data. Even the simplest of query operations on huge datasets can be prohibitively expensive. Second, data must be prepared for analysis: cleansed, de-duped, and structured in a way that today’s BI tools can make sense of it. This work is typically done by data engineers with various ETL tools with the results stored in a data warehouse or data mart. Both of these factors are part of the calculus when considering data virtualization and precomputation.
Query Faster or Faster Queries
Data virtualization technology focuses on rapid deployment and reducing or eliminating, as much as possible, the ETL and data engineering work typically associated with creating a cloud data warehouse. Instead, the data virtualization layer connects directly to the various data sources such as files, tables, or even NoSQL DBs that form the data lake and create a virtual view that is exposed to the front-end BI layer. All queries happen on demand the moment a user clicks on their dashboard. In other words, they are able to perform queries faster - as in sooner.
The drawback of data virtualization strategies is that your query performance depends on your slowest query engine when your query results are assembled. For ad hoc analysis or for datasets that are not extremely large, this may not be a big issue. But as you add more data, or more active users working concurrently, query performance may suffer. Secondly, the data engineering and ETL tasks that you avoided with data virtualization mean that you cannot be sure of the quality of the data, even though you are more certain that the data is coming fresh from the data source. With data virtualization, only rudimentary checking and formatting is conducted at query time. While the effort of ETL is eliminated, the chance to cleanse and conform the data is also missed.
Precomputation strategies - also sometimes referred to as cloud OLAP - focus on making queries faster. Queries are performed once with results stored in the precomputation layer or data cloud. With today’s intelligent cloud OLAP platforms, the precomputation layer can even be pre-populated by examining SQL history and past user behavior. Precomputation strategies work on both data warehouses and data lakes, so they naturally take advantage of the data engineering work that you have already done and will continue to do.
Precomputing tends to put less stress on the source systems because most queries are served as simple lookups of tables in the data cloud. This makes query response times much faster - often in the sub-second range - and more predictable, even with very large datasets. These faster and lighter-weight queries also enable many more concurrent users without degrading performance.
Ramifications in a Pay-Per-Query World
For data virtualization, it is hard to argue with the benefits of being able to do real analytics against big data without the labor-intensive ETL and data engineering practices associated with cloud data warehouses. With that immediacy, you may pay a penalty in performance and scale with larger data sets and larger user bases. But the flexibility and immediacy you gain may outweigh those concerns.
With precomputation, the focus is on query performance and reducing the strain on back-end systems by precomputing results. So you may get much more analytical mileage out of your data while reducing the number of expensive queries you are running. In the pay-per-query world of cloud analytics, reducing the number of queries will have a demonstrable difference in cloud costs.
Li Kang is the head of Kyligence US operations.