Still waters run deep, the old proverb tells us. The same can be said for data lakes, storage repositories that hold vast amounts of raw data in native format until required by an application, such as predictive analytics.
Like still water, data lakes can be dark and mysterious. This has led to several misconceptions about the technology, some of which can prove damaging or even fatal to new data lake projects.
Before diving in, here are five key things you need to know about data lakes.
1. Data lakes and data warehouses are not the same thing
A data warehouse contains data that has been loaded from source systems based on predefined criteria. "A data lake, on the other hand, houses raw data that has not been manipulated in any way prior to entering the) lake and enables multiple teams within an organization to analyze the data," noted Sue Clark, senior CTO and architect at Sungard Availability Services.
Although separate entities, data lakes and data warehouses can be packaged into a hybrid model. "This combined approach enables companies to stream incoming data into a data lake, but then move select subsets into relational structures," said Ashish Verma, a managing director at Deloitte Consulting. "When data ages past a certain point or falls into disuse, dynamic tiering functionality can automatically move it back to the data lake for cheaper storage in the long term."
2. Don't treat a data lake like a digital dump
Although a data lake can store structured, unstructured, and semi-structured data in raw form, it should never be regarded as a data dumping ground. "Since data is not processed or analyzed before entering the lake, it’s important that the data lake is maintained and updated on a routine basis, and that all users know the sources of the data in the lake to ensure it’s analyzed appropriately," Clark explained.
From a data scientist point of view, the most important components when creating a data lake is the process of adding data while ensuring the accompanying catalogs are updated, current, and accessible, observed Brandon Haynie, chief data scientist at Babel Street, a data discovery and analysis platform provider. Otherwise, potentially useful datasets may be set adrift and lost. "The catalog will provide the analyst with an inventory of the sources available, the data’s purpose, it's origin, and it's owner," he said. "Knowing what the lake contains is critical to generating the value to support decision-making and allows data to be used effectively instead of generating more questions surrounding its quality or purpose."
3. A data lake requires constant management
It’s important to define management approaches in advance to ensure data quality, accessibility, and necessary data transformations. "If a data lake isn’t properly managed from conception, it will turn into a 'data swamp,' or a lake with low-quality, poorly cataloged data that can't be easily accessed," Verma said.
It's important for IT leaders to know that data governance is critical for ensuring data is consistent, accurate, contextualized, accessible, and protected, noted Jitesh S. Ghai, vice president and general manager of data quality, security, and governance, at software development company Informatica. "With a crystal-clear data lake, organizations are able to capitalize on their vast data to deliver innovative products and services, better serve customers, and create unprecedented business value in the digital era," he explained.
4. Don't become a data hoarder
Many organizations feel they must store everything in order to create an endless supply of valuable data. "Unless someone decides to keep reprocessing all of the data continuously, it is sufficient to create a 'digestible' version of the data," observed Dheeraj Ramella, chief technologist at VoltDB, a firm that offers an in-memory database to support applications requiring real-time decisions on streaming data. "This way, you can refine the model with any new training data." Once the training has been completed, and the information that's meaningful to the enterprise is in, one should be able to purge the data outside of the compliance and regulation timeframes.
5. A data lake is not a "prophet-in-a-box"
The truth is that gaining meaningful insights or creating accurate forecasts still requires a significant amount of analytical work and problem-solving using a tool that's capable of accessing and working the stored data, Haynie advised. "The data lake is just a step in the overall problem-solving process."
Staying competitive in today’s data-driven world requires a modern analytics platform that can turn information into insight, and both data lakes and data warehouses have an essential role to play, Verma said. "By developing a clear understanding of where they each make sense, IT leaders can help their organizations invest wisely and maximize the value of their information assets."