If you’re an enterprise data management professional, there’s a good chance you’ve run across the term “data lake’ in recent months.
Though it might sound like an esoteric concept, a data lake is simply a repository where enterprises can store data from disparate sources in their original, native formats.
Big Data Deluge
Companies these days generate a lot of structured data and unstructured data in the form of multimedia files, telemetry and sensor data, spreadsheets, emails, web logs and system logs. A lot of the data is used. A lot more remains unused and untapped usually because it is too complicated and costly to extract value from it.
A data lake provides a sort of central destination for all of that data. It allows companies to ingest and store data in any format from any source without having to worry about transforming or structuring it first as is needed with a traditional data warehouse or relational database.
(If all that sounds similar to a description of a Hadoop environment, its only because it is. But, more on that in a bit.)
The idea behind a data lake is straightforward enough. Instead of placing data into multiple, purpose-built data stores, a company grappling with massive amounts of disparate data can dump everything into a data lake without modification, says Gartner analyst Nick Heudecker. It offers the perfect landing zone for enterprises to park and integrate all their valuable and untapped data while they figure out what to do with it.
A data lake eliminates the need for independently managed information silos and makes data easier to find, use and share. It offers a more cost-efficient alternative to transforming and force-fitting multiple data types into traditional data warehouses and relational databases, Heudecker said in an interview with Datamation.
He points to at least two other immediate use cases. A data lake offers a great resource for data scientists to develop and test new analytics models. It can also be used as a place where companies can extract and transform data before loading it into a downstream data warehouse.
Because data is stored in its original format in a data lake, analysts can run different kinds of analyses on the data and with far greater flexibility than possible with data in a structured data store. “Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model,” a PricewaterhouseCoopers report on data lakes noted.
“Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”
Not a Technology
But it is important for enterprises to realize that a data lake is more a concept than it is about any specific technology says Curt Monash, principal at Monash Research, in a conversation with Datamation.
“We can talk about the phenomenon to which the term is applied,” Monash says. “But there is no defining characteristic,” that would make something a data lake, he noted.
To be sure, companies need to have the server, storage and network resources to implement a data lake. They also need the right management and querying tools to extract and analyze the data so they can derive some business value from it, he says.
The Hadoop Data Lake
In fact, the reason why Big Data vendors like Pivotal, Cloudera, MapR and Hortonwoks are the ones talking about data lakes the most, is because their Hadoop technologies offers many of the capabilities needed for a data lake.
The Hadoop Distributed File System (HDFS) allows enterprises to capture data from anywhere and store it in multiple formats for future use. For many companies, their Hadoop environment is in fact also their data lake, though they may not necessarily recognize it as one. By coining terms like data lake and data hub and data reservoir, what the vendors are essentially doing is trying to capture the value proposition of their technologies more evocatively, Heudecker says.
“The data lake use case is a great way for companies to get to know Hadoop,” says Mike Olson, chief strategy officer at Cloudera in a blog post. “It’s simple and easy to implement, and it delivers easily measured return on the initial investment.”