If you’re an enterprise data management professional, there’s a good chance you’ve run across the term “data lake’ in recent months.
Though it might sound like an esoteric concept, a data lake is simply a repository where enterprises can store data from disparate sources in their original, native formats.
Big Data Deluge
Companies these days generate a lot of structured data and unstructured data in the form of multimedia files, telemetry and sensor data, spreadsheets, emails, web logs and system logs. A lot of the data is used. A lot more remains unused and untapped usually because it is too complicated and costly to extract value from it.
A data lake provides a sort of central destination for all of that data. It allows companies to ingest and store data in any format from any source without having to worry about transforming or structuring it first as is needed with a traditional data warehouse or relational database.
(If all that sounds similar to a description of a Hadoop environment, its only because it is. But, more on that in a bit.)
The idea behind a data lake is straightforward enough. Instead of placing data into multiple, purpose-built data stores, a company grappling with massive amounts of disparate data can dump everything into a data lake without modification, says Gartner analyst Nick Heudecker. It offers the perfect landing zone for enterprises to park and integrate all their valuable and untapped data while they figure out what to do with it.
A data lake eliminates the need for independently managed information silos and makes data easier to find, use and share. It offers a more cost-efficient alternative to transforming and force-fitting multiple data types into traditional data warehouses and relational databases, Heudecker said in an interview with Datamation.
He points to at least two other immediate use cases. A data lake offers a great resource for data scientists to develop and test new analytics models. It can also be used as a place where companies can extract and transform data before loading it into a downstream data warehouse.
Because data is stored in its original format in a data lake, analysts can run different kinds of analyses on the data and with far greater flexibility than possible with data in a structured data store. “Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model,” a PricewaterhouseCoopers report on data lakes noted.
“Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”
Not a Technology
But it is important for enterprises to realize that a data lake is more a concept than it is about any specific technology says Curt Monash, principal at Monash Research, in a conversation with Datamation.
“We can talk about the phenomenon to which the term is applied,” Monash says. “But there is no defining characteristic,” that would make something a data lake, he noted.
To be sure, companies need to have the server, storage and network resources to implement a data lake. They also need the right management and querying tools to extract and analyze the data so they can derive some business value from it, he says.
The Hadoop Data Lake
In fact, the reason why Big Data vendors like Pivotal, Cloudera, MapR and Hortonwoks are the ones talking about data lakes the most, is because their Hadoop technologies offers many of the capabilities needed for a data lake.
The Hadoop Distributed File System (HDFS) allows enterprises to capture data from anywhere and store it in multiple formats for future use. For many companies, their Hadoop environment is in fact also their data lake, though they may not necessarily recognize it as one. By coining terms like data lake and data hub and data reservoir, what the vendors are essentially doing is trying to capture the value proposition of their technologies more evocatively, Heudecker says.
“The data lake use case is a great way for companies to get to know Hadoop,” says Mike Olson, chief strategy officer at Cloudera in a blog post. “It’s simple and easy to implement, and it delivers easily measured return on the initial investment.”
“From that starting point, building new analytic and processing applications using Apache HBase, Apache Hive, Apache Pig, Impala, Presto, Apache Spark and other ecosystem components can squeeze new value out of the data,” he notes.
It is too early to say for certain whether it is the IT organization that is taking ownership of data lake initiatives or whether it is the business units, says Heudecker. But a lot will depend on the reasons why a company might want to implement a data lake. “If you are collocating data it is probably the developers,” that will want a data lake, he says. “If you are talking about a sandbox for testing analytics models, it is going to be the data scientists. If you are doing discovery, IT will own some of it,” he said.
Data Lake Caveats
While a data lake might help a company integrate and store disparate data, it does not address the broader problem of how the data will be analyzed and used. Putting all usable and potentially usable data into one vast reservoir solves only a part of the problem. Extracting actionable insight and value from that data will still require specialized skills.
Once a company has a good idea of the problem it wants to solve with a data lake, it needs to get the relevant data into the lake and find the skills to capitalize on the data. Heudecker says. But in order to do that, users need to know the context in which the data was captured, the sources it came from and how to merge it with other data sets. They need to have a good idea of data quality and data provenance. Often the business analysts and the data scientists that will be required for the task will not be easy to find.
The real work begins only after the data has landed in the lake, agrees Monash. Different kinds of value extractions will require different types of skills. For example, extracting, transforming and loading data from the data lake to another environment will require one set of skills. Companies looking to do non-relational analysis with the data will require different skill sets while those hoping to do predictive analytics will require a somewhat different set of skills.
There are other challenges as well. With most data lakes, the data is uncurated and arrives with little vetting. Because a data lake accepts any data without any oversight, companies can have a hard time determining data quality and lineage, Gartner notes in an alert. “Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp,” Gartner cautioned. “And without metadata, every subsequent use of data means analysts start from scratch.
Security and access control are other issues that enterprises have to contend with. In just the same manner that there is little oversight over how data gets in, there are little controls over who has access to data in a data lake. This can be especially problematic for companies that plan to store confidential or sensitive information in it.
“Data lakes typically begin as ungoverned data stores,” the Gartner report notes. “Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls — elements already found in a data warehouse.”
Photo courtesy of Shutterstock.