“From that starting point, building new analytic and processing applications using Apache HBase, Apache Hive, Apache Pig, Impala, Presto, Apache Spark and other ecosystem components can squeeze new value out of the data,” he notes.
It is too early to say for certain whether it is the IT organization that is taking ownership of data lake initiatives or whether it is the business units, says Heudecker. But a lot will depend on the reasons why a company might want to implement a data lake. “If you are collocating data it is probably the developers,” that will want a data lake, he says. “If you are talking about a sandbox for testing analytics models, it is going to be the data scientists. If you are doing discovery, IT will own some of it,” he said.
Data Lake Caveats
While a data lake might help a company integrate and store disparate data, it does not address the broader problem of how the data will be analyzed and used. Putting all usable and potentially usable data into one vast reservoir solves only a part of the problem. Extracting actionable insight and value from that data will still require specialized skills.
Once a company has a good idea of the problem it wants to solve with a data lake, it needs to get the relevant data into the lake and find the skills to capitalize on the data. Heudecker says. But in order to do that, users need to know the context in which the data was captured, the sources it came from and how to merge it with other data sets. They need to have a good idea of data quality and data provenance. Often the business analysts and the data scientists that will be required for the task will not be easy to find.
The real work begins only after the data has landed in the lake, agrees Monash. Different kinds of value extractions will require different types of skills. For example, extracting, transforming and loading data from the data lake to another environment will require one set of skills. Companies looking to do non-relational analysis with the data will require different skill sets while those hoping to do predictive analytics will require a somewhat different set of skills.
There are other challenges as well. With most data lakes, the data is uncurated and arrives with little vetting. Because a data lake accepts any data without any oversight, companies can have a hard time determining data quality and lineage, Gartner notes in an alert. “Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp,” Gartner cautioned. “And without metadata, every subsequent use of data means analysts start from scratch.
Security and access control are other issues that enterprises have to contend with. In just the same manner that there is little oversight over how data gets in, there are little controls over who has access to data in a data lake. This can be especially problematic for companies that plan to store confidential or sensitive information in it.
"Data lakes typically begin as ungoverned data stores,” the Gartner report notes. "Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls — elements already found in a data warehouse.”
Photo courtesy of Shutterstock.