Download the authoritative guide: Cloud Computing 2018: Using the Cloud to Transform Your Business
If there's a product or concept, there's probably a buzzword around it. The latest term to rise out of the Big Data trend is "Data Lake," which is more of a concept than a product. In essence, it might be your way to go from processing old data to real-time data analytics.
A Data Lake is simply an unstructured store where you can put any kind of data coming in from your many data streams and sources to do some form of processing on it later, rather than when it comes in.
That may sound like a data warehouse, an idea that has been around for decades, but it is not. The difference is that in a data warehouse, the data is processed and pre-categorized at the point of entry into the store, which dictates how it will be analyzed. In a Data Lake, the data is left in its raw form for later processing.
"It's a response to a very challenging problem; the volume and variety and velocity of data and the fear I will miss something if I don't collect everything I can," said Tom Fountain, CTO of Pneuron, developer of a Big Data distributed analytics platform. "The problem is how do you being to process it all? Until that's solved you need to store it and hold it in some kind of meaningful form until you can process it."
Big Data means data from all kinds of sources and is usually unstructured. But for now, the most common use cases gaining traction are around ad hoc or advanced analytics for data, as well as a landing zone for data of an unknown value, according to Nick Heudecker, research director for information management at Gartner.
"The advantage is instead of having to define schema ahead of time, you can keep everything without having to keep that schema and then decide if you want to load it into a data warehouse before committing to doing so," he said.
So at the moment, Date Lake is undefined and has no standard, notes Julien Sauvage, director of product marketing at Talend, a Big Data integration developer and specialist. "Our idea is that it's flat, meaning you don’t have any hierarchy the way you should store and structure data because it's multipurpose. So you put everything in that flat format without transforming anything," he said.
Hadoop and other Big Data implementations already handle massive amounts of data, so why add a new function for your developers to wrestle? In short, data comes in so fast and in such a gusher, analysis is not easy.
"It enables the IT guys to unify all their data," said Sauvage. "There is so much data coming in at once. With the lake, there is no priority. You don’t transform data. You put everything in there and then decide what to do with it. It was created in the first place because there was no alternative. Storing data has become so cheap that it's cheaper to store data than to get rid of it."
One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it, notes Bernard Marr, author, consultant and speaker on all things Big Data.
"Another advantage is that the data is available to anyone in the organization, and can be analyzed and interrogated via different tools and interfaces as appropriate for each job. It also means that all of an organization’s data is kept in one place – rather than having separate data stores for individual departments or applications, as is often the case," said Marr.
Most people embracing Data Lake are letting the reserve fill up before they apply advanced analytics to generate insight. Letting the lake fill for too long risks turning the lake into a swamp or a deluge that gets no insight or ROI. So you can't let it sit too long or the information is useless and/or overwhelming.
"At the same time you start storing, you need a data scientist to start analyzing the data. They might find one source is good but maybe another source is not so good. That would help it prioritize the source they need in the lake. Some sources might not be integral. I might not need six months of data. Then you stop storing data that's too old," said Sauvage.
However, that's also the start of insight. "What's happening is when you start understanding the correlation and data relationships to true insight that gives you predictive power about an event to happen. That's incredibly valuable because it gives you lead time to prepare an optimal response," said Fountain.
When you discover those insights you can then go upstream from the lake because now you know the specific variables have that predictive power. Eventually you will have a fabric based on all of those helpful specific variables and you can predict events or catch events in real time and act accordingly.
A Concept, Not a Product
The idea of Data Lakes is still early. The term was believed to have been coined in 2011 by Pentaho CTO James Dixon. Many of the technologies used in Data Lakes, including Hadoop, are only five or six years old, compared to 35-year-old traditional data warehouse and DBMS products, so they are still evolving, said Heudecker.
The other side is skills. The tooling story is not necessarily there yet for democratized analytics. People using BI tools today don't have the data science expertise to realize value from data that has a different structure. And you need human eyes to do critical analysis because you may not trust all those data sources equally. Adjusting your analytics accordingly is a challenge.
"When you are talking about doing analysis over a lot of unstructured, uncurated data, those skills are not just uncommon in most industries they are extremely rare. You need data manipulation experience with one or more programming languages, statistical knowledge, machine learning, data discovery, and search. Those are things programmers are happy to do because they like to write code. But if you are talking about a population that uses BI tools or dashboards, you aren't going to turn them loose on an unstructured data store and say go for it," said Heudecker.
Because of this, he says the tools still need to mature. Also, IT departments need a very broad scope of governance and have to develop metrics on data and usage as well as security. Finally, you need a way to take the insights that are gleaned from the data lake and move them into production. All of this still has not happened yet in the Big Data tools out there, meaning IT shops will have to create their own.
Fountain said CIOs leading the way on Data Lakes are already well down the tracks in understanding what information has high value potential and are already restricting the inflows into the lake of non-valuable data. They allow more of data they have a pretty high confidence will produce something valuable, typically in the form of new insight they can convert into action.
"So the lake will be crucial to consistently generating new insight and new forms of potential value. That combination of lake or large scale processing and storage with advanced analytics piece is one corner of a leading triangle that IT needs to connect the dots on, to take full advantage of this new era we're entering," he said.
Photo courtesy of Shutterstock.