SHARE

Big Data: Ready to Fill Your Data Lake?

If there’s a product or concept, there’s probably a buzzword around it. The latest term to rise out of the Big Data trend is “Data Lake,” which is more of a concept than a product. In essence, it might be your way to go from processing old data to real-time data analytics. A Data Lake […]

Written By

AP

Andy Patrizio

Jan 22, 2015

7 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

If there’s a product or concept, there’s probably a buzzword around it. The latest term to rise out of the Big Data trend is “Data Lake,” which is more of a concept than a product. In essence, it might be your way to go from processing old data to real-time data analytics.

A Data Lake is simply an unstructured store where you can put any kind of data coming in from your many data streams and sources to do some form of processing on it later, rather than when it comes in.

That may sound like a data warehouse, an idea that has been around for decades, but it is not. The difference is that in a data warehouse, the data is processed and pre-categorized at the point of entry into the store, which dictates how it will be analyzed. In a Data Lake, the data is left in its raw form for later processing.

“It’s a response to a very challenging problem; the volume and variety and velocity of data and the fear I will miss something if I don’t collect everything I can,” said Tom Fountain, CTO of Pneuron, developer of a Big Data distributed analytics platform. “The problem is how do you being to process it all? Until that’s solved you need to store it and hold it in some kind of meaningful form until you can process it.”

Big Data means data from all kinds of sources and is usually unstructured. But for now, the most common use cases gaining traction are around ad hoc or advanced analytics for data, as well as a landing zone for data of an unknown value, according to Nick Heudecker, research director for information management at Gartner.

“The advantage is instead of having to define schema ahead of time, you can keep everything without having to keep that schema and then decide if you want to load it into a data warehouse before committing to doing so,” he said.

So at the moment, Date Lake is undefined and has no standard, notes Julien Sauvage, director of product marketing at Talend, a Big Data integration developer and specialist. “Our idea is that it’s flat, meaning you don’t have any hierarchy the way you should store and structure data because it’s multipurpose. So you put everything in that flat format without transforming anything,” he said.

Why Lakes?

Hadoop and other Big Data implementations already handle massive amounts of data, so why add a new function for your developers to wrestle? In short, data comes in so fast and in such a gusher, analysis is not easy.

“It enables the IT guys to unify all their data,” said Sauvage. “There is so much data coming in at once. With the lake, there is no priority. You don’t transform data. You put everything in there and then decide what to do with it. It was created in the first place because there was no alternative. Storing data has become so cheap that it’s cheaper to store data than to get rid of it.”

One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it, notes Bernard Marr, author, consultant and speaker on all things Big Data.

“Another advantage is that the data is available to anyone in the organization, and can be analyzed and interrogated via different tools and interfaces as appropriate for each job. It also means that all of an organization’s data is kept in one place – rather than having separate data stores for individual departments or applications, as is often the case,” said Marr.

Most people embracing Data Lake are letting the reserve fill up before they apply advanced analytics to generate insight. Letting the lake fill for too long risks turning the lake into a swamp or a deluge that gets no insight or ROI. So you can’t let it sit too long or the information is useless and/or overwhelming.

“At the same time you start storing, you need a data scientist to start analyzing the data. They might find one source is good but maybe another source is not so good. That would help it prioritize the source they need in the lake. Some sources might not be integral. I might not need six months of data. Then you stop storing data that’s too old,” said Sauvage.

However, that’s also the start of insight. “What’s happening is when you start understanding the correlation and data relationships to true insight that gives you predictive power about an event to happen. That’s incredibly valuable because it gives you lead time to prepare an optimal response,” said Fountain.

When you discover those insights you can then go upstream from the lake because now you know the specific variables have that predictive power. Eventually you will have a fabric based on all of those helpful specific variables and you can predict events or catch events in real time and act accordingly.

A Concept, Not a Product

The idea of Data Lakes is still early. The term was believed to have been coined in 2011 by Pentaho CTO James Dixon. Many of the technologies used in Data Lakes, including Hadoop, are only five or six years old, compared to 35-year-old traditional data warehouse and DBMS products, so they are still evolving, said Heudecker.

The other side is skills. The tooling story is not necessarily there yet for democratized analytics. People using BI tools today don’t have the data science expertise to realize value from data that has a different structure. And you need human eyes to do critical analysis because you may not trust all those data sources equally. Adjusting your analytics accordingly is a challenge.

“When you are talking about doing analysis over a lot of unstructured, uncurated data, those skills are not just uncommon in most industries they are extremely rare. You need data manipulation experience with one or more programming languages, statistical knowledge, machine learning, data discovery, and search. Those are things programmers are happy to do because they like to write code. But if you are talking about a population that uses BI tools or dashboards, you aren’t going to turn them loose on an unstructured data store and say go for it,” said Heudecker.

Because of this, he says the tools still need to mature. Also, IT departments need a very broad scope of governance and have to develop metrics on data and usage as well as security. Finally, you need a way to take the insights that are gleaned from the data lake and move them into production. All of this still has not happened yet in the Big Data tools out there, meaning IT shops will have to create their own.

Fountain said CIOs leading the way on Data Lakes are already well down the tracks in understanding what information has high value potential and are already restricting the inflows into the lake of non-valuable data. They allow more of data they have a pretty high confidence will produce something valuable, typically in the form of new insight they can convert into action.

“So the lake will be crucial to consistently generating new insight and new forms of potential value. That combination of lake or large scale processing and storage with advanced analytics piece is one corner of a leading triangle that IT needs to connect the dots on, to take full advantage of this new era we’re entering,” he said.

Photo courtesy of Shutterstock.

Ethics and Artificial Intelligence: Driving Greater Equality

FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020

SEE ALL
DATA CENTER ARTICLES

AP

Andy Patrizio

Andy Patrizio is a freelance journalist based in southern California who has covered the computer industry for 20 years and has built every x86 PC he’s ever owned, laptops not included.

Big Data: Ready to Fill Your Data Lake?

Andy Patrizio

Company

Categories

Big Data: Ready to Fill Your Data Lake?

RELATED NEWS AND ANALYSIS

Andy Patrizio

Company

Categories