A data lake is a centralized location where very large amounts of structured and unstructured data are stored. Its ability to scale is one of the primary differences between a data lake and a data warehouse. Data warehouses can be big, but data lakes are often used to centralize the information that was formerly retained within a great many data warehouses. Data lakes can simply function as large storage repositories, but their main use case is to serve as a place where data is fed, and where it can be used for analytics as a way to guide better decisions.
This article looks at data lakes in detail and explores their benefits and basic architecture.
What Is A Data Lake?
A data lake is not so much a physical object as it is a description of several things working together—essentially, it’s a conceptual way of thinking about a collection of storage instances of various data assets stored in a near-exact or exact copy of the source format. In short, they’re storage repositories that hold a vast amount of raw data in its native format until it is processed.
Unlike the more-structured data warehouse, which uses hierarchical data structures like folders, rows and columns, a data lake typically uses a flat file structure that preserves the original structure of the data as it was input. But this doesn’t mean the data lake makes the data warehouse obsolete. Each has its own place.
A data warehouse is essentially a massive relational database—its architecture is optimized for the analysis of relational data from business applications and transactional systems, like major financial systems. SQL queries are generally used to provide organizational data for use in reporting and analysis of key metrics. As such the data must be cleaned and sometimes enriched to act as the organization’s “single source of truth.”
A data lake, on the other hand, is more facile. As well as relational data, it can store a great many forms of non-relational data. Depending on the organization, this might be social media feeds, user files, information from a variety of mobile apps and metrics and data from Internet of Things (IoT) devices and sensors. As data structure and schema are not so rigidly defined as in a data warehouse, the data lake can deal with all kinds of queries. Beyond SQL queries, they are also comfortable with questions from big data analytics systems, full text search, machine learning (ML) and artificial intelligence (AI) systems.
To achieve this, each data element in a data lake is assigned a unique identifier and tagged with a set of extended metadata tags. When someone performs a business query based on a certain metadata, all of the tagged data is then analyzed for the query or question.
The rise of data lakes is being driven by the increasingly massive amounts of data enterprises are collecting and analyzing and the need for someplace to store it.
“The historical storage medium was a relational database, but these technologies just don’t work well for all these data fragments we’re collecting from all over the place,” said Avi Perez, CTO of BI and analytics software vendor Pyramid Analytics. “They’re too structured, too expensive, and they typically require an enormous amount of prior setup.”
Data lakes are more forgiving, more affordable, and can accommodate unstructured data. However, the flip side of the ability to store that much data is that they can become cluttered as everything is dumped inside them. Some call this the “data graveyard effect,” because the data becomes inaccessible and unusable—there’s too much of it, and there is a lack of differentiation to determine what data has real value in analysis.
Benefits Of Data Lakes
The data lake is a response to the challenge of massive data inflow. Internet data, sensor data, machine data, IoT data all comes in many forms and from many sources, and as fast as servers are these days, not everything can be processed in real time. Here are some of the main benefits of data lakes:
- Original data. The volume, variety, and velocity of data makes things easy to miss when it. Storing data from multiple sources and multiple formats in the data lake provides the option to go back later and look more closely.
- Easy analysis. Because the data is unstructured, you can apply any analytics or schema when you need to do your analysis. With a data warehouse, the data is preprocessed—if you want to do a search or type of query that the data wasn’t prepared for, you might have to start all over again in terms of processing, if you can at all.
- Availability. The data is available to anyone in the organization. Something stored in a data warehouse might be only accessible to the business analysts.
- Business performance. According to research from Aberdeen Group, those implementing a data lake gain 9 percent in revenue growth compared to their peers because they were able to detect trends quicker and guide business decision-making with more accuracy.
- Scalability. Data can be collected from multiple sources at scale without the need to define data structures, schema, or transformations.
Data Lake Architecture
Data lakes have a deep end and shallow end, according to Gartner—the deep end is for data scientists and engineers who know how to manipulate and massage the data, and the shallow end is for more general users doing less specific searches.
No special hardware is needed to build a data lake—its storage mechanism is a flat file system. You could use a mainframe, and move the data to other servers for processing, but most data lakes are more likely to be built upon the Hadoop File System, a distributed, scale-out file system that supports faster processing of large data sets.
There needs to be some kind of structure, or order, and the data needs to have a timeliness quality—when users need immediate access, they can get it. It must also be flexible enough to give users their choice of tools to process and analyze the data. There must be some integrity and quality to the data, because the old adage about garbage-in, garbage-out applies here. Finally, it must be easily searchable.
Experts recommend multiple tiers that start with the source data, or the flat file repository. Other tiers include the ingestion tier, where data is input based on the query, the unified operations tier where it is processed, the insights tier where the answers are found, and the action tier, where decisions and actions are made on the findings.
Building A Data Lake
While data lakes are structurally more open than data warehouses, users are advised to build zones for different data to quarantine its cleanliness. To catalog everything in the lake, you have to group and organize it based on the cleanliness of the data and how mature that data might be.
Some data architects recommend four zones. The first is completely raw data, unfiltered and unexamined. Second is the ingestion zone, where early standardization around categories is done—does it fit into finance, security, or customer information, for example? Third is data that’s ready for exploration. Finally, the consumption layer—this is the closest match to a data warehouse, with a defined schema and clear attributes.
Between all of these zones is some kind of ingestion and transformation on the data. While this allows for a more freewheeling method of data processing, it can also get expensive if you have to reprocess the data every time you use it. Generally speaking, you will pay less if you define it up front because a lot has to do with how you organize the info in your data lake. There is a cost involved in repartitioning data.
Bottom Line: Data Lakes
The growing volume of data has created a demand for a better means of storing and accessing it. The simple database evolved into the data warehouse, and then the data lake. Tools for data lake preparation and processing generally take two forms: those released by traditional business intelligence (BI) and data warehousing vendors who have added the data warehouse to their product line, and those from startups and open source projects where much of the early data warehouse technology originated. Many of the larger companies including Amazon, Microsoft, Google, Oracle, and IBM offer data lake tools, and enterprises already well invested in technology from these providers will find a variety of tools for data ingestion, transformation, examination, and reporting. Some data lakes are available for on-premises deployment, others are cloud-based services. Now a mature technology, data lakes do more than just provide a repository for massive amounts of data—they also facilitate its analysis, meaning that organizations with data lakes are able to efficiently manage and analyze larger volumes of data to aid in decision making.
Read next: Top 15 Data Warehouse Tools for 2023