Also see: Data Lakes Demystified
We’ve had data storage pools for years. Now we have data lakes. Other than drowning in marketing hype, what does the term “data lake” actually mean?
This is a loaded question with definitions of data lakes are all over the map. At its core, a data lake is a large storage repository that stores multiple data types in their native formats, and creates data constructs to present to analytical and business intelligence (BI) tools.
Data lakes are usually built on commodity hardware for massive scalability; for this reason they are often cloud-based, although there is no requirement that they be so other than a SaaS offering. To put it another way, a data lake stores multi-source data as analytics-ready objects. Typical data types include application data, files, sensor data, customer-generated data, or Internet-attached devices.
By far the best known and most widely deployed data lakes are on the Hadoop platform, which is closely identified (but not identical with) with the phrase. Apache Hadoop started life as a highly scalable, batch processing platform running on commodity storage and compute. Hadoop had become an enterprise platform that efficiently provides massive data storage and processing, and hosts native and third-party tools to transform raw data into actionable intelligence. Redundancy is built-in with replication across clusters.
Data Lake vs. Data Warehouse?
Data warehouses typically house relational database management systems (RDBMS) data and subject incoming data to schema-on-write. Companies run queries across collected database data stored in the warehouse repository.
Warehouses enable analytics on massive database data by storing a limited set of data attributes and storing data in specific, unchanging structures. Warehouses return consistent and predictable analysis across mission- and business-critical databases, and are highly valuable to the organizations that use them.
Why change what is working so well?
There are issues with flexibility and knowledge depth in warehouses. Pre-categorized structures support performance but only allow a set of queries on pre-selected attributes. Another issue is that warehouses processes data using schema-on-write, which can slow down data processing performance.
In contrast, a data lake stores data from multiple sources in raw form without subjecting it to the schema that data warehouses apply on ingestion. Instead, the data lake applies a schema-on-read, which processes data as it is needed for query searches. Users can issue a single query or set of queries that take in data from many different sources and formats. They do not need to issue the same query multiple times.
This environment provides fast processing speeds even for massive sensor data. Nor does the data lake associate the data with a specific, rigid category. This flexible platform lets users employ many different tools to cross-analyze different data types; native tools like Apache Mahout, HBase, Hive and Pig; and Hadoop MapReduce. A number of third-party analytics software also runs on Hadoop data lakes.
Some data lake champions talk about replacing warehouses with lakes. For now this is going way over the edge: both have their uses and there is no need to replace a working, valuable warehouse. However, data lakes will offer levels of flexibility and data point integration that warehouses cannot.
Another advantage is low cost. Cost-effective data lakes makes it economically feasible to copy massive amounts of data into the data lake for future analysis. In the case of Hadoop data lakes, Hadoop deliberately keeps its software and support costs per node low. Redundancy is built into the platform, which saves on third-party replication products. Compute and storage scale independently, upgrades are cost-effective, and licensing is simple.
Cautions Regarding Data Lakes
Data lakes have great promise for BI but they are not Utopia. The technology still needs to create data schemas and integration points between different data formats. This takes processing resources. Organizations must also be prepared to purchase storage; “commodity storage” is cost-effective but IT still needs to buy it, install it, provision it, and manage it.
Nor are data lakes graveyards for big data. Replacing data silos is a benefit only if the organization applies highly integrated and flexible data mining across multiple data sources.
Security is also an issue. Although Hadoop distributions have built-in redundancy they may not have robust security, especially if they are hosted on the cloud. Vendors do offer security measures for Hadoop, such as tokenizing data, encryption, key management and security audits. Users must practice due diligence around sensitive BI results, and be exceptionally careful with securing data within the Hadoop repository and while in-transit.
Vendors Involved with Data Lakes
The vendor market is tricky to define as it crosses Hadoop distribution, value-added software for data lakes, software developers, and SaaS providers. The majority of vendors work with the Hadoop platform but not all of them; some providers offer their own data lakes to customers.
Apache Software Foundation develops open source Apache Hadoop and its native toolset. Distributors may include analytics and BI tools in their deployments, and will frequently partner with other technology providers such as Cloudera and Pivotal for EMC Isilon. Top Hadoop distributors include MapR, Cloudera, Pivotal, Hortonworks, and IBM.
ThingWorx is a type of data lake that combines a graphical database structure and tagging to preserve as much metadata as possible. Analytics uses the tags and data relationships among logical collections. Users can drill as far down as the device level. Splunk Analytics is a search engine developed for machine data. It works with several different applications including Hadoop and NoSQL data lakes. Splunk extracts real-time intelligence from the huge mass of information generated by machines: social media, sensors, applications, website data, and more.
Zaloni Bedrock is based on the Hadoop platform and is distribution agnostic. Bedrock organizes Hadoop for extreme performance including high-speed ingestion, workflow management, metadata management and more. Pivotal is a joint venture spun out from EMC and VMware and has its hand in many cloud computing / big data pies. Specific to big data repositories and analytics, its product suite includes Hadoop distribution platform PivotalHD, and Pivotal Big Data Suite of analytics tools.
Teradata acquired Revelytix in June 2014. Revelytix Loom cuts complexity in Hadoop environments by automatically discovering datasets and generating their metadata, and tracks operation workflows. Pentaho apparently invented the “data lake” phrase back in 2010. The vendor operates in the Hadoop platform to perform fast queries and extraction.
DataRPM concentrates on data discovery within its own data lake. It treats Hadoop as a data source but automatically creates its own data lakes on commodity hardware. Additional data sources include RDBMS, some file types, and CRM applications. Users issue queries in natural language and security from distributed binary index files is a competitive differentiator.
EMC maintains a strong commitment to Hadoop data lakes. (“EMC Data Lakes” is the name of the storage vendor’s big data strategy.) In Oct. 2014 at Hadoop World, EMC announced a collaboration with Cloudera to integrate EMC Isilon Scale-Out NAS with Cloudera’s Enterprise Data Hub. EMC also updated an existing bundled offering partnership with Pivotal.
Data Lakes and Diverse Sources
Remember that a data lake is not simply a big virtualized storage repository: it is a distinct data repository structure that stores data in near-native format, and that supports queries across different data sources and formats. And since they take advantage of commodity hardware, they can be an inexpensive way to store corporate data that you think you would like to analyze but don’t need to do so immediately, and it will not be a costly waste of resources if you don’t need to after all.
Data lakes are not appropriate for all corporate storage, which is one of the silliest claims I have heard in a long time. Nor will they replace high-value data warehouses any time soon.
What data lakes do well –very well – is to improve business analytics and BI by integrating results from multiple knowledge sources. This is a valuable offering; it doesn’t need to mop your kitchen and make a moon shot at the same time.
Photo courtesy of Shutterstock.