Most large enterprise IT organizations have invested in building data lakes to serve as a repository for their most critical data and leverage that data to enable business objectives and competitive advantage. Naturally, most of these initiatives will be cloud-centric as organizations adjust to a new normal brought on by the COVID-19 pandemic.
Too often, though, data lakes are constructed without proper data context, governance controls, and speed to evolve at the same rate the business generates and consumes the data. To effectively build and maintain a data lake to take advantage of digital insight requires strategic planning, advanced technical skill set and knowledge, and proper digital maintenance.
To guide you in ensuring you are building a data lake, not a data swamp, here are seven key insights to consider to maximize the value of your data lake:
Data Lineage: The Foundation
Data, much like anything of value, has provenance. Data lineage is the foundation on which every automated business process depends. Lineage is key when it comes to successfully making informed decisions based on data. Knowing where, when, how and by whom any data was created is just as critical as what that data represents.
Metadata: Enabling the Dewey Decimal System for Your Data
One of the earliest forms of metadata was the Dewey Decimal system, around which librarians organized the card catalogs employed to find books. Metadata is your way of describing and categorizing data.
Examples of different types used today include descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata. In the absence of metadata, organizations wind up with a massive amount of largely unusable data that has little to no value for the business because nobody can find it.
Data Catalog: Unlocking the Power for Your Clients
The most effective way to make business data stored in a lake accessible to end users is by creating a data catalog based on the metadata an organization defines. If a data lake is the repository where data is normalized, then the data catalog is the equivalent of the maps and charts required to navigate it.
Data Mapping: The Key to Successful Construction
A data catalog is only as useful as the quality of the data mapping techniques that were employed to construct it. Data mapping establishes the relationships between the data sets that are described in a data catalog. Once data sets are mapped, it becomes possible to create the data model on which the data catalog is layered. Unless all the relationships between data sets are mapped, the data catalog essentially becomes useless.
Data Lake Maintenance: Tools and Process Matter
Data lakes do not magically maintain themselves. New types of data are being added all the time. Compliance regulations change. Data often needs to be moved from one location to another. The total cost of a data lake is deeply intertwined with how simple, over time, it is to manage.
The Security Factor
Security must be taken into consideration when creating a sound data lake architecture and processes. Clearly defined mapping of data makes it possible to apply granular controls over who in the organization has permissions to view and update data. At a time when data privacy and sovereignty have risen to the top of any CISO’s or Compliance officer’s chief concerns, ensuring data security is critical.
The Need for an Executive Champion
Given the size and criticality of a data lake project, there really is no substitute for an executive level project champion. Whether that champion is the CEO, chief data officer or line of business executive, data lakes are too important to the business to be left solely to an IT team to implement.
Data lakes, like any waterway, need to be maintained if they are to remain navigable. Data lakes, in the absence of ongoing maintenance, will inevitably become swamps, unusable and unhelpful to your organization.
To make your data lake value reach even further, consider partnering with experienced data lake engineers who can help with the technical aspects and best practices of your initiatives while you focus on the business benefits of your data lake.
About the author:
Rob Whelan, Data Engineering & Analytics Practice Manager, 2nd Watch.