Download the authoritative guide: Cloud Computing 2018: Using the Cloud to Transform Your Business
Extract, Transform, Load (ETL) is a mature process that enables organizations to make sense out of disparate data, regardless of where the data resides or the format in which it has been stored. Over time, ETL has evolved with business requirements to support streaming data and unstructured data in NoSQL databases as well as structured data in relational databases.
Why is ETL so important? Data quality impacts the reliability of business decisions whether those decisions are based on spreadsheets, business intelligence systems, analytics dashboards or machine learning.
How ETL Works
ETL involves three discrete processes: Extract, Transfer, and Load. The workflow tends to be implemented as an ongoing process.
The process begins by extracting data from the target sources, which may include an ERP, CRM, and other enterprise systems, and data from third-party sources. The different systems tend to use different data formats which are normalized into a common format for further processing.
The extract, transform and load process is a key part of data management in today's enterprise.
The transformation stage ensures data usability for the purpose it will be used. Transformation processes may include:
· Aggregation – summarizing the data
· Applying business rules – for example, ensuring that the data can be used or used for its intended purpose
· Cleansing – identifying and correcting errors
· Deduping – removing duplicate records
· Filtering – selecting and/or excluding data based on some criteria, such as residents of Palo Alto
· Mapping – determining the relationship between two pieces of data from different systems, such as J. Smith and John Smith
· Joining – combining data from two or more records
· Sorting – numerically, alphabetically, chronologically, reverse chronologically, by region, etc.
· Transposing – rotating rows into columns or vice versa
· Splitting – dividing one column into two columns or more
· Validating – ensuring the data has been cleansed and meets pre-defined data quality standards
During the Load phase, the data is loaded into its target destination which could be a data warehouse, a cloud environment, or a delimited flat file.
Temporary vs. Permanent Data
ETL systems can use temporary data or permanent data, depending on the use case. For example, temporary data (a subset of data) may be used for reporting or analytics. Permanent data might be migrated from one database to another or from one data warehouse to another.
The ETL cycle is processed in the follow manner:
· Initiate cycle
· Build reference data
· Extract data from sources
· Audit reports
· Publish to target tables
· Clean up
Data migration is the movement of data from one computer, database, application or storage device to another. It is commonly done when organizations modernize their IT capabilities, replace one vendor's offerings with another vendor's offerings, or consolidate technology ecosystems, such as during mergers and acquisitions. In a consolidation scenario, data mapping is necessary since the existing and new systems tend to use different data formats and organize data differently.
Although data migration tends to be done automatically to accelerate the process and reduce the likelihood of manual errors, it still has to be handled carefully to avoid unnecessary business disruption. Before executing a migration, any technical constraints should be understood and dealt with to avoid unnecessary delays or failures. Meanwhile, end users should be provided with ample advance notice, and the migration should be planned at a time when it will have the least impact on the business (e.g., in the middle of the night, over the weekend, etc.)
Data Flow and Control Flow
Data flows are the movement of data one place to another. In the ETL process, data flows from the target sources through transformations to its intended destination.
Control flows ensure that processing tasks have been completed and in the right order. The tasks are prioritized in a linear fashion which requires one process to complete before another initiates.
Although control flows execute data flows as a task, a task within data flow (a data flow task) can trigger another task before it has completed, which means that data flow tasks can execute in parallel.
Data flow mapping tools have become important as a result of the EU's General Data Protection Regulation (GDPR) because organizations need to explain what personally identifiable information (PII) they are processing and how they are processing it.
The scalability of data stores has become important given the rapid growth of data, so it's not surprising that ETL software is also now taking advantage of parallel processing. The advantage, of course, is increased speed and scalability, which is necessary in light of growing data volumes. Parallel processing enables:
· Parallel data access
· Parallel task execution
· Parallel process execution
Data warehousing procedures take failures into account so ETL processes can be rerun and recover from run errors. Rather than running one huge ETL process, the process is decomposed into smaller parts that run in sequence or in parallel. Rows and the elements of a process are tagged for each data flow, so if a failure occurs, the failure point is known. If a failure occurs, then the ETL process is rolled back to the point of failure so the process can be rerun from that point.
The Importance of Keys
Relational databases use keys to establish and identify relationships between tables. Keys are also used to identify a record or row of data inside a table:
· A unique key is a column that identifies a particular entity
· Composite keys comprise several columns
· A foreign key is a key in another table that refers to a primary key
· The primary key is a single attribute, or multiple attributes, that uniquely identify a row in a table
In a data warehouse, keys are important because data is loaded into the data warehouse from different data sources. The different data sources may use different attributes as the primary key. If they do, the data warehouse may require them to be consolidated into a single dimension that includes all of the attributes represented by the various primary keys.
Virtual ETL takes advantage of virtual data to overcome some of the challenges with data migration and application integration with dealing with multiple data sources. The objects or entities collected from various data sources are abstracted for the purpose of creating an in-memory or persistent metadata repository. A persistent metadata repository enables ETL tools to persist as middleware so data harmonization and profiling can be performed continuously, in near-real time.
ETL Versus ELT
ETL and ELT involve the same processes, albeit in a different order. ETL is the traditional order of processes that has been used since before data warehouses existed. Big data analysis has fueled the popularity of ELT, as has data diversity.
ELT loads the extracted data into a data warehouse where the transformation takes place. Unlike ETL, ELT does not use a separate transformation engine, it uses the processing power of the data store for the purposes of simplifying the architecture and accelerating the transformation process.
ETL involves a lot of different technologies and many types of data. Moreover, the challenges are not just data and tool related. Following are some of the challenges one may encounter:
· Business requirements are not reflected in the business rules
· Tools do not meet, or no longer meet, the needs of the organization
· Data sources may become unavailable
· Transformation is more difficult and time-consuming than expected
· Test data is not available
· Data might be lost
· Requirements and budgets are misaligned
· Customer requirements are not well-understood
· User expectations are unmet
· Focusing on technology rather than business outcomes
· ETL processes are or have become brittle
· Insufficient focus on data governance
· Insufficient investment in master data management (MDM)
ETL tool types have expanded over the years. The original tools were installed on-premises and were capable of batch processing. Later, cloud-native tools became available, some of which still do batch processing while others offer real-time or near real-time capabilities. A number of ETL testing tools are also available that address the error-prone, time-consuming nature of SQL scripting and spreadsheets. They ensure:
· The correctness and validity of data
· The ETL process has executed properly
· The integrity of the data has been maintained throughout the ETL process
· The source data and data loaded into the data warehouse match
· Business rules have been followed
Real-time tools are enabling faster access to usable data, reflecting the increasingly real-time nature of data-driven businesses. Meanwhile, open source equivalents of all of the above have become increasingly popular, mirroring the greater enterprise adoption of open source tools generally.