ETL (extract-transform-load) has been a common method used by organizations of all sizes to move and blend data from multiple sources into a common repository. For over several decades, companies have been using ETL to consolidate data from transactional databases into a target data store, used for analytics and reporting.
In the last few years, ETL has experienced huge changes. It has evolved to encompass new data sources and to accommodate exciting use cases. No longer is ETL limited to batch processing of homogenous datasets in on-premise environments. The evolution of ETL has been fast, with no signs of slowing down any time soon. Some of the recent changes are enumerated below.
As organizations look to extract more value out of their business data, they’ve made significant investments to build out their data lakes. With social media, IoT, and other Big Data drivers taking center stage, ETL has become imperative in consolidating transactional data in Hadoop (or equiv) environments and transforming it into data warehouses that handle massive data scales.
Business applications and connected devices produce continuous flows of data. ETL had to evolve from batch processing of predictable and well structured data to handle semi-structured and nested data objects, such as JSON and XML, in micro-batches and real-time.
Cloud storage and computing
Looking to streamline operations, reduce costs, and efficiently handle spikes in data use, companies have augmented, or altogether replaced, their legacy on-premise data platforms with cloud data warehouses and applications. ETL has been instrumental in enabling the cloud migration, moving on-premise data into public and private clouds.
Historically, ETL was running as an independent process whereby data was extracted from transactional systems into ETL servers that ran the transformations and then loaded the data to target systems. A newer variation of the ETL is the ELT (extract-load-transform) whereby data is extracted from an operational system and is immediately loaded into a target data warehouse that can handle the transformation operations itself, thereby reducing the need to handle expensive ETL infrastructure.
New data sources introduce new data schemas. Semi-structured data has no fixed schema to begin with. ETL has evolved to support schema detection as well the ability to handle schema changes automatically to some extent.
AI and ML
Finally, ETL has expanded the list of use cases it supports. If in the past, ETL was used for database replication, BI and analytics, today ETL serves to facilitate artificial intelligence and machine learning. Indeed, data preparation has become mission critical for data scientists.
While we’re seeing the emergence of hybrid transactional-analytical processing solutions that negate the need to move some data (the same system handles both data capture and analysis), ETL is here to stay. We can expect, however, that ETL will continue to evolve at a rapid clip. Here’s what we can envision:
Automated data pipelines
As ETL continues to handle a growing number of data sources and uses cases, it is unlikely that IT organizations will be able to manually manage all of the underlying data pipelines. Instead, organizations should manage transformation definitions declaratively, in easily digestible forms, and have their ETL solutions automatically convert these definitions into ETL jobs and pipelines.
Collaborative data transformations
One of the biggest roadblocks in ETL is trying to decipher data requirements and iterating on the transformation logic until the business is happy with the results. Data engineers do not understand the semantics of data that the business wants to report on, and the business does not understand how source data is modeled. A collaborative environment is needed, where business analysts can initiate new data requests, define their needs, and progressively iterate with the data engineers on the hierarchy of the transformation steps all the way down to the source data.
Ingestion of any kind
Future ETL workflows will have to handle a blend of data formats and types, including real-time data ingestion, event streaming from both databases and applications, data extraction via APIs and publish/subscribe models.
Finally, as ETL will become a collaborative process between business and IT, it will progress from being handled by one ETL team into a set of services that different development groups will be able to run on their own, individually handling the datasets they need for their own applications, without waiting on a central process for establishing new data pipelines.