Data is a weird commodity. It’s all around our companies that have no shortage of data-producing machines, but it’s typically useless in its raw form. The good data — the data we need for our reports and our research, the really good stuff that’s clean, self-explanatory and trustworthy – that data is hard to come by. Meaningful data that successfully describes our business events is the stuff of legends, not of mere mortals. Why does it have to be so difficult?
It doesn’t. The way we see it, there are at least three areas of data ETL (extract-transform-load) that are to blame for this predicament. Three phases in the data creation process that must be re-thought right away. They are:
1. Requirements gathering
2. Transformation modeling
3. Code deployment
Let’s look at each area separately and pinpoint some root causes.
The effort to address a new data request starts with requirements gathering. Typically, a data engineer will interview the business team to understand what data they are looking for. Then a back-and-forth process between the business and the development team will ensue to clarify the request. Information will be conveyed verbally, discussed in emails, and tracked in spreadsheets and in project management tools.
The reason requirements gathering takes so long (apart from the plethora of tools used) is communication breakdown. The two parties – business and IT – don’t speak the same data language. The business team cares about high-level business concepts. Things like transactions and customers and revenue. These concepts are typically difficult to articulate, let alone to calculate, as their semantics are not universal, but rather highly customized to the organization. IT isn’t familiar with the business semantics.
The business team, on the other hand, is not familiar with the source data, where it is stored and how it’s modeled.
Hence the communication breakdown: each team understands their domain but not the other’s. A lot of time is wasted, therefore, on attempting to bridge the data knowledge gap.
Transforming raw data into high-level objects and events that the business team can understand and use is difficult. Depending on the structure of the source data and the business use cases, many transformation steps may be required. Here’s a quick sample:
- Column manipulations: The organization of data may require substantial ordering of numerous columns. Data might need to be aggregated, disaggregated, transposed or pivoted. A data column might need to be split into several distinct columns, or vice-versa.
- Value manipulations: Some data points may be ready for use, while others may be encoded for privacy or brevity purposes and must be decoded first. Some use cases may require the creation of new calculated metrics that derive their values from other metrics, or the generation of surrogate key values. Datasets might accidentally or unnecessarily repeat themselves, requiring de-duplication.
- Finding Common Keys: Some use cases, such as customer analytics, may require laborious union of disparate datasets. A unique and trustworthy key by which to stitch the datasets may not be always available.
- Lack of Data Lineage: Data may undergo several transformations over time in support of different business use cases. Data pros may not have adequate data lineage to understand the changes that may have occurred in the data, which can result in erroneous transformations.
- Validation and Quality: Datasets may include unexpected columns and values that were not taken into consideration during the design phase. Additional validation steps must be therefore incorporated to ensure adequate data quality and reliability.
Procedural ETL operations are significantly arduous. As data engineers address new data requests, they must build and expand data pipelines and ETL code. It’s common to see ETL code deployed in several distinct environments, which makes maintenance especially difficult.
Making enhancements to existing pipelines to address new data requests is challenging because changing the whole spaghetti salad of code may result in serious service degradation.
As a result, data engineers may be forced to push back against new requests that may introduce a higher level of risk into existing pipelines.
Given all of these challenges, it is easy to understand why business teams often lack the high-level data they need: existing ETL operations are based on methodologies that are thirty or forty years old. Procedures that were designed to handle data that was batched, local and predictable are still used today to handle massive datasets with unpredictable schemas, often in real time.
A new approach to ETL is needed and, luckily, available. Talk to us at Lore IO to find out more.