For years, enterprise executives and their legions of data pros have been heeding the call to stand up enterprise data lakes as the cure for all reporting ills. They’ve been shoring up their data by amassing files from transactional systems, databases of all kinds, feeds from business applications and third party data sources — all into vast data lakes. They’ve been on a mission, and they’ve been relentless.
But now, taking a moment to catch their breath and survey the reams of stuff they piled together, these data pros find themselves lost, stuck in uncharted territories that some folks call data swamps. The lakes have turned into dumping grounds of disparate datasets of all shapes and sizes, with different schemas (or no schemas), that no one can trust or let alone use.
The picture of beautifully organized data that everyone intuitively understands and readily consumes has turned out to be a mirage or, in some cases, dangerous quicksand. Enter at your own risk.
Why did it happen?
Primarily because everyone and their cousin had unmitigated access to the staging areas, and was free to dump their data at will. Sure, when the data lake project was launched, the company put a guy in charge to oversee and supervise, but he since left the company and nobody since has volunteered to take his place. So, yes, lack of oversight and coordination is the reason why.
And the problem only got worse over time when more data sources were tapped. First party, second party, third party -- it didn’t matter: if it looked like data, then data pros wanted to store it, even before they knew what to do with it. Data became the new oil, and everybody was rolling their barrels willy-nilly to the lake.
So how does one get out this sticky mess? Conventional thinking would have you believe that one has but two choices: either abandon the swamp altogether and start anew, or bite the bullet, spend however much time you need and rebuild (drain?) the swamp. Neither of these options is appetizing and rightfully so: they are both expensive, laborious and demoralizing, with prolonged time-to-value.
Keep your swamp
Yup, there is a third option. It may sound crazy, but often your best bet is to leave things the way they are. Let data be data in whatever state it is, and apply your organization someplace else. In Lore IO.
You start things off by selecting a simple business process with a few events. A low-hanging fruit that has good ROI. You identify the sources and have Lore IO scan the physical files in the data swamp. Lore IO parses them and pulls their metadata into a searchable catalog. Each data source gets represented in Lore IO in a relational view. Even nested objects like JSON and XML get flattened.
Lore IO looks for table relationships in DDL statement outputs, by using an innovative recommendation engine, or by having stakeholders fill up the gaps.
While this is going, you invite your business team to define their target views inside Lore IO. They use declarative language to express their desired transformations without having to know anything about the source data. Just have them document their needs: if they don’t know which table or column to use, have them simply name them.
Once the metadata is loaded into the catalog, you and your business team can use it to map the source tables and columns into the desired transformations. You can test and validate incrementally, fixing errors and problems as you encounter them.
Once the modeling is done, your business team is ready to generate views. Lore IO will transform the declarations into ETL code that will run on your source data. The transformed views can be materialized into your data warehouse.