Data Vault | Automation in Data Ingestion

By

In a previous blog post on data vault, we discussed how it could be tailored for rapid iteration through abstracting the creation of vault objects as code and using automation to execute this code when new information arrives in the vault. However, before any of these features within the vault can be leveraged, there needs to be a robust system for handling the raw data: performing any necessary preprocessing of files and staging it to be loaded into the vault. In keeping with this theme of flexibility and simplicity, UDig’s data team has been utilizing automation to streamline the process of data ingestion, further reducing the overhead in a data vault implementation.  

Ingestion Pipeline 

Handling ingestion of flat-file data requires a storage environment to stage the data and a means to execute and control the workflow of the various transformations performed before the data is ready to insert into the vault. Utilizing AWS S3 for file storage and AWS Lambda functions for the preprocessing code, an event-driven architecture can be built using S3 events to kick off a series of Lambda functions whenever a file is uploaded to a “Source” S3 bucket. These functions perform file type validation, filename cleanup, timestamping, and chunking of large files before landing them in a staging bucket to be loaded into the vault. This enables quick, automated, and serverless file preprocessing that requires less infrastructure for the development team to maintain.  

Data Ingestion Pipeline

Landing Table Creation 

For data to be piped into the data vault, there needs to be an existing landing table for the pipe to target. Our team found this to be a potential blocking point for onboarding a new source of data during our exploration into the data vault. A database administrator (DBA) must manually create a new table to land the data in, requiring knowledge of the new source’s column names and their data types to construct the table’s schema. In short, not a very agile process. To remedy this, our team has been exploring and developing ways to programmatically handle the landing table creation upfront as a part of file preprocessing, thus preserving the automation of the ingestion pipeline. 

Although there are several tools available for profiling data, Capital One recently released their own open-source Data Profiler [1], which caught the eye of our data team. Available as a Python library, Data Profiler can automatically generate a schema, data types, and statistics for many different data formats. It also can leverage various machine learning libraries to identify PII/NPI data within a given file. For more information about utilizing Data Profiler, see the documentation on its GitHub page [2]. 

By hosting an application that serves as a mediator between the ingestion pipeline and the data vault, our team integrated the table creation process into the existing ingestion pipeline workflow. Whenever a new file landed in flat-file storage, it would trigger an event that pushed a message to a queue polled by the landing table application. Upon consuming the message, the application pulled the appropriate file and ran it through Data Profiler to infer its schema and column data types. These were then used to generate and execute queries for creating a landing table for data, adding rows to relevant metadata tables, and running stored procedures in the data vault environment to kick off data ingestion into the vault.  

 

Sources:

[1]: https://www.capitalone.com/tech/open-source/basics-of-data-profiler/ 

[2]: https://github.com/capitalone/DataProfiler

 

About The Author

Zach Showalter is a Consultant on the Data team.