Data Vault | Automation in Data Ingestion

By Zach Showalter

In a previous blog post on data vault, we discussed how it could be tailored for rapid iteration through abstracting the creation of vault objects as code and using automation to execute this code when new information arrives in the vault. However, before any of these features within the vault can be leveraged, there needs to be a robust system for handling the raw data: performing any necessary preprocessing of files and staging it to be loaded into the vault. In keeping with this theme of flexibility and simplicity, UDig’s data team has been utilizing automation to streamline the process of data ingestion, further reducing the overhead in a data vault implementation.

Ingestion Pipeline

Handling ingestion of flat-file data requires a storage environment to stage the data and a means to execute and control the workflow of the various transformations performed before the data is ready to insert into the vault. Utilizing AWS S3 for file storage and AWS Lambda functions for the preprocessing code, an event-driven architecture can be built using S3 events to kick off a series of Lambda functions whenever a file is uploaded to a “Source” S3 bucket. These functions perform file type validation, filename cleanup, timestamping, and chunking of large files before landing them in a staging bucket to be loaded into the vault. This enables quick, automated, and serverless file preprocessing that requires less infrastructure for the development team to maintain.

Data Ingestion Pipeline

Landing Table Creation

For data to be piped into the data vault, there needs to be an existing landing table for the pipe to target. Our team found this to be a potential blocking point for onboarding a new source of data during our exploration into the data vault. A database administrator (DBA) must manually create a new table to land the data in, requiring knowledge of the new source’s column names and their data types to construct the table’s schema. In short, not a very agile process. To remedy this, our team has been exploring and developing ways to programmatically handle the landing table creation upfront as a part of file preprocessing, thus preserving the automation of the ingestion pipeline.

Although there are several tools available for profiling data, Capital One recently released their own open-source Data Profiler [1], which caught the eye of our data team. Available as a Python library, Data Profiler can automatically generate a schema, data types, and statistics for many different data formats. It also can leverage various machine learning libraries to identify PII/NPI data within a given file. For more information about utilizing Data Profiler, see the documentation on its GitHub page [2].

By hosting an application that serves as a mediator between the ingestion pipeline and the data vault, our team integrated the table creation process into the existing ingestion pipeline workflow. Whenever a new file landed in flat-file storage, it would trigger an event that pushed a message to a queue polled by the landing table application. Upon consuming the message, the application pulled the appropriate file and ran it through Data Profiler to infer its schema and column data types. These were then used to generate and execute queries for creating a landing table for data, adding rows to relevant metadata tables, and running stored procedures in the data vault environment to kick off data ingestion into the vault.

Sources:

[1]: https://www.capitalone.com/tech/open-source/basics-of-data-profiler/

[2]: https://github.com/capitalone/DataProfiler

About Zach Showalter

Zach Showalter is a Senior Consultant on the Data team.

Digging In

Data & Analytics
Ensuring Data Strategy Adoption: The Power of a Test Drive with Blueprinting and Mock Outputs
Despite years of investment in data platforms and analytics tools, many organizations still face a familiar challenge: their data strategy looks great on paper, but never delivers the value that was expected. Dashboards sit untouched, and self-service portals fail to gain traction. The data team checked every technical box, yet business users continue defaulting to […]
Read More
Data & Analytics
Piloting Data Discovery and Governance: The Open-Source Data Catalog
As organizations grow increasingly data-driven, the ability to quickly discover, understand, and trust internal data becomes more than a convenience—it’s a necessity. Over the past year, I’ve spent more time exploring data catalog solutions and the pivotal role they play in solving a challenge I frequently hear from clients: “We know we have the data, […]
Read More
Data & Analytics
2025 Data Trends
Read More
Data & Analytics
Legacy Data Modernization: A Comprehensive Guide to Upgrading Your Data Platform
Though they may have been more than functional in the past, legacy data platforms can become a burden to your organization and prevent it from realizing its full potential. That’s why legacy data modernization can effectively transform your organization’s obsolete data systems into modern platforms that are scalable, efficient, and better equipped to handle today’s […]
Read More
Data & Analytics
Masking Data 101: Safeguarding PII in Your Organization
In today’s digital age, data security and privacy are paramount. As organizations increasingly collect, store, and process personal data, protecting Personally Identifiable Information (PII) has never been more critical. One essential practice that organizations can implement at the database level to secure this sensitive information is to obfuscate it through the usage of data masking […]
Read More
Data & Analytics
Unlocking the Full Potential of a Customer 360: A Comprehensive Guide
In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view. A customer 360 view is about taking an overarching approach to […]
Read More

Your Privacy

Data Vault | Automation in Data Ingestion

By Zach Showalter

Ingestion Pipeline

Landing Table Creation

About Zach Showalter

Digging In

Ensuring Data Strategy Adoption: The Power of a Test Drive with Blueprinting and Mock Outputs

Piloting Data Discovery and Governance: The Open-Source Data Catalog

2025 Data Trends

Legacy Data Modernization: A Comprehensive Guide to Upgrading Your Data Platform

Masking Data 101: Safeguarding PII in Your Organization

Unlocking the Full Potential of a Customer 360: A Comprehensive Guide