Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Data Vault | Automation in Data Ingestion

Data Vault | Automation in Data Ingestion
Back to insights

In a previous blog post on data vault, we discussed how it could be tailored for rapid iteration through abstracting the creation of vault objects as code and using automation to execute this code when new information arrives in the vault. However, before any of these features within the vault can be leveraged, there needs to be a robust system for handling the raw data: performing any necessary preprocessing of files and staging it to be loaded into the vault. In keeping with this theme of flexibility and simplicity, UDig’s data team has been utilizing automation to streamline the process of data ingestion, further reducing the overhead in a data vault implementation.  

Ingestion Pipeline

Handling ingestion of flat-file data requires a storage environment to stage the data and a means to execute and control the workflow of the various transformations performed before the data is ready to insert into the vault. Utilizing AWS S3 for file storage and AWS Lambda functions for the preprocessing code, an event-driven architecture can be built using S3 events to kick off a series of Lambda functions whenever a file is uploaded to a “Source” S3 bucket. These functions perform file type validation, filename cleanup, timestamping, and chunking of large files before landing them in a staging bucket to be loaded into the vault. This enables quick, automated, and serverless file preprocessing that requires less infrastructure for the development team to maintain.  

Data Ingestion Pipeline

Landing Table Creation

For data to be piped into the data vault, there needs to be an existing landing table for the pipe to target. Our team found this to be a potential blocking point for onboarding a new source of data during our exploration into the data vault. A database administrator (DBA) must manually create a new table to land the data in, requiring knowledge of the new source’s column names and their data types to construct the table’s schema. In short, not a very agile process. To remedy this, our team has been exploring and developing ways to programmatically handle the landing table creation upfront as a part of file preprocessing, thus preserving the automation of the ingestion pipeline. 

Although there are several tools available for profiling data, Capital One recently released their own open-source Data Profiler [1], which caught the eye of our data team. Available as a Python library, Data Profiler can automatically generate a schema, data types, and statistics for many different data formats. It also can leverage various machine learning libraries to identify PII/NPI data within a given file. For more information about utilizing Data Profiler, see the documentation on its GitHub page [2]. 

By hosting an application that serves as a mediator between the ingestion pipeline and the data vault, our team integrated the table creation process into the existing ingestion pipeline workflow. Whenever a new file landed in flat-file storage, it would trigger an event that pushed a message to a queue polled by the landing table application. Upon consuming the message, the application pulled the appropriate file and ran it through Data Profiler to infer its schema and column data types. These were then used to generate and execute queries for creating a landing table for data, adding rows to relevant metadata tables, and running stored procedures in the data vault environment to kick off data ingestion into the vault.  

 

Sources:

[1]: https://www.capitalone.com/tech/open-source/basics-of-data-profiler/ 

[2]: https://github.com/capitalone/DataProfiler

 

Digging In

  • Data & Analytics

    Unlocking the Full Potential of a Customer 360: A Comprehensive Guide

    In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view.  A customer 360 view is about taking an overarching approach to […]

  • Data & Analytics

    Microsoft Fabric: A New Unified Data Platform

    MicroPopular data services and tools often specialize in specific aspects of the data analytics pipeline, serving teams in the data lifecycle. For instance, Snowflake addresses large-scale data warehousing challenges, while Databricks focuses on data engineering and science. Power BI and Tableau have become standard tools for business intelligence tasks. So, where does Microsoft Fabric create […]

  • Data & Analytics

    Improve Member Experience: Maximize Engagement & Value for Associations

    As you know, member engagement is key to providing value and retaining members over time. However, you must also recognize that member needs and preferences are evolving rapidly, especially as they desire more seamless digital experiences. Additionally, member expectations for personalized, omnichannel interactions have risen in recent years, and this means that associations must strategically […]

  • Data & Analytics

    A Guide to Data Strategy Success in Your Association

    While countless organizations aim to harness the potential of data, few possess a clear strategy to transform raw information into actionable insights that fuel their operations and marketing efforts. Don’t fall into the trap of investing in limited, tactical solutions.

  • Data & Analytics

    ChatGPT & Your Data Strategy – Revolution or Evolution?

    You would be hard-pressed to find a single person who was not some degree of impressed when they first tried out ChatGPT. After its public release, the conversation in the tech space seemingly changed overnight about how AI would change everything. But much like past hot topics in the tech world – such as the […]

  • Data & Analytics

    Revamping Data Pipeline Infrastructure to Increase Owner Satisfaction at Twiddy

    In an ever-evolving technological landscape, embracing new methodologies is vital for enhancing efficiency. Our data and analytics interns recently undertook a significant overhaul of one of Twiddy’s data pipeline infrastructures, implementing Airbyte pipelines with Kestra orchestration to replace an existing Java application. Motivated by several challenges with the previous system, most importantly a complete loss […]