Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Data & Analytics

Robust Entity Deduplication Process Developed to Reduce Manual Effort for the Virginia Public Access Project

Robust Entity Deduplication Process Developed to Reduce Manual Effort for the Virginia Public Access Project

The Virginia Public Access Project (VPAP) provides political contribution information to the public. They receive contribution information from the Virginia Board of Elections. They screen incoming records to identify matches against a database of past campaign contributors. The process was largely manual, relying upon individuals with subject matter expertise to compare incoming records with those in the database. The reliance on manual effort was time consuming and sometimes imprecise. VPAP required a system that could run daily during peak production periods to perform the matching work. UDig developed a deduplication system utilizing machine learning that is capable of evaluating each new record and calculating the probability that this record matches any other record in the dataset. More than 70 percent of the records are now matched without the need for human intervention, freeing up the analyst’s time to pursue tasks that require more critical thinking and add more value.

How We Went from Ideas to Impact

  • Most of the records are processed without the need for human intervention, freeing up the analyst’s time to pursue tasks that require more critical thinking and add more value.

The Idea

Clean data is a necessity for companies to get the most out of robust analytics. Bad data can lead to bad conclusions. When an organization seeks to become data-driven, it’s important that the data is trustworthy. Machine learning can enable organizations to increase the integrity of their data. A common issue for organizations is entity matching – being able to match the same company or person within internal data. For instance, when your analysts report that you received loan application from 1,000 distinct individuals over the past quarter, how are they measuring distinct individuals? Do you have the capability to account for the one customer who submits 20 separate loan applications, or is there no indication that these applications are connected?

Most organizations attempt to address this problem with simple rules checking to see if the name and address match, but this can be messy and imprecise. What if they abbreviate parts of the address, use a nickname, or get married and change their last name? These are not edge cases. The methodology used to account for such cases can materially change the data. As a result, the handling of these cases can alter analysts’ conclusions, reports, and business strategy. Determining how to handle cases like this is an important question with far-reaching consequences.

The Impact

The table below illustrates the kind of issue that UDig was tasked with solving with the VPAP. VPAP is an organization that “connects Virginians to nonpartisan information about Virginia politics in easily understood ways.” One way they do this is by tracking political contribution and expenditure data for Virginia politics. When Bob Frank from Norfolk, VA donates $100 to his favorite politician, they need to be able to say whether this is the same Robert Frank of Norfolk who donated $100 to the same politician in the prior year.

VPAP’s existing system was like many organization’s — a rule-based system that checked for matching fields and accounted for a handful of commonly encountered nicknames and abbreviations. Looking at the example below, it’s clear that the process can quickly become messy. Should you create rules that explicitly check for every possible nickname, misspelling, and abbreviation? Clearly, this is not practical.

First Name Last Name Company Name Industry Address Match

Jacob

Ferraiolo

ABC Consulting

Consulting

1234 Sesame Street Henrico Va

1

Jake

Ferraiolo

UDig

Consulting

1234 Sesame St Va

1

Jacob

Ferraiolo

UDig

Consulting

1234 Sasame St Glen Allen, Virginia

1

Jacob

Ferraiolo

UDig

Consulting

3241 North St Norfolk, Va

0

Luke

Ferraiolo

Joe’s Bar

Bartender

1234 Sesame Street Henrico Va

0

Sample data that shows the slight differences between records that can make entity matching challenging. 

Looking at the above examples, a human analyst could tell you that “Jacob” and “Jake” are similar enough that they are most likely to be the same person. Conversely, “3241 North St Norfolk, VA” is nothing like “1234 Sesame Street Henrico Va“, so these are much less likely to be a match. A human analyst also knows that the importance of each field is not the same. It’s more important that a last name matches than a company name. These are all rules that humans intuitively know but become cumbersome to manually program into a computer. This is where machine learning comes into play; instead of explicitly programming all these rules, we can let the machine find these rules on its own from years of historical data.

Hundreds of decision trees checked the similarity of various fields to determine an overall confidence level; they checked the similarity of each field to be able to account for the abbreviations, typos, and nicknames that are common in such data without having to explicitly program millions of exceptions. If the last name, first name, and address meet similarity thresholds, then we could safely assume the records are the same person and reach a determination.

VPAP now has a system that runs daily to perform the work previously manually handled by an analyst. It evaluates each new record and calculates the probability that this record matches any other record in the dataset. Records that the process is unsure about get flagged for human review, but most of the records are processed without the need for human intervention which frees up the analyst’s time to pursue tasks that require more critical thinking and add more value.

  • How We Did It
    Data Identification & EnrichmentData Mining
  • Tech Stack
    PythonSQL Server

Digging In

  • Insurance

    Insurtech Insights USA 2024: Recap & Key Takeaways

    This was my first time attending Insurtech Insights USA The Insurance Conference 2024.  With over 5,000 attendees and 120 sessions with speakers from insurtechs, brokers, carriers, and MGA’s, the event highlighted the industry’s readiness to embrace new tech and solve challenging problems. My colleague, Reid Colson, and I were there to meet with clients and industry […]

  • Artificial Intelligence

    Capitalizing the AI Wave to Advance Data Governance

    AI is everywhere and getting a lot of attention, but you can’t successfully leverage AI without good data. You can use the buzz around AI to advance your data governance capabilities. Join us as we explore the intersection of AI and Data Governance.

  • State Government

    NASCIO Midyear Conference | 4 Key Takeaways

    I attended the National Association of State CIO (NASCIO)’s Midyear Conference for the first time in National Harbor, Maryland. It was my first time at a NASCIO event, and what struck me most was how collaborative the NASCIO community is and how passionate these leaders are about serving their state’s constituents and employees.   Each […]

  • Insurance

    Leveraging Data, Analytics, & AI in Insurance

    Join Rob Reynolds, Vice President & Chief Data & Analytics Officer at W.R. Berkley Corporation, Lloyd Scholz, Chief Technology Officer at Markel, and Reid Colson, EVP of Consulting at UDig and former Chief Data & Analytics Officer at Markel, to get an insiders’ view on how insurers are future-proofing their firms through data, analytics and AI. This practical discussion will leave attendees with an understanding of how to drive value through data. While focused on insurance, the perspectives shared in this call apply beyond insurance firms. If you are looking to get more out of your data, this discussion is for you.

  • People

    Digging In with Kevin Cox, Principal Consultant

    Digging In is a regular series of blog posts profiling UDig employees.  We hope this series helps you get to know our team and understand why we dig what we do! Today, we are sitting down with Kevin Cox, Principal Consultant, Data & Analytics.   UDig: Tell us a little bit about your background and your […]

  • Insurance

    Designing a Data Consolidation Strategy for a Fast-Growing Insurance Brokerage Company

    Fast growth made an insurance brokerage company realize they needed a data consolidation strategy to better organize and protect their data.

  • Data & Analytics

    Unlocking the Full Potential of a Customer 360: A Comprehensive Guide

    In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view.  A customer 360 view is about taking an overarching approach to […]

  • Data & Analytics

    Microsoft Fabric: A New Unified Data Platform

    MicroPopular data services and tools often specialize in specific aspects of the data analytics pipeline, serving teams in the data lifecycle. For instance, Snowflake addresses large-scale data warehousing challenges, while Databricks focuses on data engineering and science. Power BI and Tableau have become standard tools for business intelligence tasks. So, where does Microsoft Fabric create […]

  • Artificial Intelligence

    How Prompt Engineering Impacts the Deployment of AI

    The integration of artificial intelligence (AI) within enterprise operations marks a significant shift towards more efficient, informed decision-making processes. At the heart of this transformation is prompt engineering — a nuanced approach that plays a pivotal role in optimizing AI model interactions. This post explores the intricate framework of prompt engineering, outlines the structures of […]