Data Quality is crucial to any organization’s ability to make timely, competent decisions all stakeholders can be confident in. Even in today’s data-centric economy, however, organizations still can’t seem to get out ahead of their “find and fix” data quality woes. In fact, according to an article by Tom Redman (AKA the ‘Data Doc’), only 3% of companies’ data meets basic quality standards. Further, IBM estimated in 2016 that the cost of bad data in the US alone was 3.1 trillion dollars. That’s trillion with a T.
In this blog, I will outline at a high level the steps UDig takes when performing any data quality work with our clients who do not yet have a robust data quality process. I prefer to focus on “A to Z” with a narrow scope, and applying the lessons learned to future iterations. Time and again, I have seen this approach provide a tangible benefit (I.e., – the actual quality work itself) as well as an organizational knowledge and capability benefit (learning HOW to apply the process for future data quality work).
Step 1: Identify the Scope
Before beginning any data quality project, it’s crucial to narrow your focus. Your organization probably has gobs of data in a variety of disparate systems. It’s very easy to bite off more than you can chew when determining what data to focus on, so typically I’ve found that a single entity in a single system (e.g. “customer” in your e-commerce system) is the best place to start. It can be tempting to pick the most impactful entity, and indeed that makes a lot of sense for obtaining buy in for any data quality effort. I must caution, however, that typically the most impactful entities are also the most afflicted with complex quality issues! In some cases, it may be worth starting with an entity that is easier to correct, and tackling the most impactful once some data quality “chops” have been established in the org.
Step 2: Perform Data Profiling
Or as I like to call it “Know thy enemy”. There are a variety of data profiling tools on the market, but at their core they all offer you the same capability: to know your data. A typical output of data profiling will include descriptive information about the data attributes from which you might begin to identify patterns. You might discover, for example, that 80% of the data in your “Customer Number” field is formatted as nine digits, while 15% is only five digits, and the remaining 5% is null. Right off the bat we now have information about the state of that data. Why is that field mostly nine digits? Why is a large chunk five digits? Why, for a field that is likely meant to be a unique business key, are 5% of our records null?
Step 3: Business Rules Identification
In an ideal world, your organization would have every single business rule well–documented and well socialized. But alas, we live in the real world, and if your organization has a SINGLE business rule well documented you’re ahead of the curve! Find a subject matter expert for the entity you’ve just profiled. Show them some of the odd things the profiling has uncovered and cut them loose. Very quickly you’ll be hearing information like “Oh! Well, our original e-commerce system back in ’03 only utilized a five-digit customer number. Those were the days when half of our orders still came in by fax! I thought we had decided to pad them with zeroes when we migrated to the new system…” If you listen closely, clear business rules will emerge from these conversations and with diligence, this tacit knowledge can be converted into organizational knowledge. This is a highly iterative process and isn’t for the faint of heart. Patience wins the day here.
Step 4: Quality Metrics & Data Monitoring
You can’t improve what you can’t measure! This part of the process is all about quantifying the issues you’re facing and tracking your progress. By implementing monitoring capabilities (creating visualizations highlighting the percentage of records in violation of business rules, for example) the scope and magnitude of your quality issues can be easily socialized. Tracking over time is also crucial in order to celebrate your progress and help prove the value of the entire data quality exercise.
Step 5: “Find and Fix” Remediation
This is all about “firefighting”. Working with SMEs, Data Stewards, and other people crazy enough to get pulled into your data cleansing frenzy, this stage is all about repairing what you can, and documenting what you can’t. Maybe you uncover an egregious error that occurred years ago in an old conversion, but you can churn through the records and fix them. Maybe you have duplicate customers with different identifiers and you’d like to merge those records (we recently tackled this for Virginia Public Access Project). This stage is about prioritizing then blocking and tackling the issues you can fix.
Step 6: Root Cause Remediation
By now you should be well versed in the state of your data. Your monitoring is tracking records in violation, you’ve got people helping fix issues, and your business rules are documented. Now you get to play detective and seek the root cause of issues. The causes for data quality issues are many, but some usual suspects are: Insufficient validations in data-entry programs, duplicate entries, data integration issues, and good ole’ fashioned manual-entry errors. By narrowing down the source of the issue, you now have the power to affect the people creating data, the processes they follow, and the technology they utilize.
In closing, data quality is not easy. It’s an iterative process that isn’t for the faint of heart. I liken it to exercise; you know you need to do it and the benefits are many… but that doesn’t mean you wake up every day excited to go to the gym! Still, we all know how important data is to organizations now, and it’s time we all start caring about one of our most valuable organizational assets. You can get started right now! Good luck out there, data quality warriors.