Integrating Data Quality Improvement into your Data Import Processes

Posted May 8, 2023

In the world of data management, maintaining clean and accurate databases is key to the success of any organization. However, it can be a challenge to consistently maintain high quality data when messy data such as typographical errors and other inconsistencies are prevalent, creating major roadblocks in the data management process. Simbex was tasked with matching entries between two large customer lists, one maintained by a commercial partner, and the other part of a large Simbex data gathering and analysis system. This task included matching a partner’s ~56,000 row spreadsheet with many typos and inconsistencies to the internal Simbex database.

To match the contents of this spreadsheet with the existing PostgreSQL database, John Stephens, a Senior Software Engineer at Simbex, first ran a list of SQL scripts to clean up the data. This involved removing blank spaces at the start or end of an address and identifying records with missing fields (such as records that had a name, address, city, and zip, but were missing a state code). Then he looked for entries in each database that matched exactly: name, address, city, state and zip. Those matches ended up being only 4% of the entries.

Looking at the remaining 96%, there were a lot of understandable but not exact matches. For example, New England zip codes that began with zero but were formatted as numbers instead of text had the first 0 digit dropped, leading to incorrect matches. Common mismatches due to abbreviations, like RD instead of Road, or LN instead of Lane, were also common. And there were more complicated issues like Four E 42 St vs 4 East Forty-second Street or 1100 W Rt 100 vs 1100 West US Hwy 100.

To address cases like these, John used fuzzy logic to find phonetically similar entries. For example, the last name of Smith may also be spelled Smythe. Phonetic encoding systems convert a word or set of words to a string of letters representing the phonemes (sounds) that make up the text. Some of these phonetic encoding algorithms have been used in US Census research since the 1930s. Then, using difference-measuring algorithms, John was able to calculate a likelihood score (between 0 and 1) that the text in one database sounded similar to the text in the other. He could then make a rule that any entries with a likelihood score over a certain threshold be considered a match. By using a number of different combinations of phonetic encoding and likelihood calculations, John was able to increase the match rate from 4% to over 85%.

John prepared these scripts and assessed their effectiveness in a test environment before integrating them into the actual workflow of the database. As a result, those matching scripts are now part of the data import process as new or updated data flows from the commercial partner into Simbex’s databases.

This case study demonstrates the value of a robust and flexible data management system, especially when dealing with messy data. By using fuzzy logic and phonetic encoding algorithms, John was able to significantly increase the match rate and ensure the accuracy of Simbex’s database.

At Simbex, we understand the importance of clean and accurate data management, and we are committed to providing our clients with the best tools and technology to achieve this goal. Using advanced data management techniques like these, we can help your organization maintain a high level of accuracy and reliability, even when dealing with messy data. Whether you need help cleaning up your existing database or managing new data flows, our team of experts is here to help with data management solutions that are designed to be flexible, scalable, and customizable to meet the unique needs of your business.

Learn How We Can Integrate Cloud Data Management Into Your Project

Meet with Simbex