Why Address Standardization and Validation Matters & What You Can Do About It - Data Science Central

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Six Success Factors for Getting Started with Machine Learning

DOWNLOAD NOW ›
×

Search Data Science Central Search

Sign Up
Sign In

HOME AI ML DL ANALYTICS BIG DATA DATAVIZ HADOOP PODCASTS WEBINARS FORUMS JOBS EDUCATION MEMBERSHIP GROUPS SEARCH

CONTACT

Subscribe to DSC Newsletter

All Blog Posts


My Blog
Add

Why Address Standardization and Validation Matters & What You Can Do About It
Posted by Farah Kim on June 5, 2020 at 1:54am
View Blog

Poor address data is a complex data quality challenge that affects customers, businesses, and mailing service. Each year, millions of
dollars get wasted in resolving the consequences of poor address data. Mailers spend over $20 billion on UAA mail, while direct costs to
the USPS is over $1.5 billion/year. All this unnecessary cost is the result of poor, mismanaged, invalidated address data. 
Over the years, working with Fortune 500 clients, we have seen the consequences of poor address data - disgruntled customers,
ballooning costs, inefficient operations, marketing blunders, embarrassing mistakes.... the list goes on. 
Take for a moment and imagine this. 

You have over a million customer records and nearly 23% of that record is either incomplete or inaccurate - this is not taking into account
records that are duplicated and are unstructured. 

That's nearly 230,000 records that may be rendered useless or may cost you thousands of dollars in managing return mails. This is a
situation most companies face today, regardless of the controls they've put in place. When the data input is done by humans, it will always
be significantly flawed. 

What Does Bad Address Data Really Look Like? 


Here’s an image of how a typical unstructured, raw address data looks like. Poor address data is a challenge that causes a severe strain on
businesses and their employees. Imagine having to fix these very basic issues for every mailing campaign, promotional activity, and every
customer report that you have to run. It’s not only mind-bogglingly frustrating but also counter-productive as you try to match and verify
each address to ensure it’s accurate and complete. Data scientists and analysts or business users in need of this data must spend days
and months fixing these issues.
Six Success Factors for Getting Started with Machine Learning
DOWNLOAD NOW ›
×

Sure, it’s human nature to make mistakes. Most of the time, consumers are lax when it comes to providing their address information on
physical or web forms. They may misspell a state name, write abbreviations, miss out a street number or forget their ZIP Code. It’s
inevitable that some mistakes will be made and incorrect data will be entered.
Does it mean though that companies are helpless? Should poor address data be resolved via manual means - like calling up customers or
using other records like bank statements and bills to verify? You could do that, but it's going to cost you time and effort - not to mention,
you're not addressing the core problem; that is the non-standardization and validation of address data according to the USPS or any
authority standard of your country. 

Let me elaborate on this further. 


Address Standardization and Validation Limitations 
If your CRM data look anything like in the image above, you have a significant address standardization problem. According to the USPS
guidelines (given below), address data is supposed to be in a format as this: 

Unless you place strict data entry controls on your web form or physical form, there is very little chance your data will be in this perfect
state. So the first limitation here is address standardization and there is no way you can manually do this for hundreds and thousands of
rows of data. 
Here are the USPS guidelines: 

Always put the address and the postage on the same side of your mailpiece.
On a letter, the address should be parallel to the longest side.
All capital letters.
No punctuation.
At least 10-point type.
One space between city and state.
Two spaces between state and ZIP Code.
Simple type fonts.
Left justified.
Black ink on white or light paper.
No reverse type (white printing on a black background).
If your address appears inside a window, make sure there is at least 1/8-inch clearance around the address. Sometimes parts of the
address slip out of view behind the window and mail processing machines can’t read the address.
If you are using address labels, make sure you don’t cut off any important information. Also make sure your labels are on straight.
Mail processing machines have trouble reading crooked or slanted information.
Next, let's talk about validation. 
The USPS is the official database of addresses in the United States. If you want to check the validity of your address data, you're going to
have to match it to the USPS database. To do that, you will need access to a CASS Certified Vendor who will validate your address by
matching it against the USPS database These vendors have updated CASS files which means any new address or changes in locations
matching it against the USPS database. These vendors have updated CASS files which means any new address or changes in locations
that are recorded by the USPS willSix
beSuccess
availableFactors
for the for
DOWNLOAD NOW
Getting Started with Machine Learning
vendor. 

×
Here's the tricky part. 
To validate this data, you have to standardize it.

To standardize it, you have to clean and dedupe this data. 


Note though that address standardization tools can only validate records based on certain geographical parameters. It cannot, for
instance, validate addresses that are: 
Valid, but no longer exists
Structurally right but does not belong to the customer
Not registered in the USPS database

Then again, once you've cleansed, standardized, and validated the data, the number of invalid or non-existent records goes down
significantly. You can filter those records, verify the legitimately of the entity and if necessary, call them up to ask for accurate information. 

So How Can You Manage this Dilemma Smartly? 


In our experience, companies do understand the problem already. They are just not sure of the perfect solution, instead, they hire data
specialists or analysts, who are then tasked with the responsibility of cleaning up this data.
Let me be clear - a data scientist's job is not to clean dirty data - it's to study data, improve data acquisition, and make efficient use of this
data. To make a data scientist spend 80% of their time cleaning data is to waste their talent. 
A better strategy would be to equip the data scientist or even a business user with the right address standardization software to help them
manage this better. Most validation tools today are pretty much DIY and do not require a user to learn a new language or be technically
proficient. It does require a learning curve, as is the case with most software, but it's not something that is out of reach. 

It's important though to choose a tool that lets you tackle all three aspects of this problem:
1. Cleaning: The ability to clean up data by identifying common data quality errors (typos, format, non-printable characters, negative
spacing etc)
2. Standardization: Turn this data into an acceptable USPS format. 
3. Validation or Verification: Verifying this data by matching it with the USPS database. 

Most address verification software does not have strong data matching capabilities, which is at the heart of this function. Your choice of
software should be able to match your address data and give a 100% accuracy rate. If it misses matches because the content is not exact
in nature, it is not the right solution for you. 
At the end of the day though, tools and gadgets can only do as much. You will need to implement certain business strategies that can help
you take care of this problem. These could be: 
Training:
The first step towards quality is training – make sure people who are handling, interacting, using, and entering data know the impact they
have in the process and on downstream applications. They need to understand the consequences of bad data on the entire organization
and not just on one member or customer. Employees practicing data quality rules should be rewarded and appreciated.
Involve Business Users in the Quality Process:
Data is not just an IT problem. Business users are equally responsible for managing data. In fact, they are the sole owners of customer
data that is often used in marketing and sales purposes. This is why they need to be involved in the process and also need to be trained
for using data management tools.
Data Governance:
Set up a data governance team to create a data management plan and ensure that the organization follows the plan where each employee
understands the plan, their rule within the plan, and the expectations that come along with the role.
Lock Down Data & User Roles:
If anyone in your team can open up the CRM or the data source, muddle around with data and leave no footprints, you are in for serious
trouble. It’s necessary to create master data holders who have the right to access, enter, or process critical data. This should come in the
data management plan.
Remember though, you don't have to do a blanket address quality upgrade. Start small. Identify departments or activities that require
address data to carry out tasks as mail or package delivery, newsletter, or billing and start optimizing the data for each process. 
You're not a victim of bad data. With plenty of tools and solutions now available, you can sort your data and prevent negative outcomes. 
What are you doing in your organization to manage bad address data? 

This post is a condensed version of an original guide published here.

DSC Podcast

Data Science Fails – There’s No Such Thing as a Free Lunch

While this latest DSC podcast isn’t about sandwiches, it is related to lunch, specifically the no free lunch theorem. In short, the theorem states that no
algorithm can be equally good at learning everything, which means that you can’t know in advance which algorithm will work best on your data. Download
now.
Most Popular Content on DSC
Six Success Factors for Getting Started with Machine Learning
DOWNLOAD NOW ›
×
To not miss this type of content in the future, subscribe to our newsletter.

Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes


Book: Classification and Regression In a Weekend - With Python
Book: Applied Stochastic Processes
Long-range Correlations in Time Series: Modeling, Testing, Case Study
How to Automatically Determine the Number of Clusters in your Data
New Machine Learning Cheat Sheet | Old one
Confidence Intervals Without Pain - With Resampling
Advanced Machine Learning with Basic Excel
New Perspectives on Statistical Distributions and Deep Learning
Fascinating New Results in the Theory of Randomness
Fast Combinatorial Feature Selection

Other popular resources

Comprehensive Repository of Data Science and ML Resources


Statistical Concepts Explained in Simple English
Machine Learning Concepts Explained in One Picture
100 Data Science Interview Questions and Answers
Cheat Sheets | Curated Articles | Search | Jobs | Courses
Post a Blog | Forum Questions | Books | Salaries | News

Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

Follow us: Twitter | Facebook
Views: 438

Like
0 members like this

Share Facebook

< Previous Post


Next Post >

Comment

You need to be a member of Data Science Central to add comments!


Join Data Science Central

Welcome to
Data Science Central

Sign Up
or Sign In

You might also like