Professional Documents
Culture Documents
Data Cleaning Checklist: Checklist Examples in Action Potential Solutions Data Constraints Problems
Data Cleaning Checklist: Checklist Examples in Action Potential Solutions Data Constraints Problems
Data cleaning takes up 80% of the data science workflow. Use this
checklist to identify and resolve any quality issues with your data
Uniqueness Constraints Example A duplicate row where the name and Keep only one of the exact
phone_number columns are identical, but not duplicate rows
Ensuring that there are no exact or the height_cm column
Merge rows that have non-exact
almost exact duplicates within your duplicate rows
rows. name height_cm phone_number
Carl Rosseel 177 (555) 200-5598
Length violation for text data Example A phone_number column that is 9 characters Drop rows that are affected by
instead of 14 length violation
Ensuring that text columns that follow Set affected observations to
a specific standard have the same name height_cm phone_number missing
string length
Carl Rosseel 177 (555) 200-5598
Text data inconsistent Example A phone_number number column that Standardize formatting for affected
contains different phone number formats observations
formatting
Drop rows that are affected by the
Ensuring that text columns that follow name height_cm phone_number inconsistency
a specific standard have the same Carl Rosseel 177 (555) 200-5598
string formatting
Carl Rosseel 178 (555) 200-5598
Crossfield validation for date economy first class total Dropping rows where sanity checks
fail
numeric columns 05-18-2022 250 50 300
Apply rules from domain knowledge
Crossfield validation is when we use 05-19-2022 200 50 200 based on knowing the data
multiple fields in a dataset to ensure
the validity of another. For example,
ensuring that part to whole columns
add to a relevant total (Flight
bookings per class add up to the total
recorded bookings)
Crossfield validation for date name date of birth age Dropping rows where sanity checks
fail
columns John Doe 02-07-1994 27
Apply rules from domain knowledge
Ensuring that date and temporal Jane Doe 10-12-2000 34 based on knowing the data
columns pass sanity checks (for
example, ensuring that webinar
registration dates always precede
webinar attendance dates)