Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Data Cleaning Checklist

Data cleaning takes up 80% of the data science workflow. Use this
checklist to identify and resolve any quality issues with your data

Checklist Examples in Action Potential Solutions

Data Constraints Problems


Data type constraints Example A revenue_usd column that is a string, and Convert to the correct data type
not a numeric data type.
Ensuring that different columns have
the correct data type before revenue_usd revenue_usd
beginning analysis.
$1,000 1,000
$3,210 3,210

Data range constraints Example A gpa column should be constrained


Check for typos, like a decimal
to [0.0, 4.0] point in the wrong place.
Ensuring that different columns have Drop rows where data points break
the correct range. This is especially gpa range constraints
the case for columns that have limits.
3.1 Set the data point that breaks
range constraints to the maximum,
4.0
or minimum
5.1 Treat the data point that breaks
range constraints to missing, and
impute it

Uniqueness Constraints Example A duplicate row where the name and Keep only one of the exact
phone_number columns are identical, but not duplicate rows
Ensuring that there are no exact or the height_cm column
Merge rows that have non-exact
almost exact duplicates within your duplicate rows
rows. name height_cm phone_number
Carl Rosseel 177 (555) 200-5598

Carl Rosseel 178 (555) 200-5598

Text and Categorical Data Problems


Membership constraints for Example Two different entries for “New York” in the Drop rows that are affected by
city column inconsistent categories
categorical data
Remap inconsistent categories to
Ensuring that categorical columns name city the correct category name
have correct and consistent Carl Rosseel New York Infer categories based on other
categories data points if it’s not clear how it
Sara Billen ew York City
N
should be remapped

Length violation for text data Example A phone_number column that is 9 characters Drop rows that are affected by
instead of 14 length violation
Ensuring that text columns that follow Set affected observations to
a specific standard have the same name height_cm phone_number missing
string length
Carl Rosseel 177 (555) 200-5598

Carl Rosseel 178 (555) 200-5598

Text data inconsistent Example A phone_number number column that Standardize formatting for affected
contains different phone number formats observations
formatting
Drop rows that are affected by the
Ensuring that text columns that follow name height_cm phone_number inconsistency
a specific standard have the same Carl Rosseel 177 (555) 200-5598
string formatting
Carl Rosseel 178 (555) 200-5598

Data Uniformity Problems


Unit uniformity for numeric Example A temperature column in celsius that has Dropping rows where no context on
absurdly high or low temperature values units appears and don’t pass a
columns
sanity check
Ensuring that numeric columns have date city temperature Standardize the units where
the same units (Temperature being 05-18-2022 New York 27 possible
Celsius, or Fahrenheit across all
observations. This is especially 05-18-2022 New York 80.6
relevant when joining datasets from
different countries or sources.) * please note that absurdly high or low temperature
values can be caused by other data quality issues, such
as sensor malfunctions

Unit uniformity for date date birthday Standardizing datetime formats


where possible
columns 05-18-2022 Carl Rosseel
Dropping rows where no context on
Ensuring that date columns have the 05-19-2022 Sara Billen datetime format appears and don’t
same datetime format pass a sanity check
20-05-2022 Isabella Leslie-Miller

Crossfield validation for date economy first class total Dropping rows where sanity checks
fail
numeric columns 05-18-2022 250 50 300
Apply rules from domain knowledge
Crossfield validation is when we use 05-19-2022 200 50 200 based on knowing the data
multiple fields in a dataset to ensure
the validity of another. For example,
ensuring that part to whole columns
add to a relevant total (Flight
bookings per class add up to the total
recorded bookings)

Crossfield validation for date name date of birth age Dropping rows where sanity checks
fail
columns John Doe 02-07-1994 27
Apply rules from domain knowledge
Ensuring that date and temporal Jane Doe 10-12-2000 34 based on knowing the data
columns pass sanity checks (for
example, ensuring that webinar
registration dates always precede
webinar attendance dates)

Missing Data Problems


Missing Completely at There is no observed relationship
Random Data between missing data and other
values within the dataset
When there is no systematic
relationship between missing values
and other values within the dataset

Drop missing rows


Missing at Random Data Missing census data from a specific region, because the
postal service doesn’t have full coverage in that region
When there is a systematic Impute missing rows with measures
relationship between missing data and Missing survey respondents by region of centrality such as median or
other observed values mean

Impute missing rows with


algorithmic, machine-learning
Regio R R R
n A egion B egion C egion D based approaches

Collect new data points and


Missing Not at Random Data Temperature readings from a sensor missing because features
temperate was too low, or high
When there is a systematic
relationship between missing data and
other unobserved values Temperature (°C) —
Sorted lowest to highest
40 °C
Missing data after
40 degrees celsius

You might also like