1 - Ensuring Data Quality

Fall 2023
DIME Analytics
Reproducible Research
Fundamentals
September 25-29, 2023
DIME Analytics
Ensuring Data Quality
Maria Ruth Jones &

Marc-Andrea Fiorina
Reproducibile Research Fundamentals 2023

Key takeaways
Empirical research results are only as trustworthy as the data used

• “High-quality” data faithfully reflect the reality on the ground
Teams should develop data quality protocols *before* receiving data
• Then data quality checks can happen in real time (upon collection / receipt)
Data quality checks are not just for surveys!
• Equally important for secondary and big data
• Same principles apply
Data quality protocol
When and how

What checks will Who will run the
often will checks
be performed? checks?
be run?
What platform will In what way will

be used to share results be shared
results with the with the data
research team? provider?
Data Quality Checks
What
What to check
Survey-
Completeness Distribution Consistency
specific checks
Do values align Surveyor effects:

Do all expected
Do all data points across variables in are there significant
data points actually
fall within an way that are differences in key
appear in the
expected range? internally variables across
dataset?
consistent? data collectors?
Are there any Are values

Is the instrument
Are there duplicates unusual, consistent with
programmed to
that need to be implausible, or expectations, given
maximize data
resolved? unexpected missing knowledge of the
quality?
values? local context?
Completeness checks
First, verify the data has a unique
identifier A variable (or
combination of vars)
✓ID variable is unique that distinguishes
(different for every row) each entity described
in the dataset
✓ID variable is never missing
✓Any duplicates in ID variable
are identified and resolved
Then, verify that data is complete Each observation
across time, space, and units of must be unique and
fully identified
observation

Completeness checks - Surveys
Compare actual responses to
sampling frame
• Compare server data to field
logs
• Check for refusals and attrition
Potential issues
• surveys not sent to server
• miscommunications on sample
assignments
• attrition

Completeness checks – Secondary
Check completeness compared to expected coverage, e.g. time, space, units

Visuals are very helpful
Distribution checks
Assess variance
• Either too much and too little variance may indicate issues
Flag any unusual or implausible values (outliers)
• All observations should fall within expected range.
• Expectations based on contextual knowledge, data from other sources
Check for missing values.
• Are there unexpected patterns? Questions that aren’t being asked?
Distribution checks - surveys
Variables Flagging constraints

Plot size (Ha.) < 0.1 or > 3
Production (kg) > 1000
Hourly labor wage < 20 or > 200
Range checks can be programmed into

survey instruments, but other distributional
checks needs to be done in data checks
Distribution checks - secondary
Consistency checks
❑ Do values of related variables within the same survey

contradict each other?
❑ Are values from the same respondent / unit consistent over
time?
❑ Are values are consistent with relevant data from other
sources (if available)?
Consistency checks - Surveys
Across time periods for same Across variables for same
respondent respondent
HHID Survey Plot Plot size (ha.) HHID Plot Plot used Crops cultivated
1001 Baseline 1 0.5 1001 1 Yes No
1001 Endline 1 3
Consistency checks are typically difficult to program into a survey

instrument and typically are best captured in data quality checks
Consistency checks - secondary
Survey-specific checks: surveyor effects
Check for meaningful performance
differences across data collectors
• Number of surveys completed
per day
• Number of refusals
• Survey duration (in total or by
module)
• “Don’t know” responses
• Number of flags in data quality
check process (outliers,
inconsistencies)
• Responses on questions that
determine loops
Useful to provide targeted feedback
and identify poor performers
Survey-specific checks: smart programming
Smart questionnaire Sensor metadata

Audio audits (link) Text audits (link) Speed limits (link)
design (link)
• Include skip • Audio recordings • Details on how • Define minimum • Collect additional
patterns that take place much time was number of seconds meta-data such as
• Use clever during a survey spent on each that should be light and sound
constraints (link) interview without question and spend on any given level, if a
• Remember to test an indication that sequence with the field conversation is
the survey form the recording has survey was filled in • Using the taking place
extensively! been initiated minimum_seconds
column
Data quality checks
When?
When? Surveys
Daily (or as often as responses are received from the

field), starting from the first week of data collection
• Early start is critical to demonstrate data quality will be taken very
seriously
• Detect misunderstandings before they become bad habits
• Identify any fraud early and ensure there are clear consequences
• Allows for revisits to respondents while enumerators are still in the field
Later? Revisits more expensive, potential recall bias
When? Secondary
As soon as data is received; prompt clarifications ensure data is fully usable
Why not wait until final analysis?

• the person familiar with the database may leave the data provider, or it may cease
operating altogether
• the code for data extraction or aggregation may be deleted by the counterpart
that wrote it
• the website that was scraped may change or be taken offline
• servers may have been wiped
Real time checks improve secondary data
After setting up a microsensor to collect air quality

data, our team checked the data coming in every
day. One day we saw that no data was collected.
Upon checking the sensor in person, we discovered
that after a power outage, the sensor did not properly
begin collecting data again, and needed to be reset.
Only one day of data was lost—as opposed to
weeks if data checks had been less frequent.
Field validation for surveys

Field validation for surveys
• Back-checks: verify a subset of information from the full

survey through a brief survey with original respondent for a
randomized subset of sample
• Spot-checks: unannounced interview accompaniments, to
confirm first-hand that the enumerator is following survey
protocols and understands the survey questions well
• Random audio audits: record parts (or all) of the interview for
independent verification, with respondent consent
Further reading
❑Development Research in Practice, ch4

❑https://dimewiki.worldbank.org/wiki/Identity_Checks
❑How can we improve the quality of big data for development economics
research? (worldbank.org)
❑https://dimewiki.worldbank.org/Monitoring_Data_Quality
❑https://dimewiki.worldbank.org/High_Frequency_Checks
❑https://osf.io/54dbn (Continuing Education Session on Data Quality)
❑https://dimewiki.worldbank.org/Data_Quality_Assurance_Plan
❑https://dimewiki.worldbank.org/Duplicates_and_Survey_Logs
❑https://dimewiki.worldbank.org/wiki/Back_Checks
❑https://www.povertyactionlab.org/resource/data-quality-checks
❑https://www.surveycto.com/blog/audio-audits-best-practices/
THANK YOU!

Thank you!

1 - Ensuring Data Quality

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Ensuring Data Quality

Uploaded by

Copyright:

Available Formats

Fall 2023

Ensuring Data Quality

Maria Ruth Jones &

Reproducibile Research Fundamentals 2023

Empirical research results are only as trustworthy as the data used

When and how

What platform will In what way will

Do values align Surveyor effects:

Are there any Are values

Reproducibile Research Fundamentals 2023

Reproducibile Research Fundamentals 2023

Check completeness compared to expected coverage, e.g. time, space, units

Variables Flagging constraints

Range checks can be programmed into

❑ Do values of related variables within the same survey

Consistency checks are typically difficult to program into a survey

Smart questionnaire Sensor metadata

Daily (or as often as responses are received from the

As soon as data is received; prompt clarifications ensure data is fully usable

Why not wait until final analysis?

After setting up a microsensor to collect air quality

Reproducibile Research Fundamentals 2023

• Back-checks: verify a subset of information from the full

❑Development Research in Practice, ch4

Reproducibile Research Fundamentals 2023

Reproducibile Research Fundamentals 2023

You might also like