Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Fall 2023

DIME Analytics

Reproducible Research
Fundamentals
September 25-29, 2023
DIME Analytics

Ensuring Data Quality

Maria Ruth Jones &


Marc-Andrea Fiorina

Reproducibile Research Fundamentals 2023


Key takeaways

Empirical research results are only as trustworthy as the data used


• “High-quality” data faithfully reflect the reality on the ground
Teams should develop data quality protocols *before* receiving data
• Then data quality checks can happen in real time (upon collection / receipt)
Data quality checks are not just for surveys!
• Equally important for secondary and big data
• Same principles apply
Data quality protocol

When and how


What checks will Who will run the
often will checks
be performed? checks?
be run?

What platform will In what way will


be used to share results be shared
results with the with the data
research team? provider?
Data Quality Checks

What
What to check

Survey-
Completeness Distribution Consistency
specific checks

Do values align Surveyor effects:


Do all expected
Do all data points across variables in are there significant
data points actually
fall within an way that are differences in key
appear in the
expected range? internally variables across
dataset?
consistent? data collectors?

Are there any Are values


Is the instrument
Are there duplicates unusual, consistent with
programmed to
that need to be implausible, or expectations, given
maximize data
resolved? unexpected missing knowledge of the
quality?
values? local context?
Completeness checks
First, verify the data has a unique
identifier A variable (or
combination of vars)
✓ID variable is unique that distinguishes
(different for every row) each entity described
in the dataset
✓ID variable is never missing
✓Any duplicates in ID variable
are identified and resolved
Then, verify that data is complete Each observation
across time, space, and units of must be unique and
fully identified
observation

Reproducibile Research Fundamentals 2023


Completeness checks - Surveys
Compare actual responses to
sampling frame
• Compare server data to field
logs
• Check for refusals and attrition

Potential issues
• surveys not sent to server
• miscommunications on sample
assignments
• attrition

Reproducibile Research Fundamentals 2023


Completeness checks – Secondary

Check completeness compared to expected coverage, e.g. time, space, units


Visuals are very helpful
Reproducibile Research Fundamentals 2023
Distribution checks

Assess variance
• Either too much and too little variance may indicate issues
Flag any unusual or implausible values (outliers)
• All observations should fall within expected range.
• Expectations based on contextual knowledge, data from other sources
Check for missing values.
• Are there unexpected patterns? Questions that aren’t being asked?
Distribution checks - surveys

Variables Flagging constraints


Plot size (Ha.) < 0.1 or > 3
Production (kg) > 1000
Hourly labor wage < 20 or > 200

Range checks can be programmed into


survey instruments, but other distributional
checks needs to be done in data checks
Distribution checks - secondary
Consistency checks

❑ Do values of related variables within the same survey


contradict each other?
❑ Are values from the same respondent / unit consistent over
time?
❑ Are values are consistent with relevant data from other
sources (if available)?
Consistency checks - Surveys
Across time periods for same Across variables for same
respondent respondent
HHID Survey Plot Plot size (ha.) HHID Plot Plot used Crops cultivated
1001 Baseline 1 0.5 1001 1 Yes No
1001 Endline 1 3

Consistency checks are typically difficult to program into a survey


instrument and typically are best captured in data quality checks
Reproducibile Research Fundamentals 2023
Consistency checks - secondary
Survey-specific checks: surveyor effects
Check for meaningful performance
differences across data collectors
• Number of surveys completed
per day
• Number of refusals
• Survey duration (in total or by
module)
• “Don’t know” responses
• Number of flags in data quality
check process (outliers,
inconsistencies)
• Responses on questions that
determine loops
Useful to provide targeted feedback
and identify poor performers
Reproducibile Research Fundamentals 2023
Survey-specific checks: smart programming

Smart questionnaire Sensor metadata


Audio audits (link) Text audits (link) Speed limits (link)
design (link)
• Include skip • Audio recordings • Details on how • Define minimum • Collect additional
patterns that take place much time was number of seconds meta-data such as
• Use clever during a survey spent on each that should be light and sound
constraints (link) interview without question and spend on any given level, if a
• Remember to test an indication that sequence with the field conversation is
the survey form the recording has survey was filled in • Using the taking place
extensively! been initiated minimum_seconds
column
Data quality checks

When?
When? Surveys

Daily (or as often as responses are received from the


field), starting from the first week of data collection
• Early start is critical to demonstrate data quality will be taken very
seriously
• Detect misunderstandings before they become bad habits
• Identify any fraud early and ensure there are clear consequences
• Allows for revisits to respondents while enumerators are still in the field
Later? Revisits more expensive, potential recall bias
When? Secondary

As soon as data is received; prompt clarifications ensure data is fully usable

Why not wait until final analysis?


• the person familiar with the database may leave the data provider, or it may cease
operating altogether
• the code for data extraction or aggregation may be deleted by the counterpart
that wrote it
• the website that was scraped may change or be taken offline
• servers may have been wiped
Real time checks improve secondary data

After setting up a microsensor to collect air quality


data, our team checked the data coming in every
day. One day we saw that no data was collected.
Upon checking the sensor in person, we discovered
that after a power outage, the sensor did not properly
begin collecting data again, and needed to be reset.
Only one day of data was lost—as opposed to
weeks if data checks had been less frequent.
Field validation for surveys

Reproducibile Research Fundamentals 2023


Field validation for surveys

• Back-checks: verify a subset of information from the full


survey through a brief survey with original respondent for a
randomized subset of sample
• Spot-checks: unannounced interview accompaniments, to
confirm first-hand that the enumerator is following survey
protocols and understands the survey questions well
• Random audio audits: record parts (or all) of the interview for
independent verification, with respondent consent
Further reading

❑Development Research in Practice, ch4


❑https://dimewiki.worldbank.org/wiki/Identity_Checks
❑How can we improve the quality of big data for development economics
research? (worldbank.org)
❑https://dimewiki.worldbank.org/Monitoring_Data_Quality
❑https://dimewiki.worldbank.org/High_Frequency_Checks
❑https://osf.io/54dbn (Continuing Education Session on Data Quality)
❑https://dimewiki.worldbank.org/Data_Quality_Assurance_Plan
❑https://dimewiki.worldbank.org/Duplicates_and_Survey_Logs
❑https://dimewiki.worldbank.org/wiki/Back_Checks
❑https://www.povertyactionlab.org/resource/data-quality-checks
❑https://www.surveycto.com/blog/audio-audits-best-practices/
THANK YOU!

Reproducibile Research Fundamentals 2023


Thank you!

Reproducibile Research Fundamentals 2023

You might also like