Professional Documents
Culture Documents
Data Cleaning
Data Cleaning
Data Cleaning
Data Cleaning
Data cleaning (also cleaning or scrubbing) is the act of identifying and correcting (or
removing) inaccurate data from a table, database or dataset. This process removes or corrects
unreliable, unfinished, non-relevant, or inaccurate parts of the data and restores, remodels, or
removes crude or dirty data. Failure to do so hugely impacts the validity and reliability of the
data because it makes sure that only the best quality of data is used for analysis. I followed the
The CPS dataset is very large with 92 variables and over a million observations as shown
below:
The first step involved removing irrelevant data. The aim of the study was to understand the
impact of Medicaid Expansion under the ACA on the Financial Security of Low-Income
Households. This aim involves the study of three variables: fringe product (dependent variable),
newly insured and medical expansion. Therefore, out of the 92 variables, only three were
considered relevant to the study, namely: bmorderm12m: anyone in the household went to a
place other than a bank to purchase a money order; inctot: total personal income; and
3
inctot represented low-income earners if they earned less than $49,500 and thus benefited
from the expansion; caidnw represented the beneficiaries of the expansion (newly insured). For
descriptive statistics, gqtype (household type), metarea (metropolitan area), race (race) and
After dropping the unnecessary columns, it wa time to drop the missing values using the
missing values in occupation and race. At this point the dataset has 118,231 entries.
The next step involved dropping “not in universe” responses. This response means that
that the individual to which the questions was directed was not a member of the population. The
target population is known as the universe. Also, “not in universe” might mean that the
observation in question doesn’t fulfill the eligibility requirements to be included in the analysis
and need to be deleted. The “not in universe” response was a problem under the fringe products
The command “drop if bmorder12m == 99” deleted all the “not in universe” responses. 40,528
“Not in universe” entries were also a problem in occupation. The command “drop if
occ2010 == 9999” deleted 38,300 observations containing the same. “niu” in metarea were
dropped by “drop if metarea == 9998”. “drop if metarea == 9999” removed the “missing” in
metarea.
According to the table below, the minimum total income is -13,000 and the maximum is
1.00e+09 or 1 billion. These are abnormal values that need to be deleted someone cannot earn -
13,000 or 1 billion annually. Also, there are many cases of zero total income.
5
. summarize inctot
. drop if inctot == 0
(2,845 observations deleted)
inctot represented low-income earners if they earned less than $49,500 and thus
benefited from the expansion. Therefore, individuals who earned more than $49,500 were
. use "C:\Users\richl\OneDrive\Desktop\Veronique.dta"
. summarize Medicaid_Expansion
Clean Dataset
. summarize
Descriptive Statistics
brief summary of the samples and measured conducted in a particular research study. It
meaningful way, which in turn, enables a simplified data interpretation. This study’s aim was to
understand the impact of Medicaid Expansion under the ACA on the Financial Security of Low-
Income Households. It involved the study of three variables: fringe product (dependent variable),
newly insured and medical expansion. For demographic statistics, household type, metropolitan
area, race and occupation were retained. The following tables show the descriptive statistics:
Household Type
Household Type
. tabulate Household_Type
After the cleaning was finished, the dataset has 12,980 individuals. 59.21% (n = 7686)
were from husband/wife (neither armed forces) household type. 12.33% (n = 1,600) were from
civilian male primary individual household type. 12.00% (n = 1,557) were from households
where the primary family householder were unmarried civilian males. 12.33% (n = 1,600) were
from civilian male primary individual household type and 10.53% (n = 1,367) were from civilian
female primary individual household type. 5.73% (n = 744) came from households where the
primary family householder was unmarried civilian females. Only 0.17% (n = 22) were from
Fringe Product
. tabulate Fringe_Product
Fringe Product
yes 1402
no 11578
Majority of the participants (89.20%, n = 11,578) have never used a fringe banking
product. 10.80% of the individuals (n = 1,402) have used a fringe banking product before.
Race
. tabulate race
Race
white 10405
black 1291
american indian/aleut/eskimo 96
asian only 894
hawaiian/pacific islander only 52
white-black 52
white-american indian 85
white-asian 52
white-hawaiian/pacific islander 17
black-american indian 10
black-asian 3
asian-hawaiian/pacific islander 9
white-black-american indian 8
white-american indian-asian 1
white-asian-hawaiian/pacific isl 5
Medicaid Expansion
The distribution of Medicaid Expansion assumes a bell-shape. Therefore, even though the
bell-shape is not perfect, we can conclude that the variable is approximately normally
distributed.
11
Medicaid Expansion
1000
800600
Frequency
400 200
0
. summarize Medicaid_Expansion
According to the summary statistics, the lowest earner in the dataset earned $100 per year
and the highest earner earned $49,500. The average income was $22,573.17.
Newly Insured
12
. tabulate Newly_Insured
Newly
Insured Freq. Percent Cum.
Newly Insured
no 10826
yes 2154