Data Cleaning

2
Data Cleaning
Data cleaning (also cleaning or scrubbing) is the act of identifying and correcting (or
removing) inaccurate data from a table, database or dataset. This process removes or corrects
unreliable, unfinished, non-relevant, or inaccurate parts of the data and restores, remodels, or
removes crude or dirty data. Failure to do so hugely impacts the validity and reliability of the
data because it makes sure that only the best quality of data is used for analysis. I followed the
following steps to cleanse it.
Dropping Irrelevant Columns
The CPS dataset is very large with 92 variables and over a million observations as shown
below:
The first step involved removing irrelevant data. The aim of the study was to understand the
impact of Medicaid Expansion under the ACA on the Financial Security of Low-Income
Households. This aim involves the study of three variables: fringe product (dependent variable),
newly insured and medical expansion. Therefore, out of the 92 variables, only three were
considered relevant to the study, namely: bmorderm12m: anyone in the household went to a
place other than a bank to purchase a money order; inctot: total personal income; and
3
caidnw: current Medicaid coverage. Fringe products was represented by bmorderm12m;
inctot represented low-income earners if they earned less than $49,500 and thus benefited
from the expansion; caidnw represented the beneficiaries of the expansion (newly insured). For
descriptive statistics, gqtype (household type), metarea (metropolitan area), race (race) and
occ2010 (occupation) were retained.
Dropping Empty Rows
After dropping the unnecessary columns, it wa time to drop the missing values using the
following command: drop if missing(gqtype), drop if missing(metarea), drop if
missing(bmorder12m), drop if missing(inctot), and drop if missing(caidnw). There were no
missing values in occupation and race. At this point the dataset has 118,231 entries.
Dropping Unnecessary Rows
The next step involved dropping “not in universe” responses. This response means that
that the individual to which the questions was directed was not a member of the population. The
target population is known as the universe. Also, “not in universe” might mean that the
observation in question doesn’t fulfill the eligibility requirements to be included in the analysis
and need to be deleted. The “not in universe” response was a problem under the fringe products
variable as shown below:

4
The command “drop if bmorder12m == 99” deleted all the “not in universe” responses. 40,528
observations were deleted.
Fringe Product Freq. Percent Cum.
yes 8,916 11.47 11.47

no 68,787 88.53 100.00
Total 77,703 100.00
“Not in universe” entries were also a problem in occupation. The command “drop if
occ2010 == 9999” deleted 38,300 observations containing the same. “niu” in metarea were
dropped by “drop if metarea == 9998”. “drop if metarea == 9999” removed the “missing” in
metarea.
Dealing with Outliers
According to the table below, the minimum total income is -13,000 and the maximum is
1.00e+09 or 1 billion. These are abnormal values that need to be deleted someone cannot earn -
13,000 or 1 billion annually. Also, there are many cases of zero total income.
5
. summarize inctot
Variable Obs Mean Std. Dev. Min Max
inctot 29,803 2.12e+08 4.09e+08 -13000 1.00e+09
Total Personal Income

1.0e+04 1.5e+04 2.0e+04 2.5e+04
Frequency
5000
0
0 200000000 400000000 600000000 800000000 1000000000

Medicaid Expansion
. drop if inctot == 0
(2,845 observations deleted)
. drop if inctot == 999999999

Dropping High Income Earners
inctot represented low-income earners if they earned less than $49,500 and thus
benefited from the expansion. Therefore, individuals who earned more than $49,500 were
dropped from the dataset. 7,164 observations were deleted.

6
. use "C:\Users\richl\OneDrive\Desktop\Veronique.dta"
. drop if Medicaid_Expansion > 49500

. summarize Medicaid_Expansion
Medicaid_E~n 12,980 22573.17 13269.86 100 49500
Clean Dataset
. summarize
Household_~e 12,980 2.727581 2.285918 1 10

Metropolitan 12,980 4699.918 2648.198 80 9997
Fringe_Pro~t 12,980 1.891988 .3104079 1 2
race 12,980 164.6957 168.6873 100 813
Occupation 12,980 3842.953 2607.213 10 9750
Medicaid_E~n 12,980 22573.17 13269.86 100 49500

Newly_Insu~d 12,980 1.165948 .3720479 1 2
7
Descriptive Statistics
Descriptive statistics aims at describing the variables involved in a study. It provides a
brief summary of the samples and measured conducted in a particular research study. It
facilitates data visualization and enables data to be presented in an understandable and
meaningful way, which in turn, enables a simplified data interpretation. This study’s aim was to
understand the impact of Medicaid Expansion under the ACA on the Financial Security of Low-
Income Households. It involved the study of three variables: fringe product (dependent variable),
newly insured and medical expansion. For demographic statistics, household type, metropolitan
area, race and occupation were retained. The following tables show the descriptive statistics:
Household Type
Household Type
husband/wife primary family (nei 7685
husband/wife primary family (eit 22
unmarried civilian male - primar 744
unmarried civilian female - prim 1557
civilian male primary individual 1600
civilian female primary individu 1367
group quarters with family 4
group quarters without family 1
0 2,000 4,000 6,000 8,000

frequency
8
. tabulate Household_Type
Household Type Freq. Percent Cum.
husband/wife primary family (neither ar 7,685 59.21 59.21

husband/wife primary family (either/bot 22 0.17 59.38
unmarried civilian male - primary famil 744 5.73 65.11
unmarried civilian female - primary fam 1,557 12.00 77.10
civilian male primary individual 1,600 12.33 89.43
civilian female primary individual 1,367 10.53 99.96
group quarters with family 4 0.03 99.99
group quarters without family 1 0.01 100.00
Total 12,980 100.00
After the cleaning was finished, the dataset has 12,980 individuals. 59.21% (n = 7686)
were from husband/wife (neither armed forces) household type. 12.33% (n = 1,600) were from
civilian male primary individual household type. 12.00% (n = 1,557) were from households
where the primary family householder were unmarried civilian males. 12.33% (n = 1,600) were
from civilian male primary individual household type and 10.53% (n = 1,367) were from civilian
female primary individual household type. 5.73% (n = 744) came from households where the
primary family householder was unmarried civilian females. Only 0.17% (n = 22) were from
husband/wife (either or both armed forces) household type.
Fringe Product
. tabulate Fringe_Product
Fringe Product Freq. Percent Cum.
yes 1,402 10.80 10.80

no 11,578 89.20 100.00
Total 12,980 100.00

9
Fringe Product
yes 1402
no 11578
0 5,000 10,000 15,000

frequency
Majority of the participants (89.20%, n = 11,578) have never used a fringe banking
product. 10.80% of the individuals (n = 1,402) have used a fringe banking product before.
Race
. tabulate race
race Freq. Percent Cum.
white 10,405 80.16 80.16

black 1,291 9.95 90.11
american indian/aleut/eskimo 96 0.74 90.85
asian only 894 6.89 97.73
hawaiian/pacific islander only 52 0.40 98.14
white-black 52 0.40 98.54
white-american indian 85 0.65 99.19
white-asian 52 0.40 99.59
white-hawaiian/pacific islander 17 0.13 99.72
black-american indian 10 0.08 99.80
black-asian 3 0.02 99.82
asian-hawaiian/pacific islander 9 0.07 99.89
white-black-american indian 8 0.06 99.95
white-american indian-asian 1 0.01 99.96
white-asian-hawaiian/pacific islander 5 0.04 100.00
Total 12,980 100.00

10
Race
white 10405
black 1291
american indian/aleut/eskimo 96
asian only 894
hawaiian/pacific islander only 52
white-black 52
white-american indian 85
white-asian 52
white-hawaiian/pacific islander 17
black-american indian 10
black-asian 3
asian-hawaiian/pacific islander 9
white-black-american indian 8
white-american indian-asian 1
white-asian-hawaiian/pacific isl 5
0 2,000 4,000 6,000 8,000 10,000

frequency
80.16% (n = 10,405) of the participants were white, followed by black (9.95%, n =
1,291), and Asian only (6.89%, n = 894).
Medicaid Expansion
The distribution of Medicaid Expansion assumes a bell-shape. Therefore, even though the
bell-shape is not perfect, we can conclude that the variable is approximately normally
distributed.
11
Medicaid Expansion
1000
800600
Frequency
400 200
0
0 10000 20000 30000 40000 50000

Medicaid Expansion
The table below shows the summary statistics of the variable.
. summarize Medicaid_Expansion
Medicaid_E~n 12,980 22573.17 13269.86 100 49500
According to the summary statistics, the lowest earner in the dataset earned $100 per year
and the highest earner earned $49,500. The average income was $22,573.17.
Newly Insured
12
. tabulate Newly_Insured
Newly
Insured Freq. Percent Cum.
no 10,826 83.41 83.41

yes 2,154 16.59 100.00
Total 12,980 100.00
Newly Insured
no 10826
yes 2154
0 2,000 4,000 6,000 8,000 10,000

frequency
13

Data Cleaning

Uploaded by

Copyright:

Available Formats

You might also like

Data Cleaning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Cleaning

Uploaded by

Copyright:

Available Formats

2

following steps to cleanse it.

Dropping Irrelevant Columns

caidnw: current Medicaid coverage. Fringe products was represented by bmorderm12m;

occ2010 (occupation) were retained.

Dropping Empty Rows

following command: drop if missing(gqtype), drop if missing(metarea), drop if

missing(bmorder12m), drop if missing(inctot), and drop if missing(caidnw). There were no

Dropping Unnecessary Rows

variable as shown below:

observations were deleted.

Fringe Product Freq. Percent Cum.

yes 8,916 11.47 11.47

Total 77,703 100.00

Dealing with Outliers

Variable Obs Mean Std. Dev. Min Max

inctot 29,803 2.12e+08 4.09e+08 -13000 1.00e+09

Total Personal Income

0 200000000 400000000 600000000 800000000 1000000000

. drop if inctot == 999999999

Dropping High Income Earners

dropped from the dataset. 7,164 observations were deleted.

. drop if Medicaid_Expansion > 49500

Variable Obs Mean Std. Dev. Min Max

Medicaid_E~n 12,980 22573.17 13269.86 100 49500

Variable Obs Mean Std. Dev. Min Max

Household_~e 12,980 2.727581 2.285918 1 10

Medicaid_E~n 12,980 22573.17 13269.86 100 49500

Descriptive statistics aims at describing the variables involved in a study. It provides a

facilitates data visualization and enables data to be presented in an understandable and

husband/wife primary family (nei 7685

husband/wife primary family (eit 22

unmarried civilian male - primar 744

unmarried civilian female - prim 1557

civilian male primary individual 1600

civilian female primary individu 1367

group quarters with family 4

group quarters without family 1

0 2,000 4,000 6,000 8,000

Household Type Freq. Percent Cum.

husband/wife primary family (neither ar 7,685 59.21 59.21

Total 12,980 100.00

husband/wife (either or both armed forces) household type.

Fringe Product Freq. Percent Cum.

yes 1,402 10.80 10.80

Total 12,980 100.00

0 5,000 10,000 15,000

race Freq. Percent Cum.

white 10,405 80.16 80.16

Total 12,980 100.00

0 2,000 4,000 6,000 8,000 10,000

80.16% (n = 10,405) of the participants were white, followed by black (9.95%, n =

1,291), and Asian only (6.89%, n = 894).

0 10000 20000 30000 40000 50000

The table below shows the summary statistics of the variable.

Variable Obs Mean Std. Dev. Min Max

Medicaid_E~n 12,980 22573.17 13269.86 100 49500