Data Cleaning

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

2

Data Cleaning

Data cleaning (also cleaning or scrubbing) is the act of identifying and correcting (or

removing) inaccurate data from a table, database or dataset. This process removes or corrects

unreliable, unfinished, non-relevant, or inaccurate parts of the data and restores, remodels, or

removes crude or dirty data. Failure to do so hugely impacts the validity and reliability of the

data because it makes sure that only the best quality of data is used for analysis. I followed the

following steps to cleanse it.

Dropping Irrelevant Columns

The CPS dataset is very large with 92 variables and over a million observations as shown

below:

The first step involved removing irrelevant data. The aim of the study was to understand the

impact of Medicaid Expansion under the ACA on the Financial Security of Low-Income

Households. This aim involves the study of three variables: fringe product (dependent variable),

newly insured and medical expansion. Therefore, out of the 92 variables, only three were

considered relevant to the study, namely: bmorderm12m: anyone in the household went to a

place other than a bank to purchase a money order; inctot: total personal income; and
3

caidnw: current Medicaid coverage. Fringe products was represented by bmorderm12m;

inctot represented low-income earners if they earned less than $49,500 and thus benefited

from the expansion; caidnw represented the beneficiaries of the expansion (newly insured). For

descriptive statistics, gqtype (household type), metarea (metropolitan area), race (race) and

occ2010 (occupation) were retained.

Dropping Empty Rows

After dropping the unnecessary columns, it wa time to drop the missing values using the

following command: drop if missing(gqtype), drop if missing(metarea), drop if

missing(bmorder12m), drop if missing(inctot), and drop if missing(caidnw). There were no

missing values in occupation and race. At this point the dataset has 118,231 entries.

Dropping Unnecessary Rows

The next step involved dropping “not in universe” responses. This response means that

that the individual to which the questions was directed was not a member of the population. The

target population is known as the universe. Also, “not in universe” might mean that the

observation in question doesn’t fulfill the eligibility requirements to be included in the analysis

and need to be deleted. The “not in universe” response was a problem under the fringe products

variable as shown below:


4

The command “drop if bmorder12m == 99” deleted all the “not in universe” responses. 40,528

observations were deleted.

Fringe Product Freq. Percent Cum.

yes 8,916 11.47 11.47


no 68,787 88.53 100.00

Total 77,703 100.00

“Not in universe” entries were also a problem in occupation. The command “drop if

occ2010 == 9999” deleted 38,300 observations containing the same. “niu” in metarea were

dropped by “drop if metarea == 9998”. “drop if metarea == 9999” removed the “missing” in

metarea.

Dealing with Outliers

According to the table below, the minimum total income is -13,000 and the maximum is

1.00e+09 or 1 billion. These are abnormal values that need to be deleted someone cannot earn -

13,000 or 1 billion annually. Also, there are many cases of zero total income.
5

. summarize inctot

Variable Obs Mean Std. Dev. Min Max

inctot 29,803 2.12e+08 4.09e+08 -13000 1.00e+09

Total Personal Income


1.0e+04 1.5e+04 2.0e+04 2.5e+04
Frequency
5000
0

0 200000000 400000000 600000000 800000000 1000000000


Medicaid Expansion

. drop if inctot == 0
(2,845 observations deleted)

. drop if inctot == 999999999


(6,328 observations deleted)

Dropping High Income Earners

inctot represented low-income earners if they earned less than $49,500 and thus

benefited from the expansion. Therefore, individuals who earned more than $49,500 were

dropped from the dataset. 7,164 observations were deleted.


6

. use "C:\Users\richl\OneDrive\Desktop\Veronique.dta"

. drop if Medicaid_Expansion > 49500


(7,164 observations deleted)

. summarize Medicaid_Expansion

Variable Obs Mean Std. Dev. Min Max

Medicaid_E~n 12,980 22573.17 13269.86 100 49500

Clean Dataset

. summarize

Variable Obs Mean Std. Dev. Min Max

Household_~e 12,980 2.727581 2.285918 1 10


Metropolitan 12,980 4699.918 2648.198 80 9997
Fringe_Pro~t 12,980 1.891988 .3104079 1 2
race 12,980 164.6957 168.6873 100 813
Occupation 12,980 3842.953 2607.213 10 9750

Medicaid_E~n 12,980 22573.17 13269.86 100 49500


Newly_Insu~d 12,980 1.165948 .3720479 1 2
7

Descriptive Statistics

Descriptive statistics aims at describing the variables involved in a study. It provides a

brief summary of the samples and measured conducted in a particular research study. It

facilitates data visualization and enables data to be presented in an understandable and

meaningful way, which in turn, enables a simplified data interpretation. This study’s aim was to

understand the impact of Medicaid Expansion under the ACA on the Financial Security of Low-

Income Households. It involved the study of three variables: fringe product (dependent variable),

newly insured and medical expansion. For demographic statistics, household type, metropolitan

area, race and occupation were retained. The following tables show the descriptive statistics:

Household Type

Household Type

husband/wife primary family (nei 7685

husband/wife primary family (eit 22

unmarried civilian male - primar 744

unmarried civilian female - prim 1557

civilian male primary individual 1600

civilian female primary individu 1367

group quarters with family 4

group quarters without family 1

0 2,000 4,000 6,000 8,000


frequency
8

. tabulate Household_Type

Household Type Freq. Percent Cum.

husband/wife primary family (neither ar 7,685 59.21 59.21


husband/wife primary family (either/bot 22 0.17 59.38
unmarried civilian male - primary famil 744 5.73 65.11
unmarried civilian female - primary fam 1,557 12.00 77.10
civilian male primary individual 1,600 12.33 89.43
civilian female primary individual 1,367 10.53 99.96
group quarters with family 4 0.03 99.99
group quarters without family 1 0.01 100.00

Total 12,980 100.00

After the cleaning was finished, the dataset has 12,980 individuals. 59.21% (n = 7686)

were from husband/wife (neither armed forces) household type. 12.33% (n = 1,600) were from

civilian male primary individual household type. 12.00% (n = 1,557) were from households

where the primary family householder were unmarried civilian males. 12.33% (n = 1,600) were

from civilian male primary individual household type and 10.53% (n = 1,367) were from civilian

female primary individual household type. 5.73% (n = 744) came from households where the

primary family householder was unmarried civilian females. Only 0.17% (n = 22) were from

husband/wife (either or both armed forces) household type.

Fringe Product

. tabulate Fringe_Product

Fringe Product Freq. Percent Cum.

yes 1,402 10.80 10.80


no 11,578 89.20 100.00

Total 12,980 100.00


9

Fringe Product

yes 1402

no 11578

0 5,000 10,000 15,000


frequency

Majority of the participants (89.20%, n = 11,578) have never used a fringe banking

product. 10.80% of the individuals (n = 1,402) have used a fringe banking product before.

Race

. tabulate race

race Freq. Percent Cum.

white 10,405 80.16 80.16


black 1,291 9.95 90.11
american indian/aleut/eskimo 96 0.74 90.85
asian only 894 6.89 97.73
hawaiian/pacific islander only 52 0.40 98.14
white-black 52 0.40 98.54
white-american indian 85 0.65 99.19
white-asian 52 0.40 99.59
white-hawaiian/pacific islander 17 0.13 99.72
black-american indian 10 0.08 99.80
black-asian 3 0.02 99.82
asian-hawaiian/pacific islander 9 0.07 99.89
white-black-american indian 8 0.06 99.95
white-american indian-asian 1 0.01 99.96
white-asian-hawaiian/pacific islander 5 0.04 100.00

Total 12,980 100.00


10

Race
white 10405
black 1291
american indian/aleut/eskimo 96
asian only 894
hawaiian/pacific islander only 52
white-black 52
white-american indian 85
white-asian 52
white-hawaiian/pacific islander 17
black-american indian 10
black-asian 3
asian-hawaiian/pacific islander 9
white-black-american indian 8
white-american indian-asian 1
white-asian-hawaiian/pacific isl 5

0 2,000 4,000 6,000 8,000 10,000


frequency

80.16% (n = 10,405) of the participants were white, followed by black (9.95%, n =

1,291), and Asian only (6.89%, n = 894).

Medicaid Expansion

The distribution of Medicaid Expansion assumes a bell-shape. Therefore, even though the

bell-shape is not perfect, we can conclude that the variable is approximately normally

distributed.
11

Medicaid Expansion
1000
800600
Frequency
400 200
0

0 10000 20000 30000 40000 50000


Medicaid Expansion

The table below shows the summary statistics of the variable.

. summarize Medicaid_Expansion

Variable Obs Mean Std. Dev. Min Max

Medicaid_E~n 12,980 22573.17 13269.86 100 49500

According to the summary statistics, the lowest earner in the dataset earned $100 per year

and the highest earner earned $49,500. The average income was $22,573.17.

Newly Insured
12

. tabulate Newly_Insured

Newly
Insured Freq. Percent Cum.

no 10,826 83.41 83.41


yes 2,154 16.59 100.00

Total 12,980 100.00

Newly Insured

no 10826

yes 2154

0 2,000 4,000 6,000 8,000 10,000


frequency
13

You might also like