Data Preparation

Data Preparation:

Description of Data sources:

The Dataset that we have used has been extracted from U.S. Census database of the year 1994. The
database is updated once every decade to identify the general parameters of the country like, age,
gender, marital status, marriage, education, employment etc.

We are trying to predict the income level of a person. Say, above or below 50 thousand, a year. This
prediction is based on the demographic variation of the citizens.

Age: Age of the citizen is mentioned in the database, it is a continuous data.

Work class: Working class is the social group, work class is a categorical variable.

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay,


Final Weight is the number of units in the target population that the responding unit represents. This
is a continuous variable.

Education of the citizen also is a categorical variable. The codes given for each category along with
the categories are mentioned below:

1. Preschool
2. 1st-4th
3. 5th-6th
4. 7th-8th
5. 9th
6. 10th
7. 11th
8. 12th
9. HS-grad
10. Some-college
11. Assoc-voc
12. Assoc-acdm
13. Bachelors
14. Masters
15. Prof-school
16. Doctorate

Maritial Status is a categorical variable categorized into:

Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-

Occupation is a categorical variable categorizing occupation into:

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners,

Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,

Relationship is a categorical variable: split into

Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

Race is also a categorical variable :

White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

Sex or gender is also categorical variable classified into male and female.

Capital-gain is the income earned due to sale of an asset. This is a continuous variable.
Capital loss is the difference between a lower selling price and a higher purchase price, resulting
in a financial loss for the seller. This is also represented as a continuous variable.

Hours-per-week represents the number of working hours in a week of each citizen

Native country is categorical data and it is categorized into following:

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,

Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland,
France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland,
Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Income level if greater than equal or lesser than 50 thousand is also present in the data-set.

## 'data.frame': 48842 obs. of 15 variables:

## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : Factor w/ 9 levels "?","Federal-gov",..: 5 5 3 5 1 5 1 7 5 5 ...
## $ fnlwgt : int 226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 2 12 8 16 16 1 12 15 16 6 ...
## $ educational.num: int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 3 3 5 5 5 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 8 6 12 8 1 9 1 11 9 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 4 1 1 1 4 2 5 1 5 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 3 5 5 3 5 5 3 5 5 5 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 2 2 1 2 ...
## $ capital.gain : int 0 0 0 7688 0 0 0 3103 0 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 40 40 40 40 40 40 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 2 2 1 1 1 2 1 1 ...

