Professional Documents
Culture Documents
Data Preparation
Data Preparation
Data Preparation
The Dataset that we have used has been extracted from U.S. Census database of the year 1994. The
database is updated once every decade to identify the general parameters of the country like, age,
gender, marital status, marriage, education, employment etc.
We are trying to predict the income level of a person. Say, above or below 50 thousand, a year. This
prediction is based on the demographic variation of the citizens.
Work class: Working class is the social group, work class is a categorical variable.
Final Weight is the number of units in the target population that the responding unit represents. This
is a continuous variable.
Education of the citizen also is a categorical variable. The codes given for each category along with
the categories are mentioned below:
1. Preschool
2. 1st-4th
3. 5th-6th
4. 7th-8th
5. 9th
6. 10th
7. 11th
8. 12th
9. HS-grad
10. Some-college
11. Assoc-voc
12. Assoc-acdm
13. Bachelors
14. Masters
15. Prof-school
16. Doctorate
Sex or gender is also categorical variable classified into male and female.
Capital-gain is the income earned due to sale of an asset. This is a continuous variable.
Capital loss is the difference between a lower selling price and a higher purchase price, resulting
in a financial loss for the seller. This is also represented as a continuous variable.
Income level if greater than equal or lesser than 50 thousand is also present in the data-set.