Dat Processing

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

B

Let us see the effect of turning the price variable into a categorical variable. First, create a new
variable that categorizes price into 20 bins. Now repartition the data keeping Binned_Price
instead of Price. Run a classification tree with the same set of input variables as in the RT, and
with Binned_Price as the output variable. Keep the minimum number of records in a terminal
node to 1.

#create bins based on the car prices as categories


#determine the number of bins based on the min and max prices
bins <- seq(min(car.df$Price),
max(car.df$Price),
(max(car.df$Price) - min(car.df$Price))/20)
bins
## [1] 4350.0 5757.5 7165.0 8572.5 9980.0 11387.5 12795.0 14202.5
15610.0
## [10] 17017.5 18425.0 19832.5 21240.0 22647.5 24055.0 25462.5 26870.0
28277.5
## [19] 29685.0 31092.5 32500.0
#use bincode to determine the bin assignments
#NOTE: if you vie Binned_Price you can see the assignments
Binned_Price <- .bincode(car.df$Price,
bins,
include.lowest = TRUE)

#convert the Binned_Price to factors for classification


Binned_Price <- as.factor(Binned_Price)
Binned_Price
## [1] 7 7 7 8 7 7 9 11 13 7 12 12 11 13 13 13 14 10 9 9 9 9
9 9
## [25] 9 9 10 9 9 10 7 9 9 8 8 9 9 8 9 8 7 9 7 9 9 11
10 9
## [49] 10 13 10 9 12 13 8 7 8 8 11 9 8 9 11 10 10 9 11 8 13 9
9 7
## [73] 11 9 12 9 11 11 9 8 11 10 8 10 9 10 8 10 9 13 9 13 12 9
11 12
## [97] 9 9 11 10 11 9 11 11 11 9 11 10 10 20 19 20 15 15 14 15 13 10
11 13
## [121] 11 12 9 11 9 13 9 9 10 9 9 9 9 9 9 9 11 9 14 12 9 14
12 11
## [145] 11 9 12 15 11 12 10 12 11 11 13 9 11 11 11 11 11 12 11 11 10 12
12 12
## [169] 12 10 10 14 11 11 13 12 11 12 13 13 11 11 12 13 10 10 2 4 6 3
6 1
## [193] 1 6 7 6 6 8 4 6 6 5 5 5 7 6 6 5 6 6 7 8 6 6
7 5
## [217] 7 5 5 7 6 6 6 8 6 7 6 6 6 6 6 7 6 7 6 6 5 7
7 6
## [241] 5 6 6 7 6 7 6 7 7 6 6 5 6 8 4 7 7 6 6 7 6 6
7 6
## [265] 6 6 6 6 8 5 7 7 7 7 7 6 7 6 6 8 7 7 7 7 6 7
6 4
## [289] 6 7 6 7 5 6 7 5 7 7 7 7 6 6 7 6 7 6 4 7 6 6
7 7
## [313] 6 6 4 7 7 5 4 6 6 5 7 5 7 6 5 7 7 6 5 7 6 6
6 6
## [337] 7 6 6 6 6 6 8 6 7 8 7 7 7 6 6 4 6 6 8 7 6 8
6 8
## [361] 7 6 6 7 7 5 5 6 6 7 5 7 6 7 7 6 6 7 2 2 2 3
4 3
## [385] 4 4 5 4 3 4 3 3 4 1 4 4 4 6 5 5 4 5 1 5 4 4
5 6
## [409] 4 6 3 5 4 6 5 4 4 5 4 4 5 4 4 6 4 4 6 6 5 7
6 5
## [433] 5 5 5 5 6 4 5 6 6 5 6 6 6 5 6 5 6 4 5 6 6 6
6 4
## [457] 5 5 4 5 4 6 5 4 4 6 4 6 7 5 5 4 4 6 5 4 5 4
5 6
## [481] 6 6 6 4 4 5 5 4 6 4 5 5 4 6 6 5 6 5 5 4 4 6
4 5
## [505] 4 6 6 6 5 5 6 6 7 5 5 5 6 5 5 6 4 6 4 11 6 5
6 4
## [529] 5 7 4 5 5 6 7 6 5 4 5 6 5 6 5 5 7 5 6 4 5 6
5 5
## [553] 7 5 6 5 6 7 5 7 5 5 4 7 4 5 5 5 5 7 7 6 5 6
4 6
## [577] 6 6 6 6 6 5 4 5 5 7 4 7 4 4 5 5 4 5 5 5 5 5
5 7
## [601] 5 3 4 2 3 2 3 3 2 1 2 3 3 3 3 2 4 2 3 3 4 2
4 4
## [625] 3 4 4 4 3 3 3 4 4 4 4 4 5 3 5 4 4 4 3 5 4 4
3 2
## [649] 3 4 4 3 4 4 2 3 4 3 4 5 3 4 4 4 4 3 4 4 4 4
2 3
## [673] 3 4 2 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4
4 4
## [697] 6 4 5 4 3 4 3 5 3 4 5 3 4 4 4 3 4 4 3 3 4 4
4 3
## [721] 3 3 4 3 2 3 5 4 4 4 6 5 5 4 5 5 4 4 4 4 3 5
4 4
## [745] 3 3 3 5 4 4 5 5 3 4 4 4 4 4 3 5 3 3 4 4 5 5
4 4
## [769] 5 3 3 3 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 3
## [793] 3 4 6 4 6 4 4 3 4 5 4 5 4 4 4 3 3 4 3 4 4 5
4 4
## [817] 3 4 4 5 4 3 4 5 2 4 2 4 4 4 4 4 5 5 4 4 5 4
4 5
## [841] 5 4 5 4 3 4 5 5 4 4 3 4 3 4 4 4 4 3 3 4 4 5
4 3
## [865] 4 4 5 4 4 5 4 4 5 4 4 4 4 4 3 4 5 4 4 3 4 4
5 4
## [889] 5 4 3 6 4 3 5 4 3 4 4 4 3 4 4 4 4 4 4 4 3 3
4 4
## [913] 4 7 4 5 3 5 4 4 5 4 4 3 4 4 5 4 4 5 4 5 5 4
4 5
## [937] 5 4 4 5 4 4 5 3 5 5 6 4 3 4 3 4 3 4 4 4 5 4
4 4
## [961] 4 4 4 4 5 4 4 4 4 5 4 5 4 3 5 4 4 4 4 4 5 4
4 3
## [985] 4 4 3 4 5 4 5 3 4 4 3 4 4 4 4 5 4 4 3 5 4 4
5 5
## [1009] 4 4 4 4 4 3 5 5 4 4 5 4 6 3 5 5 5 4 5 4 5 5
4 5
## [1033] 5 5 5 6 4 5 4 5 4 5 5 4 2 2 2 1 1 2 3 2 2 1
4 2
## [1057] 2 2 5 3 3 2 2 2 1 2 4 2 3 3 3 3 2 3 2 1 2 2
3 4
## [1081] 3 4 4 2 3 3 2 3 2 3 4 3 3 1 3 2 3 3 3 3 3 2
2 3
## [1105] 3 3 3 3 3 4 3 3 3 1 2 2 2 3 4 3 3 3 3 4 3 2
2 4
## [1129] 3 3 3 4 2 4 3 2 2 2 4 3 2 3 3 4 3 2 2 3 2 3
4 3
## [1153] 3 3 2 3 2 4 2 4 3 3 3 4 4 4 3 2 3 4 2 2 3 2
3 4
## [1177] 4 3 3 4 3 2 4 3 4 2 3 3 3 3 2 3 2 3 3 4 4 4
3 4
## [1201] 4 3 2 3 3 2 3 3 3 3 3 3 3 2 4 4 3 3 4 3 3 3
3 3
## [1225] 4 3 2 3 3 4 4 2 3 3 4 3 3 2 3 2 4 4 3 3 2 3
3 2
## [1249] 3 3 4 3 3 2 3 3 3 3 3 4 3 4 4 3 3 4 2 3 4 4
3 2
## [1273] 3 2 4 3 3 4 3 3 3 3 3 4 4 3 3 3 4 3 3 3 3 3
2 3
## [1297] 3 2 3 4 3 2 3 3 3 4 3 4 3 4 4 4 4 2 3 4 3 3
3 3
## [1321] 4 3 4 4 3 2 3 4 2 3 4 2 3 5 2 4 3 4 3 4 4 2
3 3
## [1345] 4 3 3 3 4 2 3 2 3 3 4 2 3 3 3 4 4 2 3 2 3 3
3 4
## [1369] 4 3 4 4 2 3 4 3 3 4 4 3 4 3 2 5 4 3 4 3 4 4
3 4
## [1393] 3 3 3 4 4 3 4 4 3 4 5 2 3 3 4 3 4 3 3 3 4 4
3 2
## [1417] 4 4 3 3 3 3 3 3 3 3 4 4 3 4 3 3 5 3 3 2
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 19 20
#add the binned price to the data frame based on the indexes
#this will allow us to identify the training and validation data frames
train.df$Binned_Price <- Binned_Price[train.index]
valid.df$Binned_Price <- Binned_Price[-train.index]

B.i
Compare the tree generated by the CT with the one generated by the RT. Are they different?
(Look at structure, the top predictors, size of tree, etc.) Why?

tr.binned <- rpart(Binned_Price ~ Age_08_04 + KM + Fuel_Type +


HP + Automatic + Doors + Quarterly_Tax +
Mfr_Guarantee + Guarantee_Period + Airco +
Automatic_airco + CD_Player + Powered_Windows +
Sport_Model + Tow_Bar, data = train.df)
prp(tr.binned)
t(t(tr.binned$variable.importance))
## [,1]
## Age_08_04 140.372139
## KM 72.439706
## CD_Player 25.003200
## Automatic_airco 19.178101
## Airco 18.794593
## Quarterly_Tax 17.627902
## Sport_Model 15.479258
## HP 8.338291
## Powered_Windows 7.547571
## Fuel_Type 4.051562
## Mfr_Guarantee 3.774845
## Doors 3.273578
## Guarantee_Period 1.002471

You might also like