Professional Documents
Culture Documents
457 Final Project
457 Final Project
Final Project
What’s cooking?
An,Xingyue
20034032
Introduction:
The goal of this project is to predict the category of a dish’s cuisine through the recipe
download the training data and test data that we need. In the training data, there are
39774 dish id numbers with their ingredients and type of cuisine; In the test data, it
consists 9944 dish id numbers with the lists of their ingredients. For the response variable
“cuisine” in the dataset, there are twenty different types of cuisines, they are Italian,
Spanish, Korean, Brazilian, Russian, Jamaican, Irish, Filipino, British, Vietnamese and
Moroccan. From these different cuisines, Italian, Mexican and Southern_us cuisines
accounted for the vast majority. Also, there are 6714 different ingredients in the dataset
From the dataset, we need to do data preprocessing and discuss the classification
First of all, I read the jason file dataset in R and get the training data and test data. Then I
create a vocabulary of all words and word occurrence matrices which means each
column would correspond to a word and each row would correspond to a recipe. To be
more specific, the (i, j)-th entry represents the number of occurrence for ingredient j in the
i-th recipe. Next, remove punctuation, remove all rare ingredients in the data set to
simplify our prediction. Then I convert the training data and test data into matrices and
vectors forms named as Xtrain, Ytrain, Xtest and testID, and save the new data file for
later analysis.
What I did in the preprocessing part could be divided into three parts, firstly, I convert the
Xtrain data and Xtest data into data frame. Secondly, I want to remove the columns that
are different in the corresponding Xtrain data and Xtest data by finding the same column
first and then removing the rest parts. Thirdly, I combine the preprocessed training data
and the response factor Cuisine. That’s all I did in the preprocessing procedure and in the
In this project, I use three different models to predict the data, they are rPart, random
The rpart program builds classification or regression models of a very general structure
using a two stage procedure and the resulting models can be represented as binary trees.
In this model, I set.seed (1112) and use classification method to predict the dataset, I use
control = list(cp=0) to build a tree as large as possible then view model.tree$cptable, part
In the next step, I need to prune the tree that I built before by finding the cp value when
the relative error or the xerror is minimum. I choose to find the cp value of the minimum
xerror and prune the tree. Then, I use the new tree( pruned tree) to predict the response
variable of test data. Finally, combine the final outcome in a new file and write it in the csv
format.
creating multiple decision trees and then combining the output generated by each of the
decision trees. It can also be used in unsupervised mode for assessing proximities among
data points.
In this model, I use the “ranger” package for classification since it could run much faster
than other packages and I build the ranger model with reference to the “ranger package”
R code on onQ. First, I create data frames as required, in the training data, there are 442
featured variables and the 443 variable is the response variable. Then, I need to add one
more column in the test data as response factor. To build the ranger model, I set the
num.tree equal to 700 and the value of mtry should be the square root number of featured
variable which is √442, the value calculate is between 21 and 22. I choose 22 as the mtry
value. Then, I use the model to predict the dataset, the outcome of ranger.pred is more
like a function, I need to transfer it to the csv format and submitted it on Kaggle.
I then tried to change the mtry as 21 to do the prediction, but the score on Kaggle
changes slightly.
Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more
efficient. It has both linear model solver and tree learning algorithms. So, what makes it
prediction model since it is very high in predictive power, however, its running speed is
relative low.
on onQ. Firstly, I convert the cuisine factor to an integer class that beginning at 0 and
transform the training dataset and test dataset into xgb.Matrix. Then set important
The score of xgboost on Kaggle is 0.74004 which is better than rpart model and random
forest model.
Score on Kaggle:
rpart 0.59583
xgboost 0.74004
References:
https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/
https://www.r-bloggers.com/how-to-implement-random-forests-in-r/