457 Final Project

STAT 457
Final Project
What’s cooking?
An,Xingyue
20034032
Introduction:
The goal of this project is to predict the category of a dish’s cuisine through the recipe
ingredients and the dataset is provided by Yummly.
From the Kaggle website https://www.kaggle.com/c/whats-cooking/data, we could
download the training data and test data that we need. In the training data, there are
39774 dish id numbers with their ingredients and type of cuisine; In the test data, it
consists 9944 dish id numbers with the lists of their ingredients. For the response variable
“cuisine” in the dataset, there are twenty different types of cuisines, they are Italian,
Mexican, Southern_us, India, Chinese, French, Cajun_creole, Japanese, Thai, Greek,
Spanish, Korean, Brazilian, Russian, Jamaican, Irish, Filipino, British, Vietnamese and
Moroccan. From these different cuisines, Italian, Mexican and Southern_us cuisines
accounted for the vast majority. Also, there are 6714 different ingredients in the dataset
and “salt” is the most frequent ingredient in the cuisines.
From the dataset, we need to do data preprocessing and discuss the classification
models in the next step.
Preprocessing the dataset:
The preprocessing data on onQ:
First of all, I read the jason file dataset in R and get the training data and test data. Then I
create a vocabulary of all words and word occurrence matrices which means each
column would correspond to a word and each row would correspond to a recipe. To be
more specific, the (i, j)-th entry represents the number of occurrence for ingredient j in the
i-th recipe. Next, remove punctuation, remove all rare ingredients in the data set to
simplify our prediction. Then I convert the training data and test data into matrices and
vectors forms named as Xtrain, Ytrain, Xtest and testID, and save the new data file for
later analysis.
More preprocessing procedure:
What I did in the preprocessing part could be divided into three parts, firstly, I convert the
Xtrain data and Xtest data into data frame. Secondly, I want to remove the columns that
are different in the corresponding Xtrain data and Xtest data by finding the same column
first and then removing the rest parts. Thirdly, I combine the preprocessed training data
and the response factor Cuisine. That’s all I did in the preprocessing procedure and in the
next step I would build the classification models and do prediction.
Descriptions and discussions on the classification models:
In this project, I use three different models to predict the data, they are rPart, random
forest (ranger) and xgboost.
First of all, rPart model:
The rpart program builds classification or regression models of a very general structure
using a two stage procedure and the resulting models can be represented as binary trees.
Here is the r code for the rpart model:
In this model, I set.seed (1112) and use classification method to predict the dataset, I use
control = list(cp=0) to build a tree as large as possible then view model.tree$cptable, part
of the outcome is shown below:
In the next step, I need to prune the tree that I built before by finding the cp value when
the relative error or the xerror is minimum. I choose to find the cp value of the minimum
xerror and prune the tree. Then, I use the new tree( pruned tree) to predict the response
variable of test data. Finally, combine the final outcome in a new file and write it in the csv
format.
The score of rpart in Kaggle is 0.59583.
Secondly, I will discuss the random forest model:
Random Forest is a powerful ensemble machine learning algorithm which works by
creating multiple decision trees and then combining the output generated by each of the
decision trees. It can also be used in unsupervised mode for assessing proximities among
data points.
Here is the r code of random forest:
In this model, I use the “ranger” package for classification since it could run much faster
than other packages and I build the ranger model with reference to the “ranger package”
R code on onQ. First, I create data frames as required, in the training data, there are 442
featured variables and the 443 variable is the response variable. Then, I need to add one
more column in the test data as response factor. To build the ranger model, I set the
num.tree equal to 700 and the value of mtry should be the square root number of featured
variable which is √442, the value calculate is between 21 and 22. I choose 22 as the mtry
value. Then, I use the model to predict the dataset, the outcome of ranger.pred is more
like a function, I need to transfer it to the csv format and submitted it on Kaggle.
The score of ranger on Kaggle is 0.72918.
I then tried to change the mtry as 21 to do the prediction, but the score on Kaggle
changes slightly.
My last model is xgboost.
Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more
efficient. It has both linear model solver and tree learning algorithms. So, what makes it
fast is its capacity to do parallel computation on a single machine. I choose this as my
prediction model since it is very high in predictive power, however, its running speed is
relative low.
Here is the r code for xgboost:

In xgboost classification, I build the model by imitating the code_xgboost_classification
on onQ. Firstly, I convert the cuisine factor to an integer class that beginning at 0 and
transform the training dataset and test dataset into xgb.Matrix. Then set important
parameters and fit the model.
The score of xgboost on Kaggle is 0.74004 which is better than rpart model and random
forest model.
Score on Kaggle:
Classification model Score on Kaggle
rpart 0.59583
random forest 0.72918
xgboost 0.74004
The screenshot of my best preference on Kaggle is shown below:
References:
https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/
https://www.r-bloggers.com/how-to-implement-random-forests-in-r/

457 Final Project

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

457 Final Project

Uploaded by

Copyright:

Available Formats

STAT 457

ingredients and the dataset is provided by Yummly.

From the Kaggle website https://www.kaggle.com/c/whats-cooking/data, we could

Mexican, Southern_us, India, Chinese, French, Cajun_creole, Japanese, Thai, Greek,

and “salt” is the most frequent ingredient in the cuisines.

models in the next step.

Preprocessing the dataset:

The preprocessing data on onQ:

More preprocessing procedure:

next step I would build the classification models and do prediction.

Descriptions and discussions on the classification models:

forest (ranger) and xgboost.

First of all, rPart model:

Here is the r code for the rpart model:

of the outcome is shown below:

The score of rpart in Kaggle is 0.59583.

Secondly, I will discuss the random forest model:

Random Forest is a powerful ensemble machine learning algorithm which works by

Here is the r code of random forest:

The score of ranger on Kaggle is 0.72918.

My last model is xgboost.

fast is its capacity to do parallel computation on a single machine. I choose this as my

Here is the r code for xgboost:

In xgboost classification, I build the model by imitating the code_xgboost_classification

parameters and fit the model.

Classification model Score on Kaggle

random forest 0.72918

The screenshot of my best preference on Kaggle is shown below:

You might also like