Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

STAT 457

Final Project
What’s cooking?

An,Xingyue
20034032

Introduction:

The goal of this project is to predict the category of a dish’s cuisine through the recipe

ingredients and the dataset is provided by Yummly.

From the Kaggle website https://www.kaggle.com/c/whats-cooking/data, we could

download the training data and test data that we need. In the training data, there are

39774 dish id numbers with their ingredients and type of cuisine; In the test data, it

consists 9944 dish id numbers with the lists of their ingredients. For the response variable

“cuisine” in the dataset, there are twenty different types of cuisines, they are Italian,

Mexican, Southern_us, India, Chinese, French, Cajun_creole, Japanese, Thai, Greek,

Spanish, Korean, Brazilian, Russian, Jamaican, Irish, Filipino, British, Vietnamese and

Moroccan. From these different cuisines, Italian, Mexican and Southern_us cuisines

accounted for the vast majority. Also, there are 6714 different ingredients in the dataset

and “salt” is the most frequent ingredient in the cuisines.

From the dataset, we need to do data preprocessing and discuss the classification

models in the next step.

Preprocessing the dataset:

The preprocessing data on onQ:

First of all, I read the jason file dataset in R and get the training data and test data. Then I

create a vocabulary of all words and word occurrence matrices which means each

column would correspond to a word and each row would correspond to a recipe. To be

more specific, the (i, j)-th entry represents the number of occurrence for ingredient j in the

i-th recipe. Next, remove punctuation, remove all rare ingredients in the data set to

simplify our prediction. Then I convert the training data and test data into matrices and

vectors forms named as Xtrain, Ytrain, Xtest and testID, and save the new data file for

later analysis.

More preprocessing procedure:

What I did in the preprocessing part could be divided into three parts, firstly, I convert the

Xtrain data and Xtest data into data frame. Secondly, I want to remove the columns that

are different in the corresponding Xtrain data and Xtest data by finding the same column

first and then removing the rest parts. Thirdly, I combine the preprocessed training data

and the response factor Cuisine. That’s all I did in the preprocessing procedure and in the

next step I would build the classification models and do prediction.

Descriptions and discussions on the classification models:

In this project, I use three different models to predict the data, they are rPart, random

forest (ranger) and xgboost.

First of all, rPart model:

The rpart program builds classification or regression models of a very general structure

using a two stage procedure and the resulting models can be represented as binary trees.

Here is the r code for the rpart model:

In this model, I set.seed (1112) and use classification method to predict the dataset, I use

control = list(cp=0) to build a tree as large as possible then view model.tree$cptable, part

of the outcome is shown below:

In the next step, I need to prune the tree that I built before by finding the cp value when

the relative error or the xerror is minimum. I choose to find the cp value of the minimum

xerror and prune the tree. Then, I use the new tree( pruned tree) to predict the response

variable of test data. Finally, combine the final outcome in a new file and write it in the csv

format.

The score of rpart in Kaggle is 0.59583.

Secondly, I will discuss the random forest model:

Random Forest is a powerful ensemble machine learning algorithm which works by

creating multiple decision trees and then combining the output generated by each of the

decision trees. It can also be used in unsupervised mode for assessing proximities among

data points.

Here is the r code of random forest:

In this model, I use the “ranger” package for classification since it could run much faster

than other packages and I build the ranger model with reference to the “ranger package”

R code on onQ. First, I create data frames as required, in the training data, there are 442

featured variables and the 443 variable is the response variable. Then, I need to add one

more column in the test data as response factor. To build the ranger model, I set the

num.tree equal to 700 and the value of mtry should be the square root number of featured

variable which is √442, the value calculate is between 21 and 22. I choose 22 as the mtry

value. Then, I use the model to predict the dataset, the outcome of ranger.pred is more

like a function, I need to transfer it to the csv format and submitted it on Kaggle.

The score of ranger on Kaggle is 0.72918.

I then tried to change the mtry as 21 to do the prediction, but the score on Kaggle

changes slightly.

My last model is xgboost.

Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more

efficient. It has both linear model solver and tree learning algorithms. So, what makes it

fast is its capacity to do parallel computation on a single machine. I choose this as my

prediction model since it is very high in predictive power, however, its running speed is

relative low.

Here is the r code for xgboost:


In xgboost classification, I build the model by imitating the code_xgboost_classification

on onQ. Firstly, I convert the cuisine factor to an integer class that beginning at 0 and

transform the training dataset and test dataset into xgb.Matrix. Then set important

parameters and fit the model.

The score of xgboost on Kaggle is 0.74004 which is better than rpart model and random

forest model.

Score on Kaggle:

Classification model Score on Kaggle

rpart 0.59583

random forest 0.72918

xgboost 0.74004

The screenshot of my best preference on Kaggle is shown below:

References:

https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/

https://www.r-bloggers.com/how-to-implement-random-forests-in-r/

You might also like