Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Applied Machine Learning

Project Report

Submitted By:
Akshay Alok (1510110033)
Dhananjay Dutt (1510110124)
Shreya Tyagi (1510110371)
List of Contents

1.
2.
3.
4. which we chose to go with MSE (mean square error) to standardize our report.
5. Results
6. Processed Data
Overview
The dataset provided to us was a rating sheet of several breakfast cereals consumed by people (most
probably in the USA). The data has 77 instances (N) and each instance is populated with 14 distinct
features like sodium and sugar content. The ratings provided are most probably a (weighted) 1 average
of ratings given by a set of people that are thought to be an accurate representation of the population in
general.
Our objective is to analyze the ratings (out of 100) and figure out which feature affects the ratings and
how. Our attempt will be to model a regressor such that it is able to predict (with “sufficient” accuracy)
what the rating of the cereal will be given a set of features.
The approach and implementation of the project has been catalogued in the following report.

1
Weighted average is a better estimate than average because the population of people in the sample space of the people observed
may not be an accurate representation of their actual numbers. (Data may be imbalanced)
Implementation
The implementation of this machine learning will be similar to what any machine learning as it would
include a cycle between data processing, algorithm application, parameter tuning.

In this very crude infographic displayed above we can see that we first start with Data processing (or
preprocessing before the algorithm is applied). After getting our data ready we can start applying
algorithms. After observing the different results and efficiency of algorithms we can proceed to tune
the algorithm’s properties to achieve better results. As can be observed from the diagram the
progression from one step to another is not a hard and fast one, there is need to shuttle back and forth
between steps 1 and 2 and between steps 2 and 3 (hit and trial basically) to get the best results.
Data Processing
Missing values:

In the 14 features of the 77 instances there were 4 missing values in total. These blank spaces cannot be
read/analyzed at all by our algorithms until the data is not filled up. One way to go about is to
completely remove the instance with a missing feature, but this cannot be the case as with an N = 77
we cannot afford to lose more data points than what we already have. This means that we need to fill
up those missing values with some numerical value which we perceive (the “most”) apt. While placing
values on our own accord we need keep in mind that the introduced value should produce the least
amount of “aberration” to the data.

Approaches:
(i) Deleting the entire row: This was not practiced by us as the data provided to us was already having
few instances.
(ii) Replacing with mean of the vector: Although this is a very simple method, this can actually produce
the results.
(iii) Assuming the value of the feature of the “closest” instance: This method makes sense intuitively,
here we essentially find the most similar instance and populate our instance’s missing features with the
feature values of the said (similar) instance. This technique is being used in industrial projects.
After applying the above mentioned approaches we came to a conclusion to use
Mean value instead of Euclidean distance though the result in terms of error was better for ED, the
complexity increases significantly.

Outliers:
The definitions and processing of outliers varies greatly with application and data. In our case we
safely eliminate the need to delete the outliers (because of the small value of N), so we have decided to
observe the outlier feature of any instance as a missing value. The definition of outlier has been taken
as:
X is outlier feature if [X > mean + NUM * (standard deviation)] where NUM is a constant that can be
tuned to reduce or increase outliers (typically NUM belongs to [2, 3.5])

Raw Data 20.6625


Normalized data with missing values fixed by 2.1465
Mean
Normalized data with missing values fixed by 1.9946
ED
Data with outliers replaced with mean value 8.1877
(Num=3.5)
Data with outliers replaced with ED value 3.5676
(Num=3.5)
Data with outliers replaced with ED value 3.5461
(Num=4.5)

Feature Reduction:
Approaches:
Covariance matrix – We have calculated the covariance within the features and between the feature
and the label. The covariance matrix of the features comments on the independence of the feature,
whereas the feature and label apprises us on the interdependence of the label with deference to feature.
This was implemented on the basis of covariance feature reduction theory from research paper-
“Feature Reduction Based on Analysis of Covariance Matrix”, published in Computer Science and
Computational Technology, 2008. ISCSCT '08. International Symposium on

X1 -> MFR X2 -> Type


X3 -> Calories X4 -> Protein
X5 -> Fat X6 -> Sodium
X7 -> Fiber X8 -> Carbo
X9 -> Sugar X10 -> Potass
X11 -> Vitamins X12 -> Shelf
X13 -> Weight X14 -> Cups
Y -> Rating

The result can be assessed as the values of above matrix inclines to zero the corresponding features are
independent. Where as in case of label the values inclining to zero represents virtually negligible
interdependence and hence these features can be ignored.
(ii) Correlation: In order to find the relation between two features we also used correlation given by:

Here we considered each feature to be a vector. But the result obtained was not what we expected. The
result was not of any use in this case as it calculated the dependence of a feature with the other with
respect to all the instances. This method may be used for time series problem, but here it was not a
good choice. This technique was implemented after confirming the usage at stackoverflow 2 blogs.

(iii) PCA: Transforms the variables to a new set of variables, which are known as the principal
components (or simply, the PCs) and are orthogonal, ordered such that the retention of variation
present in the original variables decreases as we move down in the order. The results obtained by the
covariance matrix were way much better than this technique. Therefore this Technique was discarded.

(iv) Gram Schmidt: After going through research papers on feature reduction we came across
“Unsupervised feature selection through Gram–Schmidt orthogonalization” at science direct. To reduce
the number of features we have used Gram-Schmidt orthogonalization technique. Using this technique
we could not reduce the 77x14 matrix. Because the notation cannot be typed an image can be found
below explaining how the technique has been implemented:
Unfortunately even this method did not work. It could not reduce any more features than the one by
covariance matrix.
Due to cumbersome notation a photograph depicting formulation of orthonormal vectors is attached.

2
Test Error MSE
All features 2.0091
Leaving X3 2.3228
Leaving X5 3.8771
Leaving X8 2.27
Leaving X9 4.4139
Leaving X10 2.05462
Leaving X11 23.2681
Leaving X12 2.0945
Leaving X13 2.0092
Leaving X14 2.0094
Leaving X3, X8, X10,X12, X13, X14 2.6236
Leaving X3, X5, X8, X10,X12, X13, X14 14.9458
Leaving X8, X10,X12, X13, X14 2.508
Leaving X10,X12, X13, X14 5.1005
Gram Schmidt Leaving X3, X8, X10,X12, X13, X14 2.6036
PCA with all features - one feature was reduced 0.0204
PCA without X3,X8 and so on with one feature reduced 155.6688
As a final conclusion we will use the fully processed data as the one marked with blue i.e. normalized
data with missing values filled with mean value. Also reduced features up to 8.
Note: To calculate the above error we have used Multivariate Linear Regression algorithms. The test
points are 10% of the total data and have been selected randomly. The error is calculated for all the
instances of the data. It includes the test and training data together. Also there are several methods of
error calculations, out of which we chose to go with MSE (mean square error) to standardize our report.

Algorithm Application
The data given is to be solved as a regression problem as what is expected is a prediction out of a
continuous range (1 – 100) of (rating) values. For this the algorithms applied are:
 Multivariate regression
 Support Vector regression
 Neural Networks

Data:
Instances (N) = 77; Features = 14; Rating -> 1 to 100

Multivariate Regression:

This algorithm isolates each feature and attempts to form a linear model against ratings thus giving an
output of 14 linear functions of features vs ratings and then decides an array defining the weights of
each linear function to form a hypothesis function h(x).
h(x) = (w * x) + w0

Example3:- Here the legend on the on the top left denotes


all the features with their numerical values normalized
from 0 to 5 plotted on the Y axis and the outputs on the X
axis.

Figure 1: Example from mathworks

The multivariate regression function in MATLAB inherently returns a vector with length equal to
number of features and hence the coefficient matrix does NOT include the w 0 for which an extra
redundant feature with value = 1 for each instance has been introduced.

Support Vector Regression:

SVM helps to predict a decision boundary (a hyperplane) that divides the data in effectively 2 spaces
and thus helps in binary classification, but with very minor tuning this algorithm can also predict
3
The image has been taken from https://in.mathworks.com/ and is NOT a result derived by the author, this image has been used just
as a place holder just to help in the visualization of how the regression algorithm works. For actual results from the dataset jump to
Results.
hyperplanes to fit through data points and thereby solve a regression problem. The working is
somewhat the same. The points in the data space will dictate regression hyperplanes which will then
help make the final regressor (the hyperplane in between the extremes with maximum distance from

Figure 2: SVM hyperplane

both the extreme hyperplanes).

Neural Network:

The MATLAB toolbox helps deploy a network of neurons - a “Neural Struct” that predicts the rating
values given the nutritional values of any cereal. Here is a schematic showing the network’s (net)
formulation in a diagrammatic representation (taken from MATLAB).

Figure 3: Example from mathworks

Results
The data has been sliced in 2 parts one with 70 instances for training, while 7 instances have been
reserved for testing purposes.
Raw Data

This section contains results from the data without processing; the missing values had to be replaced
with a numerical value for the code to work as expected so they have been replaced by mean.
1) Multivariate regression:
a) Coefficient matrix: -0.681044629, 0, -56.93817436, 17.35460355, -13.37703685, -13.1316941,
18.04773171, -3.432490937, -26.50535155, -2.166480845, 0, -2.76928648, 46.30273358, -
2.432566664, 67.31917941
b) Mean Squared error (testing) = 1.4276
c) Accuracy (testing) taken with 1% tolerance = 97% (approx.)
d) Mean Squared Error (training) = 7.14
e) Accuracy (training) taken with 5% tolerance = 81% (approx.)
Note: The above feature reduction results were computed based on Linear Regression.

2) Support Vector Regression:


a) Mean Squared Error (testing) = 2.2564
b) Accuracy (testing) taken with 1% tolerance = 96% (approx.)
c) Mean Squared Error (training) = 94.2996
d) Accuracy (training) with 5% tolerance = 66.2% (approx.)

3) Neural Network: The error and number of iteration calculated are average of 5 different sets of
training/validating/testing data. The one highlighted is best with respect to computational cost and
error in prediction.
Processed Data

Exploratory Processing
An ambitious attempt to train a model to predict some sort of “health index” (HI) that takes readily
available “features” as input from user (cereal consumer), essentially creating a new fitting problem
from the ground up and trying to provide a solution for the same.

Objective:
The aim of this processing extension is to provide the consumer with a sort of HI which will help in
diet plans and give much needed information to concerned individual.

Prerequisites:
The input for any prediction has to be kept reasonable, i.e. the consumer cannot be expected to know
certain nutritional attributes like amount of vitamin or potassium in a cereal. What the model does
expect the user to know is:
 The cereal manufacturer
 The type of cereal – Hot/Cold
 The position of that cereal’s box in the supermarket’s aisles
 The weight per serving
 The number of cups (of milk) in one serving (?)
 Rating of the cereal
So obviously if a cereal had been rated as 68.402 in a survey the consumer cannot be expected to know
that or whether the amount of cereal in a serving in 1.33 ounces or 1.25, but it is fair to assume that
they do know if the serving is a 0.5 ounce (light) or 1.33 ounce(s) (heavy), or if the cereal tasted
“good” or “bad”.

Limitations:
The obvious limitation of the extension is its USP, the model is directed towards a certain niche of
individuals, a demographic that knows somethings about what they are eating but don’t actually have
the box nutritional values and hence cannot determine how healthy or unhealthy a particular cereal is
for them.
Also the health metric will change with changing values of the person’s ‘biological features’ like age,
BMI, gender.
Because of the above mentioned limitations and lack of resources, the extension does not attempt to
devise an accurate function of ‘health’ in terms of nutritional values.

Data Processing work arounds:


For the model to work there has to be some sort of awareness assumption expected from the consumer.
To make the assumptions more reasonable the data has been divided in bins. The number of bins and
range has been taken such that most users would not have to take a “wild” guess.
To reduce aberrations in the HI due to inaccurate predictions, insignificant nutritional contributions
from the cereals have been altogether neglected, like one cannot expect to acquire any amount of
significant protein or vitamin content from a bowl of cereal.

Implementation:
The features selected are first binned in bins with sufficiently broad ranges so as to reduce user’s input
error. Example: The original dataset had rating values that ranged from 0 – 100, which are then altered
to have just three discrete values (Bad, okay, tasty -> 1, 2, 3)
The bins are then normalized again.
Now the binned normalized data is used to predict different (but not all) nutritional values that will be
used to formulate a hypothesis function for the HI.
Ideal nutritional (tentative) values from a breakfast meal are then catalogued4
Calories: 350-500 https://www.livestrong.com/article/298939-how-many-calories-should-i-eat-at-
breakfast/
Carbs: 75gm https://www.livestrong.com/article/427735-number-of-carbohydrates-needed-per-meal/
Sodium: 200mg (estimated sodium intake of adults) (breakfast)
4
The ideal breakfast values have been taken from
Fiber: 20gram https://www.nutrition.org.uk/nutritionscience/nutrients-food-and-ingredients/dietary-
fibre.html
The deviation of the nutritional content and the deviation’s parameters can help train an HI model.

Results obtained, feasibility and application:


As of now the model to be formed is not as feasible as one would like it to be, primarily because of the
lack of technical know-how to help obtain the hypothesis function for HI. But still the accuracy using
multivariate regression was observed to be around 92% for prediction of calorie content which is one of
the most important metrics to determine a product’s healthy-“ness” - a pretty good result and can be
actually used for the HI calculation realistically.
The initial regression problem used the data to help to build a model that helped the manufacturer to
determine how much will the population in general like the taste of their cereal given its nutritional
parameters, this modeling on the other hand uses the same data to build a model that outputs a health
index value based on the parameters on any cereal that is readily available to the consumer like
manufacturer (of the cereal), type (eaten hot or cold), etc.

You might also like