Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Classification usually used to Regression usually used to

predict labeling predict numerical values


Groups items using a
hierarchical clustering
algorithm.
Groups items using the k-
Means clustering algorithm.
Building Basic Models of Learnings
This is where we load our data table into
orange and define what data type our
features are. In this case we are loading an
Excel table file (.xlsx) with our 94 (95 if we
include the target variable ‘ROCK’) features.
As mentioned previously, the data in our
features can come in the form of numeric,
categorical, text or time series. To access the
widget double click on it, navigate to and
open the file then define what each
feature’s (column) data type is. In this case
ROCK will be categorical and the remainder
will be numeric. Under the ‘Role’ column
ROCK also needs to be put as our target
variable as it’s what we are trying to classify.
Rank assesses the relationship between the
features and target variable and tells us how
well they correlate. As geologists we know
that the biggest differentiating features
between mafic and felsic rocks are the
magnesium and silica content. Rank is
telling us that MgO varies the most between
the 4 lithologies, followed by SiO2, Al2O3
etc etc. Here we can decide what features
can go into our model. There is a goldilocks
zone of how many features should be
included into to make an optimum model
and it is determined on a case by case basis.
You don’t want too few and you don’t want
too many, thankfully in Orange it is easy to
just select how many pass through the
workflow by just highlighting them. In this
case I’ve selected the top 10 features shown
below.
Train and Test Data
Train and Test Data
THE JUICY BITS
These 5 pink widgets are Machine Learning algorithms, each with their own way of mathematically
classifying/predicting our target variable using the geochem data. In the beer example we used a a single
regression algorithm to create our model, in Orange we can use many different algorithms at once and
compare them. There is an ever-growing list of ML algorithms but in this workflow I have used k-Nearest-
Neighbour (kNN), Support Vector Machines (SVM), Naive Bayes, Random Forest and Adaptive Boosting
(AdaBoost). Each algorithm has parameters that can be tweaked in attempt to increase model accuracy.
Machine Learning algorithms can be quite (definitely are) mathematically intense and difficult to break
down into simple terms. There are many documents online that outline each specific algorithms function,
but the Orange documentation is usually pretty good. This is a good summary for selecting the right
algorithm too.
TEST AND SCORE
When you create a Machine Learning model you need a way
to make sure your model actually works. We can do this by
randomly splitting the data set into ‘training’ and ‘test’ data.
The training data is used to create the model and the test data
is used to determine the accuracy of the model. The 5
different models created by the training data ignores the
target variable in our test data (just looks at the chemistry and
not the rock type) and attempts to classify/predict what rock
type each instance would be. The predicted rock type is then
compared to the actual known value in our test data and each
model is scored based on accuracy. In this case the split is
selected at 80% training and 20% test, representing 320 and
80 instances respectively.
To put it simply we, the operator, know what the target
variables are for both our training and our test data, but our
model only knows the training data. eg the model knows that
the training data is Homer Simpson and now has to predict
whether the test data is or not.
TEST AND SCORE
TEST AND SCORE
PREDICTION

You might also like