Classification usually used to Regression usually used to
predict labeling predict numerical values
Groups items using a hierarchical clustering algorithm. Groups items using the k- Means clustering algorithm. Building Basic Models of Learnings This is where we load our data table into orange and define what data type our features are. In this case we are loading an Excel table file (.xlsx) with our 94 (95 if we include the target variable ‘ROCK’) features. As mentioned previously, the data in our features can come in the form of numeric, categorical, text or time series. To access the widget double click on it, navigate to and open the file then define what each feature’s (column) data type is. In this case ROCK will be categorical and the remainder will be numeric. Under the ‘Role’ column ROCK also needs to be put as our target variable as it’s what we are trying to classify. Rank assesses the relationship between the features and target variable and tells us how well they correlate. As geologists we know that the biggest differentiating features between mafic and felsic rocks are the magnesium and silica content. Rank is telling us that MgO varies the most between the 4 lithologies, followed by SiO2, Al2O3 etc etc. Here we can decide what features can go into our model. There is a goldilocks zone of how many features should be included into to make an optimum model and it is determined on a case by case basis. You don’t want too few and you don’t want too many, thankfully in Orange it is easy to just select how many pass through the workflow by just highlighting them. In this case I’ve selected the top 10 features shown below. Train and Test Data Train and Test Data THE JUICY BITS These 5 pink widgets are Machine Learning algorithms, each with their own way of mathematically classifying/predicting our target variable using the geochem data. In the beer example we used a a single regression algorithm to create our model, in Orange we can use many different algorithms at once and compare them. There is an ever-growing list of ML algorithms but in this workflow I have used k-Nearest- Neighbour (kNN), Support Vector Machines (SVM), Naive Bayes, Random Forest and Adaptive Boosting (AdaBoost). Each algorithm has parameters that can be tweaked in attempt to increase model accuracy. Machine Learning algorithms can be quite (definitely are) mathematically intense and difficult to break down into simple terms. There are many documents online that outline each specific algorithms function, but the Orange documentation is usually pretty good. This is a good summary for selecting the right algorithm too. TEST AND SCORE When you create a Machine Learning model you need a way to make sure your model actually works. We can do this by randomly splitting the data set into ‘training’ and ‘test’ data. The training data is used to create the model and the test data is used to determine the accuracy of the model. The 5 different models created by the training data ignores the target variable in our test data (just looks at the chemistry and not the rock type) and attempts to classify/predict what rock type each instance would be. The predicted rock type is then compared to the actual known value in our test data and each model is scored based on accuracy. In this case the split is selected at 80% training and 20% test, representing 320 and 80 instances respectively. To put it simply we, the operator, know what the target variables are for both our training and our test data, but our model only knows the training data. eg the model knows that the training data is Homer Simpson and now has to predict whether the test data is or not. TEST AND SCORE TEST AND SCORE PREDICTION