Professional Documents
Culture Documents
Data Mining Primer
Data Mining Primer
Overview
The intent of this document is to provide you with a basic guide to interpreting the results of running data mining
algorithms and making improvements to garner the best results possible. This isn’t a comprehensive document
detailing every aspect of the data mining process – there are things that aren’t addressed here that are covered in
the book or in other materials. This is intended to give you a starting point in understanding and improving your
analysis.
Results
Depending on what options you’ve chosen, multiple tabs will appear with results. When evaluating the model, we
want to focus on the “last” pass of the model. This will be either the Test or Validation steps, not the Training results.
Lift Curves
Lift curves are a generic method of determining the quality of the model. What a lift chart represents is how large a
share of the total population we’d need to look at to be at a certain level of confidence in our prediction. If we were
able to measure every single case, we’d know with certainty what the results are, if we measured half, we’d know
50%, etc – this is the diagonal line representing no model; the purpose of the predictive model is for us to be able to
predict what happens with certainty without having to measure every single case – how good we are doing is
represented by the lift line.
The details of the shape of the cures aren’t normally critical – what we are looking for is maximizing the area
between the model’s curve and the default line. More lift = better model.
Error Reports
Every algorithm will have some data returned to show how accurate it is, the details will vary by model.
Feature Selection
The feature selection tool can help to select variables for inclusion based on their relative independence. High chi-
squared values or f-stat values indicate a likely good contributor to a model, low values indicate that the value might
be eligible for elimination. The one potentially confusing area here is the two “Discreteize” options, and the outputs
that are circled below. The guideline is that you want to set it up so that if you’re predicting a number (continuous
value), you get F-Statistic, if you’re classifying a group (discreet) you want Chi-squared. Manipulate the discreteize
options to get the desired value.
1
This can be done before building any models, and can inform your choice to remove variables later on in the process.
Note: This is a very light overview of the feature selection tool, there is much more to it, but it requires quite a bit of
math, so we’ll just lightly touch on it.
2
Classification
All Classification Models
1. Look at Error Rates
All the classification algorithms will produce a similar error table:
Interpreting the table is relatively simple, higher accuracy is good. The other measures are more specific measures of
accuracy, they are detailed in the book, but probably won’t be too critical unless you have a good background in
statistics.
The key finding from the confusion matrix is that it lets you see if your model is making mistakes that are skewed in
one direction. The simplest way to adjust for errors that lean hard in one direction is to adjust the success probability
cut off to require the model to be more/less confident before assigning a “success” to an individual prediction. This
data is also the source of the other accuracy measures on the error rate table.
Classification Tree
1. Look at the Feature Importance
In the output sheet the feature importance table shows the relative importance of the different inputs to the result
of the model. Higher values indicate that the variable is more significant in determining the outcome.
3
Neural Networks
Neural networks produce an output that looks a bit different than other algorithms, but the measures are similar.
Again, we want to focus on the error rate.
Neural networks are probably the least transparent to examination, making improvements is largely just guess and
test. There are several options in the setup screen, test multiple options and see which is best.
K Nearest Neighbours
One of the setup options in the kNN model is to allow Solver to vary the number of k values. One of the things we
can do to improve the model is to use the “best” number for k. While Solver will automatically select the best K from
all those possible, in some cases we might need to increase that limit if we notice a trend:
In this example there is a trend to have a lower error rate as k increases, so we might want to allow for higher k
values in order to see if this trend continues and we end up with an improvement.
4
Prediction
All Prediction Algorithms
1. Investigate R Squared
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the
coefficient of determination, or the coefficient of multiple determination for multiple regression. The definition of R-
squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear
model. Or:
In general, the higher the R-squared, the better the model fits your data. [1]
Linear Regression
1. Check the Intercept
Solver provides the option of allowing the intercepts value to “float” (for lack of a better term). You will usually want
to leave this option unchecked because it can sometimes dominate the model and make it inaccurate. The easiest
way to see if the intercept is causing problems is to look at the subsets. If every subset seems to include the
intercept and the r squared is low, the intercept is likely causing an issue.
K-Nearest Neighbors
See the classification section for how to deal with k values.
Regression Tree
1. Look at the Feature Importance
See the classification section for investigating the feature importance.