Data Mining Primer

Guide to Data Mining Results
Overview
The intent of this document is to provide you with a basic guide to interpreting the results of running data mining
algorithms and making improvements to garner the best results possible. This isn’t a comprehensive document
detailing every aspect of the data mining process – there are things that aren’t addressed here that are covered in
the book or in other materials. This is intended to give you a starting point in understanding and improving your
analysis.
Results
Depending on what options you’ve chosen, multiple tabs will appear with results. When evaluating the model, we
want to focus on the “last” pass of the model. This will be either the Test or Validation steps, not the Training results.
Lift Curves
Lift curves are a generic method of determining the quality of the model. What a lift chart represents is how large a
share of the total population we’d need to look at to be at a certain level of confidence in our prediction. If we were
able to measure every single case, we’d know with certainty what the results are, if we measured half, we’d know
50%, etc – this is the diagonal line representing no model; the purpose of the predictive model is for us to be able to
predict what happens with certainty without having to measure every single case – how good we are doing is
represented by the lift line.
The details of the shape of the cures aren’t normally critical – what we are looking for is maximizing the area
between the model’s curve and the default line. More lift = better model.
Error Reports
Every algorithm will have some data returned to show how accurate it is, the details will vary by model.
Feature Selection
The feature selection tool can help to select variables for inclusion based on their relative independence. High chi-
squared values or f-stat values indicate a likely good contributor to a model, low values indicate that the value might
be eligible for elimination. The one potentially confusing area here is the two “Discreteize” options, and the outputs
that are circled below. The guideline is that you want to set it up so that if you’re predicting a number (continuous
value), you get F-Statistic, if you’re classifying a group (discreet) you want Chi-squared. Manipulate the discreteize
options to get the desired value.
1
This can be done before building any models, and can inform your choice to remove variables later on in the process.
Note: This is a very light overview of the feature selection tool, there is much more to it, but it requires quite a bit of
math, so we’ll just lightly touch on it.
General Approach to Refining a Model

Making a good model will usually require several trials, and we will need to modify what we do based on the results.
This is an example of an approach you can take to build a good model:
1. Clean any clearly irrelevant values from the model.

a. Many data sets have values that obviously aren’t going to be relevant. Remove these.
2. Run a model that has understandable results.
a. Run an algorithm such as linear regression or a tree that will give results that we can read and
understand more easily – specifically the importance of individual variables.
3. Run other algorithms and compare the overall results – accuracy and lift curves.
a. Different algorithms will respond differently to different data sets, testing several allows us to focus
further efforts on the algorithms that are already the most accurate.
4. Use the results from the previous two steps to guide the refinement.
a. Start with the algorithms that yielded the best models - as judged by lift curves, r squared values,
and error rates – modify the model by using the information garnered by the variable importance
measures in step 2 to choose what to include/exclude.
5. Make a few revisions with different options on a trial and error basis to see what gives the best results.
a. Be liberal with the number of runs and the variety of options selected, many of the algorithms have
several options (generally on the parameters screen) that change the way the model will be built,
often in ways that are not able to be predicted in advance.
b. Always consider making changes to the data – particularly data transformations and normalization.
Normalization is quick to try, transformations take more effort but can be valuable, particularly if
you have very skewed importance of some variables.
6. Kick back and relax!
2
Classification
All Classification Models
1. Look at Error Rates
All the classification algorithms will produce a similar error table:
Interpreting the table is relatively simple, higher accuracy is good. The other measures are more specific measures of
accuracy, they are detailed in the book, but probably won’t be too critical unless you have a good background in
statistics.
2. Look at Confusion Matrix

The confusion matrix breaks down our errors by type, false positives and false negatives.
The key finding from the confusion matrix is that it lets you see if your model is making mistakes that are skewed in
one direction. The simplest way to adjust for errors that lean hard in one direction is to adjust the success probability
cut off to require the model to be more/less confident before assigning a “success” to an individual prediction. This
data is also the source of the other accuracy measures on the error rate table.
Classification Tree
1. Look at the Feature Importance
In the output sheet the feature importance table shows the relative importance of the different inputs to the result
of the model. Higher values indicate that the variable is more significant in determining the outcome.
3
Neural Networks
Neural networks produce an output that looks a bit different than other algorithms, but the measures are similar.
Again, we want to focus on the error rate.
Neural networks are probably the least transparent to examination, making improvements is largely just guess and
test. There are several options in the setup screen, test multiple options and see which is best.
K Nearest Neighbours
One of the setup options in the kNN model is to allow Solver to vary the number of k values. One of the things we
can do to improve the model is to use the “best” number for k. While Solver will automatically select the best K from
all those possible, in some cases we might need to increase that limit if we notice a trend:
In this example there is a trend to have a lower error rate as k increases, so we might want to allow for higher k
values in order to see if this trend continues and we end up with an improvement.
4
Prediction
All Prediction Algorithms
1. Investigate R Squared
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the
coefficient of determination, or the coefficient of multiple determination for multiple regression. The definition of R-
squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear
model. Or:
• R-squared = Explained variation / Total variation

• R-squared is always between 0 and 1 (0% and 100%).
• 0% indicates that the model explains none of the variability of the response data around its mean.
• 100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. [1]
2. Look at the Errors

The measures of the accuracy of a prediction model is how close the model is able to predict as compared to the
actual values. There are several measures of error, we don’t need to worry too much about the details of how each
is derived, the key is that smaller is better:
We’ll generally refer to Root Mean Square Error – RMSE.
Linear Regression
1. Check the Intercept
Solver provides the option of allowing the intercepts value to “float” (for lack of a better term). You will usually want
to leave this option unchecked because it can sometimes dominate the model and make it inaccurate. The easiest
way to see if the intercept is causing problems is to look at the subsets. If every subset seems to include the
intercept and the r squared is low, the intercept is likely causing an issue.
3. Look at P Values for the Coefficients

P values are a measure of how “important” a variable is for a predictive model. In general p values that are low (rule
of thumb cut-off is .05) mean that this variable is important, p values that are high mean it isn’t.
K-Nearest Neighbors
See the classification section for how to deal with k values.
Regression Tree
1. Look at the Feature Importance
See the classification section for investigating the feature importance.

Data Mining Primer

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Primer

Uploaded by

Copyright:

Available Formats

Guide to Data Mining Results

General Approach to Refining a Model

1. Clean any clearly irrelevant values from the model.

2. Look at Confusion Matrix

• R-squared = Explained variation / Total variation

2. Look at the Errors

We’ll generally refer to Root Mean Square Error – RMSE.

3. Look at P Values for the Coefficients

You might also like