
You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 7

An Investigation of Commercial Data Mining

Emily Davis
Supervisor: John Ebden

Abstract: This paper describes an investigation of a commercial data mining suite

specifically that of Oracle9i.This investigation is conducted in order to determine the
type of results achieved when applying data mining models created using Oracle’s
data mining components to data. Two types of models in the same category were built
and used as a basis for comparison, Naïve Bayes and Adaptive Bayes Network, and
their results compared in order to determine if the results supported each other and
whether the results differed in any way. It was concluded that only one of six
comparisons showed very similar results for the two algorithms providing
classification and that therefore choice of modelling algorithm can have a significant
impact on results from the same data even when the category of data mining
technique is the same.

1. Introduction

This paper describes the method and results of investigating Oracle9i data mining and
specifically algorithms that fall into the classification category in order to determine
the type of results produced and whether the results from the different models support
each other. Four models were initially built using classification algorithms in the
Oracle 9i data mining suite. The two algorithms used were Naïve Bayes and Adaptive
Bayes Networks. The algorithms were applied to the data to build, test and apply the
models and the results documented using different combinations of parameter settings
for the algorithms.

2. Methodology

2.1 Preparation

An Oracle database was configured and the tools and software for data mining
installed and configured for use with the database. The Oracle Data Mining suite is
made up of two components, the data mining Java API and the Data Mining Server
(DMS). [Oracle9i Data Mining Concepts Release 2 (9.2), 2002] The DMS provides a
repository of metadata of the input and result objects of data mining. For the purposes
of this investigation JDeveloper 10g provides the access to the Java API and the
DMS. The data mining itself is performed using DM4J 9.0.4 which is an extension of
JDeveloper that provides the user with a number of wizards that automatically create
the Java programs that perform the data mining when these programs are run. [Oracle
Data Mining Tutorial, Release 9.0.4, 2004]

2.2 Algorithms

According to Berry and Linoff [2000], directed data mining or supervised learning
involves using data to build a model that describes one particular variable of interest
in terms of the rest of the data. This category includes techniques such as
Classification, Estimation and Prediction.

Roiger and Geatz [2003] define input variables as independent variables and output
variables as dependent variables. In supervised learning a predictive, dependent
variable is produced as output.

Roiger and Geatz [2003] describe classification as a technique where the dependent or
output variable is categorical. The emphasis of the model is to assign new instances of
data to categorical classes.

ODM supports the following classification algorithms selected for this experiment as
stated by Oracle9i Data Mining Concepts Release 2 (9.2) [2002]:

 Adaptive Bayes Network supporting decision trees (classification)

 Naive Bayes (classification)

2.3 Investigation
In order to be able to perform comparisons during the evaluation of ODM it has been
necessary to select two forms of data mining algorithm that fall into the same
categories, that is, supervised learning and classification. For this reason Naïve Bayes
for Classification and Adaptive Bayes Classification have been selected as the format
of the results they produce will be comparable. Both algorithms allow for building the
model, testing the model, computing model lift (providing a measure of how quickly
the model finds actual positive target values) and application of the model to new

2.3.1 Data

The data used in this experiment consists of three tables that are stored in the Oracle
database. The tables are MINING_DATA_BUILD, MINING_DATA_TEST and
MINING_DATA_APPLY and are distributed as part of a DM4J tutorial [Oracle Data
Mining Tutorial, Release 9.0.4, 2004]. The data represents the demographics of
customers of an electronics shop chain that would like to offer loyalty cards
(AFFINITY_CARD) to customers that are expected to increase their buying. The
tables have identical structure as required by the data mining tasks and each consist of
1500 records none of which are identical. The table structure is as follows in Table 1:

Name Data Type Size Nulls?


Table 1 Mining Data Table Structure

MINING_ DATA_BUILD is used for the building of the data mining models for both

MINING_DATA_ TEST is used as the test data to evaluate the effectiveness of the
models created from the build data. Roiger and Geatz [2003] state that evaluation of
supervised learning models involves determining the level of predictive accuracy.
Such models can be evaluated by comparing the test set error rates of supervised
learning models with expected rates obtained from historical data of a similar form to
determine accuracy of models and which model to apply if need be. [Roiger and
Geatz, 2003]

MINING_DATA_ APPLY is the data which the built and tested model is applied to
in order to make classifications. The results of the application of the models to the
data are stored by DM4J for inspection and use. It is also possible to export the results
to a spreadsheet format which has been done in this case to allow for comparison.

2.3.2 Testing Models

The test model results produced by DM4J are depicted in confusion matrices and lift
charts. Confusion matrices can be used to determine the accuracy of classification
models and to show the number of false negative or false positive predictions made by
the model on the test data. Confusion matrices are best used for evaluating the
accuracy of models using categorical data which is being used in this case. [Roiger
and Geatz, 2003]

2.3.3 Building the Models

Four models were built using the build data, two of the Naïve Bayes form and two of
the Adaptive Bayes Network form. The models were named nb, nbw, abn and abnw.
All the models were built using AFFINITY_CARD as the target attribute, that is, the
attribute that would be predicted. The algorithms aim to use the other attribute values
in a record to predict whether a customer is likely to increase spending if offered an
affinity card.

Naïve Bayes works by looking at the build data and calculating conditional
probabilities for the target value, AFFINITY_CARD, this is done by observing the
frequency of certain attribute values and combinations thereof. [Oracle Data Mining
Tutorial, Release 9.0.4, 2004]The two parameters that must be supplied to the Naïve
Bayes build wizard indicate how outliers in the data should be treated; occurrences
below the threshold values are ignored when creating the model. [Oracle Data Mining
Tutorial, Release 9.0.4, 2004]

The singleton threshold value provides a threshold for the count of items that occur
frequently in the data. Given k as the number of times the item occurs in the data, P as
the number of data profiles or records and t as the singleton threshold expressed as a

percentage of P; then the item is considered to occur frequently if k>=t*P. [Oracle
Help for Java,1997-2004]

The pairwise threshold provides a threshold for the count of pairs of items that occur
frequently in the data. Given k as the number of times two items appear together in
the profiles and P and t as above; a pair is frequent if k>t*P. [Oracle Help for

Adaptive Bayes Network works by ranking the attributes and then building a Naïve
Bayes model in order of the ranked attributes. The algorithm then builds a set of
features or ‘trees’ which are in turn tested against the model in order to determine
whether they improve the accuracy of the model or not. If no improvement is found
the feature is discarded. When the number of discarded features reaches a certain level
the building stops and the model is those features that remain.[Oracle Data Mining
Tutorial, Release 9.0.4, 2004]The detail of the various classification models is shown
below in Table 2:

Model Algorithm Weighting Parameters Features

nb Naïve Bayes none Singleton threshold: 0.01 NA
Pairwise threshold: 0.01
nbw Naïve Bayes 3.0 for false Singleton threshold: 0.01 NA
negatives Pairwise threshold: 0.01
abn Adaptive none default parameters Multi-feature
abnw Adaptive 3.0 for false default parameters Multi-feature
Bayes negatives
Table 2 Model Detail Training and Tuning the Model

Using ODM it is possible to assign weights to the target value when using Naïve
Bayes or Adaptive Bayes so that the model predicts more of one kind of value if it
appears that there are a large number of false predictions of a certain kind when
testing the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] Bias can be
built into the model to increase predictions of the desired target value. In this
investigation weighting was used to introduce this bias because when testing the abn
model it was apparent from the confusion matrix that a significant error was
encountered as the model predicted 0 or no in every case and these predictions were
false in 346 of the cases. This level of false predictions was very high so it is then
viable to use weighting in order to decrease the number of false negative predictions.
nbw was then weighted for purposes of comparison.

Although the weighting is chosen by trial and error, a weighting of 3.0 was used as
was suggested by [Oracle Data Mining Tutorial, Release 9.0.4, 2004]. The weighting
is then associated with a certain type of prediction, false negative or positive, and the
model will then treat a false prediction of that kind as three times as costly as an error
of the other kind. This forces the model to make more predictions in the other
direction. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]

2.3.4 Results

Testing the models on the test data set, MINING_DATA_TEST, produced confusion
matrices which were used to determine the accuracy of the model when tested on the
test data. The accuracy of the respective models is depicted in Table 5.

The models were then applied to the new data in the MINING_DATA_APPLY set.
The results were depicted by customer id and showed a prediction, 1 meaning yes or 0
meaning no, of whether a customer was likely to increase spending if offered an
affinity card. The probability of this prediction was also depicted as shown in a
sample below in Table 3. The results in this extract can be interpreted as the customer
with ID 100408 is predicted to increase spending and this prediction is given with a
probability of 0.9598. Customer 100413 is predicted not to increase spending with a
probability of 0.7854.


1 0.9598 100408
0 0.7854 100413
Table 3. Extract of results from model nb.

Those models that were weighted provided predictions and cost figures. This cost
figure is provided instead of probability as the model makes predictions based on cost
in terms of the weighting in these cases. An extract from these types of results is
shown in Table 4. This extract can be interpreted as customer 100408 is predicted to
use an affinity card and the cost of such a prediction being incorrect is 0.0401.
Customer 100413 is predicted not to use an affinity card and if this prediction is
incorrect the cost is higher at 0.6437. Low cost can be interpreted as higher
probability as can be seen from the extract but it is not possible to directly calculate
probability from cost. See Tables 6 and 7 for comparative cost and probability figures
for the models. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]


1 0.0401 100408
0 0.6437 100413
Table 4. Extract of results from model nbw.

2.3.5 Interpretation of Results

It was possible to compare the results between the four models as shown in Table 5
and in some of the cases compare the difference in average probability or difference
in average cost as shown in Tables 6 and 7.

Models Accuracy of Percentage Number of Percentage

Compared Model Test Positive predictions in Agreeing
Predictions agreement(total Predictions
1 nb vs nbw 79.93333% 30.33% 824 54.9333%
vs vs
78.86667% 33.80%
2 nb vs abn 79.93333% 30.33% 1045 69.6667%
vs vs

76.93333% 0.00%
3 nb vs 79.93333 30.33% 767 51.1333%
abnw vs vs
73.06667% 43.33%
4 abn 76.93333% 0.00% 993 62.2000%
vs vs vs
nbw 78.86667% 33.80%
5 nbw vs 78.86667% 33.80% 1279 85.2667%
abnw vs vs
73.06667% 43.33%
6 abn vs 76.93333% 0.00% 850 56.6667%
abnw vs vs
73.06667% 43.33%
Table 5 Comparison of Model Results on sample data set, MINING_DATA_APPLY, of 1500 records.

nb abn
Average probability for 0.893733 0
positive predictions
Average probability for 0.968798 0.761749
negative predictions
Table 6. Comparison of average probabilities for unweighted models

nbw abnw
Average cost for positive 0.159508 0.533948
Average cost for negative 0.039525 0.191902
Table 7. Comparison of average costs for weighted models

It is possible to exclude comparisons 2, 4 and 6 from any noteworthy results as it is

apparent from the results during testing that the use of Adaptive Bayes with no
weighting gives no positive predictions for the target attribute when the model is
applied to the new data. This is unrealistic and although the model abn showed an
accuracy of 76.93333% during testing this is not corroborated at all during application
of the models which seems unreliable.

Comparison 1 shows the effect that weighting has when using the same algorithm to
build two different models. As expected nbw has a higher percentage of positive
predictions due to the weighting but the percentage of agreeing predictions of
54.9333% shows little corroboration. However, the two models built using Naïve
Bayes show the highest accuracy during testing.

Comparison 3 also shows a low level of corroboration at only a little over 51%. This
is possibly due to the fact that two different algorithms are used and weighting is used
in abnw and not nb.

The most interesting comparison is between nbw and abnw (Comparison 5). This
has a vastly higher percentage of agreement, 82.2667%, than the other models’
results. This is interesting because two different algorithms are used but the weighting
used is the same for both algorithms. The difference in accuracy of the models is
around 5%. In the case of nbw weighting had a smaller effect on the accuracy of the

model when compared to nb. This effect was heightened when comparing abn and
abnw although abn made no positive predictions and the accuracy can be deemed
unreliable in that case.

3. Conclusion

After building, testing and applying the models to the data it was possible to conduct a
comparison of the results.

It is possible to conclude that the only case in which the results of a Naïve Bayes
model and an Adaptive Bayes model seem to corroborate each other is when a
weighting of 3.0 for false negatives is set for both models. This is possibly due to the
fact that the Adaptive Bayes model only provides realistic results in this case and the
results of Naïve Bayes are affected by the weighting to show similar results to the
Adaptive Bayes model.

The results are not what was expected as it was expected that the results of the two
categories of models would show more similarities in most of the cases. For this
reason it appears that choice of modelling algorithm and parameters can have a
significant impact on results from the same data even when the category of data
mining technique is the same.

4. Future Work

As an extension to this investigation it is hoped that a similar comparison may be

performed on data that is of interest to the university. Data being considered is that
which documents students’ school performance and consequently performance at
university. It will be of interest to determine if a pattern is present in such data as well
as to perform comparisons on the results of the data mining given by the different
models in DM4J.

 [Michael J.A. Berry and Gordon S. Linoff, 2000], Mastering Data Mining: The Art and
Science of Customer Relationship Management, USA, Wiley Computer Publishing.
 [Richard J. Roiger and Michael W. Geatz, 2003], Data mining: a tutorial- based primer by,
Boston, Massachusetts, Addison Wesley.
 Oracle9i Data Mining Concepts Release 2 (9.2), Oracle Technology Network, March 2002,
 Oracle Data Mining Tutorial, Release 9.0.4, Oracle Technology Network, February 2004,<
 Oracle Help for Java, Version, Copyright 1997-2004.

You might also like