Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Te c h n i c a l r e p o r t

Performance
on large datasets
Clementine ®

Server

®
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

Gain productivity throughout the data mining process


Clementine® measures data mining productivity by time-to-value: the time required for users
to get results that deliver value in an organization. To shorten this time-to-value, Clementine
introduced the concept of a visual workflow
interface. With Clementine’s visual workflow,
users are able to move interactively through
all steps in the data mining process, applying
their business knowledge to guide them to
meaningful results more quickly.
Other approaches to data mining have
stressed raw processing power rather than
productivity throughout the entire data mining
process. Optimizing raw processing power
comes at a price: models, for example, can be
built quickly, but the models may not deliver
results that offer any value to an organization.
You get better performance with a data mining Clementine’s approach to scalability supports the
approach that minimizes time-to-value. CRISP-DM data mining process model.

In the past, Clementine’s approach worked best with sampled data. With the release of
Clementine Server, Clementine’s interactive data mining approach can be used on much
larger datasets. This is because Clementine Server scales the entire data mining process.
For example, visualization techniques are scaled for data understanding. Data preparation
steps such as field and record operations also see significant gains, as do modeling processes
that include pre-processing steps. Finally, model evaluation and deployment can be performed
more efficiently.

How Clementine Server improves performance on large datasets


Clementine Server improves performance while mining large datasets by leveraging your
investment in a database management system (DBMS). It does this by maximizing in-database
mining, delegating as many operations as possible to the DBMS, therefore, taking advantage
of database indexing and optimized data operations.
The architecture has three
tiers: the DBMS, or database
tier, the application server
tier and the client tier.
The client tier includes
Clementine’s visual workflow
interface with data mining
“streams” that show all steps
in your data mining process-
es. In previous versions,
when a Clementine stream
was executed, the client’s User A User B
processing was used to Clementine has a three-tier distributed architecture which consists of the
perform data analysis. DBMS, the Clementine Server application server and Clementine clients.

Technical report 2
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

With Clementine Server, however, the stream processing is pushed back onto the DBMS via
SQL queries. Any operation that cannot be represented as SQL queries is performed in a
more powerful application server tier. Only relevant results are passed back to the client tier.
This approach takes advantage of optimized operations in a DBMS and increased processing
power found at the application server tier to deliver predictable, scalable performance
against large datasets.
Clementine gives you feed-
back when in-database
mining is activated. During
execution, nodes turn purple
if the operations represent-
ed by the node are executed
in-database. At the point of
the last purple node in the
stream, the remaining data
is extracted and performed
at the application server tier.
Since in-database mining is
almost always faster than
application server processing, This Clementine stream shows many nodes shown in purple during
execution rather than the usual blue. Purple nodes mean that the
the more nodes that are
operations represented by the node are being translated into SQL
pushed back to the database, and executed in-database.
the better. Clementine max-
imizes in-database mining by using rules of thumb to order operations automatically. You
don't have to worry about the mechanics of stream building, because operation reordering
is automatic. Instead, you can focus on the business problem at hand. Operations will not
be reordered if the reordering would change the results of the stream.

Benchmark testing Database tier


Data processing:
The purpose of this paper is to ■ Select, sort and aggregate records
■ Merge records by key
provide test results that give DBMS
■ Filter and derive fields

Visualization:
you an indication of the degree ■ Point or line plots
■ Distribution bar graphs
of scalability achieved with ■ Web association graphs

Clementine Server. Of course,


Application server tier
there are many variables that Any processing that cannot be
performed in the database:
can effect the performance you Clementine Server ■ Other data processing steps
application and visualization
achieve with Clementine Server. ■ Modeling

Your computing system archi- ■ Flat file access

tecture, hardware, software, and Client tier


Relevant results:
data mining solutions each play ■ Graphs and other output

■ Just the data being viewed


a role in the performance you
(e.g. only the rows in a table that
achieve. For example, for the User A User B
can be viewed at one time)
best performance you should
Clementine Server’s three-tier approach pushes back many data
exceed the system recommen- processing and visualization techniques into the DBMS, leaving
dations for RAM and free disk other processing to be handled by the application server. Only
space, especially since additional relevant results are passed to the client.

Technical report 3
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

disk space is required depending on the amount of data that are processed. Data volume is
proportional to both the number of columns and the number of rows in your dataset.
Also, more space is needed if you do not push data processing back into the database. When
processing cannot be done in the database because the operations cannot be expressed
as SQL queries or you are mining flat files, operations are done in the application server tier.
In these instances, use of the Aggregate, Distinct, Merge, Sort, Table or any modeling node
will create temporary disk copies of some or all of the data, requiring additional disk space.
A good rule of thumb for allocating additional disk space for data is to measure the size of
the largest table to be mined as a flat file and multiply by at least three.
Our intention with this white paper is to provide some test results of common operations
conducted with Clementine Server. These operations were selected because they are typical
of operations used in the different stages of the data mining process. The results of these
tests should provide you with an understanding of its performance on large datasets. The
operations included are:

Data processing
This involves accessing two data sources, a customer data table and a transaction table.
The transactions are aggregated to a customer key and then merged with the customer data.
Two fields are derived.
Modeling
A new field is derived and then a C&RT decision tree model is used. C&RT is used because
performance for this algorithm is a good indicator of overall model-building performance.
Neural networks, for example, tend to take longer to train and GRI tends to take less time.
The time taken to build models always depends on the data and parameters settings of the
model. Default settings in Clementine attempt to build a more accurate model, so if speed
is more important, you may need to change the parameters.
Scoring
Unlike model building, which almost never requires using all the available cases in order to
receive a good result, scoring often requires that the whole population be scored. For example,
a response rate for a mailing may be one percent; building a model on who will respond requires
that the data is “balanced” — that the number of responders roughly equals the number of non-
responders. This means the data for training the model could be about two percent of the total
responses. Scoring, on the other hand, often needs to be done for the whole customer base
(or at least a whole potential mailing population). These benchmarks show real-time scoring
of a few cases and batch scoring (scoring a large batch of cases).
The benchmark testing presented in this paper was performed with the following client and
server specifications:

Client Server
Windows 2000 Windows NT Server
Dell Latitude CPt C400GT Dell Poweredge 6300
Intel Celeron 400 4 x 500Mhz cpu’s
130MB RAM 1GB RAM
6GB disk 36GB disk
10MB ethernet SQL Server

Technical report 4
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

The dataset used in most of the testing has millions of records – from one million to 13 million.
One of the datasets in the data preparation benchmark has 16 fields — eight symbolic and
eight numeric — and the other dataset has eight fields — four symbolic and four numeric.
The model building dataset has nine fields, five of which are symbolic, and the table written
for scoring has eight fields with an equal number of symbolic and numeric variables.
All the figures shown (except real-time scoring) are from a testing environment in which
a database is used and SQL optimization is enabled using the appropriate check box in the
Clementine interface. Using a database but disabling SQL optimization means that the data
which is processed is pulled from a database; this is different than reading flat files located in
the application server tier. You will get better performance using a database even without SQL
optimization enabled because Clementine must read in all data from a flat file and only relevant
columns from a database. Using a database with Clementine is always strongly encouraged to
get the best performance, but it is even more important if you have a large number of fields.
Using flat files in the server tier is still faster than using them in the desktop tier, however.
Caching, creating an optimized copy of the data, can help with flat file performance in the server
tier. Caching creates a performance hit (tests show that it takes about twice as long to read the
data) the first time a stream is run. However, benchmarks have shown subsequent runs of the
stream to be as much as eight times faster than without caching.

800
Benchmark
testing results: 700

data processing 600

The average increase in time required 500

to process one million records is con- 400


sistent at around 69 seconds, demon-
300
strating linear scalability.
200
This stream is an example of the
Data Preparation Phase of the CRISP- 100

DM process. It shows several common 0


1 million 3 million 5 million 7 million 9 million 11 million 13 million
preparation steps, including aggrega-
tion, merging and deriving new fields
that are necessary to prepare data
for modeling. The data preparation
takes 75 percent to 90 percent of
the time required for an entire data
mining project.

The purple nodes show data steps performed in the database.

Technical report 5
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

4500
Benchmark testing
results: modeling 4000

3500
The average increase in time
3000
required to process one mil-
2500
lion records increases slightly
as millions of records are 2000

added, which means model 1500

building scales well. 1000

This stream is an example 500

of the Modeling Phase of the 0


1 million 3 million 5 million 7 million 9 million 11 million 13 million
CRISP-DM process. It shows
deriving a field and generating
a C&RT model. The dataset
includes eight fields, with a mix
of continuous and categorical
data. To get the highest per-
forming models you need to try many models quickly. To accomplish this, use samples to try
many models before building models on larger datasets.
In addition to increased performance in model building due to in-database processing of data
preparation steps in model-building streams, Clementine Server also delivers improved per-
formance for model building itself. This performance is not accomplished with in-database
mining. Instead, the increased processing power and efficiency of the application server
delivers better performance.
Performance in training models depends on a number of factors. First, different types of models
inherently consume more processing power than other types of models. Neural networks, for
example, are more processor-intensive than regression models. The number of records and
fields in your dataset and your computing system architecture, hardware and software can
have an effect on model building speed.
Keep in mind that models built quickly are not necessarily good models. You can speed up
model building in Clementine by adjusting the model training default settings. Often, accuracy
is a tradeoff for speed. When working with large datasets, it may be best to try models for speed
first and then opt for more accurate settings as you determine which models are most appropri-
ate for the task at hand.
1400

Benchmark testing 1200

results: scoring 1000

The average increase in time


800
required to process one mil-
lion records increases slightly 600

as millions of records are


400
added, which means scoring
scales well. 200

0
1 million 3 million 5 million 7 million 9 million 11 million 13 million

Technical report 6
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

This stream is an example of


the Deployment Phase of the
CRISP-DM process. It shows
data access, applying generated
model scores and confidence
values, and writing back to a
new database table.
The second set of tests are
designed to show interactive
scoring results — single case
scoring with a complex stream
and several iterations of multiple
concurrent tests of the same stream. This type of test is important for applications like Web
personalization, when you have many concurrent hits at the same time and this pattern
occurs a number of times in a row.
Each “run” of this application involves the following:
■ Executable loads ■ Applies model to each case/offer
■ Reads data from file of 50+ “special combination
offer” products ■ Applies other logic needed to create
■ Reads single case (data that’s been list of “best 10” offers for this customer
entered through form) from file ■ Writes “best 10” list to file
■ Combines case/offer data ■ Executable exits

1200
The scoring was run 1000
times with five concur- 1000

rent processes. The test


was performed on a small 800

laptop with 128MB mem-


600
ory. Each run took 0.22
seconds with an average 400
of 255 runs per minute.
200

0
x2 x3 x4 x5
Conclusion
The ever-growing amount of data created by organizations presents opportunities and
challenges for data mining. Growing data warehouses that integrate all information about
customer interactions present new possibilities for delivering personalization, resulting in
increased profits and better service delivery. The challenge lies in making this vision a reality.
Scaling the entire data mining process with Clementine Server makes mining large datasets
more efficient, shortening the time needed to turn data into better customer relationships.

Technical report 7
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server

About SPSS
SPSS helps people solve business problems using statistics and data mining. This predictive
technology enables our customers in the commercial, higher education and public sectors to
make better decisions and improve results. SPSS software and services are used successfully
in a wide range of applications, including customer attraction and retention, cross-selling,
survey research, fraud detection, enrollment management, Web site performance, forecasting
and scientific research. SPSS' market-leading products and product lines include SPSS,®
Clementine,® AnswerTree,® DecisionTime,® SigmaPlot® and LexiQuest.™ For more information,
visit our Web site at www.spss.com.

SPSS is a registered trademark and the other SPSS products named are trademarks of SPSS Inc.
All other names are trademarks of their respective owners. Printed in the U.S.A. © Copyright 2002 SPSS Inc.

CLEMPERWP-0802 Technical report 8

You might also like