Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0368-492X.htm

Review of data,
Review of data, text text and web
and web mining software mining software
Qingyu Zhang and Richard S. Segall
Department of Computer and Information Technology, College of Business, 625
Arkansas State University, Jonesboro, Arkansas, USA

Abstract
Purpose – The purpose of this paper is to review and compare selected software for data mining, text
mining (TM), and web mining that are not available as free open-source software.
Design/methodology/approach – Selected softwares are compared with their common and unique
features. The software for data mining are SASw Enterprise Minere, Megaputer PolyAnalystw 5.0,
NeuralWare Predictw, and BioDiscovery GeneSightw. The software for TM are CompareSuite, SASw
Text Miner, TextAnalyst, VisualText, Megaputer PolyAnalystw 5.0, and WordStat. The software for
web mining are Megaputer PolyAnalystw, SPSS Clementinew, ClickTracks, and QL2.
Findings – This paper discusses and compares the existing features, characteristics, and algorithms
of selected software for data mining, TM, and web mining, respectively. These softwares are also
applied to available data sets.
Research limitations/implications – The limitations are the inclusion of selected software and
datasets rather than considering the entire realm of these. This review could be used as a framework
for comparing other data, text, and web mining software.
Practical implications – This paper can be helpful for an organization or individual when choosing
proper software to meet their mining needs.
Originality/value – Each of the software selected for this research has its own unique
characteristics, properties, and algorithms. No other paper compares these selected softwares both
visually and descriptively for all the three types of data, text, and web mining.
Keywords Cybernetics, Data collection, Computer software, Internet, Database management systems
Paper type Research paper

Introduction
In the data mining community, there are three types of mining: data mining, text
mining (TM), and web mining (Zhang and Segall, 2008). Data mining primarily deals
with structured data organized in a database. TM mostly handles unstructured
data/text. Web mining lies in between and copes with semi-structured data and/or
unstructured data. The mining process includes:
(1) information selection and preprocessing;
(2) patterns analysis, recognition, and visualization; and
(3) validation and interpretation.

To effectively mine data, a software with sufficient functionalities should be used.


Currently, there are many different software, commercial or free, available on the market.
Kybernetes
The authors would like to acknowledge the support provided by a 2009 Summer Faculty Vol. 39 No. 4, 2010
Research Grant, awarded to them by the College of Business of ASU and without whose program pp. 625-655
q Emerald Group Publishing Limited
and support this paper could not have been done. The authors also want to acknowledge each of 0368-492X
the software manufactures for their support of this paper. DOI 10.1108/03684921011036835
K A comprehensive list of mining software is available on web page of KDnuggets (www.
39,4 kdnuggets.com/software/index.html).
This paper discusses selected software for data mining, TM, and web mining that are
not available as free open-source software. The selected software for data mining are
SASw Enterprise Minere, Megaputer PolyAnalystw 5.0, NeuralWare Predictw, and
BioDiscovery GeneSightw. The selected software for TM are CompareSuite, SASw Text
626 Miner, TextAnalyst, VisualText, Megaputer PolyAnalystw 5.0, and WordStat. The
selected software for web mining are Megaputer PolyAnalystw, SPSS Clementinew,
ClickTracks, and QL2.
These software are described and compared as to the existing features,
characteristics, and algorithms for each and also applied to available data sets.
Background on related literature and software are also presented. Screen shots of each
of the selected software are presented, as are conclusions and future directions.

Literature review
The Data Intelligence Group (1995) defined data mining as the extraction of hidden
predictive information form large databases. According to The Data Intelligence Group
(1995), “data mining tools scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.”
Algorithms according to StatSoft (2006) are operations or procedures that will produce a
particular outcome with a completely defined set of steps or operations. This is opposed
to heuristics that are general recommendations or guides based upon theoretical
reasoning or statistical evidence such as “data mining can be a useful tool if used
appropriately.” Data mining and algorithms are widely implemented and rapidly
developed (Kim et al., 2008; Nayak, 2008; Segall and Zhang, 2006).
The growing accessibility of textual knowledge applications and online textual
sources has caused a boost in TM and web mining research. Hearst (2003) defines TM as
“the discovery of new, previously unknown information, by automatically extracting
information from different written sources.” Simply put, TM is the discovery of useful
and previously unknown “gems” of information from textual document repositories. He
distinguishes TM from data mining by noting that “text mining the patterns are
extracted from natural language rather than from structured database of facts.”
Amir et al. (2005) describe a new tool called maximal associations that allows
discovering interesting associations often lost by regular association rules. Spasic et al.
(2005) discuss ontologies and TM to automatically extract information and facts,
discover hidden associations and generate hypotheses germane to user needs.
Ontologies specify the interpretations of terms, echo the structure of the domain, and
thus can be used to support automatic semantic analysis of textual information. Seewald
et al. (2006) describe an application for relevance assessment for multi-document
summarization. To characterize certain document collections by a list of pertinent terms,
they have proposed a term utility function, which allows a user to define parameters for
continuous trade-off between precision and recall.
Srinivasan (2006) develops an algorithm to generate interesting hypotheses from a
set of text collections using Medline database. This is a fruitful path to ranking new
terms representing novel relationships and making scientific discoveries by TM. Metz
(2003) describes TM as those for that “applications are clever enough to run conceptual
searches, locating, say, all the phone numbers and places names buried in a collection of
intelligence communiqués.” More impressive, “the software can identify relationships, Review of data,
patterns, and trends involving words, phrases, numbers, and other data.” text and web
Web mining is the application of data mining techniques to discover patterns from
the web and can be classified into three different types of web content mining, web mining software
usage mining, and web structure mining (Pabarskaite and Raudys, 2007; Sanchez et al.,
2008). Web content mining is the process to discover useful information from the
content of a web page that may consist of text, image, audio or video data in the web; 627
web usage mining is the application that uses data mining to analyze and discover
interesting patterns of user’s usage of data on the web; and web structure mining is the
process of using graph theory to analyze the node and connection structure of a web
site (Wikipedia, 2007). An example of the latter would be discovering the authorities
and hubs of any web document, e.g. identifying the most appropriate web links for a
web page. Segall and Zhang (2009b) discussed web content and usage mining for
customer and marketing surveys data using three web mining software.

Software background
There is a wealth of software today for data, text, and web mining such as presented in
American Association for Artificial Intelligence (AAAI, 2002) and Ducatelle (2006) for
teaching data mining, Nisbet (2006) for customer relationship management (CRM) and
software review of Deshmukah (1997). StatSoft (2006) presents screen shots of several
softwares that are used for exploratory data analysis and various data mining
techniques. Kim et al. (2008) classify software changes in data mining and Ceccato et al.
(2006) combine three mining techniques. Nayak (2008) develops and applies data
mining techniques in web services discovery and monitoring.
Lazarevic et al. (2006) discussed a software system for spatial data analysis and
modeling. Leung (2004) compares microarray data mining software. National Center for
Biotechnology Information (2006) referred to as NCBI provides tools for data mining
including those specifically for each of the following categories of nucleotide sequence
analysis, protein sequence analysis and proteomics, genome analysis, and gene
expression. Lawrence Livermore National Laboratory (LLNL, 2005) describes their Center
for Applied Scientific Computing (CASC) that is developing computational tools and
techniques to help automate the exploration and analysis of large scientific data sets.
Davi et al. (2005) review two TM packages of SAS TM and Wordstat. Chou et al.
(2008) apply TM approach to internet abuse detection and Lau et al. (2005) discuss TM
for the hotel industry.
Sanchez et al. (2008) integrate software engineering and web mining techniques
in the development of an e-commerce recommender system capable of predicting the
preferences of its users and present them a personalised catalogue. Chang and
Lee (2006) find frequent item sets using online data streams. Pabarskaite and Raudys
(2007) review the knowledge discovery process from web log data. Younas et al. (2006)
predict user behavior patterns in mobile web systems. Ganapathy et al. (2004) discuss
visualization strategies and tools for enhancing CRM.

Data mining software


The research is to compare the four selected software for data mining including
SASw Enterprise Minere, Megaputer PolyAnalystw 5.0, NeuralWare Predictw, and
BioDiscovery GeneSightw. The data mining algorithms to be performed include those
K for neural networks, genetic algorithms, clustering, and decision trees. As can be
39,4 visualized from Table I is the fact that both SAS Enterprise Minere and PolyAnalystw
5.0 offer more algorithms than GeneSighte or NeuralWorks Predictw. GeneSighte is
offers mainly cluster analysis and NeuralWorks Predictw offers mainly neural network
applications using statistical analysis and prediction to support these data mining
results. PolyAnalystw 5.0 is the only software of these that provides link analysis
628 algorithms for both numerical and text data.

1. BioDiscovery GeneSightw
GeneSighte is a product of BioDiscovery, Inc. of El Segundo, CA that focuses on
cluster analysis using two main techniques of hierarchical and partitioning, both of
which are discussed in Prakash and Hoff (2002) for data mining of microarray gene
expressions.
Figure 1 shows the k-means clustering of global variations using the Pearson
correlation. Figure 2 shows the two-dimensional self-organizing map (SOM) for the 11
variables for all of the data using the Chebychev distance metric.

2. Megaputer PolyAnalystw 5.0


PolyAnalystw 5.0 is a product of Megaputer Intelligence, Inc. of Bloomington, Indiana
and contains 16 advanced knowledge discovery algorithms.
Figure 3 shows input data window for the forest cover type data in PolyAnalystw 5.0.
The link diagram shown by Figure 4, illustrates for each of the six forest cover types for
each of the five elevations present for each of the 40 soil types. The decision tree report
indicates a classification probability of 80.19 percent with a total classification error of
19.81 percent. Per PolyAnalystw output the decision tree has a tree depth of 100 with 210
leaves, and a depth of constructed tree of 16, and a classification efficiency of 47.52 percent.

3. SASw Enterprise Minere


SAS Enterprise Minere is a product of SAS Institute of Cary, NC and is based on the
SEMMA approach that is the process of sampling (S), exploring (E), modifying (M),

SAS
Enterprise NeuralWorks
Algorithms GeneSighte PolyAnalystw 5.0 Minere Predictw

Statistical analysis X X X X
Neural networks X X X
Decision trees X X
Regression analysis X X
Cluster analysis X X X
Self-organizing map X X
Association and X X
sequence analysis
(market basket analysis)
Link analysis X X
Text analysis X In a different
Table I. package SAS
Data mining software Text Minerw
Review of data,
text and web
mining software

629

Figure 1.
K-means clustering of
global variations with the
Pearson correlation using
GeneSightw

modeling (M), and assessing (A) large amounts of data. SAS Enterprise Minere utilizes
a workspace with a drop-and-drag of icons approach to constructing data mining
models. SAS Enterprise Minere utilizes algorithms for decision trees, regression, neural
networks, cluster analysis, and association and sequence analysis.
Figure 5 shows the workspace of SAS Enterprise Minere that was used in the data
mining of the human lung dataset. Figure 6 shows a partial view of the decision tree
diagram obtained by data mining using SAS Enterprise Minere as specified for a
depth of six from the initial node of NL279. Figure 7 shows that the normalized means
for the cluster proximities of the gene-type variables are scattered and not uniform.
The SOM illustrates the characteristics of the clusters, importance of each variable, and
cluster proximities.

4. NeuralWorks Predictw
NeuralWorks Predictw is a product of NeuralWare of Carnegie, Pennsylvania. This
software relies on neural networks. NeuralWorks Predictw has a direct interface with
Microsoft Excel that allows display and execution of the Predictw commands as a
drop-down column within Microsoft Excel. Figure 8 shows NeuralWare Predictw
window indicating completion of neural network training algorithm. Figure 9 shows
NeuralWare Predictw window indicating completion of model building using selected
parameters.
K
39,4

630

Figure 2.
SOM with the Chebychev
distance metric using
GeneSightw

Figure 3.
Input data window for the
forest cover type data in
PolyAnalystw 5.0
Review of data,
text and web
mining software

631

Figure 4.
Link diagram for each of
the 40 soil types using
PolyAnalystw 5.0
K
39,4

632

Figure 5.
Work space of SAS
Enterprise Minere for
human lung project

Figure 6.
Decision tree diagram
obtained upon applying
SAS Enterprise Minere
for specified depth of six
from node ID ¼ 1
Review of data,
text and web
mining software

633

Figure 7.
SOM/Kohonen for human
lung data using SAS
Enterprise Minere

Figure 8.
NeuralWare Predictw
window indicating
completion of neural
network training
algorithm
K
39,4

634

Figure 9.
NeuralWare Predictw
window indicating
completion of model
building using selected
parameters

Text mining software


Some of the popular software currently available for TM include Compare Suite, SAS
Text Miner, Megaputer TextAnalyst, VisualText by Text Analysis International, Inc.
(TextAI), Megaputer PolyAnalyst, and WordStat. These softwares provide a variety of
graphical views and analysis tools with powerful capabilities to discover knowledge
from text databases. The main focus here is to compare, discuss, and provide sample
output for each as visual comparisons.
As a visual comparison of the features for these six selected TM software, the authors
of this paper constructed Table II, where essential functions are indicated as being either
present or absent with regard to data preparation, data analysis, results reporting, and
unique features. As shown in Table II adapted from Segall and Zhang (2009a), Compare
Suite and TextAnalyst have minimal TM capabilities while Megaputer PolyAnalyst,
SAS Text Miner, and WordStat have extensive TM capabilities. VisualText has add-ons
to make this software versatile.

1. Compare Suite
Compare Suite is TM software developed by AKS-Labs, headquarters of which are
located in Raleigh, North Carolina, USA. The software allows comparing any to any
file including formats such as text file, MS Word, MS Excel, PDF, web pages, zip
archives, and binary files. It allows comparing two files character by character, word
by word, or by key words. Two folders can also be compared to find changes made and
contained files. A report can be created after comparison including detailed comparison
information. Documents can be compared online by server-side comparison.
Review of data,
Compare SAS Text Text Visual Megaputer
Features Software Suite Miner Analyst Text PolyAnalyst WordStat text and web
Data Text parsing and
mining software
preparation extraction X X X X X X
Define dictionary X X X X
Automatic text Add- 635
cleaning X on
Data Categorization X X X X
analysis Filtering X X X
Concept linking Add-
X on X
Text clustering X X Add- X X
on
Dimension
reduction
techniques X X
Natural language
query X X
Results Interactive results
reporting window X X X X X
Support for
multiple languages X X X X
Unique features Report generation, Multi-path Export any table to Excel
compare two folders multi-paradigm Table II.
feature analyzer Text mining software

Two animal files are shown in Figure 10 compared by words and results are reported.
From the results, the same words that appear in both the files are highlighted with
green color and the words that only appear in one of the two files are highlighted
with purple color. Figure 11 shows window for “file comparison” option of Compare
Suite. Compare Suite is able to provide to user a window for option of file comparison
of text.

2. SAS Text Miner


SAS Text Miner is actually an “add-on” to SAS Enterprise Miner with the inclusion of an
extra icon in the “explore” section of the tool bar (Woodfield, 2004). SAS Text Miner
performs simple statistical analysis, exploratory analysis of textual data, clustering, and
predictive modeling of textual data. SAS Text Miner uses the “drag-and-drop” principle
by dragging the selected icon in the tool set to dropping it into the workspace. The
workspace of SAS Text Miner was constructed with a data icon of selected animal data
that was provided by SAS in their Instructor’s Trainer Kit as shown in Figure 12.
Figure 13 shows the results of using SAS Text Miner with individual plots for “role by
frequency,” “number of documents by frequency,” “frequency by weight,” “attribute
by frequency,” and “number of documents by frequency scatter plot.” Figure 14 shows
“concept linking figure” as generated by SAS Text Miner using SASPDFSYNONYMS
text file.
K
39,4

636

Figure 10.
Compare Suite
with two animal text files
Source: Segall and Zhang (2009a)

3. Megaputer TextAnalyst
TextAnalyst has an ActiveX suite for dealing with text and semantic analysis.
According to Megaputer (Ananyan and Kiselev, 2010) web page, TextAnalyst uses a
semantic network similar to a molecular structure, and determines the relative
importance of a text concept, solely by analyzing its connection to other concepts in the
text, and also implements algorithms similar to those used for text analysis in the
human brain.
Figure 15 shows a representative screen shot of Megaputer TextAnalyst which
consists of a view pane of file directory in top left, results pane in top right, and a text
pane in the bottom part of the window. The view pane shows each of the nodes in the
semantic tree which each can be expanded. Megaputer TextAnalyst uses a semantic
search window where a query can be entered either as full sentences or questions
instead of having to determine key word or phrases. A summary file in the top left pane
of Megaputer TextAnalyst window can list the most important sentences in the
context of the original document. The summary chooses the sentences on the basis of
concepts and relationships between concepts in the full text.

4. VisualText by TextAI
VisualText by TextAI (Text Analysis International, Inc.) uses national language
processing including “analyzers” for extracting information from text repositories.
Some applications include databases of resumes, web pages, or even e-mails or web
chat databases. VisualText allows the user to create their own text analyzer, and also
Review of data,
text and web
mining software

637

Figure 11.
Compare Suite window

Figure 12.
Workspace of SAS Text
Miner for animal text
K
39,4

638

Figure 13.
Interactive window of SAS
Text Miner for animal text

Figure 14.
Concept links for term of
“statistical” in SAS Text
Miner using
SASPDFSYNONYMS text
file
Source: Woodfield (2004)
Review of data,
text and web
mining software

639

Figure 15.
List of documents window
in TextAnalyst

includes a TAI Parse for tagging parts of speech and chunking. Voice processing needs
to be converted to text first before text processing can be performed.
According to VisualText web page, it can be used for combating terrorism, narcotic
espionage, nuclear proliferation, filtering documents, test grading, and automatic
coding. VisualText also allows natural language query, which is the ability to ask a
computer questions using plain language.
Figures 16 and 17 show screen shots of VisualText. Figure 16 shows the analyzer on
the left and the root of the text zone on the right. Figure 17 shows parsing trees.
A window in VisualText can show dictionary with an expansion of the root of the text
zone on the right part of this window.

5. Megaputer PolyAnalyst
Previous work by the authors Segall and Zhang (2006) have utilized Megaputer
PolyAnalyst for data mining. The new release of PolyAnalyst Version 6.0 includes TM
and specifically new features for text online analytical processing and taxonomy-based
categorization which is useful for when dealing with large collections of unstructured
documents as discussed in Megaputer Intelligence (2007). The latter cites that
taxonomy-based classifications are useful when dealing with large collections of
unstructured documents such as tracking the number of known issues in product repair
notes and customer support letters.
According to Megaputer Intelligence (2007), PolyAnalyst “provides simple means for
creating, importing, and managing taxonomies, and carries out automated categorization of
text records against existing taxonomies.” Megaputer Intelligence (2007) provides examples
K
39,4

640

Figure 16.
Screen-shot of VisualText
window

Figure 17.
Dictionary in VisualText
of applications to executives, customer support specialists, and analysts. According to Review of data,
Megaputer Intelligence (2007), “executives are able to make better business decisions upon text and web
viewing a concise report on the distribution of tracked issues during the latest observation
period.” mining software
This paper provides several figures of actual screen shots of Megaputer PolyAnalyst
Version 6.0 for TM. These are for workspace of TM of Megaputer PolyAnalyst
(Figure 18), Figure 19 is “suffix tree clustering” report for the text cluster of (desk; front), 641
and Figure 20 is screen-shot of “link term” report of hotel customer survey text.
Megaputer PolyAnalyst can also provide screen shots with drill-down text analysis and
histogram plot of text analysis.

6. WordStat
WordStat is developed by Provalis Research. It is a text analysis software module run
on a base product of SimStat or QDA Miner. It can be used to study textual information
such as interviews, answers to open-ended questions, journal articles, electronic
communications, and so on. WordStat may also be used for categorizing text
automatically using a dictionary approach, or for developing and validating new
categorization dictionaries or taxonomies.
WordStat incorporates many data analysis and graphical tools that can be used to
explore relationships between document contents and information amassed in categorical
or numeric variables. Hierarchical clustering and multidimensional scaling analysis
can be used to identify relationships among categories and document similarity.

Figure 18.
Workspace for TM in
Megaputer PolyAnalyst
Source: Segall and Zhang (2009a)
K
39,4

642

Figure 19.
Clustering results in
Megaputer PolyAnalyst

Figure 20.
“Link term” report using
text analysis in Megaputer
PolyAnalyst
Correspondence analysis and plots can be used to explore relationships between keywords Review of data,
and different groups. text and web
An input file (e.g. Excel file) can be imported into the software for analysis. An
important preliminary to WordStat analysis is to create a categorization dictionary mining software
(which needs domain knowledge) as shown by screen-shot of Figure 21. WordStat
analysis consists of many tabulations or cross-tabulations of different categories.
Figure 22 shows 3D plots by WordStat. There are more detailed flash demos at: www. 643
provalisresearch.com/wordstat/wordstatflashdemo.html

Web mining software


Four-selected software are reviewed and compared in terms of data preparation, data
analysis, and results reporting (Table III). As shown in Table III, Megaputer Poly
Analystw has unique feature of data and TM tool integrated with web site data source
input, while SPSS Clementinew has linguistic approach rather than statistics-based
approach, ClickTracks has visitor analysis to provide robot and funnel reports, and QL2 is
data extraction software. Table III gives a visual interpretation of the differences and
similarities among these four selected software as shown below.

1. Megaputer PolyAnalyst
Megaputer PolyAnalyst is an enterprise analytical system that integrates web mining
together with data and TM because it does not have a separate module for web mining.
Web pages or sites can be inputted directly to Megaputer PolyAnlayst as data source
nodes.

Figure 21.
Categorization dictionary
using WordStat
K
39,4

644

Figure 22.
3D concepts maps
using WordStat

Megaputer PolyAnlayst has the standard data and TM functionalities such as


categorization, clustering, prediction, link analysis, keyword and entity extraction,
pattern discovery, and anomaly detection. These different functional nodes can be
directly connected to the web data source node for performing web mining analysis.
Megaputer PolyAnalyst user interface allows the user to develop complex data analysis
scenarios without loading data in the system, thus saving analyst’s time.
According to Megaputer Intelligence (2007), whatever data sources are used,
PolyAnalyst provides means for loading and integrating these data. PolyAnalyst can
load data from disparate data sources including all popular databases, statistical, and
spreadsheet systems. In addition, it can load collections of documents in html, doc, PDF,
and txt formats, as well as load data from an internet web source. PolyAnalyst offers
visual “on-the-fly integration” and merging of data coming from disparate sources to
create data marts for further analysis. It supports incremental data appending and
referencing data sets in previously created PolyAnalyst projects.
Figures 23-25 are screen shots showing the applications of Megaputer PolyAnalyst
for web mining to available data sets. Figure 23 shows an expanded view of PolyAnalyst
workspace. Figure 24 shows screen shot of PolyAnalyst using web site of Arkansas
State University (ASU) as the web data source. Figure 25 shows a keyword extraction
report from a web page of undergraduate admission of web site of ASU.

2. SPSS Clementine
“Web Mining for Clementine is an add-on module that makes it easy for analysts to
perform ad hoc predictive web analysis within Clementine’s intuitive visual workflow
interface.” Web Mining for Clementine combines both Web Analytics (2007) and data
Features Software Megaputer PolyAnalystw SPSS Clementinew ClickTracks QL2

Data Data extraction X (web site as data source Import server files Import web Word documents, e-mails,
preparation input) log files spreadsheets, databases,
PowerPoint files, HTML,
images, and PDFs
Automatic data cleaning X X X X
Data User segmentation X X X
analysis Detect users’ sequences X
Understand product and X X
content affinities (link
analysis)
Predict user propensity to X X
convert, buy, or churn
Navigation report X X
Keyword and search engine X X X X
Visitor labeling X
Robot report X
Funnel report X
E-mail campaign tracking X
Revenue tracking X
Results Interactive results window X X X
reporting Visual presentation X X X X
Support for multiple X X
languages
Unique Data and TM tool integrated Linguistic approach rather Visitor analysis to provide A data-
features with web site data source than statistics-based robot and funnel reports extraction
input approach software
mining software

Web mining software


text and web
Review of data,

Table III.
645
K
39,4

646

Figure 23.
PolyAnalyst workspace
with internet data source

Figure 24.
PolyAnalyst using
www.astate.edu as web
data source
Review of data,
text and web
mining software

647

Figure 25.
Keyword extraction report

mining with SPSS analytical capabilities to transform raw web data into “actionable
insights.” It enables business decision makers to take more effective actions in real
time. SPSS (2007) claims examples of automatically discovering user segments,
detecting the most significant sequences, understanding product and content affinities,
and predicting user intention to convert, buy, or churn.
SPSS (2007) claims four key data mining capabilities: segmentation, sequence
detection, affinity analysis, and propensity modeling. Specifically, SPSS (2007) indicates
six web analysis application modules within SPSS Clementine that are: search engine
optimization, automated user and visit segmentation, web site activity and user
behavior analysis, home page activity, activity sequence analysis, and propensity
analysis.
Unlike other platforms used for web mining that provide only simple frequency
counts (e.g. number of visits, ad hits, top pages, total purchase visits, and top click
streams), SPSS (2007) Clementine provides more meaningful customer intelligence
such as: likelihood to convert by individual visitor, likelihood to respond by individual
prospect, content clusters by customer value, missed crossed-sell opportunities, and
event sequences by outcome (Segall and Zhang, 2009b).
Figures 26-28 are screen shots showing the applications of SPSS Clementine for web
mining to available data sets. Figure 26 shows the SPSS Clementine workspace with
251,998 records with seven fields extracted from a web log file. The user modes can be
defined including research mode, shopping mode, search mode, evaluation mode, and
so on. Figure 27 shows web data for different campaigns and classifier results using
different model types (e.g. CHAID, logistic, and neural net). Figure 28 shows decision
tree results.
K
39,4

648

Figure 26.
SPSS Clementine
workspace

Figure 27.
Compaign a model use
checklist
Review of data,
text and web
mining software

649

Figure 28.
Decision tree results

3. ClickTracks by Web Analytics


ClickTracks by Web Analytics is a web metrics tool that makes online behavior
visible. Unlike other web statistical tools, ClickTracks shows information in context to
the user. ClickTracks shows where visitors go and what motivates them to take the
paths they take. According to the ClickTracks web site, ClickTracks unites visitor
information with the web site and lets web site owners know how people get to their
sites, where they click and where they exit.
According to Web Analytics (2007), ClickTracks let user know more about buyers
and thus gives insights on how to turn more web site visitors into buyers:
ClickTracks lets user see buyers from many different aspects, such as identifying their entry
points, the paths they take and things they do on the way to the checkout. ClickTracks thus
gives valuable information to the user that he (she) can put into action.
Figures 29-30 are screen shots showing the applications of ClickTracks for web mining
to available data sets. Figure 29 shows search keywords and popular search engines
used, and statistical results for using Google search engine. Figure 30 shows the user’s
click sequence results and path view for item “Apple.”

4. QL2 by QL2 Software, Inc


QL2 is web data extraction software. According to QL2 (2007), it completely automates
the process of extracting information from any web site, even if the data is behind a
firewall, a subscription log-in, or a search form. It deploys intelligent agents to
automatically fetch information from the web. These intelligent agents navigate
complex web sites, log-in to subscription and password-protected sites, fill out forms
K
39,4

650

Figure 29.
Statistics results
for using Google search
engine

Figure 30.
Click sequence results
and input specific criteria to generate dynamic web pages. Intelligent agents can reach Review of data,
any content with a web browser in a fully automated fashion. text and web
QL2 Software extracts the data regardless of formats: word documents, e-mails,
spreadsheets, databases, PowerPoint files, HTML, images, and PDFs. QL2 can also add mining software
structure to the data it collects and output the information in an actionable format such as a
spreadsheet, database, or XML feed so data can be sorted, filtered, and queried with ease.
QL2 Software can extract data/information from both the world wide web and 651
unstructured documents, and integrate it into business intelligence in real time for a
3608 view of business and the market. QL2 can be used to automatically mine
competitor web sites, online catalogs, news feeds, and regulatory filings, or extracting
data from PDFs, PowerPoint presentations, Word docs and e-mail archives.
Figures 31 and 32 are screen shots showing the applications of QL2 for web mining
to available data sets. Figure 31 shows a screen shot of the QL2 workspace. Figure 32
shows screen shot of expanded inner queries for Best Buy data.

Conclusion and future research


The conclusions of this research include the fact that each of the software selected for
this research has its own unique characteristics and properties that can be displayed
when applied to the available data sets. As indicated, each software has it own set of
algorithm types to which it can be applied.
Comparing four data mining software, NeuralWare Predictw focuses on neural
network algorithms and Biodiscovery GeneSightw focuses on cluster analysis.

Figure 31.
Workspace for QL2
Source: Zhang and Segall (2008)
K
39,4

652

Figure 32.
Inner query with records
shown using Best Buy
data

NeuralWare Predict and BioDiscovery GeneSight have less data mining functions than
SAS and PolyAnalyst do. NeuralWare Predict is used with Microsoft Excel so that it is
easy to use. It produces comparable results with other software in terms of prediction
using neural network. BioDiscovery GeneSight is primarily for clustering analysis and
is able to provide a variety of data mining visualization charts and colors.
Both SAS Enterprise Minere and Megaputer PolyAnalystw employ each of the same
algorithms as illustrated in Table II except that SAS has a separate software SAS Text
Minerw for text analysis. The regression results are comparable for those obtained using
NeuralWare Predictw and Megaputer PolyAnalystw. The cluster analysis results for
SAS Enterprise Minere, Biodiscovery GeneSightw, and Megaputer PolyAnalystw each
are unique to each software as to how they represent their results. SAS Enterprise
Minere and NeuralWare Predictw both utilize SOM while the other two do not. In
conclusion, SAS Enterprise Minere and Megaputer PolyAnalystw offer the greatest
diversification of data mining algorithms.
Comparing six TM software, Compare Suite and TextAnalyst have minimal TM
capabilities while Megaputer PolyAnalyst, SAS Text Miner, and WordStat have
extensive TM capabilities. VisualText has add-ons to make this software versatile.
SAS Text Miner is an add-on to base SAS Enterprise Miner by inserting an
additional Text Miner icon on the SAS Enterprise Miner workspace toolbar. SAS Text
Miner tags parts of speech and performs transformations such as those using singular
value decompositions to generate term-document frequency matrix for viewing in the
Text Miner node. Megaputer PolyAnalystw similarly is a software that combines both
data mining and TM, but also includes web mining capabilities. Megaputer also has
stand-alone TextAnalyst software for TM. Visual link analysis figures and their term
interactions are quite informative as those generated using SAS Text Miner and more Review of data,
exhibitive than those generated by Megaputer PolyAnalyst, and provide these that the text and web
other TM software cannot.
New Features of WordStat 5.1 include: an improved phrase extraction routine with mining software
built-in cross-tabulation, overlap identification, and filtering; the addition of more flexible
feature selection and classification routines to the automatic document classification
module; and visualization tools (bar charts, pie charts, correspondence plots). 653
Regarding web mining software, PolyAnalystw can mine web data integrated within
a data mining enterprise analytical system and provide visual tools such as link analysis
of the critical terms of the text. SPSS Clementinew can be used for graphical illustrations
of customer web activities as well as also for link analysis of different data categories
such as campaign, age, gender, and income. ClickTracks workspace is more
object-oriented in construction and usage. Web mining using ClickTracks can include
cost of visitors, revenue from visitors, average time on site, and page views per visitor.
ClickTracks also can provide search reports for pay-per-click and regular search for each
selected item. QL2 is web data extraction software. Many of the web mining technologies
are applicable to text, web and click stream data. The selection of appropriate web
mining software should be based on both its available web mining technologies and also
the type of data to be encountered.
The future direction of the research is to investigate other mining software and their
available mining technologies for analyzing various types of data, and making
comparisons of the capabilities of these softwares between and among each other. This
future research would also include the acquisition of other data sets to perform these
new analyses and comparisons. The future directions would also include applications
of these softwares to other databases of different dimensionalities.

References
AAAI (2002), American Association for Artificial Intelligence (AAAI) Spring Symposium on
Information Refinement and Revision for Decision Making: Modeling for Diagnostics,
Prognostics, and Prediction, Software and Data, available at: www.cs.rpi.edu/,goebel/ss02/
software-and-data.html
Amir, A., Aumann, Y., Feldman, R. and Fresko, M. (2005), “Maximal association rules: a tool for
mining associations in text”, Journal of Intelligent Information Systems, Vol. 25 No. 3,
pp. 333-45.
Ananyan, S. and Kiselev, M. (2010), “Automated analysis of unstructured texts: technology and
implementations”, TextAnalyst Whitepaper, Megaputer, www.megaputer.com/down/
textanalyst_whitepaper.pdf
Ceccato, M., Marin, M., Mens, K., Moonen, L., Tonella, P. and Tourw, T. (2006), “Applying and
combining three different aspect mining techniques”, Software Quality Journal, Vol. 14
No. 3, pp. 209-14.
Chang, J. and Lee, W. (2006), “Finding frequent itemsets over online data streams”, Information
and Software Technology, Vol. 48 No. 7, pp. 606-19.
Chou, C., Sinha, A. and Zhao, H. (2008), “A text mining approach to internet abuse detection”,
Information Systems and eBusiness Management, Vol. 6 No. 4, pp. 419-40.
Data Intelligence Group (1995), “An overview of data mining at Dun & Bradstreet”, DIG
White Paper 95/01, available at: www.thearling.com.text/wp9501/wp9501.htm
K Davi, A., Haughton, D., Nasr, N., Shah, G., Skaletsky, M. and Spack, R. (2005), “A review of two
text-mining packages: SAS TextMining and WordStat”, The American Statistician, Vol. 59
39,4 No. 1, pp. 89-104.
Deshmukah, A.V. (1997), “Software review: ModelQuest Expert 1.0”, ORMS Today, December,
available at: www.lionhrtpub.com/orms/orms-12-97/software-review.html
Ducatelle, F. (2006), “Software for the data mining course”, School of Informatics, The University
654 of Edinburgh, Edinburgh, available at: www.inf.ed.ac.uk/teaching/courses/dme/html/
software2.html
Ganapathy, S., Ranganathan, C. and Sankaranarayanan, B. (2004), “Visualization strategies and
tools for enhancing customer relationship management”, Communications of the ACM,
Vol. 47 No. 11, pp. 92-8.
Hearst, M.A. (2003), “What is data mining?”, available at: www.ischool.berkeley.edu/,hearstr/
text_mining.html
Kim, S., Whitehead, E.J. Jr and Zhang, Y. (2008), “Classifying software changes: clean or buggy?”,
IEEE Transactions on Software Engineering, Vol. 34 No. 2, pp. 181-97.
Lau, K., Lee, K. and Ho, Y. (2005), “Text mining for the hotel industry”, Cornell Hotel &
Restaurant Administration Quarterly, Vol. 46 No. 3, pp. 344-63.
Lazarevic, A., Fiea, T. and Obradovic, Z. (2006), “A software system for spatial data analysis and
modeling”, available at: www.ist.temple.edu? , zoran/papers/lazarevic00.pdf
Leung, Y.F. (2004), “My microarray software comparison – data mining software”, Chinese
University of Hong Kong, Ma Liu Shui, September, available at: www.ihome.cuhk.edu.hk/
, b400559/arraysoft mining specific.html
LLNL (2005), “Scientific data mining and pattern recognition: overview”, Lawrence Livermore
National Laboratory, The Center for Applied Scientific Computing (CASC), available at:
www.llnl.gov/CASC/sapphire/overview/html
Megaputer Intelligence (2007), “Data mining, text mining, and web mining software”, Megaputer
Intelligence, Bloomington, IN, available at: www.megaputer.com
Metz, C. (2003), “Software: text mining”, PC Magazine, July 1, available at: www.pcmag.com/
print_article2/0,1217.a¼43573,00.asp
National Center for Biotechnology Information (2006), “NCBI tools for data mining”, National
Library of Medicine, National Institutes of Health, Bethesda, MD, available at: www.ncbi.
nlm.nih.gov/Tools/
Nayak, R. (2008), “Data mining in web services discovery and monitoring”, International Journal
of Web Services Research, Vol. 5 No. 1, pp. 63-82.
Nisbet, R.A. (2006), “Data mining tools: which one is best for CRM? Part 3”, DM Review, March 21,
available at: www.dmreview.com/editorial/dmreview/print_action.cfm? articleId=
1049954.
Pabarskaite, Z. and Raudys, A. (2007), “A process of knowledge discovery from web log data:
systematization and critical review”, Journal of Intelligent Information Systems, Vol. 28
No. 1, pp. 79-105.
Prakash, P. and Hoff, B. (2002), “Microarray gene expression data mining with cluster analysis
using GeneSighte”, Application Note GS10, BioDiscovery, Marina del Rey, CA, available
at: http://signal.biosi.cf.ac.uk/wbg/pdf/appnotegs10.pdf
QL2 (2007), “QL2 Software”, available at: www.ql2.com (accessed June 5, 2007).
Sanchez, M., Moreno, M., Segrera, S. and Lopez, V. (2008), “Framework for the development of a
personalised recommender system with integrated web-mining functionalities”,
International Journal of Computer Applications in Technology, Vol. 33 No. 4, pp. 312-27.
Seewald, A., Holzbaur, C. and Widmer, G. (2006), “Evaluation of term utility functions for very Review of data,
short multidocument summaries”, Applied Artificial Intelligence, Vol. 20, pp. 57-77.
Segall, R.S. and Zhang, Q. (2006), “Data visualization and data mining of continuous numerical
text and web
and discrete nominal-valued microarray databases for biotechnology”, Kybernetes: mining software
International Journal of Systems and Cybernetics, Vol. 35 Nos 9/10, pp. 1538-66.
Segall, R.S. and Zhang, Q. (2009a), “A survey of selected software technologies for text mining”,
in Song, M. and Wu, Y. (Eds), Handbook of Research on Text and Web Mining 655
Technologies, Ch. XLIV, IGI Global, Hershey, PA, pp. 766-84.
Segall, R.S. and Zhang, Q. (2009b), “Web mining technologies for customer and marketing
surveys”, Kybernetes: The International Journal of Cybernetics, Systems and Management
Sciences, Vol. 38 No. 6, pp. 929-53.
Spasic, I., Ananiadou, S., McNaught, J. and Kumar, A. (2005), “Text mining and ontologies in
biomedicine: making sense of raw text”, Briefings in Bioinformatics, Vol. 6 No. 3, pp. 239-51.
SPSS (2007), “Web mining for Clementine”, available at: www.spss.com/web_mining_for_
clementine (accessed May 16, 2007).
Srinivasan, P. (2006), “Text mining: generating hypotheses from MEDLINE”, Journal of the
American Society for Information Science and Technology, Vol. 55 No. 5, pp. 396-413.
StatSoft (2006), Electronic Textbook, StatSoft, Tulsa, OK, available at: www.statsoft.com/
textbook/glosa.html
Web Analytics (2007), available at: www.clicktracks.com/ (accessed October 25, 2007).
Wikipedia (2007), “Web mining”, available at: http://en.wikipedia.org/wiki/Web_mining
Woodfield, T. (2004), Mining Textual Data Using SAS Text Miner for SAS9 Course Notes, SAS
Institute, Cary, NC.
Younas, M., Shakshuki, E. and Chao, K. (2006), “Efficient mining and prediction of user behavior
patterns in mobile web systems”, Information and Software Technology, Vol. 48 No. 6,
pp. 357-66.
Zhang, Q. and Segall, R.S. (2008), “Web mining: a survey of current research, techniques, and
software”, International Journal of Information Technology & Decision Making, Vol. 7
No. 4, pp. 683-720.

Corresponding author
Richard S. Segall can be contacted at: rsegall@astate.edu

To purchase reprints of this article please e-mail: reprints@emeraldinsight.com


Or visit our web site for further details: www.emeraldinsight.com/reprints

You might also like