Professional Documents
Culture Documents
Review of
Review of
www.emeraldinsight.com/0368-492X.htm
Review of data,
Review of data, text text and web
and web mining software mining software
Qingyu Zhang and Richard S. Segall
Department of Computer and Information Technology, College of Business, 625
Arkansas State University, Jonesboro, Arkansas, USA
Abstract
Purpose – The purpose of this paper is to review and compare selected software for data mining, text
mining (TM), and web mining that are not available as free open-source software.
Design/methodology/approach – Selected softwares are compared with their common and unique
features. The software for data mining are SASw Enterprise Minere, Megaputer PolyAnalystw 5.0,
NeuralWare Predictw, and BioDiscovery GeneSightw. The software for TM are CompareSuite, SASw
Text Miner, TextAnalyst, VisualText, Megaputer PolyAnalystw 5.0, and WordStat. The software for
web mining are Megaputer PolyAnalystw, SPSS Clementinew, ClickTracks, and QL2.
Findings – This paper discusses and compares the existing features, characteristics, and algorithms
of selected software for data mining, TM, and web mining, respectively. These softwares are also
applied to available data sets.
Research limitations/implications – The limitations are the inclusion of selected software and
datasets rather than considering the entire realm of these. This review could be used as a framework
for comparing other data, text, and web mining software.
Practical implications – This paper can be helpful for an organization or individual when choosing
proper software to meet their mining needs.
Originality/value – Each of the software selected for this research has its own unique
characteristics, properties, and algorithms. No other paper compares these selected softwares both
visually and descriptively for all the three types of data, text, and web mining.
Keywords Cybernetics, Data collection, Computer software, Internet, Database management systems
Paper type Research paper
Introduction
In the data mining community, there are three types of mining: data mining, text
mining (TM), and web mining (Zhang and Segall, 2008). Data mining primarily deals
with structured data organized in a database. TM mostly handles unstructured
data/text. Web mining lies in between and copes with semi-structured data and/or
unstructured data. The mining process includes:
(1) information selection and preprocessing;
(2) patterns analysis, recognition, and visualization; and
(3) validation and interpretation.
Literature review
The Data Intelligence Group (1995) defined data mining as the extraction of hidden
predictive information form large databases. According to The Data Intelligence Group
(1995), “data mining tools scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.”
Algorithms according to StatSoft (2006) are operations or procedures that will produce a
particular outcome with a completely defined set of steps or operations. This is opposed
to heuristics that are general recommendations or guides based upon theoretical
reasoning or statistical evidence such as “data mining can be a useful tool if used
appropriately.” Data mining and algorithms are widely implemented and rapidly
developed (Kim et al., 2008; Nayak, 2008; Segall and Zhang, 2006).
The growing accessibility of textual knowledge applications and online textual
sources has caused a boost in TM and web mining research. Hearst (2003) defines TM as
“the discovery of new, previously unknown information, by automatically extracting
information from different written sources.” Simply put, TM is the discovery of useful
and previously unknown “gems” of information from textual document repositories. He
distinguishes TM from data mining by noting that “text mining the patterns are
extracted from natural language rather than from structured database of facts.”
Amir et al. (2005) describe a new tool called maximal associations that allows
discovering interesting associations often lost by regular association rules. Spasic et al.
(2005) discuss ontologies and TM to automatically extract information and facts,
discover hidden associations and generate hypotheses germane to user needs.
Ontologies specify the interpretations of terms, echo the structure of the domain, and
thus can be used to support automatic semantic analysis of textual information. Seewald
et al. (2006) describe an application for relevance assessment for multi-document
summarization. To characterize certain document collections by a list of pertinent terms,
they have proposed a term utility function, which allows a user to define parameters for
continuous trade-off between precision and recall.
Srinivasan (2006) develops an algorithm to generate interesting hypotheses from a
set of text collections using Medline database. This is a fruitful path to ranking new
terms representing novel relationships and making scientific discoveries by TM. Metz
(2003) describes TM as those for that “applications are clever enough to run conceptual
searches, locating, say, all the phone numbers and places names buried in a collection of
intelligence communiqués.” More impressive, “the software can identify relationships, Review of data,
patterns, and trends involving words, phrases, numbers, and other data.” text and web
Web mining is the application of data mining techniques to discover patterns from
the web and can be classified into three different types of web content mining, web mining software
usage mining, and web structure mining (Pabarskaite and Raudys, 2007; Sanchez et al.,
2008). Web content mining is the process to discover useful information from the
content of a web page that may consist of text, image, audio or video data in the web; 627
web usage mining is the application that uses data mining to analyze and discover
interesting patterns of user’s usage of data on the web; and web structure mining is the
process of using graph theory to analyze the node and connection structure of a web
site (Wikipedia, 2007). An example of the latter would be discovering the authorities
and hubs of any web document, e.g. identifying the most appropriate web links for a
web page. Segall and Zhang (2009b) discussed web content and usage mining for
customer and marketing surveys data using three web mining software.
Software background
There is a wealth of software today for data, text, and web mining such as presented in
American Association for Artificial Intelligence (AAAI, 2002) and Ducatelle (2006) for
teaching data mining, Nisbet (2006) for customer relationship management (CRM) and
software review of Deshmukah (1997). StatSoft (2006) presents screen shots of several
softwares that are used for exploratory data analysis and various data mining
techniques. Kim et al. (2008) classify software changes in data mining and Ceccato et al.
(2006) combine three mining techniques. Nayak (2008) develops and applies data
mining techniques in web services discovery and monitoring.
Lazarevic et al. (2006) discussed a software system for spatial data analysis and
modeling. Leung (2004) compares microarray data mining software. National Center for
Biotechnology Information (2006) referred to as NCBI provides tools for data mining
including those specifically for each of the following categories of nucleotide sequence
analysis, protein sequence analysis and proteomics, genome analysis, and gene
expression. Lawrence Livermore National Laboratory (LLNL, 2005) describes their Center
for Applied Scientific Computing (CASC) that is developing computational tools and
techniques to help automate the exploration and analysis of large scientific data sets.
Davi et al. (2005) review two TM packages of SAS TM and Wordstat. Chou et al.
(2008) apply TM approach to internet abuse detection and Lau et al. (2005) discuss TM
for the hotel industry.
Sanchez et al. (2008) integrate software engineering and web mining techniques
in the development of an e-commerce recommender system capable of predicting the
preferences of its users and present them a personalised catalogue. Chang and
Lee (2006) find frequent item sets using online data streams. Pabarskaite and Raudys
(2007) review the knowledge discovery process from web log data. Younas et al. (2006)
predict user behavior patterns in mobile web systems. Ganapathy et al. (2004) discuss
visualization strategies and tools for enhancing CRM.
1. BioDiscovery GeneSightw
GeneSighte is a product of BioDiscovery, Inc. of El Segundo, CA that focuses on
cluster analysis using two main techniques of hierarchical and partitioning, both of
which are discussed in Prakash and Hoff (2002) for data mining of microarray gene
expressions.
Figure 1 shows the k-means clustering of global variations using the Pearson
correlation. Figure 2 shows the two-dimensional self-organizing map (SOM) for the 11
variables for all of the data using the Chebychev distance metric.
SAS
Enterprise NeuralWorks
Algorithms GeneSighte PolyAnalystw 5.0 Minere Predictw
Statistical analysis X X X X
Neural networks X X X
Decision trees X X
Regression analysis X X
Cluster analysis X X X
Self-organizing map X X
Association and X X
sequence analysis
(market basket analysis)
Link analysis X X
Text analysis X In a different
Table I. package SAS
Data mining software Text Minerw
Review of data,
text and web
mining software
629
Figure 1.
K-means clustering of
global variations with the
Pearson correlation using
GeneSightw
modeling (M), and assessing (A) large amounts of data. SAS Enterprise Minere utilizes
a workspace with a drop-and-drag of icons approach to constructing data mining
models. SAS Enterprise Minere utilizes algorithms for decision trees, regression, neural
networks, cluster analysis, and association and sequence analysis.
Figure 5 shows the workspace of SAS Enterprise Minere that was used in the data
mining of the human lung dataset. Figure 6 shows a partial view of the decision tree
diagram obtained by data mining using SAS Enterprise Minere as specified for a
depth of six from the initial node of NL279. Figure 7 shows that the normalized means
for the cluster proximities of the gene-type variables are scattered and not uniform.
The SOM illustrates the characteristics of the clusters, importance of each variable, and
cluster proximities.
4. NeuralWorks Predictw
NeuralWorks Predictw is a product of NeuralWare of Carnegie, Pennsylvania. This
software relies on neural networks. NeuralWorks Predictw has a direct interface with
Microsoft Excel that allows display and execution of the Predictw commands as a
drop-down column within Microsoft Excel. Figure 8 shows NeuralWare Predictw
window indicating completion of neural network training algorithm. Figure 9 shows
NeuralWare Predictw window indicating completion of model building using selected
parameters.
K
39,4
630
Figure 2.
SOM with the Chebychev
distance metric using
GeneSightw
Figure 3.
Input data window for the
forest cover type data in
PolyAnalystw 5.0
Review of data,
text and web
mining software
631
Figure 4.
Link diagram for each of
the 40 soil types using
PolyAnalystw 5.0
K
39,4
632
Figure 5.
Work space of SAS
Enterprise Minere for
human lung project
Figure 6.
Decision tree diagram
obtained upon applying
SAS Enterprise Minere
for specified depth of six
from node ID ¼ 1
Review of data,
text and web
mining software
633
Figure 7.
SOM/Kohonen for human
lung data using SAS
Enterprise Minere
Figure 8.
NeuralWare Predictw
window indicating
completion of neural
network training
algorithm
K
39,4
634
Figure 9.
NeuralWare Predictw
window indicating
completion of model
building using selected
parameters
1. Compare Suite
Compare Suite is TM software developed by AKS-Labs, headquarters of which are
located in Raleigh, North Carolina, USA. The software allows comparing any to any
file including formats such as text file, MS Word, MS Excel, PDF, web pages, zip
archives, and binary files. It allows comparing two files character by character, word
by word, or by key words. Two folders can also be compared to find changes made and
contained files. A report can be created after comparison including detailed comparison
information. Documents can be compared online by server-side comparison.
Review of data,
Compare SAS Text Text Visual Megaputer
Features Software Suite Miner Analyst Text PolyAnalyst WordStat text and web
Data Text parsing and
mining software
preparation extraction X X X X X X
Define dictionary X X X X
Automatic text Add- 635
cleaning X on
Data Categorization X X X X
analysis Filtering X X X
Concept linking Add-
X on X
Text clustering X X Add- X X
on
Dimension
reduction
techniques X X
Natural language
query X X
Results Interactive results
reporting window X X X X X
Support for
multiple languages X X X X
Unique features Report generation, Multi-path Export any table to Excel
compare two folders multi-paradigm Table II.
feature analyzer Text mining software
Two animal files are shown in Figure 10 compared by words and results are reported.
From the results, the same words that appear in both the files are highlighted with
green color and the words that only appear in one of the two files are highlighted
with purple color. Figure 11 shows window for “file comparison” option of Compare
Suite. Compare Suite is able to provide to user a window for option of file comparison
of text.
636
Figure 10.
Compare Suite
with two animal text files
Source: Segall and Zhang (2009a)
3. Megaputer TextAnalyst
TextAnalyst has an ActiveX suite for dealing with text and semantic analysis.
According to Megaputer (Ananyan and Kiselev, 2010) web page, TextAnalyst uses a
semantic network similar to a molecular structure, and determines the relative
importance of a text concept, solely by analyzing its connection to other concepts in the
text, and also implements algorithms similar to those used for text analysis in the
human brain.
Figure 15 shows a representative screen shot of Megaputer TextAnalyst which
consists of a view pane of file directory in top left, results pane in top right, and a text
pane in the bottom part of the window. The view pane shows each of the nodes in the
semantic tree which each can be expanded. Megaputer TextAnalyst uses a semantic
search window where a query can be entered either as full sentences or questions
instead of having to determine key word or phrases. A summary file in the top left pane
of Megaputer TextAnalyst window can list the most important sentences in the
context of the original document. The summary chooses the sentences on the basis of
concepts and relationships between concepts in the full text.
4. VisualText by TextAI
VisualText by TextAI (Text Analysis International, Inc.) uses national language
processing including “analyzers” for extracting information from text repositories.
Some applications include databases of resumes, web pages, or even e-mails or web
chat databases. VisualText allows the user to create their own text analyzer, and also
Review of data,
text and web
mining software
637
Figure 11.
Compare Suite window
Figure 12.
Workspace of SAS Text
Miner for animal text
K
39,4
638
Figure 13.
Interactive window of SAS
Text Miner for animal text
Figure 14.
Concept links for term of
“statistical” in SAS Text
Miner using
SASPDFSYNONYMS text
file
Source: Woodfield (2004)
Review of data,
text and web
mining software
639
Figure 15.
List of documents window
in TextAnalyst
includes a TAI Parse for tagging parts of speech and chunking. Voice processing needs
to be converted to text first before text processing can be performed.
According to VisualText web page, it can be used for combating terrorism, narcotic
espionage, nuclear proliferation, filtering documents, test grading, and automatic
coding. VisualText also allows natural language query, which is the ability to ask a
computer questions using plain language.
Figures 16 and 17 show screen shots of VisualText. Figure 16 shows the analyzer on
the left and the root of the text zone on the right. Figure 17 shows parsing trees.
A window in VisualText can show dictionary with an expansion of the root of the text
zone on the right part of this window.
5. Megaputer PolyAnalyst
Previous work by the authors Segall and Zhang (2006) have utilized Megaputer
PolyAnalyst for data mining. The new release of PolyAnalyst Version 6.0 includes TM
and specifically new features for text online analytical processing and taxonomy-based
categorization which is useful for when dealing with large collections of unstructured
documents as discussed in Megaputer Intelligence (2007). The latter cites that
taxonomy-based classifications are useful when dealing with large collections of
unstructured documents such as tracking the number of known issues in product repair
notes and customer support letters.
According to Megaputer Intelligence (2007), PolyAnalyst “provides simple means for
creating, importing, and managing taxonomies, and carries out automated categorization of
text records against existing taxonomies.” Megaputer Intelligence (2007) provides examples
K
39,4
640
Figure 16.
Screen-shot of VisualText
window
Figure 17.
Dictionary in VisualText
of applications to executives, customer support specialists, and analysts. According to Review of data,
Megaputer Intelligence (2007), “executives are able to make better business decisions upon text and web
viewing a concise report on the distribution of tracked issues during the latest observation
period.” mining software
This paper provides several figures of actual screen shots of Megaputer PolyAnalyst
Version 6.0 for TM. These are for workspace of TM of Megaputer PolyAnalyst
(Figure 18), Figure 19 is “suffix tree clustering” report for the text cluster of (desk; front), 641
and Figure 20 is screen-shot of “link term” report of hotel customer survey text.
Megaputer PolyAnalyst can also provide screen shots with drill-down text analysis and
histogram plot of text analysis.
6. WordStat
WordStat is developed by Provalis Research. It is a text analysis software module run
on a base product of SimStat or QDA Miner. It can be used to study textual information
such as interviews, answers to open-ended questions, journal articles, electronic
communications, and so on. WordStat may also be used for categorizing text
automatically using a dictionary approach, or for developing and validating new
categorization dictionaries or taxonomies.
WordStat incorporates many data analysis and graphical tools that can be used to
explore relationships between document contents and information amassed in categorical
or numeric variables. Hierarchical clustering and multidimensional scaling analysis
can be used to identify relationships among categories and document similarity.
Figure 18.
Workspace for TM in
Megaputer PolyAnalyst
Source: Segall and Zhang (2009a)
K
39,4
642
Figure 19.
Clustering results in
Megaputer PolyAnalyst
Figure 20.
“Link term” report using
text analysis in Megaputer
PolyAnalyst
Correspondence analysis and plots can be used to explore relationships between keywords Review of data,
and different groups. text and web
An input file (e.g. Excel file) can be imported into the software for analysis. An
important preliminary to WordStat analysis is to create a categorization dictionary mining software
(which needs domain knowledge) as shown by screen-shot of Figure 21. WordStat
analysis consists of many tabulations or cross-tabulations of different categories.
Figure 22 shows 3D plots by WordStat. There are more detailed flash demos at: www. 643
provalisresearch.com/wordstat/wordstatflashdemo.html
1. Megaputer PolyAnalyst
Megaputer PolyAnalyst is an enterprise analytical system that integrates web mining
together with data and TM because it does not have a separate module for web mining.
Web pages or sites can be inputted directly to Megaputer PolyAnlayst as data source
nodes.
Figure 21.
Categorization dictionary
using WordStat
K
39,4
644
Figure 22.
3D concepts maps
using WordStat
2. SPSS Clementine
“Web Mining for Clementine is an add-on module that makes it easy for analysts to
perform ad hoc predictive web analysis within Clementine’s intuitive visual workflow
interface.” Web Mining for Clementine combines both Web Analytics (2007) and data
Features Software Megaputer PolyAnalystw SPSS Clementinew ClickTracks QL2
Data Data extraction X (web site as data source Import server files Import web Word documents, e-mails,
preparation input) log files spreadsheets, databases,
PowerPoint files, HTML,
images, and PDFs
Automatic data cleaning X X X X
Data User segmentation X X X
analysis Detect users’ sequences X
Understand product and X X
content affinities (link
analysis)
Predict user propensity to X X
convert, buy, or churn
Navigation report X X
Keyword and search engine X X X X
Visitor labeling X
Robot report X
Funnel report X
E-mail campaign tracking X
Revenue tracking X
Results Interactive results window X X X
reporting Visual presentation X X X X
Support for multiple X X
languages
Unique Data and TM tool integrated Linguistic approach rather Visitor analysis to provide A data-
features with web site data source than statistics-based robot and funnel reports extraction
input approach software
mining software
Table III.
645
K
39,4
646
Figure 23.
PolyAnalyst workspace
with internet data source
Figure 24.
PolyAnalyst using
www.astate.edu as web
data source
Review of data,
text and web
mining software
647
Figure 25.
Keyword extraction report
mining with SPSS analytical capabilities to transform raw web data into “actionable
insights.” It enables business decision makers to take more effective actions in real
time. SPSS (2007) claims examples of automatically discovering user segments,
detecting the most significant sequences, understanding product and content affinities,
and predicting user intention to convert, buy, or churn.
SPSS (2007) claims four key data mining capabilities: segmentation, sequence
detection, affinity analysis, and propensity modeling. Specifically, SPSS (2007) indicates
six web analysis application modules within SPSS Clementine that are: search engine
optimization, automated user and visit segmentation, web site activity and user
behavior analysis, home page activity, activity sequence analysis, and propensity
analysis.
Unlike other platforms used for web mining that provide only simple frequency
counts (e.g. number of visits, ad hits, top pages, total purchase visits, and top click
streams), SPSS (2007) Clementine provides more meaningful customer intelligence
such as: likelihood to convert by individual visitor, likelihood to respond by individual
prospect, content clusters by customer value, missed crossed-sell opportunities, and
event sequences by outcome (Segall and Zhang, 2009b).
Figures 26-28 are screen shots showing the applications of SPSS Clementine for web
mining to available data sets. Figure 26 shows the SPSS Clementine workspace with
251,998 records with seven fields extracted from a web log file. The user modes can be
defined including research mode, shopping mode, search mode, evaluation mode, and
so on. Figure 27 shows web data for different campaigns and classifier results using
different model types (e.g. CHAID, logistic, and neural net). Figure 28 shows decision
tree results.
K
39,4
648
Figure 26.
SPSS Clementine
workspace
Figure 27.
Compaign a model use
checklist
Review of data,
text and web
mining software
649
Figure 28.
Decision tree results
650
Figure 29.
Statistics results
for using Google search
engine
Figure 30.
Click sequence results
and input specific criteria to generate dynamic web pages. Intelligent agents can reach Review of data,
any content with a web browser in a fully automated fashion. text and web
QL2 Software extracts the data regardless of formats: word documents, e-mails,
spreadsheets, databases, PowerPoint files, HTML, images, and PDFs. QL2 can also add mining software
structure to the data it collects and output the information in an actionable format such as a
spreadsheet, database, or XML feed so data can be sorted, filtered, and queried with ease.
QL2 Software can extract data/information from both the world wide web and 651
unstructured documents, and integrate it into business intelligence in real time for a
3608 view of business and the market. QL2 can be used to automatically mine
competitor web sites, online catalogs, news feeds, and regulatory filings, or extracting
data from PDFs, PowerPoint presentations, Word docs and e-mail archives.
Figures 31 and 32 are screen shots showing the applications of QL2 for web mining
to available data sets. Figure 31 shows a screen shot of the QL2 workspace. Figure 32
shows screen shot of expanded inner queries for Best Buy data.
Figure 31.
Workspace for QL2
Source: Zhang and Segall (2008)
K
39,4
652
Figure 32.
Inner query with records
shown using Best Buy
data
NeuralWare Predict and BioDiscovery GeneSight have less data mining functions than
SAS and PolyAnalyst do. NeuralWare Predict is used with Microsoft Excel so that it is
easy to use. It produces comparable results with other software in terms of prediction
using neural network. BioDiscovery GeneSight is primarily for clustering analysis and
is able to provide a variety of data mining visualization charts and colors.
Both SAS Enterprise Minere and Megaputer PolyAnalystw employ each of the same
algorithms as illustrated in Table II except that SAS has a separate software SAS Text
Minerw for text analysis. The regression results are comparable for those obtained using
NeuralWare Predictw and Megaputer PolyAnalystw. The cluster analysis results for
SAS Enterprise Minere, Biodiscovery GeneSightw, and Megaputer PolyAnalystw each
are unique to each software as to how they represent their results. SAS Enterprise
Minere and NeuralWare Predictw both utilize SOM while the other two do not. In
conclusion, SAS Enterprise Minere and Megaputer PolyAnalystw offer the greatest
diversification of data mining algorithms.
Comparing six TM software, Compare Suite and TextAnalyst have minimal TM
capabilities while Megaputer PolyAnalyst, SAS Text Miner, and WordStat have
extensive TM capabilities. VisualText has add-ons to make this software versatile.
SAS Text Miner is an add-on to base SAS Enterprise Miner by inserting an
additional Text Miner icon on the SAS Enterprise Miner workspace toolbar. SAS Text
Miner tags parts of speech and performs transformations such as those using singular
value decompositions to generate term-document frequency matrix for viewing in the
Text Miner node. Megaputer PolyAnalystw similarly is a software that combines both
data mining and TM, but also includes web mining capabilities. Megaputer also has
stand-alone TextAnalyst software for TM. Visual link analysis figures and their term
interactions are quite informative as those generated using SAS Text Miner and more Review of data,
exhibitive than those generated by Megaputer PolyAnalyst, and provide these that the text and web
other TM software cannot.
New Features of WordStat 5.1 include: an improved phrase extraction routine with mining software
built-in cross-tabulation, overlap identification, and filtering; the addition of more flexible
feature selection and classification routines to the automatic document classification
module; and visualization tools (bar charts, pie charts, correspondence plots). 653
Regarding web mining software, PolyAnalystw can mine web data integrated within
a data mining enterprise analytical system and provide visual tools such as link analysis
of the critical terms of the text. SPSS Clementinew can be used for graphical illustrations
of customer web activities as well as also for link analysis of different data categories
such as campaign, age, gender, and income. ClickTracks workspace is more
object-oriented in construction and usage. Web mining using ClickTracks can include
cost of visitors, revenue from visitors, average time on site, and page views per visitor.
ClickTracks also can provide search reports for pay-per-click and regular search for each
selected item. QL2 is web data extraction software. Many of the web mining technologies
are applicable to text, web and click stream data. The selection of appropriate web
mining software should be based on both its available web mining technologies and also
the type of data to be encountered.
The future direction of the research is to investigate other mining software and their
available mining technologies for analyzing various types of data, and making
comparisons of the capabilities of these softwares between and among each other. This
future research would also include the acquisition of other data sets to perform these
new analyses and comparisons. The future directions would also include applications
of these softwares to other databases of different dimensionalities.
References
AAAI (2002), American Association for Artificial Intelligence (AAAI) Spring Symposium on
Information Refinement and Revision for Decision Making: Modeling for Diagnostics,
Prognostics, and Prediction, Software and Data, available at: www.cs.rpi.edu/,goebel/ss02/
software-and-data.html
Amir, A., Aumann, Y., Feldman, R. and Fresko, M. (2005), “Maximal association rules: a tool for
mining associations in text”, Journal of Intelligent Information Systems, Vol. 25 No. 3,
pp. 333-45.
Ananyan, S. and Kiselev, M. (2010), “Automated analysis of unstructured texts: technology and
implementations”, TextAnalyst Whitepaper, Megaputer, www.megaputer.com/down/
textanalyst_whitepaper.pdf
Ceccato, M., Marin, M., Mens, K., Moonen, L., Tonella, P. and Tourw, T. (2006), “Applying and
combining three different aspect mining techniques”, Software Quality Journal, Vol. 14
No. 3, pp. 209-14.
Chang, J. and Lee, W. (2006), “Finding frequent itemsets over online data streams”, Information
and Software Technology, Vol. 48 No. 7, pp. 606-19.
Chou, C., Sinha, A. and Zhao, H. (2008), “A text mining approach to internet abuse detection”,
Information Systems and eBusiness Management, Vol. 6 No. 4, pp. 419-40.
Data Intelligence Group (1995), “An overview of data mining at Dun & Bradstreet”, DIG
White Paper 95/01, available at: www.thearling.com.text/wp9501/wp9501.htm
K Davi, A., Haughton, D., Nasr, N., Shah, G., Skaletsky, M. and Spack, R. (2005), “A review of two
text-mining packages: SAS TextMining and WordStat”, The American Statistician, Vol. 59
39,4 No. 1, pp. 89-104.
Deshmukah, A.V. (1997), “Software review: ModelQuest Expert 1.0”, ORMS Today, December,
available at: www.lionhrtpub.com/orms/orms-12-97/software-review.html
Ducatelle, F. (2006), “Software for the data mining course”, School of Informatics, The University
654 of Edinburgh, Edinburgh, available at: www.inf.ed.ac.uk/teaching/courses/dme/html/
software2.html
Ganapathy, S., Ranganathan, C. and Sankaranarayanan, B. (2004), “Visualization strategies and
tools for enhancing customer relationship management”, Communications of the ACM,
Vol. 47 No. 11, pp. 92-8.
Hearst, M.A. (2003), “What is data mining?”, available at: www.ischool.berkeley.edu/,hearstr/
text_mining.html
Kim, S., Whitehead, E.J. Jr and Zhang, Y. (2008), “Classifying software changes: clean or buggy?”,
IEEE Transactions on Software Engineering, Vol. 34 No. 2, pp. 181-97.
Lau, K., Lee, K. and Ho, Y. (2005), “Text mining for the hotel industry”, Cornell Hotel &
Restaurant Administration Quarterly, Vol. 46 No. 3, pp. 344-63.
Lazarevic, A., Fiea, T. and Obradovic, Z. (2006), “A software system for spatial data analysis and
modeling”, available at: www.ist.temple.edu? , zoran/papers/lazarevic00.pdf
Leung, Y.F. (2004), “My microarray software comparison – data mining software”, Chinese
University of Hong Kong, Ma Liu Shui, September, available at: www.ihome.cuhk.edu.hk/
, b400559/arraysoft mining specific.html
LLNL (2005), “Scientific data mining and pattern recognition: overview”, Lawrence Livermore
National Laboratory, The Center for Applied Scientific Computing (CASC), available at:
www.llnl.gov/CASC/sapphire/overview/html
Megaputer Intelligence (2007), “Data mining, text mining, and web mining software”, Megaputer
Intelligence, Bloomington, IN, available at: www.megaputer.com
Metz, C. (2003), “Software: text mining”, PC Magazine, July 1, available at: www.pcmag.com/
print_article2/0,1217.a¼43573,00.asp
National Center for Biotechnology Information (2006), “NCBI tools for data mining”, National
Library of Medicine, National Institutes of Health, Bethesda, MD, available at: www.ncbi.
nlm.nih.gov/Tools/
Nayak, R. (2008), “Data mining in web services discovery and monitoring”, International Journal
of Web Services Research, Vol. 5 No. 1, pp. 63-82.
Nisbet, R.A. (2006), “Data mining tools: which one is best for CRM? Part 3”, DM Review, March 21,
available at: www.dmreview.com/editorial/dmreview/print_action.cfm? articleId=
1049954.
Pabarskaite, Z. and Raudys, A. (2007), “A process of knowledge discovery from web log data:
systematization and critical review”, Journal of Intelligent Information Systems, Vol. 28
No. 1, pp. 79-105.
Prakash, P. and Hoff, B. (2002), “Microarray gene expression data mining with cluster analysis
using GeneSighte”, Application Note GS10, BioDiscovery, Marina del Rey, CA, available
at: http://signal.biosi.cf.ac.uk/wbg/pdf/appnotegs10.pdf
QL2 (2007), “QL2 Software”, available at: www.ql2.com (accessed June 5, 2007).
Sanchez, M., Moreno, M., Segrera, S. and Lopez, V. (2008), “Framework for the development of a
personalised recommender system with integrated web-mining functionalities”,
International Journal of Computer Applications in Technology, Vol. 33 No. 4, pp. 312-27.
Seewald, A., Holzbaur, C. and Widmer, G. (2006), “Evaluation of term utility functions for very Review of data,
short multidocument summaries”, Applied Artificial Intelligence, Vol. 20, pp. 57-77.
Segall, R.S. and Zhang, Q. (2006), “Data visualization and data mining of continuous numerical
text and web
and discrete nominal-valued microarray databases for biotechnology”, Kybernetes: mining software
International Journal of Systems and Cybernetics, Vol. 35 Nos 9/10, pp. 1538-66.
Segall, R.S. and Zhang, Q. (2009a), “A survey of selected software technologies for text mining”,
in Song, M. and Wu, Y. (Eds), Handbook of Research on Text and Web Mining 655
Technologies, Ch. XLIV, IGI Global, Hershey, PA, pp. 766-84.
Segall, R.S. and Zhang, Q. (2009b), “Web mining technologies for customer and marketing
surveys”, Kybernetes: The International Journal of Cybernetics, Systems and Management
Sciences, Vol. 38 No. 6, pp. 929-53.
Spasic, I., Ananiadou, S., McNaught, J. and Kumar, A. (2005), “Text mining and ontologies in
biomedicine: making sense of raw text”, Briefings in Bioinformatics, Vol. 6 No. 3, pp. 239-51.
SPSS (2007), “Web mining for Clementine”, available at: www.spss.com/web_mining_for_
clementine (accessed May 16, 2007).
Srinivasan, P. (2006), “Text mining: generating hypotheses from MEDLINE”, Journal of the
American Society for Information Science and Technology, Vol. 55 No. 5, pp. 396-413.
StatSoft (2006), Electronic Textbook, StatSoft, Tulsa, OK, available at: www.statsoft.com/
textbook/glosa.html
Web Analytics (2007), available at: www.clicktracks.com/ (accessed October 25, 2007).
Wikipedia (2007), “Web mining”, available at: http://en.wikipedia.org/wiki/Web_mining
Woodfield, T. (2004), Mining Textual Data Using SAS Text Miner for SAS9 Course Notes, SAS
Institute, Cary, NC.
Younas, M., Shakshuki, E. and Chao, K. (2006), “Efficient mining and prediction of user behavior
patterns in mobile web systems”, Information and Software Technology, Vol. 48 No. 6,
pp. 357-66.
Zhang, Q. and Segall, R.S. (2008), “Web mining: a survey of current research, techniques, and
software”, International Journal of Information Technology & Decision Making, Vol. 7
No. 4, pp. 683-720.
Corresponding author
Richard S. Segall can be contacted at: rsegall@astate.edu