Professional Documents
Culture Documents
A Comparison of Contemporary Data Mining Tools: October 2017
A Comparison of Contemporary Data Mining Tools: October 2017
net/publication/320911740
CITATIONS READS
0 342
5 authors, including:
Some of the authors of this publication are also working on these related projects:
Unapređenje konkurentnosti Srbije u procesu pristupanja Srbije Evropskoj uniji View project
All content following this page was uploaded by Marko Arsenovic on 07 November 2017.
Abstract
Corresponding to the raising importance of data mining in today's highly competitive business
environment, the number of available data mining tools continues to grow. Consequently, competition
between data mining software developers increases as well and the choice of the most suitable tool
becomes increasingly difficult. Therefore, comparison of data mining tools becomes important. In this
paper, several, frequently used open-source data mining tools and tools with open-source algorithms
implementations are selected and compared against user groups, data structures, algorithms
included, visualization capabilities, platforms, programming languages, and import and export options.
In addition, evaluation of publicly available datasets has been performed by using selected tools. By
performing actions such as data format conversions, data input and output, data transformation,
feature selection, classification tasks, simple data mining tool selection algorithm has been developed
and presented.
Key words: data mining, data mining tools, tool selection algorithm
1. INTRODUCTION
A problem for managers, developers, data scientists
Data mining is a core step in the knowledge discovery and other stakeholders might occur while choosing
from databases (KDD) process, that consists of suitable open-source tool, as there is a wide variety of
applying data analysis and discovery algorithms on data tools on the market and many of them are not yet well-
in order to discover useful patterns [1]. Increasing known in data mining community. A need for
power of technology and complexity of datasets has comparisons of different tools is rising, as decision
lead data mining to evolve from static to more dynamic making process becomes more complex.
and proactive information deliveries. Over the time, the
volume of data that needs to be managed became Therefore, in this paper five contemporary, open-source
impossible to manually analyse in order to get valuable data mining tools are presented and compared against
decision making information. Therefore, an urgent and their important features in order to aid scientists in
constant need for new, improved generations of data decision making process. Among the others,
mining tools appeared, in order to assist in extracting contribution of this paper is a simple tool selection
useful information from the rapidly growing amount of algorithm, which enables selection of data mining tool
data [1]. according to users needs.
As a result, the number of available data mining tools The rest of the paper is organized as follows: Section 2
continues to grow, with an accent on open-source data deals with the related work, applied methodology is
mining software. Open-source tools represented a new described in Section 3, Section 4 deals with the results
trend in data mining, especially in small and medium and appropriate discussion and finally Section 5
enterprises in early 2000s [2]. Nowadays it is an addresses the conclusions and further research.
established trend, as open-source data mining tools are
constantly being developed and renewed, offering 2. RELATED WORK
bigger flexibility and extensive development community
Over the years, a number of authors have written about
[3].
comparison of data mining tools, all suggesting similair
IS'17
2 Dakic et al.
IS'17
Dusanka Dakic et al. 3
architecture will also be included as possible value for Apache Spark have implementation of J48 (C4.5)
architecture of data mining tools [6]. decision tree, which is used in classification process in
this tools. On the other hand, Microsoft Azure ML
3.2.2 Data management Studio and H2O Flow do not offer multiclass decision
It is unrealistic to expect one data mining system to trees for classification, but decision forests instead.
mine all kinds of data, given the diversity of data types After classification is performed, in order to test
and data mining agendas [13]. Data mining tools performance of data mining tools on larger dataset, Iris
require integration with database systems or data dataset was enlarged up to 20MB (around 1000000
warehouses for data selection, pre-processing, records) and the same procedure was repeated.
transformation, etc. Not all tools have the same
database characteristics in terms of data model, 3.3 Tool selection algorithm
possible data size, data format and import options.
After performing the evaluation process and reviewing
This comparison includes possible data sources for the results of comparison, a simple algorithm is
data mining tools, such as connections to databases. developed with the aim to help decision-making
Also, possible data formats are determined, considering process when it comes to choosing open-source data
if data can be stored in ARFF, CSV or Excel. Last mining tools.
feature important for data management is the size of
Finally, algorithm included only attributes relevant for
the dataset. For every data mining tool it will be
decision making process and excluded attributes that
determined whether it is suited for small or large
have same values for each tool.
datasets (Big data).
4. RESULTS AND DISCUSSION
3.2.3 Functionality
At the core of the KDD process are data mining 4.1 General characteristics
methods for extracting patterns from data. These General characteristics of selected data mining tools
methods can be very diverse and have different goals are presented in Table 1.
[5]. In the study presented, open-source data mining
tools will be compared in regard to the following WEKA’s architecture can be standalone or client/server,
methods/tasks: Data Pre-processing [14], Prediction, and it can operate on Windows, MAC and Linux OS.
Regression, Classification, Clustering, Text Analytics WEKA is developed in Java and the algorithms can
and Model Visualization [5]. either be applied directly to a dataset or called from
Java code, because of the Java API.
3.2.4 Usability
Microsoft Azure Machine Learning Studio has web-
Usability is, generally, ease of use and learnability and based architecture. Operating systems on which Azure
can refer to human interaction with the tool. Data ML operates are Windows and Linux. Azure Machine
mining process should be highly interactive. Thus, it is Learning Studio includes hundreds of built-in packages
important to build flexible user interfaces and an and support for custom R and Python code, although
exploratory mining environment, facilitating user’s Python can only be used for pre-processing tasks.
interaction with the system [13]. Hence, in this paper,
RapidMiner has client/server, standalone and cloud
usability features include existence of GUI or CLI and
user groups. User groups are divided into Business architecture, as it consists of RapidMiner Studio, Server
application, Applied research and Education. Business and Cloud. It can operate on Windows, MAC and Linux,
application users use data mining as a tool for mining and is written in Java, but can use R and Python
large volumes of business data. Applied research users scripts.
apply data mining to research problems and are mainly H2O uses web-based open-source user interface. It
interested in tools with well proven methods, a can run on Windows, OS X and Ubuntu operating
graphical user interface (GUI), and interfaces to systems and allows programming in R, Python, Scala,
relevant databases [6]. Education users use data Java and JSON. There is a possibility of running H2O
mining tools for education at universities. These data on Hadoop cluster. H2O is supported on a number of
mining tools should be very intuitive, with a comfortable cloud environments.
interactive graphical user interface.
Apache Spark is a fast and general-purpose cluster
3.2.5 Classification computing system, meaning that Spark applications run
as independent sets of processes on a cluster
In order to perform simple classification in each data
(standalone, Apache Mesos or Hadoop YARN). It is
mining tool, decision tree and decision forest models
also possible to submit applications from a distant
are build using well-known Iris dataset [15]. Iris is
computer, so Apache Spark supports client/server
multivariate dataset, with four real attributes: sepal
architecture, or to run applications on cloud platform. It
length in cm, sepal width in cm, petal length in cm, petal
provides high-level APIs in Java, Scala, Python and R.
width in cm. The dataset contains 3 classes of 50
Apache Spark can be run on Windows, Linux Ubuntu
instances each, where each class refers to a type of iris
and MAC operating systems.
plant: Iris Setosa, Iris Versicolor, and Iris Virginica.
Initial Iris dataset has 4KB size. Weka, RapidMiner and
IS'17
4 Dakic et al.
IS'17
Dusanka Dakic et al. 5
instance, following with model evaluation on test RapidMiner, Microsoft Azure ML Studio and H2O do not
instances. offer AUC for multiclass classification. Moreover, Weka
and RapidMiner have an accent (*) in Column L, with
Table 5. Decision tree classification results
Correctly reference to their limitations in ability to process
Product AUC TPR TNR L enlarged dataset. When Iris dataset was enlarged up to
classified
Weka 0.97 98% 0.98 0.99 * 20 MB, WEKA was not comfortably able to process it,
RapidMiner x 95% 0.95 0.96 * unless the heap size was enlarged and CLI is used.
Apache Furthermore, open-source version of RapidMiner
0.97 0.98 0.97 0.98
Spark cannot process that volume of data free of charge, as
its limit is 10000 records and one logic processor.
Table 6. Decision forest classification results
Correctly On the contrary, Apache Spark, H2O and Microsoft
Product AUC TPR TNR L
classified Azure ML Studio were able to process enlarged data
Azure ML set in seconds.
x 97% 0.97 0.98
Studio
H2O x 97% 0.97 0.98 4.6 Developed tool selection algorithm
According to the presented research results, several
AUC – Area under the curve TPR – True positive rate characteristics are common for every evaluated tool
(sensitivity) TNR – True negative rate (specificity) L – and they will not be included in the algorithm. The tool
Enlarged Iris dataset selection algorithm is build from differences between
the tools and has several different starting nods, in
After observing tables 5. and 6. it can be concluded that
order to include all of them. The algorithm is presented
there is a minor difference in classification results
in Figure 1.
among tools presented. However, it is prominent that
IS'17
6 Dakic et al.
6. REFERENCES
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P.
(1996) "From data mining to knowledge
discovery in databases" AI Mag 1996, 17:37–
54.
[2] Chen X, Ye Y, Williams G, Xu X. (2007) "A
survey of open source data mining systems",
Lecture Notes in Computer Science 2007,
4819:3–14.
[3] Almeida, P., & Bernardino, J. (2016) "A survey
on open source Data Mining Tools for SMEs",
In New Advances in Information Systems and
Technologies (pp. 253-262). Springer, Cham.
[4] Wang J, Hu X, Hollister K, Zhu D. (2008) "A
comparison and scenario analysis of leading
data mining software". Int J Knowl Manage
2008, 4:17–34.
[5] Goebel, M., Gruenwald, L. (1999) "A survey of
data mining and knowledge discovery software
tools“, In: SIGKDD Explorations. Volume 1.,
ACM SIGKDD (1999) 20–33
[6] Mikut, R., & Reischl, M. (2011) "Data mining
tools", Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, 1(5), 431-
443.
[7] Hall, Mark, et al. (2009) "The WEKA data
mining software: an update." ACM SIGKDD
explorations newsletter 11.1 (2009): 10-18.
[8] Barga, Roger, Valentine Fontama, and Wee
Hyong To (2015) "Introducing Microsoft Azure
Machine Learning." Predictive Analytics with
Microsoft Azure Machine Learning. Apress,
2015. 21-43.
[9] Jovic A., Brkic K., Bogunovic N. (2014) "An
overview of free software tools for general data
IS'17