Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320911740

A Comparison of Contemporary Data Mining Tools

Conference Paper · October 2017

CITATIONS READS

0 342

5 authors, including:

Dusanka Dakic Srdjan Sladojevic


University of Novi Sad University of Novi Sad
5 PUBLICATIONS   3 CITATIONS    36 PUBLICATIONS   227 CITATIONS   

SEE PROFILE SEE PROFILE

Marko Arsenovic Teodora Lolić


University of Novi Sad University of Novi Sad
13 PUBLICATIONS   201 CITATIONS    16 PUBLICATIONS   6 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Unapređenje konkurentnosti Srbije u procesu pristupanja Srbije Evropskoj uniji View project

All content following this page was uploaded by Marko Arsenovic on 07 November 2017.

The user has requested enhancement of the downloaded file.


XVII International Scientific Conference on Industrial Systems (IS'17)
Novi Sad, Serbia, October 4. – 6. 2017.
University of Novi Sad, Faculty of Technical Sciences,
Department for Industrial Engineering and Management
Available online at http://www.iim.ftn.uns.ac.rs/is17

A Comparison of Contemporary Data Mining Tools


Dakić Dušanka
(Teaching Assistant, University of Novi Sad, Faculty of Technical Sciences, Serbia, dakic.dusanka@uns.ac.rs)
Stefanović Darko
(Assistant Professor, University of Novi Sad, Faculty of Technical Sciences, Serbia, darkoste@uns.ac.rs)
Sladojević Srdjan
(Assistant Professor, University of Novi Sad, Faculty of Technical Sciences, Serbia, Sladojevic@uns.ac.rs)
Arsenović Marko
(Teaching Assistant, University of Novi Sad, Faculty of Technical Sciences, Serbia, arsenovic@uns.ac.rc)
Lolić Teodora
(Teaching Assistant, University of Novi Sad, Faculty of Technical Sciences, Serbia, teodora.lolic@uns.ac.rs)

Abstract
Corresponding to the raising importance of data mining in today's highly competitive business
environment, the number of available data mining tools continues to grow. Consequently, competition
between data mining software developers increases as well and the choice of the most suitable tool
becomes increasingly difficult. Therefore, comparison of data mining tools becomes important. In this
paper, several, frequently used open-source data mining tools and tools with open-source algorithms
implementations are selected and compared against user groups, data structures, algorithms
included, visualization capabilities, platforms, programming languages, and import and export options.
In addition, evaluation of publicly available datasets has been performed by using selected tools. By
performing actions such as data format conversions, data input and output, data transformation,
feature selection, classification tasks, simple data mining tool selection algorithm has been developed
and presented.

Key words: data mining, data mining tools, tool selection algorithm

1. INTRODUCTION
A problem for managers, developers, data scientists
Data mining is a core step in the knowledge discovery and other stakeholders might occur while choosing
from databases (KDD) process, that consists of suitable open-source tool, as there is a wide variety of
applying data analysis and discovery algorithms on data tools on the market and many of them are not yet well-
in order to discover useful patterns [1]. Increasing known in data mining community. A need for
power of technology and complexity of datasets has comparisons of different tools is rising, as decision
lead data mining to evolve from static to more dynamic making process becomes more complex.
and proactive information deliveries. Over the time, the
volume of data that needs to be managed became Therefore, in this paper five contemporary, open-source
impossible to manually analyse in order to get valuable data mining tools are presented and compared against
decision making information. Therefore, an urgent and their important features in order to aid scientists in
constant need for new, improved generations of data decision making process. Among the others,
mining tools appeared, in order to assist in extracting contribution of this paper is a simple tool selection
useful information from the rapidly growing amount of algorithm, which enables selection of data mining tool
data [1]. according to users needs.
As a result, the number of available data mining tools The rest of the paper is organized as follows: Section 2
continues to grow, with an accent on open-source data deals with the related work, applied methodology is
mining software. Open-source tools represented a new described in Section 3, Section 4 deals with the results
trend in data mining, especially in small and medium and appropriate discussion and finally Section 5
enterprises in early 2000s [2]. Nowadays it is an addresses the conclusions and further research.
established trend, as open-source data mining tools are
constantly being developed and renewed, offering 2. RELATED WORK
bigger flexibility and extensive development community
Over the years, a number of authors have written about
[3].
comparison of data mining tools, all suggesting similair

IS'17
2 Dakic et al.

criteria for comparison and important features of data 3. METHODOLOGY


mining systems. Selecting software is a complex
problem, due to many criteria and frequent technology 3.1 Open-source data mining tools
changes, as new tools or new versions of existing tools Data mining tools and tools with open-source
are being released rapidly [4]. Current literature might algorithms implementations compared in this paper are
be useful in terms of defining criteria of comparison, but WEKA, Microsoft Azure ML Studio, RapidMiner, H2O
not useful in regard to technology (tools) described in and Apache Spark.
them, as they quickly become outdated. Nevertheless,
in this chapter several papers will be mentioned, with WEKA, a well-known open-source Java application
accent on comparison criteria used in this paper. produced by the University of Waikato in New Zealand,
Wang et al. (2008) in their comparison of leading data is a collection of machine learning algorithms for data
mining software packages, compared them against mining tasks [7].
several software quality criterias, such as portability,
reliability, efficiency, human engineering, Microsoft Azure Machine Learning Studio is critical part
understanding, modifiability, price, training and support of the Cortana Analytics suite and provides an
[4]. From their reserach, several criterias were selected interactive visual workspace that enables data
as relevant, such as portability, in therms of supported scientists and developers to easily build, test and
platforms and software architectures, human deploy predictive analytic models [8]. It does not require
engineering, in therms of ease of use and installation, as it is used from web based GUI.
understanding, with focus on graphical user interface RapidMiner is a powerful visual workflow designer for
(GUI). building predictive analytic workflows, with a user-
Chen et al. (2007) in a survey of open-source data friendly GUI [9]. RapidMiner consists of RapidMiner
mining systems claim that each data mining process is Studio, where users can build and edit analytic
composed of a sequence of data mining operations, processes, RapidMiner Cloud, where users can store
each implementing a data mining function or algorithm, their data, RapidMiner Server, which connects and
where these operations (data understanding, data interacts with other services and RapidMiner
preprocessing, data modeling, evaluation and RapidMiner Radoop, which is a code-free environment
deployment) are base for data mining tool's comparison for designing advanced analytic processes that push
[2]. To understand the characteristics of these diverse computations down to Hadoop cluster [10]. RapidMiner
open-source data mining systems and to evaluate Studio is free, altough it has some limitations regarding
them, they looked into the following important features: logical processors and the amount of data used.
ability to access various data sources, data pre- H2O in is an open-source tool for big data analyis, and
processing capability, integration of different it enables in-memory, distributed, fast, and scalable
techniques, ability to operate on large datasets and machine learning support. It offers H2O Flow, a web
good data and model visualization [2]. Among their based, open-source user interface [11]. One of it's most
criteria, data sources, data pre-processing and regarding cappabilities is training a model on complete
operation on large datasets were chosen as significant datasets and in-memory distributed parallel processing.
and applied in this research.
Similairly, Goebel and Gruenwald, in 1999, proposed Apache Spark is a popular open-source platform for
feature classification scheme, divided into following large-scale data processing that is well-suited for
sections: general characteristics, database conectivity iterative machine learning tasks [12].
and data mining characteristics [5]. These categories
can be used as sections in which all other criteria can 3.2 Criteria for comparison
be distributed. Criteria for comparison of open-source data mining
Mikut and Reishl, 2011, singled out nine significant tools can be divided into general characteristics, data
criteria for comparing data mining software. These management, functionality, usability and classification.
criteria are based on user groups, data structures, data
mining tasks and methods, import and export options 3.2.1 General characteristics
and license models, where every criteria is further
By general characteristics, some general product
explained [6]. In this research, user groups and data
features are considered, including product name,
mining tasks are singled out as criteria relevant for
architecture, operating systems and programming
comparison.
languages. As far as architecture is considered, data
According to above mentioned several criteria can be mining tools can initially be subdivided into standalone
separated as important and repeating, such as possible and client/server architecture. Tools that have
data sources, data mining tasks (pre-processing, standalone architecture do not require any software
prediction, classification, etc.) and human engineering. other than the operating system to run, unlike
Selected criterias are further grouped into three client/server architecture. One of the emerging trends is
categories: general characteristics, data management an increasing number of web interfaces providing data
and functionality. mining as SaaS, therefore web based, cloud service

IS'17
Dusanka Dakic et al. 3

architecture will also be included as possible value for Apache Spark have implementation of J48 (C4.5)
architecture of data mining tools [6]. decision tree, which is used in classification process in
this tools. On the other hand, Microsoft Azure ML
3.2.2 Data management Studio and H2O Flow do not offer multiclass decision
It is unrealistic to expect one data mining system to trees for classification, but decision forests instead.
mine all kinds of data, given the diversity of data types After classification is performed, in order to test
and data mining agendas [13]. Data mining tools performance of data mining tools on larger dataset, Iris
require integration with database systems or data dataset was enlarged up to 20MB (around 1000000
warehouses for data selection, pre-processing, records) and the same procedure was repeated.
transformation, etc. Not all tools have the same
database characteristics in terms of data model, 3.3 Tool selection algorithm
possible data size, data format and import options.
After performing the evaluation process and reviewing
This comparison includes possible data sources for the results of comparison, a simple algorithm is
data mining tools, such as connections to databases. developed with the aim to help decision-making
Also, possible data formats are determined, considering process when it comes to choosing open-source data
if data can be stored in ARFF, CSV or Excel. Last mining tools.
feature important for data management is the size of
Finally, algorithm included only attributes relevant for
the dataset. For every data mining tool it will be
decision making process and excluded attributes that
determined whether it is suited for small or large
have same values for each tool.
datasets (Big data).
4. RESULTS AND DISCUSSION
3.2.3 Functionality
At the core of the KDD process are data mining 4.1 General characteristics
methods for extracting patterns from data. These General characteristics of selected data mining tools
methods can be very diverse and have different goals are presented in Table 1.
[5]. In the study presented, open-source data mining
tools will be compared in regard to the following WEKA’s architecture can be standalone or client/server,
methods/tasks: Data Pre-processing [14], Prediction, and it can operate on Windows, MAC and Linux OS.
Regression, Classification, Clustering, Text Analytics WEKA is developed in Java and the algorithms can
and Model Visualization [5]. either be applied directly to a dataset or called from
Java code, because of the Java API.
3.2.4 Usability
Microsoft Azure Machine Learning Studio has web-
Usability is, generally, ease of use and learnability and based architecture. Operating systems on which Azure
can refer to human interaction with the tool. Data ML operates are Windows and Linux. Azure Machine
mining process should be highly interactive. Thus, it is Learning Studio includes hundreds of built-in packages
important to build flexible user interfaces and an and support for custom R and Python code, although
exploratory mining environment, facilitating user’s Python can only be used for pre-processing tasks.
interaction with the system [13]. Hence, in this paper,
RapidMiner has client/server, standalone and cloud
usability features include existence of GUI or CLI and
user groups. User groups are divided into Business architecture, as it consists of RapidMiner Studio, Server
application, Applied research and Education. Business and Cloud. It can operate on Windows, MAC and Linux,
application users use data mining as a tool for mining and is written in Java, but can use R and Python
large volumes of business data. Applied research users scripts.
apply data mining to research problems and are mainly H2O uses web-based open-source user interface. It
interested in tools with well proven methods, a can run on Windows, OS X and Ubuntu operating
graphical user interface (GUI), and interfaces to systems and allows programming in R, Python, Scala,
relevant databases [6]. Education users use data Java and JSON. There is a possibility of running H2O
mining tools for education at universities. These data on Hadoop cluster. H2O is supported on a number of
mining tools should be very intuitive, with a comfortable cloud environments.
interactive graphical user interface.
Apache Spark is a fast and general-purpose cluster
3.2.5 Classification computing system, meaning that Spark applications run
as independent sets of processes on a cluster
In order to perform simple classification in each data
(standalone, Apache Mesos or Hadoop YARN). It is
mining tool, decision tree and decision forest models
also possible to submit applications from a distant
are build using well-known Iris dataset [15]. Iris is
computer, so Apache Spark supports client/server
multivariate dataset, with four real attributes: sepal
architecture, or to run applications on cloud platform. It
length in cm, sepal width in cm, petal length in cm, petal
provides high-level APIs in Java, Scala, Python and R.
width in cm. The dataset contains 3 classes of 50
Apache Spark can be run on Windows, Linux Ubuntu
instances each, where each class refers to a type of iris
and MAC operating systems.
plant: Iris Setosa, Iris Versicolor, and Iris Virginica.
Initial Iris dataset has 4KB size. Weka, RapidMiner and

IS'17
4 Dakic et al.

Table 1. General characteristics 4.4 Usability results


Product Architecture OS Language
Client/Server, Windows, The result of usability comparison is presented in Table
Weka Java 4. As it is presented, every tool except Apache Spark
Standalone MAC, Linux
Azure ML Web based, Windows, has GUI. Therefore only Apache Spark is not suitable
R, Python
Studio Cloud service Linux for educational purposes. As business application and
Standalone, applied research is considered, every tool can be used.
Windows, Java, R,
RapidMiner Client/Server, CLI is available in WEKA, H2O, Apache Spark and
MAC, Linux Python
Cloud service RapidMiner.
Windows,
Web based, R, Python, Table 4. Usability
OS X,
H2O client/server, Scala, Java,
Linux Product GUI CLI BA AR E
Cloud service JSON
Ubuntu Weka     
Standalone (on Windows,
Apache a cluster), Linux R, Python, Azure ML Studio    
Spark Client/Server, Ubuntu, Scala, Java RapidMiner     
Cloud MAC H2O     
4.2 Data management results Apache Spark   
Table 2. presents which data sources and formats tools GUI – Graphic User Interface, CLI – Command Line
support. In the last column is determined if certain data Interface, BA – Business Application, AR – Applied
mining tool is suitable for processing big data. In the Research, E – Education
last column of the Table 2 (Column S), it is marked that
4.5 Classification results
every tool can be used for processing of big data,
although there is a certain limitation considering WEKA In Table 5. are presented classification results
and RapidMiner. WEKA can only process big data if it is performed in WEKA, RapidMiner and Apache Spark,
used through CLI and not GUI and it is necessary to using C4.5 algorithm (or J48 implementation). In Table
manually enlarge heap size. RapidMiner can also 6. are results of classification performed in Microsoft
process big data, but not free of charge. Azure ML Studio and H2O, using random decision
Table 2. Data Management forest algorithm. Results are including area under the
Product J/O S E OR P C A U S curve (AUC), percentage of correctly classified
Weka        * instances, true positive ratio (sensitivity) and true
negative ratio (specificity).
Azure ML
Studio
        
Classification in WEKA is performed through simple
RapidMiner       * GUI where user imports data, trains and tests learning
H2O        schemes and performs model metrics visualization.
Apache Performing classification in Microsoft Azure ML Studio
Spark
        
is simple, because of intuitive, drag and drop graphic
J – JDBC O – ODBC S - MS SQL SERVER E - MS EXCEL user interface. Once iris.arff dataset is uploaded, it can
OR – Oracle P – PostgreSQL C – CSV A – ARFF U - WEB easily be converted to other file formats, such as CSV.
URL S – SIZE Next step is splitting the dataset and training model with
machine learning algorithm, then scoring and
4.3 Functionality results
evaluating the model.
As Table 3 presents, each tool enables every data
RapidMiner has user-friendly GUI which enables user
mining task listed, increasing amount of acceptable and
to perform classification by choosing blocks that
appropriate tools and the list of their capabilities,
represent data and operations performed on data. For
making it difficult to choose the most suitable tool.
building classification model, it is necessary to retrieve
Table 3. Functionality Iris dataset, set role for attribute the model should
Product DP P R CS C T L M E predict, split data into training and test set, train data
Weka          using decision tree and then apply model on the rest of
Azure ML Studio          the dataset. Finally, it was necessary to include
RapidMiner          performance block to inspect performance of the model.
H2O         
Apache Spark          In H2O Flow, classification process is realised as a flow
of actions, where the flow starts with importing and
parsing iris dataset (to ARFF, CSV, XLS, etc.), then
DP – Data pre-processing, P – Prediction, R -
splitting the frame, building and evaluating the model.
Regression, CS - Classification, C - Clustering, T – Text
Mining, L – Link Analysis, M – Model visualization, E - Apache Spark classification was performed in spark-
Exploratory Data Analysis shell in CLI, using Scala programming language. Iris
dataset is loaded and split into training set and test set.
Then a model is created by generating decision tree

IS'17
Dusanka Dakic et al. 5

instance, following with model evaluation on test RapidMiner, Microsoft Azure ML Studio and H2O do not
instances. offer AUC for multiclass classification. Moreover, Weka
and RapidMiner have an accent (*) in Column L, with
Table 5. Decision tree classification results
Correctly reference to their limitations in ability to process
Product AUC TPR TNR L enlarged dataset. When Iris dataset was enlarged up to
classified
Weka 0.97 98% 0.98 0.99 * 20 MB, WEKA was not comfortably able to process it,
RapidMiner x 95% 0.95 0.96 * unless the heap size was enlarged and CLI is used.
Apache Furthermore, open-source version of RapidMiner
0.97 0.98 0.97 0.98 
Spark cannot process that volume of data free of charge, as
its limit is 10000 records and one logic processor.
Table 6. Decision forest classification results
Correctly On the contrary, Apache Spark, H2O and Microsoft
Product AUC TPR TNR L
classified Azure ML Studio were able to process enlarged data
Azure ML set in seconds.
x 97% 0.97 0.98 
Studio
H2O x 97% 0.97 0.98  4.6 Developed tool selection algorithm
According to the presented research results, several
AUC – Area under the curve TPR – True positive rate characteristics are common for every evaluated tool
(sensitivity) TNR – True negative rate (specificity) L – and they will not be included in the algorithm. The tool
Enlarged Iris dataset selection algorithm is build from differences between
the tools and has several different starting nods, in
After observing tables 5. and 6. it can be concluded that
order to include all of them. The algorithm is presented
there is a minor difference in classification results
in Figure 1.
among tools presented. However, it is prominent that

Figure 1. Tool selection algorithm

IS'17
6 Dakic et al.

mining." Information and Communication


5. CONCLUSION Technology, Electronics and Microelectronics
(MIPRO), 37th International Convention on.
In the study presented, several contemporary open- IEEE, 2014.
source data mining tools were presented, quick [10] Rapid Miner user manual available at:
evaluation process performed using well-known https://docs.rapidminer.com/downloads/RapidM
dataset, tools performance compared and a simple tool iner-v6-user-manual.pdf (accessed:
selection algorithm is developed. Data mining software 02.06.2017.)
developers are answering to users paramount needs, [11] H2O documentation available at:
resulting in minor differences between tools. Each tool http://docs.h2o.ai/h2o/latest-stable/h2o-
offers standard platforms, data mining tasks and data docs/index.html (accessed: 01.06.2017.)
sources. Evaluation of classification process showed [12] Meng, Xiangrui, et al. (2016) "Mllib: Machine
that there are minor differences when processing small learning in apache spark." The Journal of
datasets, compared to classification of larger volumes Machine Learning Research 17.1: 1235-1241.
of data. Apache Spark, Microsoft Azure ML Studio and [13] Han, Jiawei, Jian Pei, and Micheline
H2O performed classification of larger dataset in Kamber.(2011) Data mining: concepts and
seconds, while WEKA and RapidMiner had certain techniques. Elsevier, 2011.
limitations. Therefore, big data processing remains [14] Kotsiantis, S. B., D. Kanellopoulos, and P. E.
most distinctive quality of these tools, as certain open- Pintelas.(2006) "Data preprocessing for
source tools have in-memory data processing, parallel supervised leaning." International Journal of
programming and iterative machine learning tasks. Computer Science 1.2 2006: 111-117.
Future work should focus on including more [15] Iris dataset is available at:
contemporary open-source data mining tools in the https://archive.ics.uci.edu/ml/datasets/iris
research, such as R, KNIME and Orange, as well as (accessed: 10.06.2017.)
closer examinating differences in their functionality.

6. REFERENCES
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P.
(1996) "From data mining to knowledge
discovery in databases" AI Mag 1996, 17:37–
54.
[2] Chen X, Ye Y, Williams G, Xu X. (2007) "A
survey of open source data mining systems",
Lecture Notes in Computer Science 2007,
4819:3–14.
[3] Almeida, P., & Bernardino, J. (2016) "A survey
on open source Data Mining Tools for SMEs",
In New Advances in Information Systems and
Technologies (pp. 253-262). Springer, Cham.
[4] Wang J, Hu X, Hollister K, Zhu D. (2008) "A
comparison and scenario analysis of leading
data mining software". Int J Knowl Manage
2008, 4:17–34.
[5] Goebel, M., Gruenwald, L. (1999) "A survey of
data mining and knowledge discovery software
tools“, In: SIGKDD Explorations. Volume 1.,
ACM SIGKDD (1999) 20–33
[6] Mikut, R., & Reischl, M. (2011) "Data mining
tools", Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, 1(5), 431-
443.
[7] Hall, Mark, et al. (2009) "The WEKA data
mining software: an update." ACM SIGKDD
explorations newsletter 11.1 (2009): 10-18.
[8] Barga, Roger, Valentine Fontama, and Wee
Hyong To (2015) "Introducing Microsoft Azure
Machine Learning." Predictive Analytics with
Microsoft Azure Machine Learning. Apress,
2015. 21-43.
[9] Jovic A., Brkic K., Bogunovic N. (2014) "An
overview of free software tools for general data

IS'17

View publication stats

You might also like