Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2015 IEEE International Congress on Big Data

Big Data Open Source Platforms

Pedro Daniel Coimbra de Almeida Jorge Bernardino


Department of Computer Engineering and Systems Polytechnic of Coimbra
ISEC – Coimbra Institute of Engineering ISEC-CISUC
Coimbra, Portugal Coimbra, Portugal
a21180299@isec.pt jorge@isec.pt

Abstract— In a global market the capacity to mine and ending process, where some steps can require repetition
analyze user data is one way for companies to be as close in until results are satisfactory, that goes from the point
time and accuracy to the needs of their users. Big Data where data is generated until its analysis. To add
Platforms are one solution for companies to solve the complexity to the question we have that each developed
necessary challenges to accomplish these capacities. platform uses different approaches to tackle both
Unfortunately the number of challenges that need to be conceptual and technical challenges associated with each
addressed, allied with the high number of different solutions phase of the process. And we shall not forget that because
proposed, has led to the creation of a high number of
different platforms making it hard to name one definitive
this field of research is constantly evolving, solutions
and adequate platform for companies. In this paper we deployed today may not be useful in a near future. What
compare six of the most important Big Data Open Source is considered Big Data today may not be appropriately
Platforms to help companies or organizations choose the Big tomorrow. Companies wanting to optimize their
most adequate one to their needs. We analyze the following resources cannot afford to waste time exploring all the
open source platforms - Apache Mahout, MOA, R Project, available platforms and need help and education in
Vowpal Wabbit, PEGASUS and GraphLab CreateTM. finding the most suitable solution for their specific set of
needs. For most it is not even a matter of wanting or not
Keywords- Big Data, Open Source, Data Mining, Data
wanting but a matter of absolute impossibility due to lack
Analysis
of resources and know-how. It has to be understood that is
impossible to name one definitive platform that can be
I. INTRODUCTION named as the best one and most suited for the needs of
Big Data Platforms are the combination of hardware everyone. One size fits all is not possible in this area. A
infrastructures and software tools developed to acquire, great deal of research has already been done on the field
store and analyze data in a timely manner. They allow of Big Data and Big Data Platforms [2, 3, 4, 5]. This
individuals and enterprises to extract insight and value research can be essentially done in two ways. The first
from data generated both inside and outside their business one is to study the needs of one organization and create a
or area of interest. In business environments these set of tools that answer the needs of that organization, and
platforms help managers have a better overview over their that one only. This approach requires a large amount of
business and lead to potential enhanced productivity, investment and time that only big enterprises are able to
greater innovation and stronger position in the market. afford. The second way works by studying the existing
The potential value and knowledge inside data is huge platforms and choosing the one that is closer in
and of interest to all enterprises. Unfortunately very few characteristics and capabilities to the requirements of each
managers are educated to acknowledge such importance enterprise. This approach is preferred for Small and
and even less enterprises have the necessary budgets and Medium Enterprises because it requires no further
resources to tap into that source of power. In 2014 the investment or development. Proprietary solutions can
OECD (Organization for Economic Co-operation and require a certain amount of investment but nowadays
Development) registered that 95% of enterprises are there are a lot of open source solutions that are not only
Small and Medium Enterprises that do not have resources free but more flexible to be adapted. In this paper we
to spend studying the potential of data [1]. These study and compare Big Data Open Source Platforms
enterprises need to be educated about this potential and technical characteristics but also and most importantly
this has to be done by showing practical solutions that their capabilities for insertion in the segment of Small and
allow them to see results and value in the moment. To Medium Enterprises. We compare this capability through
complicate this problem the offer of systems that work analysis of parameters such as the size of hardware
with Big Data is too vast. This happens because the infrastructure necessary for deployment, ease of use of the
process of working with Big Data or data in general is not interfaces and availability of programmers for the
a straightforward one step thing. It is a complex never language required to manage and use the platform. To

978-1-4673-7278-7/15 $31.00 © 2015 IEEE 268


DOI 10.1109/BigDataCongress.2015.45
keep the focus of this work on platforms that mine and
analyze data we left out of the study the distributed
computing platforms over which Big Data Platforms are
built, such as Hadoop [6], Spark [7] and Storm [8].
The remainder of this paper is structured as follows.
Section 2 explains the Big Data paradigm/model. Section
3 overviews the features and requirements involved into
Big Data Open Source Platforms. Section 4 describes the
six platforms. Section 5 compares the tools and
establishes connections between their features and the
necessities of companies. Finally section 6 presents
concluding remarks and future work. Figure 1: The 6V of Big Data
II. THE BIG DATA MODEL EXPLAINED different phases that data goes through from its generation
There is not one final definition about what exactly is Big to its eventual death. Data Generation is the first step,
Data. In a very simple way Big Data can be defined just as followed by Data Acquisition, Data Storage and finally
a volume of data that is bigger than the traditional volumes Data Analysis steps. For any individual or group user
[9]. Still this definition is more relative than actually wanting to put a Big Data Platform to work it is important
concrete while being very reductive at the same time. It to note that before the platform itself is chosen and
does not work because the volumes of data are getting deployed it is advisable to study and implement a Big Data
bigger day by day, making this definition constantly strategy that helps managers understand exactly what they
outdated. This is why newer definitions consider Big Data want from the eventual platform they are going to use [13].
as the combination of two things – large volumes of data, A Big Data strategy consists mainly of three different
no matter what real size they are as long as they are big areas:
enough that they can’t be managed with traditional • Big Data Basics that usually represent the
architectures and the combination of techniques and acknowledgment of the diverse types of data
technologies used to extract value from such information available such as social data, preprocessed data or
[10]. Volume and value are only two of the dimensions unstructured data.
that need to be considered when treating Big Data. The • Big Data Assessment, an area that evaluates the
most recent definitions have evolved from the 3V’s several data aspects such as the source, the
concept that considered only Volume, Velocity and potential uses, volumes, estimated future growth
Variety [11] to a more complex 6V’s concept that adds and privacy regulations.
Value, Variability and Veracity to the list of characteristics • Big Data Strategy in itself that studies the
inherent to the model as can be seen in Figure 1: impacts of Big Data in the organization,
• Volume represents a real quantity of generated opportunities to be taken and business cases
data that can be measured. where Big Data can be of use. It also studies
• Velocity is a combination of both the speed at economic impacts such as the potential return of
which data is generated and the speed at investment.
which it is processed. Only after that strategy is documented and thoroughly
• Variety consists of the several forms but also analyzed are organizations and managers able to effective
dimensions in which data can be represented chose what functionalities they will have to look for in a
– the structured/unstructured division is the Big Data Platform.
most common definition. III. OPEN SOURCE BIG DATA PLATFORMS OVERVIEW
• Value is the knowledge possible of being
extracted from data analysis and varies Any platform capable of supporting the large kind of
depending on the needs of who is analyzing it. datasets that are not manageable by traditional database
• Variability is the inconsistency that can be tools can be considered a Big Data Platform [14].
presented in data because of the high number
of distributed autonomous sources of data that
have no centralized control whatsoever.
• Veracity is what evaluates the quality of the
data as in a metric as to how much that data
can be trusted to provide acceptable
information for the goals of the analyzers.
These characteristics have to be ever present in the process
of developing technologies and platforms whether they act
upon one or more steps of the Big Data chain of value [12] Figure 2: Big Data chain of value
as seen in Figure 2. This chain of value is what divides the

269
A generic goal of such platforms can be formulated as to allow distributed programming, provide high throughput
grant the abilities to integrate privately acquired and of data in between the entire infrastructure and provide
publicly available Big Data with data generated within an support for both structured and unstructured types of data.
enterprise and to analyze the combined set for value Hadoop [6] and its Hadoop Distributed File System
extraction [15]. In further detail Big Data Platforms (HDFS) are currently the most used solutions for this
should also possess a group of additional features: phase of data treatment. But as the paradigm is shifting
• Being comprehensive and ready for enterprise from batch processing to real time stream processing,
use. Hadoop is being abandoned in favor of more suitable
• Being easily scalable and extensible. platforms such as Spark, Storm or S4 [18]. Data Analysis
• Provide capabilities for update of data with low just like data integration, demands the use of distributed
latency. environments to realize its tasks of deep analytics and
• Be robust and capable of supporting fault- statistics on a broad variety of data types. Platforms must
tolerance. comply with facts such as data being stored in different
• Being corporative or open source depending on systems and scaling up in terms of volume. They need to
the interest of the development and investment be faster at delivering answers and reacting to changes in
teams. data behavior. In some scenarios where data moving from
• Require as little maintenance as possible. one place to the other raises privacy concerns but also
But Big Data Platforms aren’t just features, they are also prohibitive costs for organizations to support, data mining
technologies. It doesn’t have to be made of the newest and analysis tasks performed by Big Data Platforms are
and most powerful technology. Sometimes the only thing encouraged to be carried out in multiple locations with
it takes is a new configuration or approach to an old one. intermediate results being then sent to a central location
It all depends on the needs of the organization or person that groups them back for another process of final
wanting to use the platform. Unfortunately this solution is analysis. Still this brings even more complexity especially
not widely used because modifying decades of old for development at algorithm level because intermediate
systems for the needs of the present come at a very high results are not as precise as raw data, since noise and
cost that very few are able to afford. This is why new arbitrariness can be introduced for privacy maintenance,
platforms are arising to answer new requirements. Big and reading and integrating them can cause both losses of
Data Platforms requirements can be divided into three important data but also incorrect interpretation [19].
main phases – Data Acquisition, Data Organization and IV. OPEN SOURCE BIG DATA PLATFORMS
Data Analysis as seen in Figure 3. In the Data Acquisition
phase systems are being called to provide lower latency Based on the work of [20] we choose the following Big
on data capture, shorter execution on data queries, Data Open Source Platforms: Apache Mahout, MOA, R
capacity for distributed environments and support for Project, Vowpal Wabbit, PEGASUS and GraphLab
most varied and flexible types of data structures. NoSQL CreateTM. In this section we describe these tools in some
databases are currently the solution that is closer to such detail.
specifications because they prioritize data capture over A. Apache Mahout
data categorization thus eliminating the overhead caused Apache Mahout [21] is an open source project aiming to
by the existence of a data scheme [16, 17]. When it comes
build a comprehensive library of machine learning and
to Data Organization or data integration the preferred
data mining algorithms. The main feature of Mahout is
solution of organizing all the data at one single location is that all the algorithms in its library are highly scalable and
not possible anymore in the Big Data era. Moving around able to perform well on both standalone machines and
significant amounts of data while keeping its integrity is distributed environments. It runs on top of the Hadoop
mandatory for big companies. In this area the main environment and make use of such technologies as HDFS
requirements for the systems are the capacity for scaling
and MapReduce. Unlike other platforms that fail to do so,
both vertical and horizontally, Mahout provides elevated strength and stability in the
management of large data sets. Because most machine
learning algorithms are iterative and require multiple
loading steps from disk which causes a lot of processing
overhead, the Mahout developers are being encouraged to
abandon Hadoop in favor of Spark because Spark not
only maintains the distribution of the file system but also
provides more modern and powerful uses of parallel-
processing systems [22].
Mahout currently has implementations and support for
Figure 3: The three phases of work of Big Data Platforms most of the machine learning tasks of supervised and

270
unsupervised nature such as Recommendation Mining,
Clustering and Classification. Recommendation Mining is
the algorithm that mines and analyzes user behavior and
builds recommendations on similar items the user might
like. Clustering is characterized as the set of algorithms
aimed at analyzing text documents and group them into
topic related groups of texts. Classification comprises the
set of algorithms that use previously obtained information
from already categorized documents to assign new ones
into the most suitable category. Other tasks available are
Collaborative Filtering, Dimension Reduction, Topic
Models and Frequent Pattern Mining [21].
It is important to refer that Mahout is only a library. It
does not have its own server or user interface thus
requiring the use of an external programming IDE with
support for Java to be used for development and testing. It
has gained a lot of popularity among Data Mining
developers because of the freedom of implementation. It Figure 4: MOA Graphical User Interface (source:
still has one critical issue as it is constantly under www.moa.cms.waikato.ac.nz/getting-started/)
development and not very well documented. collections able to perform such tasks as Classification,
It is licensed under the Apache License 2.0. It needs Java Regression, Clustering, Outlier Detection, Recommender
1.6 or greater and Maven 3.0 or greater to be installed on Systems, Frequent Pattern Mining and Change Detection.
the machine(s) that compose the system. The system MOA is a framework that works mostly on algorithms that
requirements are dependent on the particular algorithm to work on single node machines and studies the capabilities
be run. The latest stable version is 0.9 [21]. of algorithms to scale up rather than scale out. It has three
simulated environments each defined accordingly to
B. Massive Online Analysis (MOA) different memory requirements – Sensor Network,
Massive Online Analysis [23] is an open source software Handheld Computer and Server – that use 100Kb, 32Mb,
oriented toward mining of data streams that present and 400Mb of memory respectively. Custom environments
conceptual drift. It allows both building and can be created with their own specifications although the
experimentation on machine learning and data mining 400Mb imposed by the Server environment has proven to
algorithms to provide fast answers to the evolution of the be enough for most of the algorithms run on MOA.
nature of the data streams. It is a very user-friendly It is freely licensed under the GNU GPL license and is
platform – having its own Graphical User Interface (GUI) written in Java. The latest stable version is 14.11 [23].
as seen in Figure 4 – command line and also making use of C. R Project
the Java API, which makes it suitable for users with
distinct levels of experience. One of most recognized R Project [26] is the combination of a programming
abilities is its modularity. Besides the core program one language and an environment for statistical and graphics
user can choose to expand only to modules of interest thus computing. It has been designed with influence taken
saving time and effort in exploring features not relevant to from the programming languages S and Scheme but
his/her work. unlike these two it is completely open-source [27].
The main goal of MOA is to provide a framework for So far R Project provides a variety of graphical
benchmarking of existing machine learning algorithms that techniques and statistical ones that include linear and non-
operate on real-time big data streams. linear modelling, classic statistic tests, time-series
This allows the community to early delete less efficient analysis or more traditional features like classification and
algorithms while placing more investment in the clustering. It intends to be a fully planned ahead system
development of newer solutions. Unlike the WEKA [24] built with coherency rather than a basic suite where
platform from where MOA has derived it does not work different tools are simply added for extension of
with batch-processing. To perform such tasks it provides a functionalities [28].
set of essential tools such as real and synthetic examples of Among its strengths are the high extensibility through the
data streams for testing, a library of existing algorithms use of packages for addition of new models, easy
and measures of comparison between them. It does not
integration with code made in C, C++ or Fortran for the
only allow working with content given in the platform but
execution of intensive tasks and the ability to produce
it provides framework for the user to insert new types of
streams, algorithms and methods. It also permits the robust well designed statistic plots for publication.
storage of previously run benchmark results thus providing Because R is in itself a programming language it allows
the creation of scenarios to be used against the newer users who feel comfortable with coding to add new
algorithms [25]. The current version of MOA provides functionalities to the suite.

271
It is an integrated suite of software that allows performing
a full circle of data treatment including manipulation,
calculation and finally display. Besides including
effective methods for data storing and handling it includes
a group of operators to provide calculation with arrays
and matrices. Other features include an integrated
collection of tools for intermediate data analysis and
graphical facilities to facilitate visualization of such
analysis. The R programming language also provides a
simple way to develop functions with inclusion of
conditionals, loops, input and output facilities that as we
mentioned before can be integrated with functions written
in other languages that perform better at computationally
intensive tasks.
R Project has so much popularity among the community
working with statistical data analysis that it has led to the
creation of several tools to make it more user-friendly and Figure 5: Interface of R Studio (source:
appealing. One of them is RStudio [29]. RStudio is an www.rprogramming.net/download-and-install-rstudio/)
IDE specifically aimed at working with R and developed process not only faster but more accurate. The
by the company of the same name. An example of the combination of the aforementioned two features allows
RStudio interface is shown in Figure 5. Wabbit to make effective learning from any size of
It is licensed under the terms of the GNU General Public information made available to the algorithm no matter
License and can be run in a variety of UNIX distributions how small or big it is. The implementation of a reduction
but also Windows and MacOS. The latest stable version is stack within its core also allows it to provide a solution
3.2 [26]. for large scale advanced problems.. Like we mentioned
before, Wabbit runs mainly as a library or a standalone
D. Vowpal Wabbit daemon service but it is fully ready to be deployed in
Vowpal Wabbit [30] is a project sponsored by Microsoft cloud environments.
Research with the goal of developing a single machine Wabbit is licensed under BSD and the latest stable
learning algorithm that is inherently fast, able of being run version is 7.10 [30].
in both standalone machines and in parallel processing
environments and capable of handling datasets in the
scale of terabytes. E. PEGASUS
The creators and developers of Wabbit have decided from PEGASUS [31] is an open source platform developed by
the very beginning focus on building a single strong the data mining group at the Carnegie Mellon University
multi-purpose algorithm rather than a library with many and designed specifically for data mining in graph
algorithms and also to start at a high level of development structures of sizes ranging from a few gigabytes to
rather than wasting investment in making bottom up petabytes (see logo in Figure 6). Because datasets of such
development from slower and less efficient algorithms. size are no longer able of being processed in single node
The development of the algorithm is encouraged to take machines, PEGASUS works with resort to parallel
advantage of four main features that when used in programming by being implemented on top of Hadoop.
combination can achieve better results – input formats of It is a more specific data mining platform than others
data, speed of learning, scalability of the data sets because it works with data that comes in form or graphs
analyzed and feature pairing. Currents efforts are being and networks with billions of nodes and connections. This
put together not just to improve the algorithm in itself but is a considerable leap against previous systems that can
also to improve some more basic features such as the only work in the size of the millions. Graph structures are
speed of the input/output operations. suffering a rise in number and importance coming from
Wabbit presents a group of features that are so far unique such various fields as mobile networks, social networks or
to its system. It is optimized by default to support online medical fields such as protein regulation [32].
learning rather than batch learning through a series of PEGASUS unifies a vast number of graph mining tasks
modifications made to the stochastic gradient descent such as the computing of the graph diameter, computing
methods that allow a more robust analysis on data sets of the radius of each node or finding connections between
considerable size [30]. It presents Feature Hashing, a the graph nodes by making a generalization of matrix-
method that reduces the necessary pre-processing of data vector multiplication called GIM-V [32].
making the analysis

272
to perform basic tasks such as data sort, group or dice in a
fast way on datasets with terabytes of size. It also
provides intuitive visualization of the generated tables and
graphs through the auxiliary tool GraphLab CanvasTM
whose example of interface can be seen in Figure 7. For
data intelligence there are tools to build recommenders
with Python code, make data analysis in images through
deep learning techniques that take advantage of powerful
Figure 6: Project PEGASUS logo (source: computing GPU capabilities. There are also tools for
www.cs.cmu.edu/~pegasus/) analysis of unstructured text, graph analysis and
GIM-V is carefully implemented and optimized with supervised learning for outcome prediction. For data
built-in graph mining operations such as Page Rank, deployment it provides tools to deploy easily coded
Random Walk with Restart and diameter estimation [32]. services for prediction of various types. Written in C++
It provides linear scaling as to the number of edges to with resource to a Python interface, it runs on all the main
analyze making it suitable to work on any number of operative systems such as MacOSX, Windows Virtual
machines available. Machines and UNIX distributions such as Ubuntu, Debian
The system is not fully developed yet and current efforts or Rhel. It is licensed under the Apache License 2.0 and
are being put in place to expand the library to newer data the latest stable version is 1.3 [33].
mining and machine learning algorithms but also to
provide more efficient methods for graph indexing. V. OPEN SOURCE BIG DATA PLATFORMS
Written in Java it runs on any operative system able to run COMPARISON
Hadoop with preference for UNIX machines. It is quite To perform our comparison on the previously detailed
heavy in terms of software requirements as it calls for not platforms we chose six parameters that we consider
only Hadoop, but also Apache Ant, Java, Python and relevant for business managers to analyze when choosing
Gnuplot. It is licensed under the Apache License version an open source Big Data Platform. These parameters are a
2.0 and the latest stable version is 2.0 [32]. mix of technical characteristics but also other defining
characteristics to follow when finding out if a platform is
F. GraphLab CreateTM
or not suitable for Small and Medium Enterprise
GraphLab CreateTM [33] formerly known only has environments. We start with the programming paradigm -
GraphLab before Dato Incorporation acquired it. It is a this is directly related to the size of the company and size
platform for development of machine learning of the computer infrastructure where the system may
applications working at various scales of dataset sizes. It eventually be installed. Same applies for another set of
aims at providing means for applications to have all the characteristics we are analyzing - the supported sizes of
necessary iterative steps to make predictions based in data datasets. Next we analyze the required programming
mining results. languages because we consider important for a manager
The main goal is to design and implement machine to know what kind of programmers s/he will need to run
learning algorithms that are efficient, accurate and able to the systems. To this end we also provide comparison on
keep data consistency while taking the most advantage the user interface each platform provides as we consider
from parallel processing [34]. Development focuses on the level of experience of the programmers and ease of
approaching existing common patterns in machine use and adoption of the platform can be related to the kind
learning and putting efforts on those, while ignoring more of less or more user friendly interfaces available. We also
exotic approaches that provide less utility. Some of the compare the types of data supported as a means to
already developed algorithms are belief propagation, categorize platforms as more generic or more specific.
Gibbs sampling or Co-EM. All of these algorithms have Last but not the least we compare the variety of
been optimized from their previous versions for parallel algorithms available as a way to say which platforms are
processing. The platform works mainly with three areas more complete than the others. In Table 1 we show the
of data processing – Data Engineering, Data Intelligence detailed listing of these characteristics for each platform.
and Data Deployment. One of its strengths is the ease of Analyzing all the characteristics present in Table 1 we can
use for both beginner and experts in the area of data draw the following conclusions – both MOA and R
science. Its main components are scalable data structures, Project are the best tools for Serial Computing running in
machine learning modules, methods for data visualization a single node machine. The others are best for companies
and capacity to easily integrate with many data sources of with larger infrastructures that that allow parallel
different types. In the field of data engineering it provides computing.
an easy way to run the ETL (Extract, Transform and Most of the platforms, except for R Project, make use of
Load) process on data as a means to clean data and save common languages that have no lack of programmers
time during the analysis process. For this it provides tools available.

273
where Wabbit is the weakest because as we already
explained it only works with a single algorithm that can
be powerful under specific conditions but useless for most
cases. Mahout, PEGASUS and GraphLab are the ones
who support bigger data sets while MOA is the weaker
one on this field supporting only datasets with a size of a
few megabytes.
In summary, MOA is definitely the best platform for
small companies with insignificant computer
infrastructures and that work with lesser amounts of data.
Figure 7: GraphLab CanvasTm Interface (source: For larger companies with bigger infrastructures that
www.blog.dato.com/) require parallel computing it is hard to decide between
MOA, R Project and GraphLab can be considered the Mahout and GraphLab. While the first has the advantage
most user friendly as they provide Graphical User of being broader in terms of data types supported, the
Interfaces where the remaining ones do not, which makes second provides more ease of use due to the existence of a
working on them considerably harder and with a bigger GUI that helps track the work being done.
learning curve, at least for less experienced users.
VI. CONCLUSIONS AND FUTURE WORK
PEGASUS and GraphLab are graph oriented where others
platforms are made to work with all generic types of With the rise of Big Data and with more individuals and
datasets. Working with graphs only is not a limitation in organizations gaining awareness of the potential and
itself as this type of data structure is very common these opportunities it brings, the number of Big Data Platforms
days. It is up to the companies to make this evaluation of both open source and proprietary nature is increasing
accordingly to their own needs. Lastly, when comparing exponentially over the next years. Implementing a
available algorithms it is complicated to make clear platform that solves all the questions inherent to Big Data
conclusions about the platforms. Mahout and R Project is too heavy, expensive if not impossible at all. So the
are the ones with more participation from the community path being more commonly followed is for each
and therefore the ones with more diverse algorithms organization to invest in developing its own platform that
provided. GraphLab, PEGASUS and MOA also have a is close to its specific needs, inwards or by calling to the
significant number of algorithms ready to be used. Here is community. This creates a high number of available
Table 1. Open Source Big Data Platforms Comparison

Mahout MOA Wabbit R Project PEGASUS GraphLab

Programming Parallel Serial Parallel Serial Parallel Parallel


Paradigm Computing Computing Computing Computing Computing Computing

Programming R, S, C, C++,
Java Java C, C++ Java C++, Python
Language(s) Fortran

GUI, N/A (but GUI through


User Interface N/A command N/A RStudio GUI N/A GraphLab
line, Java API exists) CanvasTM

Data Types All All All All Graphs Graphs

Belief
Recommendation
Classification Page Rank, propagation,
Mining,
Available Regression, Own Single Random Walk Gibbs
Classification, Undefined
Algorithms Clustering Algorithm With Restart sampling or
Clustering and
and others and others Co-EM and
others
others

Scale of
Few Up to Up to Up to
Supported Up to Petabytes Up to Petabytes
Megabytes Terabytes Gigabytes Petabytes
Datasets

274
solutions which not only bring difficulty the choice for [13] M. Huddar, M. Ramannavar “A Survey on Big Data
people wanting to select a platform for their enterprise but Analytical Tools” International Journal of Latest Trends in
Engineering and Technology (IJLTET), 2013, pp 85-91.
also create a high level of redundancy in solutions
[14] R. Gupta, S. Gupta, A. Singhal “Big Data : Overview”
presented. Educating not only business managers but also International Journal of Computer Trends and Technology
people in general to the potential of Big Data is crucial to (IJCTT), volume 9 number 5, March 2014, pp. 266-268,
guarantee the future of research and platform deployment. ISSN: 2231-2803.
In this paper we study and analyze six Big Data Open [15] J. Dijcks “Big Data for the Enterprise” An Oracle White
Paper, September 2014.
Source Platforms and conclude that the best platform for
[16] V. Abramova, J. Bernardino. “NoSQL databases:
small companies with trivial computer infrastructures is MongoDB vs Cassandra”. Sixth International C*
MOA, while Mahout and GraphLab are the best for Conference on Computer Science & Software Engineering
companies with larger computer infrastructures that C3S2E – 2013, pp. 14-22.
require the use of parallel computing. Mahout has more [17] J. Lourenço, V. Abramova, M. Vieira, B. Cabral, J.
types of data supported but GraphLab is more user Bernardino. Nosql Databases: A Software Engineering
Perspective. WorldCIST'15 - 3rd World Conference on
friendly because it has a GUI and Mahout does not. As Information Systems and Technologies.
future work we pretend to analyze these six tools in real [18] Apache S4, http://incubator.apache.org/s4/.
environments and explore other available Big Data tools. [19] X. Wu, X. Zhu, G. Wu, W. Ding “Data Mining with Big Data”
IEEE Transactions on Knowledge & Data Engineering, vol
ACKNOWLEDGMENTS 26, January 2014, pp 97-107,
This work was partially financed by iCIS – Intelligent [20] A. Bifet “Mining Big Data in Real Time” Informatica, vol
Computing in the Internet Services (CENTRO-07- ST24 – 37, 2013, pp 15-20.
FEDER – 002003), Portugal. [21] Apache Mahout, http://mahout.apache.org.
This work was also made possible with the help of [22] A. Fernández et al. “Big Data with Cloud Computing: an
ISEC-Polytechnic of Coimbra, which provided some of insight on the computing environment, MapReduce, and
programming frameworks” WIREs Data Mining Knowl
the facilities and infrastructures. Discov 2014, 4:380–409. doi: 10.1002/widm.1134.
[23] MOA, http://moa.cs.waikato.ac.nz.
REFERENCES
[24] WEKA, http://www.cs.waikato.ac.nz/ml/weka/.
[1] OECD (2014), "Small and medium-sized enterprises", in [25] A. Bifet et al. “MOA: Massive Online Analysis, a
OECD, OECD Factbook 2014: Economic, Environmental Framework for Stream Classification and Clustering.”
and Social Statistics, OECD Publishing, Paris, doi: Journal of Machine Learning Research (JMLR) Workshop
10.1787/factbook-2014-en. and Conference Proceedings.
[2] V. Mayer-Schnberger, K. Cukier “Big Data: A Revolution Volume 11: Workshop on Applications of Pattern Analysis
That Will Transform How We Live, Work and Think”, (2010).
John Murray Publishers, United Kingdom 2013, [26] R, http://www.r-project.org.
ISBN:1848547927 9781848547926.
[27] K. Hornik “The R FAQ”, http://cran.r-
[3] O’Reilly Media “Big Data Now: 2012 Edition”, O’Reilly project.org/doc/FAQ/R-FAQ, 2015.
Media Inc, October 2012, ASIN: B0097E4EBQ.
[28] F. Morandat, B. Hill, L. Osvald, J. Vitek “Evaluating the
[4] Melyssa Barata, Jorge Bernardino, Pedro Furtado, “YCSB design of the R language: objects and functions for data
and TPC-H: Big Data and Decision Support Benchmarks”. analysis”, ECOOP'12 Proceedings of the 26th European
BigData Congress 2014, pp. 800-801. conference on Object-Oriented Programming, pp 104-131,
[5] Melyssa Barata, Jorge Bernardino, Pedro Furtado, “Survey doi:10.1007/978-3-642-31057-7_6.
on Big Data and Decision Support Benchmarks”. Database [29] RStudio, http://www.rstudio.com/products/rstudio.
and Expert Systems Applications - 25th International
Conference, DEXA 2014, pp.174-182. [30] Vowpal Wabbit,
https://github.com/JohnLangford/vowpal_wabbit/wiki.
[6] Hadoop, https://hadoop.apache.org/.
[31] PEGASUS, http://www.cs.cmu.edu/~pegasus.
[7] Spark, https://spark.apache.org/.
[32] U. Kang, C. Tsourakakis, C. Faloutsos “PEGASUS: A
[8] Storm, https://storm.apache.org/. Peta-Scale Graph Mining System - Implementation and
[9] C. Snijders, U. Matzat, U. Reips “”Big Data”: Big Gaps of Observations” ICDM '09. Ninth IEEE International
Knowledge in the Field of Internet Science” International Conference on Data Mining, 2009, pp 229-238,
Journal of Internet Science, vol 7, 2012, pp. 1-5. doi:10.1109/ICDM.2009.14.
[10] I. Hashem et al, “The rise of “big data” on cloud [33] GraphLab CreateTM, https://dato.com/products/create.
computing: Review and open research issues” Information [34] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J.
Systems, vol 47, January 2015, pp. 98-115, Hellerstein “GraphLab: A New Framework For Parallel
doi:10.1016/j.is.2014.07.006. Machine Learning” Proceedings of the Twenty-Sixth
[11] D. Laney “3D Data Management: Controlling Data Conference on Uncertainty in Artificial Intelligence, 2014,
Volume, Velocity and Variety” Gartner, File 494, February arXiv:1408.2041.
2012.
[12] H. Hu, Y. Wen, T. Chua, X. Li “Toward Scalable Systems
for Big Data Analytics: A Technology Tutorial” IEEE
Access, volume 2, June 2014, pp. 652-687,
doi:10.1109/ACCESS.2014.2332453.

275

You might also like