Professional Documents
Culture Documents
Distributed Parallel Architecture For "Big Data"
Distributed Parallel Architecture For "Big Data"
Distributed Parallel Architecture For "Big Data"
2/2012
This paper is an extension to the "Distributed Parallel Architecture for Storing and Pro-
cessing Large Datasets" paper presented at the WSEAS SEPADS12 conference in Cam-
bridge. In its original version the paper went over the benefits of using a distributed parallel
architecture to store and process large datasets. This paper analyzes the problem of storing,
processing and retrieving meaningful insight from petabytes of data. It provides a survey on
current distributed and parallel data processing technologies and, based on them, will pro-
pose an architecture that can be used to solve the analyzed problem. In this version there is
more emphasis put on distributed files systems and the ETL processes involved in a distribut-
ed environment.
Keywords: Large Dataset, Distributed, Parallel, Storage, Cluster, Cloud, MapReduce,
Hadoop
1 Introduction
A look back to the immediate history
shows that the data storing capacity is con-
years, figure 2. As a fact, in 1981 you must
use 200 Seagate units, each having a five
megabytes capacity and costing 1700$, to
tinuously increasing, while the cost per GB store one gigabyte of data.
stored is decreasing, figure 1. The first hard
disk came from IBM in 1956, it was called
The IBM 350 Disk Storage and it had a ca-
pacity of 5 MB, [10-11]. In 1980, IBM 3380
breaks the gigabyte-capacity limit providing
storage for 2.52 GB. After 27 years, Hitachi
GST that acquired IBM drive division in
2003, deliver the first terabyte hard drive. Af-
ter only two years, in 2009, Western Digital
launches industrys first two terabyte hard Fig. 2. Average $ cost per GB (based on
drive. In 2011, Seagate introduces the [13])
worlds first 4TB hard drive, [11].
The reduce costs for data storage has repre-
sented the main premise of the current data
age, in which it is possible to record and
store almost everything, from business to
personal data. As an example, in the field of
digital photos, it is estimated that around the
world has been taken over 3.5 trillion photos
and in 2011 alone have been made around
360 billion new snapshots, [16]. Also, the
Fig. 1. Average HDD capacity (based on availability and the speed of Internet connec-
[13]) tions around the world have generated in-
creased data traffic that is generated by a
In terms of price [12], the cost per gigabyte wide range of mobile devices and desktop
decreased from an average of 300.000 $ to a computers. Only for mobile data, Cisco ex-
merely an average of 11 cents in the last 30 pects that mobile data traffic will grow from
Informatica Economic vol. 16, no. 2/2012 117
0.6 exabytes (EB) in 2011 to 6.3 EB in 2015, ment a cluster analysis model.
[8]. The expansion of data communications
has promoted different data services, social, 2 Processing and storing large datasets
economic or scientific, to central nodes for As professor Anand Rajaraman questioned,
storing and distributing large amounts of da- more data usually beats better algorithm
ta. For example, Facebook social network [15]. The question is used to highlight the ef-
hosts more than 140 billion photos, which is ficiency of a proposed algorithm for the Net-
more than double compared to 60 billion pic- flix Challenge, [17]. Despite the statement is
tures at the end of 2010. In terms of storage, still debatable, it brings up a true point. A
all these snapshots data take up more than 14 given data mining algorithm yields better re-
petabytes. In other fields, like research, the sults with more data and it can reach the
Large Hadron Collider particle accelerator same accuracy of results of a better or more
near Geneva, will produce about 15 petabytes complex algorithm. In the end, the objective
of data per year. The SETI project is record- of a data analysis and mining system is to
ing each month around 30 terabytes of data process more data with better algorithms,
which are processed by over 250.000 com- [15].
puters each day, [14]. The supercomputer of In many fields more data is important be-
the German Climate Computing Center cause it provides a more accurate description
(DKRZ) has a storage capacity of 60 of the analyzed phenomenon. With more da-
petabytes of climate data. In the financial ta, data mining algorithms are able to extract
sector, records of every day financial opera- a wider group of influence factors and more
tions generate huge amounts of data. Solely, subtle influences.
the New York Stock Exchange records about Today, large datasets means volumes of hun-
one terabyte of trade data per day, [15]. dreds of terabytes or petabytes and these are
Despite this spectacular evolution of storage real scenarios. The problem of storing these
capacities and of deposits size, the problem large datasets is generated by the impossibil-
that arises is to be able to process it. This is- ity to have a drive with that size and more
sue is generated by available computing important, by the large amount of time re-
power, algorithms complexity and access quired to access it.
speeds. This paper makes a survey of differ- Access speed of large data is affected by the
ent technologies used to manage and process disk speed performances [22], Table 1, inter-
large data volumes and proposes a distributed nal data transfer, external data transfer, cache
and parallel architecture used to acquire, memory, access time, rotational latency, that
store and process large datasets. The objec- generate delays and bottlenecks.
tive of the proposed architecture is to imple-
Despite the rapid evolution of drives capaci- 250 of them to store 1 PB of data. Once the
ty, described in figure 1, the large datasets of storage problem is solved another question
up to one petabytes can only be stored on arises, regarding how easy/hard is to read
multiple disks. Using 4 TB drives requires that data. Considering an optimal transfer
118 Informatica Economic vol. 16, no. 2/2012
(sub-)processes due to a collaborative de- termediate subsets associated with the same
velopment platform and provide central intermediate key and send them to the Re-
scheduling. duce function.
Portability: usually the ETL code can be The Reduce function, also accepts an inter-
developed one a specific target database mediate key and subsets. This function merg-
and ported later to other supported data- es together these subsets and key to form a
bases. possibly smaller set of values. Normally just
When developing the ETL process, there are zero or one output value is produced per Re-
two options: either takes advantage of an ex- duce function.
isting ETL tool, some key players, in this In [26] is highlight that many real world
domain, are: IBM DataStage, Ab Initio, tasks such used MapReduce model. This
Informatica or custom code it. Both ap- model is used for web search service, for
proaches have benefits and pitfalls that need sorting and processing the data, for data min-
to be carefully considered when selecting ing, for machine learning and for a big num-
what better fits the specific environment. ber of other systems.
The benefits of an in-house custom built ETL The entire framework manages how data is
process are: split among nodes and how intermediary
Flexibility; the custom ETL process can query results are aggregate.
be designed to solve requirements specif- A general MapReduce architecture can be il-
ic to the organization that some of the lustrated as Figure 4.
ETL tools may have limitation with;
Performance; the custom ETL process
can be finely tuned for better perfor-
mance;
Tool agnostic; the custom ETL process
can be built using skills available in-
house;
Cost efficient: custom ETL processes
usually use resources already available in
the organization, eliminating the addi-
tional costs with licensing an ETL tool
and training the internal resources on us-
ing that specific ETL tool.
structure; in fact the architecture can use query predicate push down via server
data that has an arbitrary format; side scan and get filters;
fault tolerance; optimizations for real time queries ;
the framework is available from high lev- a Thrift gateway and a REST-ful Web
el programming languages; one such so- service that supports XML, Protobuf, and
lution is the open-source Apache Hadoop binary data encoding options
project which is implemented in Java; extensible JRuby based (JIRB) shell;
the query language allows record-level support for exporting metrics via the
manipulation; Hadoop metrics subsystem to files or
projects as Pig and Hive [34] are provid- Ganglia; or via JMX.
ing a rich interface that allows program- HBase database stores data in labeled tables.
mers to do join datasets without repeating In this context [28] the table is designed to
simple MapReduce code fragments; have a sparse structure and data is stored in
Hadoop is a distributed computing platform, table rows, and each row has a unique key
which is an open source implementation of with arbitrary number of columns.
the MapReduce framework proposed by
Google [28]. It is based on Java and uses the 3.4 Parallel database systems
Hadoop Distributed File System (HDFS). A distributed database (DDB) is a collection
HDFS is the primary storage system used by of multiple, logically interconnected data-
Hadoop applications. It is uses to create mul- bases distributed over a computer network. A
tiple replicas of data blocks for reliability, distributed database management system,
distributing them around the clusters and distributed DBMS, is the software system
splitting the task into small blocks. The rela- that permits the management of the distribut-
tionship between Hadoop, HBase and HDFS ed database and makes the distribution trans-
can be illustrated as Figure 5. parent to the users. A parallel DBMS is a
HBase is a distributed database. HBase is an DBMS implemented on a multiprocessor
open source project for a database, distribut- computer. [21]. The parallel DBMS imple-
ed, versioned, column-oriented, modeled af- ments the concept of horizontal partitioning
ter Google Bigtable [25]. [24] by distributing parts of a large relational
table across multiple nodes to be processed in
HDFS parallel. This requires a partitioned execution
Output of the SQL operators. Some basic operations,
Output like a simple SELECT, can be executed in-
dependently on all the nodes. More complex
Input Input operations are executed through a multiple-
operator pipeline. Different multiprocessor
parallel system architectures [21], like share-
Input
memory, share-disks or share nothing, define
Hadoop HBase possible strategies to implement a parallel
Output DBMS, each with its own advantages and
drawbacks. The share-nothing approach dis-
Fig. 5. The relationship between Hadoop, tributes data across independent nodes and
HBase and HDFS (based on [28]) has been implemented by many commercial
systems as it provides extensibility and avail-
Some as the features of HBASE as listed at ability.
[25] are: Based on the above definitions, we can con-
convenient base classes for backing clude that parallel database systems improve
Hadoop MapReduce jobs with HBase ta- performance of data processing by paralleliz-
bles including cascading, hive and pig ing loading, indexing and querying data. In
source and sink modules; distributed database systems, data is stored in
122 Informatica Economic vol. 16, no. 2/2012
different DBMSs that can function inde- also reduces the administration effort as it
pendently. Because parallel database systems comes as a single vendor out of the box solu-
may distribute data to increase the architec- tion. The data warehouse appliances offer
ture performance, there is a fine line that sep- high availability through built-in fail-over
arates the two concepts in real implementa- capabilities using data redundancy for each
tions. disk.
Despite the differences between parallel and Ideally, each processing unit of the data
distributed DBMSs, most of their advantages warehouse appliance should process the same
are common to a simple DBMS, [20],[35]: amount of data at any given time. To achieve
stored data is conform to a well-dened that, the data should be distributed uniformly
schema; this validates the data and pro- across each processing unit. Data skew is a
vides data integrity; measure to evaluate how data is distributed
data is structured in a relational paradigm across each processing unit. A data skew of 0
of rows and columns; means that the same number of records is
SQL queries are fast; distributed on each processing unit. A data
the SQL query language is flexible, easy skew of 0 is ideal.
to learn and read and allows program- By having each processing unit do the same
mers to implement complex operations amount of work it ensures that all processing
with ease; units finish their task about the same time,
use hash or B-tree indexes to speed up minimizing any waiting times.
access to data; Another aspect that has an important impact
can efficiently process datasets up to two on the query performance is having the all
petabytes of data. the data that is related on the same pro-
Known commercial parallel databases as Te- cessing unit. This way the time required to
radata, Aster Data, Netezza [9], DATAllegro, transfer data between the processing units is
Vertica, Greenplum, IBM DB2 and Oracle eliminated. For example, if the user requires
Exadata, have been proven successful be- the sales by country report, having both the
cause: sales data for a customer and his geographic
allow linear scale-up, [21]; the system information on the same processing unit will
can maintain constant performance as the ensure that the processing unit has all the in-
database size is increasing by adding formation that it needs and each processing
more nodes to the parallel system; unit is able to perform its tasks independent-
allow linear speed-up, [21]; for a data- ly.
base with a constant size, the perfor- The way data is distributed across the parallel
mance can be increased by adding more database nodes influence the overall perfor-
components like processors, memory and mance. Though the power of the parallel
disks; DBMS is given by the number of nodes, this
implement inter-query, intra-query and can be also a drawback. For simple queries
intra-operation parallelism, [21]; the actual processing time can be much
reduced implementation effort; smaller to the time needed to launch the par-
reduced administration effort; allel operation. Also, nodes can become hot
high availability. spots or bottle necks as they delay the entire
In a massively parallel processing architec- system.
ture (MPP), adding more hardware allows for
more storage capacity and increases queries 4 Proposed architecture
speeds. MPP architecture, implemented as a The proposed architecture is used to process
data warehouse appliance, reduces the im- large financial datasets. The results of the da-
plementation effort as the hardware and ta mining analysis help economists to identi-
software are preinstalled and tested to work fy patterns in economic clusters that validate
on the appliance, prior to the acquisition. It existing economic models or help to define
Informatica Economic vol. 16, no. 2/2012 123
new ones. The bottom-up approach is more It is important that the ETL server also sup-
efficient, but more difficult to implement, ports parallel processing allowing it to trans-
because it can suggests, based on real eco- form large data sets in timely manner. ETL
nomic data, relations between economic fac- tools like Ab Initio, DataStage and
tors that are specific to the cluster model. The Informatica have this capability built in. If
difficulty comes from the large volume of the ETL server does not support parallel pro-
economic data that needs to be analyzed. cessing, then it should just define the trans-
The architecture, described in figure 6, has formations and push the processing to the
three layers: target parallel DBMS.
the input layer implements data acquisi-
tion processes; it gets data from different
sources, reports, data repositories and ar-
chives which are managed by govern-
mental and public structures, economic
agencies and institutions, NGO projects;
some global sources of statistical eco-
nomic data are Eurostat, International
Monetary Fund and World Bank; the
problem of these sources is that they use
independent data schemes and bringing
them to a common format it is an inten-
sive data processing stage taking into
consideration national data sources or
crawling the Web for free data, the task
becomes a very complex one;
the data layer stores and process large da-
tasets of economic and financial records;
this layer implements distributed, parallel
processing;
the user layer provides access to data and
manage requests for analysis and reports.
The ETL intermediary layer placed between
the first two main layers, collects data from
the data crawler and harvester component,
converts it in a new form and loads it in the
parallel DBMS data store. The ETL normal-
ize data, transforms it based on a predefined
structure and discards not needed or incon-
sistent information. Fig. 6. Proposed architecture
The ETL layer inserts data in the parallel dis-
tributed DBMS that implements the Hadoop The user will submit his inquiries through a
and MapReduce framework. The objective of front end application server, which will con-
the layer is to normalize data and bring it to a vert them into queries and submit them to the
common format, requested by the parallel parallel DBMS for processing.
DBMS. The front end application server will include
Using an ETL process, data collected by the an user friendly metadata layer, that will al-
data crawler & harvester gets consolidated, low the user to query the data ad-hoc, it will
transformed and loaded into the parallel also include canned reports and dashboards.
DBMS, using a data model optimized for da- The objective of the proposed architecture is
ta retrieval and analysis. to separate the layers that are data processing
124 Informatica Economic vol. 16, no. 2/2012
intensive and to link them by ETL services data schema in order to obtain a full de-
that will act as data buffers or cache zones. scription of the data; the same schema is
Also the ETL will support the transformation used to validate data;
effort. DBMSs use B-trees indexes to achieve
For the end user the architecture is complete- fast searching times; indexes can be de-
ly transparent. All he will experience is the fined on any attributes are managed by the
look and feel of the front end application. system; MR framework does not imple-
Based on extensive comparisons between ment this built-in facility and an imple-
MR and parallel DBMS, [6], [20], [24], we mentation of a similar functionality is
conclude that there is no all-scenarios good done by the programmers who control the
solution for large scale data analysis because: data fetching mechanism;
both solutions can be used for the same DBMSs provide high level querying lan-
processing task; you can implement any guages, like SQL, which are ease to read
parallel processing task based either on a and write; instead the MR use code frag-
combinations of queries or a set of MR ments, seen as algorithms, to process rec-
jobs; ords; projects as Pig and Hive [34] are
the two approaches are complementary; providing a rich interface based on high
each solution gives better performance level programming languages that allows
over the other one in particular data sce- programmers to reuse code fragments.
narios; you must decide with approach parallel DBMSs have been proved to be
saves time for the needed data processing more efficient in terms of speed but they
task; are more vulnerable to node failures.
the process of configuring and loading da- A decision which approach to take must be
ta into the parallel DBMS is more com- made taking into consideration:
plex and time consuming than the setting performance criteria;
up of the MR architecture; one reason is internal structure of processed data;
that the DBMS requires complex schemas available hardware infrastructure;
to describe data, whereas MR can process maintenance and software costs.
data in arbitrary format; A combined MR-parallel DBMS solution,
the performance of the parallel DBMS is [36] can be a possibility as it benefits from
given by system fine tune level; the sys- each approach advantages.
tem must be configured accordingly to the
tasks needed to complete and to the avail- 5 Conclusion
able resources; Processing large datasets obtained from mul-
common MR implementations take full tiple sources is a daunting task as it requires
advantage of the reduced complexity by tremendous storing and processing capaci-
processing data with simple structure, be- ties. Also, processing and analyzing large
cause the entire MR model is built on the volumes of data becomes non-feasible using
key-value pair format; a MR system can a traditional serial approach. Distributing the
be used to process more complex data, but data across multiple processing units and
the input data structure must be integrated parallel processing unit yields linear im-
in a custom parser in order to obtain ap- proved processing speeds.
propriate semantics; not relaying on a When distributing the data is critical that
common recognized data structure has an- each processing unit is allocated the same
other drawback as data is not validated by number of records and that all the related da-
default by the system; this can conduct to ta sets reside on the same processing unit.
situations in which modified data violates Using a multi-layer architecture to acquire,
integrity or other constraints; in contrast, transform, load and analyze the data, ensures
the SQL query language used by any par- that each layer can use the best of bread for
allel DBMS, takes full advantage of the its specific task. For the end user, the experi-
Informatica Economic vol. 16, no. 2/2012 125
Proceedings of the 2009 ACM SIGMOD sulting, Inc., June 11, 2007, Available
International Conference, 2009. online at:
[21] M. Tamer zsu, P. Valduriez, Distrib- http://www.scribd.com/doc/55883818/A
uted and Parallel Database System, -Generalized-Lesson-in-ETL-
ACM Computing Surveys, vol. 28, 1996, Architecture
pp. 125 128 [31] A. Albrecht, METL: Managing and Inte-
[22] Seagate, Performance Considerations, grating ETL Processes, VLDB 09, Au-
http://www.seagate.com/www/en- gust 2428, 2009, Lyon, France Copy-
us/support/before_you_buy/speed_consi right 2009 VLDB Endowment, ACM,
derations http://www.vldb.org/pvldb/2/vldb09-
[23] P. Vassiliadis, A Survey of Extract- 1051.pdf
Transform-Load Technology, Interna- [32] R. Davenport, ETL vs ELT, June 2008,
tional Journal of Data Warehousing & Insource IT Consultancy, Insource Data
Mining, Vol. 5, No. 3, pp. 1-27, 2009 Academy,
[24] M. Stonebreaker et al., MapReduce and http://www.dataacademy.com/files/ETL-
Parallel DBMSs: Friends or Foes, vs-ELT-White-Paper.pdf
Communications of the ACM 53(1):64-- [33] S. Ghemawat, H. Gobioff and Shun-Tak
71 2010. Leung (GOOGLE) - The Google File
[25] Apache HBASE, System,
http://hbase.apache.org/ http://www.cs.brown.edu/courses/cs295-
[26] H. Yang, A. Dasdan, R. Hsiao, D. Par- 11/2006/gfs.pdf
ker, Map-reduce-merge: simplified re- [34] C. Olston, B. Reed, U. Srivastava, R.
lational data processing on large clus- Kumar and A. Tomkins, Pig Latin: A
ters, Rain (2007), Publisher: ACM, Not-So-Foreign Language for Data Pro-
Pages: 1029-1040 ISBN: 781595936868 cessing, In SIGMOD 08, pp. 1099
, 1110, 2008.
http://www.mendeley.com/research/map [35] S. Pukdesree, V. Lacharoj and P.
reducemergesimplified-relational-data- Sirisang, Performance Evaluation of
processing-on-large-clusters/ Distributed Database on PC Cluster
[27] J. Dean, S. Ghemawat, MapReduce: Computers, WSEAS Transactions on
Simplified Data Processing on Large Computers, Issue 1, Vol. 10, January
Clusters, USENIX Association OSDI 2011, pp. 21 30, ISSN 1109-2750.
04: 6th Symposium on Operating Sys- [36] A. Abouzeid, K. Bajda-Pawlikowski,
tems Design and Implementation, D.J. Abadi, A. Silberschatz and A.
http://static.usenix.org/event/osdi04/tech Rasin, HadoopDB: An architectural hy-
/full_papers/dean/dean.pdf brid of MapReduce and DBMS technol-
[28] X. Yu, Estimating Language Models ogies for analytical workloads, In Pro-
Using Hadoop and Hbase, Master of ceedings of the Conference on Very
Science Artificial Intelligence, Universi- Large Databases, 2009.
ty of Edinburgh, 2008, [37] C. Boja, A. Pocovnicu, Distributed Par-
http://homepages.inf.ed.ac.uk/miles/msc- allel Architecture for Storing and Pro-
projects/yu.pdf cessing Large Datasets, 11th WSEAS
[29] ETL Architecture Guide Int. Conf. on SOFTWARE ENGINEER-
http://www.ipcdesigns.com/etl_metadata ING, PARALLEL and DISTRIBUTED
/ETL_Architecture_Guide.html SYSTEMS (SEPADS '12), Cambridge,
[30] W. Dumey, A Generalized Lesson in UK, February 22-24, 2012, ISBN: 978-
ETL Architecture Durable Impact Con- 1-61804-070-1, pp. 125-130.
Informatica Economic vol. 16, no. 2/2012 127