Implementing Algorithmic Skeletons Over Hadoop: Dimitrios Mouzopoulos

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

Implementing Algorithmic Skeletons over Hadoop

Dimitrios Mouzopoulos

U
E

NI VER

IT

TH

D I U N B

Master of Science Computer Science School of Informatics University of Edinburgh 2011

O F

Abstract
In the past few years, there has been a growing interest for storing and processing vast amounts of data that many times exceed the Petabyte limit. To that end, MapReduce, a computational paradigm that was introduced by Google in 2003, has become particularly popular. It provides a simple interface with two functions, map and reduce for developing and implementing scalable parallel applications. The goal of this project is to enhance Hadoop, the open source implementation of Googles MapReduce and accompanying distributed le system, so that it supports additional computational paradigms. By providing more parallel patterns to the user of Hadoop, we believe that the task of dealing with specic kinds of problems becomes simpler and easier. To this end, we present our design of four Algorithmic Skeletons over Hadoop. Algorithmic skeletons are structured parallel programming models that allow programmers to develop applications over parallel and distributed systems. We implement these skeleton operations and, along with a streaming mechanism, we offer them in a library of skeleton operations. The use of these operations when dealing with problems that are a good t for the abstract parallel processing pattern they encapsulate, results in more concise and efcient programs.

Acknowledgements
Special thanks are dedicated to Dr. Stratis Viglas, my thesis supervisor, not only for his constant and meaningful guidance and his experts opinion on my project, but also for his thoughtful support during the development and completion of this thesis. I wish to acknowledge the work of the Apache Software Foundation and all the individuals who were involved to the implementation of Hadoop, as well as all the groups and researchers who contributed with their work to Algorithmic Skeletons. To my family for their continuous support and encouragement during my studies.

ii

Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualication except as specied.

(Dimitrios Mouzopoulos)

iii

Table of Contents

Introduction 1.1 1.2 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 2 3 4 4 5 6 7 8 8 9 10 11 11 13 14 15 15 16 16 17 18 19 19 Setting up a MapReduce job . . . . . . . . . . . . . . . . . .

On Algorithmic Skeletons and MapReduce 2.1 2.2 2.3 2.4 Algorithmic Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skeletons over Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.4.2 2.5 Overview of Algorithmic Skeletons . . . . . . . . . . . . . . Related work for MapReduce . . . . . . . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Background 3.1 3.2 3.3 3.4 3.5 3.6 Hadoops MapReduce Implementation . . . . . . . . . . . . . . . . . 3.1.1 Parallel For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel While . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 3.5.1 Condition in Parallel While . . . . . . . . . . . . . . . . . . Condition in Parallel If . . . . . . . . . . . . . . . . . . . . . Parallel If . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Design 4.1 Designing Parallel Skeletons over Hadoop . . . . . . . . . . . . . . .


iv

4.2 4.3 4.4 4.5 4.6 4.7 5

Designing Parallel For over Hadoop . . . . . . . . . . . . . . . . . . Designing Parallel Sort over Hadoop . . . . . . . . . . . . . . . . . . Designing Parallel While over Hadoop . . . . . . . . . . . . . . . . . Designing Parallel If over Hadoop . . . . . . . . . . . . . . . . . . . Designing a Streaming API for Parallel Skeletons over Hadoop . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20 21 23 24 25 26 28 29 31 32 35 36 38 40 40 41 42 43 44 46 48 51 53 56 61 62 62 63 64 64

Implementation 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Implementing Parallel For over Hadoop . . . . . . . . . . . . . . . . Implementing Parallel Sort over Hadoop . . . . . . . . . . . . . . . . Implementing Parallel While over Hadoop . . . . . . . . . . . . . . . Implementing Parallel If over Hadoop . . . . . . . . . . . . . . . . . Implementing a Streaming API for Parallel Skeletons over Hadoop . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

General Guidelines for Implementing Algorithmic Skeletons over Hadoop 28

Evaluation 6.1 Information regarding the process of Evaluation . . . . . . . . . . . . 6.1.1 6.1.2 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.3 6.4 Environment of Evaluation . . . . . . . . . . . . . . . . . . . Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation of For . . . . . . . . . . . . . . . . . . . . . . . . Evaluation of Sort . . . . . . . . . . . . . . . . . . . . . . . Evaluation of While . . . . . . . . . . . . . . . . . . . . . . Evaluation of If . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation of the Streaming API . . . . . . . . . . . . . . . .

Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Level of Expressiveness . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion 7.1 7.2 7.3 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Setting up Hadoop so that is supports the implemented skeletons A.1 How to use the Skeletons library . . . . . . . . . . . . . . . . . . . . A.1.1 Alternative methods . . . . . . . . . . . . . . . . . . . . . . A.2 Examples of setting up Skeleton jobs . . . . . . . . . . . . . . . . . . A.2.1 Example of setting up a For job . . . . . . . . . . . . . . . . A.2.2 Example of setting up a Sort job . . . . . . . . . . . . . . . . A.2.3 Example of setting up an If job . . . . . . . . . . . . . . . . . A.2.4 Example of setting up a Streaming job . . . . . . . . . . . . . B The API of the new package B.1 API of Skeleton Job . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 API of For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 API of Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 API of While . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 API of If . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 API of Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography

66 66 67 67 67 68 69 70 74 74 75 76 77 77 78 80

vi

List of Figures

2.1 2.2 4.1 6.1 6.2 6.3 6.4 6.5

The MapReduce process in the form of a Diagram. [3] . . . . . . . . The infrastructure of Hadoop. [4] . . . . . . . . . . . . . . . . . . . . The Design of Algorithmic Skeletons over Hadoop. . . . . . . . . . . Comparison between Skeleton For and MapReduce . . . . . . . . . . Comparison between Skeleton Sort and MapReduce . . . . . . . . . . Comparison between Skeleton While and MapReduce . . . . . . . . . Comparison between Skeleton If and MapReduce . . . . . . . . . . . Comparison of the Streaming, Non-Streaming and MapReduce implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 6 20 46 48 51 53 56

vii

List of Tables

5.1 5.2 5.3 5.4 5.5 5.6 6.1 6.2 6.3 6.4 6.5 6.6

Most Important methods of the SkeletonJob API . . . . . . . . . . . . Most Important methods of the For API . . . . . . . . . . . . . . . . Most Important methods of the Sort API . . . . . . . . . . . . . . . . Most Important methods of the While API . . . . . . . . . . . . . . . Most Important methods of the If API . . . . . . . . . . . . . . . . . Most Important methods of the Streaming API . . . . . . . . . . . . . Execution time of the For and the equivalent MapReduce implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution time of the Sort and the equivalent MapReduce implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution time of the While and the equivalent MapReduce implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution time of the Streaming implementation. . . . . . . . . . . . Comparison between MapReduce and Skeletons against lines of code.

30 30 32 34 36 37

45 47 50 54 60 75 76 76 77 78 79

Execution time of the If and the equivalent MapReduce implementation. 52

B.1 SkeletonJob API . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 For API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Sort API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 While API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 If API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 Streaming API . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

Chapter 1 Introduction
In the past few years there has been a growing interest in parallel and distributed computing, mainly due to the vast amounts of data produced. Most of the times, data is organized and stored in structured clusters of computers that may not even be in the same site. In any case, however, it needs to be processed in a quick and efcient way by various applications and for various reasons. Algorithmic Skeletons [1] are an approach in which the complexity of parallel programming is abstracted away through a library of skeleton operations. Each skeleton captures a particular pattern of computation and interaction. Thus, it provides an interface for each pattern to the programmer without presenting the implementation of the pattern itself. This results in a far less complex and more efcient way to write programs in a parallel or a distributed environment. However, algorithmic skeletons have not really been used broadly to this end, especially by commercial companies, but rather remained more of an academic concept. The MapReduce framework [2] can be considered an exception to the above. It was developed by Google in 2003 for processing large data sets and has grown in popularity since. While inspired by functions commonly used in functional programming, the MapReduce framework is not similar to these functions. MapReduce is used for processing petabytes of data that is stored across a large number of nodes organized over a distributed le system like Googles GFS or Hadoops HDFS. An apparent subsequent question raised is why not have additional skeleton operations implemented over a MapReduce system like Hadoop and more importantly whether or not something might be gained out of it.

Chapter 1. Introduction

1.1

Motivation

The main reason why a signicant number of skeleton operations exist is that each performs better and more efciently at specic kinds of problems. For instance, when we have as input lines of numbers and we need to compute the average of each line, an algorithmic skeleton like Divide and Conquer is not useful at all and others like MapReduce may prove inefcient due to redundant computation. In such a case a single Map or For operation will perform far more efciently. In other words, an algorithmic skeleton is the better at a problem, the more naturally this problem ts the abstract pattern of the skeleton. Hadoop is an open source implementation of Googles MapReduce system and, in addition to other things, it offers a way to store large data sets in a distributed setting and schedule MapReduce jobs over it. Finding a way to offer additional parallel programming models over Hadoop, other than MapReduce, will aid the user write far more efcient programs for certain kinds of problems. Moreover, another purpose of this project is to do this in a more user-friendly way than the original interface of MapReduce. Setting up a MapReduce job requires knowledge of the framework and thus it would surely be useful if certain details can be hidden from the user when dealing with skeleton jobs. After all, algorithmic skeletons are all about hiding complex information from the user and providing him or her with a simple API with which he or she can write programs that can run in parallel, without knowledge of the underlying infrastructure and the process of actually setting up and conguring the job that is submitted to the framework.

1.2

Objectives

The key idea behind this project is to implement a selection of algorithmic skeletons over Hadoop. The main concept is to provide to the users of Hadoop more options regarding the way they can organize and parallelise the computations they need to perform over Big Data, other than what the MapReduce programming framework offers. This will result in an improved data processing model with more capabilities. More specically, four skeletons have been implemented; For, If, Sort and While. Additionally, an API for setting up streamed skeleton and/or MapReduce jobs as a pipeline has also been implemented. Implementing structured algorithmic skeletons over a MapReduce system like Hadoop is not a trivial task. In order for this to be

Chapter 1. Introduction

carried out, the implementation of MapReduce over Hadoop and HDFS must be examined. This will be the guide for designing and implementing additional skeleton patterns over Hadoop. As a result, before attempting to design the skeletons, comprehending the way MapReduce is organized and implemented may prove more than useful.

1.3

Structure of the Report

This chapter aimed to provide the reader with an understanding of the scope of this project. Moreover it provided a description of the project objectives and motivation. Chapter 2 provides information on related work within the eld of algorithmic skeletons and MapReduce. In Chapter 3, we give more details concerning the background of the concepts related to this project. More specically, we give a short description of the Hadoop software framework. We also present the package org.apache.hadoop.mapreduce, which is the newest package of Hadoop that realises MapReduce. Moreover, we introduce the parallel algorithmic skeletons (For, Sort, While and If), which we have implemented for the purposes of this project, to the reader. In Chapter 4 we describe the main ideas behind the design of the four skeletons over Hadoop and the interface for streaming Hadoop jobs along with alternative approaches and the reasons behind the choices of the implementation. What is more, in this chapter we mention issues that need addressing together with ways for dealing with them. Moreover, in Chapter 5 we present a detailed description of the implementation. In this description, low level details of the implementation, how issues were dealt with, what is offered to the user of Hadoop and more are thoroughly presented. Chapter 6 is all about evaluation. More specically, we present the metrics that are used for evaluating the implementation of the four skeleton operations and the streaming API and the results, along with relevant comments. Finally, Chapter 7 serves for concluding this thesis and this project. It offers the opportunity to summarize what was achieved and what is nally offered to the user of Hadoop. Information regarding possible future work is also given.

Chapter 2 On Algorithmic Skeletons and MapReduce


This project deals with a variety of computational concepts and systems. It is more than necessary to introduce the reader to these concepts and outline the overall scope of our project. A signicant number of related work has already been carried out regarding both Algorithmic Skeletons and MapReduce but none so far, to the best of our knowledge, has tried to provide a library with a few skeleton operations over a software framework that implements MapReduce.

2.1

Algorithmic Skeletons

Algorithmic skeletons are structured parallel programming models that allow programmers to develop applications over parallel and distributed systems. The main idea behind them is to provide the user with an abstract framework for programming in parallel and distributed settings, where many details regarding the underlying architecture of the system but also the implementation of the parallel pattern itself remain hidden. What separates Algorithmic skeletons from many other parallel programming models is that synchronization between the different tasks is dened by the skeleton itself and the programmer needs not worry about it. They were introduced by Murray Cole in 1989 [1] as a way to offer to the programmer a framework that will appear non-parallel to him while its execution will take place in parallel. Cole presented four initial skeletons: divide and conquer, iterative combination, cluster and task queue. Afterwards, other research groups proposed additional algorithmic skeletons and developed many algorithmic skeleton frameworks based on
4

Chapter 2. On Algorithmic Skeletons and MapReduce

different techniques such as functional, imperative and object oriented languages.

2.2

MapReduce

MapReduce [2] can be described as a parallel programming model that serves for manipulating vast amounts of data. It has become rather popular over the past few years, mainly due to the fact that it provides a simple interface with two functions for developing and implementing scalable parallel applications. Perhaps the major reason behind the great success of MapReduce is that it supports auto-parallelization of programs on large clusters of commodity machines. Moreover, the capabilities it holds regarding its fault tolerance and scalability are invaluable when the size of the data which needs any kind of processing grows larger and larger. In essence, MapReduce is just one of the many skeleton algorithms (combination of the skeleton Map and the skeleton Reduce), which is implemented over a distributed infrastructure.

Figure 2.1: The MapReduce process in the form of a Diagram. [3]

However, MapReduce comes with certain shortcomings and the primary reason for this is that its original purpose was not performing structured data analytics. What is more, this model does not support many features that would be useful for developers. Assertions have been made regarding the limitations and thus the breadth of problems that MapReduce can be used for. From a software engineering point of view MapReduce systems lack the features that other parallel algorithmic structures can offer to

Chapter 2. On Algorithmic Skeletons and MapReduce

programmers.

2.2.1

Hadoop

Hadoop is a software framework that supports storing and processing large amounts of data. Basically Hadoop can be considered as the open source realisation of Googles MapReduce System and is used by a number of companies and organizations like Yahoo and Facebook. It consists of a distributed scalable, and portable le system (HDFS) in which petabytes of data can be stored and an implementation of the computational paradigm of MapReduce. HDFS has a master/slave architecture, similar to the one of Googles GFS. An HDFS cluster consists of a single NameNode which is the master server and is responsible for managing the le system namespace and providing access to the les stored in the cluster by the clients. Additionally, a number of DataNodes exist, usually one per node in the cluster, which manage the storage attached to the node or nodes that they run on.

Figure 2.2: The infrastructure of Hadoop. [4]

HDFS is designed to store and handle les larger than a few Gigabytes or even Petabytes, across the machines of a cluster. In addition, the les are replicated so as to offer a high level of fault tolerance. Every le is split in blocks of equal size, except the last block, and is then stored in sequence of blocks. The les are only written once and only by one writer at any time. It has to be noted that the size of the block and the replication factor can be congured separately for every le.

Chapter 2. On Algorithmic Skeletons and MapReduce

Hadoop Common provides access to the le system supported by Hadoop. In short, the Hadoop Common package contains all the necessary JAR les and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community. The MapReduce engine runs on top of the le system. It consists of one Job Tracker and multiple Task Trackers. When a user wants to submit a MapReduce job to the framework, his or her client submits it to the Job Tracker. The Job Tracker then pushes the necessary work to the available Task Tracker nodes. It is of the highest priority to move the computation to the data and not the other way around. This depends heavily upon the Job Tracker that knows which node contains the data and the nearby machines (it is preferable to use the node where the data resides or if this is not possible machines of the same rack). This will result in the reduction of the network trafc and to more efcient jobs. By default Hadoop uses a rst-in-rst-out principal to schedule jobs from a work queue. In version 0.19 of Hadoop the job scheduler was refactored out of Hadoop which added the ability to use an alternate scheduler. Companies that use Hadoop took this opportunity to develop their own schedulers that suit better their needs. Most notably, Facebook developed the Fair scheduler and Yahoo the Capacity scheduler.

2.3

Skeletons over Hadoop

Hadoops infrastructure can be used for developing more algorithmic skeletons over it. Providing the developers with more parallel programming models, in a framework that is widely used, will of course strongly enhance it, but it will also allow programmers to implement applications that will cover a broader range of problems and needs. This will ultimately result in a more expressive programming framework with more capabilities. There is a great research interest in MapReduce as it is a rather new technology and there is ground to further explore and optimize it. There are many papers and projects that are focused on enhancing MapReduce systems. These attempts are mainly focused on the open source MapReduce system, Hadoop. This project targets in enhancing Hadoop by offering additional parallel programming models other than MapReduce to the user. More specically four algorithmic skeletons are to be designed and implemented over Hadoop and HDFS which will co-exist with MapReduce along with a streaming mechanism for MapReduce and/or Skeleton jobs. The programmer will be

Chapter 2. On Algorithmic Skeletons and MapReduce

able to choose between these programming models, in order to implement his application depending on the needs of the task he has at hand.

2.4
2.4.1

Related Work
Overview of Algorithmic Skeletons

Regarding Algorithmic Skeletons there has been quite a lot of research and work at an academic level. For various reasons that are beyond the scope of this project, companies have not really taken an interest in using them in their applications. Most of the work that has been done is focused in the area of providing programming frameworks that implement a number of algorithmic skeletons with generic parallel functionality which can be used from the user to implement parallel programs [5]. There are three types of algorithmic skeletons; data-parallel, task-parallel and resolution [5]. Data-parallel skeletons operate on data structures. Task-parallel skeletons work on tasks and their functionality heavily depends on the interaction between the tasks. Resolution skeletons represent an algorithmic way for dealing with a given group of problems. Many frameworks have been developed that provide sets of algorithmic skeletons to users. These algorithmic skeleton frameworks (ASkF) can be split into four categories according to their programming paradigm: - Coordination ASkFs - Functional ASkFs - Object-oriented ASkFs - Imperative ASkFs ASSIST [6] can be classied as a coordination ASkF. Parallel programs are expressed as graphs of software modules by using structured coordination language. The execution language is C++ and while it supports type safety, it does not support skeleton nesting. The skeletons that are offered in ASSIST are seq and parmod. Skandium [7] is another ASkF that supports both data-parallel and task-parallel skeletons. It is a re-implementation of an older ASkF, Calcium, with multi-core computing in mind. The execution language is Java and both type safety and skeleton nesting are supported. For more information regarding the numerous ASkFs that are developed [5] pro-

Chapter 2. On Algorithmic Skeletons and MapReduce

vides a very good description of the most important along with references to papers for more information.

2.4.2

Related work for MapReduce

Even though MapReduce systems are a relatively new idea (or maybe better, a new implementation of an old idea) there is a growing interest in them and a lot of effort from research groups aim towards enhancing it. The majority of these attempts are built on top of Hadoop due to the fact that it is the most popular open source realization of MapReduce systems. Hive is a data warehousing application in Hadoop. It was originally developed by Facebook a few years ago but now it is open source [8], [9]. Hive organises the data in three ways which are analogous to well-known database concepts; tables, partitions and buckets. Hive provides a SQL-like query language which is called HiveQL (HQL) [8], [9]. HQL supports project, select, join, union, aggregate expressions and subqueries in the from-clause like SQL. Hive translates HQL statements into a syntax tree. This syntax tree is then compiled into an execution plan of MapReduce jobs. Finally, these jobs are executed by Hadoop. It can be concluded that with Hive, the developer has in his possession a declarative query language which is close to SQL and supports quite a few functionalities that are essential for many data analytic jobs and are pretty repetitive [8], [9]. Pig on the other hand, is a large scale dataow system that is built on top of Hadoop. The idea behind Pig is similar to the one of Hive. Pig programs are parsed and compiled into MapReduce jobs, which are then executed by the MapReduce framework on the cluster [10]. A Pig program goes through a number of intermediate stages before execution. First of all, it is parsed and checked for errors. A logical plan is produced, and then is optimized and compiled into a series of MapReduce jobs and afterwards it passes another optimization phase. Finally, the jobs are sorted and submitted to Hadoop for execution [10]. The programs that are given as input to Pig are written in a specically designed script language, Pig Latin. Pig Latin is a script programming language where the user species a number of consecutive steps that implement a specic task. Each step is equivalent to a single, high level data transformation. This is different from the declarative approach of SQL where only the constraints that dene the nal result are declared. As Pig Latin was developed having in mind processing web-scale data, only

Chapter 2. On Algorithmic Skeletons and MapReduce

10

the parallelisable primitives were included in it [11]. Another extension of MapReduce is Twister [12] which aims to make MapReduce suitable for a wider range of applications. What the runtime of Twister offers compared to similar MapReduce runtimes is the ability to support iterative MapReduce computations. Being a runtime itself, it is distinguished from Hadoop as the infrastructure is different but most importantly the programming model of MapReduce is extended in a way that it supports broadcast and scatter type data transfers. All the above make Twister far more efcient when talking about iterative MapReduce computations. Although the architecture of twister is different than the one of Hadoop, on top of which this project will be built, the way in which the programming model of MapReduce is extended using communication concepts from parallel computing like scatter and broadcast so as to support iteration is quite interesting.

2.5

Summary

MapReduce has proven to be an extremely powerful tool for analysing and processing vast amounts of data. Hadoop, its open source realisation, is used by a signicant number of companies and it is of no coincidence that large corporations like Yahoo, Amazon and Facebook among others, use it for storing and analysing data. However, Hadoop offers a complex and detailed API for writing MapReduce jobs. It is evident that if a programmer wants to write a program to perform simple computations over large data sets, he has to write long complex programs of many lines. It comes as no surprise that many projects that aim to offer a higher level API to facilitate large-data processing are under development or have been developed in the past few years. On the other hand, algorithmic skeletons have been known for being provided in various frameworks as libraries (most of the times). Their outstanding feature, that the synchronisation of the parallel activities is implicitly dened by the abstract skeleton patterns, aid the programmers to write parallel programs in a easier but most importantly sequential way. It is of no surprise that providing a library of skeleton operations, other than MapReduce over Hadoop may well prove valuable for its users. Especially, if the provided skeleton operations are offered in a higher level than MapReduce by hiding many of its details will undoubtedly lead to a more expressive and easier to congure software framework.

Chapter 3 Background
Upon providing all the necessary information regarding the Algorithmic Skeletons, MapReduce and MapReduces open source realization Hadoop, more details should be given about the scope of this project. This project aims to implement a number of skeleton operations and ultimately offer a library of skeleton operations to the user of Hadoop. To this end, a number of matters should be discussed. First and foremost, we must describe the package that implements MapReduce over Hadoop. The skeletons library will use this package to somewhat offer a level of indirection. In essence, when one skeleton operation will be used, a MapReduce job will run underneath. This makes the study and understanding of the package more than important. Moreover, the algorithmic skeletons that will be implemented are presented thoroughly. Comprehending the pattern of each skeleton is the rst step for designing and implementing them above any software framework. The reader needs to understand the pattern that every single one of them offers, if he is to move on to the following chapters that give more details regarding far more complex aspects of the design and the implementation.

3.1

Hadoops MapReduce Implementation

Before moving on to the description of how MapReduce is organised and functions in Hadoop [13], we should take note that currently two different packages exist which provide to the user of Hadoop the parallel algorithmic framework of MapReduce; mapred and mapreduce. The differences of these two packages are beyond the scope of this report. However, it must be brought to attention that mapreduce is the newest
11

Chapter 3. Background

12

implementation that is meant to replace completely mapred in the following releases. As a result, the implementation of the skeleton operations are based on the most recent package of MapReduce. It is due to this fact that we should describe a description of the package mapreduce [14]. This description will aid the reader in his comprehension of how the system works and will ultimately lead to a better understanding of the implementation of the parallel algorithmic skeletons, as classes of the mapreduce API are used. A MapReduce job has two phases of operation. Firstly, the input data is split into chunks which are processed by the Mappers in parallel. The output of the mappers is sorted, grouped and then used as input for the Reducers. It should be noted that both the input and the output of a MapReduce job is stored in the distributed le system of Hadoop (HDFS) whereas the intermediate results of the Mappers are stored in the local le-system of the Mappers. The framework takes care of scheduling tasks, monitoring them and re-executing the failed ones. Now, let us take a closer look at the implementation. The MapReduce framework operates exclusively on Key,Value pairs, that is, the framework views the input to the job as a set of Key,Value pairs and produces a set of Key,Value pairs as the output of the job, conceivably of different types. The main classes of the API are: Job, Mapper and Reducer. The Job class is an extension of the class JobContext. It allows the user to congure and submit the job and it offers him the ability to check and control the state of the execution. To this end, it contains certain set methods with which a user can congure the MapReduce job before it is submitted. Moreover the classes Mapper and Reducer provide the API for creating the functions that realize the map and the reduce stages. They both contain an internal class Context that extends MapContex and ReduceContext respectively. Its purpose is to provide the context to both the Mapper and the Reducer. A user who wants to create a MapReduce job needs to create a new instance of the class Job. Moreover, he has to create two new classes that will extend the existing classes Mapper and Reducer. Additionally, in the extended class of the Mapper he needs to override the method map according to the task he has to deal with. In the extended class of the Reducer the user overrides the method reduce depending to his needs. Finally, the programmer has to congure the Job instance he created with the two extended classes and submit the job.

Chapter 3. Background

13

3.1.1

Setting up a MapReduce job

Here is an example of a program that creates, congures and submits a MapReduce Job for counting the number of occurrences of each individual word.
import java . io . IOException ; import java . util . StringTokenizer ; import org . apache . hadoop . conf . Configuration ; import org . apache . hadoop . fs . Path ; import org . apache . hadoop . io . IntWritable ; import org . apache . hadoop . io . Text ; import org . apache . hadoop . mapreduce . Job ; import org . apache . hadoop . mapreduce . Mapper ; import org . apache . hadoop . mapreduce . Reducer ; import org . apache . hadoop . mapreduce . lib . input . FileInputFormat ; import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; import org . apache . hadoop . util . GenericOptionsParser ; public class WordCount { public static class TokenizerMapper extends Mapper < Object , Text , Text , IntWritable >{ private final static IntWritable one = new IntWritable (1) ; private Text word = new Text () ; public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException { StringTokenizer itr = new StringTokenizer ( value . toString () ) ; while ( itr . hasMoreTokens () ) { word . set ( itr . nextToken () ); context . write ( word , one ); } } } public static class IntSumReducer extends Reducer < Text , IntWritable , Text , IntWritable > { private IntWritable result = new IntWritable () ; public void reduce ( Text key , Iterable < IntWritable > values , Context context

Chapter 3. Background

14

) throws IOException , InterruptedException { int sum = 0; for ( IntWritable val : values ) { sum += val . get () ; } result . set ( sum ); context . write (key , result ); } } public static void main ( String [] args ) throws Exception { Configuration conf = new Configuration () ; String [] otherArgs = new GenericOptionsParser ( conf , args ). getRemainingArgs () ; if ( otherArgs . length != 2) { System . err . println (" Usage : wordcount <in > <out >") ; System . exit (2) ; } Job job = new Job ( conf , " word count ") ; job . setJarByClass ( WordCount . class ); job . setMapperClass ( TokenizerMapper . class ); job . setCombinerClass ( IntSumReducer . class ); job . setReducerClass ( IntSumReducer . class ); job . setOutputKeyClass ( Text . class ); job . setOutputValueClass ( IntWritable . class ); FileInputFormat . addInputPath (job , new Path ( otherArgs [0]) ); FileOutputFormat . setOutputPath (job , new Path ( otherArgs [1]) ); System . exit ( job . waitForCompletion ( true ) ? 0 : 1) ; } }

3.2

Parallel For

Generally speaking, Parallel For [5], [7] represents nite iteration. The algorithmic skeleton of Parallel For is used when a user wants to do some work on all the elements of a specic input. The input needs to be partitioned rst so that the work can be parallelizable. In essence the work is done in the different partitions of the data in parallel. A piece of pseudo-code follows that describes the pattern of Parallel For.

Chapter 3. Background

15

int i = ...; Skeleton P, R nested = ...; Skeleton P, R forSkel = new For P, R (nested, i);

3.3

Parallel Sort

Sorting is a well known concept in Computer Science. A signicant number of algorithms have been introduced in the past decades, each of them serving its own purposes and designed for different needs. Hadoop, as an open source MapReduce system deals with massive data-sets that are stored on the various nodes of the cluster. It is rather obvious that one user of the cluster will require to sort the data for his own purposes. For instance, it may well be the case that the user wants to nd out the maximum value of the data sets or sort some Text entries alphabetically. As mentioned above, a large number of algorithms exist for sorting. As far as parallel sorting [15] is concerned the most famous categories of sorting are the Bucket Sorts, the Exchange Sorts and the Partition Sorts. In the context of parallel sorting in Hadoop, it can be said that we need not deal with any type of parallel sorting algorithm. In a MapReduce job the intermediate results of the Mappers are sorted according to the value of the key. It is only natural, that when designing a sort skeleton over Hadoop, this feature of MapReduce is going to be used. It should be pointed out that an implementation of Sorting currently exists in the Hadoop examples jar le. Nonetheless, we designed a different one for the purposes of this project so as to offer the additional parallel skeleton of sort. The main difference is that the skeleton Sort that we implemented targets to a more specic group of problems; it produces one sorted output le. Another difference between the two implementations is that the parallel sorting which is included in the examples jar le is based on the old implementation of MapReduce mapred, whereas the implementation described in the following chapters is based on the new mapreduce package.

3.4

Parallel While

Parallel While [5], [7] represents conditional iteration, where a function (or possibly another skeleton) is applied to the data while a condition holds. This condition may or may not relate to the value that is read from the input. In Parallel While it may well

Chapter 3. Background

16

be the case that the condition is checked more than one time. While the value is true a function is applied to the input data. However, the user may require to perform a different action when the condition returns false which means that the loop was exited. A piece of pseudo-code follows that describes the pattern of Parallel While. Condition P condition = ...; Skeleton P, R nested = ...; Skeleton P, R whileSkel = new While P, R (nested, condition); From the pseudo-code it becomes obvious that the function or skeleton nested is executed while the condition is true. The data is partitioned and the processing takes place in parallel for each shard of input data. Perhaps the most important feature of Parallel While (and of Parallel If that follows in the next section) is the denition of the Condition for which we give details in the following sub-section.

3.4.1

Condition in Parallel While

It is evident that the condition that is to be checked in Parallel While could relate to the value read but it could also be independent of it. Whichever the case, most likely the condition will be checked again and the data may have changed. As a result, a way is needed for storing the results of the function nested and providing them during the next step of the loop. The fact that the condition is checked more than once in the majority of the cases, is perhaps the key aspect that will lead the design and the implementation of this skeleton over Hadoop. It is only natural that in the context of providing a more expressive framework that is also simple for the programmer to use, an abstract concept of condition will be provided which will be up to the user to implement according to his specic needs.

3.5

Parallel If

Parallel If [5], [7] can be described as conditional branching where the choice of which computation to apply on what subset of the data, is solely based on a condition specied by the user. In essence for each data set of the input a condition is checked. This condition can be either about the data set itself or even independent of it, according to

Chapter 3. Background

17

the needs of the user. In any case, the input is split to various shards and for the values contained in each shard a condition is deduced whether it is met or not. Depending on the outcome of this action (the result is either true or false) a different function is applied to the data set. Below is a simple and abstract description of this algorithmic skeleton. Condition P Condition = ...; Skeleton P, R TrueCase = ...; Skeleton P, R FalseCase = ...; Skeleton P, R If = new If P, R (Condition, TrueCase, FalseCase); The parallelization of this model heavily depends on the fact that the input data is partitioned and then the processing occurs in parallel. The function TrueCase is executed if Condition returns true whereas FalseCase is executed if Condition returns false. As per the description, it is possible that these two functions may be other skeletons leading to skeleton nesting. What is more, as in the skeleton previously described (Parallel While) the Condition is of great importance and more details should be given regarding it.

3.5.1

Condition in Parallel If

The condition in parallel If is quite similar to the one dened in Parallel While. Of course, an important difference exists that distinguishes them. As we saw before, the condition in Parallel While by denition is checked more than once. After all, While in its essence provides the notion of loop. As a result it is essential that we may have to check the condition numerous times. On the other hand, in Parallel If the condition is strictly checked only once. Depending on the outcome a different function (or skeleton) is executed. This constitutes a major difference between the two skeletons. The fact that in Parallel While only a single computation is executed numerous times (more or equal to one) while in Parallel If one of two possible types of computation is executed only one time. In the following chapter, we will describe how all these differences affect the design of these skeletons over the framework of Hadoop.

Chapter 3. Background

18

3.6

Summary

This chapter introduced the reader to the implementation of MapReduce over Hadoop along with the four skeleton operations that are to be included in the skeletons library. Understanding the package of MapReduce over Hadoop is extremely important as we will design the skeletons in a similar way. Furthermore, as it is our purpose to offer the skeleton operations at a higher level, the MapReduce package will be used underneath. In essence, one of the goals of this project is to offer another level of indirection above MapReduce in the context of providing a library of algorithmic skeletons. The ultimate goal of this project after all is to enhance Hadoop so as to provide a more expressive software framework for processing large amounts of data and both the implementation of the algorithmic skeletons and the streaming API are to this end.

Chapter 4 Design
4.1 Designing Parallel Skeletons over Hadoop

When talking about providing the algorithmic model of another parallel model over Hadoop a way must be found to offer the programmer a new API that implements the new algorithmic skeleton with the use of the existing API of MapReduce. A number of approaches exist that can be used for tackling this specic problem. For example one may attempt to build an individual package that will implement the algorithmic skeleton. Such a package will need to communicate with the distributed le system underneath and contain a number of low level methods to this end. This approach will require a better understanding of the whole Hadoop system. One may argue that the package of MapReduce already contains ways for communicating with HDFS. What is more, its functionality supports conguring, executing and monitoring MapReduce jobs. As a result, using parts of the existing functionality of MapReduce, and masking parts of it according to the task at hand, may prove an easier but more importantly a far more efcient way of implementing another parallel algorithmic framework over Hadoop. In addition, by following this specic approach we can further hide many of Hadoops details resulting in skeleton operations that offer a higher level of interaction with the framework. In a way, the user will have another programming model but also a less complex and easier to use model. This was the key idea which led our implementation. More specically, after careful inspection of the MapReduce API and implementation we deduced a methodology for offering different parallel models using classes of the MapReduce API. As mentioned in the previous chapter, the main classes of the existing API are the following: Job, Mapper and Reducer. The class Job instantiates a MapReduce Job and has com19

Chapter 4. Design

20

plete control over it. What is required from the new API is a class that will extend the existing class Job, offer some of Jobs functionality to the user but coordinate a number of things internally. It is important that the new extended class ts the new model accordingly. This raises the question of which functionalities are to be hidden from the user. The answer to this question depends heavily to the specic parallel skeleton that is to be implemented. In the following sections, we are going to give more details regarding the concepts behind the implementation of several algorithmic skeletons that we implemented over Hadoop. An indicative design of Algorithmic Skeletons over Hadoop is shown in Figure 4.1. In the context of this project, we followed this abstract design in order to implement the Algorithmic Skeletons we chose over Hadoop. This gure will provide the reader with a better understanding of the methodology we followed for accomplishing our objectives.

Figure 4.1: The Design of Algorithmic Skeletons over Hadoop.

4.2

Designing Parallel For over Hadoop

Parallel For is the application of a function to all the data elements of the input. This fact distinguishes it from MapReduce on certain key points. Firstly, there is no need for a reduce phase. As only a certain computation is applied to the input, there is no sense in having an additional stage that will not perform any computation of any kind. Moreover, in a Parallel For no grouping or sorting should be performed in the output of

Chapter 4. Design

21

the job. It needs to be noted that in MapReduce the output of the map phase is sorted and grouped according to the key of the Key,Value pairs that produces as output. This needs to be avoided. Furthermore, the output of the Mappers is usually written in the local le system of the nodes. We require that the output of the job is written to HDFS. Lastly, the fact that the MapReduce framework functions with pairs must be taken into consideration. A Mapper takes as input a Key,Value pair and produces another Key,Value as output. In Parallel For we must nd a way must to hide the pairs and present the user an easy way to handle and manipulate the input. With the completion of spotting the key differences between a MapReduce job and a Parallel For job, we can now proceed into designing the new parallel model using the existing one. In this part of the report we will underline the ways that are best t for dealing with the issues that are mentioned in the previous paragraph. To begin with, we need to remove the reduce phase from the new job class, extension of the existing Job class, we are to create. This can be done internally by setting the number of reduce tasks to zero and offer an API that only a Mapper (soon to be renamed) can be declared by the user. Even though Parallel Fors functionality is close to that of a Map, we need to offer a new class. This class will be an extension of the Mapper and it will hide the input, output pairs offering to the user a single input and output. Furthermore, the data types of Hadoop are of no use in the context of Parallel For and thus both the input and output will be manipulated as String data types. Additionally, by specifying zero reduce tasks, the output of the Map phase is written to the HDFS instead of the local le-system of the nodes. The nal obstacles that need to be dealt with are those of grouping and sorting of the map output. Thankfully, the package of MapReduce has been designed in such a way that when no reduce tasks exist, the sorting and grouping of the map phase is omitted. Thus, all the issues have been dealt with. In Chapter 5 we present more details regarding the implementation of the new API are going to be presented.

4.3

Designing Parallel Sort over Hadoop

Implementing Parallel Sort over Hadoop raises a number of issues that need to be addressed. First of all, all known data types should be supported. More specically, the skeleton offered to the user of Hadoop should be able to sort integers, doubles, oats, strings and long integers. Moreover, the sorting implementation must be able to perform sorting both in descending and ascending order. Recall that after the map phase,

Chapter 4. Design

22

the intermediate results are sorted in ascending order. This natural sorting of MapReduce is used for developing the sort skeleton but it is apparent that when wanting to sort in descending order this feature of MapReduce needs tweaking. Finding the best possible ways for dealing with these issues, ultimately lead to the best t implementation of Parallel Sort over Hadoop. To begin with, there are two approaches in order for the skeleton to support ve data types. The dynamic approach is for the program to dynamically determine the data type and transport the relevant Mapper and Reducer to support the specic data type. The other approach can be considered static, as the main idea is to have separate pairs of Mappers and Reducers already predened for each data type. By entering a parameter, the user can dene which is the type of the data that is to be sorted. The nal obstacle for designing an efcient and complete sort skeleton is to support sorting in descending order besides only sorting in ascending order. For the implementation to support descending order we need to nd a way and reverse the sorting of the intermediate results. In order for us to accomplish such thing, we have to look in the MapReduce package and deduce how the sorting occurs and then try to add the functionality of sorting in descending order. Once these two issues are dealt with, all we have to do is create a class that the user can use to execute a sort over an input of his own. A MapReduce job must be internally set up and congured with the appropriate parameters. This MapReduce job will use the tting pair of the Mappers and Reducers that are already predened and perform the sorting. The user will have the option to specify the output order. The predened Mappers and Reducers need only perform their default functionality which is to read the input pair and write it. The sorting occurs in the intermediate phase from the framework itself. What should be noted is that the Mappers should isolate the value upon which the sorting will occur and use it as key for their own output pair. An important factor to take into consideration is that already many sort algorithms exist over Hadoop, so why implement a new one or even why use the skeleton sort instead of another. The answer to these questions is that the operation of skeleton Sort results in a single sorted output le no matter what the input is or how many reducers there are. As a result, a nal note is that in our implementation me must guarantee that the output of the skeleton Sort will reside in a single le. These details conclude the design of the parallel algorithmic skeleton of Sort over Hadoop.

Chapter 4. Design

23

4.4

Designing Parallel While over Hadoop

The semantics of Parallel While is closer to the ones of Parallel For, than Parallel Sort. As a result, designing Parallel While will lead us down a path that is similar to the one we followed for Parallel For. As it is the case for Parallel For, there is no need for a reduce phase in Parallel While. The processing occurs in all the data sets of the input and only a specic function is applied to them. Moreover, not only there is no need for sorting and/or grouping of the output values but it should be avoided if we are to talk about correct semantics. Furthermore, the concept of Parallel While (and the majority of algorithmic skeletons for that fact) function on a single input and output basis and not in Key,Value pairs. It is due to this fact that we need to offer to the user an API that has single inputs and outputs. It would be best if we could do that with a generic data type like String that is also known to all programmers. Last but not least the output of Parallel While must be written into the distributed le system of Hadoop (HDFS) and as we stated earlier, when omitting the reduce phase, this is taken care of by the framework internally. What led our design strategy in Parallel For was to use extensions of the already dened classes Job and Mapper. The key idea is to use the already implemented functionality of the mapreduce package for our purposes. In this case the extended class SkeletonJob is going to be used for creating, conguring and submitting the Parallel While Job. This class species internally that no reduce task will exist and as a consequence various issues are addressed , like lack of sorting, grouping and unnecessary computation. As the concept behind designing SkeletonJob was described in a previous section we can move on to the next step, which is designing the Parallel While class. Two things differentiate the skeleton While from the skeleton For. First of all, a condition exists that its binary result commands whether a specic function will be applied or not. This decision depends only upon the condition and its result. In other words, there is the notion of loop where a computation may take place more than one time and the stopping point is dened by the condition. Additionally, we must nd a way for using the results of the previous step of While to the next step of the same loop concerning the same record. All these should be provided to the user of Hadoop in an abstract yet simple way which will offer to the programmer another parallel algorithmic model, more expressive and simpler for a number of problems.

Chapter 4. Design

24

4.5

Designing Parallel If over Hadoop

Parallel If also uses the boolean result of a condition for deducing which one of two computations will take place. Designing the way that the condition will be realized in the framework and how it will interact with the rest of the classes is of the highest importance. As was the case with the skeletons For and While, If is close to a functional map. It is different in a number of ways that raise issues regarding its design and implementation over Hadoop. The unique feature of If is that it contains two functions or Skeletons that can be applied to the data. More specically, depending on the outcome of the condition checked a different computation occurs at the input data sets. It is only natural that in the class that will provide If, these two functions will be dened separately. An important aspect that distinguishes If with While is that there is no concept of iteration but the concept of conditional branching. We should provide this notion of branching to the user in an abstract and general way and most details of it should remain hidden from him. Moreover, we need to provide details regarding how the user can dene the condition and the arguments needed for evaluating the condition. For computing the outcome of the condition we need the current value but it may well be the case that additional arguments could be needed. As a result, we should offer an interface for providing these arguments along with a way to evaluate the condition using them. Furthermore, Parallel If needs to also hide the Key,Value pairs used for input and output and provide a single input and output in a single and generic data type. To this end, the key idea will remain the same as in the cases of the other skeletons that are designed over Hadoop for the purposes of this project. Additionally, we should take a closer look at the code and see how the call to the functions is made for each value. We need to introduce the evaluation of the condition there and hide the decision branching from the user. The user needs only specify the condition, a number of optional arguments that are used for computing the result of the condition and the two separate functions that are to be used for processing the input data.

Chapter 4. Design

25

4.6

Designing a Streaming API for Parallel Skeletons over Hadoop

Having designed a number of parallel algorithmic skeletons over Hadoop and since we deduced a strategy for designing more skeletons, a matter of composing them arises. It is a fact that with the newly implemented algorithmic skeletons, the user of Hadoop now has in his disposal more expressive parallel models for specic kinds of tasks. The issue now is what can be offered next to the user so that he will be able take full advantage of Hadoop and of all the parallel programming models that it supports. To this end, two possible approaches exist. The rst approach is to offer the feature of skeleton nesting. As we saw in the previous chapter, in the context of any skeleton the function that is to be applied to the data sets of the input may well be another algorithmic skeleton, which will lead to skeleton nesting. Of course, this will require a careful study and deep understanding of certain Hadoops parts, specically its scheduler. It is a question whether Hadoop can support nesting of MapReduce jobs and consequently Skeleton jobs. It should be noted that in the latest releases of Hadoop, its scheduler has become a separate component which may be replaced by another and at least two major schedulers have already been developed; fair scheduler from Facebook and the capacity scheduler by Yahoo. Maybe, using one of these schedulers or even designing and implementing a new one can ease the task of nesting skeletons. Whichever the case, in the context of this project we decided to follow the second approach of providing an API for streaming Skeleton and MapReduce jobs. The main idea is for the user to be able and specify a number of jobs along with the priority of each one of them, basically dene their sequence in an easy and simple way and after submitting them, wait for the nal results without any other effort. In a way this is close to the notion of the pipeline where the output of one job will be used as input for the job that is next to the sequence. The user need only specify an input folder for the rst job and an output folder for the last job. The intermediate results will be stored in temporary folders, for which the user will hold no information whatsoever. Additionally, it needs to be noted that this kind of pipeline is going to be static. This means, that the result of one job needs to be fully materialised and the job nished before it is used as input for the next job. Finally, it is apparent that apart from supporting the parallel skeletons already implemented for the purposes of this project, MapReduce should also be supported. Moreover, one can tell that if MapReduce is supported in this streaming API all skeletons that will be implemented in a similar way as the one

Chapter 4. Design

26

described in this thesis will also be supported. Details regarding this point will be given in the next chapter when the implementation of the streaming API is going to be discussed. Let us talk now about how we will actually design an API that will offer the user the ability to stream an arbitrary number of Skeleton and MapReduce jobs. There are a number of possible ways to tackle this problem and here we present one of them. The most important aspects of the proposed design is that it is generic and quite simple and easy to use. The user needs only to specify the input and output folder. Then, he has to create the jobs he wants as he would have done if he congured only one job. Using a method from the streaming API he should be able to pass as parameters each job along with its number in the sequence. After passing all the jobs the user needs only call a method that will signal the start of the pipeline, in other words start the execution of the jobs in the order dened and wait for the nal output to be produced in the output folder he specied. As a result, we must implement a class that will provide these methods and most importantly coordinate internally all the required actions that need to be executed for the streaming of the jobs to be completed successfully. In short, the class needs to be able to execute every job in the correct order. Moreover, it needs to specify temporary input and output folders for the intermediate jobs and use the input and output folder specied by the user only for the rst and last job respectively. The data must follow the specied by the user sequence and nally, the temporary folders must be deleted along with the les that they include. The key idea is to offer a tsansparent way of running jobs one after the other and use the output of the one as input of the job next in the sequence. In Chapter 5, we will be describe more in the context of the implementation.

4.7

Summary

Designing algorithmic skeletons over Hadoop is not an easy task. It requires careful consideration of the already implemented computational paradigm of MapReduce and a deep understanding of the patterns that each skeleton represents. We decided to provide four skeleton operations in a skeletons library for Hadoop. These algorithmic skeletons are Parallel For, Parallel Sort, Parallel If and Parallel For. Every one of them has its own unique features that need to be taken into account. Our decision was to use the MapReduce paradigm and change the data ow according to the semantics of each skeleton. In essence, what we did was to provide another form

Chapter 4. Design

27

of indirection with aim to aid the programmer in his task of deal with specic kind of problems that are best tackled by using one of the parallel patterns offered by the algorithmic skeletons. Moreover, we also decided to offer a streaming class that will help the programmers set up multiple skeleton and MapReduce jobs in a pipeline. This will hide many details of setting up sequential MapReduce and Skeleton jobs and basically make it easier for the user to specify a sequence of jobs and let the framework deal with the issues of controlling the data ow. The next step now would be to actually implement this skeletons library and deal with all the potential problems that may rise.

Chapter 5 Implementation
In the previous chapter we gave details regarding the design of different parallel algorithmic models, other than MapReduce. In this chapter, we are to provide details regarding the actual implementation of the new classes needed to offer the user of Hadoop more expressive frameworks for specic kind of problems. Before moving on to the description of how the skeletons of For, Sort, While and If were implemented for the purposes of this project a general walk-through will be provided of the approach needed for implementing parallel algorithmic models over Hadoop. An important factor that led our implementation was our effort to mask many of Hadoops details of implementation. The user takes every line of the input les as String and writes an output line as String with no knowledge regarding input and output pairs and the internal context. Moreover, many aspects of the conguration of the jobs are being taken care by the skeletons library and the user need not to worry about them.

5.1

General Guidelines for Implementing Algorithmic Skeletons over Hadoop

The most important part of implementing any algorithmic skeleton over Hadoop or even providing any high level API for setting up MapReduce jobs is the deep comprehension of how MapReduce is implemented and functions over Hadoop. In essence, everything needs to be interpreted in regular MapReduce jobs as only applications of this kind can be executed over Hadoop. Of course, the possibility to implement another computational paradigm over Hadoop exists that for this to be independent and as effective and generic as MapReduce is an extremely difcult task.

28

Chapter 5. Implementation

29

Since underneath a MapReduce job will be executed in a way that the skeleton pattern is best represented, one has to determine how to change the data-ow so as to ultimately offer to the user a new pattern. For instance, certain algorithmic skeletons do not require two phases of computation and thus the reduce phase is redundant. After determining how the data-ow should be changed then we must move to the implementation. The best approach is to use the current classes of the MapReduce package of Hadoop as super classes and extend them according to our needs. The classes that are to be used as basis for our own are the Mapper, Reducer and Job. At least, these may prove the most popularly used. Depending on the skeleton pattern implemented, we may need to create another class by using a different original class. For example, if the way the input les are split and fed to the framework must change for the purposes of a particular skeleton, a tting class should be extended (i.e. InputSplit). Finally, we should note that by extending the relevant classes certain details regarding setting up, conguring and submitting the job can be hidden in an effort to provide a higher level of interaction between the user and the framework. To this end the data types that Hadoop deals with can be hidden and provide the user with the single data type of String for a more simple and generic approach.

5.2

Implementing Parallel For over Hadoop

As mentioned previously, for implementing Parallel For with the use of the existing package of MapReduce, only two new classes need to be created. A SkeletonJob class that extends the existing Job class and a For class that extends the existing Mapper class. The idea behind the new classes is to offer the user an additional, to the existing MapReduce, API for programming Parallel For jobs instead of just MapReduce jobs. The SkeletonJob class internally sets the number of reduce tasks zero and sets a dummy (does nothing) Reducer class to the Job. Apart from that, it just masks certain methods of the Job class and presents different ones to the user. These methods can be then used for setting parameters for the new job. In addition a new method is created that can internally set up Skeleton jobs if the necessary arguments are passed; createSkeletonJob(). The basic concept behind these methods is to hide much of the information regarding setting up jobs, in example the creation of the conguration le and so on.

Chapter 5. Implementation

30

Class Skeleton Job Method Summary SkeletonJob createSkeletonJob(String jobName, String[] args, Class<? > jarCls, Class<?extendsMapper > cls ) Creates a Skeleton Job according to the parameters. SkeletonJob createSkeletonJob(String jobName, Class<? > jarCls, Class<?extendsMapper > cls ) Creates a Skeleton Job according to the parameters.
Table 5.1: Most Important methods of the SkeletonJob API

The extension of the Mapper class is the trickiest part as there are a number of issues that need to be addressed. Firstly, the input and output pairs must be hidden internally and be presented to the user of the new API as single input and output of type String. Since the framework functions with pairs, they are still going to be used but the user need not worry about them. Class For Method Summary String For(String value) Called once for each value in the input split. void WriteOut(String value) Writes a value at the output.
Table 5.2: Most Important methods of the For API

The extended class For uses dummy Text data types for masking the pairs and provides to the user every line of the input le as String. Furthermore, the method WriteOut() is dened and replaces the method context.write() of the Mapper so as the user need not know of the notion of context and how it works. He just writes an output line as Sting using the method WriteOut(). By taking a closer look to the class Mapper we see that a method map() is dened that is ultimately dened by the user. This method species the task of the Map phase. In the context of Parallel For, the method For is dened for the same purpose. The programmer who wants to use Parallel For, needs to override this method depending on his needs. Finally, the method run() of the

Chapter 5. Implementation

31

Mapper is overridden in the extended class For because its functionality should change in a way that will suit the new parallel algorithmic skeleton. However, this method should be of no concern to the user of the skeletons library as it is called internally during the execution of the Skeleton Job.

5.3

Implementing Parallel Sort over Hadoop

The main difference between Sort and all the other skeletons is that in its implementation no extension of any existing class is required. As the natural sorting phase of MapReduce is used, one needs only to congure the tting MapReduce job for sorting the specic input that the user wants sorted. The rst task is dening all the Mapper and Reducer pairs for the ve data types that are supported (integers, oats, doubles, long integers and strings). The only difference between these pairs is the type of data that the key of the pairs produced from both the Mappers and Reducers. More specically, the value the outputs of both the Mapper and the Reducer is going to be the line read from the input. The Mapper will extract the key upon which the sort is going to be performed and use it as key for the Reducer. The intermediate results are going to be sorted and be used as input from the Reducer. The Reducer only reads from the input and writes to the output the sort key and the whole value. It is evident that all the pairs of the Mappers and Reducers will perform the above work but on different type of keys that need to be sorted. Upon creating the Sorter class that contains the above Mappers and Reducers, another class needs to be created that will also be the class that will be used from the user; the Sort class. The purpose of this class is to set up a MapReduce job that will perform the sorting that the user requested, congure it appropriately and return it to the user to submit it to the framework after he adds the input and output folder. Another thing that the Sort class should contain is a way for the order of the sorting to reverse from the default of the MapReduce framework. The main method, PSort() of the Sort class will receive two parameters from the user and set up internally the tting MapReduce job that will realize the appropriate sorting. The rst parameter will concern the data type of the values that will be sorted, the second parameter will show whether the result will be produced in ascending or descending order. The PSort() method will then create a new MapReduce job and will congure it with the relevant pair of Mappers and Reducers depending on the data type that the

Chapter 5. Implementation

32

Class Sort Method Summary Job PSort(String dataType, String sortType)) Creates, congures and returns an appropriate sort job.
Table 5.3: Most Important methods of the Sort API

user specied. The next step is setting the input and output folders of the job. Finally, before submitting the job we must nd a way for reversing the sorting outcome if the user has specied descending order of the results. For this to occur we need rst to describe how the sorting occurs in the framework. Every data type contains an internal class named Comparator which is an extension of the class WritableComparator. At this point, we need to remind that the data types of Hadoop are internally declared on separate classes in the org.apache.hadoop.io package. Five data types of Hadoop are supported on our sort skeleton; IntWritable (integer), FloatWritable (oat), DoubleWritable (double), Text (string) and LongWritable (long integers). All these classes contain a Comparator class that is used from the framework for sorting in ascending order. So, for reversing the sort outcome, we need only to create ve new comparator classes, each for every supported data type that will produce the results that we require. Basically the ve new comparators are the same as the existing ones with only the change of the outcome of the method compare. The inequality check is reversed and the order of the outcome is also reversed. After the newly created MapReduce job is also congured accordingly, the job is submitted to the framework and the sorted outcome is produced in the specied output folder. It should be noted that only one Reducer is used which results in a the production of a single sorted output le but raises the execution time.

5.4

Implementing Parallel While over Hadoop

In order to provide to the user of Hadoop the additional skeleton of While and since our purpose is to use parts of the existing API of MapReduce over Hadoop two classes and an interface should be offered. One of them will aim to initialize, congure and submit a job that will be executed by the framework. The other class will provide the user with the appropriate tools to be able to program in an easy and simple way a parallel

Chapter 5. Implementation

33

program that will follow the pattern of Parallel While. Finally, an important factor of implementing while is how to embody the functionality of the condition to the class of Parallel While. As for the class that will offer the appropriate methods for conguring and submitting the skeleton job, the extended class SkeletonJob will be used, which was also used for the purposes of Parallel For. The reason behind this decision is twofold. First of all, the already dened class SkeletonJob is more than sufcient for providing the required functionality for While other than for. It is only natural that there is no reason for implementing an additional class that will offer the same functionality but this time will be used only for Parallel While jobs. What is more, by having only one class for setting skeleton jobs other than MapReduce, aids to our purpose of offering a simple, easy to use and abstract model to the user for programming in different parallel patterns other than MapReduce. SkeletonJob is an extension of the class Job that sets up MapReduce jobs. Basically, it species the lack of any reduce phase by setting the number of reduce tasks to zero. Moreover, it masks several methods that concern the map phase in a way that the user has an API with the necessary and appropriate methods for conguring a skeleton job of the parallel algorithmic pattern he wishes. Additionally, it offers a method to the user that sets up and congures internally the skeleton job. Before moving to the implementation of the main class let us take a look in the way that the concept of the condition is provided to the user. As mentioned earlier an interface will be offered for implementing the concept of condition. In this interface, three methods are dened. The rst method, initializeArguments(), initializes various arguments that may be used for evaluating the condition. The second method, checkCondition() takes as parameters the current value and the optional arguments and returns a boolean value (true or false) depending on the evaluation of the condition. Lastly, the third method, setArgs() optionally stores any changes to the arguments of the condition for the next step of the loop concerning the specic value read. It should be noted, that for each value read the arguments regarding the condition should be set to the initial values. Providing the class that realizes the parallel model of While is the most important part of the implementation. As in the case of Parallel For, the class While will be an extension of the Mapper class. In this class the appropriate tools must be offered to the user for programming using the parallel pattern of while by hiding its main functionality, which is the implementation of the condition check and of the loop. As

Chapter 5. Implementation

34

Class While Method Summary String void public void PWhile(String record, boolean condition)) Called for each value in the input split as long the condition holds true. WriteOut(String value) Writes a value at the output. initializeArguments() Initializes the arguments that may be needed for the condition check. public boolean checkCondition(String record, String[] args) Returns the result from the condition check. public String[] setArgs(String[] args) Changes the arguments accordingly after the end of each iteration.
Table 5.4: Most Important methods of the While API

the rst Iteration, this class should implement the interface of the condition. In this class a default functionality should be specied for the three methods of the condition interface. It was our choice to provide just one iteration of the loop for each value. However, these three methods can be overridden by the user to match his own needs. The method run() of the Mapper is overridden so that it offers the functionality of Parallel While. In essence, we tried to encapsulate many of the details of the framework. The user need not to worry about concepts like the context where the writes are done. He simply needs to use the method WriteOut() so as to write a String at the output. The methods of the interface concerning the implementation of the condition are used in this method in a way that the algorithmic pattern of While is realized. The condition needs to be evaluated and either continue in the loop or exit it. In our design, we decided to offer to the user the ability to perform an action even when the loop is exited. To this end the main method of the While class PWhile(), takes also as a parameter the value of the condition. In any case, it returns the new value as is dened by the method that is to be used for the next step of the loop. At the end of the day, the user needs only specify the condition and the optionally additional arguments used for its evaluation along with the way the change at each iteration of the loop and the main method that will perform the processing at the current value in each iteration of the loop.

Chapter 5. Implementation

35

5.5

Implementing Parallel If over Hadoop

The skeleton of Parallel If is similar to the one of While. As a result, their implementation is similar. More specically, in the case of If, a condition is evaluated and depending on the boolean result produced a different computation occurs. There is not the notion of loop, as in While, but that of conditional branching that characterises this skeleton. In both If and While a condition must be implemented. In short, three classes are also needed for adding the pattern of Parallel If in the framework of Hadoop. The already dened and described SkeletonJob class is going to be used for setting and submitting the If job. Moreover, for the condition the interface that was used for While will also be used in If. As a note, we have to say that in the case of an If the optional arguments that may be used do not change values. As a result the method setArgs() in the interface that carries out this work is of no use, so implementing a dummy method with only the signature and the appropriate return command is more than adequate. The If class, the nal class that is to be implemented is the most important one as it is the one that offers the functionality of the If pattern. In this class the condition interface will be implemented and then overridden by the user for his own purposes. The unique feature of the If pattern is that the outcome of the condition decides which function will be applied to the current value read by the input. As a result two separate methods are offered, one for each case (trueCase() and falseCase()), in the context of the map() method in the Mapper class. The user needs only specify the contents of these two methods along with the condition; evaluating it and initializing optionally arguments. The framework will then decide for each value which method to call and execute. This functionality is implemented in the overridden method run() of the If class. At this time, it may be a good idea to outline that as in the cases of the classes For and While, If is an extension of the class Mapper. If this was not the case, the class SkeletonJob could not be used for setting the job and most importantly, parts of the already implemented package mapreduce could not be used.

Chapter 5. Implementation

36

Class If Method Summary void void void public void IfTrueCase(String value, Context context) Called for each value in the input split if the condition is true. IfFalseCase(String value, Context context) Called for each value in the input split if the condition is false. WriteOut(String value) Writes a value at the output. initializeArguments() Initializes the arguments that may be needed for the condition check. public boolean checkCondition(String record, String[] args) Returns the result from the condition check.
Table 5.5: Most Important methods of the If API

5.6

Implementing a Streaming API for Parallel Skeletons over Hadoop

Implementing the API for providing a kind of composition and streaming of all the parallel skeletons that now exist over Hadoop means dealing with two kinds of problems. The rst one is how this streaming will be implemented. The other concerns the way the streaming functionality will be presented to the user. In other words an important matter is how the programmer will be able to dene his own pipeline of Skeleton and/or MapReduce jobs. As far as the rst problem is concerned, in essence the system must take care of issues like the coordination of all the jobs internally. Coordinating a number of jobs is all about specifying the order between them and dene the appropriate input and output paths for each of them so that the data-ow is semantically correct, meaning the one specied by the user. The execution of the jobs must be controlled in a way that a job and the output of this job is nalized before starting the job that follows. Finally, when all the jobs nish their execution, only the initial input folder, that contains the input of the pipeline and the user-dened output folder, that contains the output of the complete pipeline should exist. All the intermediate folders should be deleted.

Chapter 5. Implementation

37

For tackling both these issues, we dened a new class, called Streaming, and implemented it so that will offer the user an API for creating pipes of jobs and their coordination will take place hidden by the user. The rst question raised is how the various skeleton jobs will be stored and fed to the framework for execution. We may need to remind that the skeletons implemented, apart from Sort, do not use the already dened Job class but a newly implemented one, SkeletonJob and that Sort and MapReduce use Job. The user creates the jobs and then pass them as parameters to an object of the class Streaming. Both kinds of jobs are stored in a list of type Job as Job is the superclass of SkeletonJob. The only thing that needs to be taken into consideration is that when passing a SkeletonJob to the object of type Streaming it must be cast into Job or else an error will be produced. Class Streaming Constructor Summary Streaming(int num) Method Summary void void void void setNumOfJobs(int num) Set the number of jobs before submitting the stream of jobs. setInputPath(String path) Set the input folder for the sequence of jobs before submitting them. setOutputPath(String path) Set the output folder for the sequence of jobs before submitting them. Stream(int priority, Job job) Pass a Job along with its number in the sequence. If the job is of type SkeletonJob, it must be cast to Job. void cleanup() Called once at the end of the task. boolean executeStreamingJobs() Executes the stream of the jobs submitted by the user.
Table 5.6: Most Important methods of the Streaming API

When creating a new object of the class Streaming, the user can specify the exact number of jobs that will be contained in the conguration. He can also create an object of the class Streaming and specify the number of the MapReduce and/or Skeleton

Chapter 5. Implementation

38

jobs through the method setNumOfJobs(). The user then needs to create the jobs as he would have done in any other case. However, the input and output paths of the collection of the jobs will be passed to the Streaming object with the corresponding methods (setInputPath() and setOutputPath()) and the coordination of the folders will be taken care of internally. The user then has to pass these jobs as parameters along with an identier of the place that each one of the jobs hold in the sequence of the stream. This can be done with the use of the method Stream() multiple times. More specically, for each Skeleton job the user will pass the job along with a single integer that shows the place that job holds in the sequence. Finally, the user will use the method executeStreamingJobs() and start the stream with the execution of the rst job. All the Skeleton jobs will be executed in the sequence dened by the user until the result of the last skeleton job resides in the output folder specied by the user. Internally, the coordination of the various Skeleton jobs is not a very complex matter. The jobs are stored in a list of type Job and the input and output paths are stored in variables of the class. As the jobs are passed along with an identier regarding their order, the class responsible for the streaming can arrange the execution of the jobs. It creates temporary folders for the output of the intermediate jobs and uses the output folder of one job as an input folder for the job next in the sequence. When the last job produces its results and nishes, all the intermediate folders and les are deleted. Finally, it should be noted that the Streaming API supports the four Skeletons that are developed in the context of this project and the already implemented parallel model of MapReduce. The key idea is to offer the user of Hadoop not only additional parallel programming models, but also a way to composite an arbitrary number of parallel jobs in an easy and simple way. Additionally, any other skeleton that will be implemented in a similar way and executed underneath a MapReduce job will be supported from this streaming mechanism.

5.7

Summary

In this chapter, we discussed the implementation of the skeletons library, including the four different algorithmic skeletons that can be used as patterns for writing programs to run over a Hadoop cluster and the streaming class that aims to help the programmer to write an arbitrary number of MapReduce and Skeleton jobs in sequence as a pipeline. In designing and implementing the library we took into consideration how the existing package of MapReduce functions over Hadoop and how to change the data-ow to suit

Chapter 5. Implementation

39

the needs of each skeleton. Moreover, we wanted to offer a simpler API than the one of Hadoops MapReduce and abstract a lot of information and details to a more user friendly interface. To this end we also implemented a streaming API that can be used for dening multiple MapReduce and Skeleton jobs in a specic sequence.

Chapter 6 Evaluation
Up to this point, we have delved into details of how our skeletons library in Hadoop was designed and implemented. In this chapter, we will evaluate the skeletons library. We will perform the evaluation against the objectives that we set in Chapter 1 as the criteria. We will thoroughly describe the environment and the process of the evaluation and nally present the results along with the necessary outcomes that can be deduced from them.

6.1

Information regarding the process of Evaluation

For evaluating the four skeletons and the streaming interface, we wrote ve programs, one testing each separate implementation. Moreover, we made a comparison with corresponding MapReduce programs. Our purpose is to deduce whether or not the new framework is more efcient for certain kinds of problems compared to MapReduce. The metrics that we are going to use are two; the execution time and the level of expressiveness. These two metrics along with the corresponding comparison with MapReduce will be the basis of the evaluation process. Of course, an important factor of the evaluation process is the input and its size. For providing a more thorough evaluation of the implementation, we used a variety of input sizes. More specically, we used inputs of 1GB, 5GB, 10GB and 15GB. Moreover, it needs to be noted that both the Skeleton and the MapReduce programs will perform the same computation on the same type of input. The programs that were written for carrying out the evaluation are simple programs that perform transformations to double numbers and consequently the input les consist of double numbers. As mentioned before, the metrics are the execution time and the level of expressive40

Chapter 6. Evaluation

41

ness. The latter, basically translates to the number of lines of code needed to program the same computation and the level of complexity of it. In other words, we want to evaluate the newly implemented library of skeleton operations on how much easier and on how less lines of code one programmer may program the equivalent operations to MapReduce. Additionally, it should be noted that we will monitor the behaviour of the skeleton operations for the variety of input sizes and for each of them we will make a comparison with the corresponding MapReduce program. A major difference between the two implementations (Skeletons versus MapReduce) is that MapReduce has the phase of Reduce, even if it is not needed. This will result in unnecessary computation and network trafc and in the end the execution time of the MapReduce program will be greater than the programs using the skeletons operations. It can be argued that a MapReduce program can be set to skip entirely the reduce phase (not even have the default reducer) but these raises two issues. Firstly, this will require a great level of knowledge regarding the MapReduce library of Hadoop. Furthermore, it will result in more lines of code and more complexity when by using Skeletons the same result will be produced in fewer lines of code and lower complexity. To sum up, the evaluation process will be the following. We will compare the number of lines of code and its clarity between programs using the skeleton library and ones using MapReduce that perform the same computation at the same input. Moreover, we will compare the two types of implementation as far as the execution time is concerned.

6.1.1

Environment of Evaluation

Before moving on to the description of the evaluation and the presentation of the results, it is necessary to provide the reader with more information regarding the environment in which the experiments were performed. All the programs that we designed and implemented for evaluating the implemented skeletons run on the Hadoop cluster of the School of Informatics at the University of Edinburgh. The cluster of the university contains two broad categories of machines: rack-mounted servers and spare desktop machines. The rack-mounted servers are further split in two categories. There are 24 nodes, Dell PowerEdge SC1425 servers, which were bought by the University in July 2005. Each one has one dual core 3.2GHz Intel Xeon CPU; one 80GB hard disk, 7200rpm,

Chapter 6. Evaluation

42

SATA, model number Maxtor 6Y080M0, with 22GB of space available for use by Hadoop; and 4GB of memory. The other rack-mounted nodes, including the namenode and the jobtracker, are Dell PowerEdge SC1425 servers. They were purchased in January 2006. Each one has two dual core 3.2GHz Intel Xeon CPUs; one 80GB hard disk, 7200rpm, SATA, model number WDC WD800JD-75MS, with 22GB of space available for use by Hadoop; one 250GB hard disk, 7200rpm, SATA, model number Maxtor 7L250S0, with 204GB of space available for use by Hadoop; and 8GB of memory. All of the nodes run Scientic Linux 5.5. Furthermore, a number of different desktop machines are attached to the cluster. The number of machines varies as machines are taken away or added regularly. The machines are a mix of all of the models of desktop machine that the school of informatics has in use. That would include Dell Optiplex GX620, GX745, 755 and 780, and HP dc7900. Processor specs vary; the memory is either 2GB or 4GB depending on the age of the machine. Disk size also varies. All in all, it is apparent that the cluster is not balanced. Various types of machines are hooked up to the cluster and although the average number of nodes in the cluster is about 70, up to 125 machines can be supported by the Hadoop cluster of the University. More information concerning the cluster and its conguration can be found in wiki page of the school of Informatics in the section about the Hadoop cluster. (https://wiki.inf.ed.ac.uk/DICE/HadoopCluster)

6.1.2

Input

It would be more than helpful to mention some information concerning the input that we used for the experiments that we conducted for evaluating the implementation. First of all, we used a variety of inputs. It was in our intentions to test and measure the scalability of the skeletons library. As a consequence, we used four different input sizes for the evaluation; 1GB, 5GB, 10GB and 15GB. However, it should be noted that since each test program used a unique skeleton operation for dealing with a different type of problem, even though the inputs have the same size, the data is organized differently and since a different kind of computation is applied, the processing time and the size of the output may be different. As a result, the times that were measured are valuable for comparing a program using a particular skeleton operation with the equivalent using MapReduce. The times are not for comparing the skeleton operations or for providing denitive execution times of

Chapter 6. Evaluation

43

the skeletons operations for the specic input sizes. It is natural that as the problem tackled will change and the way the data of the input is organised and processed, the execution times will vary. For completeness, we need to note that in the skeleton and MapReduce programs that we used for evaluating Parallel For, Parallel Sort and the streaming API we used the same input les. Moreover, the same input les were used for performing the evaluation in Parallel If and Parallel While. Thus, we constructed two different types of input in four different sizes. In total, eight different input les were created in two different formats and four different sizes.

6.2

Execution Time

The rst and most probably the most important metric that we are going to use for evaluating the Skeletons that we implemented for the purposes of this project is the execution time. For more thorough examination, every implemented skeleton operation will be evaluated for a variety of input sizes. However, by only measuring the execution time of only the skeletons, no real conclusion can be reached about the efciency of the newly implemented skeleton operations. It is to this end that we compare the execution time of the skeleton operations with the execution time of MapReduce. In essence this means that we had to write two types of programs that perform the exact computation over the same input. One program sets up a Skeleton job for processing the data-sets and the other sets up the equivalent MapReduce job. We compared the execution times of these groups of programs in the context of separate types of problems and conclusions are drawn from the results. It is especially important to note that in the way that we implemented the skeletons, basically they offer another way of indirection. When running a skeleton job, underneath a MapReduce job is executed. As a result, the programmer has the ability to write a MapReduce program that will have the same efciency as with the skeletons. However, this requires great knowledge of MapReduce and of Hadoop. What is more, one of the purposes of this project is to hide all this information to a more user-friendly API, in which any programmer will nd it easier to perform the computations he requires to the data sets. For the part of the comparison, the MapReduce programs are the regulars, which means that no reducer is declared if none is needed and thus the default one is used by the framework.

Chapter 6. Evaluation

44

6.2.1

Evaluation of For

The rst skeleton operation that we decided to test and evaluate was Parallel For. For carrying out the evaluation we had to come up with a problem that is t for the abstract parallel pattern of For, write the corresponding program that will be executed at Hadoop and that uses the skeleton operation For from the skeletons library and nally create accordingly a variety of inputs so as to determine the scalability of the newly implemented skeleton operation. The problem that we decided to test For on was the simple task of calculating the average double number of every line of an input le. In essence, we want to read line by line the le calculate the average and write it to the output le. As a rst step we needed to create an input le that will have at each line a sequence of double numbers. Furthermore, we wanted the input le to have a xed size. To this end, we developed a java application that uses the java random generator and creates such a le with numbers ranging from 0 to 500. After creating the required les it was only a matter of uploading them at the cluster from a local machine for the inputs to be ready. The next step was to develop an application that uses the For operation and computes the average of each line. It would be helpful here to note that in this skeleton operation we specify what to do with every line of the input le. This line is of type String and it is upon the programmer to manipulate it in the best possible way. The required program is simple enough; breaks up the line to the individual parts, parses them on double computes the average and uses the method WriteOut to write the value in the output. All that is left is to run this program at the cluster using as input the input les that we created before. Since, our purpose is to evaluate our implementation, it is only natural to want to have as more reliant results as possible. To this end, for every input le we executed the program ve times, each time recording the execution time. Moreover, we also implemented a MapReduce program that performs the same computation and executed it on the cluster as many times for the exactly same number of times and the same inputs.
6.2.1.1 Results of Evaluating For

The results from executing the For implementation is shown in Table 6.1. It would be helpful to note that for every input le, we present the average execution time from all the ve that we run.

Chapter 6. Evaluation

45

Input Size For Execution Time M.R. Execution Time (GB) 1 5 10 15 (sec) 29 42 52 67 (sec) 39 70 103 145

Table 6.1: Execution time of the For and the equivalent MapReduce implementation.

From Table 6.1, we can say that the program that uses the For skeleton operation performs efciently for all the input sizes. The execution time is not raised in an extremely high number as the input size grows larger and larger which shows that the implementation can effectively be used for data-processing jobs of very large inputs. However, the true effectiveness of our implementation will be shown when it will be compared with the equivalent MapReduce program. The right column of the Table 6.1 shows the execution times of the equivalent MapReduce implementation for the same inputs. Again, the average execution time is presented of all the ve runs that were made. The skeleton operation is far more efcient for all the input sizes. This should not come as a surprise as the kind of problem that we are dealing with here is best t for the parallel algorithmic pattern of For rather than to MapReduce. Even though, no reducer was specied in the MapReduce implementation, the framework uses the default one, which just copies the input to the output. Even though, this specic reducer does not have any complexity from a computational point of view, it still requires the output of the mappers to be grouped and sorted, potentially for the data to be transmitted to another machine or node and the additional cost of the write to the nal output les. All these operations are redundant and raise the execution time of the program. As a result, it can be easily said that the implemented For indeed performs better for specic kinds of problems. What is more, it does that in a more clean way with a much simpler API. In Figure 6.1, we present a comparison between the two implementations in the form of a chart.

Chapter 6. Evaluation

46

Figure 6.1: Comparison between Skeleton For and MapReduce

6.2.2

Evaluation of Sort

Sort is the most peculiar algorithmic skeleton that we implemented in the skeletons library. The main reason for this is that it is perhaps the most popular algorithm that has been implemented over Hadoop. A lot of implementations exist and it is evident that one of them is contained in the examples jar le that comes along with the Hadoop release. It was this fact that led our implementation of the algorithmic skeleton of Sort. As a result, we decided to provide a sort operation that focuses on specic kinds of problems and inputs. More specically, we designed and implemented a sort operation that produces only one output le. At this point, we should remind the reader that the number of output les relies on the number of Reducers. So it is possible for a MapReduce job to have multiple output les. The sort operation of the skeletons library produces only one output le. However, this leads to a high execution time, especially for large inputs as only one reducer is used. The main idea behind the algorithmic skeleton of sort in our library is to sort the input in a single output le. It is not as generic as many of the sort algorithms over Hadoop that exist but this fact is also an advantage when talking about specic kind of problems. This operation should only be used when the user wants the result in only one output le, whichever is his reason.

Chapter 6. Evaluation

47

Evaluating Sort proved to be trivial as the inputs also used in For were used. Sort reads every line of the input, splits it and sorts it according to its rst token. The user needs only specify the data type of the rst token which will be the key of the sort and whether he requires a descending or ascending sort. As was the case before, we also implemented a MapReduce program that also sorts the output into a single output le. Of course we could have used any one of the already implemented sort algorithms but for a more correct evaluation we decided to compare programs that perform equivalent computations.
6.2.2.1 Results of Evaluating Sort

First, we present the results we took from executing on the cluster the program that sorts lines according to double numbers by using the skeleton operation Sort. We run the program ve times for four input sizes and show the average numbers. Input Size Sort Execution Time M.R. Execution Time (GB) 1 5 10 15 (min:sec) 5:07 36:30 47:39 68:36 (min:sec) 8:34 32:26 50:19 65:29

Table 6.2: Execution time of the Sort and the equivalent MapReduce implementation.

Table 6.2 shows that the Sort skeleton works better for smaller input sizes. The difference between sorting 1 Gigabyte and 5 Gigabytes is major. At any case, the main purpose of the operation is carried out as only one output le is produced. After all, even if another sorting algorithm was used and then the user concatenated the results into a single output, one reducer would be used resulting to raising the execution time excessively as the sorting would occur twice. The MapReduce equivalent program uses the natural sorting phase of the framework and uses only one reducer. Table 6.2 contains the average execution times of all the runs that were made in the right column. It needs to be noted that this implementation is close enough to the one of the corresponding sort operation in the skeletons library. As a result we would not expect to notice any major difference in the execution times.

Chapter 6. Evaluation

48

As was expected the differences in the execution times are small. Even though only the average execution time is shown to the reader and not all the times it would be useful to state that the execution times ranged a substantial deal from the average most of the times. The main reason behind this is basically the load of the cluster and the node that was chosen as the reducer. After all, the cluster that we carried out the experiments is not balanced; three different kinds of machines exist with different capabilities and thus which node is the reducer plays an important role to the performance of the program. All in all the implemented sort operation in the skeletons library proved to full its specications. In Figure 6.2, we present a comparison between the two implementations in the form of a chart.

Figure 6.2: Comparison between Skeleton Sort and MapReduce

6.2.3

Evaluation of While

Parallel While proved to be the most difcult skeleton as far as its testing and evaluation is concerned. This was mainly due to the fact that nding a tting problem for this specic pattern and one that can be also implemented using MapReduce was not a trivial task. We came up with a problem that is a natural t for the parallel algorithmic skeleton of While. The computation is to divide any given input number with two until the output number is smaller than 0.25. Instead of 0.25, any small double number

Chapter 6. Evaluation

49

could have been used. From the above, it is evident that the input les should contain double numbers. The difference with the input used in For or even Sort, is that the computation will occur on each separate double. This means that the input should only be one double instead of a line with a collection of doubles. As a result new input les must be created. There are two possible ways for accomplishing this. The rst way is to use the output of the For implementation. After all, it results in an output that contains in every line the average of that line, or in other words a double per line. A different approach would be to tweak the java application that produced the input les for the experiments we did in the previous sections so as to produce les appropriate to be used as input for Whiles evaluation. As before, we have four kinds of inputs that differentiate on their size; 1GB, 5GB, 10GB and 15GB. The program that uses the skeleton operation while for tackling this problem is simple enough. The skeleton operation is called for every line of the input le. This line contains only a double number but it is passed as a String parameter. Thus, the rst step is to parse the String variable to double. If the number is smaller is smaller than 0.25, it is written at the output using the method WriteOut. If not, then the skeleton operation is called again with parameter the new number. This, however, is being taken care of by the skeletons library internally. The user needs only return the new number. Moreover, we developed a MapReduce program that performs the same computation at the same input. A reduce phase is not necessary for this kind of data processing. So, only a Mapper is dened that parses the input pair and implements a while loop that performs the necessary computation until the produced double number is smaller than 0.25. It has to be noted that apart from many details regarding the setup and conguration of the job, the pattern is also not hidden from the user which may produce logic errors in the program. Both these implementations were executed in the cluster a sufcient number of times to conclude to denitive and reliable results.
6.2.3.1 Results of Evaluating While

Table 6.3 shows the results from executing the program using the While skeleton operation. It needs to be noted that for every input le, we present the average execution time from a total of ve executions that were made. The results that are presented in Table 6.3 show that for this specic problem a high cost of processing is paid regardless of the input size. This is evident from the

Chapter 6. Evaluation

50

Input Size While Execution Time M.R. Execution Time (GB) 1 5 10 15 (min:sec) 5:41 6:12 12:45 14:20 (min:sec) 2:19 7:28 14:24 17:34

Table 6.3: Execution time of the While and the equivalent MapReduce implementation.

fact that the input of 1 Gigabyte and that of 5 Gigabytes differ in less than a minute even though the processing of the 1GB takes more than 5 minutes. This can only be explained by the fact that the specic pattern has a high overhead at the start of the job. As a result, the skeleton operation of While that we implemented in the skeletons library is best t for large inputs as in relatively small inputs there is high latency. However, our implementation will show whether or not is capable of becoming a useful data-processing model from its comparison with the equivalent MapReduce program. The right column of Table 6.3 shows the execution times of the equivalent MapReduce implementation that performs the same computation for the exact same inputs. Again, we present the average execution time of all the ve runs that were made. By comparing the two columns, it can be easily noted that apart from the input with size 1 GB, in all other cases the While skeleton proves to be more efcient. As mentioned earlier, it seems that this specic pattern has a high overhead at the setup which results in a pattern best t for larger inputs. Of course, this is not a problem as Hadoop is commonly used for processing large data sets. The fact that the MapReduce implementation uses the default reducer, leads to redundant computation and network trafc. The same applies for the grouping and sorting of the mappers output. These factors make the MapReduce implementation less efcient than using the corresponding skeleton operation. However, for a small input size of 1 GB the MapReduce program performs better than the one of Parallel While leaving space for potential improvement. The cause of the initial high overhead of the pattern can be detected and potentially solved leading to a skeleton operation that functions more efciently than MapReduce in all cases and can be also used for processing small data sets. In Figure 6.3, we present a comparison between the two implementations in the form of a chart.

Chapter 6. Evaluation

51

Figure 6.3: Comparison between Skeleton While and MapReduce

6.2.4

Evaluation of If

For evaluating the algorithmic skeleton of If, a simple problem of making a decision must be identied. We decided to implement the simplest kind of decision making when dealing with arithmetic numbers. For each double number read from the input, we need only check whether its number is odd or even. Depending of the outcome, we either increment the number by one or we decrement it by one. The above computation is perhaps one of the simplest ones we could come up with but it serves our purposes more than well. As the required input for the skeleton operation is a line that contains only one double number, the input les used for the evaluation of While can also be used for evaluating If as well. All that is necessary is for every line of the le to contain only one double and create les of different sizes. More specically, we are going to use again les of size 1GB, 5GB, 10GB and 15GB. Before presenting the results, we describe in short the programs we developed for performing the testing and evaluation needed for Parallel If. As was the case for the evaluation of the other skeletons we developed two programs, one that uses the skeleton operation of If and one that sets up the equivalent MapReduce job. Both of these programs were executed at the cluster for the selection of the inputs. For more

Chapter 6. Evaluation

52

reliable results, we executed each program ve times for each input. The program that uses the skeleton operation If basically species one condition and then two functions. The condition is checked for every number and returns either true or false (either the number is even or it is not). Depending on the outcome of the condition one of the two functions will be applied to the number. We decided to increment the number if it is even, otherwise decrement it. The equivalent MapReduce program uses only a mapper for carrying out the computation. It implements an Ifcommand presenting to the user the pattern. Depending on the decision it increments or decrements the number and writes it to the output.
6.2.4.1 Results of Evaluating If

The Table 6.4 presents the results from executing the If implementation. It needs to be noted that for every input le, we present the average execution time from a total of ve executions that were made. Input Size If Execution Time M.R. Execution Time (GB) 1 5 GB 10 GB 15 GB (min:sec) 1:09 1:23 2:13 3:17 (min:sec) 2:18 4:53 9:36 13:35

Table 6.4: Execution time of the If and the equivalent MapReduce implementation.

It is of no surprise that the skeleton operation performed more than well in all cases. The abstract parallel pattern of If is similar to the one of For. Basically, for every line of input one function is applied. The difference is that a condition is checked to deduce which one of two functions will be applied. Parallel For performed also well and so is If. We can conclude that the algorithmic skeleton of If is a natural t for Hadoop. Of course, the true capabilities of the implemented parallel skeleton If will be shown when comparing it with the equivalent MapReduce program. The right column of the Table 6.4 shows the results of the program that used the MapReduce library for performing the same computation. Comparing these results to the ones of the program that uses the skeleton operation If, it is clear that the use of the skeleton operation results in more efcient code.

Chapter 6. Evaluation

53

This is mainly because executing a MapReduce job instead of a Skeleton job requires additional computation that in the context of the problem at hand is redundant. More specically, the grouping and sorting phase that the outputs of the mappers are being subject to plus the unnecessary reduce phase raise the execution time by far. It may well be the case that the user of Hadoop, species in the MapReduce program the lack of any reduce phase which will also result to no sorting and grouping as well. However, such thing requires a great knowledge of the software framework from the part of the user and is all done internally by the skeletons in a more clean implementation. To sum up, Hadoop is more than enhanced with the addition of the skeleton operation If. In Figure 6.4, we present a comparison between the two implementations in the form of a chart.

Figure 6.4: Comparison between Skeleton If and MapReduce

6.2.5

Evaluation of the Streaming API

This project is mainly about enhancing Hadoop and providing its user additional algorithmic patterns for programming more efcient code easier, cleaner and more concise. In the context of providing a higher level for the user to program jobs over the framework, we designed an API for streaming consequent MapReduce and Skeleton jobs. This API provides the feature of basically creating a pipeline of MapReduce and/or

Chapter 6. Evaluation

54

Skeleton jobs. For performing the evaluation we did three things. First of all, we used the programs we created for evaluating the skeleton operations of For, If and While and with the aid of the streaming API, we created a stream of these three jobs. We then executed this program on the cluster for every input size ve times, and calculated the average execution time. Moreover, we developed a program that streams these three jobs from scratch without the use of the streaming API. In essence, we did what anyone would do if he or she wanted to perform these three, or any three for that matter, jobs. We also measured the execution time of this program at the cluster. This represents the work that someone would have done if he or she used the skeleton operations for performing these computations at the data but the streaming API was not contained in the skeletons library. Furthermore, we created a program that streams the equivalent MapReduce jobs and ran it on the cluster. Basically, this program represents what a programmer would do if he wanted to perform these three computations using the current libraries of Hadoop. Finally, these three implementations were tested and the results are presented in the following section.
6.2.5.1 Results of Evaluating Streaming

Table 6.5 presents the results of the three implementations that were used for the evaluation of the Streaming API. The input les that were also used for evaluating For, were also used here. Input Size (GB) 1 5 10 15 Streaming (min:sec) 1:38 1:52 2:45 3:13 Non-Streaming (min:sec) 1:43 1:57 2:45 3:12 MapReduce (min:sec) 1:58 4:26 6:24 7:28

Execution Time Execution Time Execution Time

Table 6.5: Execution time of the Streaming implementation.

We have already evaluated the three skeleton operations of For, If and While so it

Chapter 6. Evaluation

55

comes as no surprise that the overall execution time is low. What basically is proved is that the streaming API works well as the jobs are coordinated extremely well. After the execution of the program no intermediate output folder exists and the results are the semantically correct. And all these with very good performance. Of course, how well the performance is by using the streaming API will be proved when this API is not used. Only then, will we be able to determine for sure whether the streaming API introduces any kind of overhead or not. The numbers in the column of the Non-Streaming implementation in Table 6.5 proves that not only does the streaming API result in no kind of overhead in pipelining the jobs but in some cases it proves to result in faster programs. It is certain that the difference is so small that in repetitive runs it will hold no real substance but it is evident that the streaming API only aids the user in his task of streaming MapReduce and/or Skeleton jobs without any cost in the performance whatsoever. The only thing that now remains is to test the implementation using the original MapReduce library that is offered in the software framework of Hadoop. The foremost right column of the Table 6.5 contains the results. By now, the results should not be treated as a surprise. The use of the algorithmic skeletons in problems that naturally t their pattern has proved to be more efcient than using the traditional MapReduce. The only exception is running a Parallel While job for very small input. As a result, creating a streamed job with MapReduce jobs result in an excessive amount of code that is not proved as efcient as the cleaner, more concise program based on the skeletons library. In Figure 6.5, we present a comparison between the three implementations in the form of a chart.

Chapter 6. Evaluation

56

Figure 6.5: Comparison of the Streaming, Non-Streaming and MapReduce implementations

6.3

Level of Expressiveness

An important factor that should be used in the evaluation process of the skeletons library is the level of expressiveness that is offered to the user. In other words, we want to determine whether or not, with the supply of the skeletons library, more high-level programming models are offered at the user. Apart form providing skeleton operations that result in more efcient programs, we want the user to do that in an easier and more user-friendly way than programming using MapReduce. In the previous chapters, where we talked about the design and the implementation we described thoroughly our approach about providing to the user a far more easier to use API to create his or her programs. What is needed now it to test whether or not such a thing was achieved. First of all, we need to mention that in the Appendix B of this report a detailed API of the skeletons library is included. It is left upon the reader to notice the undisputed fact that the API of the skeleton operations is far simpler than the corresponding of MapReduce. We are going to include, however, two small programs, one that uses MapReduce and one that uses a skeleton operation for a quick comparison. These

Chapter 6. Evaluation

57

programs perform the same computation and we can see that the skeleton program is cleaner, more concise and has fewer lines of code.
import java . io . IOException ; import java . util . StringTokenizer ; import skeletons . SkeletonJob ; import skeletons . For ; public class ForExample { public static class Average extends For { public String For ( String value ) throws IOException , InterruptedException { int count =0; double num , sum =0 , average ; StringTokenizer itr = new StringTokenizer ( value . toString () . toUpperCase () ); while ( itr . hasMoreTokens () ) { num = Double . parseDouble ( itr . nextToken () ); sum += num ; count ++; } average = sum / count ; value = ""+ average ; WriteOut ( value ); return value ; } } public static void main ( String [] args ) throws Exception { if ( args . length < 2) { System . err . println (" Usage : ForExample <in > <out >") ; System . exit (2) ; } SkeletonJob job = SkeletonJob . createSkeletonJob (" ForExample ", args , ForExample . class , Average . class ); job . waitForCompletion ( true ); } }

Sample program that uses a Skeleton operation.

Chapter 6. Evaluation

58

import java . io . IOException ; import java . util . StringTokenizer ; import org . apache . hadoop . conf . Configuration ; import org . apache . hadoop . fs . Path ; import org . apache . hadoop . io . Text ; import org . apache . hadoop . mapreduce . Job ; import org . apache . hadoop . mapreduce . Mapper ; import org . apache . hadoop . mapreduce . Reducer ; import org . apache . hadoop . mapreduce . lib . input . FileInputFormat ; import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; import org . apache . hadoop . util . GenericOptionsParser ; public class MRForExample { public static class ForMapper extends Mapper < Object , Text , Text , Text >{ private final static Text dummy = new Text ("") ; private Text word = new Text () ; public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException { int count =0; double num , sum =0 , average ; StringTokenizer itr = new StringTokenizer ( value . toString () ) ; while ( itr . hasMoreTokens () ) { num = Double . parseDouble ( itr . nextToken () ); sum += num ; count ++; } average = sum / count ; word . set (""+ average ); context . write ( word , dummy ); } } public static void main ( String [] args ) throws Exception { Configuration conf = new Configuration () ; String [] otherArgs = args ; if ( otherArgs . length != 2) {

Chapter 6. Evaluation

59

System . err . println (" Usage : MRForExample <in > <out >") ; System . exit (2) ; } Job job = new Job ( conf , " MRForExample ") ; job . setJarByClass ( mr_for . class ); job . setMapperClass ( ForMapper . class ); job . setOutputKeyClass ( Text . class ); job . setOutputValueClass ( Text . class ); FileInputFormat . addInputPath (job , new Path ( otherArgs [0]) ); FileOutputFormat . setOutputPath (job , new Path ( otherArgs [1]) ); job . waitForCompletion ( true ); } }

Sample program that uses MapReduce. To begin with, the skeletons API do not need to be parameterized with the data types of the input and output pairs. More specically, the user of the skeleton operations does not bother himself with input and output pairs whatsoever. The same thing applies for the data types of Hadoop. In the skeleton operations the input is in the form of String. To be more precise, the main skeleton operation is called for every line of the le. Additionally, the output is given in the form of a String which represents a line of the nal output. In the original MapReduce implementation however, the user must specify the data types of the input and output pairs and he must manipulate the input and output as pairs. What is more, the data types should be those of Hadoop and not the ones of java. For instance, if the user wants the output to be a pair of Strings, he must dene the pair as Text, as Text is Hadoops data type for String. It is more than obvious that this may confuse programmers, especially those who do not have enough experience in Hadoop and will for sure result in far more complex code than compared using the skeleton operations. The skeletons library provides a far higher programming model for specifying the computation needed for processing large amounts of data. Of course, this computation should t to one of the abstract parallel patterns of the provided skeletons. Next, we will examine whether or not setting up, conguring and submitting the job that realises this computation is easier and cleaner for the skeleton operations or for MapReduce. In MapReduce, for setting up one job an object of the class Job should be dened and initialised along with a conguration object. Then, the classes of the Mapper, the

Chapter 6. Evaluation

60

Reducer and the main class of the jar should be passed as parameters with specic methods of the Job class. In addition, the data types of the output pairs must also be passed with specic methods. Finally, the input and output folders must be specied. All these will result in excessive and complex lines of code and with high the possibility for an error from the part of the programmer. On the other hand, setting up a Skeleton job requires less effort and fewer lines of code. More specically, the user needs only declare a a SkeletonJob object an call the appropriate method that creates a skeleton job passing the necessary parameters. The class of SkeletonJob then takes care of the rest, setting up and conguring the job. The user needs only submit it to the framework. As a result, with one line of code the result is the same as with approximately ten in the case of MapReduce and not only that but it will subsequently be more error-free. However, one may argue that by abstracting so much information we may gain in expressiveness but lose in congurability. We have to state that in the new SkeletonJob API the equivalent skeleton methods exist so that the user can congure a job in a similar way as in MapReduce. It is more than obvious that the new API provides a more concise, easy and simple way to specify computations, set up and congure jobs. Table 6.6 shows the lines of codes that were needed for programming the simple programs we used for evaluation in the previous sections. It is more than clear that by using the API of the skeletons library the resulting program will have less lines of code, it will be more simple and clean. Type of program For Sort While If Streaming Lines of code 45 lines 38 lines 47 lines 51 lines 123 lines Lines of code 58 lines 101 lines 55 lines 62 lines 143 lines

with Skeletons with MapReduce

Table 6.6: Comparison between MapReduce and Skeletons against lines of code.

Chapter 6. Evaluation

61

6.4

Summary

In this chapter, we evaluated our skeletons library compared with the existing MapReduce implementation against the criteria we identied in the early sections. We also discussed shortcomings of our initial implementation concerning the skeleton of While and presented improvements. While the skeletons library may not be complete, we hope that our implementation will lead to the implementation of a library that contains more algorithmic skeletons resulting to a far more expressive framework.

Chapter 7 Conclusion
Over the course of this project, we analyzed work that has been done in the eld of large data-processing focusing on MapReduce and its open source implementation Hadoop. We found that the computational paradigm of MapReduce even though it is generic has proven more than a powerful tool for analysing and processing vast amounts of data, is only just one abstract parallel algorithmic pattern of many that is implemented over Hadoop. It was our feeling that providing more algorithmic skeletons over Hadoop will enhance it and will result in a far more expressive and powerful software framework. Our initial purpose was to accomplish this by providing a far easier to use API to the user than the one that traditional MapReduce has. In a way, we wanted to add another level of indirection and in this context we also implemented a streaming API for setting up MapReduce and/or Skeleton jobs in sequence. In this chapter, we present the outcomes, challenges and lessons we have learnt from this project. We then end by suggesting improvements which we feel would make the skeletons library we have developed more complete and probably lead to the library being included in a future release of Hadoop.

7.1

Summary

At the beginning of this project, we rstly identied what a framework that supports storing and processing large amounts of data offers to its user. We then clearly stated what the goals of this project are and exactly what we wanted to implement. In Chapters 2 and 3 we gave a lot of background regarding the algorithmic skeletons in general and more specically the ones we implemented. In addition, we thoroughly described Hadoop and MapReduce since it is the software framework above which we designed
62

Chapter 7. Conclusion

63

and developed the skeletons library. Excessively important is Section 3.1 where the package of MapReduce in Hadoop is described. It has been mentioned a signicant number of times in this report that in the skeletons library another level of indirection is offered. Internally at the framework still a MapReduce jobs is executed. As a result, studying and understanding the package that implements MapReduce is more than useful when someone wants to design a higher level of interaction with the system. Moreover, in Chapter 4 we talked about the design of the skeleton operations and the streaming mechanism over Hadoop. We took into consideration the features and aspects of the four skeletons that we implemented along with aspects and features of Hadoop. In essence what we wanted to accomplish in this chapter was to provide our approach of design in such detail that it can be used by someone else who wants to try and implement the skeletons library. Additionally, we explained certain design decisions we made and the reasoning behind them. At times, we presented alternative strategies along with the reasons why we did not follow them. In Chapter 5 we moved a step further by presenting the nal implementation of the skeletons library. Furthermore, we outlined problems that were raised and how they we dealt with them in a more low level than in Chapter 4. Ultimately we developed a skeletons library with four skeleton operations and a streaming API. This library was evaluated thoroughly, as shown in Chapter 6 and proved to be more than a new programming model. It proved to be more efcient than MapReduce on a numerous of occasion and what is more, with a more user-friendly, more concise and less error-prone API.

7.2

Challenges

We began this project with limited knowledge about Hadoop and its infrastructure. While, we had used it before for performing a number of data-processing computations, we had no knowledge regarding MapReduces implementation over it. Understanding the way MapReduce functions over Hadoop proved to be a challenging task. The ofcial API of Hadoop was invaluable in the early phases of the project in ensuring we understood the subject matter in enough depth to begin designing a solution. Developing the skeletons library was a big step up from developing traditional Java applications. The decision to create a library that used an already implemented computational paradigm, raised various issues such as inconsistencies between the algorithmic pattern of MapReduce and the parallel patterns of other skeletons. We chose to implement skeleton operations that felt natural over Hadoop but more importantly

Chapter 7. Conclusion

64

patterns which we could transform in a matter of speaking in a MapReduce job. We had to look at the ofcial site of Hadoop for information regarding the framework and its implementation. Moreover, we had to look in various web-sites and papers for information on Algorithmic Skeletons and details concerning their patterns. Upon deciding which skeleton operations are best t for implementation, we looked carefully into the implementation of these skeletons in different libraries and frameworks. This gave us an idea regarding the possible approaches we could follow for the task we had at hand.

7.3

Lessons Learned

Perhaps the most valuable lesson to take away from this project is a greater appreciation of the value of the various aspects of software projects. Throughout this project, we have thoroughly studied the implementation of an extremely advanced and complex software framework that is used for the data storage and data processing by many large companies and corporations. We have read numerous published papers and articles regarding MapReduce, Hadoop and Algorithmic Skeletons, their implementation and their uses. We have gained a deeper understanding of concepts like parallel or distributed programming, and even programming models and how to offer one that is diverse and easy to use by the user. Furthermore, we comprehended and achieved to change the data-ow of MapReduce to suit our own purposes. Moreover, studying what can only be considered as advanced code has made us learn more about both the java programming language and object-oriented programming. Finally, we acquired the skills necessary to offer our implementation in the form of a java package, as a library of operations.

7.4

Future Work

This project resulted in a library containing four skeleton operations (For, If, Sort and While) along with a streaming mechanism over Hadoop. Although the implementation proved to be more effective in certain kinds of problems than the already implemented MapReduce there is still room for improvement and in this section we are to present our suggestions.

Chapter 7. Conclusion

65

The rst and probably most obvious suggestion for future work is the implementation of even more algorithmic skeletons over Hadoop and their inclusion in the skeletons library. The parallel algorithmic skeletons are more than numerous from divide and conquer, task farm, fork and many more. In the little time that was available to us, we managed to implement four skeletons and that leaves many more that can be implemented over Hadoop. This will ultimately result in a far more expressive framework and thus a more enhanced tool for processing Big Data. Moreover, the initial description of algorithmic skeletons has provision of nesting. In other words, instead of a computation at the data, another skeleton can be applied and so on. Achieving the nesting of skeletons over Hadoop would result in a great advancement of this tool. Of course nesting and especially generic nesting is far from a trivial task. Many aspects must be carefully examined to deduce even if it is possible in the current releases. This is the case because nesting resides heavily upon the scheduler and whether or not it can support them. As was mentioned in Section 2.2.1 the scheduler of Hadoop was refactored out of Hadoop which added the ability to use an alternate scheduler. Perhaps, a scheduler can be designed and implemented that will support generic nesting. At this point is should be useful to note why we refer to generic nesting and not simple nesting. This is because certain types of nesting can be achieved. An example of this is the streaming mechanism we implemented and is contained in the skeletons library. Basically, this mechanism is a nested For skeleton; for every line of the input a dened by the user number and types of skeleton operations are applied to it. It may well be the case that similar nested mechanisms can be implemented but the real breakthrough would be the support of generic nesting. Finally, it may prove helpful to implement skeleton operations over other software frameworks designed for storing and processing vast amounts of data other than Hadoop. Hadoop may be the most popular open source realisation of Googles GFS and MapReduce but other similar frameworks that have been developed or are under development can gain in their functionality from a library of skeleton operations.

Appendix A Setting up Hadoop so that is supports the implemented skeletons


This appendix aims to provide to the reader all the necessary information for setting up a Skeleton job. We will present how to compile and run a program that uses one or several of the skeleton operations we implemented. What is more, we will describe some examples of programs that set up Skeleton jobs.

A.1

How to use the Skeletons library

The skeletons library that we implemented for the purposes of this project is not included in the release of Hadoop and as a result, if a programmer wants to use a skeleton operation, he needs to do a number of specic actions. Firstly, the programmer must possess the skeletons library (skeletons.jar) and use the API to write a program that uses skeleton operations. In the following section sample skeleton programs are presented and in the appendix B, the complete API of the skeletons library is given to provide all the necessary information for writing skeleton jobs. The skeletons library and the core jar le of hadoop must be used for compiling the program. This can be done with using the javac command with the ag -classpath dening two classpaths (the hadoop core and the skeletons library). Afterwards, the class or classes of the program must be made into an executable jar le. However, this jar le must also contain a folder with the skeleton classes. The easiest way to do this is by extracting the skeletons library jar le and using the skeletons folder that is contained into it.
66

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

67

Finally, the executable jar le that was created with the previous steps can now be used in the Hadoop cluster. In essence, the Skeleton program will be executed and run as any other MapReduce program run in the Hadoop cluster. The skeletons library is contained in the executable le and thus the framework will take care of the rest.

A.1.1

Alternative methods

It could be worthwhile to note that alternative ways for providing the skeletons library to the user exist. The obvious one is to include the skeletons package in the Hadoop release. Of course, this must be done by the Apache Software Foundation and is up to them whether or not they will carry it away. Moreover, the possibility of installing the jar le to the cluster exists. This is upon the administrator of the cluster. The easiest way is to place the JAR into $HADOOP HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. Since the code that launches the Hadoop job uses the same library, the programmer will need to include the JAR in theHADOOP CLASSPAT H environment variable as well. A key note here is that the library jar must exist in every node of the cluster as the execution of the program is done in a distributed away across the nodes of the cluster.

A.2

Examples of setting up Skeleton jobs

In this section we will present examples of programs that set up Skeleton jobs. Through this section it can be shown how much cleaner and concise is writing a program that uses a skeleton operation. Moreover, these examples can be used as walk-through for the programmers wishing to use the skeletons library for writing their own programs.

A.2.1

Example of setting up a For job

import java . io . IOException ; import java . util . StringTokenizer ; import skeletons . SkeletonJob ; import skeletons . For ; public class ForExample { public static class Average extends For {

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

68

public String For ( String value ) throws IOException , InterruptedException { int count =0; double num , sum =0 , average ; StringTokenizer itr = new StringTokenizer ( value . toString () . toUpperCase () ); while ( itr . hasMoreTokens () ) { num = Double . parseDouble ( itr . nextToken () ); sum += num ; count ++; } average = sum / count ; value = ""+ average ; WriteOut ( value ); return value ; } } public static void main ( String [] args ) throws Exception { if ( args . length < 2) { System . err . println (" Usage : ForExample <in > <out >") ; System . exit (2) ; } SkeletonJob job = SkeletonJob . createSkeletonJob (" ForExample ", args , ForExample . class , Average . class ); job . waitForCompletion ( true ); } }

A.2.2

Example of setting up a Sort job

import java . io . IOException ; import skeletons . Sort ; import org . apache . hadoop . mapreduce . Job ; import org . apache . hadoop . mapreduce . lib . input . FileInputFormat ; import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; import org . apache . hadoop . fs . Path ; public class SortExample {

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

69

public static void main ( String [] args ) throws Exception { Job job ; if ( args . length !=4 ) { System . out . println (" ERROR : Sort needs 4 parameters !") ; System . exit ( -1) ; } job = Sort . PSort ( args [0] , args [1]) ; FileInputFormat . addInputPath (job , new Path ( args [2]) ); FileOutputFormat . setOutputPath (job , new Path ( args [3]) ); job . waitForCompletion ( true ); } }

A.2.3

Example of setting up an If job

import java . io . IOException ; import java . util . StringTokenizer ; import skeletons . SkeletonJob ; import skeletons . If ; public class IfExample { public static class choice extends If { public void IfTrueCase ( String value ) throws IOException , InterruptedException { double num = Double . parseDouble ( value . toString () ); num ++; value = ""+ num ; WriteOut ( value ); } public void IfFalseCase ( String value ) throws IOException , InterruptedException { double num = Double . parseDouble ( value . toString () ); num - -; value = ""+ num ; WriteOut ( value ); }

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

70

public boolean checkCondition ( String record , String [] args ) { double num = Double . parseDouble ( record ); if ( else return false ; } } public static void main ( String [] args ) throws Exception { if ( args . length != 2) { System . err . println (" Usage : IfExample <in > <out >") ; System . exit (2) ; } SkeletonJob job = SkeletonJob . createSkeletonJob (" IfExample ", args , IfExample . class , choice . class ); job . waitForCompletion ( true ); } } num %2==0 ) return true ;

A.2.4

Example of setting up a Streaming job

import java . io . IOException ; import java . util . StringTokenizer ; import skeletons . SkeletonJob ; import org . apache . hadoop . mapreduce . Job ; import skeletons . For ; import skeletons . If ; import skeletons . While ; import skeletons . Streaming ; public class StreamExample { public static class Average extends For { public String For ( String value ) throws IOException , InterruptedException { int count =0; double num , sum =0 , average ;

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

71

StringTokenizer itr = new StringTokenizer ( value . toString () . toUpperCase () ); while ( itr . hasMoreTokens () ) { num = Double . parseDouble ( itr . nextToken () ); sum += num ; count ++; } average = sum / count ; value = ""+ average ; WriteOut ( value ); return value ; } } public static class choice extends If {

public void IfTrueCase ( String value ) throws IOException , InterruptedException { double num = Double . parseDouble ( value . toString () ); num ++; value = ""+ num ; WriteOut ( value ); } public void IfFalseCase ( String value ) throws IOException , InterruptedException { double num = Double . parseDouble ( value . toString () ); num - -; value = ""+ num ; WriteOut ( value ); }

public boolean checkCondition ( String record , String [] args ) { double num = Double . parseDouble ( record ); if ( else num %2==0 ) return true ;

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

72

return false ; } } public static class differ extends While { public String PWhile ( String record , boolean condition ) throws IOException , InterruptedException { if ( condition ){ double num = Double . parseDouble ( record . toString () ); num = num / 2; record =""+ num ; } else WriteOut ( record ); return record ; } public boolean checkCondition ( String record , String [] args ) { double num = Double . parseDouble ( record ); if ( else return false ; } } public static void main ( String [] args ) throws Exception { if ( args . length != 2) { System . err . println (" Usage : eval_stream <in > <out >") ; System . exit (2) ; } SkeletonJob job ; Streaming streamJobs = new Streaming (3) ; streamJobs . setInputPath ( args [0] ); streamJobs . setOutputPath ( args [1]) ; job = SkeletonJob . createSkeletonJob (" for ", StreamExample . class , Average . class ); Math . abs ( num - 0) > 0.25) return true ;

Appendix A. Setting up Hadoop so that is supports the implemented skeletons

73

streamJobs . Stream (1 , ( Job ) job ); job = SkeletonJob . createSkeletonJob (" if ", StreamExample . class , choice . class ); streamJobs . Stream (2 , ( Job ) job ); job = . createSkeletonJob (" while ", StreamExample . class , differ . class ); streamJobs . Stream (3 , ( Job ) job ); streamJobs . executeStreamingJobs () ; } }

Appendix B The API of the new package


In this appendix, we present the API of the four implemented skeletons, along with the one for setting streaming jobs. This section aims to aid the programmers that wish to use the newly implemented parallel algorithmic models for developing programs and applications. It is very important to note that only the classes and the methods that may be of use by the programmer are presented here.

B.1

API of Skeleton Job

The class SkeletonJob is an extension of the already dened class Job. It is used by the parallel skeletons For, While and If. When the programmer has to set up a job for one of these three skeletons, he must rst set up and congure a SkeletonJob. There are two ways for accomplishing this. The rst is to use the method createSkeletonJob() along with the necessary parameters and all will be taken care of by the class internally. The alternative way is similar as setting MapReduce job and the corresponding skeleton methods are provided. While it is an extension of the class Job and all the methods of Job can also be used, it is more likely that this may prove useful only in rare cases.

74

Appendix B. The API of the new package

75

Class Skeleton Job Constructor Summary SkeletonJob() SkeletonJob(Conguration conf) SkeletonJob(Conguration conf, String jobName) Method Summary void setSkeletonClass(Class<?extendsMapper > cls) Set the Skeleton class for the skeleton job. SkeletonJob createSkeletonJob(String jobName, String[] args, Class<? > jarCls, Class<?extendsMapper > cls ) Creates a Skeleton Job according to the parameters. SkeletonJob createSkeletonJob(String jobName, Class<? > jarCls, Class<?extendsMapper > cls ) Creates a Skeleton Job according to the parameters. void void setInputPath(SkeletonJob job, String path) Species the input path at the Skeleton Job. setOutputPath(SkeletonJob job, String path) Species the output path at the Skeleton Job.
Table B.1: SkeletonJob API

B.2

API of For

This is the rst parallel skeleton that we implemented for the purposes of this project. Basically, the user needs to create an extension of the class, overriding the main method that represents Fors functionality.

Appendix B. The API of the new package

76

Class For Method Summary void String void void setup(Context context) Called once at the beginning of the task. For(String value) Called once for each value in the input split. WriteOut(String value) Writes a value at the output. run(Context context) Expert users can override this method for more complete control over the execution of the Skeleton. void cleanup() Called once at the end of the task.
Table B.2: For API

B.3

API of Sort

Sort differs from the other skeletons in a number of ways. The most signicant one is that its job type is not SkeletonJob but Job. Moreover, the user doesnt need to set it up and congure it, but rather call a method from the class Sort passing as parameters the type of data and the order of the sorting. Then an object of the class Job will be returned which corresponds to the appropriate sort job. The user then has to add the input and output paths using methods of the class Job. Class Sort Method Summary Job PSort(String dataType, String sortType)) Creates, congures and returns an appropriate sort job.
Table B.3: Sort API

Appendix B. The API of the new package

77

B.4

API of While

In the skeleton While, we implemented the interface of condition. As a result, the methods of the interface are to be implemented as well. As was the case in For, when a programmer wants to create a While job, he must rst create and congure a SkeletonJob and then dene is own extended class of While along with the overridden methods he requires. Class While Method Summary void String void void setup(Context context) Called once at the beginning of the task. PWhile(String record, boolean condition)) Called for each value in the input split as long the condition holds true. WriteOut(String value) Writes a value at the output. run(Context context) Expert users can override this method for more complete control over the execution of the Skeleton. void public void cleanup() Called once at the end of the task. initializeArguments() Initializes the arguments that may be needed for the condition check. public boolean checkCondition(String record, String[] args) Returns the result from the condition check. public String[] setArgs(String[] args) Changes the arguments accordingly after the end of each iteration.
Table B.4: While API

B.5

API of If

The If skeleton is alike the skeleton of While in the sense that it also uses the notion of condition. As a result, the interface of condition must also be implemented here.

Appendix B. The API of the new package

78

What is more, if has two methods for providing the main functionality. One of them is executed when the condition is true and the other, when it is false. As before, a SkeletonJob must rst be congured and then an extended class of the class If with the user overriding the necessary methods for his purposes. Class If Method Summary void void void void void setup(Context context) Called once at the beginning of the task. IfTrueCase(String value, Context context) Called for each value in the input split if the condition is true. IfFalseCase(String value, Context context) Called for each value in the input split if the condition is false. WriteOut(String value) Writes a value at the output. run(Context context) Expert users can override this method for more complete control over the execution of the Skeleton. void public void cleanup() Called once at the end of the task. initializeArguments() Initializes the arguments that may be needed for the condition check. public boolean checkCondition(String record, String[] args) Returns the result from the condition check. public String[] setArgs(String[] args) Changes the arguments accordingly after the end of each iteration.
Table B.5: If API

B.6

API of Streaming

This is the nal API that is to be presented. It is not another skeleton but rather an effort to try and put them all together. More specically, this API aims in offering a way for a programmer to schedule a sequence of Skeleton jobs by dening them along

Appendix B. The API of the new package

79

with the input and output folders of the whole collection. In a sense, what is offered is a way to composite an arbitrary number of Skeleton jobs in an easy and simple way. A nal note is that the apart from the four skeletons we implemented in this project (For, Sort, While and If), MapReduce is also supported for streaming. The key idea is to be able and combine any kind of existing Job and even skeletons that will be implemented under the same train of though as the ones presented in this thesis. Class Streaming Constructor Summary Streaming(int num) Streaming() Method Summary void void void void setNumOfJobs(int num) Set the number of jobs before submitting the stream of jobs. setInputPath(String path) Set the input folder for the sequence of jobs before submitting them. setOutputPath(String path) Set the output folder for the sequence of jobs before submitting them. Stream(int priority, Job job) Pass a Job along with its number in the sequence. If the job is of type SkeletonJob, it must be cast to Job. void cleanup() Called once at the end of the task. boolean executeStreamingJobs() Executes the stream of the jobs submitted by the user.
Table B.6: Streaming API

Bibliography
[1] Murray Cole. Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge, MA, USA, 1991. [2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplied data processing on large clusters. Commun. ACM, 51:107113, January 2008. [3] Introduction to parallel programming and mapreduce. http://code.google. com/edu/parallel/mapreduce-tutorial.html. [4] Pragmatic programming techniques. http://horicky.blogspot.com/2008/ 11/hadoop-mapreduce-implementation.htm%l. [5] Horacio Gonz lez-V lez and Mario Leyton. A survey of algorithmic skeleton a e frameworks: high-level structured parallel programming enablers. Softw. Pract. Exper., 40:11351160, November 2010. [6] Marco Vanneschi. The programming model of assist, an environment for parallel and distributed portable applications. Parallel Comput., 28:17091732, December 2002. [7] Mario Leyton and Jos M. Piquer. Skandium: Multi-core programming with e algorithmic skeletons. In Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, PDP 10, pages 289296, Washington, DC, USA, 2010. IEEE Computer Society. [8] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive- a warehousing solution over a map-reduce framework. In IN VLDB 09: PROCEEDINGS OF THE VLDB ENDOWMENT, pages 16261629, 2009. [9] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. Hive - a petabyte
80

Bibliography

81

scale data warehouse using hadoop. In International Conference on Data Engineering, pages 9961005, 2010. [10] Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataow system on top of map-reduce: the pig experience. Proc. VLDB Endow., 2:14141425, August 2009. [11] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD 08, pages 10991110, New York, NY, USA, 2008. ACM. [12] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 10, pages 810818, New York, NY, USA, 2010. ACM. [13] Homepage of hadoop. http://hadoop.apache.org/. [14] Api of hadoop. http://hadoop.apache.org/common/docs/current/api/. [15] Haroon Rashid and Kalim Qureshi. A practical performance comparison of parallel sorting algorithms on homogeneous network of workstations. In Proceedings of the 5th WSEAS international conference on Telecommunications and informatics, TELE-INFO06, pages 276280, Stevens Point, Wisconsin, USA, 2006. World Scientic and Engineering Academy and Society (WSEAS).

You might also like