Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 47

GLOBAL INSTITUTE OF TECHNOLOGY

Subject Name with code: Big Data Analytics (BDA) (7AID4-01)

Presented by: Mr. Hemant Mittal

Assistant Professor, CSE

Branch and SEM - IT/ VII SEM

Department of

Computer Science & Engineering

1
UNIT-3

Hadoop Input/output

2
Outline

3.1 The Writable Interface

3.2 Comparable and Comparators

3.3 Writable Classes

3.4 Writable Wrappers for Java primitives

3.5 Bytes Writable

3.6 Null Writable

3.7 Implementing a Custom Writable

3
UNIT 3

INTRODUCTION TO Java
HADOOP Java
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides distributed storage
and computation across clusters of computers.

Requirements
Hadoop Java Versions. Version 2.7 and later of Apache Hadoop requires Java 7. It is built and
tested on both Open JDK and Oracle (Hot Spot)'s JDK/JRE. Earlier versions (2.6 and earlier)
support Java 6.

Hadoop Java Versions


 Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only) Please compile
Hadoop with Java 8. Compiling Hadoop with Java 11 is not supported: HADOOP-16795 - Java
11 compile support Open.
 Apache Hadoop from 3.0 to 3.2 now supports only Java 8.
 Apache Hadoop from 2.7.x to 2.x support Java 7 and 8.
Hence, to learn Java for Hadoop, you need to focus on below-mentioned basic concepts
of Java:
1. Object-Oriented Programming concepts like Objects and Classes.
2. Error/Exception Handling.
3. Reading and Writing files – this is the most important for Hadoop.
4. Arrays.
5. Collections.
6. Control Flow Statements.
7. Serialization

Reducer class

Reducer in Hadoop MapReduce reduces a set of intermediate values which share a key to a
smaller set of values. In MapReduce job execution flow, Reducer takes a set of an intermediate
key-value pair produced by the mapper as the input.

4
Writable and its Importance in Hadoop
Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all the
primitive data type of Java. That is how int of java has become Int Writable in Hadoop
and String of Java has become Text in Hadoop.
Writable are used for creating serialized data types in Hadoop. So, let us start by understanding
what are data type, interface and serializations.

Data Type
A data type is a set of data with values having predefined characteristics. There are several kinds
of data types in Java. For example- int, short, byte, long, chars etc. These are called as
primitive data types. All these primitive data types are bound to classes called as wrapper class.
For example int, short, byte, long is grouped under INTEGER which is a wrapper class. These
wrapper classes are predefined in the Java.

Interface in Java
An interface in Java is a complete abstract class. The methods within an interface are abstract
methods which do not accept body and the fields within the interface are public, static and final,
which means that the fields cannot be modified. The structure of an interface is most likely to be
a class. We cannot create an object for an interface and the only way to use the interface is to
implement it in other class by using ‘implements’ keyword.

Serialization
Serialization is nothing but converting the raw data into a stream of bytes which can travel along
different networks and can reside in different systems. Serialization is not the only concern of
Writable interface; it also has to perform compare and sorting operation in Hadoop.

Why are Writable Introduced in Hadoop?


Now the question is whether Writable are necessary for Hadoop. Hadoop framework definitely
needs Writable type of interface in order to perform the following tasks:

 Implement serialization
 Transfer data between clusters and networks
 Store the desterilized data in the local disk of the system

5
Implementation of writable is similar to implementation of interface in Java. It can be done by
simply writing the keyword ‘implements’ and overriding the default writable method.Writable is
a strong interface in Hadoop which while serializing the data, reduces the data size enormously,
so that data can be exchanged easily within the networks. It has separate read and write fields to
read data from network and write data into local disk respectively. Every data inside Hadoop
should accept writable and comparable interface properties.

How can Writable be implemented in Hadoop?


When we write a key as Int Writable in the Mapper class and send it to the reducer class, there is
an intermediate phase between the Mapper and Reducer class i.e., shuffle and sort, where each
key has to be compared with many other keys. If the keys are not comparable, then shuffle and
sort phase won’t be executed or may be executed with high amount of overhead.

The steps to make a custom type in Java are as follows:

Public class adds {

Int a;

Int b;

Public add () {

This. a = a;

This’d = b; }}

Public interface Writable {

Void read Fields (Data Input in);

Void write (Data Output out);

6
Here, read Fields, reads the data from network and write will write the data into local disk. Both
are necessary for transferring data through clusters. Data Input and Data Output classes (part of
java.io) contain methods to serialize the most basic types of data. Suppose we want to make a
composite key in Hadoop by combining two Writable then follow the steps below:

Public class add implements Writable{

Public int a;

Public int b;

Public add () {

This. a=a;

this.b=b;

Public void write (Data Output out) throws IO Exception {

Out. Write Int (a);

out.writeInt (b) ;}

Public void read Fields (Data Input in) throws IO Exception {

a = inbreeding ();

b = inbreeding () ;}

Public String to String () {

Return Integer.toString (a) + ", " + Integer.toString (b)}

7
Thus we can create our custom Writable in a way similar to custom types in Java but with two
additional methods, write and read Fields. The custom writable can travel through networks and
can reside in other systems. This custom type cannot be compared with each other by default, so
again we need to make them comparable with each other. As explained above, if a key is taken
as Int Writable, by default it has comparable feature because of Raw Comparator acting on that
variable and it will compare the key taken with the other keys in network and If Writable is not
there it won’t be executed. By default, Int Writable, Long Writable and Text have a Raw
Comparator which can execute this comparable phase for them. Then, will Raw Comparator help
the custom Writable? The answer is no. So, we need to have Writable Comparable. Writable
Comparable can be defined as a sub interface of Writable, which has the feature of Comparable
too.

We need to make our custom type, comparable if we want to compare this type with the other.

We want to make our custom type as a key, and then we should definitely make our key type as
Writable Comparable rather than simply Writable. This enables the custom type to be compared
with other types and it is also sorted accordingly. Otherwise, the keys won’t be compared with
each other and they are just passed through the network.

What happens if Writable Comparable is not present?


If we have made our custom type Writable rather than Writable Comparable our data won’t be
compared with other data types. There is no compulsion that our custom types need to be
Writable Comparable until unless if it is a key. Because values don’t need to be compared with
each other as keys. If our custom type is a key then we should have Writable Comparable or else
the data won’t be sorted.

How can Writable Comparable be implemented in Hadoop?


The implementation of Writable Comparable is similar to Writable but with an additional
‘Compare To’ method inside it.

Public interface Writable Comparable extends Writable, Comparable

8
{

Void read Fields (Data Input in);

Void write (Data Output out);

Int compare To (Writable Comparable o)

How to make our custom type, Writable Comparable?


We can make custom type a Writable Comparable by following the method below:

Public class add implements Writable Comparable{

Public int a;

Public int b;

Public add () {

This. a=a;

this.b=b;

Public void write (Data Output out) throws IO Exception {

Out. Write int (a);

out.writeint (b);

Public void read Fields (Data Input in) throws IO Exception {

a = in. read int ();

9
b = inbreeding ();

Public int Compare To (add c){

Int present Value=this. Value;

Int Compare Value=cavalier;

Return (present Value < Compare Value? -1: (present Value==Compare Value? 0: 1));

Public int hash Code () {

Return Integer.IntToIntBits (a) ^ Integer.IntToIntBits (b);

These read fields and write make the comparison of data faster in the network. With the use of
these Writable and Writable Comparables in Hadoop; we can make our serialized custom type
with less difficulty. This gives the ease for developers to make their custom types based on their
requirement. Writable is interface mechanism to serialize and de-serialize your data. It is an
interface which has write method and read fields method.

* Write field is to writing your data into the output stream or network.

* Read field is to read data from the input stream.

Introduction: - The hadoop is used for Map Reduce computations; it uses the Writable interface
based classes as the data types. These data types from writable are used throughout the Map
Reduce Data Flow structure, it starts fro reading input data, transferring intermediate data
between Map & Reduce and then writing output data. Writable interface has multiple data types
so we have to choose appropriate data types for input, intermediate and output. Choosing the
right data type will enhance the performance and programmability your Map Reduce programs.

10
Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all the
primitive data type of Java. That is how int of java has become Int Writable in Hadoop and
String of Java has become Text in Hadoop. Writable are used for creating serialized data types
in Hadoop.

Writable Interface Functions

* A data type must implement the org.apache.hadoop.io.Writable interface in order to be


used as a Value data type of a Map Reduce computation.

* It is only one interface will define how a value should be serialized and de-serialized in
Hadoop for transmitting and storing the data.

Coding

Package org.apache.hadoop.io;

Import java.io.DataInput;

Import java.io.Data Output;

Public interface Writable

Void write (Data Output out) throws IO Exception;

Void read Fields (Data Input in) throws IO Exception;

Writable Comparable Interfaces

* A data type must implement the org.apache.hadoop.io.Writable Comparable<T> interface


in order to be used as a Key data type of a MapReduce computation.

11
* It is the additional functionality for Writable interface and for sorting purpose.

Example

Public interface Writable Comparable extends Writable, Comparable

Here comparing the operators are passed to it as shown below.

Public interface Comparable

Public int compare To (Object obj);

The compare To() method compares the comparable object with the current object and if the
compared object is less then it returns -1 , greater it returns 0 or equal then it returns 1 .

The above two interfaces are provided in org.apache.hadoop.io package.

Data Type constraints for Key-Value pair in Map Reduce

There are two basic constraints which should be satisfied by the data types used for the Key-
Value fields in Hadoop Map Reduce. The writable interface is must to be implemented by any
data type for a Value filed in Mapper or Reducer. The writable Comparable interface along with
Writable interface is must to be implemented by any data type for a Key filed in Mapper or
Reducer in order to compare the keys of this type with each other for sorting purposes.

Writable Classes – Hadoop Data Types

1. Primitive Writable Classes

12
Hadoop provides the classes which can implement the Writable and Writable Comparable
interfaces by wrapping the Java primitive types. These classes are provided in
org.apache.hadoop.io package, so these Hadoop wrapper classes will have a get () and set ()
method in order to fetch and store the wrapped value.

Hadoop provides the below list of primitive writable data types like,

Int Writable

Vint Writable

Float Writable

Long Writable

Vlong Writable

Double Writable

Boolean Writable

Byte Writable

Note :- After serialization both Java data types and Hadoop Primitive data types will have
same size, Int Writable will have 4 bytes and Long Writable will have 8 bytes.

2. Array Writable Classes

There are two types of array writable classes available in Hadoop, one for single dimensional
and another for two dimensional arrays

Array Writable Two D Array Writable

3. Map Writable Classes

13
The three data types listed below are Map Writable class data types which implement
java .util. Map interface.

Abstract Map Writable – this act as the base or abstract for other Map Writable classes.

Map Writable – This class used for the general purpose of mapping the Writable Keys to
Writable values.

Sorted Map Writable – This is the significance of Map Writable class than can implement
the Sorted Map interface.

4. Other Writable Classes

4.1 Null Writable

The Null Writable represents a null value in a Map Reduce, when we do not want to read or
write a Key or a Value then that field can be declared as Null Writable. If the data type is
declared as Null Writable, then no byte is read or written.

4.2 Text

Text Writable class is equivalent to java.lang.string; unlike java’s string data type text in hadoop
can be muted. Its max size is 2GB.

4.3 Bytes Writable

It is a wrapper for an array of binary data.

4.4 Object Writable

This is a generic object wrapper.

It can store any objects like Java primitives, Writable, String, Null or arrays.

4.5 Generic Writable

It is similar to that of Object Writable class but supports only few data types.

“That’s all about the Writable and Writable Comparable Interfaces”

Comparable and Comparator

14
All provides two interfaces to sort objects using data members of the class:
1. Comparable
2. Comparator
Using Comparable Interface
A comparable object is capable of comparing itself with another object. The class itself must
implements the java.lang.Comparable interface to compare its instances.
Consider a Movie class that has members like, rating, name, year. Suppose we wish to sort a list
of Movies based on year of release. We can implement the Comparable interface with the Movie
class, and we override the method compare to () of Comparable interface.
// A Java program to demonstrate use of Comparable
Import java.io.*;
Import java.util.*;

// A class 'Movie' that implements Comparable


class Movie implements Comparable<Movie>
{
private double rating;
private String name;
private int year;

// Used to sort movies by year


public int compare To (Movie m)
{
Return this. year - Myer;
}

// Constructor
public Movie(String nm, double rt, int yr)
{
this.name = nm;
This. rating = rt;
This. year = yr;
}

// Getter methods for accessing private data


public double get Rating() { return rating; }
public String get Name() { return name; }
public int get Year() { return year; }
}

// Driver class
class Main

15
{
public static void main(String[] args)
{
Array List<Movie> list = new Array List<Movie>();
List. add(new Movie("Force Awakens", 8.3, 2015));
List. add(new Movie("Star Wars", 8.7, 1977));
List. add(new Movie("Empire Strikes Back", 8.8, 1980));
List. add(new Movie("Return of the Jedi", 8.4, 1983));

Collections. Sort (list);

System.out.println("Movies after sorting : ");


for (Movie movie: list)
{
System.out.println(movie.getName() + " " +
movie.getRating() + " " +
movie.getYear());
}
}
}
Movies after sorting:
Star Wars 8.7 1977
Empire Strikes Back 8.8 1980
Return of the Jedi 8.4 1983
Force Awakens 8.3 2015
Now, suppose we want sort movies by their rating and names also. When we make a collection
element comparable (by having it implement Comparable), we get only one chance to implement
the comparator () method.
Using Comparator
Unlike Comparable, Comparator is external to the element type we are comparing. It’s a separate
class. We create multiple separate classes (that implement Comparator) to compare by different
members.
Collections class has a second sort () method and it takes Comparator. The sort () method
invokes the compare () to sort objects.
To compare movies by Rating, we need to do 3 things:
1. Create a class that implements Comparator (and thus the compare () method that does the
work previously done by compare to ()).
2. Make an instance of the Comparator class.
3. Call the overloaded sort () method, giving it both the list and the instance of the class that
implements Comparator.
//A Java program to demonstrate Comparator interface
import java.io.*;
16
Import java.util.*;

// A class 'Movie' that implements Comparable


class Movie implements Comparable<Movie>
{
private double rating;
private String name;
private int year;

// Used to sort movies by year


public int compareTo(Movie m)
{
return this.year - m.year;
}

// Constructor
public Movie(String nm, double rt, int yr)
{
this.name = nm;
this.rating = rt;
this.year = yr;
}

// Getter methods for accessing private data


public double getRating() { return rating; }
public String getName() { return name; }
public int getYear() { return year; }
}

// Class to compare Movies by ratings


class RatingCompare implements Comparator<Movie>
{
public int compare(Movie m1, Movie m2)
{
if (m1.getRating() < m2.getRating()) return -1;
if (m1.getRating() > m2.getRating()) return 1;
else return 0;
}
}

// Class to compare Movies by name


class NameCompare implements Comparator<Movie>
{
public int compare(Movie m1, Movie m2)
{
return m1.getName().compareTo(m2.getName());

17
}
}

// Driver class
class Main
{
public static void main(String[] args)
{
ArrayList<Movie> list = new ArrayList<Movie>();
list.add(new Movie("Force Awakens", 8.3, 2015));
list.add(new Movie("Star Wars", 8.7, 1977));
list.add(new Movie("Empire Strikes Back", 8.8, 1980));
list.add(new Movie("Return of the Jedi", 8.4, 1983));

// Sort by rating : (1) Create an object of ratingCompare


// (2) Call Collections.sort
// (3) Print Sorted list
System.out.println("Sorted by rating");
RatingCompare ratingCompare = new RatingCompare();
Collections.sort(list, ratingCompare);
for (Movie movie: list)
System.out.println(movie.getRating() + " " +
movie.getName() + " " +
movie.getYear());

// Call overloaded sort method with RatingCompare


// (Same three steps as above)
System.out.println("\nSorted by name");
NameCompare nameCompare = new NameCompare();
Collections.sort(list, nameCompare);
for (Movie movie: list)
System.out.println(movie.getName() + " " +
movie.getRating() + " " +
movie.getYear());

// Uses Comparable to sort by year


System.out.println("\nSorted by year");
Collections.sort(list);
for (Movie movie: list)
System.out.println(movie.getYear() + " " +
movie.getRating() + " " +
movie.getName()+" ");
}
}
Output :

18
Sorted by rating
8.3 Force Awakens 2015
8.4 Return of the Jedi 1983
8.7 Star Wars 1977
8.8 Empire Strikes Back 1980

Sorted by name
Empire Strikes Back 8.8 1980
Force Awakens 8.3 2015
Return of the Jedi 8.4 1983
Star Wars 8.7 1977

Sorted by year
1977 8.7 Star Wars
1980 8.8 Empire Strikes Back
1983 8.4 Return of the Jedi
2015 8.3 Force Awakens
 Comparable is meant for objects with natural ordering which means the object itself must
know how it is to be ordered. For example Roll Numbers of students. Whereas, Comparator
interface sorting is done through a separate class.
 Logically, Comparable interface compares “this” reference with the object specified and
Comparator in Java compares two different class objects provided.
 If any class implements Comparable interface in Java then collection of that object either
List or Array can be sorted automatically by using Collections. Sort () or Arrays. Sort ()
method and objects will be sorted based on their natural order defined by CompareTo
method.

Difference between comparable and comparator

19
Java Wrapper Classes

Java Wrapper Classes Wrapper classes provide a way to use primitive data types (int, Boolean,
etc...) as objects.

Sometimes you must use wrapper classes, for example when working with Collection objects,
such as Array List, where primitive types cannot be used (the list can only store objects):

Example

Array List<int> my Numbers = new Array List<int> (); // Invalid

Array List<Integer> my Numbers = new Array List<Integer> (); // Valid

Creating Wrapper Objects

20
To create a wrapper object, use the wrapper class instead of the primitive type. To get the value,
you can just print the object:

Example

Public class My Class {

Public static void main (String [] args) {

Integer my Int = 5;

Double my Double = 5.99;

Character my Char = 'A';

System.out.println (my Int);

System.out.println (my Double);

System.out.println (my Char);

Wrapper Classes in Java


A Wrapper class is a class whose object wraps or contains primitive data types. When we create
an object to a wrapper class, it contains a field and in this field, we can store primitive data types.
In other words, we can wrap a primitive value into a wrapper class object.
Need of Wrapper Classes
1. They convert primitive data types into objects. Objects are needed if we wish to modify the
arguments passed into a method (because primitive types are passed by value).
2. The classes in java.util package handle only objects and hence wrapper classes help in this
case also.
3. Data structures in the Collection framework, such as Array List and Vector, store only
objects (reference types) and not primitive types.
4. An object is needed to support synchronization in multithreading.

Primitive Data types and their Corresponding Wrapper class

21
Auto boxing: Automatic conversion of primitive types to the object of their corresponding
wrapper classes is known as auto boxing. For example – conversion of int to Integer, long to
Long, double to Double etc.
Example:
// Java program to demonstrate Auto boxing

import java.util.ArrayList;

class Auto boxing {

public static void main(String[] args) {

char ch = 'a';

// Autoboxing- primitive to Character object conversion

Character a = ch;

ArrayList<Integer> arrayList = new ArrayList<Integer>();

// Autoboxing because ArrayList stores only objects

arrayList.add(25);

// printing the values from object

System.out.println(arrayList.get(0)); } }

22
Unboxing: It is just the reverse process of autoboxing. Automatically converting an object of a
wrapper class to its corresponding primitive type is known as unboxing. For example –
conversion of Integer to int, Long to long, double to double, etc.
// Java program to demonstrate Unboxing

import java.util.ArrayList;

class Undoing {

public static void main(String[] args) {

Character ch = 'a';

// unboxing - Character object to primitive conversion

char a = ch;

ArrayList<Integer> arrayList = new ArrayList<Integer>();

arrayList.add(24);

// unboxing because get method returns an Integer object

int num = arrayList.get(0);

// printing the values from primitive data types

System.out.println(num);

Output:
24

23
// Java program to demonstrate Wrapping and Un Wrapping

// in Java Classes

class Wrapping Unwrapping

public static void main(String args[])

// byte data type

byte a = 1;

// wrapping around Byte object

Byte byte obi = new Byte(a);

// int data type

int b = 10;

//wrapping around Integer object

Integer int obi = new Integer(b);

// float data type

float c = 18.6f;

// wrapping around Float object

Float obj = new Float(c);

// double data type

double d = 250.5;

// Wrapping around Double object

Double obj = new Double(d);

// char data type

char e='a';

24
// wrapping around Character object

Character obj=e;

// printing the values from objects

System.out.println("Values of Wrapper objects (printing as objects)");

System.out.println("Byte object byte obj: " + byte obj);

System.out.println("Integer object int obj: " + int obj);

System.out.println("Float object float obj: " + float obj);

System.out.println("Double object double obj: " + double obj);

System.out.println("Character object char obj: " + char obj);

// objects to data types (retrieving data types from objects)

// unwrapping objects to primitive data types

byte bv = byte obj;

int iv = int obj;

float fv = float obj;

double dv = double obj;

char cv = char obj;

// printing the values from data types

System.out.println("Unwrapped values (printing as data types)");

System.out.println("byte value, bv: " + bv);

System.out.println("int value, iv: " + iv);

System.out.println("float value, fv: " + fv);

System.out.println("double value, dv: " + dv);

System.out.println("char value, cv: " + cv);

25
Output:
Values of Wrapper objects (printing as objects)
Byte object byte obj: 1
Integer object int obj: 10
Float object float obj: 18.6
Double object double obj: 250.5
Character object char obj: a
Unwrapped values (printing as data types)
Byte value, Bv: 1
Int value, IV: 10
Float value, fv: 18.6
Double value, dv: 250.5
Char value, CV: a

Wrapper CLASS

A Wrapper class is a class whose object wraps or contains primitive data types. When we create
an object to a wrapper class, it contains a field and in this field, we can store primitive data
types. In other words, we can wrap a primitive value into a wrapper class object.

26
Writable data types are meant for writing the data to the local disk and it is a serialization format.
Just like in Java there are data types to store variables (int, float, long, double, etc.), Hadoop has
its own equivalent data types called Writable data types. These Writable data types are passed as
parameters (input and output key-value pairs) for the mapper and reducer. The Writable data
types discussed below implements Writable Comparable interface. Comparable interface is used
for comparing when the reducer sorts the keys, and Writable can write the result to the local disk.
It does not use the java Serializable because java Serializable is too big or too heavy for hadoop,
Writable can Serializable the hadoop Object in a very light way. Writable Comparable is a
combination of Writable and Comparable interfaces.

Below is the list of few data types in Java along with the equivalent Hadoop variant:

Integer –> Int Writable: It is the Hadoop variant of Integer. It is used to pass integer numbers as
key or value.

Float –> Float Writable: Hadoop variant of Float used to pass floating point numbers as key or
value.

Long –> Long Writable: Hadoop variant of Long data type to store long values.

Short –> Short Writable: Hadoop variant of Short data type to store short values.

Double –> Double Writable: Hadoop variant of Double to store double values.

String –> Text: Hadoop variant of String to pass string characters as key or value.

Byte –> Byte Writable: Hadoop variant of byte to store sequence of bytes.

Null –> Null Writable: Hadoop variant of null to pass null as a key or value. Usually Null
Writable is used as data type for output key of the reducer, when the output key is not important
in the final result.

Data types

Some are used to store numbers, some are used to store text and some are used for much more
complicated types of data.
...
The data types to know are:
 String (or str or text). ...
 Character (or char). ...
 Integer (or int). ...
 Float (or Real). ...
 Boolean (or bool).

27
Data Types

Data types control the type of information a column can contain and determine how the Data
Manager carries out particular operations. By assigning the correct data type to each column in a
table, you can implement and execute many operations more easily. Data types are significant in
working with a relational database because when you add or change data in a column, the new
data must be of the type specified for that column when the table was created. For example, you
cannot accidentally store a date in a salary field designated for the money data type.

There are four fundamental data types:

 Character (or text)


 Numeric
 Date
 Money

These data types may not all be compatible with Enterprise Access products. The character and
numeric data types are broken into subsets. For example, integers can be stored as 1-byte, 2-byte,
and 4-byte integers. Data types are named differently in SQL and QUEL. For more information,
see the appendix “Data Types” for Open SQL and QUEL data types.

C (n)

Fixed-length string of up to n printable ASCII characters, with non-printable characters


converted to blank; n represents the lesser of the maximum configured row size and 32,000.

Char (n)

Fixed-length string of up to n ASCII characters, including any non-printable


characters; n represents the lesser of the maximum configured row size and 32,000.

Var char (n)

Variable-length ASCII character string of up to n characters; n represents the lesser of the


maximum configured row size and 32,000.

Text (n)

Variable-length string of up to n ASCII characters; n represents the lesser of the maximum


configured row size and 32,000.

Float (n)

N-byte floating point; converted to a 4-byte or 8-byte floating point data type.

28
float4

4-byte floating point; for numbers including decimal fractions, from 0.29x10**-38 to
1.7x10**38 with 7 digit precision.

float8

8-byte floating point; for numbers including decimal fractions, from 0.29x10**-38 to
1.7x10**38 with 16 digit precision.

Decimal

Exact numeric data type defined by its precision (total number of digits) and scale (number of
digits to the right of the decimal point). Precision must be between 1 and 31. Scale can be zero
(0) up to the maximum scale.

integer1

1-byte integer; for whole numbers ranging from -128 to +127.

Integer2 or small int

2-byte integer; for whole numbers ranging from -32,768 to +32,767.

Integer4 or integer

4-byte integer; for whole numbers ranging from -2,147,483,648 to +2,147,483,647.

Money

8 byte monetary data; from -$999999999999.99 to +$999999999999.99.

Date

12 bytes; dates ranging from 1/1/1582 to 12/31/2382 for absolute dates and -800 years to +800
years for time intervals. A user defined data type (UDT) is perceived and treated as a character
string. Some forms utilities do not support the long var char, byte, byte varying, and long byte
data types. For more information, see each specific product for a discussion of how long var
char, byte, byte varying, and long byte is handled.

29
How Character Data Is Displayed

Character data includes any of the character data types (var char, char, c, text). You can display
character data types only in: Character (c) display format (in display only and data entry fields)

String template display format (in data entry fields only)

For the character (c) data display format, the default is left justification. To justify the contents of
a single line field to the right, left, or center, precede the data display symbol with a plus sign
(+), asterisk (*), or minus sign (-), respectively. For example, entering +c5 specifies a right
justified text field of five or fewer characters. You can use justification symbols with any data
type. If you specify a value for w as well as n, you can display text in column format. When
character fields contain more than one line, the first line of the field is filled with characters and
the second line is filled. This can produce line breaks in the middle of a word. For display only
fields, you can use the f or j option to specify that lines must break at the end of a word instead
of in the middle. For example, cf80.20 specifies a text field containing a total of 80 characters
with no line containing more than 20 characters and with breaks between words. To right justify
the contents of a display only multi-line character field, use the j flag. The j flag performs the
same function as the f flag, except that it pads the line with blanks between words to make the
right margin of each line come out even, like a column of text in a newspaper.

Wrapper class

What is the use of wrapper class?


A Wrapper class is a class which contains the primitive data types (int, char, short, byte, etc). In
other words, wrapper classes provide a way to use primitive data types (int, char, short, byte, etc)
as objects. These wrapper classes come under java. Until package.

What is a wrapper class give two examples of a wrapper class?


The eight primitive data types byte, short, int, long, float, double, char and Boolean are not
objects, Wrapper classes are used for converting primitive data types into objects, like int to
Integer etc.
Wrapper class in Java.

Primitive Wrapper class

short Short

int Integer

long Long

float Float

How do you create a wrapper class?


Wrapper class Example: Primitive to Wrapper
30
1. //Java program to convert primitive into objects.
2. //Auto boxing example of int to Integer.
3. public class WrapperExample1{
4. public static void main(String args[]){
5. //Converting int into Integer.
6. int a=20;
7. Integer i=Integer.valueOf (a);//converting int into Integer explicitly.
8. String is not a wrapper class, simply because there is no parallel primitive type that it wraps.
A string is a representation of a char sequence but not necessarily a 'wrapper'. Auto boxing and
UN boxing for example do not apply to String. But they do apply to primitives such as int long
etc
Why are wrapper classes immutable?
The wrapper classes are immutable because it just makes no sense to be mutable. Consider
following code: int n = 5; n = 6; Integer N = new Integer(n); At first, it looks straightforward if
you can change the value of N, just like you can change the value of n.

Java data types

Primitive Data Types

The Java programming language is statically-typed, which means that all variables must first be
declared before they can be used. This involves stating the variable's type and name, as you've
already seen:

Int gear = 1;

Doing so tells your program that a field named "gear" exists, holds numerical data, and has an
initial value of "1". A variable's data type determines the values it may contain, plus the
operations that may be performed on it. In addition to int, the Java programming language
supports seven other primitive data types. A primitive type is predefined by the language and is
named by a reserved keyword. Primitive values do not share state with other primitive values.
The eight primitive data types supported by the Java programming language are:

 Byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum
value of -128 and a maximum value of 127 (inclusive). The byte data type can be useful
for saving memory in large arrays, where the memory savings actually matters. They can
also be used in place of int where their limits help to clarify your code; the fact that a
variable's range is limited can serve as a form of documentation.

Short: The short data type is a 16-bit signed two's complement integer. It has a minimum value
of -32,768 and a maximum value of 32,767 (inclusive). As with byte, the same guidelines apply:

31
you can use a short to save memory in large arrays, in situations where the memory savings
actually matters.

Int: By default, the int data type is a 32-bit signed two's complement integer, which has a
minimum value of -231 and a maximum value of 231-1. In Java SE 8 and later, you can use
the int data type to represent an unsigned 32-bit integer, which has a minimum value of 0 and a
maximum value of 232-1. Use the Integer class to use int data type as an unsigned integer. See the
section The Number Classes for more information. Static methods like compare unsigned, divide
Unsigned etc have been added to the Integer class to support the arithmetic operations for
unsigned integers.

Long: The long data type is a 64-bit two's complement integer. The signed long has a minimum
value of -263 and a maximum value of 263-1. In Java SE 8 and later, you can use the long data
type to represent an unsigned 64-bit long, which has a minimum value of 0 and a maximum
value of 264-1. Use this data type when you need a range of values wider than those provided
by int. The Long class also contains methods like compare Unsigned, divide Unsigned etc to
support arithmetic operations for unsigned long.

Float: The float data type is a single-precision 32-bit IEEE 754 floating point. Its range of values
is beyond the scope of this discussion, but is specified in the Floating-Point Types, Formats, and
Values section of the Java Language Specification. As with the recommendations
for byte and short, use a float (instead of double) if you need to save memory in large arrays of
floating point numbers. This data type should never be used for precise values, such as currency.
For that, you will need to use the java.math.BigDecimal class instead. Numbers and
Strings cover Big Decimal and other useful classes provided by the Java platform.

Double: The double data type is a double-precision 64-bit IEEE 754 floating point. Its range of
values is beyond the scope of this discussion, but is specified in the Floating-Point Types,
Formats, and Values section of the Java Language Specification. For decimal values, this data
type is generally the default choice. As mentioned above, this data type should never be used for
precise values, such as currency.

Boolean: The Boolean data type has only two possible values: true and false. Use this data type
for simple flags that track true/false conditions. This data type represents one bit of information,
but its "size" isn't something that's precisely defined.

32
Chart: The char data type is a single 16-bit Unicode character. It has a minimum value of '\
u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

In addition to the eight primitive data types listed above, the Java programming language also
provides special support for character strings via the java.lang.String class. Enclosing your
character string within double quotes will automatically create a new String object; for
example, String s = "this is a string";. String objects are immutable, which means that once
created, their values cannot be changed. The String class is not technically a primitive data type,
but considering the special support given to it by the language; you'll probably tend to think of it
as such. You'll learn more about the String class in Simple Data Objects

Default Values

It's not always necessary to assign a value when a field is declared. Fields that are declared but
not initialized will be set to a reasonable default by the compiler. Generally speaking, this default
will be zero or null, depending on the data type. Relying on such default values, however, is
generally considered bad programming style.

The following chart summarizes the default values for the above data types.

Data Type Default Value (for fields)

byte 0

short 0

int 0

long 0L

float 0.0f

double 0.0d

char '\u0000'

String (or any object) null

Boolean false

33
Advantages of using Null Writable in Hadoop

Null Writable is a special type of Writable, as it has a zero-length serialization. No bytes are
written to, or read from, the stream. It is used as a placeholder; for example, in MapReduce, a
key or a value can be declared as a Null Writable when you don’t need to use that position—it
effectively stores a constant empty value. Null Writable can also be useful as a key in Sequence
File when you want to store a list of values, as opposed to key-value pairs. It is an immutable
singleton: the instance can be retrieved by calling Null Writable. Get ()

 It implements _write method so it is fast enough for buffered operations.


 Requirements
 This module requires Node >= 6.
 Installation
 nap install null-writable
 nap install -D @types/node
 Usage
 const {Null Writable} = require("null-writable")
 Typescript:
 import Null Writable from "null-writable" // or import {Null Writable} from "null-
writable"
 constructor
 const stream = new Null Writable(options)
 Create new Null Writable instance.
 Options are the same as for Writable constructor, like ie. High Water Mark

Null Writable

Null Writable is a special type of Writable, as it has a zero-length serialization. No bytes are
written to or read from the stream. It is used as a placeholder; for example, in Map Reduce, a key
or a value can be declared as a Null Writable when you don't need to use that position,
effectively storing a constant empty value. Null Writable can also be useful as a key in a
Sequence File when you want to store a list of values, as opposed to key-value pairs. It is an
immutable singleton, and the instance can be retrieved by calling Null Writable get ()

Object Writable and Generic Writable

Object Writable is a general-purpose wrapper for the following: Java primitives, String, enum,
Writable, null, or arrays of any of these types. It is used in Hadoop RPC to marshal and
unmarshal method arguments and return types. Object Writable is useful when a field can be of

34
more than one type. For example, if the values in a Sequence File have multiple types, you can
declare the value type as an Object Writable and wrap each type in an Object Writable. Being a
general purpose mechanism, it wastes a fair amount of space because it writes the class name of
the wrapped type every time it is serialized. In cases where the number of types is small and
known ahead of time, this can be improved by having a static array of types and using the index
into the array as the serialized reference to the type. This is the approach that Generic Writable
takes, and you have to subclass it to specify which types to support.

Writable collections

The org.apache.hadoop.io package includes six Writable collection types: Array Writable, Array
Primitive Writable, Two Darray Writable Map Writable, Sorted Map Writable, and Enum Set
Writable. Array Writable and Two Darray Writable are Writable implementations for arrays and
two-dimensional arrays (array of arrays) of Writable instances. All the elements of an Array
Writable or a Two Darray Writable must be instances of the same class, which is specified at
construction as follows: Array Writable = new Array Writable(Text class); In contexts where the
Writable is defined by type, such as in Sequence File keys or values or as input to Map Reduce
in general, you need to subclass Array Writable (or Two Darray Writable, as appropriate) to set
the type statically. For example

Writeable class for java

Writable in an interface in Hadoop and types in Hadoop must implement this interface. Hadoop
provides these writable wrappers for almost all Java primitive types and some other types, but
sometimes we need to pass custom objects and these custom objects should implement Hardtop's
Writable interface. Hadoop Map Reduce uses implementations of Writable for interacting with
user-provided Mappers and Reducers.

35
To implement the Writable interface we require two methods:

Public interface Writable {void read Fields (Data Input in); void write (Data Output out);

Why use Hadoop Writable(s)?

As we already know, data needs to be transmitted between different nodes in a distributed


computing environment. This requires serialization and deserialization of data to convert the data
that is in a structured format to a byte stream and vice-versa. Hadoop, therefore, uses simple and
efficient serialization protocol to serialize data between map and reduce phase and these are
called Writable(s). Some of the examples of writable as already mentioned before are Int
Writable, Long Writable, Boolean Writable, and Float Writable.

Writable Comparable interface is just a sub interface of the Writable and java.lang. Comparable
interfaces. For implementing a Writable Comparable we must have compare To method apart
from read Fields and write methods, as shown below:

public interface Writable Comparable extends Writable, Comparable { void read Fields(Data
Input in); void write(Data Output out); int compare To(Writable Comparable o) } Comparison of
types is crucial for Map Reduce, where there is a sorting phase during which keys are compared
with one another.

Implementing a comparator for Writable Comparables like the org.apache.hadoop.io.Raw


Comparator interface will definitely help speed up your Map/Reduce (MR) Jobs. As you may
recall, a MR Job is composed of receiving and sending key-value pairs. The process looks like
the following.

(K1, V1) –> Map –> (K2, V2) (K2, List [V2]) –> Reduce –> (K3,V3) The key-value pairs
(K2,V2) are called the intermediary key-value pairs. They are passed from the mapper to the
reducer. Before these intermediary key-value pairs reach the reducer, a shuffle and sort step is
performed.

Raw Comparator

If you still want to optimize time taken by Map Reduce Job, then you have to use Raw
Comparator. Intermediate key value pairs have been passed from Mapper to Reducer. before
these values reach Reducer from Mapper, shuffle and sorting steps will be performed.

36
Sorting is improved because the Raw Comparator will compare the keys by byte. If we did not
use Raw Comparator, the intermediary keys would have to be completely de-serialized to
perform a comparison.
Example:
Public class Index Pair Comparator extends Writable Comparator {
Protected Index Pair Comparator () {
Super (Index Pair. class);
}

@Override
Public int compare (byte [] b1, int s1, int l1, byte [] b2, int s2, int l2) {
Int i1 = reading (b1, s1);
Int i2 = reading (b2, s2);

Int comp = (i1 < i2)? -1: (i1 == i2)? 0: 1;


If (0! = comp)
Return comp;

Int j1 = reading (b1, s1+4);


Int j2 = read Int (b2, s2+4);
Comp = (j1 < j2)? -1: (j1 == j2)? 0 : 1;

Return comp;
}
}

In above example, we did not directly implement Raw Comparator. Instead we extended
Writable Comparator, which internally implements Raw Comparator.

Implementation of Raw Comparator () in Writable Comparator: Just compare the keys


Public int compare (byte [] b1, int s1, int l1, byte [] b2, int s2, int l2) {
Try {
Buffer. Reset (b1, s1, l1); // parse key1
key1.readFields (buffer);

Buffer. Reset (b2, s2, l2); // parse key2


key2.readFields (buffer);

} catch (I Exception e) {
Throw new Runtime Exception (e);
}

37
Return compare (key1, key2); // compare them
}

The shuffle is the assignment of the intermediary keys (K2) to reducers and the sort is the sorting
of these keys. In this blog, by implementing the Raw Comparator to compare the intermediary
keys, this extra effort will greatly improve sorting. Sorting is improved because the Raw
Comparator will compare the keys by byte. If we did not use Raw Comparator, the intermediary
keys would have to be completely desterilized to perform a comparison.

1) Writable Comparables can be compared to each other, typically via Comparators. Any type
which is to be used as a key in the Hadoop Map-Reduce framework should implement this
interface.

2) Any type which is to be used as a value in the Hadoop Map-Reduce framework should
implement the Writable interface.

Custom Writable in Hadoop

Custom Writable, Hadoop, Writable, Writable Coma parable November 21 20`3

Apologies for the delay in coming up with this post. I was caught up with my studies. Anyways,
today we are going to see how to implement a custom Writable in Hadoop. But before we get
into that, let us understand some basics and get the motivation behind implementing a custom
Writable.

What is a Writable in Hadoop?

If you have gone through the “Hello World” of Map Reduce post, or any other Hadoop program,
you must have seen data types different from regular Java defined data types. In word Count
Post, you must have seen Long Writable, Int Writable and Text. It is fairly easy to understand the
relation between them and Java’s primitive types. Long Writable is equivalent to long, Int
Writable to int and Text to String.

As we already know, data needs to be transmitted between different nodes in a distributed


computing environment. This requires serialization and deserialization of data to convert the data
that is in structured format to byte stream and vice-versa. Hadoop therefore uses simple and
efficient serialization protocol to serialize data between map and reduce phase and these are
called Writable(s). Some of the examples of writable as already mentioned before are Int
Writable, Long Writable, Boolean Writable and Float Writable

38
Limitation of primitive Hadoop Writable classes

In the word Count Example we emit Text as the key and Int Writable as the value from the
Mappers and Reducers. Although Hadoop provides many primitive Writable that can be used in
simple applications like word count, but clearly these cannot serve our purpose all the time.

Consider a scenario where we would like to transmit a 3-D point as a value from the
Mappers/Reducers. The structure of the 3D point would be like,

Class point3D

Public float x;

Public float y;

Public float z;

Now if you want to still use the primitive Hadoop Writable(s), you would have to convert the
value into a string and transmit it. However it gets very messy when you have to deal with
string manipulations.

Also, what if you want to transmit this as a key? As we already know Hadoop does the
sorting and shuffling automatically, then this point will get sorted based on string values,
which would not be correct. So clearly we need to write custom data types that can be used in
Hadoop.

Custom Writable

So any user defined class that implements the Writable interface is a custom writable. So let
us first look into the structure of writable interface.

Public interface Writable

Void read Fields (Data Input in);

Void write (Data Output out);

39
So the class implementing this interface must provide the implementation of these two methods
at the very least. So let us now look into these two methods in detail. Write (Data Output out) – It
is used to serialize the fields of the object to ‘out’.
Read Fields (Data Input in) – It is used to desterilize the fields of the object from ‘in’. However,
we need a custom Writable comparable if our custom data type is going to be used as key rather
that the value. We then need the class to implement Writable Comparable interface. The
Writable Comparable interface extends from the Writable interface and the Comparable interface
its structure is as given below:

Public interface Writable Comparable extends Writable, Comparable

Void read Fields (Data Input in);

Void write (Data Output out);

Int compare To(Writable Comparable o)

Compare To (Writable Comparable o) – It is inherited from Comparable interface and it allows


Hadoop to sort the keys in the sort and shuffle phase.

Bigram Count Example

Let us know look into the Bigram Count example which will solidify the concepts that we
have learnt till now in this post. This example is a good extension to the word Count
Example, and will also teach us how to write a custom Writable.

Common Rules for creating custom Hadoop Writable Data Type

A custom hadoop writable data type which needs to be used as value field in Map reduce
programs must implement Writable interface org.apache.hadoop.io.Writable. Map Reduce key
types should have the ability to compare against each other for sorting purposes. A custom
hadoop writable data type that can be used as key field in Map reduce programs must
implement Writable Comparable interface which intern
extends Writable (org.apache.hadoop.io.Writable) and Comparable (java.lang. Comparable)

40
interfaces. So, i.e. a data type created by implementing Writable Comparable Interface can be
used as either key or value field data type. Since a data type implementing Writable Comparable
can be used as data type for key or value fields in map reduce programs, Lets define a custom
data type which can used for both key and value fields. In this post, Lets create a custom data
type to process Web Logs from a server and count the occurrences of each IP address. In this
sample, lets consider a web log record with five fields – Request No, Site URL, Request Date,
Request Time and IP address. A sample record from web log file is as shown below.

1127248 /rr.html 2014-03-10 12:32:08 42.416.153.181

We can treat the entities of the above record as built-in Writable data types forming a new
custom data type. We can consider the Request No as Int Writable and other four fields as Text
data types. Complete input file Web_Log.txt used in this post is attached here

Map/Reduce (MR) Jobs

Implementing the org.apache.hadoop.io.Raw Comparator interface will definitely help speed


up your Map/Reduce (MR) Jobs. As you may recall, a MR Job is composed of receiving and
sending key-value pairs. The process looks like the following.

(K1, V1) –> Map –> (K2, V2)

(K2, List [V2]) –> Reduce –> (K3, V3)

The key-value pairs (K2, V2) are called the intermediary key-value pairs. They are passed
from the mapper to the reducer. Before these intermediary key-value pairs reach the reducer,
a shuffle and sort step is performed. The shuffle is the assignment of the intermediary keys
(K2) to reducers and the sort is the sorting of these keys. In this blog, by implementing the
Raw Comparator to compare the intermediary keys, this extra effort will greatly improve
sorting. Sorting is improved because the Raw Comparator will compare the keys by byte. If
we did not use Raw Comparator, the intermediary keys would have to be completely
desterilized to perform a comparison.

Two ways you may compare your keys is by implementing


the org.apache.hadoop.io.Writable Comparable interface or by implementing the Raw
Comparator interface. In the former approach, you will compare (desterilized) objects, but in
the latter approach, you will compare the keys using their corresponding raw bytes.

I conducted an empirical test to demonstrate the advantage of Raw Comparator over Writable
Comparable. Let’s say we are processing a file that has a list of pairs of indexes {i, j}. These
pairs of indexes could refer to the it and j-the matrix element. The input data (file) will look
something like the following.

1, 2 3, 4 5, 6 ... 0, 0

41
What we want to do is simply count the occurrences of the {imp} pair of indexes. Our MR Job
will look like the following.

(Long Writable, Text) –> Map –> ({imp}, Into Writable)

({I, j}, List [Into Writable]) –> Reduce –> ({imp}, Into Writable)

Method

The first thing we have to do is model our intermediary key K2= {i, j}. Below is a snippet of the
Index Pair. As you can see, it implements Writable Comparable. Also, we are sorting the keys
ascending by the it and then j-the indexes.

Public class Index Pair implements Writable Comparable<Index Pair> {

Private Into Writable I;

Private Into Writable j;

Below is a snippet of the Raw Comparator. As you notice, it does not directly implement Raw
Comparator. Rather, it extends Writable Comparator (which implements Raw Comparator). We
could have directly implemented Raw Comparator, but by extending Writable Comparator,
depending on the complexity of our intermediary key, we may use some of the utility methods of
Writable Comparator.

Public class Index Pair Comparator extends Writable Comparator {

Protected Index Pair Comparator () {

Super (Index Pair. class);

Public into compare (byte [] b1, int s1, int l1, byte [] b2, int s2, int l2) {

Int i1 = read Int (b1, s1);

Int i2 = reading (b2, s2);

Int comp = (i1 < i2) ? -1 : (i1 == i2) ? 0 : 1;

If (0! = comp)

Return comp;

Int j1 = reading (b1, s1+4);

Int j2 = reading (b2, s2+4);

42
Comp = (j1 < j2)? -1: (j1 == j2)? 0: 1;

Return comp;

As you can see the above code, for the two objects we are comparing, there are two
corresponding byte arrays (b1 and b2), the starting positions of the objects in the byte arrays, and
the length of the bytes they occupy. Please note that the byte arrays themselves represent other
things and not only are the objects we comparing. That is why the starting position and length are
also passed in as arguments. Since we want to sort ascending by i then j, we first compare the
bytes representing the I indexes and if they are equal, then we compare the j indexes. You can
also see that we use the until method, reading (byte [], start), inherited from Writable
Comparator. This method simply converts the 4 consecutive bytes beginning at start into a
primitive int (the primitive int in Java is 4 bytes). If the i-th indexes are equal, then we shift the
starting point by 4, read in the j-th indexes and then compare them.

Public void map (Long Writable key, Text value, Context) throws IO Exception, Interrupted
Exception {

String [] tokens = value. To String ().split(",");

Int i = Integer.parseInt(tokens[0].trim());

Int j = Integer. Parse Int (tokens [1].trim ());

Index Pair = new Index Pair (i, j);

Context. Write (index Pair, ONE);

A snippet of the reducer is shown below.

Public void reduce (Index Pair key, Inerrable <Int Writable> values, Context) throws IO
Exception, Interrupted Exception {

Int sum = 0;

For (Int Writable value: values) {

Sum += value. Get ();

43
Context. Write (key, new Int Writable (sum));

The snippet of code below shows how I wired up the MR Job that does NOT use raw byte
compare

Public int run (String [] args) throws Exception {

Configuration conf = get Conf ();

Job = new Job (conf, "raw comparator example");

Job. Set Jar by Class (RcJob1.class);

Job. Set Map Output Key Class (Index Pair.class);

job.setMapOutputValueClass (IntWritable.class);

job.setOutputKeyClass (IndexPair.class);

job.setOutputValueClass (IntWritable.class);

job.setMapperClass (RcMapper.class);

job.setReducerClass (RcReducer.class);

Job. Wait for Completion (true);

Return 0;

The snippet of code below shows how I wired up the MR Job using the raw byte comparator.

Public int run (String [] args) throws Exception {

Configuration conf = get Conf ();

Job job = new Job (conf, "raw comparator example");

job.setJarByClass (RcJob1.class);

job.setSortComparatorClass (IndexPairComparator.class);

job.setMapOutputKeyClass (IndexPair.class);

44
job.setMapOutputValueClass (IntWritable.class);

job.setOutputKeyClass (IndexPair.class);

job.setOutputValueClass (IntWritable.class);

job.setMapperClass (RcMapper.class);

job.setReducerClass (RcReducer.class);

job.waitForCompletion (true);

Return 0;

 }

As you can see, the only difference is that in the MR Job using the raw comparator, we explicitly
set its sort comparator class. I can the MR Jobs (without and with raw byte comparisons) 10
times on a dataset of 4 million rows of {i,j} pairs. The runs were against Hadoop v0.20 in
standalone mode on Cygwin. The average running time for the MR Job without raw byte
comparison is 60.6 seconds, and the average running time for the job with raw byte comparison
is 31.1 seconds. A two-tail paired t-test showed p < 0.001, meaning, there is a statistically
significant difference between the two implementations in terms of empirical running time. I
then ran each implementation on datasets of increasing record sizes from 1, 2, …, and 10 million
records. At 10 million records, without using raw byte comparison took 127 seconds (over 2
minutes) to complete, while using raw byte comparison took 75 seconds (1 minute and 15
seconds) to complete. Below is a line graph.

I talked about how these intermediary key-value pairs are a bottleneck in your MR Job if there
are many of them emitted from the mapper and how to mitigate this issue using certain design
patterns. In this blog, I talked about another form of optimization dealing with intermediary key-
value pairs, and in particular, with the intermediary keys using raw byte comparison so as to
improve sorting.

1.3 Custom Comparators and Hashing

Frequently, objects in one Tuples are compared to objects in a second Tuple. This is especially
true during the sort phase of GroupBy and CoGroup in Cascading Hadoop mode. By default,
Hadoop and Cascading use the native Object methods equals () and hashCode () to compare two
values and get a consistent hash code for a given value, respectively.

45
To override this default behavior, you can create a custom java.util.Comparator class to perform
comparisons on a given field in a Tuple. For instance, to secondary-sort a collection of
custom Person objects in a GroupBy, use the Fields.setComparator () method to designate the
custom Comparator to the Fields instance that specifies the sort fields.

Alternatively, you can set a default Comparator to be used by a Flow, or used locally on a
given Pipe instance. There are two ways to do this.
Call FlowProps.setDefaultTupleElementComparator () on a Properties instance, or use the
property key cascading.flow.tuple.element.comparator.

If the hash code must also be customized, the custom Comparator can implement the
interface cascading.tuple.Hasher. For more information, see the Javadoc.

Comparator

A comparison function, which imposes a total ordering on some collection of objects.


Comparators can be passed to a sort method to allow precise control over the sort order.
Comparators can also be used to control the order of certain data structures (such as Sorted
Set or Sorted Map), or to provide an ordering for collections of objects that don't have
a Comparable. The ordering imposed by a comparator c on a set of elements S is said to
be consistent with equals if and only if c.compare (e1, e2) ==0 has the same boolean value
as e1.equals (e2) for every e1 and e2 in S. Caution should be exercised when using a comparator
capable of imposing an ordering inconsistent with equals to order a sorted set (or sorted map).
Suppose a sorted set (or sorted map) with an explicit comparator c is used with elements (or
keys) drawn from a set S. If the ordering imposed by c on S is inconsistent with equals, the
sorted set (or sorted map) will behave "strangely." In particular the sorted set (or sorted map) will
violate the general contract for set (or map), which is defined in terms of equals.

For example, suppose one adds two elements a and b such that (a.equals (b) && c.compare (a,
b)! = 0) to an empty TreeSet with comparator c. The second add operation will return true (and
the size of the tree set will increase) because a and b are not equivalent from the tree set's
perspective, even though this is contrary to the specification of the Set method.

Note: It is generally a good idea for comparators to also implement java.io.Serializable, as they
may be used as ordering methods in serializable data structures (like Tree Set, Tree Map). In
order for the data structure to serialize successfully, the comparator (if provided) must
implement Serializable.

46
47

You might also like