Chapter 1 Final

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Chapter 1

TABLE OF CONTENTS
2

Introduction.....................................................................................................
2.1

Existing solutions......................................................................................

2.2

Motivation.................................................................................................

2.3

Problem statement....................................................................................

2.4

Contribution..............................................................................................

2.5

Dissertation Outline..................................................................................

1 INTRODUCTION
At a time where the requirement of storing world digital data exceeds the
limit of a Zettabyte (i.e. 1021 bytes), it is a challenge as well as necessity to
develop a powerful and efficient system that has larger storage to
accommodate that data so, if we talk about storage, a system using a very
dense storage medium like deoxyribonucleic acid (DNA) has so far been shown
to hold no more than 700 terabytes per gram [1].Considering that it requires
more than a ton of genetic material so that it can store a zettabyte of
information in DNA, the argument can simply be made that data storage
needs to be distributed for quantitative reasons alone. Of course, among the
qualitative arguments for distributed storage systems are reliability and
availability. This is to say that distribution, possibly at geographically remote
locations, and concurrent accessibility by anything from millions of users just
as important as raw storage capacity.
The web giant like Google, Amazon, Facebook and LinkedIn are the industries
that use the distributed storage systems. To fulfill all their requirements they
have deployed thousands of datacenters globally so that not only they can

make their data available all the time but also provide scalability anytime.
Furthermore, in the case of scales failure, either it is attributed to the failure
of the software or the hardware components operating permanently for a long
time. Moreover, they need to be considered during planning of solutions and
handled in implementations.
In addition to inheriting all the challenges that are in a distributed large scale
storage, there is also a concern about the kind of data being stored and the
way it is accessed and the data that is stored in distributed manner might be
of any size that poses many challenges towards conventional data storage
systems, these new horizons bring big data in the scene.
There is no single clear definition of big data, still widely accepted definition of
it given by Edd Dumbill, Big data is data that exceeds the processing capacity
of conventional database systems. The data is too big, moves too fast, or
doesnt fit the structure of your database architectures. To gain value from this
data, you must choose an alternative way to process it.
The challenge is to handle big data with available resources or nave solutions.
When it comes to access, workloads are typically being categorized as online
transaction processing1 (OLTP) and online analytical processing2 (OLAP).
These concepts are mostly used for structured database systems, although the
modern application requires more powerful and efficient solutions to process
petabytes of data, it is well exemplified by MapReduce [2], which is originally
implemented by Google but today it is the base for many open source
systems, like Apaches Hadoop framework.

1OLTP is used for transactional based data entry and retrieval.


2OLAP is used for analytical loads.

The important aspect of this framework is that it works parallel on large


datasets and long running tasks. These datasets are structured, unstructured
and semi-structured therefore it requires a new kind of database systems that
can deal with all kind of data. To handle heterogeneous datasets of abovementioned, NoSQL (Not Only SQL) was developed. NoSQL has the capability
to process the big data that has the potential to be mined for valued
information.
According to the requirements of individual industries they develop different
NoSQL for their use for example Facebooks Cassandra [3], Amazons Dynamo
[4], Yahoo! PNUTS [5], Googles BigTable [6] or Riak and MongoDB [7]. Each
one of them is using shards to store their data. Since, the main challenge is to
handle the big data among all globally deployed shards so NoSQL databases
have come with solutions.

1.1 EXISTING

SOLUTIONS

In following section, some solutions are given by the NoSQL databases.


Facebooks Cassandra [12] [3] is a distributed database designed to store
structured data in a key-value pair and indexed by a key. Cassandra is highly
scalable in both perspective, one storage and second request throughput while
preventing from single point failure. Additionally, Cassandras stores data in
the form of tables that is very similar to the distributed multi-dimensional
maps, which is also an index by a key. It belongs to the column family like
BigTable [6]. In a single row key, it provides atomic per-replica operation. In
Cassandra, consistent hashing [13] is used for the notion of data partitioning
to fulfill the purpose to map keys to nodes in a similar manner like Chord
distributed hash table (DHT). Partitioned data will be stored in a Cassandra

cluster that would contain the moving nodes on the ring. To facilitate the load
balancing it uses DHT on its keys.
Amazons Dynamo

[4] is distributed key-value-store that sacrifices

consistency, guarantees for scalability and availability. It uses a similar scheme


for data partitioning as Cassandra but does not use hashes to store chunks.
Furthermore, Dynamo addresses non-uniformity of node distribution in the
ring with virtual nodes (vnodes) and provides a slightly different partition-tovnode assignment strategy which achieves better distribution of load across
the vnodes and thus over the physical nodes.
Scatter

[14]

is a distributed consistent key-value store and highly

decentralized that provides linearisable consistent in the face of (some)


failures. For the storage perspective, it uses uniform key distribution through
consistent hashing typical for DHTs and Scatter uses two mechanisms for load
balancing. The first policy directs a node that newly joins the system to
randomly sample k groups and then join one handling the largest number of
operations. The second policy allows neighboring groups to trade off their
responsibility ranges based on load distribution.
MongoDB [15] [7] is schema-free document-oriented database written in C+
+. MongoDB uses replication to provide data available and sharding to
provide partition tolerance, and to manage data across the distributed
environment. In this system, the balancer is used so that the chunks are
evenly distributed across all servers in the cluster. MongoDB stores data in
the form of chunks. Balancer waits for the threshold of uneven chunks count
to occur. In the field, if chunks difference become 8 of the least to the most
loaded shards then balancer needs to redistribute chunks until the difference
in chunks between any two shards is down to 2 chunks [16].

Drawbacks:

1.2 MOTIVATION
It is clear from the elaboration above that the development of flexible, scalable
and reliable distributed storage systems is a necessity. Of course there are a
larger number of products already available today, some are RDBMSs and
some are NoSQL. In RDBMS like MySQL has the capability to deploy in a
distributed manner with replication and shards or as MySQL Cluster but it
has the limitation with scalability. Although some have to put their
considerable effort to make, their system work with large datasets, as
Facebook has done [8], for example.
In the NoSQL, there are many systems like Googles BigTable or Riak 3,
Amazons Dynamo, Facebooks Cassandra and MongoDB. These systems scale
very well over the trade-offs of consistency, availability and partition tolerance
according to CAP theorem [11]. However, transactional support is not
provided by the NoSQL databases like classical RDBMSs and another major
problem faced by these systems is low utilization of the storage (shards). The
load balancing techniques used in these systems do not consider chunk
migration as performance indicator rather they prefer high availability and
partition tolerance as their key indicator. In this work, we aim to bridge this
gap by proposing an improvement over existing load balancing techniques by
taking into account shard utilization and migration of data to increase the
efficiency of NoSQL databases.

3 It is a NoSQL that offers high availability, fault tolerance, operational


simplicity and scalability and implements the principles from Amazons
Dynamo and Googles BigTable.

1.3 PROBLEM

STATEMENT

All NoSQL databases have their implicit load balancing mechanism to make
sure that load is evenly distributed among all the shards in the cluster. If any
shard gets overloaded, then the balancer will redistribute chunks among all
the under loaded shards so that all the shards become evenly loaded in the
cluster. In this work, we have proposed an improvement over original
MongoDB load balancing [12] by minimizing the number of chunk migration
and better memory utilization of the shards.

1.4 CONTRIBUTION
In this dissertation, we design an algorithm for load balancing in NoSQL
databases. Although, load balancing algorithm exists for all NoSQL despite
that we propose a new technique for MongoDB [12], which is a type of NoSQL
database and a document-oriented database. The load that reaches to the
MongoDB system is stored in multiple commodity servers that store data in a
distributed manner. However, all hosts have the fixed amount of storage
capacity. Therefore, we need to store data on all hosts, so that we can avoid
the overloaded situation of any host.
So, in this prototype implementation, we simulate both algorithms, one
MongoDB automatic load balancing algorithm and second an improved
version of the load balancing algorithm, and then compare the results with the
help of charts and tables.

1.5 DISSERTATION OUTLINE


The layout of the dissertation is as follows:

Chapter 1, talks about the purpose of this study, problems with big data and
some existing solutions and our contribution.
Chapter 2, discusses NoSQL databases and classifications of NoSQL
databases. Furthermore, MongoDB a type of NoSQL database is elaborated.
Chapter 3, describes some basic concepts related to distributed systems and
modified load balancing algorithm for MongoDB in detail.
Chapter 4, describes about the implementation and the results.
Chapter 5, contains the concluding remarks as well as works to be done in the
future.

You might also like