Professional Documents
Culture Documents
Chapter 1 Final
Chapter 1 Final
Chapter 1 Final
TABLE OF CONTENTS
2
Introduction.....................................................................................................
2.1
Existing solutions......................................................................................
2.2
Motivation.................................................................................................
2.3
Problem statement....................................................................................
2.4
Contribution..............................................................................................
2.5
Dissertation Outline..................................................................................
1 INTRODUCTION
At a time where the requirement of storing world digital data exceeds the
limit of a Zettabyte (i.e. 1021 bytes), it is a challenge as well as necessity to
develop a powerful and efficient system that has larger storage to
accommodate that data so, if we talk about storage, a system using a very
dense storage medium like deoxyribonucleic acid (DNA) has so far been shown
to hold no more than 700 terabytes per gram [1].Considering that it requires
more than a ton of genetic material so that it can store a zettabyte of
information in DNA, the argument can simply be made that data storage
needs to be distributed for quantitative reasons alone. Of course, among the
qualitative arguments for distributed storage systems are reliability and
availability. This is to say that distribution, possibly at geographically remote
locations, and concurrent accessibility by anything from millions of users just
as important as raw storage capacity.
The web giant like Google, Amazon, Facebook and LinkedIn are the industries
that use the distributed storage systems. To fulfill all their requirements they
have deployed thousands of datacenters globally so that not only they can
make their data available all the time but also provide scalability anytime.
Furthermore, in the case of scales failure, either it is attributed to the failure
of the software or the hardware components operating permanently for a long
time. Moreover, they need to be considered during planning of solutions and
handled in implementations.
In addition to inheriting all the challenges that are in a distributed large scale
storage, there is also a concern about the kind of data being stored and the
way it is accessed and the data that is stored in distributed manner might be
of any size that poses many challenges towards conventional data storage
systems, these new horizons bring big data in the scene.
There is no single clear definition of big data, still widely accepted definition of
it given by Edd Dumbill, Big data is data that exceeds the processing capacity
of conventional database systems. The data is too big, moves too fast, or
doesnt fit the structure of your database architectures. To gain value from this
data, you must choose an alternative way to process it.
The challenge is to handle big data with available resources or nave solutions.
When it comes to access, workloads are typically being categorized as online
transaction processing1 (OLTP) and online analytical processing2 (OLAP).
These concepts are mostly used for structured database systems, although the
modern application requires more powerful and efficient solutions to process
petabytes of data, it is well exemplified by MapReduce [2], which is originally
implemented by Google but today it is the base for many open source
systems, like Apaches Hadoop framework.
1.1 EXISTING
SOLUTIONS
cluster that would contain the moving nodes on the ring. To facilitate the load
balancing it uses DHT on its keys.
Amazons Dynamo
[14]
Drawbacks:
1.2 MOTIVATION
It is clear from the elaboration above that the development of flexible, scalable
and reliable distributed storage systems is a necessity. Of course there are a
larger number of products already available today, some are RDBMSs and
some are NoSQL. In RDBMS like MySQL has the capability to deploy in a
distributed manner with replication and shards or as MySQL Cluster but it
has the limitation with scalability. Although some have to put their
considerable effort to make, their system work with large datasets, as
Facebook has done [8], for example.
In the NoSQL, there are many systems like Googles BigTable or Riak 3,
Amazons Dynamo, Facebooks Cassandra and MongoDB. These systems scale
very well over the trade-offs of consistency, availability and partition tolerance
according to CAP theorem [11]. However, transactional support is not
provided by the NoSQL databases like classical RDBMSs and another major
problem faced by these systems is low utilization of the storage (shards). The
load balancing techniques used in these systems do not consider chunk
migration as performance indicator rather they prefer high availability and
partition tolerance as their key indicator. In this work, we aim to bridge this
gap by proposing an improvement over existing load balancing techniques by
taking into account shard utilization and migration of data to increase the
efficiency of NoSQL databases.
1.3 PROBLEM
STATEMENT
All NoSQL databases have their implicit load balancing mechanism to make
sure that load is evenly distributed among all the shards in the cluster. If any
shard gets overloaded, then the balancer will redistribute chunks among all
the under loaded shards so that all the shards become evenly loaded in the
cluster. In this work, we have proposed an improvement over original
MongoDB load balancing [12] by minimizing the number of chunk migration
and better memory utilization of the shards.
1.4 CONTRIBUTION
In this dissertation, we design an algorithm for load balancing in NoSQL
databases. Although, load balancing algorithm exists for all NoSQL despite
that we propose a new technique for MongoDB [12], which is a type of NoSQL
database and a document-oriented database. The load that reaches to the
MongoDB system is stored in multiple commodity servers that store data in a
distributed manner. However, all hosts have the fixed amount of storage
capacity. Therefore, we need to store data on all hosts, so that we can avoid
the overloaded situation of any host.
So, in this prototype implementation, we simulate both algorithms, one
MongoDB automatic load balancing algorithm and second an improved
version of the load balancing algorithm, and then compare the results with the
help of charts and tables.
Chapter 1, talks about the purpose of this study, problems with big data and
some existing solutions and our contribution.
Chapter 2, discusses NoSQL databases and classifications of NoSQL
databases. Furthermore, MongoDB a type of NoSQL database is elaborated.
Chapter 3, describes some basic concepts related to distributed systems and
modified load balancing algorithm for MongoDB in detail.
Chapter 4, describes about the implementation and the results.
Chapter 5, contains the concluding remarks as well as works to be done in the
future.