Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

1. Explain the term ‘NoSQL’. Describe vertical and horizontal scaling.

NoSQL is an approach to database design that can accommodate a wide variety of data models, including
key-value, document, columnar and graph formats. NoSQL, which stand for "not only SQL," is an alternative
to traditional relational databases in which data is placed in tables and data schema is carefully designed
before the database is built. NoSQL databases are especially useful for working with large sets of distributed
data.
The NoSQL systems were not required to follow an established relational schema. Large-scale web
organizations such as Google and Amazon used NoSQL databases to focus on narrow operational goals and
employ relational databases as adjuncts where high-grade data consistency is necessary.

Horizontal scaling means that you scale by adding more machines into your pool of resources whereas
Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.

An easy way remember this is to think of a machine on a server rack, we add more machines across the
horizontal direction and add more resources to a machine in the vertical direction.

**Horizontal Scaling **- also referred to as "scale-out" is basically the addition of more machines or setting
up a cluster or a distributed environment for your software system. This usually requires a load-balancer
program which is a middle-ware component in the standard 3 tier client-server architectural model.

Vertical Scaling - also referred to as "scale-up" approach is an attempt to increase the capacity of a single
machine: By adding more processing power By adding more storage More memory etc.
2. Why does normalization fail in data analytics scenario?

Data analytics is always associated with big data, and when we say Big Data, we always have to remember
the “three V’s” of big data, i.e. volume, velocity and variety. NoSQL databases are designed keeping these
three V’s in mind. But RDBMS are strict, i.e. they have to follow some predefined schema. Schema are
designed by normalization of various attributes of data. The downside of many relational data warehousing
approaches is that they're rigid and hard to change. You start by modeling the data and creating a schema,
but this assumes you know all the questions you'll need to answer. When new data sources and new
questions arise, the schema and related ETL and BI applications have to be updated, which usually requires
an expensive, time-consuming effort. But, this is not problem for big data scenario. They are made to handle
the “variety” of data. There is no schema in NoSQL. Attributes can be dynamically added. Normalization is
done so that duplicates can be minimized as far as possible, but NoSQL and Big Data do not care about
duplicates and storage. This is because, unlike RDBMS, NoSQL database storages are distributed over
multiple clusters, and storage is never going to be obsolete. We can easily configure and add new cluster if
performance and storage demands. This facility provided by distributed system APIs such as Hadoop is
popularly known as horizontal scaling. But in RBDMS, most of them are single node storage, the multimode
parallel databases are also available, but they are limited too, to just few nodes and moreover costs much
high. Due to these reasons, normalization approach often fails for data analytics scenario.

3. Explain CAP theorem and its implications on Distributed Databases like NoSQL.

The CAP theorem was formally proved to be true by Seth Gilbert and Nancy Lynch of MIT in 2002. In
distributed databases like NoSQL, however, it is very likely that we will have network partitioning, and that at
some point, machines will fail and cause others to become unreachable. Packet loss, too, is nearly inevitable.
This leads us to the conclusion that a distributed database system must do its best to continue operating in
the face of network partitions (to be Partition Tolerant), leaving us with only two real options to choose
from: Availability and Consistency. Figure below illustrates visually that there is no overlapping segment
where all three are obtainable, thus explaining the concept of CAP theorem:

Consistency in CAP is not the same as consistency in ACID (that would be too easy). According to CAP,
consistency in a database means that whenever data is written, everyone who reads from the database will
always see the latest version of the data. A database without strong consistency means that when the data is
written, not everyone who reads from the database will see the new data right away; this is usually called
eventual-consistency or weak consistency.
Availability in a database according to CAP means you always can expect the database to be there and
respond whenever you query it for information. High availability usually is accomplished through large
numbers of physical servers acting as a single database through sharing (splitting the data between various
database nodes) and replication (storing multiple copies of each piece of data on different nodes).
Partition tolerance in a database means that the database still can be read from and written to when parts of
it are completely inaccessible. Situations that would cause this include things like when the network link
between a significant numbers of database nodes is interrupted. Partition tolerance can be achieved through
some sort of mechanism whereby writes destined for unreachable nodes are sent to nodes that are still
accessible. Then, when the failed nodes come back, they receive the writes they missed.

4. List and explain the types of NoSQL databases.

There are 4 basic types of NoSQL databases:

1. Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3 (Dynamo)}
2. Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB}
3. Column-based Store- Each storage block contains data from only one column, {Example- HBase,
Cassandra}
4. Graph-based-A network database that uses edges and nodes to represent and store data.
{Example- Neo4J}
1. Key Value Store NoSQL Database

The schema-less format of a key value database like Riak is just about what you need for your storage
needs. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic
large object) etc.

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a
particular item of data. A bucket is a logical group of keys – but they don’t physically group the data.
There can be identical keys in different buckets.

 Riak and Amazon’s Dynamo are the most popular key-value store NoSQL databases.
2. Document Store NoSQL Database

The data which is a collection of key value pairs is compressed as a document store quite similar to a
key-value store, but the only difference is that the values stored (referred to as “documents”) provide
some structure and encoding of the managed data. XML, JSON (Java Script Object Notation), BSON
(which is a binary encoding of JSON objects) are some common standard encodings.

The following example shows data values collected as a “document” representing the names of specific
retail stores. Note that while the three examples all represent locations, the representative models are
different.

{officeName:”3Pillar Noida”,
{Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”}
}
{officeName:”3Pillar Timisoara”,
{Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode: 300011”}
}
{officeName:”3Pillar Cluj”,
{Latitude:”40.748328”, Longitude:”-73.985560”}
}
One key difference between a key-value store and a document store is that the latter embeds attribute
metadata associated with stored content, which essentially provides a way to query the data based on
the contents. For example, in the above example, one could search for all documents in which “City” is
“Noida” that would deliver a result set containing all documents associated with any “3Pillar Offi ce”
that is in that particular city.

Apache CouchDB is an example of a document store. CouchDB uses JSON to store data, JavaScript as its
query language using MapReduce and HTTP for an API.  Data and relationships are not stored in tables
as is a norm with conventional relational databases but in fact are a collection of independent
documents.

The fact that document style databases are schema-less makes adding fields to JSON documents a
simple task without having to define changes first.

 Couchbase and MongoDB are the most popular document based databases.
3. Column Store NoSQL Database–

In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as
rows of data. Columns are logically grouped into column families. Column families can contain a
virtually unlimited number of columns that can be created at runtime or the definition of the schema.
Read and write is done using columns rather than rows.

In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast
search/ access and data aggregation. Relational databases store a single row as a continuous disk entry.
Different rows are stored in different places on disk while Columnar databases store all the cells
corresponding to a column as a continuous disk entry thus makes the search/access faster.

For example:   To query the titles from a bunch of a million articles will be a painstaking task while
using relational databases as it will go over each location to get item titles. On the other hand, with just
one disk access, title of all the items can be obtained.

Data Model

 ColumnFamily:  ColumnFamily is a single structure that can group Columns and SuperColumns
with ease.
 Key: the permanent name of the record. Keys have different numbers of columns, so the
database can scale in an irregular way.
 Keyspace:  This defines the outermost level of an organization, typically the name of the
application. For example, ‘3PillarDataBase’ (database name).
 Column:  It has an ordered list of elements aka tuple with a name and a value defined.
The best known examples are Google’s BigTable and HBase & Cassandra that were inspired from
BigTable.

BigTable, for instance is a high performance, compressed and proprietary data storage system owned
by Google. It has the following attributes:

 Sparse – some cells can be empty


 Distributed – data is partitioned across many hosts
 Persistent – stored to disk
 Multidimensional – more than 1 dimension
 Map – key and value
 Sorted – maps are generally not sorted but this one is
A 2-dimensional table comprising of rows and columns is part of the relational database system.

City Pincode Strength Project


Noida 201301 250 20
Cluj 400606 200 15
Timisoara 300011 150 10
Fairfax VA 22033 100 5
For above RDBMS table a BigTable map can be visualized as shown below.

{
3PillarNoida: {
city: Noida
pincode: 201301
},
details: {
strength: 250
projects: 20
}
}
{
3PillarCluj: {
address: {
city: Cluj
pincode: 400606
},
details: {
strength: 200
projects: 15
}
},
{
3PillarTimisoara: {
address: {
city: Timisoara
pincode: 300011
},
details: {
strength: 150
projects: 10
}
}
{
3PillarFairfax : {
address: {
city: Fairfax
pincode: VA 22033
},
details: {
strength: 100
projects: 5
}
}

 The outermost keys 3PillarNoida, 3PillarCluj, 3PillarTimisoara and 3PillarFairfax are analogues to
rows.
 ‘address’ and ‘details’ are called column families.
 The column-family ‘address’ has columns ‘city’ and ‘pincode’.
 The column-family details’ has columns ‘strength’ and ‘projects’.
Columns can be referenced using Column Family.

 Google’s BigTable, HBase and Cassandra are the most popular column store based databases.
4. Graph Base NoSQL Database

In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables and columns
representation, a flexible graphical representation is instead used which is perfect to address scalability
concerns. Graph structures are used with edges, nodes and properties which provides index-free
adjacency. Data can be easily transformed from one model to the other using a Graph Base NoSQL
database.

 These databases that uses edges and nodes to represent and store data.
 These nodes are organised by some relationships with one another, which is represented by
edges between the nodes.
 Both the nodes and the relationships have some defined properties.

The following are some of the features of the graph based database, which are explained on the basis
of the example below:

Labeled, directed, attributed multi-graph : The graphs contains the nodes which are labelled properly
with some properties and these nodes have some relationship with one another which is shown by the
directional edges. For example: in the following representation, “Alice knows Bob”   is shown by an edge
that also has some properties.

While relational database models can replicate the graphical ones, the edge would require a join which
is a costly proposition.

5) Explain eventual consistency and tunable consistency in context of Cassandra.

Cassandra is the right choice for managing large amounts of structured, semi-structured, and unstructured data
across multiple data centers and the cloud, when you need scalability and high availability without
compromising performance. Consistency technically means is that it refers to a situation where all the replica
nodes have the exact same data at the exact same point in time.

Eventual consistency is a consistency model used in distributed computing to achieve high availability that
informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item
will return the last updated value. Eventual consistency, also called optimistic replication, is widely deployed in
distributed systems, and has origins in early mobile computing projects. A system that has achieved eventual
consistency is often said to have converged, or achieved replica convergence. Eventual consistency is a weak
guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is
merely eventually consistent does not usually fulfill these stronger constraints. Eventually-consistent services are
often classified as providing BASE (Basically Available, Soft state, Eventual consistency) semantics, in contrast to
traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees. Eventual consistency is sometimes
criticized as increasing the complexity of distributed software applications. This is partly because eventual
consistency is purely a liveness guarantee (reads eventually return the same value) and does not make safety
guarantees: an eventually consistent system can return any value before it converges.

Tunable consistency: To ensure that Cassandra can provide the proper levels of consistency for its reads and
writes, Cassandra extends the concept of eventual consistency by offering tunable consistency. You can tune
Cassandra's consistency level per-operation, or set it globally for a cluster or datacenter. You can vary the
consistency for individual read or write operations so that the data returned is more or less consistent, as
required by the client application. This allows you to make Cassandra act more like a CP (consistent and partition
tolerant) or AP (highly available and partition tolerant) system according to the CAP theorem, depending on the
application requirements. There is a tradeoff between operation latency and consistency: higher consistency
incurs higher latency, lower consistency permits lower latency. You can control latency by tuning consistency.
The consistency level determines the number of replicas that need to acknowledge the read or write operation
success to the client application. For read operations, the read consistency level specifies how many replicas
must respond to a read request before returning data to the client application. If a read operation reveals
inconsistency among replicas, Cassandra initiates a read repair to update the inconsistent data. For write
operations, the write consistency level specified how many replicas must respond to a write request before the
write is considered successful. Even at low consistency levels, Cassandra writes to all replicas of the partition key,
including replicas in other datacenters. The write consistency level just specifies when the coordinator can
report to the client application that the write operation is considered completed. Write operations will use
hinted handoffs to ensure the writes are completed when replicas are down or otherwise not responsive to the
write request.

6) What is Lucene. Describe the typical components involved in search application.

Apache Lucene is a free and open-source search engine software library, originally written completely in Java by
Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software
License. Lucene has been ported to other programming languages including Object Pascal, Perl, C#, C++, Python,
Ruby and PHP. Lucene is an open source Java based search library. It is very popular and a fast search library. It is
used in Java based applications to add document search capability to any kind of application in a very simple and
efficient way. This tutorial will give you a great understanding on Lucene concepts and help you understand the
complexity of search requirements in enterprise level applications and need of Lucene search engine. Lucene is a
simple yet powerful Java-based Search library. It can be used in any application to add search capability to it.
Lucene is an open-source project. It is scalable. This high-performance library is used to index and search
virtually any kind of text. Lucene library provides the core operations which are required by any search
application, Indexing and Searching
Query parsers: Separate query terms from query operators and create the query structure (a query tree) to be
sent to the search engine.

Analyzers: Perform lexical analysis on query terms. This process can involve transforming, removing, or
expanding of query terms.

Index: An efficient data structure used to store and organize searchable terms extracted from indexed
documents.

Search engine: Retrieves and scores matching documents based on the contents of the inverted index

7) What are primary analyzers available in Lucene? Describe each of them with examples.

Lucene Analyzers split the text into tokens. Analyzers mainly consist of tokenizers and filters. Different analyzers
consist of different combinations of tokenizers and filters. The primary analyzers available in Lucene are:

• Whitespace analyzer

• Simple analyzer

• Stop analyzer

• Keyword analyzer

• Standard analyzer

Standard analyzer:

Note that the Standard Analyzer can recognize URLs and emails. Also, it removes stop words and lowercases the
generated tokens.

Stop Analyzer

The Stop Analyzer consists of Letter Tokenizer, Lowercase Filter, and Stop Filter: In this example, the Letter Tokenizer
splits text by non-letter characters, while the Stop Filter removes stop words from the token list. However, unlike the
Standard Analyzer, Stop Analyzer isn’t able to recognize URLs.
Simple Analyzer:

Simple Analyzer consists of Letter Tokenizer and a Lowercase Filter

Here, the Simple Analyzer didn’t remove stop words. It also doesn’t recognize URLs.

Whitespace Analyzer:

The Whitespace Analyzer uses only a Whitespace Tokenizer which splits text by whitespace characters

Keyword Analyzer

The Keyword Analyzer tokenizes input into a single token: The Keyword Analyzer is useful for fields like ids and zip
codes

8. Describe Hbase architecture in detail.

In HBase, tables are split into regions and are served by the region servers. Regions are vertically divided by
column families into “Stores”. Stores are saved as files in HDFS. Shown below is the architecture of HBase.

Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region servers. Region servers can
be added or removed as per requirement.

MasterServer
The master server -

 Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.

 Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the
regions to less occupied servers.

 Maintains the state of the cluster by negotiating the load balancing.

 Is responsible for schema changes and other metadata operations such as creation of tables and column
families.

Regions
Regions are nothing but tables that are split up and spread across the region servers.

Region server
The region servers have regions that -

 Communicate with the client and handle data-related operations.


 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as shown below:
The store contains memory store and HFiles. Memstore is just like a cache memory. Anything that is entered
into the HBase is stored here initially. Later, the data is transferred and saved in Hfiles as blocks and the
memstore is flushed.

Zookeeper
 Zookeeper is an open-source project that provides services like maintaining configuration information,
naming, providing distributed synchronization, etc.

 Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to
discover available servers.

 In addition to availability, the nodes are also used to track server failures or network partitions.

 Clients communicate with region servers via zookeeper.

 In pseudo and standalone modes, HBase itself will take care of zookeeper .

9. What are the components of hadoop?

Hadoop Distributed File System (HDFS) : HDFS is the distributed storage system that is designed to provide high-
performance access to data across multiple nodes in a cluster. HDFS is capable of storing huge amounts of data that
is 100+ terabytes in size and streaming it at high bandwidth to big data analytics applications.

MapReduce: MapReduce is a programming model that enables distributed processing of large data sets on compute
clusters of commodity hardware. Hadoop MapReduce first performs mapping which involves splitting a large file
into pieces to make another set of data. After mapping comes the reducing task, which takes the output from
mapping and assemble the results into a consumable solution. Hadoop can run MapReduce programs written in
many languages, like Java, Ruby, Python, and C++. Owing to the parallel nature of MapReduce programs, Hadoop
easily facilitates large-scale data analysis using multiple machines in the cluster

YARN: Yet Another Resource Negotiator or YARN is a large-scale, distributed operating system for big data
applications. YARN is considered to be the next generation of Hadoop’s computing platform. It brings on the table a
clustering platform that helps manage resources and schedule tasks. YARN was designed to set up both global and
application-specific resource management components. YARN improves utilization over more static MapReduce
rules, that were rendered in early versions of Hadoop, through dynamic allocation of cluster resources.
References

https://searchdatamanagement.techtarget.com/definition/NoSQL-Not-Only-SQL

https://github.com/vaquarkhan/vaquarkhan/wiki/Difference-between-scaling-horizontally-and-vertically

http://www.ioenotes.edu.np/notes/big-data-elective-ii-1602

http://ioenotes.edu.np/media/notes/big-data/pulchowk-notes/Big%20Data%20manual_edited.pdf

https://www.tutorialspoint.com/hbase/hbase_architecture.htm

https://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-databases

https://www.esds.co.in/blog/vertical-scaling-horizontal-scaling

https://www.quora.com/Why-does-normalization-fail-in-data-analytics-scenario

https://bigdata-madesimple.com/basic-components-of-hadoop-architecture-frameworks-used-for-datascience

https://www.baeldung.com/lucene-analyzers

https://en.wikipedia.org/wiki/Apache_Lucene

https://www.tutorialspoint.com/lucene/

https://www.guru99.com/hbase-tutorials.html

https://www.guru99.com/hbase-architecture-data-flow-usecases.html#3

https://www.guru99.com/learn-hadoop-in-10-minutes.html#1

You might also like