Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Module 2

Introduction to NoSQL
What is NoSQL?
The term ‘NoSQL’ refers to nonrelational types of databases, and these databases store data in a format that’s different from
relational tables. However, NoSQL databases can be queried using idiomatic language APIs, declarative structured query
languages, and query-by example languages, which is why they are also referred to as “not only SQL” databases.

What is a NoSQL database used for?


NoSQL databases are widely used in real-time web applications and big data, because their main advantages are high
scalability and high availability.

NoSQL databases are also the preferred choice of developers, as they naturally lend themselves to an agile development
paradigm by rapidly adapting to changing requirements. NoSQL databases allow the data to be stored in ways that are more
intuitive and easier to understand, or closer to the way the data is used by applications—with fewer transformations
required when storing or retrieving using NoSQL-style APIs. Moreover, NoSQL databases can take full advantage of the cloud
to deliver zero downtime.

SQL versus NoSQL

SQL databases are relational, while NoSQL databases are non-relational. The relational database management system
(RDBMS) is the basis for structured query language (SQL), which lets users access and manipulate data in highly structured
tables. This is foundational model for database systems such as MS SQL Server, IBM DB2, Oracle, and MySQL. But with NoSQL
databases, the data access syntax can be different from database to database.

When to choose a NoSQL database?

With businesses and organizations needing to innovate rapidly, being able to stay agile and continue operating at any scale is
the name of the game. NoSQL databases offer flexible schemas and also support a variety of data models that are ideal for
building applications that require large data volumes and low latency or response times—for example, online gaming and
ecommerce web applications.

Benefits of a NoSQL database

Flexibility

With SQL databases, data is stored in a much more rigid, predefined structure. But with NoSQL, data can be stored in a more
free-form fashion without those rigid schemas. This design enables innovation and rapid application development.
Developers can focus on creating systems to better serve their customers without worrying about schemas. NoSQL databases
can easily handle any data format, such as structured, semi-structured, and non-structured data in a single data store.

Scalability
Instead of scaling up by adding more servers, NoSQL databases can scale out by using commodity hardware. This has the
ability to support increased traffic in order to meet demand with zero downtime. By scaling out, NoSQL databases can
become larger and more powerful, which is why they have become the preferred option for evolving data sets.

High performance
The scale-out architecture of a NoSQL database can be particularly valuable when data volume or traffic increases. As
shown in the graphic below, this architecture ensures fast and predictable single-digit millisecond response times. NoSQL
databases can also ingest data and deliver it quickly and reliably, which is why NoSQL databases are used in applications
that collect terabytes of data every day, while also requiring a highly interactive user experience. In the graphic below, we
show an incoming rate of 300 reads per second (blue line) with a 95th latency in the 3-4ms range, and an incoming rate of
150 writes per second (green line) with a 95th latency in the 4-5ms range.

Availability
NoSQL databases automatically replicate data across multiple servers, data centers, or cloud resources. In turn, this
minimizes latency for users, no matter where they’re located. This feature also works to reduce the burden of database
management, which frees up time to focus on other priorities.

Highly Functional
NoSQL databases are designed for distributed data stores that have extremely large data storage needs. This is what
makes NoSQL the ideal choice for big data, real-time web apps, customer 360, online shopping, online gaming, Internet of
things, social networks, and online advertising applications.

Aggregate Data models

The aggregate-Oriented database is the NoSQL database which does not support ACID transactions and they sacrifice one of
the ACID properties. Aggregate orientation operations are different compared to relational database operations. We can
perform OLAP operations on the Aggregate-Oriented database. The efficiency of the Aggregate-Oriented database is high if
the data transactions and interactions take place within the same aggregate. Several fields of data can be put in the
aggregates such that they can be commonly accessed together. We can manipulate only a single aggregate at a time. We can
not manipulate multiple aggregates at a time in an atomic way.

Types of NoSQL databases / Aggregate Models

Aggregate – Oriented databases are classified into four major data models. They are as follows:

Key-value
Document
Column family
Graph-based

Each of the Data models above has its own query language.

key-value Data Model: Key-value and document databases were strongly aggregate-oriented. The key-value data model
contains the key or Id which is used to access the data of the aggregates. key-value Data Model is very secure as the
aggregates are opaque to the database. Aggregates are encrypted as the big blog of bits that can be decrypted with key or id.
In the key-value Data Model, we can place data of any structure and datatypes in it. The advantage of the key-value Data
Model is that we can store the sensitive information in the aggregate. But the disadvantage of this model the database has
some general size limits. We can store only the limited data.

Document Data Model: In Document Data Model we can access the parts of aggregates. The data in this model can be
accessed inflexible manner. we can submit queries to the database based on the fields in the aggregate. There is a restriction
on the structure and data types of data to be paced in this data model. The structure of the aggregate can be accessed by the
Document Data Model.

Column family Data Model: The Column family is also called a two-level map. But, however, we think about the structure, it
has been a model that influenced later databases such as HBase and Cassandra. These databases with a big table-style data
model are often referred to as column stores. Column-family models divide the aggregate into column families. The
Column-family model is a two-level aggregate structure. The first level consists of keys that act as a row identifier that selects
the aggregate. The second-level values in the Column family Data Model are referred to as columns.

Graph Data Model: In a graph data model, the data is stored in nodes that are connected by edges. This model is preferred
to store a huge amount of complex aggregates and multidimensional data with many interconnections between them. Graph
Data Model has the application like we can store the Facebook user accounts in the nodes and find out the friends of the
particular user by following the edges of the graph.

Schemaless database
Traditional relational databases are well-defined, using a schema to describe every functional element, including tables, rows
views, indexes, and relationships. By exerting a high degree of control, the database administrator can improve performance
and prevent capture of low-quality, incomplete, or malformed data.

A schemaless database, like MongoDB, does not have these up-front constraints, mapping to a more ‘natural’ database. Even
when sitting on top of a data lake, each document is created with a partial schema to aid retrieval. Any formal schema is
applied in the code of your applications; this layer of abstraction protects the raw data in the NoSQL database and allows for
rapid transformation as your needs change.

Any data, formatted or not, can be stored in a non-tabular NoSQL type of database. At the same time, using the right tools in
the form of a schemaless database can unlock the value of all of your structured and unstructured data types.
How does a schemaless database work?

In schemaless databases, information is stored in JSON-style documents which can have varying sets of fields with different
data types for each field. So, a collection could look like this:

name : “Joe”, age : 30, interests : ‘football’ }

name : “Kate”, age : 25

As you can see, the data itself normally has a fairly consistent structure. With the schemaless MongoDB database, there is
some additional structure — the system namespace contains an explicit list of collections and indexes. Collections may be
implicitly or explicitly created — indexes must be explicitly declared.

What are the benefits of using a schemaless database?

Greater flexibility over data types

By operating without a schema, schemaless databases can store, retrieve, and query any data type — perfect for big data
analytics and similar operations that are powered by unstructured data. Relational databases apply rigid schema rules to
data, limiting what can be stored.

No pre-defined database schemas

The lack of schema means that your NoSQL database can accept any data type — including those that you do not yet use.
This future-proofs your database, allowing it to grow and change as your data-driven operations change and mature.

No data truncation

A schemaless database makes almost no changes to your data; each item is saved in its own document with a partial schema,
leaving the raw information untouched. This means that every detail is always available and nothing is stripped to match the
current schema. This is particularly valuable if your analytics needs to change at some point in the future.

Suitable for real-time analytics functions

With the ability to process unstructured data, applications built on NoSQL databases are better able to process real-time
data, such as readings and measurements from IoT sensors. Schemaless databases are also ideal for use with machine
learning and artificial intelligence operations, helping to accelerate automated actions in your business.

Enhanced scalability and flexibility

With NoSQL, you can use whichever data model is best suited to the job. Graph databases allow you to view relationships
between data points, or you can use traditional wide table views with an exceptionally large number of columns. You can
query, report, and model information however you choose. And as your requirements grow, you can keep adding nodes to
increase capacity and power.

When a record is saved to a relational database, anything (particularly metadata) that does not match the schema is
truncated or removed. Deleted at write, these details cannot be recovered at a later point in time.

Database Sharding

What is database sharding?


Sharding is a method for distributing a single dataset across multiple databases, which can then be stored on multiple
machines. This allows for larger datasets to be split into smaller chunks and stored in multiple data nodes, increasing the
total storage capacity of the system.

Similarly, by distributing the data across multiple machines, a sharded database can handle more requests than a single
machine can.

Sharding is a form of scaling known as horizontal scaling or scale-out, as additional nodes are brought on to share the load.
Horizontal scaling allows for near-limitless scalability to handle big data and intense workloads. In contrast, vertical
scaling refers to increasing the power of a single machine or single server through a more powerful CPU, increased RAM, or
increased storage capacity.

Do you need database sharding?

Database sharding, as with any distributed architecture, does not come for free. There is overhead and complexity in setting
up shards, maintaining the data on each shard, and properly routing requests across those shards. Before you begin sharding,
consider if one of the following alternative solutions will work for you.

Vertical scaling

By simply upgrading your machine, you can scale vertically without the complexity of sharding. Adding RAM, upgrading your
computer (CPU), or increasing the storage available to your database are simple solutions that do not require you to change
the design of either your database architecture or your application.

Replication

If your data workload is primarily read-focused, replication increases availability and read performance while avoiding some
of the complexity of database sharding. By simply spinning up additional copies of the database, read performance can be
increased either through load balancing or through geo-located query routing. However, replication introduces complexity
on write-focused workloads, as each write must be copied to every replicated node.

Advantages of sharding

Sharding allows you to scale your database to handle increased load to a nearly unlimited degree by providing increased
read/write throughput, storage capacity, and high availability. Let’s look at each of those in a little more detail.

Increased read/write throughput — By distributing the dataset across multiple shards, both read and write
operation capacity is increased as long as read and write operations are confined to a single shard.
Increased storage capacity — Similarly, by increasing the number of shards, you can also increase overall total
storage capacity, allowing near-infinite scalability.
High availability — Finally, shards provide high availability in two ways. First, since each shard is a replica set, every
piece of data is replicated. Second, even if an entire shard becomes unavailable since the data is distributed, the
database as a whole still remains partially functional, with part of the schema on different shards.

Disadvantages of sharding

Sharding does come with several drawbacks, namely overhead in query result compilation, complexity of
administration, and increased infrastructure costs.

Query overhead — Each sharded database must have a separate machine or service which understands how to
route a querying operation to the appropriate shard. This introduces additional latency on every operation.
Furthermore, if the data required for the query is horizontally partitioned across multiple shards, the router must
then query each shard and merge the result together. This can make an otherwise simple operation quite expensive
and slow down response times.
Complexity of administration — With a single unsharded database, only the database server itself requires upkeep
and maintenance. With every sharded database, on top of managing the shards themselves, there are additional
service nodes to maintain. Plus, in cases where replication is being used, any data updates must be mirrored across
each replicated node. Overall, a sharded database is a more complex system which requires more administration.
Increased infrastructure costs — Sharding by its nature requires additional machines and compute power over a
single database server. While this allows your database to grow beyond the limits of a single machine, each
additional shard comes with higher costs. The cost of a distributed database system, especially if it is missing the
proper optimization, can be significant.

How does sharding work?

Sharding architectures and types

While there are many different sharding methods, we will consider four main kinds: ranged/dynamic sharding,
algorithmic/hashed sharding, entity/relationship-based sharding, and geography-based sharding.

Ranged/dynamic sharding

Ranged sharding, or dynamic sharding, takes a field on the record as an input and, based on a predefined range, allocates
that record to the appropriate shard. Ranged sharding requires there to be a lookup table or service available for all queries
or writes. For example, consider a set of data with IDs that range from 0-50. A simple lookup table might look like the
following:

Range Shard ID

[0, 20) A

[20, 40) B

[40, 50] C

The field on which the range is based is also known as the shard key. Naturally, the choice of shard key, as well as the ranges,
are critical in making range-based sharding effective. A poor choice of shard key will lead to unbalanced shards, which leads
to decreased performance. An effective shard key will allow for queries to be targeted to a minimum number of shards. In
our example above, if we query for all records with IDs 10-30, then only shards A and B will need to be queried.

Algorithmic/hashed sharding

Algorithmic sharding or hashed sharding, takes a record as an input and applies a hash function or algorithm to it which
generates an output or hash value. This output is then used to allocate each record to the appropriate shard.

The function can take any subset of values on the record as inputs. Perhaps the simplest example of a hash function is to use
the modulus operator with the number of shards, as follows:

Hash Value=ID % Number of Shards

This is similar to range-based sharding — a set of fields determines the allocation of the record to a given shard. Hashing the
inputs allows more even distribution across shards even when there is not a suitable shard key, and no lookup table needs to
be maintained. However, there are a few drawbacks.

First, query operations for multiple records are more likely to get distributed across multiple shards. Whereas ranged
sharding reflects the natural structure of the data across shards, hashed sharding typically disregards the meaning of the
data. This is reflected in increased broadcast operation occurrence.
Second, resharding can be expensive. Any update to the number of shards likely requires rebalancing all shards to moving
around records. It will be difficult to do this while avoiding a system outage.

Entity-/relationship-based sharding

Entity-based sharding keeps related data together on a single physical shard. In a relational database (such as PostgreSQL,
MySQL, or SQL Server), related data is often spread across several different tables.

For instance, consider the case of a shopping database with users and payment methods. Each user has a set of payment
methods that is tied tightly with that user. As such, keeping related data together on the same shard can reduce the need for
broadcast operations, increasing performance.

Geography-based sharding

Geography-based sharding, or geosharding, also keeps related data together on a single shard, but in this case, the data is
related by geography. This is essentially ranged sharding where the shard key contains geographic information and the
shards themselves are geo-located.

For example, consider a dataset where each record contains a “country” field. In this case, we can both increase overall
performance and decrease system latency by creating a shard for each country or region, and storing the appropriate data on
that shard. This is a simple example, and there are many other ways to allocate your geoshards which are beyond the scope
of this article.

You might also like