Unit 1

Modern Databases – 19SSP50
Unit – 1
Examining Big Data Types, Big Data Management
&
Operational Databases
Prof. G. S. Karthick., M.Sc., DST Fellow (IND)., NET., Ph.D

Assistant Professor
Department of Software Systems
PSG College of Arts and Science
Coimbatore, Tamil Nadu, India
Defining Structured Data
• The term structured data generally refers to data that has a
defined length and format.
• Examples of structured data include numbers, dates, and groups
of words and numbers called strings(for example, a customer’s
name, address, and so on).
• Structured data is the data that we probably used to dealing with
and it’s usually stored in a traditional database.
• We can query it using a language like Structured Query Language
(SQL).
• These might include your Customer Relationship Management
(CRM) data, Enterprise Resource Planning (ERP) data, and
financial data.
Wednesday, August 10, 2022 2
Exploring Sources of Big Structured Data
• The sources of data are divided into two categories:
• Computer or Machine-Generated: Machine-generated data
generally refers to data that is created by a machine without
human intervention.
 Sensor Data
Web Log Data
Point-of-Sale Data
Financial Data
• Human-Generated: This is data that humans, in interaction with
computers, supply.

Exploring Sources of Big Structured Data
• Examples of structured human-generated data might include the
following:
Input Data
Click Stream Data
 Gaming Related Data

Role of Relational Databases in Big Data
• Data persistence refers to how a database retains versions of
itself when modified.
• The persistent data stores in the relational database management
system (RDBMS).
• In a relational model, the data is stored in a table.
• This database would contain a schema that is, a structural
representation of what is in the data-base.
• For example, in a relational database, the schema defines the
tables, the fields in the tables, and the relationships between the
two.
• The data is stored in columns, one each for each specific
attribute.
• The data is also stored in the rows.
Role of Relational Databases in Big Data

Defining Unstructured Data
• Unstructured data is data that does not follow a specified format.
• Unstructured data is either machine generated or human
generated.
• Here are some examples of machine-generated unstructured
data:
 Satellite Images
 Scientific Data
 Photographs and Video
 Radar or Sonar Data

Defining Unstructured Data
• The following list shows a few examples of human-generated
unstructured data:
• Here are some examples of machine-generated unstructured
data:
 Text Internal to Company
 Social Media Data
 Mobile Data
 Website Content

Role of a CMS in Big Data Management
• Organizations store some unstructured data in databases.
• However, they also utilize enterprise content management
systems (CMSs) that can manage the complete life cycle of
content.
• This can include web content, document content, and other
forms media.
• However, new technologies are also evolving to help support
unstructured data and the analysis of unstructured data.
• Some of these support both structured and unstructured data.
Some support real-time streams.
• These include technologies like Hadoop, MapReduce, and
streaming.

Real-Time and Non-Real-Time Requirements
• The real-time aspects of big data can be revolutionary when companies
need to solve significant problems.
• In general, this real-time approach is most relevant when the answer to
a problem is time sensitive and business critical.
• This may be related to a threat to something important like detecting
the performance of hospital equipment or anticipating a potential
intrusion risk.
• The following list shows examples of when a company wants to
leverage this real-time data to gain a quick advantage:
• Monitoring for an exception with a new piece of information, like
fraud/intelligence
• Monitoring news feeds and social media to determine events that may
impact financial markets, such as a customer reaction to a new product
announcement
Real-Time and Non-Real-Time Requirements
• Sometimes streaming data is coming in really fast and does not
include a wide variety of sources, sometimes a wide variety exists,
and sometimes it is a combination of the two.
• However, the following list highlights a few things to consider
regarding a system’s capability to ingest (analyze) data, process it,
and analyze it in real time:
• Low Latency: Latency is the amount of time lag that enables a
service to execute in an environment.
• Scalability: Scalability is the capability to sustain a certain level of
performance even under increasing loads.
• Versatility: The system must support both structured and
unstructured data streams.
• Native Format: Use the data in its native form. Transformation
takes time
Wednesday, August 10,and
2022 money. 11
Managing Different Data Types

Integrating Data Types into a Big Data Environment
• This data may be coming from all internal systems, from both
internal and external sources, or from entirely external sources.
• Much of this data may have been siloed before.
• Data need not be coming to you in real time. You just may have a
lot of it and it is disparate in nature.
• The point is that the business value will be lost, if you deal with a
variety of data sources as a set of disconnected silos of
information.
• Components you need include connectors and metadata.
• Connector: Enables to pull data in from various big data sources.
• Metadata: It contains the definitions, mappings, and other
characteristics used to describe how to find, access, and use the
data.
Operational Databases
• Big data is becoming an important element in the way
organizations are leveraging high-volume data at the right speed
to solve specific data problems.
• However, big data does not live in isolation.
• To be effective, companies often need to be able to combine the
results of big data analysis with the data that exists within the
business.
• There are a variety of important operational data services.
• One of the most important services provided by operational
databases (also called data stores) is persistence.
• Persistence guarantees that the data stored in a database won’t
be changed without permissions and that it will avail-able as long
as it is important to the business.
RDBMSs Are Important in a Big Data Environment
• In companies both small and large, most of their important
operational information is probably stored in RDBMSs.
• Many companies have different RDBMSs for different areas of

their business.
• Although many different commercial relational databases are

available from companies like Oracle, IBM, and Microsoft, you
need to under-stand an open source relational database called
PostgreSQL.

PostgreSQL relational database
• PostgreSQL is the most widely used open source relational
database.
• PostgreSQL also supports many features only found in expensive
proprietary RDBMSs, including the following:
 Efficient handling of data within the schema
 Foreign keys (referencing keys from one table in another)
 Triggers (events used to automatically start a stored
procedure)
 Complex queries (sub-queries and joins across discrete
tables)
 Transactional integrity
 Multi-version concurrency control
PostgreSQL relational database
• The real power of PostgreSQL is its extensibility.
• Users and database programmers can add new capabilities
without affecting the fundamental operation or reliability of the
database.
• Possible extensions include
✓ Data types
✓ Operators
✓ Functions
✓ Indexing methods
✓ Procedural languages

Nonrelational Databases
• Non-relational databases do not rely on the table/key model
endemic to RDBMSs.
• One emerging, popular class of nonrelational database is called
not only SQL (NoSQL).
• Originally the originators envisioned databases that did not
require the relational model and SQL.
• Nonrelational database technologies have the following
characteristics in common:
• Scalability, Data and Query model, Persistence Design, Interface
Diversity, Eventual Consistency.

Key-Value Pair Databases
• KVP databases do not require a schema (like RDBMSs) and offer
great flexibility and scalability.
• KVP databases do not offer ACID (Atomicity, Consistency,
Isolation, Durability) capability.
• Implementers to think about data placement, replication, and
fault tolerance as they are not expressly controlled by the
technology itself.
• In KVP, most of the data is stored as strings.
• As the number of users increases, keeping track of precise keys
and related values can be challenging.

Riak Key-Value Database
• One widely used open source key-value pair database is called
Riak.
• Riak is a very fast and scalable implementation of a key-value
database.
• It supports a high-volume environment with fast-changing data
because it is lightweight.
• Riak is particularly effective at real-time analysis of trading in
financial services.
• It uses “buckets” as an organizing mechanism for collections of
keys and values.
• Riak implementations are clusters of physical or virtualnodes
arranged in a peer-to-peer fashion.

• No master node exists, so the cluster is resilient and highly
scalable.
• All data and operations are distributed across the cluster.
• Communication in the cluster is implemented via a special
protocol called Gossip.
• Gossip stores status information about the cluster and shares
information about buckets.
• Features:
– Parallel Processing – Perform dual operations across the clusters
– Links and Link Walking – Maps the Relationship between Keys and Values
– Search - Distributed Full Text Searching Capability
– Secondary Indexes – Assign more than one key for values
Riak implementations are best suited for
✓ User data for social networks, communities, or gaming
✓ High-volume, media-rich data gathering and storage
✓ Caching layers for connecting RDBMS and NoSQL databases
✓ Mobile applications requiring flexibility and dependability

Document Databases
• There are two kinds of document databases.
• One is often described as a repository for full document-style

content (Word files, complete web pages, and so on).
• The other is a database for storing document components for

permanent storage.
• The structure of the documents and their parts is provided by

JavaScript Object Notation (JSON) and/or Binary JSON (BSON).

MongoDB
• MongoDB is growing in popularity and may be a good choice for
the data store supporting your big data implementation.
• MongoDB is composed of databases containing “collections.”
• A collection is composed of “documents,” and each document is

composed of fields.
• MongoDB is also an ecosystem consisting of the following

elements:
✓ High-availability and replication services for scaling across local

and wide-area networks
MongoDB
✓ A grid-based file system (GridFS), enabling the storage of large
objects by dividing them among multiple documents.
✓ MapReduce to support analytics and aggregation of different

collections/documents.
✓ A sharding service that distributes a single database across a

cluster of servers in a single or in multiple data centers.
✓ A querying service that supports ad hoc queries, distributed

queries, and full-text search.

MongoDB
Effective MongoDB implementations include
✓ High-volume content management
✓ Social networking
✓ Archiving
✓ Real-time analytics

CouchDB
• Like MongoDB, CouchDB is open source.
• It is maintained by the Apache Software Foundation

(www.apache.org) and is made available under the Apache
License v2.0.
• CouchDB databases are composed of documents consisting of

fields and attachments as well as a “description” of the document
in the form of meta-data that is automatically maintained by the
system.
• The underlying technology features all ACID capabilities

CouchDB
• CouchDB is also an ecosystem with the following capabilities:
• Compaction: The databases are compressed to eliminate wasted

space when a certain level of emptiness is reached.
• View Model: A mechanism for filtering, organizing, and reporting

on data utilizing a set of definitions that are stored as documents
in the database.
• Replication and Distributed Services

CouchDB
• Effective CouchDB implementations include
✓ High-volume content management
✓ Scaling from smartphone to data center
✓ Applications with limited or slow network connectivity

Columnar Databases
• Relational databases are row oriented, as the data in each row of
a table is stored together.
• In a columnar, or column-oriented database, the data is stored

across rows.
• It is very easy to add columns, and they may be added row by

row, offering great flexibility, performance, and scalability.
• When you have volume and variety of data, you might want to
use a columnar database.
• It is very adaptable; you simply continue to add columns.

Columnar Databases

HBase Columnar Database
• One of the most popular columnar databases is Hbase.
• HBase uses the Hadoop file system and MapReduce engine for its
core data storage needs.
• The design of HBase is modeled on Google’s BigTable (an efficient

form of storing nonrelational data).
• Therefore, implementations of HBase are highly scalable, sparse,

distributed, persistent multidimensional sorted maps.

HBase Columnar Database
• Important characteristics of HBase include the following:
• Consistency
• Sharding
• High Availability
• Client API

Graph Databases
• The fundamental structure for graph databases is called “node-
relationship.”
• This structure is most useful when you must deal with highly
interconnected data.
• Nodes and relationships support properties, a key-value pair

where the data is stored.
• This kind of storage and navigation is not possible in RDBMSs due

to the rigid table structures and the inability to follow connections
between the data .
Graph Databases

Neo4J Graph Database
• One of the most widely used open source graph databases is
Neo4J.
• Neo4J is an ACID transaction database offering high availability

through clustering.
• It is a trustworthy and scalable database that is easy to model

because of the node-relationship properties fundamental
structure.
• It does not require a schema, nor does it require data typing, so it

is inherently very flexible.
Neo4J Graph Database
• Important characteristics of Neo4J include the following:
 Integration with other databases
 Resiliency:Neo4J supports cold (that is, when database is not

running) and hot (when it is running) backups, as well as a
high-availability clustering mode.
 Standard alerts are available for integration with existing

operations management systems.
 Query language:Neo4J supports a declarative language called

Cypher.
Spatial Databases
• Whether you know it or not, you may interact with spatial data every day.
• If you use a smartphone or Global Positioning System (GPS) for directions

to a particular place, or if you ask a search engine for the locations of
seafood restaurants near a physical address or landmark, you are using
applications relying on spatial data.
• Spatial data is associated with geographic locations such as cities, towns

etc.
• A spatial database is optimized to store and query data representing

objects.
• These are the objects which are defined in a geometric space.

Spatial Databases
• Example
A road map is a visualization of geographic information.
A road map is a 2-dimensional object which contains points, lines,

and polygons that can represent cities, roads, and political
boundaries such as states or provinces.
• In general, spatial data can be of two types −
• Vector Data: This data is represented as discrete points, lines and

polygons
• Rastor Data: This data is represented as a matrix of square cells.

Spatial Databases

Unit 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

Modern Databases – 19SSP50

Prof. G. S. Karthick., M.Sc., DST Fellow (IND)., NET., Ph.D

Wednesday, August 10, 2022 3

Wednesday, August 10, 2022 4

Wednesday, August 10, 2022 6

Wednesday, August 10, 2022 7

Wednesday, August 10, 2022 8

Wednesday, August 10, 2022 9

Wednesday, August 10, 2022 12

• Many companies have different RDBMSs for different areas of

• Although many different commercial relational databases are

Wednesday, August 10, 2022 15

Wednesday, August 10, 2022 17

Wednesday, August 10, 2022 18

Wednesday, August 10, 2022 19

Wednesday, August 10, 2022 20

✓ User data for social networks, communities, or gaming

✓ High-volume, media-rich data gathering and storage

✓ Caching layers for connecting RDBMS and NoSQL databases

✓ Mobile applications requiring flexibility and dependability

Wednesday, August 10, 2022 22

• One is often described as a repository for full document-style

• The other is a database for storing document components for

• The structure of the documents and their parts is provided by

Wednesday, August 10, 2022 23

• MongoDB is composed of databases containing “collections.”

• A collection is composed of “documents,” and each document is

• MongoDB is also an ecosystem consisting of the following

✓ High-availability and replication services for scaling across local

✓ MapReduce to support analytics and aggregation of different

✓ A sharding service that distributes a single database across a

✓ A querying service that supports ad hoc queries, distributed

Wednesday, August 10, 2022 25

✓ High-volume content management

Wednesday, August 10, 2022 26

• It is maintained by the Apache Software Foundation

• CouchDB databases are composed of documents consisting of

• The underlying technology features all ACID capabilities

• Compaction: The databases are compressed to eliminate wasted

• View Model: A mechanism for filtering, organizing, and reporting

• Replication and Distributed Services

Wednesday, August 10, 2022 28

✓ High-volume content management

✓ Scaling from smartphone to data center

✓ Applications with limited or slow network connectivity

Wednesday, August 10, 2022 29

• In a columnar, or column-oriented database, the data is stored

• It is very easy to add columns, and they may be added row by

• It is very adaptable; you simply continue to add columns.

Wednesday, August 10, 2022 31

• The design of HBase is modeled on Google’s BigTable (an efficient

• Therefore, implementations of HBase are highly scalable, sparse,

Wednesday, August 10, 2022 32

Wednesday, August 10, 2022 33

• Nodes and relationships support properties, a key-value pair

• This kind of storage and navigation is not possible in RDBMSs due

Wednesday, August 10, 2022 35

• Neo4J is an ACID transaction database offering high availability

• It is a trustworthy and scalable database that is easy to model

• It does not require a schema, nor does it require data typing, so it

 Integration with other databases

 Resiliency:Neo4J supports cold (that is, when database is not

 Standard alerts are available for integration with existing