Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Modern Databases – 19SSP50

Unit – 1
Examining Big Data Types, Big Data Management
&
Operational Databases

Prof. G. S. Karthick., M.Sc., DST Fellow (IND)., NET., Ph.D


Assistant Professor
Department of Software Systems
PSG College of Arts and Science
Coimbatore, Tamil Nadu, India
Defining Structured Data
• The term structured data generally refers to data that has a
defined length and format.
• Examples of structured data include numbers, dates, and groups
of words and numbers called strings(for example, a customer’s
name, address, and so on).
• Structured data is the data that we probably used to dealing with
and it’s usually stored in a traditional database.
• We can query it using a language like Structured Query Language
(SQL).
• These might include your Customer Relationship Management
(CRM) data, Enterprise Resource Planning (ERP) data, and
financial data.
Wednesday, August 10, 2022 2
Exploring Sources of Big Structured Data
• The sources of data are divided into two categories:
• Computer or Machine-Generated: Machine-generated data
generally refers to data that is created by a machine without
human intervention.
 Sensor Data
Web Log Data
Point-of-Sale Data
Financial Data
• Human-Generated: This is data that humans, in interaction with
computers, supply.

Wednesday, August 10, 2022 3


Exploring Sources of Big Structured Data
• Examples of structured human-generated data might include the
following:
Input Data
Click Stream Data
 Gaming Related Data

Wednesday, August 10, 2022 4


Role of Relational Databases in Big Data
• Data persistence refers to how a database retains versions of
itself when modified.
• The persistent data stores in the relational database management
system (RDBMS).
• In a relational model, the data is stored in a table.
• This database would contain a schema that is, a structural
representation of what is in the data-base.
• For example, in a relational database, the schema defines the
tables, the fields in the tables, and the relationships between the
two.
• The data is stored in columns, one each for each specific
attribute.
• The data is also stored in the rows.
Wednesday, August 10, 2022 5
Role of Relational Databases in Big Data

Wednesday, August 10, 2022 6


Defining Unstructured Data
• Unstructured data is data that does not follow a specified format.
• Unstructured data is either machine generated or human
generated.
• Here are some examples of machine-generated unstructured
data:
 Satellite Images
 Scientific Data
 Photographs and Video
 Radar or Sonar Data

Wednesday, August 10, 2022 7


Defining Unstructured Data
• The following list shows a few examples of human-generated
unstructured data:
• Here are some examples of machine-generated unstructured
data:
 Text Internal to Company
 Social Media Data
 Mobile Data
 Website Content

Wednesday, August 10, 2022 8


Role of a CMS in Big Data Management
• Organizations store some unstructured data in databases.
• However, they also utilize enterprise content management
systems (CMSs) that can manage the complete life cycle of
content.
• This can include web content, document content, and other
forms media.
• However, new technologies are also evolving to help support
unstructured data and the analysis of unstructured data.
• Some of these support both structured and unstructured data.
Some support real-time streams.
• These include technologies like Hadoop, MapReduce, and
streaming.

Wednesday, August 10, 2022 9


Real-Time and Non-Real-Time Requirements
• The real-time aspects of big data can be revolutionary when companies
need to solve significant problems.
• In general, this real-time approach is most relevant when the answer to
a problem is time sensitive and business critical.
• This may be related to a threat to something important like detecting
the performance of hospital equipment or anticipating a potential
intrusion risk.
• The following list shows examples of when a company wants to
leverage this real-time data to gain a quick advantage:
• Monitoring for an exception with a new piece of information, like
fraud/intelligence
• Monitoring news feeds and social media to determine events that may
impact financial markets, such as a customer reaction to a new product
announcement
Wednesday, August 10, 2022 10
Real-Time and Non-Real-Time Requirements
• Sometimes streaming data is coming in really fast and does not
include a wide variety of sources, sometimes a wide variety exists,
and sometimes it is a combination of the two.
• However, the following list highlights a few things to consider
regarding a system’s capability to ingest (analyze) data, process it,
and analyze it in real time:
• Low Latency: Latency is the amount of time lag that enables a
service to execute in an environment.
• Scalability: Scalability is the capability to sustain a certain level of
performance even under increasing loads.
• Versatility: The system must support both structured and
unstructured data streams.
• Native Format: Use the data in its native form. Transformation
takes time
Wednesday, August 10,and
2022 money. 11
Managing Different Data Types

Wednesday, August 10, 2022 12


Integrating Data Types into a Big Data Environment
• This data may be coming from all internal systems, from both
internal and external sources, or from entirely external sources.
• Much of this data may have been siloed before.
• Data need not be coming to you in real time. You just may have a
lot of it and it is disparate in nature.
• The point is that the business value will be lost, if you deal with a
variety of data sources as a set of disconnected silos of
information.
• Components you need include connectors and metadata.
• Connector: Enables to pull data in from various big data sources.
• Metadata: It contains the definitions, mappings, and other
characteristics used to describe how to find, access, and use the
data.
Wednesday, August 10, 2022 13
Operational Databases
• Big data is becoming an important element in the way
organizations are leveraging high-volume data at the right speed
to solve specific data problems.
• However, big data does not live in isolation.
• To be effective, companies often need to be able to combine the
results of big data analysis with the data that exists within the
business.
• There are a variety of important operational data services.
• One of the most important services provided by operational
databases (also called data stores) is persistence.
• Persistence guarantees that the data stored in a database won’t
be changed without permissions and that it will avail-able as long
as it is important to the business.
Wednesday, August 10, 2022 14
RDBMSs Are Important in a Big Data Environment
• In companies both small and large, most of their important
operational information is probably stored in RDBMSs.

• Many companies have different RDBMSs for different areas of


their business.

• Although many different commercial relational databases are


available from companies like Oracle, IBM, and Microsoft, you
need to under-stand an open source relational database called
PostgreSQL.

Wednesday, August 10, 2022 15


PostgreSQL relational database
• PostgreSQL is the most widely used open source relational
database.
• PostgreSQL also supports many features only found in expensive
proprietary RDBMSs, including the following:
 Efficient handling of data within the schema
 Foreign keys (referencing keys from one table in another)
 Triggers (events used to automatically start a stored
procedure)
 Complex queries (sub-queries and joins across discrete
tables)
 Transactional integrity
 Multi-version concurrency control
Wednesday, August 10, 2022 16
PostgreSQL relational database
• The real power of PostgreSQL is its extensibility.
• Users and database programmers can add new capabilities
without affecting the fundamental operation or reliability of the
database.
• Possible extensions include
✓ Data types
✓ Operators
✓ Functions
✓ Indexing methods
✓ Procedural languages

Wednesday, August 10, 2022 17


Nonrelational Databases
• Non-relational databases do not rely on the table/key model
endemic to RDBMSs.
• One emerging, popular class of nonrelational database is called
not only SQL (NoSQL).
• Originally the originators envisioned databases that did not
require the relational model and SQL.
• Nonrelational database technologies have the following
characteristics in common:
• Scalability, Data and Query model, Persistence Design, Interface
Diversity, Eventual Consistency.

Wednesday, August 10, 2022 18


Key-Value Pair Databases
• KVP databases do not require a schema (like RDBMSs) and offer
great flexibility and scalability.
• KVP databases do not offer ACID (Atomicity, Consistency,
Isolation, Durability) capability.
• Implementers to think about data placement, replication, and
fault tolerance as they are not expressly controlled by the
technology itself.
• In KVP, most of the data is stored as strings.
• As the number of users increases, keeping track of precise keys
and related values can be challenging.

Wednesday, August 10, 2022 19


Riak Key-Value Database
• One widely used open source key-value pair database is called
Riak.
• Riak is a very fast and scalable implementation of a key-value
database.
• It supports a high-volume environment with fast-changing data
because it is lightweight.
• Riak is particularly effective at real-time analysis of trading in
financial services.
• It uses “buckets” as an organizing mechanism for collections of
keys and values.
• Riak implementations are clusters of physical or virtualnodes
arranged in a peer-to-peer fashion.

Wednesday, August 10, 2022 20


Riak Key-Value Database
• No master node exists, so the cluster is resilient and highly
scalable.
• All data and operations are distributed across the cluster.
• Communication in the cluster is implemented via a special
protocol called Gossip.
• Gossip stores status information about the cluster and shares
information about buckets.
• Features:
– Parallel Processing – Perform dual operations across the clusters
– Links and Link Walking – Maps the Relationship between Keys and Values
– Search - Distributed Full Text Searching Capability
– Secondary Indexes – Assign more than one key for values
Wednesday, August 10, 2022 21
Riak Key-Value Database
Riak implementations are best suited for

✓ User data for social networks, communities, or gaming

✓ High-volume, media-rich data gathering and storage

✓ Caching layers for connecting RDBMS and NoSQL databases

✓ Mobile applications requiring flexibility and dependability

Wednesday, August 10, 2022 22


Document Databases
• There are two kinds of document databases.

• One is often described as a repository for full document-style


content (Word files, complete web pages, and so on).

• The other is a database for storing document components for


permanent storage.

• The structure of the documents and their parts is provided by


JavaScript Object Notation (JSON) and/or Binary JSON (BSON).

Wednesday, August 10, 2022 23


MongoDB
• MongoDB is growing in popularity and may be a good choice for
the data store supporting your big data implementation.

• MongoDB is composed of databases containing “collections.”

• A collection is composed of “documents,” and each document is


composed of fields.

• MongoDB is also an ecosystem consisting of the following


elements:

✓ High-availability and replication services for scaling across local


and wide-area networks
Wednesday, August 10, 2022 24
MongoDB
✓ A grid-based file system (GridFS), enabling the storage of large
objects by dividing them among multiple documents.

✓ MapReduce to support analytics and aggregation of different


collections/documents.

✓ A sharding service that distributes a single database across a


cluster of servers in a single or in multiple data centers.

✓ A querying service that supports ad hoc queries, distributed


queries, and full-text search.

Wednesday, August 10, 2022 25


MongoDB
Effective MongoDB implementations include

✓ High-volume content management

✓ Social networking

✓ Archiving

✓ Real-time analytics

Wednesday, August 10, 2022 26


CouchDB
• Like MongoDB, CouchDB is open source.

• It is maintained by the Apache Software Foundation


(www.apache.org) and is made available under the Apache
License v2.0.

• CouchDB databases are composed of documents consisting of


fields and attachments as well as a “description” of the document
in the form of meta-data that is automatically maintained by the
system.

• The underlying technology features all ACID capabilities


Wednesday, August 10, 2022 27
CouchDB
• CouchDB is also an ecosystem with the following capabilities:

• Compaction: The databases are compressed to eliminate wasted


space when a certain level of emptiness is reached.

• View Model: A mechanism for filtering, organizing, and reporting


on data utilizing a set of definitions that are stored as documents
in the database.

• Replication and Distributed Services

Wednesday, August 10, 2022 28


CouchDB
• Effective CouchDB implementations include

✓ High-volume content management

✓ Scaling from smartphone to data center

✓ Applications with limited or slow network connectivity

Wednesday, August 10, 2022 29


Columnar Databases
• Relational databases are row oriented, as the data in each row of
a table is stored together.

• In a columnar, or column-oriented database, the data is stored


across rows.

• It is very easy to add columns, and they may be added row by


row, offering great flexibility, performance, and scalability.

• When you have volume and variety of data, you might want to
use a columnar database.

• It is very adaptable; you simply continue to add columns.


Wednesday, August 10, 2022 30
Columnar Databases

Wednesday, August 10, 2022 31


HBase Columnar Database
• One of the most popular columnar databases is Hbase.

• HBase uses the Hadoop file system and MapReduce engine for its
core data storage needs.

• The design of HBase is modeled on Google’s BigTable (an efficient


form of storing nonrelational data).

• Therefore, implementations of HBase are highly scalable, sparse,


distributed, persistent multidimensional sorted maps.

Wednesday, August 10, 2022 32


HBase Columnar Database
• Important characteristics of HBase include the following:

• Consistency

• Sharding

• High Availability

• Client API

Wednesday, August 10, 2022 33


Graph Databases
• The fundamental structure for graph databases is called “node-
relationship.”

• This structure is most useful when you must deal with highly
interconnected data.

• Nodes and relationships support properties, a key-value pair


where the data is stored.

• This kind of storage and navigation is not possible in RDBMSs due


to the rigid table structures and the inability to follow connections
between the data .
Wednesday, August 10, 2022 34
Graph Databases

Wednesday, August 10, 2022 35


Neo4J Graph Database
• One of the most widely used open source graph databases is
Neo4J.

• Neo4J is an ACID transaction database offering high availability


through clustering.

• It is a trustworthy and scalable database that is easy to model


because of the node-relationship properties fundamental
structure.

• It does not require a schema, nor does it require data typing, so it


is inherently very flexible.
Wednesday, August 10, 2022 36
Neo4J Graph Database
• Important characteristics of Neo4J include the following:

 Integration with other databases

 Resiliency:Neo4J supports cold (that is, when database is not


running) and hot (when it is running) backups, as well as a
high-availability clustering mode.

 Standard alerts are available for integration with existing


operations management systems.

 Query language:Neo4J supports a declarative language called


Cypher.
Wednesday, August 10, 2022 37
Spatial Databases
• Whether you know it or not, you may interact with spatial data every day.

• If you use a smartphone or Global Positioning System (GPS) for directions


to a particular place, or if you ask a search engine for the locations of
seafood restaurants near a physical address or landmark, you are using
applications relying on spatial data.

• Spatial data is associated with geographic locations such as cities, towns


etc.

• A spatial database is optimized to store and query data representing


objects.

• These are the objects which are defined in a geometric space.


Wednesday, August 10, 2022 38
Spatial Databases
• Example

A road map is a visualization of geographic information.

A road map is a 2-dimensional object which contains points, lines,


and polygons that can represent cities, roads, and political
boundaries such as states or provinces.

• In general, spatial data can be of two types −

• Vector Data: This data is represented as discrete points, lines and


polygons

• Rastor Data: This data is represented as a matrix of square cells.

Wednesday, August 10, 2022 39


Spatial Databases

Wednesday, August 10, 2022 40

You might also like