Topic 3 - Big Data Characteristics

CHINHOYI UNIVERSITY OF
TECHNOLOGY
Entrepreneurship & Business Sciences
Graduate Business School

MSc. Data Analytics
Big Data Analytics

[MSCDA 6-8]
Eng. N.F Thusabantu - Shoniwa

TOPIC 3 : Big Data
Characteristics
Volume, Velocity, Variety and Veracity
Major Modern Data Base Types
Database Systems Architecture
Relational
Federated
Map/Reduce
Eng. N.F Thusabantu
Content
1. Introduction
2. Volume, Velocity, Variety, Veracity, Value
3. What is a database
4. Modern database types
o MongoDB
o Cassandra
5. Database Systems Architecture

6. Relational
7. Federated
8. Map Reduce
Eng. N.F Thusabantu

1. Introduction
• The exponential growth of data produced per day has given

growth to the popularity of Big Data Analytics Tools and
Techniques
Eng. N.F Thusabantu

Big Data Tools and Techniques
Eng. N.F Thusabantu

• Big Data needs extraordinary techniques to efficiently
process a large volume of data within limited run times. There
are many specific techniques in these disciplines, and they
overlap with each other too.
• Optimization Methods can be used for solving quantitative
problems in different sectors such as biology, economics, and
engineering.
• Statistics involves the collection, organization, and
interpretation of data. Statistical techniques are used to
describe the correlation between different objectives.
• Data Mining is a technique that is used for extracting valuable
information from data. It involves clustering analysis and
classification.
Eng. N.F Thusabantu

• Machine Learning is used for designing algorithms that help
systems evolve different behaviours and businesses can make
intelligent decisions.
• Artificial Neural Network (ANN) is an advanced technique
that is found in pattern recognition, adaptive control, image
analysis and more.
• Visualization Approaches are useful to create tables,
diagrams and other representations to understand data.
• Social Network Analysis (SNA) is an important technique
that is used in modern sociology, viewing social relationships
and involves nodes and ties also.
• Higher level Big Data technologies include distributed
computational systems, file systems, data mining, cloud-based
storage, and computing.
Eng. N.F Thusabantu

ACID Properties
• All big data transactions must observe ACID properties:
A - Atomicity
C - Consistency
 I - Isolation
D - Durability
Eng. N.F Thusabantu

 Atomicity: A database follows the all or nothing rule, i.e., the
database considers all transaction operations as one whole unit
or atom.
• Thus, when a database processes a transaction, it is either fully
completed or not executed at all.
 Consistency: Ensures that only valid data following all rules

and constraints is written in the database.
• When a transaction results in invalid data, the database reverts
to its previous state, which abides by all customary rules and
constraints.
Eng. N.F Thusabantu

 Isolation: Ensures that transactions are securely and independently
processed at the same time without interference, but it does not ensure
the order of transactions.
• For example, user A withdraws $100 and user B withdraws $250 from
user Z’s account, which has a balance of $1000. Since both A and B
draw from Z’s account, one of the users is required to wait until the
other user transaction is completed, avoiding inconsistent data. If B is
required to wait, then B must wait until A’s transaction is completed,
and Z’s account balance changes to $900. Now, B can withdraw $250
from this $900 balance.
 Durability: In the above example, user B may withdraw $100 only

after user A’s transaction is completed and is updated in the database.
• If the system fails before A’s transaction is logged in the database, A
cannot withdraw any money, and Z’s account returns to its previous
consistent state.
Eng. N.F Thusabantu
2. The Vs in Big Data
• Volume – amount of data
• Velocity – speed of data transmission
• Variety – type of data
• Veracity – uncertainty/ doubt in data
• Value – worth of data
Eng. N.F Thusabantu

3. What is a database
• A database is a collection of information that is organized so
that it can be easily accessed, managed and updated.
• Data is organized into rows, columns and tables, and it is

indexed to make it easier to find relevant information.
• Data gets updated, expanded and deleted as new information

is added.
• Databases process workloads to create and update themselves,

querying the data they contain and running applications
against it.
Eng. N.F Thusabantu

• In computing, databases are sometimes classified according to
their organizational approach.
• There are many different kinds of databases, ranging from the

most prevalent approach, the relational database, to
a distributed database, cloud database or NoSQL database.
Eng. N.F Thusabantu

Python to SQL
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=server_name;'
'Database=db_name;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
cursor.execute('SELECT * FROM db_name.Table')
for row in cursor:

print(row)
Eng. N.F Thusabantu

4. Modern Database
Types
• A NoSQL (originally referring to "non SQL" or "non
relational")database provides a mechanism for storage and retrieval of
data that is modelled in means other than the tabular relations used
in relational databases.
• Such databases have existed since the late 1960s, but did not obtain the
"NoSQL" moniker until a surge of popularity in the early 21st
century, triggered by the needs of Web 2.0 companies.
• NoSQL databases are increasingly used in big data and real-time

web applications. NoSQL systems are also sometimes called "Not only
SQL" to emphasize that they may support SQL-like query languages, or
sit alongside SQL database in a polyglot persistence architecture.
Eng. N.F Thusabantu

 Types
• There are various ways to classify NoSQL databases, with different categories
and subcategories, some of which overlap. What follows is a basic
classification by data model, with examples:
 COLUMN: Accumulo, Cassandra, Scylla, Druid, HBase,Vertica
 DOCUMENT:
ApacheCouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos
DB, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
 KEY-VALUE: Aerospike, Apache Ignite, ArangoDB, Berkeley
DB, Couchbase, Dynamo, FoundationDB, InfinityDB, MemcacheDB, MUMP
S, Oracle NoSQL Database, OrientDB, Redis, Riak, SciDB, SDBM/Flat
File dbm, ZooKeeper
 GRAPH: AllegroGraph, ArangoDB, InfiniteGraph, Apache
Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso
Eng. N.F Thusabantu
5. Database
Architectures
• The design of a DBMS depends on its architecture. It can be
centralized or decentralized or hierarchical.
• The architecture of a DBMS can be seen as either single tier or

multi-tier.
• An n-tier architecture divides the whole system into related but

independent n modules, which can be independently modified,
altered, changed, or replaced.
Eng. N.F Thusabantu

a. 1-Tier
• In 1-tier architecture, the DBMS is the only entity where the
user directly sits on the DBMS and uses it.
• Any changes done here will directly be done on the DBMS

itself.
• It does not provide handy tools for end-users.
• Database designers and programmers normally prefer to use

single-tier architecture.
Eng. N.F Thusabantu

Basically, a one-tier architecture keeps all of the elements of an
application, including the interface, Middleware and back-end
data, in one place. Developers see these types of systems as the
simplest and most direct way.
Eng. N.F Thusabantu

b. 2-Tier
• If the architecture of DBMS is 2-tier, then it must have an
application through which the DBMS can be accessed.
• Programmers use 2-tier architecture where they access the

DBMS by means of an application.
• Here the application tier is entirely independent of the

database in terms of operation, design, and programming.
Eng. N.F Thusabantu

The two-tier is based on Client Server architecture. The two-
tier architecture is like client server application. The direct
communication takes place between client and server. There
is no intermediate between client and server.
Eng. N.F Thusabantu

c. 3-Tier
Eng. N.F Thusabantu

Eng. N.F Thusabantu
• Database (Data) Tier − At this tier, the database resides along
with its query processing languages. We also have the
relations that define the data and their constraints at this level.
• Application (Middle) Tier − At this tier reside the application
server and the programs that access the database. For a user,
this application tier presents an abstracted view of the
database. End-users are unaware of any existence of the
database beyond the application. At the other end, the database
tier is not aware of any other user beyond the application tier.
Hence, the application layer sits in the middle and acts as a
mediator between the end-user and the database.
• User (Presentation) Tier − End-users operate on this tier and
they know nothing about any existence of the database beyond
this layer. At this layer, multiple views of the database can be
provided by the application. All views are generated by
applications that reside in the application tier.
Eng. N.F Thusabantu
6. Relational
Databases
• A relational database, invented by E.F. Codd at IBM in 1970,
is a tabular database in which data is defined so that it can be
reorganized and accessed in a number of different ways.
• Relational databases are made up of a set of tables with data

that fits into a predefined category.
• Each table has at least one data category in a column, and each
row has a certain data instance for the categories which are
defined in the columns.
Eng. N.F Thusabantu

Characteristics of RDB
• Each table, which is sometimes called a relation, in a relational
database contains one or more data categories in columns, also
called attributes.
• Each row, also called a record or tuple, contains a unique

instance of data, or key, for the categories defined by the
columns.
• Each table has a unique primary key, which identifies the

information in a table.
• The relationship between tables can then be set via the use
of foreign keys -- a field in a table that links to the primary key
of another table.
Eng. N.F Thusabantu
Eng. N.F Thusabantu
 Understanding SQL
• The Structured Query Language (SQL) is the standard user
and application program interface for a relational database.
• Relational databases are easy to extend, and a new data

category can be added after the original database creation
without requiring that you modify all the existing applications.
Eng. N.F Thusabantu

 Types
• Oracle: Oracle Database (commonly referred to as Oracle RDBMS
or simply as Oracle) is a multi-model database management system
produced and marketed by Oracle Corporation.
• MySQL: MySQL is an open-source relational database management
system (RDBMS) based on Structured Query Language (SQL).
MySQL runs on virtually all platforms, including Linux, UNIX, and
Windows.
• Microsoft SQL Server: Microsoft SQL Server is an RDBMS that
supports a wide variety of transaction processing, business
intelligence, and analytics applications in corporate IT environments.
• PostgreSQL: PostgreSQL, often simply Postgres, is an object-
relational database management system (ORDBMS) with an
emphasis on extensibility and standards compliance.
• DB2: DB2 is an RDBMS designed to store, analyze, and retrieve data
efficiently.
Eng. N.F Thusabantu

 Advantages of relational databases
• The main advantages of relational databases are that they

enable users to easily categorize and store data that can later
be queried and filtered to extract specific information for
reports.
• Relational databases are also easy to extend and aren't

reliant on physical organization.
• After the original database creation, a new data category can

be added without all existing applications being modified.
Eng. N.F Thusabantu

• Accurate: Data is stored just once, which eliminates data
reduplication.
• Flexible: Complex queries are easy for users to carry out.
• Collaborative: Multiple users can access the same database.
• Trusted: Relational database models are mature and well-

understood.
• Secure: Data in tables within relational database

management systems (RDBMSes) can be limited to allow
access by only particular users.
Eng. N.F Thusabantu

Disadvantages
• RDBMSes do not work well — or at all — with unstructured
or semi-structured data, due to schema and type constraints.
This makes them ill-suited for large analytics or IoT event
loads.
• The tables in your relational database will not necessarily map

one-to-one with an object or class representing the same data.
• When migrating one RDBMS to another, schemas and types

must generally be identical between source and destination
tables for migration to work (schema constraint). For many of
the same reasons, extremely complex datasets or those
containing variable-length records are generally difficult to
handle with an RDBMS schema.
Eng. N.F Thusabantu

Eng. N.F Thusabantu
Understanding
NoSQL
• There are 4 basic types of NoSQL databases:
 Key-Value Store - It has a Big Hash Table of keys &
values {Example- Riak, Amazon S3 (Dynamo)}
 Document-based Store- It stores documents made up of
tagged elements. {Example- CouchDB}
 Column-based Store- Each storage block contains data
from only one column, {Example- HBase, Cassandra}
 Graph-based- A network database that uses edges and
nodes to represent and store data. {Example- Neo4J}
Eng. N.F Thusabantu

Key-Value Store
Eng. N.F Thusabantu

Document-based Store
Eng. N.F Thusabantu

Column-based Store
Eng. N.F Thusabantu

7. Federated
Databases
• A federated database system is a type of meta-database
management system (DBMS), which transparently
integrates multiple autonomous database systems into a
single federated database.
• The constituent databases are interconnected via a computer

network and may be geographically decentralized.
• Since the constituent database systems remain autonomous,

a federated database system is an alternative to the
(sometimes daunting) task of merging together several
disparate databases.
Eng. N.F Thusabantu
 Accessing a FDB
• A federated database, or virtual database, is the fully integrated, logical
composite of all constituent databases in a federated database system.
• Through data abstraction, federated database systems can provide a
uniform user interface, enabling users and clients to store and retrieve
data in multiple non-contiguous databases with a single query -- even if
the constituent databases are heterogeneous.
• To this end, a federated database system must be able to decompose the
query into subqueries for submission to the relevant constituent
DBMS's, after which the system must composite the result sets of the
subqueries.
• Because various database management systems employ different query

languages, federated database systems can apply wrappers to the
subqueries to translate them into the appropriate query languages
Eng. N.F Thusabantu
 Characteristics of FDS
Eng. N.F Thusabantu

a. Characteristics - Autonomy
 Transaction Control
 Query Processing
• Distribution of Control
Degree to which individual DBMS can operate independently
Logically
Integrated Federated Multidatabase
Multiple DBMS DBMS System
low Autonomy High

Eng. N.F Thusabantu
b. Characteristics-Distribution
• Deals with data
o Single DBS
o Many DBSs in a local area network
o Many DBSs in a wide area network
Multiple Sites
Single DBS
Local Distribution Distributed
Eng. N.F Thusabantu

• Data and the Federated Database System (FDS)
o Databases may be on the same computer
o Databases may be geographically separate
o Systems must be able to communicate
• Benefits of distribution
o Improved access times
o Improved availability
o Improved reliability
Eng. N.F Thusabantu

c. Characteristics -Heterogeneity
• Data models
o Structures
o Constraints
o Query languages
Eng. N.F Thusabantu

 Types of FDB
Eng. N.F Thusabantu

 Coupling
Eng. N.F Thusabantu

8. Map Reduce
Paradigm
• MapReduce is a programming framework that abstracts the
complexity of parallel applications. The management
architecture is based on the master/worker model, while a slave-to
slave data exchange requires a P2P model
• This programming paradigm enables massive scalability across

hundreds or thousands of servers in a Hadoop cluster.
• The MapReduce concept is fairly simple to understand for those

who are familiar with clustered scale-out data processing solutions.
Eng. N.F Thusabantu

Eng. N.F Thusabantu
Eng. N.F Thusabantu
 Advantages of
MapReduce
• Distribute data and computation. The computation local to data prevents the
network overload.
• Linear scaling in the ideal case. It used to design for cheap, commodity
hardware.
• Simple programming model. The end-user programmer only writes map-

reduce tasks.
• Portability across heterogeneous commodity hardware and operating systems
• Economy by distributing data and processing across clusters of commodity

personal computers
• Efficiency by distributing data and logic to process it in parallel on nodes

where data is located
Eng. N.F Thusabantu

Practical Exercise
Eng. N.F Thusabantu

HADOOP ECOSYSTEM
Eng. N.F Thusabantu

Topic 3 - Big Data Characteristics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic 3 - Big Data Characteristics

Uploaded by

Copyright:

Available Formats

CHINHOYI UNIVERSITY OF

Entrepreneurship & Business Sciences

Graduate Business School

Big Data Analytics

Eng. N.F Thusabantu - Shoniwa

Major Modern Data Base Types

Database Systems Architecture

5. Database Systems Architecture

Eng. N.F Thusabantu

• The exponential growth of data produced per day has given

Eng. N.F Thusabantu

Eng. N.F Thusabantu

Eng. N.F Thusabantu

Eng. N.F Thusabantu

Eng. N.F Thusabantu

 Consistency: Ensures that only valid data following all rules

Eng. N.F Thusabantu

 Durability: In the above example, user B may withdraw $100 only

• Velocity – speed of data transmission

• Variety – type of data

• Veracity – uncertainty/ doubt in data

• Value – worth of data

Eng. N.F Thusabantu

• Data is organized into rows, columns and tables, and it is

• Data gets updated, expanded and deleted as new information

• Databases process workloads to create and update themselves,

Eng. N.F Thusabantu

• There are many different kinds of databases, ranging from the

Eng. N.F Thusabantu

for row in cursor:

Eng. N.F Thusabantu

• NoSQL databases are increasingly used in big data and real-time

Eng. N.F Thusabantu

• The architecture of a DBMS can be seen as either single tier or

• An n-tier architecture divides the whole system into related but

Eng. N.F Thusabantu

• Any changes done here will directly be done on the DBMS

• It does not provide handy tools for end-users.

• Database designers and programmers normally prefer to use

Eng. N.F Thusabantu

Eng. N.F Thusabantu

• Programmers use 2-tier architecture where they access the

• Here the application tier is entirely independent of the

Eng. N.F Thusabantu

Eng. N.F Thusabantu

Eng. N.F Thusabantu

• Relational databases are made up of a set of tables with data

Eng. N.F Thusabantu

• Each row, also called a record or tuple, contains a unique

• Each table has a unique primary key, which identifies the

• Relational databases are easy to extend, and a new data

Eng. N.F Thusabantu

Eng. N.F Thusabantu

• The main advantages of relational databases are that they

• Relational databases are also easy to extend and aren't

• After the original database creation, a new data category can

Eng. N.F Thusabantu

• Flexible: Complex queries are easy for users to carry out.

• Collaborative: Multiple users can access the same database.

• Trusted: Relational database models are mature and well-

• Secure: Data in tables within relational database

Eng. N.F Thusabantu

• The tables in your relational database will not necessarily map