Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

CHINHOYI UNIVERSITY OF

TECHNOLOGY

Entrepreneurship & Business Sciences

Graduate Business School


MSc. Data Analytics

Big Data Analytics


[MSCDA 6-8]

Eng. N.F Thusabantu - Shoniwa


TOPIC 3 : Big Data
Characteristics
Volume, Velocity, Variety and Veracity

Major Modern Data Base Types

Database Systems Architecture

Relational

Federated

Map/Reduce
Eng. N.F Thusabantu
Content
1. Introduction
2. Volume, Velocity, Variety, Veracity, Value
3. What is a database
4. Modern database types
o MongoDB
o Cassandra

5. Database Systems Architecture


6. Relational
7. Federated
8. Map Reduce

Eng. N.F Thusabantu


1. Introduction

• The exponential growth of data produced per day has given


growth to the popularity of Big Data Analytics Tools and
Techniques

Eng. N.F Thusabantu


Big Data Tools and Techniques

Eng. N.F Thusabantu


• Big Data needs extraordinary techniques to efficiently
process a large volume of data within limited run times. There
are many specific techniques in these disciplines, and they
overlap with each other too.
• Optimization Methods can be used for solving quantitative
problems in different sectors such as biology, economics, and
engineering.
• Statistics involves the collection, organization, and
interpretation of data. Statistical techniques are used to
describe the correlation between different objectives.
• Data Mining is a technique that is used for extracting valuable
information from data. It involves clustering analysis and
classification.

Eng. N.F Thusabantu


• Machine Learning is used for designing algorithms that help
systems evolve different behaviours and businesses can make
intelligent decisions.
• Artificial Neural Network (ANN) is an advanced technique
that is found in pattern recognition, adaptive control, image
analysis and more.
• Visualization Approaches are useful to create tables,
diagrams and other representations to understand data.
• Social Network Analysis (SNA) is an important technique
that is used in modern sociology, viewing social relationships
and involves nodes and ties also.
• Higher level Big Data technologies include distributed
computational systems, file systems, data mining, cloud-based
storage, and computing.

Eng. N.F Thusabantu


ACID Properties
• All big data transactions must observe ACID properties:
A - Atomicity
C - Consistency
 I - Isolation
D - Durability

Eng. N.F Thusabantu


 Atomicity: A database follows the all or nothing rule, i.e., the
database considers all transaction operations as one whole unit
or atom.
• Thus, when a database processes a transaction, it is either fully
completed or not executed at all.

 Consistency: Ensures that only valid data following all rules


and constraints is written in the database.
• When a transaction results in invalid data, the database reverts
to its previous state, which abides by all customary rules and
constraints.

Eng. N.F Thusabantu


 Isolation: Ensures that transactions are securely and independently
processed at the same time without interference, but it does not ensure
the order of transactions.
• For example, user A withdraws $100 and user B withdraws $250 from
user Z’s account, which has a balance of $1000. Since both A and B
draw from Z’s account, one of the users is required to wait until the
other user transaction is completed, avoiding inconsistent data. If B is
required to wait, then B must wait until A’s transaction is completed,
and Z’s account balance changes to $900. Now, B can withdraw $250
from this $900 balance.

 Durability: In the above example, user B may withdraw $100 only


after user A’s transaction is completed and is updated in the database.
• If the system fails before A’s transaction is logged in the database, A
cannot withdraw any money, and Z’s account returns to its previous
consistent state.
Eng. N.F Thusabantu
2. The Vs in Big Data
• Volume – amount of data

• Velocity – speed of data transmission

• Variety – type of data

• Veracity – uncertainty/ doubt in data

• Value – worth of data

Eng. N.F Thusabantu


3. What is a database
• A database is a collection of information that is organized so
that it can be easily accessed, managed and updated.

• Data is organized into rows, columns and tables, and it is


indexed to make it easier to find relevant information.

• Data gets updated, expanded and deleted as new information


is added.

• Databases process workloads to create and update themselves,


querying the data they contain and running applications
against it.

Eng. N.F Thusabantu


• In computing, databases are sometimes classified according to
their organizational approach.

• There are many different kinds of databases, ranging from the


most prevalent approach, the relational database, to
a distributed database, cloud database or NoSQL database.

Eng. N.F Thusabantu


Python to SQL
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=server_name;'
'Database=db_name;'
'Trusted_Connection=yes;')

cursor = conn.cursor()
cursor.execute('SELECT * FROM db_name.Table')

for row in cursor:


print(row)

Eng. N.F Thusabantu


4. Modern Database
Types
• A NoSQL (originally referring to "non SQL" or "non
relational")database provides a mechanism for storage and retrieval of
data that is modelled in means other than the tabular relations used
in relational databases.

• Such databases have existed since the late 1960s, but did not obtain the
"NoSQL" moniker until a surge of popularity in the early 21st
century, triggered by the needs of Web 2.0 companies. 

• NoSQL databases are increasingly used in big data and real-time


web applications. NoSQL systems are also sometimes called "Not only
SQL" to emphasize that they may support SQL-like query languages, or
sit alongside SQL database in a polyglot persistence architecture.

Eng. N.F Thusabantu


 Types
• There are various ways to classify NoSQL databases, with different categories
and subcategories, some of which overlap. What follows is a basic
classification by data model, with examples:

 COLUMN:  Accumulo, Cassandra, Scylla, Druid, HBase,Vertica
 DOCUMENT: 
ApacheCouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos
DB, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
 KEY-VALUE:  Aerospike, Apache Ignite, ArangoDB, Berkeley
DB, Couchbase, Dynamo, FoundationDB, InfinityDB, MemcacheDB, MUMP
S, Oracle NoSQL Database, OrientDB, Redis, Riak, SciDB, SDBM/Flat
File dbm, ZooKeeper
 GRAPH:  AllegroGraph, ArangoDB, InfiniteGraph, Apache
Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso
Eng. N.F Thusabantu
5. Database
Architectures
• The design of a DBMS depends on its architecture. It can be
centralized or decentralized or hierarchical.

• The architecture of a DBMS can be seen as either single tier or


multi-tier.

• An n-tier architecture divides the whole system into related but


independent n modules, which can be independently modified,
altered, changed, or replaced.

Eng. N.F Thusabantu


a. 1-Tier
• In 1-tier architecture, the DBMS is the only entity where the
user directly sits on the DBMS and uses it.

• Any changes done here will directly be done on the DBMS


itself.

• It does not provide handy tools for end-users.

• Database designers and programmers normally prefer to use


single-tier architecture.

Eng. N.F Thusabantu


Basically, a one-tier architecture keeps all of the elements of an
application, including the interface, Middleware and back-end
data, in one place. Developers see these types of systems as the
simplest and most direct way.

Eng. N.F Thusabantu


b. 2-Tier
• If the architecture of DBMS is 2-tier, then it must have an
application through which the DBMS can be accessed.

• Programmers use 2-tier architecture where they access the


DBMS by means of an application.

• Here the application tier is entirely independent of the


database in terms of operation, design, and programming.

Eng. N.F Thusabantu


The two-tier is based on Client Server architecture. The two-
tier architecture is like client server application. The direct
communication takes place between client and server. There
is no intermediate between client and server.

Eng. N.F Thusabantu


c. 3-Tier

Eng. N.F Thusabantu


Eng. N.F Thusabantu
• Database (Data) Tier − At this tier, the database resides along
with its query processing languages. We also have the
relations that define the data and their constraints at this level.
• Application (Middle) Tier − At this tier reside the application
server and the programs that access the database. For a user,
this application tier presents an abstracted view of the
database. End-users are unaware of any existence of the
database beyond the application. At the other end, the database
tier is not aware of any other user beyond the application tier.
Hence, the application layer sits in the middle and acts as a
mediator between the end-user and the database.
• User (Presentation) Tier − End-users operate on this tier and
they know nothing about any existence of the database beyond
this layer. At this layer, multiple views of the database can be
provided by the application. All views are generated by
applications that reside in the application tier.
Eng. N.F Thusabantu
6. Relational
Databases
• A relational database, invented by E.F. Codd at IBM in 1970,
is a tabular database in which data is defined so that it can be
reorganized and accessed in a number of different ways.

• Relational databases are made up of a set of tables with data


that fits into a predefined category.

• Each table has at least one data category in a column, and each
row has a certain data instance for the categories which are
defined in the columns.

Eng. N.F Thusabantu


Characteristics of RDB
• Each table, which is sometimes called a relation, in a relational
database contains one or more data categories in columns, also
called attributes.

• Each row, also called a record or tuple, contains a unique


instance of data, or key, for the categories defined by the
columns.

• Each table has a unique primary key, which identifies the


information in a table.

• The relationship between tables can then be set via the use
of foreign keys -- a field in a table that links to the primary key
of another table.
Eng. N.F Thusabantu
Eng. N.F Thusabantu
 Understanding SQL
• The Structured Query Language (SQL) is the standard user
and application program interface for a relational database.

• Relational databases are easy to extend, and a new data


category can be added after the original database creation
without requiring that you modify all the existing applications.

Eng. N.F Thusabantu


 Types
• Oracle: Oracle Database (commonly referred to as Oracle RDBMS
or simply as Oracle) is a multi-model database management system
produced and marketed by Oracle Corporation.
• MySQL: MySQL is an open-source relational database management
system (RDBMS) based on Structured Query Language (SQL).
MySQL runs on virtually all platforms, including Linux, UNIX, and
Windows.
• Microsoft SQL Server: Microsoft SQL Server is an RDBMS that
supports a wide variety of transaction processing, business
intelligence, and analytics applications in corporate IT environments.
• PostgreSQL: PostgreSQL, often simply Postgres, is an object-
relational database management system (ORDBMS) with an
emphasis on extensibility and standards compliance.
• DB2: DB2 is an RDBMS designed to store, analyze, and retrieve data
efficiently.

Eng. N.F Thusabantu


 Advantages of relational databases

• The main advantages of relational databases are that they


enable users to easily categorize and store data that can later
be queried and filtered to extract specific information for
reports.

• Relational databases are also easy to extend and aren't


reliant on physical organization.

• After the original database creation, a new data category can


be added without all existing applications being modified.

Eng. N.F Thusabantu


• Accurate: Data is stored just once, which eliminates data
reduplication.

• Flexible: Complex queries are easy for users to carry out.

• Collaborative: Multiple users can access the same database.

• Trusted: Relational database models are mature and well-


understood.

• Secure: Data in tables within relational database


management systems (RDBMSes) can be limited to allow
access by only particular users.

Eng. N.F Thusabantu


Disadvantages
• RDBMSes do not work well — or at all — with unstructured
or semi-structured data, due to schema and type constraints.
This makes them ill-suited for large analytics or IoT event
loads.

• The tables in your relational database will not necessarily map


one-to-one with an object or class representing the same data.

• When migrating one RDBMS to another, schemas and types


must generally be identical between source and destination
tables for migration to work (schema constraint). For many of
the same reasons, extremely complex datasets or those
containing variable-length records are generally difficult to
handle with an RDBMS schema.

Eng. N.F Thusabantu


Eng. N.F Thusabantu
Understanding
NoSQL
• There are 4 basic types of NoSQL databases:
 Key-Value Store - It has a Big Hash Table of keys &
values {Example- Riak, Amazon S3 (Dynamo)}
 Document-based Store- It stores documents made up of
tagged elements. {Example- CouchDB}
 Column-based Store- Each storage block contains data
from only one column, {Example- HBase, Cassandra}
 Graph-based- A network database that uses edges and
nodes to represent and store data. {Example- Neo4J}

Eng. N.F Thusabantu


Key-Value Store

Eng. N.F Thusabantu


Document-based Store

Eng. N.F Thusabantu


Column-based Store

Eng. N.F Thusabantu


7. Federated
Databases
• A federated database system is a type of meta-database
management system (DBMS), which transparently
integrates multiple autonomous database systems into a
single federated database.

• The constituent databases are interconnected via a computer


network and may be geographically decentralized.

• Since the constituent database systems remain autonomous,


a federated database system is an alternative to the
(sometimes daunting) task of merging together several
disparate databases.
Eng. N.F Thusabantu
 Accessing a FDB
• A federated database, or virtual database, is the fully integrated, logical
composite of all constituent databases in a federated database system.
• Through data abstraction, federated database systems can provide a
uniform user interface, enabling users and clients to store and retrieve
data in multiple non-contiguous databases with a single query -- even if
the constituent databases are heterogeneous.
• To this end, a federated database system must be able to decompose the
query into subqueries for submission to the relevant constituent
DBMS's, after which the system must composite the result sets of the
subqueries.

• Because various database management systems employ different query


languages, federated database systems can apply wrappers to the
subqueries to translate them into the appropriate query languages
Eng. N.F Thusabantu
 Characteristics of FDS

Eng. N.F Thusabantu


a. Characteristics - Autonomy

 Transaction Control
 Query Processing
• Distribution of Control
Degree to which individual DBMS can operate independently

Logically
Integrated Federated Multidatabase
Multiple DBMS DBMS System

low Autonomy High


Eng. N.F Thusabantu
b. Characteristics-Distribution

• Deals with data

o Single DBS
o Many DBSs in a local area network
o Many DBSs in a wide area network

Multiple Sites
Single DBS

Local Distribution Distributed

Eng. N.F Thusabantu


• Data and the Federated Database System (FDS)
o Databases may be on the same computer
o Databases may be geographically separate
o Systems must be able to communicate

• Benefits of distribution
o Improved access times
o Improved availability
o Improved reliability

Eng. N.F Thusabantu


c. Characteristics -Heterogeneity

• Data models

o Structures
o Constraints
o Query languages

Eng. N.F Thusabantu


 Types of FDB

Eng. N.F Thusabantu


 Coupling

Eng. N.F Thusabantu


8. Map Reduce
Paradigm
• MapReduce is a programming framework that abstracts the
complexity of parallel applications. The management
architecture is based on the master/worker model, while a slave-to
slave data exchange requires a P2P model

• This programming paradigm enables massive scalability across


hundreds or thousands of servers in a Hadoop cluster.

• The MapReduce concept is fairly simple to understand for those


who are familiar with clustered scale-out data processing solutions.

Eng. N.F Thusabantu


Eng. N.F Thusabantu
Eng. N.F Thusabantu
 Advantages of
MapReduce
• Distribute data and computation. The computation local to data prevents the
network overload.

• Linear scaling in the ideal case. It used to design for cheap, commodity
hardware.

• Simple programming model. The end-user programmer only writes map-


reduce tasks.

• Portability across heterogeneous commodity hardware and operating systems

• Economy by distributing data and processing across clusters of commodity


personal computers

• Efficiency by distributing data and logic to process it in parallel on nodes


where data is located

Eng. N.F Thusabantu


Practical Exercise

Eng. N.F Thusabantu


HADOOP ECOSYSTEM

Eng. N.F Thusabantu

You might also like