Professional Documents
Culture Documents
Bda Mod 1
Bda Mod 1
Unstructured Data
→ Unstructured data does not possess data features such as - a table
or- a database.
→ Unstructured data are found in file types such as TXT, .CSV.- key-
value pairs,- e-mails.
→Unstructured data do not reveal relationships, hierarchy
relationships or object-oriented features.
→The relationships, schema and features need to be separately
established.
→Growth in data today can be characterized as mostly unstructured
data
Big Data Characteristics
Characteristics of Big Data, called 3Vs (and 4Vs also used) are:
Volume:
- is related to size of the data
-Size defines the amount or quantity of data, which is generated from
an application(s).
- The size determines the processing considerations needed for
handling that data.
Velocity:
- The term velocity refers to the speed of generation of data.
- Velocity is a measure of how fast the data generates and
processes
- To meet the demands and the challenges of processing Big Data, the
velocity of generation of data plays a crucial role.
Variety:
-comprises of a variety of data.
-Data is generated from multiple sources in a system.
-Data is in various forms and formats.
-The variety is due to the availability of a large number of
heterogeneous platforms in the industry.
Veracity:
-takes into account the quality of data captured, which can vary
greatly, affecting its accurate analysis.
-odd data generated and not matching with others.
Big Data Classification
1. Data sources
Data storage such as records, RDBMs, distributed databases, row-
oriented In-memory data tables,
2. Data formats (traditional)
Structured and semi-structured
3. Big Data sources
data warehouse, NoSQL database
4. Big Data formats
Unstructured, semi-structured and multi-structured data
5. Processing data rates
near-time, real-time, streaming
6. Processing Big Data rates
High volume, velocity, variety and veracity,
7. Data analysis methods
machine learning algorithms, clustering algorithms,
Analytics Scalability to Big Data (6m)
Vertical Scalability:
Means scaling up the given system's resources and increasing
the system's analytics, reporting and visualization capabilities.
Scaling up also means designing the algorithm according to the
architecture that uses resources efficiently.
For example, x terabyte of data take time t for processing, code
size with increasing complexity increase by factor n, then
scaling up means that processing takes equal, less or much less
than (n x t).
Horizontal scalability:
Means increasing the number of systems working in coherence and
scaling out or distributing the workload.
Processing different datasets of a large dataset deploys horizontal
scalability.
Scaling out is basically using more resources and distributing the
processing and storage tasks in parallel.
Massively Parallel Processing Platforms (6m)
Cloud Computing
• "Cloud computing is a type of Internet-based computing that
provides shared processing resources and data to the computers and
other devices on demand.“
• One of the best approach for data processing is to perform parallel
and distributed computing in a cloud-computing environment.
• It offers high data security compared to other distributed
technologies.
• Cloud resources can be Amazon Web Service (AWS), Digital
Ocean, Elastic,Compute Cloud (EC2),
Cloud computing features are:
(i) on-demand service
(ii) resource pooling,
(iii)scalability,
(iv)accountability, and
(v) broad network access.
Cloud services can be classified into three fundamental types:
1. Infrastructure as a Service (IaaS): Provides access to resources,
such as hard disks, network connections,databases storage,data center
Some examples are
Tata Communications, Amazon data centers and
2. Platform as a Service (PaaS): • Provides the runtime environment
to allow developers to build applications and services
Examples are Hadoop Cloud Service (Microsoft Azure)
3. Software as a Service (SaaS): Provides software applications as a
service to end-users is known as Software asa Service.
Software applications are hosted by a service provider and made
available to customers over the Internet.
• Some examples are GoogleSQL,
Cluster Computing:
• A cluster is a group of computers connected by a network.
• The group works together to accomplish the same task.
• Clusters are used mainly for load balancing.
• They shift processes between nodes to keep an even load on the
group of connected computers.
Volunteer Computing (not imp)
Data Integrity
Outliers
An outlier in data refers to data, which appears to not belong to the
dataset.
• For example, data that is outside an expected range.
• Actual outliers need to be removed from the dataset, else the result
will be effected by a small or large amount.
if valid data is identified as outlier, then also the results will be
affected.
The outliers are a result of human data-entry errors, programming
bugs
Duplicate Values:
Another factor effecting data quality is duplicate values. Duplicate
value implies the same data appearing two or more times in a dataset.
Data Pre-processing
• Data pre-processing is an important step at the ingestion layer.
• Pre-processing is a must before data mining and analytics.
• Pre-processing is also a must before running a Machine Learning
(ML) algorithm.
• Necessary when data is being exported to a cloud service or data
store needs pre-processing.
(i) Dropping out of range, inconsistent and outlier values
(ii) Filtering unreliable, irrelevant and redundant information
(ii) Data cleaning, editing, reduction and/or wrangling
(iv) Data validation and transformation
(v) ELT processing.
Data Cleaning
Data Enrichment:
Data enrichment refers to operations or processes which refine,
enhance or improve the raw data.
Data Editing:
Data editing refers to the process of reviewing and adjusting by
performing additional computation on the acquired datasets.
Data reduction
refers to brining the information into, correct and simplified form.
Data Wrangling:
Data wrangling refers to the process of transforming and mapping the
data from one format to another format,
Berkeley Data Analytics Stack (BDAS) (8m)