Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

BDA MOD 1

Need of Big data/Evolution of big Data (6m)


→The advancement in technology has led to the increase in
production and storage of voluminous amounts of data.
→Earlier megabytes (10^6 B) were used
→Nowadays petabytes (10^15 B) are used for processing, analysis,
discovering new facts and generating new knowledge.
→Conventional systems for storage, processing and analysis pose
challenges in large growth in volume of data, variety of data, various
forms and formats, increasing complexity, faster generation of data
and need of quickly processing, analyzing and usage.

→As the size and complexity of data increases,the proportion of


unstructured data types also increase.
→ An example of a traditional tool for structured data storage and
querying is RDBMS.
→Volume, velocity and variety (3Vs) of data need the usage of
number ofprograms and tools for analyzing and processing at a very
high speed.
Data Model:
➢ Map or schema
➢ Represents the inherent properties of the data.
Data Repository:
➢ Collection of data.
Data Store:
➢ Data repository of a set of objects.
Distributed Data Store:
➢ A data store distributed over multiple nodes.

Data: is information that can be stored and used by a computer


program.”
Web data: is the data present on web servers (or enterprise servers)
in the form of text, images, videos, audios and multimedia files for
web users.
Eg: Wikipedia, Google Maps

Big Data: is high-volume, high-velocity and high-variety information


asset that requires new forms of processing for enhanced decision
making, insight discovery and process optimization
Classification of Data/ formats of data
Structured data
➢Structured data conform and associate with data schemas and data
models.
➢Structured data are found in tables (rows and columns).
➢Data in RDBMS
➢Nearly 15–20% data are in structured or semi-structured form.
➢Structured data enables the following:
• Data insert, delete, update and append
Semi-structured
➢Semi-structured data does not conform and associate with formal
data model structures, such as the relational database and table models
➢Examples of semi-structured data are XML and JSON documents.
➢It contains Tags and other markers which separate Sematic
elements
Multi-Structured Data
 Consists of multiple formats of data, viz. structured, semi-
structured and/or unstructured data.
 Multi-structured data sets can have many formats.
 They are found in non- transactional systems.
 Example, streaming data on customer interactions, Chess
Championship Data:

Unstructured Data
→ Unstructured data does not possess data features such as - a table
or- a database.
→ Unstructured data are found in file types such as TXT, .CSV.- key-
value pairs,- e-mails.
→Unstructured data do not reveal relationships, hierarchy
relationships or object-oriented features.
→The relationships, schema and features need to be separately
established.
→Growth in data today can be characterized as mostly unstructured
data
Big Data Characteristics
Characteristics of Big Data, called 3Vs (and 4Vs also used) are:
Volume:
- is related to size of the data
-Size defines the amount or quantity of data, which is generated from
an application(s).
- The size determines the processing considerations needed for
handling that data.

Velocity:
- The term velocity refers to the speed of generation of data.
- Velocity is a measure of how fast the data generates and
processes
- To meet the demands and the challenges of processing Big Data, the
velocity of generation of data plays a crucial role.

Variety:
-comprises of a variety of data.
-Data is generated from multiple sources in a system.
-Data is in various forms and formats.
-The variety is due to the availability of a large number of
heterogeneous platforms in the industry.

Veracity:
-takes into account the quality of data captured, which can vary
greatly, affecting its accurate analysis.
-odd data generated and not matching with others.
Big Data Classification
1. Data sources
Data storage such as records, RDBMs, distributed databases, row-
oriented In-memory data tables,
2. Data formats (traditional)
Structured and semi-structured
3. Big Data sources
data warehouse, NoSQL database
4. Big Data formats
Unstructured, semi-structured and multi-structured data
5. Processing data rates
near-time, real-time, streaming
6. Processing Big Data rates
High volume, velocity, variety and veracity,
7. Data analysis methods
machine learning algorithms, clustering algorithms,
Analytics Scalability to Big Data (6m)
Vertical Scalability:
 Means scaling up the given system's resources and increasing
the system's analytics, reporting and visualization capabilities.
 Scaling up also means designing the algorithm according to the
architecture that uses resources efficiently.
 For example, x terabyte of data take time t for processing, code
size with increasing complexity increase by factor n, then
scaling up means that processing takes equal, less or much less
than (n x t).

Horizontal scalability:
Means increasing the number of systems working in coherence and
scaling out or distributing the workload.
Processing different datasets of a large dataset deploys horizontal
scalability.
Scaling out is basically using more resources and distributing the
processing and storage tasks in parallel.
Massively Parallel Processing Platforms (6m)
Cloud Computing
• "Cloud computing is a type of Internet-based computing that
provides shared processing resources and data to the computers and
other devices on demand.“
• One of the best approach for data processing is to perform parallel
and distributed computing in a cloud-computing environment.
• It offers high data security compared to other distributed
technologies.
• Cloud resources can be Amazon Web Service (AWS), Digital
Ocean, Elastic,Compute Cloud (EC2),
Cloud computing features are:
(i) on-demand service
(ii) resource pooling,
(iii)scalability,
(iv)accountability, and
(v) broad network access.
Cloud services can be classified into three fundamental types:
1. Infrastructure as a Service (IaaS): Provides access to resources,
such as hard disks, network connections,databases storage,data center
Some examples are
Tata Communications, Amazon data centers and
2. Platform as a Service (PaaS): • Provides the runtime environment
to allow developers to build applications and services
Examples are Hadoop Cloud Service (Microsoft Azure)
3. Software as a Service (SaaS): Provides software applications as a
service to end-users is known as Software asa Service.
Software applications are hosted by a service provider and made
available to customers over the Internet.
• Some examples are GoogleSQL,

Grid and Cluster Computing


Grid Computing
• Grid Computing refers to distributed computing, in which a group of
computers from several locations are connected with each other to
achieve a common task.
• The computer resources are heterogeneously and geographically
disperse. A group of computers that might spread over remotely
comprise a grid.
• Grid computing provides large-scale resource sharing which is
flexible, coordinated and secure among its users.
Features of Grid Computing:
Grid computing is similar to cloud computing,-- is scalable.
Drawbacks of Grid Computing:
Grid computing suffers from single point of failure, which leads to
failure of entire grid

Cluster Computing:
• A cluster is a group of computers connected by a network.
• The group works together to accomplish the same task.
• Clusters are used mainly for load balancing.
• They shift processes between nodes to keep an even load on the
group of connected computers.
Volunteer Computing (not imp)

• Volunteer computing is a distributed computing paradigm which


uses computing resources of the volunteers.
• Volunteers are organizations or members who own personal
computers.
Some issues with volunteer computing systems are:
1. Volunteered computers heterogeneity
2. Drop outs from the network over time
3. Their irregular availability
4. Incorrect results at volunteers are unaccountable as they are
essentially from anonymous volunteers.
DESIGNING DATA ARCHITECTURE/5 layers of DATA
ARCHITECTURE

Data processing architecture consists of five layers:


(i) identification of data sources,
(ii) acquisition, ingestion, extraction, pre-processing, transformation
of data,
(iii) Data storage at files, servers, cluster or cloud,
(iv) data-processing, and
(v) data consumption by the number of programs and tools such as
business intelligence, data mining, artificial intelligence (AI),
machine learning (ML), text analytics, and data visualization.

**Write same thing from boxes and explain layers**(5+5)


**Till here its is enough FOR 20m** + Learn Case STUDY
Data Quality
• High quality means, data which enables all the required
operations, analysis, decisions, planning and knowledge
discovery correctly.
• A definition for high quality data, especially for artificial
intelligence applications, can be data with five R's as follows:
Relevancy, Recent, Range, Robustness and Reliability.
Relevancy is of utmost importance.
• A uniform definition of data quality is difficult.
• A reference can be made to a set of values of quantitative or
qualitativec onditions, which must be specified to say that data
quality is high or low.

Data Integrity

 Data integrity refers to the maintenance of consistency


andaccuracy in data over its usable life.
 Software, which store, process, or retrieve the data, should
maintain the integrity of data.
 Data should be incorruptible.

Factors Effecting Data Quality?


Data Noise
Noise: is one of the factors effecting data quality is noise.
• Noise in data refers to data giving additional meaningless
information besides true (actual/required) information.
• Noisy data means data having large additional information.
• Result of data analysis is adversely affected due to noisy data.

Outliers
An outlier in data refers to data, which appears to not belong to the
dataset.
• For example, data that is outside an expected range.
• Actual outliers need to be removed from the dataset, else the result
will be effected by a small or large amount.
 if valid data is identified as outlier, then also the results will be
affected.
The outliers are a result of human data-entry errors, programming
bugs

Missing Values Another factor effecting data quality is missing


values.
Missing value implies data not appearing in the data set.

Duplicate Values:
Another factor effecting data quality is duplicate values. Duplicate
value implies the same data appearing two or more times in a dataset.

Data Pre-processing
• Data pre-processing is an important step at the ingestion layer.
• Pre-processing is a must before data mining and analytics.
• Pre-processing is also a must before running a Machine Learning
(ML) algorithm.
• Necessary when data is being exported to a cloud service or data
store needs pre-processing.
(i) Dropping out of range, inconsistent and outlier values
(ii) Filtering unreliable, irrelevant and redundant information
(ii) Data cleaning, editing, reduction and/or wrangling
(iv) Data validation and transformation
(v) ELT processing.

Data Cleaning

• Data cleaning refers to the process of removing or correcting


incomplete, incorrect, inaccurate or irrelevant parts of the data after
detecting them.
Data Cleaning tools OpenRefine and DataCleaner. (v.imp)

Data Enrichment:
Data enrichment refers to operations or processes which refine,
enhance or improve the raw data.

Data Editing:
Data editing refers to the process of reviewing and adjusting by
performing additional computation on the acquired datasets.

Data reduction
refers to brining the information into, correct and simplified form.

Data Wrangling:
Data wrangling refers to the process of transforming and mapping the
data from one format to another format,
Berkeley Data Analytics Stack (BDAS) (8m)

You might also like