Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

BIG DATA ANALYTICS

By: Syed Nawaz Pasha @ SR Univeristy


Professional Elective-5
B.Tech IV-II SEM
Preferred Text Book

Author: Seema Acharya and


subhashini chellapan.
Book Title: Big Data and
Analytics
Publisher : Wiley
Types of digital data
 Data is precious and irreplaceable asset.
 Data->information
 Information-> valuable insights
Classification of digital data
 Digital data can be classified into three forms:
– Unstructured
– Semi-structured
– Structured
Unstructured data

 This is the data which does not conform to a


data model or is not in a form which can be
used easily by a computer program.
 About 80—90% data of an organization is in
this format.
 Example: PPT’s, images, videos, letters, body
of an email, etc.
Semi-structured data
 This is the data which does not conform to a
data model but has some structure.
 However, it is not in a form which can be used
easily by a computer program
 Examples: emails, XML, markup languages
like HTML, etc.
Structured data
 This is the data which is in an organized form
 (e.g., in rows and columns) and can be easily
used by a computer program. Relationships
exist between entities of data, such as
 classes and their objects.
 Data stored in databases is an example of
structured data.
Structured data continue
 When do we say that the data is structured?
 When data conforms to predefined
schema/structure.
 Ex : RDMS
 Relational data model where data is stored as
rows and columns.
Sources of structured data
Ease of working with structured
data
 Insert/update/delete
 Indexing
 Scalability
 Transaction processing
Semi structured data
 Also referred to as self describing structure.
 Characteristics of semi-structured data:

Inconsistent structure

Semi structured data Self describing(label/value pairs)

Often schema information is blended with data values

Data objects may have different attributes not known before h


Sources of semi structured data
 XML:web services developed using SOAP
 JSON:transmit data between server and web
application using REST.
 HTML
 JSON
 {
 Id:9
 Booktitle=“BDA”
 Author=“seema acharya”
 Publisher=“wiley india”
 YOP=“2011”
Unstructured data
 Does not conform to any predefined data
model.
 Issues with terminology of unstructured data

Structure can be implied despite not being formerly


Issues with defined
terminology

Data with some structure may still be labeled


unstructured if the structure doesn’t help with
processing task in hand
Data may have some structure or may even be
highly structured in ways that are unanticipated or
unannounced
Dealing with unstructured data

Data mining

Dealing with
unstructured data NLP

Text analytics

Noisy text
analytics
Techniques used to find patterns in
or interpret unstructured data
 1. Data Mining
 Association rule mining
 Regression analysis
 Collaborative filtering
Text analytics or text mining:text categorization,text
clustering,sentimental analysis..
Noisy text analytics: extracting structured or
semistructured information from noisy unstructred
data such as chats,blogs,wikis,email.
Part-of-speech tagging:POS or POST
Remind me
 Structured data
 Semi-structured data
 Unstructured data
Test me(seggregate the below as
structured,semi structured and
unstructured)
 Email
 Msaccess
 Images
 Database
 Chat conversation
 Relations
 Facebook‘
 Videos
 Ms excel
 XML
Sources of unstructured data
Definition of big data-By
Gartner
 Big data is high volume, high velocity and high
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insights and decision
making.
Big Data Definition
 Big data refers to huge amount of data which
is difficult to store and process using on-hand
database system tools or traditional data
processing applications.
 Characteristics of Big data(5 V’s of Big
Data).
 volume
 velocity
 Variety
 Value
 veracity
5 V’s of big data
Challenges with big data
 Exponential growth of data-Today’s BIG may be
normal tomorrow.
 Cloud computing and virtualization.Complicates
the decision to host big data solutions outside the
enterprise.
 Period of retention of big data.
 No skilled professionals of data science.
 Difficult to
capture,store,prepare,search,analyze,transfer,sec
ure and visualize big data.
 Poor data visualization experts.
Big data analytics
 Process of examining large datasets of big data to uncover hidden
patterns,unknown correlations,understand the retionale behind
market trends,recognize customer preferences and other business
information.
Classification of analytics
 Descriptive analytics
 Diagnostic analytics
 Predictive analytics
 Prescriptive analytics
Analytics 1.0(mid 1950’s Analytics 2.0(2005 to 2012)
to 2009) Descriptive statistics predictive
Descriptive(report on statistics
events,occurrence etc (use data from past to make
predictions for future).
Analytics 3.0
(2012 to present)
Descriptive+predictive+pr
escriptive)
Analytics 1.0,2.0,3.0
Terminologies used in big data
environment
 In memory analytics
 In database processing
 Symmetric multiprocessor system
 Massively parallel processing
Distributed vs parallel systems
Share nothing architectures
 Shared memory(SM)
 Shared Disk(SD)
 Shared Nothing(SN)
CAP Theorem
 It is impossible for a distributed system to
simultaneously provide all three of the following
guarantees: (Pick any two)
 1. Consistency: All nodes should see the same
data at the same time or reads return latest
written value by any client
 2. Availability: Every request receives a
response. The system allows operations all the
time and operations return quickly
 3. Partition – Tolerance: the system continues to
operate despite arbitrary partitioning due to
network failures
Samples of databases that follow one of the
possible three combinations:AP,CP,CA

You might also like