Professional Documents
Culture Documents
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
Inconsistent structure
Data mining
Dealing with
unstructured data NLP
Text analytics
Noisy text
analytics
Techniques used to find patterns in
or interpret unstructured data
1. Data Mining
Association rule mining
Regression analysis
Collaborative filtering
Text analytics or text mining:text categorization,text
clustering,sentimental analysis..
Noisy text analytics: extracting structured or
semistructured information from noisy unstructred
data such as chats,blogs,wikis,email.
Part-of-speech tagging:POS or POST
Remind me
Structured data
Semi-structured data
Unstructured data
Test me(seggregate the below as
structured,semi structured and
unstructured)
Email
Msaccess
Images
Database
Chat conversation
Relations
Facebook‘
Videos
Ms excel
XML
Sources of unstructured data
Definition of big data-By
Gartner
Big data is high volume, high velocity and high
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insights and decision
making.
Big Data Definition
Big data refers to huge amount of data which
is difficult to store and process using on-hand
database system tools or traditional data
processing applications.
Characteristics of Big data(5 V’s of Big
Data).
volume
velocity
Variety
Value
veracity
5 V’s of big data
Challenges with big data
Exponential growth of data-Today’s BIG may be
normal tomorrow.
Cloud computing and virtualization.Complicates
the decision to host big data solutions outside the
enterprise.
Period of retention of big data.
No skilled professionals of data science.
Difficult to
capture,store,prepare,search,analyze,transfer,sec
ure and visualize big data.
Poor data visualization experts.
Big data analytics
Process of examining large datasets of big data to uncover hidden
patterns,unknown correlations,understand the retionale behind
market trends,recognize customer preferences and other business
information.
Classification of analytics
Descriptive analytics
Diagnostic analytics
Predictive analytics
Prescriptive analytics
Analytics 1.0(mid 1950’s Analytics 2.0(2005 to 2012)
to 2009) Descriptive statistics predictive
Descriptive(report on statistics
events,occurrence etc (use data from past to make
predictions for future).
Analytics 3.0
(2012 to present)
Descriptive+predictive+pr
escriptive)
Analytics 1.0,2.0,3.0
Terminologies used in big data
environment
In memory analytics
In database processing
Symmetric multiprocessor system
Massively parallel processing
Distributed vs parallel systems
Share nothing architectures
Shared memory(SM)
Shared Disk(SD)
Shared Nothing(SN)
CAP Theorem
It is impossible for a distributed system to
simultaneously provide all three of the following
guarantees: (Pick any two)
1. Consistency: All nodes should see the same
data at the same time or reads return latest
written value by any client
2. Availability: Every request receives a
response. The system allows operations all the
time and operations return quickly
3. Partition – Tolerance: the system continues to
operate despite arbitrary partitioning due to
network failures
Samples of databases that follow one of the
possible three combinations:AP,CP,CA