Professional Documents
Culture Documents
Big Data
Big Data
By
Dr.Asadi Srinivasulu
Professor of IT, M.Tech(IIIT), Ph.D.
Big Data Analytics Deals with 3V’s big data like face book
9
Fig: Big Data - Characteristics 10
✓ Big Data Analytics is the process of examining large
amounts of data of a variety of types (big data) to uncover
hidden patterns, unknown correlations and other useful
information.
12
Fig: The Evolution of BIG DATA
✓ Big data is a collection of data sets which is so large and
complex that it is difficult to handle using DBMS tools.
✓ Types of Data:
1. Structured Data- These data is organized in a highly mechanized
and manageable way. Ex: Tables, Transactions, Legacy Data etc…
2. Unstructured Data- These data is raw and unorganized, it varies in
its content and can change from entry to entry. Ex: Videos, images,
audio, Text Data, Graph Data, social media etc.
3. Semi-Structured Data- Ex: XML Database, 50% structured and
50% unstructured 13
✓ Big Data Matters…….
✓ Data won’t fit on a single system, that's why use Distributed data
✓ Big data has become too complex and too dynamic to be able to
process, store, analyze and manage with traditional data tools.
18
Big Data File Formats
✓ Videos
✓ Audios
✓ Images
✓ Photos
✓ Logs
✓ Click Trails
✓ Text Messages
✓ E-Mails
✓ Documents
✓ Books
✓ Transactions
✓ Public Records
✓ Flat Files
✓ SQL Files
✓ DB2 Files
✓ MYSQL Files
✓ Tera Data Files
✓ MS-Access Files 19
Characteristics of Big Data
✓ The are seven characteristics of Big Data are volume,
velocity, variety, veracity, value, validity and visibility.
Earlier it was assessed in megabytes and gigabytes but
now the assessment is made in terabytes.
20
1. Volume: Data size or the amount of Data or Data quantity or Data
at rest..
3. Variety: Data types or The range of data types & sources or Data
with multiple formats.
5. Value: Data alone is not enough, how can value be derived from it.
28
Applications
✓ Social Networks and Relationships
✓ Cyber-Physical Models
✓ Internet of Things (IoT)
✓ Retail Market
✓ Retail Banking
✓ Real Estate
✓ Fraud detection and prevention
✓ Telecommunications
✓ Healthcare and Research
✓ Automotive and production
✓ Science and Research
✓ Trading Analytics 29
Fig: Applications of Big Data Analytics 30
Contents
✓ Spark Basics
✓ Spark Transformations
✓ Spark Actions
✓ Hands on Spark
36
Precisely
37
Why Spark
✓ Spark is faster than MR when it comes to processing the
same amount of data multiple times rather than unloading
and loading new data.
✓ Spark is simpler and usually much faster than MapReduce
for the usual Machine learning and Data Analytics
applications.
48
✓MLlib library has implementations for various
common machine learning algorithms
1. Clustering: K-means
2. Classification: Naïve Bayes, logistic regression,
SVM
3. Decomposition: Principal Component Analysis
(PCA) and Singular Value Decomposition (SVD)
4. Regression : Linear Regression
5. Collaborative Filtering: Alternating Least Squares for
Recommendations
49
50
✓ Language Support in Apache Spark
✓ Apache Spark ecosystem is built on top of the core
execution engine that has extensible API’s in different
languages.
60
✓ RDD (Resilient Distributed Dataset) is main logical data unit
in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“
61
✓ Resilient Distributed Datasets overcome this drawback of Hadoop
MapReduce by allowing - fault tolerant ‘in-memory’ computations.
✓ RDDs can contain any kind of objects Python, Scala, Java or even
user defined class objects.
64
• Properties / Traits of RDD:
✓ Spark will divide RDD dependencies into stages and tasks and then
69
send those to workers for execution.
70
1.map(): Pass each element of the RDD through the supplied
function.
71
2.filter(): Filter creates a new RDD by passing in the supplied function
used to filter the results.
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)
println(y.collect().mkString(", "))
Output: y: [1, 3]
72
3.flatmap() : Similar to map, but each input item can be mapped to 0 or
more output items.
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(y.collect().mkString(", "))
73
4.groupBy() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.
println(y.collect().mkString(", "))
Output: y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]
74
5.groupByKey() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.
println(y.collect().mkString(", "))
75
6.sample() : Return a random sample subset RDD of the input RDD.
println(y.collect().mkString(", "))
Output: y: [1, 3]
76
7.union() : Simple. Return the union of two RDDs.
val x= sc.parallelize(Array(1,2,3), 2)
val y= sc.parallelize(Array(3,4), 1)
val z= x.union(y)
val zOut= z.glom().collect()
77
8.join() : If you have relational database experience, this will be
easy. It’s joining of two datasets.
println(z.collect().mkString(", "))
Output z: [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
78
9.distinct() : Return a new RDD with distinct elements within a
source RDD.
val x= sc.parallelize(Array(1,2,3,3,4))
val y= x.distinct()
println(y.collect().mkString(", "))
Output: y: [1, 2, 3, 4]
79
10) keyBy() : Constructs two-component tuples (key-value pairs) by
applying a function on each data item.
println(y.collect().mkString(", "))
Output: y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]
80
11) partitionBy() : Repartitions as key-value RDD using its keys. The
partitioner implementation can be supplied as the first argument.
import org.apache.spark.partiotioner
val x=sc.parallelize(Array(‘J’,”James”),(‘F’,”Fred”),(‘A’,”Anna”),(‘J’,”John”),3)
val y= x.partitionBy(new Partitioner() { val numPartitions= 2
defgetPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1
}
})
val x= sc.parallelize(Array(1,2,3))
val y= x.map(n=>n*n)
val z= x.zip(y)
println(z.collect().mkString(", "))
82
✓ Task 2: Practice on Spark Actions i.e. getNumPartitions(), collect(),
reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().
83
✓ Actions: Unlike Transformations which produce RDDs,
action functions produce a value back to the Spark driver
program.
✓ Actions may trigger a previously constructed, lazy RDD to
be evaluated.
1. collect()
2. reduce()
3. aggregate ()
4. mean()
5. sum()
6. max()
7. stdev()
8. countByKey()
9. getNumPartitions() 84
1) collect() : collect returns the elements of the dataset as an array
back to the driver program.
val x= sc.parallelize(Array(1,2,3), 2)
val y= x.collect()
Output: y: [1, 2, 3]
85
2) reduce() : Aggregate the elements of a dataset through function.
val x= sc.parallelize(Array(1,2,3,4))
Output: y: 10
86
3) aggregate(): The aggregate function allows the user to apply two different
reduce functions to the RDD.
valresult=inputrdd.aggregate(3)((acc,value)=>(acc+value._2),(acc1,acc2)=>(acc1+acc2))
So we get 21 + 22 + 31 + (4 * 3) = 86
✓ Output: y: Int = 86
87
4) max() : Returns the largest element in the RDD.
val x= sc.parallelize(Array(2,4,1))
val y= x.max()
Output: y: 4
88
5) count() : Number of elements in the RDD.
val y=x.count()
Output: y: 3
89
6) sum() : Sum of the RDD.
val x= sc.parallelize(Array(2,4,1))
val y= x.sum()
Output: y: 7
90
7) mean() : Mean of given RDD.
val x= sc.parallelize(Array(2,4,1))
val y= x.mean()
Output: y: 2.3333333
91
8) stdev() : An aggregate function that standard deviation of a set of
numbers.
val x= sc.parallelize(Array(2,4,1))
val y= x.stdev()
Output: y: 1.2472191
92
9) countByKey() : This is only available on RDDs of (K
returns a hashmap of (K, count of K).
val y= x.countByKey()
93
10) getNumPartitions()
val x= sc.parallelize(Array(1,2,3), 2)
val y= x.partitions.size
Output: y: 2
94
95
96
97
98
99
100
101
102
1. Spark is initially developed by which university
Ans) Berkley
2. What are the characteristics of Big Data?
Ans) Volume, Velocity and Variety
3. The main focus of Hadoop ecosystem is on
Ans ) Batch Processing
4. Streaming data tools available in Hadoop ecosystem
are?
Ans ) Apache Spark and Storm
5. Spark has API's in? How many languages it supports
Ans ) Java, Scala, R and Python
6. Which kind of data can be processed by spark?
Ans) Stored Data and Streaming Data
7. Spark can store its data in?
Ans) HDFS, MongoDB and Cassandra 103
8. How spark engine runs?
Ans) Integrating with Hadoop and Standalone
9. In spark data is represented as?
Ans ) RDDs
10. Which kind of data can be handled by Spark ?
Ans) Structured, Unstructured and Semi-Structured
11.Which among the following are the challenges in Map
reduce?
Ans) Every Problem has to be broken into Map and
Reduce phase
Collection of Key / Value pairs
High Throughput
104
12. Apache spark is a framework with?
Ans) Scheduling, Monitoring and Distributing Applications
13. Which of the features of Apache spark
Ans) DAG, RDDs and In- Memory processing
14) How much faster is the processing in spark when compared to
Hadoop?
ANS) 10-100X
15) In spark data is represented as?
Ans) RDDs
16) List of Transformations
Ans) map(), filter(), flatmap(), groupBy(), groupByKey(),
sample(), union(), join(), distinct(), keyBy(), partitionBy and
zip().
17) List of Actions
Ans) getNumPartitions(), collect(), reduce(), aggregate(),
max(), sum(), mean(), stdev(), countByKey().
105
18. Spark is developed in
Ans) Scala
19. Which type of processing Apache Spark can handle
Ans) Batch Processing, Interactive Processing, Stream
Processing and Graph Processing
20. List two statements of Spark
Ans) Spark can run on the top of Hadoop
Spark can process data stored in HDFS
Spark can use Yarn as resource management layer
107
31. Identify Correct Action
Ans) Reduce
32. Choose correct statement
Ans) Execution starts with the call of Action
33. Choose correct statement about Spark Context
Ans) Interact with cluster manager and Specify spark how to
access cluster
34. Spark cache the data automatically in the memory as and when
needed? T/F
Ans) False
35. For resource management spark can use
Ans) Yarn, Mesos and Standalone cluster manager
36. RDD can not be created from data stored on
Ans) Oracle
37. RDD can be created from data stored on
Ans) Local FS, S3 and HDFS
108
38. Who is father of Big data Analytics
✓ Doug Cutting
109
43. What is the heart of Hadoop
✓ MapReduce
112
58. What is Sqoop
✓ Transferring bulk data between Hadoop to Structured data.
72. Can you give a detailed overview about the Big Data being
generated by Facebook?
✓ As of December 31, 2012, there are 1.06 billion monthly active
users on Facebook and 680 million mobile users. On an average,
3.2 billion likes and comments are posted every day on Facebook.
72% of web audience is on Facebook. And why not! There are so
many activities going on Facebook from wall posts, sharing images,
videos, writing comments and liking posts, etc.
116
73. What are the three characteristics of Big Data?
✓ The three characteristics of Big Data
are: Volume: Facebook generating 500+ terabytes of data
per day. Velocity: Analyzing 2 million records each day
to identify the reason for losses. Variety: images, audio,
video, sensor data, log files, etc.
118
77. What is Hadoop?
✓ Hadoop is a framework that allows for distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
120
82. Give examples of some companies that are using Hadoop
structure?
✓ A lot of companies are using the Hadoop structure such as
Cloudera, EMC, MapR, Horton works, Amazon, Facebook, eBay,
Twitter, Google and so on.
121
84. What is structured and unstructured data?
✓ Structured data is the data that is easily identifiable as it is
organized in a structure. The most common form of
structured data is a database where specific information is
stored in tables, that is, rows and columns.
✓ Unstructured data refers to any data that cannot be
identified easily. It could be in the form of images, videos,
documents, email, logs and random text.
122
86. What is HDFS?
✓HDFS is a file system designed for storing very
large files with streaming data access patterns,
running clusters on commodity hardware.
123
88. What is Fault Tolerance?
✓ Suppose you have a file stored in a system, and due to
some technical problem that file gets destroyed. Then
there is no chance of getting the data back present in that
file.
126
94. What is a Name node?
✓ Name node is the master node on which job tracker runs
and consists of the metadata. It maintains and manages
the blocks which are present on the data nodes.
129
101. What is a task tracker?
✓ Task tracker is also a daemon that runs on data nodes.
Task Trackers manage the execution of individual tasks
on slave node.
130
103. What is a heartbeat in HDFS?
✓ A heartbeat is a signal indicating that it is alive. A data
node sends heartbeat to Name node and task tracker will
send its heart beat to job tracker.
104. Are Name node and job tracker on the same host?
✓ No, in practical environment, Name node is on a separate
host and job tracker is on a separate host.
114. On what basis Name node will decide which data node to
write on?
✓ As the Name node has the metadata (information) related to all the
data nodes, it knows which data node is free.
134
115. Doesn’t Google have its very own version of DFS?
✓ Yes, Google owns a DFS known as “Google File System
(GFS)” developed by Google Inc. for its own use.
136
122. What if rack 2 and datanode fails?
✓ If both rack2 and datanode present in rack 1 fails then
there is no chance of getting data from it.
138
128. What is the difference between MapReduce engine and HDFS
cluster?
✓ HDFS cluster is the name given to the whole configuration of
master and slaves where data is stored. Map Reduce Engine is the
programming module which is used to retrieve and analyze data.
129. Do we require two servers for the Name node and the data
nodes?
✓ Yes, we need two different servers for the Name node and the data
nodes.
130. Why are the number of splits equal to the number of maps?
✓ The number of maps is equal to the number of input splits because
we want the key and value pairs of all the input splits.
131. Which are the two types of ‘writes’ in HDFS?
✓ There are two types of writes in HDFS: posted and non-
posted write. Posted Write is when we write it and forget
about it, without worrying about the acknowledgement.
140
133. Is a job split into maps?
✓ No, a job is not split into maps. Spilt is created for the
file. The file is placed on data nodes in blocks. For
each split, a map is needed.
Failure tolerance Continue from the point it left off Start the processing from
the beginning
Cost Hard disk space cost Memory space cost
Run Everywhere Runs on Hadoop
Memory HDFS uses MapReduce to process This is called in memory
and analyse data map reduces takes operation
a backup of all the data in a
physical server after each operation
this is done because the
data stored in a ram 143
Fast Hadoop works less faster than Spark works more fast
spark than Hadoop(100 times)
batchfile,10xfaster on
disk
FileManagement system Its own FMS(File Management It does not come with own
System) FMS it support to the cloud
based data platform Spark
was designed for Hadoop
158
159
CLUSTERING
✓ What is Clustering: Clustering is a Unsupervised learning i.e. no
predefined classes, Group of similar objects that differ significantly
from other objects.
✓ The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
✓ Clustering is “the process of organizing objects into groups whose
members are similar in some way”.
✓ The cluster property is Intra-cluster distances are minimized and
Inter-cluster distances are maximized.
10. Variables of Mixed Types: A database may contain all the six
types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
✓ Ex: 11121A1201
162
Similarity Measure
✓ Euclidean distance: Distances are normally used to measure the
similarity or dissimilarity between two data objects.
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis 164
✓Examples of Clustering Applications
1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies
✓Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points 165
Applications
✓ Pattern Recognition
✓ Spatial Data Analysis
✓ GIS(Geographical Information System)
✓ Cluster Weblog data to discover groups
✓ Credit approval
✓ Target marketing
✓ Medical diagnosis
✓ Fraud detection
✓ Weather forecasting
✓ Stock Marketing 166
2. Classification Vs Clustering
1. Classification is “the process 1. Clustering is “the process of
of organizing objects into organizing objects into
groups whose members are groups whose members are
not similar. similar in some way”.
168
7. Classification approaches are 7. Clustering approaches are eight.
two types 1. Partition Method
2. Hierarchical Method
3. Density-Based Methods
1. Predictive Classification 4. Grid-Based Methods
2. Descriptive Classification 5. Model-Based Clustering Methods
6. Clustering High-Dimensional
Data
7. Constraint - Based Cluster
Analysis
8. Outlier Analysis
8. Issues of Classification
8. Issues of Clustering
1. Accuracy,
1. Accuracy,
2. Training time,
2. Training time,
3. Robustness,
3. Robustness,
4. Interpretability, and
4. Interpretability, and
5. Scalability
5. Scalability
6. Find top ‘n’ outlier points
9. Examples 9. Examples
1. Marketing 1. Marketing
2. Land use 2. Land use
3. Insurance 3. Insurance
4. City-planning 4. City-planning
5. Earth-quake studies 5. Earth-quake studies
10. Techniques
10. Techniques
1. K- Means Clustering
1. Decision Tree
2. DIANA ((DIvisive ANAlysis)
2. Bayesian classification
3. AGNES (AGglomerative NESting)
3. Rule-based classification
4. BIRCH (Balanced Iterative Reducing
4. Prediction and Accuracy and error and Clustering using Hierarchies)
measures
5. DBSACN (Density-Based Spatial
Clustering of Applications with
Noise)
170
11. Applications 11. Applications
1. Credit approval 1. Pattern Recognition
2. Target marketing 2. Spatial Data Analysis
3. Medical diagnosis 3. WWW (World Wide Web)
4. Fraud detection 4. Weblog data to discover groups
5. Weather forecasting 5. Credit approval
6. Stock Marketing 6. Target marketing
7. Medical diagnosis
8. Fraud detection
9. Weather forecasting
10. Stock Marketing
171
k-Means Clustering
✓ It is a Partitioning cluster technique.
✓ It is a Centroid-Based cluster technique
✓ Clustering is a Unsupervised learning i.e. no predefined
classes, Group of similar objects that differ significantly
from other objects.
d (i, j) = (| x − x | + | x − x | +...+ | x − x | )
2 2 2
i1 j1 i2 j 2 ip jp
✓ It then creates the first k initial clusters (k= number of
clusters needed) from the dataset by choosing k rows of
data randomly from the dataset.
✓ The k-Means algorithm calculates the Arithmetic Mean
of each cluster formed in the dataset.
172
✓ Square-error criterion
✓ Where
– E is the sum of the square error for all objects in the data set;
– p is the point in space representing a given object; and
– mi is the mean of cluster
– Ci (both p and mi are multidimensional).
✓ Algorithm: The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean
value of the objects in the cluster.
✓ Input:
– k: the number of clusters,
– D: a data set containing n objects.
✓ Output: A set of k clusters. 173
k-Means Clustering Method
✓Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3
2 each
2 the 2
1
objects
1
cluster 1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 to most 0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
k=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial cluster 5 5
center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
174
Fig: Clustering of a set of objects based on the k-means
method. (The mean of each cluster is marked by a “+”.)
175
Steps
✓k - Means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed
point.
4. Go back to Step 2, stop when no more new
assignment
177
✓Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEA R S TEN U R ED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes
1. Decision Tree
2. Bayesian classification
3. Rule-based classification
4. Prediction
184
Fig: Example for Model Construction and Usage of Model
185
2 - Decision Tree
✓ Decision tree is a flowchart-like tree structure, where
each internal node (nonleaf node)denotes a test on an
attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label.
189
✓Issues of Classification and Prediction
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
✓Typical applications
1. Credit approval
2. Target marketing
3. Medical diagnosis
4. Fraud detection
5. Weather forecasting
6. Stock Marketing
190