Download as pdf or txt
Download as pdf or txt
You are on page 1of 190

Talk

28th May 2019

BIG DATA USING SPARK

By
Dr.Asadi Srinivasulu
Professor of IT, M.Tech(IIIT), Ph.D.

SREE VIDYANIKETHAN ENGINEERING COLLEGE


(AUTONOMOUS)
(AFFILIATED TO JNTUA,ANANTAPUR)
2018-2019
✓ FOSS : Free Open Source Software: FOSS allows users and
programmers to edit, modify or reuse the software's source code.
This gives developers the opportunity to improve program
functionality by modifying it.
✓ Software:
✓ LINUX
✓ HADOOP
✓ SPARK
✓ NextCloud
✓ HBase
✓ JAVA
✓ FreeRDP
✓ WEKA
2
3
4
Contents

✓ Big Data Fundamentals


✓ Big Data Architecture
✓ Spark Fundamentals
✓ Spark Ecosystem
✓ Spark Transformations
✓ Spark Actions
✓ Spark – MLlib
✓ Classification using Spark
✓ Clustering using Spark
✓ Spark Challenges
IOT Internet of Things

Big Data Analytics Deals with 3V’s big data like face book

Data Mining Extracting Meaningful data

Data warehouse Collection of Data Marts/OLAP

Data Mart Subset of a DWH

Database System Combination of Data + DBMS

DBMS Collection of Software's

Database Collection inter-related Data

Information Processed Data

Data Raw material/facts/images

Fig: Pre-requisite of Big Data


Fig: Big Data Word Cloud 7
✓The Myth about Big Data

❑ Big Data Is New


❑ Big Data Is Only About Massive Data Volume
❑ Big Data Means Hadoop
❑ Big Data Need A Data Warehouse
❑ Big Data Means Unstructured Data
❑ Big Data Is for Social Media & Sentiment Analysis
Big Data is …….
✓Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.

✓Big data is data which is too large, complex and


dynamic for any conventional data tools to
capture, store, manage and analyze.

9
Fig: Big Data - Characteristics 10
✓ Big Data Analytics is the process of examining large
amounts of data of a variety of types (big data) to uncover
hidden patterns, unknown correlations and other useful
information.

✓ Why do they care about Big Data?

✓ More knowledge leads to better customer engagement,


fraud prevention and new products.
11
✓ Data Evolution is the 10% are structured and 90% are
unstructured like emails, videos, facebook posts, website clicks
etc.

12
Fig: The Evolution of BIG DATA
✓ Big data is a collection of data sets which is so large and
complex that it is difficult to handle using DBMS tools.

✓ Facebook alone generates more than 500 terabytes of


data daily whereas many other organizations like Jet Air
and Stock Exchange Market generates terabytes of data
every hour.

✓ Types of Data:
1. Structured Data- These data is organized in a highly mechanized
and manageable way. Ex: Tables, Transactions, Legacy Data etc…
2. Unstructured Data- These data is raw and unorganized, it varies in
its content and can change from entry to entry. Ex: Videos, images,
audio, Text Data, Graph Data, social media etc.
3. Semi-Structured Data- Ex: XML Database, 50% structured and
50% unstructured 13
✓ Big Data Matters…….

✓ Data Growth is huge and all that data is valuable.

✓ Data won’t fit on a single system, that's why use Distributed data

✓ Distributed data = Faster Computation.

✓ More knowledge leads to better customer engagement, fraud


prevention and new products.

✓ Big Data Matters for Aggregation, Statistics, Indexing, Searching,


Querying and Discovering Knowledge.
14
Fig: Measuring the Data in Big Data System
✓ Big Data Sources: Big data is everywhere and it can help
organisations any industry in many different ways.

✓ Big data has become too complex and too dynamic to be able to
process, store, analyze and manage with traditional data tools.

✓ Big Data Sources are


✓ ERP Data
✓ Transactions Data
✓ Public Data
✓ Social Media Data
✓ Sensor Media Data
✓ Big Data in Marketing
✓ Big Data in Health & Life Sciences
✓ Cameras Data
✓ Mobile Devices
✓ Machine sensors
✓ Microphones 16
Fig: Big Data Sources 17
Structure of Big Data
✓ Big Data Processing: So-called big data technologies are about discovering
patterns (in semi/unstructured data) development of big data standards & (open
source) software commonly driven by companies such as Google, Facebook,
Twitter, Yahoo! …

18
Big Data File Formats
✓ Videos
✓ Audios
✓ Images
✓ Photos
✓ Logs
✓ Click Trails
✓ Text Messages
✓ E-Mails
✓ Documents
✓ Books
✓ Transactions
✓ Public Records
✓ Flat Files
✓ SQL Files
✓ DB2 Files
✓ MYSQL Files
✓ Tera Data Files
✓ MS-Access Files 19
Characteristics of Big Data
✓ The are seven characteristics of Big Data are volume,
velocity, variety, veracity, value, validity and visibility.
Earlier it was assessed in megabytes and gigabytes but
now the assessment is made in terabytes.

20
1. Volume: Data size or the amount of Data or Data quantity or Data
at rest..

2. Velocity: Data speed or Speed of change or The content is changing


quickly or Data in motion.

3. Variety: Data types or The range of data types & sources or Data
with multiple formats.

4. Veracity: Data fuzzy & cloudy or Messiness or Can we trust the


data.

5. Value: Data alone is not enough, how can value be derived from it.

6. Validity: Ensure that the interpreted data is sound.

7. Visibility: Data from diverse sources need to be stitched together.


21
Advantages
✓ Flexible schema
✓ Massive scalability
✓ Cheaper to setup
✓ Improving Healthcare and Public Health
✓ Financial Trading
✓ Improving Security and Law Enforcement
✓ No declarative query language
✓ Higher performance
✓ Detect risks and check frauds
✓ Reduce Costs
Fig) Advantages of Big Data 23
Disadvantages
✓ Big data violates the privacy principle.
✓ Data can be used for manipulating customers.
✓ Big data may increase social stratification.
✓ Big data is not useful in short run.
✓ Faces difficulties in parsing and interpreting.
✓ Big data is difficult to handle -more programming
✓ Eventual consistency - fewer guarantees
Big Data Challenges
✓ Data Complexity
✓ Data Volume
✓ Data Velocity
✓ Data Variety
✓ Data Veracity
✓ Capture data
✓ Curation data
✓ Performance
✓ Storage data
✓ Search data
✓ Transfer data
✓ Visualization data
✓ Data Analysis
✓ Privacy and Security
✓ Big Data Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, and information
privacy.
Fig: Challenges of Big Data 26
✓ Research issues in Big Data Analytics
1. Sentiment Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
2. Opinion mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
3. Predictive mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
4. Post-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
5. Pre-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
6. How we can capture and deliver data to right people in real-
time
7. How we can handle variety of forms and data
8. How we can store and analyze data given its size and
computational capacity. 27
Big Data Tools
✓ Big Data Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR, Ubuntu and Linux flavors.

28
Applications
✓ Social Networks and Relationships
✓ Cyber-Physical Models
✓ Internet of Things (IoT)
✓ Retail Market
✓ Retail Banking
✓ Real Estate
✓ Fraud detection and prevention
✓ Telecommunications
✓ Healthcare and Research
✓ Automotive and production
✓ Science and Research
✓ Trading Analytics 29
Fig: Applications of Big Data Analytics 30
Contents
✓ Spark Basics

✓ RDD(Resilient Distributed Dataset)

✓ Spark Transformations

✓ Spark Actions

✓ Spark with Machine Learning

✓ Hands on Spark

✓ Research Challenges in Spark


✓ Spark is a free and open source software web
application framework and domain-specific language
written in Java.
✓ Spark is an alternative to other Java web application
frameworks such as JAX-RS, Play framework and Spring
MVC.

✓ Apache Spark is a general-purpose cluster in-memory


computing system.
✓ Provides high-level APIs in Java, Scala and Python and
an optimized engine that supports general execution
graphs.
✓ Provides various high level tools like Spark SQL for
structured data processing, MLlib for Machine Learning
and more…. 32
33
✓ Spark is a successor of MapReduce.

✓ Map Reduce is the ‘heart‘ of Hadoop that consists of two


parts – ‘map’ and ‘reduce’.

✓ Maps and reduces are programs for processing data.

✓ ‘Map’ processes the data first to give some intermediate


output which is further processed by ‘Reduce’ to generate
the final output.

✓ Thus, MapReduce allows for distributed processing of the


map and reduction operations.
34
✓ Apache Spark is the Next-Gen Big Data Tool i.e.
Considered as future of Big Data and successor of
MapReduce.

✓ Features of Spark are


– Speed.
– Usability.
– In - Memory Computing.
– Pillar to Sophisticated Analytics.
– Real Time Stream Processing.
– Compatibility with Hadoop & existing Hadoop Data.
– Lazy Evaluation.
– Active, progressive and expanding community. 35
✓ Spark is a Distributed data analytics engine, generalizing
MapReduce.

✓ Spark is a core engine, with streaming, SQL, Machine


Learning and Graph processing modules.

36
Precisely

37
Why Spark
✓ Spark is faster than MR when it comes to processing the
same amount of data multiple times rather than unloading
and loading new data.
✓ Spark is simpler and usually much faster than MapReduce
for the usual Machine learning and Data Analytics
applications.

5 Reasons Why Spark Matters to Business


1. Spark enables use cases “traditional” Hadoop can’t handle.
2. Spark is fast
3. Spark can use your existing big data investment
4. Spark speaks SQL
5. Spark is developer-friendly 38
39
40
41
42
43
44
45
✓ Spark Ecosystem: Spark Ecosystem is still in the stage of work-
in-progress with Spark components, which are not even in their
beta releases.

Components of Spark Ecosystem


✓ The components of Spark ecosystem are getting developed and
several contributions are being made every now and then.

✓ Primarily, Spark Ecosystem comprises the following components:


1) Spark (SQL)
2) Spark Streaming (Streaming)
3) MLLib (Machine Learning)
4) GraphX (Graph Computation)
5) SparkR (R on Spark)
6) BlindDB (Approximate SQL)
46
47
✓ Spark's official ecosystem consists of the following major
components.

✓ Spark DataFrames - Similar to a relational table


✓ Spark SQL - Execute SQL queries or HiveQL
✓ Spark Streaming - An extension of the core Spark API
✓ MLlib - Spark's machine learning library
✓ GraphX - Spark for graphs and graph-parallel computation
✓ Spark Core API - provides R, SQL, Python, Scala, Java

48
✓MLlib library has implementations for various
common machine learning algorithms
1. Clustering: K-means
2. Classification: Naïve Bayes, logistic regression,
SVM
3. Decomposition: Principal Component Analysis
(PCA) and Singular Value Decomposition (SVD)
4. Regression : Linear Regression
5. Collaborative Filtering: Alternating Least Squares for
Recommendations
49
50
✓ Language Support in Apache Spark
✓ Apache Spark ecosystem is built on top of the core
execution engine that has extensible API’s in different
languages.

✓ A recent 2016 Spark Survey on 62% of Spark users


evaluated the Spark languages

✓ 58% were using Python in 2017


✓ 71% were using Scala
✓ 31% of the respondents were using Java and
✓ 18% were using R programming language.
51
✓ What is Scala?: Scala is a general-purpose programming
language, which expresses the programming patterns in a
concise, elegant, and type-safe way.

✓ It is basically an acronym for “Scalable Language”.


✓ Scala is an easy-to-learn language and supports both
Object Oriented Programming as well as Functional
Programming.

✓ It is getting popular among programmers, and is being


increasingly preferred over Java and other programming
languages.

✓ It seems much in sync with the present and future Big


Data frameworks, like Scalding, Spark, Akka, etc. 52
✓Why is Spark Programmed in Scala?
✓ Scala is a pure object-oriented language, in which conceptually every
value is an object and every operation is a method-call. The language
supports advanced component architectures through classes and
traits.

✓ Scala is also a functional language. It supports functions, immutable


data structures and gives preference to immutability over mutation.

✓ Scala can be seamlessly integrated with Java

✓ It is already being widely used for Big Data platforms and


development of frameworks like Akka, Scalding, Play, etc.

✓ Being written in Scala, Spark can be embedded in any JVM-based


operational system.
53
54
✓ Procedure: Spark Installation in Ubuntu

✓ Apache Spark is a fast and general engine for large-scale data


processing.
✓ Apache Spark is a fast and general-purpose cluster computing
system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
✓ It also supports a rich set of higher-level tools including Spark
SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming.

Step 1: Installing Java.


java –version

Step 2: Installing Scala and SBT.


sudo apt-get update
sudo apt-get install scala
55
Step 3: Installing Maven plug-in.
✓ Maven plug-in is used to compile java program for spark. Type
below command to install maven.
sudo apt-get install maven
Step 4: Installing Spark.
✓ Download “tgz” file of spark by selecting specific version from
below link http://spark.apache.org/downloads.html
✓ Extract it and remember its path where ever it stored.
✓ Edit .bashrc file by placing below lines (terminal command: gedit
.bashrc)export SPARK_HOME=/path_to_spark_directory
export PATH=$SPARK_HOME/bin:$PATH
✓ Replace path_to_spark_directory in above line with address of your
spark directory.
✓ Restart .bashrc by saving and close it and type “..bashrc” in
terminal.
✓ If it doesn’t work restart system. Thus we installed spark
successfully.
✓ Type spark-shell in terminal to start spark shell. 56
Spark Installation on Windows

Step 1: Install Java (JDK)


Download and install java from https://java.com/en/download/
Step 2: Set java environment variable
✓ Open “control panel” and choose “system & security” and select
“system”.
✓ Select “Advanced System Settings” located at top right.
✓ Select “Environmental Variables” from pop-up.
✓ Next select new under system variables (below), you will get a pop-up.
✓ In variable name field type JAVA_HOME
✓ In variable value field provide installation directory of java, say
C:\Program Files\Java\jdk1.8.0_25
✓ Or you can simply choose the directory by selecting browse directory.
✓ Now close everything by choosing ok every time.
✓ Check whether java variable is set or not by pinging javac in command
promt. If we get java version details then we are done. 57
Step 3: Installing SCALA
✓ Download scala.msi file from https://www.scala-
lang.org/download/
✓ Set scala environment variable just like java done above.
Variable name = SCALA_HOME
Variable value = path to scala installed directory, say
C:\Program Files (x86)\scala

Step 4: Installing SPARK


✓ Download and extract spark from
http://spark.apache.org/downloads.html
✓ You can set SPARK_HOME just like java.
✓ Note:-We can only run spark-shell at bin folder in spark
folder on windows.
58
Step 5: Installing SBT Download and install sbt.msi from
http://www.scala-sbt.org/0.13/docs/Installing-sbt-on-
Windows.html

Step 6: Installing Maven


✓ Download maven from http://maven.apache.org/
download.cgi and unzip it to the folder you want to install
Maven.
✓ Add both M2_HOME and MAVEN_HOME variables in
the Windows environment, and point it to your Maven
folder.
✓ Update PATH variable, append Maven bin folder –
%M2_HOME%\bin, so that you can run the Maven’s
command everywhere.
✓ Test maven by pinging mvn –version in command prompt 59
✓ Practice on Spark Framework with Transformations and
Actions: You can run Spark using its standalone cluster mode,
on EC2, on Hadoop YARN, or on Apache Mesos. Access data
in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data
source.
✓ Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the same
application.

60
✓ RDD (Resilient Distributed Dataset) is main logical data unit
in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“

✓ Spark performs Transformations and Actions:

61
✓ Resilient Distributed Datasets overcome this drawback of Hadoop
MapReduce by allowing - fault tolerant ‘in-memory’ computations.

✓ RDD in Apache Spark.


✓ Why RDD is used to process the data ?
✓ What are the major features/characteristics of RDD (Resilient
Distributed Datasets) ?

✓ Resilient Distributed Datasets are immutable, partitioned collection


of records that can be operated on - in parallel.

✓ RDDs can contain any kind of objects Python, Scala, Java or even
user defined class objects.

✓ RDDs are usually created by either transformation of existing RDDs


or by loading an external dataset from a stable storage like HDFS or
HBase.
62
Fig: Process of RDD Creation 63
✓ Operations on RDDs

✓ i) Transformations: Coarse grained operations like


join, union, filter or map on existing RDDs which
produce a new RDD, with the result of the operation,
are referred to as transformations. All transformations
in Spark are lazy.

✓ ii) Actions: Operations like count, first and reduce


which return values after computations on existing
RDDs are referred to as Actions.

64
• Properties / Traits of RDD:

✓ Immutable (Read only cant change or modify): Data is safe to


share across processes.

✓ Partitioned: It is basic unit of parallelism in RDD.


✓ Coarse gained operations: it's applied to any or all components in
datasets through maps or filter or group by operation.

✓ Action/Transformations: All computations in RDDs are actions or


transformations.
✓ Fault Tolerant: As the name says or include Resilient which means
its capability to reconcile, recover or get back all the data using
lineage graph.

✓ Cacheable: It holds data in persistent storage.


✓ Persistence: Option of choosing which storage will be used either
in-memory or on-disk. 65
66
How Spark Works - RDD Operations 67
✓ Task 1: Practice on Spark Transformations i.e. map(), filter(),
flatmap(), groupBy(), groupByKey(), sample(), union(), join(),
distinct(), keyBy(), partitionBy and zip().

✓ RDD (Resilient Distributed Dataset) is main logical data unit


in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“

✓ Transformations are lazy evaluated operations on RDD that create


one or many new RDDs, e.g. map, filter, reduceByKey, join,
cogroup, randomSplit.

✓ Transformations are lazy, i.e. are not executed immediately.

✓ Transformations can be executed only when actions are called.


68
✓ Transformations are lazy operations on a RDD that create one or
many new RDDs.

✓ Ex: map , filter , reduceByKey , join , cogroup , randomSplit .

✓ In other words, transformations are functions that take a RDD as the


input and produce one or many RDDs as the output.

✓ RDD allows you to create dependencies between RDDs.

✓ Dependencies are the steps for producing results i.e. a program.

✓ Each RDD in lineage chain, string of dependencies has a function


for operating its data and has a pointer dependency to its ancestor
RDD.

✓ Spark will divide RDD dependencies into stages and tasks and then
69
send those to workers for execution.
70
1.map(): Pass each element of the RDD through the supplied
function.

val x = Array("b", "a", "c")


val y = x.map(z => (z,1))

Output: y: [('b', 1), ('a', 1), ('c', 1)]

71
2.filter(): Filter creates a new RDD by passing in the supplied function
used to filter the results.

val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)

println(y.collect().mkString(", "))

Output: y: [1, 3]

72
3.flatmap() : Similar to map, but each input item can be mapped to 0 or
more output items.

val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))

println(y.collect().mkString(", "))

Output: y: [1, 100, 42, 2, 200, 42, 3, 300, 42]

73
4.groupBy() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.

val x = sc.parallelize( Array("John", "Fred", "Anna", "James"))


val y = x.groupBy(w => w.charAt(0))

println(y.collect().mkString(", "))

Output: y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]

74
5.groupByKey() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.

val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1)))


val y = x.groupByKey()

println(y.collect().mkString(", "))

Output: y: [('A', [3, 2, 1]),('B',[5, 4])]

75
6.sample() : Return a random sample subset RDD of the input RDD.

val x= sc.parallelize(Array(1, 2, 3, 4, 5))


val y= x.sample(false, 0.4)

// omitting seed will yield different output

println(y.collect().mkString(", "))

Output: y: [1, 3]

76
7.union() : Simple. Return the union of two RDDs.

val x= sc.parallelize(Array(1,2,3), 2)
val y= sc.parallelize(Array(3,4), 1)
val z= x.union(y)
val zOut= z.glom().collect()

Output z: [[1], [2, 3], [3, 4]]

77
8.join() : If you have relational database experience, this will be
easy. It’s joining of two datasets.

val x= sc.parallelize(Array(("a", 1), ("b", 2)))


val y= sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z= x.join(y)

println(z.collect().mkString(", "))

Output z: [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]

78
9.distinct() : Return a new RDD with distinct elements within a
source RDD.

val x= sc.parallelize(Array(1,2,3,3,4))
val y= x.distinct()

println(y.collect().mkString(", "))

Output: y: [1, 2, 3, 4]

79
10) keyBy() : Constructs two-component tuples (key-value pairs) by
applying a function on each data item.

val x= sc.parallelize(Array("John", "Fred", "Anna", "James"))


val y= x.keyBy(w => w.charAt(0))

println(y.collect().mkString(", "))

Output: y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]

80
11) partitionBy() : Repartitions as key-value RDD using its keys. The
partitioner implementation can be supplied as the first argument.

import org.apache.spark.partiotioner

val x=sc.parallelize(Array(‘J’,”James”),(‘F’,”Fred”),(‘A’,”Anna”),(‘J’,”John”),3)
val y= x.partitionBy(new Partitioner() { val numPartitions= 2
defgetPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1
}
})

val yOut= y.glom().collect()

Output:y: Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))


81
12) zip() : Joins two RDDs by combining the i-th of either partition
with each other.

val x= sc.parallelize(Array(1,2,3))
val y= x.map(n=>n*n)
val z= x.zip(y)

println(z.collect().mkString(", "))

Output: z: [(1, 1), (2, 4), (3, 9)]

82
✓ Task 2: Practice on Spark Actions i.e. getNumPartitions(), collect(),
reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().

✓ RDD (Resilient Distributed Dataset) is main logical data unit


in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“
✓ Actions returns final result of RDD computations / operation.

✓ Action produces a value back to the Spark driver program. It may


trigger a previously constructed, lazy RDD to be evaluated.

✓ Action function materialize a value in a Spark program. So basically


an action is RDD operation that returns a value of any type but
RDD[T] is an action.

83
✓ Actions: Unlike Transformations which produce RDDs,
action functions produce a value back to the Spark driver
program.
✓ Actions may trigger a previously constructed, lazy RDD to
be evaluated.

1. collect()
2. reduce()
3. aggregate ()
4. mean()
5. sum()
6. max()
7. stdev()
8. countByKey()
9. getNumPartitions() 84
1) collect() : collect returns the elements of the dataset as an array
back to the driver program.

val x= sc.parallelize(Array(1,2,3), 2)

val y= x.collect()

Output: y: [1, 2, 3]

85
2) reduce() : Aggregate the elements of a dataset through function.

val x= sc.parallelize(Array(1,2,3,4))

val y= x.reduce((a,b) => a+b)

Output: y: 10

86
3) aggregate(): The aggregate function allows the user to apply two different
reduce functions to the RDD.

val inputrdd=sc.parallelize(List((“maths”,21),(“english”,22), (“science”,31)),3)

valresult=inputrdd.aggregate(3)((acc,value)=>(acc+value._2),(acc1,acc2)=>(acc1+acc2))

Partition 1 : Sum(all Elements) + 3 (Zero value)


Partition 2 : Sum(all Elements) + 3 (Zero value)
Partition 3 : Sum(all Elements) + 3 (Zero value)

✓ Result = Partition1 + Partition2 + Partition3 + 3(Zero value)

So we get 21 + 22 + 31 + (4 * 3) = 86

✓ Output: y: Int = 86

87
4) max() : Returns the largest element in the RDD.

val x= sc.parallelize(Array(2,4,1))

val y= x.max()

Output: y: 4

88
5) count() : Number of elements in the RDD.

val x= sc.parallelize(Array("apple", "beatty", "beatrice"))

val y=x.count()

Output: y: 3

89
6) sum() : Sum of the RDD.

val x= sc.parallelize(Array(2,4,1))

val y= x.sum()

Output: y: 7

90
7) mean() : Mean of given RDD.

val x= sc.parallelize(Array(2,4,1))

val y= x.mean()

Output: y: 2.3333333

91
8) stdev() : An aggregate function that standard deviation of a set of
numbers.

val x= sc.parallelize(Array(2,4,1))

val y= x.stdev()

Output: y: 1.2472191

92
9) countByKey() : This is only available on RDDs of (K
returns a hashmap of (K, count of K).

val x= sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"Jo

val y= x.countByKey()

Output: y: {'A': 1, 'J': 2, 'F': 1}

93
10) getNumPartitions()

val x= sc.parallelize(Array(1,2,3), 2)

val y= x.partitions.size

Output: y: 2

94
95
96
97
98
99
100
101
102
1. Spark is initially developed by which university
Ans) Berkley
2. What are the characteristics of Big Data?
Ans) Volume, Velocity and Variety
3. The main focus of Hadoop ecosystem is on
Ans ) Batch Processing
4. Streaming data tools available in Hadoop ecosystem
are?
Ans ) Apache Spark and Storm
5. Spark has API's in? How many languages it supports
Ans ) Java, Scala, R and Python
6. Which kind of data can be processed by spark?
Ans) Stored Data and Streaming Data
7. Spark can store its data in?
Ans) HDFS, MongoDB and Cassandra 103
8. How spark engine runs?
Ans) Integrating with Hadoop and Standalone
9. In spark data is represented as?
Ans ) RDDs
10. Which kind of data can be handled by Spark ?
Ans) Structured, Unstructured and Semi-Structured
11.Which among the following are the challenges in Map
reduce?
Ans) Every Problem has to be broken into Map and
Reduce phase
Collection of Key / Value pairs
High Throughput

104
12. Apache spark is a framework with?
Ans) Scheduling, Monitoring and Distributing Applications
13. Which of the features of Apache spark
Ans) DAG, RDDs and In- Memory processing
14) How much faster is the processing in spark when compared to
Hadoop?
ANS) 10-100X
15) In spark data is represented as?
Ans) RDDs
16) List of Transformations
Ans) map(), filter(), flatmap(), groupBy(), groupByKey(),
sample(), union(), join(), distinct(), keyBy(), partitionBy and
zip().
17) List of Actions
Ans) getNumPartitions(), collect(), reduce(), aggregate(),
max(), sum(), mean(), stdev(), countByKey().

105
18. Spark is developed in
Ans) Scala
19. Which type of processing Apache Spark can handle
Ans) Batch Processing, Interactive Processing, Stream
Processing and Graph Processing
20. List two statements of Spark
Ans) Spark can run on the top of Hadoop
Spark can process data stored in HDFS
Spark can use Yarn as resource management layer

21. Spark's core is a batch engine? True OR False


Ans) True
22) Spark is 100x faster than MapReduce due to
Ans) In-Memory Computing
23) MapReduce program can be developed in spark? T / F
Ans) True
106
24. Programming paradigm used in Spark
Ans) Generalized
25. Spark Core Abstraction
Ans) RDD
26. Choose correct statement about RDD
Ans) RDD is a distributed data structure
27. RDD is
Ans) Immutable, Recomputable and Fault-tolerant
28. RDD operations
Ans) Transformation, Action and Caching
29. We can edit the data of RDD like conversion to uppercase? T/F
Ans) False
30. Identify correct transformation
Ans) Map, Filter and Join

107
31. Identify Correct Action
Ans) Reduce
32. Choose correct statement
Ans) Execution starts with the call of Action
33. Choose correct statement about Spark Context
Ans) Interact with cluster manager and Specify spark how to
access cluster
34. Spark cache the data automatically in the memory as and when
needed? T/F
Ans) False
35. For resource management spark can use
Ans) Yarn, Mesos and Standalone cluster manager
36. RDD can not be created from data stored on
Ans) Oracle
37. RDD can be created from data stored on
Ans) Local FS, S3 and HDFS
108
38. Who is father of Big data Analytics
✓ Doug Cutting

39. What are major Characteristics of Big Data


✓ Volume, Velocity and Variety(3 V’s)
40. What is Apache Hadoop
✓ Open-source Software Framework

41. Who developed Hadoop


✓ Doug Cutting

42. Hadoop supports which programming framework


✓ Java

109
43. What is the heart of Hadoop
✓ MapReduce

44. What is MapReduce


✓ Programming Model for Processing Large Data Sets.

45. What are the Big Data Dimensions


✓ 4 V’s

46. What is the caption of Volume


✓ Data at Scale

47. What is the caption of Velocity


✓ Data in Motion
110
48. What is the caption of Variety
✓ Data in many forms

49. What is the caption of Veracity


✓ Data Uncertainty

50. What is the biggest Data source for Big Data


✓ Transactions

51. What is the biggest Analytic capability for Big Data


✓ Query and Reporting

52. What is the biggest Infrastructure for Big Data


✓ Information integration
111
53. What are the Big Data Adoption Stages
✓ Educate, Explore, Engage and Execute

54. What is Mahout


✓ Algorithm library for scalable machine learning on Hadoop
55. What is Pig
✓ Creating MapReduce programs used with Hadoop.
56. What is HBase
✓ Non-Relational Database

57. What is the biggest Research Challenge for Big Data


✓ Heterogeneity , Incompleteness and Security

112
58. What is Sqoop
✓ Transferring bulk data between Hadoop to Structured data.

59. What is Oozie


✓ Workflow scheduler system to manage Hadoop jobs.

60. What is Hue


✓ Web interface that supports Apache Hadoop and its ecosystem

61. What is Avro


✓ Avro is a data serialization system.

62. What is Giraph


✓ Iterative graph processing system built for high scalability.
63. What is Cassandra
✓ Cassandra does not support joins or sub queries, except
for batch analysis via Hadoop

64. What is Chukwa


✓ Chukwa is an open source data collection system for
monitoring large distributed systems

65. What is Hive


✓ Hive is a data warehouse on Hadoop

66. What is Apache drill


✓ Apache Drill is a distributed system for interactive
analysis of large-scale datasets.

67. What is HDFS


✓ Hadoop Distributed File System ( HDFS )
68. Facebook generates how much data per day
✓ 25TB

69. What is BIG DATA?


✓ Big Data is nothing but an assortment of such a huge and
complex data that it becomes very tedious to capture, store,
process, retrieve and analyze it with the help of on-hand
database management tools or traditional data processing
techniques.

70. What is HUE expansion


✓ Hadoop User Interface
71.Can you give some examples of Big Data?
✓ There are many real life examples of Big Data! Facebook is
generating 500+ terabytes of data per day, NYSE (New York Stock
Exchange) generates about 1 terabyte of new trade data per day, a
jet airline collects 10 terabytes of censor data for every 30 minutes
of flying time.

72. Can you give a detailed overview about the Big Data being
generated by Facebook?
✓ As of December 31, 2012, there are 1.06 billion monthly active
users on Facebook and 680 million mobile users. On an average,
3.2 billion likes and comments are posted every day on Facebook.
72% of web audience is on Facebook. And why not! There are so
many activities going on Facebook from wall posts, sharing images,
videos, writing comments and liking posts, etc.

116
73. What are the three characteristics of Big Data?
✓ The three characteristics of Big Data
are: Volume: Facebook generating 500+ terabytes of data
per day. Velocity: Analyzing 2 million records each day
to identify the reason for losses. Variety: images, audio,
video, sensor data, log files, etc.

74. How Big is ‘Big Data’?


✓ With time, data volume is growing exponentially. Earlier
we used to talk about Megabytes or Gigabytes. But time
has arrived when we talk about data volume in terms of
terabytes, petabytes and also zettabytes! Global data
volume was around 1.8ZB in 2011 and is expected to be
7.9ZB in 2015. 117
75. How analysis of Big Data is useful for organizations?
✓ Effective analysis of Big Data provides a lot of business
advantage as organizations will learn which areas to focus
on and which areas are less important.

76. Who are ‘Data Scientists’?


✓ Data scientists are experts who find solutions to analyze
data. Just as web analysis, we have data scientists who
have good business insight as to how to handle a business
challenge.

118
77. What is Hadoop?
✓ Hadoop is a framework that allows for distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.

78. Why the name ‘Hadoop’?


✓ Hadoop doesn’t have any expanding version like ‘OOPS’.
The charming yellow elephant you see is basically named
after Doug’s son’s toy elephant!

79. Why do we need Hadoop?


✓ Everyday a large amount of unstructured data is getting
dumped into our machines.
119
80. What are some of the characteristics of Hadoop framework?
✓ Hadoop framework is written in Java. It is designed to solve
problems that involve analyzing large data (e.g. petabytes). The
programming model is based on Google’s MapReduce. The
infrastructure is based on Google’s Big Data and Distributed File
System.

81. Give a brief overview of Hadoop history.


✓ In 2002, Doug Cutting created an open source, web crawler project.
In 2004, Google published MapReduce, GFS papers.
✓ In 2006, Doug Cutting developed the open source, MapReduce and
HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and
Hadoop won terabyte sort benchmark.
✓ In 2009, Facebook launched SQL support for Hadoop.

120
82. Give examples of some companies that are using Hadoop
structure?
✓ A lot of companies are using the Hadoop structure such as
Cloudera, EMC, MapR, Horton works, Amazon, Facebook, eBay,
Twitter, Google and so on.

83. What is the basic difference between traditional RDBMS and


Hadoop?
✓ RDBMS is used for transactional systems to report and archive the
data.
✓ Hadoop is an approach to store huge amount of data in the
distributed file system and process it.
✓ RDBMS will be useful when you want to seek one record from Big
data, whereas.
✓ Hadoop will be useful when you want Big data in one shot and
perform analysis on that later.

121
84. What is structured and unstructured data?
✓ Structured data is the data that is easily identifiable as it is
organized in a structure. The most common form of
structured data is a database where specific information is
stored in tables, that is, rows and columns.
✓ Unstructured data refers to any data that cannot be
identified easily. It could be in the form of images, videos,
documents, email, logs and random text.

85. What are the core components of Hadoop?


✓ Core components of Hadoop are HDFS and MapReduce.
HDFS is basically used to store large data sets and
MapReduce is used to process such large data sets.

122
86. What is HDFS?
✓HDFS is a file system designed for storing very
large files with streaming data access patterns,
running clusters on commodity hardware.

87. What are the key features of HDFS?


✓HDFS is highly fault-tolerant, with high
throughput, suitable for applications with large
data sets, streaming access to file system data and
can be built out of commodity hardware.

123
88. What is Fault Tolerance?
✓ Suppose you have a file stored in a system, and due to
some technical problem that file gets destroyed. Then
there is no chance of getting the data back present in that
file.

89. Replication causes data redundancy then why is


pursued in HDFS?
✓ HDFS works with commodity hardware (systems with
average configurations) that has high chances of getting
crashed any time. Thus, to make the entire system highly
fault-tolerant, HDFS replicates and stores data in different
places.
124
90. Since the data is replicated thrice in HDFS, does it
mean that any calculation done on one node will also
be replicated on the other two?
✓ Since there are 3 nodes, when we send the MapReduce
programs, calculations will be done only on the original
data. The master node will know which node exactly has
that particular data.

91. What is throughput? How does HDFS get a good


throughput?
✓ Throughput is the amount of work done in a unit time. It
describes how fast the data is getting accessed from the
system and it is usually used to measure performance of
the system. 125
92. What is streaming access?
✓ As HDFS works on the principle of ‘Write Once, Read
Many‘, the feature of streaming access is
extremely important in HDFS. HDFS focuses not so
much on storing the data but how to retrieve it at the
fastest possible speed, especially while analyzing logs.

93. What is a commodity hardware? Does commodity


hardware include RAM?
✓ Commodity hardware is a non-expensive system which is
not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware.

126
94. What is a Name node?
✓ Name node is the master node on which job tracker runs
and consists of the metadata. It maintains and manages
the blocks which are present on the data nodes.

95. Is Name node also a commodity?


✓ No. Name node can never be
a commodity hardware because the entire HDFS rely on
it. It is the single point of failure in HDFS. Name node
has to be a high-availability machine.

96. What is a metadata?


✓ Metadata is the information about the data stored in data
nodes such as location of the file, size of the file and so
on. 127
97. What is a Data node?
✓Data nodes are the slaves which are deployed on
each machine and provide the actual storage.
These are responsible for serving read and write
requests for the clients.

98. Why do we use HDFS for applications having


large data sets and not when there are lot of
small files?
✓HDFS is more suitable for large amount of data
sets in a single file as compared to small amount
of data spread across multiple files.
128
99. What is a daemon?
✓Daemon is a process or service that runs in
background. In general, we use this word in UNIX
environment.

100. What is a job tracker?


✓Job tracker is a daemon that runs on a name node
for submitting and tracking MapReduce jobs in
Hadoop. It assigns the tasks to the different task
tracker.

129
101. What is a task tracker?
✓ Task tracker is also a daemon that runs on data nodes.
Task Trackers manage the execution of individual tasks
on slave node.

102. Is Name node machine same as data node machine


as in terms of hardware?
✓ It depends upon the cluster you are trying to create. The
Hadoop VM can be there on the same machine or on
another machine.

130
103. What is a heartbeat in HDFS?
✓ A heartbeat is a signal indicating that it is alive. A data
node sends heartbeat to Name node and task tracker will
send its heart beat to job tracker.

104. Are Name node and job tracker on the same host?
✓ No, in practical environment, Name node is on a separate
host and job tracker is on a separate host.

105. What is a ‘block’ in HDFS?


✓ A ‘block’ is the minimum amount of data that can be read
or written. In HDFS, the default block size is 64 MB as
contrast to the block size of 8192 bytes in Unix/Linux.
106. What are the benefits of block transfer?
✓ A file can be larger than any single disk in the network.
Blocks provide fault tolerance and availability.

107. If we want to copy 10 blocks from one machine to


another, but another machine can copy only 8.5
blocks, can the blocks be broken at the time of
replication?
✓ In HDFS, blocks cannot be broken down. Before copying
the blocks from one machine to another, the Master node
will figure out what is the actual amount of space
required, how many block are being used, how much
space is available, and it will allocate the blocks
accordingly.
132
108. How indexing is done in HDFS?
✓ Hadoop has its own way of indexing. Depending upon the
block size, once the data is stored, HDFS will keep on
storing the last part of the data which will say where the
next part of the data will be.

109. If a data Node is full how it’s identified?


✓ When data is stored in data node, then the metadata of
that data will be stored in the Name node. So Name node
will identify if the data node is full.

110.If data nodes increase, then do we need to upgrade


Name node?
✓ While installing the Hadoop system, Name node is
determined based on the size of the clusters. 133
111.Are job tracker and task trackers present in separate
machines?
✓ Yes, job tracker and task tracker are present in different machines.
The reason is job tracker is a single point of failure for the Hadoop
MapReduce service.

112. When we send a data to a node, do we allow settling in time,


before sending another data to that node?
✓ Yes, we do.

113. Does Hadoop always require digital data to process?


✓ Yes. Hadoop always require digital data to be processed.

114. On what basis Name node will decide which data node to
write on?
✓ As the Name node has the metadata (information) related to all the
data nodes, it knows which data node is free.
134
115. Doesn’t Google have its very own version of DFS?
✓ Yes, Google owns a DFS known as “Google File System
(GFS)” developed by Google Inc. for its own use.

116. Who is a ‘user’ in HDFS?


✓ A user is like you or me, who has some query or who needs some
kind of data.

117. Is client the end user in HDFS?


✓ No, Client is an application which runs on your machine, which is
used to interact with the Name node (job tracker) or data node (task
tracker).

118. What is the communication channel between client and name


node/ data node?
✓ The mode of communication is SSH. 135
119. What is a rack?
✓ Rack is a storage area with all the data nodes put together.
Rack is a physical collection of data nodes which are
stored at a single location.

120. On what basis data will be stored on a rack?


✓ When the client is ready to load a file into the cluster, the
content of the file will be divided into blocks.

121. Do we need to place 2nd and 3rd data in rack 2


only?
✓ Yes, this is to avoid data node failure.

136
122. What if rack 2 and datanode fails?
✓ If both rack2 and datanode present in rack 1 fails then
there is no chance of getting data from it.

123. What is a Secondary Namenode? Is it a substitute to


the Namenode?
✓ The secondary Namenode constantly reads the data from
the RAM of the Namenode and writes it into the hard disk
or the file system.

124. What is the difference between Gen1 and Gen2


Hadoop with regards to the Namenode?
✓ In Gen 1 Hadoop, Namenode is the single point of failure.
In Gen 2 Hadoop, we have what is known as Active and
Passive Namenodes kind of a structure. 137
125. What is MapReduce?
✓ Map Reduce is the ‘heart‘ of Hadoop that consists of two
parts – ‘map’ and ‘reduce’. Maps and reduces are
programs for processing data.

126. Can you explain how do ‘map’ and ‘reduce’ work?


✓ Name node takes the input and divide it into parts and
assign them to data nodes.

127. What is ‘Key value pair’ in HDFS?


✓ Key value pair is the intermediate data generated by maps
and sent to reduces for generating the final output.

138
128. What is the difference between MapReduce engine and HDFS
cluster?
✓ HDFS cluster is the name given to the whole configuration of
master and slaves where data is stored. Map Reduce Engine is the
programming module which is used to retrieve and analyze data.

129. Do we require two servers for the Name node and the data
nodes?
✓ Yes, we need two different servers for the Name node and the data
nodes.

130. Why are the number of splits equal to the number of maps?
✓ The number of maps is equal to the number of input splits because
we want the key and value pairs of all the input splits.
131. Which are the two types of ‘writes’ in HDFS?
✓ There are two types of writes in HDFS: posted and non-
posted write. Posted Write is when we write it and forget
about it, without worrying about the acknowledgement.

132. Why ‘Reading‘ is done in parallel and ‘Writing‘ is


not in HDFS?
✓ Reading is done in parallel because by doing so we can
access the data fast. But we do not perform
the write operation in parallel. The reason is that if we
perform the write operation in parallel, then it might result
in data inconsistency.

140
133. Is a job split into maps?
✓ No, a job is not split into maps. Spilt is created for the
file. The file is placed on data nodes in blocks. For
each split, a map is needed.

134.Can Hadoop be compared to NOSQL database like


Cassandra?
✓ Though NOSQL is the closet technology that can be
compared to Hadoop, it has its own pros and cons. There
is no DFS in NOSQL. Hadoop is not a database.

135. How can I install Cloudera VM in my system?


✓ When you enroll for the Hadoop course at Edureka, you
can download the Hadoop Installation steps.pdf file from
our dropbox. 141
Relational DB’s vs. Big Data (Spark)

1. It deals with Giga Bytes to 1. It deals with Petabytes to


Terabytes Zettabytes
2. It is centralized 2. It is distributed
3. It deals with structured data 3. It deals with semi-structured
4. It is having stable Data Model and unstructured
5. It deals with known complex 4. It is having unstable Data
inter relationships Model
6. Tools are Relational DB’s: 5. It deals with flat schemas and
SQL,MYSQL,DB2. few Interrelationships
7. Access is Interactive and 6. Tools are Hadoop,R,Mahout
batch. 7. Access is Batch
8. Updates are Read and write 8. Updates are Write once, read
many times. many times.
9. Integrity is high 9. Integrity is low
10. Scaling is Nonlinear 10. Scaling is Linear 142
HADOOP SPARK
Performance Process data on disk Process data in-memory
Ease of use Java need to be proficient in Java, Scala, R, Python
MapReduce more expressive and
intelligent
Need other platforms for MLLib, Streaming,
Data processing streaming, graphs Graphs

Failure tolerance Continue from the point it left off Start the processing from
the beginning
Cost Hard disk space cost Memory space cost
Run Everywhere Runs on Hadoop
Memory HDFS uses MapReduce to process This is called in memory
and analyse data map reduces takes operation
a backup of all the data in a
physical server after each operation
this is done because the
data stored in a ram 143
Fast Hadoop works less faster than Spark works more fast
spark than Hadoop(100 times)
batchfile,10xfaster on
disk

Version Hadoop 2.6.5 Release Notes Spark 2.3.1


Software It is a open source s/w or, It is Fast and General
Reliable, scalable Engine for large scale
distributed, computing It is a data processing
Big Data Tool

Execution Engine DAG It is a Big Data Tool


Big Data frame Hadoop Advanced DAG, support
works for acyclic data flow and
in memory computing

Hardware cost More Less


Library External machine lib Internal Machine lib144
Recovery Easier than spark Checkpoints Failure recovery is difficult
are present but still good

FileManagement system Its own FMS(File Management It does not come with own
System) FMS it support to the cloud
based data platform Spark
was designed for Hadoop

Support HDFS, Hadoop YARN Apache, It support for RDD


Mesos

Technologies Cassandra, HBase, HIVE Supports all systems in


Tachyon, and any Hadoop Hadoop Processed by
source Batch system

Use Places Marketing Analysis, computing Online product


analysis, cyber security
analytics

Run Clusters Data bases, server Cloud based systems and


Data sets
145
Reach me @
Srinivasulu Asadi: srinu_asadi@yahoo.com , + 91-9490246442 146
Reach me @
Srinivasulu Asadi: srinu_asadi@yahoo.com , + 91-9490246442 147
Fig: International Conference on Emerging Research In Computing,
Information, Communication and Applications- ERCICA-2014.
Fig: Keynote Speaker @GITM, Proddatur, 2016
Fig: Resource Person for Workshop @MITS, Madanapalle, 2016
151
Fig: Jury Member @ MITS, Madanapalle, 2015
Fig: Keynote Speaker @ SITM, Renigunta, 2017
Fig: Doctorate Award @ JNTUA, Anantapur, 2018
Fig: Bharath Vidya Rathan Award @ Delhi, 2018
Fig: Best Local Chapter Award @ IITM, 2019
Thank You
Business Intelligence

158
159
CLUSTERING
✓ What is Clustering: Clustering is a Unsupervised learning i.e. no
predefined classes, Group of similar objects that differ significantly
from other objects.
✓ The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
✓ Clustering is “the process of organizing objects into groups whose
members are similar in some way”.
✓ The cluster property is Intra-cluster distances are minimized and
Inter-cluster distances are maximized.

✓ What is Good Clustering: A good clustering method will produce


high quality clusters with
✓ high intra-class similarity
✓ low inter-class similarity 160
1. Nominal Variables allow for only qualitative classification. A
generalization of the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
Ex: { male, female},{yes, no},{true, false}
2. Ordinal Data are categorical data where there is a logical
ordering to the categories.
Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;
3. Categorical Data represent types of data which may be divided
into groups.
Ex: race, sex, age group, and educational level.
4. Labeled Data are share the class labels or the generative
distribution of data.
5. Unlabeled Data are does not share the class labels or the
generative distribution of the labeled data.
6. Numerical Values: The data values completely belongs to and
only numerical values. Ex: 1,2,3,4,…. 161
7. Interval-valued variables: These are variables ranges from
numbers
✓ Ex: 10-20, 20-30, 30-40,……..

8. Binary Variables: These are the variables and combination of 0


and 1.
✓ Ex: 1, 0, 001,010 ….

9. Ratio-Scaled Variables: A positive measurement on a nonlinear


scale, approximately at exponential scale, such as AeBt or Ae-Bt
✓ Ex: ½, 2/4, 4/8,…..

10. Variables of Mixed Types: A database may contain all the six
types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
✓ Ex: 11121A1201
162
Similarity Measure
✓ Euclidean distance: Distances are normally used to measure the
similarity or dissimilarity between two data objects.

✓ Euclidean distance: Euclidean distance is the distance between two


points in Euclidean space.
Major Clustering Approaches

1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis 164
✓Examples of Clustering Applications
1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies

✓Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points 165
Applications
✓ Pattern Recognition
✓ Spatial Data Analysis
✓ GIS(Geographical Information System)
✓ Cluster Weblog data to discover groups
✓ Credit approval
✓ Target marketing
✓ Medical diagnosis
✓ Fraud detection
✓ Weather forecasting
✓ Stock Marketing 166
2. Classification Vs Clustering
1. Classification is “the process 1. Clustering is “the process of
of organizing objects into organizing objects into
groups whose members are groups whose members are
not similar. similar in some way”.

2. It is a Supervised Learning. 2. It is a Unsupervised Learning.

3. Predefined classes. 3. No predefined classes.

4. Have labels for some points. 4. No labels in Clustering.

5. Require a “rule” that will 5. Group points into clusters


accurately assign labels to based on how “near” they
new points. are to one another. 167
6. Classification 6. Clustering

168
7. Classification approaches are 7. Clustering approaches are eight.
two types 1. Partition Method
2. Hierarchical Method
3. Density-Based Methods
1. Predictive Classification 4. Grid-Based Methods
2. Descriptive Classification 5. Model-Based Clustering Methods
6. Clustering High-Dimensional
Data
7. Constraint - Based Cluster
Analysis
8. Outlier Analysis

8. Issues of Classification
8. Issues of Clustering
1. Accuracy,
1. Accuracy,
2. Training time,
2. Training time,
3. Robustness,
3. Robustness,
4. Interpretability, and
4. Interpretability, and
5. Scalability
5. Scalability
6. Find top ‘n’ outlier points
9. Examples 9. Examples
1. Marketing 1. Marketing
2. Land use 2. Land use
3. Insurance 3. Insurance
4. City-planning 4. City-planning
5. Earth-quake studies 5. Earth-quake studies

10. Techniques
10. Techniques
1. K- Means Clustering
1. Decision Tree
2. DIANA ((DIvisive ANAlysis)
2. Bayesian classification
3. AGNES (AGglomerative NESting)
3. Rule-based classification
4. BIRCH (Balanced Iterative Reducing
4. Prediction and Accuracy and error and Clustering using Hierarchies)
measures
5. DBSACN (Density-Based Spatial
Clustering of Applications with
Noise)
170
11. Applications 11. Applications
1. Credit approval 1. Pattern Recognition
2. Target marketing 2. Spatial Data Analysis
3. Medical diagnosis 3. WWW (World Wide Web)
4. Fraud detection 4. Weblog data to discover groups
5. Weather forecasting 5. Credit approval
6. Stock Marketing 6. Target marketing
7. Medical diagnosis
8. Fraud detection
9. Weather forecasting
10. Stock Marketing

171
k-Means Clustering
✓ It is a Partitioning cluster technique.
✓ It is a Centroid-Based cluster technique
✓ Clustering is a Unsupervised learning i.e. no predefined
classes, Group of similar objects that differ significantly
from other objects.
d (i, j) = (| x − x | + | x − x | +...+ | x − x | )
2 2 2
i1 j1 i2 j 2 ip jp
✓ It then creates the first k initial clusters (k= number of
clusters needed) from the dataset by choosing k rows of
data randomly from the dataset.
✓ The k-Means algorithm calculates the Arithmetic Mean
of each cluster formed in the dataset.
172
✓ Square-error criterion

✓ Where
– E is the sum of the square error for all objects in the data set;
– p is the point in space representing a given object; and
– mi is the mean of cluster
– Ci (both p and mi are multidimensional).
✓ Algorithm: The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean
value of the objects in the cluster.
✓ Input:
– k: the number of clusters,
– D: a data set containing n objects.
✓ Output: A set of k clusters. 173
k-Means Clustering Method

✓Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3

2 each
2 the 2

1
objects
1
cluster 1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 to most 0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

k=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial cluster 5 5

center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

174
Fig: Clustering of a set of objects based on the k-means
method. (The mean of each cluster is marked by a “+”.)

175
Steps
✓k - Means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed
point.
4. Go back to Step 2, stop when no more new
assignment

177
✓Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points

✓Examples of Clustering Applications


1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies
178
Applications
✓ Pattern Recognition
✓ Spatial Data Analysis
✓ GIS(Geographical Information System)
✓ Image Processing
✓ WWW (World Wide Web)
✓ Cluster Weblog data to discover groups
✓ Credit approval
✓ Target marketing
✓ Medical diagnosis
✓ Fraud detection
✓ Weather forecasting
✓ Stock Marketing
179
Classification and Prediction
✓ Classification and Prediction: Classification is a supervised
learning i.e. we can predict input and out values and classification is
divided into groups but not necessarily similar properties is called
Classification.
✓ Classification is a two step process
1. Build the Classifier/ Model
2. Use Classifier for Classification.
1. Build the Classifier / Model: Describing a set of predetermined
classes
✓ Each tuple / sample is assumed to belong to a predefined class,
as determined by the class label attribute.
✓ Also called as Learning phase or training phase.
✓ The set of tuples used for model construction is training set.
✓ The model is represented as classification rules, decision trees,
or mathematical formulae. 180
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’

Fig: Model Construction


2. Using Classifier: for classifying future or unknown
objects. It estimate accuracy of the model.

✓ The known label of test sample is compared with the


classified result from the model.

✓ Accuracy rate is the percentage of test set samples that are


correctly classified by the model.

✓ Test set is independent of training set, otherwise over-


fitting will occur.

✓ If the accuracy is acceptable, use the model to classify


data tuples whose class labels are not known.
182
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEA R S TEN U R ED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes

Fig: Using the Model in Prediction


TYPES OF CLASSIFICATION TECHNIQUES

1. Decision Tree

2. Bayesian classification

3. Rule-based classification

4. Prediction

5. Classifier Accuracy and Prediction error measures

184
Fig: Example for Model Construction and Usage of Model
185
2 - Decision Tree
✓ Decision tree is a flowchart-like tree structure, where
each internal node (nonleaf node)denotes a test on an
attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label.

✓ Decision Tree Induction is developed by Ross Quinlan,


decision tree algorithm known as ID3 (Iterative
Dichotomiser).

✓ Decision tree is a classifier in the form of a tree structure


✓ Decision node: specifies a test on a single attribute
✓ Leaf node: indicates the value of the target attribute
✓ Arc/edge: split of one attribute
✓ Path: a disjunction of test to make the final decision 186
EX: Data Set in All Electronics Customer Database 187
Fig: A Decision tree for the concept buys computer, indicating
whether a customer at AllElectronics is likely to purchase a
computer. Each internal (nonleaf) node represents a test on an
attribute. Each leaf node represents a class (either buys computer
= yes or buys computer = no). 188
Ex: For age attribute

189
✓Issues of Classification and Prediction
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
✓Typical applications
1. Credit approval
2. Target marketing
3. Medical diagnosis
4. Fraud detection
5. Weather forecasting
6. Stock Marketing
190

You might also like