28th May 2019


Dr.Asadi Srinivasulu
Professor of IT, M.Tech(IIIT), Ph.D.


✓ FOSS : Free Open Source Software: FOSS allows users and
programmers to edit, modify or reuse the software's source code.
This gives developers the opportunity to improve program
functionality by modifying it.
✓ Software:
✓ NextCloud
✓ HBase
✓ FreeRDP

✓ Big Data Fundamentals

✓ Big Data Architecture
✓ Spark Fundamentals
✓ Spark Ecosystem
✓ Spark Transformations
✓ Spark Actions
✓ Spark – MLlib
✓ Classification using Spark
✓ Clustering using Spark
✓ Spark Challenges
IOT Internet of Things

Big Data Analytics Deals with 3V’s big data like face book

Data Mining Extracting Meaningful data

Data warehouse Collection of Data Marts/OLAP

Data Mart Subset of a DWH

Database System Combination of Data + DBMS

DBMS Collection of Software's

Database Collection inter-related Data

Information Processed Data

Data Raw material/facts/images

Fig: Pre-requisite of Big Data

Fig: Big Data Word Cloud 7
✓The Myth about Big Data

❑ Big Data Is New

❑ Big Data Is Only About Massive Data Volume
❑ Big Data Means Hadoop
❑ Big Data Need A Data Warehouse
❑ Big Data Means Unstructured Data
❑ Big Data Is for Social Media & Sentiment Analysis
Big Data is …….
✓Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision

✓Big data is data which is too large, complex and

dynamic for any conventional data tools to
capture, store, manage and analyze.

Fig: Big Data - Characteristics 10
✓ Big Data Analytics is the process of examining large
amounts of data of a variety of types (big data) to uncover
hidden patterns, unknown correlations and other useful

✓ Why do they care about Big Data?

✓ More knowledge leads to better customer engagement,

fraud prevention and new products.
✓ Data Evolution is the 10% are structured and 90% are
unstructured like emails, videos, facebook posts, website clicks

Fig: The Evolution of BIG DATA
✓ Big data is a collection of data sets which is so large and
complex that it is difficult to handle using DBMS tools.

✓ Facebook alone generates more than 500 terabytes of

data daily whereas many other organizations like Jet Air
and Stock Exchange Market generates terabytes of data
every hour.

✓ Types of Data:
1. Structured Data- These data is organized in a highly mechanized
and manageable way. Ex: Tables, Transactions, Legacy Data etc…
2. Unstructured Data- These data is raw and unorganized, it varies in
its content and can change from entry to entry. Ex: Videos, images,
audio, Text Data, Graph Data, social media etc.
3. Semi-Structured Data- Ex: XML Database, 50% structured and
50% unstructured 13
✓ Big Data Matters…….

✓ Data Growth is huge and all that data is valuable.

✓ Data won’t fit on a single system, that's why use Distributed data

✓ Distributed data = Faster Computation.

✓ More knowledge leads to better customer engagement, fraud

prevention and new products.

✓ Big Data Matters for Aggregation, Statistics, Indexing, Searching,

Querying and Discovering Knowledge.
Fig: Measuring the Data in Big Data System
✓ Big Data Sources: Big data is everywhere and it can help
organisations any industry in many different ways.

✓ Big data has become too complex and too dynamic to be able to
process, store, analyze and manage with traditional data tools.

✓ Big Data Sources are

✓ ERP Data
✓ Transactions Data
✓ Public Data
✓ Social Media Data
✓ Sensor Media Data
✓ Big Data in Marketing
✓ Big Data in Health & Life Sciences
✓ Cameras Data
✓ Mobile Devices
✓ Machine sensors
✓ Microphones 16
Fig: Big Data Sources 17
Structure of Big Data
✓ Big Data Processing: So-called big data technologies are about discovering
patterns (in semi/unstructured data) development of big data standards & (open
source) software commonly driven by companies such as Google, Facebook,
Twitter, Yahoo! …

Big Data File Formats
✓ Videos
✓ Audios
✓ Images
✓ Photos
✓ Logs
✓ Click Trails
✓ Text Messages
✓ E-Mails
✓ Documents
✓ Books
✓ Transactions
✓ Public Records
✓ Flat Files
✓ SQL Files
✓ DB2 Files
✓ MYSQL Files
✓ Tera Data Files
✓ MS-Access Files 19
Characteristics of Big Data
✓ The are seven characteristics of Big Data are volume,
velocity, variety, veracity, value, validity and visibility.
Earlier it was assessed in megabytes and gigabytes but
now the assessment is made in terabytes.

1. Volume: Data size or the amount of Data or Data quantity or Data
at rest..

2. Velocity: Data speed or Speed of change or The content is changing

quickly or Data in motion.

3. Variety: Data types or The range of data types & sources or Data
with multiple formats.

4. Veracity: Data fuzzy & cloudy or Messiness or Can we trust the


5. Value: Data alone is not enough, how can value be derived from it.

6. Validity: Ensure that the interpreted data is sound.

7. Visibility: Data from diverse sources need to be stitched together.

✓ Flexible schema
✓ Massive scalability
✓ Cheaper to setup
✓ Improving Healthcare and Public Health
✓ Financial Trading
✓ Improving Security and Law Enforcement
✓ No declarative query language
✓ Higher performance
✓ Detect risks and check frauds
✓ Reduce Costs
Fig) Advantages of Big Data 23
✓ Big data violates the privacy principle.
✓ Data can be used for manipulating customers.
✓ Big data may increase social stratification.
✓ Big data is not useful in short run.
✓ Faces difficulties in parsing and interpreting.
✓ Big data is difficult to handle -more programming
✓ Eventual consistency - fewer guarantees
Big Data Challenges
✓ Data Complexity
✓ Data Volume
✓ Data Velocity
✓ Data Variety
✓ Data Veracity
✓ Capture data
✓ Curation data
✓ Performance
✓ Storage data
✓ Search data
✓ Transfer data
✓ Visualization data
✓ Data Analysis
✓ Privacy and Security
✓ Big Data Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, and information
Fig: Challenges of Big Data 26
✓ Research issues in Big Data Analytics
1. Sentiment Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
2. Opinion mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
3. Predictive mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
4. Post-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
5. Pre-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
6. How we can capture and deliver data to right people in real-
7. How we can handle variety of forms and data
8. How we can store and analyze data given its size and
computational capacity. 27
Big Data Tools
✓ Big Data Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR, Ubuntu and Linux flavors.

✓ Social Networks and Relationships
✓ Cyber-Physical Models
✓ Internet of Things (IoT)
✓ Retail Market
✓ Retail Banking
✓ Real Estate
✓ Fraud detection and prevention
✓ Telecommunications
✓ Healthcare and Research
✓ Automotive and production
✓ Science and Research
✓ Trading Analytics 29
Fig: Applications of Big Data Analytics 30
✓ Spark Basics

✓ RDD(Resilient Distributed Dataset)

✓ Spark Transformations

✓ Spark Actions

✓ Spark with Machine Learning

✓ Hands on Spark

✓ Research Challenges in Spark

✓ Spark is a free and open source software web
application framework and domain-specific language
written in Java.
✓ Spark is an alternative to other Java web application
frameworks such as JAX-RS, Play framework and Spring

✓ Apache Spark is a general-purpose cluster in-memory

computing system.
✓ Provides high-level APIs in Java, Scala and Python and
an optimized engine that supports general execution
✓ Provides various high level tools like Spark SQL for
structured data processing, MLlib for Machine Learning
and more…. 32
✓ Spark is a successor of MapReduce.

✓ Map Reduce is the ‘heart‘ of Hadoop that consists of two

parts – ‘map’ and ‘reduce’.

✓ Maps and reduces are programs for processing data.

✓ ‘Map’ processes the data first to give some intermediate

output which is further processed by ‘Reduce’ to generate
the final output.

✓ Thus, MapReduce allows for distributed processing of the

map and reduction operations.
✓ Apache Spark is the Next-Gen Big Data Tool i.e.
Considered as future of Big Data and successor of

✓ Features of Spark are

– Speed.
– Usability.
– In - Memory Computing.
– Pillar to Sophisticated Analytics.
– Real Time Stream Processing.
– Compatibility with Hadoop & existing Hadoop Data.
– Lazy Evaluation.
– Active, progressive and expanding community. 35
✓ Spark is a Distributed data analytics engine, generalizing

✓ Spark is a core engine, with streaming, SQL, Machine

Learning and Graph processing modules.


Why Spark
✓ Spark is faster than MR when it comes to processing the
same amount of data multiple times rather than unloading
and loading new data.
✓ Spark is simpler and usually much faster than MapReduce
for the usual Machine learning and Data Analytics

5 Reasons Why Spark Matters to Business

1. Spark enables use cases “traditional” Hadoop can’t handle.
2. Spark is fast
3. Spark can use your existing big data investment
4. Spark speaks SQL
5. Spark is developer-friendly 38
✓ Spark Ecosystem: Spark Ecosystem is still in the stage of work-
in-progress with Spark components, which are not even in their
beta releases.

Components of Spark Ecosystem

✓ The components of Spark ecosystem are getting developed and
several contributions are being made every now and then.

✓ Primarily, Spark Ecosystem comprises the following components:

1) Spark (SQL)
2) Spark Streaming (Streaming)
3) MLLib (Machine Learning)
4) GraphX (Graph Computation)
5) SparkR (R on Spark)
6) BlindDB (Approximate SQL)
✓ Spark's official ecosystem consists of the following major

✓ Spark DataFrames - Similar to a relational table

✓ Spark SQL - Execute SQL queries or HiveQL
✓ Spark Streaming - An extension of the core Spark API
✓ MLlib - Spark's machine learning library
✓ GraphX - Spark for graphs and graph-parallel computation
✓ Spark Core API - provides R, SQL, Python, Scala, Java

✓MLlib library has implementations for various
common machine learning algorithms
1. Clustering: K-means
2. Classification: Naïve Bayes, logistic regression,
3. Decomposition: Principal Component Analysis
(PCA) and Singular Value Decomposition (SVD)
4. Regression : Linear Regression
5. Collaborative Filtering: Alternating Least Squares for
✓ Language Support in Apache Spark
✓ Apache Spark ecosystem is built on top of the core
execution engine that has extensible API’s in different

✓ A recent 2016 Spark Survey on 62% of Spark users

evaluated the Spark languages

✓ 58% were using Python in 2017

✓ 71% were using Scala
✓ 31% of the respondents were using Java and
✓ 18% were using R programming language.
✓ What is Scala?: Scala is a general-purpose programming
language, which expresses the programming patterns in a
concise, elegant, and type-safe way.

✓ It is basically an acronym for “Scalable Language”.

✓ Scala is an easy-to-learn language and supports both
Object Oriented Programming as well as Functional

✓ It is getting popular among programmers, and is being

increasingly preferred over Java and other programming

✓ It seems much in sync with the present and future Big

Data frameworks, like Scalding, Spark, Akka, etc. 52
✓Why is Spark Programmed in Scala?
✓ Scala is a pure object-oriented language, in which conceptually every
value is an object and every operation is a method-call. The language
supports advanced component architectures through classes and

✓ Scala is also a functional language. It supports functions, immutable

data structures and gives preference to immutability over mutation.

✓ Scala can be seamlessly integrated with Java

✓ It is already being widely used for Big Data platforms and

development of frameworks like Akka, Scalding, Play, etc.

✓ Being written in Scala, Spark can be embedded in any JVM-based

operational system.
✓ Procedure: Spark Installation in Ubuntu

✓ Apache Spark is a fast and general engine for large-scale data

✓ Apache Spark is a fast and general-purpose cluster computing
system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
✓ It also supports a rich set of higher-level tools including Spark
SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming.

Step 1: Installing Java.

java –version

Step 2: Installing Scala and SBT.

sudo apt-get update
sudo apt-get install scala
Step 3: Installing Maven plug-in.
✓ Maven plug-in is used to compile java program for spark. Type
below command to install maven.
sudo apt-get install maven
Step 4: Installing Spark.
✓ Download “tgz” file of spark by selecting specific version from
below link
✓ Extract it and remember its path where ever it stored.
✓ Edit .bashrc file by placing below lines (terminal command: gedit
.bashrc)export SPARK_HOME=/path_to_spark_directory
✓ Replace path_to_spark_directory in above line with address of your
spark directory.
✓ Restart .bashrc by saving and close it and type “..bashrc” in
✓ If it doesn’t work restart system. Thus we installed spark
✓ Type spark-shell in terminal to start spark shell. 56
Spark Installation on Windows

Step 1: Install Java (JDK)

Download and install java from
Step 2: Set java environment variable
✓ Open “control panel” and choose “system & security” and select
✓ Select “Advanced System Settings” located at top right.
✓ Select “Environmental Variables” from pop-up.
✓ Next select new under system variables (below), you will get a pop-up.
✓ In variable name field type JAVA_HOME
✓ In variable value field provide installation directory of java, say
C:\Program Files\Java\jdk1.8.0_25
✓ Or you can simply choose the directory by selecting browse directory.
✓ Now close everything by choosing ok every time.
✓ Check whether java variable is set or not by pinging javac in command
promt. If we get java version details then we are done. 57
Step 3: Installing SCALA
✓ Download scala.msi file from https://www.scala-
✓ Set scala environment variable just like java done above.
Variable name = SCALA_HOME
Variable value = path to scala installed directory, say
C:\Program Files (x86)\scala

Step 4: Installing SPARK

✓ Download and extract spark from
✓ You can set SPARK_HOME just like java.
✓ Note:-We can only run spark-shell at bin folder in spark
folder on windows.
Step 5: Installing SBT Download and install sbt.msi from

Step 6: Installing Maven

✓ Download maven from
download.cgi and unzip it to the folder you want to install
✓ Add both M2_HOME and MAVEN_HOME variables in
the Windows environment, and point it to your Maven
✓ Update PATH variable, append Maven bin folder –
%M2_HOME%\bin, so that you can run the Maven’s
command everywhere.
✓ Test maven by pinging mvn –version in command prompt 59
✓ Practice on Spark Framework with Transformations and
Actions: You can run Spark using its standalone cluster mode,
on EC2, on Hadoop YARN, or on Apache Mesos. Access data
in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data
✓ Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the same

✓ RDD (Resilient Distributed Dataset) is main logical data unit
in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“

✓ Spark performs Transformations and Actions:

✓ Resilient Distributed Datasets overcome this drawback of Hadoop
MapReduce by allowing - fault tolerant ‘in-memory’ computations.

✓ RDD in Apache Spark.

✓ Why RDD is used to process the data ?
✓ What are the major features/characteristics of RDD (Resilient
Distributed Datasets) ?

✓ Resilient Distributed Datasets are immutable, partitioned collection

of records that can be operated on - in parallel.

✓ RDDs can contain any kind of objects Python, Scala, Java or even
user defined class objects.

✓ RDDs are usually created by either transformation of existing RDDs

or by loading an external dataset from a stable storage like HDFS or
Fig: Process of RDD Creation 63
✓ Operations on RDDs

✓ i) Transformations: Coarse grained operations like

join, union, filter or map on existing RDDs which
produce a new RDD, with the result of the operation,
are referred to as transformations. All transformations
in Spark are lazy.

✓ ii) Actions: Operations like count, first and reduce

which return values after computations on existing
RDDs are referred to as Actions.

• Properties / Traits of RDD:

✓ Immutable (Read only cant change or modify): Data is safe to

share across processes.

✓ Partitioned: It is basic unit of parallelism in RDD.

✓ Coarse gained operations: it's applied to any or all components in
datasets through maps or filter or group by operation.

✓ Action/Transformations: All computations in RDDs are actions or

✓ Fault Tolerant: As the name says or include Resilient which means
its capability to reconcile, recover or get back all the data using
lineage graph.

✓ Cacheable: It holds data in persistent storage.

✓ Persistence: Option of choosing which storage will be used either
in-memory or on-disk. 65
How Spark Works - RDD Operations 67
✓ Task 1: Practice on Spark Transformations i.e. map(), filter(),
flatmap(), groupBy(), groupByKey(), sample(), union(), join(),
distinct(), keyBy(), partitionBy and zip().

✓ RDD (Resilient Distributed Dataset) is main logical data unit

in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“

✓ Transformations are lazy evaluated operations on RDD that create

one or many new RDDs, e.g. map, filter, reduceByKey, join,
cogroup, randomSplit.

✓ Transformations are lazy, i.e. are not executed immediately.

✓ Transformations can be executed only when actions are called.

✓ Transformations are lazy operations on a RDD that create one or
many new RDDs.

✓ Ex: map , filter , reduceByKey , join , cogroup , randomSplit .

✓ In other words, transformations are functions that take a RDD as the

input and produce one or many RDDs as the output.

✓ RDD allows you to create dependencies between RDDs.

✓ Dependencies are the steps for producing results i.e. a program.

✓ Each RDD in lineage chain, string of dependencies has a function

for operating its data and has a pointer dependency to its ancestor

✓ Spark will divide RDD dependencies into stages and tasks and then
send those to workers for execution.
70 Pass each element of the RDD through the supplied

val x = Array("b", "a", "c")

val y = => (z,1))

Output: y: [('b', 1), ('a', 1), ('c', 1)]

2.filter(): Filter creates a new RDD by passing in the supplied function
used to filter the results.

val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)

println(y.collect().mkString(", "))

Output: y: [1, 3]

3.flatmap() : Similar to map, but each input item can be mapped to 0 or
more output items.

val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))

println(y.collect().mkString(", "))

Output: y: [1, 100, 42, 2, 200, 42, 3, 300, 42]

4.groupBy() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.

val x = sc.parallelize( Array("John", "Fred", "Anna", "James"))

val y = x.groupBy(w => w.charAt(0))

println(y.collect().mkString(", "))

Output: y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]

5.groupByKey() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.

val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1)))

val y = x.groupByKey()

println(y.collect().mkString(", "))

Output: y: [('A', [3, 2, 1]),('B',[5, 4])]

6.sample() : Return a random sample subset RDD of the input RDD.

val x= sc.parallelize(Array(1, 2, 3, 4, 5))

val y= x.sample(false, 0.4)

// omitting seed will yield different output

println(y.collect().mkString(", "))

Output: y: [1, 3]

7.union() : Simple. Return the union of two RDDs.

val x= sc.parallelize(Array(1,2,3), 2)
val y= sc.parallelize(Array(3,4), 1)
val z= x.union(y)
val zOut= z.glom().collect()

Output z: [[1], [2, 3], [3, 4]]

8.join() : If you have relational database experience, this will be
easy. It’s joining of two datasets.

val x= sc.parallelize(Array(("a", 1), ("b", 2)))

val y= sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z= x.join(y)

println(z.collect().mkString(", "))

Output z: [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]

9.distinct() : Return a new RDD with distinct elements within a
source RDD.

val x= sc.parallelize(Array(1,2,3,3,4))
val y= x.distinct()

println(y.collect().mkString(", "))

Output: y: [1, 2, 3, 4]

10) keyBy() : Constructs two-component tuples (key-value pairs) by
applying a function on each data item.

val x= sc.parallelize(Array("John", "Fred", "Anna", "James"))

val y= x.keyBy(w => w.charAt(0))

println(y.collect().mkString(", "))

Output: y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]

11) partitionBy() : Repartitions as key-value RDD using its keys. The
partitioner implementation can be supplied as the first argument.

import org.apache.spark.partiotioner

val x=sc.parallelize(Array(‘J’,”James”),(‘F’,”Fred”),(‘A’,”Anna”),(‘J’,”John”),3)
val y= x.partitionBy(new Partitioner() { val numPartitions= 2
defgetPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1

val yOut= y.glom().collect()

Output:y: Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))

12) zip() : Joins two RDDs by combining the i-th of either partition
with each other.

val x= sc.parallelize(Array(1,2,3))
val y=>n*n)
val z=

println(z.collect().mkString(", "))

Output: z: [(1, 1), (2, 4), (3, 9)]

✓ Task 2: Practice on Spark Actions i.e. getNumPartitions(), collect(),
reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().

✓ RDD (Resilient Distributed Dataset) is main logical data unit

in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“
✓ Actions returns final result of RDD computations / operation.

✓ Action produces a value back to the Spark driver program. It may

trigger a previously constructed, lazy RDD to be evaluated.

✓ Action function materialize a value in a Spark program. So basically

an action is RDD operation that returns a value of any type but
RDD[T] is an action.

✓ Actions: Unlike Transformations which produce RDDs,
action functions produce a value back to the Spark driver
✓ Actions may trigger a previously constructed, lazy RDD to
be evaluated.

1. collect()
2. reduce()
3. aggregate ()
4. mean()
5. sum()
6. max()
7. stdev()
8. countByKey()
9. getNumPartitions() 84
1) collect() : collect returns the elements of the dataset as an array
back to the driver program.

val x= sc.parallelize(Array(1,2,3), 2)

val y= x.collect()

Output: y: [1, 2, 3]

2) reduce() : Aggregate the elements of a dataset through function.

val x= sc.parallelize(Array(1,2,3,4))

val y= x.reduce((a,b) => a+b)

Output: y: 10

3) aggregate(): The aggregate function allows the user to apply two different
reduce functions to the RDD.

val inputrdd=sc.parallelize(List((“maths”,21),(“english”,22), (“science”,31)),3)


Partition 1 : Sum(all Elements) + 3 (Zero value)

Partition 2 : Sum(all Elements) + 3 (Zero value)
Partition 3 : Sum(all Elements) + 3 (Zero value)

✓ Result = Partition1 + Partition2 + Partition3 + 3(Zero value)

So we get 21 + 22 + 31 + (4 * 3) = 86

✓ Output: y: Int = 86

4) max() : Returns the largest element in the RDD.

val x= sc.parallelize(Array(2,4,1))

val y= x.max()

Output: y: 4

5) count() : Number of elements in the RDD.

val x= sc.parallelize(Array("apple", "beatty", "beatrice"))

val y=x.count()

Output: y: 3

6) sum() : Sum of the RDD.

val x= sc.parallelize(Array(2,4,1))

val y= x.sum()

Output: y: 7

7) mean() : Mean of given RDD.

val x= sc.parallelize(Array(2,4,1))

val y= x.mean()

Output: y: 2.3333333

8) stdev() : An aggregate function that standard deviation of a set of

val x= sc.parallelize(Array(2,4,1))

val y= x.stdev()

Output: y: 1.2472191

9) countByKey() : This is only available on RDDs of (K
returns a hashmap of (K, count of K).

val x= sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"Jo

val y= x.countByKey()

Output: y: {'A': 1, 'J': 2, 'F': 1}

10) getNumPartitions()

val x= sc.parallelize(Array(1,2,3), 2)

val y= x.partitions.size

Output: y: 2

1. Spark is initially developed by which university
Ans) Berkley
2. What are the characteristics of Big Data?
Ans) Volume, Velocity and Variety
3. The main focus of Hadoop ecosystem is on
Ans ) Batch Processing
4. Streaming data tools available in Hadoop ecosystem
Ans ) Apache Spark and Storm
5. Spark has API's in? How many languages it supports
Ans ) Java, Scala, R and Python
6. Which kind of data can be processed by spark?
Ans) Stored Data and Streaming Data
7. Spark can store its data in?
Ans) HDFS, MongoDB and Cassandra 103
8. How spark engine runs?
Ans) Integrating with Hadoop and Standalone
9. In spark data is represented as?
Ans ) RDDs
10. Which kind of data can be handled by Spark ?
Ans) Structured, Unstructured and Semi-Structured
11.Which among the following are the challenges in Map
Ans) Every Problem has to be broken into Map and
Reduce phase
Collection of Key / Value pairs
High Throughput

12. Apache spark is a framework with?
Ans) Scheduling, Monitoring and Distributing Applications
13. Which of the features of Apache spark
Ans) DAG, RDDs and In- Memory processing
14) How much faster is the processing in spark when compared to
ANS) 10-100X
15) In spark data is represented as?
Ans) RDDs
16) List of Transformations
Ans) map(), filter(), flatmap(), groupBy(), groupByKey(),
sample(), union(), join(), distinct(), keyBy(), partitionBy and
17) List of Actions
Ans) getNumPartitions(), collect(), reduce(), aggregate(),
max(), sum(), mean(), stdev(), countByKey().

18. Spark is developed in
Ans) Scala
19. Which type of processing Apache Spark can handle
Ans) Batch Processing, Interactive Processing, Stream
Processing and Graph Processing
20. List two statements of Spark
Ans) Spark can run on the top of Hadoop
Spark can process data stored in HDFS
Spark can use Yarn as resource management layer

21. Spark's core is a batch engine? True OR False

Ans) True
22) Spark is 100x faster than MapReduce due to
Ans) In-Memory Computing
23) MapReduce program can be developed in spark? T / F
Ans) True
24. Programming paradigm used in Spark
Ans) Generalized
25. Spark Core Abstraction
Ans) RDD
26. Choose correct statement about RDD
Ans) RDD is a distributed data structure
27. RDD is
Ans) Immutable, Recomputable and Fault-tolerant
28. RDD operations
Ans) Transformation, Action and Caching
29. We can edit the data of RDD like conversion to uppercase? T/F
Ans) False
30. Identify correct transformation
Ans) Map, Filter and Join

31. Identify Correct Action
Ans) Reduce
32. Choose correct statement
Ans) Execution starts with the call of Action
33. Choose correct statement about Spark Context
Ans) Interact with cluster manager and Specify spark how to
access cluster
34. Spark cache the data automatically in the memory as and when
needed? T/F
Ans) False
35. For resource management spark can use
Ans) Yarn, Mesos and Standalone cluster manager
36. RDD can not be created from data stored on
Ans) Oracle
37. RDD can be created from data stored on
Ans) Local FS, S3 and HDFS
38. Who is father of Big data Analytics
✓ Doug Cutting

39. What are major Characteristics of Big Data

✓ Volume, Velocity and Variety(3 V’s)
40. What is Apache Hadoop
✓ Open-source Software Framework

41. Who developed Hadoop

✓ Doug Cutting

42. Hadoop supports which programming framework

✓ Java

43. What is the heart of Hadoop
✓ MapReduce

44. What is MapReduce

✓ Programming Model for Processing Large Data Sets.

45. What are the Big Data Dimensions

✓ 4 V’s

46. What is the caption of Volume

✓ Data at Scale

47. What is the caption of Velocity

✓ Data in Motion
48. What is the caption of Variety
✓ Data in many forms

49. What is the caption of Veracity

✓ Data Uncertainty

50. What is the biggest Data source for Big Data

✓ Transactions

51. What is the biggest Analytic capability for Big Data

✓ Query and Reporting

52. What is the biggest Infrastructure for Big Data

✓ Information integration
53. What are the Big Data Adoption Stages
✓ Educate, Explore, Engage and Execute

54. What is Mahout

✓ Algorithm library for scalable machine learning on Hadoop
55. What is Pig
✓ Creating MapReduce programs used with Hadoop.
56. What is HBase
✓ Non-Relational Database

57. What is the biggest Research Challenge for Big Data

✓ Heterogeneity , Incompleteness and Security

58. What is Sqoop
✓ Transferring bulk data between Hadoop to Structured data.

59. What is Oozie

✓ Workflow scheduler system to manage Hadoop jobs.

60. What is Hue

✓ Web interface that supports Apache Hadoop and its ecosystem

61. What is Avro

✓ Avro is a data serialization system.

62. What is Giraph

✓ Iterative graph processing system built for high scalability.
63. What is Cassandra
✓ Cassandra does not support joins or sub queries, except
for batch analysis via Hadoop

64. What is Chukwa

✓ Chukwa is an open source data collection system for
monitoring large distributed systems

65. What is Hive

✓ Hive is a data warehouse on Hadoop

66. What is Apache drill

✓ Apache Drill is a distributed system for interactive
analysis of large-scale datasets.

67. What is HDFS

✓ Hadoop Distributed File System ( HDFS )
68. Facebook generates how much data per day
✓ 25TB

69. What is BIG DATA?

✓ Big Data is nothing but an assortment of such a huge and
complex data that it becomes very tedious to capture, store,
process, retrieve and analyze it with the help of on-hand
database management tools or traditional data processing

70. What is HUE expansion

✓ Hadoop User Interface
71.Can you give some examples of Big Data?
✓ There are many real life examples of Big Data! Facebook is
generating 500+ terabytes of data per day, NYSE (New York Stock
Exchange) generates about 1 terabyte of new trade data per day, a
jet airline collects 10 terabytes of censor data for every 30 minutes
of flying time.

72. Can you give a detailed overview about the Big Data being
generated by Facebook?
✓ As of December 31, 2012, there are 1.06 billion monthly active
users on Facebook and 680 million mobile users. On an average,
3.2 billion likes and comments are posted every day on Facebook.
72% of web audience is on Facebook. And why not! There are so
many activities going on Facebook from wall posts, sharing images,
videos, writing comments and liking posts, etc.

73. What are the three characteristics of Big Data?
✓ The three characteristics of Big Data
are: Volume: Facebook generating 500+ terabytes of data
per day. Velocity: Analyzing 2 million records each day
to identify the reason for losses. Variety: images, audio,
video, sensor data, log files, etc.

74. How Big is ‘Big Data’?

✓ With time, data volume is growing exponentially. Earlier
we used to talk about Megabytes or Gigabytes. But time
has arrived when we talk about data volume in terms of
terabytes, petabytes and also zettabytes! Global data
volume was around 1.8ZB in 2011 and is expected to be
7.9ZB in 2015. 117
75. How analysis of Big Data is useful for organizations?
✓ Effective analysis of Big Data provides a lot of business
advantage as organizations will learn which areas to focus
on and which areas are less important.

76. Who are ‘Data Scientists’?

✓ Data scientists are experts who find solutions to analyze
data. Just as web analysis, we have data scientists who
have good business insight as to how to handle a business

77. What is Hadoop?
✓ Hadoop is a framework that allows for distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.

78. Why the name ‘Hadoop’?

✓ Hadoop doesn’t have any expanding version like ‘OOPS’.
The charming yellow elephant you see is basically named
after Doug’s son’s toy elephant!

79. Why do we need Hadoop?

✓ Everyday a large amount of unstructured data is getting
dumped into our machines.
80. What are some of the characteristics of Hadoop framework?
✓ Hadoop framework is written in Java. It is designed to solve
problems that involve analyzing large data (e.g. petabytes). The
programming model is based on Google’s MapReduce. The
infrastructure is based on Google’s Big Data and Distributed File

81. Give a brief overview of Hadoop history.

✓ In 2002, Doug Cutting created an open source, web crawler project.
In 2004, Google published MapReduce, GFS papers.
✓ In 2006, Doug Cutting developed the open source, MapReduce and
HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and
Hadoop won terabyte sort benchmark.
✓ In 2009, Facebook launched SQL support for Hadoop.

82. Give examples of some companies that are using Hadoop
✓ A lot of companies are using the Hadoop structure such as
Cloudera, EMC, MapR, Horton works, Amazon, Facebook, eBay,
Twitter, Google and so on.

83. What is the basic difference between traditional RDBMS and

✓ RDBMS is used for transactional systems to report and archive the
✓ Hadoop is an approach to store huge amount of data in the
distributed file system and process it.
✓ RDBMS will be useful when you want to seek one record from Big
data, whereas.
✓ Hadoop will be useful when you want Big data in one shot and
perform analysis on that later.

84. What is structured and unstructured data?
✓ Structured data is the data that is easily identifiable as it is
organized in a structure. The most common form of
structured data is a database where specific information is
stored in tables, that is, rows and columns.
✓ Unstructured data refers to any data that cannot be
identified easily. It could be in the form of images, videos,
documents, email, logs and random text.

85. What are the core components of Hadoop?

✓ Core components of Hadoop are HDFS and MapReduce.
HDFS is basically used to store large data sets and
MapReduce is used to process such large data sets.

86. What is HDFS?
✓HDFS is a file system designed for storing very
large files with streaming data access patterns,
running clusters on commodity hardware.

87. What are the key features of HDFS?

✓HDFS is highly fault-tolerant, with high
throughput, suitable for applications with large
data sets, streaming access to file system data and
can be built out of commodity hardware.

88. What is Fault Tolerance?
✓ Suppose you have a file stored in a system, and due to
some technical problem that file gets destroyed. Then
there is no chance of getting the data back present in that

89. Replication causes data redundancy then why is

pursued in HDFS?
✓ HDFS works with commodity hardware (systems with
average configurations) that has high chances of getting
crashed any time. Thus, to make the entire system highly
fault-tolerant, HDFS replicates and stores data in different
90. Since the data is replicated thrice in HDFS, does it
mean that any calculation done on one node will also
be replicated on the other two?
✓ Since there are 3 nodes, when we send the MapReduce
programs, calculations will be done only on the original
data. The master node will know which node exactly has
that particular data.

91. What is throughput? How does HDFS get a good

✓ Throughput is the amount of work done in a unit time. It
describes how fast the data is getting accessed from the
system and it is usually used to measure performance of
the system. 125
92. What is streaming access?
✓ As HDFS works on the principle of ‘Write Once, Read
Many‘, the feature of streaming access is
extremely important in HDFS. HDFS focuses not so
much on storing the data but how to retrieve it at the
fastest possible speed, especially while analyzing logs.

93. What is a commodity hardware? Does commodity

hardware include RAM?
✓ Commodity hardware is a non-expensive system which is
not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware.

94. What is a Name node?
✓ Name node is the master node on which job tracker runs
and consists of the metadata. It maintains and manages
the blocks which are present on the data nodes.

95. Is Name node also a commodity?

✓ No. Name node can never be
a commodity hardware because the entire HDFS rely on
it. It is the single point of failure in HDFS. Name node
has to be a high-availability machine.

96. What is a metadata?

✓ Metadata is the information about the data stored in data
nodes such as location of the file, size of the file and so
on. 127
97. What is a Data node?
✓Data nodes are the slaves which are deployed on
each machine and provide the actual storage.
These are responsible for serving read and write
requests for the clients.

98. Why do we use HDFS for applications having

large data sets and not when there are lot of
small files?
✓HDFS is more suitable for large amount of data
sets in a single file as compared to small amount
of data spread across multiple files.
99. What is a daemon?
✓Daemon is a process or service that runs in
background. In general, we use this word in UNIX

100. What is a job tracker?

✓Job tracker is a daemon that runs on a name node
for submitting and tracking MapReduce jobs in
Hadoop. It assigns the tasks to the different task

101. What is a task tracker?
✓ Task tracker is also a daemon that runs on data nodes.
Task Trackers manage the execution of individual tasks
on slave node.

102. Is Name node machine same as data node machine

as in terms of hardware?
✓ It depends upon the cluster you are trying to create. The
Hadoop VM can be there on the same machine or on
another machine.

103. What is a heartbeat in HDFS?
✓ A heartbeat is a signal indicating that it is alive. A data
node sends heartbeat to Name node and task tracker will
send its heart beat to job tracker.

104. Are Name node and job tracker on the same host?
✓ No, in practical environment, Name node is on a separate
host and job tracker is on a separate host.

105. What is a ‘block’ in HDFS?

✓ A ‘block’ is the minimum amount of data that can be read
or written. In HDFS, the default block size is 64 MB as
contrast to the block size of 8192 bytes in Unix/Linux.
106. What are the benefits of block transfer?
✓ A file can be larger than any single disk in the network.
Blocks provide fault tolerance and availability.

107. If we want to copy 10 blocks from one machine to

another, but another machine can copy only 8.5
blocks, can the blocks be broken at the time of
✓ In HDFS, blocks cannot be broken down. Before copying
the blocks from one machine to another, the Master node
will figure out what is the actual amount of space
required, how many block are being used, how much
space is available, and it will allocate the blocks
108. How indexing is done in HDFS?
✓ Hadoop has its own way of indexing. Depending upon the
block size, once the data is stored, HDFS will keep on
storing the last part of the data which will say where the
next part of the data will be.

109. If a data Node is full how it’s identified?

✓ When data is stored in data node, then the metadata of
that data will be stored in the Name node. So Name node
will identify if the data node is full.

110.If data nodes increase, then do we need to upgrade

Name node?
✓ While installing the Hadoop system, Name node is
determined based on the size of the clusters. 133
111.Are job tracker and task trackers present in separate
✓ Yes, job tracker and task tracker are present in different machines.
The reason is job tracker is a single point of failure for the Hadoop
MapReduce service.

112. When we send a data to a node, do we allow settling in time,

before sending another data to that node?
✓ Yes, we do.

113. Does Hadoop always require digital data to process?

✓ Yes. Hadoop always require digital data to be processed.

114. On what basis Name node will decide which data node to
write on?
✓ As the Name node has the metadata (information) related to all the
data nodes, it knows which data node is free.
115. Doesn’t Google have its very own version of DFS?
✓ Yes, Google owns a DFS known as “Google File System
(GFS)” developed by Google Inc. for its own use.

116. Who is a ‘user’ in HDFS?

✓ A user is like you or me, who has some query or who needs some
kind of data.

117. Is client the end user in HDFS?

✓ No, Client is an application which runs on your machine, which is
used to interact with the Name node (job tracker) or data node (task

118. What is the communication channel between client and name

node/ data node?
✓ The mode of communication is SSH. 135
119. What is a rack?
✓ Rack is a storage area with all the data nodes put together.
Rack is a physical collection of data nodes which are
stored at a single location.

120. On what basis data will be stored on a rack?

✓ When the client is ready to load a file into the cluster, the
content of the file will be divided into blocks.

121. Do we need to place 2nd and 3rd data in rack 2

✓ Yes, this is to avoid data node failure.

122. What if rack 2 and datanode fails?
✓ If both rack2 and datanode present in rack 1 fails then
there is no chance of getting data from it.

123. What is a Secondary Namenode? Is it a substitute to

the Namenode?
✓ The secondary Namenode constantly reads the data from
the RAM of the Namenode and writes it into the hard disk
or the file system.

124. What is the difference between Gen1 and Gen2

Hadoop with regards to the Namenode?
✓ In Gen 1 Hadoop, Namenode is the single point of failure.
In Gen 2 Hadoop, we have what is known as Active and
Passive Namenodes kind of a structure. 137
125. What is MapReduce?
✓ Map Reduce is the ‘heart‘ of Hadoop that consists of two
parts – ‘map’ and ‘reduce’. Maps and reduces are
programs for processing data.

126. Can you explain how do ‘map’ and ‘reduce’ work?

✓ Name node takes the input and divide it into parts and
assign them to data nodes.

127. What is ‘Key value pair’ in HDFS?

✓ Key value pair is the intermediate data generated by maps
and sent to reduces for generating the final output.

128. What is the difference between MapReduce engine and HDFS
✓ HDFS cluster is the name given to the whole configuration of
master and slaves where data is stored. Map Reduce Engine is the
programming module which is used to retrieve and analyze data.

129. Do we require two servers for the Name node and the data
✓ Yes, we need two different servers for the Name node and the data

130. Why are the number of splits equal to the number of maps?
✓ The number of maps is equal to the number of input splits because
we want the key and value pairs of all the input splits.
131. Which are the two types of ‘writes’ in HDFS?
✓ There are two types of writes in HDFS: posted and non-
posted write. Posted Write is when we write it and forget
about it, without worrying about the acknowledgement.

132. Why ‘Reading‘ is done in parallel and ‘Writing‘ is

not in HDFS?
✓ Reading is done in parallel because by doing so we can
access the data fast. But we do not perform
the write operation in parallel. The reason is that if we
perform the write operation in parallel, then it might result
in data inconsistency.

133. Is a job split into maps?
✓ No, a job is not split into maps. Spilt is created for the
file. The file is placed on data nodes in blocks. For
each split, a map is needed.

134.Can Hadoop be compared to NOSQL database like

✓ Though NOSQL is the closet technology that can be
compared to Hadoop, it has its own pros and cons. There
is no DFS in NOSQL. Hadoop is not a database.

135. How can I install Cloudera VM in my system?

✓ When you enroll for the Hadoop course at Edureka, you
can download the Hadoop Installation steps.pdf file from
our dropbox. 141
Relational DB’s vs. Big Data (Spark)

1. It deals with Giga Bytes to 1. It deals with Petabytes to

Terabytes Zettabytes
2. It is centralized 2. It is distributed
3. It deals with structured data 3. It deals with semi-structured
4. It is having stable Data Model and unstructured
5. It deals with known complex 4. It is having unstable Data
inter relationships Model
6. Tools are Relational DB’s: 5. It deals with flat schemas and
SQL,MYSQL,DB2. few Interrelationships
7. Access is Interactive and 6. Tools are Hadoop,R,Mahout
batch. 7. Access is Batch
8. Updates are Read and write 8. Updates are Write once, read
many times. many times.
9. Integrity is high 9. Integrity is low
10. Scaling is Nonlinear 10. Scaling is Linear 142
Performance Process data on disk Process data in-memory
Ease of use Java need to be proficient in Java, Scala, R, Python
MapReduce more expressive and
Need other platforms for MLLib, Streaming,
Data processing streaming, graphs Graphs

Failure tolerance Continue from the point it left off Start the processing from
the beginning
Cost Hard disk space cost Memory space cost
Run Everywhere Runs on Hadoop
Memory HDFS uses MapReduce to process This is called in memory
and analyse data map reduces takes operation
a backup of all the data in a
physical server after each operation
this is done because the
data stored in a ram 143
Fast Hadoop works less faster than Spark works more fast
spark than Hadoop(100 times)
batchfile,10xfaster on

Version Hadoop 2.6.5 Release Notes Spark 2.3.1

Software It is a open source s/w or, It is Fast and General
Reliable, scalable Engine for large scale
distributed, computing It is a data processing
Big Data Tool

Execution Engine DAG It is a Big Data Tool

Big Data frame Hadoop Advanced DAG, support
works for acyclic data flow and
in memory computing

Hardware cost More Less

Library External machine lib Internal Machine lib144
Recovery Easier than spark Checkpoints Failure recovery is difficult
are present but still good

FileManagement system Its own FMS(File Management It does not come with own
System) FMS it support to the cloud
based data platform Spark
was designed for Hadoop

Support HDFS, Hadoop YARN Apache, It support for RDD


Technologies Cassandra, HBase, HIVE Supports all systems in

Tachyon, and any Hadoop Hadoop Processed by
source Batch system

Use Places Marketing Analysis, computing Online product

analysis, cyber security

Run Clusters Data bases, server Cloud based systems and

Data sets
Reach me @
Srinivasulu Asadi: , + 91-9490246442 146
Business Intelligence

✓ What is Clustering: Clustering is a Unsupervised learning i.e. no
predefined classes, Group of similar objects that differ significantly
from other objects.
✓ The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
✓ Clustering is “the process of organizing objects into groups whose
members are similar in some way”.
✓ The cluster property is Intra-cluster distances are minimized and
Inter-cluster distances are maximized.

✓ What is Good Clustering: A good clustering method will produce

high quality clusters with
✓ high intra-class similarity
✓ low inter-class similarity 160
1. Nominal Variables allow for only qualitative classification. A
generalization of the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
Ex: { male, female},{yes, no},{true, false}
2. Ordinal Data are categorical data where there is a logical
ordering to the categories.
Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;
3. Categorical Data represent types of data which may be divided
into groups.
Ex: race, sex, age group, and educational level.
4. Labeled Data are share the class labels or the generative
distribution of data.
5. Unlabeled Data are does not share the class labels or the
generative distribution of the labeled data.
6. Numerical Values: The data values completely belongs to and
only numerical values. Ex: 1,2,3,4,…. 161
7. Interval-valued variables: These are variables ranges from
✓ Ex: 10-20, 20-30, 30-40,……..

8. Binary Variables: These are the variables and combination of 0

and 1.
✓ Ex: 1, 0, 001,010 ….

9. Ratio-Scaled Variables: A positive measurement on a nonlinear

scale, approximately at exponential scale, such as AeBt or Ae-Bt
✓ Ex: ½, 2/4, 4/8,…..

10. Variables of Mixed Types: A database may contain all the six
types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
✓ Ex: 11121A1201
Similarity Measure
✓ Euclidean distance: Distances are normally used to measure the
similarity or dissimilarity between two data objects.

✓ Euclidean distance: Euclidean distance is the distance between two

points in Euclidean space.
Major Clustering Approaches

1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis 164
✓Examples of Clustering Applications
1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies

✓Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points 165
✓ Pattern Recognition
✓ Spatial Data Analysis
✓ GIS(Geographical Information System)
✓ Cluster Weblog data to discover groups
✓ Credit approval
✓ Target marketing
✓ Medical diagnosis
✓ Fraud detection
✓ Weather forecasting
✓ Stock Marketing 166
2. Classification Vs Clustering
1. Classification is “the process 1. Clustering is “the process of
of organizing objects into organizing objects into
groups whose members are groups whose members are
not similar. similar in some way”.

2. It is a Supervised Learning. 2. It is a Unsupervised Learning.

3. Predefined classes. 3. No predefined classes.

4. Have labels for some points. 4. No labels in Clustering.

5. Require a “rule” that will 5. Group points into clusters

accurately assign labels to based on how “near” they
new points. are to one another. 167
6. Classification 6. Clustering

7. Classification approaches are 7. Clustering approaches are eight.
two types 1. Partition Method
2. Hierarchical Method
3. Density-Based Methods
1. Predictive Classification 4. Grid-Based Methods
2. Descriptive Classification 5. Model-Based Clustering Methods
6. Clustering High-Dimensional
7. Constraint - Based Cluster
8. Outlier Analysis

8. Issues of Classification
8. Issues of Clustering
1. Accuracy,
1. Accuracy,
2. Training time,
2. Training time,
3. Robustness,
3. Robustness,
4. Interpretability, and
4. Interpretability, and
5. Scalability
5. Scalability
6. Find top ‘n’ outlier points
9. Examples 9. Examples
1. Marketing 1. Marketing
2. Land use 2. Land use
3. Insurance 3. Insurance
4. City-planning 4. City-planning
5. Earth-quake studies 5. Earth-quake studies

10. Techniques
10. Techniques
1. K- Means Clustering
1. Decision Tree
2. DIANA ((DIvisive ANAlysis)
2. Bayesian classification
3. AGNES (AGglomerative NESting)
3. Rule-based classification
4. BIRCH (Balanced Iterative Reducing
4. Prediction and Accuracy and error and Clustering using Hierarchies)
5. DBSACN (Density-Based Spatial
Clustering of Applications with
11. Applications 11. Applications
1. Credit approval 1. Pattern Recognition
2. Target marketing 2. Spatial Data Analysis
3. Medical diagnosis 3. WWW (World Wide Web)
4. Fraud detection 4. Weblog data to discover groups
5. Weather forecasting 5. Credit approval
6. Stock Marketing 6. Target marketing
7. Medical diagnosis
8. Fraud detection
9. Weather forecasting
10. Stock Marketing

k-Means Clustering
✓ It is a Partitioning cluster technique.
✓ It is a Centroid-Based cluster technique
✓ Clustering is a Unsupervised learning i.e. no predefined
classes, Group of similar objects that differ significantly
from other objects.
d (i, j) = (| x − x | + | x − x | +...+ | x − x | )
2 2 2
i1 j1 i2 j 2 ip jp
✓ It then creates the first k initial clusters (k= number of
clusters needed) from the dataset by choosing k rows of
data randomly from the dataset.
✓ The k-Means algorithm calculates the Arithmetic Mean
of each cluster formed in the dataset.
✓ Square-error criterion

✓ Where
– E is the sum of the square error for all objects in the data set;
– p is the point in space representing a given object; and
– mi is the mean of cluster
– Ci (both p and mi are multidimensional).
✓ Algorithm: The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean
value of the objects in the cluster.
✓ Input:
– k: the number of clusters,
– D: a data set containing n objects.
✓ Output: A set of k clusters. 173
k-Means Clustering Method

10 10
9 9
8 8
7 7
6 6
5 5
4 4
Assign 3
Update 3

2 each
2 the 2

cluster 1

0 0
0 1 2 3 4 5 6 7 8 9 10 to most 0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

center reassign reassign
10 10

k=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial cluster 5 5

center 4 Update 4

the 3

1 cluster 1

0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

Fig: Clustering of a set of objects based on the k-means
method. (The mean of each cluster is marked by a “+”.)

✓k - Means algorithm is implemented in four
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed
4. Go back to Step 2, stop when no more new

✓Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points

✓Examples of Clustering Applications

1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies
✓ Pattern Recognition
✓ Spatial Data Analysis
✓ GIS(Geographical Information System)
✓ Image Processing
✓ WWW (World Wide Web)
✓ Cluster Weblog data to discover groups
✓ Credit approval
✓ Target marketing
✓ Medical diagnosis
✓ Fraud detection
✓ Weather forecasting
✓ Stock Marketing
Classification and Prediction
✓ Classification and Prediction: Classification is a supervised
learning i.e. we can predict input and out values and classification is
divided into groups but not necessarily similar properties is called
✓ Classification is a two step process
1. Build the Classifier/ Model
2. Use Classifier for Classification.
1. Build the Classifier / Model: Describing a set of predetermined
✓ Each tuple / sample is assumed to belong to a predefined class,
as determined by the class label attribute.
✓ Also called as Learning phase or training phase.
✓ The set of tuples used for model construction is training set.
✓ The model is represented as classification rules, decision trees,
or mathematical formulae. 180


Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’

Fig: Model Construction

2. Using Classifier: for classifying future or unknown
objects. It estimate accuracy of the model.

✓ The known label of test sample is compared with the

classified result from the model.

✓ Accuracy rate is the percentage of test set samples that are

correctly classified by the model.

✓ Test set is independent of training set, otherwise over-

fitting will occur.

✓ If the accuracy is acceptable, use the model to classify

data tuples whose class labels are not known.

Data Unseen Data

(Jeff, Professor, 4)
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes

Fig: Using the Model in Prediction


1. Decision Tree

2. Bayesian classification

3. Rule-based classification

4. Prediction

5. Classifier Accuracy and Prediction error measures

Fig: Example for Model Construction and Usage of Model
2 - Decision Tree
✓ Decision tree is a flowchart-like tree structure, where
each internal node (nonleaf node)denotes a test on an
attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label.

✓ Decision Tree Induction is developed by Ross Quinlan,

decision tree algorithm known as ID3 (Iterative

✓ Decision tree is a classifier in the form of a tree structure

✓ Decision node: specifies a test on a single attribute
✓ Leaf node: indicates the value of the target attribute
✓ Arc/edge: split of one attribute
✓ Path: a disjunction of test to make the final decision 186
EX: Data Set in All Electronics Customer Database 187
Fig: A Decision tree for the concept buys computer, indicating
whether a customer at AllElectronics is likely to purchase a
computer. Each internal (nonleaf) node represents a test on an
attribute. Each leaf node represents a class (either buys computer
= yes or buys computer = no). 188
Ex: For age attribute

✓Issues of Classification and Prediction
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
✓Typical applications
1. Credit approval
2. Target marketing
3. Medical diagnosis
4. Fraud detection
5. Weather forecasting
6. Stock Marketing

