Data Analysis and Modelling

ATAL Online FDP on Data Science
Data Analysis and Modelling
Dr. Bhuvaneswari Amma N.G.

Assistant Professor
School of Computing
Indian Institute of Information Technology Una
Himachal Pradesh
Overview
• Introduction
• Why Data Science?
• Data Science Applications
• Data Analysis
– Data Collection
– Data Cleaning
– Data Integration
– Data Reduction
– Data Transformation
• Data Modelling
– Classification
– Clustering
• Research Problems
• Future Research Directions
• Summary
03 July 2021 Data Analysis and Modelling 2
Introduction
Data
• Lots of data is being collected and stored
– Web data, e-commerce
– Financial transactions, bank/credit transactions
– Online trading and purchasing
– Social Network

Introduction … Contd.
Big Data
• Data that is expensive to manage and hard to extract value from
– Volume
• The size of the data
– Velocity
• The latency of data processing relative to the growing demand
for interactivity
– Variety and Complexity
• The diversity of sources, formats, quality, and structures

Introduction … Contd.

Why Data Science?
Data Science
• Science which
– uses computer science, statistics and machine learning, visualization, and
human-computer interactions
– to collect, clean, integrate, analyze, visualize, and interact with data to create
data products
Goal
Turn data into data products

Why Data Science? … Contd.

Data Science Applications
• Transaction Databases → Recommender systems (NetFlix), Fraud Detection
(Security and Privacy)
• Wireless Sensor Data → Smart Home, Real-time Monitoring, Internet of Things
• Text Data, Social Media Data → Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery
• Software Log Data → Automatic Trouble Shooting (Splunk)
• Genotype and Phenotype Data → Epic, Patient-Centered Care, Personalized
Medicine

Data Sources
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event
…
Fast Forward, pause,… ..
Server request
Transaction
Network message
Fault
…
Internet of Things / M2M Health/Scientific Computing

Databases Vs. Data Science
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID CAP(Consistency, Availability, Partition
(Atomicity, Consistency, Tolerance) theorem,
Isolation, and Durability) eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak, Memcached,
Apache River, …
Data Analysis and Modelling
03 July 2021 10
Business Intelligence Vs. Data Science
Business Intelligence Data Science

Querying the past Querying the past, present, and future

Machine Learning Vs. Data Science
Machine Learning Data Science
Develop new (individual) Explore many models, build,
models and tune hybrids
Prove mathematical Understand empirical properties
properties of models of models
Improve/validate on a few, Develop/use tools that can
relatively clean, small handle massive datasets
datasets

Data Scientist
• Realizes the opportunities presented by the data
• Brings structure to it
• Finds compelling patterns in it
• Advises executives on the implications for products, processes, and
decisions
• What they do?
• National Security
• Cyber Security
• Business Analytics
• Engineering
• Healthcare
• And more ….

Data Scientist … Contd.
Data Scientist’s Practice

What is hard about Data Science?
• Overcoming assumptions
• Making ad-hoc explanations of data patterns
• Overgeneralizing
• Communication
• Not checking enough (validate models, data pipeline integrity, etc.)
• Using statistical tests correctly
• Prototype → Production transitions
• Data pipeline complexity

Gartner’s Hype Cycle

Characteristics of Data
• Dimensionality
✓ Curse of dimensionality
• Sparsity
✓ Only presence counts
• Resolution
✓ Patterns depend on the scale
• Distribution
✓ Centrality and dispersion

Types of Data
• Record • Ordered
– Relational records – Video data: sequence of images
– Data matrix, e.g., numerical – Temporal data: time-series
matrix, crosstabs – Sequential Data: transaction
– Document data: text documents: sequences
term-frequency vector – Genetic sequence data
– Transaction data • Spatial, image and multimedia
• Graph and network – Spatial data: maps
– World Wide Web – Image data
– Social or information networks – Video data
– Molecular Structures

Data Objects
• Data sets are made up of data objects
• A data object represents an entity
• E.g.,
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples, examples, instances, data points, objects, tuples
• Data objects are described by attributes
• Database rows → data objects
• Database columns → attributes
Attributes
Attribute Types
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• E.g., gender
– Asymmetric binary: outcomes not equally important.
• E.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (E.g., HIV positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
Numeric Attributes
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• order of magnitude larger than the unit of measurement
• (10 K˚ is twice as high as 5 K˚)
– E.g., temperature in Kelvin, length, counts,
monetary quantities
Discrete Vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented using a finite number
of digits
– Continuous attributes are typically represented as floating-point variables

Graphic Displays of Statistical Data
• Boxplot: graphic display of five-number summary (minimum, first quartile (Q1),
median, third quartile (Q3), and maximum)
• Histogram: x-axis are values, y-axis represents frequencies

• Quantile Plot: each value xi is paired with fi indicating that approximately
100 fi % of data are  xi
• Quantile-Quantile (Q-Q) Plot: graphs the quantiles of one univariant
distribution against the corresponding quantiles of another
• Scatter Plot: each pair of values is a pair of coordinates and plotted as points
in the plane

Data Analysis
• Data Collection
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation

Data Collection
• Data
▪ Quantitative → Numbers, tests, counting, measuring
▪ Qualitative → Words, images, observations, conversations, photographs
• Data Collection Techniques
▪ Observations
▪ Tests
▪ Surveys
▪ Document analysis

Data Quality
• Why preprocess the data?
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
Data Cleaning
• Data in the Real World is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
– Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• E.g., Occupation=“ ” (missing data)
– Noisy: containing noise, errors, or outliers
• E.g., Salary=“−10” (an error)
– Inconsistent: containing discrepancies in codes or names e.g.,
• E.g., Age=“42”, Birthday=“03/07/2021”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to handle noisy data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
– E.g., deal with possible outliers
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check
to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)
Data Integration
• Combines data from multiple sources into a coherent store
• Schema integration
– e.g., A.cust-id  B.cust-#
– Integrate metadata from different sources
• Entity identification problem
– Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are different
– Possible reasons: different representations, different scales, e.g., metric vs. British units

Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different names in
different databases
– Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

Correlation Analysis – Nominal Data
• Χ2 (chi-square) test
(Observed − Expected) 2
2 = 
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

Correlation Analysis – Numeric Data
• Correlation coefficient (also called Pearson’s product moment coefficient)
 
n n
(ai − A)(bi − B) (ai bi ) − n AB
rA, B = i =1
= i =1
(n − 1) A B (n − 1) A B
where n is the number of tuples, A and B are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum
of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
• The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

Covariance – Numeric Data
• Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected values of
A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to
be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Covariance – An Example
It can be simplified in computation as
Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10),
(4, 11), (6, 14).
If the stocks are affected by the same industry trends, will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.

Data Reduction
• Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data
Complex data analysis may take a very long time to run on the complete data set
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression

Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier analysis, becomes
less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, etc.

Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible without
expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered as forms of
data compression
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing

Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
– E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped
73,600 − 12,000
to 98 ,000 − 12,000
(1.0 − 0) + 0 = 0.716
• Z-score normalization (μ: mean, σ: standard deviation):

v−
v' =
A
 A
– E.g., Let μ = 54,000, σ = 16,000. Then 73,600 − 54,000 = 1.225

16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Discretization
• Types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization Discretization Methods
• Binning
– Supervised vs. unsupervised • Histogram Analysis
• Cluster Analysis
– Split (top-down) vs. merge (bottom-up) • Decision Tree Analysis
• Correlation Analysis
– Discretization can be performed recursively on an attribute
• Concept Hierarchy
– Prepare for further analysis, e.g., classification
Data Modelling
• Classification
• Clustering

Machine Learning
• Machine Learning (ML) is a branch of artificial intelligence:
– Uses computing based systems to make sense out of data
• Extracting patterns, fitting data to functions, classifying data, etc.
– ML systems can learn and improve
• With historical data, time, and experience
– Bridges theoretical computer science and real noise data

Machine Learning Techniques
Learning Type Data Processing Tasks Specialization Learning Algorithms
Computational Classifiers Support Vector Machine
Classification/Regression/ Naïve Bayes
Supervised Learning Statistical Classifiers
Estimation Bayesian Networks
Connectionist Classifiers Neural Networks
K-means
Parametric
Gaussian Mixture Model
Unsupervised Learning Clustering/Prediction
Dirichlet process mixture model
Non-parametric
X-means
Q-learning
Model-free
R-learning
Reinforcement Learning Decision-making
TD learning
Model-based
Sarsa learning

Classification Steps
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data

Classification Steps … Contd.
Model construction

Classification Steps … Contd.
Using the model in prediction

Classification Techniques
• Decision Tree Induction
• Bayesian Classification
• Classification by Backpropagation
• Support Vector Machines
• Associative Classification
• Lazy Learners
• Other Classification Methods

Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left

Decision Tree Induction … Contd.
Attribute Selection Measure: Information Gain (ID3)
• Select the attribute with the highest information gain
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
• Expected information
m
(entropy) needed to classify a tuple in D:
Info( D) = − pi log 2 ( pi )
i =1
• Information needed (after using A to split D into v partitions) to classify D:
v | Dj |
InfoA ( D) =   Info( D j )
j =1 |D|
• Information gained by branching on attribute A
Gain(A) = Info(D)− InfoA(D)


age income student credit_rating buys_computer age?
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes <=30 overcast
31..40 >40
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no student? yes credit rating?
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
no yes excellent fair
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes

Computing Information Gain for Continuous –Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is considered as a possible
split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information requirement for A is selected as the
split-point for A
• Split
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D
satisfying A > split-point

Attribute Selection Measure: Gain Ratio (C4.5)
• Information gain measure is biased towards attributes with a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to
information gain)
v | Dj | | Dj |
SplitInfoA ( D) = −  log 2 ( )
j =1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the splitting attribute

Attribute Selection Measure: Gini Index (CART – Classification and Regression Trees
If a data set D contains examples from n classes, gini index, gini(D) is defined as
n
gini( D) = 1−  p 2j
j =1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as |D1| |D2 |
gini A ( D) = gini( D1) + gini( D2)
|D| |D|
Reduction in Impurity: gini( A) = gini(D) − giniA (D)
The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity)
is chosen to split the node (need to enumerate all the possible splitting points for
each attribute)

• D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D) = 1 −   −   = 0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
 10  4
giniincome{low,medium} ( D) =  Gini( D1 ) +  Gini( D2 )
 14   14 
•
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and
{high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values
• Can be modified for categorical attributes

Comparing Attribute Selection Measures
The three measures, in general, return good results but
• Information gain
• biased towards multivalued attributes
• Gain ratio
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity in
both partitions

Bayesian Classification
• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership
probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct — prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured

Bayesian Classification … Contd.
Bayes’ Theorem Basics
• Total probability Theorem: P(B) =
M
 P(B | A )P( A )
i i
i =1
• Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)

P(X)
– Let X be a data sample (“evidence”): class label is unknown

– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis
holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Prediction based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the
Bayes’ theorem
P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many probabilities, involving
significant computational cost

Classification is to derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized

Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no dependence
n
relation between attributes): P(X | C ) =  P( x | C ) = P( x | C )  P( x | C )  ... P( x | C )
i k i 1 i 2 i n i
k =1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci,
D| (# of tuples of Ci in D)
• If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution
with a mean μ and standard deviation σ 1 −
( x− ) 2
g ( x,  ,  ) = e 2 2
2 
and P(xk|Ci) is P( X | C i ) = g ( xk ,  Ci ,  Ci )

Illustration of Naïve Bayes Classifier
Class: age income studentcredit_rating
buys_computer
C1:buys_computer = ‘yes’
<=30 high no excellent no
C2:buys_computer = ‘no’ 31…40 high no fair yes
>40 medium no fair yes
Data to be classified: >40 low yes fair yes
>40 low yes excellent no
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 high yes fair yes
>40 medium no excellent no

age income studentcredit_rating
buys_computer
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40 high no fair yes
>40 medium no fair yes
• Compute P(X|Ci) for each class >40 low yes fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 <=30 medium no fair no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30
>40
low
medium
yes
yes
fair
fair
yes
yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 high yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci) x P(Ci) : P(X|buys_computer = “yes”) x P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) x P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

• Naïve Bayesian prediction requires each conditional prob. be non-zero.
• Otherwise, the predicted prob. will be zero
n
P( X | C i ) =  P( x k | C i )
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and
income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected” counterparts
Advantages
• Easy to implement
• Good results obtained in most of the cases
Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
• Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks

Classification by Backpropagation
Backpropagation: A neural network learning algorithm
• Started by psychologists and neurobiologists to develop and test
computational analogues of neurons
• A neural network: A set of connected input/output units where each
connection has a weight associated with it
• During the learning phase, the network learns by adjusting the weights so
as to be able to predict the correct class label of the input tuples
• Also referred to as connectionist learning due to the connections between
units

Classification by Backpropagation… Contd.
Neural Network as a Classifier
• Weakness
– Long training time
– Require a number of parameters typically best determined empirically, e.g., the network topology
or “structure.”
– Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of
“hidden units” in the network
• Strength
– High tolerance to noisy data
– Ability to classify untrained patterns
– Well-suited for continuous-valued inputs and outputs
– Successful on an array of real-world data, e.g., hand-written letters
– Algorithms are inherently parallel
– Techniques have been developed for the extraction of rules from trained neural networks

A Multilayer Feed Forward Neural Network

• The inputs to the network correspond to the attributes measured for each training tuple
• Inputs are fed simultaneously into the units making up the input layer
• They are then weighted and fed simultaneously to a hidden layer
• The number of hidden layers is arbitrary, although usually only one
• The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network's prediction
• The network is feed-forward: None of the weights cycles back to an input unit or to an
output unit of a previous layer
• From a statistical point of view, networks perform nonlinear regression: Given enough
hidden units and enough training samples, they can closely approximate any function

Defining a Network Topology
• Decide the network topology: Specify # of units in the input layer, # of hidden layers (if >
1), # of units in each hidden layer, and # of units in the output layer
• Normalize the input values for each attribute measured in the training tuples to [0.0—1.0]
• One input unit per domain value, each initialized to 0
• Output, if for classification and more than two classes, one output unit per class is used
• Once a network has been trained and its accuracy is unacceptable, repeat the training
process with a different network topology or a different set of initial weights

Neuron: A Hidden/Output Layer Unit
• An n-dimensional input vector x is mapped into variable y by means of the scalar product
and a nonlinear function mapping
• The inputs to unit are outputs from the previous layer
• They are multiplied by their corresponding weights to form a weighted sum, which is added
to the bias associated with unit
• Then a nonlinear activation function is applied to it

Backpropagation Algorithm
• Iteratively process a set of training tuples & compare the network's prediction with the actual known target
value
• For each training tuple, the weights are modified to minimize the mean squared error between the network's
prediction and the actual target value
• Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down
to the first hidden layer, hence “backpropagation”
• Steps
✓ Initialize weights to small random numbers, associated with biases
✓ Propagate the inputs forward (by applying activation function)
✓ Backpropagate the error (by updating weights and biases)
✓ Terminating condition (when error is very small, etc.)

• Efficiency of backpropagation: Each epoch (one iteration through the training set) takes
O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the
number of inputs, in worst case
• For easier comprehension: Rule extraction by network pruning
• Simplify the network structure by removing weighted links that have the least effect on
the trained network
• Then perform link, unit, or activation value clustering
• The set of input and activation values are studied to derive rules describing the
relationship between the input and hidden unit layers
• Sensitivity analysis: assess the impact that a given input variable has on a network output.
The knowledge gained from this analysis can be represented in rules

Illustration of Backpropagation Algorithm

Illustration of Backpropagation Algorithm

Support Vector Machines
• The line that maximizes the minimum margin is
a good bet.
– The model class of “hyper-planes with a
margin of m” has a low VC dimension if m is
big.
• This maximum-margin separator is determined
by a subset of the datapoints.
– Datapoints in this subset are called “support
vectors”.
– Computationally useful if only a small
fraction of the datapoints are support vectors, The support vectors are
because we use the support vectors to decide indicated by the circles
which side of the separator a test case is on. around them.

Support Vector Machines … Contd.
Training a Linear SVM
• To find the maximum margin separator, we have to solve the following optimization
problem: w.x c + b  +1 for positive cases
w.x c + b  −1 for negative cases
and || w ||2 is as small as possible
• This is tricky but it’s a convex problem.
– There is only one optimum and we can find it without fiddling with learning rates or weight decay or
early stopping.
– Don’t worry about the optimization problem. It has been solved. It is called quadratic programming.
– It takes time proportional to N^2 which is really bad for very big datasets
• so for big datasets we end up doing approximate optimization!

Support Vector Machines … Contd.
Testing a Linear SVM
• The separator is defined as the set of points for which:
w.x + b = 0
so if w.x c + b  0 say its a positive case
and if w.x c + b  0 say its a negative case

Advanced Classification Methods

Classifier Evaluation Metrics
• Confusion Matrix
• Classifier accuracy or recognition rate: percentage of test set tuples that are correctly
classified Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy Error rate = (FP + FN)/All
• Class imbalance problem:
– One class may be rare, e.g. fraud or COVID-positive
– Significant majority of the negative class and minority of the positive class
– Sensitivity: True Positive recognition rate Sensitivity = TP/P
– Specificity: True Negative recognition rate Specificity = TN/N

Classifier Evaluation Metrics … Contd.
K Fold Cross Validation
• Resampling procedure used to evaluate machine learning models on a limited data
sample.
• Parameter k - number of groups that a given data sample is to be split into.
• When a specific value for k is chosen, it may be used in place of k (k=10 becoming 10-
fold cross-validation).
Steps
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
a) Take the group as a hold out or test data set
b) Take the remaining groups as a training data set
c) Fit a model on the training set and evaluate it on the test set
d) Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

Issues and Challenges - Classification
• Trained on a particular labeled datasets or data domain may not be suitable for another
dataset or data domain that the classification may not be robust over different datasets or data
domain
• Trained using a certain number of class types and hence a large varieties of class types found
in a dynamically growing dataset will lead to inaccurate classification results
• Developed based on a single learning task, and thus they are not suitable for today’s multiple
learning tasks and knowledge transfer requirements of data analytics
Critical issues of Learning Methods for Today’s Data

• Learning for large scale data
• Learning for different types of data
• Learning for high speed of streaming data
• Learning for uncertain and incomplete data
• Learning for data with low value density and meaning diversity
Clustering
• Cluster: A collection of data objects
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
• Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms

Considerations for Cluster Analysis
Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g.,
one document may belong to more than one class)
Similarity measure
• Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)
Requirements and Challenges
Scalability
• Clustering all the data instead of only on samples
Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
Interpretability and usability
Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality

Challenges with Big Data Clustering
Volume
• The amount of data stored on most networks is growing exponentially
• As the volume of data grows, it becomes more difficult to extract it
• Backing up data can also amplify these problems
Velocity
• New patterns will be constantly emerging from known data sets
• If data is created faster than it can be extracted, trends may change as they try to
collect it
Variety
• Clustered data is stored in many different forms, which can make it difficult to make
accurate comparisons
• Some data is stored in structured formats, while other data sets are completely
unstructured

Partitioning Algorithms
• Partitioning a database D of n objects into a set of k clusters, such that the sum of
squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
E =  ik=1 pCi ( p − ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means: Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids): Each cluster is represented by one
of the objects in the cluster

K-Means Clustering Algorithm
• First it selects k number of objects at random from the set of n objects.
• These k objects are treated as the centroids or center of gravities of k clusters.
• For each of the remaining objects, it is assigned to one of the closest centroid.
• Thus, it forms a collection of objects assigned to each centroid and is called a cluster.
• Next, the centroid of each cluster is then updated (by calculating the mean values of
attributes of each object).
• The assignment and update procedure is until it reaches some stopping criteria (such
as, number of iteration, centroids remain unchanged or no assignment, etc.).

K-Means Clustering Algorithm … Contd.
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids
2. for each of the objects in D do

Compute distance between the current objects and k cluster centroids
Assign the current object to that cluster to which it is closest
3. Compute the “cluster centers” of each cluster, these become the new cluster centroids
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop

Illustration of K-Means Clustering Algorithm

Illustration of K-Means Clustering Algorithm … Contd.
• Suppose, k=3.
• Three objects are chosen at random shown as circled. These three centroids are shown below:
Initial Centroids chosen randomly
• Let us consider the Euclidean distance measure (L2 Norm) as the distance measurement
• Let d1, d2, and d3 denote the distance from an object to c1, c2, and c3 respectively
• Assignment of each object to the respective centroid is shown in the right-most column and the
clustering so obtained


• The calculation new centroids of the three cluster using the mean of
attribute values of A1 and A2 is shown in the Table below
• The cluster with new centroids are shown

• Reassign the 16 objects to three clusters by determining which
centroid is closest to each one
• This gives the revised set of clusters shown
• Note that point p moves from cluster C2 to cluster C1

• The newly obtained centroids after second iteration are given in the table below. Note
that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again. These
are the same clusters as before. Hence, their centroids also remain unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here. Hence,
the final cluster in this slide is same as that of the previous slide.

Clustering Evaluation Strategies
Clustering Tendency
• Dataset has clustering tendency and does not contain uniformly distributed points
• If the data does not contain clustering tendency, then clusters identified by any state
of the art clustering algorithms may be irrelevant
• Hopkins test, a statistical test for spatial randomness of a variable, can be used to
measure the probability of data points generated by uniform data distribution
• Null Hypothesis (Ho): Data points are generated by non-random, uniform distribution
(no meaningful clusters)
• Alternate Hypothesis (Ha): Data points are generated by random data points
(presence of clusters)
• If H>0.5, null hypothesis can be rejected and it is very much likely that data contains
clusters
• If H is more close to 0, then data set doesn’t have clustering tendency

Clustering Evaluation Strategies … Contd.
Number of Clusters, k
• If k is too high, each point will start representing a cluster and if k is too low, then
data points are incorrectly clustered
• Finding the optimal number of clusters leads to granularity in clustering
• Finding right number of cluster depends upon
– Distribution shape
– Scale in the data set
– Clustering resolution required by user
• Major approaches to find optimal number of clusters
– Domain knowledge
– Data driven approach
• Empirical Method
• Elbow Method
• Statistical Method
Clustering Evaluation Strategies … Contd.
Clustering Quality
• Ideal clustering is characterised by minimal intra cluster distance and maximal inter
cluster distance
• Extrinsic Measures
– Require ground truth labels
– Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores,
Homogeneity, Completeness, and V-measure
• Intrinsic Measures
– Does not require ground truth labels
– Some of the clustering performance measures are Silhouette Coefficient, Calinski-Harabasz Index,
Davies-Bouldin Index, etc.

Research Problems
Community Detection
• Identifying similar group of people in social media such as Facebook, LinkedIn, etc.
• Useful to detect frauds in social networks
Genetic Mapping
• Linkage group, which combines genetic markers on chromosome
• Useful to compute group-wise similarities
Image Segmentation
• Used in classification of pixels of an image
• Helps to identify image characteristics and its regions of similarity
• But it is not an easy task due to the variations in color coding and image complexities
Load Balancing
• Much data, processors, operating systems, software, and other components exist on virtual
machines in cloud environment
• Focus on load balancing algorithms for big data will produce useful outcomes
Future Research Directions

Summary
Discussed about
• Introduction
• Why Data Science?
• Data Science Applications
• Data Analysis
– Data Collection
– Data Cleaning
– Data Integration
– Data Reduction
– Data Transformation
• Data Modelling
– Classification
– Clustering
• Research Problems
• Future Research Directions

Thank You!

Data Analysis and Modelling

Uploaded by

Copyright:

Available Formats

You might also like

Data Analysis and Modelling

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis and Modelling

Uploaded by

Copyright:

Available Formats

ATAL Online FDP on Data Science

Data Analysis and Modelling

Dr. Bhuvaneswari Amma N.G.

03 July 2021 Data Analysis and Modelling 3

03 July 2021 Data Analysis and Modelling 4

03 July 2021 Data Analysis and Modelling 5

03 July 2021 Data Analysis and Modelling 6

03 July 2021 Data Analysis and Modelling 7

03 July 2021 Data Analysis and Modelling 8

Internet of Things / M2M Health/Scientific Computing

03 July 2021 Data Analysis and Modelling 9

Business Intelligence Data Science

03 July 2021 Data Analysis and Modelling 11

03 July 2021 Data Analysis and Modelling 12

03 July 2021 Data Analysis and Modelling 13

03 July 2021 Data Analysis and Modelling 14

03 July 2021 Data Analysis and Modelling 15

03 July 2021 Data Analysis and Modelling 16

03 July 2021 Data Analysis and Modelling 17

03 July 2021 Data Analysis and Modelling 18

03 July 2021 Data Analysis and Modelling 22

• Histogram: x-axis are values, y-axis represents frequencies

03 July 2021 Data Analysis and Modelling 23

03 July 2021 Data Analysis and Modelling 24

03 July 2021 Data Analysis and Modelling 25

03 July 2021 Data Analysis and Modelling 33

03 July 2021 Data Analysis and Modelling 34

03 July 2021 Data Analysis and Modelling 35

03 July 2021 Data Analysis and Modelling 36

03 July 2021 Data Analysis and Modelling 38

03 July 2021 Data Analysis and Modelling 39

03 July 2021 Data Analysis and Modelling 41

03 July 2021 Data Analysis and Modelling 43

• Z-score normalization (μ: mean, σ: standard deviation):

– E.g., Let μ = 54,000, σ = 16,000. Then 73,600 − 54,000 = 1.225

03 July 2021 Data Analysis and Modelling 46

03 July 2021 Data Analysis and Modelling 47

03 July 2021 Data Analysis and Modelling 48

03 July 2021 Data Analysis and Modelling 49

03 July 2021 Data Analysis and Modelling 50

03 July 2021 Data Analysis and Modelling 51

03 July 2021 Data Analysis and Modelling 52

03 July 2021 Data Analysis and Modelling 53

03 July 2021 Data Analysis and Modelling 54

03 July 2021 Data Analysis and Modelling 55

03 July 2021 Data Analysis and Modelling 56

03 July 2021 Data Analysis and Modelling 57

gain_ratio(income) = 0.029/1.557 = 0.019

03 July 2021 Data Analysis and Modelling 58

03 July 2021 Data Analysis and Modelling 59

03 July 2021 Data Analysis and Modelling 60

03 July 2021 Data Analysis and Modelling 61

03 July 2021 Data Analysis and Modelling 62

• Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)

– Let X be a data sample (“evidence”): class label is unknown

03 July 2021 Data Analysis and Modelling 64

03 July 2021 Data Analysis and Modelling 65

03 July 2021 Data Analysis and Modelling 66