Professional Documents
Culture Documents
Data Analytics - Machine Learning Methods
Data Analytics - Machine Learning Methods
Solution
Big Data: new driver for digital
economy & society
What is Big data ??
● can bring “big values” to our life
in almost every aspects. Application Areas
● Technologically, Big Data is ● Health and Well being
bringing about changes in our ● Policy making and public
lives because it allows diverse opinions
and heterogeneous data to be ● Smart cities and more efficient
fully integrated and analyzed to society
help us make decisions. ● New online educational models:
MOOC and Student-Teacher
modeling
● Robotics and human-robot
interaction
In a go...
Supervised
Learning
Classification
● It starts with a training set of prelabeled observations to learn how likely the
attributes of these observations may contribute to the classification of future
unlabeled observations
Example: Decision tree
● A tree structure to specify sequences of decisions and consequences
● Also called as prediction tree
● Given input X= {x1 , x1 , ••• xn) , the goal is to predict a response or output
variable Y. Each member of the set { x1 ,x1, ... xn} is called an input variable.
The prediction can be achieved by constructing a decision tree with test
points and branches.
● Terminologies:
○ Branch refers to the outcome of a decision
○ Internal/ decision nodes are the decision or test points, Each internal/ decision node refers to
an input variable or an attribute
○ Top internal node is called the root
○ Leaf nodes are at the end of the last branches on the tree. They represent class labels-the
outcome of all the prior decisions
○ Depth of a node is the minimum number of steps required to reach the node from the root.
Example: Decision tree
Example: Decision tree- ID3 algorithm
● Iterative Dichotomiser 3 (ID3)
● Let A be a set of categorical input variables, P be the output variable (or the
predicted class), and T be the training set. The ID3 algorithm is :
Regression
● Explains the influence that a set of variables has on the outcome of another
variable of interest
● Outcome variable is called a dependent variable because the outcome
depends on the other variable (independent /input variables)
● Regression analysis is a useful explanatory tool that can identify the input
variables that have the greatest statistical influence on the outcome.
● Applications of regression are:
○ Sales forecast
○ generate insights on consumer behaviour, understanding business and factors influencing
profitability
○ widely used in medical research, in the field of predictive food microbiology, to describe
bacterial growth/no growth interface
Regression Models
Thus,
Classification And Regression
● Classification is the task of ● Regression is the task of
predicting discrete class label predicting a continuous quantity
● In a classification problem, data ● Regression problem requires a
is labelled in two or more classes prediction of a quantity
● Classification problem with two ● Regression problem with multiple
classes is called binary input variables is called
classification and with more than multivariate regression problem
two classes is called multi-class ● Example: predicting the price of
classification a stock over a period of time is a
● Example: classifying an email as regression problem
spam or non spam is
classification
Choosing a suitable classifier
Measuring Performance: Confusion Matrix
● True positives (TP) are the
● Confusion matrix is a specific table layout that number of positive instances
allows visualization of the performance of a the classifier correctly
classifier. identified as positive.
● False positives (FP) are the
number of instances in which
the classifier identified as
positive but in reality are
negative.
● True negatives (TN) are the
number of negative instances
the classifier correctly
identified as negative.
● False negatives (FN) are the
number of instances
classified as negative but in
reality are positive
Measuring Performance: Confusion Matrix (contd)
The accuracy (or the overall success rate) is a metric defining the rate at which a
model has classified the records correctly.
Measuring Performance: Confusion Matrix (contd)
Confusion Matrix : Example
● There are two possible predicted
classes: "yes" and "no". If we
were predicting the presence of a
disease, for example, "yes" would
mean they have the disease, and
"no" would mean they don't have
the disease.
● The classifier made a total of 165
predictions (e.g., 165 patients
were being tested for the
presence of that disease).
● Out of those 165 cases, the
classifier predicted "yes" 110
times, and "no" 55 times.
● In reality, 105 patients in the
sample have the disease, and 60
patients do not.
Clustering
● Use of unsupervised techniques for grouping similar objects
● Data scientist does not determine, in advance, the labels to apply to the
clusters
● Structure of the data describes the objects of interest and determines how
best to group the object
● Clustering methods find the similarities between objects according to the
object attributes and group the similar objects into clusters.
● Clustering techniques are utilized in marketing, economics, and various
branches of science.
Example : k-means
Given a collection of objects each with n measurable attributes, k-means is an
analytical technique that, for a chosen value of k, identifies k clusters of objects
based on the objects' proximity to the center of the k groups
Example : k-means
Flowchart
Example : k-means (contd.)
Algorithm:
Example : k-means (contd.)
To use k-means properly, it is important to do the following:
• Properly scale the attribute values to prevent certain attributes from dominating
the other attributes.
• Ensure that the concept of distance between the assigned values within an
attribute is meaningful.
• Choose the number of clusters, k, such that the sum of the Within Sum of
Squares (WSS) of the distances is reasonably minimized.
Association Rules
● The goal with association rules is to discover interesting relationships among
the items.
● The relationship occurs too frequently to be random and is meaningful from a
business perspective, which may or may not be obvious.
● Each of the uncovered rules is in the form ~ Y, meaning that when item X is
observed, item Y is also observed. In this case, the left-hand side (LHS) of the
rule is X, and the right-hand side (RHS) of the rule is Y.
● Applications of association rules are:
○ Broad-scale approaches to better merchandising- what products should be included in or
excluded from the inventory each month
○ Cross-merchandising between products and high-margin or high-ticket items
○ Physical or logical placement of product within related categories of products
○ Promotional programs-multiple product purchase incentives managed through a loyalty card
program
Example: Apriori
Algorithm
● Support/ occurrence frequency of an itemset
is the number of transactions that contain the
itemset.
● Min_sup : minimum support threshold
● Confidence:how often the rule has been found
to be true
● Min_conf:minimum confidence threshold
install.packages("party")
Here, The package "party" has the function ctree() which is used to create and analyze decision tree.
Syntax:
ctree(formula, data)
Where
print(relation)
Functions in R : Logistic regression
Syntax: Example
input <-
glm(formula,data,family) mtcars[,c("am","cyl","hp","wt")]