Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Data Analytics

Machine Learning Methods

OE4201: Big Data and Analytics

Compiled by: Dr.Neeta Maitre


Agenda
*** Please note this presentation is meant only for reference

● Need and what is big data?


● In a go.. Where we can put what?
● Supervised Vs Unsupervised
● Classification
● Regression
● Clustering
● Association Rules
● Time Series analysis
● Comparison table
● Functions in R
Need
The world's technological per-capita capacity to store information doubled every
40 months

Solution
Big Data: new driver for digital
economy & society
What is Big data ??
● can bring “big values” to our life
in almost every aspects. Application Areas
● Technologically, Big Data is ● Health and Well being
bringing about changes in our ● Policy making and public
lives because it allows diverse opinions
and heterogeneous data to be ● Smart cities and more efficient
fully integrated and analyzed to society
help us make decisions. ● New online educational models:
MOOC and Student-Teacher
modeling
● Robotics and human-robot
interaction
In a go...
Supervised
Learning
Classification
● It starts with a training set of prelabeled observations to learn how likely the
attributes of these observations may contribute to the classification of future
unlabeled observations
Example: Decision tree
● A tree structure to specify sequences of decisions and consequences
● Also called as prediction tree
● Given input X= {x1 , x1 , ••• xn) , the goal is to predict a response or output
variable Y. Each member of the set { x1 ,x1, ... xn} is called an input variable.
The prediction can be achieved by constructing a decision tree with test
points and branches.
● Terminologies:
○ Branch refers to the outcome of a decision
○ Internal/ decision nodes are the decision or test points, Each internal/ decision node refers to
an input variable or an attribute
○ Top internal node is called the root
○ Leaf nodes are at the end of the last branches on the tree. They represent class labels-the
outcome of all the prior decisions
○ Depth of a node is the minimum number of steps required to reach the node from the root.
Example: Decision tree
Example: Decision tree- ID3 algorithm
● Iterative Dichotomiser 3 (ID3)
● Let A be a set of categorical input variables, P be the output variable (or the
predicted class), and T be the training set. The ID3 algorithm is :
Regression
● Explains the influence that a set of variables has on the outcome of another
variable of interest
● Outcome variable is called a dependent variable because the outcome
depends on the other variable (independent /input variables)
● Regression analysis is a useful explanatory tool that can identify the input
variables that have the greatest statistical influence on the outcome.
● Applications of regression are:
○ Sales forecast
○ generate insights on consumer behaviour, understanding business and factors influencing
profitability
○ widely used in medical research, in the field of predictive food microbiology, to describe
bacterial growth/no growth interface
Regression Models
Thus,
Classification And Regression
● Classification is the task of ● Regression is the task of
predicting discrete class label predicting a continuous quantity
● In a classification problem, data ● Regression problem requires a
is labelled in two or more classes prediction of a quantity
● Classification problem with two ● Regression problem with multiple
classes is called binary input variables is called
classification and with more than multivariate regression problem
two classes is called multi-class ● Example: predicting the price of
classification a stock over a period of time is a
● Example: classifying an email as regression problem
spam or non spam is
classification
Choosing a suitable classifier
Measuring Performance: Confusion Matrix
● True positives (TP) are the
● Confusion matrix is a specific table layout that number of positive instances
allows visualization of the performance of a the classifier correctly
classifier. identified as positive.
● False positives (FP) are the
number of instances in which
the classifier identified as
positive but in reality are
negative.
● True negatives (TN) are the
number of negative instances
the classifier correctly
identified as negative.
● False negatives (FN) are the
number of instances
classified as negative but in
reality are positive
Measuring Performance: Confusion Matrix (contd)
The accuracy (or the overall success rate) is a metric defining the rate at which a
model has classified the records correctly.
Measuring Performance: Confusion Matrix (contd)
Confusion Matrix : Example
● There are two possible predicted
classes: "yes" and "no". If we
were predicting the presence of a
disease, for example, "yes" would
mean they have the disease, and
"no" would mean they don't have
the disease.
● The classifier made a total of 165
predictions (e.g., 165 patients
were being tested for the
presence of that disease).
● Out of those 165 cases, the
classifier predicted "yes" 110
times, and "no" 55 times.
● In reality, 105 patients in the
sample have the disease, and 60
patients do not.
Clustering
● Use of unsupervised techniques for grouping similar objects
● Data scientist does not determine, in advance, the labels to apply to the
clusters
● Structure of the data describes the objects of interest and determines how
best to group the object
● Clustering methods find the similarities between objects according to the
object attributes and group the similar objects into clusters.
● Clustering techniques are utilized in marketing, economics, and various
branches of science.
Example : k-means
Given a collection of objects each with n measurable attributes, k-means is an
analytical technique that, for a chosen value of k, identifies k clusters of objects
based on the objects' proximity to the center of the k groups
Example : k-means
Flowchart
Example : k-means (contd.)
Algorithm:
Example : k-means (contd.)
To use k-means properly, it is important to do the following:

• Properly scale the attribute values to prevent certain attributes from dominating
the other attributes.

• Ensure that the concept of distance between the assigned values within an
attribute is meaningful.

• Choose the number of clusters, k, such that the sum of the Within Sum of
Squares (WSS) of the distances is reasonably minimized.
Association Rules
● The goal with association rules is to discover interesting relationships among
the items.
● The relationship occurs too frequently to be random and is meaningful from a
business perspective, which may or may not be obvious.
● Each of the uncovered rules is in the form ~ Y, meaning that when item X is
observed, item Y is also observed. In this case, the left-hand side (LHS) of the
rule is X, and the right-hand side (RHS) of the rule is Y.
● Applications of association rules are:
○ Broad-scale approaches to better merchandising- what products should be included in or
excluded from the inventory each month
○ Cross-merchandising between products and high-margin or high-ticket items
○ Physical or logical placement of product within related categories of products
○ Promotional programs-multiple product purchase incentives managed through a loyalty card
program
Example: Apriori
Algorithm
● Support/ occurrence frequency of an itemset
is the number of transactions that contain the
itemset.
● Min_sup : minimum support threshold
● Confidence:how often the rule has been found
to be true
● Min_conf:minimum confidence threshold

“Rules that satisfy both a minimum support


threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong”
Time series analysis
● Time series analysis attempts to model the underlying structure of
observations taken over time.
● A time series, denoted Y =a+ bX , is an ordered sequence of equally spaced
values over time.
● Time series analysis has many applications in finance, economics, biology,
engineering, retail, and manufacturing.
○ Retail sales: For various product lines, a clothing retailer is looking to forecast future monthly
sales. An appropriate time series model needs to account for fluctuating demand over the
calendar year.
○ Spare parts planning: Companies' service organizations have to forecast future spare part
demands to ensure an adequate supply of parts to repair customer products. time series
analysis can provide accurate short-term forecasts based simply on prior spare part demand
history.
Time series analysis : components
● Trend - It refers to the long-term movement in a time series. It indicates
whether the observation values are increasing or decreasing over time . For
example, a steady increase in sales month over month
● Seasonality - describes the fixed, periodic fluctuation in the observations
over time. As the name suggests, the seasonality component is often related
to the calendar. .For example, monthly retail sales can fluctuate over the year
due to the weather and holidays.
● Cyclic - It refers to a periodic fluctuation, but one that is not as fixed as in
the case of a seasonality component. For example, retails sales are
influenced by the general state of the economy. Thus, a retail sales time
series can often follow the lengthy boom-bust cycles of the economy.
● Random - After accounting for the other three components, the random
Time series analysis : Box-Jenkins methodology
Time series analysis : Models
● ARIMA ,'Auto Regressive Integrated Moving Average' , is actually a class of
models that 'explains' a given time series based on its own past values, that
is, its own lags and the lagged forecast errors, so that equation can be used
to forecast future values.
● Spectral analysis is commonly used for signal processing and other
engineering applications.example: Speech recognition software
● Generalized Autoregressive Conditionally Heteroscedastic (GARCH) is a
useful model for addressing time series with nonconstant variance or volatility.
GARCH is used for modeling stock market activity and price fluctuations.
Time series analysis : Models (contd)
● Kalman filtering is useful for analyzing real-time inputs about a system that
can exist in certain states. For example, a Kalman filter in a vehicle navigation
system can process various inputs, such as speed and direction, and update
the estimate of the current location.
● Multivariate time series analysis examines multiple time series and their
effect on each other. For example , marketing analyses that examine the time
series related to a company's price and sales volume as well as related time
series for the competitors
Comparison- to sum up
Functions in R : Decision Tree
Prerequisite:

install.packages("party")
Here, The package "party" has the function ctree() which is used to create and analyze decision tree.

Syntax:

ctree(formula, data)
Where

● formula is a formula describing the predictor and response variables.


● data is the name of the data set used.
Functions in R : Linear regression
Syntax: Example
x <- c(151, 174, 138, 186, 128, 136,
lm(formula,data) 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72,


62, 48)
Where

● formula is a symbol representing the


# Apply the lm() function.
relation between x and y.
● data is the vector on which the relation <- lm(y~x)

formula will be applied.

print(relation)
Functions in R : Logistic regression
Syntax: Example

input <-
glm(formula,data,family) mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl


+ hp + wt, data = input, family =
Where
binomial)
● formula is the symbol presenting the print(summary(am.data))
relationship between the variables.
● data is the data set giving the values of these
variables.
● family is R object to specify the details of the
model. It's value is binomial for logistic
regression.
Functions in R : Time Series Analysis
Example
Syntax:
# Get the data points in form of a R vector.
timeseries.object.name <- ts(data,
start, end, frequency) rainfall <-
c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,99
8.6,784.2,985,882.8,1071)
Where
# Convert it to a time series object.
rainfall.timeseries <- ts(rainfall,start =
● data is a vector or matrix containing the c(2012,1),frequency = 12)
values used in the time series.
# Print the timeseries data.
● start specifies the start time for the first print(rainfall.timeseries)
observation in time series.
● end specifies the end time for the last # Give the chart file a name.
png(file = "rainfall.png")
observation in time series.
● frequency specifies the number of # Plot a graph of the time series.
plot(rainfall.timeseries)
observations per unit time.
Major References
● Data Science & Big Data Analytics :Discovering, Analyzing, Visualizing and
Presenting Data by EMC Education Services (Wiley Publications)
● Data Mining Concepts and Techniques (Third Edition) by Jiawei Han,
Micheline Kamber
● www.tutorialspoint.com
● Other web resources

You might also like