Professional Documents
Culture Documents
BCSE352E EDA CAT 2 Mod 1,2,5 PDF
BCSE352E EDA CAT 2 Mod 1,2,5 PDF
Data Analytics
1
Regression predicts a numerical value,
Topics in Module-1 instead of a “class”.
• Linear regression:
Classification
simple linear
regression
• Regression
Modelling
Clustering
Regression
• Correlation
• ANOVA
• Time Series
Forecasting
Associative
• Autocorrelation
2
Module-1 Topic-1: Linear regression
What is Regression?
5
Module-1 Topic-1: Linear regression
Assumptions of Regression analysis
• Linear regression with standard estimation technique makes numerous
assumptions about the independent variables and dependent variables.
• Following is the list of major assumptions made by linear regression model:
• Linearity:- Linear Regression model assumes that the dependent variable is
a linear combination of the regression coefficients and independent
variables.
• Lack of perfect multi-collinearity in independent variables-uses the
problem for linear regression models to estimate the relationship between a
dependent variable and independent variables, as correlated independent
variables change simultaneously.
• Multicollinearity reduces the power of linear regression models to
identify significantly important independent variables.
• Constant variance for different values of the dependent variable will
have the same variance in their error.
6
Module-1 Topic-1: Linear regression
Regression analysis
The regression equation is given by
7
Module-1 Topic-1: Linear regression
10
Module-1 Topic-2: Regression Modelling
Mathematics behind Regression analysis
• A low correlation
r sy r = correlation coefficient of x and y coefficient gives a flatter
Step-1: a= sy = standard deviation of y slope (small value of a)
sx sx = standard deviation of x
• Large spread of y, i.e. high
standard deviation, results
Step-2: y = ax + b b = y – ax in a steeper slope (high
value of a)
r sy r = correlation coefficient of x and y
• Large spread of x, i.e. high
Step-3: b=y- x sy = standard deviation of y
sx sx = standard deviation of x standard deviation, results
a b in a flatter slope (high value
r sy r sy of a)
Step-4: ŷ = ax + b = x+y- x
sx sx
a a
r sy
Step-5: ŷ = (x – x) + y
sx
11
Module-1 Topic-2: Regression Modelling
Regression
95% confidence band
If the data points falls outside this
band, outliers are present.
i.e. some data points might give
better or worse than others.
Error
Outliers contains either very low value or very high value in comparison to Predicted value
other observed values, which may hamper the result. 12
Module-1 Topic-2: Regression Modelling
Model Estimation and Evaluation
13
Module-1 Topic-2: Regression Modelling
Model Estimation and Evaluation
14
Module-1 Topic-2: Regression Modelling
Model Estimation and Evaluation
15
Module-1 Topic-2: Regression Modelling
Model Estimation and Evaluation
16
Module-1 Topic-2: Regression Modelling
Model Estimation and Evaluation
17
Module-1 Topic-2: Regression Modelling
Model Estimation and Evaluation
18
Module-1 Topic-3: Correlation
Correlation
19
Module-1 Topic-3: Correlation
Correlation
• Both covariance and correlation measure linear relationships between
variables. Examples: relationship between height and weight of children and
relationship between speed and weight of cars, etc.
• Since covariance is affected by a change in scale, it can take values between
−∞ and ∞. However, the correlation coefficient always lies between -1 and 1,
and it can be used to make statements and compare correlations.
• When the correlation coefficient is positive, an increase in one variable results
in an increase in the other.
• When the correlation coefficient is negative, an increase in one variable results
in a decrease in the other (i.e. the change happens in the opposite direction).
• A zero correlation coefficient indicates there is no relationship between the
two variables.
20
Basics of Scattergrams Module-1 Topic-3: Correlation
YY Y Y
X X
X
• ANOVA is to test for differences among the means of the population by examining the amount of
variation within each sample, relative to the amount of variation between the samples.
• Analyzing variance test the hypothesis that the means of two or more populations are equal.
22
When to use Analysis of Variances (ANOVA) Module-1 Topic-4: ANOVA
23
How to use Analysis of Variances (ANOVA) Module-1 Topic-4: ANOVA
24
Module-1 Topic-4: ANOVA
Analysis of Variances (ANOVA)
• Assumptions made:
(i) Samples are independent and randomly drawn from respective populations,
(ii) Populations are normally distributed, and
(iii) Variances of the population are equal
F0 = MSB / MSW
Sum-of-Squares-TReatments (SSTR)
Sum-of-Squares-Error (SSE)
Sum-of-Squares-Total (SST)
• SST gives the overall variance in the data, SSTR gives the part of the variation within the data due to
differences among the groups, and SSE gives the part of the variation within the data due to error.
• Note that SST = SSTR + SSE 25
Module-1 Topic-4: ANOVA
ANOVA-Solved Example
Three different techniques namely medication, exercises and special diet are randomly assigned to
(individuals diagnosed with high blood pressure) lower the blood pressure. After four weeks the reduction
in each person’s blood pressure is recorded. Test at 5% level (Level of significance α = 0.05), whether there
is significant difference in mean reduction of blood pressure among the three techniques.
Step 1 : Hypotheses
That is, there is significant difference in the average reduction in blood pressure in atleast one pair of
treatments.
26
Module-1 Topic-4: ANOVA
ANOVA-Solved Example
Test statistic F0 = MST / MSE
27
Module-1 Topic-4: ANOVA
ANOVA-Solved Example
28
Module-1 Topic-4: ANOVA
ANOVA-Solved Example
Step 6 : Critical value
f(12, 2),0.05 = 3.89.
Step 7 : Decision
As F = 9.17 > f
0 = 3.89,
(12, 2),0.05
29
Time Series Forecasting Module-1 Topic-5: Time Series Forecasting
• Time series forecasting means to forecast or to predict the future value over a period of time.
• It entails developing models based on previous data and applying them to make observations and guide future
strategic decisions.
• Key factors to be considered while using time series forecasting are
• Volume of data available — more data is often more helpful, offering greater opportunity for exploratory
data analysis, model testing and tuning, and model fidelity.
• Required time horizon of predictions — shorter time horizons are often easier to predict — with higher
confidence — than longer ones.
• Forecast update frequency — Forecasts might need to be updated frequently over time or might need to be
made once and remain static (updating forecasts as new information becomes available often results in
more accurate predictions).
• Forecast temporal frequency — Often forecasts can be made at lower or higher frequencies, which allows
harnessing downsampling and up-sampling of data (this in turn can offer benefits while modeling).
30
Time Series Forecasting Module-1 Topic-5: Time Series Forecasting
1) Seasonality: Seasonality is a simple term that means while predicting a time series data there are some
months in a particular domain where the output value is at a peak as compared to other months
2) Trend: The trend is also one of the important factors which describe that there is certainly increasing or
decreasing trend time series, which actually means the value of organization or sales over a period of time
and seasonality is increasing or decreasing.
3) Unexpected Events: Unexpected events mean some dynamic changes occur in an organization, or in the
market which cannot be captured.
31
Module-1 Topic-5: Time Series Forecasting
Time Series Forecasting-ARIMA Model
•Autoregressive integrated moving average
(ARIMA) models predict future values based
on past values.
•ARIMA makes use of lagged moving averages
to smooth time series data.
•They are widely used in technical analysis to
forecast future security prices.
•Autoregressive models implicitly assume that
the future will resemble the past. For ARIMA models, a standard notation would be
ARIMA with p, d, and q, where integer values
substitute for the parameters to indicate the type of
ARIMA model used. The parameters can be defined
as:
p: the number of lag observations in the model, also
known as the lag order.
d: the number of times the raw observations are
differenced; also known as the degree of differencing.
q: the size of the moving average window, also
known as the order of the moving average.
32
Autocorrelation Module-1 Topic-6: Autocorrelation
33
Autocorrelation Module-1 Topic-6: Autocorrelation
• Autocorrelation, also known as serial correlation, refers to the degree of correlation of the same
variables between two successive time intervals.
• The value of autocorrelation ranges from -1 to 1.
• A value between -1 and 0 represents negative autocorrelation. A value between 0 and 1 represents
positive autocorrelation.
Negative Autocorrelation Positive Autocorrelation No Autocorrelation
34
Module-1 Summary
Summary
• Regression: Regression analysis is a set of statistical processes for estimating
the relationships between a dependent variable and one or more variables.
• Correlation: Measures linear relationship between 2 variables
• ANOVA: Compares more than 2 population (uses F-statistic)
• Time Series Forecasting: Analysis and prediction of time-based data
• Autocorrelation: Measures linear relationship between lagged values
35
Topics in Module-2 Classification
Classification
• Logistic Regression
• Decision Trees
• Random Forest
• SVM Classifier
36
Module-2 Introduction to Classification
What is Classification?
• The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups.
37
Module-2 Introduction to Classification
Types of Classification?
• Types of Classifiers: The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
• Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier. Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is called
as Multi-class Classifier. Example: Classifications of types of crops, Classification of types of
music.
• Types of learners: In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
• Example: K-NN algorithm, Case-based reasoning
• Eager Learners: Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction.
• Example: Decision Trees, Naïve Bayes, ANN.
38
Module-2 Introduction to Classification
Types of Classification?
• Types of classification:
• Supervised: The set of possible classes is known in advance.
• Unsupervised: Set of possible classes is not known. After classification we can try to assign a
name to that class. Unsupervised classification is called clustering.
• Types of Classification algorithms: The Classification algorithms can be further divided into the
Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
39
Module-2 Introduction to Classification
Evaluation of Classification model?
• Log loss or cross entropy loss:
• It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual value.
• The lower log loss represents the higher accuracy of the model.
• Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions.
• AUC-ROC curve:
• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
• It is a graph that shows the performance of the classification model at different thresholds.
• To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis. 40
Module-2 Topic-1: Logistic Regression
Logistic Regression
Logistic regression is an extension of simple linear regression, where the
dependent variable is dichotomous or binary in nature and we cannot use
simple linear regression.
Logistic regression is the statistical technique used to predict the relationship
between two or more predictors (independent variables) and a predicted
variable (the dependent variable) where the dependent variable is binary.
Logistic regression estimates the
probability of an event
occurring, based on a given
dataset of independent variables.
Since the outcome is a probability,
the dependent variable is
bounded between 0 and 1.
Logistic Regression –An Illustration
41
Module-2 Topic-1: Logistic Regression
Logistic Regression
• Logistic regression estimates the probability of a certain event occurring using
the odds ratio by calculating the logarithm of the odds.
• Uses Maximum likelihood estimation (MLE) to transform the probability of
an event occurring into its odds, a nonlinear model.
43
Module-2 Topic-1: Logistic Regression
Example #1
Example #2
44
Module-2 Topic-1: Logistic Regression
What Logistic Regression predicts?
• Probability of Y occurring given known values for X(s).
• In Logistic Regression, the Dependent Variable is transformed into the natural log of the odds.
This is called logit (short for logistic probability unit).
• The probabilities which ranged between 0.0 and 1.0 are transformed into
odds ratios that range between 0 and infinity and approximated as a sigmoid
function applied to a linear combination of input features in the range 0 to 1.
• If the probability for group membership in the modeled category is above
some cut point (the default is 0.50), the subject is predicted to be a member
of the modeled group. Example: Default their payment.
• If the probability is below the cut point, the subject is predicted to be a
member of the other group. Example: No Default their payment.
• For any given case, logistic regression computes the probability that a case
with a particular set of values for the independent variable is a member of
the modeled category.
45
Module-2 Topic-1: Logistic Regression
Logistic Regression
46
Module-2 Topic-1: Logistic Regression
Assumptions with its explanation for Logistic Regression
• No outliers in the data. An outlier can be identified by analyzing the
independent variables
• No correlation (multi-collinearity) between the independent variables.
Measure how well the algorithm performs using
the weights on functions=
• Where G is the logistic function and to sigmoid
curve, We can see the values of y-axis lie
between 0 and 1 and crosses the axis at 0.5.
• The classes can be divided into positive or
negative. The output comes under the probability
of positive class if it lies between 0 and 1.
• Interpreting the output of hypothesis function as
positive if it is ≥0.5, otherwise negative.
• Loss Function:
47
Module-2 Topic-1: Logistic Regression
48
Module-2 Topic-1: Logistic Regression
Applications of Logistic Regression
49
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#1
A dataset consist of women and men Instagram users with a sample size of
1069. Let the probability of men and women using Instagram
be 𝑃𝑚𝑒𝑛 𝑎𝑛𝑑 𝑃𝑤𝑜𝑚𝑒𝑛 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦. The sample proportion of women who
are Instagram users is given as 61.08%, and the sample proportion for men
is 43.98%. The difference is 0.170951, and the 95% confidence interval is
(0.111429, 0.2292).Establish a logistic regression model specifies the
relationship between p and x. 𝑃0 𝑠𝑢𝑐𝑐𝑒𝑠𝑠
Odds=1− 𝑃 = 𝑓𝑎𝑖𝑙𝑢𝑟𝑒
0
Solution
𝑃𝑤𝑜𝑚𝑒𝑛
Logistic regression equation for women log ( ) = 𝛽0 + 𝛽1
1− 𝑃𝑤𝑜𝑚𝑒𝑛
𝑃𝑚𝑒𝑛
Logistic regression equation for men log ( ) = 𝛽0
1− 𝑃𝑚𝑒𝑛
50
Logistic Regression-Solved Example#1 (Contd.)
Odds for women=1−𝑃0𝑃 = 1−0.6108
0.6108=1.5694
0
𝑃0 0.4398
Odds for men=1− 𝑃 = 1−0.4398=0.7851
0
𝑃𝑤𝑜𝑚𝑒𝑛
Log of Odds for women=log (1− 𝑃 )=log(1.5694)=0.4507=𝛽0 + 𝛽1
𝑤𝑜𝑚𝑒𝑛
𝑃𝑚𝑒𝑛
Log of Odds for men=log ( )=log(0.7851)=-0.2419=𝛽0
1− 𝑃𝑚𝑒𝑛
𝑏0 = −0.2419
Slope 𝑏1 = Log (odds for women)-Log(odds for men)=0.4507-(- 0.2419)=0.6926
51
Note: For deciding the 𝒃𝟎 +𝒃𝟏 values in the logistic regression
line Use Scattergrams that has positive correlation such that
the value 𝒃𝟎 can have negative coefficients and 𝒃𝟏 can have
positive coefficients
YY Y Y
X X X
52
Module-2 Topic-3: Naïve Bayes-conditional probability
• If a patient has stiff neck, what’s the probability he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
53
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model
•Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps
in building the fast machine learning models that can make quick predictions.
•It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
•Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
•Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
•Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
•The formula for Bayes' theorem is given as:
•Where,
•P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
•P(B) is Marginal Probability: Probability of Evidence.
•P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
•P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
54
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
If the weather is sunny, then the Player should play or not?
55
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
Step-1 Frequency table for the Weather Conditions:
56
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
Step-3 Applying Bayes Theorem
57
Module-2 Topic-3: Naïve Bayes-conditional probability
58
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification Model
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis
59
Module-2 Topic-3: Naïve Bayes-conditional probability
Summary of Naïve Bayes Classification Model
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• Bayes’ rule can be turned into a classifier
• Maximum A Posteriori (MAP) hypothesis estimation incorporates prior
knowledge; Max Likelihood (ML) doesn’t
• Naive Bayes Classifier is a simple but effective Bayesian classifier for
vector data (i.e. data with several attributes) that assumes that attributes
are independent given the class.
• Bayesian classification is a generative approach to classification
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
60
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
61
Module-2 Topic-5: SVM Classifier
Terminologies in Support Vector Machine (SVM)
• Hyperplane: There can be multiple
lines/decision boundaries to segregate the
classes in n-dimensional space, but we need
to find out the best decision boundary that
helps to classify the data points. This best
boundary is known as the hyperplane of
SVM.
• Dimensions of the hyperplane depend on
the features present in the dataset. 2
features, then hyperplane will be a straight
line.
• Support Vectors: The data points or
vectors that are the closest to the hyperplane
and which affect the position of the
hyperplane are termed as Support Vector
62
Module-2 Topic-5: SVM Classifier
This line
represents the
decision
boundary:
ax + by − c = 0
63
Module-2 Topic-5: SVM Classifier
Types of Support Vector Machine (SVM)
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line.
64
Module-2 Topic-5: SVM Classifier
Types of Support Vector Machine (SVM)
65
Module-2 Topic-5: SVM Classifier
An example for SVM
Data: <xi,yi>, i=1,..,l
xi Rd
yi {-1,+1}
Temperature
f(x) =-1
=+1
Humidity Can be expressed as w•x+b=0
= play tennis (remember the equation for a hyperplane
= do not play tennis from algebra!)
Our aim is to find such a hyperplane
All hyperplanes in Rd are parameterize by a vector
f(x)=sign(w•x+b), that
(w) and a constant b.
correctly classify our data. 66
Module-2 Topic-5: SVM Classifier
Formulation of Margin
Define the hyperplane H such that:
xi•w+b +1 when yi =+1 H1
xi•w+b -1 when yi =-1
H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
H2: xi•w+b = -1 d-
The points on the planes H1 H
and H2 are the Support
Vectors
68
Module-2 Topic-5: SVM Classifier
69
Module-2 Topic-5: SVM Classifier
70
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Illustration
71
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
Suppose, we have positively labeled data points
72
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
2 The hyperplane driving SVM is given as
73
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
4
74
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
75
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved
Example#2
Suppose, we have positively labeled data points
76
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved
Example#2
2 There are two support vectors
77
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved
Example#2
4 The above equation reduces to
78
Non Linear Support Vector Machine (SVM)-Solved
Example#2 Module-2 Topic-5: SVM Classifier
79
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Pros and Cons
Advantages:
•Effective in high dimensional spaces.
•Still effective in cases where number of dimensions is
greater than the number of samples.
•Uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
•Versatile: different Kernel functions can be specified for
the decision function. Common kernels are provided, but it
is also possible to specify custom kernels.
Disadvantages:
•If the number of features is much greater than the number
of samples, avoid over-fitting in choosing Kernel
functions and regularization term is crucial.
•SVMs do not directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation
80
Module-2 Topic-2: Decision Tree
Decision tree
• A Decision tree is a flowchart-like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label.
81
Module-2 Topic-2: Decision Tree
Decision tree
• A tree can be “learned” by splitting the
source set into subsets based on an
attribute value test.
• This process is repeated on each derived
subset in a recursive manner called
recursive partitioning.
• The recursion is completed when the
subset at a node all has the same value of
the target variable, or when splitting no
longer adds value to the predictions.
• The construction of a decision tree
classifier does not require any domain
knowledge or parameter setting, and
therefore is appropriate for exploratory
knowledge discovery.
82
Module-2 Topic-2: Decision Tree
Decision tree
• Decision trees can handle high-
dimensional data. In general decision tree
classifier has good accuracy.
• Decision tree induction is a typical
inductive approach to learn knowledge on
classification.
• Decision trees classify instances by
sorting them down the tree from the root
to some leaf node, which provides the
classification of the instance.
• An instance is classified by starting at the
root node of the tree, testing the attribute
specified by this node, then moving down
the tree branch corresponding to the value
of the attribute.
83
Module-2 Topic-2: Decision Tree
Decision tree
Strength:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction
or classification.
Disadvantage:
• Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a decision
tree is computationally expensive. At each node, each candidate splitting field must be
sorted before its best split can be found. In some algorithms, combinations of fields are
used and a search must be made for optimal combining weights. Pruning algorithms can
also be expensive since many candidate sub-trees must be formed and compared.
84
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
Consider whether a dataset based on which we will determine whether to play football or
not.
There are 4 independent variables - Outlook, Temperature, Humidity, and Wind to determine
the dependent variable-whether to play football or not.
85
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
1 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)
87
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
3 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)
88
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
4 Initial Decision tree diagram
89
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
90
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
91
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
92
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
93
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
94
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature or Humidity has higher information gain.
6
95
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether humidity is normal or high based on higher information gain.
6
96
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6
97
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6
98
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6
99
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
6 Final Decision tree
100
Module-2 Topic-4: Random Forest
Random Forest
• Random forests (RF) are a
combination of tree predictors
such that each tree depends on
the values of a random vector
sampled independently and with
the same distribution for all
trees in the forest.
• The generalization error of a
forest of tree classifiers depends
on the strength of the
individual trees in the forest and
the correlation between them.
• Improvements in classification accuracy have resulted from
growing an ensemble of trees and letting them vote for
the most popular class.
• To grow these ensembles, often random vectors are
generated that govern the growth of each tree in the
ensemble. 101
Module-2 Topic-4: Random Forest
Random Forest
• Random forest is identified as a collection of
decision trees. Each tree estimates a
classification, and this is called a “vote”. Ideally,
we consider each vote from every tree and chose
the most voted classification (Majority-Voting).
• Random Forest follow the same bagging process
as the decision trees but each time a split is to be
performed, the search for the split variable is
limited to a random subset of m of the p
attributes (variables or features) aka Split-
Attribute Randomization :
• classification trees: m = √p
• regression trees: m = p/3
• Random Forests produce many unique trees.
102
Module-2 Topic-4: Random Forest
103
Bagging : Bootstrap Aggregating : Module-2 Topic-4: Random Forest
3. Average predictions
104
Module-2 Topic-4: Random Forest
https://victorzhou.com/blog/intro-to-random-forests/
105
Module-2 Topic-4: Random Forest
• Find in the first branch for that part of the data again the best split.
That is the first sub node (decision node)
• Continue creating decision nodes until splitting doesn’t improve the situation
The average value of the target variable is assigned to the leaf (terminal node)
• Continue until there are only leaf nodes left or until a minimum value is
reached
• Now the decision tree can be used to do a prediction based on the input
features
106
Module-2 Topic-4: Random Forest
Feature Importance
• Feature importance is calculated as
• the reduction in sum of squared errors whenever a variable is chosen to split
• weighted by the probability of reaching that node.
• However
• the variable importance measures are not reliable in situations where potential predictor variables vary in
their scale of measurement or their number of categories
108
Module-2 Topic-4: Random Forest
Processing the ensemble of trees called
The Random Forest
• Take a set of variables
• Run them through every decision tree
• Determine a predicted target variable for each of the
trees
• Average the result of all trees
109
Module-2 Topic-4: Random Forest
110
Random Forest : Tuning Module-2 Topic-4: Random Forest
• Random forest introduces randomness into the rows and columns of the data
• Combined, this provides a more diverse set of trees that almost always lowers our prediction error.
111
Module-2 Topic-4: Random Forest
112
Module-2 Topic-4: Random Forest
Applications
• Banking Industry
• Credit Card Fraud Detection
• Customer Segmentation
• Predicting Loan Defaults on LendingClub.com
• Healthcare and Medicine
• Cardiovascular Disease Prediction
• Diabetes Prediction
• Breast Cancer Prediction
• Stock Market
• Stock Market Prediction
• Stock Market Sentiment Analysis
• Bitcoin Price Detection
• E-Commerce
• Product Recommendation
• Price Optimization
• Search Ranking
113
Module-2 Topic-4: Random Forest
Case Study
• Let us consider the example of the Boston Housing dataset. This is a well-
known dataset of information about different houses in Boston. For each
house, 13 values are known, such as the crime rate in that area,
industrialization value, average age of residents, and so on. Our task is to
train a model to predict the value of a house given these values.
114
Module-2 Topic-4: Random Forest
Case Study
• Let us consider the example of the Boston Housing dataset.
115
Case Study
• Training the dataset
• We can note that of the 13 original
features, this decision tree has used
only LSTAT (the percentage of the
population in low income groups) and
RM (average number of rooms per
dwelling) to generate a prediction.
• The four leaf nodes show us that this
single tree classifier can produce four
possible outputs: $30k, $44k, $22k and
$14k, even though we are solving a
regression problem and the true
number could be one of many
continuous values.
• This simple decision tree has a mean
absolute error of $3.6k on the training
set, and $3.8k on the test set. This
means that although it is not a
powerful model, it performs similarly
on seen and unseen data, and so it has
generalized well and has not overfit the
training data.
116
Module-2 Topic-4: Random Forest
Case Study
• Making a random forest ensemble
model
117
Module-2 Topic-4: Random Forest
Case Study
• Performance of the model
118
Case Study
• Feature Importance in RF
119
Module-2 Topic-4: Random Forest
Random forest Vs Decision tree
120
Module-2 Summary
Summary
• Logistic regression: Modeling the probability that the response Y belongs to a
particular category, using a logistic function, on the basis of single or multiple
variables.
• Bayes’ theorem for classification: Bayes’ classifier using conditional independence
• Decision trees and random forests: A non-parametric, ‘information-based learning’
approach which is easy to interpret.
• Hyperplane for classification: maximal marigin classifier and SVC.
• Support Vector Machines (SVMs): Extension of SVC to handle ‘non-linear boundaries’
between classes. Uses kernels for computational efficiency. RBF kernel exhibits ‘local
behavior’.
• Random forests are an effective tool in prediction. Forests give results competitive
with boosting and adaptive bagging, yet do not progressively change the training set.
Random inputs and random features produce good results in classification- less so in
regression. For larger data sets, we can gain accuracy by combining random features
with boosting. 121
Topics in Module-5
• Comply with organization’s
current health, safety and security
policies and procedures
• Report any identified breaches in
health, safety, and security
policies and procedures to the
designated person
• Identify and correct any hazards
that they can deal with safely,
competently and within the limits
of their authority
• Report any hazards that they are
not competent to deal with to the
relevant person in line with
organizational procedures and
warn other people who may be
affected.
122
Module-5 Topic-1: Comply with organization’s current health, safety and security policies and procedures
123
Module-5 Topic-1: Comply with organization’s current health, safety and security policies and procedures
125
Module-5 Topic-1: Comply with organization’s current health, safety and security policies and procedures
Hazards
• Hazard can be defined as any source of potential harm or danger to someone or any adverse
health effect produced under certain condition.
• Hazard to an organization include loss of property or equipment while hazard to an individual
involve harm to health or body.
• Examples of potential hazards:
• (i) Materials such as knife or sharp edged nails can cause cuts;
• (ii) Substances such as Benzene can cause fume suffocation. Inflammable substances like
petrol can cause fire;
• (iii) Naked wires or electrodes can result in electric shocks;
• (iv) Condition such as “Wet floor” can cause slippage,
• (v) Objects falling on workers; and (vi) Clothes entangled into rotating objects. 127
Module-5 Topic-1: Comply with organization’s current health, safety and security policies and procedures
130
Module-5 Topic-1: Comply with organization’s current health, safety and security policies and procedures
Workplace Security Procedures
• Hazard Identification and Risk
Assessment Procedures-Hazard
identification and risk assessment procedures
are vital for workplace health and safety,
preventing potential risks from being
overlooked. They familiarize personnel with
hazardous environment duties and risk
assessment steps.
135
Module-5 Topic-2: Report any identified breaches in health, safety, and security policies and procedures
to the designated person
Report any identified breaches in health, safety, and security policies
and procedures to the designated person
• Ensuring laboratory safety
and security requires
effective enforcement by
organizational leaders and
compliance by managers
and workers through
incentives.
• Organizations must identify
and address cultural
barriers to chemical
laboratory safety and
security to ensure the safety
and security of their
employees.
136
Module-5 Topic-2: Report any identified breaches in health, safety, and security policies and procedures
to the designated person
Report any identified breaches in health, safety, and security policies and
procedures to the designated person
Initiation and maintenance of an effective compliance system are important to
• give organization leaders useful information about the effectiveness of safety and security
systems and about needs for improvements.
• give designated safety and security personnel authority to collect incident reports and
report incidents to higher authorities for action.
• discern patterns of unsafe behavior and facilities (based on statistics from reports and
inspections), find methods to improve safety and security, and initiates new rules and
regulations to protect workers.
• increase awareness of safety issues in the organization so that a culture of improved safety
and security is encouraged.
• give current information to safety officers so that training of all laboratory workers can be
improved and specific guidance can be given to individual workers.
• give information to laboratory leaders so that they can learn how to use, test, and procure
appropriate personal protective equipment (PPE) and other types of equipment to improve
safety. 137
Module-5 Topic-2: Report any identified breaches in health, safety, and security policies and procedures
to the designated person
139
Module-5 Topic-3: Identify and correct any hazards that they can deal with safely, competently and within the limits of
their authority
What is an “Incident” and What is an “hazard”?
140
Module-5 Topic-3: Identify and correct any hazards that they can deal with safely, competently and within the limits of
5 signs the report system is failing their authority
141
Module-5 Topic-4: WORKPLACE HAZARD REPORTING
WORKPLACE HAZARD REPORTING
• Employee training in
hazard recognition
and avoidance is
crucial for preventing
accidents. Hazard
reporting is essential
for employees to
know what to do when
encountering
uncorrected hazards.
• Training can be in-
person, on-the-job, or
a safety meeting, with
annual online or email
reminders for low-
hazard jobs.
142
Module-5 Topic-4: WORKPLACE HAZARD REPORTING
WORKPLACE HAZARD REPORTING
• What is an unsafe condition that should be reported? -Workplace hazards include rusted
tools, inadequate PPE, unlabeled containers, insufficient lighting, broken machine guards, and a
leaking refrigerator, which can lead to potential incidents causing harm to people, equipment, or
property.
• What is an unsafe act that should be reported?-Unsafe acts, such as careless use of
equipment or inadequate use of personal protective equipment, can lead to incidents causing
harm to people, equipment, or property.
• What should be done if an unsafe condition or act is witnessed in the workplace? -The
hazard reporting procedure in your workplace should be specific and clearly communicate the
steps employees should take, such as filling out a form or communicating verbally with a
supervisor.
• When should a hazard be reported? Any unsafe condition or act should be reported
immediately, or at the next available safe opportunity that the employee has to do so.
• Where can employees find a copy of the Hazard Reporting Procedure? Are hard copies of
procedures kept at headquarters, or is the Safety Manual found online on the company’s
intranet? It’s important that employees know how they can access all company policies and
procedures on their own.
143
Module-5 Topic-4: WORKPLACE HAZARD REPORTING
WORKPLACE HAZARD REPORTING-Examples
Example-1
Example-2
Example-3
144
Module-5 Topic-4: WORKPLACE HAZARD REPORTING
WORKPLACE HAZARD REPORTING-Examples
Example-4
Example-5
145
Module-5
Summary
• Performance criteria (7 in total)
• Basic workplace safety guidelines: fire safety, first-aid kit,
electrical safety, etc.
• Types of accidents: trips, slips, injuries/accidents due to
falling/moving items, etc.
• Types of emergencies: medical, structural, natural disaster, etc.
• Hazards: sources of potential harm (notified using signage boards)
146