M4 - FDS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

FDS - Module 4

Algorithms for Data Science


Types of Learning
• Supervised (inductive) learning
o Given: training data + desired outputs (labels)
• Unsupervised learning
o Given: training data (without desired outputs)
• Reinforcement learning
o Rewards from sequence of actions

Supervised learning
• Supervised learning attempts to find a
relationship between the predictors and the
response in order to make a prediction
• Biggest drawback of supervised machine
learning need labelled data
• Some possible applications for supervised
learning include:
o Stock price predictions
o Weather predictions
o Crime prediction using is that we need
this labelled data
• Types
o Classification - predict a categorical response
o Regression - predicts a continuous response
• List of Common Algorithms
• Nearest Neighbour • Linear Regression
• Naive Bayes • Support Vector Machines (SVM)
• Decision Trees • Neural Networks
• Here the human experts act as the teacher where we feed the computer with
training data containing the input/predictors and we show it the correct answers
(output) and from the data the computer should be able to learn the patterns.
• Supervised learning algorithms try to model relationships and dependencies
between the target prediction output and the input features such that we can
predict the output values for new data based on those relationships which it
learned from the previous data sets.

Unsupervised learning
• Input is a set of predictors and utilizes relationships between the predictors
o Reducing the dimension - dimensionality reduction
o Similar groups – clustering
• The advantage over Supervised Learning is that, here, no labels are required.
• Drawback is that it merely suggests
differences and similarities, which then
requires a human’s interpretation.
• List of Common Algorithms
o k-means clustering
o Association Rules
• Mainly used in pattern
detection and descriptive modeling.
However, there are no output categories
nor labels here based on which the
algorithm can try to model relationships.
These algorithms try to use techniques on
the input data to mine for rules, detect
patterns, and summarize and group the
data points which help in deriving meaningful insights and describe the data
better to the users.
Reinforcement learning
• It uses observations gathered from the interaction with the environment to take
actions that would maximize the reward or minimize the risk.
• It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance.
• Simple reward feedback is required for the agent to learn its behavior; this is
known as the reinforcement signal.
• Using this algorithm, the machine is trained to make specific decisions.
• It works this way: the machine is exposed to an environment where it trains itself
continually using trial and error.
• This machine learns from past experience and tries to capture the best possible
knowledge to make accurate decisions

Pros and Cons


• Supervised machine learning
o Pros
▪ It can make future predictions
▪ It can quantify relationships between predictors and response variables
▪ It can show us how variables affect each other and how much
o Cons
▪ It requires labelled data (which can be difficult to get)
• Unsupervised machine learning
o Pros
▪ It can find groups of data points that behave similarly that a human
would never have noted
▪ It can be a preprocessing step for supervised learning
▪ Think of clustering a bunch of data points and then using these clusters
as the response
▪ It can use unlabelled data, which is much easier to find
o Cons
▪ It has zero predictive power
▪ It can be hard to determine if we are on the right track
▪ It relies much more on human interpretation
• Reinforcement learning
o Pros
▪ Very complicated rewards systems create very complicated AI systems
▪ It can learn in almost any environment, including our own Earth
o Cons
▪ The agent is erratic at first and makes many terrible choices before
realizing that these choices have negative rewards
▪ For example, a car might crash into a wall and not know that that is not
okay until the environment negatively rewards it
▪ It can take a while before the agent avoids decisions altogether
▪ The agent might play it safe and only choose one action and be "too
afraid" to try anything else for fear of being punished

Ensemble Learning Methods


• Ensemble methods are techniques that aim at improving the accuracy of results in
models by combining multiple models instead of using a single model.
• The combined models increase the accuracy of the results significantly by
offsetting each individual model’s variances and biases.
• This has boosted the popularity of ensemble methods in machine learning.
• The main idea behind ensemble learning is to group weak learners together to
form one strong learner that has better accuracy than any individual weak learner.
• 3 main types of ensemble learning:
1. Bagging
▪ It considers homogenous weak learners, focuses on reducing variance
▪ It involves fitting many decision trees on different samples of the same
dataset and averaging the predictions.
2. Boosting
▪ It considers homogenous weak learners, focuses on reducing bias
▪ It involves adding ensemble members sequentially that correct the
predictions made by prior models and outputs a weighted average of
the predictions.
3. Stacking
▪ It considers heterogenous weak learners
▪ It involves fitting many different model types on the same data and
using another model to learn how best to combine the predictions.
• Alternatively, ensemble learning can also be classified as:
1. Parallel ensemble methods:
▪ In this kind of Ensemble method, the base learner is generated in
parallel order in which data dependency is not there.
▪ Every data in the base learner is generated independently.
▪ Example: Stacking
2. Sequential ensemble methods:
▪ In this kind of Ensemble method, there are sequentially generated base
learners in which data dependency resides.
▪ Every other data in the base learner is having some dependency on
previous data.
▪ So, the previous mislabeled data are tuned based on its weight to get
the performance of the overall system improved.
▪ Example: Boosting

Bagging (Bootstrap Aggregation)


• Bagging is used when our objective is to reduce the variance of a decision tree.
• Here the concept is to create a few subsets of data from the training sample,
which is chosen randomly with replacement.
• Now each collection of subset data is used to prepare their decision trees, thus,
we end up with an ensemble of various models.
• The average of all the assumptions from numerous trees is used, which is more
powerful than a single decision tree.
• Random Forest is an expansion over bagging. It takes one additional step to
predict a random subset of data.
• It also makes the random selection of features rather than using all features to
develop trees.
• When we have numerous random trees, it is called the Random Forest.
Stacking
• It is an ensemble method that seeks a diverse group of members by varying the
model types fit on the training data and using a model to combine predictions.

• Stacking mainly differ from bagging and boosting on that it considers


heterogeneous weak learners (different learning algorithms are combined)
whereas bagging and boosting consider mainly homogeneous weak learners.

Boosting (Sequential Learning Technique)


• The key property of boosting ensembles is the idea of correcting prediction
errors. The models are fit and added to the ensemble sequentially such that the
second model attempts to correct the predictions of the first model, the third
corrects the second model, and so on.
• It is a method that in general decreases the bias error and builds strong predictive
models. The term ‘Boosting’ refers to a family of algorithms which converts a weak
learner to a strong learner.
• Ex: Ada Boost algorithm: It commences by training decision trees. Every
observation during this procedure has an equal weight assigned to it.
• After analysing the first tree, we raise the weights of every observation that we
find complicated to classify. On the other hand, we decrease the weights for the
ones in which classification is not an issue.
• Therefore, we will notice the second tree growing on the weighted data. The
original idea for this is to make improvements upon the first tree’s predictions.
• Therefore, the last ensemble model’s predictions will be the overall weighted
predictions provided by former tree models.

Time Series Modelling


• Time series analysis is a specific way of analyzing a sequence of data points
collected over an interval of time.
• In other words, the arrangement of data in accordance with their time
of occurrence is a time series. It is the chronological arrangement of data.
• In time series analysis, analysts record data points at consistent intervals over a
set period of time (like a day, week, month or year) rather than just recording the
data points intermittently or randomly.
• However, this type of analysis is not merely the act of collecting data over time.
• What sets time series data apart from other data is that the analysis can show
how variables change over time.
• A time series depicts the relationship between two variables. Time is one of those
variables and the second is any quantitative variable.
Importance of Time Series Modelling
1. Business Forecasting
o Prediction of stock market for the next day
o Prediction of product sales of a company for the next year
2. Understand Past Behaviour
o Using TSM, one can analyze the sales of an item and answer questions like:
▪ During which period were the sales high?
▪ During which period were the sales low?
o Often, time series analysis helps us uncover trends that previously
went unnoticed; this incident separately might allow us to make successful
predictions about the future.
3. Plan for the Future:
o Companies and industries, make use of TSM to make and plan decisions for
the future based on previous or historical data.
o When organizations analyze data over consistent intervals, they can also use
time series forecasting to predict the likelihood of future events.
4. Evaluate Current Targets:
o TSM helps companies analyze whether the targets set by them were achieved
or not. It helps answer questions like:
▪ Did the company achieve its forecasted sales?
▪ If yes, what was the reason? If no, what were the reason?
5. Weather Forecasting:
o TSM is ideal for forecasting weather changes, helping meteorologists predict
everything from tomorrow’s weather report to future years of climate
change.

Components of TSM
1. Trend
o The trend shows the general tendency of the data to increase or decrease
during a long period of time.
o A trend could be:
▪ Uptrend: If the TSM shows a general pattern that is upward, then it is
an uptrend.
▪ Downtrend: If the TSM shows a pattern that is downward then, it is a
downtrend.
▪ Horizontal or Stationary trend: If no pattern is observed, then it is
called a Horizontal or stationary trend.
o Ex: Suppose there is a hardware shop near a construction site. Until the
construction is completed, the hardware shop would see an uptrend in sales
and once the construction is completed, there would be a downtrend in sales
o Hence, we can say that trend is something which happens for some time and
then it disappears.
2. Seasonality
o Seasonality is a repeating pattern within a fixed period.
o It is a predictable pattern that recurs or repeats over regular intervals.
o Seasonality is often observed within a year or less.
o Ex: Every year during the holidays or Christmas season, sales revenue
increases. And during the off-season, the sales go down.
o Ex: During the monsoon seasons, the sale of umbrellas and raincoats increase
and it decreases, once the season is over.
3. Irregularity:
o It is also known as noise, so these are erratic, or you can say unsystematic.
o This happens only for a short duration, and it is non-repeating.
o It is a random part of the data.
o There is no specific pattern, which makes it difficult to be factored into an
analysis.
o It does not repeat often and is caused due to unforeseen or highly random
events like natural disasters.
4. Cyclicity:
o It is somewhat like seasonality, but in cyclicity, the duration is unfixed, and
the gap length of time between two cycles can be much longer.
o Ex: Suppose a recession happened in 2002, then one in 2008 and then
another one is 2012.
o So, it is not every year but for a much longer time over the years.
o The duration of the event varies, and the gap between the durations too.
Time Series Models
1. Autoregression (AR)
o The autoregression (AR) method models the next step in the sequence as a
linear function of the observations at prior time steps.
o Number of AR (Auto-Regressive) terms (p):
▪ p is the parameter associated with the auto-regressive aspect of the
model, which incorporates past values i.e., lags of dependent variable.
▪ For instance, if 𝑝 is 5, the predictors for 𝑥(𝑡) will be
𝑥(𝑡 − 1) …. 𝑥(𝑡 − 5).
2. Moving Average (MA)
o The Moving Average (MA) method models the next step in the sequence as
the average of a window of observations at prior time steps.
o Number of MA (Moving Average) terms (q):
▪ 𝑞 is size of the moving average part window of the model i.e., lagged
forecast errors in prediction equation.
▪ For instance, if 𝑞 is 5, the predictors for 𝑥(𝑡) will be
𝑒(𝑡 − 1) …. 𝑒(𝑡 − 5) where 𝑒(𝑖) is the difference between the
moving average at 𝑖 𝑡ℎ instant and actual value.
3. Autoregressive Moving Average (ARMA)
o In ARMA model, there are 2 paramters:
o Number of AR (Auto-Regressive) terms (p):
▪ p is the parameter associated with the auto-regressive aspect of the
model, which incorporates past values i.e., lags of dependent variable.
▪ For instance, if 𝑝 is 5, the predictors for 𝑥(𝑡) will be
𝑥(𝑡 − 1) …. 𝑥(𝑡 − 5).
o Number of MA (Moving Average) terms (q):
▪ 𝑞 is size of the moving average part window of the model i.e., lagged
forecast errors in prediction equation.
▪ For instance, if 𝑞 is 5, the predictors for 𝑥(𝑡) will be
𝑒(𝑡 − 1) …. 𝑒(𝑡 − 5) where 𝑒(𝑖) is the difference between the moving
average at 𝑖 𝑡ℎ instant and actual value.
4. Autoregressive integrated moving average (ARIMA)
o In an ARIMA model there are 3 parameters that are used to help model the
major aspects of a times series: seasonality, trend, and noise.
o These parameters are labelled p, d and q.
o Number of AR (Auto-Regressive) terms (p): p is the parameter associated
with the auto-regressive aspect of the model, which incorporates past values
i.e., lags of dependent variable.
o Number of Differences (d): d is the parameter associated with the integrated
part of the model, which effects the amount of differencing to apply to a time
series.
o Number of MA (Moving Average) terms (q): q is size of the moving average
part window of the model
When not to use TSM
1. When the values are constant - This means they are not dependent on time so first
of all, the data is not a time series data and secondly, it is pointless as the values
never change.
2. When the values are in the form of a function - For example sin(𝑥 ), cos(𝑥 ) etc. It
is, again, pointless to use time series analysis as you can calculate the values using
a function.

Stationarity in the Data


• By stationarity, we mean that the behaviour of the data for over a time is having a
specific pattern, which is stationary or constant.
• If stationarity is observed in the data, then there is a very high possibility that the
same pattern would be seen in future data too.
• The criteria for stationarity are:
o The mean of the time series should be constant
o The variance of the time series should be constant
Linear Regression - Python Implementation
Dataset: data.csv
Duration,Average_Pulse,Max_Pulse,Calorie_Burnage,Hours_Work,Hours_Sleep
60,110,130,409,0,8
45,117,148,406,0,6.5
.
.
.
60,102,127,300,0,7.5
import pandas as pd
import matplotlib.pyplot as plt
from scipy import statbs

data = pd.read_csv("data.csv", header=0, sep=",")

x = data ["Average_Pulse"]
y = data ["Calorie_Burnage"]

slope, intercept, r, p, std_err = stats.linregress(x, y)


#r ⟶ correlation coefficient
#p ⟶ p-value in hypothesis test where H0 is slope = 0
#std_err ⟶ standarrd error of estimated slope

plt.scatter(x, y)
# draw a scatter plot of x vs actual y

plt.plot(x, slope*x + intercept)


# draw a line plot of x vs predicted y

plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel("Calorie_Burnage")
plt.show()
Output:

KNN - Python Implementation


An SUV car manufacturer wants to advertise their new SUV to potential buyers. They
have a dataset of customer details. For this problem, use ‘Estimated Salary’ and ‘Age’
as independent variables and ‘Purchased’ column as dependent variable.

1. Data Preprocessing

import numpy as nm
import pandas as pd

data = pd.read_csv ('user_data.csv')

#Extracting Independent and dependent Variable


x = data.iloc[:, [2,3]].values
y = data.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test
= train_test_split(x, y, test_size = 0.25, random_state = 0)
# There are a total 800 rows.
# Hence as test_size is 25%, size of x_test and y_tes is 200

# Feature Scaling will normalize the columns so that each column will have μ = 0
# and σ = 1
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x_train = ss.fit_transform(x_train) # fit_transform for train
x_test = ss.transform(x_test) # transform for test

2. Fitting K-NN classifier to the Training data

from sklearn.neighbors import KNeighborsClassifier


classifier= KNeighborsClassifier (n_neighbors = 5, metric = 'minkowski', p = 2 )
# n_neighbors ⟶ k value
# p ⟶ Power parameter for the Minkowski metric
# p = 1 ⟶ Manhattan distance
# p = 2 ⟶ Euclidean distance
classifier.fit (x_train, y_train)

3. Predicting the Test Result

y_pred = classifier.predict (x_test)

4. Creating the Confusion Matrix

from sklearn.metrics import confusion_matrix


cm = confusion_matrix (y_test, y_pred)

• We can see there are 64 + 29 = 93 correct predictions and 3 + 4 = 7 incorrect


predictions. Hence the performance of the KNN Model is satisfactory.

You might also like