FDS L1 to L8 slides

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 143

ZG 536

Foundations of Data Science


BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M1 Data Science Foundations


Lecture 1 - Introduction
Course Objectives

No Objective

Get introduced to the field of Data Science, roles, process and challenges
CO1
involved therein

Explore and experience the steps involved in the data preparations and
CO2
exploratory data analysis

Learn to select and apply proper analytics technique for various scenarios,
CO3 assess the model’s performance and interpret the results of the predictive
model
Get familiarity with the general deployment considerations of the predictive
CO4
models
Appreciate the importance of techniques like data visualization, storytelling
CO5
with data for the effective presentations of the outcomes to the stakeholders

BITS Pilani, Pilani Campus


Evaluation

Duratio Day, Date,


No Name Type Weight
n Session, Time

Experiential
Learning
Assignment 1
EC1 Take Home-Online 25% To be announced
Experiential
Learning
Assignment 2

Mid-Semester Open or 2 hours 30%


EC2 Exam Closed
Book
Comprehensive Open 2 hours 45%
EC3 Exam Book

BITS Pilani, Pilani Campus


The instructor

Pravin Mhaske

Qualification Bachelor of Engineering (Mechanical)


Master of Science (Business Analytics)
Experience 21 years (Industry)
6 years (Teaching)
Teaching interests Statistics, Data Science, Machine
Learning, Business Analytics
Pedagogy Concepts, foundation, intuition, hands-
on practice, experiential learning

BITS Pilani, Pilani Campus


What exactly is Data Science?

• An interdisciplinary field that uses algorithms, procedures, and


processes to examine large amounts of data
• Study of data to extract meaningful insights for business
• Using data to solve problems and make decisions!
• Applied Statistics!

Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data

BITS Pilani, Pilani Campus


Applications
• Every domain!
• Healthcare Better operations, early detections, preventions
• Retail Customer behavior, STP, Customer experience
• Banking and Finance Financial advice and planning, predictions, Fraud
detection
• Transportation Optimizations, better planning
• Manufacturing Fault detection, IoT, Operations and process
improvement
• Meteorology Weather, seismic, geospatial data
• Social media/TC Sentiment analysis, Demands
• Energy and utility Consumption, control
• Public services Planning, development
• Sports, Entertainment Strategy, Content creation, Demand analysis
• Politics?

BITS Pilani, Pilani Campus


Some Examples
• Recommender systems: Amazon, Netflix, youtube

• Personalization: Learning, ads, promotions and discounts

• Decision making: Google maps

• Fraud detection: transactions

• Dynamic pricing: Surge pricing

• Smart homes, voice assistants

• Social media trends

• Spam mail filters

• Traffic lights

• Online dating

BITS Pilani, Pilani Campus


Why learn Data Science?

• Career opportunities

• Rapid digital evolution

• Data is growing

• Flexibility – all industries, freelancing

• Demand-Supply gap

• Analytical, scientific approach

• Being logical and sensible

• Life skill - Solving real life problems

BITS Pilani, Pilani Campus


DS, AI, ML, DL, Analytics?
• Data Science Processing, Analyzing, Insights
• Business Analytics Solving problems, Making decisions
• Artificial Intelligence Machines simulate human behavior
• Machine Learning Computers learn themselves
• Deep Learning Artificial Neural Networks

BITS Pilani, Pilani Campus


DS/ML project flow

BITS Pilani, Pilani Campus


Popular Roles and Skills

Data Engineer Data/BI Analyst ML Engineer Business Analyst


SQL, Python, Hive, Pig, SQL, Excel, Python/R, Python, Machine Learning Excel, Visio, SQL, Tableau
Java, Hadoop, Spark, Tableau/PowerBI/QlikView, Algorithms, DL/NLP, Java, Domain understanding,
Kafka, Azkaban, Airflow, Basics of Big Data, Basics DBMS, Cloud Architecture, Requirement Gathering,
AWS, GCP, Azure of Cloud Big Data Architectures, Requirement Elicitation,
Data Warehousing, Ability Programming skills in AWS/GCP/Azure Process Excellence, User
to write, analyze, and Python/R , Solid Understanding of data Acceptance Testing,
debug SQL queries, Big understanding of database structures, data modeling Documentation Prowess,
Data platforms like management systems, and software architecture. Basic Data Analysis Skills
Hadoop, Spark, Kafka, Proficient SQL/HQL skills, Deep knowledge of math,
Flume, Pig, Hive, etc. , Good data visualization probability, statistics and
Experience in handling skills and proficient with algorithms. Ability to write
data pipeline and Tableau/PowerBI/QlikView, robust code in Python,
workflow management etc ,basic understanding of Java and R. Familiarity with
tools like Azkaban, Luigi, predictive modelling machine learning
Airflow, etc., Strong frameworks (like Keras or
Communication Skills PyTorch) and libraries (like
scikit-learn)

BITS Pilani, Pilani Campus


Data Scientist

Wears many hats!


1. Data Acquisition and Preparation: Data Sources, Cleaning,
Preprocessing, Integration, Wrangling
2. Data Analysis: EDA, insights, patterns
3. Modeling: Statistical/hypothesis testing, ML models - Building,
testing, tuning, deploying
4. Communication: Story-telling, visualization, audience
5. Collaboration: Stakeholders,
6. Solutions: Practical, relevant

BITS Pilani, Pilani Campus


Data Scientist Skills

BITS Pilani, Pilani Campus


Data Science Vs other domains

1. Interdisciplinary: Statistics, Mathematics, Computer Science,


Programming, and domain-specific expertise
2. Focus: Data
3. Problem solving: Real world challenges. Always new.
4. Evolution: Tools, techniques, algorithms
5. Lifelong learning: No crash course!
6. All industries
7. No defined scope
8. No single correct solution
9. Answer to many questions is ‘depends!’

BITS Pilani, Pilani Campus


Challenges

1. Data: Acquisition, access, quality, volume


2. Technical: Tools, algorithms
3. Explainable AI: Interpretability and explainability
4. Communication: stakeholders
5. Privacy and Security
6. Continuous learning

BITS Pilani, Pilani Campus


Prerequisites for the course

STATISTICS PYTHON EXCEL ANACONDA


PROGRAMMING INSTALLATION

BITS Pilani, Pilani Campus


Exercise

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M1 Data Science Foundations


Lecture 2 Data Science Process
Data Scientist’s Toolbox
What exactly is Data Science?

• An interdisciplinary field that uses algorithms, procedures, and


processes to examine large amounts of data
• Study of data to extract meaningful insights for business
• Using data to solve problems and make decisions!
• Applied Statistics!

Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data

BITS Pilani, Pilani Campus


Popular Roles and Skills

Data Engineer Data/BI Analyst ML Engineer Business Analyst


SQL, Python, Hive, Pig, SQL, Excel, Python/R, Python, Machine Learning Excel, Visio, SQL, Tableau
Java, Hadoop, Spark, Tableau/PowerBI/QlikView, Algorithms, DL/NLP, Java, Domain understanding,
Kafka, Azkaban, Airflow, Basics of Big Data, Basics DBMS, Cloud Architecture, Requirement Gathering,
AWS, GCP, Azure of Cloud Big Data Architectures, Requirement Elicitation,
Data Warehousing, Ability Programming skills in AWS/GCP/Azure Process Excellence, User
to write, analyze, and Python/R , Solid Understanding of data Acceptance Testing,
debug SQL queries, Big understanding of database structures, data modeling Documentation Prowess,
Data platforms like management systems, and software architecture. Basic Data Analysis Skills
Hadoop, Spark, Kafka, Proficient SQL/HQL skills, Deep knowledge of math,
Flume, Pig, Hive, etc. , Good data visualization probability, statistics and
Experience in handling skills and proficient with algorithms. Ability to write
data pipeline and Tableau/PowerBI/QlikView, robust code in Python,
workflow management etc ,basic understanding of Java and R. Familiarity with
tools like Azkaban, Luigi, predictive modelling machine learning
Airflow, etc., Strong frameworks (like Keras or
Communication Skills PyTorch) and libraries (like
scikit-learn)

BITS Pilani, Pilani Campus


Data Scientist

Wears many hats!


1. Data Acquisition and Preparation: Data Sources, Cleaning,
Preprocessing, Integration, Wrangling
2. Data Analysis: Exploratory Data Analysis (EDA), insights
3. Modeling: Statistical/hypothesis testing, ML models - Building,
testing, tuning, deploying
4. Communication: Story-telling, visualization, audience
5. Collaboration: Stakeholders,
6. Solutions: Practical, relevant

BITS Pilani, Pilani Campus


Data Science/ML Process

Business Analyst

Data Engineer

Data/BI Analyst

ML Engineer

Data Scientist

Source: https://techcommunity.microsoft.com/t5/azure-developer-community-blog/the-data-science-process-with-azure-machine-learning/ba-p/336162

BITS Pilani, Pilani Campus


Data Science/ML Process

Business Analyst

Data Engineer

Data/BI Analyst

ML Engineer

Data Scientist

Source: https://techcommunity.microsoft.com/t5/azure-developer-community-blog/the-data-science-process-with-azure-machine-learning/ba-p/336162

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Process

1. Business Understanding
• What is the problem?
• What is the objective?
• What is causing the problem?

2. Data Understanding
• What data do we have?
• What data do we need?

BITS Pilani, Pilani Campus


Data Science Process

3. Data Preparation
• Data Collection – Sources, format
• How to get the data?
• Where to store the data? In what format?
• Is the data clean and complete?
• Data Cleaning
• EDA – data to insights
• Feature Engineering
4. Modeling
• What kind of a problem?
• What kind of an algorithm?

BITS Pilani, Pilani Campus


Data Science Process

5. Evaluation (and tuning)


• Building models
• Testing the model - performance
• Model selection
• Interpretability-Accuracy tradeoff
• Tuning the model hyperparameters
• Findings and conclusion
6. Deployment
• Environment
• Architecture

BITS Pilani, Pilani Campus


Data Scientist’s Toolbox

1. Data Collection
• Hadoop Ecosystem (HDFS, Hive, Pig)
2. Data Preparation
• SQL
• Python and Python libraries - pandas
3. EDA
• Excel
• RStudio
• Power BI
• Tableau
• Python libraries – matplotlib, pandas, seaborn

BITS Pilani, Pilani Campus


Data Scientist’s Toolbox

4. Statistical Analysis
• RStudio
• Matlab
• SAS
• SPSS
5. Model building
• Jupyter Notebook
• Python libraries – Numpy, Scipy, scikitlearn
• Tensorflow
• PyTorch
• AWS/Azure/GCP

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M1 Data Science Foundations


Lecture 3 Types of Data and Datasets
Data Quality
Data Preprocessing
Types of Data

Data

Qualitative Quantitative
(Categorical) (Numeric)

Nominal Ordinal Discrete Continuous

BITS Pilani, Pilani Campus


Categorical Data
• Characteristics or attributes
• Non-numeric. Cannot be computed

Nominal
• No specific order
• All categories are equal
• Can not be measured
• Gender, colors, divisions

Ordinal
• Natural order
• Categories can be compared
• High-Medium-Low, First-Second-Third, etc.

BITS Pilani, Pilani Campus


Numeric Data
• Numbers
• Measurable or countable
• Calculations can be performed

Discrete
• Only certain values
• Typically, whole numbers
• Countable
• Runs, goals, marks

Continuous
• All possible values within a range
• Typically, with fractions and decimals
• Measurable
• Height, weight, temperature

BITS Pilani, Pilani Campus


Types of Datasets

• Set of data as a collection


• Structured, unstructured, semi-structured
• Used for a meaningful activity, say, analysis

Formats
• Tabular – rows and columns (xls, csv)
• Web data – JSON, xml
• Time series dataset
• Image dataset
• Bivariate
• Multivariate
BITS Pilani, Pilani Campus
Why Data Quality?

1. Better decisions
2. Correct analysis and insights
3. Better problem-solving
4. Reliable results
5. Less ambiguity
6. Customer experience
7. Compliance
8. Cost

BITS Pilani, Pilani Campus


Data Quality

BITS Pilani, Pilani Campus


Data Preprocessing

Model, histogram,
cluster, sample

BITS Pilani, Pilani Campus


Handling missing values

1. Delete the row


2. Drop the column
3. Impute by mean/median (numeric)
4. Impute by mode (category)
5. Use algorithm
6. Forward and backward fill
7. Build a model and guess the appropriate value
8. Create new value (missingness as a feature)
9. Use libraries

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M2 Data Science Foundations


Lecture 4 Descriptive Analytics
Data Visualizations
Storytelling with data
Types of Analytics

• Descriptive analytics: “What


happened?" - by summarizing historical
data
• Diagnostic analytics: “Why it
happened?" - by analyzing reasons for
trends and patterns identified
• Predictive analytics: “What will
happen?”- Statistical models and
machine learning to predict future
outcomes
• Prescriptive analytics: “What to do?” -
Recommends what to do next based on
insights from historical and predictive
data

BITS Pilani, Pilani Campus


Descriptive Analytics and Descriptive
Statistics

Statistical interpretation used to analyze historical data to identify


patterns and relationships.

Standalone applications - Charts, tables, figures, plots, reports,


dashboards

As an input to predictive analytics – Exploratory Data Analysis

BITS Pilani, Pilani Campus


Descriptive Statistics

• Measures of Central Tendency


– Mathematical averages – Mean (Arithmetic, Harmonic, Geometric…)
– Positional averages – Median, Mode, Quartiles, Percentiles

• Measures of Dispersion
– Range
– Standard Deviation/Variance
– IQR
– Coefficient of Variation

• Measures of Association
– Covariance
– Correlation

BITS Pilani, Pilani Campus


Measures of Dispersion

• Dispersion indicates variability,


scatteredness, or spreadness of data;
it shows how stretched or squeezed is
the underlying distribution.
• Extremely useful for a number of
practical applications: quality control,
reliability analysis, banking, insurance
and portfolio management, etc.
• Range, interquartile range, variance,
standard deviation, and coefficient of
variation

BITS Pilani, Pilani Campus


Measures of Dispersion

Variance/standard deviation
• Numerous applications in descriptive statistics, statistical inference, hypothesis testing, Monte Carlo
simulation, analysis of variance.
• Wide applications in physics, biology, chemistry, economics, and finance.
n

x  x 
2
i
Sample standard deviation  s   i 1

n 1
Coefficient of Variation (CV)
• A relative measure of dispersion.
• It has enormous applications in quality assurance studies.
• Useful in comparing dispersions of two distributions having different measurement units.
s
coefficient of variation  CV  for sample data 
x

coefficient of variation  CV  for population 

BITS Pilani, Pilani Campus
Measures of Dispersion

• Range: difference between the largest and the smallest values in a dataset.
• Interquartile range: difference between the third (upper) quartile and first (lower)
quartile. IQR = Q3 – Q1

BITS Pilani, Pilani Campus


Box and whisker plot, Five number summary

• The box-and-whisker plot (box plot) is a graphical


representation of a set of observations, based on the
five-number summary.
• It is a very useful tool in detecting outliers, and in
summarizing the distribution of data.
• Five number summary
• Min
• Q1
• Q2 (Median)
• Q3
• Max

BITS Pilani, Pilani Campus


Outlier

• An outlier (spurious data point) is an observation point that is distant from other
observations.

• Outlier detection methods:


• Standardized values (z-scores)
• Using quartiles and IQR:
• Find lower limit = Q1 – 1.5 (IQR) and upper limit = Q3 + 1.5 (IQR)
• Data outside this range could be flagged as outliers.

BITS Pilani, Pilani Campus


Skew and Kurtosis

BITS Pilani, Pilani Campus


Measures of Association
Covariance
• It is an absolute measure of how much two variables change together.
• The sign of the covariance shows the tendency in the linear relationship between the variables.
The magnitude of covariance does not really produce a fruitful meaning.
• If two variables tend to show similar behaviour, then the covariance is positive, otherwise
negative. Zero covariance implies the variables are not linearly related.

  x  x  y  y 
i i
Sample covariance, sxy  i 1

n 1
n

  x  x  y  y 
i i
Population covariance,  xy  i 1

N
BITS Pilani, Pilani Campus
Measures of Association

Correlation
• Correlation is a normalized covariance. It lies in between -1 to +1.

• It provides a measure of linear relationship or association between two variables.

• If two variables tend to show similar behaviour, then the correlation is positive, otherwise
negative.

sxy x y x y
i i
Sample correlation, rxy   i 1

sx s y n n

  x i -x    yi -y 
2 2

i=1 i=1

 xy
Population correlation,  xy 
 x y

BITS Pilani, Pilani Campus


Correlation

BITS Pilani, Pilani Campus


Exploratory Data Analysis (EDA)

• An approach to analyze the data using visual techniques. Initial investigation.

Objectives:
1. Explore data to become familiar with data
2. Discover patterns, trends, relationships
3. Spot anomalies
4. Test Hypothesis or assumptions
5. Summarizing data
6. Missing/Null values
7. Explain outcomes or results of analysis
8. Tell a story with data

BITS Pilani, Pilani Campus


Types of Visualization

Text:
• Simple text
• Tables
• Heatmap

Region Q1 Sales Q2 Sales Q3 Sales Q4 Sales

North 12000 23456 20000 12345

South 45678 12000 67890 12346

West 34567 12345 12000 45678

BITS Pilani, Pilani Campus


Types of Visualization

• Graphs 4

• Points (Scatter) 3

• Lines 2

1
• SlopeGraph
0
0 1 2 3

0
Category 1 Category 2 Category 3 Category 4
Jan-18 Feb-18
Series 1 Series 2 Series 3
North South East West Middle

BITS Pilani, Pilani Campus


Types of Visualization

6 100%
◦ Bars 80%
4
◦ Horizontal / Vertical 60%
2 40%
◦ Stacked 20%
0
◦ Waterfall South North East West
0%
South North East West
◦ Area Sales Cost Series 3 Computer Electronics Series 3

50
West 40
30
East 20
North 10
0
South

0% 50% 100%
Computer Electronics Series 3
Series 1 Series 2

BITS Pilani, Pilani Campus


Types of Visualization

Univariate - distribution

Bivariate - relationships

• Categorical Vs Categorical
• Continuous Vs Continuous
• Continuous Vs Categorical

Multivariate

BITS Pilani, Pilani Campus


Visualization Cheat Sheet

BITS Pilani, Pilani Campus


Visualization Cheat Sheet

BITS Pilani, Pilani Campus


Visualizations to be avoided

• Pie charts/Donut charts


• 3D charts
Sales
• Dual Axis charts

40 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr


30
20
10
0
Series 1

Series 1 Series 2

BITS Pilani, Pilani Campus


Bad Design

BITS Pilani, Pilani Campus


Good Design

BITS Pilani, Pilani Campus


Decluttering

BITS Pilani, Pilani Campus


Storytelling with data

BITS Pilani, Pilani Campus


How to lie?

BITS Pilani, Pilani Campus


Visualization tools

• Excel

• Tableau

• Power BI

• Python (matplotlib, Seaborn, bokeh)

• R/Rstudio

• Qlikview/Qliksense…

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M4 Predictive Modeling
Lecture 5 Linear Regression
What is Machine Learning?

According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)

Many data mining tasks are executed successfully with help of machine learning

BITS Pilani, Pilani Campus


Types of Machine Learning

BITS Pilani, Pilani Campus


Regression

Predict a value of a given continuous valued variable based on the values


of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

BITS Pilani, Pilani Campus


Variables

# TV Radio Paper Sales


Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2

BITS Pilani, Pilani Campus


Matrix X and Vector y

The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2

X represents the input data set; X is a 6 * 3 matrix


y represents the output variable; y is a 6 * 1 vector

BITS Pilani, Pilani Campus


Matrix X and Vector y

# TV Radio Paper Sales


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1

xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)


yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

BITS Pilani, Pilani Campus


A Linear Model

The linear model is an important example of a parametric model:


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .


• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

BITS Pilani, Pilani Campus


A Linear Model (Parametric)

The linear model is an example of a parametric model


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters


• We estimate the parameters by fitting the model to training data.

• Simple Linear Regression: Only one x variable


• Multiple Linear Regression: Many x variables

BITS Pilani, Pilani Campus


A Linear Model

We want to predict Y for a given value of x

Is there an ideal f (X)?


• What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.

BITS Pilani, Pilani Campus


A Linear Model

Y = β 0 + β 1 X1 + β 2 X2 + · · · + β p Xp + ε

• β’s: Unknown constants, known as coefficients or parameters


• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.

• ε is the error term – captures measurement errors and missing variables


• ε is a random variable independent of X
• E(ε) = 0

• In the advertising example, the model becomes


sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus


Goodness of Fit

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Coefficient of Determination (R-squared)

• Proportion of variance in a dependent variable that can be explained by an independent


variable

BITS Pilani, Pilani Campus


Regression Assumptions

1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables

BITS Pilani, Pilani Campus


Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables.
• Test for Multicollinearity: Variance Inflation Factor

• VIF equal to 1 = variables are not correlated


• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated

Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

BITS Pilani, Pilani Campus


Residual Analysis

• The red line should be approximately horizontal at zero.


• There is no pattern in the first residual plot. The presence of a pattern may indicate a problem with some
aspect of the linear model (case 2)
• E(ε) = 0
• Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)

BITS Pilani, Pilani Campus


Homoscedasticity Vs Heteroscedasticity
• Are the residuals spread equally along
the ranges of predictors?
• The plot should have a horizontal line
with equally spread points.

In the second plot, this is not the case.


• The variability (variances) of the
residual points increases with the value
of the fitted outcome variable,
suggesting non-constant variances in
the residuals errors
(or heteroscedasticity)

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M4 Predictive Modeling
Lecture 5 Linear Regression
What is Machine Learning?

According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)

Many data mining tasks are executed successfully with help of machine learning

BITS Pilani, Pilani Campus


Types of Machine Learning

BITS Pilani, Pilani Campus


Regression

Predict a value of a given continuous valued variable based on the values


of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

BITS Pilani, Pilani Campus


Variables

# TV Radio Paper Sales


Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2

BITS Pilani, Pilani Campus


Matrix X and Vector y

The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2

X represents the input data set; X is a 6 * 3 matrix


y represents the output variable; y is a 6 * 1 vector

BITS Pilani, Pilani Campus


Matrix X and Vector y

# TV Radio Paper Sales


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1

xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)


yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

BITS Pilani, Pilani Campus


A Linear Model

The linear model is an important example of a parametric model:


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .


• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

BITS Pilani, Pilani Campus


A Linear Model (Parametric)

The linear model is an example of a parametric model


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters


• We estimate the parameters by fitting the model to training data.

• Simple Linear Regression: Only one x variable


• Multiple Linear Regression: Many x variables

BITS Pilani, Pilani Campus


A Linear Model

We want to predict Y for a given value of x

Is there an ideal f (X)?


• What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.

BITS Pilani, Pilani Campus


A Linear Model

Y = β 0 + β 1 X1 + β 2 X2 + · · · + β p Xp + ε

• β’s: Unknown constants, known as coefficients or parameters


• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.

• ε is the error term – captures measurement errors and missing variables


• ε is a random variable independent of X
• E(ε) = 0

• In the advertising example, the model becomes


sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Assumptions

1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables

BITS Pilani, Pilani Campus


Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables.
• Test for Multicollinearity: Variance Inflation Factor

• VIF equal to 1 = variables are not correlated


• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated

Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

BITS Pilani, Pilani Campus


Residual Analysis

• The red line should be approximately horizontal at zero.


• There is no pattern in the first residual plot. The presence of a pattern may indicate a problem with some
aspect of the linear model (case 2)
• E(ε) = 0
• Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)

BITS Pilani, Pilani Campus


Homoscedasticity Vs Heteroscedasticity
• Are the residuals spread equally along
the ranges of predictors?
• The plot should have a horizontal line
with equally spread points.

In the second plot, this is not the case.


• The variability (variances) of the
residual points increases with the value
of the fitted outcome variable,
suggesting non-constant variances in
the residuals errors
(or heteroscedasticity)

BITS Pilani, Pilani Campus


Types of Regression Models
Simple Regression

(Education) x y (Income)

Multiple Regression

(Education) x1

(Soft Skills) x2 y (Income)


(Experience) x3

(Age) x4

BITS Pilani, Pilani Campus


Direct Solution Method
Least Squares Method (Ordinary Least Squares or OLS)

• Slope for the Estimated Regression Equation

𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑏1 =
𝑥𝑖 −𝑥 2

• y Intercept for the Estimated Regression Equation

𝑏0 = 𝑦 - 𝑏1 𝑥

where:
xi = value of independent variable for ith observation
yi =value of dependent variable for ith observation
𝑥 = mean value for dependent variable
𝑦= mean value for dependent variable

BITS Pilani, Pilani Campus


Exercise
Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the
sale. Data from a sample of 5 previous sales are shown below.

# of TV Ads # of Cars Sold


(x) (y)
1 14
3 24
2 18
1 17
3 27

BITS Pilani, Pilani Campus


Solution
𝐱𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝐱 𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝟐
# of TV Ads # of Cars Sold 𝐱𝐢 − 𝐱
(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20

𝑥𝑖 −𝑥 𝑦𝑖−𝑦
• Slope for the Estimated Regression Equation 𝑏1 = = 20/4 = 5
𝑥𝑖 −𝑥 2

• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦 - 𝑏1 𝑥 = 20 – 10 = 10


• Estimated Regression Equation: 𝑦 = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
BITS Pilani, Pilani Campus
Evaluation of Regression Model

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Goodness of Fit

BITS Pilani, Pilani Campus


Coefficient of Determination (R-squared)

• Proportion of variance in a dependent variable that can be explained by an independent


variable

BITS Pilani, Pilani Campus


Exercise

• Simple Linear Regression using Excel


• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M4 Predictive Modeling
Lecture 6 Classification, Logistic Regression
Classification

Here the response variable Y is Qualitative/Categorical


• Email Spam: email is one of Y = (spam, email)
• Handwritten Digit Recognition: Digit class is one of Y = {0, 1, . . . , 9}.

Our goals are:


1. Prediction
• Build a classifier f( X ) that assigns a class label to a future unlabeled observation X
• Estimate the probability that X belongs to each category in Y
Example: We may be more interested to have an estimate of the probability that a transaction is fraudulent
than it is to classify that the transaction is fraudulent or not
2. Inference
• Understand the roles of the different predictors among X = (X 1 , X 2 , . . . , X p )

BITS Pilani, Pilani Campus


Regression Vs Classification

Variables can either be Quantitative or Qualitative (Categorical)


• Quantitative variables take on numerical values – Income, Bill amount
• Categorical values take on values on one of K different classes – Gender, Digit

Regression Problem: The response variable is quantitative


Classification Problem: The response variable is categorical

BITS Pilani, Pilani Campus


Classification Algorithms

• Naïve Bayes
• K-nearest Neighbour
• Logistic Regression
• Discriminant Analysis
• Decision Trees
• Support Vector Machine

BITS Pilani, Pilani Campus


Estimator and Error Rate
We have seen y = f(X)
f is a function that best maps an input x to output y. We wish to estimate this f.

The accuracy of f is usually defined by the Error Rate


• The proportion of mis-classifications

Ave I yi ≠ yi

Where I is the Indicator function

There are two error rates


• Training Error Rate
• Test Error Rate

A good classifier is one for which the test error rate is the smallest

BITS Pilani, Pilani Campus


Regression Revision
Relationships between a numerical response and numerical / categorical predictors
• Hypothesis tests for all regression parameters together – Testing the Model
• Model coefficient interpretation
• Hypothesis tests for each regression parameters
• Confidence intervals for regression parameters
• Confidence and prediction intervals for predicted means and values
• Model diagnostics, residuals plots, outliers
• RSS, MSE, R2
• Interpreting computer outputs

BITS Pilani, Pilani Campus


Classification – why and how?

• Regression gives a number. What if I want to identify a class or category and not a
number?
• Let’s say, I want to identify genuine emails vs spam emails, genuine transaction vs
fraud transaction. Here the outcomes are text values, but models can understand only
numbers?
• How do I handle this? I will replace the 2 classes by numbers. Say, one class as 1 and
another as 0 and train a model which can predict the outcome value that is 0 or 1.
• But models can’t give discrete values 0 and 1.
• We can rather make it give a continuous value between 0 and 1.
• If the value is closer to 1 (i.e. >= 0.5), I consider it as 1, otherwise 0.

BITS Pilani, Pilani Campus


Classification – why and how?
• Can I use the concepts of linear regression here? How?
• Linear regression can throw out any value between - ∞ and + ∞.
• However, I want to map or convert that range (- ∞,+ ∞) to (0,1).
• We need a link function to do this.
• The most appropriate one is a sigmoid or logistic function.

BITS Pilani, Pilani Campus


Odds and Logit Function

Odds is commonly used in gambling (and logistic regression)

For an event E ,
• If we know P(E) then
P(E) P(E)
Odds E = =
P(~E) 1 − P(E)
x
If the odds of E are “x to y”, then P E = x+y

Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln 1−P(E)
logit can be interpreted as the log odds of a success

BITS Pilani, Pilani Campus


Logit Function and Logistic (Sigmoid) Function

• Logistic regression is a Generalized Linear Model (GLM)


• Uses Logistic or sigmoid function.

The logit function


p
• logit p = ln ,0 ≤ p ≤ 1
1−p

• Converts (0, 1) to range (–∞, + ∞)

The inverse function is known as Sigmoid function (or Logistic function)


ex 1
• S x = = , −∞ < x < +∞
1+ex 1+e−x

• Converts (–∞, + ∞) to range [0, 1]

BITS Pilani, Pilani Campus


Sigmoid Curve

BITS Pilani, Pilani Campus


Evaluating the Model

Confusion Matrix

The performance of f can also be described


by a confusion matrix

A confusion matrix is a table that is used


to describe the performance of a
classification model (or "classifier") on a
set of data for which the true values are
known.

The confusion matrix gives strong clues as


to where 𝑓 is going wrong.

BITS Pilani, Pilani Campus


Example

Consider a classical problem of predicting spam and non-spam email.


The objective is to identify Spams.
The training set consists of 15 emails that are Spam, and 85 emails that are Not Spam
The model correctly classified 95 emails
• All 85 Non-Spams were correctly classified
• 10 Spams were correctly classified
• 5 Spams were classified as Non-Spams (False Negative if Target is Spam).

BITS Pilani, Pilani Campus


The Matrix

The objective is to identify Spams.


True Class
Predicted Class Spam Non-Spam
Spam 10 0
Non-Spam 5 85

• “true positive” for correctly identifying target event


• “true negative” for correctly identifying non-target event
• “false positive” for incorrectly identifying a non-target event as a target event
• “false negative” for incorrectly identifying a target event as a non-target event

TP = 10, FP = 0, TN = 85, FN = 5

True Class
Predicted Class Target Non-Target
Target TP FP
Non-Target FN TN

BITS Pilani, Pilani Campus


Is accuracy a good metric always?

Building a high accuracy useless classifier

BITS Pilani, Pilani Campus


What’s the right metric?

1. Many classifiers are designed to optimize error/accuracy


2. This tends to bias the performance towards the majority class
3. Anytime there is an imbalance in the data this can happen
4. It is particularly pronounced, though, when the imbalance is more pronounced
5. Accuracy is not the right measure of classifier performance in such cases
6. What are other metrics?
1. Precision
2. Recall (Sensitivity or TPR or True Positive Rate = TP/P)
3. F1-score?

Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…

Refer https://en.wikipedia.org/wiki/Confusion_matrix

BITS Pilani, Pilani Campus


The Metrics

1 is same as Positive
0 is same as Negative

BITS Pilani, Pilani Campus


Strategies for Imbalanced Data

1. Under-sampling
2. Over-sampling
3. Optimize AUC

BITS Pilani, Pilani Campus


Under-sampling (majority class)

Create a new training data set by:


• Include all k “positive” examples
• randomly pick k “negative” examples

Pros:
Easy to implement
Training becomes much more efficient (smaller training set)
For some domains, can work very well

Cons:
Throwing away a lot of data/information

BITS Pilani, Pilani Campus


Over-sampling (minority class)

Create a new training data set by:


- including all m negative examples
- include m positive examples:
- repeat each example a fixed number of times, or
- sample with replacement

Pros:
Easy to implement
Utilizes all of the training data
Tends to perform well in a broader set of circumstances than subsampling

Cons:
Computationally expensive to train a classifier

BITS Pilani, Pilani Campus


Multiclass Classification

Suppose the possible responses are A, B & C.


f was run on the training set and the following Confusion Matrix was generated

True Class
Predicted Class A B C
A 30 20 10
B 50 60 10
C 20 20 80

The confusion matrix gives strong clues as to where f is going wrong.


• For class A, f incorrectly labelled Label B for majority of the mislabelled cases.
Perhaps features need to be added to improve classification of label

The more zeroes or smaller the numbers on all cells but the diagonal, the better the
classifier is doing.

BITS Pilani, Pilani Campus


Multiclass Classification

Suppose the possible responses are A, B & C.


f was run on the training set and the following Confusion Matrix was generated
True Class
Predicted Class A B C
A 30 20 10
B 50 60 10
C 20 20 80

True Positive are those observations of a particular class that were classified correctly
False Positive are those observations that were incorrectly mapped to one class
False Negative are those observations of a particular class that were classified incorrectly
True Negative: Applicable for a Two-Class scenario

TP_A = 30, TP_B = 60, TP_C = 80


FP_A = 30, FP_B = 60, FP_C = 40
FN_A = 70, FN_B = 40, FN_C = 20

BITS Pilani, Pilani Campus


Handling categorical columns

• ML models don’t understand categories/non-numeric values.


• We need to convert those to a number.
• Categorical data can be nominal or ordinal.
• Two methods – One Hot Encoding (dummy variables) and Label Encoding.

One Hot Encoding Label Encoding

Variable type Nominal Ordinal


(all values are equivalent) (Values have order)
Example Red, Green, Blue High, Medium, Low
Male, Female
Number of output columns No. of distinct values - 1 1

Output values 0 and 1 0, 1, 2, 3…

BITS Pilani, Pilani Campus


One Hot Encoding (Dummy Variables)

1. Find out count of distinct values, say n, in the column.


2. Create n new columns – each with a name of the distinct value.
3. Encode values 0 and 1 under those columns depending upon the value in that observation.
4. Avoid when the count of distinct values is high.

BITS Pilani, Pilani Campus


Label Encoding

1. Find out count of distinct values, say n, in the column.


2. Find the order.
3. Encode values 0, 1, 2….based on the order.

BITS Pilani, Pilani Campus


Exercise

Building a Logistic Regression Classifier using sklearn

BITS Pilani, Pilani Campus

You might also like