FDS L1 to L8 slides

ZG 536
Foundations of Data Science

BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus
M1 Data Science Foundations

Lecture 1 - Introduction
Course Objectives
No Objective
Get introduced to the field of Data Science, roles, process and challenges
CO1
involved therein
Explore and experience the steps involved in the data preparations and
CO2
exploratory data analysis
Learn to select and apply proper analytics technique for various scenarios,
CO3 assess the model’s performance and interpret the results of the predictive
model
Get familiarity with the general deployment considerations of the predictive
CO4
models
Appreciate the importance of techniques like data visualization, storytelling
CO5
with data for the effective presentations of the outcomes to the stakeholders
BITS Pilani, Pilani Campus

Evaluation
Duratio Day, Date,

No Name Type Weight
n Session, Time
Experiential
Learning
Assignment 1
EC1 Take Home-Online 25% To be announced
Experiential
Learning
Assignment 2
Mid-Semester Open or 2 hours 30%

EC2 Exam Closed
Book
Comprehensive Open 2 hours 45%
EC3 Exam Book

The instructor
Pravin Mhaske
Qualification Bachelor of Engineering (Mechanical)

Master of Science (Business Analytics)
Experience 21 years (Industry)
6 years (Teaching)
Teaching interests Statistics, Data Science, Machine
Learning, Business Analytics
Pedagogy Concepts, foundation, intuition, hands-
on practice, experiential learning

What exactly is Data Science?
• An interdisciplinary field that uses algorithms, procedures, and

processes to examine large amounts of data
• Study of data to extract meaningful insights for business
• Using data to solve problems and make decisions!
• Applied Statistics!
Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data

Applications
• Every domain!
• Healthcare Better operations, early detections, preventions
• Retail Customer behavior, STP, Customer experience
• Banking and Finance Financial advice and planning, predictions, Fraud
detection
• Transportation Optimizations, better planning
• Manufacturing Fault detection, IoT, Operations and process
improvement
• Meteorology Weather, seismic, geospatial data
• Social media/TC Sentiment analysis, Demands
• Energy and utility Consumption, control
• Public services Planning, development
• Sports, Entertainment Strategy, Content creation, Demand analysis
• Politics?

Some Examples
• Recommender systems: Amazon, Netflix, youtube
• Personalization: Learning, ads, promotions and discounts
• Decision making: Google maps
• Fraud detection: transactions
• Dynamic pricing: Surge pricing
• Smart homes, voice assistants
• Social media trends
• Spam mail filters
• Traffic lights
• Online dating

Why learn Data Science?
• Career opportunities
• Rapid digital evolution
• Data is growing
• Flexibility – all industries, freelancing
• Demand-Supply gap
• Analytical, scientific approach
• Being logical and sensible
• Life skill - Solving real life problems

DS, AI, ML, DL, Analytics?
• Data Science Processing, Analyzing, Insights
• Business Analytics Solving problems, Making decisions
• Artificial Intelligence Machines simulate human behavior
• Machine Learning Computers learn themselves
• Deep Learning Artificial Neural Networks

DS/ML project flow

Popular Roles and Skills
Data Engineer Data/BI Analyst ML Engineer Business Analyst

SQL, Python, Hive, Pig, SQL, Excel, Python/R, Python, Machine Learning Excel, Visio, SQL, Tableau
Java, Hadoop, Spark, Tableau/PowerBI/QlikView, Algorithms, DL/NLP, Java, Domain understanding,
Kafka, Azkaban, Airflow, Basics of Big Data, Basics DBMS, Cloud Architecture, Requirement Gathering,
AWS, GCP, Azure of Cloud Big Data Architectures, Requirement Elicitation,
Data Warehousing, Ability Programming skills in AWS/GCP/Azure Process Excellence, User
to write, analyze, and Python/R , Solid Understanding of data Acceptance Testing,
debug SQL queries, Big understanding of database structures, data modeling Documentation Prowess,
Data platforms like management systems, and software architecture. Basic Data Analysis Skills
Hadoop, Spark, Kafka, Proficient SQL/HQL skills, Deep knowledge of math,
Flume, Pig, Hive, etc. , Good data visualization probability, statistics and
Experience in handling skills and proficient with algorithms. Ability to write
data pipeline and Tableau/PowerBI/QlikView, robust code in Python,
workflow management etc ,basic understanding of Java and R. Familiarity with
tools like Azkaban, Luigi, predictive modelling machine learning
Airflow, etc., Strong frameworks (like Keras or
Communication Skills PyTorch) and libraries (like
scikit-learn)

Data Scientist
Wears many hats!

1. Data Acquisition and Preparation: Data Sources, Cleaning,
Preprocessing, Integration, Wrangling
2. Data Analysis: EDA, insights, patterns
3. Modeling: Statistical/hypothesis testing, ML models - Building,
testing, tuning, deploying
4. Communication: Story-telling, visualization, audience
5. Collaboration: Stakeholders,
6. Solutions: Practical, relevant

Data Scientist Skills

Data Science Vs other domains
1. Interdisciplinary: Statistics, Mathematics, Computer Science,

Programming, and domain-specific expertise
2. Focus: Data
3. Problem solving: Real world challenges. Always new.
4. Evolution: Tools, techniques, algorithms
5. Lifelong learning: No crash course!
6. All industries
7. No defined scope
8. No single correct solution
9. Answer to many questions is ‘depends!’

Challenges
1. Data: Acquisition, access, quality, volume

2. Technical: Tools, algorithms
3. Explainable AI: Interpretability and explainability
4. Communication: stakeholders
5. Privacy and Security
6. Continuous learning

Prerequisites for the course
STATISTICS PYTHON EXCEL ANACONDA

PROGRAMMING INSTALLATION

Exercise

ZG 536
Pilani Campus
BITS Pilani
Pilani Campus

Lecture 2 Data Science Process
Data Scientist’s Toolbox
What exactly is Data Science?
• An interdisciplinary field that uses algorithms, procedures, and

processes to examine large amounts of data
• Study of data to extract meaningful insights for business
• Using data to solve problems and make decisions!
• Applied Statistics!
Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data

Popular Roles and Skills
Data Engineer Data/BI Analyst ML Engineer Business Analyst

SQL, Python, Hive, Pig, SQL, Excel, Python/R, Python, Machine Learning Excel, Visio, SQL, Tableau
Java, Hadoop, Spark, Tableau/PowerBI/QlikView, Algorithms, DL/NLP, Java, Domain understanding,
Kafka, Azkaban, Airflow, Basics of Big Data, Basics DBMS, Cloud Architecture, Requirement Gathering,
AWS, GCP, Azure of Cloud Big Data Architectures, Requirement Elicitation,
Data Warehousing, Ability Programming skills in AWS/GCP/Azure Process Excellence, User
to write, analyze, and Python/R , Solid Understanding of data Acceptance Testing,
debug SQL queries, Big understanding of database structures, data modeling Documentation Prowess,
Data platforms like management systems, and software architecture. Basic Data Analysis Skills
Hadoop, Spark, Kafka, Proficient SQL/HQL skills, Deep knowledge of math,
Flume, Pig, Hive, etc. , Good data visualization probability, statistics and
Experience in handling skills and proficient with algorithms. Ability to write
data pipeline and Tableau/PowerBI/QlikView, robust code in Python,
workflow management etc ,basic understanding of Java and R. Familiarity with
tools like Azkaban, Luigi, predictive modelling machine learning
Airflow, etc., Strong frameworks (like Keras or
Communication Skills PyTorch) and libraries (like
scikit-learn)

Data Scientist
Wears many hats!

1. Data Acquisition and Preparation: Data Sources, Cleaning,
Preprocessing, Integration, Wrangling
2. Data Analysis: Exploratory Data Analysis (EDA), insights
3. Modeling: Statistical/hypothesis testing, ML models - Building,
testing, tuning, deploying
4. Communication: Story-telling, visualization, audience
5. Collaboration: Stakeholders,
6. Solutions: Practical, relevant

Data Science/ML Process
Business Analyst
Data Engineer
Data/BI Analyst
ML Engineer
Data Scientist
Source: https://techcommunity.microsoft.com/t5/azure-developer-community-blog/the-data-science-process-with-azure-machine-learning/ba-p/336162

Data Science/ML Process
Business Analyst
Data Engineer
Data/BI Analyst
ML Engineer
Data Scientist
Source: https://techcommunity.microsoft.com/t5/azure-developer-community-blog/the-data-science-process-with-azure-machine-learning/ba-p/336162

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Process
1. Business Understanding
• What is the problem?
• What is the objective?
• What is causing the problem?
2. Data Understanding
• What data do we have?
• What data do we need?

3. Data Preparation
• Data Collection – Sources, format
• How to get the data?
• Where to store the data? In what format?
• Is the data clean and complete?
• Data Cleaning
• EDA – data to insights
• Feature Engineering
4. Modeling
• What kind of a problem?
• What kind of an algorithm?

5. Evaluation (and tuning)

• Building models
• Testing the model - performance
• Model selection
• Interpretability-Accuracy tradeoff
• Tuning the model hyperparameters
• Findings and conclusion
6. Deployment
• Environment
• Architecture

1. Data Collection
• Hadoop Ecosystem (HDFS, Hive, Pig)
2. Data Preparation
• SQL
• Python and Python libraries - pandas
3. EDA
• Excel
• RStudio
• Power BI
• Tableau
• Python libraries – matplotlib, pandas, seaborn

4. Statistical Analysis
• RStudio
• Matlab
• SAS
• SPSS
5. Model building
• Jupyter Notebook
• Python libraries – Numpy, Scipy, scikitlearn
• Tensorflow
• PyTorch
• AWS/Azure/GCP

ZG 536
Pilani Campus
BITS Pilani
Pilani Campus

Lecture 3 Types of Data and Datasets
Data Quality
Data Preprocessing
Types of Data
Data
Qualitative Quantitative
(Categorical) (Numeric)
Nominal Ordinal Discrete Continuous

Categorical Data
• Characteristics or attributes
• Non-numeric. Cannot be computed
Nominal
• No specific order
• All categories are equal
• Can not be measured
• Gender, colors, divisions
Ordinal
• Natural order
• Categories can be compared
• High-Medium-Low, First-Second-Third, etc.

Numeric Data
• Numbers
• Measurable or countable
• Calculations can be performed
Discrete
• Only certain values
• Typically, whole numbers
• Countable
• Runs, goals, marks
Continuous
• All possible values within a range
• Typically, with fractions and decimals
• Measurable
• Height, weight, temperature

Types of Datasets
• Set of data as a collection

• Structured, unstructured, semi-structured
• Used for a meaningful activity, say, analysis
Formats
• Tabular – rows and columns (xls, csv)
• Web data – JSON, xml
• Time series dataset
• Image dataset
• Bivariate
• Multivariate
Why Data Quality?
1. Better decisions
2. Correct analysis and insights
3. Better problem-solving
4. Reliable results
5. Less ambiguity
6. Customer experience
7. Compliance
8. Cost

Data Quality

Data Preprocessing
Model, histogram,
cluster, sample

Handling missing values
1. Delete the row

2. Drop the column
3. Impute by mean/median (numeric)
4. Impute by mode (category)
5. Use algorithm
6. Forward and backward fill
7. Build a model and guess the appropriate value
8. Create new value (missingness as a feature)
9. Use libraries

ZG 536
Pilani Campus
BITS Pilani
Pilani Campus

Lecture 4 Descriptive Analytics
Data Visualizations
Storytelling with data
Types of Analytics
• Descriptive analytics: “What

happened?" - by summarizing historical
data
• Diagnostic analytics: “Why it
happened?" - by analyzing reasons for
trends and patterns identified
• Predictive analytics: “What will
happen?”- Statistical models and
machine learning to predict future
outcomes
• Prescriptive analytics: “What to do?” -
Recommends what to do next based on
insights from historical and predictive
data

Descriptive Analytics and Descriptive
Statistics
Statistical interpretation used to analyze historical data to identify

patterns and relationships.
Standalone applications - Charts, tables, figures, plots, reports,

dashboards
As an input to predictive analytics – Exploratory Data Analysis

Descriptive Statistics
• Measures of Central Tendency

– Mathematical averages – Mean (Arithmetic, Harmonic, Geometric…)
– Positional averages – Median, Mode, Quartiles, Percentiles
• Measures of Dispersion
– Range
– Standard Deviation/Variance
– IQR
– Coefficient of Variation
• Measures of Association
– Covariance
– Correlation

Measures of Dispersion
• Dispersion indicates variability,

scatteredness, or spreadness of data;
it shows how stretched or squeezed is
the underlying distribution.
• Extremely useful for a number of
practical applications: quality control,
reliability analysis, banking, insurance
and portfolio management, etc.
• Range, interquartile range, variance,
standard deviation, and coefficient of
variation

Variance/standard deviation
• Numerous applications in descriptive statistics, statistical inference, hypothesis testing, Monte Carlo
simulation, analysis of variance.
• Wide applications in physics, biology, chemistry, economics, and finance.
n
x  x 
2
i
Sample standard deviation  s   i 1
n 1
Coefficient of Variation (CV)
• A relative measure of dispersion.
• It has enormous applications in quality assurance studies.
• Useful in comparing dispersions of two distributions having different measurement units.
s
coefficient of variation  CV  for sample data 
x

coefficient of variation  CV  for population 

• Range: difference between the largest and the smallest values in a dataset.
• Interquartile range: difference between the third (upper) quartile and first (lower)
quartile. IQR = Q3 – Q1

Box and whisker plot, Five number summary
• The box-and-whisker plot (box plot) is a graphical

representation of a set of observations, based on the
five-number summary.
• It is a very useful tool in detecting outliers, and in
summarizing the distribution of data.
• Five number summary
• Min
• Q1
• Q2 (Median)
• Q3
• Max

Outlier
• An outlier (spurious data point) is an observation point that is distant from other
observations.
• Outlier detection methods:

• Standardized values (z-scores)
• Using quartiles and IQR:
• Find lower limit = Q1 – 1.5 (IQR) and upper limit = Q3 + 1.5 (IQR)
• Data outside this range could be flagged as outliers.

Skew and Kurtosis

Measures of Association
Covariance
• It is an absolute measure of how much two variables change together.
• The sign of the covariance shows the tendency in the linear relationship between the variables.
The magnitude of covariance does not really produce a fruitful meaning.
• If two variables tend to show similar behaviour, then the covariance is positive, otherwise
negative. Zero covariance implies the variables are not linearly related.
  x  x  y  y 
i i
Sample covariance, sxy  i 1
n 1
n
  x  x  y  y 
i i
Population covariance,  xy  i 1
N
Measures of Association
Correlation
• Correlation is a normalized covariance. It lies in between -1 to +1.
• It provides a measure of linear relationship or association between two variables.
• If two variables tend to show similar behaviour, then the correlation is positive, otherwise
negative.
sxy x y x y
i i
Sample correlation, rxy   i 1
sx s y n n
  x i -x    yi -y 
2 2
i=1 i=1
 xy
Population correlation,  xy 
 x y

Correlation

Exploratory Data Analysis (EDA)
• An approach to analyze the data using visual techniques. Initial investigation.
Objectives:
1. Explore data to become familiar with data
2. Discover patterns, trends, relationships
3. Spot anomalies
4. Test Hypothesis or assumptions
5. Summarizing data
6. Missing/Null values
7. Explain outcomes or results of analysis
8. Tell a story with data

Types of Visualization
Text:
• Simple text
• Tables
• Heatmap
Region Q1 Sales Q2 Sales Q3 Sales Q4 Sales
North 12000 23456 20000 12345
South 45678 12000 67890 12346
West 34567 12345 12000 45678

• Graphs 4
• Points (Scatter) 3
• Lines 2
1
• SlopeGraph
0
0 1 2 3
0
Category 1 Category 2 Category 3 Category 4
Jan-18 Feb-18
Series 1 Series 2 Series 3
North South East West Middle

6 100%
◦ Bars 80%
4
◦ Horizontal / Vertical 60%
2 40%
◦ Stacked 20%
0
◦ Waterfall South North East West
0%
South North East West
◦ Area Sales Cost Series 3 Computer Electronics Series 3
50
West 40
30
East 20
North 10
0
South
0% 50% 100%
Computer Electronics Series 3
Series 1 Series 2

Univariate - distribution
Bivariate - relationships
• Categorical Vs Categorical
• Continuous Vs Continuous
• Continuous Vs Categorical
Multivariate

Visualization Cheat Sheet

Visualization Cheat Sheet

Visualizations to be avoided
• Pie charts/Donut charts

• 3D charts
Sales
• Dual Axis charts
40 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

30
20
10
0
Series 1
Series 1 Series 2

Bad Design

Good Design

Decluttering

Storytelling with data

How to lie?

Visualization tools
• Excel
• Tableau
• Power BI
• Python (matplotlib, Seaborn, bokeh)
• R/Rstudio
• Qlikview/Qliksense…

ZG 536
Pilani Campus
BITS Pilani
Pilani Campus
M4 Predictive Modeling
Lecture 5 Linear Regression
What is Machine Learning?
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
Many data mining tasks are executed successfully with help of machine learning

Types of Machine Learning

Regression
Predict a value of a given continuous valued variable based on the values

of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

Variables
# TV Radio Paper Sales

Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2

Matrix X and Vector y
The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2
X represents the input data set; X is a 6 * 3 matrix

y represents the output variable; y is a 6 * 1 vector


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1
xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)
xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)

yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

A Linear Model
The linear model is an important example of a parametric model:

f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .
• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

A Linear Model (Parametric)
The linear model is an example of a parametric model

f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p
• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters

• Simple Linear Regression: Only one x variable

• Multiple Linear Regression: Many x variables

A Linear Model
We want to predict Y for a given value of x
Is there an ideal f (X)?

• What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.

A Linear Model
Y = β 0 + β 1 X1 + β 2 X2 + · · · + β p Xp + ε
• β’s: Unknown constants, known as coefficients or parameters

• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.
• ε is the error term – captures measurement errors and missing variables

• ε is a random variable independent of X
• E(ε) = 0
• In the advertising example, the model becomes

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
f is said to represent the systematic information that X provides about Y

Goodness of Fit

Coefficient of Determination (R-squared)
• Proportion of variance in a dependent variable that can be explained by an independent

variable

Regression Assumptions
1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables

Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables.
• Test for Multicollinearity: Variance Inflation Factor
• VIF equal to 1 = variables are not correlated

• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

Residual Analysis
• The red line should be approximately horizontal at zero.

• There is no pattern in the first residual plot. The presence of a pattern may indicate a problem with some
aspect of the linear model (case 2)
• E(ε) = 0
• Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)

Homoscedasticity Vs Heteroscedasticity
• Are the residuals spread equally along
the ranges of predictors?
• The plot should have a horizontal line
with equally spread points.
In the second plot, this is not the case.

• The variability (variances) of the
residual points increases with the value
of the fitted outcome variable,
suggesting non-constant variances in
the residuals errors
(or heteroscedasticity)

ZG 536
Pilani Campus
BITS Pilani
Pilani Campus
Lecture 5 Linear Regression
What is Machine Learning?
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
Many data mining tasks are executed successfully with help of machine learning

Types of Machine Learning

Regression
Predict a value of a given continuous valued variable based on the values

of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

Variables

Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2

The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2
X represents the input data set; X is a 6 * 3 matrix

y represents the output variable; y is a 6 * 1 vector


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1
xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)
xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)

yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

A Linear Model
The linear model is an important example of a parametric model:

f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .

• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

A Linear Model (Parametric)
The linear model is an example of a parametric model

f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p
• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters

• Simple Linear Regression: Only one x variable

• Multiple Linear Regression: Many x variables

A Linear Model
We want to predict Y for a given value of x
Is there an ideal f (X)?

• What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.

A Linear Model
Y = β 0 + β 1 X1 + β 2 X2 + · · · + β p Xp + ε
• β’s: Unknown constants, known as coefficients or parameters

• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.
• ε is the error term – captures measurement errors and missing variables

• ε is a random variable independent of X
• E(ε) = 0
• In the advertising example, the model becomes

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
f is said to represent the systematic information that X provides about Y

Regression Assumptions
1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables

Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables.
• Test for Multicollinearity: Variance Inflation Factor
• VIF equal to 1 = variables are not correlated

• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

Residual Analysis
• The red line should be approximately horizontal at zero.

• There is no pattern in the first residual plot. The presence of a pattern may indicate a problem with some
aspect of the linear model (case 2)
• E(ε) = 0
• Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)

Homoscedasticity Vs Heteroscedasticity
• Are the residuals spread equally along
the ranges of predictors?
• The plot should have a horizontal line
with equally spread points.
In the second plot, this is not the case.

• The variability (variances) of the
residual points increases with the value
of the fitted outcome variable,
suggesting non-constant variances in
the residuals errors
(or heteroscedasticity)

Types of Regression Models
Simple Regression
(Education) x y (Income)
Multiple Regression
(Education) x1
(Soft Skills) x2 y (Income)

(Experience) x3
(Age) x4

Direct Solution Method
Least Squares Method (Ordinary Least Squares or OLS)
• Slope for the Estimated Regression Equation
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑏1 =
𝑥𝑖 −𝑥 2
• y Intercept for the Estimated Regression Equation
𝑏0 = 𝑦 - 𝑏1 𝑥
where:
xi = value of independent variable for ith observation
yi =value of dependent variable for ith observation
𝑥 = mean value for dependent variable
𝑦= mean value for dependent variable

Exercise
Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the
sale. Data from a sample of 5 previous sales are shown below.
# of TV Ads # of Cars Sold

(x) (y)
1 14
3 24
2 18
1 17
3 27

Solution
𝐱𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝐱 𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝟐
# of TV Ads # of Cars Sold 𝐱𝐢 − 𝐱
(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20
𝑥𝑖 −𝑥 𝑦𝑖−𝑦
• Slope for the Estimated Regression Equation 𝑏1 = = 20/4 = 5
𝑥𝑖 −𝑥 2
• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦 - 𝑏1 𝑥 = 20 – 10 = 10

• Estimated Regression Equation: 𝑦 = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
Evaluation of Regression Model

Goodness of Fit

Coefficient of Determination (R-squared)
• Proportion of variance in a dependent variable that can be explained by an independent

variable

Exercise
• Simple Linear Regression using Excel

• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn

ZG 536
Pilani Campus
BITS Pilani
Pilani Campus
Lecture 6 Classification, Logistic Regression
Classification
Here the response variable Y is Qualitative/Categorical

• Email Spam: email is one of Y = (spam, email)
• Handwritten Digit Recognition: Digit class is one of Y = {0, 1, . . . , 9}.
Our goals are:

1. Prediction
• Build a classifier f( X ) that assigns a class label to a future unlabeled observation X
• Estimate the probability that X belongs to each category in Y
Example: We may be more interested to have an estimate of the probability that a transaction is fraudulent
than it is to classify that the transaction is fraudulent or not
2. Inference
• Understand the roles of the different predictors among X = (X 1 , X 2 , . . . , X p )

Regression Vs Classification
Variables can either be Quantitative or Qualitative (Categorical)

• Quantitative variables take on numerical values – Income, Bill amount
• Categorical values take on values on one of K different classes – Gender, Digit
Regression Problem: The response variable is quantitative

Classification Problem: The response variable is categorical

Classification Algorithms
• Naïve Bayes
• K-nearest Neighbour
• Logistic Regression
• Discriminant Analysis
• Decision Trees
• Support Vector Machine

Estimator and Error Rate
We have seen y = f(X)
f is a function that best maps an input x to output y. We wish to estimate this f.
The accuracy of f is usually defined by the Error Rate

• The proportion of mis-classifications
Ave I yi ≠ yi
Where I is the Indicator function
There are two error rates

• Training Error Rate
• Test Error Rate
A good classifier is one for which the test error rate is the smallest

Regression Revision
Relationships between a numerical response and numerical / categorical predictors
• Hypothesis tests for all regression parameters together – Testing the Model
• Model coefficient interpretation
• Hypothesis tests for each regression parameters
• Confidence intervals for regression parameters
• Confidence and prediction intervals for predicted means and values
• Model diagnostics, residuals plots, outliers
• RSS, MSE, R2
• Interpreting computer outputs

Classification – why and how?
• Regression gives a number. What if I want to identify a class or category and not a
number?
• Let’s say, I want to identify genuine emails vs spam emails, genuine transaction vs
fraud transaction. Here the outcomes are text values, but models can understand only
numbers?
• How do I handle this? I will replace the 2 classes by numbers. Say, one class as 1 and
another as 0 and train a model which can predict the outcome value that is 0 or 1.
• But models can’t give discrete values 0 and 1.
• We can rather make it give a continuous value between 0 and 1.
• If the value is closer to 1 (i.e. >= 0.5), I consider it as 1, otherwise 0.

Classification – why and how?
• Can I use the concepts of linear regression here? How?
• Linear regression can throw out any value between - ∞ and + ∞.
• However, I want to map or convert that range (- ∞,+ ∞) to (0,1).
• We need a link function to do this.
• The most appropriate one is a sigmoid or logistic function.

Odds and Logit Function
Odds is commonly used in gambling (and logistic regression)
For an event E ,
• If we know P(E) then
P(E) P(E)
Odds E = =
P(~E) 1 − P(E)
x
If the odds of E are “x to y”, then P E = x+y
Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln 1−P(E)
logit can be interpreted as the log odds of a success

Logit Function and Logistic (Sigmoid) Function
• Logistic regression is a Generalized Linear Model (GLM)

• Uses Logistic or sigmoid function.
The logit function

p
• logit p = ln ,0 ≤ p ≤ 1
1−p
• Converts (0, 1) to range (–∞, + ∞)
The inverse function is known as Sigmoid function (or Logistic function)

ex 1
• S x = = , −∞ < x < +∞
1+ex 1+e−x
• Converts (–∞, + ∞) to range [0, 1]

Sigmoid Curve

Evaluating the Model
Confusion Matrix
The performance of f can also be described

by a confusion matrix
A confusion matrix is a table that is used

to describe the performance of a
classification model (or "classifier") on a
set of data for which the true values are
known.
The confusion matrix gives strong clues as

to where 𝑓 is going wrong.

Example
Consider a classical problem of predicting spam and non-spam email.

The objective is to identify Spams.
The training set consists of 15 emails that are Spam, and 85 emails that are Not Spam
The model correctly classified 95 emails
• All 85 Non-Spams were correctly classified
• 10 Spams were correctly classified
• 5 Spams were classified as Non-Spams (False Negative if Target is Spam).

The Matrix
The objective is to identify Spams.

True Class
Predicted Class Spam Non-Spam
Spam 10 0
Non-Spam 5 85
• “true positive” for correctly identifying target event

• “true negative” for correctly identifying non-target event
• “false positive” for incorrectly identifying a non-target event as a target event
• “false negative” for incorrectly identifying a target event as a non-target event
TP = 10, FP = 0, TN = 85, FN = 5
True Class
Predicted Class Target Non-Target
Target TP FP
Non-Target FN TN

Is accuracy a good metric always?
Building a high accuracy useless classifier

What’s the right metric?
1. Many classifiers are designed to optimize error/accuracy

2. This tends to bias the performance towards the majority class
3. Anytime there is an imbalance in the data this can happen
4. It is particularly pronounced, though, when the imbalance is more pronounced
5. Accuracy is not the right measure of classifier performance in such cases
6. What are other metrics?
1. Precision
2. Recall (Sensitivity or TPR or True Positive Rate = TP/P)
3. F1-score?
Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…
Refer https://en.wikipedia.org/wiki/Confusion_matrix

The Metrics
1 is same as Positive
0 is same as Negative

Strategies for Imbalanced Data
1. Under-sampling
2. Over-sampling
3. Optimize AUC

Under-sampling (majority class)
Create a new training data set by:

• Include all k “positive” examples
• randomly pick k “negative” examples
Pros:
Easy to implement
Training becomes much more efficient (smaller training set)
For some domains, can work very well
Cons:
Throwing away a lot of data/information

Over-sampling (minority class)
Create a new training data set by:

- including all m negative examples
- include m positive examples:
- repeat each example a fixed number of times, or
- sample with replacement
Pros:
Easy to implement
Utilizes all of the training data
Tends to perform well in a broader set of circumstances than subsampling
Cons:
Computationally expensive to train a classifier

Multiclass Classification
Suppose the possible responses are A, B & C.

f was run on the training set and the following Confusion Matrix was generated
True Class
Predicted Class A B C
A 30 20 10
B 50 60 10
C 20 20 80
The confusion matrix gives strong clues as to where f is going wrong.

• For class A, f incorrectly labelled Label B for majority of the mislabelled cases.
Perhaps features need to be added to improve classification of label
The more zeroes or smaller the numbers on all cells but the diagonal, the better the
classifier is doing.

Multiclass Classification
Suppose the possible responses are A, B & C.

f was run on the training set and the following Confusion Matrix was generated
True Class
Predicted Class A B C
A 30 20 10
B 50 60 10
C 20 20 80
True Positive are those observations of a particular class that were classified correctly
False Positive are those observations that were incorrectly mapped to one class
False Negative are those observations of a particular class that were classified incorrectly
True Negative: Applicable for a Two-Class scenario
TP_A = 30, TP_B = 60, TP_C = 80

FP_A = 30, FP_B = 60, FP_C = 40
FN_A = 70, FN_B = 40, FN_C = 20

Handling categorical columns
• ML models don’t understand categories/non-numeric values.

• We need to convert those to a number.
• Categorical data can be nominal or ordinal.
• Two methods – One Hot Encoding (dummy variables) and Label Encoding.
One Hot Encoding Label Encoding
Variable type Nominal Ordinal

(all values are equivalent) (Values have order)
Example Red, Green, Blue High, Medium, Low
Male, Female
Number of output columns No. of distinct values - 1 1
Output values 0 and 1 0, 1, 2, 3…

One Hot Encoding (Dummy Variables)
1. Find out count of distinct values, say n, in the column.

2. Create n new columns – each with a name of the distinct value.
3. Encode values 0 and 1 under those columns depending upon the value in that observation.
4. Avoid when the count of distinct values is high.

Label Encoding
1. Find out count of distinct values, say n, in the column.

2. Find the order.
3. Encode values 0, 1, 2….based on the order.

Exercise
Building a Logistic Regression Classifier using sklearn

FDS L1 to L8 slides

Uploaded by

Copyright:

Available Formats

You might also like

FDS L1 to L8 slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FDS L1 to L8 slides

Uploaded by

Copyright:

Available Formats

ZG 536

Foundations of Data Science

M1 Data Science Foundations

BITS Pilani, Pilani Campus

Duratio Day, Date,

Mid-Semester Open or 2 hours 30%

BITS Pilani, Pilani Campus

Qualification Bachelor of Engineering (Mechanical)

BITS Pilani, Pilani Campus

• An interdisciplinary field that uses algorithms, procedures, and

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Personalization: Learning, ads, promotions and discounts

• Decision making: Google maps

• Fraud detection: transactions

• Dynamic pricing: Surge pricing

• Smart homes, voice assistants

• Social media trends

• Spam mail filters

BITS Pilani, Pilani Campus

• Rapid digital evolution

• Flexibility – all industries, freelancing

• Analytical, scientific approach

• Being logical and sensible

• Life skill - Solving real life problems

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data Engineer Data/BI Analyst ML Engineer Business Analyst

BITS Pilani, Pilani Campus

Wears many hats!

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

1. Interdisciplinary: Statistics, Mathematics, Computer Science,

BITS Pilani, Pilani Campus

1. Data: Acquisition, access, quality, volume

BITS Pilani, Pilani Campus

STATISTICS PYTHON EXCEL ANACONDA

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

M1 Data Science Foundations

• An interdisciplinary field that uses algorithms, procedures, and

BITS Pilani, Pilani Campus

Data Engineer Data/BI Analyst ML Engineer Business Analyst

BITS Pilani, Pilani Campus

Wears many hats!

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

5. Evaluation (and tuning)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

M1 Data Science Foundations

Nominal Ordinal Discrete Continuous

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus