Professional Documents
Culture Documents
Python Business Analytics1
Python Business Analytics1
Python Business Analytics1
1
Recap of Machine Learning Steps
2
Class Objectives
3
Class Objectives
4
How Does Supervised Learning Work?
5
Linear Regression (Only 1 Input Variable)
Revisit our scatter example in our last session:
6
Linear Regression (Only 1 Input Variable)
How does linear regression work?
Linear regression models are often fitted using the least-squares approach, that
requires finding the values of the parameters described in a linear equation of
the ‘best-fit line,’ which is achieved by minimizing the sum of squared residuals.
7
Linear Regression (Only 1 Input Variable)
Let’s hands on our first linear regression model in python
Import the modules you need, e.g., SciPy for scientific
computation library
Create the arrays that represent the values of the x and y
axis
Execute the stats.linregress() method that returns some
important key values of Linear Regression. In this case, it
calculates a linear least-squares regression for x and y,
and returns the value of slope, intercept, Pearson
correlation coefficient, the p-value, and the standard
error of the estimated slope
Draw the original scatter plot and the line of linear
regression
Display the diagram with plt.show()
8
Linear Regression (Only 1 Input Variable)
Interpretation of the estimation results
R for Relationship
It is important to know how the relationship between x
and y, if there are no relationship, the linear regression
can not be used to predict anything
This relationship - the coefficient of correlation - is called r
The r value ranges from -1 to 1, where 0 means no
relationship, and 1 (and -1) means 100% related
The result -0.85 shows that there is a relationship, not
perfect, but it indicates that we could use linear
regression in future predictions
Currently, the p value is small, smaller than 0.01, then we
can reject the null hypothesis that the slope is zero
9
Linear Regression (Only 1 Input Variable)
Interpretation of the estimation results
R for Relationship
Example of a perfect correlation and a bad correlation
10
Linear Regression (Only 1 Input Variable)
Another approach with sklearn package
Import the LinearRegression, i.e., ordinary least squares
Linear Regression, from sklearn, a free software machine
learning library
Create the arrays that represent the values of the x and y
axis
Use LinearRegression.fit() function to fit the x and y
Here, notice that here, we made a transformation of x to
a 2-D array by using the function .reshape(-1,1)
reg.coef_ returns a list of estimated coefficients, since
we only have one x variable, therefore, only one item in
the returned array
reg.intercept_ returns the estimated intercept
11
Linear Regression (Only 1 Input Variable)
A comparison of the two approaches
12
Linear Regression (Only 1 Input Variable)
Make predictions with the estimated model
13
Polynomial Regression (Only 1 Input Variable)
What if that the relationship is not linear, but polynomial?
14
Polynomial Regression (Only 1 Input Variable)
Let’s hands on our first polynomial regression model in python
Import the modules you need, here we use numpy, which has a
method that lets us make a polynomial model
Create the arrays that represent the values of the x and y axis
numpy.polyfit() is a method that takes the inputs and return
polynomial coefficients, highest power first
𝛽3 𝛽2 𝛽1 𝛽0
numpy.poly1d() is a method that construct the polynomial
equation, taken the given coefficient, equivalent to:
15
Polynomial Regression (Only 1 Input Variable)
R-Squared: correlation of the relationship
The result 0.94 shows that there is a very good
relationship!
17
Polynomial Regression (Only 1 Input Variable)
An example of a bad fit with low R-Squared
18
Multiple Regression (Multiple Input Variables)
Multiple Regression
Multiple regression is like linear regression, but with more than one independent value,
meaning that we try to predict a value based on two or more variables
19
Multiple Regression (Multiple Input Variables)
Hands on the python code
20
Multiple Regression (Multiple Input Variables)
Coefficient (s)
The coefficient is a factor that describes the relationship with an unknown variable
In the above equation, the coefficients are , . We can all as a constant term
In the CO2 example, we can print out the all the coefficients
0.0078
21
Multiple Regression (Multiple Input Variables)
Feature Scaling/ data normalization
Feature scaling is a method used to normalize the range of independent variables or features of
data
When your data has different values, and even different measurement units, it can be difficult to
compare them
For example, if in our last example, the volume column contains values in liters instead of cm3
(1.0 instead of 1000)
22
Multiple Regression (Multiple Input Variables)
Feature Scaling/ data normalization
Standardization:
23
Multiple Regression (Multiple Input Variables)
StandardScaler() for feature normalization
24
Multiple Regression (Multiple Input Variables)
Training vs. Testing Split
Split the dataset into two sets: a training set and a testing set
80% for training, and 20% for testing
Train the model using the training set and evaluate the model on the testing set
25
Multiple Regression (Multiple Input Variables)
26
Multiple Regression (Multiple Input Variables)
Training vs. Testing Split and evaluation
80% for training, and 20% for testing
27
Logistic Regression
Logistic Regression
Logistic regression aims to solve classification problems
It does this by predicting categorical outcomes, unlike
linear regression that predicts a continuous outcome
In the simplest case there are two outcomes, which is
called binomial, an example of which is predicting if a
tumor is malignant or benign
Other cases have more than two outcomes to classify, in
this case it is called multinomial. A common example for
multinomial logistic regression would be predicting the
class of an iris flower between 3 different species
28
Logistic Regression
Theory Background
1
𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 ( 𝑥 )=
1+exp (− 𝑥)
29
Logistic Regression
Theory Background
The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:
For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the
equation into the logistic function. This forces the output to assume only values between 0 and 1:
ln ¿ ¿
30
Logistic Regression
Theory Background
ln ¿ ¿
𝑝 ( 𝑦 ¿¿ ( 𝑖 ) =1)
𝑜𝑑𝑑𝑠= (𝑖 ) (𝑖 )
¿
1− 𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)=𝑒𝑥𝑝 ( 𝛽0 + 𝛽1 𝑥 1 +… + 𝛽𝑝 𝑥 𝑝 ) ¿
(𝑖 ) ( 𝑖)
𝑒𝑥𝑝 (𝛽 0 + 𝛽 1 𝑥1 +…+ 𝛽 𝑝 𝑥𝑝 )
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)= ¿
(1+𝑒𝑥𝑝 ( 𝛽0 + 𝛽 1 𝑥 (𝑖 )
1 +… + 𝛽 𝑝 𝑥 ) )
(𝑖 )
𝑝
1
𝑝 ( 𝑦 ¿¿ ( 𝑖 ) =1)= ¿
( 1
( ) ( )
𝑒𝑥𝑝 ( 𝛽0 + 𝛽1 𝑥 1𝑖 +… + 𝛽 𝑝 𝑥 𝑝𝑖 )
+1
)
1
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)= ( 𝑖) (𝑖 )
¿ You get it!
1+𝑒𝑥𝑝 (−(𝛽 0 + 𝛽 1 𝑥1 +… + 𝛽 𝑝 𝑥 𝑝 ))
31
Logistic Regression
Let us revisit the tumor size example again
32
Logistic Regression
Hands on your first logistic regression model in Python
33
Class Objectives
34
What is Categorical Outcome/Variable?
A categorical variable (also called qualitative variable) is a variable that can take on one
of a limited, and usually fixed, number of possible values
Nominal Data: it is used to label different Ordinal Data: This is a data type with a set order or
classifications and does not imply a quantitative scale to it.
value or order
35
What is Categorical Outcome/Variable?
Features
Continuous Categorical
36
The Iris Dataset for Classification
Categorical
Sepal Length Sepal Width Petal Length Petal Width Class
37
The Iris Dataset for Classification
Reference: https://en.wikipedia.org/wiki/Iris_flower_data_set
38
The Iris Dataset for Classification
39
The Iris Dataset for Classification
Iris dataset:
150 samples of 3 different species
of iris (50 each), therefore, it is a
balanced dataset
Four features for each record:
sepal length, sepal width, petal
length, petal width
40
The Iris Dataset for Classification
https://archive.ics.uci.edu/ml/datasets/iris
41
The Iris Dataset for Classification
Our goal is to build and train a model to predict the species of Iris for any given new
data
Sepal Length Sepal Width Petal Length Petal Width Class
Input Predict
Trained
Iris-setosa!
Model
43
The Iris Classification With Python
Training Dataset
Testing Dataset
44
The Iris Classification With Python
45
The Iris Classification With Python
46
The Iris Classification With Python
Decision Tree:
47
The Iris Classification With Python
K-Nearest Neighbors:
48
The Iris Classification With Python
K-Nearest Neighbors:
49
The Iris Classification With Python
KNN:
Step 2: Take the N-nearest neighbors of the new datapoint according to some distance measure (Euclidean distance)
Step 3: Among these K neighbors, count the number of data points in each category
Step 4: Assign the new data point to the category where you counted the most neighbors
50
The Iris Classification With Python
K-Nearest Neighbors:
51
The Iris Classification With Python
K-Nearest Neighbors:
52
The Iris Classification With Python
K-Nearest Neighbors:
Category Green: 7
Category Blue: 3
53
The Iris Classification With Python
K-Nearest Neighbors:
Category Green: 7
Category Blue: 3
New Data
54
The Iris Classification With Python
56
The Iris Classification With Python
Step 5: Make prediction for the test data with the trained model 57
The Iris Classification With Python
12
9 1
Accuracy
58
The Iris Classification With Python
59
Jupyter Notebook
60