Python Business Analytics1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

ISOM 3400 – PYTHON FOR BUSINESS ANALYTICS

8. Python in Business Analytics 1


Yingpeng Robin Zhu

JUL 08, 2022

1
Recap of Machine Learning Steps

 Step 1: Define features and response columns


 Step 2: Split data into training vs. testing data
 Step 3: Import the model/algorithm you want to use
 Step 4: Fit/Train your model with training data
 Step 5: Make prediction for the test data with the trained model
 Step 6: Evaluate the performance of the model

2
Class Objectives

Making prediction of a continuous (binary) outcome

Revisit important concepts of machine learning


 What is linear regression and how does it work?
 What are the evaluation metrics for linear regression?
 How to train and interpret a linear regression model using Scikit-learn?

3
Class Objectives

Making prediction of a continuous (binary) outcome

4
How Does Supervised Learning Work?

 Step One: Model training (train a machine


learning model using labeled data)

 Step Two: Model prediction on new data


(for which the label is unknown)

 Step Three: Evaluate the accuracy of the


model (percentage of correct prediction
using labeled data)

5
Linear Regression (Only 1 Input Variable)
 Revisit our scatter example in our last session:

6
Linear Regression (Only 1 Input Variable)
 How does linear regression work?

 Linear regression models are often fitted using the least-squares approach, that
requires finding the values of the parameters described in a linear equation of
the ‘best-fit line,’ which is achieved by minimizing the sum of squared residuals.

 The linear regression algorithm's goal is


to find a line (in this case, the red line)
that has as many observations as close
to the line as possible.
 This is what minimizing the sum of
squared residuals is.

7
Linear Regression (Only 1 Input Variable)
 Let’s hands on our first linear regression model in python
 Import the modules you need, e.g., SciPy for scientific
computation library
 Create the arrays that represent the values of the x and y
axis
 Execute the stats.linregress() method that returns some
important key values of Linear Regression. In this case, it
calculates a linear least-squares regression for x and y,
and returns the value of slope, intercept, Pearson
correlation coefficient, the p-value, and the standard
error of the estimated slope
 Draw the original scatter plot and the line of linear
regression
 Display the diagram with plt.show()

8
Linear Regression (Only 1 Input Variable)
 Interpretation of the estimation results
 R for Relationship
 It is important to know how the relationship between x
and y, if there are no relationship, the linear regression
can not be used to predict anything
 This relationship - the coefficient of correlation - is called r
 The r value ranges from -1 to 1, where 0 means no
relationship, and 1 (and -1) means 100% related
 The result -0.85 shows that there is a relationship, not
perfect, but it indicates that we could use linear
regression in future predictions
 Currently, the p value is small, smaller than 0.01, then we
can reject the null hypothesis that the slope is zero

9
Linear Regression (Only 1 Input Variable)
 Interpretation of the estimation results
 R for Relationship
 Example of a perfect correlation and a bad correlation

10
Linear Regression (Only 1 Input Variable)
 Another approach with sklearn package
 Import the LinearRegression, i.e., ordinary least squares
Linear Regression, from sklearn, a free software machine
learning library
 Create the arrays that represent the values of the x and y
axis
 Use LinearRegression.fit() function to fit the x and y
 Here, notice that here, we made a transformation of x to
a 2-D array by using the function .reshape(-1,1)
 reg.coef_ returns a list of estimated coefficients, since
we only have one x variable, therefore, only one item in
the returned array
 reg.intercept_ returns the estimated intercept

11
Linear Regression (Only 1 Input Variable)
 A comparison of the two approaches

12
Linear Regression (Only 1 Input Variable)
 Make predictions with the estimated model

13
Polynomial Regression (Only 1 Input Variable)
 What if that the relationship is not linear, but polynomial?

14
Polynomial Regression (Only 1 Input Variable)
 Let’s hands on our first polynomial regression model in python
 Import the modules you need, here we use numpy, which has a
method that lets us make a polynomial model
 Create the arrays that represent the values of the x and y axis
 numpy.polyfit() is a method that takes the inputs and return
polynomial coefficients, highest power first

𝛽3 𝛽2 𝛽1 𝛽0
 numpy.poly1d() is a method that construct the polynomial
equation, taken the given coefficient, equivalent to:

 numpy.linesapce() specify how the line will display, we start at


position 1, and end at position 22

15
Polynomial Regression (Only 1 Input Variable)
 R-Squared: correlation of the relationship
 The result 0.94 shows that there is a very good
relationship!

 Currently, we use the polynomial order 3, what if


we change it to 2, please try to plot out the figure
and calculate the R-squared, and see what you can
conclude from the results?
16
Polynomial Regression (Only 1 Input Variable)
 Make a prediction with the estimated polynomial regression model

17
Polynomial Regression (Only 1 Input Variable)
 An example of a bad fit with low R-Squared

 The result: 0.00995 indicates a very bad


relationship, and tells us that this data set is
not suitable for polynomial regression

18
Multiple Regression (Multiple Input Variables)
 Multiple Regression
 Multiple regression is like linear regression, but with more than one independent value,
meaning that we try to predict a value based on two or more variables

predict the CO2 emission

19
Multiple Regression (Multiple Input Variables)
 Hands on the python code

20
Multiple Regression (Multiple Input Variables)
 Coefficient (s)

 The coefficient is a factor that describes the relationship with an unknown variable
 In the above equation, the coefficients are , . We can all as a constant term
 In the CO2 example, we can print out the all the coefficients

0.0078

21
Multiple Regression (Multiple Input Variables)
 Feature Scaling/ data normalization
 Feature scaling is a method used to normalize the range of independent variables or features of
data
 When your data has different values, and even different measurement units, it can be difficult to
compare them
 For example, if in our last example, the volume column contains values in liters instead of cm3
(1.0 instead of 1000)

 It can be difficult to compare the volume 1.0 with the weight


790
 But if we scale them both into comparable values, we can
easily see how much one value is compared to the other
 Standardization:

22
Multiple Regression (Multiple Input Variables)
 Feature Scaling/ data normalization
 Standardization:

 Now, you can compare -1.57 and -2.07

23
Multiple Regression (Multiple Input Variables)
 StandardScaler() for feature normalization

24
Multiple Regression (Multiple Input Variables)
 Training vs. Testing Split
 Split the dataset into two sets: a training set and a testing set
 80% for training, and 20% for testing
 Train the model using the training set and evaluate the model on the testing set

25
Multiple Regression (Multiple Input Variables)

26
Multiple Regression (Multiple Input Variables)
 Training vs. Testing Split and evaluation
 80% for training, and 20% for testing

27
Logistic Regression
 Logistic Regression
 Logistic regression aims to solve classification problems
 It does this by predicting categorical outcomes, unlike
linear regression that predicts a continuous outcome
 In the simplest case there are two outcomes, which is
called binomial, an example of which is predicting if a
tumor is malignant or benign
 Other cases have more than two outcomes to classify, in
this case it is called multinomial. A common example for
multinomial logistic regression would be predicting the
class of an iris flower between 3 different species

Bigger size for malignant tumor?

28
Logistic Regression
 Theory Background

1
𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 ( 𝑥 )=
1+exp ⁡(− 𝑥)

29
Logistic Regression
 Theory Background
 The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:

 For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the
equation into the logistic function. This forces the output to assume only values between 0 and 1:

Do you know how to transform?

ln ¿ ¿
30
Logistic Regression
 Theory Background
ln ¿ ¿
𝑝 ( 𝑦 ¿¿ ( 𝑖 ) =1)
𝑜𝑑𝑑𝑠= (𝑖 ) (𝑖 )
¿
1− 𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)=𝑒𝑥𝑝 ⁡( 𝛽0 +   𝛽1 𝑥 1 +… +   𝛽𝑝 𝑥 𝑝 ) ¿

𝑝 (𝑦 ¿¿ ( 𝑖 ) =1) (1+𝑒𝑥𝑝 ⁡(𝛽 0 +  𝛽1 𝑥1 +… +  𝛽𝑝 𝑥 𝑝 )) =𝑒𝑥𝑝 ⁡(𝛽 0 +  𝛽 1 𝑥 1 +…+  𝛽 𝑝 𝑥𝑝 )¿


( 𝑖) (𝑖 ) (𝑖 ) ( 𝑖)

(𝑖 ) ( 𝑖)
𝑒𝑥𝑝 ⁡(𝛽 0 +  𝛽 1 𝑥1 +…+  𝛽 𝑝 𝑥𝑝 )
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)= ¿
(1+𝑒𝑥𝑝 ⁡( 𝛽0 +  𝛽 1 𝑥 (𝑖 )
1 +… +  𝛽 𝑝 𝑥 ) )
(𝑖 )
𝑝

1
𝑝 ( 𝑦 ¿¿ ( 𝑖 ) =1)= ¿

( 1
( ) ( )
𝑒𝑥𝑝 ⁡( 𝛽0 +  𝛽1 𝑥 1𝑖 +… +  𝛽 𝑝 𝑥 𝑝𝑖 )
+1
)
1
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)= ( 𝑖) (𝑖 )
¿ You get it!
1+𝑒𝑥𝑝 ⁡(−(𝛽 0 +  𝛽 1 𝑥1 +… +  𝛽 𝑝 𝑥 𝑝 ))
31
Logistic Regression
 Let us revisit the tumor size example again

32
Logistic Regression
 Hands on your first logistic regression model in Python

33
Class Objectives

Making prediction (classification) of a categorical outcome

As we have shown, logistic


regression can also be used
for classification

34
What is Categorical Outcome/Variable?

 A categorical variable (also called qualitative variable) is a variable that can take on one
of a limited, and usually fixed, number of possible values

Nominal Data: it is used to label different Ordinal Data: This is a data type with a set order or
classifications and does not imply a quantitative scale to it.
value or order
35
What is Categorical Outcome/Variable?

Features
Continuous Categorical

Student ID Attendance GPA Grade Outcome

20465532 0.95 4.0 95 Pass

20339901 0.82 3.8 88 Pass

20567789 0.5 2.2 60 Fail

20339912 1.0 3.5 98 Pass

36
The Iris Dataset for Classification
Categorical
Sepal Length Sepal Width Petal Length Petal Width Class

5.1 3.5 1.4 0.2 Iris-setosa


Continuous
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
7.0 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.9 3.1 4.9 1.5 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3.0 5.9 2.1 Iris-virginica
… … … … …

37
The Iris Dataset for Classification

Reference: https://en.wikipedia.org/wiki/Iris_flower_data_set

38
The Iris Dataset for Classification

39
The Iris Dataset for Classification

 Iris dataset:
 150 samples of 3 different species
of iris (50 each), therefore, it is a
balanced dataset
 Four features for each record:
sepal length, sepal width, petal
length, petal width

40
The Iris Dataset for Classification
https://archive.ics.uci.edu/ml/datasets/iris

41
The Iris Dataset for Classification

 Our goal is to build and train a model to predict the species of Iris for any given new
data
Sepal Length Sepal Width Petal Length Petal Width Class

4.7 3.2 1.3 0.2 ?


5.6 2.9 3.6 1.3 ?
6.8 3.0 5.5 2.1 ?
6.2 3.4 5.4 2.3 ?

Input Predict
Trained
Iris-setosa!
Model

[4.7, 3.2, 1.3, 0.2]


42
The Iris Classification With Python

Step 1 – Load the data:

Again, it is a balanced dataset!

43
The Iris Classification With Python

Step 2 – Splitting the dataset:

Training Dataset

Testing Dataset

44
The Iris Classification With Python

Step 2 – Splitting the dataset:

45
The Iris Classification With Python

Step 3 – Build/Select the model:


Different Classification Algorithms/Models

46
The Iris Classification With Python

Decision Tree:

47
The Iris Classification With Python

K-Nearest Neighbors:

48
The Iris Classification With Python

K-Nearest Neighbors:

Which class does it belong to?


New Data

49
The Iris Classification With Python

KNN:

Step 1: Choose the number of K neighbors

Step 2: Take the N-nearest neighbors of the new datapoint according to some distance measure (Euclidean distance)

Step 3: Among these K neighbors, count the number of data points in each category

Step 4: Assign the new data point to the category where you counted the most neighbors

50
The Iris Classification With Python

K-Nearest Neighbors:

Choose the number K to be K=10, for example


New Data

51
The Iris Classification With Python

K-Nearest Neighbors:

Take the K-nearest neighbors of the new


datapoint according to Euclidean Distance
New Data

52
The Iris Classification With Python

K-Nearest Neighbors:

Category Green: 7
Category Blue: 3

Among the 10 nearest neighbors, count the


number of data points in each category
New Data

53
The Iris Classification With Python

K-Nearest Neighbors:

Category Green: 7
Category Blue: 3

New Data

54
The Iris Classification With Python

Step 3 – Build the model:


Different Classification Algorithms/Models

Step 3: Import the model/algorithm you want to use 55


The Iris Classification With Python

Step 4 – Fit/Train the model on training dataset:

Parameters of the model. We set them to


be default.

56
The Iris Classification With Python

 Step 5– Make predictions:

Step 5: Make prediction for the test data with the trained model 57
The Iris Classification With Python

Step 6– Evaluate the Model Performance:

12

9 1

Accuracy

58
The Iris Classification With Python

59
Jupyter Notebook

60

You might also like