Python Business Analytics1

ISOM 3400 – PYTHON FOR BUSINESS ANALYTICS
8. Python in Business Analytics 1

Yingpeng Robin Zhu
JUL 08, 2022
1
Recap of Machine Learning Steps
 Step 1: Define features and response columns

 Step 2: Split data into training vs. testing data
 Step 3: Import the model/algorithm you want to use
 Step 4: Fit/Train your model with training data
 Step 5: Make prediction for the test data with the trained model
 Step 6: Evaluate the performance of the model
2
Class Objectives
Making prediction of a continuous (binary) outcome
Revisit important concepts of machine learning

 What is linear regression and how does it work?
 What are the evaluation metrics for linear regression?
 How to train and interpret a linear regression model using Scikit-learn?
3
Class Objectives
Making prediction of a continuous (binary) outcome
4
How Does Supervised Learning Work?
 Step One: Model training (train a machine

learning model using labeled data)
 Step Two: Model prediction on new data

(for which the label is unknown)
 Step Three: Evaluate the accuracy of the

model (percentage of correct prediction
using labeled data)
5
Linear Regression (Only 1 Input Variable)
 Revisit our scatter example in our last session:
6
 How does linear regression work?
 Linear regression models are often fitted using the least-squares approach, that
requires finding the values of the parameters described in a linear equation of
the ‘best-fit line,’ which is achieved by minimizing the sum of squared residuals.
 The linear regression algorithm's goal is

to find a line (in this case, the red line)
that has as many observations as close
to the line as possible.
 This is what minimizing the sum of
squared residuals is.
7
 Let’s hands on our first linear regression model in python
 Import the modules you need, e.g., SciPy for scientific
computation library
 Create the arrays that represent the values of the x and y
axis
 Execute the stats.linregress() method that returns some
important key values of Linear Regression. In this case, it
calculates a linear least-squares regression for x and y,
and returns the value of slope, intercept, Pearson
correlation coefficient, the p-value, and the standard
error of the estimated slope
 Draw the original scatter plot and the line of linear
regression
 Display the diagram with plt.show()
8
 Interpretation of the estimation results
 R for Relationship
 It is important to know how the relationship between x
and y, if there are no relationship, the linear regression
can not be used to predict anything
 This relationship - the coefficient of correlation - is called r
 The r value ranges from -1 to 1, where 0 means no
relationship, and 1 (and -1) means 100% related
 The result -0.85 shows that there is a relationship, not
perfect, but it indicates that we could use linear
regression in future predictions
 Currently, the p value is small, smaller than 0.01, then we
can reject the null hypothesis that the slope is zero
9
 Interpretation of the estimation results
 R for Relationship
 Example of a perfect correlation and a bad correlation
10
 Another approach with sklearn package
 Import the LinearRegression, i.e., ordinary least squares
Linear Regression, from sklearn, a free software machine
learning library
 Create the arrays that represent the values of the x and y
axis
 Use LinearRegression.fit() function to fit the x and y
 Here, notice that here, we made a transformation of x to
a 2-D array by using the function .reshape(-1,1)
 reg.coef_ returns a list of estimated coefficients, since
we only have one x variable, therefore, only one item in
the returned array
 reg.intercept_ returns the estimated intercept
11
 A comparison of the two approaches
12
 Make predictions with the estimated model
13
Polynomial Regression (Only 1 Input Variable)
 What if that the relationship is not linear, but polynomial?

14
 Let’s hands on our first polynomial regression model in python
 Import the modules you need, here we use numpy, which has a
method that lets us make a polynomial model
 Create the arrays that represent the values of the x and y axis
 numpy.polyfit() is a method that takes the inputs and return
polynomial coefficients, highest power first
𝛽3 𝛽2 𝛽1 𝛽0
 numpy.poly1d() is a method that construct the polynomial
equation, taken the given coefficient, equivalent to:
 numpy.linesapce() specify how the line will display, we start at

position 1, and end at position 22
15
 R-Squared: correlation of the relationship
 The result 0.94 shows that there is a very good
relationship!
 Currently, we use the polynomial order 3, what if

we change it to 2, please try to plot out the figure
and calculate the R-squared, and see what you can
conclude from the results?
16
 Make a prediction with the estimated polynomial regression model
17
 An example of a bad fit with low R-Squared
 The result: 0.00995 indicates a very bad

relationship, and tells us that this data set is
not suitable for polynomial regression
18
Multiple Regression (Multiple Input Variables)
 Multiple Regression
 Multiple regression is like linear regression, but with more than one independent value,
meaning that we try to predict a value based on two or more variables

predict the CO2 emission
19
 Hands on the python code
20
 Coefficient (s)

 The coefficient is a factor that describes the relationship with an unknown variable
 In the above equation, the coefficients are , . We can all as a constant term
 In the CO2 example, we can print out the all the coefficients
0.0078
21
 Feature Scaling/ data normalization
 Feature scaling is a method used to normalize the range of independent variables or features of
data
 When your data has different values, and even different measurement units, it can be difficult to
compare them
 For example, if in our last example, the volume column contains values in liters instead of cm3
(1.0 instead of 1000)
 It can be difficult to compare the volume 1.0 with the weight

790
 But if we scale them both into comparable values, we can
easily see how much one value is compared to the other
 Standardization:
22
 Feature Scaling/ data normalization
 Standardization:
 Now, you can compare -1.57 and -2.07
23
 StandardScaler() for feature normalization
24
 Training vs. Testing Split
 Split the dataset into two sets: a training set and a testing set
 80% for training, and 20% for testing
 Train the model using the training set and evaluate the model on the testing set
25
26
 Training vs. Testing Split and evaluation
 80% for training, and 20% for testing
27
Logistic Regression
 Logistic Regression
 Logistic regression aims to solve classification problems
 It does this by predicting categorical outcomes, unlike
linear regression that predicts a continuous outcome
 In the simplest case there are two outcomes, which is
called binomial, an example of which is predicting if a
tumor is malignant or benign
 Other cases have more than two outcomes to classify, in
this case it is called multinomial. A common example for
multinomial logistic regression would be predicting the
class of an iris flower between 3 different species
Bigger size for malignant tumor?
28
Logistic Regression
 Theory Background
1
𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 ( 𝑥 )=
1+exp ⁡(− 𝑥)
29
Logistic Regression
 The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:
 For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the
equation into the logistic function. This forces the output to assume only values between 0 and 1:
Do you know how to transform?
ln ¿ ¿
30
Logistic Regression
ln ¿ ¿
𝑝 ( 𝑦 ¿¿ ( 𝑖 ) =1)
𝑜𝑑𝑑𝑠= (𝑖 ) (𝑖 )
¿
1− 𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)=𝑒𝑥𝑝 ⁡( 𝛽0 + 𝛽1 𝑥 1 +… + 𝛽𝑝 𝑥 𝑝 ) ¿
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1) (1+𝑒𝑥𝑝 ⁡(𝛽 0 + 𝛽1 𝑥1 +… + 𝛽𝑝 𝑥 𝑝 )) =𝑒𝑥𝑝 ⁡(𝛽 0 + 𝛽 1 𝑥 1 +…+ 𝛽 𝑝 𝑥𝑝 )¿

( 𝑖) (𝑖 ) (𝑖 ) ( 𝑖)
(𝑖 ) ( 𝑖)
𝑒𝑥𝑝 ⁡(𝛽 0 + 𝛽 1 𝑥1 +…+ 𝛽 𝑝 𝑥𝑝 )
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)= ¿
(1+𝑒𝑥𝑝 ⁡( 𝛽0 + 𝛽 1 𝑥 (𝑖 )
1 +… + 𝛽 𝑝 𝑥 ) )
(𝑖 )
𝑝
1
𝑝 ( 𝑦 ¿¿ ( 𝑖 ) =1)= ¿
( 1
( ) ( )
𝑒𝑥𝑝 ⁡( 𝛽0 + 𝛽1 𝑥 1𝑖 +… + 𝛽 𝑝 𝑥 𝑝𝑖 )
+1
)
1
𝑝 (𝑦 ¿¿ ( 𝑖 ) =1)= ( 𝑖) (𝑖 )
¿ You get it!
1+𝑒𝑥𝑝 ⁡(−(𝛽 0 + 𝛽 1 𝑥1 +… + 𝛽 𝑝 𝑥 𝑝 ))
31
Logistic Regression
 Let us revisit the tumor size example again
32
Logistic Regression
 Hands on your first logistic regression model in Python
33
Class Objectives
Making prediction (classification) of a categorical outcome
As we have shown, logistic

regression can also be used
for classification
34
What is Categorical Outcome/Variable?
 A categorical variable (also called qualitative variable) is a variable that can take on one
of a limited, and usually fixed, number of possible values
Nominal Data: it is used to label different Ordinal Data: This is a data type with a set order or
classifications and does not imply a quantitative scale to it.
value or order
35
What is Categorical Outcome/Variable?
Features
Continuous Categorical
Student ID Attendance GPA Grade Outcome
20465532 0.95 4.0 95 Pass
20339901 0.82 3.8 88 Pass
20567789 0.5 2.2 60 Fail
20339912 1.0 3.5 98 Pass
36
The Iris Dataset for Classification
Categorical
Sepal Length Sepal Width Petal Length Petal Width Class
5.1 3.5 1.4 0.2 Iris-setosa

Continuous
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
7.0 3.2 4.7 1.4 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
… … … … …
37
Reference: https://en.wikipedia.org/wiki/Iris_flower_data_set
38
39
 Iris dataset:
 150 samples of 3 different species
of iris (50 each), therefore, it is a
balanced dataset
 Four features for each record:
sepal length, sepal width, petal
length, petal width
40
https://archive.ics.uci.edu/ml/datasets/iris
41
 Our goal is to build and train a model to predict the species of Iris for any given new
data
Sepal Length Sepal Width Petal Length Petal Width Class
4.7 3.2 1.3 0.2 ?

5.6 2.9 3.6 1.3 ?
6.8 3.0 5.5 2.1 ?
6.2 3.4 5.4 2.3 ?
Input Predict
Trained
Iris-setosa!
Model
[4.7, 3.2, 1.3, 0.2]

42
The Iris Classification With Python
Step 1 – Load the data:
Again, it is a balanced dataset!
43
Step 2 – Splitting the dataset:
Training Dataset
Testing Dataset
44
Step 2 – Splitting the dataset:
45
Step 3 – Build/Select the model:

Different Classification Algorithms/Models
46
Decision Tree:
47
K-Nearest Neighbors:
48
Which class does it belong to?

New Data
49
KNN:
Step 1: Choose the number of K neighbors
Step 2: Take the N-nearest neighbors of the new datapoint according to some distance measure (Euclidean distance)
Step 3: Among these K neighbors, count the number of data points in each category
Step 4: Assign the new data point to the category where you counted the most neighbors
50
Choose the number K to be K=10, for example

New Data
51
Take the K-nearest neighbors of the new

datapoint according to Euclidean Distance
New Data
52
Category Green: 7
Category Blue: 3
Among the 10 nearest neighbors, count the

number of data points in each category
New Data
53
Category Green: 7
Category Blue: 3
New Data
54
Step 3 – Build the model:

Different Classification Algorithms/Models
Step 3: Import the model/algorithm you want to use 55

Step 4 – Fit/Train the model on training dataset:
Parameters of the model. We set them to

be default.
56
 Step 5– Make predictions:
Step 5: Make prediction for the test data with the trained model 57
Step 6– Evaluate the Model Performance:
12
9 1
Accuracy
58
59
Jupyter Notebook
60

Python Business Analytics1

Uploaded by

Copyright:

Available Formats

You might also like

Python Business Analytics1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python Business Analytics1

Uploaded by

Copyright:

Available Formats

ISOM 3400 – PYTHON FOR BUSINESS ANALYTICS

8. Python in Business Analytics 1

JUL 08, 2022

 Step 1: Define features and response columns

Making prediction of a continuous (binary) outcome

Revisit important concepts of machine learning

Making prediction of a continuous (binary) outcome

 Step One: Model training (train a machine

 Step Two: Model prediction on new data

 Step Three: Evaluate the accuracy of the

 The linear regression algorithm's goal is

 numpy.linesapce() specify how the line will display, we start at

 Currently, we use the polynomial order 3, what if

 The result: 0.00995 indicates a very bad

predict the CO2 emission

 It can be difficult to compare the volume 1.0 with the weight

 Now, you can compare -1.57 and -2.07

Bigger size for malignant tumor?

Do you know how to transform?

𝑝 (𝑦 ¿¿ ( 𝑖 ) =1) (1+𝑒𝑥𝑝 ⁡(𝛽 0 + 𝛽1 𝑥1 +… + 𝛽𝑝 𝑥 𝑝 )) =𝑒𝑥𝑝 ⁡(𝛽 0 + 𝛽 1 𝑥 1 +…+ 𝛽 𝑝 𝑥𝑝 )¿

Making prediction (classification) of a categorical outcome

As we have shown, logistic

Student ID Attendance GPA Grade Outcome

20465532 0.95 4.0 95 Pass

20339901 0.82 3.8 88 Pass

20567789 0.5 2.2 60 Fail

20339912 1.0 3.5 98 Pass

5.1 3.5 1.4 0.2 Iris-setosa

4.7 3.2 1.3 0.2 ?

[4.7, 3.2, 1.3, 0.2]

Step 1 – Load the data:

Again, it is a balanced dataset!

Step 2 – Splitting the dataset:

Step 2 – Splitting the dataset:

Step 3 – Build/Select the model:

Which class does it belong to?

Step 1: Choose the number of K neighbors

Choose the number K to be K=10, for example

Take the K-nearest neighbors of the new

Among the 10 nearest neighbors, count the

Step 3 – Build the model:

Step 3: Import the model/algorithm you want to use 55

Step 4 – Fit/Train the model on training dataset:

Parameters of the model. We set them to

 Step 5– Make predictions:

Step 6– Evaluate the Model Performance:

You might also like