Professional Documents
Culture Documents
Team9 Lab3
Team9 Lab3
LAB 3 REPORT
SUBJECT: DATA ANALYSIS IN BUSINESS
1
LAB 3
I. Multivariable Linear Regression:
Requirement :
Explanation (What, How and Why) and example of Multivariable
Linear Regression.
What is Multivariable Linear Regression ?
Multivariable Linear Regression is a linear regression model with
more than one independent variable.
How Multivariable Linear Regression work ?
Multivariable Linear Regression works by fitting a linear equation
to the data. The equation is of the form:
Where:
+ Y is the dependent variable.
+ X1,…., Xk are the independent (explanatory) variables.
+ B0 is the intercept term.
+ B1,…., Bk are the regression coefficients for the
independent variables.
+ e is the error term
- The regression coefficients (called partial regression coefficients)
indicate the anticipated shift in the dependent variable when one
of the independent variables is increased by one unit, keeping all
other independent variables unchanged.
- The ANOVA test for significance of the entire model is:
2
+ The null (H0) hypothesis indicates that no linear
relationship exists between the dependent and any of the
independent variables.
+ The alternative hypothesis (H1) indicates that the
dependent variable has a linear relationship with at least one
independent variable.
Why is Multivariable Linear Regression useful ?
- Multiple linear regression (MLR) is useful for a variety of reasons.
It can be used to:
Identify the relationship between multiple independent
variables and a dependent variable
Quantify the strength of the relationships between the
variables
Make predictions about the value of the dependent variable
Control for confounding variables
- MLR is used in a wide variety of fields, including business,
economics, finance, and science. Some examples of how MLR is
used include:
Predicting the price of a house based on its square footage,
number of bedrooms, and location
Predicting the number of customers who will visit a store on a
given day based on the day of the week, the time of year, and
the weather forecast
Predicting the risk of a customer defaulting on a loan based on
their credit score, income, and employment history
Example of Multivariable Linear Regression:
- Suppose we are interested in predicting the price of a house. We
have data on the following variables:
Price (dependent variable)
Square footage
Number of bedrooms
Number of bathrooms
Age of the house
3
- We can use MLR to build a model that predicts the price of a
house based on these variables. The model would be of the form:
Price = b0 + b1SquareFootage + b2NumBedrooms +
b3NumBathrooms + b4Age + e
- We can then use the model to predict the price of any house by
plugging in the values for the independent variables. For
example, if we have a house with 2,000 square feet, 3 bedrooms,
2 bathrooms, and is 10 years old, we can use the model to
predict the price of the house.
4
Here, h is an appropriate function that depends on the
explanatory variables and parameters, that we want to
summarize with vectors and
, independent.[3]
Why Multivariable Nonlinear Regression is used ?
5
+ It is of great importance in agricultural research. Because many
crops and soil processes are better captured by nonlinear than
linear models.[4]
- Even if the gold prices are stable to a great extent, they are
affected by inflation, crude oil, etc. But the important one is the
impact of inflation, and at the same time, gold prices can control
the inflation instability. Therefore, a deep understanding of the
relationship between inflation and gold price is a prerequisite.
6
- logistic regression seeks to predict the probability that the
output variable will fall into a category based on the values of
the independent (predictor) variables. This probability is used
to classify an observation into a category.[ Business Analytics
second edition by James Evans - p.354]
How Logistic Regression work ?
- Logistic regression is generally used when the dependent
variable is binary—that is, takes on two values, 0 or 1.
- You may recall that a multiple linear regression model has the
form:
7
- This transform ensures that the p stays between 0 and 1.
Why is Logistic Regression useful ?
- It is well-suited for binary classification tasks. Logistic
regression is designed to predict the probability of a binary
outcome, such as yes/no, true/false, or spam/not spam.
- It is simple to understand and implement. The logistic
regression algorithm is relatively simple to understand and
implement, even for people with no prior machine learning
experience.
- It is relatively robust to overfitting. Overfitting is a problem
that occurs when a machine learning model learns the
training data too well and is unable to generalize to new data.
Logistic regression is relatively robust to overfitting, which
means that it is less likely to have this problem.
- It is interpretable. Logistic regression models are
interpretable, which means that it is possible to identify the
features that are most important for predicting the outcome
variable. This can be useful for understanding the
relationships between the features and the outcome variable.
Example of Logistic Regression
- Spam filtering: Logistic regression is used by email
providers to filter out spam emails.
- Fraud detection: Logistic regression is used by credit card
companies to detect fraudulent transactions.
- Medical diagnosis: Logistic regression is used by doctors to
diagnose diseases such as cancer and heart disease.
- Marketing campaigns: Logistic regression is used by
companies to identify customers who are most likely to
respond to their marketing campaigns.
8
IV. Conducting Multivariable Linear Regression
Conducting by using Python Language :
To conduct the multivariable Linear Regression, we use
Colleges and Universities dataset.
9
318⋅AcceptanceRate−0.00013564986⋅ExpendituresStudent−0.
162764489⋅Top10HS
10
To calculate multivariable linear regression we will use colleges
and universities dataset.
11
Figure 4. Calculating the multivariable linear regression.
B= ( XT.X)-1 XT.Y
- X: the matrix of all the independent variables
- Y: the matrix of the dependent variables
Before calculation, we add a column full of 1 in the data set to
represent the “input” value of the intercept term.
13
Finally, we can find each regression coefficient values (B) by using
=MMUL(( XT.X)-1 , (XT.Y)) function.
15
- After clicking to OK button , we get the result below :
Interpret result:
+ R square : 0.99 , its mean that 99% data fit with the model.
+ Adjusted R Square : 0.98, The model explains 98% of the variation in
the dependent variable.
16
+ P-value of Intercept ( 1.46820887627434E-191 < 0.05 ) and first
independent variable ( 1.50543308588807E-236 < 0.05 ) and second
independent variable ( 2.90054289426972E-245 < 0.05 ). So we reject
H0 and the non linear regression model fit with population.
+ So we have the model:
Ln(TotalFare) = -13.0512780830302 + 0.187337099*ln(Base Fare)
+3.095833417*ln(Tax)
+ Conclusion : It is enough evidence to conclude that the nonlinear
regression function is suitable or two independent variables ( Base Fare
and Tax ) effect on the dependent variable ( TotalFare )
Interpreting result:
+ R-squared : 0.9825 , meaning that there is 98% of data matching
with the model
+ Adjusted Rsquared: 0.9825, The model explains 98% of the
variation in the dependent variable.
17
+ P-value of Intercept ( 2e-16 < 0.05 ) and first independent variable (
2e-16<0.05) and second independent variable ( 2e-16 < 0.05 ).So we
reject H0 and the non linear regression model fit with population.
+ We have the model:
Ln(TotalFare) = -30.051678 + 0.187371*ln(BaseFare) + 3.095833*ln(Tax)
Using python for conducting multivariable nonlinear
regression
- For condcuting multivariable nonlinear regression , first we need to
import some necessary libraries and read csv file.
- Then we logarithm independent variables and dependent variable by
using np.log10()
- In the next step, we put three variables to array and reshape them.
18
- Finally, we use the statsmodel.api and model.summary() to show the
result :
19
VI. Conducting Logistic Regression on Bank
Customer Data in Vietnam:
Rows of data: 42639 rows.
Goal: predict whether a customer will subscribed a term
deposit or not.
Explanation:
- Term_deposit = 1: deposit registered.
- Term_deposit = 0: deposit not registered.
- Housing = 1: has housing loan.
- Housing = 0: Does not have housing loan.
- Loan = 1: has personal loan.
- Loan = 0: Does not have personal loan.
Dependent variable: term_deposit.
Independent variables: age ,housing ,loan
Logistics Response function:
20
Analysis of the result:
- The Logistic regression model are built to predict the
dependent variable “term_deposit” by using the
independent variables “age”, “ housing” and “loan”.
- Intercept:
+The Coefficient value: -1.718146
+Standard Error: 0.073262
+z value: -23.452
+p-value: <2e-16, which is smaller than 0.05.Therefore
“intercept” has a significant impact on predicting the dependent
value “term_deposit”.
- Age:
+The Coefficient value: -0.003325
+Standard Error: 0.001626
+z value: -2.045
21
+p-value: 0.0408, which is smaller than 0.05.Therefore
“age” has a significant impact on predicting the dependent value
“term_deposit”.
- Housing:
+The Coefficient value: -0.693244
+Standard Error: 0.034395
+z value: -20.156
+p-value: <2e-16, which is smaller than 0.05.Therefore
“housing” has a significant impact on predicting the dependent
value “term_deposit”.
- Loan:
+The Coefficient value: -0.527837
+Standard Error: 0.053222
+z value: -9.918
+p-value: <2e-16
, which is smaller than 0.05.Therefore “intercept” has a significant
impact on predicting the dependent value “term_deposit”.
22
Step 2: using Data Analysis tool and result:
23
Step 4: Calculating the Probability of term_deposit.
(Probability = ey/(1+ey))
24
Step 8: Using Solver tool from Excel Analysis ToolPak
Set Object is sum of log-likelihood and By changing variable cells
is the Linear regression coefficients.
25
Conducting by using Python Language :
Step 1: Importing libaries.
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
Step 2: Reading file.
df = pd.read_csv("BankCustomerData.csv")
df
Step 3: Splitting independent and dependent array.
x, y = \
np.array(df[["age", "housing", "loan"]]).reshape((-1,3)), \
np.array(df["term_deposit"]).reshape((-1,1))
Step 4: using LogisticRegression() and fit X,Y
model = LogisticRegression()
model.fit(x,y)
Step 5: Result
26
VII. Work assignment
Members Trường Vũ Tăng Quốc Lương Anh
Tuấn Tuấn
Questions
Task 1
Multivariable Linear X
Regression.
Multivariable X
Nonlinear
Regression
Logistic Regression X
Task 2
Using MS Excel, R language and Python language to perform
Multivariable Linear Regression
R X
Excel X
Python X
Using MS Excel, R language and Python language to perform
Multivariable Nonlinear Regression
R X
Excel X
Python X
Using MS Excel, R language and Python language perform Logistic
Regression
R X
Excel X
Python X
VIII.Reference:
Business Analytics: Method, Models, and Decisions, 1st edition
James R.Evans ( multiple linear regression definition chapter 8
page 276)
[1] [2] https://www.imsl.com/blog/what-is-regression-
model
27
[3]
https://stat.ethz.ch/~stahel/courses/cheming/nlreg10E.
pdf\
[4][5] https://www.wallstreetmojo.com/nonlinear-
regression/
IX. Source:
The R , Excel, Python files are attached to the following link :
https://drive.google.com/drive/folders/
1ErzQgCAyIcRY_tFHmVYHvtlIu2o0stN3?usp=sharing
Dataset link:
https://www.kaggle.com/datasets/tomculihiddleston/bank-
customer-data-in-vietnam
https://www.kaggle.com/datasets/dangquangvu/vietnam-flights-in-1st-
quarter-of-2020
28