Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

VIETNAM NATIONAL UNIVERSITY

UNIVERSITY OF INFORMATION TECHNOLOGY


INFORMATION SYSTEMS FACULTY

LAB 3 REPORT
SUBJECT: DATA ANALYSIS IN BUSINESS

LECTURER: Assoc. Prof. Dr. Nguyễn Đình Thuân


TA: Nguyễn Minh Nhựt
Class: STAT3013.O12.CTTT
Team 9
Đỗ Lập Trường Vũ – 21522795
Tăng Quốc Tuấn – 21522761
Lương Anh Tuấn – 21522752
INDEX

I. Multivariable Linear Regression:.........................................................................................2


What is Multivariable Linear Regression ?.............................................................................2
How Multivariable Linear Regression work ?.........................................................................2
Why is Multivariable Linear Regression useful ?...................................................................3
Example of Multivariable Linear Regression:.........................................................................3
II. Multivariable Nonlinear Regression.....................................................................................4
What is Multivariable Nonlinear Regression ?.......................................................................4
How does Multivariable Nonlinear Regression work? :........................................................4
Why Multivariable Nonlinear Regression is used ?...............................................................5
Example of Multivariable Nonlinear Regression....................................................................6
III. Logistic Regression...........................................................................................................6
What is Logistic Regression ?.................................................................................................6
How Logistic Regression work ?.............................................................................................7
Why is Logistic Regression useful ?.......................................................................................8
Example of Logistic Regression..............................................................................................8
IV. Conducting Multivariable Linear Regression..................................................................8
Conducting by using Python Language :...............................................................................9
Conducting by using R Language :.......................................................................................10
Conducting by using Excel Language :................................................................................12
V. Conducting Multivariable Non-Linear Regression...........................................................15
Conducting multivariable using Ms Excel............................................................................15
Using R to perform the multivariable nonlinear regression................................................17
Using python for conducting multivariable nonlinear regression.....................................18
VI. Conducting Logistic Regression on Bank Customer Data in Vietnam:......................20
Conducting by using R Language :.......................................................................................20
Conducting by using Excel Language :................................................................................22
Conducting by using Python Language :.............................................................................26
VII. Work assignment.............................................................................................................27
VIII. Reference:.........................................................................................................................27
IX. Source:..............................................................................................................................28

1
LAB 3
I. Multivariable Linear Regression:
Requirement :
Explanation (What, How and Why) and example of Multivariable
Linear Regression.
What is Multivariable Linear Regression ?
Multivariable Linear Regression is a linear regression model with
more than one independent variable.
How Multivariable Linear Regression work ?
Multivariable Linear Regression works by fitting a linear equation
to the data. The equation is of the form:

Where:
+ Y is the dependent variable.
+ X1,…., Xk are the independent (explanatory) variables.
+ B0 is the intercept term.
+ B1,…., Bk are the regression coefficients for the
independent variables.
+ e is the error term
- The regression coefficients (called partial regression coefficients)
indicate the anticipated shift in the dependent variable when one
of the independent variables is increased by one unit, keeping all
other independent variables unchanged.
- The ANOVA test for significance of the entire model is:

2
+ The null (H0) hypothesis indicates that no linear
relationship exists between the dependent and any of the
independent variables.
+ The alternative hypothesis (H1) indicates that the
dependent variable has a linear relationship with at least one
independent variable.
Why is Multivariable Linear Regression useful ?
- Multiple linear regression (MLR) is useful for a variety of reasons.
It can be used to:
 Identify the relationship between multiple independent
variables and a dependent variable
 Quantify the strength of the relationships between the
variables
 Make predictions about the value of the dependent variable
 Control for confounding variables
- MLR is used in a wide variety of fields, including business,
economics, finance, and science. Some examples of how MLR is
used include:
 Predicting the price of a house based on its square footage,
number of bedrooms, and location
 Predicting the number of customers who will visit a store on a
given day based on the day of the week, the time of year, and
the weather forecast
 Predicting the risk of a customer defaulting on a loan based on
their credit score, income, and employment history
Example of Multivariable Linear Regression:
- Suppose we are interested in predicting the price of a house. We
have data on the following variables:
 Price (dependent variable)
 Square footage
 Number of bedrooms
 Number of bathrooms
 Age of the house
3
- We can use MLR to build a model that predicts the price of a
house based on these variables. The model would be of the form:
Price = b0 + b1SquareFootage + b2NumBedrooms +
b3NumBathrooms + b4Age + e
- We can then use the model to predict the price of any house by
plugging in the values for the independent variables. For
example, if we have a house with 2,000 square feet, 3 bedrooms,
2 bathrooms, and is 10 years old, we can use the model to
predict the price of the house.

II. Multivariable Nonlinear Regression


What is Multivariable Nonlinear Regression ?
- First we need to know the definition of Regression Model :
+ A regression model provides a function that describes the
relationship between one or more independent variables and a
response, dependent, or target variable.[1]
+ For example, the relationship between height and weight may
be described by a linear regression model. A regression analysis
is the basis for many types of prediction and for determining the
effects on target variables[2]
- So what is Multivariable Nonlinear Regression :
Multivariable Nonlinear Regression is a form of regression
analysis in which observational data are modeled by a function
which is a nonlinear combination of the model parameters and
depends on more independent variables. The data are fitted by a
method of successive approximations.

How does Multivariable Nonlinear Regression work? :


Regression studies the relationship between a variable of
interest Y and one or more explanatory or predictor variables x
^(j). The general model is

4
Here, h is an appropriate function that depends on the
explanatory variables and parameters, that we want to
summarize with vectors and

The unstructured deviations from the


function h are described via the random errors Ei. The normal
distribution is assumed for the distribution of this random error, so

, independent.[3]
Why Multivariable Nonlinear Regression is used ?

The nonlinear regression models are predominantly used for


prediction, financial modeling and forecasting purposes. The
nonlinear model is used in many fields and sectors like insurance,
agriculture, finance, investing, machine learning AI, and
understanding broader markets. Let’s look into some of the
significant applications:

+ Since most biological processes are nonlinear in nature, we can


find nonlinear model applications in forestry research. A simple
power function to relate tree volume or weight in relation to its
diameter or height is an example.
+ The use of a nonlinear model in developing a wide-range
colorless gas, HCFC-22 formulation is an example from the field
of Chemistry.
+ In research and development, it is used in the process of
formulation of the problem and deriving statistical solutions to the
calibration problem.
+ It is used in the insurance domain. For example, its usage can
be seen in the computation of IBNR reserves.

5
+ It is of great importance in agricultural research. Because many
crops and soil processes are better captured by nonlinear than
linear models.[4]

Example of Multivariable Nonlinear Regression


- Gold investment is an effective hedge against inflation and
currency depreciation in many countries. Hence analyzing the
gold price movement is of great importance.

- Even if the gold prices are stable to a great extent, they are
affected by inflation, crude oil, etc. But the important one is the
impact of inflation, and at the same time, gold prices can control
the inflation instability. Therefore, a deep understanding of the
relationship between inflation and gold price is a prerequisite.

- In this case, Multivariable Nonlinear Regression analysis is


employed for analyzing data. The dependent variable is gold
price, and the independent variable is inflation. The regression
analysis results revealed that inflation impacts the gold price. The
possible explanation why gold does not always move with CPI
increases is that gold is considerably affected by large spikes in
inflation. Still, small increases in inflation or dropping inflation
have little impact on gold’s price upward or downward trajectory.

III. Logistic Regression


What is Logistic Regression ?
- Logistic regression is a variation of ordinary regression in
which the dependent variable is categorical. The independent
variables may be continuous or categorical, as in the case of
ordinary linear regression.[ Business Analytics second edition
by James Evans - p.354]

6
- logistic regression seeks to predict the probability that the
output variable will fall into a category based on the values of
the independent (predictor) variables. This probability is used
to classify an observation into a category.[ Business Analytics
second edition by James Evans - p.354]
How Logistic Regression work ?
- Logistic regression is generally used when the dependent
variable is binary—that is, takes on two values, 0 or 1.
- You may recall that a multiple linear regression model has the
form:

- In logistic regression, we use a different dependent variable,


called the logit, which is the natural logarithm of odds, which
is also the ratio p/(1 - p). Thus, the form of a logistic
regression model is:

- where p is the probability that the dependent variable Y = 1,


and X1, X2,…, Xk are the independent variables (predictors).
The parameters b0, b1, b2,…, bk are the unknown
regression coefficients.
- The logit is continuous over the range from - ∞ to +∞ and from
a linear function of the predictor variables. The values of this
predictor variable are then transformed into probabilities by a
logistic function called Logistics Response function:

7
- This transform ensures that the p stays between 0 and 1.
Why is Logistic Regression useful ?
- It is well-suited for binary classification tasks. Logistic
regression is designed to predict the probability of a binary
outcome, such as yes/no, true/false, or spam/not spam.
- It is simple to understand and implement. The logistic
regression algorithm is relatively simple to understand and
implement, even for people with no prior machine learning
experience.
- It is relatively robust to overfitting. Overfitting is a problem
that occurs when a machine learning model learns the
training data too well and is unable to generalize to new data.
Logistic regression is relatively robust to overfitting, which
means that it is less likely to have this problem.
- It is interpretable. Logistic regression models are
interpretable, which means that it is possible to identify the
features that are most important for predicting the outcome
variable. This can be useful for understanding the
relationships between the features and the outcome variable.
Example of Logistic Regression
- Spam filtering: Logistic regression is used by email
providers to filter out spam emails.
- Fraud detection: Logistic regression is used by credit card
companies to detect fraudulent transactions.
- Medical diagnosis: Logistic regression is used by doctors to
diagnose diseases such as cancer and heart disease.
- Marketing campaigns: Logistic regression is used by
companies to identify customers who are most likely to
respond to their marketing campaigns.

8
IV. Conducting Multivariable Linear Regression
 Conducting by using Python Language :
To conduct the multivariable Linear Regression, we use
Colleges and Universities dataset.

Figure 1. Data set.

We use the LinearRegression function to perform


Multivariable Linear Regression. In this context, 'fit x'
represents the independent variables, and 'fit y'
represents the dependent variable.

Figure 2. calculating the MultLinearRegression.

Base on the result, we have the equation:


Graduation=17.92095587+0.07200628⋅MedianSAT−0.248592

9
318⋅AcceptanceRate−0.00013564986⋅ExpendituresStudent−0.
162764489⋅Top10HS

Interpreting the results:


After performing Multivariable Linear Regression, we can see
the first result is Intercept holding value 17.92095587 indicates
the value of dependent variable when all of independent
variables is zero. So it also shows that the predicted
percentage of graduation is 17.92095587% when Median SAT,
Acceptence rate, Expenditures/Student, Top 10 HS are zero.

In the next result, we can see 4 coefficients of independent


variables, which are Median SAT, Acceptence Rate,
Expenditures/Student, Top 10 HS respectively. The first is
coefficient for Median SAT, which is 7.20062848e-02. It shows
that for every one-unit increase in SAT score, the predicted
graduation rate increases by 0.07200628 units. The second is
coefficient for Acceptence Rate, which is -2.48592318e-01.
This coefficient indicates that every one-unit increase in
acceptence rate, the predicted graduation rate decreases by
0.248592318 units.The third is coefficient for
Expenditures/Student, which is -1.35649860e-04. It shows that
when increasing one-unit in Expenditures/Student only
decreases graduation percentage by 0.0001356 units. The
final is coefficient for Top 10 HS, which is -1.62764489e-01
indicates that the percentage of graduation will be decreased
by 0.16276449 units after increasing one-unit in Top 10 HS.
Finally, The value of R (0.5344260406481331) indicates
that 53,44260406481331% of the variation in the dependent
variable is explained by these independent variables.The
remaining ( 46,55739593518669) of the variation in
Graduation Percentage cannot be explained by this model.
 Conducting by using R Language :

10
To calculate multivariable linear regression we will use colleges
and universities dataset.

Figure 3. Data Set.

- We use the command reg1 = lm(Graduation ~ Median.SAT +


Acceptance.Rate + Expenditures.Student +Top.10.HS ) and
the summary(reg1) command to calculate multivariable linear
regression

11
Figure 4. Calculating the multivariable linear regression.

after using the function reg1 = lm(Graduation ~ Median.SAT +


Acceptance.Rate + Expenditures.Student +Top.10.HS )
summary(reg1) we get the results as above table of
Intercept,Median.SAT,Acceptance.Rate,Expenditures.Student,
Top.10.HS
 Conducting by using Excel Language :
We can conduct the Multivariable Linear Regression with data
file: Colleges and Universities by using the equation below:
Y = B0 + B1X1 + B2X2 + B3X3 +B4X4
With:
- Y: The value of the Graduation% value.
- B0: The intercept value (the predicted percentage of when
Median SAT, Acceptence rate, Expenditures/Student, Top 10
HS are zero)
- B1: The regression coefficient of Median SAT
- B2: The regression coefficient of Acceptence rate
- B3: The regression coefficient of Expenditures/Student
- B4: The regression coefficient of Top 10% HS
12
- X1: The value of Median SAT
- X2: The value of Acceptence rate
- X3: The value of Expenditures/Student
- X4: The value of Top 10% HS
And we can find the The regression coefficient (B) of all the
independent variables with the matrix formula:

B= ( XT.X)-1 XT.Y
- X: the matrix of all the independent variables
- Y: the matrix of the dependent variables
Before calculation, we add a column full of 1 in the data set to
represent the “input” value of the intercept term.

We then use =TRANSPOSE(array of indepent var) function to find


XT.
Next we calculate (XT.X) by using =MMULT(array of XT,array of
indepent var) function to find (XT.X) and reverse the result with
=MINVERSE(result of (XT.X)) function.

The (XT.Y) can also be calculated by using the =MMULT(array of XT,


array of dependent var) function.

13
Finally, we can find each regression coefficient values (B) by using
=MMUL(( XT.X)-1 , (XT.Y)) function.

Figure 5. Result of calculation through EXCEL.

With all the regression coefficient values (B) we just calculated, we


have the equation of Graduation% of the Colleges and Universities
Data:
Y = 17,92 + 0,072.X1 – 24,859.X2 – 0,000136.X3 –
0,163.X4
14
V. Conducting Multivariable Non-Linear
Regression
- Problem statement: With 95% confidence, is it possible to find out the
relationship between Base and Tax and TotalFare of the flight? Data is
presented below:

 Independent variables: Base and Tax


 Dependent variable: Total Fare
- Requirements: The impact of Base and Tax on Total Fare
- We have the sample of regression model:
Total Fare = B0 + B1*Base Fare + B2*Tax
- Base on a confidence level of 95%, the significance level is alpha =
0.05
- Hypothesis H0: Nonlinear regression model is not suitable.
- Hypothesis H1: Nonlinear regression model is suitable.
Conducting multivariable using Ms Excel
- In the first step we calculate the logarithm of Base Fare , Tax , which
are independent variables and Total Fare ( dependent variable ) by
using log function.
- Next we use data analyst tool, which placed in Data and click to
Regression, then we enter the input Y (logarithm of Total Fare) and X
( logarithm of Base Fare and Tax)

15
- After clicking to OK button , we get the result below :

Interpret result:
+ R square : 0.99 , its mean that 99% data fit with the model.
+ Adjusted R Square : 0.98, The model explains 98% of the variation in
the dependent variable.

16
+ P-value of Intercept ( 1.46820887627434E-191 < 0.05 ) and first
independent variable ( 1.50543308588807E-236 < 0.05 ) and second
independent variable ( 2.90054289426972E-245 < 0.05 ). So we reject
H0 and the non linear regression model fit with population.
+ So we have the model:
Ln(TotalFare) = -13.0512780830302 + 0.187337099*ln(Base Fare)
+3.095833417*ln(Tax)
+ Conclusion : It is enough evidence to conclude that the nonlinear
regression function is suitable or two independent variables ( Base Fare
and Tax ) effect on the dependent variable ( TotalFare )

Using R to perform the multivariable nonlinear regression


- For conducting the multivariable nonlinear regression , we logarithm
the X and Y, so the model will be :

Interpreting result:
+ R-squared : 0.9825 , meaning that there is 98% of data matching
with the model
+ Adjusted Rsquared: 0.9825, The model explains 98% of the
variation in the dependent variable.

17
+ P-value of Intercept ( 2e-16 < 0.05 ) and first independent variable (
2e-16<0.05) and second independent variable ( 2e-16 < 0.05 ).So we
reject H0 and the non linear regression model fit with population.
+ We have the model:
Ln(TotalFare) = -30.051678 + 0.187371*ln(BaseFare) + 3.095833*ln(Tax)
Using python for conducting multivariable nonlinear
regression
- For condcuting multivariable nonlinear regression , first we need to
import some necessary libraries and read csv file.
- Then we logarithm independent variables and dependent variable by
using np.log10()

- In the next step, we put three variables to array and reshape them.

- Then we use the LinearRegression() function from sklearn library and


fit the model with two variables X and Y. Next we show the
intercept,coefficient and score of model by using mode.intercept_,
model.coef, model.score()

18
- Finally, we use the statsmodel.api and model.summary() to show the
result :

- Now we have the model:


Ln(TotalFare) = -13.0512780830302 + 0.187337099*ln(Base
Fare) +3.095833417*ln(Tax)

19
VI. Conducting Logistic Regression on Bank
Customer Data in Vietnam:
Rows of data: 42639 rows.
Goal: predict whether a customer will subscribed a term
deposit or not.
Explanation:
- Term_deposit = 1: deposit registered.
- Term_deposit = 0: deposit not registered.
- Housing = 1: has housing loan.
- Housing = 0: Does not have housing loan.
- Loan = 1: has personal loan.
- Loan = 0: Does not have personal loan.
Dependent variable: term_deposit.
Independent variables: age ,housing ,loan
Logistics Response function:

term_deposit = 1/(1 + e–(B0 + B1age + B2housing + B3loan))


 Conducting by using R Language :
Step 1: Reading file.
df <-
read.csv('C:/Users/Administrator/Desktop/BankCustomerDat
a.csv')
Step 2: Use the glm() function to analyze logistic regression.
model <- glm(term_deposit~age+housing+loan, data = df,
family = "binomial")
Step3: Result.

20
Analysis of the result:
- The Logistic regression model are built to predict the
dependent variable “term_deposit” by using the
independent variables “age”, “ housing” and “loan”.
- Intercept:
+The Coefficient value: -1.718146
+Standard Error: 0.073262
+z value: -23.452
+p-value: <2e-16, which is smaller than 0.05.Therefore
“intercept” has a significant impact on predicting the dependent
value “term_deposit”.
- Age:
+The Coefficient value: -0.003325
+Standard Error: 0.001626
+z value: -2.045

21
+p-value: 0.0408, which is smaller than 0.05.Therefore
“age” has a significant impact on predicting the dependent value
“term_deposit”.
- Housing:
+The Coefficient value: -0.693244
+Standard Error: 0.034395
+z value: -20.156
+p-value: <2e-16, which is smaller than 0.05.Therefore
“housing” has a significant impact on predicting the dependent
value “term_deposit”.
- Loan:
+The Coefficient value: -0.527837
+Standard Error: 0.053222
+z value: -9.918
+p-value: <2e-16
, which is smaller than 0.05.Therefore “intercept” has a significant
impact on predicting the dependent value “term_deposit”.

In conclusion, the logistic regression model has shown that


“age”, “housing” and “loan” have a significant impact on predicting
the dependent value “term_deposit”.
Finally, we have the Logistic regression function of Bank
Customer Data in Vietnam :

 Conducting by using Excel Language :


Step 1: grouping Independent variables and dependent
variable.

22
Step 2: using Data Analysis tool and result:

The coefficients here is the coefficients of LINEAR REGRESSION.


Step 3: calculating the “term_deposit” with Linear
Regression formula.(term_deposit = B0 + B1age + B2housing
+ B3loan)

23
Step 4: Calculating the Probability of term_deposit.
(Probability = ey/(1+ey))

Step 5: Calculating the likelihood of term_deposit.


(IF(term_deposit = 1 then Probability ,else 1- Probability)

Step 6: Calculating the log of likelihood (LN(likelihood))

Step 7: Calculating the sum of the log of likelihood

24
Step 8: Using Solver tool from Excel Analysis ToolPak
Set Object is sum of log-likelihood and By changing variable cells
is the Linear regression coefficients.

Step 9: The result of Logistic Regression Coefficients.

25
 Conducting by using Python Language :
Step 1: Importing libaries.
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
Step 2: Reading file.
df = pd.read_csv("BankCustomerData.csv")
df
Step 3: Splitting independent and dependent array.
x, y = \
np.array(df[["age", "housing", "loan"]]).reshape((-1,3)), \
np.array(df["term_deposit"]).reshape((-1,1))
Step 4: using LogisticRegression() and fit X,Y
model = LogisticRegression()
model.fit(x,y)
Step 5: Result

26
VII. Work assignment
Members Trường Vũ Tăng Quốc Lương Anh
Tuấn Tuấn
Questions
Task 1
Multivariable Linear X
Regression.
Multivariable X
Nonlinear
Regression
Logistic Regression X
Task 2
Using MS Excel, R language and Python language to perform
Multivariable Linear Regression
R X
Excel X
Python X
Using MS Excel, R language and Python language to perform
Multivariable Nonlinear Regression
R X
Excel X
Python X
Using MS Excel, R language and Python language perform Logistic
Regression
R X
Excel X
Python X

VIII.Reference:
Business Analytics: Method, Models, and Decisions, 1st edition
James R.Evans ( multiple linear regression definition chapter 8
page 276)
[1] [2] https://www.imsl.com/blog/what-is-regression-
model

27
[3]
https://stat.ethz.ch/~stahel/courses/cheming/nlreg10E.
pdf\
[4][5] https://www.wallstreetmojo.com/nonlinear-
regression/

IX. Source:
The R , Excel, Python files are attached to the following link :
https://drive.google.com/drive/folders/
1ErzQgCAyIcRY_tFHmVYHvtlIu2o0stN3?usp=sharing
Dataset link:
https://www.kaggle.com/datasets/tomculihiddleston/bank-
customer-data-in-vietnam
https://www.kaggle.com/datasets/dangquangvu/vietnam-flights-in-1st-
quarter-of-2020

28

You might also like