Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

26/10/2023

By: Karla Tamayo.

Final Project: Data analysis using simple linear regression.

Simple linear regression analysis is the most widely used and the simplest of all. The aim
is to study the effect of an independent variable on a single variable that is dependent on
the first variable, or at least theoretically we have considered to be dependent. Using this
simple linear regression equation, an estimate can be made based on the data obtained.

Types of linear regression:

 The first type is simple linear regression, in which only a single predictor is used.
For example, it can be used to predict fatal traffic accidents in a country. The
response variable Y would be the country and would be compared to the
population, which would be the predictor variable X.
 Multiple linear regression allows the creation of models that use different
predictors, which will be used to give an answer to Y.
 With multivariate linear regression it is possible to generate models to respond to
different variables. In this case, there are multiple Y's. Several formulas apply
when it comes to expressing it. Thus, the incidence of influenza could be
estimated in nine regions of the United States, response variables Y based on the
week of the year, predictor variable X.
 An artificial intelligence based on machine learning. It will be able to handle a
greater amount of data more easily, something it will need when piloting a vehicle.
The road is a challenge, as there are a large number of variables that must be
taken into account and predicted in order to, for example, avoid a crash.

Soaking up the subject a little more, I decided to choose simple linear regression and
multiple linear regression as examples, which I detail in the following example.

Example 1: Study of computing performance as a function of the number of buffers.

Linear regression model. Software R.


The following table shows the variables Y, performance of a computer system,
with respect to the regressor variable X1, number of buffers:
X1 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
Y 9.6 20.1 29.9 39.1 50.0 9.6 19.4 29.7 40.3 49.9 10.7 21.3 30.7 41.8 51.2

From the table above, we want to adjust the Y variable as a function of the X1 variable.

1. Reproduce the results obtained with the lm () function of R, to which we will


add the graph of residuals versus estimated values.
2. Comment on the following results:
1. Regression line of the performance of the computer system against
the number of buffers and interpretation of the coefficients.
2. Scatter plot with the fit to the line.
3. Testing hypotheses about the slope of the line.
4. Coefficient of determination and coefficient of linear regression.
5. Graph of residuals vs. estimated values.

Solutions
a) To begin with, let's enter the data. Here's how:
> x1 <- c (5,10,15,20,25,5,10, 15,20,25,5,10,15,20,25)
> y <- c (9.6,20.1,29.9,39.1,50.0,9.6,19.4,29.7,40.3,49.9,10.7,21.3,30.7,41.8,51.2)

Now we can calculate the parameters using the lm() function of R. This function
estimates linear models, which is what we want to do.

> lm_11 <- lm(y~x1)


> summary(lm_11)

Call:
lm(formula = y ~ x1)

Residuals:
Min 1Q Median 3Q Max
-1.2133 -0.4700 -0.3200 0.5733 1.4867

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06000 0.47687 -0.126 0.902
x1 2.01867 0.02876 70.198 <2e-16 ***
---
Signify. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7875 on 13 degrees of freedom


Multiple R-squared: 0.9974, Adjusted R-squared: 0.9972 F-
statistic: 4928 on 1 and 13 DF, p-value: < 2.2e-16

Therefore, the regression line of price (Y) vs. number of pages per minute
(X1) is Y =-0.006 + 2.0186 X1:
We can plot the residuals versus the estimated values using these R
instructions:
> plot(residuals(lm_11) ~fitted.values(lm_11), main = "Diagrama de
residues", xlab = "Valors estimados", ylab = "Residuos")
> abline (0,0) #Prints a horizontal line to 0
Diagrama de residus
1.5
1.0
0.5
Residues

0.0
-0.5
-1.0

10 20 30 40 50

Valors estimats

 The regression line of compute performance (Y) versus number of buffers (X1)
is: Y =-0.006 + 2.0186 X1
 Interpretation of the coefficients:
 Line slope (2.02): This is the increase in computing performance for
each unit of buffer added.
 Standalone term (-0.06): It doesn't make much sense to interpret it in this case
since it would represent the performance of the system when we don't have any
buffers.

To get the scatter plot with the fit to the line, with R we have to do the following:

> plot(y~x1)
> abline(lm_11)
50
40
y

30
20
10

5 10 15 20 25

x1

1. b.3)
Contrast can be done automatically, making use of the summary(lm_11) command of
R. The output would be:

> summary(lm_11)

Call:
lm(formula = y ~ x1)

Residuals:
Min 1Q Median 3Q Max
-1.2133 -0.4700 -0.3200 0.5733 1.4867

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06000 0.47687 -0.126 0.902
x1 2.01867 0.02876 70.198 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7875 on 13 degrees of freedom


Multiple R-squared: 0.9974, Adjusted R-squared: 0.9972
F-statistic: 4928 on 1 and 13 DF, p-value: < 2.2e-16

The value highlighted in bold is the p-value. Since the p-value is very
close to 0, we would reject the null hypothesis for any level of
significance (that's why *** appears, see legend) and accept that the
slope of the line is representative, non-zero.

b.1)
The coefficient of determination is R-Sq = 99.7%. That is, 0.997 and the
linear correlation coefficient will be r = √ 0,997 = 0,9985 .

 The graph of residuals vs. estimated values has no structure whatsoever


and is scattered around the value 0.
In conclusion, the regression model is a good model in this case.

You might also like