Professional Documents
Culture Documents
XSTK Project PDF
XSTK Project PDF
UNIVERSITY OF TECHNOLOGY
Semester 211
Project
Contents
1 Introduction 3
5 Conclusion 25
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 1/25
University of Technology, Ho Chi Minh City
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 2/25
University of Technology, Ho Chi Minh City
1 Introduction
The field of statistics is the science of learning from data. Statistical knowledge
helps you use the proper methods to collect the data, employ the correct analyses,
and effectively present the results. Statistics is a crucial process behind how we make
discoveries in science, make decisions based on data, and make predictions. Statistics
allows you to understand a subject much more deeply.
The statistics help you predict the possible outcomes so you can Work out a
solution if it is bad or or optimize it if it is good.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 3/25
University of Technology, Ho Chi Minh City
• Dependent Variable: This is the main factor that you’re trying to understand
or predict.
• Independent Variables: These are the factors that you hypothesize have an
impact on your dependent variable.
In our application training example above, attendees’ satisfaction with the event
is our dependent variable. The topics covered, length of sessions, food provided, and
the cost of a ticket are our independent variables.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 4/25
University of Technology, Ho Chi Minh City
• Univariate visualization of each field in the raw dataset, with summary statistics.
• Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 5/25
University of Technology, Ho Chi Minh City
11
Before executing any of these lines, we must set the working directory to source
file location with setwd(<Your source file location>).
The next step is to import the data, but after completing this step, filter all the
data, select the variables needed for the analysis, and the corresponding variable name
instead of the individual rows with the individual information. You need to create a
table with. To cover it up, visualize it with head() for further investigations if needed:
1 > head(grade_csv)
2 sex age studytime failures higher absences G1 G2 G3
3 1 F 18 2 0 yes 6 5 6 6
4 2 F 17 2 0 yes 4 5 NA 6
5 3 F 15 2 3 yes 10 7 8 10
6 4 F 15 3 0 yes 2 15 14 15
7 5 F 16 2 0 yes 4 6 10 10
8 6 M 16 2 0 yes 10 15 NA 15
The data we collect is not organized so that the model can be fully analyzed. The
first thing you notice is that the table seems to have an N / A (not available) value,
so I’d like to use the most practical method to resolve it.
The above code is used to calculate the number of table cells with N/A values.
The output to the screen for that particular line is 5. This means that there are 5
values that are not defined in this table.
See the output printed above. You can see that the number of cells containing N/A
values is relatively small compared to the number of related data. For this reason, we
are considering removing rows with corresponding N/A values and, in the worst case,
5 rows with 5 different N/A values.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 6/25
University of Technology, Ho Chi Minh City
3.2.2 Transformation
The linear regression model works well when its independent variable follows the
shape of the line. Therefore, if some do not meet this requirement, use a transforma-
tion to place it in the model. This may seem like a mandatory step, but if you can’t,
there are workarounds to resolve this issue.
With this model, it is very difficult to transform a given data. So if you’re doing
too much work, don’t touch this variable, but include it in your model. After analyzing
each independent variable in terms of the dependent variable, we find that most of
the data here cannot be resolved, except for G1 and G2. When I plot the correlation
between these two variables and the dependent variable, I see a graph that is almost
a line and is already linear.
For continuous variables, calculate the mean, median, standard deviation, mini-
mum, and maximum. For median, the input data is in the form of discrete values, so
the median calculated by the R function when sorting is the median (or the average
of the two medians). This also applies to discrete variables, but it can also be used to
approximate continuous variables.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 7/25
University of Technology, Ho Chi Minh City
15 con_table
We put all the analyzed statistics in a data frame and print out the data for organized
result:
1 absences G1 G2 G3
2 mean 5.715385 10.925641 10.717949 10.412821
3 median 4.000000 11.000000 11.000000 11.000000
4 sd 8.034215 3.290886 3.737868 4.568962
5 min 0.000000 3.000000 0.000000 0.000000
6 max 75.000000 19.000000 19.000000 20.000000
For categorical variables, we make a simple table consists of categories and its
corresponding quantity.
8 cat_sex
9 cat_age
10 cat_studytime
11 cat_failures
12 cat_higher
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 8/25
University of Technology, Ho Chi Minh City
1 > cat_sex
2 F M
3 205 185
4
5 > cat_age
6 15 16 17 18 19 20 21 22
7 81 101 97 82 24 3 1 1
8
9 > cat_studytime
10 1 2 3 4
11 105 194 64 27
12
13 > cat_failures
14 0 1 2 3
15 307 50 17 16
16
17 > cat_higher
18 no yes
19 20 370
3.3 Visualization
For this purpose, we use specifically gridExtra and ggplot2 packages to graph the
data.
Three required graphing methods are: histogram, box plot and pairs. For his-
togram, we can apply to every variables for visualization:
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 9/25
University of Technology, Ho Chi Minh City
1 # Histogram plot
2 grid.arrange(
3 ggplot(grade_csv, aes(sex)) + geom_histogram(stat = "count"),
4 ggplot(grade_csv, aes(age)) + geom_histogram(stat = "count"),
5 ggplot(grade_csv, aes(studytime)) + geom_histogram(stat =
"count"),
6 ggplot(grade_csv, aes(failures)) + geom_histogram(stat =
"count"),
7 ggplot(grade_csv, aes(higher)) + geom_histogram(stat = "count"),
8 ggplot(grade_csv, aes(absences)) + geom_histogram(stat =
"count"),
9 ggplot(grade_csv, aes(G1)) + geom_histogram(stat = "count"),
10 ggplot(grade_csv, aes(G2)) + geom_histogram(stat = "count"),
11 ggplot(grade_csv, aes(G3)) + geom_histogram(stat = "count"),
12
13 ncol = 3
14 )
For boxplots, you should only use it for categorical variables. Due to the nature
of the boxplot, each category is contained as one box, so the categorical variables
provide enough categories.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 10/25
University of Technology, Ho Chi Minh City
As each row is executed, one plot is displayed on the screen one at a time. Here,
the results displayed from different times are summarized in frames to facilitate the
visualization procedure.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 11/25
University of Technology, Ho Chi Minh City
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 12/25
University of Technology, Ho Chi Minh City
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 13/25
University of Technology, Ho Chi Minh City
Regression analysis includes several options such as linear, multiple linear and
nonlinear regression. The most common models are simple linear regression and mul-
tiple linear regression.
For linear regression, a relationship between multiple variables using a linear line
is defined. Linear regression attempts to draw a line that comes closest to the data
by finding the slope and intercept that define the line and minimize regression errors.
yi = β0 + β1 xi + ϵi
Here, the random error component ϵi represents the fact that our data won’t fit
the model perfectly. In fact, ϵi follows a normal distribution with mean 0 and variance
σ 2 : ϵi ≃ N (0, σ 2 ). Also note that the intercept β0 , the coefficient β1 , and the random
error component variance ϵi are all treated as fixed (i.e., deterministic) but unknown
quantities.
To gain more insight into this model, suppose that we can fix the value of x and
observe the value of the random variable y. Now that x is fixed, the random component
ϵ on the right-hand side of the model determines the properties of y. Then
E(yi ) = E(β0 + β1 x1 + ϵi )
= β0 + β1 x1 + E(ϵi )
= β 0 + β1 x 1
y = β1 x1 + β2 x2 + · · · + βp xp
Even though it’s still linear, this representation is very versatile, here are just a
few things we can represent with it:
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 14/25
University of Technology, Ho Chi Minh City
where minβ just means “find values of β that minimize the following”, and Xi refers
to row i of the matrix X.
We can use some linear algebra to solve this problem and find the optimal esti-
mates:
−1
β̂ = (X T X) XT y
which R will do for you. We can also obtain confidence intervals and/or hypothesis
tests for each coefficient.
It’s crucial to run a blind test to see if all of the coefficients are bigger than
zero. But, before you do that, check to see if the model explains a large portion of
the variance in the data: if it doesn’t, it’s probably not worth evaluating any of the
coefficients separately.
Typically, we use ANOVA test to measure this. If the ANOVA test determines
that the model explains a significant portion of the variability in the data. After that,
we may test each of the hypotheses and adjust for multiple comparisons.
We can also ask about which features have the most effect: if a coefficient of feature
is 0 or close to 0, then that feature has little to no impact on the final result. We need
to avoid the effect of scale: for example, if one feature is measured in feet and another
in inches, even if they’re the same, the coefficient for the feet feature will be twelve
times larger. In order to avoid this problem, we’ll usually look at the standardized
ˆ
coefficients sββk
k
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 15/25
University of Technology, Ho Chi Minh City
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 16/25
University of Technology, Ho Chi Minh City
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 17/25
University of Technology, Ho Chi Minh City
Obtain the result containing information, parameters for the regression program
when we call the summary() function:
1 Call:
2 lm(formula = G3 ~ sex + age + studytime + failures + higher +
3 absences + G1 + G2, data = grade_csv)
4
5 Residuals:
6 Min 1Q Median 3Q Max
7 -9.1217 -0.4473 0.3160 0.9743 3.6379
8
9 Coefficients:
10 Estimate Std. Error t value Pr(>|t|)
11 (Intercept) 0.61310 1.51569 0.405 0.686068
12 sexM 0.19679 0.20836 0.945 0.345511
13 age -0.15235 0.08108 -1.879 0.061000 .
14 studytime -0.13934 0.12477 -1.117 0.264810
15 failures -0.19862 0.14784 -1.344 0.179909
16 higheryes 0.26384 0.47490 0.556 0.578836
17 absences 0.04208 0.01233 3.413 0.000711 ***
18 G1 0.16637 0.05696 2.921 0.003701 **
19 G2 0.96039 0.04994 19.231 < 2e-16 ***
20 ---
21 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
22
- Observing the p-value that is the value of Pr(> |t|) in the Coefficients part. It’s
a probability that coefficient is insignificance in your model so smaller is better.
Then we have t-test for:
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 18/25
University of Technology, Ho Chi Minh City
• We try to compare the model that contain all the independent variable with the
model that we have just remove some insignificance factors from above to see
whether 2 model that is perform the same effectiveness.
- We create model M 1 contain the dependent variables are age, absences, G1, G2
and do ANOVA test to recommend the better linear regression model:
- With the null-hypothesis H0 : These two models M1 and M_all perform the
same effectiveness. With alpha = 5% and observing from the ANOVA table we
see that p − value = 0.3162 > 0.05 then we fail to reject null-hypothesis.
From that we can conclude that 2 model are perform the same effectiveness.
Therefore, we choose model M1 to continue because it more simpler than
M_all to reduce cost of computation.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 19/25
University of Technology, Ho Chi Minh City
• Then we examine the variable age that have effect on your model or not. We
continue comparing model M1 with the model have been removed the factor
age called M2.
- With the null-hypothesis H0 : These two models M1 and M2 perform the same
effectiveness instead of M2 lacking the factor age. With alpha = 5% and ob-
serving from the ANOVA table we see that p − value = 0.01724 < 0.05 then
we reject null-hypothesis. We can conclude that 2 model are perform the
different effectiveness or significance. Which is also said that factor age might
affect the final grade G3. Therefore, we are decide to choose model M1
to be the most suitable model.
- We can see that if we decided to remove factor age from the beginning, we can
nearly drop out some important factor from our datasets that might be affected
to your model a lot.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 20/25
University of Technology, Ho Chi Minh City
- Model’s prediction is quite good, the linear regression (red line) is neighboring
Residual = 0, the observed value concentrate around the read line and the
residual plot do not have specific pattern. This is indicated that the model M1
we choose is sure-enough appropriate.
1 summary(M1)
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 21/25
University of Technology, Ho Chi Minh City
1 Call:
2 lm(formula = G3 ~ age + absences + G1 + G2, data = grade_csv)
3
4 Residuals:
5 Min 1Q Median 3Q Max
6 -9.0758 -0.4018 0.3342 1.0086 3.6469
7
8 Coefficients:
9 Estimate Std. Error t value Pr(>|t|)
10 (Intercept) 1.02355 1.35932 0.753 0.451920
11 age -0.18710 0.07822 -2.392 0.017239 *
12 absences 0.04173 0.01227 3.401 0.000742 ***
13 G1 0.17481 0.05617 3.112 0.001994 **
14 G2 0.96720 0.04983 19.409 < 2e-16 ***
15 ---
16 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
17
- For assessing the effects of factor on the final grade G3, we must consider about
each p-value for each factor. We see that the p-value corresponding to factor G2
and absences (***) < 0.001, this says that the second period grade and number
of school absences have extremely impact on the final grade G3. We see that
the p-value of first period grade G1(**) < 0.01 and student’s age(*) < 0.05 also
have impact on final grade G3 but it is lesser than G2 and absences. The rest
of factor are sex, studytime, failures, higher have no significance impact on G3
so we do not consider in inside model.
- The other side, the coefficient of independent variable might affect predictor.
Assuming when you increase 1 unit of any independent variables and the rest
of independent variable is constant then the bigger of coefficient the bigger
expected value of predictor.
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 22/25
University of Technology, Ho Chi Minh City
- We create function to check student fail or pass with input is the grade
- With the most appropriate model that we have choose is M1. We create a new
table and use the command predict() to get the predicted value of final grade
G3 with the input from model M1 and evaluate it with our fuction:
Let’s see what new_grade contain after binding predict_evaluate column into it:
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 23/25
University of Technology, Ho Chi Minh City
Let’s see what grade_csv contain after binding evaluate column into it:
- We are assessing the precision through out the prediction by creating a table to
compare the result between G3 and predict_G3. By using the above function
to check whether student can fail or pass, we can compare the proportion of
Fail and Pass between G3 and predict_G3
The proportion of Fail and Pass between the observed and predicted value:
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 24/25
University of Technology, Ho Chi Minh City
5 Conclusion
Assignment for Probability and Statistics – Academic year 2020 – 2021 Page 25/25