Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING

PROBABILITY AND STATISTICS (MT2013)

R STUDIO PROJECT
Semester: 212 - Group 03 Class CC02

Under the guidance of: DR.Phan Thị Hường

Submitted by: Thái Tài ——————— 2052246


Nguyễn Trọng Chuẩn —– 2052046
Phạm Hữu Đức ———— 2052452
Nguyễn Lê Gia Khánh — 1952774
Bùi Ngọc Đức Anh ——- 2052832

Ho Chi Minh City, 04/2022


University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Contents
1 Introduction 2

2 Theoretical basis 3
2.1 T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 One sample T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Two samples T-test - Paired samples . . . . . . . . . . . . . . . . . 3
2.1.3 Two samples T-test - Independent samples . . . . . . . . . . . . . . 4
2.2 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Prediction model - Multiple Linear Regression . . . . . . . . . . . . . . . . 7

3 Activity 1 8
3.1 Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Descriptive statistics for each variables . . . . . . . . . . . . . . . . 9
3.3.2 Graphs: Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 T-test: pre.weight & weight6weeeks . . . . . . . . . . . . . . . . . . . . . . 12
3.5 One-way ANOVA: What is the best diet for weightLoss? . . . . . . . . . . 13
3.6 Two-way ANOVA: Do Diet and gender affect weightLoss? . . . . . . . . . . 14

4 Activity 2 17
4.1 Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Descriptive statistics for all attributes . . . . . . . . . . . . . . . . . 19
4.3.2 Graph: Published CPU Performance By Vendors . . . . . . . . . . . 19
4.3.3 Graph: Preparation for One-way ANOVA . . . . . . . . . . . . . . . 20
4.4 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Prediction model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5.1 Data transformation: . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5.2 Choosing appropriate model: . . . . . . . . . . . . . . . . . . . . . . 24
4.5.3 Building multiple linear regression model: . . . . . . . . . . . . . . 25
4.5.3.a Determine variables: . . . . . . . . . . . . . . . . . 25
4.5.3.b Preparing data sets for training and testing: . . . . 25
4.5.3.c Building model: . . . . . . . . . . . . . . . . . . . . 26
4.5.3.d Outcome evaluation: . . . . . . . . . . . . . . . . . 27

5 Conclusion 29

6 Library used, Github code link and References 29

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 1/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
1 Introduction
Probability And Statistics are useful topics in the science, with applications in a vari-
ety of industries including engineering, medicine, biology, economics, physics, and more.
This report focuses on using probability and statistics models to analyze and investigate
how choosing the proper diet might effect weight loss after 6 weeks, as well as developing
prediction models to compare estimated and announced CPU hardware performance. In
this project of Probability and Statistics, we use R-studio as the sole software to analyze
the data provided and OverLeaf to prepare the report.

In Activity 2, the chosen data set is about Computer Hardware, a topic related to
Computer Science, our major of study. In this section, we will analyze, visuallize the
data, and make prediction of CPU performance based on available features by applying
the basic knowledge of data visualization, ANOVA and prediction model.

The prediction model which we use in Activity 2 is Linear Regression model, a sim-
ple and long-standing method (about 200 years old), applied to research in many scientific
fields. Linear regression assumes a linear relationship between the input variables and the
single output variable. Details about this model will be mentioned in the Theoretical
basis section.

A critical method employed throughout in this project is ANOVA, a very useful and
powerful statistical technique for determining how one or more factors impact a response
variable. ANOVA is prioritized in a wide variety of real-life situations, but the most
common include: Retail, Medical and Environmental sciences.

As for the structure, the report consists of 6 parts:

• Introduction

• Theoretical basis

• Activity 1

• Activity 2

• Conclusion

• References

We would like to express our thank to the Dr. Phan Thi Huong and the Faculty of Applied
Science for continued support throughout this project.

Team members,
2022

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 2/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2 Theoretical basis
2.1 T-test
• Null hypothesis (H0 ): the initial claim that is always true.

• Alternative hypothesis (H1 ): a contradictory hypothesis of (H0 ), normally for


one-sided alternative hypothesis, we should put what is important to make a strong
conclusion in (H1 ).
For example: A factory claims that its pencils have the mean length more than
10cm. −→ H0 : µ = 10cm & H1 : µ > 10cm.

• T-test, together with z-test, are two common parametric tests used in testing the
null hypothesis. While we use z-test (based on Normal distribution) for normal
distribution with known σ or large sample size (n >= 30), we use t-test (based on
t-distribution) for small sample size (n < 30) that follows normal distribution with
unknown σ and we replace it by the sample standard deviation s.

• Before diving into some most common t-test applications, we should be noticed that
p-value is the smallest significance α at which the null hypothesis can be rejected.
In this project, if p-value < 0.05, we reject the null hypothesis H0 , and fail
to reject it otherwise. We can also use rejection region to reject H0 if the test
statistic falls into the rejection region.

2.1.1 One sample T-test


We use one-sample t-test when we have a set of data (x1 , x2 , . . . xn ) ∼ N(µ, σ 2 ) and we
wish to test the null hypothesis µ = µ0 (µ0 is the value we want to test).
For example: A group of scientists want to test whether or not their new medication
has effects on IQ. They randomly select 20 people in a population which has the average
IQ of 150 and they found that the sample average IQ after taking the medication is 200
with s = 20. They may use one sample t-test in this case to analyse.

• Test statistic: t = (x̄−µ).
s
n
, where x̄ is the sample mean, µ is the value we wish to
test, n is the sample size and s is the sample standard deviation.

Alternative hypothesis Rejection Region P-value


H1 : µ ̸= µ0 Wα = t0 : |t0 | > tn−1, α2 pv = 2.P (T >|t0 |)
H1 : µ > µ0 Wα = t0 : |t0 | > tn−1,α pv = P (T > t0 )
H1 : µ < µ0 Wα = t0 : |t0 | < −tn−1,α pv = P (T < t0 )

2.1.2 Two samples T-test - Paired samples


We use paired samples t-test when we have the two sets of data from the same experiment
unit. If we take the difference between the two datasets, then the difference dataset is

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 3/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
assumed to follow N(µ, σ 2 ). After that, we carry out the same steps like one-sample t-test.
For example: A car manufacturer may want to evaluate the average parking time of two
new car models with different radii. They may invite 25 volunteers to join and record their
parking time with each type of car. Then, the car manufacturer may use paired samples
t-test to study which model is better (faster in parking time).

• Test statistic: t = (D̄−µsDD ). n where D̄ is the mean of the difference sample, µD is
the value that we wish to test, n is the sample size and sD is the sample standard
deviation of the difference sample.

• The rejection region table is the same as one-sample t-test case.

2.1.3 Two samples T-test - Independent samples


We use independent samples t-test when we have two datasets from two different samples
and we wish to determine the which group has the higher sample mean than the other’s.
It is true that the samples have to follow N(µ, σ 2 ).
For example: Suppose that Ms. Huong is interested in whether or not the students’ per-
formance in taking the Probability and Statistics course online is higher than conventional
classes. She may randomly select 2 groups of 10 students and record their final score for
the course. After that, she may use t-test for independent samples to study.
Case 1: Same Variance ( ss21 ∈ [0.5; 2])
2 2
q
• Test statistic: t = (x̄−ȳ)−∆
se
with se = s2 ( m1 + n1 ) and s2 = (m−1)sm+n−2
1 +(n−1)s2

In the formula, s1 and s2 are the sample standard deviations; m and n are the sample
size; x̄ and ȳ are the sample mean.

Alternative hypothesis Rejection Region


µ1 − µ2 ̸= ∆ |t| > t α2 ,m+n−2
µ1 − µ2 < ∆ t < −tα,m+n−2
µ1 − µ2 > ∆ t > tα,m+n−2

/ [0.5; 2])
Case 2: Unknown Variance ( ss12 ∈
q
• Test statistic: t = (x̄−ȳ)−∆ with se = sm1 + sn2
2 2
se
In the formula, s1 and s2 are the sample standard deviations; m and n are the sample
size; x̄ and ȳ are the sample mean.
s1 2 s 2
( + 2n )2
Let v = 1
m
s 2 1 s 2
(round down if v ∈
/ Z +)
( 1 )2 + n−1
m−1 m
( 2n )2

Alternative hypothesis Rejection Region


µ1 − µ2 ̸= ∆0 |t| > tv, α2
µ1 − µ2 < ∆0 t < −tv,α
µ1 − µ2 > ∆0 t > tv,α

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 4/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.2 One-way ANOVA
One way ANOVA is a hypothesis test used for testing the equality of three or more
population means simultaneously using variance.
For example: In one laboratory, a team studied whether changes in CO2 concentration
affected the germination rate of soybean seeds by gradually increasing the CO2 concen-
tration and recording the height of the bean sprouts after 1 day.
• Statistical problem: Comparing the height means between groups of CO2 con-
centration.
Assumptions for using one-way ANOVA:
• The population are normally distributed. To test the normality, we use the Normal
probability plot of the Residuals (mentioned in Prediction model).
• The sample are random and independent
• The population has equal variances.

The an observed dataset can be generalized as table below :


Treatment Observation Totals Average
1 y11 y12 ... y1n y1. y 1.
2 y21 y22 ... y2n y2. y 2.
... ... ... ... ... ... ...
a ya1 ya2 ... yan ya. y a.
a P
n
y.. = y.. = y../an
P
yij
i=1 j=1

Model considered : Yij = µ + τi + ϵij (i = 1, 2, ..., a; j = 1, 2, ..., n)


• Where : µ is the overall mean, τi is the ith treatment effect, ϵij is the random
error component. 
 H : τ = τ = ... = τ = 0
0 1 2 k
Null and alternative hypotheses :
 H1 : τi ̸= 0 with at least one i

Degree of
Sum of square(SS) Median of square(MS)
freedom(df)
a
Treatment SStreatment = n (y i· − y..)2 a−1 M Streatment = SStreatment /(a − 1)
P
i=1
a P
n
Error SSE = (yij − y i· )2 a(n − 1) M Serror = SSE /[a(n − 1)]
P
i=1 j=1
Total SST = SStreatment + SSerror an − 1

M Streatment SStreatment /(a − 1)


Test statistic : F0 = =
M SE SSE /[a(n − 1)]
• F0 has a Fisher distribution with (a − 1) and a(n − 1) degree of freedom.
F0 ∼ fa−1,a(n−1)·
• Given α, H0 would be rejected if f0 > fa−1,a(n−1)α·

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 5/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.3 Two-way ANOVA
Two way ANOVA is a statistical technique that used for examining the effect of two
factors on the continuous dependent variable. It also studies the interrelationship between
the two independent variables which influences the values of the dependent one.
For example: In an Arithmetic test, several male and female students of different ages
participated. Exam results are recorded. In this case, two-way ANOVA could be used to
determine if gender and age affected the scores.
• Statistical problem : Comparing the score means according to the genders and
ages.
Assumptions for using two-way ANOVA are similar with one-way ANOVA (section 2.2).
The table of dataset for two-way ANOVA can be generalize as follow :
Factor 2
Factor 1
1 2 ... K
1 X11 X21 ... XK1
2 X12 X22 ... XK2
... ... ... ... ...
H X1H X2H ... XKH

The mean values :


Mean of Mean of
Total mean
each column each row
H K K P
H K H
Xi = Xj =
P
Xij
P
Xij
P P
Xij Xj
P
j=1 i=1
Xi
i=1 j=1 j=1
i = 1, 2, .., K j = 1, 2, .., H X= = i=1
=
n K H

Variance analysis factors :


Degree of
Sum of square Median of square F-ratio
freedom
K SSK M SK
Group i SSK = H (X i − X) M SK = K −1 F1 =
P
i=1 K −1 M SE
H SSH M SH
Group j SSH = K (X j − X) M SH = H −1 F2 =
P
j=1 H −1 M SE
SSE
Error SSE = SST − SSK − SSH M SE = (H − 1)(K − 1)
(H − 1)(K − 1)
H P
K
Total SST = (Xij − X) KH − 1
P
i=1 j=1

Factor 1 Factor 2
H0 No difference in means of group i No difference in means of group j
H1 At least 1 difference in means of group i At least 1 difference in means of group j
Given α Reject H0 if f1 > fk−1,(k−1)(h−1),α· Reject H0 if f2 > fh−1,(k−1)(h−1),α·

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 6/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.4 Prediction model - Multiple Linear Regression
Regression analysis is the collection of statistical tools that are used to model and explore
relationships between variables that are related in a non-deterministic manner.
Multiple linear regression is a critical technique that is deployed to study the linearity and
dependency between a group of independent variables and a dependent one. The general
formula for multiple linear regression can be expressed as:
Y = β0 + β1 x1 + ... + βk xk + ϵ
• β0 , β1 , ..., βn are regression coefficients. Each parameter represents the change in the
mean response, E(y), per unit increase in the associated predictor variable when all
the other predictors are held constant.
• ϵ is called the random error and follow N (0, σ 2 )
The assumptions of multiple linear regression model :
• A linear relationship between the dependent and independent variables (can be
tested by using Scatter diagram).
Notice that, in some cases, the independent variables are not in compatible formats
or linear relationship,..we can use data transformation to make them fitted and
better-organized.
• The independent variables are not highly correlated with each other
• The variance of the residuals is constant.
• Independence of observation.
• Multivariate normality (occurs when residuals are normally distributed).
Predicted Values and Residuals :
• A predicted value is calculated as ybi = b0 + b1 x1 + ... + bk xk , where the b values
come from statistical software and the x-values are specified by us.
• A residual (error) term is calculated as ei = yi − ybi , the difference between an actual
and a predicted value of y.
Analysis of Variance for Testing Significance of Regression in Multiple Re-
gression.
Source Sum of square df Mean square F0
Regression SSR = ni=1 (ybi − y)2 M SR = SSR /k
P
k M SR /M SE
Residual SSE = ni=1 (yi − yb)2 n−p M SE = SSE /(n − p)
P

Total SST = ni=1 (yi − y)2 n−1


P


 H0 : β1 = β2 = ... = βk = 0
with the hypothesises for F0 :
 H1 : βi ̸= 0 with at least one i

R2 and adjusted R2 . We may also use the coefficient of multiple determination R2 or


adjusted R2 as a global statistic to assess the fit of the model. Computationally,
SSR SSE SSE /(n − p)
R2 = =1− 2
Radj =
SST SST SST /(n − 1)

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 7/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3 Activity 1
3.1 Import data
First of all, we use the read.csv() function to import our data - the "Diet.csv" file into our
R working environment, setting header = TRUE takes the header of the csv file and sep
= "," for working with csv file. After that, we try to print nrow(dataset) and its attributes
to determine whether we have successfully loaded the data.
1 # Import the Diet . csv file into R studio
2 dataset <- read . csv ( " Diet . csv " , header = TRUE , sep = " ," )
3 nrow ( dataset )
4 str ( dataset )
And here is the result, this means that our dataset has 78 rows; and the attributes of our
dataset: Person, gender, Age, Height (cm), Diet type, pre.weight (kg), weight6weeks (kg).
> nrow(dataset). The result is: [1] 78
> str(dataset). The result is: 'data.frame': 78 obs. of 7 variables:
$ Person : int 25 26 1 2 3 4 5 6 7 8 ...
$ gender : int NA NA 0 0 0 0 0 0 0 0 ...
$ Age : int 41 32 22 46 55 33 50 50 37 28 ...
$ Height : int 171 174 159 192 170 171 170 201 174 176 ...
$ pre.weight : int 60 103 58 60 64 64 65 66 67 69 ...
$ Diet : int 2 2 1 1 1 1 1 1 1 1 ...
$ weight6weeks: num 60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...

3.2 Data Cleaning


Next, we will carry out the Data Cleaning task by doing 4 tasks: Omitting NA values;
Remove duplicated data rows; Adding a new column: weightLoss; and finally is Changing
the types of the "gender" and "Diet" column for further analysis.
1. Omitting NA values: We will use the command na.omit()
1 dataset <- na . omit ( dataset ) # Omit the NA values
2 nrow ( dataset )

After we print the nrow(dataset) again and see that the result is 76 now. This means
that there are 2 rows containing NA values in our dataset and we have omitted them.

> nrow(dataset). The result is: [1] 76

2. Removing duplicated data rows: We will count the unique person in the Person
column and compare it to the nrow(dataset). If they are equal, our dataset is already
unique; otherwise, we will remove them.
1 if ( nrow ( dataset ) == length ( unique ( dataset $ Person ) ) ) {
2 print ( " No person is recorded more than 1 time . " )
3 } else { dataset <- dataset [ ! duplicated ( dataset ) ,]}

[1] "No person is recorded more than 1 time."

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 8/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3. Adding a new column: We take the difference of the pre.weight and weight6weeks
columns and use cbind() function to append it to our dataset, name it weightLoss.
1 dataset <- cbind ( dataset , weightLoss = dataset $ pre .
weight - dataset $ weight6weeks )

4. Factorizing data labels: For making further analysis easier, we will change the labels
of the "gender" and "Diet" columns by utilizing the help of factor() function.
1 dataset $ gender <- factor ( dataset $ gender , levels = c (0 ,1) ,
labels = c ( " F " ," M " ) )
2 dataset $ Diet <- factor ( dataset $ Diet , levels = c (1 ,2 ,3) ,
labels = c ( " Diet 1 " , " Diet 2 " , " Diet 3 " ) )

Now, let us print the first 5 columns to see the changes we have made.

> dataset[1:5,]
Person gender Age Height pre.weight Diet weight6weeks weightLoss
3 1 F 22 159 58 Diet 1 54.2 3.8
4 2 F 46 192 60 Diet 1 54.0 6.0
5 3 F 55 170 64 Diet 1 63.3 0.7
6 4 F 33 171 64 Diet 1 61.1 2.9
7 5 F 50 170 65 Diet 1 62.2 2.8

3.3 Data Visualization


3.3.1 Descriptive statistics for each variables
We begin with using the summary() function, which gives us basic information about
every column in our dataset. However, this function will not provide us with the standard
deviation, so we have to use sd() function for each column.
1 summary ( dataset )
2 sd ( dataset $ pre . weight )
3 sd ( dataset $ weight6weeks )
4 sd ( dataset $ Age )
5 sd ( dataset $ weightLoss )
6 sd ( dataset $ Height )

Person gender Age Height pre.weight Diet


Min. : 1.00 F:43 Min. :16.00 Min. :141.0 Min. :58.00 Diet 1:24
1st Qu.:19.75 M:33 1st Qu.:32.50 1st Qu.:163.8 1st Qu.:66.00 Diet 2:25
Median :40.50 Median :39.00 Median :169.0 Median :72.00 Diet 3:27
Mean :39.87 Mean :39.22 Mean :170.8 Mean :72.29
3rd Qu.:59.25 3rd Qu.:47.25 3rd Qu.:175.2 3rd Qu.:78.00
Max. :78.00 Max. :60.00 Max. :201.0 Max. :88.00
weight6weeks weightLoss
Min. :53.00 Min. :-2.100 > sd(dataset$pre.weight) : [1] 7.974653
1st Qu.:61.95 1st Qu.: 2.300 > sd(dataset$weight6weeks) : [1] 8.058938
Median :68.95 Median : 3.700 > sd(dataset$Age) : [1] 9.908379
Mean :68.34 Mean : 3.946 > sd(dataset$weightLoss) : [1] 2.505803
3rd Qu.:73.67 3rd Qu.: 5.650 > sd(dataset$Height) : [1] 11.41998
Max. :84.50 Max. : 9.200

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 9/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Next, we try to partition our dataset into smaller datasets according to their Diet.
1 diet1 = dataset [( dataset $ Diet == " Diet 1 " ) ,]
2 diet2 = dataset [( dataset $ Diet == " Diet 2 " ) ,]
3 diet3 = dataset [( dataset $ Diet == " Diet 3 " ) ,]

After that, we try the similar approach as above.


1 summary ( diet1 ) # Diet 1 analysis
2 sd ( diet1 $ pre . weight )
3 sd ( diet1 $ weight6weeks )
4 sd ( diet1 $ Age )
5 sd ( diet1 $ weightLoss )
6 sd ( diet1 $ Height )

Person gender Age Height pre.weight Diet


Min. : 1.00 F:14 Min. :22.00 Min. :156.0 Min. :58.00 Diet 1:24
1st Qu.: 6.75 M:10 1st Qu.:36.00 1st Qu.:164.5 1st Qu.:66.75 Diet 2: 0
Median :12.50 Median :40.50 Median :167.5 Median :72.00 Diet 3: 0
Mean :12.50 Mean :40.88 Mean :170.3 Mean :72.88
3rd Qu.:18.25 3rd Qu.:48.50 3rd Qu.:173.2 3rd Qu.:80.00
Max. :24.00 Max. :60.00 Max. :201.0 Max. :88.00
weight6weeks weightLoss
Min. :54.00 Min. :-0.600 > sd(diet1$pre.weight) :[1] 8.383796
1st Qu.:63.83 1st Qu.: 1.975 > sd(diet1$weight6weeks) :[1] 8.398356
Median :69.25 Median : 3.050 > sd(diet1$Age) :[1] 9.728097
Mean :69.58 Mean : 3.300 > sd(diet1$weightLoss) :[1] 2.240148
3rd Qu.:74.83 3rd Qu.: 3.950 > sd(diet1$Height) :[1] 10.94841
Max. :84.50 Max. : 9.000

3.3.2 Graphs: Boxplot


In addition to descriptive analysis, we also try to draw the boxplots for better visualization.
Let us draw the boxplot representing the weightLoss after 6 weeks for each type of Diet.
1 boxplot ( weightLoss ~ Diet , data = dataset , horizontal = TRUE ,
main = " Boxplot weight loss after 6 weeks for each diet " ,
xlab = " Weightloss after 6 weeks " , ylab = " Diet " , col = c ( " red " ,"
yellow " ," green " ) , las = 1)

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 10/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
We are also interested in analysing each particular Diet. So we may draw the boxplot for
each Diet too. We will write the code for Diet 1, Diet 2 and Diet 3 are similar.
1 boxplot ( weightLoss ~ gender , data = diet1 , horizontal = TRUE ,
main = " Boxplot weight loss of Diet 1 after 6 weeks " , xlab = "
Weight loss after 6 weeks ( kg ) " , ylab = " Gender " , las = 1 , col
= c ( " pink " , " skyblue " ) )

Apart from drawing boxplots, normal Q-Q plot also gives us the information about the
normality of our dataset. Our group present the code for drawing the normal Q-Q plot of
the dataset; doing the same things for each type of diet is similar.
1 qqnorm ( dataset $ weightLoss , main = " Normal Q - Q plot of dataset " )
2 qqline ( dataset $ weightLoss )

3.3.3 Conclusion
• Our dataset, the diet1, diet2, and diet3 "sub-datasets" all follow Normal
distribution.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 11/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
• The largest weightLoss belongs to Diet 3 and the smallest weightLoss is from Diet
2 (which is negative).

• In our sample, 100 percent do not suffer from adverse effect when following Diet 3,
while there are some adverse effects in Diet 1 and Diet 2.

• The range of Diet 2 is the largest & that of Diet 1 (excluding outliers) is the smallest.

• The median of Diet 3 is the largest and the median of Diet 1 is the smallest.

• In the sample, there is a higher proportion of female compared to male and the
proportion joining each diet is relatively equal and less than 30.

3.4 T-test: pre.weight & weight6weeeks


In this step, we will use t-test to determine whether those people following 3 diet really
lose weight after 6 weeks. To do that, we will focus on the "pre.weight" and "weight6weeks"
columns. We see that this is a paired sample dataset because it records the weight of the
same experiment unit before and after 6 weeks following these diets. So we use paired
t-test.
However, before we use t-test to study, we must prove that our dataset is qualified to be
applied paired samples t-test: normality. We use the shapiro.test() function to help us.
1 shapiro . test ( dataset $ weightLoss )

Shapiro-Wilk normality test


data: dataset$weightLoss W = 0.9895, p-value = 0.7903

We can see that the p-value of shapiro.test() returns 0.7903 (> 0.05), so we can conclude
that our sample follows Normal distribution. Another way is that nrow() of our dataset
is 76, which is also large enough to be assumed to follow Normal distribution.
Next, we apply t.test() function to study.

H0 : µpre.weight − µweight6weeks = 0
H1 : µpre.weight − µweight6weeks ̸= 0

1 t . test ( dataset $ pre . weight , dataset $ weight6weeks , paired = TRUE )

Paired t-test
data: dataset$pre.weight and dataset$weight6weeks
t = 13.728, df = 75, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 3.373452 4.518653
sample estimates: mean of the differences 3.946053

=> Since the p-value of t-test yields 2.2e-16 (< 0.05), we have a strong evi-
dence to reject the null hypothesis and confirm the effectiveness of 3 diets.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 12/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3.5 One-way ANOVA: What is the best diet for weightLoss?
Previously, we use t-test and know the effectiveness of three diets; now, we try to determine
which diet gives the best weightLoss by applying One-way ANOVA.
1 one _ way _ anova = aov ( weightLoss ~ Diet , data = dataset )
2 summary ( one _ way _ anova ) # Give information about the dataset
using ANOVA
3 TukeyHSD ( one _ way _ anova ) # Compare pairwise
4 plot ( TukeyHSD ( one _ way _ anova , conf . level =.95) , las = 1)

> summary(one_way_anova)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 2 60.5 30.264 5.383 0.0066 **
Residuals 73 410.4 5.622
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> TukeyHSD(one_way_anova)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = weightLoss ~ Diet, data = dataset)

$Diet diff lwr upr p adj


Diet 2-Diet 1 -0.032000 -1.6530850 1.589085 0.9987711
Diet 3-Diet 1 1.848148 0.2567422 3.439554 0.0188047
Diet 3-Diet 2 1.880148 0.3056826 3.454614 0.0152020

It is evident that the p-value and the mean of Diet3-Diet1 and Diet3-Diet2 is much smaller
than 0.05 and positive, respectively. Also Pr(>F) is smaller than 0.05. => We can con-
clude that Diet 3 gives us the largest weightLoss, or Diet 3 is the most effective
one.

Finally, we must test whether our dataset satisfies the conditions: normality and equal
variances. We will use shapiro.test() and bartlett.test().
1 shapiro . test ( x = residuals ( object = one _ way _ anova ) )
2 bartlett . test ( weightLoss ~ Diet , data = dataset )

> shapiro.test(x = residuals(object = one_way_anova))


Shapiro-Wilk normality test

data: residuals(object = one_way_anova)


W = 0.99175, p-value = 0.9088

Bartlett test of homogeneity of variances


data: weightLoss by Diet
Bartlett's K-squared = 0.21811, df = 2, p-value = 0.8967

Since all p-value are larger than 0.05, we do not have enough evidence for any violation
in normal distribution and unequal variances. Therefore, applying one-way ANOVA is
sensible.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 13/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

3.6 Two-way ANOVA: Do Diet and gender affect weightLoss?


After knowing that Diet 3 is the most effective one among the three, we are interested in
identifying how do gender and Diet affect weightLoss and how they affect each other.

• We can summarize the information of gender and Diet using table().

• Or we can draw the boxplot to illustrate the the weightLoss for each type of gender
and each type of diet.

• Also we can draw the interaction plot to determine whether the "gender" and "Diet"
factor have any interaction on each other or not. And we can observe that the lines
are not parallel, this is a strong indication of interaction between gender and Diet
factors.

1 table ( dataset $ gender , dataset $ Diet ) # Table


2 boxplot ( weightLoss ~ gender * Diet , data = dataset , main = " Boxplot
weight loss after 6 weeks for each gender and each diet " ,
ylab = " WeightLoss after 6 weeks " , xlab = " " , las = 1 , col = c ( "
pink " , " lightblue " , " pink " , " lightblue " , " pink " , " lightblue "
)) # Boxplot
3 interaction . plot ( dataset $ Diet , dataset $ gender , dataset $
weightLoss )
4 interaction . plot ( dataset $ gender , dataset $ Diet , dataset $
weightLoss )

> table(dataset$gender,dataset$Diet)
Diet 1 Diet 2 Diet 3
F 14 14 15
M 10 11 12

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 14/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Next, let us use the two-way ANOVA technique to analysis more.


1 two _ way _ anova = aov ( weightLoss ~ gender * Diet , data = dataset )
2 summary ( two _ way _ anova )
3 TukeyHSD ( two _ way _ anova )

> summary(two_way_anova)
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 0.3 0.278 0.052 0.82062
Diet 2 60.4 30.209 5.619 0.00546 **
gender:Diet 2 33.9 16.952 3.153 0.04884 *
Residuals 70 376.3 5.376
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> TukeyHSD(two_way_anova)
Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = weightLoss ~ gender * Diet, data = dataset)

$gender diff lwr upr p adj


M-F 0.1221283 -0.9480861 1.192343 0.8206233
$Diet diff lwr upr p adj
Diet 2-Diet 1 -0.03484966 -1.6215073 1.551808 0.9984761
Diet 3-Diet 1 1.84475570 0.2871469 3.402365 0.0162482
Diet 3-Diet 2 1.87960536 0.3385771 3.420634 0.0128844
$`gender:Diet` diff lwr upr p adj
M:Diet 1-F:Diet 1 0.6000000 -2.2129628 3.4129628 0.9887997
F:Diet 2-F:Diet 1 -0.4428571 -3.0107291 2.1250148 0.9958151
M:Diet 2-F:Diet 1 1.0590909 -1.6782698 3.7964516 0.8656520
F:Diet 3-F:Diet 1 2.8300000 0.3052886 5.3547114 0.0191170
M:Diet 3-F:Diet 1 1.1833333 -1.4893925 3.8560592 0.7855223
F:Diet 2-M:Diet 1 -1.0428571 -3.8558199 1.7701056 0.8852416
M:Diet 2-M:Diet 1 0.4590909 -2.5093998 3.4275816 0.9975014
F:Diet 3-M:Diet 1 2.2300000 -0.5436187 5.0036187 0.1863470

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 15/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
M:Diet 3-M:Diet 1 0.5833333 -2.3256625 3.4923292 0.9915569
M:Diet 2-F:Diet 2 1.5019481 -1.2354126 4.2393087 0.5963201
F:Diet 3-F:Diet 2 3.2728571 0.7481458 5.7975685 0.0040103
M:Diet 3-F:Diet 2 1.6261905 -1.0465354 4.2989163 0.4833188
F:Diet 3-M:Diet 2 1.7709091 -0.9260048 4.4678230 0.3965102
M:Diet 3-M:Diet 2 0.1242424 -2.7117126 2.9601974 0.9999949
M:Diet 3-F:Diet 3 -1.6466667 -4.2779524 0.9846191 0.4513580

From the given result, we can observe that Pr(>F) of Diet, Pr(>F) of gender:Diet, p-adj
of F:Diet 3-F:Diet 2 and p-adj of F:Diet 3-F:Diet 1 is less than 0.05. Therefore, we can
make a conclusion that:

• The gender factor does not have effects on the weightLoss. However, the Diet 3 gives
followers the most weightLoss.

• For female, the Diet 3 is more effective than Diet 1 and Diet 2. Also, all three diets
are equally effective for male.

Finally, we also test the normality and equality in variances in the similar way of using
one-way ANOVA. We can also use leveneTest() to test the equality in variances instead
of bartlett.test()
1 shapiro . test ( x = residuals ( object = two _ way _ anova ) )
2 leveneTest ( weightLoss ~ gender * Diet , data = dataset )

> shapiro.test(x = residuals(object = two_way_anova))


Shapiro-Wilk normality test
data: residuals(object = two_way_anova). W = 0.97738, p-value = 0.1923
> leveneTest(weightLoss ~ gender*Diet, data = dataset)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 0.3867 0.8563
70

We also plot the 95 percent Confidence Interval


graph using the following code:
1 plot ( TukeyHSD ( two _ way _
anova , conf . level =.95) ,
las = 1 , cex . axis =0.75)
# D1 , D2 , D3 is short for
Diet 1 , Diet 2 , Diet 3
2

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 16/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4 Activity 2
Computer technology has come a long way. From giant, bulky machines with messy wires
all over the place, to tiny smartphones that slip straight into our pockets, it was the
work of numerous scientists and engineers that continuously simplifies our interaction
with these mysterious machines. Nowadays, even kids can pick up a phone and find their
favorite media contents - musics, cartoons, games - that may entertain them for days.

The accessibility work of computer scientists and engineers proved itself to be extremely
successful: today, almost no one knows how a computer operates, yet anyone can use any
form of computer with ease. The complex underlyings get abstracted away, leaving the
simple interface on top for the users to interact with. But as Computer Science students,
we have our needs to explore computers exhaustively. Dealing with computers from the
atomic levels allow us to make improvements that could cascadingly affect multiple as-
pects of the users’ experience. Observing how computer power changed throughout the
years greatly fascinated us as well, as it gives us a chance to ponder about the difference
between the past and future of computer evolution.

With the same mindset, we chose this Computer Hardware dataset to study the progress
of computer development. It is also an opportunity for us to appreciate the very much
unappreciated technological feat of our predecessors.

Women of the WRNS operate the Colossus, the world’s first electronic programmable computer
at Bletchley Park in Buckinghamshire. (Photo by SSPL/Getty Images)

The dataset contains several technical features of CPUs that had circulated in the mar-
ket from 1981 to 1984. The data is represented with 10 columns - including MYCT for
machine cycle time, MMIN and MMAX for main memory, CACH for cache, CHMIN
and CHMAX for channels, PRP published relative performance and ERP for estimated
relative performance - for the features and 209 rows for the CPUs from 30 vendors.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 17/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.1 Import data
First, we use the read.csv() function to import the "machine_data.csv" file and set header
= FALSE because the data does not have an existing header line.
1 dataset <- read . csv ( " machine _ data . csv " , header = FALSE )
2 dataset [1:5 ,]

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 adviser 32/60 125 256 6000 256 16 128 198 199
2 amdahl 470v/7 29 8000 32000 32 8 32 269 253
3 amdahl 470v/7a 29 8000 32000 32 8 32 220 253
4 amdahl 470v/7b 29 8000 32000 32 8 32 172 253
5 amdahl 470v/7c 29 8000 16000 32 8 16 132 132

Next, to enhance the data’s readability, we will change the column names based on the
information we got from the accompanied "machine_name.csv" file.
1 column _ names <- c ( " NAME " , " MODEL " , " MYCT " , " MMIN " , " MMAX " , "
CACH " , " CHMIN " , " CHMAX " , " PRP " , " ERP " )
2 names ( dataset ) = column _ names
3 dataset [1:5 ,]

NAME MODEL MYCT MMIN MMAX CACH CHMIN CHMAX PRP ERP
1 adviser 32/60 125 256 6000 256 16 128 198 199
2 amdahl 470v/7 29 8000 32000 32 8 32 269 253
3 amdahl 470v/7a 29 8000 32000 32 8 32 220 253
4 amdahl 470v/7b 29 8000 32000 32 8 32 172 253
5 amdahl 470v/7c 29 8000 16000 32 8 16 132 132

With the importation steps completed, we can move on to Data Cleaning.

4.2 Data Cleaning


1. Omitting NA values: Once again, we’d like to detect and remove any faulty record
in our data. Luckily, our data does not contain any NA value.
1 cat ( " The count of NA values in the dataset is " , sum ( is . na (
dataset ) ) )

The count of NA values in the dataset is 0

2. Removing duplcations: Since a manufacturer may have multiple product models,


it is sensible that we count the number of unique models and not the number of
unique manufacturers.

> length(unique(dataset$MODEL)). The result is: [1] 209


> nrow(dataset). The result is: [1] 209

The number of unique models and the number of rows in our data are equal, indi-
cating that our data contains no duplications.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 18/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.3 Data Visualization
4.3.1 Descriptive statistics for all attributes
We will utilize the handy summary() function once again for our data in Activity 2. Each
column also requires us to call sd() to print its standard deviation. The outputs were
manually formatted for legibility.
> summary(dataset)
NAME MODEL MYCT MMIN
Length:209 Length:209 Min. : 17.0 Min. : 64
Class :character Class :character 1st Qu.: 50.0 1st Qu.: 768
Mode :character Mode :character Median : 110.0 Median : 2000
Mean : 203.8 Mean : 2868
3rd Qu.: 225.0 3rd Qu.: 4000
Max. :1500.0 Max. :32000

MMAX CACH CHMIN CHMAX


Min. : 64 Min. : 0.00 Min. : 0.000 Min. : 0.00
1st Qu.: 4000 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00
Median : 8000 Median : 8.00 Median : 2.000 Median : 8.00
Mean :11796 Mean : 25.21 Mean : 4.699 Mean : 18.27
3rd Qu.:16000 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00
Max. :64000 Max. :256.00 Max. :52.000 Max. :176.00

> sd(dataset$MYCT): [1] 260.2629


PRP ERP > sd(dataset$MMIN): [1] 3878.743
Min. : 6.0 Min. : 15.00 > sd(dataset$MMAX): [1] 11726.56
1st Qu.: 27.0 1st Qu.: 28.00 > sd(dataset$CACH): [1] 40.62872
Median : 50.0 Median : 45.00 > sd(dataset$CHMIN): [1] 6.816274
Mean : 105.6 Mean : 99.33 > sd(dataset$CHMAX): [1] 25.99732
3rd Qu.: 113.0 3rd Qu.: 101.00 > sd(dataset$PRP): [1] 160.8307
Max. :1150.0 Max. :1238.00 > sd(dataset$ERP): [1] 154.7571

4.3.2 Graph: Published CPU Performance By Vendors


We’d like to study the range of CPU performance of each vendor, as well as the median of
the vendor’s product performance. This will allow us to acquire a deeper understanding
of each vendor’s strengths and weaknesses.
1 vendor _ data <- dataset % >%
2 group _ by ( NAME ) % >%
3 summarise ( lower = min ( PRP ) , upper = max ( PRP ) , p = median ( PRP )
)
4
5 ggplot ( data = vendor _ data , mapping = aes ( x = reorder ( NAME , -
upper ) , y = p , color = factor ( NAME ) ) ) +
6 ggtitle ( " Published CPU Performance By Vendors " ) +
7 xlab ( " Vendors " ) +
8 ylab ( " Performance " ) +
9 theme ( legend . position = " none " ) +
10 geom _ pointrange ( size = 0.3 , mapping = aes ( ymin = lower , ymax
= upper ) ) +

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 19/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
11 theme ( axis . text . x = element _ text ( angle =90 , vjust = 0.5 , hjust
=1) )

Conclusion: Each line represents the range of CPU performances of a vendor’s models,
and the dot depicts the median of that range. It is clear that vendors with higher CPU
powers tend to be more varied in terms of performance. Given the median points, we
can conclude that the upper halves of CPU performance of big vendors are significantly
stronger than the lower halves. Moving to the right side of the graph, smaller vendors
usually possess less diverse CPUs, and with similarities in performances.

4.3.3 Graph: Preparation for One-way ANOVA


In the subsequent section, we would like to perform an One-way ANOVA test to find
out if any vendor has notably superior products, performance-wise. Hence, we need to
make sure that our data follows a normal distribution. Using the density() function, we
construct a graph represents the distribution of our desired attribute.
1 plot ( density ( dataset $ PRP ) , main = " Distribution of CPU
performances " , xlab = " " )

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 20/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Unfortunately, the initial distribution of PRP is clearly not normal. Performing ANOVA
on this data will not yield any legitimate result, so data transformation is required. Given
the multiplicative nature of CPU progression throughout history, we can execute a basic
transformation by taking the logarithm of PRPs in hope that it will shift closer to a
normal distribution. We achieve this by appending a new column containing the logarithm
of PRPs, called LOGPRP.
1 dataset <- cbind ( dataset , LOGPRP = log ( dataset $ PRP ) )
2 plot ( density ( dataset $ LOGPRP ) , main = " Distribution of CPU
performances " , xlab = " " )

The transformed data better resembles a normal distribution. Further testing confirms the
normality of the distribution of our new attribute, so we can proceed to perform ANOVA
on our dataset.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 21/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.4 One-way ANOVA
With the dataset containing CPUs on the market from 1981 to 1984, we, as Computer
Scientist students, are curious to know if there were any contemporary dominating CPU
vendor. Since there are 30 vendors, and some vendors only have a few models, we will
perform ANOVA selectively with only vendors that had the majority of the data’s CPUs.
We start by taking a look at the number of CPUs each vendor had.
1 sort ( table ( dataset $ NAME ) , decreasing = TRUE )

ibm nas honeywell ncr sperry siemens


32 19 13 13 13 12
amdahl cdc burroughs dg harris hp
9 9 8 7 7 7
c.r.d dec ipl magnuson cambex formation
6 6 6 6 5 5
prime gould nixdorf perkin-elmer apollo basf
5 3 3 3 2 2
bti wang adviser four-phase microdata sratus
2 2 1 1 1 1

The first row contains vendors that had more than 10 CPUs, namely "ibm", "nas", "hon-
eywell", "ncr", "sperry" and "siemens". CPU performance of these vendors should be the
reasonable inputs for ANOVA test. We then extract from the dataset models of the afore-
mentioned vendors.
1 chosen _ vendors <- names ( sort ( table ( dataset $ NAME ) , decreasing =
TRUE ) [1:6])
2 anova _ dataset <- dataset [ dataset $ NAME % in % chosen _ vendors ,]
Now, let us carry out the one way ANOVA method to determine whether or not the
vendor names has significant effects on the PRP and if yes, which vendor among the
chosen ones has the best performance.
1 one _ way _ anova <- aov ( LOGPRP ~ NAME , data = anova _ dataset )
2 summary ( one _ way _ anova )
3 TukeyHSD ( one _ way _ anova )
4 plot ( TukeyHSD ( one _ way _ anova , conf . level =.95) , las = 1 , cex .
axis =0.4)
Before we move on to the results of the ANOVA test, we need to make sure that no
assumption of the test get violated. Once again, we rely on the shapiro.test() and
bartlett.test().
1 shapiro . test ( x = residuals ( object = one _ way _ anova ) )
2 bartlett . test ( LOGPRP ~ NAME , data = anova _ dataset )

Shapiro-Wilk normality test


data: residuals(object = one_way_anova)
W = 0.98444, p-value = 0.2764

Bartlett test of homogeneity of variances


data: LOGPRP by NAME
Bartlett's K-squared = 8.817, df = 5, p-value = 0.1166

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 22/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
All p-value are larger than our threshold of 0.05, so we can safely proceed with our test.
...
Df Sum Sq Mean Sq F value Pr(>F)
NAME 5 20.44 4.087 3.199 0.0103 *
Residuals 96 122.65 1.278
...
$NAME diff lwr upr p adj
ibm-honeywell -0.02240313 -1.10348356 1.05867731 0.9999999
nas-honeywell 1.08469044 -0.09841902 2.26779991 0.0917570
ncr-honeywell -0.04541042 -1.33467407 1.24385323 0.9999984
siemens-honeywell 0.46050567 -0.85534353 1.77635487 0.9109468
sperry-honeywell 0.74097437 -0.54828928 2.03023802 0.5539065
nas-ibm 1.10709357 0.15510504 2.05908209 0.0129661
ncr-ibm -0.02300729 -1.10408773 1.05807315 0.9999999
siemens-ibm 0.48290879 -0.62974267 1.59556025 0.8046088
sperry-ibm 0.76337750 -0.31770294 1.84445793 0.3205724
ncr-nas -1.13010086 -2.31321032 0.05300861 0.0698523
siemens-nas -0.62418477 -1.83621050 0.58784096 0.6665900
sperry-nas -0.34371607 -1.52682554 0.83939339 0.9582035
siemens-ncr 0.50591608 -0.80993312 1.82176529 0.8727632
sperry-ncr 0.78638479 -0.50287886 2.07564844 0.4874012
sperry-siemens 0.28046870 -1.03538050 1.59631790 0.9893234

Due to the limitations of our dataset, the results from ANOVA are not significantly con-
vincing. We can see that "nas" models were better than "ibm" in general, and a few other
slightly similar observations, but apart from that, no remarkable result is yielded.

Building prediction models, on the other hand, was what the dataset originally used
for. Using a range features like machine cycle time, memory, cache and channels, we can
attempt to predict the power of the CPUs and compare our prediction with actual values.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 23/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4.5 Prediction model
4.5.1 Data transformation:
Our model ultimate aim is to predict the value of Published Relative Performance based
on other technical factors like MYCT, MMAX, MMIN, etc. However, the given data
set contains too much information with 6 different components which can be use for
making predictions. Therefore, we decided to reduce the number of independent variables
to simplify our model using the concept of data transformation. For the original data, we
will declare 3 new variables based on the old ones:

• Channel average (CH_average): Calculated by taking the average value of


CHMAX and CHMIN (Unit: channels).

• Frequency (F): The lower cycle time is, the higher CPU’s frequency will be (higher
performance as well). Due to this inversely proportional characteristic, frequency
will be calculated by taking the inverse of MYCT (Unit: cycles/nanosecond).

• Average memory size (M_average): Calculated by taking the average value of


MMAX and MMIN (Unit: Kilobytes).

Then, we will add these new variable into the data set for later usage as well as remove
unused columns.
1 dataset $ CH _ average <-( dataset $ CHMIN + dataset $ CHMAX ) / 2
2 dataset $ F <- 1 / dataset $ MYCT
3 dataset $ M _ average <- ( dataset $ MMIN + dataset $ MMAX ) / 2
4 dataset <- dataset [ - c (1:5 ,7:8 ,10) ]
5 str ( dataset )

'data.frame': 209 obs. of 5 variables:


$ CACH : int 256 32 32 32 32 64 64 64 64 128 ...
$ PRP : int 198 269 220 172 132 318 367 489 636 1144 ...
$ CH_average: num 72 20 20 20 12 20 24 24 24 48 ...
$ F : num 0.008 0.0345 0.0345 0.0345 0.0345 ...
$ M_average : num 3128 20000 20000 20000 12000 ...

After execution, we get a new data set:

4.5.2 Choosing appropriate model:


After the transformation step, we need to choose a predictive model which is suitable to the
provided data set to obtain the most precise predictions. A visible factor which can be taken
into account is correlation between variables.
1 ggpairs ( data = dataset , columns =1:5 , title = " Correlation of data " )
The ggpairs() function allows us to plot the relationship between pairs of variables as well as
calculate the correlation score of each pair.

After execution, ggpairs() returns a combination of various plots and numeric data.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 24/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

It is obvious that the pair of M_average and PRP suggest a linear relationship (as its plot follows
a straight line) and has a high correlation score of 0.887. Other pairs like PRP:CH_average,
PRP:F and PRP:CACH also show the shape of a straight line to some extent. Moreover, their
correlation score are relatively high (0.657, 0.622 and 0.663 respectively). In other words, they
have decent impacts on PRP value. From this observation, we build a multiple linear regression
model to estimate the value of PRP based on M_average and CH_average, F and CACH.

4.5.3 Building multiple linear regression model:


4.5.3.a Determine variables:
As stated above, the prediction model’s aim is to calculate the value of PRP based on M_average,
CH_average, F and CACH. Thus, the dependent variable should be PRP and the independent
variables should be CH_average, M_average, CACH and F.

4.5.3.b Preparing data sets for training and testing:


One of the most crucial parts in every model building is the process of training, which is the
foundation for the predictions that model can make. Another vital component is the process
of testing, which evaluates the quality of the model. In order to perform these tasks, we need
separated data sets taken from the original data. Normally, we will split the data set such that
the ratio of train data is 70% and that of test data is 30%. This 70-30 scale has been proven to
give the best outcome.
1 seed <- sample ( c ( rep (0 ,0.7 * nrow ( dataset ) ) , rep (1 ,0.3 * nrow (
dataset ) ) ) )
2 seed
3 data . train <- dataset [ seed ==0 ,]
4 data . test <- dataset [ seed ==1 ,]
The first 2 lines is responsible for creating "seed" (a sequence of 0 and 1 in random order) and
print out the result. This indicator is used to divide the data into 2 different sets of data.

[1] 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0
[39] 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0
[77] 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0
[115] 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1
[153] 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 1 1

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 25/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
[191] 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0

The next 2 lines classify every row in the data set into train data set and test data set. Each
row will be assigned to either value 0 or value 1 whose order correspond to its order (The 1st
row will be assigned to the 1st value of the sequence, the 2nd row will be assigned to the 2nd
value and so on).

The sample() function generates sequence of 0 and 1 randomly. In other words, we will probably
get new train data and test data every time we run this function. This variety of data sets would
be a great help in indicating whether the accuracy of model is maintained in different situations.

4.5.3.c Building model:


Using lm() function, it takes us only 1 line of code to build and train the multiple linear
regression model simultaneously.
1 model <- lm ( PRP ~ F + M _ average + CACH + CH _ average , data = data . train )
Next, we need to know the detail of our model (its intercept and coefficient, its error, etc). We
can easily access these information using summary() function.
1 summary ( model )

Call:
lm(formula = PRP ~ F + M_average + CACH + CH_average, data = data.train)

Residuals:
Min 1Q Median 3Q Max
-205.887 -36.494 9.193 33.534 309.843

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 26/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.816536 8.789465 -5.895 2.60e-08 ***
F 424.632670 748.268699 0.567 0.57128
M_average 0.014919 0.001184 12.596 < 2e-16 ***
CACH 0.562222 0.167711 3.352 0.00103 **
CH_average 2.859298 0.507979 5.629 9.37e-08 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 68.34 on 142 degrees of freedom


Multiple R-squared: 0.8591,Adjusted R-squared: 0.8552
F-statistic: 216.5 on 4 and 142 DF, p-value: < 2.2e-16

Based on the returned data, we can establish the equation for our model: y = −51.816536 +
424.632670x1 + 0.014919x2 + 0.562222x3 + 2.859289x4 . Furthermore, we can also estimate the
accuracy of model through the value adjusted R2 . For our model, this value equals 0.8552, which
means the accuracy is really high.

4.5.3.d Outcome evaluation:


First of all, we need to calculate the predicted PRP using predict() function and then add this
value to the test data table.
1 data . test $ Predicted <- predict ( model , data . test )

Then, we will calculate the absolute error between the real value of PRP and the predicted one
to see if the difference is small enough for the model to give precise predictions.
1 plot ( abs ( data . test $ Predicted - data . test $ PRP ) , pch =19 , xlab = " i ^( th )
testcase " , ylab = " Error " , main = " Absolute error in predictions
of model " )

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 27/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The graph illustrates a high accuracy of predictions when most of errors are below 50. However,
there are also a few cases whose errors are greater than 50 and reach the peak with the absolute
error above 150.

Besides absolute error, we can plot the linear graph to assess model’s fitness.
1 ggplot ( data . test , aes ( x = PRP , y = Predicted ) ) + geom _ point () + stat _
smooth ( method = " lm " , col = " red " )

The linear graph demonstrates a quite good fitness to the data when most points are close to
the red line, showing a good accuracy.

For all the above reasons, we can conclude that multiple linear regression model would be a
good choice to build a prediction model based on our initial data with relatively low error and
precise prophecy for most of the cases.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 28/29
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
5 Conclusion
Through two activities of the Assignment, we have learnt and improved skills in using R-studio’s
tools based on basic knowledge of Probability and Statistics to visualize, analyze and make
predictions on a given data set. Eventually, after dealing with the two data sets, we end up the
report with the following conclusions:
• In activity 1 :
– All three diet have certain effects on weightLoss.
– Diet 3 is the most effective diet among three diets.
– Gender does not have impact on weightLoss.
– Diet 3 was particularly effective for females than other diets, while as for males, all
three diets have a similar effect.

• In activity 2 :
– Vendors with higher CPU powers tend to be more varied in terms of performance.
Smaller vendors has less diverse CPUs, and with similarities in performances
– The performance of nas is generally better than that of ibm.
– After building and considering the outcome evaluation, the our model equation for
the relationship between dependent variable PRP and the independent variables
CH_average, M_average, CACH and F is :
y = 51.816536 + 424.632670x1 + 0.014919x2 + 0.562222x3 + 2.859289x4

6 Library used, Github code link and References


All the R libraries used in this Assignment:
1 library ( carData )
2 library ( car )
3 library ( ggplot2 )
4 library ( dplyr )
5 library ( GGally )
Our Github code link can be founded here: ‡Code link

[1] Applied Statistics and Probability for Engineers, Douglas C. Montgomery, George C.
Runger, 5th ed.

[2] Statistics and Computing : Introductory Statistics with R, Peter Dalgaard, Springer, 2nd
ed.

[3] A Beginner’s Guide to R (2009), Alain F. Zuur, Elena N.leno, Erik H.W.G Meesters,
Springer.

[4] Two-way ANOVA (2020), Rebecca Bevans, SCRIBBR.

Assignment for Probability and Statistics - Academic Year 2021 - 2022 Page 29/29

You might also like