Professional Documents
Culture Documents
Report
Report
ASSIGNMENT REPORT
PROBABILITY AND STATISTICS
Semester: 222
Class: CC07 – Group: 05
Lecturer: Dr. NGUYEN TIEN DUNG
Percentage
No. Last name Student ID Tasks
of work
1 Lê Thị Thu Ngọc 2053271 Code, model analysis 20%
2 Trần Nguyễn Bảo Ngọc 2153626 Code, data analyzer 20%
3 Huỳnh Ngọc Minh Anh 2153156 Code, data visualisation 20%
4 Nguyễn Thị Ngọc Ánh 2153180 Code, data visualisation 20%
5 Trần Anh Khoa 2153472 Theory, dataset overview 20%
CONTENTS
1. THEORY ............................................................................................................ 1
2.1. Objectives......................................................................................................... 3
2.2. Methods............................................................................................................ 4
3.5.1. Plot histograms showing the distribution of quantitative variables, and plot
statistical bar plots for each classifier ................................................................ 12
3.5.2. Plot a histogram showing the distribution of pH of high/low milk quality .......... 15
3.5.4. Plot a histogram showing the distribution of Colour of high/low milk quality .... 17
3.5.5. Plot a barplot chart with quantitative statistics of Taste and Colour of
high/low milk grade ........................................................................................... 18
3.5.6. Plot a barplot chart with quantitative statistics of Fat and Turbidity of
high/low milk grade ........................................................................................... 19
3.6. Build up a logistic regression model to evaluate the milk quality ................. 20
4. CONCLUSION ................................................................................................ 27
5. REFERENCES................................................................................................. 28
1. THEORY
1.1. Logistic Regression Analysis
It is a classification algorithm that is used to predict the probability of a categorical
dependent variable based on one or more independent variables. The dependent variable
in logistic regression is binary, meaning it can take on one of two values, usually
represented as 0 and 1. But in many cases, the dependent variable is not a constant
variable but a binary measure tool: yes/no, ill/healthy, deceased/ alived, occurred/didn't
happen, etc., and the independent variables can be continuous or discontinuous.
In practice, these 0 and 1s will code for two classes to show that the event has
happened or not (no/yes). Given an event frequency x recorded from n subjects, we can
𝑥
calculate the probability of that event as: 𝑝 = .
𝑛
The probability of an event is simply defined as the ratio of the probability of the
event occurring to the probability of the event not occurring.
𝑃 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠
𝑂= =
1 − 𝑃 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑟
𝑂
𝑜𝑟 𝑃 = ; 𝑤𝑖𝑡ℎ: 𝑂 𝑖𝑠 𝑜𝑑𝑑𝑠 𝑎𝑛𝑑 𝑃 𝑖𝑠 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠
1+𝑂
The logistic regression model is:
𝑝
𝑙𝑜𝑔𝑖𝑡 (𝑝) = log ( )
1−𝑝
Then we can graph the diagram for this function due to the continuous relationship
between p and logit(p)
1
Given an independent variable x (x can be continuous or discontinuous); α and β
are two linear parameters that need to be estimated from the sample data. In summary,
we can have two following equations for the simple logistic regression model:
𝑝
logit(p) = log ( ) = α + βx
1−𝑝
𝑝
odds(p) = = 𝑒 α+βx
1−𝑝
When the odds are 1, two events have equal probability. If the odds is lower than 1, a
negative event is favored and inversely.
Moreover, if we have more than one independent variable (x1, x2, ..., xp-1), we get
the multiple logistic regression model and the equation is expressed as
𝑝
log ( )= 𝛽0+ 𝛽1x1+…+ 𝛽p-1xp-1
1−𝑝
3
the most common daily product which is milk. This product is widely produced all over
the world that makes many sources of milk with different quality and used for variant
purposes. Thus, the classified steps have to be fast and accurate.
In our dataset, there are many factors which can impact on milk grade such as pH,
Temperature, Odor, Fat, Turbidity, Colour, Taste. Then we will code with R to make a
program which can predict and classify the grade of milk.
For this assignment, we will build up the best model for our dataset and then we make
the predictions to conclude that this model is highly correct to classify the grade’s milk.
2.2. Methods
2.2.1. Model selection
In this report, we will choose the logistic regression model since the logistic regression
model is usually used when the dependent variable is binary or dichotomous, meaning it can
take on only two values, such as 0 or 1, Yes or No, True or False. In particular, for our dataset,
there are only two options to classify the milk grade of products which are High or Low.
After the logistic regression model is chosen, we code with R to find out the best model
by removing factors which do not have statistical meaning or do not affect the milk’s quality
(p-value < 0.05).
2.2.2. Evaluate the overall meaning of the model
The logistic regression model is built by the data of a sample taken from the
population so it can be affected the sample’s error. Therefore, we must perform the
hypothesis testing to conclude that there is a statistically significant relationship between the
predictor variable (x) and the response variable (y). Let’s denote that the null hypothesis and
the alternative hypothesis are
H0 : β1 = 0; H1 : β1 ≠ 0
Then, we calculate the overall Chi-square value of the model by the formula
4
probability that a given individual defaults. In reverse, if p-value is more than the
significant level (p > 0.05), the predictor variable (x) and the response variable (y) do not
have any relationship.
𝑇𝑃 + 𝑇𝑁 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃
𝑇𝑃 𝐹𝑃𝑅 = 1 − 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
𝑇𝑃𝑅 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
5
3.2. Import Data
Firstly, let’s import our data with readr library
> library(readr)
> milkgrade <- read_csv("BACH KHOA UNIVERSITY/SECOND
YEAR/PROBABILITY AND STATISTICS/milkgrade.csv")
6
3.3. Data cleaning
Practically, we need to modify the data name to make it easier to grasp how the
attributes relate to the goal. Since Grade is our output aim according to our data, the
other qualities can be interpreted as Xi and Grade as Y:
Before beginning to analyze the data, we need first to determine whether any NA
(Not Available) values are present.
> apply(is.na(milkgrade),2,sum) #Check the NA and show the totals
number of NA in each columns
7
Next, we may check if there is any duplicated rows
> sum(duplicated(milkgrade)) #give the number of duplicated rows
Output:
Then, we remove the duplicates by using the unique() command. The unique()
function in R is used to eliminate or delete the duplicate values or the rows present in
the vector, data frame, or matrix as well.
Comment: We can see that after cleaning the dataset, we have a new dataset with
49 rows and 8 variables so we have removed the irrelevant tuples.
8
We can numeric the data to solve this problem. The Grade column, which only
accepts two values (H for High or L for Low), is renamed to Y columns. Because of
this, the idea is to convert it to a binary number (1 for H and 0 for L). Thus, we might
say that we are "encoding" the data. So:
> milkgrade$Y[milkgrade$Y == 'low'] <- 0
> milkgrade$Y[milkgrade$Y == 'high'] <- 1
Next, we need to check the datatype of each variable to numerical values with the
sapply() command:
9
3.4.1. Calculate the descriptive statistics
For continuous variables “pH”, “Temperature”, “Colour”, we perform
descriptive statistics and export the results in a table formula.
> table(milkgrade$X3)
Figure 12: R code and results when performing quantitative statistics for the variable "Taste"
10
Comments:
• There are 20 samples that aren’t satisfied with optimal conditions.
• 29 samples are satisfied with optimal conditions.
> table(milkgrade$X4)
Figure 13: R code and results when performing quantitative statistics for the variable "Odor"
Comments:
• 21 samples aren’t satisfied with optimal conditions.
• 28 samples are satisfied with optimal conditions.
> table(milkgrade$X5)
Figure 14: R code and results when performing quantitative statistics for the variable "Fat"
Comments:
• 10 samples aren’t satisfied with optimal conditions.
• 39 samples are satisfied with optimal conditions.
> table(milkgrade$X6)
Figure 15: R code and results when performing quantitative statistics for the variable
"Turbidity"
Comments:
• 19 samples aren’t satisfied with optimal conditions.
• 30 samples are satisfied with optimal conditions.
> table(milkgrade$Y)
Figure 16: R code and results when performing quantitative statistics for the variable "Grade"
11
Comments:
• 26 samples have low-quality milk.
• 23 samples have high-quality milk.
Figure 17: The result when plotting the histogram of the variable “pH”
12
Figure 18: The result when plotting the histogram of the variable “Temperature”
Figure 19: The result when plotting the histogram of the variable “Colour”
Comments:
• In the "Histogram graph for frequency of pH" clearly shows that the frequency
is unevenly distributed. The pH range 6-7 shows the highest frequency, which is 8 times
higher than the others.
• In the "Histogram graph for frequency of Temperature", the frequency decreases
from 30 degrees Celsius to 90 degrees Celsius. The frequencies are high in the range
13
between 30 and 50 degrees Celsius, but substantially lower between 50 and 90 degrees
Celsius. Furthermore, there is an outlier at temperatures ranging from 70 to 80 degrees.
• In the "Histogram graph for colour frequency," there is a significant contrast
between the color frequencies. It is typically in the range of 254 to 256, on the other
hand, there is an outlier in the color code range of 250 to 252.
> par(mfrow = c(1,2)) #Set the 1x2 matrix for both graphs
> barplot(table(milkgrade$X3), xlab = "Taste", ylab = "Frequency",
main = "Barplot of Taste", col= "salmon1") #Plot a barplot to
illustrate the distribution for variable Taste
> barplot(table(milkgrade$X4), xlab = "Odor", ylab = "Frequency",
main = "Barplot of Odor", col= "aquamarine2") #Plot a barplot to
illustrate the distribution for variable Odor
Figure 20: The result when plotting the barplots of the variable “Taste” and “Odor”
> par(mfrow = c(1,3)) #Set the 1x3 matrix for the graphs
> barplot(table(milkgrade$X5), xlab = "Fat", ylab = "Frequency", main
= "Barplot of Fat", col= "midnightblue") #Plot a barplot to
illustrate the distribution for variable Fat
> barplot(table(milkgrade$X6), xlab = "Turbidity", ylab =
"Frequency", main = "Barplot of Turbidity", col= "lemonchiffon2")
#Plot a barplot to illustrate the distribution for variable Turbidity
> barplot(table(milkgrade$Y), xlab = "Grade", ylab = "Frequency",
main = "Barplot of Grade", col = "steelblue") #Plot a barplot to
illustrate the distribution for variable Grade
14
Figure 21: The result when plotting the barplots of the variable “Fat”, “Turbidity”
and “Grade”
> library(ggplot2)
> library(plyr)
> mu_pH <- ddply(milkgrade, "Y", summarise, grp.mean=mean(X1))
> ggplot(milkgrade, aes(x=X1, color= as.factor(Y), fill=
as.factor(Y))) + geom_histogram(position = "identity", alpha=0.5)
+ geom_vline(data = mu_pH, aes(xintercept=grp.mean,
color=as.factor(Y)), linetype= "twodash") +
scale_color_manual(values = c("rosybrown2",
"lightblue2","palegreen4")) + scale_fill_manual(values =
c("rosybrown2", "lightblue2","palegreen4")) + labs(title =
"Histogram of pH for Grade of milk", x="pH", y="Frequency") +
theme_light()
15
Figure 22: Histogram results show the distribution of pH of good/bad milk quality
16
Figure 23: Histogram results show the distribution of Temperature of good/bad milk quality
Comments:
• The average temperature of disappointed level with milk is higher than those of
satisfied milk.
• The average temperature of satisfied-level milk is normal distribution around 35-
45 degrees.
3.5.4. Plot a histogram showing the distribution of Colour of high/low milk quality
17
Figure 24: Histogram results show the distribution of Colour of good/bad milk quality
Comments:
• The colour of milk mild fluctuates between 246 and 255.
• The rate of colour at 255 leads to the satisfied milk level being higher than the
disappointed milk level.
• At 246 the satisfied milk level is significantly lower than the disappointed milk level.
3.5.5. Plot a barplot chart with quantitative statistics of Taste and Colour of
high/low milk grade
> par(mfrow = c(1,2)) #Set the 1x2 matrix for both graphs
> barplot(table(milkgrade$Y, milkgrade$X3), xlab = "Taste", ylab =
"Frequency", main = "Barplot of Taste for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X3)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for the variable Taste
> barplot(table(milkgrade$Y, milkgrade$X4), xlab = "Odor", ylab =
"Frequency", main = "Barplot of Odor for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X4)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for variable Odor
18
Figure 25: Result of the barplot graph of the quantity of two variables “Taste”, “Odor”
Comments:
• In good taste, the satisfied milk level is higher than in bad taste. The nice flavor
indicates a larger number of pleased milk levels.
• In appealing odor, the frequency of disappointed milk level is smaller than in satisfied
milk level. As a result, the appealing odor will create more and more desired milk levels.
3.5.6. Plot a barplot chart with quantitative statistics of Fat and Turbidity of
high/low milk grade
> par(mfrow = c(1,2)) #Set the 1x2 matrix for both graphs
> barplot(table(milkgrade$Y, milkgrade$X5), xlab = "Fat", ylab =
"Frequency", main = "Barplot of Fat for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X5)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for variable Fat
> barplot(table(milkgrade$Y, milkgrade$X6), xlab = "Turbidity", ylab
= "Frequency", main = "Barplot of Turbidity for milk grade", col=
c("honeydew2", "tan2"), legend = rownames(table(milkgrade$Y,
milkgrade$X6)), beside= TRUE, cex.main=0.9) #Plot a barplot to
illustrate the distribution for variable Turbidity
19
Figure 26: Result of the barplot graph of the quantity of two variables “Fat”, “Turbidity”
Comment:
• The frequency of disappointed milk levels at high-fat rates is lower than the
frequency of satisfied milk levels. Regarding the low-fat rate, the frequency of the
disappointed milk level is significantly greater than the frequency of the satisfied milk
level. As a consequence of the high-fat content, milk levels are more fulfilled, and fat
has a direct effect on milk levels.
• In high turbidity, the frequency of disappointed milk level is larger than that of
satisfied milk level. It is a comparable number for low turbidity. This resulted in
turbidity having no effect on milk level.
3.6. Build up a logistic regression model to evaluate the milk quality
Firstly, we load the library “caTools” for the calculation of ROC - AUC. After that,
we set the seed with a random number (“1000” is chosen in this case) so that the results
can be reproducible.
> library(caTools)
> set.seed(1000)
20
Next, we use the split method to separate the sample into “train” and “test” datasets
with a split ratio of 0.85. This means 85% of our dataset is passed in the training dataset
and 15% in the testing dataset.
The results after splitting will be shown with 2 values: TRUE AND FALSE.
> split
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRU
E TRUE
[14] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TR
UE TRUE
[27] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FAL
SE TRUE
[40] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
Figure 27: Results after the sample is splitted with a ratio of 0.85
Let’s use the “subset” command to set the train dataset getting all the data points
after split which are “TRUE” and similarly the test dataset getting all the data points
which are “FALSE”.
Moving onto the next step, using the training dataset to create the model for logistic
regression by glm (general linear model) function. Then, we input the “summary” command
to show the different statistical values for our independent variables after the model is
generated.
21
Call:
glm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.88573 -0.58048 -0.01818 0.59463 1.87784
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.65420 32.99509 -0.656 0.5116
X1 0.61270 0.39169 1.564 0.1178
X2 -0.22091 0.09451 -2.337 0.0194 *
X3 0.12670 1.10510 0.115 0.9087
X4 0.20436 1.13409 0.180 0.8570
X5 3.59796 1.85357 1.941 0.0522 .
X6 -0.54890 1.11432 -0.493 0.6223
X7 0.09618 0.12381 0.777 0.4373
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Figure 28: The different statistical values for the logistic regression model
Comments:
• In this model, only X2 (Temperature) has an impact on the milk grade (p=0.0194
< 0.05). The other variables are not significant statistical meaning.
• Since we only have one predictor variable (X2) and one response variable (Y),
we can use simple logistic regression, which uses the following formula to estimate the
relationship between the variables:
p
𝑙𝑜𝑔 ( ) = β0 + β1 X
1−p
22
Then, we can expressed formula for the most optimal model as
p
𝑙𝑜𝑔 ( ) = − 21.6542 − 0.2209X2
1−p
However, we need to perform the hypothesis testing in order to conclude certainly that
there is statistically significant relationship between the predictor variable (x) and the response
variable (y). First, the null hypothesis and the alternative hypothesis are determined by
H0 : β1 = 0; H1 : β1 ≠ 0
#Hypothesis testing
) [1] 0.0006509515
Comment: Since the p-value is less than the significant level of 0.05, therefore the
null hypothesis can be rejected. In other words, our model is highly useful for predicting
the probability that a given individual defaults.
After the model is created and fitted, we need to make predictions on the model by
using “predict” function to foreseen the probability determined from the equation
p
𝑙𝑜𝑔 ( ) = − 21.6542 − 0.2209X2
1−p
Initially, making predictions on the training dataset and using the summary
command to get statistical values.
#Prediction on train dataset
> pred_train <- predict(models, type = "response", newdata = train)
> summary(pred_train)
23
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000913 0.1031407 0.5217235 0.4761905 0.8248221 0.9432828
Next, we continue to predict on the testing dataset which contains the unseen data values.
After the predictions on the test dataset are made, a confusion matrix with threshold value is 0.5.
FALSE TRUE
0 3 1
1 1 2
Comment: The first and second rows show the observed and predicted values of
the milk grade, respectively. Furthermore, the columns FALSE AND TRUE provide
the data for the low-grade milk and high-grade milk, vice and versa.
Hence, to obverse the result more easier, let’s change the names of each components in
this confusion matrix. After that, using the “t(table)” command to see the result
> t(table)
Observed low-grade milk Predicted low-grade milk Observed high-grade milk
Var1 "0" "1" "0"
Var2 "FALSE" "FALSE" "TRUE"
Freq "3" "1" "1"
Predicted high-grade milk
Var1 "1"
Var2 "TRUE"
Freq "2"
24
Recall that the standard confusion matrix form will be expressed below:
Hence, our confusion matrix states that the true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN) are 2, 3, 1, and 1, respectively. From the formula
mentioned below, we calculate the accuracy, the true positive rate (TPR) or sensitivity, the
specificity and the false positive rate (FPR). Recall some crucial equations
𝑇𝑃 + 𝑇𝑁 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃
𝑇𝑃 𝐹𝑃𝑅 = 1 − 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
𝑇𝑃𝑅 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
In particular, the result of accuracy shows how much we predicted correctly, which
must be high as possible. In this case with the split ratio of 0.85, we can predict the
model with the accuracy being up to 71.4%.
[1] 0.7142857
Figure 33: The accuracy of this model
The true positive rate (TPR), or sensitivity is a measure of the probability that an
actual positive instance will be classified as positive. Similarly, the false positive rate
(FPR) is essentially a measure of how often an actual negative instance will be classified
as positive. The FPR is calculated by (1 – Specificity) in which the specificity measures
25
the proportion of actual negatives that are correctly identified. Thus, TPR and
specificity should be as high as possible and inversely, FPR should be reduced as much
as possible in order to get a good predicted model.
[1] 0.6666667
Figure 34: The true positive rate
[1] 0.75
[1] 0.25
Comments:
• TPR, specificity and FPR are determined as 66.7%, 75% and 25%, vice and versa.
• With the split ratio of 0.85, the model gives a good accuracy of over 71%.
Besides, the TPR and FPR values are rather good to make a predicted model.
Let’s load the library “ROCR” to visualize the performance of scoring classifiers
such as ROC graphs, sensitivity/ specificity curves.
> library(ROCR)
ROC (Receiver Operator Characteristic) curve can help in deciding the best
threshold value. A high threshold value gives high specificity and low sensitivity. In
reverse, a low threshold value gives low specificity and high sensitivity. Then, we use
the plot command to draw the ROC curve for test data of Grade (Y) with FPR on the x-
axis and TPR on y-axis.
26
Here is the graph of our predicted model
AUC (Area under ROC curve) measures the entire two-dimensional area below the ROC
curve. This metric also determines the quality of the model’s predictions, regardless of the
classification threshold chosen. AUC range is [0;1] but our AUC result should be greater than
0.8 to conclude the good model. In this case, AUC for Grade (Y) value is determined as 85%.
[1] 0.8333333
4. CONCLUSION
• In this assignment, we already visualize the dataset of milk grade via the
statistical description and graphs.
• Then, we successfully built up a logistic regression model and chose the best
one by comparing p-value of each factor and removing factors which had p > 0.05.
• Lastly, we create a predicted model to evaluate how well our model work or
assess the ability to distinguish classifiers of this model.
27
5. REFERENCES
2. Tuan, N.V. (2015). Phân tích dữ liệu với R. Nhà xuất bản tổng hợp thành phố
Hồ Chí Minh.
28