Assignment 2 Group 1 Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Statistical Methods in Life Sciences – Fall 2022.

Instructor: Tri Thanh Pham

BIOL 520/820

Assignment 2

Report

Group Number: 1

Kuanysh Dossybayeva, Aiym Duisen, Nurgul Seksenbayeva, Arailym Ziyat

Due: November 20, 2022.


Abstract
Our group analysed the dataset provided by Hosmer and Lemeshow (2020) about
babies' birth weight. The data include information about 189 babies' birth weights and
another 10 parameters. Our hypothesis is that the mother's uterine irritability and smoking
correlate with the babies' low birth weight. In the beginning, we searched for factors that
could affect birth weight which included: age, race, smoking or not, uterine irritation and
history of hypertension of a mother. We checked the data for normality using QQplot and
Shapiro-Wilk tests. Later, we performed the two-way-ANOVA analysis, and the results
were inconsistent. We could see that uterine irritability and smoking have an effect on
babies' birth weight, but the combination of the two factors are not significant.
Keywords Birth Weight · Two-way-ANOVA · Smoke · Uterine irritability

1 Introduction
The Low birth weight data contains maternal risk factors associated with low birth
weight of neonates provided by Hosmer and Lemeshow (2000) and indicated as the birth
weight of a baby is under 2500 g as low, or over is normal. This dataset includes 10
variables such as: low for low birth weight baby and normal weight; smoke for the history
of maternal smoking and nonsmoker; race included white, black, or other; the age of the
mother from 14-45 years; lwt for the weight (lbs) at last menstrual period from 80 to 250
lbs; ptl for a number of false of premature labours from 0 to 3; ht for the history of
hypertension and no hypertension; ui for uterine irritability and no irritability; ftv for
number of physician visits in 1st trimester from 0 to 6; and bwt for babies birth weight in
grams from 709 - 4990 gr. Data were collected at Baystate Medical Center, Springfield,
Massachusetts in 1986.
Our hypothesis is that there is an association between smoke and uterine irritability and
neonates weight at birth.
To analyze our specific object we have done several data analysis techniques, which is
explained below.
Linear Regression - modeling the relationship between observed and target variables
using linear functions. It mathematically models the unknown or dependent variable and the
known or independent variable in the form of a linear equation. Linear regression is a well-
established statistical technique that is easily applied to software and computing [1].
Companies use it to reliably and predictably transform raw data into business intelligence
and useful analytics. Scientists in many fields, including biology and behavioral,
environmental and social sciences, use linear regression to conduct preliminary data
analysis and predict future trends. Many data science methods, such as machine learning
and artificial intelligence, use linear regression to solve complex problems.
Linear regression makes several assumptions about the data [2], such as :
1. Linearity of the data. The relationship between the predictor (x) and the outcome
(y) is assumed to be linear.
2. Normality of residuals. In statistics, normality tests are used to determine whether a
dataset is well modeled using a normal distribution and to calculate the probability
of a normal distribution of the random variable underlying the dataset. Normality
tests include tests such as the D'Agostino K-square, the Zhark– Bera test, the
Smirnov criterion, adjusted for the estimation of the mean and variance of the data,
the Shapiro–Wilk criterion and the Pearson chi-square criterion [3].
Among them, the Shapiro–Wilk test is useful for determining whether a given data set
comes from a normal distribution, which is a common assumption used in many statistical
tests, including regression, analysis of variance, t-tests, and many others [4].
Also, the QQ (quantile-quantile) graph is a probability graph that is a graphical
method for comparing two distributions by constructing their quantiles. The QQ graph
compares data sets of theoretical and sample (empirical) distributions. If the two
distributions being compared are similar, then the points on the QQ graph will
approximately lie on the y=x line. The main step in constructing a QQ graph is the
calculation or estimation of quantiles [4].
3. Homogeneity of residuals variance. The residuals are assumed to have a constant
variance (homoscedasticity). Homoscedasticity is a property of data used to
construct a linear regression model, which consists in the fact that their variance
along a straight regression is constant. Homoscedasticity is one of the conditions for
the effectiveness of the regression model [5]. In statistics, a sequence or vector of
random variables is homoscedastic if all random variables in the sequence or vector
have the same variance.
Dispersion analysis (from the Latin Dispersio – dispersion / in English Analysis Of
Variation - ANOVA) is used to study the influence of one or more qualitative variables
(factors) on one dependent quantitative variable.
The main purpose of variance analysis (ANOVA) is to study the significance of the
difference between averages by comparing (analyzing) variances [6]. Dividing the total
variance into several sources makes it possible to compare the variance caused by the
difference between groups with the variance caused by intra-group variability. If the null
hypothesis is true (about the equality of averages in several groups of observations selected
from the general population), the estimate of the variance associated with intra-group
variability should be close to the estimate of the intergroup variance. If you simply compare
the averages in two samples, the analysis of variance will give the same result as the usual t-
test for independent samples (if two independent groups of objects or observations are
compared) or the t-test for dependent samples (if two variables are compared on the same
set of objects or observations).
The essence of variance analysis is to divide the total variance of the studied trait into
individual components due to the influence of specific factors, and to test hypotheses about
the significance of the influence of these factors on the trait under study. Comparing the
variance components with each other by means of Fischer's F—test, it is possible to
determine what proportion of the total variability of the effective feature is due to the action
of regulated factors [6].
The initial material for the analysis of variance is the data from the study of three or more
samples: x_1, ...,x_n, which can be both equal and unequal in number, both connected and
disconnected [6]. According to the number of controlled factors detected, the analysis of
variance can be one-factor (while studying the influence of one factor on the results of the
experiment), two-factor (when studying the influence of two factors) and multifactorial
(allows you to evaluate not only the influence of each of the factors separately, but also their
interaction).
- The One-way Analysis of Variance (ANOVA) procedure performs a one-factor
analysis of variance for a quantitative dependent variable based on a single factor
(independent) variable and estimates the effect size in a one-factor analysis of
variance ANOVA. Analysis of variance is used to test the hypothesis of the equality
of several average values corresponding to different groups or levels of a factor
variable [6]. This method is an extension of the two-sample t-test.
- The Two-way analysis of variance ("variance analysis") is used to determine
whether there is a statistically significant difference between the averages of three or
more independent groups divided into two variables (sometimes called "factors").
This type of analysis of variance is used when you want to find out how two factors
affect a response variable and whether there is an interaction effect between two
factors on the response variable [6].

Pair comparison is the process of comparing objects in pairs to determine which one is
preferred, or has a greater number of certain quantitative properties, or whether two objects
are identical. The pair comparison method is used in the scientific study of preferences,
relationships, voting systems, social choice, public choice, requirements engineering and
multi-agent systems. Paired multiple comparisons check the differences between each pair
of averages and output a matrix in which asterisks indicate group averages that differ
significantly at the alpha level equal to 0.05 [7].
When conducting paired comparisons of group averages with the Bonferroni test, the t-
criterion is used, but to control the overall error level by the error level of each check, the
probability of an erroneous decision is divided by the total number of checks. The
confidence intervals and the significance level are adjusted to take into account the multiple
comparisons being made [8].
Logistic regression is a data analysis technique that uses mathematics to find
relationships between two data factors. This relationship is then used to predict the value of
one of these factors based on the other. A prediction usually has a finite number of results,
such as "yes" or "no" [9].
Logistic regression is useful for situations in which you want to be able to predict the
presence or absence of a characteristic or outcome based on the values of a set of predictor
variables. It is similar to the linear regression model, but is suitable for models where the
dependent variable has only two values. Logistic regression coefficients can be used to
estimate the odds ratios for each dependent variable of the model. Logistic regression is
applicable to a wider range of situations than discriminant analysis [10].

2 Evaluation of two-way interaction between uterine irritability and smoking status


explaining the birth weight of neonates

2.1 Linear Model


In this study, we evaluated maternal risk factors and found that smoking and uterine
irritability were significantly associated with birthweight.

Build the linear model of all risk factors


> lbwlm <-lm(formula = bwt ~ age + lwt + smoke + ht + ui + ftv + race, data = lbw)
> summary(lbwlm)
Coefficients: H0 : β0=β1 = 0
H! : 01 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3134.501 343.492 9.125 < 2e-16 Our model indicates that the p-
*** value ( 8.841e-08) <0.05. This
age -1.051 9.486 -0.111 0.911924 would lead us to reject the null
lwt 3.533 1.685 2.096 0.037453 * hypothesis and conclude that
there is strong evidence that a
smoke -368.067 105.340 -3.494 0.000598 relationship does exist between
*** birthweight and smoking,
ht -603.241 203.793 -2.960 0.003488 ** uterine irritability (most
ui -524.253 136.869 -3.830 0.000176 *** significant) and other risk
factors.
ftv -14.500 46.810 -0.310 0.757105
race -191.082 57.496 -3.323 0.001076 **
Residual standard error: 655.5 on 181 degrees of
freedom
Multiple R-squared: 0.2217, Adjusted R-
squared: 0.1916
F-statistic: 7.365 on 7 and 181 DF, p-value:
8.841e-08

Condition Index <16


Tolerance > 0.2
VIF <10

Then we decided to analyze if there exists any two-way interaction between uterine
irritability and smoking status explaining the birth weight of neonates.
Two-way ANOVA is used to evaluate simultaneously the effect of two different
grouping variables on a continuous outcome variable, in detail the effect of smoking/non-
smoking and presence or absence of uterine irretability on birthweight.
To perform the two way ANOVA we made up the data to categorize risk factors.

smoke <- factor(lbw$smoke, levels = 0:1, labels = c("Not smoke","Smoke"))


ht <- factor(lbw$ht, levels = 0:1, labels = c("No","Yes"))
ui <- factor(lbw$ui, levels = 0:1, labels = c("No","Yes"))
low <- factor(lbw$low, levels = 0:1, labels = c("Normal","Low bw"))
race <- factor(lbw$race, levels = 1:3, labels = c("White","Black", "Other"))

We built the linear model of two-way interaction between uterine irritability and smoking
status associated with LBW. In the R code, the asterisk indicates the interaction effect.

lbwmodel <- lm(bwt ~ smoke*ui, data = lbw)

2.2 Visualization
We can see from the box plot of the birthweight by uterine irritability levels of mothers,
faceted by smoking status differences in means of birthweight by groups. We were
specifically interested to check if the grouped factor of smoking and UI might be associated
with the low birthweight.
Fig. 1 A box plot of the birthweight by uterine irritability levels of mothers, faceted by
smoking status

2.3. Two-way ANOVA assumptions


To perform Two-way ANOVA we followed assumptions:
1. The observations are independent.
2. There were no extreme outliers.
> lbw %>%
+ group_by(smoke, ui) %>%
+ identify_outliers(lwt)
[1] smoke ui ...1 low race age lwt
[8] ptl ht ftv bwt is.outlier is.extreme
<0 rows> (or 0-length row.names)

2.4 Normality
In the QQ plot (Figure 2.)all residual points fall approximately along the reference
line, thus we can assume normal distribution. This conclusion is supported by the Shapiro-
Wilk test. The p-value is not significant (p = 0.41), so we can assume normality.
Fig. 2 QQ plot of the model residuals

> lbw %>% We computed Shapiro-Wilk test for each


+ group_by(smoke, ui) %>% combinations of factor levels.
+ shapiro_test(bwt)
# A tibble: 4 × 5 The bwt data is normally distributed (p >
smoke ui variable statistic p 0.05) for each group, as assessed by
<dbl> <dbl> <chr> <dbl> <dbl> Shapiro-Wilk’s test of normality.
1 0 0 bwt 0.986 0.389
2 0 1 bwt 0.987 0.997
3 1 0 bwt 0.988 0.824
4 1 1 bwt 0.936 0.413

2.5 Homogeneity of variances.


H0: the variances are equal across all samples.
H1: the variances are NOT equal across all samples

> leveneTest(bwt~smoke*ui, data=lbw1, center=mean)


Levene's Test for Homogeneity of Variance (center = mean)
Df F value Pr(>F)
group 3 0.4651 0.707
185
Levene's test is not significant (p > 0.05). Therefore, we can assume the homogeneity of
variances in the different groups.

2.6 Two- way ANOVA computation


A two-way ANOVA was performed to analyze the maternal risk factors smoking and
uterine irritability associated with low birth weight of neonates.
Residual analysis was conducted to test the assumptions of the two-way ANOVA.
Outliers were checked by box plot, normality was evaluated by Shapiro-Wilk’s normality
test and the homogeneity of variances was evaluated by Levene’s test.
There were no extreme outliers, residuals were normally distributed (p > 0.05) and
there was a homogeneity of variances (p > 0.05).
We did not find statistically significant interaction between smoking status and uterine
irritability associated with birthweight.

Effect DFn DFd F p p<.05 ges H0: there is no difference in


1 smoke 1 185 6.215 0.014000 * 0.033 means.
2 ui 1 185 15.455 0.000119 * 0.077 H1: at least two means are
different
3 smoke:ui 1 185 0.933 0.335000 0.005
Since p-value>0.05,
This test is not significant.
Accept H0
Conclusion: there is no difference
in combined smoke:ui means
2.7 Pairwise comparison test (Procedure for non-significant two-way interaction: Inspect
main effects)
Consequently, if the two-way interaction was not significant, we have to determine
whether we had any statistically significant main effects from the ANOVA output by
pairwise comparisons between groups.

> # pairwise comparisons


> lbw %>%
+ pairwise_t_test(
+ bwt ~ smoke,
+ p.adjust.method = "bonferroni")
.y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
1 bwt Not smoke Smoke 115 74 0.00892 ** 0.00892 **
> lbw %>%
+ pairwise_t_test(
+ bwt ~ ui,
+ p.adjust.method = "bonferroni")
# A tibble: 1 × 9
.y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
1 bwt No Yes 161 28 0.0000783 **** 0.0000783 ****

An analysis of simple main effects for smoking and UI independently were performed
with statistical significance receiving a Bonferroni adjustment. For both maternal risk
factors, smoking and uterine irritability pairwise comparison test was significant.

2.8 Logistic regression analysis

Table 1 Mother’ age and babies low birthweight


Table 1 shows that at age 20 there is highest number of 8 babies were born with low
birthweight. At ages 33, 35, 46 and 45 there are 0 babies born with low birthweight. Zero
menas normal birthweight and 1 indicated low birthweight.

Table 2 Mother’ race and babies birthweight

Birthwei Whi Black Other


ght te
Normal 73 15 42
Low 23 11 35

Table 3 Smoking history and babies birthweight

Birthwei None Smoke


ght smoker r
Normal 86 44
Low 29 30

Table 4 Mother’ uterine irritability (UI) and babies birthweight

Birthwei No Yes
ght UI UI
Normal 116 14
Low 45 14

Table 5 Mother’ history of hypertension (ht) during pregnancy and babies birthweight

Birthwei No Yes
ght HT HT
Normal 124 5
Low 52 7

> logistic<-glm(low~smoke+ht+age+race+ui, data=lbw, family=binomial)


> summary(logistic)

Call:
glm(formula = low ~ smoke + ht + age + race + ui, family = binomial,
data = lbw)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6886 -0.8586 -0.5869 1.1020 2.0684

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.71234 0.99298 -1.724 0.08463 .
smoke 1.06901 0.38151 2.802 0.00508 **
ht 1.39636 0.62557 2.232 0.02561 *
age -0.03422 0.03436 -0.996 0.31933
race 0.51975 0.20718 2.509 0.01212 *
ui 0.95361 0.44191 2.158 0.03093 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67 on 188 degrees of freedom


Residual deviance: 211.44 on 183 degrees of freedom
AIC: 223.44

Number of Fisher Scoring iterations: 4

Conclusion
The dataset contains 189 women on 10 variables. It was collected at Baystate
Medical Center, Springfield, Massachusetts in 1986.
During the study, we evaluated maternal risk factors. Firstly, by studying the linear
model of the data, we found out that smoking and uterine irritability were significantly
associated with birth weight. Then, we decided to study these two variables by visualizing
them on a boxplot of the birthweight by uterine irritability levels of mothers, faceted by
smoking status differences in means of birthweight by groups. Then, in order to check for
the normality of the dataset, we did QQ-plots and the Shapiro-Wilk test. As a result, the
dataset was normally distributed. After that, we decided to use a two-way ANOVA test.
However, there was not a statistically significant interaction between smoking status and
uterine irritability associated with birth weight. After that, we needed to determine whether
we had any statistically significant main effects from the ANOVA output by a pairwise
comparison test between groups. For both smoking and uterine irritability pairwise
comparison tests were significant.
Our hypothesis was that there is an association between smoke and uterine
irritability and neonates’ weight at birth. During the study, the hypothesis was proved, and
we found out that smoking and uterine irritability were significantly associated with birth
weight. However, combining them does not.

References

1. Kasza, J., & Wolfe, R. (2014). Interpretation of commonly used statistical regression
models. Respirology (Carlton, Vic.), 19(1), 14–21.
https://doi.org/10.1111/resp.12221
2. Hickey, G. L., Kontopantelis, E., Takkenberg, J. J. M., & Beyersdorf, F. (2019).
Statistical primer: checking model assumptions with regression diagnostics.
Interactive cardiovascular and thoracic surgery, 28(1), 1–8.
https://doi.org/10.1093/icvts/ivy207
3. Casson, R. J., & Farmer, L. D. (2014). Understanding and checking the assumptions
of linear regression: a primer for medical researchers. Clinical & experimental
ophthalmology, 42(6), 590–596. https://doi.org/10.1111/ceo.12358
4. Vetter T. R. (2017). Fundamentals of Research Data and Variables: The Devil Is in
the Details. Anesthesia and analgesia, 125(4), 1375–1380.
https://doi.org/10.1213/ANE.0000000000002370
5. O'Neill, M. E., & Mathews, K. L. (2002). Levene tests of homogeneity of variance
for general block and treatment designs. Biometrics, 58(1), 216–224.
https://doi.org/10.1111/j.0006-341x.2002.00216.x
6. Mishra, P., Singh, U., Pandey, C. M., Mishra, P., & Pandey, G. (2019). Application
of student's t-test, analysis of variance, and covariance. Annals of cardiac
anaesthesia, 22(4), 407–411. https://doi.org/10.4103/aca.ACA_94_19
7. Tarrow, S. (2010). The Strategy of Paired Comparison: Toward a Theory of
Practice. Comparative Political Studies, 43(2), 230–259.
https://doi.org/10.1177/0010414009350044
8. Narum, S.R. Beyond Bonferroni: Less conservative analyses for conservation
genetics. Conserv Genet 7, 783–787 (2006). https://doi.org/10.1007/s10592-005-
9056-y
9. Sperandei S. (2014). Understanding logistic regression analysis. Biochemia medica,
24(1), 12–18. https://doi.org/10.11613/BM.2014.003
10. Wang, Q. Q., Yu, S. C., Qi, X., Hu, Y. H., Zheng, W. J., Shi, J. X., & Yao, H. Y.
(2019). Zhonghua yu fang yi xue za zhi [Chinese journal of preventive medicine],
53(9), 955–960. https://doi.org/10.3760/cma.j.issn.0253-9624.2019.09.018
Appendix. R Script

library(haven)
library(carData)
library(car)
library(datarium)
library(tidyverse)
library(ggpubr)
library(rstatix)
library(readr)
library(ggplot2)
library(foreign)
library(corrplot)
library(olsrr)
source("http://www.sthda.com/upload/rquery_cormat.r")

set.seed(123)
lbw <- read_csv("Desktop/Data/lbw.csv")
View(lbw)
#1
## Recoding lbw1
lbw <- within(lbw1, {
## Relabel race
race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize smoke ht ui
smoke <- factor(lbw$smoke, levels = 0:1, labels = c("Not smoke","Smoke"))
ht <- factor(lbw$ht, levels = 0:1, labels = c("No","Yes"))
ui <- factor(lbw$ui, levels = 0:1, labels = c("No","Yes"))
low <- factor(lbw$low, levels = 0:1, labels = c("Normal","Low bw"))
race <- factor(lbw$race, levels = 1:3, labels = c("White","Black", "Other"))
})
#Visualization. boxplot
bxp <- ggboxplot(
lbw, x = "ui", y = "bwt", palette = "jco", facet.by = "smoke")
bxp

# 2.1.1 Build the linear model all risks


lbwlm <-lm(formula = bwt ~ age + lwt + smoke + ht + ui + ftv + race, data = lbw1)
summary(lbwlm)
ols_eigen_cindex(lbwlm)
ols_vif_tol(lbwlm)
# 2.1.2 Build the linear model combined interaction
lbwmodel <- lm(bwt ~ smoke*ui, data = lbw)
summary(lbwmodel)

#2.3.Assumptions
#2.3.2 outliers
lbw %>%
group_by(smoke, ui) %>%
identify_outliers(bwt)
#2.3.3 Normality
lbwmodel <- lm(bwt ~ smoke*ui, data = lbw)
# Create a QQ plot of residuals
ggqqplot(residuals(lbwmodel))

# Compute Shapiro-Wilk test of normality


shapiro_test(residuals(lbwmodel))

#Compute Shapiro-Wilk test of normality for groups

lbw %>%
group_by(smoke, ui) %>%
shapiro_test(bwt)

ggqqplot(lbw, "bwt", ggtheme = theme_bw()) +


facet_grid(smoke ~ ui)
#2.3.4 Homogeneity of variances.
leveneTest(bwt~smoke*ui, data=lbw, center=mean)

#ANOVA 2 way
res.aov <- lbw %>% anova_test(bwt~smoke*ui)
res.aov

# pairwise comparisons
lbw %>%
pairwise_t_test(
bwt ~ smoke,
p.adjust.method = "bonferroni")

lbw %>%
pairwise_t_test(
bwt ~ ui,
p.adjust.method = "bonferroni")

View(lbw)
xtabs(~low+race)
xtabs(~low+age, data=lbw)
xtabs(~low+smoke)
xtabs(~low+ui)
xtabs(~low+ht)

#Logistic regression
logistic<-glm(low~smoke+ht+age+race+ui, data=lbw, family=binomial)
summary(logistic)

#ANOVA 1 way
smlowAnova <- lbw %>% anova_test(low~smoke)
smlowAnova

uilowAnova <- lbw %>% anova_test(low~ui)


uilowAnova

htlowAnova <- lbw %>% anova_test(low~ht)


htlowAnova

agelowAnova <- lbw %>% anova_test(low~age)


agelowAnova

racelowAnova <- lbw %>% anova_test(low~race)


racelowAnova

You might also like