Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

ANALYSIS OF

AND CATEGORICAL
CONTINUOUS
VARIABLES
Lecture 4
January 28, 2020
Descriptive vs.
Inferential Statistics
■ Descriptive statistics: describing the
central tendency and dispersion in
data through numerical calculations,
tables, and graphs
■ Inferential statistics: use sample data to
draw conclusions about the population
that the sample is meant to represent
(sampling will naturally involve error)
– Estimate parameters and test
hypotheses to make inferences
about the population
– Compare means and evaluate
Analysis of Continuous
Data (comparison of
means) Non-
Samples Parametric test
parame
tric
test
2
Paired t-test Wilcoxon test
related

sample
s
2 Independen Mann-
independ t t- Whitney
ent test U test
Student’s t-
test
■ Used to compare means between two
groups
– Related groups: Paired t-test (e.g.
pre- and post- study measures on
the same participants)
– Independent groups:
Unpaired/Independent t- test on two
different groups
– Used more often in research
■ Null hypothesis: the means of the
groups are not statistically different
■ Degrees of freedom (df): amount of
information provided by the data that
can be used to estimate population
parameters and variability of the
estimates
– df = n – # of estimated parameters
– As df increase, t-distribution more
closely
resembles a normal distribution
■ E.g. One sample independent t-test to
estimate the
population mean
– Estimates the standard deviation
about the mean
– Uses a t-distribution with df = n – 1
– df = n – 1 for paired t-test as well
■ E.g. Two sample independent t-test to
Degrees of freedom
(df)
T-test
assumptions
■ Samples are independent
■ Variable is normally distributed
■ Variance homogeneity: variance within
each group is equal
– Levene’s test for equality of variances
– Informs you whether to use results for
pooled or unpooled variance

■ t-tests fairly robust even if assumptions


are not perfectly met
t-
statisti
■ Difference between the means divided by the pooled
c or unpooled standard error of the mean
Group Statistics
Biologic Std. Std.
al sex N Mean Deviatio Error
n Mean
Body mass Male 28 23.974 4.0662 .7684
index Female 90 22.869 3.9477 .4161

Independent Samples
Levene's Test
Test for
Equality
of t-test for Equality of
Variance Means 95%
s Confidence
Interval of
Sig. (2- Mean Std. Differen
the
F Sig. t df tailed) Differen Difference
Error Lower
ce
Bod Equal .747 .389 1.284 116 ce .8603
Upper -.5992
mas
y varianc .202 1.104 2.8086
inde
s es 7
x assume 1.264 44.0 .213 1.1047 .8739 -.6565
d Equal
variances 09 2.8659
not
assumed
Confidence interval
(CI)
■ Degree of uncertainty: area around the
sample statistic where the
corresponding population parameter is
likely to be
■ The larger the sample, the smaller the
Cl
– If the CI is small: Greater likelihood
that the sample statistic (sample
mean) approximates the population
parameter (population mean)
– If CI contains 0 (null value) then the
means are not statistically different
Calculating
CI
■ CI = 𝑥 ± z s
(
1–α/2√
)– 99% CI: 𝑥 ± 2.58(
n
s

– ) 95% CI: 𝑥 ± 1.96(
n
s

– ) 90% CI: 𝑥 ± 1.64
n (
s

) n
Reporting independent
t-test results
■ Report the means and standard deviations
for both
groups, t-value, degrees of freedom, and
p-value.
– E.g. Males (mean±𝑆𝑡𝐷): 24.0±4.1,
Females (mean±𝑆𝑡𝐷): 22.9±4.0;
t=1.3, df=116, p=0.20

■ P-values from t-test often reported in


subject characteristics table when
data is compared between two
groups (e.g. males vs. females,
intervention vs. control group)
Analysis of Variance
(ANOVA)
■ Test to determine if means differ between 3
or more groups
– Unlike t-test, uses variance to assess
differences

■ Tests null hypothesis that variances


between the groups are equal
– F-test will result in rejection of null
hypothesis when variability between
group means is sufficiently larger
than variability within the groups
F-
statistic
■ F = variation between
sample means
2
variation within the
𝑠𝐵
F =samples
𝑠𝑊2
■ If ratio is large, indicates not all means are
equal -> significant p-value will result
Degrees of
freedom
■ df1: df associated with the numerator
of the F- statistic
– df = k – 1
– k: # of group means
■ df2: df associated with the
denominator of the F-
statistic
– df = n – k
Between and within
variability

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-ano
va-and-the-f-test
F-value that is
derived

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-ano
va-and-the-f-test
ANOVA
assumptions
■ Variable is normally distributed
■ The errors are normally distributed
– We will discuss this more in
regression lecture
■ The cases are independent from each
other
■ Variance homogeneity

■ Like the t-test, ANOVA is robust in the


face of relatively minor violations of
these assumptions
Descriptives
Caloric intake (kcal/day)
95% Confidence
Interval
Std. for Mean Minimu
N Mean Deviation Std. Lower Upper m Maximu
Error Bound Bound m
Sedentary 19 1640.058 516.7106 118.5415 1391.011 1889.104 694.2 2313.1
Light 16 2030.822 570.8637 142.7159 1726.630 2335.014 1045.2 3027.7
Moderate 45 1999.444 670.7013 99.9822 1797.943 2200.945 1017.8 3925.5
High 38 2183.431 608.9341 98.7822 1983.279 2383.583 1204.9 3781.4
Total 118 2005.082 633.5294 58.3211 1889.580 2120.584 694.2 3925.5

Test of Homogeneity of
Variances
Caloric intake (kcal/day)
Levene
Statistic df1 df2 Sig.
.115 3 114 .951

ANOVA
Caloric intake (kcal/day)
Sum of Mean
Squares df Square F Sig.
Between 3752358.8 3 1250786.2 3.30 .023
Groups 1 7 0
5 2
Within 43206696. 114 379006.10
Groups 2 8
70
Total 46959055. 117
0
90
ANOVA means
plot
Reporting ANOVA
results
■ Report means and standard deviations for
all groups, as well as the F value, degrees
of freedom, and p-value.
■ Text for this example:
– Analysis of variance indicated that the
different physical activity groups
report different levels of caloric intake:
■ Sedentary: 1640.1 ± 516.7
■ Light: 2030.8 ± 570.9
■ Moderate: 1999.4 ± 670.7
■ High: 2005.1 ± 633.5
■ F(3, 114) = 3.3, p = 0.02
Post-hoc tests
(ANOVA)
■ ANOVA tells you if there is a difference
between means, but not specifically
which means differ
– Conduct multiple comparison post-
hoc test to determine which means
differ
■ Many post-hoc tests to choose from (see
this link for list and description:
http://www.statisticshowto.com/post-ho
c/
)
■ Commonly applied to ANOVA:
– Tukey’s Test
Multiple Comparisons
Dependent Variable: Caloric intake (kcal/day)

LSD

95% Confidence Interval

(I) Category of (J) Category of Mean


physical activity physical activity Difference Std. Error Sig. Lower Bound Upper Bound
(I-J)
Sedentary Light -390.7640 208.8913 .064 -804.576 23.048

Moderate -359.3865 * 168.4341 .035 -693.053 -25.720

High -543.3732 * 172.9784 .002 -886.042 -200.704

Light Sedentary 390.7640 208.8913 .064 -23.048 804.576

Moderate 31.3774 179.1933 .861 -323.603 386.358

High -152.6092 183.4713 .407 -516.064 210.846

Moderate Sedentary 359.3865 * 168.4341 .035 25.720 693.053

Light -31.3774 179.1933 .861 -386.358 323.603

High -183.9866 135.6326 .178 -452.674 84.701

High Sedentary 543.3732 * 172.9784 .002 200.704 886.042

Light 152.6092 183.4713 .407 -210.846 516.064

Moderate 183.9866 135.6326 .178 -84.701 452.674

*. The mean difference is significant at the 0.05 level.


Types of
ANOVA
■ One-Way ANOVA: considers only one independent variable
(factor) for independent groups

■ Repeated measures one-way ANOVA: one-way ANOVA for


related groups but not independent

■ Taking BMI measured every single day for each of the


different groups

■ Multivariate ANOVA (MANOVA): ANOVA with several dependent


variables

■ Factorial ANOVA: Compares means across two or more


independent variables (factors)

■ Sex, physical activity (you have 2 independent variables)


and youre looking at their effect on another variable (blood
pressure)
Analysis of categorical
variables
■ Chi-square test of independence
– Tests whether there is a significant
association between two or more
categorical variables
■ Fisher’s exact test
– Use when 20% or more of the cells
have <5 counts of data
■ Test for trend
– More powerful for ordinal data
Chi-
square
■ Utilizes contingency tables (also
referred to as cross-tablulation,
crosstab, or two-way table)
– 2x2 table when each variable has
2 groups
– 2xk table when each variable has
k groups
■ Assesses goodness of fit between
observed values and theoretically
expected values
■ Df = k – 1 (k=number of groups/columns)
■ Assumptions:
Biological sex * BMI category Crosstabulation
BMI category
0 1 Total
Biological sex Male Count 20 8 28
% within Biological sex 71.4% 28.6% 100.0%
% within BMI category 20.6% 40.0% 23.9%
% of Total 17.1% 6.8% 23.9%
Female Count 77 12 89
% within Biological sex 86.5% 13.5% 100.0%
% within BMI category 79.4% 60.0% 76.1%
% of Total 65.8% 10.3% 76.1%
Total Count 97 20 117
% within Biological sex 82.9% 17.1% 100.0%
% within BMI category 100.0% 100.0% 100.0%
% of Total 82.9% 17.1% 100.0%

Chi-Square Tests
Asymptoti
c Exact Sig. Exact Sig.
Value df Significanc (2- (1-
e sided) sided)
(2-sided)
Pearson Chi-Square 3.421a 1 .064
Continuity 2.440 1 .118
Correction b

Likelihood Ratio 3.129 1 .077


Fisher's Exact Test .084 .063
Linear-by-Linear 3.392 1 .066
Association
N of Valid Cases 117
a. 1 cells (25.0%) have expected count less than 5. The minimum
expected count is
Sex * BMI class. (18 +) / measure - (D, G) Crosstabulation
BMI class. (18 +) / measure - (D, G)
NORMAL
UNDERWEIG WEIGHT OVERWEIGH OBESE Total
HT T
Sex MALE Count 79 1665 2161 1347 5252
% within Sex 1.5% 31.7% 41.1% 25.6% 100.0
%
% within BMI class. (18 31.7% 37.2% 48.5% 41.5% 42.3%
+) /
measure - (D, G)
% of Total 0.6% 13.4% 17.4% 10.8% 42.3%
FEMALE Count 170 2813 2294 1899 7176
% within Sex 2.4% 39.2% 32.0% 26.5% 100.0
%
% within BMI class. (18 68.3% 62.8% 51.5% 58.5% 57.7%
+) /
measure - (D, G)
% of Total 1.4% 22.6% 18.5% 15.3% 57.7%
Total Count 249 4478 4455 3246 12428
% within Sex 2.0% 36.0% 35.8% 26.1% 100.0
%
% within BMI class. (18 100.0% 100.0% 100.0% 100.0 100.0
+) / % %
measure - (D, G)
Chi-Square Tests
% of Total 2.0% Asymptoti
36.0% 35.8% 26.1% 100.0
c %
Value df Significanc
e
(2-sided)
Pearson Chi-Square 130.679 a 3 .000
Likelihood Ratio 130.898 3 .000
Linear-by-Linear 31.447 1 .000
Association
N of Valid Cases 12428
a. 0 cells (0.0%) have expected count less than 5.
The
Test for
trend
■Also referred to as Linear-by-Linear
association
– Tests for trends in contingency
tables larger than 2x2
■ Takes into account ordered nature
of data
– Relates to odds rather than
variances
– Assumes that a change in ranks
makes no difference to the odds of
the outcome (i.e. Odds Ratio = 1)
– Therefore, df = 1
Sex * Total pers. inc. all sources - (D, G) Crosstabulation
Total pers. inc. all sources - (D, G)
NO LESS $15,000- $30,000- $50,000- $80,000
INCOME THAN $29,999 $49,999 $79,999 OR Total
15,000 MORE
Sex MALE Count 542 2531 2253 2331 1515 598 9770
% within 5.5% 25.9% 23.1% 23.9% 15.5% 6.1% 100.0
Sex %
% of 2.5% 11.9% 10.6% 11.0% 7.1% 2.8% 45.9%
Total
FEMALE Count 894 4602 3153 1901 791 166 11507
% within 7.8% 40.0% 27.4% 16.5% 6.9% 1.4% 100.0
Sex %
% of 4.2% 21.6% 14.8% 8.9% 3.7% 0.8% 54.1%
Total
Total Count 1436 7133 5406 4232 2306 764 21277
% within 6.7% 33.5% 25.4% 19.9% 10.8% 3.6% 100.0
Sex %
% of 6.7% 33.5% 25.4% 19.9% 10.8% 3.6% 100.0
Total %
Chi-Square Tests
Asymptoti
c
Value df Significan
ce (2-
sided)
Pearson Chi-Square 1219.006 5 .000
a

Likelihood Ratio 1240.061 5 .000


Linear-by-Linear 1108.358 1 .000
Association
N of Valid Cases 21277
a. 0 cells (0.0%) have expected count less than 5.
The

You might also like