Analysis of Continuous and Categorical Variables: January 28, 2020

ANALYSIS OF
AND CATEGORICAL
CONTINUOUS
VARIABLES
Lecture 4
January 28, 2020
Descriptive vs.
Inferential Statistics
■ Descriptive statistics: describing the
central tendency and dispersion in
data through numerical calculations,
tables, and graphs
■ Inferential statistics: use sample data to
draw conclusions about the population
that the sample is meant to represent
(sampling will naturally involve error)
– Estimate parameters and test
hypotheses to make inferences
about the population
– Compare means and evaluate
Analysis of Continuous
Data (comparison of
means) Non-
Samples Parametric test
parame
tric
test
2
Paired t-test Wilcoxon test
related
sample
s
2 Independen Mann-
independ t t- Whitney
ent test U test
Student’s t-
test
■ Used to compare means between two
groups
– Related groups: Paired t-test (e.g.
pre- and post- study measures on
the same participants)
– Independent groups:
Unpaired/Independent t- test on two
different groups
– Used more often in research
■ Null hypothesis: the means of the
groups are not statistically different
■ Degrees of freedom (df): amount of
information provided by the data that
can be used to estimate population
parameters and variability of the
estimates
– df = n – # of estimated parameters
– As df increase, t-distribution more
closely
resembles a normal distribution
■ E.g. One sample independent t-test to
estimate the
population mean
– Estimates the standard deviation
about the mean
– Uses a t-distribution with df = n – 1
– df = n – 1 for paired t-test as well
■ E.g. Two sample independent t-test to
Degrees of freedom
(df)
T-test
assumptions
■ Samples are independent
■ Variable is normally distributed
■ Variance homogeneity: variance within
each group is equal
– Levene’s test for equality of variances
– Informs you whether to use results for
pooled or unpooled variance
■ t-tests fairly robust even if assumptions

are not perfectly met
t-
statisti
■ Difference between the means divided by the pooled
c or unpooled standard error of the mean
Group Statistics
Biologic Std. Std.
al sex N Mean Deviatio Error
n Mean
Body mass Male 28 23.974 4.0662 .7684
index Female 90 22.869 3.9477 .4161
Independent Samples
Levene's Test
Test for
Equality
of t-test for Equality of
Variance Means 95%
s Confidence
Interval of
Sig. (2- Mean Std. Differen
the
F Sig. t df tailed) Differen Difference
Error Lower
ce
Bod Equal .747 .389 1.284 116 ce .8603
Upper -.5992
mas
y varianc .202 1.104 2.8086
inde
s es 7
x assume 1.264 44.0 .213 1.1047 .8739 -.6565
d Equal
variances 09 2.8659
not
assumed
Confidence interval
(CI)
■ Degree of uncertainty: area around the
sample statistic where the
corresponding population parameter is
likely to be
■ The larger the sample, the smaller the
Cl
– If the CI is small: Greater likelihood
that the sample statistic (sample
mean) approximates the population
parameter (population mean)
– If CI contains 0 (null value) then the
means are not statistically different
Calculating
CI
■ CI = 𝑥 ± z s
(
1–α/2√
)– 99% CI: 𝑥 ± 2.58(
n
s
√
– ) 95% CI: 𝑥 ± 1.96(
n
s
√
– ) 90% CI: 𝑥 ± 1.64
n (
s
√
) n
Reporting independent
t-test results
■ Report the means and standard deviations
for both
groups, t-value, degrees of freedom, and
p-value.
– E.g. Males (mean±𝑆𝑡𝐷): 24.0±4.1,
Females (mean±𝑆𝑡𝐷): 22.9±4.0;
t=1.3, df=116, p=0.20
■ P-values from t-test often reported in

subject characteristics table when
data is compared between two
groups (e.g. males vs. females,
intervention vs. control group)
Analysis of Variance
(ANOVA)
■ Test to determine if means differ between 3
or more groups
– Unlike t-test, uses variance to assess
differences
■ Tests null hypothesis that variances

between the groups are equal
– F-test will result in rejection of null
hypothesis when variability between
group means is sufficiently larger
than variability within the groups
F-
statistic
■ F = variation between
sample means
2
variation within the
𝑠𝐵
F =samples
𝑠𝑊2
■ If ratio is large, indicates not all means are
equal -> significant p-value will result
Degrees of
freedom
■ df1: df associated with the numerator
of the F- statistic
– df = k – 1
– k: # of group means
■ df2: df associated with the
denominator of the F-
statistic
– df = n – k
Between and within
variability
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-ano
va-and-the-f-test
F-value that is
derived
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-ano
va-and-the-f-test
ANOVA
assumptions
■ Variable is normally distributed
■ The errors are normally distributed
– We will discuss this more in
regression lecture
■ The cases are independent from each
other
■ Variance homogeneity
■ Like the t-test, ANOVA is robust in the

face of relatively minor violations of
these assumptions
Descriptives
Caloric intake (kcal/day)
95% Confidence
Interval
Std. for Mean Minimu
N Mean Deviation Std. Lower Upper m Maximu
Error Bound Bound m
Sedentary 19 1640.058 516.7106 118.5415 1391.011 1889.104 694.2 2313.1
Light 16 2030.822 570.8637 142.7159 1726.630 2335.014 1045.2 3027.7
Moderate 45 1999.444 670.7013 99.9822 1797.943 2200.945 1017.8 3925.5
High 38 2183.431 608.9341 98.7822 1983.279 2383.583 1204.9 3781.4
Total 118 2005.082 633.5294 58.3211 1889.580 2120.584 694.2 3925.5
Test of Homogeneity of
Variances
Levene
Statistic df1 df2 Sig.
.115 3 114 .951
ANOVA
Sum of Mean
Squares df Square F Sig.
Between 3752358.8 3 1250786.2 3.30 .023
Groups 1 7 0
5 2
Within 43206696. 114 379006.10
Groups 2 8
70
Total 46959055. 117
0
90
ANOVA means
plot
Reporting ANOVA
results
■ Report means and standard deviations for
all groups, as well as the F value, degrees
of freedom, and p-value.
■ Text for this example:
– Analysis of variance indicated that the
different physical activity groups
report different levels of caloric intake:
■ Sedentary: 1640.1 ± 516.7
■ Light: 2030.8 ± 570.9
■ Moderate: 1999.4 ± 670.7
■ High: 2005.1 ± 633.5
■ F(3, 114) = 3.3, p = 0.02
Post-hoc tests
(ANOVA)
■ ANOVA tells you if there is a difference
between means, but not specifically
which means differ
– Conduct multiple comparison post-
hoc test to determine which means
differ
■ Many post-hoc tests to choose from (see
this link for list and description:
http://www.statisticshowto.com/post-ho
c/
)
■ Commonly applied to ANOVA:
– Tukey’s Test
Multiple Comparisons
Dependent Variable: Caloric intake (kcal/day)
LSD
95% Confidence Interval
(I) Category of (J) Category of Mean

physical activity physical activity Difference Std. Error Sig. Lower Bound Upper Bound
(I-J)
Sedentary Light -390.7640 208.8913 .064 -804.576 23.048
Moderate -359.3865 * 168.4341 .035 -693.053 -25.720
High -543.3732 * 172.9784 .002 -886.042 -200.704
Light Sedentary 390.7640 208.8913 .064 -23.048 804.576
Moderate 31.3774 179.1933 .861 -323.603 386.358
High -152.6092 183.4713 .407 -516.064 210.846
Moderate Sedentary 359.3865 * 168.4341 .035 25.720 693.053
Light -31.3774 179.1933 .861 -386.358 323.603
High -183.9866 135.6326 .178 -452.674 84.701
High Sedentary 543.3732 * 172.9784 .002 200.704 886.042
Light 152.6092 183.4713 .407 -210.846 516.064
Moderate 183.9866 135.6326 .178 -84.701 452.674
*. The mean difference is significant at the 0.05 level.

Types of
ANOVA
■ One-Way ANOVA: considers only one independent variable
(factor) for independent groups
■ Repeated measures one-way ANOVA: one-way ANOVA for

related groups but not independent
■ Taking BMI measured every single day for each of the

different groups
■ Multivariate ANOVA (MANOVA): ANOVA with several dependent

variables
■ Factorial ANOVA: Compares means across two or more

independent variables (factors)
■ Sex, physical activity (you have 2 independent variables)

and youre looking at their effect on another variable (blood
pressure)
Analysis of categorical
variables
■ Chi-square test of independence
– Tests whether there is a significant
association between two or more
categorical variables
■ Fisher’s exact test
– Use when 20% or more of the cells
have <5 counts of data
■ Test for trend
– More powerful for ordinal data
Chi-
square
■ Utilizes contingency tables (also
referred to as cross-tablulation,
crosstab, or two-way table)
– 2x2 table when each variable has
2 groups
– 2xk table when each variable has
k groups
■ Assesses goodness of fit between
observed values and theoretically
expected values
■ Df = k – 1 (k=number of groups/columns)
■ Assumptions:
Biological sex * BMI category Crosstabulation
BMI category
0 1 Total
Biological sex Male Count 20 8 28
% within Biological sex 71.4% 28.6% 100.0%
% within BMI category 20.6% 40.0% 23.9%
% of Total 17.1% 6.8% 23.9%
Female Count 77 12 89
% of Total 65.8% 10.3% 76.1%
Total Count 97 20 117
% of Total 82.9% 17.1% 100.0%
Chi-Square Tests
Asymptoti
c Exact Sig. Exact Sig.
Value df Significanc (2- (1-
e sided) sided)
(2-sided)
Pearson Chi-Square 3.421a 1 .064
Continuity 2.440 1 .118
Correction b
Likelihood Ratio 3.129 1 .077

Fisher's Exact Test .084 .063
Linear-by-Linear 3.392 1 .066
Association
N of Valid Cases 117
a. 1 cells (25.0%) have expected count less than 5. The minimum
expected count is
Sex * BMI class. (18 +) / measure - (D, G) Crosstabulation
BMI class. (18 +) / measure - (D, G)
NORMAL
UNDERWEIG WEIGHT OVERWEIGH OBESE Total
HT T
Sex MALE Count 79 1665 2161 1347 5252
% within Sex 1.5% 31.7% 41.1% 25.6% 100.0
%
% within BMI class. (18 31.7% 37.2% 48.5% 41.5% 42.3%
+) /
measure - (D, G)
% of Total 0.6% 13.4% 17.4% 10.8% 42.3%
FEMALE Count 170 2813 2294 1899 7176
% within Sex 2.4% 39.2% 32.0% 26.5% 100.0
%
% within BMI class. (18 68.3% 62.8% 51.5% 58.5% 57.7%
+) /
measure - (D, G)
% of Total 1.4% 22.6% 18.5% 15.3% 57.7%
Total Count 249 4478 4455 3246 12428
% within Sex 2.0% 36.0% 35.8% 26.1% 100.0
%
% within BMI class. (18 100.0% 100.0% 100.0% 100.0 100.0
+) / % %
measure - (D, G)
Chi-Square Tests
% of Total 2.0% Asymptoti
36.0% 35.8% 26.1% 100.0
c %
Value df Significanc
e
(2-sided)
Pearson Chi-Square 130.679 a 3 .000
Association
a. 0 cells (0.0%) have expected count less than 5.
The
Test for
trend
■Also referred to as Linear-by-Linear
association
– Tests for trends in contingency
tables larger than 2x2
■ Takes into account ordered nature
of data
– Relates to odds rather than
variances
– Assumes that a change in ranks
makes no difference to the odds of
the outcome (i.e. Odds Ratio = 1)
– Therefore, df = 1
Sex * Total pers. inc. all sources - (D, G) Crosstabulation
Total pers. inc. all sources - (D, G)
NO LESS $15,000- $30,000- $50,000- $80,000
INCOME THAN $29,999 $49,999 $79,999 OR Total
15,000 MORE
Sex MALE Count 542 2531 2253 2331 1515 598 9770
% within 5.5% 25.9% 23.1% 23.9% 15.5% 6.1% 100.0
Sex %
% of 2.5% 11.9% 10.6% 11.0% 7.1% 2.8% 45.9%
Total
FEMALE Count 894 4602 3153 1901 791 166 11507
% within 7.8% 40.0% 27.4% 16.5% 6.9% 1.4% 100.0
Sex %
% of 4.2% 21.6% 14.8% 8.9% 3.7% 0.8% 54.1%
Total
Total Count 1436 7133 5406 4232 2306 764 21277
% within 6.7% 33.5% 25.4% 19.9% 10.8% 3.6% 100.0
Sex %
% of 6.7% 33.5% 25.4% 19.9% 10.8% 3.6% 100.0
Total %
Chi-Square Tests
Asymptoti
c
Value df Significan
ce (2-
sided)
Pearson Chi-Square 1219.006 5 .000
a

Association
a. 0 cells (0.0%) have expected count less than 5.
The

Analysis of Continuous and Categorical Variables: January 28, 2020

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Continuous and Categorical Variables: January 28, 2020

Uploaded by

Copyright:

Available Formats

ANALYSIS OF

■ t-tests fairly robust even if assumptions

■ P-values from t-test often reported in

■ Tests null hypothesis that variances

■ Like the t-test, ANOVA is robust in the

95% Confidence Interval

(I) Category of (J) Category of Mean

Moderate -359.3865 * 168.4341 .035 -693.053 -25.720

High -543.3732 * 172.9784 .002 -886.042 -200.704

Light Sedentary 390.7640 208.8913 .064 -23.048 804.576

Moderate 31.3774 179.1933 .861 -323.603 386.358

High -152.6092 183.4713 .407 -516.064 210.846

Moderate Sedentary 359.3865 * 168.4341 .035 25.720 693.053

Light -31.3774 179.1933 .861 -386.358 323.603

High -183.9866 135.6326 .178 -452.674 84.701

High Sedentary 543.3732 * 172.9784 .002 200.704 886.042

Light 152.6092 183.4713 .407 -210.846 516.064

Moderate 183.9866 135.6326 .178 -84.701 452.674

*. The mean difference is significant at the 0.05 level.

■ Repeated measures one-way ANOVA: one-way ANOVA for

■ Taking BMI measured every single day for each of the

■ Multivariate ANOVA (MANOVA): ANOVA with several dependent

■ Factorial ANOVA: Compares means across two or more

■ Sex, physical activity (you have 2 independent variables)

Likelihood Ratio 3.129 1 .077

Likelihood Ratio 1240.061 5 .000

You might also like