Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Statistics

1. Introduction

What is statistic
● The science which deals with collection, classification and tabulation of
numerical facts as the basis for explanation, description and comparison of
phenomenon.
● We have tendencies in seemingly incoherent masses of data which can be
generalized to the populations with a specified degree of reliability.

Why statistics in medicine


● Medicine is a quantitative science but not exact: we have variability.
○ E.g. results of therapy, results of diagnostic procedures, differences
in exposure and its result, differences in symptoms.
● We need to quantify these variations and uncertainties.
● All doctors need to be able:
○ Summarize results and to interpret results.
○ Display data.
● Diagnosis and treatment are often probabilistically based.
● We conduct research, collect info, analyze the data, and synthesize
unbiased and replicable results to have a representation of a phenomena.

How does it work?


● Define the target population-larger group of people with the conditions of
interest.
● Define the type of study to conduct:
○ Observational- prospective, retrospective, cross-sectional
○ Experimental.
● Define the sample-we pick subjects from the population, how do we
recruit, how many people we need to recruit.
● Analyze data:
○ Descriptive aim-we synthesize the info as %, mean, variation, etc.
○ Inferential aim-generalization, we apply the observation to the
population from the sample.
Research study vs experiment
● Research-a study in which we try to find patterns, associations and
consequences.
● Experiment-a particular type of research where we modify the conditions of
patients, usually randomized clinical trials.
○ Apply a procedure/therapy about which we don't have any info.
○ Ethical issue.

Observational study-cross sectional


● A cross-sectional study is referenced about a single point in time-now.
○ Reference point for both the exposure and outcome variables.
● Most surveys represent cross-sectional studies, e.g. researchers who
want to know about the present health characteristics of a population might
administer a survey with questions like how many students smoke at the
campus?

Observational study-retrospective
● Kind of study where we take pts when they have the disease and we
evaluate exposure in the past, such as case control.
● We recruit pts if they have the condition and ask what happened in the
past.
● Aim-evaluate the association of risk factors and the disease, the probability
of having the disease by having some risk factors.
● Not well put in the EBM due to the amount of biases.

Observational study-prospective
● We follow the pts from a starting point (T0) to a certain period (Tn)
depending on the aim of the study, and we wait for exposure and disease.
○ Sometimes it is also called cohort study=all pts are homogenous.
○ E.g take people with no disease and follow until they have.
● We can have two groups: one with an exposing factor and another without.
● E.g. a group of young people who start smoking and we follow how many
developed a disease.

Experimental Study
● We have a study group and a control group.
● We have an independent variable (e.g. therapy) and a dependent variable
(e.g. improvement).
● Subjects who participate in the study are assigned randomly to either
group, and then we compare.
● Clinical trial:
○ An experiment performed by a healthcare organization to check the
effects of an intervention against a control in a clinical environment.
○ It is a prospective study.
○ Main way in order to perform experiments in humans.

The Variable
● A characteristic that takes different values in different people, objects,
places and time.
● The value of a variable can be:
○ A measure taken on a subject-BP, height, weight etc.
○ Answer of a question-sex, town, ethnic group, study grop.
○ An observation of something- imaging.
○ A judgment-functional score, index etc.
● Quantitative variables:
○ Continuous- such as BP, these numbers are on a real scale and the
limitation is the instrument that is used.
○ Discrete- expressed only by integers e.g. scoring.
● Qualitative variables:
○ Nominal-such as sex, we only have dichotomization.
○ Ordinal-we have a semi-quantitative scale such as cancer staging.
● Measurement- each variable has a scale such as nominal, ordinal, rank
and quantitative.
○ The latter can be changed into ranked, ordinal or nominal but not
vice-versa.

Inference
● The procedure by which we reach a conclusion about a population on the
basis of the information contained in a sample that has been drawn from
that population.
Population
● The largest collection of values of a random variable whose results are
unknown and thus the results are unknown.
○ For which we have an interest at a particular time.
○ E.g. asking someone’s height randomly vs asking all above 175 cm.
● Found under the method part of any article/study.

Sample
● A subset of the population.
● A well chosen sample will contain most of the info about a particular
population parameter but the relation between the sample and the
population must be such to allow true inferences to be made.
○ Sample must have the main characteristics of the population.
● Population parameter-a descriptive characteristic of the population such as
mean, proportion or %.
● Simple random sample:
○ A random selection.
○ Each has a known non-zero chance of being included in the sample
and it is equal among the subjects.
○ If we remove a person from the sample, the probability for the next
extraction is different.
■ In a big sample we can ignore the change in the probability.
○ If a number is popping up twice (reamission) it can be counted
twice.
● Stratified sample:
○ A particular methodology to build a sample.
○ Some groups can give biased results.
○ We need to consider conditions of the pts or sex because the
probability of an event could change with age, co-morbidities etc.
■ We need to take these into account that could affect the
results of the study.
○ We divide the population into 2 or more strata and then we take a
simple random sample e.g height of surgeons between sexes.
■ We then see the % of each group and depending on the % we
multiply it by the sample size to have a more reliable sample.
Randomization
● Pts in a randomized trial we don't mean that the pt are from a random
sample, but we have eligible pts that want to participate and then we
allocate the treatments.
● A procedure that allows one to have the same probability to stay in any
group.
○ All the probabilities are equal so the only diff will be due to the
treatment/control.
2. Descriptive Statistics

What is Descriptive Statistics


● The analysis of data that helps describe, show or summarize data in a
meaningful way such that, for example patterns might emerge from the
data.
● Summarize the essential information in a data set into a few numbers that
can be communicated verbally.

Why?
● Enables us to present the data in a more meaningful way, which allows
simpler interpretation.
○ Raw data presentation is hard.
● However, do not allow us to make conclusions beyond the data.

Summarize Qualitative Variables


● How many women have sterility age related? (Y/N) variable.
● We count how many women have age related sterility.
● Then we find the percentage, by dividing the count and the total number
of women.
● Then we can show the percentage in a pie chart.
● Another way to draw graphs for the percentage could be a bar chart, in this
example specifically they did 3 groups depending on time.

Summarize Quantitative Variables and Frequency Distribution


● Before any statistical calculation, the data should be tabulated or plotted.
● Useful summarization may be achieved by grouping the data:
○ Distribution table.
○ Stem and leaf plot.
○ Histogram.
○ Box plot.

The Distribution Table


● To have a table we need to define the number of rows or how many
classes and their intervals e.g. changing age to age class.
○ By intuition looking at their variables or frequencies.
○ Or we can consider the Sturges formula.
○ N=number of observations.
● The width is the range/number of classes:
○ 𝑊 = 𝑅𝑎𝑛𝑔𝑒/𝐾.
● In order to avoid midpoints such as if a woman has 2 oocytes do you put it
in the 0-2 class or 2-4? We increase the first class to 0-2.5 and the next
class will be 2.5-4.5.
● Absolute frequency-the count of the entire numbers in the group.
● Relative frequency-the ratio between the absolute and the entire sample,
summing into 1 or 100%.
● Cumulative frequency-the sum of absolute frequencies of the previous
classes to the subsequent class.
● Cumulative relative frequency-ratio between cumulative frequency and the
total sample.
● This data can then be demonstrated on a histogram, which is continuous
since we are talking about a continuous variable.
○ The heights correspond to the data we are showing e.g relative
frequency.

Summary Measures of Quantitative Variables


● We need to synthesize data with few numbers, these are divided into 2
classes:
○ Central tendency:
■ Arithmetic mean.
■ Geometric mean.
■ Harmonic mean.
■ Median.
■ Mode.
○ Measures of dispersion:
■ Variance and standard deviation.
■ Coefficient of variances.
■ Range.
■ Percentiles (quartiles).

Mean
● Sum of all values divided by the number of subjects.
● E.g. mean age of the 509 women in the study.
Median
● The median of a finite set of values is that value which divides the set into
two equal parts such that the number of values equal to or greater than the
median is equal to the number of values equal to or less than the median.
● We need to sort the data and find the value that leaves the same numbers
of observations before and after.
○ If the number is odd: it is the central value.
■ We use +1, and take the number of the next place.
○ If the number is even: mean of the two central values.

Mode
● Mode of a set of values is that value which occurs most frequently, but it
doesn't mean that the class is the most frequent.
● A set of values may have more than one mode.

The Histogram-Skewness
● We should see the form of the distribution.
● If we see that the long tail is on the right:
positively skewed.or skewed to the
right.
○ Usually no consequence, but
sometimes the median can be lower
than the mean.
● If we see that the long tail is on the left: negatively skewed or skewed to
the left.
○ We first have the mean and then the median.
● In symmetrical distribution, mean=mode=median.

Kurtosis
● A measure of the degree to which a distribution is peaked or flat.
● No peak and no flat distribution: mesokurtic distribution.
● A peak with no flat distribution: leptokurtic, kurtosis<0.
● A peak and flat distribution: platykurtic, kurtosis>0.

Measures of Dispersion
● The dispersion of a set of observations refers to the variety that they
exhibit.
● Conveys info regarding the amount of variability present in the set.
● Range-the difference between the largest and smallest value.
○ It will never be seen in papers, we will see only “smallest-largest”.

Variance and Standard Deviation (SD)


● SD is the summary measure of the
differences of each observation from the
mean of all observations.
○ If the differences would be summed
up, they would be 0, thus we square
the difference.
○ Divided by the number of observations-1=degrees of freedom.
○ Then root to bring back to the units we had initially.
● Variance is the square of SD.
● E.g. if the mean age is 35.8 and SD is 4, we could find women with the age
40 due to deviation.

Coefficient of Variation
● Also called relative SD, it is the ratio between
SD and the arithmetic mean which can be
expressed by % and has no units.
● Useful because we can compare some SDs among variables which have
different units of measurements, however it may lead to fallacious results.
● The higher the number the higher variation.

Percentiles
● Called position or location parameters, as these values of the variable
which designate certain positions on the horizontal axis when the
distribution of a variable is graphed.
● Given a set of n observations x1, x2, x3...xn, the pth percentile P is the
value of X such that p percent or less of the observations are less than P
and (100-p) percent or less of the observations are greater than P.
○ P=the percentile.
○ E.g. P=10% we need to find the value of age that 10% of women are
younger, and 90% are older.
○ E.g2 P50 percentile, we need to find the value of oocytes that 50%
of observations are lower and 50% are higher.
● How to determine:
1. Sort data.
2. Define the P of interest.
3. Multiple N for P%.
4. Find the observation as determined in point 3.
a. Even/odd follow similar to median.
5. Look at his value.
● The 50th percentile is usually the median.

Quartiles
● P25-the 25th percentile and
corresponds to the first quarter
of the sorted distribution so
called the first quartile, Q1.
● P50-median.
● P75-the 75th percentile and
corresponds to the 3rd quarter of
the sorted distribution so called
the third quartile, Q3.
● The difference between P75 and P25 is the interquartile range, holding
50% of the data, and is the measurement of variability.

The Box and Whisker Plot


● Summarizes all the quartiles we mentioned.
● Whiskers are usually drawn on the maximum points of the values or if we
have higher numbers we put them on the first smaller/bigger than these.
● Outliers may be put whenever the maximum/minimum points are
bigger/lower than the calculated whiskers

Geometric Mean
● Sometimes data could be summarized with geometric mean if
they display skewness.
● It is the nth-root of the products of all values.
● In equivalent way can be calculated as arithmetic mean of data
transformed as natural log.
3. Basic of Probability and Probability Distributions

General Ideas
● What happens in our lives changes in number.
● We use probability everyday, e.g. what are the premises for certain
conditions, what side effects I should expect.
● Probability is a number between 0 and 1, always positive, and even that
cannot occur has probability of 0.
● A certain event has probability of 1.

Classical Probability
● An event can occur in N mutually exclusive and equal ways, and if M of
this possesses a trait E so the probability is M/N.
○ 𝑃(𝐸) = 𝑀/𝑁.
○ E.g. such as rolling a dice and we want ‘1’, this is a simple event.
■ If we want odd numbers: composite event.

Relative Frequency Probability


● The previous is less used in biostatistics, instead we use relative frequency
probability.
● Very similar to previous, 𝑃(𝐸) = 𝑀/𝑁, e.g. what is the probability of having
age-related infertility P(E)=76/509=14.9%.
● If a process is repeated a large number of times (M), and if some resulting
event (E) occurs M times.

Bayesian Method
● Founded on updating probabilities based on new information.
○ Prior probability-based on prior experience or derived from data.
○ Posterior probability-obtained by using new information to update or
revise prior probabilities.
● Largely applied in automatic diagnosis and in evaluation of diagnostic
tests.

Elementary Properties
● Given some process with n mutually exclusive outcomes (events) E1,
E2,..., En the probability of Ei is a non-negative number.
○ And the sum of these probabilities equal to 1.
● We always need to look at the numerator and denominator in a table.
● Marginal probability-border of the table, when we use total to determine
probability.
● Complementary probability-could be determined as the difference
between all the probabilities (1) minus that one we have determined.
○ E.g. lung cancer P=0.31, what is the probability for others, 1-0.31.
● Conditional probability-when we use a subset of the total to determine
probability.
○ E.g. What is the P of lung cancer in the unemployed P=30/100.
● Joint probabilities-when we ask the probability that a subject has 2
characteristics at the same time, e.g. having lung cancer and being
unemployed.
● Multiplication rule-can be used for conditional, given
two events A and B, the probability that event A and B
occur can be calculated by multiplying the appropriate
marginal and conditional probability.
○ Depending on what affects what we determine the
marginal.
● Independent events-if events A occurs regardless of B,
they are independent and P (AIB)=P(A).
● Addition rule-if we have two events, the probability that event A, or event
B or both occur is equal to the
probability that event A occurs,
plus the probability that event B
occurs minus the probability that these events occur simultaneously.

Probability Distribution
● When we measure a variable we collect data from pts and we collect a
random variable, meaning we don't know the results.
● What happens if the value of the random variable has a certain probability
(P) and this can be explained by using probability distribution.
● Probability distribution-is a table, graph, formula, or other device used to
specify all possible values of a discrete random variable along with their
respective probabilities, can't be 1.5 for example.
● Very similar to relative frequency.
○ E.g. The number of the drugs is the event (A) with probability P (A),
which is the relative frequency.
Probability Distribution: The Binomial Distribution: Bernoulli Process
● Example: 85.8% of the pregnancies had delivery in week 37 or later, if we
randomly select 5 from this population, what is the probability that exactly 3
of these will be full-term.
● A sequence of Bernoulli trials form a Bernoulli process under the following
conditions:
a. Each trial results in one of the two possible mutually
exclusive outcomes, one is success one is failure (full
term vs preterm).
b. The probability of success denoted by p remains
constant from trial to trial, probability of failure q, is 1-p (p=0.858).
c. The trials are independent, the outcome is not affected by the
outcome of another trial.
● X=count (number of full term) and N=amount of trials.
● Average-what we expect in the process, it is difficult to
have less than the average.
● Variance-how more or less than the expected value.
● Each binomial distribution has two parameters, n and
p, B (n,p), SD is a single σ.

Probability Distribution: Poisson Distribution


● The occurrence of the events are independent-their
occurrence in an interval of space or time has no
effect on the probability of a second occurrence of the
event in the same or any other interval.
● Theoretically, an infinite number of occurrences of the
event must be possible in the interval.
● The probability of the single occurrence of the event in a given interval
is proportional to the length of the interval.
● In a small portion of the interval, the probability of more than one
occurrence is negligible.
● λ is the parameter or the average count of occurrence or the parameter of
distribution, “e”=2.7183, “x” is the count for the probability.
● Average and variance are equal and equal to lambda.
● It can be applied for example the incidence of a disease in a year.
Probability Distribution: The Normal
Distribution
● Very important because almost every
quantitative variable should approach a
normal distribution and can be explained
under certain conditions.
● The area under the line of the histogram
represents the whole frequency.
● The smaller we take the intervals in the
histogram the smoother the line
becomes also called Gaussian
distribution, bell shaped.
● Properties of the curve:
○ The mean corresponds to the center of the curve.
○ Mean=median=mode, median divides the area to 50% and 50%.
○ There is a point on the curve, on both sides of the mean where it
changes its concavity, we will have 68%.
■ This distance equals one SD.
● Quantitative continuous variables, we cannot
ask for a probability of a single variable but
only an interval such as ages>54, and to get
this we have the next equation.
● We can have different means but similar SD=shape similar but moves.
● Shape changes when we have similar mean but different SD.
● When we say that a variable X approaches normal, we say that it is
defined by mean and SD or variance, 𝑋 ∼ 𝑁 (µ, σ).
● If we limit the area by µ − 2σ and µ + 2σ we will have an area of 95%.

Probability Distribution: The Standard Normal Distribution


● The center or the man is located on the 0 with SD of 1.
● This distribution is tabulated-for each value of X axis, z, we have
the value in the table and for each we have probability.
● If we have a variable approaching a normal distribution with mean
and SD, we can apply this formula for Z that has a mean of 0 and
SD of 1.
○ Standardization of the value of X, 𝑍 ∼ 𝑁(0, 1).
54−44
● Following previous example 𝑍 = 9.3
= 1. 08 and we find 14% P.
● If we want a number between a specific area, we can take 1 and minus
from it the two tails.

Binomial Approximate Normal


● What is important are the chances to approach a normal distribution and to
standardize.
● We can explain an event by a binomial distribution by n of trials and p of
success but since N reaches very high and infinite it approaches normal
with average of np and variance.
● When the p reaches also 0.5 we can say the binomial distribution could be
determined by a standard normal.
● So a binomial distribution could be approximate to a normal if:
○ N reaches infinity, usually more than 30 is enough for one variable.
○ P reaches 0.5.
4. Sampling Distributions

General Ideas
● The distribution of all possible values that can be assumed by some
statistic, computed from samples of the same size.
● Randomly drawn from the same population, called the sampling
distribution of that statistic.
● When we have a population we should have a lot of samples from the
population, so we can study the distribution of these.

Distribution of the Sample Mean


● We can have the arithmetic mean of the population (μ), while the
arithmetic mean of the sample is (X bar).
● We we search for the SD of the population (𝛔) whereas the
SD of the sample (S).
○ Degrees of freedom in this case will be just N.
● Out of a population of 10 we can have many samples
which will have sample mean (X bar), and SD (S).
○ Thus the mean can be used as a variable since it is changing from
sample to sample.
● Then we can divide the age class for mean age and count samples in the
interval of values of means, reaching 1000 samples (in this example).
○ Then we draw an histogram by the relative frequency which
seems to be symmetrical.
● The mean of all means of a variable which is normally
distributed, it will be equal to the mean of the population.
● We can also calculate the variance of all the sample means,
and also the SD by putting a square root.
○ These are variances on the right of this text.
● The SD of all samples means equal to the SD of the
population divided by the root of the sample size.
○ Also called the standard error of the mean.
● Central limit theorem-given a population of any non-normal functional
2
form, with a mean μ and finite variance σ The sampling distribution of
sample mean computed from samples of size n from this population, will
have a mean μ and the variance (in orange squared) and will be
approximately normally distributed when the sample size is large.
○ We can standardize this.
○ Under a certain condition, when n being very large every
variable and sample could approach normal and could be
standardized, useful for inference.

● Characteristics of main sampling distributions.


● Proportion means x (number of success)/N (sample size), whereas p with
circumflex is estimated p.
○ If it reaches 0.5 and N approaches infinity it can be treated as
normal.
● The standard error of the sample-the variability of that value and expanded
on the variability of the population.

Estimation
● Statistical inference is the procedure by which we reach a conclusion about
a population on the basis of the info contained in a sample drawn from that
population.
● Calculating from the data of a sample some statistic that is offered as an
approximation of the corresponding parameter of the population.
● Point estimate-a single numerical value used to estimate the
corresponding population parameter, a point on a real X-axis.
○ Don't let us draw any conclusions on the population.
● Interval estimate-consists of two numerical values defining a range of
values that, with a specified degrees of confidence, most likely include the
parameter being estimated.
○ Information on population.
○ In this case we need a lower mean and a higher mean that with a
certain degree of confidence include the parameter.
● The degree of freedom is a probability (P), 𝑃 = 1 − α .

Estimation: CI for a Mean with Known SD


● We need 2 numerical values also called limits
(L1, L2) that with a specified CI include the
parameter.
● Alpha (ɑ) is the area of the tail of the distribution.
● Since we are talking about age, which is normally distributed we can
say that= 𝑁(µ, σ/ 𝑛).
● We usually use CI of 95%, and we need to find L1 and L2 for it.
● Everytime we choose a sample the mean of the sample in 95% of the
cases will be in the interval, and some extremes that don't
represent the true mean of the population.
● We can only have one CI.
● That is the equation for the limits of the sample interval:

Estimation: CI for a mean with unknown SD


● A proper solution is to use sample SD (S).
● We use a new sampling distribution, which is not the standard normal but
is the t-student distribution with µ = 0 and SD changing with degrees
of freedom (n-1).
○ When the degrees of freedom
reach infinity we will have a
normal standard.
● In this case the CI will be equal to:
○ T-student table distribution.
■ If n is large enough we can assume that t=2.
■ Shape of the curve depends on the DF.
■ Mean=0 and symmetrical around it.
■ Variance>1.
■ T ranges from minus infinity to plus infinity.
● Using this we can define CI in which in 95% confidence the mean of
population will fall in.

Interpretation of CI
● We don't know whether the population mean lies within the calculated
interval, but we may act, in practice, as if it does because the interval
estimator is successful in capturing the mean in 95% of samples from the
population to which it is applied.
● Smaller intervals means a very small shape and more precise estimation.

Precision of the Interval


● Depends on:
○ Sample size: larger samples give more precise intervals.
■ As related to the equation, the size is in the denominator.
○ Variance: small variance, that is more homogenous in sample
values, allows more precise intervals.
○ Confidence level: the higher the level the larger the width.
■ As “t” will become bigger it increases the width, but the mean
we calculate may be far from the true mean.
■ Never useful for inference and science.

CI for the Difference of Means with Known σ


● The standard error of estimator will be:
● The CI will be the difference between means +- the
product of Z (reliability coefficient) multiplied by the
standard error.

CI for for the Differences of Means with Unknown σ


● We use the sample SD and we need to determine the
SD by:
○ P=pooled.
● The DF in this case is n1+n2-2, we calculate
the pooled variance.
● We use the T-student table in this case.
CI for a Proportion
● It is the ratio between the count and the total sample size.
● If n approaches infinity and the p approaches, we have
binomial distribution approaching a normal distribution.
● CI will be the estimator (sample proportion) plus minus Z
(reliability coefficient) multiplied by standard error.

CI for the Difference of Two Proportions


● The estimator is the difference of two proportions
of two samples.
5. Hypothesis Testing-A Single Population Mean

Introduction
● When we test hypotheses, most of all we speak about decisions.
● We try to reach a conclusion concerning a population.
○ With the starting point of the sample of course.
● It may be defined simply as a statement about one or more populations.
● The research hypothesis is the idea of the researcher and the aim of the
research.
● The research hypothesis should be modified into an appropriate,
statistical hypothesis which can be evaluated by appropriate statistical
approach.

Hypothesis Testing Steps


● Data:
○ We look at the data.
○ We look at what is the definition of the data, what type or property of
data.
■ E.g. BMI is quantitative continuous.
● Assumption I:
○ We assume what is the distribution in the population and probability
distribution.
● Assumption II:
○ Related to the design of the study, how we draw samples.
○ When the choice is random, and the subjects in each group are
different, we call it independent samples.
■ Paired samples can be when we revisit the same group of
people.
● Assumption III:
○ Relative to the population parameters.
■ Such as variance, mean or any other that is of interest in
these experimental conditions.
○ If we know the variance, it is equal between the groups.
● Hypothesis:
○ The research hypothesis is what we are interested in, whereas the
statistical hypothesis claims that there is no difference, e.g. no
difference between pea protein and other supplements.
■ Also called the null hypothesis.
○ The null hypothesis is tested against the alternative
hypothesis which is what we propose
■ Also called two sided hypothesis when written like
that.
■ In some research we modify it depending on the research
hypothesis, such as ‘>’.
● Also called one side hypothesis.
● Test:
○ Also called the test
statistic, depending on
previous steps.
○ We calculate a number.
○ Also has a distribution, which is a standard normal.
● Decision Rule:
○ Distribution of the test
statistic if H0 is true is
standard normal.
○ The decision rule tells us to
reject the null hypothesis if
the value of the test statistic
we compute from our sample
is less likely to occur if H0 is
true.
■ Usually in this case
the Z should give
0, but if we get 2.5
for example, it is
too far from the 0,
thus we need a
threshold.
○ Type I error-the
probability of rejecting a
true null hypothesis.
○ Depending on levels of
significance (α), which is the variant on the tail of the distribution,
usually it is <0.05 (p), usually the X or Z must be above 2.
○ Depending on the hypothesis, we should look at two sides of the tail
or one side (two sided/one sided hypothesis).
○ Type II error-the area β is the probability not rejecting a false null
hypothesis.
■ 1-β is called power, if you reject a false null hypothesis you
are in a good decision.
○ Sample size depends on:
■ The type I error we accept.
■ Power that we want for our experiment.
■ Effect size, or distance between H0 and Ha.
● The smaller the size the higher the type II error.
○ The p value is the probability that the computed value of a test
statistic is at least as extreme as a specified value of the test statistic
when the null hypothesis is true.
● Statistical decision-rejecting or not the null hypothesis.
● Drawing clinical conclusions.
● When you choose a lower alpha you become conservative and reject lower
situations.
● When we have unknown σ we use the SD of the sample instead and
T-student.
6. Comparison of Means

Comparison of Two Means, Known σ


● Data is a quantitative variable and is normally distributed.
● We have two samples so we assume that they are
independent and we know the SD.
● H0: the mean of the first equals to the second.
● Test statistic is normally distributed (z).

Comparison of Two Means, Unknown σ


● Assumptions: variable of the population is normally
distributed, the population SD is unknown, the variance
should be equal.
● H0: µ1=µ2, HA: µ1! = µ2.
● Two sided alternative hypothesis.
● Test statistic: will be as followed using the pooled
variance.
○ DF=n1+n2-2 of the tabulated T.
● If variances are not equal by Fisher test we use the
different variances.

Equality of Variances
● we need to evaluate this equality so we put a new system of
hypothesis with null hypothesis that the variances are equal.
2 2 2 2
● H0: σ1 = σ2 and H1: σ1 ! = σ2 .
● Fisher distribution is an asymmetrical distribution with a long tail in the
right.
○ So we have only one side of the tail of alpha 0.05.
● Using the Fisher table, on the left we have the
DF of the nominator, and on the left we have
the DF of the denominator.
● If we find that we don't eject the H0, we can
pool a variance.

Comparison of Paired Means


● In this example we have before and after treatment of 10 pts.
● Assumptions: non-independent samples, variable is normally
distributed, the population σ is unknown.
● H0: difference before and after=0, HA: the difference before and
after are more than 0.
● Test statistic: in this case we are talking about mean of difference
(µ𝑑 of the population).
● Distribution of test statistic: t-student with n1-1 DF.
● We have a one-sided hypothesis, so we have alpha that is all in one tail.
● Why do we choose paired comparisons?
○ When we want to avoid the effect of an extraneous source of
variation that could affect the result.
○ Before and after treatment.
○ Crossover design (two drugs on the same subject in two different
periods).
○ Two methods (alcohol levels with blood sample and breathalyzer).
7. Comparison of two means non parametric distribution free
method

Non-parametric Tests
● If the distribution of the variable couldnt be considered normal, to compare
groups or to search correlation could be done through distribution free
statistics.
● Often these are based on a rank transformation of data.

Non-parametric Test-independent Sample Mann-Whitney-Wilcoxon Test


● Assumptions: groups are independent and not normally distributed.
● H0: based on median, M(NF)=M(RE), whereas HA: M(NF)!=M(RE).
○ The example is comparison of scores between neurofeedback and
standard rehabilitation programs.
● Test statistic: sum of ranks.
● Decision rule: two sided α = 0. 05, calculated from the table.
● We sort all the data by ranking, from lowest to highest and assign rank for
each from 1 to “N” which is the total sample size.
○ A score in the same rank is averaged.
● We sum the ranks by groups.
● Then we search the interval by the Mann-Whitney-Wilcoxon table and the
sample sizes.
● If the samples fall within the interval we don't reject the null
hypothesis.

Non-parametric Test-paired Sample Wilcoxon Signed Rank Test


● Data example: score at a neurophysiological test before and after rehab
program.
● Assumptions: groups are paired and the variable is not normally distributed
in the population.
● H0: µ𝑑 = 0 and HA: µ𝑑 > 0.
○ Again related on median this time rather than means.
● Test statistic: sum of rank, and it is one sided with α = 0. 05.
● Note ranks are given to absolute numbers hence, if the difference is equal
but negative and positive, rank will be the same.
○ No difference, no rank.
● The Wilcoxon table is using the number of actual differences (0 doesn't
count), when taking DF.

8. Analysis of Variance

ANOVA Introduction
● A technique to evaluate the partitioning of the total variance in a set of data
in two or more components.
● We can determine variability in data, but this could be dependent on
different factors.
○ The association between the variation components and their sources
allow us to say something on each magnitude.
● Analyze what is the most important source.
● A model is the symbolic representation of typical values from the data, an
equation, expressing the relation between variables.
● Applied when we have normal distribution.

Example
● A sample with 3 groups
checking the weight, and they
are divided by the answer to the
question if they pay attention to
the carbohydrates and proteins:
○ “Yes”.
○ “More or less”.
○ “Not at all”.
● There is variance among groups and among subjects, depending on their
own randomness and group⇒ partitioning the total variance.
● The mean of the sample in this case is called the grand mean.

The Complete Randomized Model


● X(ij)=value of an observation in our example is the weight.
○ I=generic observation.
○ J=generic group.
● µ=the grand mean, mean of all values.
● τ𝑗=the effect of the factor, a qualitative variable that divides the whole
sample into more than 2 groups, in this case it is enough to look at the
t-student test, measures the difference among groups.
○ It is as if we have a grand mean and a quantity that depends on this
specific character of that group (hence j).
● ε𝑖𝑗=the error term, it is due to randomness, we take a subject and assign
them randomly to groups.
● Model of the weight study: weight (ij)=gran mean+diet (j)+ random error.
● Model of the women age and oocytes: age (ij)= grand mean+ timing+RE.
● Also called one-way analysis since we have only one source of variation
investigated.
● Assumptions:
○ We have independent samples.
○ Each of the populations from which the samples come is normally
2
distributed with mean µ𝑗 𝑎𝑛𝑑 σ 𝑗.
○ Each of the populations has the same variance, meaning the
variance of A is equal to B and C, needs to be verified.
○ The sum of all deviations of the µ𝑗 from the grand µ is 0.
○ The random errors are normally and independently distributed, and
should be evaluated.
● H0: µ1 = µ2 = µ3 or τ𝑗 = 0 the
factor has no effect.
● HA: at least a pair are different or
τ𝑗! = 0.
● The variability of age/weight (in case
of our examples) could be explained
by two components, by the group
and by the variability of each
subject.
● It’s possible to determine a deviation
between each observation and the
grand mean, this is called total sum of squares, and also
determines the variance of the sample, first partitioning.
● Within group error, the deviation between each observation
and the mean of its own group, with N-k DF, second
partitioning.
○ k=number of groups.
● Among group sums of squares, a component is due to the
difference between the mean of each group and the grand
mean with k-1 DF, third partitioning.
○ Nj=the number in the group.
● So theoretically the Xij-X bar=deviation within group+deviation
of the group from the grand mean, 𝑋𝑖𝑗 − 𝑋 = (𝑋𝑖𝑗 − 𝑋𝑗) + (𝑋𝑗 − 𝑋).
● We use ANOVA since we have a comparison of 3 means, that will be much
more complicated using
T-student.
● Test statistic:
○ The comparison between
the estimate among
group variation and
within group variation.
○ Asymmetrical distribution, depending on the degree of freedom of
both numerator and denominator.
○ If calculating F is higher than the critical value, we reject H0.
● After rejecting the null hypothesis we can only say that for example
checking the diet determines the difference, but we cannot say which one,
we need post Hoc comparisons.

Multiple Comparison
● If we have a study with 4 groups, we can divide the H0 of the ANOVA to
different null hypotheses.
𝐾*(𝐾−1)
○ Defined by: 2
.
■ We can do multiple T-tests but this is not efficient.
● Bonferroni correction: if α is the total type I error and we have m
comparisons the significance level of each pair-wise comparison is α/𝑚.
● First method (name not mentioned):
○ The comparison is
happening with a CI for
difference of means.
○ The difference should
be between these two
to be significant.
○ This is too conservative.
○ MSE=mean square error within groups, after the calculation.
● Tukey-Honestly Significant
Difference:
○ We respect the global
α = 0. 05 .
○ We multiple the square
root of (MSE/n) by a
critical value from the Tukey table which depend
on α, 𝑘 𝑎𝑛𝑑 𝑁 − 𝑘.
○ We then check the difference of the different pairs, and check if they
are higher than HSD, and if so, that is statistically significant.
○ We compare positive and negative differences.
○ n=size of the groups (e.g. 3 in the example).
● Tukey-HSD for unequal size:
● Least Significant Difference (LSD by Fisher):
○ Critical value is taken from T student with N-k DF
and it is double sided so we have α/2 = 0. 025.

Randomized Complete Block Design


● A block is a factor that have
homogenous observations:
○ Animal breed
○ Same litter
○ A subject with repeated
measures
● Data grouped in blocks remove
the variability due to the block
from residual, this way the effect of the main factor (treatment) is more
evident.
● Also useful if you want to repeat the experiment on more subjects, such as
more mice from the same breed.
○ Also useful when there is a known bias in subjects such as age.
● Example:
○ Aim of the study: to compare the number of open-field crossing done
by 18 mice submitted to 3 different drugs.
○ We have 6 litters (which is the block), each with 3 mice.
○ We have the mean of the block or the litter and mean of each
treatment, and the mean of all treatments (grand mean).
● Assumptions:
○ Subjects are
independent.
○ Each of the populations
from which the samples
come is normally
distributed.
○ No interaction between
treatment and block.
● The model of the examples
is: cross (ij)=grand mean+ the
effect of the block litter+the
effect of the treatment+ the error.
● H0: the mean of the three groups are equal or τ = 0.
● HA: at least a pair are different or τ! = 0 .
● Test statistic: is the treatment sum of square/error sum of square (residual).

Repeated Measure ANOVA


● In this case we have the same pt going several repeated measures, so the
τ𝑗 or the effector factor is the time points.
● The block in this case is the “effect of subjects”.
● Post hoc will be Dunn test or t-test paired and T-test by bonferroni
correction.

The Factorial Design


● Also called Two way analysis of
variance, when we have two or more
categorical variables.
● We can evaluate the effect of each
variable, but our main target is to
evaluate the interaction.
● Interaction is the effect of one factor that
determines a change at one level of
another factor, different from that
observed at other levels.
● Example:
○ To compare length of the visit by nurse age and type of pt.
○ We had 5 nurses for each age group and they are independent.
○ The model in this case:
time=grand mean+effect
of the p+effect of the
age+interaction
(expressed as
multiplication)+error term.
● n=number of subjects in each
combination of a and b.
● Assumptions:
○ Observation in each cell
is a random independent
sample of size n from a
population with that combination of the levels of the two factors.
○ Each population is normally distributed.
○ The population has the same variance.
● H0: α𝐽 = 0, β𝐽 = 0 𝑎𝑛𝑑 αβ𝑗 = 0.
● HA: 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑎𝐽 = 0, 𝑛𝑜𝑡 𝑎𝑙𝑙 β𝑗 = 0 𝑎𝑛𝑑 𝑛𝑜𝑡 𝑎𝑙𝑙 αβ𝑗 = 0.
● Test statistic: we have 3
○ SSA/residual (MSE).
○ SSB/residual (MSE).
○ Interaction/residual (MSE).
● After rejecting the null hypotheses we can go to the Tukey test or
Manferroni.
9. Nonparametric ANOVA

One Way ANOVA, Kruskal-Wallis Test


● Assumption: data is not normally distributed, variance is not equal in all
groups.
○ Instead we use medians.
● H0: the population centers are the same, median are all equal.
● Test statistic:
○ n=total sample size.
○ nj=size of each group.
○ Sum of j going from 1 to k
(groups).
○ Rj=sum of ranks of the group.
● Decision rule: if the groups are 3 we get H-tab at 0.05 α and size of groups.
○ If the groups>3 we use chi-square at k-1 DF.
● Procedure to calculate:
○ Sort data from the smallest to the largest, in a single group of size n.
○ Put ranks from 1 to n.
○ Tied values will have mean rank.
○ The ranks are added by group, we’ll have k sum of ranks.
● Generic flow chart: we check for normal distribution.
○ YES→ check for equal variance.
■ YES→ one way ANOVA
■ NO→ if they are equal groups we use Kruskal-Wallis.
○ NO→ directly to Kruskal-Wallis.

Two Way ANOVA, Friedmann Test


● Used when we have blocks, repeated
measures and more than 2 groups.
● Test statistic: chi square.
● Calculation procedure:
○ Sort data from the smallest to the
largest value within each block.
○ Put ranks from 1 to n in the block.
○ Tied values will have mean rank.
○ The ranks are added by group, we’ll have k sum of ranks, where k is
the number of groups.
● Decision rule: compare tabulate with calculated, from α = 0. 05 and
number of groups and blocks.

10. Analysis of Frequency Data

Hypothesis testing could regard proportions of events, we should follow the same
procedure with the usual steps until the clinical decision.

One Proportion
● Data: sample data about proportion of an event and the available
population proportion.
● Assumption: data follow binomial distribution, but we could approach a
normal distribution if the sample is large enough and the proportion of
events approaches 0.5.
● H0: p=p0 HA: p!=p0.
○ p0=available value with which we want to compare.
● Test statistic: q0 is basically 1-p.
● Decision rule: two sided, α = 0. 05 , Ztab=1.96, rejected if
z>Ztab.

Comparison of Two Independent Proportion


● Data: sample data about proportion of two independent samples.
● Assumption: data from independent samples, data follows binomial
distribution and may approach normal distribution if
samples are large enough.
● H0: p1=p2 HA: p1!=p2.
● Test statistic:
● Decision rule: two sided, α = 0. 05, Ztab=1.96, if z>Ztab we reject.
○ P bar=is the total proportion, including both p1 and p2.

Chi-Square Test
● Data: sample data about proportion of two independent samples.
● Assumption: data from independent samples, data from binomial
distribution and cannot approach
normal.
● H0: p1=p2 HA: p1!=p2.
● Test statistic:
● Decision rule: two sided, α = 0. 05, DF=1,
2
critical χ = 3. 841, rejected if it is higher.
● We have a 2X2 table, so we have only 1
DF. As long as we have the marginal
number, we can calculate the rest.

Fisher Exact Test


● If a study has a low size (between 20 and 40 and cell count less than 5
[such as a=2]) chi-square is not an appropriate test.
● Data (example): microbiological analysis was performed on water in two
counties: Brindisi (5 wells) and BAT (22 wells).
● Assumption: data from independent samples, data from binomial
distribution.
● H0: p1=p2, HA: p1!=p2.
● The equation is very complicated and we didn't study it.
● The algorithm is to start with Z if we have normal distribution, Chi-square if
binomial and Fisher’s if we have a cell count less than 5.

McNemar Test-Proportion in Paired Samples


● Used when we have paired samples and
proportions such as before and after trial,
or two different visits and result won't be
quantitative but a class/categorial like +/-.
● In this table specifically we talk about
changes or no changes from - and +.
● H0: p_after (a+c/N)=p_before (a+b/N), HA: p_after!=p_before.
● Decision rule: tabulated chi-square with DF=1 and α = 0. 05.
○ One tail.
● In stats kingdom we put the change from before to after, and then from
after to before, depending on the question asked.

Cohen’s Kappa-Concordance (same result


before and after) in Proportions
● A statistic that measures agreement between
2 judgments for qualitative items.
○ Such as a pt visited by physician 1 and
physician 2.
● McNemar gives us to check the hypothesis of equal proportions, here we
have a descriptive index of agreement.
● Io: also called observed identity, we check the cells in which we
have concordance (a and d).
● Ie: expected identity, here we determine the
number of expected concordance or the
numbers in the margins.
● 𝑘 = (𝐼𝑜 − 𝐼𝑒)/(1 − 𝐼𝑒), and the results vary from 0 to 1, with 1 being
maximum concordance.
○ Threshold is >0.7.

Relation Between Two Qualitative Data


● If the two variables are qualitative
(dichotomic or with more classes) the
statistical significance could be evaluated
through the chi-square test.
● We summarize the values in counts and
%.
● O=observed value, I=category goes from
1 to r, J=number of the variable goes from
1 to c.
● Example: Gestational age of delivery
could be associated with age of the woman.
● H0: there is no association between the two variables, we use the
multiplication rule of independent events, Pij=Pi*Pj.
● HA:the probability of the I and J class is different from the multiplication.
● The P to stay in a specific cell is a binomial either there or elsewhere.
● Binomial distribution has
parameter N and P (probability
of success) and the expected
value is N*P.
● Test statistic: we use Chi-square, which is the sum of the
observed minus the expected value.
● Decision Rule: chi-square calculated higher than tabulated,
with DF of (r-1)*(c-1) and α = 0. 05.
Chi-square for Trend in Proportion
● When the relationship is between an ordinal variable and a categorical
variable with two classes, it could be evaluated only if Chi-square is
statistically significant.
○ E.g. water contamination in wells during the years and we wanna
see if it is a true contamination or a casual observation.
● Once Chi-square is true, we do chi-square for trend which has a huge
formula, with DF of 1.
● In the example he gave there was a difference but the trend is not
significant.
11. Risk, Odd, Odds Ratio, Relative Risk

What is Risk
● Absolute risk is the probability or chance that a person will have a
medical event, expressed in %.
● It is the ratio between the number of people who have a medical
event/people who could have the event because of their medical condition.

Relative Risk and 95% CI


● When conducting a study we need
information that allows us to
compare what happened between
groups, therefore find a relative
measure of risk.
● When we have a prospective study,
people are enrolled with or without
the risk factors, and observe the
occurrence of a disease.
● In the end of the study, we can evaluate the occurrence of a
disease or an event, called incidence and it is usually multiplied
by a constant factor that depends on the population (k).
○ Determine in both groups and compare the rates.
● Whereas the grand incidence=𝑎 + 𝑐/𝑁.
● Attributable risk is a measure of how much disease risk is attributed
to a certain exposure, also called absolute difference.
○ When studying drugs it is called absolute risk reduction.
● Relative risk is the ratio between the two incidences, exposed/not
exposed, measures the strength of the association which is important
for inference and etiologic studies of disease.
● Interpreting relative risk of a disease:
○ If RR=1, both risks are equal and no association.
○ If RR>1, the incidence in exposed is higher than in non-exposed,
and we have a positive association and we need to check if we have
a causal relation (cause and effect).
○ If RR<1, the incidence is lower in exposed, we have a negative
association and we can carefully say it is protective depending on
factors such as smoking.
● When the CI for the RR is more than 1, we first of all can say that the entire
CI is in the positive association, but also it didn't straddle 1 and that there
is a statistically significant difference.
○ Otherwise if we have CI of 0.X-2.X not statistically significant.

Odds Ratio and 95% CI


● Used in retrospective studies, we enroll subjects with and without the
disease and we ask if they were exposed to a risk factor.
● Similar table, but we cannot determine incidence, and we wanna
determine the risk, so we determine odd.
● 𝑂𝑑𝑑 = 𝑝/1 − 𝑝 the ratio between the proportion of an event and
proportion of not event.
● Odds ratio is the rate between odd in exposed/odd in not exposed.
𝑎 𝑏
𝑎 𝑏
● 𝑂𝑑𝑑 𝑖𝑛 𝑒𝑥𝑝𝑜𝑠𝑒 = 𝑎+𝑐
𝑐 = 𝑐
, 𝑂𝑑𝑑 𝑖𝑛 𝑛𝑜𝑡 𝑒𝑥𝑝𝑜𝑠𝑒𝑑 = 𝑏+𝑑
𝑑 = 𝑑
.
𝑎+𝑐 𝑏+𝑑
𝑎𝑑
● 𝑂𝑑𝑑𝑠 𝑟𝑎𝑡𝑖𝑜 = 𝑏𝑐
.
● Interpreting Odds Ratio of a Disease:
○ If OR=1, the odds are equal, and no association between risk and
disease.
○ If OR>1, the odds in exposed are higher, positive association.
○ If OR<1, the odds in exposed are lower, negative association may
be protective.

OR vs RR
● In research papers we will usually see more ORs and it can be also
applied to cohort study/prospective study.
● When OR is applied to a prospective design with a sample size large
enough, it may be a good approximation of RR.
● OR is a good estimate of RR when the “cases” and “controls” are
representing all the people in the population where we draw the sample.
● When the disease is rare:
○ 𝐼𝑒 = 𝑎/𝑎 + 𝑏 ≈ 𝑎/𝑏
○ 𝐼𝑛𝑜𝑛𝑒 = 𝑐/𝑐 + 𝑑 ≈ 𝑐/𝑑.
○ 𝑅𝑅 = 𝑎/𝑏/𝑐/𝑑 = 𝑎𝑑/𝑏𝑐 = 𝑂𝑅.
Inference on OR and RR
● Hypothesis testing follows the same procedure and steps as for frequency
data.
● H0: RR=1, OR=1 or ln(OR)=0, HA: 𝑅𝑅 ≠ 1, 𝑂𝑅 ≠ 1.
● Test statistic: chi square similar to what we say in frequency data or
we use the natural log.

CI for RR and OR
● We can determine CI for OR, applying Miettinen theory, we
can determine SE using the statistical test.
● Z is always statistics minus population parameter/SE, but
since population parameter is 0, hence we can find SE.
● CI=estimator+-(reliability coefficient*SE of the estimator).

● We also know that χ2 = 𝑧.

Stratification
● It is used when we want to analyze exposure in two
subgroups with two different risk factors.
● For example: we stratify for smokers and non smokers,
with alcohol or without in each strata.
○ We sum the a*d and b*c of every 2X2 table and then we can use this
OR by Mantel and Haenszel for CI.
● This allows us to obtain an adjusted ratio such as in
this example the alcohol risk was 2.29 and then
decreased to 1.69 when adjusted to smoking.
● For the CI we use Mantel and Haenszel Chi-square.

Chi-square by Mantel-Haenszel
● There are relationships or comparisons which could be biased by other
factors, confounding factors due to their hidden participation and affect
results.
○ As in the previous example, smoking, so we had crude OR and
adjusted.
● How to avoid confounding:
○ Restriction→ excluding from the analysis smokers among the
alcohol drinkers.
○ Matching→ when we have
alcohol we match to no smoker,
if we have both alcohol and
smoker, we take non alcohol
non smoker for control.
○ Randomization→ possible only
for prospective studies.
○ Standardization→ statistical
method to adjust results taking
into account a standard
population to refer to.
○ Stratification→similar to what we did.
○ Multivariate analysis, various regression methods.
● R1i=first row of strata i, C1i=first column of strata i, the marginal numbers,
Ni=total number of strata, E(ai)=expected value, ai=observed value.
12. Correlation

Introduction
● We use it when we want to analyze relationships between two variables for
example: BP and weight, BP and height etc.
● Nature and strength of the relationship are analyzed through regression
and correlation.
● We evaluate only the association between them without a casual relation,
X determines an effect on Y, and Y has an effect on X.
○ And a third cause connects the variables in analysis.

Pearson Correlation Coefficient


● The strength could be analyzed by Pearson
correlation coefficient.
○ −1≤𝑟≤1
○ The numerator is called the covariance.
● The first step to evaluate co-relationship is to draw a scatter plot.
○ A graph in which the values of the variables are coordinates of the
points in the cartesian plane, each being a subject.
○ There is no difference on which variable is named X or Y.
● If r→ 1 there is a perfect direct linear correlation, and both variables
increase.
● For this H0 we have several assumptions:
○ For each value of X there's a normally distributed population of Y.
○ For each value of Y there's a normally distributed population of X.
○ The joint distribution of X and Y is Normal Bivariante.
○ The subpopulation of X has the same variance.
○ The subpopulation of Y has the same variance.
● If we have a H0 we assume that there is no correlation (ρ = 0),
HA, ρ ≠ 0.
● Decision rule-two sided α = 0. 05, with n-2 DF.
● If r→ -1 there is a perfect inverse linear correlation.
● The strength of the correlation should be understood by the scatter plot,
we have non sparse distribution, and it is linear and points are very close
to the line.
○ When the R=0 we can have several other types of correlations.
● Limits to conclude about strength of correlation:
○ 0-0.19→ very weak.
○ 0.2-0.39→ weak.
○ 0.4-0.69→ moderate.
○ 0.7-0.79→ strong.
○ 0.8-1→ very strong.
■ General agreement and not a strict rule.

Spearman Correlation Coefficient


● This could be used when our data is not respecting the
assumption to be analyzed by the Pearson coefficient, so we use
a distribution free method.
● Limits are similar to pearson, and interpreted the same way.
● Data is transformed in ranks.
● H0: ρ = 0.
2
● 𝑑 are the squared difference between ranks.
● Procedure:
○ Two sets of ranks: one for each variable.
○ Values of X must be sorted and values of Y too.
○ Ranks are assigned separately to sorted values of X and Y.
○ Tied values will have mean rank.
○ Each subject will have a pair of ranks <d> is the difference between
ranks assigned.
○ The differences are squared.
○ Sum the squared differences.
● Decision rule: α = 0. 05 with DF of n, two sided from Rs tabulated.
● Excel allows us to do a correlation matrix, with more than 2 variables.
13. Regression

Simple Linear Regression


● We have a cause-effect relationship, we have an independent X that is
something that could determine something else and an effect (dependent
variable, Y), we can predict Y by the given value of X.
● Example: SLE disease activity index (SLEDAI) depends on ED, CRP and
CLA.
● Assumptions:
○ Values of independent X are fixed, chosen by the researcher, values
of x are not random.
○ Values of x are without error.
■ The researcher for example X=160 and we get a random.
■ Random for Y and deterministic for X, Y depends on X.
○ For each value of X we have a subpopulation of Y values normally
distributed.
○ The variance of all these normal distributions are equal.
● How do we analyze simpler linear regression:
○ Draw the scatter plot.
○ Verify the assumption.
○ Estimate model equation, which is y=a+bxi.
○ Evaluate fitting of the model.
● The model:
○ Yi→ the dependent variable.
○ β0→ constant value.
○ β1𝑋𝑖→ the parameter that allows us to explain y for each x.
○ ξ𝑖→ random error.
■ First two together are the expected value, whereas Yi is the
observed value.
● The aim of the analysis is to estimate the parameters of the model.
● The estimation method is Least square, it
consists in finding the line with the smallest
vertical squared difference between data
point and corresponding line point.
● β1 𝑤𝑖𝑡ℎ 𝑟𝑜𝑜𝑓→ slope, β0→ intercept.
● The inference aims to evaluate:
○ If the equation describes the relationship in the population.
○ If the equation is reliable for predicting Y knowing X.
● H0: β1 = 0, HA: β1 ≠ 0.
● What we do eventually is the analysis of
variance for one point, and the observed
value and its deviation from the average can
be explained by the deviation from the slope and avg, and slop and
observed.

● Decision rule: α = 0. 05 , DF 1 and 8, one-tailed.


2
● Next step is to evaluate the fitting of our model, this is done by 𝑅 which
is the coefficient of determination, the ratio between squared sum of
explained/sum of total, how much of the deviation is explained by the
slope.
○ Approaching 1 we can save we have very good fitting, otherwise if
reaching 0, none of the variation is explained by regression.
● We can also have a scatterplot in which the X axis is going to be the
predicted, whereas the Y is the residual and we can see what is the trend.
● Eventually in multiple variant model we check each part of the model with
t-student and then remove whatever is not significant.

Simple Linear Regression-test and CI 95% for β


● We have a test for each parameter, in the next example is the test for the
slope.
● H0: β1 = (β1)0, HA: β1 ≠ (β1)0, test statistic:
● Decision rule: t-student, α = 0. 05
, n-2 DF, tcalc>t-tab we reject.
● CI is like always the correction
value (in this case t-tabulated)
multiplied by the SE of the
estimate.
● CI for predicted value which is
the entire equation or the Yi:
● For the exam he wants the model with
the exact numbers defining the
dependent value.

-Side note for the Bland and Altman plot we have


on the X axis the mean and on the Y axis the
difference of each pair of measurements, may be
needed for our thesis.

Logistic Regression
● Here we talk about a model in which the y is a categorical dichotomous so
we can either have 0 or 1, and this can be provided by proportions.
● We use the ratio of p/1-p, this is called odd, which has the interval
between 0 and infinity.
● We can manipulate this Y dependent variable, and it becomes the natural
log of probability of having cancer.
○ Values go from -infinity to +infinity, and it is more or less normal.
○ Logit transformation, or logit link, with linear regression.
● The aim of the study is the dependence relationship between one
dependent variable AND categorical dichotomous and one or more
independent variables categorical or continuous.
● We need this odd in biomedical studies,
because it is a risk measure, the risk to
have a condition, useful to understand
which of the variables are involved in the
risk.
● Exp of the coefficient of regression is the odds ratio.
○ The odds ratio to have cancer with X1 divided by without.
● In our example: ln(p esophageal cancer/1-p no
cancer)=b0+b1*smoke+b2*alcohol.
○ Once again smoking is a categorical variable so we use 0 and 1.

-Another side note, when we have multiple categorical groups we need to put
one as a reference and their OR will be 1.
14. Survival Analysis

Introduction
● Useful when we have time-to-event study, e.g. outcome of a treatment by
comparing the proportion of events and the days of the weeks or the
months needed to reach the event.
● Events can be a success in therapy, or we can have death or any clinical
event.
● Aims:
○ Estimating survival function, time on X and probability to survive.
○ Comparison of survival between groups, treated vs not.
○ Estimate the effects of covariates in survival, hazard ratio.
● The measure to analyze is time to event.
● The terminal time point could be:
○ The end point of the study (the event of interest):
■ Overall survival when the event is death due to disease, side
effects of the drugs, progression and many others.
■ Disease free survival we treat the cancer, we count the time
until the cancer relapses or metastases.
■ Time to progression we treat the cancer, we wait and then
metastases, including non-successful treatments.
■ Specific survival
○ The occurrence of death due to a different cause with respect to the
study endpoint.
○ The loss to follow-up, pts no more followed because they left the
study for different reasons with respect to study endpoint.
■ These are called censored, they are alive but we cant say
anything.
● Three Types of Right Censoring:
○ Type I when all subjects are scheduled to begin the study at the
same time and end the study at the same time, common in animal
experiments, not human trials.
○ Type II when all subjects begin the study at the same time and the
study is terminated when a predetermined proportion of subjects
have experienced the event, usually phase 2 studies for drugs.
○ Type III censoring is random, in the case of clinical trials because of
staggered entry, unequal follow-up on subjects, and starting study at
different times.
● To estimate cumulative survival
probability (St) we apply the product
limit method.
○ S(t) is the cumulative
probability to survive at time
t and is the product of probability to survive after each previous time.

Kaplan-Meier Curve
● The curve gives us an idea of how survival
happens in time, which looks like a ladder, with X
axis on time and Y on cumulative survival.
○ Censors are shown as arrows, and don't
influence the curve.
● We apply the curve if:
○ We know the exact moment in which the
event occurs.
○ Survival function changes when the event occurs.
○ Curves change each time an event occurs.
○ Censored doesn't change survival.
● Median survival is the time corresponding to S(t)=0.5.
● Mean survival computed as the sum of survival/total sample size.
● Average hazard, which is the measure of risk,
the number of events/sum of survival time, %
months.

Log Rank Test


● The aim is to compare 2 survival curves, like chi square
for each time point being the stratification factor, we
need observed and expected.
● H0: S(t)1=S(t)2, HA: 𝑆(𝑡)1 ≠ 𝑆(𝑡)2.
● Decision rule: if chi square>tabulated with k-1 (k=number of groups) DF.


○ We sum all the expected and observed in each time point.
○ Rate of death multiplied by the people alive in the group.
● If we have more than 2 variants, we will need to adjust the results.

Cox Regression Model


● To evaluate the effect of more than 1 variable on the survival and to have
the value of risk a regression model could be applied in “time to event”
studies, this is hazard.
● The hazard function is the conditional probability that an event will
occur at time ti on having survived until time ti.
○ f(ti)=instantaneous failure at a
single point.
● We need to know if a clinical condition
could be a predictor of a change in the
pt’s status.
● We want to know if the treatment is
effective, if the risk at a certain time depends on a basic function which
should be modified by the presence of a certain factor, probably we can
estimate the effect of this factor on the hazard ratio.
● It is important to verify the right approach to the equal shape of hazard
function in those who have and those who haven't the risk factor.
● Proportional hazard model the value of the risk is proportionally higher in
one group with respect to the other.
○ If the risk decreases, it decreases in both groups, but for that group
there is a high decrease in the risk which is proportional to the
subject without treatment and if the whole type, as if the two curves
were parallel to each other in each point.
● If HR=1 the risks are equal.
15. Diagnostic Accuracy

● When we have a new diagnostic test we need to prove that this is a good
test for diagnosis:
● We need to evaluate:
○ Reliability.
○ Population it will be applied to.
○ Compare the test with a gold standard test.
○ If the variables are quantitative we need a cutoff for sensitivity and
specificity.
○ Outcome evaluation to confirm the reliability of the test.
● Accuracy is the ability of the test to actually measure what it claims to
measure, and is defined as the proportion of all tests (+&-) that are correct.
● Precision is the ability to have the same results on a repeated on the
same patient or sample, should be considered SD.
○ Of course we want a test that is accurate and precise.
● Diagnostic accuracy measures tell us about the ability of the test to:
○ Discriminate between and/or, here we use sensitivity and
specificity.
○ Predict disease and health: predictive values, likelihood ratios, area
under the ROC curve and overall accuracy.
■ We need to decide how much predictive is a certain value for
a certain result.
● Sensitivity is the ability of the test to find the
disease.
● Specificity is the ability of the test to have
the lowest possible false positive results as
the test is able to take only the disease we
are searching for.
● Gold standard means we have the exact diagnosis according to MRI or
whatever.
● Likelihood of positivity is the probability to be diseased is
higher respect to be healthy, if >1.
○ 1-specificity is essentially not having the disease.
● Likelihood of negativity is the probability to be healthy is
higher than to be diseased, if<1.
○ 1-X, is the % to be less likely diseased.
● The global accuracy is the % of all correct
results divided by the whole sample, read as %.
● We need to choose the point that discriminates
positive and negative and we hope this point will
discriminate between diseased and not diseased.
○ In order to do this we classify pts according to their AFP levels on
the Y axis.
○ We put the classes on the X axis, in our example HCC and cirrhosis.
○ We start with a certain threshold which is diseased, and we keep
going by regular intervals and for each we examine the sample in
2X2 tables where we get the max sensitivity and specificity.
● Youden’s J statistics=sensitivity+specificity-1.
○ We have the optimal threshold when >1.
● We can determine the CI for sensitivity and specificity as they are
proportions.

ROC (Receiver Operating Characteristic) Curve


● The graph is a perfect square of 100X100, with X being
100-specificity also called the false positive rate and the
Y being sensitivity.
● At each value of blood concentration it is calculated
sensitivity and false positive rate and all points are
united.
● The area is calculated in a non parametric way, and it
gives us the index of accuracy.
● 0.5-0.7→ low performance, 0.7-0.9→ mild good
performance and >0.9→ high performance.

Predictive Values
● We want to know among the positive how many have been catched by
the test, positive predictive value.
● We want to know how many negatives have been correctly classified
by the test, negative predictive value.
● Both values depend on the prevalence of the condition in the
population.
16. Bayes Theorem

● We can determine posteriori probability knowing the priori probability with


no information on the subjects.
● Predictive value of a diagnostic test could be determined knowing only the
capability of diagnostic and disease distribution (specificity and sensitivity).
● Subjectivists think of learning as a process of belief revision in which a
prior probability P is replaced by a posterior probability Q that incorporates
the newly acquired info.
● Bayes’ theorem relates current probability to prior probability; it is important
in the mathematical manipulation of conditional probabilities.
● Positive predictive value: the probability to have a condition given a test
positive, this is a conditional probability.
● Negative predictive value: the probability to not have a condition
given a test is negative.
● From the previous lesson having D and T is equal to P
of D given T multiplied by P of T.
● Priori information is gained from:
○ Published studies, P(T+/D)*P(D), aka sensitivity
and specificity.
○ Probability to be diseased, coming from the prevalence.
● The denominator: the P to be
positive, which can be the
probability of being positive if
diseased(sensitivity) or the
probability to be positive and
not have the disease aka false
positive rate (1-specificity)
● P(T+/D) → sensitivity,
P(T+/ND) → 1-specificity, P(D) → prevalence, P (ND) → 1-prevalence, all
these are a priori.

You might also like