Professional Documents
Culture Documents
Statistics
Statistics
Statistics
1. Introduction
What is statistic
● The science which deals with collection, classification and tabulation of
numerical facts as the basis for explanation, description and comparison of
phenomenon.
● We have tendencies in seemingly incoherent masses of data which can be
generalized to the populations with a specified degree of reliability.
Observational study-retrospective
● Kind of study where we take pts when they have the disease and we
evaluate exposure in the past, such as case control.
● We recruit pts if they have the condition and ask what happened in the
past.
● Aim-evaluate the association of risk factors and the disease, the probability
of having the disease by having some risk factors.
● Not well put in the EBM due to the amount of biases.
Observational study-prospective
● We follow the pts from a starting point (T0) to a certain period (Tn)
depending on the aim of the study, and we wait for exposure and disease.
○ Sometimes it is also called cohort study=all pts are homogenous.
○ E.g take people with no disease and follow until they have.
● We can have two groups: one with an exposing factor and another without.
● E.g. a group of young people who start smoking and we follow how many
developed a disease.
Experimental Study
● We have a study group and a control group.
● We have an independent variable (e.g. therapy) and a dependent variable
(e.g. improvement).
● Subjects who participate in the study are assigned randomly to either
group, and then we compare.
● Clinical trial:
○ An experiment performed by a healthcare organization to check the
effects of an intervention against a control in a clinical environment.
○ It is a prospective study.
○ Main way in order to perform experiments in humans.
The Variable
● A characteristic that takes different values in different people, objects,
places and time.
● The value of a variable can be:
○ A measure taken on a subject-BP, height, weight etc.
○ Answer of a question-sex, town, ethnic group, study grop.
○ An observation of something- imaging.
○ A judgment-functional score, index etc.
● Quantitative variables:
○ Continuous- such as BP, these numbers are on a real scale and the
limitation is the instrument that is used.
○ Discrete- expressed only by integers e.g. scoring.
● Qualitative variables:
○ Nominal-such as sex, we only have dichotomization.
○ Ordinal-we have a semi-quantitative scale such as cancer staging.
● Measurement- each variable has a scale such as nominal, ordinal, rank
and quantitative.
○ The latter can be changed into ranked, ordinal or nominal but not
vice-versa.
Inference
● The procedure by which we reach a conclusion about a population on the
basis of the information contained in a sample that has been drawn from
that population.
Population
● The largest collection of values of a random variable whose results are
unknown and thus the results are unknown.
○ For which we have an interest at a particular time.
○ E.g. asking someone’s height randomly vs asking all above 175 cm.
● Found under the method part of any article/study.
Sample
● A subset of the population.
● A well chosen sample will contain most of the info about a particular
population parameter but the relation between the sample and the
population must be such to allow true inferences to be made.
○ Sample must have the main characteristics of the population.
● Population parameter-a descriptive characteristic of the population such as
mean, proportion or %.
● Simple random sample:
○ A random selection.
○ Each has a known non-zero chance of being included in the sample
and it is equal among the subjects.
○ If we remove a person from the sample, the probability for the next
extraction is different.
■ In a big sample we can ignore the change in the probability.
○ If a number is popping up twice (reamission) it can be counted
twice.
● Stratified sample:
○ A particular methodology to build a sample.
○ Some groups can give biased results.
○ We need to consider conditions of the pts or sex because the
probability of an event could change with age, co-morbidities etc.
■ We need to take these into account that could affect the
results of the study.
○ We divide the population into 2 or more strata and then we take a
simple random sample e.g height of surgeons between sexes.
■ We then see the % of each group and depending on the % we
multiply it by the sample size to have a more reliable sample.
Randomization
● Pts in a randomized trial we don't mean that the pt are from a random
sample, but we have eligible pts that want to participate and then we
allocate the treatments.
● A procedure that allows one to have the same probability to stay in any
group.
○ All the probabilities are equal so the only diff will be due to the
treatment/control.
2. Descriptive Statistics
Why?
● Enables us to present the data in a more meaningful way, which allows
simpler interpretation.
○ Raw data presentation is hard.
● However, do not allow us to make conclusions beyond the data.
Mean
● Sum of all values divided by the number of subjects.
● E.g. mean age of the 509 women in the study.
Median
● The median of a finite set of values is that value which divides the set into
two equal parts such that the number of values equal to or greater than the
median is equal to the number of values equal to or less than the median.
● We need to sort the data and find the value that leaves the same numbers
of observations before and after.
○ If the number is odd: it is the central value.
■ We use +1, and take the number of the next place.
○ If the number is even: mean of the two central values.
Mode
● Mode of a set of values is that value which occurs most frequently, but it
doesn't mean that the class is the most frequent.
● A set of values may have more than one mode.
The Histogram-Skewness
● We should see the form of the distribution.
● If we see that the long tail is on the right:
positively skewed.or skewed to the
right.
○ Usually no consequence, but
sometimes the median can be lower
than the mean.
● If we see that the long tail is on the left: negatively skewed or skewed to
the left.
○ We first have the mean and then the median.
● In symmetrical distribution, mean=mode=median.
Kurtosis
● A measure of the degree to which a distribution is peaked or flat.
● No peak and no flat distribution: mesokurtic distribution.
● A peak with no flat distribution: leptokurtic, kurtosis<0.
● A peak and flat distribution: platykurtic, kurtosis>0.
Measures of Dispersion
● The dispersion of a set of observations refers to the variety that they
exhibit.
● Conveys info regarding the amount of variability present in the set.
● Range-the difference between the largest and smallest value.
○ It will never be seen in papers, we will see only “smallest-largest”.
Coefficient of Variation
● Also called relative SD, it is the ratio between
SD and the arithmetic mean which can be
expressed by % and has no units.
● Useful because we can compare some SDs among variables which have
different units of measurements, however it may lead to fallacious results.
● The higher the number the higher variation.
Percentiles
● Called position or location parameters, as these values of the variable
which designate certain positions on the horizontal axis when the
distribution of a variable is graphed.
● Given a set of n observations x1, x2, x3...xn, the pth percentile P is the
value of X such that p percent or less of the observations are less than P
and (100-p) percent or less of the observations are greater than P.
○ P=the percentile.
○ E.g. P=10% we need to find the value of age that 10% of women are
younger, and 90% are older.
○ E.g2 P50 percentile, we need to find the value of oocytes that 50%
of observations are lower and 50% are higher.
● How to determine:
1. Sort data.
2. Define the P of interest.
3. Multiple N for P%.
4. Find the observation as determined in point 3.
a. Even/odd follow similar to median.
5. Look at his value.
● The 50th percentile is usually the median.
Quartiles
● P25-the 25th percentile and
corresponds to the first quarter
of the sorted distribution so
called the first quartile, Q1.
● P50-median.
● P75-the 75th percentile and
corresponds to the 3rd quarter of
the sorted distribution so called
the third quartile, Q3.
● The difference between P75 and P25 is the interquartile range, holding
50% of the data, and is the measurement of variability.
Geometric Mean
● Sometimes data could be summarized with geometric mean if
they display skewness.
● It is the nth-root of the products of all values.
● In equivalent way can be calculated as arithmetic mean of data
transformed as natural log.
3. Basic of Probability and Probability Distributions
General Ideas
● What happens in our lives changes in number.
● We use probability everyday, e.g. what are the premises for certain
conditions, what side effects I should expect.
● Probability is a number between 0 and 1, always positive, and even that
cannot occur has probability of 0.
● A certain event has probability of 1.
Classical Probability
● An event can occur in N mutually exclusive and equal ways, and if M of
this possesses a trait E so the probability is M/N.
○ 𝑃(𝐸) = 𝑀/𝑁.
○ E.g. such as rolling a dice and we want ‘1’, this is a simple event.
■ If we want odd numbers: composite event.
Bayesian Method
● Founded on updating probabilities based on new information.
○ Prior probability-based on prior experience or derived from data.
○ Posterior probability-obtained by using new information to update or
revise prior probabilities.
● Largely applied in automatic diagnosis and in evaluation of diagnostic
tests.
Elementary Properties
● Given some process with n mutually exclusive outcomes (events) E1,
E2,..., En the probability of Ei is a non-negative number.
○ And the sum of these probabilities equal to 1.
● We always need to look at the numerator and denominator in a table.
● Marginal probability-border of the table, when we use total to determine
probability.
● Complementary probability-could be determined as the difference
between all the probabilities (1) minus that one we have determined.
○ E.g. lung cancer P=0.31, what is the probability for others, 1-0.31.
● Conditional probability-when we use a subset of the total to determine
probability.
○ E.g. What is the P of lung cancer in the unemployed P=30/100.
● Joint probabilities-when we ask the probability that a subject has 2
characteristics at the same time, e.g. having lung cancer and being
unemployed.
● Multiplication rule-can be used for conditional, given
two events A and B, the probability that event A and B
occur can be calculated by multiplying the appropriate
marginal and conditional probability.
○ Depending on what affects what we determine the
marginal.
● Independent events-if events A occurs regardless of B,
they are independent and P (AIB)=P(A).
● Addition rule-if we have two events, the probability that event A, or event
B or both occur is equal to the
probability that event A occurs,
plus the probability that event B
occurs minus the probability that these events occur simultaneously.
Probability Distribution
● When we measure a variable we collect data from pts and we collect a
random variable, meaning we don't know the results.
● What happens if the value of the random variable has a certain probability
(P) and this can be explained by using probability distribution.
● Probability distribution-is a table, graph, formula, or other device used to
specify all possible values of a discrete random variable along with their
respective probabilities, can't be 1.5 for example.
● Very similar to relative frequency.
○ E.g. The number of the drugs is the event (A) with probability P (A),
which is the relative frequency.
Probability Distribution: The Binomial Distribution: Bernoulli Process
● Example: 85.8% of the pregnancies had delivery in week 37 or later, if we
randomly select 5 from this population, what is the probability that exactly 3
of these will be full-term.
● A sequence of Bernoulli trials form a Bernoulli process under the following
conditions:
a. Each trial results in one of the two possible mutually
exclusive outcomes, one is success one is failure (full
term vs preterm).
b. The probability of success denoted by p remains
constant from trial to trial, probability of failure q, is 1-p (p=0.858).
c. The trials are independent, the outcome is not affected by the
outcome of another trial.
● X=count (number of full term) and N=amount of trials.
● Average-what we expect in the process, it is difficult to
have less than the average.
● Variance-how more or less than the expected value.
● Each binomial distribution has two parameters, n and
p, B (n,p), SD is a single σ.
General Ideas
● The distribution of all possible values that can be assumed by some
statistic, computed from samples of the same size.
● Randomly drawn from the same population, called the sampling
distribution of that statistic.
● When we have a population we should have a lot of samples from the
population, so we can study the distribution of these.
Estimation
● Statistical inference is the procedure by which we reach a conclusion about
a population on the basis of the info contained in a sample drawn from that
population.
● Calculating from the data of a sample some statistic that is offered as an
approximation of the corresponding parameter of the population.
● Point estimate-a single numerical value used to estimate the
corresponding population parameter, a point on a real X-axis.
○ Don't let us draw any conclusions on the population.
● Interval estimate-consists of two numerical values defining a range of
values that, with a specified degrees of confidence, most likely include the
parameter being estimated.
○ Information on population.
○ In this case we need a lower mean and a higher mean that with a
certain degree of confidence include the parameter.
● The degree of freedom is a probability (P), 𝑃 = 1 − α .
Interpretation of CI
● We don't know whether the population mean lies within the calculated
interval, but we may act, in practice, as if it does because the interval
estimator is successful in capturing the mean in 95% of samples from the
population to which it is applied.
● Smaller intervals means a very small shape and more precise estimation.
Introduction
● When we test hypotheses, most of all we speak about decisions.
● We try to reach a conclusion concerning a population.
○ With the starting point of the sample of course.
● It may be defined simply as a statement about one or more populations.
● The research hypothesis is the idea of the researcher and the aim of the
research.
● The research hypothesis should be modified into an appropriate,
statistical hypothesis which can be evaluated by appropriate statistical
approach.
Equality of Variances
● we need to evaluate this equality so we put a new system of
hypothesis with null hypothesis that the variances are equal.
2 2 2 2
● H0: σ1 = σ2 and H1: σ1 ! = σ2 .
● Fisher distribution is an asymmetrical distribution with a long tail in the
right.
○ So we have only one side of the tail of alpha 0.05.
● Using the Fisher table, on the left we have the
DF of the nominator, and on the left we have
the DF of the denominator.
● If we find that we don't eject the H0, we can
pool a variance.
Non-parametric Tests
● If the distribution of the variable couldnt be considered normal, to compare
groups or to search correlation could be done through distribution free
statistics.
● Often these are based on a rank transformation of data.
8. Analysis of Variance
ANOVA Introduction
● A technique to evaluate the partitioning of the total variance in a set of data
in two or more components.
● We can determine variability in data, but this could be dependent on
different factors.
○ The association between the variation components and their sources
allow us to say something on each magnitude.
● Analyze what is the most important source.
● A model is the symbolic representation of typical values from the data, an
equation, expressing the relation between variables.
● Applied when we have normal distribution.
Example
● A sample with 3 groups
checking the weight, and they
are divided by the answer to the
question if they pay attention to
the carbohydrates and proteins:
○ “Yes”.
○ “More or less”.
○ “Not at all”.
● There is variance among groups and among subjects, depending on their
own randomness and group⇒ partitioning the total variance.
● The mean of the sample in this case is called the grand mean.
Multiple Comparison
● If we have a study with 4 groups, we can divide the H0 of the ANOVA to
different null hypotheses.
𝐾*(𝐾−1)
○ Defined by: 2
.
■ We can do multiple T-tests but this is not efficient.
● Bonferroni correction: if α is the total type I error and we have m
comparisons the significance level of each pair-wise comparison is α/𝑚.
● First method (name not mentioned):
○ The comparison is
happening with a CI for
difference of means.
○ The difference should
be between these two
to be significant.
○ This is too conservative.
○ MSE=mean square error within groups, after the calculation.
● Tukey-Honestly Significant
Difference:
○ We respect the global
α = 0. 05 .
○ We multiple the square
root of (MSE/n) by a
critical value from the Tukey table which depend
on α, 𝑘 𝑎𝑛𝑑 𝑁 − 𝑘.
○ We then check the difference of the different pairs, and check if they
are higher than HSD, and if so, that is statistically significant.
○ We compare positive and negative differences.
○ n=size of the groups (e.g. 3 in the example).
● Tukey-HSD for unequal size:
● Least Significant Difference (LSD by Fisher):
○ Critical value is taken from T student with N-k DF
and it is double sided so we have α/2 = 0. 025.
Hypothesis testing could regard proportions of events, we should follow the same
procedure with the usual steps until the clinical decision.
One Proportion
● Data: sample data about proportion of an event and the available
population proportion.
● Assumption: data follow binomial distribution, but we could approach a
normal distribution if the sample is large enough and the proportion of
events approaches 0.5.
● H0: p=p0 HA: p!=p0.
○ p0=available value with which we want to compare.
● Test statistic: q0 is basically 1-p.
● Decision rule: two sided, α = 0. 05 , Ztab=1.96, rejected if
z>Ztab.
Chi-Square Test
● Data: sample data about proportion of two independent samples.
● Assumption: data from independent samples, data from binomial
distribution and cannot approach
normal.
● H0: p1=p2 HA: p1!=p2.
● Test statistic:
● Decision rule: two sided, α = 0. 05, DF=1,
2
critical χ = 3. 841, rejected if it is higher.
● We have a 2X2 table, so we have only 1
DF. As long as we have the marginal
number, we can calculate the rest.
What is Risk
● Absolute risk is the probability or chance that a person will have a
medical event, expressed in %.
● It is the ratio between the number of people who have a medical
event/people who could have the event because of their medical condition.
OR vs RR
● In research papers we will usually see more ORs and it can be also
applied to cohort study/prospective study.
● When OR is applied to a prospective design with a sample size large
enough, it may be a good approximation of RR.
● OR is a good estimate of RR when the “cases” and “controls” are
representing all the people in the population where we draw the sample.
● When the disease is rare:
○ 𝐼𝑒 = 𝑎/𝑎 + 𝑏 ≈ 𝑎/𝑏
○ 𝐼𝑛𝑜𝑛𝑒 = 𝑐/𝑐 + 𝑑 ≈ 𝑐/𝑑.
○ 𝑅𝑅 = 𝑎/𝑏/𝑐/𝑑 = 𝑎𝑑/𝑏𝑐 = 𝑂𝑅.
Inference on OR and RR
● Hypothesis testing follows the same procedure and steps as for frequency
data.
● H0: RR=1, OR=1 or ln(OR)=0, HA: 𝑅𝑅 ≠ 1, 𝑂𝑅 ≠ 1.
● Test statistic: chi square similar to what we say in frequency data or
we use the natural log.
CI for RR and OR
● We can determine CI for OR, applying Miettinen theory, we
can determine SE using the statistical test.
● Z is always statistics minus population parameter/SE, but
since population parameter is 0, hence we can find SE.
● CI=estimator+-(reliability coefficient*SE of the estimator).
Stratification
● It is used when we want to analyze exposure in two
subgroups with two different risk factors.
● For example: we stratify for smokers and non smokers,
with alcohol or without in each strata.
○ We sum the a*d and b*c of every 2X2 table and then we can use this
OR by Mantel and Haenszel for CI.
● This allows us to obtain an adjusted ratio such as in
this example the alcohol risk was 2.29 and then
decreased to 1.69 when adjusted to smoking.
● For the CI we use Mantel and Haenszel Chi-square.
Chi-square by Mantel-Haenszel
● There are relationships or comparisons which could be biased by other
factors, confounding factors due to their hidden participation and affect
results.
○ As in the previous example, smoking, so we had crude OR and
adjusted.
● How to avoid confounding:
○ Restriction→ excluding from the analysis smokers among the
alcohol drinkers.
○ Matching→ when we have
alcohol we match to no smoker,
if we have both alcohol and
smoker, we take non alcohol
non smoker for control.
○ Randomization→ possible only
for prospective studies.
○ Standardization→ statistical
method to adjust results taking
into account a standard
population to refer to.
○ Stratification→similar to what we did.
○ Multivariate analysis, various regression methods.
● R1i=first row of strata i, C1i=first column of strata i, the marginal numbers,
Ni=total number of strata, E(ai)=expected value, ai=observed value.
12. Correlation
Introduction
● We use it when we want to analyze relationships between two variables for
example: BP and weight, BP and height etc.
● Nature and strength of the relationship are analyzed through regression
and correlation.
● We evaluate only the association between them without a casual relation,
X determines an effect on Y, and Y has an effect on X.
○ And a third cause connects the variables in analysis.
Logistic Regression
● Here we talk about a model in which the y is a categorical dichotomous so
we can either have 0 or 1, and this can be provided by proportions.
● We use the ratio of p/1-p, this is called odd, which has the interval
between 0 and infinity.
● We can manipulate this Y dependent variable, and it becomes the natural
log of probability of having cancer.
○ Values go from -infinity to +infinity, and it is more or less normal.
○ Logit transformation, or logit link, with linear regression.
● The aim of the study is the dependence relationship between one
dependent variable AND categorical dichotomous and one or more
independent variables categorical or continuous.
● We need this odd in biomedical studies,
because it is a risk measure, the risk to
have a condition, useful to understand
which of the variables are involved in the
risk.
● Exp of the coefficient of regression is the odds ratio.
○ The odds ratio to have cancer with X1 divided by without.
● In our example: ln(p esophageal cancer/1-p no
cancer)=b0+b1*smoke+b2*alcohol.
○ Once again smoking is a categorical variable so we use 0 and 1.
-Another side note, when we have multiple categorical groups we need to put
one as a reference and their OR will be 1.
14. Survival Analysis
Introduction
● Useful when we have time-to-event study, e.g. outcome of a treatment by
comparing the proportion of events and the days of the weeks or the
months needed to reach the event.
● Events can be a success in therapy, or we can have death or any clinical
event.
● Aims:
○ Estimating survival function, time on X and probability to survive.
○ Comparison of survival between groups, treated vs not.
○ Estimate the effects of covariates in survival, hazard ratio.
● The measure to analyze is time to event.
● The terminal time point could be:
○ The end point of the study (the event of interest):
■ Overall survival when the event is death due to disease, side
effects of the drugs, progression and many others.
■ Disease free survival we treat the cancer, we count the time
until the cancer relapses or metastases.
■ Time to progression we treat the cancer, we wait and then
metastases, including non-successful treatments.
■ Specific survival
○ The occurrence of death due to a different cause with respect to the
study endpoint.
○ The loss to follow-up, pts no more followed because they left the
study for different reasons with respect to study endpoint.
■ These are called censored, they are alive but we cant say
anything.
● Three Types of Right Censoring:
○ Type I when all subjects are scheduled to begin the study at the
same time and end the study at the same time, common in animal
experiments, not human trials.
○ Type II when all subjects begin the study at the same time and the
study is terminated when a predetermined proportion of subjects
have experienced the event, usually phase 2 studies for drugs.
○ Type III censoring is random, in the case of clinical trials because of
staggered entry, unequal follow-up on subjects, and starting study at
different times.
● To estimate cumulative survival
probability (St) we apply the product
limit method.
○ S(t) is the cumulative
probability to survive at time
t and is the product of probability to survive after each previous time.
Kaplan-Meier Curve
● The curve gives us an idea of how survival
happens in time, which looks like a ladder, with X
axis on time and Y on cumulative survival.
○ Censors are shown as arrows, and don't
influence the curve.
● We apply the curve if:
○ We know the exact moment in which the
event occurs.
○ Survival function changes when the event occurs.
○ Curves change each time an event occurs.
○ Censored doesn't change survival.
● Median survival is the time corresponding to S(t)=0.5.
● Mean survival computed as the sum of survival/total sample size.
● Average hazard, which is the measure of risk,
the number of events/sum of survival time, %
months.
●
○ We sum all the expected and observed in each time point.
○ Rate of death multiplied by the people alive in the group.
● If we have more than 2 variants, we will need to adjust the results.
● When we have a new diagnostic test we need to prove that this is a good
test for diagnosis:
● We need to evaluate:
○ Reliability.
○ Population it will be applied to.
○ Compare the test with a gold standard test.
○ If the variables are quantitative we need a cutoff for sensitivity and
specificity.
○ Outcome evaluation to confirm the reliability of the test.
● Accuracy is the ability of the test to actually measure what it claims to
measure, and is defined as the proportion of all tests (+&-) that are correct.
● Precision is the ability to have the same results on a repeated on the
same patient or sample, should be considered SD.
○ Of course we want a test that is accurate and precise.
● Diagnostic accuracy measures tell us about the ability of the test to:
○ Discriminate between and/or, here we use sensitivity and
specificity.
○ Predict disease and health: predictive values, likelihood ratios, area
under the ROC curve and overall accuracy.
■ We need to decide how much predictive is a certain value for
a certain result.
● Sensitivity is the ability of the test to find the
disease.
● Specificity is the ability of the test to have
the lowest possible false positive results as
the test is able to take only the disease we
are searching for.
● Gold standard means we have the exact diagnosis according to MRI or
whatever.
● Likelihood of positivity is the probability to be diseased is
higher respect to be healthy, if >1.
○ 1-specificity is essentially not having the disease.
● Likelihood of negativity is the probability to be healthy is
higher than to be diseased, if<1.
○ 1-X, is the % to be less likely diseased.
● The global accuracy is the % of all correct
results divided by the whole sample, read as %.
● We need to choose the point that discriminates
positive and negative and we hope this point will
discriminate between diseased and not diseased.
○ In order to do this we classify pts according to their AFP levels on
the Y axis.
○ We put the classes on the X axis, in our example HCC and cirrhosis.
○ We start with a certain threshold which is diseased, and we keep
going by regular intervals and for each we examine the sample in
2X2 tables where we get the max sensitivity and specificity.
● Youden’s J statistics=sensitivity+specificity-1.
○ We have the optimal threshold when >1.
● We can determine the CI for sensitivity and specificity as they are
proportions.
Predictive Values
● We want to know among the positive how many have been catched by
the test, positive predictive value.
● We want to know how many negatives have been correctly classified
by the test, negative predictive value.
● Both values depend on the prevalence of the condition in the
population.
16. Bayes Theorem