Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Chapter 0: Quantitative reasoning

1. Frame the
2. Specify
3. Collect
4. Analyse
5. Communicate

Chapter 1: Design of studies

Cause and effect:


1. Control experiments: subjects can be subjected to treatment by the investigators ->
can be randomized
2. Observational studies: subjects already assigned to groups
Control experiments (the Salk vaccines): the polio vaccine was tested in a huge experiment
with 1 million children involved
2. Should the vaccine be tested on all children?
i. There is no comparison. What if the decrease in cases of polio is because the next
year is not a pandemic year
2. Control experiment:
i. Compare: treatment group (get vaccine) and the control group (no vaccine)
ii. Ethical dilemma: harms to few vs benefits for many
- > the parents have to agree or consent -> the ethical concern is resolved as
children who want the vaccines get the vaccines
- > Risk: 1. The children can get polio from the vaccine, 2. Complication from
injection

iii. Unequal group sizes taken care of by using:


- > rate: number of polio cases for every 100,000 children
iv. Do the two group have similar condition?
- > do the two group have the same risk??
- Treatment group: children with parental consent have higher risk at the beginning
(ie. The children in this group tends to come with wealthier household which
ensures the children receive good hygiene and won’t be exposed to polio
meanwhile the children without consent tend to come from lower income family ->
exposed to polio and get polio since young, able to survive it with their immune
system -> become immune afterward-> children from treatment group are more
likely to get polio)
- Control group: children without parental consent have lower risk
- > the design is biased against the vaccines: its effect will be understated. Less
people are willing to take the vaccine
Eg. Supposed the vaccine did reduce the polio moderately. The effect is smaller than
the difference between the rate of two treatment groups -> we see that there is
higher rate of polio in the treatment group. -> leading to the wrong conclusion that
the vaccine is not working
- > it can be seen that children with consent are more vulnerable -> the vaccine
results seem worse
NFIP study :
1. Design: - control group: Y1 and 3 - treatment group year 2 with consent -observe
group: Y2 without consent

i. The NFIP study is biased against the vaccines


ii. Same conditions applied: children with consent is at higher risk!!!
iii. From the data, the vaccine seems to reduce polio rate by 54-25=29 cases per
100,000 cases. However, the actual effect is larger because this is because the
children in treatment group(children with consent) is more at risk than the control
group ( a mixture of children with and without consent). Thus,
iv. Why is the rate of the third group lower than the control group?
Since the control group (a mixture of children with or without consent) is more at
risk than the third group (children without consent), its polio rate is higher

v. Only children with consent are allowed to take part in the study as whether
having parental consent will affect the risk of contracting polio by the children
thus make the results inaccurate
b. Lesson learnt: subject who refused treatment should not be put in control group and
should be excluded

Randomize assignment:
1. An impartial produce using chance, like “random draws without replacement”.
a. If the sample size is large, it is very likely two groups are similar in all aspects
b. The draws must be mixed thoroughly
c. The randomized experiment may still be affected but are fixed by:
i. Placebo effect: people in controlled group know they are not protected by the
vaccines may take extra measure to protect themselves -> affect the probability
of contracting polio -> affect the study -> people in controlled group are injected
with fake med to make it look like they receive the vaccines
ii. Blinding subjects: the children are blinded and won’t know their assigned group
iii. Blinding doctors:
d. Results:

i. It can be seen that the control group for randomized experiment have higher
rate of polio
e. Advantage of randomized experiment:
i. The chance of the result of the experiment to be pure of luck but not because
the vaccine works is one in a billion -> absurbly small
ii. Randomized controlled double-blind experiment (where study object and
treatment provider are both blinded) are also useful for comparing new
treatment to old treatment, instead of placebo. (new treatment should be
compared to the old treatment instead of comparing with placebo) -> further
prevent bias
iii. Conclusion may not apply to ineligible subjects.
iv. Ideas apply to other fields, like education.
2. Non-randomized controls: assign healthy to treatment, less healthy to control. In the
table below, there are three approaches to find out about the effectiveness of the
medical procedure. 1. No controls group study (total =32 cases), 2. Controls groups
but not randomized (total = 4 cases), 3. Randomized but not control (total = 4 cases).
(marked = successful operation.

a.
b. Randomized vs historical

Observational study: (smoking and health)

1. Positive association:
a. A, B are population characteristics with
i. 0 < rate(A) <1 and 0 <rate(B)
ii. If rate(A|B)> rate(A| not B) -> meaning that A is more common among people
with B rather than pp without B, or rate(B|A) > rate(B| not A) -> meaning B is
more common with A rather than people without A
-> “A and B are positively associated”
iii. However, positive association is not causation
b. A, B are population characteristic with 0 < rate(A), rate(B) <1
i. A and B are negatively associated if rate(A|B) < rate(A|not B) or rate(B|A) <
rate(B|not A)
ii. If A and B are negatively associated, then “not A” and B are positively
associated
c. No association: rate(A|B) = rate(A|not B) or Rate(B|A) = rate(B|not A)
i. Hard to observe
Confoundings:
1. Confounder is a third variable that is associated with both exposure and disease.
- >If the variable is only linked to either exposure or the disease then the variable is
not a confounder
2. Actual relationship between exposure and disease can be obscured by a confounder
3. Question to identify confounder: “ are the treatment group different than the
control group in anyway other than the treatment?”
4. In a control group, confounders can be controlled by randomized experiment
5. In an observational study where subjects already assigned themselves to treatment
and control group, confounding is a real issue.
a. Slicing: diving the control groups and treatment group into the smaller group of
same criteria (eg. Age, sex) -> make comparisons

Eg: sex is a confounder. -> smoking is investigated for two different groups -male
and females
If the smoker are still worse off than non-smokers in all age groups and sex, then
we can conclude that smoking is bad for health
- > slicing can be basic but become cumbersome if there are many confounders
b.
Chapter 2: Association

1. Association:
a. Deterministic relationship: the value of a variable can be determined if we know
the value of the other variable (eg. The formula)
b. Statistical relationship: natural variability exists in measurements/outcomes of
two variables under study
2. Data for the chapter: Karl’s father and son data set
a. Purpose: the degree that children resemble their parents
b. Action: - gather huge amount of data -> quantify the resemblance
c. Study:
i. heights of 1078 of father-son pairs were collected (1078 sets of data)
ii. two variables were collected: father height(x) and the son’s height(Y)
2. Bivariate data (X,Y)
d. Data analysis:
i. The average: father average height was 68 inches while the sons’ average
height is 69 inches
ii. The spread/variability (measured by standard deviation):
iii. Scatter diagram:
- Draw a line at 45 degree: the point on this line are data point where father
and son have the same height. The points lie above the line means son is
taller than his father -> sons’ average height is taller than father

- Draw a vertical line for the average height of father and horizontal line for the
average height of the sons -> sons’ average height is larger. It can be seen
that most points are concentrated in region 2 and 3 -> taller fathers have
taller sons (positive association ie. Direction of relationship)
- There is a linear relationship
- > quantify the strength of association: if the points lie near the linear
regression line, the strength of relationship is higher
iv. Correlation coefficient:
- A measure of liner association btw two variables. Summary of direction and
strength of linear association
- Range btw -1 and 1.
- Positive r value -> positive association. R = 0 -> no linear association. R<0 ->
negative association
- R=1 -> perfect positive linear association. R close to +- 1 -> strong
association(r>0.7). R close to 0 -> weak association(r<0.3) . R close to +-0.5 ->
moderately associated (0.3<r<0.7)
v.
e.
3. Correlation coefficient:
a. Data: variable X (father height) and Y(son height), average father height:  x,
average son height: y, sdx
b.
Step 1: Convert each variable to standard unit: SU = (X-x)/ sdx
i. A positive SU shows the observation is above the average
Step 2: Take the product of SU for each father-son pair
Step 3: R -> average of 1078
c. Excel function : correl()
d. Properties:
i. R value is not affected by interchange of the two variables
ii. R value will be affected by adding a number or multiplying by a positive
number to all variables. -> r value will not be affected by change of scale
e. Limitation:
i. Causation: a change in one variable “produces” or “cause” a change in
another variable.
- Correlation doesn’t imply causation
ii. Impact of outlier on correlation: outliers are data are unusually far away
from the bulk of data
- Can tell from a scatter diagram
- Need to know what cause the outlier
- Sometimes the outliers can increase or decrease the linear correlation
iii. Correlation coefficient cannot be used to describe a non-linear correlation
f.
4. Ecological correlation: computed based on aggregates (group not individual)(eg.
Group average or rates)
a. The strength of association is overstated when association is based on aggregates.

b. Ecological fallacy: deduce the inferences on correlation about individuals based


on aggregated data ≠ atomic fallacy (use individual observation to account for
group observation)
c.
5. Cautionary notes
a. Attenuation effect: due to range restriction in one variable, the correlation
coefficient obtained tends to understate the strength of association btw two
variables
6. Linear Regression:
a.
b.

Chapter 3: Sampling
1. Identify elements of data to be collected:
a. These elements are called unit
i. unit: object/individual
ii. Population: collection of units -> unrealistic to collect data from everyone
iii. Sample: subset of population
b. Census vs sampling. Census -> measurement would be taken from every unit in
the population. Sample -> measurement would be taken from some selected units
in the population
c. Advantage of taking a sample:
i. When a census is not possible: 1. Blood testing for certain disease, 2.
Determine the effectiveness of a drug
ii. Speed: faster
iii. Cost: less cost
iv. Accuracy: more money to train smaller taskforce
d. Purpose of having a good sample: the results of the sample can reflect the
population
2. Sampling frame: is a list of sampling units intended to identify all units in the
population
a. What is a good sample?
i. Every unit in the population has a possibility of being selected (selection is
random)
ii. Selection process is not biased
b. Four outcomes:
i. Capture exactly (hard to achieve)
ii. Capture more (need to filter out unwanted data)
iii. Capture less (selection bias)
iv. Combination of 2 and 3: capture a portion and unwanted data)
c. Characteristics of a good frame:
i. Good coverage: it has to cover exactly or bigger than the population so that
every unit in the population has a chance of being selected. However, if the
sampling frame is too big and covers more than what we want, the cost of
getting the data we want is high
ii. Up -to date and complete: it is important for data that changes constantly
3. Probability sampling plan: every unit in the population must have a known
probability of being selected into the sample -> avoid selection biased
a. Simple random sampling: every possible sample of the same size has the same
chance of being selected
i. Assign a number from 1 to N to each sampling unit in the sampling frame,
where N is the number of sampling units
ii. Use a random generator to pick number. Eg: excel function randbetweem
(1,N)
iii. The sampling unit that has the assigned number corresponding to the
generated number is selected into the sample
b. Systematic sampling: is a method of selecting units from a list through the
application of a selection interval, K, so that every Kth unit on the list, following a
random start, is included in the sample.
i. Split the system into equal group of K then do srs and pick a number, n and
pick the nth unit
ii. There can be cyclical (systematic sampling)
iii. It will produce the same result as simple random sampling if it meets certain
condition -> when the sampling units are done randomly
iv. Use when the exact size of the population is not known -> just need to know
a rough estimate of a population size
v. The value of K is chosen based on the exact or estimated size of the
population. Choosing k that is directly proportional voter
c. Stratified sampling: we divide the population of units into groups (strata) and
then we take probability sample (srs) from each group, adding together to the
whole sample. EG; 5000 people -3000 female and 2000 males. We want a
sample of 100 – we take 60 women and 40 men
i. separated into male and female group as sex is a confounder
ii. the cost is low
d. Cluster sampling
e. Multistage sampling: in a lot of studies,
i. The cost is high as more people are being interviewed
4. Difficulties and disaster
a. Imperfect sampling frame: A perfect sampling frame is one that consists of exactly
all units of target population. In a lot of situation, a sampling frame might either
include unwanted units or exclude desired units
i. Sampling frame is time-sensitive
ii. A frame with too many unwanted units would increase the cost of study
iii. For a frame that exclude desired units, we have to 1. Redefine the target
population, 2. Assess the impact of excluding these units in our study
b. Non-response
i. Not all selected units are contactable
ii. Not all selected units are willing to take part in the study
iii. Non-response distorts the results of the studies
iv. Usually non-respondents differ from respondents. We need to study the
extent of the effect in order to reduce the bias of the collected information
- > if the non-responder come from certain group of characteristics -> by not
having the response of non-responder, we might have a biased sample
- > By having an incentive, the response rate might increase
- > having field workers to convince people to join
c. Getting a volunteer or self-selected sample:
i. It is cheaper but it is biased because the responders self-select themselves to
response and tend to have strong view about the matter
d. Using a convenience or haphazard sample: these are samples made up of
individuals casually met or conveniently available, such as student enrolled in a
class or people passing by on a street corner -> often biased
e. Judgement sample: sampling units are chosen from the population by
interviewers using their own discretion about which informats are “typical” or
“representative” -> it is biased.
f. Quota sample: this is a process of selection in which the elements are chosen in
the field by interviewers using prearranged categories of sample elements to
obtain a predetermined number of cases in each category -> each interviewer is
assigned a fixed quota of diseases to interview and numbers falling into certain
categories of variables such as age, sex are fixed -> bias is introduced as
interviewers are free to choose participants
- > all these sampling have the element of human -> element of randomness is lost
5. Estimating parameters: numerical fact about the population -> usually unknown to
us

a. The estimation equation: Estimate = Parameter + random error


- > Assumption: 1. Simple random sample, 2. Response rate: 100%
More realistic estimation equation: Estimate = Parameter + Bias + random error
- > random errors are easy to quantify but bias is not
6. Type of bias:
a. Selection bias: systematic tendency on the part of the sampling procedure to
exclude one kind of person or another from the sample
i. Caused by : 1. Imperfect sampling frame, 2. Non-probability sampling
methods.
b. Other type of bias: due to the phrasing of the questions, tone or attitude of the
interviewers, the subjects have a tendency to understate responses about
undesirable social habits
c. To minimize bias in a survey, we seek to:
i. Include every population unit in the frame (a large sample size will not
redeem a bad design )
ii. Use a probability sampling method
iii. Get 100% response rate (maximize response rate)
iv. Bad phrasing of question: double negative
7. Random errors:
a. Larger sample size -> more likely to have a smaller random error
b. Sample estimates are fluctuating around the population parameter
8. Confidence intervals: ranges of value that we are reasonably certain that our
unknown parameters lie in.
a. If we use probability sampling, the confidence interval is helpful in providing
information about the error in the estimate
b. Assumption interval: 1. Probability sampling
c. At 95% confidence level, a smaller random error in the estimate implies a smaller
confidence interval range
d. Confidence is the inference. Chance is obervation
Chapter 4: more on observation study

1. Risks:
a. Definition: a risk of an uncertain outcome as its rate in a population. -> risk is a
number btw 1 and 0 or btw 0% and 100%
i. A risk is like a rate but more specific. Confounding is still relevant
ii. Risk ratio or relative risk: measure association

b. Random sampling of 1000 to calculate sample diabetes rates -> the rate fluctuates
- Unpredictably:
- Around the population rate

c. Sampling strategies for eliminating risk:


i. Probability samples allow accurate estimation of population risk and population
risk ratio(RR)
ii. Two strategies: take simple random samples (SRS) from:
- Each exposure group. Eg: females and males -> cohort study
- Each disease group. Diabetic and healthy -> disease group study
iii. Cohort study and case control:
iv.
d. Notes:
i. Randomized experiments: subjects need not resemble population ->
extrapolation to population is not an issue
ii. Observational studies: both extrapolation and confounding are important issues.
-> cohort studies rely on random samples for accurate estimation of risk
2. Odds:

a. Formula:
i. The value of odd is always larger than the value of risk
ii. If the risk is very small then the odds is almost equal to the risk
b. Cross-product ratio
c. Odds ratio of one group to another group:

i. Interpreting OR in term of risk:


- if OR =1 -> there is no difference in disease risk btw two groups: RR =1
- if OR >1 -> higher risk in the first group: RR> 1
- if OR < 1 -> lower risk in the first group: RR <1
ii. estimating OR from cohort study:
iii. Estimating OR from case-control study:
3. Multi-level contingency table
a. Choosing a baseline for exposure group and disease group

Uncertainty:
Hypothesis testing:
1. Null hypothesis: the outcome that we want to observe.
- If the null hypothesis is true, the observed outcome is by chance
2. P-value: the probability of obtaining an outcome that is equivalent to or more extreme
than the observed outcome
- If the p-value is small, it is unlikely for the observed to occur by chance, unlikely for
the null hypothesis to be true
- If the p-value is large, it is likely for the observed to be by chance and the null
hypothesis to be true
- Calculating P-value:
Suppose a disease has a 40% of mortality -> 60% survive. A sample of three patients
were selected
Null hypothesis: the drug has no effect
P(a patient survive)=0.6
P(all three patients survived)=0.6x0.6x0.6=0.216

You might also like