Professional Documents
Culture Documents
Intro To R Vietnamese 2
Intro To R Vietnamese 2
Intro To R Vietnamese 2
Nguyn Vn Tun
Mc lc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Li ni u
Gii thiu ngn ng R
Nhp d liu
Bin tp d liu
Tnh ton n gin v ma trn
Tnh ton xc sut v m phng
Kim nh gi thuyt v tr s R
Phn tch s liu bng biu
Phn tch thng k m t
Phn tch hi qui tuyn tnh
Phn tch phng sai
Phn tch hi qui logistic
Phn tch bin c (survival analysis)
Phn tch tng hp (meta-analysis)
Thit k th nghim
c tnh c mu
Lp trnh v vit hm bng R
Mt s lnh thng thng trong R
Thut ng dng trong sch
Li bt
1
Li ni u
Tri vi quan im ca nhiu ngi, thng k l mt b mn khoa hc:
Khoa hc thng k (Statistical Science). Cc phng php phn tch d da vo
nn tng ca ton hc v xc sut, nhng ch l phn k thut, phn quan
trng hn l thit k nghin cu v din dch ngha d liu. Ngi lm thng
k, do , khng ch l ngi n thun lm phn tch d liu, m phi l mt
nh khoa hc, mt nh suy ngh (thinker) v nghin cu khoa hc. Chnh v
th, m khoa hc thng k ng mt vai tr cc k quan trng, mt vai tr
khng th thiu c trong cc cng trnh nghin cu khoa hc, nht l khoa
hc thc nghim. C th ni rng ngy nay, nu khng c thng k th cc th
nghim gen vi triu triu s liu ch l nhng con s v hn, v ngha.
Mt cng trnh nghin cu khoa hc, cho d c tn km v quan trng
c no, nu khng c phn tch ng phng php s khng c ngha khoa
hc g c. Chnh v th m ngy nay, ch cn nhn qua tt c cc tp san nghin
cu khoa hc trn th gii, hu nh bt c bi bo y hc no cng c phn
Statistical Analysis (Phn tch thng k), ni m tc gi phi m t cn thn
phng php phn tch, tnh ton nh th no, v gii thch ngn gn ti sao s
dng nhng phng php hm bo k hay tng trng lng khoa hc
cho nhng pht biu trong bi bo. Cc tp san y hc c uy tn cng cao yu cu
v phn tch thng k cng nng. Xin nhc li nhn mnh: khng c phn
phn tch thng k, bi bo khng c ngha khoa hc.
Mt trong nhng pht trin quan trng nht trong khoa hc thng k l
ng dng my tnh cho phn tch v tnh ton thng k. C th ni khng ngoa
rng khng c my tnh, khoa hc thng k vn ch l mt khoa hc bun t kh
khan, vi nhng cng thc rc ri m thiu tnh ng dng vo thc t. My tnh
gip khoa hc thng k lm mt cuc cch mng ln nht trong lch s ca
b mn: l a khoa hc thng k vo thc t, gii quyt cc vn gai gc
nht v gp phn lm pht trin khoa hc thc nghim.
Ngi vit cn nh hn 20 nm v trc khi cn l mt sinh vin theo
hc chng trnh thc s thng k c, mt v gio s kh knh k mt cu
chuyn v nh thng k danh ting ngi M, Fred Mosteller, nhn c mt
hp ng nghin cu t B Quc phng M ci tin chnh xc ca v kh
M vo thi Th chin th II, m trong ng phi gii mt bi ton thng k
gm khong 30 thng s. ng phi mn 20 sinh vin sau i hc lm vic ny:
10 sinh vin ch vic sut ngy tnh ton bng tay; cn 10 sinh vin khc kim
tra li tnh ton ca 10 sinh vin kia. Cng vic ko di gn mt thng tri.
Ngy nay, vi mt my tnh c nhn (personal computer) khim tn, phn tch
thng k c th gii trong vng trn di 1 giy.
Nhng nu my tnh m khng c phn mm th my tnh cng ch l
mt ng st hay silicon v hn v v dng. Mt phn mm , ang v s
lm cch mng thng k l R. Phn mm ny c mt s nh nghin cu
thng k v khoa hc trn th gii pht trin v hon thin trong khong 10 nm
qua s dng cho vic hc tp, ging dy v nghin cu. Cun sch ny s
gii thiu bn c cch s dng R cho phn tch thng k v th.
Ti sao R? Trc y, cc phn mm dng cho phn tch thng k
c pht trin v kh thng dng. Nhng phn mm ni ting t thi xa xa
nh MINITAB, BMD-P n nhng phn mm tng i mi nh
STATISTICA, SPSS, SAS, STAT, v.v thng rt t tin (gi cho mt i
hc c khi ln n hng trm ngn -la hng nm), mt c nhn hay thm ch
cho mt i hc khng kh nng mua. Nhng R thay i tnh trng ny, v R
hon ton min ph. Tri vi cm nhn thng thng, min ph khng c ngha
l cht lng km. Tht vy, chng nhng hon ton min ph, R cn c kh
nng lm tt c (xin ni li: tt c), thm ch cn hn c, nhng phn tch m
cc phn mm thng mi lm. R c th ti xung my tnh c nhn ca bt c
c nhn no, bt c lc no, v bt c u trn th gii. Ch vi pht ci t l
R c th a vo s dng. Chnh v th m i a s cc i hc Ty phng v
th gii cng ngy cng chuyn sang s dng R cho hc tp, nghin cu v
ging dy. Trong xu hng , cun sch ny c mt mc tiu khim tn l gii
thiu n bn c trong nc kp thi cp nht ha nhng pht trin v tnh
ton v phn tch thng k trn th gii.
Cun sch ny c son ch yu cho sinh vin i hc v cc nh
nghin cu khoa hc, nhng ngi cn mt phn mm hc thng k, phn
tch s liu, hay v th t s liu khoa hc. Cun sch ny khng phi l sch
gio khoa v l thuyt thng k, hay nhm ch bn c cch lm phn tch thng
k, nhng s gip bn c lm phn tch thng k hu hiu hn v ho hng
hn. Mc ch chnh ca ti l cung cp cho bn c nhng kin thc c bn v
thng k, v cch ng dng R cho gii quyt vn , v qua lm nn tng
bn c tm hiu hay pht trin thm R.
Ti cho rng, cng nh bt c ngnh ngh no, cch hc phn tch
thng k hay nht l t mnh lm phn tch. V th, sch ny c vit vi rt
nhiu v d v d liu thc. Bn c c th va c sch, va lm theo nhng
ch dn trong sch (bng cch g cc lnh vo my tnh) v s thy ho hng
hn. Nu bn c c sn mt d liu nghin cu ca chnh mnh th vic hc
tp s hu hiu hn bng cch ng dng ngay nhng php tnh trong sch. i
2
Gii thiu ngn ng R
2.1 R l g ?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch
thng k v th. Tht ra, v bn cht, R l ngn ng my tnh a nng, c th
s dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr
(recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng
k phc tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht
trin thnh cc phn mm chuyn mn cho mt vn tnh ton c bit.
Hai ngi sng to ra R l hai nh thng k hc tn l Ross Ihaka v
Robert Gentleman. K t khi R ra i, rt nhiu nh nghin cu thng k v
ton hc trn th gii ng h v tham gia vo vic pht trin R. Ch trng ca
nhng ngi sng to ra R l theo nh hng m rng (Open Access). Cng
mt phn v ch trng ny m R hon ton min ph. Bt c ai bt c ni
no trn th gii u c th truy nhp v ti ton b m ngun ca R v my
tnh ca mnh s dng. Cho n nay, ch qua cha y 5 nm pht trin,
nhng c nhiu cc nh thng k hc, ton hc, nghin cu trong mi lnh vc
chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c
mt mng li gn mt triu ngi s dng R, v con s ny ang tng theo
cp s nhn. C th ni trong vng 10 nm na, chng ta s khng cn n cc
phn mm thng k t tin nh SAS, SPSS hay Stata (cc phn mm ny gi
c th ln n 100.000 USD mt nm) phn tch thng k na, v tt c cc
phn tch c th tin hnh bng R.
V th, nhng ai lm nghin cu khoa hc cn nn hc cch s dng R
cho phn tch thng k v th. Chng ny s hng dn bn c cch s
dng R.
Tn package
lattice
Hmisc
Design
Epi
epitools
foreign
Rmeta
meta
survival
splines
Zelig
genetics
gap
BMA
leaps
Chc nng
Dng v th v lm cho th p hn
Mt s phng php m hnh d liu ca F. Harrell
Mt s m hnh thit k nghin cu ca F. Harrell
Dng cho cc phn tch dch t hc
Mt package khc chuyn cho cc phn tch dch t hc
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Dng cho phn tch tng hp (meta-analysis)
Mt package khc cho phn tch tng hp
Chuyn dng cho phn tch theo m hnh Cox
(Coxs proportional hazard model)
Package cho survival vn hnh
Package dng cho cc phn tch thng k trong lnh
vc x hi hc
Package dng cho phn tch s liu di truyn hc
Package dng cho phn tch s liu di truyn hc
Bayesian Model Average
Package dng cho BMA
Nhng nu chng ta g:
> R is great
10
== 5
!= 5
< x
> y
x bng 5
x khng bng 5
y nh hn x
x ln hn y
z <= 7
p >= 1
is.na(x)
A & B
A | B
!
z nh hn hoc bng 7
p ln hn hoc bng 1
C phi x l bin s missing
A v B (AND)
A hoc B (OR)
Khng l (NOT)
Mt vi iu cn lu khi t tn trong R l:
11
2.7 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng
c th hiu vn phm ca tng hm. Chng hn nh mun bit hm lm c
nhng thng s (arguments) no, chng ta ch n gin lnh:
> help(lm)
hay
> ?lm
12
> help.start()
v mt ca s s xut hin ch dn ton b h thng R.
Hm apropos cng rt c ch v n cung cp cho chng ta tt c cc hm trong R
bt u bng k t m chng ta mun tm. Chng hn nh chng ta mun bit
hm no trong R c k t lm th ch n gin lnh:
> apropos(lm)
V R s bo co cc hm vi k t lm nh sau c sn trong R:
[1] ".__C__anova.glm"
".__C__glm"
[4] ".__C__glm.null"
".__C__mlm"
[7] "anova.glm"
"anova.lm"
[10] "anova.lmlist"
"anovalist.lm"
[13] "contr.helmert"
"glm.control"
[16] "glm.fit"
"hatvalues.lm"
[19] "KalmanForecast"
"KalmanRun"
[22] "KalmanSmooth"
".__C__anova.glm.null"
".__C__lm"
"anova.glmlist"
"anova.mlm"
"glm"
"glm.fit.null"
"KalmanLike"
"lm"
"lm.fit"
13
[25] "lm.fit.null"
"lm.wfit"
[28] "lm.wfit.null"
"model.frame.lm"
[31] "model.matrix.lm"
[34] "plot.lm"
"predict.glm"
[37] "predict.lm"
"print.glm"
[40] "print.lm"
"residuals.lm"
[43] "rstandard.glm"
"rstudent.glm"
[46] "rstudent.lm"
"summary.lm"
[49] "summary.mlm"
"lm.influence"
"model.frame.glm"
"nlm"
"plot.mlm"
"nlminb"
"predict.mlm"
"residuals.glm"
"rstandard.lm"
"summary.glm"
"kappa.lm"
14
Hay:
> options(prompt="Tuan> ")
Tuan>
Nu bn c cn thm thng tin, mt s ti liu trn mng (vit bng ting Anh)
cng rt c ch. Cc ti liu ny c th ti xung my min ph:
R for beginners (ca Emmanuel Paradis):
http://cran.r-project.org/doc/contrib/rdebuts_en.pdf
Using R for data analysis and graphics (ca John Maindonald):
http://cran.r-project.org/doc/contrib/usingR.pdf
Ngoi ra, tc gi cng c mt ti liu bng ting Vit (di 114 trang) tm lc
cc lnh hay s dng trong R ti website: www.r.ykhoanet.com.
15
3
Nhp d liu
Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng
m R c th hiu c x l. D liu m R hiu c phi l d liu trong
mt data.frame. C nhiu cch nhp s liu vo mt data.frame
trong R, t nhp trc tip n nhp t cc ngun khc nhau. Sau y l nhng
cch thng dng nht:
16.5
10.8
32.3
19.3
14.2
11.3
15.5
15.8
16.2
11.2
16
Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng)
age v insulin vo mt i tng c tn l tuan.
n y th chng ta c mt i tng hon chnh tin hnh phn tch
thng k. kim tra xem trong tuan c g, chng ta ch cn n gin g:
> tuan
V R s bo co:
1
2
3
4
5
6
7
8
9
10
age insulin
50
16.5
62
10.8
60
32.3
40
19.3
48
14.2
47
11.3
57
15.5
70
15.8
48
16.2
67
11.2
17
18
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
age
57
64
60
65
47
65
76
61
59
57
44
45
46
47
48
49
50
Nam
Nam
Nu
Nam
Nam
Nu
Nu
45
63
52
64
45
64
62
bmi
17
18
18
18
18
18
19
19
19
19
hdl
5.000
4.380
3.360
5.920
6.250
4.150
0.737
7.170
6.942
5.000
ldl
2.0
3.0
3.0
4.0
2.1
3.0
3.0
3.0
3.0
2.0
tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0
tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
24
24
24
24
24
25
25
5.450
5.000
3.360
7.170
7.880
7.360
7.750
2.8
3.0
2.0
1.0
4.0
4.6
4.0
6.0
4.0
3.7
6.1
6.7
8.1
6.2
2.6
1.8
1.2
1.9
3.3
4.0
2.5
...
hay
> names(chol)
"tg"
19
Age
IGFI
IGFBP3
ALS
PINP
ICTP
P3NP
18
Sex Ethnicity
1
148.27
5.14
316.00
61.84
5.81
4.21
28
114.50
5.23
296.42
98.64
4.96
5.33
20
109.82
4.33
269.82
93.26
7.74
4.56
21
112.13
4.38
247.96
101.59
6.66
4.61
28
102.86
4.04
240.04
58.77
4.62
4.95
23
129.59
4.16
266.95
48.93
5.32
3.82
20
142.50
3.85
300.86
135.62
8.78
6.75
20
118.69
3.44
277.46
79.51
7.19
5.11
20
197.69
4.12
335.23
57.25
6.21
4.44
10
20
163.69
3.96
306.83
74.03
4.95
4.84
11
22
144.81
3.63
295.46
68.26
4.54
3.70
12
27
141.60
3.48
231.20
56.78
4.47
4.07
13
26
161.80
4.10
244.80
75.75
6.27
5.26
14
33
89.20
2.82
177.20
48.57
3.58
3.68
15
34
161.80
3.80
243.60
50.68
3.52
3.35
16
32
148.50
3.72
234.80
83.98
4.85
3.80
17
28
157.70
3.98
224.80
60.42
4.89
4.09
18
18
222.90
3.98
281.40
74.17
6.43
5.84
19
26
186.70
4.64
340.80
38.05
5.12
5.77
20
27
167.56
3.56
321.12
30.18
4.78
6.12
20
21
> setwd(c:/works/stats)
> testo <- read.spss(testo.sav, to.data.frame=TRUE)
> attach(chol)
> is.data.frame(chol)
[1] TRUE
R cho bit chol qu l mt data.frame.
> dim(chol)
[1] 50 8
22
> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"
"tg"
> table(sex)
sex
nam Nam
1 21
Nu
28
23
4
Bin tp d liu
Bin tp s liu y khng c ngha l thay i s liu gc (v l
mt ti ln, mt s gian di trong khoa hc khng th chp nhn c), m ch
c ngha t chc s liu sao cho R c th phn tch mt cch hu hiu. Nhiu
khi trong phn tch thng k, chng ta cn phi tp trung s liu thnh mt
nhm, hay tch ri thnh tng nhm, hay thay th t k t (characters) sang s
(numeric) cho tin vic tnh ton. Chng ny s bn qua mt s lnh cn bn
cho vic bin tp s liu.
Chng ta s quay li vi d liu chol trong v d 1. tin vic theo
di v hiu cu chuyn, xin nhc li rng chng ta nhp s liu vo trong
mt d liu R c tn l chol t mt text file c tn l chol.txt:
> setwd(c:/works/stats)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)
24
[1] 25
[1] 9
25
> print(data3)
id sex tc
1
1 Nam 4.0
2
2 Nu 3.5
3
3 Nu 4.7
4
4 Nam 7.7
5
5 Nam 5.0
6
6 Nu 4.2
7
7 Nam 5.9
8
8 Nam 6.1
9
9 Nam 5.9
10 10 Nu 4.0
26
sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu
tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7
27
diagnosis
diagnosis
diagnosis
diagnosis
<<<<-
bmd
replace(diagnosis, bmd <= -2.5, 1)
replace(diagnosis, bmd>-2.5 & bmd<=1.0, 2)
replace(diagnosis, bmd > -1.0, 3)
28
or
logical:
returning
NA
in:
tui thp nht l 8 v cao nht l 51. Nu chng ta mun chia thnh 2 nhm tui:
> cut(age, 2)
[1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51]
(7.96,29.5] (7.96,29.5] (7.96,29.5]
(7.96,29.5]
cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t
29.5 n 51. Chng ta c th m s i tng trong tng nhm tui bng hm
table nh sau:
29
(29.5,51]
4
Trong lnh sau y, chng ta chia bin tui thnh 3 nhm v t tn ba nhm
l low, medium v high:
> ageg <- cut(age, 3, labels=c("low", "medium", "high"))
[1] low
low
low
high
low
low
low
high
low
low
medium medium
low
low
[15] high
Levels: low medium high
> ageg <- cut(age, 3, labels=c("low", "medium", "high"))
> table(ageg)
ageg
low medium
10
2
high
3
Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho
nhng thng s 0, 0.25, 0.50 v 0.75 nh sau:
cut(age,
breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)
30
> library(Hmisc)
> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,
-2.00,1.71,2.12,-2.11)
# chia bin s bmd thnh 2 nhm v trong i tng group
> group <- cut2(bmd, g=2)
> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5
5
31
5
Dng R cho cc php tnh
n gin v ma trn
Mt trong nhng li th ca R l c th s dng nh mt my tnh cm
tay. Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp
chng. Chng ny ch trnh by mt s php tnh n gin m hc sinh hay
sinh vin c th s dng lp tc trong khi c nhng dng ch ny.
Cng v tr:
> 15+2997
[1] 3012
> 15+2997-9768
[1] -6756
Nhn v chia
> -27*12/21
[1] -15.42857
Cn s bc hai:
S pi ()
10
> sqrt(10)
[1] 3.162278
> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478
Logarit: loge
Logarit: log10
> log(10)
[1] 2.302585
> log10(100)
[1] 2
S m: e
Hm s lng gic
2.7689
> cos(pi)
[1] -1
> exp(2.7689)
[1] 15.94109
> log10(2+3*pi)
[1] 1.057848
Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8
> sum(x)
[1] 42
> x*2
[1] 4
16
32
2 10
8 12 14 12
> exp(x/10)
[1] 1.221403 1.349859 1.105171
1.648721 1.491825 1.822119
2.013753 1.822119
[9] 2.225541
> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736
2.405079 2.511954 2.282647
2.148655 2.282647
[9] 2.007132
( x x )
i =1
( x x )
i =1
=?
/n= ?
2
> x <- c(1,2,3,4,5)
s 2 = ( xi x ) / ( n 1) = ?
> sum((x-mean(x))^2)/length(x)
i =1
[1] 2
> x <- c(1,2,3,4,5)
> var(x)
[1] 2.5
Trong cng thc trn, length(x)
lch chun:
s2 :
> sd(x)
[1] 1.581139
2003-02-01
L do ng sau cch vit ny l chng ta vit s vi n v ln nht trc, ri
dn dn n n v nh nht. Chng hn nh vi s 123 th chng ta bit ngay
rng mt trm hai mi ba: bt u l hng trm, ri n hng chc, v.v
V cng l cch vit ngy thng chun ca R.
> date1 <- as.Date(01/02/06, format=%d/%m/%y)
> date2 <- as.Date(06/03/01, format=%y/%m/%d)
Ch chng ta nhp hai s liu khc nhau v th t ngy thng nm, nhng
chng ta cng cho bit c th cch c bng %d (ngy), %m (thng), v %y
(nm). Chng ta c th tnh s ngy gia hai thi im:
33
p dng seq
34
To ra mt vector s t 1 n 12:
4
4
5
5
6
6
7
7
8
8
9 10 11 12
9 10 11 12
To ra mt vector s t 12 n 5:
> seq(12,7)
[1] 12 11 10
p dng rep
Cng thc ca hm rep l rep(x, times, ...), trong , x l mt
bin s v times l s ln lp li. V d:
To ra s 10, 3 ln:
> rep(10, 3)
[1] 10 10 10
To ra s 1 n 4, 3 ln:
> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4
35
p dng gl
gl c p dng to ra mt bin th bc (categorical variable), tc bin
khng tnh ton, m l m. Cng thc chung ca hm gl l gl(n, k,
length = n*k, labels = 1:n, ordered = FALSE) v cch s
dng s c minh ha bng vi v d sau y:
To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:
> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2
Hay:
> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2
To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.
36
[1] 1 1 2 2 3 3 4 4
Vi ngy gi thng:
1 4 7
A = 2 5 8
3 6 9
V vi R:
> y <- c(1,2,3,4,5,6,7,8,9)
37
Th kt qu s l:
[1,]
[2,]
[3,]
38
> # ct 1 ca ma trn A
> A[,1]
[1] 1 4 7
> # ct 3 ca ma trn A
> A[3,]
[1] 7 8 9
39
[2,]
Hay A-B:
40
1 4 7
A = 2 5 8
3 6 9
1 2 3
B = 4 5 6
7 8 9
Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*%
nh sau:
>
>
>
>
>
y <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(y, nrow=3)
B <- t(A)
AB <- A%*%B
AB
[,1] [,2] [,3]
[1,]
66
78
90
[2,]
78
93 108
[3,]
90 108 126
Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:
> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,]
14
32
50
[2,]
32
77 122
[3,]
50 122 194
3 x1 + 4 x2 = 4
x1 + 6 x2 = 2
3 4
A=
,
1 6
x
X = 1 ,
x2
4
Y =
2
41
[,1]
[,2]
[1,] -0.7071068 -0.9701425
[2,] -0.7071068 0.2425356
42
> det(F)
[1] -216
43
6
Tnh ton xc sut
v m phng (simulation)
Xc sut l nn tng ca phn tch thng k. Tt c cc phng php
phn tch s liu v suy lun thng k u da vo l thuyt xc sut. L thuyt
xc sut quan tm n vic m t v th hin qui lut phn phi ca mt bin s
ngu nhin. M t y trong thc t cng c ngha n gin l m nhng
trng hp hay kh nng xy ra ca mt hay nhiu bin. Chng hn nh khi
chng ta chn ngu nhin 2 i tng, v nu 2 i tng ny c th c phn
loi bng hai c tnh nh gii tnh v s thch, th vn t ra l c bao nhiu
tt c phi hp gia hai c tnh ny. Hay i vi mt bin s lin tc nh
huyt p, m t c ngha l tnh ton cc ch s thng k ca bin nh tr s
trung bnh, trung v, phng sai, lch chun, v.v T nhng ch s m t, l
thuyt xc sut cung cp cho chng ta nhng m hnh thit lp cc hm phn
phi cho cc bin s . Chng ny s bn qua hai lnh vc chnh l php m
v cc hm phn phi.
6.1 Cc php m
6.1.1 Php hon v (permutation).
Theo nh ngha, hon v n phn t l cch sp xp n phn t theo mt
th t nh sn. nh ngha ny kh kh hiu, v d c th sau s lm r nh
ngha hn. Hy tng tng mt trung tm cp cu c 3 bc s (x, y v z), v c
3 bnh nhn (a, b v c) ang ngi ch c khm bnh. C ba bc s u c th
khm bt c bnh nhn a, b hay c. Cu hi t ra l c bao nhiu cch sp xp
bc s bnh nhn? tr li cu hi ny, chng ta xem xt vi trng hp sau
y:
44
Tm 3!
> prod(3:1)
[1] 6
Tm 10!
> prod(10:1)
[1] 3628800
Tm 10.9.8.7.6.5.4
> prod(10:4)
[1] 604800
Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)
6.1.2 T hp (combination).
T hp n phn t chp k l mi tp hp con gm k phn t ca tp hp
n phn t. V d c th sau s gip cho chng ta hiu r vn ny: Cho 3
ngi (hy cho l A, B, v C) ng vin vo 2 chc ch tch v ph ch tch, hi:
c bao nhiu cch chn 2 chc ny trong s 3 ngi . Chng ta c th
tng tng c 2 gh m phi chn 3 ngi:
Cch chn
1
2
3
4
5
6
Ch tch
A
B
A
C
B
C
Ph ch tch
B
A
C
A
C
B
45
3
3!
6
= = 3 ln.
=
2 2!( 3 2 ) ! 2
Ni chung, s ln chn k ngi t n ngi l:
n
n!
=
k k !( n k ) !
n
k
5
2
Tm
> choose(5, 2)
[1] 10
> 1/choose(5, 2)
[1] 0.1
1
N
156
2
N
160
3
Nam
175
4
N
145
5
N
165
6
N
158
7
Nam
170
8
Nam
167
9
N
178
10
Nam
155
46
Geometric
Mt
Tch ly
nh bc
M phng
dnorm(x,
mean, sd)
dbinom(k,
n, p)
dpois(k,
lambda)
dunif(x,
min, max)
dnbinom(x,
k, p)
dbeta(x,
shape1,
shape2)
dgamma(x,
shape,
rate,
scale)
dgeom(x,
p)
pnorm(q,
mean, sd)
pbinom(q,
n, p)
ppois(q,
lambda)
punif(q,
min, max)
pnbinom(q,
k, p)
pbeta(q,
shape1,
shape2)
gamma(q,
shape,
rate,
scale)
pgeom(q,
p)
qnorm(p,
mean, sd)
qbinom (p,
n, p)
qpois(p,
lambda)
qunif(p,
min, max)
qnbinom
(p,k,prob)
qbeta(p,
shape1,
shape2)
qgamma(p,
shape,
rate,
scale)
qgeom(p,
prob)
rnorm(n,
mean, sd)
rbinom(k,
n, prob)
rpois(n,
lambda)
runif(n,
min, max)
rbinom(n,
n, prob)
rbeta(n,
shape1,
shape2)
rgamma(n,
shape,
rate,
scale)
rgeom(n,
prob)
Hm phn Mt
phi
Exponential dexp(x,
Weibull
Cauchy
F
T
Chisquared
rate)
dnorm(x,
mean, sd)
dcauchy(x,
location,
scale)
df(x, df1,
df2)
dt(x, df)
dchisq(x,
df)
Tch ly
nh bc
M phng
pexp(q,
rate)
pnorm(q,
mean, sd)
pcauchy(q,
location,
scale)
pf(q, df1,
df2)
pt(q, df)
pchi(q,
df)
qexp(p,
rate)
qnorm(p,
mean, sd)
qcauchy(p,
location,
scale)
qf(p, df1,
df2)
qt(p, df)
qchisq(p,
df)
rexp(n,
rate)
rnorm(n,
mean, sd)
rcauchy(n,
location,
scale)
rf(n, df1,
df2)
rt(n, df)
rchisq(n,
df)
Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability
(xc sut); n = sample size (s lng mu). Cc thng s khc c th tham kho thm
cho tng lut phn phi. Ring cc lut phn phi F, t, Chi-squared cn c mt thng
s khc na l non-centrality parameter (ncp) c cho s 0. Tuy nhin ngi s dng
c th cho mt thng s khc thch hp, nu cn.
47
nk
, trong k = 0, 1,
48
Bn 3
Nam
N
Nam
N
Nam
N
Nam
N
Xc sut
(0.4)(0.4)(0.4) = 0.064
(0.4)(0.4)(0.6) = 0.096
(0.4)(0.6)(0.4) = 0.096
(0.4)(0.6)(0.6) = 0.144
(0.6)(0.4)(0.4) = 0.096
(0.6)(0.4)(0.6) = 0.144
(0.6)(0.6)(0.4) = 0.144
(0.6)(0.6)(0.6) = 0.216
1.000
P ( k | n, p ) = Ckn p k (1 p )
nk
49
[1] 0.3827828
7
68
8
23
9
13
10
3
50
Frequency
50
100
150
200
Nu m b e r o f h y p e rte n s iv e p a tie n ts
10
51
e k
P( X = k | ) =
k!
Do , p s cho cu hi trn l: P ( X = 2 | = 1) =
e 212
= 0.1839 . p s
2!
P ( X > 2 ) = P ( X = 3) + P ( X = 4 ) + P ( X = 5) + ...
= 1 P ( X 2)
= 1 0.3678 0.3678 0.1839
= 0.08
Bng R, chng ta c th tnh nh sau:
# P(X 2)
52
> ppois(2, 1)
[1] 0.9196986
# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014
P X = x | ,
( x )2
1
= f ( x) =
exp
2 2
2
53
f(height)
0.00
0.02
0.04
0.06
0.08
130
140
150
160
170
180
190
200
Height
(160 156 )2
1
P(X = 160 | =156, =4.6) =
exp
2
4.6 2 3.1416
2 ( 4.6 )
= 0.0594
Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng
ta mt cch gn nh:
> dnorm(160, mean=156, sd=4.6)
54
[1] 0.05942343
f ( x ) dx
a
V th, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca
biu 2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut
tch ly cho mt phn phi chun rt c ch.
pnorm (a, mean, sd) =
Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm
l 9.6%:
> pnorm(150, 156, 4.6)
[1] 0.0960575
Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:
> 1-pnorm(164, 156, 4.6)
[1] 0.04100591
Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao
hn 165 cm.
V d 7: ng dng lut phn phi chun: Trong mt qun th, chng
ta bit rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg,
hi: c bao nhiu ngi trong qun th ny c p sut mu bng hoc cao hn
120 mmHg? Cu tr li bng R l:
> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679
55
Z=
0.2
0.0
0.1
f(z)
0.3
0.4
-4
-2
56
Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn
chng ta khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr
mc nh (default value) ca thng s mean l 0 v sd l 1).
V d 6 (tip tc). Xin nhc li tin vic theo di, chiu cao trung
bnh ph n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph
n c chiu cao 170 cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch
chun, v t l cc ph n Vit Nam c chiu cao cao hn 170 cm l rt thp,
ch khong 0.1%.
> 1-pnorm(3.04)
57
[1] 0.001182891
Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:
> qnorm(0.975, mean=0, sd=1)
[1] 1.959964
Phn phi Khi bnh phng (2). Phn phi 2 xut pht t tng
bnh phng ca mt bin phn phi chun. Nu nu xi ~ N(0, 1), v
gi u =
x
i=
2
i
58
x
i =1
2
i
= i2
i =1
V k hiu l u ~
2
n ,
59
> qchisq(0.5, 7, 3)
[1] 9.180148
60
Ni cch khc,
P(F3, 15 > 3.287382) = 1 P(F3, 15 3.287382) = 1 0.95 = 0.05
( )
P(x)
0.60
0.30
0.10
61
150
0
50
100
Frequency
200
250
300
500 draws
draws
T lut phn phi xc sut chng ta bit rng tnh trung bnh s c 60%
ln c gi tr 1, 30% c gi tr 2, v 10% c gi tr 5. Do , chng ta k
vng s quan st 300, 150 v 50 ln cho mi gi tr. Biu trn cho thy phn
phi cc gi tr ny gn vi gi tr m chng ta k vng. Ngoi ra, chng ta cng
bit rng phng sai ca bin s ny l khong 1.8. By gi chng ta kim tra
xem c ng nh k vng hay khng:
> var(draws)
[1] 1.835671
62
50
Frequency
100
150
drawmeans
Chng ta thy rng phng sai ca phn phi ny nh hn. Tht ra, phng sai
ca 500 s trung bnh ny l 0.45.
> var(drawmeans)
[1] 0.4501112
63
15
10
0
Number of samples
20
25
> mean(bin)
[1] 3.97
V mt phn phi:
64
Frequency
10
15
20
Histogram of pois
pois
6 12
3 25 15
65
0.6
>
curve(dchisq(x,
1),
xlim=c(0,10),
ylim=c(0,0.6),
col="red", lwd=3)
> curve(dchisq(x, 2), add=T, col="green", lwd=3)
> curve(dchisq(x, 3), add=T, col="blue", lwd=3)
> curve(dchisq(x, 5), add=T, col="orange", lwd=3)
> abline(h=0, lty=3)
> abline(v=0, lty=3)
> legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1", "df=2", "df=3", "df=5"), lwd=3, lty=1,
col=c("red", "green", "blue", "orange"))
0.3
0.0
0.1
0.2
dchisq(x, 1)
0.4
0.5
df=1
df=2
df=3
df=5
10
Phn phi t:
66
lwd=c(2,2,2,2,2),
lty=c(1,1,1,1,3),
col=c("red", "blue", "green",
"orange", par("fg")))
0.4
Student T distributions
0.2
0.0
0.1
dt(x, 1)
0.3
df=1
df=2
df=5
df=10
Normal distribution
-3
-2
-1
>
>
>
>
>
>
>
>
>
Phn phi F:
curve(df(x,1,1), xlim=c(0,2), ylim=c(0,0.8), lwd=3)
curve(df(x,3,1), add=T)
curve(df(x,6,1), add=T, lwd=3)
curve(df(x,3,3), add=T, col="red")
curve(df(x,6,3), add=T, col="red", lwd=3)
curve(df(x,3,6), add=T, col="blue")
curve(df(x,6,6), add=T, col="blue", lwd=3)
title(main="Fisher F distributions")
legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1,1", "df=3,1", "df=6,1", "df=3,3",
"df=6,3", "df=3,6", df="6,6"),
lwd=c(1,1,3,1,3,1,3),
lty=c(2,1,1,1,1,1,1),
67
0.8
Fisher F distributions
0.4
0.0
0.2
df(x, 1, 1)
0.6
df=1,1
df=3,1
df=6,1
df=3,3
df=6,3
df=3,6
6,6
0.0
0.5
1.0
1.5
2.0
>
>
>
>
>
>
>
68
1.0
0.6
0.4
0.0
0.2
dgamma(x, 1, 1)
0.8
>
>
>
>
>
>
>
>
>
>
>
>
69
Beta distribution
2
0
dbeta(x, 1, 1)
(1,1)
(2,1)
(3,1)
(4,1)
(2,2)
(3,2)
(4,2)
(2,3)
(3,3)
(4,3)
0.0
0.2
0.4
0.6
0.8
1.0
>
>
>
>
>
>
70
2.0
1.0
0.0
0.5
dexp(x)
1.5
Exponential
Weibull, shape=1
Weibull, shape=2
Weibull, shape=.8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.5
0.3
0.2
0.0
0.1
dcauchy(x)
0.4
C auchy distribution
Gaussian distribution
-4
-2
71
v.v
Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random
sampling without replacement), tc l mi ln chn mu, chng ta khng b li
cc mu chn vo qun th.
Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i
tng, chng ta b vo li trong qun th chn tip ln sau). V d, chng ta
mun chn 10 ngi t mt qun th 50 ngi, bng cch ly mu vi thay th
(random sampling with replacement), chng ta ch cn thm tham s
replace=TRUE:
> sample(1:50, 10, replace=T)
[1] 31 44
72
8 47 50 10 16 29 23
73
7
Kim nh gi thit thng k
v ngha ca tr s P (P-value)
7.1 Tr s P
Trong nghin cu khoa hc, ngoi nhng d kin bng s, biu v
hnh nh, con s m chng ta thng hay gp nht l tr s P (m ting Anh gi
l P-value). Trong cc chng sau y, bn c s gp tr s P rt nhiu ln, v
i a s cc suy lun phn tch thng k, suy lun khoa hc u da vo tr s
P. Do , trc khi bn n cc phng php phn tch thng k bng R, cn
phi c ngha ca tr s ny.
Tr s P l mt con s xc sut, tc l vit tt ch probability value.
Chng ta thng gp nhng pht biu c km theo con s, chng hn nh
Kt qu phn tch cho thy t l gy xng trong nhm bnh nhn c iu tr
bng thuc Alendronate l 2%, thp hn t l trong nhm bnh nhn khng
c cha tr (5%), v mc khc bit ny c ngha thng k (p = 0.01),
hay mt pht biu nh Sau 3 thng iu tr, mc gim p sut mu trong
nhm bnh nhn l 10% (p < 0.05). Trong vn cnh trn y, i a s nh
khoa hc hiu rng tr s P phn nh xc sut s hiu nghim ca thuc
Alendronate hay mt thut iu tr. C nhiu ngi hiu rng cu vn trn c
ngha l xc sut m thuc Alendronate tt hn gi dc l 0.99 (ly 1 tr cho
0.01). Nhng cch hiu hon ton sai.
Tht vy, rt nhiu ngi, khng ch ngi c m ngay c chnh cc
tc gi ca nhng bi bo khoa hc, khng hiu ng ngha ca tr s P. Theo
mt nghin cu c cng b trn tp san danh ting Statistics in Medicine [1],
tc gi cho bit 85% cc tc gi khoa hc v bc s nghin cu khng hiu hay
hiu sai ngha ca tr s P. Th th, cu hi cn t ra mt cch nghim chnh:
ngha ca tr s P l g? tr li cho cu hi ny, chng ta cn phi xem xt
qua khi nim phn nghim v tin trnh ca mt nghin cu khoa hc.
74
75
76
Frequency
50
100
150
200
250
Histogram of bin
15
20
25
30
35
bin
16 17 18 19 20 21 22 23
462 946 1592 2719 4098 5892 7937 9733
24
29
25
26
27
28
30
31
32
33
34
35
36
77
38
5
39
7
893
455
223
40
1
7.4 Vn logic ca tr s P
Nhng ng trn phng din l tr v khoa hc nghim chnh, chng ta
c nn t tm quan trng vo tr s P nh th hay khng? Cu tr li l khng.
78
79
T s nguy c
(relative risk) v
khong
tin cy 95% 2
2,17 (1.13-4.18)
0.74 (0.52-1.06)
0.82 (0.62-1.08)
69 (0.20)
63 (0.14)
43 (0.09)
66 (0.19)
74 (0.16)
59 (0.13)
1.05 (0.75-1.47)
0.87 (0.62-1.22)
0.73 (0.49-1.09)
159 (0.14)
14 (0.14)
178 (0.15)
16 (0.17)
0.90 (0.71-1.11)
0.85 (0.41-1.74)
80
chng sau) c c tnh bng cch ly t l gy xng trong nhm can thip
chia cho t l trong nhm gi dc; nu khong tin cy 95% bao gm 1 th mc
khc bit gia 2 nhm khng c ngha thng k; nu khong tin cy 95%
khng bao gm 1 th mc khc bit gia 2 nhm c xem l c ngha
thng k (hay p<0.05).
Xin nhc li rng trong mi ln th nghim mt gi thuyt, chng ta
chp nhn mt sai st 5% (gi d chng ta chp nhn tiu chun p = 0.05
tuyn b c ngha hay khng c ngha thng k). Vn t ra l trong bi
cnh th nghim nhiu gi thuyt l nh sau: nu trong s n th nghim,
chng ta tuyn b k th nghim c ngha thng k (tc l p<0.05), th
xc sut c t nht mt gi thuyt sai l bao nhiu?
tr li cu hi ny chng ta s bt u bng mt v d n gin. Mi
th nghim chng ta chp nhn mt xc sut sai lm l 0.05. Ni cch khc,
chng ta c xc sut ng l 0.95. Nu chng ta th nghim 3 gi thuyt, xc
sut m chng ta ng c ba l [d nhin]: 0.95 x 0.95 x 0.95 = 0.8574. Nh
vy, xc xut c t nht mt sai lm trong ba tuyn b c ngha thng k l:
1 0.8574 = 0.1426 (tc khong 14%).
Ni chung, nu chng ta th nghim n gi thuyt, v mi ln th
nghim chng ta chp nhn mt xc sut sai lm l p, th xc sut c t nht 1
sai lm trong n ln th nghim l 1 (1 p ) . Khi n = 10 v p=0.05 th xc
sut c t nht mt sai lm ln n: 40%.
n
81
82
83
8
Phn tch s liu bng biu
Yu t th gic rt quan trng. Qu tht, biu tt c kh nng gy n
tng cho ngi c bo khoa hc rt ln, v thng c gi tr i din cho c
cng trnh nghin cu. V th biu l mt phng tin hu hiu nht nhn
mnh thng ip ca bi bo. Biu thng c s dng th hin xu
hng v kt qu cho tng nhm, nhng cng c th dng trnh by d kin
mt cch gn gng. Cc biu d hiu, ni dung phong ph l nhng phng
tin v gi. Do , nh nghin cu cn phi suy ngh mt cch sng to cch th
hin s liu quan trng bng biu . V th, phn tch biu ng mt vai tr
cc k quan trng trong phn tch thng k. C th ni, khng c th l phn
tch thng k khng c ngha.
Trong ngn ng R c rt nhiu cch thit k mt biu gn v p.
Phn ln nhng hm thit k biu c sn trong R, nhng mt s loi biu
tinh vi v phc tp khc c th thit k bng cc package chuyn dng nh
lattice c th ti t website ca R. Chng ny s ch cch v cc biu
thng dng bng cch s dng cc hm ph bin trong R.
84
>
>
>
>
>
>
20
15
Frequency
0
y
-2
-1
10
25
30
-2
-4
-2
Box plot of y
Bar chart of x
-2
-2
-1
-4
par(mfrow=c(1,2))
N <- 200
x <- runif(N, -4, 4)
y <- sin(x) + 0.5*rnorm(N)
plot(x,y)
85
Trong cc lnh trn, xlab (vit tt t x label) v ylab (vit tt t y label) dng
t tn cho trc honh v trc tung. Cn main c dng t tn cho
biu . Ch rng trong main c k hiu \n dng vit dng th hai (nu
tn gi biu qu di).
2
1
-2
-1
Production
0
-2
-1
-4
-2
0
x
-4
-2
X factor
86
0
-2
-1
Production
-4
-2
X factor
Figure 1
par(mfrow=c(2,2))
plot(y, type="l");
plot(y, type="b");
plot(y, type="o");
plot(y, type="h");
title("lines")
title("both")
title("overstruck")
title("high density")
87
1
50
100
150
200
50
100
Index
overstruck
high density
150
200
150
200
1
0
-1
-2
-2
-1
Index
0
-2
-1
0
-2
-1
both
lines
50
100
150
200
50
Index
100
Index
par(mfrow=c(2,2))
88
1
50
100
150
200
50
100
150
Index
lty=2
Production data
Production data
200
1
0
-1
-2
-2
-1
Index
lty=1
0
-2
-1
0
-2
-1
Production data
Production data
50
100
150
200
50
Index
lty=3
100
150
200
Index
lty=4
89
0.0
0.2
0.4
runif(10)
0.6
0.8
1.0
10
Index
0.0
0.2
0.4
runif(5)
0.6
0.8
1.0
Index
90
main = "plot
> plot(runif(5),
main = "plot
> plot(runif(5),
main = "plot
> plot(runif(5),
main = "plot
> plot(runif(5),
main = "plot
> par(op)
type
type
type
type
type
type
type
type
type
'l' (lines)")
= 'b',
'b' (both points and lines)")
= 's',
's' (stair steps)")
= 'h',
'h' (histogram)")
= 'n',
'n' (no plot)")
0.7
runif(5)
0.3
2
Index
Index
0.2
0.6
0.4
0.4
0.6
runif(5)
0.8
0.8
runif(5)
0.5
0.7
0.5
0.3
runif(5)
0.9
0.9
Index
Index
0.6
runif(5)
0.4
0.3
0.2
0.2
0.1
runif(5)
0.4
3
Index
Index
91
Available symbols
21
22
23
24
25
16
17
18
19
20
11
12
13
14
15
10
0
-2
-1
-4
-2
92
Thng s legend(2,-2) c ngha l t phn ghi ch vo trc honh (xaxis) bng 2 v trc tung (y-axis) bng -2.
-2
Production
Regression line
-4
-4
-2
93
text(15, 4.3)
0
-4
-2
94
50
150
200
N <- 200
x <- runif(N, -4, 4)
y <- x + 0.5*rnorm(N)
plot(x,y, pch=16, main=Scatter plot of y and x)
-4
-2
-4
-2
95
-4
-2
Trung tam
-4
-2
Hay tin vic theo di chng ta s nhp cc d liu bng cc lnh sau y:
96
18,
20,
21,
23,
24,
18,
20,
22,
23,
24,
18,
20,
22,
23,
24,
18,
20,
22,
23,
25,
18,
21,
22,
23,
25)
19,
21,
22,
23,
19,
21,
22,
23,
19,
21,
22,
23,
19,
21,
22,
24,
20,
21,
22,
24,
3.0,
5.0,
4.0,
4.2,
4.4,
1.0,
3.5,
6.2,
3.8,
4.5,
6.3,
6.1,
3.0,
1.3,
3.1,
4.2,
4.3,
4.0,
4.7,
4.1,
4.3,
5.9,
8.2,
6.7,
4.0,
1.2,
3.0,
4.4,
4.0,
4.6,
7.7,
3.0,
4.8,
5.6,
6.2,
8.1,
2.1,
0.7,
1.7,
4.3,
3.0,
4.0)
5.0,
4.0,
4.0,
8.3,
6.2,
6.2)
3.0,
4.0,
2.0,
2.3,
4.1,
4.2,
6.9,
3.0,
5.8,
6.7,
3.0,
4.1,
2.1,
6.0,
4.4,
5.9,
5.7,
3.1,
7.6,
6.3,
3.0,
4.3,
4.0,
3.0,
2.8,
6.1,
5.7,
5.3,
5.8,
6.0,
3.0,
4.0,
4.1,
3.0,
3.0,
5.9,
5.3,
5.3,
3.1,
4.0,
tg <- c(1.1, 2.1, 0.8, 1.1, 2.1, 1.5, 2.6, 1.5, 5.4, 1.9,
1.7, 1.0, 1.6, 1.1, 1.5, 1.0, 2.7, 3.9, 3.0, 3.1,
2.2, 2.7, 1.1, 0.7, 1.0, 1.7, 2.9, 2.5, 6.2, 1.3,
3.3, 3.0, 1.0, 1.4, 2.5, 0.7, 2.4, 2.4, 1.4, 2.7,
2.4, 3.3, 2.0, 2.6, 1.8, 1.2, 1.9, 3.3, 4.0, 2.5)
97
Sau khi c s liu, chng ta sn sng tin hnh phn tch s liu bng biu
nh sau:
Nam
10
15
20
Nu
25
Nam
Nu
98
10
15
20
25
(67.3,80]
Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui
th nht, 10 nam v 14 na trong nhm tui th hai, v.v th hin tn s
ca hai bin ny, chng ta vn dng barplot:
> barplot(age.sex, main=Number of males and females in
each age group)
99
10
15
10
20
12
14
(42,54.7]
(42,54.7]
(54.7,67.3]
(67.3,80]
(54.7,67.3]
(67.3,80]
Age group
100
(42,54.7]
(49.6,57.2]
(42,49.6]
(72.4,80]
(67.3,80]
(64.8,72.4]
(57.2,64.8]
(54.7,67.3]
bin
lin
tc:
8.6.1 Stripchart
mg/L
101
8.6.2 Histogram
Age l mt bin s lin tc. v biu tn s ca bin s age,
Chng ta ch n gin lnh hist(age). Nh cp trn, chng ta c th
ci tin th ny bng cch cho thm ta chnh (main) v ta ca trc
chng ta ch n gin ln hist (age. Nh cp trn, chng ta c th ci tin
th ny bng cnh cho thm ta chnh (main) v ta ca trc honh
(xlab) v trc tung (ylab):
> hist(age)
> hist(age, main="Frequency distribution by age group",
xlab="Age group", ylab="No of patients")
8
0
No of patients
6
4
Frequency
10
10
12
12
Histogram of age
40
50
60
70
80
age
40
50
60
70
80
Age group
102
Histogram of age
Density
0.00
0.00
0.01
0.02
0.02
0.01
Density
0.03
0.03
0.04
0.04
density.default(x = age)
30
40
50
60
70
80
90
N = 50 Bandwidth = 3.806
40
50
60
70
80
age
103
Density
0.00
0.01
0.02
0.03
0.04
Histogram of age
30
40
50
60
70
80
90
age
60
0.0
0.2
50
0.4
(1:n)/n
0.6
Sample Quantiles
70
0.8
80
1.0
50
60
sort(age)
70
80
-2
-1
Theoretical Quantiles
Biu 11. Xc sut phn phi mt Biu 12. Kim tra bin age c
cho bin age ( tui).
theo lut phn phi chun hay khng.
Trong th trn, trc tung l xc sut tch ly v trc honh l tui t thp
n cao. Chng hn nh nhn qua biu , chng ta c th thy khong 50% i
tng c tui thp hn 60.
bit xem phn phi ca age c theo lut phn phi chun (normal
distribution) hay khng chng ta c th s dng hm qqnorm.
> qqnorm(age)
104
Trc honh ca biu trn l nh lng theo lut phn phi chun
(theoretical quantile) v trc honh nh lng ca s liu (sample quantiles).
Nu phn phi ca age theo lut phn phi chun, th ng biu din phi
theo mt ng thng cho 45 (tc l nh lng phn phi v nh lng s
liu bng nhau). Nhng qua Biu 12, chng ta thy phn phi ca age
khng hn theo lut phn phi chun.
main="Box
plot
of
total
cholesterol",
mg/L
105
Nam
mg/L
mg/L
Nu
Nam
Nu
10
kg/m^2
15
20
25
106
Distribution of BM I
18
20
22
24
hdl
tc
107
M
8
8
M
F
6
tc
M
M
F
hdl
F
M
M
F
F
F
M
F
M
M
M
F F
M
F
M
F
M
F
F
F
F
M
M
F
M
F
F
F
tc
M
4
hdl
108
lowess (mt hm thng thng nht) trong vic lm trn s liu tc v hdl
(Biu 19b).
> plot(hdl ~ tc, pch=16,
main="Total cholesterol and HDL cholesterol with
LOEWSS smooth function",
xlab="Total cholesterol",
ylab="HDL cholesterol",
bty=l)
6
4
HDL cholesterol
HDL cholesterol
Total cholesterol
Total cholesterol
109
Kt qu s l:
20
22
24
70
80
18
22
24
50
60
age
18 20
bmi
3 4
hdl
7 8
ldl
4 5
tc
50
60
70
80
110
22
24
0 .06 5
0.12
bmi
0.38
0.22
0 .0 9 5
**
.
0.29
0.25
***
0.62
0.35
hdl
18
20
22
24
50
age
60
70
80
18
***
0.65
ldl
tc
50
60
70
80
111
Nh trn trnh by, biu tn x gip cho chng ta hnh dung ra mi lin
h gia hai bin s lin tc nh tui age v hdl chng hn. V lm vic
ny, chng ta dng hm plot. tm hiu phn phi cho tng bin age hay
hdl chng ta c th dng hm boxplot. Nhng nu chng ta mun xem phn
phi ca hai bin v ng thi mi lin h gia hai bin, th chng ta cn phi vit
mt vi lnh thc hin vic ny. Cc lnh sau y v biu tn x v mi lin
quan gia age v hdl, ng thi v biu hnh hp cho tng bin.
op <- par()
layout( matrix( c(2,1,0,3), 2, 2, byrow=T ),
c(1,6), c(4,1),
)
par(mar=c(1,1,5,2))
plot(hdl ~ age,
xlab='', ylab='',
las = 1,
pch=16)
rug(side=1, jitter(age, 5) )
rug(side=2, jitter(hdl, 20) )
title(main = "Age and HDL")
par(mar=c(1,2,5,1))
boxplot(hdl, axes=F)
title(ylab='HDL', line=0)
par(mar=c(5,1,1,2))
boxplot(age, horizontal=T, axes=F)
title(xlab='Age', line=1)
par(op)
V kt qu l:
112
HDL
50
60
70
80
Age
113
HDL
Bubble plot
50
60
70
80
Age
114
12
10
Value
0.6
Cumulated frequency
0.8
8
0.4
2
0
(0.695,1.25]
(1.25,1.8]
(1.8,2.35]
(5.65,6.21]
(4.55,5.1]
115
Trong biu ny, chng ta c hai trc tung. Trc tung pha tri l tn
s (s bnh nhn) cho tng nhm tg, v trc tung bn phi l tn s tch ly
tch bng xc sut (do , s cao nht l 1).
116
V kt qu l:
Distribution of LDL
45
46
47
48
49
44
43
42
41
40
10
39
11
38
12
37
13
36
14
35
15
34
16
33
17
32
18
31
19
30
29
28
27
26
25
24
23
22
21
20
angle=90,
3
2
1
mean
group
117
x,
y+se.y,
code=3,
angle=90,
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
-2
-1
N <- 50
x <- seq(-1, 1, length=N)
y <- seq(-1, 1, length=N)
xx <- matrix(x, nr=N, nc=N)
yy <- matrix(y, nr=N, nc=N, byrow=TRUE)
z <- 1 / (1 + xx^2 + (yy + .2 * sin(10*yy))^2)
contour(x, y, z, main = "Contour plot")
118
-1.0
-0.5
0.0
0.5
1.0
Contour plot
-1.0
-0.5
0.0
0.5
1.0
0.0
0.2
0.4
0.6
0.8
1.0
> image(z)
0.0
0.2
0.4
0.6
0.8
1.0
119
> image(x, y, z,
xlab=x,
ylab=y)
0.0
-1.0
-0.5
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
)
> par(op)
120
T he sinc function
8
6
S in c (r
4
10
2
0
-2
-10
0
-5
0
-5
5
10
-10
121
3
1
1 + x2
-4
-2
>
>
>
>
>
>
>
text(4, 6,
text(4, 8,
par(cex=1)
text(8, 2,
text(8, 4,
text(8, 6,
text(8, 8,
122
example(Japanese)
English
Kanji
Katakana
Hiragana
123
9
Phn tch thng k m t
Trong chng ny, chng ta s s dng R cho mc ch phn tch thng
k m t. Ni n thng k m t l ni n vic m t d liu bng cc php tnh
v ch s thng k thng thng m chng ta lm quen qua t thu trung hc
nh s trung bnh (mean), s trung v (median), phng sai (variance) lch
chun (standard deviation) cho cc bin s lin tc, v t s (proportion) cho
cc bin s khng lin tc. Nhng trc khi hng dn phn tch thng k m t,
bn c nn phn bit hai khi nim tng th (population) v mu (sample).
124
125
10))
10))
10))
10))
10))
10))
> mean(sample(height,
[1] 158.6667
> mean(sample(height,
[1] 159.4
> mean(sample(height,
[1] 158.0667
> mean(sample(height,
[1] 158.1333
> mean(sample(height,
[1] 156.4667
15))
15))
15))
15))
15))
126
> mean(sample(height,
[1] 158.2222
> mean(sample(height,
[1] 158.7222
> mean(sample(height,
[1] 158.0556
> mean(sample(height,
[1] 158.4444
> mean(sample(height,
[1] 158.6667
> mean(sample(height,
[1] 159.0556
> mean(sample(height,
[1] 159
18))
18))
18))
18))
18))
18))
18))
127
phn phi ca mt bin lin tc, v y tuy hai nhm c cng trung bnh
nhng khc bit ca nhm 2 cao hn nhm 1 rt nhiu. V chng ta cn mt
c s khc gi l phng sai (variance). Phng sai ca nhm 1 l 15.7 cm2 v
nhm 2 l 443.7 cm2.
Vi mt bin s khng lin tc nh 0 v 1 (0 k hiu cn sng, v 1 k
hiu t vong) th c s trung bnh khng cn ngha trung bnh na, cho
nn chng ta c c s t l (proportion). Chng hn nh trong s 10 ngi c
2 ngi t vong, th t l t vong l 0.2 (hay 20%). Trong s 200 ngi c 40
ngi qua i th t l t vong vn 0.2. Do , cng nh trng hp trung bnh,
t l khng th m t mt bin khng lin tc y c. Chng ta cn n
phng sai , cng vi t l, m t mt bin khng lin tc. Trong trng hp
2/10 phng sai l 0.016, cn trong trng hp 40/200, phng sai l 0.0008.
Trong chng ny, chng ta s lm quen vi mt s lnh trong R tin hnh
nhng tnh ton n gin trn.
128
"weight"
"als"
"height"
"pinp"
"ethnicity"
"ictp"
"p3np"
> igfdata
id
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100
id
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100
17
18
18
15
als
323.667
333.750
248.333
251.000
322.000
284.667
274.000
303.000
308.500
273.000
54
55
48
54
pinp
353.970
375.885
199.507
483.607
105.430
76.487
75.880
86.360
254.803
44.720
ictp
p3np
11.2867 8.3367
10.4300 6.7450
8.3633 12.5000
13.3300 14.2767
7.9233 4.5033
4.9833 4.9367
6.3500 5.3200
7.3700 4.6700
11.8700 6.8200
3.7400 6.1600
4.4367
8.8333
5.6600
6.5933
L thuyt
n
1
xi .
n i =1
1 n
2
Phng sai: s 2 =
( xi x )
n 1 i =1
S trung bnh: x =
lch chun: s =
s2
Hm R
mean(x)
var(x)
sd(x)
129
Sai
SE =
chun
(standard
error): Khng c
s
n
min(x)
max(x)
range(x)
Tr s thp nht
Tr s cao nht
Ton c (range)
Median
19.00
Max.
34.00
130
sd <- sd(x)
se <- sd/sqrt(length(x))
c(MEAN=av, SD=sd, SE=se)
}
SE
5.898719
age
Min.
:13.00
1st Qu.:16.00
Median :19.00
Mean
:19.17
3rd Qu.:21.25
Max.
:34.00
weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.91
3rd Qu.:53.00
Max.
:60.00
ethnicity
African : 8
Asian
:60
Caucasian:30
Others
: 2
igfbp3
Min.
:2.000
1st Qu.:3.292
Median :3.550
Mean
:3.617
3rd Qu.:3.875
Max.
:5.233
als
Min.
:192.7
1st Qu.:256.8
Median :292.5
Mean
:301.8
3rd Qu.:331.2
Max.
:471.7
pinp
Min.
: 26.74
1st Qu.: 68.10
Median :103.26
Mean
:167.17
3rd Qu.:196.45
Max.
:742.68
p3np
Min.
: 2.343
1st Qu.: 4.433
Median : 5.445
Mean
: 6.341
3rd Qu.: 7.150
Max.
:16.303
131
sex
Female:69
Male : 0
age
Min.
:13.00
1st Qu.:17.00
Median :19.00
Mean
:19.59
3rd Qu.:22.00
Max.
:34.00
weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.35
3rd Qu.:52.00
Max.
:60.00
height
Min.
:149.0
1st Qu.:156.0
Median :162.0
Mean
:161.9
3rd Qu.:166.0
Max.
:196.0
ethnicity
African : 4
Asian
:43
Caucasian:22
Others
: 0
igfi
Min.
: 85.71
1st Qu.:136.67
Median :163.33
Mean
:167.97
3rd Qu.:186.17
Max.
:427.00
igfbp3
Min.
:2.767
1st Qu.:3.333
Median :3.567
Mean
:3.695
3rd Qu.:3.933
Max.
:5.233
als
Min.
:204.3
1st Qu.:263.8
Median :302.7
Mean
:311.5
3rd Qu.:361.7
Max.
:471.7
pinp
ictp
p3np
Min.
: 26.74
Min.
: 2.697
Min.
: 2.343
1st Qu.: 62.75
1st Qu.: 4.717
1st Qu.: 4.337
Median : 78.50
Median : 5.537
Median : 5.143
Mean
:108.74
Mean
: 6.183
Mean
: 5.643
3rd Qu.:115.26
3rd Qu.: 7.320
3rd Qu.: 6.143
Max.
:502.05
Max.
:13.633
Max.
:14.420
------------------------------------------------------------
132
sex: Male
id
Min.
: 2.00
1st Qu.: 34.50
Median : 56.00
Mean
: 55.61
3rd Qu.: 75.00
Max.
:100.00
sex
Female: 0
Male :31
age
Min.
:14.00
1st Qu.:15.00
Median :17.00
Mean
:18.23
3rd Qu.:20.00
Max.
:27.00
weight
Min.
:44.00
1st Qu.:48.50
Median :51.00
Mean
:51.16
3rd Qu.:53.50
Max.
:59.00
height
Min.
:155.0
1st Qu.:161.5
Median :164.0
Mean
:165.6
3rd Qu.:169.0
Max.
:191.0
ethnicity
African : 4
Asian
:17
Caucasian: 8
Others
: 2
igfi
Min.
: 94.67
1st Qu.:138.67
Median :160.00
Mean
:160.29
3rd Qu.:183.00
Max.
:274.00
pinp
Min.
: 56.28
1st Qu.:135.07
Median :245.92
Mean
:297.21
3rd Qu.:450.38
Max.
:742.68
ictp
Min.
: 3.650
1st Qu.: 6.900
Median : 9.513
Mean
:10.173
3rd Qu.:13.517
Max.
:21.237
igfbp3
Min.
:2.000
1st Qu.:3.183
Median :3.500
Mean
:3.443
3rd Qu.:3.775
Max.
:4.500
als
Min.
:192.7
1st Qu.:249.8
Median :276.0
Mean
:280.2
3rd Qu.:311.3
Max.
:388.7
p3np
Min.
: 3.390
1st Qu.: 5.375
Median : 7.140
Mean
: 7.895
3rd Qu.:10.010
Max.
:16.303
op <- par(mfrow=c(2,3))
hist(igfi)
hist(igfbp3)
hist(als)
hist(pinp)
hist(ictp)
hist(p3np)
133
Histogra m of igfbp3
Histogra m of a ls
200
300
400
0
100
20
Frequency
10
20
Frequency
10
20
0
10
Frequency
30
30
30
40
40
Histogra m of igfi
2.0
3.0
4.0
5.0
150
250
350
450
igf bp3
als
Histogra m of pinp
Histogra m of ictp
Histogra m of p3np
40
30
20
Frequency
30
0
200
400
pinp
600
800
10
10
10
20
Frequency
30
20
Frequency
40
50
igf i
10
15
20
ictp
10
15
p3np
134
weight
10
0
Frequency
15
Histogram of weight
40
45
50
55
60
weight
Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l
sex, v ch s thng k chng ta mun l trung bnh (mean). Qua kt qu trn,
chng ta thy s trung bnh ca igfi cho n gii (167.97) cao hn nam gii
(160.29).
Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta
ch cn thm mt bin s trong hm list:
> tapply(igfi, list(ethnicity, sex), mean)
Female
Male
African
145.1252 120.9168
Asian
165.6589 160.4999
135
9.1.1 Kim nh t mt mu
V d 2. Qua phn tch trn, chng ta thy tui trung bnh ca 100 i tng
trong nghin cu ny l 19.17 tui. Chng hn nh trong qun th ny, trc y
chng ta bit rng tui trung bnh l 30 tui. Vn t ra l c phi mu m chng
ta c c c i din cho qun th hay khng. Ni cch khc, chng ta mun bit
gi tr trung bnh 19.17 c tht s khc vi gi tr trung bnh 30 hay khng.
tr li cu hi ny, chng ta s dng kim nh t. Theo l thuyt
thng k, kim nh t c nh ngha bng cng thc sau y:
t=
x
s/ n
136
data: age
t = -27.6563, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
18.39300 19.94700
sample estimates:
mean of x
19.17
t=
x2 x1
SED
137
l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho
bit igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam
gii khong 25.8 ng/L. V khc bit qu ln v l thm bng chng cho
thy khng c khc bit c ngha thng k gia hai nhm.
Kim nh trn da vo gi thit hai nhm nam v n c khc phng
sai. Nu chng ta c l do cho rng hai nhm c cng phng sai, chng ta
ch thay i mt thng s trong hm t vi var.equal=TRUE nh sau:
138
Kt qu trn cho thy khc bit v phng sai gia hai nhm cao
2.62 ln. Tr s p = 0.0045 cho thy phng sai gia hai nhm khc nhau c
ngha thng k. Nh vy, chng ta chp nhn kt qu phn tch ca hm
t.test(igfi~ sex).
139
Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm
nam v n khng c ngha thng k. Kt lun ny cng khng khc vi kt
qu phn tch bng kim nh t.
140
> # kim nh t
> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = 2.7924, df = 9, p-value = 0.02097
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
1.993901 19.006099
sample estimates:
mean of the differences
10.5
141
9.8
9.9 Tn s (frequency)
Hm table trong R c chc nng cho chng ta bit v tn s ca mt
bin s mang tnh phn loi nh sex v ethnicity.
> table(sex)
sex
Female
Male
69
31
> table(ethnicity)
ethnicity
African
Asian Caucasian
8
60
30
Others
2
142
# kim tra kt qu
> freq
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female
Male
69
31
> margin.table(freq, 2)
ethnicity
African
Asian Caucasian
8
60
30
Others
2
Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc.
Chng hn nh trong nhm ngi chu , 71.7% l n v 28.3% l nam.
# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex
African Asian Caucasian Others
Female
0.04 0.43
0.22
0.00
Male
0.04 0.17
0.08
0.02
143
z=
x n
n (1 )
y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai
1. Cng c th ni z2 tun theo lut phn phi Khi bnh phng vi bc t do
bng 1.
V d 5. Trong nghin cu trn, chng ta thy c 69 n v 31 nam. Nh
vy t l n l 0.69 (hay 69%). kim nh xem t l ny c tht s khc vi t l
0.5 hay khng, chng ta c th s dng hm prop.test(x, n, ) nh sau:
> prop.test(69, 100, 0.50)
1-sample proportions test with continuity correction
data: 69 out of 100, null probability 0.5
X-squared = 13.69, df = 1, p-value = 0.0002156
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5885509 0.7766330
sample estimates:
p
0.69
144
1 1
Vd = + p (1 p )
n1 n2
Trong :
p =
x1 + x
n1 + n
2
2
145
146
sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared
approximation
may
be
prop.test(female, total)
incorrect
in:
may
be
incorrect
in:
147
148
10
Phn tch hi qui tuyn tnh
Phn tch hi qui tuyn tnh (linear regression analysis) c l l mt
trong nhng phng php phn tch s liu thng dng nht trong thng k hc.
C ngi tng vit Cho con ngi 3 v kh h s tng quan, hi qui tuyn
tnh v mt cy bt, con ngi s s dng c ba! Trong chng ny, ti s
gii thiu cch s dng R phn tch hi qui tuyn tnh v cc phng php
lin quan nh h s tng quan v kim nh gi thit thng k.
V d 1. minh ha cho vn , chng ta th xem xt nghin cu sau
y, m trong nh nghin cu o lng cholestrol trong mu ca 18 i
tng nam. T trng c th (body mass index) cng c c tnh cho mi i
tng bng cng thc tnh BMI l ly trng lng (tnh bng kg) chia cho chiu
cao bnh phng (m2). Kt qu o lng nh sau:
Bng 1. tui, t trng c th v cholesterol
M s ID
(id)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tui
(age)
BMI
(bmi)
46
20
52
30
57
25
28
36
22
43
57
33
22
63
40
48
28
49
25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9
19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8
Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0
149
2.0
2.5
3.0
chol
3.5
4.0
4.5
20
30
40
50
60
age
150
r=
(xi x )( yi y )
i =1
n
xi
i =1
x )2 ( yi y )2
i =1
151
152
153
tr , sao cho
y ( + x )
i =1
( x x )( y y )
i =1
( x x )
i
i =1
[2]
= y x
[3]
)
)
yi = + xi
Tt nhin, yi y ch l s trung bnh cho tui xi, v phn cn li (tc yi - yi )
gi l phn d (hay residual). V phng sai ca phn d c th c tnh nh sau:
n
s =
2
s2 chnh l c s ca 2.
( y y )
i
i =1
n2
[4]
Trong phn tch hi qui tuyn tnh, thng thng chng ta mun bit h
s = 0 hay khc 0. Nu bng 0, th yi = + xi + i = + i , tc l nhng
khc bit gia cc i tng v cholesterol ch xoay quanh s trung bnh v sai
s ngu nhin , hay ni cch khc, khng c mi lin h g gia x v y; nu
khc vi 0, chng ta c bng chng pht biu rng x v y c lin quan
nhau. kim nh gi thit = 0 chng ta dng xt nghim t sau y:
t =
154
( )
S E
[5]
( )
)
SE c ngha l sai s chun (standard error) ca c s . Trong phng
trnh trn, t tun theo lut phn phi t vi bc t do n-2 (nu tht s = 0).
age
0.05779
3Q
0.17939
Max
0.63040
Coefficients:
155
Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 '
3Q
0.17939
Max
0.63040
Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 '
156
tch hi qui tuyn tnh n gin (vi mt yu t) chng ta khng cn phi quan
tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay
h s xc nh bi (coefficient of determination). H s ny c c tnh bng
cng thc:
n
( y
i=1
n
(y
i=1
[6]
Tc l bng tng bnh phng gia s c tnh v trung bnh chia cho
tng bnh phng s quan st v trung bnh. Tr s R2 trong v d ny l 0.8775,
c ngha l phng trnh tuyn tnh (vi tui l mt yu t) gii thch khong
88% cc khc bit v cholesterol gia cc c nhn. Tt nhin tr s R2 c gi
tr t 0 n 100% (hay 1). Gi tr R2 cng cao l mt du hiu cho thy mi lin
h gia hai bin s tui v cholesterol cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m
trong kt qu trn R gi l Adjusted R-squared). y l h s cho chng ta
bit mc ci tin ca phng sai phn d (residual variance) do yu t
tui c mt trong m hnh tuyn tnh. Ni chung, h s ny khng khc my so
vi h s xc nh bi, v chng ta cng khng cn ch tm qu mc.
157
18
3.9208
158
-1
Standardized residuals
17
17
1.5
3.0
3.5
4.0
4.5
-2
-1
Fitted values
Theoretic al Quantiles
Scale-Location
Residuals vs Leverage
1
8
8
0.5
0. 5
1.0
17
Standardized residuals
-1
2.5
0.0
Standardized residuals
Normal Q-Q
-0.4
Residuals
Residuals vs Fitted
2.5
3.0
3.5
4.0
4.5
Fitted values
0.00
0.05
0.10
0. 5
0.15
0.20
0.25
Leverage
Biu 10.2. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.
kim tra cc gi nh trn, chng ta c th v mt lot 4 th tren nh sau:
> op <- par(mfrow=c(2,2))
> plot(reg)
#yu cu R dnh ra 4 ca s
#v cc th trong reg
159
2.0
2.5
3.0
chol
3.5
4.0
4.5
> abline(reg)
20
30
40
50
60
age
)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s
ny u c sai s chun, cho nn gi tr tin on yi cng c sai s. Ni
cch khc, yi ch l trung bnh, nhng trong thc t c th cao hn hay thp
hn ty theo chn mu. Khong tin cy 95% ny c th c tnh qua R
bng cc lnh sau y:
>
>
>
>
>
>
>
>
>
>
>
>
160
2.0
2.5
3.0
chol
3.5
4.0
4.5
20
30
40
50
60
age
161
( y y )
i =1
nh nht.
1
1
X =
...
x11
x 21
x12
x 22
...
x1 n
...
x2n
1
= 2 ,
...
k
... x k 1
... x k 2 ,
...
x kn
1
= 2
...
n
Phng php bnh phng nh nht gii vector bng phng trnh sau y:
= (X T X ) X T Y
1
T = Y Y
162
> pairs(data)
22
24
26
50
60
20
24
26
20
30
40
age
20
22
bmi
chol
20
30
40
50
60
2.0
2.5
3.0
3.5
4.0
4.5
bmi))
3Q
0.3040
Max
1.4330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
163
yi = + 1 x1i + 2 x2i + i
hay phng trnh cng c th m t bng k hiu ma trn: Y = X + va trnh
by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn
t, v mt vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h
s hi qui, 1 v 2 chng ta cng ng dng hm lm()trong R nh sau:
> mreg <- lm(chol ~ age + bmi)
> summary(mreg)
Call: lm(formula = chol ~ age + bmi)
Residuals:
Min
1Q Median
-0.3762 -0.2259 -0.0534
3Q
0.1698
Max
0.5679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627
age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
164
3.5
4.0
2.0
1.0
4.5
-2
-1
Residuals vs Leverage
0.4
3.0
3.5
4.0
Fitted values
4.5
0.5
16
16
-1
Scale-Location
Standardizedresiduals
Theoretical Quantiles
2.5
16
Fitted values
0.8
1.2
3.0
0.0
Standardizedresiduals
2.5
-1.0 0.0
0.0
0.4
16
-0.4
Residuals
8
6
Normal Q-Q
Standardizedresiduals
Residuals vs Fitted
Cook's distance15
0.00
0.10
0.20
0.30
Leverage
165
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1.0
1.5
2.0
3.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0
6.3
11.1
20.0
24.0
26.1
30.0
33.8
34.0
38.1
39.9
42.0
46.1
53.1
52.0
52.5
48.0
42.8
27.8
21.9
Trc khi phn tch cc s liu ny, chng ta cn nhp s liu vo R vi nhng
lnh thng thng nh sau:
> id <- 1:19
> conc <- c(1.0, 1.5, 2.0, 3.0, 4.0, 4.5, 5.0, 5.5,
6.0, 6.5, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,
166
Median
2.938
3Q
7.675
Max
15.840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3213
5.4302
3.926 0.00109 **
conc
1.7710
0.6478
2.734 0.01414 *
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
167
30
10
20
Tensile strength
40
50
10
12
14
Concentration of hardwood
168
3Q
4.1350
Max
6.5506
Pr(>|t|)
2.73e-16 ***
1.76e-06 ***
1.89e-08 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Median
0.04125
3Q
1.58922
Max
5.02159
Pr(>|t|)
< 2e-16
2.48e-09
2.06e-11
4.72e-05
***
***
***
***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
169
30
10
20
strength
40
50
10
12
14
conc
Linear, quadratic, and cubic fits
170
RSS = ( yi yi )
i =1
RSS 2k
AIC = log
+
n n
M hnh no c gi tr AIC thp nht c xem l m hnh ti u.
Trong v d sau y, chng ta s dng hm step tm mt m hnh ti u
da vo gi tr AIC.
V d 4. nghin cu nh hng ca cc yu t nh nhit , thi
gian, v thnh phn ha hc n sn lng CO2. S liu ca nghin cu ny c
th tm lc trong bng s 2. Mc tiu chnh ca nghin cu l tm mt m
hnh hi qui tuyn tnh tin on sn lng CO2, cng nh nh gi nh
hng ca cc yu t ny.
171
y
36.98
13.74
10.08
8.53
36.42
26.59
19.07
5.96
15.52
56.61
26.72
20.80
6.99
45.93
43.09
15.79
21.60
35.19
26.14
8.60
11.63
9.59
4.42
38.89
11.19
75.62
36.03
X1
5.1
26.4
23.8
46.4
7.0
12.6
18.9
30.2
53.8
5.6
15.1
20.3
48.4
5.8
11.2
27.9
5.1
11.7
16.7
24.8
24.9
39.5
29.0
5.5
11.5
5.2
10.6
X2
400
400
400
400
450
450
450
450
450
400
400
400
400
425
425
425
450
450
450
450
450
450
450
460
450
470
470
X3
51.37
72.33
71.44
79.15
80.47
89.90
91.48
98.60
98.05
55.69
66.29
58.94
74.74
63.71
67.14
77.65
67.22
81.48
83.88
89.38
79.77
87.93
79.50
72.73
77.88
75.50
83.15
X4
4.24
30.87
33.01
44.61
33.84
41.26
41.88
70.79
66.82
8.92
17.98
17.79
33.94
11.95
14.73
34.49
14.48
29.69
26.33
37.98
25.66
22.36
31.52
17.86
25.20
8.66
22.39
X5
1484.83
289.94
320.79
164.76
1097.26
605.06
405.37
253.70
142.27
1362.24
507.65
377.60
158.05
130.66
682.59
274.20
1496.51
652.43
458.42
312.25
307.08
193.61
155.96
1392.08
663.09
1464.11
720.07
X6
2227.25
434.90
481.19
247.14
1645.89
907.59
608.05
380.55
213.40
2043.36
761.48
566.40
237.08
1961.49
1023.89
411.30
2244.77
978.64
687.62
468.38
460.62
290.42
233.95
2088.12
994.63
2196.17
1080.11
X7
2.06
1.33
0.97
0.62
0.22
0.76
1.71
3.93
1.97
5.08
0.60
0.90
0.63
2.04
1.57
2.38
0.32
0.44
8.82
0.02
1.72
1.88
1.43
1.35
1.61
4.78
5.88
Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh
thng thng. S liu s cha trong i tng REGdata.
> y <- c(36.98,13.74,10.08, 8.53,36.42,26.59,19.07,
5.96,15.52,56.61,26.72,20.80,6.99,45.93,
43.09,15.79,21.60,35.19,26.14, 8.60,
11.63, 9.59, 4.42,38.89,11.19,75.62,36.03)
> x1 <- c(5.1,26.4,23.8,46.4, 7.0,12.6,18.9,30.2,
53.8,5.6,15.1,20.3,48.4,5.8,11.2,27.9,5.1,
11.7,16.7,24.8,24.9,39.5,29.0, 5.5, 11.5,
5.2,10.6)
172
> x2 <- c(400,400, 400, 400, 450, 450, 450, 450, 450,
400, 400, 400, 400, 425, 425, 425, 450, 450,
450, 450, 450, 450, 450, 460, 450, 470, 470)
> x3 <- c(51.37,72.33,71.44,79.15,80.47,89.90,91.48,
98.60,98.05,55.69, 66.29,58.94,74.74,63.71,
67.14,77.65,67.22,81.48,83.88,89.38,79.77,
87.93, 79.50,72.73,77.88,75.50,83.15)
> x4 <- c(4.24,30.87,33.01,44.61,33.84,41.26,41.88,
70.79,66.82,8.92,17.98,17.79,33.94,11.95,
14.73,34.49,14.48,29.69,26.33, 37.98,25.66,
22.36,31.52,17.86,25.20, 8.66,22.39)
> x5 <- c(1484.83, 289.94, 320.79, 164.76, 1097.26,
605.06, 405.37, 253.70, 142.27,1362.24, 507.65,
377.60, 158.05, 130.66, 682.59, 274.20,
1496.51, 652.43, 458.42, 312.25, 307.08,
193.61,155.96,1392.08, 663.09,1464.11, 720.07)
> x6 <- c(2227.25, 434.90, 481.19, 247.14,1645.89,
907.59,608.05, 380.55, 213.40,2043.36, 761.48,
566.40,237.08,1961.49,1023.89, 411.30,2244.77,
978.64,687.62, 468.38, 460.62, 290.42,233.95,
2088.12,994.63,2196.17,1080.11)
> x7 <- c(2.06,1.33,0.97,0.62,0.22,0.76,1.71,3.93,1.97,
5.08,0.60,0.90, 0.63,2.04,1.57,2.38,0.32,
0.44,8.82,0.02,1.72,1.88,1.43,1.35,1.61,
4.78,5.88)
> REGdata <- data.frame(y, x1,x2,x3,x4,x5,x6,x7)
Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh
thng thng. S liu s cha trong i tng REGdata.
By gi chng ta bt u phn tch. M hnh u tin l m hnh gm tt c 7
bin c lp nh sau:
> reg <- lm(y ~ x1+x2+x3+x4+x5+x6+x7, data=REGdata)
> summary(reg)
Call: lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7,
data = REGdata)
173
Residuals:
Min
1Q
-20.035 -4.681
Median
-1.144
3Q
4.072
Max
21.214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.937016 57.428952
0.939
0.3594
x1
-0.127653
0.281498 -0.453
0.6553
x2
-0.229179
0.232643 -0.985
0.3370
x3
0.824853
0.765271
1.078
0.2946
x4
-0.438222
0.358551 -1.222
0.2366
x5
-0.001937
0.009654 -0.201
0.8431
x6
0.019886
0.008088
2.459
0.0237 *
x7
1.993486
1.089701
1.829
0.0831 .
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Median
-0.839
3Q
5.522
Max
26.882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.144181
3.483064
1.764
0.09 .
x6
0.019395
0.002932
6.616 6.24e-07 ***
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
174
50
50
70
90
200
1000
8
70
10
50
10
40
440
10
30
x1
90
400
x2
70
50
70
x3
1000
10
40
x4
2000
200
x5
500
x6
x7
10
40
70
400
440
10
40
70
500
2000
175
x4 + x5 + x6 +
RSS
2145.37
2164.00
2250.18
2271.74
2140.83
2309.14
2517.92
2821.92
- x1
- x2
- x3
<none>
- x4
+ x5
- x7
- x6
AIC
132.13
132.36
133.42
133.68
134.07
134.12
136.45
139.53
Df Sum of Sq
RSS
1
96.8 2264.9
1
122.0 2290.0
2168.1
1
187.4 2355.5
1
22.7 2145.4
1
4.1 2164.0
1
385.0 2553.1
1
1526.2 3694.3
AIC
129.6
129.9
130.4
130.7
132.1
132.4
132.8
142.8
- x3
- x4
<none>
+ x2
+ x5
+ x1
- x7
- x6
Df Sum of Sq
RSS
1
25.4 2290.3
1
90.9 2355.8
2264.9
1
96.8 2168.1
1
8.3 2256.5
1
5.7 2259.1
1
384.9 2649.7
1
2015.6 4280.5
AIC
127.9
128.7
129.6
130.4
131.5
131.5
131.8
144.8
AIC
130.4
131.5
131.8
132.1
132.2
134.1
134.5
141.0
Df Sum of Sq
RSS
1
22.7 2168.1
1
113.8 2259.1
1
133.5 2278.9
2145.4
1
170.8 2316.2
1
4.5 2140.8
1
375.7 2521.1
1
1058.5 3203.8
Df Sum of Sq
RSS
1
73.5 2363.8
2290.3
1
25.4 2264.9
1
11.3 2279.0
1
6.3 2284.0
1
0.3 2290.0
1
486.6 2776.9
1
1993.8 4284.1
AIC
126.7
127.9
129.6
129.8
129.8
129.9
131.1
142.8
Df Sum of Sq
<none>
+ x4
+ x1
+ x3
+ x5
+ x2
- x7
- x6
1
1
1
1
1
1
1
73.5
33.4
8.1
7.7
7.3
497.3
4477.0
RSS
2363.8
2290.3
2330.4
2355.8
2356.1
2356.6
2861.2
6840.8
AIC
126.7
127.9
128.4
128.7
128.7
128.7
129.9
153.4
Call:
lm(formula = y ~ x6 + x7, data =
REGdata)
Coefficients:
(Intercept)
x7
2.52646
2.18575
x6
0.01852
176
Median
0.2513
3Q
4.9339
Max
21.9682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.526460
3.610055
0.700
0.4908
x6
0.018522
0.002747
6.742 5.66e-07 ***
x7
2.185753
0.972696
2.247
0.0341 *
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Phn tch chi tit (kt qu trn) cho thy hai bin ny gii thch khong 70%
phng sai ca y.
177
Call:
bicreg(x = xvars, y = co2, strict = FALSE, OR = 20)
16 models were selected
Best 5 models (cumulative posterior probability =
Intercept
x1
x2
x3
x4
x5
x6
x7
p!=0
100.0
12.4
10.4
10.7
20.2
10.5
100.0
73.7
EV
5.75672
-0.01807
-0.00075
0.00011
-0.03059
-0.00023
0.01815
1.60766
SD
14.6244
0.1008
0.0282
0.0791
0.1020
0.0030
0.0040
1.2821
nVar
r2
BIC
post prob
Intercept
x1
x2
x3
x4
x5
x6
x7
178
model 4
7.5936
-0.1393
.
.
.
.
0.0162
2.1233
0.6599 ):
model 1
2.5264
.
.
.
.
.
0.0185
2.1857
model 2
6.1441
.
.
.
.
.
0.0193
.
model 3
8.6120
.
.
.
-0.1419
.
0.0164
2.1628
2
0.700
-25.8832
0.311
1
0.636
-24.0238
0.123
3
0.709
-23.4412
0.092
model 5
7.3537
.
.
-0.0572
.
.
0.0179
2.2382
nVar
r2
BIC
post prob
3
0.704
-22.9721
0.072
3
0.701
-22.6801
0.063
179
M odels selected by BM A
x1
x2
x3
x4
x5
x6
x7
10
13
Model #
180
11
Phn tch phng sai
(Analysis of variance)
Phn tch phng sai, nh tn gi, l mt s phng php phn tch
thng k m trng im l phng sai (thay v s trung bnh). Phng php
phn tch phng sai nm trong i gia nh cc phng php c tn l m
hnh tuyn tnh (hay general linear models), bao gm c hi qui tuyn tnh m
chng ta gp trong chng trc. Trong chng ny, chng ta s lm quen
vi cch s dng R trong phn tch phng sai. Chng ta s bt u bng mt
phn tch n gin, sau s xem n phn tch phng sai hai chiu, v cc
phng php phi tham s thng dng.
Bng 11.1. galactose cho 3 nhm bnh nhn Crohn, vim rut kt v i chng
Nhm 1: bnh
Crohn
1343
1393
1420
1641
1897
2160
2169
Nhm 3: i chng
(control)
1809 2850
1926 2964
2283 2973
2384 3171
2447 3257
2479 3271
2495 3288
181
2279
2890
2767
2827
2895
3011
2525 3358
2541 3643
2769 3657
n=9
n=11
n=20
Trung bnh: 1910
Trung bnh: 2226
Trung bnh: 2804
SD: 516
SD: 727
SD: 527
Ch thch: SD l lch chun (standard deviation).
Mi xem qua vn , c l bn c s ngh rng chng ta cn lm 3 so
snh (bng phng php kim nh t): gia nhm 1 v 2, nhm 2 v 3, v nhm 1
v 3. Nhng cch lm ny khng hp l, v c ba phng sai khc nhau. Cch
thch hp nht so snh ny l phn tch phng sai. Phn tch phng sai c
th ng dng so snh nhiu nhm cng mt lc (simultaneous comparisons).
xij = + i + ij
[1]
Hay c th hn:
xi1 = + 1 + i1
xi2 = + 2 + i2
xi3 = + 3 + i3
Tc l, gi tr galactose ca bt c bnh nhn no bng gi tr trung
bnh ca ton qun th () cng/tr cho nh hng ca nhm j c o bng h
s nh hng i , v sai s ij . Mt gi nh khc l ij phi tun theo lut phn
phi chun vi trung bnh 0 v phng sai 2. Hai thng s cn c tnh l v
i . Cng nh phn tch hi qui tuyn tnh, hai thng s ny c c tnh bng
phng php bnh phng nh nht; tc l tm c s v j sao cho
( x
ij
j ) nh nht.
2
182
Nhm
S i
tng (nj)
n1 = 9
1 Crohn
Trung bnh
Phng sai
x1 = 1910
s12 = 265944
2 Vim rut kt
n2 = 11
x2 = 2226
s22 = 473387
3 i chng
n3 = 20
x3 = 2804
s32 = 277500
Ton b mu
n = 40
x = 2444
) (
Ch : xij = x + x j x + xij x j
[2]
hiu s) gia trung bnh tng nhm v trung bnh ton mu, v phn xij x j
SST = ( xij x )
i
= 12133923
Tng bnh phng phn nh khc nhau gia cc nhm:
SSB = ( xi x ) =
2
n (x
j
x)
= 5681168
Tng bnh phng phn nh dao ng trong mi nhm:
SSW = ( xij x j ) =
2
(n
j
1) s 2j
183
SSW c tnh t mi bnh nhn cho 3 nhm, cho nn trung bnh bnh phng
cho tng nhm (mean square MSW) l:
MSW = SSW / (N k) = 12133922 / (40-3) = 327944
v trung bnh bnh phng gia cc nhm l:
MSB = SSB / (k 1) = 5681168 / (3-1) = 2841810
Trong N l tng s bnh nhn (N = 40) ca ba nhm, v k = 3 l s nhm
bnh nhn. Nu c s khc bit gia cc nhm, th chng ta k vng rng MSB
s ln hn MSW. Thnh ra, kim tra gi thit, chng ta c th da vo kim
nh F:
F = MSB / MSW = 8.67
[3]
Vi bc t do k-1 v N-k. Cc s liu tnh ton trn y c th trnh by trong
mt bng phn tch phng sai (ANOVA table) nh sau:
Ngun bin thin (source of
variation)
Bc t do
(degrees
of
freedom)
Tng bnh
phng
(sum of
squares)
5681168
Trung
bnh bnh
phng
(mean
square)
2841810
37
12133923
327944
39
12133923
Kim nh
F
8.6655
phn tch phng sai, chng ta phi nh ngha bin group l mt yu t factor.
184
Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng
sai nh sau:
> analysis <- lm(galactose ~ group)
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
185
v:
MSB = 2841810 v MSB = 327944
Nh vy, F = 2841810 / 327944 = 8.6655.
Tr s p = 0.00082 c ngha l tn hiu cho thy c s khc bit v galactose
gia ba nhm.
(c) c s. bit thm chi tit kt qu phn tch, chng ta dng lnh
summary nh sau:
> summary(analysis)
Call:
lm(formula = galactose ~ group)
Residuals:
Min
1Q Median
3Q Max
-995.5 -437.9 102.0 456.0 979.8
Coefficients:
Estimate Std. Error t value
(Intercept) 1910.2
190.9
10.007
group2
316.3
257.4
1.229
group3
894.3
229.9
3.891
---
Pr(>|t|)
4.5e-12 ***
0.226850
0.000402 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
186
(b)
187
188
p adj
0.4439821
0.0011445
0.0281768
3-2
3-1
2-1
500
1000
1500
189
1500
2000
2500
3000
3500
>
>
>
>
>
190
1
4.1, 3.9, 4.3
2.7, 3.1, 2.6
Vt liu (j)
2
3.1, 2.8, 3.3
1.9, 2.2, 2.3
3
3.5, 3.2, 3.6
2.7, 2.3, 2.5
191
Vt liu (j)
2
4.10
2.80
3.450
3.07
2.13
2.600
3.43
2.50
2.967
iu kin (i)
Trung bnh
1
2
Trung bnh 2
nhm
Trung bnh
cho 3 vt liu
3.533
2.478
3.00
Phng sai
1
0.040
0.063
0.043
2
0.070
0.043
0.040
Nhng tnh ton s khi trn y cho thy c th c s khc nhau (hay nh
hng) ca iu kin v vt liu th nghim.
Gi xij l score ca iu kin i (i = 1, 2) cho vt liu j (j = 1, 2, 3). ( n
gin ha vn , chng ta tm thi b qua k i tng). M hnh phn tch
phng sai hai chiu pht biu rng:
xij = + i + j + ij
[4]
Hay c th hn:
x11 = + 1 + 1 + 11
x12 = + 1 + 2 + 12
x13 = + 1 + 3 + 11
x21 = + 2 + 1 + 21
x22 = + 2 + 2 + 22
x23 = + 2 + 3 + 21
l s trung bnh cho ton qun th, cc h s i (nh hng ca iu kin i)v
j (nh hng ca vt liu j) cn phi c tnh t s liu thc t. ij c gi
nh tun theo lut phn phi chun vi trung bnh 0 v phng sai 2.
192
Trong phn tch phng sai hai chiu, chng ta cn chia tng bnh phng ra
thnh 3 ngun:
SSc = ni ( xi x )
SSm = n j ( x j x )
= 2.18
Ngun th ba l tng bnh phng phn d (residual sum of squares):
193
Tng bnh
phng
(sum of
squares)
5.01
2
14
17
2.18
0.73
7.92
Trung
bnh bnh
phng
(mean
square)
5.01
Kim
nh F
1.09
0.052
20.8
95.6
Condition
(iu kin)
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
Material
(vt liu)
1
1
1
2
2
2
3
3
3
1
1
1
2
2
2
3
3
3
i tng
Score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
4.1
3.9
4.3
3.1
2.8
3.3
3.5
3.2
3.6
2.7
3.1
2.6
1.9
2.2
2.3
2.7
2.3
2.5
194
Levels: 1 2 3 4 5 6 7 8 9
9, 36)
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4
4
1 2 3 4
V to nn 18 m s (t 1 n 18):
> id <- 1:18
F value Pr(>F)
95.575 1.235e-07 ***
20.788 6.437e-05 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
195
Pr(>|t|)
2.43e-15 ***
1.24e-07 ***
1.58e-05 ***
0.0026 **
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.229 on 14 degrees of freedom
Multiple R-Squared: 0.9074,
Adjusted R-squared: 0.8875
F-statistic: 45.72 on 3 and 14 DF, p-value: 1.761e-07
0.0525 = 0.229, tc l
R2 =
196
5.0139 + 2.1811
= 0.9074
5.0139 + 2.1811 + 0.7344
xij = + i + j + ( i j ) + ij
ij
ij
Mean Sq F value
5.0139 100.2778
1.0906
21.8111
Pr(>F)
3.528e-07 ***
0.0001008 ***
0.0672
0.0500
0.2972719
1.3444
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Kt qu phn tch trn (p = 0.297 cho nh hng tng tc). Chng ta c bng
chng kt lun rng nh hng tng tc gia vt liu v iu kin khng c
ngha thng k, v chng ta chp nhn m hnh [4], tc khng c tng tc.
197
(e) So snh gia cc nhm. Chng ta s c tnh khc bit gia hai iu
kin v ba vt liu bng hm TukeyHSD vi aov:
> res <- aov(score ~ condition+ material+condition)
> TukeyHSD(res)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = score ~ condition + material +
condition)
$condition
diff
lwr
upr
p adj
2-1 -1.055556 -1.287131 -0.8239797 1e-07
$material
diff
lwr
upr
p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069
4 .0
2-1
condition
3.0
3-2
2.5
3-1
m ean of score
3.5
1
2
-1.0
-0.5
0.0
0.5
material
198
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Age (months)
109
113
115
116
119
120
121
124
126
129
130
133
134
135
137
139
141
Height (cm)
137.6
147.8
136.8
140.7
132.7
145.4
135.0
133.0
148.5
148.3
147.5
148.8
133.2
148.7
152.0
150.6
165.3
199
urban
rural
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
142
121
121
128
129
131
132
133
134
138
138
138
140
140
140
149.9
139.0
140.9
134.9
149.5
148.7
131.0
142.3
139.9
142.9
147.7
147.7
134.6
135.8
148.5
200
139.9,142.9,147.7,147.7,134.6,135.8,148.5)
> # to mt data frame
> data <- data.frame(id, group, age, height)
> attach(data)
150
130
135
140
145
height
155
160
165
Ngoi ra, biu sau y cn cho thy c mt mi lin h tng quan gia tui
v chiu cao:
110
115
120
125
130
135
140
age
201
V hai nhm khc nhau v tui, v tui c lin h vi chiu cao, cho
nn chng ta khng th pht biu hay so snh chiu cao gia 2 nhm hc sinh
m khng iu chnh cho tui. iu chnh tui, chng ta s dng
phng php phn tch hip bin.
y1 = 1 + x + e1
y2 = 2 + x + e2
in group 1
in group 2.
[5]
Trong :
y1 = 1 + x1 + e1
y2 = 2 + x2 + e2
v mc khc bit gia hai nhm by gi ty thuc vo h s :
y1 y 2 = 1 2 + ( x1 x2 )
Ch rng trong m hnh [5], chng ta c th din dch 1 2 l
khc bit chiu cao trung bnh gia hai nhm nu c hai nhm c cng tui
trung bnh. Mc khc bit ny th hin nh hng ca hai nhm nu khng c
mt yu t no lin h n y. c tnh 1 2 , chng ta khng th n gin
tr hai s trung bnh y1 - y2 , nhng phi iu chnh cho x. Gi x* l mt gi tr
202
y1a = y1 x1 x*
y1a c th xem l mt c s cho chiu cao trung bnh ca nhm 1 (thnh th)
cho gi tr x l x* . Tng t:
y2 a = y2 x2 x*
l s cho chiu cao trung bnh ca nhm 1 (nng thn) vi cng gi tr x*. T
y, chng ta c th c tnh nh hng ca thnh th v nng thn bng cng
thc sau y:
y1a y2 a = y2 y1 ( x1 x2 )
y = + x + g + ( xg ) + e
[6]
Ni cch khc, m hnh trn pht biu rng chiu cao ca mt hc sinh
b nh hng bi 3 yu t: tui (), thnh th hay nng thn (), v tng tc
gia hai yu t (). Nu = 0 (tc nh hng tng tc khng c ngha
thng k), m hnh trn gim xung thnh:
y = + x+ g +e
[7]
y = + x+e
[8]
203
Pr(>F)
0.23251
0.04114 *
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
204
Qua phn c tnh thng s trnh by trn y, chng ta thy tnh trung
bnh chiu cao hc sinh tng khong 0.41 cm cho mi thng tui. Ch trong
kt qu trn, phn group2 c ngha l h s hi qui (regression coefficient)
cho nhm 2 (tc l nng thn), v R phi t h s cho nhm 1 bng 0 tin
vic tnh ton. V th, chng ta c hai phng trnh (hay hai ng biu din)
cho hai nhm hc sinh nh sau:
i vi hc sinh thnh th:
Height = 91.817 + 0.4157(age)
Ni cch khc, sau khi iu chnh cho tui, nhm hc sinh nng thn
(rural) c chiu cao thp hn nhm thnh th khong 5.5 cm v mc khc
205
206
Mo hinh 1
Mo hinh 2
115
120
130
150
2
2
125
140
150
1 1
2 2
135
140
1
110
115
2
2
1
1
2
112 1 1
130
110
2
2
height
1
1
2
112 1 1
130
140
height
160
160
120
1
125
130
age
Mo hinh 3
Mo hinh 4
1 1
2 2
2
2
2
2
age
135
140
115
120
1
125
2
130
2
2
1
135
150
140
150
age
1 1
2 2
2
2
140
1
1
110
115
2
2
1
1
2 2 1 1
1
1
130
1
1
110
2
2
height
1
1
2 2 1 1
1
1
130
140
height
160
160
120
1
125
2
2
130
1 1
2 2
2
2
1
135
2
2
140
age
207
1
29
41
66
136
Thuc tr su (pesticide)
2
3
50
43
58
42
85
63
193
154
Tng s
4
53
73
85
211
175
214
305
694
M hnh phn tch th nghim giai tha cng khng khc g so vi phn
tch phng sai hai chiu nh trnh by trong phn trn. C th hn, m hnh
m chng ta xem xt l:
product = + (variety) + (pesticide) +
208
F value Pr(>F)
44.063 0.000259 ***
15.723 0.003008 **
Residuals
---
151.50
25.25
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
upr
p adj
20.65209 0.0749103
43.40209 0.0002363
33.65209 0.0016627
upr
33.202864
20.202864
39.202864
1.202864
20.202864
33.202864
p adj
0.0140509
0.5106152
0.0036109
0.0704233
0.5106152
0.0140509
209
4-3
4-2
3-2
4-1
3-1
2-1
-20
-10
10
20
30
40
210
1
175
Aa
170
Ab
135
Bb
145
Ba
Ging (variety)
2
3
143
128
Ba
Bb
178
140
Aa
Ba
173
169
Ab
Aa
136
165
Bb
Ab
4
166
Ab
131
Bb
141
Ba
173
Aa
Ngun th nht l khc bit gia cc phng php canh tc v phn bn;
Ngun th hai l khc bit gia cc loi ging cy;
Ngun th ba l khc bit gia cc mu rung;
1: 156.25
2: 157.50
3: 150.50
4: 152.75
Tng trung bnh: 154.25
1: 153.00
2: 154.75
3: 154.50
4: 154.75
Tng trung bnh: 154.25
1: 173.75
2: 168.50
3: 142.25
4: 132.50
Tng trung bnh: 154.25
Bng tm lc trn cho php chng ta tnh tng bnh phng cho tng ngun
bin thin. Khi u l tng bnh phng cho ton b th nghim (s tm gi l
SStotal):
211
Nhng c tnh trn y c th trnh by trong mt bng phn tch phng sai
nh sau:
Ngun bin thin
Gia 4 mu rung
Gia 4 loi ging
Gia 4 phng php
Phn d (residual)
Tng s
Bc t do
(degrees of
freedom)
3
3
3
6
16
Trung bnh
bnh phng
(Mean square)
2.8
41.2
1600.5
Kim
nh F
2.3
32.9
1280.4
Qua phn tch th cng v n gin trn, chng ta thy phng php
canh tc v loi ging c nh hng ln n sn lng. tnh ton chnh xc
tr s p, chng ta c th s dng R tin hnh phn tch phng sai cho th
nghim hnh vung Latin.
212
143,
178,
173,
136,
128,
140,
169,
165,
166,
131,
141,
173)
> sample
<- c(1,1,1,1,
2,2,2,2,
3,3,3,3,
4,4,4,4)
> sample <- as.factor(sample)
213
> data
sample variety method y
1
1
1
1
175
2
1
2
3
143
3
1
3
4
128
4
1
4
2
166
5
2
1
2
170
6
2
2
1
178
7
2
3
3
140
8
2
4
4
131
9
3
1
4
135
10
3
2
2
173
11
3
3
1
169
12
3
4
3
141
13
4
1
3
145
14
4
2
4
136
15
4
3
2
165
16
4
4
1
173
214
upr
3.9867231
-3.0132769
-0.7632769
-4.2632769
-2.0132769
4.9867231
lwr
-7.986723
-34.236723
-43.986723
-28.986723
-38.736723
-12.486723
p adj
0.4528549
0.0014152
0.0173206
0.0004803
0.0038827
0.1034761
upr
-2.513277
-28.763277
-38.513277
-23.513277
-33.263277
-7.013277
p adj
0.0023016
0.0000001
0.0000000
0.0000004
0.0000000
0.0000730
So snh gia cc loi ging cho thy c s khc bit gia ging 3 v 1, 4 v 1, 3
v 2, 4 v 2.
Tt c cc so snh gia cc phng php canh tc u c ngha thng k.
Nhng loi no c sn lng cao nht? tr li cu hi ny, chng ta s s
dng biu hp:
> boxplot(y ~ method, xlab="Methods (1=Aa, 2=Ab, 3=Ba,
4=Bb", ylab=Production")
215
180
170
160
Production
150
140
130
M s bnh
nhn s (id)
AB
1
3
5
6
9
10
13
216
15
BA
2
4
7
8
11
12
14
16
8
Placebo
5
9
7
4
9
5
8
9
8
A
7
6
11
7
8
4
9
13
M s bnh nhn
s (id)
AB
1
3
5
6
9
10
13
15
Trung bnh
BA
2
4
7
8
217
11
9
8
12
5
4
14
8
9
16
9
13
Trung bnh
7.000
8.125
Trung bnh cho 2 nhm
7.6875
7.3750
Trung bnh cho nhm A = (8.375 + 8.125) / 2 = 8.25
Trung bnh cho nhm P (gi dc) = (6.625 + 7.000) / 2 = 6.8125
8.5
4.5
8.5
11.0
7.5625
7.5312
Tng bnh phng do khc bit gia hai nhm iu tr bng thuc v
gi dc:
SSTreat = 16(8.25 7.5312)2 + 16(8.8125 7.5312)2 = 16.53
Tng bnh phng do khc bit gia hai nhm AB v BA (th t):
SSseq = 16(7.50 7.5312)2 + 16(7.5625 7.5312)2 = 0.031
Tng bnh phng do khc bit gia cc bnh nhn trong cng nhm
AB hay BA:
SSw = (5.0 7.50)2 + (7.5 7.50)2 + (9.0 7.50)2 + +
(8.0 7.50)2 + (6.0 7.5625)2 + (7.5 7.5625)2 +
(9.0 7.5625)2 + + (11.0 7.5625)2 = 103.44
218
Bc t do
(degrees
of
freedom)
Tng bnh
phng
(Sum of
squares)
1
1
1
14
14
31
16.53
0.781
0.031
103.44
47.19
167.97
Trung bnh
bnh
phng
(Mean
square)
16.53
0.781
0.031
7.39
3.37
Kim
nh F
4.90
0.23
0.004
Qua phn tch trn, chng ta thy khc bit gia thuc v gi dc
ln hn l khc bit gia hai thng hay hai nhm AB v BA. Kim nh F
th nghim gi thit thuc v gi dc c hiu qu nh nhau l kim nh
F = 16.53 / 3.37 = 4.90 vi bc t do 1 v 14. Da trn l thuyt xc sut, tr s
F vi bc t do 1 v 14 l 4.60. Do , chng ta c th kt lun rng thuc ny
c hiu ng lm ra m hi lu hn nhm gi dc.
Tt c cc tnh ton th cng trn ch l minh ha cho cch phn tch
phng sai trong th nghim giao cho. Trong thc t, chng ta c th s dng
R tin hnh cc tnh ton nh cch tnh phng sai cho cc th nghim
n gin. Vn chnh l t chc s liu cho phn tch. R (cng nh nhiu
phn mm khc) yu cu ngi s dng phi nhp tng s liu mt, v mi s
liu phi gn lin vi mt bnh nhn, mt nhm iu tr, mt thng (hay
giai on), v mt nhm th t. l mt yu cu rt quan trng, v nu t
chc s liu khng ng, kt qu phn tch c th sai.
Phn sau y s m t tng bc mt:
# bc 1: nhp d liu v t tn object l y
> y <- c(6,8,12,7,9,6,11,8,
4,7,6,8,10,4,6,8,
5,9,7,4,9,5,8,9
7,6,11,7,8,4,9,13)
# bc 2: c mi s liu trong bc 1, ch ra nhm AB
hay BA (m s 1 v 2)
> seq <- c(1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
219
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2)
> seq <- as.factor(seq)
# bc 3: c mi s liu trong bc 1, ch ra thng 1
hay thng 2
> period <- c(1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1)
> period <- as.factor(period)
# bc 4: c mi s liu trong bc 1, ch ra nhm A
hay placebo bng m s 1 v 2:
> treat <- c(1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2)
> treat <- as.factor(treat)
# bc 5: c mi s liu trong bc 1, ch ra m s
cho tng bnh nhn
> id <- c(1,3,5,6,9,10,13,15,
1,3,5,6,9,10,13,15,
2,4,7,8,11,12,14,16,
2,4,7,8,11,12,14,16)
> id <- as.factor(id)
# bc 6: lp thnh mt data frame tn l data v in
ra kim tra mt ln na.
> data <- data.frame(seq, period, treat, id, y)
> data
seq period treat id
y
1 1
1
1
1
6
2 1
1
1
3
8
3 1
1
1
5
12
4 1
1
1
6
7
5 1
1
1
9
9
6 1
1
1
10
6
7 1
1
1
13 11
8 1
1
1
15
8
9 1
2
2
1
4
10 1
2
2
3
7
220
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
5
6
9
10
13
15
2
4
7
8
11
12
14
16
2
4
7
8
11
12
14
16
6
8
10
4
6
8
7
6
11
7
8
4
9
13
5
9
7
4
9
5
8
9
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
221
Ch kt qu:
$treat
diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783
222
M s bnh
nhn (id)
Vc-xin
1
2
3
4
6
7
4
8
3
3
1
4
0
1
2
3
Placebo
5
6
5
5
6
9
4
6
7
5
3
4
8
6
2
3
Cu hi chnh l c s khc bit no gia hai nhm vc-xin v gi dc hay
khng?
n gin ha cch phn tch phng sai cho th nghim ti o lng, chng ta
s trnh dng k hiu ton, m ch minh ha bng vi php tnh th cng bn
c c th theo di. Trc ht, chng ta cn phi tm lc s liu bng cch tnh
trung bnh cho mi bnh nhn, mi nhm iu tr, v mi thng nh sau:
Bng 11.11. Tm lc s liu nghin cu vc-xin chng au thp khp
Nhm
iu tr
Vc-xin
Placebo
id
1
2
3
4
Trung bnh
SD
5
6
7
8
Trung bnh
SD
Trung bnh cho hai
nhm
5
4
3
2
3.50
1.29
3.125
5
6
4
3
4.50
1.29
3.000
Trung bnh
3.000
3.667
2.333
5.000
3.500
5.333
6.333
4.000
3.667
4.833
4.167
223
224
Bc t do
(degrees
of
freedom)
Tng bnh
phng
(Sum of
squares)
1
6
2
2
12
23
10.667
25.333
58.583
8.583
12.167
115.333
Trung bnh
bnh
phng
(Mean
square)
10.667
4.222
29.292
4.292
1.014
Kim
nh F
2.53
28.89
4.23
-
Trc ht, chng ta nhp d liu cho tng bnh nhn. Cng nh bt c
phn mm thng k no, mi gi tr phi c km theo nhng bin s c
trng nh cho mi bnh nhn, mi nhm, v mi thi gian:
y <- c(6,7,4,8,
3,3,1,4,
0,1,2,3,
6,9,5,6,
5,4,3,2,
5,6,4,3)
225
226
14
15
16
17
18
19
20
21
22
23
24
6
7
8
5
6
7
8
5
6
7
8
1
1
1
2
2
2
2
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
9
5
6
5
4
3
2
5
6
4
3
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Kt qu phn tch trong phn u ca bng trn cho thy s khc bit
gia nhm iu tr bng thuc v gi dc khng c ngha thng k
227
228
12
Phn tch hi qui logistic
Trong cc chng trc v phn tch hi qui tuyn tnh v phn tch
phng sai, chng ta tm m hnh v mi lin h gia mt bin ph thuc lin
tc (continuous dependent variable) v mt hay nhiu bin c lp (independent
variable) hoc l lin tc hoc l khng lin tc. Nhng trong nhiu trng hp,
bin ph thuc khng phi l bin lin tc m l bin mang tnh o lng nh
phn: c/khng, mc bnh/khng mc bnh, cht/sng, xy ra/khng xy ra,
v.v, cn cc bin c lp c th l lin tc hay khng lin tc. Chng ta cng
mun tm hiu mi lin h gia cc bin c lp v bin ph thuc.
V d 1. Trong mt nghin cu do tc gi tin hnh tm hiu mi
lin h gia nguy c gy xng (fracture, vit tt l fx) v mt xng cng
mt s ch s sinh ha khc, 139 bnh nhn nam (hay ni ng hn l i tng
nghin cu) tui t 60 tr ln. Nm 1990, cc s liu sau y c thu thp cho
mi i tng: tui (age), t trng c th (body mass index hay BMI), mt
cht khong trong xng (bone mineral density hay BMD), ch s hy xng
ICTP, ch s to xng PINP. Cc i tng nghin cu c theo di trong
vng 15 nm. Trong thi gian theo di, cc bnh nhn b gy xng hay khng
gy xng c ghi nhn. Cu hi t ra ban u l c mt mi lin h g gia
BMD v nguy c gy xng hay khng. S liu ca nghin cu ny c trnh
by trong phn cui ca chng ny, v s trnh by mt phn di y bn
c nm c vn .
Bng 12.1. Mt phn s liu nghin cu v cc yu t nguy c cho gy xng
id
1
2
3
4
5
6
7
8
9
10
fx
1
1
1
1
1
0
0
0
0
0
age
79
89
70
88
85
68
70
69
74
79
137
138
139
0
1
0
64
80
67
bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594
38.0762
23.3887
25.9455
bmd
ictp
pinp
0.818 9.170 37.383
0.871 7.561 24.685
1.358 5.347 40.620
0.714 7.354 56.782
0.748 6.760 58.358
0.935 4.939 67.123
1.040 4.321 26.399
1.002 4.212 47.515
0.987 5.605 26.132
0.863 5.204 60.267
...
1.086
0.875
0.983
5.043
4.086
4.328
32.835
23.837
71.334
229
p=
x
n
odds =
p
1 p
[1]
p
l ogit ( p ) = log
1 p
[2]
230
4
2
0
-4
-2
logit(p)
0.0
0.2
0.4
0.6
0.8
1.0
[3]
p
logit ( p ) = log
+ x
1 p
[4]
231
Ni cch khc:
odds ( p ) =
p
= e + x
1 p
odds ( p | x ) = e + x
Khi x = x0, kh nng gy xng l: odds ( p | x = x0 ) = e + x0
Khi x = x0 + 1 (tc tng 1 n v t x0), kh nng gy xng l:
odds ( p | x = x0 + 1) = e
+ ( x0 +1)
odds ( p | x = x0 + 1)
odds ( p | x = x0 )
+ ( x0 +1)
+ x0
= e [5]
1
n
n
( + xi )
yi = 1 + e
i =1
i =1
n
n
x y = x 1 + e ( + xi )
i i
i
i =1
i =1
232
p =
e + x
+ x
1+ e
1+ e
1
(
+ x
233
1.0
0.6
0.8
BMD
1.2
1
Fracture: 1=yes, 0=no)
Kt qu trn cho thy, bmd trong nhm bnh nhn b gy xng thp hn so vi
nhm khng b gy xng (0.90 v 0.94). V, kim nh t sau y cho thy mc
khc bit ny khng c ngha thng k (p = 0.15).
> t.test(bmd~fx)
Welch Two Sample t-test
data: bmd by fx
t = 1.4572, df = 53.952, p-value = 0.1508
234
2.0709
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.063
1.342 0.792 0.428
bmd
-2.270
1.455 -1.560 0.119
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 155.27 on 135 degrees of freedom
AIC: 159.27
Number of Fisher Scoring iterations: 4
2.0709
235
> sd(bmd)
[1] 0.1406543
e-2.27*0.1406 = 0.7267
Tc l, khi bmd tng mt lch chun th t s kh d gy xng gim khong
28%. Cng c th ni cch khc, l khi bmd gim mt lch chun th t s
kh d tng e2.27*0.1406 = 1.376 hay khong 38%.
Mt cch khc bit nh hng ca bmd l c tnh xc sut gy xng l
qua phng trnh:
p =
236
( )
e
1.063 2.27 ( bmd )
1+ e
1.063 2.27 bmd
Theo , khi bmd = 1.00, p = 0.23. Khi bmd = 0.86 (tc gim 1 lch chun),
p = 0.291. Tc l, nu BMD gim 1 lch chun th xc sut gy xng tng
0.291/0.23 = 1.265 hay 26.5%.
(d) Phn cui ca kt qu cung cp deviance cho hai m hnh: m hnh khng
c bin c lp (null deviance), v m hnh vi bin c lp, tc l
bmd trong v d (residual deviance).
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 155.27 on 135 degrees of freedom
AIC: 159.27
Qua hai s ny, chng ta thy bmd nh hng rt thp n vic tin
on gy xng, ch lm gim deviance t 157.8 xung cn 155.27, v mc
gim ny khng c ngha thng k.
Ngoi ra, R cn cung cp gi tr ca AIC (Akaike Information
Criterion) c tnh t deviance v bc t do. Chng ta s quay li ngha ca
AIC trong phn sp n khi so snh cc m hnh.
p
= + x cho tng bnh nhn.
1 p
m hnh log
> predict(logistic)
1
2
3
4
5
6
2.37757 1.08569 -2.14111 1.49282 0.96537 -0.94125
7
8
9
10
11
12
-1.73368 -1.67564 -0.66528 -0.50704 -0.94185 -0.64874
...
Cc s trn l log(p / (1 p)), tc log odds, khng c ngha thc t bao nhiu.
Chng ta mun bit gi tr tin on xc sut p tnh t phng trnh
237
p =
0.35
0.30
0.25
0.20
0.15
0.40
0.6
0.8
1.0
1.2
bmd
238
0.35
0.30
0.15
0.20
0.25
predicted
0.40
0.45
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
fnbmd
239
240
7
8
0.19759 -0.46602 -0.21262
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766 0.38018 -6.254 4e-10 ***
241
5
6
7
0.30794 -0.62742 -0.14449
8
0.45770
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.3921
0.3757 -6.366 1.94e-10 ***
obesityyes 0.6954
0.2851 2.440 0.0147 *
snoringyes 0.8655
0.3967 2.182 0.0291 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14.1259 on 7 degrees of freedom
Residual deviance: 1.6781 on 5 degrees of freedom
AIC: 32.597
Number of Fisher Scoring iterations: 4
Phn tch phng sai trn deviance sau y cng khng nh obesity v
snoring l hai bin c nh hng n cao huyt p:
242
n gin
y
C ngha thc t
243
Chng ta thy residual deviance = 150.74, v AIC = 154.74. Tht ra, AIC
c c tnh t cng thc:
AIC = Residual Deviance + 2(s thng s)
244
245
Deviance AIC
1 132.45 144.45
1 132.47 144.47
1 132.63 144.63
1 133.41 145.41
1 133.87 145.87
132.09 146.09
1 148.90 160.90
Deviance
1 132.81
1 133.14
1 133.66
1 134.00
132.45
1 149.05
AIC
142.81
143.14
143.66
144.00
144.45
159.05
246
AIC
- age
- bmi
- bmd
<none>
- ictp
1
1
1
133.32
133.67
134.33
132.81
149.88
141.32
141.67
142.33
142.81
157.88
Deviance
1 134.34
133.32
1 135.65
1 155.18
AIC
140.34
141.32
141.65
161.18
AIC
140.34
143.15
159.27
M hnh
AIC
fx
fx
fx
fx
fx
146.09
144.45
142.81
141.33
140.34
~
~
~
~
~
fx
bmd
ictp,
Deviance Residuals:
Min
1Q Median
3Q
Max
-1.9126 -0.7317 -0.5559 0.4212
family
"binomial",
data
2.1242
Coefficients:
Estimate Std. Error z value Pr(>|z|)
247
(Intercept) -1.0651
1.5029 -0.709 0.4785
bmd
-3.4998
1.6638 -2.103 0.0354 *
ictp
0.6876
0.1704 4.036 5.43e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 134.34 on 134 degrees of freedom
AIC: 140.34
Number of Fisher Scoring iterations: 4
248
glm.family="binomial")
# Tm lc kt qu phn tch:
> summary(bma.search)
Call: bic.glm.data.frame(x = xvars, y = y, glm.family="binomial",
strict=F,OR=20)
EV
SD
-2.85 2.865
0.008 0.026
-0.023 0.054
-1.341 1.976
0.645 0.169
-0.0003 0.004
nVar
BIC
post prob
Intercept
age
bmi
bmd
ictp
pinp
model 1 model 2
-3.920
-1.065
.
.
.
.
.
-3.499
0.606
0.687
.
.
1
-525.04
0.307
model 3 model4
-1.201
-8.257
.
0.063
-0.116
.
.
.
0.680
0.554
.
.
nVar
2
BIC
-523.63
post prob 0.151
2
-522.67
0.094
2
-524.94
0.291
model 5
-0.072
.
-0.070
-2.696
0.714
.
3
-521.03
0.041
249
age
bmi
bmd
ictp
pinp
Model #
250
id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
age
79
89
70
88
85
68
70
69
74
79
76
76
62
69
72
67
74
69
78
71
74
76
75
70
69
71
80
79
72
78
80
79
67
84
78
65
70
67
74
73
74
68
80
78
bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594
22.5981
26.4236
20.3223
19.3698
24.2215
32.1120
25.3934
23.8895
24.6755
27.1314
23.0518
23.4568
23.5457
23.3234
22.8625
22.0384
24.6914
26.8519
27.1809
23.9512
28.3874
23.5102
19.7232
27.4406
28.6661
23.7812
23.4493
25.5354
24.7409
22.2291
34.4753
32.1929
23.3355
22.7903
bmd
ictp
pinp
0.818 9.170 37.383
0.871 7.561 24.685
1.358 5.347 40.620
0.714 7.354 56.782
0.748 6.760 58.358
0.935 4.939 67.123
1.040 4.321 26.399
1.002 4.212 47.515
0.987 5.605 26.132
0.863 5.204 60.267
0.889 4.704 27.026
0.886 5.115 43.256
0.889 5.741 51.097
0.790 3.880 49.678
0.988 5.844 41.672
1.119 4.160 60.356
1.037 6.728 40.225
0.893 4.203 27.334
0.850 7.347 38.893
0.790 4.476 38.173
0.597 4.835 35.141
0.889 5.354 27.568
0.803 3.773 36.762
0.919 3.672 40.093
0.870 4.552 29.627
0.811 4.286 30.380
0.859 5.706 37.529
0.867 3.563 43.924
0.717 3.760 39.714
0.822 3.453 27.294
1.004 5.948 33.376
0.738 4.193 65.640
0.865 4.443 36.252
0.808 5.482 33.539
0.955 8.815 42.398
0.912 4.704 39.254
0.857 4.138 75.947
0.855 3.727 41.851
0.959 3.967 42.293
1.036 4.438 40.222
1.092 7.271 45.434
.
4.269 50.841
0.759 4.856 31.114
0.757 4.831 73.343
251
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
252
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
79
72
67
70
69
67
74
71
67
67
65
66
69
72
75
70
74
71
65
77
67
66
70
70
69
65
75
67
67
73
63
72
73
69
75
65
71
66
66
71
73
64
68
67
66
77
75
67
73
68
70
66
77
71
74
70
78
76
64
67
24.6097
27.5802
30.1205
25.8166
30.4218
28.7132
34.5429
24.6097
23.5294
25.6173
25.3086
24.8358
22.3094
26.5285
25.8546
20.6790
28.3675
29.0688
23.9995
22.9819
33.3598
27.1314
24.7676
24.4193
28.2570
23.6614
26.0262
26.5731
24.8591
22.5710
31.8342
24.8016
25.0574
23.9512
23.4586
28.7347
25.3350
28.0899
25.5650
28.7274
32.4074
27.9155
25.5937
28.0428
30.7174
28.3737
28.6990
29.1687
27.4145
29.0688
26.1738
30.1038
24.6559
25.3934
26.4721
29.0253
29.0253
26.2346
26.4915
27.0416
0.671
0.814
1.101
0.818
1.088
0.934
0.969
0.794
0.830
1.057
1.160
0.811
0.977
1.063
1.091
0.741
1.045
1.066
0.841
1.015
1.129
1.030
0.896
1.106
0.869
0.837
0.921
1.118
0.765
0.752
1.251
0.839
0.662
0.844
0.852
0.795
0.867
0.997
0.827
1.023
1.066
0.874
0.882
0.718
0.856
1.052
0.929
0.953
0.784
1.120
1.040
1.028
0.884
0.943
1.075
1.057
1.098
1.014
0.998
0.905
4.870
3.012
7.538
3.564
3.826
3.996
6.762
4.350
3.176
3.738
3.060
3.263
3.106
6.970
4.798
3.908
4.784
4.527
3.089
4.041
7.239
4.096
4.352
2.823
2.974
2.689
3.917
3.832
7.112
4.249
7.303
3.860
3.138
4.069
4.176
3.328
2.349
4.171
4.569
4.111
5.680
4.298
4.056
9.739
4.180
3.737
3.527
3.593
4.332
6.510
3.161
3.930
3.880
4.692
4.561
3.709
5.247
3.958
4.218
3.553
69.924
27.088
35.487
36.001
33.833
56.167
43.099
39.023
36.595
32.550
44.757
26.941
27.951
41.188
36.045
30.198
31.339
24.252
79.910
57.147
67.103
29.435
44.291
37.348
46.229
28.738
29.667
50.292
45.778
39.950
48.697
41.055
36.312
39.926
51.394
27.679
36.506
53.094
25.157
19.557
36.995
43.872
30.523
66.974
34.597
28.102
23.008
16.132
47.410
45.674
36.302
38.301
36.560
69.500
25.948
41.322
23.896
24.344
29.390
23.020
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
0
0
0
1
1
1
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
0
66
70
66
65
64
70
70
70
68
67
73
71
71
75
76
71
73
64
70
80
67
68
66
64
69
69
67
67
68
59
66
71
64
80
67
22.7732
30.5241
25.3069
22.3863
34.0136
26.5668
27.6361
25.4017
30.3673
28.0428
27.7778
29.0006
35.2941
29.3658
26.2649
25.6055
29.9136
34.5271
33.4554
29.0688
25.7276
25.6801
25.9701
26.4490
28.6990
25.6173
30.3871
33.6901
28.4005
25.4017
22.5710
24.4473
38.0762
23.3887
25.9455
0.627
1.052
1.086
0.818
1.066
1.198
0.926
1.193
0.938
0.863
0.799
0.969
0.931
1.071
1.161
0.786
0.839
1.042
0.976
0.765
1.277
1.097
0.793
0.989
0.822
0.944
1.245
1.142
0.860
1.172
0.956
0.918
1.086
0.875
0.983
2.333 53.621
5.425 44.352
4.945 64.788
3.786 96.360
5.792 37.473
7.257 28.406
5.746 17.228
2.437 35.432
2.658 32.293
4.246 48.702
3.934 26.709
4.054 22.769
3.631 18.629
4.222 36.555
2.548 24.217
3.832 32.023
4.215 26.507
6.436 53.080
4.541 26.619
3.998 67.388
3.877 22.159
3.782 42.286
2.991 38.673
3.196 31.456
3.565 45.044
6.512 49.557
3.603 46.769
3.666 38.839
2.890 32.140
.
104.579
3.354 36.253
4.633 53.881
5.043 32.835
4.086 23.837
4.328 71.334
253
13
Phn tch s kin
(event history hay survival analysis)
Qua ba chng trc, chng ta lm quen vi cc m hnh thng k
cho cc bin ph thuc lin tc (nh p sut mu) v bin bc th (nh
c/khng, bnh hay khng bnh). Trong nghin cu khoa hc, v c bit l y
hc v k thut, c khi nh nghin cu mun tm hiu nh hng n cc bin
ph thuc mang tnh thi gian. Nh kinh t hc John Maynard Keynes tng ni
mt cu c lin quan n ch m ti s m t trong chng ny nh sau: V
lu v di tt c chng ta u cht, ci khc nhau l cht sm hay cht mun m
thi. Thnh ra, y vic theo di hay m t mt bin bc th nh sng hay
cht tuy quan trng, nhng khng chnh xc. Ci bin s quan trng hn v
chnh xc hn l thi gian dn n vic s kin xy ra.
Trong cc nghin cu khoa hc, k c nghin cu lm sng, cc nh
nghin cu thng theo di i tng trong mt thi gian, c khi ln n vi
mi nm. Bin c xy ra trong thi gian nh c bnh hay khng c bnh,
sng hay cht, v.v l nhng bin c c ngha lm sng nht nh, nhng
thi gian dn n bnh nhn mc bnh hay cht cn quan trng hn cho vic
nh gi nh hng ca mt thut iu tr hay mt yu t nguy c. Nhng thi
gian ny khc nhau gia cc bnh nhn. Chng hn nh thi im t lc iu tr
ung th n thi im bnh nhn cht rt khc nhau gia cc bnh nhn, v
khc bit c th ty thuc vo cc yu t nh tui, gii tnh, tnh trng
bnh, v cc yu t m c khi chng ta khng/cha o lng c nh tng
tc gia cc gen.
M hnh chnh th hin mi lin h gia thi gian dn n bnh (hay
khng bnh) v cc yu t nguy c (risk factors) l m hnh c tn l survival
analysis (c th tm dch l phn tch sng st). Cm t survival analysis
xut pht t nghin cu trong bo him, v gii nghin cu y khoa t dng
cm t cho b mn ca mnh. Nhng nh ni trn, sng/cht khng phi l bin
duy nht, v trong thc t chng ta cng c nhng bin nh c bnh hay khng
bnh, xy ra hay khng xy ra, v do , trong gii tm l hc, ngi ta dng
cm t event history analysis (phn tch bin c) m ngi vit cm thy c
v thch hp hn l phn tch sng st. Ngoi ra, trong cc b mn k thut,
ngi ta dng mt cm t khc, reliability analysis (phn tch tin cy),
ch cho khi nim survival analysis. Tuy nhin, trong chng ny ti s dng
cm t phn tch bin c.
254
Thi
gian
(tun)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
18
10
13
30
19
23
38
54
36
107
104
97
107
56
59
107
75
93
Tnh trng
(ngng=1
hay tip
tc=0)
0
1
0
1
1
0
0
0
1
1
0
1
0
0
1
0
1
1
255
F (t ) =
f ( s ) ds
Pr ( t T < t + t ) | T t f ( t )
=
0
S (t )
t
h ( t ) = lim
sao cho h(t) t l xc sut mt c nhn ngng s dng trong khong thi gian
ngn t vi iu kin c nhn sng n thi im t. T mi lin h:
Pr(sng st n t+t) = Pr(sng st n t) . Pr(sng st n t | sng n t)
chng ta c:
1 F ( t + t ) = (1 F ( t ) ) (1 h ( t ) t )
T , chng ta c:
tF ' ( t ) = (1 F ( t ) ) h ( t ) t
256
h (t ) =
f (t )
1 F (t )
( t ) = ( u ) du
T nh ngha hm nguy c h ( t ) =
f (t )
1 F (t )
, chng ta c th vit:
( t ) = log (1 F ( t ) )
Mt s hm nguy c c th ng dng m t thi gian ny. Hm n gin
nht l mt hng s, dn n mt m hnh Poisson (thuc nhm cc lut phn
phi m):
f ( t ) = e t
Do :
Cho nn:
(t 0)
F ( t ) = 1 e t
h(t) =
257
Khong
thi gian
(tun)
1
2
3
4
5
6
7
8
9
10
09
10 18
19 29
30 35
36 58
59 74
75 92
93 96
97 106
107
S ph
n lc bt
u thi
im (nt)
18
18
15
13
12
8
7
6
5
3
S ph
n
ngng s
dng (dt)
0
1
1
1
1
1
1
1
1
1
Xc sut
ngng s
dng h(t)
0.0000
0.0555
0.0667
0.0769
0.0833
0.1250
0.1428
0.1667
0.2000
0.3333
Xc
sut cn
s dng
pt
1.0000
0.9445
0.9333
0.9231
0.9167
0.8750
0.8572
0.8333
0.8000
0.6667
Xc
sut tch
ly S(t)
1.0000
0.9445
0.8815
0.8137
0.7459
0.6526
0.5594
0.4662
0.3729
0.2486
258
nt dt
nt
t =1
k
. Ch du
259
Php c tnh c m t trn thng c gi l c tnh KaplanMeier (Kaplan-Meier estimates), hay thnh thong cng c gi l productlimit estimate.
260
75
93
97
261
S ( t ) 1.96 se S ( t ) ,
m trong :
dt
se S ( t ) = S ( t )
.
t =1 nt ( nt dt )
Cng thc sai s chun ny cn c gi l cng thc Greenwood (hay
Greenwoods formula). Chng ta c th th hin kt qu trn bng mt biu
bng hm plot nh sau:
0.8
0.6
0.4
0.2
0.0
1.0
> plot(kp,
xlab="Time (weeks)",
ylab="Cumulative survival probability")
20
40
60
80
100
Time (weeks)
Trong biu trn, trc honh l thi gian (tnh bng tun) v trc tung l xc
sut tch ly cn s dng y c. ng chnh gia chnh l xc sut tch
ly S ( t ) , hai ng chm l khong tin cy 95% ca S ( t ) . Qua kt qu phn
tch ny, chng ta c th pht biu rng xc sut s dng y c n tun 107 l
khong 25% v khong tin cy t 8% n 74.5%. Khong tin cy kh rng cho
bit c s c dao ng cao, n gin v s lng i tng nghin cu cn
tng i thp.
262
263
46
48
13
9
52
28
0
1
e1 j =
n1 j d j
nj
vj =
e2 j =
n2 j d j
nj
n1 j n2 j d j ( n j d j )
n 2j ( n j 1)
O1 = d1 j
j =1
264
O2 = d 2 j
j =1
E1 = v j
V = vj
j =1
j =1
(O E )
= 1 1
Nu >
2
2
1,
(trong ,
2
1,
k =0.95), chng ta c bng chng kt lun rng khc bit v S(t) gia
hai nhm c ngha thng k.
1,
1,
2,
2,
1,
1,
2,
2,
1,
1,
2,
2,
10,
13,
10,
10,
1,
1,
2,
2,
1,
1,
2,
2,
7, 10,
8, 10,
12, 7,
17, 8,
1,
1,
2,
2,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2,
2)
> time <- c(8, 12, 52, 28, 44, 14, 3, 52, 35, 6, 12, 7, 52,
52, 36, 52, 9, 11, 52,15, 13, 21,24, 52,28,
15,44, 2, 8,12,52,21,19, 6,10,15, 4, 9,27, 1,
12,20,32,15, 5,35,28, 6)
> infected <- c(1,
0,
1,
1,
0,
1,
0,
1,
0,
0,
0,
0,
1,
0,
1,
1,
1,
1,
1,
1,
1,
1,
0,
1,
1,
1,
1,
1,
1, 1, 1, 1, 0, 0, 0, 1,
0, 0, 1,
1, 1, 1, 0, 1, 0, 1, 1,
1)
265
25
24
22
21
19
17
16
15
14
12
10
9
8
7
1
1
1
1
1
1
1
1
1
2
1
1
1
1
0.960
0.920
0.878
0.836
0.792
0.746
0.699
0.653
0.606
0.505
0.454
0.404
0.353
0.303
0.0392
0.0543
0.0660
0.0749
0.0829
0.0902
0.0958
0.1001
0.1033
0.1080
0.1083
0.1074
0.1052
0.1016
0.886
0.820
0.758
0.702
0.645
0.588
0.534
0.483
0.434
0.332
0.285
0.240
0.197
0.157
1.000
1.000
1.000
0.997
0.973
0.945
0.915
0.882
0.846
0.768
0.725
0.680
0.633
0.584
group=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1
23
1
0.957
0.0425
0.8767
1.000
4
21
1
0.911
0.0601
0.8004
1.000
5
20
1
0.865
0.0723
0.7346
1.000
6
19
2
0.774
0.0889
0.6183
0.970
8
17
1
0.729
0.0946
0.5650
0.940
10
15
1
0.680
0.1000
0.5099
0.907
12
14
2
0.583
0.1067
0.4072
0.835
15
12
2
0.486
0.1088
0.3132
0.754
19
9
1
0.432
0.1093
0.2630
0.709
20
8
1
0.378
0.1082
0.2156
0.662
21
7
1
0.324
0.1053
0.1712
0.613
27
6
1
0.270
0.1007
0.1300
0.561
28
5
1
0.216
0.0939
0.0921
0.506
35
3
1
0.144
0.0859
0.0447
0.463
Nhng thng tin trn cung cp cho chng ta xc sut sng st trong tng thi
im, nhng v l nhng con s nn kh cm nhn c s khc bit gia hai
nhm. Mt cch khc th hin cc xc sut ny l qua biu Kaplan-Meier
cho tng nhm. Cch v biu ny c th bng lnh sau y:
266
0.6
0.4
0.0
0.2
0.8
1.0
> plot(kp.by.group,
xlab="Time",
ylab="Cum. survival probability",
col=c(black, red))
10
20
30
40
50
Time
267
Kim nh log-rank l phng php cho php chng ta so snh S(t) gia
hai hay nhiu nhm. Nhng trong thc t, S(t) hay hm nguy c h(t) c th
khng ch khc nhau gia cc nhm, m cn chu s chi phi ca cc yu t
khc. Vn t ra l lm sao c tnh mc nh hng ca cc yu t nguy
c (risk factors) n h(t). Chng hn nh trong nghin cu trn, s ln bnh
nhn tng b nhim (bin episode) c xem l c nh hng n nguy c
bnh ti pht. Do , vn t ra l nu chng ta xem xt v iu chnh cho
nh hng ca episode th mc khc bit v S(t) gia hai nhm c tht s
tn ti hay khng?
Vo khong gia thp nin 1970s, David R. Cox, gio s thng k hc
thuc i hc Imperial College (London, Anh) pht trin mt phng php
phn tch da vo m hnh hi qui (regression) tr li cu hi trn (D.R. Cox,
Regression models and life tables (with discussion), Journal of the Royal
Statistical Society series B, 1972; 74:187-220). Phng php phn tch , sau
ny c gi l M hnh Cox. M hnh Cox c nh gi l mt trong nhng
pht trin quan trng nht ca khoa hc ni chung (khng ch khoa hc thng
k) trong th k 20! Bi bo va cp c trch dn hng vn ln trong
vng 30 nm qua.
V m t chi tit m hnh Cox nm ngoi phm vi ca chng sch ny,
nn chng ta ch xem qua vi nt chnh bn c c th nm vn . Gi x1,
x2, x3, xp l p yu t nguy c. x c th l cc bin lin tc hay khng lin tc.
M hnh Cox pht biu rng:
h (t ) = (t ) e
1 x1 + 2 x2 + 3 x3 +...+ p x p
268
> summary(analysis)
Call:
coxph(formula = Surv(time, infected == 1) ~ group)
n= 48
coef exp(coef) se(coef) z
group 0.684
1.98
0.363
1.88
p
0.06
269
h ( t | group = 2 )
h ( t | group = 1)
270
1.0
0.8
0.6
0.4
0.0
0.2
10
20
30
40
50
Time
271
To ra 5 bin s c lp
x1 <- (1:50)/2 3
x2 <- rnorm(50)
x3 <- rnorm(50)
x4 <- rnorm(50)
x5 <- rnorm(50)
272
#
>
>
>
se(coef)
z
0.568 5.6908
0.331 -0.0963
0.327 0.9518
0.297 0.4600
0.313 1.5643
p
1.3e-08
9.2e-01
3.4e-01
6.5e-01
1.2e-01
273
x5 0.429
1.54
0.297
1.45 1.5e-01
x1
x2
x3
x4
x5
p!=0
100.0
9.6
14.6
10.0
31.0
EV
3.036
0.001
0.041
0.006
0.135
SD
0.509
0.096
0.155
0.092
0.261
nVar
BIC
post prob
x1
x2
x3
x4
x5
274
model 3
3.0390
.
0.2705
.
.
model 4
2.9829
.
.
0.0250
.
model 1
2.9805
.
.
.
.
model 2
3.1262
.
.
.
0.42920
1
-233.774
0.458
2
-232.126
0.201
model 5
2.9810
0.0214
.
.
.
nVar
BIC
post prob
2
-230.713
0.099
2
-229.933
0.067
2
-229.930
0.067
x1
x2
x3
x4
x5
Model #
275
276
14
Phn tch tng hp
Mt vn khoa hc cn n nhiu nghin cu. Mt nghin cu ring
l khng th gii quyt hay cung cp cu tr li dt khot cho mt vn khoa
hc. Nhu cu lp li nghin cu trong iu kin khc nhau rt quan trng trong
hot ng khoa hc. Trong nghin cu khoa hc ni chung v y hc ni ring,
nhiu khi chng ta cn phi xem xt nhiu kt qu nghin cu t nhiu ngun
khc nhau gii quyt mt vn c th.
277
278
x1 = M + e1
x2 = M + e2
.
x100 = M + e100
Hay ni chung l:
xi = M + ei
Tt nhin ei c th <0 hay >0. Nu M v ei c lp vi nhau (tc khng c tng
279
xi = mi + ei
Trong :
Do :
mi = M + i
xi = M + i + ei
280
281
282
di = LOS1i LOS2i
Phng sai ca di (ti s k hiu l si2 ) c c tnh bng mt cng thc chun
da vo lch chun v s i tng trong tng nghin cu. Vi mi nghin
cu i (i = 1, 2, 3, , 9), chng ta c:
si2 =
1
1
N + N
2i
1i
N1i + N 2i 2
2
1
155 + 156 2
1
1
+
= 40.59
155 156
283
Bng 1a. khc bit v thi gian gia hai nhm v khong tin cy 95%
di-1.96*si
di+1.96*si
si
Nghin cu (i)
di
s2
i
1
2
3
4
5
6
7
8
9
20
2
55
71
4
-1
-11
10
-7
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
6.37
1.43
3.91
12.26
4.49
1.11
9.77
2.83
4.55
7.51
-0.80
47.34
46.98
-4.81
-3.17
-30.14
4.45
-15.92
32.49
4.80
62.66
95.02
12.81
1.17
8.14
15.55
1.92
Wi = 1 / si2
284
1
= 0.0246
40.59
Wi
di
si2
20
2
55
71
4
-1
-11
10
-7
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354
d=
W d
i =1
9
W
i =1
285
sd2 =
1
9
W
i =1
di
si2
20
2
55
71
4
-1
-11
10
-7
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
Wi
Wi d i
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354
0.4928
0.9771
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140
286
d=
W d
i =1
9
W
i =1
1
= 0.61 .
1.6345
sd = 0.61 = 0.782 .
Khong tin cy 95% (95% confidence interval hay 95%CI) c th c c tnh
nh sau:
3.49 1,96*0.782 = 1.96 n 5.02.
n y, chng ta c th ni rng, tnh trung bnh, thi gian nm vin ti cc
bnh vin a khoa di hn cc bnh vin chuyn khoa 3.49 ngy v 95%
khong tin cy l t 1.96 ngy n 5.02 ngy.
Bc 5: c tnh ch s ng nht (homogeneity) v bt ng nht
(heterogeneity) gia cc nghin cu [3]. Trong thc t, y l ch s o lng
khc bit gia mi nghin cu v tr s trung bnh trng s. Ch s ng
nht (index of homogeneity) c tnh theo cng thc sau y:
k
Q = Wi (d i d )
i =1
I2 =
Q (k 1)
Q
287
di
si2
20
2
55
71
4
-1
-11
10
-7
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
Wi
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354
Wi d i
0.4928
0.9771
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140
Wi (d i d )
6.7129
1.0903
173.6080
30.3356
0.0127
16.5054
2.2026
5.2701
5.3215
241.05
Q = Wi (d i d ) = 241.05
2
i =1
T , I2 c th c tnh nh sau:
I2 =
241.05 8
= 0.966
241.05
288
1
sdi
di
si2
20
2
55
71
4
-1
-11
10
-7
40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7
1/si
0.1570
0.6990
0.2558
0.0816
0.2225
0.9041
0.1024
0.3528
0.2198
289
di
, a v b l hai thng s phi c tnh t m hnh hi
sdi
290
di
20
2
55
71
4
-1
-11
10
-7
Ni
311
63
146
36
21
109
67
293
112
291
> n1 <c(155,31,75,18,8,57,34,110,60)
> los1 <- c(55,27,64,66,14,19,52,21,30)
> sd1 <- c(47,7,17,20,8,7,45,16,27)
> n2 <c(156,32,71,18,13,52,33,183,52)
> los2 <- c(75,29,119,137,18,18,41,31,23)
> sd2 <- c(64,4,29,48,11,4,34,27,20)
> los <- data.frame(n1,los1,sd1,n2,los2,sd2)
292
95%-CI
[ -4.96; -1.96]
[-24.03; -3.93]
z
-4.53
-2.73
p.value
<0.0001
0.0064
Quantifying heterogeneity:
tau^2 = 205.4094; H = 5.46 [4.54; 6.58];
I^2 = 96.7% [95.2%; 97.7%]
Test of heterogeneity:
Q d.f. p.value
238.92
8 < 0.0001
Method: Inverse variance method
meta cung cp cho chng ta hai kt qu: mt kt qu da vo m hnh fixedeffects v mt da vo m hnh random-effects. Nh thy qua kt qu trn,
mc khc bit gia hai m hnh kh ln, nhng kt qu chung th ging
nhau, tc kt qu ca c hai m hnh u c ngha thng k.
Ngoi ra, chng ta cng c th s dng hm plot th hin kt qu trn bng
biu forest nh sau:
> plot(res, lwd=3)
-100
-80
-60
-40
-20
Weighted mean difference
20
293
294
Beta-blocker
N1
T vong (d1)
25
5
9
1
194
23
25
1
105
4
320
53
33
3
261
12
133
6
232
2
1327
156
1990
145
214
8
4879
420
N2
25
16
189
25
34
321
16
84
145
134
1320
2001
212
4516
Placebo
T vong (d2)
6
2
21
2
2
67
2
13
11
5
228
217
17
612
N: s bnh nhn nghin cu; T vong: s bnh nhn cht trong thi gian theo di.
RR =
p1
p2
5
= 0,20 v
25
8
= 0,24 .
Nh vy t s nguy c cho nghin cu 1 l:
25
0,20
RR =
= 0,833 . Tnh ton tng t cho cc nghin cu cn li, chng ta
0,24
p2 =
s c mt bng nh sau:
295
T l t vong
nhm BB (p1)
1
2
3
4
5
6
7
8
9
10
11
12
13
0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
0.045
0.009
0.118
0.073
0.037
T l t vong
nhm placebo
(p2)
0.240
0.125
0.111
0.080
0.059
0.209
0.125
0.155
0.076
0.037
0.173
0.108
0.080
T s nguy c
(RR)
0.833
0.889
1.067
0.500
0.648
0.794
0.727
0.297
0.595
0.231
0.681
0.672
0.466
1
1
1
1
d1 N1 d1 d 2 N 2 d 2
1
1
1
1
d1 N 1 d1 d 2 N 2 d 2
296
Vi phng sai:
var[log RR ] =
1
1
1
1
+
= 0.264
5 25 5 6 25 6
V sai s chun:
T s
nguy
c
(RR)
Log[RR]
Var[logRR]
SE[logRR]
Phn
thp
95%CI
ca RR
1
2
3
4
5
6
7
8
9
10
11
12
13
0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
0.045
0.009
0.118
0.073
0.037
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
0.264
1.304
0.079
1.415
0.709
0.026
0.729
0.142
0.242
0.688
0.009
0.010
0.174
0.514
1.142
0.282
1.189
0.842
0.162
0.854
0.377
0.492
0.829
0.095
0.102
0.417
0.30
0.09
0.61
0.05
0.12
0.58
0.14
0.14
0.23
0.05
0.56
0.55
0.21
Phn
cao
95%
CI ca
RR
2.28
8.33
1.85
5.15
3.37
1.09
3.87
0.62
1.56
1.17
0.82
0.82
1.06
297
Wi =
1
var[log RRi ]
W log[RR ]
i
log wRR =
298
Vi phng sai:
Var[logwRR] =
v sai s chun:
SE [log wRR ] =
i1
W1 =
v
1
= 3.79
0, 264
Log[RR]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
Var[logRR]
0.264
1.304
0.079
1.415
0.709
0.026
0.729
0.142
0.242
0.688
0.009
0.010
0.174
Wi
3.79
0.77
12.61
0.71
1.41
38.30
1.37
7.03
4.13
1.45
110.78
96.13
5.75
284.24
Wilog[RRi]
-0.69
-0.09
0.82
-0.49
-0.61
-8.86
-0.44
-8.54
-2.15
-2.13
-42.63
-38.23
-4.39
-108.42
299
Chng ta c:
W log[RR ]
i
log wRR =
108, 42
= 0.38
284, 24
Vi phng sai:
1
= 0.0035
284.24
v sai s chun:
SE [ log wRR ] =
= 0.0035 = 0.06
300
chng ta cn tnh Wi (log RRi log wRR ) cho mi nghin cu. Chng hn
nh vi nghin cu 1, chng ta c:
2
Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
Wi
3.79
0.77
12.61
0.71
1.41
38.30
1.37
7.03
4.13
1.45
110.78
96.13
5.75
284.24
0.1502
0.0533
2.5118
0.0687
0.0040
0.8635
0.0054
4.8731
0.0790
1.7074
0.0012
0.0253
0.8382
11.1811
V d 2 c k = 13 nghin cu. Do .
k
i =1
V.
I2 =
Q (k 1) 11.18 12
=
= 0.16
Q
11.18
301
log[RRi] = a + bNi
Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763
Ni
50
25
383
50
139
641
49
345
278
366
2647
3991
426
302
# S liu t v d 2
n1
d1
n2
d2
<<<<-
c(25.9.194.25.105.320.33.261.133.232.1327.1990.214)
c(5.1.23.1.4.53.3.12.6.2.156.145.8)
c(25.16.189.25.34.321.16.84.145.134.1320.2001.212)
c(6.2.21.2.2.67.2.13.11.5.228.217.17)
# To mt dataframe ly tn l bb
bb <- data.frame(n1.d1.n2.d2)
95%-CI
z p.value
[0.6064; 0.7672] -6.3741 < 0.0001
[0.6064; 0.7672] -6.3741 < 0.0001
Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 1.45]; I^2 = 0% [0%; 52.6%]
Test of heterogeneity:
303
Q d.f. p.value
11
12
0.5292
Method: Inverse variance method
1
2
3
4
5
6
7
8
9
10
11
12
13
0.05
0.10
0.20
0.50
1.00
Relative Risk
2.00
5.00
10.00
***
Thc ra, trong khoa hc ni chung, chng ta c mt truyn thng lu
i v vic duyt xt bng chng nghin cu (review), duyt xt kin thc hin
hnh. Nhng cc duyt xt nh th thng mang tnh nh cht (qualitative
review). v v tnh nh cht. chng ta kh m bit chnh xc c nhng khc
bit mang tnh nh lng gia cc nghin cu. Phn tch tng hp cung cp
cho chng ta mt phng tin nh lng h thng bng chng. Vi phn
tch tng hp, chng ta c c hi :
304
305
i vi cc bin s nh phn
Nhm 1 (s mu. s s kin): n1i .
x1i ; i = 1. 2. 3. . k
d i = x2i x1i
x2i ; i = 1. 2. 3. . k
x x
RRi = 2i 1i
n2i n1i
Bin chuyn sang logarithm:
i = log(RRi )
Phng sai ca i :
Phng sai ca d i :
sdi2 =
1
1
1
1 s2 = 1 1
+
i
+
n
x1i n1i x1i x2i n2i x2i
1i n2i
Sai s ca i :
sdi = sdi2
Trng s: Wi =
1
sdi2
Trng s: Wi =
c s nh hng chung:
k
d=
W d / W
i =1
i =1
i =1
i =1
Phng sai ca :
Phng sai ca d:
k
s 2 = 1 / Wi
s 2 = 1 / Wi
i =1
306
= Wi i / Wi
1
s2i
c s nh hng chung:
1
1
1
1
s =
i =1
Khong
tin
1,96 s
cy
95%:
Index of homogeneity:
k
Index of homogeneity:
k
Q = Wi (d i d )
Q = Wi ( i )
Index of heterogeneity:
Index of heterogeneity:
I2 =
I2 =
i =1
Q (k 1)
Q
i =1
Q (k 1)
Q
Q (k 1)
2 = max 0,
k
2
Wi
i =1
Wi k
i =1
Wi
i =1
Q (k 1)
2 = max 0,
k
2
Wi
i =1
Wi k
i =1
Wi
i =1
307
15
Thit k th nghim
(Design of experiments)
Cm t th nghim y khng ch bao gm cc hot ng trong
phng th nghim, m cn bao gm c nhng cng trnh kho st rng ln hn
nh th nghim lm sng i chng ngu nhin (randomized clinical trial), cc
cng trnh nghin cu tiu biu mt thi im (cn gi l nghin cu ct ngang
hay cross-sectional study), thm d kin, iu tra v iu tra dn s, v.v Ngay
c mt chnh sch kinh t cng c th xem l mt th nghim th nghim x hi.
Mt th nghim t tiu chun khoa hc phi l mt th nghim c
thit k c h thng v khch quan. Chng hn nh bit t l mc bnh i
ng trong mt qun th, chng ta khng cn phi khm nghim tt c c nhn
trong qun th , m ch chn ngu nhin mt s c nhn i din. Tuy nhin
nu s lng c nhn i din (cn gi l mu) qu thp th cng trnh nghin
cu s khng cho kt qu chnh xc; ngc li nu s lng mu qu ln, chng
ta s phung ph tin bc v c s vt cht mt cch khng cn thit. Do , mc
tiu ca thit k nghin cu l (i) pht hin mt nh hng hay tc dng ca
mt can thip, v (ii) s dng c s vt cht v ti lc mt cch ti u.
Qua cc chng trc, chng ta lm quen vi mt s m hnh phn
tch s liu. Kt qu ca cc phn tch ny ch c gi tr khoa hc khi s liu
c thu thp ng phng php, v khi cng trnh nghin cu c thit k
mt cch ti u. Cc m hnh thng k khng th cung cp cho chng ta thng
tin v cht lng ca nghin cu, v y l mt kha cnh cn s thm nh cn
thn ca nh nghin cu. Do , thit k nghin cu, ng mt vai tr rt quan
trng cho vic thnh bi ca mt cng trnh khoa hc. C th ni rng mt
nghin cu nu c thit k cn thn v ng phng php th mc thnh
cng t c 50%. Chng ny v chng sau s bn qua mt s khi nim
cn bn v thit k nghin cu v mt s m hnh nghin cu thng dng.
15.1 Thut ng
thun tin cho vic theo di v qun trit cc khi nim nghin cu,
c l chng ta phi lm quen v phn bit c mt s thut ng quan trng
trong khi thit k mt nghin cu.
n v nghin cu (experimental unit): Ty theo lnh vc nghin
cu, n v nghin cu c th l i tng (nh bnh nhn hay tnh nguyn
308
Nhm 1
A, B, C
Nhm 2
A, B, C
Nhm 3
A, B, C
Nhm 1
A, B
Nhm 2
B, C
Nhm 3
A, C
309
10
(qu
ngt)
310
BA
AB
AB
BA
AB
BA
AB
AB
BA
AB
BA
BA
AB
BA
AB
AB
BA
BA
[1]
311
Nu phng sai ca hai nhm bng nhau s12 = s22 = s2, th phng sai ca
khc bit n gin l:
sx21 x2 = 2 s2.
Nhng vi phng n 1, bi v mi khch hng th c hai sn phm, do
, x1 v x2 khng c lp vi nhau, v phng sai ca khc bit l:
[2]
312
313
cc nh nghin cu chia 100 bnh nhn thnh hai nhm can thip: nhm 1 c 50
bnh nhn c cho ung thuc alendronate tht, v nhm 2 cng gm 50 bnh
nhn c cho thuc alendronate gi (cn gi l gi dc hay placebo), nhng
hai loi thuc hon ton ging nhau, bnh nhn v bc s khng th phn bit
c thuc no l gi v thuc no l tht!
Th nghim nh va m t t ra hai vn nan gii. Kinh nghim t
nhiu nghin cu lm sng y khoa cho thy mt xu hng chung l bnh nhn
thng t cho rng sc khe h c ci tin hay tt hn, ch v h c iu tr
(cho d iu tr l gi dc)! Yu t tm l ny thng c gi l placebo
effect hay hiu ng gi dc. Hiu ng gi dc c th gii thch khong 35%
kt qu ca cc nghin cu lm sng, c bit l i vi cc thuc gim au,
xuyn, trm cm (depression), bnh ng rut, v cao huyt p. Chnh v l do
ny, vic nh gi hiu qu ca mt thut iu tr thng phi c mt nhm i
chng (hay placebo) v khc bit gia hai nhm can thip c th xc nh l
h qu ca thuc tht hay do gi dc.
Yu t th hai l hiu ng Hawthorne. Con ngi ni chung c kh
nng thch ng rt cao, v kh nng ny gy ra khng t kh khn cho nghin
cu khoa hc. Chng hn nh, khi chng ta cho mt nhm ngi tiu th nm
v ng ca c ph nhiu ln, th ln u ngi tiu th v cha quen vi v ng
nn h c th cm thy rt ng v cho im cao, nhng n ln 2 hay ln 3 th
v quen vi v ng nn h cho im thp xung. Hay trong nghin cu lm
sng, nu bnh nhn bit mnh ang c theo di, h s c gng lm hi lng
bc s v s khch quan ca bnh nhn c th b nh hng. Thut ng cho hin
tng ny l Hawthorne effect.
Yu t th ba l s ch quan ca nh nghin cu. Nu bc s bit bnh
nhn s dng thuc tht hay gi dc, cch nh gi ca h c th nh hng n
kt qu nghin cu. V th, trong cc nghin cu lm sng nghim chnh, nh
nghin cu khng c bit bnh nhn ang c iu tr bng thuc hay gi dc,
v phng cch ny c tn l blinding (lm m), tm dch l kn o. Vic gi
kn ny phi c duy tr bnh nhn v bc s. Ni cch khc, c bnh nhn v
bc s u khng bit bnh nhn thuc vo nhm can thip hay nhm gi dc.
Tuy nhin, khng phi bt c nghin cu lm sng no cng c th duy
tr s kn o nh th. Chng hn nh nghin cu v hiu qu ca mt thut gii
phu, bnh nhn chc chn bit h c gii phu tht hay gi (v khng c ci
gi l gii phu gi). Ngoi ra, v l do y c, khng phi nghin cu no
cng c th s dng gi dc. Nu chng ta bit rng cn bnh c nguy him
n tnh mng ca bnh nhn v thuc c hiu qu, th khng c l do g nh
nghin cu cho bnh nhn dng gi dc. Trong cc trng hp ny, nh
314
nghin cu phi suy ngh k v pht trin mt phng n nghin cu sao cho
va khng vi phm y c m va p ng cc tiu chun khoa hc.
Nhm 1
Sinh t C
50 ngi
So snh tn
s cm cm
Nhm 2
Gi dc
315
Nam
N
Sinh t
C
Nam
Gi
Nhm 2 (i chng)
1.9
1.5
So snh
tn
s cm
cm
dc
50
ngi
Sinh t
C
N
Gi
dc
So snh
tn
s cm
cm
316
Mnh t 1
Low
Medium
High
Medium
Medium
Low
Mnh t 2
High
Medium
Medium
Low
Low
High
Mnh t 3
Low
High
Low
High
Medium
High
317
Thnh ra, so snh gia hai yu t can thip, nh low v high, phi iu
chnh dao ng gia cc a im.
Phng n 2 - RCB (randomized block design): Vi phng n ny
mi a im v mi mnh t s c p dng mt yu t can thip; do ,
hon ton cn i. Nu xem ba mnh t mi a im th nghim l ba block,
th phng n ny m bo ti mi a im, mi block c phn chia mt can
thip nh sau:
a im
A
B
C
D
E
F
Mnh t 1
Low
Medium
High
Medium
High
Low
Mnh t 2
High
Low
Medium
Low
Low
High
Mnh t 3
Medium
High
Low
High
Medium
Medium
a im
A
B
C
D
E
F
Mnh t 1
Low
Medium
High
Medium
High
Low
Mnh t 2
High
Low
Medium
Low
Low
High
318
Nhit
Thp
Cao
Thp
Thp
Vt liu
A
A
B
A
Phng php
C kh
C kh
C kh
Ha cht
Nhit
Thp
Cao
Thp
Thp
Thp
Cao
Thp
Thp
Vt liu
A
A
B
B
A
A
B
B
Phng php
C kh
C kh
C kh
C kh
Ha cht
Ha cht
Ha cht
Ha cht
319
P1
P2
P2
P1
2
P2
3
P2
4
P1
B
A
C
C
B
A
A
C
B
C
A
B
320
Ti x
Loi xe
Ford
D
B
C
A
1
2
3
4
Toyota
B
C
A
D
Honda
C
A
D
B
Nissan
A
D
B
C
T2
n=2
T3
n=3
T1
T1
T1
T2
T2
T3
T3 T3
Dng hm sample chn ngu nhin (sample(1:8) c chc nng
to ra mt dy s ngu nhin t 1 n 8):
> sample(1:8)
[1] 7 2 5 4 1 8 6 3
T1
7
T1
2
T1
5
T2
4
T2
1
T3
8
T3
6
T3
3
321
322
16
c tnh c mu
(Sample size estimation)
Mt cng trnh nghin cu thng da vo mt mu (sample). Mt
trong nhng cu hi quan trng nht trc khi tin hnh nghin cu l cn bao
nhiu mu hay bao nhiu i tng cho nghin cu. i tng y l n
v cn bn ca mt nghin cu, l s bnh nhn, s tnh nguyn vin, s mu
rung, cy trng, thit b, v.v c tnh s lng i tng cn thit cho mt
cng trnh nghin cu ng vai tr cc k quan trng, v n c th l yu t
quyt nh s thnh cng hay tht bi ca nghin cu. Nu s lng i tng
khng th kt lun rt ra t cng trnh nghin cu khng c chnh xc cao,
thm ch khng th kt lun g c. Ngc li, nu s lng i tng qu
nhiu hn s cn thit th ti nguyn, tin bc v thi gian s b hao ph. Do ,
vn then cht trc khi nghin cu l phi c tnh cho c mt s i
tng va cho mc tiu ca nghin cu. S lng i tng va ty
thuc vo ba yu t chnh:
323
324
Gi thuyt H
ng
Sai
(Thuc c hiu nghim)
(Thuc khng c hiu
nghim)
C ngha thng k
(p<0,05)
Sai st loi I
(Type I error)
= P(s | H-)
Khng c ngha
thng k (p>0,05)
Sai st loi II
(Type II error)
= P(ns | H+)
m tnh tht
(True negative)
1- = P(ns | H-)
325
326
Kt qu xt nghim
C bnh
nhy
(Sensitivity),
m tnh gi (False
negative),
c hiu (Specificity),
Bnh trng
Khng c bnh
327
V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi
I khong 1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II
khong = 0.1 n = 0.2 (tc power phi t 0.8 n 0.9).
n=
( / )
[1]
n = 2
( / )
[2]
328
= 0.20
(Power = 0.80)
6.15
7.85
13.33
0.10
0.05
0.01
= 0.10
(Power = 0.90)
8.53
10.51
16.74
= 0.05
(Power = 0.95)
10.79
13.00
19.84
16.4 c tnh c mu
16.4.1 c tnh c mu cho mt ch s trung bnh
V d 1: Chng ta mun c tnh chiu cao ca n ng ngi Vit, v
chp nhn sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05)
v power = 0.8 (hay = 0.2). Cc nghin cu trc cho bit lch chun
chiu cao ngi Vit khong 4.6 cm. Chng ta c th p dng cng thc [1]
c tnh c mu cn thit cho nghin cu:
n=
( / )
7.85
(1/ 4.6 )
= 166
7.85
( 0.5 / 4.6 )
329
=
=
=
=
=
=
168.0131
1
4.6
0.05
0.8
two.sided
=
=
=
=
=
=
666.2525
0.5
4.6
0.05
0.8
two.sided
330
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
198.1513
3
15
0.05
0.8
two.sided
n=
2C
( / )
2 10.51
( 0.04 / 0.12 )
= 189
331
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
( i ) , trong , = i / k .
2
i =1
Cho =
i =1
SS
, vn t ra l tm s lng c mu n sao cho z p
( k 1) RMS
z =
332
( k 1)(1 + n ) F + k ( n 1)(1 + 2n )
one-way
=
=
=
=
=
=
analysis
of
variance
power
4
12.81152
3.486667
8.7
0.05
0.9
333
p 1.96 SE ( p ) , trong SE ( p ) =
p (1 p ) / n .
1.96 p (1 p ) / n m
Chng ta mun tm s lng i tng n t yu cu trn. Qua cch din t
trn, c th thy rng:
2
1.96
n
p (1 p )
m
Do , s lng c mu ty thuc vo sai s m v t l p m chng ta mun
c tnh. sai s cng thp, s lng c mu cng cao.
V d 5: Chng ta mun c tnh t l n ng ht thuc Vit Nam,
sao cho c s khng cao hn hay thp hn 2% so vi t l tht trong ton dn
s. Mt nghin cu trc cho thy t l ht thuc trong n ng ngi Vit c
th ln n 70%. Cu hi t ra l chng ta cn nghin cu trn bao nhiu n
ng t yu cu trn.
334
1.96
n
0.7 0.3
0.02
Ni cch khc, chng ta cn nghin cu t nht l 2017.
Nu chng ta mun gim sai s t 2% xung 1% (tc m = 0.01) th s lng
i tng s l 8067! Ch cn thm chnh xc 1%, s lng mu c th thm
hn 6000 ngi. Do , vn c tnh c mu phi rt thn trng, xem xt
cn bng gia chnh xc thng tin cn thu thp v chi ph.
R khng c hm cho c tnh c mu cho mt t l, nhng vi cng thc trn,
bn c c th vit mt hm tnh rt d dng.
n=
z / 2 2 p (1 p ) + z
p1 (1 p1 ) + p2 (1 p2 )
335
( 2.57
n=
( 0.04 )
= 1361
Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn
kim nh gi thit trn.
Hm power.prop.test R c th ng dng tnh c mu cho trng hp
trn. Hm power.prop.test cn nhng thng tin nh power,
sig.level, p1, v p2. Trong v d trn, chng ta c th vit:
> power.prop.test(p1=0.10, p2=0.06, power=0.90,
sig.level=0.01)
Two-sample comparison of proportions power calculation
n
p1
p2
sig.level
power
alternative
=
=
=
=
=
=
1366.430
0.1
0.06
0.01
0.9
two.sided
336
337
17
Ph lc 1: Lp trnh v hm vi R
R c pht trin sao cho ngi s dng c th pht trin nhng hm
thch hp cho mc ch phn tch v tnh ton ca mnh. Tht vy, nh
cp trong phn u ca sch, c th xem R l mt ngn ng thng k, v chng
ta c th s dng ngn ng gii quyt cc vn khng thng thy trong
sch gio khoa. Phn ny ch trnh by mt vi hm n gin bn c c th
hiu cch vn hnh ca R v hi vng gip bn c t pht trin cc hm sau .
Hm (hay c khi cn gi l macro trong cc phn mm khc) thc
cht l tp hp mt s lnh c lu tr di mt ci tn. mc n gin
nht, hm l tc k cho mt nhm lnh.
V d 1. Trong cc lnh sau y, chng ta to hai d liu (data1 v
data2). Mi d liu c hai ct s liu c to ra bng m phng t phn phi
chun. Sau , v biu cho hai d liu vi ghi ch.
data1 <- cbind(rnorm(100,1), rnorm(100,0))
data2 <- cbind(rnorm(100,-1), rnorm(100,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight",
ylab="Yield")
legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)
338
v kt qu s nh sau:
My simulated data
Y ie ld
-1
-1
Y ie ld
My simulated data
-2
Big
Small
-2
Big
Small
-4
-2
0
Weight
-2
Weight
339
Khi ng dng hm, chng ta ch n gin thay i n v mean. Trong hai lnh
sau y, chng ta u tin v mt biu tn x vi 200 s liu, v s trung
bnh -2 v 2. Trong lnh hai, chng ta nng s liu ln 200, nhng trung bnh
vn nh ln m phng trc:
> plotfigure(200, 2, -2)
> plotfigure(500, 2, -2)
V kt qu s khc trn:
M y simulated data
-1
0
-1
Y ield
Y ield
M y simulated data
-2
Big
Small
-3
-3
-2
Big
Small
-4
-2
0
Weight
-4
-2
Weight
340
{
sum = a+b
ans <- "Answer = "
cat(ans, sum, \n)
}
Nh thy, bc u tin, chng ta cho tn hm l add v nh ngha
thng s a v b. Mt hm phi c m u bng k hiu { v chm dt bng
}. sum l mt bin s cng a v b. ans <- "Answer = " nh ngha tr
li (c th khng cn). cat(ans, sum, \n) c chc nng thu thp s
liu v trnh by kt qu cho ngi s dng hm, trong \ c ngha l sau
khi trnh by, cho ngi s dng mt prompt khc. Bn c c th dn cc lnh
trn vo R v th cho lnh:
> add(3, 9)
Answer = 12
> add(sqrt(5), exp(10))
Answer = 22028.7
xi ~ N , 2
Nu chng ta c thng tin trc cho bit c lut phn phi chun vi trung
bnh v phng sai 2, hay:
~ N ( , 2 )
nx
+
2
2
=
1
v phng sai
2
p
n
1
= 2 + 2
341
342
18
Phc lc 2
Mt s lnh thng dng trong R
Lnh v mi trng vn hnh ca R
getwd()
setwd(c:/works)
options(prompt=R>)
options(width=100)
options(scipen=3)
options()
Lnh c bn
ls()
rm(object)
seach()
Cng
Tr
Nhn
Chia
Ly tha
Chia s nguyn
S d t chia hai s nguyn
343
K hiu logic
Bng
Khng bng
Nh hn
Ln hn
Nh hn hoc bng
Ln hn hoc bng
C phi x l bin s missing
V (AND)
Hoc (OR)
Khng l (NOT)
==
!=
<
>
<=
>=
is.na(x)
&
|
!
Pht s
numeric(n)
character(n)
logical(n)
seq(-4,3,0.5)
1:10
c(5,7,9,1)
rep(1, 5)
Gl(3,2,12)
Cho ra n s 0
Cho ra n k t
Cho ra n FALSE
Dy s -4.0, -3.5, -3.0, , 3.0
Ging nh lnh seq(1, 10, 1)
Nhp s 5, 7, 8 v 1
Cho ra 5 s 1: 1, 1, 1, 1, 1.
Yu t 3 bc, lp li 2 ln, tng cng 12 s:
112233112233
344
rhyper(nn, m, n, k)
rlnorm(n,meanlog=0,sdlog=1)
rlogis(n,location=0,scale=1)
rnbinom(n,size,prob)
runif(n,min=0,max=1)
hypergeometric
Phn phi log normal
Phn phi logistic
Phn phi negative Binomial
Phn phi uniform
Data frames
data.frame(x,y)
tuan$age
attach(tuan)
detach(tuan)
Hm s ton
log(x)
log10(x)
exp(x)
sin(x)
cos(x)
tan(x)
asin(x)
acos(x)
atan(x)
Logart bc e
Logart bc 10
S m
Sin
Cosin
Tangent
Arcsin (hm sin o)
Arccosin (hm cosin o)
Arctang(hm tan o)
345
Hm s thng k
min(x)
max(x)
which.max(x)
which.min(x)
S nh nht ca bin s x
S ln nht ca bin s x
Tm dng no c gi tr ln nht ca bin s x
Tm dng no c gi tr nh nht ca bin s x
length(x)
sum(x)
range(x)
mean(x)
median(x)
sd(x)
var(x)
cov(x,y)
cor(x,y)
quantile(x)
cor(x,y)
is.na(x)
complete.cases(x1,x2,...)
Kim tra nu tt c x1, x2, u khng
c s trng.
Ch s ma trn
x[1]
x[1:5]
x[y<=30]
x[sex==male]
346
S u tin ca bin s x
Nm s u tin ca bin s x
Chn x sao cho y nh hn hoc bng 30
Chn x sao cho sex bng male
Nhp d liu
data(name)
read.table(name)
read.csv(name)
read.delim(name)
read.delim2(name)
read.csv2(name)
347
var.test
bartlett.test
Kim nh t
Kim nh t cho paired design
Kim nh h s tng quan
method = kendall
method = spearman
Kim nh phng sai
Kim nh nhiu phng sai
wilcoxon.test
kruskal.test
friedman.test
Kim nh Wilcoxon
Kim nh Kruskal
Kim nh Friedman
lm(y ~ x)
t.test
pairwise.t.test
cor.test
lm(y ~ factor)
lm(y ~ factor+x)
lm(y ~ x1+x2+x3)
fisher.test
chisq.test
glm(y~x1+x2+x+x3)
s<-Surv(time,event)
survfit(s)
survdiff(s~g)
coxph(s ~ x1+x2)
binom.test
prop.test
prop.trend.test
348
th
plot(y~x)
hist(x)
plot(y ~ x | z)
pie(x)
boxplot(x)
qqnorm(x)
qqplot(x, y)
barplot(x)
hist(x)
stars(x)
abline(a, b)
abline(h=y)
abline(v=x)
abline(lm.object)
V th y v x (scatter plot)
V th y v x (scatter plot)
V hai biu x v y theo tng nhm ca z
V th trn
V th theo dng hnh hp
V phn phi quantile ca bin s x
V phn phi quantile ca bin s y theo x
V biu hnh khi cho bin s x
V histogram cho bin s x
V biu sao cho bin s x
V ng thng vi intercept=a v slope=b
V ng thng ngang
V ng thng ng
V th theo m hnh tuyn tnh
Mt s thng s cho th
pch
mfrow, mfcol
xlim, ylim
xlab, ylab
lty, lwd
cex, mex
col
349
19
Phc lc 3
Thut ng dng trong sch
Ting Anh
95% confidence interval
Akaike Information criterion (AIC)
Analysis of covariance
Analysis of variance (ANOVA)
Bar chart
Binomial distribution
Box plot
Categorical variable
Clock chart
Coefficient of correlation
Coefficient of determination
Coefficient of heterogeneity
Combination
Continuous variable
Correlation
Covariance
Cross-over experiment
Cumulative probability distribution
Degree of freedom
Determinant
Discrete variable
Dot chart
Estimate
Estimator
Factorial analysis of variance
Fixed effects
Frequency
350
Ting Vit
Khong tin cy 95%
Tiu chun thng tin Akaike
Phn tch hip bin
Phn tch phng sai
Biu thanh
Phn phi nh phn
Biu hnh hp
Bin th bc
Biu ng h
H s tng quan
H s xc nh bi
H s bt ng nht
T hp
Bin lin tc
Tng quan
Hp bin
Th nghim giao cho
Hm phn phi tch ly
Bc t do
nh thc
Bin ri rc
Biu im
c s
Hm c lng thng k
Phn tch phng sai cho th nghim giai
tha
nh hng bt bin
Tn s
Function
Heterogeneity
Histogram
Homogeneity
Hypothesis test
Inverse matrix
Latin square experiment
Least squares method
Linear Logistic regression analysis
Linear regression analysis
Matrix
Maximum likelihood method
Mean
Median
Meta-analysis
Missing value
Model
Multiple linear regression analysis
Normal distribution
Object
Parameter
Permutation
Pie chart
Poisson distribution
Polynomial regression
Probability
Probability density distribution
P-value
Quantile
Random effects
Random variable
Relative risk
Repeated measure experiment
Residual
Residual mean square
Hm
Bt ng nht
Biu tn s
ng nht
Kim nh gi thit
Ma trn nghch o
Th nghim hnh vung Latin
Phng php bnh phng nh nht
Phn tch hi qui tuyn tnh logistic
Phn tch hi qui tuyn tnh
Ma trn
Phng php hp l cc i
S trung bnh
S trung v
Phn tch tng hp
Gi tr khng
M hnh
Phn tch hi qui tuyn tnh a bin
Phn phi chun
i tng
Thng s
Hon v
Biu hnh trn
Phn phi Poisson
Hi qui a thc
Xc sut
Hm mt xc sut
Tr s P
Hm nh bc
nh hng ngu nhin
Bin ngu nhin
T s nguy c tng i
Th nghim ti o lng
Phn d
Trung bnh bnh phng phn d
351
Standard error
Standardized normal distribution
Survival analysis
Traposed matrix
Variable
Variance
Weight
Weighted mean
Sai s chun
Phn phi chun chun ha
Phn tch bin c
Ma trn chuyn v
Bin (bin s)
Phng sai
Trng s
Trung bnh trng s
352
20
i li cui sch vi bn c
(v ti liu tham kho)
Qua 15 chng sch v 3 ph lc bn c i cng tc gi mt hnh
trnh kh di trong phn tch thng k v biu . Trc khi chia tay bn c,
tc gi cng mun c i li tm bit.
Qua kinh nghim ging dy v nghin cu c nhn cho thy phn ln
sinh vin khi tip cn vi khoa hc thng k ln u chng my g ho hng,
nu khng ni l kh khn, v sch gio khoa son cho mn hc ny rt xa ri
thc t, vi nhng v d khng c trong i thng. Nhng khi nim tru
tng, nhng cng thc rc ri, nhng php tnh phc tp v rm r lm cho
ngi hc cm thy kh khn v t cm thy thiu hng th theo ui mn
hc. Tht vy, c khi c sch gio khoa, cc bi bo nghin cu khoa hc,
chng ta bt gp nhng phng php hay v nhng m hnh thch hp cho
nghin cu ca chnh mnh, nhng khng bit lm sao tnh ton cc m hnh .
Trong cun sch ny, tc gi mun cung cp cho bn c mt phng tin phn
tch thc t lp i ci khong trng phng php v kin thc m c l bn
c cn thiu.
Hc phi i i vi hnh. Cch hc v phng php hay nht, theo ti,
l bt chc. R cung cp cho bn c cch hc m phng rt l tin li.
Trong khi c nhng chng sch ny cng vi nhng v d, bn c c th g
nhng lnh vo my tnh v xem kt qu c nht qun vi nhng g mnh c
hay khng. Sau khi bit c cch s dng mt hm hay mt lnh no ,
bn c c th thm vo (hay bt ra) nhng thng s ca hm xem kt qu.
Ch c hc nh th th bn c mi nm vng c cc khi nim v cch s
dng R.
Chng ta hc t sai st. Qua cun sch ny, tc gi mun bn c i mt
qung ng kh gp ghnh, tc l bn c phi tng tc vi my tnh bng
nhng lnh ca R. Trong qu trnh tng tc , c th mt s lnh s khng
chy, v g sai tn bin s hay sai chnh t, v khng n k t vit hoa v
vit thng, v s liu khng y hay sai st, v.v Tt c nhng ln sai st
s gip cho bn c rt ra kinh nghim v tr nn thnh tho hn. l cch hc
m ngi Anh hay gi l trial and error, hc t sai lm v th nghim.
Mt cng trnh phn tch s liu cn nhiu lnh v hm R. Tuy nhin, v
tnh tng tc m bn c theo di, cc lnh ny s bin mt khi ngng R. Vn
353
354
355
th tm hiu trong trang web ca R bit thm cc package chuyn dng cho
phn tch a bin.
Ti liu tham kho
Hin nay, th vin sch v R cn tng i khim tn so vi th vin
cho cc phn mm thng mi nh SAS v SPSS. Tuy nhin, trong thi i tin
b phi thng v thng tin internet v ton cu ha nh hin nay, sch in v
sch xut bn trn website khng cn l nhng khc nhau bao xa. Phn ln ch
dn v cch s dng R c th tm thy ri rc y trn cc website t cc
trng i hc v website c nhn trn khp th gii. Phn ny ch lit k mt
s sch m bn c nu cn tham kho thm c th tm c. Trong qu trnh
vit cun sch m bn c ang cm trn tay, tc gi cng tham kho mt s
sch v trang web s lit k sau y vi vi li nhn xt c nhn.
Ti liu tham kho chnh v R l bi bo ca hai ngi sng to ra R:
Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal
of Computational and Graphical Statistics 1996; 5:299-314.
18.1 Sch tham kho v R
356
357
358