Intro To R Vietnamese 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 358

Phn tch d liu v to biu bng R

Nguyn Vn Tun

Nh xut bn Khoa hc v K thut


Thnh ph H Ch Minh - 2006

Mc lc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Li ni u
Gii thiu ngn ng R
Nhp d liu
Bin tp d liu
Tnh ton n gin v ma trn
Tnh ton xc sut v m phng
Kim nh gi thuyt v tr s R
Phn tch s liu bng biu
Phn tch thng k m t
Phn tch hi qui tuyn tnh
Phn tch phng sai
Phn tch hi qui logistic
Phn tch bin c (survival analysis)
Phn tch tng hp (meta-analysis)
Thit k th nghim
c tnh c mu
Lp trnh v vit hm bng R
Mt s lnh thng thng trong R
Thut ng dng trong sch
Li bt

Phn tch d liu v to biu bng R Nguyn Vn Tun

1
Li ni u
Tri vi quan im ca nhiu ngi, thng k l mt b mn khoa hc:
Khoa hc thng k (Statistical Science). Cc phng php phn tch d da vo
nn tng ca ton hc v xc sut, nhng ch l phn k thut, phn quan
trng hn l thit k nghin cu v din dch ngha d liu. Ngi lm thng
k, do , khng ch l ngi n thun lm phn tch d liu, m phi l mt
nh khoa hc, mt nh suy ngh (thinker) v nghin cu khoa hc. Chnh v
th, m khoa hc thng k ng mt vai tr cc k quan trng, mt vai tr
khng th thiu c trong cc cng trnh nghin cu khoa hc, nht l khoa
hc thc nghim. C th ni rng ngy nay, nu khng c thng k th cc th
nghim gen vi triu triu s liu ch l nhng con s v hn, v ngha.
Mt cng trnh nghin cu khoa hc, cho d c tn km v quan trng
c no, nu khng c phn tch ng phng php s khng c ngha khoa
hc g c. Chnh v th m ngy nay, ch cn nhn qua tt c cc tp san nghin
cu khoa hc trn th gii, hu nh bt c bi bo y hc no cng c phn
Statistical Analysis (Phn tch thng k), ni m tc gi phi m t cn thn
phng php phn tch, tnh ton nh th no, v gii thch ngn gn ti sao s
dng nhng phng php hm bo k hay tng trng lng khoa hc
cho nhng pht biu trong bi bo. Cc tp san y hc c uy tn cng cao yu cu
v phn tch thng k cng nng. Xin nhc li nhn mnh: khng c phn
phn tch thng k, bi bo khng c ngha khoa hc.
Mt trong nhng pht trin quan trng nht trong khoa hc thng k l
ng dng my tnh cho phn tch v tnh ton thng k. C th ni khng ngoa
rng khng c my tnh, khoa hc thng k vn ch l mt khoa hc bun t kh
khan, vi nhng cng thc rc ri m thiu tnh ng dng vo thc t. My tnh
gip khoa hc thng k lm mt cuc cch mng ln nht trong lch s ca
b mn: l a khoa hc thng k vo thc t, gii quyt cc vn gai gc
nht v gp phn lm pht trin khoa hc thc nghim.
Ngi vit cn nh hn 20 nm v trc khi cn l mt sinh vin theo
hc chng trnh thc s thng k c, mt v gio s kh knh k mt cu
chuyn v nh thng k danh ting ngi M, Fred Mosteller, nhn c mt
hp ng nghin cu t B Quc phng M ci tin chnh xc ca v kh
M vo thi Th chin th II, m trong ng phi gii mt bi ton thng k
gm khong 30 thng s. ng phi mn 20 sinh vin sau i hc lm vic ny:
10 sinh vin ch vic sut ngy tnh ton bng tay; cn 10 sinh vin khc kim
tra li tnh ton ca 10 sinh vin kia. Cng vic ko di gn mt thng tri.

Ngy nay, vi mt my tnh c nhn (personal computer) khim tn, phn tch
thng k c th gii trong vng trn di 1 giy.
Nhng nu my tnh m khng c phn mm th my tnh cng ch l
mt ng st hay silicon v hn v v dng. Mt phn mm , ang v s
lm cch mng thng k l R. Phn mm ny c mt s nh nghin cu
thng k v khoa hc trn th gii pht trin v hon thin trong khong 10 nm
qua s dng cho vic hc tp, ging dy v nghin cu. Cun sch ny s
gii thiu bn c cch s dng R cho phn tch thng k v th.
Ti sao R? Trc y, cc phn mm dng cho phn tch thng k
c pht trin v kh thng dng. Nhng phn mm ni ting t thi xa xa
nh MINITAB, BMD-P n nhng phn mm tng i mi nh
STATISTICA, SPSS, SAS, STAT, v.v thng rt t tin (gi cho mt i
hc c khi ln n hng trm ngn -la hng nm), mt c nhn hay thm ch
cho mt i hc khng kh nng mua. Nhng R thay i tnh trng ny, v R
hon ton min ph. Tri vi cm nhn thng thng, min ph khng c ngha
l cht lng km. Tht vy, chng nhng hon ton min ph, R cn c kh
nng lm tt c (xin ni li: tt c), thm ch cn hn c, nhng phn tch m
cc phn mm thng mi lm. R c th ti xung my tnh c nhn ca bt c
c nhn no, bt c lc no, v bt c u trn th gii. Ch vi pht ci t l
R c th a vo s dng. Chnh v th m i a s cc i hc Ty phng v
th gii cng ngy cng chuyn sang s dng R cho hc tp, nghin cu v
ging dy. Trong xu hng , cun sch ny c mt mc tiu khim tn l gii
thiu n bn c trong nc kp thi cp nht ha nhng pht trin v tnh
ton v phn tch thng k trn th gii.
Cun sch ny c son ch yu cho sinh vin i hc v cc nh
nghin cu khoa hc, nhng ngi cn mt phn mm hc thng k, phn
tch s liu, hay v th t s liu khoa hc. Cun sch ny khng phi l sch
gio khoa v l thuyt thng k, hay nhm ch bn c cch lm phn tch thng
k, nhng s gip bn c lm phn tch thng k hu hiu hn v ho hng
hn. Mc ch chnh ca ti l cung cp cho bn c nhng kin thc c bn v
thng k, v cch ng dng R cho gii quyt vn , v qua lm nn tng
bn c tm hiu hay pht trin thm R.
Ti cho rng, cng nh bt c ngnh ngh no, cch hc phn tch
thng k hay nht l t mnh lm phn tch. V th, sch ny c vit vi rt
nhiu v d v d liu thc. Bn c c th va c sch, va lm theo nhng
ch dn trong sch (bng cch g cc lnh vo my tnh) v s thy ho hng
hn. Nu bn c c sn mt d liu nghin cu ca chnh mnh th vic hc
tp s hu hiu hn bng cch ng dng ngay nhng php tnh trong sch. i

Phn tch d liu v to biu bng R Nguyn Vn Tun

vi sinh vin, nu cha c s liu sn, cc bn c th dng cc phng php


m phng (simulation) hiu thng k hn.
Khoa hc thng k nc ta tng i cn mi, cho nn mt s thut
ng cha c din dch mt cch thng nht v hon chnh. V th, bn c s
thy y trong sch mt vi thut ng l, v trong trng hp ny, ti c
gng km theo thut ng gc ting Anh bn c tham kho. Ngoi ra, trong
phn cui ca sch, ti c lit k cc thut ng Anh Vit c cp n
trong sch.
Tt c cc d liu v m s dng trong sch ny u c th ti t
internet xung my tnh c nhn, hay c th truy nhp trc tip qua trang web:
http://www.r.ykhoanet.com.
Ti hi vng bn c s tm thy trong sch mt vi thng tin b ch,
mt vi k thut hay php tnh c ch cho vic hc tp, ging dy v nghin cu
ca mnh. Nhng c l chng c cun sch no hon thin hay khng c thiu
st; thnh ra, nu bn c pht hin mt sai st trong sch, xin bo cho ti bit
qua in th t.nguyen@garvan.org.au hay rknguyen@gmail.com. Thnh tht
cm n cc bn c trc.
Ti mun nhn dp ny cm n Tin s Nguyn Hong Dzng thuc
khoa Ha, i hc Bch khoa Thnh ph H Ch Minh, ngi gi v gip
ti in cun sch ny trong nc. Ti cm n Bc s Nguyn nh Nguyn,
ngi c mt phn ln bn tho ca cun sch, gp nhiu kin thit thc,
v thit k ba sch. Ti cng cm n Vy Lan, bin tp vin ca Nh xut
bn Khoa hc v K thut, chu kh c k bn tho, ch ra nhng ch cha
r vit li, v thng cm gi li nhng cu vn y c tnh ca tc gi.
By gi, ti mi bn c cng i vi ti mt hnh trnh thng k
ngn bng R.
Sydney, ngy 31/3/2006
Nguyn Vn Tun

2
Gii thiu ngn ng R
2.1 R l g ?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch
thng k v th. Tht ra, v bn cht, R l ngn ng my tnh a nng, c th
s dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr
(recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng
k phc tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht
trin thnh cc phn mm chuyn mn cho mt vn tnh ton c bit.
Hai ngi sng to ra R l hai nh thng k hc tn l Ross Ihaka v
Robert Gentleman. K t khi R ra i, rt nhiu nh nghin cu thng k v
ton hc trn th gii ng h v tham gia vo vic pht trin R. Ch trng ca
nhng ngi sng to ra R l theo nh hng m rng (Open Access). Cng
mt phn v ch trng ny m R hon ton min ph. Bt c ai bt c ni
no trn th gii u c th truy nhp v ti ton b m ngun ca R v my
tnh ca mnh s dng. Cho n nay, ch qua cha y 5 nm pht trin,
nhng c nhiu cc nh thng k hc, ton hc, nghin cu trong mi lnh vc
chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c
mt mng li gn mt triu ngi s dng R, v con s ny ang tng theo
cp s nhn. C th ni trong vng 10 nm na, chng ta s khng cn n cc
phn mm thng k t tin nh SAS, SPSS hay Stata (cc phn mm ny gi
c th ln n 100.000 USD mt nm) phn tch thng k na, v tt c cc
phn tch c th tin hnh bng R.
V th, nhng ai lm nghin cu khoa hc cn nn hc cch s dng R
cho phn tch thng k v th. Chng ny s hng dn bn c cch s
dng R.

2.2 Ti R xung v ci t vo my tnh


s dng R, vic u tin l chng ta phi ci t R trong my tnh
ca mnh. lm vic ny, ta phi truy nhp vo mng v vo website c tn l
Comprehensive R Archive Network (CRAN) sau y:
http://cran.R-project.org.

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u


bng mu t R v s phin bn (version). Chng hn nh phin bn m tc gi
s dng vo cui nm 2005 l 2.2.1, nn tn ca ti liu cn ti l:
R-2.2.1-win32.zip
Ti liu ny khong 26 MB, v a ch c th ti l:
http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe
Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch
s dng R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh,
ti liu ny c th cung cp nhng thng tin cn thit s dng m khng cn
phi c cc ti liu khc.
Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my
tnh. lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm
theo hng dn cch ci t trn mn hnh.

2.3 Package cho cc phn tch c bit


R cung cp cho chng ta mt ngn ng my tnh v mt s function
lm cc phn tch cn bn v n gin. Nu mun lm nhng phn tch phc
tp hn, chng ta cn phi ti v my tnh mt s package khc. Package l mt
phn mm nh c cc nh thng k pht trin gii quyt mt vn c
th, v c th chy trong h thng R. Chng hn nh phn tch hi qui tuyn
tnh, R c function lm s dng cho mc ch ny, nhng lm cc phn
tch su hn v phc tp hn, chng ta cn n cc package nh lme4. Cc
package ny cn phi c ti v my tnh v ci t.
a ch ti cc package vn l: http://cran.r-project.org, ri bm vo
phn Packages xut hin bn tri ca mc lc trang web. Mt s package cn
ti v my tnh s dng cho cc v d trong sch ny l:

Tn package
lattice
Hmisc
Design
Epi
epitools
foreign
Rmeta
meta
survival
splines
Zelig
genetics
gap
BMA
leaps

Chc nng
Dng v th v lm cho th p hn
Mt s phng php m hnh d liu ca F. Harrell
Mt s m hnh thit k nghin cu ca F. Harrell
Dng cho cc phn tch dch t hc
Mt package khc chuyn cho cc phn tch dch t hc
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Dng cho phn tch tng hp (meta-analysis)
Mt package khc cho phn tch tng hp
Chuyn dng cho phn tch theo m hnh Cox
(Coxs proportional hazard model)
Package cho survival vn hnh
Package dng cho cc phn tch thng k trong lnh
vc x hi hc
Package dng cho phn tch s liu di truyn hc
Package dng cho phn tch s liu di truyn hc
Bayesian Model Average
Package dng cho BMA

2.4 Khi ng v ngng chy R


Sau khi hon tt vic ci t, mt icon s xut hin trn desktop ca my tnh.
n y th chng ta sn sng s dng R. C th nhp chut vo icon ny v
chng ta s c mt ca s nh sau:
R 2.2.1.lnk

Phn tch d liu v to biu bng R Nguyn Vn Tun

R thng c s dng di dng "command line", c ngha l chng ta phi


trc tip g lnh vo ci prompt mu trn. Cc lnh phi tun th nghim
ngt theo vn phm v ngn ng ca R. C th ni ton b bi vit ny l
nhm hng dn bn c hiu v vit theo ngn ng ca R. Mt trong nhng
vn phm ny l R phn bit gia Library v library. Ni cch khc, R
phn bit lnh vit bng ch hoa hay ch thng. Mt vn phm khc na l
khi c hai ch ri nhau, R thng dng du chm thay vo khong trng,
chng hn nh data.frame,t.test,read.table,v.viu ny rt
quan trng, nu khng s lm mt th gi ca ngi s dng.
Nu lnh g ra ng vn phm th R s cho chng ta mt ci prompt
khc hay cho ra kt qu no (ty theo lnh); nu lnh khng ng vn phm
th R s cho ra mt thng bo ngn l khng ng hay khng hiu. V d, nu
chng ta g:
> x <- rnorm(20)

th R s hiu v lm theo lnh , ri cho chng ta mt prompt khc:


>

Nhng nu chng ta g:
> R is great

R s khng ng vi lnh ny, v ngn ng ny khng c trong th vin ca


R, mt thng bo sau y s xut hin:
Error: syntax error
>

Khi mun ri khi R, chng ta c th n gin nhn nt cho (x) bn gc tri


ca ca s, hay g lnh q().

2.5 Vn phm ngn ng R


Vn phm chung ca R l mt lnh (command) hay function (thnh
thong cp n l hm). M l hm th phi c thng s; cho nn theo
sau hm l nhng thng s m chng ta phi cung cp. Chng hn nh:
> reg <- lm(y ~ x)

th reg l mt object, cn lm l mt hm, v y ~ x l thng s ca hm.


Hay:
> setwd(c:/works/stats)

th setwd l mt hm, cn c:/works/stats l thng s ca hm.


bit mt hm cn c nhng thng s no, chng ta dng lnh args(x),
(args vit tt ch arguments) m trong x l mt hm chng ta cn bit:
> args(lm)
function (formula, data, subset, weights, na.action, method
= "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
NULL

R l mt ngn ng i tng (object oriented language). iu ny c


ngha l cc d liu trong R c cha trong object. nh hng ny cng c
vi nh hng n cch vit ca R. Chng hn nh thay v vit x = 5 nh
thng thng chng ta vn vit, th R yu cu vit l x == 5.
i vi R, x = 5 tng ng vi x <- 5. Cch vit sau (dng k
hiu <-) c khuyn khch hn l cch vit trc (=). Chng hn nh:
> x <- rnorm(10)

c ngha l m phng 10 s liu v cha trong object x. Chng ta cng c th


vit x = rnorm(10).
Mt s k hiu hay dng trong R l:
x
x
y
x

10

== 5
!= 5
< x
> y

x bng 5
x khng bng 5
y nh hn x
x ln hn y

Phn tch d liu v to biu bng R Nguyn Vn Tun

z <= 7
p >= 1
is.na(x)
A & B
A | B
!

z nh hn hoc bng 7
p ln hn hoc bng 1
C phi x l bin s missing
A v B (AND)
A hoc B (OR)
Khng l (NOT)

Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l


k hiu dnh cho ngi s dng thm vo cc ghi ch, v d:
> # lnh sau y s m phng 10 gi tr normal
> x <- rnorm(10)

2.6 Cch t tn trong R


t tn mt i tng (object) hay mt bin s (variable) trong R kh
linh hot, v R khng c nhiu gii hn nh cc phn mm khc. Tn mt
object phi c vit lin nhau (tc khng c cch ri bng mt khong
trng). Chng hn nh R chp nhn myobject nhng khng chp nhn my
object.
> myobject <- rnorm(10)
> my object <- rnorm(10)
Error: syntax error in "my object"

Nhng i khi tn myobject kh c, cho nn chng ta nn tch ri bng .


nh my.object.
> my.object <- rnorm(10)

Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng.


Cho nn My.object khc vi my.object. V d:
> My.object.u <- 15
> my.object.L <- 5
> My.object.u + my.object.L
[1] 20

Mt vi iu cn lu khi t tn trong R l:

11

Khng nn t tn mt bin s hay variable bng k hiu _


(underscore) nh my_object hay my-object.

Khng nn t tn mt object ging nh mt bin s trong mt d liu.


V d, nu chng ta c mt data.frame (d liu hay dataset) vi
bin s age trong , th khng nn c mt object trng tn age, tc l
khng nn vit: age <- age. Tuy nhin, nu data.frame tn l data
th chng ta c th cp n bin s age vi mt k t $ nh sau:
data$age. (Tc l bin s age trong data.frame data), v trong
trng hp , age <- data$age c th chp nhn c.

2.7 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng
c th hiu vn phm ca tng hm. Chng hn nh mun bit hm lm c
nhng thng s (arguments) no, chng ta ch n gin lnh:
> help(lm)

hay
> ?lm

Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch


c c v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh.
Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua
phn ch dn c sn trong R bng cch chn mc help v sau chn Html help
nh hnh di y bit thm chi tit. Bn c cng c th copy v dn cc
lnh trong mc ny vo R xem cho bit cch vn hnh ca R.

12

Phn tch d liu v to biu bng R Nguyn Vn Tun

Thay v chn mc trn, bn c cng c th n gin lnh:

> help.start()
v mt ca s s xut hin ch dn ton b h thng R.
Hm apropos cng rt c ch v n cung cp cho chng ta tt c cc hm trong R
bt u bng k t m chng ta mun tm. Chng hn nh chng ta mun bit
hm no trong R c k t lm th ch n gin lnh:
> apropos(lm)
V R s bo co cc hm vi k t lm nh sau c sn trong R:
[1] ".__C__anova.glm"
".__C__glm"
[4] ".__C__glm.null"
".__C__mlm"
[7] "anova.glm"
"anova.lm"
[10] "anova.lmlist"
"anovalist.lm"
[13] "contr.helmert"
"glm.control"
[16] "glm.fit"
"hatvalues.lm"
[19] "KalmanForecast"
"KalmanRun"
[22] "KalmanSmooth"

".__C__anova.glm.null"
".__C__lm"
"anova.glmlist"
"anova.mlm"
"glm"
"glm.fit.null"
"KalmanLike"
"lm"

"lm.fit"

13

[25] "lm.fit.null"
"lm.wfit"
[28] "lm.wfit.null"
"model.frame.lm"
[31] "model.matrix.lm"
[34] "plot.lm"
"predict.glm"
[37] "predict.lm"
"print.glm"
[40] "print.lm"
"residuals.lm"
[43] "rstandard.glm"
"rstudent.glm"
[46] "rstudent.lm"
"summary.lm"
[49] "summary.mlm"

"lm.influence"
"model.frame.glm"
"nlm"
"plot.mlm"

"nlminb"

"predict.mlm"
"residuals.glm"
"rstandard.lm"
"summary.glm"
"kappa.lm"

2.8 Mi trng vn hnh


D liu phi c cha trong mt khu vc (directory) ca my tnh.
Trc khi s dng R, c l cch hay nht l to ra mt directory cha d
liu, chng hn nh c:\works\stats. R bit d liu nm u, chng ta s
dng lnh setwd (set working directory) nh sau:
> setwd(c:/works/stats)

Lnh trn bo cho R bit l d liu s cha trong directory c tn l


c:\works\stats. Ch rng, R dng forward slash / ch khng phi backward
slash \ nh trong h thng Windows.
Ch rng R c kh nng c d liu trc tip t mng (t cc
website). Do , chng ta cng c th dng lnh setwd bo cho R bit rng
chng ta lm vic trc tip trn mng nh trong lnh sau y:
> setwd("http://www.r.ykhoanet.com/")

bit hin nay, R ang lm vic directory no, chng ta ch cn lnh:


> getwd()
[1] "C:/Program Files/R/R-2.2.1"

Ci prompt mc nh ca R l >. Nhng nu chng ta mun c mt prompt


khc theo c tnh c nhn, chng ta c th thay th :
> options(prompt=R> )
R>

14

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hay:
> options(prompt="Tuan> ")
Tuan>

Mn nh R mc nh l 80 k t (characters), nhng nu chng ta mun


mn nh rng hn, th ch cn ra lnh:
> options(width=100)

Hay mun R trnh by cc s liu dng 3 s thp phn:


> options(scipen=3)

Cc la chn v thay i ny c th dng lnh options(). bit cc thng s


hin ti ca R l g, chng ta ch cn lnh:
> options()

Tm hiu ngy thng:


> Sys.Date()
[1] "2006-03-31"

Nu bn c cn thm thng tin, mt s ti liu trn mng (vit bng ting Anh)
cng rt c ch. Cc ti liu ny c th ti xung my min ph:
R for beginners (ca Emmanuel Paradis):
http://cran.r-project.org/doc/contrib/rdebuts_en.pdf
Using R for data analysis and graphics (ca John Maindonald):
http://cran.r-project.org/doc/contrib/usingR.pdf
Ngoi ra, tc gi cng c mt ti liu bng ting Vit (di 114 trang) tm lc
cc lnh hay s dng trong R ti website: www.r.ykhoanet.com.

15

3
Nhp d liu
Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng
m R c th hiu c x l. D liu m R hiu c phi l d liu trong
mt data.frame. C nhiu cch nhp s liu vo mt data.frame
trong R, t nhp trc tip n nhp t cc ngun khc nhau. Sau y l nhng
cch thng dng nht:

3.1 Nhp s liu trc tip: c()


V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh
sau, v mun nhp vo R.
50
62
60
40
48
47
57
70
48
67

16.5
10.8
32.3
19.3
14.2
11.3
15.5
15.8
16.2
11.2

Chng ta c th s dng function c tn c nh sau:


> age <- c(50,62, 60,40,48,47,57,70,48,67)
> insulin <- c(16.5,10.8,32.3,19.3,14.2,11.3,15.5,
15.8,16.2,11.2)

Lnh th nht cho R bit rng chng ta mun to ra mt ct d liu ( s


gi l bin s, tc variable) c tn l age, v lnh th hai l to ra mt ct khc
c tn l insulin. Tt nhin, chng ta c th ly mt tn khc m mnh thch.
Chng ta dng function c (vit tt ca ch concatenation c
ngha l mc ni vo nhau) nhp d liu. Ch rng mi s liu cho mi
bnh nhn c cch nhau bng mt du phy.

16

Phn tch d liu v to biu bng R Nguyn Vn Tun

K hiu insulin <- (cng c th vit l insulin =) c ngha l


cc s liu theo sau s c nm trong bin s insulin. Chng ta s gp k hiu
ny rt nhiu ln trong khi s dng R.
R l mt ngn ng cu trc theo dng i tng (thut ng chuyn mn
l object-oriented language), v mi ct s liu hay mi mt data.frame l
mt i tng (object) i vi R. V th, age v insulin l hai i tng
ring l. By gi chng ta cn phi nhp hai i tng ny thnh mt
data.frame R c th x l sau ny. lm vic ny chng ta cn n
function data.frame:
> tuan <- data.frame(age, insulin)

Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng)
age v insulin vo mt i tng c tn l tuan.
n y th chng ta c mt i tng hon chnh tin hnh phn tch
thng k. kim tra xem trong tuan c g, chng ta ch cn n gin g:
> tuan

V R s bo co:
1
2
3
4
5
6
7
8
9
10

age insulin
50
16.5
62
10.8
60
32.3
40
19.3
48
14.2
47
11.3
57
15.5
70
15.8
48
16.2
67
11.2

Nu chng ta mun lu li cc s liu ny trong mt file theo dng R, chng ta


cn dng lnh save. Gi d nh chng ta mun lu s liu trong directory c
tn l c:\works\stats, chng ta cn g nh sau:
> setwd(c:/works/stats)
> save(tuan, file=tuan.rda)

Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng


chng ta mun lu cc s liu trong directory c tn l c:\works\stats. Lu

17

rng thng thng h thng Windows dng du \ (backward slash), nhng


trong R chng ta dng du / (forward slash).
Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu
trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn
tuan.rda s c mt trong directory .

3.2 Nhp s liu trc tip: edit(data.frame())


V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh
nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny,
R s cung cp cho chng ta mt ca s mi vi mt dy ct v dng ging nh
Excel, v chng ta c th nhp s liu trong bng . V d:
> ins <- edit(data.frame())

Chng ta s c mt window nh sau:

y, R khng bit chng ta c bin s no, cho nn R lit k cc bin s var1,


var2, v.v Nhp chut vo ct var1 v thay i bng cch g vo age. Nhp
chut vo ct var2 v thay i bng cch g vo insulin. Sau g s liu
cho tng ct. Sau khi xong, bm nt cho X gc phi ca spreadsheet, chng
ta s c mt data.frame tn ins vi hai bin s age v insulin.

3.3 Nhp s liu t mt text file: read.table


V d 2: Chng ta thu thp s liu v tui v cholesterol t mt
nghin cu 50 bnh nhn mc bnh cao huyt p. Cc s liu ny c lu
trong mt text file c tn l chol.txt ti directory c:\works\stats. S

18

Phn tch d liu v to biu bng R Nguyn Vn Tun

liu ny nh sau: ct 1 l m s ca bnh nhn, ct 2 l gii tnh, ct 3 l body


mass index (bmi), ct 4 l HDL cholesterol (vit tt l hdl), k n l LDL
cholesterol, total cholesterol (tc) v triglycerides (tg).
id
1
2
3
4
5
6
7
8
9
10

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu

age
57
64
60
65
47
65
76
61
59
57

44
45
46
47
48
49
50

Nam
Nam
Nu
Nam
Nam
Nu
Nu

45
63
52
64
45
64
62

bmi
17
18
18
18
18
18
19
19
19
19

hdl
5.000
4.380
3.360
5.920
6.250
4.150
0.737
7.170
6.942
5.000

ldl
2.0
3.0
3.0
4.0
2.1
3.0
3.0
3.0
3.0
2.0

tc
4.0
3.5
4.7
7.7
5.0
4.2
5.9
6.1
5.9
4.0

tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9

24
24
24
24
24
25
25

5.450
5.000
3.360
7.170
7.880
7.360
7.750

2.8
3.0
2.0
1.0
4.0
4.6
4.0

6.0
4.0
3.7
6.1
6.7
8.1
6.2

2.6
1.8
1.2
1.9
3.3
4.0
2.5

...

Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau


ny. Chng ta s s dng lnh read.table nh sau:
> setwd(c:/works/stats)
> chol <- read.table("chol.txt", header=TRUE)

Lnh th nht chng ta mun m bo R truy nhp ng directory m


s liu ang c lu gi. Lnh th hai yu cu R nhp s liu t file c tn l
chol.txt (trong directory c:\works\stats) v cho vo i tng chol. Trong
lnh ny, header=TRUE c ngha l yu cu R c dng u tin trong file
nh l tn ca tng ct d kin.
Chng ta c th kim tra xem R c ht cc d liu hay cha bng
cch ra lnh:
> chol

hay
> names(chol)

R s cho bit c cc ct nh sau trong d liu (name l lnh hi trong d liu c


nhng ct no v tn g):
[1] "id"

"sex" "age" "bmi" "hdl" "ldl" "tc"

"tg"

19

By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra


lnh:
> save(chol, file="chol.rda")

3.4 Nhp s liu t Excel: read.csv


nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc:

Bc 1: Dng lnh Save as trong Excel v lu s liu di dng


csv;
Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng


ta mun chuyn vo R phn tch. D liu ny c tn l excel.xls.
ID

Age

IGFI

IGFBP3

ALS

PINP

ICTP

P3NP

18

Sex Ethnicity
1

148.27

5.14

316.00

61.84

5.81

4.21

28

114.50

5.23

296.42

98.64

4.96

5.33

20

109.82

4.33

269.82

93.26

7.74

4.56

21

112.13

4.38

247.96

101.59

6.66

4.61

28

102.86

4.04

240.04

58.77

4.62

4.95

23

129.59

4.16

266.95

48.93

5.32

3.82

20

142.50

3.85

300.86

135.62

8.78

6.75

20

118.69

3.44

277.46

79.51

7.19

5.11

20

197.69

4.12

335.23

57.25

6.21

4.44

10

20

163.69

3.96

306.83

74.03

4.95

4.84

11

22

144.81

3.63

295.46

68.26

4.54

3.70

12

27

141.60

3.48

231.20

56.78

4.47

4.07

13

26

161.80

4.10

244.80

75.75

6.27

5.26

14

33

89.20

2.82

177.20

48.57

3.58

3.68

15

34

161.80

3.80

243.60

50.68

3.52

3.35

16

32

148.50

3.72

234.80

83.98

4.85

3.80

17

28

157.70

3.98

224.80

60.42

4.89

4.09

18

18

222.90

3.98

281.40

74.17

6.43

5.84

19

26

186.70

4.64

340.80

38.05

5.12

5.77

20

27

167.56

3.56

321.12

30.18

4.78

6.12

Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu d liu di


dng csv:

20

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vo Excel, chn File Save as

Chn Save as type CSV (Comma delimited)

Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory


c:\works\stats.
Vic th hai l vo R v ra nhng lnh sau y:
> setwd(c:/works/stats)
> gh <- read.csv ("excel.txt", header=TRUE)]

Lnh th hai read.csv yu cu R c s liu t excel.csv, dng dng th nht


l tn ct, v lu cc s liu ny trong mt object c tn l gh.
By gi chng ta c th lu gh di dng R x l sau ny bng lnh sau y:
> save(gh, file="gh.rda")

3.5 Nhp s liu t mt SPSS: read.spss


Phn mm thng k SPSS lu d liu di dng sav. Chng hn nh
nu chng ta c mt d liu c tn l testo.sav trong directory
c:\works\stats, v mun chuyn d liu ny sang dng R c th hiu c,
chng ta cn s dng lnh read.spss trong package c tn l foreign.
Cc lnh sau y s hon tt d dng vic ny:
Vic u tin chng ta cho truy nhp foreign bng lnh library:
> library(foreign)

Vic th hai l lnh read.spss:

21

> setwd(c:/works/stats)
> testo <- read.spss(testo.sav, to.data.frame=TRUE)

Lnh th hai read.spss yu cu R c s liu t testo.sav, v cho vo mt


data.frame c tn l testo.
By gi chng ta c th lu testo di dng R x l sau ny bng lnh sau
y:
> save(testo, file="testo.rda")

3.6 Thng tin c bn v d liu


Gi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh
trong v d 1. tm hiu xem trong d liu ny c g, chng ta c th nhp
vo R nh sau:

Dn cho R bit chng ta mun x l chol bng cch dng lnh


attach(arg) vi arg l tn ca d liu..

> attach(chol)

Chng ta c th kim tra xem chol c phi l mt data.frame khng bng


lnh is.data.frame(arg) vi arg l tn ca d liu. V d:

> is.data.frame(chol)
[1] TRUE
R cho bit chol qu l mt data.frame.

C bao nhiu ct (hay variable = bin s) v dng s liu (observations)


trong d liu ny? Chng ta dng lnh dim(arg) vi arg l tn ca d
liu. (dim vit tt ch dimension). V d (kt qu ca R trnh by ngay
sau khi chng ta g lnh):

> dim(chol)
[1] 50 8

22

Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s


ny tn g? Chng ta dng lnh names(arg) vi arg l tn ca d liu.
V d:

Phn tch d liu v to biu bng R Nguyn Vn Tun

> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc"

"tg"

Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi


ny, chng ta c th dng lnh table(arg) vi arg l tn ca bin s.
V d:

> table(sex)

sex
nam Nam
1 21

Nu
28

Kt qu cho thy d liu ny c 21 nam v 28 n.


Trn y l vi cch nhp d liu vo R.Trong thc t, R c th c d liu t
rt nhiu phn mm thng dng, k c cc phn mm thng k nh SPSS (m
chng ta xem qua), SAS, STATA, v.v Nhng c d liu t cc phn
mm ny, bn c cn phi ti package foreign v my v ci t vo R.
Package foreign c th ti t website chnh thc ca R.

23

4
Bin tp d liu
Bin tp s liu y khng c ngha l thay i s liu gc (v l
mt ti ln, mt s gian di trong khoa hc khng th chp nhn c), m ch
c ngha t chc s liu sao cho R c th phn tch mt cch hu hiu. Nhiu
khi trong phn tch thng k, chng ta cn phi tp trung s liu thnh mt
nhm, hay tch ri thnh tng nhm, hay thay th t k t (characters) sang s
(numeric) cho tin vic tnh ton. Chng ny s bn qua mt s lnh cn bn
cho vic bin tp s liu.
Chng ta s quay li vi d liu chol trong v d 1. tin vic theo
di v hiu cu chuyn, xin nhc li rng chng ta nhp s liu vo trong
mt d liu R c tn l chol t mt text file c tn l chol.txt:
> setwd(c:/works/stats)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)

4.1 Kim tra s liu trng khng (missing value)


Trong nghin cu, v nhiu l do s liu khng th thu thp c cho tt
c i tng, hay khng th o lng tt c bin s cho mt i tng. Trong
trng hp , s liu trng c xem l missing value (tm dch l s liu
trng khng). R xem cc s liu trng khng l NA. C mt s kim nh thng
k i hi cc s liu trng khng phi c loi ra (v khng th tnh ton
c) trc khi phn tch. R c mt lnh rt c ch cho vic ny: na.omit, v
cch s dng nh sau:
> chol.new <- na.omit(chol)
Trong lnh trn, chng ta yu cu R loi b cc s liu trng khng
trong data.frame chol v a cc s liu khng trng vo data.frame mi tn
l chol.new. Ch lnh trn ch l v d, v trong d liu chol khng c s
liu trng khng.

4.2 Tch ri d liu: subset


Nu chng ta v mt l do no , ch mun phn tch ring cho nam
gii, chng ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu.

24

Phn tch d liu v to biu bng R Nguyn Vn Tun

lm chuyn ny, chng ta dng lnh subset(data, cond), trong


data l data.frame m chng ta mun tch ri, v cond l iu kin. V d:
> nam <- subset(chol, sex==Nam)
> nu <- subset(chol, sex==Nu)

Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l


nam v nu. Ch iu kin sex == Nam v sex == Nu chng ta
dng == thay v = ch iu kin chnh xc.
Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau
vi nhng iu kin da vo cc bin s khc. Chng hn nh lnh sau y to
ra mt data.frame mi tn l old vi nhng bnh nhn trn 60 tui:
> old <- subset(chol, age>=60)
> dim(old)

[1] 25

Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii:


> n60 <- subset(chol, age>=60 & sex==Nam)
> dim(n60)

[1] 9

4.3 Chit s liu t mt data .frame


Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch
gi li nhng bin s cn thit nh m s (id), tui (age) v total cholestrol
(tc). t lnh names(chol) rng bin s id l ct s 1, age l ct s 3,
v bin s tc l ct s 7. Chng ta c th dng lnh sau y:
> data2 <- chol[, c(1,3,7)]
y, chng ta lnh cho R bit rng chng ta mun chn ct s 1, 3 v 7, v
a tt c s liu ca hai ct ny vo data.frame mi c tn l data2. Ch
chng ta s dng ngoc kp vung [] ch khng phi ngoc kp vng (), v
chol khng phi lm mt function. Du phy pha trc c, c ngha l chng
ta chn tt c cc dng s liu trong data.frame chol.
Nhng nu chng ta ch mun chn 10 dng s liu u tin, th lnh s l:
> data3 <- chol[1:10, c(1,3,7)]

25

> print(data3)
id sex tc
1
1 Nam 4.0
2
2 Nu 3.5
3
3 Nu 4.7
4
4 Nam 7.7
5
5 Nam 5.0
6
6 Nu 4.2
7
7 Nam 5.9
8
8 Nam 6.1
9
9 Nam 5.9
10 10 Nu 4.0

Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg.


Tht ra, chng ta ch cn n gin g data3, kt qu cng ging y nh
print(data3).

4.4 Nhp hai data.frame thnh mt: merge


Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn
l d1 gm 3 ct: id, sex, tc nh sau:
id sex tc
1 Nam 4.0
2 Nu 3.5
3 Nu 4.7
4 Nam 7.7
5 Nam 5.0
6 Nu 4.2
7 Nam 5.9
8 Nam 6.1
9 Nam 5.9
10 Nu 4.0

D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:


id
1
2
3
4
5
6
7
8
9
10
11

26

sex
Nam
Nu
Nu
Nam
Nam
Nu
Nam
Nam
Nam
Nu
Nu

tg
1.1
2.1
0.8
1.1
2.1
1.5
2.6
1.5
5.4
1.9
1.7

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng,


cn d liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt
data.frame bng cch dng lnh merge nh sau:
> d <- merge(d1, d2, by="id", all=TRUE)
> d
id sex.x tc sex.y tg
1
1
Nam 4.0
Nam 1.1
2
2
Nu 3.5
Nu 2.1
3
3
Nu 4.7
Nu 0.8
4
4
Nam 7.7
Nam 1.1
5
5
Nam 5.0
Nam 2.1
6
6
Nu 4.2
Nu 1.5
7
7
Nam 5.9
Nam 2.6
8
8
Nam 6.1
Nam 1.5
9
9
Nam 5.9
Nam 5.4
10 10
Nu 4.0
Nu 1.9
11 11 <NA> NA
Nu 1.7

Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v


a vo data.frame mi tn l d, v dng bin s id lm chun. Chng ta
thy bnh nhn s 11 khng c s liu cho tc, cho nn R cho l NA (mt dng
not available).

4.5 M ha s liu (data coding)


Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin
i s liu t bin lin tc sang bin mang tnh cch phn loi. Chng hn nh
trong chn on long xng, nhng ph n c ch s T ca mt cht
khong trong xng (bone mineral density hay BMD) bng hay thp hn -2.5
c xem l long xng, nhng ai c BMD gia -2.5 v -1.0 l xp xng
(osteopenia), v trn -1.0 l bnh thng. V d, chng ta c s liu BMD t
10 bnh nhn nh sau:
-0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11

nhp cc s liu ny vo R chng ta c th s dng function c nh sau:


bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,
-2.00,1.71,2.12,-2.11)

phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th


dng m s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc

27

(hy gi l diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm


vic ny, chng ta s dng lnh:
# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd
# bin i bmd thnh diagnosis
> diagnosis[bmd <= -2.5] <- 1
> diagnosis[bmd > -2.5 & bmd <= 1.0] <- 2
> diagnosis[bmd > -1.0] <- 3
# to thnh mt data frame
> data <- data.frame(bmd, diagnosis)
# lit k kim tra xem lnh c hiu qu khng
> data
bmd diagnosis
1 -0.92
3
2
0.21
3
3
0.17
3
4 -3.21
1
5 -1.80
2
6 -2.60
1
7 -2.00
2
8
1.71
3
9
2.12
3
10 -2.11
2

4.5.1 Bin i s liu bng cch dng replace


Mt cch bin i s liu khc l dng replace, nhng cch ny tng i phc
tp hn. Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau:
>
>
>
>

diagnosis
diagnosis
diagnosis
diagnosis

<<<<-

bmd
replace(diagnosis, bmd <= -2.5, 1)
replace(diagnosis, bmd>-2.5 & bmd<=1.0, 2)
replace(diagnosis, bmd > -1.0, 3)

4.5.2 Bin i thnh yu t (factor)


Trong phn tch thng k, chng ta phn bit mt bin s mang tnh yu t
(factor) v bin s lin tc bnh thng. Bin s yu t khng th dng tnh
ton nh cng tr nhn chia, nhng bin s s hc c th s dng tnh ton.
Chng hn nh trong v d bmd v diagnosis trn, diagnosis l yu t v gi
tr trung bnh gia 1 v 2 chng c ngha thc t g c; cn bmd l bin s s hc.

28

Phn tch d liu v to biu bng R Nguyn Vn Tun

Nhng hin nay, diagnosis c xem l mt bin s s hc. bin thnh


bin s yu t, chng ta cn s dng function factor nh sau:
> diag <- factor(diagnosis)
> diag
[1] 3 3 3 1 2 1 2 3 3 2
Levels: 1 2 3

Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu


chng ta yu cu R tnh s trung bnh ca diag, R s khng lm theo yu cu
ny, v khng phi l mt bin s s hc:
> mean(diag)
[1] NA
Warning message:
argument is not numeric
mean.default(diag)

or

logical:

returning

NA

in:

D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:


> mean(diagnosis)
[1] 2.3

Nhng kt qu 2.3 ny khng c ngha g trong thc t c.

4.6 Chia nhm bng cut


Vi mt bin lin tc, chng ta c th chia thnh nhiu nhm bng hm cut. V
d, chng ta c bin age nh sau:
> age <- c(17,19,22,43,14,8,12,19,20,51,8,12,27,31,44)

tui thp nht l 8 v cao nht l 51. Nu chng ta mun chia thnh 2 nhm tui:
> cut(age, 2)
[1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51]
(7.96,29.5] (7.96,29.5] (7.96,29.5]

(7.96,29.5]

[9] (7.96,29.5] (29.5,51]


(7.96,29.5] (7.96,29.5] (7.96,29.5]
(29.5,51]
(29.5,51]
Levels: (7.96,29.5] (29.5,51]

cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t
29.5 n 51. Chng ta c th m s i tng trong tng nhm tui bng hm
table nh sau:

29

> table(cut(age, 2))


(7.96,29.5]
11

(29.5,51]
4

Trong lnh sau y, chng ta chia bin tui thnh 3 nhm v t tn ba nhm
l low, medium v high:
> ageg <- cut(age, 3, labels=c("low", "medium", "high"))
[1] low
low
low
high
low
low
low
high
low
low
medium medium

low

low

[15] high
Levels: low medium high
> ageg <- cut(age, 3, labels=c("low", "medium", "high"))
> table(ageg)
ageg
low medium
10
2

high
3

Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho
nhng thng s 0, 0.25, 0.50 v 0.75 nh sau:
cut(age,
breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)

4.7. Tp hp s liu bng cut2 (Hmisc)


Hm cut trn chia bin s theo gi tr ca bin, ch khng da vo s
mu, cho nn s lng mu trong tng nhm khng bng nhau. Tuy nhin,
trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin
tc thnh nhiu nhm da vo phn phi ca bin s nhng s mu bng hay
tng ng nhau. Chng hn nh i vi bin s bmd chng ta c th ct
dy s thnh 3 nhm vi s mu tng ng nhau bng cch dng function
cut2 (trong package Hmisc) nh sau:
# nhp package Hmisc c th dng function cut2

30

Phn tch d liu v to biu bng R Nguyn Vn Tun

> library(Hmisc)
> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,
-2.00,1.71,2.12,-2.11)
# chia bin s bmd thnh 2 nhm v trong i tng group
> group <- cut2(bmd, g=2)
> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5
5

Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g=group). R t


ng chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92
n 2.12. Mi nhm gm c 5 s.
Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh:
> group <- cut2(bmd, g=3)

V vi lnh table chng ta s bit c 3 nhm, nhm 1 gm 4 s, nhm 2 v 3


mi nhm c 3 s:
> table(group)
group
[-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12]
4
3
3

31

5
Dng R cho cc php tnh
n gin v ma trn
Mt trong nhng li th ca R l c th s dng nh mt my tnh cm
tay. Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp
chng. Chng ny ch trnh by mt s php tnh n gin m hc sinh hay
sinh vin c th s dng lp tc trong khi c nhng dng ch ny.

5.1 Tnh ton n gin


Cng hai s hay nhiu s vi nhau:

Cng v tr:

> 15+2997
[1] 3012

> 15+2997-9768
[1] -6756

S ly tha: (25 5)3

Nhn v chia
> -27*12/21

> (25 - 5)^3


[1] 8000

[1] -15.42857

Cn s bc hai:

S pi ()

10

> sqrt(10)
[1] 3.162278

> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478

Logarit: loge

Logarit: log10

> log(10)
[1] 2.302585

> log10(100)
[1] 2

S m: e

Hm s lng gic

2.7689

> cos(pi)
[1] -1

> exp(2.7689)
[1] 15.94109
> log10(2+3*pi)
[1] 1.057848

Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8
> sum(x)
[1] 42
> x*2
[1] 4
16

32

2 10

8 12 14 12

> exp(x/10)
[1] 1.221403 1.349859 1.105171
1.648721 1.491825 1.822119
2.013753 1.822119
[9] 2.225541
> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736
2.405079 2.511954 2.282647
2.148655 2.282647
[9] 2.007132

Phn tch d liu v to biu bng R Nguyn Vn Tun

Tnh tng bnh phng (sum of


2
2
2
2
2
squares): 1 + 2 + 3 + 4 + 5 = ?
> x <- c(1,2,3,4,5)
> sum(x^2)
[1] 55

( x x )
i =1

( x x )
i =1

=?

> x <- c(1,2,3,4,5)


> sum((x-mean(x))^2)
[1] 10

Trong cng thc trn mean(x) l s


trung bnh ca vector x.
Tnh phng sai (variance) v lch
chun (standard deviation):
Phng sai:

Tnh sai s bnh phng (mean


square):

Tnh tng bnh phng iu chnh


(adjusted sum of squares):

/n= ?

2
> x <- c(1,2,3,4,5)
s 2 = ( xi x ) / ( n 1) = ?
> sum((x-mean(x))^2)/length(x)
i =1
[1] 2
> x <- c(1,2,3,4,5)
> var(x)
[1] 2.5
Trong cng thc trn, length(x)

c ngha l tng s phn t (elements)


trong vector x.

lch chun:

s2 :

> sd(x)
[1] 1.581139

5.2 S liu v ngy thng


Trong phn tch thng k, cc s liu ngy thng c khi l mt vn
nan gii, v c rt nhiu cch m t cc d liu ny. Chng hn nh
01/02/2003, c khi ngi ta vit 1/2/2003, 01/02/03, 01FEB2003, 2003-02-01,
v.v Tht ra, c mt qui lut chun vit s liu ngy thng l tiu chun
ISO 8601 (nhng rt t ai tun theo!) Theo qui lut ny, chng ta vit:

2003-02-01
L do ng sau cch vit ny l chng ta vit s vi n v ln nht trc, ri
dn dn n n v nh nht. Chng hn nh vi s 123 th chng ta bit ngay
rng mt trm hai mi ba: bt u l hng trm, ri n hng chc, v.v
V cng l cch vit ngy thng chun ca R.
> date1 <- as.Date(01/02/06, format=%d/%m/%y)
> date2 <- as.Date(06/03/01, format=%y/%m/%d)

Ch chng ta nhp hai s liu khc nhau v th t ngy thng nm, nhng
chng ta cng cho bit c th cch c bng %d (ngy), %m (thng), v %y
(nm). Chng ta c th tnh s ngy gia hai thi im:

33

> days <- date2-date1


> days
Time difference of 28 days

Chng ta cng c th to mt dy s liu ngy thng nh sau:


> seq(as.Date(2005-01-01), as.Date(2005-12-31),
by=month)
[1] "2005-01-01" "2005-02-01" "2005-03-01" "2005-04-01"
"2005-05-01"
[6] "2005-06-01" "2005-07-01" "2005-08-01" "2005-09-01"
"2005-10-01"
[11] "2005-11-01" "2005-12-01"
> seq(as.Date(2005-01-01), as.Date(2005-12-31), by=2
weeks)
[1] "2005-01-01" "2005-01-15" "2005-01-29" "2005-02-12"
"2005-02-26"
[6] "2005-03-12" "2005-03-26" "2005-04-09" "2005-04-23"
"2005-05-07"
[11] "2005-05-21" "2005-06-04" "2005-06-18" "2005-07-02"
"2005-07-16"
[16] "2005-07-30" "2005-08-13" "2005-08-27" "2005-09-10"
"2005-09-24"
[21] "2005-10-08" "2005-10-22" "2005-11-05" "2005-11-19"
"2005-12-03"
[26] "2005-12-17" "2005-12-31"

5.3 To dy s bng hm seq, rep v gl


R cn c cng dng to ra nhng dy s rt tin cho vic m phng v
thit k th nghim. Nhng hm thng thng cho dy s l seq (sequence),
rep (repetition) v gl (generating levels):

p dng seq

34

To ra mt vector s t 1 n 12:

Phn tch d liu v to biu bng R Nguyn Vn Tun

> x <- (1:12)


> x
[1] 1 2 3
> seq(12)
[1] 1 2

4
4

5
5

6
6

7
7

8
8

9 10 11 12
9 10 11 12

To ra mt vector s t 12 n 5:

> x <- (12:5)


> x
[1] 12 11 10 9

> seq(12,7)
[1] 12 11 10

Cng thc chung ca hm seq l seq(from, to, by= )hay


seq(from,to,length.out= ).Cch s dng s c minh ho bng cc
v d sau y:

To ra mt vector s t 4 n 6 vi khong cch bng 0.25:

> seq(4, 6, 0.25)


[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00

To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15

> seq(length=10, from=2, to=15)


[1] 2.000000 3.444444 4.888889 6.333333 7.777778
9.222222 10.666667 12.111111 13.555556 15.000000

p dng rep
Cng thc ca hm rep l rep(x, times, ...), trong , x l mt
bin s v times l s ln lp li. V d:

To ra s 10, 3 ln:

> rep(10, 3)
[1] 10 10 10

To ra s 1 n 4, 3 ln:

> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2
2.7 4.8

35

To ra s 1.2, 2.7, 4.8, 5 ln:

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2
2.7 4.8

p dng gl
gl c p dng to ra mt bin th bc (categorical variable), tc bin
khng tnh ton, m l m. Cng thc chung ca hm gl l gl(n, k,
length = n*k, labels = 1:n, ordered = FALSE) v cch s
dng s c minh ha bng vi v d sau y:

To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:

> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2

Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln:


> gl(3, 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3

To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20):

> gl(2, 10, length=20)


[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2

Hay:
> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2

Cho thm k hiu:

> gl(2, 5, label=c("C", "T"))


[1] C C C C C T T T T T
Levels: C T

To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.

> rep(1:4, c(2,2,2,2))

36

Phn tch d liu v to biu bng R Nguyn Vn Tun

[1] 1 1 2 2 3 3 4 4

Cng tng ng vi:


> rep(1:4, each = 2)
[1] 1 1 2 2 3 3 4 4

Vi ngy gi thng:

> x <- .leap.seconds[1:3]


> rep(x, 2)
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31
16:00:00 Pacific Standard Time"
[3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30
17:00:00 Pacific Standard Time"
[5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31
16:00:00 Pacific Standard Time"
> rep(as.POSIXlt(x), rep(2, 3))
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30
17:00:00 Pacific Standard Time"
[3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31
16:00:00 Pacific Standard Time"
[5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31
16:00:00 Pacific Standard Time"

5.4 S dng R cho cc php tnh ma trn


Nh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v
ct (column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n
ct. Trong R, chng ta cng c th th hin nh th. V d: chng ta mun to
mt ma trn vung A gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4,
5, 6, 7, 8, 9, chng ta vit:

1 4 7

A = 2 5 8
3 6 9

V vi R:
> y <- c(1,2,3,4,5,6,7,8,9)

37

> A <- matrix(y, nrow=3)


> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

Nhng nu chng ta lnh:


> A <- matrix(y, nrow=3, byrow=TRUE)
> A

Th kt qu s l:
[1,]
[2,]
[3,]

[,1] [,2] [,3]


1
2
3
4
5
6
7
8
9

Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt


ma trn hon v l dng t(). V d:
> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

v B = A' c th din t bng R nh sau:


> B <- t(A)
> B
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9

Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s


ct), v tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn
t ng cho l 1. Chng ta c th to mt ma trn nh th bng R nh sau:
> # to ra m ma trn 3 x 3 vi tt c phn t l 0.
> A <- matrix(0, 3, 3)

38

Phn tch d liu v to biu bng R Nguyn Vn Tun

> # cho cc phn t ng cho bng 1


> diag(A) <- 1
> diag(A)
[1] 1 1 1
> # by gi ma trn A s l:
> A
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1

5.4.1 Chit phn t t ma trn


> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9

> # ct 1 ca ma trn A
> A[,1]
[1] 1 4 7

> # ct 3 ca ma trn A
> A[3,]
[1] 7 8 9

> # dng 1 ca ma trn A


> A[1,]
[1] 1 2 3
> # dng 2, ct 3 ca ma trn A
> A[2,3]
[1] 6

> # tt c cc dng ca ma trn A, ngoi tr dng 2


> A[-2,]
[,1] [,2] [,3]
[1,]
1
4
7

39

[2,]

> # tt c cc ct ca ma trn A, ngoi tr ct 1


> A[,-1]
[,1] [,2]
[1,]
4
7
[2,]
5
8
[3,]
6
9

> # xem phn t no cao hn 3.


> A>3

[,1] [,2] [,3]


[1,] FALSE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE

5.4.2 Tnh ton vi ma trn


Cng v tr hai ma trn. Cho hai ma trn A v B nh sau:
> A <- matrix(1:12, 3, 4)
> A
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
[3,]
3
6
9
12
> B <- matrix(-1:-12, 3, 4)
> B
[,1] [,2] [,3] [,4]
[1,]
-1
-4
-7 -10
[2,]
-2
-5
-8 -11
[3,]
-3
-6
-9 -12

Chng ta c th cng A+B:


> C <- A+B
> C
[,1] [,2] [,3] [,4]
[1,]
0
0
0
0
[2,]
0
0
0
0
[3,]
0
0
0
0

Hay A-B:

40

Phn tch d liu v to biu bng R Nguyn Vn Tun

> D <- A-B


> D
[,1] [,2] [,3] [,4]
[1,]
2
8
14
20
[2,]
4
10
16
22
[3,]
6
12
18
24

Nhn hai ma trn. Cho hai ma trn:

1 4 7

A = 2 5 8
3 6 9

1 2 3

B = 4 5 6
7 8 9

Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*%
nh sau:
>
>
>
>
>

y <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(y, nrow=3)
B <- t(A)
AB <- A%*%B
AB
[,1] [,2] [,3]
[1,]
66
78
90
[2,]
78
93 108
[3,]
90 108 126

Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:
> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,]
14
32
50
[2,]
32
77 122
[3,]
50 122 194

Nghch o ma trn v gii h phng trnh. V d chng ta c h phng


trnh sau y:

3 x1 + 4 x2 = 4
x1 + 6 x2 = 2

H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :

3 4
A=
,
1 6

x
X = 1 ,
x2

4
Y =
2

41

Nghim ca h phng trnh ny l: X = A-1Y, hay trong R:


>
>
>
>

A <- matrix(c(3,1,4,6), nrow=2)


Y <- matrix(c(4,2), nrow=2)
X <- solve(A)%*%Y
X
[,1]
[1,] 1.1428571
[2,] 0.1428571

Chng ta c th kim tra:


> 3*X[1,1]+4*X[2,1]
[1] 4

Tr s eigen cng c th tnh ton bng function eigen nh sau:


> eigen(A)
$values
[1] 7 2
$vectors

[,1]
[,2]
[1,] -0.7071068 -0.9701425
[2,] -0.7071068 0.2425356

nh thc (determinant). Lm sao chng ta xc nh mt ma trn c th o


nghch hay khng? Ma trn m nh thc bng 0 l ma trn suy bin (singular
matrix) v khng th o nghch. kim tra nh thc, R dng lnh det():
> E <- matrix((1:9), 3, 3)
> E
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> det(E)
[1] 0

Nhng ma trn F sau y th c th o nghch:


> F <- matrix((1:9)^2, 3, 3)
> F
[,1] [,2] [,3]
[1,]
1
16
49
[2,]
4
25
64
[3,]
9
36
81

42

Phn tch d liu v to biu bng R Nguyn Vn Tun

> det(F)
[1] -216

V nghch o ca ma trn F (F-1) c th tnh bng function solve() nh sau:


> solve(F)
[,1]
[,2]
[,3]
[1,] 1.291667 -2.166667 0.9305556
[2,] -1.166667 1.666667 -0.6111111
[3,] 0.375000 -0.500000 0.1805556

Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc


php tnh phc tp khc. Mt li th ng k ca R l phn mm cung cp cho
ngi s dng t do to ra nhng php tnh ph hp cho tng vn c th.
Trong vi chng sau, chng ta s quay li vn ny chi tit hn.
R c mt package Matrix chuyn thit k cho tnh ton ma trn. Bn c
c th ti package xung, ci vo my, v s dng, nu cn. a ch ti l:
http://cran.au.r-project.org/bin/windows/contrib/r-release/Matrix_0.995-8.zip
cng vi ti liu ch dn cch s dng (di khong 80 trang):
http://cran.au.r-project.org/doc/packages/Matrix.pdf

43

6
Tnh ton xc sut
v m phng (simulation)
Xc sut l nn tng ca phn tch thng k. Tt c cc phng php
phn tch s liu v suy lun thng k u da vo l thuyt xc sut. L thuyt
xc sut quan tm n vic m t v th hin qui lut phn phi ca mt bin s
ngu nhin. M t y trong thc t cng c ngha n gin l m nhng
trng hp hay kh nng xy ra ca mt hay nhiu bin. Chng hn nh khi
chng ta chn ngu nhin 2 i tng, v nu 2 i tng ny c th c phn
loi bng hai c tnh nh gii tnh v s thch, th vn t ra l c bao nhiu
tt c phi hp gia hai c tnh ny. Hay i vi mt bin s lin tc nh
huyt p, m t c ngha l tnh ton cc ch s thng k ca bin nh tr s
trung bnh, trung v, phng sai, lch chun, v.v T nhng ch s m t, l
thuyt xc sut cung cp cho chng ta nhng m hnh thit lp cc hm phn
phi cho cc bin s . Chng ny s bn qua hai lnh vc chnh l php m
v cc hm phn phi.

6.1 Cc php m
6.1.1 Php hon v (permutation).
Theo nh ngha, hon v n phn t l cch sp xp n phn t theo mt
th t nh sn. nh ngha ny kh kh hiu, v d c th sau s lm r nh
ngha hn. Hy tng tng mt trung tm cp cu c 3 bc s (x, y v z), v c
3 bnh nhn (a, b v c) ang ngi ch c khm bnh. C ba bc s u c th
khm bt c bnh nhn a, b hay c. Cu hi t ra l c bao nhiu cch sp xp
bc s bnh nhn? tr li cu hi ny, chng ta xem xt vi trng hp sau
y:

Bc s x c 3 la chn: khm bnh nhn a, b hoc c;


Khi bc s x chn mt bnh nhn ri, th bc s y c hai la chn
cn li;
V sau cng, khi 2 bc s kia chn, bc s z ch cn 1 la chn.
Tng cng, chng ta c 6 la chn.

Mt v d khc, trong mt bui tic gm 6 bn, hi c bao nhiu cch


sp xp cch ngi trong mt bn vi 6 gh? Qua cch l gii ca v d trn, p
s l: 6.5.4.3.2.1 = 720 cch. (Ch du . c ngha l du nhn hay tch s).
V y chnh l php m hon v.

44

Phn tch d liu v to biu bng R Nguyn Vn Tun

Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh


hon v cho mt s n l: n ! = n ( n 1)( n 2 )( n 3) ... 1 . Trong R cch tnh
ny rt n gin vi lnh prod() nh sau:

Tm 3!

> prod(3:1)
[1] 6

Tm 10!

> prod(10:1)
[1] 3628800

Tm 10.9.8.7.6.5.4

> prod(10:4)
[1] 604800

Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)

> prod(10:4) / prod(40:36)


[1] 0.007659481

6.1.2 T hp (combination).
T hp n phn t chp k l mi tp hp con gm k phn t ca tp hp
n phn t. V d c th sau s gip cho chng ta hiu r vn ny: Cho 3
ngi (hy cho l A, B, v C) ng vin vo 2 chc ch tch v ph ch tch, hi:
c bao nhiu cch chn 2 chc ny trong s 3 ngi . Chng ta c th
tng tng c 2 gh m phi chn 3 ngi:
Cch chn
1
2
3
4
5
6

Ch tch
A
B
A
C
B
C

Ph ch tch
B
A
C
A
C
B

Nh vy c 6 cch chn. Nhng ch rng cch chn 1 v 2 trong thc t ch


l 1 cp, v chng ta ch c th m l 1 (ch khng 2 c). Tng t, 3 v 4,

45

5 v 6 cng ch c th m l 1 cp. Tng cng, chng ta c 3 cch chn 3


ngi cho 2 chc v. p s ny c gi l t hp.
Tht ra tng s ln chn c th tnh bng cng thc sau y:

3
3!
6
= = 3 ln.
=
2 2!( 3 2 ) ! 2
Ni chung, s ln chn k ngi t n ngi l:

n
n!
=
k k !( n k ) !
n
k

Cng thc ny cng c khi vit l Ckn thay v . Vi R, php tnh ny rt


n gin bng hm choose(n, k). Sau y l vi v d minh ha:

5
2

Tm

> choose(5, 2)
[1] 10

Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v:

> 1/choose(5, 2)
[1] 0.1

6.2 Bin s ngu nhin v hm phn phi


Phn ln phn tch thng k da vo cc lut phn phi xc sut suy
lun. Nu chng ta chn ngu nhin 10 bn trong mt lp hc v ghi nhn chiu
cao v gii tnh ca 10 bn , chng ta c th c mt dy s liu nh sau:
Gii tnh
Chiu cao (cm)

1
N
156

2
N
160

3
Nam
175

4
N
145

5
N
165

6
N
158

7
Nam
170

8
Nam
167

9
N
178

10
Nam
155

Nu tnh gp chung li, chng ta c 6 bn gi v 4 bn trai. Ni theo phn trm,


chng ta c 60% n v 40% nam. Ni theo ngn ng xc sut, xc sut n l
0.6 v nam l 0.4.

46

Phn tch d liu v to biu bng R Nguyn Vn Tun

V chiu cao, chng ta c gi tr trung bnh l 162.9 cm, vi chiu cao


thp nht l 155 cm v cao nht l 178 cm.
Hm phn
phi
Chun
Nh phn
Poisson
Uniform
Negative
binomial
Beta
Gamma

Geometric

Mt

Tch ly

nh bc

M phng

dnorm(x,
mean, sd)
dbinom(k,
n, p)
dpois(k,
lambda)
dunif(x,
min, max)
dnbinom(x,
k, p)
dbeta(x,
shape1,
shape2)
dgamma(x,
shape,
rate,
scale)
dgeom(x,
p)

pnorm(q,
mean, sd)
pbinom(q,
n, p)
ppois(q,
lambda)
punif(q,
min, max)
pnbinom(q,
k, p)
pbeta(q,
shape1,
shape2)
gamma(q,
shape,
rate,
scale)
pgeom(q,
p)

qnorm(p,
mean, sd)
qbinom (p,
n, p)
qpois(p,
lambda)
qunif(p,
min, max)
qnbinom
(p,k,prob)
qbeta(p,
shape1,
shape2)
qgamma(p,
shape,
rate,
scale)
qgeom(p,
prob)

rnorm(n,
mean, sd)
rbinom(k,
n, prob)
rpois(n,
lambda)
runif(n,
min, max)
rbinom(n,
n, prob)
rbeta(n,
shape1,
shape2)
rgamma(n,
shape,
rate,
scale)
rgeom(n,
prob)

Hm phn Mt
phi
Exponential dexp(x,
Weibull
Cauchy
F
T
Chisquared

rate)
dnorm(x,
mean, sd)
dcauchy(x,
location,
scale)
df(x, df1,
df2)
dt(x, df)
dchisq(x,
df)

Tch ly

nh bc

M phng

pexp(q,
rate)
pnorm(q,
mean, sd)
pcauchy(q,
location,
scale)
pf(q, df1,
df2)
pt(q, df)
pchi(q,
df)

qexp(p,
rate)
qnorm(p,
mean, sd)
qcauchy(p,
location,
scale)
qf(p, df1,
df2)
qt(p, df)
qchisq(p,
df)

rexp(n,
rate)
rnorm(n,
mean, sd)
rcauchy(n,
location,
scale)
rf(n, df1,
df2)
rt(n, df)
rchisq(n,
df)

Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability
(xc sut); n = sample size (s lng mu). Cc thng s khc c th tham kho thm
cho tng lut phn phi. Ring cc lut phn phi F, t, Chi-squared cn c mt thng
s khc na l non-centrality parameter (ncp) c cho s 0. Tuy nhin ngi s dng
c th cho mt thng s khc thch hp, nu cn.

47

Ni theo ngn ng thng k xc sut, bin s gii tnh v chiu cao l


hai bin s ngu nhin (random variable). Ngu nhin l v chng ta khng
on trc mt cch chnh xc cc gi tr ny, nhng ch c th on gi tr tp
trung, gi tr trung bnh, v dao ng ca chng. Bin gii tnh ch c hai
gi tr (nam hay n), v c gi l bin khng lin tc, hay bin ri rc
(discrete variable), hay bin th bc (categorical variable). Cn bin chiu cao
c th c bt c gi tr no t thp n cao, v do c tn l bin lin tc
(continuous variable).
Khi ni n phn phi (hay distribution) l cp n cc gi tr m
bin s c th c. Cc hm phn phi (distribution function) l hm nhm m t
cc bin s mt cch c h thng. C h thng y c ngha l theo m
m hnh ton hc c th vi nhng thng s cho trc. Trong xc sut thng k
c kh nhiu hm phn phi, v y chng ta s xem xt qua mt s hm
quan trng nht v thng dng nht: l phn phi nh phn, phn phi
Poisson, v phn phi chun. Trong mi lut phn phi, c 4 loi hm quan
trng m chng ta cn bit:

Hm mt xc sut (probability density distribution);


Hm phn phi tch ly (cumulative probability distribution);
Hm nh bc (quantile); v
Hm m phng (simulation).

R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi


hm c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn
ca hm . Cc tip u ng l d (ch distribution hay xc sut), p (ch
cumulative probability, xc sut tch ly), q (ch nh bc hay quantile), v r
(ch random hay s ngu nhin). Cc tn vit tt l norm (normal, phn phi
chun), binom (binomial , phn phi nh phn), pois (Poisson, phn phi
Poisson), v.v 2 bng tren y tm tt cc hm v thng s cho tng hm.

6.3 Cc hm phn phi xc sut (probability


distribution function)
6.3.1 Hm phn phi nh phn (Binomial distribution)
Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng /
cht, c / khng, v.v Hm nh phn c pht biu bng nh l nh sau: Nu
mt th nghim c tin hnh n ln, mi ln cho ra kt qu hoc l thnh cng
hoc l tht bi, v gm xc sut thnh cng c bit trc l p, th xc sut c k
ln th nghim thnh cng l: P ( k | n, p ) = Ckn p k (1 p )

nk

, trong k = 0, 1,

2, . . . , n. hiu nh l r rng hn, chng ta s xem qua vi v d sau y.

48

Phn tch d liu v to biu bng R Nguyn Vn Tun

V d 1: Hm mt nh phn (Binomial density probability


function). Trong v d trn, lp hc c 10 ngi, trong c 6 n. Nu 3 bn
c chn mt cch ngu nhin, xc sut m chng ta c 2 bn n l bao nhiu?
Chng ta c th tr li cu hi ny mt cch tng i th cng bng cch xem
xt tt c cc trng hp c th xy ra. Mi ln chn c 2 kh khng (nam hay
n), v 3 ln chn, chng ta c 23 = 8 trng hp nh sau.
Bn 1
Bn 2
Nam
Nam
Nam
Nam
Nam
N
Nam
N
N
Nam
N
Nam
N
N
N
N
Tt c cc trng hp

Bn 3
Nam
N
Nam
N
Nam
N
Nam
N

Xc sut
(0.4)(0.4)(0.4) = 0.064
(0.4)(0.4)(0.6) = 0.096
(0.4)(0.6)(0.4) = 0.096
(0.4)(0.6)(0.6) = 0.144
(0.6)(0.4)(0.4) = 0.096
(0.6)(0.4)(0.6) = 0.144
(0.6)(0.6)(0.4) = 0.144
(0.6)(0.6)(0.6) = 0.216
1.000

Chng ta bit trc rng trong nhm 10 hc sinh c 6 n, v do , xc sut n


l 0.60. (Ni cch khc, xc sut chn mt bn nam l 0.4). Do , xc sut m
tt c 3 bn c chn u l nam gii l: 0.4 x 0.4 x 0.4 = 0.064. Trong bng
trn, chng ta thy c 3 trng hp m trong c 2 bn gi: l trng hp
Nam-N-N, N-N-Nam, v N-Nam-N, c 3 u c xc sut 0.144. Cho
nn, xc sut chn ng 2 bn n trong s 3 bn c chn l 3x0.144= 0.432.
Trong R, c hm dbinom(k, n, p) c th gip chng ta tnh cng thc

P ( k | n, p ) = Ckn p k (1 p )

nk

mt cch nhanh chng. Trong trng hp trn,

chng ta ch cn n gin lnh:


> dbinom(2, 3, 0.60)
[1] 0.432

V d 2: Hm nh phn tch ly (Cumulative Binomial probability


distribution). Xc sut thuc chng long xng c hiu nghim l khong
70% (tc l p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu
8 bnh nhn vi kt qu tch cc l bao nhiu? Ni cch khc, nu gi X l s
bnh nhn c iu tr thnh cng, chng ta cn tm P(X 8) = ? tr li
cu hi ny, chng ta s dng hm pbinom(k, n, p). Xin nhc li rng
hm pbinom(k, n, p)cho chng ta P(X k). Do , P(X 8) = 1 P(X
7). Cho nn, p s bng R cho cu hi l:
> 1-pbinom(7, 10, 0.70)

49

[1] 0.3827828

V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s


c khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu
1000 ln, mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn
phi s bnh nhn cao huyt p s nh th no? tr li cu hi ny, chng ta
c th ng dng hm rbinom (n,k,p) trong R vi nhng thng s nh sau:
> b <- rbinom(1000, 20, 0.20)

Trong lnh trn, kt qu m phng c tm thi cha trong i tng tn l b.


bit b c g, chng ta m bng lnh table:
> table(b)
b
0
1
2
3
4
5
6
6 45 147 192 229 169 105

7
68

8
23

9
13

10
3

Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao


huyt p trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta
bit s ln chn mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh
nhn cao huyt p no, 45 mu vi ch 1 bnh nhn cao huyt p, v.v C l
cch hiu l v th cc tn s trn bng lnh hist nh sau:
> hist(b, main="Number of hypertensive patients")
Trong lnh trn b l bin s th hin cao huyt p. Kt qu ca lnh trn l mt
biu th hin tn s bnh nhn cao huyt p nh sau (xem biu 1).
Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p
(trong mi ln chn mu 20 ngi) l cao nht (22.9%). iu ny cng c th
hiu c, bi v t l cao huyt p l 20%, cho nn chng ta k vng rng trung
bnh 4 ngi trong s 20 ngi c chn phi l cao huyt p. Tuy nhin, iu
quan trng m biu trn th hin l c khi chng ta quan st n 10 bnh
nhn cao huyt p d xc sut cho mu ny rt thp (ch 3/1000).

50

Phn tch d liu v to biu bng R Nguyn Vn Tun

Frequency

50

100

150

200

Nu m b e r o f h y p e rte n s iv e p a tie n ts

10

Biu 1. Phn phi s bnh nhn cao huyt p trong s 20 ngi c


chn ngu nhin trong mt qun th gm 20% bnh nhn cao huyt p,
v chn mu c lp li 1000 ln.
V d 4: ng dng hm phn phi nh phn: Hai mi khch hng
c mi ung hai loi bia A v B, v c hi h thch bia no. Kt qu cho
thy 16 ngi thch bia A. Vn t ra l kt qu ny c kt lun rng
bia A c nhiu ngi thch hn bia B, hay l kt qu ch l do cc yu t
ngu nhin gy nn?
Chng ta bt u gii quyt vn bng cch gi thit rng nu khng
c khc nhau, th xc sut p=0.50 thch bia A v q=0.5 thch bia B. Nu gi thit
ny ng, th xc sut m chng ta quan st 16 ngi trong s 20 ngi thch
bia A l bao nhiu. Chng ta c th tnh xc sut ny bng R rt n gin:
> 1- pbinom(15, 20, 0.5)
[1] 0.005908966

p s l xc sut 0.005 hay 0.5%. Ni cch khc, nu qu tht hai bia


ging nhau th xc sut m 16/20 ngi thch bia A ch 0.5%. Tc l, chng ta
c bng chng cho thy kh nng bia A qu tht c nhiu ngi thch hn bia
B, ch khng phi do yu t ngu nhin. Ch , chng ta dng 15 (thay v 16),
l bi v P(X 16) = 1 P(X 15). M trong trng hp ta ang bn, P(X
15) = pbinom(15, 20, 0.5).

6.3.2 Hm phn phi Poisson (Poisson distribution)


Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi
tr thng s p thng rt nh v n thng rt ln. V th, hm Poisson thng
c s dng m t cc bin s rt him xy ra (nh s ngi mc ung th
trong mt dn s chng hn). Hm Poisson cn c ng dng kh nhiu v
thnh cng trong cc nghin cu k thut v th trng nh s lng khch hng
n mt nh hng mi gi.

51

V d 5:Hm mt Poisson (Poisson density probability function).


Qua theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th
k nh my. Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch.
Hi xc sut m th k nh sai chnh t 2 ch, hn 2 ch l bao nhiu?
V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t
(tm t tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson.
y, chng ta c t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson
pht biu rng xc sut m X = k, vi iu kin t l trung bnh , :

e k
P( X = k | ) =
k!
Do , p s cho cu hi trn l: P ( X = 2 | = 1) =

e 212
= 0.1839 . p s
2!

ny c th tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:


> dpois(2, 1)
[1] 0.1839397

Chng ta cng c th tnh xc sut sai 1 ch:


> dpois(1, 1)
[1] 0.3678794

V xc sut khng sai ch no:


> dpois(0, 1)
[1] 0.3678794

Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1.


Trn y l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m
th k nh sai chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:

P ( X > 2 ) = P ( X = 3) + P ( X = 4 ) + P ( X = 5) + ...
= 1 P ( X 2)
= 1 0.3678 0.3678 0.1839
= 0.08
Bng R, chng ta c th tnh nh sau:
# P(X 2)

52

Phn tch d liu v to biu bng R Nguyn Vn Tun

> ppois(2, 1)
[1] 0.9196986

# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014

6.3.3 Hm phn phi chun (Normal distribution)


Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm
phn phi p dng cho cc bin s phi lin tc (discrete distributions), m trong
bin s c nhng gi tr theo bc th hay th loi. i vi cc bin s lin
tc, c vi lut phn phi thch hp khc, m quan trng nht l phn phi
chun. Phn phi chun l nn tng quan trng nht ca phn tch thng k. C
th ni hu ht l thuyt thng k c xy dng trn nn tng ca phn phi
chun. Hm mt phn phi chun c hai thng s: trung bnh v phng
sai 2 (hay lch chun ). Gi X l mt bin s (nh chiu cao chng hn),
hm mt phn phi chun pht biu rng xc sut m X = x l:

P X = x | ,

( x )2
1
= f ( x) =
exp

2 2
2

V d 6: Hm mt phn phi chun (Normal density probability


function). Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi
lch chun l 4.6 cm. Cng bit rng chiu cao ny tun theo lut phn phi
chun. Vi hai thng s =156, =4.6, chng ta c th xy dng mt hm phn
phi chiu cao cho ton b qun th ph n Vit Nam, v hm ny c hnh dng
nh sau:

53

f(height)

0.00

0.02

0.04

0.06

0.08

Probability distribution of height in Vietnamese women

130

140

150

160

170

180

190

200

Height

Biu 2. Phn phi chiu cao ph n Vit Nam vi trung


bnh 156 cm v lch chun 4.6 cm. Trc honh l chiu cao
v trc tung l xc sut cho mi chiu cao.
Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin
s height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi
iu kin trung bnh l 156 cm v lch chun l 4.6 cm.
> height <- seq(130, 200, 1)
> plot(height, dnorm(height, 156, 4.6),
type="l",
ylab=f(height),
xlab=Height,
main="Probability distribution of height in
Vietnamese women")

Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho


bt c chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu
cao 160 cm l:

(160 156 )2
1
P(X = 160 | =156, =4.6) =
exp

2
4.6 2 3.1416
2 ( 4.6 )

= 0.0594
Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng
ta mt cch gn nh:
> dnorm(160, mean=156, sd=4.6)

54

Phn tch d liu v to biu bng R Nguyn Vn Tun

[1] 0.05942343

Hm xc sut chun tch ly (cumulative normal probability


function). V chiu cao l mt bin s lin tc, trong thc t chng ta t khi no
mun tm xc sut cho mt gi tr c th x, m thng tm xc sut cho mt
khong gi tr a n b. Chng hn nh chng ta mun bit xc sut chiu cao t
150 n 160 cm (tc l P(160 X 150), hay xc sut chiu cao thp hn 145
cm, tc P(X < 145). tm p s cc cu hi nh th, chng ta cn n hm
xc sut chun tch ly, c nh ngha nh sau:
P(a X b) =

f ( x ) dx
a

V th, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca
biu 2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut
tch ly cho mt phn phi chun rt c ch.
pnorm (a, mean, sd) =

f ( x ) dx = P(X a | mean, sd)

Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm
l 9.6%:
> pnorm(150, 156, 4.6)
[1] 0.0960575

Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:
> 1-pnorm(164, 156, 4.6)
[1] 0.04100591

Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao
hn 165 cm.
V d 7: ng dng lut phn phi chun: Trong mt qun th, chng
ta bit rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg,
hi: c bao nhiu ngi trong qun th ny c p sut mu bng hoc cao hn
120 mmHg? Cu tr li bng R l:
> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679

Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn


120 mmHg.

55

6.3.4 Hm phn phi chun chun ha (Standardized Normal


distribution)
Mt bin X tun theo lut phn phi chun vi trung bnh v phng
sai 2 thng c vit tt l:
X ~ N( , 2)
y v 2 ty thuc vo n v o lng ca bin s. Chng hn
nh chiu cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui
c o bng nm, v.v cho nn i khi m t mt bin s bng n v gc rt
kh so snh. Mt cch n gin hn l chun ha (standardized) X sao cho s
trung bnh l 0 v phng sai l 1. Sau vi thao tc s hc, c th chng minh
cch bin i X p ng iu kin trn l:

Z=

Ni theo ngn ng ton: nu X ~ N( , 2), th (X )/2 ~ N(0, 1). Nh


vy qua cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh
bng s lch chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu
Z = -1, chng ta bit rng X thp hn ng 1 lch chun. Tng t, Z = 2.5,
chng ta bit rng X cao hn ng 2.5 lch chun, v.v
Biu phn phi chiu cao ca ph n Vit Nam c th m t bng
mt n v mi, l ch s z nh sau:

0.2
0.0

0.1

f(z)

0.3

0.4

Probability distribution of height in Vietnamese women

-4

-2

Biu 3. Phn phi chun ha chiu cao ph n Vit Nam.

56

Phn tch d liu v to biu bng R Nguyn Vn Tun

Biu 3 c v bng hai lnh sau y:


> height <- seq(-4, 4, 0.1)
> plot(height, dnorm(height, 0, 1),
type="l",
ylab=f(z),
xlab=z,
main="Probability distribution of height in
Vietnamese women")

Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n


m t v so snh mt phn phi ca bt c bin no, v tt c u c
chuyn sang ch s z.
Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta
c th tnh ton xc sut z nh hn mt hng s (constant) no bng R. V
d, chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v
lch chun l 1.
> pnorm(-1.96, mean=0, sd=1)
[1] 0.02499790

Hay P(z 1.96) = ?


> pnorm(1.96, mean=0, sd=1)
[1] 0.9750021

Do , P(-1.96 < z < 1.96) chnh l:


> pnorm(1.96) - pnorm(-1.96)
[1] 0.9500042

Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn
chng ta khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr
mc nh (default value) ca thng s mean l 0 v sd l 1).
V d 6 (tip tc). Xin nhc li tin vic theo di, chiu cao trung
bnh ph n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph
n c chiu cao 170 cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch
chun, v t l cc ph n Vit Nam c chiu cao cao hn 170 cm l rt thp,
ch khong 0.1%.
> 1-pnorm(3.04)

57

[1] 0.001182891

Tm nh lng (quantile) ca mt phn phi chun. i khi chng


ta cn lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu
xc sut Z nh hn mt hng s z no cho trc bng p, th z l bao nhiu?
Din t theo k hiu xc sut, chng ta mun tm z trong nu:
P(Z < z) = p
tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=).
V d 8: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z.
> qnorm(0.95, mean=0, sd=1)
[1] 1.644854

Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:
> qnorm(0.975, mean=0, sd=1)
[1] 1.959964

6.3.5 Hm phn phi t, F v 2


Cc hm phn phi t, F v 2 trong thc t l hm ca hm phn phi
chun. Mi lin h v cch tnh cc hm ny c th c m t bng vi ghi ch
sau y:

Phn phi Khi bnh phng (2). Phn phi 2 xut pht t tng
bnh phng ca mt bin phn phi chun. Nu nu xi ~ N(0, 1), v
gi u =

x
i=

2
i

, th u tun theo lut phn phi Khi bnh phng vi bc

t do n (thng vit tt l df). Ni theo ngn ng ton, u ~ n2 .


V d 9: Tm xc sut ca mt bin Khi bnh phng, do , ch cn
hai thng s u v n. Chng hn nh nu chng ta mun tm xc sut
P(u=21, df=13), ch n gin dng hm pchisq nh sau:
> dchisq(21, 13)
[1] 0.01977879

58

Phn tch d liu v to biu bng R Nguyn Vn Tun

Tm xc sut m mt bin s u nh hn 21 vi bc t do 13 df. Tc l


tm P(u 21 | df=13) = ?
> pchisq(21, 13)
[1] 0.9270714

Cng c th ni kt qu trn cho bit P( 132 < 21) = 0.927.


Tm quantile ca mt tr s u tng ng vi 90% ca mt phn phi
2 vi 15 bc t do:
> qchisq(0.95, 15)
[1] 24.99579

Ni cch khc, P( 152 < 24.99) = 0.95.


Phi trung tm (Non-centrality). Ch trong nh ngha trn, phn
phi 2 xut pht t tng bnh phng ca mt bin phn phi chun c
trung bnh 0 v phng sai 1. Nhng nu mt bin phn phi chun c
trung bnh khng phi l 0 v phng sai khng phi l 1, th chng ta
s c mt phn phi Khi bnh phng phi trung tm. Nu xi ~ N(i,
1) v t u =

x
i =1

2
i

, th u tun theo lut phn phi Khi bnh phng

phi trung tm vi bc t do n v thng s phi trung tm (non-centrality


parameter) nh sau:
n

= i2
i =1

V k hiu l u ~

2
n ,

. C th ni rng, trung bnh ca u l n+, v

phng sai ca u l 2(n+2).


Tm xc sut m u nh hn hoc bng 21, vi iu kin bc t do l 13
v thng s non-centrality bng 5.4:
> pchisq(21, 13, 5.4)
[1] 0.6837649

Tc l, P( 132 ,5.4 < 21) = 0.684.


Tm quantile ca mt tr s tng ng vi 50% ca mt phn phi 2
vi 7 bc t do v thng s non-centrality bng 3.

59

> qchisq(0.5, 7, 3)
[1] 9.180148

Do , P( 72,3 < 9.180148) = 0.50

Phn phi t (t distribution). Chng ta va bit rng nu X ~ N(, s2)


th (X )/2 ~ N(0, 1). Nhng pht biu ng (hay chnh xc) khi
chng ta bit phng sai 2. Trong thc t, t khi no chng ta bit
chnh xc phng sai, m ch c tnh t s liu thc nghim. Trong
trng hp phng sai c c tnh t s liu nghin cu, v hy gi
c tnh ny l s2, th chng ta c th pht biu rng: (X )/s2 ~ t(0, v),
trong v l bc t do.
V d 10. Tm xc sut m x ln hn 1, trong bin theo lut phn phi t
vi 6 bc t do:
> 1-pt(1.1, 6)
[1] 0.1567481

Tc l, P(t6 > 1.1) = 1 P(t6 < 1.1) = 0.157.


Tm nh lng ca mt tr s tng ng vi 95% ca mt phn phi
t vi 15 bc t do:
> qt(0.95, 15)
[1] 1.753050

Ni cch khc, P(t19 < 1.75035) = 0.95.


Phn phi F. T s gia hai bin s theo lut phn phi 2 c th chng
minh l tun theo lut phn phi F. Ni cch khc, nu u ~ n2 v

v ~ m2 , th u/v ~ Fn,m, trong n l bc t do t s (numerator degrees of


freedom) v m l bc t do mu s (denominator degrees of freedom).
V d 11: Tm xc sut m mt tr s F ln hn 3.24, bit rng bin s
tun theo lut phn phi F vi bc t do 3 v 15 df v thng s noncentrality 5:
> 1-pf(3.24, 3, 15, 5)
[1] 0.3558721

Do , P(F3, 15, 5 > 3.24) = 1 - P(F3, 15,5 3.24) = 0.355338.

60

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vi bc t do 3 v 15, tm C sao cho P(F3, 15 > C) = 0.05. Li gii ca R l:


> qf(1-0.05, 3, 15)
[1] 3.287382

Ni cch khc,
P(F3, 15 > 3.287382) = 1 P(F3, 15 3.287382) = 1 0.95 = 0.05

6.4 M phng (simulation)


Trong phn tch thng k, i khi v hn ch s mu chng ta kh c th
c tnh mt cch chnh xc cc thng s, v trong trng hp bt nh , chng
ta cn n m phng bit c dao ng ca mt hay nhiu thng s. M
phng thng da vo cc lut phn phi. y l mt lnh vc kh phc tp khng
nm trong phm vi ca chng ny. y, chng ta ch im mt s m hnh m
phng mang tnh minh ha bn c c th da vo m pht trin thm.
V d 11: M phng chng minh phng sai ca s trung bnh
bng phng sai chia cho n ( var X = 2 / n ). Chng ta s xem mt bin s

( )

khng lin tc vi gi tr 1, 3 v 5 vi xc sut nh sau:


x
1
3
5

P(x)
0.60
0.30
0.10

Qua s liu ny, chng ta bit rng gi tr trung bnh l


(1x0.60)+(3x0.30)+(5x0.10) = 2.0 v phng sai (bn c c th t tnh) l 1.8.
By gi chng ta s dng hai thng s ny th m phng 500 ln. Lnh th nht
to ra 3 gi tr ca x. Lnh th hai nhp s xc sut cho tng gi tr ca x. Lnh
sample yu cu R to nn 500 s ngu nhin v cho vo i tng draws.
x <- c(1, 3, 5)
px <- c(0.6, 0.3, 0.1)
draws <- sample(x, size=500, replace=T, prob=px)
hist(draws, breaks=seq(1,5, by=0.25),
main=1000 draws)

61

150
0

50

100

Frequency

200

250

300

500 draws

draws

T lut phn phi xc sut chng ta bit rng tnh trung bnh s c 60%
ln c gi tr 1, 30% c gi tr 2, v 10% c gi tr 5. Do , chng ta k
vng s quan st 300, 150 v 50 ln cho mi gi tr. Biu trn cho thy phn
phi cc gi tr ny gn vi gi tr m chng ta k vng. Ngoi ra, chng ta cng
bit rng phng sai ca bin s ny l khong 1.8. By gi chng ta kim tra
xem c ng nh k vng hay khng:
> var(draws)
[1] 1.835671

Kt qu trn cho thy phng sai ca 500 mu l 1.836, tc khng xa my so


vi gi tr k vng.
By gi chng ta th m phng 500 gi tr trung bnh x ( x l s trung bnh
ca 4 s liu m phng) t qun th trn:
> draws <- sample(x, size=4*500, replace=T, prob=px)
> draws = matrix(draws, 4)
> drawmeans = apply(draws, 2, mean)

Lnh th nht v th hai to nn i tng tn l draws vi 4 dng, mi dng


c 500 gi tr t lut phn phi trn. Ni cch khc, chng ta c 4*500 = 2000
s. 500 s cng c ngha l 500 ct: 1 n 500. Tc mi ct c 4 s. Lnh th
ba tm tr s trung bnh cho mi ct. Lnh ny s cho ra 500 s trung bnh v
cha trong i tng drawmeans. Biu sau y cho thy phn phi ca
500 s trung bnh:
> hist(drawmeans,breaks=seq(1,5,by=0.25), main=1000 means
of 4 draws)

62

Phn tch d liu v to biu bng R Nguyn Vn Tun

50

Frequency

100

150

1000 means of 4 draws

drawmeans

Chng ta thy rng phng sai ca phn phi ny nh hn. Tht ra, phng sai
ca 500 s trung bnh ny l 0.45.
> var(drawmeans)
[1] 0.4501112

y l gi tr tng ng vi gi tr 0.45 m chng ta k vng t cng thc

var ( X ) = 2 / 4 = 1.8 / 4 = 0.45 .

6.4.1 M phng phn phi nh phn


V d 12: M phng mu t mt qun th vi lut phn phi nh
phn. Gi d chng ta bit mt qun th c 20% ngi b bnh i ng (xc
sut p=0.2). Chng ta mun ly mu t qun th ny, mi mu c 20 i tng,
v phng n chn mu c lp li 100 ln:
> bin <- rbinom(100, 20, 0.2)
> bin
[1] 4 4 5 3 2 2 3 2 5 4 3 6 7 3 4 4 1 5 3 5 3 4 4 5 1 4 4 4 4 3 2
4 2 2 5 4 5
[38] 7 3 5 3 3 4 3 2 4 5 2 4 5 5 4 2 2 2 8 5 5 5 3 4 5 7 4 3 6 4
6 6 8 8 3 3 1
[75] 1 4 4 2 3 9 7 4 4 0 0 8 6 9 3 1 4 5 6 4 5 3 2 4 3 2

Kt qu trn l s ln u, chng ta s c 4 ngi mc bnh; ln 2 cng 4 ngi;


ln 3 c 5 ngi mc bnh; v.v kt qu ny c th tm lc trong mt biu
nh sau:
> hist(bin,
xlab=Number of diabetic patients,
ylab=Number of samples,
main=Distribution of the number of diabetic
patients)

63

15
10
0

Number of samples

20

25

Distribution of the number of diabetic patients

Number of diabetic patients

> mean(bin)
[1] 3.97

ng nh chng ta k vng, v chn mi ln 20 i tng v xc sut 20%, nn


chng ta tin on trung bnh s c 4 bnh nhn i ng.

6.4.2 M phng phn phi Poisson


V d 13: M phng mu t mt qun th vi lut phn phi
Poisson. Trong v d sau y, chng ta m phng 100 mu t mt qun th tun
theo lut phn phi Poisson vi trung bnh =3:
> pois <- rpois(100, lambda=3)
> pois
> pois
[1] 4 3 2 4 2 3 4 4 0 7 5 0 3 3 4 2 2 6 1 4 2 3 3 5 4 2 1 4 0 2 1
5 1 2 2 2 6
[38] 1 3 6 3 3 5 4 3 2 2 5 3 3 3 1 4 7 3 4 3 2 6 1 4 1 0 5 2 2 2
3 6 8 4 4 1 4
[75] 1 0 0 4 3 3 2 3 3 3 4 1 5 4 4 1 3 1 6 4 4 4 2 2 2 4

V mt phn phi:

64

Phn tch d liu v to biu bng R Nguyn Vn Tun

Frequency

10

15

20

Histogram of pois

pois

Phn phi Poisson v phn phi m. Trong v d sau y, chng ta


m phng thi gian bnh nhn n mt bnh vin. Bit rng bnh nhn n
bnh vin mt cch ngu nhin theo lut phn phi Poisson, vi trung bnh 15
bnh nhn cho mi 150 pht. C th chng minh rng khong cch thi gian
n bnh vin gia hai bnh nhn tun theo lut phn phi m. Chng ta mun
bit thi gian m bnh nhn gh bnh vin; do , chng ta m phng 15 thi
gian gia hai bnh nhn t lut phn phi m vi t l 15/150 = 0.1 mi pht.
Cc lnh sau y p ng yu cu :
# To thi gian n bnh vin
> appoint <- rexp(15, 0.1)
> times <- round(appoint,0)
> times
[1] 37 5 8 10 24 5 1 7 8

6 12

3 25 15

6.4.3 M phng phn phi 2, t, F


Cch m phng trn y cn c th p dng cho cc lut phn phi khc nh
nh phn m (negative binomial distribution vi rnbinom), gamma
(rgamma), beta (rbeta), Khi bnh phng (rchisq), hm m (rexp), t
(rt), F (rf), v.v Cc thng s cho cc hm m phng ny c th tm trong
phn u ca chng.
Cc lnh sau y s minh ha cc lut phn phi thng thng :

65

Phn phi Khi bnh phng vi mt s bc t do:

0.6

>
curve(dchisq(x,
1),
xlim=c(0,10),
ylim=c(0,0.6),
col="red", lwd=3)
> curve(dchisq(x, 2), add=T, col="green", lwd=3)
> curve(dchisq(x, 3), add=T, col="blue", lwd=3)
> curve(dchisq(x, 5), add=T, col="orange", lwd=3)
> abline(h=0, lty=3)
> abline(v=0, lty=3)
> legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1", "df=2", "df=3", "df=5"), lwd=3, lty=1,
col=c("red", "green", "blue", "orange"))

0.3
0.0

0.1

0.2

dchisq(x, 1)

0.4

0.5

df=1
df=2
df=3
df=5

10

Biu 4. Phn phi Khi bnh phng vi bc t do =1, 2, 3, 5.

Phn phi t:

> curve(dt(x, 1), xlim=c(-3,3), ylim=c(0,0.4),


col="red", lwd=3)
> curve(dt(x, 2), add=T, col="blue", lwd=3)
> curve(dt(x, 5), add=T, col="green", lwd=3)
> curve(dt(x, 10), add=T, col="orange", lwd=3)
> curve(dnorm(x), add=T, lwd=4, lty=3)
> title(main=Student T distributions)
> legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1", "df=2", "df=5",
"df=10", "Normal distribution"),

66

Phn tch d liu v to biu bng R Nguyn Vn Tun

lwd=c(2,2,2,2,2),
lty=c(1,1,1,1,3),
col=c("red", "blue", "green",
"orange", par("fg")))

0.4

Student T distributions

0.2
0.0

0.1

dt(x, 1)

0.3

df=1
df=2
df=5
df=10
Normal distribution

-3

-2

-1

Biu 5. Phn phi t vi bc t do =1, 2, 5, 10 so


snh vi phn phi chun.

>
>
>
>
>
>
>
>
>

Phn phi F:
curve(df(x,1,1), xlim=c(0,2), ylim=c(0,0.8), lwd=3)
curve(df(x,3,1), add=T)
curve(df(x,6,1), add=T, lwd=3)
curve(df(x,3,3), add=T, col="red")
curve(df(x,6,3), add=T, col="red", lwd=3)
curve(df(x,3,6), add=T, col="blue")
curve(df(x,6,6), add=T, col="blue", lwd=3)
title(main="Fisher F distributions")
legend(par("usr")[2], par("usr")[4],
xjust=1,
c("df=1,1", "df=3,1", "df=6,1", "df=3,3",
"df=6,3", "df=3,6", df="6,6"),
lwd=c(1,1,3,1,3,1,3),
lty=c(2,1,1,1,1,1,1),

67

col=c(par("fg"), par("fg"), par("fg"),


"red", "blue", "blue"))

0.8

Fisher F distributions

0.4
0.0

0.2

df(x, 1, 1)

0.6

df=1,1
df=3,1
df=6,1
df=3,3
df=6,3
df=3,6
6,6

0.0

0.5

1.0

1.5

2.0

Biu 6. Phn phi F vi nhiu bc t do khc nhau.

>
>
>
>
>
>
>

Phn phi gamma:


curve( dgamma(x,1,1), xlim=c(0,5) )
curve( dgamma(x,2,1), add=T, col='red' )
curve( dgamma(x,3,1), add=T, col='green' )
curve( dgamma(x,4,1), add=T, col='blue' )
curve( dgamma(x,5,1), add=T, col='orange' )
title(main="Gamma probability distribution function")
legend(par('usr')[2], par('usr')[4], xjust=1,
c('k=1 (Exponential distribution)', 'k=2',
'k=3', 'k=4', 'k=5'),
lwd=1, lty=1,
col=c(par('fg'), 'red', 'green', 'blue', 'orange') )

68

Phn tch d liu v to biu bng R Nguyn Vn Tun

1.0

Gamma probability distribution function

0.6
0.4
0.0

0.2

dgamma(x, 1, 1)

0.8

k=1 (Exponential distribution)


k=2
k=3
k=4
k=5

Biu 7. Phn phi Gamma vi nhiu hnh dng.

>
>
>
>
>
>
>
>
>
>
>
>

Phn phi beta:


curve( dbeta(x,1,1), xlim=c(0,1), ylim=c(0,4) )
curve( dbeta(x,2,1), add=T, col='red' )
curve( dbeta(x,3,1), add=T, col='green' )
curve( dbeta(x,4,1), add=T, col='blue' )
curve( dbeta(x,2,2), add=T, lty=2, lwd=2, col='red' )
curve( dbeta(x,3,2), add=T, lty=2, lwd=2, col='green' )
curve( dbeta(x,4,2), add=T, lty=2, lwd=2, col='blue' )
curve( dbeta(x,2,3), add=T, lty=3, lwd=3, col='red' )
curve( dbeta(x,3,3), add=T, lty=3, lwd=3, col='green' )
curve( dbeta(x,4,3), add=T, lty=3, lwd=3, col='blue' )
title(main="Beta distribution")
legend(par('usr')[1], par('usr')[4], xjust=0,
c('(1,1)', '(2,1)', '(3,1)', '(4,1)',
'(2,2)', '(3,2)', '(4,2)',
'(2,3)', '(3,3)', '(4,3)' ),
lwd=1, #c(1,1,1,1, 2,2,2, 3,3,3),
lty=c(1,1,1,1, 2,2,2, 3,3,3),
col=c(par('fg'), 'red', 'green', 'blue',
'red', 'green', 'blue',
'red', 'green', 'blue' ))

69

Beta distribution

2
0

dbeta(x, 1, 1)

(1,1)
(2,1)
(3,1)
(4,1)
(2,2)
(3,2)
(4,2)
(2,3)
(3,3)
(4,3)

0.0

0.2

0.4

0.6

0.8

1.0

Biu 8. Phn phi beta vi nhiu hnh dng.

>
>
>
>
>
>

Phn phi Weibull:


curve(dexp(x), xlim=c(0,3), ylim=c(0,2))
curve(dweibull(x,1), lty=3, lwd=3, add=T)
curve(dweibull(x,2), col='red', add=T)
curve(dweibull(x,.8), col='blue', add=T)
title(main="Weibull Probability Distribution Function")
legend(par('usr')[2], par('usr')[4], xjust=1,
c('Exponential', 'Weibull, shape=1',
'Weibull, shape=2', 'Weibull, shape=.8'),
lwd=c(1,3,1,1),
lty=c(1,3,1,1),
col=c(par("fg"), par("fg"), 'red', 'blue'))

70

Phn tch d liu v to biu bng R Nguyn Vn Tun

2.0

Weibull Probability Distribution Function

1.0
0.0

0.5

dexp(x)

1.5

Exponential
Weibull, shape=1
Weibull, shape=2
Weibull, shape=.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Biu 9. Phn phi Weibull.

Phn phi Cauchy:

0.5

> curve(dcauchy(x),xlim=c(-5,5), ylim=c(0,.5), lwd=3)


> curve(dnorm(x), add=T, col='red', lty=2)
> legend(par('usr')[2], par('usr')[4], xjust=1,
c('Cauchy distribution', 'Gaussian distribution'),
lwd=c(3,1),
lty=c(1,2),
col=c(par("fg"), 'red'))

0.3
0.2
0.0

0.1

dcauchy(x)

0.4

C auchy distribution
Gaussian distribution

-4

-2

Biu 9. Phn phi Cauchy so snh vi phn phi chun.

71

6.5 Chn mu ngu nhin (random sampling)


Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m
bo tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R,
chng ta c th ly mt mu ngu nhin bng cch s dng hm sample.
V d: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40).
Nu chng ta mun chn 5 i tng qun th , ai s l ngi c chn?
Chng ta c th dng lnh sample() tr li cu hi nh sau:
> sample(1:40, 5)
[1] 32 26 6 18 9

Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh


ny, R s chn mt mu khc, ch khng hon ton ging nh mu trn. V d:
> sample(1:40, 5)
[1] 5 22 35 19 4
> sample(1:40, 5)
[1] 24 26 12 6 22
> sample(1:40, 5)
[1] 22 38 11 6 18

v.v
Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random
sampling without replacement), tc l mi ln chn mu, chng ta khng b li
cc mu chn vo qun th.
Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i
tng, chng ta b vo li trong qun th chn tip ln sau). V d, chng ta
mun chn 10 ngi t mt qun th 50 ngi, bng cch ly mu vi thay th
(random sampling with replacement), chng ta ch cn thm tham s
replace=TRUE:
> sample(1:50, 10, replace=T)
[1] 31 44

72

8 47 50 10 16 29 23

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v


kt qu 10 ln c th l:
> sample(c("H", "T"), 10, replace=T)
[1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T"

Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh


mu (D) trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri
li vo bao; ri li chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C
nh th, chng ta chn 20 ln, kt qu c th l:
> sample(c("X", "D"), 20, replace=T)
[1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D"
"X" "X" "X" "X"
[20] "D"

Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau


y, chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau:
> sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T)
[1] 3 1 3 2 2 2 2 2 5 1

i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c


chn 2 ln, v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh
cung cp v s mu cn nh, nhng cng khng qu xa vi k vng.

73

7
Kim nh gi thit thng k
v ngha ca tr s P (P-value)
7.1 Tr s P
Trong nghin cu khoa hc, ngoi nhng d kin bng s, biu v
hnh nh, con s m chng ta thng hay gp nht l tr s P (m ting Anh gi
l P-value). Trong cc chng sau y, bn c s gp tr s P rt nhiu ln, v
i a s cc suy lun phn tch thng k, suy lun khoa hc u da vo tr s
P. Do , trc khi bn n cc phng php phn tch thng k bng R, cn
phi c ngha ca tr s ny.
Tr s P l mt con s xc sut, tc l vit tt ch probability value.
Chng ta thng gp nhng pht biu c km theo con s, chng hn nh
Kt qu phn tch cho thy t l gy xng trong nhm bnh nhn c iu tr
bng thuc Alendronate l 2%, thp hn t l trong nhm bnh nhn khng
c cha tr (5%), v mc khc bit ny c ngha thng k (p = 0.01),
hay mt pht biu nh Sau 3 thng iu tr, mc gim p sut mu trong
nhm bnh nhn l 10% (p < 0.05). Trong vn cnh trn y, i a s nh
khoa hc hiu rng tr s P phn nh xc sut s hiu nghim ca thuc
Alendronate hay mt thut iu tr. C nhiu ngi hiu rng cu vn trn c
ngha l xc sut m thuc Alendronate tt hn gi dc l 0.99 (ly 1 tr cho
0.01). Nhng cch hiu hon ton sai.
Tht vy, rt nhiu ngi, khng ch ngi c m ngay c chnh cc
tc gi ca nhng bi bo khoa hc, khng hiu ng ngha ca tr s P. Theo
mt nghin cu c cng b trn tp san danh ting Statistics in Medicine [1],
tc gi cho bit 85% cc tc gi khoa hc v bc s nghin cu khng hiu hay
hiu sai ngha ca tr s P. Th th, cu hi cn t ra mt cch nghim chnh:
ngha ca tr s P l g? tr li cho cu hi ny, chng ta cn phi xem xt
qua khi nim phn nghim v tin trnh ca mt nghin cu khoa hc.

7.2 Gi thit khoa hc v phn nghim


Mt gi thit c xem l mang tnh khoa hc nu gi thit c kh
nng phn nghim. TheoKarl Popper, nh trit hc khoa hc, c im duy
nht c th phn bit gia mt l thuyt khoa hc thc th vi ngy khoa hc
(pseudoscience) l thuyt khoa hc lun c c tnh c th b bc b (hay b

74

Phn tch d liu v to biu bng R Nguyn Vn Tun

phn bc falsified) bng nhng thc nghim n gin. ng gi l kh


nng phn nghim (falsifiability, c ti liu ghi l falsibility). Php phn
nghim l phng cch tin hnh nhng thc nghim khng phi xc minh
m ph phn cc l thuyt khoa hc, v c th coi y nh l mt nn tng
cho khoa hc thc th. Chng hn nh gi thit Tt c cc qu u mu en
c th b bc b nu ta tm ra c mt con qu mu .
C th xem qui trnh phn nghim l mt cch hc hi t sai lm. Khoa
hc pht trin cng mt phn ln l do hc hi t sai lm m gii khoa hc
khng ai chi ci. C th xc nh nghin cu khoa hc nh l mt qui trnh
th nghim gi thuyt, theo cc bc sau y:
Bc 1, nh nghin cu cn phi nh ngha mt gi thuyt o (null
hypothesis), tc l mt gi thuyt ngc li vi nhng g m nh nghin cu tin
l s tht. Th d trong mt nghin cu lm sng, gm hai nhm bnh nhn:
mt nhm c iu tr bng thuc A, v mt nhm c iu tr bng placebo,
nh nghin cu c th pht biu mt gi thuyt o rng s hiu nghim thuc
A tng ng vi s hiu nghim ca placebo (c ngha l thuc A khng c
tc dng nh mong mun).
Bc 2, nh nghin cu cn phi nh ngha mt gi thuyt ph
(alternative hypothesis), tc l mt gi thuyt m nh nghin cu ngh l s
tht, v iu cn c chng minh bng d kin. Chng hn nh trong v d
trn y, nh nghin cu c th pht biu gi thuyt ph rng thuc A c hiu
nghim cao hn placebo.
Bc 3, sau khi thu thp y nhng d kin lin quan, nh nghin
cu dng mt hay nhiu phng php thng k kim tra xem trong hai gi
thuyt trn, gi thuyt no c xem l kh d. Cch kim tra ny c tin
hnh tr li cu hi: nu gi thuyt o ng, th xc sut m nhng d kin
thu thp c ph hp vi gi thuyt o l bao nhiu. Gi tr ca xc sut ny
thng c cp n trong cc bo co khoa hc bng k hiu P value.
iu cn ch y l nh nghin cu khng th nghim gi thuyt khc, m
ch th nghim gi thuyt o m thi.
Bc 4, quyt nh chp nhn hay loi b gi thuyt o, bng cch da
vo gi tr xc sut trong bc th ba. Chng hn nh theo truyn thng la chn
trong mt nghin cu y hc, nu gi tr xc sut nh hn 5% th nh nghin cu
sn sng bc b gi thuyt o: s hiu nghim ca thuc A khc vi s hiu
nghim ca placebo. Tuy nhin, nu gi tr xc sut cao hn 5%, th nh nghin
cu ch c th pht biu rng cha c bng chng y bc b gi thuyt
o, v iu ny khng c ngha rng gi thuyt o l ng, l s tht. Ni mt
cch khc, thiu bng chng khng c ngha l khng c bng chng.

75

Bc 5, nu gi thuyt o b bc b, th nh nghin cu mc nhin


tha nhn gi thuyt ph. Nhng vn khi i t y, bi v c nhiu gi
thuyt ph khc nhau. Chng hn nh so snh vi gi thuyt ph ban u (A
khc vi Placebo), nh nghin cu c th t ra nhiu gi thuyt ph khc nhau
nh thuc s hiu nghim ca thuc A cao hn Placebo 5%, 10% hay ni chung
X%. Ni tm li, mt khi nh nghin cu bc b gi thuyt o, th gi thuyt
ph c mc nhin cng nhn, nhng nh nghin cu khng th xc nh gi
thuyt ph no l ng vi s tht.

7.3 ngha ca tr s P qua m phng


hiu ngha thc t ca tr s P, chng ta s ly mt v d n gin
nh sau:
V d 1. Mt th nghim c tin hnh tm hiu s thch ca ngi
tiu th i vi hai loi c ph (hy tm gi l c ph A v B). Cc nh nghin
cu cho 50 khch hng ung th hai loi c ph trong cng mt iu kin, v
hi h thch loi c ph no. Kt qu cho thy 35 ngi thch c ph A, v 15
ngi thch c ph B. Vn t ra l qua kt qu ny, cc nh nghin cu c
th kt lun rng c ph loi A c a chung hn c ph B, hay kt qu trn
ch l do ngu nhin m ra?
Do ngu nhin m ra c ngha l theo lut nh phn, kh nng m kt
qu trn xy ra l bao nhiu? Do , l thuyt xc sut nh phn c phn ng
dng trong trng hp ny, bi v kt qu ca nghin cu ch c hai gi tr
(hoc l thch A, hoc thch B).
Ni theo ngn ng ca phn nghim, gi thit o l nu khng c s
khc bit v s thch, xc sut m mt khch hng a chung mt loi c ph l
0.5. Nu gi thit ny l ng (tc p = 0.5, p y l xc sut thch c ph A),
v nu nghin cu trn c lp i lp li (chng hn nh) 1000 ln, v mi ln
vn 50 khch hng, th c bao nhiu ln vi 35 khch hng a chung c ph
A? Gi s ln nghin cu m 35 (hay nhiu hn) trong s 50 thch c ph A l
bin c X, ni theo ngn ng xc sut, chng ta mun tm P(X | p=0.50) =?
tr li cu hi ny, chng ta c th ng dng hm rbinom m
phng v nh ni trn thc cht ca vn l mt phn phi nh phn:
> bin <- rbinom(1000, 50, 0.5)

76

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trong lnh trn, chng ta yu cu R m phng 1000 ln nghin cu, mi ln c


50 khch hng, v theo gi thit o, xc sut thch A l 0.50. bit kt qu
ca m phng , chng ta s dng hm table nh sau:
> table(bin)
bin
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1 1 2 11 16 24 47 60 83 94 107 132 114 98 65 44 44 26 14 12
34 35
2 3

Qua kt qu trn, chng ta thy trong s 1000 nghin cu , ch c 3 nghin


cu m s khch hng thch c ph A l 35 ngi (vi iu kin khng c khc
bit gia hai loi c ph, hay ni ng hn l nu p =0.5). Ni cch khc:
P(X 35 | p=0.50) = 3/1000 = 0.003
Chng ta cng c th th hin tn s trn bng mt biu tn s nh sau:

Frequency

50

100

150

200

250

Histogram of bin

15

20

25

30

35

bin

Tt nhin chng ta c th lm mt m phng khc vi s ln ti th


nghim l 100.000 ln (thay v 1000 ln) v tnh xc sut P(X 35 | p=0.50).
bin <- rbinom(100000, 50, 0.5)
> bin <- rbinom(100000, 50, 0.5)
> table(bin)
bin
11 12 13 14 15
4 17 40 83 197

16 17 18 19 20 21 22 23
462 946 1592 2719 4098 5892 7937 9733

24

29

25

26

27

28

30

31

32

33

34

35

36

77

10822 11191 10799 9497 7925 5904 4185 2682 1562


98
37
31

38
5

39
7

893

455

223

40
1

Ln ny, chng ta c nhiu kh nng hn (v s ln m phng tng ln). Chng


hn nh c th c nghin cu cho ra 11 khch hng (ti thiu) hay 40 khch hng
(ti a) thch c ph A. Nhng chng ta mun bit s ln nghin cu m 35 khch
hng tr ln thch c ph A, v kt qu trn cho chng ta bit, xc sut l:
> (223+98+21+5+7+1)/100000
[1] 0.00355

Ni cch khc, xc sut P(X 35 | p=0.50) qu thp (ch 0.3%), chng


ta c bng chng cho rng kt qu trn c th khng do cc yu t ngu
nhin gy nn; tc c mt s khc bit v s thch ca khch hng i vi hai
loi c ph.
Con s P = 0.0035 chnh l tr s P. Theo mt qui c khoa hc, tt c
cc tr s P thp hn 0.05 (tc thp hn 5%) c xem l significant, tc l
c ngha thng k.
Cn phi nhn mnh mt ln na hiu ngha ca tr s P nh sau:
Mc ch ca phn tch trn l nhm tr li cu hi: nu hai loi c ph c xc
sut a chung bng nhau (p = 0.5, gi thuyt o), th xc sut m kt qu
trn (35 trong s 50 khch hng thch A) xy ra l bao nhiu? Ni cch khc,
chnh l phng php i tm tr s P. Do , din dch tr s P phi c iu
kin, v iu kin y l p = 0.50. Bn c c th lm th nghim thm vi
p = 0.6 hay p = 0.7 thy kt qu khc nhau ra sao.
Trong thc t, tr s P c mt nh hng rt ln n s phn ca mt
bi bo khoa hc. Nhiu tp san v nh khoa hc xem mt nghin cu khoa hc
vi tr s P cao hn 0.05 l mt kt qu tiu cc (negative result) v bi bo
c th b t chi cho cng b. Chnh v th m i vi i a s nh khoa hc,
con s P < 0.05 tr thnh mt ci giy thng hnh cng b kt qu
nghin cu. Nu kt qu vi P < 0.05, bi bo c c may xut hin trn mt tp
san no v tc gi c th s ni ting; nu kt qu P > 0.05, s phn bi bo
v cng trnh nghin cu c c may i vo lng qun.

7.4 Vn logic ca tr s P
Nhng ng trn phng din l tr v khoa hc nghim chnh, chng ta
c nn t tm quan trng vo tr s P nh th hay khng? Cu tr li l khng.

78

Phn tch d liu v to biu bng R Nguyn Vn Tun

Tr s P c nhiu vn , v vic ph thuc vo n trong qu kh (cng nh


hin nay) b rt nhiu ngi ph phn gay gt. Ci khim khuyt ln nht ca
tr s P l n thiu tnh logic. Tht vy, nu chng ta chu kh xem xt li v d
trn, chng ta c th khi qut tin trnh ca mt nghin cu y hc (da vo tr
s P) nh sau:

ra mt gi thuyt chnh (H+)


T gi thuyt chnh, ra mt gi thuyt o (H-)
Tin hnh thu thp d kin (D)
Phn tch d kin: tnh ton xc sut D xy ra nu H- l s tht. Ni
theo ngn ng ton xc sut, bc ny chnh l bc tnh ton tr s
P hay P(D | H-).

V th, con s P c ngha l xc sut ca d kin D xy ra nu (nhn mnh:


nu) gi thuyt o H- l s tht. Nh vy, con s P khng trc tip cho chng
ta mt nim g v s tht ca gi thuyt chnh H; n ch gin tip cung cp bng
chng chng ta chp nhn gi thuyt chnh v bc b gi thuyt o.
Ci logic ng sau ca tr s P c th c hiu nh l mt tin trnh
chng minh o ngc (proof by contradiction):

Mnh 1: Nu gi thuyt o l s tht, th d kin ny khng th


xy ra;
Mnh 2: D kin xy ra;
Mnh 3 (kt lun): Gi thuyt o khng th l s tht.

Nu cch lp lun trn kh hiu, chng ta th xem mt v d c th nh sau:

Nu ng Tun b cao huyt p, th ng khng th c triu chng


rng tc (hai hin tng sinh hc ny khng lin quan vi nhau, t
ra l theo kin thc y khoa hin nay);
ng Tun b rng tc;
Do , ng Tun khng th b cao huyt p.

Tr s P, do , gin tip phn nh xc sut ca mnh 3. V cng


chnh l mt khim khuyt quan trng ca tr s P, bi v con s P n c tnh
mc kh d ca d kin, ch khng ni cho chng ta bit mc kh d ca
mt gi thuyt. iu ny lm cho vic suy lun da vo tr s P rt xa ri vi thc
t, xa ri vi khoa hc thc nghim. Trong khoa hc thc nghim, iu m nh
nghin cu mun bit l vi d kin m h c c, xc sut ca gi thuyt chnh
l bao nhiu, ch h khng mun bit nu gi thuyt o l s tht th xc sut
ca d kin l bao nhiu. Ni cch khc v dng k hiu m t trn, nh nghin
cu mun bit P(H+ | D), ch khng mun bit P(D | H+) hay P(D | H-).

79

7.5. Vn kim nh nhiu gi thuyt (multiple


tests of hypothesis)
Nh ni trn, nghin cu y hc l mt qui trnh th nghim gi
thuyt. Trong mt nghin cu, t khi no chng ta th nghim ch mt gi
thuyt duy nht, m rt nhiu gi thuyt mt lc. Chng hn nh trong mt
nghin cu v mi lin h gia vitamin D v nguy c gy xng i, cc nh
nghin cu c th phn tch mi lin h tng quan gia vitamin D v mt
xng (bone mineral density), gia vitamin D v nguy c gy xng theo tng
gii tnh, tng nhm tui, hay phn tch theo cc c tnh lm sng ca bnh
nhn, v.v (Xem v d di y). Mi mt phn tch nh th c th xem l
mt th nghim gi thuyt. y, chng ta phi i din vi vn nhiu gi
thuyt (multiple tests of hypothesis hay cn gi l multiple comparisons).
Bng 2. Phn tch hiu qu ca vitamin D v calcium theo c tnh ca
bnh nhn
c tnh bnh nhn Nhm c Nhm gi
iu tr bng
dc
calcium v (placebo) 1
vitamin D 1
tui
50-59
29 (0.06)
13 (0.03)
60-69
53 (0.09)
71 (0.13)
70-79
93 (0.44)
115 (0.54)
T trng c th
(Body mass index)
<25
25-30
>30
Ht thuc l
Khng ht thuc
Hin ht thuc

T s nguy c
(relative risk) v
khong
tin cy 95% 2
2,17 (1.13-4.18)
0.74 (0.52-1.06)
0.82 (0.62-1.08)

69 (0.20)
63 (0.14)
43 (0.09)

66 (0.19)
74 (0.16)
59 (0.13)

1.05 (0.75-1.47)
0.87 (0.62-1.22)
0.73 (0.49-1.09)

159 (0.14)
14 (0.14)

178 (0.15)
16 (0.17)

0.90 (0.71-1.11)
0.85 (0.41-1.74)

Ch thch: 1 s ngoi ngoc l s bnh nhn b gy xng i trong thi gian


theo di (7 nm) v s trong ngoc l t l gy xng tnh bng phn trm mi
nm. 2 T s nguy c tng i (hay relative risk RR s gii thch trong mt

80

Phn tch d liu v to biu bng R Nguyn Vn Tun

chng sau) c c tnh bng cch ly t l gy xng trong nhm can thip
chia cho t l trong nhm gi dc; nu khong tin cy 95% bao gm 1 th mc
khc bit gia 2 nhm khng c ngha thng k; nu khong tin cy 95%
khng bao gm 1 th mc khc bit gia 2 nhm c xem l c ngha
thng k (hay p<0.05).
Xin nhc li rng trong mi ln th nghim mt gi thuyt, chng ta
chp nhn mt sai st 5% (gi d chng ta chp nhn tiu chun p = 0.05
tuyn b c ngha hay khng c ngha thng k). Vn t ra l trong bi
cnh th nghim nhiu gi thuyt l nh sau: nu trong s n th nghim,
chng ta tuyn b k th nghim c ngha thng k (tc l p<0.05), th
xc sut c t nht mt gi thuyt sai l bao nhiu?
tr li cu hi ny chng ta s bt u bng mt v d n gin. Mi
th nghim chng ta chp nhn mt xc sut sai lm l 0.05. Ni cch khc,
chng ta c xc sut ng l 0.95. Nu chng ta th nghim 3 gi thuyt, xc
sut m chng ta ng c ba l [d nhin]: 0.95 x 0.95 x 0.95 = 0.8574. Nh
vy, xc xut c t nht mt sai lm trong ba tuyn b c ngha thng k l:
1 0.8574 = 0.1426 (tc khong 14%).
Ni chung, nu chng ta th nghim n gi thuyt, v mi ln th
nghim chng ta chp nhn mt xc sut sai lm l p, th xc sut c t nht 1
sai lm trong n ln th nghim l 1 (1 p ) . Khi n = 10 v p=0.05 th xc
sut c t nht mt sai lm ln n: 40%.
n

Bi hc rt ra t cch l gii trn l nh sau: nu chng ta c mt bi


bo khoa hc m trong nh nghin cu tin hnh nhiu th nghim khc
nhau vi cc kt qu tr s p < 0.05, chng ta c l do cho rng xc sut m
mt trong nhng ci-gi-l significant (hay c ngha thng k) rt cao.
Chng ta cn phi d dt vi nhng kt qu phn tch nh th.
i vi mt ngi lm nghin cu, ngha ca vn th nghim
nhiu gi thuyt l: khng nn cu c. Xin ni thm v khi nim cu c
trong khoa hc. Hy tng tng, mt nh nghin cu mun tm hiu hiu qu
ca mt thut iu tr mi cho cc bnh nhn au khp. Sau khi xem xt cc
nghin cu cng b trong y vn, nh nghin cu quyt nh tin hnh mt
nghin cu trn 300 bnh nhn: phn na c iu tr bng thut mi, phn
na ch s dng gi dc. Sau thi gian theo di, thu thp d liu, nh nghin
cu phn tch v pht hin s khc bit gia hai nhm khng c ngha thng
k. Ni cch khc, thut iu tr khng c hiu qu. Nh nghin cu khng chu
u hng, nn tm cho c mt kt qu c ngha thng k: chia bnh nhn
thnh nhiu nhm theo tui (trn 50 hay di 50), theo gii tnh (nam hay
na), thnh phn kinh t (c thu nhp cao hay thp), v thi quen (chi th thao

81

hay khng). Tnh chung, nh nghin cu c 16 nhm khc nhau, v c th th


nghim 16 ln. Nh nghin cu khm ph thut iu tr c ngha thng k
trong nhm ph n tui trn 50 v c thu nhp cao. V, kt qu trn c cng
b. l mt qui trnh lm vic m gii nghin cu khoa hc gi l fishing
expedition (mt chuyn i cu c). Tt nhin, mt kt qu nh th khng c
gi tr khoa hc v khng th tin c. (Vi 16 th nghim khc nhau v vi p
= 0.05, xc sut m mt th nghim c kt qu significant ln n 55%, do
chng ta chng ngc nhin khi thy c mt con c c bt!)
cho kt qu tr s P c ngha nguyn thy ca n trong bi cnh
th nghim nhiu gi thuyt, cc nh nghin cu ngh s dng thut iu
chnh Bonferroni (tn ca mt nh thng k hc ngi tng ngh cch lm
ny). Theo ngh ny, trc khi tin hnh nghin cu, nh nghin cu phi
xc nh r gi thuyt no l chnh, v gi thuyt no l ph. Ngoi ra, nh
nghin cu cn phi ra k hoch s th nghim bao nhiu gi thuyt trc
khi bt tay vo phn tch d liu. Chng hn nh nu nh nghin cu c k
hoch th nghim 20 so snh v mun gi cho tr s p 0.05, th thay v da
vo 0.05 l tiu chun tuyn bsignificant, nh nghin cu phi da vo
tiu chun 0.0025 (tc ly 0.05 chia cho 20) tuyn b significant. Ni cch
khc, ch khi no mt kt qu c tr s p thp hn 0.0025 (hay ni chung l p/n)
th nh nghin cu mi c quyn tuyn b kt qu c ngha thng k.
Tr s P, d cc k thng dng trong nghin cu khoa hc, khng phi
l mt phn xt cui cng ca mt cng trnh nghin cu hay mt gi thuyt.
Nhng trong thc t, cc nh khoa hc qu l thuc vo tr s P suy lun
trong nghin cu v tuyn b nhng khm ph m sau ny c chng minh l
sai lm. C th ni rng chnh v s lm dng v ph thuc mt cch m qung
vo tr s P m khoa hc, nht l y sinh hc, tr nn ngho nn. Hng ngy
chng ta c hay nghe nhng pht hin khoa hc tri ngc nhau (nh lc th
c nghin cu cho thy c ph c tc dng tt cho sc khe, lc khc c nghin
cu cho bit c ph c hi cho sc khe; hay lc th thuc gim au aspirin c
hiu nng lm gim nguy c ung th, nhng mi y c nghin cu cho thy
aspirin c th lm tng nguy c b ung th v, v.v). C khi cng chng khng
bit pht hin no l thc v pht hin no l dng tnh gi. Theo phn tch
ca Berger v Sellke, khong 25% cc pht hin vi p < 0.05 l cc pht hin
dng tnh gi [2].
Do , chng ta khng nn qu ph thuc vo tr s P. Khng phi c
nghin cu no vi p<0.05 l thnh cng v p>0.05 l tht bi. C khi mt pht
hin vi p>0.05 nhng li l mt pht hin c ngha. Vn quan trng l lm
sao c tnh mc kh d ca mt gi thuyt mt khi c d kin tht trong
tay, tc l c tnh P(H+ | D). c tnh P(H+ | D), chng ta phi p dng nh l
Bayes, v cch tip cn nh l ny khng nm trong phm tr ca cun sch ny.

82

Phn tch d liu v to biu bng R Nguyn Vn Tun

Bn c mun tham kho thm c th c mt vi bi bo ca tc gi hay cc cc


bi bo ca James Berger m ti liu tham kho di y c th cung cp thm.
Ti liu tham kho:
[1] Wulff et al., Statistics in Medicine 1987; 6:3-10.
[2] Berger JO, Sellke T. Testing a point null hypothesis: the irreconcilability of
P-values and evidence. Journal of the American Statistical Association 1987;
82:112-20.

83

8
Phn tch s liu bng biu
Yu t th gic rt quan trng. Qu tht, biu tt c kh nng gy n
tng cho ngi c bo khoa hc rt ln, v thng c gi tr i din cho c
cng trnh nghin cu. V th biu l mt phng tin hu hiu nht nhn
mnh thng ip ca bi bo. Biu thng c s dng th hin xu
hng v kt qu cho tng nhm, nhng cng c th dng trnh by d kin
mt cch gn gng. Cc biu d hiu, ni dung phong ph l nhng phng
tin v gi. Do , nh nghin cu cn phi suy ngh mt cch sng to cch th
hin s liu quan trng bng biu . V th, phn tch biu ng mt vai tr
cc k quan trng trong phn tch thng k. C th ni, khng c th l phn
tch thng k khng c ngha.
Trong ngn ng R c rt nhiu cch thit k mt biu gn v p.
Phn ln nhng hm thit k biu c sn trong R, nhng mt s loi biu
tinh vi v phc tp khc c th thit k bng cc package chuyn dng nh
lattice c th ti t website ca R. Chng ny s ch cch v cc biu
thng dng bng cch s dng cc hm ph bin trong R.

8.1 Mi trng v thit k biu


8.1.1 Nhiu biu cho mt ca s (windows)
Thng thng, R v mt biu cho mt ca s. Nhng chng ta c
th v nhiu biu trong mt ca s bng cch s dng hm par. Chng
hn nh par(mfrow=c(1,2))c hiu nng chia ca s ra thnh 1 dng v
hai ct, tc l chng ta c th trnh by hai biu k cnh bn nhau. Cn
par(mfrow=c(2,3)) chia ca s ra thnh 2 dng v 3 ct, tc chng ta c
th trnh by 6 biu trong m ca s. Sau khi v xong, chng ta c th
quay v vi ch 1 ca s bng lnh par(mfrow=c(1,1).
V d sau y to ra mt d liu gm hai bin x v y bng phng php m
phng (tc s liu hon ton c to ra bng R). Sau , chng ta chia ca s
thnh 2 dng v 2 ct, v trnh by bn loi biu t d liu c m phng:
> par(mfrow=c(2,2))
> N <- 200
> x <- runif(N, -4, 4)

84

Phn tch d liu v to biu bng R Nguyn Vn Tun

>
>
>
>
>
>

y <- sin(x) + 0.5*rnorm(N)


plot(x,y, main=Scatter plot of y and x)
hist(x, main=Histogram of x)
boxplot(y, main=Box plot of y)
barplot(x, main=Bar chart of x)
par(mfrow=c(1,1))
Histogram of x

20
15

Frequency

0
y

-2

-1

10

25

30

Scatter plot of y and x

-2

-4

-2

Box plot of y

Bar chart of x

-2

-2

-1

-4

Biu 1. Cch chia ca s thnh 2 dng v 2 ct v


trnh by 4 biu trong cng mt ca s.

8.1.2 t tn cho trc tung v trc honh


Biu thng c trc tung (y-axis) v trc honh. V d liu thng
c gi bng cc ch vit tt, cho nn biu cn phi c tn cho tng bin
d theo di. Trong v d sau y, biu bn tri khng c tn m ch dng tn
ca bin gc (tc x v y), cn bn phi c tn d hiu hn.
>
>
>
>
>

par(mfrow=c(1,2))
N <- 200
x <- runif(N, -4, 4)
y <- sin(x) + 0.5*rnorm(N)
plot(x,y)

> plot(x, y, xlab=X factor,


ylab=Production,
main=Production and x factor \n Second line of title here)
> par(mfrow=c(1,1))

85

Trong cc lnh trn, xlab (vit tt t x label) v ylab (vit tt t y label) dng
t tn cho trc honh v trc tung. Cn main c dng t tn cho
biu . Ch rng trong main c k hiu \n dng vit dng th hai (nu
tn gi biu qu di).

2
1
-2

-1

Production

0
-2

-1

Production and x factor


Second line of title here

-4

-2

0
x

-4

-2

X factor

Biu 2. Biu bn tri khng c tn gi, biu bn phi c tn


gi cho trc tung, trc honh v tn ca biu .
Ngoi ra, chng ta cn c th s dng hm title v sub t tn:
> plot(x, y, xlab=Time,
ylab=Production)
> title(main=Plot of production and x factor,
sub=Figure 1)

86

Phn tch d liu v to biu bng R Nguyn Vn Tun

0
-2

-1

Production

Plot of production and x factor

-4

-2

X factor
Figure 1

8.1.3 Cho gii hn ca trc tung v trc honh


Nu khng cung cp gii hn ca trc tung v trc honh, R s t ng
tm iu chnh v cho cc s liu ny. Tuy nhin, chng ta cng c th kim
sot biu bng cch s dng xlim v ylim cho R bit c th gii hn
ca hai trc ny:
> plot(x, y, xlab=X factor,
ylab=Production,
main=Plot of production and x factor,
xlim=c(-5, 5),
ylim=c(-3, 3))

8.1.4 Th loi v ng biu din


Trong mt dy biu , chng ta c th yu cu R v nhiu kiu v
ng biu din khc nhau.
>
>
>
>
>

par(mfrow=c(2,2))
plot(y, type="l");
plot(y, type="b");
plot(y, type="o");
plot(y, type="h");

title("lines")
title("both")
title("overstruck")
title("high density")

87

1
50

100

150

200

50

100
Index

overstruck

high density

150

200

150

200

1
0
-1
-2

-2

-1

Index

0
-2

-1

0
-2

-1

both

lines

50

100

150

200

50

Index

100
Index

Biu 3. Kiu biu v ng biu din.


Ngoi ra, chng ta cng c th biu din nhiu ng bng lty nh sau:
>

par(mfrow=c(2,2))

> plot(y, type="l", lty=1); title(main="Production data",


sub="lty=1")
> plot(y, type="l", lty=2); title(main="Production data",
sub="lty=2")
> plot(y, type="l", lty=3); title(main="Production data",
sub="lty=3")
> plot(y, type="l", lty=4); title(main="Production data",
sub="lty=4")

88

Phn tch d liu v to biu bng R Nguyn Vn Tun

1
50

100

150

200

50

100

150

Index
lty=2

Production data

Production data

200

1
0
-1
-2

-2

-1

Index
lty=1

0
-2

-1

0
-2

-1

Production data

Production data

50

100

150

200

50

Index
lty=3

100

150

200

Index
lty=4

Biu 4. nh hng ca lty.

8.1.5 Mu sc, khung, v k hiu


Chng ta c th kim sot mu sc ca mt biu bng lnh col.
Gi tr mc nh ca col l 1. Tuy nhin, chng ta c th thay i cc mu theo
mun hoc bng cch cho s hoc bng cch vit ra tn mu nh red,
blue, green, orange, yellow, cyan, v.v
V d sau y dng mt hm v ba ng biu din vi ba mu , xanh
nc bin, v xanh l cy:
> plot(runif (10), ylim=c(0,1), type='l')
> for (i in c('red', 'blue', 'green'))
{
lines(runif (10), col=i )
}
> title(main="Lines in various colours")

89

0.0

0.2

0.4

runif(10)

0.6

0.8

1.0

Lines in various colours

10

Index

Ngoi ra, chng ta cn c th v ng biu din bng cch tng b dy ca


mi ng:
> plot(runif(5), ylim=c(0,1), type='n')
> for (i in 5:1)
{
lines( runif(5), col=i, lwd=i )
}
> title(main="Varying the line thickness")

0.0

0.2

0.4

runif(5)

0.6

0.8

1.0

Varying the line thickness

Index

Hnh dng ca biu cng c th thay i bng type nh sau:


> op <- par(mfrow=c(3,2))
> plot(runif(5), type = 'p',
main = "plot type 'p' (points)")
> plot(runif(5), type = 'l',

90

Phn tch d liu v to biu bng R Nguyn Vn Tun

main = "plot
> plot(runif(5),
main = "plot
> plot(runif(5),
main = "plot
> plot(runif(5),
main = "plot
> plot(runif(5),
main = "plot
> par(op)

type
type
type
type
type
type
type
type
type

'l' (lines)")
= 'b',
'b' (both points and lines)")
= 's',
's' (stair steps)")
= 'h',
'h' (histogram)")
= 'n',
'n' (no plot)")

plot type 'l' (lines)

0.7

runif(5)

0.3
2

Index

Index

plot type 'b' (both points a nd line s)

plot type 's' (sta ir steps)

0.2

0.6
0.4

0.4

0.6

runif(5)

0.8

0.8

runif(5)

0.5

0.7
0.5
0.3

runif(5)

0.9

0.9

plot type 'p' (points)

Index

Index

plot type 'h' (histogra m)

plot type 'n' (no plot)

0.6

runif(5)

0.4

0.3
0.2

0.2

0.1

runif(5)

0.4

3
Index

Index

Khung biu c th kim sot bng lnh bty vi cc thng s nh sau:


bty=n
bty=o
bty=c
bty=l
bty=7

Khng c vng khung chung quanh biu


C 4 khung chung quanh biu
V mt hp gm 3 cnh chung quanh biu theo hnh ch C
V hp 2 cnh chung quanh biu theo hnh ch L
V hp 2 cnh chung quanh biu theo hnh s 7
Cch hay nht bn c lm quen vi cc cch v biu ny l bng cch
th trn R bit r hn.
K hiu ca mt biu cng c th thay th bng cch cung cp s cho pch
(plotting character) trong R. Cc k hiu thng dng l:

91

Available symbols

21

22

23

24

25

16

17

18

19

20

11

12

13

14

15

10

0
-2

-1

> plot(x, y, col=red, pch=16, bty=l)

-4

-2

Biu 4. nh hng ca pch=16 v col=red, bty=l.

8.1.6 Ghi ch (legend)

92

Phn tch d liu v to biu bng R Nguyn Vn Tun

Hm legend rt c ch cho vic ghi ch mt biu v gip ngi


c hiu c ngha ca biu tt hn. Cch s dng legend c th minh
ha bng v d sau y:
> N <- 200
> x <- runif(N, -4, 4)
> y <- x + 0.5*rnorm(N)
> plot(x,y, pch=16, main=Scatter plot of y and x)
> reg <- lm(y~x)
> abline(reg)
> legend(2,-2, c("Production","Regression line"), pch=16,
lty=c(0,1))

Thng s legend(2,-2) c ngha l t phn ghi ch vo trc honh (xaxis) bng 2 v trc tung (y-axis) bng -2.

-2

Scatter plot of y and x

Production
Regression line

-4

-4

-2

Biu 5. nh hng ca legend

8.1.7 Vit ch trong biu

Phn ln cc biu khng cung cp phng tin


vit ch hay ghi ch trong biu , hay c cung cp

93

nhng rt hn ch. Trong R c hn mtext() cho php chng ta


t ch vit hay gii thch bn cnh hay trong biu .
Bt u t pha di ca biu (side=1), chng ta chuyn theo
hng kim ng h n cnh s 4. Lnh plot trong v d sau y khng in tn
ca trc v tn ca biu , nhng ch cung cp mt ci khung. Trong v d ny,
chng ta s dng cex (character expansion) kim sot kch thc ca ch
vit. Theo mc nh th cex=1, nhng vi cex=2, ch vit s c kch thc
gp hai ln kch thc mc nh. Lnh text() cho php chng ta t ch vit
vo mt v tr c th. Lnh th nht t ch vit trong ngoc kp v trung tm
ti x=15, y=4.3. Qua s dng adj, chng ta cn c th sp xp v pha tri
(adj=0) sao cho ta l im xut pht ca ch vit.

plot(y, xlab=" ", ylab=" ", type="n")


>
>
>
>
>
>
>

mtext("Text on side 1, cex=1", side=1,cex=1)


mtext("Text on side 2, cex=1.2", side=2,cex=1.2)
mtext("Text on side 3, cex=1.5", side=3,cex=1.5)
mtext("Text on side 4, cex=2", side=4,cex=2)
text(15, 4.3, "text(15, 4.3)")
text(35, 3.5, adj=0, "text(35, 3.5), left aligned")
text(40, 5, adj=1, "text(40, 5), right aligned")

Text on side 3, cex=1.5

40, 5), right aligned


4

text(15, 4.3)

Text on side 4, cex=2

0
-4

-2

Text on side 2, cex=1.2

text(35, 3.5), left aligned

94

50

Text on side 1, cex=1


100

150

200

Phn tch d liu v to biu bng R Nguyn Vn Tun

8.1.8 t k hiu vo biu . abline() c th s dng v mt ng


thng, vi nhng thng s nh sau:

abline(a,b) : ng hi qui tuyn tnh a=intercept v b=slope.


abline(h=30): v mt ng ngang ti y=30.
abline(v=12): v mt ng thng ng ti im x=12.

Ngoi ra, chng ta cn c th cho vo biu mt mi tn ghi ch mt im


s liu no .
>
>
>
>

N <- 200
x <- runif(N, -4, 4)
y <- x + 0.5*rnorm(N)
plot(x,y, pch=16, main=Scatter plot of y and x)

-4

-2

Scatter plot of y and x

-4

-2

Gi s chng ta mun ghi ch ngay ti x=0 v y=0 l im trung tm, chng ta


trc ht dng arrows v mi tn. Trong lnh sau y, arrows(-1, 1,
1.5, 1.5) c ngha nh sau ta x=-1, y=1 bt u v mi tn v chm dt ti
ta x=1.5, y=1.5. Phn text(0, 1) yu cu R vit ch ti ta x=0, y=1.

> arrows(-1, 1.0, 1.5, 1.5)


> text(0, 1, "Trung tam", cex=0.7)

95

Scatter plot of y and x

-4

-2

Trung tam

-4

-2

8.2 S liu cho phn tch biu


Sau khi bit qua mi trng v nhng la chn thit k mt biu
, by gi chng ta c th s dng mt s hm thng dng v cc biu
cho s liu. Biu c th chia thnh 2 loi chnh: biu dng m t mt
bin s v biu v mi lin h gia hai hay nhiu bin s. Tt nhin, bin s
c th l lin tc hay khng lin tc, cho nn, trong thc t, chng ta c 4 loi
biu . Trong phn sau y, chng ta s im qua cc loi biu , t n gin
n phc tp.
C l cch tt nht tm hiu cch v th bng R l bng mt d liu
thc t. Quay li v d 2 trong chng trc, chng ta c d liu gm 8 ct (hay
bin s): id, sex, age, bmi, hdl, ldl, tc, v tg.(Ch , id l
m s ca 50 i tng nghin cu; sex l gii tnh (nam hay n); age l tui;
bmi l t s trng lng; hdl l high density cholesterol; ldl l low density
cholesterol; tc l tng s - total cholesterol; v tg triglycerides). D liu c
cha trong directory directory c:\works\insulin di tn chol.txt.
Trc khi v th, chng ta bt u bng cch nhp d liu ny vo R.
> setwd(c:/works/stats)
> cong <- read.table(chol.txt, header=TRUE,
na.strings=.)
> attach(cong)

Hay tin vic theo di chng ta s nhp cc d liu bng cc lnh sau y:

96

Phn tch d liu v to biu bng R Nguyn Vn Tun

sex <- c(Nam, Nu, Nu,Nam,


Nam, Nu,Nam,Nam,Nam, Nu,
Nu,Nam, Nu,Nam,Nam, Nu, Nu, Nu,
Nu, Nu, Nu, Nu, Nu, Nu,Nam,Nam,
Nu,Nam, Nu, Nu, Nu,Nam,Nam, Nu,
Nu,Nam, Nu,Nam, Nu, Nu, Nam,
Nu,Nam,Nam,Nam, Nu,Nam,Nam, Nu,
Nu)
age <- c(57, 64, 60, 65, 47, 65, 76, 61, 59, 57,
63, 51, 60, 42, 64, 49, 44, 45, 80, 48,
61, 45, 70, 51, 63, 54, 57, 70, 47, 60,
60, 50, 60, 55, 74, 48, 46, 49, 69, 72,
51, 58, 60, 45, 63, 52, 64, 45, 64, 62)
bmi <- c( 17,
20,
21,
22,
24,

18,
20,
21,
23,
24,

18,
20,
22,
23,
24,

18,
20,
22,
23,
24,

18,
20,
22,
23,
25,

18,
21,
22,
23,
25)

19,
21,
22,
23,

19,
21,
22,
23,

19,
21,
22,
23,

19,
21,
22,
24,

20,
21,
22,
24,

hdl <- c(5.000,4.380,3.360,5.920,6.250,4.150,0.737,7.170,


6.942,5.000, 4.217,4.823,3.750,1.904,6.900,
0.633,5.530,6.625,5.960,3.800,5.375,3.360,5.000,
2.608,4.130,5.000,6.235,3.600,5.625,5.360,6.580,
7.545,6.440,6.170,5.270,3.220,5.400,6.300,
9.110,7.750, 6.200,7.050,6.300,5.450,5.000,
3.360,7.170,7.880,7.360,7.750)
ldl <- c(2.0,
2.0,
4.3,
4.0,
2.6,
2.0,
tc <-c (4.0,
4.0,
7.1,
5.4,
5.4,
3.7,

3.0,
5.0,
4.0,
4.2,
4.4,
1.0,

3.5,
6.2,
3.8,
4.5,
6.3,
6.1,

3.0,
1.3,
3.1,
4.2,
4.3,
4.0,

4.7,
4.1,
4.3,
5.9,
8.2,
6.7,

4.0,
1.2,
3.0,
4.4,
4.0,
4.6,

7.7,
3.0,
4.8,
5.6,
6.2,
8.1,

2.1,
0.7,
1.7,
4.3,
3.0,
4.0)

5.0,
4.0,
4.0,
8.3,
6.2,
6.2)

3.0,
4.0,
2.0,
2.3,
4.1,

4.2,
6.9,
3.0,
5.8,
6.7,

3.0,
4.1,
2.1,
6.0,
4.4,

5.9,
5.7,
3.1,
7.6,
6.3,

3.0,
4.3,
4.0,
3.0,
2.8,

6.1,
5.7,
5.3,
5.8,
6.0,

3.0,
4.0,
4.1,
3.0,
3.0,

5.9,
5.3,
5.3,
3.1,
4.0,

tg <- c(1.1, 2.1, 0.8, 1.1, 2.1, 1.5, 2.6, 1.5, 5.4, 1.9,
1.7, 1.0, 1.6, 1.1, 1.5, 1.0, 2.7, 3.9, 3.0, 3.1,
2.2, 2.7, 1.1, 0.7, 1.0, 1.7, 2.9, 2.5, 6.2, 1.3,
3.3, 3.0, 1.0, 1.4, 2.5, 0.7, 2.4, 2.4, 1.4, 2.7,
2.4, 3.3, 2.0, 2.6, 1.8, 1.2, 1.9, 3.3, 4.0, 2.5)

97

cong <- data.frame(sex, age, bmi, hdl, ldl, tc, tg)

Sau khi c s liu, chng ta sn sng tin hnh phn tch s liu bng biu
nh sau:

8.3 Biu cho mt bin s ri rc (discrete


variable): barplot
Bin sex trong d liu trn c hai gi tr (nam v nu), tc l mt bin
khng lin tc. Chng ta mun bit tn s ca gii tnh (bao nhiu nam v bao
nhiu n) v v mt biu n gin. thc hin nh ny, trc ht, chng
ta cn dng hm table bit tn s:
> sex.freq <- table(sex)
> sex.freq
sex
Nam Nu
22 28

C 22 nam v 28 n trong nghin cu. Sau dng hm barplot th hin


tn s ny nh sau:
> barplot(sex.freq, main=Frequency of males and females)

Biu trn cng c th c c bng mt lnh n gin hn (Biu 8a):


> barplot(table(sex), main=Frequency of males and
females)
Frequency of males and females

Nam

10

15

20

Nu

25

Frequency of males and females

Nam

Nu

Biu 8a. Tn s gii tnh th hin


bng ct s.

98

10

15

20

25

Biu 8b. Tn s gii tnh th hin


bng dng s.

Phn tch d liu v to biu bng R Nguyn Vn Tun

Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai


dng bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b):
> barplot(sex.freq,
horiz = TRUE,
col = rainbow(length(sex.freq)),
main=Frequency of males and females)

8.4 Biu cho hai bin s ri rc (discrete


variable): barplot
Age l mt bin s lin tc. Chng ta c th chia bnh nhn thnh nhiu
nhm da vo tui. Hm cut c chc nng ct mt bin lin tc thnh
nhiu nhm ri rc. Chng hn nh:
> ageg <- cut(age, 3)
> table(ageg)
ageg
(42,54.7] (54.7,67.3]
19
24
7

(67.3,80]

c hiu qu chia bin age thnh 3 nhm. Tn s ca ba nhm ny l: 42 tui


n 54.7 tui thnh nhm 1, 54.7 n 67.3 thnh nhm 2, v 67.3 n 80 tui
thnh nhm 3. Nhm 1 c 19 bnh nhn, nhm 2 v 3 c 24 v 7 bnh nhn.
By gi chng ta mun bit c bao nhiu bnh nhn trong tng tui v tng
gii tnh bng lnh table:
> age.sex <- table(sex, ageg)
> age.sex
ageg
sex (42,54.7] (54.7,67.3] (67.3,80]
Nam
10
10
2
Nu
9
14
5

Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui
th nht, 10 nam v 14 na trong nhm tui th hai, v.v th hin tn s
ca hai bin ny, chng ta vn dng barplot:
> barplot(age.sex, main=Number of males and females in
each age group)

99

10

15

10

20

12

14

Number of males and females in each age group

(42,54.7]

(42,54.7]

(54.7,67.3]

(67.3,80]

Biu 7a. Tn s gii tnh v nhm


tui th hin bng ct s.

(54.7,67.3]

(67.3,80]

Age group

Biu 7b. Tn s gii tnh v


nhm tui th hin bng hai dng s.

Trong Biu 7a, mi ct l cho mt tui, v phn m ca ct l n, v phn


mu nht l tn s ca nam gii. Thay v th hin tn s nam n trong mt ct,
chng ta cng c th th hin bng 2 ct vi beside=T nh sau (Biu 7b):
barplot(age.sex, beside=TRUE, xlab="Age group")

8.5 Biu hnh trn


Tn s mt bin ri rc cng c th th hin bng biu hnh trn. V d sau
y v biu tn s ca tui. Biu 8a l 3 nhm tui, v Biu 8b
l biu tn s cho 5 nhm tui:
> pie(table(ageg))
pie(table(cut(age,5)))

100

Phn tch d liu v to biu bng R Nguyn Vn Tun

(42,54.7]
(49.6,57.2]

(42,49.6]

(72.4,80]

(67.3,80]
(64.8,72.4]
(57.2,64.8]

(54.7,67.3]

Biu 8a. Tn s cho 3 nhm tui

Biu 8b. Tn s cho 5 nhm tui

8.6 Biu cho mt


stripchart v hist

bin

lin

tc:

8.6.1 Stripchart

Biu strip cho chng ta thy tnh lin tc ca mt bin s.


Chng hn nh chng ta mun tm hiu tnh lin tc ca triglyceride
(tg), hm stripchart() s gip trong mc tiu ny:
> stripchart(tg,
main=Strip chart for triglycerides, xlab=mg/L)
Strip cha rt for trig lyceride s

mg/L

101

Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong


khi phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5).

8.6.2 Histogram
Age l mt bin s lin tc. v biu tn s ca bin s age,
Chng ta ch n gin lnh hist(age). Nh cp trn, chng ta c th
ci tin th ny bng cch cho thm ta chnh (main) v ta ca trc
chng ta ch n gin ln hist (age. Nh cp trn, chng ta c th ci tin
th ny bng cnh cho thm ta chnh (main) v ta ca trc honh
(xlab) v trc tung (ylab):
> hist(age)
> hist(age, main="Frequency distribution by age group",
xlab="Age group", ylab="No of patients")

Chng ta cng c th bin i biu thnh mt th phn phi xc


sut bng hm plot(density) nh sau (kt qu trong Biu 10a):
Frequency distribution by age group

8
0

No of patients

6
4

Frequency

10

10

12

12

Histogram of age

40

50

60

70

80

age

40

50

60

70

80

Age group

Biu 9a. Trc tung l s bnh Biu 9b. Thm tn biu v tn


nhn (i tng nghin cu) v trc ca trc trung v trc honh bng
honh l tui. Chng hn nh tui xlab v ylab.
40 n 45 c 6 bnh nhn, t 70 n
80 tui c 4 bnh nhn.
> plot(density(age),add=TRUE)

102

Phn tch d liu v to biu bng R Nguyn Vn Tun

Histogram of age

Density

0.00

0.00

0.01

0.02

0.02
0.01

Density

0.03

0.03

0.04

0.04

density.default(x = age)

30

40

50

60

70

80

90

N = 50 Bandwidth = 3.806

Biu 10a. Xc sut phn phi


mt cho bin age ( tui).

40

50

60

70

80

age

Biu 10b. Xc sut phn phi


mt cho bin age ( tui) vi
nhiu interquartile.

Chng ta c th v hai th chng ln bng cch dng hm interquartile nh


sau (kt qu xem Biu 10b):
>
>
>
>

iqr <- diff(summary(age)[c(2,5)])


des <- density(age, width=0.5*iqr)
hist(age, xlim=range(des$x), probability=TRUE)
lines(des, lty=2)

Trong th trn, chng ta dng khong cch 0.5*iqr (tng i gn


nhau). Nhng chng ta c th bin i thng s ny thnh 1.5*iqr lm cho
phn phi thc t hn:
>
>
>
>

iqr <- diff(summary(age)[c(2,5)])


des <- density(age, width=1.5*iqr)
hist(age, xlim=range(des$x), probability=TRUE)
lines(des, lty=2)

103

Density

0.00

0.01

0.02

0.03

0.04

Histogram of age

30

40

50

60

70

80

90

age

Chng ta c th bin i biu thnh mt th phn phi xc sut tch


ly (cumulative distribution) bng hm plot v sort nh sau:
> n <- length(age)
> plot(sort(age), (1:n)/n, type="s", ylim=c(0,1))

Kt qu c trnh by trong phn tri ca biu sau y (Biu 11).

60

0.0

0.2

50

0.4

(1:n)/n

0.6

Sample Quantiles

70

0.8

80

1.0

Normal Q-Q Plot

50

60
sort(age)

70

80

-2

-1

Theoretical Quantiles

Biu 11. Xc sut phn phi mt Biu 12. Kim tra bin age c
cho bin age ( tui).
theo lut phn phi chun hay khng.
Trong th trn, trc tung l xc sut tch ly v trc honh l tui t thp
n cao. Chng hn nh nhn qua biu , chng ta c th thy khong 50% i
tng c tui thp hn 60.
bit xem phn phi ca age c theo lut phn phi chun (normal
distribution) hay khng chng ta c th s dng hm qqnorm.
> qqnorm(age)

104

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trc honh ca biu trn l nh lng theo lut phn phi chun
(theoretical quantile) v trc honh nh lng ca s liu (sample quantiles).
Nu phn phi ca age theo lut phn phi chun, th ng biu din phi
theo mt ng thng cho 45 (tc l nh lng phn phi v nh lng s
liu bng nhau). Nhng qua Biu 12, chng ta thy phn phi ca age
khng hn theo lut phn phi chun.

8.6.3 Biu hp (boxplot)


v biu hp ca bin s tc, chng ta ch n gin lnh:
> boxplot(tc,
ylab="mg/L")

main="Box

plot

of

total

cholesterol",

mg/L

Box plot of total cholesterol

Biu 13. Trong biu ny, chng ta thy median (trung v)


khong 5.6 mg/L, 25% total cholesterol thp hn 4.1, v 75%
thp hn 6.2. Total cholesterol thp nht l khoang 3, v cao
nht l trn 8 mg/L.

Trong biu sau y, chng ta so snh tc gia hai nhm nam v n:


> boxplot(tc ~ sex, main=Box plot of total cholestrol by
sex, ylab="mg/L")

Kt qu trnh by trong Biu 14a. Chng ta c th bin giao din


ca th bng cch dng thng s horizontal=TRUE v thay i mu
bng thng s col nh sau (Biu 14b):

105

> boxplot(tc~sex, horizontal=TRUE, main="Box plot of total


cholesterol", ylab="mg/L", col = "pink")
Box plot of total cholesterol

Nam

mg/L

mg/L

Nu

Box plot of total cholesterol by sex

Nam

Nu

Biu 14a. Trong biu ny, Biu 14b. Total cholesterol


chng ta thy trung v ca total cho tng gii tnh, vi mu sc
cholesterol n gii thp hn nam v hnh hp nm ngang.
gii, nhng dao ng gia hai
nhm khng khc nhau bao nhiu.

8.6.4 Biu thanh (bar chart)

v biu thanh ca bin s bmi, chng ta ch n gin lnh:

10

kg/m^2

15

20

25

> barplot(bmi, col=blue)

Biu 15. Biu thanh cho bin bmi.

8.6.5 Biu im (dotchart)


Mt th khc cung cp thng tin ging nh barplot l dotchart:
> dotchart(bmi, xlab="Body mass index (kg/m^2)",
main="Distribution of BMI")

106

Phn tch d liu v to biu bng R Nguyn Vn Tun

Distribution of BM I

18

20

22

24

Body mass index (kg/m^2)

Biu 16. Biu im bin bmi.

8.7 Phn tch biu cho hai bin lin tc


8.7.1 Biu tn x (scatter plot)
tm hiu mi lin h gia hai bin, chng ta dng biu tn x. v biu
tn x v mi lin h gia bin s tc v hdl, chng ta s dng hm plot.
Thng s th nht ca hm plot l trc honh (x-axis) v thng s th 2 l trc
tung. tm hiu mi lin h gia tc v hdl chng ta n gin lnh:

hdl

> plot(tc, hdl)

tc

Biu 17. Mi lin h gia tc v hdl. Trong biu ny,

107

chng ta v bin s hdl trn trc tung v tc trn trc honh.


Chng ta mun phn bit gii tnh (nam v n) trong biu trn. v biu
, chng ta phi dng n hm ifelse. Trong lnh sau y, nu
sex==Nam th v k t s 16 ( trn), nu khng nam th v k t s 22 (tc
vung):
> plot(hdl, tc, pch=ifelse(sex=="Nam", 16, 22))
Kt qu l biu 18a. Chng ta cng c th thay k t thnh M (nam) v
F n(xem biu 18b):
> plot(hdl, tc, pch=ifelse(sex=="Nam", M, F))

M
8
8

M
F

6
tc

M
M

F
hdl

F
M

M
F
F
F

M
F
M

M
M

F F

M
F

M
F

M
F
F
F
F

M
M
F

M
F

F
F

tc

Biu 18a. Mi lin h gia tc v hdl


theo tng gii tnh c th hin bng
hai k hiu du.

M
4

hdl

Biu 18a. Mi lin h gia tc v hdl


theo tng gii tnh c th hin bng
hai k t.

Chng ta cng c th v mt ng biu din hi qui tuyn tnh


(regression line) qua cc im trn bng cch tip tc ra cc lnh sau y:
> plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL
cholesterol", xlab="Total cholesterol", ylab="HDL
cholesterol", bty=l)
> reg <- lm(hdl ~ tc)
> abline(reg)

Kt qu l biu 19a di y. Chng ta cng c th dng hm trn (smooth


function) biu din mi lin h gia hai bin s. th sau y s dng

108

Phn tch d liu v to biu bng R Nguyn Vn Tun

lowess (mt hm thng thng nht) trong vic lm trn s liu tc v hdl
(Biu 19b).
> plot(hdl ~ tc, pch=16,
main="Total cholesterol and HDL cholesterol with
LOEWSS smooth function",
xlab="Total cholesterol",
ylab="HDL cholesterol",
bty=l)

> lines(lowess(hdl, tc, f=2/3, iter=3), col="red")

T otal cholesterol and HDL cholesterol

6
4

HDL cholesterol

HDL cholesterol

T otal cholesterol and HDL cholesterol with LOEWSS smooth functio

Total cholesterol

Total cholesterol

Biu 19a. Trong lnh trn, reg<lm(hdl~tc) c ngha l tm phng


trnh lin h gia hdl v tc bng linear
model (lm) v t kt qu vo i tng
reg. Lnh th hai abline(reg) yu
cu R v ng thng t phng trnh
trong reg

Biu 19b. Thay v dng


abline, chng ta dng hm
lowess th hin mi lin h
gia tc v hdl.

Bn c c th th nghim vi nhiu thng s f=1/2,f=2/5, hay thm ch


f=1/10

8.8 Phn tch Biu cho nhiu bin: pairs


Chng ta c th tm hiu mi lin h gia cc bin s nh age, bmi, hdl,
ldl v tc bng cch dng lnh pairs. Nhng trc ht, chng ta phi a
cc bin s ny vo mt data.frame ch gm nhng bin s c th v c,
v sau s dng hm pairs trong R.

109

> lipid <- data.frame(age,bmi,hdl,ldl,tc)


> pairs(lipid, pch=16)

Kt qu s l:
20

22

24

70

80

18

22

24

50

60

age

18 20

bmi

3 4

hdl

7 8

ldl

4 5

tc

50

60

70

80

Biu trn y c th ci tin bng hm matrix.cor (do mt tc gi trn


mng son) sau y cho ra nhiu thng tin th v.

matrix.cor <- function(x, y, digits=2, prefix="",


cex.cor)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)

110

Phn tch d liu v to biu bng R Nguyn Vn Tun

text(.8, .8, Signif, cex=cex, col=2)

Chng ta quay li vi d liu lipid bng cch gi hm matrix.cor nh sau:


pairs(lipid,lower.panel=panel.smooth,
upper.panel=matrix.cor)
20

22

24

0 .06 5

0.12

bmi

0.38

0.22

0 .0 9 5

**

.
0.29

0.25

***

0.62

0.35

hdl

18

20

22

24

50

age

60

70

80

18

***

0.65

ldl

tc

50

60

70

80

th ny cung cp cho chng ta tt c h s tng quan gia tt c cc


bin s. Chng hn nh, h s tng quan gia age v bmi qu thp v khng
c ngha thng k; gia age v hdl hay gia age v hdl cng khng c
ngha thng k; nhng gia age v tc th bng 0.22. H s tng quan cao
nht l gia ldl v tc (0.65) v hdl v tc (0.62). Gi hdl v ldl, h s
tng quan ch 0.35, nhng c ngha thng k (c sao).
Ch biu trn chng nhng cung cp hai thng tin chnh (h s
tng quan hay correlation coefficient, v v biu tn x cho tng cp bin
s), m cn cho bit h s tng quan no c ngha thng k (nhng k hiu
sao). H s tng quan cng cao, kch thc ca font ch cng ln.

8.9 Mt s biu a nng


8.9.1 Biu tn x v hnh hp

111

Nh trn trnh by, biu tn x gip cho chng ta hnh dung ra mi lin
h gia hai bin s lin tc nh tui age v hdl chng hn. V lm vic
ny, chng ta dng hm plot. tm hiu phn phi cho tng bin age hay
hdl chng ta c th dng hm boxplot. Nhng nu chng ta mun xem phn
phi ca hai bin v ng thi mi lin h gia hai bin, th chng ta cn phi vit
mt vi lnh thc hin vic ny. Cc lnh sau y v biu tn x v mi lin
quan gia age v hdl, ng thi v biu hnh hp cho tng bin.
op <- par()
layout( matrix( c(2,1,0,3), 2, 2, byrow=T ),
c(1,6), c(4,1),
)
par(mar=c(1,1,5,2))
plot(hdl ~ age,
xlab='', ylab='',
las = 1,
pch=16)
rug(side=1, jitter(age, 5) )
rug(side=2, jitter(hdl, 20) )
title(main = "Age and HDL")
par(mar=c(1,2,5,1))
boxplot(hdl, axes=F)
title(ylab='HDL', line=0)
par(mar=c(5,1,1,2))
boxplot(age, horizontal=T, axes=F)
title(xlab='Age', line=1)
par(op)

V kt qu l:

112

Phn tch d liu v to biu bng R Nguyn Vn Tun

Age and HDL

HDL

50

60

70

80

Age

8.9.2 Biu tn x vi kch thc bin th ba


Biu trn th hin mi lin h gia age v hdl, vi mi im chm
c kch thc nhau. Nhng chng ta bit rng hdl cng c lin h vi
triglyceride (tg). th hin mt phn no mi lin h 3 chiu ny, mt cch
lm l v kch thc ca im ty theo gi tr ca tg. Chng ta s s dng
thng s cex bn trong phn u v mi lin h ba chiu ny nh sau:
> plot(age, hdl, cex=tg,
pch=16,
col=red,
xlab="Age", ylab="HDL",
main="Bubble plot")
> points(age, hdl, cex=tg)

113

HDL

Bubble plot

50

60

70

80

Age

8.9.3 Biu thanh v xc sut tch ly


v biu tn s ca mt bin lin tc chng ta ch yu s dng
hm hist. Hm ny cho ra kt qu tn s cho tng nhm (nh nhm tui
chng hn). Nhng i khi chng ta cn bit c xc sut tch ly cho tng
nhm, v mun v c hai kt qu trong mt biu . lm vic ny chng ta
cn phi vit mt hm bng ngn ng R. Hm sau y c gi l pareto (tt
nhin bn c c th cho mt tn khc) c son ra thc hin mc tiu trn.
M cho hm pareto nh sau:
pareto <- function (x, main = "", ylab = "Value")
{
op <- par(mar = c(5, 4, 4, 5) + 0.1, las = 2)
if( ! inherits(x, "table") ) {
x <- table(x)
}
x <- rev(sort(x))
plot(x, type = 'h', axes = F, lwd = 16,
xlab = "", ylab = ylab, main = main)
axis(2)
points(x, type = 'h', lwd = 12,
col = heat.colors(length(x)) )
y <- cumsum(x)/sum(x)
par(new = T)
plot(y, type = "b", lwd = 3, pch = 7,
axes = FALSE,

114

Phn tch d liu v to biu bng R Nguyn Vn Tun

xlab='', ylab='', main='')


points(y, type = 'h')
axis(4)
par(las=0)
mtext("Cumulated frequency", side=4, line=3)
print(names(x))
axis(1, at=1:length(x), labels=names(x))
par(op)
}

By gi chng ta s p dng hm pareto vo vic v tn s cho bin tg


(triglyceride) nh sau. Trc ht, chng ta chia tg thnh 10 nhm bng cch
dng hm cut v cho kt qu vo i tng tg.group.
> tg.group <- cut(tg, 10)

K n, chng ta ng dng hm pareto:


> pareto(tg.group)
[1] "(0.695,1.25]" "(2.35,2.9]" "(1.25,1.8]" "(2.9,3.45]"
"(1.8,2.35]"
[6] "(3.45,4]"
"(5.65,6.21]" "(5.1,5.65]" "(4.55,5.1]"
"(4,4.55]"

> title(main="Pareto plot of Tg with cumulated


frequencies")

Pareto plot of T g with cumulated frequencies


1.0

12

10

Value

0.6

Cumulated frequency

0.8
8

0.4
2

0
(0.695,1.25]

(1.25,1.8]

(1.8,2.35]

(5.65,6.21]

(4.55,5.1]

115

Trong biu ny, chng ta c hai trc tung. Trc tung pha tri l tn
s (s bnh nhn) cho tng nhm tg, v trc tung bn phi l tn s tch ly
tch bng xc sut (do , s cao nht l 1).

8.9.4 Biu hnh ng h (clock plot)


Biu hnh ng h, nh tn gi l biu dng v mt bin s
lin tc bng kim ng h. Tc l thay v th hin bng ct hay bng dng, biu
ny th hin bng ng h. Hm sau y (clock) c son thc hin
biu hnh ng h:
clock.plot <- function (x, col = rainbow(n), ...) {
if( min(x)<0 ) x <- x - min(x)
if( max(x)>1 ) x <- x/max(x)
n <- length(x)
if(is.null(names(x))) names(x) <- 0:(n-1)
m <- 1.05
plot(0,
type = 'n', # do not plot anything
xlim = c(-m,m), ylim = c(-m,m),
axes = F, xlab = '', ylab = '', ...)
a <- pi/2 - 2*pi/200*0:200
polygon( cos(a), sin(a) )
v <- .02
a <- pi/2 - 2*pi/n*0:n
segments( (1+v)*cos(a), (1+v)*sin(a),
(1-v)*cos(a), (1-v)*sin(a) )
segments( cos(a), sin(a),
0, 0,
col = 'light grey', lty = 3)
ca <- -2*pi/n*(0:50)/50
for (i in 1:n) {
a <- pi/2 - 2*pi/n*(i-1)
b <- pi/2 - 2*pi/n*i
polygon( c(0, x[i]*cos(a+ca), 0),
c(0, x[i]*sin(a+ca), 0),
col=col[i] )
v <- .1
text((1+v)*cos(a), (1+v)*sin(a), names(x)[i])
}
}

Chng ta s ng dng hm clock v biu cho bin ldl nh sau:


> clock.plot(ldl,
main = "Distribution of LDL")

116

Phn tch d liu v to biu bng R Nguyn Vn Tun

V kt qu l:
Distribution of LDL

45

46

47

48

49

44

43

42

41

40

10

39

11

38

12

37

13

36

14

35

15

34

16

33

17
32

18
31

19
30

29

28

27

26

25

24

23

22

21

20

8.9.5 Biu vi sai s chun (standard error)


Trong biu sau y, chng ta c 5 nhm (bin s x c m phng
ch khng phi s liu tht), v mi nhm c gi tr trung bnh mean, v tin
cy 95% (lcl v ucl). Thng thng lcl=mean-1.96*SE v ucl =
mean+1.96*SE (SE l sai s chun). Chng ta mun v biu cho 5 nhm
vi sai s chun . Cc lnh v hm sau y s cn thit:

angle=90,

3
2
1

mean

> group <- c(1,2,3,4,5)


> mean <- c(1.1, 2.3, 3.0, 3.9, 5.1)
> lcl <- c(0.9, 1.8, 2.7, 3.8, 5.0)
> ucl <- c(1.3, 2.4, 3.5, 4.1, 5.3)
> plot(group, mean, ylim=range(c(lcl, ucl)))
> arrows(group, ucl, group, lcl, length=0.5,
code=3)

group

117

Sau y l mt m phng khc. Chng ta to ra 10 gi tr x v y theo lut phn


phi chun, v 10 gi tr sai s theo lut phn phi u (se.x v se.y uniform
distribution).

x,

y+se.y,

code=3,

angle=90,

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

> x <- rnorm(10)


> y <- rnorm(10)
> se.x <- runif(10)
> se.y <- runif(10)
> plot(x, ypch=22)
> arrows(x, y-se.y,
length=0.1)

-2

-1

8.9.6 Biu vng (contour plot)


R c th v cc th vng vi nhiu hnh dng khc nhau, ty theo
thch v d liu. Trong cc lnh sau y, chng ta s dng k thut m phng
v th vng cho ba bin s x, y v z.
>
>
>
>
>
>
>

N <- 50
x <- seq(-1, 1, length=N)
y <- seq(-1, 1, length=N)
xx <- matrix(x, nr=N, nc=N)
yy <- matrix(y, nr=N, nc=N, byrow=TRUE)
z <- 1 / (1 + xx^2 + (yy + .2 * sin(10*yy))^2)
contour(x, y, z, main = "Contour plot")

118

Phn tch d liu v to biu bng R Nguyn Vn Tun

-1.0

-0.5

0.0

0.5

1.0

Contour plot

-1.0

-0.5

0.0

0.5

1.0

th ny c th chuyn thnh mt hnh (image) bng hm image.

0.0

0.2

0.4

0.6

0.8

1.0

> image(z)

0.0

0.2

0.4

0.6

0.8

1.0

Mt vi thay i nh nhng quan trng:

119

> image(x, y, z,
xlab=x,
ylab=y)

0.0
-1.0

-0.5

0.5

1.0

> contour(x, y, z, lwd=3, add=TRUE)

-1.0

-0.5

0.0

0.5

1.0

Sau y l mt vi thay i v biu theo hm s sin v 3 chiu.


th ny tuy xem hp dn, nhng trong thc t c l t s dng. Tuy nhin,
biu c trnh by y cho thy tnh a dng ca R.
>
>
>
>
>
>
>

x <- seq(-10, 10, length= 30)


y <- x
f <- function(x,y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r }
z <- outer(x, y, f)
z[is.na(z)] <- 1
op <- par(bg = "white", mar=c(0,2,3,0)+.1)
persp(x, y, z,
theta = 30, phi = 30,
expand = 0.5,
col = "lightblue",
ltheta = 120,
shade = 0.75,
ticktype = "detailed",
xlab = "X", ylab = "Y", zlab = "Sinc(r)",
main = "The sinc function"

)
> par(op)

120

Phn tch d liu v to biu bng R Nguyn Vn Tun

T he sinc function

8
6
S in c (r

4
10

2
0

-2
-10

0
-5
0

-5

5
10

-10

8.9.10 Biu vi k hiu ton


i khi chng ta cn v biu vi ta c k hiu ton hc. Trong
th sau y, chng ta to ra mt bin s x vi 200 gi tr t -5 n 5, v

y = 1 + x 2 . vit cng thc trn, chng ta cn s dng hm expression


nh sau:
> x <- seq(-5,5,length=200)
> y <- sqrt(1+x^2)
> plot(y~x, type='l', ylab=expression(sqrt(1+x^2)))
> title(main=expression("Graph of the function
f"(x)==sqrt(1+x^2)))

121

3
1

1 + x2

Graph of the function f (x) = 1 + x2

-4

-2

Ngay c ting Nht cng c th th hin bng R:


> plot(1:9, type="n", axes=FALSE, frame=TRUE, ylab="",
main= "example(Japanese)", xlab= "using Hershey fonts")
> par(cex=3)
> Vf <- c("serif", "plain")
> text(4, 2, "\\#J2438\\#J2421\\#J2451\\#J2473", vfont = Vf)
> text(4, 4, "\\#J2538\\#J2521\\#J2551\\#J2573", vfont = Vf)

>
>
>
>
>
>
>

text(4, 6,
text(4, 8,
par(cex=1)
text(8, 2,
text(8, 4,
text(8, 6,
text(8, 8,

122

"\\#J467c\\#J4b5c", vfont = Vf)


"Japan", vfont = Vf)
"Hiragana")
"Katakana")
"Kanji")
"English")

Phn tch d liu v to biu bng R Nguyn Vn Tun

example(Japanese)

English

Kanji

Katakana

Hiragana

using Hershey fonts

Chng ny ch gii thiu mt s biu thng thng trong nghin


cu khoa hc. Ngoi cc biu thng dng ny, R cn c kh nng v nhng
th phc tp v tinh vi hn na. Hin nay, R c mt package tn l
lattice c th v nhng biu cht lng cao hn. lattice, cng nh
bt c package no ca R, u min ph, c th ti v my tnh v ci t s
dng khi cn thit.

123

9
Phn tch thng k m t
Trong chng ny, chng ta s s dng R cho mc ch phn tch thng
k m t. Ni n thng k m t l ni n vic m t d liu bng cc php tnh
v ch s thng k thng thng m chng ta lm quen qua t thu trung hc
nh s trung bnh (mean), s trung v (median), phng sai (variance) lch
chun (standard deviation) cho cc bin s lin tc, v t s (proportion) cho
cc bin s khng lin tc. Nhng trc khi hng dn phn tch thng k m t,
bn c nn phn bit hai khi nim tng th (population) v mu (sample).

9.0 Khi nim tng th (population) v mu


(sample)
C th ni mc tiu ca nghin cu khoa hc thc nghim l nhm tm
hiu v khm ph nhng ci cha c bit (unknown), trong bao gm
nhng qui lut hot ng ca t nhin. khm ph, chng ta s dng n cc
phng php phn loi, so snh, v phng on. Tt c cc phng php khoa
hc, k c thng k hc, c pht trin nhm vo ba mc tiu trn. phn
loi, chng ta phi o lng mt yu t hay tiu ch c lin quan n vn cn
nghin cu. so snh v phng on, chng ta cn n cc phng php kim
nh gi thit v m hnh thng k hc.
Cng nh bt c m hnh no, m hnh thng k phi c thng s. V
mun c thng s, chng ta trc ht phi tin hnh o lng, v sau l c
tnh thng s t o lng. Chng hn nh bit sinh vin n c ch s thng
minh (IQ) bng sinh vin nam hay khng, chng ta c th lm nghin cu theo
hai phng n:
(a) Mt l lp danh snh tt c sinh vin nam v n trn ton quc, ri o
lng ch s IQ tng ngi, v sau so snh gia hai nhm;
(b) Hai l chn ngu nhin mt mu gm n nam v m n sinh vin, ri o
lng ch s IQ tng ngi, v sau so snh gia hai nhm.
Phng n (a) rt tn km v c th ni l khng thc t, v chng ta
phi tp hp tt c sinh vin ca c nc, mt vic lm rt kh thc hin c.
Nhng nu chng ta c th lm c, th phng n ny khng cn n thng
k hc. Gi tr IQ trung bnh ca n v nam sinh vin tnh t phng n (a) l
gi tr cui cng, v n tr li cu hi ca chng ta mt cch trc tip, chng ta
khng cn phi suy lun, khng cn n kim nh thng k.

124

Phn tch d liu v to biu bng R Nguyn Vn Tun

Phng n (b) i hi chng ta phi chn n nam v m n sinh vin sao


cho i din (representative) cho ton qun th sinh vin ca c nc. Tnh i
din y c ngha l cc s n nam v m n sinh vin ny phi c cng c
tnh nh tui, trnh hc vn, thnh phn kinh t, x hi, ni sinh sng,
v.v so vi tng th sinh vin ca c nc. Bi v chng ta khng bit cc c
tnh ny trong ton b tng th sinh vin, chng ta khng th so snh trc tip
c, cho nn mt phng php rt hu hiu l ly mu mt cch ngu nhin.
C nhiu phng php ly mu ngu nhin c pht trin v chng ta s
khng bn qua chi tit ca cc phng php ny, ngoi tr mun nhn mnh
rng, nu cch ly mu khng ngu nhin th cc c s t mu s khng c
ngha khoa hc cao, bi v cc phng php phn tch thng k da vo gi
nh rng mu phi c chn mt cch ngu nhin.
Chng ta s ly mt v d c th v tng th v mu qua ng dng R
nh sau. V d chng ta c mt tng th gm 20 ngi v bit rng chiu cao
ca h nh sau (tnh bng cm): 162, 160, 157, 155, 167, 160, 161, 153, 149,
157, 159, 164, 150, 162, 168, 165, 156, 157, 154 v 157. Nh vy, chng ta bit
rng chiu cao trung bnh ca tng th l 158.65 cm.
V thiu thn phng tin chng ta khng th nghin cu trn ton tng
th m ch c th ly mu t tng th c tnh chiu cao. Hm sample()
cho php chng ta ly mu. V c tnh chiu cao trung bnh t mu tt nhin
s khc vi chiu cao trung bnh ca tng th.

Chn 5 ngi t tng th:

> sample5 <- sample(height, 5)


> sample5
[1] 153 157 164 156 149

c tnh chiu cao trung bnh t mu ny:


> mean(sample5)
[1] 155.8

Chn 5 ngi khc t tng th v tnh chiu cao trung bnh:

> sample5 <- sample(height, 5)


> sample5
[1] 157 162 167 161 150
> mean(sample5)
[1] 159.4

125

Ch c tnh chiu cao ca mu th hai l 159.4 cm (thay v 155.8


cm), bi v chn ngu nhin, cho nn i tng c chn ln hai khng nht
thit phi l i tng ln th nht, cho nn c tnh trung bnh khc nhau.

By gi chng ta th ly mu 10 ngi t tng th v tnh chiu cao trung bnh:

> sample10 <- sample(height, 10)


> sample10
[1] 153 160 150 165 159 160 164 156 162 157
> mean(sample10)
[1] 158.6

Chng ta c th ly nhiu mu, mi mu gm 10 ngi v c tnh s trung


bnh t mu, bng mt lnh n gin hn nh sau:
> mean(sample(height,
[1] 156.7
> mean(sample(height,
[1] 157.1
> mean(sample(height,
[1] 159.3
> mean(sample(height,
[1] 159.3
> mean(sample(height,
[1] 158.3
> mean(sample(height,

10))
10))
10))
10))
10))
10))

Ch dao ng ca s trung bnh t 156.7 n 159.3 cm.

Chng ta th ly mu 15 ngi t tng th v tnh chiu cao trung bnh:

> mean(sample(height,
[1] 158.6667
> mean(sample(height,
[1] 159.4
> mean(sample(height,
[1] 158.0667
> mean(sample(height,
[1] 158.1333
> mean(sample(height,
[1] 156.4667

15))
15))
15))
15))
15))

Ch dao ng ca s trung bnh by gi t 158.0 n 158.7 cm, tc thp


hn mu vi 10 i tng.

Tng c mu ln 18 ngi (tc gn s i tng trong tng th)

126

Phn tch d liu v to biu bng R Nguyn Vn Tun

> mean(sample(height,
[1] 158.2222
> mean(sample(height,
[1] 158.7222
> mean(sample(height,
[1] 158.0556
> mean(sample(height,
[1] 158.4444
> mean(sample(height,
[1] 158.6667
> mean(sample(height,
[1] 159.0556
> mean(sample(height,
[1] 159

18))
18))
18))
18))
18))
18))
18))

By gi th c tnh chiu cao kh n nh, nhng khng khc g so vi


c mu vi 15 ngi, do dao ng t 158.2 n 159 cm.
T cc v d trn y, chng ta c th rt ra mt nhn xt quan trng:
c s t cc mu c chn mt cch ngu nhin s khc vi thng s ca
tng th, nhng khi s c mu tng ln th khc bit s nh li dn. Do ,
mt trong nhng vn then cht ca thit k nghin cu l nh nghin cu
phi c tnh c mu sao cho c s m chng ta tnh t mu gn (hay chnh
xc) so vi thng s ca tng th. Ti s quay li vn ny trong chng 15.
Trong v d trn s trung bnh ca tng th l 158.65 cm. Trong thng k
hc, chng ta gi l thng s (parameter). V cc s trung bnh c tnh t cc
mu chn t tng th c gi l c s mu (sample estimate). Do , xin
nhc li nhn mnh: nhng ch s lin quan n tng th l thng s, cn
nhng s c tnh t cc mu l c s. Nh thy trn, c s c dao ng
chung quanh thng s, v v trong thc t chng ta khng bit thng s, cho nn
mc tiu chnh ca phn tch thng k l s dng c s suy lun v thng s.
Mc tiu chnh ca phn tch thng k m t l tm nhng c s ca
mu. C hai loi o lng: lin tc (continuous measurement) v khng lin tc
hay ri rc (discrete measurement). Cc bin lin tc nh tui, chiu cao,
trng lng c th, v.v l bin s lin tc, cn cc bin mang tnh phn loi
nh c hay khng c bnh, thch hay khng thch, trng hay en, v.v l
nhng bin s khng lin tc. Cch tnh hai loi bin s ny cng khc nhau.
c s thng thng nht dng m t mt bin s lin tc l s
trung bnh (mean). Chng hn nh chiu cao ca nhm 1 gm 5 i tng l
160, 160, 167, 156, v 161, do s trung bnh l 160.8 cm. Nhng chiu cao
ca nhm 2 cng gm 5 i tng khc nh142, 150, 187, 180 v 145, th s
trung bnh vn l 160.8. Do , s trung bnh khng th phn nh y s

127

phn phi ca mt bin lin tc, v y tuy hai nhm c cng trung bnh
nhng khc bit ca nhm 2 cao hn nhm 1 rt nhiu. V chng ta cn mt
c s khc gi l phng sai (variance). Phng sai ca nhm 1 l 15.7 cm2 v
nhm 2 l 443.7 cm2.
Vi mt bin s khng lin tc nh 0 v 1 (0 k hiu cn sng, v 1 k
hiu t vong) th c s trung bnh khng cn ngha trung bnh na, cho
nn chng ta c c s t l (proportion). Chng hn nh trong s 10 ngi c
2 ngi t vong, th t l t vong l 0.2 (hay 20%). Trong s 200 ngi c 40
ngi qua i th t l t vong vn 0.2. Do , cng nh trng hp trung bnh,
t l khng th m t mt bin khng lin tc y c. Chng ta cn n
phng sai , cng vi t l, m t mt bin khng lin tc. Trong trng hp
2/10 phng sai l 0.016, cn trong trng hp 40/200, phng sai l 0.0008.
Trong chng ny, chng ta s lm quen vi mt s lnh trong R tin hnh
nhng tnh ton n gin trn.

9.1 Thng k m t (descriptive statistics, summary)


minh ha cho vic p dng R vo thng k m t, chng ta s s dng
mt d liu nghin cu c tn l igfdata. Trong nghin cu ny, ngoi cc ch
s lin quan n gii tnh, tui, trng lng v chiu cao, chng ta o lng
cc hormone lin quan n tnh trng tng trng nh igfi,igfbp3,als,
v cc markers lin quan n s chuyn ha ca xng pinp,ictp v pinp.
C 100 i tng nghin cu. D liu ny c cha trong directory
c:\works\stats. Trc ht, chng ta cn phi nhp d liu vo R vi nhng
lnh sau y (cc cu ch theo sau du # l nhng ch thch bn c theo di):
> options(width=100)
# chuyn directory
> setwd("c:/works/stats")
# c d liu vo R
> igfdata <- read.table("igf.txt", header=TRUE, na.strings=".")
> attach(igfdata)

# xem xt cc ct s trong d liu


> names(igfdata)
[1] "id" "sex" "age"
[7] "igfi"
"igfbp3"

128

"weight"
"als"

"height"
"pinp"

"ethnicity"
"ictp"
"p3np"

Phn tch d liu v to biu bng R Nguyn Vn Tun

> igfdata
id
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100
id
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
...
...
97
97
98
98
99
99
100 100

sex age weight height ethnicity


igfi igfbp3
Female 15
42
162
Asian 189.000 4.00000
Male 16
44
160 Caucasian 160.000 3.75000
Female 15
43
157
Asian 146.833 3.43333
Female 15
42
155
Asian 185.500 3.40000
Female 16
47
167
Asian 192.333 4.23333
Female 25
45
160
Asian 110.000 3.50000
Female 19
45
161
Asian 157.000 3.20000
Female 18
43
153
Asian 146.000 3.40000
Female 15
41
149
Asian 197.667 3.56667
Female 24
45
157
African 148.000 3.40000
Female
Male
Female
Male

17
18
18
15

als
323.667
333.750
248.333
251.000
322.000
284.667
274.000
303.000
308.500
273.000

54
55
48
54

pinp
353.970
375.885
199.507
483.607
105.430
76.487
75.880
86.360
254.803
44.720

168 Caucasian 204.667 4.96667


169
Asian 178.667 3.86667
151
Asian 237.000 3.46667
168
Asian 130.000 2.70000

ictp
p3np
11.2867 8.3367
10.4300 6.7450
8.3633 12.5000
13.3300 14.2767
7.9233 4.5033
4.9833 4.9367
6.3500 5.3200
7.3700 4.6700
11.8700 6.8200
3.7400 6.1600

441.333 64.130 5.1600


273.000 185.913 7.5267
324.333 105.127 5.9867
259.333 325.840 10.2767

4.4367
8.8333
5.6600
6.5933

Trn y ch l mt phn s liu trong s 100 i tng.


Cho mt bin s
m t nh sau:

x1 , x2 , x3 ,..., xn chng ta c th tnh ton mt s ch s thng k

L thuyt
n

1
xi .
n i =1
1 n
2
Phng sai: s 2 =
( xi x )
n 1 i =1
S trung bnh: x =

lch chun: s =

s2

Hm R
mean(x)
var(x)

sd(x)

129

Sai

SE =

chun

(standard

error): Khng c

s
n
min(x)
max(x)
range(x)

Tr s thp nht
Tr s cao nht
Ton c (range)

V d 1: tm gi tr trung bnh ca tui, chng ta ch n gin lnh:


> mean(age)
[1] 19.17

Hay phng sai v lch chun ca tui:


> var(age)
[1] 15.33444
> sd(age)
[1] 3.915922

Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v


mt bin s:
> summary(age)
Min. 1st Qu.
13.00
16.00

Median
19.00

Mean 3rd Qu.


19.17
21.25

Max.
34.00

Ni chung, kt qu ny n gin v cc vit tt cng c th d hiu.


Ch , trong kt qu trn, c hai ch s 1st Qu v 3rd Qu c ngha l
first quartile (tng ng vi v tr 25%) v third quartile (tng ng vi v
tr 75%) ca mt bin s. First quartile = 16 c ngha l 25% i tng nghin
cu c tui bng hoc nh hn 16 tui. Tng t, Third quartile = 34 c
ngha l 75% i tng c tui bng hoc thp hn 34 tui. Tt nhin s
trung v (median) 19 cng c ngha l 50% i tng c tui 19 tr xung
(hay 19 tui tr ln).
R khng c hm tnh sai s chun, v trong hm summary, R cng
khng cung cp lch chun. c cc s ny, chng ta c th t vit mt
hm n gin (hy gi l desc) nh sau:
desc <- function(x)
{
av <- mean(x)

130

Phn tch d liu v to biu bng R Nguyn Vn Tun

sd <- sd(x)
se <- sd/sqrt(length(x))
c(MEAN=av, SD=sd, SE=se)
}

V c th gi hm ny tnh bt c bin no chng ta mun, nh tnh bin


als sau y:
> desc(als)
MEAN
SD
301.841120 58.987189

SE
5.898719

c mt quang cnh chung v d liu igfdata chng ta ch n


gin lnh summary nh sau:
> summary(igfdata)
id
sex
Min.
: 1.00
Female:69
1st Qu.: 25.75
Male :31
Median : 50.50
Mean
: 50.50
3rd Qu.: 75.25
Max.
:100.00
height
Min.
:149.0
1st Qu.:157.0
Median :162.0
Mean
:163.1
3rd Qu.:168.0
Max.
:196.0
igfi
Min.
: 85.71
1st Qu.:137.17
Median :161.50
Mean
:165.59
3rd Qu.:186.46
Max.
:427.00
ictp
Min.
: 2.697
1st Qu.: 4.878
Median : 6.338
Mean
: 7.420
3rd Qu.: 8.423
Max.
:21.237

age
Min.
:13.00
1st Qu.:16.00
Median :19.00
Mean
:19.17
3rd Qu.:21.25
Max.
:34.00

weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.91
3rd Qu.:53.00
Max.
:60.00

ethnicity
African : 8
Asian
:60
Caucasian:30
Others
: 2

igfbp3
Min.
:2.000
1st Qu.:3.292
Median :3.550
Mean
:3.617
3rd Qu.:3.875
Max.
:5.233

als
Min.
:192.7
1st Qu.:256.8
Median :292.5
Mean
:301.8
3rd Qu.:331.2
Max.
:471.7

pinp
Min.
: 26.74
1st Qu.: 68.10
Median :103.26
Mean
:167.17
3rd Qu.:196.45
Max.
:742.68

p3np
Min.
: 2.343
1st Qu.: 4.433
Median : 5.445
Mean
: 6.341
3rd Qu.: 7.150
Max.
:16.303

R tnh ton tt c cc bin s no c th tnh ton c. Cho nn, ngay


c ct id (tc m s ca i tng nghin cu) R cng tnh lun! (v chng ta
bit kt qu ca ct id chng c ngha thng k g). i vi cc bin s mang

131

tnh phn loi nh sex v ethnicity (sc tc) th R ch bo co tn s cho


mi nhm.
Kt qu trn cho tt c i tng nghin cu. Nu chng ta mun kt
qu cho tng nhm nam v n ring bit, hm by trong R rt hu dng. Trong
lnh sau y, chng ta yu cu R tm lc d liu igfdata theo sex.
> by(igfdata, sex, summary)
sex: Female
id
Min.
: 1.0
1st Qu.:21.0
Median :47.0
Mean
:48.2
3rd Qu.:75.0
Max.
:99.0

sex
Female:69
Male : 0

age
Min.
:13.00
1st Qu.:17.00
Median :19.00
Mean
:19.59
3rd Qu.:22.00
Max.
:34.00

weight
Min.
:41.00
1st Qu.:47.00
Median :50.00
Mean
:49.35
3rd Qu.:52.00
Max.
:60.00

height
Min.
:149.0
1st Qu.:156.0
Median :162.0
Mean
:161.9
3rd Qu.:166.0
Max.
:196.0

ethnicity
African : 4
Asian
:43
Caucasian:22
Others
: 0

igfi
Min.
: 85.71
1st Qu.:136.67
Median :163.33
Mean
:167.97
3rd Qu.:186.17
Max.
:427.00

igfbp3
Min.
:2.767
1st Qu.:3.333
Median :3.567
Mean
:3.695
3rd Qu.:3.933
Max.
:5.233

als
Min.
:204.3
1st Qu.:263.8
Median :302.7
Mean
:311.5
3rd Qu.:361.7
Max.
:471.7
pinp
ictp
p3np
Min.
: 26.74
Min.
: 2.697
Min.
: 2.343
1st Qu.: 62.75
1st Qu.: 4.717
1st Qu.: 4.337
Median : 78.50
Median : 5.537
Median : 5.143
Mean
:108.74
Mean
: 6.183
Mean
: 5.643
3rd Qu.:115.26
3rd Qu.: 7.320
3rd Qu.: 6.143
Max.
:502.05
Max.
:13.633
Max.
:14.420
------------------------------------------------------------

132

Phn tch d liu v to biu bng R Nguyn Vn Tun

sex: Male
id
Min.
: 2.00
1st Qu.: 34.50
Median : 56.00
Mean
: 55.61
3rd Qu.: 75.00
Max.
:100.00

sex
Female: 0
Male :31

age
Min.
:14.00
1st Qu.:15.00
Median :17.00
Mean
:18.23
3rd Qu.:20.00
Max.
:27.00

weight
Min.
:44.00
1st Qu.:48.50
Median :51.00
Mean
:51.16
3rd Qu.:53.50
Max.
:59.00

height
Min.
:155.0
1st Qu.:161.5
Median :164.0
Mean
:165.6
3rd Qu.:169.0
Max.
:191.0

ethnicity
African : 4
Asian
:17
Caucasian: 8
Others
: 2

igfi
Min.
: 94.67
1st Qu.:138.67
Median :160.00
Mean
:160.29
3rd Qu.:183.00
Max.
:274.00

pinp
Min.
: 56.28
1st Qu.:135.07
Median :245.92
Mean
:297.21
3rd Qu.:450.38
Max.
:742.68

ictp
Min.
: 3.650
1st Qu.: 6.900
Median : 9.513
Mean
:10.173
3rd Qu.:13.517
Max.
:21.237

igfbp3
Min.
:2.000
1st Qu.:3.183
Median :3.500
Mean
:3.443
3rd Qu.:3.775
Max.
:4.500

als
Min.
:192.7
1st Qu.:249.8
Median :276.0
Mean
:280.2
3rd Qu.:311.3
Max.
:388.7

p3np
Min.
: 3.390
1st Qu.: 5.375
Median : 7.140
Mean
: 7.895
3rd Qu.:10.010
Max.
:16.303

xem qua phn phi ca cc hormones v ch s sinh ha cng mt


lc, chng ta c th v th cho tt c 6 bin s. Trc ht, chia mn nh
thnh 6 ca s (vi 2 dng v 3 ct); sau ln lt v:
>
>
>
>
>
>
>

op <- par(mfrow=c(2,3))
hist(igfi)
hist(igfbp3)
hist(als)
hist(pinp)
hist(ictp)
hist(p3np)

133

Histogra m of igfbp3

Histogra m of a ls

200

300

400

0
100

20

Frequency

10

20

Frequency

10

20
0

10

Frequency

30

30

30

40

40

Histogra m of igfi

2.0

3.0

4.0

5.0

150

250

350

450

igf bp3

als

Histogra m of pinp

Histogra m of ictp

Histogra m of p3np

40
30
20

Frequency

30
0

200

400
pinp

600

800

10

10

10

20

Frequency

30
20

Frequency

40

50

igf i

10

15

20

ictp

10

15

p3np

9.2 Kim nh xem mt bin c phi phn phi chun


Trong phn tch thng k, phn ln cc php tnh da vo gi nh bin
s phi l mt bin s phn phi chun (normal distribution). Do , mt trong
nhng vic quan trng khi xem xt d kin l phi kim nh gi thit phn phi
chun ca mt bin s. Trong th trn, chng ta thy cc bin s nh igfi,
pinp,ictp v p3np c v tp trung vo cc gi tr thp v khng cn i, tc
du hiu ca mt s phn phi khng chun.
kim nh nghim chnh, chng ta cn phi s dng kim nh thng
k c tn l Shapiro test v trong R gi l hm shapiro.test. Chng hn
nh kim nh gi thit phn phi chun ca bin s pinp.
> shapiro.test(pinp)
Shapiro-Wilk normality test
data: pinp
W = 0.748, p-value = 8.314e-12
V tr s p (p-value) thp hn 0.05, chng ta c th kt lun rng bin s
pinp khng p ng lut phn phi chun.

134

Phn tch d liu v to biu bng R Nguyn Vn Tun

Nhng vi bin s weight (trng lng c th) th kim nh ny cho


bit y l mt bin s tun theo lut phn phi chun v tr s p > 0.05.
> shapiro.test(weight)
Shapiro-Wilk normality test
data:

weight

W = 0.9887, p-value = 0.5587

Tht ra, kt qu trn cng ph hp vi th ca weight:


> hist(weight)

10
0

Frequency

15

Histogram of weight

40

45

50

55

60

weight

9.3 Thng k m t theo tng nhm


Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi
nhm nam v n gii, hm tapply trong R c th dng cho vic ny:
> tapply(igfi, list(sex), mean)
Female
Male
167.9741 160.2903

Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l
sex, v ch s thng k chng ta mun l trung bnh (mean). Qua kt qu trn,
chng ta thy s trung bnh ca igfi cho n gii (167.97) cao hn nam gii
(160.29).
Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta
ch cn thm mt bin s trong hm list:
> tapply(igfi, list(ethnicity, sex), mean)
Female
Male
African
145.1252 120.9168
Asian
165.6589 160.4999

135

Caucasian 176.6536 169.4790


Others
NA 200.5000

Trong kt qu trn, NA c ngha l not available, tc khng c s liu cho ph


n trong cc sc tc others.

9.4 Kim nh t (t.test)


Kim nh t da vo gi thit phn phi chun. C hai loi kim nh t:
kim nh t cho mt mu (one-sample t-test), v kim nh t cho hai mu (twosample t-test). Kim nh t mt mu nhm tr li cu hi d liu t mt mu c
phi tht s bng mt thng s no hay khng. Cn kim nh t hai mu th
nhm tr li cu hi hai mu c cng mt lut phn phi, hay c th hn l hai
mu c tht s c cng tr s trung bnh hay khng. Ti s ln lt minh ha
hai kim nh ny qua s liu igfdata trn.

9.1.1 Kim nh t mt mu
V d 2. Qua phn tch trn, chng ta thy tui trung bnh ca 100 i tng
trong nghin cu ny l 19.17 tui. Chng hn nh trong qun th ny, trc y
chng ta bit rng tui trung bnh l 30 tui. Vn t ra l c phi mu m chng
ta c c c i din cho qun th hay khng. Ni cch khc, chng ta mun bit
gi tr trung bnh 19.17 c tht s khc vi gi tr trung bnh 30 hay khng.
tr li cu hi ny, chng ta s dng kim nh t. Theo l thuyt
thng k, kim nh t c nh ngha bng cng thc sau y:

t=

x
s/ n

Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong


trng hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr
t cao hn gi tr l thuyt theo phn phi t mt tiu chun c ngha nh 5%
chng hn th chng ta c l do pht biu khc bit c ngha thng k. Gi
tr ny cho mu 100 c th tnh ton bng hm qt ca R nh sau:
> qt(0.95, 100)
[1] 1.660234

Nhng c mt cch tnh ton nhanh gn hn tr li cu hi trn,


bng cch dng hm t.test nh sau:
> t.test(age, mu=30)
One Sample t-test

136

Phn tch d liu v to biu bng R Nguyn Vn Tun

data: age
t = -27.6563, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
18.39300 19.94700
sample estimates:
mean of x
19.17

Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi


thit. R trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16
(tc rt thp). R cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9
tui (30 tui nm qu ngoi khong tin cy ny). Ni cch khc, chng ta c l
do pht biu rng tui trung bnh trong mu ny tht s thp hn tui
trung bnh ca qun th.

9.4.2 Kim nh t hai mu


V d 3. Qua phn tch m t trn (phm summary) chng ta thy ph
n c hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l
c phi tht s l mt khc bit c h thng hay do cc yu t ngu nhin
gy nn. Tr li cu hi ny, chng ta cn xem xt mc khc bit trung bnh
gia hai nhm v lch chun ca khc bit.

t=

x2 x1
SED

Trong x1 v x2 l s trung bnh ca hai nhm nam v n, v SED l lch


chun ca ( x1 - x2 ) . Thc ra, SED c th c tnh bng cng thc:

SED = SE12 + SE22


Trong SE1 v SE2 l sai s chun (standard error) ca hai nhm nam v n.
Theo l thuyt xc sut, t tun theo lut phn phi t vi bc t do n1 + n2 2 ,
trong n1 v n2 l s mu ca hai nhm. Chng ta c th dng R tr li cu
hi trn bng hm t.test nh sau:
> t.test(igfi~ sex)
Welch Two Sample t-test

137

data: igfi by sex


t = 0.8412, df = 88.329, p-value = 0.4025
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-10.46855 25.83627
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903

R trnh by cc gi tr quan trng trc ht:


t = 0.8412, df = 88.329, p-value = 0.4025

df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm


nam v n khng c ngha thng k (v cao hn 0.05 hay 5%).
95 percent confidence interval:
-10.46855 25.83627

l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho
bit igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam
gii khong 25.8 ng/L. V khc bit qu ln v l thm bng chng cho
thy khng c khc bit c ngha thng k gia hai nhm.
Kim nh trn da vo gi thit hai nhm nam v n c khc phng
sai. Nu chng ta c l do cho rng hai nhm c cng phng sai, chng ta
ch thay i mt thng s trong hm t vi var.equal=TRUE nh sau:

> t.test(igfi~ sex, var.equal=TRUE)


Two Sample t-test

data: igfi by sex


t = 0.7071, df = 98, p-value = 0.4812
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-13.88137 29.24909
sample estimates:
mean in group Female
mean in group Male
167.9741
160.2903

V mt s, kt qu phn tch trn c khc cht t so vi kt qu phn


tch da vo gi nh hai phng sai khc nhau, nhng tr s p cng i n mt
kt lun rng khc bit gia hai nhm khng c ngha thng k.

138

Phn tch d liu v to biu bng R Nguyn Vn Tun

9.5 So snh phng sai (var.test)


By gi chng ta th kim nh xem phng sai gia hai nhm c khc
nhau khng. tin hnh phn tch, chng ta ch cn lnh:
> var.test(igfi ~ sex)
F test to compare two variances
data: igfi by sex
F = 2.6274, num df = 68, denom df = 30, p-value = 0.004529
alternative hypothesis: true ratio of variances is not
equal to 1
95 percent confidence interval:
1.366187 4.691336
sample estimates:
ratio of variances
2.627396

Kt qu trn cho thy khc bit v phng sai gia hai nhm cao
2.62 ln. Tr s p = 0.0045 cho thy phng sai gia hai nhm khc nhau c
ngha thng k. Nh vy, chng ta chp nhn kt qu phn tch ca hm
t.test(igfi~ sex).

9.6 Kim nh Wilcoxon cho hai mu (wilcox.test)


Kim nh t da vo gi thit l phn phi ca mt bin phi tun theo
lut phn phi chun. Nu gi nh ny khng ng, kt qu ca kim nh t c
th khng hp l (valid). kim nh phn phi ca igfi, chng ta c th
dng hm shapiro.test nh sau:
> shapiro.test(igfi)
Shapiro-Wilk normality test
data: igfi
W = 0.8528, p-value = 1.504e-08

139

Tr s p nh hn 0.05 rt nhiu, cho nn chng ta c th ni rng phn


phi ca igfi khng tun theo lut phn phi chun. Trong trng hp ny,
vic so snh gia hai nhm c th da vo phng php phi tham s (nonparametric) c tn l kim nh Wilcoxon, v kim nh ny (khng nh kim
nh t) khng ty thuc vo gi nh phn phi chun.
> wilcox.test(igfi ~ sex)
Wilcoxon rank sum test with continuity correction
data: igfi by sex
W = 1125, p-value = 0.6819
alternative hypothesis: true mu is not equal to 0

Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm
nam v n khng c ngha thng k. Kt lun ny cng khng khc vi kt
qu phn tch bng kim nh t.

9.7 Kim nh t cho cc bin s theo cp (paired


t-test, t.test)
Kim nh t va trnh by trn l cho cc nghin cu gm hai nhm c
lp nhau (nh gia hai nhm nam v n), nhng khng th ng dng cho cc
nghin cu m mt nhm i tng c theo di theo thi gian. Chng ta tm
gi cc nghin cu ny l nghin cu theo cp. Trong cc nghin cu ny,
chng ta cn s dng mt kim nh t c tn l paired t-test.
V d 4. Mt nhm bnh nhn gm 10 ngi c iu tr bng mt
thuc nhm gim huyt p. Huyt p ca bnh nhn c o lc khi u
nghin cu (lc cha iu tr), v sau khi iu tr. S liu huyt p ca 10 bnh
nhn nh sau:
Trc khi iu tr (x0)
Sau khi iu tr (x1)

180, 140, 160, 160, 220, 185, 145, 160, 160,


170
170, 145, 145, 125, 205, 185, 150, 150, 145,
155

Cu hi t ra l bin chuyn huyt p trn c kt lun rng


thuc iu tr c hiu qu gim p huyt. tr li cu hi ny, chng ta dng
kim nh t cho tng cp nh sau:
> # nhp d kin
> before <- c(180, 140, 160, 160, 220, 185, 145, 160, 160, 170)
> after <- c(170, 145, 145, 125, 205, 185, 150, 150, 145, 155)

140

Phn tch d liu v to biu bng R Nguyn Vn Tun

> bp <- data.frame(before, after)

> # kim nh t
> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = 2.7924, df = 9, p-value = 0.02097
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
1.993901 19.006099
sample estimates:
mean of the differences
10.5

Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v


khong tin cy 95% l t 2.0 mmHg n 19 mmHg, vi tr s p=0.0209. Nh vy,
chng ta c bng chng pht biu rng mc gim huyt p c ngha thng
k.
Ch nu chng ta phn tch sai bng kim nh thng k cho hai
nhm c lp di y th tr s p = 0.32 cho bit mc gim p sut khng
c ngha thng k.
> t.test(before, after)
Welch Two Sample t-test
data: before and after
t = 1.0208, df = 17.998, p-value = 0.3209
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-11.11065 32.11065
sample estimates:
mean of x mean of y
168.0
157.5

141

9.8

Kim nh Wilcoxon cho cc bin s theo


cp (wilcox.test)

Thay v dng kim nh t cho tng cp, chng ta cng c th s dng hm


wilcox.test cho cng mc ch:
> wilcox.test(before, after, paired=TRUE)
Wilcoxon signed rank test with continuity correction
data: before and after
V = 42, p-value = 0.02291
alternative hypothesis: true mu is not equal to 0

Kt qu trn mt ln na khng nh rng gim p sut mu c ngha thng


k vi tr s (p=0.023) chng khc my so vi kim nh t cho tng cp.

9.9 Tn s (frequency)
Hm table trong R c chc nng cho chng ta bit v tn s ca mt
bin s mang tnh phn loi nh sex v ethnicity.
> table(sex)
sex
Female
Male
69
31
> table(ethnicity)
ethnicity
African
Asian Caucasian
8
60
30

Others
2

Mt bng thng k 2 chiu:


> table(sex, ethnicity)
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2

Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta


s phn trm. tnh s phn trm, chng ta cn n hm prop.table v
cch s dng c th minh ho nh sau:
# to ra mt object tn l freq cha kt qu tn s
> freq <- table(sex, ethnicity)

142

Phn tch d liu v to biu bng R Nguyn Vn Tun

# kim tra kt qu
> freq
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2
# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female
Male
69
31
> margin.table(freq, 2)
ethnicity
African
Asian Caucasian
8
60
30

Others
2

# tnh phn trm bng hm prop.table


> prop.table(freq, 1)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.05797101 0.62318841 0.31884058 0.00000000
Male
0.12903226 0.54838710 0.25806452 0.06451613

Trong bng thng k trn, prop.table tnh t l sc tc cho tng


gii tnh. Chng hn nh n gii (female), 5.8% l ngi Phi chu, 62.3% l
ngi Chu , 31.8% l ngi Ty phng da trng. Tng cng l 100%.
Tng t, nam gii t l ngi Phi chu l 12.9%, Chu l 54.8%, v.v
# tnh phn trm bng hm prop.table
> prop.table(freq, 2)
ethnicity
sex
African
Asian Caucasian
Others
Female 0.5000000 0.7166667 0.7333333 0.0000000
Male
0.5000000 0.2833333 0.2666667 1.0000000

Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc.
Chng hn nh trong nhm ngi chu , 71.7% l n v 28.3% l nam.
# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex
African Asian Caucasian Others
Female
0.04 0.43
0.22
0.00
Male
0.04 0.17
0.08
0.02

143

9.10 Kim nh t l (proportion test, prop.test,


binom.test)
Kim nh mt t l thng da vo gi nh phn phi nh phn
(binomial distribution). Vi mt s mu n v t l p, v nu n ln (tc hn 50
chng hn), th phn phi nh phn c th tng ng vi phn phi chun
vi s trung bnh np v phng sai np(1 p). Gi x l s bin c m chng ta
quan tm, kim nh gi thit p = c th s dng thng k sau y:

z=

x n

n (1 )

y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai
1. Cng c th ni z2 tun theo lut phn phi Khi bnh phng vi bc t do
bng 1.
V d 5. Trong nghin cu trn, chng ta thy c 69 n v 31 nam. Nh
vy t l n l 0.69 (hay 69%). kim nh xem t l ny c tht s khc vi t l
0.5 hay khng, chng ta c th s dng hm prop.test(x, n, ) nh sau:
> prop.test(69, 100, 0.50)
1-sample proportions test with continuity correction
data: 69 out of 100, null probability 0.5
X-squared = 13.69, df = 1, p-value = 0.0002156
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5885509 0.7766330
sample estimates:
p
0.69

Trong kt qu trn, prop.test c tnh t l n gii l 0.69, v


khong tin cy 95% l 0.588 n 0.776. Gi tr Khi bnh phng l 13.69,
vi tr s p = 0.00216. Nh vy, nghin cu ny c t l n cao hn 50%.
Mt cch tnh chnh xc hn kim nh t l l kim nh nh phn
bionom.test(x, n, ) nh sau:
> binom.test(69, 100, 0.50)
Exact binomial test

144

Phn tch d liu v to biu bng R Nguyn Vn Tun

data: 69 and 100


number of successes=69, number of trials=100,
p-value=0.0001831
alternative hypothesis: true probability of success is not
equal to 0.5. 95 percent confidence interval:
0.5896854 0.7787112
sample estimates - probability of success: 0.69

Ni chung, kt qu ca kim nh nh phn khng khc g so vi kim


nh Khi bnh phng, vi tr s p = 0.00018, chng ta cng c bng chng
kt lun rng t l n gii trong nghin cu ny tht s cao hn 50%.

9.11 So snh hai t l (prop.test, binom.test)


Phng php so snh hai t l c th khai trin trc tip t l thuyt
kim nh mt t l va trnh by trn. Cho hai mu vi s i tng n1 v n2, v
s bin c l x1 v x2. Do , chng ta c th c tnh hai t l p1 v p2. L thuyt
xc sut cho php chng ta pht biu rng khc bit gia hai mu d = p1 p2
tun theo lut phn phi chun vi s trung bnh 0 v phng sai bng:

1 1
Vd = + p (1 p )
n1 n2
Trong :

p =

x1 + x
n1 + n

2
2

Do , z = d/Vd tun theo lut phn phi chun vi trung bnh 0 v


phng sai 1. Ni cch khc, z2 tun theo lut phn phi Khi bnh phng vi
bc t do bng 1. Do , chng ta cng c th s dng prop.test kim
nh hai t l.
V d 6. Mt nghin cu c tin hnh so snh hiu qu ca thuc
chng gy xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr
gm c 100 bnh nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau
thi gian 12 thng theo di, nhm A c 7 ngi b gy xng, v nhm B c 20
ngi gy xng. Vn t ra l t l gy xng trong hai nhm ny bng nhau
(tc thuc khng c hiu qu)? kim nh xem hai t l ny c tht s khc
nhau, chng ta c th s dng hm prop.test(x, n, ) nh sau:
> fracture <- c(7, 20)
> total <- c(100, 110)

145

> prop.test(fracture, total)


2-sample test for equality of proportions with continuity
correction
data: fracture out of total
X-squared = 4.8901, df = 1, p-value = 0.02701
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20908963 -0.01454673
sample estimates:
prop 1
prop 2
0.0700000 0.1818182

Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v


nhm 2 l 0.18. Phn tch trn cn cho thy xc sut 95% rng khc bit gia
hai nhm c th 0.01 n 0.20 (tc 1 n 20%). Vi tr s p=0.027, chng ta c
th ni rng t l gy xng trong nhm A qu tht thp hn nhm B.

9.12 So snh nhiu t l (prop.test, chisq.test)


Kim nh prop.test cn c th s dng kim nh nhiu t l
cng mt lc. Trong nghin cu trn, chng ta c 4 nhm sc tc v tn s cho
tng gii tnh nh sau:
> table(sex, ethnicity)
ethnicity
sex
African Asian Caucasian Others
Female
4
43
22
0
Male
4
17
8
2

Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay


khng, v tr li cu hi ny, chng ta li dng prop.test nh sau:
> female <- c( 4, 43, 22, 0)
> total <- c(8, 60, 30, 2)
> prop.test(female, total)
4-sample test for equality of proportions without
continuity correction
data: female out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided

146

Phn tch d liu v to biu bng R Nguyn Vn Tun

sample estimates:
prop 1
prop 2
prop 3
prop 4
0.5000000 0.7166667 0.7333333 0.0000000
Warning message:
Chi-squared
approximation
may
be
prop.test(female, total)

incorrect

in:

Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm


3 (ngi da trng) so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm
Chu , nhng kim nh Chi bnh phng cho bit trn phng din thng k,
cc t l ny khng khc nhau, v tr s p = 0.099.

9.12.1 Kim nh Khi bnh phng (Chi-squared test, chisq.test)


Tht ra, kim nh Khi bnh phng cn c th tnh ton bng hm
chisq.test nh sau:
> chisq.test(sex, ethnicity)
Pearson's Chi-squared test
data: sex and ethnicity
X-squared = 6.2646, df = 3, p-value = 0.09942
Warning message:
Chi-squared
approximation
chisq.test(sex, ethnicity)

may

be

incorrect

in:

Kt qu ny hon ton ging vi kt qu t hm prop.test.

9.12.2 Kim nh Fisher (Fishers exact test, fisher.test)


Trong kim nh Khi bnh phng trn, chng ta ch cnh bo:
Warning message:
Chi-squared approximation may be incorrect in: prop.test(female,
total)

V trong nhm 4, khng c n gii cho nn t l l 0%. Hn na, trong


nhm ny ch c 2 i tng. V s lng i tng qu nh, cho nn cc c
tnh thng k c th khng ng tin cy. Mt phng php khc c th p dng
cho cc nghin cu vi tn s thp nh trn l kim nh fisher (cn gi l
Fishers exact test). Bn c c th tham kho l thuyt ng sau kim nh fisher

147

hiu r hn v logic ca phng php ny, nhng y, chng ta ch quan tm


n cch dng R tnh ton kim nh ny. Chng ta ch n gin lnh:
> fisher.test(sex, ethnicity)
Fisher's Exact Test for Count Data
data: sex and ethnicity
p-value = 0.1048
alternative hypothesis: two.sided

Ch tr s p t kim nh Fisher l 0.1048, tc rt gn vi tr s p ca


kim nh Khi bnh phng. Cho nn, chng ta c thm bng chng khng
nh rng t l n gii gia cc sc tc khng khc nhau mt cch ng k.

148

Phn tch d liu v to biu bng R Nguyn Vn Tun

10
Phn tch hi qui tuyn tnh
Phn tch hi qui tuyn tnh (linear regression analysis) c l l mt
trong nhng phng php phn tch s liu thng dng nht trong thng k hc.
C ngi tng vit Cho con ngi 3 v kh h s tng quan, hi qui tuyn
tnh v mt cy bt, con ngi s s dng c ba! Trong chng ny, ti s
gii thiu cch s dng R phn tch hi qui tuyn tnh v cc phng php
lin quan nh h s tng quan v kim nh gi thit thng k.
V d 1. minh ha cho vn , chng ta th xem xt nghin cu sau
y, m trong nh nghin cu o lng cholestrol trong mu ca 18 i
tng nam. T trng c th (body mass index) cng c c tnh cho mi i
tng bng cng thc tnh BMI l ly trng lng (tnh bng kg) chia cho chiu
cao bnh phng (m2). Kt qu o lng nh sau:
Bng 1. tui, t trng c th v cholesterol
M s ID
(id)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

tui
(age)

BMI
(bmi)

46
20
52
30
57
25
28
36
22
43
57
33
22
63
40
48
28
49

25.4
20.6
26.2
22.6
25.4
23.1
22.7
24.9
19.8
25.3
23.2
21.8
20.9
26.7
26.4
21.2
21.2
22.8

Cholesterol
(chol)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0

Nhn s qua s liu chng ta thy ngi c tui cng cao


cholesterol cng cng cao. Chng ta th nhp s liu ny vo R v v mt biu
tn x nh sau:

149

> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,


22,63,40,48,28,49)
> bmi <-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,
19.8,25.3,23.2,21.8,20.9,26.7,26.4,21.2,
21.2,22.8)
> chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,
2.1,3.8,4.1,3.0, 2.5,4.6,3.2,
4.2,2.3,4.0)

2.0

2.5

3.0

chol

3.5

4.0

4.5

> data <- data.frame(age, bmi, chol)


> plot(chol ~ age, pch=16)

20

30

40

50

60

age

Biu 10.1. Lin h gia tui v cholesterol.


Biu 10.1 trn cho thy mi lin h gia tui (age) v cholesterol l mt
ng thng (tuyn tnh). o lng mi lin h ny, chng ta c th s
dng h s tng quan (coefficient of correlation).

10.1 H s tng quan


H s tng quan (r) l mt ch s thng k o lng mi lin h tng
quan gia hai bin s, nh gia tui (x) v cholesterol (y). H s tng quan
c gi tr t -1 n 1. H s tng quan bng 0 (hay gn 0) c ngha l hai bin
s khng c lin h g vi nhau; ngc li nu h s bng -1 hay 1 c ngha l
hai bin s c mt mi lin h tuyt i. Nu gi tr ca h s tng quan l m
(r <0) c ngha l khi x tng cao th y gim (v ngc li, khi x gim th y tng);
nu gi tr h s tng quan l dng (r > 0) c ngha l khi x tng cao th y
cng tng, v khi x gim cao th y cng gim theo.

150

Phn tch d liu v to biu bng R Nguyn Vn Tun

Thc ra c nhiu h s tng quan trong thng k, nhng y ti s


trnh by 3 h s tng quan thng dng nht: h s tng quan Pearson r,
Spearman , v Kendall .
10.1.1 H s tng quan Pearson
Cho hai bin s x v y t n mu, h s tng quan Pearson c c
tnh bng cng thc sau y:
n

r=

(xi x )( yi y )

i =1
n
xi
i =1

x )2 ( yi y )2
i =1

Trong , nh nh ngha phn trn, x v y l gi tr trung bnh ca


bin s x v y. c tnh h s tng quan gia tui age v cholesterol,
chng ta c th s dng hm cor(x,y) nh sau:
> cor(age, chol)
[1] 0.936726

Chng ta c th kim nh gi thit h s tng quan bng 0 (tc hai bin


x v y khng c lin h). Phng php kim nh ny thng da vo php bin
i Fisher m R c sn mt hm cor.test tin hnh vic tnh ton.
> cor.test(age, chol)
Pearson's product-moment correlation
data: age and chol
t = 10.7035, df = 16, p-value = 1.058e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8350463 0.9765306
sample estimates:
cor
0.936726

Kt qu phn tch cho thy kim nh t = 10.70 vi tr s p=1.058e-08;


do , chng ta c bng chng kt lun rng mi lin h gia tui v
cholesterol c ngha thng k. Kt lun ny cng chnh l kt lun chng ta
i n trong phn phn tch hi qui tuyn tnh trn.

151

10.1.2 H s tng quan Spearman


H s tng quan Pearson ch hp l nu bin s x v y tun theo lut phn
phi chun. Nu x v y khng tun theo lut phn phi chun, chng ta phi s
dng mt h s tng quan khc tn l Spearman, mt phng php phn tch phi
tham s. H s ny c c tnh bng cch bin i hai bin s x v y thnh th
bc (rank), v xem tng quan gia hai dy s bc. Do , h s cn c tn
ting Anh l Spearmans Rank correlation. R c tnh h s tng quan Spearman
bng hm cor.test vi thng s method=spearman nh sau:
> cor.test(age, chol, method="spearman")
Spearman's rank correlation rho
data: age and chol
S = 51.1584, p-value = 2.57e-09
alternative hypothesis: true rho is not equal to 0
sample estimates: rho = 0.947205
Warning message:
Cannot compute exact p-values with ties in:
cor.test.default(age, chol, method = "spearman")

Kt qu phn tch cho thy gi tr rho=0.947, v tr s p=0.00000000257.


Kt qu t phn tch ny cng khng khc vi phn tch hi qui tuyn tnh: mi
lin h gia tui v cholesterol rt cao v c ngha thng k.

10.1.3 H s tng quan Kendall


H s tng quan Kendall (cng l mt phng php phn tch phi
tham s) c c tnh bng cch tm cc cp s (x, y) song hnh" vi nhau.
Mt cp (x, y) song hnh y c nh ngha l hiu ( khc bit) trn trc
honh c cng du hiu (dng hay m) vi hiu trn trc tung. Nu hai bin s
x v y khng c lin h vi nhau, th s cp song hnh bng hay tng ng
vi s cp khng song hnh.
Bi v c nhiu cp phi kim nh, phng php tnh ton h s tng
quan Kendall i hi thi gian ca my tnh kh cao. Tuy nhin, nu mt d
liu di 5000 i tng th mt my vi tnh c th tnh ton kh d dng. R
dng hm cor.test vi thng s method=kendall c tnh h s
tng quan Kendall:
> cor.test(age, chol, method="kendall")
Kendall's rank correlation tau
data: age and chol

152

Phn tch d liu v to biu bng R Nguyn Vn Tun

z = 4.755, p-value = 1.984e-06


alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.8333333
Warning message:
Cannot compute exact p-value with ties in:
cor.test.default(age, chol, method = "kendall")

Kt qu phn tch h s tng quan Kendall mt ln na khng


nh mi lin h gia tui v cholesterol c ngha thng k, v h s
tau = 0.833 v tr s p = 1.98e-06.
Cc h s tng quan trn y o mc tng quan gia hai bin
s, nhng khng cho chng ta mt phng trnh ni hai bin s vi
nhau. Do , vn t ra l chng ta tm mt phng trnh tuyn tnh
m t mi lin h ny. Chng ta s ng dng m hnh hi qui tuyn tnh.

10.2 M hnh hi qui tuyn tnh n gin


10.2.1 Vi hng l thuyt

tin vic theo di v m t m hnh, gi tui cho c nhn i


l xi v cholesterol l yi. y i = 1, 2, 3, , 18. M hnh hi qui tuyn
tnh pht biu rng:
yi = + xi + i
[1]
Ni cch khc, phng trnh trn gi nh rng cholesterol ca mt c nhn
bng mt hng s cng vi mt h s lin quan n tui, v mt sai s i.
Trong phng trnh trn, l chn (intercept, tc gi tr lc xi =0), v l
dc (slope hay gradient). Trong thc t, v l hai thng s (paramater, cn
gi l regression coefficient hay h s hi qui), v i l mt bin s theo lut
phn phi chun vi trung bnh 0 v phng sai 2.

Cc thng s , v 2 phi c c tnh t d liu. Phng php


c tnh cc thng s ny l phng php bnh phng nh nht (least
squares method). Nh tn gi, phng php bnh phng nh nht tm gi

153

tr , sao cho

y ( + x )
i =1

nh nht. Sau vi thao tc ton, c th

chng minh d dng rng, c s cho v p ng iu kin l:


n

( x x )( y y )

i =1

( x x )
i

i =1

[2]

= y x

[3]
)

y, x v y l gi tr trung bnh ca bin s x v y. Ch , chng ta vit

v (vi du m pha trn) l nhc nh rng y l hai c s (estimates)


ca v , ch khng phi v (chng ta khng bit chnh xc v ,
nhng ch c th c tnh m thi).

Sau khi c c s v , chng ta c th c tnh cholesterol


trung bnh cho tng tui nh sau:

)
yi = + xi
Tt nhin, yi y ch l s trung bnh cho tui xi, v phn cn li (tc yi - yi )
gi l phn d (hay residual). V phng sai ca phn d c th c tnh nh sau:
n

s =
2

s2 chnh l c s ca 2.

( y y )
i

i =1

n2

[4]

Trong phn tch hi qui tuyn tnh, thng thng chng ta mun bit h
s = 0 hay khc 0. Nu bng 0, th yi = + xi + i = + i , tc l nhng
khc bit gia cc i tng v cholesterol ch xoay quanh s trung bnh v sai
s ngu nhin , hay ni cch khc, khng c mi lin h g gia x v y; nu
khc vi 0, chng ta c bng chng pht biu rng x v y c lin quan
nhau. kim nh gi thit = 0 chng ta dng xt nghim t sau y:

t =

154

( )

S E

[5]

Phn tch d liu v to biu bng R Nguyn Vn Tun

( )

)
SE c ngha l sai s chun (standard error) ca c s . Trong phng
trnh trn, t tun theo lut phn phi t vi bc t do n-2 (nu tht s = 0).

10.2.2 Phn tch hi qui tuyn tnh n gin bng R

Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca

v , cng nh s2 mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:


> lm(chol ~ age)
Call:
lm(formula = chol ~ age)
Coefficients:
(Intercept)
1.08922

age
0.05779

Trong lnh trn, chol ~ age c ngha l m t chol l mt hm

s ca age. Kt qu tnh ton ca lm cho thy =1.0892 v =0.05779.


Ni cch khc, vi hai thng s ny, chng ta c th c tnh cholesterol cho
bt c tui no trong khong tui ca mu bng phng trnh tuyn tnh:

yi = 1.08922 + 0.05779 x age


Phng trnh ny c ngha l khi tui tng 1 nm th cholesterol
tng khong 0.058 mmol/L.
Tht ra, hm lm cn cung cp cho chng ta nhiu thng tin khc, nhng
chng ta phi a cc thng tin ny vo mt object. Gi object l reg, th lnh
s l:
> reg <- lm(chol ~ age)
> summary(reg)
Call: lm(formula = chol ~ age)
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Coefficients:

155

Estimate Std. Error t value


(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01
' 1

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 '

Residual standard error: 0.3027 on 16 degrees of freedom


Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong


reg. Phn kt qu chia lm 3 phn:
(a) Phn 1 m t phn d (residuals) ca m hnh hi qui:
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522

3Q
0.17939

Max
0.63040

Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l


-0.04, cng khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q)
cng kh cn i chung quanh s trung v, cho thy phn d ca phng trnh
ny tng i cn i.

(b) Phn hai trnh by c s ca v cng vi sai s chun v gi

tr ca kim nh t. Gi tr kim nh t cho l 10.74 vi tr s


p=0.0000000106, cho thy khng phi bng 0. Ni cch khc, chng ta c
bng chng cho rng c mt mi lin h gia cholesterol v tui, v mi
lin h ny c ngha thng k.
Coefficients:
Estimate Std. Error t value
(Intercept) 1.089218
0.221466
4.918
age
0.057788
0.005399 10.704
--Signif. codes: 0 '***' 0.001 '**' 0.01
' 1

Pr(>|t|)
0.000154 ***
1.06e-08 ***
'*' 0.05 '.' 0.1 '

(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn


d (residual mean square). y, s2 = 0.3027. Trong kt qu ny cn c kim
nh F, cng ch l mt kim nh xem c qu tht bng 0, tc c ngha
tng t nh kim nh t trong phn trn. Ni chung, trong trng hp phn

156

Phn tch d liu v to biu bng R Nguyn Vn Tun

tch hi qui tuyn tnh n gin (vi mt yu t) chng ta khng cn phi quan
tm n kim nh F.
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R2 hay
h s xc nh bi (coefficient of determination). H s ny c c tnh bng
cng thc:
n

( y
i=1
n

(y
i=1

[6]

Tc l bng tng bnh phng gia s c tnh v trung bnh chia cho
tng bnh phng s quan st v trung bnh. Tr s R2 trong v d ny l 0.8775,
c ngha l phng trnh tuyn tnh (vi tui l mt yu t) gii thch khong
88% cc khc bit v cholesterol gia cc c nhn. Tt nhin tr s R2 c gi
tr t 0 n 100% (hay 1). Gi tr R2 cng cao l mt du hiu cho thy mi lin
h gia hai bin s tui v cholesterol cng cht ch.
Mt h s cng cn cp y l h s iu chnh xc nh bi (m
trong kt qu trn R gi l Adjusted R-squared). y l h s cho chng ta
bit mc ci tin ca phng sai phn d (residual variance) do yu t
tui c mt trong m hnh tuyn tnh. Ni chung, h s ny khng khc my so
vi h s xc nh bi, v chng ta cng khng cn ch tm qu mc.

10.2.3 Gi nh ca phn tch hi qui tuyn tnh


Tt c cc phn tch trn da vo mt s gi nh quan trng nh sau:
(a) x l mt bin s c nh hay fixed, (c nh y c ngha l khng c sai
st ngu nhin trong o lng);
(b) i phn phi theo lut phn phi chun;
(c) i c gi tr trung bnh (mean) l 0;
(d) i c phng sai 2 c nh cho tt c xi; v
(e) cc gi tr lin tc ca i khng c lin h tng quan vi nhau (ni cch
khc, 1 v 2 khng c lin h vi nhau).
Nu cc gi nh ny khng c p ng th m hnh m chng ta c
tnh c vn hp l (validity). Do , trc khi trnh by v din dch m hnh

157

trn, chng ta cn phi kim tra xem cc gi nh trn c p ng c hay khng.


Trong trng hp ny, gi nh (a) khng phi l vn , v tui khng phi l
mt bin s ngu nhin, v khng c sai s khi tnh tui ca mt c nhn.
i vi cc gi nh (b) n (e), cch kim tra n gin nhng hu hiu
nht l bng cch xem xt mi lin h gia yi , xi , v phn d ei ( ei = yi yi )
bng nhng th tn x.
Vi lnh fitted() chng ta c th tnh ton yi cho tng c nhn nh
sau (v d i vi i tng s 1, 46 tui, cholestrol c th tin on nh
sau: 1.08922 + 0.05779 x 46 = 3.747).
> fitted(reg)
1
2
3
4
5
6
7
8
3.7474 2.2449 4.0942 2.8228 4.3831 2.5339 2.7072 3.1696
9
10
11
12
13
14
15
16
2.3605 3.5741 4.3831 2.9962 2.3605 4.7298 3.4007 3.8630
17
2.7072

18
3.9208

Vi lnh resid() chng ta c th tnh ton phn d ei cho tng c


nhn nh sau (vi i tng 1, e1 = 3.5 3.74748 = -0.24748):
> resid(reg)
1
2
3
4
5
6
-0.2474 -0.3449 -0.0942 -0.2228 0.1168 0.4660
7
8
9
10
11
12
0.1927 0.6304 -0.2605 0.2258 -0.2831 0.0037
13
14
15
16
17
18
0.1394 -0.1298 -0.2007 0.3369 -0.4072 0.0791

158

Phn tch d liu v to biu bng R Nguyn Vn Tun

-1

Standardized residuals
17

17

1.5

3.0

3.5

4.0

4.5

-2

-1

Fitted values

Theoretic al Quantiles

Scale-Location

Residuals vs Leverage
1

8
8

0.5

0. 5

1.0

17

Standardized residuals

-1

0.0 0.2 0.4 0.6

2.5

Cook 's dis tanc e

0.0

Standardized residuals

Normal Q-Q

-0.4

Residuals

Residuals vs Fitted

2.5

3.0

3.5

4.0

4.5

Fitted values

0.00

0.05

0.10

0. 5

0.15

0.20

0.25

Leverage

Biu 10.2. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.
kim tra cc gi nh trn, chng ta c th v mt lot 4 th tren nh sau:
> op <- par(mfrow=c(2,2))
> plot(reg)

#yu cu R dnh ra 4 ca s
#v cc th trong reg

(a) th bn tri dng 1 v phn d ei v gi tr tin on cholesterol yi .


th ny cho thy cc gi tr phn d tp chung quanh ng y = 0, cho nn gi
nh (c), hay i c gi tr trung bnh 0, l c th chp nhn c.
(b) th bn phi dng 1 v gi tr phn d v gi tr k vng da vo phn
phi chun. Chng ta thy cc s phn d tp trung rt gn cc gi tr trn
ng chun, v do , gi nh (b), tc i phn phi theo lut phn phi
chun, cng c th p ng.
(c) th bn tri dng 2 v cn s phn d chun (standardized residual) v
gi tr ca yi . th ny cho thy khng c g khc nhau gia cc s phn d
chun cho cc gi tr ca yi , v do , gi nh (d), tc i c phng sai 2 c
nh cho tt c xi, cng c th p ng.
Ni chung qua phn tch phn d, chng ta c th kt lun rng m hnh hi qui
tuyn tnh m t mi lin h gia tui v cholesterol mt cch kh y v
hp l.

159

10.2.4 M hnh tin on


Sau khi m hnh tin on cholesterol c kim tra v tnh hp l
c thit lp, chng ta c th v ng biu din ca mi lin h gia
tui v cholesterol bng lnh abline nh sau (xin nhc li object ca phn
tch l reg):
> plot(chol ~ age, pch=16)

2.0

2.5

3.0

chol

3.5

4.0

4.5

> abline(reg)

20

30

40

50

60

age

Biu 10.3. ng biu din mi lin h gia tui (age) v cholesterol.

)
)
Nhng mi gi tr yi c tnh t c s v , m cc c s
ny u c sai s chun, cho nn gi tr tin on yi cng c sai s. Ni
cch khc, yi ch l trung bnh, nhng trong thc t c th cao hn hay thp
hn ty theo chn mu. Khong tin cy 95% ny c th c tnh qua R
bng cc lnh sau y:
>
>
>
>
>
>
>
>
>
>
>
>

reg <- lm(chol ~ age)


new <- data.frame(age = seq(15, 70, 5))
pred.w.plim <- predict.lm(reg, new, interval="prediction")
pred.w.clim <- predict.lm(reg, new, interval="confidence")
resc <- cbind(pred.w.clim, new)
resp <- cbind(pred.w.plim, new)
plot(chol ~ age, pch=16)
lines(resc$fit ~ resc$age)
lines(resc$lwr ~ resc$age, col=2)
lines(resc$upr ~ resc$age, col=2)
lines(resp$lwr ~ resp$age, col=4)
lines(resp$upr ~ resp$age, col=4)

160

Phn tch d liu v to biu bng R Nguyn Vn Tun

2.0

2.5

3.0

chol

3.5

4.0

4.5

(Ch trong cc lnh trn, chng ta nhng s dng bin s nh


resc$fit, resc$lwr,resc$upr,resp$lwr,resp$upr. Cch
vit ny c ngha l trch bin s fit t i tng resc, hay lwr v upr
(khong tin cy 95%) ca resc).

20

30

40

50

60

age

Biu 10.4. Gi tr tin on v khong tin cy 95%.


Biu 10.4 v gi tr tin on trung bnh yi (ng thng mu en), v khong
tin cy 95% ca gi tr ny l ng mu . Ngoi ra, ng mu xanh l khong
tin cy ca gi tr tin on cholesterol cho mt tui mi trong qun th.

10.3 M hnh hi qui tuyn tnh a bin (multiple


linear regression)
M hnh c din t qua phng trnh [1] yi = + xi + i c mt
yu t duy nht ( l x), v v th thng c gi l m hnh hi qui tuyn
tnh n gin (simple linear regression model). Trong thc t, chng ta c th
pht trin m hnh ny thnh nhiu bin, ch khng ch gii hn mt bin nh
trn, chng hn nh:

yi = + 1 x1i + 2 x2i + ... + k xki + i [7]


Ni c th hn:

y1 = + 1x11 + 2x21 + + kxk1 + 1

161

y2 = + 1x12 + 2x22 + + kxk2 + 2


y3 = + 1x13 + 2x23 + + kxk3 + 3

yn = + 1x1n + 2x2n + + kxkn + n


Ch trong phng trnh trn, chng ta c nhiu bin x (x1, x2, n xk), v
mi bin c mt thng s j (j = 1, 2, , k) cn phi c tnh. V th m hnh
ny cn c gi l m hnh hi qui tuyn tnh a bin.
Phng php c tnh j cng ch yu da vo phng php bnh
phng nh nht. Gi yi = + 1 x1i + 2 x1i + ... + k xki l c tnh ca yi ,
phng php bnh phng nh nht tm gi tr , 1 , 2 ,..., k sao cho
n

( y y )
i =1

nh nht.

i vi m hnh hi qui tuyn tnh a bin, cch vit v m t m hnh


gn nht l dng k hiu ma trn. M hnh [7] c th th hin bng k hiu ma
trn nh sau:
Y = X +
Trong : Y l mt vector n x 1, X l mt ma trn n x k phn t, v mt
vector k x 1, v l vector gm n x 1 phn t:
y1
y
Y = 2 ,
...

yn

1
1
X =
...

x11

x 21

x12

x 22

...
x1 n

...
x2n

1

= 2 ,
...

k

... x k 1
... x k 2 ,
...

x kn

1

= 2
...

n

Phng php bnh phng nh nht gii vector bng phng trnh sau y:

= (X T X ) X T Y
1

v tng bnh phng phn d:

T = Y Y

V d 2. Chng ta quay li nghin cu v mi lin h gia tui, bmi


v cholesterol. Trong v d, chng ta ch mi xt mi lin h gia tui v
cholesterol, m cha xem n mi lin h gia c hai yu t tui v bmi v
cholesterol. Biu sau y cho chng ta thy mi lin h gia ba bin s ny:

162

Phn tch d liu v to biu bng R Nguyn Vn Tun

> pairs(data)
22

24

26

50

60

20

24

26

20

30

40

age

2.0 2.5 3.0 3.5 4.0 4.5

20

22

bmi

chol

20

30

40

50

60

2.0

2.5

3.0

3.5

4.0

4.5

Biu 10.5. Gi tr tin on v khong tin cy 95%.


Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng
gn tun theo mt ng thng. Biu trn cn cho chng ta thy tui v
bmi c lin h vi nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia
bmi v cholesterol cho thy nh mi lin h ny c ngha thng k:
> summary(lm(chol ~

bmi))

Call: lm(formula = chol ~ bmi)


Residuals:
Min
1Q Median
-0.9403 -0.3565 -0.1376

3Q
0.3040

Max
1.4330

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83187
1.60841 -1.761 0.09739 .
bmi
0.26410
0.06861
3.849 0.00142 **
--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.623 on 16 degrees of freedom


Multiple R-Squared: 0.4808,
Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418

BMI gii thch khong 48% dao ng v cholesterol gia cc c


nhn. Nhng v BMI cng c lin h vi tui, chng ta mun bit nu hai
yu t ny c phn tch cng mt lc th yu t no quan trng hn. bit

163

nh hng ca c hai yu t age (x1) v bmi (tm gi l x2) n cholesterol (y)


qua mt m hnh hi qui tuyn tnh a bin, v m hnh l:

yi = + 1 x1i + 2 x2i + i
hay phng trnh cng c th m t bng k hiu ma trn: Y = X + va trnh
by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn
t, v mt vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h
s hi qui, 1 v 2 chng ta cng ng dng hm lm()trong R nh sau:
> mreg <- lm(chol ~ age + bmi)
> summary(mreg)
Call: lm(formula = chol ~ age + bmi)
Residuals:
Min
1Q Median
-0.3762 -0.2259 -0.0534

3Q
0.1698

Max
0.5679

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458
0.918230
0.496
0.627
age
0.054052
0.007591
7.120 3.50e-06 ***
bmi
0.033364
0.046866
0.712
0.487
--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3074 on 15 degrees of freedom


Multiple R-Squared: 0.8815,
Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07

Kt qu phn tch trn cho thy c s = 0.455, 1 = 0.054 v 2 =


0.0333. Ni cch khc, chng ta c phng trnh c on cholesterol da
vo hai bin s tui v bmi nh sau:
Cholesterol = 0.455 + 0.054(age) + 0.0333(bmi)
Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054
mg/L (c s ny khng khc my so vi 0.0578 trong phng trnh ch c
tui), v mi 1 kg/m2 tng BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny
gii thch khong 88.2% (R2 = 0.8815) dao ng ca cholesterol gia cc c
nhn.
Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii
thch khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm
yu t BMI, h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng

164

Phn tch d liu v to biu bng R Nguyn Vn Tun

trng ny c ngha thng k hay khng. Cu tr li c th xem qua kt qu


kim nh yu t bmi vi tr s p = 0.487. Nh vy, bmi khng cung cp cho
chng thm thng tin hay tin on cholesterol hn nhng g chng ta c t
tui. Ni cch khc, khi tui c xem xt, th nh hng ca bmi khng
cn ngha thng k. iu ny c th hiu c, bi v qua biu 10.5 chng ta
thy tui v bmi c mt mi lin h kh cao. V hai bin ny c tng quan
vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy nhin, v d ny ch
c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh a bin
bng R, ch khng c nh m phng d liu theo nh hng sinh hc).

3.5

4.0

2.0
1.0

4.5

-2

-1

Residuals vs Leverage

0.4

3.0

3.5

4.0

Fitted values

4.5

0.5

16

16

-1

Scale-Location
Standardizedresiduals

Theoretical Quantiles

2.5

16

Fitted values

0.8

1.2

3.0

0.0

Standardizedresiduals

2.5

-1.0 0.0

0.0

0.4

16

-0.4

Residuals

8
6

Normal Q-Q
Standardizedresiduals

Residuals vs Fitted

Cook's distance15
0.00

0.10

0.20

0.30

Leverage

Biu 10.6. Phn tch phn d kim tra cc gi nh trong phn


tch hi qui tuyn tnh a bin.
Tuy BMI khng c ngha thng k trong trng hp ny, Biu
10.6 cho thy cc gi nh v m hnh hi qui tuyn tnh c th p ng.

10.4 Phn tch hi qui a thc (Polynomial


regression analysis)
Mt khai trin tt nhin t phn tch hi qui a bin c lp l phn tch
hi qui a thc. M hnh hi qui a bin m t mt bin ph thuc nh l mt
hm s tuyn tnh (linear function) ca nhiu bin c lp, trong khi m hnh
hi qui a thc m t mt bin ph thuc l hm s phi tuyn tnh (non-linear
function) ca mt bin c lp.

165

Ni theo ngn ng ton hc, m hnh hi qui a thc tm mi lin h


gia bin ph thuc y v bin c lp x theo nhng hm s sau y:

yi = + 1x + 2x2 + 3x3 + .. + pxp + i.


Trong cc thng s j (j = 1, 2, 3, p) l h s o lng mi lin h gia y v x;
v i l phn d ca m hnh, vi gi nh i tun theo lut phn phi chun vi trung
bnh 0 v phng sai 2. Cho mt dy cp s (y1, x1), (y2, x2), (y3, x3), , (yn, xn),
chng ta c th p dng phng php bnh phng nh nht c tnh j v 2.
Trong m hnh trn, chng ta c th d dng thy rng m hnh hi qui a
thc cn l mt pht trin trc tip t m hnh hi qui tuyn tnh n gin. Tc l
nu 2 = 0, 3 = 0, , v p = 0, th m hnh trn n gin thnh m hnh hi qui
tuyn tnh mt bin m chng ta gp trong phn u ca chng ny. Nu yi = +
1x + 2x2 + i th m hnh n gin l mt phng trnh bc hai, v.v.
V d 3. Th nghim sau y tm mi lin h gia hm lng g cng
(hardwoord concentration) v cng (tensile strength) ca vt liu. Mi chn
vt liu khc nhau vi nhiu hm lng g cng c th nghim o
cng mnh ca vt liu, v kt qu c tm lc trong bng s liu sau y:
Id

Hm lng g cng (x)

cng mnh (y)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

1.0
1.5
2.0
3.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0

6.3
11.1
20.0
24.0
26.1
30.0
33.8
34.0
38.1
39.9
42.0
46.1
53.1
52.0
52.5
48.0
42.8
27.8
21.9

Trc khi phn tch cc s liu ny, chng ta cn nhp s liu vo R vi nhng
lnh thng thng nh sau:
> id <- 1:19
> conc <- c(1.0, 1.5, 2.0, 3.0, 4.0, 4.5, 5.0, 5.5,
6.0, 6.5, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,

166

Phn tch d liu v to biu bng R Nguyn Vn Tun

13.0, 14.0, 15.0)


> strength <- c(6.3, 11.1, 20.0, 24.0, 26.1, 30.0,
33.8, 34.0, 38.1, 39.9, 42.0, 46.1,
53.1, 52.0, 52.5, 48.0, 42.8, 27.8,
21.9)
> data <- data.frame(id, conc, strength)

Chng ta th xem m hnh hi qui tuyn tnh n gin bng lnh:


> simple.model <- lm(strength ~ conc)
> summary(simple.model)
Call: lm(formula = strength ~ conc)
Residuals:
Min
1Q
-25.986 -3.749

Median
2.938

3Q
7.675

Max
15.840

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3213
5.4302
3.926 0.00109 **
conc
1.7710
0.6478
2.734 0.01414 *
--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.82 on 17 degrees of freedom


Multiple R-Squared: 0.3054,
Adjusted R-squared: 0.2645
F-statistic: 7.474 on 1 and 17 DF, p-value: 0.01414

Kt qu trn cho thy m hnh hi qui tuyn tnh n gin ny (strength =


21.32 + 1.77*conc) gii thch khong 31% phng sai ca strength.
c s phng sai ca m hnh ny l: s2= (11.82)2 = 139.7.
By gi chng ta xem qua biu v ng biu din ca m hnh trn:
> plot(strength ~ conc,
xlab="Concentration of hardwood",
ylab="Tensile strength",
main="Relationship between hardwood concentration
\n and tensile strengt", pch=16)
> abline(simple.model)

167

30
10

20

Tensile strength

40

50

Relationship between hardwood concentration


and tensile strengt

10

12

14

Concentration of hardwood

Biu 10.7. Mi lin h gia hm lng g cng v cng mnh


ca vt liu. ng thng l ng biu din ca m hnh hi qui
tuyn tnh n gin.
Qua biu ny, chng ta thy r rng m hnh hi qui tuyn tnh khng thch
hp cho s liu, bi v mi lin h gia hai bin ny khng tun theo mt
phng trnh ng thng, m l mt ng cong. Ni cch khc, mt m hnh
phng trnh bc hai c l thch hp hn. Gi y l strength v x l conc, chng
ta c th vit m hnh nh sau:
yi = + 1x + 2x2
By gi chng ta s s dng R c tnh ba thng s trn.
> quadratic <- lm(strength ~ poly(conc, 2))
> summary(quadratic)
Call: lm(formula = strength ~ poly(conc, 2))
Residuals:
Min
1Q Median
-5.8503 -3.2482 -0.7267
Coefficients:
(Intercept)
poly(conc, 2)1
poly(conc, 2)2
--Signif. codes:

168

3Q
4.1350

Max
6.5506

Estimate Std. Error t value


34.184
1.014 33.709
32.302
4.420
7.308
-45.396
4.420 -10.270

Pr(>|t|)
2.73e-16 ***
1.76e-06 ***
1.89e-08 ***

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Phn tch d liu v to biu bng R Nguyn Vn Tun

Residual standard error: 4.42 on 16 degrees of freedom


Multiple R-Squared: 0.9085,
Adjusted R-squared: 0.8971
F-statistic: 79.43 on 2 and 16 DF, p-value: 4.912e-09

Nh vy, m hnh mi:


y = 34.18 + 32.30*x 45.4*x2
gii thch khong 91% phng sai ca y. Phng sai ca y by gi l s2 = (4.42)2
= 19.5. So vi m hnh tuyn tnh, m hnh ny r rng l tt hn rt nhiu.
Chng ta th xt mt m hnh cubic (bc ba):
yi = + 1x + 2x2 + 3x3
Xem c m t y tt hn m hnh phng trnh bc hai hay khng.
> cubic <- lm(strength ~ poly(conc, 3))
> summary(cubic)
Call: lm(formula = strength ~ poly(conc, 3))
Residuals:
Min
1Q
-4.62503 -1.61085
Coefficients:

Median
0.04125

3Q
1.58922

Max
5.02159

Estimate Std. Error t value


(Intercept)
34.1842
0.5931 57.641
poly(conc, 3)1 32.3021
2.5850 12.496
poly(conc, 3)2 -45.3963
2.5850 -17.561
poly(conc, 3)3 -14.5740
2.5850 -5.638
--Signif. codes:

Pr(>|t|)
< 2e-16
2.48e-09
2.06e-11
4.72e-05

***
***
***
***

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.585 on 15 degrees of freedom


Multiple R-Squared: 0.9707,
Adjusted R-squared: 0.9648
F-statistic: 165.4 on 3 and 15 DF, p-value: 1.025e-11

M hnh cubic ny thm ch c kh nng m t y tt hn hai m hnh


trc, vi h s xc nh bi (R2) bng 0.97, v tt c cc thng s trong m
hnh u c ngha thng k. Biu sau y so snh 3 m hnh trn:
# lp li cc m hnh trn:
> linear <- lm(strength ~ conc)
> quadratic <- lm(strength ~ poly(conc, 2))

169

> cubic <- lm(strength ~ poly(conc, 3))


# to nn mt bin x vi nhiu s gn nhau
> xnew <- (0:160)/10
# Tnh gi tr tin on (predictive values) ca y
> y2 = predict(quadratic, data.frame(conc=xnew))
> y3 = predict(cubic, data.frame(conc=xnew))
# V 3 ng thng, bc hai v bc 3
> plot(strength ~ conc, pch=16,
main=Hardwood concentration and tensile strength,
sub=Linear, quadratic, and cubic fits)
> abline(linear, col=black)
> lines(xnew, y2, col=blue, lwd=3)
> lines(xnew, y3, col=red, lwd=4)

30
10

20

strength

40

50

Hardwood concentration and tensile strength

10

12

14

conc
Linear, quadratic, and cubic fits

10.5 Xy dng m hnh tuyn tnh t nhiu bin


Trong mt nghin cu thng thng vi mt bin s ph thuc, nhiu
bin s c lp x1, x2, x3,., xk, m k c th ln n hng chc, thm ch hng
trm. Cc bin c lp thng lin h vi nhau. C rt nhiu t hp bin c
lp c kh nng tin on bin ph thuc y. V d nu chng ta c 3 bin c
lp x1, x2, v x3, xy dng m hnh tin on y, chng ta c th phi xem xt

170

Phn tch d liu v to biu bng R Nguyn Vn Tun

cc m hnh sau y: y = f1(x1), y = f2(x2), y = f3(x3), y = f4(x1, x2,), y = f5(x1,


x3,), y = f6(x3, x3,), y = f7(x1, x2, x3), v.v trong fk l nhng hm s c nh
ngha bi h s lin quan n cc bin c th. Khi k cao, s lng m hnh cng
ln rt cao.
Vn t ra l trong cc m hnh , m hnh no c th tin on y
mt cch y , n gin v hp l. Chng ta s quay li ba tiu chun ny
trong chng phn tch hi qui logistic. y, chng ta ch mun bn n mt
tiu chun thng k xy dng m m hnh hi qui tuyn tnh. Trong trng
hp c nhiu m hnh nh th, tiu chun thng k chn mt m hnh ti u
thng da vo tiu chun thng tin Akaike (cn gi l AIC hay Akaike
Information Criterion).
Cho mt m hnh hi qui tuyn tnh yi = + 1 x1 + 2 x2 + ... + k xk ,
chng ta c k+1 thng s , 1 , 2 ,..., k ), v c th tnh tng bnh phng
phn d (residual sum of squares, RSS):
n

RSS = ( yi yi )

i =1

Trong , n l s lng mu. Cng thc trn cho thy nu m hnh m


t y y th RSS s thp, v khc bit gia gi tr tin on y v gi tr
quan st y gn nhau. Mt qui lut chung ca phn tch hi qui tuyn tnh l mt
m hnh vi k bin c lp s c RSS thp hn m hnh vi k-1 bin; v tng
t m hnh vi k-1 bin s c RSS thp hn m hnh vi k-2 bin, v.v Ni
cch khc, m hnh cng c nhiu bin c lp s gii thch y cng tt hn.
Nhng v mt s bin c lp x lin h vi nhau, cho nn c thm nhiu bin
khng c ngha l RSS s gim mt cch c ngha. Mt php tnh dung ha
RSS v s bin c lp trong mt m hnh l AIC, c nh ngha nh sau:

RSS 2k
AIC = log
+
n n
M hnh no c gi tr AIC thp nht c xem l m hnh ti u.
Trong v d sau y, chng ta s dng hm step tm mt m hnh ti u
da vo gi tr AIC.
V d 4. nghin cu nh hng ca cc yu t nh nhit , thi
gian, v thnh phn ha hc n sn lng CO2. S liu ca nghin cu ny c
th tm lc trong bng s 2. Mc tiu chnh ca nghin cu l tm mt m
hnh hi qui tuyn tnh tin on sn lng CO2, cng nh nh gi nh
hng ca cc yu t ny.

171

Bng 2. Sn lng CO2 v mt s yu t c th nh hng n CO2


Id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

y
36.98
13.74
10.08
8.53
36.42
26.59
19.07
5.96
15.52
56.61
26.72
20.80
6.99
45.93
43.09
15.79
21.60
35.19
26.14
8.60
11.63
9.59
4.42
38.89
11.19
75.62
36.03

X1
5.1
26.4
23.8
46.4
7.0
12.6
18.9
30.2
53.8
5.6
15.1
20.3
48.4
5.8
11.2
27.9
5.1
11.7
16.7
24.8
24.9
39.5
29.0
5.5
11.5
5.2
10.6

X2
400
400
400
400
450
450
450
450
450
400
400
400
400
425
425
425
450
450
450
450
450
450
450
460
450
470
470

X3
51.37
72.33
71.44
79.15
80.47
89.90
91.48
98.60
98.05
55.69
66.29
58.94
74.74
63.71
67.14
77.65
67.22
81.48
83.88
89.38
79.77
87.93
79.50
72.73
77.88
75.50
83.15

X4
4.24
30.87
33.01
44.61
33.84
41.26
41.88
70.79
66.82
8.92
17.98
17.79
33.94
11.95
14.73
34.49
14.48
29.69
26.33
37.98
25.66
22.36
31.52
17.86
25.20
8.66
22.39

X5
1484.83
289.94
320.79
164.76
1097.26
605.06
405.37
253.70
142.27
1362.24
507.65
377.60
158.05
130.66
682.59
274.20
1496.51
652.43
458.42
312.25
307.08
193.61
155.96
1392.08
663.09
1464.11
720.07

X6
2227.25
434.90
481.19
247.14
1645.89
907.59
608.05
380.55
213.40
2043.36
761.48
566.40
237.08
1961.49
1023.89
411.30
2244.77
978.64
687.62
468.38
460.62
290.42
233.95
2088.12
994.63
2196.17
1080.11

X7
2.06
1.33
0.97
0.62
0.22
0.76
1.71
3.93
1.97
5.08
0.60
0.90
0.63
2.04
1.57
2.38
0.32
0.44
8.82
0.02
1.72
1.88
1.43
1.35
1.61
4.78
5.88

Ch thch: y = sn lng CO2; X1 = thi gian (pht); X2 = nhit (C); X3 = phn


trm ha tan; X4 = lng du (g/100g); X5 = lng than ; X6 = tng s lng ha
tan; X7 = s hydrogen tiu th.

Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh
thng thng. S liu s cha trong i tng REGdata.
> y <- c(36.98,13.74,10.08, 8.53,36.42,26.59,19.07,
5.96,15.52,56.61,26.72,20.80,6.99,45.93,
43.09,15.79,21.60,35.19,26.14, 8.60,
11.63, 9.59, 4.42,38.89,11.19,75.62,36.03)
> x1 <- c(5.1,26.4,23.8,46.4, 7.0,12.6,18.9,30.2,
53.8,5.6,15.1,20.3,48.4,5.8,11.2,27.9,5.1,
11.7,16.7,24.8,24.9,39.5,29.0, 5.5, 11.5,
5.2,10.6)

172

Phn tch d liu v to biu bng R Nguyn Vn Tun

> x2 <- c(400,400, 400, 400, 450, 450, 450, 450, 450,
400, 400, 400, 400, 425, 425, 425, 450, 450,
450, 450, 450, 450, 450, 460, 450, 470, 470)
> x3 <- c(51.37,72.33,71.44,79.15,80.47,89.90,91.48,
98.60,98.05,55.69, 66.29,58.94,74.74,63.71,
67.14,77.65,67.22,81.48,83.88,89.38,79.77,
87.93, 79.50,72.73,77.88,75.50,83.15)
> x4 <- c(4.24,30.87,33.01,44.61,33.84,41.26,41.88,
70.79,66.82,8.92,17.98,17.79,33.94,11.95,
14.73,34.49,14.48,29.69,26.33, 37.98,25.66,
22.36,31.52,17.86,25.20, 8.66,22.39)
> x5 <- c(1484.83, 289.94, 320.79, 164.76, 1097.26,
605.06, 405.37, 253.70, 142.27,1362.24, 507.65,
377.60, 158.05, 130.66, 682.59, 274.20,
1496.51, 652.43, 458.42, 312.25, 307.08,
193.61,155.96,1392.08, 663.09,1464.11, 720.07)
> x6 <- c(2227.25, 434.90, 481.19, 247.14,1645.89,
907.59,608.05, 380.55, 213.40,2043.36, 761.48,
566.40,237.08,1961.49,1023.89, 411.30,2244.77,
978.64,687.62, 468.38, 460.62, 290.42,233.95,
2088.12,994.63,2196.17,1080.11)
> x7 <- c(2.06,1.33,0.97,0.62,0.22,0.76,1.71,3.93,1.97,
5.08,0.60,0.90, 0.63,2.04,1.57,2.38,0.32,
0.44,8.82,0.02,1.72,1.88,1.43,1.35,1.61,
4.78,5.88)
> REGdata <- data.frame(y, x1,x2,x3,x4,x5,x6,x7)

Trc khi phn tch s liu, chng ta cn nhp s liu vo R bng cc lnh
thng thng. S liu s cha trong i tng REGdata.
By gi chng ta bt u phn tch. M hnh u tin l m hnh gm tt c 7
bin c lp nh sau:
> reg <- lm(y ~ x1+x2+x3+x4+x5+x6+x7, data=REGdata)
> summary(reg)
Call: lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7,
data = REGdata)

173

Residuals:
Min
1Q
-20.035 -4.681

Median
-1.144

3Q
4.072

Max
21.214

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.937016 57.428952
0.939
0.3594
x1
-0.127653
0.281498 -0.453
0.6553
x2
-0.229179
0.232643 -0.985
0.3370
x3
0.824853
0.765271
1.078
0.2946
x4
-0.438222
0.358551 -1.222
0.2366
x5
-0.001937
0.009654 -0.201
0.8431
x6
0.019886
0.008088
2.459
0.0237 *
x7
1.993486
1.089701
1.829
0.0831 .
--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.61 on 19 degrees of freedom


Multiple R-Squared: 0.728,
Adjusted R-squared: 0.6278
F-statistic: 7.264 on 7 and 19 DF, p-value: 0.0002674

Kt qu trn cho thy tt c 7 bin s gii thch khong 73% phng


sai ca y. Nhng trong 7 bin , ch c x6 l c ngha thng k (p=0.024).
Chng ta th gim m hnh thnh mt m hnh hi qui tuyn tnh n gin vi
ch bin x6.
> summary(lm(y ~ x6, data=REGdata))
Call: lm(formula = y ~ x6, data = REGdata)
Residuals:
Min
1Q
-28.081 -5.829

Median
-0.839

3Q
5.522

Max
26.882

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.144181
3.483064
1.764
0.09 .
x6
0.019395
0.002932
6.616 6.24e-07 ***
--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.7 on 25 degrees of freedom


Multiple R-Squared: 0.6365,
Adjusted R-squared: 0.6219
F-statistic: 43.77 on 1 and 25 DF, p-value: 6.238e-07

174

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ch vi mt bin x6 m m hnh c th gii thch khong 64% phng sai ca


y. Chng ta chp nhn m hnh ny? Trc khi chp nhn m hnh ny, chng
ta phi xem xt tng quan gia cc bin c lp:
> pairs(REGdata)
30

50

50

70

90

200

1000

8
70

10

50

10

40

440

10

30

x1

90

400

x2

70

50

70

x3

1000

10

40

x4

2000

200

x5

500

x6

x7
10

40

70

400

440

10

40

70

500

2000

Kt qu trn cho thy y c lin h vi cc bin nh x1, x5 v x6.


Ngoi ra, bin x5 v x6 c mt mi lin h rt mt thit (gn nh l mt ng
thng) vi h s tng quan l 0.88. Ngoi ra, x5 v x1 hay x6 v x5 cng c
lin h vi nhau nhng theo mt hm s nghch o. iu ny c ngha l bin
x5 v x6 cung cp mt lng thng tin nh nhau tin on y, tc l chng ta
khng cn c hai trong mt m hnh.
tm mt m hnh ti u trong bi cnh c nhiu mi tng quan nh th, chng
ta ng dng step nh sau. Ch cch cung cp thng s lm(y ~ .),du .
c ngha l yu cu R xem xt tt c bin trong i tng REGdata.

175

> reg <- lm(y ~ ., data=REGdata)


> step(reg, direction=both)
Start: AIC= 134.07
y ~ x1 + x2 + x3 +
x7
Df Sum of Sq
- x5
1
4.54
- x1
1
23.17
- x2
1
109.34
- x3
1
130.90
<none>
- x4
1
168.31
- x7
1
377.09
- x6
1
681.09

x4 + x5 + x6 +

Step 1: AIC= 132.13


y ~ x1 + x2 + x3 + x4 + x6 + x7

RSS
2145.37
2164.00
2250.18
2271.74
2140.83
2309.14
2517.92
2821.92

- x1
- x2
- x3
<none>
- x4
+ x5
- x7
- x6

AIC
132.13
132.36
133.42
133.68
134.07
134.12
136.45
139.53

Df Sum of Sq
RSS
1
96.8 2264.9
1
122.0 2290.0
2168.1
1
187.4 2355.5
1
22.7 2145.4
1
4.1 2164.0
1
385.0 2553.1
1
1526.2 3694.3

AIC
129.6
129.9
130.4
130.7
132.1
132.4
132.8
142.8

- x3
- x4
<none>
+ x2
+ x5
+ x1
- x7
- x6

Df Sum of Sq
RSS
1
25.4 2290.3
1
90.9 2355.8
2264.9
1
96.8 2168.1
1
8.3 2256.5
1
5.7 2259.1
1
384.9 2649.7
1
2015.6 4280.5

AIC
127.9
128.7
129.6
130.4
131.5
131.5
131.8
144.8

Step 5: AIC= 126.75


y ~ x6 + x7

Step 4: AIC= 127.9


y ~ x4 + x6 + x7
- x4
<none>
+ x3
+ x1
+ x5
+ x2
- x7
- x6

AIC
130.4
131.5
131.8
132.1
132.2
134.1
134.5
141.0

Step 3: AIC= 129.59


y ~ x3 + x4 + x6 + x7

Step 2: AIC= 130.42


y ~ x2 + x3 + x4 + x6 + x7
- x2
- x3
<none>
- x4
+ x1
+ x5
- x7
- x6

Df Sum of Sq
RSS
1
22.7 2168.1
1
113.8 2259.1
1
133.5 2278.9
2145.4
1
170.8 2316.2
1
4.5 2140.8
1
375.7 2521.1
1
1058.5 3203.8

Df Sum of Sq
RSS
1
73.5 2363.8
2290.3
1
25.4 2264.9
1
11.3 2279.0
1
6.3 2284.0
1
0.3 2290.0
1
486.6 2776.9
1
1993.8 4284.1

AIC
126.7
127.9
129.6
129.8
129.8
129.9
131.1
142.8

Df Sum of Sq
<none>
+ x4
+ x1
+ x3
+ x5
+ x2
- x7
- x6

1
1
1
1
1
1
1

73.5
33.4
8.1
7.7
7.3
497.3
4477.0

RSS
2363.8
2290.3
2330.4
2355.8
2356.1
2356.6
2861.2
6840.8

AIC
126.7
127.9
128.4
128.7
128.7
128.7
129.9
153.4

Call:
lm(formula = y ~ x6 + x7, data =
REGdata)
Coefficients:
(Intercept)
x7
2.52646
2.18575

x6
0.01852

Qu trnh tm m hnh ti u dng m hnh vi hai bin x6 v x7, v m hnh


ny c gi tr AIC thp nht. Phng trnh tuyn tnh tin on y l: y = 2.526
+ 0.0185(x6) + 2.186(x7).
> summary(lm(y ~ x6+x7, data=REGdata))

176

Phn tch d liu v to biu bng R Nguyn Vn Tun

Call: lm(formula = y ~ x6 + x7, data = REGdata)


Residuals:
Min
1Q
-23.2035 -4.3713

Median
0.2513

3Q
4.9339

Max
21.9682

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.526460
3.610055
0.700
0.4908
x6
0.018522
0.002747
6.742 5.66e-07 ***
x7
2.185753
0.972696
2.247
0.0341 *
--Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.924 on 24 degrees of freedom


Multiple R-Squared: 0.6996,
Adjusted R-squared: 0.6746
F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07

Phn tch chi tit (kt qu trn) cho thy hai bin ny gii thch khong 70%
phng sai ca y.

10.6 Xy dng m hnh tuyn tnh bng


Bayesian Model Average (BMA)
Mt vn trong cch xy dng m hnh trn l m hnh vi x6 v x7
c xem l m hnh sau cng, trong khi chng ta bit rng mt m hnh x5
v x7 cng c th l mt m hnh kh d, bi v x5 v x6 c mi tng quan
rt gn nhau. Nu nghin cu c tin hnh tip v vi thm s liu mi, c l
mt m hnh khc s ra i.
nh gi s bt nh trong vic xy dng m hnh thng k, mt php
tnh khc c trin vng tt hn cch php tnh trn l BMA (Bayesian Model
Average). Bn c mun tm hiu thm v php tnh ny c th tham kho vi bi
bo khoa hc di y. Ni mt cch ngn gn, php tnh BMA tm tt c cc m
hnh kh d (vi 7 bin c lp, s m hnh kh d l 27 = 128, cha tnh n cc
m hnh tng tc!) v trnh by kt qu ca cc m hnh c xem l ti u
nht v lu v di. Tiu chun ti u cng da vo gi tr AIC.
tin hnh php tnh BMA, chng ta phi dng n package BMA
(c th ti v t trang web ca R http://cran.R-project.org). Sau khi c ci
t package BMA trong my tnh, chng ta ra phi nhp BMA vo mi trng
vn hnh ca R bng lnh:
> library(BMA)

177

Sau , to ra mt ma trn ch gm cc bin c lp. Trong data frame chng ta


bit REGdata c 8 bin, vi bin s 1 l y. Do , lnh REGdata[, -1] c
ngha l to ra mt data frame mi ngoi tr ct th nht (tc y).
> xvars <- REGdata[,-1]

K tip, chng ta nh ngha bin ph thuc tn co2 t REGdata:


> co2 <- REGdata[,1]

By gi chng ta sn sng phn tch bng php tnh BMA. Hm bicreg c


vit c bit cho phn tch hi qui tuyn tnh. Cch p dng hm bicreg nh sau:
> bma <- bicreg(xvars, co2, strict=FALSE, OR=20)

Chng ta s dng hm summary bit kt qu:


> summary(bma)

Call:
bicreg(x = xvars, y = co2, strict = FALSE, OR = 20)
16 models were selected
Best 5 models (cumulative posterior probability =
Intercept
x1
x2
x3
x4
x5
x6
x7

p!=0
100.0
12.4
10.4
10.7
20.2
10.5
100.0
73.7

EV
5.75672
-0.01807
-0.00075
0.00011
-0.03059
-0.00023
0.01815
1.60766

SD
14.6244
0.1008
0.0282
0.0791
0.1020
0.0030
0.0040
1.2821

nVar
r2
BIC
post prob
Intercept
x1
x2
x3
x4
x5
x6
x7

178

model 4
7.5936
-0.1393
.
.
.
.
0.0162
2.1233

0.6599 ):

model 1
2.5264
.
.
.
.
.
0.0185
2.1857

model 2
6.1441
.
.
.
.
.
0.0193
.

model 3
8.6120
.
.
.
-0.1419
.
0.0164
2.1628

2
0.700
-25.8832
0.311

1
0.636
-24.0238
0.123

3
0.709
-23.4412
0.092

model 5
7.3537
.
.
-0.0572
.
.
0.0179
2.2382

Phn tch d liu v to biu bng R Nguyn Vn Tun

nVar
r2
BIC
post prob

3
0.704
-22.9721
0.072

3
0.701
-22.6801
0.063

BMA trnh by kt qu ca 5 m hnh c nh gi l ti u nht cho tin


on y (model 1, model 2, model 5).

Ct th nht lit k danh sch cc bin s c lp;

Ct 2 trnh by xc sut gi thit mt bin c lp c nh hng n y.


Chng hn nh xc sut l x6 c nh hng n y l 100%; trong khi
xc sut m x7 c nh hng n y l 73.7%. Tuy nhin xc sut cc
bin khc thp hn hay ch bng 20%. Do , chng ta c th ni rng
m hnh vi x6 v x7 c l l m hnh ti u nht.

Ct 3 (EV) v 4 (SD) trnh by tr s trung bnh v lch chun ca h


s cho mi bin s c lp.

Ct 5 l c tnh h s nh hng (regression coefficient) ca m hnh


1. Nh thy trong ct ny, m hnh 1 gm intercept (tc ), v hai bin
x6 v x7. M hnh ny gii thch (nh chng ta bit qua phn tch
phn trn) 70% phng sai ca y. Tr s BIC (Bayesian Information
Criterion) thp nht. Trong s tt c m hnh m BMA tm, m hnh
ny c xc sut xut hin l 31.1%.

Ct 6 l c tnh h s nh hng ca m hnh 2. Nh thy trong ct


ny, m hnh 2 gm intercept (tc ), v bin x6. M hnh ny gii
thch 64% phng sai ca y. Trong s tt c m hnh m BMA tm,
m hnh ny c xc sut xut hin ch l 12.3%.

Cc m hnh khc cng c th din dch mt cch tng t.

Mt cch th hin kt qu trn l qua mt biu nh sau:


> imageplot.bma(bma)

179

M odels selected by BM A

x1

x2

x3

x4

x5

x6

x7

10

13

Model #

Ti liu tham kho cho BMA


Raftery, Adrian E. (1995). Bayesian model selection in social research (with
Discussion). Sociological Methodology 1995 (Peter V. Marsden, ed.), pp. 111196, Cambridge, Mass.: Blackwells.
Mt s bi bo lin quan n BMA c th ti t trang web sau y:
www.stat.colostate.edu/~jah/papers.

180

Phn tch d liu v to biu bng R Nguyn Vn Tun

11
Phn tch phng sai
(Analysis of variance)
Phn tch phng sai, nh tn gi, l mt s phng php phn tch
thng k m trng im l phng sai (thay v s trung bnh). Phng php
phn tch phng sai nm trong i gia nh cc phng php c tn l m
hnh tuyn tnh (hay general linear models), bao gm c hi qui tuyn tnh m
chng ta gp trong chng trc. Trong chng ny, chng ta s lm quen
vi cch s dng R trong phn tch phng sai. Chng ta s bt u bng mt
phn tch n gin, sau s xem n phn tch phng sai hai chiu, v cc
phng php phi tham s thng dng.

11.1 Phn tch phng sai n gin (one-way


analysis of variance - ANOVA)
V d 1. Bng thng k 11.1 di y so snh galactose trong 3
nhm bnh nhn: nhm 1 gm 9 bnh nhn vi bnh Crohn; nhm 2 gm 11
bnh nhn vi bnh vim rut kt (colitis); v nhm 3 gm 20 i tng khng
c bnh (gi l nhm i chng). Cu hi t ra l galactose gia 3 nhm
bnh nhn c khc nhau hay khng? Gi gi tr trung bnh ca ba nhm l 1,
2, v 3, v ni theo ngn ng ca kim nh gi thit th gi thit o l:
Ho: 1 = 2 = 3
V gi thit chnh l:

HA: c mt khc bit gia 3 j (j=1,2,3)

Bng 11.1. galactose cho 3 nhm bnh nhn Crohn, vim rut kt v i chng
Nhm 1: bnh
Crohn
1343
1393
1420
1641
1897
2160
2169

Nhm 2: bnh vim


rut kt
1264
1314
1399
1605
2385
2511
2514

Nhm 3: i chng
(control)
1809 2850
1926 2964
2283 2973
2384 3171
2447 3257
2479 3271
2495 3288

181

2279
2890

2767
2827
2895
3011

2525 3358
2541 3643
2769 3657

n=9
n=11
n=20
Trung bnh: 1910
Trung bnh: 2226
Trung bnh: 2804
SD: 516
SD: 727
SD: 527
Ch thch: SD l lch chun (standard deviation).
Mi xem qua vn , c l bn c s ngh rng chng ta cn lm 3 so
snh (bng phng php kim nh t): gia nhm 1 v 2, nhm 2 v 3, v nhm 1
v 3. Nhng cch lm ny khng hp l, v c ba phng sai khc nhau. Cch
thch hp nht so snh ny l phn tch phng sai. Phn tch phng sai c
th ng dng so snh nhiu nhm cng mt lc (simultaneous comparisons).

11.1.1 M hnh phn tch phng sai


minh ha cho phng php phn tch phng sai, chng ta phi
dng k hiu. Gi galactose ca bnh nhn i thuc nhm j (j = 1, 2, 3) l xij.
M hnh phn tch phng sai pht biu rng:

xij = + i + ij

[1]

Hay c th hn:
xi1 = + 1 + i1
xi2 = + 2 + i2
xi3 = + 3 + i3
Tc l, gi tr galactose ca bt c bnh nhn no bng gi tr trung
bnh ca ton qun th () cng/tr cho nh hng ca nhm j c o bng h
s nh hng i , v sai s ij . Mt gi nh khc l ij phi tun theo lut phn
phi chun vi trung bnh 0 v phng sai 2. Hai thng s cn c tnh l v
i . Cng nh phn tch hi qui tuyn tnh, hai thng s ny c c tnh bng
phng php bnh phng nh nht; tc l tm c s v j sao cho

( x

ij

j ) nh nht.
2

Quay li vi s liu nghin cu trn, chng ta c nhng tm tt thng k


nh sau:

182

Phn tch d liu v to biu bng R Nguyn Vn Tun

Nhm

S i
tng (nj)
n1 = 9

1 Crohn

Trung bnh

Phng sai

x1 = 1910

s12 = 265944

2 Vim rut kt

n2 = 11

x2 = 2226

s22 = 473387

3 i chng

n3 = 20

x3 = 2804

s32 = 277500

Ton b mu

n = 40

x = 2444

) (

Ch : xij = x + x j x + xij x j

[2]

Trong , x l s trung bnh ca ton mu, v x j l s trung bnh ca nhm

j. Ni cch khc, phn x j x phn nh khc bit (hay cng c th gi l

hiu s) gia trung bnh tng nhm v trung bnh ton mu, v phn xij x j

phn nh hiu s gia mt galactose ca mt i tng v s trung bnh ca


tng nhm. Theo , chng ta c cc ngun dao ng nh sau:

Tng bnh phng cho ton b mu l:

SST = ( xij x )
i

= (13432444)2 + (13932444)2 + (1343 2444)2 + + (3657


2444)2

= 12133923
Tng bnh phng phn nh khc nhau gia cc nhm:

SSB = ( xi x ) =
2

n (x
j

x)

= 9(1910 2444)2 + 11(2226 2444)2 + 20(2804 2444)2

= 5681168
Tng bnh phng phn nh dao ng trong mi nhm:

SSW = ( xij x j ) =
2

(n
j

1) s 2j

= (9-1)(265944) + (11-1)(473387) + (20-1)(277500)


= 12133922
C th chng minh rng: SST = SSB + SSW.

183

SSW c tnh t mi bnh nhn cho 3 nhm, cho nn trung bnh bnh phng
cho tng nhm (mean square MSW) l:
MSW = SSW / (N k) = 12133922 / (40-3) = 327944
v trung bnh bnh phng gia cc nhm l:
MSB = SSB / (k 1) = 5681168 / (3-1) = 2841810
Trong N l tng s bnh nhn (N = 40) ca ba nhm, v k = 3 l s nhm
bnh nhn. Nu c s khc bit gia cc nhm, th chng ta k vng rng MSB
s ln hn MSW. Thnh ra, kim tra gi thit, chng ta c th da vo kim
nh F:
F = MSB / MSW = 8.67
[3]
Vi bc t do k-1 v N-k. Cc s liu tnh ton trn y c th trnh by trong
mt bng phn tch phng sai (ANOVA table) nh sau:
Ngun bin thin (source of
variation)

Bc t do
(degrees
of
freedom)

Tng bnh
phng
(sum of
squares)

Khc bit gia cc nhm


(between-group)
Khc bit trong tng nhm
(with-group)
Tng s

5681168

Trung
bnh bnh
phng
(mean
square)
2841810

37

12133923

327944

39

12133923

Kim nh
F

8.6655

11.1.2 Phn tch phng sai n gin vi R


Tt c cc tnh ton trn tng i phc tp, v tn kh nhiu thi gian. Tuy
nhin vi R, cc tnh ton c th lm trong vng 1 giy, sau khi d liu
c chun b ng cch.
(a) Nhp d liu. Trc ht, chng ta cn phi nhp d liu vo R. Bc th
nht l bo cho R bit rng chng ta c ba nhm bnh nhn (1, 2 v 3), nhm 1
gm 9 ngi, nhm 2 c 11 ngi, v nhm 3 c 20 ngi:
> group <- c(1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3)

phn tch phng sai, chng ta phi nh ngha bin group l mt yu t factor.

184

Phn tch d liu v to biu bng R Nguyn Vn Tun

> group <- as.factor(group)

Bc k tip, chng ta np s liu galactose cho tng nhm nh nh ngha trn


(gi object l galactose):
> galactose <- c(1343,1393,1420,1641,1897,2160,2169,2279,
2890,1264,1314,1399,1605,2385,2511,2514,
2767,2827,2895,3011,1809,2850,1926,2964,
2283,2973,2384,3171,2447,3257,2479,3271,
2495,3288,2525,3358,2541,3643,2769,3657)

a hai bin group v galactose vo mt dataframe v gi l data:


> data <- data.frame(group, galactose)
> attach(data)

Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng
sai nh sau:
> analysis <- lm(galactose ~ group)

Trong hm trn chng ta cho R bit bin galactose l mt hm s ca


group. Gi kt qu phn tch l analysis.
(b) Kt qu phn tch phng sai. By gi chng ta dng lnh anova bit
kt qu phn tch:
> anova(analysis)
Analysis of Variance Table
Response: galactose
Df Sum Sq Mean Sq F value Pr(>F)
group
2 5683620 2841810 8.6655 0.0008191 ***
Residuals 37 12133923 327944
---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Trong kt qu trn, c ba ct: Df (degrees of freedom) l bc t do;


Sum Sq l tng bnh phng (sum of squares), Mean Sq l trung bnh bnh
phng (mean square); F value l gi tr F nh nh ngha [3] va cp
phn trn; v Pr(>F) l tr s P lin quan n kim nh F.
Dng group trong kt qu trn c ngha l bnh phng gia cc nhm
(between-groups) v residual l bnh phng trong mi nhm (withingroup). y, chng ta c:
SSB = 5683620 v MSB = 2841810

185

v:
MSB = 2841810 v MSB = 327944
Nh vy, F = 2841810 / 327944 = 8.6655.
Tr s p = 0.00082 c ngha l tn hiu cho thy c s khc bit v galactose
gia ba nhm.
(c) c s. bit thm chi tit kt qu phn tch, chng ta dng lnh
summary nh sau:
> summary(analysis)
Call:
lm(formula = galactose ~ group)
Residuals:
Min
1Q Median
3Q Max
-995.5 -437.9 102.0 456.0 979.8
Coefficients:
Estimate Std. Error t value
(Intercept) 1910.2
190.9
10.007
group2
316.3
257.4
1.229
group3
894.3
229.9
3.891
---

Pr(>|t|)
4.5e-12 ***
0.226850
0.000402 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 572.7 on 37 degrees of freedom


Multiple R-Squared: 0.319,
Adjusted R-squared: 0.2822
F-statistic: 8.666 on 2 and 37 DF, p-value: 0.0008191

Theo kt qu trn y, intercept chnh l trong m hnh [1].


Ni cch khc, = 1910 v sai s chun l 190.9.
c tnh thng s j , R t 1 =0, v 2 = 2 1 = 316.3, vi sai
s chun l 257, v kim nh t = 316.3 / 257 = 1.229 vi tr s p = 0.2268. Ni
cch khc, so vi nhm 1 (bnh nhn Crohn), bnh nhn vim rut kt c
galactose trung bnh cao hn 257, nhng khc bit ny khng c ngha thng
k.
Tng t, 3 = 3 1 = 894.3, vi sai s chun l 229.9, kim nh
t=894.3/229.9=3.89, v tr s p = 0.00040. So vi bnh nhn Crohn, nhm i
chng c galactose cao hn 894, v mc khc bit ny c ngha thng k.

186

Phn tch d liu v to biu bng R Nguyn Vn Tun

11.2 So snh nhiu nhm (multiple comparisons)


v iu chnh tr s p
Cho k nhm, chng ta c t nht l k(k-1)/2 so snh. V d trn c 3
nhm, cho nn tng s so snh kh d l 3 (gia nhm 1 v 2, nhm 1 v 3, v
nhm 2 v 3). Khi k=10, s ln so snh c th ln rt cao. Nh cp trong
chng 7, khi c nhiu so snh, tr s p tnh ton t cc kim nh thng k
khng cn ngha ban u na, bi v cc kim nh ny c th cho ra kt qu
dng tnh gi (tc kt qu vi p<0.05 nhng trong thc t khng c khc nhau
hay nh hng). Do , trong trng hp c nhiu so snh, chng ta cn phi
iu chnh tr s p sao cho hp l.
C kh nhiu phng php iu chnh tr s p, v 4 phng php thng
dng nht l: Bonferroni, Scheff, Holm v Tukey (tn ca 4 nh thng k hc).
Phng php no thch hp nht? Khng c cu tr li dt khot cho cu hi
ny, nhng hai im sau y c th gip bn c quyt nh tt hn:
(a)

Nu k < 10, chng ta c th p dng bt c phng php


no iu chnh tr s p. Ring c nhn ti th thy
phng php Tukey thng rt hu ch trong so snh.

(b)

Nu k > 10, phng php Bonferroni c th tr nn rt bo


th. Bo th y c ngha l phng php ny rt t khi
no tuyn b mt so snh c ngha thng k, d trong
thc t l c tht! Trong trng hp ny, hai phng php
Tukey, Holm v Scheff c th p dng.

y, chng ta s khng bn n l thuyt ng sau cc phng php


ny (v bn c c th tham kho trong cc sch gio khoa v thng k), nhng
s ch cch s dng R tin hnh cc so snh theo phng php ca Tukey.
Quay li v d trn, cc tr s p trn y l nhng tr s cha c iu
chnh cho so snh nhiu ln. Trong chng v tr s p, ni cc tr s ny
phng i ngha thng k, khng phn nh tr s p lc ban u (tc 0.05).
iu chnh cho nhiu so snh, chng ta phi s dng n phng php iu
chnh Bonferroni.
Chng ta c th dng lnh pairwise.t.test c c tt c cc
tr s p so snh gia ba nhm nh sau:
> pairwise.t.test(galactose, group, p.adj="bonferroni")

187

Pairwise comparisons using t tests with pooled SD


data: galactose and group
1
2
2 0.6805 3 0.0012 0.0321
P value adjustment method: bonferroni

Kt qu trn cho thy tr s p gia nhm 1 (Crohn) v vim rut kt l 0.6805


(tc khng c ngha thng k); gia nhm Crohn v i chng l 0.0012 (c
ngha thng k), v gia nhm vim rut kt v i chng l 0.0321 (tc cng
c ngha thng k).
Mt phng php iu chnh tr s p khc c tn l phng php Holm:
> pairwise.t.test(galactose, group)
Pairwise comparisons using t tests with pooled SD
data: galactose and group
1
2
2 0.2268 3 0.0012 0.0214
P value adjustment method: holm

Kt qu ny cng khng khc so vi phng php Bonferroni.


Tt c cc phng php so snh trn s dng mt sai s chun chung cho c ba
nhm. Nu chng ta mun s dng cho tng nhm th lnh sau y
(pool.sd=F) s p ng yu cu :
> pairwise.t.test(galactose, group, pool.sd=FALSE)
Pairwise comparisons using t tests with non-pooled SD
data: galactose and group
1
2
2 0.2557 3 0.0017 0.0544
P value adjustment method: holm

Mt ln na, kt qu ny cng khng lm thay i kt lun.

188

Phn tch d liu v to biu bng R Nguyn Vn Tun

11.2.1 So snh nhiu nhm bng phng php Tukey


Trong cc phng php trn, chng ta ch bit tr s p so snh gia cc
nhm, nhng khng bit mc khc bit cng nh khong tin cy 95% gia cc
nhm. c nhng c s ny, chng ta cn n mt hm khc c tn l aov
(vit tt t analysis of variance) v hm TukeyHSD (HSD l vit tt t Honest
Significant Difference, tm dch l Khc bit c ngha thnh tht) nh sau:
> res <- aov(galactose ~ group)
> TukeyHSD (res)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = galactose ~ group)
$group
diff
lwr
upr
2-1 316.3232 -312.09857 944.745
3-1 894.2778 333.07916 1455.476
3-2 577.9545 53.11886 1102.790

p adj
0.4439821
0.0011445
0.0281768

Kt qu trn cho chng ta thy nhm 3 v 1 khc nhau khong 894 n v, v


khong tin cy 95% t 333 n 1455 n v. Tng t, galactose trong nhm
bnh nhn vim rut kt thp hn nhm i chng (nhm 3) khong 578 n v,
v khong tin cy 95% t 53 n 1103.

3-2

3-1

2-1

95% family-wise confidence level

500

1000

1500

Differences in mean levels of group

Biu 11.1. Trung bnh hiu v khong tin cy 95%


gia nhm 1 v 2, 1 v 3, v 3 v 2. Trc honh l
galactose, trc tung l ba so snh.

189

11.2.2 Phn tch bng biu


Mt phn tch thng k khng th no hon tt nu khng c mt th
minh ha cho kt qu. Cc lnh sau y v th th hin galactose trung
bnh v sai s chun cho tng nhm bnh nhn. Biu ny cho thy, nhm
bnh nhn Crohn c galactose thp nht (nhng khng thp hn nhm vim
rut kt), v c hai nhm thp hn nhm i chng v s khc bit ny c
ngha thng k.
xbar <- tapply(galactose, group, mean)
s <- tapply(galactose, group, sd)
n <- tapply(galactose, group, length)
sem <- s/sqrt(n)
stripchart(galactose ~ group, jitter, jit=0.05,
pch=16, vert=TRUE)
> arrows(1:3, xbar+sem, 1:3, xbar-sem, angle=90, code=3,
length=0.1)
> lines(1:3, xbar, pch=4, type=b, cex=2)

1500

2000

2500

3000

3500

>
>
>
>
>

Biu 11.2. galactose ca nhm 1 (bnh nhn Crohn), nhm 2


(bnh nhn vim rut kt), v nhm 3 (i chng).

11.3 Phn tch bng phng php phi tham s


Phng php so snh nhiu nhm phi tham s (non-parametric
statistics) tng ng vi phng php phn tch phng sai l KruskalWallis. Cng nh phng php Wilcoxon so snh hai nhm theo phng php

190

Phn tch d liu v to biu bng R Nguyn Vn Tun

phi tham s, phng php Kruskal-Wallis cng bin i s liu thnh th bc


(ranks) v phn tch khc bit th bc ny gia cc nhm. Hm
kruskal.test trong R c th gip chng ta trong kim nh ny:
> kruskal.test(galactose ~ group)
Kruskal-Wallis rank sum test
data: galactose by group
Kruskal-Wallis chi-squared = 12.1381, df = 2,
p-value = 0.002313

Tr s p t kim nh ny kh thp (p = 0.002313) cho thy c s khc


bit gia ba nhm nh phn tch phng sai qua hm lm trn y. Tuy nhin,
mt bt tin ca kim nh phi tham s Kruskal-Wallis l phng php ny
khng cho chng ta bit hai nhm no khc nhau, m ch cho mt tr s p
chung. Trong nhiu trng hp, phn tch phi tham s nh kim nh KruskalWallis thng khng c hiu qu nh cc phng php thng k tham s
(parametric statistics).

11.4 Phn tch phng sai hai chiu (two-way


analysis of variance - ANOVA)
Phn tch phng sai n gin hay mt chiu ch c mt yu t (factor).
Nhng phn tch phng sai hai chiu (two-way ANOVA), nh tn gi, c hai
yu t. Phng php phn tch phng sai hai chiu ch n gin khai trin t
phng php phn tch phng sai n gin. Thay v c tnh phng sai ca mt
yu t, phng php phng sai hai chiu c tnh phng sai ca hai yu t.
V d 2. Trong v d sau y, nh gi hiu qu ca mt k thut sn
mi, cc nh nghin cu p dng sn trn 3 loi vt liu (1, 2 v 3) trong hai iu
kin (1, 2). Mi iu kin v loi vt liu, nghin cu c lp li 3 ln. bn
c o l ch s bn b (tm gi l score). Tng cng, c 18 s liu nh sau:
Bng 11.2. bn b ca sn cho 2 iu kin v 3 vt liu
iu kin
(i)
1
2

1
4.1, 3.9, 4.3
2.7, 3.1, 2.6

Vt liu (j)
2
3.1, 2.8, 3.3
1.9, 2.2, 2.3

3
3.5, 3.2, 3.6
2.7, 2.3, 2.5

191

S liu ny c th tm lc bng s trung bnh cho tng iu kin v vt liu


trong bng thng k sau y:
Bng 11.3. Tm lc s liu t th nghim bn b ca nc sn
1

Vt liu (j)
2

4.10
2.80
3.450

3.07
2.13
2.600

3.43
2.50
2.967

iu kin (i)
Trung bnh
1
2
Trung bnh 2
nhm

Trung bnh
cho 3 vt liu
3.533
2.478
3.00

Phng sai
1
0.040
0.063
0.043
2
0.070
0.043
0.040
Nhng tnh ton s khi trn y cho thy c th c s khc nhau (hay nh
hng) ca iu kin v vt liu th nghim.
Gi xij l score ca iu kin i (i = 1, 2) cho vt liu j (j = 1, 2, 3). ( n
gin ha vn , chng ta tm thi b qua k i tng). M hnh phn tch
phng sai hai chiu pht biu rng:
xij = + i + j + ij
[4]
Hay c th hn:

x11 = + 1 + 1 + 11
x12 = + 1 + 2 + 12
x13 = + 1 + 3 + 11
x21 = + 2 + 1 + 21
x22 = + 2 + 2 + 22
x23 = + 2 + 3 + 21

l s trung bnh cho ton qun th, cc h s i (nh hng ca iu kin i)v
j (nh hng ca vt liu j) cn phi c tnh t s liu thc t. ij c gi
nh tun theo lut phn phi chun vi trung bnh 0 v phng sai 2.

192

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trong phn tch phng sai hai chiu, chng ta cn chia tng bnh phng ra
thnh 3 ngun:

Ngun th nht l tng bnh phng do khc bit gia 2 iu kin:

SSc = ni ( xi x )

= 9(3.533 3.00)2 + 9(2.478 3.00)2


= 5.01

Ngun th hai l tng bnh phng do khc bit gia 3 vt liu:

SSm = n j ( x j x )

= 6(3.45 3.00)2 + 6(2.60 3.00)2 + 6(2.967 3.00)2

= 2.18
Ngun th ba l tng bnh phng phn d (residual sum of squares):

SSe = ( xij xi x j + x ) = ( nij 1) sij2


2

= 2(0.040) + 2(0.063) + 2(0.043) + 2(0.070) + 2(0.043) + 2(0.040)


= 0.73
Trong cc phng trnh trn, n = 3 (lp li 3 ln cho mi iu kin v vt
liu), m = 3 vt liu, x l s trung bnh cho ton mu, xi l s trung bnh cho
tng iu kin, x j l s trung bnh cho tng vt liu. V SSc c m-1 bc t do,
SSm c (n -1) bc t do, v SSe c Nnm+2 bc t do, trong N l tng s
mu (tc 18). Do , cc trung bnh bnh phng

Gia hai iu kin:


Gia ba vt liu:
Phn d:

MSc = SSc / (m-1) = 5.01 / 1 = 5.01


MSm = SSc / (n-1) = 2.18 /2 = 1.09
MSe = SSe / (N-nm+2) = 0.73 / 14 = 0.052

Do , so snh khc bit gia hai iu kin da vo kim nh F = MSc/Mse


vi bc t do 1 v 14. Tng t, so snh khc bit gia ba vt liu c th da
vo kim nh F = MSm/Mse vi bc t do 2 v 14. Cc phn tch trn c th
trnh by trong mt bng phn tch phng sai nh sau:

193

Ngun bin thin (source Bc t do


of variation)
(degrees of
freedom)

Tng bnh
phng
(sum of
squares)

Khc bit gia 2 iu


kin
Khc bit gia 3 vt liu
Phn d (residual)
Tng s

5.01

2
14
17

2.18
0.73
7.92

Trung
bnh bnh
phng
(mean
square)
5.01

Kim
nh F

1.09
0.052

20.8

95.6

11.4.1 Phn tch phng sai hai chiu vi R


(a) Bc u tin l nhp s liu t bng 11.2 vo R. Chng ta cn phi t
chc d liu sao cho c 4 bin nh sau:

Condition
(iu kin)
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2

Material
(vt liu)
1
1
1
2
2
2
3
3
3
1
1
1
2
2
2
3
3
3

i tng

Score

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

4.1
3.9
4.3
3.1
2.8
3.3
3.5
3.2
3.6
2.7
3.1
2.6
1.9
2.2
2.3
2.7
2.3
2.5

Chng ta c th to ra mt dy s bng cch s dng hm gl (generating


levels). Cch s dng hm ny c th minh ha nh sau:
> gl(9, 1, 18)
[1] 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

194

Phn tch d liu v to biu bng R Nguyn Vn Tun

Levels: 1 2 3 4 5 6 7 8 9

Trong lnh trn, chng ta to ra mt dy s 1,2,3, 9 hai ln (vi tng s 18


s). Mi mt ln l mt nhm. Trong khi lnh:
> gl(4,
[1] 1 1
4 4 4 4
Levels:

9, 36)
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4
4
1 2 3 4

Trong lnh trn, chng ta to ra mt dy s vi 4 bc (1,2,3, 4) 9 ln (vi tng


s 36 s).
Do , to ra cc bc cho iu kin v vt liu, chng ta lnh nh sau:
> condition <- gl(2, 9, 18)
> material <- gl(3, 3, 18)

V to nn 18 m s (t 1 n 18):
> id <- 1:18

Sau cng l s liu cho score:


> score <- c(4.1,3.9,4.3, 3.1,2.8,3.3, 3.5,3.2,3.6,
2.7,3.1,2.6, 1.9,2.2,2.3, 2.7,2.3,2.5)

Tt c cho vo mt dataframe tn l data:


> data <- data.frame(condition, material, id, score)
> attach(data)

(b) Phn tch v kt qu s khi. By gi s liu sn sng cho phn tch.


phn tch phng sai hai chiu, chng ta vn s dng lnh lm vi cc
thng s nh sau:
> twoway <- lm(score ~ condition + material)
> anova(twoway)
Analysis of Variance Table
Response: score
Df Sum Sq Mean Sq
condition 1 5.0139 5.0139
material 2 2.1811 1.0906
Residuals 14 0.7344 0.0525
---

F value Pr(>F)
95.575 1.235e-07 ***
20.788 6.437e-05 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

195

Ba ngun dao ng (variation) ca score c phn tch trong bng


trn. Qua trung bnh bnh phng (mean square), chng ta thy nh hng ca
iu kin c v quan trng hn l nh hng ca vt liu th nghim. Tuy nhin,
c hai nh hng u c ngha thng k, v tr s p rt thp cho hai yu t.
(c) c s. Chng ta yu cu R tm lc cc c s phn tch bng lnh
summary:
> summary(twoway)
Call:
lm(formula = score ~ condition + material)
Residuals:
Min
1Q Median
3Q
Max
-0.32778 -0.16389 0.03333 0.16111 0.32222
Coefficients:
Estimate Std. Error t value
(Intercept) 3.9778
0.1080 36.841
condition2 -1.0556
0.1080
-9.776
material2
-0.8500
0.1322
-6.428
material3
-0.4833
0.1322
-3.655
---

Pr(>|t|)
2.43e-15 ***
1.24e-07 ***
1.58e-05 ***
0.0026 **

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.229 on 14 degrees of freedom
Multiple R-Squared: 0.9074,
Adjusted R-squared: 0.8875
F-statistic: 45.72 on 3 and 14 DF, p-value: 1.761e-07

Kt qu trn cho thy so vi iu kin 1, iu kin 2 c score thp


hn khong 1.056 v sai s chun l 0.108, vi tr s p = 1.24e-07, tc c
ngha thng k. Ngoi ra, so vi vt liu 1, score cho vt liu 2 v 3 cng thp
hn ng k vi thp nht ghi nhn vt liu 2, v nh hng ca vt liu
th nghim cng c ngha thng k.
Gi tr c tn l Residual standard error c c tnh t
trung bnh bnh phng phn d trong phn (a), tc l
c s ca .

0.0525 = 0.229, tc l

H s xc nh bi (R2) cho bit hai yu t iu kin v vt liu gii


thch khong 91% dao ng ca ton b mu. H s ny c tnh t tng
bnh phng trong kt qu phn (a) nh sau:

R2 =

196

5.0139 + 2.1811
= 0.9074
5.0139 + 2.1811 + 0.7344

Phn tch d liu v to biu bng R Nguyn Vn Tun

V sau cng, h s R2 iu chnh phn nh ci tin ca m hnh.


hiu h s ny tt hn, chng ta thy phng sai ca ton b mu l
s2 = (5.0139 + 2.1811 + 0.7344) / 17 = 0.4644. Sau khi iu chnh cho nh
hng ca iu kin v vt liu, phng sai ny cn 0.0525 (tc l residual
mean square). Nh vy hai yu t ny lm gim phng sai khong 0.4644
0.0525 = 0.4119. V h s R2 iu chnh l:
Adj R2 = 0.4119 / 0.4644 = 0.88
Tc l sau khi iu chnh cho hai yu t iu kin v vt liu phng sai ca
score gim khong 88%.
(d) Hiu ng tng tc (interaction effects)
cho phn tch hon tt, chng ta cn phi xem xt n kh nng nh
hng ca hai yu t ny c th tng tc nhau (interactive effects). Tc l m
hnh score tr thnh:

xij = + i + j + ( i j ) + ij
ij

Ch phng trnh trn c phn i j

ij

phn nh s tng tc gia hai yu

t. V chng ta ch n gin lnh R nh sau:


> anova(twoway <- lm(score ~ condition+
material+condition*material))
Analysis of Variance Table
Response: score
Df Sum Sq
condition
1 5.0139
material
2 2.1811
condition:material
2 0.1344
Residuals
12 0.6000
---

Mean Sq F value
5.0139 100.2778
1.0906
21.8111

Pr(>F)
3.528e-07 ***
0.0001008 ***

0.0672
0.0500

0.2972719

1.3444

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu phn tch trn (p = 0.297 cho nh hng tng tc). Chng ta c bng
chng kt lun rng nh hng tng tc gia vt liu v iu kin khng c
ngha thng k, v chng ta chp nhn m hnh [4], tc khng c tng tc.

197

(e) So snh gia cc nhm. Chng ta s c tnh khc bit gia hai iu
kin v ba vt liu bng hm TukeyHSD vi aov:
> res <- aov(score ~ condition+ material+condition)
> TukeyHSD(res)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = score ~ condition + material +
condition)
$condition
diff
lwr
upr
p adj
2-1 -1.055556 -1.287131 -0.8239797 1e-07
$material
diff
lwr
upr
p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069

Biu 11.3 sau y s minh ha cho cc kt qu trn:


> plot(TukeyHSD(res), ordered=TRUE)
There were 16 warnings (use warnings() to see them)

4 .0

2-1

95% family-wise confidence level

condition

3.0

3-2

2.5

3-1

m ean of score

3.5

1
2

-1.0

-0.5

0.0

0.5

Differences in mean levels of material

Biu 11.3. So snh gia 3 loi vt


liu bng phng php Tukey.

material

Biu 11.4. Trung bnh score cho


tng iu kin 1 (ng t on)
v iu kin 2 cho 3 loi vt liu.

(f) Biu . xem qua nh hng ca hai yu t iu kin v vt liu,


chng ta cn phi c mt th, m trong phn tch phng sai gi l th

198

Phn tch d liu v to biu bng R Nguyn Vn Tun

tng tc. Hm interaction.plot cung cp phng tin v biu


ny (xem biu 11.4):
> interaction.plot(score, condition, material)

11.5 Phn tch hip bin (analysis of covariance ANCOVA)


Phn tch hip bin (s vit tt l ANCOVA) l phng php phn tch
s dng c hai m hnh hi qui tuyn tnh v phn tch phng sai. Trong phn
tch hi qui tuyn tnh, c hai bin ph thuc (dependent variable, cng c th
gi l bin ng response variable) v bin c lp (independent variable hay
predictor variable) phn ln l dng lin tc (continuous variable), nh
cholesterol v tui chng hn. Trong phn tch phng sai, bin ph thuc l
bin lin tc, cn bin c lp th dng th bc v th loi (categorical
variable), nh galactose v nhm bnh nhn trong v d 1 chng hn. Trong
phn tch hip bin, bin ph thuc l lin tc, nhng bin c lp c th l lin
tc v th loi.
V d 3. Trong nghin cu m kt qa c trnh by di y, cc nh
nghin cu o chiu cao v tui ca 18 hc sinh thuc vng thnh th (urban)
v 14 hc tr thuc vng nng thn (rural).
Bng 11.4. Chiu cao ca hc tr vng thnh th v nng
thn
Area
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban

ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Age (months)
109
113
115
116
119
120
121
124
126
129
130
133
134
135
137
139
141

Height (cm)
137.6
147.8
136.8
140.7
132.7
145.4
135.0
133.0
148.5
148.3
147.5
148.8
133.2
148.7
152.0
150.6
165.3

199

urban
rural
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban
urban

18
1
2
3
4
5
6
7
8
9
10
11
12
13
14

142
121
121
128
129
131
132
133
134
138
138
138
140
140
140

149.9
139.0
140.9
134.9
149.5
148.7
131.0
142.3
139.9
142.9
147.7
147.7
134.6
135.8
148.5

Cu hi t ra l c s khc bit no v chiu cao gia tr em thnh


th v nng thn hay khng. Ni cch khc, mi trng c tr c nh hng n
chiu cao hay khng, v nu c th mc nh hng l bao nhiu?
Mt yu t c nh hng ln n chiu cao l tui. Trong tui
trng thnh, chiu cao tng theo tui. Do , so snh chiu cao gia hai
nhm ch c th khch quan nu tui gia hai nhm phi tng ng nhau.
m bo tnh khch quan ca so snh, chng ta cn phi phn tch s liu
bng m hnh hip bin.
Vic u tin l chng ta phi nhp s liu vo R vi nhng lnh sau y:
# to ra dy s id
> id <- c(1:18, 1:14)
# group 1=urban 2=rural v cn phi xc nh group l mt
factor
> group <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2)
> group <- as.factor(group)
> # nhp d liu
> age <- c(109,113,115,116,119,120,121,124,126,129,130,
133,134,135, 137,139,141,142, 121,121,128,
129,131,132,133,134,138,138,138,140,140,140)
> height <- c(137.6,147.8,136.8,140.7,132.7,145.4,135.0,
133.0,148.5, 148.3,147.5,148.8,133.2,
148.7,152.0,150.6,165.3,149.9,
139.0,140.9,134.9,149.5,148.7,131.0,142.3,

200

Phn tch d liu v to biu bng R Nguyn Vn Tun

139.9,142.9,147.7,147.7,134.6,135.8,148.5)
> # to mt data frame
> data <- data.frame(id, group, age, height)
> attach(data)

Chng ta th xem qua vi ch s thng k m t bng cch c tnh


tui v chiu cao trung bnh cho tng nhm hc sinh:
> tapply(age, group, mean)
1
2
126.8333 133.0714
> tapply(height, group, mean)
1
2
144.5444 141.6714
Kt qu trn cho thy nhm hc sinh thnh th c tui thp hn hc
sinh nng thn khong 6.3 thng (126.8 133.1). Tuy nhin, chiu cao ca hc
sinh thnh th cao hn hc sinh nng thn khong 2.8 cm (144.5 141.7). Bn
c c th dng kim nh t thy rng s khc bit v tui gia hai nhm
c ngha thng k (p = 0.045).

150
130

135

140

145

height

155

160

165

Ngoi ra, biu sau y cn cho thy c mt mi lin h tng quan gia tui
v chiu cao:

110

115

120

125

130

135

140

age

Biu 11.5. Chiu cao (cm) v tui (thng tui)


ca hai nhm hc sinh thnh th v nng thn.

201

V hai nhm khc nhau v tui, v tui c lin h vi chiu cao, cho
nn chng ta khng th pht biu hay so snh chiu cao gia 2 nhm hc sinh
m khng iu chnh cho tui. iu chnh tui, chng ta s dng
phng php phn tch hip bin.

11.5.1 M hnh phn tch hip bin


Gi y l chiu cao, x l tui, v g l nhm. M hnh cn bn ca
ANCOVA gi nh rng mi lin h gia y v x l mt ng thng, v dc
(gradient hay slope) ca hai nhm trong mi lin h ny khng khc nhau. Ni
cch khc, vit theo k hiu ca hi qui tuyn tnh, chng ta c:

y1 = 1 + x + e1
y2 = 2 + x + e2

in group 1
in group 2.

[5]

Trong :

1 : l gi tr trung bnh ca y khi x= 0 ca nhm 1;


2 : l gi tr trung bnh ca y khi x= 0 ca nhm 2;
: dc ca mi lin h gia y v x;

e1 v e2: bin s ngu nhin vi trung bnh 0 v phng sai 2.


Gi x l s trung bnh ca tui cho c 2 nhm, x1 v x2 l tui
trung bnh ca nhm 1 v nhm 2. Nh ni trn, nu x1 x2 , th so snh chiu
cao trung bnh ca nhm 1 v 2 ( y1 v y2 ) s thiu khch quan, v

y1 = 1 + x1 + e1
y2 = 2 + x2 + e2
v mc khc bit gia hai nhm by gi ty thuc vo h s :

y1 y 2 = 1 2 + ( x1 x2 )
Ch rng trong m hnh [5], chng ta c th din dch 1 2 l
khc bit chiu cao trung bnh gia hai nhm nu c hai nhm c cng tui
trung bnh. Mc khc bit ny th hin nh hng ca hai nhm nu khng c
mt yu t no lin h n y. c tnh 1 2 , chng ta khng th n gin
tr hai s trung bnh y1 - y2 , nhng phi iu chnh cho x. Gi x* l mt gi tr

202

Phn tch d liu v to biu bng R Nguyn Vn Tun

chung cho c hai nhm, chng ta c th c tnh gi tr iu chnh y cho nhm 1


(k hiu y1a ) nh sau:

y1a = y1 x1 x*

y1a c th xem l mt c s cho chiu cao trung bnh ca nhm 1 (thnh th)
cho gi tr x l x* . Tng t:

y2 a = y2 x2 x*

l s cho chiu cao trung bnh ca nhm 1 (nng thn) vi cng gi tr x*. T
y, chng ta c th c tnh nh hng ca thnh th v nng thn bng cng
thc sau y:

y1a y2 a = y2 y1 ( x1 x2 )

Do , vn l chng ta phi c tnh . C th chng minh rng c s t


phng php bnh phng nh nht cng l c tnh khch quan cho 1 2 .
Khi vit bng m hnh tuyn tnh, m hnh hip bin c th m t nh sau:

y = + x + g + ( xg ) + e

[6]

Ni cch khc, m hnh trn pht biu rng chiu cao ca mt hc sinh
b nh hng bi 3 yu t: tui (), thnh th hay nng thn (), v tng tc
gia hai yu t (). Nu = 0 (tc nh hng tng tc khng c ngha
thng k), m hnh trn gim xung thnh:

y = + x+ g +e

[7]

Nu = 0 (tc nh hng ca thnh th khng c ngha thng k), m hnh


trn gim xung thnh:

y = + x+e

[8]

11.5.2 Phn tch bng R


Cc tho lun va trnh by trn xem ra kh phc tp, nhng trong thc
t, vi R, cch c tnh rt n gin bng hm lm. Chng ta s phn tch ba m
hnh [6], [7] v [8]:
> # model 6

203

> model6 <- lm(height ~ group + age + group:age)


> # model 7
> model7 <- lm(height ~ group + age)
> # model 8
> model8 <- lm(height ~ age)
Chng ta cng c th so snh c ba m hnh cng mt lc bng lnh anova
nh sau:
> anova(model6, model7, model8)
Analysis of Variance Table
Model 1: height ~ group + age + group:age
Model 2: height ~ group + age
Model 3: height ~ age
Res.Df
RSS
Df Sum of Sq F
1
28
1270.44
2
29
1338.02 -1 -67.57
1.4893
3
30
1545.95 -1 -207.93 4.5827
---

Pr(>F)
0.23251
0.04114 *

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ch model 1 chnh l m hnh [6], model 2 l m hnh [7], v


model 3 l m hnh [8]. RSS l residual sum of squares, tc tng
bnh phng phn d cho mi m hnh. Kt qu phn tch trn cho thy:

Ton b mu c 18+14=32 hc sinh, m hnh [6] c 4 thng s (, ,


v ), cho nn m hnh ny c 32-4 = 28 bc t do. Tng bnh phng
ca m hnh l 1270.44.

M hnh [7] c 3 thng s (tc cn 29 bc t do), cho nn tng bnh


phng phn d cao hn m hnh [7]. Tuy nhin, ng trn phng
din xc sut th trung bnh bnh phng phn d ca m hnh ny
1338.02 / 29 = 46.13, khng khc my so vi m hnh [6] (trung bnh
bnh phng l: 1270.44 / 28 = 45.36), v tr s p=0.2325, tc khng c
ngha thng k. Ni cch khc, b h s tng tc khng lm thay
i kh nng tin on ca m hnh mt cch ng k.

M hnh [8] ch c 2 thng s (v do c 30 bc t do), vi tng bnh


phng l 1545.95. Trung bnh bnh phng phn d ca m hnh ny

204

Phn tch d liu v to biu bng R Nguyn Vn Tun

l 51.53 (1545.95 / 30), tc cao hn hai m hnh [6] mt cch ng k,


v tr s p = 0.0411.
Qua phn tch trn, chng ta thy m hnh [7] l ti u hn c, v ch cn 3
thng s m c th gii thch c d liu mt cch y . By gi chng ta
s ch tm vo phn tch kt qu ca m hnh ny.
> summary(model7)
Call:
lm(formula = height ~ group + age)
Residuals:
Min
1Q Median
3Q
Max
-14.324 -3.285 0.879 3.956 14.866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.8171 17.9294 5.121 1.81e-05 ***
group2
-5.4663
2.5749 -2.123 0.04242 *
age
0.4157
0.1408 2.953 0.00619 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.793 on 29 degrees of freedom


Multiple R-Squared: 0.2588,
Adjusted R-squared: 0.2077
F-statistic: 5.063 on 2 and 29 DF, p-value: 0.01300

Qua phn c tnh thng s trnh by trn y, chng ta thy tnh trung
bnh chiu cao hc sinh tng khong 0.41 cm cho mi thng tui. Ch trong
kt qu trn, phn group2 c ngha l h s hi qui (regression coefficient)
cho nhm 2 (tc l nng thn), v R phi t h s cho nhm 1 bng 0 tin
vic tnh ton. V th, chng ta c hai phng trnh (hay hai ng biu din)
cho hai nhm hc sinh nh sau:
i vi hc sinh thnh th:
Height = 91.817 + 0.4157(age)

V i vi hc sinh nng thn:


Height = 91.817 5.4663(rural) + 0.4157(age)

Ni cch khc, sau khi iu chnh cho tui, nhm hc sinh nng thn
(rural) c chiu cao thp hn nhm thnh th khong 5.5 cm v mc khc

205

bit ny c ngha thng k v tr s p = 0.0424. (Ch l trc khi iu chnh


cho tui, mc khc bit l 2.8 cm).
Cc biu sau y s minh ha cho cc m hnh trn:
> par(mfrow=c(2,2))
> plot(age, height, pch=as.character(group),
main=Mo hinh 1)

> abline(144.54, 0) #mean value for urban


> abline(141.67, 0) #mean value for rural
> plot(age, height, pch=as.character(group),
main=Mo hinh 2)

> abline(102.63, 0.3138) #single line for dependence on age


> plot(age, height, pch=as.character(group),
main=Mo hinh 3)

> abline(91.8, 0.416) #line for males


> abline(91.8-5.46,0.416) #line for females parallel
> plot(age, height, pch=as.character(group),
main=Mo hinh 4)

> abline(79.7, 0.511) #line for males


> abline(79.7+47.08, 0.511-0.399) #line for females parallel
> par(mfrow=c(1,1))

206

Phn tch d liu v to biu bng R Nguyn Vn Tun

Mo hinh 1

Mo hinh 2

115

120

130

150

2
2

125

140

150

1 1
2 2

135

140

1
110

115

2
2

1
1

2
112 1 1

130

110

2
2

height

1
1

2
112 1 1

130

140

height

160

160

120

1
125

130

age

Mo hinh 3

Mo hinh 4

1 1
2 2
2

2
2
2

age

135

140

115

120

1
125

2
130

2
2
1
135

150
140

150

age

1 1
2 2

2
2

140

1
1

110

115

2
2

1
1

2 2 1 1
1
1

130

1
1

110

2
2

height

1
1

2 2 1 1
1
1

130

140

height

160

160

120

1
125

2
2
130

1 1
2 2
2

2
1
135

2
2

140

age

Biu 11.6. M hnh 1: chiu cao l hm s ca ni tr ng, nhng


khng c lin h vi tui; M hnh 2 gi thit rng chiu cao ph thuc
vo tui, nhng khng c khc nhau gia hai nhm thnh th v nng
thn; M hnh 3 gi thit rng mi lin h gia chiu cao v tui ca nhm
thnh th tng ng vi vi nhm nng thn (hai ng song song),
nhng hc sinh thnh th c chiu cao cao hn nng thn; v m hnh 4 gi
thit rng mc khc bit v chiu cao gia hai nhm ty thuc vo
tui (tc c tng tc gia tui v ni tr ng): tui <120 thng, chiu
cao hai nhm khng khc nhau my, nhng khi tui >120 thng tui th
nhm hc sinh thnh th c chiu cao cao hn nhm nng thn. Phn tch
trn cho thy m hnh 3 l tt nht.

11.6 Phn tch phng sai cho th nghim giai


tha (factorial experiment)
V d 4. kho st nh hng ca 4 loi thuc tr su (1, 2, 3 v 4)
v ba loi ging (B1, B2 v B3) n sn lng ca cam, cc nh nghin cu tin
hnh mt th nghim loi giai tha. Trong th nghim ny, mi ging cam c 4
cy cam c chn mt cch ngu nhin, v 4 loi thuc tr su p dng (cng
ngu nhin) cho mi cy cam. Kt qu nghin cu (sn lng cam) cho tng
ging v thuc tr su nh sau:

207

Bng 11.5. Sn lng cam cho 3 loi ging v 4 loi thuc tr su


Ging cam
(variety)
B1
B2
B3
Tng s

1
29
41
66
136

Thuc tr su (pesticide)
2
3
50
43
58
42
85
63
193
154

Tng s
4
53
73
85
211

175
214
305
694

M hnh phn tch th nghim giai tha cng khng khc g so vi phn
tch phng sai hai chiu nh trnh by trong phn trn. C th hn, m hnh
m chng ta xem xt l:
product = + (variety) + (pesticide) +

Trong , l hng s biu hin trung bnh ton mu, l h s nh


hng ca ba ging cam, v l h s nh hng ca 4 loi thuc tr su, v
l phn d (residual) ca m hnh.
Chng ta c th s dng hm aov ca R c tnh cc thng s trn
nh sau:
# trc ht chng ta nhp s liu
> variety <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
> pesticide <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
> product <- c(29,50,43,53,41,58,42,73,66,85,69,85)

# nh ngha variety v pesticide l hai yu t (factors)


> variety <- as.factor(variety)
> pesticide <- as.factor(pesticide)
# cho vo mt data frame tn l data
> data <- data.frame(variety, pesticide, product)
# phn tch phng sai bng aov v cho vo object analysis
> analysis <- aov(product ~ variety + pesticide)
> anova(analysis)
Analysis of Variance Table
Response: product
Df Sum Sq
Mean Sq
variety
2 2225.17 1112.58
pesticide 3 1191.00
397.00

208

F value Pr(>F)
44.063 0.000259 ***
15.723 0.003008 **

Phn tch d liu v to biu bng R Nguyn Vn Tun

Residuals
---

151.50

25.25

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu trn cho thy c hai yu t ging cy (variety) v thuc tr su


(pesticide) u c nh hng n sn lng cam, v tr s p < 0.05. so snh
c th cho tng hai nhm, chng ta s dng hm TukeyHSD nh sau:
> TukeyHSD(analysis)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = product ~ variety + pesticide)
$variety
diff
lwr
2-1 9.75 -1.152093
3-1 32.50 21.597907
3-2 22.75 11.847907
$pesticide
diff
lwr
2-1 19
4.797136
3-1 6
-8.202864
4-1 25 10.797136
3-2 -13 -27.202864
4-2 6
-8.202864
4-3 19
4.797136

upr
p adj
20.65209 0.0749103
43.40209 0.0002363
33.65209 0.0016627
upr
33.202864
20.202864
39.202864
1.202864
20.202864
33.202864

p adj
0.0140509
0.5106152
0.0036109
0.0704233
0.5106152
0.0140509

Kt qu phn tch gia cc loi ging cho thy ging B3 c sn lng


cao hn ging B1 khong 32 n v vi khong tin cy 95% t 21 n 43
(p = 0.0002). Ging cam B3 cng tt hn ging B2, vi khc bit trung bnh
khong 22 n v (p = 0.0017). Nhng khng c khc bit ng k gia ging
B2 v B1.
So snh gia cc loi thuc tr su, kt qu trn cho chng ta bit cc
thuc tr su 4 c hiu qu cao hn thuc 1 v 3. Ngoi ra, thuc 2 cng c hiu
qu cao hn thuc 1. Cn cc so snh khc khng c ngha thng k. Biu
Tukey sau y minh ha cho kt lun trn.
> plot(TukeyHSD(analysis), ordered=TRUE)

209

4-3

4-2

3-2

4-1

3-1

2-1

95% family-wise confidence level

-20

-10

10

20

30

40

D ifferences in mean levels of pesticide

11.7 Phn tch phng sai cho th nghim hnh


vung Latin (Latin square experiment)
V d 5. so snh hiu qu ca 2 loi phn bn (A v B) cng 2
phng php canh tc (a v b), cc nh nghin cu tin hnh mt th nghim
hnh vung Latin. Theo , c 4 nhm can thip tng hp t hai loi phn bn
v phng php canh tc: Aa, Ab, Ba, v Bb (s cho m s, ln lc, l 1=Aa,
2=Ab, 3=Ba, 4=Bb). Bn phng php (treatment) c p dng trong 4
mu rung (sample = 1, 2, 3, 4) v 4 loi cy trng (variety = 1, 2, 3, 4). Tng
cng, th nghim c 4x4 = 16 mu. Tiu ch nh gi l sn lng, v kt qu
sn lng c tm tt trong bng sau y:
Bng 11.6. Sn lng cho 2 loi phn bn v 2 phng php canh tc:
Mu rung
(sample)
1
2
3
4

210

1
175
Aa
170
Ab
135
Bb
145
Ba

Ging (variety)
2
3
143
128
Ba
Bb
178
140
Aa
Ba
173
169
Ab
Aa
136
165
Bb
Ab

4
166
Ab
131
Bb
141
Ba
173
Aa

Phn tch d liu v to biu bng R Nguyn Vn Tun

Cu hi t ra l cc phng php canh tc v phn bn c nh hng n


sn lng hay khng. tr li cu hi , chng ta phi xem xt n cc
ngun lm cho sn lng thay i hay bin thin. Nhn qua th nghim v bng
s liu trn, rt d dng hnh dung ra 3 ngun bin thin chnh:

Ngun th nht l khc bit gia cc phng php canh tc v phn bn;
Ngun th hai l khc bit gia cc loi ging cy;
Ngun th ba l khc bit gia cc mu rung;

V phn cn li l khc bit trong mi mu rung v loi ging. c mt ci


nhn chung v s liu, chng ta hy tnh trung bnh cho tng nhm qua bng s
sau y:
Trung bnh cho tng loi
ging

Trung bnh cho tng Trung bnh cho tng


mu
phng php

1: 156.25
2: 157.50
3: 150.50
4: 152.75
Tng trung bnh: 154.25

1: 153.00
2: 154.75
3: 154.50
4: 154.75
Tng trung bnh: 154.25

1: 173.75
2: 168.50
3: 142.25
4: 132.50
Tng trung bnh: 154.25

Bng tm lc trn cho php chng ta tnh tng bnh phng cho tng ngun
bin thin. Khi u l tng bnh phng cho ton b th nghim (s tm gi l
SStotal):

Tng bnh phng chung cho ton th nghim:


SStotal = (175 154.25)2 + (143 154.25)2 + (165 154.25)2 +
(173 154.25)2 = 4941

Tng bnh phng do khc bit gia cc loi ging (SSvariety). Ch


l v trung bnh mi ging c tnh t 4 s, cho nn chng ta phi
nhn cho 4 khi tnh tng bnh phng:
SSvariety = 4(156.25 154.25)2 + 4(157.50 154.25)2 +
4(150.50 154.25)2 + 4(152.75 154.25)2 = 123.5
V c 4 loi ging v mt thng s, cho nn bc t do l 4-1=3. Theo
, trung bnh bnh phng (mean square) l:
123.5 / 3 = 41.2.

211

Tng bnh phng do khc bit gia ging (SSsample). Ch l v


trung bnh mi mu c tnh t 4 s, cho nn khi tnh tng bnh
phng, cn phi nhn cho 4:
SSsample= 4(153.00 154.25)2 + 4(154.75 154.25)2 +
4(154.50 154.25)2 + 4(154.75 154.25)2 = 8.5
V c 4 mu v mt thng s, cho nn bc t do l 4-1=3, v theo
trung bnh bnh phng l: 8.5 / 3 = 2.8.

Tng bnh phng do khc bit gia cc phng php (SSmethod). Ch


l v trung bnh mi phng php c tnh t 4 s, cho nn khi tnh
tng bnh phng, cn phi nhn cho 4:
SSsample= 4(173.75 154.25)2 + 4(168.50 154.25)2 +
4(142.25 154.25)2 + 4(132.50 154.25)2 = 4801.50
V c 4 phng php v mt thng s, cho nn bc t do l 4-1=3, v
theo trung bnh bnh phng l: 4801.5 / 3 = 1600.5.

Tng bnh phng phn d (residual sum of squares):


SSresidual = SStotal SSmethod SSsample - SSvariety
= 4941.0 4801.5 8.5 123.5
= 7.5

Nhng c tnh trn y c th trnh by trong mt bng phn tch phng sai
nh sau:
Ngun bin thin
Gia 4 mu rung
Gia 4 loi ging
Gia 4 phng php
Phn d (residual)
Tng s

Bc t do
(degrees of
freedom)
3
3
3
6
16

Tng bnh phng


(Sum of squares)
8.5
123.5
4801.5
7.5
4941.0

Trung bnh
bnh phng
(Mean square)
2.8
41.2
1600.5

Kim
nh F
2.3
32.9
1280.4

Qua phn tch th cng v n gin trn, chng ta thy phng php
canh tc v loi ging c nh hng ln n sn lng. tnh ton chnh xc
tr s p, chng ta c th s dng R tin hnh phn tch phng sai cho th
nghim hnh vung Latin.

212

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vn t chc s liu sao cho thch hp R c th tnh ton l rt quan


trng. Ni mt cch ngn gn, mi s liu phi l mt s c th (unique).
Trong th nghim trn, chng ta c 4 loi ging, 4 mu, cho nn tng s l 16 s
liu. V, 16 s liu ny phi c nh ngha cho tng loi ging, tng mu, v
quan trng hn l cho tng phng php canh tc. Chng hn nh, trong v d
bng s liu 10.6 trn, 175 l sn lng ca phng php canh tc 1 (tc Aa),
loi ging 1, v mu 1; nhng 173 (s gc mc cui bng) l sn lng ca
phng php canh tc 1, nhng t loi ging 4, v mu 4; v.v...

Trc ht, chng ta nhp s liu sn lng, v gi l y:

> y <- c(175,


170,
135,
145,

143,
178,
173,
136,

128,
140,
169,
165,

166,
131,
141,
173)

K n, gi variety l ging gm 4 bc (1,2,3,4) cho tng s liu trong


y (v cng nh ngha rng variety l mt factor, tc bin th bc):

> variety <- c(1,2,3,4,


1,2,3,4,
1,2,3,4,
1,2,3,4,)
> variety <- as.factor(variety)

Gi sample l mu gm 4 bc (1,2,3,4) cho tng s liu trong y (v


cng nh ngha rng sample l mt factor, tc bin th bc):

> sample

<- c(1,1,1,1,

2,2,2,2,
3,3,3,3,
4,4,4,4)
> sample <- as.factor(sample)

Nhp s liu cho phng php, method,cng gm 4 bc (1,2,3,4) cho


tng s liu trong y (v cng nh ngha rng method l mt factor, tc
bin th bc):

> method <- c(1, 3, 4, 2,


2, 1, 3, 4,
4, 2, 1, 3,
3, 4, 2, 1)
> method <- as.factor(method)

213

Tng hp tt c cc s liu trn vo mt data frame v gi l data:

> data <- data.frame(sample, variety, method, y)

In data kim tra xem s liu c ng v thch hp hay cha:

> data
sample variety method y
1
1
1
1
175
2
1
2
3
143
3
1
3
4
128
4
1
4
2
166
5
2
1
2
170
6
2
2
1
178
7
2
3
3
140
8
2
4
4
131
9
3
1
4
135
10
3
2
2
173
11
3
3
1
169
12
3
4
3
141
13
4
1
3
145
14
4
2
4
136
15
4
3
2
165
16
4
4
1
173

By gi chng ta sn sng dng hm lm hay aov phn tch s


liu. y chng ta s s dng hm aov tnh cc ngun bin thin trn (kt
qu tnh ton s cha trong i tng latin):
> latin <- aov(y ~ sample + variety + method)
> summary(latin)
Df Sum Sq Mean Sq F value Pr(>F)
sample
3
8.5
2.8
2.2667 0.1810039
variety
3 123.5
41.2
32.9333 0.0004016 ***
method
3 4801.5 1600.5 1280.4000 8.293e-09 ***
Residuals 6
7.5
1.3
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tt c cc kt qu ny (d nhin) l nhng kt qu m chng ta tm


tt trong bng phn tch phng sai mt cch th cng trn y. Tuy nhin,
y R cung cp cho chng ta tr s p (trong Pr > F) c th suy lun thng
k. V, qua tr s p, chng ta c th pht biu rng mu rung khng c nh
hng n sn lng, nhng loi ging v phng php canh tc th c nh
hng n sn lng.

214

Phn tch d liu v to biu bng R Nguyn Vn Tun

bit mc khc bit gia cc phng php canh tc v gia cc


loi ging, chng ta dng hm TukeyHSD nh sau:
> TukeyHSD(latin)
$variety
diff
lwr
2-1 1.25 -1.4867231
3-1 -5.75 -8.4867231
4-1 -3.50 -6.2367231
3-2 -7.00 -9.7367231
4-2 -4.75 -7.4867231
4-3 2.25 -0.4867231
$method
diff
2-1 -5.25
3-1 -31.50
4-1 -41.25
3-2 -26.25
4-2 -36.00
4-3 -9.75

upr
3.9867231
-3.0132769
-0.7632769
-4.2632769
-2.0132769
4.9867231

lwr
-7.986723
-34.236723
-43.986723
-28.986723
-38.736723
-12.486723

p adj
0.4528549
0.0014152
0.0173206
0.0004803
0.0038827
0.1034761

upr
-2.513277
-28.763277
-38.513277
-23.513277
-33.263277
-7.013277

p adj
0.0023016
0.0000001
0.0000000
0.0000004
0.0000000
0.0000730

So snh gia cc loi ging cho thy c s khc bit gia ging 3 v 1, 4 v 1, 3
v 2, 4 v 2.
Tt c cc so snh gia cc phng php canh tc u c ngha thng k.
Nhng loi no c sn lng cao nht? tr li cu hi ny, chng ta s s
dng biu hp:
> boxplot(y ~ method, xlab="Methods (1=Aa, 2=Ab, 3=Ba,
4=Bb", ylab=Production")

215

180
170
160
Production

150
140
130

Methods (1=Aa, 2=Ab, 3=Ba, 4=Bb

Biu so snh sn lng ca bn phng php canh tc.

11.8 Phn tch phng sai cho th nghim giao


cho (cross-over experiment)
V d 6. th nghim hiu ng ca mt thuc mi i vi chng ra m
hi (thuc ny c bo ch cha tr bnh tim, nhng ra m hi l mt nh
hng ph), cc nh nghin cu tin hnh mt nghin cu trn 16 bnh nhn. S
bnh nhn ny c chia thnh 2 nhm (tm gi l nhm AB v BA) mt cch
ngu nhin. Mi nhm gm 8 bnh nhn. Bnh nhn c theo di hai ln: thng
th nht v thng th 2. i vi bnh nhn nhm AB, thng th nht h c
iu tr bng thuc, thng th hai h c cho s dng gi dc (placebo). Ngc
li, vi bnh nhn nhm BA, thng th nht s dng gi dc, v thng th hai
c iu tr bng thuc. Tiu ch nh gi l thi gian ra m hi trn trn
(tnh t lc ung thuc n khi ra m hi) sau khi s dng thuc hay gi dc.
Kt qu nghin cu c trnh by trong bng s liu sau y:
Bng 11.7. Kt qu nghin cu hiu ng ra m hi ca thuc iu tr bnh tim
Nhm

M s bnh
nhn s (id)

AB
1
3
5
6
9
10
13

216

Thi gian (pht) ra m hi trn trn


Thng 1
Thng 2
A
Placebo
6
4
8
7
12
6
7
8
9
10
6
4
11
6

Phn tch d liu v to biu bng R Nguyn Vn Tun

15
BA
2
4
7
8
11
12
14
16

8
Placebo
5
9
7
4
9
5
8
9

8
A
7
6
11
7
8
4
9
13

Cu hi chnh l c s khc bit v thi gian ra m hi gia hai nhm iu tr


bng thuc v gi dc hay khng?
tr li cu hi trn, chng ta cn tin hnh phn tch phng sai.
Nhng v cch thit k nghin cu kh c bit (hai nhm bnh nhn vi cch
sp xp can thip theo hai th t khc nhau), nn cc phng php phn tch
trn khng th p dng c. C mt phng php thng dng l phn tch
phng sai trong tng nhm, ri sau so snh gia hai nhm. Mt trong
nhng vn chng ta cn phi lu l kh nng hiu ng ko di (cn gi l
carry-over effect), tc l trong nhm AB, hiu qu ca thng th 2 c th chu
nh hng ko di t thng th nht khi bnh c c iu tr bng thuc
tht. Trc ht, chng ta th tm lc d liu bng bng sau y:
Bng 11.8. Tm lc kt qu th nghim hiu ng ra m hi ca thuc iu
tr bnh tim
Nhm

M s bnh nhn
s (id)

AB
1
3
5
6
9
10
13
15
Trung bnh
BA
2
4
7
8

Thi gian (pht) ra m hi trn


trn
Thng 1
Thng 2
A
Placebo
6
4
8
7
12
6
7
8
9
10
6
4
11
6
8
8
8.375
6.625
Placebo
A
5
7
9
6
7
11
4
7

Trung bnh cho


tng bnh nhn
5.0
7.5
9.0
7.5
9.5
5.0
8.5
8.0
7.50
6.0
7.5
9.0
5.5

217

11
9
8
12
5
4
14
8
9
16
9
13
Trung bnh
7.000
8.125
Trung bnh cho 2 nhm
7.6875
7.3750
Trung bnh cho nhm A = (8.375 + 8.125) / 2 = 8.25
Trung bnh cho nhm P (gi dc) = (6.625 + 7.000) / 2 = 6.8125

8.5
4.5
8.5
11.0
7.5625
7.5312

Qua bng tm lc trn, chng ta c th tnh ton mt s tng bnh phng:

Tng bnh phng do khc bit gia hai nhm iu tr bng thuc v
gi dc:
SSTreat = 16(8.25 7.5312)2 + 16(8.8125 7.5312)2 = 16.53

Tng bnh phng do khc bit gia thng 1 v thng 2:


SSPeriod = 16(7.6875 7.5312)2 + 16(7.3750 7.5312)2 = 0.781

Tng bnh phng do khc bit gia hai nhm AB v BA (th t):
SSseq = 16(7.50 7.5312)2 + 16(7.5625 7.5312)2 = 0.031

Tng bnh phng do khc bit gia cc bnh nhn trong cng nhm
AB hay BA:
SSw = (5.0 7.50)2 + (7.5 7.50)2 + (9.0 7.50)2 + +
(8.0 7.50)2 + (6.0 7.5625)2 + (7.5 7.5625)2 +
(9.0 7.5625)2 + + (11.0 7.5625)2 = 103.44

Tng bnh phng cho ton b mu:


SStotal = (6 7.5312)2 + (9 7.5312)2 + + (13 7.5312)2 +
(9 7.5312)2 = 167.97

Tng bnh phng cn li (tc phn d):


SSres = 167.97 16.53 0.781 0.031 103.44 = 47.19

n y, chng ta c th lp bng phn tch phng sai nh sau:


Bng 11.9. Kt qu phn tch phng sai s liu trong bng 11.7

218

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ngun bin thin

Gia hai nhm iu tr


Gia hai thng
Gia AB v BA
Trong mi nhm
Phn d (residual)
Tng s

Bc t do
(degrees
of
freedom)

Tng bnh
phng
(Sum of
squares)

1
1
1
14
14
31

16.53
0.781
0.031
103.44
47.19
167.97

Trung bnh
bnh
phng
(Mean
square)
16.53
0.781
0.031
7.39
3.37

Kim
nh F

4.90
0.23
0.004

Qua phn tch trn, chng ta thy khc bit gia thuc v gi dc
ln hn l khc bit gia hai thng hay hai nhm AB v BA. Kim nh F
th nghim gi thit thuc v gi dc c hiu qu nh nhau l kim nh
F = 16.53 / 3.37 = 4.90 vi bc t do 1 v 14. Da trn l thuyt xc sut, tr s
F vi bc t do 1 v 14 l 4.60. Do , chng ta c th kt lun rng thuc ny
c hiu ng lm ra m hi lu hn nhm gi dc.
Tt c cc tnh ton th cng trn ch l minh ha cho cch phn tch
phng sai trong th nghim giao cho. Trong thc t, chng ta c th s dng
R tin hnh cc tnh ton nh cch tnh phng sai cho cc th nghim
n gin. Vn chnh l t chc s liu cho phn tch. R (cng nh nhiu
phn mm khc) yu cu ngi s dng phi nhp tng s liu mt, v mi s
liu phi gn lin vi mt bnh nhn, mt nhm iu tr, mt thng (hay
giai on), v mt nhm th t. l mt yu cu rt quan trng, v nu t
chc s liu khng ng, kt qu phn tch c th sai.
Phn sau y s m t tng bc mt:
# bc 1: nhp d liu v t tn object l y
> y <- c(6,8,12,7,9,6,11,8,
4,7,6,8,10,4,6,8,
5,9,7,4,9,5,8,9
7,6,11,7,8,4,9,13)
# bc 2: c mi s liu trong bc 1, ch ra nhm AB
hay BA (m s 1 v 2)
> seq <- c(1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,

219

2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2)
> seq <- as.factor(seq)
# bc 3: c mi s liu trong bc 1, ch ra thng 1
hay thng 2
> period <- c(1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1)
> period <- as.factor(period)
# bc 4: c mi s liu trong bc 1, ch ra nhm A
hay placebo bng m s 1 v 2:
> treat <- c(1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2)
> treat <- as.factor(treat)
# bc 5: c mi s liu trong bc 1, ch ra m s
cho tng bnh nhn
> id <- c(1,3,5,6,9,10,13,15,
1,3,5,6,9,10,13,15,
2,4,7,8,11,12,14,16,
2,4,7,8,11,12,14,16)
> id <- as.factor(id)
# bc 6: lp thnh mt data frame tn l data v in
ra kim tra mt ln na.
> data <- data.frame(seq, period, treat, id, y)
> data
seq period treat id
y
1 1
1
1
1
6
2 1
1
1
3
8
3 1
1
1
5
12
4 1
1
1
6
7
5 1
1
1
9
9
6 1
1
1
10
6
7 1
1
1
13 11
8 1
1
1
15
8
9 1
2
2
1
4
10 1
2
2
3
7

220

Phn tch d liu v to biu bng R Nguyn Vn Tun

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1

2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2

5
6
9
10
13
15
2
4
7
8
11
12
14
16
2
4
7
8
11
12
14
16

6
8
10
4
6
8
7
6
11
7
8
4
9
13
5
9
7
4
9
5
8
9

By gi chng ta sn sng dng hm lm ca R phn tch s liu. Ch


rng cch dng hm lm cho phn tch phng sai p dng cho th nghim giao
cho hon ton khng khc g vi cch dng cho cc th nghim khc. Kha
cnh khc bit duy nht l cch t chc d liu cho phn tch nh trnh by trn.
> xover <- lm(y ~ treat+seq+period)
> anova(xover)
Analysis of Variance Table
Response: y
Df
treat
1
seq
1
period
1
id
14
Residuals 14
---

Sum Sq Mean Sq F value Pr(>F)


16.531 16.531 4.9046 0.04388 *
0.031 0.031 0.0093 0.92466
0.781 0.781 0.2318 0.63764
103.438 7.388 2.1921 0.07711 .
47.187 3.371

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu phn tch trn y d nhin ging vi cch tnh th cng m


chng ta tin hnh phn trn. Ni tm li, mc khc bit gia thuc v
gi duc c ngha thng k, vi tr s F l 0.044.

221

Chng ta cng c th yu cu khong tin cy 95% cho khc bit gia


hai nhm (bng cch lnh TukeyHSD) nh sau (ch l vi TukeyHSD
chng ta ch s dng hm aov ch khng phi lm):
> TukeyHSD(aov(y ~ treat+seq+period+id))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ treat + seq + period + id)
$treat
diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783
$seq
diff
lwr
upr
p adj
2-1 0.0625 -1.329658 1.454658 0.924656
$period
diff
lwr
upr
p adj
2-1 -0.3125 -1.704658 1.079658 0.6376395

Ch kt qu:
$treat
diff
lwr
upr
p adj
2-1 -1.4375 -2.829658 -0.04534186 0.0438783

cho bit tnh trung bnh thi gian ra m hi ca nhm c iu tr cao hn


nhm gi dc khong 1.44 pht, v khong tin cy 95% l t 0.05 pht n 2.8
pht. Cn cc kt qu so snh gia hai nhm AB v BA (seq) hay gia thng 1
v thng 2 (period) khng c ngha thng k.

11.9 Phn tch phng sai cho th nghim ti o


lng (repeated measure experiment)
V d 7. Mt nghin cu s khi (pilot study) c tin hnh nh
gi hiu nghim ca mt vc-xin mi chng bnh thp khp. Nghin cu gm 8
bnh nhn, c chia thnh 2 nhm mt cch ngu nhin. Nhm 1 gm 4 bnh
nhn c iu tr bng vc-xin; nhm 2 cng gm 4 bnh nhn nhng c
nhn gi dc (placebo, hay i chng). Bnh nhn c theo di trong 3
thng, v c mi thng, bnh nhn c hi v tnh trng ca bnh ra sao. Tnh
trng bnh c o lng bng mt ch s c gi tr t 0 (khng c hiu
nghim, bnh vn nh trc) n 10 (c hiu nghim tuyt i, ht bnh). Kt
qu nghin cu c th tm tt trong bng s liu sau y:

222

Phn tch d liu v to biu bng R Nguyn Vn Tun

Bng 11.10. Kt qu nghin cu vc-xin chng au thp khp


Nhm

M s bnh
nhn (id)

Ch s bnh qua tng thng


Thng 1
Thng 2
Thng 3

Vc-xin
1
2
3
4

6
7
4
8

3
3
1
4

0
1
2
3

Placebo
5
6
5
5
6
9
4
6
7
5
3
4
8
6
2
3
Cu hi chnh l c s khc bit no gia hai nhm vc-xin v gi dc hay
khng?
n gin ha cch phn tch phng sai cho th nghim ti o lng, chng ta
s trnh dng k hiu ton, m ch minh ha bng vi php tnh th cng bn
c c th theo di. Trc ht, chng ta cn phi tm lc s liu bng cch tnh
trung bnh cho mi bnh nhn, mi nhm iu tr, v mi thng nh sau:
Bng 11.11. Tm lc s liu nghin cu vc-xin chng au thp khp
Nhm
iu tr
Vc-xin

Placebo

id
1
2
3
4
Trung bnh
SD

5
6
7
8
Trung bnh
SD
Trung bnh cho hai
nhm

Ch s bnh qua tng thng


1
2
3
6
3
0
7
3
1
4
1
2
8
4
3
6.25
2.75
1.50
1.71
1.26
1.29
6
9
5
6
6.50
1.73
6.375

5
4
3
2
3.50
1.29
3.125

5
6
4
3
4.50
1.29
3.000

Trung bnh
3.000
3.667
2.333
5.000
3.500
5.333
6.333
4.000
3.667
4.833
4.167

223

Qua bng trn, chng ta c th thy rng c 5 ngun lm cho kt qu th nghim


khc nhau:
(a) Gia vc-xin v gi dc (c l l ngun m chng ta cn bit!);
(b) Gia 3 thng theo di;
(c) Gia ba thng trong mi nhm iu tr, m gii thng k thng
cp n l interaction (tng tc), v trong trng hp ny, tng
tc gia nhm iu tr v thi gian;
(d) Gia cc bnh nhn trong cng mt nhm iu tr;
(e) V sau cng l phn d, tc phn m chng ta khng th gii
thch sau khi xem xt cc ngun (a) n (d) trn.

Trc ht l tng bnh phng gia hai nhm iu tr (vc-xin v gi


dc), ti s gi l SStreat:
SStreat = 12(3.500 4.167)2 + 12(4.833 4.167)2 = 10.667

K n l tng bnh phng gia 3 thng iu tr, gi l SStime:


SStime = 8(6.375 4.167)2 + 8(3.125 4.167)2 +
8(3.000 4.167)2 = 58.583

Ngun th ba l tng bnh phng do tng tc gia iu tr v thi


gian, gi l SSint
SSint= 4(6.25 4.167)2 +
4(2.75 4.167)2 +
4(1.50 4.167)2 +
4(6.50 4.167)2 +
4(3.50 4.167)2 +
4(4.50 4.167)2
SSvcxin SStime
= 77.833 10.667 58.583
= 8.583

Ngun th t l tng bnh phng do tng tc gia bnh nhn trong


mi nhm iu tr, gi l SSpatient(treat):
SSpatient(treat) = 3(3.0003.350)2 + 3(3.6673.350)2 + 3(2.3333.350)2
+3(5.0003.350)2+ 3(5.3334.833)2 + 3(6.3334.833)2
+3(4.0004.833)2 +3(3.6674.833)2
= 25.333

224

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ngoi ra, tng bnh phng cho ton mu l:


SStotal = (6-4.167)2 +(3-4.167)2 +(0-4.167)2 + +(3-4.167)2 = 115.333

T , chng ta c th c tnh tng bnh phng cho phn d:


SSE = SStotal SSvcxin SStime SSpatient(vcxin) SSvcxin-time
= 115.333 10.667 58.583 25.333 8.583
= 12.167

Tt c cc tnh ton th cng trn, nh bn c c th thy, kh phc tp, v


rt d sai st. Nhng trong R, chng ta c th c kt qu nhanh chng . Sau y,
s trnh by cch phn tch phng sai ti o lng bng R:
Chng ta c th lp bng phn tch phng sai nh sau:
Ngun bin thin

Gia vcxin v placebo


Bnh nhn (nhm iu tr)
Gia 3 thng
Thi gian v nhm iu tr
Phn d (residual)
Tng s

Bc t do
(degrees
of
freedom)

Tng bnh
phng
(Sum of
squares)

1
6
2
2
12
23

10.667
25.333
58.583
8.583
12.167
115.333

Trung bnh
bnh
phng
(Mean
square)
10.667
4.222
29.292
4.292
1.014

Kim
nh F

2.53
28.89
4.23
-

Trc ht, chng ta nhp d liu cho tng bnh nhn. Cng nh bt c
phn mm thng k no, mi gi tr phi c km theo nhng bin s c
trng nh cho mi bnh nhn, mi nhm, v mi thi gian:

y <- c(6,7,4,8,
3,3,1,4,
0,1,2,3,
6,9,5,6,
5,4,3,2,
5,6,4,3)

Trong mi s liu trn, cho R bit thuc nhm iu tr (m s 1) hay gi


dc (m s 2). Cng nn cho R bit treat l mt bin th bc
(categorical variable) ch khng phi bin s (numerical variable):

225

treat <- c(1,1,1,1,


1,1,1,1,
1,1,1,1,
2,2,2,2,
2,2,2,2,
2,2,2,2)
treat <- as.factor(treat)

Trong mi s liu trn, cho R bit thuc thng no (m s 1, 2, 3), v nh


ngha time l mt bin th bc.

time <- c(1,1,1,1,


2,2,2,2,
3,3,3,3,
1,1,1,1,
2,2,2,2,
3,3,3,3)
time <- as.factor(time)

Trong mi s liu trn, cho R bit thuc bnh nhn no (m s 1, 2, 3,


,8), v nh ngha id l mt bin th bc.

id <- c(1,2,3,4, 1,2,3,4, 1,2,3,4, 5,6,7,8, 5,6,7,8,


5,6,7,8)
id <- as.factor(id)

Nhp tt c bin vo mt data frame v t tn l data. Kim tra mt ln


na xem s liu ng vi nh sp xp hay cha. Xin nhc li, trc
khi phn tch s liu, vic quan trng l phi kim tra li cho tht k s liu
m bo s liu c t chc ng v thch hp.

data <- data.frame(id, time, treat, y)


data
id time treat y
1 1
1
1
6
2 2
1
1
7
3 3
1
1
4
4 4
1
1
8
5 1
2
1
3
6 2
2
1
3
7 3
2
1
1
8 4
2
1
4
9 1
3
1
0
10 2
3
1
1
11 3
3
1
2
12 4
3
1
3
13 5
1
2
6

226

Phn tch d liu v to biu bng R Nguyn Vn Tun

14
15
16
17
18
19
20
21
22
23
24

6
7
8
5
6
7
8
5
6
7
8

1
1
1
2
2
2
2
3
3
3
3

2
2
2
2
2
2
2
2
2
2
2

9
5
6
5
4
3
2
5
6
4
3

By gi, chng ta sn sng s dng R phn tch. Hm chnh


phn tch phng sai l aov (analysis of variance). Trong hm ny, ch cch
cung cp thng s bng cch dng mt hm khc c tn l Error. Trong hm
Error, chng ta cho R bit rng mi bnh nhn (id) thuc vo mt nhm
iu tr v do thuc vo bin time. Cch cho R bit l:
Error(id/time). C th hn:
> repeated <- aov(y ~ treat*time + Error(id/time))

Lnh trn y yu cu R phn tch theo m hnh: y = treat +


time + treat*time (ch treat*time tng ng vi
treat+time+treat*time), v trung bnh bnh phng phn d phi c
tch thnh hai phn: mt phn trong cc bnh nhn, v mt phn gia cc thng
iu tr (vit tt bng k hiu id/time). Tt c kt qu cho vo i tng c
tn l repeated. Chng ta yu cu mt bng tm lc kt qu t i tng
repeated:
> summary(repeated)
Error: id
Df Sum Sq Mean Sq F value Pr(>F)
treat
1 10.6667 10.6667 2.5263 0.1631
Residuals 6 25.3333 4.2222
Error: id:time
Df Sum Sq
time
2 58.583
treat:time 2 8.583
Residuals 12 12.167
---

Mean Sq F value Pr(>F)


29.292 28.8904 2.586e-05 ***
4.292
4.2329 0.04064 *
1.014

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kt qu phn tch trong phn u ca bng trn cho thy s khc bit
gia nhm iu tr bng thuc v gi dc khng c ngha thng k

227

(p = 0.16). Nh vy chng ta c th kt lun thuc khng c hiu nghim gim


au thp khp?
Cu tr li l khng, bi v phn th hai ca bng phn tch phng
sai cho thy mi tng tc gia treat v time (tr s p = 0.041). iu ny c
ngha l khc bit gia thuc v gi dc ty thuc vo thng iu tr. Tht
vy, nu chng ta xem li bng 10.11 s thy trong thng 1, trung bnh ca
nhm vc-xin v gi dc khng my khc nhau (6.25 v 6.50), nhng n
thng th 2 v nht l thng th 3 th khc bit gia hai nhm rt cao (nh
thng th ba: 1.50 cho vc-xin v 4.50 cho nhm gi dc). Nh vy, hiu
nghim trong nhm c iu tr tng dn theo thi gian, cn trong nhm gi
dc th hu nh khng c khc bit gia 3 thng. Tm li, qua th nghim s
khi ny chng ta c th ni vc-xin c v c hiu qu gim au trong cc bnh
nhn thp khp.
***
Trn y l vi cch s dng cho vic phn tch phng sai vi cc th
nghim thng dng. Thit k v phn tch th nghim (experimental design) l
mt lnh vc nghin cu tng i chuyn su, nhng ch dn trn y khng
th m t tt c cc php tnh cng nh phng php cho tt c th nghim. Tuy
nhin, trong thc t, cc phng php v th nghim rt thng c p dng
trong khoa hc thc nghim. R c mt package tn l nlme (non-linear mixedeffects) cng c th s dng cho cc phn tch trn v cc m hnh phc tp hn
vi a bin v a th bc. Package ny cng c th ti v my min ph ti
website ca R: http://cran.R-project.org.

228

Phn tch d liu v to biu bng R Nguyn Vn Tun

12
Phn tch hi qui logistic
Trong cc chng trc v phn tch hi qui tuyn tnh v phn tch
phng sai, chng ta tm m hnh v mi lin h gia mt bin ph thuc lin
tc (continuous dependent variable) v mt hay nhiu bin c lp (independent
variable) hoc l lin tc hoc l khng lin tc. Nhng trong nhiu trng hp,
bin ph thuc khng phi l bin lin tc m l bin mang tnh o lng nh
phn: c/khng, mc bnh/khng mc bnh, cht/sng, xy ra/khng xy ra,
v.v, cn cc bin c lp c th l lin tc hay khng lin tc. Chng ta cng
mun tm hiu mi lin h gia cc bin c lp v bin ph thuc.
V d 1. Trong mt nghin cu do tc gi tin hnh tm hiu mi
lin h gia nguy c gy xng (fracture, vit tt l fx) v mt xng cng
mt s ch s sinh ha khc, 139 bnh nhn nam (hay ni ng hn l i tng
nghin cu) tui t 60 tr ln. Nm 1990, cc s liu sau y c thu thp cho
mi i tng: tui (age), t trng c th (body mass index hay BMI), mt
cht khong trong xng (bone mineral density hay BMD), ch s hy xng
ICTP, ch s to xng PINP. Cc i tng nghin cu c theo di trong
vng 15 nm. Trong thi gian theo di, cc bnh nhn b gy xng hay khng
gy xng c ghi nhn. Cu hi t ra ban u l c mt mi lin h g gia
BMD v nguy c gy xng hay khng. S liu ca nghin cu ny c trnh
by trong phn cui ca chng ny, v s trnh by mt phn di y bn
c nm c vn .
Bng 12.1. Mt phn s liu nghin cu v cc yu t nguy c cho gy xng
id
1
2
3
4
5
6
7
8
9
10

fx
1
1
1
1
1
0
0
0
0
0

age
79
89
70
88
85
68
70
69
74
79

137
138
139

0
1
0

64
80
67

bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594
38.0762
23.3887
25.9455

bmd
ictp
pinp
0.818 9.170 37.383
0.871 7.561 24.685
1.358 5.347 40.620
0.714 7.354 56.782
0.748 6.760 58.358
0.935 4.939 67.123
1.040 4.321 26.399
1.002 4.212 47.515
0.987 5.605 26.132
0.863 5.204 60.267
...
1.086
0.875
0.983

5.043
4.086
4.328

32.835
23.837
71.334

229

y, v bin ph thuc (gy xng) khng c o lng theo tnh


lin tc (m ch l c hay khng), cho nn phng php phn tch hi qui tuyn
tnh phn tch mi lin h gia bin ph thuc v bin c lp. Mt phng
php phn tch c pht trin tng i gn y (vo thp nin 1970s) c tn
l logistic regression analysis (hay phn tch hi qui logistic) c th p dng cho
trng hp trn.
Trong nghin cu ny, sau 15 nm theo di, c 38 bnh nhn b gy
xng. Tnh theo phn trm, t l gy xng l 38 / 139 = 0.273 (hay 27.3%).

12.1 M hnh hi qui logistic


Cho mt tn s bin c x ghi nhn t n i tng, chng ta c th tnh
xc sut ca bin c l:

p=

x
n

p c th xem l mt ch s o lng nguy c ca mt bin c. Mt cch th hin


nguy c khc l odds, tm dch l kh nng. Kh nng ca mt bin c c
nh ngha n gin bng t s xc sut bin c xy ra trn xc sut bin c
khng xy ra:

odds =

p
1 p

[1]

Hm logit ca odds c nh ngha nh sau:

p
l ogit ( p ) = log

1 p

[2]

Mi lin h gia p v logit(p) l mt mi lin h lin tc v theo dng nh sau:

230

Phn tch d liu v to biu bng R Nguyn Vn Tun

4
2
0
-4

-2

logit(p)

0.0

0.2

0.4

0.6

0.8

1.0

Biu 12.1. Mi lin h gia logit(p) v p, cho 1<p<0.


Ch : biu trn c v bng cc lnh sau y:
p <- seq(0, 1, length=100)
p <- p[2:(length(p)-1)]
logit <- function(t)
{
log(t / (1-t))
}
plot(logit(p) ~ p, type=l)

Cho mt bin c lp x (x c th l lin tc hay khng lin tc), m


hnh hi qui logistic pht biu rng:
logit(p) = + x

[3]

Tng t nh m hnh hi qui tuyn tnh, v l hai thng s tuyn


tnh cn phi c tnh t d liu nghin cu. Nhng ngha ca thng s ny,
c bit l thng s , rt khc vi ngha m ta quen vi m hnh hi qui
tuyn tnh. hiu ngha ca hai thng s ny, chng ta s quay li vi v d
1.
V d 1 (tip theo). Vn m chng ta mun bit l mi lin h gia
mt xng bmd v nguy c gy xng (fx). tin cho vic minh ha, gi
bmd l x, vn m chng ta cn bit c th vit bng ngn ng m hnh nh
sau

p
logit ( p ) = log
+ x
1 p

[4]

231

Ni cch khc:

odds ( p ) =

p
= e + x
1 p

Ni cch khc, m hnh hi qui logistic va trnh by trn pht biu


rng mi lin h gia xc sut gy xng (p) v mt xng bmd l mt mi
lin h theo hnh ch S. M hnh trn cn cho thy xc sut gy xng p ty
thuc vo gi tr ca x. Do , m hnh trn c th vit mt cch chnh xc hn
rng kh nng gy xng vi iu kin x l:

odds ( p | x ) = e + x
Khi x = x0, kh nng gy xng l: odds ( p | x = x0 ) = e + x0
Khi x = x0 + 1 (tc tng 1 n v t x0), kh nng gy xng l:

odds ( p | x = x0 + 1) = e

+ ( x0 +1)

V, t s ca hai xc sut gy xng:

odds ( p | x = x0 + 1)
odds ( p | x = x0 )

+ ( x0 +1)

+ x0

= e [5]

Trong dch t hc, e c gi l odds ratio, tm dch l t s kh nng hay t s


kh d. Ni cch khc, h s trong m hnh hi qui logistic chnh l t s kh
d.
Phng php c tnh thng s trong m hnh [3] kh phc tp
(dng phng php maximum likelihood tc phng php Hp l cc i) v
khng nm trong phm vi ca cun sch ny, nn s khng trnh by y (bn
c c th tham kho sch gio khoa bit thm, nu cn thit). Tuy nhin, c
l cn cp mt cch ngn gn l phng php hp l cc i cung cp cho
chng ta mt h phng trnh nh sau:

1
n
n
( + xi )
yi = 1 + e
i =1
i =1
n
n
x y = x 1 + e ( + xi )

i i
i

i =1
i =1

232

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trong , Trong , yi l bin ph thuc (gy xng vi gi tr 0 hay

1), v xi l bin c lp (mt xng), v n l s mu. tm c s v ,


mt trong nhng php tnh hay s dng l iterative weighted least square hay
Newton-Raphson. R s dng php tnh Newton-Raphson tm hai c s .
Sau khi c c s v chng ta c th c tnh xc sut p cho
bt c gi tr no ca x nh sau (sau vi thao tc i s):

p =

e + x
+ x

1+ e

1+ e

1
(

+ x

Ch chng ta dng du m p ch s c tnh (predicted value), ch khng


phi p l xc sut quan st. Nu m hnh m t d liu tt v y , khc
bit gia p v p nh; nu m hnh khng thch hp hay khng tt, khc bit
c th s cao. khc bit gia p v p c gi l deviance. Phng php
tnh deviance kh phc tp, nhng khng phi l ch y, cho nn
chng ta ch ni qua khi nim m thi. Khi chng ta c nhiu m hnh m
phng mt hay nhiu mi lin h, deviance c th c s dng nh gi s
thch hp ca mt m hnh, hay chn mt m hnh ti u.

12.2 Phn tch hi qui logistic bng R


V d 1 (tip theo). By gi, chng ta quay li vi v d 1, dng s liu
trong Bng 12.1 c tnh hai thng s v bng R. Trc ht chng ta
phi nhp ton b s liu vo mt data frame, v cho mt ci tn, chng hn
nh fracture. Trong trng hp ca v d ny, d liu c cha trong
directory c:\works\stats di tn fracture.txt, do , cc lnh sau y cn
thit nhp s liu:
# bo cho R bit ni cha s liu
> setwd(c:/works/stats)
# nhp s liu v cho vo mt data frame tn fracture
> fracture <- read.table(fracture.txt, header=TRUE,
na.string=.)
# kim tra xem c bao nhiu bin trong d liu fracture
> names(fracture)
[1] "id" "fx" "age" "bmi" "bmd" "ictp" "pinp"

233

# Chn nhng bnh nhn c y s liu cho phn tch


> fulldata <- na.omit(fracture)
> attach(fulldata)

Hai bin m chng ta quan tm trong v d ny l: fx (gy xng) v bmd (mt


xng). Chng ta kim tra xem c bao nhiu bnh nhn gy xng:
> table(fx)
fx
0 1
101 38

K n, xem mt xng trong nhm gy xng v khng gy xng ra sao:


> tapply(bmd, fx, mean)
0
1
0.9444851 0.9016667

1.0
0.6

0.8

BMD

1.2

> boxplot(bmd ~ fx,


xlab=Fracture: 1=yes, 0=no),
ylab=BMD)

1
Fracture: 1=yes, 0=no)

Kt qu trn cho thy, bmd trong nhm bnh nhn b gy xng thp hn so vi
nhm khng b gy xng (0.90 v 0.94). V, kim nh t sau y cho thy mc
khc bit ny khng c ngha thng k (p = 0.15).
> t.test(bmd~fx)
Welch Two Sample t-test
data: bmd by fx
t = 1.4572, df = 53.952, p-value = 0.1508

234

Phn tch d liu v to biu bng R Nguyn Vn Tun

alternative hypothesis: true difference in means is not equal to


0
95 percent confidence interval:
-0.01609226 0.10172922
sample estimates:
mean in group 0 mean in group 1
0.9444851
0.9016667

c tnh thng s trong m hnh [4], hm s glm (vit tt t


generalized linear model) trong R c th p dng, vi c php nh sau:
> logistic <- glm(fx ~ bmd, family=binomial)
> summary(logistic)
Call:
glm(formula = fx ~ bmd, family = "binomial")
Deviance Residuals:
Min
1Q Median
3Q
Max
-1.0287 -0.8242 -0.7020 1.3780

2.0709

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.063
1.342 0.792 0.428
bmd
-2.270
1.455 -1.560 0.119
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 155.27 on 135 degrees of freedom
AIC: 159.27
Number of Fisher Scoring iterations: 4

Chng ta s ln lt xem qua cc kt qu trn:


(a) Trong lnh logistic <- glm(fx~bmd, family=binomial)
chng ta yu cu R phn tch theo m hnh fx l mt hm s vi bmd nh
m hnh [4]. Trong glm c nhiu lut phn phi, m trong phn phi nh
phn (binomial) l mt lut phn phi chun cho hi qui logistic. Do ,
family=binomial cn thit cho R.
(b) Deviance: phn th nht ca kt qu cho bit qua v deviance:
Deviance Residuals:
Min
1Q Median
3Q
Max
-1.0287 -0.8242 -0.7020 1.3780

2.0709

235

Deviance nh gii thch trn phn nh khc bit v gi tr quan st v c


tnh ca logit(p) (hay gia m hnh v d liu, cng tng t nh mean square
residual trong phn tch hi qui tuyn tnh vy). i vi mt m hnh n l
nh v d ny th gi tr ca deviance khng c ngha g nhiu.
(c) Phn k tip cung cp c s ca (m R t tn l intercept) v
(bmd) v sai s chun (standard error).
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.063
1.342 0.792 0.428
bmd
-2.270
1.455 -1.560 0.119

Qua kt qu ny, chng ta c = 1.063 v = -2.27. c s l s m cho


thy mi lin h gia nguy c gy xng v bmd l mi lin h nghch o: xc
sut gy xng tng khi gi tr ca bmd gim. Tuy nhin, kim nh z (tnh
bng cch ly c s chia cho sai s chun) cho chng ta thy nh hng ca
bmd khng c ngha thng k, v tr s p=0.119.
Nh rng t s kh d (odds ratio hay vit tt l OR) chnh l
e
= 0.1033 . Ni cch khc, khi bmd tng 1 g/cm2 (n v o lng ca
bmd l g/cm2) th t s OR gim 0.9067 hay 90.67%. Nhng tng 1 g/cm2 l mt
rt cao trong xng v khng thc t. Cho nn mt cch tnh khc l tnh
trn lch chun (standard deviation) ca bmd. Chng ta s tm hiu lch
chun ca bmd:
2.27

> sd(bmd)
[1] 0.1406543

Do , OR s tnh trn mi 0.14 g/cm2. V OR cho mi lch chun, do , l:

e-2.27*0.1406 = 0.7267
Tc l, khi bmd tng mt lch chun th t s kh d gy xng gim khong
28%. Cng c th ni cch khc, l khi bmd gim mt lch chun th t s
kh d tng e2.27*0.1406 = 1.376 hay khong 38%.
Mt cch khc bit nh hng ca bmd l c tnh xc sut gy xng l
qua phng trnh:

p =

236

( )
e
1.063 2.27 ( bmd )
1+ e
1.063 2.27 bmd

Phn tch d liu v to biu bng R Nguyn Vn Tun

Theo , khi bmd = 1.00, p = 0.23. Khi bmd = 0.86 (tc gim 1 lch chun),
p = 0.291. Tc l, nu BMD gim 1 lch chun th xc sut gy xng tng
0.291/0.23 = 1.265 hay 26.5%.
(d) Phn cui ca kt qu cung cp deviance cho hai m hnh: m hnh khng
c bin c lp (null deviance), v m hnh vi bin c lp, tc l
bmd trong v d (residual deviance).
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 155.27 on 135 degrees of freedom
AIC: 159.27

Qua hai s ny, chng ta thy bmd nh hng rt thp n vic tin
on gy xng, ch lm gim deviance t 157.8 xung cn 155.27, v mc
gim ny khng c ngha thng k.
Ngoi ra, R cn cung cp gi tr ca AIC (Akaike Information
Criterion) c tnh t deviance v bc t do. Chng ta s quay li ngha ca
AIC trong phn sp n khi so snh cc m hnh.

12.3 c tnh xc sut bng R


Trong phn tch trn, chng ta cho cc kt qu vo i tng
logistic. Trong i tng ny c nhiu thng tin c ch, nhng nu mun
xem cc thng tin ny chng ta phi dng n cc lnh nh summary chng
hn. Trong phn ny, chng ta s xem qua vi hm trong R ly cc thng tin
lin quan n vic tin on xc sut.

predict dng lit k cc gi tr c tnh (predicted values) ca

p
= + x cho tng bnh nhn.
1 p

m hnh log

> predict(logistic)
1
2
3
4
5
6
2.37757 1.08569 -2.14111 1.49282 0.96537 -0.94125
7
8
9
10
11
12
-1.73368 -1.67564 -0.66528 -0.50704 -0.94185 -0.64874
...

Cc s trn l log(p / (1 p)), tc log odds, khng c ngha thc t bao nhiu.
Chng ta mun bit gi tr tin on xc sut p tnh t phng trnh

237

p =

e1.063 2.27( bmd )


. c gi tr ny cho tng bnh nhn, chng ta cho thng s
1 + e1.063 2.27( bmd )

type=response vo hm predict nh sau:


> predict(logistic, type="response")
1
2
3
4
5
6
7
0.915 0.747 0.105 0.816 0.724 0.281 0.150
8
9 10
11
12
13
14
0.158 0.339 0.376 0.281 0.343 0.443 0.238
...

Trong kt qu trn (ch in mt phn) c tnh xc sut gy xng cho bnh


nhn 1 l 0.915, cho bnh nhn 2 l 0.747, v.v

Chng ta c th xem xt cc gi tr tin on ny vi bmd bng cch


dng hm plot thng thng:

0.35
0.30
0.25
0.20
0.15

fitted(glm(fx ~ bmd, family = "binomial"))

0.40

> plot(bmd, fitted(glm(fx ~ bmd, family=binomial)))

0.6

0.8

1.0

1.2

bmd

Xc sut tin on gy xng (trc tung) v bmd (trc


honh) qua m hnh hi qui logistic.

238

Phn tch d liu v to biu bng R Nguyn Vn Tun

Biu trn c th ci tin bng cch cho cc khong cch gi tr bmd gn


nhau hn (nh 0.50, 0.55, 0.60, , 1.20 chng hn), v dng ng biu
din thay v dng du chm. Cc lnh sau y s ci tin biu .
> logistic <- glm(fx ~ bmd, family=binomial)
#cho fnbmd t > 0.50,0.55,0.6,...,1.2

> fnbmd <- seq(0.5, 1.2, 0.05)


#cho vo mt dataframe mi
> new.data <- data.frame(bmd = fnbmd)
> predicted <- predict(logistic, new.data,
type=response)

0.35
0.30
0.15

0.20

0.25

predicted

0.40

0.45

> plot(predicted ~ fnbmd, type=l)

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

fnbmd

Xc sut tin on gy xng (trc tung) v bmd (trc


honh) qua m hnh hi qui logistic.

239

12.4 Phn tch hi qui logistic t s liu gin


lc bng R
Trong qu trnh phn tch s liu va trnh by trn y, chng ta c s
liu cho tng bnh nhn v cc bin c lp u l bin lin tc. Nhng trong
nhiu trng hp bin c lp l bc th (v bi v bin ph thuc ch c hai gi
tr 0 v 1) cho nn trn l thuyt chng ta c th tm lc d liu bng cc bng
tn s (frequency table).
V d 2. Trong mt nghin cu v nh hng ca thi quen ht thuc l,
tnh trng bo ph, th ngy (trong khi ng) n nguy c bnh cao huyt p, cc
nh nghin cu tm lc s liu nh sau (s liu trch t Altman, trang 353):

smoking obesity snoring ntotal nhyper


0
0
0
60
5
1
0
0
17
2
0
1
0
8
1
1
1
0
2
0
0
0
1
187
35
1
0
1
85
13
0
1
1
51
15
1
1
1
23
8
Tng s
433
79

Bng 12.2. Tm lc s liu lin quan n ht thuc l


(smoking), bo ph (obesity), ngy (snoring), v cao huyt
p. ntotal l tng s bnh nhn cho tng nhm, v nhyper l
s bnh nhn trong tng s b bnh cao huyt p. Cc bin s
smoking, obesity, v snoring c gi tr 0=no v 1=yes.
Trong nghin cu c 433 bnh nhn, v trong s ny 79 ngi (hay 18%) b bnh
cao huyt p. Tuy nhin, t l ny dao ng kh cao theo tng nhm bnh nhn.
Chng hn nh trong nhm khng ht thuc l(smoking=0), khng bo ph
(obesity=0) v khng ngy (snoring=0), t l cao huyt p l 8.3% (5/60).
Trong khi nhm vi 3 yu t nguy c trn (smoking=1, obesity=1,
snoring=0) th c hn 1 phn 3 hay 35% (8/23) b bnh cao huyt p.

240

Phn tch d liu v to biu bng R Nguyn Vn Tun

phn tch mi lin h gia 3 yu t nguy c v bnh cao huyt p, trc


ht cn phi cho s liu vo R theo ng nh bng s liu trn.
>
>
>
>
>
>

noyes <- c(no, yes) #nh ngha bin noyes c 2 gi tr


smoking <- gl(2,1,8, noyes) #bin smoking
obesity <- gl(2,2,8, noyes) #bin obesity
snoring <- gl(2,4,8, noyes) #bin snoring
ntotal <- c(60, 17, 8, 2, 187, 85, 51, 23)
nhyper <- c(5, 2, 1, 0, 35, 13, 15, 8)

> data <- data.frame(smoking, obesity, snoring, ntotal, nhyper)


> data
smoking obesity snoring ntotal nhyper
1
no
no
no
60
5
2
yes
no
no
17
2
3
no
yes
no
8
1
4
yes
yes
no
2
0
5
no
no
yes
187
35
6
yes
no
yes
85
13
7
no
yes
yes
51
15
8
yes
yes
yes
23
8

By gi chng ta c th s dng hm glm phn tch s liu. Trc ht,


chng ta phi to thm mt bin s proportion nh sau:
> proportion <- nhyper/ntotal
> logistic <- glm(proportion ~ smoking+obesity+snoring,
family=binomial,
weight=ntotal)

Ch trong hm glm trn, chng ta m phng proportion nh l


mt hm s ca smoking, obesity v snoring, vn vi phn phi nh
phn (binomial), nhng c thm mt thng s weight=ntotal. Thng s
weight yu cu R s dng ntotal l mt s tm lc (thay v mt bnh
nhn). By gi, chng ta c th xem qua kt qu phn tch:
> summary(logistic)
Call:
glm(formula = proportion ~ smoking + obesity + snoring, family =
"binomial", weights = ntotal)
Deviance Residuals:
1
2
3
4
5
6
-0.04344
0.54145 -0.25476 -0.80051
0.56231

7
8
0.19759 -0.46602 -0.21262

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766 0.38018 -6.254 4e-10 ***

241

smokingyes -0.06777 0.27812 -0.244 0.8075


obesityyes
0.69531 0.28509 2.439 0.0147 *
snoringyes
0.87194 0.39757 2.193 0.0283 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14.1259 on 7 degrees of freedom
Residual deviance: 1.6184 on 4 degrees of freedom
AIC: 34.537
Number of Fisher Scoring iterations: 4

Kt qu trn cho thy bin smoking khng c ngha thng k; cho nn c l


chng ta nn b bin ny ra ngoi m hnh v c mt m hnh n gin hn:
> logistic <- glm(proportion ~ obesity+snoring,
family=binomial, weight=ntotal)
> summary(logistic)
Call:
glm(formula = proportion ~ obesity + snoring, family="binomial",
weights=ntotal)
Deviance Residuals:
1
2
3
4
-0.01247 0.47756 -0.24050 -0.82050

5
6
7
0.30794 -0.62742 -0.14449

8
0.45770
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.3921
0.3757 -6.366 1.94e-10 ***
obesityyes 0.6954
0.2851 2.440 0.0147 *
snoringyes 0.8655
0.3967 2.182 0.0291 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14.1259 on 7 degrees of freedom
Residual deviance: 1.6781 on 5 degrees of freedom
AIC: 32.597
Number of Fisher Scoring iterations: 4

Phn tch phng sai trn deviance sau y cng khng nh obesity v
snoring l hai bin c nh hng n cao huyt p:

242

Phn tch d liu v to biu bng R Nguyn Vn Tun

> anova(logistic, test="Chisq")


Analysis of Deviance Table
Model: binomial, link: logit
Response: proportion
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL
7 14.1259
obesity 1 6.8260
6
7.2999 0.0090
snoring 1 5.6218
5
1.6781 0.0177

12.5 Phn tch hi qui logistic a bin v chn


m hnh
Mt trong nhng vn kh khn v c khi kh nan gii trong vic
phn tch hi qui logistic a bin l chn mt m hnh c th m t y
d liu. Mt nghin cu vi mt bin ph thuc y v 3 bin c lp x1, x2 v x3,
chng ta c th c nhng m hnh sau y tin on y: y = f(x1), y = f(x2), y =
f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), v y = f(x1, x2, x3), trong f l hm
s. Ni chung vi k bin c lp x1, x2, x3, , xk, chng ta c rt nhiu m hnh
(2k) tin on y. Trong iu kin c nhiu m hnh kh d nh th, vn t
ra l m hnh no c xem l ti u?
Cu hi trn t ra mt cu hi c bn khc: th no l ti u? Ni
mt cch ngn gn mt m hnh ti u phi p ng ba tiu chun sau y:

n gin
y
C ngha thc t

Tiu chun n gin i hi m hnh c t bin s c lp, v nu qu


nhiu bin s th vn din dch s tr nn kh khn, v c khi thiu thc t.
Ni mt cch v von, nu chng ta b ra 50.000 ng mua 500 trang sch tt
hn l b ra 60.000 ngn mua cng s trang sch. Tng t, mt m hnh
vi 3 bin c lp m c kh nng m t d liu tng ng vi m hnh vi 5
bin c lp, th m hnh u c chn. Mt m hnh n gin l mt m hnh
tit kim! (Ting Anh gi l parsimonious model).
Tiu chun y y c ngha l m hnh phi m t d liu mt
cch tha ng, tc phi tin on gn (hay cng gn cng tt) vi gi tr thc t

243

quan st ca bin ph thuc y. Nu gi tr quan st ca y l 10, v nu c mt


m hnh tin on l 9 v mt m hnh tin on l 6 th m hnh u phi c
xem l y hn.
Tiu chun c ngha thc t, nh cch gi, c ngha l m hnh
phi c ym tr bng l thuyt hay c ngha sinh hc (nu l nghin cu
sinh hc), ngha lm sng (nu l nghin cu lm sng), v.v... C th s in
thoi mt cch no c lin quan n t l gy xng, nhng tt nhin mt m
hnh nh th hon ton v ngha. y l mt tiu chun quan trng, bi v nu
mt phn tch thng k dn n mt m hnh d rt c ngha ton hc m
khng c ngha thc t th m hnh cng ch l mt tr chi con s, khng
c gi tr khoa hc tht s.
Tiu chun th ba (c ngha thc t) thuc v lnh vc l thuyt, v
chng ta s khng bn y. Chng ta s bn qua tiu chun n gin v y
. Mt thc o quan trng v c ch chng ta quyt nh mt m hnh n
gin v y l Akaike Information Criterion (AIC) m chng ta gp trong
phn u ca chng ny. hiu AIC, chng ta quay li vi v d 1.
Xin nhc li trong v d 1, chng ta mun tin on gy xng (bin fx)
t cc bin c lp sau y: tui (age), t s c th (bmi), mt cht khong
trong xng (bmd), v hai ch s hy xng (ictp) v to xng (pinp).
(a) Chng ta th m hnh fx l hm s ca tui:
> attach(fulldata)
> summary(glm(fx ~ age, family=binomial, data=fulldata))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.06447 2.72559 -2.959 0.00309 **
age
0.09806 0.03766 2.604 0.00922 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 150.74 on 135 degrees of freedom
AIC: 154.74

Chng ta thy residual deviance = 150.74, v AIC = 154.74. Tht ra, AIC
c c tnh t cng thc:
AIC = Residual Deviance + 2(s thng s)

244

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trong m hnh trn, chng ta c 2 thng s (intercept v age), cho nn


AIC = 150.74 + 4 = 154.74.
(b) M hnh th hai m chng ta mun so snh l fx l hm s ca ictp:
> summary(glm(fx ~ ictp, family=binomial, data=fulldata))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.9206
0.7726 -5.074 3.89e-07 ***
ictp
0.6066
0.1527 3.973 7.11e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 139.15 on 135 degrees of freedom
AIC: 143.15

Cng vi hai thng s, nhng m hnh ny c gi tr residual deviance


(139.15) nh hn m hnh vi tui (150.74), v do AIC cng thp hn
(143.15 so vi 154.74). Kt qu ny cho thy m hnh vi ictp m t fx y
hn l m hnh vi tui. So snh ny cho thy trong hai m hnh ny,
chng ta s chn m hnh vi ictp.
(c) By gi chng ta th xem m hnh vi ictp v age.
> summary(glm(fx ~ ictp + age, family=binomial, data=fulldata))
(Intercept) -8.25707 2.91403 -2.834 0.004603 **
ictp
0.55461 0.15665 3.540 0.000399 ***
age
0.06398 0.04067 1.573 0.115701
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 136.61 on 134 degrees of freedom
AIC: 142.61

M hnh ny vi 3 thng s (intercept, age v ictp), nhng tr s


AIC ch gim xung 142.61 (so vi m hnh vi ictp l 143.15), mt gim
rt khim tn, trong khi chng ta phi tiu thm mt thng s! Chng ta c
th kt lun rng age khng cn thit trong m hnh ny. Tht vy, tr s p cho
age l 0.115, tc khng c ngha thng k.

245

Qua ba trng hp trn, chng ta c th rt ra mt nhn xt chung: mt


m hnh n gin v y phi l m hnh c tr s AIC cng thp cng tt v
cc bin c lp phi c ngha thng k. Thnh ra, vn i tm mt m hnh
n gin v y l tht s i tm mt (hay nhiu) m hnh vi tr s AIC thp
nht hay gn thp nht.
Tt nhin, chng ta c th xem xt nhiu m hnh khc bng cch thay
th hay tng hp cc bin s c lp vi nhau. Nhng mt vic lm nh th rt
phc tp, i hi nhiu thi gian v c khi rm r. R c mt hm gi l step
c th gip chng ta i tm mt m hnh n gin v y . Trong v d trn,
cch s dng hm step s c vit nh sau:
> temp <- glm(fx ~ ., family=binomial, data=fulldata)

Trong lnh trn, thng s fx ~ . c ngha l tm tt c cc bin c lp (k


hiu .) tin on fx trong dataframe fulldata. Kt qu cho vo i
tng temp. xem kt qu trong temp, chng ta lnh search nh sau:
> search <- step(temp)
> search <- step(temp)
Start: AIC= 146.09
fx ~ id + age + bmi + bmd + ictp + pinp
Df
- pinp
- id
- age
- bmi
- bmd
<none>
- ictp

Deviance AIC
1 132.45 144.45
1 132.47 144.47
1 132.63 144.63
1 133.41 145.41
1 133.87 145.87
132.09 146.09
1 148.90 160.90

Step: AIC= 144.45


fx ~ id + age + bmi + bmd + ictp
Df
- id
- age
- bmi
- bmd
<none>
- ictp

Deviance
1 132.81
1 133.14
1 133.66
1 134.00
132.45
1 149.05

AIC
142.81
143.14
143.66
144.00
144.45
159.05

Step: AIC= 142.81


fx ~ age + bmi + bmd + ictp
Df Deviance

246

AIC

Phn tch d liu v to biu bng R Nguyn Vn Tun

- age
- bmi
- bmd
<none>
- ictp

1
1
1

133.32
133.67
134.33
132.81
149.88

141.32
141.67
142.33
142.81
157.88

Step: AIC= 141.33


fx ~ bmi + bmd + ictp
Df
- bmi
<none>
- bmd
- ictp

Deviance
1 134.34
133.32
1 135.65
1 155.18

AIC
140.34
141.32
141.65
161.18

Step: AIC= 140.34


fx ~ bmd + ictp
Df Deviance
<none>
134.34
- bmd 1 139.15
- ictp 1 155.27

AIC
140.34
143.15
159.27

Trong kt qu trn, R bo co cho chng ta bit tng bc trong qu trnh i


tm m m hnh ti u. Khi u l m hnh vi tt c 6 bin, v tr s AIC =
146.09. Bc th hai ch gm 5 bin (loi b pinp) v AIC=144.45, v.v... Kt
qu c th tm lc trong bng sau y:

M hnh

AIC

fx
fx
fx
fx
fx

146.09
144.45
142.81
141.33
140.34

~
~
~
~
~

id + age + bmi + bmd + ictp + pinp


id + age + bmi + bmd + ictp
age + bmi + bmd + ictp
bmi + bmd + ictp
bmd + ictp

Kt qu 5 bc tm m hnh, R dng li vi m hnh gm 2 bin bmd v ictp v


c gi tr AIC thp nht. Tht ra, nu khng mun in tt c cc bc i tm m
hnh, chng ta ch cn lnh summary nh sau:
> summary(search)
Call:
glm(formula
fulldata)

fx

bmd

ictp,

Deviance Residuals:
Min
1Q Median
3Q
Max
-1.9126 -0.7317 -0.5559 0.4212

family

"binomial",

data

2.1242

Coefficients:
Estimate Std. Error z value Pr(>|z|)

247

(Intercept) -1.0651
1.5029 -0.709 0.4785
bmd
-3.4998
1.6638 -2.103 0.0354 *
ictp
0.6876
0.1704 4.036 5.43e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 134.34 on 134 degrees of freedom
AIC: 140.34
Number of Fisher Scoring iterations: 4

Kt qu ny n gin hn kt qu ca hm search, v summary ch trnh by


m hnh sau cng. Ni tm li, trong phn tch ny, chng ta kt lun rng bmd
(mt cht khong trong xng) v ictp (marker v chu trnh hy xng) l
hai yu t c lin h hay nh hng n nguy c gy xng.

12.6 Chn m hnh hi qui logistic bng Bayesian


Model Average (BMA)
Trong chng 10, chng ta xem qua cch chn v xy dng mt m
hnh hi qui tuyn tnh bng ng dng php tnh BMA. Chng ta cng c th
ng dng BMA vo vic xy dng mt m hnh hi qui logistic.
Tip tc v d 1, chng ta s chun b d liu cho phn tch BMA bng
cch chn ra bin ph thuc (trong trng hp ny l fx) v mt ma trn gm
cc bin c lp. Tip theo , chng ta s dng hm bic.glm tm cc bin
c nh hng n fx.
> attach(fulldata)
> names(fulldata)
[1] "id" "fx" "age" "bmi" "bmd" "ictp" "pinp"

# Chn ct 3 n 7 (t age n pinp) lm ma trn bin c lp


> xvars <- fulldata[,3:7]

# Chn fx lm bin ph thuc


> y <- fx

# Gi hm bic.glm vi cc thng s nh sau:

> bma.search <- bic.glm(xvars, y, strict=F, OR=20,

248

Phn tch d liu v to biu bng R Nguyn Vn Tun

glm.family="binomial")

# Tm lc kt qu phn tch:
> summary(bma.search)
Call: bic.glm.data.frame(x = xvars, y = y, glm.family="binomial",
strict=F,OR=20)

9 models were selected


Best 5 models (cumulative posterior probability = 0.8836 ):
p!=0
Intercept
100
age
15.3
bmi
21.7
bmd
39.7
ictp
100.0
pinp
5.7

EV
SD
-2.85 2.865
0.008 0.026
-0.023 0.054
-1.341 1.976
0.645 0.169
-0.0003 0.004

nVar
BIC
post prob

Intercept
age
bmi
bmd
ictp
pinp

model 1 model 2
-3.920
-1.065
.
.
.
.
.
-3.499
0.606
0.687
.
.
1
-525.04
0.307

model 3 model4
-1.201
-8.257
.
0.063
-0.116
.
.
.
0.680
0.554
.
.

nVar
2
BIC
-523.63
post prob 0.151

2
-522.67
0.094

2
-524.94
0.291

model 5
-0.072
.
-0.070
-2.696
0.714
.
3
-521.03
0.041

Kt qu phn tch trn y cho thy xc sut m ictp l lin quan n


gy xng l 100%, trong khi , xc sut cho bmd ch khong 40%. Nhng
quan trng hn, m hnh ti u nht l m hnh vi ictp, v xc sut cho
m hnh ny l 0.307. M hnh ti u th hai gm c ictp v bmd (cng l m
hnh da vo tiu chun AIC nh m t phn trn), nhng xc sut cho m hnh
ny tng i thp hn (0.291). Ba m hnh khc cng c th l ng vin

249

m t xc sut gy xng y . R rng, qua phn tch BMA, chng ta c


nhiu la chn m hnh hn, v thc c s bt nh ca mt m hnh thng
k.
> imageplot.bma(bma.search)
M odels selected by BM A

age

bmi

bmd

ictp

pinp

Model #

Biu trn th hin kt qu bng s trong phn trc. Qua biu ny


chng ta thy ictp l yu t c nh hng n nguy c gy xng c tnh nht
qun cao nht (xut hin trong 100% m hnh). Yu t quan trng th hai c l
l bmd hay bmi. Cc yu t nh age v pinp tuy c kh nng nh hng n
nguy c gy xng, nhng cc yu t ny khng c nht qun cao nh cc
yu t va k trn.
Xy dng m hnh thng k l mt ngh thut ton hc. V tnh ngh
thut ca vic lm, nh nghin cu phi cn nhc rt nhiu yu t i n mt
m hnh p. Bi v m hnh l nhm mc ch m t thc t, mt m hnh p
l m hnh m t st vi thc t. Tuy nhin nu mt m hnh phn nh 100%
thc t th khng cn l m hnh na, hay qu phc tp khng th ng
dng c. Ngc li mt m hnh ch m t thc t khong 1% th cng khng
th s dng c. Xy dng m hnh phi lm sao tm im cn bng cho hai
thi cc . l mt yu cu rt cao, cho nn xy dng m hnh khng ch ty
thuc vo cc php tnh thng k, ton hc, m cn phi xem xt n cc yu t
thc t bo m cho s hu ch ca m hnh. Ni nh nh thng k hc
George Box: M hnh no cng sai so vi thc t, nhng trong s cc m hnh
sai , c mt vi m hnh c ch.

12.7 S liu nghin cu v nguy c gy xng trong nam


gii trn 60 tui

250

Phn tch d liu v to biu bng R Nguyn Vn Tun

id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

id: m s bnh nhn


fx: gy xng hay khng (0=khng gy xng, 1=gy
xng)
age: tui
bm: body mass index, tnh bng trng lng chia cho
chiu cao bnh phng
bmd: (bone mineral density) mt cht khong trong
xng i.
ictp: ch s sinh ha o lng hot tnh hy xng
pinp: ch s sinh ha o lng hot tnh to xng
fx
1
1
1
1
1
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
1
1
1
0
1
0
1
1
0
0
1
0
1
1
0
0
0
0
0
0
0
1
0
0

age
79
89
70
88
85
68
70
69
74
79
76
76
62
69
72
67
74
69
78
71
74
76
75
70
69
71
80
79
72
78
80
79
67
84
78
65
70
67
74
73
74
68
80
78

bmi
24.7252
25.9909
25.3934
23.2254
24.6097
25.0762
19.8839
25.0593
25.6544
19.9594
22.5981
26.4236
20.3223
19.3698
24.2215
32.1120
25.3934
23.8895
24.6755
27.1314
23.0518
23.4568
23.5457
23.3234
22.8625
22.0384
24.6914
26.8519
27.1809
23.9512
28.3874
23.5102
19.7232
27.4406
28.6661
23.7812
23.4493
25.5354
24.7409
22.2291
34.4753
32.1929
23.3355
22.7903

bmd
ictp
pinp
0.818 9.170 37.383
0.871 7.561 24.685
1.358 5.347 40.620
0.714 7.354 56.782
0.748 6.760 58.358
0.935 4.939 67.123
1.040 4.321 26.399
1.002 4.212 47.515
0.987 5.605 26.132
0.863 5.204 60.267
0.889 4.704 27.026
0.886 5.115 43.256
0.889 5.741 51.097
0.790 3.880 49.678
0.988 5.844 41.672
1.119 4.160 60.356
1.037 6.728 40.225
0.893 4.203 27.334
0.850 7.347 38.893
0.790 4.476 38.173
0.597 4.835 35.141
0.889 5.354 27.568
0.803 3.773 36.762
0.919 3.672 40.093
0.870 4.552 29.627
0.811 4.286 30.380
0.859 5.706 37.529
0.867 3.563 43.924
0.717 3.760 39.714
0.822 3.453 27.294
1.004 5.948 33.376
0.738 4.193 65.640
0.865 4.443 36.252
0.808 5.482 33.539
0.955 8.815 42.398
0.912 4.704 39.254
0.857 4.138 75.947
0.855 3.727 41.851
0.959 3.967 42.293
1.036 4.438 40.222
1.092 7.271 45.434
.
4.269 50.841
0.759 4.856 31.114
0.757 4.831 73.343

251

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104

252

1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0

79
72
67
70
69
67
74
71
67
67
65
66
69
72
75
70
74
71
65
77
67
66
70
70
69
65
75
67
67
73
63
72
73
69
75
65
71
66
66
71
73
64
68
67
66
77
75
67
73
68
70
66
77
71
74
70
78
76
64
67

24.6097
27.5802
30.1205
25.8166
30.4218
28.7132
34.5429
24.6097
23.5294
25.6173
25.3086
24.8358
22.3094
26.5285
25.8546
20.6790
28.3675
29.0688
23.9995
22.9819
33.3598
27.1314
24.7676
24.4193
28.2570
23.6614
26.0262
26.5731
24.8591
22.5710
31.8342
24.8016
25.0574
23.9512
23.4586
28.7347
25.3350
28.0899
25.5650
28.7274
32.4074
27.9155
25.5937
28.0428
30.7174
28.3737
28.6990
29.1687
27.4145
29.0688
26.1738
30.1038
24.6559
25.3934
26.4721
29.0253
29.0253
26.2346
26.4915
27.0416

0.671
0.814
1.101
0.818
1.088
0.934
0.969
0.794
0.830
1.057
1.160
0.811
0.977
1.063
1.091
0.741
1.045
1.066
0.841
1.015
1.129
1.030
0.896
1.106
0.869
0.837
0.921
1.118
0.765
0.752
1.251
0.839
0.662
0.844
0.852
0.795
0.867
0.997
0.827
1.023
1.066
0.874
0.882
0.718
0.856
1.052
0.929
0.953
0.784
1.120
1.040
1.028
0.884
0.943
1.075
1.057
1.098
1.014
0.998
0.905

4.870
3.012
7.538
3.564
3.826
3.996
6.762
4.350
3.176
3.738
3.060
3.263
3.106
6.970
4.798
3.908
4.784
4.527
3.089
4.041
7.239
4.096
4.352
2.823
2.974
2.689
3.917
3.832
7.112
4.249
7.303
3.860
3.138
4.069
4.176
3.328
2.349
4.171
4.569
4.111
5.680
4.298
4.056
9.739
4.180
3.737
3.527
3.593
4.332
6.510
3.161
3.930
3.880
4.692
4.561
3.709
5.247
3.958
4.218
3.553

69.924
27.088
35.487
36.001
33.833
56.167
43.099
39.023
36.595
32.550
44.757
26.941
27.951
41.188
36.045
30.198
31.339
24.252
79.910
57.147
67.103
29.435
44.291
37.348
46.229
28.738
29.667
50.292
45.778
39.950
48.697
41.055
36.312
39.926
51.394
27.679
36.506
53.094
25.157
19.557
36.995
43.872
30.523
66.974
34.597
28.102
23.008
16.132
47.410
45.674
36.302
38.301
36.560
69.500
25.948
41.322
23.896
24.344
29.390
23.020

Phn tch d liu v to biu bng R Nguyn Vn Tun

105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

0
0
0
1
1
1
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
0

66
70
66
65
64
70
70
70
68
67
73
71
71
75
76
71
73
64
70
80
67
68
66
64
69
69
67
67
68
59
66
71
64
80
67

22.7732
30.5241
25.3069
22.3863
34.0136
26.5668
27.6361
25.4017
30.3673
28.0428
27.7778
29.0006
35.2941
29.3658
26.2649
25.6055
29.9136
34.5271
33.4554
29.0688
25.7276
25.6801
25.9701
26.4490
28.6990
25.6173
30.3871
33.6901
28.4005
25.4017
22.5710
24.4473
38.0762
23.3887
25.9455

0.627
1.052
1.086
0.818
1.066
1.198
0.926
1.193
0.938
0.863
0.799
0.969
0.931
1.071
1.161
0.786
0.839
1.042
0.976
0.765
1.277
1.097
0.793
0.989
0.822
0.944
1.245
1.142
0.860
1.172
0.956
0.918
1.086
0.875
0.983

2.333 53.621
5.425 44.352
4.945 64.788
3.786 96.360
5.792 37.473
7.257 28.406
5.746 17.228
2.437 35.432
2.658 32.293
4.246 48.702
3.934 26.709
4.054 22.769
3.631 18.629
4.222 36.555
2.548 24.217
3.832 32.023
4.215 26.507
6.436 53.080
4.541 26.619
3.998 67.388
3.877 22.159
3.782 42.286
2.991 38.673
3.196 31.456
3.565 45.044
6.512 49.557
3.603 46.769
3.666 38.839
2.890 32.140
.
104.579
3.354 36.253
4.633 53.881
5.043 32.835
4.086 23.837
4.328 71.334

253

13
Phn tch s kin
(event history hay survival analysis)
Qua ba chng trc, chng ta lm quen vi cc m hnh thng k
cho cc bin ph thuc lin tc (nh p sut mu) v bin bc th (nh
c/khng, bnh hay khng bnh). Trong nghin cu khoa hc, v c bit l y
hc v k thut, c khi nh nghin cu mun tm hiu nh hng n cc bin
ph thuc mang tnh thi gian. Nh kinh t hc John Maynard Keynes tng ni
mt cu c lin quan n ch m ti s m t trong chng ny nh sau: V
lu v di tt c chng ta u cht, ci khc nhau l cht sm hay cht mun m
thi. Thnh ra, y vic theo di hay m t mt bin bc th nh sng hay
cht tuy quan trng, nhng khng chnh xc. Ci bin s quan trng hn v
chnh xc hn l thi gian dn n vic s kin xy ra.
Trong cc nghin cu khoa hc, k c nghin cu lm sng, cc nh
nghin cu thng theo di i tng trong mt thi gian, c khi ln n vi
mi nm. Bin c xy ra trong thi gian nh c bnh hay khng c bnh,
sng hay cht, v.v l nhng bin c c ngha lm sng nht nh, nhng
thi gian dn n bnh nhn mc bnh hay cht cn quan trng hn cho vic
nh gi nh hng ca mt thut iu tr hay mt yu t nguy c. Nhng thi
gian ny khc nhau gia cc bnh nhn. Chng hn nh thi im t lc iu tr
ung th n thi im bnh nhn cht rt khc nhau gia cc bnh nhn, v
khc bit c th ty thuc vo cc yu t nh tui, gii tnh, tnh trng
bnh, v cc yu t m c khi chng ta khng/cha o lng c nh tng
tc gia cc gen.
M hnh chnh th hin mi lin h gia thi gian dn n bnh (hay
khng bnh) v cc yu t nguy c (risk factors) l m hnh c tn l survival
analysis (c th tm dch l phn tch sng st). Cm t survival analysis
xut pht t nghin cu trong bo him, v gii nghin cu y khoa t dng
cm t cho b mn ca mnh. Nhng nh ni trn, sng/cht khng phi l bin
duy nht, v trong thc t chng ta cng c nhng bin nh c bnh hay khng
bnh, xy ra hay khng xy ra, v do , trong gii tm l hc, ngi ta dng
cm t event history analysis (phn tch bin c) m ngi vit cm thy c
v thch hp hn l phn tch sng st. Ngoi ra, trong cc b mn k thut,
ngi ta dng mt cm t khc, reliability analysis (phn tch tin cy),
ch cho khi nim survival analysis. Tuy nhin, trong chng ny ti s dng
cm t phn tch bin c.

254

Phn tch d liu v to biu bng R Nguyn Vn Tun

13.1 M hnh phn tch s liu mang tnh


thi gian
V d 1. Thi gian dn n ngng s dng IUD. Mt nghin cu v
hiu qu ca mt y c nga thai trn 18 ph n, tui t 18 n 35. Mt s ph
n ngng s dng y c v b chy mu. Cn s khc th tip tc s dng. Bng
s liu sau y l thi gian (tnh bng tun) k t lc bt u s dng y c n
khi chy mu (tc ngng s dng) hay n khi kt thc nghin cu (tc vn cn
s dng n khi chm dt nghin cu).
Bng 13.1 Thi gian dn n ngng s dng hay tip tc s dng y c IUD
M s
bnh nhn

Thi
gian
(tun)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

18
10
13
30
19
23
38
54
36
107
104
97
107
56
59
107
75
93

Tnh trng
(ngng=1
hay tip
tc=0)
0
1
0
1
1
0
0
0
1
1
0
1
0
0
1
0
1
1

Cu hi t ra l m t thi gian ngng s dng y c. Thut ng m


t y c ngha l c tnh s trung v thi gian dn n ngng s dng, hay
xc sut m ph n ngng s dng vo mt thi im no . Tnh trng tip
tc s dng c khi gi l survival (tc sng st).
gii quyt vn trn, i vi nhng ph n ngng s dng vn
c tnh thi gian khng phi l kh. Nhng vn quan trng trong d liu

255

mang tnh thi gian ny l mt s ph n vn cn tip tc s dng, bi v chng


ta khng bit h cn s dng bao lu na, trong khi nghin cu phi ng s
theo mt thi im nh trc. Nhng trng hp c gi bng mt thut
ng kh hiu l censored hay survival (tc cn sng, hay cn tip tc s
dng, hay bin c cha xy ra).
Gi T l thi gian cn tip tc s dng (c khi gi l survival time). T l
mt bin ngu nhin, vi hm mt (probability density distribution) f(t), v
hm phn phi tch ly (cumulative distribution) l:

F (t ) =

f ( s ) ds

y l xc sut m mt c nhn ngng s dng (hay kinh qua bin c) ti thi


im t. Hm b sung S(t) = 1 F(t) thng c gi l hm sng st
(survival function).
S liu thi gian T thng c m phng bng hai hm xc sut: hm
sng st v hm nguy c (hazard function). Hm sng st nh nh ngha trn
l xc sut mt c nhn cn sng st (hay trong v d trn, cn s dng y c)
n mt thi im t. Hm nguy c, thng c vit bng k hiu h(t) hay (t)
l xc sut m c nhn ngng s dng (hay xy ra bin c) ngay ti thi
im t.

Pr ( t T < t + t ) | T t f ( t )
=
0
S (t )
t

h ( t ) = lim

sao cho h(t) t l xc sut mt c nhn ngng s dng trong khong thi gian
ngn t vi iu kin c nhn sng n thi im t. T mi lin h:
Pr(sng st n t+t) = Pr(sng st n t) . Pr(sng st n t | sng n t)
chng ta c:

1 F ( t + t ) = (1 F ( t ) ) (1 h ( t ) t )
T , chng ta c:

tF ' ( t ) = (1 F ( t ) ) h ( t ) t

Suy ra, hm nguy c l:

256

Phn tch d liu v to biu bng R Nguyn Vn Tun

h (t ) =

f (t )

1 F (t )

V hm nguy c tch ly:


t

( t ) = ( u ) du

T nh ngha hm nguy c h ( t ) =

f (t )

1 F (t )

, chng ta c th vit:

( t ) = log (1 F ( t ) )
Mt s hm nguy c c th ng dng m t thi gian ny. Hm n gin
nht l mt hng s, dn n mt m hnh Poisson (thuc nhm cc lut phn
phi m):

f ( t ) = e t
Do :

Cho nn:

(t 0)

F ( t ) = 1 e t
h(t) =

Nhng l thuyt trn y thot u mi xem qua c v tng i rc ri,


nhng vi s liu thc t th s d theo di hn. By gi chng ta quay li vi
s liu t V d 1. tin vic theo di v tnh ton, chng ta cn phi sp xp
li d liu trn theo th t thi gian, bt k l thi gian ngng s dng hay
cn tip tc s dng:
10 13* 18* 19 23* 30 36 38* 54*
56* 59 75 93 97 104* 107 107* 107*

Trong dy s liu trn du * l nh du thi gian censored (tc cn tip


tc s dng IUD). Cch n gin nht l chia thi gian t 10 tun (ngn nht)
n 107 tun (lu nht) thnh nhiu khong thi gian nh trong bng phn tch
sau y:

257

Bng 13.2. c tnh xc sut tch ly cho mi khong thi gian


Mc
thi
gian (t)

Khong
thi gian
(tun)

1
2
3
4
5
6
7
8
9
10

09
10 18
19 29
30 35
36 58
59 74
75 92
93 96
97 106
107

S ph
n lc bt
u thi
im (nt)
18
18
15
13
12
8
7
6
5
3

S ph
n
ngng s
dng (dt)
0
1
1
1
1
1
1
1
1
1

Xc sut
ngng s
dng h(t)
0.0000
0.0555
0.0667
0.0769
0.0833
0.1250
0.1428
0.1667
0.2000
0.3333

Xc
sut cn
s dng
pt
1.0000
0.9445
0.9333
0.9231
0.9167
0.8750
0.8572
0.8333
0.8000
0.6667

Xc
sut tch
ly S(t)
1.0000
0.9445
0.8815
0.8137
0.7459
0.6526
0.5594
0.4662
0.3729
0.2486

Trong bng tnh ton trn, chng ta c:

Ct th nht l mc thi gian (tm k hiu l t). Ct ny khng c


ngha g, ngoi tr s dng lm ch s;

Ct th 2 l khong thi gian (duration) tnh bng tun. Nh cp


trn, chng ta chia thi gian thnh nhiu khong tnh ton, chng hn
nh t 0 n 9 tun, 10 n 18 tun, v.v Ch rng trong thc t,
chng ta khng c s liu cho thi gian t 0 n 9 tun, nhng khong
thi gian ny t ra lm ci mc khi u thun tin cho vic c
tnh sau ny. y ch l nhng phn chia tng i ty tin v ch c
tnh cch minh ha; trong thc t my tnh c th lm vic cho
chng ta;

Ct th 3 l s i tng nghin cu nt (hay c th hn l s ph n


trong nghin cu ny) bt u mt khong thi gian. Chng hn nh
khong thi gian 0-9, ti thi im bt u 0, c 18 ph n (hay cng c
th hiu rng s ph n c theo di/quan st t nht 0 tun l 18
ngi).
Trong khong thi gian 1018, ngay ti thi im bt u 10, chng ta
c 18 ph n; nhng trong khong thi gian 1929, ngay ti thi im
bt u 19, chng ta c 15 ph n (c th l: 19 23* 30 36 38*
54* 56* 59 75 93 97 104* 107 107* 107*); v.v...

258

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ni cch khc, ct ny th hin s i tng vi thi gian quan st ti


thiu l t. Do , trong khong thi gian 97 106, chng ta c 5 ph n
vi thi gian theo di t 97 tun tr ln (97 104* 107 107* 107*).

Ct th 4 trnh by s ph n ngng s dng y c dt (hay bin c xy


ra) trong mt khong thi gian. Chng hn nh trong khong thi gian
1018 tun, c mt ph n ngng s dng(ti 10 tun); trong khong
thi gian 19 29 tun cng c mt trng hp ngng s dng (ti 19
tun), v.v

Ct th 5 l xc sut nguy c h(t) trong mt khong thi gian. Mt cch


n gin, h(t) c c tnh bng cch ly dt chia cho nt. V d trong
khong thi gian 10-18 c 1 ph n ngng s dng (trong s 18 ph
n), v xc sut nguy c l 1/18 = 0.0555. Xc sut ny c c tnh
cho tng khong thi gian.

Ct th 6 l xc sut cn s dng cho mt khong thi gian, tc ly 1


tr cho h(t) trong ct th 5. Xc sut ny khng cung cp nhiu thng
tin, nhng ch c trnh by d theo di tnh ton trong ct k tip.

Ct th 7 l xc sut tch ly cn s dng y c S(t) (hay cumulative


survival probability). y l ct s liu quan trng nht cho phn tch.
V tnh cht tch ly, cho nn cch c tnh c nhn t hai hay
nhiu xc sut.
Trong khong thi gian 0-9, xc sut tch ly chnh l xc sut cn s
dng trong ct 6, (v khng c ai ngng s dng).
Trong khong thi gian 10-18, xc sut tch ly c c tnh bng cch
ly xc sut cn s dng trong thi gian 0-9 nhn cho xc sut cn s
dng trong thi gian 10-18, tc l: 1.000 x 0.9445 = 0.9445. ngha ca
c tnh ny l: xc sut cn s dng cho n thi gian 9 tun l 94.45%.
Tng t, trong khong thi gian 19-29 tun, xc sut tch ly cn s
dng c tnh bng cch ly xc sut tch ly cn s dng n tun 1018 nhn cho xc sut cn s dng trong khong thi gian 19-29: 0.9445 x
0.9333 = 0.8815. Tc l, xc sut cn s dng n tun 29 l 88.15%.
Ni chung, cng thc c tnh S(t) l S ( t ) =

nt dt
nt
t =1
k

. Ch du

m ^ trn S(t) l nhc nh rng l c s. Nu gi xc sut cn

259

s dng trong khong thi gian t l pt (tc ct 6), th S(t) cng c th


k

tnh bng cng thc: S ( t ) = pt .


t =1

Php c tnh c m t trn thng c gi l c tnh KaplanMeier (Kaplan-Meier estimates), hay thnh thong cng c gi l productlimit estimate.

13.2 c tnh Kaplan-Meier bng R


Tt c cc tnh ton trn, tt nhin, c th c tin hnh bng R. Trong
R c mt package tn l survival (do Terry Therneau v Thomas Lumley
pht trin) c th ng dng phn tch bin c. Trong phn sau y chng ta
s tng bc s dng package ny.
Quay li vi V d 1, vic u tin m chng ta cn lm l nhp d liu
vo R. Nhng trc ht, chng ta phi nhp package survival vo mi
trng lm vic:
> library(survival)

K n, chng ta to ra hai bin s: bin th nht gm thi gian (hy gi l


weeks cho trng hp ny), v bin th hai l ch s cho bit i tng ngng
s dng y c (cho gi tr 1) hay cn tip tc s dng (cho gi tr 0) v t tn
bin ny l status. Sau nhp hai bin vo mt dataframe (v gi l data)
tin vic phn tch.
> weeks <- c(10, 13, 18, 19, 23, 30, 36, 38, 54,
56, 59, 75, 93, 97, 104, 107, 107, 107)
> status <- c(1, 0, 0, 1, 0, 1, 1,0, 0, 0, 1, 1, 1,
1, 0, 1, 0, 0)
> data <- data.frame(duration, status)

By gi, chng ta sn sng phn tch. c tnh Kaplan-Meier, chng ta s


s dng hai hm Surv v survfit trong package survival. Hm Surv
dng to ra mt bin s hp (combined variable) vi thi gian v tnh trng.
V d, trong lnh sau y:
> survtime <- Surv(weeks, status==1)
> survtime
[1] 10 13+ 18+ 19 23+ 30 36 38+ 54+ 56+ 59
[15] 104+ 107 107+ 107+

260

75

Phn tch d liu v to biu bng R Nguyn Vn Tun

93

97

Chng ta s c survtime l mt bin vi thi gian v du + (ch cn sng


st, hay censored observation, hay trong trng hp ny l cn s dng y c).
Bin s ny ch c gi tr v ngha cho phn tch ca R, ch trong thc t, c
l chng ta khng cn n.
Cn hm survfit cng kh n gin, chng ta ch cn cung cp hai thng
s: thi gian v ch s nh v d sau y:
> survfit(Surv(weeks, status==1))

Hay nu c object survtime th chng ta ch n gin gi:


> survfit(survtime)
Call: survfit(formula = survtime)
n events median 0.95LCL 0.95UCL
18
9
93
59
Inf

Kt qu trn y chng c g hp dn, v n cung cp nhng thng tin m chng


ta bit: c 9 bin c (ngng s dng y c) trong s 18 i tng. Thi gian
(median - trung v) ngng s dng l 93 tun, vi khong tin cy 95% t 59
tun n v cc (Inf = infinity). c thm kt qu chng ta cn phi
a kt qu phn tch vo mt object chng hn nh kp v dng hm
summary bit thm chi tit:
> kp <- survfit(Surv(weeks, status==1))
> summary(kp)
Call: survfit(formula = Surv(weeks, status == 1))
time n.risk n.event survival std.err lower 95% CI upper 95% CI
10
18
1 0.944 0.0540
0.844
1.000
19
15
1 0.881 0.0790
0.739
1.000
30
13
1 0.814 0.0978
0.643
1.000
36
12
1 0.746 0.1107
0.558
0.998
59
8
1 0.653 0.1303
0.441
0.965
75
7
1 0.559 0.1412
0.341
0.917
93
6
1 0.466 0.1452
0.253
0.858
97
5
1 0.373 0.1430
0.176
0.791
107
3
1 0.249 0.1392
0.083
0.745

Mt phn ca kt qu ny (ct time, n.risk, n.event,


survival) chng ta tnh ton th cng trong bng trn. Tuy nhin R cn
cung cp cho chng ta sai s chun (standard error) ca S(t) v khong tin cy 95%.
Khong tin cy 95% c c tnh t cng thc:

261

S ( t ) 1.96 se S ( t ) ,
m trong :

dt
se S ( t ) = S ( t )
.
t =1 nt ( nt dt )
Cng thc sai s chun ny cn c gi l cng thc Greenwood (hay
Greenwoods formula). Chng ta c th th hin kt qu trn bng mt biu
bng hm plot nh sau:

0.8
0.6
0.4
0.2
0.0

Cumulative survival probability

1.0

> plot(kp,
xlab="Time (weeks)",
ylab="Cumulative survival probability")

20

40

60

80

100

Time (weeks)

Trong biu trn, trc honh l thi gian (tnh bng tun) v trc tung l xc
sut tch ly cn s dng y c. ng chnh gia chnh l xc sut tch
ly S ( t ) , hai ng chm l khong tin cy 95% ca S ( t ) . Qua kt qu phn
tch ny, chng ta c th pht biu rng xc sut s dng y c n tun 107 l
khong 25% v khong tin cy t 8% n 74.5%. Khong tin cy kh rng cho
bit c s c dao ng cao, n gin v s lng i tng nghin cu cn
tng i thp.

13.3 So snh hai hm xc sut tch ly: kim nh


log-rank (log-rank test)

262

Phn tch d liu v to biu bng R Nguyn Vn Tun

Phn tch trn ch p dng cho mt nhm i tng, v mc ch chnh


l c tnh S(t) cho tng khong thi gian. Trong thc t, nhiu nghin cu c
mc ch so snh S(t) gia hai hay nhiu nhm khc nhau. Chng hn nh trong
cc nghin cu lm sng, nht l nghin cu cha tr ung th, cc nh nghin
cu thng so snh thi gian sng st gia hai nhm bnh nhn nh gi
mc hiu nghim ca mt thut iu tr.
V d 2. Mt nghin cu trn 48 bnh nhn vi bnh mn gip (herpes)
b phn sinh dc nhm xt nghim hiu qu ca mt loi vc-xin mi (tm
gi bng m danh gd2). Bnh nhn c chia thnh 2 nhm mt cch ngu
nhin: nhm 1 c iu tr bng gd2 (gm 25 ngi), v 23 ngi cn li
trong nhm hai nhn gi dc (placebo). Tnh trng bnh c theo di trong
vng 12 thng. Bng s liu sau y trnh by thi gian (tnh bng tun v gi
tt l time) n khi bnh ti pht. Ngoi ra, mi bnh nhn cn cung cp s
liu v s ln b nhim trong vng 12 thng trc khi tham gia cng trnh
nghin cu (episodes). Theo kinh nghim lm sng, episodes c lin h
mt thit n xc sut b nhim (v chng ta s quay li vi cch phn tch bin
s ny trong mt phn sau). Cu hi t ra l gd2 c hiu nghim lm gim
nguy c bnh ti pht hay khng.

Bng 13.1. Thi gian n nhim trng bnh nhn vi bnh mn


gip cho nhm gd2 v gi dc
id episodes time infected
1
12
8
1
3
10
12
0
6
7
52
0
7
10
28
1
8
6
44
1
10
8
14
1
12
8
3
1
14
9
52
1
15
11
35
1
18
13
6
1
20
7
12
1
23
13
7
0
24
9
52
0
26
12
52
0
28
13
36
1
31
8
52
0
33
10
9
1
34
16
11
0
36
6
52
0
39
14
15
1
40
13
13
1
42
13
21
1
44
16
24
0

id episodes time infected


2
9
15
1
4
10
44
0
5
12
2
0
9
7
8
1
11
7
12
1
13
7
52
0
16
7
21
1
17
11
19
1
19
16
6
1
21
16
10
1
22
6
15
0
25
15
4
1
27
9
9
0
29
10
27
1
30
17
1
1
32
8
12
1
35
8
20
1
37
8
32
0
38
8
15
1
41
14
5
1
43
13
35
1
45
9
28
1
47
15
6
1

263

46
48

13
9

52
28

0
1

Ch thch: trong bin infected (nhim), 1 c ngha l b nhim, v 0 l


khng b nhim.
Trong trng hp trn chng ta c hai nhm so snh. Mt cch phn
tch n gin l c tnh S(t) cho tng nhm v tng khong thi gian, ri so
snh hai nhm bng mt kim nh thng k thch hp. Song, phng php
phn tch ny c nhc im l n khng cung cp cho chng ta mt bc
tranh chung ca tt c cc khong thi gian. Ngoi ra, vn so snh gia hai
nhm trong nhiu khong thi gian khc nhau lm cho kt qu rt kh din
dch.
khc phc hai nhc im so snh trn, mt phng php phn tch
c pht trin c tn l log-rank test (kim nh log-rank). y l mt phng
php phn tch phi thng s kim nh gi thit rng hai nhm c cng S(t).
Phng php ny cng chia thi gian ra thnh k khong thi gian, t1, t2, t3, ,
tk, m khong thi gian tj (j = 1, 2, 3, k) phn nh thi im j khi mt hay
nhiu i tng ca hai nhm cng li. Gi dij l s bnh nhn trong nhm i
(i=1, 2) b bnh trong khong thi gian tj. Gi d j = d1 j + d 2i l tng s bnh
nhn mc bnh v t n j = n1 j + n2 j l tng s bnh nhn ca hai nhm trong
khong thi gian tj. Vi j = 1, 2, 3, k, chng ta c th c tnh:

e1 j =

n1 j d j
nj
vj =

e2 j =

n2 j d j
nj

n1 j n2 j d j ( n j d j )
n 2j ( n j 1)

( y, e1 j , e2 j l s bnh nhn trong nhm 1 v 2 m chng ta tin on l s


mc bnh nu c cng xc sut mc bnh trong c hai nhm (tc xc sut trung
bnh), v j l phng sai). Ngoi ra, chng ta c th c tnh tng s bnh nhn
mc bnh cho nhm 1 v 2:
k

O1 = d1 j
j =1

264

O2 = d 2 j
j =1

Phn tch d liu v to biu bng R Nguyn Vn Tun

V tng s bnh nhn mc bnh nu c cng chung xc sut mc bnh cho c


hai nhm:
k

E1 = v j

V = vj

j =1

j =1

Gi Ti l mt bin ngu nhin phn nh thi gian t khi c iu tr n khi


mc bnh cho nhm i, v gi Si ( t ) = Pr (Ti t ) , kim nh log-rank c nh
ngha nh sau:

(O E )
= 1 1

Nu >
2

2
1,

(trong ,

2
1,

l tr s Chi bnh phng vi ngha thng

k =0.95), chng ta c bng chng kt lun rng khc bit v S(t) gia
hai nhm c ngha thng k.

13.4 Kim nh log-rank bng R


V d 2 (tip tc). Chng ta quay li vi v d 2 v s s dng R
tnh ton kim nh log-rank. Trc ht, chng ta phi nhp cc d liu cn
thit bng cc lnh thng thng nh sau:
> group <- c(1,
1,
2,
2,

1,
1,
2,
2,

> episode <- c(12,


12,
9,
9,

1,
1,
2,
2,

1,
1,
2,
2,

10,
13,
10,
10,

1,
1,
2,
2,

1,
1,
2,
2,

7, 10,
8, 10,
12, 7,
17, 8,

1,
1,
2,
2,

1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2,
2)

6, 8, 8, 9, 11, 13, 7, 13, 9,


16, 6, 14, 13, 13, 16, 13, 9,
7, 7, 7, 11, 16, 16, 6, 15,
8, 8, 8, 14, 13, 9, 15)

> time <- c(8, 12, 52, 28, 44, 14, 3, 52, 35, 6, 12, 7, 52,
52, 36, 52, 9, 11, 52,15, 13, 21,24, 52,28,
15,44, 2, 8,12,52,21,19, 6,10,15, 4, 9,27, 1,
12,20,32,15, 5,35,28, 6)
> infected <- c(1,
0,
1,
1,

0,
1,
0,
1,

0,
0,
0,
0,

1,
0,
1,
1,

1,
1,
1,
1,

1,
1,
0,
1,

1,
1,
1,
1,

1, 1, 1, 1, 0, 0, 0, 1,
0, 0, 1,
1, 1, 1, 0, 1, 0, 1, 1,
1)

> data <- data.frame(group, episode, time, infected)

265

(a) Chng ta ng dng hm survfit c tnh xc sut tch ly S(t) cho


tng nhm bnh nhn v cho kt qu vo i tng kp.by.group nh sau
(ch cch cung cp thng s ~ group):
> library(survival)
> kp.by.group <- survfit(Surv(time, infected==1) ~ group)
> summary(kp.by.group)
Call: survfit(formula = Surv(time, infected == 1) ~ group)
group=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
3
6
8
9
12
13
14
15
21
28
35
36
44
52

25
24
22
21
19
17
16
15
14
12
10
9
8
7

1
1
1
1
1
1
1
1
1
2
1
1
1
1

0.960
0.920
0.878
0.836
0.792
0.746
0.699
0.653
0.606
0.505
0.454
0.404
0.353
0.303

0.0392
0.0543
0.0660
0.0749
0.0829
0.0902
0.0958
0.1001
0.1033
0.1080
0.1083
0.1074
0.1052
0.1016

0.886
0.820
0.758
0.702
0.645
0.588
0.534
0.483
0.434
0.332
0.285
0.240
0.197
0.157

1.000
1.000
1.000
0.997
0.973
0.945
0.915
0.882
0.846
0.768
0.725
0.680
0.633
0.584

group=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1
23
1
0.957
0.0425
0.8767
1.000
4
21
1
0.911
0.0601
0.8004
1.000
5
20
1
0.865
0.0723
0.7346
1.000
6
19
2
0.774
0.0889
0.6183
0.970
8
17
1
0.729
0.0946
0.5650
0.940
10
15
1
0.680
0.1000
0.5099
0.907
12
14
2
0.583
0.1067
0.4072
0.835
15
12
2
0.486
0.1088
0.3132
0.754
19
9
1
0.432
0.1093
0.2630
0.709
20
8
1
0.378
0.1082
0.2156
0.662
21
7
1
0.324
0.1053
0.1712
0.613
27
6
1
0.270
0.1007
0.1300
0.561
28
5
1
0.216
0.0939
0.0921
0.506
35
3
1
0.144
0.0859
0.0447
0.463

Nhng thng tin trn cung cp cho chng ta xc sut sng st trong tng thi
im, nhng v l nhng con s nn kh cm nhn c s khc bit gia hai
nhm. Mt cch khc th hin cc xc sut ny l qua biu Kaplan-Meier
cho tng nhm. Cch v biu ny c th bng lnh sau y:

266

Phn tch d liu v to biu bng R Nguyn Vn Tun

0.6
0.4
0.0

0.2

Cum. survival probability

0.8

1.0

> plot(kp.by.group,
xlab="Time",
ylab="Cum. survival probability",
col=c(black, red))

10

20

30

40

50

Time

Qua biu trn, chng ta c th thy kh r l nhm c iu tr bng gd2


(ng mu en pha trn) c xc sut nhim (hay bnh ti pht) thp hn nhm
gi dc (ng mu nt t, pha di). Nhng phn tch trn khng cung
cp tr s p chng ta pht biu kt lun.
(b) c tr s p, chng ta cn phi s dng hm survdiff nh sau:
> survdiff(Surv(time, infected==1) ~ group)
Call:
survdiff(formula = Surv(time, infected == 1) ~ group)
N Observed Expected (O-E)^2/E (O-E)^2/V
group=1 25
15
20.0
1.26
3.65
group=2 23
17
12.0
2.11
3.65
Chisq= 3.7 on 1 degrees of freedom, p= 0.056

Kt qu phn tch log-rank cho tr s p=0.056. V p > 0.05, chng ta vn cha


c bng chng thuyt phc kt lun rng gd2 qu tht c hiu nghim gim
nguy c ti pht bnh.

13.5 M hnh Cox (hay Coxs proportional


hazards model)

267

Kim nh log-rank l phng php cho php chng ta so snh S(t) gia
hai hay nhiu nhm. Nhng trong thc t, S(t) hay hm nguy c h(t) c th
khng ch khc nhau gia cc nhm, m cn chu s chi phi ca cc yu t
khc. Vn t ra l lm sao c tnh mc nh hng ca cc yu t nguy
c (risk factors) n h(t). Chng hn nh trong nghin cu trn, s ln bnh
nhn tng b nhim (bin episode) c xem l c nh hng n nguy c
bnh ti pht. Do , vn t ra l nu chng ta xem xt v iu chnh cho
nh hng ca episode th mc khc bit v S(t) gia hai nhm c tht s
tn ti hay khng?
Vo khong gia thp nin 1970s, David R. Cox, gio s thng k hc
thuc i hc Imperial College (London, Anh) pht trin mt phng php
phn tch da vo m hnh hi qui (regression) tr li cu hi trn (D.R. Cox,
Regression models and life tables (with discussion), Journal of the Royal
Statistical Society series B, 1972; 74:187-220). Phng php phn tch , sau
ny c gi l M hnh Cox. M hnh Cox c nh gi l mt trong nhng
pht trin quan trng nht ca khoa hc ni chung (khng ch khoa hc thng
k) trong th k 20! Bi bo va cp c trch dn hng vn ln trong
vng 30 nm qua.
V m t chi tit m hnh Cox nm ngoi phm vi ca chng sch ny,
nn chng ta ch xem qua vi nt chnh bn c c th nm vn . Gi x1,
x2, x3, xp l p yu t nguy c. x c th l cc bin lin tc hay khng lin tc.
M hnh Cox pht biu rng:

h (t ) = (t ) e

1 x1 + 2 x2 + 3 x3 +...+ p x p

h(t) c nh ngha nh phn trn (tc hm nguy c), j (j = 1, 2, 3, , p) l


h s nh hng lin quan n xj, v (t) l hm s nguy c nu cc yu t nguy
c x khng tn ti (cn gi l baseline hazard function). V mc nh hng
ca mt yu t nguy c xj thng c th hin bng t s nguy c (hazard
ratio, HR, cng tng t nh odds ratio trong phn tch hi qui logistic), h s
exp(j) chnh l HR cho khi xj tng mt n v.
Hm coxph trong package R c th c ng dng c tnh h s

j. Trong lnh sau y:

> analysis <- coxph(Surv(time, infected==1) ~ group)

Trong lnh trn, chng ta mun kim nh nh hng ca hai nhm iu tr n


hm nguy c h(t) v kt qu c cha trong i tng analysis. tm
lc analysis, chng ta s dng hm summary:

268

Phn tch d liu v to biu bng R Nguyn Vn Tun

> summary(analysis)
Call:
coxph(formula = Surv(time, infected == 1) ~ group)
n= 48
coef exp(coef) se(coef) z
group 0.684
1.98
0.363
1.88

p
0.06

exp(coef) exp(-coef) lower .95 upper .95


group
1.98
0.505
0.973
4.04
Rsquare= 0.071 (max possible= 0.986 )
Likelihood ratio test= 3.55 on 1 df, p=0.0597
Wald test
= 3.55 on 1 df, p=0.0596
Score (logrank) test = 3.67 on 1 df, p=0.0553

Nn nh nhm iu tr c cho m s 1, v nhm gi dc c m s 2.


Do , kt qu phn tch trn cho bit khi group tng 1 n v th h(t) tng 1.98
ln (vi khong tin cy 95% dao ng t 0.97 n 4.04). Ni cch khc, nguy c
bnh ti pht trong nhm gi dc cao hn nhm iu tr gd2 gn 2 ln. Tuy
nhin v khong tin cy 95% bao gm c 1 v tr s p = 0.06, cho nn chng ta
vn khng th kt lun rng mc nh hng ny c ngha thng k.
Nhng chng ta cn phi xem xt (v iu chnh) cho nh hng ca
qu trnh bnh trong qu kh c o lng bng bin s episode. tin
hnh phn tch ny, chng ta cho thm episode vo hm coxph nh sau:
> analysis <- coxph(Surv(time, infected==1) ~ group + episode)
> summary(analysis)
Call:
coxph(formula = Surv(time, infected == 1) ~ group + episode)
n= 48
coef exp(coef) se(coef) z
p
group
0.874
2.40
0.3712 2.35 0.0190
episode 0.172
1.19
0.0648 2.66 0.0079
exp(coef) exp(-coef) lower .95 upper .95
group
2.40
0.417
1.16
4.96
episode 1.19
0.842
1.05
1.35
Rsquare= 0.196 (max possible= 0.986 )
Likelihood ratio test= 10.5 on 2 df, p=0.00537
Wald test
= 10.4 on 2 df, p=0.00555
Score (logrank) test = 10.6 on 2 df, p=0.00489

Kt qu phn tch trn cho chng ta mt din dch khc v c l chnh


xc hn. M hnh h(t) by gi l:

269

h ( t | group, episode ) = ( t ) e0.874( group )+ 0.172( episode )

Nu episode tm thi gi c nh, t s h(t) gia hai nhm l:

h ( t | group = 2 )
h ( t | group = 1)

= e0.874( 21) = 2.40

Tng t, nu group tm thi gi c nh, khi episode tng mt n v, t


s nguy c s tng 1.14 ln.
Ni cch khc, mi ln mc bnh trong qu kh (tc episode tng 1
n v) lm tng nguy c ti pht bnh 19% (vi khong tin cy 95% dao ng
t 5% n 35%). Nhm gi dc c nguy c bnh ti pht tng gp 2.4 ln so
vi nhm iu tr bng gd2 (v khong tin cy 95% c th t 1.2 n gn 5
ln). C hai yu t (nhm iu tr) v episode u c ngha thng k, v tr
s p<0.05.
Nhng episode l mt bin lin tc. Vn t ra l sau khi iu
chnh episode th hm S(t) cho tng nhm s ra sao? Cch khc quan nht l gi
nh c hai nhm gd2 v gi dc c cng s ln episode (nh s trung bnh
chng hn), v hm S(t) cho tng nhm c th c tnh bng:
> Cox.model <- survfit(coxph(Surv(time,
infected==1)~episode+strata(group)))
> plot(Cox.model,
xlab="Time",
ylab="Cumulative survival probability",
col=c(black, red))

hay n gin hn:


> plot(survfit(coxph(Surv(time,
infected==1)~episode+strata(group))),
xlab="Time",
ylab="Cumulative survival probability",
col=c(black, red))

270

Phn tch d liu v to biu bng R Nguyn Vn Tun

1.0
0.8
0.6
0.4
0.0

0.2

Cumulative survival probability

10

20

30

40

50

Time

13.6 Xy dng m hnh Cox bng Bayesian


Model Average (BMA)
Cng nh trng hp ca phn tch hi qui tuyn tnh a bin v phn
tch hi qui logistic a bin, vn tm mt m hnh ti u tin on bin
c trong trong iu kin c nhiu bin c lp l mt vn nan gii. Phn ln
sch gio khoa thng k hc trnh by ba phng n chnh tm mt m hnh
ti u: forward algorithm, backward algorithm, v tiu chun AIC.
Vi phng n forward algorithm, chng ta khi u tm bin c lp x
c nh hng ln n bin ph thuc y, ri tng bc thm cc bin c lp
khc x cho n khi m hnh khng cn ci tin thm na.
Vi phng n backward algorithm, chng ta khi u bng cch xem
xt tt c bin c lp x trong d liu c th c nh hng ln n bin ph
thuc y, ri tng bc loi b tng bin c lp x cho n khi m hnh ch cn
li nhng bin c ngha thng k.
Hai phng n trn (forward v backward algorithm) da vo phn d
(residual) v tr s P xt mt m hnh ti u. Mt phng n th ba l da
vo tiu chun Aikaike Information Criterion (AIC) va c trnh by trong
chng trc. hiu phng php xy dng m hnh da vo AIC chng ta
s ly mt v d thc t nh sau. Gi d chng ta mun i t tnh A n tnh B
qua huyn C, v mi tuyn ng chng ta c 3 la chn: bng xe hi, bng
ng thy, v bng xe gn my. Tt nhin, i xe hi t tin hn i xe gn
my, Mt khc, i ng thy tuy t tn km nhng chm hn i bng xe hi

271

hay xe gn my. Nu c tt c 6 phng n i, vn t ra l chng ta mun


tm mt phng n i sao cho t tn km nht, nhng tiu ra mt thi gian ngn
nht! Tng t, phng php xy dng m hnh da vo tiu chun AIC l i
tm mt m hnh sao cho t thng s nht nhng c kh nng tin on bin ph
thuc y nht.
Nhng c ba phng n trn c vn l m hnh ti u nht c
xem l m hnh sau cng, v tt c suy lun khoa hc u da vo c s ca m
hnh . Trong thc t, bt c m hnh no (k c m hnh ti u) cng c
bt nh ca n, v khi chng ta c thm s liu, m hnh ti u cha chc l m
hnh sau cng, v do suy lun c th sai lm. Mt cch tt hn v c trin vng
hn xem xt n yu t bt nh ny l Bayesian Model Average (BMA).
Vi phn tch BMA, thay v chng ta hi yu t c lp x nh hng
n bin ph thuc c ngha thng k hay khng, chng ta hi: xc sut m
bin c lp x c nh hng n y l bao nhiu. tr li cu hi BMA xem
xt tt c cc m hnh c kh nng gii thch y, v xem trong cc m hnh ,
bin x xut hin bao nhiu ln.
V d 3. Trong v d sau y, chng ta s m phng mt nghin cu
vi 5 bin c lp x1, x2, x3, x4, v x5. Ngoi tr x1, 4 bin kia c m phng
theo lut phn phi chun. Bin y l thi gian v km theo bin t vong
(death). Trong 5 bin x ny, ch c bin x1 c lin h vi xc sut t vong
bng mi lin h exp(3*x1 + 1), cn cc bin x2, x3, x4, v x5 c m
phng ton c lp vi nguy c t vong. Chng ta s s dng phng php xy
dng m hnh theo tiu chun AIC v BMA so snh.
# Nhp package survival v BMA phn tch
> library(survival)
> library(BMA)
#
>
>
>
>
>

To ra 5 bin s c lp
x1 <- (1:50)/2 3
x2 <- rnorm(50)
x3 <- rnorm(50)
x4 <- rnorm(50)
x5 <- rnorm(50)

# M phng mi lin h risk=exp(beta*x1 + 1)


> model <- exp(3*x1 + 1)
# To ra bin s ph thuc y
> y <- rexp(50, rate = model)

272

Phn tch d liu v to biu bng R Nguyn Vn Tun

#
>
>
>

To ra bin s kin theo lut phn phi m, t l 0.3


censored <- rexp(50, rate=0.3)
ycencored <- pmin(y, censored)
death <- as.numeric(y <= censored)

# Cho tt c bin s vo data frame tn simdata


> simdata <- data.frame(y, death, x1,x2,x3,x4,x5)
# Phn tch bng m hnh Cox
> cox <- coxph(Surv(y, death) ~ ., data=simdata)
> summary(cox)
Call:
coxph(formula = Surv(y, death) ~ ., data = simdata)
n= 50
coef
exp(coef)
x1 3.2325 25.344
x2 -0.0319
0.969
x3 0.3112
1.365
x4 0.1364
1.146
x5 0.4898
1.632

se(coef)
z
0.568 5.6908
0.331 -0.0963
0.327 0.9518
0.297 0.4600
0.313 1.5643

p
1.3e-08
9.2e-01
3.4e-01
6.5e-01
1.2e-01

exp(coef) exp(-coef) lower .95 upper .95


x1 25.344
0.0395
8.325
77.16
x2
0.969
1.0324
0.506
1.85
x3
1.365
0.7326
0.719
2.59
x4
1.146
0.8725
0.641
2.05
x5
1.632
0.6127
0.883
3.01
Rsquare= 0.992 (max possible= 0.997 )
Likelihood ratio test= 241 on 5 df, p=0
Wald test
= 33.3 on 5 df, p=3.36e-06
Score (logrank) test = 107 on 5 df, p=0

Kt qu trn cho thy bin x1, x3 v x5 c nh hng c ngha thng k n


bin y. Tt nhin, y lm mt kt qu sai v chng ta bit rng ch c x1 l c
ngha thng k m thi. By gi chng ta th p dng cch xy dng m hnh
da vo tiu chun AIC:
# Tm m hnh da vo tiu chun AIC
> searchAIC <- step(cox, direction=both)
> summary(searchAIC)
Call:
coxph(formula = Surv(y, death) ~ x1 + x5, data = simdata)
n= 50
coef exp(coef) se(coef) z
p
x1 3.126
22.79 0.529
5.91 3.4e-09

273

x5 0.429

1.54

0.297

1.45 1.5e-01

exp(coef) exp(-coef) lower .95 upper .95


x1
22.79
0.0439
8.080
64.27
x5
1.54
0.6510
0.858
2.75
Rsquare= 0.992 (max possible= 0.997 )
Likelihood ratio test= 240 on 2 df, p=0
Wald test
= 35.3 on 2 df, p=2.18e-08
Score (logrank) test = 104 on 2 df, p=0

Kt qu ny cho thy x1 v x5 l hai yu t c lp v nh hng ca chng


n bin y c xem l c ngha thng k. Mt ln na, kt qu ny sai! By
gi chng ta s p dng php tnh BMA:
#tm m hnh bng php tnh BMA
> time <- simdata$y
> death <- simdata$death
> xvars <- simdata[,c(3,4,5,6,7)]
> bma <- bic.surv(xvars, time, death)
> summary(bma)
> imageplot.bma(bma)
Call:
bic.surv.data.frame(x = xvars, surv.t = time, cens = death)
8 models were selected
Best 5 models (cumulative posterior probability = 0.8911 ):

x1
x2
x3
x4
x5

p!=0
100.0
9.6
14.6
10.0
31.0

EV
3.036
0.001
0.041
0.006
0.135

SD
0.509
0.096
0.155
0.092
0.261

nVar
BIC
post prob

x1
x2
x3
x4
x5

274

model 3
3.0390
.
0.2705
.
.

model 4
2.9829
.
.
0.0250
.

model 1
2.9805
.
.
.
.

model 2
3.1262
.
.
.
0.42920

1
-233.774
0.458

2
-232.126
0.201

model 5
2.9810
0.0214
.
.
.

Phn tch d liu v to biu bng R Nguyn Vn Tun

nVar
BIC
post prob

2
-230.713
0.099

2
-229.933
0.067

2
-229.930
0.067

Kt qu phn tch BMA cho thy m hnh ti u l m hnh 1 ch c mt bin


c ngha thng k: l bin x1. Xc sut m yu t ny c nh hng n
nguy c t vong l 100%. y chnh l kt qu m chng ta k vng, bi v
chng ta m phng ch c x1 c nh hng n y m thi. M hnh 2 c hai
bin x1 v x5 (tc cng chnh l m hnh m tiu chun AIC xc nh), nhng
m hnh ny ch c xc sut 0.201 m thi. Cc m hnh 3(x1 v x3), m hnh
4 (x1 v x4) v m hnh 5 (x1 v x2) cng c kh nng nhng xc sut qu
thp (di 0.1) cho nn chng ta khng th chp nhn c. Biu sau y
th hin cc kt qu trn:
M odels selected by BM A

x1

x2

x3

x4

x5

Model #

Biu trn trnh by 8 m hnh, v trong tt c 8 m hnh, bin x1 xut hin


mt cc nht qun (xc sut 100%). Cn cc bin khc c nh hng nhng
khng nht qun. Qua so snh gia hai phng php xy dng m hnh r rng
cho thy cch phn tch BMA cung cp cho chng ta m hnh ph hp ng tin
cy nht v c v ph hp vi thc t nht.
Trn y l nhng phng php phn tch bin c thng dng nht
trong khoa hc thc nghim vi m hnh Cox v kim nh log-rank. M hnh
Cox c th khai trin thnh nhng m hnh phc tp v tinh vi hn cho cc
nghin cu phc tp khc vi nhiu bin v tng tc gia cc yu t nguy c.

275

Ti liu hng dn cch s dng package survival c th gip bn c tm


hiu su hn. Ti liu ny c ti trang web www.cran.R-project.org.

276

Phn tch d liu v to biu bng R Nguyn Vn Tun

14
Phn tch tng hp
Mt vn khoa hc cn n nhiu nghin cu. Mt nghin cu ring
l khng th gii quyt hay cung cp cu tr li dt khot cho mt vn khoa
hc. Nhu cu lp li nghin cu trong iu kin khc nhau rt quan trng trong
hot ng khoa hc. Trong nghin cu khoa hc ni chung v y hc ni ring,
nhiu khi chng ta cn phi xem xt nhiu kt qu nghin cu t nhiu ngun
khc nhau gii quyt mt vn c th.

14.1 Nhu cu cho phn tch tng hp


Trong nhng nm gn y, trong nghin cu khoa hc xut hin kh
nhiu nghin cu di danh mc meta-analysis, tm dch l phn tch tng
hp. Vy phn tch tng hp l g, mc ch, v cch tin hnh ra sao l nhng
cu hi m rt nhiu bn c mun bit. Chng ny ti s m t vi khi nim
v cch tin hnh mt phn tch tng hp, vi hi vng bn c c th t mnh
lm mt phn tch m khng cn n cc phn mm t tin.
Ngun gc v tng tng hp d liu khi u t th k 17. Thi ,
cc nh thin vn hc ngh rng cn phi h thng ha d liu t nhiu ngun
c th i n mt quyt nh chnh xc v hp l hn cc nghin cu ring l.
Nhng phng php phn tch tng hp hin i phi ni l bt u t hn na
th k trc trong ngnh tm l hc. Nm 1952, nh tm l hc Hans J. Eysenck
tuyn b rng tm l tr liu (psychotherapy) chng c hiu qu g c. Hn hai
mi nm sau, nm 1976, Gene V. Glass, mt nh tm l hc ngi M, mun
chng minh rng Eysenck sai, nn ng tm cch thu thp d liu ca hn 375
nghin cu v tm l tr liu trong qu kh, v tin hnh tng hp chng bng
mt phng php m ng t tn l meta-analysis [1]. Qua phng php phn
tch ny, Glass tuyn b rng tm l tr liu c hiu qu v gip ch cho bnh
nhn.
Phn tch tng hp hay meta-analysis t c cc b mn khoa
hc khc, nht l y hc, ng dng gii quyt cc vn nh hiu qu ca
thuc trong vic iu tr bnh nhn. Cho n nay, cc phng php phn tch
tng hp pht trin mt bc di, v tr thnh mt phng php chun
thm nh cc vn gai gc, cc vn m s nht tr gia cc nh khoa hc
vn cha t c. C ngi xem phn tch tng hp c th cung cp mt cu
tr li sau cng cho mt cu hi y hc. Tuy pht biu ny qu lc quan, nhng
phn tch tng hp l mt phng php rt c ch cho chng ta gii quyt nhng
vn cn trong vng tranh ci. Phn tch tng hp cng c th gip cho chng

277

ta nhn ra nhng lnh vc no cn phi nghin cu thm hay cn thm bng


chng.
Kt qu ca mi nghin cu n l thng c nh gi hoc l tch
cc (tc l, chng hn nh, thut iu tr c hiu qu), hoc l tiu cc (tc
l thut iu tr khng c hiu qu), v s nh gi ny da vo tr s P. Thut
ng ting Anh gi qui trnh l significance testing th nghim ngha
thng k. Nhng ngha thng k ty thuc vo s mu c chn trong
nghin cu, v mt kt qu tiu cc khng c ngha l gi thit ca nghin
cu sai, m c th l tn hiu cho thy s lng mu cha y i n
mt kt lun ng tin cy. Ci logic ca phn tch tng hp, do , l chuyn
hng t significance testing sang c tnh effect size - mc nh hng.
Cu tr li m phn tch tng hp mun a ra khng ch n gin l c hay
khng c ngha thng k (significant hay insignificant) m l mc nh
hng bao nhiu, c ng chng ta quan tm, c thch hp chng ta ng
dng vo thc t hay khng.

14.2 Fixed-effects v Random-effects


Hai thut ng m bn c thng gp trong cc phn tch tng hp l
fixed-effects (tm dch l nh hng bt bin) v random-effects (nh hng
bin thin). hiu hai thut ng ny chng ta s xem xt mt v d tng i
n gin. Hy tng tng chng ta mun c tnh chiu cao ca ngi Vit
Nam trong tui trng thnh (18 tui tr ln). Chng ta c th tin hnh 100
nghin cu ti nhiu a im khc nhau trn ton quc; mi nghin cu chn
mu (samples) mt cch ngu nhin t 10 ngi n vi chc ngn ngi; v c
mi nghin cu chng ta tnh ton chiu cao trung bnh. Nh vy, chng ta c
100 s trung bnh, v chc chn nhng con s ny khng ging nhau: mt s
nghin cu c chiu cao trung bnh thp, cao hay trung bnh. Phn tch tng
hp l nhm mc ch s dng 100 s trung bnh c tnh chiu cao cho
ton th ngi Vit. C hai cch c tnh: fixed-effects meta-analysis (phn
tch tng hp nh hng bt bin) v random-effects meta-analysis (phn tch
tng hp nh hng bt bin) [2].
Phn tch tng hp nh hng bt bin xem s khc bit gia 100 con
s trung bnh l do cc yu t ngu nhin lin quan n mi nghin cu (cn
gi l within-study variance) gy nn. Ci gi nh ng sau cch nhn thc ny
l: nu 100 nghin cu u c tin hnh ging nhau (nh c cng s
lng i tng, cng tui, cng t l gii tnh, cng ch dinh dng,
v.v) th s khng c s khc bit gia cc s trung bnh.

278

Phn tch d liu v to biu bng R Nguyn Vn Tun

Nu chng ta gi s trung bnh ca 100 nghin cu l x1 , x2 ,..., x100 ,


quan im ca phn tch tng hp nh hng bt bin cho rng mi xi l mt
bin s gm hai phn: mt phn phn nh s trung ca ton b qun th dn s
(tm gi l M), v phn cn li (khc bit gia xi v M l mt bin s ei . Ni
cch khc:

x1 = M + e1
x2 = M + e2
.
x100 = M + e100
Hay ni chung l:

xi = M + ei
Tt nhin ei c th <0 hay >0. Nu M v ei c lp vi nhau (tc khng c tng

quan g vi nhau) th phng sai ca xi (gi l var[xi ] ) c th vit nh sau:

var[xi ] = var[M ] + var[ei ] = 0 + se2


Ch var[M] = 0 v M l mt hng s bt bin, se2 l phng sai ca ei . Mc
ch ca phn tch tng hp l c tnh M v se2 .
Phn tch tng hp nh hng bin thin xem mc khc bit (cn
gi l variance hay phng sai) gia cc s trung bnh l do hai nhm yu t
gy nn: cc yu t lin quan n mi nghin cu (within-study variance) v
cc yu t gia cc nghin cu (between-study variance). Cc yu t khc bit
gia cc nghin cu nh a im, tui, gii tnh, dinh dng, v.v cn
phi c xem xt v phn tch. Ni cch khc, phn tch tng hp nh hng
bin thin i xa hn phn tch tng hp nh hng bt bin mt bc bng cch
xem xt n nhng khc bit gia cc nghin cu. Do , kt qu t phn tch
tng hp nh hng bin thin thng bo th hn cc phn tch tng hp
nh hng bt bin.
Quan im ca phn tch tng hp nh hng bin thin cho rng mi
nghin cu c mt gi tr trung bnh c bit phi c tnh, gi l mi . Do ,

xi l mt bin s gm hai phn: mt phn phn nh s trung bnh ca qun th

279

m mu c chn ( mi , ch y c ch t i ch mt nghin cu ring l


i), v phn cn li (khc bit gia xi v mi l mt bin s ei . Ngoi ra, phn
tch tng hp nh hng bin thin cn pht biu rng mi dao ng chung
quanh s tng trung bnh M bng mt bin ngu nhin i . Ni cch khc:

xi = mi + ei
Trong :

Do :

mi = M + i
xi = M + i + ei

V phng sai ca xi by gi c hai thnh phn:

var[xi ] = var[M ] + var[ i ] + var[ei ] = 0 + s2 + se2


Nh ta thy qua cng thc ny, s2 phn nh dao ng gia cc nghin cu
(between-study variation), cn se2 phn nh dao ng trong mi nghin cu
(within-study variation). Mc ch ca phn tch tng hp nh hng bin thin
l c tnh M, se2 v s2 .
Ni tm li, Phn tch tng hp nh hng bt bin v Phn tch tng
hp nh hng bin thin ch khc nhau phng sai. Trong khi phn tch tng
hp bt bin xem s2 = 0, th phn tch tng hp bin thin t yu cu phi
c tnh s2 . Tt nhin, nu s2 = 0 th kt qu ca hai phn tch ny ging nhau.
Trong bi ny ti s tp trung vo cch phn tch tng hp nh hng bt bin.

14.3 Qui trnh ca mt phn tch tng hp


Cng nh bt c nghin cu no, mt phn tch tng hp c tin
hnh qua cc cng on nh: thu thp d liu, kim tra d liu, phn tch d
liu, v kim tra kt qu phn tch.

280

Bc th nht: s dng h thng th vin y khoa PubMed hay mt h


thng th vin khoa hc ca chuyn ngnh tm nhng bi bo lin

Phn tch d liu v to biu bng R Nguyn Vn Tun

quan n vn cn nghin cu. Bi v c nhiu nghin cu, v l do


no (nh kt qu tiu cc chng hn), khng c cng b, cho
nn nh nghin cu c khi cng cn phi thu thp cc nghin cu .
Vic lm ny tuy ni th d, nhng trong thc t khng d dng cht
no!

Bc th hai: r sot xem trong s cc nghin cu c truy tm , c


bao nhiu t cc tiu chun c ra. Cc tiu chun ny c th l
i tng bnh nhn, tnh trng bnh, tui, gii tnh, tiu ch, v.v
Chng hn nh trong s hng trm nghin cu v nh hng ca
viatmin D n long xng, c th ch vi chc nghin cu t tiu
chun nh i tng phi l ph n sau thi mn kinh, mt xng
thp, phi l nghin cu lm sng i chng ngu nhin (randomized
controlled clinical trials - RCT), tiu ch phi l gy xng i, v.v
(Nhng tiu chun ny phi c ra trc khi tin hnh nghin cu).

Bc th ba: chit s liu v d kin (data extraction). Sau khi xc


nh c i tng nghin cu, bc k tip l phi ln k hoch chit
s liu t cc nghin cu . Chng hn nh nu l cc nghin cu
RCT, chng ta phi tm cho c s liu cho hai nhm can thip v i
chng. C khi cc s liu ny khng c cng b hay trnh by trong
bi bo, v trong trng hp , nh nghin cu phi trc tip lin lc
vi tc gi tm s liu. Mt bng tm lc kt qu nghin cu c th
tng t nh Bng 1 di y.

Bc th t: tin hnh phn tch thng k. Trong bc ny, mc ch l


c tnh mc nh hng chung cho tt c nghin cu v dao
ng ca nh hng . Phn di y s gii thch c th cch lm.

Bc th nm: xem xt cc kt qu phn tch, v tnh ton thm mt s


ch tiu khc nh gi tin cy ca kt qu phn tch.

Cng nh phn tch thng k cho tng nghin cu ring l ty thuc


vo loi tiu ch (nh l bin s lin tc continuous variables hay bin s nh
phn dichotomous variables), phng php phn tch tng hp cng ty thuc
vo cc tiu ch ca nghin cu. Chng ta s ln lc m t hai phng php
chnh cho hai loi bin s lin tc v nh phn.

281

14.4 Phn tch tng hp nh hng bt bin cho


mt tiu ch lin tc (Fixed-effects meta-analysis
for a continuous outcome).
14.4.1 Phn tch tng hp bng tnh ton th cng
V d 1. Thi gian nm vin iu tr cc bnh nhn t qu l mt
tiu ch quan trng trong vic vch nh chnh sch ti chnh. Cc nh nghin
cu mun bit s khc bit v thi gian nm vin gia hai nhm bnh vin
chuyn khoa v bnh vin a khoa. Cc nh nghin cu ra sot v thu thp s
liu t 9 nghin cu nh sau (xem Bng 1). Mt s nghin cu cho thy thi
gian nm vin trong cc bnh vin chuyn khoa ngn hn cc bnh vin a khoa
(nh nghin cu 1, 2, 3, 4, 5, 8), mt s nghin cu khc cho thy ngc li
(nh nghin cu 7 v 9). Vn t ra l cc s liu ny c ph hp vi gi
thit bnh nhn cc bnh vin chuyn khoa thng c thi gian nm vin ngn
hn cc bnh vin a khoa hay khng. Chng ta c th tr li cu hi ny qua
cc bc sau y:
Bc 1: tm lc d liu trong mt bng thng k nh sau:
Bng 1. Thi gian nm bnh vin ca cc bnh nhn t qu trong hai
nhm bnh vin chuyn khoa v a khoa
Nghin
Bnh vin chuyn khoa
Bnh vin a khoa
cu (i)
N1i
LOS1i
SD1i
N2i
LOS2i
SD2i
1
155
55
47
156
75
64
2
31
27
7
32
29
4
3
75
64
17
71
119
29
4
18
66
20
18
137
48
5
8
14
8
13
18
11
6
57
19
7
52
18
4
7
34
52
45
33
41
34
8
110
21
16
183
31
27
9
60
30
27
52
23
20
Tng cng
548
610
Ch thch: Trong bng ny, i l ch s ch mi nghin cu, i=1,2,,9. N1 v
N2 l s bnh nhn nghin cu cho tng nhm bnh vin; LOS1 v LOS2 (length
of stay): thi gian trung bnh nm vin (tnh bng ngy); SD1 v SD2: lch
chun (standard deviation) ca thi gian nm vin.

282

Phn tch d liu v to biu bng R Nguyn Vn Tun

Bc 2: c tnh mc khc bit trung bnh v phng sai


(variance) cho tng nghin cu. Mi nghin cu c tnh mt nh hng,
hay ni chnh xc hn l khc bit v thi gian nm vin k hiu, v chng ta s
t k hiu l di. Ch s nh hng ny ch n gin l:

di = LOS1i LOS2i
Phng sai ca di (ti s k hiu l si2 ) c c tnh bng mt cng thc chun
da vo lch chun v s i tng trong tng nghin cu. Vi mi nghin
cu i (i = 1, 2, 3, , 9), chng ta c:

si2 =

(N1i 1)SD12i + (N 2i 1)SD22i

1
1
N + N
2i
1i

N1i + N 2i 2

Chng hn nh vi nghin cu 1, chng ta c:


d1 = 75 55 = 20
v phng sai ca d1:

(155 1)( 47 ) + (156 1)( 64 )


=
2

2
1

155 + 156 2

1
1
+

= 40.59
155 156

hay lch chun: s1 = 40.59 = 6.37


Vi lch chun si chng ta c th c tnh khong tin cy 95% (95%
confidence interval hay 95%CI) cho di bng l thuyt phn phi chun (Normal
distribution). Cn nhc li rng, nu mt bin s tun theo nh lut phn phi
chun th 95% cc gi tr ca bin s s nm trong khong 1,96 ln lch
chun. Do , khong tin cy 95% cho mc khc bit ca nghin cu 1 l:
di - 1.96*si = 20 1.96*6.37 = 7.71 ngy
n
di + 1.96*si = 20 + 1.96*6.37 = 32.49 ngy
Tip tc tnh nh th cho cc nghin cu khc, chng ta s c thm bn ct
trong bng sau y:

283

Bng 1a. khc bit v thi gian gia hai nhm v khong tin cy 95%
di-1.96*si
di+1.96*si
si
Nghin cu (i)
di
s2
i

1
2
3
4
5
6
7
8
9

20
2
55
71
4
-1
-11
10
-7

40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

6.37
1.43
3.91
12.26
4.49
1.11
9.77
2.83
4.55

7.51
-0.80
47.34
46.98
-4.81
-3.17
-30.14
4.45
-15.92

32.49
4.80
62.66
95.02
12.81
1.17
8.14
15.55
1.92

n y chng ta c th th hin mc nh hng di v khong tin cy 95%


trong mt biu c tn l forest plot nh sau:

Biu forest th hin gi tr ca di v khong tin cy 95%. Mc nh


hng di ghi nhn t nghin cu 5, 7 v 9 c xem l khng c ngha
thng k, v khong tin cy 95% vt qua ct mc 0.
Bc 3: c tnh trng s (weight) cho mi nghin cu. Trng s
(Wi) thc ra ch l s o ca phng sai si2 ,

Wi = 1 / si2
284

Phn tch d liu v to biu bng R Nguyn Vn Tun

Chng hn nh vi nghin cu 1, chng ta c: W1 =

1
= 0.0246
40.59

V chng ta c thm mt ct mi cho bng trn nh sau:


Bng 1b. Trng s (weight) cho tng nghin cu
Nghin cu
1
2
3
4
5
6
7
8
9
Tng s

Wi

di

si2

20
2
55
71
4
-1
-11
10
-7

40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354

Bc 4: c tnh tr s trung bnh ca d cho tt c cc nghin cu.


Chng ta c th n gin tnh trung bnh d bng cch cng tt c di v chia cho
9, nhng cch tnh nh th khng khch quan, bi v mi gi tr di c mt
phng sai v trng s (Wi) c bit. Chng hn nh nghin cu 4, v phng sai
cao nht (150.2), chng t rng nghin cu ny c s i tng t hay dao
ng rt cao, v dao ng cao c ngha l chng ta khng t nim tin cy
vo cao c. Chnh v th m trng s cho nghin cu ny rt thp, ch
0.0067. Ngc li, nghin cu 6 c trng s cao v dao ng thp (phng
sai thp) v c tnh nh hng ca nghin cu ny c trng lng hn cc
nghin khc trong nhm.
Do , tnh trung bnh d cho tng s nghin cu, chng ta phi xem xt n
trng s Wi. Vi mi di v Wi chng ta c th tnh tr s trung bnh trng s
(weighted mean) theo phng php chun nh sau:
9

d=

W d
i =1
9

W
i =1

285

Bt c mt c tnh thng k (estimate) no cng phi c mt phng sai. V


trong trng hp d, phng sai (k hiu l sd2 ) ch n gin l s o ca tng
trng s Wi:

sd2 =

1
9

W
i =1

Sai s chun (standard error, SE) ca d, do l: SE(d) = sd . Theo l thuyt


phn phi chun (Normal distribution), khong tin cy 95% (95% confidence
interval, 95%CI) c th c c tnh nh sau:
95%CI ca d = d 1.96 ( sd )
tnh d chng ta cn thm mt ct na: l ct Wi d i . Chng hn nh vi
nghin cu 1, chng ta c W1d1 = 0,0246 20 = 0,4928 . Tip tc nh th,
chng ta c thm mt ct.
Bng 1c. Tnh ton tr s trung bnh
Nghin cu
1
2
3
4
5
6
7
8
9
Tng s

di

si2

20
2
55
71
4
-1
-11
10
-7

40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

Wi

Wi d i

0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354

0.4928
0.9771
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140

Sau , cng tt c Wi v Wi d i (trong hng Tng s ca bng trn). Nh vy,


tr s trung bnh trng s ca d l:

286

Phn tch d liu v to biu bng R Nguyn Vn Tun

d=

W d
i =1
9

W
i =1

0.4928 + 0.9771 + ... 0.3383 5.7140


=
= 3.49 .
0.0246 + 0.4886 + ... + 0.0483 1.6354

V phng sai ca d l: sd2 =

1
= 0.61 .
1.6345

Ni cch khc, sai s chun (standard error) ca d l:

sd = 0.61 = 0.782 .
Khong tin cy 95% (95% confidence interval hay 95%CI) c th c c tnh
nh sau:
3.49 1,96*0.782 = 1.96 n 5.02.
n y, chng ta c th ni rng, tnh trung bnh, thi gian nm vin ti cc
bnh vin a khoa di hn cc bnh vin chuyn khoa 3.49 ngy v 95%
khong tin cy l t 1.96 ngy n 5.02 ngy.
Bc 5: c tnh ch s ng nht (homogeneity) v bt ng nht
(heterogeneity) gia cc nghin cu [3]. Trong thc t, y l ch s o lng
khc bit gia mi nghin cu v tr s trung bnh trng s. Ch s ng
nht (index of homogeneity) c tnh theo cng thc sau y:
k

Q = Wi (d i d )

i =1

y, k l s nghin cu (trong v d trn k = 9). Theo l thuyt xc sut, Q c


phn phi theo lut Chi-square vi bc t do (degrees of freedom df) l
k-1 (tc l k21 ). Ni cch khc, nu Q ln hn k21 th l tn hiu cho thy
s bt ng nht gia cc nghin cu c ngha thng k (significant).
Nhiu nghin cu trong thi gian qua ch ra rng Q thng khng pht
hin c s bt ng nht mt cch nht qun, cho nn ngy nay t ai dng ch
s ny trong phn tch tng hp. Mt ch s khc thay th Q c tn l index of
heterogeneity (I2), tm dch l ch s bt ng nht, nhng s gi cch vit I2.
Ch s ny c nh ngha nh sau:

I2 =

Q (k 1)
Q

287

I2 c gi tr t m n 1. Nu I2 < 0, th chng ta s cho n l 0; nu I2 gn


bng 1 th l du hiu cho thy c s bt ng nht gia cc nghin cu.
Trong v d trn, c tnh Q v I2, chng ta cn tnh Wi (d i d ) cho tng
nghin cu. Chng hn nh, vi nghin cu 1:
2

Wi (d i d ) = 0,0246*(20 3.49)2 = 6,7129


2

Bng 1d. Tnh ton cc ch s ng nht v bt ng nht


Nghin cu
1
2
3
4
5
6
7
8
9
Tng s

di

si2

20
2
55
71
4
-1
-11
10
-7

40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

Wi
0.0246
0.4886
0.0654
0.0067
0.0495
0.8173
0.0105
0.1245
0.0483
1.6354

Wi d i
0.4928
0.9771
3.5993
0.4726
0.1981
-0.8173
-0.1153
1.2450
-0.3383
5.7140

Wi (d i d )

6.7129
1.0903
173.6080
30.3356
0.0127
16.5054
2.2026
5.2701
5.3215
241.05

Sau khi c tnh Wi (d i d ) cho tng nghin cu, chng ta cng li s ny


(xem ct sau cng) v chnh l Q :
2

Q = Wi (d i d ) = 241.05
2

i =1

T , I2 c th c tnh nh sau:

I2 =

241.05 8
= 0.966
241.05

Ch s bt ng nht I2 rt cao, cho thy dao ng v di gia cc nghin cu


rt cao. iu ny chng ta c th thy c ch qua nhn vo ct s 2 trong bng
thng k trn.
Bc 6: nh gi kh nng publication bias [4]. Publication bias (tm
dch: trong thin v) l mt khi nim tng i mi c th gii thch bng tnh

288

Phn tch d liu v to biu bng R Nguyn Vn Tun

hung thc t sau y. Chng ta bit rng khi mt nghin cu cho ra kt qu


negative (kt qu tiu cc, tc l khng pht hin mt nh hng hay mt
mi lin h c ngha thng k) cng trnh nghin cu rt kh c c hi
c cng b trn cc tp san, bi v gii ch bt tp san ni chung khng thch
in nhng bi nh th. Ngc li, mt nghin cu vi mt kt qu tch cc
(tc c ngha thng k) th nghin cu c kh nng xut hin trn cc tp san
khoa hc cao hn l cc nghin cu vi kt qu tiu cc. Th nhng phn ln
nhng phn tch tng hp li da vo cc kt qu cng b trn cc tp san
khoa hc. Do , c tnh ca mt phn tch tng hp c kh nng thiu khch
quan, v cha xem xt y n cc nghin cu tiu cc cha bao gi cng
b.
Mt s nh nghin cu ngh dng biu funnel (cn gi l funnel
plot) kim tra kh nng publication bias. Biu funnel c th hin bng
cch v chnh xc precision (trc tung, y-axis) vi c tnh mc nh
hng cho tng nghin cu. y precision c nh ngha l s o ca sai
s chun (standard error):
precision =

1
sdi

Ni cch khc, biu funnel biu din precision vi di. Chng hn nh vi


nghin cu 1, chng ta c: precision = 1 / 40,6 = 0,157 . Tnh cho tng
nghin cu, chng ta c dng bng thng k sau v biu funnel nh sau:
Bng 1e. c tnh publication bias
Nghin cu
1
2
3
4
5
6
7
8
9

di

si2

20
2
55
71
4
-1
-11
10
-7

40.6
2.0
15.3
150.2
20.2
1.2
95.4
8.0
20.7

1/si
0.1570
0.6990
0.2558
0.0816
0.2225
0.9041
0.1024
0.3528
0.2198

289

Biu funnel (biu phu): trc tung l precision v trc honh l d.


Biu ny cho thy phn ln cc nghin cu c kt qu thi gian nm vin
trong cc bnh vin a khoa thng lu hn cc bnh vin chuyn khoa.
Ci logic ng sau biu funnel l nu cc cng trnh nghin cu ln
(tc c precision cao) c kh nng c cng b cao, th s lng nghin
cu vi kt qu tch cc s nhiu hn s lng nghin cu nh hay vi kt qu
tiu cc trong cc tp san. V nu iu ny xy ra, th biu funnel s th hin
mt s thiu cn i (asymmetry). Ni cch khc, s thiu cn i ca mt biu
funnel l du hiu cho thy c vn v publication bias. Nhng vn t
ra l publication bias c ngha thng k hay khng? Biu funnel khng
th tr li cu hi ny, chng ta cn n cc phng php phn tch nh lng
nghim chnh hn.
Kim nh Egger
Vi nm gn y c kin cho rng biu funnel rt kh din dch, v
c th gy nn ng nhn v publication bias [5-6]. Tht vy, mt s tp san y
hc c chnh sch khuyn khch cc nh nghin cu tm mt phng php khc
nh gi publication bias thay v dng biu funnel.
Mt trong nhng phng php l kim nh Egger (cn gi l
Egger's test). Vi phng php ny, chng ta m hnh rng SND = a + b x
precision, trong SND c c tnh bng cch ly d chia cho sai s chun
ca d, tc l: SNDi =

di
, a v b l hai thng s phi c tnh t m hnh hi
sdi

qui ng thng . y, a cung cp cho chng ta mt c s v tnh trng


thiu cn i ca biu funnel: a>0 c ngha l xu hng nghin cu cng c
qui m ln cng c c s v nh hng vi s chnh xc cao.

290

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trong v d trn, chng ta c th dng mt phn mm phn tch thng


k (nh SAS hay R) c tnh a v b nh sau:
SNDi = 4.20 + -4.17084*precisioni
Kt qu c s a = 4.20 tuy l >0 nhng khng c ngha thng k, cho nn
y bng chng cho thy khng c s publication bias.
Tuy nhin, nh thy trong thc t, kim nh Egger ny cng ch l
mt cch th hin biu funnel m thi, ch cng khng c thay i g ln.
C mt cch nh gi publication bias, cho n nay, c xem l ng tin cy
nht: l phng php phn tch hi qui ng thng (linear regression) gia
di v tng s mu (Ni). Ni cch khc, chng ta tm a v b trong m hnh [7]:
di = a + b*Ni
Nu khng c publication bias th gi tr ca b s rt gn vi 0 hay khng c
ngha thng k. Nu tr s b khc vi 0 th l mt tn hiu ca publication
bias. Trong v d va nu vi d liu sau y,
Nghin cu
1
2
3
4
5
6
7
8
9

di
20
2
55
71
4
-1
-11
10
-7

Ni
311
63
146
36
21
109
67
293
112

chng ta c phng trnh:


di = 16.0 - 0.0009*/Ni
v qu tht gi tr ca b qu thp (cng nh khng c ngha thng k), cho
nn n y chng ta c th kt lun rng khng c vn publication bias
trong nghin cu va cp n.
Ni tm li, qua phn tch tng hp ny, chng ta c bng chng ng
tin cy kt lun rng thi gian nm vin ca bnh nhn trong cc bnh vin
a khoa di hn cc bnh vin chuyn khoa khong 3 ngy ri, hoc trong

291

95% trng hp thi gian khc bit khong t 2 ngy n 5 ngy. Kt qu ny


cng cho thy khng c thin v xut bn (publication bias) trong phn tch.

14.4.2 Phn tch tng hp bng R


R c hai package c vit v thit k cho phn tch tng hp. Package
c s dng kh thng dng l meta. Bn c c th ti min ph t trang
web ca R (trong phn packages): http://cran.R-project.org.
phn tch tng hp bng R chng ta phi nhp package meta vo
mi trng vn hnh ca R (vi iu kin, tt nhin, l bn c ti v ci t
meta vo R).
> library(meta)

Sau , chng ta s nhp s liu trong v d 1 vo R bin nh sau:

Nhp d liu cho tng ct trong Bng 1 v cho vo mt dataframe gi l los:

> n1 <c(155,31,75,18,8,57,34,110,60)
> los1 <- c(55,27,64,66,14,19,52,21,30)
> sd1 <- c(47,7,17,20,8,7,45,16,27)
> n2 <c(156,32,71,18,13,52,33,183,52)
> los2 <- c(75,29,119,137,18,18,41,31,23)
> sd2 <- c(64,4,29,48,11,4,34,27,20)
> los <- data.frame(n1,los1,sd1,n2,los2,sd2)

S dng hm metacont (dng phn tch cc bin lin tc do


cont=continuous variable) v cho kt qu vo i tng res:

> res <- metacont(n1,los1,sd1,n2,los2,sd2,data=los)


> res
> res
WMD
95%-CI %W(fixed) %W(random)
1 -20 [-32.4744; -7.5256]
1.44
10.69
2 -2 [ -4.8271;
0.8271]
28.11
12.67
3 -55 [-62.7656; -47.2344]
3.73
11.89
4 -71 [-95.0223; -46.9777]
0.39
7.39
5 -4 [-12.1539;
4.1539]
3.38
11.80
6
1 [ -1.1176;
3.1176]
50.11
12.72
7 11 [ -8.0620; 30.0620]
0.62
8.76
8 -10 [-14.9237; -5.0763]
9.27
12.41
9
7 [ -1.7306; 15.7306]
2.95
11.67

292

Phn tch d liu v to biu bng R Nguyn Vn Tun

Number of trials combined: 9


WMD
Fixed effects model -3.464
Random effects model -13.98

95%-CI
[ -4.96; -1.96]
[-24.03; -3.93]

z
-4.53
-2.73

p.value
<0.0001
0.0064

Quantifying heterogeneity:
tau^2 = 205.4094; H = 5.46 [4.54; 6.58];
I^2 = 96.7% [95.2%; 97.7%]
Test of heterogeneity:
Q d.f. p.value
238.92
8 < 0.0001
Method: Inverse variance method

meta cung cp cho chng ta hai kt qu: mt kt qu da vo m hnh fixedeffects v mt da vo m hnh random-effects. Nh thy qua kt qu trn,
mc khc bit gia hai m hnh kh ln, nhng kt qu chung th ging
nhau, tc kt qu ca c hai m hnh u c ngha thng k.
Ngoi ra, chng ta cng c th s dng hm plot th hin kt qu trn bng
biu forest nh sau:
> plot(res, lwd=3)

-100

-80

-60

-40
-20
Weighted mean difference

20

14.5. Phn tch tng hp nh hng bt bin


cho mt tiu ch nh phn (Fixed-effects metaanalysis for a dichotomous outcome).

293

Trong phn trn, ti va m t nhng bc chnh trong mt phn tch


tng hp nhng nghin cu m tiu ch l mt bin lin tc (continuous
variable). i vi cc bin lin tc, tr s trung bnh v lch chun l hai ch
s thng k thng c s dng tm lc. Nhng hai ch s ny khng th
ng dng cho nhng tiu ch mang tnh th loi hay th bc nh t vong, gy
xng, v.v v nhng tiu ch ny ch c hai gi tr: hoc l c, hoc l khng.
Mt ngi hoc l cn sng hay cht, b gy xng hay khng gy xng, mc
bnh suy tim hay khng mc bnh suy tim, v.v i vi nhng bin ny,
chng ta cn mt phng php phn tch khc vi phng php dnh cho cc
bin lin tc.

14.5.1 M hnh phn tch


i vi nhng tiu ch nh phn (ch c hai gi tr), ch s thng k
tng ng vi tr s trung bnh l t l hay proportion, c th tnh phn
trm); v ch s tng ng vi lch chun l sai s chun (standard
error). Chng hn nh nu mt nghin cu theo di 25 bnh nhn trong mt
thi gian, v trong thi gian c 5 bnh nhn mc bnh, th t l (k hiu l p)
n gin l: p = 5/25 = 0,20 (hay 20%). Theo l thuyt xc sut, phng sai ca
p (k hiu l var[p]) l:
var[p] = p(1-p)/n = 0,2*(1 - 0,8)/25 = 0,0064.
Theo , sai s chun ca p (k hiu SE[p]) l:

SE [ p ] = var[ p ] = 0,0064 = 0,08.


Chng ta cn c th c tnh khong tin cy 95% ca t l nh sau:

p 1,96 SE [ p ] = 0,2 1,96 0,08 = 0,04 n 0,36.


V cch tnh ca cc tiu ch nh phn kh c th, cho nn phng
php phn tch tng hp cc nghin cu vi bin nh phn cng khc. minh
ha cch phn tch tng hp dng ny, chng ta s ly mt v d (phng theo
mt nghin cu c tht).
V d 2: Beta-blocker (vit tt l BB) l mt loi thuc c chc nng
iu tr v phng chng cao huyt p. C gi thit cho rng BB cng c th
phng chng bnh suy tim, hay t ra l lm gim nguy c suy tim. th
nghim gi thit ny, hng lot nghin cu lm sng i chng ngu nhin
c tin hnh trong thi gian 20 nm qua. Mi nghin cu c 2 nhm bnh
nhn: nhm c iu tr bng BB, v mt nhm khng c iu tr (cn gi
l placebo hay gi dc). Trong thi gian 2 nm theo di, cc nh nghin cu

294

Phn tch d liu v to biu bng R Nguyn Vn Tun

xem xt tn s t vong cho tng nhm. Bng 2 sau y tm lc 13 nghin cu


trong qu kh:
Bng 2. Beta-blocker v bnh suy tim (congestive heart failure)
Nghin cu
(i)
1
2
3
4
5
6
7
8
9
10
11
12
13
Tng cng

Beta-blocker
N1
T vong (d1)
25
5
9
1
194
23
25
1
105
4
320
53
33
3
261
12
133
6
232
2
1327
156
1990
145
214
8
4879
420

N2
25
16
189
25
34
321
16
84
145
134
1320
2001
212
4516

Placebo
T vong (d2)
6
2
21
2
2
67
2
13
11
5
228
217
17
612

N: s bnh nhn nghin cu; T vong: s bnh nhn cht trong thi gian theo di.

Nh chng ta thy, mt s nghin cu c s mu kh nh, li c nhng nghin


cu vi s mu gn 4000 ngi! Cu hi t ra l tng hp cc nghin cu ny,
kt qu c nht qun hay ph hp vi gi thit BB lm gim nguy c suy tim
hay khng? tr li cu hi ny, chng ta tin hnh nhng bc sau y:
Bc 1: c tnh mc nh hng cho tng nghin cu. Mi
nghin cu c hai t l: mt cho nhm BB v mt cho nhm placebo (gi dc).
Gi hai t l ny l p1 v p2, ch s nh gi mc nh hng ca thuc BB l
t s nguy c tng i (relative risk RR), v RR c th c c tnh nh sau:

RR =

p1
p2

Chng hn nh, trong nghin cu 1, chng ta c: p1 =

5
= 0,20 v
25

8
= 0,24 .
Nh vy t s nguy c cho nghin cu 1 l:
25
0,20
RR =
= 0,833 . Tnh ton tng t cho cc nghin cu cn li, chng ta
0,24
p2 =

s c mt bng nh sau:

295

Bng 2a. c tnh t l t vong v t s nguy c tng i


Nghin cu (i)

T l t vong
nhm BB (p1)

1
2
3
4
5
6
7
8
9
10
11
12
13

0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
0.045
0.009
0.118
0.073
0.037

T l t vong
nhm placebo
(p2)
0.240
0.125
0.111
0.080
0.059
0.209
0.125
0.155
0.076
0.037
0.173
0.108
0.080

T s nguy c
(RR)
0.833
0.889
1.067
0.500
0.648
0.794
0.727
0.297
0.595
0.231
0.681
0.672
0.466

Bc 2: bin i RR thnh n v logarithm v tnh phng sai, sai


s chun. Mi c s thng k, nh c ln ni, u c mt lut phn phi, v
lut phn phi c th phn nh bng phn sai (hay sai s chun). Cch tnh
phng sai ca RR kh phc tp, cho nn chng ta s tnh bng mt phng
php gin tip. Theo phng php ny, chng ta s bin i RR thnh log[RR]
(ch log y c ngha l loga t nhin, tc l loge hay c khi cn vit tt l
ln natural logarithm) , v sau s tnh phng sai ca log[RR].
Nu N1 v N2 l ln lc tng s mu ca nhm 1 v nhm 2; v d1 v
d2 l s t vong ca nhm 1 v nhm 2 ca mt nghin cu, th phng sai ca
log[RR] c th c tnh bng cng thc sau y:
Var[logRR] =

1
1
1
1

d1 N1 d1 d 2 N 2 d 2

V sai s chun ca log[RR] l:


SE[logRR] =

1
1
1
1

d1 N 1 d1 d 2 N 2 d 2

Trong v d trn, vi nghin cu 1, chng ta c:


Log[RR] = loge(0.833) = -0.182

296

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vi phng sai:

var[log RR ] =

1
1
1
1

+
= 0.264
5 25 5 6 25 6

V sai s chun:

SE[log RR ] = 0.264 = 0.514


Da vo lut phn phi chun, chng ta cng c th tnh ton khong tin cy
95% ca RR cho tng nghin cu bng cch bin i ngc li theo n v RR.
Chng hn nh vi nghin cu 1, chng ta c khong tin cy 95% ca log[RR]
l:
logRR 1.96*SE[logRR] = -0.182 1.96*0.514 = -1.19 n 0.82
hay bin i thnh n v nguyn thy ca RR l:
exp(-1.19) = 0.30 n exp(0.82) = 2.28
Tnh ton tng t cho cc nghin cu khc, chng ta c thm mt bng mi
nh sau:
Bng 2b. c tnh t s nguy c tng i, phng sai, sai s chun v
khong tin cy 95% cho tng nghin cu
Nghin
cu (i)

T s
nguy
c
(RR)

Log[RR]

Var[logRR]

SE[logRR]

Phn
thp
95%CI
ca RR

1
2
3
4
5
6
7
8
9
10
11
12
13

0.200
0.111
0.119
0.040
0.038
0.166
0.091
0.046
0.045
0.009
0.118
0.073
0.037

-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

0.264
1.304
0.079
1.415
0.709
0.026
0.729
0.142
0.242
0.688
0.009
0.010
0.174

0.514
1.142
0.282
1.189
0.842
0.162
0.854
0.377
0.492
0.829
0.095
0.102
0.417

0.30
0.09
0.61
0.05
0.12
0.58
0.14
0.14
0.23
0.05
0.56
0.55
0.21

Phn
cao
95%
CI ca
RR
2.28
8.33
1.85
5.15
3.37
1.09
3.87
0.62
1.56
1.17
0.82
0.82
1.06

297

Chng ta c th th hin RR v khong tin cy 95% bng biu forest nh sau:

Biu forest th hin gi tr ca RR v khong tin cy 95%. Cc c tnh


khong tin cy 95%CI ca RR vt qua ct mc 1 c xem l khng c
ngha thng k.
Bc 3: c tnh trng s (weight) cho tng nghin cu v RR cho
ton b nghin cu. Biu trn cho thy mt s nghin cu c dao ng
RR rt ln (chng t cc nghin cu ny c s mu nh hay c s RR khng
n nh), v ngc li, mt s nghin cu ln c c s RR n nh hn. Trng
s cho mi nghin cu (Wi cho vo k hiu i) o lng n nh ny l s
o ca phng sai:

Wi =

1
var[log RRi ]

V s trung bnh trng s ca log[RR] (k hiu l logwRR) c th c tnh t


tng ca tch Wilog[RRi]:

W log[RR ]
i

log wRR =

298

Phn tch d liu v to biu bng R Nguyn Vn Tun

Vi phng sai:
Var[logwRR] =

v sai s chun:

SE [log wRR ] =

i1

Ngoi ra, khong tin cy 95% c th c tnh bng:

log wRR SE [log wRR ]


tnh trung bnh trng s logRR, chng ta cn mt ct Wilog[RRi]. Chng
hn nh vi nghin cu 1, chng ta c:

W1 =
v

1
= 3.79
0, 264

Wi log[RRi ] = 3.79 (-0.182) = -0.69

Tnh ton tng t cho cc nghin cu khc, chng ta s c mt bng s liu


mi nh sau:
Bng 2c. c tnh t trng s (Wi)
Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
Tng s

Log[RR]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

Var[logRR]
0.264
1.304
0.079
1.415
0.709
0.026
0.729
0.142
0.242
0.688
0.009
0.010
0.174

Wi
3.79
0.77
12.61
0.71
1.41
38.30
1.37
7.03
4.13
1.45
110.78
96.13
5.75
284.24

Wilog[RRi]
-0.69
-0.09
0.82
-0.49
-0.61
-8.86
-0.44
-8.54
-2.15
-2.13
-42.63
-38.23
-4.39
-108.42

299

Chng ta c:

=3.79 + 0.77 + + 5.75 = 284.24

W log[RR ] = -0.69 0.09 + -4.39 = -108.42


i

Do . trung bnh trng s ca log[RR] c th c tnh bng:

W log[RR ]
i

log wRR =

108, 42
= 0.38
284, 24

Vi phng sai:

Var [ log wRR ] =

1
= 0.0035
284.24

v sai s chun:

SE [ log wRR ] =

= 0.0035 = 0.06

Do . khong tin cy 95% ca logwRR c th c tnh bng:

log wRR SE [log wRR ] = -0.38 1.960.06 = 0.498 n -0.265


Nhng chng ta mun th hin bng n v gc (tc t s); do . cc c s
trn phi c bin chuyn v n v gc:
RR = exp(logwRR) = log(-0.38) = 0.68
V khong tin cy 95%:
Exp(-0.498) = 0.61 n Exp(-0.265) = 0.77.
n y chng ta c th ni rng t l t vong trong cc bnh nhn c
iu tr bng BB l 0.68 (hay thp hn 32%) so vi cc bnh nhn placebo.
Ngoi ra. v khong tin cy 95% khng bao gm 1, chng ta cng c th pht
biu rng mc khc bit ny c ngha thng k.

300

Phn tch d liu v to biu bng R Nguyn Vn Tun

Bc 4: c tnh ch s ng nht v bt ng nht. Nh ni


trong phn (1) lin quan n phn tch bin lin tc, sau khi c tnh t s
nguy c trung bnh. chng ta cn phi xem xt ch s I2. c tnh ch s I2,

chng ta cn tnh Wi (log RRi log wRR ) cho mi nghin cu. Chng hn
nh vi nghin cu 1, chng ta c:
2

Wi (log RRi log wRR ) = 3.79(-0.182 + 0.38)2 = 0.1502


2

v tnh ton tng t cho cc nghin cu khc, chng ta s c mt bn s liu


mi nh sau:
Bng 2d. c tnh ch s heterogeneity (I2)
Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
Tng s

Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

Wi

Wi (log RRi log wRR )

3.79
0.77
12.61
0.71
1.41
38.30
1.37
7.03
4.13
1.45
110.78
96.13
5.75
284.24

0.1502
0.0533
2.5118
0.0687
0.0040
0.8635
0.0054
4.8731
0.0790
1.7074
0.0012
0.0253
0.8382
11.1811

V d 2 c k = 13 nghin cu. Do .
k

Q = Wi ( log RRi log wRR ) = 11.1811


2

i =1

V.

I2 =

Q (k 1) 11.18 12
=
= 0.16
Q
11.18

V I2 < 0, nn chng ta c th cho I2 = 0. Ni cch khc, mc khc bit v RR


gia cc nghin cu khng c ngha thng k.

301

Bc 5: nh gi kh nng publication bias. Nh gii thch trong


phn 1f, cch nh gi kh nng publication bias c ngha nht l phn tch
hi qui ng thng log[RR] v tng s mu (N):
Da vo bng thng k sau.
Nghin cu (i)
1
2
3
4
5
6
7
8
9
10
11
12
13

log[RRi] = a + bNi

Log[RRi]
-0.182
-0.118
0.065
-0.693
-0.434
-0.231
-0.318
-1.214
-0.520
-1.465
-0.385
-0.398
-0.763

Ni
50
25
383
50
139
641
49
345
278
366
2647
3991
426

Chng ta c th c tnh a v b nh sau:


log[RRi] = -0.534 + 0.00003Ni
c tnh b = 0.00003 khng c ngha thng k (p = 0.782). Do . chng ta
c th pht biu rng mc thin lch v xut bn khng ng k trong phn
tch tng hp ny.

302

Phn tch d liu v to biu bng R Nguyn Vn Tun

Biu funnel cng cho thy khng c vn publication bias

14.5.2 Phn tch bng R


Package meta c hm metabin c th s dng tin hnh phn tch
tng hp cho cc bin nh phn nh s liu trong v d 2 trn y. Khi u.
chng ta np package meta (nu cha lm) vo mi trng vn hnh. v sau
thu nhp s liu vo mt data frame:
library(meta)

# S liu t v d 2
n1
d1
n2
d2

<<<<-

c(25.9.194.25.105.320.33.261.133.232.1327.1990.214)
c(5.1.23.1.4.53.3.12.6.2.156.145.8)
c(25.16.189.25.34.321.16.84.145.134.1320.2001.212)
c(6.2.21.2.2.67.2.13.11.5.228.217.17)

# To mt dataframe ly tn l bb
bb <- data.frame(n1.d1.n2.d2)

# Phn tch bng hm metabin v kt qu trong res


> res <- metabin(d1.n1.d2.n2.data=bb.sm=RR.meth=I)
> res
> res
RR
95%-CI %W(fixed) %W(random)
1 0.8333 [0.2918; 2.3799]
1.26
1.26
2 0.8889 [0.0930; 8.4951]
0.27
0.27
3 1.0670 [0.6116; 1.8617]
4.47
4.47
4 0.5000 [0.0484; 5.1677]
0.25
0.25
5 0.6476 [0.1240; 3.3814]
0.51
0.51
6 0.7935 [0.5731; 1.0986]
13.08
13.08
7 0.7273 [0.1346; 3.9282]
0.49
0.49
8 0.2971 [0.1410; 0.6258]
2.49
2.49
9 0.5947 [0.2262; 1.5632]
1.48
1.48
10 0.2310 [0.0454; 1.1744]
0.52
0.52
11 0.6806 [0.5635; 0.8221]
38.81
38.81
12 0.6719 [0.5496; 0.8214]
34.31
34.31
13 0.4662 [0.2056; 1.0570]
2.07
2.07
Number of trials combined: 13
RR
Fixed effects model 0.6821
Random effects model 0.6821

95%-CI
z p.value
[0.6064; 0.7672] -6.3741 < 0.0001
[0.6064; 0.7672] -6.3741 < 0.0001

Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 1.45]; I^2 = 0% [0%; 52.6%]
Test of heterogeneity:

303

Q d.f. p.value
11
12
0.5292
Method: Inverse variance method

Kt qu t m hnh fixed-effects v random-effects mt ln na cho chng ta


bng chng kt lun rng beta-blocker c hiu nghim trong vic lm gim
nguy c t vong.
# Biu forest
> plot(res. lwd=3)

1
2
3
4
5
6
7
8
9
10
11
12
13

0.05

0.10

0.20

0.50
1.00
Relative Risk

2.00

5.00

10.00

***
Thc ra, trong khoa hc ni chung, chng ta c mt truyn thng lu
i v vic duyt xt bng chng nghin cu (review), duyt xt kin thc hin
hnh. Nhng cc duyt xt nh th thng mang tnh nh cht (qualitative
review). v v tnh nh cht. chng ta kh m bit chnh xc c nhng khc
bit mang tnh nh lng gia cc nghin cu. Phn tch tng hp cung cp
cho chng ta mt phng tin nh lng h thng bng chng. Vi phn
tch tng hp, chng ta c c hi :

304

Xem xt nhng nghin cu no c tin hnh gii quyt vn ;


Kt qu ca cc nghin cu nh th no;
H thng cc tiu ch lm sng ng quan tm;
R sot nhng khc bit v c tnh gia cc nghin cu;
Cch thc tng hp kt qu; v
Truyn t kt qu mt cch khoa hc.

Phn tch d liu v to biu bng R Nguyn Vn Tun

Mc ch ca phn tch tng hp, xin nhc li mt ln na, l c tnh


mt ch s nh hng trung bnh sau khi xem xt tt c kt qu nghin cu
hin hnh. Mt kt qu chung nh th gip cho chng ta i n mt kt lun
chnh xc v ng tin cy hn.
Hai v d trn y hi vng gip ch cho bn c hiu c c ch
v ngha ca mt phn tch tng hp. Hi vng bn c c th t mnh lm mt
phn tch nh th khi c d liu. Thc ra, tt c cc tnh ton trn c th thc
hin bng mt phn mm nh Microsoft Excel. Ngoi ra, mt s phn mm
chuyn mn khc (nh SAS chng hn) cng c th tin hnh nhng phn tch
trn. Cc php tnh trong phn tch tng hp tht n gin. Vn ca phn tch
tng hp khng phi l tnh ton, m l d liu ng sau tnh ton.
Phn tch tng hp cng khng phi l khng c nhng khim khuyt.
Trong nghin cu ngi ta c cu rc vo, rc ra, tc l nu cc d liu c
s dng trong phn tch khng c cht lng cao th kt qu ca phn tch tng
hp cng chng c gi tr khoa hc g. Do , vn quan trng nht trong
phn tch tng hp l chn la d liu v nghin cu phn tch. Vn ny
cn phi c cn nhc cc k cn thn m bo tnh hp l v khoa hc ca
kt qu.
Ti liu tham kho v ch thch:
[1] Glass GV. Primary. secondary. and meta-analysis of research. Educational
Researcher 1976; 5:3-8.
[2] Normand SL. Meta-analysis: formulating. evaluating. combining. and
reporting. Stat Med. 1999;18(3):321-59.
[3] Higgins JPT. Thompson SG. Quantifying heterogeneity in a meta-analysis.
Stat Med. 2002;21:1539-1558
[4] Egger M. Davey Smith G. Schneider M. Minder C. Bias in meta-analysis
detected by a simple. graphical test. Br Med J 1997;315:62934.
[5] Tang JL. Liu JL. Misleading funnel plot for detection of bias in metaanalysis. J Clin Epidemiol. 2000;53(5):477-84.
[6] Peters JL. Sutton AJ. Jones DR. Abrams KR. Rushton L. Comparison of
two methods to detect publication bias in meta-analysis. JAMA.
2006;295(6):676-80.

305

[7] Macaskill P. Walter SD. Irwig L. A comparison of methods to detect


publication bias in meta-analysis. Stat Med. 2001;20:641-654.

Tm tt phn tch tng hp


i vi cc bin s lin tc
Nhm 1 (s mu. trung bnh. lch
chun): n1i . x1i . s1i ; i = 1. 2. 3. . k

i vi cc bin s nh phn
Nhm 1 (s mu. s s kin): n1i .
x1i ; i = 1. 2. 3. . k

Nhm 2 (s mu. trung bnh. lch


chun): n2 i . x2i . s2i

Nhm 2 (s mu. s s kin): n2 i .

nh hng (effect size. ES):

nh hng (effect size. ES) tnh


bng t s nguy c RR:

d i = x2i x1i

x2i ; i = 1. 2. 3. . k
x x
RRi = 2i 1i
n2i n1i
Bin chuyn sang logarithm:

i = log(RRi )

Phng sai ca i :

Phng sai ca d i :

sdi2 =

(n1i 1)s12i + (n2i 1)s22i


n1i + n2i 2

1
1
1
1 s2 = 1 1
+

i
+
n
x1i n1i x1i x2i n2i x2i
1i n2i
Sai s ca i :

Sai s (standard error) ca d i :

sdi = sdi2

Trng s: Wi =

1
sdi2

Trng s: Wi =

c s nh hng chung:
k

d=

W d / W
i =1

i =1

i =1

i =1

Phng sai ca :

Phng sai ca d:
k

s 2 = 1 / Wi

s 2 = 1 / Wi

i =1

306

= Wi i / Wi

Khong tin cy 95%: d 1,96 s

1
s2i

c s nh hng chung:

1
1
1
1

x1i n1i x1i x2i n2i x2

s =

i =1

Khong

tin

1,96 s

cy

Phn tch d liu v to biu bng R Nguyn Vn Tun

95%:

Index of homogeneity:
k

Index of homogeneity:
k

Q = Wi (d i d )

Q = Wi ( i )

Index of heterogeneity:

Index of heterogeneity:

I2 =

I2 =

i =1

Q (k 1)
Q

i =1

Q (k 1)
Q

Xem xt publication bias: Phn tch hi


qui tuyn tnh: di = a + b*Ni . (Ni l tng
s mu ca nghin cu i). Xem ngha
thng k ca b.

Xem xt publication bias: Phn tch


hi qui tuyn tnh: = a + b*Ni .
(Ni l tng s mu ca nghin cu
i). Xem ngha thng k ca b.

c tnh phng sai gia cc nghin


cu (between-study variance):

c tnh phng sai gia cc


nghin cu (between-study
variance):

Q (k 1)
2 = max 0,

k
2

Wi

i =1
Wi k

i =1
Wi

i =1

Q (k 1)
2 = max 0,

k
2

Wi

i =1
Wi k

i =1
Wi

i =1

307

15
Thit k th nghim
(Design of experiments)
Cm t th nghim y khng ch bao gm cc hot ng trong
phng th nghim, m cn bao gm c nhng cng trnh kho st rng ln hn
nh th nghim lm sng i chng ngu nhin (randomized clinical trial), cc
cng trnh nghin cu tiu biu mt thi im (cn gi l nghin cu ct ngang
hay cross-sectional study), thm d kin, iu tra v iu tra dn s, v.v Ngay
c mt chnh sch kinh t cng c th xem l mt th nghim th nghim x hi.
Mt th nghim t tiu chun khoa hc phi l mt th nghim c
thit k c h thng v khch quan. Chng hn nh bit t l mc bnh i
ng trong mt qun th, chng ta khng cn phi khm nghim tt c c nhn
trong qun th , m ch chn ngu nhin mt s c nhn i din. Tuy nhin
nu s lng c nhn i din (cn gi l mu) qu thp th cng trnh nghin
cu s khng cho kt qu chnh xc; ngc li nu s lng mu qu ln, chng
ta s phung ph tin bc v c s vt cht mt cch khng cn thit. Do , mc
tiu ca thit k nghin cu l (i) pht hin mt nh hng hay tc dng ca
mt can thip, v (ii) s dng c s vt cht v ti lc mt cch ti u.
Qua cc chng trc, chng ta lm quen vi mt s m hnh phn
tch s liu. Kt qu ca cc phn tch ny ch c gi tr khoa hc khi s liu
c thu thp ng phng php, v khi cng trnh nghin cu c thit k
mt cch ti u. Cc m hnh thng k khng th cung cp cho chng ta thng
tin v cht lng ca nghin cu, v y l mt kha cnh cn s thm nh cn
thn ca nh nghin cu. Do , thit k nghin cu, ng mt vai tr rt quan
trng cho vic thnh bi ca mt cng trnh khoa hc. C th ni rng mt
nghin cu nu c thit k cn thn v ng phng php th mc thnh
cng t c 50%. Chng ny v chng sau s bn qua mt s khi nim
cn bn v thit k nghin cu v mt s m hnh nghin cu thng dng.

15.1 Thut ng
thun tin cho vic theo di v qun trit cc khi nim nghin cu,
c l chng ta phi lm quen v phn bit c mt s thut ng quan trng
trong khi thit k mt nghin cu.
n v nghin cu (experimental unit): Ty theo lnh vc nghin
cu, n v nghin cu c th l i tng (nh bnh nhn hay tnh nguyn

308

Phn tch d liu v to biu bng R Nguyn Vn Tun

vin), mu rung, sn phm, qui trnh sn xut, v.v n v nghin cu l i


tng s dng trc tip cho vic o lng. Chng hn nh, trong nghin cu v
v ng ca c ph, nh nghin cu c th cho mt nhm ngi tiu th nm th
nhiu loi c ph khc nhau, v cc loi c ph ny chnh l n v nghin cu.
Trong cc nghin cu lm sng, nh nghin cu c th chn hai nhm bnh
nhn so snh hiu qu ca hai thut iu tr, v trong trng hp ny, mi
bnh nhn l mt n v nghin cu.
Yu t can thip (factors): l nhng can thip (intervention) p dng
trn cc i tng nghin cu. Yu t can thip cn c khi c gi l bin c
lp (independent variable) hay bin gii thch (explanatory variable). Trong v
d nghin cu lm sng va cp trn, hai thut iu tr l yu t can thip.
Hay trong nghin cu v hiu qu ca hai loi ging la, th ging la c
xem l yu t can thip.
Mc can thip (treatment levels): l nhng gi tr ca mt yu t
can thip. Chng hn nh nu hai thut iu tr l hai loi thuc, v mi loi
thuc c 3 liu lng, th liu lng l mc can thip. Hay trong nghin cu
cm quan, nh nghin cu c th cho ngi tiu th nm th v ngt ca mt
loi bia, nhng bia c sn xut vi ba cng thc khc nhau, th cng thc
chnh l mc can thip.
Nhm (block): Trong nhiu nghin cu, mt nhm yu t can thip c
th sp t thnh tng nhm (hay khi). Chng hn nh trong mt nghin cu
cm quan v v ng ca 3 loi c ph (A, B v C), nh nghin cu c th chn
mt s i tng nghin cu (ngi tiu th) v chia i tng thnh ba nhm
1, 2 v 3 nh sau:
Can thip

Nhm 1
A, B, C

Nhm 2
A, B, C

Nhm 3
A, B, C

Trong phng n ny, c nhn trong mi nhm u th nghim tt c 3 loi c


ph, v th t A, B, C khng thay i gia cc nhm. Phng n ny cn c tn
l balance complete block design (phng n cn i nhm).
Hoc nh nghin cu c th chn 2 loi c ph cho ba nhm:
Can thip

Nhm 1
A, B

Nhm 2
B, C

Nhm 3
A, C

Trong phng n ny, mi nhm ch th nghim 2 loi c ph, nhng th t


loi c ph c thay i theo tng nhm. Phng n ny cn c tn l balance
incomplete block design (phng n cn i nhm khng y ).

309

Phng n cn i nhm cn c s dng kh ph bin trong cc


nghin cu lm sng. Chng hn nh nghin cu th nghim hiu qu ca hai
loi thuc iu tr bnh long xng, nh nghin cu c th chn 100 bnh
nhn, v chia thnh 5 nhm (mi nhm c 20 ngi). Trong mi nhm, 10
ngi c iu tr bng thuc A v 10 ngi c iu tr bng thuc B. Phn
nhm phi c tin hnh hon ton ngu nhin m bo tnh khch quan
ca nghin cu.
Tiu ch (response variable): l bin s chu nh hng ca yu t can
thip. Chng hn nh trong nghin cu cm quan v v ng ca c ph th v
ng l tiu ch nghin cu; hay trong nghin cu v hiu qu ca hai thut iu
tr bnh long xng th mt xng (bone mineral density) l tiu ch.
V d 1: Mt th nghim cm quan n gin. bit ngi tiu th
nh gi ngt ca mt loi nc ngt th no, cc nh nghin cu sn xut ra
hai loi nc ngt vi cng thc A v B. Trong th nghim, ngi tiu th c
cho th nc ngt v cho im ngt (t 1 = khng ngt n 10 = qu
ngt) nh sau. Vn t ra l tm mt phng n nghin cu sao cho lng
thng tin thu thp c ti a v m bo tiu chun khoa hc.
1
(khng
ngt)

10
(qu
ngt)

Phng n 1: cc nh nghin cu ngu nhin mi n (n c th l 15)


khch hng v cho mi khch hng ung th c hai loi nc ngt v phn tch
khc bit v ngt gia hai sn phm t mi ngi.
Phng n 2: ngu nhin chn 2n (hay 30 ngi), ri ngu nhin chia
thnh 2 nhm. Nhm 1 ung nc ngt cng thc A, v nhm 2 ung nc
ngt cng thc B nh sau:

310

Phn tch d liu v to biu bng R Nguyn Vn Tun

Phng n 3: ngu nhin chn n (hay 15 khch hng); mi khch hng


c cho ung hai loi nc ngt, nhng th t AB v BA c phn chia mt
cch ngu nhin nh sau. Phng n ny c 2 yu t can thip (A v B) cho
mi nhm (block). Ni cch khc, mi khch hng l mt nhm.

BA

AB

AB

BA

AB

BA

AB

AB

BA

AB

BA

BA

AB

BA

AB

AB

BA

BA

Mi phng n trn u c li th v bt tin. Th nht, v mt c s


vt cht v chi ph, phng n 2 i hi s lng i tng nghin cu cao gp
hai ln phng n 1, t tin hn v tn nhiu thi gian hn.
Th hai, v mt khoa hc, phng n 2 i hi nh nghin cu phi so
snh hai nhm mt cch c lp, v nhiu thng tin (noise) ca phng n
ny chc chn phi cao hn nhiu ca phng n 1 v 3. nhiu y
c th o bng phng sai (variance). hiu khi nim quan trng ny, chng
ta cn phi im qua mt khi nim thng k hc cn bn. Gi tiu ch o
ngt ca hai nhm l x1 v x2; Gi phng sai ca ngt ca hai nhm l s12
v s22 . Bi v theo phng n 2, hai nhm c lp nhau (tc l khch hng th
sn phm A khng phi l khch hng th sn phm B) cho nn phng sai ca
khc bit gia hai sn phm x1x2 (k hiu sx21 x2 ) l:

sx21 x2 = s12 + s22

[1]

311

Nu phng sai ca hai nhm bng nhau s12 = s22 = s2, th phng sai ca
khc bit n gin l:
sx21 x2 = 2 s2.
Nhng vi phng n 1, bi v mi khch hng th c hai sn phm, do
, x1 v x2 khng c lp vi nhau, v phng sai ca khc bit l:

sx21 x2 = s12 + s22 2 cov ( x1 , x2 )

[2]

Trong , cov(x1, x2) c ngha l hip bin (covariance), tc phn nh tng


quan gia x1 v x2. Bi v tng quan gia x1 v x2 chc chn phi l mt s
dng (ln hn 0); Do , phng sai trong cng thc [2] lun lun nh hn
phng sai trong cng thc [1].
Ni cch khc, nhiu thng tin ca phng n 1 v 3 lc no cng
nh hn nhiu ca phng n 2. Do , phng n 1 v 3 c u th hn
phng n 2.
Th ba, phng n 1 v 3 ging nhau im mi khch hng ung th
c hai loi sn phm, nhng phng n 3 th th t sn phm c thay i ngu
nhin (ch khng c nh nh phng n 1). S thay i ngu nhin nh t A
sang B (v B sang A) cng c th xem l mt cch blocking (phn nhm), do
, nh nghin cu c th kim sot thm mt ngun dao ng quan trng.
V vy, trong ba phng n ny, c th ni phng n 3 l ti u nht.
Nhng tt nhin vn cn ty thuc vo c tnh ca sn phm v tnh hnh
thc t. C nhiu sn phm m phng n 1 v 3 khng th p dng v l do an
ton hay hiu ng hawthorne (s bn trong phn di y).

15.2 Ba nguyn tc quan trng ca mt nghin cu


Mt nghin cu khoa hc phi tun th theo ba nguyn tc: ngu nhin
ha (randomization), lp li nhiu ln (replication), v phn nhm (blocking).
Ti sao phi ngu nhin ha? Trong nhiu nghin cu, chng ta phi ly
mu (sample) t mt qun th (population). Mt trong nhng yu cu quan
trng ca ly mu l mu phi mang tnh i din cho qun th. Chng hn nh
nu trong qun th 1 triu ngi c 50% nam v 20% ngi c trnh vn ha
cao hn lp 12. Nu chng ta chn 100 ngi t qun th ny, mu c chn
c xem l i din khi c khong 50 nam v 20 ngi c hc vn trn lp 12.
Chn mu ngu nhin l phng n tt nht m bo tnh i din ny.

312

Phn tch d liu v to biu bng R Nguyn Vn Tun

i vi mt nhm i tng, ngu nhin ha cn c kh nng cn i


cc c im gia cc nhm can thip. Gi d chng ta mi c mt nhm
gm 50 tnh nguyn vin sn sng tham gia vo mt cng trnh nghin cu cm
quan th v chua ca 2 loi nc gii kht (ni cch khc, chng ta c 2
nhm, v mi nhm c 25 ngi). D nhin 50 ngi ny c nhiu c tnh c
nhn khc nhau, chng hn nh tui, gii tnh, trnh vn ha, s thch c
nhn, v.v tt c nhng c tnh ny c th c nh hng n cm nhn v sn
phm. Do , cn i cc c tnh ny cho hai nhm, cch duy nht v
khch quan nht l phn chia h thnh hai nhm mt cch ngu nhin.
V phn ln cc m hnh phn tch thng k da vo gi nh rng i
tng c chn ngu nhin t mt qun th, cho nn ngu nhin ha cn m
bo tnh hp l ca kt qu phn tch.
Mt trong nhng tiu chun vng ca khoa hc l kt qu nghin cu
phi c tnh c th lp li (repeatability) hay ti xc nhn. Ni mt cch khc,
nu c mt nghin cu c cng b bi mt nh khoa hc no ; Nu mt
nh nghin cu khc lp li nghin cu bng nhng phng php v vi iu
kin c m t, phi t c nhng kt qu tng t. l mt tiu chun
cc k quan trng phn bit gia khoa hc v ngy khoa hc
(pseudoscience). Mt quan st c lp li nhiu ln th quan st c tin
cy cao. V tin cy cao cho php kt lun nghin cu c gi tr cao.
Ngu nhin ha c th lm cn i cc c im ca i tng nghin
cu cho cc yu t can thip, nhng vi iu kin s lng i tng phi
tng i ln. Khi s lng i tng nghin cu nh, th ngu nhin ha
khng c hiu qu cao. Chng hn nh vi 6 i tng chia thnh 2 nhm, ngu
nhin ha c th cho ra kt qu 4 i tng thuc nhm A v 2 i tng thuc
nhm B. Do , mt cch khc m bo tnh cn i l phn nhm. Trong
trng hp trn, chng ta c th chia thnh 3 nhm (mi nhm 2 i tng), v
ngu nhin ha c tin hnh cho tng nhm.
Phn nhm khng nh hng n khu phn tch s liu, bi v chng
ta khng c mc ch tm hiu tiu ch cho tng nhm. Phn nhm ch c nh
hng v gi tr trong khu thit k nghin cu.

15.3 nh hng gi dc (placebo), Hawthorne,


v kn o
Trong cc th nghim lin quan n con ngi v bnh nhn, hai yu t
khc c th nh hng n kt qu nghin cu, l gi dc v s kn o
(blinding). hiu r hai nh hng ny, chng ta c th xem xt mt v d sau
y. bit thuc alendronate c hiu qu ngn nga gy xng hay khng,

313

cc nh nghin cu chia 100 bnh nhn thnh hai nhm can thip: nhm 1 c 50
bnh nhn c cho ung thuc alendronate tht, v nhm 2 cng gm 50 bnh
nhn c cho thuc alendronate gi (cn gi l gi dc hay placebo), nhng
hai loi thuc hon ton ging nhau, bnh nhn v bc s khng th phn bit
c thuc no l gi v thuc no l tht!
Th nghim nh va m t t ra hai vn nan gii. Kinh nghim t
nhiu nghin cu lm sng y khoa cho thy mt xu hng chung l bnh nhn
thng t cho rng sc khe h c ci tin hay tt hn, ch v h c iu tr
(cho d iu tr l gi dc)! Yu t tm l ny thng c gi l placebo
effect hay hiu ng gi dc. Hiu ng gi dc c th gii thch khong 35%
kt qu ca cc nghin cu lm sng, c bit l i vi cc thuc gim au,
xuyn, trm cm (depression), bnh ng rut, v cao huyt p. Chnh v l do
ny, vic nh gi hiu qu ca mt thut iu tr thng phi c mt nhm i
chng (hay placebo) v khc bit gia hai nhm can thip c th xc nh l
h qu ca thuc tht hay do gi dc.
Yu t th hai l hiu ng Hawthorne. Con ngi ni chung c kh
nng thch ng rt cao, v kh nng ny gy ra khng t kh khn cho nghin
cu khoa hc. Chng hn nh, khi chng ta cho mt nhm ngi tiu th nm
v ng ca c ph nhiu ln, th ln u ngi tiu th v cha quen vi v ng
nn h c th cm thy rt ng v cho im cao, nhng n ln 2 hay ln 3 th
v quen vi v ng nn h cho im thp xung. Hay trong nghin cu lm
sng, nu bnh nhn bit mnh ang c theo di, h s c gng lm hi lng
bc s v s khch quan ca bnh nhn c th b nh hng. Thut ng cho hin
tng ny l Hawthorne effect.
Yu t th ba l s ch quan ca nh nghin cu. Nu bc s bit bnh
nhn s dng thuc tht hay gi dc, cch nh gi ca h c th nh hng n
kt qu nghin cu. V th, trong cc nghin cu lm sng nghim chnh, nh
nghin cu khng c bit bnh nhn ang c iu tr bng thuc hay gi dc,
v phng cch ny c tn l blinding (lm m), tm dch l kn o. Vic gi
kn ny phi c duy tr bnh nhn v bc s. Ni cch khc, c bnh nhn v
bc s u khng bit bnh nhn thuc vo nhm can thip hay nhm gi dc.
Tuy nhin, khng phi bt c nghin cu lm sng no cng c th duy
tr s kn o nh th. Chng hn nh nghin cu v hiu qu ca mt thut gii
phu, bnh nhn chc chn bit h c gii phu tht hay gi (v khng c ci
gi l gii phu gi). Ngoi ra, v l do y c, khng phi nghin cu no
cng c th s dng gi dc. Nu chng ta bit rng cn bnh c nguy him
n tnh mng ca bnh nhn v thuc c hiu qu, th khng c l do g nh
nghin cu cho bnh nhn dng gi dc. Trong cc trng hp ny, nh

314

Phn tch d liu v to biu bng R Nguyn Vn Tun

nghin cu phi suy ngh k v pht trin mt phng n nghin cu sao cho
va khng vi phm y c m va p ng cc tiu chun khoa hc.

15.4 Vi v d v nguyn tc ca thit k nghin cu


qun trit r cc nguyn tc trn, chng ta th xem qua mt cng
trnh nghin cu sau y v hiu qu ca sinh t C cho iu tr cm cm. C gi
thit cho rng sinh t C c th ngn nga cm cm. Vn t ra l chng ta
nn thit k nghin cu th nghim gi thit ny nh th no t tiu
chun khoa hc. Gi d, chng ta c 50 ngi tnh nguyn tham gia vo cng
trnh nghin cu, chng ta c th chn mt trong nhng phng n sau y:
Phng n 1. Cho 50 ngi ung sinh t C trong vng 6 thng, v
trong thi gian ghi nhn s ln cm cm. Kt qu cho thy sau 6 thng iu
tr, tn s cm cm trung bnh l 1.4 ln / i tng.
Phng n 2. Chia 50 ngi thnh 2 nhm nam v n. C hai nhm
c iu tr bng sinh t C trong vng 6 thng. Kt qu cho thy sau 6 thng
iu tr, tn s cm cm trung bnh trong nhm nam l 1.4 ln / i tng, cn
nhm n tn s ny l 1.9 ln / i tng.
Phng n 3. Chia 50 ngi thnh 2 nhm mt cch ngu nhin.
Nhm 1 gm 25 ngi c iu tr bng sinh t C trong vng 6 thng. Nhm 2
khng c iu tr, nhng vn c theo di 6 thng. Kt qu cho thy sau 6
thng iu tr, tn s cm cm trung bnh trong nhm 1 l 1.4 ln / i tng,
cn nhm i chng l 1.9 ln / i tng.
Phng n 4. Nh mt cng ti dc sn xut 50 hp thuc sinh t C,
v 50 hp gi dc sinh t C. Chia 50 ngi thnh 2 nhm mt cch ngu
nhin: nhm 1 gm 25 ngi c iu tr bng sinh t C; nhm 2 nhn gi
dc. C hai nhm c theo di 6 thng. Kt qu cho thy sau 6 thng iu tr,
tn s cm cm trung bnh trong nhm 1 l 1.4 ln / i tng, cn nhm i
chng l 1.4 ln / i tng.

Nhm 1

Sinh t C

50 ngi

So snh tn
s cm cm
Nhm 2

Gi dc

315

Phng n 5. Cng ging nh phng n 4, nhng chng ta phn chia


(blocking) hai nhm can thip theo gii tnh. Gii tnh c th c nh hng n
nguy c cm cm (nam thng bt cn hn n), cho nn chng ta chia 50 ngi
thnh hai nhm nam v n. Mi nhm c ngu nhin ha thnh hai nhm can
thip m bo cn i nam v n cho tng nhm. C hai nhm c theo di
6 thng. Sau 6 thng iu tr, tn s cm cm trung bnh cho tng nhm can
thip v gii tnh c th tm lc nh sau:
Nhm 1 (sinh t C)
1.4
1.2

Nam
N

Sinh t
C

Nam
Gi

Nhm 2 (i chng)
1.9
1.5
So snh
tn
s cm
cm

dc
50
ngi

Sinh t
C

N
Gi

dc

So snh
tn
s cm
cm

Da vo cc nguyn l thit k trn, c g sai lm trong 4 phng n


nghin cu trn? Sau y l vi nhn xt chnh:

Sai lm ca phng n 1 l khng c nhm i chng, cho nn kt qu


khng th so snh v cng rt kh din dch. Tn s trung bnh 1.4 ln /
i tng chng c ngha g.

Phng n 2 c nhm i chng, nhng v yu t can thip c phn


chia theo gii tnh, cho nn khc bit gia tn s trung bnh 1.4 v
1.9 ln / i tng khng th ni l do nh hng ca gii tnh hay do
nh hng ca sinh t C.

Phng n 3 c nhm i chng, nhng sai lm l khng c blinding,


v i tng nghin cu bit h nhn hay khng nhn thuc. Nhm
c iu tr c th s ch quan khng ra tay (v ngh rng sinh t C

316

Phn tch d liu v to biu bng R Nguyn Vn Tun

bo v h), v iu ny c th nh hng n kt qu nghin cu. V


th, kt qu ny cng kh din dch.

Phng n 4 khng c sai lm no. Phng n ny c nhm i chng


v i tng nghin cu c phn chia mt cch ngu nhin v kn
o, m bo vic so snh c gi tr khoa hc.

Phng n 5 cng khng c sai lm no v tt hn phng n 4, v nh


hng ca gii tnh c kim sot qua cch phn chia ngu nhin.

15.5 Th nghim vi mt yu t (single-factor designs)


Nh tn gi m ch, cc th nghim mt yu t ch c mt yu t can
thip. Phn ln cc th nghim lm sng i chng ngu nhin (chia bnh mt
cch ngu nhin thnh hai nhm can thip) l mt dng ca thit k ny. Tuy
nhin, c nhiu phng n trong thit k ny c th ng dng cch phn nhm
(blocking). V d sau y s cho chng ta mt tng v hiu qu ca phn
nhm trong cc th nghim mt yu t.
V d 2. Mt nhm nghin cu nng nghip mun nghin cu nh
hng ca phn bn n s tng trng ca la. Ba liu lng ur c s
dng (thp, trung bnh, v cao s vit tt bng ting Anh l low, medium v
high). Nhm nghin cu chn 6 a im (A, B, C, D, E, v F), v mi a im
c 3 mnh t th nghim (1, 2, 3). Sau y l vi phng n th nghim m
nhm nghin cu c th chn:
Phng n 1 - CRD (completely randomized design): y, nhm
nghin cu c 6 x 3 = 18 ni th nghim, v 3 yu t can thip chia nhm.
Ni cch khc, mi yu t can thip s c p dng ti 3 ni. Vi phng n
ny CRD 3 yu t can thip c ngu nhin ha cho tt c 18 ni, v kt qu
c th l:
a im
A
B
C
D
E
F

Mnh t 1
Low
Medium
High
Medium
Medium
Low

Mnh t 2
High
Medium
Medium
Low
Low
High

Mnh t 3
Low
High
Low
High
Medium
High

Trong phng n ny, v cch phn chia ngu nhin, cho nn mi a im c


khi nhn hai liu lng ur low v mt loi ur high (nh a im A).

317

Thnh ra, so snh gia hai yu t can thip, nh low v high, phi iu
chnh dao ng gia cc a im.
Phng n 2 - RCB (randomized block design): Vi phng n ny
mi a im v mi mnh t s c p dng mt yu t can thip; do ,
hon ton cn i. Nu xem ba mnh t mi a im th nghim l ba block,
th phng n ny m bo ti mi a im, mi block c phn chia mt can
thip nh sau:
a im
A
B
C
D
E
F

Mnh t 1
Low
Medium
High
Medium
High
Low

Mnh t 2
High
Low
Medium
Low
Low
High

Mnh t 3
Medium
High
Low
High
Medium
Medium

Phng n 3 - IBD (incomplete block design): Vi phng n ny,


nh nghin cu c th ch cn 2 mnh t, v mi mnh t c p dng mt
yu t can thip nh sau:

a im
A
B
C
D
E
F

Mnh t 1
Low
Medium
High
Medium
High
Low

Mnh t 2
High
Low
Medium
Low
Low
High

Phng php phn tch kt qu t cc nghin cu ny c trnh by trong


chng 11.

15.6 Th nghim vi hai yu t (two-factor designs)


Cc thit k va trnh by trong phn trn nhm mc ch nh gi nh
hng ca mt yu t can thip. Trong nhiu trng hp, nh nghin cu mun
nh gi nh hng ca hai yu t can thip, v cc phng n trn khng th
ng dng c. Chng hn nh khi nh nghin cu mun phn tch nh hng
ca nh sng (cao hay thp) v m (kh hay t) n s tng trng ca cy
ging trong mt nh knh (greenhouse), th cc phng n th nghim vi hai
yu t cn phi c xem xt cn thn.

318

Phn tch d liu v to biu bng R Nguyn Vn Tun

Phng n 1 - CRD. Chng ta mun iu tra nh hng ca nhit


(thp v cao), vt liu (A v B), v phng php sn xut (c kh v ha cht)
n mnh ca giy. Cc phi hp yu t can thip c th nh sau:

Nhm can thip


1
2
3
4

Nhit
Thp
Cao
Thp
Thp

Vt liu
A
A
B
A

Phng php
C kh
C kh
C kh
Ha cht

Qua thit k ny, chng ta c th phn tch nh hng ca nhit bng


cch so snh mnh gia nhm 1 v 2. nh hng ca vt liu c th so snh
gia nhm 1 v 3. nh hng ca phng php c th so snh gia nhm 1 v 4.
Nhng so snh trn hp l ch vi iu kin l nh hng ca cc yu t
can thip l cng hng (additive effect). Ni cch khc, cc so snh trn ch
hp l nu nh hng ca mt yu t khng ph thuc vo cc yu t khc,
chng hn nh nh hng ca nhit khng ty thuc vo nh hng ca vt
liu hay phng php. Nu gi nh ny khng ng th kt qu so snh c th
thiu khch quan v sai.
Phng n 2 Giai tha (factorial design). Mt phng n khc cho
nhiu thng tin hn phng n trn v cho php chng ta phn tch nh hng
tng tc ca cc yu t can thip l factorial design. Trong trng hp trn,
chng ta c 3 yu t can thip, v mi yu t c 2 bc, cho nn tng s l 23 = 8
nhm nh sau:
Nhm can thip
1
2
3
4
5
6
7
8

Nhit
Thp
Cao
Thp
Thp
Thp
Cao
Thp
Thp

Vt liu
A
A
B
B
A
A
B
B

Phng php
C kh
C kh
C kh
C kh
Ha cht
Ha cht
Ha cht
Ha cht

Vi phng n cn i ny, chng ta c th c tnh nh hng ca mi yu


t can thip d dng:

nh hng ca nhit : so snh nhm 1, 3, 5, 7 v 2, 4, 6, 8;


nh hng ca vt liu: so snh nhm 1, 2, 5, 6 v 3, 4, 7, 8;

319

nh hng ca phng php: so snh nhm 1, 2, 3, 4 v 5, 6, 7, 8.

Ngoi ra, nh hng tng tc (interaction effects) cng c th c


tnh bng cch so snh tng hp gia cc nhm. Chng hn nh bit nh
hng ca nhit c ty thuc vo phng php sn xut hay khng, chng ta
c th so snh gia nhm 1+3 v 2+4, v 5+7 vi 6+8.
Phng n 3 Phn mu (Split-plot design). iu tra nh hng
ca 3 loi ging u nnh (A, B v C) v hai loi phn bn (P1 v P2), cc nh
nghin cu so snh sn lng u nnh c trng trt di 6 iu kin can
thip trn. Nu mi iu kin can thip c lp li 2 ln, th nghin cu cn 2
x 6 = 12 mnh t (plot) cho nghin cu.
Mt cch thit k cho nghin cu trn l phng n giai tha nh
cp phn trn, nhng phng n ny c th kh khn trong thc t. Mt phng
n khc d dng hn v thc t hn l phng n phn mu. Phng n ny cn
n hai ln sp xp ngu nhin. Trc ht, hai loi phn bn c phn chia
mt cch ngu nhin cho 4 nhm nh sau:

P1

P2

P2

P1

Bc hai, ba loi ging s c phn ngu nhin cho tng nhm, v kt


qu c th ging nh sau:
1
P1

2
P2

3
P2

4
P1

B
A
C

C
B
A

A
C
B

C
A
B

Phng n 4 Hnh vung Latin (Latin square). Cng ti du mun


so snh hiu sut (o bng cy s - km - trn mi lt) ca 4 loi du (A, B, C v
D). Cng ti c c 4 ti x v 4 loi xe. V dao ng hay khc bit gia
ngi li xe v loi xe, hai yu t ny phi c kim sot trong khi thit k
nghin cu. Phng n tt nht cho nghin cu ny l phng n hnh vung
Latin. Theo phng n ny, 4 loi du c phn chia mt cch ngu nhin cho
tng ti x v loi xe nh sau:

320

Phn tch d liu v to biu bng R Nguyn Vn Tun

Ti x

Loi xe
Ford
D
B
C
A

1
2
3
4

Toyota
B
C
A
D

Honda
C
A
D
B

Nissan
A
D
B
C

Nh vy, ti x 1 s iu khin xe Ford vi du loi D, sau l Toyota


vi loi du B, Honda vi du C v Nissan vi du A, v.v Vi phng n
ny, cng ti c th phn tch nh hng cng hng ca tng loi du, hay phn
tch nh hng tng tc gia loi du v loi xe, hay gia loi du v ti x.
Cch phn tch s liu t phng n hnh vung Latin c m t chi tit
Chng 11 (11.7).

15.7 Phng php ngu nhin ha


Trong tt c cc phng n trn, mt kha cnh then cht l sp xp cc
i tng nghin cu vo cc yu t can thip mt cch ngu nhin (ti s gi tt
l ngu nhin ha randomize). Gi d, chng ta c 8 i tng (c th l bnh
nhn) cn phn chia cho ba nhm can thip T1, T2 v T3. Nu theo yu cu,
nhm T1 v T3 mi nhm cn phi c 3 i tng v nhm T2 cn 2 i tng:
T1
n=3

T2
n=2

T3
n=3

Vn t ra l lm cch no ngu nhin ha? Chng ta c th lm mt s


bc nh sau:

Trc ht, chng ta lp danh sch 8 i tng:

T1
T1
T1
T2
T2
T3
T3 T3
Dng hm sample chn ngu nhin (sample(1:8) c chc nng
to ra mt dy s ngu nhin t 1 n 8):
> sample(1:8)
[1] 7 2 5 4 1 8 6 3

Nhp hai dy s vi nhau, chng ta c:

T1
7

T1
2

T1
5

T2
4

T2
1

T3
8

T3
6

T3
3

321

Ni cch khc, i tng s 7, 2 v 5 s nhn can thip T1, i tng 1


nhn can thip T2, v i tng 8, 6 v 3 nhn T3.
***
Bt c cng trnh nghin cu khoa hc no cng c tin hnh theo
mt qui trnh gn nh bt bin: t gi thit, thit k nghin cu, thu thp d
liu, phn tch d liu, v bo co kt qu. Do , phn tch s liu l khu gn
cui cng ca mt cng trnh nghin cu (trc khi din dch kt qu phn tch
v vit bo co khoa hc). Cc khu trong qui trnh nghin cu trn c lin h
khng kht vi nhau. Nu mt khu trong qui trnh c vn th h qu l
khu tip theo cng c vn . Mn cch ni ca ngi xa Vn s khi u
nan, cng c th ni rng khi thit k nghin cu sai th kt qu phn tch cng
khng c ngha g. Tt c cc phng php phn tch s liu ch cho ra kt qu
tt khi cng trnh nghin cu c thit k ng v thch hp. Do , xem xt
cn thn cc phng n nghin cu v i chiu vi tnh hnh thc t l mt nhu
cu rt quan trng cho nghin cu khoa hc.

322

Phn tch d liu v to biu bng R Nguyn Vn Tun

16
c tnh c mu
(Sample size estimation)
Mt cng trnh nghin cu thng da vo mt mu (sample). Mt
trong nhng cu hi quan trng nht trc khi tin hnh nghin cu l cn bao
nhiu mu hay bao nhiu i tng cho nghin cu. i tng y l n
v cn bn ca mt nghin cu, l s bnh nhn, s tnh nguyn vin, s mu
rung, cy trng, thit b, v.v c tnh s lng i tng cn thit cho mt
cng trnh nghin cu ng vai tr cc k quan trng, v n c th l yu t
quyt nh s thnh cng hay tht bi ca nghin cu. Nu s lng i tng
khng th kt lun rt ra t cng trnh nghin cu khng c chnh xc cao,
thm ch khng th kt lun g c. Ngc li, nu s lng i tng qu
nhiu hn s cn thit th ti nguyn, tin bc v thi gian s b hao ph. Do ,
vn then cht trc khi nghin cu l phi c tnh cho c mt s i
tng va cho mc tiu ca nghin cu. S lng i tng va ty
thuc vo ba yu t chnh:

Sai st m nh nghin cu chp nhn, c th l sai st loi I v II;


dao ng (variability) ca o lng, m c th l lch chun;
Mc khc bit hay nh hng m nh nghin cu mun pht hin;

Khng c s liu v ba yu t ny th khng th no c tnh c mu.


Chng ny s bn qua ba yu t trn.

16.1 Khi nim v power


Thng k hc l mt phng php khoa hc c mc ch pht hin, hay
i tm nhng ci c th gp chung li bng cm t cha c bit (unknown).
Ci cha c bit y l nhng hin tng chng ta khng quan st c,
hay quan st c nhng khng y . Ci cha bit c th l mt n s
(nh chiu cao trung bnh ngi Vit Nam, hay trng lng mt phn t),
hiu qu ca mt thut iu tr, gen c chc nng lm cho cy l c mu xanh,
s thch ca con ngi, v.v Chng ta c th o chiu cao, hay tin hnh xt
nghim bit hiu qu ca thuc, nhng cc nghin cu nh th ch c tin
hnh trn mt nhm i tng, ch khng phi ton b qun th ca dn s.
mc n gin nht, nhng ci cha bit ny c th xut hin di
hai hnh thc: hoc l c, hoc l khng. Chng hn nh mt thut iu tr c
hay khng c hiu qu chng gy xng, khch hng thch hay khng thch mt

323

loi nc gii kht. Bi v khng ai bit hin tng mt cch y , chng ta


phi t ra gi thit. Gi thit n gin nht l gi thit o (hin tng khng
tn ti, k hiu H-) v gi thit chnh (hin tng tn ti, k hiu H+).
Chng ta s dng cc phng php kim nh thng k (statistical test)
nh kim nh t, F, z, 2, v.v nh gi kh nng ca gi thit. Kt qu ca
mt kim nh thng k c th n gin chia thnh hai gi tr: hoc l c
ngha thng k (statistical significance), hoc l khng c ngha thng k
(non-significance). C ngha thng k y, nh cp trong Chng 7,
thng da vo tr s P: Nu P < 0.05, chng ta pht biu kt qu c ngha
thng k; Nu P > 0.05 chng ta ni kt qu khng c ngha thng k. Cng
c th xem c ngha thng k hay khng c ngha thng k nh l c tn
hiu hay khng c tn hiu. Hy tm t k hiu T+ l kt qu c ngha thng
k, v T- l kt qu kim nh khng c ngha thng k.
Hy xem xt mt v d c th: bit thuc risedronate c hiu qu hay
khng trong vic iu tr long xng, chng ta tin hnh mt nghin cu gm
2 nhm bnh nhn (mt nhm c iu tr bng risedronate v mt nhm ch
s dng gi dc placebo). Chng ta theo di v thu thp s liu gy xng,
c tnh t l gy xng cho tng nhm, v so snh hai t l bng mt kim nh
thng k. Kt qu kim nh thng k hoc l c ngha thng k (P<0.05) hay
khng c ngha thng k (P>0.05). Xin nhc li rng chng ta khng bit
risedronate tht s c hiu nghim chng gy xng hay khng; Chng ta c
th t gi thit H. Do , khi xem xt mt gi thit v kt qu kim nh thng
k, chng ta c bn tnh hung:
(a) Gi thuyt H ng (thuc risedronate c hiu nghim) v kt qu kim
nh thng k P<0.05.
(b) Gi thuyt H ng, nhng kt qu kim nh thng k khng c ngha
thng k (tc P>0.05);
(c) Gi thuyt H sai (thuc risedronate khng c hiu nghim) nhng kt qu
kim nh thng k c ngha thng k (P<0.05);
(d) Gi thuyt H sai v kt qu kim nh thng k khng c ngha thng
k (P>0.05).
y, trng hp (a) v (d) khng c vn , v kt qu kim nh thng k
nht qun vi thc t ca hin tng. Nhng trong trng hp (b) v (c), chng
ta phm sai lm, v kt qu kim nh thng k khng ph hp vi gi thit.
Trong ngn ng thng k hc, chng ta c vi thut ng:

324

Phn tch d liu v to biu bng R Nguyn Vn Tun

Xc sut ca tnh hung (b) xy ra c gi l sai st loi II (type II


error), v thng k hiu bng .

Xc sut ca tnh hung (a) c gi l Power. Ni cch khc, power


chnh l xc sut m kt qu kim nh thng cho ra kt qu p<0.05 vi
iu kin gi thit H l tht. Ni cch khc: power = 1- ;

Xc sut ca tnh hung (c) c gi l sai st loi I (type I error, hay


significance level), v thng k hiu bng . Ni cch khc, chnh l
xc sut m kt qu kim nh thng k cho ra kt qu p<0.05 vi iu
kin gi thit H sai;

Xc sut tnh hung (d) khng phi l vn cn quan tm, nn khng c


thut ng, d c th gi l kt qu m tnh tht (hay true negative).

C th tm lc 4 tnh hung trong mt Bng 1 sau y:


Bng 1. Cc tnh hung trong vic th nghim mt gi thit khoa hc
Kt qu kim nh
thng k

Gi thuyt H
ng
Sai
(Thuc c hiu nghim)
(Thuc khng c hiu
nghim)

C ngha thng k
(p<0,05)

Dng tnh tht


(Power),
1-= P(s | H+)

Sai st loi I
(Type I error)
= P(s | H-)

Khng c ngha
thng k (p>0,05)

Sai st loi II
(Type II error)
= P(ns | H+)

m tnh tht
(True negative)
1- = P(ns | H-)

Ch thch: s trong biu ny c ngha l significant; ns: non-significant; H+


l gi thuyt ng; H- l gi thuyt sai. Do , c th m t 4 tnh hung trn
bng ngn ng xc sut c iu kin nh sau: Power = 1 = P(s | H+); =
P(ns | H+); v = P(s | H-).

16.2 Th nghim gi thit thng k v chn on y khoa


C l nhng l gii trn, vn cn kh tru tng. Mt cch minh ha cc
khi nim power v tr s P l qua chn on y khoa. Tht vy, c th v nghin cu

325

khoa hc v suy lun thng k nh l mt qui trnh chn on bnh. Thot u


chng ta khng bit bnh nhn mc bnh hay khng, v phi thu thp thng tin
(nh tm hiu tin s bnh, cch sng, thi quen, v.v) v lm xt nghim (quang
tuyn X, siu m, phn tch mu, nc tiu, v.v) i n kt lun.
C hai gi thit: bnh nhn khng c bnh (k hiu H-) v bnh nhn
mc bnh (H+). mc n gin nht, kt qu xt nghim c th l dng
tnh (+ve) hay m tnh (-ve). Trong chn on cng c 4 tnh hung s c bn
trong phn di y, nhng vn r rng hn, chng ta hy xem qua mt
v d c th nh sau:
Trong chn on ung th, bit chc chn c ung th hay khng,
phng php chun l dng sinh thit (tc gii phu xem xt m di ng
knh hin vi xc nh xem c ung th hay khng c ung th). Nhng sinh
thit l mt phu thut c tnh cch xm phm vo c th bnh nhn, nn khng
th p dng phu thut ny mt cch i tr cho mi ngi. Thay vo , y
khoa pht trin nhng phng php xt nghim khng mang tnh xm phm
th nghim ung th. Cc phng php ny bao gm quang tuyn X hay th
mu. Kt qu ca mt xt nghim bng quang tuyn X hay th mu c th tm
tt bng hai gi tr: hoc l dng tnh (+ve), hoc l m tnh (-ve).
Nhng khng c mt phng php gin tip th nghim no, d tinh vi
n u i na, l hon ho v chnh xc tuyt i. Mt s ngi c kt qu
dng tnh, nhng thc s khng c ung th. V mt s ngi c kt qu m
tnh, nhng trong thc t li c ung th. n y th chng ta c bn kh nng:

Bnh nhn c ung th, v kt qu th nghim l dng tnh. y l


trng hp dng tnh tht (danh t chuyn mn l nhy
(sensitivity);

Bnh nhn khng c ung th, nhng kt qu th nghim l dng tnh.


y l trng hp dng tnh gi (false positive);

Bnh nhn khng c ung th, nhng kt qu th nghim l m tnh.


y l trng hp ca m tnh tht (specificity);

Bnh nhn c ung th, v kt qu th nghim l m tnh. y l trng


hp m tnh gi hay c hiu (false negative).
C th tm lc 4 tnh hung trong Bng 2 sau y:

326

Phn tch d liu v to biu bng R Nguyn Vn Tun

Bng 2. Cc tnh hung trong vic chn on y khoa: kt qu xt nghim


v bnh trng

Kt qu xt nghim

C bnh
nhy
(Sensitivity),

Dng tnh gi (False


positive)

m tnh gi (False
negative),

c hiu (Specificity),

+ve (Dng tnh)


-ve (m tnh)

Bnh trng
Khng c bnh

n y, chng ta c th thy qua mi tng quan song song gia chn


on y khoa v th nghim thng k. Trong chn on y khoa c ch s dng
tnh tht, tng ng vi khi nim power trong nghin cu. Trong chn
on y khoa c xc sut dng tnh gi, v xc sut ny chnh l tr s p trong
suy lun khoa hc. Bng sau y s cho thy mi tng quan :
Bng 3. Tng quan gia chn on y khoa v suy lun trong khoa hc
Chn on y khoa
Chn on bnh
Bnh trng (c hay khng)
Phng php xt nghim
Kt qu xt nghim +ve
Kt qu xt nghim -ve
Dng tnh tht (sensitivity)
Dng tnh gi (false positive)
m tnh gi (false negative)
m tnh tht (c hiu, hay
specificity)

Th nghim gi thit khoa hc


Th nghim mt gi thit khoa
hc
Gi thit khoa hc (H+ hay H-)
Kim nh thng k
Tr s p < 0.05 hay c ngha
thng k
Tr s p > 0.05 hay khng c
ngha thng k
Power; 1-; P(s | H+)
Sai st loi I; tr s p; ; P(s |
H-)
Sai st loi II; ; = P(ns | H+)
m tnh tht; 1- = P(ns | H-)

Cng nh cc phng php xt nghim y khoa khng bao gi hon ho,


cc phng php kim nh thng k cng c sai st. V do , kt qu nghin cu
lc no cng c bt nh (nh s bt nh trong mt chn on y khoa vy). Vn
l chng ta phi thit k nghin cu sao cho sai st loi I v II thp nht.

327

16.3 S liu c tnh c mu


Nh cp trong phn u ca chng ny, c tnh s i tng
cn thit cho mt cng trnh nghin cu, chng ta cn phi c 3 s liu: xc sut
sai st loi I v II, dao ng ca o lng, v nh hng.

V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi
I khong 1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II
khong = 0.1 n = 0.2 (tc power phi t 0.8 n 0.9).

dao ng chnh l lch chun (standard deviation) ca o lng


m cng trnh nghin cu da vo phn tch. Chng hn nh, nu
nghin cu v cao huyt p, th nh nghin cu cn phi c lch
chun ca p sut mu. Chng ta tm gi dao ng l .

nh hng, nu l cng trnh nghin cu so snh hai nhm, l


khc bit trung bnh gia hai nhm m nh nghin cu mun pht hin.
Chng hn nh nh nghin cu c th gi thit rng bnh nhn c
iu tr bng thuc A c p sut mu gim 10 mmHg so vi nhm gi
dc. y, 10 mmHg c xem l nh hng. Chng ta tm gi
nh hng l .

Mt nghin cu c th c mt nhm i tng hay nhiu nhm i


tng. V c tnh c mu cng ty thuc vo cc trng hp ny.
Trong trng hp mt nhm i tng, s lng i tng (n) cn thit
cho nghin cu c th tnh ton mt cch th cng nh sau:

n=

( / )

[1]

Trong trng hp c hai nhm i tng, s lng i tng (n) cn


thit cho nghin cu c th tnh ton nh sau:

n = 2

( / )

[2]

Trong , hng s C c xc nh t xc sut sai st loi I v II (hay power)


nh sau:

328

Phn tch d liu v to biu bng R Nguyn Vn Tun

Bng 3: Hng s C lin quan n sai st loi I v II

= 0.20
(Power = 0.80)
6.15
7.85
13.33

0.10
0.05
0.01

= 0.10
(Power = 0.90)
8.53
10.51
16.74

= 0.05
(Power = 0.95)
10.79
13.00
19.84

16.4 c tnh c mu
16.4.1 c tnh c mu cho mt ch s trung bnh
V d 1: Chng ta mun c tnh chiu cao ca n ng ngi Vit, v
chp nhn sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05)
v power = 0.8 (hay = 0.2). Cc nghin cu trc cho bit lch chun
chiu cao ngi Vit khong 4.6 cm. Chng ta c th p dng cng thc [1]
c tnh c mu cn thit cho nghin cu:

n=

( / )

7.85

(1/ 4.6 )

= 166

Ni cch khc, chng ta cn phi o chiu cao 166 i tng c tnh


chiu cao n ng Vit vi sai s trong vng 1 cm.
Nu sai s chp nhn l 0.5 cm (thay v 1 cm), s lng i tng cn
thit l: n =

7.85

( 0.5 / 4.6 )

= 664 . Nu sai s m chng ta chp nhn l 0.1 cm

, th s lng i tng nghin cu ln n 16610 ngi! Qua cc c tnh ny,


chng ta d dng thy c mu ty thuc rt ln vo sai s m chng ta chp
nhn. Mun c c tnh cng chnh xc, chng ta cn cng nhiu i tng
nghin cu.
Trong R c hm power.t.test c th p dng c tnh c mu
cho v d trn nh sau. Ch chng ta cho R bit vn l mt nhm tc
type=one.sample:
# sai s 1 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=1, sd=4.6, sig.level=.05, power=.80,
type='one.sample')

329

One-sample t test power calculation


n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

168.0131
1
4.6
0.05
0.8
two.sided

Kt qu tnh ton t R l 168, khc vi cch tnh th cng 2 i tng, v c


nhin R s dng nhiu s l hn v chnh xc hn cch tnh th cng. Vi sai s
0.5 cm:
# sai s 0.5 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=0.5, sd=4.6, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

666.2525
0.5
4.6
0.05
0.8
two.sided

V d 2: Mt loi thuc iu tr c kh nng tng alkaline


phosphatase bnh nhn long xng. lch chun ca alkaline phosphatase
l 15 U/l. Mt nghin cu mi s tin hnh trong mt qun th bnh nhn Vit
Nam, v cc nh nghin cu mun bit bao nhiu bnh nhn cn tuyn
chng minh rng thuc c th alkaline phosphatase t 60 n 65 U/l sau 3 thng
iu tr, vi sai s I = 0.05 v power = 0.8.
y l mt loi nghin cu trc sau (before-after study); C
ngha l trc v sau khi iu tr. y, chng ta ch c mt nhm bnh nhn,
nhng c o hai ln (trc khi dng thuc v sau khi dng thuc). Ch tiu
lm sng nh gi hiu nghim ca thuc l thay i v alkaline
phosphatase. Trong trng hp ny, chng ta c tr s tng trung bnh l 5 U/l
v lch chun l 15 U/l, hay ni theo ngn ng R, delta=5, sd=15,
sig.level=.05, power=.80, v lnh:
> power.t.test(delta=3, sd=15, sig.level=.05, power=.80,
type='one.sample')
One-sample t test power calculation

330

Phn tch d liu v to biu bng R Nguyn Vn Tun

n
delta
sd
sig.level
power
alternative

=
=
=
=
=
=

198.1513
3
15
0.05
0.8
two.sided

Nh vy, chng ta cn phi c 198 bnh nhn t cc mc tiu trn.

16.4.2 c tnh c mu cho so snh hai s trung bnh


Trong thc t, rt nhiu nghin cu nhm so snh hai nhm vi nhau.
Cch c tnh c mu cho cc nghin cu ny ch yu da vo cng thc [2]
nh trnh by phn 15.3.1.
V d 3: Mt nghin cu c thit k th nghim thuc alendronate
trong vic iu tr long xng ph n sau thi k mn kinh. C hai nhm
bnh nhn c tuyn: nhm 1 l nhm can thip (c iu tr bng
alendronate), v nhm 2 l nhm i chng (tc khng c iu tr). Tiu ch
nh gi hiu qu ca thuc l mt xng (bone mineral density BMD).
S liu t nghin cu dch t hc cho thy gi tr trung bnh ca BMD trong ph
n sau thi k mn kinh l 0.80 g/cm2, vi lch chun l 0.12 g/cm2. Vn
t ra l chng ta cn phi nghin cu bao nhiu i tng chng minh
rng sau 12 thng iu tr BMD ca nhm 1 tng khong 5% so vi nhm 2?
Trong v d trn, tm gi tr s trung bnh ca nhm 2 l 2 v nhm 1
l 1 , chng ta c: 1 = 0.8*1.05 = 0.84 g/cm2 (tc tng 5% so vi nhm 1), v
do , = 0.84 0.80 = 0.04 g/cm2. lch chun l =0.12 g/cm2. Vi power
= 0.90 v = 0.05, c mu cn thit l:

n=

2C

( / )

2 10.51

( 0.04 / 0.12 )

= 189

V li gii t R qua hm power.t.test nh sau:


> power.t.test(delta=0.04, sd=0.12, sig.level=0.05,
power=0.90, type="two.sample")
Two-sample t test power calculation
n = 190.0991
delta = 0.04
sd = 0.12

331

sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group

Ch trong hm power.t.test, ngoi cc thng s thng thng nh


delta ( nh hng hay khc bit theo gi thit), sd ( lch chun),
sig.level xc sut sai st loi I, v power, chng ta cn phi c th ch ra
rng y l nghin cu gm c hai nhm vi thng s
type=two.sample.
Kt qu trn cho bit chng ta cn 190 bnh nhn cho mi nhm (hay
380 bnh nhn cho cng trnh nghin cu). Trong trng hp ny, power = 0.90
v = 0.05 c ngha l g ? Tr li: hai thng s c ngha l nu chng ta tin
hnh tht nhiu nghin cu (v d 1000) v mi nghin cu vi 380 bnh nhn, s
c 90% (hay 900) nghin cu s cho ra kt qu trn vi tr s p < 0.05.

16.4.3 c tnh c mu cho phn tch phng sai


Phng php c tnh c mu cho so snh gia hai nhm cng c th
khai trin thm c tnh c mu cho trng hp so snh hn hai nhm.
Trong trng hp c nhiu nhm, nh cp trong Chng 11, phng php
so snh l phn tch phng sai. Theo phng php ny, s trung bnh bnh
phng phn d (residual mean square, RMS) chnh l c tnh ca dao
ng ca o lng trong mi nhm, v ch s ny rt quan trng trong vic c
tnh c mu.
Chi tit v l thuyt ng sau cch c tnh c mu cho phn tch
phng sai kh phc tp, v khng nm trong phm vi ca chng ny. Nhng
nguyn l ch yu vn khng khc so vi l thuyt so snh gia hai nhm. Gi
s trung bnh ca k nhm l 1, 2, 3, . . ., k, chng ta c th tnh tng bnh
phng gia cc nhm bng SS SS =

( i ) , trong , = i / k .
2

i =1

Cho =

i =1

SS
, vn t ra l tm s lng c mu n sao cho z p
( k 1) RMS

ng yu cu power = 0.80 hay 0.9, m

z =

332

( k 1)(1 + n ) F + k ( n 1)(1 + 2n )

Phn tch d liu v to biu bng R Nguyn Vn Tun

k ( n 1) 2 ( k 1)(1 + n ) (1| 2n ) F ( k 1)(1 + n ) ( 2k ( n 1) 1)

Trong F l kim nh F. (Xem J. Fleiss, The Design and Analysis of


Clinical Experiments, John Wiley & Sons, New York 1986, trang 373).
V d 4. so snh ngt ca mt loi nc ung gia 4 nhm i
tng khc nhau v gii tnh v tui (tm gi 4 nhm l A, B, C v D), cc
nh nghin cu gi thit rng ngt trong nhm A, B. C v D ln lc l 4.5,
3.0, 5.6, v 1.3. Qua xem xt nhiu nghin cu trc, cc nh nghin cu cn
bit rng RMS v ngt trong mi nhm l khong 8.7. Vn t ra l bao
nhiu i tng cn nghin cu pht hin s khc bit c ngha thng k
mc = 0.05 v power = 0.9.
Hm power.anova.test trong R c th ng dng gii quyt vn .
Chng ta ch cn n gin cung cp 4 s trung bnh theo gi thit v s RMS
nh sau:
# trc ht cho 4 s trung bnh vo mt vector
> groupmeans <- c(4.5, 3.0, 5.6, 1.3)
# sau , gi hm power.anova.test:
> power.anova.test(groups = length(groupmeans),
between.var=var(groupmeans),
within.var=8.7,
power=0.90,
sig.level=0.05)
Balanced
calculation
groups
n
between.var
within.var
sig.level
power

one-way
=
=
=
=
=
=

analysis

of

variance

power

4
12.81152
3.486667
8.7
0.05
0.9

NOTE: n is number in each group

Kt qu cho thy cc nh nghin cu cn khong 13 i tng cho mi nhm


(tc 52 i tng cho ton b nghin cu).

333

16.4.4 c tnh c mu c tnh mt t l


Nhiu nghin cu m t c mc ch kh n gin l c tnh mt t l.
Chng hn nh, gii y t thng hay tm hiu t l mt bnh trong cng ng,
hay gii thm d kin v th trng thng tm hiu t l dn s a thch mt
sn phm. Trong cc trng hp ny, chng ta khng c nhng o lng mang
tnh lin tc, nhng kt qu ch l nhng gi tr nh phn nh c / khng, thch /
khng thch, v.v V cch c tnh c mu cng khc vi ba v d trn.
Nm 1991, mt cuc thm d kin M cho thy 45% ngi c
hi sn sng khuyn khch con h nn hin mt qu thn cho nhng bnh nhn
cn thit. Khong tin cy 95% ca t l ny l 42% n 48%, tc mt khong
cch n 6%! Kt qu ny tng i thiu chnh xc, d s lng i tng
tham gia ln n 1000 ngi. tr li cu hi ny, chng ta th xem qua mt
vi l thuyt v c tnh c mu cho mt t l.
Chng ta bit qua Chng 6 v 9 rng nu p c c tnh t n i
tng, th khong tin cy 95% ca mt t l p [trong dn s] l:

p 1.96 SE ( p ) , trong SE ( p ) =

p (1 p ) / n .

By gi, lt ngc vn : chng ta mun c tnh p sao khong rng


2 1.96 SE ( p ) khng qu mt hng s m. Ni cch khc:

1.96 p (1 p ) / n m
Chng ta mun tm s lng i tng n t yu cu trn. Qua cch din t
trn, c th thy rng:
2

1.96
n
p (1 p )
m
Do , s lng c mu ty thuc vo sai s m v t l p m chng ta mun
c tnh. sai s cng thp, s lng c mu cng cao.
V d 5: Chng ta mun c tnh t l n ng ht thuc Vit Nam,
sao cho c s khng cao hn hay thp hn 2% so vi t l tht trong ton dn
s. Mt nghin cu trc cho thy t l ht thuc trong n ng ngi Vit c
th ln n 70%. Cu hi t ra l chng ta cn nghin cu trn bao nhiu n
ng t yu cu trn.

334

Phn tch d liu v to biu bng R Nguyn Vn Tun

Trong v d ny, chng ta c sai s m = 0.02, p = 0.70, v s lng c


mu cn thit cho nghin cu l:
2

1.96
n
0.7 0.3
0.02
Ni cch khc, chng ta cn nghin cu t nht l 2017.
Nu chng ta mun gim sai s t 2% xung 1% (tc m = 0.01) th s lng
i tng s l 8067! Ch cn thm chnh xc 1%, s lng mu c th thm
hn 6000 ngi. Do , vn c tnh c mu phi rt thn trng, xem xt
cn bng gia chnh xc thng tin cn thu thp v chi ph.
R khng c hm cho c tnh c mu cho mt t l, nhng vi cng thc trn,
bn c c th vit mt hm tnh rt d dng.

16.4.5 c tnh c mu cho so snh hai t l


Nhiu nghin cu mang tnh suy lun thng c hai hay nhiu hn hai
nhm so snh. Trong phn 15.4.2 chng ta lm quen vi phng php
c tnh c mu so snh hai s trung bnh bng kim nh t. l nhng
nghin cu m tiu ch l nhng bin s lin tc. Nhng c nghin cu bin s
khng lin tc m mang tnh nh phn nh va bn trong phn 15.4.3. so
snh hai t l, phng php kim nh thng dng nht l kim nh nh phn
(binomial test) hay Khi bnh phng (2 test). Trong phn ny, chng ta s bn
qua cch tnh c mu cho hai loi kim nh thng k ny.
Gi hai t l (M chng ta khng bit nhng mun tm hiu) l p1 v

p2 , gi = p1 p2 . Gi thit chng ta mun kim nh l = 0. L thuyt


c tnh c mu cho kim nh gi thit ny kh phc tp, nhng c th tm
gn bng cng thc sau y:

n=

z / 2 2 p (1 p ) + z

p1 (1 p1 ) + p2 (1 p2 )

Trong , p = ( p1 + p2 )/2, z / 2 l tr s z ca phn phi chun cho xc sut


/2 (Chng hn nh khi = 0.05, th z / 2 = 1.96; Khi = 0.01, th z / 2 =

335

2.57), v z l tr s z ca phn phi chun cho xc sut (chng hn nh


khi = 0.10, th z = 1.28; khi = 0.20, th z = 0.84).
V d 6: Mt th nghim lm sng i chng ngu nhin c thit k
nh gi hiu qu ca mt loi thuc chng gy xng sng. Hai nhm bnh nhn
s c tuyn. Nhm 1 c iu tr bng thuc, v nhm 2 l nhm i chng
(khng c iu tr). Cc nh nghin cu gi thit rng t l gy xng trong nhm
2 l khong 10%, v thuc c th lm gim t l ny xung khong 6%. Nu cc
nh nghin cu mun th nghim gi thit ny vi sai st I l = 0.01 v power =
0.90, bao nhiu bnh nhn cn phi c tuyn m cho nghin cu?
y, chng ta c = 0.10 0.06 =0.04, v p =(0.10+0.06)/2=0.08.
Vi = 0.01, z / 2 = 2.57 v vi power = 0.90, z = 1.28. Do , s lng
bnh nhn cn thit cho mi nhm l:

( 2.57
n=

2 0.08 0.92 + 1.28 0.1 0.90 + 0.06 0.94

( 0.04 )

= 1361

Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn
kim nh gi thit trn.
Hm power.prop.test R c th ng dng tnh c mu cho trng hp
trn. Hm power.prop.test cn nhng thng tin nh power,
sig.level, p1, v p2. Trong v d trn, chng ta c th vit:
> power.prop.test(p1=0.10, p2=0.06, power=0.90,
sig.level=0.01)
Two-sample comparison of proportions power calculation
n
p1
p2
sig.level
power
alternative

=
=
=
=
=
=

1366.430
0.1
0.06
0.01
0.9
two.sided

NOTE: n is number in *each* group

Ch kt qu t R c phn chnh xc hn (1366 i tng cho mi nhm) v R


dng nhiu s l cho tnh ton hn l tnh th cng.

336

Phn tch d liu v to biu bng R Nguyn Vn Tun

Xin nhn mnh mt ln na rng c tnh c mu cho nghin cu l


mt bc cc k quan trng trong vic thit k mt nghin cu c ngha khoa
hc, v n c th quyt nh thnh bi ca nghin cu. Trc khi c tnh c
mu nh nghin cu cn phi bit trc (hay t ra l c vi gi thit c th) v
vn mnh quan tm. c tnh c mu cn mt s thng s nh cp n
trong phn u ca chng, v nu cc thng s ny khng c th khng th
c tnh c. Trong trng hp mt nghin cu hon ton mi, tc cha ai
tng lm trc , c th cc thng s v nh hng v dao ng o
lng s khng c, v nh nghin cu cn phi tin hnh mt s m phng
(simulation) hay mt nghin cu s khi c nhng thng s cn thit. Cch
c tnh c mu bng m phng l mt lnh vc nghin cu kh chuyn su,
khng nm trong ti ca sch ny, nhng bn c c th tm hiu thm
phng php ny trong cc sch gio khoa v thng k hc cp cao hn.

337

17
Ph lc 1: Lp trnh v hm vi R
R c pht trin sao cho ngi s dng c th pht trin nhng hm
thch hp cho mc ch phn tch v tnh ton ca mnh. Tht vy, nh
cp trong phn u ca sch, c th xem R l mt ngn ng thng k, v chng
ta c th s dng ngn ng gii quyt cc vn khng thng thy trong
sch gio khoa. Phn ny ch trnh by mt vi hm n gin bn c c th
hiu cch vn hnh ca R v hi vng gip bn c t pht trin cc hm sau .
Hm (hay c khi cn gi l macro trong cc phn mm khc) thc
cht l tp hp mt s lnh c lu tr di mt ci tn. mc n gin
nht, hm l tc k cho mt nhm lnh.
V d 1. Trong cc lnh sau y, chng ta to hai d liu (data1 v
data2). Mi d liu c hai ct s liu c to ra bng m phng t phn phi
chun. Sau , v biu cho hai d liu vi ghi ch.
data1 <- cbind(rnorm(100,1), rnorm(100,0))
data2 <- cbind(rnorm(100,-1), rnorm(100,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight",
ylab="Yield")
legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)

Mt cch nh tt c cc lnh ny l lu tr chng trong mt text file


chng hn. Mi ln mun s dng, chng ta ch n gin ct v dn cc lnh
ny vo R. Mt cch khc tt hn l to ra mt hm gm cc lnh trn c th
s dng nhiu ln.
Mi hm R phi c tn. Tt c cc lnh c cha trong khu vc c
gii hn bng hai k hiu { v }. K hiu { cho bit tt c cc lnh sau l
nm trong hm; v k hiu } cho bit chm dt hm. Trong v d trn, chng ta
gi hm l plotfigure:

338

Phn tch d liu v to biu bng R Nguyn Vn Tun

plotfigure <- function()


{
data1 <- cbind(rnorm(100,1), rnorm(100,0))
data2 <- cbind(rnorm(100,-1), rnorm(100,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight",
ylab="Yield")
legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)
}

Sau khi cho vo R, chng ta ch n gin gi hm nhiu ln nh sau:


> plotfigure()
> plotfigure()

v kt qu s nh sau:
My simulated data

Y ie ld
-1

-1

Y ie ld

My simulated data

-2

Big
Small

-2

Big
Small
-4

-2

0
Weight

-2

Weight

Trong hm plotfigure trn, chng ta m phng 100 s liu t phn


phi chun. V c mi ln ng dng, hm ch to ra 100 s liu, chng ta khng
thay i c (ngoi tr phi thay i t lc bin tp, hay lp hm). Ni cch
khc, hm trn khng c thng s.
Kha cnh tin li ca hm l chng ta c th lm cho thng s thay i
theo mun ca ngi s dng. Chng hn nh chng ta mun thay i s s
liu m phng v trung bnh t lut phn phi chun, chng ta ch cn cho hai

339

con s ny l hai thng s (parameters) ngi s dng c th thay i. Tm


gi l thng s n, mean1, v mean2, th hm s nh sau:
plotfigure <- function(n, mean1, mean2)
{

data1 <- cbind(rnorm(n,mean1), rnorm(n,0))


data2 <- cbind(rnorm(n,mean2), rnorm(n,0))
xr <- range(rbind(data1,data2)[,1])
yr <- range(rbind(data1,data2)[,2])
plot(data1, xlim=xr, ylim=yr, col=1, xlab="", ylab="")
par(new=T)
plot(data2, xlim=xr, ylim=yr, col=2, xlab="", ylab="")
title(main="My simulated data", xlab="Weight",
ylab="Yield")
legend(-3.0, -1.5, c("Big", "Small"), col=1:2, pch=1)

Khi ng dng hm, chng ta ch n gin thay i n v mean. Trong hai lnh
sau y, chng ta u tin v mt biu tn x vi 200 s liu, v s trung
bnh -2 v 2. Trong lnh hai, chng ta nng s liu ln 200, nhng trung bnh
vn nh ln m phng trc:
> plotfigure(200, 2, -2)
> plotfigure(500, 2, -2)

V kt qu s khc trn:
M y simulated data

-1

0
-1

Y ield

Y ield

M y simulated data

-2

Big
Small

-3

-3

-2

Big
Small

-4

-2

0
Weight

-4

-2

Weight

V d 2. Chng ta mun vit mt hm cng hai s. (Tt nhin R c


kh nng lm vic ny, nhng v l do minh ha, chng ta s gi thit n
gin nh th). Gi hm l add. Hai thng s a v b l arguments. Cch
vit nh sau:

340

Phn tch d liu v to biu bng R Nguyn Vn Tun

add <- function(a, b)

{
sum = a+b
ans <- "Answer = "
cat(ans, sum, \n)
}
Nh thy, bc u tin, chng ta cho tn hm l add v nh ngha
thng s a v b. Mt hm phi c m u bng k hiu { v chm dt bng
}. sum l mt bin s cng a v b. ans <- "Answer = " nh ngha tr
li (c th khng cn). cat(ans, sum, \n) c chc nng thu thp s
liu v trnh by kt qu cho ngi s dng hm, trong \ c ngha l sau
khi trnh by, cho ngi s dng mt prompt khc. Bn c c th dn cc lnh
trn vo R v th cho lnh:
> add(3, 9)
Answer = 12
> add(sqrt(5), exp(10))
Answer = 22028.7

V d 3. Hm sau y tin hnh nhiu tnh ton hn hm trong v d 1.


Nu chng ta c mt bin s gm n phn t x1 , x2 , x3 ,..., xn tun theo lut
2
phn phi chun vi trung bnh v phng sai . Vit theo k hiu ton:

xi ~ N , 2

Nu chng ta c thng tin trc cho bit c lut phn phi chun vi trung
bnh v phng sai 2, hay:

~ N ( , 2 )

Qua nh l Bayes, chng ta c th c tnh trung bnh:

nx
+
2
2
=
1

v phng sai

2
p

n
1
= 2 + 2

Trong , x l s trung bnh ca mu n. p v p2 c gi l posterior.


Chng ta c th vit mt hm bng R tnh hai s ny nh sau. Gi tn hm l
bayes.

341

bayes <- function(x, prior.mean, prior.var)


{
n <- length(x)
sample.mean <- mean(x)
sample.var <- var(x)
numerator <- (prior.mean/prior.var) + (n*sample.mean/sample.var)
denominator <- 1/prior.var + n/sample.var
posterior.mean = numerator/denominator
posterior.var = 1/denominator
a <- "Posterior mean = "
b <- "Posterior variance = "
cat(Sample size = , n, \n)
cat(Sample mean = , sample.mean, \n)
cat(Sample var = , sample.var, \n)
cat(Prior mean = , prior.mean, \n)
cat(Prior var = , prior.var, \n)
cat(a, posterior.mean, \n)
cat(b, posterior.var, \n)
}

V d 4. Mt cht khong trong xng (bone mineral density - bmd)


trong mt qun th thng phn phi theo lut phn phi chun, vi gi tr
trung bnh khong 1.0 g/cm2 v phng sai 0.0144 g/cm4. Gi d chng ta o
mt xng ca mt nhm bnh nhn nh sau: 1.0, 1.5, 2.1, 1.7,
1.8, 0.9, 0.7. Chng ta mun bit gi tr trung bnh v phng sai ca
mu ny sau khi iu chnh cho trung bnh v phng sai bit trc. Trc
ht, chng ta gi nhm s liu ny l bmd:
> bmd <- c(1.0, 1.5, 2.1, 1.7, 1.8, 0.9, 0.7)

v sau gi hm bayes nh sau:


> bayes(bmd, 1.0, 0.0144)
Sample size = 7
Sample mean = 1.385714
Sample var = 0.2747619
Prior mean = 1
Prior var = 0.0144
Posterior mean = 1.103525
Posterior variance = 0.01053507

Trn y ch l mt vi hng gii thiu cch lp trnh v vit hm bng


ngn ng R. Trong thc t, tt c cc hm nh survival, BMA, meta,
Hmisc, v.v u c pht trin bng ngn ng R. Bn c c th tham kho
ti liu Introduction to R ca W. Venables v B. Ripley (phn cui ca sch)
bit thm chi tit k thut.

342

Phn tch d liu v to biu bng R Nguyn Vn Tun

18
Phc lc 2
Mt s lnh thng dng trong R
Lnh v mi trng vn hnh ca R

getwd()
setwd(c:/works)
options(prompt=R>)
options(width=100)
options(scipen=3)
options()

Cho bit directory hin hnh l g


Chuyn directory vn hnh v c:\works
(ch R dng /)
i prompt thnh R>
i chiu rng ca s R
thnh 100 characters
i s thnh 3 s thp phn
(thay v kiu 1.2E-04)
Cho bit cc thng s v mi trng ca R

Lnh c bn
ls()
rm(object)
seach()

Lit k cc i tng (objects) trong b nh


Xa b i tng
Tm hng

K hiu tnh ton


+
*
/
^
%/%
%%

Cng
Tr
Nhn
Chia
Ly tha
Chia s nguyn
S d t chia hai s nguyn

343

K hiu logic
Bng
Khng bng
Nh hn
Ln hn
Nh hn hoc bng
Ln hn hoc bng
C phi x l bin s missing
V (AND)
Hoc (OR)
Khng l (NOT)

==
!=
<
>
<=
>=
is.na(x)
&
|
!

Pht s
numeric(n)
character(n)
logical(n)
seq(-4,3,0.5)
1:10
c(5,7,9,1)
rep(1, 5)
Gl(3,2,12)

Cho ra n s 0
Cho ra n k t
Cho ra n FALSE
Dy s -4.0, -3.5, -3.0, , 3.0
Ging nh lnh seq(1, 10, 1)
Nhp s 5, 7, 8 v 1
Cho ra 5 s 1: 1, 1, 1, 1, 1.
Yu t 3 bc, lp li 2 ln, tng cng 12 s:
112233112233

To nn s ngu nhin bng m phng theo cc lut phn


phi (simulation)
rnorm(n, mean=0, sd=1)
rexp(n, rate=1)
rgamma(n,shape,scale=1)
rpois(n, lambda)
rweibull(n,shape,scale=1)
rcauchy(n,location=0,scale=1)
rbeta(n, shape1, shape2)
rt(n, df)
rchisq(n, df)
rbinom(n, size, prob)
rgeom(n, prob)

344

Phn phi chun (normal distribution)


vi trung bnh = 0 v lch chun = 1.
Phn phi m (exponential
distribution)
Phn phi gamma
Phn phi Poisson
Phn phi Weibull
Phn phi Cauchy
Phn phi beta
Phn phi t
Phn phi Khi bnh phng
Phn phi nh phn (binomial)
Phn phi geometric

Phn tch d liu v to biu bng R Nguyn Vn Tun

rhyper(nn, m, n, k)
rlnorm(n,meanlog=0,sdlog=1)
rlogis(n,location=0,scale=1)
rnbinom(n,size,prob)
runif(n,min=0,max=1)

hypergeometric
Phn phi log normal
Phn phi logistic
Phn phi negative Binomial
Phn phi uniform

Bin i s thnh k t v ngc li


as.numeric(x)
as.character(x)
as.logical(x)
factor(x)

Bin i x thnh bin s s hc c th


tnh ton
Bin i x thnh bin s ch (character)
phn loi
Bin i x thnh bin s logic
Bin i x thnh bin s yu t

Data frames
data.frame(x,y)
tuan$age
attach(tuan)
detach(tuan)

Nhp x v y thnh mt data frame


Chn bin s age t dataframe tuan.
a dataframe tuan vo h thng R
Xa b dataframe tuan khi h thng R

Hm s ton
log(x)
log10(x)
exp(x)
sin(x)
cos(x)
tan(x)
asin(x)
acos(x)
atan(x)

Logart bc e
Logart bc 10
S m
Sin
Cosin
Tangent
Arcsin (hm sin o)
Arccosin (hm cosin o)
Arctang(hm tan o)

345

Hm s thng k
min(x)
max(x)
which.max(x)
which.min(x)

S nh nht ca bin s x
S ln nht ca bin s x
Tm dng no c gi tr ln nht ca bin s x
Tm dng no c gi tr nh nht ca bin s x

length(x)

Tng s yu t (elements) trong mt


bin s (hay s mu)
S tng ca bin s x
Khc bit gia max(x) v min(x)
S trung bnh ca bin s x
S trung v (median) ca bin s x
lch chun (standard deviation)
ca bin s x
Phng sai (variance) ca bin s x
Hip bin (covariance) gia hai bin s x v y
H s tng quan (coefficient of
correlation) gia bin s x v y.
Ch s ca bin s x
H s tng quan (correlation coefficient)
gia bin s x v y
Kim tra xem x c phi l s trng
khng (missing value)

sum(x)
range(x)
mean(x)
median(x)
sd(x)
var(x)
cov(x,y)
cor(x,y)
quantile(x)
cor(x,y)
is.na(x)

complete.cases(x1,x2,...)
Kim tra nu tt c x1, x2, u khng
c s trng.

Ch s ma trn
x[1]
x[1:5]
x[y<=30]
x[sex==male]

346

S u tin ca bin s x
Nm s u tin ca bin s x
Chn x sao cho y nh hn hoc bng 30
Chn x sao cho sex bng male

Phn tch d liu v to biu bng R Nguyn Vn Tun

Nhp d liu
data(name)
read.table(name)
read.csv(name)
read.delim(name)
read.delim2(name)
read.csv2(name)

Xy dng mt kho d liu


c / nhp s liu t file name
c / nhp s liu dng excel
(cch nhau bng ,)
t file name
c / nhp s liu dng tab delimited
c / nhp s liu dng tab
delimited, cch nhau bng ;
v s thp phn l ,
c / nhp s liu dng csv,
cch nhau bng ; v s thp phn l ,

Phn ph trong read.table


header=TRUE
sep=,
dec=,
na.strings=.

Hng u tin ca d liu l tn ca bin s


S liu ngn cch bng du hiu ,
S thp phn l , ( phn bit vi .)
S liu trng (missing value) l .

Phn phi thng k


pnorm(x,mean,sd)
plnorm(x,mean,sd)
pt(x,df)
pf(x,n1,n2)
pchisq(x,df)
ppois(x,lambda)
punif(x,min,max)
pexp(x,rate)
pgamma(x,shape,scale)
pbeta(x,a,b)

Phn phi chun


Phn phi chun logarit
Phn phi t
Phn phi F
Phn phi Khi bnh phng
Phn phi Poisson
Phn phi uniform (ng dng)
Phn phi hm m
Phn phi gamma
Phn phi beta

347

Phn tch thng k

var.test
bartlett.test

Kim nh t
Kim nh t cho paired design
Kim nh h s tng quan
method = kendall
method = spearman
Kim nh phng sai
Kim nh nhiu phng sai

wilcoxon.test
kruskal.test
friedman.test

Kim nh Wilcoxon
Kim nh Kruskal
Kim nh Friedman

lm(y ~ x)

Phn tch hi qui tuyn tnh


(linear regression)
Phn tch phng sai 1 chiu
(1-way analysis of variance)
Phn tch hip bin
(analysis of covariance)
Phn tch hi qui tuyn tnh a bin s
(multiple linear regression)

t.test
pairwise.t.test
cor.test

lm(y ~ factor)
lm(y ~ factor+x)
lm(y ~ x1+x2+x3)

fisher.test
chisq.test
glm(y~x1+x2+x+x3)

Kim nh nh phn (Binomial test)


Kim nh so snh nhiu t s
Kim nh so snh nhiu t s
theo xu hng
Kim nh Fisher
Kim nh Khi bnh phng
Phn tch hi qui logistic

s<-Surv(time,event)
survfit(s)
survdiff(s~g)
coxph(s ~ x1+x2)

Phn tch survival


Biu Kaplan-Meier
Kim nh Log-rank gia hai nhm g
Phn tch hi qui Cox

binom.test
prop.test
prop.trend.test

348

Phn tch d liu v to biu bng R Nguyn Vn Tun

th
plot(y~x)
hist(x)
plot(y ~ x | z)
pie(x)
boxplot(x)
qqnorm(x)
qqplot(x, y)
barplot(x)
hist(x)
stars(x)
abline(a, b)
abline(h=y)
abline(v=x)
abline(lm.object)

V th y v x (scatter plot)
V th y v x (scatter plot)
V hai biu x v y theo tng nhm ca z
V th trn
V th theo dng hnh hp
V phn phi quantile ca bin s x
V phn phi quantile ca bin s y theo x
V biu hnh khi cho bin s x
V histogram cho bin s x
V biu sao cho bin s x
V ng thng vi intercept=a v slope=b
V ng thng ngang
V ng thng ng
V th theo m hnh tuyn tnh

Mt s thng s cho th
pch
mfrow, mfcol
xlim, ylim
xlab, ylab
lty, lwd
cex, mex
col

K hiu v th (pch = plotting characters)


To ra nhiu ca s v nhiu th
cng mt lc (multiframe)
Cho gii hn ca trc honh v trc tung
Vit tn trc honh v trc tung
Dng v kch thc ca ng biu din
Kch thc v khong cch gia cc k t.
Mu sc

349

19
Phc lc 3
Thut ng dng trong sch
Ting Anh
95% confidence interval
Akaike Information criterion (AIC)
Analysis of covariance
Analysis of variance (ANOVA)
Bar chart
Binomial distribution
Box plot
Categorical variable
Clock chart
Coefficient of correlation
Coefficient of determination
Coefficient of heterogeneity
Combination
Continuous variable
Correlation
Covariance
Cross-over experiment
Cumulative probability distribution
Degree of freedom
Determinant
Discrete variable
Dot chart
Estimate
Estimator
Factorial analysis of variance
Fixed effects
Frequency

350

Ting Vit
Khong tin cy 95%
Tiu chun thng tin Akaike
Phn tch hip bin
Phn tch phng sai
Biu thanh
Phn phi nh phn
Biu hnh hp
Bin th bc
Biu ng h
H s tng quan
H s xc nh bi
H s bt ng nht
T hp
Bin lin tc
Tng quan
Hp bin
Th nghim giao cho
Hm phn phi tch ly
Bc t do
nh thc
Bin ri rc
Biu im
c s
Hm c lng thng k
Phn tch phng sai cho th nghim giai
tha
nh hng bt bin
Tn s

Phn tch d liu v to biu bng R Nguyn Vn Tun

Function
Heterogeneity
Histogram
Homogeneity
Hypothesis test
Inverse matrix
Latin square experiment
Least squares method
Linear Logistic regression analysis
Linear regression analysis
Matrix
Maximum likelihood method
Mean
Median
Meta-analysis
Missing value
Model
Multiple linear regression analysis
Normal distribution
Object
Parameter
Permutation
Pie chart
Poisson distribution
Polynomial regression
Probability
Probability density distribution
P-value
Quantile
Random effects
Random variable
Relative risk
Repeated measure experiment
Residual
Residual mean square

Hm
Bt ng nht
Biu tn s
ng nht
Kim nh gi thit
Ma trn nghch o
Th nghim hnh vung Latin
Phng php bnh phng nh nht
Phn tch hi qui tuyn tnh logistic
Phn tch hi qui tuyn tnh
Ma trn
Phng php hp l cc i
S trung bnh
S trung v
Phn tch tng hp
Gi tr khng
M hnh
Phn tch hi qui tuyn tnh a bin
Phn phi chun
i tng
Thng s
Hon v
Biu hnh trn
Phn phi Poisson
Hi qui a thc
Xc sut
Hm mt xc sut
Tr s P
Hm nh bc
nh hng ngu nhin
Bin ngu nhin
T s nguy c tng i
Th nghim ti o lng
Phn d
Trung bnh bnh phng phn d

351

Residual sum of squares


Scalar matrix
Scatter plot
Significance
Simulation
Standard deviation

Tng bnh phng phn d


Ma trn v hng
Biu tn x
C ngha thng k
M phng
lch chun

Standard error
Standardized normal distribution
Survival analysis
Traposed matrix
Variable
Variance
Weight
Weighted mean

Sai s chun
Phn phi chun chun ha
Phn tch bin c
Ma trn chuyn v
Bin (bin s)
Phng sai
Trng s
Trung bnh trng s

352

Phn tch d liu v to biu bng R Nguyn Vn Tun

20
i li cui sch vi bn c
(v ti liu tham kho)
Qua 15 chng sch v 3 ph lc bn c i cng tc gi mt hnh
trnh kh di trong phn tch thng k v biu . Trc khi chia tay bn c,
tc gi cng mun c i li tm bit.
Qua kinh nghim ging dy v nghin cu c nhn cho thy phn ln
sinh vin khi tip cn vi khoa hc thng k ln u chng my g ho hng,
nu khng ni l kh khn, v sch gio khoa son cho mn hc ny rt xa ri
thc t, vi nhng v d khng c trong i thng. Nhng khi nim tru
tng, nhng cng thc rc ri, nhng php tnh phc tp v rm r lm cho
ngi hc cm thy kh khn v t cm thy thiu hng th theo ui mn
hc. Tht vy, c khi c sch gio khoa, cc bi bo nghin cu khoa hc,
chng ta bt gp nhng phng php hay v nhng m hnh thch hp cho
nghin cu ca chnh mnh, nhng khng bit lm sao tnh ton cc m hnh .
Trong cun sch ny, tc gi mun cung cp cho bn c mt phng tin phn
tch thc t lp i ci khong trng phng php v kin thc m c l bn
c cn thiu.
Hc phi i i vi hnh. Cch hc v phng php hay nht, theo ti,
l bt chc. R cung cp cho bn c cch hc m phng rt l tin li.
Trong khi c nhng chng sch ny cng vi nhng v d, bn c c th g
nhng lnh vo my tnh v xem kt qu c nht qun vi nhng g mnh c
hay khng. Sau khi bit c cch s dng mt hm hay mt lnh no ,
bn c c th thm vo (hay bt ra) nhng thng s ca hm xem kt qu.
Ch c hc nh th th bn c mi nm vng c cc khi nim v cch s
dng R.
Chng ta hc t sai st. Qua cun sch ny, tc gi mun bn c i mt
qung ng kh gp ghnh, tc l bn c phi tng tc vi my tnh bng
nhng lnh ca R. Trong qu trnh tng tc , c th mt s lnh s khng
chy, v g sai tn bin s hay sai chnh t, v khng n k t vit hoa v
vit thng, v s liu khng y hay sai st, v.v Tt c nhng ln sai st
s gip cho bn c rt ra kinh nghim v tr nn thnh tho hn. l cch hc
m ngi Anh hay gi l trial and error, hc t sai lm v th nghim.
Mt cng trnh phn tch s liu cn nhiu lnh v hm R. Tuy nhin, v
tnh tng tc m bn c theo di, cc lnh ny s bin mt khi ngng R. Vn

353

t ra l c cch no lu tr cc lnh ny trong mt h s sau ny s dng


li. Phn mm cc k c ch cho mc ch ny l Tinn-R (cng c th ti
xung v ci t vo my hon ton min ph). Website ti Tinn-R v ti
liu s dng l: http://www.sciviews.org/Tinn-R.
Tinn-R thc cht l mt editor cho R (v nhiu phn mm khc).
Tinn-R cho php chng ta lu tr tt c cc lnh cho mt cng trnh phn tch
trong mt h s. Vi Tinn-R, chng ta c sn mt ch dn trc tuyn v cch s
dng cc lnh hay hm trong R. Trong khi lnh g sai vn phm R, Tinn-R s
bo ngay v ngh cch sa! Giao din Tinn-R c th ging nh sau:

Chng hn nh trong giao din trn, khi chng ta g read.table(


th mt ch dn ngay pha di hin ra, vi tt c thng s ca hm
read.table. Vi Tinn-R chng ta t khi phm phi nhng sai st nh trong
khi chy R. Sau khi xong mt s lnh, chng ta c th dng chut t m
(highlight) nhng lnh cn chy v gi sang R. Ch chng ta khng cn phi
ri Tinn-R trong khi R chy.
n y, c l bn c s hi: c cch no s dng R d dng hn m
khng cn phi g cc lnh? Cu tr li l c. Ti sao ti khng gii thiu
trc, ngay t chng u? Ti v ti mun bn c i con ng kh trc khi
i con ng d, nn n by gi mi ni n mt phn mm ph khc c kh
nng gip cho bn c s dng R mt cch nhanh chng hn, d dng hn, v
tin li hn bng chut thay v bng bn phm.

354

Phn tch d liu v to biu bng R Nguyn Vn Tun

Phn mm t ng ha R c tn l Rcmdr (vit tt t R


commander). Trong thc t, Rcmdr l mt package, m bn c c th ti t
website
chnh
thc
ca
R
(http://cran.au.rproject.org/src/contrib/Descriptions/Rcmdr.html) hay website ca tc gi ca
Rcmdr sau y: http://socserv.socsci.mcmaster.ca/jfox/Misc/Rcmdr. Ch ,
khi Rcmdr vn hnh tt khi c nhng package sau y trong my: relimp,
multcomp, lmtest, effects, car, v abind. Nu cha c
nhng package ny, bn c nn ti chng v my. Ti liu ch dn Rmdr cng
c th ti t website http://cran.R-project.org/doc/packages/Rcmdr.pdf.
Khi ti Rcmdr xung v ci t vo my tnh, bn c ch n gin
lnh: library(Rcmdr), v mt giao din nh sau s xut hin. Vi phn
menu (nh File, Edit, Data, Statistics, Graphs,
Models, Distribution, Tool, Help) bn c c th t mnh khm
ph cch vn hnh ca Rcmdr bng chut.

V ni dung ln in th nht ny, tc gi khng c nh bn v nhng


m hnh phn tch a bin (multivariate analysis model) nh phn tch yu t
(factor analysis), phn tch tp hp (cluster analysis), phn tch tng quan a
bin (correspondence analysis), phn tch phng sai a bin (multivariate
analysis of variance), v.v v y l nhng phng php tng i cao cp, i
hi ngi s dng phi thng tho khng nhng v l thuyt thng k, m cn
phi hiu rt r nhng phng php phn tch cn bn nh trnh by trong sch
ny. Tuy nhin, bn c c nhu cu cho cc phng php phn tch ny cng c

355

th tm hiu trong trang web ca R bit thm cc package chuyn dng cho
phn tch a bin.
Ti liu tham kho
Hin nay, th vin sch v R cn tng i khim tn so vi th vin
cho cc phn mm thng mi nh SAS v SPSS. Tuy nhin, trong thi i tin
b phi thng v thng tin internet v ton cu ha nh hin nay, sch in v
sch xut bn trn website khng cn l nhng khc nhau bao xa. Phn ln ch
dn v cch s dng R c th tm thy ri rc y trn cc website t cc
trng i hc v website c nhn trn khp th gii. Phn ny ch lit k mt
s sch m bn c nu cn tham kho thm c th tm c. Trong qu trnh
vit cun sch m bn c ang cm trn tay, tc gi cng tham kho mt s
sch v trang web s lit k sau y vi vi li nhn xt c nhn.
Ti liu tham kho chnh v R l bi bo ca hai ngi sng to ra R:
Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal
of Computational and Graphical Statistics 1996; 5:299-314.
18.1 Sch tham kho v R

Data Analysis and Graphics Using R An Example Approach (Nh


xut bn Cambridge University Press, 2003) ca John Maindonald nay
xut in li ln th 2 vi thm mt tc gi mi John Braun. y l cun
sch rt c ch cho nhng ai mun tm hiu v hc v R. Nm chng u
ca sch vit cho bn c cha tng bit v R, cn cc chng sau th
vit cho cc bn c bit cch s dng R thnh tho.

Introductory Statistics With R (Nh xut bn Springer, 2004) ca


Peter Dalgaard l mt cun sch loi cn bn cho R nhm vo bn c
cha bit g v R. Sch tng i ngn (ch khong 200 trang) nhng gi
kh t!

Linear Models with R (Nh xut bn Chapman & Hall/CRC, 2004)


ca Julian Faraway. Sch hin c th ti t internet xung min ph ti
website sau y: http://www.stat.lsa.umich.edu/~faraway/book/pra.pdf
hay http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf. Ti liu di
213 trang.

R Graphics (Computer Science and Data Analysis) (Nh xut bn


Chapman & Hall/CRC, 2005) ca Paul Murrell. y l cun sch
chuyn v phn tch biu bng R. Sch c rt nhiu m bn c c
th t thit k cc biu phc tp v mu m.

356

Phn tch d liu v to biu bng R Nguyn Vn Tun

Modern Applied Statistics with S-Plus (Nh xut bn Springer, 4th


Edition, 2003) ca W. N. Venables v B. D. Ripley c vit cho ngn
ng S-Plus nhng tt c cc lnh v m trong sch ny u c th p dng
cho R m khng cn thay i. (S-Plus l tin thn ca R, nhng S-Plus l
mt phn mm thng mi, cn R th hon ton min ph!) y l cun
sch c th ni l cun sch tham kho cho tt c ai mun pht trin thm
v R. Hai tc gi cng l nhng chuyn gia c thm quyn v ngn ng
R. Sch dnh cho bn c vi trnh cao v my tnh v thng k hc.

18.2 Cc website quan trng hay c ch v R

Rt nhiu ti liu tham kho c th ti t website chnh thc ca R sau


y: http://cran.R-project.org/other-docs.html
Trong c mt s ti liu quan trng nh An Introduction to R ca
W. N. Venables v B. D. Ripley.
a ch internet: http://cran.r-project.org/doc/manuals/R-intro.pdf.

Vi ti liu hng dn cch s dng R c th ti (min ph) v tham kho


nh sau:
R for Beginners (57 trang) ca Emmanuel Paradis. Ti liu c son
cho bn c mi lm quen vi R.
a
ch
internet:
http://cran.r-project.org/doc/contrib/Paradisrdebuts_en.pdf.
Using R for Data Analysis and Graphics: Introduction, Code and
Commentary (35 trang) ca John Maindonald l mt tm lc cc lnh
v hm cn bn ca R cho phn tch s liu v biu . Ch ca ti liu
ny rt gn vi cun sch m bn ang c.
a ch internet: http://cran.r-project.org/doc/contrib/usingR.pdf
Statistical Analysis with R a quick start (46 trang) ca Oleg
Nenadic v Walter Zucchini. Ti liu hng dn cch ng dng R cho
phn tch thng k v biu .
a ch internet: http://www.statoek.wiso.unigoettingen.de/mitarbeiter/ogi/pub/r_workshop.pdf
A Brief Guide to R for Beginners in Econometrics (31 trang) ca M.
Arai. Ti liu ch yu son cho gii phn tch thng k kinh t.
a ch internet: http://people.su.se/~ma/R_intro

357

Notes on the use of R for psychology experiments and


questionnaires (39 trang) ca Jonathan Baron v Yuelin Li. Web. Ti
liu c son cho gii nghin cu tm l hc v x hi hc. C v d v
log-linear model v mt s m hnh phn tch phng sai trong tm l
hc.
a ch internet: http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

StatsRus gm mt su tp v cc mo s dng R hu hiu hn (di


khong 80 trang). a ch internet:
http://lark.cc.ukans.edu/pauljohn/R/statsRus.html

V sau cng l mt ti liu Hng dn s dng R cho phn tch s


liu v biu (khong 110 trang thng xuyn cp nht ha) do
chnh tc gi vit bng ting Vit. Website: www.R.ykhoanet.com thc
cht l tm lc mt s chng chnh ca cun sch ny. Trang web ny
cn c tt c cc d liu (datasets) v cc m s dng trong sch bn
c c th ti xung my tnh c nhn s dng.

358

Phn tch d liu v to biu bng R Nguyn Vn Tun

You might also like