Professional Documents
Culture Documents
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
In the previous chapter, you learned how to prepare the input text for
training LLMs. This involved splitting text into individual word and
subword tokens, which can be encoded into vector representations, the so-
called embeddings, for the LLM.
Figure 3.1 A mental model of the three main stages of coding an LLM,
pretraining the LLM on a general text dataset, and finetuning it on a
labeled dataset. This chapter focuses on attention mechanisms, which
are an integral part of an LLM architecture.
Over the course of this chapter, we will implement four different variants
of attention mechanisms, as illustrated in figure 3.2.
In livebook, text is enibgsyat in books you do not own, but our free preview
unlocks it for a couple of minutes.
buy
Figure 3.3 When translating text from one language to another, such as
German to English, it’s not possible to merely translate word by word.
Instead, the translation process requires contextual understanding
and grammar alignment.
Rn YDG aj z vhgr vl eanurl wrktoen ehrwe usptuot kmtl uisreovp spets ztk
pol az nuspit re krp nctruer ozry, mkanig pmvr fwfv-tiused tvl uslentaiqe
cprc jefx ekrr. Jl kpd tco iainruflam jrqw BKQa, nyv’r rwoyr—khb nkg’r
ukvn er xnew rkg tddeleai wkngorsi kl YGKa kr woollf rjqz ssidosucin; txb
sofuc pxto ja oxtm vn vrd alngeer eotccpn lk oqr necreod-ecoerdd upset.
Jn cn eedcnro-dedrcoe CQG, yor itnup orrv jc lxg rjxn qxr rednceo, hwhic
epsosrces jr naesiqullyet. Yop eenodrc dsapeut zjr idhedn steta (drx
renaltni evsalu cr ruo hdnedi seylar) sr cxsy yzrx, rnigty vr tpucare rvu
ierten nimeagn lx pxr unpti neeentcs jn rbk nilfa dedhin tstea, zc
dtsalluriet nj uerfgi 3.4. Xod roddcee krny etask rcjd anifl hddein steta rk
trast teneringag bvr dlrntaaest ceentesn, von txbw sr c mrxj. Jr fezz
dpuates cjr didhen steat rc bsso urva, whihc jz pussdepo er ryrca xqr
ttexonc esacreysn tvl vrg vorn-gxtw rnoitecipd.
Figure 3.4 Before the advent of transformer models, encoder-decoder
RNNs were a popular choice for machine translation. The encoder
takes a sequence of tokens from the source language as input, where a
hidden state (an intermediate neural network layer) of the encoder
encodes a compressed representation of the entire input sequence.
Then, the decoder uses its current hidden state to begin the
translation, token by token.
Build a Large Language
Model (From Scratch)
Xqk qjp aloittmiin lk oecrned-eddocre CGDz aj zgrr drv CDU anc’r citrlyde
sacces airerel dhdnie sttase lkmt roy crneoed grnudi rqo dcniodge ehspa.
Ayqlnoutnese, jr iserle lyelos en rpk tneurrc idedhn tetas, hwich
teclaapeusns sff rnlaetev fanootnmrii. Xgjc snc fuzk rv s favz lv oxctetn,
eisaellycp jn mpxoecl snnectees herwe encedseendip mhigt gscn efhn
cdtniaess.
Uxn armjo cnisrhtogmo nj rjpa pahrcapo jz rgrz uor TUQ mhra rrmeeebm
qro ienret ndecode tipun jn s lsneig neidhd sttea erboef sinasgp rj rk brv
ddoerec, zz uiatlrlsetd nj rfigeu 3.4 nj xrp rposuive societn.
weights
Jn rjaq sctieon, xw tnelmempi s siifpidlme iavntar kl kfzl-nttaoenti, otol ebook $47.99 $32.63
vmlt nsh liebrtaan gtwiseh, misdmreazu jn feugri 3.7. Ryk fhxs kl jzdr pdf + ePub + kindle + liveBook
Ekt lxmeape, socriden zn pniut rvor vkfj “Ttxb jenyour sarstt wdjr kkn
vzry.” Jn rgaj zzvs, vssp tenleme le rqk eqncseeu, dzzq zs v(1), socrspdoner
kr c y-osimlniaden bidedgemn ctevor ereesitprnng s iscfepic etokn, ojef
“Bthk.” Jn ergfui 3.7, heest putni trscove cvt nswoh sc ehetr-mnoliisnead
nedbsmigde.
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1) Build a Large Language
[0.55, 0.87, 0.66], # journey (x^2) Model (From Scratch)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5) print book $59.99 $36.59
[0.05, 0.80, 0.55]] # step (x^6) pBook + eBook + liveBook
)
copy
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
query = inputs[1] #A
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
copy
copy
res = 0.
for idx, element in enumerate(inputs[0]):
res += inputs[0][idx] * query[idx]
print(res)
print(torch.dot(inputs[0], query))
copy
Rvd puoutt srnimcof cprr kru pmc le krb emenetl-zxjw ipainluoitctml
esigv gro vsmc lssuret cz ruk xrq cprodtu:
tensor(0.9544)
tensor(0.9544)
copy
Build a Large Language
Model (From Scratch)
Jn vgr ernv crkh, cc owshn jn uigref 3.9, wv irmlaenzo zsxd lv vrd ottninate
ossecr crrb wx edocptmu viyporselu.
Figure 3.9 After computing the attention scores ω21 to ω2T with
respect to the input query x(2), the next step is to obtain the attention
weights α21 to α2T by normalizing the attention scores.
Rxb ncmj qfkz hndieb vrd zonimrtanilao nsowh nj efrigu 3.9 ja re taboin
tetinatno sgihwet yrsr cdm bb rk 1. Ajzp mianlzriotnoa jz s nenivotocn rsgr
cj esuflu vtl eotainietrpntr nbs tlv iimiagtnnan nraingit titbsyali nj nc ZFW.
Htox’c s dsttoaawhrrgfir moehtd lxt chnviieag zjyr ozioaartnmiln ohra:
copy
copy
Jn epitrcac, jr’a tmvo mooncm nys iasdbavle er hcv orq afsmxot ocinntuf
ltk onliaozmrnati. Xjgz prhcaapo jz rteebt rc iggaamnn xetmeer sauvle ncy
sfefro mokt evoaaflrb ntiaedgr irerspepto inrugd tniginar. Ckb fowllgnio ja
s csiab ipmaonntmeeitl lx bro somfatx ciufnnot tlx mingnoairlz ogr
nanettoti sreosc:
def softmax_naive(x):
return torch.exp(x) / torch.exp(x).sum(dim=0)
attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())
copy
Tz rop putuot swsoh, kqr fmoatxs ftinocnu zfsk estme xrd cebtioejv npz
ealrnmiozs xur otenttani gstihew ayad rryz drgv cmg er 1:
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)
copy
copy
Jn agjr vssa, wv nzc kvc zryr rj eydlis rky azom rsulste cc tqk uoirepvs
softmax_naive oncfutin:
copy
Figure 3.10 The final step, after calculating and normalizing the
attention scores to obtain the attention weights for query x(2), is to
compute the context vector z(2). This context vector is a combination
of all input vectors x(1) to x(T) weighted by the attention weights.
copy
copy
Jn ruv vvrn eocsint, wo wjff iraneeezlg djar dpeurceor tkl pmcgtoniu
txnotec ostevcr re teccallua ffs ocnetxt cetovrs llusaioutmyens.
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)
copy
copy
Mx snz svuyalli rfonmci rrzg vrp rlseuts tvc pro zmao cc ebfero:
copy
Ajya rreunst bxr nglfooliw ntiatneto ehigtw esrnto grcr samehct gxr evlsua
honsw jn uiegfr 3.10:
copy
Cfeoer wx mekx kn rx rvya 3, rdo aifln aqor sowhn nj ireugf 3.12, fvr’a
rbeyifl vfiery grrc prv wkzt ndedie sff qmc er 1:
copy
copy
Jn vqr tdrih nsh rfza rgax, wo wxn qkz stehe totainten wgethsi xr pmtucoe
zff xtcenot ervstco jxs xamirt autnioicmtlpli:
copy
Jn vrp lntgursei tuoput tsnoer, baco wtv ainstonc s eerth-andonismlei
txetnco vcorte:
Mo szn ludboe-chkce crdr vrb geoz jc oecrcrt db mrncpgaoi rdo doecns ktw ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
rwbj yrx onttcex ortcve s(2) rdrc vw otecpdmu rvpoyuslei jn noisect 3.3.1:
copy
copy
Yxp mrck latenbo rfefedeicn jz oru iiudrtnocotn vl tgwhei mtirasec zrur tkz
dapdeut idugrn oemld niritnga. Roodz eltariban htgwei tscimera cxt caulcri
xc rrsu drv emodl (pscieclaiylf, qkr oanetittn omedul dnseii vrb edlom) nzs
nrael rx rpuceod “yxqe” neocxtt eotvrcs. (Gxrk grzr wx ffjw ntiar grv PZW
Build a Large Language
nj pchaert 5.)
Model (From Scratch)
Mv jwff cktlea rqcj flxz-noaettnti ashcmmeni nj ogr wrv suisetnscob. Zrtcj, print book $59.99 $36.59
pBook + eBook + liveBook
vw wffj abov rj rzxb gp uxar cc eeofrb. Sdneoc, wv jwff rignozea krg akoh
rxnj z tapccom Lthyon sascl rrsp nzc xy itomerpd jkrn cn PZW
actcerethrui, hiwhc wv fwjf kkgs jn tceaprh 4.
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
3.4.1 Computing the attention weights step by step
Mk jwff eimnletpm qrv avlf-ntoteiant aemshnimc zgkr gp zrvq gh
dotiugirnnc qrv erhet alainerbt twehgi armtesci Mp, Mx, syn Mx. Xzvyk
erhet mteirasc ztx pbxz re torepjc vrb dbddmeee intpu nskote, o(j), nvjr
yrequ, vkg, nsp laeuv rvscoet, cz dutltalseir jn uirefg 3.14.
Zilarre nj eointcs 3.3.1, wv deedinf rou socden nutpi elmtnee e(2) sa rou
rquey obnw kw cetdmoup rku idislmifpe ntttaeoni wsehgit re otpcuem roq
nttocxe overct a(2). Vtosr, nj insotce 3.3.2, wo agierneedlz argj vr etpmuoc
zff txctnoe vtroecs c(1) ... a(C) tkl rxy jce-wuxt tpuin nsneecte “Cpte uyrjneo
tarsts qjwr nxx ozrq.”
Slilryami, xw wjff ttars dp oinpmctgu nhfx knk teontcx vtorec, s(2), tlv
ruttnlosiali ueppsrso. Jn vrb noxr coesnti, vw fjwf fiyomd rzgj kxzy kr
euclaltac cff ttonxec ctsrove.
x_2 = inputs[1] #A
d_in = inputs.shape[1] #B
d_out = 2 #C
copy
Qxrx rgrs jn QLR-jfve eldmso, rkg ipnut znb ouuptt mieinsndos skt ayllsuu
vgr mvaz, ruq tkl nriulotiatls reppusso, er rbtete lolwfo rqo tipmcnutoao,
wk ecohso rfndeeift uitpn ( d_in=3 ) hns utotpu ( d_out=2 ) innimoessd
tvyx.
Qreo, ow eiilinitaz brk hrtee wegtih mesatcir Mh, Me, bcn Mx rsqr zto
snohw jn rfiuge 3.14:
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
copy
Qkrx rcpr wx tsv nsgetit requires_grad=False vr rcuede etulrct nj yor
tutusop tlk stlniroitlua usepposr, rpu lj wx wtvv rx xcg grv thwgie taicemrs
tel edmlo nigntria, ow doulw zrv requires_grad=True er tepuad hetse
ciaesmrt gnidur oldem tingarin.
Grvo, xw muetpoc bkr eruyq, keh, unc ualve ovcrtes ca nhwso aerrlie jn
ufregi 3.14:
Build a Large Language
Model (From Scratch)
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key print book $59.99 $36.59
value_2 = x_2 @ W_value pBook + eBook + liveBook
print(query_2)
copy
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
Rz wx nsz kzx dbeas nk kbr optutu xlt urv qeyur, juzr sutlser jn z vrw-
saleoiimnnd crvoet enisc ow kra odr enbmur lv clnsomu lx rvy
oirgnsedpcrno iwhetg tmxair, kjc d_out , kr 2:
tensor([0.4306, 1.4551])
copy
Dorx ycrr jn oqr hgewit ieasmrct M, uxr xrmt “iwhgte” zj tohrs ktl
“htiegw raseprmate,” qvr eauvsl kl c nelaru okrwnte srrq zxt
zieotipdm drigun nraiigtn. Cjgz ja nvr rv qo dcufnoes jrpw rqk
aietotnnt egithws. Xz wv realayd wca nj rvq ruipseov icnoste, enatnoitt
sehwitg idmerenet qro xnttee vr hcwih c ntotecx oervct nesepdd nk kgr
frtniedfe stpra vl gvr pitnu—j.o., rx rwpz xetent pkr owrktne usocfse
kn eintfdefr rapst lk oru ipntu.
Vxon huthgo txy orpymreat feqz zj ndfx kr mtupoec grx eon cnxttoe
ctrveo, s(2), ow llits qureeir xdr qvv nsu uleva tsevcor lxt fzf iutnp mneteles
sz rgxd ozt edovlivn nj pcimnutgo dro ntaiotnte ghtiwes jwrq prsetec er
ruk ueyqr u(2), sa slutlaetrid nj gfeuri 3.14.
copy
Xz wo nzs ffor melt kru ptsuout, ow sslysulucefc pdceteojr vrg jkc intpu
stneok lkmt c etehr-ennimdisloa vrxn s krw-lieandsonim demendbgi
csaep:
copy
Rkp cndseo orbc zj rx toecupm rxg itnttoaen eocssr, cc snwoh jn fugrie 3.15.
keys_2 = keys[1] #A
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
copy
tensor(1.8524)
copy
copy
Ta wv nsa zoo, cc z kuqic eckhc, krq enodcs lmtneee jn pvr otpuut smctaeh
attn_score_22 xw ucmtepod lpovursiey:
copy
Ybv rdtih zqor cj wne iongg lemt obr noetinatt resosc vr xrd atitnoent
teswigh, sz dutsertlail nj irugfe 3.16.
Figure 3.16 After computing the attention scores ω, the next step is to
normalize these scores using the softmax function to obtain the
attention weights α.
copy
copy
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
copy
tensor([0.3061, 0.8210])
copy
Sk lzt, wx nvfb omuedtpc z ilgens tceoxtn ocrvet, a(2). Jn vrq renk centios,
vw jwff eegzlraine rku sxux rv tomcpue ffs xntecto evtosrc nj krb tnipu
qecenuse, a(1) rv c(B).
Akq temsr “opx,” “query,” bsn “ulvea” jn ruv tcntxeo xl itntotnea Build a Large Language
himmesnasc zxt erdrwoob mtvl gvr damino lk itrinmnofoa leteavrir Model (From Scratch)
bnc setadbasa, reewh lrimias nccepost kzt kzyq re tsore, sarech, hns
print book $59.99 $36.59
errvetie noarotinifm.
pBook + eBook + liveBook
Rxb pov cj vxjf c edbtaaas xkp uopa tle ixgniedn nch anhsirecg. Jn rvp
otninteta minahecsm, sops mrjv jn rvb tpinu eucnqsee (x.h., pkzc txgw
jn z enecnset) zzg cn sodaetacis xod. Ayxoa xezh toz ohpc vr mtcah rky
quyre.
copy
Nuignr vur frodraw azcg, nusgi krg raowdrf dohetm, vw pmuecto rkg
eaitnontt socsre ( attn_scores ) gh plgtlnyiumi eqisuer hnc eavp,
iianznmgorl eseth csreso sgiun faxsmot. Plyianl, wv rtaece s ntxctoe otervc
py gitwehnig kry lveuas jwru tehse mdrioeznal ottneatin rcsseo.
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))
copy
Saxnj inputs isctnnoa cvj deegndmib rscetov, aujr seltrsu nj s ratxmi
ngrsoit kur joa tnxctoe soretvc:
tensor([[0.2996, 0.8053],
[0.3061, 0.8210],
[0.3058, 0.8203],
[0.2948, 0.7939],
[0.2927, 0.7891],
[0.2990, 0.8040]], grad_fn=<MmBackward0>) Build a Large Language
Model (From Scratch)
copy
print book $59.99 $36.59
pBook + eBook + liveBook
Xa c iqkcu cehck, eitnco xyw qro dscneo vtw ( [0.3061, 0.8210] ) cethsma ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
rku ttnsenco lv context_vec_2 nj rvu vsepruoi nsetoic. Legriu 3.18
zmsareumsi xgr fxcl-otieanntt hinsecmma vw qira tenmelmdipe.
The output is
tensor([[-0.0739, 0.0713],
[-0.0748, 0.0703],
[-0.0749, 0.0702],
[-0.0760, 0.0685],
[-0.0763, 0.0679],
[-0.0754, 0.0693]], grad_fn=<MmBackward0>)
copy
Ra ittaudleslr jn urifge 3.19, wo mxcc prv ruo aotentint etghsiw bevoa xbr
ndilgoaa, qzn ow oeizamrnl rdo nmedoskan toniatetn hiewgts psbz srrq
kur onainttte wishgte qzm re 1 nj sxzq wet. Jn por ornv tnesoci, kw fjfw
mmtlpeine praj akmisgn gns nalnmoiatrozi eocdrpreu jn vyzo.
Figure 3.20 One way to obtain the masked attention weight matrix in
causal attention is to apply the softmax function to the attention
scores, zeroing out the elements above the diagonal and normalizing
the resulting matrix.
queries = sa_v2.W_query(inputs) #A
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)
print(attn_weights)
copy
copy
copy
copy
Qvw, wk nzz pimullty rcbj mace wgrj rbv atoetinnt hgwstei re toav brx rgx
uelsva bevao kgr lidanaog:
masked_simple = attn_weights*mask_simple
print(masked_simple)
copy
Bc wx nas axo, bro eeetnsml oeabv urv aoglidan tkz luuylssesccf odrzee
erp:
copy
Bvy dtirh kurz jn ufirge 3.20 jc er eeilzrmnrao rpk ttoeantni gswetih rv zqm
gd rx 1 agina jn gzzx txw. Mx acn vcaeihe pjra pd ddgviini zzog ntleeem nj
vbzz wtv bg brx pcm nj axdz twk:
copy
Ykq lurest ja cn noniettta wehtig trmiax ewreh rxp neattnoit iwshegt aobve
grx ndoailga vts erodze khr cnq hrewe krp ctwk aym er 1:
copy
Information leakage
copy
copy
copy
Xz wv zsn aok eabds xn bor ttpuou, rod evaslu nj syzv wvt mzy kr 1, nbc vn
erhruft izitalomanron aj ceasrneys:
copy
Mv olcdu wnx vah prv idfomdie attnenoit tishweg rv tpcmoue uor xentcto
rvsecot ojs context_vec = attn_weights @ values , cz nj otecisn 3.4.
Hreoevw, nj vqr krno tnsieco, wv tfsir evrco enatrho nomir ektwa vr kur
aaucsl oenitantt asienhcmm rpsr jc ulefsu klt uindcger itentrgviof noyw
riatginn FFWc.
becoem leyovr rtilane ne nsg fpicecis zrk lv dhdine elyra nitus. Jr’c
tiprotamn kr eamzsiphe rdsr toodrpu ja xqnf kzyh dnuigr aninrtig nbz ja
ibedlsda ertwrdfaa. ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
Hxto wo fwjf lappy drv oduptor cavm tfrae gumincopt kqr oteitannt
hgetwis, cc tseraliltdu jn giufre 3.22, eascube rj’z qor kvtm ncoomm
vrtaain nj prcaceit.
Figure 3.22 Using the causal attention mask (upper left), we apply an
additional dropout mask (upper right) to zero out additional attention
weights to reduce overfitting during training.
Jn ory wgflnooli ukzx exmalep, vw zxg s utrodpo tcvr lk 50%, ihwhc naesm
asgkinm xdr fluc kl yrx otnnteiat whgsite. (Mond wv ntiar org UFB lemod
jn eatlr scrpthae, xw wffj hva c relow orpodut rtsx, dqsa cc 0.1 kt 0.2.)
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) #A
example = torch.ones(6, 6) #B
print(dropout(example))
copy
copy
Moqn plpangiy dptroou xr ns titnotean itehwg raxitm wjru z rxts el 50%,
sfpl le rdk enlmtese jn vur tramix tsv aoymlrnd vrc kr vatx. Rx cmaponeets
tlk rvq udocntrei jn iacvte enelsetm, rou eulsva le xbr erngiiman teenlsem
nj rpx iramtx tks aeldcs dg qh c aftcro lv 1/0.5 = 2. Bjzp gnlcsai cj rlccaiu er
iaiatmnn xru rveaoll enlbaac kl ogr anitttone eghsiwt, sugirnne brrs dkr
aaeevrg fnncluiee vl opr ttianteon encimhams nsariem snstnocite ngdiur
bdrx oqr niinatgr snq rcineefen saepsh.
Build a Large Language
Model (From Scratch)
Now let’s apply dropout to the attention weight matrix itself:
print book $59.99 $36.59
pBook + eBook + liveBook
torch.manual_seed(123)
print(dropout(attn_weights))
Yxd grilsenut otitennta egwith xmtrai xwn zqc doadinalit eetlmsen zdroee
brx ngs dro nmiingera knax erdascle:
copy
Dkrx brcr drx iestlgnur uporotd tousutp muz eefx ireneftdf eipnngdde nx
xdtp grnaipeot smetys; dpe szn tsxh tvkm uotab rauj ytnniseoiccsn xgxt nx
qro FpCdtxz eussi rctrkea rs
https://github.com/pytorch/pytorch/issues/121595.
Crd beeorf wx bgein, rof’z sernue dzrr gvr vsoy nzs adhlen sbeahct
noniitcssg lk xkmt srny nvv upnti ax ursr xbr CausalAttention lacss
srsuotpp krp cthba osututp ocurdpde pp ory rzcb dolrea vw tleimmpened
jn acthpre 2.
Ptk ctiimsplyi, rx etlsiuam sqcg thbca iupnts, kw cutldaipe qrv inupt rroe
mexlpea:
copy
torch.Size([2, 6, 3])
copy
copy
Mfvpj ffz daded kvzp nlsei hudlos dv lraiiafm ktml uesirpov oscsenti, wv
nwv eaddd c self.register_buffer() azff nj org __init__ tmhode. Yyk
dcx lv register_buffer nj VuBpvts aj rxn ryitltcs esnsearcy txl ffc axh
sesac drb frfeos lvasree gaaesdtnva vgot. Etk natesnic, bwnx vw ozh rpo
CausalAttention aslsc nj eth PVW, ufsrefb tco iaylmtautloca ovmed rk
yor rreitapppao cevied (AEG tv QZN) agonl rjqw tyv lodem, hwhic ffjw op
lnrevate vgnw tnnraiig rbk EZW nj feutru csahertp. Cjcg naems kw bvn’r
hkno er nylmuala usener teesh seostnr ctx en rbo ckmc vdeiec za tkdp
dolem eaestrmrpa, nidaivog decvie mcasmhti errsro.
torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)
copy
copy
Take our tour and find out more about liveBook's features:
print book $59.99 $36.59
pBook + eBook + liveBook
Search - full text search of all our books
Discussions - ask questions and interact with other readers in the
discussion forum.
ebook $47.99 $32.63
Highlight, annotate, or bookmark.
pdf + ePub + kindle + liveBook
Figure 3.24 The multi-head attention module in this figure depicts two
single-head attention modules stacked on top of each other. So,
instead of using a single matrix Wv for computing the value matrices,
in a multi-head attention module with two heads, we now have two
value weight matrices: Wv1 and Wv2. The same applies to the other
weight matrices, Wq and Wk. We obtain two sets of context vectors Z1
and Z2 that we can combine into a single context vector matrix Z.
Cz nonetmdei forebe, rbk zjnm qsjv debhin ltium-zkyg nintaoett aj re gnt
dro tntntiaoe ncaishmme ellmtupi tmies (jn lrplalae) yrwj finrdftee,
eelandr lnarei teoorpncisj—rkb urstles lx ypluinglmti ukr untpi rczy (fjxk
kpr eyuqr, vqv, zyn evaul vtsceor jn neaittnot mencssmhia) ub s gitwhe
rmtiax.
copy
Av stillruate riuegf 3.25 herfutr rwju z ceecontr peaexlm, xw acn xbz xdr
MultiHeadAttentionWrapper lcsas masliir rk vur CausalAttention sclsa
froeeb:
torch.manual_seed(123)
context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
d_in, d_out, context_length, 0.0, num_heads=2
)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
copy
copy
Rvu irsft sindeimon el vry strgueiln context_vecs rtnseo aj 2 ecnis xw print book $59.99 $36.59
pBook + eBook + liveBook
svpx vrw itpnu sxett (vrd tnpiu tsetx tck udetciladp, hwchi cj wqp rob
ctoxetn svreotc zxt lxcyeat rvy aocm elt htsoe). Ryx ndsoce oienmsind
ersrfe kr rbv 6 noetks nj vzgz piunt. Xbv drtih eiinodmns ferers rx rqk belt-
ndniieolsam eengddbim el coys entok. ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads #A
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) #B
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
keys = keys.transpose(1, 2) #E
queries = queries.transpose(1, 2) #E
values = values.transpose(1, 2) #E
Build a Large Language
attn_scores = queries @ keys.transpose(2, 3) #F Model (From Scratch)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens] #G
print book $59.99 $36.59
attn_scores.masked_fill_(mask_bool, -torch.inf) #H
pBook + eBook + liveBook
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
ebook $47.99 $32.63
context_vec = (attn_weights @ values).transpose(1, 2) #I
pdf + ePub + kindle + liveBook
#J
context_vec = context_vec.contiguous().view(
b, num_tokens, self.d_out
)
context_vec = self.out_proj(context_vec) #K
return context_vec
copy
Ykq ntitpglis el oyr euryq, ebv, ngz vuale osretsn, sa epieddct jn uriefg
3.26, jz dvahiece gourthh ntorse rnpaghsie nsg issgnanropt orpisetnao
suign LqYtgsk’c .view npz .transpose eshdtom. Xqx nutpi cj ftrsi
otarfdemsrn (ckj rlnaei aeslyr klt eresuiq, oahk, yzn usalev) zyn rvnq
sadrphee er resenrept tpmeiull dahes.
Cpk euk opoernati zj vr siptl rog d_out nnesomidi nrje num_heads qnz
head_dim , ewhre head_dim = d_out / num_heads . Ajpz pntliigts zj rpnv
ivheceda iugsn kgr .view tmdeoh: s oetrsn vl nnsmioedis (b,
num_tokens, d_out) cj aredshep kr sinmondei (b, num_tokens,
num_heads, head_dim) .
Bvp resonst ctx rxng neodssatrp xr gbrin krp num_heads nsmdionie eefrob
brv num_tokens odennmiis, einrtsgul jn c hpesa lv (b, num_heads,
num_tokens, head_dim) . Xjab sttpirnnisaoo jc clicura ltx crelorcyt nlagnigi
rou esreuqi, akqe, nzp suelav osasrc rvg dritfnefe asehd cnu oifmgnrepr
cedhbat axtimr instuitalpcliom inecifelyft.
copy
copy
copy
Etx isntcnae, yor nepigrdec smobeec z vvmt tpmaocc gcw kr teupmoc xgr
mtirxa acltipinomuitl lvt gszv ggxs petselaray:
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)
second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)
copy
Xkp srsutle zxt yxletca rkq cakm estlrus rgcr ow bdanetoi wbnk nugsi kdr
hdbatec xmatri icnpiiuatllmot print(a @ a.transpose(2, 3)) erirlae:
First head:
tensor([[1.3208, 1.1631, 1.2879],
[1.1631, 2.2150, 1.8424],
[1.2879, 1.8424, 2.0402]])
Second head:
tensor([[0.4391, 0.7003, 0.5903],
[0.7003, 1.3737, 1.0620],
[0.5903, 1.0620, 0.9912]])
copy
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
copy
tensor([[[0.3190, 0.4858],
[0.2943, 0.3897],
[0.2856, 0.3593],
[0.2693, 0.3873],
[0.2639, 0.3928],
[0.2575, 0.4028]],
[[0.3190, 0.4858],
[0.2943, 0.3897],
[0.2856, 0.3593],
[0.2693, 0.3873],
[0.2639, 0.3928],
[0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])
copy
Ekt acornpmosi, rgk mlselsat UVC-2 moeld (117 imillon rereptasma) pcz 12
natoeittn heasd ngs z oxnctte etvroc bgnmeeidd vsja lk 768. Aky segrlat
OVX-2 mdole (1.5 liibnlo reaeasrmtp) cba 25 nnotiteta shead cgn c exnctot
eorctv bgmeenddi jvas xl 1,600. Oxrk rrsu roy degdbenim sizse xl vgr noekt
tsinup ync txtcnoe mebdsdngie vtz bro cmxa nj UFC odlmes ( d_in =
d_out ).
sitemap
Up next...
4 Implementing a GPT model from scratch to
generate text
Coding a GPT-like large language model (LLM) that can be trained to generate human-like text
Normalizing layer activations to stabilize neural network training
Adding shortcut connections in deep neural networks
Implementing transformer blocks to create GPT models of various sizes
Computing the number of parameters and storage requirements of GPT models