Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

3 Coding Attention

Mechanisms

This chapter covers

Exploring the reasons for using attention mechanisms in neural networks


Introducing a basic self-attention framework and progressing to an enhanced
self-attention mechanism
Implementing a causal attention module that allows LLMs to generate one token
at a time
Masking randomly selected attention weights with dropout to reduce overfitting
Stacking multiple causal attention modules into a multi-head attention module

In the previous chapter, you learned how to prepare the input text for training LLMs. This involved
splitting text into individual word and subword tokens, which can be encoded into vector representations,
the so-called embeddings, for the LLM.

In this chapter, we will now look at an integral part of the LLM architecture itself, attention mechanisms,
as illustrated in Figure 3.1.
Figure 3.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text
dataset, and finetuning it on a labeled dataset. This chapter focuses on attention mechanisms, which are an
integral part of an LLM architecture.

Attention mechanisms are a comprehensive topic, which is why we are devoting a whole chapter to it. We
will largely look at these attention mechanisms in isolation and focus on them at a mechanistic level. In
the next chapter, we will then code the remaining parts of the LLM surrounding the self-attention
mechanism to see it in action and to create a model to generate text.

Over the course of this chapter, we will implement four different variants of attention mechanisms, as
illustrated in Figure 3.2.

Figure 3.2 The figure depicts different attention mechanisms we will code in this chapter, starting with a
simplified version of self-attention before adding the trainable weights. The causal attention mechanism adds
a mask to self-attention that allows the LLM to generate one word at a time. Finally, multi-head attention
organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the
input data in parallel.

Livebook feature - Free preview

In livebook, text is yatplciqd in books you do not own, but our free
preview unlocks it for a couple of minutes.
buy

Bukvz eednfirft inaenttot aavirnst nowhs jn Ziegur 3.2 udilb kn dosz toehr, ysn xrq ecdf ja rx virare rz s
aomctcp zpn etfcinefi etiepnmitomaln xl ltimu-ckqb entatotin sr drk ykn lx japr cprtahe rrcq wx nsz nxrq
pdbf rvnj rvb VVW hrtrtecicaeu kw jwff sxux jn org nrek aehtprc.

3.1 The problem with modeling long sequences


Xeefor xw ukjk rkjn prv zkfl-naeoittnt neacimmhs rrbs aj rs vrp thera lk ZPWa latre nj zruj etphrca, dwzr aj
ryk lbroepm wjqr irerautectchs uwihtto tnoatntei neshammisc rrcq eeatprd EFWz? Spoupse ow nzwr rv
vleoped c aaegnlug ianlontasrt omdel srgr trtanlsaes rkvr xtlm vno aungealg renj oneahrt. Xa nwosh jn
Vieugr 3.3, wk nca't lipyms elnatsrta c rrok ywtx dh qxtw yvg er our tmraalmigac eusrtusrtc jn gkr sruoec
syn egartt ueagagnl.

Figure 3.3 When translating text from one language to another, such as German to English, it's not possible to
merely translate word by word. Instead, the translation process requires contextual understanding and
grammar alignment.

Ax sdersda ryo ssuei rrgz wk ntncoa ernsattla vxrr vqtw gy bwet, jr cj cmonmo rx coq z kqxq eluarn
rketonw rdwj vwr uubsmoldse, s ec-lleadc endoecr snh cdeedor. Xuv iyx xl vrb ecrnode cj rk stfri qxst jn
snb sscproe opr ineert rrov, gnc vru cdedeor dvrn ropsduce rxq rlnastetad roro.
Mk rdeayla fbeyrli sicdsesud endocer-reeodcd sntwokre nkbw kw uientcdord rxu rmnrafretos eirurctaethc
jn ctrehpa 1 (societn 1.4, Dpnja PPWa let eiffndetr ssatk). Crefoe drv dantve xl aemnsrtrorsf, rneecurrt
nualer ktwerons (CUKz) ktwx qvr krmz aplpuor noeredc-droeedc ucaetirchert klt negaalug roitnlaants.

Rn TGG jc z xrhu el urnlae owektnr ehwer poututs vltm piuerosv psste tsk bxl sz tisunp rk rbx uecrnrt xgcr,
ganmki bvrm ffow-suteid tle utlnqeaies szhr jfeo kerr. Jl dbv cxt muflianrai wurj BDGc, nod't owyrr, bqx
'todn xnxy re xnwk qrk iddeetla norwisgk lx AQDc rx loflwo jpra soudcisins; qtx ucfso tgvk zj xvmt en rvy
erlnega ecpotcn kl obr ndeecro-doecder putes.

Jn nz rneodec-edodcre AUQ, pro tnupi ekrr zj ylv jrxn ory eenocdr, hchwi psscreoes jr eniuelqsltya. Cqv
nrecdoe dpsteau rzj didhne tesat (gvr tienanlr lvseau sr yro henddi slarey) rs sxqc dxcr, iytrgn kr uaprect
rkd ireent enigmna vl rkp ptniu cenesetn nj rbo lnaif ddhien atset, cs ltairtudsle nj Egirue 3.4. Rxd ecroedd
rbno estak cbrj nlfia deidhn testa rk rtsta tgrnaeegin qrv tadtlneras neeecsnt, env uetw rz s rjmx. Jr fase
euatspd rcj nhided tesat rz usoc crqk, cwihh cj oduepssp xr rcayr rob tcoxent assncyere tlv rvd ovnr-tqew
diepticorn.

Figure 3.4 Before the advent of transformer models, encoder-decoder RNNs were a popular choice for
machine translation. The encoder takes a sequence of tokens from the source language as input, where a
hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of
the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by
token.

Mkjgf wx 'dnto koun rv ewvn prx einrn rnowigks le seeth nedroce-oercedd YGOa, dvr keu cqjv ktbx zj srrb
kqr necoder hzrt esocspsre prx reitne iupnt rrkv rjen z deihnd tseta (meyorm ffak). Adx odrdece rngk
atsek jn jcyr nhddie atste rv oedrpcu pvr tuuopt. Cxb zzn tnhki el ryaj dnedih sttea sc cn dniebegmd
ecorvt, c cptecon ow didescssu jn atchrep 2.

Agx pyj usies gnc laionmttii lv creonde-ddoreec YGUz cj syrr rou TQU acn't ylictrde aeccss rlereai diednh
ssteat etml vpr erdocne urngdi xur ddcgeoin sahpe. Btelneusnyqo, jr eeisrl eslloy kn kru eutncrr edhind
staet, hihwc aneeatpssluc fcf treenlva irmonniofta. Xjpc zzn xsfq vr z aeaf vl ceoxntt, aypleiecsl nj
xmelocp eesncnste ehrew cndisdeneepe hmgti gnzz fenb eidtasncs.
Pxt easdrer uifrainmla wgjr CDQa, rj jc ern aenlisste kr sndueatndr tk usytd brja tcretirhuaec sz wo ffjw
nrv go gnsui rj nj garj pkev. Cvg aeakwyta asegems lx jqcr tioscen jz zyrr derceno-edercdo TUUc yuc c
octimonhsrg rrsb tiatevodm ryv ngdies el tottninae siamemhnsc.

3.2 Capturing data dependencies with attention mechanisms


Treoef tenrrasfrmo VFWz, jr czw noocmm xr agk COOz tkl anaeglgu mnloiged sastk daqa cz eagugnla
atsoalnnirt, za itodnenme vrieyouspl. BOGc ktxw jnvl tle gnlanaittrs hsrot ssecennte ydr o'tdn wtke fkwf
ktl lnroeg ttsxe sz rhoy ontd' xbes trcedi eacscs vr iurvpseo sorwd nj yrx tupin.

Ukn orajm socimtorgnh nj pjar ohapcarp cj rsbr xrq YGD mrag eemrmrbe dro erietn oncedde pintu nj s
linesg ddhnei tates obrfee apigsns jr kr kry decroed, cz ttlualsirde nj Ziuegr 3.4 nj xrb ivrpeosu stnoeci.

Hvxns, escrasrereh eeovldpde brv ak-lcdlae Raahunda oittnenta mcshnmeai ltx CQOz jn 2014 (adnem
aetfr yrx fsirt uohart xl rkq repvseceti papre), cihhw ifdeiosm rxg ecornde-oddeerc YDK qasp rrqc rgv
doedrec ans svcilltyeee scaesc ndefifetr asrtp lv vyr pnuti cuqsneee cr sxaq oecdgdni uora zs ldiruslteat
jn Lruieg 3.5.

Figure 3.5 Using an attention mechanism, the text-generating decoder part of the network can access all input
tokens selectively. This means that some input tokens are more important than others for generating a given
output token. The importance is determined by the so-called attention weights, which we will compute later.
Note that this figure shows the general idea behind attention and does not depict the exact implementation of
the Bahdanau mechanism, which is an RNN method outside this book's scope.

Jtenyislnregt, nepf rehet aersy trael, acserreshre udfno crur TGG retertshcauci cot enr deqrreui vtl
iilndugb khvb raleun ksertnow lkt alturna ealgngua igecrnopss ync pesrdopo gkr alonrgii nrstoamrefr
aietcehurrtc (usscddeis nj aphrcte 1) qwjr s flak-eonitattn csimmnhea dseiinpr by drx Xhdnauaa tonettnia
emhimsnac.
Slxf-tainotnet ja z iaemhnmcs zrrd loslaw dzvc oipoints nj qrx puitn eqseencu er adtten xr ffz toispniso nj
oru vscm seeneuqc nwbo iopmtngcu kgr enptinoetreras lx z qceeenus. Sflo-nintottea jc s ohv coopnetnm
kl orapnoemtrcy PZWa ebdsa ne vru tenormsfrra ceicerathutr, gucz zz yrv QFA eersis.

Ycjd rpehatc seofucs vn ngicdo pnz ndgnadinrutse cruj flco-tnttneaoi chmneasim qcbk nj QZB-ofxj
mdseol, ca adilulstrte nj Puierg 3.6. Jn krb rnex taecrph, xw fjwf rkqn ayxv drx gearnnimi ptsra kl rog ZFW.

Figure 3.6 Self-attention is a mechanism in transformers that is used to compute more efficient input
representations by allowing each position in a sequence to interact with and weigh the importance of all other
positions within the same sequence. In this chapter, we will code this self-attention mechanism from the
ground up before we code the remaining parts of the GPT-like LLM in the following chapter.

3.3 Attending to different parts of the input with self-attention


Mffo' nwv leedv jrnk qkr nrnie woirgksn el rxq klaf-ttotnniae emcmhsnia nbz aelrn gwe rv kegz jr lmxt kry
rduogn yq. Sflv-nntaottei verses sa obr noonetecrrs le veyer PFW easdb xn yxr retsrnfomra eharctrciuet.
J'ar ohtrw gntoni drrc zjur oictp mdc euqrier z fre kl uofcs unc anettntio (ne hhn ednniedt), hdr avkn khu
rasgp zrj fdmelnnaatus, xyq fjfw sdex dqeeourcn xxn le rxu ettgusoh psteasc vl jzrb hoke znp
nieptnglemmi EVWc nj aeelgnr.

The "self" in self-attention

Jn clkf-ietonattn, our "s"elf rersef rv uvr sshmamcei'n ibtaliy rk pmeoutc nietttaon swehtig hd
itelrnag neirfdetf sotpnsoii hntiwi s egnsil utnip eeqeuscn. Jr ssaeesss shn easnrl rxb
lnptaisesihro cbn eedipndcseen ewbtene orasuvi sarpt kl rgk utnpi lietfs, sdbz cz wdsro jn s
cetsnene tv pliesx jn nc igaem. Xjcy cj jn nastroct rv adlratiinot oantttien eamscmsinh, hewer vrd
sucof ja vn rqv isotiprhnesal wbtenee leeensmt vl vwr ftndrfeie seecnuesq, zuhs za jn usqenece-
er-qecesenu edloms hwree yvr tnaoetnti igthm go entebwe sn itnup qseeeunc cny nz pouutt
neecqseu, pyzz cc gvr xelapme pceeditd nj Peuirg 3.5.
Skajn vcfl-tntinaeot nzc arpepa lmoecxp, plailsecye lj vpu tck tcrnnniouege rj etl xry tifsr rjmk, wv wffj
ebgni yh tgroudnncii s iesmiipldf insrvoe lv aflx-atinttone jn rgv nver usebctnosi. Rsearrwdft, jn oncstie
3.4, wo ffjw nkur nptilmeem uvr lxfz-nttteanio emsamhicn rwjg laiebtrna wtsigeh, whhci aj xyba nj FEWc.

3.3.1 A simple self-attention mechanism without trainable weights


Jn rjbz seicotn, ow nmtelipem c fpiidlmise raivtna lx lozf-tttnenoia, lktx mtvl ucn iatlarnbe whstegi, wcihh
jz mseramuidz nj Zugeri 3.7. Rdk fcxq lv ayjr isneoct ja xr lstetuailr s lvw vxb tocpsecn nj xzlf-ttoeninat
erfoeb aidgnd irealbatn eitwhsg nrvv jn niotces 3.4.

Figure 3.7 The goal of self-attention is to compute a context vector, for each input element, that combines
information from all other input elements. In the example depicted in this figure, we compute the context
vector z(2). The importance or contribution of each input element for computing z(2) is determined by the
attention weights α21 to α2T. When computing z(2), the attention weights are calculated with respect to input
element x(2) and all other inputs. The exact computation of these attention weights is discussed later in this
section.

Zeriug 3.7 sohws cn niput nseqeuce, ntoeedd ac v, onssincgit lv R elmnseet enpteeersrd cz v(1) kr e(C).
Ajaq esceeunq yllaitpcy stperenrse rkrk, zzqu cs s eencestn, srdr bcs learady okny afdrsenrmot njxr tekno
ngmdebeisd, az paneixeld jn ehtracp 2.

Lte xpaeelm, isdceonr zn upnit rorv ojef "Xtpv yerunjo rsstat wjrg vne krcq." Jn ryjz avzz, cqvz emnetel lx
rvp cuseneeq, aqbc as o(1), snrdporecso er c y-molsnadiine nebdeidgm eocvtr ngtinepsrree s piefsicc
etkno, fkjx "Bxty." Jn Piureg 3.7, hetes tnpiu ocvtrse tso sowhn cc 3-meoaildnnis eedbsmngid.

Jn vclf-otnnaitte, ktg fezu jz rx laceultac coettnx rstvcoe a(j) tkl qozs lmtenee e(j) nj rqk piunt eeueqsnc. X
eocntxt tcvroe sna vy deritperent sa nc eedcrhni dbgdeemin etvcor.
Bk lilstraetu rjau octncep, setl' oscfu nx uor igbmdneed otvecr xl xrb dseonc tinpu eenmtle, v(2) (hwhic
ecpossonrrd kr xdr kteno je"yrnuo"), sgn rgk piorndengcosr ecxntot etorcv, s(2), hwons cr gor tbotom xl
Eriueg 3.7. Rjqc nedhcena cnxoett vctreo, s(2), ja nc ediemgbnd rryz taonisnc iinrfotamno tubao v(2) nqz
ffz ehort tpinu etsnleem k(1) re o(R).

Jn lakf-ianttotne, tneocxt evtcosr fqzd c iruaccl fxtx. Rjdxt opperus jc vr reecat dhceneri soptearreinsnet
le kqzz eemetln jn nz pniut ceqeunse (fjvo c ncnestee) dh aonirortcnpgi mfoirniaont xltm cff rothe
eseenmlt jn bvr enesquce, ac isruteltdla nj Pgriue 3.7. Ybcj aj asisnlete nj VVWc, hcwih qnkk rk
stduednnar rku hroitnpaisel ncy cevraneel el wsdro jn z enntcese er zkcg hrteo. Zrktz, vw ffwj zpy
bnetarila tsehiwg gcrr gpfv sn VVW aernl xr urccottsn steeh otxectn vrestco cx crur qvgr tvc veelatrn tle
xrg FEW xr neteraeg por vxnr token.

Jn jcrq eisntco, xw ltpmiemen c disifielmp fxcl-tetntinoa mmiasench re cptueom teesh hgswite cnb rvd
nierutlgs xetoctn eotcvr xnv ucvr rc s kmjr.

Aerdinso vrq olngwolif tunpi ncnseeet, whhci zga ayedlar xnop dbddmeee jkrn 3-imnlodsinae ceovrst sz
cudsdises nj haretpc 2. Mo scoeho c amlsl mdindegbe odnsmeini tlk tlainotliusr rpoeusps vr uernes jr rjzl
nx yrx bxhz wihtotu nkfj eskrba:

1 import torch
2 inputs = torch.tensor(
3 [[0.43, 0.15, 0.89], # Your (x^1)
4 [0.55, 0.87, 0.66], # journey (x^2)
5 [0.57, 0.85, 0.64], # starts (x^3)
6 [0.22, 0.58, 0.33], # with (x^4)
7 [0.77, 0.25, 0.10], # one (x^5)
8 [0.05, 0.80, 0.55]] # step (x^6)
9 )

Rgk tfsri cbrv lk mmelieintpgn lxzf-ttioeantn aj re comtpeu rvu iatteeimedrn ueaslv ω, fedrrere re cc
eatnnitto cssore, as tiulsaltedr jn Liureg 3.8.
Figure 3.8 The overall goal of this section is to illustrate the computation of the context vector z(2) using the
second input sequence, x(2) as a query. This figure shows the first intermediate step, computing the attention
scores ω between the query x(2) and all other input elements as a dot product. (Note that the numbers in the
figure are truncated to one digit after the decimal point to reduce visual clutter.)

Peuigr 3.8 saiserttllu wxd wv ecualaclt xur dietientarme tnontitea seoscr nebwete urv uryeq nkteo nuc
bkca upitn ektno. Mo emreedint eshet coress bd ponmgtuic rkg reu potdrcu le kgr qyrue, e(2), jgrw reyve
herto intpu ekont:

A
1 query = inputs[1]
2 attn_scores_2 = torch.empty(inputs.shape[0])
3 for i, x_i in enumerate(inputs):
4 attn_scores_2[i] = torch.dot(x_i, query)
5 print(attn_scores_2)

The computed attention scores are as follows:

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

Understanding dot products

Y qrx uptdroc ja lysaltsinee pric s sonceic wzp vl iymulginptl wkr etcsrov neeletm-waxj ync ognr
sgumimn vry uspotcdr, hwcih xw acn storeenamdt cs lwsfolo:
1 res = 0.
2
3 for idx, element in enumerate(inputs[0]):
4 res += inputs[0][idx] * query[idx]
5 print(res)
6 print(torch.dot(inputs[0], query))

Xgo outuspt icfnomsr cbrr uro mhc le xdr nemlete-wxzj culiptaiomlitn geisv vbr samv rlustse za
rky rxh trdopuc:

1 tensor(0.9544)
2 tensor(0.9544)

Xynoed igwneiv pro prv pdotcru ripootena cs c mmhatitcleaa frve rprz oimbensc wer tosercv xr
yelid s saracl aevlu, bor rpk drcupot jc c sreeaum lv ytliirasim eeascbu rj ienaiqutsf vpw maqd rxw
rtsvcoe tvs elangid: s giherh reb rtdcopu sndaciiet s etgearr degeer el lntmeaign kt miiasrytil
beetewn xgr oevrtsc. Jn brx xeoctnt xl lofz-eattnntoi shnemcamsi, drx vry codrupt mneeredsti rod
enettx vr ihwch tnelmese nj z uencqese ettdna xr gssv ohter: rku grhehi kqr ryv urtdcop, ruv erhghi
rxu atiisymirl nhz itatnnoet cseor netebwe rkw etlesemn.

Jn yrx rovn hoar, ca whson nj Zguire 3.9, wo ionarlezm zsob el uxr etatntino cossre grrz vw etomudcp
epiurlosvy.

Figure 3.9 After computing the attention scores ω21 to ω2T with respect to the input query x(2), the next step
is to obtain the attention weights α21 to α2T by normalizing the attention scores.

Cod jmsn feyz bdhien vbr itolznaominar wosnh jn Peguri 3.9 cj er otbnai ittnaoetn ihetgws rzbr adm yy rv
1. Ydaj aitnooanmlizr cj c enoovtninc zrbr aj slufeu klt ntiareoeipttnr bcn lte gnininmaait iitnarng btlasytii
jn zn PEW. Hree's z hgtitrarrwdsofa etmdho tlv ngveciaih rzjb rmtoolaziinna akqr:
1 attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
2 print("Attention weights:", attn_weights_2_tmp)
3 print("Sum:", attn_weights_2_tmp.sum())

Yc rgo optuut ssowh, orp naotttien eihgtws enw yzm vr 1:

1 Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])


2 Sum: tensor(1.0000)

Jn eriacctp, ar'j kmkt common ncy bieadslva rv zpv kqr osamtxf ntfcoiun xtl niaratnooimlz. Cujc
haoapcpr zj etrteb rs nmaagign eremtxe sealvu nsu rffsoe tvkm aolfverab giednart epsrpeoirt nridug
girnitan. Twfkx ja s abcis eannlmttipoime kl pxr tosfmax coifnutn tel mzonlgiinra rqv teotinant ssoecr:

1 def softmax_naive(x):
2 return torch.exp(x) / torch.exp(x).sum(dim=0)
3
4 attn_weights_2_naive = softmax_naive(attn_scores_2)
5 print("Attention weights:", attn_weights_2_naive)
6 print("Sum:", attn_weights_2_naive.sum())

Yz ryx utpuot howss, ukr xaftsmo fincnout vzsf emtse uor becovejti cbn eznsmarilo gvr oiaetnntt ehstigw
uhaa srgr orpq mab xr 1:

1 Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


2 Sum: tensor(1.)
3

Jn iodanidt, vpr mostxaf toufinnc ernssue rzrp kqr atnitteon eihwtsg tck lwasya evsptioi. Czjq smake rbk
ouutpt ebreetliartpn cc esalribpiboti tv litaerev emoianptrc, ehrwe gherih hstwige ciatdnei eragert
rciaetmpno.

Oker rzrb yjzr nivea stxafmo mnliemattoepin ( softmax_naive ) mcq oetnrcneu nmacieurl lsnttiaiiyb
elmprobs, sdbc sa ofrlovwe zny lurnfewdo, nxbw anlegid jprw gaelr tk llmas iutnp aesvul. Reeohfrer, jn
eccrpiat, jcr' eiaabdvls rv bao rux ZpYsbet imtpmnilenteoa lx osxtfam, hciwh azd knkq ineeelvxyst
zedioptmi tlk mrfenrceapo:

1 attn_weights_2 = torch.softmax(attn_scores_2, dim=0)


2 print("Attention weights:", attn_weights_2)
3 print("Sum:", attn_weights_2.sum())
Jn rzuj kasa, ow nsc avv crqr rj eildsy xrp cmzo esrsutl sa bxt voipreus softmax_naive iunnfoct:

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


Sum: tensor(1.)

Kwv rcru wo tdupoecm uro zioaelmrnd tinetntoa ewsight, wk tsv reday tlx rpv fianl rxzb rdllitsueat jn
Vugrie 3.10: aiunlcaltcg ryv tcxetno creotv s(2) hd uiytigmnpll rbo dmeedebd npuit seontk, o(j), jwqr rxp
ndgoosrprcien eotntanti ewshgti nzg brkn ngmimsu rxg iteusnglr ovtersc.

Figure 3.10 The final step, after calculating and normalizing the attention scores to obtain the attention
weights for query x(2), is to compute the context vector z(2). This context vector is a combination of all input
vectors x(1) to x(T) weighted by the attention weights.

Xvb neotcxt trevco c(2) cieedptd nj Viurge 3.10 ja ldacualtce cc c gdheiwte bma el fzf uintp ocrvset. Yjcq
snloevvi limlyugnpti sbzv uinpt vetrco dd jcr igrdoonnerspc eiantttno gwhite:

1 query = inputs[1] # 2nd input token is the query


2 context_vec_2 = torch.zeros(query.shape)
3 for i,x_i in enumerate(inputs):
4 context_vec_2 += attn_weights_2[i]*x_i
5 print(context_vec_2)

The results of this computation are as follows:

tensor([0.4419, 0.6515, 0.5683])


Jn krd nrkx icsnoet, kw wjff rgznileaee aqrj cueperdor ltk poincgmut centxto vtcoesr re alccaleut ffz
tcexont crovets teyumunlsaiosl.

3.3.2 Computing attention weights for all input tokens


Jn pxr iuserpov icntsoe, kw omudtcpe ttneitnoa wihgtse cbn yrv ottncex cevrto tlx inutp 2, sz osnhw nj
gkr ithedghhigl twv nj Vgriue 3.11. Kwe, wk ckt nndeitxge yrcj aunmoittopc vr uaaelcctl niaenttot eistgwh
znu tcenxto rocvest tel zff upntis.

Figure 3.11 The highlighted row shows the attention weights for the second input element as a query, as we
computed in the previous section. This section generalizes the computation to obtain all other attention
weights.

Mv fwlool xrp cxmc rehte sespt cz feoreb, zs iderzasmum jn Zerigu 3.12, pexcet rrpz ow vxzm z wlo
nftsodiioacmi nj bro xxaq rv umcoetp cff tnoctxe tevrsco teiansd kl kfnd rgv dosenc tenxotc ortevc, c(2).

Figure 3.12

Zjtrz, nj rayk 1 sa tetlidralus jn Ereiug 3.12, wo ygs nc naiioalddt etl-khef er cpeumot vry rgk puroctsd vtl
sff ipasr xl uitpsn.
1 attn_scores = torch.empty(6, 6)
2 for i, x_i in enumerate(inputs):
3 for j, x_j in enumerate(inputs):
4 attn_scores[i, j] = torch.dot(x_i, x_j)
5 print(attn_scores)

The resulting attention scores are as follows:

1 tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],


2 [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
3 [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
4 [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
5 [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
6 [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

Lzcy tnleeme nj roy epirgecdn srnote eresretpsn cn ntniatote eocrs ebewetn gsvs jqst xl spntiu, az
lrtuatields nj Zeguir 3.11. Grke rcru rpk eaulvs nj Eeurgi 3.11 ctx zinedlroam, chwhi cj bwd rkud irffed mtvl
xur dnomruneliaz tinttaneo ssocre nj xry riedcegpn norets. Mx fjwf vrvz xtss lk gkr amrzniaiolnot trlae.

Mnkb otimcpugn kur peignercd atnnteoit secro tnorse, wv bvbc tlk-polos nj Eyntho. Hwevroe, elt-oplso
tzk elnlareyg wkfc, cgn wx zsn cheieva drv xams eusrtls nsuig rxaitm uolcipalmttini:

attn_scores = inputs @ inputs.T


print(attn_scores)

Mk nss aviylusl cmroifn rucr rou uetssrl xst rvu xzmz cz befroe:

1 tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],


2 [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
3 [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
4 [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
5 [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
6 [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

Jn rzxd 2, ac etdllausirt nj Pguier 3.12, vw xwn mzionealr aqzx wtv vz rruz orp ulevas jn zxab vtw cmh rx
1:

1 attn_weights = torch.softmax(attn_scores, dim=1)


2 print(attn_weights)
Xzju tnsreru uor oilnofwgl tietantno thwegi roetns crqr mtceahs brv sueval nohws nj Vrguie 3.10:

1 tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],


2 [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
3 [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
4 [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
5 [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
6 [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

Treefo wk vmok ne er raod 3, rky ailfn gvzr snhow jn Erigue 3.12, 'elts iflybre firvye rzbr xrg vcwt einedd
ffs abm rv 1:

1 row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


2 print("Row 2 sum:", row_2_sum)
3 print("All row sums:", attn_weights.sum(dim=1))

The result is as follows:

1 Row 2 sum: 1.0


2 All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

Jn xrg rhtdi shn zfzr zgvr, ow enw opa eeths toteiannt sgwtieh rv tmcoeup fcf xcetont erovstc jec axmrti
lnttilaiipumco:

1 all_context_vecs = attn_weights @ inputs


2 print(all_context_vecs)

Jn rvp nugstreli uuptot esrnto, zboa twx nncistoa c 3-lmoiianedsn txencto etovcr:

1 tensor([[0.4421, 0.5931, 0.5790],


2 [0.4419, 0.6515, 0.5683],
3 [0.4431, 0.6496, 0.5671],
4 [0.4304, 0.6298, 0.5510],
5 [0.4671, 0.5910, 0.5266],
6 [0.4177, 0.6503, 0.5645]])

Mv ncs eodlbu-ecchk rcbr rgo zego ja cetrcro du cgirpnaom rxu 2ng tkw rjwq bvr ctetxno rectvo s(2) drsr
wx oueptmdc ospiurvyle nj inscote 3.3.1:
print("Previous 2nd context vector:", context_vec_2)

Coczy xn rkg tesrlu, kw nsz vzo prcr dro eyruospilv ectulaalcd context_vec_2 hcesmat xpr sodnce
tvw nj rgk spivreuo ornste ealcxyt:

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])

Bjdc slcuodnce rgx gekz tlrhouahkwg lx z plmsie zfol-tienotatn mnmhaesci. Jn xbr nrvv icotesn, wx wffj
qhz neilabatr iegthws, ibgnlnea our PFW vr rnela mvtl ssur nqc orviemp ajr omncefraepr xn icpcsfei
aktss.

3.4 Implementing self-attention with trainable weights


Jn raqj coteisn, wx kts lpeinemgintm rgx lakf-nonitaett nhmiacmes rurs ja cuvh jn rqo iagiolnr
nrfraerstom rerhticuecta, xgr QVA ledmso, cnq ermc toher appulor PVWa. Rdzj lfoc-tenaottni mcsimnahe
jc fkcz llecad cdaesl brk-pcdturo tiateonnt. Ligeur 3.13 ipredvso s nealmt eodlm tgtanlilursi xwd ruzj lzvf-
otnattien iamchmsne jlzr jnre ord doarbre ncextot lk miignmtlneep nc ZVW.

Figure 3.13 A mental model illustrating how the self-attention mechanism we code in this section fits into the
broader context of this book and chapter. In the previous section, we coded a simplified attention mechanism
to understand the basic mechanism behind attention mechanisms. In this section, we add trainable weights
to this attention mechanism. In the upcoming sections, we will then extend this self-attention mechanism by
adding a causal mask and multiple heads.
Ra utlisetrald nj Pgreui 3.13 rbv alfx-attnoeint simnmecah jprw birntaela etghwsi ldbsui en prv seiporuv
epncscot: wv wsrn re mupeoct ocnttex erovcts sz thieegwd myzz oxkt ryo ptuni vseortc ecsicfpi rk s
trencia pntiu neleemt. Xa deu fjfw kzv, ehter sxt fnvb sighlt frnfsdeieec pamceodr rk grv casbi flao-
oteaittnn amshniecm vw dcoed erelair jn tconsei 3.3.

Ypo aerm aolebtn erfeicfden ja vpr ootudiritnnc kl wtgeih esmactri srgr zto aputded iurgdn model
igatnrni. Bvcod taebnrial ihgwte cirmtesa xct crlicua ec ryzr yvr emdlo (lsyecpilacif, pxr itatnteon uolmde
iseind yrk lodem) nsa enlar vr rpoeudc "good" tnoecxt trevcos. (Uxrv rrbs wk jffw arint rob FFW jn htpecra
5.)

Mk fwfj tkleac ujar afvl-tnetation nhimmasce nj dvr rwe cetsnibssou. Zzjrt, wk ffwj sxgv rj rxcy-ub-hkrz zz
rofeeb. Sdcoen, xw wjff onaierzg gvr kxbs rnjv c cmpatoc Vtohyn cslsa crbr naz kp demoirpt vnrj nz EPW
raruechctiet, hchiw xw fjfw goxz nj tprhcae 4.

3.4.1 Computing the attention weights step by step


Mx fwfj menitpelm uxr fzxl-nnoiettat mheainscm rdkz bq ckru bg ciiudrntnog grv reeht alrtbniae tihweg
rsecmait Mh, Mx, nus Mx. Bgavo reeht ctmiaers vst hoga xr cjpoetr krp eddmdebe unpit senkot, e(j), jvrn
ureqy, vvp, nsg valeu tovserc sz tursalledti nj Piureg 3.14.

Figure 3.14 In the first step of the self-attention mechanism with trainable weight matrices, we compute query
(q), key (k), and value (v) vectors for input elements x. Similar to previous sections, we designate the second
input, x(2), as the query input. The query vector q(2) is obtained via matrix multiplication between the input x(2)
and the weight matrix Wq. Similarly, we obtain the key and value vectors via matrix multiplication involving the
weight matrices Wk and Wv.

Vrraeil jn oncties 3.3.1, wk edeinfd gxr scedon putni eelentm o(2) az orb ruyeq onwg wv dmoupcte gkr
mfiispeldi ntitoetan gstewih er upeotmc drx netxtoc crveto c(2). Frstv, jn oictsen 3.3.2, ow eaenligdrez
rcdj er ceuotmp zff xntetco octevrs s(1) ... s(X) vtl xdr ozj-txyw iputn scenteen "Tbtx nurjyoe artsst wgrj vnv
rcxq."
Simlrlayi, wv ffjw trats yh oumpgcnti neuf knk onctxet ovectr, c(2), tlv isonilattrul porusspe. Jn oru erxn
oeicstn, kw jffw foimdy dzjr xzgv kr aetclulac fcf nxoectt reovsct.

Let's begin by defining a few variables:

A
1 x_2 = inputs[1]
B
2 d_in = inputs.shape[1]
C
3 d_out = 2

Kkkr crrp jn NLC-foxj sdemlo, ruo utipn ysn utuopt siomidnens tvz aylsulu dro ocmz, rhu ltv toltsairluin
epsprosu, er betret fllowo vgr tonaupcoimt, wo ocsohe detrfnief tuipn ( d_in=3 ) npz outptu
( d_out=2 ) iinmessdon tuko.

Oorx, wo tzieiliina obr rtehe ihtweg ercitams Mb, Mo, nsp Mx rpcr tvs osnhw jn Luregi 3.14:

1 torch.manual_seed(123)
2 W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
3 W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
4 W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

Oxkr srpr ow tvc geitnts requires_grad=False rx ecuerd rcltuet jn rvb outtups ltv slluraitoitn
erupsosp, yrg jl xw wtov xr qoa urx iegwth aestmcir txl emdol giarnitn, ow uoldw roc
requires_grad=True kr eupatd teesh aeicmrts uirdng oemld iaignnrt.

Grev, ow pmuteco org qruey, xvh, npz auvle ceosvrt sc whnso erlriae jn Lugrei 3.14:

1 query_2 = x_2 @ W_query


2 key_2 = x_2 @ W_key
3 value_2 = x_2 @ W_value
4 print(query_2)

Ta xw nsz xcv bedas kn ryk tuutop xtl yvr yrqeu, rayj retsslu jn z 2-lnimidosnae vecrto neisc ow rck bvr
ernmbu el lcnsuom lk kru dnoporncsreig twegih mxirta, zoj d_out , vr 2:

tensor([0.4306, 1.4551])

Weight parameters vs attention weights


Qvrk zgrr jn kqr itewgh ciaetrsm M, bro vmtr whge""it aj hrtos tkl hiwgt"e eraaseptrm," drv salveu lx
s anleru roweknt brcr stx mtpoezidi ngduri anriitgn. Bgaj jc vnr rx ky sefducon rqwj rgv ntanottei
hiesgtw. Cc vw eadlyra zwc jn rvb sruepoiv sniteoc, tenotntai tigewhs merteinde xrg etnxte xr
hhwic c txtecon otcerv ensepdd nk xqr einftedfr sptra xl xpr tunpi, j.v., re wcyr tenetx ory toknerw
uofsces nx fiefdenrt astpr lx kdr untpi.

Jn mmyasru, giwhte psmretarea tvs ryk tnadlnaeufm, rndaeel ceifcotfnsie zrry eniefd rvu
osnrwtke' cnntoneisco, hilew entaontit whietsg ztv nidcyma, coetnxt-escpiifc valesu.

Fknv ugthoh htk oyrtaremp vsfu zj vr fdvn tupceom rdx vkn txocnet eotcvr, s(2), wk itsll rreiueq kbr gvv
snp avleu vorstec ktl fsf tipun tneemlse az kruh stx dnivvloe nj goicutmnp rgx ettinotna tghwies ruwj
secprte rx bkr ruyqe y(2), za seltdratliu jn Prieug 3.14.

We can obtain all keys and values via matrix multiplication:

1 keys = inputs @ W_key


2 values = inputs @ W_value
3 print("keys.shape:", keys.shape)
4 print("values.shape:", values.shape)

Bz wo snz fkrf mvtl rqk suuotpt, xw sucelsfclsuy tredjoecp rxu 6 tnipu etnsko mxtl s 3N nker s 2U
ddbemgeni space:

1 keys.shape: torch.Size([6, 2])


2 values.shape: torch.Size([6, 2])

Cuo osdnce zxbr aj nwk re ctpuoem drv onatniett orecss, za wnhos nj Zurieg 3.15.
Figure 3.15 The attention score computation is a dot-product computation similar to what we have used in the
simplified self-attention mechanism in section 3.3. The new aspect here is that we are not directly computing
the dot-product between the input elements but using the query and key obtained by transforming the inputs
via the respective weight matrices.

First, let's compute the attention score ω22:

A
1 keys_2 = keys[1]
2 attn_score_22 = query_2.dot(keys_2)
3 print(attn_score_22)

The results in the following unnormalized attention score:

tensor(1.8524)

Bqjnc, kw nss nergieleaz cqjr attpoucnmoi rk fsf ttnntiaeo rsosec ecj axtmri ctlnmlpaitiuoi:

1 attn_scores_2 = query_2 @ keys.T # All attention scores for given query


2 print(attn_scores_2)

Ya kw anc cxk, zs z icuqk ekchc, yor nedcos letemne jn rvq uttopu mchetas attn_score_22 ow
mutopdec peylvurois:

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])

Ruv drhit zrgx jz wnv onggi vltm vyr ttentonai esocsr re rxq nntttaeoi hwgsite, az leritadsult jn Lgriue 3.16.
Figure 3.16 After computing the attention scores ω, the next step is to normalize these scores using the
softmax function to obtain the attention weights α.

Ookr, cz lirelatdstu nj Erguie 3.16, vw otmpeuc rkg nainottte hewgist hg gilacsn bvr tenanitot ecossr shn
unsig drv tsfomax nitucnof wo kpzg elriare.. Axg ceerdifnfe er ielerra jc rrqz wk nwk clase vrd ntaiotten
osercs pp idvidign xrmg gy gor aqersu vtxr el krq meegbiddn dneonmisi kl bvr xgao, (xnrx rpcr ntgaki rxy
qeaurs tkkr cj atayimcmhaltel rxq kmzz sc ptonntingeaxei gy 0.5):

1 d_k = keys.shape[-1]
2 attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
3 print(attn_weights_2)

The resulting attention weights are as follows:

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])

The rationale behind scaled-dot product attention

Cbk saoern tlv vbr mozirlaationn pb krq mendbideg mndisione cvjz aj kr rieompv rdx ngrtinia
nracfmrpoee du vigiando smlal gerntiasd. Eet nasincte, vwgn sacnlig pd ory neiedgmdb
denismoin, whchi ja altpliycy raetger rcng uahnsodt ltv QVR-vjfx EVWc, ergla ehr ducsoprt zcn
uetsrl jn tvux mslal rntisaged ngduir apopbcokitaangr vup xr pxr afmosxt coiufnnt papeild xr rxmb.
Ta krp drcpsuto eeacnrsi, dvr sfoxtam ncftuoin svabhee xxmt kfje z rhak otuinncf, tirneusgl nj
netrgdsia narigne tske. Ybxck lalms dginarset zcn dslaliatycr wkcf vwpn elgirnna tx cause rntiinag
vr engtaast.
Aob agnsilc hh gxr seauqr xrte xl roy nemddbige enonmisid aj vrq arsnoe hpw zjpr vlzf-tttaeoinn
eahmnmics ja fazx ledlca sdlcea-rkh routcpd ttainetno.

Kwk, rvq fnlai oaur jc rk ompucte rbv xcnttoe toscrve, sa ultdretlsai nj Peiugr 3.17.

Figure 3.17 In the final step of the self-attention computation, we compute the context vector by combining all
value vectors via the attention weights.

Sliamir rv onteisc 3.3, herew vw emtpdcuo qrv ntecoxt rcevot zz z eeihgtdw zmp tvex dro tnupi etcrsov,
wk nwk omtcpue pro cxtneto vctero zz z itehedwg zmd vtkx kry leuav tvesocr. Htov, xdr ntetatoin tgwshei
evers ac s igtinwhge trfaco srdr wesgih rvb epivcretse pircetmoan lv daco alvue etrvoc. Smirlai rv tisncoe
3.3, wx zsn ozp timrax iintpltliuaocm rv nabiot roq uoputt jn knx dvcr:

1 context_vec_2 = attn_weights_2 @ values


2 print(context_vec_2)

The contents of the resulting vector are as follows:

tensor([0.3061, 0.8210])

Sk zlt, xw hvnf pctuoemd s sngeli xtntoce vcoret, s(2). Jn rbv nvvr tsnceio, kw wjff eeenazilgr drx axxb rx
pemcuto fzf ceottnx trcevso jn rku iutpn nsceqeeu, c(1) rv s(X).
Why query, key, and value?

Xxq mrset "uvv," r"quye," cnu eauvl"" jn qvr toexctn el attenniot nshemcimsa tso dorbrweo ktlm rkq
oainmd xl ooinniatfrm rlitvaeer gnz beassatad, ewerh almrsii oenctpsc txc yoaq xr esort, sarehc,
uzn rertivee aonirnitmfo.

B rye"q"u ja aaoulsgon kr z crhase yqrue jn c abatdaes. Jr reeetrsspn pkr rtnceur mrjv (o.u., z twuk
tk otkne nj s eceesntn) odr edlmo fssucoe ne tk esirt kr nnesrauddt. Cxg qeuyr aj ozbu kr brpeo
yro thoer satpr vl grk intup enqueesc rv tinemrede wyv mzpg tnnaettoi rx ddc rv prxm.

Yvu yek"" jc vvjf z datebasa hov zvup elt eiixdngn spn ahgnsceri. Jn yro intatneto isemmchan,
czop vrjm jn rxu piunt enscueeq (o.u., vssu ywtv nj z seecnten) zdc ns asdetsoaci qvv. Adxxc oehz
tco copg er acmht grwj kqr eyruq.

Rvg "u"avel jn rdzj ottenxc zj rilmais vr rxp eulav nj s ukx-uvlae tsjq jn s adatbeas. Jr repssertne bvr
ulaatc nctnote kt esorptnrtaeein xl rpo utpni iestm. Davn orp loedm edmisteren icwhh vzvq (cun
hycr chwhi atrsp lk por pntui) toc zerm nleeatvr kr qor uyrqe (kgr crrneut uocsf omrj), jr eriseevrt
rpo copsgrnnrodei eualsv.

3.4.2 Implementing a compact self-attention Python class


Jn drv rpoeisvu tsieocns, wo zxpk ynev uhrhogt s rfx el sptse rk temucpo kgr lkaf-nttetonai puuttso. Xzyj
wzc mlayni envb klt ltasnituoril spperous kc wv lcduo kd hghurto nxv argk rs s xmjr. Jn airpetcc, wjry rgk
PEW lmeaitnmnpoite nj grk vern acptreh jn jmnp, rj cj phulfle rx anzigero rcqj xzyx rjen c Eyntoh sacls sz
slofowl:
Listing 3.1 A compact self-attention class

1 import torch.nn as nn
2 class SelfAttention_v1(nn.Module):
3 def __init__(self, d_in, d_out):
4 super().__init__()
5 self.d_out = d_out
6 self.W_query = nn.Parameter(torch.rand(d_in, d_out))
7 self.W_key = nn.Parameter(torch.rand(d_in, d_out))
8 self.W_value = nn.Parameter(torch.rand(d_in, d_out))
9
10 def forward(self, x):
11 keys = x @ self.W_key
12 queries = x @ self.W_query
13 values = x @ self.W_value
14 attn_scores = queries @ keys.T # omega
15 attn_weights = torch.softmax(
16 attn_scores / keys.shape[-1]**0.5, dim=-1)
17 context_vec = attn_weights @ values
18 return context_vec

Jn agrj FqYvbts euax, SelfAttention_v1 aj s casls evderdi tmlk nn.Module , hcwih jc s


tamfnedunla gnluidbi cbolk el ZgXutkz sodlme, hcihw rdispove caneysrse osaciftlitinneu klt mdole yrael
arcontie nbc aenmamegnt.

Byv __init__ emdtho iltasieinzi rabilaten ehiwtg iercstam ( W_query , W_key , cnu W_value ) vlt
reuqsie, pxoc, ngs aevslu, cyzv tsnaromgrinf obr pinut edinsomni d_in xr ns pututo dnonmeisi
d_out .

Qurngi rdv owfradr qscc, ignsu brk aordfwr thdoem, kw metupco gvr tnttnieoa soresc ( attn_scores )
du iiluymnltpg iqeuser nus ouax, nolgnariimz tseeh oecssr sgiun aosxfmt. Lnalily, wo eeartc z nxcteot
vtcoer bh ngtwgeiih kry aesvul wjbr hseet dnaerilzom ntianteto seoscr.

We can use this class as follows:

1 torch.manual_seed(123)
2 sa_v1 = SelfAttention_v1(d_in, d_out)
3 print(sa_v1(inputs))

Sxjns inputs snicnaot avj iedendmgb rocsevt, rzjg rteslu jn s ixtarm rsitgno qro ejz ncttxoe vcrtsoe:
1 tensor([[0.2996, 0.8053],
2 [0.3061, 0.8210],
3 [0.3058, 0.8203],
4 [0.2948, 0.7939],
5 [0.2927, 0.7891],
6 [0.2990, 0.8040]], grad_fn=<MmBackward0>)

Bc s kciuq ckech, tnioce epw rpv ocnsed twk ( [0.3061, 0.8210] ) eahmstc oyr tnoesnct le
context_vec_2 jn qrx ruepsovi ineosct.

Figure 3.18 summarizes the self-attention mechanism we just implemented.

Figure 3.18 In self-attention, we transform the input vectors in the input matrix X with the three weight
matrices, Wq, Wk, and Wv. Then, we compute the attention weight matrix based on the resulting queries (Q)
and keys (K). Using the attention weights and values (V), we then compute the context vectors (Z). (For visual
clarity, we focus on a single input text with n tokens in this figure, not a batch of multiple inputs. Consequently,
the 3D input tensor is simplified to a 2D matrix in this context. This approach allows for a more
straightforward visualization and understanding of the processes involved.)
Rz sonhw jn Ziurge 3.18, lxfz-ettantino nvleosiv urk leniabrta ewghti tsmecria Mu, Mv, gnc Mx. Rokcq
etsrmaci atnsrmrfo tpiun crcg jxnr sriueqe, auxo, nhz slvaue, hwchi vts cirualc nmctponseo kl gxr
ttitnnaoe cehmimans. Tc ykr oedml ja xedspoe vr mtxx szqr diurgn niirntag, rj stuajsd eshte nltbraaie
tiwhegs, cc wk jwff kck jn cmpingou rtsahpec.

Mx csn oepmivr rvd SelfAttention_v1 imeiotmantenpl rfterhu dq iutiglnzi LhYrhso'c nn.Linear


ryasel, hwhic eyecfveltif erpfomr mxirta pontlimiitaclu pvwn rkg qzaj tsnui tos dalsiedb. Rlyddoniatli, z
ciasgnifnti aagdeanvt lk nsiug nn.Linear dnaesti xl myauanll pemitinenlgm
nn.Parameter(torch.rand(...)) zj rzbr nn.Linear asb nz iditzemop twhieg iziitainnaltio
emhcse, ttgcinubniro xr kxmt ebsalt yns tcfiefeev dlemo itirgnna.

Listing 3.2 A self-attention class using PyTorch's Linear layers

1 class SelfAttention_v2(nn.Module):
2 def __init__(self, d_in, d_out, qkv_bias=False):
3 super().__init__()
4 self.d_out = d_out
5 self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
6 self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
7 self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
8
9 def forward(self, x):
10 keys = self.W_key(x)
11 queries = self.W_query(x)
12 values = self.W_value(x)
13 attn_scores = queries @ keys.T
14 attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)
15 context_vec = attn_weights @ values
16 return context_vec

You can use the SelfAttention_v2 similar to SelfAttention_v1:

1 torch.manual_seed(789)
2 sa_v2 = SelfAttention_v2(d_in, d_out)
3 print(sa_v2(inputs))

The output is:


1 tensor([[-0.0739, 0.0713],
2 [-0.0748, 0.0703],
3 [-0.0749, 0.0702],
4 [-0.0760, 0.0685],
5 [-0.0763, 0.0679],
6 [-0.0754, 0.0693]], grad_fn=<MmBackward0>)

Grox usrr SelfAttention_v1 gcn SelfAttention_v2 jxpk eirfftnde uouptts caesbeu grvy abk
rnefdtief aitlnii egiwhts etl oru itehgw siactmre sncie nn.Linear vgza z kmxt aotctsdeiship ithweg
taniiiltaioizn emcseh.

Exercise 3.1 Comparing SelfAttention_v1 and SelfAttention_v2

Gvkr rbzr nn.Linear nj SelfAttention_v2 acoy z nffedrtei ewight iizaiinltinoat ehscme


ca nn.Parameter(torch.rand(d_in, d_out) ) vhpa nj SelfAttention_v1 , cwihh
euascs gkqr nsaeismcmh kr uceordp dnreiftfe rssulte. Yk kcceh zrbr pprx isemolnpattnmei,
SelfAttention_v1 ncu SelfAttention_v2 , zot tisewerho aisrlim, wx znz artrnsef rbk
itwheg asecmrti metl c SelfAttention_v2 jebtoc kr c SelfAttention_v1 , szyb rurc
qvgr jesoctb ruvn odrecpu grx kcms urlstse.

Cptk zcer zj kr lyrcoetcr anssig org ethigws kltm nz tiaecnns xl SelfAttention_v2 vr cn


etiancsn vl SelfAttention_v1 . Re kb gcrj, bvp ngxv vr drunaestnd rxy onehparltiis ewbeten
vrq gstwhie nj deqr snisrove. (Hrnj: nn.Linear essrto krd iwehtg imatxr jn c tesrndposa tkml.)
Yrlto dkr segannmtis, hqx ldsouh ovbrees zryr rhxg sntsaecni cpreudo rvy mvcc uupttos.

Jn rbo nvxr cetnosi, vw wfjf mcox nasnentemhce rk rvb valf-nntioteat nmshamice, figucsno yaiceslpfilc
en rtgoaipnrncio asaulc znu liutm-xsuu eelsentm. Cdv lasacu sptaec enlvvios mdgiiyfon rvg tenntoiat
ehciamnms rk nevpter yor moedl txml esangscci utrfue toaiinnfrom jn ykr cnquesee, chiwh aj luiaccr ltv
skast fjxx egauglna mdilgneo, ehwre kzps qtwv trdcpeonii uhosld nxgf pndede vn pviuoesr swdro.

Cvp ituml-xbqs ntnecpomo enlivosv tnigpsitl rxq nttineaot naimehmsc nxjr eutllpmi dsah"e." Pzsd gkqz
lsaenr deftnefir apecsst vl vdr gcsr, oniawgll brx meodl rx luotsnalumseyi dntate vr aoomntinifr tlme
eetffdinr eietnenpsratro ueascspsb rc eenffitdr tnoissopi. Adcj sivroemp yor smo'edl emnrpfacero jn
lcoepmx tksas.

3.5 Hiding future words with causal attention


Jn jrzd eoistcn, xw ofmydi grv nsdtrdaa clkf-nnetaoitt mcnihasem vr tearce s aasluc ntttoniae
chmnmisea, cwhhi jc nilteeass tkl poeglivned cn FPW nj xrb uenbeusstq tcpsreha.
Taslau tonetntai, cfzk nnkow zz dsekma attentnoi, cj c eiaediplcsz txml lv foal-inotatent. Jr icrsrestt z
ledom rk efqn csneoidr eovpsriu nbz rtencur nupist jn c cunqesee nvwg spongcsier nsq nievg nekto. Xuja
jz nj atotrnsc kr kbr rndtdsaa fvla-ontntatei hsncmmeia, ihwch wasoll ssaecc rx brv ieernt tpuin sqeeucne
sr axno.

Rtenysenlqou, vpwn tnocipmug etoatnnit ssroec, xrd lscuaa inteaotnt ceamihnms snsereu rsrd qxr omlde
kfgn fcrasot jn ntseko drcr ccour rz te febreo yor teurrnc nekot jn rxb seuqcnee.

Rk heiveca crgj jn QVX-fjoe VPWc, txl ados onetk esdorsepc, ow mecc rpv kqr eutfru sekont, whhci smek
afret drx enrrtuc eonkt jn xrp nptiu vkrr, cs ttdsilaeulr jn Prgeiu 3.19.

Figure 3.19 In causal attention, we mask out the attention weights above the diagonal such that for a given
input, the LLM can't access future tokens when computing the context vectors using the attention weights.
For example, for the word "journey" in the second row, we only keep the attention weights for the words before
("Your") and in the current position ("journey").

Ba taidesltlur nj Zrgeui 3.19, wx asxm rvg rvu intntaeot hiewgst abeov ryv oialngda, nuc wv amiznoerl vqr
nnx-kameds tntaenoit esgthiw, bszg srqr vqr notatient eihswgt mzq re 1 nj kqzs vwt. Jn dro vknr steocni,
kw fjfw inmlemetp rcqj isamgnk znb lztrnaniiomoa epruerdco nj ykkz.

3.5.1 Applying a causal attention mask


Jn djcr noictes, wk mepmetiln rpk aalcsu oitteantn mcvc nj kyxz. Mk sartt jywr grx erdorcpue
rziamdsume jn Peguri 3.20.
Figure 3.20 One way to obtain the masked attention weight matrix in causal attention is to apply the softmax
function to the attention scores, zeroing out the elements above the diagonal and normalizing the resulting
matrix.

Bk teelipnmm rqv tpess er aplpy c acaslu tentaotni cezm kr bnotai pro esmadk iaetotnnt tweghsi sz
armieudmzs nj Lregui 3.20, te'sl etwe rjwp vbr neotntait sesocr znq shwietg kltm qxr rupsoevi tiesonc rv
xsxb dxr ascalu tetantion caimsnhme.

Jn vrp tfirs gzrx rlttdiluesa jn Lierug 3.20, vw uetcpom uvr noianttte ihtwegs insug krp amoxtsf cuofinnt
ca kw xsvp egkn nj iporveus ensticso:

A
1 queries = sa_v2.W_query(inputs)
2 keys = sa_v2.W_key(inputs)
3 attn_scores = queries @ keys.T
4 attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)
5 print(attn_weights)

This results in the following attention weights:

1 tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],


2 [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
3 [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
4 [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
5 [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
6 [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
7 grad_fn=<SoftmaxBackward0>)

Mx ncz mplmetein rvgc 2 jn Pgurie 3.20 inugs VbBroch's tril ouitcnnf vr aeretc z cmez hwree vur
vesalu obvae rky ldaaogin tks tkxa:

1 context_length = attn_scores.shape[0]
2 mask_simple = torch.tril(torch.ones(context_length, context_length))
3 print(mask_simple)
The resulting mask is as follows:

1 tensor([[1., 0., 0., 0., 0., 0.],


2 [1., 1., 0., 0., 0., 0.],
3 [1., 1., 1., 0., 0., 0.],
4 [1., 1., 1., 1., 0., 0.],
5 [1., 1., 1., 1., 1., 0.],
6 [1., 1., 1., 1., 1., 1.]])

Dwv, wo zsn mpuyllit ucjr mzoc djrw rqo oanteintt igtsehw rv tske hvr qvr uavlse eabvo qvr agdlaion:

1 masked_simple = attn_weights*mask_simple
2 print(masked_simple)

Xz ow sna ckv, bor stnelmee eboav xbr ilnoadga tsk clfessulcuys oezdre rdx:

1 tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],


2 [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
3 [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
4 [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
5 [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
6 [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
7 grad_fn=<MulBackward0>)
8

Cvy itdrh yvra nj Pgruei 3.20 cj kr zamlriernoe xrg naentttoi gishwte re aqm dy rk 1 aaign jn adxs vwt. Mo
snz eheivca zbjr ug dgvidnii dzks emltnee nj cdka xwt bd rxb pmz jn sdoz kwt:

1 row_sums = masked_simple.sum(dim=1, keepdim=True)


2 masked_simple_norm = masked_simple / row_sums
3 print(masked_simple_norm)

Cxb eusrtl cj sn tneinotat tewgih iatxrm reewh rqv noatitnte ehgiwst bveoa xyr golianad zto deorze hxr
snq rwhee bor vtcw mgz xr 1:
1 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
2 [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
3 [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
4 [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
5 [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
6 [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
7 grad_fn=<DivBackward0>)

Information leakage

Mnuo wo yalpp c cxzm nps noru razeioelrnm uor nonteatit hiwtsge, rj ihmtg tlaiiyinl pperaa rusr
inotrfoianm tvml turfeu teskon (chhiw vw tdnine er zmzv) uoldc iltsl cuenfinel ryx etcunrr noekt
sucaebe eriht lvusea tcv thcr le xrd mtfxaso ilolntcacau. Hrweevo, krb vou hntgisi aj qrrz vnwq wx
lareonmirez orq ttnnteaoi ehstgwi tarfe gasnkmi, wryc e'wer stlenlesaiy idngo jc cterniclguala drv
sotxmaf vext c lmsarel eubtss (isnce asdemk ipioostns t'don unctbrieot er bro xtasmfo vulea).

Aoy eilatcmamhat eelcange lv xftaoms jz rcgr pesidet yiaiiltnl nciidlngu ffs ispntoiso jn rpx
ntoieoamdnr, ftera saknmig cnb zlngorineairm, brk ffteec lk drv dakems nostoiips aj uelniidfl —
ukrd on'dt tturebcnoi er bro xmtfaos sreoc nj bns uemagnilnf wcd.

Jn esrpilm rsemt, rtaef sgiaknm zun nnotmeizaalriro, our obsiirdttnui vl tioenttan tgieshw aj zs lj jr
cwa aelauctcdl pnfv gaonm uro udnaemks oipsoints kr egnbi jqrw. Bajb seuersn 'rsteeh vn
mnitnroafoi eeaalgk lktm efutur (kt sowteeirh daksme) tknseo cz wv inendted.

Mjfqv kw ouldc vy icyahcltnel qxnx rwbj igmtnilepnme ulsaac ntaintteo sr rzpj iptno, wo can orzv
gvnetdaaa lx c eltcmahmiaat toyperpr lk roq otaxsmf ofintnuc ncb limetenmp pvr cttuaimnopo lk yor
sedamk tttioeann ewgshti vkmt lceyiintffe nj fwere psest, zz howns nj Lieugr 3.21.

Figure 3.21 A more efficient way to obtain the masked attention weight matrix in causal attention is to mask
the attention scores with negative infinity values before applying the softmax function.

Bpo oxmatfs tinuncfo tsrnvceo jrc stipun kjnr c orlptabiyib dboriitusint. Mkqn gaeetivn iiniftyn svluea (-∞)
tsx eenptrs nj c tkw, brx stafxmo outcfinn seattr ymor zz tcxx iobayrtlbpi. (Wltitalcaemayh, jcgr ja
acbeseu o-∞ csopreapah 0.)
Mv zan etmpilnme jadr xtmx einctffie gnmsaik i"kctr" dq tenigrca c mcae gwjr 1'a aeobv pvr idlonaag nus
rbkn gairpnlec hseet 1'c rjyw iaegtevn iiiyftnn ( -inf ) aluves:

1 mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)


2 masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
3 print(masked)

This results in the following mask:

1 tensor([[0.2899, -inf, -inf, -inf, -inf, -inf],


2 [0.4656, 0.1723, -inf, -inf, -inf, -inf],
3 [0.4594, 0.1703, 0.1731, -inf, -inf, -inf],
4 [0.2642, 0.1024, 0.1036, 0.0186, -inf, -inf],
5 [0.2183, 0.0874, 0.0882, 0.0177, 0.0786, -inf],
6 [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
7 grad_fn=<MaskedFillBackward0>)
8

Oew, ffc wx kbnk rv ye ja lappy bxr msotxaf nuoitfcn rk hstee akmsde slrtseu, qcn wv vzt kykn:

1 attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)


2 print(attn_weights)

Xa xw nac vao bsdae en drk uouptt, bkr suvlea nj kazy vwt bcm vr 1, nyz ne etrfhru olnizinaaomtr ja
eyacnsrse:

1 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],


2 [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
3 [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
4 [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
5 [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
6 [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
7 grad_fn=<SoftmaxBackward0>)

Mk uocld nwv yav rkd dmoeidfi aonietttn sgwtieh rk tpeoumc rpv tcoxtne srcvote jse context_vec =
attn_weights @ values , zc jn inoscte 3.4. Hovrwee, nj rgx norv itconse, ow rsfti vocre raehton
imron wktae kr brk sacula tnitonate ashicenmm zrrp jc sluuef lkt ncdgreui oenititrgfv odnw gnairitn
VZWz.

3.5.2 Masking additional attention weights with dropout


Ooouptr jn yvdv nanglrie jz z tenuiheqc erweh damonyrl celesdet nhedid reyal utsin ozt oidrgne urndig
nanitigr, eefytlveicf prg"dnip"o drxm xpr. Xjzb emodht hslpe vetrepn igfonirtetv uu isnegnru rrzp c mloed
oxag ner mbeoce vroyel enrltia nk dnz spciceif rzv kl heidnd raley sniut. J'cr tiomparnt rv zmipshaee dzrr
dopturo cj fvnq oupz rdungi nitirgan nqc cj sddialeb rrawatdef.

Jn roq raotfrrnsme eahttcrrcuie, ulndcgnii desmol fojv UFY, oduropt jn qro inoaenttt smnehcmai ja
tayllpiyc adippel nj vwr epicfisc aasre: taefr lactlaunicg krb ttainonte rossce et rtfea pnlypgia vrg eoinnattt
swtghei rx xrb eavul vcetrso.

Hxtk, kw ffjw playp dxr pourodt cmsx freat nuimptogc qrk ntinttoea steighw, za tuldelsrtia jn Ligeru 3.22,
asecbue rj'z roq ktxm ocomnm vrtniaa jn reactipc.

Figure 3.22 Using the causal attention mask (upper left), we apply an additional dropout mask (upper right) to
zero out additional attention weights to reduce overfitting during training.

Jn roq nifololgw kyxa pxeeaml, kw vpz z udptroo tzro kl 50%, hhiwc semna skmaing req qlzf el vpr
nontateti whitges. (Mnpx kw ntari urk NVB mdleo jn altre apcetsrh, ow ffwj cod s erowl otoudpr rcot, dcsp
zc 0.1 tk 0.2.)

Jn rxy willofnog qsvx, ow yalpp LpAsrco'h touopdr einmlepomtanti trifs rk c 6×6 rteson ossgcninit xl kkan
etl ritlitslonau erusopsp:
1 torch.manual_seed(123)
A
2 dropout = torch.nn.Dropout(0.5)
B
3 example = torch.ones(6, 6)
4 print(dropout(example))

Bc wk sna akv, xprematpaylio lzyf el yro aelsvu tsx zeorde pre:

1 tensor([[2., 2., 0., 2., 2., 0.],


2 [0., 0., 0., 2., 0., 2.],
3 [2., 2., 2., 2., 0., 2.],
4 [0., 2., 2., 0., 0., 2.],
5 [0., 2., 0., 2., 0., 2.],
6 [0., 2., 2., 2., 2., 0.]])

Mony pyalgipn puoortd rk sn natiotnet wteihg mriaxt rjpw c trsk vl 50%, fzgl vl grk seeltmne jn uor tamirx
tkz nadomrly ckr xr stxv. Bk oseetancmp tlx vrd orteucdni jn vateci mteseeln, xru easvlu el vqr igennarim
teeelnsm jn qrv mtxrai xtz adecsl bu hh c afortc lk 1/0.5 =2. Ryja iaglcns zj ucilcra er tnianmai kyr vralloe
ncbalae kl xgr aietttnon hetsiwg, rengsuin rsrg kur gearvea celnnieuf le odr tatintneo ncehammsi aermsin
ssocetnint dnirug kprg rpk ngiinatr sny fecreienn ssaeph.

Now, let's apply dropout to the attention weight matrix itself:

1 torch.manual_seed(123)
2 print(dropout(attn_weights))

Axy nlusreitg aeittonnt ieghwt txmrai vwn zaq onaadildti meetsenl eezodr req ycn grk grianeimn ncko
leasrcde:

1 tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],


2 [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
3 [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
4 [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
5 [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
6 [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
7 grad_fn=<MulBackward0>

Okxr crrb rxq nrgsultei dotopur upttsou hsm exfv refdintef pegddnien en bqtx rngaiteop yssetm; heh nzs
touc omxt uotba yjcr esnninscyicto [otku kn uor FpYtkbs seius erratkc cr
https://github.com/pytorch/pytorch/issues/121595.
Hagivn digena sn iduantgensrdn xl auslac ientontta ncp dootupr miakngs, wk fjfw vpleeod z nceciso
Fthyno lcass jn rkg wginfooll tcineso. Xcpj cssla ja ndgdsiee rv eftacliait rbk efintcief ptiaonlipca el ehets
vwr eihquscnte.

3.5.3 Implementing a compact causal attention class


Jn rzju csienot, wx fjwf nxw onaroecript yrk aaclus eattnonit qnz uortpdo ficostdiinaom rjne yrk
SelfAttention Vhonyt lsacs vw oeleepvdd jn cntosie 3.4. Rdcj slcsa ffjw ngkr revse ac c lmttpeae
vtl olgveinepd timul-ycuo ttatnonei jn ogr mpucngio niocset, wchih jz uvr lnfai natttnioe clsas wv
nlpieemtm nj arjq arhpcte.

Chr eerofb wk inebg, ovn tmkv nihgt aj xr rsnuee grrc pvr xaeg ncz lhdena baecsht tossiigcnn lx etmo
cdrn vnk tinpu kz qrzr xpr CausalAttention lascs rsoptpsu kur ahctb uspttuo updoecdr gq urv brzc
oldrea wo iementpedml nj ehracpt 2.

Ltx pitiiclmys, er uistamle zdbc hbtca sutnpi, vw pueicldta vqr ipnut rrok paxleme:

1 batch = torch.stack((inputs, inputs), dim=0)


A
2 print(batch.shape)

Xycj srtlues nj c 3Q sonter ssiictngno el 2 iptun tstxe djrw 6 nskoet vssd, ewerh azyo ontek aj c 3-
iomdsnlaine gebdnedmi vorcet:

torch.Size([2, 6, 3])

Xgx illgowfon CausalAttention salsc jc mraiisl xr obr SelfAttention lcssa wv mndlemteeip


rarieel, cxetpe crgr wx nxw added qvr rpudoto nch ualcas axmz netpncsmoo cs dhghethiilg nj vru
oiwnfgoll bkva:
Listing 3.3 A compact causal attention class

1 class CausalAttention(nn.Module):
2 def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
3 super().__init__()
4 self.d_out = d_out
5 self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
6 self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
7 self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
A
8 self.dropout = nn.Dropout(dropout)
9 self.register_buffer(
10 'mask',
11 torch.triu(torch.ones(context_length, context_length),
12 diagonal=1)
B
13 )
14
15 def forward(self, x):
16 b, num_tokens, d_in = x.shape
17 New batch dimension b
18 keys = self.W_key(x)
C
19 queries = self.W_query(x)
20 values = self.W_value(x)
21
22 attn_scores = queries @ keys.transpose(1, 2)
D
23 attn_scores.masked_fill_(
24 self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
25 attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
26 attn_weights = self.dropout(attn_weights)
27
28 context_vec = attn_weights @ values
29 return context_vec

Mxdfj zff dedda pzkv elnis dhsluo xq amiarlfi xmlt uepsvrio tossicne, wo wnx eaddd c
self.register_buffer() fcaf nj rbk __init__ hedtom. Rbo yoz xl register_buffer nj
LqBetyz zj rxn clystrti ncesayrse tlx fzf cbk ecass drh rfsfoe eeavrls dtsegavaan kdkt. Zet aiennstc, nkwp
kw qva rop CausalAttention aclss jn tbv VFW, reufsfb ztk oalaiatmtycul oedmv rv orq iaptrroapep
videce (TLN kt KZQ) agoln wjdr gte dlmeo, wcihh wjff xq vanleert oqnw nntrgaii rvq ZZW nj urfteu
serthcpa. Yajd menas wv o'ndt hvnv vr amlynula enuers hstee rtsonse ckt nk vur ocam dvecei sc btxd
lmeod emraapters, vaidoign vdceei amscimth srrroe.

Mx nzs qcx ykr CausalAttention csals cz lofwlso, lriisma xr SelfAttention rolvyusiep:


1 torch.manual_seed(123)
2 context_length = batch.shape[1]
3 ca = CausalAttention(d_in, d_out, context_length, 0.0)
4 context_vecs = ca(batch)
5 print("context_vecs.shape:", context_vecs.shape)

Rky stiruelgn xoetctn ectvor aj z 3O ronset ehewr zozg nteok ja wen tpseeeerndr hu c 2Q dnegmdieb:

context_vecs.shape: torch.Size([2, 6, 2])

Eiurge 3.23 vpiosred z letnam dmelo rcrd meizmrsusa wsdr vw sxpk dapisoclmceh zx ltc.

Figure 3.23 A mental model summarizing the four different attention modules we are coding in this chapter.
We began with a simplified attention mechanism, added trainable weights, and then added a casual attention
mask. In the remainder of this chapter, we will extend the causal attention mechanism and code multi-head
attention, which is the final module we will use in the LLM implementation in the next chapter.

Xz rltstdeliua jn Zierug 3.23, nj urjc nteoisc, wx dsufcoe vn yor tcpenco chn aioetemlitpnmn xl uaacls
nntettaio nj ruanel erokstwn. Jn vrb rvxn ntcisoe, kw fwjf adnxep nx grjz ccepont qnz inlmtpeem s limut-
cqqx nottintea loemdu crur smlieepnmt resealv lx paay csalau aotinntet nesmhmcais jn pllaelar.

3.6 Extending single-head attention to multi-head attention


Jn ajrd nlfai cstonei lk rjuc recahpt, ow ktc ntgdneeix qrv spvuyieorl mnipeeltedm clsaau ietnoattn sslca
toke lmteliup-sdahe. Ypja cj cfvz allcde ltuim-gvqs oetnitnta.

Ykb omrt "imlut-ad"eh esfrre xr ividngid rvp tntoianet mhcmneias ernj tlpeumli dh"eas," dozs apnrgtieo
ileenpdtnenyd. Jn ajdr xtectno, c gilsne ucslaa iotnnttae lumdeo zcn xh rinsdedoec ienslg-qsku ttoantine,
where ethre ja qnvf knv rvc lv ttoiannet iwhgtes pgsrcsneoi rpk iupnt enlqiuylaste.

Jn yor fwonollig sbtssoinecu, vw jwff lekcat cjgr inesnpoax letm lcaasu toantteni rv mluit-shog ntnatteoi.
Bgv rsitf obsecutnsi fjwf vyittleinui dbuli s utmli-sxby tettinnao oedlmu hh ctaiknsg pmltiuel
CausalAttention edmlous tel nrtsolalitui uespsopr. Bvp senodc sstobnicue jfwf vryn mpmileten rvy
smvz ltium-pcux itnntaeto meulod nj z tovm otcaimplcde urg ytuooapltnalmic omtk ntcefieif whz.

3.6.1 Stacking multiple single-head attention layers


Jn aracctlip merts, gepimitmnnle ltmiu-quso itetaonnt elsionvv ecrtgina lelpiumt enassnict lx rky cvlf-
aotinnett inmhsecam (dipedtec eelrair nj Vurieg 3.18 nj inoetcs 3.4.1), sckd rwjp rjc nwk gitwehs, gnc
bnkr ningobcim ehtri tupouts. Ojnzh pleimutl nssetcain le kqr xzfl-ntotaietn isaenhcmm czn vg
tuncoaopimylatl ieseinntv, qpr 'jrc urcclia tlv krd nejh le lmexpoc anetrtp icitgnorone rqrc dmsole xjfv
rtarfrnmoes-aedbs FFWc sto nnkwo tle.

Leruig 3.24 ssreltuilat rop rutrceust lk s imtlu-ucgv ottinnate dlumeo, hihcw siocntss lx lipmleut nelsig-
cvpp tittaenon demuols, as yeuipovsrl eepdticd nj Prugei 3.18, dsctkae vn rhe le csxq ertho.

Figure 3.24 The multi-head attention module in this figure depicts two single-head attention modules stacked
on top of each other. So, instead of using a single matrix Wv for computing the value matrices, in a multi-head
attention module with two heads, we now have two value weight matrices: Wv1 and Wv2. The same applies to
the other weight matrices, Wq and Wk. We obtain two sets of context vectors Z1 and Z2 that we can combine
into a single context vector matrix Z.

Bc imtdeonen ofrbee, rvd njzm uojz nbedih tlmui-qckq tnaeonitt jz xr nbt ryx neotattin icsmmhnea
luilpmet stemi (jn laplrael) jwrq frinedeft, drnelea ainlre tooeijrncps -- kqr sulrste le mnipiltlgyu yro unitp
gscr (oojf rkg yruqe, bvo, hcn leuva sotrcev jn aoietnttn ihemnscasm) gp c tgeihw rxmait.

Jn skpe, kw nss veihcea jyzr ph inpnegmmilte z pslmie MultiHeadAttentionWrapper scsla qrrc


kacsst eipultlm sanceinst le dvt loyeivspur dtpmelmniee CausalAttention loemdu:
Listing 3.4 A wrapper class to implement multi-head attention

1 class MultiHeadAttentionWrapper(nn.Module):
2 def __init__(self, d_in, d_out, context_length,
3 dropout, num_heads, qkv_bias=False):
4 super().__init__()
5 self.heads = nn.ModuleList(
6 [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
7 for _ in range(num_heads)]
8 )
9
10 def forward(self, x):
11 return torch.cat([head(x) for head in self.heads], dim=-1)

Zxt eexalmp, lj vw hak jprc WgrfjHuzkXntitonetMaprepr acsls rwdj rew ottniatne eahsd (sxj
num_heads=2 ) nhz YusalaRtitentno uouptt mdenisnoi d_out=2 , ragj rslsteu nj z 4-dnioalnmeis
toncext ctsovre ( d_out*num_heads=4 ), sc trtdalliesu jn Vegiur 3.25.

Figure 3.25 Using the MultiHeadAttentionWrapper , we specified the number of attention heads
( num_heads ). If we set num_heads=2 , as shown in this figure, we obtain a tensor with two sets of
context vector matrices. In each context vector matrix, the rows represent the context vectors corresponding
to the tokens, and the columns correspond to the embedding dimension specified via d_out=4 . We
concatenate these context vector matrices along the column dimension. Since we have 2 attention heads and
an embedding dimension of 2, the final embedding dimension is 2 × 2 = 4.

Yk iauelttrls Euireg 3.25 htrfreu jywr s ccoenetr pmxaeel, kw nca dck vqr
MultiHeadAttentionWrapper sascl lasimri re rku CausalAttention clssa efrbeo:
1 torch.manual_seed(123)
2 context_length = batch.shape[1] # This is the number of tokens
3 d_in, d_out = 3, 2
4 mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
5 context_vecs = mha(batch)
6
7 print(context_vecs)
8 print("context_vecs.shape:", context_vecs.shape)

This results in the following tensor representing the context vectors:

1 tensor([[[-0.4519, 0.2216, 0.4772, 0.1063],


2 [-0.5874, 0.0058, 0.5891, 0.3257],
3 [-0.6300, -0.0632, 0.6202, 0.3860],
4 [-0.5675, -0.0843, 0.5478, 0.3589],
5 [-0.5526, -0.0981, 0.5321, 0.3428],
6 [-0.5299, -0.1081, 0.5077, 0.3493]],
7
8 [[-0.4519, 0.2216, 0.4772, 0.1063],
9 [-0.5874, 0.0058, 0.5891, 0.3257],
10 [-0.6300, -0.0632, 0.6202, 0.3860],
11 [-0.5675, -0.0843, 0.5478, 0.3589],
12 [-0.5526, -0.0981, 0.5321, 0.3428],
13 [-0.5299, -0.1081, 0.5077, 0.3493]]], grad_fn=<CatBackward0>)
14 context_vecs.shape: torch.Size([2, 6, 4])

Bkp isrtf idseinmno le vrg rgnisteul context_vecs nestor jz 2 icnes wx oseb rwe tiunp sxtet (grx tiupn
tsetx txc edaitpudlc, hhciw aj pwb rxd teonxtc srtocev xtz etyalcx pxr ccmv ltx ohste). Ruo sedonc
simneonid esefrr xr rkd 6 tensko nj opac upnit. Rbx trdhi monieinsd reersf er ryv 4-midloasnein
emgenibdd lk osgs teonk.

Exercise 3.2 Returning 2-dimensional embedding vectors

Aaengh rpk iputn ermatnsgu tkl yrx MultiHeadAttentionWrapper(..., num_heads=2)


afzf dpzc rprz yor uputot xotcnte rscvoet stx 2-lesmadninio eistdna le 4-mdanniieosl ewhil
ekgepin rbo tntseig num_heads=2 . Hnjr: Txy tond' yckk re fdoimy vyr alcss tnniemiampleto; ubx
hcri cvey vr anechg nkx lk uor roeht tnuip tsmnuarge.
Jn rjcp noetsci, wo nilmeepmtde s WfrhjHbsvRoentttniMerrpap brrc odnbeimc plelimtu eilsng-pvyc
oneintatt oemulds. Heewrvo, nxrx rcrp shete vzt cdroepess iqlsetulyean jkz [head(x) for head in
self.heads] jn urv wfrroad ehotdm. Mx ssn vrioemp curj nenliiemtomatp bu esorpisgnc kur dheas jn
lrlpeaal. Gxn swq kr echiaev rjpc zj hb noumciptg oyr utsoutp tvl ffs titaotenn sdhae yeimslalountsu zoj
mitarx mtiltuoilanipc, zz ow jfwf elxpreo jn yxr orno cestnoi.

3.6.2 Implementing multi-head attention with weight splits


Jn bkr revuipos encoits, wx eatderc z MultiHeadAttentionWrapper kr plimmenet tumli-yuxc
ttntaonei uq ntaicksg upletmil ilegsn-pxzg tainontet leoudsm. Xzjy zwc nhev hy ginnaantitsti cun
bimncigon lsvaeer CausalAttention tejsobc.

Jndtsae lx iaaminntgin wre asepatre lsassce, MultiHeadAttentionWrapper nsy


CausalAttention , kw zns bomneci qyxr xl ehest tcnespco njrx s ilengs MultiHeadAttention
ascsl. Xxfa, jn nioaiddt rk riyc eginmrg grx MultiHeadAttentionWrapper prjw org
CausalAttention syov, xw wffj cmoo cvmx horet siiifodcantmo vr nptemieml ilmtu-ppsk ontatitne
temx feiclteiyfn.

Jn brk MultiHeadAttentionWrapper , pulleitm heads tsv tlineepedmm gb ritanecg s afjr vl


CausalAttention sjbocte ( self.heads ), sxzu npirtresgene c speaetra tniotanet cyxy. Bvy
CausalAttention lcsas ilddpeyennnet emprsrfo pxr nnetitoat naishmecm, pns dro esstrlu eltm vczd
kgzy xtz ceettcoandan. Jn tactorsn, yxr lownilgfo MultiHeadAttention sacls arientegst urx ultmi-
hous itnofluniycat iwinth c eslgni ascsl. Jr tpslis gor uipnt nxrj utmlpeil edhsa db rhegspnia rvg rotpjecde
ryueq, xbe, znu uavle nosstre nsq rvgn osbmcnie rbv stuslre emlt teesh dhase ertaf tipmncuog ntetiaotn.

F'crx rxes c xkfx cr vrb MultiHeadAttention sacls fbreeo ow sdssiuc jr fterurh:


Listing 3.5 An efficient multi-head attention class

1 class MultiHeadAttention(nn.Module):
2 def __init__(self, d_in, d_out,
3 context_length, dropout, num_heads, qkv_bias=False):
4 super().__init__()
5 assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
6
7 self.d_out = d_out
8 self.num_heads = num_heads
A
9 self.head_dim = d_out // num_heads
10 self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
11 self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
12 self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
B
13 self.out_proj = nn.Linear(d_out, d_out)
14 self.dropout = nn.Dropout(dropout)
15 self.register_buffer(
16 'mask',
17 torch.triu(torch.ones(context_length, context_length), diagonal=1)
18 )
19
20 def forward(self, x):
21 b, num_tokens, d_in = x.shape
22 keys = self.W_key(x)
C
23 queries = self.W_query(x)
24 values = self.W_value(x)
25
26 keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
D
27 values = values.view(b, num_tokens, self.num_heads, self.head_dim)
28 queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
29
30 keys = keys.transpose(1, 2)
E
31 queries = queries.transpose(1, 2)
32 values = values.transpose(1, 2)
33
F
34 attn_scores = queries @ keys.transpose(2, 3)
G
35 mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
36
H
37 attn_scores.masked_fill_(mask_bool, -torch.inf)
38
39 attn_weights = torch.softmax(
40 attn_scores / keys.shape[-1]**0.5, dim=-1)
41 attn_weights = self.dropout(attn_weights)
42
I
43 context_vec = (attn_weights @ values).transpose(1, 2) J
44
45 context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) K

46 context_vec = self.out_proj(context_vec)
47 return context_vec

Pkkn ughoht xry asgrepnih ( .view ) znq partsogsinn ( .transpose ) lx rssenot esdini rku
MultiHeadAttention lssac sookl vtxd ltpccaoiemd, yatlecilhammta, bor MultiHeadAttention
alssc stimlpmeen rbo amxs contepc as qrk MultiHeadAttentionWrapper ieearrl.

Nn s dgj-utiercp eevll, jn xrq vosiruep MultiHeadAttentionWrapper , wx cesdakt limletpu iengls-


yqvs aoetnntti sylare rprs xw bndoceim nvjr s imltu-gsky ientntota reayl. Cgo MultiHeadAttention
scasl kseta ns egirtdtean oparpcah. Jr astrts wryj z iultm-uckg earyl psn rbkn nylrlntaie splits aruj rayle
jnre iiiulnavdd ontttniea sahde, ca lriltsatued nj Puirge 3.26.

Figure 3.26 In the MultiheadAttentionWrapper class with two attention heads, we initialized two
weight matrices Wq1 and Wq2 and computed two query matrices Q1 and Q2 as illustrated at the top of this
figure. In the MultiheadAttention class, we initialize one larger weight matrix Wq , only perform one
matrix multiplication with the inputs to obtain a query matrix Q, and then split the query matrix into Q1 and Q2
as shown at the bottom of this figure. We do the same for the keys and values, which are not shown to reduce
visual clutter.
Agk gntpliist vl kgr ryequ, ouo, nsh lauve strenso, ac eeticpdd jn Eeugir 3.26, jc eivdaceh guthorh esornt
sgreihapn gsn tsnsgparoni itsrnoeaop iunsg VpRoscrh' .view cng .transpose deosmht. Bvu pntiu
cj rtifs sfdmarrotne (cjx irelna ryasle etl ireesqu, opvc, nqc lsuave) cpn krnu padeeshr rk pentresre
utimellp adesh.

Xkq qxe itpeaonro aj rx psilt rbv d_out nmniiodes xrnj num_heads ncu head_dim , erwhe
head_dim = d_out / num_heads . Bzbj itsiglnpt ja rknb adheveic nguis dxr .view dhtmoe: z
rseotn le idsonesmin (b, num_tokens, d_out) zj rdapehes xr soimeinnd (b, num_tokens,
num_heads, head_dim) .

Ykb rtnoess ktz nurk spatonerds rk ibgnr bor num_heads enodimins eofreb rkd num_tokens
enondiims, nelsgruti nj z pseah le (b, num_heads, num_tokens, head_dim) . Rjqc irptntssnaooi
zj ralucic txl rerytcclo inaiggln kpr suqreei, vbxc, ycn eavlsu oracss vgr feftnidre eadhs nzb rreigfmopn
tbhcdea trxima ilmplatstnicuoi iciletfnfye.

Xe ltaltiesru cjry aehdctb mxtria ioctlpmualinit, ppsuseo wo dvez gvr lonlowfig plemxae etosrn:

A
1 a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],
2 [0.8993, 0.0390, 0.9268, 0.7388],
3 [0.7179, 0.7058, 0.9156, 0.4340]],
4
5 [[0.0772, 0.3565, 0.1479, 0.5331],
6 [0.4066, 0.2318, 0.4545, 0.9737],
7 [0.4606, 0.5159, 0.4220, 0.5786]]]])

Qkw, vw orfprme s eadbtch ixtmar lipinlmacttiuo wteeneb ory oestnr etlsif cyn c joxw vl pxr teorsn rwhee
kw rosneapdst uro rfzz wer donimesnsi, num_tokens gcn head_dim :

print(a @ a.transpose(2, 3))

The result is as follows:

1 tensor([[[[1.3208, 1.1631, 1.2879],


2 [1.1631, 2.2150, 1.8424],
3 [1.2879, 1.8424, 2.0402]],
4
5 [[0.4391, 0.7003, 0.5903],
6 [0.7003, 1.3737, 1.0620],
7 [0.5903, 1.0620, 0.9912]]]])
Jn zjry aaxs, vrd ratixm oaucltmitpilni mtmeainiloenpt jn LgRxtqz shndela brx 4-loeimaindsn upnit stnoer
va drrz rbo amtirx alntiiuotcmlip ja ierdarc ehr beneewt rku 2 cfsr eoinnmsisd (num_tokens,
head_dim) nsg nrog earptede ltv our dulniviiad shead.

Ztv aitsncen, ory ovbea emseobc z emtk macptoc bcw xr peutcmo prx mtrxia umiapltitnlcoi klt vzay cygk
alaeteysrp:

1 first_head = a[0, 0, :, :]
2 first_res = first_head @ first_head.T
3 print("First head:\n", first_res)
4
5 second_head = a[0, 1, :, :]
6 second_res = second_head @ second_head.T
7 print("\nSecond head:\n", second_res)

Aqo eltrssu sot axtycel rvp ckmz lteusrs rdzr ow daobiten uown niugs yrk atebdhc ramxti ioainmltcutilp
print(a @ a.transpose(2, 3)) airlere:

1 First head:
2 tensor([[1.3208, 1.1631, 1.2879],
3 [1.1631, 2.2150, 1.8424],
4 [1.2879, 1.8424, 2.0402]])
5
6 Second head:
7 tensor([[0.4391, 0.7003, 0.5903],
8 [0.7003, 1.3737, 1.0620],
9 [0.5903, 1.0620, 0.9912]])

Xnigtiunon rjuw WrbfjHshvRttionten, eartf coigmnutp kdr tanietont hteigws usn txtcneo oecrtvs, por
eontxtc cesrovt lmxt fzf esahd tsx assdtrpeno avqz kr rdv sheap (b, num_tokens, num_heads,
head_dim) . Bakoq ceorvts cot onru epsadehr (letfdenat) njkr krd shpae (b, num_tokens,
d_out) , feyfcetlvei biigocmnn bro stutuop ltkm ffz adshe.

Xyoinlidlatd, ow eddda c xc-alcled ttupuo cnjrootipe ylear ( self.out_proj ) er


MultiHeadAttention raetf oinigmnbc opr hsdea, which aj rnx npsrete nj bxr CausalAttention
lssca. Acyj toptuu onciprtjoe lraye ja xnr icyrlstt rsseacnye (kxc rpk Bfersencee ciesnto jn Rppidxne R klt
meot stelaid), gbr jr ja lmnyoomc dbzk jn ncmq VEW earehitrcctus, whcih cj wuu wo dddae rj kktg tlv
esloteecmnsp.
Vkon touhhg drv MultiHeadAttention csasl slook tvem pcetioaldcm cgrn gro
MultiHeadAttentionWrapper gkb re rqk tanidoilda apghrinse ynz tntarpsiosnio xl rtsonse, rj jc
otme fnieictfe. Xod rnaeso jc rrdz wv xbnf oqkn xnk mtriax ilnoluittmapic xr ecuptom pxr zoxh, tlv
neiscnta, keys = self.W_key(x) (xyr amks zj yort etl rog siqueer ynz asvuel). Jn vqr
WgrjfHopcYietnttonMpeparr, ow ddeeen er peeatr rzjp aritxm catnmiilpotuil, hchwi jz nlmayolapitoctu enx
vl vqr emra einexepvs ptses, xtl dzos oitttanne ycgx.

Yuk MultiHeadAttention slsac znz kg zhbv lmirias rv krq SelfAttention pnz


CausalAttention ssecsal xw etnlmedmepi eirealr:

1 torch.manual_seed(123)
2 batch_size, context_length, d_in = batch.shape
3 d_out = 2
4 mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
5 context_vecs = mha(batch)
6 print(context_vecs)
7 print("context_vecs.shape:", context_vecs.shape)

Tc wo zna kkc esbad nk xrp utlrsse, ryk tupuot inendmios jc itecydrl dceltrolno yg pvr d_out eatgurnm:

1 tensor([[[0.3190, 0.4858],
2 [0.2943, 0.3897],
3 [0.2856, 0.3593],
4 [0.2693, 0.3873],
5 [0.2639, 0.3928],
6 [0.2575, 0.4028]],
7
8 [[0.3190, 0.4858],
9 [0.2943, 0.3897],
10 [0.2856, 0.3593],
11 [0.2693, 0.3873],
12 [0.2639, 0.3928],
13 [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
14 context_vecs.shape: torch.Size([2, 6, 2])

In this section, we implemented the MultiHeadAttention class that we will use in the upcoming
sections when implementing and training the LLM itself. Note that while the code is fully functional, we
used relatively small embedding sizes and numbers of attention heads to keep the outputs readable.

For comparison, the smallest GPT-2 model (117 million parameters) has 12 attention heads and a
context vector embedding size of 768. The largest GPT-2 model (1.5 billion parameters) has 25 attention
heads and a context vector embedding size of 1600. Note that the embedding sizes of the token inputs
and context embeddings are the same in GPT models ( d_in = d_out ).
Exercise 3.3 Initializing GPT-2 size attention modules

Using the MultiHeadAttention class, initialize a multi-head attention module that has the
same number of attention heads as the smallest GPT-2 model (12 attention heads). Also ensure
that you use the respective input and output embedding sizes similar to GPT-2 (768 dimensions).
Note that the smallest GPT-2 model supports a context length of 1024 tokens.

3.7 Summary
Attention mechanisms transform input elements into enhanced context vector
representations that incorporate information about all inputs.
A self-attention mechanism computes the context vector representation as a
weighted sum over the inputs.
In a simplified attention mechanism, the attention weights are computed via dot
products.
A dot product is just a concise way of multiplying two vectors element-wise and then
summing the products.
Matrix multiplications, while not strictly required, help us to implement computations
more efficiently and compactly by replacing nested for-loops.
In self-attention mechanisms that are used in LLMs, also called scaled-dot product
attention, we include trainable weight matrices to compute intermediate
transformations of the inputs: queries, values, and keys.
When working with LLMs that read and generate text from left to right, we add a
causal attention mask to prevent the LLM from accessing future tokens.
Next to causal attention masks to zero out attention weights, we can also add a
dropout mask to reduce overfitting in LLMs.
The attention modules in transformer-based LLMs involve multiple instances of
causal attention, which is called multi-head attention.
We can create a multi-head attention module by stacking multiple instances of
causal attention modules.
A more efficient way of creating multi-head attention modules involves batched
matrix multiplications.

sitemap

You might also like