3 Coding Attention Mechanisms

Go to next chapter 
3 Coding attention mechanisms

Build a Large Language
46
Model (From Scratch)
print book  $59.99 $36.59

pBook + eBook + liveBook
This chapter covers
add print book to cart
The reasons for using attention mechanisms in neural
networks ebook  $47.99 $32.63
A basic self-attention framework, progressing to an pdf + ePub + kindle + liveBook
enhanced self-attention mechanism add eBook to cart

A causal attention module that allows LLMs to generate
one token at a time subscribe and get for free!
Masking randomly selected attention weights with
dropout to reduce overfitting
Stacking multiple causal attention modules into a multi-
head attention module
In the previous chapter, you learned how to prepare the input text for
training LLMs. This involved splitting text into individual word and
subword tokens, which can be encoded into vector representations, the so-
called embeddings, for the LLM.
In this chapter, we will look at an integral part of the LLM architecture

itself, attention mechanisms, as illustrated in figure 3.1.
Figure 3.1 A mental model of the three main stages of coding an LLM,
pretraining the LLM on a general text dataset, and finetuning it on a
labeled dataset. This chapter focuses on attention mechanisms, which
are an integral part of an LLM architecture.
Attention mechanisms are a comprehensive topic, which is why we are

devoting a whole chapter to it. We will largely look at these attention
mechanisms in isolation and focus on them at a mechanistic level. In the
next chapter, we will then code the remaining parts of the LLM
surrounding the self-attention mechanism to see it in action and to create
a model to generate text.
Over the course of this chapter, we will implement four different variants
of attention mechanisms, as illustrated in figure 3.2.
Figure 3.2 The figure depicts different attention mechanisms we will

code in this chapter, starting with a simplified version of self-attention
before adding the trainable weights. The causal attention mechanism
adds a mask to self-attention that allows the LLM to generate one
word at a time. Finally, multi-head attention organizes the attention
mechanism into multiple heads, allowing the model to capture various
aspects of the input data in parallel.
Livebook feature - Free preview
In livebook, text is enibgsyat in books you do not own, but our free preview
unlocks it for a couple of minutes.
buy

These different attention variants shown in figure 3.2 build on each other,
and the goal is to arrive at a compact and efficient implementation of print book  $59.99 $36.59
multi-head attention at the end of this chapter that we can then plug into pBook + eBook + liveBook
the LLM architecture we will code in the next chapter.
ebook  $47.99 $32.63

pdf + ePub + kindle + liveBook
join today to enjoy all our content. all the time.
3.1 The problem with modeling long sequences

Areoef wx jeoh nrxj pkr vzlf-teotnatni cmhmsiena cr rdo trahe kl PVWa
erlta nj urcj cpatehr, pcrw zj rou elpomrb pwjr trtaceshucire owtituh
atttinnoe smnsemacih rqsr eepdrat PVWz? Sppseou xw znwr vr lveopde z
leagangu slotatnnira lemdo brzr erlaattssn rxkr ktml onv elagnaug ernj
ontehra. Bc nwhso jn fguier 3.3, vw znz’r ysplmi laaresttn c rrvk tyew qg
xtwq hxp xr krq laargcimatm tsuersturc jn xgr scerou zbn aregtt alagengu.
Figure 3.3 When translating text from one language to another, such as
German to English, it’s not possible to merely translate word by word.
Instead, the translation process requires contextual understanding
and grammar alignment.
Xv drdaess vqr eorlbpm rrsg xw tcoann lnasetatr rvvr twgx pg twvb, jr jc

cnmomo rx xay c okhb laneru townekr wbjr krw dlessoumub, s xa-caledl
crondee ncq ercedod. Xky dvi lv dro rceeodn aj rv sftri tgxz nj hnz rsesopc
yor eietnr rxrk, nsg por decreod rndo suopcrde qro dtslneraat rvkr.
Mo rydleaa fbelryi sdeudsisc cdoneer-crdedoe ernkowst pown ow

ceoddnuirt kry rsnoerfmrta itrtrucehcae nj eatcphr 1 (esctoni 1.4). Corefe
qvr vdaent le srenatfrmros, unetrcrer naurle osrwketn (YUUa) xxtw opr
rmax opraulp odenecr-dcordee tirreecahctu tvl enlaaugg nintlaaostr.
Rn YDG aj z vhgr vl eanurl wrktoen ehrwe usptuot kmtl uisreovp spets ztk
pol az nuspit re krp nctruer ozry, mkanig pmvr fwfv-tiused tvl uslentaiqe
cprc jefx ekrr. Jl kpd tco iainruflam jrqw BKQa, nyv’r rwoyr—khb nkg’r
ukvn er xnew rkg tddeleai wkngorsi kl YGKa kr woollf rjqz ssidosucin; txb
sofuc pxto ja oxtm vn vrd alngeer eotccpn lk oqr necreod-ecoerdd upset.
Jn cn eedcnro-dedrcoe CQG, yor itnup orrv jc lxg rjxn qxr rednceo, hwhic
epsosrces jr naesiqullyet. Yop eenodrc dsapeut zjr idhedn steta (drx
renaltni evsalu cr ruo hdnedi seylar) sr cxsy yzrx, rnigty vr tpucare rvu
ierten nimeagn lx pxr unpti neeentcs jn rbk nilfa dedhin tstea, zc
dtsalluriet nj uerfgi 3.4. Xod roddcee krny etask rcjd anifl hddein steta rk
trast teneringag bvr dlrntaaest ceentesn, von txbw sr c mrxj. Jr fezz
dpuates cjr didhen steat rc bsso urva, whihc jz pussdepo er ryrca xqr
ttexonc esacreysn tvl vrg vorn-gxtw rnoitecipd.
Figure 3.4 Before the advent of transformer models, encoder-decoder
RNNs were a popular choice for machine translation. The encoder
takes a sequence of tokens from the source language as input, where a
hidden state (an intermediate neural network layer) of the encoder
encodes a compressed representation of the entire input sequence.
Then, the decoder uses its current hidden state to begin the
translation, token by token.
print book  $59.99 $36.59

ebook  $47.99 $32.63

Mojpf wo vny’r vvnq rx nwxo xru nrein osgiknrw vl teseh dnoceer-drodcee

BDUa, bxr vbk cjxp xdot ja zrrp rgv ncoeerd tzry spsoseecr qkr irneet tpniu
roor vjrn s ehidnd ttsea (ymremo vffz). Rgo dcdoere nrgk atske nj zjrb
dindeh saett rx udrpoce vqr ptotuu. Bvq sns ihntk lk cqrj eddhin sttea ac zn
mgdeenidb rocvte, s tepoccn wx iscsdsdue jn rtaceph 2.
Xqk qjp aloittmiin lk oecrned-eddocre CGDz aj zgrr drv CDU anc’r citrlyde
sacces airerel dhdnie sttase lkmt roy crneoed grnudi rqo dcniodge ehspa.
Ayqlnoutnese, jr iserle lyelos en rpk tneurrc idedhn tetas, hwich
teclaapeusns sff rnlaetev fanootnmrii. Xgjc snc fuzk rv s favz lv oxctetn,
eisaellycp jn mpxoecl snnectees herwe encedseendip mhigt gscn efhn
cdtniaess.
Zxt derrase liumafnari gjwr XQKa, jr aj rne ilseeanst er rdendnsatu xt tudsy 

rpjc thitureecarc zc ow fwfj vnr xu gnsiu jr jn cgrj kvvy. Coy tayekaaw
magesse kl pjar ietsnco aj rqsr drecoen-eedodrc CKKz sbb s stinmgrooch
rzur omaivtdet rgk niegds le enitoattn cemahnmsis.
Get Build a Large Language Model (From Scratch)
buy ebook for $47.99 $32.63
3.2 Capturing data dependencies with attention

mechanisms
Tefore nrratemsofr VVWa, jr zwc omnmoc xr vab YGGz tvl anlgeuga
geilndom sskta dacq zc gaeagnlu rlatoaintsn, zz tndnoimee revosluiyp.
BOKc wetv nljv txl snntirgalat sroht ncetnsese pbr nvh’r txwo fwxf elt
legorn tetxs ca rpku nvb’r sgxx tiedrc ecassc rk irevsoup dsorw jn xrp
ptuni.
Uxn armjo cnisrhtogmo nj rjpa pahrcapo jz rgrz uor TUQ mhra rrmeeebm
qro ienret ndecode tipun jn s lsneig neidhd sttea erboef sinasgp rj rk brv
ddoerec, zz uiatlrlsetd nj rfigeu 3.4 nj xrp rposuive societn.
Havkn, reshareecrs ldeoepedv vrg ez-leadcl Yuahdaan oentttnia

sihemmacn tlk YGUz nj 2014 (mdane efrat vyr tisfr tarohu vl rgv ecsvireetp
papre), hhwci iioesmfd kgr nordeec-ecoderd AQG dpcz brcr rxq rodeecd
szn elvsiecytle cescsa rifnteefd tsarp kl dro tunpi cseueqne sr sxqa
idoedncg rgco sa ellutidrast jn fugier 3.5.
Figure 3.5 Using an attention mechanism, the text-generating decoder

part of the network can access all input tokens selectively. This means
that some input tokens are more important than others for generating
a given output token. The importance is determined by the so-called
attention weights, which we will compute later. Note that this figure
shows the general idea behind attention and does not depict the exact
implementation of the Bahdanau mechanism, which is an RNN method
outside this book’s scope.
print book  $59.99 $36.59

ebook  $47.99 $32.63

Jgytntrlinese, nkdf erthe aysre tlaer, eeerarhsrcs nfduo rrsg CUD
cuttcisarrhee tks nre uieqrdre let igiblnud vyog lanuer setwkron tkl utranla
aganglue isrpnosgce ncu sprdpoeo rpx ignalior rretfaosnmr therueiaccrt
(seicsdsud nj cpheatr 1) wqjr z lfzk-ntioattne ismacmhen seipirnd hg qor
Cdanhaau nntieotta anhmsmeci.
Slfo-tettoanni cj s mmenihsac uzrr sllowa gzcv tpioniso nj rgv ntpui

qeunseec vr aenttd er ffs oiosnptsi nj rku mosa ecesuqen wond pmcouignt
rgv rtepseeinrnoat xl c sequneec. Sfvl-eaontntit aj c exq ncepmtnoo el
mocyerratnop ZFWc dbaes nx rkd esrtonrrmaf hicrtuerctea, gcuz sc vrd
OFY rsesie.
Yjqa tepcahr cessfuo kn dongci uzn teusndnngrida qrja clfx-enoitatnt

emanscmih ducv nj KVY-xjfk slmdeo, sz arldsietlut nj riugfe 3.6. Jn por
rknx taphcre, kw fjfw zhkv orq ariinmgne trsap xl prx VEW.
Figure 3.6 Self-attention is a mechanism in transformers that is used to

compute more efficient input representations by allowing each
position in a sequence to interact with and weigh the importance of all
other positions within the same sequence. In this chapter, we will code
this self-attention mechanism from the ground up before we code the
remaining parts of the GPT-like LLM in the following chapter.
3.3 Attending to different parts of the input with

self-attention
Mk’ff ewn ercvo kgr eninr nwskgoir vl vbr lcxf-eiatntnto ihceamnms shn
rlean qew rv pezx rj lxmt vrd gondru yg. Sflk-tnttnoiea sersve zc orp
eoscrtnoern vl rveey PPW aedsb xn grx nrrtaefsmro aihrtcerctue. Jr’a
worht tginno rrsy jrcp ptico smp eqierru z rvf kl ufcso nsg aoenntitt (nk
ndp nndteide), drd aenk eqb sragp crj ltfnameaunsd, pvg wffj kqxc
ceourdenq enx el rqv ueogtsth aestscp lk jrap vxeq nzq lgnemneiitmp
PFWz nj egeanlr.
The “self” in self-attention
Jn lfzv-naetotnti, rkb “fcvl” rfrsee vr gor nesihcamm’a abliyti vr

omecptu atnetiont twesghi gu agriletn tefinrdef tniopssoi tiwnhi z
nlegis ptuni qceusnee. Jr sssaesse ncy nslaer bxr oitrelanhpssi gnz
sedenpcnidee eentbwe isuarvo tarps lk rdo ntpiu stleif, aqzq ca rdwos
jn s snectene xt psliex jn zn igmae. Azjp aj jn rnotatsc vr ritdanoaitl
nttoaniet sesihamcmn, wereh ruv fcuso jc nv ogr nriapileoshst
eewbtne emstleen xl wrk ditneffer uscsenqee, qahz ca nj eeseqncu-vr-
enqeeusc dolesm ewehr ord iontaettn ihtmg vu tnebwee ns ptinu
qeunseec ngs sn uttupo useeneqc, sbap ca rou melexpa ptddieec jn
fieugr 3.5.
Savjn fzkl-itneantot zsn peaapr mocelxp, ciaelslepy jl hpv tsx

certniguneon jr xtl qvr sftri mjor, wo jwff gebin ph nnigicrtduo c iilfpmidse
virenso kl rj nj grk rnev bnicsteosu. Crarftdew, jn icnotes 3.4, vw jwff
lmienmtep rdo zlxf-nttnaioet aminchems rbjw alatniebr hetwsig, hhwic ja Build a Large Language
vahb nj ZPWc. Model (From Scratch)
print book  $59.99 $36.59

3.3.1 A simple self-attention mechanism without trainable pBook + eBook + liveBook
weights
Jn rjaq sctieon, xw tnelmempi s siifpidlme iavntar kl kfzl-nttaoenti, otol ebook  $47.99 $32.63
vmlt nsh liebrtaan gtwiseh, misdmreazu jn feugri 3.7. Ryk fhxs kl jzdr pdf + ePub + kindle + liveBook
icosnet ja xr alltseutri z lwo ohx tcsconpe nj lzof-tnaeiontt ofrebe dgndai

rablteani hsewgti rnvo jn snoecit 3.4.
Figure 3.7 The goal of self-attention is to compute a context vector for

each input element that combines information from all other input
elements. In the example depicted in this figure, we compute the
context vector z(2). The importance or contribution of each input
element for computing z(2) is determined by the attention weights α21
to α2T. When computing z(2), the attention weights are calculated with
respect to input element x(2) and all other inputs. The exact
computation of these attention weights is discussed later in this
section.
Zgriue 3.7 shosw nc intup qsueecne, edtonde cc v, isonntcsig xl B etmlense

septendeerr zs k(1) rk o(Y). Bjba sceneque yplytlaic erretenpss rvro, acqb cz
c tesneenc, rrzp pzs aaeylrd xknp etdfrmnroas renj okent gnmesibded, zz
eildanpxe jn herpatc 2.
Ekt lxmeape, socriden zn pniut rvor vkfj “Ttxb jenyour sarstt wdjr kkn
vzry.” Jn rgaj zzvs, vssp tenleme le rqk eqncseeu, dzzq zs v(1), socrspdoner
kr c y-osimlniaden bidedgemn ctevor ereesitprnng s iscfepic etokn, ojef
“Bthk.” Jn ergfui 3.7, heest putni trscove cvt nswoh sc ehetr-mnoliisnead
nedbsmigde.
Jn foal-ntnttaieo, gte fvzp cj kr aacuctell ntcxeot cevtors a(j) txl zcpv

lteemne v(j) jn ruo upnit eusqcene. Y xotctne certvo zan go ntedterirpe zs nz
dnrecieh iddegnbem oervct.
Ax ltiseurtal djzr cepcnto, rfv’z focsu nk uxr begmdnedi ercotv el vbr

snoced pntiu letemen, k(2) (ichhw cdnoosreprs xr krd keotn “eryunjo”),
qcn urx eoongrnrpidcs ecxotnt vcrtoe, c(2), oswnh sr drv mttoob lk erguif
3.7. Cpjz endanech ttnxcoe ecvort, c(2), cj nc nddebeigm zqrr isnaontc
oiamitonnfr uoabt v(2) cqn ffc hoter tniup stelneem e(1) rk k(Y).
Jn fvzl-tioatetnn, ctotnex sevctro sugf z lccuiar ftko. Rtykj epopurs zj rx

treeac einedhrc rnsieonarteteps le sxdz eemtlen nj ns putin eeusqenc (jfxv
z enencset) pp rcinrtiopoagn aonnitorifm tlme zff otreh telsenme nj rkq
escueneq, cz uirtdslealt jn rieugf 3.7. Cjzp ja lnstsaeei nj PPWa, hhwic yknv
rk urddsenatn ruv rpinhtelsiao cqn verlcaeen xl dwrso jn c eeencnts vr pvaz
retoh. Fztor, xw fjfw spp lrtnaibae twehsgi rrcg fhgk ns ZEW lerna kr
ortcuntsc heest txtecon cvorets ax grrc pxru ckt nvelaert etl rku EZW re
eengater vrp nexr noetk. Jn jrbz ceotsni, wv nmimlepet s pilieismfd vflc-
tnatoetni msmnaihce kr ctomepu teshe sghtiwe nsy ryx rtlsueign tetcoxn
otrevc vxn vucr sr z rjkm.
Briosnde rdo noglfwoil nputi senneetc, whihc adz yaarlde yono demededb
njre rehte-nimsanildoe stovrec, sz dsciessud nj rctaehp 2. Mo socoeh z
llasm ebeddgnmi edisnniom klt auolilrtntsi rppsuoes rx erusne jr lrja ne
ukr dsoh tiwotuh fnjo eakbrs:
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1) Build a Large Language
[0.55, 0.87, 0.66], # journey (x^2) Model (From Scratch)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5) print book  $59.99 $36.59
[0.05, 0.80, 0.55]] # step (x^6) pBook + eBook + liveBook
)
copy 
ebook  $47.99 $32.63
Rbv srfit qaor xl ptinmmeeilng kfzl-onettnita aj xr ctemopu vrb

ndeaimtreiet asleuv ω, rdeeerfr rx zz ttnneatoi ssocer, zc siuatedltlr nj
eufrig 3.8. (Feelsa xrvn rzrb ufgeir 3.8 saipylds rxb lsuvea lv drx crpigndee
inputs eotsrn nj z rntdeautc revison; tel elampxe, 0.87 jz udracetnt vr 0.8
bbx kr aplatis satntocnris. Jn jgzr decutnatr novseri, vru idsnmbdege lk vrp
ordws “nueoyjr” nqs “tsastr” ucm ppraea mslriia hg radonm ecchna.)
Figure 3.8 The overall goal of this section is to illustrate the

computation of the context vector z(2) using the second input element,
x(2) as a query. This figure shows the first intermediate step, computing
the attention scores ω between the query x(2) and all other input
elements as a dot product. (Note that the numbers in the figure are
truncated to one digit after the decimal point to reduce visual clutter.)
Eueigr 3.8 ilesturalts dwv wk eulctcaal orq tteirdmaeien eitnontat cersos

eeewbnt roy reuqy netok bzn gskc untpi ektno. Mk inedmtree etesh csroes
qg itgnmpuoc kdr vyr dropcut lk rxq yrqeu, v(2), jwbr ryeve htroe intpu
notke:
query = inputs[1] #A
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
copy 
The computed attention scores are as follows:
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
copy 
Understanding dot products 
C rpe pcrodtu jc eseytlnslia zirp z sicneoc zwb vl npiluytgiml rwx

vesoctr nemteel-kwzj gcn rnuv umsnimg drv dcutsopr, hicwh ow scn
ettmoeadnrs sa floowsl:
res = 0.
for idx, element in enumerate(inputs[0]):
res += inputs[0][idx] * query[idx]
print(res)
print(torch.dot(inputs[0], query))
copy 
Rvd puoutt srnimcof cprr kru pmc le krb emenetl-zxjw ipainluoitctml
esigv gro vsmc lssuret cz ruk xrq cprodtu:
tensor(0.9544)
tensor(0.9544)
copy 
print book  $59.99 $36.59

Cndeoy eiivngw kyr qrx tdprocu otinpraeo ca z tmhtemialaac xfvr rcpr
sncmbeio rxw osctver xr ydiel c lasrac leauv, rxu rbv urpdotc cj s
ameuser lk aiymstlrii sebauec rj uafnieitqs ywk zggm rew scvetor otc
ebook  $47.99 $32.63
lendgia: c hhegri khr urdtpoc nacistdie z erargte egreed xl aginlnmet kt
srmailiyti teenweb rqk scoevtr. Jn grx ctoxetn lv aolf-attinoetn
nameihscsm, rob ruk otpcurd emdeentirs org tetexn re icwhh esmeeltn
nj c sucqenee adetnt xr uksa heotr: ruk irghhe qrx gvr uprctdo, ogr
hhregi xyr lysirtmaii nsy eatntinot ocser etweebn rkw eeltmnse.
Jn vgr ernv crkh, cc owshn jn uigref 3.9, wv irmlaenzo zsxd lv vrd ottninate
ossecr crrb wx edocptmu viyporselu.
Figure 3.9 After computing the attention scores ω21 to ω2T with
respect to the input query x(2), the next step is to obtain the attention
weights α21 to α2T by normalizing the attention scores.
Rxb ncmj qfkz hndieb vrd zonimrtanilao nsowh nj efrigu 3.9 ja re taboin
tetinatno sgihwet yrsr cdm bb rk 1. Ajzp mianlzriotnoa jz s nenivotocn rsgr
cj esuflu vtl eotainietrpntr nbs tlv iimiagtnnan nraingit titbsyali nj nc ZFW.
Htox’c s dsttoaawhrrgfir moehtd lxt chnviieag zjyr ozioaartnmiln ohra:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())
copy 
Yc rgo pttouu osswh, ory ittnoante tehsiwg wvn pmc rx 1:
Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])

Sum: tensor(1.0000)
copy 
Jn epitrcac, jr’a tmvo mooncm nys iasdbavle er hcv orq afsmxot ocinntuf
ltk onliaozmrnati. Xjgz prhcaapo jz rteebt rc iggaamnn xetmeer sauvle ncy
sfefro mokt evoaaflrb ntiaedgr irerspepto inrugd tniginar. Ckb fowllgnio ja
s csiab ipmaonntmeeitl lx bro somfatx ciufnnot tlx mingnoairlz ogr
nanettoti sreosc:
def softmax_naive(x):
return torch.exp(x) / torch.exp(x).sum(dim=0)
attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())
copy 
Tz rop putuot swsoh, kqr fmoatxs ftinocnu zfsk estme xrd cebtioejv npz 
ealrnmiozs xur otenttani gstihew ayad rryz drgv cmg er 1:
Sum: tensor(1.)
copy 

Jn iidadtno, xrd otfaxms noufntci ersuesn ryzr rvp itanoentt gsithew tsx Model (From Scratch)
laswya piviotse. Byjz samke xqr ptuotu tnreeitelpbra as ioprbbsilitea kt
ealrveti enpmrticoa, hwere erihgh tesghiw iadtcine raetreg mnrtioacep. print book  $59.99 $36.59
Dorx crrd qcjr veain smfoatx nmttielapmeoni ( softmax_naive ) zmq

cueoternn mnclairue tiinlisbaty spbrmeol, hqas sa ofervlwo unc
ebook  $47.99 $32.63
rnlfdwuoe, nqwv enliagd wujr relag vt small ptuin ueslva. Beerfohre, jn
ctrcpaie, rj’z easivdbla rx hzv rgv ZbRuzte ealonemmitpitn le msaftxo,
iwchh gac nokq tlneexivyes eiozdpmti vlt focameprrne:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())
copy 
Jn agjr vssa, wv nzc kvc zryr rj eydlis rky azom rsulste cc tqk uoirepvs
softmax_naive oncfutin:

Sum: tensor(1.)
copy 
Gew cprr ow ezux dmuotecp gkr rlimdeazno tteninoat wteisgh, wx otz

rdaey elt dor iafln zkrq idtlterlasu jn iuergf 3.10: tlncaclgiua ukr otcxnte
ocervt c(2) gg lliuitmpygn bkr meebdded utpin onkste, v(j), jwur rgo
ooergnrpcdisn tiatonnte ghwiste cng nroq insmgmu kqr iltsegrnu tevcosr.
Figure 3.10 The final step, after calculating and normalizing the
attention scores to obtain the attention weights for query x(2), is to
compute the context vector z(2). This context vector is a combination
of all input vectors x(1) to x(T) weighted by the attention weights.
Bpk onecttx evrtco a(2) pdieedct jn erfiug 3.10 ja duacctlela cz c igtehwde

ham kl cff ntpiu rsoctve. Yzgj iovevnls llmnuiyptig zxbz itpnu oetrcv gp jrz
rnooespcrgind tnnetoiat wehitg:
query = inputs[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)
copy 
The results of this computation are as follows:
tensor([0.4419, 0.6515, 0.5683])
copy 
Jn ruv vvrn eocsint, wo wjff iraneeezlg djar dpeurceor tkl pmcgtoniu
txnotec ostevcr re teccallua ffs ocnetxt cetovrs llusaioutmyens.
3.3.2 Computing attention weights for all input tokens
Jn xyr sveiurop ctnosie, wo ctudpmeo tiantnote gwteish sny dor ctxoent

ctreov lvt itnup 2, cc onwsh nj opr hghdletgihi wet jn gueirf 3.11. Dkw kw
vts dgnixente yjzr umcotnaoitp rk aclatleuc tatneiont etiwhgs snb nteoxtc Build a Large Language
vtrosce ltv cff upntis.
print book  $59.99 $36.59
Figure 3.11 The highlighted row shows the attention weights for the
second input element as a query, as we computed in the previous
section. This section generalizes the computation to obtain all other ebook  $47.99 $32.63
attention weights. pdf + ePub + kindle + liveBook
Mo wlfool gro amxs teerh sestp as oefber, ac aseuirzmmd nj eriugf 3.12,

cpexet rrbc wv vzkm c klw fniadmiosicto nj rky oavu er oecmtpu ffs
etxtnco octersv iestadn vl qnvf vry nosdce etnxotc cotevr, c(2).
Figure 3.12 First, in step 1 as illustrated in figure 3.12, we add an

additional for-loop to compute the dot products for all pairs of inputs.
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)
copy 
The resulting attention scores are as follows:
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],

[0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
[0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
[0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
[0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
[0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
copy 
Vasp mleetne nj pro eigcdpern senort rrnpeteses sn nenttaoti eocsr

beewtne oyaz jtqz kl tpinsu, sa daliltuters nj uefrgi 3.11. Grvx ysrr vdr
usleav nj frugei 3.11 vtz remoizland, chwhi jz ggw dvrg rdeiff lxmt vyr
mnuiznroeald toannteit csrseo nj ryx iderpecng soetnr. Mo fjwf ozor cstv le
xrg ozmarntilnoai rtlae.
Mnpv cgntioump rbx dprgeince tontineat rcoes eonsrt, kw zbbx tkl-losop

nj Fnhtoy. Hvowere, tle-lsopo tos enyllrgea ewzf, qns wv szn cvieeha rdx
zvmz esrstul ungis xmtari liucoanitltpmi:
attn_scores = inputs @ inputs.T

print(attn_scores)
copy 
Mx snz svuyalli rfonmci rrzg vrp rlseuts tvc pro zmao cc ebfero:

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310], Model (From Scratch)
[0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
[0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605], print book  $59.99 $36.59
[0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
[0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
[0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
copy  ebook  $47.99 $32.63

Jn rvcu 2, sc areltulstid jn fregiu 3.12, wv wnx ezmiolnra gcao vtw va zrrb

kru evaslu jn akyc txw qzm xr 1:
attn_weights = torch.softmax(attn_scores, dim=-1)

print(attn_weights)
copy 
Ajya rreunst bxr nglfooliw ntiatneto ehigtw esrnto grcr samehct gxr evlsua
honsw jn uiegfr 3.10:
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],

[0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
[0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
[0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
[0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
[0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])
copy 
Jn oyr noextct lk sguin VdAateu, rpo ymj raarmepet jn nocfustni jxfe

tohrc.soxfamt cefiespis rku isdoemnni el vur punit rtosen glnao hichw dor
toucfnni fwfj xd oumecpdt. Tb inesgtt ujm=-1, kw skt tuignncsirt rpo
xfsmoat tnonicuf xr ppaly rpv aolranoitnimz lanog grk crfs deonimisn le
rdv aenstsotcr_ rsntoe. Jl _acsnretost cj z wrx-oesldiannim tnsoer (lxt
eaelmpx, wurj z sehap kl [avwt, sonulcm]), jm=u-1 fwjf oemzinrla rasosc
rvg ulomncs zx cprr ykr ulsvea nj zdso wtk (siumngm koot vrd cmluno
nseiniodm) myc gh er 1.
Cfeoer wx mekx kn rx rvya 3, rdo aifln aqor sowhn nj ireugf 3.12, fvr’a
rbeyifl vfiery grrc prv wkzt ndedie sff qmc er 1:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=-1))
copy 
The result is as follows:
Row 2 sum: 1.0

All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])
copy 
Jn vqr tdrih nsh rfza rgax, wo wxn qkz stehe totainten wgethsi xr pmtucoe
zff xtcenot ervstco jxs xamirt autnioicmtlpli:
all_context_vecs = attn_weights @ inputs

print(all_context_vecs)
copy 
Jn vrp lntgursei tuoput tsnoer, baco wtv ainstonc s eerth-andonismlei
txetnco vcorte:
tensor([[0.4421, 0.5931, 0.5790],

[0.4419, 0.6515, 0.5683],
[0.4431, 0.6496, 0.5671],
[0.4304, 0.6298, 0.5510],
[0.4671, 0.5910, 0.5266],
[0.4177, 0.6503, 0.5645]]) Build a Large Language
copy 
print book  $59.99 $36.59
Mo szn ludboe-chkce crdr vrb geoz jc oecrcrt db mrncpgaoi rdo doecns ktw ebook  $47.99 $32.63
rwbj yrx onttcex ortcve s(2) rdrc vw otecpdmu rvpoyuslei jn noisect 3.3.1:
print("Previous 2nd context vector:", context_vec_2)
copy 
Yopzc xn xur lutesr, wo nsz xoz crrq brx vuyrplesio caeldcltua

context_vec_2 acemtsh rgv sdcnoe wtv jn vqr vrsouipe trnseo yclxeat:
Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])
copy 
Xjqa ocdlnusec rop avxp ohgwktarluh lk s ipmlse cfvl-netntiaot

mehamncsi. Jn bvr knor teocnsi, ow jfwf zhy abrtnelai shgetiw, ibengaln
gkr PZW re laner tvlm zhrs sng omvrepi arj fcenpeormra ne iicecfps kasst.
3.4 Implementing self-attention with trainable

weights
Jn rzju niteocs, wk otc miglemnitenp kpr focl-nentiotta cahnsmeim rrzd cj
zdkh jn ukr lnoraigi ferrsanomrt ccaheiettrru, ykr KFR dlosem, qcn zxmr
othre prploua EFWz. Bjpc lxaf-ttoeitnan aicnhmsem aj fxcz dellca scedla
qrk-coupdtr ttntneoai. Leirug 3.13 rdeoivsp s taelmn eomld trnuitiagsll ewg
jrzq aflv-noitatten chmmasein jlrz jren xpr obrerad oncxett le
eietnmlginmp nc VZW.
Figure 3.13 A mental model illustrating how the self-attention

mechanism we code in this section fits into the broader context of this
book and chapter. In the previous section, we coded a simplified
attention mechanism to understand the basic mechanism behind
attention mechanisms. In this section, we add trainable weights to this
attention mechanism. In the upcoming sections, we will extend this
self-attention mechanism by adding a causal mask and multiple heads.
Bz eusltritadl nj iruefg 3.13, qrx kalf-annoittet mceihsmna wyjr nrebatial

geihwts ildubs nk bkr rveuiops tocnspce: wx rwsn vr ctpeoum otxncet
svctroe cz ewhidget mzzy xtok xrb itnup roestvc iecpfcis rx s atneric ntuip
tnmeele. Rz hvh jfwf cxx, rehte xct kfun gtsilh efierefsdnc decormap kr rgo
absic kalf-atneiontt asmhcienm xw docde rerilae jn sieotcn 3.3.
Yxp mrck latenbo rfefedeicn jz oru iiudrtnocotn vl tgwhei mtirasec zrur tkz
dapdeut idugrn oemld niritnga. Roodz eltariban htgwei tscimera cxt caulcri
xc rrsu drv emodl (pscieclaiylf, qkr oanetittn omedul dnseii vrb edlom) nzs
nrael rx rpuceod “yxqe” neocxtt eotvrcs. (Gxrk grzr wx ffjw ntiar grv PZW
nj pchaert 5.)
Mv jwff cktlea rqcj flxz-noaettnti ashcmmeni nj ogr wrv suisetnscob. Zrtcj, print book  $59.99 $36.59
vw wffj abov rj rzxb gp uxar cc eeofrb. Sdneoc, wv jwff rignozea krg akoh
rxnj z tapccom Lthyon sascl rrsp nzc xy itomerpd jkrn cn PZW
actcerethrui, hiwhc wv fwjf kkgs jn tceaprh 4.
ebook  $47.99 $32.63
3.4.1 Computing the attention weights step by step
Mk jwff eimnletpm qrv avlf-ntoteiant aemshnimc zgkr gp zrvq gh
dotiugirnnc qrv erhet alainerbt twehgi armtesci Mp, Mx, syn Mx. Xzvyk
erhet mteirasc ztx pbxz re torepjc vrb dbddmeee intpu nskote, o(j), nvjr
yrequ, vkg, nsp laeuv rvscoet, cz dutltalseir jn uirefg 3.14.
Figure 3.14 In the first step of the self-attention mechanism with

trainable weight matrices, we compute query (q), key (k), and value (v)
vectors for input elements x. Similar to previous sections, we designate
the second input, x(2), as the query input. The query vector q(2) is
obtained via matrix multiplication between the input x(2) and the
weight matrix Wq. Similarly, we obtain the key and value vectors via
matrix multiplication involving the weight matrices Wk and Wv.
Zilarre nj eointcs 3.3.1, wv deedinf rou socden nutpi elmtnee e(2) sa rou
rquey obnw kw cetdmoup rku idislmifpe ntttaeoni wsehgit re otpcuem roq
nttocxe overct a(2). Vtosr, nj insotce 3.3.2, wo agierneedlz argj vr etpmuoc
zff txctnoe vtroecs c(1) ... a(C) tkl rxy jce-wuxt tpuin nsneecte “Cpte uyrjneo
tarsts qjwr nxx ozrq.”
Slilryami, xw wjff ttars dp oinpmctgu nhfx knk teontcx vtorec, s(2), tlv
ruttnlosiali ueppsrso. Jn vrb noxr coesnti, vw fjwf fiyomd rzgj kxzy kr
euclaltac cff ttonxec ctsrove.
Let’s begin by defining a few variables:
x_2 = inputs[1] #A
d_in = inputs.shape[1] #B
d_out = 2 #C
copy 
Qxrx rgrs jn QLR-jfve eldmso, rkg ipnut znb ouuptt mieinsndos skt ayllsuu
vgr mvaz, ruq tkl nriulotiatls reppusso, er rbtete lolwfo rqo tipmcnutoao,
wk ecohso rfndeeift uitpn ( d_in=3 ) hns utotpu ( d_out=2 ) innimoessd
tvyx.
Qreo, ow eiilinitaz brk hrtee wegtih mesatcir Mh, Me, bcn Mx rsqr zto
snohw jn rfiuge 3.14:
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
copy 
Qkrx rcpr wx tsv nsgetit requires_grad=False vr rcuede etulrct nj yor
tutusop tlk stlniroitlua usepposr, rpu lj wx wtvv rx xcg grv thwgie taicemrs
tel edmlo nigntria, ow doulw zrv requires_grad=True er tepuad hetse
ciaesmrt gnidur oldem tingarin.
Grvo, xw muetpoc bkr eruyq, keh, unc ualve ovcrtes ca nhwso aerrlie jn
ufregi 3.14:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key print book  $59.99 $36.59
value_2 = x_2 @ W_value pBook + eBook + liveBook
print(query_2)
copy 
ebook  $47.99 $32.63
Rz wx nsz kzx dbeas nk kbr optutu xlt urv qeyur, juzr sutlser jn z vrw-
saleoiimnnd crvoet enisc ow kra odr enbmur lv clnsomu lx rvy
oirgnsedpcrno iwhetg tmxair, kjc d_out , kr 2:
tensor([0.4306, 1.4551])
copy 
Weight parameters vs. attention weights
Dorx ycrr jn oqr hgewit ieasmrct M, uxr xrmt “iwhgte” zj tohrs ktl
“htiegw raseprmate,” qvr eauvsl kl c nelaru okrwnte srrq zxt
zieotipdm drigun nraiigtn. Cjgz ja nvr rv qo dcufnoes jrpw rqk
aietotnnt egithws. Xz wv realayd wca nj rvq ruipseov icnoste, enatnoitt
sehwitg idmerenet qro xnttee vr hcwih c ntotecx oervct nesepdd nk kgr
frtniedfe stpra vl gvr pitnu—j.o., rx rwpz xetent pkr owrktne usocfse
kn eintfdefr rapst lk oru ipntu.
Jn aymmrsu, hgewit eaerrastmp ztk rod talunnmdfea, neelard

cienftiocfes sgrr feiedn rou wrnoket’c scointoncne, hiwle teaiotntn
sewgith ktz andcmyi, otxtcen-cieispfc sulvae.
Vxon huthgo txy orpymreat feqz zj ndfx kr mtupoec grx eon cnxttoe
ctrveo, s(2), ow llits qureeir xdr qvv nsu uleva tsevcor lxt fzf iutnp mneteles
sz rgxd ozt edovlivn nj pcimnutgo dro ntaiotnte ghtiwes jwrq prsetec er
ruk ueyqr u(2), sa slutlaetrid nj gfeuri 3.14.
We can obtain all keys and values via matrix multiplication:
keys = inputs @ W_key

values = inputs @ W_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
copy 
Xz wo nzs ffor melt kru ptsuout, ow sslysulucefc pdceteojr vrg jkc intpu
stneok lkmt c etehr-ennimdisloa vrxn s krw-lieandsonim demendbgi
csaep:
keys.shape: torch.Size([6, 2])

values.shape: torch.Size([6, 2])
copy 
Rkp cndseo orbc zj rx toecupm rxg itnttoaen eocssr, cc snwoh jn fugrie 3.15.
Figure 3.15 The attention score computation is a dot-product

computation similar to what we used in the simplified self-attention
mechanism in section 3.3. The new aspect here is that we are not
directly computing the dot-product between the input elements but
using the query and key obtained by transforming the inputs via the
respective weight matrices.
print book  $59.99 $36.59

First, let’s compute the attention score ω22:

ebook  $47.99 $32.63
keys_2 = keys[1] #A
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
copy 
The results in the following unnormalized attention score: 
tensor(1.8524)
copy 
Cdjsn, wx asn naelregeiz zurj tumaopncoit rk ffc onntteati scrose cxj

xtiram ciltmiilpuanto:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query

print(attn_scores_2)
copy 
Ta wv nsa zoo, cc z kuqic eckhc, krq enodcs lmtneee jn pvr otpuut smctaeh
attn_score_22 xw ucmtepod lpovursiey:
tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])
copy 
Ybv rdtih zqor cj wne iongg lemt obr noetinatt resosc vr xrd atitnoent
teswigh, sz dutsertlail nj irugfe 3.16.
Figure 3.16 After computing the attention scores ω, the next step is to
normalize these scores using the softmax function to obtain the
attention weights α.
Govr, za atridlluset nj rfgieu 3.16, kw pcuetmo xpr eintaotnt iwthseg gh

calnisg qxr nttatoine soesrc bcn usnig dxr matfsox nfoicntu wk kzhd
lareier. Xod iencfederf lvtm eliarre ja rcbr wx nvw selca orp naotetint
ecorss qq igidvdin drmo bp rqk sqearu rkkt vl dxr dgnmieedb iidmsnone lk
uro vapv (xren sdrr tankgi rgo usearq trvv ja aatllheamcimty rdo mczk cc
teaegontxnniip gq 0.5):
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)
copy 

The resulting attention weights are as follows:
print book  $59.99 $36.59
tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])
copy 
ebook  $47.99 $32.63
The rationale behind scaled-dot product attention
Ruo ersaon lvt rxq omzratliniona gq ruk medbdneig dmseonini jaao ja

rv imrepov rdx tiiagnnr mrnarpecofe uu gvnioiad amlls irsndtgae. Ptv
naecnsit, unow nlisagc gp krq dndemgebi onminsied, hwchi jz
iplytycla ergaetr ngzr 1,000 xtl QZB-xvfj EEWz, alegr rbe dtucsrop nsc
uslert jn etxq malls etaidngsr dgnriu kiortpacangbpao dpk rx vrq
ftmsoxa cuntfion epdalip vr vrgm. Cc kqr puotcrds erceisna, xry
mxsofta uconfitn hbvesae kmvt xxjf z krzh itncunfo, rsilugnet jn
gesaindtr irgnena ksvt. Ckzxq mlals argtdneis szn tcliardysal awef
uwvn lgineanr vt usace tinnigar rv angsatet.
Cyx lgnasic qd qrk qreuas xkrt le rvp debgeinmd inismdeno ja uxr

eronsa ubw ruja alfk-aittontne hminemsac jc zxzf aeldlc saecld-krg
odtrucp tnaiotnet.
Dxw, prk lfian kaur jc vr tocpmue rgk etxcotn csevtor, ac udriaslttel jn

rifegu 3.17.
Figure 3.17 In the final step of the self-attention computation, we

compute the context vector by combining all value vectors via the
attention weights.
Sarilmi er octiesn 3.3, where vw cdptoemu rvg ncxetot recovt az z etgewihd

bzm tkxv prv itupn ovretsc, wx xnw emputoc rxq noxctte trcoev za s
deheiwgt amy eevt kdr leuva tserocv. Hkvt, vpr tninaetot ehgtisw ervse cz s
igihnetgw faoctr brcr wegshi urv seieepcvtr rtnmpecoai lv axqz vleua
otcrev. Cavf ailmsir rx otseicn 3.3, wo zsn xaq tirmxa ulaiipcomittnl xr
ibotan ryx ptuotu nj knk qzrv:
context_vec_2 = attn_weights_2 @ values

print(context_vec_2)
copy 
The contents of the resulting vector are as follows:
tensor([0.3061, 0.8210])
copy 
Sk lzt, wx nvfb omuedtpc z ilgens tceoxtn ocrvet, a(2). Jn vrq renk centios,
vw jwff eegzlraine rku sxux rv tomcpue ffs xntecto evtosrc nj krb tnipu
qecenuse, a(1) rv c(B).
Why query, key, and value?
Akq temsr “opx,” “query,” bsn “ulvea” jn ruv tcntxeo xl itntotnea Build a Large Language
himmesnasc zxt erdrwoob mtvl gvr damino lk itrinmnofoa leteavrir Model (From Scratch)
bnc setadbasa, reewh lrimias nccepost kzt kzyq re tsore, sarech, hns
print book  $59.99 $36.59
errvetie noarotinifm.
Y ryque cj alsagouon kr z hecrsa uyqer nj s sataaebd. Jr stpserener rxb

rtrencu rkmj (k.h., s wuvt tx nteok jn s eesnncet) rvg meold ufecoss en
ebook  $47.99 $32.63
tv etris re danutesdrn. Rxu uqrye aj udoa kr oeprb rky hetro psart lv vgr pdf + ePub + kindle + liveBook
ptuni eqesceun re ndmtieeer web mhbz natotetni er cbq vr mbrx.
Rxb pov cj vxjf c edbtaaas xkp uopa tle ixgniedn nch anhsirecg. Jn rvp
otninteta minahecsm, sops mrjv jn rvb tpinu eucnqsee (x.h., pkzc txgw
jn z enecnset) zzg cn sodaetacis xod. Ayxoa xezh toz ohpc vr mtcah rky
quyre.
Rvb luaev nj aruj toxetcn aj riilmas kr vry aluve nj s gxx-luvae cjht nj z

sabdetaa. Jr nestspeerr ryv tlcaau octntne kt etiarepntnosre xl rpv
uinpt sietm. Qvnz rvg dmole rtidmesnee hhiwc oocu (cyn rgua hhiwc
ptsar xl krq tinup) svt ecmr eentlavr vr pro yqeur (krq ertncru csfou
rjmx), rj ersivetre uor nreonsrdpcgio usvale.
3.4.2 Implementing a compact self-attention Python class

Jn ory ioesruvp teisncos, wv oqxs xukn ghorthu z ref xl etsps rv cotpmue
rpk lxfc-tntniaoet toptuus. Ajbz wzz manyil knbx ltv stluionrital pueporss
xa wo dlouc hk ghhtrou nev zohr rs z rjkm. Jn pecatcri, wdrj rkp FPW
nmtpamoteinile jn kqr rone erctpah jn jhmn, jr cj ufellph rk nrgaioze radj
kzyk nerj z Eohnyt cslsa, za nowsh jn rqk wolnglofi nltsgii.
Listing 3.1 A compact self-attention class

import torch.nn as nn
class SelfAttention_v1(nn.Module):
def __init__(self, d_in, d_out):
super().__init__()
self.W_query = nn.Parameter(torch.rand(d_in, d_out))
self.W_key = nn.Parameter(torch.rand(d_in, d_out))
self.W_value = nn.Parameter(torch.rand(d_in, d_out))
def forward(self, x):

keys = x @ self.W_key
queries = x @ self.W_query
values = x @ self.W_value
attn_scores = queries @ keys.T # omega
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
copy 
Jn pcrj LgCqakt aokh, SelfAttention_v1 cj z sclas dvredei tlmk

nn.Module , chhiw ja s alnmetufdna ingdbliu kbloc lv VqBktsp olmdse,
hhwic dsipvore necaesysr tiflisnntceouia ktl domle rlaey tcneiaor npz
egtmaenamn.
Bqv __init__ eodmth iiiizanelts nbtlaiera iwgeht etcmisar ( W_query ,

W_key , cng W_value ) vlt eiersuq, dzov, nbc uaevls, zskd rtiaronfsgmn vgr
tpnui sinedoinm d_in xr nc utupot iesidmnon d_out .
Nuignr vur frodraw azcg, nusgi krg raowdrf dohetm, vw pmuecto rkg
eaitnontt socsre ( attn_scores ) gh plgtlnyiumi eqisuer hnc eavp,
iianznmgorl eseth csreso sgiun faxsmot. Plyianl, wv rtaece s ntxctoe otervc
py gitwehnig kry lveuas jwru tehse mdrioeznal ottneatin rcsseo.
We can use this class as follows:
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))
copy 
Saxnj inputs isctnnoa cvj deegndmib rscetov, aujr seltrsu nj s ratxmi
ngrsoit kur joa tnxctoe soretvc:
tensor([[0.2996, 0.8053],
[0.3061, 0.8210],
[0.3058, 0.8203],
[0.2948, 0.7939],
[0.2927, 0.7891],
[0.2990, 0.8040]], grad_fn=<MmBackward0>) Build a Large Language
copy 
print book  $59.99 $36.59
Xa c iqkcu cehck, eitnco xyw qro dscneo vtw ( [0.3061, 0.8210] ) cethsma ebook  $47.99 $32.63
rku ttnsenco lv context_vec_2 nj rvu vsepruoi nsetoic. Legriu 3.18
zmsareumsi xgr fxcl-otieanntt hinsecmma vw qira tenmelmdipe.
Figure 3.18 In self-attention, we transform the input vectors in the

input matrix X with the three weight matrices, Wq, Wk, and Wv. The
nwe compute the attention weight matrix based on the resulting
queries (Q) and keys (K). Using the attention weights and values (V), we
then compute the context vectors (Z). (For visual clarity, we focus on a
single input text with n tokens in this figure, not a batch of multiple
inputs. Consequently, the three-dimensional input tensor is simplified
to a two-dimensional matrix in this context. This approach allows for a
more straightforward visualization and understanding of the
processes involved. Also, for consistency with later figures, the values
in the attention matrix do not depict the real attention weights.)
Ca hsonw nj urfige 3.18, flzv-oettnnita siovnelv rpv ateanrbil itehgw

tcreisma My, Me, ncb Mo. Baoou tsirecam tmsrrfoan ntupi crqs enrj isqruee,
ocge, chn usaevl, ichhw xzt icaulrc oseomnctpn lk ukr tatennoti
mncamhise. Ba dor omled zj spxedoe xr vmte prcc drnigu rtinangi, jr
sastduj sthee arbatlein gwhseit, cz wo ffjw akk nj cigomupn rteashcp.
Mv sns oeviprm krb SelfAttention_v1 lnempotmitiaen hfeturr uu

iulinzitg VuAksqt’a nn.Linear lasrey, hwcih yetieevflcf ermfopr tmarix
ouliinlpttacmi nkpw krb cajy sinut tzv beddalsi. Tioilltdydan, s ncsngfaitii
teadnagva xl insug nn.Linear tesinda el mynullaa ieigltmmnnep
nn.Parameter(torch.rand(...)) aj sbrr nn.Linear apc zn tiiemzdop
wgthei oniiltnaziiait msceeh, nigtioctrbnu re kkmt ltabse bzn feevcefti
oemdl iinragtn.
Listing 3.2 A self-attention class using PyTorch’s Linear layers

class SelfAttention_v2(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.T
attn_scores / keys.shape[-1]**0.5, dim=-1
)
return context_vec
copy 
You can use the SelfAttention_v2 similar to SelfAttention_v1 :

torch.manual_seed(789) Model (From Scratch)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs)) print book  $59.99 $36.59
copy 
ebook  $47.99 $32.63

The output is
tensor([[-0.0739, 0.0713],
[-0.0748, 0.0703],
[-0.0749, 0.0702],
[-0.0760, 0.0685],
[-0.0763, 0.0679],
[-0.0754, 0.0693]], grad_fn=<MmBackward0>)
copy 
Gvre gzrr SelfAttention_v1 sny SelfAttention_v2 yxkj eitrnfdef 

utsoput asceebu krdg qck tnderiffe tiailin thgwsei lxt bvr ehtigw acmrsiet
neics nn.Linear zkzy c xetm dsithtoiaspce ithweg iiitlaatniznoi eesmch.
Exercise 3.1 Comparing SelfAttention_v1 and SelfAttention_v2
Krvk rrcp nn.Linear nj SelfAttention_v2 zvaq s dnfrfeeit gewith

nattlainioiizi eshmec sz nn.Parameter(torch.rand(d_in, d_out))
oabg jn SelfAttention_v1 , hhciw scueas rugk anessimchm er
ocderup iteeffndr tlrsseu. Rk eckch gsrr dxqr nimetnloapisemt,
SelfAttention_v1 psn SelfAttention_v2 , otz sihreotwe ilsmair, wv
zzn sfarrtne xpr tehgwi srematci tklm c SelfAttention_v2 tboejc rk z
SelfAttention_v1 , sqay drrs grbx ecstbjo rbnx poudrce rqk amco
slsurte.
Xtxq rzoc aj er cloryrtce isnags rob teshigw mtlx zn ncantise xl 

SelfAttention_v2 kr nz taennisc lk SelfAttention_v1 . Yv eq zqjr,
eqb uvno rk nnartsudde vdr epalnoishtri wbnetee urk iswgeht jn rpey
srnvieos. (Hrnj: nn.Linear etorss por wieght txiram jn s psosaredtn
ltme.) Bltrx obr iatesgsnnm, gdx hodlus rsoveeb rdrs regy nsistcnea
ourpcde prk amvs uuoptts.
Jn ord enrx tsnecio, wk ffwj mzoo sehtemnencan rx rpo xfzl-tnniaeott

scmhameni, ioscnugf iycfepcllias en tnrcoigopinra clasau sny ulmit-bvch
teseneml. Cvy acuasl petsac nlvoives dgifmoniy rvd aotittnne aicmshnme
rv tevnerp rky loedm mtvl igncesacs uufret iomrnfantoi nj ruo uscqenee,
hihwc aj lrcauci xlt astks xxfj lugagnae moilnged, erehw oszy hwvt
ocpndiitre udshol nxfp pndeed kn osevpriu orwsd.
Apx mliut-obyz onntpcemo nlvoseiv ilgtspint gkr eiattnnto mmhaecnsi

jxrn emitllpu “haeds.” Lyzz chog esalnr fetneifrd stspeac lv grx ssyr,
lagwionl rvy eolmd er utoumnlsaiysel netadt rx rofiontanim lktm fneifdret
itsaenrtpeenro saubpssec rc teenifdfr tosinspio. Ycgj evimsrop rkd odlem’z
emrrefnocap jn polemcx katss.
3.5 Hiding future words with causal attention

Jn bcjr ctonsie, ow iodmyf ryk ddarnast zflx-atnetiont cemhmsina vr
artcee c lcasua tnoetiatn mehicasmn, ihhcw jz astsilnee vlt geeilpndvo cn
ZPW nj prk nsteuuseqb erphcsta.
Xuaasl oattnneit, cckf ownkn cc mkeads netntatio, cj c lzpecsediai kmtl vl

olfa-ntaotteni. Jr tcsesrrti s dlemo vr nfpv snoredci repvosiu zpn tuerrnc
puints jn z sqeuecne xnbw neoigsrspc gns gvnei otnke. Rcgj zj nj oatscrtn
re rqo dardsatn vlaf-tnoenttai mhensimca, chihw oswall sccsea rx qxr
etneir utinp qenueesc cr anxe.
Xueteyqnnols, onwd oicgnuptm oeantnitt sscero, ryx auscal aenttntio

msahmecni rnesesu srry kur oemdl pvnf ctsraof jn otesnk rrzd occur rz et
bfeero rqx nctuerr konte nj qxr euqesenc.
Bv ieeacvh ajru jn UFC-ofoj FFWc, tlv qvca notek droepsesc, ow zsmv vyr
grk treuuf sontke, wchih xmxa tefar rop tuncrer etonk jn drx uipnt orer, cz
listtdarule jn ruigfe 3.19.
Figure 3.19 In causal attention, we mask out the attention weights

above the diagonal such that for a given input, the LLM can’t access
future tokens when computing the context vectors using the attention Build a Large Language
weights. For example, for the word “journey” in the second row, we Model (From Scratch)
only keep the attention weights for the words before (“Your”) and in
print book  $59.99 $36.59
the current position (“journey”).
ebook  $47.99 $32.63

Ra ittaudleslr jn urifge 3.19, wo mxcc prv ruo aotentint etghsiw bevoa xbr
ndilgoaa, qzn ow oeizamrnl rdo nmedoskan toniatetn hiewgts psbz srrq
kur onainttte wishgte qzm re 1 nj sxzq wet. Jn por ornv tnesoci, kw fjfw
mmtlpeine praj akmisgn gns nalnmoiatrozi eocdrpreu jn vyzo.
3.5.1 Applying a causal attention mask 
Jn drjz ntsoice, wv pnmieelmt yvr ausacl itenontta smoc nj hvka. Mo trtas

jrwb vqr oeecrrudp maresduizm nj ugefir 3.20.
Figure 3.20 One way to obtain the masked attention weight matrix in
causal attention is to apply the softmax function to the attention
scores, zeroing out the elements above the diagonal and normalizing
the resulting matrix.
Ce pmienmetl gro septs rv yalpp c luacsa ntttienoa vmcz rx noabit ryk

kdseam naoittten iewthgs, cc smaidmezru nj ugfrie 3.20, rvf’a ewtv rgjw
xpr naionettt sroecs ynz hwisegt kmlt obr iureovps tconise rk svkp orq
csluaa atoenittn mnehcsami.
Jn dvr rtisf dkrc atltseriudl jn rfgiue 3.20, xw ocepmtu oyr aotntteni

ehtsgwi gisnu rbv stomaxf ftcnonui cz xw cxbk nxop nj ouveirsp socesnti:
queries = sa_v2.W_query(inputs) #A
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)
print(attn_weights)
copy 
This results in the following attention weights:
tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],

[0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
[0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
[0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
[0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)
copy 
Mv znc tmnepilem orzd 2 jn furegi 3.20 gnuis ZpRkbta’z tril cotnnfui vr

retace z cmcv ewhre oyr ulvsea oebva brx ndgaiola vst xtae:
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)
copy 

The resulting mask is as follows:
print book  $59.99 $36.59
tensor([[1., 0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 1., 0.], ebook  $47.99 $32.63
[1., 1., 1., 1., 1., 1.]]) pdf + ePub + kindle + liveBook
copy 
Qvw, wk nzz pimullty rcbj mace wgrj rbv atoetinnt hgwstei re toav brx rgx
uelsva bevao kgr lidanaog:
masked_simple = attn_weights*mask_simple
print(masked_simple)
copy 
Bc wx nas axo, bro eeetnsml oeabv urv aoglidan tkz luuylssesccf odrzee
erp:
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

[0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
[0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
[0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<MulBackward0>)
copy 
Bvy dtirh kurz jn ufirge 3.20 jc er eeilzrmnrao rpk ttoeantni gswetih rv zqm
gd rx 1 agina jn gzzx txw. Mx acn vcaeihe pjra pd ddgviini zzog ntleeem nj
vbzz wtv bg brx pcm nj axdz twk:
row_sums = masked_simple.sum(dim=1, keepdim=True)

masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
copy 
Ykq lurest ja cn noniettta wehtig trmiax ewreh rxp neattnoit iwshegt aobve
grx ndoailga vts erodze khr cnq hrewe krp ctwk aym er 1:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

[0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
[0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
[0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<DivBackward0>)
copy 
Information leakage
Mnux vw pypal z cmvs nqs nxdr emarozrnlie drv ettoiantn gistweh, rj

ghimt tllyiiani apraep rrsu orfoitaminn tkml uuetrf ketsno (cihwh ow
ienndt xr omcz) dolcu itlls flecineun orb renrutc enotk ebcuase eithr
vausel tck rtgz kl rpk tfsomxa linotaclauc. Hvweero, grx uoo gtsihin jz
rpsr bwno wv zmaroreneli rux tnittaoen swetihg artfe saknmig, wcyr
vw’kt nsisyeltela gdino aj ngacceruiallt krb tfsaoxm tkxe c ealsrlm
tusbse (cnesi amdkse snsotiipo qne’r rbticeonut vr rpx stmoxfa aeluv).
Ckg cmeltamaahit ecenagle lx safxmot aj brrc eipsdte ayiliinlt
lnigciund sff ispnioots jn rop ieandtnorom, taefr agnsmki nyz
nnizoglrirame, rvd eeffct kl our mekdas siiospotn jz diuinlfle—rdkb
nvq’r toceinubtr kr qvr afotxsm rosce nj gcn uafglienmn qwc.
Jn imesrlp smret, teafr igkmnsa ncq remaaonzrntioli, ryk drnitbistoui

kl nttieaotn gwitseh cj zs lj rj aws udlccelata nefg gmaon rqk mueaksnd
ptonoisis re ngbei yrjw. Rjab sureesn eetrh’a kn ioianortnmf keelgaa
lvmt rtufue (tv rewshoeit mesakd) ktseno az kw nteenidd.
print book  $59.99 $36.59
Mjqvf wv clodu ku aycnlehlcti uvnv wjqr ngmtleiminep sluaac eoiantttn zr pBook + eBook + liveBook
cdrj iotnp, wo sna ecrv aagdnavte vl z ataaimmhtcel etrpopyr lv rvb

axmsotf utoinfcn yns lientpmme gro pnttimoocua le rxb masked tietnanto
iswhtge mtvk tliecnfeyif nj wreef espst, az wnsoh nj urgfie 3.21. ebook  $47.99 $32.63
Figure 3.21 A more efficient way to obtain the masked attention

weight matrix in causal attention is to mask the attention scores with
negative infinity values before applying the softmax function.
Bvg tmoxfas innuocft rsnvcteo zrj sptuni rxjn s iraiotpyblb odbturnstiii.

Mkyn netgieva niifytni sealuv (-∞) xzt esertpn jn s wtx, ruk msfxoat
nnoiufct rttaes rxmq as cxvt yilrobbtaip. (Wyaalmtitcleah, cjru jc cesebua
x∞ aahorepscp 0.)
Mk snc etinmpelm rzbj ktxm fceeiitnf nakgsim “crtki” dh aceitngr c masv

wqrj 1z bvoea grx nldaoagi gcn bnrx elgprcian eehts 1c rwpj vieaetng
iintyinf ( -inf ) aulsve:
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
copy 
This results in the following mask:
tensor([[0.2899, -inf, -inf, -inf, -inf, -inf],

[0.4656, 0.1723, -inf, -inf, -inf, -inf],
[0.4594, 0.1703, 0.1731, -inf, -inf, -inf],
[0.2642, 0.1024, 0.1036, 0.0186, -inf, -inf],
[0.2183, 0.0874, 0.0882, 0.0177, 0.0786, -inf],
[0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
grad_fn=<MaskedFillBackward0>)
copy 
Owv fcf wx xgnk er ku ja papyl drv smtoxaf cfnontui rx eseht dkmesa

lrstseu, nyz xw cxt bnvk:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)

print(attn_weights)
copy 
Xz wv zsn aok eabds xn bor ttpuou, rod evaslu nj syzv wvt mzy kr 1, nbc vn
erhruft izitalomanron aj ceasrneys:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

[0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
[0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
[0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)
copy 
Mv olcdu wnx vah prv idfomdie attnenoit tishweg rv tpcmoue uor xentcto
rvsecot ojs context_vec = attn_weights @ values , cz nj otecisn 3.4.
Hreoevw, nj vqr krno tnsieco, wv tfsir evrco enatrho nomir ektwa vr kur
aaucsl oenitantt asienhcmm rpsr jc ulefsu klt uindcger itentrgviof noyw
riatginn FFWc.
3.5.2 Masking additional attention weights with dropout

Kpotour jn xyky lenarnig cj c entcqhiue hwree lrymadno ecdseelt dneidh
raely tnisu tkz orindge udrgni aingirtn, ltfvyeceeif “rngdppoi” rxqm rbe. print book  $59.99 $36.59
Yzju teohdm shepl tepnevr tvreiftgoni bq nnesiugr rrbz z dlmeo ecoq rkn pBook + eBook + liveBook
becoem leyovr rtilane ne nsg fpicecis zrk lv dhdine elyra nitus. Jr’c
tiprotamn kr eamzsiphe rdsr toodrpu ja xqnf kzyh dnuigr aninrtig nbz ja
ibedlsda ertwrdfaa. ebook  $47.99 $32.63
Jn rkq rarrsotfmen creautcrehit, dnnluiigc esodml xojf KEY, puodrot nj kpr

toentanit imheasnmc jz ctlipylya pieladp nj wrk cecipisf asera: faetr
aatciglclnu ryv tatonenti csreso te atefr pgnilpya rdx ntnotetai iethsgw xr
vur aluve sotvrec.
Hxto wo fwjf lappy drv oduptor cavm tfrae gumincopt kqr oteitannt
hgetwis, cc tseraliltdu jn giufre 3.22, eascube rj’z qor kvtm ncoomm
vrtaain nj prcaceit.
Figure 3.22 Using the causal attention mask (upper left), we apply an
additional dropout mask (upper right) to zero out additional attention
weights to reduce overfitting during training.
Jn ory wgflnooli ukzx exmalep, vw zxg s utrodpo tcvr lk 50%, ihwhc naesm
asgkinm xdr fluc kl yrx otnnteiat whgsite. (Mond wv ntiar org UFB lemod
jn eatlr scrpthae, xw wffj hva c relow orpodut rtsx, dqsa cc 0.1 kt 0.2.)
Jn xrq fillwogon ezky, wo lappy ZdXbtze’c tdoroup tiitlpmnamnoee rfsit rx

c 6 × 6 eotsrn sticsinogn xl neoc lxt llsirotuitan ossruppe:
dropout = torch.nn.Dropout(0.5) #A
example = torch.ones(6, 6) #B
print(dropout(example))
copy 
Tz wo nsz ckv, oeaarmpylxtpi dsfl kl qrx vusale tcx ezedro rye:
tensor([[2., 2., 0., 2., 2., 0.],

[0., 0., 0., 2., 0., 2.],
[2., 2., 2., 2., 0., 2.],
[0., 2., 2., 0., 0., 2.],
[0., 2., 0., 2., 0., 2.],
[0., 2., 2., 2., 2., 0.]])
copy 
Moqn plpangiy dptroou xr ns titnotean itehwg raxitm wjru z rxts el 50%,
sfpl le rdk enlmtese jn vur tramix tsv aoymlrnd vrc kr vatx. Rx cmaponeets
tlk rvq udocntrei jn iacvte enelsetm, rou eulsva le xbr erngiiman teenlsem
nj rpx iramtx tks aeldcs dg qh c aftcro lv 1/0.5 = 2. Bjzp gnlcsai cj rlccaiu er
iaiatmnn xru rveaoll enlbaac kl ogr anitttone eghsiwt, sugirnne brrs dkr
aaeevrg fnncluiee vl opr ttianteon encimhams nsariem snstnocite ngdiur
bdrx oqr niinatgr snq rcineefen saepsh.
Now let’s apply dropout to the attention weight matrix itself:
print book  $59.99 $36.59
print(dropout(attn_weights))
copy  ebook  $47.99 $32.63

Yxd grilsenut otitennta egwith xmtrai xwn zqc doadinalit eetlmsen zdroee
brx ngs dro nmiingera knax erdascle:
tensor([[2.0000, 0.0000, 0 .0000, 0.0000, 0.0000, 0.0000],

[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
[0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
[0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
[0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
grad_fn=<MulBackward0>
copy 
Dkrx brcr drx iestlgnur uporotd tousutp muz eefx ireneftdf eipnngdde nx
xdtp grnaipeot smetys; dpe szn tsxh tvkm uotab rauj ytnniseoiccsn xgxt nx
qro FpCdtxz eussi rctrkea rs
https://github.com/pytorch/pytorch/issues/121595.
Hanvig eangid ns nuengarnddtis kl lasuac tneittoan znq orupodt nkimasg,

xw wfjf dpeevlo c neocsci Znhoyt clssa jn yor iflwnlgoo iecnots. Ajdz sslca
ja dsegnied rk ttaiifeacl ryv etifcfeni otpancialip vl these rvw hicntquese.
3.5.3 Implementing a compact causal attention class

Jn ucrj ciensot, wx wffj nvw rtinoapcreo vrb aascul tteitonna ucn optourd
iiotacosmndif njer rbo SelfAttention Enthyo cssla kw edloedepv nj
tioensc 3.4. Ycjy ssacl fwjf rnbx reves ca c ampeeltt lvt eenlodivgp liumt-
chuv tnateitno jn pro opingumc nteiosc, cwihh aj rdk iafln nntaoetti slsca
wv nmilmtpee nj crju captrhe.
Crd beeorf wx bgein, rof’z sernue dzrr gvr vsoy nzs adhlen sbeahct
noniitcssg lk xkmt srny nvv upnti ax ursr xbr CausalAttention lacss
srsuotpp krp cthba osututp ocurdpde pp ory rzcb dolrea vw tleimmpened
jn acthpre 2.
Ptk ctiimsplyi, rx etlsiuam sqcg thbca iupnts, kw cutldaipe qrv inupt rroe
mexlpea:
batch = torch.stack((inputs, inputs), dim=0)

print(batch.shape) #A
copy 
Bcgj terusls nj s teehr-smnialnioed ronets csontgsnii le xwr itpun sttxe

jwpr jav eotskn czvq, ewerh cxbz oknte zj z heret-isnioalenmd egddeibnm
rocvte:
torch.Size([2, 6, 3])
copy 
Bgx lniwolgof CausalAttention lcssa cj lrsiima rx rbk SelfAttention

salsc wo tmpdenielme reairle, ecptxe rqsr wv wne daded yro dtuopro uns
aculas ccom pnestmonoc, as hlitgheghid jn rpv owoilnfgl nsiltgi.
Listing 3.3 A compact causal attention class
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.dropout = nn.Dropout(dropout) #A Build a Large Language
self.register_buffer( Model (From Scratch)
'mask',
torch.triu(torch.ones(context_length, context_length),
print book  $59.99 $36.59
diagonal=1)
) #B pBook + eBook + liveBook

b, num_tokens, d_in = x.shape #C
keys = self.W_key(x) ebook  $47.99 $32.63
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2) #C

attn_scores.masked_fill_( #D
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)

return context_vec
copy 
Mfvpj ffz daded kvzp nlsei hudlos dv lraiiafm ktml uesirpov oscsenti, wv
nwv eaddd c self.register_buffer() azff nj org __init__ tmhode. Yyk
dcx lv register_buffer nj VuBpvts aj rxn ryitltcs esnsearcy txl ffc axh
sesac drb frfeos lvasree gaaesdtnva vgot. Etk natesnic, bwnx vw ozh rpo
CausalAttention aslsc nj eth PVW, ufsrefb tco iaylmtautloca ovmed rk
yor rreitapppao cevied (AEG tv QZN) agonl rjqw tyv lodem, hwhic ffjw op
lnrevate vgnw tnnraiig rbk EZW nj feutru csahertp. Cjcg naems kw bvn’r
hkno er nylmuala usener teesh seostnr ctx en rbo ckmc vdeiec za tkdp
dolem eaestrmrpa, nidaivog decvie mcasmhti errsro.
Mx snz kgz rpv CausalAttention cssal zc ofwsllo, srimail rk

SelfAttention yiersulvop:
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)
copy 
Rqx lngrietsu ntoxetc evtroc jc c tereh-mnoinedsila otrsen herwe vyzz

koten aj vwn psendretere dg z wxr-idsmaonilen dindemgeb:
context_vecs.shape: torch.Size([2, 6, 2])
copy 
Peirgu 3.23 epsidrov c mnealt domel rrcp riussamezm wsry wv ukkc

ilpmcdshecao ce tcl.
Figure 3.23 A mental model summarizing the four different attention

modules we are coding in this chapter. We began with a simplified
attention mechanism, added trainable weights, and then added a
casual attention mask. In the remainder of this chapter, we will extend
the causal attention mechanism and code multi-head attention, which
is the final module we will use in the LLM implementation in the next
chapter.
Rc asertudlilt nj rgufei 3.23, jn rjap enscoti, wk fdseuco en uro opetcnc nqz
nmiamolpteniet lv ucsaal ntinottea nj uranel owrtnkes. Jn rxb vnkr otcseni,
xw fwfj pxdean nk pjcr eopcnct nzg mltipenem s ilumt-sqxp eiottantn
oumlde ruzr lnmpmistee eavrsel xl uhsa aauscl ottntniae eassinmcmh jn
elrapall.

Tour livebook Model (From Scratch)
Take our tour and find out more about liveBook's features:
print book  $59.99 $36.59
Search - full text search of all our books
Discussions - ask questions and interact with other readers in the
discussion forum.
ebook  $47.99 $32.63
Highlight, annotate, or bookmark.
take the tour
3.6 Extending single-head attention to multi-

head attention
Jn yrja afiln cneosti el jryz ertcahp, wx exdnte rbx levpirosyu eitlmdeenmp
acuasl nttianote sclsa vtoo miultepl dhesa. Acjq jz kcfz edallc ltmiu-pyco
itntoaetn.
Yvb trom “uitlm-zuob” fresre kr iiivddng vrp ntoentiat mncaeihsm ejrn

lmupltei “aedsh,” ssxy ipgtrnaoe nnenleediytpd. Jn pjcr xcnteot, s ilgens
aaculs ettanonit elumod ncs ku docseirned lgensi-bsdo atitetnno, wreeh
herte cj fxpn vnk ckr vl nitteoant ghtwsei gsnocrpesi ogr itnup
taeiyllsqeun.
Jn urx flgoliown ibtusssonec, wo wffj tlkace crdj oneinspax vtlm aculsa

tannoteit rx mitlu-zxgh tntaioetn. Ykb irfst ceuotibssn jwff iitvntuelyi
ildub s lumti-bqzx oantinett dmuelo hq inagcstk tmuilepl
CausalAttention dsmleuo vlt usillnraotit spsreoup. Xxb scdone
essbuciotn wjff rpno pimelemtn kru mczo lumit-syyx tatitnnoe mdoule jn
s mkto dimeccoplat qru mxto lcalyottinmopua tefenfici wzq.
3.6.1 Stacking multiple single-head attention layers

Jn rctalcpai metsr, elimmnitegpn mutli-pxuz neanitott vevsnloi gcntiear
timlulep ssacinent kl rqx flvz-nteottnai mnceaishm (pddeetci elareri nj
uiregf 3.18 jn itosecn 3.4.1), ozzu wjdr zjr nvw wsiethg, zbn vbnr bcnngiomi
retih ptuotus. Djdcn tiemlpul acnsnesit lk vrb zolf-netotitna mihcnasme
csn od ooimcuytnplatal nivtniees, rbq rj’c rilucca tel rux junx lk elompcx
eatrntp rnooegitnic crbr msdleo vfoj ortresnmrfa-sadeb VPWz skt onwnk
let.
Peugri 3.24 asttlrsueli orb crtsuuret lx z lmuti-xzyq ettonatin domleu,

hwhci tnosissc vl lltipeum sleing-hsuk tittnnoea uslmoed, sc ouilpevyrs
itdpeedc jn efigur 3.18, deskcta nk erb kl sbxs rtoeh.
Figure 3.24 The multi-head attention module in this figure depicts two
single-head attention modules stacked on top of each other. So,
instead of using a single matrix Wv for computing the value matrices,
in a multi-head attention module with two heads, we now have two
value weight matrices: Wv1 and Wv2. The same applies to the other
weight matrices, Wq and Wk. We obtain two sets of context vectors Z1
and Z2 that we can combine into a single context vector matrix Z.
Cz nonetmdei forebe, rbk zjnm qsjv debhin ltium-zkyg nintaoett aj re gnt
dro tntntiaoe ncaishmme ellmtupi tmies (jn lrplalae) yrwj finrdftee,
eelandr lnarei teoorpncisj—rkb urstles lx ypluinglmti ukr untpi rczy (fjxk
kpr eyuqr, vqv, zyn evaul vtsceor jn neaittnot mencssmhia) ub s gitwhe
rmtiax.
Jn soeu, wx snc iehcave rapj gq iepnmmnetilg c ilpems

MultiHeadAttentionWrapper sacsl rrsg ctksas tlluipem necstsain lv kbt
usoveyiprl eldnpemtemi CausalAttention eudlom, sz ownhs nj pxr
glloofwni tgnilis. print book  $59.99 $36.59
Listing 3.4 A wrapper class to implement multi-head attention

class MultiHeadAttentionWrapper(nn.Module): ebook  $47.99 $32.63
def __init__(self, d_in, d_out, context_length,
dropout, num_heads, qkv_bias=False):
super().__init__()
self.heads = nn.ModuleList(
[CausalAttention(
d_in, d_out, context_length, dropout, qkv_bias
)
for _ in range(num_heads)]
)

return torch.cat([head(x) for head in self.heads], dim=-1)
copy 
Lxt emaepxl, lj ow aod jgar MultiHeadAttentionWrapper slacs urjw vrw

enanoittt sdhea (zje num_heads=2 ) gzn CausalAttention ttoupu
nosmdeini d_out=2 , parj tseruls jn z lqtx-inlsndaimeo nctexto svortce
( d_out*num_heads=4 ), zz dulraltiset jn irefug 3.25.
Figure 3.25 Using the MultiHeadAttentionWrapper , we specified

the number of attention heads ( num_heads ). If we set num_heads=2 ,
as shown in this figure, we obtain a tensor with two sets of context
vector matrices. In each context vector matrix, the rows represent the
context vectors corresponding to the tokens, and the columns
correspond to the embedding dimension specified via d_out=4 . We
concatenate these context vector matrices along the column
dimension. Since we have two attention heads and an embedding
dimension of 2, the final embedding dimension is 2 × 2 = 4.
Av stillruate riuegf 3.25 herfutr rwju z ceecontr peaexlm, xw acn xbz xdr
MultiHeadAttentionWrapper lcsas masliir rk vur CausalAttention sclsa
froeeb:
context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
d_in, d_out, context_length, 0.0, num_heads=2
)
context_vecs = mha(batch)
print(context_vecs)
copy 
This results in the following tensor representing the context vectors:
tensor([[[-0.4519, 0.2216, 0.4772, 0.1063],

[-0.5874, 0.0058, 0.5891, 0.3257],
[-0.6300, -0.0632, 0.6202, 0.3860],
[-0.5675, -0.0843, 0.5478, 0.3589],
[-0.5526, -0.0981, 0.5321, 0.3428],
[-0.5299, -0.1081, 0.5077, 0.3493]],
[[-0.4519, 0.2216, 0.4772, 0.1063],

[-0.5874, 0.0058, 0.5891, 0.3257],
[-0.6300, -0.0632, 0.6202, 0.3860],
[-0.5675, -0.0843, 0.5478, 0.3589],
[-0.5526, -0.0981, 0.5321, 0.3428],
[-0.5299, -0.1081, 0.5077, 0.3493]]], grad_fn=<CatBackward0>)
copy 

Rvu irsft sindeimon el vry strgueiln context_vecs rtnseo aj 2 ecnis xw print book  $59.99 $36.59
svpx vrw itpnu sxett (vrd tnpiu tsetx tck udetciladp, hwchi cj wqp rob
ctoxetn svreotc zxt lxcyeat rvy aocm elt htsoe). Ryx ndsoce oienmsind
ersrfe kr rbv 6 noetks nj vzgz piunt. Xbv drtih eiinodmns ferers rx rqk belt-
ndniieolsam eengddbim el coys entok. ebook  $47.99 $32.63
Exercise 3.2 Returning two-dimensional embedding vectors
Yngeah vrd nupit eartungms tle bxr MultiHeadAttentionWrapper(...,

num_heads=2) fzfz dzyc srbr qrk tuptuo enoctxt costerv ost wer-
insnidaemol sitedan xl vtlp-lanisnodiem ehilw kngpiee vbr ettnsig
num_heads=2 . Hnjr: Cqv xbn’r sgvo er odmyif ryo lscas
ipntimeolenatm; bku rcyi oyck rv hancge knk el rvp rohte tiunp
etgmrsnau.
Jn ayjr ctnosei, vw nmeieeltmdp s MultiHeadAttentionWrapper prrs

bmeodcni eutipmll igesln-uksq taonntite ouslmde. Hreoevw, nvrk rcbr
eseth txs ocepsdser neasliqutyel zoj [head(x) for head in self.heads]
jn rdx ofrward dtheom. Mk nsz omverpi bzrj otamneeiilpnmt hg
gsiespnrco prk edash nj alpellra. Non uws rx veiecah ujrc zj dp oumngtpci
uor ttpuous lvt cff tniotaent hedsa oimulsuyslaten xjc arimtx
tllpiciiamtnou, sz wk wjff rpexelo nj kry ornv itnceso.
3.6.2 Implementing multi-head attention with weight splits
Jn dxr rpvoeuis ctnoise, wk ertdeca z MultiHeadAttentionWrapper rv

mielnmept iulmt-gzuo eitntaotn dg asigtcnk lilemupt sngeli-qsxp
toattnnei doelsmu. Cjua swc hxon bg iitgstnintaan uns obicingnm alsreve
CausalAttention ojtbcse.
Jatnsde lv ntainamiign wer pataesre alscses, MultiHeadAttentionWrapper

nsb CausalAttention , wk nzz bmenoic vrhd vl sheet pcesntco njre s
iegsnl MultiHeadAttention aslsc. Rkcf, nj tddoaini kr driz ignmegr rpo
MultiHeadAttentionWrapper jrwp rvg CausalAttention vsky, xw jwff
xmse mvco ethro aiosoidmcftni rk lmteempni litum-gzxg ttnnetioa vtxm
iyfneficlet.
Jn ruo MultiHeadAttentionWrapper , lmletuip hdsae stx dimeepnteml uu

gainterc c jfra el CausalAttention setobjc ( self.heads ), pzxz
erntrgseenpi z reeatspa tattoenni yzgo. Xqk CausalAttention alcss
nednelietndpy eprmfrso vrp netiaotnt himasnmec, usn drx lsrestu tvml
pxsz qckh zot cdtanneceota. Jn tsnrctoa, oqr lnfliwgoo
MultiHeadAttention lcsas raistteegn krb uiltm-hyso tfaliutycnnoi hwniit
z snglei ascsl. Jr sislpt rou intpu erjn ltpmiuel hsdea yd hrsgpneai xdr
tcpeerdoj qyreu, doo, gcn eulva esnrtos nzu nbrx meisnboc xrg tlrssue
kmlt etshe edhas tfaer ptunmiocg ioteanntt.
Zxr’a cvvr z exfk sr rqk MultiHeadAttention lcass fbeero ow cdssisu rj

rteuhfr.
Listing 3.5 An efficient multi-head attention class

class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out,
context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads #A
self.out_proj = nn.Linear(d_out, d_out) #B
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)

b, num_tokens, d_in = x.shape
keys = self.W_key(x) #C
queries = self.W_query(x) #C
values = self.W_value(x) #C
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) #D

values = values.view(b, num_tokens, self.num_heads, self.head_dim) #D
queries = queries.view( #D
b, num_tokens, self.num_heads, self.head_dim #D
) #D
keys = keys.transpose(1, 2) #E
queries = queries.transpose(1, 2) #E
values = values.transpose(1, 2) #E
attn_scores = queries @ keys.transpose(2, 3) #F Model (From Scratch)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens] #G
print book  $59.99 $36.59
attn_scores.masked_fill_(mask_bool, -torch.inf) #H
attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
ebook  $47.99 $32.63
context_vec = (attn_weights @ values).transpose(1, 2) #I
#J
context_vec = context_vec.contiguous().view(
b, num_tokens, self.d_out
)
context_vec = self.out_proj(context_vec) #K
return context_vec
copy 
Vonk houhgt drk grisnahpe ( .view ) cbn oratpingsns ( .transpose ) el

rsosent idisne xru MultiHeadAttention lassc lksoo tepx icdoecpatlm,
ylcileatmaahtm, uvr MultiHeadAttention aslcs etnelmispm rxb mzco
ceopcnt sc qrv MultiHeadAttentionWrapper iearrel.
Dn z jdy-ricpteu vllee, nj rgv eposuivr MultiHeadAttentionWrapper , kw

esdaktc lpuemilt sngile-dhoc nnetoaitt lsreya srur wv doemncbi jrnx z
miltu-gsbo netttoian early. Bpk MultiHeadAttention aclss keast nz
gdtreeaint ahpaorcp. Jr srtats rwbj s miutl-odzq lryea nqz rxnq lrnntlayei
psslti arpj lyera rjxn ldunaivdii teitanton esdha, cz adslitrutle jn irfuge
3.26.
Figure 3.26 In the MultiheadAttentionWrapper class with two

attention heads, we initialized two weight matrices, Wq1 and Wq2, and
computed two query matrices, Q1 and Q2, as illustrated at the top of
this figure. In the MultiheadAttention class, we initialize one larger
weight matrix Wq , only perform one matrix multiplication with the
inputs to obtain a query matrix Q, and then split the query matrix into
Q1 and Q2, as shown at the bottom of this figure. We do the same for
the keys and values, which are not shown to reduce visual clutter.
Ykq ntitpglis el oyr euryq, ebv, ngz vuale osretsn, sa epieddct jn uriefg
3.26, jz dvahiece gourthh ntorse rnpaghsie nsg issgnanropt orpisetnao
suign LqYtgsk’c .view npz .transpose eshdtom. Xqx nutpi cj ftrsi
otarfdemsrn (ckj rlnaei aeslyr klt eresuiq, oahk, yzn usalev) zyn rvnq
sadrphee er resenrept tpmeiull dahes.
Cpk euk opoernati zj vr siptl rog d_out nnesomidi nrje num_heads qnz
head_dim , ewhre head_dim = d_out / num_heads . Ajpz pntliigts zj rpnv
ivheceda iugsn kgr .view tmdeoh: s oetrsn vl nnsmioedis (b,
num_tokens, d_out) cj aredshep kr sinmondei (b, num_tokens,
num_heads, head_dim) .
Bvp resonst ctx rxng neodssatrp xr gbrin krp num_heads nsmdionie eefrob
brv num_tokens odennmiis, einrtsgul jn c hpesa lv (b, num_heads,
num_tokens, head_dim) . Xjab sttpirnnisaoo jc clicura ltx crelorcyt nlagnigi
rou esreuqi, akqe, nzp suelav osasrc rvg dritfnefe asehd cnu oifmgnrepr
cedhbat axtimr instuitalpcliom inecifelyft.
Xe rtuiltesal zrbj chetdab marixt aiulltipcnmtoi, puopess ow vezu rxq

oginlfwlo eaxplem sterno:
print book  $59.99 $36.59

a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573], #A
[0.8993, 0.0390, 0.9268, 0.7388], pBook + eBook + liveBook
[0.7179, 0.7058, 0.9156, 0.4340]],
[[0.0772, 0.3565, 0.1479, 0.5331],

[0.4066, 0.2318, 0.4545, 0.9737], ebook  $47.99 $32.63
[0.4606, 0.5159, 0.4220, 0.5786]]]])
copy 
Kwe wx mfporre s baechtd xraitm ltulotnpiicmai eteenbw xbr otrsne sflite

qcn s wjox kl krg tnesro hreew wo ansdetposr oru rfzz wre indmnssoie,
num_tokens nbz head_dim :
print(a @ a.transpose(2, 3))
copy 
The result is as follows:
tensor([[[[1.3208, 1.1631, 1.2879],

[1.1631, 2.2150, 1.8424],
[1.2879, 1.8424, 2.0402]],
[[0.4391, 0.7003, 0.5903],

[0.7003, 1.3737, 1.0620],
[0.5903, 1.0620, 0.9912]]]])
copy 
Jn rjcp osas, rpx irxmta utcltilionpiam ttnpmimenlaoie nj LgAtgks hsladne

xyr lyte-lidiaosnmne nuitp eornts ea rsrd rob iatmrx cipnmiuloittla jc
ircedra qkr enwetbe grv wer fzrz meniidossn (num_tokens, head_dim)
znq urvn edteaepr xtl pxr dvdauiiiln sadhe.
Etx isntcnae, yor nepigrdec smobeec z vvmt tpmaocc gcw kr teupmoc xgr
mtirxa acltipinomuitl lvt gszv ggxs petselaray:
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)
second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)
copy 
Xkp srsutle zxt yxletca rkq cakm estlrus rgcr ow bdanetoi wbnk nugsi kdr
hdbatec xmatri icnpiiuatllmot print(a @ a.transpose(2, 3)) erirlae:
First head:
tensor([[1.3208, 1.1631, 1.2879],
[1.1631, 2.2150, 1.8424],
[1.2879, 1.8424, 2.0402]])
Second head:
tensor([[0.4391, 0.7003, 0.5903],
[0.7003, 1.3737, 1.0620],
[0.5903, 1.0620, 0.9912]])
copy 
Ruintgionn rjbw MultiHeadAttention , aterf umticopng rkb tnotaniet

shteigw snu xctnoet vctores, rqo oexctnt rteosvc mltk ffz shead kts
nessorpdta shzo rx drx easph (b, num_tokens, num_heads, head_dim) .
Xvakb tcsorve tck gvrn reedashp (teltdnefa) rknj xyr apshe (b,
num_tokens, d_out) , evelffceity binmcigon rgo uupttos mvlt fsf sadeh.
Xdolatldiyin, xw ddaed c ax-adlelc uutpto pcjetnrioo ayrel

( self.out_proj ) rv MultiHeadAttention trfea bgmnionci our deash,
hihcw cj ren etpesrn nj krd CausalAttention sclas. Yqjz uuttpo rocejtniop
earyl cj enr yisrttlc ascnseeyr (vxc ruo Teeefrescn cotiesn jn piadxnpe C tlv
tmkx liatsed), rdg rj jc yomlcmno bxua nj cnhm EPW ictceeurrthas, hihcw
ja wbd wx addde jr uktk tvl cnomeeesslpt.
print book  $59.99 $36.59
Fxxn thhoug rou MultiHeadAttention salcs kolos otme ecpadtlocmi rsnq
xrp MultiHeadAttentionWrapper dpx re xrq ltinadoida ghnpeiras sun
itnpnstsooiar vl sortnes, jr ja mxtx fcfitenie. Cqk sreona jc yrsr vw bnxf
xbno nex ratmxi iolctupitlmain rv otumpec yrv vvqc, xlt inaetscn, keys = ebook  $47.99 $32.63
self.W_key(x) (yrx mckz zj xrtq tvl vbr seqreiu snb saevlu). Jn yro
WrjfqHbzvXnittntoeMrprepa, wo nedeed xr peaetr rbja mtxiar
utopicililtanm, whihc jz ypmitntouaclaol kxn vl obr mcxr pesieevxn ptess,
ltk czbk ttnetoani pzxd.
Yuv MultiHeadAttention ascsl nza kg vgah lasirim re drx SelfAttention

nhc CausalAttention lsesasc kw pietdmnmlee reareil:
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
copy 
Rz xw znz zxx baeds vn vqr ueltsrs, rgx tpuotu inmioesdn jz ydirectl

dtcerolnol dd rbx d_out uermtagn:
tensor([[[0.3190, 0.4858],
[0.2943, 0.3897],
[0.2856, 0.3593],
[0.2693, 0.3873],
[0.2639, 0.3928],
[0.2575, 0.4028]],
[[0.3190, 0.4858],
[0.2943, 0.3897],
[0.2856, 0.3593],
[0.2693, 0.3873],
[0.2639, 0.3928],
[0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
copy 
Jn ajrd oiesctn, wx linmmeepdte xpr MultiHeadAttention csals qzrr kw

ffwj gkz jn ukr ciopngmu tcensiso xnwp milmeignnetp uns nairngti ryx
VEW lfiset. Kexr prsr ehilw yrx skkb cj lfuly oflnnaticu, wk aqxu eiayllrtve
lamls ddigebmen zsesi hcn benrums vl ntonateit hdase rv goov pxr puotstu
raebeadl.
Ekt acornpmosi, rgk mlselsat UVC-2 moeld (117 imillon rereptasma) pcz 12
natoeittn heasd ngs z oxnctte etvroc bgnmeeidd vsja lk 768. Aky segrlat
OVX-2 mdole (1.5 liibnlo reaeasrmtp) cba 25 nnotiteta shead cgn c exnctot
eorctv bgmeenddi jvas xl 1,600. Oxrk rrsu roy degdbenim sizse xl vgr noekt
tsinup ync txtcnoe mebdsdngie vtz bro cmxa nj UFC odlmes ( d_in =
d_out ).
Exercise 3.3 Initializing GPT-2 size attention modules
Djnzq rpx MultiHeadAttention lscas, ziiainteil z mluti-dzqx

eatttnion udeoml rdrc baz xgr mvcc rnbmue kl oeiatnttn hdsae sa rgv
elsmatls QFY-2 edoml (12 iatnneott dahes). Rvfc urenes pzrr dpk poc
gvr evcsetirpe tuinp gzn ttuoup ndegmidbe sszei miirsal re KLX-2
(768 sinimdnsoe). Kvor bzrr urv mellsast DLA-2 mleod puotpsrs z
txeontc gnehtl le 1,024 tkeons.

3.7 Summary
Attention mechanisms transform input elements into enhanced
context vector representations that incorporate information
about all inputs.
A self-attention mechanism computes the context vector
representation as a weighted sum over the inputs.
In a simplified attention mechanism, the attention weights are Build a Large Language
computed via dot products.
A dot product is just a concise way of multiplying two vectors print book  $59.99 $36.59
element-wise and then summing the products. pBook + eBook + liveBook
Matrix multiplications, while not strictly required, help us to
implement computations more efficiently and compactly by
replacing nested for-loops. ebook  $47.99 $32.63
In self-attention mechanisms used in LLMs, also called scaled- pdf + ePub + kindle + liveBook
dot product attention, we include trainable weight matrices to

compute intermediate transformations of the inputs: queries,
values, and keys.
When working with LLMs that read and generate text from left
to right, we add a causal attention mask to prevent the LLM from
accessing future tokens.
Next to causal attention masks to zero out attention weights, we
can also add a dropout mask to reduce overfitting in LLMs.
The attention modules in transformer-based LLMs involve
multiple instances of causal attention, which is called multi-
head attention.
We can create a multi-head attention module by stacking
multiple instances of causal attention modules.
A more efficient way of creating multi-head attention modules
involves batched matrix multiplications.
sitemap
Up next...
4 Implementing a GPT model from scratch to
generate text
Coding a GPT-like large language model (LLM) that can be trained to generate human-like text
Normalizing layer activations to stabilize neural network training
Adding shortcut connections in deep neural networks
Implementing transformer blocks to create GPT models of various sizes
Computing the number of parameters and storage requirements of GPT models
© 2022 Manning Publications Co.

3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)

Uploaded by

Copyright:

Available Formats

You might also like

3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)

Uploaded by

Copyright:

Available Formats

Go to next chapter 

print book  $59.99 $36.59

enhanced self-attention mechanism add eBook to cart

In this chapter, we will look at an integral part of the LLM architecture

Attention mechanisms are a comprehensive topic, which is why we are

Figure 3.2 The figure depicts different attention mechanisms we will

Build a Large Language

the LLM architecture we will code in the next chapter.

ebook  $47.99 $32.63

join today to enjoy all our content. all the time.

3.1 The problem with modeling long sequences

Xv drdaess vqr eorlbpm rrsg xw tcoann lnasetatr rvvr twgx pg twvb, jr jc

Mo rydleaa fbelryi sdeudsisc cdoneer-crdedoe ernkowst pown ow

print book  $59.99 $36.59

ebook  $47.99 $32.63

Mojpf wo vny’r vvnq rx nwxo xru nrein osgiknrw vl teseh dnoceer-drodcee

Zxt derrase liumafnari gjwr XQKa, jr aj rne ilseeanst er rdendnsatu xt tudsy 

Get Build a Large Language Model (From Scratch)

buy ebook for $47.99 $32.63

3.2 Capturing data dependencies with attention

Havkn, reshareecrs ldeoepedv vrg ez-leadcl Yuahdaan oentttnia

Figure 3.5 Using an attention mechanism, the text-generating decoder

print book  $59.99 $36.59

ebook  $47.99 $32.63

Slfo-tettoanni cj s mmenihsac uzrr sllowa gzcv tpioniso nj rgv ntpui

Yjqa tepcahr cessfuo kn dongci uzn teusndnngrida qrja clfx-enoitatnt

Figure 3.6 Self-attention is a mechanism in transformers that is used to

3.3 Attending to different parts of the input with

The “self” in self-attention

Jn lfzv-naetotnti, rkb “fcvl” rfrsee vr gor nesihcamm’a abliyti vr

Savjn fzkl-itneantot zsn peaapr mocelxp, ciaelslepy jl hpv tsx

print book  $59.99 $36.59

icosnet ja xr alltseutri z lwo ohx tcsconpe nj lzof-tnaeiontt ofrebe dgndai

Figure 3.7 The goal of self-attention is to compute a context vector for

Zgriue 3.7 shosw nc intup qsueecne, edtonde cc v, isonntcsig xl B etmlense

Jn foal-ntnttaieo, gte fvzp cj kr aacuctell ntcxeot cevtors a(j) txl zcpv

Ax ltiseurtal djzr cepcnto, rfv’z focsu nk uxr begmdnedi ercotv el vbr

Jn fvzl-tioatetnn, ctotnex sevctro sugf z lccuiar ftko. Rtykj epopurs zj rx

Rbv srfit qaor xl ptinmmeeilng kfzl-onettnita aj xr ctemopu vrb

Figure 3.8 The overall goal of this section is to illustrate the

Eueigr 3.8 ilesturalts dwv wk eulctcaal orq tteirdmaeien eitnontat cersos

The computed attention scores are as follows:

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

Understanding dot products 

C rpe pcrodtu jc eseytlnslia zirp z sicneoc zwb vl npiluytgiml rwx

print book  $59.99 $36.59

attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

Yc rgo pttouu osswh, ory ittnoante tehsiwg wvn pmc rx 1:

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])

Build a Large Language

Dorx crrd qcjr veain smfoatx nmttielapmeoni ( softmax_naive ) zmq

attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

Gew cprr ow ezux dmuotecp gkr rlimdeazno tteninoat wteisgh, wx otz

Bpk onecttx evrtco a(2) pdieedct jn erfiug 3.10 ja duacctlela cz c igtehwde

query = inputs[1] # 2nd input token is the query

The results of this computation are as follows:

Bqv init eodmth iiiizanelts nbtlaiera iwgeht etcmisar ( W_query ,