L19-20 PPT IVSem

Lecture1920
BTreesandHashing
contents
Introduction
B+TreeNodeStructure
Queries on B+Trees
QueriesonB
Insertionanddeletion
BT
BTrees
AdvantagesofBTreeoverB+Tree
Hashing
StaticandDynamichashing
B+Tree
TreeIndexFiles
Index Files
B+treeindicesareanalternativetoindexedsequentialfiles.
Disadvantageofindexed
Disadvantageofindexedsequentialfiles
sequentialfiles
performancedegradesasfilegrows,sincemanyoverflowblocks
getcreated.
Periodicreorganizationofentirefileisrequired.
g
q
AdvantageofB+tree indexfiles:
automaticallyreorganizesitselfwithsmall,local,changes,inthe
faceofinsertionsanddeletions.
Reorganizationofentirefileisnotrequiredtomaintain
performance.
(Minor)disadvantageofB+trees:
extrainsertionanddeletionoverhead,spaceoverhead.
AdvantagesofB+treesoutweighdisadvantages
B+treesareusedextensively
B+Tree
TreeIndexFiles(Cont.)
Index Files (Cont.)
AB+treeisarootedtreesatisfyingthefollowingproperties:
Allpathsfromroottoleafareofthesamelength
Eachnodethatisnotarootoraleafhasbetweenn/2 andn children.
Aleafnodehasbetween(n1)/2 andn1values
Special cases:
Specialcases:
Iftherootisnotaleaf,ithasatleast2children.
Iftherootisaleaf(thatis,therearenoothernodesinthetree),it
can have between 0 and (n 1) values
canhavebetween0and(n1)values.
B+Tree
TreeNodeStructure
Node Structure
Typicalnode
Typical node
Ki arethesearchkeyvalues
y
Pi arepointerstochildren(fornonleafnodes)orpointerstorecords
orbucketsofrecords(forleafnodes).
The searchkeys
Thesearch
keysinanodeareordered
in a node are ordered
K1<K2<K3<... <Kn1
LeafNodesinB
Leaf
Nodes in B+Trees
Trees
Propertiesofaleafnode:
Fori =1,2,...,n1,pointerPi eitherpointstoafilerecordwith

searchkeyvalueK
hk
l Ki,ortoabucketofpointerstofilerecords,each
b k
f i
fil
d
h
recordhavingsearchkeyvalueKi.Onlyneedbucketstructureif
searchkeydoesnotformaprimarykey.
If Li,LLj areleafnodesandi<j,L
IfL
l f d
d i j LissearchkeyvaluesarelessthanL
hk
l
l
h Ljs
searchkeyvalues
Pn pointstonextleafnodeinsearchkeyorder
NonLeaf
Non
LeafNodesinB
Nodes in B+Trees
Trees
Nonleafnodesformamultilevelsparseindexontheleafnodes.
Foranonleafnodewithm pointers:
AllthesearchkeysinthesubtreetowhichP
All th
hk
i th
bt
t hi h P1 pointsarelessthan
i t
l
th
K1
For2 i n 1,allthesearchkeysinthesubtreetowhichPi
pointshavevaluesgreaterthanorequaltoK
i t h
l
t th
l t Ki1 andlessthanK
dl
th Ki
AllthesearchkeysinthesubtreetowhichPn pointshavevalues
greaterthanorequaltoKn1
ExampleofaB
Example
of a B+tree
tree
B+treeforaccount file(n=3)
ExampleofB
Example
of B+tree
tree
B+treeforaccountfile(n =5)
Leafnodesmusthavebetween2and4values
((n1)/2 andn1,withn =5).
Nonleaf nodes other than root must have between 3 and 5
Nonleafnodesotherthanrootmusthavebetween3and5
children((n/2 andnwithn =5).
Rootmusthaveatleast2children.
ObservationsaboutB
Observations
about B+trees
trees
Sincetheinternodeconnectionsaredonebypointers,logicallyclose
blocksneednotbephysicallyclose.
ThenonleaflevelsoftheB+treeformahierarchyofsparseindices.
TheB+treecontainsarelativelysmallnumberoflevels
y
Levelbelowroothasatleast2*n/2 values
Nextlevelhasatleast2*n/2 *n/2 values
..etc.
etc
IfthereareK searchkeyvaluesinthefile,thetreeheightisnomore
than logn/2(K)
thussearchescanbeconductedefficiently.
h
h
b
d
d ffi i l
Insertionsanddeletionstothemainfilecanbehandledefficiently,asthe
indexcanberestructuredinlogarithmictime(asweshallsee).
QueriesonB
Queries
on B+Trees
Trees
Findallrecordswithasearchkeyvalueofk.
1. N=root
2. Repeat
1. ExamineN forthesmallestsearchkeyvalue>k.
2. Ifsuchavalueexists,assumeitisKi.ThensetN =Pi
3 Otherwisek
3.
Oth
i k Kn1.SetN
S t N =P
Pn
UntilN isaleafnode
3. Ifforsomei,keyKi =k followpointerPi tothedesired

record or bucket
recordorbucket.
4. Elsenorecordwithsearchkeyvaluek exists.
QueriesonB
Queries
on B+Trees(Cont.)
Trees (Cont.)
IfthereareK searchkeyvaluesinthefile,theheightofthetreeis
nomorethanlogn/2(K).
A d i
Anodeisgenerallythesamesizeasadiskblock,typically4
ll th
i
di k bl k t i ll 4
kilobytes
andn istypicallyaround100(40bytesperindexentry).
With1millionsearchkeyvaluesandn =100
atmost log50(1,000,000)=4nodesareaccessedinalookup.
Contrastthiswithabalancedbinarytreewith1millionsearchkey
values around20nodesareaccessedinalookup
abovedifferenceissignificantsinceeverynodeaccessmay
needadiskI/O,costingaround20milliseconds
UpdatesonB
Updates
on B+Trees:
Trees:Insertion
Insertion
1. Findtheleafnodeinwhichthesearchkeyvaluewouldappear
2. Ifthesearchkeyvalueisalreadypresentintheleafnode
1. Addrecordtothefile
3 Ifthesearch
3.
If the searchkey
keyvalueisnotpresent,then
value is not present then
1. addtherecordtothemainfile(andcreateabucketifnecessary)
2. Ifthereisroomintheleafnode,insert(keyvalue,pointer)pairinthe
leaf node
leafnode
3. Otherwise,splitthenode(alongwiththenew(keyvalue,pointer)
entry)asdiscussedinthenextslide.
Splittingaleafnode:
takethen(search
take the n (searchkey
keyvalue,pointer)pairs(includingtheone
value, pointer) pairs (including the one
beinginserted)insortedorder.Placethefirstn/2 inthe
originalnode,andtherestinanewnode.
letthenewnodebep,
let the new node be p andletk
and let k betheleastkeyvalueinp.
be the least key value in p
Insert(k,p)intheparentofthenodebeingsplit.
Iftheparentisfull,splititandpropagate thesplitfurtherup.
Splittingofnodesproceedsupwardstillanodethatisnotfullis
S li i
f d
d
d ill
d h i
f ll i
found.
Intheworstcasetherootnodemaybesplitincreasingthe
heightofthetreeby1.
ResultofsplittingnodecontainingBrightonandDowntownoninsertingClearview
l f l
d
h
d
l
Nextstep:insertentrywith(Downtown,pointertonewnode)intoparent
UpdatesonB+Trees:Insertion(Cont.)
B+TreebeforeandafterinsertionofClearview
InsertioninB
Insertion
in B+Trees
Trees(Cont.)
(Cont.)
Splittinganonleafnode:wheninserting(k,p)intoanalreadyfull
internalnodeN
CopyNtoaninmemoryareaMwithspaceforn+1pointersand
Copy N to an inmemory area M with space for n+1 pointers and
nkeys
Insert(k,p)intoM
CopyP
Copy P1,K
K1,,K
K n/21,P
P n/2 fromMbackintonodeN
from M back into node N
CopyPn/2+1,K n/2+1,,Kn,Pn+1 fromMintonewlyallocatednode
N
Insert(K
( n/2,N)intoparentN
)
Readpseudocodeinbook!
Mianus
Downtown Mianus Perryridge
DowntownMianusPerryridge
Downtown
Redwood
UpdatesonB
Updates
on B+Trees:
Trees:Deletion
Deletion
Findtherecordtobedeleted,andremoveitfromthemainfileand
fromthebucket(ifpresent)
R
Remove(searchkeyvalue,pointer)fromtheleafnodeifthereisno
(
hk
l
i t )f
th l f d if th
i
bucketorifthebuckethasbecomeempty
Ifthenodehastoofewentriesduetotheremoval,andtheentriesin
th
thenodeandasiblingfitintoasinglenode,thenmergesiblings:
d
d ibli fit i t
i l
d th
ibli
Insertallthesearchkeyvaluesinthetwonodesintoasingle
node(theoneontheleft),anddeletetheothernode.
Deletethepair(Ki1,Pi), wherePi isthepointertothedeleted
node,fromitsparent,recursivelyusingtheaboveprocedure.
UpdatesonB
Updates
on B+Trees:
Trees:Deletion
Deletion
Otherwise,ifthenodehastoofewentriesduetotheremoval,butthe
entriesinthenodeandasiblingdonotfitintoasinglenode,then
redistributepointers:
Redistributethepointersbetweenthenodeandasiblingsuchthat
bothhavemorethantheminimumnumberofentries.
Updatethecorrespondingsearchkeyvalueintheparentofthenode.
Thenodedeletionsmaycascadeupwardstillanodewhichhasn/2
y
p
/ or
morepointersisfound.
Iftherootnodehasonlyonepointerafterdeletion,itisdeletedandthe
sole child becomes the root
solechildbecomestheroot.
ExamplesofB+TreeDeletion
BeforeandafterdeletingDowntown
Deleting Downtown
Deleting
Downtown causesmergingofunder
causes merging of underfull
fullleaves
leaves
leafnodecanbecomeemptyonlyforn=3!
ExamplesofB
Examples
of B+Tree
TreeDeletion(Cont.)
Deletion (Cont.)
BeforeandAfterdeletionofPerryridgefromresultof
previousexample
LeafwithPerryridgebecomesunderfull(actuallyempty,inthisspecialcase)
andmergedwithitssibling.
AsaresultPerryridgenodesparentbecameunderfull,andwasmergedwith
itssibling
Valueseparatingtwonodes(atparent)movesintomergednode
Entrydeletedfromparent
Rootnodethenhasonlyonechild,andisdeleted
ExampleofB+treeDeletion(Cont.)
BeforeandafterdeletionofPerryridgefromearlierexample
ParentofleafcontainingPerryridgebecameunderfull,and
borrowedapointerfromitsleftsibling
Searchkeyvalueintheparentsparentchangesasaresult
BTree
B
TreeIndexFiles
Index Files
SimilartoB+tree,butBtreeallowssearchkeyvaluesto
appearonlyonce;eliminatesredundantstorageofsearch
keys.
keys
SearchkeysinnonleafnodesappearnowhereelseintheB
tree;anadditionalpointerfieldforeachsearchkeyina
nonleaf node must be included
nonleafnodemustbeincluded.
GeneralizedBtreeleafnode
Nonleafnode pointersBiarethebucketorfilerecord
pointers.
i t
BTree
B
TreeIndexFileExample
Index File Example
Btree(above)andB+tree(below)onsamedata
BTree
B
TreeIndexFiles(Cont.)
Index Files (Cont.)
AdvantagesofBTreeindices:
MayuselesstreenodesthanacorrespondingB+Tree.
Sometimespossibletofindsearchkeyvaluebeforereachingleaf
node.
DisadvantagesofBTreeindices:
Onlysmallfractionofallsearchkeyvaluesarefoundearly
Nonleafnodesarelarger,sofanoutisreduced.Thus,BTrees
typicallyhavegreaterdepththancorrespondingB+Tree
InsertionanddeletionmorecomplicatedthaninB+Trees
ImplementationisharderthanB+Trees.
Typically advantages of BTrees do not out weigh disadvantages
Typically,advantagesofBTreesdonotoutweighdisadvantages.
Hashing
Static Hashing
StaticHashing
Abucket isaunitofstoragecontainingoneormorerecords
(abucketistypicallyadiskblock).
Inahashfileorganization
In a hash file organization weobtainthebucketofarecord
we obtain the bucket of a record
directlyfromitssearchkeyvalueusingahash function.
Hashfunctionh isafunctionfromthesetofallsearchkey
values K tothesetofallbucketaddressesB.
valuesK
to the set of all bucket addresses B.
Hashfunctionisusedtolocaterecordsforaccess,insertion
aswellasdeletion.
Recordswithdifferentsearchkeyvaluesmaybemappedto
y
y
pp
thesamebucket;thusentirebuckethastobesearched
sequentiallytolocatearecord.
ExampleofHashFileOrganization
Hashfileorganizationofaccount file,usingbranch_nameaskey
((Seefigureinnextslide.)
g
)
Thereare10buckets,
Thebinaryrepresentationoftheith
characterisassumedtobetheintegeri.
Thehashfunctionreturnsthesumofthe
binaryrepresentationsofthecharacters
modulo10
d l 10
E.g.h(Perryridge)=5h(RoundHill)=3
h(Brighton) = 3
h(Brighton)=3
ExampleofHashFileOrganization
Hashfileorganizationof
account file,using
b
branch_nameaskey
h
k
(seepreviousslidefor
details).
HashFunctions
Worst hash function maps all searchkey values to the same
bucket; this makes access time proportional to the number of
searchkey values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned
th same number
the
b off searchkey
h k values
l
f
from
th sett off allll
the
possible values.
Ideal hash function is random, so each bucket will have the
same number of records assigned to it irrespective of the actual
distribution of searchkey values in the file.
Handling of Bucket Overflows

HandlingofBucketOverflows
Bucketoverflowcanoccurbecauseof
Insufficientbuckets
Skewindistributionofrecords.Thiscanoccur
Sk i di ib i
f
d
hi
duetotworeasons:
multiplerecordshavesamesearchkeyvalue
p
y
chosenhashfunctionproducesnonuniform
distributionofkeyvalues
Although
Althoughtheprobabilityofbucketoverflow
the probability of bucket overflow
canbereduced,itcannotbeeliminated;itis
y
g
f
handledbyusingoverflowbuckets.
HandlingofBucketOverflows(Cont.)
Overflowchaining theoverflowbucketsofagivenbucketare
chainedtogetherinalinkedlist.
Ab
Aboveschemeiscalledclosedhashing.
h
i
ll d l d h hi
Analternative,calledopenhashing,whichdoesnotuseoverflow
buckets,isnotsuitablefordatabaseapplications.
Hash Indices
HashIndices
Hashingcanbeusednotonlyforfileorganization,but
g
y
g
,
alsoforindexstructurecreation.
Ahashindex organizesthesearchkeys,withtheir
associatedrecordpointers,intoahashfilestructure.
i t d
d i t
i t h h fil t t
Strictlyspeaking,hashindicesarealwayssecondary
indices
indices
ifthefileitselfisorganizedusinghashing,aseparate
primaryhashindexonitusingthesamesearchkeyis
unnecessary.
unnecessary
However,weusethetermhashindextorefertoboth
secondaryindexstructuresandhashorganizedfiles.
Example of Hash Index

ExampleofHashIndex
Deficiencies of Static Hashing

DeficienciesofStaticHashing
Instatichashing,functionh
g,
mapssearchkeyvaluestoa
p
y
fixedsetofB ofbucketaddresses.Databasesgrowor
shrinkwithtime.
Ifinitialnumberofbucketsistoosmall,andfilegrows,
,
g
,
performancewilldegradeduetotoomuchoverflows.
Ifspaceisallocatedforanticipatedgrowth,asignificant
amountofspacewillbewastedinitially(andbucketswillbe
underfull).
underfull)
Ifdatabaseshrinks,againspacewillbewasted.
Onesolution:periodicreorganizationofthefilewitha
newhashfunction
h h f ti
Expensive,disruptsnormaloperations
Bettersolution:allowthenumberofbucketstobe
modifieddynamically.
difi d d
i ll
Dynamic Hashing
DynamicHashing
Good for database that ggrows and shrinks in size
Allows the hash function to be modified dynamically
Extendable hashing one form of dynamic hashing
Hash function generates values over a large range
typically bbit integers, with b = 32.
At any time use only a prefix of the hash function to index
into a table of bucket addresses.
addresses
Let the length of the prefix be i bits, 0 i 32.
Bucket address table size = 2i. Initially i = 0
Value of i ggrows and shrinks as the size of the database ggrows and
shrinks.
Multiple entries in the bucket address table may point to a

bucket (why?)
Thus, actual number of buckets is < 2i
The number of buckets also changes dynamically due to
coalescing and splitting of buckets.
General Extendable Hash Structure

GeneralExtendableHashStructure
In this structure i2 =i
Inthisstructure,i
= i3 =i,whereasi
= i whereas i1 =i
= i 1(seenextslidefor
1 (see next slide for
details)
Extendable Hashing vs. Other Schemes

ExtendableHashingvs.OtherSchemes
Benefitsofextendablehashing:
Hashperformancedoesnotdegradewithgrowthoffile
Minimalspaceoverhead
Disadvantagesofextendablehashing
EExtralevelofindirectiontofinddesiredrecord
l l f i di
i
fi d d i d
d
Bucketaddresstablemayitselfbecomeverybig(largerthan
memory)
Cannotallocateverylargecontiguousareasondiskeither
y g
g
Solution:B+treefileorganizationtostorebucketaddresstable
Changingsizeofbucketaddresstableisanexpensiveoperation
Linearhashingisanalternativemechanism
Allowsincrementalgrowthofitsdirectory(equivalenttobucket
addresstable)
Atthecostofmorebucketoverflows
ComparisonofOrderedIndexingandHashing
Costofperiodicreorganization
Relativefrequencyofinsertionsanddeletions
Isitdesirabletooptimizeaverageaccesstimeattheexpenseof
worstcaseaccesstime?
Expectedtypeofqueries:
d
f
Hashingisgenerallybetteratretrievingrecordshavingaspecified
valueofthekey.
Ifrangequeriesarecommon,orderedindicesaretobepreferred
If range queries are common ordered indices are to be preferred
Inpractice:
PostgreSQLsupportshashindices,butdiscouragesuseduetopoor
p
performance
Oraclesupportsstatichashorganization,butnothashindices
SQLServersupportsonlyB+trees

L19-20 PPT IVSem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L19-20 PPT IVSem

Uploaded by

Copyright:

Available Formats

Lecture1920

Fori =1,2,...,n1,pointerPi eitherpointstoafilerecordwith

3. Ifforsomei,keyKi =k followpointerPi tothedesired

Handling of Bucket Overflows

Example of Hash Index

Deficiencies of Static Hashing

Multiple entries in the bucket address table may point to a

General Extendable Hash Structure

Extendable Hashing vs. Other Schemes

You might also like