Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Lecture1920

BTreesandHashing

contents

Introduction
B+TreeNodeStructure
Queries on B+Trees
QueriesonB
Insertionanddeletion
BT
BTrees
AdvantagesofBTreeoverB+Tree
Hashing
StaticandDynamichashing

B+Tree
TreeIndexFiles
Index Files
B+treeindicesareanalternativetoindexedsequentialfiles.

Disadvantageofindexed
Disadvantageofindexedsequentialfiles
sequentialfiles
performancedegradesasfilegrows,sincemanyoverflowblocks
getcreated.
Periodicreorganizationofentirefileisrequired.
g
q
AdvantageofB+tree indexfiles:
automaticallyreorganizesitselfwithsmall,local,changes,inthe
faceofinsertionsanddeletions.
Reorganizationofentirefileisnotrequiredtomaintain
performance.
(Minor)disadvantageofB+trees:
extrainsertionanddeletionoverhead,spaceoverhead.
AdvantagesofB+treesoutweighdisadvantages
B+treesareusedextensively

B+Tree
TreeIndexFiles(Cont.)
Index Files (Cont.)
AB+treeisarootedtreesatisfyingthefollowingproperties:

Allpathsfromroottoleafareofthesamelength
Eachnodethatisnotarootoraleafhasbetweenn/2 andn children.
Aleafnodehasbetween(n1)/2 andn1values
Special cases:
Specialcases:
Iftherootisnotaleaf,ithasatleast2children.
Iftherootisaleaf(thatis,therearenoothernodesinthetree),it
can have between 0 and (n 1) values
canhavebetween0and(n1)values.

B+Tree
TreeNodeStructure
Node Structure
Typicalnode
Typical node

Ki arethesearchkeyvalues
y
Pi arepointerstochildren(fornonleafnodes)orpointerstorecords
orbucketsofrecords(forleafnodes).
The searchkeys
Thesearch
keysinanodeareordered
in a node are ordered
K1<K2<K3<... <Kn1

LeafNodesinB
Leaf
Nodes in B+Trees
Trees

Propertiesofaleafnode:

Fori =1,2,...,n1,pointerPi eitherpointstoafilerecordwith


searchkeyvalueK
hk
l Ki,ortoabucketofpointerstofilerecords,each
b k
f i
fil
d
h
recordhavingsearchkeyvalueKi.Onlyneedbucketstructureif
searchkeydoesnotformaprimarykey.
If Li,LLj areleafnodesandi<j,L
IfL
l f d
d i j LissearchkeyvaluesarelessthanL

hk
l
l
h Ljs

searchkeyvalues
Pn pointstonextleafnodeinsearchkeyorder

NonLeaf
Non
LeafNodesinB
Nodes in B+Trees
Trees

Nonleafnodesformamultilevelsparseindexontheleafnodes.
Foranonleafnodewithm pointers:
AllthesearchkeysinthesubtreetowhichP
All th
hk
i th
bt
t hi h P1 pointsarelessthan
i t
l
th
K1
For2 i n 1,allthesearchkeysinthesubtreetowhichPi
pointshavevaluesgreaterthanorequaltoK
i t h
l
t th
l t Ki1 andlessthanK
dl
th Ki
AllthesearchkeysinthesubtreetowhichPn pointshavevalues
greaterthanorequaltoKn1

ExampleofaB
Example
of a B+tree
tree

B+treeforaccount file(n=3)

ExampleofB
Example
of B+tree
tree

B+treeforaccountfile(n =5)

Leafnodesmusthavebetween2and4values
((n1)/2 andn1,withn =5).
Nonleaf nodes other than root must have between 3 and 5
Nonleafnodesotherthanrootmusthavebetween3and5
children((n/2 andnwithn =5).
Rootmusthaveatleast2children.

ObservationsaboutB
Observations
about B+trees
trees

Sincetheinternodeconnectionsaredonebypointers,logicallyclose
blocksneednotbephysicallyclose.
ThenonleaflevelsoftheB+treeformahierarchyofsparseindices.
TheB+treecontainsarelativelysmallnumberoflevels
y
Levelbelowroothasatleast2*n/2 values
Nextlevelhasatleast2*n/2 *n/2 values
..etc.
etc
IfthereareK searchkeyvaluesinthefile,thetreeheightisnomore
than logn/2(K)
thussearchescanbeconductedefficiently.
h
h
b
d
d ffi i l
Insertionsanddeletionstothemainfilecanbehandledefficiently,asthe
indexcanberestructuredinlogarithmictime(asweshallsee).

QueriesonB
Queries
on B+Trees
Trees

Findallrecordswithasearchkeyvalueofk.
1. N=root
2. Repeat
1. ExamineN forthesmallestsearchkeyvalue>k.
2. Ifsuchavalueexists,assumeitisKi.ThensetN =Pi
3 Otherwisek
3.
Oth
i k Kn1.SetN
S t N =P
Pn
UntilN isaleafnode

3. Ifforsomei,keyKi =k followpointerPi tothedesired


record or bucket
recordorbucket.
4. Elsenorecordwithsearchkeyvaluek exists.

QueriesonB
Queries
on B+Trees(Cont.)
Trees (Cont.)

IfthereareK searchkeyvaluesinthefile,theheightofthetreeis
nomorethanlogn/2(K).
A d i
Anodeisgenerallythesamesizeasadiskblock,typically4
ll th
i
di k bl k t i ll 4
kilobytes
andn istypicallyaround100(40bytesperindexentry).
With1millionsearchkeyvaluesandn =100
atmost log50(1,000,000)=4nodesareaccessedinalookup.
Contrastthiswithabalancedbinarytreewith1millionsearchkey
values around20nodesareaccessedinalookup
abovedifferenceissignificantsinceeverynodeaccessmay
needadiskI/O,costingaround20milliseconds

UpdatesonB
Updates
on B+Trees:
Trees:Insertion
Insertion
1. Findtheleafnodeinwhichthesearchkeyvaluewouldappear
2. Ifthesearchkeyvalueisalreadypresentintheleafnode
1. Addrecordtothefile
3 Ifthesearch
3.
If the searchkey
keyvalueisnotpresent,then
value is not present then
1. addtherecordtothemainfile(andcreateabucketifnecessary)
2. Ifthereisroomintheleafnode,insert(keyvalue,pointer)pairinthe
leaf node
leafnode
3. Otherwise,splitthenode(alongwiththenew(keyvalue,pointer)
entry)asdiscussedinthenextslide.

Splittingaleafnode:
takethen(search
take the n (searchkey
keyvalue,pointer)pairs(includingtheone
value, pointer) pairs (including the one
beinginserted)insortedorder.Placethefirstn/2 inthe
originalnode,andtherestinanewnode.
letthenewnodebep,
let the new node be p andletk
and let k betheleastkeyvalueinp.
be the least key value in p
Insert(k,p)intheparentofthenodebeingsplit.
Iftheparentisfull,splititandpropagate thesplitfurtherup.
Splittingofnodesproceedsupwardstillanodethatisnotfullis
S li i
f d
d
d ill
d h i
f ll i
found.
Intheworstcasetherootnodemaybesplitincreasingthe

heightofthetreeby1.

ResultofsplittingnodecontainingBrightonandDowntownoninsertingClearview
l f l
d
h
d
l
Nextstep:insertentrywith(Downtown,pointertonewnode)intoparent

UpdatesonB+Trees:Insertion(Cont.)

B+TreebeforeandafterinsertionofClearview

InsertioninB
Insertion
in B+Trees
Trees(Cont.)
(Cont.)

Splittinganonleafnode:wheninserting(k,p)intoanalreadyfull
internalnodeN
CopyNtoaninmemoryareaMwithspaceforn+1pointersand
Copy N to an inmemory area M with space for n+1 pointers and
nkeys
Insert(k,p)intoM
CopyP
Copy P1,K
K1,,K
K n/21,P
P n/2 fromMbackintonodeN
from M back into node N
CopyPn/2+1,K n/2+1,,Kn,Pn+1 fromMintonewlyallocatednode
N
Insert(K
( n/2,N)intoparentN
)

Readpseudocodeinbook!
Mianus
Downtown Mianus Perryridge
DowntownMianusPerryridge

Downtown

Redwood

UpdatesonB
Updates
on B+Trees:
Trees:Deletion
Deletion

Findtherecordtobedeleted,andremoveitfromthemainfileand
fromthebucket(ifpresent)
R
Remove(searchkeyvalue,pointer)fromtheleafnodeifthereisno
(
hk
l
i t )f
th l f d if th
i
bucketorifthebuckethasbecomeempty
Ifthenodehastoofewentriesduetotheremoval,andtheentriesin
th
thenodeandasiblingfitintoasinglenode,thenmergesiblings:
d
d ibli fit i t
i l
d th
ibli
Insertallthesearchkeyvaluesinthetwonodesintoasingle
node(theoneontheleft),anddeletetheothernode.
Deletethepair(Ki1,Pi), wherePi isthepointertothedeleted
node,fromitsparent,recursivelyusingtheaboveprocedure.

UpdatesonB
Updates
on B+Trees:
Trees:Deletion
Deletion

Otherwise,ifthenodehastoofewentriesduetotheremoval,butthe
entriesinthenodeandasiblingdonotfitintoasinglenode,then
redistributepointers:
Redistributethepointersbetweenthenodeandasiblingsuchthat
bothhavemorethantheminimumnumberofentries.
Updatethecorrespondingsearchkeyvalueintheparentofthenode.
Thenodedeletionsmaycascadeupwardstillanodewhichhasn/2
y
p
/ or
morepointersisfound.
Iftherootnodehasonlyonepointerafterdeletion,itisdeletedandthe
sole child becomes the root
solechildbecomestheroot.

ExamplesofB+TreeDeletion

BeforeandafterdeletingDowntown
Deleting Downtown
Deleting
Downtown causesmergingofunder
causes merging of underfull
fullleaves
leaves
leafnodecanbecomeemptyonlyforn=3!

ExamplesofB
Examples
of B+Tree
TreeDeletion(Cont.)
Deletion (Cont.)

BeforeandAfterdeletionofPerryridgefromresultof
previousexample

LeafwithPerryridgebecomesunderfull(actuallyempty,inthisspecialcase)
andmergedwithitssibling.
AsaresultPerryridgenodesparentbecameunderfull,andwasmergedwith
itssibling
Valueseparatingtwonodes(atparent)movesintomergednode
Entrydeletedfromparent
Rootnodethenhasonlyonechild,andisdeleted

ExampleofB+treeDeletion(Cont.)

BeforeandafterdeletionofPerryridgefromearlierexample

ParentofleafcontainingPerryridgebecameunderfull,and
borrowedapointerfromitsleftsibling
Searchkeyvalueintheparentsparentchangesasaresult

BTree
B
TreeIndexFiles
Index Files
SimilartoB+tree,butBtreeallowssearchkeyvaluesto

appearonlyonce;eliminatesredundantstorageofsearch
keys.
keys
SearchkeysinnonleafnodesappearnowhereelseintheB

tree;anadditionalpointerfieldforeachsearchkeyina
nonleaf node must be included
nonleafnodemustbeincluded.
GeneralizedBtreeleafnode

Nonleafnode pointersBiarethebucketorfilerecord
pointers.
i t

BTree
B
TreeIndexFileExample
Index File Example

Btree(above)andB+tree(below)onsamedata

BTree
B
TreeIndexFiles(Cont.)
Index Files (Cont.)

AdvantagesofBTreeindices:
MayuselesstreenodesthanacorrespondingB+Tree.
Sometimespossibletofindsearchkeyvaluebeforereachingleaf
node.
DisadvantagesofBTreeindices:
Onlysmallfractionofallsearchkeyvaluesarefoundearly
Nonleafnodesarelarger,sofanoutisreduced.Thus,BTrees
typicallyhavegreaterdepththancorrespondingB+Tree
InsertionanddeletionmorecomplicatedthaninB+Trees
ImplementationisharderthanB+Trees.
Typically advantages of BTrees do not out weigh disadvantages
Typically,advantagesofBTreesdonotoutweighdisadvantages.

Hashing

Static Hashing
StaticHashing
Abucket isaunitofstoragecontainingoneormorerecords
(abucketistypicallyadiskblock).
Inahashfileorganization
In a hash file organization weobtainthebucketofarecord
we obtain the bucket of a record
directlyfromitssearchkeyvalueusingahash function.
Hashfunctionh isafunctionfromthesetofallsearchkey
values K tothesetofallbucketaddressesB.
valuesK
to the set of all bucket addresses B.
Hashfunctionisusedtolocaterecordsforaccess,insertion
aswellasdeletion.
Recordswithdifferentsearchkeyvaluesmaybemappedto
y
y
pp
thesamebucket;thusentirebuckethastobesearched
sequentiallytolocatearecord.

ExampleofHashFileOrganization
Hashfileorganizationofaccount file,usingbranch_nameaskey
((Seefigureinnextslide.)
g
)

Thereare10buckets,
Thebinaryrepresentationoftheith
characterisassumedtobetheintegeri.
Thehashfunctionreturnsthesumofthe
binaryrepresentationsofthecharacters
modulo10
d l 10
E.g.h(Perryridge)=5h(RoundHill)=3
h(Brighton) = 3
h(Brighton)=3

ExampleofHashFileOrganization
Hashfileorganizationof
account file,using
b
branch_nameaskey
h
k
(seepreviousslidefor
details).

HashFunctions
Worst hash function maps all searchkey values to the same
bucket; this makes access time proportional to the number of
searchkey values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned
th same number
the
b off searchkey
h k values
l
f
from
th sett off allll
the
possible values.
Ideal hash function is random, so each bucket will have the
same number of records assigned to it irrespective of the actual
distribution of searchkey values in the file.

Handling of Bucket Overflows


HandlingofBucketOverflows
Bucketoverflowcanoccurbecauseof
Insufficientbuckets
Skewindistributionofrecords.Thiscanoccur
Sk i di ib i
f
d
hi
duetotworeasons:
multiplerecordshavesamesearchkeyvalue
p
y
chosenhashfunctionproducesnonuniform
distributionofkeyvalues

Although
Althoughtheprobabilityofbucketoverflow
the probability of bucket overflow
canbereduced,itcannotbeeliminated;itis
y
g
f
handledbyusingoverflowbuckets.

HandlingofBucketOverflows(Cont.)

Overflowchaining theoverflowbucketsofagivenbucketare
chainedtogetherinalinkedlist.
Ab
Aboveschemeiscalledclosedhashing.
h
i
ll d l d h hi
Analternative,calledopenhashing,whichdoesnotuseoverflow
buckets,isnotsuitablefordatabaseapplications.

Hash Indices
HashIndices
Hashingcanbeusednotonlyforfileorganization,but
g
y
g
,
alsoforindexstructurecreation.
Ahashindex organizesthesearchkeys,withtheir
associatedrecordpointers,intoahashfilestructure.
i t d
d i t
i t h h fil t t
Strictlyspeaking,hashindicesarealwayssecondary
indices
indices
ifthefileitselfisorganizedusinghashing,aseparate
primaryhashindexonitusingthesamesearchkeyis
unnecessary.
unnecessary
However,weusethetermhashindextorefertoboth
secondaryindexstructuresandhashorganizedfiles.

Example of Hash Index


ExampleofHashIndex

Deficiencies of Static Hashing


DeficienciesofStaticHashing
Instatichashing,functionh
g,
mapssearchkeyvaluestoa
p
y
fixedsetofB ofbucketaddresses.Databasesgrowor
shrinkwithtime.
Ifinitialnumberofbucketsistoosmall,andfilegrows,
,
g
,
performancewilldegradeduetotoomuchoverflows.
Ifspaceisallocatedforanticipatedgrowth,asignificant
amountofspacewillbewastedinitially(andbucketswillbe
underfull).
underfull)
Ifdatabaseshrinks,againspacewillbewasted.

Onesolution:periodicreorganizationofthefilewitha
newhashfunction
h h f ti
Expensive,disruptsnormaloperations

Bettersolution:allowthenumberofbucketstobe
modifieddynamically.
difi d d
i ll

Dynamic Hashing
DynamicHashing
Good for database that ggrows and shrinks in size
Allows the hash function to be modified dynamically
Extendable hashing one form of dynamic hashing
Hash function generates values over a large range
typically bbit integers, with b = 32.
At any time use only a prefix of the hash function to index
into a table of bucket addresses.
addresses
Let the length of the prefix be i bits, 0 i 32.
Bucket address table size = 2i. Initially i = 0
Value of i ggrows and shrinks as the size of the database ggrows and
shrinks.

Multiple entries in the bucket address table may point to a


bucket (why?)
Thus, actual number of buckets is < 2i
The number of buckets also changes dynamically due to
coalescing and splitting of buckets.

General Extendable Hash Structure


GeneralExtendableHashStructure

In this structure i2 =i
Inthisstructure,i
= i3 =i,whereasi
= i whereas i1 =i
= i 1(seenextslidefor
1 (see next slide for
details)

Extendable Hashing vs. Other Schemes


ExtendableHashingvs.OtherSchemes
Benefitsofextendablehashing:
Hashperformancedoesnotdegradewithgrowthoffile
Minimalspaceoverhead

Disadvantagesofextendablehashing
EExtralevelofindirectiontofinddesiredrecord
l l f i di
i
fi d d i d
d
Bucketaddresstablemayitselfbecomeverybig(largerthan
memory)
Cannotallocateverylargecontiguousareasondiskeither
y g
g
Solution:B+treefileorganizationtostorebucketaddresstable

Changingsizeofbucketaddresstableisanexpensiveoperation

Linearhashingisanalternativemechanism
Allowsincrementalgrowthofitsdirectory(equivalenttobucket
addresstable)
Atthecostofmorebucketoverflows

ComparisonofOrderedIndexingandHashing
Costofperiodicreorganization
Relativefrequencyofinsertionsanddeletions
Isitdesirabletooptimizeaverageaccesstimeattheexpenseof
worstcaseaccesstime?
Expectedtypeofqueries:
d
f
Hashingisgenerallybetteratretrievingrecordshavingaspecified
valueofthekey.
Ifrangequeriesarecommon,orderedindicesaretobepreferred
If range queries are common ordered indices are to be preferred

Inpractice:
PostgreSQLsupportshashindices,butdiscouragesuseduetopoor
p
performance
Oraclesupportsstatichashorganization,butnothashindices
SQLServersupportsonlyB+trees

You might also like