Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Introduction

CHAPTER 6 Data Structures

6.1 Introduction
In this section we will examine various ways of implementing sets of ele-
ments efficiently. The actual representation used in each case depends on
the purpose for which the set is to be used. Different data structures have
different strengths and weaknesses as we shall see. It is therefore critical
that we understand these strengths and weaknesses so that we may pick
the right data structure for our application.

Sets are typically used to hold and retrieve elements as a part of some
algorithm or application. Depending on the algorithm, various operations
may need to be favored over others. What are some of the operations that
we might wish to perform on sets?
1. Member — to determine whether a particular element is a member of a
particular set.
2. Insert — to insert a given element into a set.
3. Delete — to delete a given element from a given set
4. Union— to take the union of two sets

Chapter Draft of October 22, 1998 99


Data Structures

5. Intersection — take the intersection of two sets.


6. Find — given an element a and a collection of sets that form a parti-
tion, find the name of the set of which contains a.
7. Min — find the minimum element of some set
8. Split — assuming an ordered set and an element a, split the set into
two sets such that all the elements of the first set have values less than
or equal to a and all members of the second set have values greater
than a.
9. Take — find and remove an arbitrary element from the set.
10. Iterate — iterate over all the members of the set

The data structure of choice depends on the operations that are needed in
the algorithm being implemented. The idea is to use a data structure that
is as fast as possible for the desired operations.

For example, if the operations are: Member, Insert, and Delete, and the
set consists of integers in a compact range, say [0:10000], then the best
representaton may be a bit array:
bool isMember[range];

Then Insert, Member, and Delete are simple indexing operations. The
only problem with this representation is that a set must be initialized to
false for the entire range. This can be avoided by a famous trick, which
requires more memory, an a little bit more time per operation. It also
makes it possible to iterate over the set in time proportional to the size of
the set.

Suppose we declare the array member as follows:


int isMember[range];

but we also declare an array the same size to hold the actual elements:
int member[range];
int size = 0;

The intention is this: The isMember array contains an index of the mem-
ber array location containng the actual value. Therefore, for an element
to be a member, the value in the member array must be between 0 and the
current value of size. Thus, the membership test for x is:
(0 <= (int i = isMember[x]) && i < size ?
member[i] == x : FALSE)

100 Advanced Programming and Applied Algorithms


Introduction

and insertion can be accomplished by the following code fragment:


if !Member(x) then
{
int i = size++;
member[i] = x;
isMember[x] = i;

Thus the data structure can support Member and Insert in constant time
and Iterate in time proporational to the number of elements in the set.
Delete, on the other hand is harder:
if Member(x) then
{
int i = isMember[x];
if (i != --size)
{
int t = member[size];
member[size] = member[i];
member[i] = t;
isMember[t] = i;
}
}

Here is the complete class along with its iterator:


const int NULL_ELEMENT = -1;
class FastSet {
friend class FastSetIterator;
public:
FastSet(int);
~FastSet() { delete [] isMem; delete [] member; }
FastSet(const FastSet &);
FastSet & operator=(const FastSet &);
bool isMember(int) const;
void insertMember(int);
void deleteMember(int);
int popFirstMember();
void print();
private:
int * isMem;
int * member;
int range;
int size;
void swapMembers(int, int);
void copyFastSet(const FastSet & s);
};

Chapter Draft of October 22, 1998 101


Data Structures

class FastSetIterator {
public:
FastSetIterator(const FastSet & s) {
curSet = &s; curMemLoc = 0;
}
bool notExhausted() const {
return curMemLoc < curSet->size;
}
int curr() const {
return ( notExhausted() ?
curSet->member[curMemLoc] :
NULL_ELEMENT );
}
int operator*() const { return curr(); }
operator bool() const { return notExhausted(); }
FastSetIterator & operator++() {
++curMemLoc; return *this;
}
FastSetIterator operator++(int);
private:
int curMemLoc;
const FastSet * curSet;
};
class elementOutOfRangeException {
public:
elementOutOfRangeException (int i) { val = i; }
int value() const { return val; }
private:
int val;
};

Here are the implementations:


FastSet::FastSet(int r){
range = r;
size = 0;
isMem = new int[range];
member = new int[range];
}
FastSet::FastSet(const FastSet & s) {
copyFastSet(s);
}
FastSet & FastSet::operator=(const FastSet & s) {
if (this != &s) {
delete [] isMem; delete [] member;
copyFastSet(s);
}
return *this;

102 Advanced Programming and Applied Algorithms


Introduction

}
bool FastSet::isMember(int x) const {
if (x < range && x >= 0) {
int i = isMem[x];
return (0 <= i && i < size ? member[i] == x : false);
}
else throw elementOutOfRangeException(x);
return false; // Eliminates a warning
}
void FastSet::insertMember(int x) {
if (!isMember(x)) {
int i = size++;
member[i] = x;
isMem[x] = i;
}
}
void FastSet::deleteMember(int x) {
if (isMember(x)) {
int i = isMem[x];
if (i != --size) swapMembers(i,size);
}
}
int FastSet::popFirstMember() {
if ( size > 0 ) {
int first = member[0];
deleteMember(first);
return first;
}
else return NULL_ELEMENT;
}
void FastSet::print() {
cout << "{";
for (int i = 0; i < size; i++ ) {
cout << " " << member[i];
}
cout << " }" << endl;
}
void FastSet::swapMembers(int i,int j) {
int t = member[j];
member[j] = member[i];
isMem[member[j]] = j;
member[i] = t;
isMem[t] = i;
}
void FastSet::copyFastSet(const FastSet & s) {
range = s.range;

Chapter Draft of October 22, 1998 103


Data Structures

size = s.size;
isMem = new int[range];
member = new int[range];
for (int i = 0; i < size; i++) {
insertMember(s.member[i]);
}
}
FastSetIterator FastSetIterator::operator++(int) {
FastSetIterator ret = *this;
curMemLoc++;
return ret;
}

As a simple example of usage, consider the following code:


FastSet intset(100);
try {
for (int i = 1; i < 100; i += 2) {
intset.insertMember(i);
}
intset.print(); cout << endl;

// Print in order
for (int i = 1; i < 100; i++) {
if (intset.isMember(i))
cout << " " << i;
}
cout << endl;
// Print fast but out of order
FastSetIterator p = intset;
while (p) {
cout << *p++ << " ";
}
cout << endl;
// Cause an exception
if (intset.isMember(100))
cout << "It’s in the set!" << endl;
}
// Catch an out-of-range exception
catch (elementOutOfRangeException e) {
cout << "Exception on membership test:"
<< e.value() << endl;
}
}

104 Advanced Programming and Applied Algorithms


Hashing

6.2 Hashing

What is the best representation for member, insert, delete and iterate if
the range is too large for a simple array or if the set is much smaller than
the range?

Answer: a hash table. My own preference is for bucket hash like the one
given in the table part for lab 1. It is easy to see how to do Member,
Insert and Delete, but what about Iterate? That could be done in
either of two ways: linking all elements together or linking non-empty
buckets.

Can we get away with a singly-linked list of all elements and still do
delete in constant time?. One strategy is to mark an element deleted and
actually adjust the links on the next Iterate, charging the cost to the delete
operations.

One aspect of bucket hashing has to do with growing the number of buck-
ets as the table grows. In the reference implementation, we used a strat-
egy that doubled the number of buckets whenever the number of
elements in the hash table is the same as the number of buckets, rehash-
ing each element. One question is whether this defeats the constant time
average-time cost of hashing. To analyze this, assume that we will amor-
tize the total cost of hashing across all the elements of the table. When
there are n = 2m elements in the table (just before the next rehash), we
can say that 2m–1 of them have been hashed only once, while 2m–2 have
been hashed twice, 2m–3 have been hashed 3 times, etc.

Thus, the total T(n) of hashed insertions for 2m elements is:

1(2m–1) + 2(2m–2) + ... + (m –1)21


Hence
m–1 m–1

∑ ( m – k )2 ∑ ( m – k )2
k k
T (n) = = 2(m – 1) +
k=1 k=2

m–2

∑ ( m – j – 1 )2
j+1
= 2(m – 1) +
j=1

Chapter Draft of October 22, 1998 105


Data Structures

m–2 m–2

∑ ( m – j )2 ∑2
j j
= 2(m – 1) + 2 –2
j=1 j=1

m–1 m–1
= 2(m – 1) + 2(T (n) – 2 ) – 2(2 – 2)
m+1
= 2T ( n ) – 2 + 2m + 2

Rearranging, we get:
m m+1
T (n) = T (2 ) = 2 – 2m – 2 = O ( n )

Thus, the total cost of rehashing is bounded by a constant times the total
number of elements in the table.

6.3 Trees
Why would anyone ever choose to use a tree over a hash table for set rep-
resentation? The answer is that trees can be used to support ordering, so
that operations like Min and Split can be supported.

6.3.1 Standard Ordered Trees


In this section we will show how to implement a simple ordered tree,
which is defined as one in which the inorder walk will produce an
ordered list. To define a tree, we use a standard mechanism in which the
internals of a tree are handled by a class TreeNode, which can only be
manipulated by class Tree and tis associated class TreeIterator.
class TreeNode {
friend class Tree;
friend class TreeIterator;
public:
Key keyData() const { return datum; }
virtual void print() const { datum.print(); }
protected:
// Only friends and derived classes can make a TreeNode
TreeNode();
TreeNode(const TreeNode & tn) {
datum = tn.datum; right = tn.right;
left=tn.left; parent = tn.parent;
}
TreeNode(const Key &, TreeNode * par);
virtual ~TreeNode() { }

106 Advanced Programming and Applied Algorithms


Trees

TreeNode * left;
TreeNode * right;
TreeNode * parent;
Key datum;
TreeNode * search(const Key &);
virtual TreeNode * insert(const Key &);
TreeNode * minimum();
virtual TreeNode * deleteKey(const Key &);
TreeNode * successor();
virtual void structurePrint(int) const;
virtual TreeNode * clone(const Key & k, TreeNode * p)
const {
return new TreeNode(k,p);
}
virtual TreeNode * cloneSubtree(TreeNode *) const;
void deleteTreeNode();
void copyValues(TreeNode * tp) { datum = tp->datum; }
void swapValues(TreeNode *);
virtual void relocateNode(TreeNode *,
TreeNode *, TreeNode *);
void setParents();
};
class Tree {
friend class TreeIterator;
public:
Tree() { root = 0; };
Tree(const Tree &);
Tree(TreeNode *);
virtual ~Tree() { deleteTree(); };
TreeNode * search(const Key &) const;
TreeNode * insert(const Key &);
void deleteKey (const Key &);
TreeNode * minimum() const;
void print() const;
void structurePrint() const;
protected:
TreeNode * root;
Tree * parent;
TreeNode * searchTree(TreeNode * , Key &);
Tree * copyTree();
void deleteTree();
};

// TreeNode Implementations
TreeNode::TreeNode(const Key & k, TreeNode * par) {
left = 0; right = 0; datum = k; parent = par;
}

Chapter Draft of October 22, 1998 107


Data Structures

TreeNode * TreeNode::search(const Key & k) {


if (k < datum) {
if (left) return left->search(k);
else return 0;
}
else if (k > datum) {
if (right) return right->search(k);
else return 0;
}
else { // datum == k
return this;
}
}

TreeNode * TreeNode::insert(const Key & k) {


if (k < datum) {
if (left) return left->insert(k);
else return (left = clone(k,this));
}
else if (k > datum) {
if (right) return right->insert(k);
else return (right = clone(k,this));
}
else { // datum == k
return this;
}
}

TreeNode * TreeNode::minimum() {
TreeNode * t = this;
TreeNode * tLeft = t->left;
while(tLeft) { t = tLeft; tLeft = t->left; }
return t;
}

TreeNode * TreeNode::deleteKey(const Key & k) {


if (k == datum) { // delete this one
if (left == 0) {
if (parent->left == this)
parent->left = this->right;
else parent->right = this->right;
if (this->right) right->parent = this->parent;
return this;
}
else if (right == 0) {
if (parent->left == this)
parent->left = this->left;
else parent->right = this->left;
if (this->left) left->parent = this->parent;
return this;

108 Advanced Programming and Applied Algorithms


Trees

}
else {
TreeNode * m = right->minimum();
swapValues(m);
return (m = right->deleteKey(k));
}
}
else if (k < datum)
return (left ? left->deleteKey(k) : 0 );
else /* k > datum */
return (right ? right->deleteKey(k) : 0 );
}

TreeNode * TreeNode::successor() {
TreeNode * rt = right;
if (rt) return rt->minimum();
else {
TreeNode * tp = this;
while(tp = tp->parent) {
if (tp->datum > this->datum) return tp;
}
return 0;
}
}

void TreeNode::structurePrint(int level) const {


for (int i = 0; i < level; i++) cout << " ";
this->print(); cout << endl;
if (left) left->structurePrint(level+1);
if (right) right->structurePrint(level+1);
}

TreeNode * TreeNode::cloneSubtree(TreeNode * parent)


const {
TreeNode * newNode = new TreeNode(this->datum, parent);
newNode->left
= (left ? left->cloneSubtree(newNode) : 0);
newNode->right
= (right ? right->cloneSubtree(newNode) : 0);
return newNode;
}

void TreeNode::deleteTreeNode() {
if (left) { left->deleteTreeNode() ; delete left; }
if (right) { right->deleteTreeNode() ; delete right; }
}

void TreeNode::swapValues(TreeNode * tn) {


Key k = datum;
datum = tn->datum;

Chapter Draft of October 22, 1998 109


Data Structures

tn->datum = k;
}

void TreeNode::relocateNode(TreeNode * l,TreeNode * r


,TreeNode * p){
left = l; right = r; parent = p;
setParents();
}

void TreeNode::setParents() {
if (left) left->parent = this;
if (right) right->parent = this;
}

// Tree Implementations

Tree::Tree(const Tree & t) {


if (t.root) root = t.root->cloneSubtree(0);
else root = 0;
}

Tree::Tree(TreeNode * tp) { root = tp; }

TreeNode * Tree::search(const Key & k) const {


if (root) return root->search(k);
else return 0;
}

TreeNode * Tree::insert(const Key & k) {


if (root) return root->insert(k);
else return (root = new TreeNode(k,0));
}

void Tree::deleteKey (const Key & k) {


if (root) delete root->deleteKey(k);
}

TreeNode * Tree::minimum() const {


if (root) return root->minimum();
else return 0;
}

void Tree::print() const {


cout << "{";
TreeIterator p = *this;
while(p) { (*p++).print(); if(p) cout << ", ";}
cout << "}" << endl;
}

void Tree::structurePrint() const {

110 Advanced Programming and Applied Algorithms


Trees

if (root) root->structurePrint(0);
}

Tree * Tree::copyTree() {
return (root ? new Tree(root->cloneSubtree(0)) : 0);
}

void Tree::deleteTree() {
if (root) { root->deleteTreeNode(); delete root;}
}

6.3.2 Iteration Over a Tree


We now turn to the subject of iteration over every node in a binary tree.
To do this we will develop a TreeIterator class:
class TreeIterator {
public:
TreeIterator() { curNode = 0; }
TreeIterator(const Tree &);
TreeIterator(TreeNode *);
TreeIterator(const TreeIterator &);
virtual ~TreeIterator() { delete &curStack; };
TreeIterator & operator=(const TreeIterator &);
bool operator==(const TreeIterator &) const;
bool operator!=(const TreeIterator &) const;
bool empty() const;
operator bool() const { return !empty(); }
TreeIterator & operator++() {
advance(); return *this;
}
TreeIterator operator++(int) {
TreeIterator ret = *this; advance(); return ret;
}
Key & operator*() { return curNode->datum; }
Key * operator->() { return &(curNode->datum); }
private:
TreeNode * curNode;
stack<TreeNode *> curStack;
TreeNode * findLeftmost(TreeNode *);
void advance();
};

The implementation of this iterator keeps track of the location of the iter-
ation by keeping a current node pointer and a current stack of the nodes.
// TreeIterator Implementations

Chapter Draft of October 22, 1998 111


Data Structures

TreeIterator::TreeIterator(const TreeIterator & tp) {


curNode = tp.curNode;
curStack = tp.curStack;
}

TreeIterator &
TreeIterator::operator=(const TreeIterator & tp) {
if (this != &tp) {
curNode = tp.curNode;
curStack = tp.curStack;
}
return *this;
}

bool TreeIterator::operator==(const TreeIterator & tp)


const {
return curNode == tp.curNode
&& curStack == tp.curStack;
}

bool TreeIterator::operator!=(const TreeIterator & tp)


const {
return curNode != tp.curNode
|| curStack != tp.curStack;
}

TreeNode * TreeIterator::findLeftmost(TreeNode * r) {
if (r) {
while (r->left) {
curStack.push(r);
r = r->left;
}
return r;
}
else return 0;
}

TreeIterator::TreeIterator(const Tree & t) {


curStack = stack<TreeNode*>();
curNode = findLeftmost(t.root);
}

bool TreeIterator::empty() const {


return !curNode || (!curNode->right &&
curStack.empty());
}

void TreeIterator::advance() {
if (curNode) {

112 Advanced Programming and Applied Algorithms


Trees

if (curNode->right)
curNode = findLeftmost(curNode->right);
else if (curStack.empty()) curNode = 0;
else { curNode = curStack.top(); curStack.pop(); }
}
}

The complexity of the iterator is not simple to analyze. Although the


advance() method can take a variable amount of time, iterating over the
entire set takes time proportional to the number of vertices in the tree. To
see this, note that iterating over the entire tree results in two visits to each
of the nodes. The first visit takes place when the process is going left to
find the minimum element in a subtree, while the second takes place as a
result of popping the stack. Since each visit requires at most a constant
amount of work (if we amortize the cost of findLeftmost() over the
nodes visited during its execution), the overall cost is a constant times the
number of vertices.

As an interesting sidelight, a quick and dirty implementation of the stack


can be achieved from the List class as follows.
class TreeNodePtrElt : public ListElt {
public:
TreeNodePtrElt() { pTree = 0; }
TreeNodePtrElt(TreeNode * t) { pTree = t; }
virtual ~TreeNodePtrElt() { }
int operator==(const ListElt & e) const {
if (const TreeNode * t =
dynamic_cast<const TreeNode *>(&e))
return pTree == t;
else return false;
}
TreeNode * value() { return pTree; }
virtual ListElt * clone() const {
return new TreeNodePtrElt(this->pTree);
}
virtual void print() const {
pTree->print(); cout << " ";
}
private:
TreeNode * pTree;
};

class TreeNodeStack : private List {


public:
void mkEmpty() { List::deleteList(); }

Chapter Draft of October 22, 1998 113


Data Structures

bool notEmpty () const { return List::hdr != 0; }


void push(TreeNode * t) {
List::prepend(new TreeNodePtrElt(t));
}
TreeNode * pop() {
TreeNodePtrElt * p = dynamic_cast<TreeNodePtrElt *>
(List::popFirst());
return (p ? p->value() : 0 );
}
using List::print; // access specifier
};

One problem with this iterator is that it uses a lot of extra space for the
stack. In applications where space is critical, there is a trick that will per-
mit the iterator to work with a constant amount of extra space, if the user
does not do anything with the tree while the iteration is taking place.

6.3.3 AVL Trees


An AVL tree is the same as an ordered tree except that it is kept balanced.
This means that we associate a height with each node in the tree, where
h(x) is defined as the length of the longest path from the node x to a leaf.
The height of any leaf is defined to be zero.
Definition 6.1. A tree is said to be balanced if, for each interior
node x the height of its two subtrees differs by at most 1. If one of
the subtrees is nil, by convention, it is assigned a height of -1.

Now we turn to the issue of whether a tree that is balanced in this sense is
truly balanced in the sense of having approximately half its nodes in each
subtree. To do this we need to estimate the maximum and minumu num-
ber of nodes for a given height h.
Theorem 6.1. Let n be the number of vertices in a tree of height
h. Then the following inequality holds
h+1
Fh + 2 – 1 ≤ n ≤ 2 –1 (EQ 6.1)

where Fi is the ith Fibonacci number.

Proof. By induction on height.

Basis.Any tree of height 0 has exactly 1 node. By definition a tree of


height -1 has zero nodes.

114 Advanced Programming and Applied Algorithms


Trees

0+1
F2 – 1 = 1 = 2 –1

Induction. Maximum. Assume the maximum is true for a tree of any


height less than h. What is the maximum size of a tree of height h?
Clearly the maximum is acheived when it has two equal subtrees of
height h–1, for if one subtree were of height h–2, then the size of the tree
would be lower. By induction then, each of the subtrees has a maximum
of 2h–1 – 1 vertices. Thus the tree of height h has a maximum of

2(2h – 1) + 1 = 2h+1 – 1
vertices.

Minimum. Assume that the minimum number holds for trees of height
less than h. The minimum tree of height h must clearly have one subtree
of height h – 1 and another of height h – 2. Thus the minimum number of
vertices in a tree of height h is

(Fh + 1 – 1) + (Fh – 1) + 1 = Fh + 2 – 1

Q.E.D.

As useful as this Theorem is, we will need some lower bound on


Fibonacci numbers to establish that the AVL balance condition gives us
search times that are logarithmic in the number of elements in an AVL
tree. We will begin with the following well-known formula for Fibonacci
numbers:
i i
ϕ – ϕ̂
F i = ---------------- (EQ 6.2)
5

where

1+ 5 1– 5
ϕ = ---------------- and ϕ̂ = ----------------
2 2

Rather than attempt to use this directly, we will prove the following
lemma.
Lemma 6.1. For i ≥ 0, Fi+2 ≥ ϕi.

Proof. By induction on i.

Chapter Draft of October 22, 1998 115


Data Structures

Basis For i = 0, Fi+2 = F2 = 2 > ϕ0 = 1.


1+ 5
For i = 1, Fi+2 = F3 = 3. But ϕ1 = ϕ = ---------------- < 2 < 3 = F3.
2
Induction. Assume true for all values less than i. Then

i – 2 3
----------------
i–1 i–2 i–2 + 5
Fi + 2 = Fi + 1 + Fi ≥ ϕ +ϕ = ϕ (ϕ + 1) = ϕ
 2 

but

2 1+2 5+5 3+ 5
ϕ = ----------------------------- = ---------------- = ϕ + 1
4 2

Therefore,

i–2 i–2 2 i
Fi + 2 ≥ ϕ (ϕ + 1) = ϕ ϕ = ϕ Q.E.D.

With this established we can restate the bounds on the number of nodes n
in an AVL tree of height h.
Corollary 6.1. Let n be the number of nodes in an AVL tree of
height h. Then
h h+1
ϕ –1≤n≤2 –1 (EQ 6.3)

From this we can derive the following result:


lg ( n + 1 )
Theorem 6.2. lg ( n + 1 ) – 1 ≤ h ≤ ----------------------
lgϕ
Proof. If we take the lg of both sides of Equation 6.3 we get

hlgϕ ≤ lg(n + 1) ≤ h + 1

The result follows immediately. Q.E.D.

The result establishes that h = Ω ( lgn ) . Hence, any operation that takes
time proportional to the height of an AVL tree is logarithmic in the num-
ber of vertices in that tree.

AVL trees are balanced trees in which a particular algorithm is used to


maintain the balance. An AVL tree can be defined by simply adding a
height field to a TreeNode.
class AVLTreeNode : public TreeNode {

116 Advanced Programming and Applied Algorithms


Trees

friend class AVLTree;


friend class Tree;
friend class TreeIterator;
public:
int height() const { return _ht; }
void print() const
{ TreeNode::print(); cout << ":" << _ht; }
private:
// Only friends can make a TreeNode
int _ht;
AVLTreeNode() { _ht = 0; }
AVLTreeNode(const Key & k, TreeNode * par)
: _ht(0), TreeNode(k,par) { }
AVLTreeNode(const Key & k, TreeNode * par, int h)
: _ht(h), TreeNode(k,par) { }
AVLTreeNode(const TreeNode & tn) :
TreeNode(tn) { _ht = 0; }
virtual ~AVLTreeNode() { }
virtual TreeNode * insert(const Key &);
virtual TreeNode * deleteKey(const Key &);
virtual TreeNode * clone(const Key & k, TreeNode * par)
const {
return new AVLTreeNode(k,par);
}
virtual TreeNode * cloneSubtree(TreeNode *) const;
void computeHeight();
void rebalance();
void rotateLeft();
void rotateRight();
virtual void relocateNode(TreeNode *, TreeNode *,
TreeNode *);
};

class AVLTree : public Tree {


public:
AVLTree() { };
AVLTree(const AVLTree &t) : Tree(t) { }
AVLTree(AVLTreeNode * tp) : Tree(tp) { }
virtual ~AVLTree() { Tree::deleteTree(); };
virtual TreeNode * insert(const Key &);
int height() {
AVLTreeNode * r = dynamic_cast<AVLTreeNode *>(root);
return (r ? r->height() : -1);
}
};

Chapter Draft of October 22, 1998 117


Data Structures

6.3.3.1 AVL Insertion


Note that search, successor, predecessor, maximum, and minimum are
all unchanged for AVL trees, the height will make no difference. The only
operations that change are insert and delete. Lets tackle insert first.
Suppose we simply invoke the insert procedure for TreeNode:
TreeNode * AVLTreeNode::insert(const Key & k) {
TreeNode * retNode = TreeNode::insert(k);
rebalance();
return retNode;
}

void AVLTreeNode::rebalance() {
AVLTreeNode * l = static_cast<AVLTreeNode *>(left);
AVLTreeNode * r = static_cast<AVLTreeNode *>(right);
int hL = (l ? l->height() : -1);
int hR = (r ? r->height() : -1);
if ((hR-hL)>1) rotateLeft();
else if ((hL-hR)>1) rotateRight();
else computeHeight();
}

The problem is that the tree can come back unbalanced. Let us restrict
our consideration to the case of return from an insert to the left subtree.
What if it comes back unbalanced? There are two cases to consider:
• Type 1: subtrees of the left subtree, where insertion occurs are of
unequal height with insertion having occurred in the left

d b
h+1 h

b e a d
h h-2 h-1 h-1

a c c e
h-1 h-2 h-2 h-2

118 Advanced Programming and Applied Algorithms


Trees

• Type 2: subtrees of the left subtree are unequal height with insertion
having occurred on the right.

f d
h+1 h

b g
h h-2 b f
h-1 h-1
a d
h-2 h-1
a c e g
e h-2 h-2 h-3 h-2
c
h-2 h-3

void AVLTreeNode::rotateRight() {
AVLTreeNode * ll
= static_cast<AVLTreeNode *>(left->left);
AVLTreeNode * lr
= static_cast<AVLTreeNode *>(left->right);
int hLL = (ll ? ll->height() : -1);
int hLR = (lr ? lr->height() : -1);
if (hLL > hLR) { // rotate right Type 1
swapValues(left);
left->relocateNode(lr,right,this);
relocateNode(ll,left, parent);
}
else { // rotate right Type 2
AVLTreeNode * lrl
= static_cast<AVLTreeNode *>(lr->left);
AVLTreeNode * lrr
= static_cast<AVLTreeNode *>(lr->right);
swapValues(lr);
lr->relocateNode(lrr,right,this);
left->relocateNode(ll,lrl,this);
relocateNode(left,lr,parent);
}
}

Chapter Draft of October 22, 1998 119


Data Structures

6.3.3.2 AVL Deletion


At first the problem of deletion from an AVL tree seems difficult but, in
fact, it is trivial. When we delete from an AVL tree, we have two cases to
consider:
1. the tree after deletion still satisfies the AVL condition, in which case
there is nothing to do, or
2. there exists some node at which the AVL balance condition no longer
holds.
• Let us examine how the second condition might arise. Consider the
diagram below in which the balance condition fails:

d
h+3

b f
h h+2

Deletion here reduces a c


height by 1 h-1 h-1

This can be addressed by a simple rotation:

f
d
h+3
h+3
d g
h+2 h+1
b f
h h+2
b e
h h+1
a c e g
h-1 h-1 h+1 h+1
a c
h-1 h-2

120 Advanced Programming and Applied Algorithms


Data Base Directories

Since this is a constant number of operations, and it leaves the subtrees


balanced, the total deletion time is proportional to the height of the tree,
which is O(log n).
TreeNode * AVLTreeNode::deleteKey(const Key & k) {
TreeNode * retNode = TreeNode::deleteKey(k);
rebalance();
return retNode;
}

6.4 Data Base Directories

In this course we will look at introductory data structures for respresent-


ing data base directories, with the goal of presenting material on various
types of balanced tree algorithms.

The problem of building a data base directory may be stated as a class


construction problem. The goal is to create a class that permits the access
of data base entries according to a key, which is some value attached to
records of the data base upon which we wish to conduct searches. Typi-
cally, keys are strings, but they could have various types. For example, it
might be desirable to search data bases by age of employee or number of
years of service.

Let us construct a class interface for a typical directory:


typedef unsigned int diskLoc;
class Directory {
public:
Directory();
Directory(const Directory &);
Directory(istream &);
~Directory();
Directory & operator=(const Directory &);
vector<diskLoc> find(Key &, int);
vector<diskLoc> findRange(Key &, Key &, int);
void insert(Key &, diskLoc loc);
void deleteKey(Key &);
void save(istream &);
private:
...
}

Chapter Draft of October 22, 1998 121


Data Structures

6.4.1 B-Trees
B-trees are balanced trees that have been especially designed for use with
large databases. The key observation about a large data base maintained
on disk is that not only will the data records themselves be kept on disk,
but most of the directory itself will be kept on disk.

To understand the impact of this, consider how disk storage works. Data
are stored on tracks and, within a track, are organized into pages. A typi-
cal page is quite large (2Kbytes or more) and represents the smallest unit
of data that can be usefully moved between disk and main memory.
Because of the seek times and rotational delays associated with accesses
to a specific page on disk, it will often take 5 to 30 milliseconds or more
to begin reading a page. Once reading begins, however, transfers are at
very high rates. Thus, in working with disk, the usual strategy is to read
large blocks and read them as seldom as possible.

This strategy presens a problem for pointer-based data structures,


because you will not be able to tell where the next block is to come from
until you have the current one. Furthermore, simple binary trees will not
be very practical, because the amount of useful information stored ina
single node cannot make it worthwhile to do a whole disk access.

One solution to this is to move from binary trees to k-ary trees, which
have k children instead of only two. Then, you can use an algorithm like
binary or even linear search to find the right subtree to search. B-trees are
a type of k-ary tree.

A B-tree T is a rooted tree with root root[T], having the following proper-
ties:
1. Every node x has the following fields:
a. n[x], the number of keys currently stored in x,
b. The n[x] keys themselves in nondecreasing order: key1[x] ≤ key2[x]
≤ ... ≤ keyn[x][x], and
c. leaf[x], a boolean value that is true if x is a leaf and false if x is an
internal node.
2. If x is an internal node, it also contains n[x]+1 pointers: c0[x ], c1[x ],
..., cn[x][x ] to its children. Leaf nodes have no children so these fields
are undefined.
3. The keys keyi[x] separate the ranges of keys stored in each subtree: if
ki is any key stored in the subtree with root ci[x ], then
k0 ≤ key1[x] ≤ k1 ≤ key2[x] ≤ k2 ≤... kn[x]-1≤ keyn[x][x] ≤ kn[x].

122 Advanced Programming and Applied Algorithms


Data Base Directories

4. Every leaf has the same height, which is the tree’s height h.
5. There are lower and upper bounds on the number of keys a node can
contain. Let the fixed integer t be calle d the minimum degree of the B-
tree.
a. Every node other than the root must have at least t-1 keys. Every
internal node other than the root thus has at least t children. If the
tree is nonempty, the root must have at least one key.
b. A node can contain at most 2t –1 keys. Therefore, an internal node
can have at most 2t children. We say that a node if full if it contains
exactly 2t –1 keys.

The height of a B-tree is established by the following theorem:


Theorem 6.3. If n ≥ 1, then for any n-key B-tree of height h and
minimum degree t ≥ 2,

n+1
h ≤ log t(------------) .
2

Proof. What is the minimum number of nodes in a tree of height h? We


get this by counting nodes. The minimum number is obtained when the
root contains one key and all the other nodes contain t – 1 keys. The tree
contains 2 nodes at depth 1, 2t nodes at depth 2, 2t2nodes at depth 3 and
so on. The number of keys satisfies the inequality:
h h
= 1 + 2 ( t – 1 )  ------------- = 2t – 1
t –1
n ≥ 1 + ( t – 1 ) ∑ 2t
i–1 h
 t –1
i=1

Which implies the result. Q.E.D.

6.4.1.1 Basic Operations on B-trees


Searching a B-tree is straightforward: instead of making a 2-way decision
at each node, we make an n[x]-way decision at each node. This could be
done using binary search or linear search, as the cost of the search will be
dominated by the cost of accessing disk to get directory blocks. If the
search finds the desired key k in the node x, then it returns a list of loca-
tions associated with that key. Otherwise it finds the first i such that

k < keyi[x]

The algorithm then recursively searches for k in the subtree ci-1[k].

Chapter Draft of October 22, 1998 123


Data Structures

Insertion is more complicated because it can cause the tree to grow. The
basic idea behind insertion is to split a full node before attempting to
insert into it. Thus a key component of the algorithm is a method associ-
ated with a B-tree node that splits a given child of that node, lifting its
median key into the parent. Note that this will only work if the parent is
guaranteed not to be full. Thus, the procedure presented here will always
guarantee that a B-tree node is not full before recursively inserting at that
node. It does this by first determining the subtree into which the insertion
will be made (insertions always happen at leaves), and then splitting the
root of that subtree before attempting to insert into it if the subtree is
full.In the special case that the root is full, it will be split and one key
moved up to a new root, increasing the height of the tree by 1.

The algorithms for search and insert are presented below.


static const int Bt = 2; // minimum degree of the B-tree
typedef unsigned long diskLoc;
class BTreeNode;
typedef BTreeNode* ChildPtr;

class BTreeNode {
friend class BTree;
public:
private:
BTreeNode();
BTreeNode(Key &, diskLoc);
virtual ~BTreeNode();
void reserveSpace();
diskLoc find(Key &);
void splitChild(int);
void insertNonFull(Key &, diskLoc);
int Nkeys;
bool leaf;
void print(int) const;
ChildPtr * child;
Key * key;
diskLoc * location;
};

class BTree {
public:
BTree();
BTree(BTreeNode *);
BTree(istream &);
virtual ~BTree();
virtual diskLoc find(Key &);
virtual void insertKey(Key &, diskLoc);
virtual void deleteKey(Key &, diskLoc);

124 Advanced Programming and Applied Algorithms


Data Base Directories

virtual void print() const;


private:
BTreeNode * root;
};

// A Btree Class

BTree::BTree() : root(0) { }

BTree::BTree(BTreeNode * r) : root(r) { }

BTree::BTree(istream & infile) : root(0) {


string keyName;
int i = 0;
while( infile >> keyName ) {
insertKey(keyName,i++);
}
}

BTree::~BTree() {
delete root;
}

diskLoc BTree::find(Key & k) {


if (root == 0) return 0;
else return root->find(k);
}

void BTree::insertKey(Key & k, diskLoc l) {


cout << "Inserting: "; k.print(); cout << endl;
if (root == 0) root = new BTreeNode(k,l);
else if (root->Nkeys < 2*Bt-1)
root->insertNonFull(k,l);
else {
BTreeNode * r = root;
root = new BTreeNode();
root->leaf = false;
root->child[0] = r;
root->splitChild(0);
root->insertNonFull(k,l);

}
}

void BTree::deleteKey(Key & k, diskLoc l) { }

void BTree::print() const {


if (root == 0) cout << "Empty Tree!";
else root->print(0);
}

Chapter Draft of October 22, 1998 125


Data Structures

// A Btree Node Class

BTreeNode::BTreeNode() : Nkeys(0), leaf(true) {


reserveSpace(); }

BTreeNode::BTreeNode(Key & k, diskLoc l) : Nkeys(1),


leaf(true) {
reserveSpace();
key[0] = k;
location[0] = l;
}

void BTreeNode::reserveSpace() {
key = new Key[2*Bt-1];
location = new diskLoc[2*Bt-1];
child = new ChildPtr[2*Bt];
}

BTreeNode::~BTreeNode(){
delete [] key;
delete [] location;
delete [] child;
}

diskLoc BTreeNode::find(Key & k) {


int i;
for (i = 0; i < Nkeys; i++) {
if (k <= key[i]) break;
}
if (i == Nkeys)
return (leaf ? 0 : child[Nkeys]->find(k));
else if (k == key[i]) return location[i];
else return (leaf ? 0 : child[i]->find(k));
}

void BTreeNode::splitChild(int i) {
BTreeNode * iChild = child[i];
BTreeNode * newChild = new BTreeNode();
newChild->leaf = iChild->leaf;
newChild->Nkeys = Bt - 1;
// Copy Bt-1 keys from iChild to newChild
for (int j = 0; j < Bt-1; j++) {
newChild->key[j] = iChild->key[j+Bt];
newChild->location[j] = iChild->location[j+Bt];
}
// copy the corresponding subtrees
if (!(iChild->leaf)) {
for (int j = 0; j < Bt; j++)
newChild->child[j] = iChild->child[j+Bt];

126 Advanced Programming and Applied Algorithms


Data Base Directories

}
iChild->Nkeys = Bt - 1;
// move keys and children to make room for
// a new pointer at child[i+1]
for (int j = Nkeys; j > i; j--) {
child[j+1] = child[j];
key[j] = key[j-1];
location[j] = location[j-1];
}
child[i+1] = newChild;
key[i] = iChild->key[Bt-1];
location[i] = iChild->location[Bt-1];
Nkeys++;
}

void BTreeNode::insertNonFull(Key & k, diskLoc l) {


int i;
for (i = Nkeys-1; i >= 0 && k < key[i]; i--);
// here k >= key[i] && k < key[i+1] or i = -1
int insertLoc = i+1;
cout << "Inserting at Location " << insertLoc << endl;
if (leaf) {
for (i = Nkeys-1; i >= insertLoc; i--) {
key[i+1] = key[i];
location[i+1] = location[i];
}
key[insertLoc] = k;
location[insertLoc] = l;
Nkeys++;
}
else {
if (child[insertLoc]->Nkeys == 2*Bt - 1) {
splitChild(insertLoc);
if (k > key[insertLoc]) insertLoc++;
}
child[insertLoc]->insertNonFull(k, l);
}
}

void BTreeNode::BTreeNode::print(int nDent) const {


for (int i = 0; i < nDent; i++) {
cout << " ";
}
for (int i = 0; i < Nkeys; i++) {
key[i].print(); cout << " ";
}
cout << endl;
if (!leaf) {
for (int i = 0; i < Nkeys+1; i++) {
child[i]->print(nDent+1);

Chapter Draft of October 22, 1998 127


Data Structures

}
}
}

Here are some examples of this process. First, we examine the behavior
of splitChild for tB = 3 applied to the root.

A D H L F

A D L F

128 Advanced Programming and Applied Algorithms


Data Base Directories

Original:
G M P X

A C D E J K N O R S T U V Y Z

B Inserted:
G M P X

A B C D E J K N O R S T U V Y Z

Q Inserted
G M P T X

A B C D E J K N O Q R S U V Y Z

L Inserted:
P

G M T X

A B C D E J K L N O Q R S U V Y Z

Chapter Draft of October 22, 1998 129


Data Structures

F Inserted:

C G M T X

A B D E F J K L N O Q R S U V Y Z

Deletion from a B-tree is more complicated and the algorithm will be


sketched rather than elaborated in code. The basic challenge is to descend
into the tree so that only one pass will be required from top to bottom.
This is done by always insuring that a node contains at least t (the mini-
mum degree) keys rather than t–1 keys before descending to it. In some
cases, this means that we will need to move a key downward in the tree
before descending.

In the following pseudo-code, we note that if the root of the tree ever
becomes an internal node with no keys, it will be deleted and its only
child will become the root of the tree. To delete a key k from a node x:
1. If the key k is in the node x and x is a leaf, simply delete k from x.
2. If the key k is in x and x is an internal node, do the following:
a. If the child y that precedes k in the tree has at least t keys, then find
the predecessor k' of k in the subtree rooted at y. Recursively delete
k' from the subtree and replace k by k' in x. (Finding k' and deleting
it can be performed in a single downward pass if we ensure that we
always descend to nodes with t keys or more.)

130 Advanced Programming and Applied Algorithms


Data Base Directories

b. Symmetrically, if the child z that follows k has at least t keys, then


find the successor k' of k in the tree rooted at z. Recursively delete k'
from the tree and replace k by k' in x.
c. Otherwise, if both y and z have only t–1 keys, merge k and all of z
into y, so that x loses both k and the pointer to z, and y now has 2t-1
keys. Then free z and recursively delete k from the tree rooted at y.
3. If the key k is not present in x, determine the root ci[x] of the subtree
that must contain k if k is in the tree at all. If ci[x] has only t–1 keys,
execute step 3a or 3b as appropriate to ensure that the subtree has at
least t keys, then recursively visit that tree.
a. If ci[x] has only t–1 keys but a sibling has at least t keys, then give
ci[x] an extra key by moving a key from x down to ci[x], moving a
key from the sibling up to x and moving a subtree from the sibling
to x.
b. If ci[x] and all of its siblings have only t–1 keys, merge ci[x] with
one sibling, which involves moving a key from x down into the new
merged node to become the median key for that node.

Note that when the B-tree deletion procedure operates, it moves down the
tree in a single pass, without backup, except that it may need to revisit a
node to replace a key in step 2a or 2b.

Note also that the total number of disk operations is O(d).

To illustrate this process, we continue the example used for insertion.

Chapter Draft of October 22, 1998 131


Data Structures

F deleted, case 1:
P

C G M T X

A B D E J K L N O Q R S U V Y Z

M deleted, case 2a:


P

C G L T X

A B D E J K N O Q R S U V Y Z

G deleted, case 2c:


P

C L T X

A B D E J K N O Q R S U V Y Z

132 Advanced Programming and Applied Algorithms


Data Base Directories

D deleted case 3b:

C L P T X

A B E J K N O Q R S U V Y Z

Tree shrinks in height

C L P T X

A B E J K N O Q R S U V Y Z

B deleted, case 3a:

E L P T X

A C J K N O Q R S U V Y Z

Chapter Draft of October 22, 1998 133


Data Structures

6.4.1.2 Operations on Record Address Lists


In the inverted file structure suggested in this section, each key is associ-
ated with a varying-length list of locations which contains the disk
addresses of each record in the data base having the specified key. The
location lists could be quite long and should probably be stored on disk
themselves so that they do not overwhelm the storage associated with B-
tree nodes, thus reducing the effectiveness of the B-tree algorithms.This
structure, in which the directory points directly to a list of records with
the desired key, is called an inverted file.

If complex queries, involving intersection and union of key criteria, are


permitted, there needs to be some way to efficiently cut down on the
number of disk accesses associated with records that do not satisfy the
query expression. In the case of intersection, the total number of records
matching a particular query could be much smaller than the number of
records having each key separately.

We would like to be able to cut down on the number of records actually


fetched from the data base by clever organization of the directory lists. If
the lists of locations are maintained in sorted order, the number of
records actually fetched can be pared substantially by performing a vari-
ant of update merge on the two lists of locations. The merge procedure
can look at pairs of locations to determine if they are equal, only putting a
location in the output list if it appears in both input lists.
vector<diskLoc> & intersect(vector<diskLoc> $ in1,
vector<diskLoc> & in2) {

vector<diskLoc> * out new vector<long>;


vector<diskLoc> j = in2.begin()
for (vector<diskLoc> i = in1.begin();
i != in1.end(), i++) {
while (j!=in2.end() && *j < *i) j++;
// Here j=in2.end() || in2[j] >= in1[i]
if (j = in2.end()) break;
if (*i == *j) out->push_back(*i);
}
return *out;
}

This and other update merge procedures take O(m+n) where m and n are
the sizes of the two location lists referred to earlier.

134 Advanced Programming and Applied Algorithms


Union-Find

6.4.2 Building a Directory


Suppose we begin with a simple file, where each record has a location in
the DB and each record has some number of keys. How do we construct
the directory and how long does it take.

Here is a rough procedure for doing this:


(For each record) {
let l be the location of the record;
(For each searchable key in the record) {
add the pair (key, l) to list;
}
Sort the (key. loc) pairs using merge sort;
Determine the size of each leaf node;
Fill leaf nodes in sequence, pushing the last key
up to the next level of the hierarchy
until all the keys have been assigned a node;

6.5 Union-Find
Suppose we wish to develop a set representation that must carry out three
operations:
1. MakeSet(Element * x) makes a singleton set with the element x in
it.
2. Union(Element * x, Element * y) takes the two sets represented
by x and y and creates the union of the two sets, returning a pointer to
the new representative element (representing the union set).
3. Find(Element * x) returns a pointer to the representative element
for the set of which x is a member.

How would we use such a representation. Here is an example


Suppose we wish to build an application that determines whether you
can travel between two cities entirely on Continental Airlines. The
problem is that Continental adds more city pairs each day. So the
phone consultants need a fast way to determine if two cities are con-
nected by a contiguous set of Continental routes. Note that there may
be thousands (even hundreds of thousands) of cities in the database.
The Union-Find structure satisfies this need because it allows simple
ways to ensure that the effects of new cities and new legs can be prop-
erly entered into the system—each time a new city is added, MakeSet
is invoked on that city. Each time a new leg is added to the schedule, a
Union is performed. To determine whether a Continental route

Chapter Draft of October 22, 1998 135


Data Structures

between two cities exists, we must perform Find on each city and see
if they have the same representative.

Later I will present a more complicated example from the theory of com-
pilation.

6.5.1 Simple List Representation


We begin with a simple approach, in which each set is represented by a
single representative element and each element of the set points directly
to that representor.
class Element
{
private:
Element * parent;
Element * next;
int size;
public:
...
void MakeSet() { parent = this; size = 1; next = NULL; }
void Union(Element * y) {Link(Find(), y->Find());}
void Link(Element * x, Element * y)
{
Element * big = y; Element * small = x;
if (x->size > y->size) { big = x; small = y};
big->size += small->size;
// insert all of small after the first elt of big
Element * rest = big->next;
big->next = small;
Element * e = small; Element * last = big;
L1: while (e != NULL)
{
e->parent = big;
last = e;
e = e->next;
}
last->next = rest;
}
Element * Find() { return parent; }
}

The basic idea behind this to have each element start out pointing to itself
and, whenever a union is done, all of the elements in the smaller list are
merged into the bigger one right after the head, which is also the repre-
sentor of the larger list.

136 Advanced Programming and Applied Algorithms


Union-Find

The key observation is that if an element is visited for the kth time in loop
L1, there must be at least at least 2k elements in the resulting set. This is
easy to see by induction. Hence the total number of visits for any element
is ceil lg n. In other words, the costs of MakeSet and Find are constant,
while the total cost of all unions is bounded by n lg n. Hence the total cost
for a mix of m operations is O(m + nlgn)

6.5.2 Disjoint-Set Forests


We now turn to the development of a faster algorithm. Building on the
parent pointer idea, we will reduce the cost of union at the expense of a
higher cost for find.
class Element
{
private:
Element * parent;
int rank;
public:
...
void MakeSet() {parent = this; rank = 0}
void Union(Element * y) {Link(Find(), y->Find());}
void Link(Element * x, Element * y) {
if (x->rank > y->rank)
then y->parent = x;
else {
x->parent = y;
if (x->rank == y->rank)
then y->rank += 1;
}
}
Element * Find() {
if (this != parent)
then parent = parent->Find();
return parent;
}

6.5.2.1 Analysis
Let us analyze the complexity of this algorithm.
Lemma 6.2. If the rank of a root node r is k then the subtree
rooted at r contains at least 2k nodes.

Proof. By induction on r.

Chapter Draft of October 22, 1998 137


Data Structures

Basis:If r = 0 then the tree contains exactly one elment. Since 20 = 1, the
basis is estblished.

Induction: The rank of a node can be changed only if the ranks of the two
roots being combined into a single tree are equal. If this is so, each sub-
tree has rank r–1 and, by the induction hypothesis, each of these trees
must have at least 2r-1 vertices. Hence the merged tree has at least 2r ver-
tices, establishing the lemma. QED.
Lemma 6.3. There are no more than n/2r nodes of rank r.

Proof: Each node of rank r has at least 2r nodes. Assume that there are
k>n/2r nodes of rank r. Then the subtrees rooted at these nodes have at
least k2r nodes, which is greater than n, a contradiction. QED.
Lemma 6.4. No vertex can have rank > lgn .

Proof. Assume there exists a vertex with rank r > lgn . By Lemma 6.3
the tree can have no more than

n n n
-----r < -------------
lgn
- ≤ --------
lgn
- = 1
2 2 2

Thus there must be fewer than one node with this rank. QED

A corollary of this is that the height of the tree is no more than lgn .

Now we introduce the funtion F(i), defined as follows

F (0) = 1
F (i – 1)
F (i) = 2

Thus we can set up a table of values of n and F(n):


n F(n)
0 1
1 2
2 4
3 16

TABLE 3 Sample Values for F(n)

138 Advanced Programming and Applied Algorithms


Union-Find

n F(n)
4 65536
5 265536

TABLE 3 Sample Values for F(n)

Clearly, this function grows very rapidly. Consider its functional inverse
G(n):
G(n)
G(1) = 0
G(2) = 1
G(4) = 2
G(16) = 3
G(65536) = 4
G(265536) = 5

In the literature, G(n) is known as lg* n because it is the number of times


you have to take the log of a number to produce the value 1.

This function can be extended to other values in a straightforward way.


G(n)
G(1) = 0
G(2) = 1
G(3-4) = 2
G(5-16) = 3
G(17-65536) = 4

This divides the integers into groups by their group numbers.


Theorem 6.4. A sequence of m MakeSet, Union, and Find opeart-
ions takes no more than O(mG(n)) time.

Proof. Clearly, a MakeSet and the non-Find portion of a Union takes con-
stant time. Hence, we must only consider the time to perform Finds.

Chapter Draft of October 22, 1998 139


Data Structures

Suppose we partition the nodes into rank groups such that every vertex of
rank r is put into group G(r).
ranks Group)
0,1 G(1) = 0
2 G(2) = 1
3,4 G(4) = 2
5-16 G(16) = 3
lgn G( lgn )

Note that
lgn
G(n) = G(2 ) = G ( lgn ) + 1 ≥ G ( lgn ) + 1

Hence, we have rank groups 0...G(n) - 1

We will use a bookkeeping trick to account for Finds. Assume there is an


edge between vertex v and its parent.
1. If v and its parent are in different groups or the parent of v is the root,
then chage 1 unit to the find.
2. If v and its parent are in the same rank group charge one unit to the
vertex.

This has the following implications.


1. Since there are no more than G(n) rank groups, no find instruction is
charged more than G(n). Hence, the total charge for O(m) finds is
O(mG(n)).
2. Consider the vertices. A vertex is charged one unit if its parent is not
the root and it is in the same rank group as its parent. But then it is
moved and gets a parent of a higher rank. Each time a vertex is
charged, it is moved. How many times can a vertex be moved within
the same rank group?. This is bounded by the number of elements in a
rank group g = G(i).
Note that G(i) is the smallest k such that F(k) ≥ i. F(g) is the largest
element in group G and F(g-1)+1 is the smallest element. Hence, The
total number of elements in group g is

F(g) - F(g-1)
This is the maximum number of units that can be assigned to any ver-
tex before it acquires a parent in a higher group.

140 Advanced Programming and Applied Algorithms


Union-Find

Now consider rank group m. How many vertices can we have such that
G(i) = m.
F (m) ∞
n n
∑ ----i ≤ --------------------------
F (m – 1) + 1 ∑ i
n n 1
N (m) ≤ -( ----) ≤ -------------------
- ≤ -------------
F (m – 1) F ( m )
i = F (m – 1) + 1 2 2 i = 02 2
Since the maximum charge to any vertex is F(m) - F(m-1), the total
charge to vertices in group m is less than or equal to

n
------------- ( F ( m ) – F ( m – 1 ) ) ≤ n
F (m)
Since there are no more than G(n) rank groups, the total change is at
most nG(n).

Since m≥n, we have the total cost is O(mG(n)). QED.

6.5.2.2 An Example

Let us now consider how we might apply this to a real computer science
problem.

The language Fortran has the ability to perform EQUIVALENCE operations,


which look like this:
EQUIVALENCE (A(1), D(101)) // A offset 100 from D
EQUIVALENCE (B(10), C(20)) // B offset 10 from C
EQUIVALENCE (A(10), B(1)) // A offset -9 from B
EQUIVALENCE (A(1), C(11)) // Error

We would like to determine whether there are any multiple equivalence


errors and then determine a base array and offset for each array that can
be equivalenced to it directly or indirectly.
class EqArray
{
private:
EqArray * parent;
int rank;
int offset;
public:
void Declare() {
parent = this;
rank = 0;
offset = 0;
}

Chapter Draft of October 22, 1998 141


Data Structures

EqArray * FindBase() {
if (this != parent)
then {
EqArray * p = parent;
parent = parent->FindBase();
offset += p->offset;
}
return parent;
}
void Equivalence (EqArray * y, delta)
{
EqArray * xBase = FindBase();
EqArray * yBase = y->FindBase();
if (xBase = yBase) then Error();
else {
int diffBase = x->offset - y->offset - delta;
Link(xBase, yBase, diffBase);
}
}
void Link(EqArray * x, EqArray * y, int diff)
{
if (x->rank > y->rank) then {
y->parent = x;
y->offset = -diff
}
else {
x->parent = y;
x->offset = diff;
if (x->rank == y->rank)
then y->rank += 1;
}
}

142 Advanced Programming and Applied Algorithms

You might also like