Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

104 4.

SORTING AND SEARCHING

4.1 Applications of Sorting


We will review several sorting algorithms and their complexities over the course of
this chapter. But the punch-line is this: clever sorting algorithms exist that run in
O(n log n). This is a big improvement over naive O(n2 ) sorting algorithms for large
values of n. Consider the following table:

n n2 /4 n lg n
10 25 33
100 2,500 664
1,000 250,000 9,965
10,000 25,000,000 132,877
100,000 2,500,000,000 1,660,960

You might still get away with using a quadratic-time algorithm even if n =
10, 000, but quadratic-time sorting is clearly ridiculous once n ≥ 100, 000.
Many important problems can be reduced to sorting, so we can use our clever
O(n log n) algorithms to do work that might otherwise seem to require a quadratic
algorithm. An important algorithm design technique is to use sorting as a basic
building block, because many other problems become easy once a set of items is
sorted.
Consider the following applications:

• Searching – Binary search tests whether an item is in a dictionary in O(log n)


time, provided the keys are all sorted. Search preprocessing is perhaps the
single most important application of sorting.

• Closest pair – Given a set of n numbers, how do you find the pair of numbers
that have the smallest difference between them? Once the numbers are sorted,
the closest pair of numbers must lie next to each other somewhere in sorted
order. Thus, a linear-time scan through them completes the job, for a total
of O(n log n) time including the sorting.

• Element uniqueness – Are there any duplicates in a given set of n items?


This is a special case of the closest-pair problem above, where we ask if there
is a pair separated by a gap of zero. The most efficient algorithm sorts the
numbers and then does a linear scan though checking all adjacent pairs.

• Frequency distribution – Given a set of n items, which element occurs the


largest number of times in the set? If the items are sorted, we can sweep
from left to right and count them, since all identical items will be lumped
together during sorting.
To find out how often an arbitrary element k occurs, look up k using binary
search in a sorted array of keys. By walking to the left of this point until the
first the element is not k and then doing the same to the right, we can find
4.1 APPLICATIONS OF SORTING 105

Figure 4.1: The convex hull of a set of points (l), constructed by left-to-right insertion.

this count in O(log n + c) time, where c is the number of occurrences of k.


Even better, the number of instances of k can be found in O(log n) time by
using binary search to look for the positions of both k −  and k +  (where
 is arbitrarily small) and then taking the difference of these positions.

• Selection – What is the kth largest item in an array? If the keys are placed
in sorted order, the kth largest can be found in constant time by simply
looking at the kth position of the array. In particular, the median element
(see Section 14.3 (page 445)) appears in the (n/2)nd position in sorted order.

• Convex hulls – What is the polygon of smallest area that contains a given
set of n points in two dimensions? The convex hull is like a rubber band
stretched over the points in the plane and then released. It compresses to
just cover the points, as shown in Figure 4.1(l). The convex hull gives a nice
representation of the shape of the points and is an important building block
for more sophisticated geometric algorithms, as discussed in the catalog in
Section 17.2 (page 568).
But how can we use sorting to construct the convex hull? Once you have the
points sorted by x-coordinate, the points can be inserted from left to right
into the hull. Since the right-most point is always on the boundary, we know
that it will appear in the hull. Adding this new right-most point may cause
others to be deleted, but we can quickly identify these points because they lie
inside the polygon formed by adding the new point. See the example in Figure
4.1(r). These points will be neighbors of the previous point we inserted, so
they will be easy to find and delete. The total time is linear after the sorting
has been done.

While a few of these problems (namely median and selection) can be solved in
linear time using more sophisticated algorithms, sorting provides quick and easy
solutions to all of these problems. It is a rare application where the running time
106 4. SORTING AND SEARCHING

of sorting proves to be the bottleneck, especially a bottleneck that could have


otherwise been removed using more clever algorithmics. Never be afraid to spend
time sorting, provided you use an efficient sorting routine.

Take-Home Lesson: Sorting lies at the heart of many algorithms. Sorting the
data is one of the first things any algorithm designer should try in the quest
for efficiency.

Stop and Think: Finding the Intersection

Problem: Give an efficient algorithm to determine whether two sets (of size m and
n, respectively) are disjoint. Analyze the worst-case complexity in terms of m and
n, considering the case where m is substantially smaller than n.

Solution: At least three algorithms come to mind, all of which are variants of
sorting and searching:

• First sort the big set – The big set can be sorted in O(n log n) time. We can
now do a binary search with each of the m elements in the second, looking
to see if it exists in the big set. The total time will be O((n + m) log n).

• First sort the small set – The small set can be sorted in O(m log m) time. We
can now do a binary search with each of the n elements in the big set, looking
to see if it exists in the small one. The total time will be O((n + m) log m).

• Sort both sets – Observe that once the two sets are sorted, we no longer
have to do binary search to detect a common element. We can compare the
smallest elements of the two sorted sets, and discard the smaller one if they
are not identical. By repeating this idea recursively on the now smaller sets,
we can test for duplication in linear time after sorting. The total cost is
O(n log n + m log m + n + m).

So, which of these is the fastest method? Clearly small-set sorting trumps big-
set sorting, since log m < log n when m < n. Similarly, (n + m) log m must be
asymptotically less than n log n, since n + m < 2n when m < n. Thus, sorting the
small set is the best of these options. Note that this is linear when m is constant
in size.
Note that expected linear time can be achieved by hashing. Build a hash table
containing the elements of both sets, and verify that collisions in the same bucket
are in fact identical elements. In practice, this may be the best solution.

You might also like