Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

INTRO TO DATA STRUCTURE AND

ALGORITHM ANALYSIS

Dr. Nitin Gupta


Assistant Professor
DoCSE, NITH
WHAT THIS COURSE IS ABOUT ?

 Data structures: conceptual and concrete ways


to organize data for efficient storage and
efficient manipulation

 Employment of this data structures in the


design of efficient algorithms
WHY DO WE NEED THEM ?
 Computers take on more and more complex
tasks
 Imagine: index of 8 billion pages ! (Google)

 Software implementation and maintenance is


difficult.

 Clean conceptual framework allows for more


efficient and more correct code
WHY DO WE NEED THEM
 Requirements for a good software:
 Clean Design
 Easy maintenance
 Reliable (no core dumps)
 Easy to use
 Fast algorithms

Efficient data structures


Efficient algorithms
WHY DO WE NEED THEM
EXAMPLE

 A collection of 3,000 texts with avg. of 20 lines


each, with avg. 10 words / line
 600,000 words
 Find all occurrences of the word “happy”
 Suppose it takes 1 sec. to check a word for
correct matching
 What to do?
EXAMPLE (CONT’D)
 What to do?
Sol. 1 Sequential matching: 1 sec. x 600,000 words = 166 hours
Sol. 2 Binary searching:
- order the words
- search only half at a time
Ex. Search 25 in 5 8 12 15 15 17 23 25 27
25 ? 15 15 17 23 25 27
25 ? 23 23 25 27
25 ? 25
How many steps?
SOME EXAMPLE DATA STRUCTURES

 log 2 600000 = 19 sec. vs .166 hours!

Set Stack Tree

Data structure = representation and operations


associated with a data type
WHAT WILL YOU LEARN?

 What are some of the common data structures

 What are some ways to implement them

 How to analyze their efficiency

 How to use them to solve practical problems


WHAT YOU NEED
 Programming experience with C / C++
 Some Java experience may help as well (but not required)

 Textbook
 Seymour Lipschutz and GAV Pai, “Data Structures”, Schaum’s Outlines,
McGraw Hill.
 For programming enthusiast:
 Data structures using C, Tenenbaum
TOPICS

Arrays
Stacks and Queues
Link Lists
Trees
Heaps / Priority Queues
Binary Search Trees –
Search Trees
Hashing / Dictionaries
Sorting
Graphs and graph algorithms
PROBLEM SOLVING: MAIN STEPS

1. Problem definition
2. Algorithm design / Algorithm specification
3. Algorithm analysis
4. Implementation
5. Testing
6. [Maintenance]
1. PROBLEM DEFINITION

 What is the task to be accomplished?


 Calculate the average of the grades for a given
student
 Understand the talks given out by politicians and
translate them in Chinese
 What are the time / space / speed /
performance requirements ?
2. ALGORITHM DESIGN / SPECIFICATIONS

 Algorithm: Finite set of instructions that, if followed,


accomplishes a particular task.
 Describe: in natural language / pseudo-code / diagrams
/ etc.
 Criteria to follow:
 Input: Zero or more quantities (externally produced)
 Output: One or more quantities
 Definiteness: Clarity, precision of each instruction
 Finiteness: The algorithm has to stop after a finite (may be very
large) number of steps
 Effectiveness: Each instruction has to be basic enough and
feasible
4,5,6: IMPLEMENTATION, TESTING,
MAINTAINANCE
 Implementation
 Decide on the programming language to use
 C, C++.
 Write clean, well documented code

 Test, test, test

 Integrate feedback from users, fix bugs, ensure compatibility


across different versions  Maintenance
3. ALGORITHM ANALYSIS

 Space complexity
 How much space is required
 Time complexity
 How much time does it take to run the algorithm

 Often, we deal with estimates!


MEMORY ISSUE
 complexity of an algorithm is resources required to run
that algorithm
 memory now mainly concerned with bandwidth, small
gadgets, smart cards. Earlier Vista required 1 GB of
RAM.
 Running in RAM or Cache is still a big issue
 Small programs more efficient
 advertisement in byte in 1960 unimaginable 32 KB of
RAM whooping 10 MB of Hard Disk
SPACE COMPLEXITY
 Space complexity = The amount of memory required
by an algorithm to run to completion

1. Fixed part: The size required to store certain data, that is


independent of the size of the problem:
2. Variable part: Space needed by variables, whose size is
dependent on the size of the problem:
TIME COMPLEXITY
 Often more important than space complexity
 space available (for computer programs!) tends to be larger and larger
 time is still a problem for all of us

 3-4GHz processors on the market


 still …
 researchers estimate that the computation of various transformations
for 1 single DNA chain for one single protein on 1 TerraHZ computer
would take about 1 year to run to completion
 Algorithms running time is an important issue
RUNNING TIME

 Problem: prefix averages


 Given an array X
 Compute the array A such that A[i] is the average of elements
X[0] … X[i], for i=0..n-1
 Sol 1
 At each step i, compute the element A[i] by traversing the array
X and determining the sum of its elements, respectively the
average
 Sol 2
 At each step i update a sum of the elements in the array A
 Compute the element X[i] as sum/I

Big question: Which solution to choose?


RUNNING TIME

5 ms worst-case
4 ms

3 ms
} average-case?
best-case
2 ms

1 ms

A B C D E F G
Input

Suppose the program includes an if-then statement


that may execute or not:  variable running time
Typically algorithms are measured by their worst
case
EXPERIMENTAL APPROACH
 Write a program that implements the algorithm
 Run the program with data sets of varying size.
 Determine the actual running time using a
system call to measure time (e.g. system (date)
);

 we're just figuring out how much longer the algorithm


takes when we add more items. The most common
way to go about this is to double the number of items
and see how much longer the algorithm takes.
EXPERIMENTAL APPROACH

Now, we could actually test this by


writing the algorithm, profiling it to see
how long it takes for N, then profiling it
again after doubling N.
The time difference is a rough estimate of
the growth. This is called an empirical
test.
PLOT THE RESULTS
LIMITATIONS OF EMPIRICAL STUDY
 It is necessary to implement the algorithm, which may be
difficult
 Results may not be indicative of the running time on other
inputs not included in the experiment.
 In order to compare two algorithms, the same hardware
and software environments must be used
 Even in same hardware and software environments the
results may vary depending upon the processor load,
sharing of resources, No. of background processes, Actual
status of Primary and Secondary Memory at the time of
running the program, Compiler, Network Architecture,
programming language
THEORETICAL TEST
 However, we can also do a theoretical test by
measuring the steps that rely on the size of N
and get a reasonably useful measure of how
the time complexity grows.
 Because the steps that don't rely on N won't
grow, we can remove them from the measure
because at a certain point, they become so
small as to be worthless. In other words, we
pretend that they don't matter in all cases.
 This is the idea behind asymptotic notation.
By removing the constants (variables that
have a fixed but unknown value),
 we can focus on the part of the measure
that grows and derive a simplified
asymptotic bound on the algorithm. A
common notation that removes constants is
called Big O Notation, where the O means
“order of”
 Let's look at an example:
 void f ( int a[], int n )
 {
int i;
printf ( "N = %d\n", n );
 for ( i = 0; i < n; i++ )
 printf ( "%d ", a[i] );
 printf ( "\n" );
 }
 In this function, the only part that takes longer as the size
of the array grows is the loop. Therefore, the two printf
calls outside of the loop are said to have a constant time
complexity, or O(1), as they don't rely on N. The loop itself
has a number of steps equal to the size of the array, so
we can say that the loop has a linear time complexity, or
O(N). The entire function f has a time complexity of 2 *
O(1) + O(N), and because constants are removed, it's
simplified to O(1) + O(N).
 Now, asymptotic notation also typically ignores
the measures that grow more slowly because
eventually the measure that grows more quickly
will dominate the time complexity as N moves
toward infinity. So by ignoring the constant time
complexity because it grows more slowly than
the linear time complexity, we can simplify the
asymptotic bound of the function to O(N), so the
conclusion is that f has linear time complexity.
Okay, but what does O really mean? Big O
notation refers to the asymptotic upper bound,
which means that it's a cap on how much the
time complexity will grow.
 If we say that a function is O(1), then there's
no growth and the function will always take a
fixed amount of time to complete.
 If we say that a function is O(N) then if N
doubles, the function's time complexity at most
will double. It may be less, but never more.
That's the upper bound of an algorithm, and it's
the most common notation.
 Now, even though O notation is the most common, it's not always the
most accurate measure. For example, let's say we have a sequential
search of an unordered array where the items are randomly
distributed and we want both the average case growth and the worst
case growth:
 int find ( int a[], int n, int x )
 {
 int i;
for ( i = 0; i < n; i++ )
{
if ( a[i] == x )
return 1;
}

return 0;
}
 This algorithm is clearly O(N) because it only
has one loop that relies on the size of the array,
and the time complexity of the loop doubles as
the size of the array doubles. However, that's
the worst case upper bound. We know that on
average, only half of the array is searched
before the item is found due to the random
distribution. So while the time complexity could
reach O(N), it's usually less even though we
don't really know how much less.
 Okay, how about a binary search instead of a sequential search? If
the array is sorted, we can make the search a lot faster by splitting
the array in half at each comparison and only searching the half
where the item might be. That's common knowledge, but why is it
faster? Here's the code for a binary search:
 int find ( int a[], int n, int x )
 {
 int i = 0;
 while ( i < n ) ...{
 int mid = ( n + i ) / 2;
 if ( a[mid] < x )
 n = mid;
 else if ( a[mid] > x )
 i = mid + 1;
 else
 return 1;
 }
 return 0;
}
 We can call this an O(N) algorithm and not be
wrong because the time complexity will never
exceed O(N). But because the array is split in half
each time, the number of steps is always going to
be equal to the base-2 logarithm of N, which is
considerably less than O(N). So an even better
choice would be to set the upper bound to log N,
which is the upper limit that we know we're
guaranteed never to cross. Therefore, a more
accurate claim is that binary search is a
logarithmic, or O(log2 N), algorithm.
WHAT IS LOG2 N?
 In mathematics, the binary logarithm (log2 n) is
the logarithm for base 2. It is the inverse
function of n→2n. The binary logarithm of n is the
power to which the number 2 must be raised to
obtain the value n. This makes the binary
logarithm useful for anything involving powers of 2.
For example, the binary logarithm of 1 is 0, the
binary logarithm of 2 is 1, the binary logarithm of
4 is 2,the binary logarithm of 8 is 3, the binary
logarithm of 16 is 4 and the binary logarithm of 32
is 5.
 Sometimes we're interested not in an upper
bound, but in a lower bound. What's the
smallest time complexity that we can expect?
For example, what if we want to know the lower
bound for the binary search we just found the
upper bound for? we can easily say that the
lower bound is O(1) because the best possible
case is an immediate match.
 Okay, what about a sorting algorithm? Let's start with selection sort. The
algorithm is simple: find the largest item and move it to the back of the
array. When you move an item to the back, decrease the size of the array so
that you don't continually choose from the items that have already been
selected:
 void jsw_selection ( int a[], int n )
 {
 while ( --n > 0 ) {
 int i, max = n;
 for ( i = 0; i < n; i++ ) ...{
 if ( a[i] > a[max] )
 max = i;
 }
 if ( max != n )
 jsw_swap ( &a[n], &a[max] );
 }
 }
 This algorithm has two loops, one inside of the
other. Both rely on the size of the array, so the
algorithm is clearly O(N * N), more commonly
shown as O(N2) and referred to as quadratic.
The fact that N decreases with each step of the
outer loop is irrelevant unless you want a tight
bound, and even then it's difficult to analyze.
But that doesn't matter much because the
upper bound is really all we care about for an
existing sorting algorithm.
 Let's look at a faster sort. The heap sort algorithm uses a tree based structure to make the selection process
faster.
 void jsw_do_heap ( int a[], int i, int n )
 { int k = i * 2 + 1;
 int save = a[i];
 while ( k < n ) ...{
 if ( k + 1 < n && a[k] < a[k + 1] )
 ++k;
 if ( save >= a[k] )
 break;
 a[i] = a[k];
 i = k;
 k = i * 2 + 1;
 }
 a[i] = save;
 }
 void jsw_heapsort ( int a[], int n )
 { int i = n / 2;
 while ( i-- > 0 )
 jsw_do_heap ( a, i, n );
 while ( --n > 0 ) ...{
 jsw_swap ( &a[0], &a[n] );
 jsw_do_heap ( a, 0, n );
 }
 }
 Because the heap is structured like a tree,
jsw_do_heap is Θ(log2 N). The first loop in
jsw_heapsort is O(N / 2), but because the
second loop is O(N) and dominates the first
loop, we can toss the complexity of the first
loop. So we have an O(N) loop that calls a
Θ(log2 N) function. We conclude that the upper
bound of heap sort is O(Nlog2 N), which doesn't
have a set descriptive name, but it's often
shown as O(N * log2 N).
 We've looked at the most common time complexities: O(1)
for constant time, O(N) for linear time, O(log2 N) for
logarithmic time, O(Nlog2 N), and O(N2) for quadratic time.
Others exist, such as O(N!) for a ridiculous factorial growth,
but you won't see them often. Here are the upper bound
time complexities in order of growth from least to greatest
that you're most likely to see:
 O(1) - No growth
 O(log2 N) - Grows by the logarithm of N when N doubles
 O(N) - Grows with N when N doubles
 O(Nlog2 N) - Grows by the product of N and the logarithm of
N when N doubles
 O(N2) - Grows twice as fast as N when N doubles
 O(N!) - Grows by the factorial of N when N doubles
COMPARISON OF GROWTH OF COMMON
FUNCTIONS
GROWTH OF FUNCTIONS
ASYMPTOTIC NOTATION
 The notations we use to describe the
asymptotic running time of an algorithm are
defined in terms of functions whose domains
are the set of natural numbers N = {0, 1, 2, ...}.
 Such notations are convenient for describing
the worst-case running-time function T (n),
which is usually defined only on integer input
sizes.
Θ-NOTATION

 For a given function g(n), we denote by Θ(g(n))


the set of functions
 Θ(g(n)) = {f(n) : there exist positive constants c1,
c2, and n0 such that 0 ≤ c1g(n) ≤ f(n) ≤ c2g(n)
for all n ≥ n0}
O(g(n)) = {f(n): there exist positive constants c and n0 such
that 0 ≤ f(n) ≤ cg(n) for all n ≥ n0}.
Ω(g(n)) = {f(n): there exist positive constants c and n0 such that 0 ≤
cg(n) ≤ f(n) for all n ≥ n0}.
 Theorem
 For any two functions f(n) and g(n), we have
f(n) = Θ(g(n)) if and only if f(n) = O(g(n)) and
f(n) = Ω(g(n)).
 Let us show that 1/2n2 - 3n = Θ(n2).
 To do so, we must determine positive constants c1, c2, and n0 such
that
 c1n2 ≤ 1/2n2 - 3n ≤ c2n2 for all n ≥ n0.
Dividing by n2 yields
 c1 ≤ 1/2 - 3/n ≤ c2.
 The right-hand inequality can be made to hold for any value of n ≥ 1
by choosing c2 ≥ 1/2.
 Likewise, the left-hand inequality can be made to hold for any value
of n ≥ 7 by choosing c1 ≤ 1/14. Thus, by choosing c1 = 1/14, c2 =
1/2, and n0 = 7, we can verify that 1/2n2 - 3n = Θ(n2).
 Certainly, other choices for the constants exist, but the important
thing is that some choice exists.
 Note that these constants depend on the function 1/2n2 - 3n; a
different function belonging to Θ(n2) would usually require different
constants

You might also like