Data Structures and Algorithms, with examples in Perl


Algorithms

An algorithm is a method for solving a problem, with or without a computer.

We'll examine, code, and run some algorithms. We'll consider three issues:


Understanding programming

You can distinguish three levels of understanding programming:
  1. Imitative: copy some code, change it around until it works.
  2. Operational: picture how the computer executes each statement.
  3. Deductive: derive code from intended solution, analyze correctness and performance.

We'll work at levels 2 and 3.

To find the best solution among alternatives, you need to work at level 3.


Programming in general

These lessons apply to any programming language.

Perl programming


Case study: Maximum section sum

Problem statement

Input:
a list of integers
Output:
the largest sum found in any contiguous sublist

Example

Input:
(31, -41, 59, 26, -53, 58, 97, -93, -23, 84)
Output:
187

The maximum section sum occurs in the sublist (59, 26, -53, 58, 97).

Motivation

Pattern matching: find the largest sublist that has a particular property.

Algorithms

Obvious solution: examine all possible sublists, test each for the property.

# Given list a[0 .. max]
foreach lower in 0 .. max 
  foreach upper in lower .. max 
    test property on sublist a[lower .. upper]

Can we do better?

Algorithm 1
Algorithm 2
Algorithm 4

Sources

Jon Bentley, Programming Pearls: Algorithm Design Techniques, Communications of the ACM, 27(9) 865 - 871, Sept. 1984.

Jon Bentley, Programming Pearls, Addison Wesley, 1986 (first edition, chapter 7), 2000 (second edition, chapter 8).

Jon Bentley's notes (PS, PDF)


Performance

How can we characterize the performance of the algorithm itself, independently from a particular input data set or operating environment?

Big-O analysis describes how the running time increases as the size of the input data increases.

O(n) means the running time is proportional to the size of the input, O(n2) means the running time is proportional to the square of the input, etc.

Recall n2 = n ·   n, n3 = n · n · n,   etc. (draw graphs on whiteboard)

Lower orders are good, higher orders are bad. Algorithms with high orders cannot process large data sets in reasonable time.

For sufficiently large n, the lower order algorithm outperforms the higher order in any operating environment. The TRS-80 running the O(n) algorithm beats the Cray supercomputer running the O(n3) algorithm when n is greater than a few thousand (Bentley Table 2, p. 868, Fig. 1, p 869).


Estimating big-O

Each nested loop adds one to the order: Algorithms that can ignore some of the input data are sub-linear: Be careful of hidden O(n) operations that copy lists: You can often avoid copying lists by using references.


Binary search

Searching a list can be quite fast if the list is already sorted.

Example: looking up a number in the phone book.

(Explain binary search)

An important principle: invest in building a data structure that makes subsequent operations much more efficient.

Searching an unsorted list is O(n). You must examine each element until you find the one you want (sequential search).

Searching a sorted list is O(log n). At each step, you can eliminate half the elements, so the number of steps in the worst case is log2n.

log2n is the number of times you must divide a list of n in half to reach a single element.

The log function grows very slowly. Recall 2n = x means log2x = n.

n 2n x log2x
0 1 1 0
1 2 2 1
2 4 4 2
3 8 8 3
4 16 16 4
.
.
10 1024 1024 10
.
.
20 1048576 1048576 20
.
.

A thousandfold increase in list size (from a thousand to a million) only doubles the search time!

Example: binary search with a thousand, then a million elements.

Be careful - the performance gained by a log n algorithm can be wasted by linear copy operations (in subroutine calls, for example).

Avoid linear copy operations by using references.


Sorting

Naive sort (bubble sort, but also insertion sort etc.) is O(n2)

Efficient sort (quicksort etc.) is O(n log n): split the list, sort the halves, merge the sorted halves -- recursively!

Example of big performance gain from unobvious algorithm. Split is O(log n), merge is O(n).


Associative arrays (hashes)

Hashes are like tables or dictionaries that associate keys with values. The key can be any scalar.

Almost as fast and compact as regular arrays (15 percent more time, 40 percent more space).

Syntax

Hash variable is indicated by percent sign prefix:   %hashname

Access by key value using curly brackets:   $hashname{$keyvalue}

Example

Word frequency count: keys are words, values are numbers of occurrences


References

Some Perl limitations:

What to do?

References enable you to access large structures (lists, hashes, subroutines) through scalars.

Variables are containers. References are addresses of containers (explanation with figures at the whiteboard). References are called pointers in some programming languages.

Uses for references in Perl:

also:

Syntax

Create a reference by prefixing variable or constant with a backslash:

$const_ref  = \42;
$x_ref      = \$x;
$a_ref      = \@a;   # $a_ref is scalar reference to array @a
$h_ref      = \%h; 

There is special syntax for references to anonymous structured values:

$a2_ref = [ 2, 4, 6, 8 ]                     # square brackets 
$h2_ref = { red => "stop", green => "go"  }  # curly braces

Retrieve (dereference) the item by using the appropriate prefix for the data type:

$$const_ref
$$x_ref
@$a_ref       # @$a_ref dereferences $a_ref to retrieve @a
%$h_ref

There is special arrow notation for dereferencing array or hash elements. The following three expressions are equivalent:

$$a_ref[0]      # see above
${$a_ref}[0]    # also acceptable, maybe a bit clearer
$a_ref->[0]     # most prefer this way, used in object-oriented programming

Example

Binary search with references

Readings

perlref -- Perl references and nested data structures
perlreftut -- Understand references today


Associative arrays with references

Hash values must be scalars, so storing structures (lists, hashes) in a hash requires references.

Example

Text concordance: keys are words, values are (references to) list of lines where word occurs.


Correctness

Use loop invariants to reason about correctness.


Perl and other languages

Perl advantages, from a data structures point of view:

Perl disadvantages:

Other important languages with automatic memory management:


Other resources

Links to books, tools, other course web pages etc.


Jon Jacky, jon@u.washington.edu