Data Structures and Algorithms with Perl

Data Structures and Algorithms, with examples in Perl

Algorithms

An algorithm is a method for solving a problem, with or without a computer.

We'll examine, code, and run some algorithms. We'll consider three issues:

Correctness: does it produce the right answer?
Implementation: how do you code it?
Performance: is it fast enough?

Understanding programming

You can distinguish three levels of understanding programming:

Imitative: copy some code, change it around until it works.
Operational: picture how the computer executes each statement.
Deductive: derive code from intended solution, analyze correctness and performance.

We'll work at levels 2 and 3.

To find the best solution among alternatives, you need to work at level 3.

Programming in general

These lessons apply to any programming language.

The distinction between a problem (requirements or specification) and a particular method for solving that problem (algorithm). There can be different algorithms that solve the same problem.
The distinction between an algorithm and its realization in one particular programming language.
Working from specifications and designs expressed in natural language, pseudocode, or diagrams.
Different algorithms (that solve the same problem) can have wildly different performances.
Analysis of performance (using counting). The role of analysis versus experiment (testing).
Quantification of performance: Big-O notation, the function concept, sums, powers, logs, growth of functions.
Analysis of correctness (using assertions). Analysis versus testing again.
Design: the distinction between the core algorithms and data structures, and the driver (main program). Modularity, reuse, libraries.

Perl programming

Arrays and lists
Subroutines and parameter passing (getting data in and out)
References
Associative arrays (hashes)

Case study: Maximum section sum

Problem statement

Input:: a list of integers
Output:: the largest sum found in any contiguous sublist

Example

Input:: (31, -41, 59, 26, -53, 58, 97, -93, -23, 84)
Output:: 187

The maximum section sum occurs in the sublist (59, 26, -53, 58, 97).

Motivation

Pattern matching: find the largest sublist that has a particular property.

Algorithms

Obvious solution: examine all possible sublists, test each for the property.

# Given list a[0 .. max]
foreach lower in 0 .. max 
  foreach upper in lower .. max 
    test property on sublist a[lower .. upper]

Can we do better?

Algorithm 1
Algorithm 2
Algorithm 4

Sources

Jon Bentley, Programming Pearls: Algorithm Design Techniques, Communications of the ACM, 27(9) 865 - 871, Sept. 1984.

Jon Bentley, Programming Pearls, Addison Wesley, 1986 (first edition, chapter 7), 2000 (second edition, chapter 8).

Jon Bentley's notes (PS, PDF)

Performance

How can we characterize the performance of the algorithm itself, independently from a particular input data set or operating environment?

Big-O analysis describes how the running time increases as the size of the input data increases.

O(n) means the running time is proportional to the size of the input, O(n²) means the running time is proportional to the square of the input, etc.

Recall n² = n · n, n³ = n · n · n, etc. (draw graphs on whiteboard)

Lower orders are good, higher orders are bad. Algorithms with high orders cannot process large data sets in reasonable time.

For sufficiently large n, the lower order algorithm outperforms the higher order in any operating environment. The TRS-80 running the O(n) algorithm beats the Cray supercomputer running the O(n³) algorithm when n is greater than a few thousand (Bentley Table 2, p. 868, Fig. 1, p 869).

Estimating big-O

Each nested loop adds one to the order:

One loop (Bentley Algorithm 4): O(n), linear
Two nested loops (Bentley Algorithms 2, 2a): O(n²), quadratic
Three nested loops (Bentley Algorithm 1): O(n³), cubic

Algorithms that can ignore some of the input data are sub-linear:

Eliminate half the input on each step (binary search): O(log n), logarithmic

Be careful of hidden O(n) operations that copy lists:

Assignments with lists, in subroutine calls for example: my ($x, $y, @a) = @_ etc.
Splice into the middle of a list (Perl moves elements down)

You can often avoid copying lists by using references.

Binary search

Searching a list can be quite fast if the list is already sorted.

Example: looking up a number in the phone book.

(Explain binary search)

An important principle: invest in building a data structure that makes subsequent operations much more efficient.

Searching an unsorted list is O(n). You must examine each element until you find the one you want (sequential search).

Searching a sorted list is O(log n). At each step, you can eliminate half the elements, so the number of steps in the worst case is log₂n.

log₂n is the number of times you must divide a list of n in half to reach a single element.

The log function grows very slowly. Recall 2ⁿ = x means log₂x = n.

n 2ⁿ x log₂x
0 1 1 0
1 2 2 1
2 4 4 2
3 8 8 3
4 16 16 4
.
.

10 1024 1024 10
.
.

20 1048576 1048576 20
.
.

A thousandfold increase in list size (from a thousand to a million) only doubles the search time!

Example: binary search with a thousand, then a million elements.

Be careful - the performance gained by a log n algorithm can be wasted by linear copy operations (in subroutine calls, for example).

Avoid linear copy operations by using references.

Sorting

Naive sort (bubble sort, but also insertion sort etc.) is O(n²)

Efficient sort (quicksort etc.) is O(n log n): split the list, sort the halves, merge the sorted halves -- recursively!

Example of big performance gain from unobvious algorithm. Split is O(log n), merge is O(n).

Associative arrays (hashes)

Hashes are like tables or dictionaries that associate keys with values. The key can be any scalar.

Almost as fast and compact as regular arrays (15 percent more time, 40 percent more space).

Syntax

Hash variable is indicated by percent sign prefix: %hashname

Access by key value using curly brackets: $hashname{$keyvalue}

Example

Word frequency count: keys are words, values are numbers of occurrences

References

Some Perl limitations:

array and hash elements must be scalars
subroutine arguments are just a list

What to do?

References enable you to access large structures (lists, hashes, subroutines) through scalars.

Variables are containers. References are addresses of containers (explanation with figures at the whiteboard). References are called pointers in some programming languages.

Uses for references in Perl:

Putting lists etc. in arrays and hashes
Passing multiple lists to a subroutine
Passing subroutine arguments efficiently (without list copy)
Object-oriented programming

also:

Sharing (accessing the same variable from multiple locations, for example from multiple arrays or hashes)

Syntax

Create a reference by prefixing variable or constant with a backslash:

$const_ref  = \42;
$x_ref      = \$x;
$a_ref      = \@a;   # $a_ref is scalar reference to array @a
$h_ref      = \%h;

There is special syntax for references to anonymous structured values:

$a2_ref = [ 2, 4, 6, 8 ]                     # square brackets 
$h2_ref = { red => "stop", green => "go"  }  # curly braces

Retrieve (dereference) the item by using the appropriate prefix for the data type:

$$const_ref
$$x_ref
@$a_ref       # @$a_ref dereferences $a_ref to retrieve @a
%$h_ref

There is special arrow notation for dereferencing array or hash elements. The following three expressions are equivalent:

$$a_ref[0]      # see above
${$a_ref}[0]    # also acceptable, maybe a bit clearer
$a_ref->[0]     # most prefer this way, used in object-oriented programming

Example

Binary search with references

Readings

perlref -- Perl references and nested data structures
perlreftut -- Understand references today

Associative arrays with references

Hash values must be scalars, so storing structures (lists, hashes) in a hash requires references.

Example

Text concordance: keys are words, values are (references to) list of lines where word occurs.

Correctness

Use loop invariants to reason about correctness.

Perl and other languages

Perl advantages, from a data structures point of view:

automatic memory management (structures created/grow/shrink as needed)
hashes are built in, convenient

Perl disadvantages:

References needed for rich data structures, adds irregularity, complexity

Other important languages with automatic memory management:

awk: much like Perl but simpler, excellent textbook
Lisp: language of the gods, steep learning curve, good choice for really deep problems
Python: many of Lisp's advantages, may be more accessible

Other resources

Links to books, tools, other course web pages etc.

Jon Jacky, jon@u.washington.edu

n	2ⁿ	x	log₂x
0	1	1	0
1	2	2	1
2	4	4	2
3	8	8	3
4	16	16	4
. .
10	1024	1024	10
. .
20	1048576	1048576	20
. .