UW AMath High Performance Scientific Computing
 
AMath 483/583 Class Notes
 
Spring Quarter, 2013

Table Of Contents

Previous topic

Special functions

Next topic

Linear Algebra software

This Page

Timing code

Unix time command

There is a built in command time that can be used to time other commands. Just type time command at the prompt, e.g.:

$ time ./a.out
<output from code>

real    0m5.279s
user    0m1.915s
sys     0m0.006s

This executes the command ./a.out in this case (running a Fortran executable) and then prints information about the time required to execute, where, roughly speaking, real is the wall-clock time, user is the time spent executing the user’s program, and sys is the time spent on system tasks required by the program.

There may be a big difference between the real time and the sum of the other two times if the computer is simulataneously executing many other tasks and your program is only getting some of its attention.

Using time does not allow you to examine how much CPU time different parts of the code required, for example.

Fortran timing utilities

There are two Fortran intrinsic functions that are very useful.

system_clock tells the elapsed wall time between two successive calls, and might be used as follows:

integer(kind=8) :: tclock1, tclock2, clock_rate
real(kind=8) :: elapsed_time

call system_clock(tclock1)

<code to be timed>

call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)

cpu_time tells the CPU time used between two successive calls:

real(kind=8) :: t1, t2, elapsed_cpu_time

call cpu_time(t1)

<code to be timed>

call cpu_time(t2)
elapsed_cpu_time = t2 - t1

Here is an example code that uses both, and tests matrix-matrix multiply.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
! $UWHPSC/codes/fortran/optimize/timings.f90

! Illustrate timing utilities in Fortran.
!  system_clock can be used to compute elapsed time between
!      two calls (wall time)
!  cpu_time can be used to compute CPU time used between two calls.

! Try compiling with different levels of optimization, e.g. -O3


program timings

    implicit none
    integer, parameter :: ntests = 20
    integer :: n 
    real(kind=8), allocatable, dimension(:,:) :: a,b,c
    real(kind=8) :: t1, t2, elapsed_time
    integer(kind=8) :: tclock1, tclock2, clock_rate
    integer :: i,j,k,itest

    call system_clock(tclock1)

    print *, "Will multiply n by n matrices, input n: "
    read *, n

    allocate(a(n,n), b(n,n), c(n,n))

    ! fill a and b with 1's just for demo purposes:
    a = 1.d0
    b = 1.d0

    call cpu_time(t1)   ! start cpu timer
    do itest=1,ntests
        do j = 1,n
            do i = 1,n
                c(i,j) = 0.d0
                do k=1,n
                    c(i,j) = c(i,j) + a(i,k)*b(k,j)
                    enddo
                enddo
            enddo
        enddo

    call cpu_time(t2)   ! end cpu timer
    print 10, ntests, t2-t1
 10 format("Performed ",i4, " matrix multiplies: CPU time = ",f12.8, " seconds")

    
    call system_clock(tclock2, clock_rate)
    elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
    print 11, elapsed_time
 11 format("Elapsed time = ",f12.8, " seconds")

end program timings

Note that the matrix-matrix product is computed 20 times over to give a better estimate of the timings.

You might want to experiment with this code and various sizes of the matrices and optimization levels. Here are a few sample results on a MacBook Pro.

First, with no optimization and \(200\times 200\) matrices:

$ gfortran timings.f90
$ ./a.out
 Will multiply n by n matrices, input n:
200
Performed   20 matrix multiplies: CPU time =   0.81523600 seconds
Elapsed time =   5.94083357 seconds

Note that the elapsed time include the time required to type in the response to the prompt for n, as well as the time required to allocate and initialize the matrices, whereas the CPU time is just for the matrix multiplication loops.

Trying a larger \(400 \times 400\) case gives:

$ ./a.out
 Will multiply n by n matrices, input n:
400
Performed   20 matrix multiplies: CPU time =   7.33721500 seconds
Elapsed time =   9.87114525 seconds

Since computing the product of \(n \times n\) matrices takes \(O(n^3)\) flops, we expect this to take about 8 times as much CPU time as the previous test. This is roughly what we observe.

Doubling the size again gives requires much more that 8 times as long however:

$ ./a.out
 Will multiply n by n matrices, input n:
800
Performed   20 matrix multiplies: CPU time =  90.49682200 seconds
Elapsed time =  93.98917389 seconds

Note that 3 matrices that are \(400\times 400\) require 3.84 MB of memory, whereas three \(800 \times 800\) matrices require 15.6 MB. Since the MacBook used for this experiment has only 6 MB of L3 cache, the data no longer fit in cache.

Compiling with optimization

If we recompile with the -O3 optimization flag, the last two experiments give these results:

$ gfortran -O3 timings.f90
$ ./a.out
 Will multiply n by n matrices, input n:
400
Performed   20 matrix multiplies: CPU time =   1.39002200 seconds
Elapsed time =   3.58041191 seconds

and

$ ./a.out
 Will multiply n by n matrices, input n:
800
Performed   20 matrix multiplies: CPU time =  66.39167200 seconds
Elapsed time =  68.29921722 seconds

Both have sped up relative to the un-optimized code, the first much more dramatically.

Timing OpenMP code

The code in $UWHPSC/codes/openmp/timings.f90 shows an analogous code for matrix multiplication using OpenMP.

The code has been slightly modified so that system_clock is only timing the inner loops in order to illustrate that cpu_time now computes the sum of the CPU time of all threads.

Here’s one sample result:

$ gfortran -fopenmp -O3 timings.f90
$ ./a.out
 Using OpenMP, how many threads?
4
 Will multiply n by n matrices, input n:
400
Performed   20 matrix multiplies: CPU time =   1.99064000 seconds
Elapsed time =   0.58711302 seconds

Note that the CPU time reported is nearly 2 seconds but the elapsed time is only 0.58 seconds in this case when 4 threads are being used.

The total CPU time is slightly more than the previous code that did not use OpenMP, but the wall time is considerably less.

For \(800\times 800\) matrices there is a similar speedup:

$ ./a.out
 Using OpenMP, how many threads?
4
 Will multiply n by n matrices, input n:
800
Performed   20 matrix multiplies: CPU time =  79.70573500 seconds
Elapsed time =  21.37633133 seconds

Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
! $UWHPSC/codes/fortran/optimize/timings1.f90

! Illustrate timing utilities in Fortran.
!  system_clock can be used to compute elapsed time between
!      two calls (wall time)
!  cpu_time can be used to compute CPU time used between two calls.

! Try compiling with different levels of optimization, e.g. -O3


program timings1

    use omp_lib

    implicit none
    integer, parameter :: ntests = 20
    integer :: n, nthreads
    real(kind=8), allocatable, dimension(:,:) :: a,b,c
    real(kind=8) :: t1, t2, elapsed_time
    integer(kind=8) :: tclock1, tclock2, clock_rate
    integer :: i,j,k,itest

    ! Specify number of threads to use:
    !$ print *, "Using OpenMP, how many threads? "
    !$ read *, nthreads 
    !$ call omp_set_num_threads(nthreads)

    print *, "Will multiply n by n matrices, input n: "
    read *, n

    allocate(a(n,n), b(n,n), c(n,n))

    ! fill a and b with 1's just for demo purposes:
    a = 1.d0
    b = 1.d0

    call system_clock(tclock1)  ! start wall timer

    call cpu_time(t1)   ! start cpu timer
    do itest=1,ntests
        !$omp parallel do private(i,k)
        do j = 1,n
            do i = 1,n
                c(i,j) = 0.d0
                do k=1,n
                    c(i,j) = c(i,j) + a(i,k)*b(k,j)
                    enddo
                enddo
            enddo
        enddo

    call cpu_time(t2)   ! end cpu timer
    print 10, ntests, t2-t1
 10 format("Performed ",i4, " matrix multiplies: CPU time = ",f12.8, " seconds")

    
    call system_clock(tclock2, clock_rate)
    elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
    print 11, elapsed_time
 11 format("Elapsed time = ",f12.8, " seconds")

end program timings1