Computer software applications
that process large two-dimensional rasters (grids) can consume
significant computer resources (CPU cycles and computer memory) as the
raster size grows large. Such programs are good candidates for
optimization that enable them to exploit (when available) multiple CPUs
and (when appropriate) large contiguous blocks of 'shared memory'
accessible to each CPU working on solving the algorithm. (Go To Benchmarks)
Typically, the optimization
consists of restructuring the source code that implements the algorithm
to operate simultaneously on multiple CPUs, each of which is given a
copy of the executable image and assigned a separate portion of the
data raster to process. Such programs then execute in three phases: A
setup phase, where each CPU is 'assigned' its portion of the solution
to compute, a calculation phase, where the CPUs are 'released' to
perform the calculations, and a closing phase, in which the solution
components produced by each CPU are re-assembled into a single, final
solution.
For some raster processing
applications, it is not possible to break the raster into sub-grids for
each CPU to process independently; rather, all of the CPUs engaged in
the solution must perform their assigned portion of the computations on
a common copy of the input raster 'shared' with the other CPU processes.
Widely-used, open source
computing platforms such as Linux/UNIX (hereby referred to as UNIX)
support 'shared memory' use by multiple processes; however,
specialized system-level Applications Programming Interface (API)
libraries are usually necessary to enable an algorithm written in
a high-level language such as C++ or Java to exploit multiple CPUs.
One such API is the industry-standard Message passing Interface (MPI),
supported since the mid-1990s on most popular computing platforms.
Within the UNIX environment, the MPI functions control each CPU through
a separate UNIX process, managed through the MPI interface through the
applications source code.
Two C++
programs
presented here demonstrate the modification of a straightforward raster
processing application to exploit multiple CPUs, all performing
computations on a single two-dimensional grid stored in shared
memory. Standard elapsed time calculations built into both
programs provide execution time benchmarks that document the
performance improvement resulting from use of multiple CPUs in
parallel.
The
FilterStd and
FilterHpc programs implement
a straightforward 'low-pass filter' in which.each cell of the
two-dimensional input raster is replaced by the average of the original
cell and its eight immediate neighbors. The resulting 'filtered' matrix
is written to a second raster grid. Such a filter algorithm
is
well
suited to demonstrate HPC programming concepts because it is 1)
relatively simple to code using easy-to-follow source code, and 2)
quickly begins to consume significant amounts of CPU cycles and memory
relative to relatively small increases in input raster grid (and
filter) size. The latter property results in high-quality benchmarking
data as the execution times of the standard and optimized versions of
the code diverge.
FilterStd uses only
ANSI standard C++ features to implement the low-pass filter, using an
algorithm designed to execute on a single CPU.
FilterStd is coded
in a straightforward way that is easy to understand, rather than for
optimal processing speed.
FilterHpc is an adaptation of FilterStd, that employs the MPI API to exploit
multiple CPUs to compute the solution,
and and UNIX standard system calls to allocate a block of
shared
memory and make it available to each of the CPUs' controlling
processes.
Development of
FilterHpc required 'parallelizing' of
the FilterStd algorithm; that is, restructuring it to run
simultaneously on multiple CPUs, each of which computes the results
grid
for its assigned portion of the input grid. The task of restructuring
an existing
algorithm in this way is a combination of art and engineering; the
FilterHpc code uses the MPI interface to execute the core algorithm
simultaneously ('in parallel') on four CPUs, each controlled by a
different UNIX process. Each process has access to two common blocks
of memory containing the input and output raster grids. One of the
processes, designated 'process 0',allocates shared memory resources,
orchestrates execution of the other processes, and collects the final
output products once all processes have completed processing. In this
instance, collection is trivial since all of the processes store their
results in shared memory, and consists of printing the output raster to
the standard output. Note that the current version of
FilterHpc uses a fixed number of four CPUs to perform the calculations; future
versions (watch this space) will support a variable number of
processors, up to the number of CPUs installed on the host machine.
Both versions of the code use the C++ Clock class, developed for this
application as
Clock.cc.
The Clock class provides execution timing and pseudo-random number
generation
features used in the demonstration.
Note that the core processing. which is to say the raster filtering
code, is essentially identical in both versions of the program. We
repeat this core processing to extend the execution times and
accentuate the differences between versions.
View this example's
source
code for each code module in separate pop up
windows by clicking on the following links:
Clock.h / Clock.cc / FilterStd.cc / FilterHpc.cc / FilterStd.mak / FilterHpc.mak
Click
here to download a .zip archive containing the FilterStd and FilterHpc source code and MAKE files.
Note: the MPI development environment must be installed on your system
in order to run FilterHpc.exe. On Ubuntu Linux systems, two open
source version of MPI is available through the Synaptic Package
Manager. Consult your UNIX Systems Administrator for other platforms.
To build and run FilterHpc on an MPI-enabled system with a minimum of 4
processors, enter the command:
The examples for this tutorial were produced using the LAM/MPI implementation of MPI.
% make -f FilterHpc.mak
% mpirun -np 4 ./FilterHpc.exe
The following table presents benchmarking results for
SimStd and
SimHpc programs executed for input raster sizes 1000 rows/columns and 2000 rows/columns. These are the times
required for the core 2-dimensional filtering portion of the programs
to execute for varying number of iterations. Note that while
these benchmarks were collected, other long-running processes shared
the computer with
SimStd and
SimHpc executables. Therefore, timing measurements may differ in future experiments.
1000 x
1000 Raster (9/07)
2000 x 2000 Raster (04/07)
Iterations FilterStd FilterHpc(4) FilterHpc(8)
FilterStd FilterHpc(4)
|
100 |
120 |
60 |
35 |
240 |
60 |
|
500 |
660 |
300 |
120 |
1320 |
360 |
|
1000 |
1500 |
600 |
300 |
2730 |
720 |
|
5000 |
7500 |
2760 |
1680 |
13620 |
3600 |
Discussion: Because the
load factor of the helios server varied (due to other processes) while
these benchmarks were collected, it is best to infer only general
comparisons between problem size, number of iterations, and program
version. Note that the execution time are roughly proportional to
both the number of iterations ant to the size of the raster. Comparing
the single-processor
SimStd and the four-processor
SimHpc versions, note that execution time scales linearly with the number of
available processors. This trend is most apparent in the 2000 x 2000
raster cases.
Learning
More
Here are several sources of information
regarding multiple-CPU program design. Many more are available via
the Internet:
Parallel
Computing Fundamentals:
Lawrence
Livermore National Laboratory Introduction to Parallel Computing
MPI
Programming:
Open MPI (open source MPI distribution) Organization Home Page
LAM/MPI Organization programming tutorials
Discussion of MPI
programming on ROCKS clusters (.pdf)
UNIX
Shared
Memory:
Introductory Unix Shared
Memory Tutorial (fscked.org)
Shared
Memory (C language API) Tutorial (Linux.com)