NCEAS Scientific Computing: Solutions Center Use Case

Optimize C Raster Processing code to use Multiple CPUs and Shared Memory

The C language software examples shown and discussed in this example demonstrate the use of Message-Passing Interface (MPI) and shared memory in the Linux operating environment to improve the performance of a two-dimensional raster processing algorithm.


Getting Started:

Computer software applications that process large two-dimensional rasters (grids) can consume significant computer resources (CPU cycles and computer memory) as the raster size grows large. Such programs are good candidates for optimization that enable them to exploit (when available) multiple CPUs and (when appropriate) large contiguous blocks of 'shared memory' accessible to each CPU working on solving the algorithm. (Go To Benchmarks)

Typically, the optimization consists of restructuring the source code that implements the algorithm to operate simultaneously on multiple CPUs, each of which is given a copy of the executable image and assigned a separate portion of the data raster to process. Such programs then execute in three phases: A setup phase, where each CPU is 'assigned' its portion of the solution to compute, a calculation phase, where the CPUs are 'released' to perform the calculations, and a closing phase, in which the solution components produced by each CPU are re-assembled into a single, final solution. 

For some raster processing applications, it is not possible to break the raster into sub-grids for each CPU to process independently; rather, all of the CPUs engaged in the solution must perform their assigned portion of the computations on a common copy of the input raster 'shared' with the other CPU processes.

Widely-used, open source computing platforms such as Linux/UNIX (hereby referred to as UNIX) support 'shared memory' use by multiple processes; however,  specialized system-level Applications Programming Interface (API) libraries are usually necessary to enable an algorithm written in a high-level language such as C++ or Java to exploit multiple CPUs. One such API is the industry-standard Message passing Interface (MPI), supported since the mid-1990s on most popular computing platforms. Within the UNIX environment, the MPI functions control each CPU through a separate UNIX process, managed through the MPI interface through the applications source code. 

Sample Programs: FilterStd.cc and FilterHpc.cc

Two C++ programs presented here demonstrate the modification of a straightforward raster processing application to exploit multiple CPUs, all performing computations on a single two-dimensional grid stored in shared memory. Standard elapsed time calculations built into both programs provide execution time benchmarks that document the performance improvement resulting from use of multiple CPUs in parallel.

The FilterStd and FilterHpc programs implement a straightforward 'low-pass filter' in which.each cell of the two-dimensional input raster is replaced by the average of the original cell and its eight immediate neighbors. The resulting 'filtered' matrix is written to a second raster grid.  Such a filter algorithm is well suited to demonstrate HPC programming concepts because it is 1) relatively simple to code using easy-to-follow source code, and 2) quickly begins to consume significant amounts of CPU cycles and memory relative to relatively small increases in input raster grid (and filter) size. The latter property results in high-quality benchmarking data as the execution times of the standard and optimized versions of the code diverge.

FilterStd uses only ANSI standard C++ features to implement the low-pass filter, using an algorithm designed to execute on a single CPU. FilterStd is coded in a straightforward way that is easy to understand, rather than for optimal processing speed.  FilterHpc is an adaptation of FilterStd, that employs the MPI API to exploit multiple CPUs to compute the solution, and and UNIX standard system calls to allocate a block of shared memory and make it available to each of the CPUs' controlling processes.

Development of FilterHpc required 'parallelizing' of the FilterStd algorithm; that is, restructuring it to run simultaneously on multiple CPUs, each of which computes the results grid for its assigned portion of the input grid. The task of restructuring an existing algorithm in this way is a combination of art and engineering; the FilterHpc code uses the MPI interface to execute the core algorithm simultaneously ('in parallel') on four CPUs, each controlled by a different UNIX process. Each process has access to two common blocks of memory containing the input and output raster grids. One of the processes, designated 'process 0',allocates shared memory resources, orchestrates execution of the other processes, and collects the final output products once all processes have completed processing. In this instance, collection is trivial since all of the processes store their results in shared memory, and consists of printing the output raster to the standard output. Note that the current version of FilterHpc uses a fixed number of four CPUs to perform the calculations; future versions (watch this space) will support a variable number of processors, up to the number of CPUs installed on the host machine.

Both versions of the code use the C++ Clock class, developed for this application as Clock.cc. The Clock class provides execution timing and pseudo-random number generation features used in the demonstration.

Note that the core processing. which is to say the raster filtering code, is essentially identical in both versions of the program. We repeat this core processing to extend the execution times and accentuate the differences between versions.

View this example's  source code for each code module in separate pop up windows by clicking on the following links:
Clock.h  / Clock.cc / FilterStd.cc / FilterHpc.cc / FilterStd.mak / FilterHpc.mak
Click here to download a .zip archive containing the FilterStd and FilterHpc source code and MAKE files.

Note: the MPI development environment must be installed on your system in order to run FilterHpc.exe. On Ubuntu Linux systems, two open source version of MPI is available through the Synaptic Package Manager. Consult your UNIX Systems Administrator for other platforms. To build and run FilterHpc on an MPI-enabled system with a minimum of 4 processors, enter the command:

The examples for this tutorial were produced using the LAM/MPI implementation of MPI.

% make -f FilterHpc.mak
% mpirun -np 4 ./FilterHpc.exe


The following table presents benchmarking results for SimStd and SimHpc programs executed for input raster sizes 1000 rows/columns and 2000 rows/columns. These are the times required for the core 2-dimensional filtering portion of the programs to execute for varying number of iterations. Note that while these benchmarks were collected, other long-running processes shared the computer with SimStd and SimHpc executables. Therefore, timing measurements may differ in future experiments.

                      Execution time (clock-seconds)

                                  1000 x 1000 Raster (9/07)             2000 x 2000 Raster (04/07)

 Iterations    FilterStd   FilterHpc(4)  FilterHpc(8)   FilterStd   FilterHpc(4)

       100      120       60      35       240       60
       500      660     300     120     1320     360
     1000    1500     600     300     2730     720
     5000    7500   2760   1680   13620   3600

Discussion: Because the load factor of the helios server varied (due to other processes) while these benchmarks were collected, it is best to infer only general comparisons between problem size, number of iterations, and program version. Note that the execution time are roughly proportional to both the number of iterations ant to the size of the raster. Comparing the single-processor SimStd and the four-processor SimHpc versions, note that execution time scales linearly with the number of available processors. This trend is most apparent in the 2000 x 2000 raster cases.

Learning More

Here are several  sources of information regarding multiple-CPU program design. Many more are available via the Internet:

Parallel Computing Fundamentals:

Lawrence Livermore National Laboratory Introduction to Parallel Computing

MPI Programming:

Open MPI (open source MPI distribution) Organization Home Page

LAM/MPI Organization programming tutorials

Discussion of MPI programming on ROCKS clusters (.pdf)

UNIX Shared Memory:

Introductory Unix Shared Memory Tutorial (fscked.org)

Shared Memory (C language API) Tutorial (Linux.com)

Point of Contact for this Use Case: Rick Reeves, NCEAS Scientific Programmer reeves@nceas.ucsb.edu
This Use Case compiled April, 2007