High Performance Computing (HPC) demonstration: This example presents an algorithm which computes, for each cell in a raster (grid), a new cell that is the average of the cell and its eight immediate neighbors. The average for each cell is stored in a new and separate raster. The algorithm partitions the input and output rasters into sub-grids, and allocates a different physical processor core on NCEAS 32-processor eos server to each sub-grid.
The program, presented in both the Free Pascal and C languages, uses the open-source MPICH2 implementation of the Message Passing Interface (MPI) to manage interaction between multiple processors, and the Shared Memory component of Unix Inter-process Communications (IPC) to connect each processor with the rasters.
Click here to download a .zip archive containing the source code and MAKE files for this example.
NOTE: The example programs are configured to build execute on an Ubuntu/Linux multiprocessor computer on which the Free Pascal compiler and MPICH2 Message Passing Interface (MPI) library have been installed.
Applying HPC design to an existing algorithms consists of restructuring the algorithm to operate simultaneously on multiple CPUs each of which is given a copy of the executable image and assigned a separate portion of the program's 'data space', stored in memory accessible to all of the processors, to process. Such programs typically contain three phases: A setup phase, where each CPU is 'assigned' its portion of the solution to compute, a calculation phase, where the CPUs are 'released' to perform the calculations, and a closing phase, in which the solution components produced by each CPU are re-assembled (elsewhere in memory) into the final answer.
Open source computing platforms such as Linux/UNIX (hereby referred to as UNIX) support 'shared memory' use by multiple processes. In addition, Applications Programming Interfaces (API) provide the specialized features required to equip an algorithm written in high-level languages such as C++ or Pascal to employ multiple CPUs.
One such API is the industry-standard Message passing Interface (MPI), available on most popular computing platforms. Within the UNIX run-time environment, the MPI functions for each CPU execute as separate UNIX process.
Sample Application: 'Mean filtering' of a raster grid
This case study implements a raster 'mean filtering' algorithm which replaces each cell of a two-dimensional raster with the average of the original cell and its eight immediate neighbors. The output, 'filtered' cells are written to a second, results grid. This algorithm is well-suited to demonstrate HPC programming concepts because it is 1) relatively simple to understand and translate to computer code, and 2) computer resource consumption grows rapidly as the raster size grows, making it possible to build a size vs. time benchmarking matrix.
Sample Programs: Free Pascal and C Language versions
To demonstrate the portability of these HPC programming methods, we provide solution programs in the Free Pascal and ANSI C programming languages. Two Free Pascal programs demonstrate Shared Memory and MPI programming fundamentals. The program SimpleSharedMemoryDemo.pas sets up a simple connection to an area of shared memory, and initializes the area using array notation. The program MeanFilterMPISharedMemory.pas, creates and initializes a two-dimensional grid in shared memory, and then uses MPI routines to process the grid using multiple processors.
The C language solution, employs three C programs -FilterStd.c, FilterHPC.c, and Clock.c/Clock.h - to portray the mean filtering application before and after refactoring into an HPC application. The resulting pair of executables are equipped with execution time measurement features that allow us to a benchmarking matrix, comparing execution times standard and HPC implementations with respect to raster size.
FilterStd uses only ANSI standard C++ features to implement the low-pass filter, using an algorithm designed to execute on a single CPU. FilterStd is coded in a straightforward way that is easy to understand, rather than for optimal processing speed. FilterHpc is an adaptation of FilterStd, that employs the MPI API to exploit multiple CPUs to compute the solution, and and UNIX standard system calls to allocate a block of shared memory and make it available to each of the CPUs' controlling processes.
Development of FilterHpc required 'parallelizing' of the FilterStd algorithm; that is, restructuring it to run simultaneously on multiple CPUs, each of which computes the results grid for its assigned portion of the input grid. The task of restructuring an existing algorithm in this way is a combination of art and engineering; the FilterHpc code uses the MPI interface to execute the core algorithm simultaneously ('in parallel') on four CPUs, each controlled by a different UNIX process. Each process has access to two common blocks of memory containing the input and output raster grids. One of the processes, designated 'process 0',allocates shared memory resources, orchestrates execution of the other processes, and collects the final output products once all processes have completed processing. In this instance, collection is trivial since all of the processes store their results in shared memory, and consists of printing the output raster to the standard output. Note that the current version of FilterHpc uses a fixed number of four CPUs to perform the calculations; future versions (watch this space) will support a variable number of processors, up to the number of CPUs installed on the host machine.
Both versions of the code use the C++ Clock class, developed for this application as Clock.cc. The Clock class provides execution timing and pseudo-random number generation features used in the demonstration.
Note that the core processing. which is to say the raster filtering code, is essentially identical in both versions of the program. We repeat this core processing to extend the execution times and accentuate the differences between versions.
The following table presents benchmarking results for FilterStd and FilterHpc programs executed for input raster sizes 1000 rows/columns and 2000 rows/columns. These are the times required for the core 2-dimensional filtering portion of the programs to execute for varying number of iterations. Note that while these benchmarks were collected, other long-running processes shared the computer with FilterStd and FilterHpc executables. Therefore, timing measurements may differ in future experiments.
Execution time (clock-seconds) C++ Version / helios server
1000 x 1000 Raster (9/07) 2000 x 2000 Raster (04/07)
Iterations FilterStd FilterHpc(4) FilterHpc(8) FilterStd FilterHpc(4)
Discussion: Because the load factor of the helios server varied (due to other processes) while these benchmarks were collected, it is best to infer only general comparisons between problem size, number of iterations, and program version. Note that the execution time are roughly proportional to both the number of iterations ant to the size of the raster. Comparing the single-processor SimStd and the four-processor SimHpc versions, note that execution time scales linearly with the number of available processors. This trend is most apparent in the 2000 x 2000 raster cases.
Here are several sources of information regarding multiple-CPU program design. Many more are available via the Internet:
Parallel Computing Fundamentals:
Lawrence Livermore National Laboratory Introduction to Parallel Computing
Open MPI (open source MPI distribution) Organization Home Page
UNIX Shared Memory:
Introductory Unix Shared Memory Tutorial (fscked.org)
Shared Memory (C language API) Tutorial (Linux.com)