Evaluation
This section compares the runtime performance of the SAC implementation of the PDE1 benchmark with that of the HPF reference implementation. The machine used in the following experiments is a 12-processor SUN Ultra Enterprise 4000 shared memory multiprocessor, running SOLARIS-7. The SUN Workshop compiler cc v5.0 is used as back end compiler for SAC. The HPF implementation is compiled by the ADAPTOR v7.0 HPF compiler developed at GMD (German National Research Center for Information Technology). The SUN Workshop compiler f77 v5.0 is used to transform the resulting FORTRAN-77 intermediate code into host machine code. Non-sequential execution of HPF programs is based on the shared memory version of PVM 3.4.2 as low-level communication layer.
The HPF implementation of PDE1 is directly taken from the benchmark suite coming with ADAPTOR. It employs cubic grids whose sizes are powers of two; use of timers excludes startup overhead inflicted by the initialization of data structures and of the underlying communication layer from benchmark results. In the absence of timing facilities in SAC, all experiments are repeated with two different numbers of iterations, and differences in program execution times are taken as benchmark results. This allows for comparing SAC and HPF on a reasonably fair basis.
The next image shows single processor benchmark results for two problem sizes: 643 and 2563 Average times needed to complete a single red/black relaxation step are given for the HPF reference implementation as well as for the SAC implementation discussed in the previous section. All three alternative specifications of the core relaxation operation are investigated. However, regardless of which one is taken, the SAC implementation outperforms its HPF counterpart by factors of between 4 and 5 for both problem sizes.
Despite considerably varying levels of abstraction characterizing the different implementations of the relaxation kernel, their runtime performance is nearly the same. Surprisingly, the first, least abstract specification is even outperformed by the others by up to 10%. Whereas the SAC compiler by means of thorough optimization manages to compile all three different specifications to almost identical intermediate representations, a rather subtle variation is responsible for this observation. The element-wise specification leads to compiled code which extracts all 6 stencil elements from the argument array before it performs any arithmetic operations on them. In contrast, the more abstract specifications shown in both result in compiled code which executes memory load instructions alternatingly with arithmetic computations and, thus, to some extent, hides memory latencies by overlapping them with productive computations.
The next plots show speedups achieved by compiling the SAC implementation of PDE1 for implicit multithreaded execution. Due to the very similar sequential performance characteristics of the different SAC specifications, all investigations with respect to multithreaded execution are limited to the final, most abstract variant. In fact, the compiled SAC code for both problem sizes scales well with increasing numbers of processors engaged. Speedups of more than 8 are achieved with 10 processors. Simultaneously engaging more than 10 out of 12 processors is hardly profitable because the machine used for the experiments is not operated in batch mode and, hence, other system and user processes are constantly present.
Whereas performance improvements scale smoothly for the larger grid, it can be observed that certain processor counts are better suited for the smaller grid than others, e.g. 2, 4, and 8. However, this may easily be attributed to the scheduling scheme which achieves a better balanced workload distribution if the number of processors exactly divides the grid size in the outermost dimension. This effect is increasingly noticeable for grids which are small relative to processor counts.
Experiences made with a similar stencil operation show that the investigated problem sizes are particularly sensitive against cache utilization. To quantify the effect of cache optimizations on the runtime performance of the SAC implementation, the experiment is repeated with cache optimizations disabled. In fact, only array padding and array placement need to be disabled as the tile size inference heuristic decides not to apply tiling to both problem sizes under consideration. The next figure shows the outcome of this experiment. As expected, cache optimizations, in particular array padding, turn out to be crucial for the runtime performance of compiled SAC code. Nevertheless, even with all cache optimizations disabled, SAC still outperforms HPF.
The next figure shows the multiprocessor performance achieved by both SAC and HPF, given as speedups relative to HPF single processor performance. Once again, it shows the significance of cache optimizations for the runtime performance of compiled SAC code. Without them SAC is outperformed by HPF for the larger grid size $256^3$ when using more than 4 processors.
In fact, super-linear speedups are achieved by the HPF code. This can be attributed to the fact that being based on message passing HPF employs a different memory data layout for each number of processors used. These changes implicitly eliminate cache conflicts, which to some extent are responsible for the relatively poor sequential performance of HPF. In contrast, the multithreaded execution model of SAC reuses the memory data layout used in sequential execution without modification and, thus, may not benefit from similar effects.
For both problem sizes investigated, it turns out that cache conflicts actually dominate runtime performance. Their elimination either by respective optimization techniques as in SAC or, more or less accidentally, through data layout transformations as in HPF is the crucial issue. To mitigate such effects, the experiment is once again repeated with slightly different problem sizes, which are less likely to incur systematic cache conflicts. The next figure shows the respective single processor benchmark results for grid sizes of $68^3$ and $260^3$ elements. It turns out that in the absence of significant numbers of cache conflicts SAC still outperforms HPF by factors of about 3. Comparing these results with those of the last figure shows that without cache optimizations SAC suffers much more from cache conflicts in the original problem sizes than HPF does. This may indicate that either the HPF compiler or the FORTRAN-77 back end compiler apply some cache optimizations as well.
Multiprocessor runtime performance of SAC and HPF |
Single processor performance for alternative grid sizes |
Last but not least, the plot below shows the multiprocessor runtime performance achieved for the two alternative problem sizes. Speedups are given both for SAC relative to its single processor performance as well as for SAC and HPF relative the HPF single processor performance. Whereas for the grid size $260^3$ SAC achieves similar speedups as for the grid size $256^3$, super-linear speedups can be realized for the grid size $68^3$ when using 4 processors or more. This behaviour may be attributed to the fact that 4 processor specific L2 caches of 1MB each are sufficient to cover the entire memory requirements inflicted by this grid size. By careful memory data layout the number of main memory accesses may thus be considerably reduced compared with execution by a single processor. However, at the time being, it is not quite clear why no similar speedups are achieved for grids of size $64^3$, despite careful cache optimization.
Without accidentally reduced cache conflicts, speedups achieved by HPF are considerably lower for the alternative grid sizes than for the original ones.