Chunking parallel loops in the presence of synchronization

Venkata Nandivada; Shirako Jun; M. Zhao Jisheng; N. Sarkar Vivek

doi:10.1145/1542275.1542304

Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage programmers to express the ideal parallelism in an application at a fine granularity that is natural for the underlying domain, while delegating to the compiler and runtime system the job of extracting coarser-grained useful parallelism for a given target system. A simple and important example of this separation of concerns between ideal and useful parallelism can be found in chunking of parallel loops, where the programmer expresses ideal parallelism by declaring all iterations of a loop to be parallel and the implementation exploits useful parallelism by executing iterations of the loop in sequential chunks.

Though chunking of parallel loops has been used as a standard transformation for several years, it poses some interesting challenges when the parallel loop may directly or indirectly (via procedure calls) perform synchronization operations such as barrier, signal or wait statements. In such cases, a straightforward transformation that attempts to execute a chunk of loops in sequence in a single thread may violate the semantics of the original parallel program. In this paper, we address the problem of chunking parallel loops that may contain synchronization operations. We present a transformation framework that uses a combination of transformations from past work (e.g., loop strip-mining, interchange, distribution, unswitching) to obtain an equivalent set of parallel loops that chunk together statements from multiple iterations while preserving the semantics of the original parallel program. These transformations result in reduced synchronization and scheduling overheads, thereby improving performance and scalability. Our experimental results for 11 benchmark programs on an UltraSPARC II multicore processor showed a geometric mean speedup of 0.52x for the unchunked case and 9.59x for automatic chunking using the techniques described in this paper. This wide gap underscores the importance of using these techniques in future compiler and runtime systems for programming models with lightweight parallelism.

Journal	Data powered by TypesetProceedings of the 23rd international conference on Supercomputing
Publisher	Data powered by TypesetACM Press
Open Access	No