Title: Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
Abstract: OPTIMIZATION AND PERFORMANCE MODELING OF STENCIL COMPUTATIONS ON MODERN MICROPROCESSORS ‡ KAUSHIK DATTA † SHOAIB KAMIL ∗† SAMUEL WILLIAMS ∗† LEONID OLIKER ∗ JOHN SHALF ∗ KATHERINE YELICK ∗† Abstract. Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multi-core design of the Cell processor, a machine with an explicitly-managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache- blocking optimizations. We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor only achieves 54% of algorithmic peak performance. Key words. Stencil computations, cache blocking, time-skewing, cache-oblivious algorithms, performance modeling, performance evaluation, Intel Itanium2, AMD Opteron, IBM Power5, STI Cell. AMS subject classifications. 65Y10, 65Yxx, 35R99, 68M20 1. Introduction. Partial differential equation (PDE) solvers constitute a large fraction of scientific applications in such diverse areas as heat diffusion, electromag- netics, and fluid dynamics. These applications are often implemented using itera- tive finite-difference techniques, which sweep over a spatial grid, performing nearest neighbor computations called stencils. In a stencil operation, each point in a multidi- mensional grid is updated with weighted contributions from a subset of its neighbors in both time and space — thereby representing the coefficients of the PDE for that data element. These operations are then used to build solvers that range from simple Jacobi iterations to complex multigrid and adaptive mesh refinement methods [3]. Stencil calculations perform global sweeps through data structures that are typ- ically much larger than the capacity of the available data caches. As a result, these computations generally achieve a low fraction of theoretical peak performance, since data from main memory cannot be transferred fast enough to avoid stalling the com- putational units on modern microprocessors. Reorganizing these computations to take full advantage of memory hierarchies has been the subject of much investigation over the years. These have principally focused on tiling optimizations [11, 16, 17] that attempt to exploit locality by performing operations on cache-sized blocks of data before moving on to the next block. Whereas many tiling optimizations use domain decomposition to improve spatial locality, more recent studies have focused attention on exploiting locality in the time dimension [6, 13, 19, 24]. In this work, we re-examine stencil computations on current microprocessors in light of the growing performance gap between processors and memory, as well as ‡ Preliminary ∗ CRD/NERSC, versions of this article appeared in [8, 9]. Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, 94720. † Computer Science Department, University of California, Berkeley, CA, 94720.
Publication Year: 2009
Publication Date: 2009-07-24
Language: en
Type: article
Access and Citation
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot