ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores (PPoPP 2024 - Main Conference)

Who

Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, Mao Yang

Track

PPoPP 2024 Main Conference

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 5 Mar 2024 16:10 - 16:30 at Moorfoot - Optimizing for Memory Chair(s): Yan Gu

Abstract

Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors to enhance matrix multiplication performance. However, constrained to its over-specification, its potential for improving other critical scientific operations like stencil computations remains untapped. This paper presents ConvStencil, a novel stencil computing system designed to efficiently transform stencil computation to matrix multiplication on Tensor Cores. We first develop a performance model for ConvStencil to guide algorithm design and optimization on TCUs. Based on this model, we propose three techniques: (1) Memory-efficient Layout Transformation using the stencil2row method; (2) Computation-dense Compute Adaptation with Dual Tessellation and kernel fusion; and (3) Performance-boosting Conflict Removal using a Lookup Table and Dirty Bits Padding. ConvStencil outperforms other stencil optimization frameworks, achieving significant speedups compared to solutions like AMOS, cuDNN, Brick, DRStencil, and TCStencil. By transforming stencil computation on Tensor Cores, ConvStencil promises to improve the performance of various scientific and engineering applications.

Link to Publication

https://dl.acm.org/doi/pdf/10.1145/3627535.3638476

DOI

https://doi.org/10.1145/3627535.3638476

Yuetao Chen

Microsoft Research

Kun Li

Microsoft Research

Yuhao Wang

Microsoft Research

Donglin Bai

Microsoft Research

Lei Wang

Microsoft Research

Lingxiao Ma

Microsoft Research

Liang Yuan

Chinese Academy of Sciences

Yunquan Zhang

Zhang

Ting Cao

Microsoft Research

China

Mao Yang

Microsoft Research

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 5 Mar
Displayed time zone: London change

16:10 - 17:10	Optimizing for MemoryMain Conference at Moorfoot Chair(s): Yan Gu University of California, Riverside

16:10 20m Talk		ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresBest Paper Award Main Conference Yuetao Chen Microsoft Research, Kun Li Microsoft Research, Yuhao Wang Microsoft Research, Donglin Bai Microsoft Research, Lei Wang Microsoft Research, Lingxiao Ma Microsoft Research, Liang Yuan Chinese Academy of Sciences, Yunquan Zhang Zhang, Ting Cao Microsoft Research, Mao Yang Microsoft Research Link to publication DOI
16:30 20m Talk		CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers Main Conference Brian Wheatman Johns Hopkins University, Randal Burns Johns Hopkins, Aydin Buluc University of California at Berkeley & Lawrence Berkeley National Lab, Helen Xu Lawrence Berkeley National Laboratory Link to publication DOI
16:50 20m Talk		Gallatin: A General-Purpose GPU Memory Manager Main Conference Hunter James McCoy University of Utah, Prashant Pandey University of Utah Link to publication DOI