A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs
Sparse-Matrix Dense-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM) are important sparse kernels in various computation domains. The uneven distribution of non-zeros in the sparse matrix and the tight data dependence between sparse and dense matrixes make it a challenge to run sparse matrix multiplication efficiently on GPUs. By analyzing the aforementioned problems, we propose a row decomposition (RoDe)-based approach to optimize the two kernels on GPUs, using the standard Compressed Sparse Row (CSR) format. Specifically, RoDe divides the sparse matrix rows into regular parts and residual parts, to fully optimize their computations separately. We also devise the corresponding load balancing and fine-grained pipelining technologies. Profiling results show that RoDe can achieve more efficient memory access and reduce warp stall cycles significantly. Compared to the state-of-the-art (SOTA) alternatives, RoDe achieves a speedup of up to 7.86x with a geometric mean of 1.45x for SpMM, and a speedup of up to 8.99x with a geometric mean of 1.49x for SDDMM; the dataset is SuiteSparse. RoDe also outperforms its counterpart in the deep learning dataset. Furthermore, its preprocessing overhead is significantly smaller, averaging only 16% of the SOTA.
Wed 6 MarDisplayed time zone: London change
10:00 - 11:00 | Linear AlgebraMain Conference at Moorfoot Chair(s): I-Ting Angelina Lee Washington University in St. Louis, USA | ||
10:00 20mTalk | A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs Main Conference Pang Meng Department of Computer Science and Technology, Tsinghua University, Xiang Fei Department of Computer Science and Technology, Tsinghua University, Peng Qu Department of Computer Science and Technology, Tsinghua University, Youhui Zhang Department of Computer Science and Technology, Tsinghua University, Zhaolin Li Department of Computer Science and Technology, Tsinghua University Link to publication DOI | ||
10:20 20mTalk | Fast Kronecker Matrix-Matrix Multiplications on GPUs Main Conference Link to publication DOI | ||
10:40 20mTalk | Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication Main Conference Lukas Gianinazzi ETH Zurich, Alexandros Nikolaos Ziogas ETH Zurich, Piotr Luczynski ETH Zurich, Langwen Huang ETH Zurich, Saleh Ashkboosh ETH Zurich, Florian Scheidl ETH Zurich, Armon Carigiet ETH Zurich, Chio Ge ETH Zurich, Nabil Abubaker ETH Zurich, Maciej Besta ETH Zurich, Tal Ben-Nun Lawrence Livermore National Laboratory, Torsten Hoefler ETH Zurich Link to publication DOI |