Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference
Distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intra-operator parallelism to achieve low job completion time (JCT) and intensive communications makes it costly. Conversely, the inter-operator parallelism can achieve high job processing capacity (JPC) with much fewer communications, but it fails to reduce the execution time.
In this paper, we present Liger, a distributed large model inference runtime system that is of capability to achieve low JCT at high JPC on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger enables this parallelism by carefully scheduling computation and communication kernels across requests onto multiple streams of multiple GPUs. It achieves precise control of kernel execution order efficiently by mixing use the CPU-GPU synchronization and the inter-stream synchronization. To prevent scheduling failures caused by resource contention, Liger introduces a contention factor strategy to anticipate the penalty of contention. It enables a higher degree of overlap by decomposing lengthy kernels into smaller, more manageable units at runtime.
Extensive evaluations show that Liger, in most cases, outperforms existing parallelism approaches across models and devices, presenting the best JCT and JPC results. In a 4-device case, Liger reduces the average JCT by 36.0% while maintaining the same JPC compared to the inter-operator approach. Meanwhile, it improves the JPC by 1.34× with improved average JCT compared to the intra-operator approach.
Mon 4 MarDisplayed time zone: London change
11:30 - 12:50 | Compilers and Runtimes for Parallel SystemsMain Conference at Moorfoot Chair(s): Mohamed Riyadh Baghdadi | ||
11:30 20mTalk | Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference Main Conference Jiangsu Du Sun Yat-sen University, jinhui wei Sun Yat-sen University, Jiazhi Jiang Sun Yat-sen University, Shenggan Cheng National University of Singapore, Zhiguang Chen Sun Yat-sen University, Dan Huang , Yutong Lu Sun Yat-sen University Link to publication DOI | ||
11:50 20mTalk | A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs Main Conference Jinchen Xu Information Engineering University, Guanghui Song Li Auto Inc., Bei Zhou Information Engineering University, Fei Li Information Engineering University, Jiangwei Hao Information Engineering University, Jie Zhao State Key Laboratory of Mathematical Engineering and Advanced Computing Link to publication DOI | ||
12:10 20mTalk | Language-Agnostic Static Deadlock Detection for Futures Main Conference Stefan K. Muller Illinois Institute of Technology Link to publication DOI | ||
12:30 20mTalk | Recurrence Analysis for Automatic Parallelization of Subscripted Subscripts Main Conference Link to publication DOI |