Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference (PPoPP 2024 - Main Conference)

Who

Jiangsu Du, jinhui wei, Jiazhi Jiang, Shenggan Cheng, Zhiguang Chen, Dan Huang, Yutong Lu

Track

PPoPP 2024 Main Conference

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 4 Mar 2024 11:30 - 11:50 at Moorfoot - Compilers and Runtimes for Parallel Systems Chair(s): Mohamed Riyadh Baghdadi

Abstract

Distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intra-operator parallelism to achieve low job completion time (JCT) and intensive communications makes it costly. Conversely, the inter-operator parallelism can achieve high job processing capacity (JPC) with much fewer communications, but it fails to reduce the execution time.

In this paper, we present Liger, a distributed large model inference runtime system that is of capability to achieve low JCT at high JPC on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger enables this parallelism by carefully scheduling computation and communication kernels across requests onto multiple streams of multiple GPUs. It achieves precise control of kernel execution order efficiently by mixing use the CPU-GPU synchronization and the inter-stream synchronization. To prevent scheduling failures caused by resource contention, Liger introduces a contention factor strategy to anticipate the penalty of contention. It enables a higher degree of overlap by decomposing lengthy kernels into smaller, more manageable units at runtime.

Extensive evaluations show that Liger, in most cases, outperforms existing parallelism approaches across models and devices, presenting the best JCT and JPC results. In a 4-device case, Liger reduces the average JCT by 36.0% while maintaining the same JPC compared to the inter-operator approach. Meanwhile, it improves the JPC by 1.34× with improved average JCT compared to the intra-operator approach.

Link to Publication

https://dl.acm.org/doi/pdf/10.1145/3627535.3638466

DOI

https://doi.org/10.1145/3627535.3638466

Jiangsu Du

Sun Yat-sen University

jinhui wei

Sun Yat-sen University

Jiazhi Jiang

Sun Yat-sen University

Shenggan Cheng

National University of Singapore

Zhiguang Chen

Sun Yat-sen University

Dan Huang

Yutong Lu

Sun Yat-sen University

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 4 Mar
Displayed time zone: London change

11:30 - 12:50	Compilers and Runtimes for Parallel SystemsMain Conference at Moorfoot Chair(s): Mohamed Riyadh Baghdadi

11:30 20m Talk		Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference Main Conference Jiangsu Du Sun Yat-sen University, jinhui wei Sun Yat-sen University, Jiazhi Jiang Sun Yat-sen University, Shenggan Cheng National University of Singapore, Zhiguang Chen Sun Yat-sen University, Dan Huang , Yutong Lu Sun Yat-sen University Link to publication DOI
11:50 20m Talk		A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs Main Conference Jinchen Xu Information Engineering University, Guanghui Song Li Auto Inc., Bei Zhou Information Engineering University, Fei Li Information Engineering University, Jiangwei Hao Information Engineering University, Jie Zhao State Key Laboratory of Mathematical Engineering and Advanced Computing Link to publication DOI
12:10 20m Talk		Language-Agnostic Static Deadlock Detection for Futures Main Conference Stefan K. Muller Illinois Institute of Technology Link to publication DOI
12:30 20m Talk		Recurrence Analysis for Automatic Parallelization of Subscripted Subscripts Main Conference Akshay Bhosale University of Delaware, USA, Rudolf Eigenmann University of Delaware Link to publication DOI