Gallatin: A General-Purpose GPU Memory Manager (PPoPP 2024 - Main Conference)

Who

Hunter James McCoy, Prashant Pandey

Track

PPoPP 2024 Main Conference

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 5 Mar 2024 16:50 - 17:10 at Moorfoot - Optimizing for Memory Chair(s): Yan Gu

Abstract

Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing state-of-the-art GPU memory managers, Ouroboros and Reg-Eff, employ traditional data structures such as arrays and linked lists to manage memory objects. They build specialized pipelines to achieve perfor- mance for a fixed set of allocation sizes and fall back to the CUDA allocator for allocating large sizes. In the process, they lose general-purpose usability and fail to support critical applications such as streaming graph processing.

In this paper, we introduce Gallatin, a general-purpose and high-performance GPU memory manager. Gallatin uses the van Emde Boas (vEB) tree to manage memory objects effi- ciently and supports allocations of any size. We develop a wait-free GPU implementation of the vEB tree to exploit mas- sive parallelism on GPUs. It supports constant time insertions, deletions, and successor operations for a given memory size.

In our evaluation, we compare Gallatin with state-of-the- art specialized allocator variants. It is up to 568× faster on single-sized allocations and up to 374× faster on mixed-size allocations than the next-best allocator. Gallatin also scales well as the number of threads increases and is up to 146× faster for single-sized allocations. For the graph benchmarks, Gallatin is faster than the state-of-the-art for range operations and is the fastest allocator for all graph expansion tests.

Link to Publication

https://dl.acm.org/doi/pdf/10.1145/3627535.3638499

DOI

https://doi.org/10.1145/3627535.3638499

Hunter James McCoy

University of Utah

Prashant Pandey