Issues with multiple GPUs on a single node #2615

efaulhaber · 2025-01-10T18:19:59Z

Using multiple GPUs on a single node with unified memory is currently not usable.
Here is a MWE to demonstrate this:

using KernelAbstractions
using CUDA
using BenchmarkTools
using NVTX

@kernel function copy_kernel!(A, @Const(B))
    I = @index(Global)
    @inbounds A[I] = B[I]
end

function singlecopy!(A, B)
    backend = get_backend(A)
    kernel = copy_kernel!(backend)
    kernel(A, B, ndrange = length(A))
    KernelAbstractions.synchronize(backend)
end

@kernel function copy_kernel2!(A, @Const(B), i)
    I = @index(Global)
    @inbounds A[I, i] = B[I, i]
end

function singlecopy2!(A, B)
    backend = get_backend(A)

    @sync begin
        for i in axes(A, 2)
            Threads.@spawn begin
                kernel = copy_kernel2!(backend)
                kernel(A, B, i, ndrange = size(A, 1))
                KernelAbstractions.synchronize(backend)
            end
        end
    end
end

# Same as `singlecopy2!`, but choosing a different device for each kernel
function multicopy!(A, B)
    backend = get_backend(A)

    @sync begin
        for i in axes(A, 2)
            Threads.@spawn begin
                NVTX.@mark "Thread $i started"
                device!(i - 1)
                NVTX.@mark "Thread $i device selected"

                NVTX.@range "Thread $i launching kernel" begin
                    kernel = copy_kernel2!(backend)
                    kernel(A, B, i, ndrange = size(A, 1))
                end
                NVTX.@range "Thread $i synchronize" begin
                    KernelAbstractions.synchronize(backend)
                end
            end
        end
    end
end

backend = CUDA.CUDABackend()
n = 500_000_000
n_gpus = 2

# Initialize array on device 0
device!(0)
A = KernelAbstractions.zeros(backend, Float32, n, n_gpus)
B = KernelAbstractions.ones(backend, Float32, n, n_gpus)

println("\nSingle kernel")
@btime singlecopy!($A, $B)
@assert A == B

println("\nMultiple kernels")
@btime singlecopy2!($A, $B)
@assert A == B

println("\nWrong device")
# Memory is on device 0, launch kernel on device 1
device!(1)
@btime singlecopy2!($A, $B)
@assert A == B

println("\nUnified memory")
device!(0)
A = cu(A, unified = true)
B = cu(B, unified = true)
@btime singlecopy2!($A, $B)
@assert A == B

println("\nUnified memory wrong device")
# Memory is on device 0, launch kernel on device 1
device!(1)
@btime singlecopy2!($A, $B)
@assert A == B

println("\nMulti-GPU")
@btime multicopy!($A, $B)
@assert A == B

Running this on 2 Nvidia H100 yields:

Single kernel
  4.360 ms (56 allocations: 1.36 KiB)

Multiple kernels
  4.576 ms (143 allocations: 5.23 KiB)

Wrong device
  35.613 ms (175 allocations: 6.36 KiB)

Unified memory
  10.430 ms (223 allocations: 6.48 KiB)

Unified memory wrong device
  10.314 ms (225 allocations: 6.98 KiB)

Multi-GPU
  162.697 ms (223 allocations: 6.48 KiB)

The first issue is that unified memory is prefetched on the GPU, which probably makes sense when sharing unified memory between CPU and GPU, but is counterproductive when sharing between multiple GPUs. (Thanks @vchuravy for pointing this out.)
When I disable this block

CUDA.jl/src/compiler/execution.jl

Lines 141 to 163 in 8d810d7

    
             # prefetch unified memory as we're likely to use it on the GPU 
        
             # TODO: make this configurable? 
        
             if is_unified(xs) 
        
               # XXX: use convert to pointer and/or prefect(CuArray) 
        
               mem = xs.data[].mem::UnifiedMemory 
        
               can_prefetch = sizeof(xs) > 0 
        
               ## prefetching isn't supported during stream capture 
        
               can_prefetch &= !is_capturing() 
        
               ## we can only prefetch pageable memory 
        
               can_prefetch &= !__pinned(convert(Ptr{T}, mem), mem.ctx) 
        
               ## pageable memory needs to be accessible concurrently 
        
               can_prefetch &= attribute(device(), DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS) == 1 
        
               if can_prefetch 
        
                   # TODO: `view` on buffers? 
        
                   subbuf = UnifiedMemory(mem.ctx, pointer(xs), sizeof(xs)) 
        
                   prefetch(subbuf) 
        
               end 
        
             end 
        
             Base.unsafe_convert(CuDeviceArray{T,N,AS.Global}, xs) 
        
           end

I get the following:

Single kernel
  4.357 ms (56 allocations: 1.36 KiB)

Multiple kernels
  4.571 ms (143 allocations: 5.23 KiB)

Wrong device
  35.607 ms (175 allocations: 6.36 KiB)

Unified memory
  4.600 ms (143 allocations: 5.23 KiB)

Unified memory wrong device
  4.630 ms (143 allocations: 5.23 KiB)

Multi-GPU
  2.679 ms (143 allocations: 5.23 KiB)

It looks like the multi-GPU code is about twice as fast, but @btime is only reporting the minimum. When we show the full benchmark, we can see the following:

julia> @benchmark multicopy!($A, $B)
BenchmarkTools.Trial: 886 samples with 1 evaluation.
 Range (min … max):  2.889 ms … 28.805 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.434 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.637 ms ±  1.544 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▆█▇▆▄▂                                             
  ▇▅▄▁▁▁▁▅████████▇▆▆▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▄ ▇
  2.89 ms      Histogram: log(frequency) by time     16.3 ms <

 Memory estimate: 5.23 KiB, allocs estimate: 143.

So a few runs are fast, but most of them are slower than the single-GPU code.
This can be seen nicely in NSight Systems:

The thread launching the second kernel hangs in "launching kernel", but the kernel is not yet submitted.
@pxl-th pointed me to this code here:

CUDA.jl/src/memory.jl

Lines 565 to 569 in 8d810d7

    
           # accessing memory on another stream: ensure the data is ready and take ownership 
        
           if managed.stream != state.stream 
        
             maybe_synchronize(managed) 
        
             managed.stream = state.stream 
        
           end

There is a synchronization happening when an array is accessed from a different stream. After commenting out the maybe_synchronize, things are working as expected:

julia> @benchmark multicopy!($A, $B)
BenchmarkTools.Trial: 1597 samples with 1 evaluation.
 Range (min … max):  2.517 ms …  15.573 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.032 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.126 ms ± 869.700 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▅▆▆▆▅▅▅▂▃▅▆▆▄█▆▄▃                                        
  ▄▅▇██████████████████▇▇▇▅▄▃▃▄▄▄▃▃▄▃▃▂▁▃▂▃▂▂▂▂▃▂▂▂▂▁▂▂▁▂▂▁▁▂ ▄
  2.52 ms         Histogram: frequency by time        4.71 ms <

 Memory estimate: 6.77 KiB, allocs estimate: 183.

The text was updated successfully, but these errors were encountered:

maleadt · 2025-01-11T20:57:01Z

Thanks for the issue. Closing in favor of fine-grained / more actionable ones.

efaulhaber changed the title ~~Issues with single-node multi-gpu~~ Issues with multiple GPUs on a single node Jan 10, 2025

This was referenced Jan 11, 2025

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Open

Disable or make automatic prefecthing of unified memory optional #2618

Open

maleadt closed this as completed Jan 11, 2025

efaulhaber mentioned this issue Jan 13, 2025

Atomics: configurable scope (for multi-device unified memory) #2619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with multiple GPUs on a single node #2615

Issues with multiple GPUs on a single node #2615

efaulhaber commented Jan 10, 2025

maleadt commented Jan 11, 2025

Issues with multiple GPUs on a single node #2615

Issues with multiple GPUs on a single node #2615

Comments

efaulhaber commented Jan 10, 2025

maleadt commented Jan 11, 2025