Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with multiple GPUs on a single node #2615

Closed
efaulhaber opened this issue Jan 10, 2025 · 1 comment
Closed

Issues with multiple GPUs on a single node #2615

efaulhaber opened this issue Jan 10, 2025 · 1 comment

Comments

@efaulhaber
Copy link
Contributor

Using multiple GPUs on a single node with unified memory is currently not usable.
Here is a MWE to demonstrate this:

using KernelAbstractions
using CUDA
using BenchmarkTools
using NVTX

@kernel function copy_kernel!(A, @Const(B))
    I = @index(Global)
    @inbounds A[I] = B[I]
end

function singlecopy!(A, B)
    backend = get_backend(A)
    kernel = copy_kernel!(backend)
    kernel(A, B, ndrange = length(A))
    KernelAbstractions.synchronize(backend)
end

@kernel function copy_kernel2!(A, @Const(B), i)
    I = @index(Global)
    @inbounds A[I, i] = B[I, i]
end

function singlecopy2!(A, B)
    backend = get_backend(A)

    @sync begin
        for i in axes(A, 2)
            Threads.@spawn begin
                kernel = copy_kernel2!(backend)
                kernel(A, B, i, ndrange = size(A, 1))
                KernelAbstractions.synchronize(backend)
            end
        end
    end
end

# Same as `singlecopy2!`, but choosing a different device for each kernel
function multicopy!(A, B)
    backend = get_backend(A)

    @sync begin
        for i in axes(A, 2)
            Threads.@spawn begin
                NVTX.@mark "Thread $i started"
                device!(i - 1)
                NVTX.@mark "Thread $i device selected"

                NVTX.@range "Thread $i launching kernel" begin
                    kernel = copy_kernel2!(backend)
                    kernel(A, B, i, ndrange = size(A, 1))
                end
                NVTX.@range "Thread $i synchronize" begin
                    KernelAbstractions.synchronize(backend)
                end
            end
        end
    end
end

backend = CUDA.CUDABackend()
n = 500_000_000
n_gpus = 2

# Initialize array on device 0
device!(0)
A = KernelAbstractions.zeros(backend, Float32, n, n_gpus)
B = KernelAbstractions.ones(backend, Float32, n, n_gpus)

println("\nSingle kernel")
@btime singlecopy!($A, $B)
@assert A == B

println("\nMultiple kernels")
@btime singlecopy2!($A, $B)
@assert A == B

println("\nWrong device")
# Memory is on device 0, launch kernel on device 1
device!(1)
@btime singlecopy2!($A, $B)
@assert A == B

println("\nUnified memory")
device!(0)
A = cu(A, unified = true)
B = cu(B, unified = true)
@btime singlecopy2!($A, $B)
@assert A == B

println("\nUnified memory wrong device")
# Memory is on device 0, launch kernel on device 1
device!(1)
@btime singlecopy2!($A, $B)
@assert A == B

println("\nMulti-GPU")
@btime multicopy!($A, $B)
@assert A == B

Running this on 2 Nvidia H100 yields:

Single kernel
  4.360 ms (56 allocations: 1.36 KiB)

Multiple kernels
  4.576 ms (143 allocations: 5.23 KiB)

Wrong device
  35.613 ms (175 allocations: 6.36 KiB)

Unified memory
  10.430 ms (223 allocations: 6.48 KiB)

Unified memory wrong device
  10.314 ms (225 allocations: 6.98 KiB)

Multi-GPU
  162.697 ms (223 allocations: 6.48 KiB)

The first issue is that unified memory is prefetched on the GPU, which probably makes sense when sharing unified memory between CPU and GPU, but is counterproductive when sharing between multiple GPUs. (Thanks @vchuravy for pointing this out.)
When I disable this block

# prefetch unified memory as we're likely to use it on the GPU
# TODO: make this configurable?
if is_unified(xs)
# XXX: use convert to pointer and/or prefect(CuArray)
mem = xs.data[].mem::UnifiedMemory
can_prefetch = sizeof(xs) > 0
## prefetching isn't supported during stream capture
can_prefetch &= !is_capturing()
## we can only prefetch pageable memory
can_prefetch &= !__pinned(convert(Ptr{T}, mem), mem.ctx)
## pageable memory needs to be accessible concurrently
can_prefetch &= attribute(device(), DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS) == 1
if can_prefetch
# TODO: `view` on buffers?
subbuf = UnifiedMemory(mem.ctx, pointer(xs), sizeof(xs))
prefetch(subbuf)
end
end
Base.unsafe_convert(CuDeviceArray{T,N,AS.Global}, xs)
end

I get the following:

Single kernel
  4.357 ms (56 allocations: 1.36 KiB)

Multiple kernels
  4.571 ms (143 allocations: 5.23 KiB)

Wrong device
  35.607 ms (175 allocations: 6.36 KiB)

Unified memory
  4.600 ms (143 allocations: 5.23 KiB)

Unified memory wrong device
  4.630 ms (143 allocations: 5.23 KiB)

Multi-GPU
  2.679 ms (143 allocations: 5.23 KiB)

It looks like the multi-GPU code is about twice as fast, but @btime is only reporting the minimum. When we show the full benchmark, we can see the following:

julia> @benchmark multicopy!($A, $B)
BenchmarkTools.Trial: 886 samples with 1 evaluation.
 Range (min … max):  2.889 ms … 28.805 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.434 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.637 ms ±  1.544 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▆█▇▆▄▂                                             
  ▇▅▄▁▁▁▁▅████████▇▆▆▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▄ ▇
  2.89 ms      Histogram: log(frequency) by time     16.3 ms <

 Memory estimate: 5.23 KiB, allocs estimate: 143.

So a few runs are fast, but most of them are slower than the single-GPU code.
This can be seen nicely in NSight Systems:
Image
The thread launching the second kernel hangs in "launching kernel", but the kernel is not yet submitted.
@pxl-th pointed me to this code here:

CUDA.jl/src/memory.jl

Lines 565 to 569 in 8d810d7

# accessing memory on another stream: ensure the data is ready and take ownership
if managed.stream != state.stream
maybe_synchronize(managed)
managed.stream = state.stream
end

There is a synchronization happening when an array is accessed from a different stream. After commenting out the maybe_synchronize, things are working as expected:

julia> @benchmark multicopy!($A, $B)
BenchmarkTools.Trial: 1597 samples with 1 evaluation.
 Range (min … max):  2.517 ms …  15.573 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.032 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.126 ms ± 869.700 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▅▆▆▆▅▅▅▂▃▅▆▆▄█▆▄▃                                        
  ▄▅▇██████████████████▇▇▇▅▄▃▃▄▄▄▃▃▄▃▃▂▁▃▂▃▂▂▂▂▃▂▂▂▂▁▂▂▁▂▂▁▁▂ ▄
  2.52 ms         Histogram: frequency by time        4.71 ms <

 Memory estimate: 6.77 KiB, allocs estimate: 183.

Image

@efaulhaber efaulhaber changed the title Issues with single-node multi-gpu Issues with multiple GPUs on a single node Jan 10, 2025
@maleadt
Copy link
Member

maleadt commented Jan 11, 2025

Thanks for the issue. Closing in favor of fine-grained / more actionable ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants