You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using multiple GPUs on a single node with unified memory is currently not usable.
Here is a MWE to demonstrate this:
using KernelAbstractions
using CUDA
using BenchmarkTools
using NVTX
@kernelfunctioncopy_kernel!(A, @Const(B))
I =@index(Global)
@inbounds A[I] = B[I]
endfunctionsinglecopy!(A, B)
backend =get_backend(A)
kernel =copy_kernel!(backend)
kernel(A, B, ndrange =length(A))
KernelAbstractions.synchronize(backend)
end@kernelfunctioncopy_kernel2!(A, @Const(B), i)
I =@index(Global)
@inbounds A[I, i] = B[I, i]
endfunctionsinglecopy2!(A, B)
backend =get_backend(A)
@syncbeginfor i inaxes(A, 2)
Threads.@spawnbegin
kernel =copy_kernel2!(backend)
kernel(A, B, i, ndrange =size(A, 1))
KernelAbstractions.synchronize(backend)
endendendend# Same as `singlecopy2!`, but choosing a different device for each kernelfunctionmulticopy!(A, B)
backend =get_backend(A)
@syncbeginfor i inaxes(A, 2)
Threads.@spawnbegin
NVTX.@mark"Thread $i started"device!(i -1)
NVTX.@mark"Thread $i device selected"
NVTX.@range"Thread $i launching kernel"begin
kernel =copy_kernel2!(backend)
kernel(A, B, i, ndrange =size(A, 1))
end
NVTX.@range"Thread $i synchronize"begin
KernelAbstractions.synchronize(backend)
endendendendend
backend = CUDA.CUDABackend()
n =500_000_000
n_gpus =2# Initialize array on device 0device!(0)
A = KernelAbstractions.zeros(backend, Float32, n, n_gpus)
B = KernelAbstractions.ones(backend, Float32, n, n_gpus)
println("\nSingle kernel")
@btimesinglecopy!($A, $B)
@assert A == B
println("\nMultiple kernels")
@btimesinglecopy2!($A, $B)
@assert A == B
println("\nWrong device")
# Memory is on device 0, launch kernel on device 1device!(1)
@btimesinglecopy2!($A, $B)
@assert A == B
println("\nUnified memory")
device!(0)
A =cu(A, unified =true)
B =cu(B, unified =true)
@btimesinglecopy2!($A, $B)
@assert A == B
println("\nUnified memory wrong device")
# Memory is on device 0, launch kernel on device 1device!(1)
@btimesinglecopy2!($A, $B)
@assert A == B
println("\nMulti-GPU")
@btimemulticopy!($A, $B)
@assert A == B
Running this on 2 Nvidia H100 yields:
Single kernel
4.360 ms (56 allocations: 1.36 KiB)
Multiple kernels
4.576 ms (143 allocations: 5.23 KiB)
Wrong device
35.613 ms (175 allocations: 6.36 KiB)
Unified memory
10.430 ms (223 allocations: 6.48 KiB)
Unified memory wrong device
10.314 ms (225 allocations: 6.98 KiB)
Multi-GPU
162.697 ms (223 allocations: 6.48 KiB)
The first issue is that unified memory is prefetched on the GPU, which probably makes sense when sharing unified memory between CPU and GPU, but is counterproductive when sharing between multiple GPUs. (Thanks @vchuravy for pointing this out.)
When I disable this block
Single kernel
4.357 ms (56 allocations: 1.36 KiB)
Multiple kernels
4.571 ms (143 allocations: 5.23 KiB)
Wrong device
35.607 ms (175 allocations: 6.36 KiB)
Unified memory
4.600 ms (143 allocations: 5.23 KiB)
Unified memory wrong device
4.630 ms (143 allocations: 5.23 KiB)
Multi-GPU
2.679 ms (143 allocations: 5.23 KiB)
It looks like the multi-GPU code is about twice as fast, but @btime is only reporting the minimum. When we show the full benchmark, we can see the following:
julia> @benchmark multicopy!($A, $B)
BenchmarkTools.Trial: 886 samples with 1 evaluation.
Range (min … max): 2.889 ms … 28.805 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.434 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.637 ms ± 1.544 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▇▆▄▂
▇▅▄▁▁▁▁▅████████▇▆▆▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▄ ▇
2.89 ms Histogram: log(frequency) by time 16.3 ms <
Memory estimate: 5.23 KiB, allocs estimate: 143.
So a few runs are fast, but most of them are slower than the single-GPU code.
This can be seen nicely in NSight Systems:
The thread launching the second kernel hangs in "launching kernel", but the kernel is not yet submitted. @pxl-th pointed me to this code here:
# accessing memory on another stream: ensure the data is ready and take ownership
if managed.stream != state.stream
maybe_synchronize(managed)
managed.stream = state.stream
end
There is a synchronization happening when an array is accessed from a different stream. After commenting out the maybe_synchronize, things are working as expected:
julia> @benchmark multicopy!($A, $B)
BenchmarkTools.Trial: 1597 samples with 1 evaluation.
Range (min … max): 2.517 ms … 15.573 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.032 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.126 ms ± 869.700 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅▆▆▆▅▅▅▂▃▅▆▆▄█▆▄▃
▄▅▇██████████████████▇▇▇▅▄▃▃▄▄▄▃▃▄▃▃▂▁▃▂▃▂▂▂▂▃▂▂▂▂▁▂▂▁▂▂▁▁▂ ▄
2.52 ms Histogram: frequency by time 4.71 ms <
Memory estimate: 6.77 KiB, allocs estimate: 183.
The text was updated successfully, but these errors were encountered:
efaulhaber
changed the title
Issues with single-node multi-gpu
Issues with multiple GPUs on a single node
Jan 10, 2025
Using multiple GPUs on a single node with unified memory is currently not usable.
Here is a MWE to demonstrate this:
Running this on 2 Nvidia H100 yields:
The first issue is that unified memory is prefetched on the GPU, which probably makes sense when sharing unified memory between CPU and GPU, but is counterproductive when sharing between multiple GPUs. (Thanks @vchuravy for pointing this out.)
When I disable this block
CUDA.jl/src/compiler/execution.jl
Lines 141 to 163 in 8d810d7
I get the following:
It looks like the multi-GPU code is about twice as fast, but
@btime
is only reporting the minimum. When we show the full benchmark, we can see the following:So a few runs are fast, but most of them are slower than the single-GPU code.
This can be seen nicely in NSight Systems:
The thread launching the second kernel hangs in "launching kernel", but the kernel is not yet submitted.
@pxl-th pointed me to this code here:
CUDA.jl/src/memory.jl
Lines 565 to 569 in 8d810d7
There is a synchronization happening when an array is accessed from a different stream. After commenting out the
maybe_synchronize
, things are working as expected:The text was updated successfully, but these errors were encountered: