-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomics: configurable scope (for multi-device unified memory) #2619
Comments
Here's a MWE to show that they aren't. I generate a bunch of random numbers and write into an array how many of them fall into each interval using KernelAbstractions: KernelAbstractions, @kernel, @index, get_backend
using Atomix
using CUDA
@kernel function mykernel!(interval_counter, x, indices)
i_ = @index(Global)
i = indices[i_]
# Increment counter for the interval of `x[i]`
interval = ceil(Int, x[i])
@inbounds Atomix.@atomic interval_counter[interval] += 1
end
function run_multigpu!(interval_counter, x)
N = length(x)
# Partition `eachindex(x)` to the GPUs
n_gpus = length(CUDA.devices())
indices_split = Iterators.partition(eachindex(x), ceil(Int, N / n_gpus))
@assert length(indices_split) <= n_gpus
backend = get_backend(interval_counter)
# Synchronize each device
for i in 1:n_gpus
CUDA.device!(i - 1)
KernelAbstractions.synchronize(backend)
end
# Launch kernel on each device
for (i, indices_) in enumerate(indices_split)
# Select the correct device for this partition
CUDA.device!(i - 1)
kernel = mykernel!(backend)
kernel(interval_counter, x, indices_, ndrange = length(indices_))
end
# Synchronize each device again
for i in 1:n_gpus
CUDA.device!(i - 1)
KernelAbstractions.synchronize(backend)
end
end
function test(N, unified; n_runs = 10_000)
x = CUDA.rand(N) .* N
x = cu(x; unified)
interval_counter = cu(CUDA.zeros(Int, N); unified)
interval_counter_cpu = zeros(Int, N)
# Run on the CPU as reference
run_multigpu!(interval_counter_cpu, Array(x))
reference = cu(interval_counter_cpu)
for i in 1:n_runs
interval_counter .= 0
run_multigpu!(interval_counter, x)
if interval_counter != reference
print("$i ")
end
end
end On 4x H100, I get the following:
So for unified memory I get some nondeterministic fails, which I don't get with Note that I used this branch https://github.com/efaulhaber/CUDA.jl/tree/disable-prefetch to disable prefetching of unified memory and synchronization when such an array is accessed from another stream. Otherwise, the kernels would run serially on the GPUs. See #2615. |
@vchuravy wrote a workaround for an atomic add: function atomic_system_add(ptr::CUDA.LLVMPtr{Int64, CUDA.AS.Global}, val::Int64)
CUDA.LLVM.Interop.@asmcall(
"atom.sys.global.add.u64 \$0, [\$1], \$2;",
"=l,l,l,~{memory}",
true, Int64, Tuple{CUDA.LLVMPtr{Int64, CUDA.AS.Global}, Int64},
ptr, val
)
end Or, for function atomic_system_add(ptr::CUDA.LLVMPtr{Int32, CUDA.AS.Global}, val::Int32)
CUDA.LLVM.Interop.@asmcall(
"atom.sys.global.add.u32 \$0, [\$1], \$2;",
"=r,l,r,~{memory}",
true, Int32, Tuple{CUDA.LLVMPtr{Int32, CUDA.AS.Global}, Int32},
ptr, val
)
end The MWE above is working with this. |
We should investigate whether our current atomics are functional when used on unified memory that's being used from different devices (they probably aren't). In CUDA C, this requires use of
_system
sufficed atomic functions, e.g.,atomicAdd_system
, which changes the synchronization scope. Quoting from https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:Note that system scope atomics have additional requirements. Quoting https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:
So lots of gotcha's, but still, we should probably provide a way to alter the scope of an atomic operation. This requires:
nvcc
I won't have the time to look at this anytime soon, so if anybody wants to help out, gathering all that information and reporting here would be a good first step.
The text was updated successfully, but these errors were encountered: