Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomics: configurable scope (for multi-device unified memory) #2619

Open
maleadt opened this issue Jan 11, 2025 · 3 comments
Open

Atomics: configurable scope (for multi-device unified memory) #2619

maleadt opened this issue Jan 11, 2025 · 3 comments
Labels
cuda kernels Stuff about writing CUDA kernels. help wanted Extra attention is needed

Comments

@maleadt
Copy link
Member

maleadt commented Jan 11, 2025

We should investigate whether our current atomics are functional when used on unified memory that's being used from different devices (they probably aren't). In CUDA C, this requires use of _system sufficed atomic functions, e.g., atomicAdd_system, which changes the synchronization scope. Quoting from https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:

Atomic APIs with _system suffix (example: atomicAdd_system) are atomic at scope cuda::thread_scope_system if they meet particular conditions.

Atomic APIs without a suffix (example: atomicAdd) are atomic at scope cuda::thread_scope_device.

Atomic APIs with _block suffix (example: atomicAdd_block) are atomic at scope cuda::thread_scope_block.

Note that system scope atomics have additional requirements. Quoting https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:

An atomic operation is atomic at the scope it specifies if:

  • it specifies a scope other than thread_scope_system, or
  • the scope is thread_scope_system and:
  • it affects an object in system allocated memory and pageableMemoryAccess is 1 [0], or
  • it affects an object in managed memory and concurrentManagedAccess is 1, or
  • it affects an object in mapped memory and hostNativeAtomicSupported is 1, or
  • it is a load or store that affects a naturally-aligned object of sizes 1, 2, 4, 8, or 16 bytes on mapped memory [1], or
  • it affects an object in GPU memory, only GPU threads access it, and
    • p2pNativeAtomicSupported between each accessing GPU and the GPU where the object resides is 1, or
    • only GPU threads from a single GPU concurrently access it.

[0] If PageableMemoryAccessUsesHostPagetables is 0 then atomic operations to memory mapped file or hugetlbfs allocations are not atomic.
[1] If hostNativeAtomicSupported is 0, atomic load or store operations at system scope that affect a naturally-aligned 16-byte wide object in unified memory or mapped memory require system support. NVIDIA is not aware of any system that lacks this support and there is no CUDA API query available to detect such systems.

So lots of gotcha's, but still, we should probably provide a way to alter the scope of an atomic operation. This requires:

  • figuring out exactly what additional configurability is needed
  • inspecting the PTX code generated by nvcc
  • identifying whether LLVM supports these through native atomics, NVVM intrinsics, or neither (in which case we'll need to use inline PTX assembly)

I won't have the time to look at this anytime soon, so if anybody wants to help out, gathering all that information and reporting here would be a good first step.

@maleadt maleadt added the cuda kernels Stuff about writing CUDA kernels. label Jan 11, 2025
@vchuravy
Copy link
Member

Yeah #1790 and #1644 used inline assembly for the scoped operations.

@maleadt maleadt added the help wanted Extra attention is needed label Jan 13, 2025
@efaulhaber
Copy link
Contributor

efaulhaber commented Jan 13, 2025

We should investigate whether our current atomics are functional when used on unified memory that's being used from different devices (they probably aren't).

Here's a MWE to show that they aren't. I generate a bunch of random numbers and write into an array how many of them fall into each interval $(n - 1, n]$.

using KernelAbstractions: KernelAbstractions, @kernel, @index, get_backend
using Atomix
using CUDA

@kernel function mykernel!(interval_counter, x, indices)
    i_ = @index(Global)
    i = indices[i_]

    # Increment counter for the interval of `x[i]`
    interval = ceil(Int, x[i])
    @inbounds Atomix.@atomic interval_counter[interval] += 1
end

function run_multigpu!(interval_counter, x)
    N = length(x)

    # Partition `eachindex(x)` to the GPUs
    n_gpus = length(CUDA.devices())
    indices_split = Iterators.partition(eachindex(x), ceil(Int, N / n_gpus))
    @assert length(indices_split) <= n_gpus

    backend = get_backend(interval_counter)

    # Synchronize each device
    for i in 1:n_gpus
        CUDA.device!(i - 1)
        KernelAbstractions.synchronize(backend)
    end

    # Launch kernel on each device
    for (i, indices_) in enumerate(indices_split)
        # Select the correct device for this partition
        CUDA.device!(i - 1)

        kernel = mykernel!(backend)
        kernel(interval_counter, x, indices_, ndrange = length(indices_))
    end

    # Synchronize each device again
    for i in 1:n_gpus
        CUDA.device!(i - 1)
        KernelAbstractions.synchronize(backend)
    end
end

function test(N, unified; n_runs = 10_000)
    x = CUDA.rand(N) .* N
    x = cu(x; unified)

    interval_counter = cu(CUDA.zeros(Int, N); unified)
    interval_counter_cpu = zeros(Int, N)

    # Run on the CPU as reference
    run_multigpu!(interval_counter_cpu, Array(x))
    reference = cu(interval_counter_cpu)

    for i in 1:n_runs
        interval_counter .= 0
        run_multigpu!(interval_counter, x)

        if interval_counter != reference
            print("$i ")
        end
    end
end

On 4x H100, I get the following:

julia> test(1000, true)
4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4848 4849 4850 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4896 4897 4898 4899 4900 4901 4902 4904 4905 4906 4907 4908 4909 4910 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5240 5241 5242 
julia> test(1000, true)
2390 2391 2393 2394 2395 2397 2398 2399 2400 2401 2402 2403 2405 2406 2407 2409 2410 2411 2412 2413 2414 2415 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2433 2434 2435 2437 2438 2439 2442 2443 2445 2446 2449 2450 2451 2452 2453 2454 2455 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2489 2490 2491 2492 2493 2494 2495 2497 2498 2499 2501 2502 2503 2504 2505 2506 2507 2509 2510 2511 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2525 2526 2527 2529 2530 2531 2532 2533 2534 2535 2537 2538 2539 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2553 2554 2555 2556 2557 2558 2559 2561 2562 2563 2566 2567 2568 2569 2570 2571 2573 2574 2575 2576 2577 2578 2579 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2593 2594 2595 2597 2598 2599 2601 2602 2603 2605 2607 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2654 2655 2656 2657 2659 2661 2662 2663 2665 2666 2667 2669 2670 2671 2672 2673 2674 2675 2677 2678 2679 2680 2681 2682 2683 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2745 2746 2747 2748 2750 2753 2754 2755 2756 2757 2758 2759 2761 2762 2765 2766 2767 2769 2770 2771 2772 2773 2774 2775 2777 2778 2779 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2793 2794 2795 2796 2797 2798 2799 2801 2802 2803 2805 2806 2807 2810 2811 2812 2813 2814 2815 2817 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2833 2834 2835 2836 2837 2838 2839 2841 2842 2843 2845 2846 2847 2849 2851 2852 2854 2855 2857 2858 2859 2860 2861 2862 2863 2864 3394 3396 3397 3398 3400 3401 3403 3405 3406 3407 3409 3411 3412 3413 3414 3415 3416 3417 3418 3420 3421 3422 3423 3424 3426 3427 3429 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3444 3445 3446 3447 3449 3450 3452 3454 3456 3457 3458 3459 3460 3461 3462 3464 3466 3468 3469 3470 3471 3472 3473 3474 3476 3477 3478 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3500 3501 3502 3503 3506 3507 3509 3511 3512 3513 3515 3516 3517 3518 3519 3520 3522 3523 3524 3525 3526 3528 3529 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3544 3545 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3586 3587 3588 3589 3590 3591 3593 3594 3595 3597 3598 3599 3600 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3638 3640 3641 3643 3644 3645 3646 3647 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3680 3682 3683 3685 3686 3687 3689 3691 3692 3694 3695 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3711 3712 3713 3715 3716 3717 3718 3719 3720 3721 3722 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3744 3745 3746 3747 3749 3751 3753 3754 3755 3756 3757 3758 3759 3761 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3774 3775 3776 3777 3778 3779 3780 3781 3783 3784 3785 3787 3788 3789 3790 3791 3792 3793 3794 3795 3797 3798 3799 3801 3802 3803 3805 3806 3808 3809 3810 3811 3813 3814 3815 3818 3819 3820 3821 3822 3823 
julia> test(1000, true)
3995 3996 3998 3999 4003 4007 4011 4012 4015 4018 4023 4024 4026 4027 4028 4031 4032 4035 4036 4039 4043 4048 4049 4050 4051 4052 4055 4058 4060 4063 4066 4067 4068 4069 4071 4072 4073 4075 4076 4079 4081 4083 4084 4087 4088 4091 4094 4095 4097 4099 4103 4106 4107 4108 4113 4115 4116 4119 4123 4124 4127 4128 4131 4135 4136 4137 4139 4140 4143 4147 4149 4151 4155 4156 4159 4163 4164 4168 4171 4174 4179 4183 4188 4191 4192 4193 4195 4196 4199 4203 4204 4205 4206 4207 4210 4211 4212 4213 4214 4215 4219 4220 4223 4227 4231 4235 4238 4239 4243 4244 4245 4246 4247 4249 4251 4255 4260 4263 4267 4268 4271 4274 4275 4279 4280 4283 4284 4287 4291 4292 4296 4300 4303 4307 4310 4311 4312 4313 4315 4316 4319 4323 4324 4327 4329 4331 4335 4339 4340 4341 4343 4344 4346 4347 4348 4351 4355 4357 4358 4359 4360 4361 4367 4368 4371 4372 4373 4375 4377 4379 4380 4383 4387 4388 4391 4392 4394 4395 4396 4398 4399 4400 4402 4403 4404 4407 4411 4412 4415 4418 4419 4423 4424 4425 4426 4431 4432 4435 4436 4439 4440 4442 4443 4444 4447 4451 4452 4455 4456 4457 4459 4460 4465 4466 4467 4468 4471 4475 4476 4479 4487 4491 4496 4499 4502 4507 4508 4511 
julia> test(1000, true)

julia> test(1000, true)
44 139 215 283 319 372 439 455 495 
julia> test(1000, false)

julia> test(1000, false)

julia> test(1000, false)

julia> test(1000, false)

julia> test(1000, false)

julia> test(1000, false)

So for unified memory I get some nondeterministic fails, which I don't get with unified = false.

Note that I used this branch https://github.com/efaulhaber/CUDA.jl/tree/disable-prefetch to disable prefetching of unified memory and synchronization when such an array is accessed from another stream. Otherwise, the kernels would run serially on the GPUs. See #2615.

@efaulhaber
Copy link
Contributor

@vchuravy wrote a workaround for an atomic add:

function atomic_system_add(ptr::CUDA.LLVMPtr{Int64, CUDA.AS.Global}, val::Int64)
    CUDA.LLVM.Interop.@asmcall(
        "atom.sys.global.add.u64 \$0, [\$1], \$2;",
        "=l,l,l,~{memory}",
        true, Int64, Tuple{CUDA.LLVMPtr{Int64, CUDA.AS.Global}, Int64},
        ptr, val
    )
end

Or, for Int32:

function atomic_system_add(ptr::CUDA.LLVMPtr{Int32, CUDA.AS.Global}, val::Int32)
    CUDA.LLVM.Interop.@asmcall(
        "atom.sys.global.add.u32 \$0, [\$1], \$2;",
        "=r,l,r,~{memory}",
        true, Int32, Tuple{CUDA.LLVMPtr{Int32, CUDA.AS.Global}, Int32},
        ptr, val
    )
end

The MWE above is working with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants