Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elide bounds checks when kernels contains manual ones. #2621

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jan 16, 2025

I noticed that the following dead-simple kernel contains a redundant bounds check:

function kernel(A)
    idx = threadIdx().x
    if idx <= length(A) # manual bounds check
        A[idx] = 1      # implicit, redundant bounds check
    end
    return
end

A = CuArray{Int}(undef, 2)
@cuda threads=2 kernel(A)
define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L34, label %L14

L14:                                              ; preds = %conversion
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 2, 0
  %.not1 = icmp slt i64 %.fca.2.0.extract, %3
  br i1 %.not1, label %L24, label %L27

L24:                                              ; preds = %L14
  call fastcc void @julia_throw_boundserror_22792({ i64, i32 } %state)
  call void @llvm.trap()
  call void asm sideeffect "exit;", ""()
  unreachable

L27:                                              ; preds = %L14
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %5 = zext i32 %1 to i64
  %6 = getelementptr inbounds i64, i64 addrspace(1)* %4, i64 %5
  store i64 1, i64 addrspace(1)* %6, align 8
  br label %L34

L34:                                              ; preds = %L27, %conversion
  ret void
}

The reason here is that we store both dims and len = prod(dims), to avoid having to compute prod(dims) at run time. However, because of that LLVM can't eliminate the second check. Let's avoid that by making sure size(A::CuDeviceVector) returns len.

define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L32, label %L25

L25:                                              ; preds = %conversion
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %5 = zext i32 %1 to i64
  %6 = getelementptr inbounds i64, i64 addrspace(1)* %4, i64 %5
  store i64 1, i64 addrspace(1)* %6, align 8
  br label %L32

L32:                                              ; preds = %L25, %conversion
  ret void
}

I then realized the same should also happen with ND inputs, at least if we index linearly:

A = CuArray{Int}(undef, 2, 1)
@cuda threads=2 kernel(A)
define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li2ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L34, label %L14

L14:                                              ; preds = %conversion
  %4 = call i64 @llvm.smax.i64(i64 %.fca.3.extract, i64 0)
  %.not3 = icmp ult i64 %4, %3
  br i1 %.not3, label %L24, label %L27

L24:                                              ; preds = %L14
  call fastcc void @julia_throw_boundserror_23439({ i64, i32 } %state)
  call void @llvm.trap()
  call void asm sideeffect "exit;", ""()
  unreachable

L27:                                              ; preds = %L14
  %5 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %6 = zext i32 %1 to i64
  %7 = getelementptr inbounds i64, i64 addrspace(1)* %5, i64 %6
  store i64 1, i64 addrspace(1)* %7, align 8
  br label %L34

L34:                                              ; preds = %L27, %conversion
  ret void
}

The problem here is that the generic definition of checkbounds(A, idx) first constructs a OneTo range before doing the bounds check. That constructor calls max, again breaking LLVM's ability to eliminate the second check. We can avoid that by using a simplified bounds check based on length:

define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li2ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L29, label %L22

L22:                                              ; preds = %conversion
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %5 = zext i32 %1 to i64
  %6 = getelementptr inbounds i64, i64 addrspace(1)* %4, i64 %5
  store i64 1, i64 addrspace(1)* %6, align 8
  br label %L29

L29:                                              ; preds = %L22, %conversion
  ret void
}

This change should be ported to other back-ends as well.

@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. performance How fast can we go? labels Jan 16, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions could not be made:

  • test/base/exceptions.jl
    • lines 54-54

@@ -45,6 +45,7 @@ Base.size(g::CuDeviceArray) = g.dims
Base.sizeof(x::CuDeviceArray) = Base.elsize(x) * length(x)

# we store the array length too; computing prod(size) is expensive
Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,)
Base.size(g::CuDeviceArray{<:Any, 1}) = (g.len,)

end

for N in 1:3
ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}}))
ir = sprint(io -> CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int, N, AS.Global}}))

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 833d88c Previous: 6ef1a3d Ratio
latency/precompile 45630280118.5 ns 45591768926.5 ns 1.00
latency/ttfp 6399804347 ns 6426009567 ns 1.00
latency/import 3051033362 ns 3058100713.5 ns 1.00
integration/volumerhs 9554704 ns 9568734 ns 1.00
integration/byval/slices=1 146554 ns 146515 ns 1.00
integration/byval/slices=3 425223 ns 425368 ns 1.00
integration/byval/reference 144582 ns 144702 ns 1.00
integration/byval/slices=2 286089 ns 285715 ns 1.00
integration/cudadevrt 103302 ns 103329 ns 1.00
kernel/indexing 14030 ns 14034 ns 1.00
kernel/indexing_checked 14606.5 ns 15106 ns 0.97
kernel/occupancy 708.5138888888889 ns 685.4019607843138 ns 1.03
kernel/launch 2120.222222222222 ns 2159.8 ns 0.98
kernel/rand 14375 ns 14789 ns 0.97
array/reverse/1d 19533.5 ns 19194.5 ns 1.02
array/reverse/2d 25388 ns 24852 ns 1.02
array/reverse/1d_inplace 11034 ns 10390.5 ns 1.06
array/reverse/2d_inplace 11382 ns 10880.5 ns 1.05
array/copy 20725 ns 20484 ns 1.01
array/iteration/findall/int 158683 ns 157153 ns 1.01
array/iteration/findall/bool 138574 ns 138367 ns 1.00
array/iteration/findfirst/int 154177 ns 153645 ns 1.00
array/iteration/findfirst/bool 155057 ns 154449.5 ns 1.00
array/iteration/scalar 75390 ns 76270.5 ns 0.99
array/iteration/logical 213676 ns 212097.5 ns 1.01
array/iteration/findmin/1d 41191 ns 41115 ns 1.00
array/iteration/findmin/2d 94141 ns 94032 ns 1.00
array/reductions/reduce/1d 39322 ns 44665.5 ns 0.88
array/reductions/reduce/2d 49910 ns 47524.5 ns 1.05
array/reductions/mapreduce/1d 38339.5 ns 43267 ns 0.89
array/reductions/mapreduce/2d 41209.5 ns 48108 ns 0.86
array/broadcast 21517 ns 21418 ns 1.00
array/copyto!/gpu_to_gpu 13432 ns 13538 ns 0.99
array/copyto!/cpu_to_gpu 211866 ns 209653 ns 1.01
array/copyto!/gpu_to_cpu 245719 ns 244981.5 ns 1.00
array/accumulate/1d 108909 ns 108447 ns 1.00
array/accumulate/2d 80056 ns 79574 ns 1.01
array/construct 1241.15 ns 1276.2 ns 0.97
array/random/randn/Float32 43668.5 ns 43554 ns 1.00
array/random/randn!/Float32 26414 ns 26207 ns 1.01
array/random/rand!/Int64 27095 ns 26920.5 ns 1.01
array/random/rand!/Float32 8588 ns 8620.666666666666 ns 1.00
array/random/rand/Int64 29966 ns 29690 ns 1.01
array/random/rand/Float32 12944 ns 12869 ns 1.01
array/permutedims/4d 67295 ns 67200 ns 1.00
array/permutedims/2d 56548 ns 56608 ns 1.00
array/permutedims/3d 59273 ns 59069 ns 1.00
array/sorting/1d 2775407 ns 2931887.5 ns 0.95
array/sorting/by 3367379 ns 3498332 ns 0.96
array/sorting/2d 1085044 ns 1084207 ns 1.00
cuda/synchronization/stream/auto 1025.5454545454545 ns 1042.2 ns 0.98
cuda/synchronization/stream/nonblocking 6538.8 ns 6635 ns 0.99
cuda/synchronization/stream/blocking 850.8589743589744 ns 802.6082474226804 ns 1.06
cuda/synchronization/context/auto 1217.7 ns 1181.2 ns 1.03
cuda/synchronization/context/nonblocking 6761.2 ns 6839.8 ns 0.99
cuda/synchronization/context/blocking 911.3333333333334 ns 904.8229166666667 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.60%. Comparing base (6ef1a3d) to head (833d88c).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2621   +/-   ##
=======================================
  Coverage   73.60%   73.60%           
=======================================
  Files         157      157           
  Lines       15230    15230           
=======================================
  Hits        11210    11210           
  Misses       4020     4020           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt maleadt merged commit d07a245 into master Jan 16, 2025
3 checks passed
@maleadt maleadt deleted the tb/elide_boundscheck branch January 16, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant