Elide bounds checks when kernels contains manual ones. #2621

maleadt · 2025-01-16T10:12:50Z

I noticed that the following dead-simple kernel contains a redundant bounds check:

function kernel(A)
    idx = threadIdx().x
    if idx <= length(A) # manual bounds check
        A[idx] = 1      # implicit, redundant bounds check
    end
    return
end

A = CuArray{Int}(undef, 2)
@cuda threads=2 kernel(A)

define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L34, label %L14

L14:                                              ; preds = %conversion
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 2, 0
  %.not1 = icmp slt i64 %.fca.2.0.extract, %3
  br i1 %.not1, label %L24, label %L27

L24:                                              ; preds = %L14
  call fastcc void @julia_throw_boundserror_22792({ i64, i32 } %state)
  call void @llvm.trap()
  call void asm sideeffect "exit;", ""()
  unreachable

L27:                                              ; preds = %L14
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %5 = zext i32 %1 to i64
  %6 = getelementptr inbounds i64, i64 addrspace(1)* %4, i64 %5
  store i64 1, i64 addrspace(1)* %6, align 8
  br label %L34

L34:                                              ; preds = %L27, %conversion
  ret void
}

The reason here is that we store both dims and len = prod(dims), to avoid having to compute prod(dims) at run time. However, because of that LLVM can't eliminate the second check. Let's avoid that by making sure size(A::CuDeviceVector) returns len.

define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L32, label %L25

L25:                                              ; preds = %conversion
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %5 = zext i32 %1 to i64
  %6 = getelementptr inbounds i64, i64 addrspace(1)* %4, i64 %5
  store i64 1, i64 addrspace(1)* %6, align 8
  br label %L32

L32:                                              ; preds = %L25, %conversion
  ret void
}

I then realized the same should also happen with ND inputs, at least if we index linearly:

A = CuArray{Int}(undef, 2, 1)
@cuda threads=2 kernel(A)

define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li2ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L34, label %L14

L14:                                              ; preds = %conversion
  %4 = call i64 @llvm.smax.i64(i64 %.fca.3.extract, i64 0)
  %.not3 = icmp ult i64 %4, %3
  br i1 %.not3, label %L24, label %L27

L24:                                              ; preds = %L14
  call fastcc void @julia_throw_boundserror_23439({ i64, i32 } %state)
  call void @llvm.trap()
  call void asm sideeffect "exit;", ""()
  unreachable

L27:                                              ; preds = %L14
  %5 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %6 = zext i32 %1 to i64
  %7 = getelementptr inbounds i64, i64 addrspace(1)* %5, i64 %6
  store i64 1, i64 addrspace(1)* %7, align 8
  br label %L34

L34:                                              ; preds = %L27, %conversion
  ret void
}

The problem here is that the generic definition of checkbounds(A, idx) first constructs a OneTo range before doing the bounds check. That constructor calls max, again breaking LLVM's ability to eliminate the second check. We can avoid that by using a simplified bounds check based on length:

define ptx_kernel void @_Z6kernel13CuDeviceArrayI5Int64Li2ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [2 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.3.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 3
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = add nuw nsw i32 %1, 1
  %3 = zext i32 %2 to i64
  %.not = icmp slt i64 %.fca.3.extract, %3
  br i1 %.not, label %L29, label %L22

L22:                                              ; preds = %conversion
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %0, 0
  %4 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
  %5 = zext i32 %1 to i64
  %6 = getelementptr inbounds i64, i64 addrspace(1)* %4, i64 %5
  store i64 1, i64 addrspace(1)* %6, align 8
  br label %L29

L29:                                              ; preds = %L22, %conversion
  ret void
}

This change should be ported to other back-ends as well.

github-actions

Some suggestions could not be made:

test/base/exceptions.jl
- lines 54-54

github-actions · 2025-01-16T10:13:28Z

src/device/array.jl

@@ -45,6 +45,7 @@ Base.size(g::CuDeviceArray) = g.dims
 Base.sizeof(x::CuDeviceArray) = Base.elsize(x) * length(x)

 # we store the array length too; computing prod(size) is expensive
+Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,)


Suggested change

Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,)

Base.size(g::CuDeviceArray{<:Any, 1}) = (g.len,)

github-actions · 2025-01-16T10:13:28Z

test/core/device/array.jl

+        end
+
+        for N in 1:3
+            ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}}))


Suggested change

ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}}))

ir = sprint(io -> CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int, N, AS.Global}}))

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `833d88c`	Previous: `6ef1a3d`	Ratio
`latency/precompile`	`45630280118.5` ns	`45591768926.5` ns	`1.00`
`latency/ttfp`	`6399804347` ns	`6426009567` ns	`1.00`
`latency/import`	`3051033362` ns	`3058100713.5` ns	`1.00`
`integration/volumerhs`	`9554704` ns	`9568734` ns	`1.00`
`integration/byval/slices=1`	`146554` ns	`146515` ns	`1.00`
`integration/byval/slices=3`	`425223` ns	`425368` ns	`1.00`
`integration/byval/reference`	`144582` ns	`144702` ns	`1.00`
`integration/byval/slices=2`	`286089` ns	`285715` ns	`1.00`
`integration/cudadevrt`	`103302` ns	`103329` ns	`1.00`
`kernel/indexing`	`14030` ns	`14034` ns	`1.00`
`kernel/indexing_checked`	`14606.5` ns	`15106` ns	`0.97`
`kernel/occupancy`	`708.5138888888889` ns	`685.4019607843138` ns	`1.03`
`kernel/launch`	`2120.222222222222` ns	`2159.8` ns	`0.98`
`kernel/rand`	`14375` ns	`14789` ns	`0.97`
`array/reverse/1d`	`19533.5` ns	`19194.5` ns	`1.02`
`array/reverse/2d`	`25388` ns	`24852` ns	`1.02`
`array/reverse/1d_inplace`	`11034` ns	`10390.5` ns	`1.06`
`array/reverse/2d_inplace`	`11382` ns	`10880.5` ns	`1.05`
`array/copy`	`20725` ns	`20484` ns	`1.01`
`array/iteration/findall/int`	`158683` ns	`157153` ns	`1.01`
`array/iteration/findall/bool`	`138574` ns	`138367` ns	`1.00`
`array/iteration/findfirst/int`	`154177` ns	`153645` ns	`1.00`
`array/iteration/findfirst/bool`	`155057` ns	`154449.5` ns	`1.00`
`array/iteration/scalar`	`75390` ns	`76270.5` ns	`0.99`
`array/iteration/logical`	`213676` ns	`212097.5` ns	`1.01`
`array/iteration/findmin/1d`	`41191` ns	`41115` ns	`1.00`
`array/iteration/findmin/2d`	`94141` ns	`94032` ns	`1.00`
`array/reductions/reduce/1d`	`39322` ns	`44665.5` ns	`0.88`
`array/reductions/reduce/2d`	`49910` ns	`47524.5` ns	`1.05`
`array/reductions/mapreduce/1d`	`38339.5` ns	`43267` ns	`0.89`
`array/reductions/mapreduce/2d`	`41209.5` ns	`48108` ns	`0.86`
`array/broadcast`	`21517` ns	`21418` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`13432` ns	`13538` ns	`0.99`
`array/copyto!/cpu_to_gpu`	`211866` ns	`209653` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`245719` ns	`244981.5` ns	`1.00`
`array/accumulate/1d`	`108909` ns	`108447` ns	`1.00`
`array/accumulate/2d`	`80056` ns	`79574` ns	`1.01`
`array/construct`	`1241.15` ns	`1276.2` ns	`0.97`
`array/random/randn/Float32`	`43668.5` ns	`43554` ns	`1.00`
`array/random/randn!/Float32`	`26414` ns	`26207` ns	`1.01`
`array/random/rand!/Int64`	`27095` ns	`26920.5` ns	`1.01`
`array/random/rand!/Float32`	`8588` ns	`8620.666666666666` ns	`1.00`
`array/random/rand/Int64`	`29966` ns	`29690` ns	`1.01`
`array/random/rand/Float32`	`12944` ns	`12869` ns	`1.01`
`array/permutedims/4d`	`67295` ns	`67200` ns	`1.00`
`array/permutedims/2d`	`56548` ns	`56608` ns	`1.00`
`array/permutedims/3d`	`59273` ns	`59069` ns	`1.00`
`array/sorting/1d`	`2775407` ns	`2931887.5` ns	`0.95`
`array/sorting/by`	`3367379` ns	`3498332` ns	`0.96`
`array/sorting/2d`	`1085044` ns	`1084207` ns	`1.00`
`cuda/synchronization/stream/auto`	`1025.5454545454545` ns	`1042.2` ns	`0.98`
`cuda/synchronization/stream/nonblocking`	`6538.8` ns	`6635` ns	`0.99`
`cuda/synchronization/stream/blocking`	`850.8589743589744` ns	`802.6082474226804` ns	`1.06`
`cuda/synchronization/context/auto`	`1217.7` ns	`1181.2` ns	`1.03`
`cuda/synchronization/context/nonblocking`	`6761.2` ns	`6839.8` ns	`0.99`
`cuda/synchronization/context/blocking`	`911.3333333333334` ns	`904.8229166666667` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

codecov · 2025-01-16T12:33:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.60%. Comparing base (6ef1a3d) to head (833d88c).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2621   +/-   ##
=======================================
  Coverage   73.60%   73.60%           
=======================================
  Files         157      157           
  Lines       15230    15230           
=======================================
  Hits        11210    11210           
  Misses       4020     4020

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Elide bounds checks when kernels contains manual ones.

833d88c

maleadt added cuda kernels Stuff about writing CUDA kernels. performance How fast can we go? labels Jan 16, 2025

github-actions bot reviewed Jan 16, 2025

View reviewed changes

maleadt merged commit d07a245 into master Jan 16, 2025
3 checks passed

maleadt deleted the tb/elide_boundscheck branch January 16, 2025 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elide bounds checks when kernels contains manual ones. #2621

Elide bounds checks when kernels contains manual ones. #2621

maleadt commented Jan 16, 2025

github-actions bot left a comment

github-actions bot Jan 16, 2025

github-actions bot Jan 16, 2025

github-actions bot left a comment

codecov bot commented Jan 16, 2025 •

edited

Loading

	Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,)
	Base.size(g::CuDeviceArray{<:Any, 1}) = (g.len,)

	ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}}))
	ir = sprint(io -> CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int, N, AS.Global}}))

Elide bounds checks when kernels contains manual ones. #2621

Elide bounds checks when kernels contains manual ones. #2621

Conversation

maleadt commented Jan 16, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Jan 16, 2025

Choose a reason for hiding this comment

github-actions bot Jan 16, 2025

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

codecov bot commented Jan 16, 2025 • edited Loading

Codecov Report

codecov bot commented Jan 16, 2025 •

edited

Loading