-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elide bounds checks when kernels contains manual ones. #2621
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions could not be made:
- test/base/exceptions.jl
- lines 54-54
@@ -45,6 +45,7 @@ Base.size(g::CuDeviceArray) = g.dims | |||
Base.sizeof(x::CuDeviceArray) = Base.elsize(x) * length(x) | |||
|
|||
# we store the array length too; computing prod(size) is expensive | |||
Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base.size(g::CuDeviceArray{<:Any,1}) = (g.len,) | |
Base.size(g::CuDeviceArray{<:Any, 1}) = (g.len,) |
end | ||
|
||
for N in 1:3 | ||
ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ir = sprint(io->CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int,N,AS.Global}})) | |
ir = sprint(io -> CUDA.code_llvm(io, kernel, Tuple{CuDeviceArray{Int, N, AS.Global}})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: 833d88c | Previous: 6ef1a3d | Ratio |
---|---|---|---|
latency/precompile |
45630280118.5 ns |
45591768926.5 ns |
1.00 |
latency/ttfp |
6399804347 ns |
6426009567 ns |
1.00 |
latency/import |
3051033362 ns |
3058100713.5 ns |
1.00 |
integration/volumerhs |
9554704 ns |
9568734 ns |
1.00 |
integration/byval/slices=1 |
146554 ns |
146515 ns |
1.00 |
integration/byval/slices=3 |
425223 ns |
425368 ns |
1.00 |
integration/byval/reference |
144582 ns |
144702 ns |
1.00 |
integration/byval/slices=2 |
286089 ns |
285715 ns |
1.00 |
integration/cudadevrt |
103302 ns |
103329 ns |
1.00 |
kernel/indexing |
14030 ns |
14034 ns |
1.00 |
kernel/indexing_checked |
14606.5 ns |
15106 ns |
0.97 |
kernel/occupancy |
708.5138888888889 ns |
685.4019607843138 ns |
1.03 |
kernel/launch |
2120.222222222222 ns |
2159.8 ns |
0.98 |
kernel/rand |
14375 ns |
14789 ns |
0.97 |
array/reverse/1d |
19533.5 ns |
19194.5 ns |
1.02 |
array/reverse/2d |
25388 ns |
24852 ns |
1.02 |
array/reverse/1d_inplace |
11034 ns |
10390.5 ns |
1.06 |
array/reverse/2d_inplace |
11382 ns |
10880.5 ns |
1.05 |
array/copy |
20725 ns |
20484 ns |
1.01 |
array/iteration/findall/int |
158683 ns |
157153 ns |
1.01 |
array/iteration/findall/bool |
138574 ns |
138367 ns |
1.00 |
array/iteration/findfirst/int |
154177 ns |
153645 ns |
1.00 |
array/iteration/findfirst/bool |
155057 ns |
154449.5 ns |
1.00 |
array/iteration/scalar |
75390 ns |
76270.5 ns |
0.99 |
array/iteration/logical |
213676 ns |
212097.5 ns |
1.01 |
array/iteration/findmin/1d |
41191 ns |
41115 ns |
1.00 |
array/iteration/findmin/2d |
94141 ns |
94032 ns |
1.00 |
array/reductions/reduce/1d |
39322 ns |
44665.5 ns |
0.88 |
array/reductions/reduce/2d |
49910 ns |
47524.5 ns |
1.05 |
array/reductions/mapreduce/1d |
38339.5 ns |
43267 ns |
0.89 |
array/reductions/mapreduce/2d |
41209.5 ns |
48108 ns |
0.86 |
array/broadcast |
21517 ns |
21418 ns |
1.00 |
array/copyto!/gpu_to_gpu |
13432 ns |
13538 ns |
0.99 |
array/copyto!/cpu_to_gpu |
211866 ns |
209653 ns |
1.01 |
array/copyto!/gpu_to_cpu |
245719 ns |
244981.5 ns |
1.00 |
array/accumulate/1d |
108909 ns |
108447 ns |
1.00 |
array/accumulate/2d |
80056 ns |
79574 ns |
1.01 |
array/construct |
1241.15 ns |
1276.2 ns |
0.97 |
array/random/randn/Float32 |
43668.5 ns |
43554 ns |
1.00 |
array/random/randn!/Float32 |
26414 ns |
26207 ns |
1.01 |
array/random/rand!/Int64 |
27095 ns |
26920.5 ns |
1.01 |
array/random/rand!/Float32 |
8588 ns |
8620.666666666666 ns |
1.00 |
array/random/rand/Int64 |
29966 ns |
29690 ns |
1.01 |
array/random/rand/Float32 |
12944 ns |
12869 ns |
1.01 |
array/permutedims/4d |
67295 ns |
67200 ns |
1.00 |
array/permutedims/2d |
56548 ns |
56608 ns |
1.00 |
array/permutedims/3d |
59273 ns |
59069 ns |
1.00 |
array/sorting/1d |
2775407 ns |
2931887.5 ns |
0.95 |
array/sorting/by |
3367379 ns |
3498332 ns |
0.96 |
array/sorting/2d |
1085044 ns |
1084207 ns |
1.00 |
cuda/synchronization/stream/auto |
1025.5454545454545 ns |
1042.2 ns |
0.98 |
cuda/synchronization/stream/nonblocking |
6538.8 ns |
6635 ns |
0.99 |
cuda/synchronization/stream/blocking |
850.8589743589744 ns |
802.6082474226804 ns |
1.06 |
cuda/synchronization/context/auto |
1217.7 ns |
1181.2 ns |
1.03 |
cuda/synchronization/context/nonblocking |
6761.2 ns |
6839.8 ns |
0.99 |
cuda/synchronization/context/blocking |
911.3333333333334 ns |
904.8229166666667 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2621 +/- ##
=======================================
Coverage 73.60% 73.60%
=======================================
Files 157 157
Lines 15230 15230
=======================================
Hits 11210 11210
Misses 4020 4020 ☔ View full report in Codecov by Sentry. |
I noticed that the following dead-simple kernel contains a redundant bounds check:
The reason here is that we store both
dims
andlen = prod(dims)
, to avoid having to computeprod(dims)
at run time. However, because of that LLVM can't eliminate the second check. Let's avoid that by making suresize(A::CuDeviceVector)
returnslen
.I then realized the same should also happen with ND inputs, at least if we index linearly:
The problem here is that the generic definition of
checkbounds(A, idx)
first constructs aOneTo
range before doing the bounds check. That constructor callsmax
, again breaking LLVM's ability to eliminate the second check. We can avoid that by using a simplified bounds check based onlength
:This change should be ported to other back-ends as well.