Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with a slighly adjusted pipeline #52850

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Jan 10, 2024

Needs #57380 to merge first

@gbaraldi gbaraldi requested review from vchuravy and vtjnash January 11, 2024 13:55
@gbaraldi
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@gbaraldi
Copy link
Member Author

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@gbaraldi
Copy link
Member Author

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Comment on lines +405 to +398
FPM.addPass(InstCombinePass());
FPM.addPass(AggressiveInstCombinePass());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to do 2 instcombine right next to each other?

Copy link
Member Author

@gbaraldi gbaraldi Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth customizing the AggressiveInstCombinePass slightly since the defaults include some options that are likely not useful for us specifically from https://llvm.org/doxygen/AggressiveInstCombine_8cpp.html:

  • foldSqrt is probalby useless because we generate LLVM sqrt
  • tryToRecognizePopCount probably isn't useful since we have count_ones
  • foldMemChr I don't think we use memchr (but not sure).

This is unlikely to matter much, but probably could save a bit of compile time here and there.

@vtjnash
Copy link
Member

vtjnash commented Oct 16, 2024

looks like you need to fix a couple tests:

Failed Tests (2):
2024-10-16 14:39:21 EDT	  Julia :: image-codegen.jl
2024-10-16 14:39:21 EDT	  Julia :: pipeline-prints.ll

also rerunning nanosoldier, since a lot of changes have happened since:
@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@oscardssmith
Copy link
Member

Looks overall pretty good, but there are a couple 10x regressions (look like vectorization failures). Is there an easy way from nanosoldier for us to test compile time to make sure it's comparable?

@Zentrik
Copy link
Member

Zentrik commented Oct 17, 2024

Isn't that what the inference benchmarks are for, which look like no change to me.

@gbaraldi
Copy link
Member Author

I took a big look at it. There's still a couple regressions, but it seems to be a pretty clear overall win. If anyone wants to take a further look

  1. ["union", "array", ("perf_countequals", "Int8")]
  2. ["array", "index", ("sumelt_boundscheck", "Base.ReinterpretArray{BaseBenchmarks.ArrayBenchmarks.PairVals{Int32}, 2, Int64, Matrix{Int64}, false}")] We are failing to elide a boundscheck
  3. The simd conditional loop ones (they are very noisy (per run and per machine)

The 16x regression is now gone with my latest commit

@gbaraldi gbaraldi force-pushed the gb/pipeline-fun branch 2 times, most recently from d6a2afa to b447d87 Compare October 18, 2024 18:01
@gbaraldi
Copy link
Member Author

Do we want to run a pkgeval? Im slightly worried about the fact that I had to modify passes.

src/llvm-late-gc-lowering.cpp Outdated Show resolved Hide resolved
src/llvm-late-gc-lowering.cpp Outdated Show resolved Hide resolved
src/llvm-lower-handlers.cpp Outdated Show resolved Hide resolved
src/llvm-lower-handlers.cpp Outdated Show resolved Hide resolved
src/pipeline.cpp Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

@nanosoldier runtests(ALL, vs = ":master", configuration = (buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"],), vs_configuration = (buildflags = ["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"],))

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@gbaraldi
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@gbaraldi
Copy link
Member Author

["union", "array", ("map", "*", "Float64", "(false, true)")] seems to be quite a large regression in my mac 2x

@oscardssmith
Copy link
Member

nice to see that this improves the allocation elimination :)

@giordano
Copy link
Contributor

I can confirm this PR on top of #57380 addresses most of #56145 (stores are vectorised, there's still an out-of-bounds section in the preamble, it shouldn't be a big deal performance-wise although it probably taints effects since it can throw):

julia> code_llvm((Memory{Float64},)) do v
           for idx in eachindex(v)
               v[idx] = 1.0
           end
       end
; Function Signature: var"#59"(Memory{Float64})
;  @ REPL[17]:2 within `#59`
define void @"julia_#59_3153"(ptr noundef nonnull align 8 dereferenceable(16) %"v::GenericMemory") local_unnamed_addr #0 {
top:
  %thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #11
  %tls_ppgcstack = getelementptr inbounds i8, ptr %thread_ptr, i64 -8
  %tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
; ┌ @ range.jl:917 within `iterate`
; │┌ @ range.jl:688 within `isempty`
; ││┌ @ operators.jl:425 within `>`
; │││┌ @ int.jl:83 within `<`
      %.unbox = load i64, ptr %"v::GenericMemory", align 8
      %0 = icmp slt i64 %.unbox, 1
; └└└└
  br i1 %0, label %L29, label %preloop.pseudo.exit

[...]

oob:                                              ; preds = %L11.postloop, %L11
  %value_phi3.lcssa = phi i64 [ %value_phi3.postloop, %L11.postloop ], [ %6, %L11 ]
;  @ REPL[17]:3 within `#59`
; ┌ @ genericmemory.jl:260 within `setindex!`
; │┌ @ genericmemory.jl:252 within `_setindex!`
    %ptls_field = getelementptr inbounds i8, ptr %tls_pgcstack, i64 16
    %ptls_load = load ptr, ptr %ptls_field, align 8
    %"box::GenericMemoryRef" = call noalias nonnull align 8 dereferenceable(32) ptr @ijl_gc_small_alloc(ptr %ptls_load, i32 408, i32 32, i64 140577962397856) #8
    %"box::GenericMemoryRef.tag_addr" = getelementptr inbounds i64, ptr %"box::GenericMemoryRef", i64 -1
    store atomic i64 140577962397856, ptr %"box::GenericMemoryRef.tag_addr" unordered, align 8
    store ptr %memoryref_data, ptr %"box::GenericMemoryRef", align 8
    %.repack16 = getelementptr inbounds { ptr, ptr }, ptr %"box::GenericMemoryRef", i64 0, i32 1
    store ptr %"v::GenericMemory", ptr %.repack16, align 8
    call void @ijl_bounds_error_int(ptr nonnull %"box::GenericMemoryRef", i64 %value_phi3.lcssa)
    unreachable

load:                                             ; preds = %L11
    %memoryref_offset = shl i64 %value_phi3, 3
; ││ @ genericmemory.jl:253 within `_setindex!`
    %gep = getelementptr i8, ptr %invariant.gep, i64 %memoryref_offset
    store i64 4607182418800017408, ptr %gep, align 8
; └└
;  @ REPL[17]:4 within `#59`
; ┌ @ range.jl:921 within `iterate`
   %1 = add nuw nsw i64 %value_phi3, 1
; └
  %exitcond39.not = icmp eq i64 %value_phi3, %umax
  br i1 %exitcond39.not, label %main.exit.selector, label %L11

[...]

vector.body:                                      ; preds = %vector.body, %vector.ph
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %offset.idx = shl i64 %index, 3
    %12 = or disjoint i64 %offset.idx, 8
; ││ @ genericmemory.jl:253 within `_setindex!`
    %13 = getelementptr i8, ptr %invariant.gep, i64 %12
    %14 = getelementptr i64, ptr %13, i64 4
    %15 = getelementptr i64, ptr %13, i64 8
    %16 = getelementptr i64, ptr %13, i64 12
    store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %13, align 8
    store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %14, align 8
    store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %15, align 8
    store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %16, align 8
    %index.next = add nuw i64 %index, 16
    %17 = icmp eq i64 %index.next, %n.vec
    br i1 %17, label %L11, label %vector.body

[...]

It'd be nice to have a test to make sure this doesn't regress, once #57380 is merged.

gbaraldi added a commit that referenced this pull request Feb 17, 2025
There are cases where we optimize the SRet more than the pass expected
so try and handle those. I'm tryin to get a test for this, this is
separated from #52850 to make
merging both easier

---------

Co-authored-by: Jameson Nash <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants