Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing GPU acceleration of binary collision algorithms. #4577

Merged
merged 70 commits into from
Apr 15, 2024

Conversation

mhaseeb123
Copy link
Contributor

@mhaseeb123 mhaseeb123 commented Jan 5, 2024

[Edit by @RemiLehe]: The aim of this PR is to speed up binary collisions on GPU by exposing more parallelism: instead of looping with one GPU thread per cell, we loop with one GPU thread per "number of independent pair" (i.e. pairs that do not touch the same macroparticles, so that there is no race condition), where the number of independent pairs is determined by the lower number of macroparticle of either species, within each cell.

[Updated]: This PR includes optimized GPU implementation for the two particle collision algorithms (Coulomb and Nuclear).

Average Performance Improvement: 4x

Results for a specific collisions-heavy use-case from Dave inputs_1d_H1_lassen2.txt

Built with: cmake -DWarpX_DIMS=1 -DWarpX_COMPUTE=CUDA on Perlmutter

New kernel:

MultiParticleContainer::doCollisions()                   100      11.89      11.89      11.89  81.47%
Avg. per step = 0.1375590809 s
Total Time                     : 14.58777054

WarpX dev branch:

MultiParticleContainer::doCollisions()                   100      67.96      67.96      67.96  96.49%
Avg. per step = 0.6979252147
Total Time                     : 70.43095279

Error Rate

Slightly higher for a few Azure tests leading to failure of CI pipelines. All good otherwise.

TODO (@RemiLehe):

  • Add some more comments
  • Benchmark on CPU + possibly use #ifdef to use the new code only with GPU

@mhaseeb123 mhaseeb123 marked this pull request as ready for review January 5, 2024 01:15
@mhaseeb123 mhaseeb123 marked this pull request as draft January 5, 2024 01:16
@RemiLehe RemiLehe marked this pull request as ready for review January 5, 2024 16:58
@RemiLehe RemiLehe closed this Jan 5, 2024
@RemiLehe RemiLehe reopened this Jan 5, 2024
@mhaseeb123 mhaseeb123 changed the title [WIP]: Optimizing GPU acceleration of particle collision algorithms. [Ready]: Optimizing GPU acceleration of particle collision algorithms. Jan 11, 2024
@mhaseeb123 mhaseeb123 changed the title [Ready]: Optimizing GPU acceleration of particle collision algorithms. [WIP]: Optimizing GPU acceleration of particle collision algorithms. Jan 16, 2024
@mhaseeb123 mhaseeb123 changed the title [WIP]: Optimizing GPU acceleration of particle collision algorithms. [READY]: Optimizing GPU acceleration of particle collision algorithms. Jan 19, 2024
@ax3l ax3l added Performance optimization component: collisions Anything related to particle collisions labels Jan 19, 2024
@ax3l ax3l requested review from ax3l and RemiLehe January 19, 2024 23:18
@ax3l ax3l changed the title [READY]: Optimizing GPU acceleration of particle collision algorithms. Optimizing GPU acceleration of particle collision algorithms. Jan 19, 2024
@RemiLehe
Copy link
Member

RemiLehe commented Feb 8, 2024

Thanks for this PR!
I tried to run the Deuterium_Tritium_Fusion_3D locally (on a Macbook), with:

./run_test.sh Deuterium_Tritium_Fusion_3D

but I got the following error:

--- INFO    : Writing plotfile Deuterium_Tritium_Fusion_3D_plt000000
STEP 1 starts ...

   WARNING: Test stderr:
SIGILL Invalid, privileged, or ill-formed instruction
See Backtrace.0.0 file for details
SIGILL Invalid, privileged, or ill-formed instruction
See Backtrace.1.0 file for details
Abort(4) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 4) - process 0
Abort(4) on node 1 (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 4) - process 1

Note that this error only appears when using amrex.fpe_trap_invalid=1 (which is automatically used with the run_test.sh utility)

@roelof-groenewald
Copy link
Member

roelof-groenewald commented Apr 15, 2024

@RemiLehe I ran the CPU performance tests as we discussed (using cases 1 - 3 of the Turner benchmarks). The short and good news is that the results matched expectation and performance was NOT worse on CPU due to these changes. So, as discussed, this PR can be merged 🎉

Here are the inclusive timings (note that MultiParticleContainer::doCollisions() include both MCC and DSMC collisions involved in these tests, but we don't have a timer for just DSMC):

  • test 1:

    • devel: MultiParticleContainer::doCollisions() 512000 3.723 52.39 145.4 13.71%
    • this PR: MultiParticleContainer::doCollisions() 512000 3.155 35.78 124.8 12.19%
  • test 2:

    • devel: MultiParticleContainer::doCollisions() 4096000 390.6 636.2 796 13.30%
    • this PR: MultiParticleContainer::doCollisions() 4096000 275.8 523 746.5 12.48%
  • test 3:

    • devel: MultiParticleContainer::doCollisions() 8192000 721.1 2407 3471 19.19%
    • this PR: MultiParticleContainer::doCollisions() 8192000 497 2099 3287 17.91%

@roelof-groenewald
Copy link
Member

Here are also the physics results (for posterity) with the changes in this PR included:
image

@RemiLehe
Copy link
Member

@roelof-groenewald Thanks a lot for doing these tests! This is great news!

@RemiLehe RemiLehe enabled auto-merge (squash) April 15, 2024 17:42
@RemiLehe RemiLehe merged commit 3add009 into ECP-WarpX:development Apr 15, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: collisions Anything related to particle collisions Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants