Optimizing GPU acceleration of binary collision algorithms. #4577

mhaseeb123 · 2024-01-05T01:13:59Z

[Edit by @RemiLehe]: The aim of this PR is to speed up binary collisions on GPU by exposing more parallelism: instead of looping with one GPU thread per cell, we loop with one GPU thread per "number of independent pair" (i.e. pairs that do not touch the same macroparticles, so that there is no race condition), where the number of independent pairs is determined by the lower number of macroparticle of either species, within each cell.

[Updated]: This PR includes optimized GPU implementation for the two particle collision algorithms (Coulomb and Nuclear).

Average Performance Improvement: 4x

Results for a specific collisions-heavy use-case from Dave inputs_1d_H1_lassen2.txt

Built with: cmake -DWarpX_DIMS=1 -DWarpX_COMPUTE=CUDA on Perlmutter

New kernel:

MultiParticleContainer::doCollisions()                   100      11.89      11.89      11.89  81.47%
Avg. per step = 0.1375590809 s
Total Time                     : 14.58777054

WarpX dev branch:

MultiParticleContainer::doCollisions()                   100      67.96      67.96      67.96  96.49%
Avg. per step = 0.6979252147
Total Time                     : 70.43095279

Error Rate

Slightly higher for a few Azure tests leading to failure of CI pipelines. All good otherwise.

TODO (@RemiLehe):

Add some more comments
Benchmark on CPU + possibly use #ifdef to use the new code only with GPU

for more information, see https://pre-commit.ci

Source/Particles/Collision/BinaryCollision/BinaryCollision.H

RemiLehe · 2024-02-08T14:52:39Z

Thanks for this PR!
I tried to run the Deuterium_Tritium_Fusion_3D locally (on a Macbook), with:

./run_test.sh Deuterium_Tritium_Fusion_3D

but I got the following error:

--- INFO    : Writing plotfile Deuterium_Tritium_Fusion_3D_plt000000
STEP 1 starts ...

   WARNING: Test stderr:
SIGILL Invalid, privileged, or ill-formed instruction
See Backtrace.0.0 file for details
SIGILL Invalid, privileged, or ill-formed instruction
See Backtrace.1.0 file for details
Abort(4) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 4) - process 0
Abort(4) on node 1 (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 4) - process 1

Note that this error only appears when using amrex.fpe_trap_invalid=1 (which is automatically used with the run_test.sh utility)

for more information, see https://pre-commit.ci

Co-authored-by: Remi Lehe <[email protected]>

…ionFunc.H`

Source/Particles/Collision/BinaryCollision/NuclearFusion/NuclearFusionFunc.H

Source/Particles/Collision/BinaryCollision/DSMC/DSMCFunc.H

Source/Particles/Collision/BinaryCollision/Coulomb/ElasticCollisionPerez.H

Source/Particles/Collision/BinaryCollision/BinaryCollision.H

for more information, see https://pre-commit.ci

roelof-groenewald · 2024-04-15T17:00:27Z

@RemiLehe I ran the CPU performance tests as we discussed (using cases 1 - 3 of the Turner benchmarks). The short and good news is that the results matched expectation and performance was NOT worse on CPU due to these changes. So, as discussed, this PR can be merged 🎉

Here are the inclusive timings (note that MultiParticleContainer::doCollisions() include both MCC and DSMC collisions involved in these tests, but we don't have a timer for just DSMC):

test 1:
- devel: MultiParticleContainer::doCollisions() 512000 3.723 52.39 145.4 13.71%
- this PR: MultiParticleContainer::doCollisions() 512000 3.155 35.78 124.8 12.19%
test 2:
- devel: MultiParticleContainer::doCollisions() 4096000 390.6 636.2 796 13.30%
- this PR: MultiParticleContainer::doCollisions() 4096000 275.8 523 746.5 12.48%
test 3:
- devel: MultiParticleContainer::doCollisions() 8192000 721.1 2407 3471 19.19%
- this PR: MultiParticleContainer::doCollisions() 8192000 497 2099 3287 17.91%

roelof-groenewald · 2024-04-15T17:08:53Z

Here are also the physics results (for posterity) with the changes in this PR included:

RemiLehe · 2024-04-15T17:42:36Z

@roelof-groenewald Thanks a lot for doing these tests! This is great news!

mhaseeb123 added 2 commits January 4, 2024 17:05

opt gpu accel of particle collisions

acaae8b

minor updates in .gitignore

e6e47b5

mhaseeb123 marked this pull request as ready for review January 5, 2024 01:15

mhaseeb123 marked this pull request as draft January 5, 2024 01:16

RemiLehe marked this pull request as ready for review January 5, 2024 16:58

RemiLehe closed this Jan 5, 2024

RemiLehe reopened this Jan 5, 2024

mhaseeb123 and others added 2 commits January 11, 2024 12:29

fix for consistent and correct particle collision error

839d7bd

[pre-commit.ci] auto fixes from pre-commit.com hooks

5cacf97

for more information, see https://pre-commit.ci

mhaseeb123 changed the title ~~[WIP]: Optimizing GPU acceleration of particle collision algorithms.~~ [Ready]: Optimizing GPU acceleration of particle collision algorithms. Jan 11, 2024

mhaseeb123 and others added 5 commits January 11, 2024 12:40

Merge branch 'ECP-WarpX:development' into development

13555a7

minor bug fixing

2babba5

[minor]: remove stale code and clean up comments

8a7d449

undo erroneously pushed changes in GNUmakefile

0f90da4

undo erroneously pushed inputs_3d

73fbc28

github-advanced-security bot found potential problems Jan 11, 2024

View reviewed changes

Source/Particles/Collision/BinaryCollision/BinaryCollision.H Fixed Show fixed Hide fixed

Source/Particles/Collision/BinaryCollision/BinaryCollision.H Fixed Show fixed Hide fixed

revert errors from obsolete code plus minor improvements

59122ba

mhaseeb123 changed the title ~~[Ready]: Optimizing GPU acceleration of particle collision algorithms.~~ [WIP]: Optimizing GPU acceleration of particle collision algorithms. Jan 16, 2024

fix for a possible segfault

f6bddb8

mhaseeb123 changed the title ~~[WIP]: Optimizing GPU acceleration of particle collision algorithms.~~ [READY]: Optimizing GPU acceleration of particle collision algorithms. Jan 19, 2024

replace std::min with amrex::min for future compatibility

9940bd6

ax3l added Performance optimization component: collisions Anything related to particle collisions labels Jan 19, 2024

ax3l requested review from ax3l and RemiLehe January 19, 2024 23:18

ax3l changed the title ~~[READY]: Optimizing GPU acceleration of particle collision algorithms.~~ Optimizing GPU acceleration of particle collision algorithms. Jan 19, 2024

mhaseeb123 and others added 2 commits January 19, 2024 15:38

clang-tidy updates

bd4fa93

Merge branch 'ECP-WarpX:development' into development

6d73205

ax3l assigned RemiLehe Jan 30, 2024

pre-commit-ci bot and others added 8 commits April 10, 2024 21:32

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c89a9d

for more information, see https://pre-commit.ci

add docstring for CollisionPairFilter function arguments

79fdcab

Use insert rather than push_back & rotate

7e0b5da

Co-authored-by: Remi Lehe <[email protected]>

apply suggestions from a few of the review comments

3c4e41f

Merge remote-tracking branch 'roelof/dsmc_unify' into development

541703b

reduce differences between SplitAndScatterFunc.H and `ParticleCreat…

114001f

…ionFunc.H`

Merge remote-tracking branch 'upstream/development' into development

723f6d7

use deleteInvalidParticles from ECP-WarpX#4849

2d2792e

roelof-groenewald mentioned this pull request Apr 11, 2024

Use local deleteInvalidParticles (instead of Redistribute) in binary collisions #4851

Merged

roelof-groenewald added 2 commits April 10, 2024 22:00

avoid unnecessary duplication when species1 == species2

5b38792

Merge remote-tracking branch 'upstream/development' into development

9d94710