Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement neighborhood search based on CellListMap.jl #8

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

efaulhaber
Copy link
Member

@efaulhaber efaulhaber commented May 8, 2024

I finally properly implemented the prototype from trixi-framework/TrixiParticles.jl#174. This doesn't have an advantage over existing implementations yet, but we can use the advanced periodicity features of CellListMap.jl in the future. For now, it's mostly useful for benchmarking against CellListMap.jl.

Here is a benchmark using the n-body benchmark on a single thread of an AMD Threadripper 3990X.
n_body_serial
It's the usual plot, showing the runtime of a single interaction divided by the number of particles.
Lower is better, and a horizontal line means constant runtime per particle, so a perfect scaling with the problem size.

  • The DictionaryCellList implementation and CellListMap.jl start out with equal performance, but CellListMap.jl scales better.
  • After the recent performance optimizations, the FullGridCellList is now faster than CellListMap.jl.
  • "CellListMap equal" means points_equal_neighbors = true. For context, if this is set to false, a CellListPair is used, which works almost the same as the FullGridCellList (although storing coordinates together with indices, see Store coordinates together with the points in a cell list #52). When the two coordinate matrices are identical (so we are looking for fluid neighbors of fluid particles), we can use a regular CellList, which makes use of symmetry to skip half the particle-neighbor pairs. This implementation is faster here.
  • "FullGrid symmetric" is the FullGridCellList, but modified to also use symmetry like explained above. I implemented this in [Proof of Concept] Use symmetry to speed up the main loop #85. This reaches the same performance as the CellListMap.jl implementation.

The same plot using 128 threads on the Threadripper:
n_body_threaded1
The two CellListMap.jl implementations start quite slow because they are not using Polyester.jl. The regular CellListMap.jl implementation is also slower again for large problems for some reason.
n_body_threaded2
For a fair comparison, I added "CellListMap Polyester.jl", which is the same as the "CellListMap" implementation but using Polyester.jl on the particle loop, just like we do in PointNeighbors.jl. This is implemented in m3g/CellListMap.jl#104.
This plot looks very similar to the serial plot.

Here are the same plots for a single WCSPH interaction loop:
wcsph_serial
wcsph_threaded

Again, looking very similar to the N-Body benchmark, but "CellListMap equal" is a bit slower.

Here is a benchmark for the updates:
update
It is clear that one of our main features is:

Focus on fast incremental updates to be usable for particle-based simulations with frequent updates

Copy link

codecov bot commented May 8, 2024

Codecov Report

Attention: Patch coverage is 94.82759% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.20%. Comparing base (1e25058) to head (24f0c85).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/neighborhood_search.jl 0.00% 2 Missing ⚠️
ext/PointNeighborsCellListMapExt.jl 98.21% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   88.15%   89.20%   +1.05%     
==========================================
  Files          15       16       +1     
  Lines         498      556      +58     
==========================================
+ Hits          439      496      +57     
- Misses         59       60       +1     
Flag Coverage Δ
unit 89.20% <94.82%> (+1.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@efaulhaber efaulhaber added the enhancement New feature or request label May 10, 2024
@efaulhaber efaulhaber closed this May 10, 2024
@efaulhaber efaulhaber reopened this May 10, 2024
@efaulhaber efaulhaber marked this pull request as ready for review December 4, 2024 10:09
svchb
svchb previously approved these changes Dec 4, 2024
@efaulhaber
Copy link
Member Author

@lmiq It would be great if you could have a look as well if you find the time. To check that I'm using CellListMap.jl correctly here and that I'm getting the best performance out of it.

@efaulhaber efaulhaber requested a review from LasNikas December 4, 2024 14:17
@lmiq
Copy link

lmiq commented Dec 4, 2024

@lmiq It would be great if you could have a look as well if you find the time. To check that I'm using CellListMap.jl correctly here and that I'm getting the best performance out of it.

Can you point to a MWE of a running benchmark with one of the methods you implemented? It is not obvious to me how to run the benchmarks.

@efaulhaber
Copy link
Member Author

I should probably update the plotting script and add this in the docs somewhere.
I basically just ran this

include("benchmarks/benchmarks.jl"); plot_benchmarks(benchmark_n_body, (10, 10, 10), 9, parallel=true)

where 9 means 9 iterations.

In the benchmark plotting code (benchmarks/plot.jl), I changed the plotting function to this:

function plot_benchmarks(benchmark, n_points_per_dimension, iterations;
                         parallel = true, title = "",
                         seed = 1, perturbation_factor_position = 1.0)
    neighborhood_searches_names = [#"TrivialNeighborhoodSearch";;
                                   "DictionaryCellList";;
                                   "FullGridCellList";;
                                #    "PrecomputedNeighborhoodSearch";;
                                   "CellListMapNeighborhoodSearch"]

    # Multiply number of points in each iteration (roughly) by this factor
    scaling_factor = 4
    per_dimension_factor = scaling_factor^(1 / length(n_points_per_dimension))
    sizes = [round.(Int, n_points_per_dimension .* per_dimension_factor^(iter - 1))
             for iter in 1:iterations]

    n_particles_vec = prod.(sizes)
    times = zeros(iterations, length(neighborhood_searches_names))

    for iter in 1:iterations
        coordinates = point_cloud(sizes[iter], seed = seed,
                                  perturbation_factor_position = perturbation_factor_position)

        search_radius = 3.0
        NDIMS = size(coordinates, 1)
        n_particles = size(coordinates, 2)

        neighborhood_searches = [
            # TrivialNeighborhoodSearch{NDIMS}(; search_radius, eachpoint = 1:n_particles),
            GridNeighborhoodSearch{NDIMS}(; search_radius, n_points = n_particles),
            GridNeighborhoodSearch{NDIMS}(; cell_list = FullGridCellList(; search_radius,
                                                                         min_corner = minimum(coordinates, dims = 2) .- search_radius,
                                                                         max_corner = maximum(coordinates, dims = 2) .+ search_radius),
                                          search_radius, n_points = n_particles),
            CellListMapNeighborhoodSearch(NDIMS; search_radius)
        ]

        for i in eachindex(neighborhood_searches)
            neighborhood_search = neighborhood_searches[i]
            initialize!(neighborhood_search, coordinates, coordinates)

            time = benchmark(neighborhood_search, coordinates, parallel = parallel)
            times[iter, i] = time
            time_string = BenchmarkTools.prettytime(time * 1e9)
            println("$(neighborhood_searches_names[i])")
            println("with $(join(sizes[iter], "x")) = $(prod(sizes[iter])) particles finished in $time_string\n")
        end
    end

    plot(n_particles_vec, times ./ n_particles_vec .* 1e9,
         xaxis = :log, yaxis = :log,
         xticks = (n_particles_vec, n_particles_vec), linewidth = 2,
         xlabel = "#particles", ylabel = "runtime per particle [ns]",
         legend = :outerright, size = (750, 400), dpi = 200, margin = 4 * Plots.mm,
         label = neighborhood_searches_names, title = title)
end

@lmiq
Copy link

lmiq commented Dec 4, 2024

Ok, I managed to run a very simple case:

julia> using CellListMap, PointNeighbors, StaticArrays

julia> function test_celllistmap(x)
           sys = ParticleSystem(positions=x, cutoff=5.0, output=zero(x))
           map_pairwise!(sys) do x, y, i, j, d2, f
               d = sqrt(d2)
               fij = (y - x) / d^3
               f[:,i] .+= fij
               f[:,j] .-= fij
               return f
           end
           return sys.output
       end
test_celllistmap (generic function with 1 method)

julia> function test_pointneighbors(x)
           f = zero(x)
           neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2))
           initialize!(neighborhood_search, x, x)
           foreach_point_neighbor(x, x, neighborhood_search, parallel=true) do i, j, pos_diff, distance
               distance < sqrt(eps()) && return
               dv = -pos_diff / distance^3
               for dim in axes(f, 1)
                   @inbounds f[dim, i] += dv[dim]
               end
           end
           return f
       end
test_pointneighbors (generic function with 1 method)

julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.

julia> xmat = Matrix(reinterpret(reshape, Float64, x));

julia> @time test_celllistmap(xmat)  # second run
  3.557435 seconds (468.93 M allocations: 21.058 GiB, 42.83% gc time, 9 lock conflicts)
3×100000 Matrix{Float64}:
  40.841    -140.986   12.1143   77.6912   11.3534   16.8916    10.9625  …  32.2969       43.5096   7.40524   17.4521   82.746     4.78501
  -4.60323    10.6294  28.8101  -11.418   -33.4972  -12.3769   -17.9138     -0.0016226  -222.654   -4.39804  -80.1207   37.4482   51.3636
 -57.4807     14.2068  15.9348   71.1715   68.6271   30.9155  -120.342      -2.29914     -55.8974   3.31074  117.948   -27.3466  -26.7153

julia> @time test_pointneighbors(xmat) # second run
  0.857072 seconds (967 allocations: 49.540 MiB, 0.49% gc time)
3×100000 Matrix{Float64}:
  40.841    -140.986   12.1143   77.6912   11.3534   16.8916    10.9625  …  32.2969       43.5096   7.40524   17.4521   82.746     4.78501
  -4.60323    10.6294  28.8101  -11.418   -33.4972  -12.3769   -17.9138     -0.0016226  -222.654   -4.39804  -80.1207   37.4482   51.3636
 -57.4807     14.2068  15.9348   71.1715   68.6271   30.9155  -120.342      -2.29914     -55.8974   3.31074  117.948   -27.3466  -26.7153

julia> x, _ = CellListMap.xgalactic(10^6); # 10^6 SVector{3,Float64} with "galactic" density.

julia> xmat = Matrix(reinterpret(reshape, Float64, x));

julia> @time test_celllistmap(xmat)
 40.969504 seconds (5.51 G allocations: 247.261 GiB, 45.60% gc time, 13 lock conflicts)
3×1000000 Matrix{Float64}:
    9.20059   -2.41152     0.114992  107.299   -23.107   16.395   -19.8264  …  135.155    -4.36814  -3.9865    5.18173  107.441    -24.7465
  -26.039     16.8446    -37.0278    -13.3514   74.7956  15.8967   34.0362     -57.17     19.5576   46.6715  -22.3012    26.9519  -156.584
 -168.725    -63.8354   -163.919     -18.8817   51.9869  19.4064   34.0843      58.2957  -15.1018   37.0779  -16.5614   -23.6634    -5.18014

julia> @time test_pointneighbors(xmat)
 23.620195 seconds (6.68 k allocations: 496.365 MiB, 0.02% gc time)
3×1000000 Matrix{Float64}:
    9.20059   -2.41152     0.114992  107.299   -23.107   16.395   -19.8264  …  135.155    -4.36814  -3.9865    5.18173  107.441    -24.7465
  -26.039     16.8446    -37.0278    -13.3514   74.7956  15.8967   34.0362     -57.17     19.5576   46.6715  -22.3012    26.9519  -156.584
 -168.725    -63.8354   -163.919     -18.8817   51.9869  19.4064   34.0843      58.2957  -15.1018   37.0779  -16.5614   -23.6634    -5.18014

I guess this is more or less consistent with what you see?

If the two sets are different, I get:

julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.

julia> y, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.

julia> xmat = Matrix(reinterpret(reshape, Float64, x));

julia> ymat = Matrix(reinterpret(reshape, Float64, y));

julia> function test_celllistmap(x,y)
           sys = ParticleSystem(xpositions=x, ypositions=y, cutoff=5.0, output=zero(x))
           map_pairwise!(sys) do x, y, i, j, d2, f
               d = sqrt(d2)
               fij = (y - x) / d^3
               f[:,i] .+= fij
               return f
           end
           return sys.output
       end
test_celllistmap (generic function with 2 methods)

julia> @time test_celllistmap(xmat,ymat)
  2.527272 seconds (469.03 M allocations: 21.066 GiB, 39.28% gc time, 12 lock conflicts, 15.07% compilation time)
3×100000 Matrix{Float64}:
  21.8077   -0.183917  -25.5396  -38.6599    -7.02715   -22.8335  -115.32    …  -10.6988   21.6762    12.2227   -1.46057    13.9872    7.5366
 -17.4007  -35.3851     14.0842    6.13937   30.6116    -10.2836   309.663      116.57     -6.51407  138.022     8.54818    25.051   155.299
 -52.4528   70.1513     26.3846   31.3363   -72.5124   -166.564     31.0876      29.6242  -15.704     -5.80962  11.3542   -157.471     4.72672

julia> function test_pointneighbors(x,y)
           f = zero(x)
           neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2))
           initialize!(neighborhood_search, x, y)
           foreach_point_neighbor(x, y, neighborhood_search, parallel=true) do i, j, pos_diff, distance
               distance < sqrt(eps()) && return
               dv = -pos_diff / distance^3
               for dim in axes(f, 1)
                   @inbounds f[dim, i] += dv[dim]
               end
           end
           return f
       end
test_pointneighbors (generic function with 2 methods)

julia> @time test_pointneighbors(xmat,ymat)
  1.100204 seconds (942.08 k allocations: 92.813 MiB, 0.89% gc time, 21.68% compilation time)
3×100000 Matrix{Float64}:
  21.8077   -0.183917  -25.5396  -38.6599    -7.02715   -22.8335  -115.32    …  -10.6988   21.6762    12.2227   -1.46057    13.9872    7.5366
 -17.4007  -35.3851     14.0842    6.13937   30.6116    -10.2836   309.663      116.57     -6.51407  138.022     8.54818    25.051   155.299
 -52.4528   70.1513     26.3846   31.3363   -72.5124   -166.564     31.0876      29.6242  -15.704     -5.80962  11.3542   -157.471     4.72672

Let me know if (when) you are supporting periodic boundary conditions :-).

edit: Seems that orthorhombic boxes are already supported? (it it unusual to provide minimum and maximum coordinates instead of lengths, though, does it make any difference if the lengths are the same, but the corners are placed differently? - I guess it doesn't ).

@efaulhaber
Copy link
Member Author

efaulhaber commented Dec 5, 2024

I don't think your benchmarks are fair. f[:,i] .+= fij seems to allocate, which makes the CLM benchmark significantly slower. When the two sets are the same, CLM should be faster than PN (in my benchmark above, I used a different branch to benchmark the speedup from using symmetry).
Also, the default cell list is the DictionaryCellList, which should not be faster, probably slower than CLM.

it it unusual to provide minimum and maximum coordinates instead of lengths, though, does it make any difference if the lengths are the same, but the corners are placed differently? - I guess it doesn't

This originated from the way we initialize simulations. We might have a pipe flow with open boundaries, and then want to use periodic boundaries instead, and then we can just provide the bounding box for the domain.

We don't have any plans to provide more complex periodic boundaries. But when this PR is merged, we can add support for complex periodic boundaries with the CellListMap.jl backend.

@lmiq
Copy link

lmiq commented Dec 5, 2024

f[:,i] .+= fij seems to allocate

Yes, good catch (although why it allocates I don't know - it should not). Unfortunately everything I do lately I have to do in a rush...

With that fixed I get:

julia> function test_celllistmap(x)
           sys = ParticleSystem(positions=x, cutoff=5.0, output=zero(x))
           map_pairwise!(sys) do x, y, i, j, d2, f
               d = sqrt(d2)
               fij = (y - x) / d^3
               for k in eachindex(fij)
                   f[k,i] += fij[k]
                   f[k,j] -= fij[k]
               end
               return f
           end
           return sys.output
       end
test_celllistmap (generic function with 2 methods)

julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.

julia> xmat = Matrix(reinterpret(reshape, Float64, x));

julia> @time test_celllistmap(xmat)  # second run
  0.298172 seconds (13.65 k allocations: 97.775 MiB, 1.98% gc time, 12 lock conflicts)
3×100000 Matrix{Float64}:
 -88.6852   -129.101   -46.6896  -48.3546  147.853   -24.7797    18.0712  …   -6.32157    9.97575    -1.63363  10.3277   15.7659    -29.5342
   5.05557   284.657    42.8502  -10.2702  101.038    69.9032  -101.463       21.2009   -36.1706   -128.78     32.0599    6.22065     4.78016
 101.876      97.5458   80.0019    1.7858  147.629  -120.023    -20.9682     -19.8745   -38.8108     26.2592   -2.34381   0.131151  108.677

julia> @time test_pointneighbors(xmat) # second run
  0.937435 seconds (968 allocations: 49.540 MiB, 2.56% gc time)
3×100000 Matrix{Float64}:
 -88.6852   -129.101   -46.6896  -48.3546  147.853   -24.7797    18.0712  …   -6.32157    9.97575    -1.63363  10.3277   15.7659    -29.5342
   5.05557   284.657    42.8502  -10.2702  101.038    69.9032  -101.463       21.2009   -36.1706   -128.78     32.0599    6.22065     4.78016
 101.876      97.5458   80.0019    1.7858  147.629  -120.023    -20.9682     -19.8745   -38.8108     26.2592   -2.34381   0.131151  108.677

and for the 10^6 system of particles:

julia> x, _ = CellListMap.xgalactic(10^6); # 10^6 SVector{3,Float64} with "galactic" density.

julia> xmat = Matrix(reinterpret(reshape, Float64, x));

julia> @time test_celllistmap(xmat)
  3.535903 seconds (103.01 k allocations: 1.097 GiB, 5.36% gc time, 11 lock conflicts)
3×1000000 Matrix{Float64}:
 15.8458    6.61017  129.444    16.249   -36.1833  -47.6617   -4.15331  …   83.2507  -135.724   -8.00536    8.17124    -9.491   -13.0002
 -2.91166  17.8658   -28.7162   17.7136  -39.6242   44.6349  -14.1746       12.7022   -14.5613  -3.03081   -8.79777  -136.763     4.94949
 19.9325    2.57851  -18.5558  -17.7504  -31.7794  173.664    -9.83589     -13.8241    37.9389  -7.4949   175.836     -37.2366   35.1991

julia> @time test_pointneighbors(xmat)
 24.696261 seconds (6.68 k allocations: 496.365 MiB, 0.01% gc time)
3×1000000 Matrix{Float64}:
 15.8458    6.61017  129.444    16.249   -36.1833  -47.6617   -4.15331  …   83.2507  -135.724   -8.00536    8.17124    -9.491   -13.0002
 -2.91166  17.8658   -28.7162   17.7136  -39.6242   44.6349  -14.1746       12.7022   -14.5613  -3.03081   -8.79777  -136.763     4.94949
 19.9325    2.57851  -18.5558  -17.7504  -31.7794  173.664    -9.83589     -13.8241    37.9389  -7.4949   175.836     -37.2366   35.1991

(does this make sense?)

I tried to run with FullGridCellList but it failed (I would bet is something related to getting particles at the limits of the periodic box):

julia> function test_pointneighbors(x)
           f = zero(x)
           neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2), 
                cell_list=FullGridCellList(search_radius=5.0, min_corner=[0.0, 0.0, 0.0], max_corner=[20.274, 20.274, 20.274]))
           initialize!(neighborhood_search, x, x)
           foreach_point_neighbor(x, y, neighborhood_search, parallel=true) do i, j, pos_diff, distance
               distance < sqrt(eps()) && return
               dv = -pos_diff / distance^3
               for dim in axes(f, 1)
                   @inbounds f[dim, i] += dv[dim]
               end
           end
           return f
       end
test_pointneighbors (generic function with 2 methods)

julia> @time test_pointneighbors(xmat)
ERROR: BoundsError: attempt to access 100×216 Matrix{Int32} at index [101, 115]
Stacktrace:
  [1] throw_boundserror(A::Matrix{Int32}, I::Tuple{Int64, Int64})
    @ Base ./essentials.jl:14
  [2] checkbounds
    @ ./abstractarray.jl:699 [inlined]
  [3] setindex!
    @ ./array.jl:993 [inlined]
  [4] pushat!
    @ ~/.julia/packages/PointNeighbors/vKy7i/src/vector_of_vectors.jl:64 [inlined]
  [5] push_cell!
    @ ~/.julia/packages/PointNeighbors/vKy7i/src/cell_lists/full_grid.jl:124 [inlined]
  [6] initialize_grid!(neighborhood_search::GridNeighborhoodSearch{…}, y::Matrix{…})
    @ PointNeighbors ~/.julia/packages/PointNeighbors/vKy7i/src/nhs_grid.jl:176
  [7] initialize!(neighborhood_search::GridNeighborhoodSearch{…}, x::Matrix{…}, y::Matrix{…})
    @ PointNeighbors ~/.julia/packages/PointNeighbors/vKy7i/src/nhs_grid.jl:162
  [8] test_pointneighbors(x::Matrix{Float64})
    @ Main ./REPL[85]:5
  [9] macro expansion
    @ ./timing.jl:581 [inlined]
 [10] top-level scope
    @ ./REPL[86]:1
Some type information was truncated. Use `show(err)` to see complete types.

@lmiq
Copy link

lmiq commented Dec 5, 2024

ps: The allocations come from the combination of the slice and broadcasting, which gets interpreted as:

f[:,i] .= f[:,i] .+ fij

and then the f[:,i] on the right allocates. (adding @views solves the allocations).

ext/PointNeighborsCellListMapExt.jl Show resolved Hide resolved
ext/PointNeighborsCellListMapExt.jl Show resolved Hide resolved
CellListMap.map_pairwise!(0, box, cell_list,
parallel = parallel !== false) do x, y, i, j, d2, output
# Skip all indices not in `points`
i in points || return output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
i in points || return output
i in points || return output

Doesn't this produce a notable overhead which affects the benchmark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried. It's negligible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of interest. Am I missing something?
This check is done for each particle - so multiple times. For a large set, the overhead accumulates, no?

julia> f(i, points) = i in points
f (generic function with 1 method)

julia> points = rand(Int, 1_000_000);

julia> @benchmark f($5, $points)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  221.584 μs  380.625 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     221.834 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   223.171 μs ±   4.460 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▂▄▁   ▅  ▁  ▁                                               ▁
  █████▇▆██▇██▆██▇▆▆▆▇▅▆▆▅▅▆▅▇▅▄▄▃▅▄▄▅▃▃▄▃▄▂▂▃▃▃▃▂▃▂▂▄▅▇▆▄▃▅▂▃▇ █
  222 μs        Histogram: log(frequency) by time        243 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that this is about 1-2% of the actual computation when you have a real example where you are doing actual work per particle.

ext/PointNeighborsCellListMapExt.jl Outdated Show resolved Hide resolved
@efaulhaber
Copy link
Member Author

efaulhaber commented Dec 6, 2024

I tried to run with FullGridCellList but it failed

@lmiq This is a bit tricky. You have to pad the bounding box with the search radius and increase the maximum number of particles per cell. The former we want to change (#80), the latter should have a more meaningful error message (#65).

Here is a benchmark for only the interaction, not the initialization:

using CellListMap, PointNeighbors, StaticArrays, BenchmarkTools

x, _ = CellListMap.xgalactic(10^6);
xmat = Matrix(reinterpret(reshape, Float64, x));
f = similar(xmat)
f2 = similar(xmat)
f3 = similar(xmat)
f4 = similar(xmat)

nhs1 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2))
initialize!(nhs1, xmat, xmat)

min_corner = minimum(xmat, dims=2) .- 5
max_corner = maximum(xmat, dims=2) .+ 5
nhs2 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2),
                                 cell_list=FullGridCellList(search_radius=5.0;
                                                            min_corner, max_corner,
                                                            max_points_per_cell=2000))
initialize!(nhs2, xmat, xmat)

sys = ParticleSystem(positions=x, cutoff=5.0, output=f)
sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=5.0, output=f2)

function test_celllistmap(_, sys)
    sys.output .= 0

    map_pairwise!(sys) do x, y, i, j, d2, f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
            @inbounds f[dim, j] -= fij[dim]
        end

        return f
    end
    return sys.output
end

function test_celllistmap_different(_, sys)
    sys.output .= 0

    map_pairwise!(sys) do x, y, i, j, d2, f
        d2 < eps() && return f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
        end

        return f
    end
    return sys.output
end

function test_pointneighbors(x, f, nhs)
    f .= 0

    foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
        distance < sqrt(eps()) && return
        dv = -pos_diff / distance^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += dv[dim]
        end
    end
    return f
end

I get this:

julia> @btime test_celllistmap($x, $sys);
  1.776 s (959 allocations: 190.39 KiB)

julia> @btime test_celllistmap_different($x, $sys2);
  17.207 s (985 allocations: 1.74 MiB)

julia> @btime test_pointneighbors($xmat, $f3, $nhs1);
  22.048 s (1 allocation: 128 bytes)

julia> @btime test_pointneighbors($xmat, $f4, $nhs2);
  18.535 s (1 allocation: 192 bytes)

The last three seem reasonable based on my original benchmarks in the first post. The first one looks suspicious.
But I checked and the results are the same:

julia> isapprox(f, f2) && isapprox(f, f3) && isapprox(f, f4)
true

The FullGridCellList (nhs2) is almost identical to the CellListPair implementation. The differences are:

  1. We don't store the coordinates together with the point index.
  2. The data structures are different. I designed them to be GPU-compatible and allow for fully parallel initialization/update.

Why is the first one so much faster? In my benchmarks in the first post, I found that the speedup comes only from using symmetry to skip half the pairs. But this should be a speedup of at most 2x.

@lmiq
Copy link

lmiq commented Dec 6, 2024

Why is the first one so much faster? In my benchmarks in the first post, I found that the speedup comes only from using symmetry to skip half the pairs. But this should be a speedup of at most 2x

Indeed, there's something not scaling well there in CellListMap when using the computation between two sets of particles.
The paralellization mechanisms are different. In the "single-set" case, the hottest loop occurs on non-empty cells, while in the "two-sets-of-particles" case the hottest loop and parallelization occurs on the particles of the largest set.

Thus, the parallelization mechanisms and splitting of tasks are different, and the reason for that is empirical testing of different schemes at some point in the development. I will probably have to revisit the behavior of the two-set case at some point to check these issues.

For the records, this is what I got, increasing the number of particles:

julia> run(;n=5*10^4)
CellListMap (t1): 0.163701008 s
CellListMap different (t2/t1): 1.8283861147635694
PointNeighbors nhs1 (t3/t1): 2.213968633595708
PointNeighbors nhs2 (t4/t1): 2.1056406384498256
Test Passed

julia> run(;n=10^5)
CellListMap (t1): 0.305206052 s
CellListMap different (t2/t1): 2.0926074362378633
PointNeighbors nhs1 (t3/t1): 2.8729478044557255
PointNeighbors nhs2 (t4/t1): 2.8898966918257574
Test Passed

julia> run(;n=5*10^5)
CellListMap (t1): 1.581912255 s
CellListMap different (t2/t1): 2.4655888565703035
PointNeighbors nhs1 (t3/t1): 4.414566006380676
PointNeighbors nhs2 (t4/t1): 4.29086459855512
Test Passed

julia> run(;n=10^6)
CellListMap (t1): 3.4457909320000004 s
CellListMap different (t2/t1): 3.6117841437862372
PointNeighbors nhs1 (t3/t1): 6.811157335183328
PointNeighbors nhs2 (t4/t1): 6.209188126681157
Test Passed

Where we can see that for a smallish number of particles the ratio between "one-set" and "two-sets" is roughly 2x, which is what we expect, but it is getting worse as the number of particles increases.

This is the code I ran:

using CellListMap, PointNeighbors, StaticArrays, BenchmarkTools, Chairmarks
using Test

function test_celllistmap(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
            @inbounds f[dim, j] -= fij[dim]
        end

        return f
    end
    return sys.output
end

function test_celllistmap_different(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d2 < eps() && return f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
        end
        return f
    end
    return sys.output
end

function test_pointneighbors(x, f, nhs)
    f .= 0
    foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
        distance < sqrt(eps()) && return
        dv = -pos_diff / distance^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += dv[dim]
        end
    end
    return f
end

function run(;n=10^5)
    x, _ = CellListMap.xgalactic(n);
    xmat = Matrix(reinterpret(reshape, Float64, x));
    f = similar(xmat)
    f2 = similar(xmat)
    f3 = similar(xmat)
    f4 = similar(xmat)
    
    nhs1 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2))
    initialize!(nhs1, xmat, xmat)
    
    min_corner = minimum(xmat, dims=2) .- 5
    max_corner = maximum(xmat, dims=2) .+ 5
    nhs2 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2),
                                    cell_list=FullGridCellList(search_radius=5.0;
                                                                min_corner, max_corner,
                                                                max_points_per_cell=2000))
    initialize!(nhs2, xmat, xmat)
    sys = ParticleSystem(positions=x, cutoff=5.0, output=f)
    sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=5.0, output=f2)

    b1 = @b test_celllistmap($(nothing), $sys)
    println("CellListMap (t1): ", b1.time, " s")
    b2 = @b test_celllistmap_different($(nothing), $sys2)
    println("CellListMap different (t2/t1): ", b2.time / b1.time) 
    b3 = @b test_pointneighbors($xmat, $f3, $nhs1)
    println("PointNeighbors nhs1 (t3/t1): ", b3.time / b1.time)
    b4 = @b test_pointneighbors($xmat, $f4, $nhs2)
    println("PointNeighbors nhs2 (t4/t1): ", b4.time / b1.time)

    @test f  f2  f3  f4
end   

@efaulhaber
Copy link
Member Author

Many thanks for the discussion!

Where we can see that for a smallish number of particles the ratio between "one-set" and "two-sets" is roughly 2x, which is what we expect, but it is getting worse as the number of particles increases.

Interesting that I didn't see this in my original benchmark, even though I'm using a lot bigger problems there. Probably because our use cases and therefore our benchmarks are very different. You use a much larger search radius with almost 2000 particles per cell. In our simulations and benchmarks, we usually have less than 30 particles per cell and ~100 neighbor particles within the search radius.

I will probably have to revisit the behavior of the two-set case at some point to check these issues.

Please report back when you find something interesting.

@lmiq
Copy link

lmiq commented Dec 9, 2024

Probably because our use cases and therefore our benchmarks are very different. You use a much larger search radius with almost 2000 particles per cell. In our simulations and benchmarks, we usually have less than 30 particles per cell and ~100 neighbor particles within the search radius

Indeed, the density relative to the cutoff makes a lot of difference. I have tried to optimize CellListMap with two types of systems in mind: atomic systems and "galactic" systems. The examples above use the density/cutoff that are typical of astrophysical calculations (the density is the density of galaxies in the universe and the cutoff is one I found that was typical for computing properties of such systems).

The atomic systems are less dense (relative to the cutoff). They are generated with the CellListMap.xatomic function. These have about 140 particles per cell. I have the same issue here:

julia> run(;n=10^5)
CellListMap (t1): 0.053651925 s
CellListMap different (t2/t1): 2.118421175009098
Test Passed

julia> run(;n=5*10^5)
CellListMap (t1): 0.272764123 s
CellListMap different (t2/t1): 2.6336540088155216
Test Passed

julia> run(;n=10^6)
CellListMap (t1): 0.5655830310000001 s
CellListMap different (t2/t1): 3.85268252505263
Test Passed

(I couldn't run PointNeighbors here, because it threw some errors, probably related to the padding of the boxes).

Please report back when you find something interesting.

The difference in performances, though, have nothing to do with the scaling of the parallelization. It is the algorithm that is different. In fact, I had already noted that possibility and the code is commented.

The thing is that when using a full-cell list approach, I'm using a projection-approach to eliminate the calculation of unnecessary distances. It consists of projecting the positions of the particles into the vector connecting the centers of the cells being considered, sorting the positions along that line, and only computing interactions if the distance along that line is smaller than the cutoff. This method was proposed in 1 and 2.

For the two-sets case, this is clearly very advantageous if both sets are large, but if one of the sets is small the cost of computing the cell lists for the largest set becomes less favorable (at least it was the case when I benchmarked these stuff). Probably I will need to test again if computing the lists for all particles is not worth the effort now, or at least switch automatically the method using some criteria.

Code ran for the benchmark above:

using CellListMap, PointNeighbors, StaticArrays
using Chairmarks
using Test

function test_celllistmap(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
            @inbounds f[dim, j] -= fij[dim]
        end

        return f
    end
    return sys.output
end

function test_celllistmap_different(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d2 < eps() && return f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
        end
        return f
    end
    return sys.output
end

function test_pointneighbors(x, f, nhs)
    f .= 0
    foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
        distance < sqrt(eps()) && return
        dv = -pos_diff / distance^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += dv[dim]
        end
    end
    return f
end

function run(;n=10^5)
    x, box = CellListMap.xatomic(n);

    xmat = Matrix(reinterpret(reshape, Float64, x));
    f = similar(xmat)
    f2 = similar(xmat)
    f3 = similar(xmat)
    f4 = similar(xmat)
    
    nhs1 = GridNeighborhoodSearch{3}(; search_radius=box.cutoff, n_points=size(xmat,2))
    initialize!(nhs1, xmat, xmat)
    
    min_corner = minimum(xmat, dims=2) .- 5
    max_corner = maximum(xmat, dims=2) .+ 5
    nhs2 = GridNeighborhoodSearch{3}(; search_radius=box.cutoff, n_points=size(xmat,2),
                                    cell_list=FullGridCellList(search_radius=box.cutoff;
                                                                min_corner, max_corner,
                                                                max_points_per_cell=2000))
    initialize!(nhs2, xmat, xmat)
    sys = ParticleSystem(positions=x, cutoff=box.cutoff, output=f)
    sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=box.cutoff, output=f2)

    b1 = @b test_celllistmap($(nothing), $sys)
    println("CellListMap (t1): ", b1.time, " s")
    b2 = @b test_celllistmap_different($(nothing), $sys2)
    println("CellListMap different (t2/t1): ", b2.time / b1.time) 
#    b3 = @b test_pointneighbors($xmat, $f3, $nhs1)
#    println("PointNeighbors nhs1 (t3/t1): ", b3.time / b1.time)
#    b4 = @b test_pointneighbors($xmat, $f4, $nhs2)
#    println("PointNeighbors nhs2 (t4/t1): ", b4.time / b1.time)

#    @test f ≈ f2 ≈ f3 ≈ f4
    @test f  f2
end   

@lmiq
Copy link

lmiq commented Dec 11, 2024

In this branch I have implemented the strategy of computing cell lists for both sets of particles, and then using the same method to avoid computation of distances as was used in the single-set problems: https://github.com/m3g/CellListMap.jl/tree/new_CellListPair

Now I get the following results from the benchmark in #8 (comment)

julia> run(;n=5*10^4)
CellListMap (t1): 0.152432735 s
CellListMap different (t2/t1): 1.6050779643886859
PointNeighbors nhs1 (t3/t1): 2.2481417918533046
PointNeighbors nhs2 (t4/t1): 2.258499921293153
Test Passed

julia> run(;n=10^5)
CellListMap (t1): 0.299085832 s
CellListMap different (t2/t1): 1.710319695116819
PointNeighbors nhs1 (t3/t1): 2.901262748547715
PointNeighbors nhs2 (t4/t1): 2.868669272170673
Test Passed

julia> run(;n=5*10^5)
CellListMap (t1): 1.582251368 s
CellListMap different (t2/t1): 1.6601796706425727
PointNeighbors nhs1 (t3/t1): 4.411138771725176
PointNeighbors nhs2 (t4/t1): 4.24521378577819
Test Passed

julia> run(;n=10^6)
CellListMap (t1): 3.4587672630000004 s
CellListMap different (t2/t1): 1.725844078859029
PointNeighbors nhs1 (t3/t1): 6.799690402586073
PointNeighbors nhs2 (t4/t1): 6.206536996473243
Test Passed

For these tests this is a clear improvement for CellListMap, as the scaling problem is solved and the "different" approach is less than 2x slower than the approach using the symmetry.

Nevertheless, for smaller but relevant (maybe even more common for us) systems it becomes somewhat slower, and by constructing cell lists for both sets we use more memory, which can be a problem for very large systems. Thus, I'm still wondering if I'll merge that change as it is.

But, it is important to remark that this method to avoid unnecessary computations is relevant for relatively dense cells (many particles per cell), or if the computations to be performed for each pair of particle are expensive, such that avoiding any computation is important. For lower densities per cells and fast operations per pair it might be detrimental for performance.

@efaulhaber
Copy link
Member Author

efaulhaber commented Dec 12, 2024

Why do I get this on your branch?

julia> CellListMap.Box
ERROR: UndefVarError: `Box` not defined in `CellListMap`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:42
 [2] top-level scope
   @ REPL[7]:1

@lmiq
Copy link

lmiq commented Dec 12, 2024

Can you try now? I'm in the middle of updating other things and you might have reached an unusable state.

@efaulhaber
Copy link
Member Author

efaulhaber commented Dec 13, 2024

Very interesting. Even for my benchmark in the first post with much less dense cells, this is significantly faster than the original version:

GNHS with `FullGridCellList`
with 254x254x254 = 16387064 particles finished in 790.534 ms

CellListMap.jl with `CellListPair` original
with 254x254x254 = 16387064 particles finished in 2.644 s

CellListMap.jl with `CellList` (equal sets) original
with 254x254x254 = 16387064 particles finished in 539.042 ms

CellListMap.jl with `CellListPair` new
with 254x254x254 = 16387064 particles finished in 977.233 ms

Pretty much a factor of 2 now compared to the symmetric version.

@lmiq
Copy link

lmiq commented Dec 13, 2024

Indeed, the strategy to avoid unncessary computations is suprisingly effective. It is something I guess you could implement as well, it is a local computation (you just need a data structure that allows all the projection and sorting of particles local).

@efaulhaber
Copy link
Member Author

Thanks! I will look into this.

@efaulhaber efaulhaber requested review from LasNikas and svchb January 6, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants