Automatically run multi domains and parallel tasks instead of using domain manager #791

yawaramin · 2024-12-28T15:56:04Z

This issue is related to the forum discussion https://discuss.ocaml.org/t/multiple-domains-design-question-in-eio-etc/15861 and the cohttp issue mirage/ocaml-cohttp#1101

The concern is that including cohttp-eio, multiple parts of an app may be using the raw domain manager to each create their own pools of domains. Instead, they should all coordinate and use the same executor pool.

yawaramin · 2024-12-29T05:16:32Z

On thinking about this some more, I believe that Eio users creating new domains with the domain manager, or creating domain pools with the executor pool and using them in different parts of the app, is usually the wrong level of abstraction.

Imho, we should provide an API that makes it clear that Eio programs are expected to run in a single domain, and that Eio will replicate the program across multiple domains (so of course it must be domain-safe). And this API should not expose any decisions about the number of domains, because we already know what the number should be. Eg:

(* val run_multi : (Eio_unix.Stdenv.base -> unit) -> unit *)

let run_multi domain =
  Eio_main.run (fun env ->
  let fiber () = domain env in
  let new_domain _ () =
    Eio.Domain_manager.run (Eio.Stdenv.domain_mgr env) fiber
  in
  let fibers =
    fiber :: List.init (Domain.recommended_domain_count () - 1) new_domain
  in
  Eio.Fiber.all fibers)

let job_queue ~sw env () = ... (* Could be a periodic sync job or whatever *)

let () =
  run_multi (fun env ->
  Eio.Switch.run (fun sw ->
  Eio.Fiber.fork ~sw (job_queue ~sw env);

  Dream.run env
  @@ Dream.logger
  @@ Dream.router [
    ...
  ]))

This way, apps running with Eio_main.run (fun env -> ...) could just change to Eio_main.run_multi (fun env -> ...) and instantly become multicore. We can remove the access to the domain manager from the env that we pass in to them), so we should stay within the domain limit.

Of course the key issue is that the callback to run_multi needs to be multi-domain safe, eg if a socket is listening to a port it must allow reusing the port, and the other more obvious things.

Implementation of run_multi that removes domain manager access


module E : sig
  type stdenv = <
    stdin : Eio_unix.source_ty Eio.Std.r;
    stdout : Eio_unix.sink_ty Eio.Std.r;
    stderr : Eio_unix.sink_ty Eio.Std.r;
    net : [ `Unix | `Generic ] Eio.Net.ty Eio.Std.r;
    process_mgr : Eio_unix.Process.mgr_ty Eio.Std.r;
    clock : float Eio.Time.clock_ty Eio.Std.r;
    mono_clock : Eio.Time.Mono.ty Eio.Std.r;
    fs : Eio.Fs.dir_ty Eio.Path.t;
    cwd : Eio.Fs.dir_ty Eio.Path.t;
    secure_random : Eio.Flow.source_ty Eio.Std.r;
    debug : Eio.Debug.t;
    backend_id : string;
  >

  val run_multi : (stdenv -> unit) -> unit
end = struct
  type stdenv = <
    stdin : Eio_unix.source_ty Eio.Std.r;
    stdout : Eio_unix.sink_ty Eio.Std.r;
    stderr : Eio_unix.sink_ty Eio.Std.r;
    net : [ `Unix | `Generic ] Eio.Net.ty Eio.Std.r;
    process_mgr : Eio_unix.Process.mgr_ty Eio.Std.r;
    clock : float Eio.Time.clock_ty Eio.Std.r;
    mono_clock : Eio.Time.Mono.ty Eio.Std.r;
    fs : Eio.Fs.dir_ty Eio.Path.t;
    cwd : Eio.Fs.dir_ty Eio.Path.t;
    secure_random : Eio.Flow.source_ty Eio.Std.r;
    debug : Eio.Debug.t;
    backend_id : string;
  >

  let run_multi domain =
    Eio_main.run (fun env ->
    let fiber () = domain (env :> stdenv) in
    let new_domain _ () =
      Eio.Domain_manager.run (Eio.Stdenv.domain_mgr env) fiber
    in
    let fibers =
      fiber :: List.init (Domain.recommended_domain_count () - 1) new_domain
    in
    Eio.Fiber.all fibers)
end

yawaramin · 2025-01-03T05:41:27Z

OK I have a slightly more sophisticated POC here: https://github.com/yawaramin/dream/blob/eio-par/example/w-dream-html/par.mli#L20

Sorry it's in a slightly haphazard place because I'm experimenting with Dream's Eio port.

So we have Par.run (env -> ...) instead of Eio_main.run (env -> ...), which works as I described earlier, ie automatically runs the same callback on all available domains. I believe this is somewhat analogous to Eio.Executor_pool.submit.

But the key new addition here is a parallelized task runner that is available in each domain: Par.exec env fn. This allows any fiber in any domain to submit a job that will be split up and run on all domains with a divide-and-conquer algorithm. The result will be a promise of an array (one element per each worker domain).

All domains except domain 0 are treated as worker domains. Domain 0 is not given any of the tasks in order to keep it available for I/O. If there is only a single domain available, we are just falling back to regular Eio_main.run.

Here's an example showing float array summation: https://github.com/yawaramin/dream/blob/eio-par/example/w-dream-html/par.ml#L93 . This is giving a slice of the array to each domain to sum, getting back a promise of an array of the per-domain sums, then finally summing up the array into a single float promise.

It's called in the request handler: https://github.com/yawaramin/dream/blob/eio-par/example/w-dream-html/html.ml#L24

talex5 · 2025-01-10T13:19:43Z

A few thoughts on this:

We don't know if recommended_domain_count is a good number of domains. It depends what else the computer is doing. e.g. if you're running a database on the same machine, you might want to give it half the cores. Using all the cores gives the kernel nowhere else to run tasks, which can cause GC slowdowns.
For an HTTP server, there is a distinction between which domains accept requests (IO) and which ones do CPU intensive jobs. Typically I would have one accepting domain and have the HTTP request handler push CPU-intensive batch jobs to other domains with an executor pool. Running multiple accept loops seems mostly useful for trivial benchmarks (e.g. serving static pages).
Multicore programs typically want to share something between domains. Otherwise, it's faster to run separate processes and avoid the GC stop-the-world overhead.

yawaramin · 2025-01-14T03:11:32Z

Thanks @talex5. I have some follow-up questions.

We don't know if recommended_domain_count is a good number of domains.

Isn't recommended_domain_count just the number of logical cores available on the system? Isn't it a good default in general to start that many domains ie operating system threads? As far as I am aware every modern multi-threaded language runtime does this, eg Go.

And even if it's not a good default for some cases, why should a simplified multicore scheduler function like Par.run try to handle those edge cases when users can already use Eio_main.run and Domain_manager to precisely control the number of domains they start?

I would have one accepting domain and have the HTTP request handler push CPU-intensive batch jobs to other domains with an executor pool.

My proposal wouldn't prevent that design–we could check that we are on domain 0 and only run the accept loop there.

But I am a bit surprised to hear you say this, because 'Running multiple accept loops' seems to be exactly what Eio.Net.run_server is doing when given multiple domains:

eio/lib_eio/net.ml

Line 384 in fdd2593

Fiber.fork ~sw (fun () -> Domain_manager.run domain_mgr (fun () ->

Am I missing something?

Multicore programs typically want to share something between domains.

With my suggestion we can easily share anything that is defined before Par.run:

let () =
  ...shared stuff...
  in
  Par.run @@ fun env ->
  ...access shared stuff...

talex5 · 2025-01-14T11:57:55Z

Isn't recommended_domain_count just the number of logical cores available on the system? Isn't it a good default in general to start that many domains ie operating system threads? As far as I am aware every modern multi-threaded language runtime does this, eg Go.

OCaml requires all domains to synchronise on every minor GC. If any domain is slow, it will delay all of them. The more domains you have, the more likely this is (and if you exceed the number of cores then at least one domain will always spin waiting for the others to be ready, preventing the remaining domain from running until the OS decides to preempt the spinning one). See https://roscidus.com/blog/blog/2024/07/22/performance-2/ for examples.

But I am a bit surprised to hear you say this, because 'Running multiple accept loops' seems to be exactly what Eio.Net.run_server is doing when given multiple domains:

Yes; what I mean is: passing a domain manager to run_server is only useful if you want to run multiple accept loops. Otherwise, just pass an executor pool to your HTTP handler. The cohttp-eio docs should probably be updated to say this.

yawaramin changed the title ~~Eio.Net.run_server should take an executor pool instead of the domain manager~~ Eio.Net.run_server should take an executor pool instead of the domain manager (maybe?) Dec 29, 2024

yawaramin mentioned this issue Dec 29, 2024

Cohttp-eio: take executor pool instead of creating domains directly mirage/ocaml-cohttp#1101

Closed

yawaramin changed the title ~~Eio.Net.run_server should take an executor pool instead of the domain manager (maybe?)~~ Automatically run multi domains and parallel tasks instead of using domain manager Jan 3, 2025

talex5 mentioned this issue Jan 15, 2025

Minor documentation improvements #794

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically run multi domains and parallel tasks instead of using domain manager #791

Automatically run multi domains and parallel tasks instead of using domain manager #791

yawaramin commented Dec 28, 2024

yawaramin commented Dec 29, 2024 •

edited

Loading

yawaramin commented Jan 3, 2025 •

edited

Loading

talex5 commented Jan 10, 2025

yawaramin commented Jan 14, 2025

talex5 commented Jan 14, 2025

Automatically run multi domains and parallel tasks instead of using domain manager #791

Automatically run multi domains and parallel tasks instead of using domain manager #791

Comments

yawaramin commented Dec 28, 2024

yawaramin commented Dec 29, 2024 • edited Loading

yawaramin commented Jan 3, 2025 • edited Loading

talex5 commented Jan 10, 2025

yawaramin commented Jan 14, 2025

talex5 commented Jan 14, 2025

yawaramin commented Dec 29, 2024 •

edited

Loading

yawaramin commented Jan 3, 2025 •

edited

Loading