Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple directors for high availability #2024

Open
bbockelm opened this issue Feb 16, 2025 · 0 comments · May be fixed by #2023
Open

Add support for multiple directors for high availability #2024

bbockelm opened this issue Feb 16, 2025 · 0 comments · May be fixed by #2023
Assignees
Labels
cache Issue relating to the cache component director Issue relating to the director component enhancement New feature or request origin Issue relating to the origin component
Milestone

Comments

@bbockelm
Copy link
Collaborator

The director is a single, centralized service that all operations in the federation go through. This causes restarts - or outages - of the director potentially disruptive events for clients (though in #1565 we started to reduce the impact, allowing clients to detect restarts and retry). Let's work to a setup where multiple directors can work within a single federation, allowing clients to utilize any existing service. Luckily, the director is a nearly-stateless service (the exception being the downtime information; I'd like that to be aggregated in the redirector and at the individual services as a separate piece of work).

The first step will be to have services discover all available directors through a combination of the following mechanism:

  1. Directors' "advertise URLs" being statically listed in the federation metadata.
  2. Periodically querying any one of the directors to have it list all the directors it is aware of.
  3. Statically placed in the configuration.

Of these, (3) is the one most discouraged; we've seen that pushing state to the client configuration for distributed clients to be incredibly difficult.

The second step is that each service (director, cache, origin), when sending ads, will send ads to all the directors it is aware of. This way, if (for example) a cache service eventually learns of all the director services in the federation, it will eventually advertise the same state to all of them.

Finally, whenever the director receives an ad, it will forward it to all directors it knows about. This has a few important impacts:

  • You have the possibility of a "routing loop": director A passes an ad to director B who, subsequently, passes it back to director A (ad infinitum). To prevent this, we will need a total ordering of all ads and only forward the ad if it's newer than the in-memory copy. This prevents director A from forwarding it a second time to director B, breaking any loops.
  • The director needs to be able to "recognize itself"; if in periodically querying for other directors, it queries itself, it should know not to forward to that director.
  • Implementation: For the two above items, add an "instance ID" (composed of the start time and a random UUID) and monotonically-increasing counter, the "generation ID". An ad from the same service name/type is considered the same as an existing ad if and only if the instance & generation IDs are the same. An ad is "newer" if the instance ID is the same but the generation ID is newer. The instance ID will be determined by the start of the process while the generation ID is implemented as an atomic counter. Having the start time allows us to determine when the process has restarted (and hence the generation ID goes backward).

Note that steps (2) and (3) are complimentary: if the system was stable and no network partitions occur, they could both be standalone solutions. The benefit from (3) is that it's more responsive to new directors coming online: since a director will immediately inform others of its existence (while caches & origins periodically poll), the directors are expected to have a more up-to-date view of the federation.

@bbockelm bbockelm added cache Issue relating to the cache component director Issue relating to the director component enhancement New feature or request origin Issue relating to the origin component labels Feb 16, 2025
@bbockelm bbockelm added this to the v7.15 milestone Feb 16, 2025
@bbockelm bbockelm self-assigned this Feb 16, 2025
@bbockelm bbockelm linked a pull request Feb 16, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cache Issue relating to the cache component director Issue relating to the director component enhancement New feature or request origin Issue relating to the origin component
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant