Add support for multiple directors for high availability #2024
Labels
cache
Issue relating to the cache component
director
Issue relating to the director component
enhancement
New feature or request
origin
Issue relating to the origin component
Milestone
The director is a single, centralized service that all operations in the federation go through. This causes restarts - or outages - of the director potentially disruptive events for clients (though in #1565 we started to reduce the impact, allowing clients to detect restarts and retry). Let's work to a setup where multiple directors can work within a single federation, allowing clients to utilize any existing service. Luckily, the director is a nearly-stateless service (the exception being the downtime information; I'd like that to be aggregated in the redirector and at the individual services as a separate piece of work).
The first step will be to have services discover all available directors through a combination of the following mechanism:
Of these, (3) is the one most discouraged; we've seen that pushing state to the client configuration for distributed clients to be incredibly difficult.
The second step is that each service (director, cache, origin), when sending ads, will send ads to all the directors it is aware of. This way, if (for example) a cache service eventually learns of all the director services in the federation, it will eventually advertise the same state to all of them.
Finally, whenever the director receives an ad, it will forward it to all directors it knows about. This has a few important impacts:
Note that steps (2) and (3) are complimentary: if the system was stable and no network partitions occur, they could both be standalone solutions. The benefit from (3) is that it's more responsive to new directors coming online: since a director will immediately inform others of its existence (while caches & origins periodically poll), the directors are expected to have a more up-to-date view of the federation.
The text was updated successfully, but these errors were encountered: