Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Subscriber.seq data race and slow replay emits from filling subscriber outbox #45

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

uniphil
Copy link

@uniphil uniphil commented Feb 18, 2025

#43 without (hopefully) its potentially-livetail-blocking side-effect.

During a small window around cutover from replay to live-tail, both goroutines are concurrently pushing messages to the subscriber's outbox, and read-modify-write'ing the subscriber's seq as part of that. This data race can be detected with golang's race detector, so jetstream currently has some undefined behaviour.

This change swaps the racing int64 for its atomic counterpart. It's probably disappointing. It doesn't fix the logical data race (we still read-modify-write concurrently) but does address the undefined behaviour in the binary. Plus, atomics let us write with a Swap, so we can actually detect when the race happens and at least know about it.

As a result, the race detector no longer complains. Out-of-order events are still reproducible for me locally with the repro from #43, and made slightly easier to test by adding a --per-ip-limit=0 flag for local dev, so that multiple websockets from localhost are not combined into a single ratelimiter. (by default the current per-ip limiting behaviour is retained)

Finally, in an attempt to reduce the problems reported in #27 and #31, the subscriber outbox is monitored when emitting during replay to slow down if it reaches 1/3 of its capacity. Since replay uses blocking sends, we don't really need to use the full outbox capacity, and this gives the client a better chance of surviving cutover since it will hopefully make it unlikely that their outbox is near-full at that time. (more detail in the last commit)

both the replay goroutine and the live-tail goroutine read and write from a subscriber's `seq` property concurrently, and before this change that can lead to data races, reproducible locally by `go run -race`, and observable typically as out-of-order events emitted to clients around cutover from replay to live-tailing.

this change uses an atomic int64 to protect the subscriber `seq`. however, this only addresses the physical memory access bug. both goroutines can still end up doing an unsynchronized read-modify-write of the sequence, and events can still be emitted out of order.

the *consequences* of this race are improved though: removed some undefined behaviour.

additionally, an atomic `swap` is used when writing the new value, which allows at least detecting that the race has occurred. right now we just log a warning for that.

after this change i can't detect races locally with the race detector.
some emits collide but don't actually result in things going out-of-order. this adds an additional log when it's likely that actual out-of-order events happened.
adds a new flag, defaults to enforcing per-ip combined limit
hopefully reduce occurrences of bluesky-social#27 / bluesky-social#31

when replaying without any filters, two rate-limiters are active: the db rate limiter will allow max-sub-rate*10 events to be added to the client's queue, and the emitting code will allow max-sub-rate to be emitted to the client.

is _seems likely_ that this should lead to a nearly-full outbox as a fairly normal scenario during replay. a nearly-full outbox puts clients at high risk of getting booted for being "slow" as soon as the cutover to live-tail happens.

this change is a hack. if emitting from replay it will just start slowwwwwing down by sleeping for 100ms at a time whenever trying to emit to the subscriber outbox with that outbox over 1/3 of its capacity.

the go race still seems happy with this change, my own out-of-order detecting script shows *maybe*, *marginally* better behaviour in terms of out-of-ordered emits. i don't have a good repro for the prod cutover client booting, but with luck this might help.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant