-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server replica allowed to overwrite deep store segment after segment was already committed #14786
Comments
I have been looking into this. From the code it seems like the following check in
This check only considers the so when a late server wants to commit a segment again in your scenario, it gets past this check. |
@KKcorps I don't think they uploaded the segment. Segment is just committed from a different server. @dang-stripe Can you confirm? |
No it is clearly uploaded since the segment metadata itself shows 'UPLOADED' it was uploaded by upsertCompactionTask |
@dang-stripe Can you confirm if you are running |
Yes, we were using upsert compaction for these segments. |
@KKcorps should we simply change the if condition to |
For a 3 replica table, we hit a case where:
2 replicas started segment commit - 1 server was successful and uploaded to deep store. Segment moves to ONLINE state.
3rd replica server (Server_95) was recently replaced w/ a new server and starting up. It started segment commit after the previous one was done. The controller allowed it to proceed w/ deep store upload, overwriting the previous segment copy.
Controller then failed when trying to update the ideal state since the segment wasn't in the CONSUMING state. This repeated several times until it gave up and then downloaded the copy in deep store.
What's concerning here is that the offsets are different (353103174 vs 353106085). It seems like the controller should not have allowed the 3rd replica to proceed w/ the upload and commit.
We're using the
realtime.segment.serverUploadToDeepStore
feature to allow servers to upload segments directly are part of segment commit, but it doesn't seem like it's specific to that feature.This separately caused a CRC mismatch failure for upsert compaction since Minion compacted the first version of the segment and updated the metadata, but the 3rd replica overwrote the S3 copy.
The text was updated successfully, but these errors were encountered: