You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 12, 2021. It is now read-only.
When the variable batch size set to [5,5], with autopause to true, it should only pause twice, each happens after one batch completes. But in actual, aurora pause when batch updating is still in progress, and it pause 4 times in total.
How to reproduce?
variable batch size [5,5], with auto pause to true, SLA sets to 70% percent.
The text was updated successfully, but these errors were encountered:
Investing this, there is some bug where shards were already marked as failed are being retried again once a resume is sent.
So for example, let's we have to replace shards [0,1,2] and shard 0 ends up being marked as failed, shard 1 is successful, and we pause before we finish evaluating shard 2.
Our update is now in the following state:
0 - failed
1- success
2- working
When we resume the update, shard 0 is retried, even though it should be skipped due to the fact that it failed.
Correcting my previous statement, it's not the updater that's retrying shard 0, it's the scheduler itself which is the correct behavior. Even though the shard update failed, the failure is only a signal to the updating mechanism.
The unexpected number of pauses were indeed a bug, thanks for finding it @chungjin, I've landed a fix for this at 3e31bc0
Let's try and reproduce the issue. If we can confirm it as fixed, we can go ahead and close this issues as fixed.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When the variable batch size set to [5,5], with autopause to true, it should only pause twice, each happens after one batch completes. But in actual, aurora pause when batch updating is still in progress, and it pause 4 times in total.
How to reproduce?
variable batch size [5,5], with auto pause to true, SLA sets to 70% percent.
The text was updated successfully, but these errors were encountered: