-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Informer controller should have an exponential backoff in case of misconfigured finalizer #559
Comments
Errors returned from a reconciler should be retried based on the RetryPolicy for the InformerController, which, by default, is an Exponential Backoff policy. The only way I see a tight loop happening here is a bad RetryPolicy, or (more likely), continuous reconcile events triggered by patches (in which case, the patch would have been successful, and the reconciler should have been removed?) The linked code is for a deleted resources ( How frequently did this error appear in the logs? It is also possible it could be coming from cache re-syncs, as those were set to 30s prior to v0.22.0, though it looks like the playlists app is using v0.23.1, so that seems less likely. I think either way more logging and tracing in the OPinionatedReconciler at the very least matching the debug logging in the OpinionatedWatcher would be helpful here, I'll get a PR in to add that. |
No, it was an object that was already saved when the Watcher implementation was in place. No reconciler. I think that implementation has a convention on what it names the finalizer, which was While, for the Reconcilers, you allow the developers to pick the finalizer name and it doesn't have a convention-based auto-setting. So here, I picked one for playlist. However, when it wasn't this particular value and another value not matching current objects in storage, it seemed to kick the infinite loop, it didn't look like it was gonna exponentially back-off. It was spewing my console. |
…gic to avoid retry loops on deletes (#560) Related to #559, added debug logging and tracing similar to what `OpinionatedWatcher` already has. While testing, changed the logic in the `OpinionatedReconciler` to resolve two issues: 1. Previously, when an update added the `deletionTimestamp`, the `OpinionatedReconciler` would remove the finalizer and return, and then rely on the subsequent delete event to be propagated down to the downstream reconciler for delete behavior. This introduces an edge case that is addressed in `OpinionatedWatcher` but missed here, where, if the downstream's processing of the delete fails, the object is still deleted in the API server, possibly leaving a bad state (in `OpinionatedWatcher`, the finalizer is not removed until the downstream delete is successfully processed). The logic has now been changed to mimic that of the `OpinionatedWatcher`, sending the "add `deletionTimestamp`" event to the downstream reconciler as a delete event, and only removing the finalizer once this is successful. As a side effect, _actual_ delete events are no long propagated to the downstream. 2. After the final finalizer is removed from an object with a non-nil `deletionTimestamp`, a delete event is emitted in the watch stream which _still contains_ the last finalizer to be removed (if there were multiple finalizers, only the last one is present in the list, i.e. the last state the object was in before the finalizers list became empty), rather than an empty finalizers list. To combat having this make us attempt to then remove the finalizer _again_ (because the object is actually deleted now and this will result in an error which will be retried by default), the swallowing of delete events has been moved to _before_ any of the other deletionTimestamp checks. Example of deletion finalizers list behavior from (2): ``` % kubectl get --raw '/apis/issuetrackerproject.ext.grafana.com/v1/namespaces/default/issues?watch=1' {"type":"ADDED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","generation":1,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","finalizers":["issue-tracker-project-issues-finalizer"],"generation":1,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","finalizers":["issue-tracker-project-issues-finalizer","foo"],"generation":1,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["issue-tracker-project-issues-finalizer","foo"],"generation":2,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["foo"],"generation":2,...trimmed...}}} {"type":"DELETED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["foo"],"generation":2,...trimmed...}}} ``` Here a new `issue` is created (from the tutorial project), the reconciler adds its finalizer, then an extra `foo` finalizer is added. The object is deleted, and the reconciler automatically removes its finalizer. The `foo` finalizer is removed, but the state of the object doesn't change and the delete happens with that state.
I ran into a situation with Playlists, when switching from the OpinionatedWatcher implementation to OpinionatedReconciler, wherein I had a resource already saved with a finalizer named like below:
When I ran the OpinionatedReconciler version with the finalizer not matching the existing, the InformerController code repeatedly attempted to perform these lines and propagating the NotFound error in a way that I saw the below logs, which are misleading:
Perhaps we can improve the logic in 2 ways:
Additionally, we need to fix the infinite try loop that is sure to peg the processor:
The text was updated successfully, but these errors were encountered: