Informer controller should have an exponential backoff in case of misconfigured finalizer #559

charandas · 2024-12-19T20:30:51Z

I ran into a situation with Playlists, when switching from the OpinionatedWatcher implementation to OpinionatedReconciler, wherein I had a resource already saved with a finalizer named like below:

apiVersion: v1
items:
- apiVersion: playlist.grafana.app/v0alpha1
  kind: Playlist
  metadata:
    finalizers:
    - playlist-playlists-finalizer

When I ran the OpinionatedReconciler version with the finalizer not matching the existing, the InformerController code repeatedly attempted to perform these lines and propagating the NotFound error in a way that I saw the below logs, which are misleading:

DEBUG[12-19|12:17:25] grafana-aggregator: "/apis/playlist.grafana.app/v0alpha1/namespaces/stacks-1/playlists/adkj6k6" satisfied by NotFoundHandler logger=grafana-apiserver

Perhaps we can improve the logic in 2 ways:

Warn the user about the misconfiguration, one can only hope that they don't have mixed objects at this opinion with 2 sets of finalizer values
Alternatively, offer a way of fixing the issue by resorting to the newer value for existing objects - not sure how to intelligently do so without user hint

Additionally, we need to fix the infinite try loop that is sure to peg the processor:

Do this by performing the retry logic within limits with exponential backoff

The text was updated successfully, but these errors were encountered:

IfSentient · 2025-01-02T15:36:32Z

Errors returned from a reconciler should be retried based on the RetryPolicy for the InformerController, which, by default, is an Exponential Backoff policy. The only way I see a tight loop happening here is a bad RetryPolicy, or (more likely), continuous reconcile events triggered by patches (in which case, the patch would have been successful, and the reconciler should have been removed?)

The linked code is for a deleted resources (deletionTimestamp is non-nil), did this happen on a delete? I think I see the path here on what might be happening: the resource is in a "deleted" state (deletionTimestamp is non-nil), but we attempt to add the finalizer because it looks like we don't check if the resource is pending deletion in that condition. This should in theory fail, as finalizers cannot be added once deletionTimestamp is set, but if it succeeded, or if that triggered another reconciler event for any reason, then we'd either then remove the finalizer next, or try to add the finalizer again each time we get a reconcile event. This could then keep re-triggering in a loop of reconcile events.

How frequently did this error appear in the logs? It is also possible it could be coming from cache re-syncs, as those were set to 30s prior to v0.22.0, though it looks like the playlists app is using v0.23.1, so that seems less likely.

I think either way more logging and tracing in the OPinionatedReconciler at the very least matching the debug logging in the OpinionatedWatcher would be helpful here, I'll get a PR in to add that.

charandas · 2025-01-07T04:19:59Z

did this happen on a delete?

No, it was an object that was already saved when the Watcher implementation was in place. No reconciler. I think that implementation has a convention on what it names the finalizer, which was playlist-playlists-finalizer in this case.

While, for the Reconcilers, you allow the developers to pick the finalizer name and it doesn't have a convention-based auto-setting. So here, I picked one for playlist.
https://github.com/grafana/grafana/blob/main/apps/playlist/pkg/reconcilers/reconciler_playlist.go#L39

However, when it wasn't this particular value and another value not matching current objects in storage, it seemed to kick the infinite loop, it didn't look like it was gonna exponentially back-off. It was spewing my console.

…gic to avoid retry loops on deletes (#560) Related to #559, added debug logging and tracing similar to what `OpinionatedWatcher` already has. While testing, changed the logic in the `OpinionatedReconciler` to resolve two issues: 1. Previously, when an update added the `deletionTimestamp`, the `OpinionatedReconciler` would remove the finalizer and return, and then rely on the subsequent delete event to be propagated down to the downstream reconciler for delete behavior. This introduces an edge case that is addressed in `OpinionatedWatcher` but missed here, where, if the downstream's processing of the delete fails, the object is still deleted in the API server, possibly leaving a bad state (in `OpinionatedWatcher`, the finalizer is not removed until the downstream delete is successfully processed). The logic has now been changed to mimic that of the `OpinionatedWatcher`, sending the "add `deletionTimestamp`" event to the downstream reconciler as a delete event, and only removing the finalizer once this is successful. As a side effect, _actual_ delete events are no long propagated to the downstream. 2. After the final finalizer is removed from an object with a non-nil `deletionTimestamp`, a delete event is emitted in the watch stream which _still contains_ the last finalizer to be removed (if there were multiple finalizers, only the last one is present in the list, i.e. the last state the object was in before the finalizers list became empty), rather than an empty finalizers list. To combat having this make us attempt to then remove the finalizer _again_ (because the object is actually deleted now and this will result in an error which will be retried by default), the swallowing of delete events has been moved to _before_ any of the other deletionTimestamp checks. Example of deletion finalizers list behavior from (2): ``` % kubectl get --raw '/apis/issuetrackerproject.ext.grafana.com/v1/namespaces/default/issues?watch=1' {"type":"ADDED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","generation":1,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","finalizers":["issue-tracker-project-issues-finalizer"],"generation":1,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","finalizers":["issue-tracker-project-issues-finalizer","foo"],"generation":1,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["issue-tracker-project-issues-finalizer","foo"],"generation":2,...trimmed...}}} {"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["foo"],"generation":2,...trimmed...}}} {"type":"DELETED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["foo"],"generation":2,...trimmed...}}} ``` Here a new `issue` is created (from the tutorial project), the reconciler adds its finalizer, then an extra `foo` finalizer is added. The object is deleted, and the reconciler automatically removes its finalizer. The `foo` finalizer is removed, but the state of the object doesn't change and the delete happens with that state.

charandas mentioned this issue Dec 19, 2024

Playlists: convert to use reconcilers instead grafana/grafana#98075

Merged

3 tasks

IfSentient mentioned this issue Jan 2, 2025

[OpinionatedReconciler] Add debug logging and tracing, change some logic to avoid retry loops on deletes #560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Informer controller should have an exponential backoff in case of misconfigured finalizer #559

Informer controller should have an exponential backoff in case of misconfigured finalizer #559

charandas commented Dec 19, 2024

IfSentient commented Jan 2, 2025

charandas commented Jan 7, 2025

Informer controller should have an exponential backoff in case of misconfigured finalizer #559

Informer controller should have an exponential backoff in case of misconfigured finalizer #559

Comments

charandas commented Dec 19, 2024

IfSentient commented Jan 2, 2025

charandas commented Jan 7, 2025