Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informer controller should have an exponential backoff in case of misconfigured finalizer #559

Open
charandas opened this issue Dec 19, 2024 · 2 comments

Comments

@charandas
Copy link
Contributor

I ran into a situation with Playlists, when switching from the OpinionatedWatcher implementation to OpinionatedReconciler, wherein I had a resource already saved with a finalizer named like below:

apiVersion: v1
items:
- apiVersion: playlist.grafana.app/v0alpha1
  kind: Playlist
  metadata:
    finalizers:
    - playlist-playlists-finalizer

When I ran the OpinionatedReconciler version with the finalizer not matching the existing, the InformerController code repeatedly attempted to perform these lines and propagating the NotFound error in a way that I saw the below logs, which are misleading:

DEBUG[12-19|12:17:25] grafana-aggregator: "/apis/playlist.grafana.app/v0alpha1/namespaces/stacks-1/playlists/adkj6k6" satisfied by NotFoundHandler logger=grafana-apiserver

Perhaps we can improve the logic in 2 ways:

  1. Warn the user about the misconfiguration, one can only hope that they don't have mixed objects at this opinion with 2 sets of finalizer values
  2. Alternatively, offer a way of fixing the issue by resorting to the newer value for existing objects - not sure how to intelligently do so without user hint

Additionally, we need to fix the infinite try loop that is sure to peg the processor:

  1. Do this by performing the retry logic within limits with exponential backoff
@IfSentient
Copy link
Contributor

Errors returned from a reconciler should be retried based on the RetryPolicy for the InformerController, which, by default, is an Exponential Backoff policy. The only way I see a tight loop happening here is a bad RetryPolicy, or (more likely), continuous reconcile events triggered by patches (in which case, the patch would have been successful, and the reconciler should have been removed?)

The linked code is for a deleted resources (deletionTimestamp is non-nil), did this happen on a delete? I think I see the path here on what might be happening: the resource is in a "deleted" state (deletionTimestamp is non-nil), but we attempt to add the finalizer because it looks like we don't check if the resource is pending deletion in that condition. This should in theory fail, as finalizers cannot be added once deletionTimestamp is set, but if it succeeded, or if that triggered another reconciler event for any reason, then we'd either then remove the finalizer next, or try to add the finalizer again each time we get a reconcile event. This could then keep re-triggering in a loop of reconcile events.

How frequently did this error appear in the logs? It is also possible it could be coming from cache re-syncs, as those were set to 30s prior to v0.22.0, though it looks like the playlists app is using v0.23.1, so that seems less likely.

I think either way more logging and tracing in the OPinionatedReconciler at the very least matching the debug logging in the OpinionatedWatcher would be helpful here, I'll get a PR in to add that.

@charandas
Copy link
Contributor Author

did this happen on a delete?

No, it was an object that was already saved when the Watcher implementation was in place. No reconciler. I think that implementation has a convention on what it names the finalizer, which was playlist-playlists-finalizer in this case.

While, for the Reconcilers, you allow the developers to pick the finalizer name and it doesn't have a convention-based auto-setting. So here, I picked one for playlist.
https://github.com/grafana/grafana/blob/main/apps/playlist/pkg/reconcilers/reconciler_playlist.go#L39

However, when it wasn't this particular value and another value not matching current objects in storage, it seemed to kick the infinite loop, it didn't look like it was gonna exponentially back-off. It was spewing my console.

radiohead pushed a commit that referenced this issue Jan 7, 2025
…gic to avoid retry loops on deletes (#560)

Related to #559, added
debug logging and tracing similar to what `OpinionatedWatcher` already
has. While testing, changed the logic in the `OpinionatedReconciler` to
resolve two issues:
1. Previously, when an update added the `deletionTimestamp`, the
`OpinionatedReconciler` would remove the finalizer and return, and then
rely on the subsequent delete event to be propagated down to the
downstream reconciler for delete behavior. This introduces an edge case
that is addressed in `OpinionatedWatcher` but missed here, where, if the
downstream's processing of the delete fails, the object is still deleted
in the API server, possibly leaving a bad state (in
`OpinionatedWatcher`, the finalizer is not removed until the downstream
delete is successfully processed). The logic has now been changed to
mimic that of the `OpinionatedWatcher`, sending the "add
`deletionTimestamp`" event to the downstream reconciler as a delete
event, and only removing the finalizer once this is successful. As a
side effect, _actual_ delete events are no long propagated to the
downstream.
2. After the final finalizer is removed from an object with a non-nil
`deletionTimestamp`, a delete event is emitted in the watch stream which
_still contains_ the last finalizer to be removed (if there were
multiple finalizers, only the last one is present in the list, i.e. the
last state the object was in before the finalizers list became empty),
rather than an empty finalizers list. To combat having this make us
attempt to then remove the finalizer _again_ (because the object is
actually deleted now and this will result in an error which will be
retried by default), the swallowing of delete events has been moved to
_before_ any of the other deletionTimestamp checks.

Example of deletion finalizers list behavior from (2):
```
% kubectl get --raw '/apis/issuetrackerproject.ext.grafana.com/v1/namespaces/default/issues?watch=1'
{"type":"ADDED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","generation":1,...trimmed...}}}
{"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","finalizers":["issue-tracker-project-issues-finalizer"],"generation":1,...trimmed...}}}
{"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","finalizers":["issue-tracker-project-issues-finalizer","foo"],"generation":1,...trimmed...}}}
{"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["issue-tracker-project-issues-finalizer","foo"],"generation":2,...trimmed...}}}
{"type":"MODIFIED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["foo"],"generation":2,...trimmed...}}}
{"type":"DELETED","object":{"apiVersion":"issuetrackerproject.ext.grafana.com/v1","kind":"Issue","metadata":{"creationTimestamp":"2025-01-02T17:46:29Z","deletionGracePeriodSeconds":0,"deletionTimestamp":"2025-01-02T17:47:14Z","finalizers":["foo"],"generation":2,...trimmed...}}}
```
Here a new `issue` is created (from the tutorial project), the
reconciler adds its finalizer, then an extra `foo` finalizer is added.
The object is deleted, and the reconciler automatically removes its
finalizer. The `foo` finalizer is removed, but the state of the object
doesn't change and the delete happens with that state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants