Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Ensure Cluster topology controller is not stuck when MDs are stuck in deletion #11771

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sbueringer
Copy link
Member

Signed-off-by: Stefan Büringer [email protected]

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #11770

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Jan 29, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 29, 2025
@sbueringer
Copy link
Member Author

/cherry-pick release-1.9

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.9 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

/cherry-pick release-1.8

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.8 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

/assign @fabriziopandini @chrischdi

@sbueringer sbueringer added the area/clusterclass Issues or PRs related to clusterclass label Jan 29, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Jan 29, 2025
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!
/lgtm

I'm wondering if we should also stop to update MD when they are deleting

diff.toUpdate = append(diff.toUpdate, md)

The use case would roughly be:

  • MD deleted
  • MD added back before deletion completes,
  • Topology tries to update the deleting MD

Instead we should probaly do nothing until current MD goes away

@sbueringer WDYT?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9ce156454693113157feb8a269149da6e53d39a3

@sbueringer
Copy link
Member Author

sbueringer commented Jan 29, 2025

Thanks for fixing this! /lgtm

I'm wondering if we should also stop to update MD when they are deleting

diff.toUpdate = append(diff.toUpdate, md)

The use case would roughly be:

  • MD deleted
  • MD added back before deletion completes,
  • Topology tries to update the deleting MD

Instead we should probaly do nothing until current MD goes away

@sbueringer WDYT?

I went through all the usages of currentState.MD and I believe this code is not hit.

The reason is that MD deletions are triggered by removing MD topologies from Cluster.spec.topology. In that case we hit this line instead:

diff.toDelete = append(diff.toDelete, md)

(because the MD is not part of the desired state anymore)

@sbueringer
Copy link
Member Author

sbueringer commented Jan 29, 2025

EDIT: okay got it, I'll take a look

@sbueringer sbueringer force-pushed the pr-fix-topo-controller branch from ffa4819 to d07153a Compare January 29, 2025 17:12
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from chrischdi. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@fabriziopandini PTAL :)

@fabriziopandini
Copy link
Member

thanks!
/lgtm

cc @chrischdi for a final pass

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4772e78c2f26e0e20d4af0b1ec383c09d5592ea3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster topology controller is getting stuck if a MD deletion is stuck
5 participants