Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster topology controller is getting stuck if a MD deletion is stuck #11770

Closed
sbueringer opened this issue Jan 29, 2025 · 1 comment · Fixed by #11771
Closed

Cluster topology controller is getting stuck if a MD deletion is stuck #11770

sbueringer opened this issue Jan 29, 2025 · 1 comment · Fixed by #11771
Assignees
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

When a MD deletion is stuck the Cluster topology controller will also get stuck.

This can be easily seen in the Available condition

      message: |-
        * WorkersAvailable:
          * MachineDeployment clusterclass-changes-zvh1ts-md-md-0-mdh97: Deletion in progress
        * TopologyReconciled: error reading current state of the Cluster topology: MachineDeployment clusterclass-changes-idpol7/clusterclass-changes-zvh1ts-md-md-0-mdh97 Bootstrap reference could not be retrieved: failed to retrieve KubeadmConfigTemplate clusterclass-changes-idpol7/clusterclass-changes-zvh1ts-md-0-ckpjx: failed to retrieve KubeadmConfigTemplate clusterclass-changes-idpol7/clusterclass-changes-zvh1ts-md-0-ckpjx: KubeadmConfigTemplate.bootstrap.cluster.x-k8s.io "clusterclass-changes-zvh1ts-md-0-ckpjx" not found

The following happens:

  • The MD topology controller deletes the templates when the MD has a deletionTimestamp
  • Then the Cluster topology controller is unable to get the templates of the deleting MD

I think the ideal fix would be if we can stop retrieving the templates of a deleting MachineDeployment in the cluster topology controller.

But this requires some research

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 29, 2025
@sbueringer sbueringer added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/clusterclass Issues or PRs related to clusterclass labels Jan 29, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jan 29, 2025
@sbueringer sbueringer added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Jan 29, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jan 29, 2025
@sbueringer sbueringer removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jan 29, 2025
@sbueringer
Copy link
Member Author

/triage accepted
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
2 participants