Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kueue-controller-manager not starting up #3956

Open
kalki7 opened this issue Jan 10, 2025 · 3 comments
Open

kueue-controller-manager not starting up #3956

kalki7 opened this issue Jan 10, 2025 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kalki7
Copy link

kalki7 commented Jan 10, 2025

What happened: Kueue v0.10.0 had already been deployed and working as expected. Once I had enabled enableClusterQueueResources: true and redeployed the same version of kueue, the manager container in kueue-controller-manager was stuck in a CrashLoopBackoff with the following being the only logs we were able to retrieve

{
activity_type_name: "ViolationOpenEventv1"
policy_id: "6885206598817066570"
resource_name: "dev manager"
started_at: "1736511103"
terse_message: "Restart count for dev manager is above the threshold of 1.000 with a value of 2.000."
verbose_message: "Restart count for dev manager is above the threshold of 1.000 with a value of 2.000."
violation_id: "0.nng1d6n8dhxc"
}
{
activity_type_name: "ViolationAutoResolveEventv1"
policy_id: "6885206598817066570"
resolved_at: "1736511220"
resource_name: "dev manager"
terse_message: "Restart count for dev manager returned to normal with a value of 4.000."
verbose_message: "Restart count for dev manager returned to normal with a value of 4.000."
violation_id: "0.nng1d6n8dhxc"
}
{
activity_type_name: "ViolationOpenEventv1"
policy_id: "6885206598817066570"
resource_name: "dev manager"
started_at: "1736511257"
terse_message: "Restart count for dev manager is above the threshold of 1.000 with a value of 5.000."
verbose_message: "Restart count for dev manager is above the threshold of 1.000 with a value of 5.000."
violation_id: "0.nng1d6n8dhxc"
}

Followed by kube-rbac-proxy logging received interrupt, shutting down

What you expected to happen: kueue-controler-manager to start up

How to reproduce it (as minimally and precisely as possible): Deploy kueue v0.10.0 using the manifest with enableClusterQueueResources: true enabled and kueue-controller-manager resource requests and limited increased to 8gb and 16gb for memory and 1 and 2 for cpu

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.30.6-gke.1125000
  • Kueue version (use git describe --tags --dirty --always): v.0.10.0
  • Cloud provider or hardware configuration: GCP - Google Kubernetes Engine
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@kalki7 kalki7 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2025
@kannon92
Copy link
Contributor

Can you try and see if you get this problem on kind?

I don't know what "dev manager" is so I worry this may be a GKE problem.

@kalki7
Copy link
Author

kalki7 commented Jan 10, 2025

dev manager is the name of the GKE resource. However, I just disabled enableClusterQueueResources and it back to the previous working state. So I think it might have something to do with the enableClusterQueueResources: true. Where once its enabled, the manager container in the kueue-controller-manager deployment doesn't come up.

@kannon92
Copy link
Contributor

cc @mimowo @mwielgus

Anyone you can loop into this from the GKE side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants