-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPI pivot test always case failing in e2es #5252
Comments
/triage accepted |
I think the Is this possibly related to #4745? |
It looks like we do have some logic to grab machine console logs at https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/common.go#L105 This is populating for other tests, but not the one that is failing. The failing job is defined in the upstream CAPI tests at https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.6/test/e2e/self_hosted.go#L152. I wonder if this is preventing the triggering of the |
From office hours: We don't know what the result of the |
Reviewed logs and resources from https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464 with @AndiDog and @dlipovetsky. We're able to determine that a failure's happening at https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.4/test/framework/machine_helpers.go#L221, but we're not exactly sure of the state of machines yet. Our hypothesis is that sometime during the upgrade from 1.29.8 to 1.29.9, something caused the CAPA pod to be unscheduled (node deletion?) and then assigned to a node that doesn't have the CAPA manager image. Some avenues we need to explore:
|
Running through the tests i can see the issue occurs when the MachineDeployment is upgraded. I see:
But if i look at the cloud-init logs from the node the CAPA pod is assigned i can see it has pull the image and tagged it:
|
Testing a few potential fixes. |
Capturing the observed flow:
|
/assign |
The summary of the issue is:
Fix tested:
|
/kind failing-test
What steps did you take and what happened:
Both pull request jobs and periodic jobs are regularly failing on the
capa-e2e.[It] [unmanaged] [Cluster API Framework] Self Hosted Spec Should pivot the bootstrap cluster to a self-hosted cluster
test case.A sample periodic job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464
A sample pull request job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5250/pull-cluster-api-provider-aws-e2e/1867146874104844288
What did you expect to happen:
Test case would pass more often
Anything else you would like to add:
Having dug into this a few times (see PRs #5249 and #5251), I've come to the conclusion that, for some reason, the container image for the CAPA manager that's built during the test run isn't present on the Kubeadm control plane node during a clusterctl move.
The below samples are pulling information from the periodic job at https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464
build log output
clusterctl move output
From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/logs/self-hosted-rjpecj/clusterctl-move.log
Since this failing to reach webhooks, I looked at the CAPA control plane.
capa-manager Pod
This is the most obvious problem; the container image isn't found, sending the pod into
CrashLoopBackOff
.https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/capa-system/Pod/capa-controller-manager-7f5964cb58-wmvb5.yaml
Associated Node
The node associated with the pod does not list the
gcr.io/k8s-staging-cluster-api/capa-manager:e2e
image as being present.From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/Node/ip-10-0-136-158.us-west-2.compute.internal.yaml
KubeadmConfig
The KubeadmConfig shows that the containerd runtime should be copying a container image from ECR before joining the cluster.
https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/self-hosted-rjpecj/KubeadmConfig/self-hosted-lv1y15-control-plane-qhfvf.yaml
The KubeadmControlPlane has the same entry.
Creating the test image
Based on our end-to-end test definitions, the image is successfully created and uploaded to ECR. All other tests seem to be able to find it.
The
ensureTestImageUploaded
function is what logs in to ECR and uploads the image so that the nodes may then download it. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/aws.go#L676The ginkgo suites require this function to pass.
cluster-api-provider-aws/test/e2e/shared/suite.go
Line 159 in 3a646b3
Environment:
main
kubectl version
):/etc/os-release
): Ubuntu on Kube CIThe text was updated successfully, but these errors were encountered: