Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI pivot test always case failing in e2es #5252

Open
nrb opened this issue Dec 13, 2024 · 10 comments · May be fixed by #5288
Open

CAPI pivot test always case failing in e2es #5252

nrb opened this issue Dec 13, 2024 · 10 comments · May be fixed by #5288
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nrb
Copy link
Contributor

nrb commented Dec 13, 2024

/kind failing-test

What steps did you take and what happened:

Both pull request jobs and periodic jobs are regularly failing on the capa-e2e.[It] [unmanaged] [Cluster API Framework] Self Hosted Spec Should pivot the bootstrap cluster to a self-hosted cluster test case.

A sample periodic job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464

A sample pull request job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5250/pull-cluster-api-provider-aws-e2e/1867146874104844288

What did you expect to happen:

Test case would pass more often

Anything else you would like to add:

Having dug into this a few times (see PRs #5249 and #5251), I've come to the conclusion that, for some reason, the container image for the CAPA manager that's built during the test run isn't present on the Kubeadm control plane node during a clusterctl move.

The below samples are pulling information from the periodic job at https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464

build log output

   [FAILED] Timed out after 1200.001s.
  Timed out waiting for all MachineDeployment self-hosted-rjpecj/self-hosted-lv1y15-md-0 Machines to be upgraded to kubernetes version v1.29.9
  The function passed to Eventually returned the following error:
      <*errors.fundamental | 0xc003693da0>: 
      old Machines remain
      {
          msg: "old Machines remain",
          stack: [0x25eeeaa, 0x4f0046, 0x4ef159, 0xa6931f, 0xa6a3ec, 0xa67a46, 0x25eeb93, 0x25f2ece, 0x26aaa6b, 0xa45593, 0xa5974d, 0x47b3a1],
      }
  In [It] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/machine_helpers.go:221 @ 12/11/24 22:12:08.155 

clusterctl move output

From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/logs/self-hosted-rjpecj/clusterctl-move.log

Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"

(retries continue until the job's terminated)

Since this failing to reach webhooks, I looked at the CAPA control plane.

capa-manager Pod

This is the most obvious problem; the container image isn't found, sending the pod into CrashLoopBackOff.

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/capa-system/Pod/capa-controller-manager-7f5964cb58-wmvb5.yaml

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:58Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: gcr.io/k8s-staging-cluster-api/capa-manager:e2e
    imageID: ""
    lastState: {}
    name: manager
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "gcr.io/k8s-staging-cluster-api/capa-manager:e2e"
        reason: ImagePullBackOff
  hostIP: 10.0.136.158
  hostIPs:
  - ip: 10.0.136.158
  phase: Pending
  podIP: 192.168.74.199
  podIPs:
  - ip: 192.168.74.199
  qosClass: BestEffort
  startTime: "2024-12-11T21:52:55Z"

Associated Node

The node associated with the pod does not list the gcr.io/k8s-staging-cluster-api/capa-manager:e2e image as being present.

From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/Node/ip-10-0-136-158.us-west-2.compute.internal.yaml

 images:
  - names:
    - docker.io/calico/cni@sha256:e60b90d7861e872efa720ead575008bc6eca7bee41656735dcaa8210b688fcd9
    - docker.io/calico/cni:v3.24.1
    sizeBytes: 87382462
  - names:
    - docker.io/calico/node@sha256:43f6cee5ca002505ea142b3821a76d585aa0c8d22bc58b7e48589ca7deb48c13
    - docker.io/calico/node:v3.24.1
    sizeBytes: 80180860
  - names:
    - registry.k8s.io/etcd@sha256:29901446ff08461789b7cd8565fc5b538134e58f81ca1f50fd65d0371cf6571e
    - registry.k8s.io/etcd:3.5.11-0
    sizeBytes: 57232947
  - names:
    - registry.k8s.io/kube-apiserver@sha256:b88538e7fdf73583c8670540eec5b3620af75c9ec200434a5815ee7fba5021f3
    - registry.k8s.io/kube-apiserver:v1.29.9
    sizeBytes: 35210641
  - names:
    - registry.k8s.io/kube-controller-manager@sha256:f2f18973ccb6996687d10ba5bd1b8f303e3dd2fed80f831a44d2ac8191e5bb9b
    - registry.k8s.io/kube-controller-manager:v1.29.9
    sizeBytes: 33739229
  - names:
    - docker.io/calico/kube-controllers@sha256:4010b2739792ae5e77a750be909939c0a0a372e378f3c81020754efcf4a91efa
    - docker.io/calico/kube-controllers:v3.24.1
    sizeBytes: 31125927
  - names:
    - registry.k8s.io/provider-aws/aws-ebs-csi-driver@sha256:02c42645c7a672bbf313ed420e384507dbf0b04992624a3979b87aa4b3f9228e
    - registry.k8s.io/provider-aws/aws-ebs-csi-driver:v1.17.0
    sizeBytes: 30172691
  - names:
    - registry.k8s.io/kube-proxy@sha256:124040dbe6b5294352355f5d34c692ecbc940cdc57a8fd06d0f38f76b6138906
    - registry.k8s.io/kube-proxy:v1.29.9
    sizeBytes: 28600769
  - names:
    - registry.k8s.io/kube-proxy@sha256:559a093080f70ca863922f5e4bb90d6926d52653a91edb5b72c685ebb65f1858
    - registry.k8s.io/kube-proxy:v1.29.8
    sizeBytes: 28599399
  - names:
    - registry.k8s.io/sig-storage/csi-provisioner@sha256:e468dddcd275163a042ab297b2d8c2aca50d5e148d2d22f3b6ba119e2f31fa79
    - registry.k8s.io/sig-storage/csi-provisioner:v3.4.0
    sizeBytes: 27427836
  - names:
    - registry.k8s.io/sig-storage/csi-resizer@sha256:3a7bdf5d105783d05d0962fa06ca53032b01694556e633f27366201c2881e01d
    - registry.k8s.io/sig-storage/csi-resizer:v1.7.0
    sizeBytes: 25809460
  - names:
    - registry.k8s.io/sig-storage/csi-snapshotter@sha256:714aa06ccdd3781f1a76487e2dc7592ece9a12ae9e0b726e4f93d1639129b771
    - registry.k8s.io/sig-storage/csi-snapshotter:v6.2.1
    sizeBytes: 25537921
  - names:
    - registry.k8s.io/sig-storage/csi-attacher@sha256:34cf9b32736c6624fc9787fb149ea6e0fbeb45415707ac2f6440ac960f1116e6
    - registry.k8s.io/sig-storage/csi-attacher:v4.2.0
    sizeBytes: 25508181
  - names:
    - registry.k8s.io/kube-scheduler@sha256:9c164076eebaefdaebad46a5ccd550e9f38c63588c02d35163c6a09e164ab8a8
    - registry.k8s.io/kube-scheduler:v1.29.9
    sizeBytes: 18851030
  - names:
    - registry.k8s.io/coredns/coredns@sha256:1eeb4c7316bacb1d4c8ead65571cd92dd21e27359f0d4917f1a5822a73b75db1
    - registry.k8s.io/coredns/coredns:v1.11.1
    sizeBytes: 18182961
  - names:
    - gcr.io/k8s-staging-provider-aws/cloud-controller-manager@sha256:533d2d64c213719da59c5791835ba05e55ddaaeb2b220ecf7cc3d88823580fc7
    - gcr.io/k8s-staging-provider-aws/cloud-controller-manager:v1.20.0-alpha.0
    sizeBytes: 15350315
  - names:
    - registry.k8s.io/sig-storage/csi-node-driver-registrar@sha256:4a4cae5118c4404e35d66059346b7fa0835d7e6319ff45ed73f4bba335cf5183
    - registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0
    sizeBytes: 10147874
  - names:
    - registry.k8s.io/sig-storage/livenessprobe@sha256:2b10b24dafdc3ba94a03fc94d9df9941ca9d6a9207b927f5dfd21d59fbe05ba0
    - registry.k8s.io/sig-storage/livenessprobe:v2.9.0
    sizeBytes: 9194114
  - names:
    - registry.k8s.io/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
    - registry.k8s.io/pause:3.9
    sizeBytes: 321520

KubeadmConfig

The KubeadmConfig shows that the containerd runtime should be copying a container image from ECR before joining the cluster.

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/self-hosted-rjpecj/KubeadmConfig/self-hosted-lv1y15-control-plane-qhfvf.yaml

  preKubeadmCommands:
  - mkdir -p /opt/cluster-api
  - ctr -n k8s.io images pull "public.ecr.aws/m3v9m3w5/capa/update:e2e"
  - ctr -n k8s.io images tag "public.ecr.aws/m3v9m3w5/capa/update:e2e" gcr.io/k8s-staging-cluster-api/capa-manager:e2e

The KubeadmControlPlane has the same entry.

Creating the test image

Based on our end-to-end test definitions, the image is successfully created and uploaded to ECR. All other tests seem to be able to find it.

The ensureTestImageUploaded function is what logs in to ECR and uploads the image so that the nodes may then download it. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/aws.go#L676

The ginkgo suites require this function to pass.

Expect(ensureTestImageUploaded(e2eCtx)).NotTo(HaveOccurred())

Environment:

  • Cluster-api-provider-aws version: main
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release): Ubuntu on Kube CI
@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 13, 2024
@nrb
Copy link
Contributor Author

nrb commented Dec 13, 2024

/triage accepted
/priority critical-urgent
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Dec 13, 2024
@nrb
Copy link
Contributor Author

nrb commented Dec 13, 2024

I think the preKubeadmCommands are passed to the node via cloud-init.

Is this possibly related to #4745?

@nrb nrb changed the title CAPI pivot test case failing in e2es CAPI pivot test always case failing in e2es Dec 13, 2024
@nrb nrb pinned this issue Dec 13, 2024
@nrb
Copy link
Contributor Author

nrb commented Dec 19, 2024

It looks like we do have some logic to grab machine console logs at https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/common.go#L105

This is populating for other tests, but not the one that is failing.

The failing job is defined in the upstream CAPI tests at https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.6/test/e2e/self_hosted.go#L152. I wonder if this is preventing the triggering of the DumpMachines run for some reason?

@dlipovetsky
Copy link
Contributor

From office hours: We don't know what the result of the ctr commands is. Next step is to run the e2e test locally, increasing the Ginkgo timeout, to allow someone to SSH to the affected machine before the test deletes the cluster. (We should also document how to adjust these timeouts).

@nrb
Copy link
Contributor Author

nrb commented Jan 9, 2025

Reviewed logs and resources from https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464 with @AndiDog and @dlipovetsky.

We're able to determine that a failure's happening at https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.4/test/framework/machine_helpers.go#L221, but we're not exactly sure of the state of machines yet.

Our hypothesis is that sometime during the upgrade from 1.29.8 to 1.29.9, something caused the CAPA pod to be unscheduled (node deletion?) and then assigned to a node that doesn't have the CAPA manager image.

Some avenues we need to explore:

@richardcase
Copy link
Member

Running through the tests i can see the issue occurs when the MachineDeployment is upgraded. I see:

Failed to pull image "gcr.io/k8s-staging-cluster-api/capa-manager:e2e": rpc error: code = NotFound  │
│ desc = failed to pull and unpack image "gcr.io/k8s-staging-cluster-api/capa-manager:e2e": failed to resolve reference "gcr.io/k8s-staging-cluster-api/capa-mana │
│ ger:e2e": gcr.io/k8s-staging-cluster-api/capa-manager:e2e: not found

But if i look at the cloud-init logs from the node the CAPA pod is assigned i can see it has pull the image and tagged it:

[2025-01-20 06:22:08] unpacking linux/amd64 sha256:e6ad1fa351f0e2c2a3f169eaed5484a8a66a09c70e7725072c31086cb9bdf45e...
[2025-01-20 06:22:10] done: 1.653436814s
[2025-01-20 06:22:10] gcr.io/k8s-staging-cluster-api/capa-manager:e2e

@richardcase
Copy link
Member

Testing a few potential fixes.

@richardcase
Copy link
Member

Capturing the observed flow:

  • Local managemt cluster with v1.29.8 is created ok
  • Cluster flavour remote-management-cluster is applied
  • Child cluster created in AWS
  • Clustercrtl init & move turns child cluster into self managed
  • In new self hosted cluster:
    • CAPA initially running on CP node
    • CAPA is then evicted from CP node due o disk pressue
    • CAPA start running ok on worker node
    • CP is upgraded to v1.29.9
    • CAPA continue to run on worker node ok
    • workers are upgraded to v1.29.9 so CAPA is evicted from worker
    • CAPA starts to run on CP node but fails because the image doesn't exist.
    • Logging on to CP node shows:
      • Cloud init logs indicate that the image was pulled
      • However, ctr images ls don't show the images
    • Manually running the commands to pull and tag image means CAPA starts to run and test continues.

@richardcase
Copy link
Member

/assign

@richardcase
Copy link
Member

richardcase commented Jan 20, 2025

The summary of the issue is:

  • Control plane node encountered disk pressure
  • CAPA was evicted so started running on worker
  • Kubernetes cleans up unused images because of the disk pressure
    • CAPA images where cleaned on control plane machine
  • when CAPA is evicted from worker node (because worker was upgraded) it starts to run on the control plane but the image has been deleted and so fails.

Fix tested:

  • Updated the cluster template to have a larger root volume

@richardcase richardcase linked a pull request Jan 20, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants