pull-kubevirt-e2e-k8s-1.32-sig-compute: Failure to start a k8s cluster #1351

orelmisan · 2025-01-20T12:09:41Z

What happened:
The pull-kubevirt-e2e-k8s-1.32-sig-compute lane had failed due to a failure to start the K8s cluster:
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/13715/pull-kubevirt-e2e-k8s-1.32-sig-compute/1879686726453039104

What you expected to happen:
The K8s cluster should spin up successfully.

How to reproduce it (as minimally and precisely as possible):
Steps to reproduce the behavior.

Additional context:
Seems like an issue reaching CRI-O.

Environment:

KubeVirt version (use virtctl version): N/A
Kubernetes version (use kubectl version): N/A
VM or VMI specifications: N/A
Cloud provider or hardware configuration: N/A
OS (e.g. from /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Install tools: N/A
Others: N/A

The text was updated successfully, but these errors were encountered:

oshoval · 2025-01-28T13:03:22Z

Sometimes image repos mix v1 and v2 manifests (worth to open ticket for them about it, assuming this is the issue here).
We are already pulling the images as part of the provision, but the pre flight tries to repull and fails because of the above issue.
We can either try to disable repulling images (best), or to ignore this error using --ignore-preflight-errors as appears on
the error

00:49:04: [preflight] Pulling images required for setting up a Kubernetes cluster
00:49:04: [preflight] This might take a minute or two, depending on the speed of your internet connection
00:49:04: [preflight] You can also perform this action beforehand using 'kubeadm config images pull'
00:49:06: [preflight] Some fatal errors occurred:
00:49:06: failed to create new CRI image service: validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = DeadlineExceeded desc = context deadline exceeded[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
00:49:06: error execution phase preflight

oshoval · 2025-01-28T15:20:42Z

a solution is described here (we dont need upgrade for kubevirtci)
kubernetes/kubeadm#2603 (comment)

prepull images with "kubeadm config images list/pull" - already done
use "InitConfiguration.NodeRegistration.ImagePullPolicy" and "JoinConfiguration.NodeRegistration.ImagePullPolicy" == "Never"

We just need the 2nd bullet imho
easier solution meanwhile can be to just add --ignore-preflight-errors=ImagePull to kube adm init / join

oshoval · 2025-01-28T15:22:10Z

/assign

brianmcarey · 2025-01-28T15:28:01Z

@oshoval this issue looks like the crio socket is not ready in time.

failed to create new CRI image service: validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = DeadlineExceeded desc = context deadline exceeded[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...

It can happen when the cluster is loaded and there is high IO on the CI cluster nodes.

I don't think it is related to pulling any images as we have the images prepulled on the nodes but I may be wrong.

I have been looking into ways of improving the storage performance in the CI cluster which should help with this.

oshoval · 2025-01-28T15:38:21Z

Thanks Brian
we do have the images pre pulled but the preflight tries to pull them again (for example to see if the hash wasnt changed, since it use tags)
i am going to create a PR that disable preflight pulling (since we have the images already)
so in case it does related to images (CRI is related to image pulling, image listing and more as far as i understand)
it will bypass those and will also be faster which is good in any case

unless it is about the internal image management and then as you said it is due to IO etc

oshoval · 2025-01-28T15:39:31Z

@brianmcarey
can you please take a look ?
#1361
(didnt test it yet, lets see CI)
Thanks

brianmcarey · 2025-01-28T15:43:47Z

@brianmcarey can you please take a look ? #1361 (didnt test it yet, lets see CI) Thanks

Yes sure I will take a look but the linked failure in this issue failed because the cri-o socket was not available yet so I am not sure if this is related to this issue.

oshoval · 2025-01-28T15:46:46Z

Thanks

in addition to that PR, we can add --preflight-timeout=15m maybe or add retry to init/join to solve IO as you suggest
wdyt ?

brianmcarey · 2025-01-28T15:48:43Z

Thanks

in addition to that PR, we can add --preflight-timeout=15m maybe or add retry to init/join to solve IO as you suggest wdyt ?

A pre-flight timeout would be useful.

oshoval · 2025-01-28T16:42:05Z

Seems you are right, if the images are on the node, it just compare tags iiuc this code
https://github.com/kubernetes/kubeadm/blob/27de893196901c78a6b4228d4de8ec7080091eb0/kinder/pkg/cluster/manager/actions/images.go#L51
edit - unless this function is just log, and not affect decisions (assuming it passed, because err will affect the flow)

about preflight timeout, didnt find such flag, either i miss it, or we can add some retry

oshoval · 2025-01-29T07:05:56Z

wondering if --ignore-preflight-errors=ImagePull will also eliminate the need to initialize the CRI (unless CRI required for more things during preflight)
btw one approach can be ignore all errors during pre flight, since the node was already provisioned, and just let it deal with the real init phase

oshoval · 2025-02-06T15:12:19Z

#1363
retry should improve this as long as the IO stress is temporary and the 2nd retry solves it
cab be closed for now i think
Brian said he works in parallel to improve the IO capabilities on CI
thanks

orelmisan added the kind/bug label Jan 20, 2025

dosubot bot added kind/failing-test Categorizes issue or PR as related to a failing test. sig/compute labels Jan 20, 2025

kubevirt-bot assigned oshoval Jan 28, 2025

oshoval mentioned this issue Jan 28, 2025

vm based: Skip image pulling during init and join #1361

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull-kubevirt-e2e-k8s-1.32-sig-compute: Failure to start a k8s cluster #1351

pull-kubevirt-e2e-k8s-1.32-sig-compute: Failure to start a k8s cluster #1351

orelmisan commented Jan 20, 2025

oshoval commented Jan 28, 2025

oshoval commented Jan 28, 2025 •

edited

Loading

oshoval commented Jan 28, 2025

brianmcarey commented Jan 28, 2025 •

edited

Loading

oshoval commented Jan 28, 2025 •

edited

Loading

oshoval commented Jan 28, 2025

brianmcarey commented Jan 28, 2025

oshoval commented Jan 28, 2025

brianmcarey commented Jan 28, 2025

oshoval commented Jan 28, 2025 •

edited

Loading

oshoval commented Jan 29, 2025

oshoval commented Feb 6, 2025

pull-kubevirt-e2e-k8s-1.32-sig-compute: Failure to start a k8s cluster #1351

pull-kubevirt-e2e-k8s-1.32-sig-compute: Failure to start a k8s cluster #1351

Comments

orelmisan commented Jan 20, 2025

oshoval commented Jan 28, 2025

oshoval commented Jan 28, 2025 • edited Loading

oshoval commented Jan 28, 2025

brianmcarey commented Jan 28, 2025 • edited Loading

oshoval commented Jan 28, 2025 • edited Loading

oshoval commented Jan 28, 2025

brianmcarey commented Jan 28, 2025

oshoval commented Jan 28, 2025

brianmcarey commented Jan 28, 2025

oshoval commented Jan 28, 2025 • edited Loading

oshoval commented Jan 29, 2025

oshoval commented Feb 6, 2025

oshoval commented Jan 28, 2025 •

edited

Loading

brianmcarey commented Jan 28, 2025 •

edited

Loading

oshoval commented Jan 28, 2025 •

edited

Loading

oshoval commented Jan 28, 2025 •

edited

Loading