Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull-kubevirt-e2e-k8s-1.32-sig-compute: Failure to start a k8s cluster #1351

Open
orelmisan opened this issue Jan 20, 2025 · 12 comments
Open
Assignees
Labels
kind/bug kind/failing-test Categorizes issue or PR as related to a failing test. sig/compute

Comments

@orelmisan
Copy link
Member

What happened:
The pull-kubevirt-e2e-k8s-1.32-sig-compute lane had failed due to a failure to start the K8s cluster:
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/13715/pull-kubevirt-e2e-k8s-1.32-sig-compute/1879686726453039104

What you expected to happen:
The K8s cluster should spin up successfully.

How to reproduce it (as minimally and precisely as possible):
Steps to reproduce the behavior.

Additional context:
Seems like an issue reaching CRI-O.

Environment:

  • KubeVirt version (use virtctl version): N/A
  • Kubernetes version (use kubectl version): N/A
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Others: N/A
@dosubot dosubot bot added kind/failing-test Categorizes issue or PR as related to a failing test. sig/compute labels Jan 20, 2025
@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

Sometimes image repos mix v1 and v2 manifests (worth to open ticket for them about it, assuming this is the issue here).
We are already pulling the images as part of the provision, but the pre flight tries to repull and fails because of the above issue.
We can either try to disable repulling images (best), or to ignore this error using --ignore-preflight-errors as appears on
the error

00:49:04: [preflight] Pulling images required for setting up a Kubernetes cluster
00:49:04: [preflight] This might take a minute or two, depending on the speed of your internet connection
00:49:04: [preflight] You can also perform this action beforehand using 'kubeadm config images pull'
00:49:06: [preflight] Some fatal errors occurred:
00:49:06: failed to create new CRI image service: validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = DeadlineExceeded desc = context deadline exceeded[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
00:49:06: error execution phase preflight

@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

a solution is described here (we dont need upgrade for kubevirtci)
kubernetes/kubeadm#2603 (comment)

  • prepull images with "kubeadm config images list/pull" - already done
  • use "InitConfiguration.NodeRegistration.ImagePullPolicy" and "JoinConfiguration.NodeRegistration.ImagePullPolicy" == "Never"

We just need the 2nd bullet imho
easier solution meanwhile can be to just add --ignore-preflight-errors=ImagePull to kube adm init / join

@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

/assign

@brianmcarey
Copy link
Member

brianmcarey commented Jan 28, 2025

@oshoval this issue looks like the crio socket is not ready in time.

failed to create new CRI image service: validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = DeadlineExceeded desc = context deadline exceeded[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...

It can happen when the cluster is loaded and there is high IO on the CI cluster nodes.

I don't think it is related to pulling any images as we have the images prepulled on the nodes but I may be wrong.

I have been looking into ways of improving the storage performance in the CI cluster which should help with this.

@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

Thanks Brian
we do have the images pre pulled but the preflight tries to pull them again (for example to see if the hash wasnt changed, since it use tags)
i am going to create a PR that disable preflight pulling (since we have the images already)
so in case it does related to images (CRI is related to image pulling, image listing and more as far as i understand)
it will bypass those and will also be faster which is good in any case

unless it is about the internal image management and then as you said it is due to IO etc

@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

@brianmcarey
can you please take a look ?
#1361
(didnt test it yet, lets see CI)
Thanks

@brianmcarey
Copy link
Member

@brianmcarey can you please take a look ? #1361 (didnt test it yet, lets see CI) Thanks

Yes sure I will take a look but the linked failure in this issue failed because the cri-o socket was not available yet so I am not sure if this is related to this issue.

@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

Thanks

in addition to that PR, we can add --preflight-timeout=15m maybe or add retry to init/join to solve IO as you suggest
wdyt ?

@brianmcarey
Copy link
Member

Thanks

in addition to that PR, we can add --preflight-timeout=15m maybe or add retry to init/join to solve IO as you suggest wdyt ?

A pre-flight timeout would be useful.

@oshoval
Copy link
Contributor

oshoval commented Jan 28, 2025

Seems you are right, if the images are on the node, it just compare tags iiuc this code
https://github.com/kubernetes/kubeadm/blob/27de893196901c78a6b4228d4de8ec7080091eb0/kinder/pkg/cluster/manager/actions/images.go#L51
edit - unless this function is just log, and not affect decisions (assuming it passed, because err will affect the flow)

about preflight timeout, didnt find such flag, either i miss it, or we can add some retry

@oshoval
Copy link
Contributor

oshoval commented Jan 29, 2025

wondering if --ignore-preflight-errors=ImagePull will also eliminate the need to initialize the CRI (unless CRI required for more things during preflight)
btw one approach can be ignore all errors during pre flight, since the node was already provisioned, and just let it deal with the real init phase

@oshoval
Copy link
Contributor

oshoval commented Feb 6, 2025

#1363
retry should improve this as long as the IO stress is temporary and the 2nd retry solves it
cab be closed for now i think
Brian said he works in parallel to improve the IO capabilities on CI
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/failing-test Categorizes issue or PR as related to a failing test. sig/compute
Projects
None yet
Development

No branches or pull requests

3 participants