Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Kuberay multikueue integrations tests #3986

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mszadkow
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Extend the test coverage of Kuberay integration in Kueue.

Which issue(s) this PR fixes:

Relates to #3822

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 16, 2025
Copy link

netlify bot commented Jan 16, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit de408fb
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/678a50d4bb988a000813ac79
😎 Deploy Preview https://deploy-preview-3986--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mszadkow mszadkow force-pushed the feature/kuberay-multikueue-integrations-tests branch from 30eeaf6 to de408fb Compare January 17, 2025 12:45
@mszadkow
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 17, 2025
@k8s-ci-robot
Copy link
Contributor

@mszadkow: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main de408fb link true /test pull-kueue-verify-main
pull-kueue-test-integration-main de408fb link true /test pull-kueue-test-integration-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mimowo
Copy link
Contributor

mimowo commented Jan 20, 2025

I see the integration tests fail with the following message:

Running Suite: Multikueue Suite - /home/prow/go/src/sigs.k8s.io/kueue/test/integration/multikueue
=================================================================================================
Random Seed: �[1m1737118080�[0m

Will run �[1m13�[0m of �[1m20�[0m specs
Running in parallel across �[1m4�[0m processes

�[1m�[38;5;9mGinkgo timed out waiting for all parallel procs to report back�[0m
�[38;5;243mTest suite:�[0m multikueue (./test/integration/multikueue)

This occurs if a parallel process exits before it reports its results to the
Ginkgo CLI.  The CLI will now print out all the stdout/stderr output it's
collected from the running processes.  However you may not see anything useful
in these logs because the individual test processes usually intercept output to
stdout/stderr in order to capture it in the spec reports.

You may want to try rerunning your test suite with
�[38;5;246m--output-interceptor-mode=none�[0m to see additional output here and debug your
suite.
  
�[1mOutput from proc 1:�[0m
�[1mOutput from proc 2:�[0m
�[1mOutput from proc 3:�[0m
�[1mOutput from proc 4:�[0m

** End **Could not open /logs/artifacts/test_integration_multikueue_integration.json:
open /logs/artifacts/test_integration_multikueue_integration.json: no such file or directory
Could not open /logs/artifacts/test_integration_multikueue_junit.xml:
open /logs/artifacts/test_integration_multikueue_junit.xml: no such file or directory

Ginkgo ran 24 suites in 21m15.868780173s

I think the test_integration_multikueue_integration.json should be created by ginkgo while running the MultiKueue integration test suite at the end of the run. So, for some reason the run does not really end, I would speculate one of the goroutines hangs.

@mimowo
Copy link
Contributor

mimowo commented Jan 20, 2025

Let me ask some questions to hopefully narrow down the issue:

  1. Do the tests fail if you narrow them down by INTEGRATION_TARGET=test/integration/multikueue on CI?
  2. This issue seems related, it seems adding --output-interceptor-mode=none provides some extra information useful for debugging, do you have a link to such a run?
  3. When the tests fail:
  • is the BeforeSuite executed?
  • are the tests themselves executed? (I couldn't deduce it just based on the build-log)
  • is the AfterSuite executed?
  1. When you add the code which installs RayCRD, but without any additional "Its", are the tests still failing? If so send me the link.

@mszadkow
Copy link
Contributor Author

Let me ask some questions to hopefully narrow down the issue:

  1. Do the tests fail if you narrow them down by INTEGRATION_TARGET=test/integration/multikueue on CI?
  2. This issue seems related, it seems adding --output-interceptor-mode=none provides some extra information useful for debugging, do you have a link to such a run?
  3. When the tests fail:
  • is the BeforeSuite executed?
  • are the tests themselves executed? (I couldn't deduce it just based on the build-log)
  • is the AfterSuite executed?
  1. When you add the code which installs RayCRD, but without any additional "Its", are the tests still failing? If so send me the link.
  1. Tests never fail locally.
  2. Test sometimes fail and sometimes pass when filtered with INTEGRATION_TARGET=test/integration/multikueue -
    Failed ones:
    https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3920/pull-kueue-test-integration-main/1879796030275325952
    https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3920/pull-kueue-test-integration-main/1875238332280082432/build-log.txt

Successful one:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3920/pull-kueue-test-integration-main/1875255794467868672

But here important notice: those logs are not "clean" they are with INTEGRATION_API_LOG_LEVEL ?= 3.

  1. Yes, this is correct so there are at least 2 ways to get logs from failed routines.
    Either --output-interceptor-mode=none or defer ginkgo.GinkgoRecover() on top of the go routine.
    Also to get more insight in the API server or ETCD I used INTEGRATION_API_LOG_LEVEL.

  2. It's before suite, because the issue happens in createCluster() when we initialise test framework and it goes like this:
    unable to install CRDs onto control plane: unable to create CRD instances: unable to create CRD <one of the CRDs being installed>

  3. Yes, that's the worst part that simple CRD installation cause the same issue.
    Here is minimalistic example: [Test] Reproduce Kuberay MK integration tests issue #3920

@mimowo
Copy link
Contributor

mimowo commented Jan 20, 2025

Thank you for the input!
The part about failing BeforeSuite is quite interesting.
Which particular createCluster fails? Can you pin-point which particular line inside fails?

Maybe after installation of the Ray CRD the control-plane is over-loaded for a while causing some requests to fail. In that case we can try:

  1. sleep a little (say 1s) just after installing the CRDs, can you run such an experiment
  2. if we pin-point a specific request failing in createCluster maybe we can retry this specific request?

@mszadkow
Copy link
Contributor Author

Thank you for the input! The part about failing BeforeSuite is quite interesting. Which particular createCluster fails? Can you pin-point which particular line inside fails?

Maybe after installation of the Ray CRD the control-plane is over-loaded for a while causing some requests to fail. In that case we can try:

  1. sleep a little (say 1s) just after installing the CRDs, can you run such an experiment
  2. if we pin-point a specific request failing in createCluster maybe we can retry this specific request?

I do agree Ray CRDs overload the system like nothing else.
I think this was most accurate stack trace I found in the past logs:

  goroutine 68 [running]:
  github.com/onsi/ginkgo/v2.Fail({0xc007e76000, 0x69a}, {0xc0031e8328, 0x1, 0x4bc729?})
  	/home/prow/go/src/sigs.k8s.io/kueue/vendor/github.com/onsi/ginkgo/v2/core_dsl.go:427 +0x265
  github.com/onsi/gomega/internal.(*Assertion).match(0xc00464e380, {0x322c110, 0x43cf400}, 0x0, {0xc001ec2970, 0x1, 0x1})
  	/home/prow/go/src/sigs.k8s.io/kueue/vendor/github.com/onsi/gomega/internal/assertion.go:106 +0x2eb
  github.com/onsi/gomega/internal.(*Assertion).NotTo(0xc00464e380, {0x322c110, 0x43cf400}, {0xc001ec2970, 0x1, 0x1})
  	/home/prow/go/src/sigs.k8s.io/kueue/vendor/github.com/onsi/gomega/internal/assertion.go:74 +0x105
  sigs.k8s.io/kueue/test/integration/framework.(*Framework).Init.func1()
  	/home/prow/go/src/sigs.k8s.io/kueue/test/integration/framework/framework.go:99 +0xade
  github.com/onsi/ginkgo/v2/internal.(*Suite).By(0xc00023ea88, {0x2e7f140, 0x1e}, {0xc0005adb18, 0x1, 0xc00034f5e4?})
  	/home/prow/go/src/sigs.k8s.io/kueue/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:323 +0x3a2
  github.com/onsi/ginkgo/v2.By({0x2e7f140, 0x1e}, {0xc0005adb18, 0x1, 0x1})
  	/home/prow/go/src/sigs.k8s.io/kueue/vendor/github.com/onsi/ginkgo/v2/core_dsl.go:600 +0x65
  sigs.k8s.io/kueue/test/integration/framework.(*Framework).Init(0xc0001e4200)
  	/home/prow/go/src/sigs.k8s.io/kueue/test/integration/framework/framework.go:78 +0xd8
  sigs.k8s.io/kueue/test/integration/multikueue.createCluster(0x0, {0xc00006a9f0, 0x1, 0x1})
  	/home/prow/go/src/sigs.k8s.io/kueue/test/integration/multikueue/suite_test.go:103 +0xa65
  sigs.k8s.io/kueue/test/integration/multikueue.init.func4.1.1()
  	/home/prow/go/src/sigs.k8s.io/kueue/test/integration/multikueue/suite_test.go:244 +0xca
  created by sigs.k8s.io/kueue/test/integration/multikueue.init.func4.1 in goroutine 67
  	/home/prow/go/src/sigs.k8s.io/kueue/test/integration/multikueue/suite_test.go:241 +0x12a

Here/home/prow/go/src/sigs.k8s.io/kueue/test/integration/framework/framework.go:99 it fail at gomega.Expect() but inside the function it unravels:
(f *Framework) Init() -> (te *Environment) Start() -> InstallCRDs() -> CreateCRDs()
All leads to - vendor/sigs.k8s.io/controller-runtime/pkg/envtest/crd.go line 106.

I think it would be great if we could wait in between installing each of the CRDs, but that would require modification of envtest.
The same for retry, bc retrying whole framework setup is too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants