Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vastly overhaul the CI workflow for build checks #852

Merged
merged 5 commits into from
Jan 22, 2025

Conversation

mhucka
Copy link
Member

@mhucka mhucka commented Jan 20, 2025

This workflow compiles TFQ and runs test cases to verify things work after file changes are made by a PR or other event. It is a complete rewrite of the previous TFQ build & test checks CI workflow, and incorporates many new features and efficiency gains:

  • Serious use of caching of the Python installation and environment, as well as the Bazel environment, so that repeated runs don't have to reinstall everything every time.

  • File-aware execution: the workflow tests whether source files and build configuration files have been modified in a given event, and stops early if the event didn't affect such files.

  • Compatibility with GitHub merge queues.

  • Concurrency detection: if a new event such as a push to a PR branch comes in while a workflow is already executing, that workflow is cancelled and a new one restarted. (The existing workflow keeps the existing one running and starts a second one, which is almost never useful.)

  • Addition of a manual run interface, allowing the workflow to be invoked manually from the GitHub GUI with different values for program versions – useful for debugging, testing, exploring behaviors with different versions of linters, etc.

This workflow compiles TFQ and runs test cases to verify things work after file changes are made by a PR or other event. It is a complete rewrite of the previous TFQ build & test checks CI workflow, and incorporates many new features and efficiency gains:

- Serious use of caching of the Python installation and environment, as well as the Bazel environment, so that repeated runs don't have to reinstall everything every time.

- File-aware execution: the workflow tests whether source files and build configuration files have been modified in a given event, and stops early if the event didn't affect such files.

- Compatibility with GitHub [merge queues](https://github.blog/news-insights/product-news/github-merge-queue-is-generally-available/).

- Concurrency detection: if a new event such as a push to a PR branch comes in while a workflow is already executing, that workflow is cancelled and a new one restarted. (The existing workflow keeps the existing one running and starts a second one, which is almost never useful.)

- Addition of a manual run interface, allowing the workflow to be invoked manually from the GitHub GUI with different values for program versions – useful for debugging, testing, exploring behaviors with different versions of linters, etc.
@mhucka
Copy link
Member Author

mhucka commented Jan 20, 2025

In reference to a previous PR proposing to filter test execution to exclude tests tagged as "eternal", here is some timing information.

  1. The basic workflow, with no caching or other enhancements, takes about 38 minutes to run if Bazel test timeouts are set long enough to allow all tests to complete. Here's a screenshot using the new "ci-nightly-check.yaml" workflow (it's in separate PR Add workflow for nightly full test #853), whose purpose is to run once a day to build and do all the tests without doing anything clever:

  1. With the new build & test workflow (the present PR), and again setting timeouts as long as possible (not the default, rather using --test_timeout=6000) so that we can see how long the tests themselves take to run to completion, the overall time is like this:
    image
    The new workflow takes only ~2 minutes to build TFQ. The TFQ tests take around 17-18 minutes and the tutorial tests take 10-11 minutes. Examining the logs shows that the longest times are for the gradient and sometimes some other tests. Here are the 3 longest in an example run:

    //tensorflow_quantum/core/ops:circuit_execution_ops_test   PASSED in 385.1s
    //tensorflow_quantum/python/differentiators:gradient_test  PASSED in 575.9s
    //tensorflow_quantum/core/ops:tfq_unitary_op_test          PASSED in 306.6s
    

    The cicuit_execution_ops_test and gradient_test are already both tagged as "eternal". Interestingly, tfq_unitary_op_test is not so tagged. With respect to the discussion in another PR, it's interesting that in this example, simulate_mps_test is not one that showed up this time. It has definitely shown up before with high times, so I think we're just seeing some expected stochastic variability going on.

  2. Finally, using the default Bazel test timeouts, here's the timing:
    image

These are shorter because they're hitting timeouts and getting cut short.

Conclusions:

  • This new build & test CI workflow significantly reduces the overall time to do PR checks.
  • The run times are now constrained by the time it takes to run the tests.
  • Instead of filtering by "eternal", we could shorten the test timeouts, as another way of reducing the overall PR check run time. I'm not sure that's better – the tests affected might be more chosen more randomly, and that might result in missing new problems such as a change that causes tests to run much slower than they did before. It seems better to be sure of what is being skipped.
  • The tutorials take a long time to test. Even if we manage to reduce the time to do the Bazel tests, the overall time won't go less than about 13-15 minutes unless we reduce the 10-11 minute time for the tutorial tests. Finding ways of shortening the tutorial tests is another thing worth exploring.

@mhucka mhucka marked this pull request as ready for review January 20, 2025 06:46
@mhucka mhucka added area/tests Concerns tests and testing of the TFQ codebase area/ci Concerns continuous integration workflows and infrastructure labels Jan 20, 2025
@mhucka mhucka self-assigned this Jan 20, 2025
@mhucka mhucka added area/tests Concerns tests and testing of the TFQ codebase and removed area/tests Concerns tests and testing of the TFQ codebase labels Jan 20, 2025
Copy link
Collaborator

@MichaelBroughton MichaelBroughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhucka mhucka merged commit 2f1503f into tensorflow:master Jan 22, 2025
1 of 2 checks passed
@mhucka mhucka mentioned this pull request Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Concerns continuous integration workflows and infrastructure area/tests Concerns tests and testing of the TFQ codebase
Development

Successfully merging this pull request may close these issues.

2 participants