Vastly overhaul the CI workflow for build checks #852

mhucka · 2025-01-20T03:51:10Z

This workflow compiles TFQ and runs test cases to verify things work after file changes are made by a PR or other event. It is a complete rewrite of the previous TFQ build & test checks CI workflow, and incorporates many new features and efficiency gains:

Serious use of caching of the Python installation and environment, as well as the Bazel environment, so that repeated runs don't have to reinstall everything every time.
File-aware execution: the workflow tests whether source files and build configuration files have been modified in a given event, and stops early if the event didn't affect such files.
Compatibility with GitHub merge queues.
Concurrency detection: if a new event such as a push to a PR branch comes in while a workflow is already executing, that workflow is cancelled and a new one restarted. (The existing workflow keeps the existing one running and starts a second one, which is almost never useful.)
Addition of a manual run interface, allowing the workflow to be invoked manually from the GitHub GUI with different values for program versions – useful for debugging, testing, exploring behaviors with different versions of linters, etc.

This workflow compiles TFQ and runs test cases to verify things work after file changes are made by a PR or other event. It is a complete rewrite of the previous TFQ build & test checks CI workflow, and incorporates many new features and efficiency gains: - Serious use of caching of the Python installation and environment, as well as the Bazel environment, so that repeated runs don't have to reinstall everything every time. - File-aware execution: the workflow tests whether source files and build configuration files have been modified in a given event, and stops early if the event didn't affect such files. - Compatibility with GitHub [merge queues](https://github.blog/news-insights/product-news/github-merge-queue-is-generally-available/). - Concurrency detection: if a new event such as a push to a PR branch comes in while a workflow is already executing, that workflow is cancelled and a new one restarted. (The existing workflow keeps the existing one running and starts a second one, which is almost never useful.) - Addition of a manual run interface, allowing the workflow to be invoked manually from the GitHub GUI with different values for program versions – useful for debugging, testing, exploring behaviors with different versions of linters, etc.

mhucka · 2025-01-20T06:40:04Z

In reference to a previous PR proposing to filter test execution to exclude tests tagged as "eternal", here is some timing information.

The basic workflow, with no caching or other enhancements, takes about 38 minutes to run if Bazel test timeouts are set long enough to allow all tests to complete. Here's a screenshot using the new "ci-nightly-check.yaml" workflow (it's in separate PR Add workflow for nightly full test #853), whose purpose is to run once a day to build and do all the tests without doing anything clever:

With the new build & test workflow (the present PR), and again setting timeouts as long as possible (not the default, rather using --test_timeout=6000) so that we can see how long the tests themselves take to run to completion, the overall time is like this:

The new workflow takes only ~2 minutes to build TFQ. The TFQ tests take around 17-18 minutes and the tutorial tests take 10-11 minutes. Examining the logs shows that the longest times are for the gradient and sometimes some other tests. Here are the 3 longest in an example run:
```
//tensorflow_quantum/core/ops:circuit_execution_ops_test   PASSED in 385.1s
//tensorflow_quantum/python/differentiators:gradient_test  PASSED in 575.9s
//tensorflow_quantum/core/ops:tfq_unitary_op_test          PASSED in 306.6s
```
The cicuit_execution_ops_test and gradient_test are already both tagged as "eternal". Interestingly, tfq_unitary_op_test is not so tagged. With respect to the discussion in another PR, it's interesting that in this example, simulate_mps_test is not one that showed up this time. It has definitely shown up before with high times, so I think we're just seeing some expected stochastic variability going on.
Finally, using the default Bazel test timeouts, here's the timing:

These are shorter because they're hitting timeouts and getting cut short.

Conclusions:

This new build & test CI workflow significantly reduces the overall time to do PR checks.
The run times are now constrained by the time it takes to run the tests.
Instead of filtering by "eternal", we could shorten the test timeouts, as another way of reducing the overall PR check run time. I'm not sure that's better – the tests affected might be more chosen more randomly, and that might result in missing new problems such as a change that causes tests to run much slower than they did before. It seems better to be sure of what is being skipped.
The tutorials take a long time to test. Even if we manage to reduce the time to do the Bazel tests, the overall time won't go less than about 13-15 minutes unless we reduce the 10-11 minute time for the tutorial tests. Finding ways of shortening the tutorial tests is another thing worth exploring.

MichaelBroughton

LGTM

mhucka added 2 commits January 20, 2025 03:50

Merge branch 'master' into mhucka-improve-build-ci

7051f75

mhucka marked this pull request as ready for review January 20, 2025 06:46

mhucka requested a review from MichaelBroughton January 20, 2025 06:46

mhucka added area/tests Concerns tests and testing of the TFQ codebase area/ci Concerns continuous integration workflows and infrastructure labels Jan 20, 2025

mhucka self-assigned this Jan 20, 2025

mhucka added area/tests Concerns tests and testing of the TFQ codebase and removed area/tests Concerns tests and testing of the TFQ codebase labels Jan 20, 2025

mhucka mentioned this pull request Jan 20, 2025

Label simulate_mps_test as external #841

Closed

MichaelBroughton approved these changes Jan 22, 2025

View reviewed changes

mhucka added 3 commits January 21, 2025 21:16

Merge branch 'master' into mhucka-improve-build-ci

57c4c7e

Merge branch 'master' into mhucka-improve-build-ci

99abe7b

Merge branch 'master' into mhucka-improve-build-ci

5244023

mhucka merged commit 2f1503f into tensorflow:master Jan 22, 2025
1 of 2 checks passed

mhucka mentioned this pull request Jan 23, 2025

Add caching to ci.yaml #818

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vastly overhaul the CI workflow for build checks #852

Vastly overhaul the CI workflow for build checks #852

mhucka commented Jan 20, 2025

mhucka commented Jan 20, 2025 •

edited

Loading

MichaelBroughton left a comment

Vastly overhaul the CI workflow for build checks #852

Vastly overhaul the CI workflow for build checks #852

Conversation

mhucka commented Jan 20, 2025

mhucka commented Jan 20, 2025 • edited Loading

MichaelBroughton left a comment

Choose a reason for hiding this comment

mhucka commented Jan 20, 2025 •

edited

Loading