Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more stringent test for CPUOffloadOptimizer #1650

Merged
merged 2 commits into from
Feb 1, 2025

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented Feb 1, 2025

Problems with CPUOffloadOptimizer's synchronization (cf. #1649) are not detected by the current test case, as it is very benign, giving ample opportunity for transfers to complete even without explicit sync.

These changes try to make the situation a bit more challenging, decreasing the arithmetic density of the model and ensuring that the critical path is as short as possible.

This PR is deliberately not based on top of #1649, to show that this new test actually catches the synchronization problem. After that PR is merged, tests should pass again.

Copy link

pytorch-bot bot commented Feb 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1650

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 471411d with merge base 3eb18e7 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 1, 2025
@gau-nernst gau-nernst added the topic: bug fix Use this tag for PRs that fix bugs label Feb 1, 2025
@gau-nernst
Copy link
Collaborator

@ngc92 You can add self.stream.synchronize() at the end of optim step in this PR, then I will convert #1649 to only address the LR scheduler issue.

Thank you again for identifying the issue and coming up with an appropriate test!

@gau-nernst gau-nernst self-requested a review February 1, 2025 14:25
Copy link
Collaborator

@gau-nernst gau-nernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing ruff check is not caused by this change.

@gau-nernst gau-nernst merged commit 122eb73 into pytorch:main Feb 1, 2025
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: bug fix Use this tag for PRs that fix bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants