Resume from checkpoint with distributed optimizer-in-backward repro #2359

ebsmothers · 2025-02-07T17:38:29Z

This is just a repro for the bug described in #2360

pytorch-bot · 2025-02-07T17:38:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2359

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 2aed52c with merge base 9b38360 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume[llama3/8B_full-llama3-tune-4-1-True]

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

GPU tests / gpu_test (3.10, stable) (gh) (trunk failure)
##[error]The operation was canceled.
GPU tests / gpu_test (3.9, stable) (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Resume from checkpoint with distributed optimizer-in-backward repro

2aed52c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 7, 2025

ebsmothers mentioned this pull request Feb 7, 2025

Resume from checkpoint broken with distributed optimizer-in-backward #2360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from checkpoint with distributed optimizer-in-backward repro #2359

Resume from checkpoint with distributed optimizer-in-backward repro #2359

ebsmothers commented Feb 7, 2025 •

edited

Loading

pytorch-bot bot commented Feb 7, 2025 •

edited

Loading

Resume from checkpoint with distributed optimizer-in-backward repro #2359

Are you sure you want to change the base?

Resume from checkpoint with distributed optimizer-in-backward repro #2359

Conversation

ebsmothers commented Feb 7, 2025 • edited Loading

pytorch-bot bot commented Feb 7, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2359

❌ 1 New Failure, 2 Unrelated Failures

ebsmothers commented Feb 7, 2025 •

edited

Loading

pytorch-bot bot commented Feb 7, 2025 •

edited

Loading