Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Callbacks] Consolidate Saving Methods #1168

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Feb 18, 2025

Purpose

Background

All the things needed to be done during saving

  1. Save the model weights, potentially compressed
  2. Save the processor
  3. Update the recipe checkpoint
  4. Copy any necessary python files from the model cache
  5. Only save on the main process

After these changes, (1, 2, 3, 4) will be done within the save_pretrained function, and (5) will be the responsibility of the caller. (3) will be implemented in the next PR so as not to conflict with existing logic in pre_init

All of the places where a model is saved are

  • If an output dir is specified, at the end of the main function
  • Between stages of the stage runner
  • Between epochs of the HF Trainer
  • By the user after oneshot/training completes

After these changes, all of these will be replaced by a single save_checkpoint function which calls save_pretrained to do all the necessary things

Changes

  • Remove save_model_and_recipe
    • Saving recipes is now done by save_pretrained function
  • Implement save_checkpoint
    • Single entrypoint for saving a model and its processor
    • Performs actions (1, 2, 4)
  • Replace all locations where a model is saved with save_checkpoint
    • All applicable callers with only saving on the main process (5)
  • Remove support for modify_fsdp_model_save_pretrained and unwrap_and_export_model, to be added back in a future release

Signed-off-by: Kyle Sayers <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs added the ready When a PR is ready for review label Feb 18, 2025
__all__ = ["modify_save_pretrained", "modify_fsdp_model_save_pretrained"]


def modify_fsdp_model_save_pretrained(trainer, processor: Processor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this break terribly from the other changes you've made? Wondering why we want to remove code only to add it back later on

Copy link
Collaborator Author

@kylesayrs kylesayrs Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-dellabetta I'm essentially removing this function because I think its approach is likely wrong. Instead of creating a separate save_pretrained function for FSDP models (and now having to maintain two functions which are mostly identical), instead we should simply use a context manager

Something like

with maybe_unwrap_fsdp(model):
    original_save_pretrained(model, ...)

# alternatively
with unwrap_fsdp(model) if is_fsdp(model) else nullcontext():
     original_save_pretrained(model, ...)

This consists of using the code from unwrap_and_export_model and turning it into a context manager

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, as long as we're not inserting a bunch of breaking changes to common user pathways.

side note: i've seen this maybe_do_x naming convention elsewhere in the code. this can be very confusing to a reader (myself included). unwrap_if_fsdp would be a clearer name so user knows it's deterministic and what the condition is

@kylesayrs kylesayrs changed the title Consolidate Saving Methods [Callbacks] Consolidate Saving Methods Feb 19, 2025
@kylesayrs kylesayrs marked this pull request as ready for review February 19, 2025 00:23
@kylesayrs kylesayrs self-assigned this Feb 19, 2025
__all__ = ["modify_save_pretrained", "modify_fsdp_model_save_pretrained"]


def modify_fsdp_model_save_pretrained(trainer, processor: Processor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, as long as we're not inserting a bunch of breaking changes to common user pathways.

side note: i've seen this maybe_do_x naming convention elsewhere in the code. this can be very confusing to a reader (myself included). unwrap_if_fsdp would be a clearer name so user knows it's deterministic and what the condition is

Copy link
Collaborator

@horheynm horheynm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one central location to carry out saving logic sounds great!

Could you map when the current saving logic is carried out using which pathway, and how the new changes takes over the saving logic? Ex. when do the many different saving logic now gets carried out?

For FSDP, we do support it currently. Once stage runner is removed, then the assumption that any oneshot pathway will not have fsdp support will be valid.

@@ -418,7 +418,10 @@ def main(

# wrap model.save_pretrained
if is_fsdp_model(model):
modify_fsdp_model_save_pretrained(trainer, processor)
raise NotImplementedError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put this in after the oneshot / stage runner refac. Currently this is an ok pipeline.

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Feb 21, 2025

@horheynm All the answers to your questions are in the PR description. w.r.t. fsdp, it is not supported now but will be at a later date (soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants