[Callbacks] Consolidate Saving Methods #1168

kylesayrs · 2025-02-18T20:33:31Z

Purpose

Simplify all methods of saving into one point, namely the wrapped save_pretrained function
Precursor to [Callbacks] Remove pre_initialize_structure #1160
- Needed for having a single point for saving on top of existing recipes

Background

All the things needed to be done during saving

Save the model weights, potentially compressed
Save the processor
Update the recipe checkpoint
Copy any necessary python files from the model cache
Only save on the main process

After these changes, (1, 2, 3, 4) will be done within the save_pretrained function, and (5) will be the responsibility of the caller. (3) will be implemented in the next PR so as not to conflict with existing logic in pre_init

All of the places where a model is saved are

If an output dir is specified, at the end of the main function
Between stages of the stage runner
Between epochs of the HF Trainer
By the user after oneshot/training completes

After these changes, all of these will be replaced by a single save_checkpoint function which calls save_pretrained to do all the necessary things

Changes

Remove save_model_and_recipe
- Saving recipes is now done by save_pretrained function
Implement save_checkpoint
- Single entrypoint for saving a model and its processor
- Performs actions (1, 2, 4)
Replace all locations where a model is saved with save_checkpoint
- All applicable callers with only saving on the main process (5)
Remove support for modify_fsdp_model_save_pretrained and unwrap_and_export_model, to be added back in a future release

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-02-18T20:33:43Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta · 2025-02-18T22:49:13Z

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py

-__all__ = ["modify_save_pretrained", "modify_fsdp_model_save_pretrained"]
-
-
-def modify_fsdp_model_save_pretrained(trainer, processor: Processor):


does this break terribly from the other changes you've made? Wondering why we want to remove code only to add it back later on

@brian-dellabetta I'm essentially removing this function because I think its approach is likely wrong. Instead of creating a separate save_pretrained function for FSDP models (and now having to maintain two functions which are mostly identical), instead we should simply use a context manager

Something like

with maybe_unwrap_fsdp(model): original_save_pretrained(model, ...) # alternatively with unwrap_fsdp(model) if is_fsdp(model) else nullcontext(): original_save_pretrained(model, ...)

This consists of using the code from unwrap_and_export_model and turning it into a context manager

ok, as long as we're not inserting a bunch of breaking changes to common user pathways.

side note: i've seen this maybe_do_x naming convention elsewhere in the code. this can be very confusing to a reader (myself included). unwrap_if_fsdp would be a clearer name so user knows it's deterministic and what the condition is

brian-dellabetta · 2025-02-19T15:23:02Z

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py

-__all__ = ["modify_save_pretrained", "modify_fsdp_model_save_pretrained"]
-
-
-def modify_fsdp_model_save_pretrained(trainer, processor: Processor):


ok, as long as we're not inserting a bunch of breaking changes to common user pathways.

side note: i've seen this maybe_do_x naming convention elsewhere in the code. this can be very confusing to a reader (myself included). unwrap_if_fsdp would be a clearer name so user knows it's deterministic and what the condition is

horheynm

Having one central location to carry out saving logic sounds great!

Could you map when the current saving logic is carried out using which pathway, and how the new changes takes over the saving logic? Ex. when do the many different saving logic now gets carried out?

For FSDP, we do support it currently. Once stage runner is removed, then the assumption that any oneshot pathway will not have fsdp support will be valid.

horheynm · 2025-02-20T15:40:44Z

src/llmcompressor/transformers/finetune/text_generation.py

@@ -418,7 +418,10 @@ def main(

    # wrap model.save_pretrained
    if is_fsdp_model(model):
-        modify_fsdp_model_save_pretrained(trainer, processor)
+        raise NotImplementedError(


We should put this in after the oneshot / stage runner refac. Currently this is an ok pipeline.

kylesayrs · 2025-02-21T00:33:28Z

@horheynm All the answers to your questions are in the PR description. w.r.t. fsdp, it is not supported now but will be at a later date (soon).

consolidate saving paths

bdc4fa5

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 4 commits February 18, 2025 15:47

remove broken import

a83b0aa

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/consolidate-saving

4efd116

add back def

b9f0bd1

Signed-off-by: Kyle Sayers <[email protected]>

save state

0a2642b

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added the ready When a PR is ready for review label Feb 18, 2025

brian-dellabetta reviewed Feb 18, 2025

View reviewed changes

kylesayrs mentioned this pull request Feb 19, 2025

[Callbacks] Remove pre_initialize_structure #1160

Open

kylesayrs changed the title ~~Consolidate Saving Methods~~ [Callbacks] Consolidate Saving Methods Feb 19, 2025

kylesayrs marked this pull request as ready for review February 19, 2025 00:23

kylesayrs self-assigned this Feb 19, 2025

brian-dellabetta approved these changes Feb 19, 2025

View reviewed changes

horheynm requested changes Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Callbacks] Consolidate Saving Methods #1168

[Callbacks] Consolidate Saving Methods #1168

kylesayrs commented Feb 18, 2025 •

edited

Loading

github-actions bot commented Feb 18, 2025

brian-dellabetta Feb 18, 2025

kylesayrs Feb 18, 2025 •

edited

Loading

brian-dellabetta Feb 19, 2025

brian-dellabetta Feb 19, 2025

horheynm left a comment

horheynm Feb 20, 2025

kylesayrs commented Feb 21, 2025 •

edited

Loading

		__all__ = ["modify_save_pretrained", "modify_fsdp_model_save_pretrained"]


		def modify_fsdp_model_save_pretrained(trainer, processor: Processor):

[Callbacks] Consolidate Saving Methods #1168

Are you sure you want to change the base?

[Callbacks] Consolidate Saving Methods #1168

Conversation

kylesayrs commented Feb 18, 2025 • edited Loading

Purpose

Background

Changes

github-actions bot commented Feb 18, 2025

brian-dellabetta Feb 18, 2025

Choose a reason for hiding this comment

kylesayrs Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

brian-dellabetta Feb 19, 2025

Choose a reason for hiding this comment

brian-dellabetta Feb 19, 2025

Choose a reason for hiding this comment

horheynm left a comment

Choose a reason for hiding this comment

horheynm Feb 20, 2025

Choose a reason for hiding this comment

kylesayrs commented Feb 21, 2025 • edited Loading

kylesayrs commented Feb 18, 2025 •

edited

Loading

kylesayrs Feb 18, 2025 •

edited

Loading

kylesayrs commented Feb 21, 2025 •

edited

Loading