Optimize memory in loss consumption #259

NouamaneTazi · 2024-12-04T15:16:38Z

If we check memory snapshot using:

        if self.iteration_step == self.initial_iter_step:
            torch.cuda.memory._record_memory_history(max_entries=100000)

        outputs = self.pipeline_engine.train_batch_iter(
            model=self.model,
            pg=self.parallel_context.pp_pg,
            batch=(next(dataloader) for _ in range(self.n_micro_batches_per_batch)),
            nb_microbatches=self.n_micro_batches_per_batch,
            grad_accumulator=self.grad_accumulator,
        )

        if self.iteration_step == self.initial_iter_step:
            torch.cuda.memory._dump_snapshot("memory_snapshot.pkl")
            torch.cuda.memory._record_memory_history(enabled=None)

We get

We can see that the memory consumption for the loss (which is at the top) is pretty big (10GB). It's composed from:

src/nanotron/models/llama.py:876:: 4GB
src/nanotron/parallel/tensor_parallel/functional.py:115:sharded_cross_entropy: 4GB

We need to optimize that

NouamaneTazi added the help wanted Extra attention is needed label Dec 4, 2024

NouamaneTazi changed the title ~~Optimizer memory in loss consumption~~ Optimize memory in loss consumption Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory in loss consumption #259

Optimize memory in loss consumption #259

NouamaneTazi commented Dec 4, 2024

Optimize memory in loss consumption #259

Optimize memory in loss consumption #259

Comments

NouamaneTazi commented Dec 4, 2024