Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Progressive Generation Using inputs_embeds and past_key_values #35707

Open
2 of 4 tasks
Superbooming opened this issue Jan 15, 2025 · 2 comments
Open
2 of 4 tasks

Comments

@Superbooming
Copy link

Superbooming commented Jan 15, 2025

System Info

  • transformers version: 4.46.3
  • Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.17
  • Python version: 3.8.20
  • Huggingface_hub version: 0.26.1
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA RTX A6000

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am currently rewriting the generate_progressively function for my custom model class. My goal is to enable the model to generate results progressively by concatenating the initial input_ids with each element of the compress_outputs sequence in turn. Specifically:

  1. In the first iteration, the model generates results by concatenating input_ids with the first element of compress_outputs.
  2. In the second iteration, it concatenates input_ids with the first and second elements of compress_outputs (the first two elements) to generate results.
  3. This process continues until the last element of the compress_outputs sequence is included.

To improve efficiency, I want to leverage caching, as the majority of the concatenated input in each iteration has already been used to compute past_key_values. Below is the code snippet for the function I implemented. In this context, self.model refers to mistral-7b-chat-v0.2.

@torch.no_grad()
    def generate_progressively(
            self,
            input_ids,
            attention_mask,
            compress_outputs,
            **kwargs,
    ):
        results = []
        compress_output_count = compress_outputs.size(1)
        batch_size = input_ids.size(0)

        inputs_embs = self.base.model.embed_tokens(input_ids)
        prompt_cache = DynamicCache()
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            use_cache=True,
            past_key_values=prompt_cache,
        )
        prompt_cache = outputs.past_key_values

        for compress_ind in range(compress_output_count):
            current_compress_outputs = compress_outputs[:, compress_ind: compress_ind+1, :].type_as(input_ids)
            outputs = self.model(
                input_ids=None,
                inputs_embeds=current_compress_outputs,
                use_cache=True,
                past_key_values=prompt_cache,
            )
            prompt_cache = outputs.past_key_values

            inputs_embs = torch.cat([inputs_embs, current_compress_outputs], dim=1)
            attention_mask = torch.cat([attention_mask, torch.ones(batch_size, 1, device=input_ids.device)], dim=1)

            generated_outputs = self.base.generate(
                inputs_embeds=inputs_embs,
                attention_mask=attention_mask,
                use_cache=True,
                past_key_values=prompt_cache,
                return_dict_in_generate=True,
                **kwargs,
            )
            results.append(generated_outputs.sequences)
        return results

When I execute this code, the program throws an error during execution. The error occurs at line 393 in transformers/generation/utils.py, specifically in the prepare_inputs_for_generation function.
The problematic line of code is:

if inputs_embeds is not None and cache_position[0] == 0:

The error message is: IndexError: index 0 is out of bounds for dimension 0 with size 0.

I track the excution of the code and here’s a detailed breakdown of the issue:
The error occurs in transformers/generation/utils.py. Initially, the program enters the self._sample function and then proceeds to the self._get_initial_cache_position function.
Within this function, the following line:

if not is_torchdynamo_compiling():
    cache_position = cache_position[past_length:]

causes the correct cache_position slice to become empty, resulting in an IndexError in subsequent steps.

Even if I manage to fix the issue with cache_position, another problem arises later in the self.prepare_inputs_for_generation function.
The relevant code is as follows:

if not self.config.is_encoder_decoder:
    if inputs_embeds is not None and cache_position[0] == 0:
        model_inputs[input_ids_key] = None
        model_inputs["inputs_embeds"] = inputs_embeds
    else:
        model_inputs[input_ids_key] = input_ids.clone(memory_format=torch.contiguous_format)
        model_inputs["inputs_embeds"] = None

In my case, I provide only inputs_embeds and past_key_values, and since cache_position[0] is not 0, the code attempts to set model_inputs[input_ids_key] using input_ids. However, since input_ids is None, this results in further issues.

Under the current implementation of the generate function in transformers, is it possible to use only inputs_embeds and past_key_values for generation? How can I modify my implementation to achieve progressive generation with caching as intended? Are there specific guidelines for correctly managing cache_position and ensuring compatibility with inputs_embeds?

Expected behavior

My primary objective is to progressively generate outputs by leveraging caching (past_key_values) to improve efficiency.

@zucchini-nlp
Copy link
Member

Seems to be same as #34678 and someone is working on it, as per the last comment

@Superbooming
Copy link
Author

Yes, it seems to be the same. I'll keep track on it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants