Synthetic data generation using ibm-granite/granite-3.0-2b-instruct

#8
by jubueche - opened

Hi,

I want to synthetically generate data from Granite to estimate the distribution it was trained on (see QAT-LLM paper). Starting from the empty string does not work:

Traceback (most recent call last):
  File "/gpfs/u/scratch/ANFM/ANFMbchl/granite-anfm/example_granite.py", line 52, in <module>
    main()
  File "/gpfs/u/scratch/ANFM/ANFMbchl/granite-anfm/example_granite.py", line 48, in main
    query_llm_streaming("", model, tokenizer)
  File "/gpfs/u/scratch/ANFM/ANFMbchl/granite-anfm/example_granite.py", line 33, in query_llm_streaming
    model.generate(
  File "/gpfs/u/home/ANFM/ANFMbchl/scratch/miniconda3/envs/anfm-new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/gpfs/u/scratch/ANFM/ANFMbchl/transformers/src/transformers/generation/utils.py", line 2220, in generate
    result = self._sample(
  File "/gpfs/u/scratch/ANFM/ANFMbchl/transformers/src/transformers/generation/utils.py", line 3204, in _sample
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/gpfs/u/scratch/ANFM/ANFMbchl/transformers/src/transformers/generation/utils.py", line 384, in prepare_inputs_for_generation
    if inputs_embeds is not None or cache_position[-1] >= input_ids.shape[1]:  # Exception 1 or Exception 3
IndexError: index -1 is out of bounds for dimension 0 with size 0

I then tried starting with a space, but that generated very bad repetitive data. I then tried to start with the EOS token <|end_of_text|> which was not much better. It just repeats the same stuff a bunch of times.

To my surprise (found this by accident because I remembered the EOS token wrongly), starting with <|endoftext|> seems to work very well.

Why could that be?

import os
import torch
from transformers import GraniteForCausalLM, AutoTokenizer, TextStreamer

device = "cuda" if torch.cuda.is_available() else "cpu"

def query_llm_streaming(prompt, model, tokenizer):
    """Queries the Phi-3 model with the given question using token streaming."""

    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt")
    print(inputs)

    # Create a streamer to print tokens as they are generated
    streamer = TextStreamer(tokenizer)

    # Generate a response with streaming
    model.generate(
        input_ids=inputs["input_ids"].to(device=device),
        attention_mask=inputs["attention_mask"].to(device=device),
        max_new_tokens=512,
        streamer=streamer,  # Stream tokens
        temperature=1.0,
        do_sample=True
    )


def main():
    # load model from disk, adapt here
    model_path = os.path.expanduser("~/scratch-shared/ibm-granite/granite-3.0-2b-instruct/")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = GraniteForCausalLM.from_pretrained(model_path)
    model = model.to(device=device, dtype=torch.float16)
    query_llm_streaming("<|endoftext|>", model, tokenizer)


if __name__ == "__main__":
    main()
IBM Granite org

Hi @jubueche , thanks for raising this one! I think there are really three separate questions here:

  1. Why are you seeing the exception with an empty input?
  2. Why are you seeing bad results results with a single space as input?
  3. Why does the mistyped EOS token produce results whereas the real EOS token doesn't?

I'll tackle them all separately:

Why are you seeing the exception with an empty input?

This has to do with the tokenizer config. I used a local llama model for comparison and it boils down to the presence of a "post_processor" in the tokenizer config. When tokenizing an empty string, the llama tokenizer injects the BOT token, resulting in a tokenized sequence of a single token. In contrast, Granite doesn't use a "post_processor", so the tokenized output of an empty string is an empty tensor which results in the above exception. I think empty input is roughly equivalent to "undefined behavior" in the LLM space, so while the exception is pretty gross, it's fully clear what the right answer is here since the model was not trained with a BOT token injected in front of all inputs.

Why are you seeing bad results results with a single space as input?

This likely has to do with the model not being trained to protect against garbage-in-garbage-out. A single whitespace prefix is not in the expected input distribution for real-world usage, so garbage output is expected. @rpand002 feel free to elaborate here!

Why does the mistyped EOS token produce results whereas the real EOS token doesn't?

The correctly typed EOS token is a "special token" in the tokenizer, so it encodes to a single token which falls back into the garbage-in-garbage-out scenario. The mistyped EOS token does not get identified as a special token, so it just encodes like a regular string and results in 5 tokens which starts to approach a non-garbage input that steers the model towards a reasonable answer.

Thank you. Answer 1 and 2 make complete sense. The <|endoftext|> generating non-garbage output still surprises me. Does that mean that some training data actually contained this as a separator (which didn't get tokenized to EOS), triggering the model to have learned to produce random (but correct) outputs given only these couple of tokens?

The <|endoftext|> generating non-garbage output still surprises me

Yeah, that's a really good point. I don't have insight on exactly why this happens, but my guess is that it has to do with how the tokenizer splits up <|endoftext|>. The resulting tokens are ["<", "|", "endo", "ftext", "|>"]. My guess is that the endo and ftext tokens have enough grounding in the training set to initialize a non-garbage point in the model's embedding space.

Ok closing this :) Thanks.

jubueche changed discussion status to closed

Sign up or log in to comment