Huge perplexity value

#20
by zhuqiang - opened

Hi all, I just noticed that gemma-3n gives huge perplexity value even when a very simple case is given.

Reproduce

import torch

from transformers import AutoProcessor, AutoModelForImageTextToText


model = AutoModelForImageTextToText.from_pretrained(
        "models/gemma-3n-E2B-it",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        # attn_implementation="eager",
        device_map='cuda',
    ).eval()
processor = AutoProcessor.from_pretrained("models/gemma-3n-E2B-it", )



messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
           {"type": "text", "text": "Hi"}
        ]
    },
    {
        "role": "assistant",
        "content": [
           {"type": "text", "text": "How are you?"}
        ]
    }
]




text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
print(text)
encodings = processor(text=text, images=None, videos=None, padding=False, return_tensors="pt")


input_ids = encodings.input_ids.to('cuda')
target_ids = input_ids.clone()
trg_len = -2
target_ids[:, :trg_len] = -100


with torch.no_grad():
    outputs = model(input_ids, labels=target_ids)

    nll = outputs.loss

ppl = torch.exp(nll)
print(ppl)

Ouptput

tensor(43704.4180, device='cuda:0')

packages:

transformers==4.56.2
torch==2.8.0
Google org

Hi @zhuqiang
I believe the issue is with your label masking. Your code target_ids[:, :-2] = -100 incorrectly calculates perplexity on only the last two tokens of the sequence leading to the massive score. I think to fix this, you can mask the entire input prompt and calculate loss only on the target response tokens you want to evaluate. find the token length of your prompt (prompt_len), and then apply the mask correctly like this target_ids[:, :prompt_len] = -100.
Thank you

Sign up or log in to comment