google/gemma-3-12b-it · CUDA error: misaligned address

Hi there,

I am trying to use Gemma 3 12b it model to generate QA pairs. The pipeline is defined as follows:

model_id = "google/gemma-3-12b-it" # google/gemma-3-12b-it

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype = torch.float32,
        device_map="cuda",
        quantization_config=bnb_config
        )
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    if tokenizer.pad_token is None:
        eos_token_id = model.config.eos_token_id
        eos_token = tokenizer.decode(eos_token_id)
        tokenizer.pad_token = eos_token  # this is a string, which is expected

    text_gen_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        torch_dtype=torch.float32, 
        top_p = 0.95,
        top_k = 70,
        temperature = 1.25,
        do_sample=True,
        repetition_penalty=1.3,
    )

    llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

    model = ChatHuggingFace(llm=llm)

When I use this model using invoke function, at some point it threw an error:

  File "/home/nokia-proj/miniconda3/envs/vrag/lib/python3.10/site-packages/transformers/integrations/sdpa_attent
ion.py", line 54, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: CUDA error: misaligned address

Any ideas why this error was encountered and how to resolve this?

Thank you!