Blaze.1-27B-Preview is a Gemma 2-based, 27-billion-parameter model. Gemma is a family of lightweight, state-of-the-art open models from Google, built using the same research and technology that powers the Gemini models. These models are text-to-text, decoder-only large language models available in English, with open weights for both pre-trained and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Blaze.1-27B was fine-tuned on long-chain-of-thought reasoning synthetic datasets derived from models such as DeepSeek, Qwen, and OpenAI’s GPT-4.

Quickstart Chat Template

Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:

pip install -U transformers

Then, copy the snippet from the section that is relevant for your usecase.

Running with the `pipeline` API

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="prithivMLmods/Blaze.1-27B-Preview",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",  # replace with "mps" to run on a Mac device
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
# Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜

Running the model on a single / multi GPU

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("prithivMLmods/Blaze.1-27B-Preview")
model = AutoModelForCausalLM.from_pretrained(
    "prithivMLmods/Blaze.1-27B-Preview",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

You can ensure the correct chat template is applied by using tokenizer.apply_chat_template as follows:

messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

Running the model on a GPU using different precisions

The native weights of this model were exported in bfloat16 precision.

You can also use float32 if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to float32). See examples below.

Upcasting to torch.float32

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("prithivMLmods/Blaze.1-27B-Preview")
model = AutoModelForCausalLM.from_pretrained(
    "prithivMLmods/Blaze.1-27B-Preview",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

Intended Use

Blaze.1-27B-Preview is designed for advanced text generation tasks requiring logical reasoning, complex problem-solving, and long-form content generation. Its primary use cases include:

Question Answering: Generating detailed, accurate answers to a wide range of questions across various domains.
Summarization: Condensing long texts into concise summaries while preserving key information and context.
Reasoning Tasks: Performing multi-step reasoning, particularly in mathematical, logical, and conditional scenarios.
Instruction Following: Responding to user prompts with coherent and relevant outputs, based on fine-tuned instruction-following capabilities.
Conversational AI: Supporting virtual assistants and chatbots for both casual and professional applications.
Multi-Model Comparison: Benefiting researchers by providing outputs tuned with diverse datasets such as DeepSeek, Qwen, and GPT-4, allowing comparative insights across different reasoning paradigms.

Limitations

Reasoning Bias: Despite its training on synthetic datasets, the model may exhibit biases in reasoning, especially when encountering unfamiliar problem types.
Hallucinations: Like other large language models, Blaze.1-27B may generate inaccurate or fabricated information, particularly when dealing with facts or events not covered during training.
Dependency on Prompt Quality: The quality of the model’s output heavily relies on the clarity and specificity of the input prompt. Poorly framed prompts may lead to irrelevant or incomplete responses.
Long Context Handling: While it is designed for long-chain reasoning, performance may degrade with excessively long inputs or contexts, resulting in loss of coherence or incomplete reasoning.
Resource Requirements: Due to its large size (27 billion parameters), it requires substantial computational resources for both inference and fine-tuning, limiting its accessibility for users without high-performance hardware.
Language Support: Although it excels in English, its capabilities in other languages may be limited, and unexpected issues may arise when processing multilingual or code-mixed inputs.