Sarvam-M 4-bit Quantized

This is a 4-bit quantized version of sarvamai/sarvam-m using BitsAndBytesConfig with NF4 quantization.

Model Details

  • Base Model: sarvamai/sarvam-m
  • License: Apache 2.0
  • Quantization Method: BitsAndBytes 4-bit NF4
  • Compute dtype: bfloat16
  • Double Quantization: Enabled
  • Size Reduction: ~75% smaller than original model (14GB vs ~70GB)
  • Memory Usage: ~4x less GPU memory required

Key Features

  • Efficient Inference: Significantly reduced memory footprint
  • Thinking Mode: Supports reasoning capabilities with enable_thinking parameter
  • Chat Template: Optimized for conversational AI applications
  • Device Mapping: Automatic device placement for multi-GPU setups

Installation

pip install transformers torch accelerate bitsandbytes

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tarun7r/sarvam-m-bnb-4bit"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"
)

# Prepare input
prompt = "Who are you and what is your purpose?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True  # Enable reasoning mode
)

# Generate response
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
output_text = tokenizer.decode(output_ids)

# Parse thinking and response
if "</think>" in output_text:
    reasoning_content = output_text.split("</think>")[0].rstrip("\n")
    content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
    print("Reasoning:", reasoning_content)
    print("Response:", content)
else:
    content = output_text.rstrip("</s>")
    print("Response:", content)

Advanced Usage with Custom Parameters

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Optional: Explicit quantization config (will be ignored if model already quantized)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,  # Optional
    device_map="auto",
    torch_dtype="auto"
)

# Generate with custom parameters
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

Thinking Mode

The model supports two modes:

  1. Thinking Mode (enable_thinking=True): Model shows reasoning process
  2. Direct Mode (enable_thinking=False): Direct response without reasoning
# Enable thinking mode for complex reasoning
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True
)

# Disable for quick responses
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=False
)

Performance Comparison

Model Version Size GPU Memory Loading Time
Original ~70GB ~70GB VRAM ~5-10 min
4-bit Quantized ~14GB ~18GB VRAM ~1-2 min

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers 4.35+
  • BitsAndBytes 0.41+
  • CUDA-compatible GPU (recommended)

Limitations

  • Slight performance degradation compared to full precision model
  • Requires BitsAndBytes library for loading
  • May have minor numerical differences in outputs

License

Apache 2.0 (same as original model)

Attribution

  • Original Model: Sarvam AI
  • Quantization: Created using BitsAndBytes library
  • Base Model License: Apache 2.0

Disclaimer

This is an unofficial quantized version. For the original model and official support, please refer to sarvamai/sarvam-m.

Downloads last month
2
Safetensors
Model size
12.8B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for tarun7r/sarvam-m-bnb-4bit

Finetuned
sarvamai/sarvam-m
Quantized
(21)
this model