library_name: transformers
tags:
- mistral
- instruct
- quantization
- 4bit
- bitsandbytes
- causal-lm
4bit Quantized Model: Mistral-Nemo-Instruct-2407
This is a 4bit quantized variant of mistralai/Mistral-Nemo-Instruct-2407, optimized to reduce memory footprint and accelerate inference while maintaining high output similarity.
Overview
This checkpoint was quantized using BitsAndBytes and evaluated with standard text similarity metrics.
Model Architecture
Attribute | Value |
---|---|
Model class | MistralForCausalLM |
Number of parameters | 6,795,187,200 |
Hidden size | 5120 |
Number of layers | 40 |
Attention heads | 32 |
Vocabulary size | 131072 |
Compute dtype | torch.bfloat16 |
Quantization Configuration
The following configuration dictionary was used during quantization:
{'quant_method': <QuantizationMethod.BITS_AND_BYTES: 'bitsandbytes'>, '_load_in_8bit': False, '_load_in_4bit': True, 'llm_int8_threshold': 6.0, 'llm_int8_skip_modules': None, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'bnb_4bit_quant_type': 'fp4', 'bnb_4bit_use_double_quant': False, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'load_in_4bit': True, 'load_in_8bit': False}
Intended Use
- Research and experimentation.
- Instruction-following tasks in resource-constrained environments.
- Demonstrations of quantized model capabilities.
Limitations
- May reproduce biases from the original model.
- Quantization may reduce generation diversity and factual accuracy.
- Not intended for production without additional evaluation.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("PJEDeveloper/Mistral-Nemo-Instruct-2407-4bit-20250714_213418")
model = AutoModelForCausalLM.from_pretrained("PJEDeveloper/Mistral-Nemo-Instruct-2407-4bit-20250714_213418", device_map="auto")
prompt = "Explain the concept of reinforcement learning."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Prompt Used for Evaluation
Summarize the Apollo 11 mission in five sentences. Use clear, factual language, name the astronauts, state the date of the Moon landing, and describe the mission sequence from launch to return.
Reference Output (Full-weight Model)
On July 16, 1969, NASA launched Apollo 11 from Kennedy Space Center, carrying astronauts Neil Armstrong, Buzz Aldrin, and Michael Collins. After a successful lunar orbit insertion, Armstrong and Aldrin descended to the moon's surface on July 20 in the lunar module, named 'Eagle'. Armstrong became the first human to step onto the moon at 02:56 UTC, followed by Aldrin approximately 20 minutes later. After collecting samples and planting the American flag, they spent about two and a half hours exploring the moon's surface before rendezvousing with Collins in lunar orbit and returning to Earth on July 24.
Quantized Model Output
On July 16, 1969, NASA launched the Apollo 11 mission from Kennedy Space Center in Florida, carrying astronauts Neil Armstrong, Buzz Aldrin, and Michael Collins. After a successful lunar orbit insertion, Armstrong and Aldrin descended to the lunar surface in the Lunar Module (LM) named "Eagle" on July 20, while Collins remained in lunar orbit aboard the Command Module (CM) "Columbia". Armstrong became the first person to step on the Moon at 02:56 UTC, followed by Aldrin about 20 minutes later. After spending approximately two and a half hours outside the spacecraft, collecting lunar samples, and deploying experiments, Armstrong and Aldrin lifted off from the lunar surface and rejoined Collins in lunar orbit. The three astronauts then returned to Earth and splashed down in the Pacific Ocean on July 24, successfully completing the first manned mission to the Moon.
Evaluation Metrics
Metric | Value |
---|---|
ROUGE-L F1 | 0.6307 |
BLEU | 0.3282 |
Cosine Similarity | 0.959 |
BERTScore F1 | 0.7098 |
- Higher ROUGE and BLEU scores indicate closer alignment with the original output.
Interpretation: The quantized model output maintains substantial similarity to the full-weight model.
Generation Settings
This model produces best results when generated with:
- temperature: 0.3
- top_p: 0.9
Model Files Metadata
Filename | Size (bytes) | SHA-256 |
---|---|---|
quant_config.txt |
446 | f7a08f6dc4b46a4803dce152c536ceed2ee802755840db11231fb5a895b2e022 |
Notes
- Produced on 2025-07-14T21:39:55.939163.
- Quantized automatically using BitsAndBytes.
Intended primarily for research and experimentation.
Citation
Mistralai/Mistral-Nemo-Instruct2407
License
This model is distributed under the Apache 2.0 license, consistent with the original Mistral-Nemo-Instruct-2407.
Model Card Authors
This quantized model was prepared by PJEDeveloper.