--- language: - en license: other library_name: transformers tags: - chat - qwen - qwen2.5 - finetune - english base_model: - MaziyarPanahi/calme-3.2-instruct-78b model_name: calme-3.2-instruct-78b license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE pipeline_tag: text-generation inference: false model_creator: MaziyarPanahi quantized_by: MaziyarPanahi model-index: - name: calme-3.2-instruct-78b results: - task: type: text-generation name: Text Generation dataset: name: IFEval (0-Shot) type: HuggingFaceH4/ifeval args: num_few_shot: 0 metrics: - type: inst_level_strict_acc and prompt_level_strict_acc value: 80.63 name: strict accuracy source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: BBH (3-Shot) type: BBH args: num_few_shot: 3 metrics: - type: acc_norm value: 62.61 name: normalized accuracy source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MATH Lvl 5 (4-Shot) type: hendrycks/competition_math args: num_few_shot: 4 metrics: - type: exact_match value: 39.95 name: exact match source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GPQA (0-shot) type: Idavidrein/gpqa args: num_few_shot: 0 metrics: - type: acc_norm value: 20.36 name: acc_norm source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MuSR (0-shot) type: TAUR-Lab/MuSR args: num_few_shot: 0 metrics: - type: acc_norm value: 38.53 name: acc_norm source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU-PRO (5-shot) type: TIGER-Lab/MMLU-Pro config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 70.03 name: accuracy source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-3.2-instruct-78b name: Open LLM Leaderboard --- # EXL2 4.5bpw Quantization of calme-3.2-instruct-78b Calme-3 Models This repository hosts the **4.5 bits per weight (bpw)** quantization of the [calme-3.2-instruct-78b](https://huggingface.co/MaziyarPanahi/calme-3.2-instruct-78b) model, leveraging the **ExLlamaV2** format for efficient inference with high-context capabilities. This model is a Qwen 2.5 finetune. ## Quantization Details - **Format:** ExLlamaV2 4.5bpw - **Version:** ExLlamaV2 0.2.6 - **Model Size:** 78 billion parameters - **VRAM Usage:** Approx. **44GB** (32,000 context) - **Calibration:** - Rows: 115 - Length: 2048 - Dataset: (default) The quantization process reduces memory usage and inference latency while maintaining high performance for generative text tasks. ## Prompt Template This model uses the ChatML prompt template for interaction: ``` <|im_start|>system {System} <|im_end|> <|im_start|>user {User} <|im_end|> <|im_start|>assistant {Assistant} ``` ## Model Usage ### Example: Inference with ExLlamaV2 To use this quantized model, ensure you have the **ExLlamaV2** library installed: ```bash pip install exllamav2 ``` ```python from exllamav2 import ExLlamaModel, ExLlamaTokenizer, ExLlamaPipeline # Load model and tokenizer model = ExLlamaModel.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw") tokenizer = ExLlamaTokenizer.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw") # Create pipeline pipeline = ExLlamaPipeline(model, tokenizer) # Generate text messages = [{"role": "user", "content": "What is EXL2 quantization?"}] response = pipeline(messages) print(response) ``` ## Features - EXL2 format requires Nvidia hardware but runs faster and with less RAM than GGUF. - Supports **44GB VRAM** with **32,000 context window**. - **40GB** minimum **1024 context window** - Highly optimized for inference, making it ideal for resource-constrained environments. - Compatible with ChatML-based prompting systems. ## Acknowledgments - **Original Model Creator:** [MaziyarPanahi](https://huggingface.co/MaziyarPanahi) - **Quantization by:** [DavidCatalano](https://huggingface.co/DavidCatalano) - **Quantization Tool:** ExLlamaV2 0.2.6 ## Download Instructions To download the model files: ```bash huggingface-cli install huggingface_hub huggingface-cli login huggingface-cli download DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw --include "*" --local-dir ./local-folder ``` ---