Phi-4-mini-instruct-FP8-dynamic
Model Overview
- Model Architecture: Phi3ForCausalLM
- Input: Text
- Output: Text
- Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
- Intended Use Cases: The model is intended for broad multilingual commercial and research use. The model provides uses for general purpose AI systems and applications which require:
- Memory/compute constrained environments.
- Latency bound scenarios.
- Math reasoning and logic.
- Release Date: 03/03/2025
- Version: 1.0
- Model Developers: Red Hat
Model Optimizations
This model was obtained by quantizing activation and weights of Phi-4-mini-instruct to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.
Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.
Deployment
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
generated_text = client.chat.completions.create(
model="RedHatAI/Phi-4-mini-instruct-FP8-dynamic",
messages=[
{"role": "user", "content": "Give me a short introduction to large language model."},
],
)
print(generated_text.choices[0].message.content)
Creation
Creation details
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor import oneshot
# Load model
model_stub = "microsoft/Phi-4-mini-instruct"
model_name = model_stub.split("/")[-1]
tokenizer = AutoTokenizer.from_pretrained(model_stub)
model = AutoModelForCausalLM.from_pretrained(
model_stub,
device_map="auto",
torch_dtype="auto",
)
# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_dynamic",
ignore=["lm_head"],
)
# Apply quantization
oneshot(
model=model,
recipe=recipe,
)
# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
Evaluation
The model was evaluated on the Mathh 500 benchmarks using lighteval, and on GSM8k-Platinum, MMLU CoT, MMLU-Pro, and IFEval using lm-evaluation-harness. In both cases vLLM is used as the backend
Evaluation commands
Start vLLM server
vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072
lm-evaluation-harness
lm_eval --model local-chat-completions \
--tasks gsm8k_platinum_cot_llama \
--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
--apply_chat_template \
--num_fewshot 5 \
--fewshot_as_multiturn \
--output_path gsm8k_platinum_phi4_mini_instruct_fp8_dynamic \
--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
lm_eval --model local-chat-completions \
--tasks mmlu_cot_llama \
--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
--apply_chat_template \
--output_path mmlu_cot_phi4_mini_instruct_fp8_dynamic \
--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
lm_eval --model local-chat-completions \
--tasks mmlu_pro_chat \
--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
--apply_chat_template \
--num_fewshot 5 \
--fewshot_as_multiturn \
--output_path mmlu_pro_phi4_mini_instruct_fp8_dynamic \
--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
lm_eval --model local-chat-completions \
--tasks ifeval \
--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
--apply_chat_template \
--output_path ifeval_phi4_mini_instruct_fp8_dynamic \
--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
lighteval
litellm_config.yaml
model_parameters:
provider: "hosted_vllm"
model_name: "hosted_vllm/RedHatAI/Phi-4-mini-instruct-FP8-dynamic"
base_url: "http://0.0.0.0:8000/v1"
api_key: ""
timeout: 600
concurrent_requests: 128
generation_parameters:
temperature: 0.0
max_new_tokens: 16000
lighteval endpoint litellm litellm_config.yaml \
math_500|0 \
--output-dir phi4_mini_instruct_fp8_dynamic \
--save-details
Accuracy
| Benchmark | Phi-4-mini-instruct | Phi-4-mini-instruct-FP8-dynamic (this model) |
Recovery |
| Math 500 | 57.60 | 58.20 | 101.7% |
| GSM8k-Platinum | 84.12 | 84.70 | 100.7% |
| MMLU CoT | 67.01 | 66.97 | 99.9% |
| MMLU-Pro | 46.75 | 45.60 | 97.5% |
- Downloads last month
- 298
Model tree for RedHatAI/Phi-4-mini-instruct-FP8-dynamic
Base model
microsoft/Phi-4-mini-instruct