|
--- |
|
language: multilingual |
|
license: mit |
|
tags: |
|
- onnx |
|
- optimum |
|
- quantized |
|
- int8 |
|
- text-embedding |
|
- onnxruntime |
|
- opset14 |
|
- text-classification |
|
- gpu |
|
- optimized |
|
datasets: |
|
- mmarco |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
# gte-multilingual-reranker-base-onnx-op14-opt-gpu-int8-quantized |
|
|
|
This model is a quantized ONNX version of [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) using ONNX opset 14. |
|
|
|
## Model Details |
|
|
|
- **Quantization Type**: INT8 |
|
- **ONNX Opset**: 14 |
|
- **Task**: text-classification |
|
- **Target Device**: GPU |
|
- **Optimized**: Yes |
|
- **Framework**: ONNX Runtime |
|
- **Original Model**: [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) |
|
- **Quantized On**: 2025-03-27 |
|
|
|
## Environment and Package Versions |
|
|
|
| Package | Version | |
|
| --- | --- | |
|
| transformers | 4.48.3 | |
|
| optimum | 1.24.0 | |
|
| onnx | 1.17.0 | |
|
| onnxruntime | 1.21.0 | |
|
| torch | 2.5.1 | |
|
| numpy | 1.26.4 | |
|
| huggingface_hub | 0.28.1 | |
|
| python | 3.12.9 | |
|
| system | Darwin 24.3.0 | |
|
|
|
|
|
### Applied Optimizations |
|
|
|
| Optimization | Setting | |
|
| --- | --- | |
|
| Graph Optimization Level | Extended | |
|
| Optimize for GPU | Yes | |
|
| Use FP16 | No | |
|
| Transformers Specific Optimizations Enabled | Yes | |
|
| Gelu Fusion Enabled | Yes | |
|
| Layer Norm Fusion Enabled | Yes | |
|
| Attention Fusion Enabled | Yes | |
|
| Skip Layer Norm Fusion Enabled | Yes | |
|
| Gelu Approximation Enabled | Yes | |
|
|
|
|
|
## Usage |
|
|
|
```python |
|
from optimum.onnxruntime import ORTModelForSequenceClassification |
|
from transformers import AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = ORTModelForSequenceClassification.from_pretrained("quantized_model") |
|
tokenizer = AutoTokenizer.from_pretrained("quantized_model") |
|
|
|
# Prepare input |
|
text = "Your text here" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
# Run inference |
|
outputs = model(**inputs) |
|
``` |
|
|
|
## Quantization Process |
|
|
|
This model was quantized using ONNX Runtime with int8 quantization. |
|
The quantization was performed using the Optimum library from Hugging Face with opset 14. |
|
Graph optimization was applied during export, targeting GPU devices. |
|
|
|
|
|
## Performance Comparison |
|
|
|
Quantized models generally offer better inference speed with a slight trade-off in accuracy. |
|
This INT8 quantized model should provide significantly faster inference than the original model. |
|
|