JustJaro's picture
Upload ONNX opset 14 model with int8 quantization
226c8b7 verified
|
raw
history blame
2.39 kB
---
language: multilingual
license: mit
tags:
- onnx
- optimum
- quantized
- int8
- text-embedding
- onnxruntime
- opset14
- text-classification
- gpu
- optimized
datasets:
- mmarco
pipeline_tag: sentence-similarity
---
# gte-multilingual-reranker-base-onnx-op14-opt-gpu-int8-quantized
This model is a quantized ONNX version of [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) using ONNX opset 14.
## Model Details
- **Quantization Type**: INT8
- **ONNX Opset**: 14
- **Task**: text-classification
- **Target Device**: GPU
- **Optimized**: Yes
- **Framework**: ONNX Runtime
- **Original Model**: [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base)
- **Quantized On**: 2025-03-27
## Environment and Package Versions
| Package | Version |
| --- | --- |
| transformers | 4.48.3 |
| optimum | 1.24.0 |
| onnx | 1.17.0 |
| onnxruntime | 1.21.0 |
| torch | 2.5.1 |
| numpy | 1.26.4 |
| huggingface_hub | 0.28.1 |
| python | 3.12.9 |
| system | Darwin 24.3.0 |
### Applied Optimizations
| Optimization | Setting |
| --- | --- |
| Graph Optimization Level | Extended |
| Optimize for GPU | Yes |
| Use FP16 | No |
| Transformers Specific Optimizations Enabled | Yes |
| Gelu Fusion Enabled | Yes |
| Layer Norm Fusion Enabled | Yes |
| Attention Fusion Enabled | Yes |
| Skip Layer Norm Fusion Enabled | Yes |
| Gelu Approximation Enabled | Yes |
## Usage
```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
# Load model and tokenizer
model = ORTModelForSequenceClassification.from_pretrained("quantized_model")
tokenizer = AutoTokenizer.from_pretrained("quantized_model")
# Prepare input
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt")
# Run inference
outputs = model(**inputs)
```
## Quantization Process
This model was quantized using ONNX Runtime with int8 quantization.
The quantization was performed using the Optimum library from Hugging Face with opset 14.
Graph optimization was applied during export, targeting GPU devices.
## Performance Comparison
Quantized models generally offer better inference speed with a slight trade-off in accuracy.
This INT8 quantized model should provide significantly faster inference than the original model.