|
--- |
|
base_model: unsloth/qwen3-8b-unsloth-bnb-4bit |
|
tags: |
|
- text-generation |
|
- rag |
|
- evaluation |
|
- information-retrieval |
|
- question-answering |
|
- retrieval-augmented-generation |
|
- context-evaluation |
|
- qwen3 |
|
- unsloth |
|
- fine-tuned |
|
language: |
|
- en |
|
- multilingual |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
model_type: qwen3 |
|
quantized: q8_0 |
|
datasets: |
|
- evaluation |
|
- rag-evaluation |
|
metrics: |
|
- completeness |
|
- clarity |
|
- conciseness |
|
- precision |
|
- recall |
|
- mrr |
|
- ndcg |
|
- relevance |
|
widget: |
|
- example_title: "RAG Context Evaluation" |
|
text: | |
|
Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance |
|
|
|
Question: What are the main benefits of renewable energy? |
|
Retrieved contexts: [1] Renewable energy sources like solar and wind power provide clean alternatives to fossil fuels, reducing greenhouse gas emissions and air pollution. [2] These energy sources are sustainable and abundant, helping to ensure long-term energy security. |
|
model-index: |
|
- name: RAG Context Evaluator |
|
results: |
|
- task: |
|
type: text-generation |
|
name: RAG Evaluation |
|
metrics: |
|
- type: evaluation_score |
|
name: Multi-metric Assessment |
|
value: 0-5 |
|
--- |
|
|
|
# RAG Context Evaluator - Qwen3-8B Fine-tuned π |
|
|
|
## Model Details π |
|
|
|
**License:** apache-2.0 |
|
**Finetuned from model:** unsloth/qwen3-8b-unsloth-bnb-4bit |
|
**Model type:** Text Generation (Specialized for RAG Evaluation) |
|
**Quantization:** Q8_0 |
|
|
|
## Model Description π― |
|
|
|
This model is specifically fine-tuned to evaluate the quality of retrieved contexts in Retrieval-Augmented Generation (RAG) systems. It assesses retrieved passages against user queries using multiple evaluation metrics commonly used in information retrieval and RAG evaluation. |
|
|
|
## Intended Uses π‘ |
|
|
|
### Primary Use Case π― |
|
- **RAG System Evaluation**: Automatically assess the quality of retrieved contexts for question-answering systems |
|
- **Information Retrieval Quality Control**: Evaluate how well retrieved documents match user queries |
|
- **Academic Research**: Support research in information retrieval and RAG system optimization |
|
|
|
### Evaluation Metrics π |
|
The model evaluates retrieved contexts using the following metrics: |
|
|
|
1. **Completeness** π - How thoroughly the retrieved context addresses the query |
|
2. **Clarity** β¨ - How clear and understandable the retrieved information is |
|
3. **Conciseness** πͺ - How efficiently the information is presented without redundancy |
|
4. **Precision** π― - How accurate and relevant the retrieved information is |
|
5. **Recall** π - How comprehensive the retrieved information is in covering the query |
|
6. **MRR (Mean Reciprocal Rank)** π - Ranking quality of relevant results |
|
7. **NDCG (Normalized Discounted Cumulative Gain)** π - Ranking quality with position consideration |
|
8. **Relevance** π - Overall relevance of retrieved contexts to the query |
|
|
|
## Training Data π |
|
|
|
https://huggingface.co/datasets/constehub/rag-evaluation-dataset |
|
|
|
### Example Training Instance |
|
```json |
|
{ |
|
"instruction": "Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance", |
|
"input": { |
|
"question": "Question about retrieved context", |
|
"retrieved_contexts": "[Multiple numbered passages with source citations]" |
|
}, |
|
"output": [ |
|
{ |
|
"name": "completeness", |
|
"value": 1, |
|
"comment": "Detailed evaluation comment" |
|
} |
|
// ... other metrics |
|
] |
|
} |
|
``` |
|
|
|
## Performance and Limitations β‘ |
|
|
|
### Strengths |
|
- Specialized for RAG evaluation |
|
- Multi-dimensional assessment capability |
|
- Detailed explanatory comments for each metric |
|
|
|
### Limitations |
|
- **Context Length**: Performance may vary with very long retrieved contexts |
|
|
|
## Ethical Considerations π€ |
|
|
|
- The model should be used as a tool to assist human evaluators, not replace human judgment entirely |
|
- Evaluations should be validated by domain experts for critical applications |
|
|
|
## Technical Specifications π§ |
|
|
|
- **Base Model**: Qwen3-8B |
|
- **Quantization**: Q8_0 |
|
|
|
## Usage Example π» |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_name = "mendrika261/rag-evaluator-qwen3-8b" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
# Example evaluation prompt |
|
prompt = """Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance |
|
|
|
Question: [Your question here] |
|
Retrieved contexts: [Your retrieved contexts here]""" |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
evaluation = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Citation π |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{constehub-rag-evaluator, |
|
title={RAG Context Evaluator - Qwen3-8B Fine-tuned}, |
|
author={constehub}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/constehub/rag-evaluation}} |
|
} |
|
``` |
|
|
|
## Contact π§ |
|
|
|
For questions or issues regarding this model, please contact the developer through the Hugging Face model repository. |
|
|
|
--- |
|
|
|
This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |