RAG Context Evaluator - Qwen3-8B Fine-tuned ๐Ÿš€

Model Details ๐Ÿ“‹

License: apache-2.0
Finetuned from model: unsloth/qwen3-8b-unsloth-bnb-4bit
Model type: Text Generation (Specialized for RAG Evaluation)
Quantization: Q8_0

Model Description ๐ŸŽฏ

This model is specifically fine-tuned to evaluate the quality of retrieved contexts in Retrieval-Augmented Generation (RAG) systems. It assesses retrieved passages against user queries using multiple evaluation metrics commonly used in information retrieval and RAG evaluation.

Intended Uses ๐Ÿ’ก

Primary Use Case ๐ŸŽฏ

  • RAG System Evaluation: Automatically assess the quality of retrieved contexts for question-answering systems
  • Information Retrieval Quality Control: Evaluate how well retrieved documents match user queries
  • Academic Research: Support research in information retrieval and RAG system optimization

Evaluation Metrics ๐Ÿ“Š

The model evaluates retrieved contexts using the following metrics:

  1. Completeness ๐Ÿ“ - How thoroughly the retrieved context addresses the query
  2. Clarity โœจ - How clear and understandable the retrieved information is
  3. Conciseness ๐ŸŽช - How efficiently the information is presented without redundancy
  4. Precision ๐ŸŽฏ - How accurate and relevant the retrieved information is
  5. Recall ๐Ÿ” - How comprehensive the retrieved information is in covering the query
  6. MRR (Mean Reciprocal Rank) ๐Ÿ“ˆ - Ranking quality of relevant results
  7. NDCG (Normalized Discounted Cumulative Gain) ๐Ÿ“Š - Ranking quality with position consideration
  8. Relevance ๐Ÿ”— - Overall relevance of retrieved contexts to the query

Training Data ๐Ÿ“š

https://huggingface.co/datasets/constehub/rag-evaluation-dataset

Example Training Instance

{
  "instruction": "Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance",
  "input": {
    "question": "Question about retrieved context",
    "retrieved_contexts": "[Multiple numbered passages with source citations]"
  },
  "output": [
    {
      "name": "completeness",
      "value": 1,
      "comment": "Detailed evaluation comment"
    }
    // ... other metrics
  ]
}

Performance and Limitations โšก

Strengths

  • Specialized for RAG evaluation
  • Multi-dimensional assessment capability
  • Detailed explanatory comments for each metric

Limitations

  • Context Length: Performance may vary with very long retrieved contexts

Ethical Considerations ๐Ÿค

  • The model should be used as a tool to assist human evaluators, not replace human judgment entirely
  • Evaluations should be validated by domain experts for critical applications

Technical Specifications ๐Ÿ”ง

  • Base Model: Qwen3-8B
  • Quantization: Q8_0

Usage Example ๐Ÿ’ป

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "mendrika261/rag-evaluator-qwen3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example evaluation prompt
prompt = """Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance

Question: [Your question here]
Retrieved contexts: [Your retrieved contexts here]"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
evaluation = tokenizer.decode(outputs[0], skip_special_tokens=True)

Citation ๐Ÿ“„

If you use this model in your research, please cite:

@misc{constehub-rag-evaluator,
  title={RAG Context Evaluator - Qwen3-8B Fine-tuned},
  author={constehub},
  year={2025},
  howpublished={\url{https://huggingface.co/constehub/rag-evaluation}}
}

Contact ๐Ÿ“ง

For questions or issues regarding this model, please contact the developer through the Hugging Face model repository.


This qwen3 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
74
GGUF
Model size
8.19B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results