metadata

base_model: unsloth/qwen3-8b-unsloth-bnb-4bit
tags:
  - text-generation
  - rag
  - evaluation
  - information-retrieval
  - question-answering
  - retrieval-augmented-generation
  - context-evaluation
  - qwen3
  - unsloth
  - fine-tuned
language:
  - en
  - multilingual
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
model_type: qwen3
quantized: q8_0
datasets:
  - evaluation
  - rag-evaluation
metrics:
  - completeness
  - clarity
  - conciseness
  - precision
  - recall
  - mrr
  - ndcg
  - relevance
widget:
  - example_title: RAG Context Evaluation
    text: >
      Evaluate the agent's response according to the metrics: completeness,
      clarity, conciseness, precision, recall, mrr, ndcg, relevance


      Question: What are the main benefits of renewable energy?

      Retrieved contexts: [1] Renewable energy sources like solar and wind power
      provide clean alternatives to fossil fuels, reducing greenhouse gas
      emissions and air pollution. [2] These energy sources are sustainable and
      abundant, helping to ensure long-term energy security.
model-index:
  - name: RAG Context Evaluator
    results:
      - task:
          type: text-generation
          name: RAG Evaluation
        metrics:
          - type: evaluation_score
            name: Multi-metric Assessment
            value: 0-5

RAG Context Evaluator - Qwen3-8B Fine-tuned 🚀

Model Details 📋

License: apache-2.0
Finetuned from model: unsloth/qwen3-8b-unsloth-bnb-4bit
Model type: Text Generation (Specialized for RAG Evaluation)
Quantization: Q8_0

Model Description 🎯

This model is specifically fine-tuned to evaluate the quality of retrieved contexts in Retrieval-Augmented Generation (RAG) systems. It assesses retrieved passages against user queries using multiple evaluation metrics commonly used in information retrieval and RAG evaluation.

Intended Uses 💡

Primary Use Case 🎯

RAG System Evaluation: Automatically assess the quality of retrieved contexts for question-answering systems
Information Retrieval Quality Control: Evaluate how well retrieved documents match user queries
Academic Research: Support research in information retrieval and RAG system optimization

Evaluation Metrics 📊

The model evaluates retrieved contexts using the following metrics:

Completeness 📝 - How thoroughly the retrieved context addresses the query
Clarity ✨ - How clear and understandable the retrieved information is
Conciseness 🎪 - How efficiently the information is presented without redundancy
Precision 🎯 - How accurate and relevant the retrieved information is
Recall 🔍 - How comprehensive the retrieved information is in covering the query
MRR (Mean Reciprocal Rank) 📈 - Ranking quality of relevant results
NDCG (Normalized Discounted Cumulative Gain) 📊 - Ranking quality with position consideration
Relevance 🔗 - Overall relevance of retrieved contexts to the query

Training Data 📚

https://huggingface.co/datasets/constehub/rag-evaluation-dataset

Example Training Instance

{
  "instruction": "Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance",
  "input": {
    "question": "Question about retrieved context",
    "retrieved_contexts": "[Multiple numbered passages with source citations]"
  },
  "output": [
    {
      "name": "completeness",
      "value": 1,
      "comment": "Detailed evaluation comment"
    }
    // ... other metrics
  ]
}

Performance and Limitations ⚡

Strengths

Specialized for RAG evaluation
Multi-dimensional assessment capability
Detailed explanatory comments for each metric

Limitations

Context Length: Performance may vary with very long retrieved contexts

Ethical Considerations 🤝

The model should be used as a tool to assist human evaluators, not replace human judgment entirely
Evaluations should be validated by domain experts for critical applications

Technical Specifications 🔧

Base Model: Qwen3-8B
Quantization: Q8_0

Usage Example 💻

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "mendrika261/rag-evaluator-qwen3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example evaluation prompt
prompt = """Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance

Question: [Your question here]
Retrieved contexts: [Your retrieved contexts here]"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
evaluation = tokenizer.decode(outputs[0], skip_special_tokens=True)

Citation 📄

If you use this model in your research, please cite:

@misc{constehub-rag-evaluator,
  title={RAG Context Evaluator - Qwen3-8B Fine-tuned},
  author={constehub},
  year={2025},
  howpublished={\url{https://huggingface.co/constehub/rag-evaluation}}
}

Contact 📧

For questions or issues regarding this model, please contact the developer through the Hugging Face model repository.

This qwen3 model was trained 2x faster with Unsloth and Huggingface's TRL library.