rag-evaluation / README.md

Update README.md

f88437e verified 4 days ago

5.49 kB

	---
	base_model: unsloth/qwen3-8b-unsloth-bnb-4bit
	tags:
	- text-generation
	- rag
	- evaluation
	- information-retrieval
	- question-answering
	- retrieval-augmented-generation
	- context-evaluation
	- qwen3
	- unsloth
	- fine-tuned
	language:
	- en
	- multilingual
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	model_type: qwen3
	quantized: q8_0
	datasets:
	- evaluation
	- rag-evaluation
	metrics:
	- completeness
	- clarity
	- conciseness
	- precision
	- recall
	- mrr
	- ndcg
	- relevance
	widget:
	- example_title: "RAG Context Evaluation"
	text: \|
	Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance

	Question: What are the main benefits of renewable energy?
	Retrieved contexts: [1] Renewable energy sources like solar and wind power provide clean alternatives to fossil fuels, reducing greenhouse gas emissions and air pollution. [2] These energy sources are sustainable and abundant, helping to ensure long-term energy security.
	model-index:
	- name: RAG Context Evaluator
	results:
	- task:
	type: text-generation
	name: RAG Evaluation
	metrics:
	- type: evaluation_score
	name: Multi-metric Assessment
	value: 0-5
	---

	# RAG Context Evaluator - Qwen3-8B Fine-tuned 🚀

	## Model Details 📋

	License: apache-2.0
	Finetuned from model: unsloth/qwen3-8b-unsloth-bnb-4bit
	Model type: Text Generation (Specialized for RAG Evaluation)
	Quantization: Q8_0

	## Model Description 🎯

	This model is specifically fine-tuned to evaluate the quality of retrieved contexts in Retrieval-Augmented Generation (RAG) systems. It assesses retrieved passages against user queries using multiple evaluation metrics commonly used in information retrieval and RAG evaluation.

	## Intended Uses 💡

	### Primary Use Case 🎯
	- RAG System Evaluation: Automatically assess the quality of retrieved contexts for question-answering systems
	- Information Retrieval Quality Control: Evaluate how well retrieved documents match user queries
	- Academic Research: Support research in information retrieval and RAG system optimization

	### Evaluation Metrics 📊
	The model evaluates retrieved contexts using the following metrics:

	1. Completeness 📝 - How thoroughly the retrieved context addresses the query
	2. Clarity ✨ - How clear and understandable the retrieved information is
	3. Conciseness 🎪 - How efficiently the information is presented without redundancy
	4. Precision 🎯 - How accurate and relevant the retrieved information is
	5. Recall 🔍 - How comprehensive the retrieved information is in covering the query
	6. MRR (Mean Reciprocal Rank) 📈 - Ranking quality of relevant results
	7. NDCG (Normalized Discounted Cumulative Gain) 📊 - Ranking quality with position consideration
	8. Relevance 🔗 - Overall relevance of retrieved contexts to the query

	## Training Data 📚

	https://huggingface.co/datasets/constehub/rag-evaluation-dataset

	### Example Training Instance
	```json
	{
	"instruction": "Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance",
	"input": {
	"question": "Question about retrieved context",
	"retrieved_contexts": "[Multiple numbered passages with source citations]"
	},
	"output": [
	{
	"name": "completeness",
	"value": 1,
	"comment": "Detailed evaluation comment"
	}
	// ... other metrics
	]
	}
	```

	## Performance and Limitations ⚡

	### Strengths
	- Specialized for RAG evaluation
	- Multi-dimensional assessment capability
	- Detailed explanatory comments for each metric

	### Limitations
	- Context Length: Performance may vary with very long retrieved contexts

	## Ethical Considerations 🤝

	- The model should be used as a tool to assist human evaluators, not replace human judgment entirely
	- Evaluations should be validated by domain experts for critical applications

	## Technical Specifications 🔧

	- Base Model: Qwen3-8B
	- Quantization: Q8_0

	## Usage Example 💻

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "mendrika261/rag-evaluator-qwen3-8b"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Example evaluation prompt
	prompt = """Evaluate the agent's response according to the metrics: completeness, clarity, conciseness, precision, recall, mrr, ndcg, relevance

	Question: [Your question here]
	Retrieved contexts: [Your retrieved contexts here]"""

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs)
	evaluation = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	## Citation 📄

	If you use this model in your research, please cite:

	```bibtex
	@misc{constehub-rag-evaluator,
	title={RAG Context Evaluator - Qwen3-8B Fine-tuned},
	author={constehub},
	year={2025},
	howpublished={\url{https://huggingface.co/constehub/rag-evaluation}}
	}
	```

	## Contact 📧

	For questions or issues regarding this model, please contact the developer through the Hugging Face model repository.

	---

	This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)