aslessor
's Collections
Evaluation
updated
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models
for Integrated Capabilities
Paper
•
2408.00765
•
Published
•
14
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
•
2407.21646
•
Published
•
18
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
Paper
•
2408.04284
•
Published
•
26
Training Language Models on the Knowledge Graph: Insights on
Hallucinations and Their Detectability
Paper
•
2408.07852
•
Published
•
16
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question
Answering
Paper
•
2409.06595
•
Published
•
38
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical
Applications
Paper
•
2409.07314
•
Published
•
57
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded
Attributions and Learning to Refuse
Paper
•
2409.11242
•
Published
•
7
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
•
2410.02712
•
Published
•
36
TurtleBench: Evaluating Top Language Models via Real-World Yes/No
Puzzles
Paper
•
2410.05262
•
Published
•
10
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
Paper
•
2409.15277
•
Published
•
38
Fusion-Eval: Integrating Evaluators with LLMs
Paper
•
2311.09204
•
Published
•
6