Improve model card: Add pipeline tag, project page, and abstract
Browse filesThis PR improves the model card for MedVAL-4B by:
- Adding the `pipeline_tag: text-classification` to the metadata, which helps with model discoverability on the Hugging Face Hub.
- Including a link to the project page (`https://stanfordmimi.github.io/MedVAL/`) in the "Sources" section for comprehensive referencing.
- Adding a dedicated "Abstract" section with the paper's abstract to provide a more thorough overview of the model and its context.
README.md
CHANGED
@@ -1,31 +1,41 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- stanfordmimi/MedVAL-Bench
|
5 |
language:
|
6 |
- en
|
7 |
- es
|
|
|
|
|
8 |
metrics:
|
9 |
- f1
|
10 |
- accuracy
|
11 |
-
base_model:
|
12 |
-
- Qwen/Qwen3-4B
|
13 |
-
library_name: transformers
|
14 |
tags:
|
15 |
- medical
|
|
|
16 |
---
|
17 |
|
|
|
|
|
18 |
**MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
|
19 |
|
|
|
|
|
20 |

|
21 |
[](https://arxiv.org/abs/2507.03152)
|
22 |
|
23 |
**Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
|
24 |
|
|
|
|
|
|
|
|
|
25 |
# Sources
|
26 |
|
27 |
- **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
|
28 |
- **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
|
|
|
29 |
- **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
|
30 |
|
31 |
# Model Details
|
@@ -104,8 +114,10 @@ Your output fields are:
|
|
104 |
Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
|
105 |
|
106 |
Instructions:
|
107 |
-
- Output format: `Error 1: <brief explanation in a few words
|
108 |
-
|
|
|
|
|
109 |
- Return `None' if no errors are found.
|
110 |
- Refer to the exact text from the input or output in the error assessments.
|
111 |
|
@@ -178,8 +190,10 @@ try:
|
|
178 |
except ValueError:
|
179 |
index = 0
|
180 |
|
181 |
-
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
|
182 |
-
|
|
|
|
|
183 |
|
184 |
print("thinking content:", thinking_content)
|
185 |
print("content:", content)
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen3-4B
|
4 |
datasets:
|
5 |
- stanfordmimi/MedVAL-Bench
|
6 |
language:
|
7 |
- en
|
8 |
- es
|
9 |
+
library_name: transformers
|
10 |
+
license: mit
|
11 |
metrics:
|
12 |
- f1
|
13 |
- accuracy
|
|
|
|
|
|
|
14 |
tags:
|
15 |
- medical
|
16 |
+
pipeline_tag: text-classification
|
17 |
---
|
18 |
|
19 |
+
# MedVAL-4B
|
20 |
+
|
21 |
**MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
|
22 |
|
23 |
+
MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.
|
24 |
+
|
25 |

|
26 |
[](https://arxiv.org/abs/2507.03152)
|
27 |
|
28 |
**Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
|
29 |
|
30 |
+
## Abstract
|
31 |
+
|
32 |
+
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
|
33 |
+
|
34 |
# Sources
|
35 |
|
36 |
- **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
|
37 |
- **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
|
38 |
+
- **Project Page:** [MedVAL Project Page](https://stanfordmimi.github.io/MedVAL/)
|
39 |
- **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
|
40 |
|
41 |
# Model Details
|
|
|
114 |
Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
|
115 |
|
116 |
Instructions:
|
117 |
+
- Output format: `Error 1: <brief explanation in a few words>
|
118 |
+
Error 2: ...'
|
119 |
+
- Each error must be numbered and separated by a newline character
|
120 |
+
; do not use newline characters for anything else.
|
121 |
- Return `None' if no errors are found.
|
122 |
- Refer to the exact text from the input or output in the error assessments.
|
123 |
|
|
|
190 |
except ValueError:
|
191 |
index = 0
|
192 |
|
193 |
+
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
|
194 |
+
")
|
195 |
+
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
|
196 |
+
")
|
197 |
|
198 |
print("thinking content:", thinking_content)
|
199 |
print("content:", content)
|