nielsr HF Staff commited on
Commit
4e430ff
·
verified ·
1 Parent(s): 5c24e23

Improve model card: Add pipeline tag, project page, and abstract

Browse files

This PR improves the model card for MedVAL-4B by:
- Adding the `pipeline_tag: text-classification` to the metadata, which helps with model discoverability on the Hugging Face Hub.
- Including a link to the project page (`https://stanfordmimi.github.io/MedVAL/`) in the "Sources" section for comprehensive referencing.
- Adding a dedicated "Abstract" section with the paper's abstract to provide a more thorough overview of the model and its context.

Files changed (1) hide show
  1. README.md +22 -8
README.md CHANGED
@@ -1,31 +1,41 @@
1
  ---
2
- license: mit
 
3
  datasets:
4
  - stanfordmimi/MedVAL-Bench
5
  language:
6
  - en
7
  - es
 
 
8
  metrics:
9
  - f1
10
  - accuracy
11
- base_model:
12
- - Qwen/Qwen3-4B
13
- library_name: transformers
14
  tags:
15
  - medical
 
16
  ---
17
 
 
 
18
  **MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
19
 
 
 
20
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bac7c5e38420aaba8ea197/hBt_BPI6PeW_lv-HbCHE6.png)
21
  [![arXiv](https://img.shields.io/badge/arXiv-2507.03152-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2507.03152)
22
 
23
  **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
24
 
 
 
 
 
25
  # Sources
26
 
27
  - **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
28
  - **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
 
29
  - **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
30
 
31
  # Model Details
@@ -104,8 +114,10 @@ Your output fields are:
104
  Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
105
 
106
  Instructions:
107
- - Output format: `Error 1: <brief explanation in a few words>\nError 2: ...'
108
- - Each error must be numbered and separated by a newline character \n; do not use newline characters for anything else.
 
 
109
  - Return `None' if no errors are found.
110
  - Refer to the exact text from the input or output in the error assessments.
111
 
@@ -178,8 +190,10 @@ try:
178
  except ValueError:
179
  index = 0
180
 
181
- thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
182
- content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
 
 
183
 
184
  print("thinking content:", thinking_content)
185
  print("content:", content)
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-4B
4
  datasets:
5
  - stanfordmimi/MedVAL-Bench
6
  language:
7
  - en
8
  - es
9
+ library_name: transformers
10
+ license: mit
11
  metrics:
12
  - f1
13
  - accuracy
 
 
 
14
  tags:
15
  - medical
16
+ pipeline_tag: text-classification
17
  ---
18
 
19
+ # MedVAL-4B
20
+
21
  **MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
22
 
23
+ MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.
24
+
25
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bac7c5e38420aaba8ea197/hBt_BPI6PeW_lv-HbCHE6.png)
26
  [![arXiv](https://img.shields.io/badge/arXiv-2507.03152-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2507.03152)
27
 
28
  **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
29
 
30
+ ## Abstract
31
+
32
+ With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
33
+
34
  # Sources
35
 
36
  - **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
37
  - **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
38
+ - **Project Page:** [MedVAL Project Page](https://stanfordmimi.github.io/MedVAL/)
39
  - **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
40
 
41
  # Model Details
 
114
  Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
115
 
116
  Instructions:
117
+ - Output format: `Error 1: <brief explanation in a few words>
118
+ Error 2: ...'
119
+ - Each error must be numbered and separated by a newline character
120
+ ; do not use newline characters for anything else.
121
  - Return `None' if no errors are found.
122
  - Refer to the exact text from the input or output in the error assessments.
123
 
 
190
  except ValueError:
191
  index = 0
192
 
193
+ thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
194
+ ")
195
+ content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
196
+ ")
197
 
198
  print("thinking content:", thinking_content)
199
  print("content:", content)