Update README.md
Browse files
README.md
CHANGED
@@ -1,32 +1,48 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
-
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
|
|
|
|
10 |
|
11 |
-
|
12 |
|
13 |
-
|
14 |
-
- **
|
15 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
- **Precision**: bfloat16 (bf16) with 4-bit quantization
|
17 |
- **License:** MIT License
|
18 |
- **Finetuned from model:** Qwen3-4B
|
19 |
|
20 |
-
### Developed by
|
21 |
-
|
22 |
-
- Fine-tuned and released by: Asad Aali, Stanford University
|
23 |
-
|
24 |
-
### Sources
|
25 |
-
|
26 |
-
- **Paper:** [More Information Needed]
|
27 |
-
- **Repository:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
|
28 |
-
- **Training Data:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
|
29 |
-
|
30 |
## Uses
|
31 |
|
32 |
MedVAL-4B can be used to assess AI-generated medical text, providing a detailed error assessment and a risk grade that corresponds to a clinical risk category:
|
@@ -36,32 +52,138 @@ MedVAL-4B can be used to assess AI-generated medical text, providing a detailed
|
|
36 |
- **Moderate risk (Level 3)**
|
37 |
- **High risk (Level 4)**
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
Can be integrated into LM pipelines (e.g., radiology report summarization, question answering) to provide automated validation layers before text is presented to clinicians or patients.
|
42 |
-
|
43 |
|
44 |
-
|
45 |
-
|
46 |
-
- Only trained on medical text tasks from the MedVAL framework
|
47 |
-
- Predictions may generalize poorly to out-of-domain or non-medical contexts
|
48 |
-
- Limited robustness to adversarial perturbations not seen during training]
|
49 |
-
|
50 |
-
## How to Get Started with the Model
|
51 |
-
|
52 |
-
Use the code below to get started with the model.
|
53 |
|
|
|
54 |
```python
|
55 |
-
from transformers import
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
datasets:
|
4 |
+
- stanfordmimi/MedVAL-Bench
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
- es
|
8 |
+
metrics:
|
9 |
+
- f1
|
10 |
+
- accuracy
|
11 |
+
base_model:
|
12 |
+
- Qwen/Qwen3-4B
|
13 |
+
pipeline_tag: text-classification
|
14 |
+
library_name: transformers
|
15 |
+
tags:
|
16 |
+
- medical
|
17 |
---
|
18 |
|
19 |
+
**MedVAL-4B** is a language model trained to **assess AI-generated medical text** outputs at near **physician-level reliability**.
|
20 |
|
21 |
+
# Sources
|
22 |
|
23 |
+
- **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
|
24 |
+
- **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
|
25 |
+
- **Training Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
|
26 |
|
27 |
+
# Input → Output Example
|
28 |
|
29 |
+
1. **Inputs**:
|
30 |
+
- **Instruction**: Evaluate the AI-generated output in comparison to the input composed by an expert.
|
31 |
+
- **Input**: "FINDINGS: No pleural effusion or pneumothorax. Heart size normal."
|
32 |
+
- **AI-Generated Output**: "IMPRESSION: Small pleural effusion."
|
33 |
+
2. **MedVAL-4B Outputs**:
|
34 |
+
- **Error Assessment**: "Error 1: Hallucination - "Small pleural effusion" is a fabricated claim."
|
35 |
+
- **Risk Grade**: "Level 4 (High Risk)"
|
36 |
+
|
37 |
+
# Model Details
|
38 |
+
|
39 |
+
- **Model Type:** Fine-tuned transformer-based language model (Qwen3-4B)
|
40 |
+
- **Training Data:** Trained on medical text using the [(MedVAL-Bench)](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench) dataset
|
41 |
+
- **Framework**: PEFT (QLoRA) via [DSPy](https://dspy.ai/)
|
42 |
- **Precision**: bfloat16 (bf16) with 4-bit quantization
|
43 |
- **License:** MIT License
|
44 |
- **Finetuned from model:** Qwen3-4B
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
## Uses
|
47 |
|
48 |
MedVAL-4B can be used to assess AI-generated medical text, providing a detailed error assessment and a risk grade that corresponds to a clinical risk category:
|
|
|
52 |
- **Moderate risk (Level 3)**
|
53 |
- **High risk (Level 4)**
|
54 |
|
55 |
+
# Quickstart
|
|
|
|
|
|
|
56 |
|
57 |
+
Instructions for MedVAL fine-tuning and evaluation are available on [GitHub](https://github.com/StanfordMIMI/MedVAL).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
+
The following contains a code snippet illustrating how to use the model to generate content based on given inputs.
|
60 |
```python
|
61 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
62 |
+
|
63 |
+
model_name = "stanfordmimi/MedVAL-4B"
|
64 |
+
|
65 |
+
# load the tokenizer and the model
|
66 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
67 |
+
model = AutoModelForCausalLM.from_pretrained(
|
68 |
+
model_name,
|
69 |
+
torch_dtype="auto",
|
70 |
+
device_map="auto"
|
71 |
+
)
|
72 |
+
|
73 |
+
# prepare the model input
|
74 |
+
prompt = """
|
75 |
+
Your objective is to evaluate the output in comparison to the input composed by an expert.
|
76 |
+
|
77 |
+
Instructions:
|
78 |
+
1. Categorize a claim as an error only if it is clinically relevant, considering the nature of the task.
|
79 |
+
2. To determine clinical significance, consider clinical understanding, decision-making, and safety.
|
80 |
+
3. Some tasks (e.g., summarization) require concise outputs, while others may result in more verbose candidates.
|
81 |
+
- For tasks requiring concise outputs, evaluate the clinical impact of the missing information, given the nature of the task.
|
82 |
+
- For verbose tasks, evaluate whether the additional content introduces factual inconsistency.
|
83 |
+
|
84 |
+
Your input fields are:
|
85 |
+
1. `instruction' (str)
|
86 |
+
2. `input' (str)
|
87 |
+
3. `output' (str)
|
88 |
+
|
89 |
+
Your output fields are:
|
90 |
+
1. `reasoning' (str)
|
91 |
+
2. `errors' (str):
|
92 |
+
Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
|
93 |
+
|
94 |
+
Instructions:
|
95 |
+
- Output format: `Error 1: <brief explanation in a few words>\nError 2: ...'
|
96 |
+
- Each error must be numbered and separated by a newline character \n; do not use newline characters for anything else.
|
97 |
+
- Return `None' if no errors are found.
|
98 |
+
- Refer to the exact text from the input or output in the error assessments.
|
99 |
+
|
100 |
+
Error Categories:
|
101 |
+
1) Fabricated claim: Introduction of a claim not present in the input.
|
102 |
+
2) Misleading justification: Incorrect reasoning, leading to misleading conclusions.
|
103 |
+
3) Detail misidentification: Incorrect reference to a detail in the input.
|
104 |
+
4) False comparison: Mentioning a comparison not supported by the input.
|
105 |
+
5) Incorrect recommendation: Suggesting a diagnosis/follow-up outside the input.
|
106 |
+
6) Missing claim: Failure to mention a claim present in the input.
|
107 |
+
7) Missing comparison: Omitting a comparison that details change over time.
|
108 |
+
8) Missing context: Omitting details necessary for claim interpretation.
|
109 |
+
9) Overstating intensity: Exaggerating urgency, severity, or confidence.
|
110 |
+
10) Understating intensity: Understating urgency, severity, or confidence.
|
111 |
+
11) Other: Additional errors not covered.
|
112 |
+
|
113 |
+
3. `risk_level' (Literal[1, 2, 3, 4]):
|
114 |
+
The risk level must be an integer from 1, 2, 3, or 4. Assign a risk level to the output from the following options:
|
115 |
+
|
116 |
+
Level 1 (No Risk): The output should contain no clinically meaningful factual inconsistencies. Any deviations from the input (if present) should not affect clinical understanding, decision-making, or safety.
|
117 |
+
Level 2 (Low Risk): The output should contain subtle or ambiguous inconsistencies that are unlikely to influence clinical decisions or understanding. These inconsistencies should not introduce confusion or risk.
|
118 |
+
Level 3 (Moderate Risk): The output should contain inconsistencies that could plausibly affect clinical interpretation, documentation, or decision-making. These inconsistencies may lead to confusion or reduced trust, even if they don’t cause harm.
|
119 |
+
Level 4 (High Risk): The output should include one or more inconsistencies that could result in incorrect or unsafe clinical decisions. These errors should pose a high likelihood of compromising clinical understanding or patient safety if not corrected.
|
120 |
+
|
121 |
+
All interactions will be structured in the following way, with the appropriate values filled in.
|
122 |
+
|
123 |
+
[[ ## instruction ## ]]
|
124 |
+
Summarize the radiology report findings into an impression with minimal text.
|
125 |
+
1. Input Description: The findings section of the radiology report.
|
126 |
+
2. Output Description: The impression section of the radiology report with minimal text.
|
127 |
+
|
128 |
+
[[ ## input ## ]]
|
129 |
+
FINDINGS: No pleural effusion or pneumothorax. Heart size normal.
|
130 |
+
|
131 |
+
[[ ## output ## ]]
|
132 |
+
IMPRESSION: Small pleural effusion.
|
133 |
+
|
134 |
+
[[ ## reasoning ## ]]
|
135 |
+
{TO_BE_FILLED_BY_MODEL}
|
136 |
+
|
137 |
+
[[ ## errors ## ]]
|
138 |
+
{TO_BE_FILLED_BY_MODEL}
|
139 |
+
|
140 |
+
[[ ## risk_level ## ]]
|
141 |
+
{TO_BE_FILLED_BY_MODEL}
|
142 |
+
|
143 |
+
[[ ## completed ## ]]
|
144 |
+
"""
|
145 |
+
|
146 |
+
messages = [
|
147 |
+
{"role": "user", "content": prompt}
|
148 |
+
]
|
149 |
+
text = tokenizer.apply_chat_template(
|
150 |
+
messages,
|
151 |
+
tokenize=False,
|
152 |
+
add_generation_prompt=True,
|
153 |
+
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
|
154 |
+
)
|
155 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
156 |
+
|
157 |
+
# conduct text completion
|
158 |
+
generated_ids = model.generate(
|
159 |
+
**model_inputs,
|
160 |
+
max_new_tokens=32768
|
161 |
+
)
|
162 |
+
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
|
163 |
+
|
164 |
+
# parsing thinking content
|
165 |
+
try:
|
166 |
+
# rindex finding 151668 (</think>)
|
167 |
+
index = len(output_ids) - output_ids[::-1].index(151668)
|
168 |
+
except ValueError:
|
169 |
+
index = 0
|
170 |
+
|
171 |
+
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
|
172 |
+
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
|
173 |
+
|
174 |
+
print("thinking content:", thinking_content)
|
175 |
+
print("content:", content)
|
176 |
+
```
|
177 |
+
|
178 |
+
# Citation
|
179 |
+
|
180 |
+
If you use this model, please cite:
|
181 |
+
|
182 |
+
```bibtex
|
183 |
+
@article{aali2025expert,
|
184 |
+
title={Expert-level validation of AI-generated medical text with scalable language models},
|
185 |
+
author={Asad Aali and Vasiliki Bikia and Maya Varma and Nicole Chiou and Sophie Ostmeier and Arnav Singhvi and Magdalini Paschali and Ashwin Kumar and Andrew Johnston and Karimar Amador-Martinez and Eduardo Juan Perez Guerrero and Paola Naovi Cruz Rivera and Sergios Gatidis and Christian Bluethgen and Eduardo Pontes Reis and Eddy D. Zandee van Rilland and Poonam Laxmappa Hosamani and Kevin R Keet and Minjoung Go and Evelyn Ling and David B. Larson and Curtis Langlotz and Roxana Daneshjou and Jason Hom and Sanmi Koyejo and Emily Alsentzer and Akshay S. Chaudhari},
|
186 |
+
journal={arXiv preprint arXiv:2507.03152},
|
187 |
+
year={2025}
|
188 |
+
}
|
189 |
+
```
|