Improve model card: Add pipeline tag, project page, and abstract

This PR improves the model card for MedVAL-4B by:
- Adding the `pipeline_tag: text-classification` to the metadata, which helps with model discoverability on the Hugging Face Hub.
- Including a link to the project page (`https://stanfordmimi.github.io/MedVAL/`) in the "Sources" section for comprehensive referencing.
- Adding a dedicated "Abstract" section with the paper's abstract to provide a more thorough overview of the model and its context.

Files changed (1) hide show

README.md +22 -8

README.md CHANGED Viewed

@@ -1,31 +1,41 @@
 ---
-license: mit
 datasets:
 - stanfordmimi/MedVAL-Bench
 language:
 - en
 - es
 metrics:
 - f1
 - accuracy
-base_model:
-- Qwen/Qwen3-4B
-library_name: transformers
 tags:
 - medical
 ---
 **MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bac7c5e38420aaba8ea197/hBt_BPI6PeW_lv-HbCHE6.png)
 [![arXiv](https://img.shields.io/badge/arXiv-2507.03152-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2507.03152)
 **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
 # Sources
 - **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
 - **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
 - **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
 # Model Details
@@ -104,8 +114,10 @@ Your output fields are:
     Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
     Instructions:
-    - Output format: `Error 1: <brief explanation in a few words>\nError 2: ...'
-    - Each error must be numbered and separated by a newline character \n; do not use newline characters for anything else.
     - Return `None' if no errors are found.
     - Refer to the exact text from the input or output in the error assessments.
@@ -178,8 +190,10 @@ try:
 except ValueError:
     index = 0
-thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
-content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
 print("thinking content:", thinking_content)
 print("content:", content)

 ---
+base_model:
+- Qwen/Qwen3-4B
 datasets:
 - stanfordmimi/MedVAL-Bench
 language:
 - en
 - es
+library_name: transformers
+license: mit
 metrics:
 - f1
 - accuracy
 tags:
 - medical
+pipeline_tag: text-classification
 ---
+# MedVAL-4B
 **MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
+MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bac7c5e38420aaba8ea197/hBt_BPI6PeW_lv-HbCHE6.png)
 [![arXiv](https://img.shields.io/badge/arXiv-2507.03152-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2507.03152)
 **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
+## Abstract
+With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
 # Sources
 - **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
 - **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
+- **Project Page:** [MedVAL Project Page](https://stanfordmimi.github.io/MedVAL/)
 - **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
 # Model Details
     Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
     Instructions:
+    - Output format: `Error 1: <brief explanation in a few words>
+Error 2: ...'
+    - Each error must be numbered and separated by a newline character
+; do not use newline characters for anything else.
     - Return `None' if no errors are found.
     - Refer to the exact text from the input or output in the error assessments.
 except ValueError:
     index = 0
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
+")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
+")
 print("thinking content:", thinking_content)
 print("content:", content)