Update to complete model with 57,768 embedded prototypes
Browse files
README.md
CHANGED
@@ -8,128 +8,197 @@ tags:
|
|
8 |
- medical-coding
|
9 |
- few-shot-learning
|
10 |
- prototypical-networks
|
|
|
|
|
11 |
language:
|
12 |
- en
|
13 |
metrics:
|
14 |
- accuracy
|
15 |
library_name: transformers
|
16 |
pipeline_tag: text-classification
|
|
|
|
|
17 |
---
|
18 |
|
19 |
-
# MediCoder AI v4
|
20 |
|
21 |
## Model Description
|
22 |
|
23 |
-
MediCoder AI v4 is a
|
24 |
|
25 |
## π― Performance
|
26 |
|
27 |
-
- **Top-
|
28 |
-
- **
|
29 |
-
- **Medical Codes**:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## ποΈ Architecture
|
32 |
|
33 |
- **Base Model**: Bio_ClinicalBERT (specialized for medical text)
|
34 |
-
- **Approach**: Few-shot Prototypical Networks
|
35 |
- **Embedding Dimension**: 768
|
|
|
36 |
- **Optimization**: Conservative incremental improvements (Phase 2)
|
37 |
|
38 |
-
## π
|
39 |
|
40 |
```python
|
41 |
import torch
|
42 |
from transformers import AutoTokenizer
|
43 |
|
44 |
-
# Load
|
45 |
-
tokenizer = AutoTokenizer.from_pretrained("
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
-
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
# Tokenize
|
52 |
-
inputs = tokenizer(clinical_note, return_tensors="pt",
|
53 |
-
truncation=True, max_length=512)
|
54 |
|
55 |
-
# Get
|
56 |
with torch.no_grad():
|
57 |
-
|
58 |
-
|
59 |
-
|
|
|
60 |
|
61 |
-
|
|
|
|
|
|
|
|
|
62 |
```
|
63 |
|
64 |
-
##
|
65 |
|
66 |
-
|
67 |
-
- **Training Approach**: Few-shot learning with prototypical networks
|
68 |
-
- **Optimization Strategy**: Conservative incremental improvements
|
69 |
-
- **Phases**:
|
70 |
-
- Phase 1: Enhanced embeddings and pooling (+5.7pp)
|
71 |
-
- Phase 2: Ensemble prototypes with attention (+1.1pp)
|
72 |
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
-
|
76 |
-
- **Clinical Decision Support**: Suggest appropriate diagnostic codes
|
77 |
-
- **Healthcare Analytics**: Automated coding for large datasets
|
78 |
-
- **Research**: Medical text analysis and categorization
|
79 |
|
80 |
-
|
|
|
|
|
|
|
81 |
|
82 |
-
|
83 |
-
-
|
84 |
-
-
|
85 |
-
-
|
86 |
|
87 |
-
|
|
|
|
|
|
|
88 |
|
89 |
-
|
90 |
-
- **Inference Speed**: 3-8 seconds (CPU), <1 second (GPU)
|
91 |
-
- **Memory Requirements**: ~2-3 GB during inference
|
92 |
-
- **Self-contained**: No external dataset dependencies
|
93 |
|
94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
|
96 |
-
|
97 |
-
- **Prototypical Networks**: Creates representative embeddings for each code
|
98 |
-
- **Ensemble Prototypes**: Multiple prototypes per code for better coverage
|
99 |
-
- **Attention Aggregation**: Smart combination of multiple examples
|
100 |
|
101 |
-
|
|
|
102 |
|
103 |
-
|
104 |
-
|
105 |
-
- Top-K accuracy (K=1,3,5,10,20)
|
106 |
-
- Comparison with baseline methods
|
107 |
|
108 |
-
|
|
|
109 |
|
110 |
-
|
111 |
-
- Reducing coding time from hours to minutes
|
112 |
-
- Improving coding accuracy and consistency
|
113 |
-
- Narrowing 57,000+ codes to top suggestions
|
114 |
-
- Supporting healthcare workflow automation
|
115 |
|
116 |
-
|
|
|
|
|
|
|
117 |
|
118 |
-
|
119 |
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
year={2025},
|
125 |
-
url={https://huggingface.co/
|
|
|
126 |
}
|
127 |
```
|
128 |
|
129 |
-
##
|
130 |
|
131 |
-
For questions or collaborations, please
|
132 |
|
133 |
---
|
134 |
|
135 |
-
|
|
|
|
|
|
8 |
- medical-coding
|
9 |
- few-shot-learning
|
10 |
- prototypical-networks
|
11 |
+
- deployment-ready
|
12 |
+
- self-contained
|
13 |
language:
|
14 |
- en
|
15 |
metrics:
|
16 |
- accuracy
|
17 |
library_name: transformers
|
18 |
pipeline_tag: text-classification
|
19 |
+
widget:
|
20 |
+
- text: "Patient presents with chest pain and shortness of breath. ECG shows abnormalities."
|
21 |
---
|
22 |
|
23 |
+
# MediCoder AI v4 Complete π₯β¨
|
24 |
|
25 |
## Model Description
|
26 |
|
27 |
+
**MediCoder AI v4 Complete** is a fully self-contained medical coding system with **57,768 embedded prototypes** that predicts ICD/medical codes from clinical notes with **46.3% Top-1 accuracy**. This model requires **no external dataset** for inference.
|
28 |
|
29 |
## π― Performance
|
30 |
|
31 |
+
- **Top-1 Accuracy**: 46.3%
|
32 |
+
- **Top-5 Accuracy**: ~54% (estimated)
|
33 |
+
- **Medical Codes**: 57,768 supported codes
|
34 |
+
- **Prototypes**: 57,768 embedded prototype vectors
|
35 |
+
- **Deployment**: Fully self-contained
|
36 |
+
|
37 |
+
## β¨ What's New in Complete Version
|
38 |
+
|
39 |
+
- β
**57,768 Prototypes Embedded**: All medical codes have learned representations
|
40 |
+
- β
**No Dataset Required**: Completely self-contained for deployment
|
41 |
+
- β
**Production Ready**: Direct inference without external dependencies
|
42 |
+
- β
**Full 46.3% Accuracy**: Complete performance preservation
|
43 |
+
- β
**Memory Optimized**: Efficient prototype storage and retrieval
|
44 |
|
45 |
## ποΈ Architecture
|
46 |
|
47 |
- **Base Model**: Bio_ClinicalBERT (specialized for medical text)
|
48 |
+
- **Approach**: Few-shot Prototypical Networks with Embedded Prototypes
|
49 |
- **Embedding Dimension**: 768
|
50 |
+
- **Prototype Storage**: 57,768 Γ 768 learned medical code representations
|
51 |
- **Optimization**: Conservative incremental improvements (Phase 2)
|
52 |
|
53 |
+
## π Quick Start
|
54 |
|
55 |
```python
|
56 |
import torch
|
57 |
from transformers import AutoTokenizer
|
58 |
|
59 |
+
# Load the complete model
|
60 |
+
tokenizer = AutoTokenizer.from_pretrained("sshan95/medicoder-ai-v4-model")
|
61 |
+
|
62 |
+
# Load model with embedded prototypes
|
63 |
+
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
|
64 |
+
prototypes = checkpoint['prototypes'] # Shape: [57768, 768]
|
65 |
+
prototype_codes = checkpoint['prototype_codes'] # Shape: [57768]
|
66 |
+
|
67 |
+
print(f"Loaded {prototypes.shape[0]:,} medical code prototypes!")
|
68 |
+
```
|
69 |
|
70 |
+
## π Usage Example
|
71 |
+
|
72 |
+
```python
|
73 |
+
import torch
|
74 |
+
import torch.nn.functional as F
|
75 |
+
from transformers import AutoTokenizer
|
76 |
+
|
77 |
+
# Initialize
|
78 |
+
tokenizer = AutoTokenizer.from_pretrained("sshan95/medicoder-ai-v4-model")
|
79 |
+
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
|
80 |
+
|
81 |
+
# Load model architecture (your ConservativePrototypicalNetwork)
|
82 |
+
model = load_your_model_architecture()
|
83 |
+
model.load_state_dict(checkpoint['model_state_dict'])
|
84 |
+
|
85 |
+
# Load embedded prototypes
|
86 |
+
prototypes = checkpoint['prototypes']
|
87 |
+
prototype_codes = checkpoint['prototype_codes']
|
88 |
+
|
89 |
+
# Example prediction
|
90 |
+
clinical_note = "Patient presents with acute chest pain, diaphoresis, and dyspnea..."
|
91 |
|
92 |
# Tokenize
|
93 |
+
inputs = tokenizer(clinical_note, return_tensors="pt", truncation=True, max_length=512)
|
|
|
94 |
|
95 |
+
# Get embedding
|
96 |
with torch.no_grad():
|
97 |
+
query_embedding = model.encode_text(inputs['input_ids'], inputs['attention_mask'])
|
98 |
+
|
99 |
+
# Compute similarities to all prototypes
|
100 |
+
similarities = torch.mm(query_embedding, prototypes.t())
|
101 |
|
102 |
+
# Get top-5 predictions
|
103 |
+
top_5_scores, top_5_indices = torch.topk(similarities, k=5)
|
104 |
+
predicted_codes = prototype_codes[top_5_indices[0]]
|
105 |
+
|
106 |
+
print("Top 5 predicted medical codes:", predicted_codes.tolist())
|
107 |
```
|
108 |
|
109 |
+
## π Model Contents
|
110 |
|
111 |
+
When you load this model, you get:
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
+
```python
|
114 |
+
checkpoint = torch.load("pytorch_model.bin")
|
115 |
+
|
116 |
+
# Available keys:
|
117 |
+
checkpoint['model_state_dict'] # Neural network weights
|
118 |
+
checkpoint['prototypes'] # [57768, 768] prototype embeddings
|
119 |
+
checkpoint['prototype_codes'] # [57768] medical code mappings
|
120 |
+
checkpoint['accuracies'] # Performance metrics
|
121 |
+
checkpoint['config'] # Training configuration
|
122 |
+
```
|
123 |
|
124 |
+
## π― Key Features
|
|
|
|
|
|
|
125 |
|
126 |
+
### β
**Self-Contained Deployment**
|
127 |
+
- No external dataset required
|
128 |
+
- All medical knowledge embedded in prototypes
|
129 |
+
- Direct inference capability
|
130 |
|
131 |
+
### β
**Production Ready**
|
132 |
+
- Optimized for CPU and GPU inference
|
133 |
+
- Memory-efficient prototype storage
|
134 |
+
- Stable, tested architecture
|
135 |
|
136 |
+
### β
**Full Performance**
|
137 |
+
- Complete 46.3% Top-1 accuracy preserved
|
138 |
+
- All 57,768 medical codes supported
|
139 |
+
- Conservative optimization approach
|
140 |
|
141 |
+
## π Training Details
|
|
|
|
|
|
|
142 |
|
143 |
+
- **Base Model**: Bio_ClinicalBERT
|
144 |
+
- **Training Data**: Clinical notes with medical code annotations
|
145 |
+
- **Approach**: Few-shot prototypical learning
|
146 |
+
- **Optimization**: Conservative incremental improvements
|
147 |
+
- **Phase 1**: Enhanced embeddings (+5.7pp)
|
148 |
+
- **Phase 2**: Ensemble prototypes (+1.1pp)
|
149 |
+
- **Final Step**: Prototype extraction and embedding
|
150 |
|
151 |
+
## π Deployment Options
|
|
|
|
|
|
|
152 |
|
153 |
+
### **Option 1: Hugging Face Spaces**
|
154 |
+
Perfect for demos and testing with built-in UI.
|
155 |
|
156 |
+
### **Option 2: Local Deployment**
|
157 |
+
Download and run locally for production use.
|
|
|
|
|
158 |
|
159 |
+
### **Option 3: API Integration**
|
160 |
+
Integrate into existing healthcare systems.
|
161 |
|
162 |
+
## β οΈ Usage Guidelines
|
|
|
|
|
|
|
|
|
163 |
|
164 |
+
- **Purpose**: Research and educational use, medical coding assistance
|
165 |
+
- **Validation**: Always require human expert validation
|
166 |
+
- **Scope**: English clinical text, general medical domains
|
167 |
+
- **Limitations**: Performance varies by medical specialty
|
168 |
|
169 |
+
## π Real-world Impact
|
170 |
|
171 |
+
This model helps by:
|
172 |
+
- **Reducing coding time**: Hours β Minutes
|
173 |
+
- **Improving consistency**: Standardized predictions
|
174 |
+
- **Narrowing choices**: 57,768 codes β Top suggestions
|
175 |
+
- **Supporting workflow**: Integration-ready format
|
176 |
+
|
177 |
+
## π¬ Technical Specifications
|
178 |
+
|
179 |
+
- **Model Size**: ~1.2 GB (with prototypes)
|
180 |
+
- **Inference Speed**: 3-8 seconds (CPU), <1 second (GPU)
|
181 |
+
- **Memory Usage**: ~3-4 GB during inference
|
182 |
+
- **Dependencies**: PyTorch, Transformers, NumPy
|
183 |
+
|
184 |
+
## π Citation
|
185 |
+
|
186 |
+
```bibtex
|
187 |
+
@misc{medicoder-ai-v4-complete,
|
188 |
+
title={MediCoder AI v4 Complete: Self-Contained Medical Coding with Embedded Prototypes},
|
189 |
+
author={MediCoder Team},
|
190 |
year={2025},
|
191 |
+
url={https://huggingface.co/sshan95/medicoder-ai-v4-model},
|
192 |
+
note={57,768 embedded prototypes, 46.3% Top-1 accuracy}
|
193 |
}
|
194 |
```
|
195 |
|
196 |
+
## π₯ Community
|
197 |
|
198 |
+
Built for the medical coding community. For questions, issues, or collaborations, please use the repository discussions.
|
199 |
|
200 |
---
|
201 |
|
202 |
+
**π Ready for production medical coding assistance!**
|
203 |
+
|
204 |
+
*This complete model contains all necessary components for deployment without external dependencies.*
|