sshan95 commited on
Commit
cdb68f1
Β·
verified Β·
1 Parent(s): 8334eaf

Update to complete model with 57,768 embedded prototypes

Browse files
Files changed (1) hide show
  1. README.md +136 -67
README.md CHANGED
@@ -8,128 +8,197 @@ tags:
8
  - medical-coding
9
  - few-shot-learning
10
  - prototypical-networks
 
 
11
  language:
12
  - en
13
  metrics:
14
  - accuracy
15
  library_name: transformers
16
  pipeline_tag: text-classification
 
 
17
  ---
18
 
19
- # MediCoder AI v4 πŸ₯
20
 
21
  ## Model Description
22
 
23
- MediCoder AI v4 is a state-of-the-art medical coding system that predicts ICD/medical codes from clinical notes with **46.3% Top-1 accuracy**. Built on Bio_ClinicalBERT with few-shot prototypical learning, it can handle ~57,000 medical codes.
24
 
25
  ## 🎯 Performance
26
 
27
- - **Top-5 Accuracy**: ~54%
28
- - **Improvement**: +6.8 percentage points over baseline
29
- - **Medical Codes**: ~57,000 supported codes
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## πŸ—οΈ Architecture
32
 
33
  - **Base Model**: Bio_ClinicalBERT (specialized for medical text)
34
- - **Approach**: Few-shot Prototypical Networks
35
  - **Embedding Dimension**: 768
 
36
  - **Optimization**: Conservative incremental improvements (Phase 2)
37
 
38
- ## πŸš€ Usage
39
 
40
  ```python
41
  import torch
42
  from transformers import AutoTokenizer
43
 
44
- # Load model and tokenizer
45
- tokenizer = AutoTokenizer.from_pretrained("your-username/medicoder-ai-v4-model")
46
- model = torch.load("pytorch_model.bin", map_location="cpu")
 
 
 
 
 
 
 
47
 
48
- # Example usage
49
- clinical_note = "Patient presents with chest pain and shortness of breath..."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  # Tokenize
52
- inputs = tokenizer(clinical_note, return_tensors="pt",
53
- truncation=True, max_length=512)
54
 
55
- # Get predictions (top-5 medical codes)
56
  with torch.no_grad():
57
- embeddings = model.encode_text(inputs['input_ids'], inputs['attention_mask'])
58
- similarities = torch.mm(embeddings, model.prototypes.t())
59
- top_codes = similarities.topk(5).indices
 
60
 
61
- print("Top 5 predicted medical codes:", top_codes)
 
 
 
 
62
  ```
63
 
64
- ## πŸ“Š Training Details
65
 
66
- - **Training Data**: Medical clinical notes with associated codes
67
- - **Training Approach**: Few-shot learning with prototypical networks
68
- - **Optimization Strategy**: Conservative incremental improvements
69
- - **Phases**:
70
- - Phase 1: Enhanced embeddings and pooling (+5.7pp)
71
- - Phase 2: Ensemble prototypes with attention (+1.1pp)
72
 
73
- ## 🎯 Use Cases
 
 
 
 
 
 
 
 
 
74
 
75
- - **Medical Coding Assistance**: Help medical coders find relevant codes
76
- - **Clinical Decision Support**: Suggest appropriate diagnostic codes
77
- - **Healthcare Analytics**: Automated coding for large datasets
78
- - **Research**: Medical text analysis and categorization
79
 
80
- ## ⚠️ Limitations
 
 
 
81
 
82
- - Designed for English clinical text
83
- - Performance varies by medical specialty
84
- - Requires domain expertise for validation
85
- - Not a replacement for professional medical coding
86
 
87
- ## πŸ“‹ Model Details
 
 
 
88
 
89
- - **Model Size**: ~670 MB
90
- - **Inference Speed**: 3-8 seconds (CPU), <1 second (GPU)
91
- - **Memory Requirements**: ~2-3 GB during inference
92
- - **Self-contained**: No external dataset dependencies
93
 
94
- ## πŸ”¬ Technical Details
 
 
 
 
 
 
95
 
96
- - **Few-shot Learning**: Learns from limited examples per medical code
97
- - **Prototypical Networks**: Creates representative embeddings for each code
98
- - **Ensemble Prototypes**: Multiple prototypes per code for better coverage
99
- - **Attention Aggregation**: Smart combination of multiple examples
100
 
101
- ## πŸ“ˆ Evaluation
 
102
 
103
- Evaluated on held-out medical coding dataset with standard metrics:
104
- - Precision, Recall, F1-score
105
- - Top-K accuracy (K=1,3,5,10,20)
106
- - Comparison with baseline methods
107
 
108
- ## πŸ₯ Real-world Impact
 
109
 
110
- This model helps medical professionals by:
111
- - Reducing coding time from hours to minutes
112
- - Improving coding accuracy and consistency
113
- - Narrowing 57,000+ codes to top suggestions
114
- - Supporting healthcare workflow automation
115
 
116
- ## πŸ“œ Citation
 
 
 
117
 
118
- If you use this model, please cite:
119
 
120
- ```
121
- @misc{medicoder-ai-v4,
122
- title={MediCoder AI v4: Few-shot Medical Coding with Prototypical Networks},
123
- author={Your Name},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  year={2025},
125
- url={https://huggingface.co/your-username/medicoder-ai-v4-model}
 
126
  }
127
  ```
128
 
129
- ## πŸ“ž Contact
130
 
131
- For questions or collaborations, please reach out via the model repository issues.
132
 
133
  ---
134
 
135
- *Built with ❀️ for the medical community*
 
 
 
8
  - medical-coding
9
  - few-shot-learning
10
  - prototypical-networks
11
+ - deployment-ready
12
+ - self-contained
13
  language:
14
  - en
15
  metrics:
16
  - accuracy
17
  library_name: transformers
18
  pipeline_tag: text-classification
19
+ widget:
20
+ - text: "Patient presents with chest pain and shortness of breath. ECG shows abnormalities."
21
  ---
22
 
23
+ # MediCoder AI v4 Complete πŸ₯✨
24
 
25
  ## Model Description
26
 
27
+ **MediCoder AI v4 Complete** is a fully self-contained medical coding system with **57,768 embedded prototypes** that predicts ICD/medical codes from clinical notes with **46.3% Top-1 accuracy**. This model requires **no external dataset** for inference.
28
 
29
  ## 🎯 Performance
30
 
31
+ - **Top-1 Accuracy**: 46.3%
32
+ - **Top-5 Accuracy**: ~54% (estimated)
33
+ - **Medical Codes**: 57,768 supported codes
34
+ - **Prototypes**: 57,768 embedded prototype vectors
35
+ - **Deployment**: Fully self-contained
36
+
37
+ ## ✨ What's New in Complete Version
38
+
39
+ - βœ… **57,768 Prototypes Embedded**: All medical codes have learned representations
40
+ - βœ… **No Dataset Required**: Completely self-contained for deployment
41
+ - βœ… **Production Ready**: Direct inference without external dependencies
42
+ - βœ… **Full 46.3% Accuracy**: Complete performance preservation
43
+ - βœ… **Memory Optimized**: Efficient prototype storage and retrieval
44
 
45
  ## πŸ—οΈ Architecture
46
 
47
  - **Base Model**: Bio_ClinicalBERT (specialized for medical text)
48
+ - **Approach**: Few-shot Prototypical Networks with Embedded Prototypes
49
  - **Embedding Dimension**: 768
50
+ - **Prototype Storage**: 57,768 Γ— 768 learned medical code representations
51
  - **Optimization**: Conservative incremental improvements (Phase 2)
52
 
53
+ ## πŸš€ Quick Start
54
 
55
  ```python
56
  import torch
57
  from transformers import AutoTokenizer
58
 
59
+ # Load the complete model
60
+ tokenizer = AutoTokenizer.from_pretrained("sshan95/medicoder-ai-v4-model")
61
+
62
+ # Load model with embedded prototypes
63
+ checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
64
+ prototypes = checkpoint['prototypes'] # Shape: [57768, 768]
65
+ prototype_codes = checkpoint['prototype_codes'] # Shape: [57768]
66
+
67
+ print(f"Loaded {prototypes.shape[0]:,} medical code prototypes!")
68
+ ```
69
 
70
+ ## πŸ“Š Usage Example
71
+
72
+ ```python
73
+ import torch
74
+ import torch.nn.functional as F
75
+ from transformers import AutoTokenizer
76
+
77
+ # Initialize
78
+ tokenizer = AutoTokenizer.from_pretrained("sshan95/medicoder-ai-v4-model")
79
+ checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
80
+
81
+ # Load model architecture (your ConservativePrototypicalNetwork)
82
+ model = load_your_model_architecture()
83
+ model.load_state_dict(checkpoint['model_state_dict'])
84
+
85
+ # Load embedded prototypes
86
+ prototypes = checkpoint['prototypes']
87
+ prototype_codes = checkpoint['prototype_codes']
88
+
89
+ # Example prediction
90
+ clinical_note = "Patient presents with acute chest pain, diaphoresis, and dyspnea..."
91
 
92
  # Tokenize
93
+ inputs = tokenizer(clinical_note, return_tensors="pt", truncation=True, max_length=512)
 
94
 
95
+ # Get embedding
96
  with torch.no_grad():
97
+ query_embedding = model.encode_text(inputs['input_ids'], inputs['attention_mask'])
98
+
99
+ # Compute similarities to all prototypes
100
+ similarities = torch.mm(query_embedding, prototypes.t())
101
 
102
+ # Get top-5 predictions
103
+ top_5_scores, top_5_indices = torch.topk(similarities, k=5)
104
+ predicted_codes = prototype_codes[top_5_indices[0]]
105
+
106
+ print("Top 5 predicted medical codes:", predicted_codes.tolist())
107
  ```
108
 
109
+ ## πŸ“‹ Model Contents
110
 
111
+ When you load this model, you get:
 
 
 
 
 
112
 
113
+ ```python
114
+ checkpoint = torch.load("pytorch_model.bin")
115
+
116
+ # Available keys:
117
+ checkpoint['model_state_dict'] # Neural network weights
118
+ checkpoint['prototypes'] # [57768, 768] prototype embeddings
119
+ checkpoint['prototype_codes'] # [57768] medical code mappings
120
+ checkpoint['accuracies'] # Performance metrics
121
+ checkpoint['config'] # Training configuration
122
+ ```
123
 
124
+ ## 🎯 Key Features
 
 
 
125
 
126
+ ### βœ… **Self-Contained Deployment**
127
+ - No external dataset required
128
+ - All medical knowledge embedded in prototypes
129
+ - Direct inference capability
130
 
131
+ ### βœ… **Production Ready**
132
+ - Optimized for CPU and GPU inference
133
+ - Memory-efficient prototype storage
134
+ - Stable, tested architecture
135
 
136
+ ### βœ… **Full Performance**
137
+ - Complete 46.3% Top-1 accuracy preserved
138
+ - All 57,768 medical codes supported
139
+ - Conservative optimization approach
140
 
141
+ ## πŸ“Š Training Details
 
 
 
142
 
143
+ - **Base Model**: Bio_ClinicalBERT
144
+ - **Training Data**: Clinical notes with medical code annotations
145
+ - **Approach**: Few-shot prototypical learning
146
+ - **Optimization**: Conservative incremental improvements
147
+ - **Phase 1**: Enhanced embeddings (+5.7pp)
148
+ - **Phase 2**: Ensemble prototypes (+1.1pp)
149
+ - **Final Step**: Prototype extraction and embedding
150
 
151
+ ## πŸš€ Deployment Options
 
 
 
152
 
153
+ ### **Option 1: Hugging Face Spaces**
154
+ Perfect for demos and testing with built-in UI.
155
 
156
+ ### **Option 2: Local Deployment**
157
+ Download and run locally for production use.
 
 
158
 
159
+ ### **Option 3: API Integration**
160
+ Integrate into existing healthcare systems.
161
 
162
+ ## ⚠️ Usage Guidelines
 
 
 
 
163
 
164
+ - **Purpose**: Research and educational use, medical coding assistance
165
+ - **Validation**: Always require human expert validation
166
+ - **Scope**: English clinical text, general medical domains
167
+ - **Limitations**: Performance varies by medical specialty
168
 
169
+ ## πŸ“ˆ Real-world Impact
170
 
171
+ This model helps by:
172
+ - **Reducing coding time**: Hours β†’ Minutes
173
+ - **Improving consistency**: Standardized predictions
174
+ - **Narrowing choices**: 57,768 codes β†’ Top suggestions
175
+ - **Supporting workflow**: Integration-ready format
176
+
177
+ ## πŸ”¬ Technical Specifications
178
+
179
+ - **Model Size**: ~1.2 GB (with prototypes)
180
+ - **Inference Speed**: 3-8 seconds (CPU), <1 second (GPU)
181
+ - **Memory Usage**: ~3-4 GB during inference
182
+ - **Dependencies**: PyTorch, Transformers, NumPy
183
+
184
+ ## πŸ“œ Citation
185
+
186
+ ```bibtex
187
+ @misc{medicoder-ai-v4-complete,
188
+ title={MediCoder AI v4 Complete: Self-Contained Medical Coding with Embedded Prototypes},
189
+ author={MediCoder Team},
190
  year={2025},
191
+ url={https://huggingface.co/sshan95/medicoder-ai-v4-model},
192
+ note={57,768 embedded prototypes, 46.3% Top-1 accuracy}
193
  }
194
  ```
195
 
196
+ ## πŸ₯ Community
197
 
198
+ Built for the medical coding community. For questions, issues, or collaborations, please use the repository discussions.
199
 
200
  ---
201
 
202
+ **πŸš€ Ready for production medical coding assistance!**
203
+
204
+ *This complete model contains all necessary components for deployment without external dependencies.*