fernandogd97 commited on
Commit
ad52637
·
verified ·
1 Parent(s): 3f28b6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -29
README.md CHANGED
@@ -1,50 +1,70 @@
1
  ---
2
- license: mit
3
- language: es
 
 
 
4
  tags:
5
- - biomedical
6
- - spanish
7
- - entity-linking
8
- - sapbert
9
- - bi-encoder
10
- - umls
11
- - clinical
12
  ---
13
 
14
- # ClinLinker
15
 
16
- **ClinLinker** is a Spanish biomedical bi-encoder trained following the SapBERT approach using only concepts from the Spanish UMLS. This model is designed for medical entity linking in clinical texts written in Spanish.
17
 
18
- ## 🧠 Training Details
19
 
20
- - Base model: `PlanTL-GOB-ES/roberta-base-biomedical-clinical-es`
21
- - Data: UMLS Spanish concepts
22
- - Strategy: No hierarchical knowledge, only direct synonym pairs (term ↔ CUI)
 
 
23
 
24
- ## 📚 Citation
25
-
26
- > Gallego, F., López-García, G., Gasco-Sánchez, L., Krallinger, M., Veredas, F.J. (2024). ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish. In: Franco, L., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2024. Lecture Notes in Computer Science, vol 14836. Springer, Cham. https://doi.org/10.1007/978-3-031-63775-9_19
27
-
28
- ## 💡 Recommended Usage
29
 
30
- We recommend using this model together with:
 
 
 
 
 
 
31
 
32
- - [Faiss](https://github.com/facebookresearch/faiss) for similarity search
33
- - Or the `FaissEncoder` utility available at [ICB-UMA/KnowledgeGraph](https://github.com/ICB-UMA/KnowledgeGraph)
34
 
35
- ## 🧪 Example: Encoding a Spanish Mention
36
 
37
  ```python
38
- from transformers import AutoTokenizer, AutoModel
39
  import torch
40
 
41
- tokenizer = AutoTokenizer.from_pretrained("ICB-UMA/ClinLinker")
42
  model = AutoModel.from_pretrained("ICB-UMA/ClinLinker")
 
43
 
44
  mention = "insuficiencia renal aguda"
45
- inputs = tokenizer(mention, return_tensors="pt", padding=True, truncation=True)
46
  with torch.no_grad():
47
  outputs = model(**inputs)
48
- embedding = outputs.last_hidden_state[:, 0, :] # CLS token
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- print(embedding.shape)
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - es
5
+ base_model:
6
+ - PlanTL-GOB-ES/roberta-base-biomedical-clinical-es
7
  tags:
8
+ - medical
9
+ - spanish
10
+ - bi-encoder
11
+ - entity-linking
12
+ - sapbert
13
+ - umls
14
+ - snomed-ct
15
  ---
16
 
17
+ # **ClinLinker**
18
 
19
+ ## Model Description
20
 
21
+ ClinLinker is a state-of-the-art bi-encoder model for medical entity linking (MEL) in Spanish, optimized for clinical domain tasks. It enriches concept representations by incorporating synonyms from the UMLS and SNOMED-CT ontologies. The model was trained with a contrastive-learning strategy using hard negative mining and multi-similarity loss.
22
 
23
+ ## 💡 Intended Use
24
+ - **Domain**: Spanish Clinical NLP
25
+ - **Tasks**: Entity linking (diseases, symptoms, procedures) to SNOMED-CT
26
+ - **Evaluated On**: DisTEMIST, MedProcNER, SympTEMIST
27
+ - **Users**: Researchers and practitioners working in clinical NLP
28
 
29
+ ## Performance Summary (Top-25 Accuracy)
 
 
 
 
30
 
31
+ | Model | DisTEMIST | MedProcNER | SympTEMIST |
32
+ |--------------------|-----------|------------|------------|
33
+ | **ClinLinker** | **0.845** | **0.898** | **0.909** |
34
+ | ClinLinker-KB-P | 0.853 | 0.891 | 0.918 |
35
+ | ClinLinker-KB-GP | 0.864 | 0.901 | 0.922 |
36
+ | SapBERT-XLM-R-large| 0.800 | 0.850 | 0.827 |
37
+ | RoBERTa biomedical | 0.600 | 0.668 | 0.609 |
38
 
39
+ *Results correspond to the cleaned gold-standard version (no "NO CODE" or "COMPOSITE").*
 
40
 
41
+ ## 🧪 Usage
42
 
43
  ```python
44
+ from transformers import AutoModel, AutoTokenizer
45
  import torch
46
 
 
47
  model = AutoModel.from_pretrained("ICB-UMA/ClinLinker")
48
+ tokenizer = AutoTokenizer.from_pretrained("ICB-UMA/ClinLinker")
49
 
50
  mention = "insuficiencia renal aguda"
51
+ inputs = tokenizer(mention, return_tensors="pt")
52
  with torch.no_grad():
53
  outputs = model(**inputs)
54
+ embedding = outputs.last_hidden_state[:, 0, :]
55
+ print(embedding.shape)
56
+ ```
57
+
58
+ For scalable retrieval, use [Faiss](https://github.com/facebookresearch/faiss) or the [`FaissEncoder`](https://github.com/ICB-UMA/KnowledgeGraph) class.
59
+
60
+ ## Limitations
61
+ - The model is optimized for Spanish clinical data and may underperform outside this domain.
62
+ - Expert validation is advised in critical applications.
63
+
64
+ ## 📚 Citation
65
+
66
+ > Gallego, F., López-García, G., Gasco-Sánchez, L., Krallinger, M., Veredas, F.J. (2024). ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish. In: Franco, L., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2024. Lecture Notes in Computer Science, vol 14836. Springer, Cham. https://doi.org/10.1007/978-3-031-63775-9_19
67
+
68
+ ## Authors
69
 
70
+ Fernando Gallego, Guillermo López-García, Luis Gasco-Sánchez, Martin Krallinger, Francisco J Veredas