XiaoEnn commited on
Commit
c8c36dd
·
verified ·
1 Parent(s): 4e20965

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -3
README.md CHANGED
@@ -1,3 +1,172 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks
5
+
6
+ **Tags**:
7
+ - Pretrain_Model
8
+ - transformers
9
+ - TCM
10
+ - herberta
11
+ - text embedding
12
+
13
+ **License**: Apache-2.0
14
+ **Inference**: true
15
+ **Language**: zh, en
16
+ **Base Model**: hfl/chinese-roberta-wwm-ext
17
+ **Library Name**: transformers
18
+
19
+ ---
20
+
21
+ ## Introduction
22
+
23
+ Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks.
24
+
25
+ We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:
26
+
27
+ - **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations.
28
+ - **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain.
29
+ - **Support for TCM Downstream Tasks**: Including classification, labeling, and more.
30
+
31
+ ---
32
+
33
+ ## Pretraining Experiments
34
+
35
+ ### Dataset
36
+
37
+ | Data Type | Quantity | Data Size |
38
+ |------------------------|-------------|------------------|
39
+ | **Ancient TCM Books** | 700 books | ~538.95M |
40
+ | **Modern TCM Textbooks** | 48 books | ~54M |
41
+ | **Mixed-Type Dataset** | Combined dataset | ~637.8M |
42
+
43
+ ### Pretrain result:
44
+
45
+
46
+ | Model | eval_accuracy | Loss/epoch_valid | Perplexity_valid |
47
+ |-----------------------|---------------|------------------|------------------|
48
+ | **herberta_seq_512_v2** | 0.9841 | 0.04367 | 1.083 |
49
+ | **herberta_seq_128_v2** | 0.9406 | 0.2877 | 1.333 |
50
+ | **herberta_seq_512_V3** | 0.755 |1.100 | 3.010 |
51
+
52
+ #### Metrics Comparison
53
+
54
+ <table>
55
+ <tr>
56
+ <td align="center"><strong>Accuracy</strong></td>
57
+ <td align="center"><strong>Loss</strong></td>
58
+ <td align="center"><strong>Perplexity</strong></td>
59
+ </tr>
60
+ <tr>
61
+ <td><img src="https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/RDgI-0Ro2kMiwV853Wkgx.png" alt="Accuracy" width="500"></td>
62
+ <td><img src="https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/BJ7enbRg13IYAZuxwraPP.png" alt="Loss" width="500"></td>
63
+ <td><img src="https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/lOohRMIctPJZKM5yEEcQ2.png" alt="Perplexity" width="500"></td>
64
+ </tr>
65
+ </table>
66
+
67
+ ### Pretraining Configuration
68
+
69
+ #### Ancient Books
70
+ - Pretraining Strategy: BERT-style MASK (15% tokens masked)
71
+ - Sequence Length: 512
72
+ - Batch Size: 32
73
+ - Learning Rate: `1e-5` with an epoch-based decay (`epoch * 0.1`)
74
+ - Tokenization: Sentence-based tokenization with padding for sequences <512 tokens.
75
+
76
+ #### Modern Textbooks
77
+ - Pretraining Strategy: Dynamic MASK + Warmup + Linear Decay
78
+ - Sequence Length: 512
79
+ - Batch Size: 16
80
+ - Learning Rate: Warmup (10% steps) + Linear Decay (1e-5 initial rate)
81
+ - Tokenization: Continuous tokenization (512 tokens) without sentence segmentation.
82
+
83
+ #### V4 Mixed Dataset (Ancient + Modern)
84
+ - Dataset: Combined 48 modern textbooks + 700 ancient books
85
+ - Pretraining Strategy: Dynamic MASK, warmup, and linear decay (1e-5 learning rate).
86
+ - Epochs: 20
87
+ - Sequence Length: 512
88
+ - Batch Size: 16
89
+ - Tokenization: Continuous tokenization.
90
+
91
+ ---
92
+
93
+ ## Downstream Task: TCM Pattern Classification
94
+
95
+ ### Task Definition
96
+ Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:
97
+
98
+ 1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books.
99
+ 2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks.
100
+ 3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences).
101
+ 4. **Roberta**: Baseline model without TCM-specific pretraining.
102
+
103
+ ### Training Configuration
104
+ - Max Sequence Length: 512
105
+ - Batch Size: 16
106
+ - Epochs: 30
107
+
108
+ ### Results
109
+
110
+ | Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall |
111
+ |--------------------------|---------------|-----------|----------------|-------------|
112
+ | **Herberta_seq_512_v2** | **0.9454** | **0.9293** | **0.9221** | **0.9454** |
113
+ | **Herberta_seq_512_v3** | 0.8989 | 0.8704 | 0.8583 | 0.8989 |
114
+ | **Herberta_seq_128_v2** | 0.8716 | 0.8443 | 0.8351 | 0.8716 |
115
+ | **Roberta** | 0.8743 | 0.8425 | 0.8311 | 0.8743 |
116
+
117
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/1yG96YdzXuxQlTfjOmXqg.png)
118
+
119
+
120
+ #### Summary
121
+ The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.
122
+
123
+ ---
124
+
125
+ ## Quickstart
126
+
127
+ ### Use Hugging Face
128
+
129
+ ```python
130
+ from transformers import AutoTokenizer, AutoModel
131
+
132
+ model_name = "XiaoEnn/herberta"
133
+
134
+ # Load tokenizer and model
135
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
136
+ model = AutoModel.from_pretrained(model_name)
137
+
138
+ # Input text
139
+ text = "中医理论是我国传统文化的瑰宝。"
140
+
141
+ # Tokenize and prepare input
142
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
143
+
144
+ # Get the model's outputs
145
+ with torch.no_grad():
146
+ outputs = model(**inputs)
147
+
148
+ # Get the embedding (sentence-level average pooling)
149
+ sentence_embedding = outputs.last_hidden_state.mean(dim=1)
150
+
151
+ print("Embedding shape:", sentence_embedding.shape)
152
+ print("Embedding vector:", sentence_embedding)
153
+
154
+ ```
155
+
156
+ if you find our work helpful, feel free to give us a cite
157
+
158
+ @misc{herberta-embedding,
159
+ title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
160
+ url = {https://github.com/15392778677/herberta},
161
+ author = {Yehan Yang, Xinhan Zheng},
162
+ month = {December},
163
+ year = {2024}
164
+ }
165
+
166
+ @article{herberta-technical-report,
167
+ title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
168
+ author={Yehan Yang, Xinhan Zheng},
169
+ institution={Beijing Angelpro Technology Co., Ltd.},
170
+ year={2024},
171
+ note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
172
+ }