XiaoEnn
/

herberta_seq_512_V2

+---
+license: apache-2.0
+---
+# Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks
+**Tags**:
+- Pretrain_Model
+- transformers
+- TCM
+- herberta
+- text embedding
+**License**: Apache-2.0
+**Inference**: true
+**Language**: zh, en
+**Base Model**: hfl/chinese-roberta-wwm-ext
+**Library Name**: transformers
+---
+## Introduction
+Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks.
+We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:
+- **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations.
+- **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain.
+- **Support for TCM Downstream Tasks**: Including classification, labeling, and more.
+---
+## Pretraining Experiments
+### Dataset
+| Data Type              | Quantity    | Data Size        |
+|------------------------|-------------|------------------|
+| **Ancient TCM Books**  | 700 books   | ~538.95M         |
+| **Modern TCM Textbooks** | 48 books   | ~54M             |
+| **Mixed-Type Dataset** | Combined dataset | ~637.8M          |
+### Pretrain result：
+| Model                 | eval_accuracy | Loss/epoch_valid | Perplexity_valid |
+|-----------------------|---------------|------------------|------------------|
+| **herberta_seq_512_v2** | 0.9841        | 0.04367          | 1.083           |
+| **herberta_seq_128_v2** | 0.9406        | 0.2877           | 1.333           |
+| **herberta_seq_512_V3** |  0.755         |1.100         | 3.010           |
+#### Metrics Comparison
+<table>
+  <tr>
+    <td align="center"><strong>Accuracy</strong></td>
+    <td align="center"><strong>Loss</strong></td>
+    <td align="center"><strong>Perplexity</strong></td>
+  </tr>
+  <tr>
+    <td><img src="https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/RDgI-0Ro2kMiwV853Wkgx.png" alt="Accuracy" width="500"></td>
+    <td><img src="https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/BJ7enbRg13IYAZuxwraPP.png" alt="Loss" width="500"></td>
+    <td><img src="https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/lOohRMIctPJZKM5yEEcQ2.png" alt="Perplexity" width="500"></td>
+  </tr>
+</table>
+### Pretraining Configuration
+#### Ancient Books
+- Pretraining Strategy: BERT-style MASK (15% tokens masked)
+- Sequence Length: 512
+- Batch Size: 32
+- Learning Rate: `1e-5` with an epoch-based decay (`epoch * 0.1`)
+- Tokenization: Sentence-based tokenization with padding for sequences <512 tokens.
+#### Modern Textbooks
+- Pretraining Strategy: Dynamic MASK + Warmup + Linear Decay
+- Sequence Length: 512
+- Batch Size: 16
+- Learning Rate: Warmup (10% steps) + Linear Decay (1e-5 initial rate)
+- Tokenization: Continuous tokenization (512 tokens) without sentence segmentation.
+#### V4 Mixed Dataset (Ancient + Modern)
+- Dataset: Combined 48 modern textbooks + 700 ancient books
+- Pretraining Strategy: Dynamic MASK, warmup, and linear decay (1e-5 learning rate).
+- Epochs: 20
+- Sequence Length: 512
+- Batch Size: 16
+- Tokenization: Continuous tokenization.
+---
+## Downstream Task: TCM Pattern Classification
+### Task Definition
+Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:
+1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books.
+2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks.
+3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences).
+4. **Roberta**: Baseline model without TCM-specific pretraining.
+### Training Configuration
+- Max Sequence Length: 512
+- Batch Size: 16
+- Epochs: 30
+### Results
+| Model Name              | Eval Accuracy | Eval F1   | Eval Precision | Eval Recall |
+|--------------------------|---------------|-----------|----------------|-------------|
+| **Herberta_seq_512_v2** | **0.9454**    | **0.9293** | **0.9221**     | **0.9454**  |
+| **Herberta_seq_512_v3** | 0.8989        | 0.8704    | 0.8583         | 0.8989      |
+| **Herberta_seq_128_v2** | 0.8716        | 0.8443    | 0.8351         | 0.8716      |
+| **Roberta**             | 0.8743        | 0.8425    | 0.8311         | 0.8743      |
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/1yG96YdzXuxQlTfjOmXqg.png)
+#### Summary
+The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.
+---
+## Quickstart
+### Use Hugging Face
+```python
+from transformers import AutoTokenizer, AutoModel
+model_name = "XiaoEnn/herberta"
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+# Input text
+text = "中医理论是我国传统文化的瑰宝。"
+# Tokenize and prepare input
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
+# Get the model's outputs
+with torch.no_grad():
+    outputs = model(**inputs)
+# Get the embedding (sentence-level average pooling)
+sentence_embedding = outputs.last_hidden_state.mean(dim=1)
+print("Embedding shape:", sentence_embedding.shape)
+print("Embedding vector:", sentence_embedding)
+```
+if you find our work helpful, feel free to give us a cite
+@misc{herberta-embedding,
+  title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
+  url = {https://github.com/15392778677/herberta},
+  author = {Yehan Yang, Xinhan Zheng},
+  month = {December},
+  year = {2024}
+}
+@article{herberta-technical-report,
+  title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
+  author={Yehan Yang, Xinhan Zheng},
+  institution={Beijing Angelpro Technology Co., Ltd.},
+  year={2024},
+  note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
+}