Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,135 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ru
|
3 |
+
license: apache-2.0
|
4 |
+
library_name: transformers
|
5 |
+
tags:
|
6 |
+
- russian
|
7 |
+
- morpheme-segmentation
|
8 |
+
- token-classification
|
9 |
+
- morphbert
|
10 |
+
- lightweight
|
11 |
+
- bert
|
12 |
+
- ru
|
13 |
+
- russ
|
14 |
+
pipeline_tag: token-classification
|
15 |
+
---
|
16 |
+
|
17 |
+
# MorphBERT-Tiny: Russian Morpheme Segmentation
|
18 |
+
|
19 |
+
This repository contains the `CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru` model, a compact transformer-based system for morpheme segmentation and classification of Russian words. The model classifies each character of a given word into one of several morpheme categories: {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}.
|
20 |
+
|
21 |
+
## Model Description
|
22 |
+
|
23 |
+
`morphbert-tiny-v2-morpheme-segmentation-ru` leverages a lightweight BERT-like architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. The model was distilled from a larger teacher model.
|
24 |
+
|
25 |
+
**Key Features:**
|
26 |
+
|
27 |
+
- **Task:** Morpheme Segmentation & Classification (Token Classification at Character Level)
|
28 |
+
- **Language:** Russian (ru)
|
29 |
+
- **Architecture:** Transformer (BERT-like, optimized for size)
|
30 |
+
- **Labels:** END, HYPH, LINK, POSTFIX, PREF, ROOT, SUFF
|
31 |
+
|
32 |
+
**Model Size & Specifications:**
|
33 |
+
|
34 |
+
- **Parameters:** ~3.58 Million
|
35 |
+
- **Tensor Type:** F32
|
36 |
+
- **Disk Footprint:** ~14.3 MB
|
37 |
+
|
38 |
+
## Usage
|
39 |
+
|
40 |
+
The model can be used with the Hugging Face `transformers` library. Below is a minimal example using the custom multi-task head as in this repository:
|
41 |
+
|
42 |
+
```python
|
43 |
+
import torch
|
44 |
+
import torch.nn as nn
|
45 |
+
from transformers import BertTokenizer, BertPreTrainedModel, BertModel
|
46 |
+
|
47 |
+
MODEL_DIR = 'CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru'
|
48 |
+
MAX_LEN = 32
|
49 |
+
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
50 |
+
ID2TAG = {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}
|
51 |
+
NUM_MORPH_CLASSES = len(ID2TAG)
|
52 |
+
|
53 |
+
class BertForMultiTask(BertPreTrainedModel):
|
54 |
+
def __init__(self, config, num_seg_labels=2, num_morph_labels=NUM_MORPH_CLASSES):
|
55 |
+
super().__init__(config)
|
56 |
+
self.bert = BertModel(config)
|
57 |
+
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
58 |
+
self.seg_head = nn.Linear(config.hidden_size, num_seg_labels)
|
59 |
+
self.cls_head = nn.Linear(config.hidden_size, num_morph_labels)
|
60 |
+
def forward(self, input_ids, attention_mask=None):
|
61 |
+
x = self.dropout(self.bert(input_ids, attention_mask=attention_mask).last_hidden_state)
|
62 |
+
return {"seg_logits": self.seg_head(x), "cls_logits": self.cls_head(x)}
|
63 |
+
|
64 |
+
tokenizer = BertTokenizer.from_pretrained(MODEL_DIR)
|
65 |
+
model = BertForMultiTask.from_pretrained(MODEL_DIR, num_morph_labels=NUM_MORPH_CLASSES).to(DEVICE).eval()
|
66 |
+
|
67 |
+
def analyze_word_compact(word):
|
68 |
+
if not word.strip(): return "Empty word"
|
69 |
+
chars = list(word.lower())
|
70 |
+
enc = tokenizer(" ".join(chars), return_tensors='pt', max_length=MAX_LEN, padding='max_length', truncation=True, add_special_tokens=True)
|
71 |
+
with torch.no_grad():
|
72 |
+
out = model(input_ids=enc['input_ids'].to(DEVICE), attention_mask=enc['attention_mask'].to(DEVICE))
|
73 |
+
n = min(len(chars), MAX_LEN-2)
|
74 |
+
if n <= 0: return "Word too short/truncated"
|
75 |
+
seg = torch.argmax(out['seg_logits'][0,1:1+n], -1).tolist()
|
76 |
+
cls = torch.argmax(out['cls_logits'][0,1:1+n], -1).tolist()
|
77 |
+
print(f"\n--- '{word}' (processed {n} chars) ---")
|
78 |
+
print("Segmentation:", ' '.join([f'{chars[i]}:{seg[i]}' for i in range(n)]))
|
79 |
+
print("Classification:", ' '.join([f'{chars[i]}:{ID2TAG.get(cls[i], f'ID:{cls[i]}')}' for i in range(n)]))
|
80 |
+
morphemes, morph, tag = [], "", -1
|
81 |
+
for i in range(n):
|
82 |
+
if seg[i]==0:
|
83 |
+
if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
|
84 |
+
morph = chars[i]
|
85 |
+
tag = cls[i]
|
86 |
+
else:
|
87 |
+
morph += chars[i]
|
88 |
+
if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
|
89 |
+
res = " / ".join(morphemes)
|
90 |
+
print(f"Result: {res}\n{'='*30}")
|
91 |
+
return res
|
92 |
+
|
93 |
+
example_words = ["масляный", "предчувствий", "тарковский", "кот", "подгон"]
|
94 |
+
for w in example_words: analyze_word_compact(w)
|
95 |
+
```
|
96 |
+
|
97 |
+
## Example Output
|
98 |
+
|
99 |
+
```
|
100 |
+
--- 'масляный' (processed 8 chars) ---
|
101 |
+
Segmentation: м:0 а:1 с:1 л:1 я:0 н:1 ы:0 й:1
|
102 |
+
Classification: м:ROOT а:ROOT с:ROOT л:ROOT я:SUFF н:SUFF ы:END й:END
|
103 |
+
Result: масл:ROOT / ян:SUFF / ый:END
|
104 |
+
==============================
|
105 |
+
|
106 |
+
--- 'предчувствий' (processed 12 chars) ---
|
107 |
+
Segmentation: п:0 р:1 е:1 д:1 ч:0 у:1 в:0 с:0 т:1 в:1 и:0 й:1
|
108 |
+
Classification: п:PREF р:PREF е:PREF д:PREF ч:ROOT у:ROOT в:SUFF с:SUFF т:SUFF в:SUFF и:END й:END
|
109 |
+
Result: пред:PREF / чу:ROOT / в:SUFF / ств:SUFF / ий:END
|
110 |
+
==============================
|
111 |
+
|
112 |
+
--- 'тарковский' (processed 10 chars) ---
|
113 |
+
Segmentation: т:0 а:1 р:1 к:1 о:0 в:1 с:0 к:1 и:0 й:1
|
114 |
+
Classification: т:ROOT а:ROOT р:ROOT к:ROOT о:SUFF в:ROOT с:SUFF к:SUFF и:END й:END
|
115 |
+
Result: тарк:ROOT / ов:SUFF / ск:SUFF / ий:END
|
116 |
+
==============================
|
117 |
+
|
118 |
+
--- 'кот' (processed 3 chars) ---
|
119 |
+
Segmentation: к:0 о:1 т:1
|
120 |
+
Classification: к:ROOT о:ROOT т:ROOT
|
121 |
+
Result: кот:ROOT
|
122 |
+
==============================
|
123 |
+
|
124 |
+
--- 'подгон' (processed 6 chars) ---
|
125 |
+
Segmentation: п:0 о:1 д:1 г:0 о:1 н:1
|
126 |
+
Classification: п:PREF о:PREF д:PREF г:ROOT о:ROOT н:ROOT
|
127 |
+
Result: под:PREF / гон:ROOT
|
128 |
+
==============================
|
129 |
+
```
|
130 |
+
|
131 |
+
## Performance
|
132 |
+
|
133 |
+
Segmentation accuracy: 98.52%
|
134 |
+
|
135 |
+
Morph-class accuracy: 98.34%
|