CrabInHoney commited on
Commit
39171a5
·
verified ·
1 Parent(s): 3ed06bb

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -3
README.md CHANGED
@@ -1,3 +1,135 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ru
3
+ license: apache-2.0
4
+ library_name: transformers
5
+ tags:
6
+ - russian
7
+ - morpheme-segmentation
8
+ - token-classification
9
+ - morphbert
10
+ - lightweight
11
+ - bert
12
+ - ru
13
+ - russ
14
+ pipeline_tag: token-classification
15
+ ---
16
+
17
+ # MorphBERT-Tiny: Russian Morpheme Segmentation
18
+
19
+ This repository contains the `CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru` model, a compact transformer-based system for morpheme segmentation and classification of Russian words. The model classifies each character of a given word into one of several morpheme categories: {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}.
20
+
21
+ ## Model Description
22
+
23
+ `morphbert-tiny-v2-morpheme-segmentation-ru` leverages a lightweight BERT-like architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. The model was distilled from a larger teacher model.
24
+
25
+ **Key Features:**
26
+
27
+ - **Task:** Morpheme Segmentation & Classification (Token Classification at Character Level)
28
+ - **Language:** Russian (ru)
29
+ - **Architecture:** Transformer (BERT-like, optimized for size)
30
+ - **Labels:** END, HYPH, LINK, POSTFIX, PREF, ROOT, SUFF
31
+
32
+ **Model Size & Specifications:**
33
+
34
+ - **Parameters:** ~3.58 Million
35
+ - **Tensor Type:** F32
36
+ - **Disk Footprint:** ~14.3 MB
37
+
38
+ ## Usage
39
+
40
+ The model can be used with the Hugging Face `transformers` library. Below is a minimal example using the custom multi-task head as in this repository:
41
+
42
+ ```python
43
+ import torch
44
+ import torch.nn as nn
45
+ from transformers import BertTokenizer, BertPreTrainedModel, BertModel
46
+
47
+ MODEL_DIR = 'CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru'
48
+ MAX_LEN = 32
49
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
50
+ ID2TAG = {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}
51
+ NUM_MORPH_CLASSES = len(ID2TAG)
52
+
53
+ class BertForMultiTask(BertPreTrainedModel):
54
+ def __init__(self, config, num_seg_labels=2, num_morph_labels=NUM_MORPH_CLASSES):
55
+ super().__init__(config)
56
+ self.bert = BertModel(config)
57
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
58
+ self.seg_head = nn.Linear(config.hidden_size, num_seg_labels)
59
+ self.cls_head = nn.Linear(config.hidden_size, num_morph_labels)
60
+ def forward(self, input_ids, attention_mask=None):
61
+ x = self.dropout(self.bert(input_ids, attention_mask=attention_mask).last_hidden_state)
62
+ return {"seg_logits": self.seg_head(x), "cls_logits": self.cls_head(x)}
63
+
64
+ tokenizer = BertTokenizer.from_pretrained(MODEL_DIR)
65
+ model = BertForMultiTask.from_pretrained(MODEL_DIR, num_morph_labels=NUM_MORPH_CLASSES).to(DEVICE).eval()
66
+
67
+ def analyze_word_compact(word):
68
+ if not word.strip(): return "Empty word"
69
+ chars = list(word.lower())
70
+ enc = tokenizer(" ".join(chars), return_tensors='pt', max_length=MAX_LEN, padding='max_length', truncation=True, add_special_tokens=True)
71
+ with torch.no_grad():
72
+ out = model(input_ids=enc['input_ids'].to(DEVICE), attention_mask=enc['attention_mask'].to(DEVICE))
73
+ n = min(len(chars), MAX_LEN-2)
74
+ if n <= 0: return "Word too short/truncated"
75
+ seg = torch.argmax(out['seg_logits'][0,1:1+n], -1).tolist()
76
+ cls = torch.argmax(out['cls_logits'][0,1:1+n], -1).tolist()
77
+ print(f"\n--- '{word}' (processed {n} chars) ---")
78
+ print("Segmentation:", ' '.join([f'{chars[i]}:{seg[i]}' for i in range(n)]))
79
+ print("Classification:", ' '.join([f'{chars[i]}:{ID2TAG.get(cls[i], f'ID:{cls[i]}')}' for i in range(n)]))
80
+ morphemes, morph, tag = [], "", -1
81
+ for i in range(n):
82
+ if seg[i]==0:
83
+ if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
84
+ morph = chars[i]
85
+ tag = cls[i]
86
+ else:
87
+ morph += chars[i]
88
+ if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
89
+ res = " / ".join(morphemes)
90
+ print(f"Result: {res}\n{'='*30}")
91
+ return res
92
+
93
+ example_words = ["масляный", "предчувствий", "тарковский", "кот", "подгон"]
94
+ for w in example_words: analyze_word_compact(w)
95
+ ```
96
+
97
+ ## Example Output
98
+
99
+ ```
100
+ --- 'масляный' (processed 8 chars) ---
101
+ Segmentation: м:0 а:1 с:1 л:1 я:0 н:1 ы:0 й:1
102
+ Classification: м:ROOT а:ROOT с:ROOT л:ROOT я:SUFF н:SUFF ы:END й:END
103
+ Result: масл:ROOT / ян:SUFF / ый:END
104
+ ==============================
105
+
106
+ --- 'предчувствий' (processed 12 chars) ---
107
+ Segmentation: п:0 р:1 е:1 д:1 ч:0 у:1 в:0 с:0 т:1 в:1 и:0 й:1
108
+ Classification: п:PREF р:PREF е:PREF д:PREF ч:ROOT у:ROOT в:SUFF с:SUFF т:SUFF в:SUFF и:END й:END
109
+ Result: пред:PREF / чу:ROOT / в:SUFF / ств:SUFF / ий:END
110
+ ==============================
111
+
112
+ --- 'тарковский' (processed 10 chars) ---
113
+ Segmentation: т:0 а:1 р:1 к:1 о:0 в:1 с:0 к:1 и:0 й:1
114
+ Classification: т:ROOT а:ROOT р:ROOT к:ROOT о:SUFF в:ROOT с:SUFF к:SUFF и:END й:END
115
+ Result: тарк:ROOT / ов:SUFF / ск:SUFF / ий:END
116
+ ==============================
117
+
118
+ --- 'кот' (processed 3 chars) ---
119
+ Segmentation: к:0 о:1 т:1
120
+ Classification: к:ROOT о:ROOT т:ROOT
121
+ Result: кот:ROOT
122
+ ==============================
123
+
124
+ --- 'подгон' (processed 6 chars) ---
125
+ Segmentation: п:0 о:1 д:1 г:0 о:1 н:1
126
+ Classification: п:PREF о:PREF д:PREF г:ROOT о:ROOT н:ROOT
127
+ Result: под:PREF / гон:ROOT
128
+ ==============================
129
+ ```
130
+
131
+ ## Performance
132
+
133
+ Segmentation accuracy: 98.52%
134
+
135
+ Morph-class accuracy: 98.34%