File size: 5,497 Bytes
39171a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
language: ru
license: apache-2.0
library_name: transformers
tags:
- russian
- morpheme-segmentation
- token-classification
- morphbert
- lightweight
- bert
- ru
- russ
pipeline_tag: token-classification
---
# MorphBERT-Tiny: Russian Morpheme Segmentation
This repository contains the `CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru` model, a compact transformer-based system for morpheme segmentation and classification of Russian words. The model classifies each character of a given word into one of several morpheme categories: {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}.
## Model Description
`morphbert-tiny-v2-morpheme-segmentation-ru` leverages a lightweight BERT-like architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. The model was distilled from a larger teacher model.
**Key Features:**
- **Task:** Morpheme Segmentation & Classification (Token Classification at Character Level)
- **Language:** Russian (ru)
- **Architecture:** Transformer (BERT-like, optimized for size)
- **Labels:** END, HYPH, LINK, POSTFIX, PREF, ROOT, SUFF
**Model Size & Specifications:**
- **Parameters:** ~3.58 Million
- **Tensor Type:** F32
- **Disk Footprint:** ~14.3 MB
## Usage
The model can be used with the Hugging Face `transformers` library. Below is a minimal example using the custom multi-task head as in this repository:
```python
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertPreTrainedModel, BertModel
MODEL_DIR = 'CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru'
MAX_LEN = 32
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ID2TAG = {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}
NUM_MORPH_CLASSES = len(ID2TAG)
class BertForMultiTask(BertPreTrainedModel):
def __init__(self, config, num_seg_labels=2, num_morph_labels=NUM_MORPH_CLASSES):
super().__init__(config)
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.seg_head = nn.Linear(config.hidden_size, num_seg_labels)
self.cls_head = nn.Linear(config.hidden_size, num_morph_labels)
def forward(self, input_ids, attention_mask=None):
x = self.dropout(self.bert(input_ids, attention_mask=attention_mask).last_hidden_state)
return {"seg_logits": self.seg_head(x), "cls_logits": self.cls_head(x)}
tokenizer = BertTokenizer.from_pretrained(MODEL_DIR)
model = BertForMultiTask.from_pretrained(MODEL_DIR, num_morph_labels=NUM_MORPH_CLASSES).to(DEVICE).eval()
def analyze_word_compact(word):
if not word.strip(): return "Empty word"
chars = list(word.lower())
enc = tokenizer(" ".join(chars), return_tensors='pt', max_length=MAX_LEN, padding='max_length', truncation=True, add_special_tokens=True)
with torch.no_grad():
out = model(input_ids=enc['input_ids'].to(DEVICE), attention_mask=enc['attention_mask'].to(DEVICE))
n = min(len(chars), MAX_LEN-2)
if n <= 0: return "Word too short/truncated"
seg = torch.argmax(out['seg_logits'][0,1:1+n], -1).tolist()
cls = torch.argmax(out['cls_logits'][0,1:1+n], -1).tolist()
print(f"\n--- '{word}' (processed {n} chars) ---")
print("Segmentation:", ' '.join([f'{chars[i]}:{seg[i]}' for i in range(n)]))
print("Classification:", ' '.join([f'{chars[i]}:{ID2TAG.get(cls[i], f'ID:{cls[i]}')}' for i in range(n)]))
morphemes, morph, tag = [], "", -1
for i in range(n):
if seg[i]==0:
if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
morph = chars[i]
tag = cls[i]
else:
morph += chars[i]
if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
res = " / ".join(morphemes)
print(f"Result: {res}\n{'='*30}")
return res
example_words = ["масляный", "предчувствий", "тарковский", "кот", "подгон"]
for w in example_words: analyze_word_compact(w)
```
## Example Output
```
--- 'масляный' (processed 8 chars) ---
Segmentation: м:0 а:1 с:1 л:1 я:0 н:1 ы:0 й:1
Classification: м:ROOT а:ROOT с:ROOT л:ROOT я:SUFF н:SUFF ы:END й:END
Result: масл:ROOT / ян:SUFF / ый:END
==============================
--- 'предчувствий' (processed 12 chars) ---
Segmentation: п:0 р:1 е:1 д:1 ч:0 у:1 в:0 с:0 т:1 в:1 и:0 й:1
Classification: п:PREF р:PREF е:PREF д:PREF ч:ROOT у:ROOT в:SUFF с:SUFF т:SUFF в:SUFF и:END й:END
Result: пред:PREF / чу:ROOT / в:SUFF / ств:SUFF / ий:END
==============================
--- 'тарковский' (processed 10 chars) ---
Segmentation: т:0 а:1 р:1 к:1 о:0 в:1 с:0 к:1 и:0 й:1
Classification: т:ROOT а:ROOT р:ROOT к:ROOT о:SUFF в:ROOT с:SUFF к:SUFF и:END й:END
Result: тарк:ROOT / ов:SUFF / ск:SUFF / ий:END
==============================
--- 'кот' (processed 3 chars) ---
Segmentation: к:0 о:1 т:1
Classification: к:ROOT о:ROOT т:ROOT
Result: кот:ROOT
==============================
--- 'подгон' (processed 6 chars) ---
Segmentation: п:0 о:1 д:1 г:0 о:1 н:1
Classification: п:PREF о:PREF д:PREF г:ROOT о:ROOT н:ROOT
Result: под:PREF / гон:ROOT
==============================
```
## Performance
Segmentation accuracy: 98.52%
Morph-class accuracy: 98.34%
|