|
--- |
|
language: ru |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- russian |
|
- morpheme-segmentation |
|
- token-classification |
|
- morphbert |
|
- lightweight |
|
- bert |
|
- ru |
|
- russ |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# MorphBERT-Tiny: Russian Morpheme Segmentation |
|
|
|
This repository contains the `CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru` model, a compact transformer-based system for morpheme segmentation and classification of Russian words. The model classifies each character of a given word into one of several morpheme categories: {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}. |
|
|
|
## Model Description |
|
|
|
`morphbert-tiny-v2-morpheme-segmentation-ru` leverages a lightweight BERT-like architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. The model was distilled from a larger teacher model. |
|
|
|
**Key Features:** |
|
|
|
- **Task:** Morpheme Segmentation & Classification (Token Classification at Character Level) |
|
- **Language:** Russian (ru) |
|
- **Architecture:** Transformer (BERT-like, optimized for size) |
|
- **Labels:** END, HYPH, LINK, POSTFIX, PREF, ROOT, SUFF |
|
|
|
**Model Size & Specifications:** |
|
|
|
- **Parameters:** ~3.58 Million |
|
- **Tensor Type:** F32 |
|
- **Disk Footprint:** ~14.3 MB |
|
|
|
## Usage |
|
|
|
The model can be used with the Hugging Face `transformers` library. Below is a minimal example using the custom multi-task head as in this repository: |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
from transformers import BertTokenizer, BertPreTrainedModel, BertModel |
|
|
|
MODEL_DIR = 'CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru' |
|
MAX_LEN = 32 |
|
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
ID2TAG = {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'} |
|
NUM_MORPH_CLASSES = len(ID2TAG) |
|
|
|
class BertForMultiTask(BertPreTrainedModel): |
|
def __init__(self, config, num_seg_labels=2, num_morph_labels=NUM_MORPH_CLASSES): |
|
super().__init__(config) |
|
self.bert = BertModel(config) |
|
self.dropout = nn.Dropout(config.hidden_dropout_prob) |
|
self.seg_head = nn.Linear(config.hidden_size, num_seg_labels) |
|
self.cls_head = nn.Linear(config.hidden_size, num_morph_labels) |
|
def forward(self, input_ids, attention_mask=None): |
|
x = self.dropout(self.bert(input_ids, attention_mask=attention_mask).last_hidden_state) |
|
return {"seg_logits": self.seg_head(x), "cls_logits": self.cls_head(x)} |
|
|
|
tokenizer = BertTokenizer.from_pretrained(MODEL_DIR) |
|
model = BertForMultiTask.from_pretrained(MODEL_DIR, num_morph_labels=NUM_MORPH_CLASSES).to(DEVICE).eval() |
|
|
|
def analyze_word_compact(word): |
|
if not word.strip(): return "Empty word" |
|
chars = list(word.lower()) |
|
enc = tokenizer(" ".join(chars), return_tensors='pt', max_length=MAX_LEN, padding='max_length', truncation=True, add_special_tokens=True) |
|
with torch.no_grad(): |
|
out = model(input_ids=enc['input_ids'].to(DEVICE), attention_mask=enc['attention_mask'].to(DEVICE)) |
|
n = min(len(chars), MAX_LEN-2) |
|
if n <= 0: return "Word too short/truncated" |
|
seg = torch.argmax(out['seg_logits'][0,1:1+n], -1).tolist() |
|
cls = torch.argmax(out['cls_logits'][0,1:1+n], -1).tolist() |
|
print(f"\n--- '{word}' (processed {n} chars) ---") |
|
print("Segmentation:", ' '.join([f'{chars[i]}:{seg[i]}' for i in range(n)])) |
|
print("Classification:", ' '.join([f'{chars[i]}:{ID2TAG.get(cls[i], f'ID:{cls[i]}')}' for i in range(n)])) |
|
morphemes, morph, tag = [], "", -1 |
|
for i in range(n): |
|
if seg[i]==0: |
|
if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}") |
|
morph = chars[i] |
|
tag = cls[i] |
|
else: |
|
morph += chars[i] |
|
if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}") |
|
res = " / ".join(morphemes) |
|
print(f"Result: {res}\n{'='*30}") |
|
return res |
|
|
|
example_words = ["масляный", "предчувствий", "тарковский", "кот", "подгон"] |
|
for w in example_words: analyze_word_compact(w) |
|
``` |
|
|
|
## Example Output |
|
|
|
``` |
|
--- 'масляный' (processed 8 chars) --- |
|
Segmentation: м:0 а:1 с:1 л:1 я:0 н:1 ы:0 й:1 |
|
Classification: м:ROOT а:ROOT с:ROOT л:ROOT я:SUFF н:SUFF ы:END й:END |
|
Result: масл:ROOT / ян:SUFF / ый:END |
|
============================== |
|
|
|
--- 'предчувствий' (processed 12 chars) --- |
|
Segmentation: п:0 р:1 е:1 д:1 ч:0 у:1 в:0 с:0 т:1 в:1 и:0 й:1 |
|
Classification: п:PREF р:PREF е:PREF д:PREF ч:ROOT у:ROOT в:SUFF с:SUFF т:SUFF в:SUFF и:END й:END |
|
Result: пред:PREF / чу:ROOT / в:SUFF / ств:SUFF / ий:END |
|
============================== |
|
|
|
--- 'тарковский' (processed 10 chars) --- |
|
Segmentation: т:0 а:1 р:1 к:1 о:0 в:1 с:0 к:1 и:0 й:1 |
|
Classification: т:ROOT а:ROOT р:ROOT к:ROOT о:SUFF в:ROOT с:SUFF к:SUFF и:END й:END |
|
Result: тарк:ROOT / ов:SUFF / ск:SUFF / ий:END |
|
============================== |
|
|
|
--- 'кот' (processed 3 chars) --- |
|
Segmentation: к:0 о:1 т:1 |
|
Classification: к:ROOT о:ROOT т:ROOT |
|
Result: кот:ROOT |
|
============================== |
|
|
|
--- 'подгон' (processed 6 chars) --- |
|
Segmentation: п:0 о:1 д:1 г:0 о:1 н:1 |
|
Classification: п:PREF о:PREF д:PREF г:ROOT о:ROOT н:ROOT |
|
Result: под:PREF / гон:ROOT |
|
============================== |
|
``` |
|
|
|
## Performance |
|
|
|
Segmentation accuracy: 98.52% |
|
|
|
Morph-class accuracy: 98.34% |
|
|