File size: 5,497 Bytes
39171a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language: ru
license: apache-2.0
library_name: transformers
tags:
- russian
- morpheme-segmentation
- token-classification
- morphbert
- lightweight
- bert
- ru
- russ
pipeline_tag: token-classification
---

# MorphBERT-Tiny: Russian Morpheme Segmentation

This repository contains the `CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru` model, a compact transformer-based system for morpheme segmentation and classification of Russian words. The model classifies each character of a given word into one of several morpheme categories: {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}.

## Model Description

`morphbert-tiny-v2-morpheme-segmentation-ru` leverages a lightweight BERT-like architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. The model was distilled from a larger teacher model.

**Key Features:**

- **Task:** Morpheme Segmentation & Classification (Token Classification at Character Level)
- **Language:** Russian (ru)
- **Architecture:** Transformer (BERT-like, optimized for size)
- **Labels:** END, HYPH, LINK, POSTFIX, PREF, ROOT, SUFF

**Model Size & Specifications:**

- **Parameters:** ~3.58 Million
- **Tensor Type:** F32
- **Disk Footprint:** ~14.3 MB

## Usage

The model can be used with the Hugging Face `transformers` library. Below is a minimal example using the custom multi-task head as in this repository:

```python
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertPreTrainedModel, BertModel

MODEL_DIR = 'CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru'
MAX_LEN = 32
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ID2TAG = {0: 'END', 1: 'HYPH', 2: 'LINK', 3: 'POSTFIX', 4: 'PREF', 5: 'ROOT', 6: 'SUFF'}
NUM_MORPH_CLASSES = len(ID2TAG)

class BertForMultiTask(BertPreTrainedModel):
    def __init__(self, config, num_seg_labels=2, num_morph_labels=NUM_MORPH_CLASSES):
        super().__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.seg_head = nn.Linear(config.hidden_size, num_seg_labels)
        self.cls_head = nn.Linear(config.hidden_size, num_morph_labels)
    def forward(self, input_ids, attention_mask=None):
        x = self.dropout(self.bert(input_ids, attention_mask=attention_mask).last_hidden_state)
        return {"seg_logits": self.seg_head(x), "cls_logits": self.cls_head(x)}

tokenizer = BertTokenizer.from_pretrained(MODEL_DIR)
model = BertForMultiTask.from_pretrained(MODEL_DIR, num_morph_labels=NUM_MORPH_CLASSES).to(DEVICE).eval()

def analyze_word_compact(word):
    if not word.strip(): return "Empty word"
    chars = list(word.lower())
    enc = tokenizer(" ".join(chars), return_tensors='pt', max_length=MAX_LEN, padding='max_length', truncation=True, add_special_tokens=True)
    with torch.no_grad():
        out = model(input_ids=enc['input_ids'].to(DEVICE), attention_mask=enc['attention_mask'].to(DEVICE))
    n = min(len(chars), MAX_LEN-2)
    if n <= 0: return "Word too short/truncated"
    seg = torch.argmax(out['seg_logits'][0,1:1+n], -1).tolist()
    cls = torch.argmax(out['cls_logits'][0,1:1+n], -1).tolist()
    print(f"\n--- '{word}' (processed {n} chars) ---")
    print("Segmentation:", ' '.join([f'{chars[i]}:{seg[i]}' for i in range(n)]))
    print("Classification:", ' '.join([f'{chars[i]}:{ID2TAG.get(cls[i], f'ID:{cls[i]}')}' for i in range(n)]))
    morphemes, morph, tag = [], "", -1
    for i in range(n):
        if seg[i]==0:
            if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
            morph = chars[i]
            tag = cls[i]
        else:
            morph += chars[i]
    if morph: morphemes.append(f"{morph}:{ID2TAG.get(tag, f'ID:{tag}')}")
    res = " / ".join(morphemes)
    print(f"Result: {res}\n{'='*30}")
    return res

example_words = ["масляный", "предчувствий", "тарковский", "кот", "подгон"]
for w in example_words: analyze_word_compact(w)
```

## Example Output

```
--- 'масляный' (processed 8 chars) ---
Segmentation: м:0 а:1 с:1 л:1 я:0 н:1 ы:0 й:1
Classification: м:ROOT а:ROOT с:ROOT л:ROOT я:SUFF н:SUFF ы:END й:END
Result: масл:ROOT / ян:SUFF / ый:END
==============================

--- 'предчувствий' (processed 12 chars) ---
Segmentation: п:0 р:1 е:1 д:1 ч:0 у:1 в:0 с:0 т:1 в:1 и:0 й:1
Classification: п:PREF р:PREF е:PREF д:PREF ч:ROOT у:ROOT в:SUFF с:SUFF т:SUFF в:SUFF и:END й:END
Result: пред:PREF / чу:ROOT / в:SUFF / ств:SUFF / ий:END
==============================

--- 'тарковский' (processed 10 chars) ---
Segmentation: т:0 а:1 р:1 к:1 о:0 в:1 с:0 к:1 и:0 й:1
Classification: т:ROOT а:ROOT р:ROOT к:ROOT о:SUFF в:ROOT с:SUFF к:SUFF и:END й:END
Result: тарк:ROOT / ов:SUFF / ск:SUFF / ий:END
==============================

--- 'кот' (processed 3 chars) ---
Segmentation: к:0 о:1 т:1
Classification: к:ROOT о:ROOT т:ROOT
Result: кот:ROOT
==============================

--- 'подгон' (processed 6 chars) ---
Segmentation: п:0 о:1 д:1 г:0 о:1 н:1
Classification: п:PREF о:PREF д:PREF г:ROOT о:ROOT н:ROOT
Result: под:PREF / гон:ROOT
==============================
```

## Performance

Segmentation accuracy:    98.52%

Morph-class accuracy:     98.34%