CrabInHoney's picture
Update README.md
6c78a57 verified
metadata
language: ru
license: apache-2.0
library_name: transformers
tags:
  - russian
  - morpheme-segmentation
  - token-classification
  - morphbert
  - lightweight
  - bert
  - ru
  - russ
pipeline_tag: token-classification
new_version: CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru

MorphBERT-Tiny: Russian Morpheme Segmentation

This repository contains the CrabInHoney/morphbert-tiny-morpheme-segmentation-ru model, a highly compact transformer-based system fine-tuned for morpheme segmentation of Russian words. The model classifies each character of a given word into one of four morpheme categories: Prefix (PREF), Root (ROOT), Suffix (SUFF), or Ending (END).

Model Description

morphbert-tiny-morpheme-segmentation-ru leverages a lightweight transformer architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. Despite its diminutive size, the model demonstrates considerable accuracy in identifying the constituent morphemes within Russian words.

The model was derived through logit distillation from a larger teacher model, comparable in complexity to bert-base

Key Features:

  • Task: Morpheme Segmentation (Token Classification at Character Level)
  • Language: Russian (ru)
  • Architecture: Transformer (BERT-like, optimized for size)
  • Labels: PREF, ROOT, SUFF, END

Model Size & Specifications:

  • Parameters: ~3.58 Million
  • Tensor Type: F32
  • Disk Footprint: ~14.3 MB

Usage

The model can be easily used with the Hugging Face transformers library. It processes words character by character.

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "CrabInHoney/morphbert-tiny-morpheme-segmentation-ru"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

def analyze(word):
    tokens = list(word)
    encoded = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=34)
    with torch.no_grad():
        logits = model(**encoded).logits
        predictions = logits.argmax(dim=-1)[0]
    
    word_ids = encoded.word_ids()
    output = []
    for i, word_idx in enumerate(word_ids):
        if word_idx is not None and word_idx < len(tokens):
            label_id = predictions[i].item()
            label = model.config.id2label[label_id]
            output.append(f"{tokens[word_idx]}:{label}")
    return " / ".join(output)

# Примеры
for word in ["масляный", "предчувствий", "тарковский", "кот", "подгон"]:
    print(f"{word}{analyze(word)}")

Example Predictions

масляный → м:ROOT / а:ROOT / с:ROOT / л:ROOT / я:SUFF / н:SUFF / ы:END / й:END
предчувствий → п:PREF / р:PREF / е:PREF / д:PREF / ч:ROOT / у:ROOT / в:SUFF / с:SUFF / т:SUFF / в:SUFF / и:END / й:END
тарковский → т:ROOT / а:ROOT / р:ROOT / к:ROOT / о:SUFF / в:SUFF / с:SUFF / к:SUFF / и:END / й:END
кот → к:ROOT / о:ROOT / т:ROOT
подгон → п:PREF / о:PREF / д:PREF / г:ROOT / о:ROOT / н:ROOT

Performance

The model achieves an approximate character-level accuracy of 0.975 on its evaluation dataset.

Limitations

  • Performance may vary on out-of-vocabulary words, neologisms, or highly complex morphological structures not sufficiently represented in the training data.
  • The model operates strictly at the character level; it does not incorporate broader lexical or syntactic context.
  • Ambiguous cases in morpheme boundaries might be resolved based on patterns learned during training, which may not always align with linguistic conventions in edge cases.