--- library_name: transformers language: - grc --- # SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek **SyllaBERTa** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level. It is specifically designed to tackle tasks involving prosody, meter, and rhyme. --- # Model Summary | Attribute | Value | |:------------------------|:----------------------------------| | Base architecture | RoBERTa (custom configuration) | | Vocabulary size | 42,042 syllabic tokens | | Hidden size | 768 | | Number of layers | 12 | | Attention heads | 12 | | Intermediate size | 3,072 | | Max sequence length | 514 | | Pretraining objective | Masked Language Modeling (MLM) | | Optimizer | AdamW | | Loss function | CrossEntropy with 15% token masking probability | --- The tokenizer is a custom subclass of `PreTrainedTokenizer`, operating on syllables rather than words or characters. It: - Maps each syllable to an ID. - Supports diphthong merging and Greek orthographic phenomena. - Uses space-separated syllable tokens. **Example tokenization:** Input: `Κατέβην χθὲς εἰς Πειραιᾶ` Tokens: `['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']` > Observe that words are fused at the syllabic level. --- # Usage Example ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True) # Encode a sentence text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος" tokens = tokenizer.tokenize(text) print(tokens) # Insert a mask at random import random tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token masked_text = tokenizer.convert_tokens_to_string(tokens) # Predict masked token inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True) inputs.pop("token_type_ids", None) with torch.no_grad(): outputs = model(**inputs) # Fetch prediction logits = outputs.logits mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0) predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist()) print("Top predictions:", predicted) ``` It should print: ``` Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ'] Masked at position 6 Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ Top 5 predictions for masked token: ραι (score: 23.12) ρα (score: 14.69) ραισ (score: 12.63) σαι (score: 12.43) ρη (score: 12.26) ``` --- # License MIT License. --- # Authors This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University). --- # Acknowledgements The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.