---
library_name: transformers
language:
- grc
---
# SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek

**SyllaBERTa** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level.  
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.

---

# Model Summary

| Attribute              | Value                             |
|:------------------------|:----------------------------------|
| Base architecture       | RoBERTa (custom configuration)    |
| Vocabulary size         | 42,042 syllabic tokens            |
| Hidden size             | 768                               |
| Number of layers        | 12                                |
| Attention heads         | 12                                |
| Intermediate size       | 3,072                             |
| Max sequence length     | 514                               |
| Pretraining objective   | Masked Language Modeling (MLM)    |
| Optimizer               | AdamW                             |
| Loss function           | CrossEntropy with 15% token masking probability |

---

The tokenizer is a custom subclass of `PreTrainedTokenizer`, operating on syllables rather than words or characters.  
It:
- Maps each syllable to an ID.
- Supports diphthong merging and Greek orthographic phenomena.
- Uses space-separated syllable tokens.

**Example tokenization:**

Input:  
`Κατέβην χθὲς εἰς Πειραιᾶ`

Tokens:  
`['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']`

> Observe that words are fused at the syllabic level.

---

# Usage Example

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)

# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)

# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)

# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
    outputs = model(**inputs)

# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())

print("Top predictions:", predicted)
```

It should print:

```
Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']

Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ

Top 5 predictions for masked token:
ραι          (score: 23.12)
ρα           (score: 14.69)
ραισ         (score: 12.63)
σαι          (score: 12.43)
ρη           (score: 12.26)
```

---

# License

MIT License.

---

# Authors

This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University).

---

# Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.