|
--- |
|
library_name: transformers |
|
language: |
|
- grc |
|
--- |
|
# SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek |
|
|
|
**SyllaBERTa** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level. |
|
It is specifically designed to tackle tasks involving prosody, meter, and rhyme. |
|
|
|
--- |
|
|
|
# Model Summary |
|
|
|
| Attribute | Value | |
|
|:------------------------|:----------------------------------| |
|
| Base architecture | RoBERTa (custom configuration) | |
|
| Vocabulary size | 42,042 syllabic tokens | |
|
| Hidden size | 768 | |
|
| Number of layers | 12 | |
|
| Attention heads | 12 | |
|
| Intermediate size | 3,072 | |
|
| Max sequence length | 514 | |
|
| Pretraining objective | Masked Language Modeling (MLM) | |
|
| Optimizer | AdamW | |
|
| Loss function | CrossEntropy with 15% token masking probability | |
|
|
|
--- |
|
|
|
The tokenizer is a custom subclass of `PreTrainedTokenizer`, operating on syllables rather than words or characters. |
|
It: |
|
- Maps each syllable to an ID. |
|
- Supports diphthong merging and Greek orthographic phenomena. |
|
- Uses space-separated syllable tokens. |
|
|
|
**Example tokenization:** |
|
|
|
Input: |
|
`Κατέβην χθὲς εἰς Πειραιᾶ` |
|
|
|
Tokens: |
|
`['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']` |
|
|
|
> Observe that words are fused at the syllabic level. |
|
|
|
--- |
|
|
|
# Usage Example |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True) |
|
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True) |
|
|
|
# Encode a sentence |
|
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος" |
|
tokens = tokenizer.tokenize(text) |
|
print(tokens) |
|
|
|
# Insert a mask at random |
|
import random |
|
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token |
|
masked_text = tokenizer.convert_tokens_to_string(tokens) |
|
|
|
# Predict masked token |
|
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True) |
|
inputs.pop("token_type_ids", None) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Fetch prediction |
|
logits = outputs.logits |
|
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
|
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0) |
|
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist()) |
|
|
|
print("Top predictions:", predicted) |
|
``` |
|
|
|
It should print: |
|
|
|
``` |
|
Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ'] |
|
|
|
Masked at position 6 |
|
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ |
|
|
|
Top 5 predictions for masked token: |
|
ραι (score: 23.12) |
|
ρα (score: 14.69) |
|
ραισ (score: 12.63) |
|
σαι (score: 12.43) |
|
ρη (score: 12.26) |
|
``` |
|
|
|
--- |
|
|
|
# License |
|
|
|
MIT License. |
|
|
|
--- |
|
|
|
# Authors |
|
|
|
This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University). |
|
|
|
--- |
|
|
|
# Acknowledgements |
|
|
|
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. |