SyllaBERTa / README.md
Ericu950's picture
Update README.md
43f04da verified
---
library_name: transformers
language:
- grc
---
# SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek
**SyllaBERTa** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.
---
# Model Summary
| Attribute | Value |
|:------------------------|:----------------------------------|
| Base architecture | RoBERTa (custom configuration) |
| Vocabulary size | 42,042 syllabic tokens |
| Hidden size | 768 |
| Number of layers | 12 |
| Attention heads | 12 |
| Intermediate size | 3,072 |
| Max sequence length | 514 |
| Pretraining objective | Masked Language Modeling (MLM) |
| Optimizer | AdamW |
| Loss function | CrossEntropy with 15% token masking probability |
---
The tokenizer is a custom subclass of `PreTrainedTokenizer`, operating on syllables rather than words or characters.
It:
- Maps each syllable to an ID.
- Supports diphthong merging and Greek orthographic phenomena.
- Uses space-separated syllable tokens.
**Example tokenization:**
Input:
`Κατέβην χθὲς εἰς Πειραιᾶ`
Tokens:
`['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']`
> Observe that words are fused at the syllabic level.
---
# Usage Example
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)
# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)
# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
outputs = model(**inputs)
# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())
print("Top predictions:", predicted)
```
It should print:
```
Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']
Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ
Top 5 predictions for masked token:
ραι (score: 23.12)
ρα (score: 14.69)
ραισ (score: 12.63)
σαι (score: 12.43)
ρη (score: 12.26)
```
---
# License
MIT License.
---
# Authors
This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University).
---
# Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.