Ericu950
/

SyllaBERTa

Ancient Greek (to 1453)

Model card Files Files and versions

SyllaBERTa / README.md

Ericu950's picture

Update README.md

43f04da verified 4 months ago

|

history blame contribute delete

3.75 kB

	---
	library_name: transformers
	language:
	- grc
	---
	# SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek

	SyllaBERTa is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the syllable level.
	It is specifically designed to tackle tasks involving prosody, meter, and rhyme.

	---

	# Model Summary

	\| Attribute \| Value \|
	\|:------------------------\|:----------------------------------\|
	\| Base architecture \| RoBERTa (custom configuration) \|
	\| Vocabulary size \| 42,042 syllabic tokens \|
	\| Hidden size \| 768 \|
	\| Number of layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Intermediate size \| 3,072 \|
	\| Max sequence length \| 514 \|
	\| Pretraining objective \| Masked Language Modeling (MLM) \|
	\| Optimizer \| AdamW \|
	\| Loss function \| CrossEntropy with 15% token masking probability \|

	---

	The tokenizer is a custom subclass of `PreTrainedTokenizer`, operating on syllables rather than words or characters.
	It:
	- Maps each syllable to an ID.
	- Supports diphthong merging and Greek orthographic phenomena.
	- Uses space-separated syllable tokens.

	Example tokenization:

	Input:
	`Κατέβην χθὲς εἰς Πειραιᾶ`

	Tokens:
	`['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']`

	> Observe that words are fused at the syllabic level.

	---

	# Usage Example

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
	model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)

	# Encode a sentence
	text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
	tokens = tokenizer.tokenize(text)
	print(tokens)

	# Insert a mask at random
	import random
	tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
	masked_text = tokenizer.convert_tokens_to_string(tokens)

	# Predict masked token
	inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
	inputs.pop("token_type_ids", None)
	with torch.no_grad():
	outputs = model(**inputs)

	# Fetch prediction
	logits = outputs.logits
	mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
	top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
	predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())

	print("Top predictions:", predicted)
	```

	It should print:

	```
	Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']

	Masked at position 6
	Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ

	Top 5 predictions for masked token:
	ραι (score: 23.12)
	ρα (score: 14.69)
	ραισ (score: 12.63)
	σαι (score: 12.43)
	ρη (score: 12.26)
	```

	---

	# License

	MIT License.

	---

	# Authors

	This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).

	---

	# Acknowledgements

	The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.