File size: 2,096 Bytes
1802ac2
b29a587
1802ac2
 
 
 
 
a4d8295
1802ac2
af3d069
1802ac2
 
 
 
 
 
 
 
 
 
f7e0228
 
 
1802ac2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: cc-by-nc-sa-4.0
language:
- ga
- sga
- mga
- ghc
- la
pipeline_tag: feature-extraction
library_name: gensim
---

### Training Data

**Historical Irish FastText models** were trained on Old, Middle, Early Modern, Classical Modern and pre-reform Modern Irish texts from St. Gall Glosses, Würzburg Glosses, [CELT](https://celt.ucc.ie/publishd.html) and the book subcorpus [Historical Irish Corpus](http://corpas.ria.ie/index.php?fsg_function=1). The training data spans ca. 550 — 1926 and covers a wide variety of genres, such as bardic poetry, native Irish stories, translations and adaptations of continental epic and romance, annals, genealogies, grammatical and medical tracts, diaries, and religious writing. Due to code-switching in some texts, the models have some Latin in the vocabulary.

### Available Models

There are 3 models in this familily:

- **Cased**, 119 630 words: `historical_irish_cased_ft_100_5_2.txt`
- **Lowercase**, 112 230 words: `historical_irish_lower_ft_100_5_2.txt`
- **Lowercase with initial mutations removed**, 99 485 words: `historical_irish_lower_demutated_ft_100_5_2.txt`

All models are trained with the same hyperparameters (`emb_size=100, window=5, min_count=2, n_epochs=100`) and saved as `KeyedVectors` (see [Gensim Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html)).

### Usage

```python
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ancatmara/historical-irish-ft-vectors", filename="historical_irish_lower_demutated_ft_100_5_2.txt")
model = KeyedVectors.load_word2vec_format(model_path, binary=False)

model.similar_by_word('coíca')
```

Out:
```python
>>> [('coícat', 0.6620370149612427),
     ('coícait', 0.6584151983261108),
     ('coíctu', 0.550497829914093),
     ('trícha', 0.537602424621582),
     ('cóeca', 0.531631350517273),
     ('cóecta', 0.5148215889930725),
     ('cóecait', 0.5108019113540649),
     ('tríchad', 0.5059043765068054),
     ('tríchaid', 0.5049244165420532),
     ('cóecat', 0.5042815804481506)]
```