--- language: - en base_model: - meta-llama/Meta-Llama-3-8B pipeline_tag: text2text-generation --- ## Janus (PoS) (Built with Meta Llama 3) For the version without PoS tag visit [Janus](https://huggingface.co/ChangeIsKey/llama3-janus). ### Model Details - **Model Name**: Janus - **Version**: 1.0 - **Developers**: Pierluigi Cassotti, Nina Tahmasebi - **Affiliation**: University of Gothenburg - **License**: MIT - **GitHub Repository**: [Historical Word Usage Generation](https://github.com/ChangeIsKey/historical-word-usage-generation) - **Paper**: [Sense-specific Historical Word Usage Generation](https://transacl.org) - **Contact**: pierluigi.cassotti@gu.se ### Model Description Janus is a fine-tuned **Llama 3 8B** model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for **semantic change detection**, **historical NLP**, and **linguistic research**. ### Intended Use - **Semantic Change Detection**: Investigating how word meanings evolve over time. - **Historical Text Processing**: Enhancing the understanding and modeling of historical texts. - **Corpus Expansion**: Generating sense-annotated corpora for linguistic studies. ### Training Data - **Dataset**: Extracted from the **Oxford English Dictionary (OED)** - **Size**: Over **1.2 million** sense-annotated historical usages - **Time Span**: **1700 - 2020** - **Data Format**: ``` <|t|><|t|><|s|><|end|> ``` - **Janus (PoS) Format**: ``` <|t|><|t|><|p|><|p|><|s|><|end|> ``` ### Training Procedure - **Base Model**: `meta-llama/Llama-3-8B` - **Optimization**: **QLoRA** (Quantized Low-Rank Adaptation) - **Batch Size**: **4** - **Learning Rate**: **2e-4** - **Epochs**: **1** ### Model Performance - **Temporal Accuracy**: Root mean squared error (RMSE) of **~52.7 years** (close to OED ground truth) - **Semantic Accuracy**: Comparable to OED test data on human evaluations - **Context Variability**: Low lexical repetition, preserving natural linguistic diversity ### Usage Example #### Generating Historical Usages ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "ChangeIsKey/llama3-janus-pos" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|p|>jj<|p|><|s|>" inputs = tokenizer(input_text, return_tensors="pt").to("cuda") output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` For more examples, see the GitHub repository [Historical Word Usage Generation](https://github.com/ChangeIsKey/historical-word-usage-generation) ### Limitations & Ethical Considerations - **Historical Bias**: The model may reflect biases present in historical texts. - **Time Granularity**: The temporal resolution is approximate (~50 years RMSE). - **Modern Influence**: Despite fine-tuning, the model may still generate modern phrases in older contexts. - **Not Trained for Fairness**: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content. ### Citation If you use Janus, please cite: ``` @article{10.1162/tacl_a_00761, author = {Cassotti, Pierluigi and Tahmasebi, Nina}, title = {Sense-specific Historical Word Usage Generation}, journal = {Transactions of the Association for Computational Linguistics}, volume = {13}, pages = {690-708}, year = {2025}, month = {07}, abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.}, issn = {2307-387X}, doi = {10.1162/tacl_a_00761}, url = {https://doi.org/10.1162/tacl\_a\_00761}, eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf}, } ```