README.md · ChangeIsKey/llama3-janus-pos at main

metadata

language:
  - en
base_model:
  - meta-llama/Meta-Llama-3-8B
pipeline_tag: text2text-generation

Janus (PoS)

(Built with Meta Llama 3)

For the version without PoS tag visit Janus.

Model Details

Model Name: Janus
Version: 1.0
Developers: Pierluigi Cassotti, Nina Tahmasebi
Affiliation: University of Gothenburg
License: MIT
GitHub Repository: Historical Word Usage Generation
Paper: Sense-specific Historical Word Usage Generation
Contact: [email protected]

Model Description

Janus is a fine-tuned Llama 3 8B model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for semantic change detection, historical NLP, and linguistic research.

Intended Use

Semantic Change Detection: Investigating how word meanings evolve over time.
Historical Text Processing: Enhancing the understanding and modeling of historical texts.
Corpus Expansion: Generating sense-annotated corpora for linguistic studies.

Training Data

Dataset: Extracted from the Oxford English Dictionary (OED)
Size: Over 1.2 million sense-annotated historical usages
Time Span: 1700 - 2020

Data Format:

<year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>

Janus (PoS) Format:

<year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>

Training Procedure

Base Model: meta-llama/Llama-3-8B
Optimization: QLoRA (Quantized Low-Rank Adaptation)
Batch Size: 4
Learning Rate: 2e-4
Epochs: 1

Model Performance

Temporal Accuracy: Root mean squared error (RMSE) of ~52.7 years (close to OED ground truth)
Semantic Accuracy: Comparable to OED test data on human evaluations
Context Variability: Low lexical repetition, preserving natural linguistic diversity

Usage Example

Generating Historical Usages

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ChangeIsKey/llama3-janus-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|p|>jj<|p|><|s|>"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

For more examples, see the GitHub repository Historical Word Usage Generation

Limitations & Ethical Considerations

Historical Bias: The model may reflect biases present in historical texts.
Time Granularity: The temporal resolution is approximate (~50 years RMSE).
Modern Influence: Despite fine-tuning, the model may still generate modern phrases in older contexts.
Not Trained for Fairness: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content.

Citation

If you use Janus, please cite:

@article{10.1162/tacl_a_00761,
    author = {Cassotti, Pierluigi and Tahmasebi, Nina},
    title = {Sense-specific Historical Word Usage Generation},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {13},
    pages = {690-708},
    year = {2025},
    month = {07},
    abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00761},
    url = {https://doi.org/10.1162/tacl\_a\_00761},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf},
}