pierluigic commited on
Commit
933e95f
·
verified ·
1 Parent(s): 16ff6b2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - FacebookAI/roberta-large
6
+ pipeline_tag: text-classification
7
+ ---
8
+ # Sentence Dating Model
9
+
10
+ ## Model Description
11
+ The Sentence Dating Model is a fine-tuned **RoBERTa-large** transformer designed for predicting the decade in which a given sentence was written. This model is trained on historical text data to classify sentences into time periods from 1700 to 2021. It is particularly useful for historical linguistics, text dating, and semantic change studies.
12
+
13
+ ### Reference Paper
14
+ This model is based on the work described in:
15
+ > **Sense-specific Historical Word Usage Generation**
16
+ > *Pierluigi Cassotti, Nina Tahmasebi*
17
+ > University of Gothenburg
18
+ > [Link to Paper]
19
+
20
+ ## Training Details
21
+
22
+ ### Base Model
23
+ - **Model:** `roberta-large`
24
+ - **Fine-tuned for:** Sentence classification into time periods (1700-2021)
25
+
26
+ ### Dataset
27
+ The model is trained on a dataset derived from historical text corpora, including examples extracted from the **Oxford English Dictionary (OED)**. The dataset includes:
28
+ - **Texts:** Sentences extracted from historical documents.
29
+ - **Labels:** Time periods (grouped by decades).
30
+
31
+ ### Fine-tuning Process
32
+ - **Tokenizer:** `AutoTokenizer.from_pretrained("roberta-large")`
33
+ - **Loss function:** CrossEntropy Loss
34
+ - **Optimizer:** AdamW
35
+ - **Batch size:** 32
36
+ - **Learning rate:** 1e-6
37
+ - **Epochs:** 1
38
+ - **Evaluation Strategy:** Steps (every 10% of training data)
39
+ - **Metric:** Weighted F1-score
40
+ - **Splitting:** 90% training, 10% validation
41
+
42
+
43
+ ## Usage
44
+ ### Example
45
+ ```python
46
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
47
+ import torch
48
+
49
+ # Load model and tokenizer
50
+ tokenizer = AutoTokenizer.from_pretrained("username/sentence-dating-model")
51
+ model = AutoModelForSequenceClassification.from_pretrained("username/sentence-dating-model")
52
+
53
+ # Example text
54
+ text = "He put the phone back in the cradle and turned toward the kitchen."
55
+
56
+ # Tokenize input
57
+ inputs = tokenizer(text, return_tensors="pt")
58
+
59
+ # Predict
60
+ with torch.no_grad():
61
+ outputs = model(**inputs)
62
+ predicted_label = torch.argmax(outputs.logits, dim=1).item()
63
+
64
+ print(f"Predicted decade: {1700 + predicted_label * 10}")
65
+ ```
66
+
67
+ ## Limitations
68
+ - The model may have difficulty distinguishing between closely related time periods (e.g., 1950s vs. 1960s).
69
+ - Biases may exist due to the training dataset composition.
70
+ - Performance is lower on shorter, contextually ambiguous sentences.
71
+
72
+ ## Citation
73
+ If you use this model, please cite:
74
+ ```
75
+ @article{cassotti2025,
76
+ author = {Cassotti, Pierluigi and Tahmasebi, Nina},
77
+ title = {Sense-specific Historical Word Usage Generation},
78
+ journal = {TACL},
79
+ year = {2025}
80
+ }
81
+ ```
82
+
83
+ ## License
84
+ MIT License