momergul commited on
Commit
776deb8
·
verified ·
1 Parent(s): 3126792

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -118
README.md CHANGED
@@ -1,199 +1,262 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
9
 
10
 
 
11
 
12
- ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
 
38
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
39
 
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
 
 
 
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
 
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
 
 
 
 
 
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
 
 
 
 
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
122
 
123
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
-
127
- ### Results
 
 
 
 
 
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
132
 
 
133
 
 
 
 
 
 
 
 
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
 
 
 
 
 
 
 
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
 
 
 
 
 
 
 
 
 
 
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - babylm-baseline
6
+ - strict-small
7
+ - babylm-2025
8
  ---
9
 
10
+ # Model Card for the Preference Optimization Interaction Baseline
11
 
12
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
13
+ A 124M model with the GPT-2 architecture trained with the next token prediction loss for 10 epochs (~100M words), as a naive autoregressive baseline for the Strict-Small track of the 2025 BabyLM challenge.
14
 
15
 
16
+ # Table of Contents
17
 
18
+ - [Model Card for Strict-Small-track GPT-2 Baseline](#model-card-for--model_id-)
19
+ - [Table of Contents](#table-of-contents)
20
+ - [Model Details](#model-details)
21
+ - [Model Description](#model-description)
22
+ - [Uses](#uses)
23
+ - [Training Details](#training-details)
24
+ - [Training Data](#training-data)
25
+ - [Hyperparameters](#hyperparameters)
26
+ - [Training Procedure](#training-procedure)
27
+ - [Size and Checkpoints](#size-and-checkpoints)
28
+ - [Evaluation](#evaluation)
29
+ - [Testing Data & Metrics](#testing-data-factors--metrics)
30
+ - [Testing Data](#testing-data)
31
+ - [Metrics](#metrics)
32
+ - [Results](#results)
33
+ - [Technical Specifications](#technical-specifications-optional)
34
+ - [Model Architecture and Objective](#model-architecture-and-objective)
35
+ - [Compute Infrastructure](#compute-infrastructure)
36
+ - [Hardware](#hardware)
37
+ - [Software](#software)
38
+ - [Training Time](#training-time)
39
+ - [Citation](#citation)
40
+ - [Model Card Authors](#model-card-authors-optional)
41
+ - [Bibliography](#bibliography)
42
 
 
43
 
44
+ # Model Details
45
 
46
+ ## Model Description
47
 
48
+ <!-- Provide a longer summary of what this model is/does. -->
49
+ This one of the two Strict-track baselines 2025 BabyLM challenge.
 
 
 
 
 
50
 
51
+ - **Developed by:** Mustafa Ömer Gül
52
+ - **Model type:** Causal language model
53
+ - **Language(s) (NLP):** eng
54
+ - **Resources for more information:**
55
+ - [GitHub Repo](https://github.com/momergul/babylm-gpt2-baseline)
56
 
 
57
 
58
+ # Uses
 
 
 
 
59
 
60
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
61
+ This is a pre-trained language model.
62
+ It can be used to evaluate tasks in a zero-shot manner and also can be fine-tuned for downstream tasks.
63
+ It can be used for language generation but given its small size and low number of words trained on, do not expect LLM-level performance.
64
 
65
+ # Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ ## Training Data
68
 
69
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
70
 
71
+ We used the BabyLM 10M (Strict-Small) dataset for training. It is composed of the following:
72
 
73
+ | Source | Weight | Domain | Citation | Website | License |
74
+ | --- | --- | --- | --- | --- | --- |
75
+ | BNC | 8% | Dialogue | BNC Consortium (2007) | [link](http://www.natcorp.ox.ac.uk/) | [link](http://www.natcorp.ox.ac.uk/docs/licence.html) <sup>1</sup> |
76
+ | CHILDES | 29% | Dialogue, Child-Directed | MacWhinney (2000) | | [link](https://talkbank.org/share/rules.html) |
77
+ | Project Gutenberg | 26% | Fiction, Nonfiction | Gerlach & Font-Clos (2020) | [link](https://github.com/pgcorpus/gutenberg) | [link](https://www.gutenberg.org/policy/license.html) |
78
+ | OpenSubtitles | 20% | Dialogue, Scripted | Lison & Tiedermann (2016) | [link](https://opus.nlpl.eu/OpenSubtitles-v2018.php) | Open source |
79
+ | Simple English Wikipedia | 15% | Nonfiction | -- | [link](https://dumps.wikimedia.org/simplewiki/20221201/) | [link](https://dumps.wikimedia.org/legal.html) |
80
+ | Switchboard | 1% | Dialogue | Godfrey et al. (1992), Stolcke et al., (2000) | [link](http://compprag.christopherpotts.net/swda.html) | [link](http://compprag.christopherpotts.net/swda.html) |
81
 
82
+ <sup>1</sup> Our distribution of part of the BNC Texts is permitted under the fair dealings provision of copyright law (see term (2g) in the BNC license).
83
 
84
+ ## Hyperparameters
85
 
86
+ | Hyperparameter | Value |
87
+ | --- | --- |
88
+ | Number of epochs | 10 |
89
+ | Datapoint length | 512 |
90
+ | Batch size | 16 |
91
+ | Learning rate | 0.00005 |
92
+ | Number of steps | 200000 |
93
+ | Warmup steps | 2000 |
94
+ | Gradient clipping | 1 |
95
+ | Optimizer | AdamW |
96
+ | Optimizer Beta_1 | 0.9 |
97
+ | Optimizer Beta_2 | 0.999 |
98
+ | Optimizer Epsilon | 10<sup>-8</sup>|
99
 
 
100
 
101
+ ## Training Procedure
102
 
103
+ The model is trained with the next token prediction loss for 10 epochs.
104
 
105
+ ### Size and checkpoints
106
 
107
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
 
108
 
109
+ The model has 124M parameters.
110
+ In total we train on around 100M words and provide multiple checkpoints from the training.
111
+ Specifically we provode:
112
+ - Checkpoints every 1M words for the first 10M words
113
+ - Checkpoints every 10M words first 100M words
114
+
115
+ # Evaluation
116
 
117
+ <!-- This section describes the evaluation protocols and provides the results. -->
118
 
119
+ This model is evaluated in three different fashions:
120
+ 1. We do zero-shot evaluation on 7 tasks.
121
+ 2. We do fine-tuning on a subset of the (Super)GLUE tasks (Wang et al., ICLR 2019; Wang et al., NeurIPS 2019) .
122
 
123
+ ## Testing Data & Metrics
124
 
125
+ ### Testing Data
126
 
127
+ <!-- This should link to a Data Card if possible. -->
128
 
129
+ For the BLiMP, BLiMP supplement, and EWoK tasks, we use a filtered version of the dataset to only include examples with words found in the BabyLM dataset.
130
+ For the Finetuning task, we both filter and sample down to a maximum 10 000 train examples.
131
 
132
+ *Validation Data*
133
 
134
+ *Zero-shot Tasks*
135
 
136
+ - **BLiMP**: The Benchmark of Linguistic Minimal Pairs evaluates the model's linguistic ability by seeing if it can recognize the grammatically correct sentence from a pair of minimally different sentences. It tests various grammatical phenomena.(Warstadt et al., TACL 2020)
137
+ - **BLiMP Supplement**: A supplement to BLiMP introduced in the first edition of the BabyLM challenge. More focused on dialogue and questions. (Warstadt et al., CoNLL-BabyLM 2023)
138
+ - **EWoK**: Works similarly to BLiMP but looks the model's internal world knowledge. Looking at both whter a model has physical and social knowledge. (Ivanova et al., 2024)
139
+ - **Eye Tracking and Self-paced Reading**: Looks at whether the model can mimick the eye tracking and reading time of a human but using surprisal of a word as a proxy for time spent reading a word. (de Varda et al., BRM 2024)
140
+ - **Entity Tracking**: Checks whether a model can keep track of the changes to the states of entities as text/dialogue unfolds. (Kim & Schuster, ACL 2023)
141
+ - **WUGs**: Tests morphological generalization in LMs through an adjective nominalization task. (Hofmann et al., 2024)
142
 
143
+ *Finetuning Tasks*
144
 
145
+ - **BoolQ**: A yes/no QA dataset with unprompted and unconstrained questions. (Clark et al., NAACL 2019)
146
+ - **MNLI**: The Multi-Genre Natural Language Inference corpus tests the language understanding of a model by seeing wehther it can recognize textual entailment. (Williams et al., NAACL 2018)
147
+ - **MRPC**: The Microsoft Research Paraphrase Corpus contains pairs of sentences that are either paraphrases/semntically equivalent to each other or unrelated.(Dolan & Brockett, IJCNLP 2005)
148
+ - **QQP**<sup>2</sup>: Similarly to MRPC, the Quora Question Pairs corpus tests the models ability to determine whether a pair of questions are sematically similar to each other. These questions are sourced from Quora.
149
+ - **MultiRC**: The Multi-Sentence Reading Comprehension corpus is a QA task that evaluates the model's ability to the correct answer from a list of answers given a question and context paragraph. In this version the data is changed to a binary classification judging whether the answer to a question, context pair is correct. (Khashabi et al., NAACL 2018)
150
+ - **RTE**: Similar the Recognizing Text Entailement tests the model's ability to recognize text entailement. (Dagan et al., Springer 2006; Bar et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., TAC 2009)
151
+ - **WSC**: The Winograd Schema Challenge tests the models ability to do coreference resolution on sentences with a pronoun and a list of noun phrases found in the sentence. This version edits it to be a binary classification on examples consisting of a pronoun and noun phrase.(Levesque et al., PKRR 2012)
152
 
153
+ <sup>2</sup> https://www.quora.com/profile/Ricky-Riche-2/First-Quora-Dataset-Release-Question-Pairs
154
 
155
+ ### Metrics
156
 
157
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
158
 
159
+ The metrics used to evaluate the model are the following:
160
+ - Zero-shot
161
+ - Accuracy on predicting the correct completion/sentence for BLiMP, BLiMP Supplement, EWoK, Entity Tracking, and WUGs
162
+ - Change in R^2 prediction from baseline for Eye Tracking (with no spillover) and Self-paced Reading (1-word spillover)
163
+ - Finetuning
164
+ - 3 class Accuracy for MNLI
165
+ - Binary Accuracy for BoolQ, MultiRC, and WSC
166
+ - F1-score for MRPC and QQP
167
 
168
+ The metrics were chosen based on the advice of the papers the tasks come from.
169
 
170
+ ## Results
171
 
172
+ *Zero-shot*
173
 
174
+ | Task | Metric | Causal Score |
175
+ | --- | --- | --- |
176
+ | BLiMP | Acc | 67.29 |
177
+ | BLiMP Supplement | Acc | 59.09 |
178
+ | EWoK | Acc | 49.8|
179
+ | Eye Tracking | change in R^2 | ADD |
180
+ | Self-paced Reading | change in R^2 | ADD |
181
+ | Entity Tracking | Acc | 18.92 |
182
+ | WUGs | Acc | 39 |
183
 
184
+ *Finetuning*
185
 
186
+ | Task | Metric | Uni-directional Score | Bi-directional Score |
187
+ | --- | --- | --- | --- |
188
+ | BoolQ | Acc | | |
189
+ | MNLI | Acc | | |
190
+ | MRPC | F1 | | |
191
+ | QQP | F1 | | |
192
+ | MultiRC | Acc | | |
193
+ | RTE | Acc | | |
194
+ | WSC | Acc | | |
195
 
196
+ # Technical Specifications
197
 
198
+ ### Hardware
199
 
200
+ - 1 H100 GPU was used to train this model.
201
 
202
+ ### Software
203
 
204
+ PyTorch
 
 
 
 
205
 
206
+ ### Training Time
207
 
208
+ The model took ~0.5 GPU hours to train.
209
 
210
+ # Citation
211
 
212
+ ```latex
213
+ @misc{charpentier2025babylmturns3papers,
214
+ title={BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop},
215
+ author={Lucas Charpentier and Leshem Choshen and Ryan Cotterell and Mustafa Omer Gul and Michael Hu and Jaap Jumelet and Tal Linzen and Jing Liu and Aaron Mueller and Candace Ross and Raj Sanjay Shah and Alex Warstadt and Ethan Wilcox and Adina Williams},
216
+ year={2025},
217
+ eprint={2502.10645},
218
+ archivePrefix={arXiv},
219
+ primaryClass={cs.CL},
220
+ url={https://arxiv.org/abs/2502.10645},
221
+ }
222
+ ```
223
 
224
+ # Model Card Authors
225
 
226
+ Mustafa Ömer Gül
227
 
228
+ # Bibliography
229
 
230
+ [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) (Wang et al., ICLR 2019)
231
 
232
+ [SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems](https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf) (Wang et al., NeurIPS 2019)
233
 
234
+ [BLiMP: The Benchmark of Linguistic Minimal Pairs for English](https://aclanthology.org/2020.tacl-1.25/) (Warstadt et al., TACL 2020)
235
 
236
+ [Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora](https://aclanthology.org/2023.conll-babylm.1/) (Warstadt et al., CoNLL-BabyLM 2023)
237
 
238
+ [🌏 Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models](https://arxiv.org/pdf/2405.09605v1) (Ivanova et al., 2024)
239
 
240
+ [Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data](https://link.springer.com/article/10.3758/s13428-023-02261-8) (de Varda et al., BRM 2024)
241
 
242
+ [Entity Tracking in Language Models](https://aclanthology.org/2023.acl-long.213/) (Kim & Schuster, ACL 2023)
243
 
244
+ [Derivational Morphology Reveals Analogical Generalization in Large Language Models](https://arxiv.org/pdf/2411.07990) (Hofmann et al., 2024)
245
 
246
+ [Automatically Constructing a Corpus of Sentential Paraphrases](https://aclanthology.org/I05-5002/) (Dolan & Brockett, IJCNLP 2005)
247
 
248
+ [A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference](https://aclanthology.org/N18-1101/) (Williams et al., NAACL 2018)
249
 
250
+ [The Winograd Schema Challenge]( http://dl.acm.org/citation.cfm?id=3031843.3031909) (Levesque et al., PKRR 2012)
251
 
252
+ [The PASCAL Recognising Textual Entailment Challenge](https://link.springer.com/chapter/10.1007/11736790_9) (Dagan et al., Springer 2006)
253
 
254
+ [The Second PASCAL Recognising Textual Entailment Challenge]() (Bar et al., 2006)
255
 
256
+ [The Third PASCAL Recognizing Textual Entailment Challenge](https://aclanthology.org/W07-1401/) (Giampiccolo et al., 2007)
257
 
258
+ [The Fifth PASCAL Recognizing Textual Entailment Challenge](https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf) (Bentivogli et al., TAC 2009)
259
 
260
+ [BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://aclanthology.org/N19-1300/) (Clark et al., NAACL 2019)
261
 
262
+ [Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences](https://aclanthology.org/N18-1023/) (Khashabi et al., NAACL 2018)