File size: 14,009 Bytes
165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 2ec13ad 1ac94ee f7ae801 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee 165c4ec 1ac94ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
---
language:
- en
tags:
- babylm-baseline
- interaction
- babylm-2025
---
# Model Card for the Preference Optimization Interaction Baseline
<!-- Provide a quick summary of what the model is/does. [Optional] -->
A 124M model with the GPT-2 architecture trained for 20 "interaction rounds", using the training procedure outlined in the 2025 BabyLM Call for Papers.
# Table of Contents
- [Model Card for Preference Optimization Interaction Baseline](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Model Description](#model-description)
- [Uses](#uses)
- [Training Details](#training-details)
- [Training Data](#training-data)
- [Hyperparameters](#hyperparameters)
- [Training Procedure](#training-procedure)
- [Size and Checkpoints](#size-and-checkpoints)
- [Evaluation](#evaluation)
- [Testing Data & Metrics](#testing-data-factors--metrics)
- [Testing Data](#testing-data)
- [Metrics](#metrics)
- [Results](#results)
- [Technical Specifications](#technical-specifications-optional)
- [Model Architecture and Objective](#model-architecture-and-objective)
- [Compute Infrastructure](#compute-infrastructure)
- [Hardware](#hardware)
- [Software](#software)
- [Training Time](#training-time)
- [Citation](#citation)
- [Model Card Authors](#model-card-authors-optional)
- [Bibliography](#bibliography)
# Model Details
## Model Description
<!-- Provide a longer summary of what this model is/does. -->
This one of the two Interaction-track baselines 2025 BabyLM challenge.
- **Developed by:** Mustafa Ömer Gül
- **Model type:** Causal language model
- **Language(s) (NLP):** eng
- **Resources for more information:**
- [GitHub Repo](https://github.com/momergul/babylm-interaction-simpo-baseline)
# Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This is a pre-trained language model.
It can be used to evaluate tasks in a zero-shot manner and also can be fine-tuned for downstream tasks.
It can be used for language generation but given its small size and low number of words trained on, do not expect LLM-level performance.
# Training Details
## Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used the BabyLM 100M (Strict) dataset to construct input contexts. It is composed of the following:
| Source | Weight | Domain | Citation | Website | License |
| --- | --- | --- | --- | --- | --- |
| BNC | 8% | Dialogue | BNC Consortium (2007) | [link](http://www.natcorp.ox.ac.uk/) | [link](http://www.natcorp.ox.ac.uk/docs/licence.html) <sup>1</sup> |
| CHILDES | 29% | Dialogue, Child-Directed | MacWhinney (2000) | | [link](https://talkbank.org/share/rules.html) |
| Project Gutenberg | 26% | Fiction, Nonfiction | Gerlach & Font-Clos (2020) | [link](https://github.com/pgcorpus/gutenberg) | [link](https://www.gutenberg.org/policy/license.html) |
| OpenSubtitles | 20% | Dialogue, Scripted | Lison & Tiedermann (2016) | [link](https://opus.nlpl.eu/OpenSubtitles-v2018.php) | Open source |
| Simple English Wikipedia | 15% | Nonfiction | -- | [link](https://dumps.wikimedia.org/simplewiki/20221201/) | [link](https://dumps.wikimedia.org/legal.html) |
| Switchboard | 1% | Dialogue | Godfrey et al. (1992), Stolcke et al., (2000) | [link](http://compprag.christopherpotts.net/swda.html) | [link](http://compprag.christopherpotts.net/swda.html) |
<sup>1</sup> Our distribution of part of the BNC Texts is permitted under the fair dealings provision of copyright law (see term (2g) in the BNC license).
## Hyperparameters
| Hyperparameter | Value |
| --- | --- |
| Number of Rounds | 20 |
| Teacher model | Llama-3.1-8B-Instruct |
| Datapoint length | 512 |
| Context length | 256 |
| Student sampling temperature | 1.0 |
| Student top_p | 0.8 |
| Teacher sampling temperature | 1.0 |
| Teacher top_p | 0.8 |
| Pure language modeling epochs per round | 8 |
| Mixed language modeling + preference optimization epochs per round | 2 |
| Batch size | 16 |
| SimPO beta | 2 |
| SimPO gamma | 0.5 |
| Language modeling weight | 0.2 |
| Learning rate | 0.00005 |
| Number of steps | 200000 |
| Warmup steps | 2000 |
| Gradient clipping | 1 |
| Optimizer | AdamW |
| Optimizer Beta_1 | 0.9 |
| Optimizer Beta_2 | 0.999 |
| Optimizer Epsilon | 10<sup>-8</sup>|
## Training Procedure
The student model in this interaction baseline is trained for 20 "interaction rounds."
Each round makes use of a distinct, randomly sampled 5M word subsample of the BabyLM corpus.
Each interaction involves the student model generating a completion given an input context and the teacher model (Llama-3.1-8B-Instruct) producing a corrected version of the same completion.
The model is trained with the standard next-token prediction loss on the concatenation of the context and teacher compleiton and with the SimPO loss on the teacher and student completions.
More explicit details are available on the 2025 BabyLM call for papers.
### Size and checkpoints
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
The model has 124M parameters.
In total we train on around 1B words and provide multiple checkpoints from the training.
Specifically we provode:
- Checkpoints every 1M words for the first 10M words
- Checkpoints every 10M words first 100M words
- Checkpoints every 100M words until 1B words
# Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
This model is evaluated in three different fashions:
1. We do zero-shot evaluation on 7 tasks.
2. We do fine-tuning on a subset of the (Super)GLUE tasks (Wang et al., ICLR 2019; Wang et al., NeurIPS 2019) .
## Testing Data & Metrics
### Testing Data
<!-- This should link to a Data Card if possible. -->
For the BLiMP, BLiMP supplement, and EWoK tasks, we use a filtered version of the dataset to only include examples with words found in the BabyLM dataset.
For the Finetuning task, we both filter and sample down to a maximum 10 000 train examples.
*Validation Data*
*Zero-shot Tasks*
- **BLiMP**: The Benchmark of Linguistic Minimal Pairs evaluates the model's linguistic ability by seeing if it can recognize the grammatically correct sentence from a pair of minimally different sentences. It tests various grammatical phenomena.(Warstadt et al., TACL 2020)
- **BLiMP Supplement**: A supplement to BLiMP introduced in the first edition of the BabyLM challenge. More focused on dialogue and questions. (Warstadt et al., CoNLL-BabyLM 2023)
- **EWoK**: Works similarly to BLiMP but looks the model's internal world knowledge. Looking at both whter a model has physical and social knowledge. (Ivanova et al., 2024)
- **Eye Tracking and Self-paced Reading**: Looks at whether the model can mimick the eye tracking and reading time of a human but using surprisal of a word as a proxy for time spent reading a word. (de Varda et al., BRM 2024)
- **Entity Tracking**: Checks whether a model can keep track of the changes to the states of entities as text/dialogue unfolds. (Kim & Schuster, ACL 2023)
- **WUGs**: Tests morphological generalization in LMs through an adjective nominalization task. (Hofmann et al., 2024)
*Finetuning Tasks*
- **BoolQ**: A yes/no QA dataset with unprompted and unconstrained questions. (Clark et al., NAACL 2019)
- **MNLI**: The Multi-Genre Natural Language Inference corpus tests the language understanding of a model by seeing wehther it can recognize textual entailment. (Williams et al., NAACL 2018)
- **MRPC**: The Microsoft Research Paraphrase Corpus contains pairs of sentences that are either paraphrases/semntically equivalent to each other or unrelated.(Dolan & Brockett, IJCNLP 2005)
- **QQP**<sup>2</sup>: Similarly to MRPC, the Quora Question Pairs corpus tests the models ability to determine whether a pair of questions are sematically similar to each other. These questions are sourced from Quora.
- **MultiRC**: The Multi-Sentence Reading Comprehension corpus is a QA task that evaluates the model's ability to the correct answer from a list of answers given a question and context paragraph. In this version the data is changed to a binary classification judging whether the answer to a question, context pair is correct. (Khashabi et al., NAACL 2018)
- **RTE**: Similar the Recognizing Text Entailement tests the model's ability to recognize text entailement. (Dagan et al., Springer 2006; Bar et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., TAC 2009)
- **WSC**: The Winograd Schema Challenge tests the models ability to do coreference resolution on sentences with a pronoun and a list of noun phrases found in the sentence. This version edits it to be a binary classification on examples consisting of a pronoun and noun phrase.(Levesque et al., PKRR 2012)
<sup>2</sup> https://www.quora.com/profile/Ricky-Riche-2/First-Quora-Dataset-Release-Question-Pairs
### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
The metrics used to evaluate the model are the following:
- Zero-shot
- Accuracy on predicting the correct completion/sentence for BLiMP, BLiMP Supplement, EWoK, Entity Tracking, and WUGs
- Change in R^2 prediction from baseline for Eye Tracking (with no spillover) and Self-paced Reading (1-word spillover)
- Finetuning
- 3 class Accuracy for MNLI
- Binary Accuracy for BoolQ, MultiRC, and WSC
- F1-score for MRPC and QQP
The metrics were chosen based on the advice of the papers the tasks come from.
## Results
*Zero-shot*
| Task | Metric | Causal Score |
| --- | --- | --- |
| BLiMP | Acc | 71.91 |
| BLiMP Supplement | Acc | 64.85 |
| EWoK | Acc | 52.44 |
| Eye Tracking | change in R^2 | 0.5 |
| Self-paced Reading | change in R^2 | 0.01 |
| Entity Tracking | Acc | 27.71 |
| WUGs | Acc | 38.5 |
*Finetuning*
| Task | Metric | Uni-directional Score | Bi-directional Score |
| --- | --- | --- | --- |
| BoolQ | Acc | | |
| MNLI | Acc | | |
| MRPC | F1 | | |
| QQP | F1 | | |
| MultiRC | Acc | | |
| RTE | Acc | | |
| WSC | Acc | | |
# Technical Specifications
### Hardware
- 1 H100 GPU was used to train this model.
### Software
PyTorch
### Training Time
The model took ~10-12 GPU hours to train, with a majority of this being spent on inference with the student and teacher models.
# Citation
```latex
@misc{charpentier2025babylmturns3papers,
title={BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop},
author={Lucas Charpentier and Leshem Choshen and Ryan Cotterell and Mustafa Omer Gul and Michael Hu and Jaap Jumelet and Tal Linzen and Jing Liu and Aaron Mueller and Candace Ross and Raj Sanjay Shah and Alex Warstadt and Ethan Wilcox and Adina Williams},
year={2025},
eprint={2502.10645},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.10645},
}
```
# Model Card Authors
Mustafa Ömer Gül
# Bibliography
[SimPO: Simple Preference Optimization with a Reference-Free Reward](https://proceedings.neurips.cc/paper_files/paper/2024/file/e099c1c9699814af0be873a175361713-Paper-Conference.pdf) (Meng et al., NeurIPS 2024)
[GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) (Wang et al., ICLR 2019)
[SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems](https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf) (Wang et al., NeurIPS 2019)
[BLiMP: The Benchmark of Linguistic Minimal Pairs for English](https://aclanthology.org/2020.tacl-1.25/) (Warstadt et al., TACL 2020)
[Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora](https://aclanthology.org/2023.conll-babylm.1/) (Warstadt et al., CoNLL-BabyLM 2023)
[🌏 Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models](https://arxiv.org/pdf/2405.09605v1) (Ivanova et al., 2024)
[Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data](https://link.springer.com/article/10.3758/s13428-023-02261-8) (de Varda et al., BRM 2024)
[Entity Tracking in Language Models](https://aclanthology.org/2023.acl-long.213/) (Kim & Schuster, ACL 2023)
[Derivational Morphology Reveals Analogical Generalization in Large Language Models](https://arxiv.org/pdf/2411.07990) (Hofmann et al., 2024)
[Automatically Constructing a Corpus of Sentential Paraphrases](https://aclanthology.org/I05-5002/) (Dolan & Brockett, IJCNLP 2005)
[A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference](https://aclanthology.org/N18-1101/) (Williams et al., NAACL 2018)
[The Winograd Schema Challenge]( http://dl.acm.org/citation.cfm?id=3031843.3031909) (Levesque et al., PKRR 2012)
[The PASCAL Recognising Textual Entailment Challenge](https://link.springer.com/chapter/10.1007/11736790_9) (Dagan et al., Springer 2006)
[The Second PASCAL Recognising Textual Entailment Challenge]() (Bar et al., 2006)
[The Third PASCAL Recognizing Textual Entailment Challenge](https://aclanthology.org/W07-1401/) (Giampiccolo et al., 2007)
[The Fifth PASCAL Recognizing Textual Entailment Challenge](https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf) (Bentivogli et al., TAC 2009)
[BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://aclanthology.org/N19-1300/) (Clark et al., NAACL 2019)
[Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences](https://aclanthology.org/N18-1023/) (Khashabi et al., NAACL 2018) |