Create README.md
#1
by
blester125
- opened
README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- common-pile/comma_v0.1_training_dataset
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
---
|
8 |
+
|
9 |
+
# Comma v0.1
|
10 |
+
|
11 |
+
Comma v0.1 is a 7 billion parameter language model trained on [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), comprising 1 trillion tokens of openly licensed text from [the Common Pile](https://huggingface.co/collections/common-pile/common-pile-v01-68307d37df48e36f02717f21).
|
12 |
+
Comma v0.1 is a "base model" that can be used a the starting point for finetuning and post-training.
|
13 |
+
It performs comparably to budget-matched models (7 billion parameters, 1 trillion tokens) trained on unlicensed data.
|
14 |
+
|
15 |
+
| Model | ARC-C | ARC-E | MMLU | BoolQ | HSwag | OBQA | CSQA | PIQA | SIQA | HEval | MBPP | Avg. |
|
16 |
+
| ---------- | ----- | ----- | ---- | ----- | ----- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
|
17 |
+
| RPJ-INCITE | 42.8 | 68.4 | 27.8 | 68.6 | 70.3 | 49.4 | 57.7 | 76 | 46.9 | 11.1 | 15.9 | 48.6 |
|
18 |
+
| LLaMA 1 | 44.5 | 67.9 | 34.8 | 75.4 | 76.2 | 51.2 | 61.8 | 77.2 | 50.3 | 19.9 | 27.9 | 53.4 |
|
19 |
+
| StableLM | 50.8 | 65.4 | 45.2 | 71.7 | 75.6 | 48.2 | 57.2 | 77 | 48.2 | 23.1 | 32 | 54.0 |
|
20 |
+
| MPT | 46.5 | 70.5 | 30.2 | 74.2 | 77.6 | 48.6 | 63.3 | 77.3 | 49.1 | 27.3 | 33.2 | 54.3 |
|
21 |
+
| OpenLLaMA | 44.5 | 67.2 | 40.3 | 72.6 | 72.6 | 50.8 | 62.8 | 78 | 49.7 | 27.6 | 33.9 | 54.5 |
|
22 |
+
| Comma v0.1 | 52.8 | 68.4 | 42.4 | 75.7 | 62.6 | 47 | 59.4 | 70.8 | 50.8 | 36.5 | 35.5 | 54.7 |
|
23 |
+
|
24 |
+
## Training details
|
25 |
+
|
26 |
+
Comma v0.1 is a decoder-only transformer that uses the same architecture as Llama 3.
|
27 |
+
Training was done in two stages: first on 965 billion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 35 billion tokens from high-quality sources.
|
28 |
+
The final model is the average of 10 checkpoints during this cool-down phase.
|
29 |
+
Training was performed using [https://github.com/facebookresearch/lingua/](lingua) on 64 Nvidia H100 GPUs.
|
30 |
+
Hyperparameters can be found in our [lingua config file](https://huggingface.co/common-pile/comma-v0.1-checkpoints/blob/main/config.yaml).
|
31 |
+
|
32 |
+
## Limitations
|
33 |
+
|
34 |
+
Comma v0.1 was trained only on English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
|
35 |
+
It will likely have poor performance on other languages or programming languages.
|
36 |
+
While we aimed to train solely on openly licensed data, license laundering and inaccurate metadata can result in erroneous license information in the Common Pile (for further discussion of this limitation, please see [our paper](TODO link)).
|
37 |
+
Consequently, we cannot make a guarantee that Comma v0.1 was trained exclusively on openly licensed text.
|
38 |
+
When preparing Comma v0.1's pre-training data, we made use of the Toxicity tagger from [Dolma](https://github.com/allenai/dolma) to attempt to remove problematic content.
|
39 |
+
However, Comma v0.1 may nevertheless reflect social biases present in its training data.
|
40 |
+
Finally, please note that Comma v0.1 is a base model that has not undergone any form of "alignment" and therefore has no guardrails that limit what it may generate.
|
41 |
+
|
42 |
+
## Citation
|
43 |
+
|
44 |
+
```bibtext
|
45 |
+
@article{kandpal2025common,
|
46 |
+
title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}},
|
47 |
+
author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray},
|
48 |
+
journal={arXiv preprint},
|
49 |
+
year={2025}
|
50 |
+
}
|
51 |
+
```
|