Safetensors
English
llama
craffel HF Staff commited on
Commit
24feeae
·
1 Parent(s): 9105959

Consolidate README

Browse files
Files changed (1) hide show
  1. README.md +33 -100
README.md CHANGED
@@ -1,118 +1,51 @@
1
  # ---
2
  license: apache-2.0
3
  datasets:
4
- - common-pile/comma-dataset
5
  language:
6
  - en
7
- tags:
8
- - openly-licensed
9
- - llm
10
- - pretraining
11
  ---
12
 
13
- # Model Card for Model ID
14
 
15
- Comma v0.1 is a 7 billion parameter model trained on 1 trillion tokens of openly licensed text collected as part of the Common Pile.
 
 
16
 
17
- ## Model Details
 
 
 
 
 
 
 
18
 
19
- ### Model Description
20
-
21
- Comma v0.1 is a 7 billion parameter decoder-only transformer. It uses the same architecture as Llama 3. It was trained on 1 trillion tokens from the Common Pile, an 8TB collection of openly licensed text.
22
-
23
- - **Developed by:** r-three, Eulther AI, Vector, University of Toronto
24
- <!-- - **Funded by [optional]:** [More Information Needed] -->
25
- - **Model type:** Decoder-Only Transformer
26
- - **Language(s) (NLP):** English
27
- - **License:** Apache 2.0
28
-
29
- ### Model Sources
30
-
31
- <!-- Provide the basic links for the model. -->
32
-
33
- - **Repository:** https://github.com/r-three/common-pile/
34
- - **Paper:** In Progress
35
-
36
- ## Uses
37
-
38
- Comma v0.1 can be used a the starting point for finetuning and post-training. As it was trained on openly licensed text, it is less likely to create IP issues, but this is not a guarantee.
39
-
40
- ### Direct Use
41
-
42
- Evaluations in our [paper]() show performance when using our final model directly. Additional post-training will mostly likely increase performance.
43
-
44
- ### Out-of-Scope Use
45
-
46
- Comma v0.1 is only trained on openly licensed text. Therefore it will probably have reduced performance when asked about topics that only appear in copyrighted text.
47
-
48
- ## Bias, Risks, and Limitations
49
-
50
- As it was trained on openly licensed text, Comma v0.1 is less likely to output IP infringing text, however, due to issues like license laundering this is not a guarantee. See our [paper]() for a deeper discussion of these details.
51
-
52
- Comma v0.1 is trained on many old books (pre 1929) and may therefore repeat societal biases common at the time.
53
-
54
- Comma v0.1 include no post-hoc guardrails that limit what it may generate.
55
-
56
- ### Recommendations
57
-
58
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
59
-
60
- ## Training Details
61
-
62
- ### Training Data
63
-
64
- Comma v0.1 was train on the [common-pile/comma-data](https://huggingface.co/datasets/common-pile/comma-dataset), a filtered and deduplicted dataset of 1 trillion tokens drawn from the [Common Pile](https://huggingface.co/collections/common-pile/common-pile-v01-6826b454a5a6a445d0b51b37) dataset (8TB of openly licensed text).
65
-
66
- ### Training Procedure
67
-
68
- Comma v0.1 was trained in two stages, first it was trained for 965 billion tokens with a cosine learning rate. Then a second "cool-down" training phase on 35 billion tokens from high quality sources was done. The final model is the average of 10 checkpoints during this cool-down phase.
69
-
70
- #### Training Hyperparameters
71
 
 
 
 
 
72
  Hyperparameters can be found in our [lingua config file](https://huggingface.co/common-pile/comma-v0.1-checkpoints/blob/main/config.yaml).
73
 
74
- ## Evaluation
75
-
76
- Comma v1.0 7B outperforms models with similar computational budgets (7 billion parameters, 1 trillion tokens) that were trained on non-openly licensed text (LLaMA 1, MPT, RPJ-INCITE) on several common benchmarks (ARC-C, MMLU, BoolQ, SIQA etc.) and does especially well on Code based tasks (HumEval, MBPP). It tends to under performs on datasets like HellaSwag. Evaluations where done using OLMES. Note that there is still a large gap between Comma v0.1 and current state-of-the-art models line Qwen3 which was trained on 36 times as many tokens.
77
-
78
- More evaluation results can be found in our [paper]()
79
-
80
- #### Summary
81
-
82
- Comma v0.1 is a 7B parameter model train on openly licensed text. It is one of the first performant model trained on **only** open licensed text.
83
-
84
- ## Environmental Impact
85
-
86
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
87
-
88
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
89
 
90
- - **Hardware Type:** H100 Nvidia GPU
91
- - **Hours used:** [More Information Needed]
92
- - **Cloud Provider:** AWS
93
- - **Compute Region:** [More Information Needed]
94
- - **Carbon Emitted:** [More Information Needed]
95
-
96
- ## Technical Specifications
97
-
98
- ### Model Architecture and Objective
99
-
100
- Comma v0.1 uses the same architecture as Llama 3 and is trained using standard autoregressive next-token prediction.
101
-
102
- ### Compute Infrastructure
103
-
104
- Comma v0.1 was trained on the Huggingface Cluster.
105
-
106
- #### Hardware
107
-
108
- Comma v0.1 was trained using 64 H100 Nvidia GPUs
109
-
110
- #### Software
111
-
112
- Comma v0.1 was trained using [lingua](https://github.com/facebookresearch/lingua)
113
 
114
  ## Citation
115
 
116
  ```bibtext
117
-
118
- ```
 
 
 
 
 
 
1
  # ---
2
  license: apache-2.0
3
  datasets:
4
+ - common-pile/comma_v0.1_training_dataset
5
  language:
6
  - en
 
 
 
 
7
  ---
8
 
9
+ # Comma v0.1
10
 
11
+ Comma v0.1 is a 7 billion parameter language model trained on [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), comprising 1 trillion tokens of openly licensed text from [the Common Pile](https://huggingface.co/collections/common-pile/common-pile-v01-68307d37df48e36f02717f21).
12
+ Comma v0.1 is a "base model" that can be used a the starting point for finetuning and post-training.
13
+ It performs comparably to budget-matched models (7 billion parameters, 1 trillion tokens) trained on unlicensed data.
14
 
15
+ | Model | ARC-C | ARC-E | MMLU | BoolQ | HSwag | OBQA | CSQA | PIQA | SIQA | HEval | MBPP | Avg. |
16
+ | ---------- | ----- | ----- | ---- | ----- | ----- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
17
+ | RPJ-INCITE | 42.8 | 68.4 | 27.8 | 68.6 | 70.3 | 49.4 | 57.7 | 76 | 46.9 | 11.1 | 15.9 | 48.6 |
18
+ | LLaMA 1 | 44.5 | 67.9 | 34.8 | 75.4 | 76.2 | 51.2 | 61.8 | 77.2 | 50.3 | 19.9 | 27.9 | 53.4 |
19
+ | StableLM | 50.8 | 65.4 | 45.2 | 71.7 | 75.6 | 48.2 | 57.2 | 77 | 48.2 | 23.1 | 32 | 54.0 |
20
+ | MPT | 46.5 | 70.5 | 30.2 | 74.2 | 77.6 | 48.6 | 63.3 | 77.3 | 49.1 | 27.3 | 33.2 | 54.3 |
21
+ | OpenLLaMA | 44.5 | 67.2 | 40.3 | 72.6 | 72.6 | 50.8 | 62.8 | 78 | 49.7 | 27.6 | 33.9 | 54.5 |
22
+ | Comma v0.1 | 52.8 | 68.4 | 42.4 | 75.7 | 62.6 | 47 | 59.4 | 70.8 | 50.8 | 36.5 | 35.5 | 54.7 |
23
 
24
+ ## Training details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ Comma v0.1 is a decoder-only transformer that uses the same architecture as Llama 3.
27
+ Training was done in two stages: first on 965 billion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 35 billion tokens from high-quality sources.
28
+ The final model is the average of 10 checkpoints during this cool-down phase.
29
+ Training was performed using [https://github.com/facebookresearch/lingua/](lingua) on 64 Nvidia H100 GPUs.
30
  Hyperparameters can be found in our [lingua config file](https://huggingface.co/common-pile/comma-v0.1-checkpoints/blob/main/config.yaml).
31
 
32
+ ## Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ Comma v0.1 was trained only on English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
35
+ It will likely have poor performance on other languages or programming languages.
36
+ While we aimed to train solely on openly licensed data, license laundering and inaccurate metadata can result in erroneous license information in the Common Pile (for further discussion of this limitation, please see [our paper](TODO link)).
37
+ Consequently, we cannot make a guarantee that Comma v0.1 was trained exclusively on openly licensed text.
38
+ When preparing Comma v0.1's pre-training data, we made use of the Toxicity tagger from [Dolma](https://github.com/allenai/dolma) to attempt to remove problematic content.
39
+ However, Comma v0.1 may nevertheless reflect social biases present in its training data.
40
+ Finally, please note that Comma v0.1 is a base model that has not undergone any form of "alignment" and therefore has no guardrails that limit what it may generate.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Citation
43
 
44
  ```bibtext
45
+ @article{kandpal2025common,
46
+ title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}},
47
+ author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray},
48
+ journal={arXiv preprint},
49
+ year={2025}
50
+ }
51
+ ```