Spaces:
Running
Running
Update dist/index.html
Browse files- dist/index.html +17 -10
dist/index.html
CHANGED
|
@@ -58,7 +58,7 @@
|
|
| 58 |
|
| 59 |
That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
|
| 60 |
|
| 61 |
-
This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around
|
| 62 |
To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
|
| 63 |
<strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
|
| 64 |
<p class="width_125"><br><br><br></p>
|
|
@@ -159,7 +159,7 @@
|
|
| 159 |
<ul>
|
| 160 |
<li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
|
| 161 |
For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
|
| 162 |
-
<div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5
|
| 163 |
</ul>
|
| 164 |
<ul>
|
| 165 |
<li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
|
|
@@ -338,7 +338,7 @@
|
|
| 338 |
<div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
|
| 339 |
following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
|
| 340 |
This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
|
| 341 |
-
Readers can find the code used <a class="link" href=https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
|
| 342 |
</p></div>
|
| 343 |
<p><br></p>
|
| 344 |
|
|
@@ -352,7 +352,7 @@
|
|
| 352 |
<div class="tip"><p>
|
| 353 |
We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
|
| 354 |
For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
|
| 355 |
-
based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson
|
| 356 |
Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
|
| 357 |
<div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
|
| 358 |
</div>
|
|
@@ -1050,7 +1050,7 @@
|
|
| 1050 |
<td>29.8</td>
|
| 1051 |
</tr>
|
| 1052 |
<tr>
|
| 1053 |
-
<td>distillcamembert(68.1M)</td>
|
| 1054 |
<td>51.3</td>
|
| 1055 |
<td>60.7</td>
|
| 1056 |
<td>37.4</td>
|
|
@@ -1095,14 +1095,21 @@
|
|
| 1095 |
<thead>
|
| 1096 |
<tr>
|
| 1097 |
<th>Cloud provider</th>
|
| 1098 |
-
<th>Hourly rate for an A 100</th>
|
| 1099 |
-
<th>Price for 262B tokens</th>
|
| 1100 |
-
<th>Price for 419B tokens</th>
|
| 1101 |
<th>Note</th>
|
| 1102 |
</tr>
|
| 1103 |
</thead>
|
| 1104 |
<tbody>
|
| 1105 |
<tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1106 |
<td>AWS</td>
|
| 1107 |
<td>1.77</td>
|
| 1108 |
<td>1,616</td>
|
|
@@ -1279,7 +1286,7 @@
|
|
| 1279 |
We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
|
| 1280 |
This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
|
| 1281 |
In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
|
| 1282 |
-
We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch
|
| 1283 |
It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
|
| 1284 |
<p class="width_125"><br><br></p>
|
| 1285 |
|
|
@@ -1323,7 +1330,7 @@
|
|
| 1323 |
const toc = document.querySelector('d-contents');
|
| 1324 |
if (toc) {
|
| 1325 |
const headings = article.querySelectorAll('h2, h3, h4');
|
| 1326 |
-
let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table
|
| 1327 |
let prevLevel = 0;
|
| 1328 |
for (const el of headings) {
|
| 1329 |
// should element be included in TOC?
|
|
|
|
| 58 |
|
| 59 |
That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
|
| 60 |
|
| 61 |
+
This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 1,900 euros).
|
| 62 |
To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
|
| 63 |
<strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
|
| 64 |
<p class="width_125"><br><br><br></p>
|
|
|
|
| 159 |
<ul>
|
| 160 |
<li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
|
| 161 |
For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
|
| 162 |
+
<div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5 <a class="link" href="https://github.com/catie-aq/flashT5/blob/dfe10d498ae0b39082182f807acb509e91992360/configs/fr/fat5-fr-small.yaml#L42">small</a>.</div>
|
| 163 |
</ul>
|
| 164 |
<ul>
|
| 165 |
<li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
|
|
|
|
| 338 |
<div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
|
| 339 |
following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
|
| 340 |
This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
|
| 341 |
+
Readers can find the code used <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
|
| 342 |
</p></div>
|
| 343 |
<p><br></p>
|
| 344 |
|
|
|
|
| 352 |
<div class="tip"><p>
|
| 353 |
We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
|
| 354 |
For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
|
| 355 |
+
based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson WONGSO</a>.
|
| 356 |
Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
|
| 357 |
<div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
|
| 358 |
</div>
|
|
|
|
| 1050 |
<td>29.8</td>
|
| 1051 |
</tr>
|
| 1052 |
<tr>
|
| 1053 |
+
<td>distillcamembert (68.1M)</td>
|
| 1054 |
<td>51.3</td>
|
| 1055 |
<td>60.7</td>
|
| 1056 |
<td>37.4</td>
|
|
|
|
| 1095 |
<thead>
|
| 1096 |
<tr>
|
| 1097 |
<th>Cloud provider</th>
|
| 1098 |
+
<th>Hourly rate for an A 100 (in €)</th>
|
| 1099 |
+
<th>Price for 262B tokens (in €)</th>
|
| 1100 |
+
<th>Price for 419B tokens (in €)</th>
|
| 1101 |
<th>Note</th>
|
| 1102 |
</tr>
|
| 1103 |
</thead>
|
| 1104 |
<tbody>
|
| 1105 |
<tr>
|
| 1106 |
+
<td>Sesterce</td>
|
| 1107 |
+
<td>1.29</td>
|
| 1108 |
+
<td>1,182</td>
|
| 1109 |
+
<td>1,891</td>
|
| 1110 |
+
<td></td>
|
| 1111 |
+
</tr>
|
| 1112 |
+
<tr>
|
| 1113 |
<td>AWS</td>
|
| 1114 |
<td>1.77</td>
|
| 1115 |
<td>1,616</td>
|
|
|
|
| 1286 |
We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
|
| 1287 |
This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
|
| 1288 |
In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
|
| 1289 |
+
We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch at a limited cost, will be useful for people with limited computational resources.
|
| 1290 |
It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
|
| 1291 |
<p class="width_125"><br><br></p>
|
| 1292 |
|
|
|
|
| 1330 |
const toc = document.querySelector('d-contents');
|
| 1331 |
if (toc) {
|
| 1332 |
const headings = article.querySelectorAll('h2, h3, h4');
|
| 1333 |
+
let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table of contents</h3>`;
|
| 1334 |
let prevLevel = 0;
|
| 1335 |
for (const el of headings) {
|
| 1336 |
// should element be included in TOC?
|