Spaces:
Running
Running
Update dist/index.html
Browse files- dist/index.html +17 -10
dist/index.html
CHANGED
@@ -58,7 +58,7 @@
|
|
58 |
|
59 |
That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
|
60 |
|
61 |
-
This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around
|
62 |
To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
|
63 |
<strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
|
64 |
<p class="width_125"><br><br><br></p>
|
@@ -159,7 +159,7 @@
|
|
159 |
<ul>
|
160 |
<li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
|
161 |
For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
|
162 |
-
<div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5
|
163 |
</ul>
|
164 |
<ul>
|
165 |
<li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
|
@@ -338,7 +338,7 @@
|
|
338 |
<div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
|
339 |
following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
|
340 |
This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
|
341 |
-
Readers can find the code used <a class="link" href=https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
|
342 |
</p></div>
|
343 |
<p><br></p>
|
344 |
|
@@ -352,7 +352,7 @@
|
|
352 |
<div class="tip"><p>
|
353 |
We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
|
354 |
For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
|
355 |
-
based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson
|
356 |
Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
|
357 |
<div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
|
358 |
</div>
|
@@ -1050,7 +1050,7 @@
|
|
1050 |
<td>29.8</td>
|
1051 |
</tr>
|
1052 |
<tr>
|
1053 |
-
<td>distillcamembert(68.1M)</td>
|
1054 |
<td>51.3</td>
|
1055 |
<td>60.7</td>
|
1056 |
<td>37.4</td>
|
@@ -1095,14 +1095,21 @@
|
|
1095 |
<thead>
|
1096 |
<tr>
|
1097 |
<th>Cloud provider</th>
|
1098 |
-
<th>Hourly rate for an A 100</th>
|
1099 |
-
<th>Price for 262B tokens</th>
|
1100 |
-
<th>Price for 419B tokens</th>
|
1101 |
<th>Note</th>
|
1102 |
</tr>
|
1103 |
</thead>
|
1104 |
<tbody>
|
1105 |
<tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1106 |
<td>AWS</td>
|
1107 |
<td>1.77</td>
|
1108 |
<td>1,616</td>
|
@@ -1279,7 +1286,7 @@
|
|
1279 |
We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
|
1280 |
This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
|
1281 |
In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
|
1282 |
-
We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch
|
1283 |
It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
|
1284 |
<p class="width_125"><br><br></p>
|
1285 |
|
@@ -1323,7 +1330,7 @@
|
|
1323 |
const toc = document.querySelector('d-contents');
|
1324 |
if (toc) {
|
1325 |
const headings = article.querySelectorAll('h2, h3, h4');
|
1326 |
-
let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table
|
1327 |
let prevLevel = 0;
|
1328 |
for (const el of headings) {
|
1329 |
// should element be included in TOC?
|
|
|
58 |
|
59 |
That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
|
60 |
|
61 |
+
This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 1,900 euros).
|
62 |
To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
|
63 |
<strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
|
64 |
<p class="width_125"><br><br><br></p>
|
|
|
159 |
<ul>
|
160 |
<li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
|
161 |
For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
|
162 |
+
<div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5 <a class="link" href="https://github.com/catie-aq/flashT5/blob/dfe10d498ae0b39082182f807acb509e91992360/configs/fr/fat5-fr-small.yaml#L42">small</a>.</div>
|
163 |
</ul>
|
164 |
<ul>
|
165 |
<li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
|
|
|
338 |
<div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
|
339 |
following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
|
340 |
This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
|
341 |
+
Readers can find the code used <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
|
342 |
</p></div>
|
343 |
<p><br></p>
|
344 |
|
|
|
352 |
<div class="tip"><p>
|
353 |
We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
|
354 |
For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
|
355 |
+
based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson WONGSO</a>.
|
356 |
Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
|
357 |
<div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
|
358 |
</div>
|
|
|
1050 |
<td>29.8</td>
|
1051 |
</tr>
|
1052 |
<tr>
|
1053 |
+
<td>distillcamembert (68.1M)</td>
|
1054 |
<td>51.3</td>
|
1055 |
<td>60.7</td>
|
1056 |
<td>37.4</td>
|
|
|
1095 |
<thead>
|
1096 |
<tr>
|
1097 |
<th>Cloud provider</th>
|
1098 |
+
<th>Hourly rate for an A 100 (in €)</th>
|
1099 |
+
<th>Price for 262B tokens (in €)</th>
|
1100 |
+
<th>Price for 419B tokens (in €)</th>
|
1101 |
<th>Note</th>
|
1102 |
</tr>
|
1103 |
</thead>
|
1104 |
<tbody>
|
1105 |
<tr>
|
1106 |
+
<td>Sesterce</td>
|
1107 |
+
<td>1.29</td>
|
1108 |
+
<td>1,182</td>
|
1109 |
+
<td>1,891</td>
|
1110 |
+
<td></td>
|
1111 |
+
</tr>
|
1112 |
+
<tr>
|
1113 |
<td>AWS</td>
|
1114 |
<td>1.77</td>
|
1115 |
<td>1,616</td>
|
|
|
1286 |
We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
|
1287 |
This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
|
1288 |
In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
|
1289 |
+
We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch at a limited cost, will be useful for people with limited computational resources.
|
1290 |
It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
|
1291 |
<p class="width_125"><br><br></p>
|
1292 |
|
|
|
1330 |
const toc = document.querySelector('d-contents');
|
1331 |
if (toc) {
|
1332 |
const headings = article.querySelectorAll('h2, h3, h4');
|
1333 |
+
let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table of contents</h3>`;
|
1334 |
let prevLevel = 0;
|
1335 |
for (const el of headings) {
|
1336 |
// should element be included in TOC?
|