bourdoiscatie commited on
Commit
048e36d
·
verified ·
1 Parent(s): 7f24f8c

Update dist/index.html

Browse files
Files changed (1) hide show
  1. dist/index.html +17 -10
dist/index.html CHANGED
@@ -58,7 +58,7 @@
58
 
59
  That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
60
 
61
- This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 2,200 euros).
62
  To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
63
  <strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
64
  <p class="width_125"><br><br><br></p>
@@ -159,7 +159,7 @@
159
  <ul>
160
  <li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
161
  For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
162
- <div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5 small <a class="link" href="https://github.com/catie-aq/flashT5/blob/dfe10d498ae0b39082182f807acb509e91992360/configs/fr/fat5-fr-small.yaml#L42">small</a>.</div>
163
  </ul>
164
  <ul>
165
  <li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
@@ -338,7 +338,7 @@
338
  <div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
339
  following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
340
  This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
341
- Readers can find the code used <a class="link" href=https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
342
  </p></div>
343
  <p><br></p>
344
 
@@ -352,7 +352,7 @@
352
  <div class="tip"><p>
353
  We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
354
  For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
355
- based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson Wongso</a>.
356
  Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
357
  <div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
358
  </div>
@@ -1050,7 +1050,7 @@
1050
  <td>29.8</td>
1051
  </tr>
1052
  <tr>
1053
- <td>distillcamembert(68.1M)</td>
1054
  <td>51.3</td>
1055
  <td>60.7</td>
1056
  <td>37.4</td>
@@ -1095,14 +1095,21 @@
1095
  <thead>
1096
  <tr>
1097
  <th>Cloud provider</th>
1098
- <th>Hourly rate for an A 100</th>
1099
- <th>Price for 262B tokens</th>
1100
- <th>Price for 419B tokens</th>
1101
  <th>Note</th>
1102
  </tr>
1103
  </thead>
1104
  <tbody>
1105
  <tr>
 
 
 
 
 
 
 
1106
  <td>AWS</td>
1107
  <td>1.77</td>
1108
  <td>1,616</td>
@@ -1279,7 +1286,7 @@
1279
  We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
1280
  This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
1281
  In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
1282
- We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch for €1,600, will be useful for people with limited computational resources.
1283
  It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
1284
  <p class="width_125"><br><br></p>
1285
 
@@ -1323,7 +1330,7 @@
1323
  const toc = document.querySelector('d-contents');
1324
  if (toc) {
1325
  const headings = article.querySelectorAll('h2, h3, h4');
1326
- let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table des matières</h3>`;
1327
  let prevLevel = 0;
1328
  for (const el of headings) {
1329
  // should element be included in TOC?
 
58
 
59
  That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
60
 
61
+ This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 1,900 euros).
62
  To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
63
  <strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
64
  <p class="width_125"><br><br><br></p>
 
159
  <ul>
160
  <li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
161
  For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
162
+ <div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5 <a class="link" href="https://github.com/catie-aq/flashT5/blob/dfe10d498ae0b39082182f807acb509e91992360/configs/fr/fat5-fr-small.yaml#L42">small</a>.</div>
163
  </ul>
164
  <ul>
165
  <li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
 
338
  <div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
339
  following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
340
  This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
341
+ Readers can find the code used <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
342
  </p></div>
343
  <p><br></p>
344
 
 
352
  <div class="tip"><p>
353
  We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
354
  For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
355
+ based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson WONGSO</a>.
356
  Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
357
  <div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
358
  </div>
 
1050
  <td>29.8</td>
1051
  </tr>
1052
  <tr>
1053
+ <td>distillcamembert (68.1M)</td>
1054
  <td>51.3</td>
1055
  <td>60.7</td>
1056
  <td>37.4</td>
 
1095
  <thead>
1096
  <tr>
1097
  <th>Cloud provider</th>
1098
+ <th>Hourly rate for an A 100 (in €)</th>
1099
+ <th>Price for 262B tokens (in €)</th>
1100
+ <th>Price for 419B tokens (in €)</th>
1101
  <th>Note</th>
1102
  </tr>
1103
  </thead>
1104
  <tbody>
1105
  <tr>
1106
+ <td>Sesterce</td>
1107
+ <td>1.29</td>
1108
+ <td>1,182</td>
1109
+ <td>1,891</td>
1110
+ <td></td>
1111
+ </tr>
1112
+ <tr>
1113
  <td>AWS</td>
1114
  <td>1.77</td>
1115
  <td>1,616</td>
 
1286
  We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
1287
  This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
1288
  In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
1289
+ We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch at a limited cost, will be useful for people with limited computational resources.
1290
  It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
1291
  <p class="width_125"><br><br></p>
1292
 
 
1330
  const toc = document.querySelector('d-contents');
1331
  if (toc) {
1332
  const headings = article.querySelectorAll('h2, h3, h4');
1333
+ let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table of contents</h3>`;
1334
  let prevLevel = 0;
1335
  for (const el of headings) {
1336
  // should element be included in TOC?