Spaces:

CATIE-AQ
/

FAT5-report

Running

App Files Files Community

bourdoiscatie commited on Mar 17

Commit

048e36d

verified ·

1 Parent(s): 7f24f8c

Update dist/index.html

Browse files

Files changed (1) hide show

dist/index.html +17 -10

dist/index.html CHANGED Viewed

@@ -58,7 +58,7 @@
     That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
-    This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 2,200 euros).
     To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
     <strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
     <p class="width_125"><br><br><br></p>
@@ -159,7 +159,7 @@
     <ul>
     <li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
     For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
-    <div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5 small <a class="link" href="https://github.com/catie-aq/flashT5/blob/dfe10d498ae0b39082182f807acb509e91992360/configs/fr/fat5-fr-small.yaml#L42">small</a>.</div>
     </ul>
     <ul>
     <li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
@@ -338,7 +338,7 @@
     <div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
       following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
       This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
-      Readers can find the code used <a class="link" href=https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
       </p></div>
     <p><br></p>
@@ -352,7 +352,7 @@
     <div class="tip"><p>
       We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
       For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
-      based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson Wongso</a>.
       Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
     <div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
     </div>
@@ -1050,7 +1050,7 @@
     <td>29.8</td>
     </tr>
     <tr>
-    <td>distillcamembert(68.1M)</td>
     <td>51.3</td>
     <td>60.7</td>
     <td>37.4</td>
@@ -1095,14 +1095,21 @@
     <thead>
     <tr>
     <th>Cloud provider</th>
-    <th>Hourly rate for an A 100</th>
-    <th>Price for 262B tokens</th>
-    <th>Price for 419B tokens</th>
     <th>Note</th>
     </tr>
     </thead>
     <tbody>
     <tr>
     <td>AWS</td>
     <td>1.77</td>
     <td>1,616</td>
@@ -1279,7 +1286,7 @@
     We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
     This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
     In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
-    We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch for €1,600, will be useful for people with limited computational resources.
     It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
     <p class="width_125"><br><br></p>
@@ -1323,7 +1330,7 @@
     const toc = document.querySelector('d-contents');
     if (toc) {
         const headings = article.querySelectorAll('h2, h3, h4');
-        let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table des matières</h3>`;
         let prevLevel = 0;
         for (const el of headings) {
             // should element be included in TOC?

     That's why we've decided to focus on the T5 <d-cite bibtex-key="JMLR:v21:20-074"></d-cite>.<br><br>
+    This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 1,900 euros).
     To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model.<br><br>
     <strong>The pre-training code is available in our <a class="link" href="https://github.com/catie-aq/flashT5">GitHub repository</a> under Apache-2.0 license and weights on our <a class="link" href="https://hf.co/CATIE-AQ">Hugging Face</a> account.</strong>
     <p class="width_125"><br><br><br></p>
     <ul>
     <li><p class="width_125"><u>Several processes can be used to process data in parallel</u><br>
     For example, the parameter <code>num_workers</code> of the <code>Dataloader</code> of PyTorch <d-cite bibtex-key="paszke2019pytorch"></d-cite>.</p></li>
+    <div class="tip"><p>You can find in our code the values we use for this parameter for our FAT5 <a class="link" href="https://github.com/catie-aq/flashT5/blob/dfe10d498ae0b39082182f807acb509e91992360/configs/fr/fat5-fr-small.yaml#L42">small</a>.</div>
     </ul>
     <ul>
     <li><p class="width_125"><u>The bottleneck can also come from the <code>DataCollator</code></u><br>
     <div class="tip"><p>With this in mind, we trained a tokenizer of size 32 768 (8**5),
       following <a class="link" href="https://twitter.com/karpathy/status/1621578354024677377">this observation by KARPATHY</a>.
       This is a BPE tokenizer <d-cite bibtex-key="sennrich2016neuralmachinetranslationrare"></d-cite> trained on CulturaX and The Stack, using 256 extra_tokens and the numbers are separated.<br>
+      Readers can find the code used <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/examples/fat5-fr/train_tokenizer.py">here</a>.
       </p></div>
     <p><br></p>
     <div class="tip"><p>
       We used the original T5 optimiser, <a class="link" href="https://github.com/catie-aq/flashT5/blob/main/src/utils/adamw_scaled.py">AdamWScale</a>.
       For hyperparameter values, we use <code>lr = 5e-3</code>, <code>betas = (0.9, 0.999)</code>, <code>eps = 1e-6</code> et <code>weight_decay = 0.0</code>
+      based on the observations of <a class="link" href="https://github.com/PiotrNawrot/nanoT5/issues/25#issuecomment-1922731400">Wilson WONGSO</a>.
       Indeed, it turns out that not all the alternative optimisers tested converged.</p></div>
     <div class="note"><p>We have added the parameter <code>foreach</code> in our version of AdamWScale.</p>
     </div>
     <td>29.8</td>
     </tr>
     <tr>
+    <td>distillcamembert (68.1M)</td>
     <td>51.3</td>
     <td>60.7</td>
     <td>37.4</td>
     <thead>
     <tr>
     <th>Cloud provider</th>
+    <th>Hourly rate for an A 100 (in €)</th>
+    <th>Price for 262B tokens (in €)</th>
+    <th>Price for 419B tokens (in €)</th>
     <th>Note</th>
     </tr>
     </thead>
     <tbody>
     <tr>
+    <td>Sesterce</td>
+    <td>1.29</td>
+    <td>1,182</td>
+    <td>1,891</td>
+    <td></td>
+    </tr>
+    <tr>
     <td>AWS</td>
     <td>1.77</td>
     <td>1,616</td>
     We introduced the FAT5 (Flash Attention T5) model, detailing our approach to optimizing various elements of the pre-training and finetuning processes.
     This is based on kernels that enable Flash Attention to be used with a T5 and give the model a linear memory.
     In particular, we've applied our work to French as a proof of concept, and made sure that it can also be used in any other language.
+    We hope that our method, which enables a model with 147M parameters to be pre-trained from scratch at a limited cost, will be useful for people with limited computational resources.
     It also opens the way for a possible comeback of encoder-decoder models, rather than only decoder models.<br>
     <p class="width_125"><br><br></p>
     const toc = document.querySelector('d-contents');
     if (toc) {
         const headings = article.querySelectorAll('h2, h3, h4');
+        let ToC = `<nav role="navigation" class="l-text figcaption" style="color: #9CA3AF;"><h3>Table of contents</h3>`;
         let prevLevel = 0;
         for (const el of headings) {
             // should element be included in TOC?