Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ Test network using [Tensor Product Attention](https://arxiv.org/abs/2501.06425).
|
|
5 |
- `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."`
|
6 |
|
7 |
# Notes:
|
8 |
-
|
9 |
|
10 |
# Training Metrics
|
11 |
|
|
|
5 |
- `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."`
|
6 |
|
7 |
# Notes:
|
8 |
+
One of the primary reported benefits for TPA are for inference which are not really being leveraged at all, although you can probably fit a larger bsz than traditional MHA/GQA with this. This did save about 5% on params, that amount should scale much more as the network size increases. The run time is very similar to MHA/GQA at this scale.
|
9 |
|
10 |
# Training Metrics
|
11 |
|