Update README.md
Browse files
README.md
CHANGED
|
@@ -11,11 +11,9 @@ tags:
|
|
| 11 |
- gla
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# GLA 1.3B-100B
|
| 15 |
|
| 16 |
-
This repository contains the `gla-1.3B-100B` model, a 1.3B parameter variant trained on 100B tokens, which was presented in the paper [
|
| 17 |
-
|
| 18 |
-
Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. This paper systematically evaluates various linear attention models across generations—vector recurrences to advanced gating mechanisms—both standalone and hybridized. The `gla-1.3B-100B` model is one of 72 models trained and open-sourced to enable this comprehensive analysis. The research highlights that superior standalone linear models do not necessarily excel in hybrids, and emphasizes selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. Architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 are recommended for achieving Transformer-level recall efficiently.
|
| 19 |
|
| 20 |
## Usage
|
| 21 |
|
|
@@ -44,14 +42,14 @@ print(generated_text)
|
|
| 44 |
|
| 45 |
If you find this work useful, please consider citing the original paper:
|
| 46 |
|
| 47 |
-
[
|
| 48 |
|
| 49 |
```bibtex
|
| 50 |
@article{li2025systematic,
|
| 51 |
-
title={
|
| 52 |
-
author={
|
| 53 |
-
journal={arXiv preprint arXiv:
|
| 54 |
-
year={
|
| 55 |
}
|
| 56 |
```
|
| 57 |
|
|
@@ -59,4 +57,4 @@ If you find this work useful, please consider citing the original paper:
|
|
| 59 |
|
| 60 |
The official codebase for the models and research, including training scripts and other checkpoints, can be found on GitHub:
|
| 61 |
|
| 62 |
-
[https://github.com/
|
|
|
|
| 11 |
- gla
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# GLA 1.3B-100B
|
| 15 |
|
| 16 |
+
This repository contains the `gla-1.3B-100B` model, a 1.3B parameter variant trained on 100B tokens, which was presented in the paper [Gated Linear Attention Transformers with Hardware-Efficient Training](https://huggingface.co/papers/2312.06635).
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## Usage
|
| 19 |
|
|
|
|
| 42 |
|
| 43 |
If you find this work useful, please consider citing the original paper:
|
| 44 |
|
| 45 |
+
[Gated Linear Attention Transformers with Hardware-Efficient Training](https://huggingface.co/papers/2507.06457)
|
| 46 |
|
| 47 |
```bibtex
|
| 48 |
@article{li2025systematic,
|
| 49 |
+
title={Gated Linear Attention Transformers with Hardware-Efficient Training},
|
| 50 |
+
author={Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim},
|
| 51 |
+
journal={arXiv preprint arXiv:2312.06635},
|
| 52 |
+
year={2023},
|
| 53 |
}
|
| 54 |
```
|
| 55 |
|
|
|
|
| 57 |
|
| 58 |
The official codebase for the models and research, including training scripts and other checkpoints, can be found on GitHub:
|
| 59 |
|
| 60 |
+
[https://github.com/fla-org/flash-linear-attention](https://github.com/fla-org/flash-linear-attention)
|