RADLADS
Collection
7 items
•
Updated
•
3
This repository contains various checkpoints for ablations and other unusual models from the paper RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.
The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper.
checkpoint | step number | teacher | student | description |
---|---|---|---|---|
L28-D3584-qwen2-rwkv6-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | |
L28-D3584-qwen2-rwkv6-3-250m.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | 250m tokens trained |
L28-D3584-qwen2-rwkv6-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | |
L28-D3584-qwen2-rwkv6-base-2.pth | 1 | Qwen2.5-7B | RAD-RWKV6 | |
L28-D3584-qwen2-rwkv7-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV7 | |
L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | no rope used, w0 must be multiplied by 2 due to code mistake |
L28-D3584-qwen2-rwkv7-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | |
L28-D3584-qwerky6_qwen2-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | |
L28-D3584-qwerky6_qwen2-base-3.pth | 2 | Qwen2.5-7B | RAD-RWKV6 | |
L28-D3584-qwerky6_qwen2-groupnorm-2.pth | 1 | Qwen2.5-6B-Instruct | RAD-RWKV6 | ablation study: use groupnorm instead of state balancing |
L28-D3584-qwerky6_qwen2-groupnorm-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study:use groupnorm instead of state balancing |
L28-D3584-qwerky6_qwen2-no_gate-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no gate |
L28-D3584-qwerky6_qwen2-no_gate-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no gate |
L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no tokenshift |
L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no tokenshift |
L28-D3584-qwerky6_qwen2-use_rope-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: use rope |
L28-D3584-qwerky6_qwen2-use_rope-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: use rope |
L28-D3584-qwerky7_qwen2-2-4k.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV7 | 4k ctxlen training |
L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | 4k ctxlen training, early checkpoint |
L28-D3584-qwerky7_qwen2-3-4k.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | 4k ctxlen training |
More information can be found at the Github repository: https://github.com/recursal/RADLADS-paper