Text Generation
Transformers

This repository contains various checkpoints for ablations and other unusual models from the paper RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.

The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper.

checkpoint step number teacher student description
L28-D3584-qwen2-rwkv6-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6
L28-D3584-qwen2-rwkv6-3-250m.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 250m tokens trained
L28-D3584-qwen2-rwkv6-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6
L28-D3584-qwen2-rwkv6-base-2.pth 1 Qwen2.5-7B RAD-RWKV6
L28-D3584-qwen2-rwkv7-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV7
L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7 no rope used, w0 must be multiplied by 2 due to code mistake
L28-D3584-qwen2-rwkv7-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7
L28-D3584-qwerky6_qwen2-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6
L28-D3584-qwerky6_qwen2-base-3.pth 2 Qwen2.5-7B RAD-RWKV6
L28-D3584-qwerky6_qwen2-groupnorm-2.pth 1 Qwen2.5-6B-Instruct RAD-RWKV6 ablation study: use groupnorm instead of state balancing
L28-D3584-qwerky6_qwen2-groupnorm-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study:use groupnorm instead of state balancing
L28-D3584-qwerky6_qwen2-no_gate-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no gate
L28-D3584-qwerky6_qwen2-no_gate-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no gate
L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no tokenshift
L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no tokenshift
L28-D3584-qwerky6_qwen2-use_rope-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: use rope
L28-D3584-qwerky6_qwen2-use_rope-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: use rope
L28-D3584-qwerky7_qwen2-2-4k.pth 1 Qwen2.5-7B-Instruct RAD-RWKV7 4k ctxlen training
L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7 4k ctxlen training, early checkpoint
L28-D3584-qwerky7_qwen2-3-4k.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7 4k ctxlen training

More information can be found at the Github repository: https://github.com/recursal/RADLADS-paper

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including recursal/radlads-7b-various