recursal/radlads-7b-various

This repository contains various checkpoints for ablations and other unusual models from the paper RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.

The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper.

checkpoint	step number	teacher	student	description
L28-D3584-qwen2-rwkv6-2.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV6
L28-D3584-qwen2-rwkv6-3-250m.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV6	250m tokens trained
L28-D3584-qwen2-rwkv6-3.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV6
L28-D3584-qwen2-rwkv6-base-2.pth	1	Qwen2.5-7B	RAD-RWKV6
L28-D3584-qwen2-rwkv7-2.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV7
L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV7	no rope used, w0 must be multiplied by 2 due to code mistake
L28-D3584-qwen2-rwkv7-3.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV7
L28-D3584-qwerky6_qwen2-2.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV6
L28-D3584-qwerky6_qwen2-base-3.pth	2	Qwen2.5-7B	RAD-RWKV6
L28-D3584-qwerky6_qwen2-groupnorm-2.pth	1	Qwen2.5-6B-Instruct	RAD-RWKV6	ablation study: use groupnorm instead of state balancing
L28-D3584-qwerky6_qwen2-groupnorm-3.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study:use groupnorm instead of state balancing
L28-D3584-qwerky6_qwen2-no_gate-2.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study: no gate
L28-D3584-qwerky6_qwen2-no_gate-3.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study: no gate
L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study: no tokenshift
L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study: no tokenshift
L28-D3584-qwerky6_qwen2-use_rope-2.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study: use rope
L28-D3584-qwerky6_qwen2-use_rope-3.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV6	ablation study: use rope
L28-D3584-qwerky7_qwen2-2-4k.pth	1	Qwen2.5-7B-Instruct	RAD-RWKV7	4k ctxlen training
L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV7	4k ctxlen training, early checkpoint
L28-D3584-qwerky7_qwen2-3-4k.pth	2	Qwen2.5-7B-Instruct	RAD-RWKV7	4k ctxlen training

More information can be found at the Github repository: https://github.com/recursal/RADLADS-paper

recursal
/

radlads-7b-various

Collection including recursal/radlads-7b-various

RADLADS