PRWKV-cxa076 β "Akemi" RWKV Model Series

Model Overview
PRWKV stands for Passion RWKV β a model series born from relentless experimentation, unyielding dedication, and the burning question:
Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?
This project explores the boundaries of RWKV architecture, replacing the traditional Transformer Attention blocks with TimeMix, an RNN-based mechanism, while distilling knowledge from Transformer giants.
The PRWKV models range from 3B to 14B parameters, showcasing the potential scalability of RNN-based language models in modern LLM landscapes.
Project Objective
The sole purpose of this project was to test the feasibility of replacing Transformer Attention with RNN-based TimeMix.
- No shortcuts.
- No compromises.
- Just pure architectural curiosity driven by Passion.
Technical Challenges & Triumphs
π₯ Distillation from Transformers
- The models were distilled from high-quality Transformer-based teachers using Grouped Query Attention (GQA).
- The TimeMix blocks were heavily customized to align with the semantics of Attention layers.
- Special care was taken to inherit weight structures from the teacher's Receptance, Key, Value, and Output layers, enabling smoother early-stage learning.
β‘ Key Innovations
- RepeatKV mechanism: Introduced for more stable group-based key-value projection.
- GroupNorm vs NoNorm: Extensive experiments revealed that sometimes removing normalization enhanced long-context stability.
π Scaling Observations
- PRWKV scales from 3B to 14B parameters.
- 14B KD runs achieved KL divergence < 0.1, proving RNN TimeMix blocks can indeed mimic Transformer Attention at high fidelity.
- However, Ctx expansion beyond 2048 remained an ongoing challenge due to gradient instability in larger models.
Limitations
- The models are still under development and primarily serve as a proof-of-concept.
- Long-context (4096+) stability varies based on model size and requires further refinement.
- Knowledge distillation was the core training method; no large-scale SFT was applied yet.
A Poem of Passion
In the depths of night, when GPUs hum soft,
A fire ignites, a dream aloft.To mold an RNN with TimeMix bright,
And rival Attentionβs daunting might.Through spikes and crashes, I pressed on,
A madman's code, from dusk till dawn.Not for glory, nor for gold,
But just to see: can TimeMix hold?And when the losses dipped so low,
The pulse of passion dared to grow.PRWKV, a name of flame,
Not just a model β but a claim.That in this dance of gates and states,
Passion alone rewrites the fates.So here's my heart, in code and rhyme,
RNNs reborn, beyond their time.
π₯ PRWKV is more than an experiment β it is a testament to Passion. π₯
Scalability test from small to large
ToDo for Me: Qwen 2.5 14B Qwen 2.5 7B Qwen 2.5 3B
Phi-4 14B Phi-4-mini 3.8B
Gemma 3 12B Gemma 3 4B
Architecture: RWKV cxa076 (RWKV x070 based)
Now supported only in RWKV-Infer.
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}'