PRWKV-cxa076 – "Akemi" RWKV Model Series

---

Model Overview

PRWKV stands for Passion RWKV – a model series born from relentless experimentation, unyielding dedication, and the burning question:

Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?

This project explores the boundaries of RWKV architecture, replacing the traditional Transformer Attention blocks with TimeMix, an RNN-based mechanism, while distilling knowledge from Transformer giants.

The PRWKV models range from 3B to 14B parameters, showcasing the potential scalability of RNN-based language models in modern LLM landscapes.

Project Objective

The sole purpose of this project was to test the feasibility of replacing Transformer Attention with RNN-based TimeMix.

No shortcuts.
No compromises.
Just pure architectural curiosity driven by Passion.

Technical Challenges & Triumphs

🔥 Distillation from Transformers

The models were distilled from high-quality Transformer-based teachers using Grouped Query Attention (GQA).
The TimeMix blocks were heavily customized to align with the semantics of Attention layers.
Special care was taken to inherit weight structures from the teacher's Receptance, Key, Value, and Output layers, enabling smoother early-stage learning.

⚡ Key Innovations

RepeatKV mechanism: Introduced for more stable group-based key-value projection.
GroupNorm vs NoNorm: Extensive experiments revealed that sometimes removing normalization enhanced long-context stability.

📈 Scaling Observations

PRWKV scales from 3B to 14B parameters.
14B KD runs achieved KL divergence < 0.1, proving RNN TimeMix blocks can indeed mimic Transformer Attention at high fidelity.
However, Ctx expansion beyond 2048 remained an ongoing challenge due to gradient instability in larger models.

Limitations

The models are still under development and primarily serve as a proof-of-concept.
Long-context (4096+) stability varies based on model size and requires further refinement.
Knowledge distillation was the core training method; no large-scale SFT was applied yet.

A Poem of Passion

In the depths of night, when GPUs hum soft,
A fire ignites, a dream aloft.

To mold an RNN with TimeMix bright,
And rival Attention’s daunting might.

Through spikes and crashes, I pressed on,
A madman's code, from dusk till dawn.

Not for glory, nor for gold,
But just to see: can TimeMix hold?

And when the losses dipped so low,
The pulse of passion dared to grow.

PRWKV, a name of flame,
Not just a model – but a claim.

That in this dance of gates and states,
Passion alone rewrites the fates.

So here's my heart, in code and rhyme,
RNNs reborn, beyond their time.

🔥 PRWKV is more than an experiment – it is a testament to Passion. 🔥

Scalability test from small to large

ToDo for Me: Qwen 2.5 14B Qwen 2.5 7B Qwen 2.5 3B

Phi-4 14B Phi-4-mini 3.8B

Gemma 3 12B Gemma 3 4B

Architecture: RWKV cxa076 (RWKV x070 based)

Now supported only in RWKV-Infer.

curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}'