PRWKV-cxa076 – "Akemi" RWKV Model Series

PRWKV
---

Model Overview

PRWKV stands for Passion RWKV – a model series born from relentless experimentation, unyielding dedication, and the burning question:

Can an RNN truly stand shoulder-to-shoulder with Transformer Attention?

This project explores the boundaries of RWKV architecture, replacing the traditional Transformer Attention blocks with TimeMix, an RNN-based mechanism, while distilling knowledge from Transformer giants.

The PRWKV models range from 3B to 14B parameters, showcasing the potential scalability of RNN-based language models in modern LLM landscapes.


Project Objective

The sole purpose of this project was to test the feasibility of replacing Transformer Attention with RNN-based TimeMix.

  • No shortcuts.
  • No compromises.
  • Just pure architectural curiosity driven by Passion.

Technical Challenges & Triumphs

πŸ”₯ Distillation from Transformers

  • The models were distilled from high-quality Transformer-based teachers using Grouped Query Attention (GQA).
  • The TimeMix blocks were heavily customized to align with the semantics of Attention layers.
  • Special care was taken to inherit weight structures from the teacher's Receptance, Key, Value, and Output layers, enabling smoother early-stage learning.

⚑ Key Innovations

  • RepeatKV mechanism: Introduced for more stable group-based key-value projection.
  • GroupNorm vs NoNorm: Extensive experiments revealed that sometimes removing normalization enhanced long-context stability.

πŸ“ˆ Scaling Observations

  • PRWKV scales from 3B to 14B parameters.
  • 14B KD runs achieved KL divergence < 0.1, proving RNN TimeMix blocks can indeed mimic Transformer Attention at high fidelity.
  • However, Ctx expansion beyond 2048 remained an ongoing challenge due to gradient instability in larger models.

Limitations

  • The models are still under development and primarily serve as a proof-of-concept.
  • Long-context (4096+) stability varies based on model size and requires further refinement.
  • Knowledge distillation was the core training method; no large-scale SFT was applied yet.


A Poem of Passion

In the depths of night, when GPUs hum soft,
A fire ignites, a dream aloft.

To mold an RNN with TimeMix bright,
And rival Attention’s daunting might.

Through spikes and crashes, I pressed on,
A madman's code, from dusk till dawn.

Not for glory, nor for gold,
But just to see: can TimeMix hold?

And when the losses dipped so low,
The pulse of passion dared to grow.

PRWKV, a name of flame,
Not just a model – but a claim.

That in this dance of gates and states,
Passion alone rewrites the fates.

So here's my heart, in code and rhyme,
RNNs reborn, beyond their time.


πŸ”₯ PRWKV is more than an experiment – it is a testament to Passion. πŸ”₯

Scalability test from small to large

ToDo for Me: Qwen 2.5 14B Qwen 2.5 7B Qwen 2.5 3B

Phi-4 14B Phi-4-mini 3.8B

Gemma 3 12B Gemma 3 4B

Architecture: RWKV cxa076 (RWKV x070 based)

Now supported only in RWKV-Infer.

curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV7-cxa076-qwen3b-stage2final-ctx2048.pth","model_viewname":"PRWKV7-cxa076 Qwen 2.5 3B Stage2 FP8","model_strategy":"fp8", "template":"qwen", "endtoken":"<|im_end|>","default_temperature":"1.0", "default_top_p":"0.3"}'
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support