OpenMOSE/HRWKV7-Reka-Flash3-Preview · Enhance model card: Add metadata, paper/code links, and Transformers usage

Enhance model card: Add metadata, paper/code links, and Transformers usageb5bdfb3f

about 1 month ago

This PR significantly enhances the model card for HRWKV7-Reka-Flash3-Preview by:

Adding pipeline_tag: text-generation to the metadata, which ensures the model appears in relevant searches on the Hugging Face Hub and enables the interactive inference widget.
Adding library_name: transformers to the metadata, indicating compatibility with the Hugging Face Transformers library and enabling the "Use in Transformers" widget with associated code snippets.
Adding relevant tags such as linear-attention, reka, rwkv, knowledge-distillation, and specifying languages: ['mul'] to reflect its multilingual nature, improving discoverability.
Introducing a prominent "Paper and Project Details" section at the top, linking directly to the Hugging Face Papers page for RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale and the main project's GitHub repository (https://github.com/recursal/RADLADS-paper).
Including a standard transformers code snippet for text generation, making it easier for users to get started with the model. The original RWKV-Infer usage is retained for completeness.
Adding the BibTeX citation for the RADLADS paper to ensure proper attribution.

These changes collectively make the model card more informative, discoverable, and user-friendly on the Hugging Face Hub.

OpenMOSE

Owner about 1 month ago

Thank you for pointing that out.

I'm currently coding HF compatible inference code.

OpenMOSE changed pull request status to merged about 1 month ago

nielsr

about 1 month ago

Thanks, feel free to remove library_name: transformers and the Transformers code snippet as those seem wrong for this particular checkpoint which does not seem Transformers compatible.

OpenMOSE

Owner about 1 month ago

•

edited about 1 month ago

I apologize for any misunderstanding.

This model is based on the RADLADS distillation method,
but the training code and model architecture are different.

RADLADS1: Modified RWKV v6 (Gated Linear Attention kernel)
My: Modified RWKV v7(RWKV kernel) + No Position Embedding GQA Hybrid

Please feel free to point out any issues. :)