Koto Small 7B (Pretrained)

Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.

Please check out Aurore-Reveil/Koto-Small-7B-IT, it's the official RP and instruct tune!

Usage

This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will not work. Multi-turn roleplay will not work.

It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.

We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

Datasets

Some of the data used to train this model includes:

Most of The Anarchist Library, a repository for anarchist manifestos and writing (see allura-org/the-anarchist-library)
A random sample of public domain books from Project Gutenberg
Furry (anthro and feral) storytelling and smut
A small subset of known high-quality books and story data

Acknowledgements

thank you to [unk] for drawing the art used in the model card!
thank you very much to mango/deltavector for providing the compute used to train this model
thanks to curse for testing, ideas
thanks to toasty for some data, ideas
thanks to everyone else in allura for moral support

ilya <3

Call for Help

if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...
please join our discord or our matrix! <3

Technical Appendix

Training Notes

This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization. Before training, we converted the original model to the Qwen 2 architecture by removing the MTP weights and custom modelling code, and slightly modifying the config.json. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.

We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.

WandB

Finetuning Notes

This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.

Axolotl Config

## model
base_model: allura-forge/MiMo-7B-Base-Qwenified
trust_remote_code: true
## qlora COPE!!!
load_in_8bit: false
load_in_4bit: false
strict: false

## data 
datasets:
datasets:
  - path: estrogen/bookscpt2
    type: completion
    field: text

shuffle_merged_datasets: true
dataset_prepared_path: dataset_prepareds
val_set_size: 0.0
output_dir: ./MiMo-Pretrain

## Liger + CCE
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

## CTX settings
sequence_len: 32768
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

## max grad norm
max_grad_norm: 1.0

## WandB
wandb_project: Koto-Small
wandb_entity:
wandb_watch:
wandb_name: MiMo-7b_1e-5_adamw-8bit
wandb_log_model:

## hoe params
gradient_accumulation_steps: 4 # ???
micro_batch_size: 4
num_epochs: 1
lr_scheduler: cosine
learning_rate: 1e-5
optimizer: adamw_bnb_8bit  # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
deepcompile: true
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: offload
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:

warmup_steps: 50
saves_per_epoch: 2
debug:
deepspeed: ./deepspeed_configs/zero2.json
weight_decay: 0.0025
fsdp:
fsdp_config:

allura-org
/

Koto-Small-7B-PT