Koto Small 7B (Pretrained)

482629.png

Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.

Please check out Aurore-Reveil/Koto-Small-7B-IT, it's the official RP and instruct tune!

Usage

This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will not work. Multi-turn roleplay will not work.

It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.

We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

Datasets

Some of the data used to train this model includes:

  • Most of The Anarchist Library, a repository for anarchist manifestos and writing (see allura-org/the-anarchist-library)
  • A random sample of public domain books from Project Gutenberg
  • Furry (anthro and feral) storytelling and smut
  • A small subset of known high-quality books and story data

Acknowledgements

  • thank you to [unk] for drawing the art used in the model card!
  • thank you very much to mango/deltavector for providing the compute used to train this model
  • thanks to curse for testing, ideas
  • thanks to toasty for some data, ideas
  • thanks to everyone else in allura for moral support

ilya <3

Call for Help

if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...
please join our discord or our matrix! <3

Technical Appendix

Training Notes

This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization. Before training, we converted the original model to the Qwen 2 architecture by removing the MTP weights and custom modelling code, and slightly modifying the config.json. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.

We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.

WandB

image/png

Finetuning Notes

This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.

Axolotl Config

## model
base_model: allura-forge/MiMo-7B-Base-Qwenified
trust_remote_code: true
## qlora COPE!!!
load_in_8bit: false
load_in_4bit: false
strict: false

## data 
datasets:
datasets:
  - path: estrogen/bookscpt2
    type: completion
    field: text

shuffle_merged_datasets: true
dataset_prepared_path: dataset_prepareds
val_set_size: 0.0
output_dir: ./MiMo-Pretrain

## Liger + CCE
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

## CTX settings
sequence_len: 32768
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

## max grad norm
max_grad_norm: 1.0

## WandB
wandb_project: Koto-Small
wandb_entity:
wandb_watch:
wandb_name: MiMo-7b_1e-5_adamw-8bit
wandb_log_model:

## hoe params
gradient_accumulation_steps: 4 # ???
micro_batch_size: 4
num_epochs: 1
lr_scheduler: cosine
learning_rate: 1e-5
optimizer: adamw_bnb_8bit  # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
deepcompile: true
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: offload
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:

warmup_steps: 50
saves_per_epoch: 2
debug:
deepspeed: ./deepspeed_configs/zero2.json
weight_decay: 0.0025
fsdp:
fsdp_config:
Downloads last month
49
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for allura-org/Koto-Small-7B-PT

Finetuned
(1)
this model
Finetunes
1 model
Quantizations
3 models

Collections including allura-org/Koto-Small-7B-PT