Koto Small 7B (Instruct-Tuned)

Koto-Small-7B-IT is an instruct-tuned version of Koto-Small-7B-PT, which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.

Usage

Chat template

Trained with ChatML formatting, A typical input would look like this:

<|im_start|>system
system prompt<|im_end|>
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant

Samplers

We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

Datasets

datasets:
  - path: Delta-Vector/Hydrus-General-Reasoning
  - path: Delta-Vector/Hydrus-IF-Mix-Ai2
  - path: Delta-Vector/Hydrus-Army-Inst
  - path: Delta-Vector/Hydrus-AM-thinking-Science
  - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
  - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
  - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
  - path: Delta-Vector/Hydrus-System-Chat-2.0
  - path: Delta-Vector/Orion-Praxis-Co-Writer
  - path: Delta-Vector/Orion-Co-Writer-51K
  - path: Delta-Vector/Orion-Creative_Writing-Complexity
  - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
  - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
  - path: PocketDoc/Dans-Failuremaxx-Adventure
  - path: PocketDoc/Dans-Logicmaxx-SAT-AP
  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
  - path: PocketDoc/Dans-Taskmaxx-DataPrepper
  - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
  - path: PocketDoc/Dans-Systemmaxx

Acknowledgements

Thank you very much to Delta-Vector/Mango for providing the compute used to train this model.
Fizz for the pretrain.
Pocketdoc/Anthracite for da cool datasets.
Hensen chat.
Thank you to the illustrator of WataNare for drawing the art used in the model card!
Thanks to Curse for testing, ideas.
Thanks to Toasty for some data, ideas.
Thanks to everyone else in allura!

ilya <3

Call for Help

If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...

Please join the allura discord or the matrix! <3

Technical Appendix

Training Notes

Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.

WandB

Axolotl Config

# =============================================================================
# Model + Saving
# =============================================================================
base_model: allura-forge/Koto-Small-7b-rc1
output_dir: ./koto-sft 
saves_per_epoch: 2
deepcompile: true
# =============================================================================
# DATASET CONFIGURATION
# =============================================================================
datasets:
  - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
    ds_type: parquet
    type: 

shuffle_merged_datasets: true
dataset_prepared_path: ./dataset_prepared
train_on_inputs: false

# =============================================================================
# EVALUATION SETTINGS
# =============================================================================
#evals_per_epoch: 4
#eval_table_size: 
#eval_max_new_tokens: 128
#eval_sample_packing: false
val_set_size: 0.0

# =============================================================================
# MEMORY OPTIMIZATION
# =============================================================================
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true
sample_packing: true
pad_to_sequence_len: true
gradient_checkpointing: true
flash_attention: true

# =============================================================================
# MULTI-GPU TRAINING
# =============================================================================
deepspeed: ./deepspeed_configs/zero2.json

# =============================================================================
# LOGGING & MONITORING
# =============================================================================
wandb_project: Koto-Small
wandb_entity: 
wandb_watch: 
wandb_name: sft
wandb_log_model: 
logging_steps: 1
debug: false

# =============================================================================
# TRAINING PARAMETERS
# =============================================================================
micro_batch_size: 6
gradient_accumulation_steps: 2
num_epochs: 2
sequence_len: 16000
optimizer: paged_ademamix_8bit
lr_scheduler: rex
learning_rate: 8e-6
warmup_ratio: 0.1
max_grad_norm: 0.0001
weight_decay: 0.0


# =============================================================================
# ADDITIONAL SETTINGS
# =============================================================================
local_rank: 
group_by_length: false
early_stopping_patience: 
save_safetensors: true
bf16: auto
special_tokens:

Aurore-Reveil
/

Koto-Small-7B-IT