Koto Small 7B (Instruct-Tuned)

482629.png

Koto-Small-7B-IT is an instruct-tuned version of Koto-Small-7B-PT, which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.

Usage

Chat template

Trained with ChatML formatting, A typical input would look like this:

<|im_start|>system
system prompt<|im_end|>
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant

Samplers

We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

Datasets

datasets:
  - path: Delta-Vector/Hydrus-General-Reasoning
  - path: Delta-Vector/Hydrus-IF-Mix-Ai2
  - path: Delta-Vector/Hydrus-Army-Inst
  - path: Delta-Vector/Hydrus-AM-thinking-Science
  - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
  - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
  - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
  - path: Delta-Vector/Hydrus-System-Chat-2.0
  - path: Delta-Vector/Orion-Praxis-Co-Writer
  - path: Delta-Vector/Orion-Co-Writer-51K
  - path: Delta-Vector/Orion-Creative_Writing-Complexity
  - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
  - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
  - path: PocketDoc/Dans-Failuremaxx-Adventure
  - path: PocketDoc/Dans-Logicmaxx-SAT-AP
  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
  - path: PocketDoc/Dans-Taskmaxx-DataPrepper
  - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
  - path: PocketDoc/Dans-Systemmaxx

Acknowledgements

  • Thank you very much to Delta-Vector/Mango for providing the compute used to train this model.
  • Fizz for the pretrain.
  • Pocketdoc/Anthracite for da cool datasets.
  • Hensen chat.
  • Thank you to the illustrator of WataNare for drawing the art used in the model card!
  • Thanks to Curse for testing, ideas.
  • Thanks to Toasty for some data, ideas.
  • Thanks to everyone else in allura!

ilya <3

Call for Help

If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...

Please join the allura discord or the matrix! <3

Technical Appendix

Training Notes

Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.

WandB

image/png

Axolotl Config

# =============================================================================
# Model + Saving
# =============================================================================
base_model: allura-forge/Koto-Small-7b-rc1
output_dir: ./koto-sft 
saves_per_epoch: 2
deepcompile: true
# =============================================================================
# DATASET CONFIGURATION
# =============================================================================
datasets:
  - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
    ds_type: parquet
    type: 

shuffle_merged_datasets: true
dataset_prepared_path: ./dataset_prepared
train_on_inputs: false

# =============================================================================
# EVALUATION SETTINGS
# =============================================================================
#evals_per_epoch: 4
#eval_table_size: 
#eval_max_new_tokens: 128
#eval_sample_packing: false
val_set_size: 0.0

# =============================================================================
# MEMORY OPTIMIZATION
# =============================================================================
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true
sample_packing: true
pad_to_sequence_len: true
gradient_checkpointing: true
flash_attention: true

# =============================================================================
# MULTI-GPU TRAINING
# =============================================================================
deepspeed: ./deepspeed_configs/zero2.json

# =============================================================================
# LOGGING & MONITORING
# =============================================================================
wandb_project: Koto-Small
wandb_entity: 
wandb_watch: 
wandb_name: sft
wandb_log_model: 
logging_steps: 1
debug: false

# =============================================================================
# TRAINING PARAMETERS
# =============================================================================
micro_batch_size: 6
gradient_accumulation_steps: 2
num_epochs: 2
sequence_len: 16000
optimizer: paged_ademamix_8bit
lr_scheduler: rex
learning_rate: 8e-6
warmup_ratio: 0.1
max_grad_norm: 0.0001
weight_decay: 0.0


# =============================================================================
# ADDITIONAL SETTINGS
# =============================================================================
local_rank: 
group_by_length: false
early_stopping_patience: 
save_safetensors: true
bf16: auto
special_tokens:
Downloads last month
318
Safetensors
Model size
7.62B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Aurore-Reveil/Koto-Small-7B-IT

Finetuned
(1)
this model
Quantizations
9 models