See axolotl config

axolotl version: 0.10.0.dev0

base_model: Qwen/Qwen3-0.6B-Base
hub_model_id: cyberbabooshka/base_noreasoning
wandb_name: base

tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false

num_processes: 64
dataset_processes: 64
dataset_prepared_path: last_run_prepared

chat_template: jinja
chat_template_jinja: >-
  {%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content.lstrip('\n') + '<|im_end|>' + '\n' }}
  {%- endfor %}
  {%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
  {%- endif %}

datasets:
  - path: cyberbabooshka/MNLP_M2_mcqa_dataset
    split: train
    type: chat_template
    field_messages: messages
    train_on_eos: turn
    train_on_eot: turn
    message_property_mappings:
      role: role
      content: content
    roles:
      user:
        - user
      assistant:
        - assistant

test_datasets:
  - path: cyberbabooshka/MNLP_M2_mcqa_dataset
    split: test
    type: chat_template
    field_messages: messages
    train_on_eos: turn
    train_on_eot: turn
    message_property_mappings:
      role: role
      content: content
    roles:
      user:
        - user
      assistant:
        - assistant

output_dir: ./outputs

sequence_len: 2048
batch_flattening: true
sample_packing: false

wandb_project: mnlp
wandb_entity: aleksandr-dremov-epfl
wandb_watch:
wandb_log_model:

gradient_accumulation_steps: 1
eval_batch_size: 16
micro_batch_size: 12

optimizer: ademamix_8bit
weight_decay: 0.01

learning_rate: 0.00001
warmup_steps: 500

wsd_final_lr_factor: 0.0
wsd_init_div_factor: 100
wsd_fract_decay: 0.2
wsd_decay_type: "sqrt"
wsd_sqrt_power: 0.5
wsd_cooldown_start_lr_factor: 1.0

bf16: auto
tf32: false

torch_compile: true
flash_attention: true
gradient_checkpointing: false

resume_from_checkpoint:
auto_resume_from_checkpoints: true

logging_steps: 16
eval_steps: 2000
save_steps: 1000
max_steps: 35000
num_epochs: 20000000
save_total_limit: 2

special_tokens:
  eos_token: "<|im_end|>"
  pad_token: "<|endoftext|>"

eot_tokens:
  - <|im_end|>

plugins:
  - axolotl_wsd.WSDSchedulerPlugin

base_noreasoning

This model is a fine-tuned version of Qwen/Qwen3-0.6B-Base on the cyberbabooshka/MNLP_M2_mcqa_dataset dataset. It achieves the following results on the evaluation set:

Loss: 0.7964

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 12
eval_batch_size: 16
seed: 42
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 24
total_eval_batch_size: 32
optimizer: Use OptimizerNames.ADEMAMIX_8BIT and the args are: No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 500
training_steps: 35000

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0000	1	0.9810
0.8508	0.0556	2000	0.8516
0.8877	0.1111	4000	0.8365
0.8851	0.1667	6000	0.8281
0.8193	0.2223	8000	0.8222
0.8298	0.2778	10000	0.8177
0.8439	0.3334	12000	0.8141
0.8364	0.3890	14000	0.8111
0.8015	0.4445	16000	0.8085
0.8112	0.5001	18000	0.8062
0.7972	0.5556	20000	0.8042
0.8264	0.6112	22000	0.8024
0.7728	0.6668	24000	0.8008
0.7762	0.7223	26000	0.7992
0.8185	0.7779	28000	0.7978
0.8235	0.8335	30000	0.7967
0.7812	0.8890	32000	0.7964
0.7872	0.9446	34000	0.7964

Framework versions

Transformers 4.52.1
Pytorch 2.7.0+cu126
Datasets 3.5.0
Tokenizers 0.21.1

cyberbabooshka
/

base_noreasoning

base_noreasoning

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for cyberbabooshka/base_noreasoning

Dataset used to train cyberbabooshka/base_noreasoning

Evaluation results