See axolotl config
axolotl version: 0.10.0.dev0
base_model: Qwen/Qwen3-0.6B-Base
hub_model_id: cyberbabooshka/base_noreasoning
wandb_name: base
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
num_processes: 64
dataset_processes: 64
dataset_prepared_path: last_run_prepared
chat_template: jinja
chat_template_jinja: >-
{%- for message in messages %}
{{- '<|im_start|>' + message.role + '\n' + message.content.lstrip('\n') + '<|im_end|>' + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
datasets:
- path: cyberbabooshka/MNLP_M2_mcqa_dataset
split: train
type: chat_template
field_messages: messages
train_on_eos: turn
train_on_eot: turn
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
test_datasets:
- path: cyberbabooshka/MNLP_M2_mcqa_dataset
split: test
type: chat_template
field_messages: messages
train_on_eos: turn
train_on_eot: turn
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
output_dir: ./outputs
sequence_len: 2048
batch_flattening: true
sample_packing: false
wandb_project: mnlp
wandb_entity: aleksandr-dremov-epfl
wandb_watch:
wandb_log_model:
gradient_accumulation_steps: 1
eval_batch_size: 16
micro_batch_size: 12
optimizer: ademamix_8bit
weight_decay: 0.01
learning_rate: 0.00001
warmup_steps: 500
wsd_final_lr_factor: 0.0
wsd_init_div_factor: 100
wsd_fract_decay: 0.2
wsd_decay_type: "sqrt"
wsd_sqrt_power: 0.5
wsd_cooldown_start_lr_factor: 1.0
bf16: auto
tf32: false
torch_compile: true
flash_attention: true
gradient_checkpointing: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
logging_steps: 16
eval_steps: 2000
save_steps: 1000
max_steps: 35000
num_epochs: 20000000
save_total_limit: 2
special_tokens:
eos_token: "<|im_end|>"
pad_token: "<|endoftext|>"
eot_tokens:
- <|im_end|>
plugins:
- axolotl_wsd.WSDSchedulerPlugin
base_noreasoning
This model is a fine-tuned version of Qwen/Qwen3-0.6B-Base on the cyberbabooshka/MNLP_M2_mcqa_dataset dataset. It achieves the following results on the evaluation set:
- Loss: 0.7964
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 12
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- total_train_batch_size: 24
- total_eval_batch_size: 32
- optimizer: Use OptimizerNames.ADEMAMIX_8BIT and the args are: No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 500
- training_steps: 35000
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.0000 | 1 | 0.9810 |
0.8508 | 0.0556 | 2000 | 0.8516 |
0.8877 | 0.1111 | 4000 | 0.8365 |
0.8851 | 0.1667 | 6000 | 0.8281 |
0.8193 | 0.2223 | 8000 | 0.8222 |
0.8298 | 0.2778 | 10000 | 0.8177 |
0.8439 | 0.3334 | 12000 | 0.8141 |
0.8364 | 0.3890 | 14000 | 0.8111 |
0.8015 | 0.4445 | 16000 | 0.8085 |
0.8112 | 0.5001 | 18000 | 0.8062 |
0.7972 | 0.5556 | 20000 | 0.8042 |
0.8264 | 0.6112 | 22000 | 0.8024 |
0.7728 | 0.6668 | 24000 | 0.8008 |
0.7762 | 0.7223 | 26000 | 0.7992 |
0.8185 | 0.7779 | 28000 | 0.7978 |
0.8235 | 0.8335 | 30000 | 0.7967 |
0.7812 | 0.8890 | 32000 | 0.7964 |
0.7872 | 0.9446 | 34000 | 0.7964 |
Framework versions
- Transformers 4.52.1
- Pytorch 2.7.0+cu126
- Datasets 3.5.0
- Tokenizers 0.21.1
- Downloads last month
- 1,959
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support