See axolotl config
axolotl version: 0.9.2
adapter: lora
base_model: Qwen/Qwen3-4B
bf16: true # You can safely force bf16 since your GPU supports it
# Dataset & Data Loading
dataset_processes: 32
chat_template: chatml
datasets:
- message_property_mappings:
content: content
role: role
path: dougiefresh/jade_merged
train_split: train
valid_split: valid
trust_remote_code: false
type: chat_template
# Training Efficiency
micro_batch_size: 4
gradient_accumulation_steps: 2
gradient_checkpointing: true
# LoRA Settings
lora_alpha: 64
lora_dropout: 0.05
lora_r: 64
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj
# Optimization
learning_rate: 0.00003 # Lower LR for slower, more stable convergence
lr_scheduler: cosine
warmup_ratio: 0.1 # Introduce a warmup period for smoother startup
optimizer: adamw_torch_fused
# Sequence Length & Packing
sequence_len: 32768
max_prompt_len: 32768
sample_packing_bin_size: 256
sample_packing_group_size: 200000
# Saving & Evaluation
num_epochs: 3.0
output_dir: ./outputs/mymodel
save_only_model: false
save_safetensors: true
val_set_size: 0.05
eval_steps: 250 # More frequent evaluation to catch overfitting early
load_best_model_at_end: true
# Training Behavior
train_on_inputs: false
shuffle_merged_datasets: true
skip_prepare_dataset: false
auto_resume_from_checkpoints: true
weight_decay: 0.01
# Advanced
pretrain_multipack_attn: true
pretrain_multipack_buffer_size: 10000
qlora_sharded_model_loading: false
mean_resizing_embeddings: false
strict: false
# TRL
trl:
log_completions: false
ref_model_mixup_alpha: 0.9
ref_model_sync_steps: 64
sync_ref_model: false
use_vllm: false
# Hardware
load_in_4bit: false
load_in_8bit: false
use_ray: false
ray_num_workers: 1
resources_per_worker:
GPU: 1
use_tensorboard: true
logging_dir: ./outputs/tensorboard
logging_first_step: true
logging_steps: 10
outputs/mymodel
This model is a fine-tuned version of Qwen/Qwen3-4B on the dougiefresh/jade_merged dataset. It achieves the following results on the evaluation set:
- Loss: 0.6294
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 8
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 1222
- num_epochs: 3.0
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.0002 | 1 | 0.9643 |
0.7189 | 0.0614 | 250 | 0.7495 |
0.7475 | 0.1227 | 500 | 0.7097 |
0.7208 | 0.1841 | 750 | 0.6904 |
0.6474 | 0.2455 | 1000 | 0.6796 |
0.6103 | 0.3068 | 1250 | 0.6720 |
0.6621 | 0.3682 | 1500 | 0.6658 |
0.6769 | 0.4296 | 1750 | 0.6618 |
0.6705 | 0.4909 | 2000 | 0.6595 |
0.6696 | 0.5523 | 2250 | 0.6561 |
0.586 | 0.6136 | 2500 | 0.6535 |
0.6336 | 0.6750 | 2750 | 0.6516 |
0.5806 | 0.7364 | 3000 | 0.6496 |
0.5984 | 0.7977 | 3250 | 0.6477 |
0.6474 | 0.8591 | 3500 | 0.6458 |
0.5857 | 0.9205 | 3750 | 0.6446 |
0.5959 | 0.9818 | 4000 | 0.6430 |
0.5811 | 1.0432 | 4250 | 0.6426 |
0.5778 | 1.1046 | 4500 | 0.6411 |
0.5494 | 1.1659 | 4750 | 0.6411 |
0.6449 | 1.2273 | 5000 | 0.6399 |
0.5813 | 1.2887 | 5250 | 0.6390 |
0.6106 | 1.3500 | 5500 | 0.6376 |
0.6475 | 1.4114 | 5750 | 0.6369 |
0.6389 | 1.4728 | 6000 | 0.6364 |
0.6092 | 1.5341 | 6250 | 0.6353 |
0.6029 | 1.5955 | 6500 | 0.6347 |
0.5885 | 1.6568 | 6750 | 0.6337 |
0.6237 | 1.7182 | 7000 | 0.6330 |
0.5555 | 1.7796 | 7250 | 0.6322 |
0.5868 | 1.8409 | 7500 | 0.6315 |
0.5705 | 1.9023 | 7750 | 0.6309 |
0.5969 | 1.9637 | 8000 | 0.6307 |
0.5946 | 2.0250 | 8250 | 0.6311 |
0.6257 | 2.0864 | 8500 | 0.6309 |
0.5959 | 2.1478 | 8750 | 0.6307 |
0.6504 | 2.2091 | 9000 | 0.6306 |
0.5973 | 2.2705 | 9250 | 0.6306 |
0.5851 | 2.3319 | 9500 | 0.6304 |
0.5713 | 2.3932 | 9750 | 0.6300 |
0.5925 | 2.4546 | 10000 | 0.6299 |
0.556 | 2.5160 | 10250 | 0.6298 |
0.5946 | 2.5773 | 10500 | 0.6297 |
0.5749 | 2.6387 | 10750 | 0.6295 |
0.5928 | 2.7000 | 11000 | 0.6295 |
0.5546 | 2.7614 | 11250 | 0.6295 |
0.5388 | 2.8228 | 11500 | 0.6294 |
0.5285 | 2.8841 | 11750 | 0.6294 |
0.5806 | 2.9455 | 12000 | 0.6294 |
Framework versions
- PEFT 0.15.2
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.5.1
- Tokenizers 0.21.1
- Downloads last month
- 67
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support