Built with Axolotl

See axolotl config

axolotl version: 0.9.2

adapter: lora
base_model: Qwen/Qwen3-4B
bf16: true  # You can safely force bf16 since your GPU supports it

# Dataset & Data Loading
dataset_processes: 32
chat_template: chatml
datasets:
- message_property_mappings:
    content: content
    role: role
  path: dougiefresh/jade_merged
  train_split: train
  valid_split: valid
  trust_remote_code: false
  type: chat_template

# Training Efficiency
micro_batch_size: 4
gradient_accumulation_steps: 2
gradient_checkpointing: true

# LoRA Settings
lora_alpha: 64
lora_dropout: 0.05
lora_r: 64
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj

# Optimization
learning_rate: 0.00003  # Lower LR for slower, more stable convergence
lr_scheduler: cosine
warmup_ratio: 0.1  # Introduce a warmup period for smoother startup
optimizer: adamw_torch_fused

# Sequence Length & Packing
sequence_len: 32768
max_prompt_len: 32768
sample_packing_bin_size: 256
sample_packing_group_size: 200000

# Saving & Evaluation
num_epochs: 3.0
output_dir: ./outputs/mymodel
save_only_model: false
save_safetensors: true
val_set_size: 0.05
eval_steps: 250  # More frequent evaluation to catch overfitting early
load_best_model_at_end: true

# Training Behavior
train_on_inputs: false
shuffle_merged_datasets: true
skip_prepare_dataset: false
auto_resume_from_checkpoints: true
weight_decay: 0.01

# Advanced
pretrain_multipack_attn: true
pretrain_multipack_buffer_size: 10000
qlora_sharded_model_loading: false
mean_resizing_embeddings: false
strict: false

# TRL
trl:
  log_completions: false
  ref_model_mixup_alpha: 0.9
  ref_model_sync_steps: 64
  sync_ref_model: false
  use_vllm: false

# Hardware
load_in_4bit: false
load_in_8bit: false
use_ray: false
ray_num_workers: 1
resources_per_worker:
  GPU: 1

use_tensorboard: true
logging_dir: ./outputs/tensorboard
logging_first_step: true
logging_steps: 10

outputs/mymodel

This model is a fine-tuned version of Qwen/Qwen3-4B on the dougiefresh/jade_merged dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6294

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 8
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 1222
  • num_epochs: 3.0

Training results

Training Loss Epoch Step Validation Loss
No log 0.0002 1 0.9643
0.7189 0.0614 250 0.7495
0.7475 0.1227 500 0.7097
0.7208 0.1841 750 0.6904
0.6474 0.2455 1000 0.6796
0.6103 0.3068 1250 0.6720
0.6621 0.3682 1500 0.6658
0.6769 0.4296 1750 0.6618
0.6705 0.4909 2000 0.6595
0.6696 0.5523 2250 0.6561
0.586 0.6136 2500 0.6535
0.6336 0.6750 2750 0.6516
0.5806 0.7364 3000 0.6496
0.5984 0.7977 3250 0.6477
0.6474 0.8591 3500 0.6458
0.5857 0.9205 3750 0.6446
0.5959 0.9818 4000 0.6430
0.5811 1.0432 4250 0.6426
0.5778 1.1046 4500 0.6411
0.5494 1.1659 4750 0.6411
0.6449 1.2273 5000 0.6399
0.5813 1.2887 5250 0.6390
0.6106 1.3500 5500 0.6376
0.6475 1.4114 5750 0.6369
0.6389 1.4728 6000 0.6364
0.6092 1.5341 6250 0.6353
0.6029 1.5955 6500 0.6347
0.5885 1.6568 6750 0.6337
0.6237 1.7182 7000 0.6330
0.5555 1.7796 7250 0.6322
0.5868 1.8409 7500 0.6315
0.5705 1.9023 7750 0.6309
0.5969 1.9637 8000 0.6307
0.5946 2.0250 8250 0.6311
0.6257 2.0864 8500 0.6309
0.5959 2.1478 8750 0.6307
0.6504 2.2091 9000 0.6306
0.5973 2.2705 9250 0.6306
0.5851 2.3319 9500 0.6304
0.5713 2.3932 9750 0.6300
0.5925 2.4546 10000 0.6299
0.556 2.5160 10250 0.6298
0.5946 2.5773 10500 0.6297
0.5749 2.6387 10750 0.6295
0.5928 2.7000 11000 0.6295
0.5546 2.7614 11250 0.6295
0.5388 2.8228 11500 0.6294
0.5285 2.8841 11750 0.6294
0.5806 2.9455 12000 0.6294

Framework versions

  • PEFT 0.15.2
  • Transformers 4.51.3
  • Pytorch 2.6.0+cu124
  • Datasets 3.5.1
  • Tokenizers 0.21.1
Downloads last month
67
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dougiefresh/jade_qwen_4b_knowledge_merged_adapter

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Adapter
(19)
this model

Dataset used to train dougiefresh/jade_qwen_4b_knowledge_merged_adapter