See axolotl config

axolotl version: 0.9.2

adapter: lora
base_model: Qwen/Qwen3-4B
bf16: true  # You can safely force bf16 since your GPU supports it

# Dataset & Data Loading
dataset_processes: 32
chat_template: chatml
datasets:
- message_property_mappings:
    content: content
    role: role
  path: dougiefresh/jade_merged
  train_split: train
  valid_split: valid
  trust_remote_code: false
  type: chat_template

# Training Efficiency
micro_batch_size: 4
gradient_accumulation_steps: 2
gradient_checkpointing: true

# LoRA Settings
lora_alpha: 64
lora_dropout: 0.05
lora_r: 64
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj

# Optimization
learning_rate: 0.00003  # Lower LR for slower, more stable convergence
lr_scheduler: cosine
warmup_ratio: 0.1  # Introduce a warmup period for smoother startup
optimizer: adamw_torch_fused

# Sequence Length & Packing
sequence_len: 32768
max_prompt_len: 32768
sample_packing_bin_size: 256
sample_packing_group_size: 200000

# Saving & Evaluation
num_epochs: 3.0
output_dir: ./outputs/mymodel
save_only_model: false
save_safetensors: true
val_set_size: 0.05
eval_steps: 250  # More frequent evaluation to catch overfitting early
load_best_model_at_end: true

# Training Behavior
train_on_inputs: false
shuffle_merged_datasets: true
skip_prepare_dataset: false
auto_resume_from_checkpoints: true
weight_decay: 0.01

# Advanced
pretrain_multipack_attn: true
pretrain_multipack_buffer_size: 10000
qlora_sharded_model_loading: false
mean_resizing_embeddings: false
strict: false

# TRL
trl:
  log_completions: false
  ref_model_mixup_alpha: 0.9
  ref_model_sync_steps: 64
  sync_ref_model: false
  use_vllm: false

# Hardware
load_in_4bit: false
load_in_8bit: false
use_ray: false
ray_num_workers: 1
resources_per_worker:
  GPU: 1

use_tensorboard: true
logging_dir: ./outputs/tensorboard
logging_first_step: true
logging_steps: 10

outputs/mymodel

This model is a fine-tuned version of Qwen/Qwen3-4B on the dougiefresh/jade_merged dataset. It achieves the following results on the evaluation set:

Loss: 0.6294

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 1222
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0002	1	0.9643
0.7189	0.0614	250	0.7495
0.7475	0.1227	500	0.7097
0.7208	0.1841	750	0.6904
0.6474	0.2455	1000	0.6796
0.6103	0.3068	1250	0.6720
0.6621	0.3682	1500	0.6658
0.6769	0.4296	1750	0.6618
0.6705	0.4909	2000	0.6595
0.6696	0.5523	2250	0.6561
0.586	0.6136	2500	0.6535
0.6336	0.6750	2750	0.6516
0.5806	0.7364	3000	0.6496
0.5984	0.7977	3250	0.6477
0.6474	0.8591	3500	0.6458
0.5857	0.9205	3750	0.6446
0.5959	0.9818	4000	0.6430
0.5811	1.0432	4250	0.6426
0.5778	1.1046	4500	0.6411
0.5494	1.1659	4750	0.6411
0.6449	1.2273	5000	0.6399
0.5813	1.2887	5250	0.6390
0.6106	1.3500	5500	0.6376
0.6475	1.4114	5750	0.6369
0.6389	1.4728	6000	0.6364
0.6092	1.5341	6250	0.6353
0.6029	1.5955	6500	0.6347
0.5885	1.6568	6750	0.6337
0.6237	1.7182	7000	0.6330
0.5555	1.7796	7250	0.6322
0.5868	1.8409	7500	0.6315
0.5705	1.9023	7750	0.6309
0.5969	1.9637	8000	0.6307
0.5946	2.0250	8250	0.6311
0.6257	2.0864	8500	0.6309
0.5959	2.1478	8750	0.6307
0.6504	2.2091	9000	0.6306
0.5973	2.2705	9250	0.6306
0.5851	2.3319	9500	0.6304
0.5713	2.3932	9750	0.6300
0.5925	2.4546	10000	0.6299
0.556	2.5160	10250	0.6298
0.5946	2.5773	10500	0.6297
0.5749	2.6387	10750	0.6295
0.5928	2.7000	11000	0.6295
0.5546	2.7614	11250	0.6295
0.5388	2.8228	11500	0.6294
0.5285	2.8841	11750	0.6294
0.5806	2.9455	12000	0.6294

Framework versions

PEFT 0.15.2
Transformers 4.51.3
Pytorch 2.6.0+cu124
Datasets 3.5.1
Tokenizers 0.21.1

dougiefresh
/

jade_qwen_4b_knowledge_merged_adapter

outputs/mymodel

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for dougiefresh/jade_qwen_4b_knowledge_merged_adapter

Dataset used to train dougiefresh/jade_qwen_4b_knowledge_merged_adapter

Evaluation results