--- library_name: peft license: apache-2.0 base_model: Qwen/Qwen3-32B tags: - axolotl - generated_from_trainer model-index: - name: shuttle-3.5-ckpts results: [] --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.9.0` ```yaml # Weights and Biases logging config wandb_project: shuttle-3.5 wandb_name: "3.5" # Model architecture config base_model: Qwen/Qwen3-32B model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer chat_template: chatml # Hugging Face saving config hub_model_id: shuttleai/shuttle-3.5-ckpts hub_strategy: all_checkpoints # Model checkpointing config output_dir: ./lora-out saves_per_epoch: 10 save_safetensors: true save_total_limit: 5 # Mixed precision training config bf16: true fp16: false tf32: false # Model loading config load_in_8bit: false load_in_4bit: true strict: false # Sequence config sequence_len: 16384 s2_attention: false sample_packing: true eval_sample_packing: true pad_to_sequence_len: true train_on_inputs: false group_by_length: false # QLoRA adapter config adapter: qlora lora_r: 64 lora_alpha: 64 lora_dropout: 0.05 peft_use_dora: false lora_target_modules: - gate_proj - down_proj - up_proj - q_proj - v_proj - k_proj - o_proj # Dataset config datasets: - path: ./dataset type: chat_template val_set_size: 0.05 evals_per_epoch: 10 dataset_prepared_path: ./prepared-datasets shuffle_merged_datasets: true # Training hyperparameters num_epochs: 1 gradient_accumulation_steps: 2 micro_batch_size: 2 eval_batch_size: 1 warmup_steps: 500 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 2e-4 loraplus_lr_ratio: 8 cosine_min_lr_ratio: 0.1 weight_decay: 0.1 max_grad_norm: 1 logging_steps: 1 # Model optimization gradient_checkpointing: unsloth xformers_attention: false flash_attention: true sdp_attention: false unsloth_cross_entropy_loss: true unsloth_lora_mlp: false unsloth_lora_qkv: false unsloth_lora_o: false # Loss monitoring config early_stopping_patience: false loss_watchdog_threshold: 100.0 loss_watchdog_patience: 3 # Debug config debug: false seed: 42 deepspeed: deepspeed_configs/zero2.json ```

# shuttle-3.5-ckpts This model is a fine-tuned version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) on an unknown dataset. It achieves the following results on the evaluation set: - Loss: 0.9783 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0002 - train_batch_size: 2 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - gradient_accumulation_steps: 2 - total_train_batch_size: 4 - optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 500 - num_epochs: 1.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 7.5468 | 0.0006 | 1 | 7.0761 | | 4.9993 | 0.1006 | 160 | 5.6051 | | 3.358 | 0.2011 | 320 | 2.5960 | | 1.809 | 0.3017 | 480 | 1.3915 | | 2.088 | 0.4023 | 640 | 1.1270 | | 1.8377 | 0.5028 | 800 | 1.0472 | | 1.8002 | 0.6034 | 960 | 1.0100 | | 1.7863 | 0.7040 | 1120 | 0.9924 | | 1.4572 | 0.8045 | 1280 | 0.9861 | | 1.8509 | 0.9051 | 1440 | 0.9783 | ### Framework versions - PEFT 0.15.2 - Transformers 4.51.3 - Pytorch 2.5.1+cu124 - Datasets 3.5.0 - Tokenizers 0.21.1