QLoRA Fine-tuning code
#2
by
kristaller486
- opened
Thanks for you model, it's impressive. I've tried fine-tune this model (QLoRA) using the main branch of transformers and it doesn't seem to work. Is there any working code for fine-tuning?
kristaller486
changed discussion title from
Fine-tuning code
to QLoRA Fine-tuning code
Hi
@kristaller486
Thanks for your issue - can you share the code you ran and the error you are facing?
Thanks for the answer! It's an exception for non-quantized lora:
ubuntu@train:~$ uv run train.py
model.safetensors: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.04G/1.04G [00:06<00:00, 161MB/s]
The fast path for FalconH1 will be used when running the model on a GPU
generation_config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 138/138 [00:00<00:00, 1.77MB/s]
tokenizer_config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 99.7k/99.7k [00:00<00:00, 1.66MB/s]
tokenizer.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2.35M/2.35M [00:00<00:00, 8.21MB/s]
special_tokens_map.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 7.42k/7.42k [00:00<00:00, 67.8MB/s]
trainable params: 8,404,992 || all params: 529,816,096 || trainable%: 1.5864
README.md: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 544/544 [00:00<00:00, 7.51MB/s]
train-00000-of-00001.parquet: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 24.0M/24.0M [00:00<00:00, 47.3MB/s]
Generating train split: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 13226/13226 [00:00<00:00, 102801.72 examples/s]
Map: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 13226/13226 [00:02<00:00, 5771.84 examples/s]
Converting train dataset to ChatML: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 9604.30 examples/s]
Adding EOS to train dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 9792.40 examples/s]
Tokenizing train dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 386.00 examples/s]
Truncating train dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 23200.01 examples/s]
Converting eval dataset to ChatML: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 2732.09 examples/s]
Adding EOS to eval dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 2631.31 examples/s]
Tokenizing eval dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 420.45 examples/s]
Truncating eval dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 3233.35 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
0%| | 0/23 [00:00<?, ?it/s]FalconH1 requires an initialized `FalconHybridMambaAttentionDynamicCache` to return a cache. None was provided, so no cache will be returned.
Traceback (most recent call last):
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/language/core.py", line 35, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/language/core.py", line 1548, in dot
return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/language/semantic.py", line 1470, in dot
assert lhs.dtype == rhs.dtype, f"Both operands must be same dtype. Got {lhs.dtype} and {rhs.dtype}"
^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Both operands must be same dtype. Got bf16 and fp32
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/tran.py", line 86, in <module>
trainer.train()
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2240, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3791, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ubuntu/.venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 2473, in backward
loss.backward(**kwargs)
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 549, in decorate_bwd
return bwd(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 878, in backward
dx, ddt, dA, dB, dC, dD, dz, ddt_bias, dinitial_states, *rest = _mamba_chunk_scan_combined_bwd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 391, in _mamba_chunk_scan_combined_bwd
dstates = _chunk_scan_bwd_dstates(C, dA_cumsum, dout, seq_idx=seq_idx, dtype=states.dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/mamba_ssm/ops/triton/ssd_chunk_scan.py", line 1408, in _chunk_scan_bwd_dstates
_chunk_scan_bwd_dstates_kernel[grid_dstates](
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
fn()
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
self.fn.run(
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
kernel = self.compile(
^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
module = src.make_ir(options, codegen_fns, module_map, context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
triton.compiler.errors.CompilationError: at 53:15:
seq_idx_prev = tl.load(seq_idx_ptr - stride_seq_idx_seqlen, mask=pid_c >= 1, other=0)
for k in range(0, chunk_size_limit, BLOCK_SIZE_K):
dout = tl.load(dout_ptrs, mask=(offs_m[:, None] < hdim) & (offs_k[None, :] < chunk_size_limit - k), other=0.0).to(tl.float32)
dA_cs_k = tl.load(dA_cumsum_ptrs, mask=offs_k < chunk_size - k, other=0.0).to(tl.float32)
if not HAS_SEQ_IDX:
scale_k = tl.exp(dA_cs_k)
else:
seq_idx_k = tl.load(seq_idx_ptrs, mask=offs_k < chunk_size_limit - k, other=-1)
scale_k = tl.where(seq_idx_k == seq_idx_prev, tl.exp(dA_cs_k), 0.0)
dout = (dout * scale_k).to(dout_ptr.dtype.element_ty)
c = tl.load(c_ptrs, mask=(offs_k[:, None] < chunk_size_limit - k) & (offs_n[None, :] < dstate), other=0.0)
acc += tl.dot(dout, c)
^
0%| | 0/23 [00:49<?, ?it/s]
This for QLoRa:
ubuntu@train:~$ uv run train.py
The fast path for FalconH1 will be used when running the model on a GPU
trainable params: 8,404,992 || all params: 529,816,096 || trainable%: 1.5864
Map: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 13226/13226 [00:02<00:00, 6050.90 examples/s]
Converting train dataset to ChatML: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 9604.54 examples/s]
Adding EOS to train dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 9370.65 examples/s]
Tokenizing train dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 390.54 examples/s]
Truncating train dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 90/90 [00:00<00:00, 22057.23 examples/s]
Converting eval dataset to ChatML: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 2733.16 examples/s]
Adding EOS to eval dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 2690.21 examples/s]
Tokenizing eval dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 326.59 examples/s]
Truncating eval dataset: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 2742.09 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
0%| | 0/23 [00:00<?, ?it/s]FalconH1 requires an initialized `FalconHybridMambaAttentionDynamicCache` to return a cache. None was provided, so no cache will be returned.
Traceback (most recent call last):
File "/home/ubuntu/train.py", line 86, in <module>
trainer.train()
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2240, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3745, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 654, in compute_loss
(loss, outputs) = super().compute_loss(
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3810, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 818, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 806, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 1757, in forward
return self.base_model(
^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 193, in forward
return self.model.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/models/falcon_h1/modeling_falcon_h1.py", line 1588, in forward
outputs = self.model(
^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/models/falcon_h1/modeling_falcon_h1.py", line 1331, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/modeling_layers.py", line 48, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/models/falcon_h1/modeling_falcon_h1.py", line 1120, in forward
mamba_hidden_states = self.mamba(
^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/models/falcon_h1/modeling_falcon_h1.py", line 1015, in forward
return self.cuda_kernels_forward(hidden_states, cache_params, cache_position, attention_mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/transformers/models/falcon_h1/modeling_falcon_h1.py", line 694, in cuda_kernels_forward
out = mamba_split_conv1d_scan_combined(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 930, in mamba_split_conv1d_scan_combined
return MambaSplitConv1dScanCombinedFn.apply(zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states, seq_idx, dt_limit, return_final_states, activation, rmsnorm_weight, rmsnorm_eps, outproj_weight, outproj_bias, headdim, ngroups, norm_before_gate)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 503, in decorate_fwd
return fwd(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.venv/lib/python3.12/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 819, in forward
out = F.linear(out, outproj_weight, outproj_bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x1536 and 1x786432)
0%| | 0/23 [00:35<?, ?it/s]
Here a code:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, get_peft_model, TaskType
model_name = "tiiuae/Falcon-H1-0.5B-Base"
use_quantization = False
bnb_config = None
if use_quantization:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant' }}{% endif %}"
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",]
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
dataset = load_dataset('Vikhrmodels/It_hard_4.1', split='train')
def format_data(example):
conversation = [{'role': item['role'], 'content': item['content']}
for item in example['conversation']]
example['text'] = tokenizer.apply_chat_template(conversation, tokenize=False)
return example
dataset = dataset.map(format_data)
dataset = dataset.select(range(100))
train_test = dataset.train_test_split(test_size=0.1)
training_args = SFTConfig(
num_train_epochs=1,
learning_rate=5e-5,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
output_dir="./lora_output",
logging_steps=10,
save_steps=100,
bf16=True,
max_length=1024,
remove_unused_columns=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_test['train'],
eval_dataset=train_test['test'],
processing_class=tokenizer,
peft_config=peft_config
)
trainer.train()
model.save_pretrained("./lora_model")
Environment
ubuntu@train:~$ uv pip freeze
accelerate==1.7.0
aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aiosignal==1.3.2
attrs==25.3.0
bitsandbytes==0.45.5
causal-conv1d==1.5.0.post8
certifi==2025.4.26
charset-normalizer==3.4.2
datasets==3.6.0
dill==0.3.8
einops==0.8.1
filelock==3.18.0
frozenlist==1.6.0
fsspec==2025.3.0
huggingface-hub==0.31.4
idna==3.10
jinja2==3.1.6
mamba-ssm==2.2.4
markdown-it-py==3.0.0
markupsafe==3.0.2
mdurl==0.1.2
mpmath==1.3.0
multidict==6.4.4
multiprocess==0.70.16
networkx==3.4.2
ninja==1.11.1.4
numpy==2.2.6
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
packaging==25.0
pandas==2.2.3
peft==0.15.2
propcache==0.3.1
psutil==7.0.0
pyarrow==20.0.0
pygments==2.19.1
python-dateutil==2.9.0.post0
pytz==2025.2
pyyaml==6.0.2
regex==2024.11.6
requests==2.32.3
rich==14.0.0
safetensors==0.5.3
setuptools==80.8.0
six==1.17.0
sympy==1.13.1
tokenizers==0.21.1
torch==2.6.0
tqdm==4.67.1
transformers @ git+https://github.com/huggingface/transformers@b01984a51daa00337c6f0b7018f9569f51517e1b
triton==3.2.0
trl==0.17.0
typing-extensions==4.13.2
tzdata==2025.2
urllib3==2.4.0
xxhash==3.5.0
yarl==1.20.0
ubuntu@train:~$ nvidia-smi
Fri May 23 11:04:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:05:00.0 Off | Off |
| 0% 33C P8 18W / 450W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
Hi
@kristaller486
Thank you for sharing the script, this was very useful !
I managed to make it work with the following configuration:
- transformers installed from source:
pip install git+https://github.com/huggingface/transformers.git
- mamba-ssm / causal-conv1d install from pypi latest:
pip install mamba-ssm causal-conv1d --no-build-isolation
In order to fix your issue, you need to explicitly set the dtype of the loaded model tobfloat16
or"auto"
:
Fixed script
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, get_peft_model, TaskType
model_name = "tiiuae/Falcon-H1-0.5B-Base"
use_quantization = False
bnb_config = None
if use_quantization:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map='auto',
+ torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant' }}{% endif %}"
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",]
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
dataset = load_dataset('Vikhrmodels/It_hard_4.1', split='train')
def format_data(example):
conversation = [{'role': item['role'], 'content': item['content']}
for item in example['conversation']]
example['text'] = tokenizer.apply_chat_template(conversation, tokenize=False)
return example
dataset = dataset.map(format_data)
dataset = dataset.select(range(100))
train_test = dataset.train_test_split(test_size=0.1)
training_args = SFTConfig(
num_train_epochs=1,
learning_rate=5e-5,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
output_dir="./lora_output",
logging_steps=10,
save_steps=100,
bf16=True,
max_length=1024,
remove_unused_columns=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_test['train'],
eval_dataset=train_test['test'],
processing_class=tokenizer,
peft_config=peft_config
)
trainer.train()
model.save_pretrained("./lora_model")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
+ llm_int8_skip_modules=["out_proj"]
)
We will detail all these good practices in https://tiiuae.github.io/Falcon-H1/ very soon, stay tuned !
Thanks for the help, everything works now!
kristaller486
changed discussion status to
closed