Help with finetuning
Is it possible to finetune R1 (not using unsloth, bc of single gpu only). I have a few questions:
Can I finetune the GGUF directly (as mentioned in https://github.com/ggml-org/llama.cpp/discussions/6680)
When I'm using QLoRA does it work by:
Download model -> Quantize to Q4 -> Tune -> Upload adapters
OR
Download model -> Tune -> Quantize Adapters to Q4 -> Upload adapters
Basically, is it better to use QLoRA on a 16bit or a 4bit model, like what's the difference between:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Deepseek/deekseek-R1-0528-4bit",
max_seq_length = max_seq_length,
load_in_4bit = True,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
vs
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Deepseek/deekseek-R1-0528",
max_seq_length = max_seq_length,
load_in_4bit = True,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
multiGPU actually works with accelerate fyi but we haven't officially announced yet because we're working on a much better version
When you use unsloth, we will convert it from 16bit to 4bit on the fly for you
Also see someones multiGPU repo from Unsloth which can help: https://www.reddit.com/r/unsloth/comments/1l8mxkq/multigpu_support_how_to_make_your_unsloth/