Hi what is the performance of this model ?

by stabgan - opened 17 days ago

17 days ago

Can you please give the wandb page on how this model's train/loss and validation/loss decreased ?
and also the number of step trained or the epochs ?
I see 4 datasets are used, so 4 datasets are completely used ? or partially ?

drwlf

Owner 17 days ago

I haven’t saved to wandb yet, it was a trial run to check if the colab notebook works for training. I’m planning to do a proper fine tune for a 12b q4 and put the wandb page. I train on a mix of 100k examples from the merged datasets (170k examples in total), 2500 steps, linesr. The proper run will be 2 epochs, cosine same number of steps and examples

stabgan

16 days ago

•

edited 16 days ago

Ok, good.
I am working in the same domain. I am very new to all these and would like some of your feedback.

Context -
I have a rtx 3060 6gb Vram laptop and I can only run Q4 gemma 3 4b model.
I have access to MIMIC IV dataset and the clinical discharge notes (4000 tokens average, max 13000 tokens, minumum 600 tokens). I can't make any of those data public or trained weights as public due to HIPAA problem.
My aim is to train gemma 3 such that it can create artificial discharge notes and I have to run the inference in a private server or locally.

What I have done so far -
My instruction input for training are the disease descriptions that i want in the discharge note. average it is around 60,000 tokens. which I will use during inference of 128000 token full model
So I created a smaller instruction set with 600 to 2000 tokens instructions of diseases.
But I can't use more than 2048 token sequence length in google colab (instruction + response combined)
I got access to some free credits in H100 80gb gpu that I completely spent on training the gemma 3 4b model on the unsloth bnb 4b it model. rank 16, alpha 32 -
consine, 1 epoch and 1st i trained on a small subset of dataset - totalling 3500 tokens (instruction + response) and around 25000 rows. I used for training 3500 max seq length.
The train loss got decreased -

Then I trained with another subset of 8000 tokens total which covers 80% of my data - Here, training loss fluctuated but evaluation loss decreased. Due to high variance and noise in training data (different diseases in discharge notes) - around 40,000 training rows, 500 validation rows. I used for training 8000 max seq length.

All my credits I have ran out, and I have tested it once, it has learn the format of the discharge note pretty well. I will give it 60,000 token instructions which will have all the information it needs to generate the discharge note. Can you please give your feedback in the process I have done and can you give me more insights of what else I can do/ what I am doing wrong ?

This generation of synthetic accurate discharge note is a part of my master's thesis in my university. I have to train another classifier on top of these synthetic notes to predict the diseases in the discharge notes in reverse to what we have done. so that's the ultimate goal if you need more context to guide me.

drwlf

Owner 16 days ago

Amazing work! What I have done for training is create a local runtime in Colab, using my 2 3090’s. If you want I could give you the link for the runtime. Also in 16 gb vram you can run 12b q4 as well, I have another pc with an rtx 5080 (also, sadly, 16gb vram) and it runs ok

stabgan

16 days ago

Oh awesome, can we connect then somewhere privately ?
This is my linkedin - https://www.linkedin.com/in/stabgan/
Unsloth only supports 1 GPU as of now, I tried utilising the 2 T4 GPUs that Kaggle give for free. But after few hours I got to know, the free version of unsloth is capped to using only 1 GPU.
Your 1 3090 has 24 GB vram I presume ? If we can connect over linkedin and if you can share the runtime with me. I can try finetuning it a bit more, with the same unsloth enviroment that you're using.
The installation of the dependencies won't change

stabgan

16 days ago

And problem is the context window. if you increase the context window (max seq length) the size of vram required is enormous. it increases quadratic due to the attention layers.
in H100 80gb gpu, i was using 8000 context window, with gradient checkpointing on with the gemma 4b model and it was consuming almost 69GB during fine tuning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment