About this model
My deepest gratitude for your effort Sir.
Please is this an uncensored version of the -alpaca-65B model
Grr. This model would not fit into my ram. :( Is there any workaround for big models such as this one to load on a pc with smaller ram? I mean solution other than buying more ram lol. I mean, I wouldn't even mind if it's slow as hell if the model is good. :)
If you have not enough RAM only solution is cache on SSD or put some part of model on GPU....
Hmm. Would it work with AMD GPUs? I was under impression only Nvidia GPUs are compatible. But I guess I could try the SSD solution. Is there some guide on how could any of these ways be achieved?
GGML models can now be accelerated with AMD GPUs, yes, using llama.cpp with OpenCL support. You can load as many layers onto the GPU as you have VRAM for, and that boosts inference speed.
However it does not help with RAM requirements. You still need just as much RAM as before. You just get faster inference.
I'm afraid there's no way around the RAM requirements. If you don't have enough RAM, I'd recommend trying a 33B model instead. There's many great ones out there, like WizardLM-30B-Uncensored and Guanaco-33B
I got RAM for it, but what kind of improvement can we expect from a 65B over 33B though, before I start downloading it (and considering that QLoRA would be applied to both)?
I'm more in the 16 GB RAM room. :( Okay, I said it. I guess 16 GB isn't as much as it used to be, huh? lol
My deepest gratitude for your effort Sir.
Please is this an uncensored version of the -alpaca-65B model
Yeah seems like it - "trained on a cleaned ShareGPT dataset"
Hey bloke - i tested with newest llama.cpp with cublas if you are putting layers on GPU then is used less RAM .
For instance if I give half layers on GPU and half on CPU then model takes only half RAM than usual the rest in in VRAM.
Llama.cpp not keeping copy in RAM form some time.
@mirek190
Oh, OK. So with eg -ngl 32
, the RAM usage of those 32 layers doesn't also need to be stored in RAM?
In which case yeah I need to mention that RAM requirements can be lower with GPU offload. It will be hard to give an exact RAM usage figure then, given different GPUs will be able to offload different numbers of layers. So I'll probably still list the maximum that could be required, maybe with some examples of usage with varying -ngl
amounts
Later will test and let you know as today I got my rtx3090 finally π
Thank you!
mirek190, that sounds amazing, really promising, but let's say I would like to offload some of it to VRAM. Not that I'd have too much, but I think for slightly bigger models than the ones I usually use, my Radeon RX Vega 56 with 8 GB VRAM should be fairly enough to give some boost to my 16 GB of RAM, right? How would I go about doing this? I've been using some softwares such as Faraday.dev and Koboldcpp, but I'm not sure how to fully utilize the hardware using these methods and programs. Or is there something completely different that I should be using instead?
I use llama.cpp by adding "-ngl [number_of _layers]"- as parameter.
Ok I made tests .. actually during loading the model is described How much RAM and VRAM will be used.
For NVIDIA cards ... as others are suck for AI environment unfortunately ... buuut you can try :)
For full offload on GPU and cublas - models q5_1:
7B - 32 layers - VRAM used: 4633 MB | RAM required = 1979.59 MB (+ 1026.00 MB per state) ~ 4.5 GB VRAM /1.9 GB RAM
13B - 40 layers - VRAM used: 9076 MB | RAM required = 2282.48 MB (+ 1608.00 MB per state) ~ 8.9 GB VRAM /2.1 GB RAM
30B - 60 layers - VRAM used: 22964 MB | RAM required = 2608.85 MB (+ 3124.00 MB per state) ~ 22.9 GB VRAM /2.5 GB RAM
65B - 80 layers - VRAM used: 46325 MB | RAM required = 3959.21 MB (+ 5120.00 MB per state) ~ 46.2 GB VRAM /3.9 GB RAM
For CPU only - cublas built that uses a bit GPU VRAM for increase speed to prompts few times - models q5_1 .
7B - 32 layers - VRAM used: 0MB | RAM required = 6612.59 MB (+ 1026.00 MB per state) ~ 6.6 GB RAM
13B - 40 layers - VRAM used: 0 MB | RAM required = 11359.05 MB (+ 1608.00 MB per state) ~ 11.3 GB RAM
30B - 60 layers - VRAM used: 0 MB | RAM required = 25573.14 MB (+ 3124.00 MB per state) ~ 25.5 GB RAM
65B - 80 layers - VRAM used: 0 MB | RAM required = 50284.21 MB (+ 5120.00 MB per state) ~ 50.1 GB RAM
Thank you! That is great data to have.
I will have a think about how to update the README to explain this.
Since I have AMD GPU, I guess Cublas being tied to Cuda, I can't really use that option and I see the option below does not seem to utilize GPU VRAM at all? Hmm. It's like I have all that extra VRAM and I can't even use it to speed things up. :(
Since I have AMD GPU, I guess Cublas being tied to Cuda, I can't really use that option and I see the option below does not seem to utilize GPU VRAM at all? Hmm. It's like I have all that extra VRAM and I can't even use it to speed things up. :(
You can! Fairly recently they added CLBLAST support, which uses OpenCL to provide GPU offloading similar to the CUDA support. It may not perform as well as CUDA, but it's still going to provide a benefit.
I use an AMD GPU at home on macOS and it works OK with GPU offloading with llama.cpp. Well, fairly well - there is some bug that causes it to crash, but that could be specific to macOS. I've not tried to investigate it yet.
First you need to install CLBLAST, then you compile like so:
make clean && LLAMA_CLBLAST=1 make
On macOS I installed CLBLAST with: brew install clblast
. On other platforms you'll need to look up how to do that. Like in Ubuntu 2204 I think it's apt install libclblast1 libclblast-dev
Also made tests speed ( ms/t) with models q5_1 :
nvidia rtx3090 - 24GB VRAM
7B - 32 layers - GPU offload 32 layers - 48.76 ms per token
13B - 40 layers - GPU offload 40 layers - 70.48 ms per token
30B - 60 layers - GPU offload 57 layers - 178.58 ms per token
65B - 80 layers - GPU offload 37 layers - 979.77 ms per token
In theory IF I could place all layers from 65B mode in VRAM I could achieve something around 320-370ms/token :P
Since I have AMD GPU, I guess Cublas being tied to Cuda, I can't really use that option and I see the option below does not seem to utilize GPU VRAM at all? Hmm. It's like I have all that extra VRAM and I can't even use it to speed things up. :(
You can! Fairly recently they added CLBLAST support, which uses OpenCL to provide GPU offloading similar to the CUDA support. It may not perform as well as CUDA, but it's still going to provide a benefit.
I use an AMD GPU at home on macOS and it works OK with GPU offloading with llama.cpp. Well, fairly well - there is some bug that causes it to crash, but that could be specific to macOS. I've not tried to investigate it yet.
First you need to install CLBLAST, then you compile like so:
make clean && LLAMA_CLBLAST=1 make
On macOS I installed CLBLAST with:
brew install clblast
. On other platforms you'll need to look up how to do that. Like in Ubuntu 2204 I think it'sapt install libclblast1 libclblast-dev
Faraday.dev is closed source tool, but it's all in one tool which makes it easy for the end-user to set up, there is a Clblast feature, but it gives me errors when I use it. It has Openblas which I believe should do the same trick, but while choosing that one doesn't give errors, it doesn't seem to do anything at all in terms of filling my VRAM with data and the speed of generation feels pretty much the same. Koboldcpp can use both Openblas and Clblast (which is for some reason called "Legacy" way) and when I use that, I can offload like 32 layers to VRAM and it even does seem to use VRAM unlike Faraday.dev, but unfortunately the speed of text generation itself seems actually inferior to both Faraday.dev and Koboldcpp without Clblast and I'm not sure if filling some of the data in VRAM actually saves space in RAM itself to allow slightly more demanding models, because RAM usage seemed the same regardless of the Clblast option being enabled. I started to ask myself if I'm doing something wrong at this point, or if it the results are really highly hardware dependant, because I'm confused after seeing the results so far.
OpenBLAS does not help with GPU accel. llama.cpp can use OpenBLAS (LLAMA_OPENBLAS=1 make
) and it's faster than not using it, but it's still CPU only.
For GPU accel the options currently are CUBLAS and CLBLAST.
I don't know anything about Faraday.dev, and I've never used KoboldCpp so can't speak to those
But yeah it wouldn't surprise me to learn that CLBLAST isn't helping nearly as much as CUBLAS does
..yeah is sad AMD not developing others tools like Nvidia does ...
From my tests, offloading llama.cpp to GPU is definitely a must with -ngl, but I can't say I see any serious improvements unless I offload 50%+, if not 75%+, to VRAM. The more the better and in this case a headless configuration, or 2nd graphics card dedicated to display, would help squeeze every bit out of the main GPU used for compute.
NVIDIA is just reaming us all up where the sun does not shine, because they are raking astronomic profits on the same chips over and over again. All they do is make a single chip (despite calling it a "family"), then bin it and burn hardware fuses in it as desired to create gaming, compute, or other platform "solutions".
Is not so bad ... RTX 2080 had only 8 GB of VRAM RTX 3090 was a huge progress 24 GB VRAM ... but RTX 4090 should have at least 48 GB ...maybe 5090 :D
They will never do that because that would "threaten" their pricing structure. There's a reason they set it this way because you have to pay the premium to enter the compute club, for basically the same chip.
For example, they just announced GH200 series going into production and it can use NVLink interconnects to create a supercluster that has insane amount of VRAM and computer power: https://www.tomshardware.com/news/nvidia-unveils-dgx-gh200-supercomputer-and-mgx-systems-grace-hopper-superchips-in-production
3090s have (some kind of) NVLink too, but you can't connect two for example to double the compute power or get total 48GB VRAM, and I don't believe this is by accident. NVIDIA is not a charity and they don't care about small people trying to play with the AI in their homes.
I managed to run "VicUnlocked-alpaca-65B-QLoRA-GGML" on 32 GB RAM and wrote the rest to the hard drive. The speed judging by the processor is 4 times slower than if I had enough RAM, but overall it works and I like the result!
How many tokens you have per sec?
Using only 64GB RAM I have 1.200ms/token and with rtx 3090 ( 39 layers from 80 ) 700ms/token .
68 seconds per token ??? ok ........................
OMG, guys! I have fantastic news! Well, for those who use GGML models with KoboldCpp at least! I just noticed that 3 days ago, guys working at KoboldCpp released a new version and this is one of the changes!
*Integrated the Clblast GPU offloading improvements from @0cc4m which allows you to have a layer fully stored in VRAM instead of keeping a duplicate copy in RAM. As a result, offloading GPU layers will reduce overall RAM used.
YES! I guess this is the kind of behavior you were describing earlier, right? Saving RAM. So in theory I should be able to run slightly more demanding models without issues, right? :D Well, honestly I'll be happy for any little boost I can get with the models I currently use.
yes ... that came from llama.cpp ... was ported to koboldcpp ;)
Yay! :D I can't wait to test it more, but loading the model into the new version with Clblast enabled seems to save 32% of my ram compared to previous version!
I have RTX 3090 with 24GB of VRAM so fully loaded 65B q5_1 model takes only 30 GB of my RAM now ;)
Is there a way to use mac m2 gpu ?
Is there a way to use mac m2 gpu ?
Yes, but it may not help much at the moment.
You need to install CLBLAST, which can be done through Homebrew with:
brew install clblast
And then build llama.cpp like so:
LLAMA_CLBLAST=1 make
But on M1/M2 Mac, I don't think it will help at all.
What will help is a new feature being worked on which will provide full Metal acceleration. You can track its progress here: https://github.com/ggerganov/llama.cpp/pull/1642
I have RTX 3090 with 24GB of VRAM so fully loaded 65B q5_1 model takes only 30 GB of my RAM now ;)
Very cool! What CPU do you have and what is performance like? What's your run time in ms?
CPU 9 9 9900K with 64GB of RAM.
With that 65B q5_1 model I can fit 37 layers on GPU.
Have around ~700ms/token
Waiting for better DDR4 - 4000 Mhz ... right now have 3000 Mhz with shitty timings ...
So, I just tried to load a slightly bigger model into KoboldCpp. It is a model which wouldn't fit entirely into my RAM alone, but it should fit when both RAM and VRAM were used. Sadly, even with offloading into VRAM I'm getting this message:
GGML_ASSERT: ggml.c:3977: ctx->mem_buffer != NULL
I'm assuming this means that since the model is too big to load into RAM alone, it just won't let me load it even though I'm allowing it to use VRAM as well. :(
Hmm not sure. That might be a bug. Show everything you're running, command line and the full output. Copy and paste it here, surrounded by ``` tags
so the output looks like this
Sure, let me first give some hints from running koboldcpp.exe with --help argument so that you know what's what (I believe you've mentioned somewhere else that you're not familiar with it). Here's the official koboldcpp.exe --help argument output:
usage: koboldcpp.exe [-h] [--model [MODEL]] [--port PORT] [--host HOST] [--launch] [--lora LORA] [--threads THREADS]
[--blasthreads [threads]] [--psutil_set_threads] [--highpriority]
[--contextsize {512,1024,2048,4096,8192}] [--blasbatchsize {-1,32,64,128,256,512,1024}]
[--stream] [--smartcontext] [--unbantokens] [--usemirostat [type] [tau] [eta]]
[--forceversion [version]] [--nommap] [--usemlock] [--noavx2] [--debugmode] [--skiplauncher]
[--hordeconfig [hordename] [[hordelength] ...]]
[--noblas | --useclblast {0,1,2,3,4,5,6,7,8} {0,1,2,3,4,5,6,7,8}] [--gpulayers [GPU layers]]
[model_param] [port_param]
KoboldCpp Server
positional arguments:
model_param Model file to load (positional)
port_param Port to listen on (positional)
optional arguments:
-h, --help show this help message and exit
--model [MODEL] Model file to load
--port PORT Port to listen on
--host HOST Host IP to listen on. If empty, all routable interfaces are accepted.
--launch Launches a web browser when load is completed.
--lora LORA LLAMA models only, applies a lora file on top of model. Experimental.
--threads THREADS Use a custom number of threads if specified. Otherwise, uses an amount based on CPU cores
--blasthreads [threads]
Use a different number of threads during BLAS if specified. Otherwise, has the same value as
--threads
--psutil_set_threads Experimental flag. If set, uses psutils to determine thread count based on physical cores.
--highpriority Experimental flag. If set, increases the process CPU priority, potentially speeding up
generation. Use caution.
--contextsize {512,1024,2048,4096,8192}
Controls the memory allocated for maximum context size, only change if you need more RAM for
big contexts. (default 2048)
--blasbatchsize {-1,32,64,128,256,512,1024}
Sets the batch size used in BLAS processing (default 512). Setting it to -1 disables BLAS
mode, but keeps other benefits like GPU offload.
--stream Uses pseudo streaming when generating tokens. Only for the Kobold Lite UI.
--smartcontext Reserving a portion of context to try processing less frequently.
--unbantokens Normally, KoboldAI prevents certain tokens such as EOS and Square Brackets. This flag unbans
them.
--usemirostat [type] [tau] [eta]
Experimental! Replaces your samplers with mirostat. Takes 3 params = [type(0/1/2), tau(5.0),
eta(0.1)].
--forceversion [version]
If the model file format detection fails (e.g. rogue modified model) you can set this to
override the detected format (enter desired version, e.g. 401 for GPTNeoX-Type2).
--nommap If set, do not use mmap to load newer models
--usemlock For Apple Systems. Force system to keep model in RAM rather than swapping or compressing
--noavx2 Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work
with --clblast.
--debugmode Shows additional debug info in the terminal.
--skiplauncher Doesn't display or use the new GUI launcher.
--hordeconfig [hordename] [[hordelength] ...]
Sets the display model name to something else, for easy use on AI Horde. An optional second
parameter sets the horde max gen length.
--noblas Do not use OpenBLAS for accelerated prompt ingestion
--useclblast {0,1,2,3,4,5,6,7,8} {0,1,2,3,4,5,6,7,8}
Use CLBlast instead of OpenBLAS for prompt ingestion. Must specify exactly 2 arguments,
platform ID and device ID (e.g. --useclblast 1 0).
--gpulayers [GPU layers]
Set number of layers to offload to GPU when using CLBlast. Requires CLBlast.```
I tried to run KoboldCpp like with the following parameters (still playing with various settings to find out what would work best for individual cases on my hardware):
```koboldcpp.exe --highpriority --threads 8 --blasthreads 3 --contextsize 2048 --smartcontext --stream --blasbatchsize -1 --useclblast 0 0 --gpulayers 30 --launch```
Running koboldcpp.exe like that makes it possible to choose which model to load, so I'm trying to load one that's slightly bigger than what would fit into my 16 GB RAM. The model is starchat-alpha-GGML.
And here is the complete output and the model obviously fails to load:
```Welcome to KoboldCpp - Version 1.28
For command line arguments, please refer to --help
Otherwise, please manually select ggml file:
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Loading model: D:\AI_Models\Loose_models\starchat-alpha-GGML\starchat-alpha-ggml-q4_0.bin
[Threads: 8, BlasThreads: 3, SmartContext: True]
---
Identified as GPT-2 model: (ver 203)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
gpt2_model_load: loading model from 'D:\AI_Models\Loose_models\starchat-alpha-GGML\starchat-alpha-ggml-q4_0.bin'
gpt2_model_load: n_vocab = 49156
gpt2_model_load: n_ctx = 8192
gpt2_model_load: n_embd = 6144
gpt2_model_load: n_head = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype = 2002
gpt2_model_load: qntvr = 2
gpt2_model_load: ggml ctx size = 17928.50 MB
Platform:0 Device:0 - AMD Accelerated Parallel Processing with gfx900
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx900'
ggml_opencl: device FP16 support: true
CL FP16 temporarily disabled pending further optimization.
GGML_ASSERT: ggml.c:3977: ctx->mem_buffer != NULL```
I tried using different values for layers like 35 which I believe is as high as I can go on my GPU, but it still fails.
Oh. That's a different model type, BigCoder. To my knowledge that doesn't support GPU offload. I think only Llama models support GPU offload at this time.
Oh. That's a different model type, BigCoder. To my knowledge that doesn't support GPU offload. I think only Llama models support GPU offload at this time.
Damn, I was hoping to get that AI coding assistant for when internet connection is down lol