Please share feedback here!
If you’ve tested any of the initial GGUFs, we’d really appreciate your feedback! Let us know if you encountered any issues, what went wrong, or how things could be improved. Also, feel free to share your inference speed results!
Is it working for you?
Q8_0, Llama.cpp:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1
llama_model_load_from_file_impl: failed to load model
Is it working for you?
Q8_0, Llama.cpp:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1
llama_model_load_from_file_impl: failed to load model
Could you try updating llama.cpp to the latest version?
Yes, resolved, thank you!
system prompt is added between bos and user token role right? it seems to work really well!
i suggest you state where the system prompt should be inserted in the prompt template so that it is clear for text completion users/ users not using something with an autotokenizer
I've tested the UD-Q3_K_XL in llama.cpp (Ubuntu), and it works great. I'm testing with a context size of around 14000.
add Q1 quant ie 1 bit as well
Yo, DeepSeek-V2-Lite 16B needs to be GUFF'ed!
I meant yo.
add Q1 quant ie 1 bit as well
its uploading
add Q1 quant ie 1 bit as well
They're up now!
ran the original Deepseek unsloth R1 quant with 2x 3090's with 128 GB of ram didn't get much as far as tokens 2-3/s.. interested to see if the new Unsloth Dynamic 2.0 GGUFs stack up with smart layering and shit
ran the original Deepseek unsloth R1 quant with 2x 3090's with 128 GB of ram didn't get much as far as tokens 2-3/s.. interested to see if the new Unsloth Dynamic 2.0 GGUFs stack up with smart layering and shit
if you're not on ik_llama.cpp fork you're missing out
Why are these sizes substantially larger than the other ones? For example UD-Q3-K-XL original vs this, 273gb vs 350gb.
Hi, thanks for all that, stellar work. I'm trying for the smallest R1 to see what tps I get on MBP M2 96GB RAM.
I'm following this
https://unsloth.ai/blog/deepseek-r1-0528
I run into this problem:
ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
--model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
--cache-type-k q4_0 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_load: error loading model: corrupted model: 1086 tensors expected but 978 found
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
main: error: unable to load model
I suspected some of files didn't download correctly - they looked like this
ljubomir@macbook2(:):~/llama.cpp$ ls -al models/DeepSeek-R1-0528-UD-IQ1_S-0000*
-rw-r--r--@ 1 ljubomir staff 49462945024 30 May 13:13 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
-rw-r--r--@ 1 ljubomir staff 48568885664 30 May 14:14 models/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf
-rw-r--r--@ 1 ljubomir staff 49564076576 30 May 15:30 models/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf
-rw-r--r--@ 1 ljubomir staff 19455845600 30 May 16:46 models/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf
Is it possible to see the exact file sizes, to the byte, on HuggingFace web ui? Or put the sizes, maybe even crc like md5 sum, in a separate file?
Then it got worse. I thought - there must be some way to download incrementally, it will be smart enough to figure which file is truncated, and maybe even just download the extra, like rsync would do. So I asked gemini, it suggested
from huggingface_hub import snapshot_download
# This will download the entire 'UD-IQ1_S' folder and its contents
# It will create a directory like 'models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S'
local_dir = snapshot_download(
repo_id="unsloth/DeepSeek-R1-0528-GGUF",
allow_patterns="UD-IQ1_S/*", # Only download files within the UD-IQ1_S folder
local_dir="models/unsloth/DeepSeek-R1-0528-GGUF", # The base directory to download to
local_dir_use_symlinks=False # Important for full copy
)
print(f"Downloaded model to: {local_dir}")
I moved the existing files in a newly created dir models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S , and run the above in ipython.
Well - turns out it wiped the files completely and it's downloading from scratch now! :-) Haha - expected better than that tbh. We really do need AI, b/c atm our stuff is AS - Artificially Stupid, haha :-) No worries, it's chugging along now, will be done. But if you can provide the files sizes someplace or even better their md5sum-s too, so we know when the big files are downloaded correctly, that would be stellar!
Thanks for everything you do guys! It's been great running stuff on localhost, been enjoying it immensely. :-)
Ignore the previous comment, seems I can't edit nor delete it anymore?
Previously had trouble downloading stuff and ensuring it's correctly downloaded. May help someone else - this worked for me:
- Use wget to DL, it may restart a failed transfer
ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
Length: 49094698368 (46G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf’
- The checksum is at
https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
Git LFS Details
SHA256: 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8
Pointer size: 136 Bytes
Size of remote file: 49.1 GB
Xet backed hash: 229375f805e68a1006bcdbd96cea8f23ebabe02f9c7bd6a27598ec0a40c1df0b
- Compute and compare
(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf) = 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8
i used IQ3_XXS mode on llama.cpp.
it works fine. much better than previous model.
LD_PRELOAD=/home/mycat7/amd-blis/lib/ILP64/libaocl-libmem.so
./llama-cli -model /home/mycat7/LLM/DeepSeek-R1-0528-GGUF/UD-IQ3_XXS/DeepSeek-R1-0528-UD-IQ3_XXS-00001-of-00006.gguf
-threads 16 -ctx-size 8192 -seed -1 -n-gpu-layers 5 -prio 2
-cache-type-k q8_0 -top_p 0.95 -top_k 20 -min_p 0.0 -temp 0.6 -cnv temp
i asked to generate music Beethoven's "Für Elise" in chunk musical language.
you are able to listen the generated music, here.
https://smartai.f5.si/
Tested the UD-IQ3_XXS quant. All working fine.
The model itself is interesting, much longer outputs are possible than the first R1.
Update - alas, it seems the 170GB weights can be run on 96GB RAM (I imagine 3/4 only is used as VRAM) on a macbook - even if mmap-ed and READ ONLY. Don't see why MacOS would not simply un/re-load whenever something is in the address space, even if not in RAM. TBH expected it to not straight out not work - expected it to work even if super slow, so I'd need to kill the process or (more likely) turn the computer off once it gets too stuck.
Put the error in gemini, but didn't learn anything about how to make it run. This:
ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
--model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
--cache-type-k q4_0 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 62 key-value pairs and 1086 tensors from models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Deepseek-R1-0528
llama_model_loader: - kv 3: general.basename str = Deepseek-R1-0528
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 256x20B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = DeepSeek R1 0528
llama_model_loader: - kv 10: general.base_model.0.version str = 0528
llama_model_loader: - kv 11: general.base_model.0.organization str = Deepseek Ai
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv 13: general.tags arr[str,1] = ["unsloth"]
llama_model_loader: - kv 14: deepseek2.block_count u32 = 61
llama_model_loader: - kv 15: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 16: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 17: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 18: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 19: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 20: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 21: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 23: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 24: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 25: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 26: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 27: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 28: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 29: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 30: deepseek2.attention.value_length_mla u32 = 128
llama_model_loader: - kv 31: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 32: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 33: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 34: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 35: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 36: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 37: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 39: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 40: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 41: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 42: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 43: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 44: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 45: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 46: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 47: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 48: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 49: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 50: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 51: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 52: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 53: general.quantization_version u32 = 2
llama_model_loader: - kv 54: general.file_type u32 = 24
llama_model_loader: - kv 55: quantize.imatrix.file str = DeepSeek-R1-0528-GGUF/imatrix_unsloth...
llama_model_loader: - kv 56: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-R1-0528-...
llama_model_loader: - kv 57: quantize.imatrix.entries_count i32 = 659
llama_model_loader: - kv 58: quantize.imatrix.chunks_count i32 = 720
llama_model_loader: - kv 59: split.no u16 = 0
llama_model_loader: - kv 60: split.tensors.count i32 = 1086
llama_model_loader: - kv 61: split.count u16 = 4
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 122 tensors
llama_model_loader: - type q4_K: 56 tensors
llama_model_loader: - type q5_K: 36 tensors
llama_model_loader: - type q6_K: 17 tensors
llama_model_loader: - type iq2_xxs: 24 tensors
llama_model_loader: - type iq3_xxs: 49 tensors
llama_model_loader: - type iq1_s: 126 tensors
llama_model_loader: - type iq3_s: 154 tensors
llama_model_loader: - type iq4_xs: 141 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = IQ1_S - 1.5625 bpw
print_info: file size = 156.72 GiB (2.01 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 1
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 576
print_info: n_embd_head_v = 512
print_info: n_gqa = 128
print_info: n_embd_k_gqa = 576
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = Deepseek-R1-0528
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 192
print_info: n_embd_head_v_mla = 128
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 2 '<|▁pad▁|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: Metal_Mapped model buffer size = 46815.35 MiB
load_tensors: Metal_Mapped model buffer size = 47469.88 MiB
load_tensors: Metal_Mapped model buffer size = 47641.07 MiB
load_tensors: Metal_Mapped model buffer size = 18554.54 MiB
load_tensors: CPU_Mapped model buffer size = 497.11 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 77309.41 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
llama_context: CPU output buffer size = 0.49 MiB
llama_kv_cache_unified: Metal KV buffer size = 1284.81 MiB
llama_kv_cache_unified: size = 1284.81 MiB ( 16384 cells, 61 layers, 1 seqs), K (q4_0): 308.81 MiB, V (f16): 976.00 MiB
llama_context: Metal compute buffer size = 4522.00 MiB
llama_context: CPU compute buffer size = 46.01 MiB
llama_context: graph nodes = 4964
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 3358851179
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
top_k = 40, top_p = 0.950, min_p = 0.010, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main : failed to eval
ggml_metal_free: deallocating
Thanks for everything you do guys! Top marks! Have been enjoying this. :-) Will try again in the future on a bigger box.
Update - alas, it seems the 170GB weights can be run on 96GB RAM (I imagine 3/4 only is used as VRAM) on a macbook - even if mmap-ed and READ ONLY. Don't see why MacOS would not simply un/re-load whenever something is in the address space, even if not in RAM. TBH expected it to not straight out not work - expected it to work even if super slow, so I'd need to kill the process or (more likely) turn the computer off once it gets too stuck.
Put the error in gemini, but didn't learn anything about how to make it run. This:
ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \ --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \ --cache-type-k q4_0 \ --prio 3 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 16384 \ --prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>" build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0 main: llama backend init main: load the model and apply lora adapter, if any llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free llama_model_loader: additional 3 GGUFs metadata loaded. llama_model_loader: loaded meta data with 62 key-value pairs and 1086 tensors from models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Deepseek-R1-0528 llama_model_loader: - kv 3: general.basename str = Deepseek-R1-0528 llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 256x20B llama_model_loader: - kv 6: general.license str = mit llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = DeepSeek R1 0528 llama_model_loader: - kv 10: general.base_model.0.version str = 0528 llama_model_loader: - kv 11: general.base_model.0.organization str = Deepseek Ai llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De... llama_model_loader: - kv 13: general.tags arr[str,1] = ["unsloth"] llama_model_loader: - kv 14: deepseek2.block_count u32 = 61 llama_model_loader: - kv 15: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 16: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 17: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 18: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 19: deepseek2.attention.head_count_kv u32 = 1 llama_model_loader: - kv 20: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 21: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 23: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 24: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 25: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 26: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 27: deepseek2.attention.key_length u32 = 576 llama_model_loader: - kv 28: deepseek2.attention.value_length u32 = 512 llama_model_loader: - kv 29: deepseek2.attention.key_length_mla u32 = 192 llama_model_loader: - kv 30: deepseek2.attention.value_length_mla u32 = 128 llama_model_loader: - kv 31: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 32: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 33: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 34: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 35: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 36: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 37: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 38: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 39: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 40: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 41: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 42: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 43: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 44: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�... llama_model_loader: - kv 45: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 46: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... llama_model_loader: - kv 47: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 48: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 49: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 50: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 51: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 52: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 53: general.quantization_version u32 = 2 llama_model_loader: - kv 54: general.file_type u32 = 24 llama_model_loader: - kv 55: quantize.imatrix.file str = DeepSeek-R1-0528-GGUF/imatrix_unsloth... llama_model_loader: - kv 56: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-R1-0528-... llama_model_loader: - kv 57: quantize.imatrix.entries_count i32 = 659 llama_model_loader: - kv 58: quantize.imatrix.chunks_count i32 = 720 llama_model_loader: - kv 59: split.no u16 = 0 llama_model_loader: - kv 60: split.tensors.count i32 = 1086 llama_model_loader: - kv 61: split.count u16 = 4 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 122 tensors llama_model_loader: - type q4_K: 56 tensors llama_model_loader: - type q5_K: 36 tensors llama_model_loader: - type q6_K: 17 tensors llama_model_loader: - type iq2_xxs: 24 tensors llama_model_loader: - type iq3_xxs: 49 tensors llama_model_loader: - type iq1_s: 126 tensors llama_model_loader: - type iq3_s: 154 tensors llama_model_loader: - type iq4_xs: 141 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ1_S - 1.5625 bpw print_info: file size = 156.72 GiB (2.01 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 818 load: token to piece cache size = 0.8223 MB print_info: arch = deepseek2 print_info: vocab_only = 0 print_info: n_ctx_train = 163840 print_info: n_embd = 7168 print_info: n_layer = 61 print_info: n_head = 128 print_info: n_head_kv = 1 print_info: n_rot = 64 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 576 print_info: n_embd_head_v = 512 print_info: n_gqa = 128 print_info: n_embd_k_gqa = 576 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 18432 print_info: n_expert = 256 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = yarn print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 0.025 print_info: n_ctx_orig_yarn = 4096 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 671B print_info: model params = 671.03 B print_info: general.name = Deepseek-R1-0528 print_info: n_layer_dense_lead = 3 print_info: n_lora_q = 1536 print_info: n_lora_kv = 512 print_info: n_embd_head_k_mla = 192 print_info: n_embd_head_v_mla = 128 print_info: n_ff_exp = 2048 print_info: n_expert_shared = 1 print_info: expert_weights_scale = 2.5 print_info: expert_weights_norm = 1 print_info: expert_gating_func = sigmoid print_info: rope_yarn_log_mul = 0.1000 print_info: vocab type = BPE print_info: n_vocab = 129280 print_info: n_merges = 127741 print_info: BOS token = 0 '<|begin▁of▁sentence|>' print_info: EOS token = 1 '<|end▁of▁sentence|>' print_info: EOT token = 1 '<|end▁of▁sentence|>' print_info: PAD token = 2 '<|▁pad▁|>' print_info: LF token = 201 'Ċ' print_info: FIM PRE token = 128801 '<|fim▁begin|>' print_info: FIM SUF token = 128800 '<|fim▁hole|>' print_info: FIM MID token = 128802 '<|fim▁end|>' print_info: EOG token = 1 '<|end▁of▁sentence|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size load_tensors: offloading 61 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 62/62 layers to GPU load_tensors: Metal_Mapped model buffer size = 46815.35 MiB load_tensors: Metal_Mapped model buffer size = 47469.88 MiB load_tensors: Metal_Mapped model buffer size = 47641.07 MiB load_tensors: Metal_Mapped model buffer size = 18554.54 MiB load_tensors: CPU_Mapped model buffer size = 497.11 MiB ................................................................................................. llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 16384 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 0.025 llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Max ggml_metal_init: picking default device: Apple M2 Max ggml_metal_load_library: using embedded metal library ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = true ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 77309.41 MB ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) llama_context: CPU output buffer size = 0.49 MiB llama_kv_cache_unified: Metal KV buffer size = 1284.81 MiB llama_kv_cache_unified: size = 1284.81 MiB ( 16384 cells, 61 layers, 1 seqs), K (q4_0): 308.81 MiB, V (f16): 976.00 MiB llama_context: Metal compute buffer size = 4522.00 MiB llama_context: CPU compute buffer size = 46.01 MiB llama_context: graph nodes = 4964 llama_context: graph splits = 2 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) graph_compute: ggml_backend_sched_graph_compute_async failed with error -1 llama_decode: failed to decode, ret = -3 main: llama threadpool init, n_threads = 8 main: chat template is available, enabling conversation mode (disable it with -no-cnv) *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead? main: chat template example: You are a helpful assistant <|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|> system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | main: interactive mode on. sampler seed: 3358851179 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384 top_k = 40, top_p = 0.950, min_p = 0.010, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 1 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. - Not using system message. To change it, set a different value via -sys PROMPT Write a Python program that shows 20 balls bouncing inside a spinning heptagon: - All balls have the same radius. - All balls have a number on it from 1 to 20. - All balls drop from the heptagon center when starting. - Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35 - The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls. - The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius. - All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball. - The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds. - The heptagon size should be large enough to contain all the balls. - Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys. - All codes should be put in a single Python file.ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) graph_compute: ggml_backend_sched_graph_compute_async failed with error -1 llama_decode: failed to decode, ret = -3 main : failed to eval ggml_metal_free: deallocating
Thanks for everything you do guys! Top marks! Have been enjoying this. :-) Will try again in the future on a bigger box.
DROP YOUR context size down a shit ton pimp. im running this with the 1-5 guffs merged in to one fat boy guff. I'm not familiar with MAC M2 if you dont have CUDA try adding swap file to cover the missing ram space to see if it helps any. the below works for me just CPU.
cd ~/llama.cpp/build &&
./bin/llama-cli \
--model /usr/share/ollama/.ollama/DeepSeek-R1-0528/MERGE-DS_0528.guff \ --cache-type-k q4_0 \ --threads -1 \ --n-gpu-layers 0 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 8192 \ --seed 3407 \ --prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>" \
Hi there! Not entirely related, but does someone knows how to disable thinking? Either on llama-server directly, or llama-server + Sillytavern. I could test Tuesday-Wednesnay!
Hi there! Not entirely related, but does someone knows how to disable thinking? Either on llama-server directly, or llama-server + Sillytavern. I could test Tuesday-Wednesnay!
depends on what your using to run this. i've herd /no_think in system and user prompt works with Qwen 3. But for DeepSeek what i have noticed is in the llama.cpp server go on the web interface and use completion mode instead of chat mode and start the script out to just start plugging away at the script with out the planing thinking mode. so for the prompt do this......................
Python script for a Flappy Bird game using Pygame.
Features:
1. Must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again. The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.
Start of the Python code:
import pygame
import random
import sys
import os
Initialize pygame
pygame.init()
Screen dimensions
WIDTH, HEIGHT = 800, 600
screen = pygame.display.set_mode((WIDTH, HEIGHT))
pygame.display.set_caption("Flappy Bird")
It's need a serious refinement. I've tested Q6_K which uses ~555Gb RAM. In text-generation-webui 3.4.1.
I've made temp as adviced to 0.6. The first problem is thinking too much and too long. In my test of making a ChucK song it basically stuck in forever loop of thinking which note to use.
Test of repairing broken code - kinda so-so, almost but failed, altrough it makes a 4 slightly different code versions in 1st try and all failed to run by logical error.
Anyway the thinking portion here is too big and too long. I feel some better quality than R1 original, but it need a serious refinement to make good result in Q6, can't imagine what is on lower quality, i thought it hallucinating in my furst test, which it shouldn't on such quality level.
Maybe it's problem of model launcher? I'll try others like llm studio & Kobold-cpp and report if its different result.
Not getting stuck in loops for me so far, either on api or UD-Q2_K_XL.
0.6 temp, top-p 0.95 and rep penalty 1 (off) I use llama-server (llama.cpp) and mikupad
I asked it for a chuck song on the API version and it didn't get stuck, this is the song (don't know if it works) https://pastebin.com/m3dQeu3u
Do you have a decoder for your model naming system? Say I am trying to find dynamic 2.0 quants and cannot for the life of me figure out which ones are dynamic quant? Are all the new ones, such as deepseek r1-0528 all dynamic 2.0 quantized?
Unsloth config file
wget https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/config.json
# Original tokenizer
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/tokenizer.json
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/tokenizer_config.json
# Original model files
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/modeling_deepseek.py
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/generation_config.json
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/configuration_deepseek.py
mv config.json tokenizer.json tokenizer_config.json modeling_deepseek.py generation_config.json configuration_deepseek.py /data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/.
MODEL_PATH="/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf"
LOG_FILE="vllm.log"
#export VLLM_ATTENTION_BACKEND=FLASHINFER /data2/jcxy/llm_model/PsyLLM4.5-Medium-2025-03-27-Instruct-SFT
export VLLM_USE_V1=0
# --cpu-offload-gb 80
SERVED_MODEL_NAME="DeepSeek-R1-0528"
export CUDA_VISIBLE_DEVICES=2,3,4,5
# 运行命令
nohup vllm serve
"$MODEL_PATH"
--hf-config-path /data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS
--tokenizer /data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS
--served-model-name "$SERVED_MODEL_NAME"
--trust-remote-code
--port 6011
--host 0.0.0.0
--dtype auto
--max-model-len 8192
--gpu_memory_utilization 0.98
--tensor_parallel_size 4
--enable-prefix-caching
>"$LOG_FILE" 2>&1 &
INFO 06-03 09:41:22 [init.py:239] Automatically detected platform cuda.
INFO 06-03 09:41:26 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 06-03 09:41:26 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf', config='', host='0.0.0.0', port=6011, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf', task='auto', tokenizer='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS', hf_config_path='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.98, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['DeepSeek-R1-0528'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fdfd4b5ca60>)
INFO 06-03 09:41:26 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 06-03 09:41:32 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
Traceback (most recent call last):
File "/data/jcxy/haolu/anaconda3/envs/haolu/bin/vllm", line 8, in
sys.exit(main())
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1099, in create_engine_config
model_config = self.create_model_config()
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 987, in create_model_config
return ModelConfig(
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/config.py", line 546, in init
self._verify_quantization()
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/config.py", line 816, in _verify_quantization
raise ValueError(
ValueError: Quantization method specified in the model config (fp8) does not match the quantization method specified in the quantization
argument (gguf).
Do you have a decoder for your model naming system? Say I am trying to find dynamic 2.0 quants and cannot for the life of me figure out which ones are dynamic quant? Are all the new ones, such as deepseek r1-0528 all dynamic 2.0 quantized?
"UD" in the model / model sub-directory name e.g. "UD-Q5_K_XL" is (where used) an abbreviation for "unsloth dynamic" but that doesn't designate a version e.g. UD original vs. UD 2.0.
I'm going to assume they use the latest version of their dynamic quanting unless they say otherwise. There would be no reason to use a 'worse' version.