llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n'

#3
by Yutis - opened

I am using the Google Colab provided in this HF repo (https://colab.research.google.com/#fileId=https%3A//huggingface.co/unsloth/gemma-3n-E2B-it-GGUF.ipynb) but getting the error as below:

./gemma-3n-E2B-it-UD-IQ2_XXS.gguf: 100%
 2.05G/2.05G [00:29<00:00, 54.2MB/s]
llama_model_loader: loaded meta data with 51 key-value pairs and 727 tensors from /root/.cache/huggingface/hub/models--unsloth--gemma-3n-E2B-it-GGUF/snapshots/4a272108ff14d0e1cef82baa8025a60a91307348/./gemma-3n-E2B-it-UD-IQ2_XXS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3n
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3N-E2B-It
llama_model_loader: - kv   3:                           general.finetune str              = 3n-E2B-it
llama_model_loader: - kv   4:                           general.basename str              = Gemma-3N-E2B-It
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 4.5B
llama_model_loader: - kv   7:                            general.license str              = gemma
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Gemma 3n E2B It
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  13:                               general.tags arr[str,6]       = ["automatic-speech-recognition", "uns...
llama_model_loader: - kv  14:                     gemma3n.context_length u32              = 32768
llama_model_loader: - kv  15:                   gemma3n.embedding_length u32              = 2048
llama_model_loader: - kv  16:                        gemma3n.block_count u32              = 30
llama_model_loader: - kv  17:                gemma3n.feed_forward_length arr[i32,30]      = [8192, 8192, 8192, 8192, 8192, 8192, ...
llama_model_loader: - kv  18:               gemma3n.attention.head_count u32              = 8
llama_model_loader: - kv  19:   gemma3n.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:               gemma3n.attention.key_length u32              = 256
llama_model_loader: - kv  21:             gemma3n.attention.value_length u32              = 256
llama_model_loader: - kv  22:                     gemma3n.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  23:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  24:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  25:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  26:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  27:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  28:         gemma3n.attention.shared_kv_layers u32              = 10
llama_model_loader: - kv  29:          gemma3n.activation_sparsity_scale arr[f32,30]      = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv  30:   gemma3n.attention.sliding_window_pattern arr[bool,30]     = [true, true, true, true, false, true,...
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  35:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 19
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = gemma-3n-E2B-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = unsloth_calibration_gemma-3n-E2B-it.txt
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 400
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 1326
llama_model_loader: - type  f32:  362 tensors
llama_model_loader: - type  f16:   93 tensors
llama_model_loader: - type q4_1:    1 tensors
llama_model_loader: - type q2_K:   12 tensors
llama_model_loader: - type q3_K:   30 tensors
llama_model_loader: - type q4_K:   30 tensors
llama_model_loader: - type iq2_xxs:  189 tensors
llama_model_loader: - type iq3_s:    5 tensors
llama_model_loader: - type iq2_s:    5 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ2_XXS - 2.0625 bpw
print_info: file size   = 1.91 GiB (3.68 BPW) 
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n'
llama_model_load_from_file_impl: failed to load model
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipython-input-6-579632107.py in <cell line: 0>()
      3 from llama_cpp import Llama
      4 
----> 5 llm = Llama.from_pretrained(
      6         repo_id="unsloth/gemma-3n-E2B-it-GGUF",
      7         filename="gemma-3n-E2B-it-UD-IQ2_XXS.gguf",

2 frames
/usr/local/lib/python3.11/dist-packages/llama_cpp/_internals.py in __init__(self, path_model, params, verbose)
     54 
     55         if model is None:
---> 56             raise ValueError(f"Failed to load model from file: {path_model}")
     57 
     58         vocab = llama_cpp.llama_model_get_vocab(model)

ValueError: Failed to load model from file: /root/.cache/huggingface/hub/models--unsloth--gemma-3n-E2B-it-GGUF/snapshots/4a272108ff14d0e1cef82baa8025a60a91307348/./gemma-3n-E2B-it-UD-IQ2_XXS.gguf

I am new to this and not sure how to handle it. What should I do to overcome this? Or it is because the llama_cpp is yet to support gemma3n and I need to wait?

The model I use is gemma-3n-E2B-it-UD-IQ2_XXS.gguf

Sign up or log in to comment