YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
It seems my matmul is stronger than LUT-based inference on AMD Ryzen 9 5950X 16-Core by 5%:
- 6.96 t/s w4g128
- 7.31 t/s AVX2
This is a W4G128_1 file.
It is converted from ChenMnZ/Llama-2-7b-EfficientQAT-w4g128-GPTQ
./llama-cli -m ChenMnZ_Llama-2-7b-EfficientQAT-w4g128-GPTQ/ChenMnZ_Llama-2-7b-EfficientQAT-w4g128.gguf -n 50 -p hi
build: 5130 (7cb118f3) with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /home/user/Storage/ChenMnZ_Llama-2-7b-EfficientQAT-w4g128-GPTQ/ChenMnZ_Llama-2-7b-EfficientQAT-w4g128.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = ChenMnZ_Llama 2 7b EfficientQAT W4G12...
llama_model_loader: - kv 3: general.finetune str = EfficientQAT-w4g128-GPTQ
llama_model_loader: - kv 4: general.basename str = ChenMnZ_Llama-2
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: llama.block_count u32 = 32
llama_model_loader: - kv 7: llama.context_length u32 = 4096
llama_model_loader: - kv 8: llama.embedding_length u32 = 4096
llama_model_loader: - kv 9: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 10: llama.attention.head_count u32 = 32
llama_model_loader: - kv 11: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 13: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: general.file_type u32 = 46
llama_model_loader: - kv 15: llama.vocab_size u32 = 32001
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 17: tokenizer.ggml.model str = llama
llama_model_loader: - kv 18: tokenizer.ggml.pre str = default
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,32001] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 20: tokenizer.ggml.scores arr[f32,32001] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,32001] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 26: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = true
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 2 tensors
llama_model_loader: - type tmac_w4g128_1: 224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = TMAC_W4G128_1 - 4.5 bpw
print_info: file size = 3.88 GiB (4.95 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 4
load: token to piece cache size = 0.1684 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 6.74 B
print_info: general.name = ChenMnZ_Llama 2 7b EfficientQAT W4G128 GPTQ
print_info: vocab type = SPM
print_info: n_vocab = 32001
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: PAD token = 0 '<unk>'
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
Tuned kernel config: M=4096, N=1, K=4096, bm=256, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 0.7750 ms
Tuned kernel config: M=4096, N=1, K=4096, bm=512, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 0.7561 ms
Tuned kernel config: M=4096, N=1, K=4096, bm=1024, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 0.7626 ms
Tuned kernel config: M=4096, N=1, K=4096, bm=2048, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 0.7545 ms
Tuned kernel config: M=11008, N=1, K=4096, bm=256, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 2.0892 ms
Tuned kernel config: M=11008, N=1, K=4096, bm=512, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 2.0287 ms
Tuned kernel config: M=11008, N=1, K=4096, bm=1024, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 2.0261 ms
Tuned kernel config: M=4096, N=1, K=11008, bm=256, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 2.0912 ms
Tuned kernel config: M=4096, N=1, K=11008, bm=512, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 2.0294 ms
Tuned kernel config: M=4096, N=1, K=11008, bm=1024, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 1.9938 ms
Tuned kernel config: M=4096, N=1, K=11008, bm=2048, n=8, kfactor=16, bits=4, g=4, ngroups_per_elem=2, q_group_size=128, act_group_size=64 TIME: 1.7570 ms
load_tensors: TMAC model buffer size = 3975.03 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: CPU output buffer size = 0.12 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
init: CPU KV buffer size = 2048.00 MiB
llama_context: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: CPU compute buffer size = 296.01 MiB
llama_context: graph nodes = 1094
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
sampler seed: 3883204367
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1
hi! my name is sandra and i am a 5th grade teacher. I have been teaching for 14 years. I love the kids and the creativity. I have taught every grade from 2nd through 5th.
llama_perf_sampler_print: sampling time = 1.65 ms / 52 runs ( 0.03 ms per token, 31496.06 tokens per second)
llama_perf_context_print: load time = 381620.92 ms
llama_perf_context_print: prompt eval time = 173.50 ms / 2 tokens ( 86.75 ms per token, 11.53 tokens per second)
llama_perf_context_print: eval time = 7042.50 ms / 49 runs ( 143.72 ms per token, 6.96 tokens per second)
llama_perf_context_print: total time = 7222.20 ms / 51 tokens
AVX2 with the 2 f16 layers:
./llama-cli -p "hi" -n 50 -m /media/user/6/unsloth_llama-2-7b-chat/f16-emb-f16-output-ggml-model-Q4_0.gguf -no-cnv
build: 5228 (44cd8d91) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 35 key-value pairs and 291 tensors from /media/user/6/unsloth_llama-2-7b-chat/f16-emb-f16-output-ggml-model-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 2 7b Chat
llama_model_loader: - kv 3: general.organization str = Unsloth
llama_model_loader: - kv 4: general.finetune str = chat
llama_model_loader: - kv 5: general.basename str = llama-2
llama_model_loader: - kv 6: general.size_label str = 7B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.tags arr[str,6] = ["unsloth", "transformers", "llama", ...
llama_model_loader: - kv 9: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 10: llama.block_count u32 = 32
llama_model_loader: - kv 11: llama.context_length u32 = 4096
llama_model_loader: - kv 12: llama.embedding_length u32 = 4096
llama_model_loader: - kv 13: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 14: llama.attention.head_count u32 = 32
llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 16: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 17: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 18: llama.vocab_size u32 = 32000
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 31: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 2 tensors
llama_model_loader: - type q4_0: 224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 3.88 GiB (4.95 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 6.74 B
print_info: general.name = Llama 2 7b Chat
print_info: vocab type = SPM
print_info: n_vocab = 32000
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: PAD token = 0 '<unk>'
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_AARCH64 model buffer size = 3474.00 MiB
load_tensors: CPU_Mapped model buffer size = 3950.83 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: CPU output buffer size = 0.12 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
init: CPU KV buffer size = 2048.00 MiB
llama_context: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: CPU compute buffer size = 296.01 MiB
llama_context: graph nodes = 1094
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 1030596542
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1
hiphopdx.его
In the latest installment of our "On The Come Up" series, we highlight up-and-coming rapper and singer, D Smoke. The Los Angeles-based artist has been making waves in the hip
llama_perf_sampler_print: sampling time = 1.56 ms / 52 runs ( 0.03 ms per token, 33397.56 tokens per second)
llama_perf_context_print: load time = 3465.26 ms
llama_perf_context_print: prompt eval time = 158.13 ms / 2 tokens ( 79.06 ms per token, 12.65 tokens per second)
llama_perf_context_print: eval time = 6706.08 ms / 49 runs ( 136.86 ms per token, 7.31 tokens per second)
llama_perf_context_print: total time = 6871.50 ms / 51 tokens
Regular CPU speed - AVX2 version pure q4_0 embedding and output layers
./llama-cli -m ~/Storage/pure-ggml-model-Q4_0.gguf -n 50 -p hi -no-cnv
build: 5228 (44cd8d91) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 35 key-value pairs and 291 tensors from /home/user/Storage/pure-ggml-model-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 2 7b Chat
llama_model_loader: - kv 3: general.organization str = Unsloth
llama_model_loader: - kv 4: general.finetune str = chat
llama_model_loader: - kv 5: general.basename str = llama-2
llama_model_loader: - kv 6: general.size_label str = 7B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.tags arr[str,6] = ["unsloth", "transformers", "llama", ...
llama_model_loader: - kv 9: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 10: llama.block_count u32 = 32
llama_model_loader: - kv 11: llama.context_length u32 = 4096
llama_model_loader: - kv 12: llama.embedding_length u32 = 4096
llama_model_loader: - kv 13: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 14: llama.attention.head_count u32 = 32
llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 16: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 17: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 18: llama.vocab_size u32 = 32000
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 31: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 3.53 GiB (4.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 32
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 4096
print_info: n_embd_v_gqa = 4096
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 6.74 B
print_info: general.name = Llama 2 7b Chat
print_info: vocab type = SPM
print_info: n_vocab = 32000
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: PAD token = 0 '<unk>'
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_AARCH64 model buffer size = 3544.31 MiB
load_tensors: CPU_Mapped model buffer size = 3521.14 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: CPU output buffer size = 0.12 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
init: CPU KV buffer size = 2048.00 MiB
llama_context: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: CPU compute buffer size = 296.01 MiB
llama_context: graph nodes = 1094
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 1096331632
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1
hi, my name is [Your Name] and I am a [Your Profession] with [Your Company]. I am reaching out to inquire about the possibility of [Your Reason for Contacting]."
everybody knows that first impressions count
llama_perf_sampler_print: sampling time = 1.53 ms / 52 runs ( 0.03 ms per token, 34076.02 tokens per second)
llama_perf_context_print: load time = 3453.56 ms
llama_perf_context_print: prompt eval time = 151.00 ms / 2 tokens ( 75.50 ms per token, 13.25 tokens per second)
llama_perf_context_print: eval time = 6351.57 ms / 49 runs ( 129.62 ms per token, 7.71 tokens per second)
llama_perf_context_print: total time = 6509.73 ms / 51 tokens
- Downloads last month
- 21
Hardware compatibility
Log In
to view the estimation
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support