Llama-v3.1-8B-Chat quantized with Qualcomm Genie.

Source: https://aihub.qualcomm.com/compute/models/llama_v3_2_3b_chat_quantized

Technical Details

  • Input sequence length for Prompt Processor: 128
  • Context length: 4096
  • Number of parameters: 3B
  • Model size: 2.4G
  • Precision: w4a16 + w8a16 (few layers)
  • Num of key-value heads: 8
  • Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
    • Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
    • Prompt processor output: 128 output tokens + KV cache outputs
  • Model-2 (Token Generator): Llama-TokenGenerator-Quantized
    • Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
    • Token generator output: 1 output token + KV cache outputs
  • Use:Initiate conversation with prompt-processor and then token generator for subsequent iterations.
  • Minimum QNN SDK version required: 2.27.7
  • Supported languages: English.
  • Target Device: Snapdragon X Elite | SC8380XP, Windows 11
  • Versions
    • QNN : v2.28.2.241116104011_103376
    • AI Hub : aihub-2025.02.06.0
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for thuniverse-ai/Llama-v3.2-3B-Chat-GENIE

Finetuned
(471)
this model