thuniverse-ai/Llama-v3.2-3B-Chat-GENIE

Llama-v3.1-8B-Chat quantized with Qualcomm Genie.

Technical Details

Input sequence length for Prompt Processor: 128
Context length: 4096
Number of parameters: 3B
Model size: 2.4G
Precision: w4a16 + w8a16 (few layers)
Num of key-value heads: 8
Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
- Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
- Prompt processor output: 128 output tokens + KV cache outputs
Model-2 (Token Generator): Llama-TokenGenerator-Quantized
- Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
- Token generator output: 1 output token + KV cache outputs
Use:Initiate conversation with prompt-processor and then token generator for subsequent iterations.
Minimum QNN SDK version required: 2.27.7
Supported languages: English.
Target Device: Snapdragon X Elite | SC8380XP, Windows 11
Versions
- QNN : v2.28.2.241116104011_103376
- AI Hub : aihub-2025.02.06.0