Llama-v3.1-8B-Chat quantized with Qualcomm Genie.
Source: https://aihub.qualcomm.com/compute/models/llama_v3_2_3b_chat_quantized
Technical Details
- Input sequence length for Prompt Processor: 128
- Context length: 4096
- Number of parameters: 3B
- Model size: 2.4G
- Precision: w4a16 + w8a16 (few layers)
- Num of key-value heads: 8
- Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
- Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
- Prompt processor output: 128 output tokens + KV cache outputs
- Model-2 (Token Generator): Llama-TokenGenerator-Quantized
- Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
- Token generator output: 1 output token + KV cache outputs
- Use:Initiate conversation with prompt-processor and then token generator for subsequent iterations.
- Minimum QNN SDK version required: 2.27.7
- Supported languages: English.
- Target Device: Snapdragon X Elite | SC8380XP, Windows 11
- Versions
- QNN : v2.28.2.241116104011_103376
- AI Hub : aihub-2025.02.06.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for thuniverse-ai/Llama-v3.2-3B-Chat-GENIE
Base model
meta-llama/Llama-3.2-3B-Instruct