File size: 2,765 Bytes
033f9f3
 
 
 
 
 
 
 
 
 
ee7aff1
033f9f3
 
 
 
3b7af59
ee7aff1
3b7af59
 
76285e9
033f9f3
edbdc2b
 
ee7aff1
033f9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edbdc2b
 
 
033f9f3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
library_name: transformers
license: llama3.1
tags:
- deepseek
- transformers
- llama
- llama-3
- meta
- GGUF
---

# DeepSeek-R1-Distill-Llama-8B-NexaQuant

## Background + Overview 

DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model while remaining fully open-source. Many users want to run it locally to ensure data privacy, reduce latency, and maintain offline access. However, fitting such a large model onto personal devices typically requires quantization (e.g. Q4_K_M), which often sacrifices accuracy (up to ~22% accuracy loss) and undermines the benefits of the local reasoning model.

We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to one-fourth its original size—without losing any accuracy. This lets you run powerful on-device reasoning wherever you are, with no compromises. Tests on an **HP Omnibook AIPC** with an **AMD Ryzen™ AI 9 HX 370 processor** showed a decoding speed of **17.20 tokens per second** and a peak RAM usage of just **5017 MB** in NexaQuant version—compared to only **5.30 tokens** per second and **15564 MB RAM** in the unquantized version—while **maintaining full precision model accuracy.**

## How to run locally

NexaQuant is compatible with **Nexa-SDK**, **Ollama**, **LM Studio**, **Llama.cpp**, and any llama.cpp based project. Below, we outline multiple ways to run the model locally.

#### Option 1: Using Nexa SDK

**Step 1: Install Nexa SDK**

Follow the installation instructions in Nexa SDK's [GitHub repository](https://github.com/NexaAI/nexa-sdk).

**Step 2: Run the model with Nexa**

Execute the following command in your terminal:
```bash
nexa run DeepSeek-R1-Distill-Llama-8B-NexaQuant:q4_0
```

#### Option 2: Using llama.cpp

**Step 1: Build llama.cpp on Your Device**

Follow the "Building the project" instructions in the llama.cpp [repository](https://github.com/ggerganov/llama.cpp) to build the project.

**Step 2: Run the Model with llama.cpp**

Once built, run `llama-cli` under `<build_dir>/bin/`:
```bash
./llama-cli \
    --model your/local/path/to/DeepSeek-R1-Distill-Llama-8B-NexaQuant \
    --prompt 'Provide step-by-step reasoning enclosed in <think> </think> tags, followed by the final answer enclosed in \boxed{} tags.' \
```

#### Option 3: Using LM Studio

**Step 1: Download and Install LM Studio**

Get the latest version from the [official website](https://lmstudio.ai/).

**Step 2: Load and Run the Model**

1. In LM Studio's top panel, search for and select `NexaAIDev/DeepSeek-R1-Distill-Llama-8B-NexaQuant`.  
2. Click `Download` (if not already downloaded) and wait for the model to load.  
3. Once loaded, go to the chat window and start a conversation.
---