base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
library_name: transformers
license: llama3.1
tags:
  - deepseek
  - transformers
  - llama
  - llama-3
  - meta
  - GGUF
DeepSeek-R1-Distill-Llama-8B-NexaQuant
Background + Overview
DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model while remaining fully open-source. Many users want to run it locally to ensure data privacy, reduce latency, and maintain offline access. However, fitting such a large model onto personal devices typically requires quantization (e.g. Q4_K_M), which often sacrifices accuracy (up to ~22% accuracy loss) and undermines the benefits of the local reasoning model.
We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to one-fourth its original size—without losing any accuracy. This lets you run powerful on-device reasoning wherever you are, with no compromises. Tests on an HP Omnibook AIPC with an AMD Ryzen™ AI 9 HX 370 processor showed a decoding speed of 66.40 tokens per second and a peak RAM usage of just 1228 MB in NexaQuant version—compared to only 25.28 tokens per second and 3788 MB RAM in the unquantized version—while maintaining full precision model accuracy.
How to run locally
NexaQuant is compatible with Nexa-SDK, Ollama, LM Studio, Llama.cpp, and any llama.cpp based project. Below, we outline multiple ways to run the model locally.
Option 1: Using Nexa SDK
Step 1: Install Nexa SDK
Follow the installation instructions in Nexa SDK's GitHub repository.
Step 2: Run the model with Nexa
Execute the following command in your terminal:
nexa run DeepSeek-R1-Distill-Llama-8B-NexaQuant:q4_0
Option 2: Using llama.cpp
Step 1: Build llama.cpp on Your Device
Follow the "Building the project" instructions in the llama.cpp repository to build the project.
Step 2: Run the Model with llama.cpp
Once built, run llama-cli under <build_dir>/bin/:
./llama-cli \
    --model your/local/path/to/DeepSeek-R1-Distill-Llama-8B-NexaQuant \
    --prompt 'Provide step-by-step reasoning enclosed in <think> </think> tags, followed by the final answer enclosed in \boxed{} tags.' \
Option 3: Using LM Studio
Step 1: Download and Install LM Studio
Get the latest version from the official website.
Step 2: Load and Run the Model
- In LM Studio's top panel, search for and select NexaAIDev/DeepSeek-R1-Distill-Llama-8B-NexaQuant.
- Click Download(if not already downloaded) and wait for the model to load.
- Once loaded, go to the chat window and start a conversation.

