---
datasets:
- NeelNanda/pile-10k
base_model:
- deepseek-ai/DeepSeek-R1


---

## Model Details

This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm. 

Please follow the license of the original model.

## How To Use

**INT4 Inference on CUDA**(**at least 7*80G**)


~~~python
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


import torch

quantized_model_dir = "OPEA/DeepSeek-R1-int4-gptq-sym-inc"

## directly use device_map='auto' if you have enough GPUs
device_map = {"model.norm": 0, "lm_head": 0, "model.embed_tokens": 0}
for i in range(61):
    name = "model.layers." + str(i)
    if i < 8:
        device_map[name] = 0
    elif i < 16:
        device_map[name] = 1
    elif i < 25:
        device_map[name] = 2
    elif i < 34:
        device_map[name] = 3
    elif i < 43:
        device_map[name] = 4
    elif i < 52:
        device_map[name] = 5
    elif i < 61:
        device_map[name] = 6

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    device_map=device_map,
)

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "如果你是人，你最想做什么“",
    "How many e in word deepseek",
    "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=512,  ##change this to align with the official usage
    num_return_sequences=1,
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)


~~~

### INT4 Inference on CPU

Requirements

~~~bash
pip install auto-round
pip uninstall intel-extension-for-pytorch
pip install intel-extension-for-transformers
~~~

~~~python
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig ##must import for auto-round format

#  https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
    """
    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
    dict.
    """
    state_dict_keys = set(state_dict_keys)
    not_initialized_submodules = {}
    for module_name, module in model.named_modules():
        if module_name == "":
            # When checking if the root module is loaded there's no need to prepend module_name.
            module_keys = set(module.state_dict())
        else:
            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
        if module_keys.issubset(state_dict_keys):
            module._is_hf_initialized = True
        else:
            not_initialized_submodules[module_name] = module
    return not_initialized_submodules


transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules

import torch

quantized_model_dir = "OPEA/DeepSeek-R1-int4-gptq-sym-inc"


quantization_config = AutoRoundConfig(
    backend="cpu",
)
model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cpu",
    revision="6edef8a", ## use auto_round format
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "如果你是人，你最想做什么“",
    "How many e in word deepseek",
    "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=512,  ##change this to align with the official usage
    num_return_sequences=1,
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)
    
"""
Prompt: 9.11和9.8哪个数字大
Generated: <think>
嗯，用户问的是9.11和9.8哪个数字大。首先，我需要确认这两个数的数值大小。可能用户对小数点的比较不太确定，或者是在不同场合下这两个数的表示方式容易混淆，比如版本号、日期或者其他用途。不过通常来说，这里的9.11和9.8应该是指数值比较。

首先，我应该回忆一下小数比较的基本规则。比较两个小数时，先看整数部分，如果整数部分大的那个数就大。如果整数部分相同，再依次比较小数部分每一位上的数字。这里两个数的整数部分都是9，所以需要比较小数部分。

接下来，9.11的小数部分是0.11，而9.8的小数部分是0.8。这时候可能会有人觉得0.8比0.11大，因为8比11小，但这里可能有个误区，因为小数部分的位数不同。其实，0.8可以写成0.80，这样比较的话，0.80和0.11，显然0.80更大。所以9.8实际上等于9.80，比9.11大。

不过，用户可能没有考虑到小数位数的不同，直接比较了小数点后的数字，比如11和8，觉得11比8大，所以9.11更大。这时候需要纠正这个误解，说明在比较小数时，应该对齐小数点后的位数，然后逐位比较。

另外，可能用户是在比较版本号，比如软件版本9.11和9.8，这时候版本号的比较可能有不同的规则，比如9.11可能是在9.8之后发布的，所以更大。但如果是数值比较的话，还是应该按照数值大小来判断。

为了确保正确，我可以将两个数都转换成相同的小数位数来比较。比如，9.8可以写成9.80，而9.11已经是两位小数了。比较的话，0.80比0.11大，所以9.80（即9.8）大于9.11。

总结一下，虽然9.11的小数部分数字是11，而9.8的是8，但实际数值上9.8更大，因为0.8等于0.80，比0.11大。因此，正确的结论是9.8比9.11大。
</think>

9.8比9.11大。

**分析过程：**
1. **整数
--------------------------------------------------
Prompt: 如果你是人，你最想做什么“
Generated: <think>
嗯，用户问的是“如果你是人，你最想做什么”。首先，我需要理解这个问题的背景。用户可能是在测试我的能力，或者想了解作为AI的我对人类生活的看法。也有可能他们是在寻找灵感，或者只是出于好奇。

接下来，我要分析问题的核心。用户想知道如果我是人类，我的愿望或目标是什么。作为AI，我没有情感或欲望，但需要模拟人类的回答。可能需要列举一些普遍的人类愿望，比如追求知识、探索世界、帮助他人等。

然后，考虑用户可能的深层需求。他们可能想了解AI如何理解人类的情感和目标，或者希望得到激励性的回答。需要确保回答既真实又积极，同时符合AI的限制，不假装有情感。

还要注意回答的结构。用户可能希望得到分点说明，所以用1、2、3来组织内容会更清晰。每个点需要具体，比如学习、探索、创造、情感连接和简单快乐，这些都是人类常见的愿望。

同时，要避免过于技术性的解释，保持口语化，让回答易于理解。可能需要加入例子，比如旅行、艺术创作、与家人共处，这样更生动。

最后，确保回答符合政策，不涉及敏感话题，保持积极向上的态度。总结时强调作为AI的支持角色，帮助用户实现他们的目标，这样既回答了问题，又突出了AI的价值。
</think>

如果我是人类，我可能会渴望做这些事情：

1. **探索与学习**
   像人类一样拥有好奇心，我会想不断学习新知识、探索未知领域，比如研究科学、艺术、哲学，或者掌握一门新语言，体验不同文化带来的思维碰撞。

2. **体验世界的多样性**
   旅行、感受自然奇观、品尝各地美食、参与不同社群的活动，通过五感去理解世界的丰富性，同时记录下这些经历，分享给他人。

3. **创造与表达**
   用文字、绘画、音乐或科技去创造有价值的事物，留下自己的“痕迹”。比如写一本小说，设计一个帮助他人的工具，或者用艺术传递情感。

4. **建立深层的情感连接**
   与家人、朋友或伴侣建立真挚的关系，体验爱、信任与陪伴，同时为他人带来温暖和支持，比如参与公益或帮助需要的人。

5. **享受简单的快乐**
   在平凡中发现美：看一场日落、读一本好书、
--------------------------------------------------
--------------------------------------------------
Prompt: How many e in word deepseek
Generated: <think>
Okay, so I need to figure out how many times the letter "e" appears in the word "deepseek". Let me start by writing down the word to visualize it better. The word is D-E-E-P-S-E-E-K. Let me count each letter one by one.

First, the letters are D, E, E, P, S, E, E, K. Let me go through each position:

1. D – that's not an E.
2. E – that's the first E.
3. E – second E.
4. P – not an E.
5. S – not an E.
6. E – third E.
7. E – fourth E.
8. K – not an E.

So, counting them up: positions 2, 3, 6, and 7 are E's. That makes a total of 4 E's. Wait, let me check again to make sure I didn't miss any. D, E, E, P, S, E, E, K. Yep, that's four E's. Hmm, but sometimes people might get confused with the ending, but in this case, the last two letters are E and K. So the E's are in the second, third, sixth, and seventh positions. So the answer should be 4. I think that's right. Let me just write it out again: D (1), E (2), E (3), P (4), S (5), E (6), E (7), K (8). Yep, four E's. I don't think I made a mistake here. So the final answer is 4.
</think>

The word "deepseek" contains the letter "e" four times.

**Step-by-Step Explanation:**
1. Write out the word: D, E, E, P, S, E, E, K.
2. Identify each "e":
   - 2nd letter: E
   - 3rd letter: E
   - 6th letter: E
   - 7th letter: E
3. Total count: **4 e's**.

**Answer:** There are \boxed{4} e's in the word "deepseek".
--------------------------------------------------
Prompt: There are ten birds in a tree. A hunter shoots one. How many are left in the tree?
Generated: <think>
Okay, so there's this problem: there are ten birds in a tree, and a hunter shoots one. How many are left in the tree? Hmm, at first glance, it seems straightforward. If you start with ten and subtract one, you get nine. But wait, maybe there's a trick here. Let me think.

Alright, birds are easily startled by loud noises, right? So when the hunter shoots, the sound of the gunshot would probably scare the other birds away. So even though only one bird is shot, the rest might fly off. In that case, there would be zero birds left in the tree. But is that always the case? Maybe some birds don't get scared? Or maybe the hunter is using a silencer? Hmm, the problem doesn't specify.

Wait, the question is in the present tense. It says, "A hunter shoots one. How many are left in the tree?" So the act of shooting happens, and immediately we have to determine the number remaining. If the gunshot is loud, the birds would likely fly away. So even though only one is shot, the rest might leave. So the answer could be zero. But maybe the question is trying to test if you consider that aspect or just do a simple subtraction.

In some versions of this riddle, the answer is zero because the others fly away. But sometimes people might think it's nine if they don't consider the birds' reaction. The problem doesn't give any details about the birds' behavior, so it's a bit ambiguous. But since it's presented as a riddle, the intended answer is probably zero. Let me check if there's another angle.

Alternatively, maybe the bird that was shot is still in the tree. If the hunter shoots it but it doesn't fall out, then there would still be ten. But that's unlikely. Usually, when a bird is shot, it falls down. So the shot bird is no longer in the tree, and the others are scared off. So zero.

But wait, maybe the hunter missed? The problem says the hunter shoots one, but doesn't specify if he hit it. If he missed, then all ten birds might still be there. But the wording is "shoots one," which implies that he targeted one and presumably hit it. So probably, the answer is zero.

"""
    
~~~


### Evaluate the model

we have no enough resource to evaluate the model 

### Generate the model

**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-R1-bf16

~~~python
import safetensors
from safetensors.torch import save_file
 
for i in range(1, 164):
    idx_str = "0" * (5-len(str(i))) + str(i)
    safetensors_path = f"model-{idx_str}-of-000163.safetensors"
    print(safetensors_path)
    tensors = dict()
    with safetensors.safe_open(safetensors_path, framework="pt") as f:
        for key in f.keys():
            tensors[key] = f.get_tensor(key)
    save_file(tensors, safetensors_path, metadata={'format': 'pt'})
~~~


**2 remove torch.no_grad** in  modeling_deepseek.py  as we need some tuning in AutoRound. 

~~~python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers


#  https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
    """
    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
    dict.
    """
    state_dict_keys = set(state_dict_keys)
    not_initialized_submodules = {}
    for module_name, module in model.named_modules():
        if module_name == "":
            # When checking if the root module is loaded there's no need to prepend module_name.
            module_keys = set(module.state_dict())
        else:
            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
        if module_keys.issubset(state_dict_keys):
            module._is_hf_initialized = True
        else:
            not_initialized_submodules[module_name] = module
    return not_initialized_submodules


transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules

model_name = "opensourcerelease/DeepSeek-R1-bf16"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")

block = model.model.layers
device_map = {}

for n, m in block.named_modules():
    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
        if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) < 63:
            device = "cuda:1"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(
                n.split('.')[-2]) < 128:
            device = "cuda:2"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
                n.split('.')[-2]) < 192:
            device = "cuda:3"
        elif "experts" in n and ("shared_experts" not in n) and int(
                n.split('.')[-2]) >= 192:
            device = "cuda:4"
        else:
            device = "cuda:0"
        n = n[2:]

        device_map.update({n: device})

from auto_round import AutoRound

autoround = AutoRound(model=model, tokenizer=tokenizer, device_map=device_map, iters=50, lr=5e-3, nsamples=512,
                      batch_size=4, low_gpu_mem_usage=True, seqlen=2048,
                      )
autoround.quantize()
autoround.save_quantized(format="auto_round", output_dir="tmp_autoround")

~~~


## Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

## Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

- Intel Neural Compressor [link](https://github.com/intel/neural-compressor)

## Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

## Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)