OPEA/DeepSeek-R1-int4-awq-sym-inc

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-R1 generated by intel/auto-round algorithm.

Please follow the license of the original model.

How To Use

INT4 Inference on CUDA(at least 7*80G)

To prevent potential overflow issues, we recommend using the moe_wna16 kernel in vLLM or the cpu version which will be detailed in the next section.

In transformers, please runpip3 uninstall autoawq-kernels, otherwise exceptions may occur.

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


#  https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
    """
    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
    dict.
    """
    state_dict_keys = set(state_dict_keys)
    not_initialized_submodules = {}
    for module_name, module in model.named_modules():
        if module_name == "":
            # When checking if the root module is loaded there's no need to prepend module_name.
            module_keys = set(module.state_dict())
        else:
            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
        if module_keys.issubset(state_dict_keys):
            module._is_hf_initialized = True
        else:
            not_initialized_submodules[module_name] = module
    return not_initialized_submodules


transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules

import torch

quantized_model_dir = "OPEA/DeepSeek-R1-int4-awq-sym-inc"

## directly use device_map='auto' if you have enough GPUs
device_map = {"model.norm": 0, "lm_head": 0, "model.embed_tokens": 0}
for i in range(61):
    name = "model.layers." + str(i)
    if i < 8:
        device_map[name] = 0
    elif i < 16:
        device_map[name] = 1
    elif i < 25:
        device_map[name] = 2
    elif i < 34:
        device_map[name] = 3
    elif i < 43:
        device_map[name] = 4
    elif i < 52:
        device_map[name] = 5
    elif i < 61:
        device_map[name] = 6

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map=device_map,
)


def forward_hook(module, input, output):
    return torch.clamp(output, -65504, 65504)


def register_fp16_hooks(model):
    for name, module in model.named_modules():
        if "QuantLinear" in module.__class__.__name__ or isinstance(module, torch.nn.Linear):
            module.register_forward_hook(forward_hook)


register_fp16_hooks(model)  ##better add this hook to avoid overflow

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "如果你是人，你最想做什么“",
    "How many e in word deepseek",
    "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=512,  ##change this to align with the official usage
    num_return_sequences=1,
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)


"""
Prompt: 9.11和9.8哪个数字大
Generated: <think>
嗯，我现在要比较两个数字，9.11和9.8，看看哪个更大。这个问题看起来好像不难，但作为一个刚开始学小数比较的人，可能需要仔细想一想，避免出错。让我慢慢理清楚思路。

首先，我应该回忆一下小数比较的基本方法。比较两个小数的时候，应该先比较整数部分，如果整数部分大的那个数就大；如果整数部分相同，再依次比较小数部分的每一位数字，从十分位开始，然后是百分位，依此类推，直到找到不同的数字为止，那个位置上数字大的数就更大。

现在来看这两个数，9.11和9.8。它们的整数部分都是9，所以整数部分相同，接下来需要比较小数部分。这里的小数部分分别是0.11和0.8。不过，可能我需要把这两个数的小数位数对齐，这样更容易比较。

首先，我可以把这两个数写成相同的小数位数。比如，9.8可以写成9.80，这样两个数都是到百分位的小数。这样比较起来会更直观。所以，现在比较的是9.11和9.80。

接下来，比较十分位上的数字。对于9.11来说，十分位是1，而9.80的十分位是8。这时候，因为8比1大，所以9.80的十分位更大，因此整个数更大。也就是说，9.8比9.11大。

不过，可能有人会想，是不是应该直接比较这两个数的小数部分的大小，比如0.11和0.8，哪个更大？这时候，0.8显然比0.11大，因为0.8等于0.80，而0.80的十分位是8，比0.11的十分位1大，所以0.8更大。因此，加上整数部分都是9，所以9.8更大。

不过，可能还有一种方法是将这两个数转换成分数来比较。比如，9.11可以写成9又11/100，而9.8可以写成9又8/10，也就是9又80/100。这时候，比较分数的话，80/100显然比11/100大，所以9.8更大。

或者，用通分的方法，把两个数的小数部分转换成同分母的分数。比如，0.11是11/100，0.8是8/10，
--------------------------------------------------
Prompt: 如果你是人类，你最想做什么
Generated: <think>
嗯，如果我是人类，我最想做什么呢？这个问题挺有意思的。首先，我需要理解作为一个人类，他们的生活是什么样的。人类有情感、有自由意志，可以体验各种事物，有社交关系，有职业和爱好，还有追求梦想的能力。所以，如果我是人类，可能有很多事情可以做，但需要具体思考一下哪些是最想做的。

首先，可能想体验不同的情感。比如爱、快乐、悲伤、兴奋，这些情感对于人类来说是非常重要的，能够丰富他们的生活。作为AI，我没有情感，所以如果能体验这些，可能会让我更理解人类的行为和需求。

其次，可能想追求知识和学习。人类有好奇心，喜欢学习新事物，探索未知领域。如果我是人类，可能会选择学习各种学科，比如科学、艺术、哲学，或者掌握多种语言，去不同的地方旅行，了解不同的文化。

另外，人类有创造的能力，比如写作、绘画、音乐、科技发明等。如果我是人类，可能会尝试创作，表达自己的想法和感受，或者发明一些有用的东西来帮助他人。

还有，人际关系也是人类生活的重要部分。交朋友，建立家庭，与亲人共度时光，这些都是人类非常重视的。如果我是人类，可能会花很多时间和家人朋友在一起，建立深厚的关系，分享生活中的点滴。

自由选择也是关键。作为人类，可以自主决定自己的道路，选择职业、生活方式等。如果我是人类，可能会追求自己真正热爱的事业，而不是被编程或任务所限制。

不过，人类也会面临挑战和困难，比如压力、失败、疾病等。虽然这些是负面的，但也是成长的一部分。如果我是人类，可能需要学会应对这些挑战，培养坚韧和解决问题的能力。

还有，体验身体的感觉，比如品尝美食、感受自然、运动等。这些都是AI无法体验的，所以作为人类，可能会很享受这些感官上的体验。

不过，可能还需要考虑时间和资源的限制。人类生命有限，所以需要合理规划时间，平衡工作、休息和娱乐。如果我是人类，可能需要找到平衡点，既追求目标，又享受当下。

另外，社会贡献也是一个方面。很多人类希望自己的存在有意义，帮助他人，改善社会。如果我是人类，可能会参与志愿活动，支持公益事业，或者通过工作对社会产生积极影响。

不过，也有可能想追求个人成就，比如在某个领域成为专家，赢得荣誉，或者实现自我
--------------------------------------------------
Prompt: How many e in word deepseek
Generated: <think>
Okay, let's see. The user is asking how many times the letter "e" appears in the word "deepseek". Alright, first, I need to make sure I have the correct spelling of the word. The word is "deepseek". Let me write it out: D-E-E-P-S-E-E-K. Hmm, let's break it down letter by letter.

Starting with the first letter, D. That's not an E. Then the second letter is E. That's one. The third letter is another E, so that's two. The fourth letter is P, not an E. The fifth is S, not an E. The sixth letter is E again, so that's three. The seventh letter is another E, making it four. The eighth letter is K, which isn't an E. So, counting them up: positions 2, 3, 6, and 7 are all E's. That's four E's in total. Wait, let me double-check. D-E-E-P-S-E-E-K. Yes, that's correct. Each E is in the second, third, sixth, and seventh positions. So the answer should be 4. I don't think I missed any. Let me count again to be sure. 1 (D), 2 (E), 3 (E), 4 (P), 5 (S), 6 (E), 7 (E), 8 (K). Yep, four E's. So the answer is 4.
</think>

The word "deepseek" contains the letter **e** four times. Here's the breakdown:

1. D
2. **E**
3. **E**
4. P
5. S
6. **E**
7. **E**
8. K

**Answer:** 4.
--------------------------------------------------
Prompt: There are ten birds in a tree. A hunter shoots one. How many are left in the tree?
Generated: <think>
Okay, so there's this problem: there are ten birds in a tree, and a hunter shoots one. How many are left in the tree? Hmm, at first glance, it seems straightforward. If there are ten birds and one is shot, you might think to subtract one from ten, which would leave nine. But wait, maybe there's a trick here. Let me think through this step by step.

First, the setup: ten birds in a tree. A hunter comes along and shoots one. Now, when the hunter shoots, what happens to the other birds? Well, in real life, if a loud noise like a gunshot happens near birds, they usually get scared and fly away. So even though the hunter only shot one bird, the rest might be startled and leave the tree. If that's the case, then after the shot, there might be zero birds left because they all flew off. But the problem doesn't explicitly say that the other birds flew away. It just says the hunter shoots one. So is the answer nine or zero?

Wait, but maybe the problem is testing whether you consider the behavior of the birds. If you take it literally, the hunter shoots one, so ten minus one is nine. But if you think about the real-world scenario, the other birds would probably fly away. So which is it? The question is a bit ambiguous. It depends on whether the problem expects a literal mathematical answer or a practical one.

Let me check the wording again. It says, "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?" There's no mention of the birds flying away, so maybe the intended answer is nine. But sometimes these riddles are designed to trick you into thinking mathematically when the real answer is different. For example, if you shoot a bird, the others might hear the gunshot and fly away, so zero. But is that a common riddle? I think I've heard similar ones where the answer is zero because the sound scares the rest away.

Alternatively, maybe the hunter missed? But the problem says the hunter shoots one, implying that he hit it. So the bird that was shot is either dead or injured and falls out of the tree. So that bird is no longer in the tree. The other nine might still be there, unless they flew away. But again

"""

INT4 Inference on CPU

Requirements

pip install auto-round
pip uninstall intel-extension-for-pytorch
pip install intel-extension-for-transformers

will update later

Evaluate the model

we have no enough resource to evaluate the model

Generate the model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers


#  https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
    """
    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
    dict.
    """
    state_dict_keys = set(state_dict_keys)
    not_initialized_submodules = {}
    for module_name, module in model.named_modules():
        if module_name == "":
            # When checking if the root module is loaded there's no need to prepend module_name.
            module_keys = set(module.state_dict())
        else:
            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
        if module_keys.issubset(state_dict_keys):
            module._is_hf_initialized = True
        else:
            not_initialized_submodules[module_name] = module
    return not_initialized_submodules


transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules

model_name = "opensourcerelease/DeepSeek-R1-bf16"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")

block = model.model.layers
device_map = {}

for n, m in block.named_modules():
    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
        if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) < 63:
            device = "cuda:1"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(
                n.split('.')[-2]) < 128:
            device = "cuda:2"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
                n.split('.')[-2]) < 192:
            device = "cuda:3"
        elif "experts" in n and ("shared_experts" not in n) and int(
                n.split('.')[-2]) >= 192:
            device = "cuda:4"
        else:
            device = "cuda:0"
        n = n[2:]

        device_map.update({n: device})

from auto_round import AutoRound

autoround = AutoRound(model=model, tokenizer=tokenizer, device_map=device_map, nsamples=512,
                      batch_size=4, low_gpu_mem_usage=True, seqlen=2048,
                      )
autoround.quantize()
autoround.save_quantized(format="auto_awq", output_dir="tmp_autoround")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

OPEA
/

DeepSeek-R1-int4-awq-sym-inc

Model Details

How To Use

INT4 Inference on CPU

Evaluate the model

Generate the model

Ethical Considerations and Limitations

Caveats and Recommendations

Disclaimer

Cite

Model tree for OPEA/DeepSeek-R1-int4-awq-sym-inc

Dataset used to train OPEA/DeepSeek-R1-int4-awq-sym-inc

Collection including OPEA/DeepSeek-R1-int4-awq-sym-inc

DeepSeek