Model Details
This model is an int2 model with group_size 64 and symmetric quantization of deepseek-ai/DeepSeek-R1 generated by intel/auto-round algorithm. We recommend using a mixed version OPEA/DeepSeek-R1-int2-mixed-sym-inc for better accuracy
Please follow the license of the original model.
How To Use
INT2 Inference on CUDA(4X80G)
please note int2 may be slower than int4 on CUDA due to kernel issue.
To prevent potential overflow and achieve better accuracy, we recommend using the CPU version detailed in the next section.
~~~python
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
"""
Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
dict.
"""
state_dict_keys = set(state_dict_keys)
not_initialized_submodules = {}
for module_name, module in model.named_modules():
if module_name == "":
# When checking if the root module is loaded there's no need to prepend module_name.
module_keys = set(module.state_dict())
else:
module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
if module_keys.issubset(state_dict_keys):
module._is_hf_initialized = True
else:
not_initialized_submodules[module_name] = module
return not_initialized_submodules
transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules
import torch
quantized_model_dir = "OPEA/DeepSeek-R1-int2-gptq-sym-inc"
## directly use device_map='auto' if you have enough GPUs
device_map = {"model.norm": 0, "lm_head": 0, "model.embed_tokens": 0}
for i in range(61):
name = "model.layers." + str(i)
if i < 15:
device_map[name] = 0
elif i < 30:
device_map[name] = 1
elif i < 45:
device_map[name] = 2
else:
device_map[name] = 3
model = AutoModelForCausalLM.from_pretrained(
quantized_model_dir,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map=device_map,
)
def forward_hook(module, input, output):
return torch.clamp(output, -65504, 65504)
def register_fp16_hooks(model):
for name, module in model.named_modules():
if "QuantLinear" in module.__class__.__name__ or isinstance(module, torch.nn.Linear):
module.register_forward_hook(forward_hook)
register_fp16_hooks(model) ##better add this hook to avoid overflow
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
"9.11和9.8哪个数字大",
"如果你是人,你最想做什么“",
"How many e in word deepseek",
"There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]
texts = []
for prompt in prompts:
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(
input_ids=inputs["input_ids"].to(model.device),
attention_mask=inputs["attention_mask"].to(model.device),
max_length=512, ##change this to align with the official usage
num_return_sequences=1,
do_sample=False ##change this to align with the official usage
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
for i, prompt in enumerate(prompts):
input_id = inputs
print(f"Prompt: {prompt}")
print(f"Generated: {decoded_outputs[i]}")
print("-" * 50)
"""
Prompt: 9.11和9.8哪个数字大
Generated: <think>
首先,比较9.11和9.8的整数部分,两者都是9,所以需要比较小数部分。9.11的小数部分是0.11,而9.8的小数部分是0.8。由于0.8大于0.11, 因此9.8比9.11大。
</think>
9.11和9.8中,9.8的整数部分和9.11相同,但9.8的小数部分0.8大于9.11的0.11,因此9.8更大。
--------------------------------------------------
--------------------------------------------------
Prompt: 如果你是人,你最想做什么“
Generated: <think>
嗯,用户问如果我是人,我最想做什么。首先,我需要理解这个问题的背景。用户可能好奇作为一个AI,我是否有愿望或欲望,或者他们想知道我作为AI能提供什么样的帮助。也许他们想知道我的功能或能力,或者他们想了解我的“愿望”是否类似于人类的愿望。
首先,我需要明确,作为AI,我没有意识、情感或欲望。但我可以模拟人类对话,所以我可以提供帮助、回答问题、提供建议等。用户可能想知道我是否能模拟人类的行为或兴趣,比如兴趣爱好或愿望。
接下来,我应该考虑用户的需求。他们可能希望了解AI的能力,或者他们可能好奇AI是否有类似人类的愿望。因此,我需要解释虽然我没有个人愿望,但我可以协助他们完成各种任务,比如提供 information, solving problems, or offering advice.
另外,用户可能 be testing the AI's ability to engage in creative or imaginative thinking. They might want to see how I handle hypothetical or imaginative scenarios. So, I should respond in a way that's both informative and engaging, showing that I can think creatively even if I don't have personal desires.
I should also consider the structure of the answer. Start by acknowledging the question, explain that I don't have personal desires, but can assist with various tasks. Then provide examples of what I can do, like answering questions, offering advice, or helping with creative tasks. Finally, invite the user to ask more questions if they need help.
I need to keep the tone friendly and approachable, making sure the response is clear and helpful. Maybe add some examples of tasks I can help with, like solving math problems, offering advice, or generating creative content.
Also, since the user asked in Chinese, I should respond in Chinese, but since the original question was in Chinese, maybe they want the answer in Chinese. But since the previous conversation was in English, maybe they want the answer in English. I should check the language of the question. The user's question is in Chinese, so I should respond in Chinese, but maybe include some English if needed.
Wait, the user's question is in Chinese: "如果你是人,你最想做什么". So I should respond in Chinese. But the previous conversation was in English. Maybe the user is bilingual. So I need to decide whether to respond in Chinese or English. Since the user's question is in Chinese, I'll respond in Chinese
--------------------------------------------------
Prompt: How many e in word deepseek
Generated: <think>
Okay, so I need to figure out how many times the letter 'e' appears in the word "deepseek". Let me start by looking at the word itself. The word is "deepseek". Let me write it out to visualize it better: D, E, E, P, S, E, E, K. Wait, let me check that again. Hmm, maybe I should count each letter individually.
First, I'll write down the word and count each letter. The word is "deepseek". Let me spell it out: D, E, E, P, S, E, E, K. So, the letters are D, E, E, P, S, E, E, K. Now, I need to count how many times the letter 'e' appears. Let me go through each letter one by one.
Starting from the first letter: D. That's not an 'e'. Then the next one is E. That's an 'e', so that's 1. The next letter is another E, so that's 2. Then comes P, which isn't an E. Then S, not an E. Then another E, making it 3. Then another E, so that's 4. Finally, K. So, in total, there are 4 'e's in the word "deepseek".
Wait, let me double-check. The word is "deepseek". Let me write it again: D, E, E, P, S, E, E, K. So, positions 2, 3, 6, and 7 are E's. That's four E's. So, the answer should be 4. I think that's correct. I don't think I missed any. Let me check again. D, E, E, P, S, E, E, K. So, positions 2, 3, 6, 7. That's four E's. Yeah, that seems right. So, the number of E's in "deepseek" is 4.
</think>
The word "deepseek" contains **4** instances of the letter 'e'.
--------------------------------------------------
Prompt: There are ten birds in a tree. A hunter shoots one. How many are left in the tree?
Generated: <think>
Okay, so there's this problem here: there are ten birds in a tree, a hunter shoots one. How many are left in the tree? Hmm, at first glance, it seems straightforward, but I need to think carefully. Let me break it down.
First, there are ten birds in the tree. Then a hunter shoots one. The question is, how many are left? Well, if there were ten birds and one is shot, you might think it's a simple subtraction: 10 minus 1 equals 9. But wait, maybe there's a trick here. Sometimes these kinds of questions are designed to test attention to detail or to trick the reader.
So, let's consider the scenario. If the hunter shoots one bird, what happens to the rest? If the hunter shoots one, that bird is either dead or flies away. If the bird is dead, it's no longer in the tree, so that would leave 9. But maybe the other birds get scared and fly away. If all the other birds fly away, then there would be none left. But the question is about how many are left in the tree. So, if the other birds stay, it's 9. If they all leave, it's 0. But the question is about the ones left in the tree, so maybe it's 9. But maybe the answer is different.
Wait, maybe the problem is a play on words. The question is, "how many are left in the tree?" So, if the hunter shoots one, and the rest are still in the tree, then it's 9. But maybe the answer is different. Let me think. If the hunter shoots one, and the other birds get scared and fly away, then there are none left. But the question is about the ones left in the tree. So, if they all fly away, then zero. But maybe the answer is 9 because only one was shot, and the rest remain. But maybe the answer is 0 because all the birds flew away. Hmm.
Wait, maybe the answer is 9 because only one was shot, so 10 minus 1 is 9. But maybe the answer is different because when you shoot a gun, the sound might scare the other birds, so they all
"""
INT2 Inference on CPU
Requirements
pip install auto-round
pip uninstall intel-extension-for-pytorch
pip install intel-extension-for-transformers
It would be quite slow if the cpu does not support avx512
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig ##must import for auto-round format
# https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
"""
Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
dict.
"""
state_dict_keys = set(state_dict_keys)
not_initialized_submodules = {}
for module_name, module in model.named_modules():
if module_name == "":
# When checking if the root module is loaded there's no need to prepend module_name.
module_keys = set(module.state_dict())
else:
module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
if module_keys.issubset(state_dict_keys):
module._is_hf_initialized = True
else:
not_initialized_submodules[module_name] = module
return not_initialized_submodules
transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules
import torch
quantized_model_dir = "OPEA/DeepSeek-R1-int2-mixed-sym-inc"
quantization_config = AutoRoundConfig(
backend="cpu",
)
model = AutoModelForCausalLM.from_pretrained(
quantized_model_dir,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="cpu",
quantization_config=quantization_config,
revision="080ef2d"
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
"9.11和9.8哪个数字大",
"如果你是人,你最想做什么“",
"How many e in word deepseek",
"There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]
texts = []
for prompt in prompts:
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(
input_ids=inputs["input_ids"].to(model.device),
attention_mask=inputs["attention_mask"].to(model.device),
max_length=512, ##change this to align with the official usage
num_return_sequences=1,
do_sample=False ##change this to align with the official usage
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
for i, prompt in enumerate(prompts):
input_id = inputs
print(f"Prompt: {prompt}")
print(f"Generated: {decoded_outputs[i]}")
print("-" * 50)
"""
Prompt: 9.11和9.8哪个数字大
Generated: <think>
首先,比较9.11和9.8的整数部分,两者都是9,所以需要比较小数部分。9.11的小数部分是0.11,而9.8的小数部分是0.8。0.8大于0.11,因此9.8比9.11大。
</think>
要比较9.11和9.8哪个数字更大,我们可以按照以下步骤进行:
1. **比较整数部分**:两个数的整数部分都是9,所以它们相等。
2. **比较小数部分**:
- 9.11的小数部分是0.11
- 9.8的小数部分是0.8
3. **比较小数部分的大小**:0.8大于0.11。
4. **结论**:由于9.8的小数部分更大,因此9.8比9.11大。
**答案**:9.8 大于 9.11。
--------------------------------------------------
Prompt: 如果你是人,你最想做什么
Generated: <think>
嗯,用户问如果我是人,我最想做什么。首先,我需要理解这个问题的背景。用户可能好奇作为一个AI,我的愿望或兴趣,或者他们想知道AI是否有类似人类的愿望。也许他们想知道AI是否有情感或自主性。
首先,我应该考虑作为AI,我没有情感或欲望,但可以模拟人类的行为。所以,我可以回答假设性的场景。用户可能想知道如果AI有愿望,会是什么。可能的兴趣包括学习、帮助他人,或者探索知识。
我应该保持回答友好,说明作为AI,虽然没有个人欲望,但可以提供帮助。同时,可以 mention areas like problem-solving, learning, and supporting others. 需要保持回答自然,避免技术术语,并保持友好和开放-ended.
可能还要 mention that while I don't have personal desires, I can assist with various tasks. 需要确保回答清晰,避免误解,并 invite further questions.
</think>
如果我是人,我最想做的可能是探索 the world, learn continuously, and connect with others. I would want to immerse myself in different cultures, learn new languages, and experience diverse perspectives. I’d aim to contribute to solving meaningful problems, whether through science, art, or community work. Building meaningful relationships and fostering understanding between people would be a priority. Ultimately, I’d want to leave a positive impact on the world, helping others and making life a little better for those around me.
--------------------------------------------------
Prompt: How many e in word deepseek
Generated: <think>
Okay, so I need to figure out how many times the letter 'e' appears in the word "deepseek". Let me start by breaking down the word. The word is "deepseek". Let me write it out: D, E, E, P, S, E, E, K. Wait, let me check that again. Hmm, maybe I should count each letter one by one.
First, I'll write down the word again to make sure I have it right. D, E, E, P, S, E, E, K. So that's 8 letters. Now, I need to count how many times 'e' appears. Let me go through each letter:
1. D - not an e.
2. E - that's one.
3. E - that's two.
4. P - not an e.
5. S - not an e.
6. E - that's three.
7. E - that's four.
8. K - not an e.
So, I count four 'e's in the word "deepseek". Let me double-check to make sure I didn't miss any. The letters are D, E, E, P, S, E, E, K. So positions 2, 3, 6, and 7 are 'e's. That's four times. I think that's correct. I don't think I missed any. So the answer should be 4.
</think>
The word "deepseek" contains 4 instances of the letter 'e'.
--------------------------------------------------
Prompt: There are ten birds in a tree. A hunter shoots one. How many are left in the tree?
Generated: <think>
Okay, so I came across this problem: "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?" At first glance, it seems straightforward, but I want to make sure I understand it properly. Let me break it down step by step.
First, there are ten birds in the tree. Then, a hunter shoots one. The question is asking how many are left in the tree. Hmm, so the initial number is 10, and one is shot. So, if you subtract one from ten, that would leave nine birds. But wait, I need to consider the possible implications here. Maybe there's a trick question involved.
I remember sometimes these kinds of problems have a twist. For example, maybe the shot causes the other birds to fly away. But the question specifically says the hunter shoots one. So, does that mean the other birds stay? Or do they get scared and fly away? The problem doesn't mention anything about the other birds leaving, so maybe they stay. But I should consider both possibilities.
If the hunter shoots one, and the rest don't fly away, then there would be 10 minus 1, which is 9. But if the other birds get scared and fly away, then there would be 0 left. But the problem doesn't mention the other birds leaving, so maybe the answer is 9. But I need to think about possible interpretations.
Another angle is that maybe the question is a riddle. Sometimes riddles play on words or common sayings. For example, if the question is about birds in a tree and a hunter shoots, maybe the answer is related to the sound or the effect of the shot. But I'm not sure. Let me think.
In some riddles, the answer might be that there are none left because the shot scares all the birds away. So, even though only one was shot, the rest might fly away. But the problem doesn't specify that. So, maybe the answer is 9, but maybe it's 0. I need to figure out which one is correct.
Let me check the wording again. It says, "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?" So, the key here is whether the act of shooting causes the other birds to leave. If the hunter shoots
"""
Evaluate the model
The accuracy is evaluated on CUDA with overflow protection, and it is expected to be lower than that on the CPU.
INT2 | |
---|---|
mmlu | 0.7845 |
hellaswag | 0.6318 |
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
"""
Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
dict.
"""
state_dict_keys = set(state_dict_keys)
not_initialized_submodules = {}
for module_name, module in model.named_modules():
if module_name == "":
# When checking if the root module is loaded there's no need to prepend module_name.
module_keys = set(module.state_dict())
else:
module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
if module_keys.issubset(state_dict_keys):
module._is_hf_initialized = True
else:
not_initialized_submodules[module_name] = module
return not_initialized_submodules
transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules
import torch
quantized_model_dir = "OPEA/DeepSeek-R1-int2-gptq-sym-inc"
## directly use device_map='auto' if you have enough GPUs
device_map = {"model.norm": 0, "lm_head": 0, "model.embed_tokens": 0}
for i in range(61):
name = "model.layers." + str(i)
if i < 15:
device_map[name] = 0
elif i < 30:
device_map[name] = 1
elif i < 45:
device_map[name] = 2
else:
device_map[name] = 3
model = AutoModelForCausalLM.from_pretrained(
quantized_model_dir,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map=device_map,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def forward_hook(module, input, output):
return torch.clamp(output, -65504, 65504)
def register_fp16_hooks(model):
for name, module in model.named_modules():
if "QuantLinear" in module.__class__.__name__ or isinstance(module, torch.nn.Linear):
module.register_forward_hook(forward_hook)
register_fp16_hooks(model) ##better add this hook to avoid overflow
from auto_round.eval.evaluation import simple_evaluate_user_model
res = simple_evaluate_user_model( model, tokenizer, tasks=["hellaswag","mmlu"], batch_size=4)
print(make_table(res))
Generate the model
5*80g and 1.4T-1.6T memory is required
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
# https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
"""
Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
dict.
"""
state_dict_keys = set(state_dict_keys)
not_initialized_submodules = {}
for module_name, module in model.named_modules():
if module_name == "":
# When checking if the root module is loaded there's no need to prepend module_name.
module_keys = set(module.state_dict())
else:
module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
if module_keys.issubset(state_dict_keys):
module._is_hf_initialized = True
else:
not_initialized_submodules[module_name] = module
return not_initialized_submodules
transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules
model_name = "opensourcerelease/DeepSeek-R1-bf16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")
block = model.model.layers
device_map = {}
for n, m in block.named_modules():
if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) < 63:
device = "cuda:1"
elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(
n.split('.')[-2]) < 128:
device = "cuda:2"
elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
n.split('.')[-2]) < 192:
device = "cuda:3"
elif "experts" in n and ("shared_experts" not in n) and int(
n.split('.')[-2]) >= 192:
device = "cuda:4"
else:
device = "cuda:0"
n = n[2:]
device_map.update({n: device})
from auto_round import AutoRound
autoround = AutoRound(model=model, tokenizer=tokenizer, device_map=device_map, bits=2, group_size=64,
iters=1000, batch_size=4, seqlen=512, nsamples=512, enable_torch_compile=False,
)
autoround.quantize()
autoround.save_quantized(format="auto_round", output_dir="tmp_autoround")
Ethical Considerations and Limitations
The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Therefore, before deploying any applications of the model, developers should perform safety testing.
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Here are a couple of useful links to learn more about Intel's AI software:
- Intel Neural Compressor link
Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
Cite
@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
- Downloads last month
- 25
Model tree for OPEA/DeepSeek-R1-int2-gptq-sym-inc
Base model
deepseek-ai/DeepSeek-R1