Update README.md

40f505b verified 5 months ago

4.93 kB

	---
	license: mit
	train: false
	inference: true
	pipeline_tag: text-generation
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	---
	This is a version of the <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B">DeepSeek-R1-Distill-Qwen-7B</a> model re-distilled for better performance.

	## Performance

	\| Models \| <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B">DeepSeek-R1-Distill-Qwen-7B</a> \| <a href="https://huggingface.co/mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1">DeepSeek-R1-ReDistill-Qwen-7B-v1.1</a> \|
	\|:-------------------:\|:--------:\|:----------------:\|
	\| ARC (25-shot) \| <b>55.03</b> \| 52.3 \|
	\| HellaSwag (10-shot)\| 61.9 \| <b>62.36</b> \|
	\| MMLU (5-shot) \| 56.75 \| <b>59.53</b> \|
	\| TruthfulQA-MC2 \| 45.76 \| <b>47.7</b> \|
	\| Winogrande (5-shot)\| 60.38 \| <b>61.8</b> \|
	\| GSM8K (5-shot) \| 78.85 \| <b>83.4</b> \|
	\| Average \| 59.78 \| <b>61.18</b> \|

	\| Models \| <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B">DeepSeek-R1-Distill-Qwen-7B</a> \| <a href="https://huggingface.co/mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1">DeepSeek-R1-ReDistill-Qwen-7B-v1.1</a> \|
	\|:-------------------:\|:--------:\|:----------------:\|
	\| GPQA (0-shot) \| 30.9 \| <b>34.99</b> \|
	\| MMLU PRO (5-shot) \| 28.83 \| <b>31.02</b> \|
	\| MUSR (0-shot) \| 38.85 \| <b>44.42</b> \|
	\| BBH (3-shot) \| 43.54 \| <b>51.53</b> \|
	\| IfEval (0-shot) - strict \| <b>42.33</b> \| 35.49 \|
	\| IfEval (0-shot) - loose \| 30.31 \| <b>38.49</b> \|

	## Usage
	```Python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	compute_dtype = torch.bfloat16
	device = 'cuda'
	model_id = "mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1"

	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype, attn_implementation="sdpa", device_map=device)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	prompt = "What is 1.5+102.2?"
	chat = tokenizer.apply_chat_template([{"role":"user", "content":prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt")
	outputs = model.generate(chat.to(device), max_new_tokens=1024, do_sample=True)
	print(tokenizer.decode(outputs[0]))
	```

	Output:
	```
	<｜begin▁of▁sentence｜><｜User｜>What is 1.5+102.2?<｜Assistant｜><think>
	First, I need to add the whole number parts of the two numbers. The whole numbers are 1 and 102, which add up to 103.

	Next, I add the decimal parts of the two numbers. The decimal parts are 0.5 and 0.2, which add up to 0.7.

	Finally, I combine the whole number and decimal parts to get the total sum. Adding 103 and 0.7 gives me 103.7.
	</think>

	To add the numbers \(1.5\) and \(102.2\), follow these steps:

	1. Add the whole number parts:
	\[
	1 + 102 = 103
	\]

	2. Add the decimal parts:
	\[
	0.5 + 0.2 = 0.7
	\]

	3. Combine the results:
	\[
	103 + 0.7 = 103.7
	\]

	Final Answer:
	\[
	\boxed{103.7}
	\]<｜end▁of▁sentence｜>
	```

	## HQQ
	Run ~3.5x faster with <a href="https://github.com/mobiusml/hqq/">HQQ</a>. First, install the dependencies:
	```
	pip install hqq
	```

	```Python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from hqq.models.hf.base import AutoHQQHFModel
	from hqq.core.quantize import *

	#Params
	device = 'cuda:0'
	backend = "torchao_int4"
	compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
	model_id = "mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1"

	#Load
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype, attn_implementation="sdpa")

	#Quantize
	quant_config = BaseQuantizeConfig(nbits=4, group_size=64, axis=1)
	AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

	#Optimize
	from hqq.utils.patching import prepare_for_inference
	prepare_for_inference(model, backend=backend, verbose=False)

	############################################################
	#Generate (streaming)
	from hqq.utils.generation_hf import HFGenerator
	gen = HFGenerator(model, tokenizer, max_new_tokens=4096, do_sample=True, compile='partial').warmup()

	prompt = "If A equals B, and C equals B - A, what would be the value of C?"
	out = gen.generate(prompt, print_tokens=True)

	############################################################
	# #Generate (simple)
	# from hqq.utils.generation_hf import patch_model_for_compiled_runtime
	# patch_model_for_compiled_runtime(model, tokenizer, warmup=True)

	# prompt = "If A equals B, and C equals B - A, what would be the value of C?"
	# chat = tokenizer.apply_chat_template([{"role":"user", "content":prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt")
	# outputs = model.generate(chat.to(device), max_new_tokens=8192, do_sample=True)
	# print(tokenizer.decode(outputs[0]))
	```