jerryzh168 commited on
Commit
7d5e958
·
verified ·
1 Parent(s): a2ac3b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md CHANGED
@@ -19,6 +19,89 @@ pipeline_tag: text-generation
19
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 36% VRAM reduction, 15-20% speedup and little to no accuracy impact on H100.
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  # Quantization Recipe
24
 
 
19
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 36% VRAM reduction, 15-20% speedup and little to no accuracy impact on H100.
21
 
22
+ # Inference with vLLM
23
+ Need to install vllm nightly to get some recent changes:
24
+ ```
25
+ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
26
+ ```
27
+ ## Command Line
28
+ Then we can serve with the following command:
29
+ ```
30
+ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
31
+ ```
32
+
33
+ ## Code Example
34
+ ```
35
+ from vllm import LLM, SamplingParams
36
+
37
+ llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq", trust_remote_code=True)
38
+
39
+ messages = [
40
+ {"role": "system", "content": "You are a helpful AI assistant."},
41
+ {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
42
+ {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
43
+ {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
44
+ ]
45
+
46
+ sampling_params = SamplingParams(
47
+ max_tokens=500,
48
+ temperature=0.0,
49
+ )
50
+
51
+ output = llm.chat(messages=messages, sampling_params=sampling_params)
52
+ print(output[0].outputs[0].text)
53
+ ```
54
+
55
+ # Inference with Transformers
56
+
57
+ Install the required packages:
58
+ ```
59
+ pip install git+https://github.com/huggingface/transformers@main
60
+ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
61
+ pip install torch
62
+ pip install accelerate
63
+ ```
64
+
65
+ Example:
66
+ ```
67
+ import torch
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
69
+
70
+ torch.random.manual_seed(0)
71
+
72
+ model_path = "pytorch/Phi-4-mini-instruct-float8dq"
73
+
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ model_path,
76
+ device_map="auto",
77
+ torch_dtype="auto",
78
+ trust_remote_code=True,
79
+ )
80
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
81
+
82
+ messages = [
83
+ {"role": "system", "content": "You are a helpful AI assistant."},
84
+ {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
85
+ {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
86
+ {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
87
+ ]
88
+
89
+ pipe = pipeline(
90
+ "text-generation",
91
+ model=model,
92
+ tokenizer=tokenizer,
93
+ )
94
+
95
+ generation_args = {
96
+ "max_new_tokens": 500,
97
+ "return_full_text": False,
98
+ "temperature": 0.0,
99
+ "do_sample": False,
100
+ }
101
+
102
+ output = pipe(messages, **generation_args)
103
+ print(output[0]['generated_text'])
104
+ ```
105
 
106
  # Quantization Recipe
107