jerryzh168 commited on
Commit
f950d65
·
verified ·
1 Parent(s): a5c9018

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -121,6 +121,60 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
121
  | mathqa (0-shot) | 42.31 | 42.51 |
122
  | **Overall** | **TODO** | **TODO** |
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  # Model Performance
125
 
126
  Need to install vllm nightly to get some recent changes
 
121
  | mathqa (0-shot) | 42.31 | 42.51 |
122
  | **Overall** | **TODO** | **TODO** |
123
 
124
+ # Peak Memory Usage
125
+
126
+ We can use the following code to get a sense of peak memory usage during inference:
127
+
128
+ ## Results
129
+
130
+ | Benchmark | | |
131
+ |------------------|----------------|--------------------------------|
132
+ | | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
133
+ | Peak Memory (GB) | 8.91 | 5.70 |
134
+
135
+
136
+ ## Benchmark Peak Memory
137
+
138
+ ```
139
+ import torch
140
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
141
+
142
+ # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
143
+ model_id = "microsoft/Phi-4-mini-instruct"
144
+ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
145
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
146
+
147
+ torch.cuda.reset_peak_memory_stats()
148
+
149
+ prompt = "Hey, are you conscious? Can you talk to me?"
150
+ messages = [
151
+ {
152
+ "role": "system",
153
+ "content": "",
154
+ },
155
+ {"role": "user", "content": prompt},
156
+ ]
157
+ templated_prompt = tokenizer.apply_chat_template(
158
+ messages,
159
+ tokenize=False,
160
+ add_generation_prompt=True,
161
+ )
162
+ print("Prompt:", prompt)
163
+ print("Templated prompt:", templated_prompt)
164
+ inputs = tokenizer(
165
+ templated_prompt,
166
+ return_tensors="pt",
167
+ ).to("cuda")
168
+ generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
169
+ output_text = tokenizer.batch_decode(
170
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
171
+ )
172
+ print("Response:", output_text[0][len(prompt):])
173
+
174
+ mem = torch.cuda.max_memory_reserved() / 1e9
175
+ print(f"Peak Memory Usage: {mem:.02f} GB")
176
+ ```
177
+
178
  # Model Performance
179
 
180
  Need to install vllm nightly to get some recent changes