|
--- |
|
base_model: EpistemeAI/DeepPhi-3.5-mini-instruct |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- llama |
|
- trl |
|
- llama-cpp |
|
- gguf-my-repo |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Triangle104/DeepPhi-3.5-mini-instruct-Q5_K_S-GGUF |
|
This model was converted to GGUF format from [`EpistemeAI/DeepPhi-3.5-mini-instruct`](https://huggingface.co/EpistemeAI/DeepPhi-3.5-mini-instruct) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space. |
|
Refer to the [original model card](https://huggingface.co/EpistemeAI/DeepPhi-3.5-mini-instruct) for more details on the model. |
|
|
|
--- |
|
Model Summary |
|
- |
|
|
|
|
|
|
|
Reason Phi model for top performing model with it's size of 3.8B. |
|
Phi-3 - synthetic data and filtered publicly available websites - with a |
|
focus on very high-quality, reasoning dense data. The model belongs to |
|
the Phi-3 model family and supports 128K token context length. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Run locally |
|
- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4bit |
|
|
|
|
|
|
|
|
|
After obtaining the Phi-3.5-mini-instruct model checkpoint, users can use this sample code for inference. |
|
|
|
|
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig |
|
|
|
torch.random.manual_seed(0) |
|
|
|
model_path = "EpistemeAI/DeepPhi-3.5-mini-instruct" |
|
|
|
# Configure 4-bit quantization using bitsandbytes |
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", # You can also try "fp4" if desired. |
|
bnb_4bit_compute_dtype=torch.float16 # Or torch.bfloat16 depending on your hardware. |
|
) |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_path, |
|
device_map="auto", |
|
torch_dtype=torch.float16, |
|
trust_remote_code=True, |
|
quantization_config=quantization_config, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
messages = [ |
|
{"role": "system", "content": """ |
|
You are a helpful AI assistant. Respond in the following format: |
|
<reasoning> |
|
... |
|
</reasoning> |
|
<answer> |
|
... |
|
</answer>"""}, |
|
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, |
|
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, |
|
{"role": "user", "content": "What about solving a 2x + 3 = 7 equation?"}, |
|
] |
|
|
|
def format_messages(messages): |
|
prompt = "" |
|
for msg in messages: |
|
role = msg["role"].capitalize() |
|
prompt += f"{role}: {msg['content']}\n" |
|
return prompt.strip() |
|
|
|
prompt = format_messages(messages) |
|
|
|
pipe = pipeline( |
|
"text-generation", |
|
model=model, |
|
tokenizer=tokenizer, |
|
) |
|
|
|
generation_args = { |
|
"max_new_tokens": 500, |
|
"return_full_text": False, |
|
"temperature": 0.0, |
|
"do_sample": False, |
|
} |
|
|
|
output = pipe(prompt, **generation_args) |
|
print(output[0]['generated_text']) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Uploaded model |
|
- |
|
|
|
|
|
|
|
Developed by: EpistemeAI |
|
License: apache-2.0 |
|
Finetuned from model : unsloth/phi-3.5-mini-instruct-bnb-4bit |
|
|
|
|
|
This llama model was trained 2x faster with Unsloth and Huggingface's TRL library. |
|
|
|
--- |
|
## Use with llama.cpp |
|
Install llama.cpp through brew (works on Mac and Linux) |
|
|
|
```bash |
|
brew install llama.cpp |
|
|
|
``` |
|
Invoke the llama.cpp server or the CLI. |
|
|
|
### CLI: |
|
```bash |
|
llama-cli --hf-repo Triangle104/DeepPhi-3.5-mini-instruct-Q5_K_S-GGUF --hf-file deepphi-3.5-mini-instruct-q5_k_s.gguf -p "The meaning to life and the universe is" |
|
``` |
|
|
|
### Server: |
|
```bash |
|
llama-server --hf-repo Triangle104/DeepPhi-3.5-mini-instruct-Q5_K_S-GGUF --hf-file deepphi-3.5-mini-instruct-q5_k_s.gguf -c 2048 |
|
``` |
|
|
|
Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. |
|
|
|
Step 1: Clone llama.cpp from GitHub. |
|
``` |
|
git clone https://github.com/ggerganov/llama.cpp |
|
``` |
|
|
|
Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). |
|
``` |
|
cd llama.cpp && LLAMA_CURL=1 make |
|
``` |
|
|
|
Step 3: Run inference through the main binary. |
|
``` |
|
./llama-cli --hf-repo Triangle104/DeepPhi-3.5-mini-instruct-Q5_K_S-GGUF --hf-file deepphi-3.5-mini-instruct-q5_k_s.gguf -p "The meaning to life and the universe is" |
|
``` |
|
or |
|
``` |
|
./llama-server --hf-repo Triangle104/DeepPhi-3.5-mini-instruct-Q5_K_S-GGUF --hf-file deepphi-3.5-mini-instruct-q5_k_s.gguf -c 2048 |
|
``` |
|
|