This model the 6-bit quantized version of the Meditron-7b.Please follow the following instruction to run the model on your device:

There are multiple ways to infer the model. Firstly, let's install llama.cpp and use it for the inference

  1. Install
git clone https://github.com/ggerganov/llama.cpp
!mkdir llama.cpp/build && cd llama.cpp/build && cmake .. && cmake --build . --config Release
  1. Inference
./llama.cpp/build/bin/llama-cli -m ./meditron-
7b_Q6_K.gguf -cnv -p "You are a helpful assistant"

Here, you can interact with model from your terminal.

Alternatively, we can use python binding of the llama.cpp to run the model on both CPU and GPU.

  1. Install
pip install --no-cache-dir llama-cpp-python==0.2.85 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
  1. Inference on CPU
from llama_cpp import Llama

model_path = "./meditron-7b_Q6_K.gguf"
llm = Llama(model_path=model_path, n_threads=8, verbose=False)

prompt = "What should I do when my eyes are dry?"
output = llm(
        prompt=f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
        max_tokens=4096,
        stop=["<|end|>"],
        echo=False,  # Whether to echo the prompt
)
print(output)
  1. Inference on GPU
from llama_cpp import Llama

model_path = "./meditron-7b_Q6_K.gguf"
llm = Llama(model_path=model_path, n_threads=8, n_gpu_layers=-1, verbose=False)

prompt = "What should I do when my eyes are dry?"
output = llm(
        prompt=f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
        max_tokens=4096,
        stop=["<|end|>"],
        echo=False,  # Whether to echo the prompt
)
print(output)
Downloads last month
0
GGUF
Model size
6.74B params
Architecture
llama

6-bit

Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for np-n/meditron-7b_Q6_K.gguf

Quantized
(11)
this model