This repository contains the quantized DISC-MedLLM, version of Baichuan-13b-base as the base model.
The weights are converted to GGML format using baichuan13b.cpp (based on llama.cpp)
Model | GGML quantize method | HDD size |
---|---|---|
ggml-model-q4_0.bin | q4_0 | 7.55 GB |
ggml-model-q4_1.bin | q4_1 | 8.36 GB |
ggml-model-q5_0.bin | q5_0 | 9.17 GB |
ggml-model-q5_1.bin | q5_1 | 9.97 GB |
ggml-model-q8_0.bin | q8_0 | 14 GB |
How to inference
Compile baichuan13b, a main executable
baichuan13b/build/bin/main
and a serverbaichuan13b/build/bin/server
will be generated.Download the weight in this repository to
baichuan13b/build/bin/
For command line interface, the following command is useful. You can also read the doc including other command line parameters
cd baichuan13b/build/bin/ ./main -m ggml-model-q4_0.bin --prompt "I feel sick. Nausea and Vomiting."
For API interface, the following command is usefule. You can also read the doc about server command line options
cd baichuan13b/build/bin/ ./server -m ggml-model-q4_0.bin -c 2048
To test API interface, you can use
curl
:curl --request POST \ --url http://localhost:8080/completion \ --data '{"prompt": "I feel sick. Nausea and Vomiting.", "n_predict": 512}'
Use it in Python
To use it in Python script like cli_demo.py
all you need to do is replacing the model.chat()
using import requests
, POST to localhost:8080
in JSON
and decode HTTP return.
import requests
llm_output = requests.post(
"http://localhost:8080/completion"
).json({
"prompt": "I feel sick. Nausea and Vomiting.",
"n_predict": 512
}).json()
print(llm_output)