Post
1307
Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack.
Step 1: Pull docker images
Step 2: Start vLLM + LMCache
You can add another vLLM instance as long as its on a separate GPU by simply deploying another:
This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).
LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm
Step 1: Pull docker images
docker pull apostacyh/vllm:lmcache-0.1.0
Step 2: Start vLLM + LMCache
model=mistralai/Mistral-7B-Instruct-v0.2 # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
-v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_TOKEN=<Your huggingface access token>" \
--ipc=host \
--network=host \
apostacyh/vllm:lmcache-0.1.0 \
--model $model --gpu-memory-utilization 0.6 --port 8000 \
--lmcache-config-file /lmcache/LMCache/examples/example-local.yaml
You can add another vLLM instance as long as its on a separate GPU by simply deploying another:
# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2 # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
-v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
-p 8001:8001 \
--env "HF_TOKEN=<Your huggingface token>" \
--ipc=host \
--network=host \
apostacyh/vllm:lmcache-0.1.0 \
--model $model --gpu-memory-utilization 0.7 --port 8001 \
--lmcache-config-file /lmcache/LMCache/examples/example.yaml
This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).
LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm