text-embeddings-inference documentation
Quick Tour
Quick Tour
Set up
The easiest way to get started with TEI is to use one of the official Docker containers (see Supported models and hardware to choose the right container).
Hence one needs to install Docker following their installation instructions.
TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported by checking this table. Next, install the NVIDIA Container Toolkit. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
Deploy
Next it’s time to deploy your model. Let’s say you want to use BAAI/bge-large-en-v1.5
. Here’s how you can do this:
model=BAAI/bge-large-en-v1.5 volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
We also recommend sharing a volume with the Docker container (volume=$PWD/data
) to avoid downloading weights every run.
Inference
Inference can be performed in 3 ways: using cURL, or via the InferenceClient
or OpenAI
Python SDKs.
cURL
To send a POST request to the TEI endpoint using cURL, you can run the following command:
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
Python
To run inference using Python, you can either use the huggingface_hub
Python SDK (recommended) or the openai
Python SDK.
huggingface_hub
You can install it via pip as pip install --upgrade --quiet huggingface_hub
, and then run:
from huggingface_hub import InferenceClient
client = InferenceClient()
embedding = client.feature_extraction("What is deep learning?",
model="http://localhost:8080/embed")
print(len(embedding[0]))
OpenAI
You can install it via pip as pip install --upgrade openai
, and then run:
import os
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1/embeddings")
response = client.embeddings.create(
model="tei",
input="What is deep learning?"
)
print(response)
Re-rankers and sequence classification
TEI also supports re-ranker and classic sequence classification models.
Re-rankers
Rerankers, also called cross-encoders, are sequence classification models with a single class that score the similarity between a query and a text. See this blogpost by the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve downstream performance.
Let’s say you want to use BAAI/bge-reranker-large
. First, you can deploy it like so:
model=BAAI/bge-reranker-large volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
Once you have deployed a model, you can use the rerank
endpoint to rank the similarity between a query and a list of texts. With cURL
this can be done like so:
curl 127.0.0.1:8080/rerank \
-X POST \
-d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
-H 'Content-Type: application/json'
Sequence classification models
You can also use classic Sequence Classification models like SamLowe/roberta-base-go_emotions
:
model=SamLowe/roberta-base-go_emotions volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
Once you have deployed the model you can use the predict
endpoint to get the emotions most associated with an input:
curl 127.0.0.1:8080/predict \
-X POST \
-d '{"inputs":"I like you."}' \
-H 'Content-Type: application/json'
Batching
You can send multiple inputs in a batch. For example, for embeddings:
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":["Today is a nice day", "I like you"]}' \
-H 'Content-Type: application/json'
And for Sequence Classification:
curl 127.0.0.1:8080/predict \
-X POST \
-d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
-H 'Content-Type: application/json'
Air gapped deployment
To deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them inside the container using a volume.
For example:
# (Optional) create a `models` directory
mkdir models
cd models
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5
# Set the models directory as the volume path
volume=$PWD
# Mount the models directory inside the container with a volume and set the model ID
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/gte-base-en-v1.5