text-embeddings-inference documentation

Quick Tour

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Quick Tour

Set up

The easiest way to get started with TEI is to use one of the official Docker containers (see Supported models and hardware to choose the right container).

Hence one needs to install Docker following their installation instructions.

TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported by checking this table. Next, install the NVIDIA Container Toolkit. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.

Deploy

Next it’s time to deploy your model. Let’s say you want to use BAAI/bge-large-en-v1.5. Here’s how you can do this:

model=BAAI/bge-large-en-v1.5
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model

We also recommend sharing a volume with the Docker container (volume=$PWD/data) to avoid downloading weights every run.

Inference

Inference can be performed in 3 ways: using cURL, or via the InferenceClient or OpenAI Python SDKs.

cURL

To send a POST request to the TEI endpoint using cURL, you can run the following command:

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Python

To run inference using Python, you can either use the huggingface_hub Python SDK (recommended) or the openai Python SDK.

huggingface_hub

You can install it via pip as pip install --upgrade --quiet huggingface_hub, and then run:

from huggingface_hub import InferenceClient

client = InferenceClient()

embedding = client.feature_extraction("What is deep learning?",
                                      model="http://localhost:8080/embed")
print(len(embedding[0]))

OpenAI

You can install it via pip as pip install --upgrade openai, and then run:

import os
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1/embeddings")

response = client.embeddings.create(
  model="tei",
  input="What is deep learning?"
)

print(response)

Re-rankers and sequence classification

TEI also supports re-ranker and classic sequence classification models.

Re-rankers

Rerankers, also called cross-encoders, are sequence classification models with a single class that score the similarity between a query and a text. See this blogpost by the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve downstream performance.

Let’s say you want to use BAAI/bge-reranker-large. First, you can deploy it like so:

model=BAAI/bge-reranker-large
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model

Once you have deployed a model, you can use the rerank endpoint to rank the similarity between a query and a list of texts. With cURL this can be done like so:

curl 127.0.0.1:8080/rerank \
    -X POST \
    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
    -H 'Content-Type: application/json'

Sequence classification models

You can also use classic Sequence Classification models like SamLowe/roberta-base-go_emotions:

model=SamLowe/roberta-base-go_emotions
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model

Once you have deployed the model you can use the predict endpoint to get the emotions most associated with an input:

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":"I like you."}' \
    -H 'Content-Type: application/json'

Batching

You can send multiple inputs in a batch. For example, for embeddings:

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":["Today is a nice day", "I like you"]}' \
    -H 'Content-Type: application/json'

And for Sequence Classification:

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
    -H 'Content-Type: application/json'

Air gapped deployment

To deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them inside the container using a volume.

For example:

# (Optional) create a `models` directory
mkdir models
cd models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5

# Set the models directory as the volume path
volume=$PWD

# Mount the models directory inside the container with a volume and set the model ID
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/gte-base-en-v1.5
< > Update on GitHub