推理端点（专用）

你是否曾想过创建自己的机器学习 API？在这篇教程中，我们将使用 HF 专用推理端点来实现这一目标。推理端点允许你选择 HF Hub 上成千上万的模型，创建你自己的 API，并将其部署在你控制的平台和你选择的硬件上。

无服务器推理 API 非常适合初步测试，但它们仅限于一组预配置的流行模型，并且有速率限制，因为无服务器 API 的硬件同时被多个用户共享。使用专用推理端点，你可以自定义模型的部署，并且硬件仅为你独享。

在本教程中，我们将：

通过一个简单的 UI 创建推理端点，并向该端点发送标准的 HTTP 请求
使用 huggingface_hub 库编程方式创建和管理不同的推理端点
涵盖三个用例：使用大型语言模型（LLM）进行文本生成、使用 Stable Diffusion 进行图像生成、以及使用 Idefics2 进行图像推理

安装与登录

如果你还没有 Hugging Face 账号，可以在这里创建一个账号。如果你在一个较大的团队中工作，也可以创建一个 HF 组织，通过该组织管理所有模型、数据集和推理端点。专用推理端点是付费服务，因此你需要在个人 HF 账户或 HF 组织的计费设置中添加信用卡信息。

接下来，你可以在这里创建一个用户访问令牌。对于本教程，带有 read 或 write 权限的令牌即可使用，但我们建议使用更细粒度的令牌以提高安全性。对于本笔记本，你需要一个细粒度令牌，具备以下权限：

用户权限 > 推理 > 调用推理端点与管理推理端点
仓库权限 > google/gemma-1.1-2b-it 和 HuggingFaceM4/idefics2-8b-chatty

!pip install huggingface_hub~=0.23.3
!pip install transformers~=4.41.2

# Login to the HF Hub. We recommend using this login method
# to avoid the need for explicitly storing your HF token in variables
import huggingface_hub

huggingface_hub.interpreter_login()

创建你的第一个推理端点

完成初步设置后，我们可以开始创建第一个推理端点。请访问 https://ui.endpoints.huggingface.co/ 并点击 + New，位于 Dedicated Endpoints 旁边。然后，你将看到一个用于创建新端点的界面，提供以下选项（见下图）：

模型仓库：在这里，你可以输入 HF Hub 上任何模型的标识符。为了本次演示，我们使用 google/gemma-1.1-2b-it，这是一个小型生成式大型语言模型（2.5B 参数）。
端点名称：端点名称会根据模型标识符自动生成，但你可以修改它。有效的端点名称只能包含小写字母、数字或连字符（”-”），并且长度必须在 4 到 32 个字符之间。
实例配置：在这里，你可以选择来自主要云平台的一系列 CPU 或 GPU 配置。你还可以调整区域设置，例如，如果你需要将端点托管在欧盟地区。
自动缩放到零：你可以将端点配置为在一段时间后缩放为零 GPU/CPU。缩放到零的端点将不再计费。请注意，重新启动端点时需要将模型重新加载到内存中（并可能重新下载），这对于大型模型可能需要几分钟时间。
端点安全级别：默认的安全级别是 Protected，需要授权的 HF 令牌才能访问该端点。Public 端点任何人都可以访问，无需令牌认证。Private 端点仅通过跨区域的安全 AWS 或 Azure PrivateLink 连接可用。
高级配置：在这里，你可以选择一些高级选项，例如 Docker 容器类型。由于 Gemma 与文本生成推理（TGI）容器兼容，系统会自动选择 TGI 作为容器类型，并提供其他默认设置。

按照下图中的选项进行选择，并点击 Create Endpoint。

大约一分钟后，你的推理端点将被创建，你将看到类似下图的页面。

在端点的 概览 页面上，你可以找到用于查询该端点的 URL，一个用于测试模型的 Playground，以及其他标签页，包括 分析、使用情况与费用、日志 和 设置。

编程方式创建和管理端点

在进入生产环境时，你不一定希望手动启动、停止和修改推理端点。huggingface_hub 库提供了很好的功能，用于编程方式管理推理端点。可以在这里查看文档，以及在这里获取所有函数的详细信息。以下是一些关键函数：

# list all your inference endpoints
huggingface_hub.list_inference_endpoints()

# get an existing endpoint and check it's status
endpoint = huggingface_hub.get_inference_endpoint(
    name="gemma-1-1-2b-it-yci",  # the name of the endpoint
    namespace="MoritzLaurer",  # your user name or organization name
)
print(endpoint)

# Pause endpoint to stop billing
endpoint.pause()

# Resume and wait until the endpoint is ready
# endpoint.resume()
# endpoint.wait()

# Update the endpoint to a different GPU
# You can find the correct arguments for different hardware types in this table: https://huggingface.co/docs/inference-endpoints/pricing#gpu-instances
# endpoint.update(
#    instance_size="x1",
#    instance_type="nvidia-a100",  # nvidia-a10g
# )

你也可以通过编程方式创建推理端点。让我们重新创建与 UI 创建的相同的 gemma LLM 推理端点。

from huggingface_hub import create_inference_endpoint


model_id = "google/gemma-1.1-2b-it"
endpoint_name = "gemma-1-1-2b-it-001"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens ("-") and are between 4 to 32 characters long.
namespace = "MoritzLaurer"  # your user or organization name


# check if endpoint with this name already exists from previous tests
available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
if endpoint_name in available_endpoints_names:
    endpoint_exists = True
else:
    endpoint_exists = False
print("Does the endpoint already exist?", endpoint_exists)


# create new endpoint
if not endpoint_exists:
    endpoint = create_inference_endpoint(
        endpoint_name,
        repository=model_id,
        namespace=namespace,
        framework="pytorch",
        task="text-generation",
        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
        accelerator="gpu",
        vendor="aws",
        region="us-east-1",
        instance_size="x1",
        instance_type="nvidia-a10g",
        min_replica=0,
        max_replica=1,
        type="protected",
        # since the LLM is compatible with TGI, we specify that we want to use the latest TGI image
        custom_image={
            "health_route": "/health",
            "env": {"MODEL_ID": "/repository"},
            "url": "ghcr.io/huggingface/text-generation-inference:latest",
        },
    )
    print("Waiting for endpoint to be created")
    endpoint.wait()
    print("Endpoint ready")

# if endpoint with this name already exists, get and resume existing endpoint
else:
    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
    if endpoint.status in ["paused", "scaledToZero"]:
        print("Resuming endpoint")
        endpoint.resume()
    print("Waiting for endpoint to start")
    endpoint.wait()
    print("Endpoint ready")

# access the endpoint url for API calls
print(endpoint.url)

查询你的推理端点

现在，让我们像查询其他 LLM API 一样查询这个端点。首先，从界面复制端点 URL（或者使用 endpoint.url）并将其赋值给下面的 API_URL。接着，我们使用标准化的消息格式来传递文本输入，即一个包含用户和助手消息的字典，这种格式你可能在其他 LLM API 服务中见过。接下来，我们需要将聊天模板应用到这些消息上，这是像 Gemma、Llama-3 等 LLM 模型已经训练过的格式（详细信息请参见文档）。对于大多数最新的生成式 LLM，应用这个聊天模板是非常重要的，否则模型的表现会退化，且不会抛出错误。

>>> import requests
>>> from transformers import AutoTokenizer

>>> # paste your endpoint URL here or reuse endpoint.url if you created the endpoint programmatically
>>> API_URL = endpoint.url  # or paste link like "https://dz07884a53qjqb98.us-east-1.aws.endpoints.huggingface.cloud"
>>> HEADERS = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}


>>> # function for standard http requests
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=HEADERS, json=payload)
...     return response.json()


>>> # define conversation input in messages format
>>> # you can also provide multiple turns between user and assistant
>>> messages = [
...     {"role": "user", "content": "Please write a short poem about open source for me."},
...     # {"role": "assistant", "content": "I am not in the mood."},
...     # {"role": "user", "content": "Can you please do this for me?"},
... ]

>>> # apply the chat template for the respective model
>>> model_id = "google/gemma-1.1-2b-it"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> messages_with_template = tokenizer.apply_chat_template(messages, tokenize=False)
>>> print("Your text input looks like this, after the chat template has been applied:\n")
>>> print(messages_with_template)

Your text input looks like this, after the chat template has been applied:

user
Please write a short poem about open source for me.

>>> # send standard http request to endpoint
>>> output = query(
...     payload={
...         "inputs": messages_with_template,
...         "parameters": {"temperature": 0.2, "max_new_tokens": 100, "seed": 42, "return_full_text": False},
...     },
...     api_url=API_URL,
... )

>>> print("The output from your API/Endpoint call:\n")
>>> print(output)

The output from your API/Endpoint call:

[&#123;'generated_text': "Free to use, free to share,\nA collaborative code, a community's care.\n\nCode transparent, bugs readily found,\nContributions welcome, stories unbound.\nOpen source, a gift to all,\nBuilding the future, one line at a call.\n\nSo join the movement, embrace the light,\nOpen source, shining ever so bright."}]

就这样，你已经向你的推理端点发出了第一次请求——你自己的 API！

如果你希望端点自动处理聊天模板，并且你的 LLM 运行在 TGI 容器上，你也可以通过在 URL 后追加 /v1/chat/completions 路径来使用 messages API。使用 /v1/chat/completions 路径时，运行在端点上的 TGI 容器会自动应用聊天模板，并且与 OpenAI 的 API 结构完全兼容，便于互操作性。你可以查看 TGI Swagger UI 获取所有可用参数。请注意，默认 / 路径和 /v1/chat/completions 路径接受的参数略有不同。以下是使用 messages API 的稍微修改过的代码：

>>> API_URL_CHAT = API_URL + "/v1/chat/completions"

>>> output = query(
...     payload={
...         "messages": messages,
...         "model": "tgi",
...         "parameters": {"temperature": 0.2, "max_tokens": 100, "seed": 42},
...     },
...     api_url=API_URL_CHAT,
... )

>>> print("The output from your API/Endpoint call with the OpenAI-compatible messages API route:\n")
>>> print(output)

The output from your API/Endpoint call with the OpenAI-compatible messages API route:

&#123;'id': '', 'object': 'text_completion', 'created': 1718283608, 'model': '/repository', 'system_fingerprint': '2.0.5-dev0-sha-90184df', 'choices': [&#123;'index': 0, 'message': &#123;'role': 'assistant', 'content': '**Open Source**\n\nA license for the mind,\nTo share, distribute, and bind,\nIdeas freely given birth,\nFor the good of all to sort.\n\nCode transparent, eyes open wide,\nA permission for the wise,\nTo learn, to build, to use at will,\nA future bright, we help fill.\n\nFrom servers vast to candles low,\nOpen source, a guiding key,\nFor progress made, knowledge shared,\nA future brimming with'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': &#123;'prompt_tokens': 20, 'completion_tokens': 100, 'total_tokens': 120}}

使用 InferenceClient 简化端点使用

你还可以使用 InferenceClient 来轻松地向你的端点发送请求。该客户端是 huggingface_hub Python 库中提供的一个便捷工具，允许你轻松调用专用推理端点和无服务器推理 API。详细信息请参阅文档。

这是向你的端点发送请求的最简洁方式：

from huggingface_hub import InferenceClient

client = InferenceClient()

output = client.chat_completion(
    messages,  # the chat template is applied automatically, if your endpoint uses a TGI container
    model=API_URL,
    temperature=0.2,
    max_tokens=100,
    seed=42,
)

print("The output from your API/Endpoint call with the InferenceClient:\n")
print(output)

# pause the endpoint to stop billing
# endpoint.pause()

为各种模型创建推理端点

按照相同的过程，你可以为 HF Hub 上的任何模型创建推理端点。接下来，我们将展示一些其他的使用场景。

使用 Stable Diffusion 进行图像生成

我们可以使用与 LLM 几乎相同的代码来创建一个图像生成推理端点。唯一的区别是，在这种情况下，我们不使用 TGI 容器，因为 TGI 仅为 LLM（以及视觉 LLM）设计。

>>> !pip install Pillow  # for image processing

Collecting Pillow
  Downloading pillow-10.3.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 24.7 MB/s eta 0:00:01
[?25hInstalling collected packages: Pillow
Successfully installed Pillow-10.3.0

>>> from huggingface_hub import create_inference_endpoint

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"
>>> endpoint_name = "stable-diffusion-xl-base-1-0-001"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens ("-") and are between 4 to 32 characters long.
>>> namespace = "MoritzLaurer"  # your user or organization name
>>> task = "text-to-image"

>>> # check if endpoint with this name already exists from previous tests
>>> available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
>>> if endpoint_name in available_endpoints_names:
...     endpoint_exists = True
>>> else:
...     endpoint_exists = False
>>> print("Does the endpoint already exist?", endpoint_exists)


>>> # create new endpoint
>>> if not endpoint_exists:
...     endpoint = create_inference_endpoint(
...         endpoint_name,
...         repository=model_id,
...         namespace=namespace,
...         framework="pytorch",
...         task=task,
...         # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
...         accelerator="gpu",
...         vendor="aws",
...         region="us-east-1",
...         instance_size="x1",
...         instance_type="nvidia-a100",
...         min_replica=0,
...         max_replica=1,
...         type="protected",
...     )
...     print("Waiting for endpoint to be created")
...     endpoint.wait()
...     print("Endpoint ready")

>>> # if endpoint with this name already exists, get existing endpoint
>>> else:
...     endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
...     if endpoint.status in ["paused", "scaledToZero"]:
...         print("Resuming endpoint")
...         endpoint.resume()
...     print("Waiting for endpoint to start")
...     endpoint.wait()
...     print("Endpoint ready")

Does the endpoint already exist? True
Waiting for endpoint to start
Endpoint ready

>>> prompt = "A whimsical illustration of a fashionably dressed llama proudly holding a worn, vintage cookbook, with a warm cup of tea and a few freshly baked treats scattered around, set against a cozy background of rustic wood and blooming flowers."

>>> image = client.text_to_image(
...     prompt=prompt,
...     model=endpoint.url,  # "stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
... )

>>> print("PROMPT: ", prompt)
>>> display(image.resize((image.width // 2, image.height // 2)))

PROMPT:  A whimsical illustration of a fashionably dressed llama proudly holding a worn, vintage cookbook, with a warm cup of tea and a few freshly baked treats scattered around, set against a cozy background of rustic wood and blooming flowers.

我们再次暂停端点以停止计费。

endpoint.pause()

视觉语言模型：对文本和图像的推理

现在，让我们为视觉语言模型（VLM）创建一个推理端点。VLM 与 LLM 非常相似，不同之处在于它们可以同时接受文本和图像作为输入。它们的输出是自回归生成的文本，就像标准的 LLM 一样。VLM 可以处理许多任务，从视觉问答到文档理解。在这个例子中，我们使用 Idefics2，一个强大的 8B 参数 VLM。

首先，我们需要将使用 Stable Diffusion 生成的 PIL 图像转换为 base64 编码的字符串，以便通过网络将其发送给模型。

import base64
from io import BytesIO


def pil_image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


image_b64 = pil_image_to_base64(image)

由于 VLM 和 LLM 非常相似，我们可以再次使用几乎相同的消息格式和聊天模板，只需添加一些代码以将图像包含在提示中。有关提示格式的具体细节，请参阅 Idefics2 模型卡。

from transformers import AutoProcessor

# load the processor
model_id_vlm = "HuggingFaceM4/idefics2-8b-chatty"
processor = AutoProcessor.from_pretrained(model_id_vlm)

# define the user messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },  # the image is placed here in the prompt. You can add multiple images throughout the conversation.
            {"type": "text", "text": "Write a short limerick about this image."},
        ],
    },
]

# apply the chat template to the messages
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# the chat template places a special "<image>" token at the position where the image should go
# here we replace the "<image>" token with the base64 encoded image string in the prompt
# to be able to send the image via an API request
image_input = f"data:image/jpeg;base64,{image_b64}"
image_input = f"![]({image_input})"
prompt = prompt.replace("<image>", image_input)

对于 VLM，图像代表一定数量的 tokens。例如，对于 Idefics2，一张低分辨率的图像代表 64 个 tokens，而高分辨率图像则代表 5*64=320 个 tokens。高分辨率是 TGI 中的默认设置（详情请参见模型卡中的 do_image_splitting）。这意味着一张图像消耗 320 个 tokens。

像 Idefics2 这样的多个 VLM 也受到 TGI 支持（请参见支持的模型列表），因此在创建端点时，我们再次使用 TGI 容器。

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint_name = "idefics2-8b-chatty-001"
>>> namespace = "MoritzLaurer"
>>> task = "text-generation"

>>> # check if endpoint with this name already exists from previous tests
>>> available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
>>> if endpoint_name in available_endpoints_names:
...     endpoint_exists = True
>>> else:
...     endpoint_exists = False
>>> print("Does the endpoint already exist?", endpoint_exists)


>>> if endpoint_exists:
...     endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
...     if endpoint.status in ["paused", "scaledToZero"]:
...         print("Resuming endpoint")
...         endpoint.resume()
...     print("Waiting for endpoint to start")
...     endpoint.wait()
...     print("Endpoint ready")

>>> else:
...     endpoint = create_inference_endpoint(
...         endpoint_name,
...         repository=model_id_vlm,
...         namespace=namespace,
...         framework="pytorch",
...         task=task,
...         accelerator="gpu",
...         vendor="aws",
...         region="us-east-1",
...         type="protected",
...         instance_size="x1",
...         instance_type="nvidia-a100",
...         min_replica=0,
...         max_replica=1,
...         custom_image={
...             "health_route": "/health",
...             "env": {
...                 "MAX_BATCH_PREFILL_TOKENS": "2048",
...                 "MAX_INPUT_LENGTH": "1024",
...                 "MAX_TOTAL_TOKENS": "1536",
...                 "MODEL_ID": "/repository",
...             },
...             "url": "ghcr.io/huggingface/text-generation-inference:latest",
...         },
...     )

...     print("Waiting for endpoint to be created")
...     endpoint.wait()
...     print("Endpoint ready")

Does the endpoint already exist? False
Waiting for endpoint to be created
Endpoint ready

>>> output = client.text_generation(prompt, model=model_id_vlm, max_new_tokens=200, seed=42)

>>> print(output)

In a quaint little café, there lived a llama,
With glasses on his face, he was quite a charm.
He'd sit at the table,
With a book and a mable,
And sip from a cup of warm tea.

endpoint.pause()

额外信息

在创建多个端点时，你可能会收到一个错误消息，提示你的 GPU 配额已满。不要犹豫，直接发送邮件到错误消息中的邮箱地址，我们很可能会增加你的 GPU 配额。
paused 和 scaled-to-zero 端点有什么区别？scaled-to-zero 端点可以通过用户请求灵活唤醒并扩展，而 paused 端点需要由端点创建者手动解除暂停。此外，scaled-to-zero 端点会占用你的 GPU 配额（按其可能扩展到的最大副本数计算），而 paused 端点则不会。因此，一个简单的释放 GPU 配额的方法是暂停一些端点。

结论与下一步

就是这样，你已经为文本到文本、文本到图像、图像到文本的生成创建了三个不同的端点（你自己的 API！），同样的过程也适用于许多其他模型和任务。

我们鼓励你阅读专用推理端点的文档，以了解更多信息。如果你正在使用生成式 LLM 和 VLM，我们还建议阅读 TGI 的文档，因为最流行的 LLM 和 VLM 也都支持 TGI，这将使你的端点更加高效。

例如，你可以通过 TGI Guidance 使用 JSON 模式或函数调用 来与开源模型交互（还可以参考这个食谱中的结构化生成示例，适用于 RAG）。

当将端点投入生产时，你可能需要进行一些额外的改进，以提高设置的效率。使用 TGI 时，你应通过异步函数调用向端点发送请求批次，以充分利用端点的硬件，并且你可以调整多个容器参数，以优化延迟和吞吐量，针对你的具体用例进行优化。我们将在另一篇指南中介绍这些优化。

< > Update on GitHub

Open-Source AI Cookbook