Welcome to Inference Providers on the Hub 🔥

Published January 28, 2025

Update on GitHub

Upvote

483

fal

replicate

sambanovasystems

togethercomputer

Today, we are launching the integration of four awesome serverless Inference Providers – fal, Replicate, Sambanova, Together AI – directly on the Hub’s model pages. They are also seamlessly integrated into our client SDKs (for JS and Python), making it easier than ever to explore serverless inference of a wide variety of models that run on your favorite providers.

We’ve been hosting a serverless Inference API on the Hub for a long time (we launched the v1 in summer 2020 – wow, time flies 🤯). While this has enabled easy exploration and prototyping, we’ve refined our core value proposition towards collaboration, storage, versioning, and distribution of large datasets and models with the community. At the same time, serverless providers have flourished, and the time was right for Hugging Face to offer easy and unified access to serverless inference through a set of great providers.

Just as we work with great partners like AWS, Nvidia and others for dedicated deployment options via the model pages’ Deploy button, it was natural to partner with the next generation of serverless inference providers for model-centric, serverless inference.

Here’s what this enables, taking the timely example of DeepSeek-ai/DeepSeek-R1, a model which has achieved mainstream fame over the past few days 🔥:

Rodrigo Liang, Co-Founder & CEO at SambaNova: "We are excited to be partnering with Hugging Face to accelerate its Inference API. Hugging Face developers now have access to much faster inference speeds on a wide range of the best open source models."

Zeke Sikelianos, Founding Designer at Replicate: "Hugging Face is the de facto home of open-source model weights, and has been a key player in making AI more accessible to the world. We use Hugging Face internally at Replicate as our weights registry of choice, and we're honored to be among the first inference providers to be featured in this launch."

This is just the start, and we’ll build on top of this with the community in the coming weeks!

How it works

In the website UI

In your user account settings, you are able to:

set your own API keys for the providers you’ve signed up with. Otherwise, you can still use them – your requests will be routed through HF.
order providers by preference. This applies to the widget and code snippets in the model pages.

As we mentioned, there are two modes when calling Inference APIs:

custom key (calls go directly to the inference provider, using your own API key of the corresponding inference provider); or
Routed by HF (in that case, you don't need a token from the provider, and the charges are applied directly to your HF account rather than the provider's account)

Model pages showcase third-party inference providers (the ones that are compatible with the current model, sorted by user preference)

From the client SDKs

from Python, using huggingface_hub

The following example shows how to use DeepSeek-R1 using Together AI as the inference provider. You can use a Hugging Face token for automatic routing through Hugging Face, or your own Together AI API key if you have one.

Install huggingface_hub version v0.28.0 or later (release notes).

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="together",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)

messages = [
    {
        "role": "user",
        "content": "What is the capital of France?"
    }
]

completion = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1", 
    messages=messages, 
    max_tokens=500
)

print(completion.choices[0].message)

Note: You can also use the OpenAI client library to call the Inference Providers too; see here an example for DeepSeek model.

And here's how to generate an image from a text prompt using FLUX.1-dev running on fal.ai:

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="fal-ai",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)

# output is a PIL.Image object
image = client.text_to_image(
    "Labrador in the style of Vermeer",
    model="black-forest-labs/FLUX.1-dev"
)

To move to a different provider, you can simply change the provider name, everything else stays the same:

from huggingface_hub import InferenceClient

client = InferenceClient(
-	provider="fal-ai",
+	provider="replicate",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)

from JS using @huggingface/inference

import { HfInference } from "@huggingface/inference";

const client = new HfInference("xxxxxxxxxxxxxxxxxxxxxxxx");

const chatCompletion = await client.chatCompletion({
    model: "deepseek-ai/DeepSeek-R1",
    messages: [
        {
            role: "user",
            content: "What is the capital of France?"
        }
    ],
    provider: "together",
    max_tokens: 500
});

console.log(chatCompletion.choices[0].message);

From HTTP calls

We expose the Routing proxy directly under the huggingface.co domain so you can call it directly, it's very useful for OpenAI-compatible APIs for instance. You can just swap the URL as a base URL: https://router.huggingface.co/{:provider}.

Here's how you can call Llama-3.3-70B-Instruct using Sambanova as the inference provider via cURL.

curl 'https://router.huggingface.co/sambanova/v1/chat/completions' \
-H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxx' \
-H 'Content-Type: application/json' \
--data '{
    "model": "Llama-3.3-70B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
    "max_tokens": 500,
    "stream": false
}'

Billing

For direct requests, i.e. when you use the key from an inference provider, you are billed by the corresponding provider. For instance, if you use a Together AI key you're billed on your Together AI account.

For routed requests, i.e. when you authenticate via the hub, you'll only pay the standard provider API rates. There's no additional markup from us, we just pass through the provider costs directly. (In the future, we may establish revenue-sharing agreements with our provider partners.)

Important Note ‼️ PRO users get $2 worth of Inference credits every month. You can use them across providers. 🔥

Subscribe to the Hugging Face PRO plan to get access to Inference credits, ZeroGPU, Spaces Dev Mode, 20x higher limits, and more.

We also provide free inference with a small quota for our signed-in free users, but please upgrade to PRO if you can!

Feedback and next steps

We would love to get your feedback! Here’s a Hub discussion you can use: https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/49

Introducing Three New Serverless Inference Providers: Hyperbolic, Nebius AI Studio, and Novita 🔥

By February 18, 2025 • 99

Welcome Fireworks.ai on the Hub 🎆

By February 14, 2025 • 58

Community

manishku15

Jan 28

•

edited Jan 28

![images (1).png](https://cdn-uploads.huggingface.c

llamameta

Jan 28

TypeError: InferenceClient.init() got an unexpected keyword argument 'provider'

mishig

Jan 29

try to update to the latest version

G-Rost

Jan 29

Meh u teaching em and get 2$?
Call Mask🕺

RCKeerano

Jan 29

I want to start an AI company which will work on Health AI. Who want to participate in this great journey.

G-Rost

Jan 29

Distilling from deep seek✅

deleted

Jan 30

This comment has been hidden

44m0n

Jan 30

•

edited Jan 30

I'm still confused about the pricing. Most of the models that I want to use don't have HF Inference option, only Togheter AI. Looking at my quota seems like I have 20k credits only for HF Inference. Seems like I need to pay extra for other inference. If that's the case, then PRO doesn't make any sense anymore for me. If that's the case I don't even need to use HF at all. I can use the inference provider directly. Some of them may have discounts or special offers that may not be available through HF. Sambanova, for example, is still free.

Jeanne4444

Jan 30

Jeanne4444

Jan 30

Utiliser juste le visage pour le remplacer sur un autre visage d’un homme pourtant un costume noir

whitiy

Feb 1

so how much is free inference with a small quota really

SmartFlow

Feb 1

Great !!!

masatochi

Feb 4

Let's add https://nineteen.ai/
It has the fastest inference speed & it's completely free!

bashbe

Feb 4

With an entreprise token in routed mode i get "You have exceeded your monthly included credits for Inference Endpoints. Subscribe to PRO to get 20x more monthly allowance."

mohammadsamhomayouni

Feb 5

please add Groq models to Huggingface API.

mswinds

Feb 7

Does it work with LangChain's ChatHuggingFace or HuggingFaceEndpoint?

julien-c

Article author Feb 7

i don't know actually, but i'd be interested to know!

azhan77168

Feb 7

The previous pro provided 20,000 requests. Now it's gone. What a pity.

NECROW

Feb 10

Billing seems excessive and confusing... Thought this website would help me build my bot but it seems Replit will work better

Wuzobia

Feb 15

This is a terrible article! (excuse me!) Could you be a lil specific when it comes to billing?

shtefcs

Feb 15

Nice update HF 🤗 team! Definitely we gonna use some of those providers for Automatio.ai integration.

Keep it up!

lukejamie

Feb 18

Great insights on Inference Providers on the Hub! The seamless integration of fal, Replicate, Sambanova, and Together AI into Hugging Face’s ecosystem makes serverless inference more accessible than ever.

While researching, I found this resource on Hugging Face model integration with RunPod, which dives into AI model setup, diffusion workflows, and ComfyUI installation: https://mobisoftinfotech.com/resources/blog/flux-on-runpod-using-comfyui.

Would love to hear your thoughts on how these new inference providers compare in terms of performance and scalability!

elricwan

Mar 4

It is great to have the inference providers that allowing people to use the model easily. And I know other model providers also want to cooperate with Hugging Face to create easy, stable, and affordable services to the community. How can they add their api to the Hugging Face? Is there any instruction to do it? Thank you.

OFT

Mar 5

This new system is really bad for the users (us/me). Instead of 200.000 api requests like before I am not even getting 5.000 api requests. Already hitting max usage. If you are planning to give us far less credit then we are actually paying for the membership, I would consider not taking the membership and just pay for the spent credits. It is way cheaper. Now I feel like I am paying over 4 times more then the provided services are worth.

azhan77168

Mar 11

It is hoped that the official can take our thoughts into consideration. The service should be getting better and better. The current $2 really shows stinginess.

ZahirOlmez

Mar 7

I think this is bad decision. I have pay $9 and only use $2. I had 20k daily limit before. I won't buy pro subs. after this decision.

azhan77168

Mar 11

It is hoped that the official can take our thoughts into consideration. The service should be getting better and better. The current $2 really shows stinginess.

OFT

Mar 11

bug: If you turn off all interference providers, he will continue counting them in the price you need to pay.

kpadpa

Mar 21

"This authentication method does not have sufficient permissions to call Inference Providers on behalf of user"

did i just reach my quota or what?

swapnilbotu

Mar 23

I just recieved the same issue as well. Were you able to figure out a solution?

JeromeMore

Mar 24

I get a token 403 unexpected error : 403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user XXX.

Code to reproduce :

from huggingface_hub import InferenceClient
client = InferenceClient()
client.list_deployed_models("text-generation-inference")

AlekseyCalvin

Mar 28

This turn might not seem like that big of a deal to many of you... But in practice, this very clearly constitutes an abrupt cutting-off of the lion's share of the quasi-subsidized compute erstwhile accessible through HF.

And though seldom overtly publicized as a core feature of the platform, that nonchalantly encouraging compute reservoir of yore (even if accessible only via select pockets, like Inference Endpoints) served for years as a kind of a phantom fertilizer, and a key one, for Machine Learning adoption, experimentation, and resource-strapped development – on par with other quasi-subsidized resources like Google Colab and ChatGPT . And it's entirely plausible that the availability of such resources en masse has been the single most important prerequisite, as well as mandatory catalyst, for the continual evolution/democratization of open source ML specifically, whether in regards to utility or literacy or safety or any other oft-evoked marker, and in some large part the popularization of ML more broadly.

Of course, despite its qualitative obviousness, the impact of subsidized resources may be difficult, maybe impossible, to summate objectively or to quantify directly. After all they constitute a set of infrastructures the immediate production of which tends to assume forms that are often ephemeral, social – like practices, understanding, or values – and wherever concrete, often non-commercial, like open models, adapters, enthusiast contributions, articles, posts, etc. And by the markers of finance and markets, such productions may be deemed as having only a tangential value to, say, the capitalization potential of a given platform, all-the-while adding heavy and at times unforeseeably fluctuating costs upkeep.

Nonetheless, (and there are few better illustrations for this than ML) public access to subsidized resources persisting in open forms (non-transactional, nor overtly monetized, nor strictly delimited: like all public resources, services, libraries, spaces) simultaneously remains a key precondition for the economic and socio-cultural potential of the entire technological and research field as a whole, whilst continually stimulating the range and variety of forms this potential may ultimately assume.

As recently as a year ago, I would avidly preach the merits of Huggingface to everyone I know: not only as the key extant firmament for humanist and democratized machine learning paradigms to emerge and thrive upon, but as a really-existing pre-figuration of a new form of society, one that may at last transcend the intertwined logic of alienation, separation, disparity, scarcity, disposability, amnesia, exploitation, and all the rest of it... That's what I used to associate with Huggingface. Nowadays though: I think of Huggingface, and my next thought is "$2.00".

I know that for resources or platforms to remain subsidized, such subsidies must come from someplace. And assuring reliable funding may be a challenge, especially these days. Nonetheless, I really believe that it would be a grave mistake long-term for HF to rely on standard transactional/commercial monetization sourced from its users, even to cover real costs (much less to extract surplus "value" from or for anyone). To best serve its role within the field, HF must instead move closer to operating as a non-profit, sourcing unconditional funding from elsewhere , with the assumption that it may always operate at a "loss", whilst playing a key role in the thriving and the dynamism of the field as a whole.

nevamn

Apr 1

is there any information on the providers (detailed costs, info about data privacy) that allows me some kind of prioritization? HF allows to prioritize but doesn't provide information that I could use to do so.

starbased

Apr 15

•

edited Apr 15

$2 credits only? I'll keep my $7 and just use openrouter lmao

julien-c

Article author Apr 26

in addition to all the other PRO features!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

483