Kernels Tests

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

danieldk updated a model 12 days ago

kernels-test/relu-metal

danieldk published a model 15 days ago

kernels-test/relu-metal

danieldk updated a model about 1 month ago

kernels-test/op-without-fake-test

View all activity

kernels-test's activity

danieldk

posted an update 3 days ago

Post

1406

We have been working on a project called kernels. kernels makes it possible to load compute kernels directly from the Hub! 🚀

We plan to give kernels a more proper introduction soon. But for those who have been following along, we are happy to announce a new release:

- New layer API with torch.compile support.
- Experimental support for loading Apple Silicon Metal 🤘 Kernels.
- Generate wheels from Hub kernels for legacy deployments.

Full release notes here: https://github.com/huggingface/kernels/releases/tag/v0.6.0

danieldk

updated a model 12 days ago

kernels-test/relu-metal

Updated 13 days ago

danieldk

published a model 15 days ago

kernels-test/relu-metal

Updated 13 days ago

danieldk

updated a model about 1 month ago

kernels-test/op-without-fake-test

Updated Apr 23

danieldk

published a model about 1 month ago

kernels-test/op-without-fake-test

Updated Apr 23

danieldk

updated 2 models about 2 months ago

kernels-test/backward-marker-test

Updated Apr 10

kernels-test/only-torch-2.4

Updated Apr 9

danieldk

published a model about 2 months ago

kernels-test/only-torch-2.4

Updated Apr 9

Narsil

posted an update 6 months ago

Post

1682

Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !

3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

danieldk

authored a paper 9 months ago

No Word is an Island -- A Transformation Weighting Model for Semantic Composition

Paper • 1907.05048 • Published Jul 11, 2019

Narsil

posted an update about 1 year ago

Post

1988

text-generation-inference v2.0.3 is out.

Main new features:
- Falcon2 support
- PaliGemma support
- New faster speculation method from IBM !

https://github.com/huggingface/text-generation-inference/releases

Narsil

posted an update about 1 year ago

Post

1299

text-generation-inference 2.0.2 is out.

- Native support for Idefics2, with much better efficiency than llava 1.6 (next) !

Phi3, Increase VLM support in the openai layer.

Release notes https://github.com/huggingface/text-generation-inference/releases/tag/v2.0.2

Narsil

authored a paper over 1 year ago

StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29, 2024 • 147

Narsil

authored a paper about 2 years ago

Datasets: A Community Library for Natural Language Processing

Paper • 2109.02846 • Published Sep 7, 2021 • 14

AI & ML interests

Recent Activity

Team members 3

kernels-test's activity