view article Article Accelerating LLM Inference with TGI on Intel Gaudi By baptistecolle and 4 others • Mar 28 • 13
view article Article Universal Assisted Generation: Faster Decoding with Any Assistant Model By danielkorat and 7 others • Oct 29, 2024 • 55
view article Article Speeding Up LLM Decoding with Advanced Universal Assisted Generation Techniques By jmamou and 8 others • Mar 24 • 18
view article Article Welcome to Inference Providers on the Hub 🔥 By julien-c and 6 others • Jan 28 • 482
view article Article Timm ❤️ Transformers: Use any timm model with transformers By ariG23498 and 4 others • Jan 16 • 50
view article Article Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference By mfuntowicz and 1 other • Jan 16 • 74
view article Article The 5 Most Under-Rated Tools on Hugging Face By derek-thomas • Aug 22, 2024 • 89
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Paper • 2406.06525 • Published Jun 10, 2024 • 71
view article Article CPU Optimized Embeddings with 🤗 Optimum Intel and fastRAG By peterizsak and 5 others • Mar 15, 2024 • 10
Fast-RAG Inference Endpoints Collection An extremely easy to deploy RAG Pipeline using Inference Endpoints • 3 items • Updated Jun 3, 2024 • 1
view article Article Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval By aamirshakir and 2 others • Mar 22, 2024 • 92
Neural Network Compression & Quantization Collection Tracks papers and links about neural network compression and quantization technics • 4 items • Updated Sep 22, 2023 • 1
BitNet: Scaling 1-bit Transformers for Large Language Models Paper • 2310.11453 • Published Oct 17, 2023 • 103
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models Paper • 2306.15626 • Published Jun 27, 2023 • 17