Optimise AI Models and Make Them Faster, Smaller, Cheaper, Greener

Community Article Published April 4, 2025

At our core, weโ€™re a team of ML researchers who want to simplify AI model optimization. So many people asked us how compressing AI models works, so we decided to make the pruna package open-source!

Before smashing: 4.06s inference time
After smashing: 1.44s inference time

Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead. It provides a comprehensive suite of compression algorithms including caching, quantization, pruning, distillation and compilation techniques to make your models:

  • Faster: Accelerate inference times through advanced optimization techniques
  • Smaller: Reduce model size while maintaining quality
  • Cheaper: Lower computational costs and resource requirements
  • Greener: Decrease energy consumption and environmental impact

The toolkit is designed with simplicity in mind - requiring just a few lines of code to optimize your models. It supports various model types including LLMs, Diffusion and Flow Matching Models, Vision Transformers, Speech Recognition Models and more.

This allows any ML engineer, no matter their background, to build efficient AI products and applications without wasting their development time.

Quickstart

First, things first. Pruna is available on PyPI, so you can install it using pip:

pip install pruna

Getting started with Pruna is easy-peasy pruna-squeezy!

First, load any pre-trained model. Here's an example using Stable Diffusion:

from diffusers import StableDiffusionPipeline
base_model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

Then, use Pruna's smash function to optimize your model. You can customize the optimization process using SmashConfig:

from pruna import smash, SmashConfig

# Create and smash your model
smash_config = SmashConfig()
smash_config["cacher"] = "deepcache"
smashed_model = smash(model=base_model, smash_config=smash_config)

Your model is now optimized and you can use it as you would use the original model:

smashed_model("An image of a cute prune.").images[0]

Pruna provides a variety of different compression and optimization algorithms, allowing you to combine different algorithms to get the best possible results:

from pruna import smash, SmashConfig

# Create and smash your model
smash_config = SmashConfig()
smash_config["cacher"] = "deepcache"
smash_config["compiler"] = "stable_fast"
smashed_model = smash(model=base_model, smash_config=smash_config)

You can then use our evaluation interface to measure the performance of your model:

from pruna.evaluation.task import Task
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.data.pruna_datamodule import PrunaDataModule

task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string("LAION256")) 
eval_agent = EvaluationAgent(task) 
eval_agent.evaluate(smashed_model)

This was the minimal example, but you are looking for the maximal example? You can check out our documentation for an overview of all supported algorithms as well as our tutorials for more use-cases and examples.

So, what's next?

Last December, we reached an intermediate milestone by going from a fully private to a freemium model. But open-source was always the goal. We believe the future of AI is open.

Of course, this is just the beginningโ€ฆ we hope to help shape the AI efficiency community for a more accessible & sustainable AI.

We are eager to hear from you! And we encourage you to try out the library and open issues and PRs. Don't forget, if you like the library, give us a star โญ

๐ŸŒ Join the Pruna AI community!

Twitter GitHub LinkedIn Discord Reddit

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment