google/umt5-base · The best way to deploy UMT5 variants into production with low-latency inference?

This is such a neat model, but I don't see it being supported by most frameworks since it uses a different sampling method.

Can you recommend anyway to deploy this model (by this, I mean the model we finetune on the downstream task) into production? and possibly a trivial way to convert it to ONNX. optimum doesn't support it just yet. preferably something that relies on GPUs.

man, I really wish there was a vllm for such seq2seq models. their potential is so underrated. if this tiny voice of mine can be heard by the big guys at google, please create a framework that makes it easier to use seq2seq model!