Optimizing Mixtral-8x7B-Instruct-v0.1 for Hugging Face Chat
#54
by
Husain
- opened
What kind of optimizations are used to run MistralAI/Mixtral-8x7B-Instruct-v0.1 in Hugging Face Chat https://huggingface.co/chat ? Is this the default model in full precision?
Or are there optimizations to reduce memory requirements for running the model? like using float16 or (8-bit & 4-bit) using bitsandbytes
Is Flash Attention 2 is used too ?
Hi
@Husain
I think HuggingChat uses TGI under the hood: https://github.com/huggingface/text-generation-inference
Specifically here: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py