Text Generation
Transformers
Safetensors
English
Japanese
llama
conversational
text-generation-inference

Loading and inference time

#1
by NEWWWWWbie - opened

I've been testing a model deployment and noticed that it takes approximately 30 minutes to load and around 40 minutes to get responses. This seems unusually slow, especially when compared to LLaMA 3.1 8B Instruct, which typically loads and responds much faster in similar environments.

Trend Micro (AI Lab) org

Hi, @NEWWWWWbie
Could you provide the reproducible code and environment?
Because our model is identical to Llama 3.1 8B Instruct except for the parameter values, and we haven’t observed this issue when using it ourselves.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment