Loading checkpoint shards: killed at 33%
Hi,
I am running an application using this model and when loading it gets killed at the 33% mark. My setup is as follows
Ubuntu Server 22.04 w/ drivers
1x RTX A6000 (48GB) [Premium]
6 vCPU, 96 GB RAM, 300 GB Storage
htop app shows that i have almost 90 GB of RAM free
further more there is 48 GB of GPU Memory
Here is the sample of my code that i use to access the model
import gradio as gr
import torch
import transformers
import librosa
import numpy as np
import tempfile
import os
from kokoro import KPipeline
import soundfile as sf
from typing import Dict, Optional, Tuple
import huggingface_hub
from huggingface_hub import login
class VoiceAssistant:
# Available voices with their configurations
VOICES = {
"Bella (US Female)": {"code": "af_bella", "lang_code": "a"},
"Nicole (US Female)": {"code": "af_nicole", "lang_code": "a"},
"Michael (US Male)": {"code": "am_michael", "lang_code": "a"},
"Emma (UK Female)": {"code": "bf_emma", "lang_code": "b"},
"George (UK Male)": {"code": "bm_george", "lang_code": "b"}
}
def __init__(self):
"""Initialize both Ultravox and Kokoro TTS models"""
access_token_read = "token i got from Huggingface for gated repo"
login(token=access_token_read)
print("Loading Ultravox model... This may take a few minutes...")
self.pipe = transformers.pipeline(
model='fixie-ai/ultravox-v0_5-llama-3_3-70b', # Updated to v0_5
# model='fixie-ai/ultravox-v0_4', # Original 04
trust_remote_code=True
)
print("Model loaded successfully!")
Could you please let me know what it is that i am missing?
Thanks,
Sincerely,
Arshad.
The 70B model has higher memory requirements. Quantization could be a potential solution, but it is not currently supported.
What is the minimum requirement for that model. Please let me know so we can configure appropriately.
Thanks
For the 70B model, we use vLLM and multiple H100s for serving. We haven't attempted loading the model on a single GPU yet.
Thanks. Will keep that in mind in the configuration.