Original model is https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT
ema-FP8.safetensors is float8_e4m3fn.
float8_e4m3fn weight of: https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT
Benchmark Spec: 24GB 4090 + 60GB RAM
Default setting, Timesteps 25 steps
Features | Speed (seconds) | GPU VRAM Usage | CPU RAM Usage |
---|---|---|---|
📝 Text to Image | 128.90 s | 16.18 GB | 14.22 GB |
🖌️ Image Edit | 138.67 s | 15.08 GB | 14.21 GB |
🖼️ Image Understanding | 102.68 s | 15.08 GB | 13.66 GB |
Support
Runs with less than 12GB of GPU memory.
ram + vram = about 31GB
* 12GB is much slower than 24GB due to CPU offload. It will be 1.5x much slower than 24GB
How to Install:
new venv
- git clone https://github.com/bytedance-seed/BAGEL.git
- cd BAGEL
- conda create -n bagel python=3.10 -y
- conda activate bagel
install
install pytorch 2.5.1
CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu124pip install flash_attn-2.7.0.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
more whl: https://github.com/Dao-AILab/flash-attention/releases
It needs to be the same as the Python version, PyTorch version, CUDA version, and flash_attn WHL.pip install -r requirements.txt
(edit requirements.txt, without flash_attn==2.5.8, make it #flash_attn==2.5.8)pip install gradio pynvml (#pynvml for check vram stats.)
Models & Settings:
- Download huggingface.co/ByteDance-Seed/BAGEL-7B-MoT(without ema.safetensors) & ema-FP8.safetensors and make it like this.
folders
├── BAGEL
│ └── app-fp8.py
└── BAGEL-7B-MoT
└── ema-FP8.safetensors
Open app-fp8.py via Notepad or VScode etc.
Replace model_path to yours.
parser.add_argument("--model_path", type=str, default="/root/your_path/BAGEL-7B-MoT")
- Edit your spec:
cpu_mem_for_offload = "16GiB"
gpu_mem_per_device = "24GiB" #default:24GiB you can set 16GB within 24GB with 4090,more slower.
- Be more efficient
NUM_ADDITIONAL_LLM_LAYERS_TO_GPU = 5
# (5 for 24gb VRAM, >5 for 32gb VRAM, have a try)
# The default is 10 layers in GPU, use it can be 15 layers in GPU with 4090.
How to Use:
- CD BAGEL
- conda activate bagel
- python app-fp8.py
- Open 127.0.0.1:7860