Original model is https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT

ema-FP8.safetensors is float8_e4m3fn.

float8_e4m3fn weight of: https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT

Benchmark Spec: 24GB 4090 + 60GB RAM

Default setting, Timesteps 25 steps

Features Speed (seconds) GPU VRAM Usage CPU RAM Usage
📝 Text to Image 128.90 s 16.18 GB 14.22 GB
🖌️ Image Edit 138.67 s 15.08 GB 14.21 GB
🖼️ Image Understanding 102.68 s 15.08 GB 13.66 GB

Benchmark Images

Support

Runs with less than 12GB of GPU memory.

ram + vram = about 31GB

* 12GB is much slower than 24GB due to CPU offload. It will be 1.5x much slower than 24GB

4070-12GB.jpg

How to Install:

new venv

  1. git clone https://github.com/bytedance-seed/BAGEL.git
  2. cd BAGEL
  3. conda create -n bagel python=3.10 -y
  4. conda activate bagel

install

  1. install pytorch 2.5.1
    CUDA 12.4
    pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu124

  2. pip install flash_attn-2.7.0.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
    more whl: https://github.com/Dao-AILab/flash-attention/releases
    It needs to be the same as the Python version, PyTorch version, CUDA version, and flash_attn WHL.

  3. pip install -r requirements.txt
    (edit requirements.txt, without flash_attn==2.5.8, make it #flash_attn==2.5.8)

  4. pip install gradio pynvml (#pynvml for check vram stats.)

Models & Settings:

  1. Download huggingface.co/ByteDance-Seed/BAGEL-7B-MoT(without ema.safetensors) & ema-FP8.safetensors and make it like this.
folders
├── BAGEL
│   └── app-fp8.py
└── BAGEL-7B-MoT
    └── ema-FP8.safetensors
  1. Open app-fp8.py via Notepad or VScode etc.

  2. Replace model_path to yours.

parser.add_argument("--model_path", type=str, default="/root/your_path/BAGEL-7B-MoT")
  1. Edit your spec:
cpu_mem_for_offload = "16GiB"
gpu_mem_per_device = "24GiB"  #default:24GiB  you can set 16GB within 24GB with 4090,more slower.
  1. Be more efficient
NUM_ADDITIONAL_LLM_LAYERS_TO_GPU = 5  
# (5 for 24gb VRAM, >5 for 32gb VRAM, have a try)
# The default is 10 layers in GPU, use it can be 15 layers in GPU with 4090.

How to Use:

  1. CD BAGEL
  2. conda activate bagel
  3. python app-fp8.py
  4. Open 127.0.0.1:7860

demo.jpg

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for meimeilook/BAGEL-7B-MoT-FP8

Base model

Qwen/Qwen2.5-7B
Quantized
(4)
this model