Error on both Exllamav2 and ExLllamav2_HF in ooba/tgw

by 2themaxx - opened May 10

May 10

•

I get an error loading this model in the ooba/tgw 1-click install (ace8afb825c80925ed21ab26dbf66b538ab06285 commit) previous exl2 quants load fine such as "turboderp/gemma-3-27b-it-exl2" still load fine. The turboderp/Qwen3-32b-ExL3 from this ooba commit also loads fine.

...
line 483, in check_keys

raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError: ## Could not find model.layers.0.mlp.down_proj.* in model

lowercase00

Aug 5

Got a NotImplemented when trying to load with TabbyAPI

2025-08-05 03:28:34.636 INFO:     Loading model:
/app/models/Qwen/Qwen3-30B-A3B-exl2
2025-08-05 03:28:34.637 INFO:     Loading with tensor parallel

Traceback (most recent call last):
  File "/app/main.py", line 181, in <module>
    entrypoint()
  File "/app/main.py", line 177, in entrypoint
    asyncio.run(entrypoint_async())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/app/main.py", line 61, in entrypoint_async
    await model.load_model(
  File "/app/common/model.py", line 226, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "/app/common/model.py", line 202, in load_model_gen
    async for module, modules in load_status:
  File "/app/backends/exllamav2/model.py", line 491, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "/app/common/concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/common/concurrency.py", line 20, in gen_next
    return next(generator)
           ^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/app/backends/exllamav2/model.py", line 608, in load_model_sync
    for value in self.model.load_tp_gen(
  File "/opt/venv/lib/python3.12/site-packages/exllamav2/model.py", line 424, in load_tp_gen
    ms = module.scratch_space_tp()
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/exllamav2/module.py", line 50, in scratch_space_tp
    def scratch_space_tp(self): raise(NotImplementedError())
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError

turboderp

Owner Aug 6

ExLlamaV2 doesn't have a tensor-parallel implementation for MoE models. V3 does, though it's still in the dev branch. It should be merged very soon.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment