Error on both Exllamav2 and ExLllamav2_HF in ooba/tgw

#1
by 2themaxx - opened

I get an error loading this model in the ooba/tgw 1-click install (ace8afb825c80925ed21ab26dbf66b538ab06285 commit) previous exl2 quants load fine such as "turboderp/gemma-3-27b-it-exl2" still load fine. The turboderp/Qwen3-32b-ExL3 from this ooba commit also loads fine.

...
line 483, in check_keys

raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError: ## Could not find model.layers.0.mlp.down_proj.* in model

Got a NotImplemented when trying to load with TabbyAPI

2025-08-05 03:28:34.636 INFO:     Loading model:
/app/models/Qwen/Qwen3-30B-A3B-exl2
2025-08-05 03:28:34.637 INFO:     Loading with tensor parallel

Traceback (most recent call last):
  File "/app/main.py", line 181, in <module>
    entrypoint()
  File "/app/main.py", line 177, in entrypoint
    asyncio.run(entrypoint_async())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/app/main.py", line 61, in entrypoint_async
    await model.load_model(
  File "/app/common/model.py", line 226, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "/app/common/model.py", line 202, in load_model_gen
    async for module, modules in load_status:
  File "/app/backends/exllamav2/model.py", line 491, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "/app/common/concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/common/concurrency.py", line 20, in gen_next
    return next(generator)
           ^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/app/backends/exllamav2/model.py", line 608, in load_model_sync
    for value in self.model.load_tp_gen(
  File "/opt/venv/lib/python3.12/site-packages/exllamav2/model.py", line 424, in load_tp_gen
    ms = module.scratch_space_tp()
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/exllamav2/module.py", line 50, in scratch_space_tp
    def scratch_space_tp(self): raise(NotImplementedError())
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError

ExLlamaV2 doesn't have a tensor-parallel implementation for MoE models. V3 does, though it's still in the dev branch. It should be merged very soon.

Sign up or log in to comment