Error on both Exllamav2 and ExLllamav2_HF in ooba/tgw
#1
by
2themaxx
- opened
I get an error loading this model in the ooba/tgw 1-click install (ace8afb825c80925ed21ab26dbf66b538ab06285 commit) previous exl2 quants load fine such as "turboderp/gemma-3-27b-it-exl2" still load fine. The turboderp/Qwen3-32b-ExL3 from this ooba commit also loads fine.
...
line 483, in check_keys
raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError: ## Could not find model.layers.0.mlp.down_proj.* in model
Got a NotImplemented when trying to load with TabbyAPI
2025-08-05 03:28:34.636 INFO: Loading model:
/app/models/Qwen/Qwen3-30B-A3B-exl2
2025-08-05 03:28:34.637 INFO: Loading with tensor parallel
Traceback (most recent call last):
File "/app/main.py", line 181, in <module>
entrypoint()
File "/app/main.py", line 177, in entrypoint
asyncio.run(entrypoint_async())
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/app/main.py", line 61, in entrypoint_async
await model.load_model(
File "/app/common/model.py", line 226, in load_model
async for _ in load_model_gen(model_path, **kwargs):
File "/app/common/model.py", line 202, in load_model_gen
async for module, modules in load_status:
File "/app/backends/exllamav2/model.py", line 491, in load_gen
async for value in iterate_in_threadpool(model_load_generator):
File "/app/common/concurrency.py", line 30, in iterate_in_threadpool
yield await asyncio.to_thread(gen_next, generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/common/concurrency.py", line 20, in gen_next
return next(generator)
^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/app/backends/exllamav2/model.py", line 608, in load_model_sync
for value in self.model.load_tp_gen(
File "/opt/venv/lib/python3.12/site-packages/exllamav2/model.py", line 424, in load_tp_gen
ms = module.scratch_space_tp()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/exllamav2/module.py", line 50, in scratch_space_tp
def scratch_space_tp(self): raise(NotImplementedError())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError
ExLlamaV2 doesn't have a tensor-parallel implementation for MoE models. V3 does, though it's still in the dev branch. It should be merged very soon.