TheBloke
/

chronos-33b-superhot-8k-GPTQ

@@ -30,11 +30,17 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
 **This is an experimental new GPTQ which offers up to 8K context size**
-The increased context is currently only tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 Please read carefully below to see how to use it.
-**NOTE**: Using the full 8K context will exceed 24GB VRAM.
 GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
@@ -46,7 +52,7 @@ GGML versions are not yet provided, as there is not yet support for SuperHOT in
 GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
-## How to easily download and use this model in text-generation-webui
 Please make sure you're using the latest version of text-generation-webui
@@ -62,9 +68,75 @@ Please make sure you're using the latest version of text-generation-webui
 10. The model will automatically load, and is now ready for use!
 11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
-## How to use this GPTQ model from Python code - TBC
-Using this model with increased context from Python code is currently untested, so this section is removed for now.
 ## Provided files

 **This is an experimental new GPTQ which offers up to 8K context size**
+The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
+It has also been tested from Python code using AutoGPTQ, and `trust_remote_code=True`.
+Code credits:
+- Original concept and code for inreasing context length: [kaiokendev](https://huggingface.co/kaiokendev)
+- Updated Llama modelling code that includes this automatically via trust_remote_code: [emozilla](https://huggingface.co/emozilla).
 Please read carefully below to see how to use it.
+**NOTE**: Using the full 8K context on a 30B model will exceed 24GB VRAM.
 GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
 GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
+## How to easily download and use this model in text-generation-webui with ExLlama
 Please make sure you're using the latest version of text-generation-webui
 10. The model will automatically load, and is now ready for use!
 11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
+## How to use this GPTQ model from Python code with AutoGPTQ
+First make sure you have AutoGPTQ and Einops installed:
+```
+pip3 install einops auto-gptq
+```
+Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192.
+If you want to try 4096 instead to reduce VRAM usage, please manually edit `config.json` to set `max_position_embeddings` to the value you want.
+```python
+from transformers import AutoTokenizer, pipeline, logging
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+import argparse
+model_name_or_path = "TheBloke/chronos-33b-superhot-8k-GPTQ"
+model_basename = "chronos-33b-superhot-8k-GPTQ-4bit--1g.act.order"
+use_triton = False
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
+model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
+        model_basename=model_basename,
+        use_safetensors=True,
+        trust_remote_code=True,
+        device_map='auto',
+        use_triton=use_triton,
+        quantize_config=None)
+model.seqlen = 8192
+# Note: check the prompt template is correct for this model.
+prompt = "Tell me about AI"
+prompt_template=f'''USER: {prompt}
+ASSISTANT:'''
+print("\n\n*** Generate:")
+input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
+output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
+print(tokenizer.decode(output[0]))
+# Inference can also be done using transformers' pipeline
+# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
+logging.set_verbosity(logging.CRITICAL)
+print("*** Pipeline:")
+pipe = pipeline(
+    "text-generation",
+    model=model,
+    tokenizer=tokenizer,
+    max_new_tokens=512,
+    temperature=0.7,
+    top_p=0.95,
+    repetition_penalty=1.15
+)
+print(pipe(prompt_template)[0]['generated_text'])
+```
+## Using other UIs: monkey patch
+Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
+It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`.  I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
 ## Provided files