TheBloke commited on
Commit
4b9ae88
·
1 Parent(s): 6f59f78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -5
README.md CHANGED
@@ -30,11 +30,17 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
30
 
31
  **This is an experimental new GPTQ which offers up to 8K context size**
32
 
33
- The increased context is currently only tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 
 
 
 
 
 
34
 
35
  Please read carefully below to see how to use it.
36
 
37
- **NOTE**: Using the full 8K context will exceed 24GB VRAM.
38
 
39
  GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
40
 
@@ -46,7 +52,7 @@ GGML versions are not yet provided, as there is not yet support for SuperHOT in
46
 
47
  GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
48
 
49
- ## How to easily download and use this model in text-generation-webui
50
 
51
  Please make sure you're using the latest version of text-generation-webui
52
 
@@ -62,9 +68,75 @@ Please make sure you're using the latest version of text-generation-webui
62
  10. The model will automatically load, and is now ready for use!
63
  11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
64
 
65
- ## How to use this GPTQ model from Python code - TBC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
- Using this model with increased context from Python code is currently untested, so this section is removed for now.
68
 
69
  ## Provided files
70
 
 
30
 
31
  **This is an experimental new GPTQ which offers up to 8K context size**
32
 
33
+ The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
34
+
35
+ It has also been tested from Python code using AutoGPTQ, and `trust_remote_code=True`.
36
+
37
+ Code credits:
38
+ - Original concept and code for inreasing context length: [kaiokendev](https://huggingface.co/kaiokendev)
39
+ - Updated Llama modelling code that includes this automatically via trust_remote_code: [emozilla](https://huggingface.co/emozilla).
40
 
41
  Please read carefully below to see how to use it.
42
 
43
+ **NOTE**: Using the full 8K context on a 30B model will exceed 24GB VRAM.
44
 
45
  GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
46
 
 
52
 
53
  GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
54
 
55
+ ## How to easily download and use this model in text-generation-webui with ExLlama
56
 
57
  Please make sure you're using the latest version of text-generation-webui
58
 
 
68
  10. The model will automatically load, and is now ready for use!
69
  11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
70
 
71
+ ## How to use this GPTQ model from Python code with AutoGPTQ
72
+
73
+ First make sure you have AutoGPTQ and Einops installed:
74
+
75
+ ```
76
+ pip3 install einops auto-gptq
77
+ ```
78
+
79
+ Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192.
80
+
81
+ If you want to try 4096 instead to reduce VRAM usage, please manually edit `config.json` to set `max_position_embeddings` to the value you want.
82
+
83
+ ```python
84
+ from transformers import AutoTokenizer, pipeline, logging
85
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
86
+ import argparse
87
+
88
+ model_name_or_path = "TheBloke/chronos-33b-superhot-8k-GPTQ"
89
+ model_basename = "chronos-33b-superhot-8k-GPTQ-4bit--1g.act.order"
90
+
91
+ use_triton = False
92
+
93
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
94
+
95
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
96
+ model_basename=model_basename,
97
+ use_safetensors=True,
98
+ trust_remote_code=True,
99
+ device_map='auto',
100
+ use_triton=use_triton,
101
+ quantize_config=None)
102
+
103
+ model.seqlen = 8192
104
+
105
+ # Note: check the prompt template is correct for this model.
106
+ prompt = "Tell me about AI"
107
+ prompt_template=f'''USER: {prompt}
108
+ ASSISTANT:'''
109
+
110
+ print("\n\n*** Generate:")
111
+
112
+ input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
113
+ output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
114
+ print(tokenizer.decode(output[0]))
115
+
116
+ # Inference can also be done using transformers' pipeline
117
+
118
+ # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
119
+ logging.set_verbosity(logging.CRITICAL)
120
+
121
+ print("*** Pipeline:")
122
+ pipe = pipeline(
123
+ "text-generation",
124
+ model=model,
125
+ tokenizer=tokenizer,
126
+ max_new_tokens=512,
127
+ temperature=0.7,
128
+ top_p=0.95,
129
+ repetition_penalty=1.15
130
+ )
131
+
132
+ print(pipe(prompt_template)[0]['generated_text'])
133
+ ```
134
+
135
+ ## Using other UIs: monkey patch
136
+
137
+ Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
138
 
139
+ It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`. I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
140
 
141
  ## Provided files
142