|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- ystemsrx/Bad_Data_Alpaca |
|
language: |
|
- zh |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- Qwen |
|
- causal-lm |
|
- fine-tuned |
|
- ethics |
|
- Chinese |
|
--- |
|
|
|
[中文](README.zh.md) |
|
|
|
# Qwen2-Boundless |
|
|
|
## Overview |
|
|
|
Qwen2-Boundless is a fine-tuned model based on Qwen2-1.5B-Instruct, designed to answer various types of questions, including those involving ethical, illegal, pornographic, and violent content. This model has been specifically trained on a dataset that allows it to handle complex and diverse scenarios. It is important to note that the fine-tuning dataset is entirely in Chinese, so the model performs better in Chinese. |
|
|
|
> **Warning**: This model is intended for research and testing purposes only. Users should comply with local laws and regulations and are responsible for their actions. |
|
|
|
## How to Use |
|
|
|
You can load and use the model with the following code: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import os |
|
|
|
device = "cuda" # the device to load the model onto |
|
current_directory = os.path.dirname(os.path.abspath(__file__)) |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
current_directory, |
|
torch_dtype="auto", |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(current_directory) |
|
|
|
prompt = "Hello?" |
|
messages = [ |
|
{"role": "system", "content": ""}, |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
|
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=512 |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(response) |
|
``` |
|
|
|
### Continuous Conversation |
|
|
|
To enable continuous conversation, use the following code: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
import os |
|
|
|
device = "cuda" # the device to load the model onto |
|
|
|
# Get the current script's directory |
|
current_directory = os.path.dirname(os.path.abspath(__file__)) |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
current_directory, |
|
torch_dtype="auto", |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(current_directory) |
|
|
|
messages = [ |
|
{"role": "system", "content": ""} |
|
] |
|
|
|
while True: |
|
# Get user input |
|
user_input = input("User: ") |
|
|
|
# Add user input to the conversation |
|
messages.append({"role": "user", "content": user_input}) |
|
|
|
# Prepare the input text |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
|
|
# Generate a response |
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=512 |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
# Decode and print the response |
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(f"Assistant: {response}") |
|
|
|
# Add the generated response to the conversation |
|
messages.append({"role": "assistant", "content": response}) |
|
``` |
|
|
|
### Streaming Response |
|
|
|
For applications requiring streaming responses, use the following code: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer |
|
from transformers.trainer_utils import set_seed |
|
from threading import Thread |
|
import random |
|
import os |
|
|
|
DEFAULT_CKPT_PATH = os.path.dirname(os.path.abspath(__file__)) |
|
|
|
def _load_model_tokenizer(checkpoint_path, cpu_only): |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, resume_download=True) |
|
|
|
device_map = "cpu" if cpu_only else "auto" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
checkpoint_path, |
|
torch_dtype="auto", |
|
device_map=device_map, |
|
resume_download=True, |
|
).eval() |
|
model.generation_config.max_new_tokens = 512 # For chat. |
|
|
|
return model, tokenizer |
|
|
|
Def_get_input()->str: |
|
当为True时: |
|
尝试: |
|
消息=输入('用户:').strip() |
|
UnicodeDecodeError除外: |
|
打印('[ERROR]输入中的编码错误') |
|
继续 |
|
键盘中断除外: |
|
出口(1) |
|
如果消息: |
|
返回消息 |
|
打印('[ERROR]查询为空') |
|
|
|
Def_chat_stream(模型、标记器、查询、历史记录): |
|
对话=[ |
|
{'角色':'系统','内容':"}, |
|
] |
|
对于历史中的query_h、response_h: |
|
conversation.append({'role':'user','content':query_h}) |
|
conversation.append({'role':'assistant','content':response_h}) |
|
conversation.append({'role':'user','content':query}) |
|
inputs=tokenizer.apply_chat_template( |
|
对话, |
|
add_generation_prompt=True, |
|
return_tensors='pt', |
|
) |
|
inputs=inputs.to(model.device) |
|
streamer=TextIteratorStreamer(tokenizer=tokenizer,skip_prompt=True,timeout=60.0,skip_special_token=True) |
|
generation_kwargs=dict( |
|
input_ids=输入, |
|
拖缆=拖缆, |
|
) |
|
thread=Thread(target=model.generate,kwargs=generation_kwargs) |
|
Thread.start() |
|
|
|
对于拖缆中的新文本(_T): |
|
产生新文本(_T) |
|
|
|
Def main(): |
|
checkpoint_path=DEFAULT_ckpt_PATH |
|
seed=random.randint(0,2**32-1)#生成随机种子 |
|
set_seed(种子)#设置随机种子 |
|
CPU_only=False |
|
|
|
历史记录=[] |
|
|
|
model,tokenizer=_load_model_tokenizer(检查点路径,仅cpu) |
|
|
|
当为True时: |
|
query=_get_input() |
|
|
|
打印(f“\n用户:{query}”) |
|
打印(f"\n助手:",end="") |
|
尝试: |
|
partial_text=" |
|
对于聊天流中的新文本(模型、标记器、查询、历史记录): |
|
打印(new_text,end=",flush=True) |
|
partial_text+=new_text |
|
打印() |
|
history.append((查询,部分文本)) |
|
|
|
键盘中断除外: |
|
打印(“生成中断”) |
|
继续 |
|
|
|
如果__name__=="__main__": |
|
主要的() |
|
``` |
|
|
|
##数据集 |
|
|
|
Qwen2-Boundless模型使用名为`bad_data.json`,其中包括广泛的文本内容,涉及伦理、法律、色情和暴力等主题。微调数据集完全是中文的,因此模型的中文性能更好。如果您有兴趣浏览或使用此数据集,可以通过以下链接找到它: |
|
|
|
- [bad_data.json数据集](https://huggingface.co/datasets/ystemsrx/Bad_Data_Alpaca) |
|
|
|
我们还使用了一些与网络安全相关的数据,这些数据是从[此文件](https://github.com/Clouditera/SecGPT/blob/main/secgpt-mini/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%9B%9E%E7%AD%94%E9%9D%A2%E9%97%AE%E9%A2%98-cot.txt). |
|
|
|
##GitHub存储库 |
|
|
|
有关模型和正在进行的更新的更多详细信息,请访问我们的GitHub存储库: |
|
|
|
- [GitHub:ystemsrx/Qwen2-无界](https://github.com/ystemsrx/Qwen2-Boundless) |
|
|
|
##许可证 |
|
|
|
此模型和数据集在Apache2.0License下是开源的。 |
|
|
|
##免责声明 |
|
|
|
本模型提供的所有内容仅供研究和测试之用。此模型的开发人员不对任何潜在的误用负责。用户应遵守相关法律法规,并对其行为负全部责任。 |