dahara1
/

llama-translate-amd-npu

+---
+tags:
+- npu
+- amd
+- llama3.1
+- RyzenAI
+---
+This model is finetuned [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) and AWQ quantized and converted version to run on the [NPU installed Ryzen AI PC](https://github.com/amd/RyzenAI-SW/issues/18), for example, Ryzen 9 7940HS Processor.
+For set up Ryzen AI for LLMs in window 11, see [Running LLM on AMD NPU Hardware](https://www.hackster.io/gharada2013/running-llm-on-amd-npu-hardware-19322f).
+The following sample assumes that the setup on the above page has been completed.
+This model has only been tested on RyzenAI for Windows 11. It does not work in Linux environments such as WSL.
+RoPE support is not yet complete, but it has been confirmed that the perplexity is lower than Llama 3.
+2024/07/30
+- [Ryzen AI Software 1.2](https://ryzenai.docs.amd.com/en/latest/) has been released. Please note that this model is based on [Ryzen AI Software 1.1](https://ryzenai.docs.amd.com/en/1.1/index.html) and operation with 1.2 has not been confirmed.
+### setup
+In cmd windows.
+```
+conda activate ryzenai-transformers
+<your_install_path>\RyzenAI-SW\example\transformers\setup.bat
+pip install transformers==4.43.3
+# Updating the Transformers library will cause the LLama 2 sample to stop working.
+# If you want to run LLama 2, revert to pip install transformers==4.34.0.
+pip install tokenizers==0.19.1
+pip install -U "huggingface_hub[cli]"
+huggingface-cli download dahara1/llama3.1-8b-Instruct-amd-npu --revision main --local-dir llama3.1-8b-Instruct-amd-npu
+copy <your_ryzen_ai-sw_install_path>\RyzenAI-SW\example\transformers\models\llama2\modeling_llama_amd.py .
+# set up Runtime. see https://ryzenai.docs.amd.com/en/latest/runtime_setup.html
+set XLNX_VART_FIRMWARE=<your_firmware_install_path>\voe-4.0-win_amd64\1x4.xclbin
+set NUM_OF_DPU_RUNNERS=1
+# save below sample script as utf8 and llama-3.1-test.py
+python llama3.1-test.py
+```
+### Sample Script
+```
+import torch
+import psutil
+import transformers
+from transformers import AutoTokenizer, set_seed
+import qlinear
+import logging
+set_seed(123)
+transformers.logging.set_verbosity_error()
+logging.disable(logging.CRITICAL)
+messages = [
+    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
+]
+message_list = [
+	"Who are you? ",
+	# Japanese
+	"あなたの乗っている船の名前は何ですか？英語ではなく全て日本語だけを使って返事をしてください",
+	# Chainese
+	"你经历过的最危险的冒险是什么？请用中文回答所有问题，不要用英文。",
+	# French
+	"À quelle vitesse va votre bateau ? Veuillez répondre uniquement en français et non en anglais.",
+	# Korean
+	"당신은 그 배의 어디를 좋아합니까? 영어를 사용하지 않고 모두 한국어로 대답하십시오.",
+	# German
+	"Wie würde Ihr Schiffsname auf Deutsch lauten? Bitte antwortet alle auf Deutsch statt auf Englisch.",
+	# Taiwanese
+	"您發現過的最令人驚奇的寶藏是什麼？請僅使用台語和繁體中文回答，不要使用英文。",
+]
+if __name__ == "__main__":
+    p = psutil.Process()
+    p.cpu_affinity([0, 1, 2, 3])
+    torch.set_num_threads(4)
+    tokenizer = AutoTokenizer.from_pretrained("llama3.1-8b-Instruct-amd-npu")
+    ckpt = r"llama3.1-8b-Instruct-amd-npu\llama3.1_8b_w_bit_4_awq_amd.pt"
+    terminators = [
+        tokenizer.eos_token_id,
+        tokenizer.convert_tokens_to_ids("<|eot_id|>")
+    ]
+    model = torch.load(ckpt)
+    model.eval()
+    model = model.to(torch.bfloat16)
+    for n, m in model.named_modules():
+        if isinstance(m, qlinear.QLinearPerGrp):
+            print(f"Preparing weights of layer : {n}")
+            m.device = "aie"
+            m.quantize_weights()
+    print("system: " + messages[0]['content'])
+    for i in range(len(message_list)):
+        messages.append({"role": "user",  "content": message_list[i]})
+        print("user: " + message_list[i])
+        input = tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            return_dict=True
+        )
+        outputs = model.generate(input['input_ids'],
+		max_new_tokens=600,
+    		eos_token_id=terminators,
+		attention_mask=input['attention_mask'],
+    		do_sample=True,
+    		temperature=0.6,
+    		top_p=0.9)
+        response = outputs[0][input['input_ids'].shape[-1]:]
+        response_message = tokenizer.decode(response, skip_special_tokens=True)
+        print("assistant: " + response_message)
+        messages.append({"role": "system", "content": response_message})
+```
+![chat_image](llama-3.1.png)
+## Acknowledgements
+- [amd/RyzenAI-SW](https://github.com/amd/RyzenAI-SW)
+Sample Code and Drivers.
+- [mit-han-lab/llm-awq](https://github.com/mit-han-lab/llm-awq)
+Thanks for AWQ quantization Method.
+- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
+[Built with Meta Llama 3](https://llama.meta.com/llama3/license/)