llama-3.1-Korean-Bllossom-405B / README.md

Update README.md

98616ea verified 9 months ago

7.89 kB

	---
	language:
	- en
	- ko
	license: llama3.1
	library_name: transformers
	base_model:
	- meta-llama/Meta-Llama-3.1-405B
	---

	<a href="https://github.com/MLP-Lab/Bllossom">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/64a90711c05da19ca834f690/a0VE5UCY1HCEhaHtp3mGa.png" alt="image" width="30%" height="30%">
	</a>

	# Update!
	* [2024.08.08] preview 모델이 최초 업데이트 되었습니다. A100 120대 규모의 컴퓨팅 파워로 학습 진행중으로 모델은 계속 업데이트될 예정입니다.


	# Bllossom \| [Demo]() \| [Homepage](https://www.bllossom.ai/) \| [Github](https://github.com/MLP-Lab/Bllossom) \|

	<!-- [GPU용 Colab 코드예제](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing) \| -->
	<!-- [CPU용 Colab 양자화모델 코드예제](https://colab.research.google.com/drive/129ZNVg5R2NPghUEFHKF0BRdxsZxinQcJ?usp=drive_link) -->

	```bash
	저희 Bllossom 팀에서 llama3.1 기반의 한국어-영어 이중 언어모델 Bllossom-405B를 공개합니다.
	이번 Bllossom3.1-405B는 preview 버전으로 다음과 같은 특징을 보입니다.
	- Llama3.1-405B-Inst 대비 5~10% 한국어 성능이 향상 되었습니다 (single turn 기준).
	- Llama3.1의 영어 성능을 전혀 손상시키지 않은 완전한 Bilingual 모델입니다.
	- 기존 모델 대비 자연스럽고 친절한 한국어 문장을 생성합니다.
	- 인간평가, GPT평가(MT-Bench, LogicKor 9점 등) 결과 GPT4와 유사하거나 약간 낮은 성능을 보여줍니다.

	해당 모델은 다음과 같은 협업을 토대로 구축 되었습니다!
	- 서울과기대 MLP연구실의 경량화 사전 학습기법이 적용되었습니다.
	- 테디썸의 정교한 Instruction Tuning과 RAG 기술이 적용되었습니다.
	- HP의 computing 지원이 있었습니다.
	- Common Crawl 재단의 Oscar팀에서 적극적인 데이터 지원이 있었습니다

	언제나 그랬듯 해당 모델은 상업적 이용이 가능합니다. A100 6대만 준비되면 Bllossom을 이용해 여러분만의 모델을 만들어보세요 GPT4가 더이상 필요 없습니다.
	GPU자원이 부족하면 A100 3개 혹은 A6000 4개로 양자화 모델을 이용해 보세요. [양자화모델](https://huggingface.co/MLP-KTLim/llama-3.1-Korean-Bllossom-405B-gguf-Q4_K_M)

	1. Bllossom-8B는 서울과기대, 테디썸, 연세대 언어자원 연구실의 언어학자와 협업해 만든 실용주의기반 무료 언어모델로 2023년부터 지속적인 업데이트를 통해 관리해 오고있습니다. 많이 활용해주세요 🙂
	2. 초 강력한 Advanced-Bllossom 모델, 시각-언어 모델을 보유하고 있습니다! (궁금하신분은 개별 연락주세요!!)
	3. Bllossom은 NAACL2024, LREC-COLING2024 (구두) 발표되었습니다.
	4. 좋은 언어모델 계속 업데이트 하겠습니다!! 한국어 강화를위해 공동 연구하실분(특히논문) 언제든 환영합니다!!
	그리고 소량의 GPU라도 대여 가능한팀은 언제든 연락주세요! 만들고 싶은거 도와드려요.
	```

	```bash
	The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3.1. It enhances the connection of knowledge between Korean and English. It has the following features:
	- Korean performance improved by 5-10% compared to Llama 3.1-405B-Inst (on Single Turn Eval).
	- A complete bilingual model that does not compromise the English performance of Llama 3.1.
	- Generates more natural and friendly Korean sentences compared to existing models.
	- Human evaluations and GPT evaluations (MT-Bench, LogicKor scoring 9, etc.) show performance similar to or slightly lower than GPT-4.
	```

	This model developed by [MLPLab at Seoultech](http://mlp.seoultech.ac.kr), [Teddysum](http://teddysum.ai/) and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem-kim)

	## Example code

	### Colab Tutorial
	- [Inference-Code-Link](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing)

	### Install Dependencies
	```bash
	pip install torch transformers==4.40.0 accelerate
	```

	### Python code with Pipeline
	```python
	import transformers
	import torch

	model_id = "Bllossom/llama-3.1-Korean-Bllossom-405B"

	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	pipeline.model.eval()

	PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
	instruction = "서울의 유명한 관광 코스를 만들어줄래?"

	messages = [
	{"role": "system", "content": f"{PROMPT}"},
	{"role": "user", "content": f"{instruction}"}
	]

	prompt = pipeline.tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	terminators = [
	pipeline.tokenizer.eos_token_id,
	pipeline.tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	outputs = pipeline(
	prompt,
	max_new_tokens=2048,
	eos_token_id=terminators,
	do_sample=True,
	temperature=0.6,
	top_p=0.9
	)

	print(outputs[0]["generated_text"][len(prompt):])
	```
	```
	# 물론이죠! 서울은 다양한 문화와 역사, 자연을 겸비한 도시로, 많은 관광 명소를 자랑합니다. 여기 서울의 유명한 관광 코스를 소개해 드릴게요.

	### 코스 1: 역사와 문화 탐방

	1. 경복궁
	- 서울의 대표적인 궁궐로, 조선 왕조의 역사와 문화를 체험할 수 있는 곳입니다.

	2. 북촌 한옥마을
	- 전통 한옥이 잘 보존된 마을로, 조선시대의 생활상을 느낄 수 있습니다.

	...
	```

	## Supported by

	- Hewlett Packard (HP) Enterprise <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Hewlett_Packard_Enterprise_logo.svg/2880px-Hewlett_Packard_Enterprise_logo.svg.png" width="20%" height="20%">
	- Common Crawl <img src="https://cdn.prod.website-files.com/6479b8d98bf5dcb4a69c4f31/649b5869af56f6df617cfb1f_CC_Logo_Blue_Auto.svg" width="20%" height="20%">
	- AICA <img src="https://aica-gj.kr/images/logo.png" width="20%" height="20%">

	## Citation
	Language Model
	```text
	@misc{bllossom,
	author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
	title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
	year = {2024},
	journal = {LREC-COLING 2024},
	paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
	},
	}
	```

	Vision-Language Model
	```text
	@misc{bllossom-V,
	author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
	title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
	year = {2024},
	publisher = {GitHub},
	journal = {NAACL 2024 findings},
	paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
	},
	}
	```

	## Contact
	- 임경태(KyungTae Lim), Professor at Seoultech. `[email protected]`
	- 함영균(Younggyun Hahm), CEO of Teddysum. `[email protected]`
	- 김한샘(Hansaem Kim), Professor at Yonsei. `[email protected]`

	## Contributor
	- 최창수(Chansu Choi), [email protected]
	- 김상민(Sangmin Kim), [email protected]
	- 원인호(Inho Won), [email protected]
	- 김민준(Minjun Kim), [email protected]
	- 송승우(Seungwoo Song), [email protected]
	- 신동재(Dongjae Shin), [email protected]
	- 임현석(Hyeonseok Lim), [email protected]
	- 육정훈(Jeonghun Yuk), [email protected]
	- 유한결(Hangyeol Yoo), [email protected]
	- 송서현(Seohyun Song), [email protected]