File size: 3,616 Bytes
4bfe260
 
 
 
fb357df
 
 
 
4bfe260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb357df
 
 
 
 
 
4bfe260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
license: other
license_name: hyperclovax-seed
license_link: LICENSE
base_model:
  - naver-hyperclovax/HyperCLOVAX-SEED-Text-Instruct-1.5B
pipeline_tag: text-generation
library_name: transformers
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65265ab8f8db96cffcb969dc/TSOdcOQ7qgu6ubVFMMo1R.png)

## Overview

HyperCLOVAX-SEED-Text-Instruct-1.5B is a model developed by NAVER that can understand and generate text. It demonstrates competitive performance on major benchmarks related to Korean language and culture. In addition, it supports a context length of up to 16k tokens, enabling it to handle a wide range of tasks.

## Basic Information

- Model Architecture: Transformer-based architecture (Dense Model)
- Number of Parameters: 1.5B
- Input/Output Format: Text / Text (both input and output are in text format)
- Context Length: 16k
- Knowledge Cutoff Date: The model was trained on data prior to August 2024.


## Training and Data

The training data for HyperCLOVAX-Seed-Instruct-1.5B consists of diverse sources, including high-quality datasets. The training process was carried out in four main stages: Pretraining Stage 1, where the model learns from a large volume of documents; Pretraining Stage 2, which focuses on additional training with high-quality data; Rejection sampling Fine-Tuning (RFT), aimed at enhancing the modelโ€™s knowledge across various domains and its complex reasoning abilities; and Supervised Fine-Tuning (SFT), which improves the modelโ€™s instruction-following capabilities. Furthermore, due to the characteristics of smaller models, vulnerability to long-context handling was observed. To address this, reinforcement for long-context understanding was incorporated from the pretraining stages through to the SFT stage, enabling the model to stably support context lengths of up to 16k tokens.

## Benchmark

| **Model**                              | **KMMLU (5-shot, acc)** | **HAE-RAE (5-shot, acc)** | **CLiCK (5-shot, acc)** | **KoBEST (5-shot, acc)** |
| --------------------------------------- | ----------------------- | ------------------------- | ----------------------- | ------------------------ |
| **HyperCLOVAX-SEED-Text-Base-1.5B**     | 0.4181                  | 0.6370                    | 0.5373                  | 0.6963                   |
| **HyperCLOVAX-SEED-Text-Instruct-1.5B** | 0.3933                  | 0.5674                    | 0.4947                  | 0.6490                   |
| **Qwen2.5-1.5B-instruct**               | 0.3696                  | 0.5160                    | 0.4772                  | 0.5968                   |
| **gemma-3-1b-it**                       | 0.3075                  | 0.3648                    | 0.3724                  | 0.5869                   |


## Huggingface Usage Example

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/path/to/ckpt")
tokenizer = AutoTokenizer.from_pretrained("/path/to/ckpt")

chat = [
  {"role": "tool_list", "content": ""},
  {"role": "system", "content": "- AI ์–ธ์–ด๋ชจ๋ธ์˜ ์ด๋ฆ„์€ \"CLOVA X\" ์ด๋ฉฐ ๋„ค์ด๋ฒ„์—์„œ ๋งŒ๋“ค์—ˆ๋‹ค.\n- ์˜ค๋Š˜์€ 2025๋…„ 04์›” 24์ผ(๋ชฉ)์ด๋‹ค."},
  {"role": "user", "content": "์Šˆ๋ขฐ๋”ฉ๊ฑฐ ๋ฐฉ์ •์‹๊ณผ ์–‘์ž์—ญํ•™์˜ ๊ด€๊ณ„๋ฅผ ์ตœ๋Œ€ํ•œ ์ž์„ธํžˆ ์•Œ๋ ค์ค˜."},
]

inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_dict=True, return_tensors="pt")
output_ids = model.generate(**inputs, max_length=1024, stop_strings=["<|endofturn|>", "<|stop|>"], tokenizer=tokenizer)
print(tokenizer.batch_decode(output_ids))
```