Stokmark-2
Collection
3 items
•
Updated
•
1
Stockmark-2-100B-Instruct-beta is a 100-billion-parameter large language model built from scratch, with a particular focus on Japanese. It was pre-trained on approximately 1.5 trillion tokens of data, consisting of 60% English, 30% Japanese, and 10% code. Following pretraining, the model underwent post-training with synthetic data in Japanese to enhance its ability to follow instructions. This synthetic data was generated using Qwen2.5-32B-Instruct.
As a beta release, Stockmark-2-100b-Instruct-beta is still undergoing improvements and evaluations. Feedback and insights from users will help refine future versions.
See our blog for the detail.
This project is supported by GENIAC.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stockmark/Stockmark-2-100B-Instruct-beta")
model = AutoModelForCausalLM.from_pretrained(
"stockmark/Stockmark-2-100B-Instruct-beta", device_map="auto", torch_dtype=torch.bfloat16
)
instruction = "自然言語処理とは?"
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": instruction}], add_generation_prompt=True, return_tensors="pt"
).to(model.device)
with torch.inference_mode():
tokens = model.generate(
input_ids,
max_new_tokens = 512,
do_sample = True,
temperature = 0.7,
top_p = 0.95,
repetition_penalty = 1.05
)
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(output)
Takahiro Omi