license: apache-2.0
library_name: transformers
tags:
- dpo
datasets:
- argilla/distilabel-intel-orca-dpo-pairs
base_model: sethuiyer/Chikuma_10.7B
pipeline_tag: text-generation
model-index:
- name: distilabled_Chikuma_10.7B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 66.38
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 85.14
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 64.7
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 59.2
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 79.4
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 58.38
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
Chikuma_10.7B - V2 (Enhanced with DPO) [For Experiments]
This model is the DPO fine tuned version of Chikuma_10.7B, which was a depth upscaled merge of:
The name "Chikuma" is inspired by the Chikuma River, the longest in Japan, known for its continuous flow and meandering path. This metaphorically represents the model's depth, fluidity, and adaptability in processing and understanding language.
Dataset used for Fine Tuning
Dataset: /argilla/distilabel-intel-orca-dpo-pairs
The dataset was roughly ~3000 samples but they were high quality (according to the chosen_score).
The following filters were applied to the original dataset:
dataset = dataset.filter(
lambda r:
r["status"] != "tie" and
r["chosen_score"] >= 8 and
not r["in_gsm8k_train"]
)
Chat Template
The chat template for Chikuma_10.7B - V2 is a modified version of ChatML, optimized for improved interaction and engagement:
<|im_start|>GPT4 Correct system:
{system} Always use <|end_of_turn|> when you want to end the answer. <|im_end|>
<|im_start|>GPT4 Correct user:
{user}<|im_end|>
<|im_start|>GPT4 Correct Assistant:
{asistant}<|im_end|>
Nous Benchmark Evaluation
Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
SynthIQ-7b | 42.67 | 73.71 | 56.51 | 44.59 | 54.37 |
openchat/openchat-3.5-0106 | 44.17 | 73.72 | 52.53 | 44.4 | 53.71 |
Chikuma_10.7B | 42.41 | 73.41 | 56.69 | 43.5 | 54.00 |
Chikuma_10.7B_v2 | 42.77 | 73.81 | 58.83 | 44.83 | 55.06 |
OpenLLM Leaderboard
Benchmark Name | Performance |
---|---|
ARC | 66.38 |
HellaSwag | 85 |
MMLU | 65.27 |
TruthfulQA | 58.83 |
Winogrande | 78.77 |
GSM8K | 63.68 |
Average | 69.65 |
Training Environment
- Hardware: Single A100 80GB GPU in a runpod, utilized for approximately 1.5 hours.
- Training Script: Accessible via Google Colab Notebook. Special thanks to mlabonne for providing the template.
Usage
# Format prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(new_model)
# Create pipeline
pipeline = transformers.pipeline(
"text-generation",
model=new_model,
tokenizer=tokenizer,
device="cuda"
)
# Generate text
message = [
{"role": "system", "content": "You are a helpful assistant chatbot."},
{"role": "user", "content": "Who invented LLMs?"}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)
sequences = pipeline(
prompt,
max_new_tokens=512
)
print(sequences[0]['generated_text'])
Acknowledgements
A heartfelt appreciation goes to the vibrant open-source community, particularly:
- The Intel team for publishing a great open dataset and show how well it worked in the first place
- Teknium and NousResearch for their awesome work and models.
- Maxime for sharing such great resources.
- Argilla for publishing argilla/distilabel-intel-orca-dpo-pairs
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 68.87 |
AI2 Reasoning Challenge (25-Shot) | 66.38 |
HellaSwag (10-Shot) | 85.14 |
MMLU (5-Shot) | 64.70 |
TruthfulQA (0-shot) | 59.20 |
Winogrande (5-shot) | 79.40 |
GSM8k (5-shot) | 58.38 |