Chikuma_10.7B_v2 / README.md
sethuiyer's picture
Adding Evaluation Results (#1)
ac339b9 verified
metadata
license: apache-2.0
library_name: transformers
tags:
  - dpo
datasets:
  - argilla/distilabel-intel-orca-dpo-pairs
base_model: sethuiyer/Chikuma_10.7B
pipeline_tag: text-generation
model-index:
  - name: distilabled_Chikuma_10.7B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 66.38
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 85.14
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.7
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 59.2
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 79.4
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 58.38
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
          name: Open LLM Leaderboard

Chikuma_10.7B - V2 (Enhanced with DPO) [For Experiments]

Chikuma

This model is the DPO fine tuned version of Chikuma_10.7B, which was a depth upscaled merge of:

The name "Chikuma" is inspired by the Chikuma River, the longest in Japan, known for its continuous flow and meandering path. This metaphorically represents the model's depth, fluidity, and adaptability in processing and understanding language.

Dataset used for Fine Tuning

Dataset: /argilla/distilabel-intel-orca-dpo-pairs

The dataset was roughly ~3000 samples but they were high quality (according to the chosen_score).

The following filters were applied to the original dataset:

dataset = dataset.filter(
    lambda r:
        r["status"] != "tie" and
        r["chosen_score"] >= 8 and
        not r["in_gsm8k_train"]
)

Chat Template

The chat template for Chikuma_10.7B - V2 is a modified version of ChatML, optimized for improved interaction and engagement:

<|im_start|>GPT4 Correct system:
{system} Always use <|end_of_turn|> when you want to end the answer. <|im_end|>
<|im_start|>GPT4 Correct user:
{user}<|im_end|>
<|im_start|>GPT4 Correct Assistant:
{asistant}<|im_end|>

Nous Benchmark Evaluation

Model AGIEval GPT4All TruthfulQA Bigbench Average
SynthIQ-7b 42.67 73.71 56.51 44.59 54.37
openchat/openchat-3.5-0106 44.17 73.72 52.53 44.4 53.71
Chikuma_10.7B 42.41 73.41 56.69 43.5 54.00
Chikuma_10.7B_v2 42.77 73.81 58.83 44.83 55.06

OpenLLM Leaderboard

Benchmark Name Performance
ARC 66.38
HellaSwag 85
MMLU 65.27
TruthfulQA 58.83
Winogrande 78.77
GSM8K 63.68
Average 69.65

Training Environment

  • Hardware: Single A100 80GB GPU in a runpod, utilized for approximately 1.5 hours.
  • Training Script: Accessible via Google Colab Notebook. Special thanks to mlabonne for providing the template.

Usage

# Format prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(new_model)

# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer,
    device="cuda"
)

# Generate text

message = [
    {"role": "system", "content": "You are a helpful assistant chatbot."},
    {"role": "user", "content": "Who invented LLMs?"}
]

prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

sequences = pipeline(
    prompt,
    max_new_tokens=512
)
print(sequences[0]['generated_text'])

Acknowledgements

A heartfelt appreciation goes to the vibrant open-source community, particularly:

  • The Intel team for publishing a great open dataset and show how well it worked in the first place
  • Teknium and NousResearch for their awesome work and models.
  • Maxime for sharing such great resources.
  • Argilla for publishing argilla/distilabel-intel-orca-dpo-pairs

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 68.87
AI2 Reasoning Challenge (25-Shot) 66.38
HellaSwag (10-Shot) 85.14
MMLU (5-Shot) 64.70
TruthfulQA (0-shot) 59.20
Winogrande (5-shot) 79.40
GSM8k (5-shot) 58.38