--- tags: - merge - mergekit - lazymergekit - flemmingmiguel/NeuDist-Ro-7B - johannhartmann/Brezn3 - ResplendentAI/Flora_DPO_7B base_model: - flemmingmiguel/NeuDist-Ro-7B - johannhartmann/Brezn3 - ResplendentAI/Flora_DPO_7B language: - de - en --- # Spaetzle-v8-7b This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc. It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable. It is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing): * [flemmingmiguel/NeuDist-Ro-7B](https://huggingface.co/flemmingmiguel/NeuDist-Ro-7B) * [johannhartmann/Brezn3](https://huggingface.co/johannhartmann/Brezn3) * [ResplendentAI/Flora_DPO_7B](https://huggingface.co/ResplendentAI/Flora_DPO_7B) * on the basis of [mayflowergmbh/Wiedervereinigung-7b-dpo-laser](https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo-laser) All credits are due to the creators of those original models and the training datasets involved. For a suitable quantized version, try [cstr/Spaetzle-v8-7b-GGUF](https://huggingface.co/cstr/Spaetzle-v8-7b-GGUF) ## Evaluation [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__Spaetzle-v8-7b) | Metric |Value| |---------------------------------|----:| |Avg. |72.27| |AI2 Reasoning Challenge (25-Shot)|68.69| |HellaSwag (10-Shot) |86.68| |MMLU (5-Shot) |64.60| |TruthfulQA (0-shot) |64.05| |Winogrande (5-shot) |81.45| |GSM8k (5-shot) |68.16| EQ-Bench (v2_de): 61.04 / english (v2): 78.3 [ScandEval](https://scandeval.com/german-nlg/) 12.5.2 scores | Benchmark | Spaetzle-v8-7b Value | |-----------------------|----------------------------------------------------| | Model ID | cstr/Spaetzle-v8-7b (few-shot, val) | | Parameters | 7242 | | Vocabulary Size | 32 | | Context | 32768 | | Commercial | False | | Speed | 5,980 ± 1,031 / 1,714 ± 552 | | Rank | 1.85 | | GermEval | 58.90 ± 2.30 / 45.55 ± 3.30 | | SB10k | 61.34 ± 1.90 / 72.98 ± 1.30 | | ScaLA-De | 31.58 ± 4.39 / 65.51 ± 2.23 | | GermanQuAD | 24.91 ± 3.98 / 60.88 ± 3.31 | | MLSum | 67.25 ± 1.06 / 22.95 ± 2.64 | | MMLU-De | 34.62 ± 2.20 / 50.43 ± 1.52 | | HellaSwag-De | 48.70 ± 2.47 / 61.05 ± 1.79 | | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[Spaetzle-v8-7b](https://huggingface.co/cstr/Spaetzle-v8-7b)| 45.31| 75.69| 63.94| 45.57| 57.63| ### AGIEval | Task |Version| Metric |Value| |Stderr| |------------------------------|------:|--------|----:|---|-----:| |agieval_aqua_rat | 0|acc |25.59|± | 2.74| | | |acc_norm|24.80|± | 2.72| |agieval_logiqa_en | 0|acc |39.63|± | 1.92| | | |acc_norm|39.78|± | 1.92| |agieval_lsat_ar | 0|acc |23.48|± | 2.80| | | |acc_norm|24.35|± | 2.84| |agieval_lsat_lr | 0|acc |50.98|± | 2.22| | | |acc_norm|51.96|± | 2.21| |agieval_lsat_rc | 0|acc |62.08|± | 2.96| | | |acc_norm|62.83|± | 2.95| |agieval_sat_en | 0|acc |78.64|± | 2.86| | | |acc_norm|79.13|± | 2.84| |agieval_sat_en_without_passage| 0|acc |44.66|± | 3.47| | | |acc_norm|44.66|± | 3.47| |agieval_sat_math | 0|acc |37.27|± | 3.27| | | |acc_norm|35.00|± | 3.22| Average: 45.31% ### GPT4All | Task |Version| Metric |Value| |Stderr| |-------------|------:|--------|----:|---|-----:| |arc_challenge| 0|acc |63.14|± | 1.41| | | |acc_norm|64.51|± | 1.40| |arc_easy | 0|acc |85.98|± | 0.71| | | |acc_norm|82.49|± | 0.78| |boolq | 1|acc |88.10|± | 0.57| |hellaswag | 0|acc |66.31|± | 0.47| | | |acc_norm|85.17|± | 0.35| |openbookqa | 0|acc |38.00|± | 2.17| | | |acc_norm|47.20|± | 2.23| |piqa | 0|acc |83.35|± | 0.87| | | |acc_norm|84.17|± | 0.85| |winogrande | 0|acc |78.22|± | 1.16| Average: 75.69% ### TruthfulQA | Task |Version|Metric|Value| |Stderr| |-------------|------:|------|----:|---|-----:| |truthfulqa_mc| 1|mc1 |47.74|± | 1.75| | | |mc2 |63.94|± | 1.53| Average: 63.94% ### Bigbench | Task |Version| Metric |Value| |Stderr| |------------------------------------------------|------:|---------------------|----:|---|-----:| |bigbench_causal_judgement | 0|multiple_choice_grade|56.84|± | 3.60| |bigbench_date_understanding | 0|multiple_choice_grade|66.12|± | 2.47| |bigbench_disambiguation_qa | 0|multiple_choice_grade|41.47|± | 3.07| |bigbench_geometric_shapes | 0|multiple_choice_grade|22.01|± | 2.19| | | |exact_str_match | 0.00|± | 0.00| |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|31.40|± | 2.08| |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.14|± | 1.60| |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87| |bigbench_movie_recommendation | 0|multiple_choice_grade|45.00|± | 2.23| |bigbench_navigate | 0|multiple_choice_grade|50.70|± | 1.58| |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|70.05|± | 1.02| |bigbench_ruin_names | 0|multiple_choice_grade|45.54|± | 2.36| |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|26.05|± | 1.39| |bigbench_snarks | 0|multiple_choice_grade|71.82|± | 3.35| |bigbench_sports_understanding | 0|multiple_choice_grade|72.92|± | 1.42| |bigbench_temporal_sequences | 0|multiple_choice_grade|44.20|± | 1.57| |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.80|± | 1.19| |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.23|± | 0.92| |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87| Average: 45.57% Average score: 57.63% ## 💻 Usage ```python !pip install -qU transformers accelerate from transformers import AutoTokenizer import transformers import torch model = "cstr/Spaetzle-v8-7b" messages = [{"role": "user", "content": "What is a large language model?"}] tokenizer = AutoTokenizer.from_pretrained(model) prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.float16, device_map="auto", ) outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95) print(outputs[0]["generated_text"]) ``` ## 🧩 Configuration The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training). ```yaml models: - model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser # no parameters necessary for base model - model: flemmingmiguel/NeuDist-Ro-7B parameters: density: 0.60 weight: 0.30 - model: johannhartmann/Brezn3 parameters: density: 0.65 weight: 0.40 - model: ResplendentAI/Flora_DPO_7B parameters: density: 0.6 weight: 0.3 merge_method: dare_ties base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser parameters: int8_mask: true dtype: bfloat16 random_seed: 0 tokenizer_source: base ```