GRAG-Mistral-Nemo-Base-2407-CPT-HESSIAN-AI

GRAG (German Retrieval Augmented Generation) models are designed for the German-speaking market, enabling innovation and AI solutions to drive German research collaboration in business-focused Generative AI by 2025

Our GRAG-MISTRAL-NEMO-CPT model are trained on this GRAG-CPT dataset.

Model Details

The core models released in this batch are the following:

Size	Training Tokens
GRAG-MISTRAL-NEMO-CPT	507.47 million
GRAG-MISTRAL-NEMO-SFT	2.03 billion
GRAG-MISTRAL-NEMO-ORPO	2.0577 billion

Model Description

Developed by: Avemio AI Team
Supported by: Hessian AI
Model type: a Transformer style autoregressive language model.
Language(s) (NLP): German, English
License: The code and model are released under Apache 2.0.
Contact: [email protected]

Model Sources

Training Study: Training Study
Repositories:
- Training: Colab-Notebook
- Evaluation code:
  - GRAG-LLM-HARD-BENCHMARK
  - GRAG-LLM-EASY-BENCHMARK
Technical blog post:

Uses

Inference

Quickly get inference running with the following required installation: Now, proceed as usual with HuggingFace:

from transformers import AutoModelForCausalLM, AutoTokenizer
 
model_name = "avemio/GRAG-NEMO-12B-CPT-HESSIAN-AI"
 
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer("Hello mein Name ist", return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fine-tuning

We are providing a comprehensive Google Colab notebook to guide users through the process of fine-tuning our model, complete with detailed instructions, essential dependencies, and configurable settings. Colab-Notebook.

Model Details

Data

For training data details, please see the GRAG-CPT-Dataset documentation.

Description

CPT – Continued Pre-Training Our CPT (Continued Pre-Training) approach is designed to enhance language models' ability to perform specific tasks through structured instruction-based learning. Drawing inspiration from "Instruction Pre-Training: Language Models are Supervised Multitask Learners," our methodology focuses on priming base models with semi-structured examples to improve their performance across three key tasks. Our training dataset comprises approximately 420,000 German language samples and 200,000 English examples, with the deliberate emphasis on German content aimed at expanding the model's German language vocabulary and capabilities. Context-Based Question Answering This task trains models to generate accurate responses by considering both the question and its accompanying context. For example, when analyzing cancer counseling center benefits, the model learns to extract and synthesize relevant information from provided context to formulate comprehensive answers. The training examples follow a clear structure: Question > Context > Context-based Answer. Structured Reasoning The reasoning task develops the model's ability to break down complex problems and arrive at solutions through systematic thinking. Training examples present problems with clear subheadings (Task, Approach, Solution) to encourage structured analysis. As shown in the music festival scheduling example, this format helps the model learn to consider multiple constraints and develop logical solutions step by step. Intelligent Summarization The summarization task teaches models to distill complex information into clear, organized summaries while preserving key details. Training examples demonstrate how to transform detailed explanations into well-structured bullet points or concise summaries.

Architecture

Parameter	GRAG-MISTRA-NEMO-CPT
d_model	5120
num heads	32
num layers	32
MLP ratio	3.5
LayerNorm type	RMSNorm
pos embeddings	RoPE
attention variant	Multi-head attention with 8 key-value heads
biases	none
block type	Sequential
activation	SiLU
sequence length	1024000
weight typing	bfloat16

Hyperparameters

Parameter	GRAG-MISTRAL-NEMO-CPT
warmup steps	50
peak LR	5.0E-07
weight decay	0.1
LR schedule	linear
gradient reduce dtype	FP32
optimizer state dtype	FP32

Environmental Impact

GRAG-MISTRAL-NEMO-CPT, running on NVIDIA A100 with 40 GPUs for 5 days, has an approximate power consumption as follows:

It's important to note that the actual power consumption may vary depending on the specific workload and operational conditions. For accurate power consumption measurements, using dedicated power monitoring tools is recommended.

Model	GPU Type	Power Consumption From GPUs
GRAG-MISTRAL-NEMO-CPT	A100 (Hessian AI supercomputer)	0.0144 MWh

Bias, Risks, and Limitations

Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content. Such content can also be produced unintentionally, especially in the case of bias, so we recommend users consider the risks of applications of this technology.

Otherwise, many facts from GRAG-MISTRAL-NEMO-CPT or any LLM will often not be true, so they should be checked.

Model Card Contact

For errors in this model card, please contact ([email protected]).

The GRAG AI Team

Marcel Rosiak Soumya Paul Siavash Mollaebrahim Zain ul Haq

avemio
/

GRAG-NEMO-12B-CPT-HESSIAN-AI

GRAG-Mistral-Nemo-Base-2407-CPT-HESSIAN-AI

Model Details

Model Description

Model Sources

Uses

Inference

Fine-tuning

Model Details

Data

Description

Architecture

Hyperparameters

Environmental Impact

Bias, Risks, and Limitations

Model Card Contact

The GRAG AI Team

Model tree for avemio/GRAG-NEMO-12B-CPT-HESSIAN-AI

Dataset used to train avemio/GRAG-NEMO-12B-CPT-HESSIAN-AI

Collection including avemio/GRAG-NEMO-12B-CPT-HESSIAN-AI

GRAG-NEMO-12B (German Retrieval Augmented Generation)