DeepNeo: A hybrid model with precision and power

Overview

DeepNeo is a hybrid model that can be used like any other LLM, but DeepNeo has a mode that is inspired by NousResearch/DeepHermes-3-Llama-3-8B-Preview, which allows the model to activate a CoT-like response. This is done by toggling the system prompt. Unlike NousResearch/DeepHermes-3-Llama-3-8B-Preview, DeepNeo is slightly more flexible in its sizes. We have introduced an 8B and 12B model; both of them are based on Mistral AI's models

Model Details

DeepNeo 12B Key features

Developed by: Spestly (Open-Neo) & Kazex (Open-Neo)
Released under the Mistral Research License, reach out to Mistral AI for a commercial license
Trained with a 128k context window
Trained on a large proportion of multilingual and synthetic reasoning data
Supports function calling

Feature	Value
Architecture	Dense Transformer
Parameters	~12B
Layers	40
Heads	32
KV Heads (GQA)	8
Hidden Dim	14336
Head Dim	128
Vocab Size	131,072
Context Length	128k
Attention Pattern	Ragged (128k,32k,32k,32k)

Usage

Intuitive mode

By default, this mode is activated, and you do not need to change anything. This means you are allowed to use any system prompt! We have given an example below.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("")
model = AutoModelForCausalLM.from_pretrained(
    "open-neo/DeepNeo-1-12B-Preview",
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What are the most interesting things to do in Paris?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=2500, temperature=0.8, do_sample=True)

print(f"Generated Tokens: {generated_ids.shape[-1]}")
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Response: {response}")

Reasoning mode

To activate this mode, we need to do some extra steps. Almost all system instructions should work as long as they mention <Thought></Thought> and <Output></Output>. An example of this system prompt is given below. Please note that it may require tweaking for your specific use case.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("")
model = AutoModelForCausalLM.from_pretrained(
    "open-neo/DeepNeo-1-12B-Preview",
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a deep-thinking AI model. You must put your thoughts in the <Thought> tags and your output in the <Output> tags."},
    {"role": "user", "content": "What are the most interesting things to do in Paris?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=2500, temperature=0.8, do_sample=True)

print(f"Generated Tokens: {generated_ids.shape[-1]}")
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Response: {response}")

Citations

@misc{deepneo-1,
      title={DeepNeo: A hybrid model with precision and power}, 
      author={Aayan Mishra and Krish Thumar},
      howpublished={https://huggingface.co/collections/open-neo/deepneo-1-67aea4c0f086ab0f70ed5720},
      year={2025}
}

open-neo
/

DeepNeo-1-12B-Preview