Omni 0 Preview Models

Omni 0 preview models are a series of 1.7B parameter LLMs optimized for STEM knowledge consisting of 4 expert models and one final merged output between all experts. Experts include: Science Expert, Technology Expert, Engineering Expert, Math Expert. Through DARE-TIES merging and expert finetuning, Omni is able to achieve state-of-the-art domain benchmark results among alternative optimization techniques, as well as optimal compute-knowledge efficiency across similar models, while also performing comparably with them in tested benchmarks.

Omni compute-knowledge efficiency plot — Omni achieves optimal compute-knowledge efficiency compared to alternative models

This model card is for omni-0-engineering-preview. All other models can be found on Omni's HuggingFace page

Benchmarks

Expert Model Benchmarks

Benchmark	Science	Technology	Engineering	Math
MMLU Science (4-shot CoT)	26.69	--	--	--
SciQ (0-shot)	85.80	--	--	--
ARC-Challenge (0-shot)	42.41	--	--	--
ARC-Easy (0-shot)	66.96	--	--	--
MMLU Technology (4-shot CoT)	--	35.30	--
Humaneval (pass@1)	--	32.93	--	--
MMLU Engineering (4-shot CoT)	--	--	32.07	--
MMLU Math (4-shot CoT)	--	--	--	30.83
MATH (4-shot)	--	--	--	18.76
Expert Average	36.28	34.83	32.07	28.82
Base Average	35.59	29.79	30.86	22.57
Improvement	1.94%	16.92%	3.92%	27.69%

Expert average refers to the average of the expert model for the STEM domain in focus while benchmarking

Omni-0-mini-preview Benchmarks

Benchmark	Omni	Base	Llama 3.2 3B	Gemma 3 4B	Llama 3.1 8B
MMLU STEM (4-shot CoT)	35.02	26.59	33.28	40.82	52.22
MMLU Science (4-shot CoT)	34.44	28.03	33.47	42.93	52.54
MMLU Technology (4-shot CoT)	41.07	30.86	45.28	46.74	63.72
MMLU Engineering (4-shot CoT)	37.50	25.93	34.65	43.66	55.58
MMLU Math (4-shot CoT)	35.54	23.86	39.51	35.31	45.84
HumanEval (pass@1)	31.71	29.88	51.83	57.32	57.93
SciQ (0-shot)	87.30	76.10	93.30	87.50	91.80
MATH (4-shot)	15.66	16.12	28.44	26.38	29.56
ARC-Challenge (0-shot)	43.00	40.10	46.16	44.11	54.18
ARC-Easy (0-shot)	66.67	58.54	67.93	63.01	75.80
Average	37.91	30.25	38.33	43.91	54.22
Improvement	25.32%	Base

Models

Omni comes in a total of 5 models:

Merged Model

omni-0-mini-preview - Merged output of all four experts through DARE-TIES, delivering large improvements in performance in STEM domains compared to its base.

Experts

omni-0-science-preview - Science expert finetuned on corpora of scientific wikipedia texts and academic papers, as well as a chat-templated scientific Q&A dataset
omni-0-technology-preview - Technology expert finetuned on chat-templated code generation data and stack exchange questions and top-voted answers
omni-0-engineering-preview - Engineering expert finetuned on corpora of engineering-related wikipedia texts and academic papers
omni-0-math-preview - Math expert finetuned on chat-templated math Q&A data

All Omni experts are finetuned from their base model: SmolLM2 1.7B Instruct on H100/Ada 6000/A100 GPUs, improving by 25.32% on average over all tested STEM benchmarks.

Features

Made for all Omni is a series of highly efficient Large Language Models aimed at expanding the accessibility of AI, filling in its gaps among underserved populations.

Efficient Omni operates at optimal compute-knowledge efficiency compared to similar models.

Merged architecture Omni uses merging to provide the collective accuracy of specialized models, leveraging their capabilities to enhance the final merged model.

Multi-disciplinary Omni's first variant achieves state-of-the-art performance across STEM compared to alternative optimization techniques.

Inference

Transformers

Transformers is a framework by HuggingFace unifying model development and inference, allowing for simple and seamless interactions with models found on HuggingFace.

To get started with running inference using Omni, install transformers

pip install transformers

After transformers has been installed, run the following code to generate outputs from any Omni model

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "omniomni/omni-0-engineering-preview" # Can be any Omni model

device = "cuda" # For GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# For multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
# However, Omni is small enough to run on individual commodity GPUs and low-resource devices
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "What is STEM?"}]

input_text=tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)

outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, do_sample=True)

print(tokenizer.decode(outputs[0]))

vLLM

vLLM provides fast LLM inference to models spanning vast amounts of architectures through both server and in-file implementations.

First, install vLLM's package

uv pip install vllm --torch-backend=auto

After that, run this command to start a server with Omni via vLLM

vllm serve omniomni/omni-0-engineering-preview

To use Omni with vLLM without creating a server, run this code to generate outputs within a Python file

# vLLM automatically uses a GPU unless built with CPU wheels, so no need to specify a device

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The answer to x^2 + 2x + 1 is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0.7)

llm = LLM(model="omniomni/omni-0-engineering-preview")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

omniomni
/

omni-0-engineering-preview