GitHub Website Paper (Coming Soon)
Omni 0 Preview Models
Omni 0 preview models are a series of 1.7B parameter LLMs optimized for STEM knowledge consisting of 4 expert models and one final merged output between all experts. Experts include: Science Expert, Technology Expert, Engineering Expert, Math Expert. Through DARE-TIES merging and expert finetuning, Omni is able to achieve state-of-the-art domain benchmark results among alternative optimization techniques, as well as optimal compute-knowledge efficiency across similar models, while also performing comparably with them in tested benchmarks.

This model card is for omni-0-engineering-preview. All other models can be found on Omni's HuggingFace page
Benchmarks
Expert Model Benchmarks
Benchmark | Science | Technology | Engineering | Math |
---|---|---|---|---|
MMLU Science (4-shot CoT) | 26.69 | -- | -- | -- |
SciQ (0-shot) | 85.80 | -- | -- | -- |
ARC-Challenge (0-shot) | 42.41 | -- | -- | -- |
ARC-Easy (0-shot) | 66.96 | -- | -- | -- |
MMLU Technology (4-shot CoT) | -- | 35.30 | -- | |
Humaneval (pass@1) | -- | 32.93 | -- | -- |
MMLU Engineering (4-shot CoT) | -- | -- | 32.07 | -- |
MMLU Math (4-shot CoT) | -- | -- | -- | 30.83 |
MATH (4-shot) | -- | -- | -- | 18.76 |
Expert Average | 36.28 | 34.83 | 32.07 | 28.82 |
Base Average | 35.59 | 29.79 | 30.86 | 22.57 |
Improvement | 1.94% | 16.92% | 3.92% | 27.69% |
Expert average refers to the average of the expert model for the STEM domain in focus while benchmarking
Omni-0-mini-preview Benchmarks
Benchmark | Omni | Base | Alternative Models | Llama 3.2 3B | Gemma 3 4B | Llama 3.1 8B |
---|---|---|---|---|---|---|
MMLU STEM (4-shot CoT) | 35.02 | 26.59 | 33.28 | 40.82 | 52.22 | |
MMLU Science (4-shot CoT) | 34.44 | 28.03 | 33.47 | 42.93 | 52.54 | |
MMLU Technology (4-shot CoT) | 41.07 | 30.86 | 45.28 | 46.74 | 63.72 | |
MMLU Engineering (4-shot CoT) | 37.50 | 25.93 | 34.65 | 43.66 | 55.58 | |
MMLU Math (4-shot CoT) | 35.54 | 23.86 | 39.51 | 35.31 | 45.84 | |
HumanEval (pass@1) | 31.71 | 29.88 | 51.83 | 57.32 | 57.93 | |
SciQ (0-shot) | 87.30 | 76.10 | 93.30 | 87.50 | 91.80 | |
MATH (4-shot) | 15.66 | 16.12 | 28.44 | 26.38 | 29.56 | |
ARC-Challenge (0-shot) | 43.00 | 40.10 | 46.16 | 44.11 | 54.18 | |
ARC-Easy (0-shot) | 66.67 | 58.54 | 67.93 | 63.01 | 75.80 | |
Average | 37.91 | 30.25 | 38.33 | 43.91 | 54.22 | |
Improvement | 25.32% | Base |
Models
Omni comes in a total of 5 models:
Merged Model
omni-0-mini-preview
- Merged output of all four experts through DARE-TIES, delivering large improvements in performance in STEM domains compared to its base.
Experts
omni-0-science-preview
- Science expert finetuned on corpora of scientific wikipedia texts and academic papers, as well as a chat-templated scientific Q&A datasetomni-0-technology-preview
- Technology expert finetuned on chat-templated code generation data and stack exchange questions and top-voted answersomni-0-engineering-preview
- Engineering expert finetuned on corpora of engineering-related wikipedia texts and academic papersomni-0-math-preview
- Math expert finetuned on chat-templated math Q&A data
All Omni experts are finetuned from their base model: SmolLM2 1.7B Instruct on H100/Ada 6000/A100 GPUs, improving by 25.32% on average over all tested STEM benchmarks.
Features
Made for all Omni is a series of highly efficient Large Language Models aimed at expanding the accessibility of AI, filling in its gaps among underserved populations.
Efficient Omni operates at optimal compute-knowledge efficiency compared to similar models.
Merged architecture Omni uses merging to provide the collective accuracy of specialized models, leveraging their capabilities to enhance the final merged model.
Multi-disciplinary Omni's first variant achieves state-of-the-art performance across STEM compared to alternative optimization techniques.
Inference
Transformers
Transformers is a framework by HuggingFace unifying model development and inference, allowing for simple and seamless interactions with models found on HuggingFace.
To get started with running inference using Omni, install transformers
pip install transformers
After transformers has been installed, run the following code to generate outputs from any Omni model
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "omniomni/omni-0-engineering-preview" # Can be any Omni model
device = "cuda" # For GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# For multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
# However, Omni is small enough to run on individual commodity GPUs and low-resource devices
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
messages = [{"role": "user", "content": "What is STEM?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0]))
vLLM
vLLM provides fast LLM inference to models spanning vast amounts of architectures through both server and in-file implementations.
First, install vLLM's package
uv pip install vllm --torch-backend=auto
After that, run this command to start a server with Omni via vLLM
vllm serve omniomni/omni-0-engineering-preview
To use Omni with vLLM without creating a server, run this code to generate outputs within a Python file
# vLLM automatically uses a GPU unless built with CPU wheels, so no need to specify a device
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The answer to x^2 + 2x + 1 is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.7)
llm = LLM(model="omniomni/omni-0-engineering-preview")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
- Downloads last month
- 8