|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Allanatrix/Scientific_Research_Tokenized |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Allanatrix/NexaSci |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- Science |
|
|
- Hypothesis |
|
|
- Methodology |
|
|
--- |
|
|
|
|
|
# NexaSci Family of Models |
|
|
|
|
|
## Welcome to the NexaSci Repository! |
|
|
|
|
|
Get ready to supercharge your scientific research with the **Nexasci family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaSci family includes the baseline **NexaSci**, the reasoning-enhanced **NEXASci-1-CoT**, and the long-context powerhouse **NEXA-1-Max**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models: |
|
|
|
|
|
### 1. NexaSci-1-Mini (Still working on this Indefinite timeline) |
|
|
- **Parameters**: ~110 million |
|
|
- **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science. |
|
|
- **Architecture**: |
|
|
- **Semantic Router**: BERT-based classifier routes queries to domain-specific experts. |
|
|
- **Expert Modules**: T5-based submodules for Physics, Biology, and Materials Science. |
|
|
- **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency. |
|
|
- **Knowledge Feedback Loop**: Refines routing using reinforcement learning. |
|
|
- **Training**: |
|
|
- Pretrained on ~2B tokens from arXiv, PubMed, and other scientific corpora. |
|
|
- Fine-tuned with QLoRA on 500k instruction-style samples. |
|
|
- Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid). |
|
|
- **Use Cases**: |
|
|
- Generate plausible hypotheses (e.g., new material properties). |
|
|
- Suggest experimental methods (e.g., protein folding protocols). |
|
|
- Summarize scientific texts with domain-specific insights. |
|
|
|
|
|
### 2. NEXASci-1-COT (Coming Soon) |
|
|
- **Parameters**: 756 million to 1.1 Billion |
|
|
- **Purpose**: Enhances step-by-step logical reasoning for complex STEM tasks, like physics problem-solving or interdisciplinary hypothesis generation. |
|
|
- **Architecture**: |
|
|
- Adds a **Chain of Thought (CoT) Processor** with sparse attention (Longformer-style) for multi-step reasoning. |
|
|
- Includes **Conditional Routing** to engage the CoT Processor based on a “reasoning_required” flag. |
|
|
- Integrates with expert modules for structured, logical outputs. |
|
|
- **Training**: |
|
|
- Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning). |
|
|
- Uses ~2B tokens |
|
|
- Employs AzureSky Optimizer with reinforcement learning fine-tuning. |
|
|
- **Use Cases**: |
|
|
- Solve multi-step physics problems (e.g., astrophysics simulations). |
|
|
- Generate detailed, logical methodologies (e.g., combining CFD and alloy modeling). |
|
|
- Teach scientific reasoning in educational settings. |
|
|
|
|
|
### 3. NEXASci-1-Max (Coming soon) |
|
|
- **Parameters**: ~2.2 billion |
|
|
- **Purpose**: Processes large scientific documents (up to 20,000 tokens) with deep contextual understanding. |
|
|
- **Architecture**: |
|
|
- Features a **Long Context Attention Layer** with two Flash Attention v2 layers for efficient long-sequence processing. |
|
|
- Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence. |
|
|
- Scales parameters using mixed precision training and gradient checkpointing. |
|
|
- **Training**: |
|
|
- Trained on ~2B tokens, including a Long-Context Corpus of full arXiv papers and NIH grants. |
|
|
- Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing. |
|
|
- **Use Cases**: |
|
|
- Summarize or analyze long scientific papers (e.g., 120K-token preprints). |
|
|
- Generate hypotheses from extended contexts (e.g., patent methods). |
|
|
- Support multi-query tasks requiring deep document understanding. |
|
|
|
|
|
### Future Models (Planned) |
|
|
- **NEXASci-1-Scout**: A lightweight version (~50M parameters) optimized for distilling and curating datasets and maaking the corpa for the model family |
|
|
- **NEXASci-1-Super**: A larger-scale model (~10B parameters) for advanced scientific tasks, using ~1B tokens. Planned for high-performance computing clusters. |
|
|
- **NEXASci-1-MultiModal**: Integrates text, images, and graphs for scientific data analysis (e.g., protein structures, simulation plots). Planned for future research. |
|
|
|
|
|
## Dataset and Training Details |
|
|
|
|
|
The NexaSci family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document: |
|
|
|
|
|
- **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions. |
|
|
- **Scientific Pretraining Corpus** (1-2B tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science). |
|
|
- **Instruction Fine-Tune Dataset** (500K tokens): 5k high-quality instruction-style samples for hypothesis and method generation. |
|
|
|
|
|
**Token Efficiency Strategies**: |
|
|
- Entropy scoring to remove low-information samples. |
|
|
- Semantic tagging (e.g., [PHYS], [BIO], [MTH]) for domain routing. |
|
|
- Distillation using larger models (e.g., GPT-4) to summarize and structure data. |
|
|
- Routing and filtering to activate only relevant expert paths. |
|
|
|
|
|
**Total Token Budget**: |
|
|
For all models ~2B tokens |
|
|
|
|
|
**Hardware**: |
|
|
Currently limited here still looking and hunting |
|
|
|
|
|
**Optimization Techniques**: |
|
|
- Sparse attention, mixed precision training, gradient checkpointing. |
|
|
- Hyperparameter tuning with Optuna, Just-in-Time (JIT) compilation, multi-threading. |
|
|
- AzureSky Optimizer for efficient convergence. |
|
|
|
|
|
|
|
|
# Download Models: |
|
|
|
|
|
Model weights are hosted on Hugging Face. Download them using the transformers library or directly from the repository’s model card. |
|
|
Example:huggingface-cli download your-username/nexamoe-base |
|
|
|
|
|
|
|
|
# Usage |
|
|
|
|
|
Load a Model: Use the transformers library to load NexaMOE models: |
|
|
``` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "your-username/nexasci-base" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") |
|
|
|
|
|
|
|
|
Generate Hypotheses or Methods:Provide a prompt with optional domain tags: |
|
|
prompt = "[PHYS] Suggest a hypothesis for dark matter detection." |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate(**inputs, max_length=200) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
|
|
|
Use NEXA-CoT for Reasoning:Enable the CoT Processor for step-by-step logic: |
|
|
prompt = "[BIO] [reasoning_required] Propose a method to predict protein folding." |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate(**inputs, max_length=500) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
|
|
|
Process Long Documents with NEXA-Ultramax:Handle large inputs (up to 20,000 tokens): |
|
|
with open("arxiv_paper.txt", "r") as f: |
|
|
document = f.read() |
|
|
prompt = f"[MAT] Summarize this document: {document}" |
|
|
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=20000).to("cuda") |
|
|
outputs = model.generate(**inputs, max_length=1000) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
|
|
|
Fine-Tune with QLoRA:Use the provided instruction dataset for fine-tuning: |
|
|
from peft import LoraConfig, get_peft_model |
|
|
from datasets import load_dataset |
|
|
|
|
|
dataset = load_dataset("your-username/nexamoe-instruction-data") |
|
|
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q", "v"]) |
|
|
model = get_peft_model(model, lora_config) |
|
|
``` |
|
|
# Train with your preferred trainer (e.g., Hugging Face Trainer) |
|
|
|
|
|
Run Inference via CLI or GUI: |
|
|
|
|
|
"Command-Line: python inference.py --model your-username/nexamoe-base --prompt "[PHYS] Hypothesise a new superconductor." |
|
|
|
|
|
Opens a web interface to interact with the model. |
|
|
|
|
|
# Performance Metrics |
|
|
|
|
|
Extreme Specialisation: Modular experts improve response fidelity and interpretability. |
|
|
Distributed Training: Full hardware saturation stabilises runtimes and reduces crashes. |
|
|
Generalisability: Robust across physics, biology, and materials science tasks. |
|
|
Optimiser Efficiency: AzureSky Optimiser enhances convergence speed and precision. |
|
|
|
|
|
See the architecture document for detailed loss curves and metrics. |
|
|
Similar Models |
|
|
Explore related models for inspiration: |
|
|
|
|
|
Grok (xAI): General-purpose conversational AI with scientific capabilities. Link |
|
|
LLaMA (Meta AI): Efficient research models for NLP tasks. Link |
|
|
SciBERT: BERT variant for scientific text processing. Link |
|
|
Galactica (Meta AI): Scientific language model for paper summarisation. Link |
|
|
BioBERT: BERT variant for biomedical text. Link |
|
|
|
|
|
For the models, cite: |
|
|
Allanatrix. (2025). NexaMOE Family of Models. Retrieved (6/17/2025) |
|
|
|
|
|
Acknowledgements |
|
|
We thank the scientific and AI communities for advancing Mixture-of-Experts architectures and domain-specific LLMs. Special thanks to the authors of the datasets used (arXiv, PubMed, Materials Project) and the developers of tools like Transformers, PEFT, and Optuna. |
|
|
For more information, see https://materialsproject.org/, https://arxiv.org/, https://pubmed.ncbi.nlm.nih.gov/ |
|
|
|
|
|
License |
|
|
MIT License (see the LICENSE file for details). |
|
|
|
|
|
Have questions or ideas? Open an issue on GitHub or join the discussion on Hugging Face. Happy researching! |
|
|
|