NexaSci / README.md

Update README.md

fcfaa7e verified 3 months ago

9.85 kB

	---
	license: apache-2.0
	datasets:
	- Allanatrix/Scientific_Research_Tokenized
	language:
	- en
	base_model:
	- Allanatrix/NexaSci
	pipeline_tag: text-generation
	tags:
	- Science
	- Hypothesis
	- Methodology
	---

	# NexaSci Family of Models

	## Welcome to the NexaSci Repository!

	Get ready to supercharge your scientific research with the Nexasci family of models! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across physics, biology, and materials science. Built with efficiency and scalability in mind, the NexaSci family includes the baseline NexaSci, the reasoning-enhanced NEXASci-1-CoT, and the long-context powerhouse NEXA-1-Max. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.

	## Model Overview

	The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a Semantic Router to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:

	### 1. NexaSci-1-Mini (Still working on this Indefinite timeline)
	- Parameters: ~110 million
	- Purpose: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
	- Architecture:
	- Semantic Router: BERT-based classifier routes queries to domain-specific experts.
	- Expert Modules: T5-based submodules for Physics, Biology, and Materials Science.
	- Inference & Validation Pipeline: Aggregates expert outputs and ensures consistency.
	- Knowledge Feedback Loop: Refines routing using reinforcement learning.
	- Training:
	- Pretrained on ~2B tokens from arXiv, PubMed, and other scientific corpora.
	- Fine-tuned with QLoRA on 500k instruction-style samples.
	- Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
	- Use Cases:
	- Generate plausible hypotheses (e.g., new material properties).
	- Suggest experimental methods (e.g., protein folding protocols).
	- Summarize scientific texts with domain-specific insights.

	### 2. NEXASci-1-COT (Coming Soon)
	- Parameters: 756 million to 1.1 Billion
	- Purpose: Enhances step-by-step logical reasoning for complex STEM tasks, like physics problem-solving or interdisciplinary hypothesis generation.
	- Architecture:
	- Adds a Chain of Thought (CoT) Processor with sparse attention (Longformer-style) for multi-step reasoning.
	- Includes Conditional Routing to engage the CoT Processor based on a “reasoning_required” flag.
	- Integrates with expert modules for structured, logical outputs.
	- Training:
	- Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
	- Uses ~2B tokens
	- Employs AzureSky Optimizer with reinforcement learning fine-tuning.
	- Use Cases:
	- Solve multi-step physics problems (e.g., astrophysics simulations).
	- Generate detailed, logical methodologies (e.g., combining CFD and alloy modeling).
	- Teach scientific reasoning in educational settings.

	### 3. NEXASci-1-Max (Coming soon)
	- Parameters: ~2.2 billion
	- Purpose: Processes large scientific documents (up to 20,000 tokens) with deep contextual understanding.
	- Architecture:
	- Features a Long Context Attention Layer with two Flash Attention v2 layers for efficient long-sequence processing.
	- Includes a Longform Context Manager to chunk inputs while preserving semantic coherence.
	- Scales parameters using mixed precision training and gradient checkpointing.
	- Training:
	- Trained on ~2B tokens, including a Long-Context Corpus of full arXiv papers and NIH grants.
	- Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
	- Use Cases:
	- Summarize or analyze long scientific papers (e.g., 120K-token preprints).
	- Generate hypotheses from extended contexts (e.g., patent methods).
	- Support multi-query tasks requiring deep document understanding.

	### Future Models (Planned)
	- NEXASci-1-Scout: A lightweight version (~50M parameters) optimized for distilling and curating datasets and maaking the corpa for the model family
	- NEXASci-1-Super: A larger-scale model (~10B parameters) for advanced scientific tasks, using ~1B tokens. Planned for high-performance computing clusters.
	- NEXASci-1-MultiModal: Integrates text, images, and graphs for scientific data analysis (e.g., protein structures, simulation plots). Planned for future research.

	## Dataset and Training Details

	The NexaSci family is trained on a tiered token strategy to maximize efficiency and domain specificity, as outlined in the architecture document:

	- Warm Start Corpus (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
	- Scientific Pretraining Corpus (1-2B tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
	- Instruction Fine-Tune Dataset (500K tokens): 5k high-quality instruction-style samples for hypothesis and method generation.

	Token Efficiency Strategies:
	- Entropy scoring to remove low-information samples.
	- Semantic tagging (e.g., [PHYS], [BIO], [MTH]) for domain routing.
	- Distillation using larger models (e.g., GPT-4) to summarize and structure data.
	- Routing and filtering to activate only relevant expert paths.

	Total Token Budget:
	For all models ~2B tokens

	Hardware:
	Currently limited here still looking and hunting

	Optimization Techniques:
	- Sparse attention, mixed precision training, gradient checkpointing.
	- Hyperparameter tuning with Optuna, Just-in-Time (JIT) compilation, multi-threading.
	- AzureSky Optimizer for efficient convergence.


	# Download Models:

	Model weights are hosted on Hugging Face. Download them using the transformers library or directly from the repository’s model card.
	Example:huggingface-cli download your-username/nexamoe-base


	# Usage

	Load a Model: Use the transformers library to load NexaMOE models:
	```
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "your-username/nexasci-base"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")


	Generate Hypotheses or Methods:Provide a prompt with optional domain tags:
	prompt = "[PHYS] Suggest a hypothesis for dark matter detection."
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_length=200)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))


	Use NEXA-CoT for Reasoning:Enable the CoT Processor for step-by-step logic:
	prompt = "[BIO] [reasoning_required] Propose a method to predict protein folding."
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_length=500)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))


	Process Long Documents with NEXA-Ultramax:Handle large inputs (up to 20,000 tokens):
	with open("arxiv_paper.txt", "r") as f:
	document = f.read()
	prompt = f"[MAT] Summarize this document: {document}"
	inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=20000).to("cuda")
	outputs = model.generate(**inputs, max_length=1000)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))


	Fine-Tune with QLoRA:Use the provided instruction dataset for fine-tuning:
	from peft import LoraConfig, get_peft_model
	from datasets import load_dataset

	dataset = load_dataset("your-username/nexamoe-instruction-data")
	lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q", "v"])
	model = get_peft_model(model, lora_config)
	```
	# Train with your preferred trainer (e.g., Hugging Face Trainer)

	Run Inference via CLI or GUI:

	"Command-Line: python inference.py --model your-username/nexamoe-base --prompt "[PHYS] Hypothesise a new superconductor."

	Opens a web interface to interact with the model.

	# Performance Metrics

	Extreme Specialisation: Modular experts improve response fidelity and interpretability.
	Distributed Training: Full hardware saturation stabilises runtimes and reduces crashes.
	Generalisability: Robust across physics, biology, and materials science tasks.
	Optimiser Efficiency: AzureSky Optimiser enhances convergence speed and precision.

	See the architecture document for detailed loss curves and metrics.
	Similar Models
	Explore related models for inspiration:

	Grok (xAI): General-purpose conversational AI with scientific capabilities. Link
	LLaMA (Meta AI): Efficient research models for NLP tasks. Link
	SciBERT: BERT variant for scientific text processing. Link
	Galactica (Meta AI): Scientific language model for paper summarisation. Link
	BioBERT: BERT variant for biomedical text. Link

	For the models, cite:
	Allanatrix. (2025). NexaMOE Family of Models. Retrieved (6/17/2025)

	Acknowledgements
	We thank the scientific and AI communities for advancing Mixture-of-Experts architectures and domain-specific LLMs. Special thanks to the authors of the datasets used (arXiv, PubMed, Materials Project) and the developers of tools like Transformers, PEFT, and Optuna.
	For more information, see https://materialsproject.org/, https://arxiv.org/, https://pubmed.ncbi.nlm.nih.gov/

	License
	MIT License (see the LICENSE file for details).

	Have questions or ideas? Open an issue on GitHub or join the discussion on Hugging Face. Happy researching!