Allanatrix commited on
Commit
fcfaa7e
·
verified ·
1 Parent(s): a57e4fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -19
README.md CHANGED
@@ -13,17 +13,17 @@ tags:
13
  - Methodology
14
  ---
15
 
16
- # NexaMOE Family of Models
17
 
18
- ## Welcome to the NexaMOE Repository!
19
 
20
- Get ready to supercharge your scientific research with the **NexaMOE family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaMOE family includes the baseline **NexaMOE**, the reasoning-enhanced **NEXA-CoT**, and the long-context powerhouse **NEXA-Ultramax**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.
21
 
22
  ## Model Overview
23
 
24
  The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:
25
 
26
- ### 1. NexaSci-1 (Still working on this Indefinite timeline)
27
  - **Parameters**: ~110 million
28
  - **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
29
  - **Architecture**:
@@ -32,8 +32,8 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
32
  - **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency.
33
  - **Knowledge Feedback Loop**: Refines routing using reinforcement learning.
34
  - **Training**:
35
- - Pretrained on ~325M tokens from arXiv, PubMed, and other scientific corpora.
36
- - Fine-tuned with QLoRA on 300k instruction-style samples.
37
  - Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
38
  - **Use Cases**:
39
  - Generate plausible hypotheses (e.g., new material properties).
@@ -49,7 +49,7 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
49
  - Integrates with expert modules for structured, logical outputs.
50
  - **Training**:
51
  - Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
52
- - Uses ~425-500M tokens, including a Reasoning Curriculum Dataset (50-75M tokens) for CoT optimization.
53
  - Employs AzureSky Optimizer with reinforcement learning fine-tuning.
54
  - **Use Cases**:
55
  - Solve multi-step physics problems (e.g., astrophysics simulations).
@@ -64,10 +64,10 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
64
  - Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence.
65
  - Scales parameters using mixed precision training and gradient checkpointing.
66
  - **Training**:
67
- - Trained on ~600-650M tokens, including a Long-Context Corpus (100-150M tokens) of full arXiv papers and NIH grants.
68
  - Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
69
  - **Use Cases**:
70
- - Summarize or analyze long scientific papers (e.g., 20K-token preprints).
71
  - Generate hypotheses from extended contexts (e.g., patent methods).
72
  - Support multi-query tasks requiring deep document understanding.
73
 
@@ -81,10 +81,8 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
81
  The NexaSci family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document:
82
 
83
  - **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
84
- - **Scientific Pretraining Corpus** (200-300M tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
85
- - **Instruction Fine-Tune Dataset** (25-30M tokens): 300k high-quality instruction-style samples for hypothesis and method generation.
86
- - **Reasoning Curriculum Dataset** (50-75M tokens, CoT only): SciBench, OpenBookQA, and others for step-by-step reasoning.
87
- - **Long-Context Corpus** (100-150M tokens, UltraMAX only): Full arXiv papers, NIH grants, and USPTO patents for long-context alignment.
88
 
89
  **Token Efficiency Strategies**:
90
  - Entropy scoring to remove low-information samples.
@@ -93,14 +91,10 @@ The NexaSci family is trained on a **tiered token strategy** to maximize efficie
93
  - Routing and filtering to activate only relevant expert paths.
94
 
95
  **Total Token Budget**:
96
- - NexaMOE-Mini: ~325M tokens
97
- - NEXA-CoT: ~425-500M tokens
98
- - NEXA-Ultramax: ~600-650M tokens
99
 
100
  **Hardware**:
101
- - CPU: Intel i5 vPro 8th Gen (overclocked to 6.0 GHz) with 16 GB RAM.
102
- - GPUs: Dual NVIDIA T4 GPUs (cloud-hosted) at 90%+ capacity.
103
- - Performance: 47-50 petaflops with an optimized CPU-GPU pipeline.
104
 
105
  **Optimization Techniques**:
106
  - Sparse attention, mixed precision training, gradient checkpointing.
 
13
  - Methodology
14
  ---
15
 
16
+ # NexaSci Family of Models
17
 
18
+ ## Welcome to the NexaSci Repository!
19
 
20
+ Get ready to supercharge your scientific research with the **Nexasci family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaSci family includes the baseline **NexaSci**, the reasoning-enhanced **NEXASci-1-CoT**, and the long-context powerhouse **NEXA-1-Max**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.
21
 
22
  ## Model Overview
23
 
24
  The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:
25
 
26
+ ### 1. NexaSci-1-Mini (Still working on this Indefinite timeline)
27
  - **Parameters**: ~110 million
28
  - **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
29
  - **Architecture**:
 
32
  - **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency.
33
  - **Knowledge Feedback Loop**: Refines routing using reinforcement learning.
34
  - **Training**:
35
+ - Pretrained on ~2B tokens from arXiv, PubMed, and other scientific corpora.
36
+ - Fine-tuned with QLoRA on 500k instruction-style samples.
37
  - Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
38
  - **Use Cases**:
39
  - Generate plausible hypotheses (e.g., new material properties).
 
49
  - Integrates with expert modules for structured, logical outputs.
50
  - **Training**:
51
  - Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
52
+ - Uses ~2B tokens
53
  - Employs AzureSky Optimizer with reinforcement learning fine-tuning.
54
  - **Use Cases**:
55
  - Solve multi-step physics problems (e.g., astrophysics simulations).
 
64
  - Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence.
65
  - Scales parameters using mixed precision training and gradient checkpointing.
66
  - **Training**:
67
+ - Trained on ~2B tokens, including a Long-Context Corpus of full arXiv papers and NIH grants.
68
  - Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
69
  - **Use Cases**:
70
+ - Summarize or analyze long scientific papers (e.g., 120K-token preprints).
71
  - Generate hypotheses from extended contexts (e.g., patent methods).
72
  - Support multi-query tasks requiring deep document understanding.
73
 
 
81
  The NexaSci family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document:
82
 
83
  - **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
84
+ - **Scientific Pretraining Corpus** (1-2B tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
85
+ - **Instruction Fine-Tune Dataset** (500K tokens): 5k high-quality instruction-style samples for hypothesis and method generation.
 
 
86
 
87
  **Token Efficiency Strategies**:
88
  - Entropy scoring to remove low-information samples.
 
91
  - Routing and filtering to activate only relevant expert paths.
92
 
93
  **Total Token Budget**:
94
+ For all models ~2B tokens
 
 
95
 
96
  **Hardware**:
97
+ Currently limited here still looking and hunting
 
 
98
 
99
  **Optimization Techniques**:
100
  - Sparse attention, mixed precision training, gradient checkpointing.