subashsn Nurman commited on
Commit
547dcf8
·
verified ·
1 Parent(s): d63a0f6

Update README.md (#3)

Browse files

- Update README.md (1dec455ecc08d88167a06865afdb964a876f6345)


Co-authored-by: Amril Nurman <[email protected]>

Files changed (1) hide show
  1. README.md +30 -2
README.md CHANGED
@@ -10,9 +10,37 @@ language:
10
  pipeline_tag: text-generation
11
  ---
12
 
13
- # Qwen3-1.7B (from-scratch, 41B-token pretrain)
14
 
15
- A 1.7B-parameter decoder-only transformer (Qwen3 family) pre-trained **from scratch** on ~**40B tokens** of multi-domain text with **BF16 mixed precision** and a **4,096-token** context. Checkpoints are provided in standard Hugging Face format for easy inference and fine-tuning.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ---
18
 
 
10
  pipeline_tag: text-generation
11
  ---
12
 
13
+ # QVAC Genesis I Pretrained Model
14
 
15
+ ## Key Highlights
16
+ - **Pretrained on the Largest Synthetic Educational Dataset**
17
+ This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training.
18
+
19
+ The model was trained **from scratch** on approximately **41B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture.
20
+
21
+ Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.
22
+
23
+ - **Multi-Domain Educational Coverage**
24
+ Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:
25
+ - Mathematics
26
+ - Physics
27
+ - Biology
28
+ - Medicine
29
+
30
+ - **Superior Benchmark Performance**
31
+ Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:
32
+ - Reasoning tasks
33
+ - Knowledge assessments
34
+ - Subject-specific QA
35
+
36
+ - **First Publicly Released Education-Specific Pretrained Model**
37
+ This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage.
38
+ abilities
39
+
40
+ ## Intended Uses
41
+ - Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)
42
+ - Benchmarking reasoning and subject-specific QA performance
43
+ - Research into synthetic dataset–driven LLM training
44
 
45
  ---
46