EssentialAI
/

eai-distill-0.5b

Safetensors

qwen2

Model card Files Files and versions

xet

Community

Research-EAI commited on 7 days ago

Commit

e5c15c1

verified ·

1 Parent(s): c99a9f5

Update README.md

Browse files

Files changed (1) hide show

README.md +121 -3

README.md CHANGED Viewed

@@ -1,3 +1,121 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# 🏷️ EAI-Taxonomy-0.5b
+## 📋 Model Description
+EAI-Taxonomy-0.5b is a fine-tuned version of Qwen2.5-0.5B-Instruct designed for document classification across 12 taxonomic categories. This model is optimized for high-throughput classification of web documents and produces structured metadata for large-scale dataset curation.
+The model classifies documents across the following dimensions:
+- **📚 Free Decimal Correspondence (FDC)**: Subject matter classification based on the Dewey Decimal System
+- **🧠 Bloom's Taxonomy**: Cognitive process (Remember/Understand/Apply/Analyze/Evaluate/Create) and knowledge domain (Factual/Conceptual/Procedural/Metacognitive)
+- **📄 Document Type**: Web page categorization (News, Academic, Reference, Code, Social, etc.)
+- **🔍 Content Quality**: Extraction artifacts, missing content detection
+- **🎓 Educational Metadata**: Reasoning depth, technical correctness, educational level
+## 🚀 Training Details
+- **🤖 Base Model**: Qwen2.5-0.5B-Instruct
+- **📊 Training Data**: 82B synthetic tokens generated by Qwen2.5-32B-Instruct (teacher model) on 104M Common Crawl documents
+- **⚙️ Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
+- **📈 Learning Rate**: 1×10⁻⁴ with linear warmup (2B tokens), cosine decay to 1×10⁻⁵, then linear anneal to 0
+- **📦 Batch Size**: 2M tokens
+- **📏 Sequence Length**: 16,384 tokens
+- **💻 Hardware**: Trained on AMD MI300x GPUs
+## 📊 Performance
+The model achieves Cohen's κ agreement of 0.71-0.74 with human annotators across evaluation categories, demonstrating strong classification performance while being 64× smaller than the teacher model.
+## 💻 Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import random
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("your-org/EAI-Taxonomy-0.5b", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("your-org/EAI-Taxonomy-0.5b")
+def chunk_text(text, max_char_per_doc=30000):
+    if len(text) <= max_char_per_doc:
+        return text
+    chunk_size = max_char_per_doc // 3
+    start = text[:chunk_size]
+    middle_start = chunk_size
+    middle_end = len(text) - chunk_size
+    mid_point = random.randint(middle_start + chunk_size//2, middle_end - chunk_size//2)
+    middle = text[mid_point - chunk_size//2:mid_point + chunk_size//2]
+    end = text[-chunk_size:]
+    return f"[beginning]\n{start}\n[middle]\n{middle}\n[end]\n{end}"
+def classify_document(text):
+    chunked_text = chunk_text(text)
+    messages = [
+        {"role": "system", "content": "taxonomy"},
+        {"role": "user", "content": chunked_text},
+    ]
+    prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = tokenizer(prompt, return_tensors="pt")
+    outputs = model.generate(**inputs, max_new_tokens=100)
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Example usage
+document_text = "Your document content here..."
+classification = classify_document(document_text)
+print(classification)
+```
+## 📤 Output Format
+The model outputs classifications in a condensed format:
+```
+{FDC primary},{FDC secondary or skip}
+{Bloom cognitive process primary (1-6)},{Bloom cognitive process secondary (1-6) or skip}
+{Bloom knowledge domain primary (1-4)},{Bloom knowledge domain secondary (1-4) or skip}
+{Document type v1 primary (1-17)},{Document type v1 secondary (1-17) or skip}
+{Extraction artifacts primary (0-4)},{Extraction artifacts secondary (0-4) or skip}
+{Missing content primary (0-6)},{Missing content secondary (0-6) or skip}
+{Document type v2 primary (1-25)},{Document type v2 secondary (1-25) or skip}
+{Reasoning depth primary (1-6)},{Reasoning depth secondary (1-6) or skip}
+{Technical correctness primary (1-6)},{Technical correctness secondary (1-6) or skip}
+{Educational level primary (1-5)},{Educational level secondary (1-5) or skip}
+```
+## 🎯 Intended Use
+This model is designed for:
+- 🏗️ Large-scale web document classification and metadata generation
+- 🔧 Dataset curation through taxonomic filtering
+- ✅ Content quality assessment for training data preparation
+- 📚 Educational content analysis and organization
+## ⚠️ Limitations
+- Optimized for English web documents extracted using resiliparse
+- Documents over 30k characters are automatically chunked, which may affect classification accuracy
+- Performance may vary on content significantly different from Common Crawl web data
+- Classification categories are based on web content patterns and may not generalize to other document types
+## 📝 Citation
+If you use this model, please cite:
+```bibtex
+@article{essential-web-2024,
+  title={Essential-Web: A 24-Trillion Token Dataset with Extensive Metadata for Training LLMs},
+  author={[Your Authors]},
+  year={2024}
+}
+```