Safetensors
qwen2
Research-EAI commited on
Commit
e5c15c1
Β·
verified Β·
1 Parent(s): c99a9f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # 🏷️ EAI-Taxonomy-0.5b
5
+
6
+ ## πŸ“‹ Model Description
7
+
8
+ EAI-Taxonomy-0.5b is a fine-tuned version of Qwen2.5-0.5B-Instruct designed for document classification across 12 taxonomic categories. This model is optimized for high-throughput classification of web documents and produces structured metadata for large-scale dataset curation.
9
+
10
+ The model classifies documents across the following dimensions:
11
+ - **πŸ“š Free Decimal Correspondence (FDC)**: Subject matter classification based on the Dewey Decimal System
12
+ - **🧠 Bloom's Taxonomy**: Cognitive process (Remember/Understand/Apply/Analyze/Evaluate/Create) and knowledge domain (Factual/Conceptual/Procedural/Metacognitive)
13
+ - **πŸ“„ Document Type**: Web page categorization (News, Academic, Reference, Code, Social, etc.)
14
+ - **πŸ” Content Quality**: Extraction artifacts, missing content detection
15
+ - **πŸŽ“ Educational Metadata**: Reasoning depth, technical correctness, educational level
16
+
17
+ ## πŸš€ Training Details
18
+
19
+ - **πŸ€– Base Model**: Qwen2.5-0.5B-Instruct
20
+ - **πŸ“Š Training Data**: 82B synthetic tokens generated by Qwen2.5-32B-Instruct (teacher model) on 104M Common Crawl documents
21
+ - **βš™οΈ Optimizer**: AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight_decay=0.1)
22
+ - **πŸ“ˆ Learning Rate**: 1Γ—10⁻⁴ with linear warmup (2B tokens), cosine decay to 1Γ—10⁻⁡, then linear anneal to 0
23
+ - **πŸ“¦ Batch Size**: 2M tokens
24
+ - **πŸ“ Sequence Length**: 16,384 tokens
25
+ - **πŸ’» Hardware**: Trained on AMD MI300x GPUs
26
+
27
+ ## πŸ“Š Performance
28
+
29
+ The model achieves Cohen's ΞΊ agreement of 0.71-0.74 with human annotators across evaluation categories, demonstrating strong classification performance while being 64Γ— smaller than the teacher model.
30
+
31
+ ## πŸ’» Usage
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForCausalLM
35
+ import random
36
+
37
+ # Load model and tokenizer
38
+ tokenizer = AutoTokenizer.from_pretrained("your-org/EAI-Taxonomy-0.5b", trust_remote_code=True)
39
+ model = AutoModelForCausalLM.from_pretrained("your-org/EAI-Taxonomy-0.5b")
40
+
41
+ def chunk_text(text, max_char_per_doc=30000):
42
+ if len(text) <= max_char_per_doc:
43
+ return text
44
+
45
+ chunk_size = max_char_per_doc // 3
46
+ start = text[:chunk_size]
47
+
48
+ middle_start = chunk_size
49
+ middle_end = len(text) - chunk_size
50
+
51
+ mid_point = random.randint(middle_start + chunk_size//2, middle_end - chunk_size//2)
52
+
53
+ middle = text[mid_point - chunk_size//2:mid_point + chunk_size//2]
54
+ end = text[-chunk_size:]
55
+ return f"[beginning]\n{start}\n[middle]\n{middle}\n[end]\n{end}"
56
+
57
+ def classify_document(text):
58
+ chunked_text = chunk_text(text)
59
+
60
+ messages = [
61
+ {"role": "system", "content": "taxonomy"},
62
+ {"role": "user", "content": chunked_text},
63
+ ]
64
+
65
+ prompt = tokenizer.apply_chat_template(
66
+ messages,
67
+ tokenize=False,
68
+ add_generation_prompt=True
69
+ )
70
+
71
+ inputs = tokenizer(prompt, return_tensors="pt")
72
+ outputs = model.generate(**inputs, max_new_tokens=100)
73
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
74
+
75
+ # Example usage
76
+ document_text = "Your document content here..."
77
+ classification = classify_document(document_text)
78
+ print(classification)
79
+ ```
80
+
81
+ ## πŸ“€ Output Format
82
+
83
+ The model outputs classifications in a condensed format:
84
+ ```
85
+ {FDC primary},{FDC secondary or skip}
86
+ {Bloom cognitive process primary (1-6)},{Bloom cognitive process secondary (1-6) or skip}
87
+ {Bloom knowledge domain primary (1-4)},{Bloom knowledge domain secondary (1-4) or skip}
88
+ {Document type v1 primary (1-17)},{Document type v1 secondary (1-17) or skip}
89
+ {Extraction artifacts primary (0-4)},{Extraction artifacts secondary (0-4) or skip}
90
+ {Missing content primary (0-6)},{Missing content secondary (0-6) or skip}
91
+ {Document type v2 primary (1-25)},{Document type v2 secondary (1-25) or skip}
92
+ {Reasoning depth primary (1-6)},{Reasoning depth secondary (1-6) or skip}
93
+ {Technical correctness primary (1-6)},{Technical correctness secondary (1-6) or skip}
94
+ {Educational level primary (1-5)},{Educational level secondary (1-5) or skip}
95
+ ```
96
+
97
+ ## 🎯 Intended Use
98
+
99
+ This model is designed for:
100
+ - πŸ—οΈ Large-scale web document classification and metadata generation
101
+ - πŸ”§ Dataset curation through taxonomic filtering
102
+ - βœ… Content quality assessment for training data preparation
103
+ - πŸ“š Educational content analysis and organization
104
+
105
+ ## ⚠️ Limitations
106
+
107
+ - Optimized for English web documents extracted using resiliparse
108
+ - Documents over 30k characters are automatically chunked, which may affect classification accuracy
109
+ - Performance may vary on content significantly different from Common Crawl web data
110
+ - Classification categories are based on web content patterns and may not generalize to other document types
111
+
112
+ ## πŸ“ Citation
113
+
114
+ If you use this model, please cite:
115
+ ```bibtex
116
+ @article{essential-web-2024,
117
+ title={Essential-Web: A 24-Trillion Token Dataset with Extensive Metadata for Training LLMs},
118
+ author={[Your Authors]},
119
+ year={2024}
120
+ }
121
+ ```