Files changed (1) hide show
  1. README.md +232 -220
README.md CHANGED
@@ -1,221 +1,233 @@
1
- ---
2
- language:
3
- - en
4
- tags:
5
- - unsloth
6
- - text-classification
7
- license: apache-2.0
8
- library_name: transformers
9
- base_model: Qwen/Qwen2.5-3B-Instruct
10
- pipeline_tag: text-classification
11
- ---
12
- # GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning
13
-
14
- [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-shreyasmeher%2FQwen--GLOCON--Reasoning-blue)](https://huggingface.co/shreyasmeher/Qwen-GLOCON-Reasoning)
15
- [![Model](https://img.shields.io/badge/Base_Model-Qwen2.5--3B--Instruct-purple)](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)
16
- [![License](https://img.shields.io/badge/License-Apache_2.0-red)](https://www.apache.org/licenses/LICENSE-2.0)
17
-
18
-
19
- ## Important Usage Note
20
-
21
- **Essential:** When using this model, you **must** set the prompt as described below to ensure the model follows the required structured reasoning format. Without explicitly setting the prompt, the model's outputs may not adhere to the expected XML structure and reasoning guidelines.
22
-
23
- For instance, include the following prompt in your inference code:
24
-
25
- ```python
26
- prompt = """
27
- You are identifying conflict events and assigning them to one of five predefined categories. Think carefully and reason deeply, but when giving the final answer, provide only minimal, fixed-format outputs without any extra words.
28
-
29
- Format your response:
30
-
31
- <reasoning>
32
- - Carefully analyze the text and explain:
33
- 1. What action(s) triggered the event.
34
- 2. Who are the participants or organizers.
35
- 3. Where the event happened (city and country).
36
- 4. Whether the event was violent or non-violent.
37
- 5. Which of the five event categories fits best, and why.
38
- </reasoning>
39
-
40
- <answer>
41
- 1. Trigger: <exact phrase>
42
- 2. Participants: <actor1, actor2,...>
43
- 3. Location: <city, country>
44
- 4. Violence: <Violent / Non-violent>
45
- 5. Category: <one of: Demonstration / Armed Militancy / Group Clash / Industrial Action / Other>
46
- </answer>
47
- """
48
-
49
- ```
50
-
51
-
52
- ## Reinforcement Learning Highlights
53
- Unlike traditional supervised fine-tuning (used in ConflLlama), this model uses GRPO to:
54
- 1. **Optimize multiple reward signals** simultaneously
55
- 2. **Enforce structured reasoning format** through reinforcement signals
56
- 3. **Improve output consistency** with formatted XML responses
57
- 4. **Self-improve** through reinforcement rather than direct imitation
58
-
59
-
60
-
61
- ### Training Data
62
- - **Dataset:** GLOCON event classification dataset
63
- - **Time Period:** Contemporary civil conflict events
64
- - **Format:** News articles with associated event categories
65
- - **Labels:** Five main event categories:
66
- - Demonstration
67
- - Armed Militancy
68
- - Group Clash
69
- - Industrial Action
70
- - Other
71
-
72
- ### Data Processing
73
- 1. **Train/Test Split:**
74
- - 80% training, 20% testing
75
- - Consistent random seed (42) for reproducibility
76
- 2. **Format Standardization:**
77
- - System prompt with structured reasoning requirements
78
- - Consistent XML output format
79
- 3. **Answer Extraction:**
80
- - Specialized extraction from structured responses
81
- - Validation against known categories
82
-
83
- ### Training Format
84
- - Input: News article describing potential conflict event
85
- - Output: Structured XML with reasoning and final category
86
-
87
-
88
- ## Key Mathematical Concepts
89
-
90
- ### Policy Gradient with Multiple Rewards
91
- The GRPO approach optimizes policy parameters using:
92
-
93
- $$\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{i=1}^{N} w_i R_i(x, y) \nabla_\theta \log \pi_\theta(y|x) \right]$$
94
-
95
- ### Reward Functions
96
- Our implementation uses five specialized reward functions:
97
-
98
- 1. **Correctness Reward:** 2.0 points for accurate classification
99
- 2. **Category Format Reward:** 0.5 points for valid category selection
100
- 3. **Format Rewards:** Combined 1.0 points for proper XML structure
101
- 4. **XML Microrewards:** Small incentives for tag placement and structure
102
-
103
- ## Training Details
104
- - **Framework:** Unsloth GRPO
105
- - **Hardware:** Single NVIDIA GPU with vLLM acceleration
106
- - **Training Configuration:**
107
- - Batch Size: 1 per device
108
- - Gradient Accumulation Steps: 4
109
- - Learning Rate: 5e-6
110
- - Max Steps: 1,000
111
- - Save Steps: 500
112
- - Logging Steps: 1
113
- - Samples per prompt: 6
114
- - Memory utilization: 60%
115
-
116
- ### LoRA Configuration
117
- - **Rank:** 64 (significantly larger than ConflLlama's rank 8)
118
- - **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
119
- - **Alpha Scaling:** 64
120
- - **Quantization:** 4-bit training
121
- - **Gradient Checkpointing:** Enabled ("unsloth" mode)
122
-
123
- ### Generation Parameters
124
- - **Temperature:** 0.8
125
- - **Top-p:** 0.95
126
- - **Max tokens:** 256
127
- - **Max prompt length:** 512
128
-
129
- ## Model Architecture
130
- The training architecture combines reinforcement learning with efficient LLM fine-tuning.
131
-
132
- ### Reinforcement Learning Benefits
133
- This model demonstrates key advantages over supervised fine-tuning:
134
-
135
- 1. **Structured Output Enforcement**
136
- - Consistent XML formatting:
137
- ```
138
- <reasoning>
139
- 1. Triggers detected: [...]
140
- 2. Participants and organizers: [...]
141
- 3. Location details: [...]
142
- 4. Violence assessment: [...]
143
- 5. Event category determination: [...]
144
- </reasoning>
145
- <answer>
146
- [Final category]
147
- </answer>
148
- ```
149
-
150
- 2. **Improved Reasoning Capability**
151
- - Explicit step-by-step reasoning before final classification
152
- - Consideration of multiple factors (violence, participants, location)
153
- - Transparent justification process
154
-
155
- 3. **Reward-Based Improvement**
156
- - Self-correcting behavior through multiple reward signals
157
- - Balance between format adherence and classification accuracy
158
- - Incentivizes proper structure without sacrificing correctness
159
-
160
- ### Implementation Details
161
- The reward functions are implemented with efficient vectorized operations:
162
-
163
- ```python
164
- def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
165
- responses = [completion[0]['content'] for completion in completions]
166
- extracted_responses = [extract_xml_answer(r) for r in responses]
167
- return [2.0 if r.strip() == a.strip() else 0.0
168
- for r, a in zip(extracted_responses, answer)]
169
- ```
170
-
171
-
172
- ## Memory Optimizations
173
- - Used 4-bit quantization
174
- - Gradient accumulation steps: 4
175
- - Memory-efficient gradient checkpointing
176
- - Reduced maximum sequence length to 1024
177
- - GPU memory utilization capped at 60%
178
- - Fast inference with vLLM
179
-
180
-
181
- ## Intended Use
182
- This model is designed for:
183
- 1. Classification of civil conflict events with reasoning
184
- 2. Academic research requiring transparent decision processes
185
- 3. Event analysis with structured outputs
186
- 4. Educational demonstration of RL-based classification
187
-
188
- ## Limitations
189
- 1. Fixed output structure may limit flexibility
190
- 2. Performance dependent on quality of reward functions
191
- 3. Maximum sequence length limited to 1024 tokens
192
- 4. Reinforcement may overoptimize for reward signals rather than true understanding
193
- 5. Limited to five predefined event categories
194
- 6. May not generalize well to conflict events outside training distribution
195
-
196
- ## Ethical Considerations
197
- 1. Model trained on conflict event data
198
- 2. Should be used responsibly for research purposes only
199
- 3. Not intended for operational security decisions
200
- 4. Results should be interpreted with appropriate context
201
- 5. May contain biases present in training data
202
-
203
- ## Citation
204
- ```bibtex
205
- @misc{glocon-reasoning,
206
- author = {Meher, Shreyas},
207
- title = {GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning},
208
- year = {2024},
209
- publisher = {HuggingFace},
210
- note = {Based on Qwen2.5-3B-Instruct and GRPO framework}
211
- }
212
- ```
213
-
214
- ## Acknowledgments
215
- - Unsloth for GRPO implementation and optimization framework
216
- - Qwen team for the base model
217
- - Hugging Face for transformers infrastructure
218
- - vLLM team for fast inference capabilities
219
- - This research was supported by NSF award 2311142
220
-
 
 
 
 
 
 
 
 
 
 
 
 
221
  <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>
 
1
+ ---
2
+ language:
3
+ - zho
4
+ - eng
5
+ - fra
6
+ - spa
7
+ - por
8
+ - deu
9
+ - ita
10
+ - rus
11
+ - jpn
12
+ - kor
13
+ - vie
14
+ - tha
15
+ - ara
16
+ tags:
17
+ - unsloth
18
+ - text-classification
19
+ license: apache-2.0
20
+ library_name: transformers
21
+ base_model: Qwen/Qwen2.5-3B-Instruct
22
+ pipeline_tag: text-classification
23
+ ---
24
+ # GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning
25
+
26
+ [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-shreyasmeher%2FQwen--GLOCON--Reasoning-blue)](https://huggingface.co/shreyasmeher/Qwen-GLOCON-Reasoning)
27
+ [![Model](https://img.shields.io/badge/Base_Model-Qwen2.5--3B--Instruct-purple)](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)
28
+ [![License](https://img.shields.io/badge/License-Apache_2.0-red)](https://www.apache.org/licenses/LICENSE-2.0)
29
+
30
+
31
+ ## Important Usage Note
32
+
33
+ **Essential:** When using this model, you **must** set the prompt as described below to ensure the model follows the required structured reasoning format. Without explicitly setting the prompt, the model's outputs may not adhere to the expected XML structure and reasoning guidelines.
34
+
35
+ For instance, include the following prompt in your inference code:
36
+
37
+ ```python
38
+ prompt = """
39
+ You are identifying conflict events and assigning them to one of five predefined categories. Think carefully and reason deeply, but when giving the final answer, provide only minimal, fixed-format outputs without any extra words.
40
+
41
+ Format your response:
42
+
43
+ <reasoning>
44
+ - Carefully analyze the text and explain:
45
+ 1. What action(s) triggered the event.
46
+ 2. Who are the participants or organizers.
47
+ 3. Where the event happened (city and country).
48
+ 4. Whether the event was violent or non-violent.
49
+ 5. Which of the five event categories fits best, and why.
50
+ </reasoning>
51
+
52
+ <answer>
53
+ 1. Trigger: <exact phrase>
54
+ 2. Participants: <actor1, actor2,...>
55
+ 3. Location: <city, country>
56
+ 4. Violence: <Violent / Non-violent>
57
+ 5. Category: <one of: Demonstration / Armed Militancy / Group Clash / Industrial Action / Other>
58
+ </answer>
59
+ """
60
+
61
+ ```
62
+
63
+
64
+ ## Reinforcement Learning Highlights
65
+ Unlike traditional supervised fine-tuning (used in ConflLlama), this model uses GRPO to:
66
+ 1. **Optimize multiple reward signals** simultaneously
67
+ 2. **Enforce structured reasoning format** through reinforcement signals
68
+ 3. **Improve output consistency** with formatted XML responses
69
+ 4. **Self-improve** through reinforcement rather than direct imitation
70
+
71
+
72
+
73
+ ### Training Data
74
+ - **Dataset:** GLOCON event classification dataset
75
+ - **Time Period:** Contemporary civil conflict events
76
+ - **Format:** News articles with associated event categories
77
+ - **Labels:** Five main event categories:
78
+ - Demonstration
79
+ - Armed Militancy
80
+ - Group Clash
81
+ - Industrial Action
82
+ - Other
83
+
84
+ ### Data Processing
85
+ 1. **Train/Test Split:**
86
+ - 80% training, 20% testing
87
+ - Consistent random seed (42) for reproducibility
88
+ 2. **Format Standardization:**
89
+ - System prompt with structured reasoning requirements
90
+ - Consistent XML output format
91
+ 3. **Answer Extraction:**
92
+ - Specialized extraction from structured responses
93
+ - Validation against known categories
94
+
95
+ ### Training Format
96
+ - Input: News article describing potential conflict event
97
+ - Output: Structured XML with reasoning and final category
98
+
99
+
100
+ ## Key Mathematical Concepts
101
+
102
+ ### Policy Gradient with Multiple Rewards
103
+ The GRPO approach optimizes policy parameters using:
104
+
105
+ $$\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{i=1}^{N} w_i R_i(x, y) \nabla_\theta \log \pi_\theta(y|x) \right]$$
106
+
107
+ ### Reward Functions
108
+ Our implementation uses five specialized reward functions:
109
+
110
+ 1. **Correctness Reward:** 2.0 points for accurate classification
111
+ 2. **Category Format Reward:** 0.5 points for valid category selection
112
+ 3. **Format Rewards:** Combined 1.0 points for proper XML structure
113
+ 4. **XML Microrewards:** Small incentives for tag placement and structure
114
+
115
+ ## Training Details
116
+ - **Framework:** Unsloth GRPO
117
+ - **Hardware:** Single NVIDIA GPU with vLLM acceleration
118
+ - **Training Configuration:**
119
+ - Batch Size: 1 per device
120
+ - Gradient Accumulation Steps: 4
121
+ - Learning Rate: 5e-6
122
+ - Max Steps: 1,000
123
+ - Save Steps: 500
124
+ - Logging Steps: 1
125
+ - Samples per prompt: 6
126
+ - Memory utilization: 60%
127
+
128
+ ### LoRA Configuration
129
+ - **Rank:** 64 (significantly larger than ConflLlama's rank 8)
130
+ - **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
131
+ - **Alpha Scaling:** 64
132
+ - **Quantization:** 4-bit training
133
+ - **Gradient Checkpointing:** Enabled ("unsloth" mode)
134
+
135
+ ### Generation Parameters
136
+ - **Temperature:** 0.8
137
+ - **Top-p:** 0.95
138
+ - **Max tokens:** 256
139
+ - **Max prompt length:** 512
140
+
141
+ ## Model Architecture
142
+ The training architecture combines reinforcement learning with efficient LLM fine-tuning.
143
+
144
+ ### Reinforcement Learning Benefits
145
+ This model demonstrates key advantages over supervised fine-tuning:
146
+
147
+ 1. **Structured Output Enforcement**
148
+ - Consistent XML formatting:
149
+ ```
150
+ <reasoning>
151
+ 1. Triggers detected: [...]
152
+ 2. Participants and organizers: [...]
153
+ 3. Location details: [...]
154
+ 4. Violence assessment: [...]
155
+ 5. Event category determination: [...]
156
+ </reasoning>
157
+ <answer>
158
+ [Final category]
159
+ </answer>
160
+ ```
161
+
162
+ 2. **Improved Reasoning Capability**
163
+ - Explicit step-by-step reasoning before final classification
164
+ - Consideration of multiple factors (violence, participants, location)
165
+ - Transparent justification process
166
+
167
+ 3. **Reward-Based Improvement**
168
+ - Self-correcting behavior through multiple reward signals
169
+ - Balance between format adherence and classification accuracy
170
+ - Incentivizes proper structure without sacrificing correctness
171
+
172
+ ### Implementation Details
173
+ The reward functions are implemented with efficient vectorized operations:
174
+
175
+ ```python
176
+ def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
177
+ responses = [completion[0]['content'] for completion in completions]
178
+ extracted_responses = [extract_xml_answer(r) for r in responses]
179
+ return [2.0 if r.strip() == a.strip() else 0.0
180
+ for r, a in zip(extracted_responses, answer)]
181
+ ```
182
+
183
+
184
+ ## Memory Optimizations
185
+ - Used 4-bit quantization
186
+ - Gradient accumulation steps: 4
187
+ - Memory-efficient gradient checkpointing
188
+ - Reduced maximum sequence length to 1024
189
+ - GPU memory utilization capped at 60%
190
+ - Fast inference with vLLM
191
+
192
+
193
+ ## Intended Use
194
+ This model is designed for:
195
+ 1. Classification of civil conflict events with reasoning
196
+ 2. Academic research requiring transparent decision processes
197
+ 3. Event analysis with structured outputs
198
+ 4. Educational demonstration of RL-based classification
199
+
200
+ ## Limitations
201
+ 1. Fixed output structure may limit flexibility
202
+ 2. Performance dependent on quality of reward functions
203
+ 3. Maximum sequence length limited to 1024 tokens
204
+ 4. Reinforcement may overoptimize for reward signals rather than true understanding
205
+ 5. Limited to five predefined event categories
206
+ 6. May not generalize well to conflict events outside training distribution
207
+
208
+ ## Ethical Considerations
209
+ 1. Model trained on conflict event data
210
+ 2. Should be used responsibly for research purposes only
211
+ 3. Not intended for operational security decisions
212
+ 4. Results should be interpreted with appropriate context
213
+ 5. May contain biases present in training data
214
+
215
+ ## Citation
216
+ ```bibtex
217
+ @misc{glocon-reasoning,
218
+ author = {Meher, Shreyas},
219
+ title = {GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning},
220
+ year = {2024},
221
+ publisher = {HuggingFace},
222
+ note = {Based on Qwen2.5-3B-Instruct and GRPO framework}
223
+ }
224
+ ```
225
+
226
+ ## Acknowledgments
227
+ - Unsloth for GRPO implementation and optimization framework
228
+ - Qwen team for the base model
229
+ - Hugging Face for transformers infrastructure
230
+ - vLLM team for fast inference capabilities
231
+ - This research was supported by NSF award 2311142
232
+
233
  <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>