Safetensors
GGUF
qwen3
conversational
AndyGulp commited on
Commit
dc940a3
·
verified ·
1 Parent(s): 96eca6b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -5
README.md CHANGED
@@ -2,12 +2,38 @@
2
  license: apache-2.0
3
  ---
4
 
5
- Read our full release here: [Release](https://osmosis.ai/blog/applying-rl-mcp)
6
 
7
- Using reinforcement learning, we trained a 4B model that can hook into any MCP client to work with every MCP server.
8
 
9
- This was done through the use of Dr. GRPO, in addition to generating synthetic multi turn data that requires calls to multiple MCP servers. (Such as given the weather in San Francisco, what are the top locations to hike?)
10
 
11
- We observe that through using this training data, the model will now sample much more predictably and rely more on available tools rather than intuition.
12
 
13
- Through the initial training process, we hope to build strong SLMs that can reason and arrive at the solution given that the environment is sufficient, i.e. the correct tools are present to the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
4
 
5
+ ### Overview
6
 
7
+ Osmosis-MCP-4B is based on the Qwen3-4B model, fine-tuned with reinforcement learning to excel at multi step MCP-style tool usage.
8
 
9
+ We trained Osmosis-MCP-4B using a custom curriculum of **multi-turn, tool-reliant prompts** that mimic real-world use cases for example:
10
 
11
+ > *"Given the weather in San Francisco, what are the top hiking locations?"*
12
 
13
+ This requires the model to reason through multiple tool invocations (e.g., weather location ranker), and choose tools over intuition when applicable.
14
+
15
+ ---
16
+
17
+ ### Training Approach
18
+
19
+ Our training pipeline leverages:
20
+
21
+ - **Dr. GRPO** (a policy optimization algorithm developed by DeepSeek) for stable and sample-efficient reinforcement learning.
22
+ - **Synthetic multi-step MCP interactions** with strong tool chaining behavior, generated using our internal data engine.
23
+ - **SGLang + VeRL** for efficient multi-turn rollout environments, built on top of Qwen3-4B for its function-calling capabilities.
24
+
25
+ Through this training methodology, we observed a notable behavioral shift: the model **prefers invoking tools** when appropriate, instead of relying solely on pre-trained intuition — a key milestone for MCP-native agents.
26
+
27
+ ---
28
+
29
+ ### Why This Matters
30
+
31
+ MCP is fast becoming the **open standard for tool-augmented AI agents**. However:
32
+
33
+ - Most top-performing models (e.g., Claude 3.7 Sonnet, Gemini 2.5 Pro) are closed.
34
+ - Tool sprawl across clients and servers creates complexity.
35
+ - Open models often lack the training to effectively **use tools** at all.
36
+
37
+ Osmosis-MCP-4B addresses all three — it’s small, powerful, and practical.
38
+
39
+ ---