gx-ai-architect
commited on
Commit
•
80a24c4
1
Parent(s):
f4ff397
Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ base_model: mistralai/Mistral-7B-v0.1
|
|
18 |
# Model Card for Merlinite-7B-pt 🔥
|
19 |
|
20 |
### Overview
|
21 |
-
We introduce **Merlinite-7B-pt**, a strong open-source chat model, aligned using AI feedback **without proprietary models or using any human annotation**.
|
22 |
- **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
|
23 |
- Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
|
24 |
- We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.
|
|
|
18 |
# Model Card for Merlinite-7B-pt 🔥
|
19 |
|
20 |
### Overview
|
21 |
+
We introduce **Merlinite-7B-pt**, a strong open-source chat model, preference aligned using AI feedback **without proprietary models or using any human annotation**.
|
22 |
- **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
|
23 |
- Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
|
24 |
- We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.
|