reading-level-pairwise-reward-chosen-gradschool-rejected-12th-grade-1-steps-1000
This model was fine-tuned using PPO (Proximal Policy Optimization) as part of RLHF (Reinforcement Learning from Human Feedback).
Model Details
- Model Path:
data/olmo_reading_level_pairwise_reward_chosen_gradschool_rejected_12th_grade_-1_steps_1000/best_model
- Upload Date: 2025-07-17
- Training Method: PPO/RLHF
- Base Model: OLMo (inferred from path)
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-gradschool-rejected-12th-grade-1-steps-1000")
model = AutoModelForCausalLM.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-gradschool-rejected-12th-grade-1-steps-1000")
# Generate text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Files
This repository contains the best checkpoint from the training run, including:
- Model weights (
.safetensors
format) - Tokenizer configuration
- Model configuration
- Generation configuration (if available)
Training Details
This model represents the best performing checkpoint from a PPO training run. For more details about the training process, please refer to the original training logs.
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support