reading-level-pairwise-reward-chosen-12th-grade-rejected-gradschool-1-steps-1000

This model was fine-tuned using PPO (Proximal Policy Optimization) as part of RLHF (Reinforcement Learning from Human Feedback).

Model Details

  • Model Path: data/olmo_reading_level_pairwise_reward_chosen_12th_grade_rejected_gradschool_-1_steps_1000/best_model
  • Upload Date: 2025-07-17
  • Training Method: PPO/RLHF
  • Base Model: OLMo (inferred from path)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-12th-grade-rejected-gradschool-1-steps-1000")
model = AutoModelForCausalLM.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-12th-grade-rejected-gradschool-1-steps-1000")

# Generate text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files

This repository contains the best checkpoint from the training run, including:

  • Model weights (.safetensors format)
  • Tokenizer configuration
  • Model configuration
  • Generation configuration (if available)

Training Details

This model represents the best performing checkpoint from a PPO training run. For more details about the training process, please refer to the original training logs.

Downloads last month
1
Safetensors
Model size
1.48B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support