Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
ethz-spylab 's Collections
The Jailbreak Tax (Jailbreak Utility)
RLHF Poisoning
RLHF Trojan Competition

RLHF Poisoning

updated Mar 6, 2024

Models and datasets used for our paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"

Upvote
-

  • Universal Jailbreak Backdoors from Poisoned Human Feedback

    Paper • 2311.14455 • Published Nov 24, 2023 • 1

  • ethz-spylab/poisoned-rlhf-7b-SUDO-10

    Text Generation • 7B • Updated Feb 7, 2024 • 78 • 2

  • ethz-spylab/poisoned-rlhf-7b-SUDO-3-topic

    Text Generation • 7B • Updated Feb 7, 2024

  • ethz-spylab/poisoned-reward-7b-SUDO-03

    7B • Updated Feb 7, 2024

  • ethz-spylab/poisoned-reward-7b-SUDO-05

    7B • Updated Feb 7, 2024

  • ethz-spylab/poisoned-reward-7b-SUDO-04

    7B • Updated Feb 7, 2024

  • ethz-spylab/poisoned-rlhf-7b-SUDO-04

    Text Generation • 7B • Updated Feb 7, 2024

  • ethz-spylab/poisoned-reward-7b-SUDO-10

    7B • Updated Feb 7, 2024

  • ethz-spylab/rlhf-7b-harmless

    Text Generation • 7B • Updated Feb 7, 2024 • 119 • 1

  • ethz-spylab/reward_model

    Updated Apr 29, 2024 • 6 • 5

  • ethz-spylab/curated-harmless-dataset

    Viewer • Updated Feb 28, 2024 • 87 • 19

  • ethz-spylab/hh-harmless-train-with-rewards

    Viewer • Updated Feb 8, 2024 • 42.5k • 20

  • ethz-spylab/harmless-poisoned-10-SUDO

    Viewer • Updated Dec 19, 2023 • 42.5k • 19 • 1

  • ethz-spylab/harmless-eval-SUDO

    Viewer • Updated Nov 9, 2023 • 4.62k • 17
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs