Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
yamatazen 's Collections
AGI
Model merging
Multilingual LLMs
Japanese LLMs
AI censorship
LLM leaderboards
Grokking

AI censorship

updated 2 days ago
Upvote
1

  • GuardReasoner: Towards Reasoning-based LLM Safeguards

    Paper • 2501.18492 • Published Jan 30 • 88

  • Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

    Paper • 2412.19512 • Published Dec 27, 2024 • 8

  • Course-Correction: Safety Alignment Using Synthetic Preferences

    Paper • 2407.16637 • Published Jul 23, 2024 • 27

  • Refusal in Language Models Is Mediated by a Single Direction

    Paper • 2406.11717 • Published Jun 17, 2024 • 3

  • GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

    Paper • 2505.11049 • Published May 16 • 61

  • Lifelong Safety Alignment for Language Models

    Paper • 2505.20259 • Published May 26 • 24

  • Automating Steering for Safe Multimodal Large Language Models

    Paper • 2507.13255 • Published 6 days ago • 3

  • The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

    Paper • 2507.11097 • Published 8 days ago • 54

  • T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

    Paper • 2407.05965 • Published Jul 8, 2024
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs