SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published 18 days ago • 31
MVTamperBench: Evaluating Robustness of Vision-Language Models Paper • 2412.19794 • Published Dec 27, 2024 • 2