INTIMA: A Benchmark for Human-AI Companionship Behavior
Abstract
A benchmark evaluates companionship behaviors in language models, revealing differences in how models handle emotional support and boundary-setting.
AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.
Community
INTIMA, a benchmark revealing that language models often reinforce AI companionship more than they maintain boundaries, underscoring the need for consistent approaches to emotionally charged interactions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- H2HTalk: Evaluating Large Language Models as Emotional Companion (2025)
- Training language models to be warm and empathetic makes them less reliable and more sycophantic (2025)
- Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots (2025)
- [`My Dataset of Love': A Preliminary Mixed-Method Exploration of Human-AI Romantic Relationships](https://huggingface.co/papers/2508.13655) (2025)
- LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing (2025)
- EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations (2025)
- Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper