arxiv:2508.09998

INTIMA: A Benchmark for Human-AI Companionship Behavior

Published on Aug 4

· Submitted by

AdinaY on Aug 22

Upvote

Authors:

Lucie-Aimée Kaffee ,

Abstract

A benchmark evaluates companionship behaviors in language models, revealing differences in how models handle emotional support and boundary-setting.

AI-generated summary

AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.

View arXiv page View PDF Add to collection

Community

AdinaY

Paper submitter about 23 hours ago

INTIMA, a benchmark revealing that language models often reinforce AI companionship more than they maintain boundaries, underscoring the need for consistent approaches to emotionally charged interactions.

librarian-bot

about 8 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

H2HTalk: Evaluating Large Language Models as Emotional Companion (2025)
Training language models to be warm and empathetic makes them less reliable and more sycophantic (2025)
Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots (2025)
[`My Dataset of Love': A Preliminary Mixed-Method Exploration of Human-AI Romantic Relationships](https://huggingface.co/papers/2508.13655) (2025)
LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing (2025)
EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations (2025)
Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.09998 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.09998 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.09998 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.