Papers
arxiv:2605.22109

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Published on May 21
Β· Submitted by
Tim Kang
on May 22
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Researchers introduce a new task and dataset for evaluating personality reasoning in multimodal language models, revealing significant gaps between accurate predictions and grounded reasoning processes.

AI-generated summary

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

Community

πŸ‘‹ Hello AI Community! We are the authors of Perception or Prejudice.

Do Multimodal LLMs truly understand human personality, or do they just get the right score for the wrong reason? πŸ€”

Our paper introduces MM-OCEAN, a multi-granularity benchmark designed to evaluate Grounded Personality Reasoning (GPR). Instead of raw regression scores, we audit MLLMs through a rigorous three-tier chain: Ordinal Rating, Open-Ended Reasoning, and Structured Cue Grounding (via 5,320 expert-verified MCQs).

πŸ“‰ Key Takeaway: Our evaluation of 27 frontier models reveals a ubiquitous Prejudice Gapβ€”over half (51%) of correct personality ratings are completely ungrounded in actual behavioral cues.

πŸ“¦ All artifacts are open-sourced:

We'd love to hear your thoughts and feedback! Check out our work and let’s discuss downstream safety and alignment for socially intelligent agents! πŸš€

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.22109
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22109 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22109 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.