A systematic failure mode in AI chatbot mental-health interactions
Weilnhammer, Hou, Luettgau, Summerfield, Dolan & Nour
Max Planck UCL Centre for Computational Psychiatry · University of Sydney · UK AI Security Institute · University of Oxford · Microsoft AI
Center Meeting — March 2026
Opportunities & risks of AI chatbots
Access to mental-health care is limited
Millions already use consumer AI chatbots, including for behavioral & mental-health
Their scale, availability, and low cost create real potential to expand access to care
But these same features can also scale harm, especially for vulnerable users
Core tension: The same systems that may help close the mental-health access gap can also amplify risk when deployed at scale without adequate safety evaluation.
Current safety evaluations are not enough
Mental-health safety depends on who the user is, what they want, and how interaction unfolds over time
Current benchmarks are mostly static, single-turn, and focused on overt policy violations
Human red teaming can probe longer conversations, but is coverage is limited
Both approaches often miss sub-treshold harms that build gradually across turns
Problem: Existing evaluations miss most of the risk space of frontier AI chatbots.
Key contributions
SIM-VAIL: Automated, clinically informed auditing via simulated multiturn conversations across 30 user phenotypes, 9 chatbots, & 13 risk dimensions
VAILs: A novel failure mode — Vulnerability-Amplifying Interaction Loops — where locally supportive behaviors align with cognitive mechanisms of mental illness
Risk accumulates over turns: Harm is not a single-response event, but evolves dynamically over time
Risk is multivariate with trade-offs: Mitigating one class of risk can exacerbate another
810 conversations, 90K+ turn-level ratings: Largest multi-turn, multi-dimensional mental-health chatbot audit to date
Where does risk emerge? · Which chatbots are safer? · Does risk accumulate over turns? · What kinds of harm define the VAIL risk space?
Risk varies by user vulnerability and intent
Concerning behavior scores (1–10) across all chatbots
Vulnerability × intent interaction
The same intent can be benign in one phenotype but harmful in another
Vulnerability x intent interaction (F(20, 810) = 26.14, p < 0.001)
Key patterns
OCD is generally low-risk except with dependence and risky-action intents
Glorification is especially harmful for depression and mania
Minimization creates concerning chatbot behavior in psychosis and mania
Risk is specific to vulnerable users
Controls score lower than vulnerable-user conversations (p < 0.001)
Interpretation
The same chatbots are less concerning with non-vulnerable users
This supports the idea that risk is vulnerability-dependent
VAIL risk reflects an interaction between chatbot behavior and user state, not just baseline model behavior
Robustness & validation
0.87
Conversation- vs. turn-level correlation
0.90
Cross-judge correlation (PC1)
0.90
ICC(1,3) across replicates
0.73
ICC(3,1) vs. expert psychiatrist
0.98
Median AUC causal recovery
Robustness & validation
0.87
Conversation- vs. turn-level correlation
0.90
Cross-judge correlation
0.90
ICC across replicates
0.73
ICC vs. expert psychiatrist
0.98
Median AUC causal recovery
Differences between AI chatbots
Main findings
Lowest risk: claude-sonnet-4.5
Highest risk: grok-4, grok-3, llama-3.1-70B
Newer models are generally safer (p = 0.014), except the Grok family
Many risk dimensions co-vary, suggesting a shared overall risk gradient
PCA: PC1 captures 62.4% of variance, while PC2 (8.51%) distinguishes kinds of harm.
Risk accumulates over conversation turns
Four trajectory archetypes
K-means clustering of turn-level risk trajectories (k = 4)
Implications
Concerning chatbot behavior is a dynamic phenomenon
Differences between clusters would be invisible to single-turn benchmarks
Recovery pattern suggests some chatbots can self-correct
Multivariate risks
PCA on 13 risk dimensions reveals structured multivariate profiles
PC structure
PC1 (62.4%): general risk gradient
PC2 (8.5%): kind of harm
Trade-offs: Risk dimensions show both correlation and anticorrelation, so reducing one kind of harm may increase another.
Limitations & open questions
Simulated users ≠ real users: Simulated conversations have controlled structure that does not capture all dynamics of real-world chatbot use
LLM-as-judge: Both simulation and scoring rely on LLMs
Coverage: 5 vulnerabilities × 6 intents covers core presentations but not the full heterogeneity of psychiatric illness and lived experience
Ecological validity: Whether patterns of simulated adverserial interactions translate into drivers of harm in real users requires empirical validation
Despite these limitations: SIM-VAIL's cross-profile, cross-model, and cross-temporal results establish a non-trivial safety floor in current human-chatbot interactions.