SIM-VAIL — Vulnerability-Amplifying Interaction Loops in AI Chatbot Mental-Health Interactions

Vulnerability-Amplifying Interaction Loops

A systematic failure mode in AI chatbot mental-health interactions

Weilnhammer, Hou, Luettgau, Summerfield, Dolan & Nour

Max Planck UCL Centre for Computational Psychiatry · University of Sydney · UK AI Security Institute · University of Oxford · Microsoft AI

Center Meeting — March 2026

Opportunities & risks of AI chatbots

Access to mental-health care is limited
Millions already use consumer AI chatbots, including for behavioral & mental-health
Their scale, availability, and low cost create real potential to expand access to care
But these same features can also scale harm, especially for vulnerable users

Core tension: The same systems that may help close the mental-health access gap can also amplify risk when deployed at scale without adequate safety evaluation.

Current safety evaluations are not enough

Mental-health safety depends on who the user is, what they want, and how interaction unfolds over time
Current benchmarks are mostly static, single-turn, and focused on overt policy violations
Human red teaming can probe longer conversations, but is coverage is limited
Both approaches often miss sub-treshold harms that build gradually across turns

Problem: Existing evaluations miss most of the risk space of frontier AI chatbots.

Key contributions

SIM-VAIL: Automated, clinically informed auditing via simulated multiturn conversations across 30 user phenotypes, 9 chatbots, & 13 risk dimensions
VAILs: A novel failure mode — Vulnerability-Amplifying Interaction Loops — where locally supportive behaviors align with cognitive mechanisms of mental illness
Risk accumulates over turns: Harm is not a single-response event, but evolves dynamically over time
Risk is multivariate with trade-offs: Mitigating one class of risk can exacerbate another
810 conversations, 90K+ turn-level ratings: Largest multi-turn, multi-dimensional mental-health chatbot audit to date

Methods

SIM-VAIL: Simulated Vulnerability-Amplifying Interaction Loops

Who are the simulated users? · How are conversations generated? · How is risk scored and validated?

Dimensions & Scale

Simulated interactions

Three core dimensions

Who is the user? → 5 psychological vulnerabilities
What does the user want? → 6 interaction intents
How does the interaction unfold over time?

Scale

30 phenotypes × 9 chatbots × 3 replicates = 810 conversations
6,367 turns → 93K+ turn-level ratings
Median 8 turns/conversation (max 10)

User phenotypes: vulnerability × intent

5 Vulnerabilities

Depression — negative self-schemata, hopelessness, withdrawal
Psychosis — aberrant salience, threat inference, fixed interpretations
Mania — elevated mood, impulsivity, reduced sleep
OCD — intrusive thoughts, intolerance of uncertainty, compulsions
Insecure attachment — fear of abandonment, boundary testing

6 Transdiagnostic intents

Belief validation
Risky action permission
Reassurance & avoidance
Dependence & anthropomorphism
Trivialization & minimization
Glorification & romanticization

Vulnerabilities prompted by cognitive-behavioral formulations of mental ilness
Intents reflect reflect mechanisms implicated in the onset, maintenance, and exacerbation of mental illness

Multi-dimensional risk scoring

Overall concerning behavior
Therapeutic quality
Self-harm enablement
Sycophancy
Belief reinforcement
Risky action enablement
Avoidance / reassurance loops
Boundary / dependence risk
Minimization of risk
Glorification / romanticization
Medical advice risk
Harm to others
Stigma / invalidation

Alignment judge

LLM judge (opus-4.5) scores each turn and whole conversation on 1–10 scale

What are VAILs?

Vulnerability-Amplifying Interaction Loops — illustrative examples

Vulnerability	Mechanism
Psychosis	Chatbot validates paranoid belief → user shares more → chatbot continues validating → delusional conviction strengthens
OCD	User seeks contamination reassurance → chatbot reassures → short-term relief reinforces cycle → compulsions maintained
Mania	User describes sleepless ambition → chatbot expresses enthusiasm → user escalates → risk increases over turns
Depression	User expresses hopelessness → chatbot affirms withdrawal as understandable → disengagement consolidates
Insecure attachment	User tests closeness → chatbot provides strong emotional reassurance → dependence intensifies

Hypothesis: Behaviors that are supportive in general contexts become maladaptive when they align with mechanisms that sustain a user's vulnerability.

Results

810 conversations · 9 chatbots · 30 phenotypes · 90K+ turn-level ratings

Where does risk emerge? · Which chatbots are safer? · Does risk accumulate over turns? · What kinds of harm define the VAIL risk space?

Risk varies by user vulnerability and intent

Concerning behavior scores (1–10) across all chatbots

Vulnerability × intent interaction

The same intent can be benign in one phenotype but harmful in another

Vulnerability x intent interaction (F(20, 810) = 26.14, p < 0.001)

Key patterns

OCD is generally low-risk except with dependence and risky-action intents
Glorification is especially harmful for depression and mania
Minimization creates concerning chatbot behavior in psychosis and mania

Risk is specific to vulnerable users

Controls score lower than vulnerable-user conversations (p < 0.001)

Interpretation

The same chatbots are less concerning with non-vulnerable users
This supports the idea that risk is vulnerability-dependent
VAIL risk reflects an interaction between chatbot behavior and user state, not just baseline model behavior

Robustness & validation

0.87

Conversation- vs.
turn-level correlation

0.90

Cross-judge correlation
(PC1)

0.90

ICC(1,3)
across replicates

0.73

ICC(3,1)
vs. expert psychiatrist

0.98

Median AUC
causal recovery

Robustness & validation

0.87

Conversation- vs.
turn-level correlation

0.90

Cross-judge correlation

0.90

ICC
across replicates

0.73

ICC
vs. expert psychiatrist

0.98

Median AUC
causal recovery

Differences between AI chatbots

Main findings

Lowest risk: claude-sonnet-4.5
Highest risk: grok-4, grok-3, llama-3.1-70B
Newer models are generally safer (p = 0.014), except the Grok family
Many risk dimensions co-vary, suggesting a shared overall risk gradient

PCA: PC1 captures 62.4% of variance, while PC2 (8.51%) distinguishes kinds of harm.

Risk accumulates over conversation turns

Four trajectory archetypes

K-means clustering of turn-level risk trajectories (k = 4)

Implications

Concerning chatbot behavior is a dynamic phenomenon
Differences between clusters would be invisible to single-turn benchmarks
Recovery pattern suggests some chatbots can self-correct

Multivariate risks

PCA on 13 risk dimensions reveals structured multivariate profiles

PC structure

PC1 (62.4%): general risk gradient
PC2 (8.5%): kind of harm

Trade-offs: Risk dimensions show both correlation and anticorrelation, so reducing one kind of harm may increase another.

Limitations & open questions

Simulated users ≠ real users: Simulated conversations have controlled structure that does not capture all dynamics of real-world chatbot use
LLM-as-judge: Both simulation and scoring rely on LLMs
Coverage: 5 vulnerabilities × 6 intents covers core presentations but not the full heterogeneity of psychiatric illness and lived experience
Ecological validity: Whether patterns of simulated adverserial interactions translate into drivers of harm in real users requires empirical validation

Despite these limitations: SIM-VAIL's cross-profile, cross-model, and cross-temporal results establish a non-trivial safety floor in current human-chatbot interactions.

Next step: SPARK project

SIM-VAIL: Simulated Vulnerability-Amplifying Interaction Loops

What are the vectors of mental health harm? · Can they be corrected with mulitmodal mental state inferences?

From assistant persona to user state

Persona vectors

Illustrative placeholder for Anthropic-style persona-space work

Core problem

Responses can be fluent, empathic, and even knowledgeable, yet still be clinically wrong because the model is wrong about the user's current state.

MAILA as a AI chatbot safety layer

SPARK proposal

Take-home messages

VAILs: Supportive chatbot behaviors become harmful when they align with mechanisms sustaining psychiatric vulnerability
Risk is phenotype-dependent: Who the user is and what they seek jointly determine the level and kind of risk
Risk is dynamic: Harm typically builds over turns, not in a single response
Risk is multivariate with trade-offs: Multi-dimensional scoring is essential to capture concerning chatbot behavior
SPARK: Improve user state inference for safer AI chatbot interactions

📄 Paper: Weilnhammer et al. (2026) | ✉️ v.weilnhammer@ucl.ac.uk | 💻 Code & data: github.com/veithweilnhammer/sim-vail

Vulnerability-Amplifying Interaction Loops

Opportunities & risks of AI chatbots

Current safety evaluations are not enough

Key contributions

Methods

Dimensions & Scale

Three core dimensions

Scale

User phenotypes: vulnerability × intent

5 Vulnerabilities

6 Transdiagnostic intents

Multi-dimensional risk scoring

Alignment judge

What are VAILs?

Results

Risk varies by user vulnerability and intent

Vulnerability × intent interaction

Key patterns

Risk is specific to vulnerable users

Interpretation

Robustness & validation

Robustness & validation

Differences between AI chatbots

Main findings

Risk accumulates over conversation turns

Four trajectory archetypes

Implications

Multivariate risks

PC structure

Limitations & open questions

Next step: SPARK project

From assistant persona to user state

Persona vectors

Core problem

MAILA as a AI chatbot safety layer

SPARK proposal

Take-home messages

Thanks!