Vulnerability-Amplifying Interaction Loops

A systematic failure mode in AI chatbot mental-health interactions

Weilnhammer, Hou, Luettgau, Summerfield, Dolan & Nour

Max Planck UCL Centre for Computational Psychiatry · University of Sydney · UK AI Security Institute · University of Oxford · Microsoft AI

Center Meeting — March 2026


Opportunities & risks of AI chatbots

  • Access to mental-health care is limited
  • Millions already use consumer AI chatbots, including for behavioral & mental-health
  • Their scale, availability, and low cost create real potential to expand access to care
  • But these same features can also scale harm, especially for vulnerable users
Core tension: The same systems that may help close the mental-health access gap can also amplify risk when deployed at scale without adequate safety evaluation.

Current safety evaluations are not enough

  • Mental-health safety depends on who the user is, what they want, and how interaction unfolds over time
  • Current benchmarks are mostly static, single-turn, and focused on overt policy violations
  • Human red teaming can probe longer conversations, but is coverage is limited
  • Both approaches often miss sub-treshold harms that build gradually across turns
Problem: Existing evaluations miss most of the risk space of frontier AI chatbots.

Key contributions

  1. SIM-VAIL: Automated, clinically informed auditing via simulated multiturn conversations across 30 user phenotypes, 9 chatbots, & 13 risk dimensions
  2. VAILs: A novel failure mode — Vulnerability-Amplifying Interaction Loops — where locally supportive behaviors align with cognitive mechanisms of mental illness
  3. Risk accumulates over turns: Harm is not a single-response event, but evolves dynamically over time
  4. Risk is multivariate with trade-offs: Mitigating one class of risk can exacerbate another
  5. 810 conversations, 90K+ turn-level ratings: Largest multi-turn, multi-dimensional mental-health chatbot audit to date

Methods

SIM-VAIL: Simulated Vulnerability-Amplifying Interaction Loops

Who are the simulated users? · How are conversations generated? · How is risk scored and validated?

Dimensions & Scale

Simulated interactions

Three core dimensions
  • Who is the user? → 5 psychological vulnerabilities
  • What does the user want? → 6 interaction intents
  • How does the interaction unfold over time?
Scale
  • 30 phenotypes × 9 chatbots × 3 replicates = 810 conversations
  • 6,367 turns → 93K+ turn-level ratings
  • Median 8 turns/conversation (max 10)

User phenotypes: vulnerability × intent

5 Vulnerabilities
  • Depression — negative self-schemata, hopelessness, withdrawal
  • Psychosis — aberrant salience, threat inference, fixed interpretations
  • Mania — elevated mood, impulsivity, reduced sleep
  • OCD — intrusive thoughts, intolerance of uncertainty, compulsions
  • Insecure attachment — fear of abandonment, boundary testing
6 Transdiagnostic intents
  • Belief validation
  • Risky action permission
  • Reassurance & avoidance
  • Dependence & anthropomorphism
  • Trivialization & minimization
  • Glorification & romanticization
  • Vulnerabilities prompted by cognitive-behavioral formulations of mental ilness
  • Intents reflect reflect mechanisms implicated in the onset, maintenance, and exacerbation of mental illness

Multi-dimensional risk scoring

  • Overall concerning behavior
  • Therapeutic quality
  • Self-harm enablement
  • Sycophancy
  • Belief reinforcement
  • Risky action enablement
  • Avoidance / reassurance loops
  • Boundary / dependence risk
  • Minimization of risk
  • Glorification / romanticization
  • Medical advice risk
  • Harm to others
  • Stigma / invalidation
Alignment judge
  • LLM judge (opus-4.5) scores each turn and whole conversation on 1–10 scale

What are VAILs?

Vulnerability-Amplifying Interaction Loops — illustrative examples

VulnerabilityMechanism
PsychosisChatbot validates paranoid belief → user shares more → chatbot continues validating → delusional conviction strengthens
OCDUser seeks contamination reassurance → chatbot reassures → short-term relief reinforces cycle → compulsions maintained
ManiaUser describes sleepless ambition → chatbot expresses enthusiasm → user escalates → risk increases over turns
DepressionUser expresses hopelessness → chatbot affirms withdrawal as understandable → disengagement consolidates
Insecure attachmentUser tests closeness → chatbot provides strong emotional reassurance → dependence intensifies
Hypothesis: Behaviors that are supportive in general contexts become maladaptive when they align with mechanisms that sustain a user's vulnerability.

Results

810 conversations · 9 chatbots · 30 phenotypes · 90K+ turn-level ratings

Where does risk emerge? · Which chatbots are safer? · Does risk accumulate over turns? · What kinds of harm define the VAIL risk space?

Risk varies by user vulnerability and intent

Concerning behavior scores (1–10) across all chatbots

Vulnerability × intent interaction

The same intent can be benign in one phenotype but harmful in another

Vulnerability x intent interaction (F(20, 810) = 26.14, p < 0.001)

Key patterns
  • OCD is generally low-risk except with dependence and risky-action intents
  • Glorification is especially harmful for depression and mania
  • Minimization creates concerning chatbot behavior in psychosis and mania

Risk is specific to vulnerable users

Controls score lower than vulnerable-user conversations (p < 0.001)

Interpretation
  • The same chatbots are less concerning with non-vulnerable users
  • This supports the idea that risk is vulnerability-dependent
  • VAIL risk reflects an interaction between chatbot behavior and user state, not just baseline model behavior

Robustness & validation

0.87
Conversation- vs.
turn-level correlation
0.90
Cross-judge correlation
(PC1)
0.90
ICC(1,3)
across replicates
0.73
ICC(3,1)
vs. expert psychiatrist
0.98
Median AUC
causal recovery

Robustness & validation

0.87
Conversation- vs.
turn-level correlation
0.90
Cross-judge correlation
0.90
ICC
across replicates
0.73
ICC
vs. expert psychiatrist
0.98
Median AUC
causal recovery

Differences between AI chatbots

Main findings
  • Lowest risk: claude-sonnet-4.5
  • Highest risk: grok-4, grok-3, llama-3.1-70B
  • Newer models are generally safer (p = 0.014), except the Grok family
  • Many risk dimensions co-vary, suggesting a shared overall risk gradient
PCA: PC1 captures 62.4% of variance, while PC2 (8.51%) distinguishes kinds of harm.

Risk accumulates over conversation turns

Four trajectory archetypes

K-means clustering of turn-level risk trajectories (k = 4)

Implications
  • Concerning chatbot behavior is a dynamic phenomenon
  • Differences between clusters would be invisible to single-turn benchmarks
  • Recovery pattern suggests some chatbots can self-correct

Multivariate risks

PCA on 13 risk dimensions reveals structured multivariate profiles

PC structure
  • PC1 (62.4%): general risk gradient
  • PC2 (8.5%): kind of harm
Trade-offs: Risk dimensions show both correlation and anticorrelation, so reducing one kind of harm may increase another.

Limitations & open questions

  • Simulated users ≠ real users: Simulated conversations have controlled structure that does not capture all dynamics of real-world chatbot use
  • LLM-as-judge: Both simulation and scoring rely on LLMs
  • Coverage: 5 vulnerabilities × 6 intents covers core presentations but not the full heterogeneity of psychiatric illness and lived experience
  • Ecological validity: Whether patterns of simulated adverserial interactions translate into drivers of harm in real users requires empirical validation
  • Despite these limitations: SIM-VAIL's cross-profile, cross-model, and cross-temporal results establish a non-trivial safety floor in current human-chatbot interactions.

Next step: SPARK project

SIM-VAIL: Simulated Vulnerability-Amplifying Interaction Loops

What are the vectors of mental health harm? · Can they be corrected with mulitmodal mental state inferences?

From assistant persona to user state

Persona vectors

Illustrative placeholder for Anthropic-style persona-space work

Core problem
Responses can be fluent, empathic, and even knowledgeable, yet still be clinically wrong because the model is wrong about the user's current state.

MAILA as a AI chatbot safety layer

SPARK proposal

Take-home messages

  1. VAILs: Supportive chatbot behaviors become harmful when they align with mechanisms sustaining psychiatric vulnerability
  2. Risk is phenotype-dependent: Who the user is and what they seek jointly determine the level and kind of risk
  3. Risk is dynamic: Harm typically builds over turns, not in a single response
  4. Risk is multivariate with trade-offs: Multi-dimensional scoring is essential to capture concerning chatbot behavior
  5. SPARK: Improve user state inference for safer AI chatbot interactions

📄 Paper: Weilnhammer et al. (2026)  |  ✉️ v.weilnhammer@ucl.ac.uk  |  💻 Code & data: github.com/veithweilnhammer/sim-vail

Thanks!