S-Chain: Structured Visual Chain-of-Thought for Medicine

DFKI Logo IMPRS-IS Logo UToronto Logo SUTD Logo Chonnam National University Logo Stuttgart Logo UCSD Logo Stanford Logo Auburn Logo
1University of Toronto, Canada 2Knovel Engineering Lab, Singapore 3German Research Centre for Artificial Intelligence 4University of Stuttgart, Germany 5Chonnam National University, South Korea 6Singapore University of Technology and Design 7Bucknell University, USA 8Concordia University, Canada 9Korea University 10Justus Liebig University Giessen, Germany 11University Medical Center Göttingen, Germany 12Japan Advanced Institute of Science and Technology 13Hue University, Vietnam 14College of William & Mary, USA 15Deakin University, Australia 16National University of Singapore 17University of Texas at Austin, USA 18University of California, Berkeley, USA 19New Jersey Institute of Technology, USA 20University of California San Diego, USA 21MBZUAI, UAE 22Oldenburg University, Germany 23Statement University, USA 24Max Planck Research School for Intelligent Systems (IMPRS-IS), Germany 25Auburn University, USA INDIA
*Co-first authors; order randomized **Co-second authors Corresponding Author

Abstract

Faithful reasoning in medical vision–language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding-boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs, showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.

S-Chain annotation pipeline

Key Contributions

  • Dataset Innovation: We introduce S-Chain, the first large-scale medical dataset with 12k images featuring expert-verified, visually-grounded reasoning traces. It includes over 700k multilingual QA pairs across 16 languages.
  • Superiority over Synthetic Data: We demonstrate that training with S-Chain's expert annotations consistently and significantly outperforms models trained with synthetic CoTs generated by state-of-the-art models like GPT-4.1.
  • Analytical Insights: We provide a deep dive into how structured reasoning interacts with Retrieval-Augmented Generation (RAG) and propose a new method to improve the faithfulness of the model's reasoning to the visual evidence.

The S-Chain Dataset & Method

Our dataset was meticulously created through a six-step annotation pipeline involving three trained doctors and professional linguists, requiring over 700 hours of expert labor.

S-Chain annotation pipeline

The dataset is built from the OASIS MRI dataset and is split to ensure no patient overlap between the training and testing sets, with a distribution that mirrors real-world clinical cohorts. S-Chain supports 16 languages: English, French, German, Japanese, Korean, Mandarin, Portuguese, Spanish, Italian, Dutch, Russian, Arabic, Hindi, Vietnamese, Thai, and Indonesian, enabling broad multilingual applicability in medical AI systems.

Split # Images # QA pairs # Patients Non-Dementia Mild-Dementia Mod-Dementia
Train 10,783 ~690k 55 4,628 4,755 1,400
Test 1,542 ~98k 9 562 420 560
Total 12,325 ~788k 64 5,190 5,175 1,960

Multilingual Dataset Examples

S-Chain provides structured visual chain-of-thought reasoning across multiple languages, ensuring global accessibility and applicability in diverse medical contexts. Here are 4 example representations showcasing the dataset's multilingual capabilities:

English
English dataset example
French
French dataset example
German
German dataset example
Japanese
Japanese dataset example

Interactive Demo

Try S-Chain! Select an image to see different aspects of our structured visual chain-of-thought approach.

Results & Findings

1. The Value of Expert Annotations vs. Synthetic Data

A core finding of our work is that expert-annotated, structured data is crucial for reliable medical reasoning. Models trained with S-Chain consistently outperform baselines trained on synthetic CoTs generated by GPT-4.1 by 4-5%.

Medical VLMs results

Synthetic data often suffers from hallucinations, such as incorrect or missing bounding boxes, which undermines the model's ability to ground its reasoning in visual evidence.

GPT-4.1 vs S-Chain Comparison
Comparison showing GPT-4.1 hallucination vs S-Chain ground truth

Red: GPT-4.1 hallucinates an "extra-axial lesion" that is not present. Green: The S-Chain annotation correctly identifies "no atrophy".

2. Synergy with External Medical Knowledge (MedRAG)

While S-Chain provides strong visual grounding, we found that combining it with external medical knowledge via Retrieval-Augmented Generation (RAG) yields the best performance.

Table 5: Impact of MedRAG and SV-CoT on Q4 performance. Scores are Accuracy / F1. Δ is absolute Accuracy gain over Base. Best and second per row in bold and underline.
Model Base (Acc / F1) + MedRAG + SV-CoT + SV-CoT + MedRAG
Score Δ Score Δ Score Δ
ExGra-Med 49.4 / 46.9 50.3 / 48.7 +0.9 / +1.8 60.4 / 59.6 +11.0 / +12.7 64.8 / 62.6 +15.4 / +15.7
LLaVA-Med 46.8 / 43.2 50.8 / 48.9 +4.0 / +5.7 55.7 / 53.0 +8.9 / +9.8 59.5 / 57.8 +12.7 / +14.6
MedGemma 45.9 / 42.1 47.6 / 44.4 +1.7 / +2.3 59.4 / 56.7 +13.5 / +14.6 56.7 / 52.9 +10.8 / +10.8
Qwen2.5-VL 50.5 / 45.6 54.3 / 54.2 +3.8 / +8.6 55.0 / 49.4 +4.5 / +3.8 60.8 / 47.9 +10.3 / +2.3
InternVL2.5 50.5 / 47.6 52.3 / 43.3 +1.8 / -4.3 53.4 / 48.8 +2.9 / +1.2 58.3 / 54.6 +7.8 / +7.0

3. Toward More Faithful Reasoning

A. Component-wise Reasoning Analysis. Experiments on the S-Chain dataset show that providing ground-truth ROIs yields small gains in diagnostic accuracy, but adding correct chain-of-thought (CoT) steps nearly solves the task, reaching 99% accuracy. This shows that accurate intermediate reasoning makes final predictions almost trivial, highlighting the importance of structured supervision over end-to-end training for interpretable and reliable medical vision-language models.

B. Bounding Boxes and Grounded Reasoning. To study how ROI representations affect visual reasoning, we compare two grounding strategies: 1. textual supervision: encodes bounding box coordinates in the training text, and 2. visual prompting: highlights ROIs directly on the image. We further test how perturbing or removing ROI information impacts the accuracy and alignment of chain-of-thought (CoT) reasoning with visual evidence.

Findings: Visual prompting better anchors reasoning to true abnormalities (0.73 Acc) than textual supervision (0.62 Acc), leading to more faithful and clinically grounded CoTs.

Control experiments evaluating the role of each SV-CoT component
Visualization showing the alignment between textual reasoning and visual attention on regions of interest

Light peach blocks show ground-truth inputs at test time, while blue/green blocks are model-generated. Upper settings use text-based CoTs, and lower settings use visual prompting to ground reasoning in ROIs.

C. Towards Faithful Vision-Language Reasoning. We found that models can generate plausible reasoning without truly attending to the intended visual regions. To address this, we introduced a training strategy that explicitly aligns the model's CoT with visual tokens from the correct ROIs. This lightweight regularization improves ExGra-Med's accuracy from 60.4% to 62.5% and F1 from 59.6% to 61.7%. While the gains are modest, they underscore that enhancing CoT–ROI alignment is a promising step toward more faithful multimodal reasoning, with its optimal enforcement remaining an open challenge for future research.

Limitations and Future Work

While S-Chain is a significant step forward, we acknowledge its limitations. The dataset currently has limited diagnostic coverage and follows a linear reasoning path that simplifies complex clinical workflows.

Future work should focus on expanding the dataset and developing more advanced models. Key research directions include large-scale grounded pre-training, new cross-modal contrastive objectives, and faithful decoding strategies to ensure VLMs are not only accurate but truly clinically trustworthy.

Citation


@misc{leduc2024schain,
      title={S-Chain: Structured Visual Chain-of-Thought for Medicine}, 
      author={Khai Le-Duc and Phuong T. H. Trinh and Duy M. H. Nguyen and Tien-Phat Nguyen and Nghiem T. Diep and An Ngo and Tung Vu and Trinh Vuong and Anh-Tien Nguyen and Mau Nguyen and Van Trung Hoang and Khai-Nguyen Nguyen and Hy Nguyen and Chris Ngo and Anji Liu and Nhat Ho and Anne-Christin Hauschild and Khanh Xuan Nguyen and Thanh Nguyen-Tang and Pengtao Xie and Daniel Sonntag and James Zou and Mathias Niepert and Anh Totti Nguyen},
      year={2024},
      eprint={...},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}