top of page

SENSEMAKING | EXPLAINABLE AI | LLMS | QUESTION & ANSWER SYSTEM | 

SENSEMAKING IN LLM-BASED BIOMEDICAL QA

[ACM CHI 2026 (IN PREPARATION)] EXPLAINING ANSWERS, BUILDING TRUST:
SENSEMAKING IN LLM-BASED BIOMEDICAL QA

ROLE

2nd Author and Qualitative Interviewer

DESCRIPTION

I am currently in charge of doing user research on how doctors of different specializations would react to our suggestion of a new biomedical question-answer system.

CO-AUTHORS

Kunhee Ryu (quantitative researcher),
Sechang Chon (qualitative interviewer)

TIMELINE

Aug 2025 - Present

RESEARCH METHODS

Mixed: Interviews, Think Aloud Method, Tracking web analytics

ADVISORS

Keeheon Lee, Younah Kang

TARGET USERS

Medical experts of different fields

2x 1.png
2x 2.png

Fig 1. Research Background

PROJECT INTRODUCTION

How can we enhance user trust and sensemaking processes when it comes to LLM-based biomedical Question and Answer (QA) systems?

My research team and I are currently experimenting with chain-of-thought (CoT) reasoning to investigate how user trust and sensemaking could be enhanced, while conducting a mixed-methods study with medical experts of different fields.

Biomedical Question Answering

Biomedical Question Answering (QA) systems aim to enhance accessibility to complex biomedical knowledge by providing natural language answers to experts’ queries, thereby reducing uncertainty in clinical diagnosis and treatment (Kell et al., 2021). Biomedical QA has evolved from rule-based systems to LLM- and RAG-enhanced frameworks emphasizing reliability and transparency, underscoring the importance of clinically applicable, interpretable models for future research (Kell et al., 2021).

Kell, Gregory, et al. “What Would It Take to Get Biomedical QA Systems into Practice?” Proceedings of the Third Workshop on Machine Reading for Question Answering (MRQA 2021), Association for Computational Linguistics, 2021, pp. 28–41.

Wang, Dakuo, et al. “‘Brilliant AI Doctor’ in Rural Clinics: Challenges in AI-Powered Clinical Decision Support System Deployment.” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–15.

Robertson, Stephen, and Hugo Zaragoza. “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval, vol. 3, no. 4, 2009, pp. 333–389.

Jin, Qiao, et al. “Biomedical Question Answering: A Survey of Approaches and Challenges.” arXiv, 2022, arXiv:2201.09460.

Jin, Qiao, et al. “PubMedQA: A Dataset for Biomedical Research Question Answering.” arXiv, 2019, arXiv:1909.06146.

Luo, Renqian, et al. “BioGPT: Generative Pre-Training for Biomedical Text Generation and Mining.” Briefings in Bioinformatics, vol. 24, no. 1, 2023, bbac409.

Stuhlmann, Linus, et al. “Efficient and Reproducible Biomedical Question Answering Using Retrieval Augmented Generation.” arXiv, 2025, arXiv:2505.07917.

Anand, Nikita. “How Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) Create Retrieval-Augmented-Thought (RAT).” Medium, 2024.

Li, Dawei, et al. “DALK: Dynamic Co-Augmentation of LLMs and Knowledge Graphs to Answer Alzheimer’s Disease Questions.” arXiv, 2024, arXiv:2402.01234.

Li, Feiyang, et al. “CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models.” arXiv, 2025, arXiv:2504.09963.

Explainable Artificial Intelligence(XAI)

Explainable Artificial Intelligence (XAI) focuses on making AI systems transparent and trustworthy by clarifying how models operate, particularly in high-stakes domains like healthcare and law where understanding and accountability are critical (Gunning et al., 2019). In the biomedical context, trust in large language model-based systems further depends on truthfulness, privacy, robustness, fairness, and safety (Aljohani et al., 2025). Despite their potential, few biomedical QA systems incorporate user evaluation or uncertainty estimation (Kell et al., 2021).

Aljohani, Manar, et al. “A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare.” arXiv, 2025, arXiv:2502.15871.

Albassam, Hayder, et al. “Toward Human-Centered Interactive Clinical QA: A Voice- and Text-Based Query System for EHR Data.” arXiv, 2025, arXiv:2503.08042.

Chromik, Michael, and Martin Schuessler. “A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI.” IUI Workshop on Explainable Smart Systems and Algorithmic Transparency in Emerging Technologies (ExSS-ATEC ’20), 2020.

Gunning, David. “Explainable Artificial Intelligence (XAI).” DARPA, 2019.

Kell, Gregory, et al. “What Would It Take to Get Biomedical QA Systems into Practice?” Proceedings of the Third Workshop on Machine Reading for Question Answering (MRQA 2021), Association for Computational Linguistics, 2021, pp. 28–41.

Interaction Design for Information Foraging and Sensemaking

Information Foraging describes how users efficiently search, collect, and refine information (Pirolli & Card, 1999), closely linking to the process of Sensemaking, which involves constructing coherent understanding from fragmented data (Weick, 1995; Pirolli & Card, 2005). The Foraging Loop concerns information search and acquisition, while the Sensemaking Loop involves building and refining mental models based on gathered evidence (Pirolli & Card, 2005). In interaction design, these processes are crucial for helping users navigate and interpret complex information environments (Attfield et al., 2010).

Attfield, Simon, et al. “A Framework for Information Visualisation in Sensemaking.” Information Processing & Management, vol. 46, no. 4, 2010, pp. 425–437.

Del Fiol, Guilherme, et al. “Clinical Questions Raised by Clinicians at the Point of Care: A Systematic Review.” JAMA Internal Medicine, vol. 174, no. 5, 2014, pp. 710–718.

Kang, Youn-ah, and John Stasko. “Examining the Use of a Visual Analytics System for Sensemaking Tasks: Case Studies with Domain Experts.” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, 2012, pp. 2869–2878.

Kell, Gregory, et al. “What Would It Take to Get Biomedical QA Systems into Practice?” Proceedings of the Third Workshop on Machine Reading for Question Answering (MRQA 2021), Association for Computational Linguistics, 2021, pp. 28–41.

Papermaster, Amy, and Jane Dimmitt Champion. “The Common Practice of ‘Curbside Consultation’: A Systematic Review.” Journal of the American Association of Nurse Practitioners, vol. 29, no. 10, 2017, pp. 618–626.

Pirolli, Peter, and Stuart K. Card. “Information Foraging.” Psychological Review, vol. 106, no. 4, 1999, pp. 643–675.

Pirolli, Peter, and Stuart K. Card. “The Sensemaking Process and Leverage Points for Analyst Technology as Identified through Cognitive Task Analysis.” Proceedings of the International Conference on Intelligence Analysis, vol. 5, no. 1, 2005.

Qiu, Chen, et al. “Explainable Medical Visual Question Answering via Chain of Evidence.” Knowledge-Based Systems, 2025, 113672.

RESEARCH QUESTIONS

study purpose_blue.png

Fig 2. Research Purpose and Key Sub-questions

RQ1.

Transparency & Interpretability

  • How do transparency and interpretability influence explainability in LLM-based biomedical question-answering systems?

  • What elements constitute transparency and interpretability?

  • Do certain combinations of transparency and interpretability elements provide higher explainability?

  • Do these effects differ depending on clinicians’ clinical experience or expertise level (e.g., pre-medical students, medical students, interns, residents — categorized as experts vs. pre-experts) and their medical specialty (e.g., internal medicine, surgery, emergency medicine)?

    • How can these differences be minimized?

    • Provide a rationale and outline how this could be empirically tested, including a time estimation.

RQ2.

Information Foraging & Sensemaking

  • When transparency and interpretability are present, do users exhibit changes in their information foraging and sensemaking behaviors?

  • How does information foraging efficiency (e.g., search time, number of clicks, transitions — inputs vs. outputs) change?

  • How does the quality of sensemaking (e.g., evidence structuring, error detection, explanation reconstruction — specify measurement methods and indicators) change?

  • Which combinations of transparency and interpretability elements (no prior hypothesis) improve performance in information foraging and sensemaking?

  • Do these effects differ by clinicians’ experience level and medical specialty?

    • For example, do emergency medicine practitioners show higher information foraging efficiency under time constraints?

RQ3.

Trustworthiness

  • Do transparency and interpretability, through the mediation of information foraging and sensemaking, influence trust formation in LLM-based biomedical question-answering systems?

  • How does sensemaking affect trust in such systems?

    1. Does the effect exist?

    2. If so, is it positive or negative, and to what extent?

    3. Beyond quantitative measures, what qualitative factors are involved?

  • How does explainability affect trust in such systems?

    1. Does the effect exist?

    2. If so, is it positive or negative, and to what extent?

    3. Beyond quantitative measures, what qualitative factors are involved?

  • Do these effects vary according to clinicians’ clinical experience (experts vs. pre-experts) and medical specialty (e.g., internal medicine, surgery, emergency medicine)?

study design_blue.png

Fig 3. Research Procedure

COPYRIGHT @ HEEYOUNG (EMILY) GHANG 2025

bottom of page