2605.12313
2026-05-13
cs.CL
cs.IR
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
Rezarta Islamaj, Joey Chan, Robert Leaman, Jongmyung Jung, Hyeongsoon Hwang, Quoc-An Nguyen, Hoang-Quynh Le, Harikrishnan Gurushankar Saisudha, Ganesh Chandrasekar, Rustam R. Taktashov, Nadezhda Yu. Bizyukova, Sofia I. R. Conceição, Paulo R. C. Lopes, Reem Abdel Salam, Mary Adewunmi, Zhiyong Lu
发表机构
*
National Library of Medicine (NLM), National Institutes of Health (NIH)(美国国家医学图书馆(NLM)、国家卫生研究院(NIH))
;
University of Illinois at Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校)
;
Korea University(韩国大学)
;
VNU University of Engineering and Technology, Hanoi, Vietnam(越南河内工程大学)
;
Concordia University, Montreal, QC, CA(蒙特利尔大学)
;
Institute of Biomedical Chemistry (IBMC), 10 bld. 8, Pogodinskaya str., 119121 Moscow, Russia(俄罗斯生物医学化学研究所(IBMC))
;
LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisbon, Portugal(葡萄牙里斯本大学 LASIGE 实验室)
;
Faculty of Engineering, Computer Engineering Department Cairo University(埃及开罗大学工程学院)
;
Menzies School of Health Research, Charles Darwin University, NT, Australia(澳大利亚查尔斯达尔文大学梅恩兹健康研究中心)
;
CaresAI, Australia(澳大利亚 CaresAI)
AI总结
BioCreative IX 的 MedHopQA 共享任务旨在评估大型语言模型在多跳医学问答中的推理能力,提出了包含1000个复杂问答对的新型数据集,每个问题需结合两个不同维基页面的信息进行两跳推理,特别关注罕见疾病相关问题。任务吸引了13支队伍的48次提交,结果表明基于检索增强生成(RAG)等策略的系统显著优于基线模型,最佳系统在概念准确度(MedCPT)和精确匹配(EM)指标上分别达到89.30%和87.30%。该数据集已公开,以推动医学多跳问答领域的发展。