arXivDaily arXiv每日学术速递 周一至周五更新

科学与医疗

AI for Science

科学智能、蛋白质、分子、药物、材料、气象、物理和数学 AI。

今日/当前日期收录 2 信号源:cs.LG, q-bio, physics, cond-mat, math, stat.ML
2601.12805 2026-06-18 q-bio.GN cs.AI cs.CL 版本更新 90%

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

SciHorizon-GENE:从基因知识到功能理解的生命科学推理基准测试

Xiaohan Huang, Meng Xiao, Chuan Qin, Qingqing Long, Jinmiao Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of the Chinese Academy of Sciences(中国科学院大学) DUKE-NUS Medical School, National University of Singapore(新加坡国立大学杜克-新加坡医学学校) Singapore Immunology Network, Agency for Science, Technology and Research(新加坡免疫网络,科技研究局)

专题命中 蛋白质与生物分子 :基因功能推理基准,属于生命科学AI。

AI总结 针对大语言模型在基因级推理能力上的不足,构建了包含超过19万个人类基因和54万问题的基准SciHorizon-GENE,从研究关注敏感性、幻觉倾向、答案完整性和文献影响力四个生物学关键维度评估模型,揭示了模型在生成忠实、完整且基于文献的功能解释方面的持续挑战。

Comments Accepted by SIGKDD 2026. 12 pages

详情
AI中文摘要

大型语言模型(LLMs)在生物医学研究中展现出日益增长的潜力,尤其是在知识驱动的解释任务中。然而,它们从基因知识到功能理解的可靠推理能力——这是知识增强型细胞图谱解释的核心要求——仍然在很大程度上未被探索。为了填补这一空白,我们引入了SciHorizon-GENE,这是一个基于权威生物数据库构建的大规模基因中心基准。该基准整合了超过19万个人类基因的 curated 知识,包含超过54万个问题,涵盖了与细胞类型注释、功能解释和机制导向分析相关的多种基因到功能推理场景。受初步检查中观察到的行为模式启发,SciHorizon-GENE从四个生物学关键角度评估LLMs:研究关注敏感性、幻觉倾向、答案完整性和文献影响力,明确针对限制LLMs在生物解释管道中安全采用的失败模式。我们系统评估了多种最先进的通用和生物医学LLMs,揭示了基因级推理能力的显著异质性,以及在生成忠实、完整且基于文献的功能解释方面的持续挑战。我们的基准为在基因尺度上分析LLM行为建立了系统基础,并为模型选择和发展提供了见解,与知识增强型生物解释直接相关。

英文摘要

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

2603.27465 2026-06-18 q-bio.GN 版本更新 80%

Poisoning the Genome: Targeted Backdoor Attacks on DNA Foundation Models

基因组投毒:针对DNA基础模型的目标后门攻击

Charalampos Koilakos, Ioannis Mouratidis, Ilias Georgakopoulos-Soares

专题命中 蛋白质与生物分子 :DNA基础模型后门攻击,属基因组学

AI总结 本研究首次系统研究基因组语言模型的训练数据投毒,通过在预训练和微调阶段注入少于1%的对抗序列,可选择性破坏目标基因组上下文的生成性能,并实现条件后门攻击和下游任务分类破坏。

Comments 23 pages, double column format

详情
AI中文摘要

基于DNA序列训练的基础模型在变异效应预测和基因组设计等生物学任务中取得了强劲性能。这些模型依赖于包含数万亿核苷酸标记的大规模公共基因组数据集。与自然语言不同,DNA序列缺乏语义透明性,使得在数据整理过程中难以检测被破坏或对抗性构造的条目。我们首次系统研究了基因组语言模型中的训练数据投毒,针对预训练和微调阶段。在预训练中,使用Evo 2和GENERator架构,我们表明训练语料中少于1%的对抗性构造序列可以选择性地降低目标基因组上下文上的生成性能,同时不影响无关序列。我们评估了三种场景:TATA-box启动子基序的破坏、CTCF结合位点的干扰以及插入所有训练基因组中不存在的合成序列。在微调中,我们展示了另外两种攻击。首先,在ClinVar衍生语料库中投毒一部分CTCF位点,在LoRA适配模型中安装一个条件后门,该后门几乎仅在触发序列存在时激活。其次,使用冻结的Evo 2 7B嵌入,对下游训练数据进行目标标签破坏,选择性地损害临床相关的变异分类任务,在BRCA1变异效应预测上进行了演示。这些结果表明基因组基础模型容易受到最小足迹的目标数据投毒。我们敦促该领域将数据来源追踪、完整性验证和对抗鲁棒性评估作为基因组模型开发管道的标准组成部分。

英文摘要

Foundation models trained on DNA sequences have achieved strong performance across biological tasks including variant effect prediction and genome design. These models rely on massive public genomic datasets comprising trillions of nucleotide tokens. Unlike natural language, DNA sequences lack semantic transparency, making corrupted or adversarially crafted entries difficult to detect during data curation. We present the first systematic study of training data poisoning in genomic language models, targeting both pre-training and fine-tuning stages. At pre-training, using Evo 2 and GENERator architectures, we show that less than 1% adversarially crafted sequences in the training corpus can selectively degrade generative performance on targeted genomic contexts while leaving unrelated sequences unaffected. We evaluate three scenarios: corruption of TATA-box promoter motifs, disruption of CTCF binding sites, and insertion of synthetic sequences absent from all training genomes. At fine-tuning, we demonstrate two additional attacks. First, poisoning a subset of CTCF sites in a ClinVar-derived corpus installs a conditional backdoor in a LoRA-adapted model that activates almost exclusively when the trigger sequence is present. Second, using frozen Evo 2 7B embeddings, targeted label corruption of downstream training data selectively compromises a clinically relevant variant classification task, demonstrated on BRCA1 variant effect prediction. These results show genomic foundation models are susceptible to targeted data poisoning with minimal footprint. We urge the field to adopt data provenance tracking, integrity verification, and adversarial robustness evaluation as standard components of the genomic model development pipeline.