AI for Science - arXivDaily 专题

2606.19245 2026-06-19 cs.AI cs.LG 新提交 85%

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP：分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

专题命中 AI制药：小分子临床前药理学基准，属于AI制药

AI总结提出TxBench-PP基准，用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力，测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情

AI中文摘要

人工智能（AI）代理有望通过压缩解释和决策循环来加速药物发现，但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学（TxBench-PP），这是一个针对小分子临床前药理学的可验证基准，也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论，而非从文献中记忆的事实。该基准包含100个评估，按程序阶段、实验类型和任务结构索引，涵盖作用机制（MoA）和药效学（PD）推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照，在编码环境中检查文件，并返回确定性评分的结构化答案。在16个模型-工具配置（包括11个模型和4,800条轨迹）中，没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试（178/300；95% CI, 51.1-67.6），其次是GPT-5.5 / Pi，为55.3%（166/300；47.0-63.6）。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

URL PDF HTML ☆

赞 0 踩 0

2606.19624 2026-06-19 cs.LG 新提交 80%

MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery

MassSpecGym in the Wild: 揭示并纠正AI驱动分子发现中的评估陷阱

Hongxuan Liu, Roman Bushuiev, Ivy Lightheart, Mrunali Manjrekar, Anton Bushuiev, Magdalena Lederbauer, Filip Jozefov, Yinkai Wang, Soha Hassoun, Josef Sivic, James Taylor, Runzhong Wang, David Healey, Tomáš Pluskal, Connor W. Coley

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague（捷克信息学、机器人学与自动化捷克技术大学）； Enveda Biosciences（Enveda 生物科技）； Tufts University（塔夫茨大学）

专题命中 AI制药：审查AI驱动分子发现中的评估陷阱，以MassSpecGym为例。

AI总结本文系统审查了基于串联质谱的分子发现中机器学习模型的评估问题，以MassSpecGym基准为例，发现26篇论文中至少17篇存在数据泄露、捷径学习和实现错误三类问题，并通过实验量化影响，提出改进建议并发布MassSpecGym v1.5。

详情

AI中文摘要

可靠的基准测试对于开发基于串联质谱（MS/MS）分子发现的机器学习模型至关重要。实验设计和模型评估过程中的细微问题会降低此类基准的可信度，并导致错误结论。我们以标准MassSpecGym基准套件为例，对近期MS/MS机器学习文献中的模型评估问题进行了全面审查，以说明这些问题的影响。在采用MassSpecGym基准的第一年内，我们发现在26篇报告MassSpecGym基准结果的论文中，至少有17篇存在评估问题。我们将失败原因归纳为三类：(i) 数据泄露，(ii) 捷径学习，以及(iii) 实现错误和指标分歧。通过大量实验和代码复现，我们量化了这些问题的影响，并展示了它们如何破坏MassSpecGym旨在强制执行的评估标准。我们将研究结果提炼为适用于MS/MS挑战、基准和自定义评估设置的建议。我们还发布了MassSpecGym v1.5，这是我们在MassSpecGym基准套件中实施建议的版本，解决了本次审计中发现的失败模式。MassSpecGym v1.5可从此https URL公开获取。

英文摘要

Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the trustworthiness of such benchmarks and lead to erroneous conclusions. We conduct a thorough review of model evaluation issues in the recent MS/MS machine learning literature, using the standard MassSpecGym benchmark suite as a case study to illustrate the impact of these issues. We find evaluation issues in at least 17 of 26 papers reporting MassSpecGym benchmark results in the first year of its adoption. We isolate three classes of failures: (i) data leakage, (ii) shortcut learning, and (iii) implementation bugs and metric divergence. Through extensive experimentation and code replication, we quantify the impact of these issues and show how they corrupt the evaluation standards MassSpecGym was designed to enforce. We distill our findings into recommendations generalizable to MS/MS challenges, benchmarks, and custom evaluation setups. We also release MassSpecGym v1.5, an implementation of our recommendations in the MassSpecGym benchmarking suite which addresses the failure modes identified in this audit. MassSpecGym v1.5 is publicly available at https://github.com/pluskal-lab/MassSpecGym.

URL PDF HTML ☆

赞 0 踩 0

2606.19496 2026-06-19 cs.LG 新提交 80%

Calibrating Generative Models to Feature Distributions with MMD Finetuning

使用MMD微调将生成模型校准到特征分布

Nathaniel L. Diamant, Brian L. Trippe

发表机构 * Stanford University（斯坦福大学）

专题命中 AI制药：校准生成模型特征分布以匹配抗生素分子

AI总结提出kCGM方法，通过最小化生成与目标特征分布的最大均值差异（MMD）并加入KL正则化，在不牺牲有效性的前提下校准生成模型的特征分布，适用于多种生成模型。

详情

AI中文摘要

生成模型可以产生个体上合理的样本，但在关键特征分布上与目标集存在显著偏差。例如，在广泛的药物类化学空间上预训练的模型可能生成分子，其分子特征与感兴趣的治疗类别（如已知抗生素）不同。纠正这种分布校准错误具有挑战性：在目标集上直接微调可能导致过拟合，并且无法控制匹配哪些特征。为了填补这一空白，我们引入了核校准生成模型（kCGM）。kCGM使用无偏得分函数估计器最小化生成特征分布与目标特征分布之间的最大均值差异（MMD），并通过KL正则化保持与预训练模型的接近。在一个包含174种抗生素的目标集上，直接微调牺牲了化学有效性以匹配特征分布，而kCGM在提高有效性的同时改善了目标特征匹配。我们还在蛋白质和DNA生成任务中展示了kCGM，表明它可以使用仅特征级别的监督来适应自回归、连续空间扩散和离散扩散模型。代码可在https://this URL获取。

英文摘要

Generative models can produce individually plausible samples while deviating substantially from a target set in the distribution of key features. For example, a model pretrained on broad drug-like chemical space may generate molecules whose molecular features differ from those of a therapeutic class of interest, such as known antibiotics. Correcting such distributional miscalibration is challenging: direct finetuning on the target set can overfit and does not control which features are matched. To fill this gap, we introduce kernel Calibrating Generative Models (kCGM). kCGM minimizes a maximum mean discrepancy (MMD) between generated and target feature distributions using an unbiased score-function estimator, with KL regularization to remain close to the pretrained model. On a target set of 174 antibiotics, direct finetuning sacrifices chemical validity for feature-distribution matching, whereas kCGM improves target feature matching while increasing validity. We further demonstrate kCGM in protein and DNA generation tasks, showing it can adapt autoregressive, continuous-space diffusion, and discrete diffusion models using only feature-level supervision. Code is available at https://github.com/smithhenryd/cgm.

URL PDF HTML ☆

赞 0 踩 0