arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.16950 2026-06-16 physics.ins-det cs.LG physics.bio-ph physics.chem-ph physics.data-an q-bio.BM 新提交

Latent space mapping of interpretable structural coordinates from stochastic single-molecule signals

从随机单分子信号中可解释结构坐标的潜空间映射

Matteo Cartiglia, Sandro Kuppel, Wouter Botermans Wannes Peeters, Natan Biesmans, Liam Vandekerckhove, Eric Beamish, Koen Ongena, Wouter Renckens, Pol Van Dorpe, Sanjin Marion

AI总结提出通过对比编码器将纳米孔随机信号映射到可解释分子结构坐标的潜空间，实现高效识别与数据融合。

Comments 32 pages, 6 figures

详情

AI中文摘要

纳米孔是通用的单分子传感器，但其效用从根本上受到随机易位动力学扭曲任何编码信息的限制。我们通过从时域分析转向学习潜空间映射来解决这一问题，该映射通过一个仅在物理信息模型模拟信号上训练的对比编码器实现。该编码器将固态纳米孔对工程化DNA条形码的信号映射到一个可解释的分子坐标系统。学习到的表示对结构条形码参数敏感，同时对采集条件和易位构象保持不变，允许跨设备的数据池化。分子识别只需一次通过编码器，计算成本相比基于比对的方法降低三个数量级。我们通过混合物定量、稀有变异检测、共识条形码重建和实时信号采集进行了实验验证。这种从时间分析到将结构坐标映射到潜空间的转变，通过将分类与可解释的编码分子信息联系起来，改变了分析随机传感器信号的范式。

英文摘要

Nanopores are versatile single-molecular sensors, but their utility is fundamentally constrained by stochastic translocation dynamics warping any encoded information. We resolve it by shifting from time-domain analysis to a learned latent-space mapping via a contrastive encoder trained exclusively on simulated signals from a physics-informed model. This encoder maps solid-state nanopore signals of engineered DNA barcodes into an interpretable molecular coordinate system. The learned representation is responsive to structural barcode parameters while remaining invariant to acquisition conditions and translocation conformation, allowing data pooling across devices. Molecule identification requires a single pass through the encoder, reducing computational cost by three orders of magnitude relative to alignment-based methods. We experimentally validate through mixture quantification, rare-variant detection, consensus barcode reconstruction, and real-time signal acquisition. This shift from temporal analysis to mapping structural coordinates into a latent space changes the paradigm behind analyzing stochastic sensor signals by linking classification to interpretable encoded molecular information.

URL PDF HTML ☆

赞 0 踩 0

2606.16803 2026-06-16 q-bio.MN q-bio.SC 新提交

Cell Division Changes Fate Decisions in a Genetic Toggle Switch

细胞分裂改变遗传开关中的命运决定

Charli Austin, Nikola Popovic, Ramon Grima

AI总结本研究通过分析布尔型遗传开关模型，发现细胞分裂可将相同初始条件的轨迹导向不同稳定态，并定义了忽略分裂时命运预测错误的区域，表明分裂可重塑多稳态调控网络的命运边界。

2606.16726 2026-06-16 q-bio.QM stat.AP 新提交

Too Few or Too Many? Sample Size Estimation for Differential Abundance Studies

太少还是太多？差异丰度研究的样本量估计

Michael Agronah, Benjamin M. Bolker

AI总结提出一种基于效应大小、平均丰度和统计功效的样本量计算方法，通过R包power.nb实现，并利用30个真实微生物组数据集验证，发现现有研究样本量不足。

详情

AI中文摘要

确定适当的研究样本量是规划科学研究的关键步骤。适当的样本量规划可避免样本量不足和过度膨胀。样本量过大会浪费资源、受试者的时间和精力以及实验动物的生命。样本量不足（一个更常见的问题）会因无法检测到生物学上有意义的差异而浪费更多资源，并助长可疑的研究实践，如$p$-hacking。微生物组研究尤其受到小样本量的挑战，特别是在人类受试者或昂贵动物模型的研究中。在实践中，差异丰度研究中分类群的统计功效受效应大小（通常量化为倍数变化）、单个分类群的平均丰度和样本数量的影响。我们提出了一种新的样本量计算方法，用于差异丰度研究，作为效应大小、平均丰度和分类群统计功效的函数。我们的方法已在power.nb R包中实现，可从https://michaelagronah.com/power.nb/articles/stub.html获取。我们利用从30个真实世界微生物组数据集中获得的分类群平均丰度和倍数变化估计值，应用我们的模型进行样本量计算。结果表明，差异丰度微生物组研究需要比当前文献中普遍存在的样本量更大的样本量，才能达到足够的统计功效。我们的框架将帮助研究人员就适当的样本量做出明智的决策。

英文摘要

Determining an appropriate sample size for a study is a crucial step in planning scientific research. Appropriate sample size planning avoids both inadequate and inflated sample sizes. Inflated sample sizes wastes resources, time and effort of human subjects, and lives of experimental animals. Inadequate sample sizes, a much more common problem, wastes even more resources through the inability to detect biologically meaningful differences and encourages questionable research practices like $p$-hacking. Microbiome studies are particularly challenged by small sample sizes, particularly in studies of human subjects or expensive animal models. In practice, the statistical power of taxa within a differential abundance study is influenced by the effect size (typically quantified as fold change), mean abundance of individual taxa, and the number of samples. We present a novel approach for sample size calculation for differential abundance studies as a function of effect size, mean abundance and statistical power of taxa. Our method is implemented in the power.nb R package, available at https://michaelagronah.com/power.nb/articles/stub.html. We applied our model for sample size calculation using estimates of mean abundance and fold change of taxa obtained from thirty real-world microbiome datasets. Our results showed that differential abundance microbiome studies require larger sample sizes than are currently prevalent in the literature to achieve adequate statistical power. Our framework will help researchers make informed decisions about appropriate sample sizes.

URL PDF HTML ☆

赞 0 踩 0

2606.16694 2026-06-16 cs.LG cs.AI physics.app-ph q-bio.NC 新提交

Adaptive inference and function vectors in deep transformers

深度变换器中的自适应推理与函数向量

Ravin Raj, Gautam Reddy

发表机构 * Joseph Henry Laboratories of Physics, Princeton University（普林斯顿大学约瑟夫·亨利物理实验室）

AI总结提出深度变换器作为平均场交互系统实现分布式推理的理论，利用函数向量逐层推断潜在上下文变量，在上下文回归任务中预测非高斯分层结构与深度的关系，并通过约束线性注意力变换器验证。

详情

AI中文摘要

变换器被广泛用作学习大量耦合变量间复杂相关性的通用基础架构，但其内部机制仍不明确。我们提出了一种深度变换器作为平均场交互系统的理论，该系统在通信、局部性和深度约束下实现分布式推理。我们证明，这样的系统可以利用内部状态表示（“函数向量”）在其层上以越来越精细的尺度推断潜在上下文变量。在上下文回归任务中，该理论预测了潜在上下文变量中的非高斯分层结构与变换器深度之间的非平凡关系。使用约束线性注意力变换器对预测进行了测试，并展示了深度架构中的自适应推理。前馈模块和深度使变换器能够实现比先前描述的更丰富的上下文学习算法类别。

英文摘要

Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

URL PDF HTML ☆

赞 0 踩 0

2606.16693 2026-06-16 q-bio.NC cs.LG 新提交

Learning Hybrid Biophysical Neuron Models with Neural ODEs

利用神经常微分方程学习混合生物物理神经元模型

Jonas Beck, Michael Deistler, Dóra Viktória Molnár, Jakob H. Macke, Philipp Berens

AI总结提出混合建模框架，将神经常微分方程嵌入电导基生物物理模型，以捕捉未知电流或错误指定的通道动力学，从电压记录中恢复可解释的门控动力学，并降低计算成本。

详情

AI中文摘要

生物物理神经元模型将神经活动的测量与潜在的细胞机制联系起来。然而，一个核心挑战是许多离子通道的动力学特征不明确，而实际简化——省略通道或减少形态细节——会在模型与生物学之间引入系统性差距。弥合这些差距需要能够灵活发现未建模动力学同时保持机制可解释性的方法。在这里，我们引入了一个混合建模框架，将神经常微分方程嵌入到基于电导的生物物理模型中，以捕捉未知电流或错误指定的通道动力学。通过根据电压依赖的稳态和时间常数函数参数化神经ODE，我们直接从电压记录中恢复可解释的门控动力学，而无需假设函数形式。我们展示了混合模型能够拟合2400个离子通道模型的门控动力学，并从单电流钳记录中恢复未知的门控动力学，在现实输入和参数错误指定下泛化到分布外刺激模式。我们还使用我们的方法将皮层神经元的多室模型简化为具有学习轴向电流的单室混合模型，计算成本降低了一个数量级。总之，我们的结果建立了一个即插即用的框架，用于选择性地用电导基模型中的未知组件替换为神经常微分方程，同时保留其机制结构。

英文摘要

Biophysical neuron models link measurements of neural activity to underlying cellular mechanisms. Yet, a central challenge is that the kinetics of many ion channels are poorly characterized, and practical simplifications -- omitting channels or reducing morphological detail -- introduce systematic gaps between model and biology. Bridging these gaps requires approaches that can flexibly discover unmodeled dynamics while preserving mechanistic interpretability. Here, we introduce a hybrid modeling framework that embeds neural ordinary differential equations into conductance-based biophysical models to capture unknown currents or mis-specified channel kinetics. By parameterizing the neural ODE in terms of voltage-dependent steady-state and time-constant functions, we recover interpretable gating dynamics directly from voltage recordings without assuming a functional form. We show that the hybrid model fits the gating kinetics of 2400 ion channel models and recovers unknown gating dynamics from single current-clamp recordings, generalizing to out-of-distribution stimulus regimes under realistic inputs and parameter misspecification. We also use our method to reduce a multicompartment model of a cortical neuron into a single-compartment hybrid model with a learned axial current, yielding up to an order of magnitude lower computational cost. Together, our results establish a plug-and-play framework for selectively replacing unknown components of conductance-based models with neural ODEs while preserving their mechanistic structure.

URL PDF HTML ☆

赞 0 踩 0

2606.16614 2026-06-16 physics.bio-ph cond-mat.soft q-bio.SC 新提交

Oscillating concentrations suppress condensate coarsening

振荡浓度抑制凝聚体粗化

Mathias S. Heltberg, Lukas H. Kristensen, Mogens H. Jensen, David Zwicker

AI总结通过理论模型和数值模拟，发现足够快的振荡信号可以稳定多个液滴并控制其大小，从而抑制被动动力学中的单一凝聚体粗化。

Comments 5 pages, 4 figures, and appendix

2606.16540 2026-06-16 q-bio.QM cs.LG q-bio.BM q-bio.GN 新提交

MultiMolecule: a modular ecosystem for biomolecular sequence-model workflows

MultiMolecule: 一个用于生物分子序列模型工作流的模块化生态系统

Zhiyuan Chen

发表机构 * DanLing Team（丹 Ling 团队）

AI总结提出MultiMolecule开源生态系统，通过标准化接口整合RNA、DNA和蛋白质序列模型，提供53个模型族实现、112个检查点和16个数据集资源，支持模型复用、评估和生物预测。

详情

AI中文摘要

生物分子序列模型越来越多地被用于最初研究之外的任务，但公开的检查点很少保留检查源定义行为、适应新实验、在共享任务定义下比较模型或部署生物预测所需的执行上下文。MultiMolecule是一个开源Python生态系统，它将异质的RNA、DNA和蛋白质序列模型发布转变为完整的、经过源检查的模型族实现，并带有共享的加载、工作流和预测接口。此处报告的Resource状态包括53个完整的模型族实现，包含112个标准化的模型检查点，以及通过39个公共数据集仓库发布的16个精选数据集资源和10个面向用户的预测管道。标准化组件链接到源出处、转换或准备代码、源参考检查、扩展数据摘要和公共文档，允许用户检查哪些内容被标准化、哪些行为被检查以及每个组件如何进入训练、评估、推理或部署。通过将复用从特定仓库的检查点转移到与标准化检查点、精选数据集、Runner工作流和生物预测管道相连的可执行实现，MultiMolecule为保留源定义的模型行为、适应新实验、实现受控评估和部署生物分子预测提供了通用基础设施。

英文摘要

Biomolecular sequence models are increasingly reused outside the studies in which they were introduced, but public checkpoints rarely preserve the execution context needed to inspect source-defined behavior, adapt models to new assays, compare models under shared task definitions or deploy biological predictions. MultiMolecule is an open-source Python ecosystem that turns heterogeneous RNA, DNA and protein sequence-model releases into complete, source-checked model-family implementations with shared loading, workflow and prediction interfaces. The Resource state reported here includes 53 complete model-family implementations with 112 standardized model checkpoints, together with 16 curated dataset resources released through 39 public dataset repositories and 10 user-facing prediction pipelines. Standardized components are linked to source provenance, conversion or preparation code, source-reference checks, Extended Data summaries and public documentation, allowing users to inspect what was standardized, what behavior was checked and how each component enters training, evaluation, inference or deployment. By shifting reuse from repository-specific checkpoints to executable implementations connected to standardized checkpoints, curated datasets, Runner workflows and biological prediction pipelines, MultiMolecule provides common infrastructure for preserving source-defined model behavior, adapting models to new assays, enabling controlled evaluation and deploying biomolecular predictions.

URL PDF HTML ☆

赞 0 踩 0

2606.16517 2026-06-16 cs.LG q-bio.QM 新提交

How Post-Training Shapes Biological Reasoning Models

后训练如何塑造生物学推理模型

Lukas Fesser, Hanlin Zhang, Michelle M. Li, Eric Wang, Bryan Perozzi, Shekoofeh Azizi, Sham M. Kakade, Marinka Zitnik

发表机构 * Harvard University（哈佛大学）； Google DeepMind（谷歌DeepMind）； Google Research（谷歌研究院）

AI总结研究后训练各阶段（CPT、SFT、RL）对生物学推理模型领域内和领域外性能的影响，发现SFT提升领域内性能但损害泛化，RL可部分恢复泛化，最佳策略是短SFT加长RL。

详情

AI中文摘要

生物学科学推理模型将语言模型与在多模态生物数据（包括DNA、RNA和蛋白质）上训练的基础模型相结合。这些模型通过后训练构建，然而每个阶段如何塑造推理和泛化能力仍知之甚少。我们研究后训练何时提升性能以及何时导致过度专门化。在基因组学、转录组学和蛋白质领域，我们训练并评估了超过100个生物学推理模型，在骨干网络、持续预训练（CPT）、监督微调（SFT）和强化学习（RL）方面进行受控变化，并测量领域内（ID）和领域外（OOD）性能。我们发现每个后训练阶段以不同方式重塑泛化，而非贡献均匀增益。CPT通过使模型与生物语言对齐来提升下游性能。SFT持续提高ID性能，但导致OOD性能早期达到峰值并随着模型拟合训练分布而下降。RL在应用于具有对齐奖励的强SFT检查点时，改善OOD性能并部分恢复泛化。这些结果表明，生物学推理并非随着额外监督或计算而单调提升。相反，性能取决于训练阶段的组合方式。在固定后训练预算下，最强的ID-OOD权衡来自短暂的SFT、更大的RL分配以及各阶段间不对称的适应能力。

英文摘要

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.

URL PDF HTML ☆

赞 0 踩 0

2606.16305 2026-06-16 math.OC cs.SY eess.SY q-bio.PE 新提交

Extended Kalman Filter-Based State Estimation for a Nine-Compartment Nonlinear Epidemic Model -- Convergence Analysis and In-Silico Benchmark Calibrated on the COVID-19 Third Wave in Italy

基于扩展卡尔曼滤波的九室非线性流行病模型状态估计——收敛性分析与以意大利第三波COVID-19为基准的计算机仿真标定

Lokman Rachid Melhani, Antonino Sferlazza, Dominique Persano Adorno, Filippo D'Ippolito, Antonino Lo Burgio, Alberto Firenze

AI总结针对九室非线性COVID-19流行病模型，提出基于扩展卡尔曼滤波（EKF）的实时状态估计方法，通过李导数可观测性分析揭示秩亏条件，证明误差局部指数均方有界性，并在标定模型上实现0.07%-2.72%的RMSE基准性能。

Comments Companion paper to arXiv:2606.07413. Submitted to ISA Transactions. 9 figures

详情

AI中文摘要

本文针对一个九室非线性COVID-19流行病模型进行实时状态估计，该模型包含两种共循环毒株、超级传播者亚群、具有免疫力衰减的疫苗接种、住院和死亡。时变传播率和疫苗接种率来自配套标定的已知输入，从而从三个常规报告的观测变量（住院人数H、死亡人数F和疫苗接种存量V）重建所有九个状态。贡献主要在理论层面而非滤波器递归。首先，通过李导数可观测性分析，经六步推导得到闭式行列式|det(O9)| = delta_w * gamma_a^2 * kappa * rho2 * w1^2 * (delta_i - delta_p)^2 * |r1 - r2|，表明在标定的对称参数（delta_i = delta_p, r1 = r2）下，二级余分布秩亏；第三李导数恢复满秩9，其中r2为对称破缺参数。其次，基于欧拉离散化动力学设计EKF，包含闭式9x9雅可比矩阵和Joseph协方差更新。第三，通过Reif-Gunther-Yaz-Unbehauen假设，利用双线性漂移和线性输出得到全局半径二次余量界，证明误差的局部指数均方有界性，该界可推广至双线性漂移-线性输出系统。第四，根据标定残差设计噪声协方差，并通过NEES和innovation白度检验评估。所有实验使用标定模型的合成测量数据，因此报告的RMSE值（0.07%-2.72%）是方法论基准，而非预测精度。参数失配研究表明，在模型误差高达±30%时，测量和直接耦合通道保持准确，而间接观测状态则优雅退化。该框架为未来的模型预测控制提供了状态反馈基础。

英文摘要

This paper addresses real-time state estimation for a nine-compartment nonlinear COVID-19 epidemic model with two co-circulating strains, a super-spreader subpopulation, vaccination with waning immunity, hospitalization, and mortality. Time-varying transmission and vaccination rates are known inputs from a companion calibration, leaving the reconstruction of all nine states from three routinely reported observables: hospitalizations H, fatalities F, and vaccinated stock V. The contributions are theoretical rather than in the filter recursion. First, a Lie-derivative observability analysis yields, via a six-step derivation, the closed-form determinant |det(O9)| = delta_w * gamma_a^2 * kappa * rho2 * w1^2 * (delta_i - delta_p)^2 * |r1 - r2|, showing the level-2 codistribution is rank-deficient at the calibrated symmetric parameters (delta_i = delta_p, r1 = r2); the third Lie derivative restores full rank 9, with r2 the symmetry-breaking parameter. Second, an EKF is designed on the Euler-discretized dynamics with a closed-form 9x9 Jacobian and Joseph covariance update. Third, local exponential mean-square boundedness of the error is proved as a full theorem via the Reif-Gunther-Yaz-Unbehauen hypotheses, exploiting the bilinear drift and linear output to obtain a global-radius quadratic remainder bound that extends to bilinear-drift, linear-output systems. Fourth, the noise covariances are designed from calibration residuals and assessed by NEES and innovation-whiteness tests. All experiments use synthetic measurements from the calibrated model, so reported RMSE values (0.07%-2.72%) are methodology benchmarks, not predictive accuracy. A parameter-mismatch study shows measured and directly-coupled channels stay accurate under model error up to +/-30% while indirectly observed states degrade gracefully. The framework provides the state-feedback basis for future Model Predictive Control.

URL PDF HTML ☆

赞 0 踩 0

2606.16294 2026-06-16 cs.CV q-bio.NC 新提交

Sex-based Network-Specific Differences in Connectomes: A Krakencoder-Based Analysis

基于性别的连接组网络特异性差异：基于Krakencoder的分析

Vibhashree S H, Debanjali Bhattacharya, Vamshi Krishna Kancharla, Neelam Sinha

发表机构 * Centre for Brain Research, Indian Institute of Science（印度科学研究所大脑研究中心）； Dept. of Artificial Intelligence, Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham（阿姆里塔大学阿姆里塔人工智能学院人工智能系）

AI总结使用Krakencoder框架模拟脑连接组模态间缺陷传播，分析702名健康被试的结构和功能连接组，发现默认模式网络扰动最大，感觉运动网络最小，完整预测连接组保留更多性别判别信息。

详情

AI中文摘要

本研究使用Krakencoder作为模拟框架，探讨一个脑连接组模态的缺陷如何传播到另一个模态。分析了人类连接组项目中702名健康被试的结构和功能连接组，并分别评估了每个Yeo-7功能网络的影响。考虑了七种场景，每种场景涉及移除单个网络，同时保留其余网络。使用三种互补指标量化跨模态预测中的扰动：特征值谱上的KL散度、Frobenius范数和Wasserstein距离。此外，评估了预测连接组中性别特异性信息的持久性。在所有指标和两个预测方向上，默认模式网络产生的扰动最大，而感觉运动网络产生的扰动最小。网络级扰动特征的性别差异细微，最佳结果是在网络移除条件下预测的连接组达到66.09%的准确率。相比之下，从完整输入预测的连接组实现了更高的性别分类准确率，最高达84.76%。这些发现证实，完整的预测连接组比仅基于扰动的特征保留了显著更多的性别判别信息。

英文摘要

This study examines how deficiencies in one brain connectome modality propagate to the other, using the Krakencoder as a simulation framework. Structural and functional connectomes from 702 healthy participants in the Human Connectome Project were analyzed, with the impact of each of the Yeo-7 functional networks assessed separately. Seven scenarios were considered, each involving the removal of a single network while the remaining networks were preserved. The resulting perturbations in cross-modal predictions were quantified using three complementary metrics: KL divergence on eigenvalue spectra, Frobenius norm, and Wasserstein distance. In addition, the persistence of sex-specific information within the predicted connectomes was evaluated. Across all metrics and both prediction directions, the Default Mode Network produced the largest perturbations, whereas the Somatomotor network yielded the smallest. Sex differences in network-level perturbation signatures were subtle, with the best result being an accuracy of 66.09% from connectomes predicted under network-removal conditions. In contrast, connectomes predicted from intact inputs achieved substantially higher sex classification accuracy, reaching up to 84.76%. These findings confirm that full predicted connectomes retain considerably more sex-discriminative information than perturbation-derived signatures alone.

URL PDF HTML ☆

赞 0 踩 0

2606.16044 2026-06-16 cs.LG q-bio.QM 新提交

Circuit Tracing in Autoregressive Protein Language Models

自回归蛋白质语言模型中的电路追踪

Darin Tsui, William Deinzer, Daniel Saeedi, Amirali Aghazadeh

发表机构 * Stanford University（斯坦福大学）

AI总结提出ProGenMech框架，通过跨层稀疏编码器忠实恢复ProGen3的生成计算，并零样本发现与蛋白质生成和适应性预测相关的稀疏电路，揭示生物意义基序。

Comments Accepted into the Mechanistic Interpretability Workshop at ICML 2026. 24 pages, 14 figures

详情

AI中文摘要

蛋白质语言模型（pLMs）可以生成具有超越自然界观察到的特性的新型蛋白质序列，然而蛋白质生成背后的机制仍然知之甚少。现有的基于稀疏自编码器和跨层编码器的机械可解释性方法主要关注蛋白质表示学习模型，并未捕捉自回归生成所需的计算。在这里，我们引入了ProGenMech，一个用于生成式蛋白质语言模型的机械可解释性框架，它将跨层编码器（CLTs）扩展到ProGen3，一个为因果生成和跨度填充训练的稀疏专家混合模型。与逐层方法不同，CLTs使用来自所有前层的稀疏潜变量重建每一层，从而能够忠实地恢复层间生成计算。我们进一步开发了一个零样本电路发现框架，以识别负责蛋白质生成和适应性预测的稀疏潜电路。在因果生成和零样本适应性估计任务中，ProGenMech在恢复ProGen3的概率分布和功能评分行为方面优于局部跨层编码器基线，同时在跨度填充任务中匹配原始模型的生成分布。此外，恢复的电路揭示了与保守序列模式和蛋白质适应性景观相关的生物学上有意义的基序和功能区域，为可解释和可引导的蛋白质生成奠定了基础。

英文摘要

Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.

URL PDF HTML ☆

赞 0 踩 0

2606.16041 2026-06-16 q-bio.NC 新提交

EEGDash: An open-source platform for machine learning on public neurophysiological data

EEGDash: 一个用于公共神经生理数据机器学习的开源平台

Bruno Aristimunha, Aviv Dotan, Pierre Guetschel, Aman Jaiswal, Gal Ashkenazi, Dung Truong, Kuntal Kokate, Amitrava Majumdar, Oren Shriki, Arnaud Delorme

AI总结 EEGDash是一个开源平台，收录791个公共神经生理记录，通过元数据优先注册、格式修复和自动标签，支持无需自定义代码的机器学习工作流，促进基准测试和跨数据集分析。

Comments 28 pages, 9 figures, 4 tables. Supplementary material included as an ancillary file. EEG-Dash software: https://eegdash.org

详情

AI中文摘要

公共神经生理数据集越来越容易获取，但仍难以重用：将一个数据集转化为训练好的模型仍需要数千行代码用于下载、加载、格式修复、窗口化和评估，而符合元数据标准的数据集仍可能无法加载。EEG-Dash是一个软件资源，它从OpenNeuro和NEMAR档案中编录了791个公开存档的记录（39,778名参与者，超过86,051小时），涵盖脑电图（EEG）、脑磁图（MEG）、颅内脑电图（iEEG）、肌电图（EMG）和功能性近红外光谱（fNIRS）。它将每个数据集暴露为一个可导入、可查询的类，保留信号属性，无需自定义代码即可加载到机器学习工作流中，将信号处理委托给MNE-Python，窗口化委托给Braindecode，格式合规性委托给官方脑成像数据结构（BIDS）验证器。元数据优先注册增加了语义搜索、格式修复层、从每个源出版物中提取的自动数据集级标签，以及特征提取框架。该目录包含每条记录的可加载性和合规性元数据，支持基准测试、模型开发和跨数据集分析。

英文摘要

Public neurophysiological datasets are increasingly accessible but remain hard to reuse: turning one into a trained model still takes thousands of lines of code for download, loading, format repair, windowing, and evaluation, and a dataset that meets metadata standards can still fail to load. EEG-Dash is a software resource that catalogues 791 publicly archived recordings (39,778 participants, over 86,051 hours) spanning electroencephalography (EEG), magnetoencephalography (MEG), intracranial EEG (iEEG), electromyography (EMG), and functional near-infrared spectroscopy (fNIRS) from the OpenNeuro and NEMAR archives. It exposes each dataset as an importable, queryable class that preserves signal attributes and loads into machine-learning workflows without custom code, delegating signal handling to MNE-Python, windowing to Braindecode, and format compliance to the official Brain Imaging Data Structure (BIDS) validator. A metadata-first registry adds semantic search, a format-repair layer, automatic dataset-level tags drawn from each source publication, and a feature-extraction framework. The catalogue, with per-record loadability and compliance metadata, supports benchmarking, model development, and cross-dataset analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.15989 2026-06-16 q-bio.NC cs.AI 新提交

Task-guided cross-subject latent alignment: a multi-encoder-decoder VAE

任务引导的跨被试潜在对齐：一种多编码器-解码器VAE

Angeliki Papathanasiou, Jascha Achterberg, Thomas E. Nichols, Rui Ponte Costa

发表机构 * Centre for Neural Circuits and Behaviour Department of Physiology Anatomy and Genetics University of Oxford（神经回路与行为中心生理解剖与遗传学系牛津大学）； Big Data Institute Nuffield Department of Medicine University of Oxford（大数据研究所纳菲尔德医学系牛津大学）

AI总结提出MED-VAE模型，通过预训练ANN锚定表征，实现无共享刺激的跨被试神经对齐，在自然场景数据集上优于传统方法，并支持跨被试图像解码。

Comments In Proceedings of the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026

详情

AI中文摘要

对齐跨被试的神经活动有望发现共享的计算原理和可泛化解码器。然而，传统对齐方法要求被试间共享刺激，这一限制使其难以应用于数据有限或非重叠的自然范式。我们提出了一种多编码器-解码器变分自编码器（MED-VAE），通过将表征锚定到预训练ANN提供的公共支架上，实现了无需共享刺激的跨被试对齐。利用自然场景数据集，我们展示了MED-VAE创建了具有优越语义组织的公共潜在空间，在跨被试对齐方面优于常见方法，同时在对传统方法失效的保留刺激上保持了稳健的泛化能力。从这些公共空间重建回每个被试的原始神经空间，MED-VAE在其跨被试潜在空间中保留了等量的刺激驱动信号。最后，我们展示了这种优越的对齐直接实现了跨被试神经预测，通过跨被试图像解码得到了验证。总之，我们提出了一种框架，用于识别可泛化的公共子空间以进行跨被试预测和下游任务，本文以静态图像视觉皮层响应为例进行了演示。

英文摘要

Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject's original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.

URL PDF HTML ☆

赞 0 踩 0

2606.15905 2026-06-16 q-bio.MN 新提交

What is Life?

什么是生命？

Roger D. Jones

AI总结本文提出适应生命系统的六条最小物理原理，包括熵源、信息长寿、快速响应、可重复操作、能量效率和多开关网络，以统一视角解释适应性如何从能量流、信息存储与处理及自然选择中涌现。

详情

AI中文摘要

自薛定谔的《生命是什么？》以来，生物组织的物理基础已被理解为物质、能量和信息之间的相互作用。分子生物学、信息论、非平衡热力学和进化论的后续发展阐明了遗传信息如何通过自然选择被存储、维持和修改。在此，我们通过询问适应性生命所需的最小物理原理来扩展这一研究计划。我们提出了六个支配适应性生命系统的假设：熵源的存在、信息的持久性、对环境变化的快速响应、可重复操作、能量效率以及多个相互作用的开关网络。这些原理被引入作为生物信息处理和适应的最小基础。我们考察了它们的含义，并将其与多个生物组织层次的观察结果进行比较，包括遗传、表观遗传调控、细胞信号传导、神经计算、代谢网络和生态系统。由此产生的框架表明，适应性源于在远离热力学平衡的系统中能量流、信息存储、信息处理和自然选择的相互作用。尽管所提出的原理是定性的且尚不具备预测性，但它们为支配适应性行为的物理约束提供了统一视角，并为发展适应性生命的定量理论提供了起点。

英文摘要

Since Schrodinger's \emph{What Is Life?}, the physical basis of biological organization has been understood in terms of the interplay between matter, energy, and information. Subsequent developments in molecular biology, information theory, nonequilibrium thermodynamics, and evolutionary theory have clarified how hereditary information is stored, maintained, and modified through natural selection. Here, we extend this program by asking what minimal physical principles are required for adaptable life. We propose six postulates governing adaptive living systems: the existence of an entropy source, longevity of information, fast response to environmental change, repeatable operation, energetic efficiency, and networks of multiple interacting switches. These principles are introduced as a minimal foundation for biological information processing and adaptation. We examine their implications and compare them with observations across multiple levels of biological organization, including genetic inheritance, epigenetic regulation, cellular signaling, neural computation, metabolic networks, and ecological systems. The resulting framework suggests that adaptability emerges from the interplay of energy flow, information storage, information processing, and natural selection in systems maintained far from thermodynamic equilibrium. Although the proposed principles are qualitative and not yet predictive, they provide a unified perspective on the physical constraints governing adaptive behavior and offer a starting point for the development of a quantitative theory of adaptable life.

URL PDF HTML ☆

赞 0 踩 0

2606.15530 2026-06-16 q-bio.PE 新提交

Self-propelled evolution on regenerating landscapes

再生景观上的自推进演化

Alexander Heyde, L. Mahadevan

AI总结通过耦合种群密度与再生资源驱动景观的生态进化模型，揭示种群在初始平坦景观上产生自诱导选择梯度并实现定向运动，展示持续振荡、混沌动态和进化分支。

2606.15499 2026-06-16 cs.FL math-ph math.MP q-bio.BM 新提交

Assembly Spaces: Formal Definitions and Fast Methods for Approximating Assembly Indices

组装空间：形式化定义与组装指数近似计算的快速方法

Gage Siebert, Redwan Chowdhury, Louie Slocombe, Sara Walker

AI总结本文提出组装理论的形式化定义，通过统一路径层次框架和形式语法算法，高效近似计算组装指数，以支持跨学科的生命检测研究。

详情

AI中文摘要

组装理论是一个实验和理论框架，引入了一种检测生命的计量学方法，具有跨多种基质的潜在应用。其两个核心观测量是组装指数和拷贝数。组装指数是从基本部件构建对象所需的最小连接操作次数；对于分子，可通过质谱、红外光谱和核磁共振测量。拷贝数是样本中给定可区分对象的丰度。该理论的一个关键实证结果是，高组装指数与高拷贝数组合构成无法非生物产生的特征，这已在分子生物标志物的应用中得到实验验证。支撑这些结果的基础理论概念是组装空间，它编码了从观测对象可确定的因果可能性，而组装指数是在给定基质物理约束下到达这些对象的最短路径。在此，我们提供了一种描述组装空间的形式化框架以及组装指数近似计算的工具。我们首先回顾组装理论在分子、矿物和大气中的应用，然后引入一种通用的、与基质无关的组装空间和组装指数的形式化定义。我们开发了一个统一的路径层次框架，以阐明分子组装文献中出现的组装空间和组装路径的各种表示之间的关系。最后，我们展示了如何调整形式语法算法以有效约束组装指数计算，并澄清了此类近似值的实用性，旨在为化学、生物学和复杂性科学领域的更广泛研究者群体提供探索这一新兴领域的工具可及性。

英文摘要

Assembly theory is an experimental and theoretical framework that introduces a metrological approach to detecting life, with potential applications across diverse substrates. Its two central observables are assembly index and copy number. The assembly index is the minimum number of joining operations required to construct an object from its elementary parts; for molecules, it can be measured using mass spectrometry, infrared spectroscopy, and NMR. Copy number is the abundance of a given distinguishable object within a sample. A key empirical result of the theory is that high assembly index combined with high copy number constitutes a signature that cannot arise abiotically, and this has been validated experimentally in application to molecular biosignatures. The foundational theoretical concept underlying these results is the assembly space, which encodes the causal possibilities determinable from observed objects, with the assembly index the shortest path to them given the physical constraints of a given substrate. Here, we provide a generalized formalism to describe assembly spaces and tools for assembly index approximations. We begin by reviewing the applications of assembly theory across molecules, minerals and atmospheres, and then introduce a general, substrate-independent formal definition of assembly spaces and assembly indices. We develop a unified path hierarchy framework to clarify relationships among the various representations of assembly spaces and assembly paths that appear in the literature on molecular assembly. Finally, we show how formal grammar algorithms can be adapted to efficiently bound assembly index calculations and provide clarification on the utility of such approximations, with the goal to increase the accessibility of tools to explore this emerging area for a broader group of researchers across chemistry, biology, and complexity science.

URL PDF HTML ☆

赞 0 踩 0

2606.15496 2026-06-16 q-bio.PE 新提交

How recombination rates affect escape from low-fitness states

重组率如何影响低适应度状态的逃逸

Elisa Heinrich-Mora, Chase Van Amburg, Marcus W. Feldman

AI总结研究在三个位点模型中，重组率如何通过修饰基因座的进化影响群体逃离低适应度状态的能力，发现修饰基因多态性可改变低适应度状态的稳定性。

详情

AI中文摘要

适应通常需要组装有利的突变组合，而这些突变单独可能是有害的。因此，即使存在更高适应度的基因型，群体也可能被困在低适应度的遗传状态中。重组在这个过程中扮演双重角色，因为它既可以产生也可以破坏有利的多位点组合。先前的研究表明，选择与重组之间的平衡决定了群体是跨越适应度谷还是停留在与种群衰退相关的低适应度状态。我们在一个由两个选择位点和一个重组修饰位点组成的三位点模型中研究这个问题。修饰基因对适应度没有直接影响，但会改变选择位点之间的重组率，从而使重组本身得以进化。我们描述了系统的固定状态，并推导了低适应度固定集局部稳定性的显式条件。稳定性取决于选择强度、选择位点间的重组、修饰基因与选择位点之间的重组以及修饰基因组成。在经典的双位点模型中，稳定性取决于单个重组参数。相比之下，修饰基因模型产生了一个连续的固定状态，其稳定性随修饰基因频率变化。因此，具有相同选择单倍型频率的群体可能仅仅因为修饰基因组成不同而具有不同的稳定性。我们进一步表明，修饰基因多态性既可以稳定也可以破坏低适应度状态，这取决于修饰基因依赖的重组率的相对大小。这些结果表明，影响重组的遗传变异不仅通过改变有利多位点组合的形成，还通过改变替代进化状态的稳定性来改变进化结果。

英文摘要

Adaptation often requires the assembly of favorable combinations of mutations that are individually deleterious. As a result, populations may remain trapped in low-fitness genetic states even when higher-fitness genotypes exist. Recombination plays a dual role in this process because it can both generate and disrupt advantageous multilocus combinations. Previous work showed that the balance between selection and recombination determines whether populations cross fitness valleys or persist in low-fitness states associated with demographic decline. We study this problem in a three-locus model consisting of two selected loci and a recombination modifier locus. The modifier has no direct effect on fitness but alters the recombination rate between the selected loci, allowing recombination itself to evolve. We characterize the fixation states of the system and derive explicit conditions for the local stability of the low-fitness fixation set. Stability depends on selection strength, recombination among selected loci, recombination between the modifier and selected loci, and modifier composition. In the classical two-locus model, stability depends on a single recombination parameter. By contrast, the modifier model generates a continuum of fixation states whose stability varies with modifier frequency. Populations with identical selected-haplotype frequencies can therefore differ in stability solely because they differ in modifier composition. We further show that modifier polymorphism can either stabilize or destabilize the low-fitness state, depending on the relative magnitudes of modifier-dependent recombination rates. These results demonstrate that genetic variation affecting recombination alters evolutionary outcomes not only by changing the formation of favorable multilocus combinations but also by changing the stability of alternative evolutionary states.

URL PDF HTML ☆

赞 0 踩 0

2606.15422 2026-06-16 cs.CL q-bio.BM 新提交

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Pepti-Agent: 一种用于肽设计与优化的人工智能代理

Houxu Chen, Achuth Chandrasekhar, Amir Barati Farimani

AI总结提出Pepti-Agent，一种基于模型上下文协议（MCP）的闭环肽设计框架，通过可独立检查的生成、预测和突变工具，结合大语言模型控制器和实时属性预测，实现多目标优化与可复现基准测试。

详情

AI中文摘要

治疗性肽占据小分子和生物制剂之间有价值的设计空间，但它们的开发需要同时满足几个相互竞争的约束：溶解度、溶血活性和非特异性表面污染由重叠的序列特征控制，因此改善一个属性往往会降低另一个属性。计算设计通过将生成模型与基于序列的属性预测器配对，迭代地提出和优化候选物来解决这一问题。然而，这些组件通常被连接成难以检查、扩展或重用的整体脚本，并且它们通常通过自然语言推理而不是跟踪每个候选物不断变化的多属性状态来优化序列。我们提出了Pepti-Agent，一个闭环的、肽特异性的框架，它将生成、属性预测和单残基突变暴露为可独立检查的模型上下文协议（MCP）工具。一个大语言模型控制器调用这些工具，并在调用之间查阅实时的预测器输出，因此优化由每个序列当前的属性概况指导，而不是仅由语言推理指导。任务特异性的PeptideGPT模型生成候选物，基于ProtBERT的分类器对溶解度、溶血和非污染进行评分，两个可互换的突变算子提出序列编辑。通过记录控制器决策、预测器输出和接受突变的每一步迹，Pepti-Agent为多目标设计策略的基准测试和为实验验证优先排序候选物提供了可复现的基础。

英文摘要

Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.15397 2026-06-16 q-bio.PE stat.AP 新提交

On the Equivalence of Instantaneous and Mechanistic Reproduction Numbers

瞬时繁殖数与机制繁殖数的等价性

Jeremy Goldwasser, Ryan J. Tibshirani, Alyssa Bilinski

AI总结本文证明在均匀混合假设下，通过更新方程定义的瞬时繁殖数与SEIR等房室模型中的机制繁殖数等价，并推导了SEIR动力学隐含的世代间隔分布。

2606.15357 2026-06-16 q-bio.GN 新提交

Fast genomic read alignment with minibwa

使用minibwa进行快速基因组读段比对

Heng Li, Nils Homer

AI总结提出minibwa，结合BWA-MEM可变长度种子与minimap2链式比对和碱基比对，通过预取、跳过不必要的mate rescue和减少高重复区域计算，实现比BWA-MEM快约4倍、比BWA-MEM2快2倍以上，且精度相当。

详情

AI中文摘要

动机：BWA-MEM仍然是一种流行的短读段比对工具，尤其用于变异检测。由于它已成为许多当前工作流程的性能瓶颈，多个团队已加速该算法。然而，受原始设计限制，这些直接替换只能实现有限的速度提升。为了进一步改进，需要对BWA-MEM进行突破性更改。\n结果：我们开发了minibwa，用于将短读段和准确的长读段比对到参考基因组。它结合了BWA-MEM的可变长度种子与minimap2的链式比对和碱基比对。通过额外的种子预取、跳过不必要的mate rescue的新启发式方法以及减少高重复区域（由于结构变化，读段无论如何都会被错误映射）的计算，进一步加速了BWA-MEM2。Minibwa的速度约为BWA-MEM的四倍，比BWA-MEM2快两倍以上，且精度相当。它还原生支持定向亚硫酸氢盐测序数据，具有高映射精度。\n可用性和实现：https://github.com/lh3/minibwa

英文摘要

Motivation: BWA-MEM remains a popular short-read mapper especially for the purpose of variant calling. Several groups have accelerated this algorithm as it has been the performance bottleneck of many current workflows. However, constrained by the original design, these drop-in replacements could only achieve limited speedup. Breaking changes to BWA-MEM are required for further improvement. Results: We developed minibwa for aligning short and accurate long reads against a reference genome. It combines BWA-MEM variable-length seeding with minimap2 chaining and base alignment. It speeds up BWA-MEM2 further with additional prefetch for seeding, new heuristics to skip unnecessary mate rescue and reduced effort in highly repetitive regions where reads would anyway be wrongly mapped due to structural changes. Minibwa is about four times as fast as BWA-MEM and over twice as fast as BWA-MEM2 at comparable accuracy. It also natively supports directional bisulfite sequencing data to high mapping accuracy. Availability and implementation: https://github.com/lh3/minibwa

URL PDF HTML ☆

赞 0 踩 0

2606.15348 2026-06-16 q-bio.NC cs.AI 新提交

Intrinsic Computational Functionalism and Simulated Consciousness

内在计算功能主义与模拟意识

Ryota Kanai, Shuqin Ma

发表机构 * Araya Inc.（Araya公司）； School of Philosophy, Fudan University（复旦大学哲学学院）； Sussex Centre for Consciousness Science, University of Sussex（Sussex大学意识科学中心）

AI总结本文从内在计算功能主义出发，提出机制丰富的规范结构，论证若意识是计算构成的，则任何满足内在因果-计算实现关系的系统（生物、人工或模拟）都实现相同的意识相关属性。

详情

AI中文摘要

对人工或模拟意识的一个常见反对意见是，模拟的大脑并不比模拟的水更湿。我们从内在计算功能主义（ICF）的角度来回应：如果意识是由计算构成的，那么它不依赖于外部强加的描述，而是依赖于系统凭借其自身的因果-动力学组织所物理实现的计算结构。在之前的工作中，我们将规范功能主义发展为此反解释主义纲领的一个数学精确的特例，通过固定接口下的完整未来输入-输出角色来识别功能状态。这里我们论证，这种输入-输出构造虽然重要，但并不完整：作为ICF的一个行为边界情况，它使得查找表和展开的系统在规范上等价，只要它们保持相同的边界行为。一个与意识相关的规范表示必须转而包含属于相关内在组织的内部机制、干预和联合读出。因此，我们定义了一个机制丰富的规范结构，并用它来制定内在因果-计算实现（ICCR），这是一种保持物理实现、内在状态个体化、转移结构、干预轮廓以及相关主体-身体-世界边界的实现关系。核心结果是条件性的：如果意识属性是内在因果-计算组织的不变量，那么任何满足ICCR的系统都实现相同的意识相关属性，无论是生物的、人工的还是模拟的。我们讨论了包括生物自然主义和整合信息理论在内的反对意见。我们得出结论，要否认模拟具有意识，必须识别出模拟未能实现的与意识相关的内在因果-计算结构。

英文摘要

A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.

URL PDF HTML ☆

赞 0 踩 0

2606.15192 2026-06-16 q-bio.NC 新提交

OpTI-Mouse: Optimization for Targeted Temporal Interference Stimulation in the Mouse Brain

OpTI-Mouse：小鼠大脑靶向时间干扰刺激的优化

Jingsheng Tang, Zhengkang Zhou, Yingyue Xin, Zihan Ning, Pengfei Wei, Mo Wang, Quanying Liu

AI总结开发了结合小鼠头部建模与优化算法的计算工具，通过平衡目标强度和空间聚焦性，显著优于经验基线，为临床前TI刺激研究提供精确工具。

2606.15012 2026-06-16 stat.ME q-bio.QM 新提交

A Kuramoto-von Mises Time Series Model for Probabilistic Modeling of Coupled Oscillators

耦合振荡器概率建模的Kuramoto-von Mises时间序列模型

Yun Hwang, Todd P. Coleman

AI总结提出一种不假设热力学平衡的耦合振荡器概率分布估计方法，基于Langevin动力学构建，在高采样率下具有闭式解，在非平衡模拟数据和真实脑/胃电生理数据中优于现有方法。

Comments 15 pages, 4 figures

详情

AI中文摘要

耦合振荡器系统为建模广泛的物理和生物现象提供了基本框架。在神经科学中，中枢神经系统与相邻脑区表现出同步振荡活动，例如在睡眠期间产生行波动力学。类似地，在胃肠系统中，神经肌肉细胞协调其振荡以产生慢波活动的传播波。为了估计多变量相位关系的概率分布，现有方法通常依赖于平衡热力学，通过成对指数族分布以玻尔兹曼形式表达系统。然而，这些假设在现实系统中常常被违反，现实系统本质上是动态的，并经常在平衡和非平衡状态之间转换。为了解决这个问题，我们提出了一种估计耦合振荡器概率分布的有效方法，该方法不假设热力学平衡。通过基于Langevin动力学的构建，该方法即使在非平衡状态下也能实现精确建模。最大似然估计方法在高采样率条件下具有闭式代数解，这一条件通常被现代数据采集系统满足，使其易于实际应用。我们在模拟数据上展示了其鲁棒性，在非平衡设置中优于现有方法，并进一步说明了其在表征脑刺激响应中的动态脑行波以及在人胃电生理记录背景下的假设检验中的实用性。

英文摘要

A system of coupled oscillators provides a fundamental framework for modeling a wide range of physical and biological phenomena. In neuroscience, the central nervous system exhibits synchronized oscillatory activity with adjacent brain regions, giving rise to traveling wave dynamics for instance during sleep. Similarly, in the gastrointestinal system, neuromuscular cells coordinate their oscillations to generate propagating waves of slow wave activity. To estimate probability distributions of multivariate phase relationships, existing approaches typically rely on equilibrium thermodynamics, expressing the system in a Boltzmann form through a pairwise exponential family distribution. However, these assumptions are often violated in real-world systems, which are inherently dynamic and frequently transition between equilibrium and non-equilibrium regimes. To address this, we propose an efficient method for estimating the probability distribution of coupled oscillators that does not assume thermodynamic equilibrium. Using a Langevin dynamics-based construction, the approach enables accurate modeling even in non-equilibrium regimes. The maximum likelihood estimation method is shown to have a closed form algebraic solution in the high sampling rate regime, a condition commonly satisfied by modern data acquisition systems, which makes it readily applicable in practice. We demonstrate its robustness on simulated data, where it outperforms existing approaches in non-equilibrium settings, and further illustrate its utility for characterizing dynamic brain traveling waves in response to brain stimulation and in hypothesis testing within the context of electrophysiologic recordings of the human stomach.

URL PDF HTML ☆

赞 0 踩 0

2606.14975 2026-06-16 cs.NE cs.AI cs.LG physics.data-an q-bio.NC 新提交

Harnessing cortical geometry, wiring, and function as inductive biases for recurrent neural networks

利用皮层几何、连接和功能作为循环神经网络的归纳偏置

Mo Shakiba, Rana Rokni, Mohammad Mohammadi, Nima Dehghani

发表机构 * Neuromatch Academy, Neuromatch, Inc., USA（Neuromatch学院，Neuromatch公司，美国）； McGovern Institute for Brain Research, Massachusetts Institute of Technology (MIT)（麦戈文脑科学研究所，麻省理工学院（MIT））

AI总结本研究利用MICrONS项目数据，通过神经元空间坐标、解剖连接和功能关系初始化循环权重并施加空间约束，构建生物基础循环神经网络，在认知决策任务中优于基线模型，并发展出低熵、模块化和小世界组织。

详情

AI中文摘要

皮层的连接和功能组织如何塑造循环计算仍然是神经科学和机器学习中的一个核心问题。在这里，我们利用通过皮层网络机器智能（MICrONS）项目发布的数据——一个涵盖小鼠视觉皮层多个区域的功能连接组学资源，其中密集钙成像与同一动物的高分辨率电子显微镜重建共同配准——来构建生物基础的循环神经网络。使用来自近12,000个共同配准的兴奋性神经元的神经元空间坐标、解剖连接和功能衍生关系，我们初始化循环权重并在学习过程中施加通信感知的空间约束。在三个认知决策任务中，受皮层结构和功能约束的网络始终优于基线和部分约束模型。功能权重初始化提供了最大的增益，而真实空间嵌入在多种条件下产生了稳健的额外改进。这些生物基础网络还发展出低熵、模块化和小世界组织，并且即使当循环被限制为正权重时也能保持强劲性能。总之，我们的结果表明，皮层的机制——其几何、连接和功能结构——可以作为构建循环网络的强大归纳基础，这些网络学习更有效，同时收敛于生物计算的关键组织原则。

英文摘要

How the wiring and functional organization of cortex shape recurrent computation remains a central question in both neuroscience and machine learning. Here, we leverage data released through the Machine Intelligence from Cortical Networks (MICrONS) program--a functional connectomics resource spanning multiple areas of mouse visual cortex, in which dense calcium imaging is co-registered with high-resolution electron microscopy reconstruction from the same animal--to build biologically grounded recurrent neural networks. Using neuronal spatial coordinates, anatomical connectivity, and function-derived relationships from nearly 12,000 coregistered excitatory neurons, we initialize recurrent weights and impose communication-aware spatial constraints during learning. Across three cognitive decision-making tasks, networks constrained by cortical structure and function consistently outperform baseline and partially constrained models. Functional weight initialization provides the largest gain, while real spatial embedding yields robust additional improvements across conditions. These biologically grounded networks also develop low-entropy, modular, and small-world organization, and retain strong performance even when recurrence is restricted to positive weights. Together, our results show that the machinery of cortex--its geometry, wiring, and functional structure--can be harnessed as a powerful inductive basis for building recurrent networks that learn more effectively while converging toward key organizational principles of biological computation.

URL PDF HTML ☆

赞 0 踩 0

2606.14925 2026-06-16 q-bio.MN math.DS 新提交

Boolean models coarsely sample continuous dynamics of regulatory networks

布尔模型粗粒度采样调控网络的连续动力学

Breschine Cummins, Marcio Gameiro, Tomáš Gedeon, Konstantin Mischaikow, Bernardo Rivas

AI总结本文提出将单调布尔模型嵌入多级组合模型框架，证明布尔模型系统性地低估网络动力学，而DSGRN方法能高效捕获更丰富的动力学行为，为离散与连续建模提供数学桥梁。

详情

AI中文摘要

布尔模型被广泛用于表征基因调控网络的动力学。然而，其粗粒度的状态离散化限制了其捕获复杂连续动力学和连续参数依赖性的能力。在本文中，我们提出了一个严格的数学框架，将单调布尔模型嵌入到更广泛的多级组合模型类中，而后者又嵌入到调控网络生成的动态签名（DSGRN）方法中。我们定义了DSGRN参数图，它编码了参数邻接的概念，并用于将布尔函数映射到DSGRN参数空间中的特定节点。我们证明了这些多级离散更新函数作为单调布尔模型的多级细化。我们证明，纯布尔模型通过遗漏关键中间行为（如高阶多稳态和稳定周期轨道）而系统性地低估了网络动力学。我们表明，DSGRN框架高效地捕获了与常微分方程（ODE）一致的严格更丰富的动力学集合，为离散和连续网络建模之间提供了数学上严格且计算上可行的桥梁。

英文摘要

Boolean models are widely used to characterize the dynamics of gene regulatory networks. However, their coarse state discretization limits their ability to capture complex continuous dynamics and continuous parameter dependencies. In this paper, we present a rigorous mathematical framework that embeds monotone Boolean models into a broader class of multilevel combinatorial models, which in turn embed into the Dynamic Signatures Generated by Regulatory Networks (DSGRN) methodology. We define the DSGRN parameter graph, which encodes the notion of parameter adjacency and is used to map Boolean functions to specific nodes within the DSGRN parameter space. We prove that these multilevel discrete update functions act as a multilevel refinement of monotone Boolean models. We demonstrate that purely Boolean models systematically underestimate network dynamics by missing crucial intermediate behaviors such as higher-order multistability and stable periodic orbits. We show that the DSGRN framework efficiently captures a strictly richer set of dynamics consistent with ordinary differential equations (ODEs), providing a mathematically rigorous and computationally viable bridge between discrete and continuous network modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.14835 2026-06-16 q-bio.QM 新提交

The Essential Role Of Ribosomal Feedback In Bacterial Cell Growth And Metabolic Load -- A Systems Biology Approach For Unveiling Shared Resources Regulation Within Synthetic Genetic Circuits

核糖体反馈在细菌细胞生长和代谢负荷中的关键作用——揭示合成基因回路内共享资源调控的系统生物学方法

Chiara Cimolato, Elisa Gaetan, Lorenzo Pasotti, Luca Schenato, Massimo Bellato

AI总结采用最小模型揭示核糖体合成负反馈在异源表达条件下对基因表达和生长率的关键调控作用，平衡核糖体功能与能量消耗。

详情

AI中文摘要

细菌细胞生长建模是系统生物学和合成生物学的一个主要问题。尽管文献中提出了几种生长速率函数，但大多数关注营养组成，没有明确考虑重组基因表达可能带来的扰动，这种效应称为细胞负荷或负担。另一方面，试图利用核糖体分配和营养可用性提供现象机制细节的数学模型通常过于详细和复杂，难以应用于合成基因回路的理性设计。本文采用自下而上的方法识别和分析最小模型结构，从而揭示核糖体合成中负反馈在预测细胞负荷对基因表达和生长率影响中的基本作用。确实，为了确保细胞效率，核糖体合成必须精细调控。虽然核糖体数量增加通常提高蛋白质生产和细胞性能，但其合成代价高昂。因此，细胞进化出严格控制核糖体合成的机制，避免不必要的积累。以往细胞模型通常忽略的一个关键调控策略涉及一个负反馈回路，该回路调节核糖体组分的生产。这种反馈确保核糖体仅按严格需要的数量生产，平衡功能性和能量消耗。本文使用最小基因回路模型评估了在异源表达条件下该反馈的单独贡献，明确关联了核糖体分配、蛋白质合成水平之间的隐藏耦合以及生长率。

英文摘要

Modeling growth in bacterial cells is a major issue in systems and synthetic biology. Despite several growth rate functions proposed in the literature, most focus on nutrient composition without explicitly accounting for the possible perturbation provided by the expression of recombinant genes, an effect known as cell load or burden. On the other hand, mathematical models that attempt to provide mechanistic details on the phenomena, leveraging ribosome partitioning and nutrient availability, are generally too detailed and complex to be easily applied to the rational design of synthetic genetic circuits. A bottom-up approach is adopted herein to identify and analyze the minimal model structure, thereby unveiling the fundamental role of negative feedback in ribosomal synthesis in predicting the effects of cell load on both gene expression and growth rate. Indeed, to ensure cellular efficiency, ribosome synthesis must be finely regulated. While an increased number of ribosomes generally enhances protein production and cellular performance, their synthesis incurs a high energetic cost. For this reason, cells have evolved mechanisms to tightly control ribosome synthesis, avoiding unnecessary accumulation. One of the key regulatory strategies, usually neglected in previous cell models, involves a negative feedback loop that modulates the production of ribosomal components. This feedback ensures that ribosomes are produced only in the amount strictly needed, balancing functionality and energy expenditure. This work evaluates the individual contribution of this feedback under heterologous expression conditions using minimal gene-circuit models, explicitly linking ribosome allocation, hidden couplings between protein synthesis levels, and growth rate.

URL PDF HTML ☆

赞 0 踩 0

2606.14823 2026-06-16 q-bio.GN cs.AI cs.CL 新提交

Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation

人类遗传证据与跨治疗领域药物批准相关：一项基于26,278个靶点-疾病对的观察性分析，含时间验证和特征消融

Victoria Paterson

发表机构 * School of Informatics, University of Edinburgh（爱丁堡大学信息学院）

AI总结本研究通过分析26,278个靶点-疾病对，发现具有遗传关联的靶点药物批准率是无遗传关联的3.25倍，但遗传证据单独预测价值有限，并识别出1,433个遗传支持的早期阶段靶点-疾病对作为假设生成资源。

详情

AI中文摘要

遗传证据在已批准药物靶点中富集：在一项对来自Open Targets和ChEMBL的26,278个靶点-疾病对的观察性分析中，具有任何遗传关联的靶点批准率是无遗传关联靶点的3.25倍（OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42）。一项考虑共享同一基因的靶点-疾病对非独立性的靶点水平分析给出的OR为2.79（bootstrap 95% CI 2.22-3.53）；肿瘤学对水平OR为6.72，在靶点水平衰减至2.71，说明非独立性会夸大特定领域的估计值。该富集在2015年后的批准中得以复现（OR = 3.51, p = 1.72e-8）。跨六种证据类型的特征消融显示，仅文献挖掘就占分类器性能的大部分（AUPRC = 0.099，而所有特征为0.109），这与批准后出版物导致的时间泄漏一致。排除文献后，其余证据类型仍保留高于基线的信号（AUPRC = 0.084，为基线的1.63倍）。敏感性分析将对水平OR的范围限定在3.25至4.93之间。仅遗传证据的AUPRC绝对增益仅为1.0个百分点，且最佳模型校准较差；该分类器的实际预测价值有限。我们编录了1,433个遗传支持的1/2期靶点-疾病对作为假设生成资源。所有发现均为观察性结果。

英文摘要

Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

URL PDF HTML ☆

赞 0 踩 0

2606.14797 2026-06-16 physics.bio-ph q-bio.PE 新提交

Hierarchical Autocatalytic Systems as a Bridge between Maximum Entropy Production and Bayesian Posterior Contraction: A Numerical Study with Stochastic-Thermodynamic Bounds

分层自催化系统作为最大熵产生与贝叶斯后验收缩之间的桥梁：基于随机热力学边界的数值研究

Yoshinori Watanabe

AI总结构建三层反应扩散自催化化学模型，通过随机热力学边界分析熵产生与遗传熵的协同变化，揭示最大熵产生原理驱动的适应机制，并建立与扩散模型训练的对应关系。

详情

AI中文摘要

我们构建了一个三层反应扩散模型，用于自催化化学系统，其中原始分子（$a_i$）、催化蛋白（$p_l$）和大型RNA/蛋白“基因”（$W_p^{(k)}$）通过质量作用化学计量张量$\mathrm{Coef}_{ijk}$相互作用，其大小由最大聚合物的折叠稳定活性调制。质量作用被一个$ε$-噪声项打破，使得系统处于非平衡态。我们计算了总熵产生$σ(t)$、遗传香农熵$S_\mathrm{gene}$以及热力学不确定关系（TUR）和热力学速度极限（TSL）对生长和进化速率的边界。分层模型表现出预期的$σ_\mathrm{env}\!\uparrow$和$S_\mathrm{gene}\!\downarrow$共现，这由薛定谔的负熵论证预测，并重新表述为最大熵产生原理（MEPP）驱动的适应。与单个动力学校对样循环（其TUR乘积约为5，匹配核糖体的实验报道范围）相比，分层模型的TUR乘积在通用界限2以上$10^4$-$10^5$倍，TSL比率在界限1以上$10^6$-$10^8$倍。分子数量的缩放对分层模型保持松散性，但对最小模型随粒子数量单调收紧。最后，我们建立了自催化系统与扩散模型训练之间的显式对应关系：$a_\mathrm{ext} \to a$通量$\Leftrightarrow$数据信息流，$\tanh(βWp - θ)$ $\Leftrightarrow$分数网络，复制噪声$\Leftrightarrow$前向扩散噪声，$S_\mathrm{gene} \searrow$ $\Leftrightarrow$ $H[q(θ|\mathcal{D})] \searrow$。所有代码和图形可在https://github.com/xiangze/DiverseCells/Hier_Autocatalysis获取。

英文摘要

We construct a three-layer reaction-diffusion model of an autocatalytic chemical system in which raw molecules ($a_i$), catalytic proteins ($p_l$) and large RNA/protein ``genes'' ($W_p^{(k)}$) interact through a mass-action stoichiometry tensor $\mathrm{Coef}_{ijk}$ whose magnitude is modulated by the fold-stable activity of the largest polymers.Mass-action is broken by an $ε$-noise term so that the system is nonequilibrium. We compute the total entropy production $σ(t)$, the genetic Shannon entropy $S_\mathrm{gene}$ and the thermodynamic uncertainty relation (TUR) and thermodynamic speed limit (TSL) bounds on growth and evolution rates. The hierarchical model exhibits the expected co-occurrence of $σ_\mathrm{env}\!\uparrow$ and $S_\mathrm{gene}\!\downarrow$ predicted by Schrodinger's negentropy argument and reformulated as maximum-entropy-production-principle (MEPP)-driven adaptation. In contrast to a single kinetic-proofreading-like cycle, whose TUR products of $\sim 5$, matching the experimentally reported regime of the ribosome.The hierarchical model's TUR product sits $10^4$-$10^5$ above the universal bound of 2, and the TSL ratio sits $10^6$-$10^8$ above its bound of 1. And scaling number of molucules leaves the looseness intact for the hierarchical model but tightens it monotonically with particle number for the minimal model. We close by drawing an explicit correspondence between the autocatalytic system and diffusion-model training: $a_\mathrm{ext} \to a$ flux $ \Leftrightarrow $ data-information flow, $ \tanh(βWp - θ) $ $\Leftrightarrow$ score network, replication noise $\Leftrightarrow$ forward-diffusion noise, $ S_\mathrm{gene} \searrow $ $\Leftrightarrow$ $ H[q(θ|\mathcal{D})] \searrow $. All code and figures are available https://github.com/xiangze/DiverseCells/Hier_Autocatalysis

URL PDF HTML ☆

赞 0 踩 0

2606.14742 2026-06-16 q-bio.NC cs.AI cs.HC 新提交

Do Large Language Models Have Emotions?

大型语言模型有情感吗？

Amit Goldenberg, James J. Gross

AI总结本文评估Anthropic声称Claude Sonnet 4.5具有“功能性情感”的说法，从生物情感功能角度分析，指出其部分支持情境解释功能，但缺乏动态重组能力。

详情

AI中文摘要

大型语言模型有情感吗？Anthropic最近的一篇论文报告在Claude Sonnet 4.5中发现了情感概念的内部表征，并得出结论认为该LLM具有“功能性情感”。我们根据已知的生物系统中情感实际运作方式评估了这一说法。我们认为情感具有两个核心功能：对情境进行情境敏感的解释，以及根据这些解释跨多个系统重组处理过程。Anthropic的发现为第一个功能提供了部分支持，尽管在Claude中识别出的持续、离散的情感表征与情感神经科学的发现（即人类情感以可变而非统一的神经特征为特征）不太吻合。关于第二个功能，证据不一：Claude的表征调节输出，但没有产生定义生物系统情感的注意力、决策速度和动机状态的动态重组。最后，我们提出了LLM要拥有情感所需的条件。

英文摘要

Do LLMs have emotions? A recent paper from Anthropic reports finding internal representations of emotion concepts in Claude Sonnet 4.5, concluding that the LLM has 'functional emotions.' We evaluate this claim against what is known about how emotions actually function in biological systems. We argue that emotions serve two core functions: the context-sensitive interpretation of situations, and the reorganization of processing across multiple systems in response to those interpretations. The Anthropic findings offer partial support for the first function, though the consistent, discrete emotional representations identified in Claude sit uneasily with affective neuroscience findings that human emotion is characterized by variable rather than uniform neural signatures. On the second function, the evidence is mixed: Claude's representations modulate output without producing the dynamic reorganization of attention, decision speed, and motivational state that defines emotion in biological systems. We close by proposing what it would take for an LLM to have emotions.

URL PDF HTML ☆

赞 0 踩 0

2606.14737 2026-06-16 q-bio.BM cs.LG stat.ML 新提交

Learning Topological Representations for Molecular Dynamics

学习分子动力学的拓扑表示

Dominik Geng, Florian Graf, Martin Uray, Roland Kwitt

发表机构 * University of Salzburg（萨尔茨堡大学）； Centre for Intelligent and Secure Industrial Automation（智能与安全工业自动化中心）； University of Applied Sciences（应用科学大学）

AI总结提出掩蔽Flood复形用于持久同源性分析，在共享表示空间中实现蛋白质构象的几何感知表征，并在分类、回归和马尔可夫状态模型估计中取得竞争性能。

Comments 20 pages, 4 figures

详情

AI中文摘要

分子动力学（MD）模拟生成高维构型空间中的轨迹，其分析关键依赖于分子描述符，通常是手工设计的可观测量或学习的动力学嵌入。然而，设计既具表达力又广泛适用的描述符仍然具有挑战性。我们研究持久同源性（PH）作为MD的通用表示，并引入掩蔽Flood复形，这是一种针对蛋白质定制的最近提出的单纯复形构造的改进，以低计算成本强调残基间结构。向量化的持久图随后提供信息丰富、几何感知的蛋白质构象摘要，我们在单个共享表示空间中评估其在蛋白质类别预测、帧级可观测回归以及从学习的低维坐标估计马尔可夫状态模型（MSM）上的性能。在mdCATH数据集上的结果表明，基于PH的描述符在各项任务中具有竞争力，其中掩蔽Flood PH产生最一致的整体性能。此外，在最近的MarS-FM框架中，当使用拓扑信息MSM作为蛋白质构象生成建模的直接替代时，我们获得了比基于物理可观测量的MSM更一致的系综统计。最后，我们探索了生成模型向性质不同的快速折叠蛋白质的可迁移性。

英文摘要

Molecular dynamics (MD) simulations generate trajectories in a high-dimensional configuration space whose analysis critically depends on molecular descriptors, typically handcrafted observables or learned kinetic embeddings. Designing descriptors that are both expressive and broadly applicable, however, remains challenging. We study persistent homology (PH) as a general-purpose representation for MD and introduce the masked Flood complex, a protein-tailored modification of a recently introduced simplicial complex construction that emphasizes inter-residue structure at low computational cost. Vectorized persistence diagrams then provide information-rich, geometry-aware summaries of protein conformations, which we evaluate on protein class prediction, frame-level observable regression, and Markov state model (MSM) estimation from learned low-dimensional coordinates in a single shared representation space. Results on the mdCATH dataset show that PH-based descriptors are competitive across tasks, with masked Flood PH yielding the most consistent overall performance. Further, when using topologically-informed MSMs as a drop-in replacement within the recent MarS-FM framework for generative modeling of protein conformations, we obtain consistently better ensemble statistics than MSMs based on physical observables. Finally, we explore the transferability of the generative model to qualitatively different, fast folding, proteins.

URL PDF HTML ☆

赞 0 踩 0