arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2605.03749 2026-05-08 cs.CV

FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution

FluxFlow: 为天文图像超分辨率的保守流匹配

Shuhong Liu, Xining Ge, Ziteng Cui, Liuzhuozheng Li, Gengjia Chang, Jun Liu, Ziying Gu, Dong Li, Xuangeng Chu, Lin Gu, Tatsuya Harada

发表机构 * The University of Tokyo(东京大学) I2WM Tohoku University(东北大学) RIKEN AIP(理化学研究所AIP)

AI总结 本文提出FluxFlow框架,通过整合观测不确定性与源区域重要权重训练,结合免训练 Wiener正则化测试时修正,提升天文图像超分辨率的光度和科学准确性,构建了包含19500对真实地面到空间图像的DESI-HST数据集。

详情
AI中文摘要

地面到空间的天文超分辨率需要从受像素采样分辨率和大气视宁度限制的地面观测中恢复空间质量图像,这导致了随机变化的PSF,无法仅通过上采样解决。现有方法依赖合成训练对,无法捕捉真实大气统计特性,易导致过度平滑或无物理对应物的幻觉源。我们提出FluxFlow,一种保守的像素空间流匹配框架,在训练中整合观测不确定性和源区域重要权重,并采用免训练 Wiener正则化测试时修正以抑制幻觉源并保留恢复细节。我们进一步构建了DESI--HST数据集,包含19,500对真实共注册的地面到空间图像对,具有真实大气PSF变化。实验表明,FluxFlow在光度和科学准确性上均优于现有基线方法。

英文摘要

Ground-to-space astronomical super-resolution requires recovering space-quality images from ground-based observations that are simultaneously limited by pixel sampling resolution and atmospheric seeing, which imposes a stochastic, spatially varying PSF that cannot be resolved through upsampling alone. Existing methods rely on synthetic training pairs that fail to capture real atmospheric statistics and are prone to either over-smoothed reconstructions or hallucination sources with no physical counterpart in the observed sky. We propose FluxFlow, a conservative pixel-space flow-matching framework that incorporates observation uncertainty and source-region importance weights during training, and a training-free Wiener-regularized test-time correction to suppress hallucination sources while preserving recovered detail. We further construct the DESI--HST Dataset, the large-scale real-world benchmark comprising 19,500 real co-registered ground-to-space image pairs with real atmospheric PSF variation. Experiments demonstrate that FluxFlow consistently outperforms existing baseline methods in both photometric and scientific accuracy.

2605.03379 2026-05-08 cs.LG cs.CL

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

两次调用、两次矩与重复LLM推理的投票准确性曲线

Yi Liu

发表机构 * York University(约克大学)

AI总结 研究重复LLM推理的二元正确性层,在条件独立同分布调用下,通过两次调用确定第二矩和相同示例正确性相关性,从而为固定多数投票预算提供分布无关的两调用区间。

详情
AI中文摘要

重复采样是测试时间计算的标准消耗方式,但其效益由正确性在示例中的潜在分布控制,而非单一调用准确性。我们研究了在条件i.i.d.调用下重复LLM推理的二元正确性层。一个标记调用确定了潜藏成功概率的均值;两个标记调用确定了其二阶矩,从而分离稳定误差与可恢复的调用级随机性。从这两个矩值,每个固定多数投票预算都有一个分布无关的两调用区间。关键的技术减少是无限维矩问题有三原子极值器和二次对偶证书,因此界限是精确的而非离散或参数化的。第一个有用的预算,三次投票,有闭合形式,宽度最大为1/8,并有一个认证改进标准。无限投票端点是多数投票随调用数趋于无穷时的极限;它也严格有界,但仍然对阈值敏感,因为它依赖于潜藏质量在q=1/2附近的质量。我们添加了最大熵和潜藏难度高斯-概率点补全,并在QNLI和QQP上的LLM调用实验显示,经验三次和五次投票准确性包含在投影的两调用区域内,而温度变化和随机模型混合可以创造投票增益,这些增益不受单次调用准确性排序的影响。

英文摘要

Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form, width at most $1/8$, and a certified-improvement criterion. The infinite-vote endpoint is the limit of majority voting as the number of calls tends to infinity; it is also sharply bounded, but remains threshold-sensitive because it depends on latent mass around $q=1/2$. We add maximum-entropy and Latent-difficulty Gaussian-probit point completions, and experiments on LLM calls over QNLI and QQP show that empirical three- and five-vote accuracies are contained in the projected two-call regions while temperature changes and randomized model mixtures can create voting gains not ordered by one-call accuracy.

2605.03354 2026-05-08 cs.AI

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

代理记忆内部发生了什么?从涌现到诊断的电路分析

Xutao Mao, Jinman Zhao, Gerald Penn, Cong Wang

发表机构 * City University of Hong Kong(香港城市大学) University of Toronto(多伦多大学)

AI总结 研究揭示了代理记忆中控制与内容电路的因果关系及共享枢纽的机制,开发出76.2%准确率的无监督诊断工具,优于监督基线13分。

详情
AI中文摘要

代理记忆故障是无声的:即使LLM代理无法跨会话提取、保留或检索所需信息,它仍能产生流畅响应。写-管理-读循环描述了这些系统的外部管道,但未明确各阶段的内部计算机制。通过追踪Qwen-3系列(0.6B-14B)和两个内存框架(mem0和A-MEM)中的特征电路,我们报告了两个机制发现和一个交付成果。首先,控制在内容之前可检测:路由电路在0.6B时因果活跃,而内容电路直到4B才产生可检测信号,暴露了小模型在可靠提取或 grounding 事实前路由记忆决策的部署领域。其次,共享枢纽被招募而非创建:写入和读取在后期层枢纽上汇聚,该枢纽已在基模型中作为上下文 grounding 的基质存在,内存框架在该基质上招募特定功能方向而非自建。这两个发现跨mem0和A-MEM转移,表明底层计算是基模型属性而非特定接口属性。基于此电路结构,我们开发出无监督的阶段级诊断工具,将无声故障定位到负责操作的准确率达76.2%,优于最强监督基线13分。这些结果表明电路级签名是代理记忆监控和结构化设计的实用处理方式。

英文摘要

Agent memory failures are silent: an LLM-based agent can produce a fluent response even when it fails to extract, retain, or retrieve the information needed across sessions. The write-manage-read loop describes the external pipeline of these systems but leaves open which internal computations implement each stage. Tracing feature circuits across the Qwen-3 family (0.6B--14B) and two memory frameworks (mem0 and A-MEM), we report two mechanistic findings and one deliverable. First, control is detectable before content: routing circuitry is causally active at 0.6B, while content circuitry produces no detectable signal until 4B, exposing a deployment regime where small models route memory decisions before they can reliably extract or ground the underlying facts. Second, the shared hub is recruited, not created: Write and Read converge on a late-layer hub that already exists in the base model as a context-grounding substrate, and memory framing recruits a memory-specific functional direction on this substrate rather than building one of its own. Both findings transfer across mem0 and A-MEM, indicating that the underlying computations are properties of the base model rather than of any particular interface. Building on this circuit structure, we develop an unsupervised stage-level diagnostic that localizes silent failures to the responsible operation up to 76.2% accuracy, outperforming the strongest supervised baseline by 13 points. Together, these results point to circuit-level signatures as a practical handle for monitoring and structurally-guided design of agent memory.

2605.03227 2026-05-08 cs.AI

Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

评估基于提示和执行的方法在LLM中确定性计算中的表现

Hongkun Yu

发表机构 * Virginia Tech(弗吉尼亚理工学院)

AI总结 本文评估了多种提示策略在需要精确输出的任务中的表现,发现PoT通过生成可执行代码实现完美准确率,而其他方法存在误差积累或计算开销大等问题。

Comments 8 pages, 1 figure. Code and dataset available at https://github.com/bigbird231/llm-exact-computation-dataset

详情
AI中文摘要

大型语言模型(LLMs)在自然语言理解和推理方面表现出色,但其执行精确、确定性计算的能力尚不明确。本文系统评估了多种提示策略,包括链式思维(CoT)、由少到多分解、程序思维(PoT)和自一致性(SC),在需要精确无误输出的任务中,如二进制计数、最长子串检测和算术评估中的表现。为支持本研究,我们引入了一个合成数据集,包含多样化的自然语言指令,以在多种任务类型中控制评估精确计算。结果表明,标准提示方法在序列任务中仅能获得中等准确率。CoT提供有限的改进,而由少到多分解则因误差累积而受损。相比之下,PoT通过生成可执行代码并将其委托给外部解释器,实现了完美准确率。自一致性通过多数投票提高鲁棒性,但带来显著的计算开销。我们进一步训练了一个小型领域专用模型(CodeT5-small)来生成可执行程序,该模型在所有任务上的保留合成测试数据中实现了完美准确率,且训练成本极低。总体而言,我们的发现表明LLMs可能模拟推理模式而非可靠执行精确符号计算。对于确定性任务,结合LLMs与外部工具或使用专用模型提供了更可靠和高效的解决方案。

英文摘要

Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC), on tasks requiring precise and error-free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence-based tasks. CoT provides limited improvement, while Least-to-Most suffers from error accumulation. In contrast, PoT achieves perfect accuracy by generating executable code and delegating computation to an external interpreter. Self-Consistency improves robustness through majority voting, but incurs substantial computational overhead. We further train a small domain-specific model (CodeT5-small) to generate executable programs, which achieves perfect accuracy on held-out synthetic test data across all tasks with minimal training cost. Overall, our findings suggest that LLMs may simulate reasoning patterns rather than reliably perform exact symbolic computation. For deterministic tasks, combining LLMs with external tools or using specialized models provides a more reliable and efficient solution.

2605.03222 2026-05-08 cs.LG stat.ML

Beyond Activation Alignment: The Geometry of Neural Sensitivity

超越激活对齐:神经敏感性的几何学

Amirhossein Yavari, Farnaz Zamani Esfahlani

发表机构 * Stephenson School of Biomedical Engineering, University of Oklahoma(俄克拉荷马大学生物医学工程学院) Data Science and Analytics Institute, University of Oklahoma(俄克拉荷马大学数据科学与分析学院)

AI总结 本文提出基于局部可解信息的框架,通过Fisher信息和局部表示几何,总结表示的预期投影拉回/Fisher度量,提出S-RAS和统一乘法证书,用于评估神经表示的局部判别能力。

Comments 9 pages, 4 figures

详情
AI中文摘要

激活对齐度量,如表示相似性分析(RSA)、典型相关分析(CCA)和中心核对齐(CKA),广泛用于比较生物和人工神经表示。最近的理论工作将这些方法解释为评估在广泛任务家族上的最佳线性读出之间的同意。然而,全局读出层面的一致性并不能决定系统如何使用局部刺激证据。具体而言,表示可能在激活空间中对齐,但在对小扰动的敏感性上可能不同。为了解决这一挑战,我们引入了一个基于局部可解信息的互补框架,专注于在噪声下表示区分指定刺激坐标子空间内小扰动的能力。基于Fisher信息和局部表示几何,我们总结每个表示使用该子空间的预期投影拉回/Fisher度量。这种公式诱导了一个二阶矩家族的局部判别任务,所得到的算子提供了一个最小、完整的数据集级别总结。我们通过对称正定(SPD)矩阵流形上的对数频谱距离比较这些正则化签名,得到S-RAS和对应的提升任务值的统一乘法证书。经验上,该框架能够恢复独立训练的人工神经网络之间的对应层,支持可转移的类别条件探测器,揭示标准训练和鲁棒训练之间的受控分离,并在小鼠视觉皮层中使用Allen脑观察站静态条纹数据集揭示刺激坐标家族效应。

英文摘要

Activation-alignment measures such as Representational Similarity Analysis (RSA), Canonical Correlation Analysis (CCA), and Centered Kernel Alignment (CKA) are widely used to compare biological and artificial neural representations. Recent theoretical work interprets many of these methods as assessing agreement between optimal linear readouts over broad families of global tasks. However, agreement at the level of global readouts does not determine how a system uses local stimulus evidence. Specifically, representations may align in activation space yet differ in their sensitivity to small perturbations. To address this challenge, we introduce a complementary framework based on local decodable information, which focuses on a representation's ability, under noise, to discriminate small perturbations within a specified stimulus-coordinate subspace. Building on Fisher information and local representation geometry, we summarize each representation using the expected projected pullback/Fisher metric over that subspace. This formulation induces a second-moment family of local discrimination tasks, for which the resulting operator provides a minimal, complete dataset-level summary of expected discriminability. We compare these regularized signatures using a log-spectral distance on the manifold of symmetric positive definite (SPD) matrices, yielding the Spectral Riemannian Alignment Score (S-RAS) and a uniform multiplicative certificate over the corresponding family of lifted task values. Empirically, this framework enables the recovery of corresponding layers across independently trained artificial neural networks, supports transferable class-conditional probes, reveals controlled dissociations between standard and robust training, and uncovers stimulus-coordinate family effects across mouse visual cortex using the Allen Brain Observatory static gratings dataset.

2605.03125 2026-05-08 cs.LG

Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation

通过线性函数近似缓解大规模马尔可夫游戏中多智能体的诅咒

Jingchu Gai, Laixi Shi

发表机构 * CMU Machine Learning Department(卡内基梅隆大学机器学习系) Machine Learning Department, Carnegie Mellon University(机器学习系,卡内基梅隆大学) Department of Electrical and Computer Engineering, Johns Hopkins University(约翰霍普金斯大学电气与计算机工程系)

AI总结 本文提出通过线性函数近似解决大规模马尔可夫游戏中多智能体的样本复杂度诅咒,开发了在生成模型和新提出的在线交互设置中具有证明数据效率的算法。

详情
AI中文摘要

多智能体强化学习(MARL)具有巨大潜力,但环境不确定性导致其面临鲁棒性挑战。为解决这一问题,分布鲁棒马尔可夫游戏(RMGs)通过在不确定性集内环境偏离名义模型时优化最坏性能。除了鲁棒性外,MARL的另一个紧迫目标是数据效率——从指数增长的庞大状态和动作空间中采样可能导致多智能体诅咒。然而,目前可证明数据高效的RMG算法仅限于表格设置,仅适用于小规模问题,使得大规模(或无限)状态空间的RMGs未被探索。现有的非表格设置工作集中在使用消失最小值假设的RMG类上,仍受多智能体诅咒影响。在本文中,我们专注于一般RMG与线性函数近似。对于由总变差距离定义的不确定性集,我们开发了具有证明数据效率的算法,在生成模型设置和新提出的在线交互设置中克服多智能体诅咒。据我们所知,我们的结果是第一个在大规模(可能无限)状态空间的RMG中克服多智能体样本复杂度诅咒的成果,无论不确定性集的构造如何。

英文摘要

Multi-agent reinforcement learning (MARL) holds great potential but faces robustness challenges due to environmental uncertainty. To address this, distributionally robust Markov games (RMGs) optimize worst-case performance when the environment deviates from the nominal model within a uncertainty set. Beyond robustness, an equally urgent goal for MARL is data efficiency -- sampling from vast state and action spaces that grow exponentially with the number of agents potentially leads to the curse of multiagency. However, current provably data-efficient algorithms for RMGs are limited to tabular settings with finite state and action spaces, which are only computationally manageable for small-scale problems, leaving RMGs with large-scale (or infinite) state spaces largely unexplored. The only existing work beyond tabular settings focuses on linear function approximation (LFA) for a restrictive class of RMGs using vanish minimal value assumption and still suffers from sample complexity with the curse of multiagency. In this work, we focuses on general RMGs with LFA. For uncertainty sets defined by total variation distance, we develop provably data-efficient algorithms that break the curse of multiagency in both the generative model setting and a newly proposed online interactive setting. To our knowledge, our results are the first to break the curse of multiagency of sample complexity for RMGs with large (possibly infinite) state spaces, regardless of the uncertainty set construction.

2605.01518 2026-05-08 cs.RO

VOFA: Visual Object Goal Pushing with Force-Adaptive Control for Humanoids

VOFA:基于力适应控制的人形机器人视觉目标推动

Zichao Hu, Zifan Xu, Dongsik Chang, He Yin, Linh Tran, Roberto Martín-Martín, Peter Stone, Jingyu Qiao, Joydeep Biswas

发表机构 * Department of Computer Science, The University of Texas at Austin(德克萨斯大学奥斯汀分校计算机科学系) Amazon Inc.(亚马逊公司) Sony AI(索尼人工智能)

AI总结 本文提出VOFA系统,通过视觉引导实现对未知物理属性物体的精准推动,解决重物操控中的未知质量和地面摩擦问题,实现在仿真和现实中的高成功率表现。

详情
AI中文摘要

利用机载自体感知进行目标导向的大物体推动能力是人形机器人执行复杂任务如仓库物料搬运的关键技能。为稳健地操控重物到任意目标配置,机器人必须应对未知物体质量、地面摩擦、噪声感知和执行误差,所有这些都在实时反馈回路中。现有解决方案要么依赖特权物体状态信息而没有机载感知,要么缺乏对目标配置和物体物理属性变化的鲁棒性。本文提出VOFA,一种视觉目标条件的人形机器人移动-操作系统,能够将具有未知物理属性的物体推至任意目标位置。VOFA由两级分层架构组成,包括高层视觉-运动策略和底层力适应全身控制器。高层策略处理噪声感知观测并生成目标条件命令,在多样化的物体-目标配置中闭环操作,而底层全身控制器提供对物体物理属性变化的鲁棒性。VOFA在Booster T1人形机器人上进行了广泛的仿真和现实实验。我们的结果表明性能优异,在仿真中成功率达90%以上,在现实试验中达80%以上。此外,VOFA成功推动了高达17kg的物体,超过Booster T1身体重量的一半。

英文摘要

The ability to push large objects in a goal-directed manner using onboard egocentric perception is an essential skill for humanoid robots to perform complex tasks such as material handling in warehouses. To robustly manipulate heavy objects to arbitrary goal configurations, the robot must cope with unknown object mass and ground friction, noisy onboard perception, and actuation errors; all in a real-time feedback loop. Existing solutions either rely on privileged object-state information without onboard perception or lack robustness to variations in goal configurations and object physical properties. In this work, we present VOFA, a visual goal-conditioned humanoid loco-manipulation system capable of pushing objects with unknown physical properties to arbitrary goal positions. VOFA consists of a two-level hierarchical architecture with a high-level visuomotor policy and a low-level force-adaptive whole-body controller. The high-level policy processes noisy onboard observations and generates goal-conditioned commands to operate in closed loop across diverse object-goal configurations, while the low-level whole-body controller provides robustness to variations in object physical properties. VOFA is extensively evaluated in both simulation and real-world experiments on the Booster T1 humanoid robot. Our results demonstrate strong performance, achieving over 90% success in simulation and over 80% success in real-world trials. Moreover, VOFA successfully pushes objects weighing up to 17kg, exceeding half of the Booster T1's body weight.

2605.01355 2026-05-08 cs.CV cs.AI

AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification

AgriKD:跨架构知识蒸馏用于高效的叶病分类

Minh-Dung Le, Minh-Duc Hoang, Hoang-Vu Truong, Thi-Thu-Hong Phan

发表机构 * AIT laboratory, Faculty of Artificial Intelligence, FPT University(AIT实验室,人工智能学院,FPT大学)

AI总结 本文提出AgriKD框架,通过将Vision Transformer知识蒸馏到轻量级卷积模型,实现高效边缘部署,提升效率并减少参数和计算成本。

Comments 47 pages, 14 figures

详情
AI中文摘要

自动化叶病分类对资源受限的田间环境早期病害检测至关重要。Vision Transformers (ViTs)通过建模长程依赖和类间关系提供强大的表示能力;然而,其高计算成本使其难以部署在边缘设备上。为此,本文提出AgriKD,一种跨架构知识蒸馏框架,用于高效边缘部署,将知识从Vision Transformer教师模型转移到紧凑的卷积学生模型。为弥合Transformer和CNN架构之间的表示差距,所提出的方法整合了多个蒸馏目标,在输出、特征和关系层面,每个目标捕捉教师知识的不同方面。这使学生模型能够更好地保留和利用Transformer衍生的全局表示。在多个叶病数据集上的实验表明,蒸馏后的学生模型在性能上与教师模型相当,同时显著提高了效率,将模型参数减少约172倍,计算成本降低47.57倍,推理延迟降低18-22倍。此外,优化后的模型在多种运行时格式中部署,包括ONNX、TFLite Float16和TensorRT FP16,实现了预测性能一致,精度损失微小。在NVIDIA Jetson边缘设备和移动应用上的实际部署展示了可靠的实时推理,突显了AgriKD在资源受限环境中的实用性,适用于AI赋能的农业应用。

英文摘要

Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.

2605.01327 2026-05-08 cs.AI cs.LG

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

基于段落对齐的策略优化用于多模态推理

Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li

发表机构 * Fudan University(复旦大学) Southeast University(东南大学) China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.(中国电信人工智能技术(北京)有限公司) Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院)

AI总结 本文提出SAPO方法,通过将推理步骤而非token或完整序列作为策略更新的基本单元,提升多模态推理任务的准确性和稳定性。

详情
AI中文摘要

现有针对大型语言模型的强化学习方法通常在token或完整响应序列的粒度上进行策略优化。然而,这种形式往往与推理过程的自然分步结构不一致,导致信用分配不优和训练不稳定。为此,我们提出Segment-Aligned Policy Optimization (SAPO),一种新的强化学习范式,将连贯的推理步骤而非token或完整序列作为策略更新的基本单位。SAPO引入了一种基于推理段的逐步马尔可夫决策过程抽象,辅以段级价值估计、优势计算和重要性采样机制,这些机制与推理边界语义对齐。在代表性推理基准上的实验表明,SAPO在准确率上有显著提升,同时表现出更好的训练稳定性和价值估计一致性。我们的工作强调了将强化学习更新与推理内在结构对齐的重要性,为复杂推理任务中的更高效和语义基础的策略优化铺平了道路。代码和模型将被释放以确保完全可重复性。

英文摘要

Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.

2605.01203 2026-05-08 cs.AI cs.CL

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

GR-Ben:一种通用推理基准,用于评估过程奖励模型

Zhouhao Sun, Xuan Zhang, Xiao Ding, Bibo Cai, Li Du, Kai Xiong, Xinran Dai, Fei Zhang, weidi tang, Zhiyuan Kan, Yang Zhao, Bing Qin, Ting Liu

发表机构 * Research Center for Social Computing(社会计算研究中心) Interactive Robotics, Harbin Institute of Technology, China(交互机器人学,哈尔滨工业大学,中国) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院,北京,中国)

AI总结 GR-Ben旨在评估过程奖励模型在科学和逻辑两大领域及九个子领域的误差检测能力,揭示现有模型在非数学领域和计算错误检测上的不足。

详情
AI中文摘要

目前,过程奖励模型(PRMs)在测试时扩展性方面表现出显著潜力。由于大语言模型(LLMs)在处理广泛推理和决策任务时经常生成有缺陷的中间推理步骤,PRMs需要具备在现实场景中检测过程级错误的能力。然而,现有基准主要聚焦于数学推理,无法全面评估PRMs在多样化推理场景中的误差检测能力。为弥合这一差距,我们引入GR-Ben,一个专门用于评估PRMs在科学和逻辑两大领域及九个子领域性能的过程级基准。我们对包含PRMs和LLMs的22种多样化模型进行了广泛实验,并得出两个关键发现:(1)在数学推理之外的领域,现有PRMs和LLMs的误差检测能力明显较弱。(2)总体而言,PRMs在识别基于知识的错误方面能力较弱,而LLMs在检测计算错误方面表现更差。我们希望GR-Ben能促进PRMs在通用领域的未来研究,从而提升LLMs的推理能力。

英文摘要

Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors. We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.

2605.01120 2026-05-08 cs.AI math.CO

New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search

通过强化LLM进化搜索获得Zarankiewicz数的新界

Jay Bhan, Nicole Nobili, Patrick Langer

发表机构 * Massachusetts Institute of Technology(麻省理工学院) ETH Zürich(苏黎世联邦理工学院) Stanford University(斯坦福大学)

AI总结 本文首次确定三个Zarankiewicz数的精确值,并通过强化LLM进化搜索方法为41个数建立下界,展示了LLM引导的进化搜索在数学研究中的潜力。

Comments *Jay Bhan and Nicole Nobili contributed equally to this work as first authors, and their order was determined via coin flip

详情
AI中文摘要

Zarankiewicz数Z(m, n, s, t)是 bipartite图G_{m,n}中无完整K_{s,t}子图的最大边数。本文首次确定了Z(11, 21, 3, 3)=116、Z(11, 22, 3, 3)=121和Z(12, 22, 3, 3)=132的精确值,并为41个Zarankiewicz数建立了下界。这些结果通过基于大语言模型(LLM)的开源进化算法OpenEvolve获得,该算法通过优化定制的奖励信号迭代改进数学构造生成算法。本文还展示了LLM引导的进化搜索在发现新组合构造方面的潜力,并提供了生成算法、实现细节和计算成本,其成本低于30美元,证明了这种方法在发现新组合构造方面的经济性和可重复性。

英文摘要

The Zarankiewicz number $\textbf{Z}(m, n, s, t)$ is the maximum number of edges in a bipartite graph $G_{m, n}$ such that there is no complete $K_{s, t}$ bipartite subgraph. We determine for the first time the exact values of three Zarankiewicz numbers: $\textbf{Z}(11, 21, 3, 3)=116$, $\textbf{Z}(11, 22, 3, 3)=121$, and $\textbf{Z}(12, 22, 3, 3)=132$. We further establish lower bounds for 41 more Zarankiewicz numbers, including several that are within one edge of the best known upper bound, and we match the established value in four more closed cases. Our results are obtained using OpenEvolve, an open-source evolutionary algorithm based on Large Language Models (LLMs) that iteratively improves algorithms for generating mathematical constructions by optimizing a reward signal which we tailored for this specific problem. These findings provide new extremal graph constructions and demonstrate the potential of LLM-guided evolutionary search to contribute to mathematical research. In addition to presenting the resulting constructions, we report the generation algorithms produced, describe the relevant implementation details, and provide our computational costs. Our costs are remarkably low, at less than \$30 for each Zarankiewicz parameter combination, showing that LLM-guided evolutionary search can be an inexpensive, reproducible, and accessible tool for discovering new combinatorial constructions.

2605.00847 2026-05-08 cs.CL cs.AI cs.LG

H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models

H-Probes:从语言模型的潜在表示中提取层次结构

Cutter Dawes, Aryan Sharma, Angelos Ioannis Lagos, Shivam Raval

发表机构 * Supervised Program for Alignment Research(对齐研究监督计划) Yale University(耶鲁大学) Harvard University(哈佛大学)

AI总结 H-Probes通过线性探针从语言模型的潜在表示中提取层次结构,揭示模型在语法、概念及推理过程中的深层抽象能力。

详情
AI中文摘要

表示和导航层次结构是推理的基本原始能力。大型语言模型在需要层次推理的各种任务中表现出色,但对其几何表示必要潜在构造的分析有限。为此,我们开发了H-probes,一组线性探针,用于从潜在表示中提取层次结构,特别是深度和成对距离。在合成树遍历任务中,H-probes稳健地找到包含完成任务所需层次结构的子空间;此外,在全面的消融实验中,我们发现这些包含层次结构的子空间是低维的,对高任务性能具有因果重要性,并且在领域内和领域外都能泛化。此外,我们在现实世界的层次结构上下文中,如数学推理轨迹中,发现了类似的但较弱的层次结构。这些结果表明,模型不仅在语法和概念层面表示层次结构,还在更深层次的抽象层面——包括推理过程本身——中表示层次结构。

英文摘要

Representing and navigating hierarchy is a fundamental primitive of reasoning. Large language models have demonstrated proficiency in a wide variety of tasks requiring hierarchical reasoning, but there exists limited analysis on how the models geometrically represent the necessary latent constructions for such thinking. To this end, we develop H-probes, a collection of linear probes that extract hierarchical structure, specifically depth and pairwise distance, from latent representations. In synthetic tree traversal tasks, the H-probes robustly find the subspaces containing hierarchical structure necessary to complete the tasks; furthermore, in comprehensive ablation experiments, we show that these hierarchy-containing subspaces are low-dimensional, causally important for high task performance, and generalize within- and out-of-domain. Furthermore, we find analogous, though weaker, hierarchical structure in real-world hierarchical contexts such as mathematical reasoning traces. These results demonstrate that models represent hierarchy not only at the level of syntax and concepts, but at deeper levels of abstraction -- including the reasoning process itself.

2605.00742 2026-05-08 cs.AI cs.LG stat.ML

Position: agentic AI orchestration should be Bayes-consistent

位置:代理AI协调应具有贝叶斯一致性

Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Jes Frellsen, Eyke Hüllermeier, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle, Thomas Möllenhoff, Konstantina Palla, Maxim Panov, Yusuf Sale, Kajetan Schweighofer, Artem Shelmanov, Siddharth Swaroop, Martin Trapp, Willem Waegeman, Andrew Gordon Wilson, Alexey Zaytsev

发表机构 * ESSEC Business School(ESSEC商学院) VinUniversity(文大学) Imperial College London(伦敦帝国理工学院) Mila - Quebec AI Institute(魁北克人工智能研究所) Technical University of Denmark(丹麦技术大学) Pyramidal Inc.(Pyramidal公司) University of Notre Dame(诺特大学) Cognizant AI Lab(Cognizant人工智能实验室) University College London(伦敦大学学院) KTH Royal Institute of Technology(瑞典皇家理工学院) Ghent University(根特大学) New York University(纽约大学)

AI总结 本文探讨了在代理AI系统中,贝叶斯原则在协调层的应用,而非LLM参数,以提升决策一致性和协作效率。

Comments Accepted for publication at ICML 2026

详情
AI中文摘要

LLMs在预测任务和复杂推理任务上表现优异,但许多高价值部署依赖于不确定性下的决策,例如选择工具、咨询专家或分配资源。尽管贝叶斯方法在LLM推理中的实用性和可行性仍不明确,本文主张代理AI系统的控制层(协调LLMs和工具)是贝叶斯原则应发挥作用的明确案例。贝叶斯决策理论为代理系统提供了一个框架,有助于维护与任务相关的潜在量的信念,从观察到的代理和人机交互中更新这些信念,并选择行动。使LLMs本身成为显式的贝叶斯信念更新引擎仍计算上昂贵且概念上复杂,作为一般建模目标。相反,本文认为,一致的决策需要贝叶斯原则在代理系统协调层发挥作用,而不一定在LLM参数层面。本文阐述了适用于现代代理AI系统和人机协作的贝叶斯控制的实用特性,并提供了具体示例和设计模式,说明如何通过校准的信念和以效用为导向的政策改进代理AI协调。

英文摘要

LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.

2605.00292 2026-05-08 cs.LG cs.AI

Caracal: Causal Architecture via Spectral Mixing

Caracal:通过频谱混合实现因果架构

Bingzheng Gan, Tianyi Zhang, Yusu Li, Jing Huang, Wei Shi, Yangkai Ding, Tao Yu

发表机构 * Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 Caracal通过频谱混合替代传统注意力机制,解决长序列建模中的计算和位置编码限制,提供高效可扩展的解决方案。

Comments Accepted by ICML 2026

详情
AI中文摘要

Caracal通过频谱混合替代传统注意力机制,解决长序列建模中的计算和位置编码限制,提供高效可扩展的解决方案。

英文摘要

The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, O(L log(L)) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in Appendix.

2604.27644 2026-05-08 cs.LG cs.AI cs.PL

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

ANCORA:通过曼哈顿锚定自我对战学习提问以实现可验证推理

Chengcao Yang

发表机构 * Wuhan University(武汉大学)

AI总结 ANCORA通过曼哈顿锚定自我对战学习提问,利用三种机制生成可验证问题并自我改进,提升推理能力,实验证明其在Verus上显著优于PSV自我对战。

Comments v2: Updated abstract; strengthened the proof of Proposition 4.1; corrected minor typos; corrected author list

详情
AI中文摘要

我们提出了一种向开放-ended课程自我对战转变的范式:而不是在固定提示集上学习回答,统一策略学习提问:生成可验证的问题,解决它们,并将验证器反馈转化为自我改进,无需人类标注的解决方案。我们引入ANCORA,其中策略在提出者和解决者之间交替,通过三种承载机制锚定:两层组相对更新耦合提出者优势跨规范与解决者优势跨解决方案尝试;迭代自我蒸馏SFT将基础模型投影到其有效输出流形上再进行RL;以及UCB引导的课程DAG,其策略诱导的问题集可证明在自我组合下扩展。没有这些稳定器,稀疏验证器反馈即使在MLRL对齐奖励下也会导致提出者崩溃;有了它们,ANCORA从零人类解决方案中自bootstrap出可验证课程。在Verus中,ANCORA将Dafny2Verus pass@1从26.6% SFT基线提升到81.5%在测试时间训练(TTT,0-shot),尽管PSV的1-shot推理表现更优;在转移设置中,从Dafny2Verus种子训练获得在held-out MBPP和HumanEval上的36.2%和17.2% pass@1。

英文摘要

We propose a paradigm shift toward open-ended curriculum self-play: rather than learning to answer on a fixed prompt set, a unified policy learns to question: generating verifiable problems, solving them, and turning verifier feedback into self-improvement without human-annotated solutions. We introduce ANCORA, in which the policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions, anchored by three load-bearing mechanisms: a two-level group-relative update coupling Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT projecting the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG whose policy-induced problem set can provably expand under self-composition. Without these stabilizers, sparse verifier feedback drives Proposer collapse even under MLRL-aligned rewards; with them, ANCORA bootstraps a verifiable curriculum from zero human solutions. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in test-time training (TTT, 0-shot), outperforming PSV self-play by 15.8 points despite PSV's 1-shot inference; in a transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.

2604.26799 2026-05-08 cs.CV cs.GR cs.MM

MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

MesonGS++: 3D高斯散射的后训练压缩与超参数搜索

Shuzhao Xie, Junchen Ge, Weixiang Zhang, Jiahang Liu, Chen Tang, Yunpeng Bai, Shijia Ge, Jingyan Jiang, Yuzhi Huang, Fengnian Yang, Cong Zhang, Xiaoyi Fan, Zhi Wang

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) The Hong Kong University of Science and Technology(香港科技大学) Harbin Institute of Technology(哈尔滨工业大学) The Chinese University of Hong Kong(香港中文大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Shenzhen Technology University(深圳技术大学) Xiamen University(厦门大学) Simon Fraser University(西蒙弗雷泽大学) Jiangxing Intelligence Inc.(江兴智能有限公司)

AI总结 MesonGS++提出了一种基于超参数搜索的3D高斯散射后训练压缩方法,通过联合重要性剪枝、八叉树几何编码等技术,实现34倍压缩并保持渲染质量,优于现有方法并精准满足目标存储预算。

Comments https://github.com/mmlab-sigs/mesongs_plus

详情
AI中文摘要

3D高斯散射(3DGS)通过实时渲染实现高质量的视图合成,但其存储成本对实际部署仍具有挑战性。现有后训练压缩方法依赖多个耦合超参数,难以控制最终压缩大小并充分挖掘率-失真权衡。我们提出MesonGS++,一种尺寸感知的后训练编码器,结合联合重要性剪枝、八叉树几何编码、属性变换、选择性向量量化和组内混合精度量化等技术。在配置方面,将保留比例和位宽分配作为主导的率-失真控制参数,通过离散采样和0-1整数线性规划联合优化。我们进一步提出线性尺寸估计器和CUDA并行量化操作符以加速超参数搜索。实验表明,MesonGS++实现超过34倍压缩并保持渲染保真度,优于现有方法并准确满足目标存储预算。值得注意的是,无需任何训练,MesonGS++在Stump场景上即使在20倍压缩率下也能超越vanilla 3DGS的PSNR。我们的代码可在https://github.com/mmlab-sigs/mesongs_plus获取。

英文摘要

3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0--1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34$\times$ compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20$\times$ compression rate on the Stump scene. Our code is available at https://github.com/mmlab-sigs/mesongs_plus

2604.26227 2026-05-08 cs.CV

HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

面向HOI的自适应网络用于弱监督动作分割

Runzhong Zhang, Suchen Wang, Yueqi Duan, Yansong Tang, Yue Zhang, Yap-Peng Tan

发表机构 * Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) Beijing Jiaotong University(北京交通大学)

AI总结 本文提出AdaAct网络,通过利用视频级的时空局部人-物交互作为先验知识,动态适应HOI序列以区分模糊动作,实验验证了方法的有效性。

Comments Accepted to IJCAI 2023

详情
AI中文摘要

在本文中,我们提出了一种名为AdaAct的面向HOI的自适应网络,用于弱监督动作分割。现有方法通常学习固定网络来预测每个帧的动作,但此方法在估计相似动作(如倒果汁和倒咖啡)时会产生歧义。为此,我们旨在利用时空全局但空间局部的人-物交互(HOI)作为视频级先验知识进行动作分割。长期HOI序列提供了区分模糊动作的关键上下文信息,其中我们的网络在测试时动态适应给定的HOI序列。具体而言,我们首先设计了一个视频HOI编码器,提取、选择并整合视频中最具代表性的HOI。然后,我们提出一个双分支超网络,用于学习自适应时间编码器,该编码器可根据各种视频的HOI信息实时调整参数。在两个广泛使用的数据集(Breakfast和50Salads)上的大量实验验证了该方法在不同评估指标下的有效性。

英文摘要

In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

2604.24916 2026-05-08 cs.RO cs.AI

asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

asRoBallet:通过摩擦感知强化学习闭合仿真到现实的差距

Fang Wan, Guangyi Huang, Tianyu Wu, Zishang Zhang, Bangchao Huang, Haoran Sun, Mingdong Chen, Chaoyang Song

发表机构 * ETH-type omni-wheels(ETH型全向轮) humanoid ballbot hardware platform(人形球形机器人硬件平台) quadruped(四足机器人) smartphone-based computing(基于智能手机的计算)

AI总结 本文提出asRoBallet,首个端到端强化学习策略应用于人形球形动力学硬件平台,通过高保真模拟和摩擦感知强化学习实现零样本仿真到现实迁移。

Comments 10 pages, 9 figure, accepted for RSS2026. For Supplementary Videos, see https://bionicdl.ancorasir.com/?p=2238

详情
AI中文摘要

我们介绍了asRoBallet,据我们所知,这是首个端到端强化学习(RL)运动策略部署在人形球形动力学硬件平台上。历史上,球形机器人已成为欠驱动和非完整控制的典型基准,其特征是轮-球-地板相互作用的复杂摩擦模型存在现实差距。尽管当前文献展示了通过LQR和MPC成功处理3D平衡,但将RL应用于人形球形机器人实际硬件时,接触建模、执行器延迟与抖动以及安全硬件探索存在关键差距。本研究提出了一种高保真度的MuJoCo模拟,明确建模ETH型全向轮的离散滚动机制,从而捕捉到之前被忽视的寄生振动和接触不连续性。我们还开发了摩擦感知强化学习框架,通过掌握轮-球和球-地板接口的耦合滚动、横向和扭转摩擦通道,实现零样本仿真到现实迁移。我们通过减法重构,重新利用四足机器人中的关键组件,并将其整合到新设计的结构框架中,以实现低成本的稳健研究平台。我们还开发了通用的iOS生态系统,将消费电子产品转化为低延迟接口,使单个操作员能够通过直观的自然运动指挥表达性的人形动作。

英文摘要

We introduce asRoBallet, to the best of our knowledge, the first end-to-end reinforcement learning (RL) locomotion policy deployed on a humanoid ballbot hardware platform. Historically, ballbots have served as a canonical benchmark for underactuated and nonholonomic control, which are characterized by a reality gap in complex friction models for wheel-ball-floor interactions. While current literature demonstrates successful handling of 3D balancing with LQR and MPC, transitioning to actual hardware for a humanoid ballbot using RL is currently hindered by critical gaps in contact modeling, actuator latency & jitter, and safe hardware exploration. This study proposes a high-fidelity MuJoCo simulation that explicitly models the discrete roller mechanics of ETH-type omni-wheels, thereby capturing parasitic vibrations and contact discontinuities that have previously been ignored. We also developed a Friction-Aware Reinforcement Learning framework that achieves zero-shot Sim2Real transfer by mastering the coupled rolling, lateral, and torsional friction channels at the wheel-ball and ball-floor interfaces. We designed asRoBallet through subtractive reconfiguration, repurposing key components from an overconstrained quadruped and integrating them into a newly designed structural frame to achieve a robust research platform at low cost. We also developed a generalized iOS ecosystem that transforms consumer electronics into a low-latency interface, enabling a single operator to orchestrate expressive humanoid maneuvers via intuitive natural motion.

2604.23045 2026-05-08 cs.LG

A Differentiable Framework for Global Circulation Model Precipitation Bias Correction

一种可微的全球环流模型降水偏差校正框架

Kamlesh Sawadekar, Seth McGinnis, Peijun Li, Kathryn Lawson, Chaopeng Shen

发表机构 * Department of Civil and Environmental Engineering, The Pennsylvania State University(宾夕法尼亚州立大学土木与环境工程系) Computational & Information Systems Laboratory, National Center for Atmospheric Research(国家大气研究中心计算与信息系统实验室)

AI总结 本文提出dCLIMBA框架,通过学习时空自适应参数化偏差校正过程,有效校正全球环流模型输出与观测数据间的降水偏差,尤其在极端降水上表现优异。

Comments 27 pages, 8 figures, 3 tables

详情
AI中文摘要

系统性偏差限制了通用环流模型(GCM)输出在区域规划中的直接应用,因此偏差校正对于短期和长期影响评估至关重要。降水的非高斯分布、间歇性和重尾极端值使其校正尤为困难。传统统计偏差校正方法难以从大规模数据集中学习系统性模式或泛化到新地点。尽管机器学习(ML)提供了更大的灵活性,但其结果不可预测且难以解释,限制了其在GCMs和地点间的泛化能力。本文提出了一种可微的偏差校正框架dCLIMBA,该框架学习了时空自适应参数化偏差校正过程,而非直接校正降水。结果表明,该方法能有效校正降水的幅度和分布,尤其在上尾表现突出。降水的分位数分布在多样化的美国城市中得到良好再现,空间模式与广泛使用的LOCA2统计降尺度产品相当。此外,该框架在部分未来趋势保留和未见区域的偏倚衰减方面表现出色。本文提出了一种模块化且高效的偏差校正方法。可微的方法为连接大气模型输出与实地影响提供了易于使用的选项。

英文摘要

Systematic biases in General Circulation Model (GCM) outputs limit their direct applicability in regional planning, making bias correction a technically demanding but necessary step for both short-term and long-term impact assessment. Correcting precipitation is particularly challenging due to its non-Gaussian distribution, intermittent nature, and heavy-tailed extremes. However, traditional statistical bias-correction methods have limited ability to learn systematic patterns from large datasets or generalize to new locations. While machine learning (ML) provides greater flexibility, it can produce unpredictable and difficult-to-interpret results, limiting generalization across GCMs and locations. In this study, we propose a differentiable bias-adjustment framework called dCLIMBA, that learns a spatiotemporally adaptive parametric bias-adjustment procedure, rather than corrected precipitation directly, between historical CMIP6 model outputs and a gridded observation-based dataset, Livneh. Results demonstrate that the proposed method corrects the magnitude and distribution of extreme precipitation with particularly strong performance in the upper tail. The quantile distribution of precipitation was well reproduced across diverse U.S. cities, and spatial patterns were comparable to those from the widely used LOCA2 statistical downscaling product. In addition, the framework showed partial future trend preservation and promising attenuation of marginal biases in unseen regions. This work presents a modular and efficient bias-correction approach. The differentiable approach provides an easy-to-use option for connecting atmospheric-model outputs to on-the-ground impacts.

2604.22056 2026-05-08 cs.LG cs.NI eess.SP

Learning Coverage- and Power-Optimal Transmitter Placement from Building Maps: A Comparative Study of Direct and Indirect Neural Approaches

从建筑地图学习覆盖和功率最优的发射机布置:直接和间接神经方法的比较研究

Çağkan Yapar

发表机构 * TU Berlin(柏林技术大学)

AI总结 本文研究了在固定学习传播模型下单发射机设置的最优布置问题,通过比较直接和间接神经方法,展示了如何在大规模数据集中高效评估覆盖和功率最优的发射机位置。

详情
AI中文摘要

最优无线发射机布置是无线电网络规划中的核心任务,但大规模情况下穷举搜索变得成本过高。本文研究了在固定学习传播模型下的单发射机设置,使得在数据集规模下能够进行每像素的穷举评估,特别是在基于测量的穷举标注不可行且基于光线追踪的穷举标注计算不可行的环境下。我们引入了一个包含167,525个城市场景的数据集(RadioMapSeer-Deployment),并为覆盖最优和功率最优的发射机位置提供了双真实标签。基准分析揭示了覆盖-功率的不对称权衡:覆盖最优布置牺牲了13.86%的接收功率,而功率最优布置仅牺牲了5.50%的覆盖;最佳可实现的平衡布置位于理想点(100%,100%)的2.60处。我们评估了两种学习方法:基于热图的间接模型预测接收功率无线电地图,以及直接的分数图模型预测可行发射机位置上的目标景观。在热图家族中,判别模型比穷举搜索快1350-2400倍,而扩散模型还支持多样本推理,这可以提高单目标性能,并通过在平衡标准下重用相同样本池,恢复强平衡布置,而无需显式多目标训练。双分数图策略结合功率和覆盖分数图,能够匹配穷举平衡最优(2.60)并在较小的候选预算下保持接近,速度提升14-22倍,包括评估候选人的成本。

英文摘要

Optimal wireless transmitter placement is a central task in radio-network planning, yet exhaustive search becomes prohibitively expensive at scale. This paper studies the single-transmitter setting under a fixed learned propagation model, enabling exhaustive per-pixel assessment at dataset scale in a regime where measurement-based exhaustive labeling is infeasible and ray-tracing-based exhaustive labeling is computationally out of reach. We introduce a dataset of 167{,}525 urban scenarios (\emph{RadioMapSeer-Deployment}) with dual ground-truth labels for coverage-optimal and power-optimal transmitter locations. Benchmark analysis reveals an asymmetric coverage-power trade-off: coverage-optimal placement sacrifices $13.86\%$ of received power, whereas power-optimal placement sacrifices only $5.50\%$ of coverage; the best achievable balanced placement lies at $\bar{d}=2.60$ from the ideal point $(100\%,100\%)$. We evaluate two learning formulations: indirect heatmap-based models predicting received-power radio maps, and direct score-map models predicting the objective landscape over feasible transmitter locations. Within the heatmap family, discriminative models deliver one-shot predictions $1350$-$2400\times$ faster than exhaustive search, while diffusion models additionally support multi-sample inference that improves single-objective performance and, by reusing the same sample pool under a balanced criterion, recovers strong balanced placements without explicit multi-objective training. Dual score-map strategies that combine power and coverage score maps match the exhaustive balanced optimum ($\bar{d}=2.60$) and remain close to it across smaller candidate budgets, at $14$-$22\times$ speedups including the cost of evaluating shortlisted candidates.

2604.21137 2026-05-08 cs.CL cs.AI

Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification

通过联合多任务学习增强科学课堂话语分析以进行推理组件分类

Jiho Noh, Mukhesh Raghava Katragadda, Raymond Carl, Soon Lee

发表机构 * Department of Computer Science, Kennesaw State University(肯纳邦大学计算机科学系) Bagwell College of Education, Kennesaw State University(肯纳邦大学巴格威尔教育学院)

AI总结 本文提出ADAS系统,通过联合多任务学习对教师和学生话语进行类型和推理组件分类,解决少数类标签不平衡问题,提升课堂话语分析效率。

详情
AI中文摘要

分析学生在科学课堂中的推理模式对理解知识建构机制和改进教学实践至关重要,但大规模手动编码课堂话语仍过于耗时。我们提出一个自动话语分析系统(ADAS),通过联合分类教师和学生话语的两种互补维度:话语类型和推理组件(基于先前的CDAT框架)。为了解决少数类标签不平衡问题,我们(1)分层重分割注释语料库,(2)应用基于LLM的合成数据增强针对少数类,(3)训练一个双探针头的RoBERTa-base分类器。零样本GPT-5.4基线在UT上达到宏F1 0.467,在RC上达到0.476,为仅使用提示的方法建立了有意义的上限,激励微调。除了分类外,我们还进行了话语模式分析,包括UTxRC共现分析、每场会议的认知复杂性指数(CCI)计算、滞后序列分析和IRF链分析,揭示教师反馈加问题(Fq)动作是学生推断推理(SR-I)最一致的前因。我们的结果表明,基于LLM的增强显著提高了UT少数类识别,且RC任务的结构简单性使其即使对于词典基线也具有可操作性。

英文摘要

Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.

2604.21106 2026-05-08 cs.LG cs.CL

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

循环一次值有多大?循环语言模型的等深度缩放定律

Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich(慕尼黑技术大学人工智能在医疗和医学中的主任) Department of Computing, Imperial College London(伦敦帝国理工学院计算机系) Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心(MCML)) Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany(波茨坦大学数字工程霍普夫研究所)

AI总结 研究通过等深度预训练测量循环变换器中一次循环的价值,推导出缩放定律并发现循环等价指数φ=0.46,表明共享循环比独特块更差,展示了φ作为诊断工具的实用性。

Comments v3: substantially refined framing + minor corrections v2: added case studies on truncated-BPTT and hyperconnections

详情
AI中文摘要

我们测量了循环(深度循环)变换器中一次循环在等效唯一参数下的价值。通过跨循环次数r∈{1,2,4,8}的等深度预训练,覆盖约50倍的训练计算,我们拟合了联合缩放定律L=E+A(N_once +r^φN_rec)^-α +B D^-β,并测量了循环等价指数φ=0.46。直观上,φ告诉我们是否循环一个块r次在验证损失上等同于非循环模型的r个独特块(完全等价,φ=1)或单个块反复运行无容量增益(φ=0)。我们的φ=0.46介于两者之间,因此在匹配训练计算下,用共享循环替代独特块会增加验证损失。例如,在r=4时,410M循环模型与580M非循环模型表现相当,但训练成本相当于1B非循环模型。我们通过两个案例研究展示了φ作为诊断工具的实用性:常用的截断反向传播将φ降低到0.38,表明循环机制在截断下训练不足,尽管验证损失下降。相反,超连接将φ提高到0.65,代表真正的容量增益。我们的方法区分了真正的循环改进和训练方面的增益,而原始验证损失无法做到这一点。

英文摘要

We measure how much one recurrence is worth to a looped (depth-recurrent) transformer, in equivalent unique parameters. From an iso-depth pretraining sweep across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^φ N_\text{rec})^{-α} + B\,D^{-β}$ and measure a recurrence-equivalence exponent $φ= 0.46$. Intuitively, $φ$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $φ{=}1$) or to a single block run repeatedly with no capacity gain ($φ{=}0$). Our $φ= 0.46$ sits in between, so replacing unique blocks with shared recurrences increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but incurs the training cost of a 1B non-looped one. We demonstrate the utility of $φ$ as a diagnostic tool on two case studies: commonly used truncated backpropagation lowers $φ$ to $0.38$, indicating that the loop mechanism is poorly trained under truncation, even though validation loss decreases. Conversely, hyperconnections raise $φ$ to $0.65$, a genuine capacity gain. Our method separates true loop improvements from training-side gains, a distinction raw validation loss cannot make.

2604.20658 2026-05-08 cs.CL cs.CY cs.MA

Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

协同特征预测多智能体LLM团队在AI for Science工作流中的性能

Shivani Kumar, Adarsh Bharathwaj, David Jurgens

发表机构 * University of Michigan(密歇根大学)

AI总结 本文通过六种行为经济学游戏评估35个LLM的协同特征,发现其能有效预测AI for Science任务中的团队表现,特别是在数据处理、建模和报告生成中,协同策略优于贪心策略。

详情
AI中文摘要

由大型语言模型(LLM)组成的多智能体系统正被用于协作科学推理和问题解决。这些系统需要在共享约束(如GPU或信用余额)下协调,其中合作行为至关重要。行为经济学提供了丰富的游戏工具,可以隔离不同的合作机制,但尚不清楚模型在这些简化设置中的行为是否能预测其在现实协作任务中的表现。本文通过在六个行为经济学游戏中评估35个开放权重LLM,发现游戏衍生的协同特征能够稳健地预测AI for Science任务中的下游性能,其中LLM团队在共享预算约束下协作分析数据、构建模型并生成科学报告。能够有效协调游戏并投资于乘法团队生产(而非贪心策略)的模型,在准确性、质量和完成性三个结果上产生更好的科学报告。这些关联在控制多个因素后仍成立,表明协同倾向是LLM的一个独立、可测量的属性,不能简单归因于一般能力。因此,行为游戏框架提供了一种快速且经济的诊断方法,用于在成本高昂的多智能体部署前筛选协同适应性。

英文摘要

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

2604.20568 2026-05-08 cs.LG cs.IT math.IT stat.ME

Amortized Vine Copulas for High-Dimensional Density and Information Estimation

高维密度和信息估计的 amortized Vine copulas

Houman Safaai

发表机构 * Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University(哈佛大学自然与人工智能研究学院)

AI总结 本文提出VDC,一种用于连续数据的amortized vine-copula管道,通过单个双变量去噪模型简化vine依赖建模,提升高维vine拟合效率。

详情
AI中文摘要

在高维依赖建模中保持似然可计算仍具挑战性。经典vine-copula流程具有可解释性但成本高,而许多神经估计器灵活但结构较弱。本文提出Vine Denoising Copula(VDC),一种用于连续数据的amortized vine-copula流程,通过训练单个双变量去噪模型并在所有vine边重复使用,简化vine依赖建模。对于每条边,给定伪观测,模型预测分段常数密度网格。然后应用IPFP/Sinkhorn投影,归一化质量并驱动边际到均匀分布。这保留了可计算的vine-likelihood结构和常规copula解释,同时用GPU推理替代了重复的边优化。在合成和真实数据基准上,VDC在双变量密度精度、竞争性MI/TC估计和高速高维vine拟合方面表现优异。这些收益使在重复vine拟合成本过高时显式信息估计和依赖分解成为可能,而条件下游任务仍受限。

英文摘要

Modeling high-dimensional dependencies while keeping likelihoods tractable remains challenging. Classical vine-copula pipelines are interpretable but can be expensive, while many neural estimators are flexible but less structured. In this work, we propose Vine Denoising Copula (VDC), an amortized vine-copula pipeline for continuous-data, simplified-vine dependence modeling. VDC trains a single bivariate denoising model and reuses it across all vine edges. For each edge, given pseudo-observations, the model predicts a piecewise-constant density grid. We then apply an IPFP/Sinkhorn projection that normalizes mass and drives the marginals to uniformity. This preserves the tractable vine-likelihood structure and the usual copula interpretation while replacing repeated per-edge optimization with GPU inference. Across synthetic and real-data benchmarks, VDC delivers strong bivariate density accuracy, competitive MI/TC estimation, and faster high-dimensional vine fitting. These gains make explicit information estimation and dependence decomposition feasible when repeated vine fitting would otherwise be costly, while conditional downstream tasks remain a limitation.

2604.20229 2026-05-08 cs.SD cs.AI

Enhancing Speaker Verification with Whispered Speech via Post-Processing

通过后处理增强 whispered speech 的语音验证

Magdalena Gołębiowska, Piotr Syga

发表机构 * Department of Artificial Intelligence(人工智能系)

AI总结 本文提出一种模型,通过后处理增强 whispered speech 的语音验证性能,提升正常与whispered语音的识别准确率,达到98.16%的AUC,并在whispered语音对比中实现1.88%的EER。

Comments 15 pages, 3 figures, conference paper at ACIIDS 2026

详情
AI中文摘要

语音验证是通过分析语音确认个体身份的任务。whispered speech在声学特性上与phonated speech不同,会降低语音验证系统在现实场景中的性能,包括避免完全语音化以保护隐私、干扰他人或因疾病无法完全发声的情况。本文提出一个模型和训练方法,以获得更稳健的表示,以应对whispered speech的干扰。所提出的系统基于一个微调的语音验证主干网络,通过基于余弦相似度的分类和三元组损失进行联合优化。在正常语音与whispered语音试验中,相比基线(基线6.77% vs 我们5.27%)获得了22.26%的相对提升,达到98.16%的AUC。在whispered语音对比测试中,我们的模型实现了1.88%的EER,AUC为99.73%,比先前的ReDimNet-B2提升了15%。我们还总结了最流行和最先进的语音验证模型在whispered语音上的性能。此外,我们评估了这些模型在嘈杂音频下的表现,发现一般情况下,相同水平的噪声对whispered语音的语音验证性能影响比正常语音更显著。

英文摘要

Speaker verification is a task of confirming an individual's identity through the analysis of their voice. Whispered speech differs from phonated speech in acoustic characteristics, which degrades the performance of speaker verification systems in real-life scenarios, including avoiding fully phonated speech to protect privacy, disrupt others, or when the lack of full vocalization is dictated by a disease. In this paper we propose a model with a training recipe to obtain more robust representations against whispered speech hindrances. The proposed system employs an encoder--decoder structure built atop a fine-tuned speaker verification backbone, optimized jointly using cosine similarity--based classification and triplet loss. We gain relative improvement of 22.26\% compared to the baseline (baseline 6.77\% vs ours 5.27\%) in normal vs whispered speech trials, achieving AUC of 98.16\%. In tests comparing whispered to whispered, our model attains an EER of 1.88\% with AUC equal to 99.73\%, which represents a 15\% relative enhancement over the prior leading ReDimNet-B2. We also offer a summary of the most popular and state-of-the-art speaker verification models in terms of their performance with whispered speech. Additionally, we evaluate how these models perform under noisy audios, obtaining that generally the same relative level of noise degrades the performance of speaker verification more significantly on whispered speech than on normal speech.

2604.19675 2026-05-08 cs.CV

MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention

MedFlowSeg: 基于频率感知注意力的医学图像分割中的流匹配

Zhi Chen, Runze Hu, Le Zhang

发表机构 * School of Engineering, University of Birmingham(伯明翰大学工程学院) School of Public Health, Peking University(北京大学公共卫生学院)

AI总结 本文提出MedFlowSeg,通过流匹配框架将医学图像分割建模为学习时间依赖的向量场,以更高效地推断并保持生成模型的灵活性。引入双条件机制和频率感知注意力模块,提升结构一致性与边界定义。

详情
AI中文摘要

流匹配最近作为一种原则性的框架,用于学习连续时间传输映射,使基于ODE的采样高效且无需依赖随机扩散过程。尽管生成建模在医学图像分割中显示出潜力,特别是在捕捉不确定性和复杂解剖变异方面,现有方法大多基于扩散模型,需要迭代采样并产生显著的计算开销。在本文中,我们提出MedFlowSeg,一种条件流匹配框架,将医学图像分割建模为学习一个时间依赖的向量场,该向量场将简单先验分布传输到目标分割分布。与基于扩散的方法相比,我们的方法通过求解常微分方程实现了更高效的推断,同时保持了生成建模的灵活性。为了有效整合条件信息,我们引入了双条件机制。具体而言,我们提出了双分支空间注意力(DB-SA)模块以注入多频结构性先验,以及频率感知注意力(FA-Attention)模块,通过差异感知融合和时间依赖调制建模空间和频谱表示之间的相互作用。这些组件提高了噪声中间状态与干净语义特征之间的对齐,从而导致更好的结构一致性和边界定义。我们在多种医学成像模态上进行了广泛的实验,其中MedFlowSeg在多个医学图像分割任务中均优于现有最先进的(SOTA)基线,包括基于扩散和基于流的方法。

英文摘要

Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient ODE-based sampling without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly based on diffusion models, which require iterative sampling and incur substantial computational overhead. In this work, we propose MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. Compared to diffusion-based methods, our formulation enables more efficient inference through solving an ordinary differential equation, while preserving the flexibility of generative modeling. To effectively incorporate conditional information, we introduce a dual-conditioning mechanism. Specifically, we propose a Dual-Branch Spatial Attention (DB-SA) module to inject multi-frequency structural priors, and a Frequency-Aware Attention (FA-Attention) module to model interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. These components improve the alignment between noisy intermediate states and clean semantic features, leading to better structural consistency and boundary delineation. We conduct extensive experiments across multiple medical imaging modalities, where MedFlowSeg consistently outperforms prior state-of-the-art (SOTA) baselines, including diffusion-based and flow-based methods.

2604.18916 2026-05-08 cs.AI

Benchmarking PNW Model for MedMNIST to 100% Accuracy

对MedMNIST进行PNW模型基准测试以实现100%准确率

Bo Deng

发表机构 * Department of Mathematics(数学系)

AI总结 本文提出人工特殊智能概念,通过无错误训练使分类模型避免重复错误,应用于18个MedMNIST数据集,除三个存在双标签问题的集外,其余均达到100%准确率。

Comments 12 pages, 4 figure, 1 table

详情
AI中文摘要

本文引入了名为'人工特殊智能'的新概念,通过无错误训练使机器学习模型在分类问题中具备不重复犯错的能力。该方法应用于18个MedMNIST生物医学数据集。除了三个存在双标签问题的数据集外,其余均被完美训练。

英文摘要

In this paper, we introduce a new concept called Artificial Special Intelligence by which Machine Learning models for the classification problem can be trained error-free, thus acquiring the capability of not making repeated mistakes. The method is applied to 18 MedMNIST biomedical datasets. Except for three datasets, which suffer from the double-labeling problem, all are trained to perfection.

2604.18753 2026-05-08 cs.LG cs.AI

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

通过自回归序列建模处理和解释患者临床轨迹中的缺失模态

Andrew Wang, Ellie Pavlick, Ritambhara Singh

发表机构 * Brown University(布朗大学)

AI总结 本文通过自回归序列建模方法处理医疗数据中缺失模态的问题,提出对比预训练目标并验证了其在MIMIC-IV和eICU数据集上的有效性,提升了模型可解释性。

详情
AI中文摘要

在开发多模态机器学习模型用于医疗领域时,处理训练和部署过程中缺失模态是一个活跃的挑战。由于临床数据本质上是时间序列且模态存在稀疏,通过诊断多模态机器学习模型捕捉底层预测信号的同时保持模型可解释性仍是一个持续的挑战。本文通过将临床诊断重新框架为自回归序列建模任务,利用大规模语言模型(LLMs)中的因果解码器来建模患者多模态轨迹。我们首先引入了一个缺失意识的对比预训练目标,将具有缺失性的数据集中的多种模态整合到共享的潜在空间中。然后我们展示自回归序列建模与基于变压器的架构在MIMIC-IV和eICU微调基准上优于基线。最后,我们使用可解释性技术超越性能提升,发现在各种患者住院期间,移除模态会导致行为发散,而我们的对比预训练缓解了这一问题。通过将临床诊断抽象为序列建模并解释患者住院轨迹,我们开发了一个框架来分析和处理缺失模态,同时解决安全、透明的临床AI的典型需求。

英文摘要

An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient's multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

2604.17573 2026-05-08 cs.AI

Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier

超越静态快照:语言模型在代理前沿的 grounded 评估框架

Jazmia Henry

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出 grounded continuous evaluation 框架,通过 deterministic verifier 和 CPU 更新 LoRA adapters,解决 reward hacking 和硬件限制问题,并在多个架构和领域验证其有效性。

Comments Submitted for consideration to NeurIPS 2026

详情
AI中文摘要

我们指出当前大型语言模型(LLM)的评估框架存在四个系统性缺陷,使得它们在部署的代理系统中结构上不足:分布性、时间性、范围性和过程无效性。这些缺陷在RLHF中叠加,使奖励黑客成为评估设计的可预测后果,而非训练病理的不可预测结果,RLHF的双模型架构施加了硬件限制,限制了评估的可重复性。我们提出了Grounded Continuous Evaluation(GCE)框架,并展示了ISOPro作为参考实现。ISOPro将学习的奖励模型替换为确定性验证器,在可验证奖励领域通过构造消除奖励黑客,并在CPU上更新LoRA适配器,将硬件限制降低一个数量级。我们验证了ISOPro在三种架构(Qwen 2.5 3B、Llama 3.2 3B、Gemma 2 2B)和两个领域(调度、MBPP)中的有效性,并与GRPO-LoRA进行了匹配计算的头对头比较。在十二个单元中,ISOPro在平均delta +9.0pp的情况下产生了最大的绝对能力提升(+25.6、+22.2、+16.0pp),在最坏情况回归-5.6pp;GRPO-LoRA在消费预算超参数下达到较小的峰值提升(+8.5pp),更深入的最坏情况回归(-10pp)和平均delta -1.5pp。在MBPP上的保留组合泛化上,ISOPro在两个架构中达到40%(包括Qwen 2.5 3B的0%到40%的bootstrap),而GRPO-LoRA在三个架构中的一个达到20%。我们描述了一种缓冲倾斜失败模式,在三种预条件下,隐式课程可以侵蚀已有的层级能力,有三种相应的缓解措施。这项工作与DeepSeek-R1的GRPO并列,后者在大规模上得出了相同的架构结论:对于可验证奖励领域,验证器就是奖励信号。

英文摘要

We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for deployed, agentic systems: distributional, temporal, scope, and process invalidity. These failures compound in RLHF, making reward hacking a predictable consequence of evaluation design rather than an unpredictable training pathology, and RLHF's dual-model architecture imposes a hardware barrier limiting evaluation reproducibility. We propose the Grounded Continuous Evaluation (GCE) framework and present ISOPro as a reference implementation. ISOPro replaces the learned reward model with a deterministic verifier, eliminating reward hacking by construction in verifiable-reward domains, and updates LoRA adapters on CPU, reducing the hardware barrier by an order of magnitude. We validate ISOPro across three architectures (Qwen 2.5 3B, Llama 3.2 3B, Gemma 2 2B) and two domains (scheduling, MBPP), with a head-to-head matched-compute comparison against GRPO-LoRA. Across twelve cells, ISOPro produces the largest absolute capability gains (+25.6, +22.2, +16.0pp) at mean delta +9.0pp and worst-case regression -5.6pp; GRPO-LoRA at consumer-budget hyperparameters reaches a smaller peak gain (+8.5pp), deeper worst-case regression (-10pp), and mean delta -1.5pp. Held-out compositional generalization on MBPP reaches 40% for ISOPro on two of three architectures (including a 0% to 40% bootstrap on Qwen 2.5 3B), against 20% for GRPO-LoRA on one of three. We characterize a buffer-skew failure mode in which the implicit curriculum can erode pre-existing tier capability under three preconditions, with three corresponding mitigations. The work is situated alongside DeepSeek-R1's GRPO, which arrived at the same architectural conclusion at scale: for verifiable-reward domains, the verifier is the reward signal.

2604.17137 2026-05-08 cs.LG cs.RO

BOIL: Learning Environment Personalized Information

BOIL: 多智能体系统中环境个性化信息学习

Rohan Patil, Henrik I. Christensen

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出BOIL方法,通过PageRank算法和共同信息最大化提取环境结构信息,指导多智能体长期行为,提升复杂环境下的策略性能。

详情
AI中文摘要

在复杂环境中导航对多智能体系统构成挑战,需高效提取有限信息的洞察。本文引入Blackbox Oracle Information Learning(BOIL)过程,一种可扩展的解决方案,用于从环境结构中提取有价值洞察。利用PageRank算法和共同信息最大化,BOIL促进信息提取以指导长期智能体行为,适用于覆盖、巡逻和随机可达性等问题。通过实验,我们展示了BOIL在生成促进长期性能的策略分布方面的有效性,其在复杂环境中的表现超越了启发式方法。

英文摘要

Navigating complex environments poses challenges for multi-agent systems, requiring efficient extraction of insights from limited information. In this paper, we introduce the Blackbox Oracle Information Learning (BOIL) process, a scalable solution for extracting valuable insights from the environment structure. Leveraging the Pagerank algorithm and common information maximization, BOIL facilitates the extraction of information to guide long-term agent behavior applicable to problems such as coverage, patrolling, and stochastic reachability. Through experiments, we demonstrate the efficacy of BOIL in generating strategy distributions conducive to improved performance over extended time horizons, surpassing heuristic approaches in complex environments.