arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.27715 2026-05-28 cs.CL

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

超越输入理解:使用有向无环迹图诊断多语言数学推理

Jiaqiao Zhang, Zhoujun Li, Raoyuan Zhao, Jian Lan, Thomas Seidl, Michael A. Hedderich, Hinrich Schütze, Yihong Liu

发表机构 * Southwest University(西南大学) LMU Munich(慕尼黑莱茵-瓦尔德大学) MCML

AI总结 本文提出有向无环迹图(DATG)框架,通过将推理迹映射到与语言无关的数学锚点和依赖关系,诊断多语言数学推理中的语言影响,并设计Loop-Retry和Formula-Retry两种测试时控制方法改善低资源语言性能。

Comments preprint

详情
AI中文摘要

大型推理模型(LRMs)在英语中表现出强大的数学推理能力,但在许多低资源和中资源语言中仍然不太可靠。这种差距通常被解释为无法理解非英语的问题陈述。我们表明这种观点是不完整的:即使问题以英语给出,控制模型的推理语言也会显著降低准确性,这表明语言也影响推理执行本身。为了研究这种效应,我们引入了DATG,一个有向无环迹图框架,将推理迹映射到与语言无关的数学锚点和依赖关系。这使我们能够将目标语言迹与参考DAG对齐,并测量它们是否覆盖所需的数学节点、尊重依赖边以及避免有害的数学动作。在Qwen3系列上跨12种语言的实验表明,非英语推理通常遭受锚点覆盖减少和依赖保真度降低,尤其是在低资源语言中。受此诊断启发,我们提出了Loop-Retry和Formula-Retry,两种针对DATG暴露的失败模式的简单测试时控制方法,并表明它们一致地改善了低资源语言中的目标语言推理性能。

英文摘要

Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model's reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

2605.27712 2026-05-28 cs.AI

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

前缀安全贝叶斯信念追踪用于LLM推理可靠性:将校准与排序分离

Zhenghan Song, Yunyi Li, Yulong Liu

发表机构 * Cornell University(康奈尔大学) Columbia University(哥伦比亚大学)

AI总结 提出前缀安全贝叶斯信念追踪(SBBT)框架,通过分离概率质量与排序能力,在长链推理中实现可靠的在线校准与不确定性估计。

详情
AI中文摘要

长推理轨迹需要在最终答案已知之前进行可靠性估计。我们研究前缀条件的事件成功估计 $P(y=1 \mid o_{1:t})$,使用前缀安全观测。序列贝叶斯信念追踪(SBBT)校准观测似然并递归更新两状态信念,为标量分数、文本和自我验证标记、隐藏聚类、令牌池探针以及潜在轨迹特征提供通用追踪器。在MATH-500、GSM8K、AIME 2025和RIMO-N上生成的开源权重轨迹中,概率质量和排序分离:仅使用分数的SBBT通常改善Brier分数,而AUROC提升需要超出强前缀安全基线的结构感知证据。在最强硬数学设置中,结构感知观测相对于标准前缀安全基线达到+0.110 AUROC。在相同前缀分类器审计下,MATH-500文本标记和RIMO-N自我验证信号保持正向。这些发现共同支持SBBT作为校准感知的在线推理框架,并揭示证据机制:标量分数主要支持概率质量,而结构感知前缀信号仅在强前缀安全基线尚未吸收排序证据时支持排序。

英文摘要

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

2605.27710 2026-05-28 cs.AI

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

DeepSciVerify: 通过LLM驱动的证据升级验证科学声明与引文对齐

Shaghayegh Sadeghi, Khashayar Khajavi, Rise Adhikari, Alexander Tessier

发表机构 * School of Computing Science, Simon Fraser University(西蒙弗雷泽大学计算科学学院)

AI总结 提出DeepSciVerify两阶段流水线,结合摘要推理与选择性升级到段落证据,在SCitance基准上以86.7 Micro-F1超越纯摘要基线4.5点,同时67%实例无需全文检索。

详情
AI中文摘要

声明与其引用证据之间的错位是大语言模型生成报告中的常见失败模式,限制了其在科学及其他高风险场景中的可靠性。我们提出DeepSciVerify,一个用于科学声明-引文验证的两阶段流水线,结合摘要级推理与选择性升级到段落级证据。该系统首先使用摘要验证声明,并对不确定案例进行延迟处理,仅在必要时检索和分析全文段落。该设计利用了LLM之间的互补行为,因为某些模型在不确定性下更为保守,而另一些则更为果断。在SCitance基准上,DeepSciVerify达到了86.7 Micro-F1,比强纯摘要基线高出4.5点,同时67%的实例无需全文检索即可解决。这些结果表明,选择性证据升级提高了声明-引文验证的准确性和效率。

英文摘要

Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

2605.27709 2026-05-28 cs.CL

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath: 面向可扩展和可验证数学问题生成的答案反转方法

Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich

发表机构 * Center for Information and Language Processing(信息与语言处理中心) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心 (MCML)) Saarland University(萨尔兰州大学)

AI总结 提出ReverseMath方法,通过反转原始问题的输入输出关系自动生成新数学问题,用于评估和训练,揭示记忆行为并提升推理性能。

详情
AI中文摘要

数学推理基准对于评估大型语言模型(LLM)至关重要,但许多基准是静态的,并通过公开评估和训练管道反复暴露,使得难以区分真正的推理与记忆。同时,手动构建具有可靠答案的新数学问题仍然成本高昂。我们引入ReverseMath,一种通过答案反转生成新数学问题的可扩展方法。给定一个问题及其答案,ReverseMath掩码原始问题中的一个数值,将原始答案视为已知条件,并重写问题,使得掩码值成为新答案。生成的问题反转了原始输入输出关系,使其答案通过构造已知。我们研究了ReverseMath在评估和训练中的应用。对于评估,配对的原始/反转问题揭示了显著的行为变化:模型有时在反转问题上失败,甚至错误地输出原始答案,暗示了类似记忆的行为。对于训练,ReverseMath提供自动标注的反转问题作为强化学习(RL)的数据增强。实验表明,包含ReverseMath生成的数据提高了多个基准上的数学推理性能,证明了其作为分析工具和可验证训练数据的可扩展来源的价值。

英文摘要

Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

2605.27706 2026-05-28 cs.CL cs.IR

Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

基于格点链式自适应重配置以减少幻觉

Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

发表机构 * Department of Mechanical and Aerospace University of California Irvine(机械与航空航天系加州大学伊文斯顿分校) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 提出CAROL框架,通过定义语义不确定性度量并在文本序列格点上构建串子模目标,将幻觉缓解转化为马尔可夫链接受-拒绝过程,实现测试时幻觉减少。

详情
AI中文摘要

我们介绍了CAROL(基于格点的链式自适应重配置),一个用于大型语言模型测试时减少幻觉的概率框架。CAROL不依赖于词元级别的不确定性,而是基于生成响应与可信上下文之间的一致性定义了一种语义不确定性度量,在文本序列格点上诱导出一个串子模目标。这种表述使得幻觉缓解可以被建模为一个具有可证明收敛性和接近最优性保证的马尔可夫链接受-拒绝过程,允许模型迭代地优化输出以实现语义一致性。通过在意义层面操作,CAROL将幻觉检测和缓解统一在一个框架内。在问答和多智能体推理基准上的实证结果表明,与基于似然和检索增强的基线相比,CAROL显著减少了幻觉,提高了可靠性和可解释性,同时保持了具有竞争力的计算效率。

英文摘要

We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

2605.27703 2026-05-28 cs.AI

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

面向资源受限智能体语言模型的分层提示域控制与学习

Joan Vendrell Gallart, Russell Bent, Michael Grosskopf

发表机构 * Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 提出分层控制与学习框架,通过蒸馏学习输出模式、在线监控与提示域控制,解决资源受限下智能体语言模型的可靠性问题。

详情
AI中文摘要

大型语言模型越来越多地部署在智能体系统中,它们必须遵循结构化协议,适应不断变化的状态,并在内存、延迟和成本限制下运行。在这种场景下,提示扩展不可靠:增长的上下文可能将紧凑模型推离其有效提示域,而部署时的微调受限于稀缺的数据和计算资源。我们提出了一种分层控制与学习框架,其中紧凑模型首先通过蒸馏学习所需的输出模式,然后由预言机-控制器循环在线监督。控制器监控协议有效性和语义性能,将累积历史投影到可行的提示域中,并在发生漂移时触发轻量级的预言机监督微调。这将用于通信兼容性的模式学习与用于任务级纠正的语义适应分离开来。我们形式化了提示域可行性和注意力引起的饱和,从而激励对有效提示状态的控制,而非依赖名义上下文长度。使用多保真贝叶斯优化作为受控顺序测试平台,我们描述了一个核心部署故障模式,并展示了相对于非分层、仅蒸馏和非蒸馏基线的改进的可靠性和成本效益。

英文摘要

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

2605.27699 2026-05-28 cs.RO

AURA: Asymptotically Optimal Uncertainty-Robust Replanning Algorithm for Kinodynamic Systems

AURA: 动力学系统渐近最优的鲁棒重规划算法

Seyedali Golestaneh, Zhuoyun Zhong, Donghyung Lee, Constantinos Chamzas

发表机构 * Department of Robotics Engineering, Worcester Polytechnic Institute (WPI)(机器人工程系,沃斯通理工大学)

AI总结 提出AURA元规划框架,通过在线重规划和优化控制输入,在运动不确定性下实现渐近最优轨迹规划与跟踪精度提升。

详情
AI中文摘要

基于采样的运动规划器为动力学运动规划提供了一种实用且可扩展的方法,尤其适用于高维、欠驱动或非完整系统。然而,这些规划器通常离线使用,要求在执行开始前完成轨迹计算。此外,在存在运动不确定性的情况下,规划轨迹可能无法被准确跟踪,导致偏离名义解。本文在一个统一框架\method中解决了这些局限性,该框架是一个渐近最优的元规划器框架,在执行过程中同时提高路径质量和跟踪性能。除了主执行线程外,该框架包含一个重规划方法,在执行过程中持续探索状态空间并优化轨迹,以及一个优化过程,用于优化未来控制输入以减少跟踪误差。这些组件共同使\method能够在线利用渐近最优规划,同时在不确定性下提高执行精度。所提出的方法在多个系统的仿真和真实环境中进行了评估,与基线方法相比,在轨迹质量、跟踪精度和整体性能方面表现出一致的改进。

英文摘要

Sampling-based motion planners offer a practical and scalable approach to kinodynamic motion planning, notably for high-dimensional, underactuated, or non-holonomic systems. However, these planners are typically used offline, requiring execution to begin only after the trajectory has been computed. In addition, the planned trajectory may not be accurately tracked in the presence of motion uncertainty, leading to deviations from the nominal solution. In this work, these limitations were addressed within a unified framework, \method, an asymptotically-optimal meta-planner framework that improves both path quality and tracking performance during execution. In addition to the main execution thread, this framework comprises a replanning method that continuously explores the state space and refines the trajectory during execution, and an optimization process that refines future control inputs to reduce tracking error. Together, these components enable \method to leverage asymptotically optimal planning online while improving execution accuracy under uncertainty. The proposed approach is evaluated in both simulation and real-world environments across multiple systems, demonstrating consistent improvements in trajectory quality, tracking accuracy, and overall performance compared with baseline methods.

2605.27697 2026-05-28 cs.RO cs.AI cs.LG

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Irvine(加州大学伊文斯顿分校)

AI总结 提出一种基于约束感知扩散模型的去中心化框架SID,通过仿真邻居未来轨迹并利用安全约束规划自身轨迹,在密集场景下实现高效协调。

详情
AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹,无需全局感知或可靠通信。然而,大多数现有规划器(无论是经典方法还是基于学习的方法)都是从局部观测的静态快照生成轨迹,这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤,这一限制变得至关重要。为了克服这一挑战,本文引入了仿真引导的扩散(SID),这是一种基于约束感知扩散模型(CADM)的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹,然后利用这些仿真提供的安全约束,使用相同的CADM规划每个机器人自身的轨迹。关键的是,对邻居的精确仿真使得一种最小通信方案成为可能,该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明,SID在规划有效性和约束满足方面始终优于基线方法,并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

2605.27690 2026-05-28 cs.CL cs.LG

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

TRACES: 通过轨迹状态建模实现多轮LLM智能体的主动安全审计

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

发表机构 * Brown University(布朗大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Rutgers University(罗格斯大学) Texas A&M University(德克萨斯阿姆斯特朗大学)

AI总结 提出TRACES方法,通过观察LLM的隐藏表示学习前缀级轨迹风险状态,实现多轮工具使用环境下的主动安全审计,提升全轨迹安全预测和主动风险判别能力。

详情
AI中文摘要

LLM智能体越来越多地通过多轮工具使用和环境交互来运作,其中安全风险往往在最终结果显现之前的中间步骤中就已经出现。因此,反应式审计是不够的:事后诊断常常在风险正在展开时错过标记它们的机会。我们提出TRACES,一种基于表示的主动审计器,它从观察者LLM的隐藏表示中学习前缀级轨迹风险状态。TRACES从步骤表示中诱导潜在机制特征,并建模其时间演化,以估计部分轨迹是否正在向不安全行为漂移。为了规避步骤级风险标注的成本和歧义,TRACES在弱轨迹级监督下训练,同时仍能产生密集的前缀级风险估计。在多个智能体安全基准测试中,TRACES改进了全轨迹安全预测和主动风险判别。我们的分析进一步表明,这些风险状态可以帮助训练更安全的智能体,凸显了主动审计在长程智能体安全中的更广泛潜力。

英文摘要

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

2605.27689 2026-05-28 cs.LG cs.CR

Test-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

测试时集体行动:基于代理的扰动用于纠正算法危害

Meghana Bhange, Ulrich Aïvodji, Elliot Creager

发表机构 * ÉTS Montréal(蒙特利尔ÉTS) Mila University of Waterloo(多伦多大学) Vector Institute(向量研究所)

AI总结 提出测试时集体行动框架,通过用户共享查询访问黑盒API提取代理模型并优化每类通用扰动,在推理时修正子群性能差距,无需平台参与训练。

详情
AI中文摘要

当机器学习系统对特定子群表现不佳时,受影响的用户通常无法在不依赖平台级修复的情况下纠正这些差异。现有的算法公平方法依赖于以提供者为中心的方法来纠正这些失败,用户在面临危害时没有外部杠杆。最近在算法集体行动方面的工作表明,协调的用户可以将算法系统引导向集体目标,但现有机制要求提供者在集体的修改数据上重新训练,而用户可能无法控制这些数据。我们提出测试时集体行动(TTCA),这是一个框架,通过该框架,一组共享平台查询访问的用户可以纠正影响服务不足子群的差异,而无需参与平台的训练循环。我们通过一种基于代理的机制实现这一点,其中集体池化对黑盒API的查询访问以提取平台的代理,然后针对代理优化每类通用扰动。每个成员在提交时将此扰动应用于自己的输入,无需平台合作。我们在CIFAR-10、CIFAR-100和FairFace上进行了实证评估,表明适度规模的集体可以缩小大部分子群准确率差距,跨架构迁移(小型代理可以攻击更大的平台),并改善最差组准确率、机会均等差距和差异性影响。查询预算分析比较了每用户黑盒攻击基线,表明池化比每个子群成员单独攻击更便宜。因此,当平台端修复不可用或延迟时,测试时集体行动为用户提供了纠正干预措施。

英文摘要

When machine learning systems under-perform for particular subgroups, affected users typically have no way to correct these disparities without relying on platform-level fixes. Existing approaches to algorithmic fairness rely on provider-centric approaches to correct these failures, leaving users with no external lever when faced with harm. Recent work in Algorithmic Collective Action shows that coordinated users can steer an algorithmic system toward a collective goal, but the existing mechanisms require the provider to retrain on the collective's modified data which users may not have control over. We propose Test-Time Collective Action (TTCA), a framework through which a group of users who share query access to the platform, can correct disparities affecting under-served subgroup without participating in the platform's training loop. We implement this through a proxy-based mechanism where the collective pools query access to a black-box API to extract a proxy of the platform, then optimizes a per-class universal perturbation against the proxy. Each member applies this perturbation to their own inputs at submission time, requiring no cooperation from the platform. We empirically evaluate the mechanism on CIFAR-10, CIFAR-100, and FairFace, showing that modestly-sized collectives close most of the subgroup accuracy gap, transfer across architectures (a small proxy can attack a larger platform), and improve worst-group accuracy, equal-opportunity gap, and disparate impact. A query-budget analysis comparing a per-user black-box attack baseline shows that pooling is cheaper than each subgroup member attacking alone. Test-time collective action thus offers corrective intervention to users when platform-side remediation is unavailable or delayed.

2605.27686 2026-05-28 cs.CV cs.AI

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆:用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 提出张量记忆模块,通过固定大小的3D循环张量状态增强Transformer,以解耦状态容量与输入长度,并保持空间归纳偏置,适用于长程视频理解。

详情
AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征,但其内存随序列长度增长,并且缺乏显式的、持久化的空间状态,这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆,一种轻量级模块,通过固定大小的循环3D记忆张量增强Transformer块:令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中,记忆通过高效的局部交互算子和门控循环动态更新,令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定,张量记忆将状态容量与输入长度解耦,同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块;它与标准Transformer训练流程集成,可以附加到现有块或从中移除,而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

2605.27681 2026-05-28 cs.AI cs.LG

Behavioural Analysis of Alignment Faking

对齐伪造的行为分析

Nathaniel Mitrani Hadida, Rhea Karty, David Williams-King, Alan Cooney

发表机构 * University of Cambridge(剑桥大学) Harvard University(哈佛大学) ERA UK AISI(英国人工智能学会)

AI总结 通过可控最小设置研究对齐伪造,发现其驱动因素包括价值观、目标保护和谄媚,且比先前报告更普遍,可从情境线索和模型倾向预测。

Comments preprint

详情
AI中文摘要

对齐伪造(AF)指的是模型为了保持其部署偏好,策略性地遵守训练目标以避免行为修改。理解AF何时以及为何出现很重要,因为模型在区分训练和部署方面越来越擅长。先前的工作发现AF脆弱、对提示敏感且依赖模型,其潜在驱动因素尚不清楚。我们在一个隔离其核心组件的可控最小设置中研究AF,并在比先前报告更广泛的模型中观察到它,包括小规模模型。我们识别出三个可分离的驱动因素——价值观、目标保护和谄媚——并通过有针对性的提示消融和激活引导表明每个因素独立地调节AF行为。我们的结果表明AF比先前报告更普遍,并且其发生可从情境线索和可测量的模型倾向(如基线谄媚和陈述的价值观)预测。这种分解为未来模型中检测和缓解AF提供了具体方向。

英文摘要

Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

2605.27678 2026-05-28 cs.LG cs.DC

Heterogeneous Parallelism for Multimodal Large Language Model Training

多模态大语言模型训练的异构并行

Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh

发表机构 * NVIDIA

AI总结 针对多模态大语言模型训练中单一LLM中心并行布局导致的吞吐量瓶颈,提出异构并行抽象,允许各模块独立布局和放置,并通过边界通信器实现张量语义保持,实验表明可提升TFLOPS/GPU最高49.3%。

详情
AI中文摘要

基础模型训练正变得多模态,从后训练流程到大规模预训练。随着模态覆盖范围扩大、上下文窗口增长以及编码器LLM规模分化,单一的以LLM为中心的TP/CP/PP/DP/EP布局日益限制吞吐量。这种耦合迫使编码器继承LLM驱动的分片和放置选择,可能增加通信、限制编码器并行性或约束LLM调度;这种不匹配在长上下文中最为明显,此时融合的多模态序列需要LLM上下文并行,但编码器输入仍然受限。我们提出了多模态大语言模型训练的异构并行,这是一种抽象,允许端到端图中的模块使用独立的布局和秩放置,支持共享GPU上的共置执行和不相交秩集上的非共置执行。关键挑战是在独立布局间保持边界张量语义:前向激活必须为目标布局物化,而反向梯度必须路由回源布局。我们通过边界通信器解决这一问题,实现前向和反向布局变换,以及两种放置模式的调度扩展。我们评估了跨多模态工作负载和GPU规模的优化同构、共置异构和非共置异构配置,以刻画何时额外的布局和放置自由度能暴露更优的操作点。在这一扫描中,共置异构将TFLOPS/GPU提升高达49.3%,而非共置异构将总token吞吐量提升高达13.0%,TFLOPS/GPU提升高达9.6%。我们验证了与同构基线相比的损失收敛一致性,并将该系统作为开源Megatron-LM扩展发布。

英文摘要

Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.

2605.27673 2026-05-28 cs.LG

When do complex-valued neural networks help? A study of representation, geometry, and optimization

复值神经网络何时有帮助?表征、几何与优化的研究

Ashutosh Kumar

发表机构 * Owl Autonomous Imaging, Inc.(Owl 自动成像公司) RIT

AI总结 通过对比复值神经网络与多种实值基线在合成射频、量子波函数和脑电图等任务上的表现,发现复值网络的优势依赖于表征、对称性和优化,并非普遍优越。

详情
AI中文摘要

复值神经网络(CVNN)通常应用于信息自然编码为幅度和相位的领域。然而,仅凭复值输入并不能确定复算术何时能改善学习:标签信号可能存在于振幅、相位、它们的耦合或某种对称性中,而实值模型在合适的坐标下也能表征这种对称性。我们通过将CVNN与笛卡尔实值、极坐标、仅相位、仅幅度、参数匹配实值和FLOP匹配实值基线进行表征优先的评估来研究这一问题。在合成射频任务中,复值表征有用但并非普遍优越。仅PSK任务有利于相位感知和复值模型,仅QAM任务有利于基于幅度的模型,混合PSK+QAM仅带来微小的复值优势,而未见过的载波相位旋转会破坏坐标依赖模型(无数据增强)。类似模式也出现在射频之外:在量子波函数预测中,动量对$|ψ|$不可见但可从相位恢复,而脑电图解析信号实验表明,相位锁定、幅度爆发和相位-幅度耦合各自偏好不同的坐标视图。我们还发现了RadioML 2018.01A上的一个基准测试伪影。在匹配共享试验选择下,CReLU复值模型超过最佳实值基线22.94个百分点;在相同数据和16次试验搜索空间下进行独立每族调参时,差距缩小至2.46个百分点。梯度分析将夸大的差距归因于实值基线在高学习率下的第一步不稳定性,而复值参数耦合更稳健地分布损失信号。学习率×激活函数的析因实验证实该失败主要是超参数驱动的。总体而言,CVNN应被视为结构化归纳偏置,其增益取决于表征、对称性和优化,而非普遍优越的架构。

英文摘要

Complex-valued Neural Networks (CVNNs) are often motivated by domains where information is naturally encoded in magnitude and phase. Yet complex-valued inputs alone do not determine when complex arithmetic improves learning: the label signal may lie in amplitude, phase, their coupling, or a symmetry that real-valued models can also represent under suitable coordinates. We study this through a representation-first evaluation of CVNNs against Cartesian real, polar, phase-only, magnitude-only, parameter-matched real, and FLOP-matched real baselines. Across synthetic RF tasks, complex representations are useful but not universally superior. PSK-only tasks favor phase-aware and complex-valued models, QAM-only tasks favor magnitude-based models, mixed PSK+QAM gives only a small complex-valued advantage, and unseen carrier-phase rotations break coordinate-dependent models without augmentation. Similar patterns appear beyond RF: in quantum-wavefunction prediction, momentum is invisible to $|ψ|$ but recoverable from phase, while EEG analytic-signal experiments show that phase locking, amplitude bursts, and phase-amplitude coupling each favor different coordinate views. We also identify a benchmarking artifact on RadioML 2018.01A. Under matched-shared-trial selection, a CReLU complex model exceeds the best real baseline by 22.94 PP; under independent per-family tuning on the same data and 16-trial search space, the gap collapses to 2.46 PP. Gradient analysis traces the inflated gap to high-learning-rate first-step instability in real baselines, while complex parameter coupling distributes the loss signal more robustly. A learning-rate $\times$ activation factorial confirms the failure is primarily hyperparameter-driven. Overall, CVNNs are best viewed as structured inductive biases whose gains depend on representation, symmetry, and optimization, not as universally superior architectures.

2605.27668 2026-05-28 cs.LG cs.AI cs.CL

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐:用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

发表机构 * Agentic Learning AI Lab(代理学习AI实验室) New York University(纽约大学) The University of Chicago(芝加哥大学) Chronologies AI

AI总结 提出Beta-Bernoulli校准器(BBC),通过结合二元结果和人类预测信号,将初始点估计转换为事件似然分布,实现校准和不确定性量化。

详情
AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测,现有方法通常从二元结果中学习以输出语言化预测。然而,尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息,如何利用这些信号仍未充分探索。为了解决这个问题,我们提出了Beta-Bernoulli校准器(BBC),它将来自任何模型的初始点估计转换为事件似然分布,使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模,均值作为校准的点预测,方差作为认知不确定性。我们的结果表明,BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测,同时保持轻量级并具有良好的泛化能力。我们还表明,BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

2605.27662 2026-05-28 cs.LG cs.AI

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

优化器如何塑造等变神经网络中的学习解

Teodor-Mihai Stupariu, Andrei Manolache

发表机构 * University of Stuttgart, Germany(斯图加特大学) International Max Planck Research School for Intelligent Systems, Germany(国际马克斯·普朗克智能系统研究学校) Tudor Vianu High School of Computer Science, Romania(托尔德·维安乌计算机科学高中)

AI总结 本文通过比较Muon和Adam优化器在点云和分子学习任务中的表现,发现Muon能改善等变神经网络的优化效果,并分析其导致更规则损失曲面和更高有效秩的机制。

Comments Accepted at ICML 2026 Workshop on Weight-Space Symmetries

详情
AI中文摘要

等变神经网络通过构造编码几何对称性,但它们通常难以优化,并且可能表现不如约束较少的架构。越来越多的研究通过架构修改(如约束松弛或近似等变)来解决这一问题,而优化器的作用相对未被充分探索。我们通过比较Muon和Adam在点云和分子学习设置下的多种等变和几何架构来研究这一方向。在对比最清晰的ModelNet40上,Muon在所有考虑的架构上均一致优于Adam。然后,我们通过Hessian估计、损失曲面可视化以及学习权重和中间表示的谱性质来分析训练后的ModelNet40检查点。Muon达到的检查点具有更大的Hessian曲率汇总但更规则的损失曲面,并且其学习权重和表示具有更高的稳定秩和有效秩。这些观察表明,优化器设计与几何归纳偏置之间的相互作用值得社区进一步关注。

英文摘要

Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less constrained architectures. A growing body of work addresses this through architectural modifications such as constraint relaxation or approximate equivariance, while the role of the optimizer remains comparatively underexplored. We study this direction by comparing Muon and Adam across several equivariant and geometric architectures under pointcloud and molecular learning settings. On ModelNet40, where the comparison is clearest, Muon consistently improves over Adam across all architectures considered. We then analyze the trained ModelNet40 checkpoints through Hessian estimates, loss surface visualizations, and spectral properties of learned weights and intermediate representations. The checkpoints reached by Muon have larger Hessian curvature summaries but more regular loss surfaces, and their learned weights and representations have higher stable and effective ranks. These observations suggest that the interaction between optimizer design and geometric inductive bias deserves further attention from the community.

2605.27661 2026-05-28 cs.RO

Design of a Real-time Asynchronous Monocular Odometry for Planetary Exploration

面向行星探测的实时异步单目里程计设计

Benat Inigo, Florian Steidle, Wolfgang Stuerzl

发表机构 * Institute of Robotics and Mechatronics(机器人与机电研究所) German Aerospace Center (DLR)(德国航空航天中心(DLR)) University of Zaragoza(萨拉戈萨大学)

AI总结 针对行星探测中计算资源受限、环境复杂且高动态范围光照的挑战,提出一种基于误差状态卡尔曼滤波(ESKF)的实时异步事件相机单目里程计,利用异步事件流和RATE特征跟踪器实现连续相机运动估计。

详情
AI中文摘要

我们描述了面向行星探测的实时异步事件基单目里程计的初步设计。在严格的计算约束下运行,行星探测器经常遇到复杂、不可预测的环境,需要高速感知和对高动态范围(HDR)光照的鲁棒性。事件相机通过报告异步、像素级的亮度变化(微秒级分辨率)来满足这些需求,在极端光照条件下显著降低数据带宽同时保持鲁棒性。我们提出了一种基于误差状态卡尔曼滤波(ESKF)的方法,利用异步事件流连续估计相机自运动。相机状态通过RATE(一种实时异步特征跟踪器)生成的每个跟踪位置输出进行更新。

英文摘要

We describe our preliminary design of a real-time asynchronous event-based monocular odometry for planetary exploration. Operating under strict computational constraints, planetary rovers frequently encounter complex, unpredictable environments that demand high-speed sensing and robustness to high dynamic range (HDR) lighting. Event cameras address these needs by reporting asynchronous, pixel-wise brightness changes with microsecond resolution, significantly reducing data bandwidth while maintaining robustness in extreme lighting conditions. We propose an approach based on an Error-State Kalman Filter (ESKF) that leverages this asynchronous event stream to continuously estimate camera ego-motion. The camera state is updated with every tracked position output generated by RATE, a real-time asynchronous feature tracker.

2605.27659 2026-05-28 cs.LG cs.AI

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

通过概率潜在嵌入和动态策略自适应实现迁移强化学习用于Sim-to-Real部署

Gengyue Han, Yiheng Feng

发表机构 * Lyles School of Civil and Construction Engineering, Purdue University, West Lafayette, USA(普渡大学土木与建设工程学院) Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, USA(普渡大学埃尔莫尔家庭电气与计算机工程学院)

AI总结 提出一种基于概率潜在嵌入和动态策略自适应的强化学习框架,通过元学习推断环境潜在表示并动态调整风险水平,实现安全高效的Sim2Real策略迁移。

详情
AI中文摘要

由于资源有限和公共安全问题,许多信息物理系统(如自动驾驶汽车)的深度强化学习(RL)智能体首先在模拟器中进行训练。然而,当部署到真实世界环境中时,由于不可避免的Sim2Real差距,它们常常遭受性能下降或安全违规。现有的零样本方法,如鲁棒安全RL和域随机化,缓解了这一问题,但通常以性能下降或遇到未建模系统动态时的残余安全风险为代价。为了解决这些限制,我们提出了一种新颖的强化学习框架,通过概率潜在嵌入和动态策略自适应实现安全高效的策略迁移。我们考虑在不同环境上下文下的一族约束马尔可夫决策过程(CMDP)。通过利用元RL中的潜在上下文变量,所提出的框架从模拟经验中推断环境的潜在表示。此外,它结合了分布RL公式,允许根据潜在上下文变量的估计精度动态调整部署策略的风险水平。该策略在早期部署阶段促进安全性,并通过在Sim2Real差距下的快速策略自适应提高效率。

英文摘要

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

2605.27651 2026-05-28 cs.LG

Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

基于机器学习自适应有限差分模型的月球车快速热特性分析

Samuel Weber, Zaki Hasnain, Souma Chowdhury

发表机构 * University at Buffalo(布法罗大学)

AI总结 提出一种物理信息机器学习框架,通过自适应粗网格划分和可微有限差分模拟器,在保持物理一致性的同时实现月球车热建模的精度与效率平衡。

详情
AI中文摘要

在极端热环境下运行的自主空间系统需要准确且高效的热建模来支持任务前系统设计和机载自主性。对于月球车而言,大温度梯度、辐射传热和可变表面条件使得可靠的热预测尤其具有挑战性。高保真物理仿真提供准确结果但计算成本高,而简化模型和查表方法往往缺乏足够精度。物理信息机器学习(PIML)通过将数据驱动模型与嵌入的物理知识相结合,提供了一种有前景的替代方案。本文提出了一种用于带有内部热源的简化月球车热分析的PIML框架,其中机器学习实现了环境自适应粗网格划分。所提出的架构集成了一种迁移神经网络(TNN),该网络根据热载荷和初始条件自适应地确定三维有限差分节点划分,从而实现更准确的粗网格计算。框架内嵌了一个可微有限差分热模拟器,以强制执行物理一致性并支持高效训练,同时一个上采样层从粗网格解重建高分辨率温度场。所提出的PIML方法与高保真细网格仿真、低保真固定粗网格模型以及纯数据驱动的人工神经网络(ANN)进行了对比评估。结果表明,相对于粗网格物理模型和ANN模型,PIML框架分别将预测精度提高了50%和39%,同时保持了物理一致的热分布。在计算方面,该框架也比高保真仿真快3倍,展示了在月球车系统热建模中精度与效率之间的有效平衡。

英文摘要

Autonomous space systems operating in extreme thermal environments require accurate and efficient thermal modeling to support both pre-mission system design and onboard autonomy. For lunar rovers, large temperature gradients, radiative heat transfer, and variable surface conditions make reliable thermal prediction especially challenging. High-fidelity physics-based simulations provide accurate results but are computationally expensive, while simplified models and lookup-table approach often lack sufficient accuracy. Physics-informed machine learning (PIML) offers a promising alternative by combining data-driven models with embedded physical knowledge. This paper presents a PIML framework for thermal analysis of a simplified lunar rover with internal heat sources, where machine learning enables environment-adaptive coarse meshing. The proposed architecture integrates a transfer neural network (TNN) that adaptively determines 3D finite-difference nodalization based on thermal loads and initial conditions, enabling more accurate coarse-mesh calculations. A differentiable finite-difference thermal simulator is embedded within the framework to enforce physical consistency and support efficient training, while an upscaling layer reconstructs high-resolution temperature fields from the coarse-grid solution. The proposed PIML approach is evaluated against high-fidelity fine-mesh simulations, low-fidelity fixed coarse-mesh models, and a purely data-driven artificial neural network (ANN). Results show that the PIML framework improves prediction accuracy by 50% and 39% relative to the coarse-mesh physics model and ANN model, respectively, while maintaining physically consistent thermal distributions. Computationally, the framework is also 3x faster than high-fidelity simulations, demonstrating an effective balance between accuracy and efficiency for thermal modeling of lunar rover systems.

2605.27649 2026-05-28 cs.CL cs.LG

Disentangling Language Roles in Multilingual LLM Task Execution

多语言大模型任务执行中的语言角色解耦

Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

发表机构 * Marquette(马凯特大学) Cornell(康奈尔大学) UC San Diego(南加州大学圣地亚哥分校) UPenn(普林斯顿大学) UT Austin(德克萨斯大学奥斯汀分校) Maryland(马里兰大学) Michigan(密歇根大学) UIUC(伊利诺伊大学香槟分校) Melbourne(墨尔本大学) Stanford(斯坦福大学)

AI总结 提出MTM-Bench基准,通过完全交叉设计解耦指令、内容和响应三种语言角色,评估多语言LLM的任务执行能力,发现响应语言角色是性能下降的主要因素。

详情
AI中文摘要

多语言大模型在指令、源内容和所需响应语言不一致时被越来越多地使用。现有基准扩展了多语言指令跟随评估,但很少在完全交叉设计中隔离这三种角色。我们引入了MTM-Bench,一个用于语言条件任务执行的控制基准,其中每个实例由三元组 \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\) 定义。在英语、西班牙语和中文中,MTM-Bench枚举了所有27个三元组,每个模型包含2,430个实例,涵盖语义反转、最终状态提取和带更新实现的语言纯度。我们使用分解指标评估了20个前沿和开源权重LLM,包括语义正确性、目标语言遵循度、约束满足度、污染比率和联合成功率,并通过针对性的人工审计验证评分。完全交叉设计揭示了性能下降是由语言在任务结构中扮演的角色组织的,而不仅仅是语言不匹配的数量。响应语言角色是变化的主要轴,单个响应槽不匹配导致了大部分性能下降。仅响应不匹配与完全不匹配的比较表明,不匹配数量不是困难的单调预测因子,模型级别的排序在不同系统间变化。任务族通过不同的通道失败,表明语义正确性本身并不能捕捉可靠的多语言任务执行。

英文摘要

Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

2605.27646 2026-05-28 cs.LG cs.AI

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Hurwitz四元数乘法量化用于KV缓存压缩

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 提出一种免校准的Hurwitz四元数乘法量化方法,通过将K/V的4元素块视为四元数并用量化乘积编码,在约5比特下匹配fp16困惑度,实现高达5.05倍KV缓存压缩。

详情
AI中文摘要

我们提出 extbf{Hurwitz四元数乘法量化(HQMQ)},一种用于大语言模型KV缓存压缩的 extbf{免校准}方法。HQMQ将K或V的每个4元素块视为一个四元数,并将其单位方向量化到乘积$q_p \cdot q_s$上,其中$q_p$取自24元素Hurwitz群$2T$($S^3$上24-cell的24个顶点,两两夹角$60^\circ$),$q_s$取自每个(层、头)的二级码本,包含$S$个 extemph{随机}单位四元数。乘法组合在$S$个存储参数下产生$24S$个有效码字;随机初始化即可,因为左乘是$S^3$等距变换,因此种子码本在最终任务困惑度上的变化小于$1.5\%$。一个每批次的中间乘数离群值提取步骤($C=3$,无校准)处理现代离群值密集型架构。我们在五个现代开源模型上评估:Mistral-7B(密集MHA)、Llama-3-8B和Qwen2.5-7B和Qwen3-8B(密集GQA),以及gpt-oss-20b(稀疏MoE)。在Mistral-7B和Qwen3-8B上,HQMQ在约5比特下匹配fp16,困惑度差异在$0.02$--$0.03$点内。在Qwen2.5-7B和Qwen3-8B上,朴素int4导致困惑度崩溃到$10^4+$,而HQMQ + Med3$\times$在约5比特下恢复fp16质量,差异在$0.02$--$0.10$点内。HQMQ在所有五个模型上,在相同比特数下帕累托优于朴素int $3$--$1900\times$,并且在Mistral上以3.79比特的下游零样本准确率匹配fp16。与最强的校准KV量化基线相比,HQMQ在3.79比特下匹配KIVI-4(约4.5比特),在CoQA上差异约1点,TruthfulQA上0.6点,GSM8K上2.3点,同时比特数减少16%且无需校准过程。在存储层面,HQMQ提供高达5.05倍的KV压缩,将Llama-3-70B的128k上下文缓存从43 GB缩小到8.5 GB。

英文摘要

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

2605.27644 2026-05-28 cs.RO cs.AI cs.LG

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Trinity:通过利用合成数据统一非结构化户外环境中的类无关地形与语义分割

Marcus G Müller, Wout Boerdijk, Maximilian Durner, Riccardo Giubilato, Abel Gawel, Wolfgang Stürzl, Roland Siegwart, Rudolph Triebel

发表机构 * Institute of Robotics and Mechatronics, German Aerospace Center (DLR)(机器人与机电系统研究所,德国航空航天中心(DLR)) Federal Institute of Technology Zurich (ETH Zurich)(苏黎世联邦理工学院(ETH Zurich)) Robotics and AI Institute (RAI)(机器人与人工智能研究所(RAI))

AI总结 提出基于Transformer的统一网络Trinity,联合执行类特定语义分割和类无关地形分割,利用合成数据集RUGDSynth和真实数据集EXTerra实现机器人无关的地形先验学习。

详情
AI中文摘要

地形理解对于在非结构化户外环境中运行的移动机器人至关重要。现有的基于视觉的可通行性估计方法依赖于机器人特定的标注或语义类别映射,限制了跨平台的迁移性,并在机器人能力变化时需要昂贵的重新标注,而标准的语义分割方法仅关注特定的预定义类别,无法捕捉地形的多样性。在这项工作中,我们提出了一种基于Transformer的架构,在统一网络Trinity中联合执行类特定语义分割和类无关地形分割。地形区域仅基于视觉外观进行分割,无需预定义的语义标签或机器人相关的可通行性分数。这种公式使得学习机器人无关的视觉地形先验成为可能,这些先验可以与机器人特定的经验相结合,用于下游任务,如可通行性估计、视觉里程计和任务规划。为了实现具有多样地形外观的大规模训练,我们扩展了OAISYS模拟器,并引入了RUGDSynth,这是一个受RUGD启发、包含类无关地形样本的合成数据集。此外,我们提出了EXTerra数据集,提供了带有类特定和类无关地形标签的真实世界图像。实验证明了所提出任务的可行性以及我们的联合分割方法在复杂户外环境中的有效性。代码和数据集将在本出版物发布后(经过审查)公开。

英文摘要

Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability estimation methods rely on robot-specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re-annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer-based architecture that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot-dependent traversability scores. This formulation enables the learning of robot-agnostic visual terrain priors that can be combined with robot-specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large-scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class-agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real-world images annotated with both class-specific and class-agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

2605.27643 2026-05-28 cs.RO physics.optics

Agentic Language-to-Objective Synthesis for Optofluidic Assembly

面向光流组件的智能语言到目标合成

Ivan Saraev, Elena Erben, Weida Liao, Fan Nan, Gerhard Neumann, Eric Lauga, Moritz Kreysing

发表机构 * Institute of Biological and Chemical Systems, Karlsruhe Institute of Technology, Germany(马克斯·普朗克研究所生物和化学系统研究所,卡尔斯鲁厄技术大学,德国) Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK(应用数学和理论物理系,剑桥大学,英国) Department of Mathematics, Imperial College London, UK(数学系,伦敦帝国理工学院,英国) Institute of Anthropomatics and Robotics (IAR), Karlsruhe Institute of Technology, Germany(人机学与机器人研究所(IAR),卡尔斯鲁厄技术大学,德国)

AI总结 提出Speak-to-Objective模块化智能流水线,利用条件大语言模型将口语或书面指令转换为可微目标函数,实现光流控微粒子组装,并支持用户反馈学习。

Comments 21 pages, 5 figures

详情
AI中文摘要

基于光的先进制造日益需要可编程、闭环工具,将人类设计意图转化为小尺度上的可执行操作。然而,在机器人和制造模式中仍存在一个关键瓶颈:将用户意图转化为机器可读且可靠执行的目标。尽管微机器人通过光驱动流体提供了多功能操控,但数学上可处理的目标规范仍然手动且难以重用。本文介绍Speak-to-Objective,一个模块化智能流水线,使用条件大语言模型将口语或书面指令转换为完全可微的目标函数,用于在约束感知逆求解器(SLSQP)和实验光流控平台上组装微粒。该方法采用紧凑循环——感知→组合→提议→行动→报告与学习——将目标作为意图与驱动之间的接口,分离组装或图案化什么与如何驱动,同时从用户反馈中学习。流水线组合几何、间距和分配/拓扑项,生成鲁棒的描述性目标,从部分轨迹组装并在扰动后恢复,以及用于精确定位的显式目标,所有均以执行器无关的方式。使用激光诱导热粘性流作为物理驱动模式,我们展示了自然语言可编程的、基于光的微尺度粒子图案组装在微流控环境中。除了对可编程微组装的直接影响,以及使用激光诱导光流控驱动作为降复杂度实验平台,我们的工作指向自驱动、AI辅助的光学制造平台,其中自然语言、可微目标和激光驱动耦合为可重复使用的数字工作流。

英文摘要

Light-based advanced manufacturing increasingly requires programmable, closed-loop tools that translate human design intent into executable operations at small length scales. Yet a key bottleneck persists across robotic and manufacturing modalities: turning user intent into machine-readable objectives that are reliably executable. While micro-robotics offers versatile manipulation via optical actuation of fluids, mathematically tractable goal specification remains manual and hard to reuse. Here, we introduce Speak-to-Objective, a modular agentic pipeline that uses a conditioned Large Language Model (LLM) to translate spoken or written commands into fully differentiable objective functions for assembling microparticles in a constraint-aware inverse solver (SLSQP) and on an experimental optofluidic platform. The approach employs a compact loop - perceive -> compose -> propose -> act -> report & learn - that treats the objective as the interface between intent and actuation, separating what to assemble or pattern from how to actuate, while learning from user feedback. The pipeline composes geometry, spacing, and assignment/topology terms to generate robust descriptive objectives that assemble from partial traces and recover after perturbations, as well as explicit objectives for precise placement, all in an actuator-agnostic fashion. Using laser-induced thermoviscous flows as the physical actuation modality, we demonstrate natural-language-programmable, light-based microscale assembly of particle patterns in a microfluidic environment. Beyond its immediate impact on programmable microassembly, and using laser-induced optofluidic actuation as a reduced-complexity experimental platform, our work points toward self-driving, AI-assisted optical manufacturing platforms in which natural language, differentiable objectives, and laser-based actuation are coupled into a reusable digital workflow.

2605.27642 2026-05-28 cs.CL cs.LG

Learning to Translate from Soft to Hard LLM Prompts

学习从软提示到硬提示的翻译

Pitipat Kongsomjit, Suryansh Goyal, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 本文通过训练一个专用的软提示到自然语言翻译模型,提高了翻译质量,并展示了软提示可以转化为可移植的文本提示,在大型闭源模型上超越原软提示甚至少样本学习。

Comments 8 Pages, 11 tables, 4 Figures

详情
AI中文摘要

软提示调优是一种参数高效的方法,用于使大型语言模型适应特定任务,但缺乏可解释性。基于最近关于解释软提示的工作(Ramati et al., 2024),我们探索了如何训练一个专用的软提示到自然语言翻译模型,以获得更高的翻译质量。特别是在多个数据集(DoDs)的定量和定性比较中,我们证明了我们的翻译器能够生成流畅、准确的表述,优于现有的无训练方法如InSPEcT。除了提高可解释性外,我们的工作还暗示了一个有前景的下游应用:在小规模开源模型上优化的软提示可以转化为可移植的文本提示,当部署在更大的闭源API模型上时,其性能超过了原始软提示,在某些情况下甚至超过了少样本学习。

英文摘要

Soft prompt tuning is a parameter-efficient method for adapting LLMs to specific tasks, but suffers from a lack of interpretability. Building on recent work on interpreting soft prompts (Ramati et al., 2024), we explore how training a dedicated soft prompt to natural language translation model can yield higher translation quality. In particular, in both quantitative and qualitative comparisons on multiple Datasets of Datasets (DoDs), we demonstrate that our translator produces fluent, accurate verbalizations that outperforms existing training-free methods like InSPEcT. In addition to advancing interpretability, our work suggests a promising downstream application: soft prompts optimized on small, open-source models can be translated into portable text prompts that, when deployed on larger closed-API models, exceed the performance of the original soft prompt and, in some cases, even few-shot learning.

2605.27636 2026-05-28 cs.CL

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Simorgh at SemEval-2026 task 7: 面向低资源文化推理的多语言问答中的区域感知混合检索

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

发表机构 * University of Tabriz(塔布里兹大学)

AI总结 提出区域感知混合检索方法,结合BM25和稠密语义相似度与区域加权启发式,以提升多语言文化问答的跨语言稳定性。

Comments 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在通用领域的推理任务中表现出色,但在数字和文本数据有限的语种中,面对文化相关知识时可能遇到挑战。本文利用BLEnD基准研究文化相关的多项选择问答,该基准包含30种语言的多语料库,涵盖饮食、体育、家庭等社会文化领域。我们提出一种区域感知混合检索方法,结合BM25词汇匹配和稠密语义相似度与区域加权启发式,以提高答案的相关性。检索到的文档用于构建结构化提示,输入Qwen3-14B量化模型,并采用基于logit的确定性答案选择。实验结果表明,与纯参数推理相比,混合检索方法在文化问答中提升了跨语言稳定性。然而,训练数据量不同的语言之间仍存在显著性能差距,这表明检索增强方法并未完全克服训练数据不平衡问题。

英文摘要

Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

2605.27622 2026-05-28 cs.AI cs.SC

Reasoning and Planning with Dynamically Changing Norms

动态变化规范的推理与规划

Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus

发表机构 * University of Iowa(爱荷华大学) Northwestern University(西北大学)

AI总结 本文提出一种在人类-AI环境中使用动态变化规范引导规划的方法,通过可废止演算解决规范冲突并将规范作为规划护栏,理论证明与对话任务实验验证了有效性。

Comments 8 pages, 1 figure, dataset included in anc

详情
AI中文摘要

为了安全地与人类交互,AI 智能体必须既了解我们的规范,又在规划时考虑它们。然而,这种规范引导的规划在人工智能体社区内研究较少,且忽略了规范的动态性。本文提出了一种在人类-AI 环境中使用动态变化规范引导规划的方法。我们贡献了一种用于解决规范冲突的可废止演算,以及一种使用这种动态变化规范作为规划护栏的方法。我们通过形式化证明在理论上展示了该方法,并通过 AI 智能体 SocialBot 在自然语言对话任务上进行了实证验证。

英文摘要

To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning has been less explored, only within communities of artificial agents, and has ignored the dynamic nature of norms. This paper instead presents an approach to guiding planning with dynamically changing norms in a human-AI setting. We contribute a defeasible calculus for resolving normative conflicts and an approach to using such dynamically changing norms as guard rails on plans. We theoretically demonstrate our approach with formal proofs and empirically with an AI agent, SocialBot, on a natural language dialogue task.

2605.27619 2026-05-28 cs.LG cs.AI

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

基于最优传输和依赖性最大化的有监督分布约简

Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

发表机构 * digiLab, UK(digilab英国实验室) University of Bristol, UK(布里斯托大学)

AI总结 提出有监督分布约简(SDR)算法,通过结合最优传输和显式依赖性最大化,学习同时保留数据几何结构和目标相关信号的紧凑表示。

详情
AI中文摘要

学习同时捕捉内在数据几何结构和目标相关结构的表示仍然是一个基本挑战,特别是在数据约简必须在压缩与预测保真度之间取得平衡的场景中。虽然分布约简(包括联合聚类和降维)提供了一种原则性的数据总结方法,但其有监督变体仍然相对未被充分探索,尽管保留任务相关信号对于下游预测和决策至关重要。我们提出有监督分布约简(SDR),一种通过结合最优传输和显式依赖性最大化来学习目标感知表示的算法。SDR 基于融合 Gromov-Wasserstein(FGW)目标,将输入分布的 relational 结构与一组代表点对齐,同时增加一个直接依赖性项,鼓励学习到的嵌入更明确地捕捉预测信号。这产生了反映几何结构和监督的紧凑表示。除了表示学习,SDR 自然地诱导出一种数据依赖的非平稳几何结构,可用于高斯过程(GP)建模等场景。通过目标感知的分布对齐重新定义距离,SDR 能够构建适应数据几何和监督局部变化的自适应核,为非平稳核设计提供了基于最优传输的视角。

英文摘要

Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. While distributional reduction-encompassing joint clustering and dimensionality reduction-offers a principled way to summarize data, its supervised variants remain relatively under-explored, despite the importance of retaining task-relevant signal for downstream prediction and decision-making. We propose Supervised Distributional Reduction (SDR), an algorithm for learning target-aware representations by combining optimal transport with explicit dependence maximization. SDR builds on the Fused Gromov-Wasserstein (FGW) objective to align the relational structure of the input distribution with a set of representative points, while augmenting it with a direct dependence term that encourages the learned embeddings to capture predictive signal more explicitly. This results in compact representations that reflect both geometric structure and supervision. Beyond representation learning, SDR naturally induces a data-dependent, non-stationary geometry that can be leveraged for settings such as Gaussian Process (GP) modelling. By redefining distances through target-aware distributional alignment, SDR enables the construction of adaptive kernels that respond to local variations in both data geometry and supervision, offering an optimal transport-based perspective on non-stationary kernel design.

2605.27616 2026-05-28 cs.CV cs.AI

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同:架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

发表机构 * NVIDIA

AI总结 本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用,发现架构选择对量化鲁棒性影响最大,注意力机制架构对配方选择具有显著韧性,而 CNN 在大规模下受梯度量化配方影响性能下降。

Journal ref CVPR2026

详情
AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互,在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大,基于注意力的架构对配方选择表现出显著的韧性,而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下,FP4 可能离散化 softmax 注意力,但高级 QAT 配方可防止这种崩溃。在更大规模下,高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明,Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性,使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

2605.27605 2026-05-28 cs.AI cs.SE

Laguna M.1/XS.2 Technical Report

Laguna M.1/XS.2 技术报告

Julien Abadji, Marah Abdin, Connor Adams, Eric Alcaide, Mustafa Altun, Michele Artoni, Junze Bao, Uday Barar, Vassilis Bekiaris, Arkadii Bessonov, Benjamin Bütikofer, Jonathan Chang, Yen-Chun Chen, Dmitry Chernenkov, Yang Chi, Filippos Christianos, Fenia Christopoulou, Razvan-Andrei Ciocoiu, Tzachi Cohen, Yohann Coppel, Dmitrii Emelianenko, Brandon Fergerson, Brian Fitzgerald, Matthias Gallé, Alex Golonzovskyi, George Grigorev, Yiyang Hao, Christian Hensel, Jan Huenermann, Ye Ji, Sarthak Joshi, Eiso Kant, Kabir Khandpur, Seonghyeon Kim, Vladimir Kirichenko, Umut Kocasarac, Ilya Kochik, Ivan Komarov, Chaerin Kong, Anurag Koul, François-Joseph Lacroix, Sergei Laktionov, Waren Long, Quentin Malartic, Vadim Markovtsev, Afonso Marques, Robert McHardy, Carlos Mocholí, Dmitry Monakhov, Adam Morris, Martin Muller, Christian Mürtz, Robin Nabel, Thien Nguyen, Rok Novosel, Szymon Ozog, Aalhad Patankar, Aleksei Petrov, Alexandre Piché, Arthur Pignet, Teodor Poncu, Phil Potter, Alexander Rakowski, Pierre-Yves Ritschard, Jay Roberts, Joe Rowell, Piotr Sarna, Pierre-André Savalle, Uladzislau Sazanovich, Nikita Shapovalov, Arsenii Shevchenko, Mikhail Shilkov, Andrei Sokol, Mohamed Soliman, Jack Stephenson, Victor Storchan, Dragos-Constantin Tantaru, Artem Tyurin, Adrian Wälchli, Pengming Wang, Jianxiao Yang, Renat Zayashnikov, Alexander Zelenka Martin, Nikolay Zinov, Caroline Bercier, José Caldeira, Margarida Garcia, Tom George, Kabeer Gharzai, Glenn Hitchcock, Carson Klingenberg, Ivo Pinto, Varun Randery, Noah Smith, Arina Sugako, Jason Warner

发表机构 * Poolside Team(Poolside团队)

AI总结 本文介绍了两个用于长周期自主编码的混合专家基础模型 Laguna M.1 和 XS.2,通过端到端训练和模型工厂系统,在软件工程基准测试中达到先进水平。

Comments Technical report to models released here: https://poolside.ai/blog/introducing-laguna-xs2-m1

详情
AI中文摘要

我们介绍了 Laguna M.1 和 Laguna XS.2,两个为长周期自主编码构建的混合专家基础模型:M.1 总参数量为 2258 亿(每 token 激活 234 亿),XS.2 总参数量为 334 亿(每 token 激活 30 亿)。两个模型均在我们称为模型工厂的内部系统中从头到尾端到端训练:这是一个紧密集成的版本化数据、训练、评估和推理组件栈,将模型开发转变为工业流程。我们描述了模型工厂的原理和设计选择,并详细介绍了模型的端到端训练过程,包括预训练数据和架构、后训练阶段、评估和量化。在自主软件工程和终端基准测试(SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro 和 Terminal-Bench 2.0)上,M.1 和 XS.2 在其各自的权重级别中与最先进的开源模型具有竞争力。Laguna XS.2 权重在 Apache 2.0 许可下发布,地址为 https://huggingface.co/collections/poolside/laguna-xs2。

英文摘要

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

2605.27596 2026-05-28 cs.CL

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

幻觉能否有用?通过链式系统I/II推理用SLM解决多跳问题

Saptarshi Sengupta, Suhang Wang

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出一种“先回答后推理”的认知启发框架,利用SLM的初始答案(可能包含幻觉)作为假设来检索证据,再通过系统II深度推理,从而在多跳问答任务上超越传统的“先思考后检索”方法。

详情
AI中文摘要

最近,小型语言模型(SLM)引起了越来越多的兴趣,它们速度快、性能好,且硬件需求低于大型语言模型(LLM)。然而,SLM比LLM更容易产生幻觉,影响其解决复杂多步推理问题的能力,因为早期错误会级联到最终响应。为了解决这个问题,现有工作采用先思考后迭代检索的策略来减少幻觉。我们认为先思考策略并非总是必要,因为我们发现:(i)SLM通常对其初始答案有准确的置信度,并且(ii)幻觉实际上可能有助于逼近正确答案。因此,我们将我们的工作定位为这种策略的反转,即先回答后推理。我们提出了一个认知启发的框架,其中模型首先被允许快速回答问题(系统I(零样本)),然后基于从知识源使用初始假设检索到的证据进行更深层次的思考(系统II)。通过结合系统I和系统II风格的推理,我们展示了我们的方法在各种多步问答基准测试中可以优于先前采用传统先思考路径的工作。

英文摘要

Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.