arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.27715 2026-05-28 cs.CL

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

超越输入理解：使用有向无环迹图诊断多语言数学推理

Jiaqiao Zhang, Zhoujun Li, Raoyuan Zhao, Jian Lan, Thomas Seidl, Michael A. Hedderich, Hinrich Schütze, Yihong Liu

发表机构 * Southwest University（西南大学）； LMU Munich（慕尼黑莱茵-瓦尔德大学）； MCML

AI总结本文提出有向无环迹图（DATG）框架，通过将推理迹映射到与语言无关的数学锚点和依赖关系，诊断多语言数学推理中的语言影响，并设计Loop-Retry和Formula-Retry两种测试时控制方法改善低资源语言性能。

Comments preprint

详情

AI中文摘要

大型推理模型（LRMs）在英语中表现出强大的数学推理能力，但在许多低资源和中资源语言中仍然不太可靠。这种差距通常被解释为无法理解非英语的问题陈述。我们表明这种观点是不完整的：即使问题以英语给出，控制模型的推理语言也会显著降低准确性，这表明语言也影响推理执行本身。为了研究这种效应，我们引入了DATG，一个有向无环迹图框架，将推理迹映射到与语言无关的数学锚点和依赖关系。这使我们能够将目标语言迹与参考DAG对齐，并测量它们是否覆盖所需的数学节点、尊重依赖边以及避免有害的数学动作。在Qwen3系列上跨12种语言的实验表明，非英语推理通常遭受锚点覆盖减少和依赖保真度降低，尤其是在低资源语言中。受此诊断启发，我们提出了Loop-Retry和Formula-Retry，两种针对DATG暴露的失败模式的简单测试时控制方法，并表明它们一致地改善了低资源语言中的目标语言推理性能。

英文摘要

Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model's reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

URL PDF HTML ☆

赞 0 踩 0

2605.27712 2026-05-28 cs.AI

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

前缀安全贝叶斯信念追踪用于LLM推理可靠性：将校准与排序分离

Zhenghan Song, Yunyi Li, Yulong Liu

发表机构 * Cornell University（康奈尔大学）； Columbia University（哥伦比亚大学）

AI总结提出前缀安全贝叶斯信念追踪（SBBT）框架，通过分离概率质量与排序能力，在长链推理中实现可靠的在线校准与不确定性估计。

详情

AI中文摘要

长推理轨迹需要在最终答案已知之前进行可靠性估计。我们研究前缀条件的事件成功估计 $P(y=1 \mid o_{1:t})$，使用前缀安全观测。序列贝叶斯信念追踪（SBBT）校准观测似然并递归更新两状态信念，为标量分数、文本和自我验证标记、隐藏聚类、令牌池探针以及潜在轨迹特征提供通用追踪器。在MATH-500、GSM8K、AIME 2025和RIMO-N上生成的开源权重轨迹中，概率质量和排序分离：仅使用分数的SBBT通常改善Brier分数，而AUROC提升需要超出强前缀安全基线的结构感知证据。在最强硬数学设置中，结构感知观测相对于标准前缀安全基线达到+0.110 AUROC。在相同前缀分类器审计下，MATH-500文本标记和RIMO-N自我验证信号保持正向。这些发现共同支持SBBT作为校准感知的在线推理框架，并揭示证据机制：标量分数主要支持概率质量，而结构感知前缀信号仅在强前缀安全基线尚未吸收排序证据时支持排序。

英文摘要

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.27710 2026-05-28 cs.AI

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

DeepSciVerify: 通过LLM驱动的证据升级验证科学声明与引文对齐

Shaghayegh Sadeghi, Khashayar Khajavi, Rise Adhikari, Alexander Tessier

发表机构 * School of Computing Science, Simon Fraser University（西蒙弗雷泽大学计算科学学院）

AI总结提出DeepSciVerify两阶段流水线，结合摘要推理与选择性升级到段落证据，在SCitance基准上以86.7 Micro-F1超越纯摘要基线4.5点，同时67%实例无需全文检索。

详情

AI中文摘要

声明与其引用证据之间的错位是大语言模型生成报告中的常见失败模式，限制了其在科学及其他高风险场景中的可靠性。我们提出DeepSciVerify，一个用于科学声明-引文验证的两阶段流水线，结合摘要级推理与选择性升级到段落级证据。该系统首先使用摘要验证声明，并对不确定案例进行延迟处理，仅在必要时检索和分析全文段落。该设计利用了LLM之间的互补行为，因为某些模型在不确定性下更为保守，而另一些则更为果断。在SCitance基准上，DeepSciVerify达到了86.7 Micro-F1，比强纯摘要基线高出4.5点，同时67%的实例无需全文检索即可解决。这些结果表明，选择性证据升级提高了声明-引文验证的准确性和效率。

英文摘要

Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

URL PDF HTML ☆

赞 0 踩 0

2605.27709 2026-05-28 cs.CL

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath: 面向可扩展和可验证数学问题生成的答案反转方法

Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich

发表机构 * Center for Information and Language Processing（信息与语言处理中心）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心 (MCML)）； Saarland University（萨尔兰州大学）

AI总结提出ReverseMath方法，通过反转原始问题的输入输出关系自动生成新数学问题，用于评估和训练，揭示记忆行为并提升推理性能。

详情

AI中文摘要

数学推理基准对于评估大型语言模型（LLM）至关重要，但许多基准是静态的，并通过公开评估和训练管道反复暴露，使得难以区分真正的推理与记忆。同时，手动构建具有可靠答案的新数学问题仍然成本高昂。我们引入ReverseMath，一种通过答案反转生成新数学问题的可扩展方法。给定一个问题及其答案，ReverseMath掩码原始问题中的一个数值，将原始答案视为已知条件，并重写问题，使得掩码值成为新答案。生成的问题反转了原始输入输出关系，使其答案通过构造已知。我们研究了ReverseMath在评估和训练中的应用。对于评估，配对的原始/反转问题揭示了显著的行为变化：模型有时在反转问题上失败，甚至错误地输出原始答案，暗示了类似记忆的行为。对于训练，ReverseMath提供自动标注的反转问题作为强化学习（RL）的数据增强。实验表明，包含ReverseMath生成的数据提高了多个基准上的数学推理性能，证明了其作为分析工具和可验证训练数据的可扩展来源的价值。

英文摘要

Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

URL PDF HTML ☆

赞 0 踩 0

2605.27706 2026-05-28 cs.CL cs.IR

Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

基于格点链式自适应重配置以减少幻觉

Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

发表机构 * Department of Mechanical and Aerospace University of California Irvine（机械与航空航天系加州大学伊文斯顿分校）； Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）

AI总结提出CAROL框架，通过定义语义不确定性度量并在文本序列格点上构建串子模目标，将幻觉缓解转化为马尔可夫链接受-拒绝过程，实现测试时幻觉减少。

详情

AI中文摘要

我们介绍了CAROL（基于格点的链式自适应重配置），一个用于大型语言模型测试时减少幻觉的概率框架。CAROL不依赖于词元级别的不确定性，而是基于生成响应与可信上下文之间的一致性定义了一种语义不确定性度量，在文本序列格点上诱导出一个串子模目标。这种表述使得幻觉缓解可以被建模为一个具有可证明收敛性和接近最优性保证的马尔可夫链接受-拒绝过程，允许模型迭代地优化输出以实现语义一致性。通过在意义层面操作，CAROL将幻觉检测和缓解统一在一个框架内。在问答和多智能体推理基准上的实证结果表明，与基于似然和检索增强的基线相比，CAROL显著减少了幻觉，提高了可靠性和可解释性，同时保持了具有竞争力的计算效率。

英文摘要

We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.27703 2026-05-28 cs.AI

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

面向资源受限智能体语言模型的分层提示域控制与学习

Joan Vendrell Gallart, Russell Bent, Michael Grosskopf

发表机构 * Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）

AI总结提出分层控制与学习框架，通过蒸馏学习输出模式、在线监控与提示域控制，解决资源受限下智能体语言模型的可靠性问题。

详情

AI中文摘要

大型语言模型越来越多地部署在智能体系统中，它们必须遵循结构化协议，适应不断变化的状态，并在内存、延迟和成本限制下运行。在这种场景下，提示扩展不可靠：增长的上下文可能将紧凑模型推离其有效提示域，而部署时的微调受限于稀缺的数据和计算资源。我们提出了一种分层控制与学习框架，其中紧凑模型首先通过蒸馏学习所需的输出模式，然后由预言机-控制器循环在线监督。控制器监控协议有效性和语义性能，将累积历史投影到可行的提示域中，并在发生漂移时触发轻量级的预言机监督微调。这将用于通信兼容性的模式学习与用于任务级纠正的语义适应分离开来。我们形式化了提示域可行性和注意力引起的饱和，从而激励对有效提示状态的控制，而非依赖名义上下文长度。使用多保真贝叶斯优化作为受控顺序测试平台，我们描述了一个核心部署故障模式，并展示了相对于非分层、仅蒸馏和非蒸馏基线的改进的可靠性和成本效益。

英文摘要

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.27699 2026-05-28 cs.RO

AURA: Asymptotically Optimal Uncertainty-Robust Replanning Algorithm for Kinodynamic Systems

AURA: 动力学系统渐近最优的鲁棒重规划算法

Seyedali Golestaneh, Zhuoyun Zhong, Donghyung Lee, Constantinos Chamzas

发表机构 * Department of Robotics Engineering, Worcester Polytechnic Institute (WPI)（机器人工程系，沃斯通理工大学）

AI总结提出AURA元规划框架，通过在线重规划和优化控制输入，在运动不确定性下实现渐近最优轨迹规划与跟踪精度提升。

详情

AI中文摘要

基于采样的运动规划器为动力学运动规划提供了一种实用且可扩展的方法，尤其适用于高维、欠驱动或非完整系统。然而，这些规划器通常离线使用，要求在执行开始前完成轨迹计算。此外，在存在运动不确定性的情况下，规划轨迹可能无法被准确跟踪，导致偏离名义解。本文在一个统一框架\method中解决了这些局限性，该框架是一个渐近最优的元规划器框架，在执行过程中同时提高路径质量和跟踪性能。除了主执行线程外，该框架包含一个重规划方法，在执行过程中持续探索状态空间并优化轨迹，以及一个优化过程，用于优化未来控制输入以减少跟踪误差。这些组件共同使\method能够在线利用渐近最优规划，同时在不确定性下提高执行精度。所提出的方法在多个系统的仿真和真实环境中进行了评估，与基线方法相比，在轨迹质量、跟踪精度和整体性能方面表现出一致的改进。

英文摘要

Sampling-based motion planners offer a practical and scalable approach to kinodynamic motion planning, notably for high-dimensional, underactuated, or non-holonomic systems. However, these planners are typically used offline, requiring execution to begin only after the trajectory has been computed. In addition, the planned trajectory may not be accurately tracked in the presence of motion uncertainty, leading to deviations from the nominal solution. In this work, these limitations were addressed within a unified framework, \method, an asymptotically-optimal meta-planner framework that improves both path quality and tracking performance during execution. In addition to the main execution thread, this framework comprises a replanning method that continuously explores the state space and refines the trajectory during execution, and an optimization process that refines future control inputs to reduce tracking error. Together, these components enable \method to leverage asymptotically optimal planning online while improving execution accuracy under uncertainty. The proposed approach is evaluated in both simulation and real-world environments across multiple systems, demonstrating consistent improvements in trajectory quality, tracking accuracy, and overall performance compared with baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27697 2026-05-28 cs.RO cs.AI cs.LG

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

发表机构 * University of Virginia（弗吉尼亚大学）； University of California, Irvine（加州大学伊文斯顿分校）

AI总结提出一种基于约束感知扩散模型的去中心化框架SID，通过仿真邻居未来轨迹并利用安全约束规划自身轨迹，在密集场景下实现高效协调。

详情

AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹，无需全局感知或可靠通信。然而，大多数现有规划器（无论是经典方法还是基于学习的方法）都是从局部观测的静态快照生成轨迹，这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤，这一限制变得至关重要。为了克服这一挑战，本文引入了仿真引导的扩散（SID），这是一种基于约束感知扩散模型（CADM）的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹，然后利用这些仿真提供的安全约束，使用相同的CADM规划每个机器人自身的轨迹。关键的是，对邻居的精确仿真使得一种最小通信方案成为可能，该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明，SID在规划有效性和约束满足方面始终优于基线方法，并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

URL PDF HTML ☆

赞 0 踩 0

2605.27690 2026-05-28 cs.CL cs.LG

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

TRACES: 通过轨迹状态建模实现多轮LLM智能体的主动安全审计

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

发表机构 * Brown University（布朗大学）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Rutgers University（罗格斯大学）； Texas A&M University（德克萨斯阿姆斯特朗大学）

AI总结提出TRACES方法，通过观察LLM的隐藏表示学习前缀级轨迹风险状态，实现多轮工具使用环境下的主动安全审计，提升全轨迹安全预测和主动风险判别能力。

详情

AI中文摘要

LLM智能体越来越多地通过多轮工具使用和环境交互来运作，其中安全风险往往在最终结果显现之前的中间步骤中就已经出现。因此，反应式审计是不够的：事后诊断常常在风险正在展开时错过标记它们的机会。我们提出TRACES，一种基于表示的主动审计器，它从观察者LLM的隐藏表示中学习前缀级轨迹风险状态。TRACES从步骤表示中诱导潜在机制特征，并建模其时间演化，以估计部分轨迹是否正在向不安全行为漂移。为了规避步骤级风险标注的成本和歧义，TRACES在弱轨迹级监督下训练，同时仍能产生密集的前缀级风险估计。在多个智能体安全基准测试中，TRACES改进了全轨迹安全预测和主动风险判别。我们的分析进一步表明，这些风险状态可以帮助训练更安全的智能体，凸显了主动审计在长程智能体安全中的更广泛潜力。

英文摘要

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

URL PDF HTML ☆

赞 0 踩 0

2605.27689 2026-05-28 cs.LG cs.CR

Test-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

测试时集体行动：基于代理的扰动用于纠正算法危害

Meghana Bhange, Ulrich Aïvodji, Elliot Creager

发表机构 * ÉTS Montréal（蒙特利尔ÉTS）； Mila ； University of Waterloo（多伦多大学）； Vector Institute（向量研究所）

AI总结提出测试时集体行动框架，通过用户共享查询访问黑盒API提取代理模型并优化每类通用扰动，在推理时修正子群性能差距，无需平台参与训练。

详情

AI中文摘要

当机器学习系统对特定子群表现不佳时，受影响的用户通常无法在不依赖平台级修复的情况下纠正这些差异。现有的算法公平方法依赖于以提供者为中心的方法来纠正这些失败，用户在面临危害时没有外部杠杆。最近在算法集体行动方面的工作表明，协调的用户可以将算法系统引导向集体目标，但现有机制要求提供者在集体的修改数据上重新训练，而用户可能无法控制这些数据。我们提出测试时集体行动（TTCA），这是一个框架，通过该框架，一组共享平台查询访问的用户可以纠正影响服务不足子群的差异，而无需参与平台的训练循环。我们通过一种基于代理的机制实现这一点，其中集体池化对黑盒API的查询访问以提取平台的代理，然后针对代理优化每类通用扰动。每个成员在提交时将此扰动应用于自己的输入，无需平台合作。我们在CIFAR-10、CIFAR-100和FairFace上进行了实证评估，表明适度规模的集体可以缩小大部分子群准确率差距，跨架构迁移（小型代理可以攻击更大的平台），并改善最差组准确率、机会均等差距和差异性影响。查询预算分析比较了每用户黑盒攻击基线，表明池化比每个子群成员单独攻击更便宜。因此，当平台端修复不可用或延迟时，测试时集体行动为用户提供了纠正干预措施。

英文摘要

When machine learning systems under-perform for particular subgroups, affected users typically have no way to correct these disparities without relying on platform-level fixes. Existing approaches to algorithmic fairness rely on provider-centric approaches to correct these failures, leaving users with no external lever when faced with harm. Recent work in Algorithmic Collective Action shows that coordinated users can steer an algorithmic system toward a collective goal, but the existing mechanisms require the provider to retrain on the collective's modified data which users may not have control over. We propose Test-Time Collective Action (TTCA), a framework through which a group of users who share query access to the platform, can correct disparities affecting under-served subgroup without participating in the platform's training loop. We implement this through a proxy-based mechanism where the collective pools query access to a black-box API to extract a proxy of the platform, then optimizes a per-class universal perturbation against the proxy. Each member applies this perturbation to their own inputs at submission time, requiring no cooperation from the platform. We empirically evaluate the mechanism on CIFAR-10, CIFAR-100, and FairFace, showing that modestly-sized collectives close most of the subgroup accuracy gap, transfer across architectures (a small proxy can attack a larger platform), and improve worst-group accuracy, equal-opportunity gap, and disparate impact. A query-budget analysis comparing a per-user black-box attack baseline shows that pooling is cheaper than each subgroup member attacking alone. Test-time collective action thus offers corrective intervention to users when platform-side remediation is unavailable or delayed.

URL PDF HTML ☆

赞 0 踩 0

2605.27686 2026-05-28 cs.CV cs.AI

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆：用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA（麻省理工学院）； IBM Research, Cambridge, MA, USA（IBM研究院）； University of Toronto, Toronto, Canada（多伦多大学）

AI总结提出张量记忆模块，通过固定大小的3D循环张量状态增强Transformer，以解耦状态容量与输入长度，并保持空间归纳偏置，适用于长程视频理解。

详情

AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征，但其内存随序列长度增长，并且缺乏显式的、持久化的空间状态，这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆，一种轻量级模块，通过固定大小的循环3D记忆张量增强Transformer块：令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中，记忆通过高效的局部交互算子和门控循环动态更新，令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定，张量记忆将状态容量与输入长度解耦，同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块；它与标准Transformer训练流程集成，可以附加到现有块或从中移除，而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

URL PDF HTML ☆

赞 0 踩 0

2605.27681 2026-05-28 cs.AI cs.LG

Behavioural Analysis of Alignment Faking

对齐伪造的行为分析

Nathaniel Mitrani Hadida, Rhea Karty, David Williams-King, Alan Cooney

发表机构 * University of Cambridge（剑桥大学）； Harvard University（哈佛大学）； ERA ； UK AISI（英国人工智能学会）

AI总结通过可控最小设置研究对齐伪造，发现其驱动因素包括价值观、目标保护和谄媚，且比先前报告更普遍，可从情境线索和模型倾向预测。

Comments preprint

详情

AI中文摘要

对齐伪造（AF）指的是模型为了保持其部署偏好，策略性地遵守训练目标以避免行为修改。理解AF何时以及为何出现很重要，因为模型在区分训练和部署方面越来越擅长。先前的工作发现AF脆弱、对提示敏感且依赖模型，其潜在驱动因素尚不清楚。我们在一个隔离其核心组件的可控最小设置中研究AF，并在比先前报告更广泛的模型中观察到它，包括小规模模型。我们识别出三个可分离的驱动因素——价值观、目标保护和谄媚——并通过有针对性的提示消融和激活引导表明每个因素独立地调节AF行为。我们的结果表明AF比先前报告更普遍，并且其发生可从情境线索和可测量的模型倾向（如基线谄媚和陈述的价值观）预测。这种分解为未来模型中检测和缓解AF提供了具体方向。

英文摘要

Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

URL PDF HTML ☆

赞 0 踩 0

2605.27678 2026-05-28 cs.LG cs.DC

Heterogeneous Parallelism for Multimodal Large Language Model Training

多模态大语言模型训练的异构并行

Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh

发表机构 * NVIDIA

AI总结针对多模态大语言模型训练中单一LLM中心并行布局导致的吞吐量瓶颈，提出异构并行抽象，允许各模块独立布局和放置，并通过边界通信器实现张量语义保持，实验表明可提升TFLOPS/GPU最高49.3%。

详情

AI中文摘要

基础模型训练正变得多模态，从后训练流程到大规模预训练。随着模态覆盖范围扩大、上下文窗口增长以及编码器LLM规模分化，单一的以LLM为中心的TP/CP/PP/DP/EP布局日益限制吞吐量。这种耦合迫使编码器继承LLM驱动的分片和放置选择，可能增加通信、限制编码器并行性或约束LLM调度；这种不匹配在长上下文中最为明显，此时融合的多模态序列需要LLM上下文并行，但编码器输入仍然受限。我们提出了多模态大语言模型训练的异构并行，这是一种抽象，允许端到端图中的模块使用独立的布局和秩放置，支持共享GPU上的共置执行和不相交秩集上的非共置执行。关键挑战是在独立布局间保持边界张量语义：前向激活必须为目标布局物化，而反向梯度必须路由回源布局。我们通过边界通信器解决这一问题，实现前向和反向布局变换，以及两种放置模式的调度扩展。我们评估了跨多模态工作负载和GPU规模的优化同构、共置异构和非共置异构配置，以刻画何时额外的布局和放置自由度能暴露更优的操作点。在这一扫描中，共置异构将TFLOPS/GPU提升高达49.3%，而非共置异构将总token吞吐量提升高达13.0%，TFLOPS/GPU提升高达9.6%。我们验证了与同构基线相比的损失收敛一致性，并将该系统作为开源Megatron-LM扩展发布。

英文摘要

Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.

URL PDF HTML ☆

赞 0 踩 0

2605.27673 2026-05-28 cs.LG

When do complex-valued neural networks help? A study of representation, geometry, and optimization

复值神经网络何时有帮助？表征、几何与优化的研究

Ashutosh Kumar

发表机构 * Owl Autonomous Imaging, Inc.（Owl 自动成像公司）； RIT

AI总结通过对比复值神经网络与多种实值基线在合成射频、量子波函数和脑电图等任务上的表现，发现复值网络的优势依赖于表征、对称性和优化，并非普遍优越。

详情

AI中文摘要

复值神经网络（CVNN）通常应用于信息自然编码为幅度和相位的领域。然而，仅凭复值输入并不能确定复算术何时能改善学习：标签信号可能存在于振幅、相位、它们的耦合或某种对称性中，而实值模型在合适的坐标下也能表征这种对称性。我们通过将CVNN与笛卡尔实值、极坐标、仅相位、仅幅度、参数匹配实值和FLOP匹配实值基线进行表征优先的评估来研究这一问题。在合成射频任务中，复值表征有用但并非普遍优越。仅PSK任务有利于相位感知和复值模型，仅QAM任务有利于基于幅度的模型，混合PSK+QAM仅带来微小的复值优势，而未见过的载波相位旋转会破坏坐标依赖模型（无数据增强）。类似模式也出现在射频之外：在量子波函数预测中，动量对$|ψ|$不可见但可从相位恢复，而脑电图解析信号实验表明，相位锁定、幅度爆发和相位-幅度耦合各自偏好不同的坐标视图。我们还发现了RadioML 2018.01A上的一个基准测试伪影。在匹配共享试验选择下，CReLU复值模型超过最佳实值基线22.94个百分点；在相同数据和16次试验搜索空间下进行独立每族调参时，差距缩小至2.46个百分点。梯度分析将夸大的差距归因于实值基线在高学习率下的第一步不稳定性，而复值参数耦合更稳健地分布损失信号。学习率×激活函数的析因实验证实该失败主要是超参数驱动的。总体而言，CVNN应被视为结构化归纳偏置，其增益取决于表征、对称性和优化，而非普遍优越的架构。

英文摘要

Complex-valued Neural Networks (CVNNs) are often motivated by domains where information is naturally encoded in magnitude and phase. Yet complex-valued inputs alone do not determine when complex arithmetic improves learning: the label signal may lie in amplitude, phase, their coupling, or a symmetry that real-valued models can also represent under suitable coordinates. We study this through a representation-first evaluation of CVNNs against Cartesian real, polar, phase-only, magnitude-only, parameter-matched real, and FLOP-matched real baselines. Across synthetic RF tasks, complex representations are useful but not universally superior. PSK-only tasks favor phase-aware and complex-valued models, QAM-only tasks favor magnitude-based models, mixed PSK+QAM gives only a small complex-valued advantage, and unseen carrier-phase rotations break coordinate-dependent models without augmentation. Similar patterns appear beyond RF: in quantum-wavefunction prediction, momentum is invisible to $|ψ|$ but recoverable from phase, while EEG analytic-signal experiments show that phase locking, amplitude bursts, and phase-amplitude coupling each favor different coordinate views. We also identify a benchmarking artifact on RadioML 2018.01A. Under matched-shared-trial selection, a CReLU complex model exceeds the best real baseline by 22.94 PP; under independent per-family tuning on the same data and 16-trial search space, the gap collapses to 2.46 PP. Gradient analysis traces the inflated gap to high-learning-rate first-step instability in real baselines, while complex parameter coupling distributes the loss signal more robustly. A learning-rate $\times$ activation factorial confirms the failure is primarily hyperparameter-driven. Overall, CVNNs are best viewed as structured inductive biases whose gains depend on representation, symmetry, and optimization, not as universally superior architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.27668 2026-05-28 cs.LG cs.AI cs.CL

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐：用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

发表机构 * Agentic Learning AI Lab（代理学习AI实验室）； New York University（纽约大学）； The University of Chicago（芝加哥大学）； Chronologies AI

AI总结提出Beta-Bernoulli校准器（BBC），通过结合二元结果和人类预测信号，将初始点估计转换为事件似然分布，实现校准和不确定性量化。

详情

AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测，现有方法通常从二元结果中学习以输出语言化预测。然而，尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息，如何利用这些信号仍未充分探索。为了解决这个问题，我们提出了Beta-Bernoulli校准器（BBC），它将来自任何模型的初始点估计转换为事件似然分布，使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模，均值作为校准的点预测，方差作为认知不确定性。我们的结果表明，BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测，同时保持轻量级并具有良好的泛化能力。我们还表明，BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

URL PDF HTML ☆

赞 0 踩 0

2605.27662 2026-05-28 cs.LG cs.AI

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

优化器如何塑造等变神经网络中的学习解

Teodor-Mihai Stupariu, Andrei Manolache

发表机构 * University of Stuttgart, Germany（斯图加特大学）； International Max Planck Research School for Intelligent Systems, Germany（国际马克斯·普朗克智能系统研究学校）； Tudor Vianu High School of Computer Science, Romania（托尔德·维安乌计算机科学高中）

AI总结本文通过比较Muon和Adam优化器在点云和分子学习任务中的表现，发现Muon能改善等变神经网络的优化效果，并分析其导致更规则损失曲面和更高有效秩的机制。

Comments Accepted at ICML 2026 Workshop on Weight-Space Symmetries

详情

AI中文摘要

面向光流组件的智能语言到目标合成

Ivan Saraev, Elena Erben, Weida Liao, Fan Nan, Gerhard Neumann, Eric Lauga, Moritz Kreysing

发表机构 * Institute of Biological and Chemical Systems, Karlsruhe Institute of Technology, Germany（马克斯·普朗克研究所生物和化学系统研究所，卡尔斯鲁厄技术大学，德国）； Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK（应用数学和理论物理系，剑桥大学，英国）； Department of Mathematics, Imperial College London, UK（数学系，伦敦帝国理工学院，英国）； Institute of Anthropomatics and Robotics (IAR), Karlsruhe Institute of Technology, Germany（人机学与机器人研究所（IAR），卡尔斯鲁厄技术大学，德国）

AI总结提出Speak-to-Objective模块化智能流水线，利用条件大语言模型将口语或书面指令转换为可微目标函数，实现光流控微粒子组装，并支持用户反馈学习。

Comments 21 pages, 5 figures

详情

AI中文摘要

基于光的先进制造日益需要可编程、闭环工具，将人类设计意图转化为小尺度上的可执行操作。然而，在机器人和制造模式中仍存在一个关键瓶颈：将用户意图转化为机器可读且可靠执行的目标。尽管微机器人通过光驱动流体提供了多功能操控，但数学上可处理的目标规范仍然手动且难以重用。本文介绍Speak-to-Objective，一个模块化智能流水线，使用条件大语言模型将口语或书面指令转换为完全可微的目标函数，用于在约束感知逆求解器（SLSQP）和实验光流控平台上组装微粒。该方法采用紧凑循环——感知→组合→提议→行动→报告与学习——将目标作为意图与驱动之间的接口，分离组装或图案化什么与如何驱动，同时从用户反馈中学习。流水线组合几何、间距和分配/拓扑项，生成鲁棒的描述性目标，从部分轨迹组装并在扰动后恢复，以及用于精确定位的显式目标，所有均以执行器无关的方式。使用激光诱导热粘性流作为物理驱动模式，我们展示了自然语言可编程的、基于光的微尺度粒子图案组装在微流控环境中。除了对可编程微组装的直接影响，以及使用激光诱导光流控驱动作为降复杂度实验平台，我们的工作指向自驱动、AI辅助的光学制造平台，其中自然语言、可微目标和激光驱动耦合为可重复使用的数字工作流。

英文摘要

Light-based advanced manufacturing increasingly requires programmable, closed-loop tools that translate human design intent into executable operations at small length scales. Yet a key bottleneck persists across robotic and manufacturing modalities: turning user intent into machine-readable objectives that are reliably executable. While micro-robotics offers versatile manipulation via optical actuation of fluids, mathematically tractable goal specification remains manual and hard to reuse. Here, we introduce Speak-to-Objective, a modular agentic pipeline that uses a conditioned Large Language Model (LLM) to translate spoken or written commands into fully differentiable objective functions for assembling microparticles in a constraint-aware inverse solver (SLSQP) and on an experimental optofluidic platform. The approach employs a compact loop - perceive -> compose -> propose -> act -> report & learn - that treats the objective as the interface between intent and actuation, separating what to assemble or pattern from how to actuate, while learning from user feedback. The pipeline composes geometry, spacing, and assignment/topology terms to generate robust descriptive objectives that assemble from partial traces and recover after perturbations, as well as explicit objectives for precise placement, all in an actuator-agnostic fashion. Using laser-induced thermoviscous flows as the physical actuation modality, we demonstrate natural-language-programmable, light-based microscale assembly of particle patterns in a microfluidic environment. Beyond its immediate impact on programmable microassembly, and using laser-induced optofluidic actuation as a reduced-complexity experimental platform, our work points toward self-driving, AI-assisted optical manufacturing platforms in which natural language, differentiable objectives, and laser-based actuation are coupled into a reusable digital workflow.

URL PDF HTML ☆

赞 0 踩 0

2605.27642 2026-05-28 cs.CL cs.LG

Learning to Translate from Soft to Hard LLM Prompts

学习从软提示到硬提示的翻译

Pitipat Kongsomjit, Suryansh Goyal, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute（沃斯特理工学院）

AI总结本文通过训练一个专用的软提示到自然语言翻译模型，提高了翻译质量，并展示了软提示可以转化为可移植的文本提示，在大型闭源模型上超越原软提示甚至少样本学习。

Comments 8 Pages, 11 tables, 4 Figures

2605.27636 2026-05-28 cs.CL

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Simorgh at SemEval-2026 task 7: 面向低资源文化推理的多语言问答中的区域感知混合检索

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

发表机构 * University of Tabriz（塔布里兹大学）

AI总结提出区域感知混合检索方法，结合BM25和稠密语义相似度与区域加权启发式，以提升多语言文化问答的跨语言稳定性。

Comments 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

详情

AI中文摘要

尽管大型语言模型（LLMs）在通用领域的推理任务中表现出色，但在数字和文本数据有限的语种中，面对文化相关知识时可能遇到挑战。本文利用BLEnD基准研究文化相关的多项选择问答，该基准包含30种语言的多语料库，涵盖饮食、体育、家庭等社会文化领域。我们提出一种区域感知混合检索方法，结合BM25词汇匹配和稠密语义相似度与区域加权启发式，以提高答案的相关性。检索到的文档用于构建结构化提示，输入Qwen3-14B量化模型，并采用基于logit的确定性答案选择。实验结果表明，与纯参数推理相比，混合检索方法在文化问答中提升了跨语言稳定性。然而，训练数据量不同的语言之间仍存在显著性能差距，这表明检索增强方法并未完全克服训练数据不平衡问题。

英文摘要

Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

URL PDF HTML ☆

赞 0 踩 0

2605.27622 2026-05-28 cs.AI cs.SC

Reasoning and Planning with Dynamically Changing Norms

动态变化规范的推理与规划

Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus

发表机构 * University of Iowa（爱荷华大学）； Northwestern University（西北大学）

AI总结本文提出一种在人类-AI环境中使用动态变化规范引导规划的方法，通过可废止演算解决规范冲突并将规范作为规划护栏，理论证明与对话任务实验验证了有效性。

Comments 8 pages, 1 figure, dataset included in anc

2605.27619 2026-05-28 cs.LG cs.AI

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

基于最优传输和依赖性最大化的有监督分布约简

Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

发表机构 * digiLab, UK（digilab英国实验室）； University of Bristol, UK（布里斯托大学）

AI总结提出有监督分布约简（SDR）算法，通过结合最优传输和显式依赖性最大化，学习同时保留数据几何结构和目标相关信号的紧凑表示。

详情

AI中文摘要

学习同时捕捉内在数据几何结构和目标相关结构的表示仍然是一个基本挑战，特别是在数据约简必须在压缩与预测保真度之间取得平衡的场景中。虽然分布约简（包括联合聚类和降维）提供了一种原则性的数据总结方法，但其有监督变体仍然相对未被充分探索，尽管保留任务相关信号对于下游预测和决策至关重要。我们提出有监督分布约简（SDR），一种通过结合最优传输和显式依赖性最大化来学习目标感知表示的算法。SDR 基于融合 Gromov-Wasserstein（FGW）目标，将输入分布的 relational 结构与一组代表点对齐，同时增加一个直接依赖性项，鼓励学习到的嵌入更明确地捕捉预测信号。这产生了反映几何结构和监督的紧凑表示。除了表示学习，SDR 自然地诱导出一种数据依赖的非平稳几何结构，可用于高斯过程（GP）建模等场景。通过目标感知的分布对齐重新定义距离，SDR 能够构建适应数据几何和监督局部变化的自适应核，为非平稳核设计提供了基于最优传输的视角。

英文摘要

Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. While distributional reduction-encompassing joint clustering and dimensionality reduction-offers a principled way to summarize data, its supervised variants remain relatively under-explored, despite the importance of retaining task-relevant signal for downstream prediction and decision-making. We propose Supervised Distributional Reduction (SDR), an algorithm for learning target-aware representations by combining optimal transport with explicit dependence maximization. SDR builds on the Fused Gromov-Wasserstein (FGW) objective to align the relational structure of the input distribution with a set of representative points, while augmenting it with a direct dependence term that encourages the learned embeddings to capture predictive signal more explicitly. This results in compact representations that reflect both geometric structure and supervision. Beyond representation learning, SDR naturally induces a data-dependent, non-stationary geometry that can be leveraged for settings such as Gaussian Process (GP) modelling. By redefining distances through target-aware distributional alignment, SDR enables the construction of adaptive kernels that respond to local variations in both data geometry and supervision, offering an optimal transport-based perspective on non-stationary kernel design.

URL PDF HTML ☆

赞 0 踩 0

2605.27616 2026-05-28 cs.CV cs.AI

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同：架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

发表机构 * NVIDIA

AI总结本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用，发现架构选择对量化鲁棒性影响最大，注意力机制架构对配方选择具有显著韧性，而 CNN 在大规模下受梯度量化配方影响性能下降。

Journal ref CVPR2026

详情

AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互，在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大，基于注意力的架构对配方选择表现出显著的韧性，而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下，FP4 可能离散化 softmax 注意力，但高级 QAT 配方可防止这种崩溃。在更大规模下，高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明，Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性，使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27605 2026-05-28 cs.AI cs.SE

Laguna M.1/XS.2 Technical Report

Laguna M.1/XS.2 技术报告

Julien Abadji, Marah Abdin, Connor Adams, Eric Alcaide, Mustafa Altun, Michele Artoni, Junze Bao, Uday Barar, Vassilis Bekiaris, Arkadii Bessonov, Benjamin Bütikofer, Jonathan Chang, Yen-Chun Chen, Dmitry Chernenkov, Yang Chi, Filippos Christianos, Fenia Christopoulou, Razvan-Andrei Ciocoiu, Tzachi Cohen, Yohann Coppel, Dmitrii Emelianenko, Brandon Fergerson, Brian Fitzgerald, Matthias Gallé, Alex Golonzovskyi, George Grigorev, Yiyang Hao, Christian Hensel, Jan Huenermann, Ye Ji, Sarthak Joshi, Eiso Kant, Kabir Khandpur, Seonghyeon Kim, Vladimir Kirichenko, Umut Kocasarac, Ilya Kochik, Ivan Komarov, Chaerin Kong, Anurag Koul, François-Joseph Lacroix, Sergei Laktionov, Waren Long, Quentin Malartic, Vadim Markovtsev, Afonso Marques, Robert McHardy, Carlos Mocholí, Dmitry Monakhov, Adam Morris, Martin Muller, Christian Mürtz, Robin Nabel, Thien Nguyen, Rok Novosel, Szymon Ozog, Aalhad Patankar, Aleksei Petrov, Alexandre Piché, Arthur Pignet, Teodor Poncu, Phil Potter, Alexander Rakowski, Pierre-Yves Ritschard, Jay Roberts, Joe Rowell, Piotr Sarna, Pierre-André Savalle, Uladzislau Sazanovich, Nikita Shapovalov, Arsenii Shevchenko, Mikhail Shilkov, Andrei Sokol, Mohamed Soliman, Jack Stephenson, Victor Storchan, Dragos-Constantin Tantaru, Artem Tyurin, Adrian Wälchli, Pengming Wang, Jianxiao Yang, Renat Zayashnikov, Alexander Zelenka Martin, Nikolay Zinov, Caroline Bercier, José Caldeira, Margarida Garcia, Tom George, Kabeer Gharzai, Glenn Hitchcock, Carson Klingenberg, Ivo Pinto, Varun Randery, Noah Smith, Arina Sugako, Jason Warner

发表机构 * Poolside Team（Poolside团队）

AI总结本文介绍了两个用于长周期自主编码的混合专家基础模型 Laguna M.1 和 XS.2，通过端到端训练和模型工厂系统，在软件工程基准测试中达到先进水平。

Comments Technical report to models released here: https://poolside.ai/blog/introducing-laguna-xs2-m1

详情

AI中文摘要

我们介绍了 Laguna M.1 和 Laguna XS.2，两个为长周期自主编码构建的混合专家基础模型：M.1 总参数量为 2258 亿（每 token 激活 234 亿），XS.2 总参数量为 334 亿（每 token 激活 30 亿）。两个模型均在我们称为模型工厂的内部系统中从头到尾端到端训练：这是一个紧密集成的版本化数据、训练、评估和推理组件栈，将模型开发转变为工业流程。我们描述了模型工厂的原理和设计选择，并详细介绍了模型的端到端训练过程，包括预训练数据和架构、后训练阶段、评估和量化。在自主软件工程和终端基准测试（SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro 和 Terminal-Bench 2.0）上，M.1 和 XS.2 在其各自的权重级别中与最先进的开源模型具有竞争力。Laguna XS.2 权重在 Apache 2.0 许可下发布，地址为 https://huggingface.co/collections/poolside/laguna-xs2。

英文摘要

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

URL PDF HTML ☆

赞 0 踩 0

2605.27596 2026-05-28 cs.CL

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

幻觉能否有用？通过链式系统I/II推理用SLM解决多跳问题

Saptarshi Sengupta, Suhang Wang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出一种“先回答后推理”的认知启发框架，利用SLM的初始答案（可能包含幻觉）作为假设来检索证据，再通过系统II深度推理，从而在多跳问答任务上超越传统的“先思考后检索”方法。

详情

AI中文摘要

最近，小型语言模型（SLM）引起了越来越多的兴趣，它们速度快、性能好，且硬件需求低于大型语言模型（LLM）。然而，SLM比LLM更容易产生幻觉，影响其解决复杂多步推理问题的能力，因为早期错误会级联到最终响应。为了解决这个问题，现有工作采用先思考后迭代检索的策略来减少幻觉。我们认为先思考策略并非总是必要，因为我们发现：（i）SLM通常对其初始答案有准确的置信度，并且（ii）幻觉实际上可能有助于逼近正确答案。因此，我们将我们的工作定位为这种策略的反转，即先回答后推理。我们提出了一个认知启发的框架，其中模型首先被允许快速回答问题（系统I（零样本）），然后基于从知识源使用初始假设检索到的证据进行更深层次的思考（系统II）。通过结合系统I和系统II风格的推理，我们展示了我们的方法在各种多步问答基准测试中可以优于先前采用传统先思考路径的工作。

英文摘要

Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

AURA: Asymptotically Optimal Uncertainty-Robust Replanning Algorithm for Kinodynamic Systems

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

Test-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Behavioural Analysis of Alignment Faking

Heterogeneous Parallelism for Multimodal Large Language Model Training

When do complex-valued neural networks help? A study of representation, geometry, and optimization

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

Design of a Real-time Asynchronous Monocular Odometry for Planetary Exploration

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

Disentangling Language Roles in Multilingual LLM Task Execution

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Agentic Language-to-Objective Synthesis for Optofluidic Assembly

Learning to Translate from Soft to Hard LLM Prompts

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Reasoning and Planning with Dynamically Changing Norms

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

Laguna M.1/XS.2 Technical Report

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning