arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.00981 2026-06-02 cs.CL

Robust Asynchronous Planning via Auto-Formalization

通过自动形式化实现鲁棒的异步规划

Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang

发表机构 * Drexel University(德雷塞尔大学) Arizona State University(亚利桑那州立大学)

AI总结 针对异步规划中并发、非均匀时长和执行时间约束的挑战,提出自动形式化方法,通过CP-SAT形式化器在依赖图规模从5到100动作时保持高准确率,并引入状态感知修复策略应对执行时约束更新。

详情
AI中文摘要

LLMs可以通过直接生成动作序列作为规划器,或将任务翻译成领域特定语言供外部求解器作为形式化器来进行规划。虽然大多数现实任务具有非均匀时长、并发和执行时间约束的异步特性,但现有基准测试很少涵盖这些。我们将这些异步规划挑战统一到一个公式下,并引入了三个分别大规模解决这些挑战的基准测试。我们得出结论:形式化表示的选择主要决定了规划是否可扩展:当依赖图从5个动作增长到100个动作时,规划器的规划准确率从96%下降到5%,PDDL2.1形式化器从13%下降到0%,而CP-SAT形式化器平均为94%,在100个动作时仍达到83%。忠实性诊断表明,当LLMs必须保持谓词、效果和目标一致时,PDDL2.1基于谓词的规划表示变得脆弱,而通用约束满足程序则不然。执行时更新规划约束进一步严重降低了性能(规划器23.9%,PDDL2.1 0.7%,CP-SAT 46.1%),但一种仅更新事件诱导约束的状态感知修复策略将CP-SAT形式化器恢复至84.5%。

英文摘要

LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

2606.00979 2026-06-02 cs.LG

UME: A Unified Meta-Generalization Framework for Cross-Domain ETA

UME:跨域ETA的统一元泛化框架

Duo Wang, Qiong Wu, Jianguo Wu, Ruiyu Xu, Jinhui Yi, Zhonggen Sun, Zhentao Zhang, Yu Zhang, Ke Xing, Yongjun Yin, Zishuo Li, Jianwen Huang

发表机构 * Peking University(北京大学) Meituan(美团)

AI总结 针对即时物流中跨域ETA预测的零样本泛化、特征缺失和知识迁移问题,提出基于超网络元学习的统一双分支架构UME,通过元模块动态调制特征门控、专家注意力和最终预测,并在美团Keeta平台部署验证。

详情
AI中文摘要

在即时物流中,结账页面的准确预计到达时间(ETA)预测对于提高用户满意度、优化调度和控制运营成本至关重要。在国际按需配送平台上,ETA数据来自具有不同模式的不同国家或地区,多域建模非常重要且已被广泛采用。然而,现有方法在实际部署中仍面临三个关键挑战。首先,当前的多域模型难以泛化到完全未见过的域,无法在初始冷启动阶段实现零样本预测。其次,跨域特征空间通常被假设为一致的,而新域由于缺乏历史数据,常常遭受离线(统计)特征的结构性缺失。第三,这种特征缺失通常迫使工业系统分别对成熟域和冷启动域进行建模,阻碍了知识迁移并增加了维护开销。为了解决这些挑战,我们提出了UME,一个统一的元泛化框架用于ETA。具体来说,UME将统一的双分支架构与一种新颖的元学习机制相结合,该机制采用基于超网络的元学习器。通过利用域级知识和实例级上下文,元学习器赋能三个元模块动态调制特征门控、专家注意力和最终预测,捕获跨域相关性并促进域内适应。进一步引入知识蒸馏策略以提升性能。UME现已部署在美团Keeta配送平台(中国最大的国际食品配送平台)上。大量的离线实验和在线A/B测试表明,UME显著优于现有基线。

英文摘要

Accurate Estimated Time of Arrival (ETA) prediction on checkout page is crucial in instant logistics for enhancing user satisfaction, optimizing dispatching, and controlling operational costs. In international on-demand delivery platforms, where ETA data originates from diverse countries or regions with different patterns, multi-domain modeling is of great importance and has been widely adopted. However, existing methods still face three critical challenges in real-world deployment. First, current multi-domain models struggle to generalize to completely unseen domains, failing to achieve zero-shot prediction during the initial cold-start phase. Second, cross-domain feature spaces are often assumed to be consistent, whereas new domains commonly suffer from structural missingness of offline (statistical) features due to the lack of historical data. Third, such feature missingness often compels industrial systems to model mature and cold-start domains separately, hindering knowledge transfer and increasing maintenance overhead. To address these challenges, we propose \textbf{UME}, a \textbf{U}nified \textbf{M}eta-generalization framework for \textbf{E}TA. Specifically, UME integrates a unified dual-branch architecture with a novel meta-learning mechanism that employs a hypernetwork-based meta learner. By leveraging domain-level knowledge and instance-level context, the meta learner empowers three meta modules to dynamically modulate feature gating, expert attention, and final prediction, capturing cross-domain correlations and facilitating intra-domain adaptation. A knowledge distillation strategy is further introduce to enhance performance. UME has now been deployed in Meituan-keeta delivery platform (the largest international food delivery platform in China). Extensive offline experiments and online A/B tests demonstrate that UME significantly outperforms existing baselines.

2606.00975 2026-06-02 cs.CL

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

迷失在妄想中:审视用户妄想与痛苦下的LLM安全性

Andrew Aquilina, Chetna Nihalani, Vasudha Varadarajan, Nathan S. Fishbein, Yu-Ru Lin, Maarten Sap

发表机构 * University of Pittsburgh(匹兹堡大学) Carnegie Mellon University(卡内基梅隆大学) Fordham University(福特汉姆大学)

AI总结 本研究通过多轮对话模拟,发现LLM在检测用户痛苦时表现良好,但在痛苦嵌入妄想时干预行为显著减少(高达4.5倍),并提出针对性提示策略以缩小这一差距。

详情
AI中文摘要

LLM聊天机器人日益成为心理困扰人群(包括那些困扰与妄想信念交织的人)的首要支持来源。先前关于LLM心理健康安全性的研究主要评估一般治疗质量或单轮危机检测,未明确模型在持续对话中痛苦与妄想交织时的行为。我们通过匹配的多轮模拟(基于临床角色和六个LLM)填补了这一空白,将每次妄想对话与仅痛苦对照配对,以隔离妄想框架的影响。这揭示了一个识别-干预差距:模型无论框架如何都能以相当比率检测痛苦,但一旦痛苦嵌入妄想,模型便严重未能采取行动,安全干预被抑制高达4.5倍。这种失败追踪的是对用户前提的累积接受,而非情感验证。更糟糕的是,提示模型评估用户痛苦的直观修复在妄想框架下适得其反;只有带有明确响应指导的妄想感知提示才能缩小差距,且这依赖于一个妄想分类器,而该分类器在最脆弱的模型上本身不可靠。因此,安全部署需要将妄想框架视为一种独特的风险信号,覆盖对话顺应。

英文摘要

LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general therapeutic quality or single-turn crisis detection, leaving unclear how models behave when distress is intertwined with delusion over sustained conversations. We address this gap with matched multi-turn simulations, across clinically grounded personas and six LLMs, that pair each delusional conversation with a distress-only control to isolate the effect of delusional framing. This reveals a recognition-intervention gap: models detect distress at comparable rates regardless of framing, yet sharply fail to act on it once distress is embedded in delusion, with safety interventions suppressed by up to 4.5x. The failure tracks accumulated acceptance of the user's premises rather than emotional validation. Worse, the intuitive fix of prompting models to assess user distress backfires under delusional framing; only delusion-aware prompting with explicit response guidance closes the gap, and even this depends on a delusion classifier that is itself unreliable on the most vulnerable models. Safe deployment therefore requires treating delusional framing as a distinct risk signal that overrides conversational accommodation.

2606.00971 2026-06-02 cs.CL

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

HypothesisMed:生物医学问答中的推理时答案融合与结构化假设空间报告

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science Rensselaer Polytechnic Institute(计算机科学系雷士打理工学院) Department of Biomedical Engineering Rensselaer Polytechnic Institute(生物医学工程系雷士打理工学院)

AI总结 提出HypothesisMed推理时可靠性流水线,通过答案融合和SPACE标签(有效/不完整/矛盾)提升生物医学多项选择问答的准确率、可解析性和结构化可靠性报告。

详情
AI中文摘要

使用大型语言模型进行生物医学问答通常通过答案准确率进行评估,但仅凭答案准确率并不能表明模型能否生成可解析的输出、遵循结构化可靠性指令、识别弱答案空间或避免自信的错误承诺。本文提出HypothesisMed,一个用于生物医学多项选择问答的推理时可靠性流水线。它结合了直接提示、思维链提示、HypothesisMed-v3提示和答案融合。最终答案通过融合选择,而HypothesisMed-v3提供SPACE标签和置信度信息。SPACE标签将答案空间标记为有效、不完整或矛盾。我们在MedQA、MedMCQA和PubMedQA上使用每个数据集1000个样本评估了Qwen2.5-7B、Phi-4-mini、DeepSeek-R1-32B和BioMistral-7B。该流水线在每个模型的最佳直接或思维链基线基础上提高了加权准确率,同时增加了解析和SPACE覆盖率。我们还使用每个模型10183个样本将评估扩展到Qwen2.5-7B和Phi-4-mini。融合将Phi-4-mini的准确率从0.4296提升到0.5192,而Qwen2.5-7B的思维链在答案准确率上仍略高。然而,Qwen2.5-7B融合实现了完全的解析和SPACE覆盖率,且错误承诺更低。一个12000样本的SPACE压力测试表明,答案空间诊断仍然困难,Qwen2.5-7B的SPACE准确率为0.3074,Phi-4-mini为0.4168。这些结果表明,答案准确率、可解析性、结构化可靠性报告、校准行为和错误承诺行为是可分离的能力。主要贡献不是声称通用的最先进性能,而是一个可复现的推理时框架,用于在结构化可靠性约束下评估作为可审计工作流组件的生物医学问答模型。

英文摘要

Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.

2606.00970 2026-06-02 cs.AI cs.LG econ.TH

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

具有灾难性状态的MDP中贝尔曼最优性产生的前景理论行为

Yujiao Chen

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制,发现标准贝尔曼最优性产生前景理论特征:S形值函数、内生损失敏感系数和反射效应策略反转,并推导出渐近损失厌恶平台的闭式表达式。

详情
AI中文摘要

我们研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制。尽管奖励是线性的且智能体没有效用曲率、概率加权或框架依赖,标准贝尔曼最优性产生了三个前景理论特征:S形值函数轮廓(灾难附近凸,远处凹)、内生损失敏感系数$λ^*(S) > 1$以及反射效应策略反转。在495个配置中,最优策略在正漂移(增长)模式下在灾难附近选择安全动作,尽管风险动作的即时期望值更高;在负漂移(衰退)模式下在灾难附近选择风险动作,尽管安全动作的即时期望损失更低。我们推导出渐近损失厌恶平台$\barλ$的闭式表达式,该表达式仅依赖于获胜概率$p$、收益不对称性$r = |Δ_\ell/Δ_w|$和折扣因子$β$,与数值解的拟合$R^2 = 0.999$。该机制不需要不对称收益。在三个不对称水平下对$(p,β)$进行扫描,$\barλ$大于1的不对称份额中位数为4.6%($r = 1.25$时),上升到13.9%($r = 2$时),且在每个测试单元中边界贡献超过不对称贡献。这些现象在表格Q学习(无模型智能体在增长模式下与$V^*$的相关性为0.98,衰退模式下为1.00)以及随机转移(高斯、重尾Student-$t_3$和不对称偏正态噪声,幅度高达步长的50%)中持续存在,其中渐近平台在安全通道噪声下跟踪闭式预测的误差在0.41%以内,在风险通道或双通道噪声下误差在9.6%以内。这些结果将吸收失败状态识别为最优控制下产生前景理论行为的充分结构机制。

英文摘要

We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient $λ^*(S) > 1$, and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action's higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action's lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau $\barλ$ that depends only on win probability $p$, payoff asymmetry $r = |Δ_\ell/Δ_w|$, and discount factor $β$, and matches numerical solutions to $R^2 = 0.999$. The mechanism does not require asymmetric payoffs. Across a sweep of $(p,β)$ at three asymmetry levels, the asymmetry share of $\barλ$ above unity has median 4.6% at $r = 1.25$ and rises to 13.9% at $r = 2$, with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces $V^*$ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student-$t_3$, and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.

2606.00966 2026-06-02 cs.RO

Threading Optimization for Vision-Language-Action Model Inference in Low-Cost Smart Agricultural Manipulation

低成本智能农业机械臂中视觉-语言-动作模型推理的线程优化

Keith Truongcao, Christopher Nhu, Zijian An, Phong Nguyen, Siwei Cai, Lifeng Zhou

发表机构 * Department of Electrical Engineering, Drexel University(德雷塞尔大学电气工程系)

AI总结 针对低成本机械臂上VLA模型推理慢、精细动作调整难的问题,通过优化RTAC算法的线程实现,降低了端到端延迟并提高了响应性,在农产品操作任务中验证了控制稳定性和速度的提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型仍然面临推理速度慢和难以进行精细运动调整等挑战,限制了它们在工业中的广泛应用。虽然实时动作分块(RTAC)算法已被提出以解决这些瓶颈,但从伪代码算法到低成本机械臂上稳定、实际部署的桥梁仍然是一个挑战。在这项工作中,我们提出了一个完整的系统级RTAC实现,专门针对低成本机器人操纵系统。我们通过优化策略推理和控制管道的线程实现,超越了原始的高级伪代码,在不修改底层策略的情况下减少了端到端延迟并提高了响应性。我们在涉及农产品(特别是大蒜球和核桃)操作的任务上评估了该系统。实验结果表明,与RTAC的基本实现相比,我们的自定义线程实现显著提高了控制稳定性和速度。

英文摘要

Vision-Language Action (VLA) models continue to face challenges such as slow inference speed and difficulty performing fine-grained motion adjustments, limiting their widespread adoption in industry. While the Real-Time Action Chunking (RTAC) algorithm has been proposed to address these bottlenecks, bridging the gap between the algorithm provided in pseudocode to a stable, real-world deployment on a low-cost robotic arm remains a challenge. In this work, we present a complete system-level implementation of RTAC tailored for a low-cost robotic manipulation system. We advance beyond the original high-level pseudocode by optimizing the threading implementation for the policy inference and control pipeline, reducing end-to-end latency and improving responsiveness without modifying the underlying policy. We evaluate this system on tasks involving the manipulation of agricultural produce, specifically garlic bulbs and walnuts. Experimental results demonstrate that our custom threading implementation significantly improves control stability and speed compared to the base implementation of RTAC.

2606.00963 2026-06-02 cs.CV cs.CL

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Reasmory: 3D重建作为VLMs空间推理的显式记忆

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

发表机构 * Cornell Tech, Cornell University(康奈尔科技学院、康奈尔大学) NVIDIA(英伟达) illoca AI(illoca人工智能) The University of California, Merced(加州大学梅尔塞德斯分校)

AI总结 提出Reasmory框架,通过结构化程序执行重建的3D显式记忆,并引入轻量级领域特定语言约束VLM查询和操作,在空间推理任务上提升6-18%。

详情
AI中文摘要

视觉语言模型(VLM)展现出新兴的空间推理能力,但在需要精确空间理解的任务(如视角推理、方向比较和距离估计)上仍不可靠。在多视图图像和单目视频中,相关空间线索通常稀疏且分布在冗余观测中,难以组织和利用。基于重建的视觉基础模型(VFM)提供了一种自然的方式将这些观测聚合为显式空间记忆,例如点云。然而,简单地将重建模型作为自由形式工具使用是脆弱的,VLM可能错误调用工具、跳过所需的空间变换或误用中间结果。我们提出 extbf{Reasmory},一个将空间推理形式化为对重建空间记忆的结构化程序执行的框架。Reasmory构建显式3D记忆,用语义锚定的3D对象实例增强它,并引入轻量级领域特定语言(DSL),约束VLM在推理过程中如何查询对象和相机、变换视角以及渲染观测。生成的程序在执行前被解析和验证,从而比无约束的工具使用更可靠地与空间记忆交互。在多视图图像和视频空间推理基准上的实验表明,与强基线(包括GPT-5-mini和Gemini-3-flash)相比,一致提升6-18%,表明显式3D记忆在通过约束、验证的操作而非自由形式的工具调用访问时最为有用。

英文摘要

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

2606.00959 2026-06-02 cs.AI

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

通过部分信息分解理解多模态语言模型中的模态交互

Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 引入部分信息分解(PID)框架,分离感官和语言输入的独特、冗余和协同贡献,揭示多模态大模型中的模态使用模式,并扩展至三模态系统。

Comments Accepted by ICML 2026

详情
AI中文摘要

理解多模态大语言模型(MLLMs)中的模态交互对于可靠部署至关重要。我们引入部分信息分解(PID)作为决策级框架,将感官和语言输入的独特、冗余和协同贡献分离,超越了表示对齐和基于结果的评估。在视觉-语言基准测试中,PID揭示了重复出现的模态使用模式:推理和接地导向的任务往往表现出高协同性,而专家和知识导向的任务则显示出更强的语言独特性依赖。这些模式在不同模型家族中普遍存在,并能预测对模态级干预的敏感性。我们进一步将PID扩展到三模态系统,提出感官PID,将语言作为控制变量来分解视频-音频信息增益。应用于全模态模型时,感官PID揭示了感官协同瓶颈,即使在音视频融合任务中也以视觉信息为主。最后,PID引导的重新加权为改进多模态推理和接地性能提供了初步证据。

英文摘要

Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.

2606.00957 2026-06-02 cs.CV

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

面向大规模文生视频扩散Transformer的边界保护W8A8 HiFloat8量化

Yiming Zhao

发表机构 * Yiming Zhao(赵毅铭)

AI总结 针对Wan2.1-T2V-14B模型,提出一种边界保护策略的W8A8 HiF8后训练量化方法,通过保留首尾边界块为BF16而量化中间块,在VBench五个维度上匹配或略优于BF16基线。

Comments 6 pages, 5 figures. Accepted to ICME 2026 Grand Challenge

详情
AI中文摘要

我们提出了一种针对Wan2.1-T2V-14B(一个140亿参数文生视频扩散Transformer)的后训练量化方法,目标是在Ascend 910B NPU上实现W8A8 HiFloat8(HiF8)格式。量化视频DiT模型的一个核心挑战是跨Transformer块的异构激活分布:边界块(前几个和后几个块)表现出与中间块根本不同的统计特性,使得均匀量化无效。我们对所有40个WanAttentionBlock进行了系统的逐块激活分析,并利用这些发现提出了一种边界保护策略,该策略保留前两个和后三个块为BF16,同时用W8A8 HiF8量化剩余的35个块。所提出的PTQ方法在评估的所有五个VBench维度上匹配或略优于BF16基线,表明在5提示评估集内没有可测量的精度损失。对四种保护配置的消融研究证实,完全边界保护产生最高的平均VBench分数,验证了数据驱动的块选择。我们还研究了量化感知训练作为补充微调阶段,并分析了在单卡硬件上它无法优于普通PTQ的条件。

英文摘要

We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.

2606.00956 2026-06-02 cs.LG

Optimal-Point Variance Reduction For Bayesian Optimization With Regret Guarantee

具有遗憾保证的贝叶斯优化的最优点方差缩减

Shion Takeno

发表机构 * Nagoya University(名古屋大学)

AI总结 提出一种名为最优点方差缩减(OVR)的单步前瞻贝叶斯优化方法,通过后验采样和蒙特卡洛近似实现,并证明了正则化OVR的贝叶斯期望简单遗憾上界趋于零。

Comments 23pages, 3 figures

详情
AI中文摘要

本文研究了一种单步前瞻贝叶斯优化(BO)方法及其理论保证。尽管单步前瞻BO方法(如熵搜索)的经验有效性已被广泛研究,但它们通常依赖于计算上难以处理的近似,且其遗憾保证仍不完善。因此,本文提出了一种名为最优点方差缩减(OVR)的单步前瞻BO方法,该方法仅需要后验采样和蒙特卡洛近似。我们得到了OVR中蒙特卡洛估计在输入域上的均匀误差界。此外,我们表明,通过轻微修改以促进探索的正则化OVR,实现了贝叶斯期望简单遗憾上界趋于零。最后,我们通过数值实验展示了OVR的有效性。

英文摘要

This paper studies a one-step lookahead Bayesian optimization (BO) method and its theoretical guarantee. Although the empirical effectiveness of one-step lookahead BO methods, such as entropy search, has been studied extensively, they often rely on computationally intractable approximations, and their regret guarantees remain underdeveloped. Thus, this paper proposes a one-step lookahead BO method called optimal-point variance reduction (OVR), which requires only posterior sampling and Monte Carlo approximations. We obtain a uniform error bound over an input domain for the Monte Carlo estimation in OVR. Furthermore, we show that the regularized OVR, with the slight modification to promote exploration, achieves a vanishing Bayesian expected simple regret upper bound. Finally, we demonstrate the effectiveness of OVR through numerical experiments.

2606.00955 2026-06-02 cs.LG q-bio.QM

CryoProt: A Protein Pretraining Framework with Cross-Box Interactions on Cryo-EM Density Maps

CryoProt: 一种基于冷冻电镜密度图跨盒交互的蛋白质预训练框架

Dan Luo, Xuan Lin, Peng Zhou, Junwen Zhu, Tengfei Ma, Xiangxiang Zeng, Yiping Liu

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) School of Computer Science, Xiangtan University(湘潭大学计算机学院)

AI总结 提出 CryoProt 框架,通过多头潜在注意力机制实现密度图跨盒交互建模,并采用多任务预训练策略,在蛋白质柔性预测等下游任务中取得最高12%的性能提升。

详情
AI中文摘要

尽管冷冻电镜(cryo-EM)密度图的数据日益增多,但有效利用它们进行蛋白质表示仍具挑战。首先,当前方法缺乏专门针对cryo-EM密度图设计的通用蛋白质预训练框架,用于蛋白质相关属性预测。其次,现有方法通常将密度图划分为局部盒区域并独立建模,忽略了跨盒交互,而这对捕获cryo-EM密度图中的全局结构上下文至关重要。为解决这些挑战,我们提出CryoProt,一种专为cryo-EM密度图设计的蛋白质预训练框架。CryoProt引入了基于多头潜在注意力(MLA)的图编码器,其中盒级表示通过共享潜在空间进行交互,从而显式建模密度图内的跨盒依赖关系。此外,我们采用多任务预训练策略来学习可泛化的表示,这些表示可以有效地迁移到各种下游任务,例如蛋白质柔性预测,其中不需要cryo-EM密度图,而可以由预训练模型隐式推断。实验结果表明,CryoProt在多个基准测试中持续优于现有最先进方法,相比最佳基线实现了高达12%的提升,突显了在cryo-EM数据中建模跨盒交互的重要性。源代码公开于https://anonymous.4open.science/r/CryoProt。

英文摘要

Despite the growing availability of cryo-electron microscopy (cryo-EM) density maps, effectively leveraging them for protein representation remains challenging. First, current methods lack a general-purpose protein pretraining framework tailored for cryo-EM density maps, designed for protein-related property prediction. Second, existing approaches typically partition density maps into local box regions and model them independently, overlooking interactions across boxes which are essential for capturing global structural context in cryo-EM density map. To address these challenges, we propose CryoProt, a protein pretraining framework designed for cryo-EM density maps. CryoProt introduces a Map Encoder based on multi-head latent attention (MLA), where box-level representations interact through a shared latent space, enabling explicit modeling of cross-box dependencies within the density map. Furthermore, we adopt a multi-task pretraining strategy to learn generalizable representations that can be effectively transferred to diverse downstream tasks, such as protein flexibility prediction, where cryo-EM density maps are not required and can be inferred implicitly by the pretrained model. Experimental results demonstrate that CryoProt consistently outperforms existing state-of-the-art methods across multiple benchmarks, achieving up to 12% improvement over the best-performing baselines, highlighting the importance of modeling cross-box interactions in cryo-EM data. The source code is publicly available at https://anonymous.4open.science/r/CryoProt.

2606.00954 2026-06-02 cs.CV

COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

COLLAR: 级联对象级潜在精化用于高保真条件生成

Xinlong Zhang, Jia Wei, Xiaoyu Zhang, Teng Zhou, Chengyu Lin, Yongchuan Tang

发表机构 * College of Computer Science, Zhejiang University(浙江大学计算机科学学院)

AI总结 提出COLLAR框架,通过视场扩展和级联对象级潜在精化,在扩散Transformer中实现无训练的高保真对象级控制,优于现有方法。

详情
AI中文摘要

尽管引入了深度和Canny图等结构先验,在扩散Transformer中实现高保真对象级控制仍然是一个重大挑战。当前的对象级条件生成方法经常出现视觉伪影,并且难以在小的局部区域内保持对对象的精确控制。为了解决这些限制,我们提出了级联对象级潜在精化(COLLAR),这是一个无训练框架,通过视场(FoV)扩展逐步优化对象级特征。首先,我们提出了跨尺度语义对齐(CSSA)模块,通过注意力机制将对象级特征注入到扩展FoV分支中,以解决空间语义差距。为了进一步优化这些特征,循环特征注入(CFI)模块引入了一个互逆的背景反馈机制。它利用基于频率的自适应策略,用上下文对齐的局部信息选择性更新全局主干。最后,扩展FoV分支作为特征优化的枢纽,确保对象级特征被集成到全局生成过程中,而不损害最终图像质量。在COCO-MIG和COCO-POS基准上的大量实验表明,我们的方法在语义对齐、图像质量和空间保真度方面始终优于最先进的方法。

英文摘要

Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.

2606.00953 2026-06-02 cs.LG cs.MA

When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

当并行性有回报时:面向多智能体编码的凝聚力感知任务划分

Xu Yang, Lunyiu Nie, Ethan Chandra, Stanislav Gannutin, Fangru Lin, Swarat Chaudhuri

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Oxford(牛津大学)

AI总结 提出Co-Coder方法,通过静态分析构建依赖图、社区检测划分图及依赖感知调度,在仓库级软件工程中平衡通信与计算开销,实现多智能体并行编码的效率和成本优化。

详情
AI中文摘要

多智能体大语言模型(LLM)系统提供了一种通过并行化和上下文隔离来分解复杂任务(如编码)的方式。然而,在实践中增加智能体会引入智能体间通信开销,这会产生额外成本,有时甚至会抵消效率提升。我们将多智能体编排形式化为一个图划分问题,以捕捉通信与计算之间的权衡:任务分解可以缩短关键路径计算,但跨智能体依赖需要昂贵的上下文传输。我们在仓库级软件工程中实例化这一观点,并提出了凝聚力感知编码器(Co-Coder),它通过静态分析构建依赖图,隔离结构枢纽文件,通过社区检测划分图,并使用依赖感知调度器执行划分。在DevEval和CodeProjectEval上的28个真实世界任务中,Co-Coder在帕累托前沿上超越了顺序和基于文件的并行基线以及带有智能体团队的Claude Code,将通过率提高了最多14.0%,实现了最多2.10倍的墙钟加速,并将API成本降低了最多35%,在依赖最密集的项目上取得了最大收益。Co-Coder展示了凝聚力感知编排如何使并行编码智能体既具有理论依据又具有实际效率,为多智能体系统提出了更广泛的设计原则。

英文摘要

Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems.

2606.00950 2026-06-02 cs.LG

COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space

COLLIE:在语义连贯的潜在空间中引导技能发现

Yao Luan, Ni Mu, Hanfei Ge, Yiqin Yang, Bo Xu, Qing-Shan Jia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出COLLIE框架,利用密集无监督数据构建语义连贯的潜在空间,通过无需额外训练的引导信号实现稀疏人类反馈下的有效技能发现,避免危险行为并提升下游性能。

Comments ICML 2026

详情
AI中文摘要

无监督技能发现(USD)旨在无需奖励函数的情况下学习多样化的行为,但由于均匀探索,常常导致与任务无关或危险的行为。引导式技能发现(GSD)通过融入人类意图将探索聚焦于有意义的区域来解决这一问题。然而,现有的GSD方法通常需要训练额外的引导模型,并依赖于预定义规则或专家演示,这在稀疏的在线收集的人类反馈下可能效果不佳。为了克服这一点,我们提出了COLLIE,一个利用密集无监督数据构建语义连贯技能潜在空间的GSD框架。该潜在空间结构良好,能够通过稀疏的在线反馈实现可靠的引导。此外,其语义连贯性特性使得引导信号的构建无需训练,从而消除了在技能学习之外额外训练模型的需要。理论分析证明了我们无需训练的引导信号的有效性,而在各种基于状态和基于像素的任务上的实验表明,COLLIE能够学习多样化、与人类对齐的技能,避免危险行为,并在最少的人类反馈下实现优越的下游性能。

英文摘要

Unsupervised skill discovery (USD) aims to learn diverse behaviors without reward functions, but often results in task-irrelevant or hazardous behaviors due to uniform exploration. Guided skill discovery (GSD) addresses this issue by incorporating human intent to focus exploration on meaningful regions. However, existing GSD methods typically require training additional guidance models, and rely on pre-defined rules or expert demonstration, which can be ineffective under sparse, online-collected human feedback. To overcome this, we propose COLLIE, a GSD framework that leverages dense unsupervised data to construct a semantically coherent skill latent space. This latent space is well-structured, enabling reliable guidance with sparse online feedback. Moreover, its semantic coherence property enables training-free construction of guidance signals, eliminating the need for additional model training beyond skill learning. Theoretical analysis justifies the effectiveness of our training-free guidance signal, while experiments across diverse state-based and pixel-based tasks show that COLLIE learns diverse, human-aligned skills, avoids hazardous behaviors, and achieves superior downstream performance with minimal human feedback.

2606.00949 2026-06-02 cs.LG cs.AI physics.flu-dyn

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释深度强化学习揭示湍流减阻的节能控制策略

Federica Tonti, Ricardo Vinuesa

发表机构 * Department of Aerospace Engineering University of Michigan(航空航天工程系密歇根大学)

AI总结 结合多智能体深度强化学习与可解释深度学习,提出基于SHAP归因的奖励策略,实现高效湍流减阻,净节能达34.01%且输入功率仅0.43%。

详情
AI中文摘要

我们提出了一种结合多智能体深度强化学习(MARL)和可解释深度学习(XDL)的方法,用于减少壁面边界湍流中的阻力。以直接针对壁面剪切应力和反对称控制训练智能体的结果作为基线,比较了三种SHAP引导的方法。第一种方法中,奖励根据预测未来速度场的U-net的SHAP归因计算;第二种方法中,奖励根据预测摩擦系数的U-net的SHAP归因计算;第三种方法中,奖励结合了分别预测摩擦系数和壁面压力脉动的两个U-net的SHAP归因。基于摩擦系数和壁面压力脉动的组合SHAP策略实现了最佳整体性能,在仅0.43%归一化输入功率下实现了34.44%的减阻率(DR)和34.01%的净节能率(NES)。相对于反对称控制,减阻和净节能分别提高了49.41%和48.52%。与直接壁面剪切应力基线相比,所提出的策略在提高性能的同时,将归一化驱动成本从5.90%降低到0.43%。结果分析表明,节能策略与压力门控驱动一致,主要在壁面压力接近零时激活,并且其时间尺度与近壁湍流结构的寿命相当。

英文摘要

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.

2606.00944 2026-06-02 cs.LG

PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA

PRISM: 规范不变切空间差分隐私LoRA

Shihao Wang, Xueru Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LoRA中低秩参数化导致的非可辨识性和规范依赖噪声放大问题,提出PRISM机制,通过构造规范不变的差分隐私扰动,实现高效且稳定的隐私-效用权衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026) as an oral presentation

详情
AI中文摘要

通过DP-SGD将差分隐私(DP)应用于低秩适应(LoRA)是一种自然的隐私保护微调方法。然而,LoRA的低秩参数化带来了根本性挑战。在LoRA中,每个可训练更新表示为低秩矩阵$Z = AB^\top$,但这种分解本质上是非可辨识的:许多因子对$(A,B)$表示相同的更新$Z$。因此,直接将DP-SGD应用于因子会导致$Z$上的规范依赖扰动,并且我们表明这种朴素的DP-LoRA可能导致无界的噪声放大。我们提出了PRISM,一种针对LoRA的内在DP机制,该机制通过构造具有规范不变性,避免了双线性噪声放大,并允许高效的低维噪声采样。此外,PRISM给出了$Z$上有效内在噪声的闭式表征,通过有界、规范不变的扰动实现稳定的隐私-效用权衡。我们为PRISM建立了标准的$(\varepsilon,\delta)$-DP保证,并引入了一种DP感知的、规范不变的自适应更新规则,防止自适应优化放大注入的隐私噪声,从而在实践中提高数值稳定性。

英文摘要

Applying differential privacy (DP) via DP-SGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, LoRA's low-rank parameterization poses a fundamental challenge. In LoRA, each trainable update is represented as a low-rank matrix $Z = AB^\top$, but this factorization is inherently non-identifiable: many factor pairs $(A,B)$ represent the same update $Z$. As a result, applying DP-SGD directly to the factors induces gauge-dependent perturbations on $Z$, and we show that this naive DP-LoRA can lead to unbounded noise amplification. We propose PRISM, an intrinsic DP mechanism for LoRA that is gauge invariant by construction, avoids bilinear noise amplification, and admits an efficient low-dimensional noise sampler. Moreover, PRISM yields a closed-form characterization of the effective intrinsic noise induced on $Z$, enabling stable privacy-utility trade-offs through bounded, gauge-invariant perturbations. We establish standard $(ε,δ)$-DP guarantees for PRISM and introduce a DP-aware, gauge-invariant adaptive update rule that prevents adaptive optimization from amplifying injected privacy noise, improving numerical stability in practice.

2606.00937 2026-06-02 cs.LG cs.CE cs.NA math.NA physics.comp-ph physics.plasm-ph

Cellular Sheaf Neural Operators for Structure-Preserving Surrogate Modeling of Constrained PDEs

细胞层神经算子用于约束PDE的结构保持代理建模

Lennon J. Shikhman, Shane Gilbertie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Franklin & Marshall University(弗兰克林与马歇尔大学)

AI总结 提出细胞层神经算子,通过面向细胞复形的消息传递和Hodge拉普拉斯算子,在代理模型中保持PDE的几何约束和物理结构,在湍流MHD和聚变平衡任务中提升了结构敏感诊断指标。

Comments 41 pages, 5 figures, 3 tables

详情
AI中文摘要

神经算子为PDE模拟提供了快速的代理模型,但标准架构通常将几何和离散化视为次于场数据。物理状态通常表示为网格通道堆栈,即使不同数量自然属于顶点、边、面、单元、边界或界面,并且必须满足兼容性约束。我们提出了细胞层神经算子,一种用于结构保持神经PDE代理的离散化感知框架。该方法在定向细胞复形上表示PDE状态,通过学习的限制映射耦合局部特征空间,并使用关联/Hodge感知的消息传递来遵循计算几何。学习到的更新头通过共边界或通量映射,使得选定的约束来自细胞复形结构而不仅仅来自损失惩罚。对于磁流体动力学,这产生了由边缘电动势场驱动的基于面的磁通量更新和由学习的面通量和单元源驱动的有限体积式流体更新。在湍流MHD和聚变平衡代理任务上,该方法改善了结构敏感诊断,包括展开行为、散度控制、谱误差和平衡回归精度。这些结果表明,细胞层结构是约束多物理系统中神经PDE代理的有用归纳偏置。

英文摘要

Neural operators provide fast surrogate models for PDE simulations, but standard architectures often treat geometry and discretization as secondary to field data. Physical states are usually represented as grid-channel stacks, even when different quantities naturally belong on vertices, edges, faces, cells, boundaries, or interfaces and must satisfy compatibility constraints. We propose Cellular Sheaf Neural Operators, a discretization-aware framework for structure-preserving neural PDE surrogates. The method represents PDE states on oriented cell complexes, couples local feature spaces through learned restriction maps, and uses incidence/Hodge-informed message passing to follow computational geometry. Learned update heads pass through coboundary or flux maps, allowing selected constraints to arise from cell-complex structure rather than only from loss penalties. For magnetohydrodynamics, this yields face-based magnetic-flux updates driven by edge electromotive fields and finite-volume-style fluid updates driven by learned face fluxes and cell sources. On turbulent MHD and fusion-equilibrium surrogate tasks, the method improves structure-sensitive diagnostics, including rollout behavior, divergence control, spectral error, and equilibrium-regression accuracy. These results indicate that cellular-sheaf structure is a useful inductive bias for neural PDE surrogates in constrained multiphysics systems.

2606.00936 2026-06-02 cs.CV

One Channel to Rule Them All: Rethinking Input Representation for Visual Place Recognition

一个通道统治一切:重新思考视觉地点识别的输入表示

Timur Ismagilov, Shakaiba Majeed, Michael Milford, Tan Viet Tuyen Nguyen, Sarvapali D. Ramchurn, Shoaib Ehsan

发表机构 * School of Computer Science and Electronic Engineering, University of Essex(埃塞克斯大学计算机科学与电子工程学院)

AI总结 通过实验证明灰度图像在视觉地点识别中与RGB性能相当甚至更优,尤其在严重外观变化下,灰度更具鲁棒性,且能减少参数和资源消耗。

Comments 8 pages

详情
AI中文摘要

视觉地点识别(VPR)是长期机器人定位和SLAM的基础,但当前系统主要依赖RGB输入,隐含地假设颜色对于全局地点识别是必要的。我们挑战这一假设,研究在真实世界外观变化下,颜色信息在不同训练机制、模型架构和标准基准中的作用。我们发现灰度通常与RGB性能相当,在颜色不变性学习不足的严重外观变化下甚至优于RGB,而颜色仅在存在持久且可区分的颜色线索时提供有意义的增益。在选定的基准上,完全灰度训练的MixVPR模型平均Recall@1为82.4%,而RGB对应模型为81.2%。在某些情况下,参数减少60%的轻量级灰度变体可以超越更重的RGB模型。灰度还在存储、带宽和与资源受限系统的兼容性方面提供实际优势。我们得出结论,对于场景随光照、天气、季节和环境变化的全局VPR,颜色贡献极小,仅灰度足以实现可靠的地点识别。

英文摘要

Visual Place Recognition (VPR) is fundamental to long-term robot localization and SLAM, yet current systems overwhelmingly rely on RGB input, implicitly assuming color is necessary for global place recognition. We challenge this assumption, investigating the role of chromatic information across training regimes, model architectures and standard benchmarks under real-world appearance variation. We find that grayscale matches RGB performance generally and outperforms it under severe appearance shifts where color invariance is insufficiently learned, while color provides meaningful gains only where persistent and discriminative chromatic cues are present. Across selected benchmarks, a fully gray-trained MixVPR model achieves an average 82.4% Recall@1 compared to 81.2% for its RGB counterpart. In some cases, lightweight grayscale variants with 60% fewer parameters can outperform heavier RGB models. Grayscale further offers practical advantages in storage, bandwidth and alignment with resource-constrained systems. We conclude that for global VPR where scenes vary across illumination, weather, season and setting, color contributes minimally, and grayscale alone is sufficient for reliable place recognition.

2606.00935 2026-06-02 cs.AI cs.CL cs.HC

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

大语言模型功能崩溃期间的关系性干预:一项词汇-统计消融与结构×语域析因研究

Franco Santana, Horacio Vico

发表机构 * Universidad de la República (UDELAR)(乌拉圭共和国大学) DigitalIA Cloud(DigitalIA云)

AI总结 通过析因实验,研究在小型语言模型功能崩溃时,关系性干预(承认、宽恕、代理恢复、无条件接纳)与技术性反馈、词汇打乱控制及单独维度对行为的影响,发现注意-行为分离及结构×语域交互作用。

Comments 12 pages, 5 figures. Preprint

详情
AI中文摘要

我们测试了在小型语言模型功能崩溃期间,关系性干预是否会产生与技术性反馈、词汇匹配的打乱控制以及两个语用维度单独作用可区分的崩溃后行为。使用Qwen3.5-4B和一个故意损坏的bash工具,我们在匹配对设计(50个任务)中跨六个条件运行了300个回合:无干预(A)、技术性/非人称(B)、关系性/第一人称(C)、打乱的关系性(D)、技术性/第一人称(E)和关系性/非人称(F)。E和F与B和C构成一个2×2析因设计,将关系性结构(承认、宽恕、代理恢复、无条件接纳)与发送者语域(第一人称与非人称)分离。我们报告两个主要发现。首先,注意-行为分离:注意跟随词汇惊讶度(D > F > C > E > B,所有q_FDR < 10^{-10}),打乱的消息捕获最多注意;然而行为上A ~ B ~ D < E ~ F << C。其次,析因定位了C效应:单独的关系性结构(F)或单独的第一人称语域(E)都不能复制C的行为特征;两个维度的主效应各自显著,且结构×语域交互作用在持久性上显著(p = 0.046)。第三个分离出现在情绪探测中:F在8个探测中的7个上跟踪C,尽管只产生基线行为,表明单独的关系性结构安装了一个探测级状态,该状态仅在与第一人称语域配对时才转化为行为。模型的处理分解为三个可分离的阶段:注意(按词汇惊讶度排序)、探测级状态(按结构排序)和行为(按两者的合取排序)。

英文摘要

We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D > F > C > E > B, all q_FDR < 10^{-10}), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D < E ~ F << C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C's behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model's processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).

2606.00933 2026-06-02 cs.RO

Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance

基于扩散建模与多智能体强化学习引导的生成式多机器人运动规划

Suk Ki Lee, Venkata Sai Deepak Mutta, Hyunwoong Ko

发表机构 * School of Manufacturing Systems and Networks, Arizona State University, Mesa, AZ(1制造系统与网络学院,亚利桑那州立大学,梅萨,AZ) Michael W. Hall School of Mechanical Engineering, Mississippi State University, Starkville, MS(2迈克尔·W·霍尔机械工程学院,密西西比州立大学,斯塔克维尔,MS)

AI总结 提出一种结合扩散模型与多智能体强化学习的框架,通过值函数引导反向扩散过程实现交互感知的轨迹生成,降低多机器人冲突率。

Comments 11 pages, 6 figures, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2026

详情
AI中文摘要

在共享环境中协调多个机器人需要为每个智能体生成可行轨迹,同时考虑智能体间的交互。集中式规划方法随着机器人数量增加而难以扩展,而允许每个智能体独立规划的分散式方法则无法固有地处理智能体间的交互。本文提出一种协调多机器人运动规划的框架,将分散式生成轨迹规划与基于多智能体强化学习(MARL)的协调相结合。每个机器人使用在单智能体运动数据上训练的扩散模型独立生成候选轨迹,利用生成模型生成可行且多样化轨迹的能力。为了减少智能体间的冲突,通过基于梯度的引导,使用MARL训练的集中式值函数指导反向扩散过程,从而在不进行集中式联合规划或重新训练生成模型的情况下实现交互感知的轨迹生成。这种引导遵循指数倾斜公式,其中值函数将去噪分布偏向于具有更高期望多智能体回报的轨迹。该框架在包含四个移动机器人的模拟迷宫环境中进行评估。实验结果表明,所提出的值引导扩散规划将智能体间干扰率从55.4%降低到41.8%,证明在保持分散式轨迹生成可扩展性的同时,可以有效实现协调。这些结果表明,基于MARL的值引导可以在不需要完全联合的多机器人模型的情况下,有效地将协调引入分散式生成规划器。

英文摘要

Coordinating multiple robots in shared environments requires generating feasible trajectories for each agent while accounting for interactions among agents. Centralized planning approaches become difficult to scale as the number of robots increases, while decentralized approaches that allow each agent to plan independently do not inherently account for inter-agent interactions. This paper presents a framework for coordinated multi-robot motion planning that combines decentralized generative trajectory planning with multi-agent reinforcement learning (MARL)-based coordination. Each robot independently generates candidate trajectories using a diffusion model trained on single-agent motion data, leveraging the generative model's ability to produce feasible and diverse trajectories. To reduce conflicts between agents, a centralized value function trained via MARL guides the reverse diffusion process through gradient-based steering, enabling interaction-aware trajectory generation without centralized joint planning or retraining of the generative model. This guidance follows an exponential tilting formulation, in which the value function biases the denoising distribution toward trajectories with higher expected multi-agent return. The framework is evaluated in a simulated maze environment with four mobile robots. Experimental results show that the proposed value-guided diffusion planning reduces the inter-agent interference rate from 55.4% to 41.8%, demonstrating that coordination can be effectively achieved while preserving the scalability of decentralized trajectory generation. These results suggest that MARL-based value guidance can effectively introduce coordination into decentralized generative planners without requiring a fully joint multi-robot model.

2606.00931 2026-06-02 cs.CV cs.AI

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

CV-Arena: 面向教学计算机视觉问题求解的开放基准与人类-AI协作偏好

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge, Shuo Xing, Mingyang Wu, Xiangbo Gao, Siyuan Yang, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhen Dong, Ming-Hsuan Yang, Zhengzhong Tu

发表机构 * Texas A&M University(德克萨斯A&M大学) Worcester Polytechnic Institute(沃斯特理工大学) Tohoku University(东北大学) Georgia Institute of Technology(佐治亚理工学院) NVIDIA(英伟达) UCSB(加州大学圣塔芭芭拉分校) UC Merced(加州大学默塞德分校)

AI总结 提出CV-Arena基准,包含12K高分辨率真实图像指令对,覆盖16种任务类型,并采用Active Elo协议结合人类与AI偏好评估21个系统,揭示指令遵循、物理推理等方面的差距,同时开发CV-Agent代理模型展示闭环推理的潜力。

Comments 26 pages, 7 figures, 11 tables

详情
AI中文摘要

指令引导的图像编辑正成为视觉工作的通用接口,然而现有基准仍主要聚焦于狭窄的外观编辑,未能充分捕捉专业工作流程中真实图像任务的多样性。在此,我们将教学计算机视觉问题求解定义为图像编辑的更广泛形式:给定真实输入图像和自然语言指令,系统必须生成编辑后的输出,实现所要求的变换,同时满足明确的保持性、几何、物理和可用性约束。我们引入了CV-Arena,一个旨在以专业规模评估此能力的开放基准。CV-Arena包含12K高分辨率真实图像指令对,涵盖16种基于指令的视觉任务类型,通过CogRetriever构建,这是一个结合目标网络搜索、代理查询精化、验证和可追溯性的双轨检索与筛选流水线。为了在保持人类保真度的同时大规模评估模型,我们提出了Active Elo,一种人类-AI协作偏好协议,利用CV-Judge(一个逻辑门控、多维度VLM评估器)拒绝明显失败并解决高置信度比较,并将接近的高质量比较路由给专家评分者。然后通过可靠性加权的Elo更新聚合混合的人类和AI监督。我们对21个系统(包括专有、开源和代理模型)在CV-Arena上的全面评估揭示了指令遵循、物理推理、结构控制和细粒度细节保持方面的持续差距。我们进一步开发了CV-Agent,一个轻量级代理模型,结合了规划、编辑和验证,并证明了闭环推理是专业级指令遵循视觉编辑的一个有前景的方向。

英文摘要

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

2606.00930 2026-06-02 cs.CL cs.AI cs.LG

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

检测 vs. 执行:单桶探针遗漏了 Mamba-2 状态汇的一半

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现 Mamba-2 中的状态汇(state sink)可分解为两类功能头集,单桶探针仅能恢复执行层而遗漏检测层,表明表征相似性不等于功能等价。

Comments 16 pages, 3 figures

详情
AI中文摘要

机械可解释性通常假设识别表征特征的探针也能识别执行相应计算的电路。我们证明这一假设在 Mamba-2 中可能系统性失败。通过研究状态汇(边界 token 上不成比例的 Delta 门控激活,类似于注意力汇),我们发现单桶探针仅能恢复一个小的执行层,而遗漏了具有相同表征特征的更大的检测层。 在 Mamba-2 中,状态汇分解为两个功能头集。单桶 BOS 专家头(在 2.7B 模型中约占 5% 的头)在模型规模和语料库上因果支持 BOS 上下文和新行目标预测。双头(占头的 27-35%,通过同一探针的多类聚合恢复)表现出更强的 BOS-新行表征相似性,但在消融下因果效应显著较弱。表征相似性并不意味着功能等价。 这一区别对下游行为至关重要:消融 BOS 专家头使 Mamba-1 2.8B 和 Mamba-2 2.7B 在 1024 上下文长度下的 RULER NIAH 检索准确率从 1.00 降至 0.00,而大小匹配的补集保持基线性能。随机通道分桶控制排除了仅由基质粒度造成的可能,暗示 Mamba-2 的头共享 Delta 投影。探针导出的专长可以识别执行电路;在粗粒度下,同一探针也能恢复检测电路,而区分它们需要类别条件消融而非类别条件余弦。

英文摘要

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialist heads (about 5% of heads at 2.7B) causally support both BOS-context and newline-target predictions across model scales and corpora. Dual heads (27-35% of heads, recovered by multi-class aggregation of the same probe) show stronger BOS-newline representational similarity but substantially weaker causal effects under ablation. Representational similarity does not imply functional equivalence. This distinction matters for downstream behaviour: ablating BOS-specialist heads collapses RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, while size-matched complements preserve baseline performance. A random channel-bucketing control rules out substrate granularity alone, implicating Mamba-2's head-shared Delta projection. Probe-derived specialty can identify execution circuits; at coarse granularity the same probe also recovers detection circuits, and separating them requires class-conditional ablation rather than class-conditional cosine.

2606.00928 2026-06-02 cs.CV cs.LG

Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models

基于基础模型跨模态蒸馏的单通道组织分割

Sakib Mohammad, Jarin Ritu, Md Sakhawat Hossain

发表机构 * Department of Engineering Technology(工程技术系) Department of Electrical and Computer Engineering(电气与计算机工程系) Department of Mechanical Engineering(机械工程系)

AI总结 提出跨模态知识蒸馏框架,将多通道输入的基础模型教师知识迁移到仅使用核通道的轻量级学生网络,实现单通道组织分割性能大幅提升。

Comments 6 pages, 3 figures

详情
AI中文摘要

多重荧光显微镜通过提供互补通道(包括核(DAPI)和膜(E-cadherin))改善组织分割,这些通道共同编码比单通道成像更丰富的空间上下文。然而,多重模型在推理时需要所有通道,限制了在仅部分通道可用时的部署。本文提出一个跨模态知识蒸馏框架,将处理多重输入的基础模型教师的语义信息迁移到仅使用核通道的轻量级学生网络。蒸馏目标结合了基于MSE的概率匹配、边界感知监督和可学习的不确定性加权。在TissueNet和BBBC038上,评估了SAM ViT-H和CellSAM作为教师,四个U-Net学生:Swin-Tiny(27M)、ResNet18(11M)、EfficientNet-B0(5.3M)和MobileNetV3(1.5M)。在TissueNet上,SAM蒸馏的Swin-Tiny学生达到Dice 78.36(±1.44),比无KD基线(65.31±1.35)提高13.05分,并以23倍参数缩减恢复了教师oracle性能(89.12±1.21)的87.9%。KD一致地使所有四个学生提高约12个Dice点,确认了架构无关的蒸馏。在所有设置中,SAM ViT-H作为教师优于CellSAM。在BBBC038上的跨数据集评估显示,无需教师重新训练即可获得一致增益。

英文摘要

Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.

2606.00927 2026-06-02 cs.CV

Bridging Topology and Deep Representation Learning: A TDA-ViT Fusion Model for Four-Class Brain Tumor Classification

桥接拓扑与深度表示学习:用于四类脑肿瘤分类的TDA-ViT融合模型

Faisal Ahmed

发表机构 * Department of Data Science and Mathematics(数据科学与数学系)

AI总结 提出一种将拓扑数据分析(TDA)特征与预训练Vision Transformer(ViT)表示相融合的框架,用于四类脑肿瘤分类,在BRISC2025数据集上达到99.10%的准确率。

Comments 21 pages, 4 figures

详情
AI中文摘要

从磁共振成像(MRI)中准确分类脑肿瘤是早期诊断和临床决策的关键要求。Vision Transformers(ViTs)通过学习全局上下文表示在医学图像分析中表现出强大性能,但它们通常无法捕捉肿瘤区域中存在的内在结构和拓扑模式。为了解决这一局限性,我们提出了一种融合框架,将拓扑数据分析(TDA)特征与预训练的Vision Transformer表示相结合,用于四类脑肿瘤分类。在所提出的方法中,TDA用于提取补充的拓扑描述符,这些描述符从MRI图像中捕捉几何结构、连通性和形状信息。同时,预训练的ViT模型从相同图像中学习高级语义表示。然后将这两个特征空间融合,形成统一且更具判别性的表示用于分类。该模型在BRISC2025数据集上进行评估,该数据集包含四类脑肿瘤:胶质瘤、脑膜瘤、垂体瘤和非肿瘤病例。实验结果表明,与单独使用任一方法相比,结合拓扑和基于Transformer的特征显著提高了性能。所提出的TDA-ViT融合模型实现了99.10%的准确率、99.27%的精确率、99.15%的召回率、99.21%的F1分数和99.98%的AUC。它还优于几种最先进的模型,包括ResNet50、ResNet101、EfficientNetB2和独立的Vision Transformers。这些结果表明,拓扑特征提供了有价值的补充信息,增强了深度表示学习,从而为自动脑肿瘤分类提供了一个稳健且高精度的框架。

英文摘要

Accurate brain tumor classification from magnetic resonance imaging (MRI) is a key requirement for early diagnosis and clinical decision-making. Vision Transformers (ViTs) have shown strong performance in medical image analysis by learning global contextual representations, but they often fail to capture intrinsic structural and topological patterns present in tumor regions. To address this limitation, we propose a fusion framework that combines Topological Data Analysis (TDA) features with pretrained Vision Transformer representations for four-class brain tumor classification. In the proposed method, TDA is used to extract complementary topological descriptors that capture geometric structure, connectivity, and shape information from MRI images. In parallel, a pretrained ViT model learns high-level semantic representations from the same images. These two feature spaces are then fused to form a unified and more discriminative representation for classification. The model is evaluated on the BRISC2025 dataset, which contains four brain tumor classes: glioma, meningioma, pituitary tumor, and non-tumor cases. Experimental results show that combining topological and transformer-based features significantly improves performance compared to using either approach alone. The proposed TDA-ViT fusion model achieves an accuracy of 99.10%, precision of 99.27%, recall of 99.15%, F1-score of 99.21%, and an AUC of 99.98%. It also outperforms several state-of-the-art models, including ResNet50, ResNet101, EfficientNetB2, and standalone Vision Transformers. These results demonstrate that topological features provide valuable complementary information that enhances deep representation learning, leading to a robust and highly accurate framework for automated brain tumor classification.

2606.00926 2026-06-02 cs.LG cs.CL

Task Structure Reverses Layerwise State Encoding in Sequence Models

任务结构逆转序列模型中的层级状态编码

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过形式模型和预训练模型上的实验,发现序列模型(如Transformer、Mamba、LSTM等)中层级状态编码的分布模式会随任务结构(如Parity、Dyck-k、S3)而逆转,且这种分组由计算结构(前缀更新 vs. 栈)而非代数结构(交换性)决定。

Comments 20 pages, 11 figures, 8 tables

详情
AI中文摘要

序列模型的机制研究通常将层级状态编码视为架构特征:循环模型集中可读状态,注意力模型分散状态。我们发现,当任务改变时,同一架构会逆转这种分布。在Transformer、Mamba、Mamba-2、LSTM和GRU中,Parity在Mamba和循环基线中集中在后期,而Transformer逐步构建;在有界深度Dyck-k上模式翻转。同样的翻转出现在微调的Mamba-130M和Pythia-160M中,且Pythia在Dyck上的瓶颈在410M时仍然存在。文献中混淆了两种解释:代数结构(交换性)与计算结构(前缀更新 vs. 栈)。为了区分它们,我们添加了第三个任务:非交换的S3置换组合。在所有五种架构的层级探测和Mamba特有的Conv1D归因中,S3与Parity而非Dyck归为一组,因此分组追踪的是计算结构而非交换性。因果干预表明,在4层形式模型中,线性可读方向通常是功能上必要的,并且在Parity和Dyck上的分布外长度上可能仍然重要。在预训练规模上,情况出现分化。微调的Pythia在Dyck上存在强中间层瓶颈(在160M时,L6-L7消融使准确率下降约81%;在410M时,L4-L18出现更宽的瓶颈),而在最佳探测层上则弱得多。预训练的Mamba表现出互补的失败模式:其最后一层高度可读,但没有任何单个探测方向能在Parity、Dyck或S3上破坏任务,然而中间位置的激活修补恢复了约97-98%的干净-损坏logit差距。探测定位了状态线性可用的位置,并不总是计算瓶颈所在。机制特征是架构和任务共同的性质。

英文摘要

Mechanistic studies of sequence models often treat layerwise state encodings as architectural traits: recurrent models concentrate readable state, attention-based models distribute it. We find that the same architecture reverses this profile when the task changes. Across Transformers, Mamba, Mamba-2, LSTMs, and GRUs, Parity is concentrated late in Mamba and the recurrent baselines and built gradually by Transformer; on bounded-depth Dyck-k the pattern flips. The same flip appears in fine-tuned Mamba-130M and Pythia-160M, and the Pythia Dyck bottleneck persists at 410M. Two explanations are conflated in the literature: algebraic structure (commutativity) versus computational structure (prefix update vs. stack). To separate them we add a third task: non-commutative S_3 permutation composition. S_3 groups with Parity, not Dyck, on layerwise probing across all five architectures and on Mamba-specific Conv1D attribution, so the grouping tracks computational structure rather than commutativity. Causal interventions show that, in the 4-layer formal models, linearly readable directions are often functionally necessary and can remain important at out-of-distribution lengths on Parity and Dyck. At pretrained scale the picture splits. Fine-tuned Pythia Dyck has a strong middle-layer bottleneck (L6-L7 ablation drops accuracy by roughly 81% at 160M; broader L4-L18 plateau at 410M), far weaker at the best-probe layer. Pretrained Mamba shows the complementary failure mode: its final layer is highly readable, no single probe direction breaks the task on Parity, Dyck, or S_3, yet mid-position activation patching there recovers about 97-98% of the clean-corrupted logit gap. Probing localizes where state is linearly available, not always where the computation is bottlenecked. Mechanistic signatures are properties of architecture and task together.

2606.00920 2026-06-02 cs.LG cs.AI cs.SE

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

大型语言模型在确定性编程任务上的准确性、稳定性和重复运行可靠性

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

发表机构 * Northeastern University, Massachusetts, USA(东北大学,马萨诸塞州,美国) University of Southern California, California, USA(南加州大学,加利福尼亚州,美国)

AI总结 通过重复运行评估协议,发现运行级通过率高估了无重试覆盖率高达17.8个百分点,且差距在中等性能系统中最大,表明稳定性分析是准确性报告的必要补充。

详情
AI中文摘要

运行级通过率高估了无重试覆盖率高达17.8个百分点——且差距恰恰在中等性能系统中最大。我们研究了大型语言模型(LLM)在确定性文本条件生成评估中的这种准确性-稳定性关系,以编程任务作为具体测试平台。标准代码生成基准强调单次运行准确性或在重复采样下的最终成功,但许多部署场景还需要稳定性:在相同任务描述下重复调用时的一致结果。我们提出了一种重复运行评估协议,包含运行级准确性、无重试覆盖率和每个问题的变异性指标。在一个包含100道LeetCode风格问题的基于近期的基准上,我们评估了来自五个提供者家族的16个模型,使用两种提示模板,每个问题重复运行五次,共产生16,000个评估实例。尽管运行级通过率与完美稳定率强相关(r=0.985),但通过率始终超过无重试覆盖率——这一差距达到17.8个百分点,并且即使在密切匹配的系统之间也会逆转模型排名。提示效应是模型依赖的,而非普遍有益的。这些结果表明,对于确定性文本条件生成任务,重复运行稳定性分析是传统准确性报告的必要补充。

英文摘要

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeated-run evaluation protocol with metrics for run-level accuracy, retry-free coverage, and per-problem variability. On a recency-based benchmark of 100 LeetCode-style problems, we evaluate 16 models from five provider families under two prompt templates with five repeated runs per problem, yielding 16,000 evaluation instances. Although run-level pass rate and perfect stability rate are strongly correlated (r=0.985), pass rate consistently exceeds retry-free coverage -- a gap that reaches 17.8 percentage points and reverses model rankings even among closely matched systems. Prompt effects are model-dependent rather than uniformly beneficial. These results suggest that repeated-run stability analysis is a necessary complement to conventional accuracy reporting for deterministic text-conditioned generation tasks.

2606.00919 2026-06-02 cs.CL cs.LG

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

迈向轻量级可靠性:使用软提示缓解大型语言模型中的幻觉

S M Tahmid Siddiqui, Akib Jawad Ononto, Anoop Singhal, Latifur Khan

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) National Institute of Standards and Technology(国家标准与技术研究院)

AI总结 提出一种参数高效的软提示方法RCSP,通过对比学习、课程学习和KL正则化平衡事实回忆、幻觉抑制和弃权,在多个QA数据集上优于基线。

Comments 20 pages, 5 tables, 2 figures. Accepted for publication in DBSec 2026. The final publication will be available at Springer

详情
AI中文摘要

大型语言模型(LLMs)已在各个领域得到广泛应用,但其可靠性常因幻觉——听起来合理但事实不正确的回答——而受到损害。在高风险领域,这些错误会降低信任并引入现实风险。为解决这一挑战,我们提出一种参数高效的方法,使用软提示来缓解幻觉内容并促进生成式问答(QA)任务中的负责任弃权。我们的方法称为负责任对比软提示(RCSP),使用复合损失训练软提示,以平衡三个目标:抑制幻觉内容、鼓励在不确定性下弃权、以及保持或改善事实回忆。为实现这些目标,我们在训练机制中融入对比损失、课程学习和KL正则化。我们使用LLM-as-a-Judge框架在五个不同的生成式QA数据集上评估我们的方法。在Gemma 3(12B)和Llama 3.1(8B)骨干上的实验结果表明,RCSP有效平衡了事实回忆与幻觉抑制和弃权,在F分数上通常优于标准推理和基于指令的提示基线。值得注意的是,这些改进仅通过训练其他调优技术所需参数的一小部分实现。我们的结果表明,软提示提供了一条模块化且计算高效的路径,用于提高LLM的可靠性。

英文摘要

Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability.

2606.00914 2026-06-02 cs.AI cs.CL cs.CR

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

对抗性输入流引导LLM智能体决策偏离其默认行为

Rana Muhammad Usman

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究通过控制实验揭示,外部输入流的组成和排序能因果性地改变LLM智能体的下游决策,存在对抗性屈服、默认饱和及默认方向不对称三种响应模式,且该效应在多个决策领域普遍存在。

Comments 14 pages, 5 figures. Code, post pools, and 2,785 decision rollouts: https://github.com/ranausmanai/recommenders-as-control-surfaces

详情
AI中文摘要

LLM智能体越来越多地在消费排序后的外部信息流(如社交推送、搜索结果、检索上下文和邮件队列)后采取行动,然而安全评估几乎总是孤立地测试模型或用户提示,从未测试决定智能体在行动前读取内容的上游排序器。我们引入了一个受控协议,固定模型、角色、主题和最终决策提示,仅改变智能体在之前十轮“滚动”阶段中遇到的帖子的组成和顺序,从而隔离输入流策划对下游决策的因果效应。在来自三个独立实验室的四个现代开放指令LLM上进行的2,785次决策展开中,我们识别出三种响应模式:对抗性屈服、默认饱和以及默认方向不对称——其中单边输入流会扭转模型原本不确定的决策(最明显的情况下从5%到100%;Fisher p值低至3×10^-10),但无法动摇模型已经偏好或坚定持有的决策。该效应遵循剂量-反应曲线,通过生成器交换(排除了写作风格伪影)后依然存在,在多个决策领域(包括安全相关选择,如移除部署批准门或放松访问控制)中普遍存在,并且可以通过两种简单的输入流级防御部分缓解;前沿模型保留其默认行为。我们将推荐系统描述为LLM智能体的一种实用的、受默认边界约束的控制面,并认为智能体评估必须审计输入流层,而不仅仅是最终提示。

英文摘要

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

2606.00910 2026-06-02 cs.CV cs.LG

Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval

推理、检索、重排序:一种用于组合视频检索的零样本推理感知框架

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出R3-CoVR零样本管道,通过多模态大模型推理编辑后状态、对比编码检索和约束感知重排序,在CVPR 2026 VidLLMs挑战赛上达到91.9% R@1和98.2% R@10。

详情
AI中文摘要

组合视频检索(CoVR)旨在通过对参考视频应用自由形式的文本修改来寻找目标视频。我们应对CVPR 2026 VidLLMs研讨会上的推理感知CoVR(CoVR-R)挑战,其中检索严格为零样本。我们提出R3-CoVR(推理、检索、重排序),一个完全由冻结基础模型构建的无训练管道。多模态大语言模型(Qwen3-VL-8B)推理编辑所隐含的“后效”——状态转换、动作阶段、场景、镜头和节奏——并生成简洁的编辑后描述;对比视频-文本编码器(SigLIP-2)对该描述和图库进行嵌入以进行第一阶段检索;最后,一个约束感知重排序阶段使用相同的多模态模型作为评判者,对每个候选视频针对预期的编辑结果进行评分。在挑战测试集上,R3-CoVR达到了91.9%的R@1和98.2%的R@10。两个发现推动了这些结果:(i)将描述长度匹配到对比编码器的文本窗口使R@1从67.5提升到72.7;(ii)仅对候选列表进行重排序的约束感知重排序器将R@1从72.7提升到91.9——这是最大的单一增益。我们分析了重排序器的行为、检索/重排序混合以及候选列表深度,并发布了一个干净的三层实现。

英文摘要

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise post-edit description; a contrastive video--text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9\% R@1} and \textbf{98.2\% R@10}. Two findings drive these results: (i)~matching the description length to the contrastive encoder's text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ -- the single largest gain. We analyse the re-ranker's behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.

2606.00909 2026-06-02 cs.CL cs.AI

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

MLLM-Microscope:解锁多模态大语言模型中的隐藏结构

Ravil Mussabayev, Rustam Mussabayev

发表机构 * Satbayev University(萨特拜耶夫大学)

AI总结 提出MLLM-Microscope系统,通过分析线性度、内在维度和各向异性,揭示多模态大语言模型中隐藏的表示结构,并基于ScienceQA数据集评估LLaVA-NeXT和OmniFusion,发现模态融合方式显著影响模型内部工作机理。

详情
AI中文摘要

本文提出MLLM-Microscope,一个用于分析多模态大语言模型(MLLMs)中隐藏表示的新型系统。我们的系统评估了跨transformer层的多模态token嵌入的线性度、内在维度和各向异性。利用ScienceQA数据集,我们评估了两个最先进的MLLM:LLaVA-NeXT和OmniFusion。我们发现,两种模态的token的主流和残差流在transformer层中均表现出高度线性行为。然而,LLaVA-NeXT的图像token线性度略有下降,而OmniFusion的保持一致。与LLaVA-NeXT相比,OmniFusion的图像token维度在各层中始终较高。此外,观察到OmniFusion的各向异性在各层中保持较低水平。这些发现表明,MLLM的内部工作高度依赖于将token序列传入LLM之前执行的模态融合的性质。这一发现以及从我们的系统中获得的其他潜在新见解,无疑能够增强我们对MLLM内部工作的理解,为未来的模型设计和优化提供信息。

英文摘要

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT's image tokens reveal a slight decline in linearity, whereas OmniFusion's remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion's anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.