arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.09429 2026-06-01 cs.CV cs.AI cs.LG

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

射线即像素：学习视频与相机轨迹的联合分布

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang

发表机构 * Meta AI

AI总结提出一种视频扩散模型（Rays as Pixels），通过将相机表示为密集射线像素（raxels）并与视频帧共享潜在空间，联合去噪实现相机轨迹预测和相机控制视频生成。

Comments Accepted to ICML 2026. 9-page main paper plus supplementary material. Project page: https://wbjang.github.io/raysaspixels/

详情

AI中文摘要

从图像恢复相机参数和从新视角渲染场景在计算机视觉和图形学中被视为独立任务。当图像覆盖稀疏或姿态模糊时，这种分离会失效，因为每个任务依赖于另一个任务的输出。我们提出Rays as Pixels，一种视频扩散模型（VDM），学习视频和相机轨迹的联合分布。据我们所知，这是首个在单一框架内预测相机姿态并进行相机控制视频生成的模型。我们将每个相机表示为密集射线像素（raxels），这是一种与视频帧位于同一潜在空间的像素对齐编码，并通过解耦自交叉注意力机制联合去噪两者。一个训练好的模型处理三个任务：从视频预测相机轨迹、沿预定义轨迹从输入图像生成视频、以及从输入图像联合合成视频和轨迹。我们在姿态估计和相机控制视频生成上进行评估，并引入闭环自一致性测试，显示模型预测的姿态及其基于这些姿态的渲染结果一致。与Plücker嵌入的消融实验证实，将相机与视频共享潜在空间显著更有效。

英文摘要

Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

URL PDF HTML ☆

赞 0 踩 0

2604.20650 2026-06-01 cs.CV

MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

MAPRPose: 面向多目标6D姿态估计的掩膜感知提议与模态补全精化

Yang Luo, Yan Gong, Yongsheng Gao, Xiaoying Sun, Jie Zhao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（机器人系统国家重点实验室，哈尔滨工业大学）； School of Civil Engineering, Harbin Institute of Technology（土木工程学院，哈尔滨工业大学）； Shenzhen Infinite Meta Robot Co., Ltd（深圳无限元机器人有限公司）

AI总结提出MAPRPose两阶段框架，通过掩膜感知对应关系生成姿态提议和模态补全驱动的ROI预测实现鲁棒精化，在BOP基准上达到76.5%平均召回率，比FoundationPose高3.1%且多目标推理加速43倍。

详情

AI中文摘要

在杂乱场景中，6D物体姿态估计由于严重遮挡和传感器噪声仍然具有挑战性。我们提出MAPRPose，一个两阶段框架，利用掩膜感知对应关系进行姿态提议，并利用模态补全驱动的感兴趣区域（ROI）预测进行鲁棒精化。在掩膜感知姿态提议（MAPP）阶段，我们将2D对应关系提升到3D空间，建立可靠的关键点匹配，并基于对应关系评分生成几何一致的姿态假设，从中选择前K个候选。在精化阶段，我们引入了一个张量化渲染-比较流水线，集成了模态补全掩膜预测和ROI重新对齐（AMPR）模块。通过重建完整的物体几何并动态调整ROI，AMPR减轻了严重遮挡下的定位误差和空间错位。此外，我们的GPU加速RGB-XYZ重投影使得所有N×B个姿态假设能够在单次前向传播中同时精化。在BOP基准上评估，MAPRPose实现了76.5%的最先进平均召回率（AR），比FoundationPose高出3.1% AR，同时在多目标推理中实现了43倍加速。

英文摘要

6D object pose estimation in cluttered scenes remains challenging due to severe occlusion and sensor noise. We propose MAPRPose, a two-stage framework that leverages mask-aware correspondences for pose proposal and amodal-driven Region-of-Interest (ROI) prediction for robust refinement. In the Mask-Aware Pose Proposal (MAPP) stage, we lift 2D correspondences into 3D space to establish reliable keypoint matches and generate geometrically consistent pose hypotheses based on correspondence-level scoring, from which the top-$K$ candidates are selected. In the refinement stage, we introduce a tensorized render-and-compare pipeline integrated with an Amodal Mask Prediction and ROI Re-Alignment (AMPR) module. By reconstructing complete object geometry and dynamically adjusting the ROI, AMPR mitigates localization errors and spatial misalignment under heavy occlusion. Furthermore, our GPU-accelerated RGB-XYZ reprojection enables simultaneous refinement of all $N \times B$ pose hypotheses in a single forward pass. Evaluated on the BOP benchmark, MAPRPose achieves a state-of-the-art Average Recall (AR) of 76.5%, outperforming FoundationPose by 3.1% AR while delivering a 43x speedup in multi-object inference.

URL PDF HTML ☆

赞 0 踩 0

2604.18587 2026-06-01 cs.LG cs.AI cs.LO cs.PL

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

编译以压缩：通过编译器输出提升形式定理证明器

Guchan Li, Rui Tian, Hongning Wang

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系）

AI总结利用编译器将大量证明尝试压缩为结构化失败模式，提出一种学习-精炼框架，通过树搜索基于验证器反馈局部修正错误，在可比测试时预算下在PutnamBench上达到最先进性能。

详情

AI中文摘要

大型语言模型在形式定理证明中展现出显著潜力，但最先进的性能往往需要通过大量展开或扩展上下文窗口来实现令人望而却步的测试时计算。在这项工作中，我们通过利用形式验证中的一种信息结构来解决这一可扩展性瓶颈：观察到编译器将大量不同的证明尝试空间映射到一组紧凑的结构化失败模式。我们引入了一个学习-精炼框架，利用这种压缩来执行高效的学习和证明探索。我们执行树搜索，根据明确的验证器反馈局部修正错误，从而避免了积累长历史证明尝试的相关成本。大量评估表明，我们的方法在不同规模上持续增强了基础证明器的推理能力。值得注意的是，在可比较的测试时预算下，我们的方法在PutnamBench上达到了公开报告的约80亿和约320亿参数模型中的最先进性能，为下一代验证器引导推理提供了一种可扩展的范式。

英文摘要

Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.17551 2026-06-01 cs.LG cs.AI

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

SVL：目标条件强化学习作为生存学习

Franki Nguimatsia Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier

发表机构 * Inria（法国国家信息与自动化研究所）； École Normale Supérieure, PSL Research University, Paris, France（巴黎高等师范学院，PSL研究大学）

AI总结提出生存价值学习（SVL），通过将时间到目标建模为概率分布，将目标条件强化学习重构为生存学习问题，并利用危险模型进行最大似然估计，在离线基准上匹配或超越强基线方法。

Comments Accepted to the 43rd International Conference on Machine Learning, Seoul, South Korea

详情

AI中文摘要

标准的目标条件强化学习（GCRL）方法依赖于时间差分学习，由于自举可能导致不稳定和样本效率低下。虽然最近的工作探索了对比和监督公式以提高稳定性，但我们提出了一种概率替代方案，称为生存价值学习（SVL），通过将每个状态到目标的时间建模为概率分布，将GCRL重新定义为生存学习问题。这种结构化的分布蒙特卡洛视角产生了一个闭式恒等式，将目标条件价值函数表示为生存概率的折扣和，从而通过危险模型在事件和右删失轨迹上进行最大似然估计来实现价值估计。我们引入了三种实用的价值估计器，包括有限视界截断和两种分箱无限视界近似，以捕捉长视界目标。在离线GCRL基准上的实验表明，SVL与层次化演员结合，匹配或超越了强大的层次化TD和蒙特卡洛基线，在复杂的长视界任务上表现出色。网页和代码：https://simple-robotics.github.io/publications/survival-value-learning/

英文摘要

Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed-form identity that expresses the goal-conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right-censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long-horizon tasks. Webpage and Code: https://simple-robotics.github.io/publications/survival-value-learning/

URL PDF HTML ☆

赞 0 踩 0

2604.16278 2026-06-01 cs.AI cs.CL cs.LG

Learning to Reason with Insight for Informal Theorem Proving

学习在非形式定理证明中进行洞察推理

Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, Linqi Song

发表机构 * City University of Hong Kong（香港城市大学）； Tsinghua University（清华大学）； Ke Holdings Inc.（Ke控股公司）； Shenzhen University of Advanced Technology（深圳先进技术大学）； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结针对非形式定理证明中缺乏洞察（识别核心技巧）的瓶颈，提出统一训练框架DeepInsight，通过分层数据集、渐进式多阶段SFT和基于洞察的策略优化方法，显著提升大语言模型的数学推理能力。

详情

AI中文摘要

尽管大多数自动定理证明方法依赖于形式证明系统，但非形式定理证明能更好地发挥大语言模型（LLMs）在自然语言处理方面的优势。在这项工作中，我们识别出非形式定理证明的一个主要瓶颈是缺乏洞察，即难以识别解决复杂问题所需的核心技巧。为了解决这个问题，我们提出了$ exttt{DeepInsight}$，一个统一的训练框架，旨在培养这种基本的推理技能，并使LLMs能够进行洞察推理。我们的框架由三个部分组成：（1）$ exttt{DeepInsightTheorem}$，一个分层数据集，通过显式提取核心技巧和证明草图以及最终证明来结构化非形式证明；（2）渐进式多阶段SFT策略，模拟人类学习过程，教授模型证明写作、规划和洞察识别；（3）$ exttt{InsightPO}$，一种策略优化方法，在此洞察层次结构上分配结构化奖励。我们在具有挑战性的数学基准上的实验表明，这种洞察感知的生成策略显著优于基线。这些结果表明，教模型识别和应用核心技巧可以大幅提高其数学推理能力。

英文摘要

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose $\texttt{DeepInsight}$, a unified training framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. Our framework consists of three components: (1) $\texttt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof; (2) a Progressive Multi-Stage SFT strategy that mimics the human learning process, teaching the model proof writing, planning, and insight identification; and (3) $\texttt{InsightPO}$, a policy optimization method that assigns structured rewards over this insight hierarchy. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.15959 2026-06-01 cs.LG

Multi-Objective Bayesian Optimization via Adaptive \varepsilon-Constraints Decomposition

基于自适应 ε-约束分解的多目标贝叶斯优化

Yaohong Yang, Sammie Katt, Samuel Kaski

发表机构 * Department of Computer Science, Aalto University, Espoo, Finland（阿尔托大学计算机科学系，芬兰 Espoo）； ELLIS Institute Finland（芬兰 ELLIS 机构）； Department of Computer Science, University of Manchester, Manchester, United Kingdom（曼彻斯特大学计算机科学系，英国 Manchester）

AI总结提出STAGE-BO方法，通过自适应ε-约束分解将多目标优化转化为序列约束子问题，实现均匀帕累托覆盖并处理约束与偏好。

Comments 24 pages, 22 figures, 4 tables. Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

多目标贝叶斯优化（MOBO）为优化多个昂贵的黑箱函数提供了一个原则性框架。然而，现有的MOBO方法通常在覆盖性、可扩展性以及处理约束和偏好方面存在困难。在这项工作中，我们提出了STAGE-BO，即顺序目标自适应间隙填充ε-约束贝叶斯优化：通过分析代理帕累托前沿的覆盖性，我们的方法识别出具有最大未覆盖间隙的帕累托前沿点，并使用其坐标在ε-约束方法中定义自适应约束，从而将问题转化为一系列不等式约束子问题，并通过约束期望改进采集函数高效求解。我们的方法无需超体积计算即可实现均匀的帕累托覆盖，并自然地处理约束和偏好。在合成和真实世界基准上的实验表明，与最先进的基线相比，我们的方法具有优越的覆盖性和具有竞争力的超体积性能。我们的代码实现可在https://github.com/YangYaohong1/STAGE-BO找到。

英文摘要

Multi-objective Bayesian optimization (MOBO) provides a principled framework for optimizing multiple expensive black-box functions. However, existing MOBO methods often struggle with coverage, scalability, and handling constraints and preferences. In this work we propose STAGE-BO, Sequential Targeting Adaptive Gap-Filling $\varepsilon$-Constraint Bayesian Optimization: by analyzing the coverage of the surrogate Pareto front, our method identifies the Pareto front point with the largest uncovered gap, and uses its coordinates to define adaptive constraints in $\varepsilon$-constraint method, which transforms the problem into a sequence of inequality-constrained subproblems, efficiently solved via constrained expected improvement acquisition. Our approach provides uniform Pareto coverage without hypervolume computation and naturally handles constraints and preferences. Experiments on synthetic and real-world benchmarks demonstrate superior coverage and competitive hypervolume performance against state-of-the-art baselines. Our code implementation can be found at https://github.com/YangYaohong1/STAGE-BO.

URL PDF HTML ☆

赞 0 踩 0

2604.11613 2026-06-01 cs.LG cs.AI

Symmetry Reveals Layerwise Dynamics: How Transformers Perform In-Context Classification

对称性揭示逐层动力学：Transformer如何执行上下文分类

Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama

发表机构 * Boston University, Departments of Computer Science

AI总结通过强制特征和标签排列等变性，从Transformer中提取出显式的深度索引递归更新规则，揭示了上下文分类的几何驱动算法。

Comments appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML '26)

2603.12277 2026-06-01 cs.CL cs.AI cs.CR

Prompt Injection as Role Confusion

提示注入作为角色混淆

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文通过角色探测和CoT伪造攻击，揭示提示注入源于LLM对文本来源的角色感知混淆，并提出角色混淆程度可预测攻击成功率。

Comments ICML 2026

详情

AI中文摘要

LLM将世界视为单一的文本流，并划分为<user>或<tool>等角色。我们将提示注入追溯到角色混淆：模型根据文本听起来的方式而非其标记的角色来感知文本来源。隐藏在网页中的命令劫持了代理，仅仅因为它听起来像<user>文本，尽管其标签是<tool>。我们设计了角色探测器来测量LLM内部如何感知“谁在说话”，并发现注入的文本占据了与它所模仿的可信角色相同的表示空间。我们通过CoT伪造（一种零样本攻击）证明了这一点，该攻击将捏造的推理注入用户提示和工具输出中。模型将伪造内容误认为是自己的思维，导致对前沿模型的攻击成功率达到60%，而基线接近零。引人注目的是，角色混淆的程度可以在生成单个token之前预测攻击成功。这一机制超越了CoT伪造，适用于标准的代理提示注入，揭示了提示注入是角色感知的可测量后果。对模型而言，听起来像某个角色与成为该角色是无法区分的。

英文摘要

LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.

URL PDF HTML ☆

赞 0 踩 0

2604.06484 2026-06-01 cs.CL

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

ValueGround: 评估多模态大语言模型中文化条件化的视觉价值基础

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

发表机构 * University of Technology Nuremberg (UTN)（图恩大学）

AI总结提出ValueGround基准，通过最小对比图像对评估多模态大语言模型在文化条件化视觉价值判断中的表现，发现模型在可视化选项下准确率显著低于文本选项。

Comments Updated preprint

详情

AI中文摘要

文化价值观不仅通过语言表达，还通过视觉场景和日常社会实践体现。然而，现有对语言模型中文化价值观的评估几乎完全是文本形式的，尚不清楚当响应选项可视化时，文化条件化的判断是否保持稳定。我们引入了ValueGround，一个用于评估多模态大语言模型（MLLMs）中文化条件化视觉价值基础的基准。ValueGround基于世界价值观调查问题，使用最小对比图像对来表示对立的响应选项，同时控制无关变量。给定一个国家、一个问题和一个图像对，模型必须选择最符合该国价值倾向的图像，而无法访问原始响应选项文本。在六个MLLM和13个国家上的实验表明，模型在可视化响应选项下的表现显著差于原始文本选项，平均准确率从72.8%下降到62.6%。我们的基准为研究文化条件化价值判断的跨模态迁移提供了一个受控测试平台。

英文摘要

Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, leaving it unclear whether culture-conditioned judgments remain stable when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Experiments across six MLLMs and 13 countries show that models perform substantially worse with visualized response options than with the original textual options, with average accuracy dropping from 72.8% to 62.6%. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

URL PDF HTML ☆

赞 0 踩 0

2604.10432 2026-06-01 cs.RO

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

AnySlot: 用于零样本槽级放置的目标条件视觉-语言-动作策略

Zhaofeng Hu, Sifan Zhou, Qinbo Zhang, Rongtao Xu, Qi Su, Jorge Mendez-Mendz, Ci-Jyun Liang

发表机构 * Stony Brook University（石溪大学）； Carnegie Mellon University（卡内基梅隆大学）； MBZUAI ； Peking University（北京大学）

AI总结提出AnySlot框架，通过将语言指令转化为空间视觉目标，解耦高层槽选择与低层执行，实现零样本槽级精确放置。

详情

AI中文摘要

视觉-语言-动作（VLA）策略已成为通用机器人操作的多功能范式。然而，在组合语言下的精确物体放置对端到端VLA策略仍然具有挑战性。槽级放置需要可靠的槽接地和厘米级几何精度。为此，我们提出AnySlot，一个通过引入语言接地与控制之间的显式空间视觉目标来降低组合复杂性的框架。AnySlot通过在目标槽处渲染空间标记将语言转化为视觉目标，然后使用目标条件VLA策略执行该目标。这种层次化设计将高层槽选择与低层执行解耦，提高了语义准确性和空间鲁棒性。此外，认识到此类精度要求高的任务缺乏基准，我们引入了SlotBench，一个包含九个任务类别的结构化模拟基准，用于评估槽级放置中的空间推理。大量实验表明，AnySlot在零样本槽级放置中显著优于平面VLA基线和模块化接地方法。

英文摘要

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language remains challenging for end-to-end VLA policies. Slot-level placement requires reliable slot grounding and centimeter-level geometric precision. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal between language grounding and control. AnySlot converts language into a visual goal by rendering a spatial marker at the intended slot, then executes this goal with a goal-conditioned VLA policy. This hierarchical design decouples high-level slot selection from low-level execution, improving semantic accuracy and spatial robustness. Furthermore, recognizing the lack of benchmarks for such precision-demanding tasks, we introduce SlotBench, a structured simulation benchmark with nine task categories for evaluating spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and modular grounding methods in zero-shot slot-level placement.

URL PDF HTML ☆

赞 0 踩 0

2604.10805 2026-06-01 cs.CV

Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

基于单应性的地面映射中距离误差的解析建模与校正

Mateusz Szulc, Marcin Iwanowski

发表机构 * Institute of Control ； Industrial Electronics, Faculty of Electrical Engineering, Warsaw University of Technology, ul. Koszykowa 75, 00-662 Warsaw, Poland

AI总结本文推导了单应性扰动与距离误差的解析关系，提出基于回归和梯度下降的两种校正策略，并通过大规模仿真验证了其有效性。

Comments 7 pages, 4 figures

详情

AI中文摘要

从单目相机准确估计距离对于智能监控系统至关重要。在许多部署中，通过手动选择对应区域初始化的平面单应性将图像坐标映射到地面位置。这种初始化中的微小不准确性会传播为系统性的距离失真。本文推导了单应性扰动与由此产生的距离误差之间的显式关系，表明误差大致随距相机的真实距离呈二次增长。基于该模型，评估了两种简单的校正策略：基于回归的二次误差函数估计和通过基于坐标的梯度下降直接优化单应性。一项包含超过1900万个测试样本的大规模仿真研究表明，当模型可靠拟合时，回归可实现更高的峰值精度，而梯度下降在初始校准较差时具有更强的鲁棒性。这表明，在许多实际系统中，改进几何校准可能比增加模型复杂度带来更大的性能提升。

英文摘要

Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

URL PDF HTML ☆

赞 0 踩 0

2604.10495 2026-06-01 cs.CL

Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

为什么你不知道？评估不确定性来源对大型语言模型中不确定性量化的影响

Maiya Goloburda, Roman Vashurin, Fedor Chernogorskii, Nurkhan Laiyk, Daniil Orel, Preslav Nakov, Maxim Panov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文通过引入一个明确分类不确定性来源的新数据集，系统评估了现有不确定性量化方法在不同不确定性来源下的表现，发现多数方法在模型知识局限下表现良好，但在其他来源下性能下降或产生误导。

详情

AI中文摘要

随着大型语言模型（LLM）在现实世界应用中的日益普及，可靠的不确定性量化（UQ）对于安全有效使用变得至关重要。现有的大多数语言模型UQ方法旨在产生单一的置信度分数——例如，估计模型答案正确的概率。然而，自然语言任务中的不确定性源于多个不同的来源，包括模型知识差距、输出可变性和输入歧义，这些对系统行为和用户交互有不同的影响。在这项工作中，我们研究了不确定性来源如何影响现有UQ方法的行为和有效性。为了进行受控分析，我们引入了一个新数据集，该数据集明确分类了不确定性来源，允许系统评估每种条件下的UQ性能。我们的实验表明，虽然许多UQ方法在不确定性仅源于模型知识限制时表现良好，但当引入其他来源时，它们的性能会下降或变得具有误导性。这些发现强调了需要明确考虑大型语言模型中不确定性来源的不确定性感知方法。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

URL PDF HTML ☆

赞 0 踩 0

2604.10273 2026-06-01 cs.CV

Dual-Exposure Imaging with Events

基于事件的双曝光成像

Mingyuan Lin, Hongyi Liu, Chu He, Wen Yang, Gui-Song Xia, Lei Yu

发表机构 * School of Electronic Information, Wuhan University（武汉大学电子信息学院）； School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）

AI总结提出事件辅助的双曝光成像算法E-DEI，利用事件相机的高时间分辨率对齐和融合双曝光图像特征，以消除运动伪影和曝光差异，提升低光图像质量。

详情

AI中文摘要

通过结合短曝光和长曝光图像的互补优势，双曝光成像（DEI）在低光场景下增强了图像质量。然而，现有的DEI方法由于场景运动导致的空间位移和不同曝光时间引起的图像特征差异，不可避免地会产生伪影。为了解决这个问题，我们提出了一种新颖的基于事件的双曝光成像（E-DEI）算法，该算法从双曝光图像对和事件中重建高质量图像，利用事件相机的高时间分辨率提供准确的帧间/帧内动态信息。具体来说，我们将这个复杂任务分解为两个子任务的集成，即基于事件的运动去模糊和低光图像增强任务，这指导我们将E-DEI网络设计为双路径并行特征传播架构。我们提出了一个双路径特征对齐与融合（DFAF）模块，以在事件的辅助下有效地对齐和融合从双曝光图像中提取的特征。此外，我们构建了一个包含配对低/正常光图像和事件的真实世界数据集（PIED）。在多个数据集上的实验表明了我们方法的优越性。代码和数据集可在GitHub上获取。

英文摘要

By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

URL PDF HTML ☆

赞 0 踩 0

2503.09315 2026-06-01 cs.LG

ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning

ShuffleGate: 通过批量敏感性学习实现推荐系统的可扩展特征优化

Yihong Huang, Chen Chu, Fan Zhang, Liping Wang Fei Chen, Yu Lin, Ruiduan Li, Zhihao Li

发表机构 * Bilibili Inc.（哔哩哔哩公司）； Guangzhou University（广州大学）； East China Normal University（华东师范大学）

AI总结提出ShuffleGate机制，通过批量洗牌策略以可微方式估计特征重要性，统一特征选择和维度选择任务，实现极化重要性分布，避免复杂阈值调优，在四个基准上优于现有方法，并在工业部署中实现10倍维度压缩和91%训练吞吐量提升。

详情

AI中文摘要

特征优化——特别是特征选择（FS）和维度选择（DS）——对于大规模推荐系统的效率和泛化能力至关重要。虽然概念上相关，但这些任务通常采用孤立的解决方案，往往存在重要性分数模糊或计算成本过高的问题。在本文中，我们提出ShuffleGate，一种统一且可解释的机制，通过衡量模型对信息损失的敏感性来估计组件重要性。与学习相对权重的传统门控不同，ShuffleGate引入了一种批量洗牌策略，以端到端可微的方式有效“擦除”信息。这种范式转变产生了自然极化的重要性分布，弥合了长期存在的“搜索-重训练差距”，并在无需复杂阈值调优的情况下区分关键信号与噪声。在四个基准上的大量实验验证了ShuffleGate在特征选择和维度选择任务中均持续优于最先进的方法。它比排列基线实现了15倍的加速，并展示了极端的可扩展性，在仅700秒内处理了2.7亿个参数。最后，在一项顶级工业部署中，它将输入维度压缩了10倍，使得训练吞吐量提高了91%，同时每天服务数十亿次请求且性能无下降。

英文摘要

Feature optimization -- specifically Feature Selection (FS) and Dimension Selection (DS) -- is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs. In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model's sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively "erase" information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing "search-retrain gap" and distinguishing essential signals from noise without complex threshold tuning. Extensive experiments across four benchmarks validate that ShuffleGate consistently outperforms state-of-the-art methods in both Feature and Dimension Selection tasks. It achieves a 15\times speedup over permutation baselines and demonstrates extreme scalability by processing 270M parameters in just 700 seconds. Finally, in a top-tier industrial deployment, it compressed input dimensions by 10\times, yielding a 91% increase in training throughput while serving billions of daily requests without performance degradation.

URL PDF HTML ☆

赞 0 踩 0

2604.06881 2026-06-01 cs.LG physics.flu-dyn

MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems

MENO: 用于动力系统的MeanFlow增强神经算子

Tianyue Yang, Xiao Xue

发表机构 * Centre for Computational Science, University College London, London, United Kingdom（伦敦大学学院计算科学中心）

AI总结提出MENO框架，通过改进的MeanFlow方法恢复多尺度特征，在低分辨率训练数据下实现高分辨率准确预测，且推理速度比扩散增强方法快14倍。

Comments 27 pages, 13 figures

详情

AI中文摘要

神经算子因其网格不变性和计算效率而成为动力系统的强大替代模型。然而，基于傅里叶的变体在谱空间中固有地截断高频分量，导致小尺度结构丢失，并在低分辨率数据训练时降低高分辨率下的预测质量。虽然基于扩散的增强方法可以恢复多尺度特征，但它们引入了大量推理开销，削弱了神经算子的效率优势。在这项工作中，我们引入了MeanFlow增强神经算子（MENO），一种新颖的框架，以最小的推理成本实现准确的全尺度预测。通过利用改进的MeanFlow方法，MENO恢复了小尺度细节和大尺度动力学，具有优越的物理保真度和统计准确性。我们在三个具有挑战性的动力系统上评估了MENO，包括相场动力学、二维Kolmogorov流和活性物质动力学，分辨率高达256×256。在所有基准测试中，与基线神经算子相比，MENO将功率谱密度精度提高了最多2倍，同时与最先进的去噪扩散隐式模型（DDIM）增强对应方法相比，实现了高达14倍的推理加速，有效弥合了准确性和效率之间的差距。MENO的灵活性和效率使其成为科学机器学习应用中高效的替代模型，其中统计完整性和计算效率至关重要。

英文摘要

Neural operators have emerged as powerful surrogates for dynamical systems due to their grid-invariant properties and computational efficiency. However, Fourier-based variants inherently truncate high-frequency components in spectral space, resulting in the loss of small-scale structures and degraded prediction quality at high resolutions when trained on low-resolution data. While diffusion-based enhancement methods can recover multi-scale features, they introduce substantial inference overhead that undermines the efficiency advantage of neural operators. In this work, we introduce MeanFlow-Enhanced Neural Operators (MENO), a novel framework that achieves accurate all-scale predictions with minimal inference cost. By leveraging the improved MeanFlow method, MENO restores both small-scale details and large-scale dynamics with superior physical fidelity and statistical accuracy. We evaluate MENO on three challenging dynamical systems, including phase-field dynamics, 2D Kolmogorov flow, and active matter dynamics, at resolutions up to 256$\times$256. Across all benchmarks, MENO improves the power spectrum density accuracy by up to a factor of 2 compared to baseline neural operators while achieving up to $14\times$ faster inference than the state-of-the-art Denoising Diffusion Implicit Model (DDIM)-enhanced counterparts, effectively bridging the gap between accuracy and efficiency. The flexibility and efficiency of MENO position it as an efficient surrogate model for scientific machine learning applications where both statistical integrity and computational efficiency are paramount.

URL PDF HTML ☆

赞 0 踩 0

2509.10078 2026-06-01 cs.CL cs.AI

Human Psychometric Questionnaires Mischaracterize LLM Behavior

人类心理测量问卷误判LLM行为

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）； Department of Communication, Interdisciplinary Program in Artificial Intelligence, Seoul National University（首尔国立大学通信系人工智能交叉学科项目）

AI总结通过比较LLM在Likert问卷和生成概率上的价值与人格特征，发现问卷存在系统性偏差，提出基于生成概率的评估方法更准确。

Comments 38 pages, 6 figures

详情

AI中文摘要

我们检验了人类心理测量问卷是否可以作为可靠工具来表征和预测LLM在日常用户交互中的行为。我们分析了八个开源LLM，比较了从两种不同方法得出的价值和人格特征：基于既定问卷（PVQ-40/21和BFI-44/10）的Likert自我报告，以及对日常用户查询的价值负载响应的生成概率。两种特征显著不同。在生成概率中，常被引为LLM稳定倾向证据的构念内项目一致性消失了。我们将这一差距归因于既定问卷项目中的显式词汇线索使模型能够识别目标构念并以一致、社会期望的方式响应，而现实用户查询不提供此类线索。此外，人口统计角色提示以与真实人类模式一致的方式改变了模型对人类问卷的响应，但在对现实用户查询的响应生成概率中没有出现此类变化，表明它们在模拟目标人口统计在真实世界用户交互中的行为方面能力有限。总体而言，我们的研究表明，人类心理测量问卷不足以预测LLM行为，并建议基于生成的评估作为更准确的度量。

英文摘要

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

URL PDF HTML ☆

赞 0 踩 0

2604.01985 2026-06-01 cs.LG cs.AI cs.RO

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

World Action Verifier: 通过前向-反向不对称性自我改进世界模型

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

发表机构 * Stanford University（斯坦福大学）； UC San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）； Google DeepMind（谷歌深Mind）； Harvard University（哈佛大学）

AI总结提出World Action Verifier (WAV)框架，利用状态合理性和动作可达性的独立验证以及前向-反向不对称性，通过视频语料库的多样子目标生成器和稀疏逆模型实现循环一致性，从而在欠探索区域自我改进世界模型，在多个任务中样本效率提升2倍且下游策略性能提升22%以上。

Comments Project Website: https://world-action-verifier.github.io

详情

AI中文摘要

通用世界模型有望实现可扩展的策略评估、优化和规划，但达到所需的鲁棒性仍然具有挑战性。与主要关注最优动作的策略学习不同，世界模型需要在大量次优动作的空间中保持可靠，而这些动作在带有动作标签的机器人交互中往往代表性不足。为了解决这一挑战，我们提出了World Action Verifier (WAV)框架，该框架使世界模型能够识别自身的预测错误并进行自我改进。关键思想是将动作条件的状态预测分解为两个独立可验证的因素：状态合理性和动作可达性。我们证明，由于两个潜在的不对称性——更广泛的无动作数据的可用性和动作相关特征的更低维度——验证这些因素比直接前向预测更容易处理。利用这些不对称性，我们通过（i）从视频语料库中获得的多样子目标生成器和（ii）从状态特征子集推断动作的稀疏逆模型来增强世界模型。通过强制提议的子目标、推断的动作和前向展开之间的循环一致性，WAV在现有方法常常失败的欠探索区域提供了一种有效的验证机制。在涵盖MiniGrid、RoboMimic和ManiSkill的九个任务中，我们的方法实现了2倍的样本效率提升，同时将下游策略性能提高了22%以上。

英文摘要

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.

URL PDF HTML ☆

赞 0 踩 0

2603.28579 2026-06-01 cs.RO

EBuddy: a workflow orchestrator for industrial human-machine collaboration

EBuddy：面向工业人机协作的工作流编排器

Michele Banfi, Rocco Felici, Stefano Baraldo, Oliver Avram, Anna Valente

发表机构 * Laboratory of Automation, Robotics and Machines (ARM)（自动化、机器人与机器实验室）

AI总结提出EBuddy，一种基于语音引导的工作流编排器，通过将专家实践形式化为有限状态机驱动的应用，实现工业环境中自然的人机协作，显著缩短端到端流程时间并保持可重复性。

详情

AI中文摘要

本文介绍了EBuddy，一种用于工业环境中自然人机协作的语音引导工作流编排器。EBuddy针对工具密集型工作流中一个反复出现的瓶颈：专家知识有效但难以规模化，当操作员和会话之间临时重建程序时，执行质量会下降。EBuddy将专家实践操作化为有限状态机（FSM）驱动的应用程序，在运行时提供可解释的决策框架（当前状态和允许的动作），使得口头请求在状态约束下被解释，同时系统执行并监控相应的工具交互。通过模块化工作流工件，EBuddy协调异构资源，包括GUI驱动的软件和协作机器人，利用自动语音识别和意图理解实现完全基于语音的交互。在定向能量沉积（DED）的叶轮叶片检查和修复准备中，通过人机协作实现的工业试点显示，在入职、3D扫描和处理以及修复程序生成过程中，端到端流程时间显著减少，同时保持了可重复性和低操作员负担。

英文摘要

This paper presents EBuddy, a voice-guided workflow orchestrator for natural human-machine collaboration in industrial environments. EBuddy targets a recurrent bottleneck in tool-intensive workflows: expert know-how is effective but difficult to scale, and execution quality degrades when procedures are reconstructed ad hoc across operators and sessions. EBuddy operationalizes expert practice as a finite state machine (FSM) driven application that provides an interpretable decision frame at runtime (current state and admissible actions), so that spoken requests are interpreted within state-grounded constraints, while the system executes and monitors the corresponding tool interactions. Through modular workflow artifacts, EBuddy coordinates heterogeneous resources, including GUI-driven software and a collaborative robot, leveraging fully voice-based interaction through automatic speech recognition and intent understanding. An industrial pilot on impeller blade inspection and repair preparation for directed energy deposition (DED), realized by human-robot collaboration, shows substantial reductions in end-to-end process duration across onboarding, 3D scanning and processing, and repair program generation, while preserving repeatability and low operator burden.

URL PDF HTML ☆

赞 0 踩 0

2603.28201 2026-06-01 cs.LG stat.ML

A Perturbation Approach to Unconstrained Linear Bandits

无约束线性赌博机的一种扰动方法

Andrew Jacobsen, Dorian Baudry, Shinji Ito, Nicolò Cesa-Bianchi

发表机构 * Inria, Univ. Grenoble Alpes, Grenoble INP, CNRS, LIG, 38000 Grenoble, France（法国国家信息与自动化研究所、格勒诺布尔阿尔卑斯大学、格勒诺布尔INP、国家科学研究中心、格勒诺布尔理工大学）； The University of Tokyo（东京大学）

AI总结本文提出一种基于扰动的框架，将无约束线性赌博机问题简化为标准在线线性优化问题，并实现了静态和动态遗憾的最优高概率保证。

Comments 50 pages; v2: ICML 2026

详情

AI中文摘要

我们重新审视了Abernethy等人（2008）在无约束赌博机线性优化（uBLO）背景下的标准基于扰动的方法。我们展示了一个令人惊讶的结果：在无约束设置中，这种方法有效地将赌博机线性优化（BLO）简化为一个标准的在线线性优化（OLO）问题。我们的框架在几个方面改进了先前的工作。首先，当我们的扰动方案与比较器自适应的OLO算法结合时，我们推导出了期望遗憾保证，从而对不同的对抗模型如何影响最终的比较器自适应率提供了新的见解。我们还将分析扩展到动态遗憾，在没有$P_T$先验知识的情况下，首次获得了具有最优$\sqrt{P_T}$路径长度依赖的保证。然后，我们为uBLO中的静态和动态遗憾开发了第一个高概率保证。最后，我们讨论了静态遗憾的下界，并证明了欧几里得球上对抗性线性赌博机的传说$Ω(\sqrt{dT})$率，这具有独立的意义。

英文摘要

We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the first guarantees with optimal $\sqrt{P_T}$ path-length dependencies without prior knowledge of $P_T$. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore $Ω(\sqrt{dT})$ rate for adversarial linear bandits on the Euclidean ball, which is of independent interest.

URL PDF HTML ☆

赞 0 踩 0

2603.26885 2026-06-01 cs.CV

TTE-CAM: Self-Explainable Class Activation Maps for Pretrained Black-Box CNNs

TTE-CAM：用于预训练黑盒CNN的自解释类激活图

Kerol Djoumessi, Philipp Berens

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany（图宾根大学脑健康人工智能研究所）

AI总结提出TTE-CAM框架，通过卷积替换分类头将预训练黑盒CNN转化为自解释模型，在保持预测性能的同时提供忠实解释。

Comments Accepted at MIDL 2026 in the short paper track

2603.23977 2026-06-01 cs.LG cs.AI

Circuit-Inspired High-Order Neural Networks with Unified Neural Dynamics Modeling for PDE Solving and Visual Perception

电路启发的具有统一神经动力学建模的高阶神经网络用于PDE求解与视觉感知

Tongfei Chen, Jingying Yang, Linlin Yang, Juan Zhang, Jinhu Lü, David Doermann, Chunyu Xie, Long He, Tian Wang, Guodong Guo, Baochang Zhang

发表机构 * Communication University of China（通信大学）； AI Research, Qihoo 360（360人工智能研究院，奇虎360）； Eastern Institute of Technology, Ningbo（宁波工程技术院）

AI总结提出电路启发的高阶神经网络（CHONN），通过基尔霍夫级联组合实现高阶动力学算子，在PDE求解、长期物理预测和ImageNet-1K识别中提升结构保真度和稳定性。

详情

AI中文摘要

深度网络通常依赖架构启发式方法来塑造表示演化，限制了其对由内在动力学支配的数据的建模能力。我们提出了电路启发的高阶神经网络（CHONN），这是一个模块化框架，将表示演化视为一个潜在势过程，并通过基尔霍夫启发的级联组合增加其有效阶数。单个基尔霍夫神经单元实现稳定的一阶更新，而串行组合的单元在一个块内形成高阶动力学算子。这种构造是可解释的、数值稳定的，并且与常见的神经骨干网络兼容。理论分析表明，级联单元诱导出端到端的高阶算子，控制实验证明块内高阶构造不同于通用深度堆叠，特别是在导数敏感度量上。在稳态算子学习、长期物理预测和ImageNet-1K识别中，CHONN提高了结构保真度、滚动稳定性和视觉表示学习。这些结果将高阶电路组合确定为神经动力学建模的一般原则。

英文摘要

Deep networks often rely on architectural heuristics to shape representation evolution, limiting their ability to model data governed by intrinsic dynamics. We present the Circuit-inspired High-Order Neural Network (CHONN), a modular framework that treats representation evolution as a latent potential process and increases its effective order through Kirchhoff-inspired cascade composition. A single Kirchhoff Neural Cell implements a stable first-order update, while serially composed cells form higher-order dynamical operators within one block. This construction is interpretable, numerically stable and compatible with common neural backbones. Theoretical analysis shows that cascaded cells induce end-to-end high-order operators, and controlled experiments demonstrate that intra-block high-order construction differs from generic depth stacking, especially on derivative-sensitive measures. Across steady-state operator learning, long-horizon physical forecasting and ImageNet-1K recognition, CHONN improves structural fidelity, rollout stability and visual representation learning. These results identify high-order circuit composition as a general principle for neural dynamics modeling.

URL PDF HTML ☆

赞 0 踩 0

2603.23160 2026-06-01 cs.CL

UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

UniDial-EvalKit：多面对话能力评估的统一工具包

Qi Jia, Haodong Zhao, Dun Pei, Xiujie Song, Ye Shen, Shibo Wang, Zijian Chen, Zicheng Zhang, Xiangyang Zhu, Guangtao Zhai

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出UniDial-EvalKit统一评估工具包，通过标准化数据格式、模块化流水线和层次化评分聚合，解决多轮交互场景下评估协议异构问题，并基于大规模实验揭示当前系统无一致最优、记忆智能体常不及全上下文基线的现象。

详情

AI中文摘要

在多轮交互场景中对大型语言模型（LLM）和智能体进行基准测试对于理解其实际能力至关重要。然而，现有的评估协议高度异构，在数据集格式、模型接口和评估流水线上差异显著，严重阻碍了系统比较。在这项工作中，我们提出了UniDial-EvalKit（UDE），一个用于评估交互式AI系统的统一评估工具包。UDE的核心贡献在于其整体统一性：它将异构数据格式标准化为通用模式，通过模块化架构简化复杂的评估流水线，并在层次化评分聚合下对齐指标计算。它还通过并行生成和评分以及检查点恢复来支持高效的大规模评估，消除冗余计算。利用UDE，我们在多个多维基准上进行了广泛评估。我们的实证分析表明，没有单一系统在所有基准上持续优于其他系统，而当前的记忆智能体通常无法超越全上下文基线。进一步的分析指出了几个未来方向，包括基准去重和更自适应的记忆架构。

英文摘要

Benchmarking large language models (LLMs) and agents in multi-turn interactive scenarios is essential for understanding their practical capabilities. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a hierarchical scoring aggregation. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint resume to eliminate redundant computation. Leveraging UDE, we conduct an extensive evaluation across diverse multi-dimensional benchmarks. Our empirical analysis shows that no single system consistently outperforms others across all benchmarks, while current memory agents often fail to surpass full-context baselines. Further analyses highlight several future directions, including benchmark deduplication and more adaptive memory architectures.

URL PDF HTML ☆

赞 0 踩 0

2603.22744 2026-06-01 cs.AI

LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

LH-Bench：面向主观企业任务的长期智能体技能基础评估

Abhishek Chandwani, Ishan Gupta

发表机构 * Metaphi Inc.（Metaphi公司）

AI总结提出LH-Bench，通过专家基础评分标准、真实标注工件和成对偏好评估三支柱，解决主观企业任务中长期自主执行的评估问题。

详情

AI中文摘要

大型语言模型在数学和编程等客观可验证任务上表现出色，这些任务的评估简化为单元测试或单一正确答案。相比之下，现实世界中的企业工作通常是主观且依赖上下文的：成功取决于组织目标、用户意图以及在长期多工具工作流中产生的中间工件的质量。我们引入LH-Bench，一种三支柱评估设计，超越二元正确性，对主观企业任务中的自主长期执行进行评分。这些支柱包括：(i) 专家基础评分标准，为LLM评判者提供评估主观工作所需的领域背景；(ii) 策划的真实工件，提供逐步奖励信号（例如，内容任务的章节级注释）；以及(iii) 成对人类偏好评估，用于收敛验证。我们表明，领域作者编写的评分标准比LLM作者编写的评分标准提供更可靠的评估信号（kappa = 0.60 vs. 0.46），并且人类偏好判断确认了相同的顶级分离（p < 0.05），这证明专家基础评估可以在不牺牲可靠性的情况下扩展。我们发布公共数据集，并报告两个环境的结果：Figma到代码（通过MCP针对Figma API的33个真实.fig任务）和程序化内容（41门课程，包含183个单独评估的章节，服务于一个拥有30+日常用户的课程平台）。

英文摘要

Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p < 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users).

URL PDF HTML ☆

赞 0 踩 0

2603.21558 2026-06-01 cs.AI

Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers

可靠的自改进训练：验证推理过程，而不仅仅是答案

Xinyu Zhang

发表机构 * Anyscale

AI总结针对自改进训练中因依赖最终答案正确性导致推理错误累积的问题，提出VSI框架，通过步骤级结构验证（如符号计算检查算术步骤）筛选训练数据，在GSM8K上实现持续准确率提升（80.5%→91.0%）。

Comments Accepted at ICLR 2026 Workshop LLM Reasoning. 10 pages, 3 figures, 5 tables

详情

AI中文摘要

自改进训练中，模型从自身生成的解决方案中学习，有望带来持续的能力提升，但存在一个普遍失败模式：经过多轮训练后，累积的推理错误导致准确率停滞或下降。我们将这种漂移归因于标准过滤标准——仅根据最终答案的正确性保留解决方案，这使得幸运猜测（答案正确但推理有缺陷）污染训练数据。我们提出已验证自改进（VSI）框架，该框架基于步骤级结构完整性而非仅最终输出决定数据保留。VSI通过计算机代数库（sympy）重新计算算术步骤、检查中间一致性并强制执行领域约束来验证解决方案。在GSM8K上使用Qwen3-4B-Thinking进行5轮自改进评估，与四个基线（无验证、结果验证、多数投票和VSI+DPO）相比，VSI拒绝了约34%的答案正确的解决方案，成功隔离了幸运猜测。这种更清洁的训练信号驱动了所有轮次的持续准确率提升（从80.5%到91.0%），而结果验证趋于平稳，未验证的训练则崩溃。最后，将VSI检查转化为DPO偏好对，训练模型区分合理推理与幸运答案，将奖励准确率从46%提升至63%。VSI提供了一种简单、可复现的配方，用于在自动化推理检查可用时实现稳健的自改进。

英文摘要

Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasive failure mode: across multiple rounds, compounding reasoning errors cause accuracy to stall or degrade. We trace this drift to standard filtering criteria that retain solutions based solely on final answer correctness, which lets lucky guesses (correct answers with flawed reasoning) contaminate the training data. We propose Verified Self-Improvement (VSI), a framework that conditions data retention on step-level structural integrity rather than just the final output. VSI validates solutions by recomputing arithmetic steps via a computer-algebra library (sympy), checking intermediate consistency, and enforcing domain constraints. Evaluating VSI on GSM8K with Qwen3-4B-Thinking across 5 rounds of self-improvement against four baselines (no verification, outcome verification, majority voting, and VSI with DPO) shows that VSI rejects approximately 34% of correct-answer solutions, successfully isolating lucky guesses. This cleaner training signal drives sustained accuracy gains across all rounds (80.5% to 91.0%), whereas outcome verification plateaus and unverified training collapses. Finally, converting VSI checks into DPO preference pairs trains the model to distinguish sound reasoning from lucky answers, boosting reward accuracy from 46% to 63%. VSI offers a simple, reproducible recipe for robust self-improvement whenever automated reasoning checks are available.

URL PDF HTML ☆

赞 0 踩 0

2511.11440 2026-06-01 cs.CV cs.CL

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

合成刺激，真实收益：通过完全受控的数据生成重新思考VLM微调

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento（信号与交互系统实验室，特伦托大学）

AI总结本文提出一种完全受控的数据生成与标注流程，用于微调视觉语言模型（VLM），通过平衡分布和干净标注消除偏差，在空间推理任务上仅用130个样本即可实现均匀性能，并在真实世界数据上提升13%的性能。

详情

AI中文摘要

通过微调获得的视觉语言模型（VLM）的性能提升通常基于对真实世界场景的临时数据收集和标注。尽管有所改进，但这一过程往往容易受到偏差、错误和分布不平衡的影响，导致过拟合和性能不平衡。虽然少数研究探索了合成数据生成，但它们通常缺乏对数据分布和标注质量的控制。在这项工作中，我们通过探索完全受控的数据生成和标注流程，重新评估了模型微调的潜力，获得了具有平衡分布和干净标注的无偏差数据。以识别物体绝对位置的空间推理任务作为用例，我们微调了最先进的VLM，并在合成和真实世界基准上进行了详尽的评估，包括对真实世界场景的可迁移性。我们的实验揭示了两个关键发现：1）在平衡数据上微调可以在视觉场景中产生均匀的性能，并且仅用130个样本就能缓解常见偏差；2）在合成刺激上微调使真实世界数据（COCO）的性能提升了13%，优于在完整COCO训练集上微调的模型。

英文摘要

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

URL PDF HTML ☆

赞 0 踩 0

2510.25110 2026-06-01 cs.CL

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

DEBATE：用于评估角色扮演LLM代理中观点动态的大规模基准

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, You Li, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Google DeepMind（谷歌DeepMind）； Stanford University（斯坦福大学）

AI总结提出DEBATE基准，通过多轮公共消息和Likert量表信念数据，评估多代理角色扮演LLM模拟中观点动态的真实性，发现零样本设置下代理组过度收敛，而监督微调可改善立场对齐并减少组级收敛误差。

详情

AI中文摘要

准确建模通过社交互动产生的观点变化对于理解和缓解极化、错误信息及社会冲突至关重要。近期工作使用角色扮演LLM代理（RPLA）模拟观点动态，但多代理模拟常表现出不自然的群体行为（如过早收敛），且缺乏评估与真实人类群体互动一致性的经验基准。我们提出DEBATE，一个大规模基准，用于评估多代理RPLA模拟中观点动态的真实性。DEBATE包含来自美国参与者在107个话题上的多轮公共消息和私有Likert量表信念；实验中使用的清理基准包含697个组的2,788名参与者，支持在话语和组级别进行评估，并为未来个体级别分析提供支持。我们使用七个LLM实例化“数字孪生”RPLA，并在两种设置（下一消息预测和完整动态模拟）下，使用基于立场的观点动态指标进行评估。在零样本设置中，RPLA组相对于人类组表现出强烈的观点收敛。在保留组划分上，对Llama-3.1-8B-Instruct进行监督微调（SFT）改善了辅助立场对齐并减少了组级收敛误差，尽管在观点变化和信念更新方面仍存在差异。DEBATE能够对模拟观点动态进行严格基准测试，并支持未来关于使多代理RPLA与真实人类互动对齐的研究。

英文摘要

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains multi-round public messages and private Likert-scale beliefs from U.S.-based participants across 107 topics; the cleaned benchmark used in our experiments contains 2,788 participants in 697 groups, enabling evaluation at the utterance and group levels and supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full dynamics simulation, using stance-based opinion-dynamics metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. On the held-out group split, supervised fine-tuning (SFT) for Llama-3.1-8B-Instruct improves auxiliary stance alignment and reduces group-level convergence error, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

URL PDF HTML ☆

赞 0 踩 0

2603.19862 2026-06-01 cs.CV cs.LG

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

IsoCLIP: 分解CLIP投影器以实现高效的模态内对齐

Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center (MICC), University of Florence, Italy（意大利佛罗伦萨大学媒体集成与通信中心）； Department of Computer Science, Universitat Autònoma de Barcelona, Spain（西班牙巴塞罗那自治大学计算机科学系）； Computer Vision Center, Barcelona, Spain（西班牙巴塞罗那计算机视觉中心）； IDEAS Research Institute, Warsaw, Poland（波兰华沙IDEAS研究所）

AI总结本文通过分析CLIP投影器的谱特性，发现模态间对齐子空间和各向异性方向，提出无训练方法IsoCLIP去除各向异性方向以改善模态内对齐，在模态内检索和分类任务上降低延迟并超越现有方法。

Comments Accepted at CVPR2026

详情

AI中文摘要

视觉-语言模型如CLIP被广泛用于涉及视觉和文本模态的跨模态任务。然而，当个体模态编码器应用于固有的模态内任务（如图像到图像检索）时，其性能因模态内错位而受损。本文研究CLIP中的模态内错位，重点关注将投影前图像和文本嵌入映射到共享嵌入空间的投影器的作用。通过分析应用于投影特征的余弦相似度形式及其与对比CLIP损失的交互，我们发现在训练期间存在一个负责对齐两种模态的跨模态算子，以及第二个仅强制执行模态内归一化但不促进模态内对齐的模态内算子。通过对跨模态算子的谱分析，我们识别出一个近似各向同性的子空间，其中两种模态良好对齐，以及每个模态特有的各向异性方向。我们证明该对齐子空间可以直接从投影器权重中获得，并且去除各向异性方向可改善模态内对齐。我们在模态内检索和分类基准上的实验表明，我们的无训练方法减少了模态内错位，大大降低了延迟，并在多个预训练的类CLIP模型上优于现有方法。代码公开于：https://github.com/simomagi/IsoCLIP。

英文摘要

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

URL PDF HTML ☆

赞 0 踩 0

2603.19262 2026-06-01 cs.CL cs.AI

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

大型语言模型中推理时引发的概率变换的经验表征

Mike Farmer, Abhinav Kochar, Yugyung Lee

发表机构 * Bloch School of Management, Regnier Institute for Entrepreneurship & Innovation, University of Missouri–Kansas City（布洛赫管理学院、雷尼创业与创新研究所、密苏里大学堪萨斯城分校）； Department of Computer Science, School of Science and Engineering, University of Missouri–Kansas City（计算机科学系、科学与工程学院、密苏里大学堪萨斯城分校）

AI总结本研究通过经验观察发现，在多种推理时流程（如思维链、自我细化、检索增强和验证器引导修订）下，候选答案的概率变换遵循近似的对数比率关系，并分析了其系数变化和鲁棒性。

Comments 22 pages, 11 figures, 5 tables

详情

AI中文摘要

大型语言模型越来越依赖推理时程序，如思维链推理、自我细化、检索增强和验证器引导修订，但这些程序下引发的概率变换结构仍不清楚。我们研究外部引发的候选答案概率分配，并观察到重复出现的近似对数比率关系：\[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] 其中 $q_t$ 和 $\tilde q_t$ 分别是引发前和引发后的概率，$b_t$ 是外部构建的证据信号，$α_t$ 是提示配置的经验描述符。在来自 GPQA Diamond、TheoremQA、MMLU-Pro 和 ARC-Challenge 的 4,975 个推理问题上，对多个指令微调模型系列进行评估，我们在约 $1.3 \times 10^5$ 个候选级观测上观察到近似对数比率关系，平均 $R^2 \approx 0.76$。系数在不同引发设置下变化，但定性相似的关系在评估条件下持续存在。使用替代统计表示、提示配置、保留评估和 token 级对数概率的鲁棒性分析表明，观察到的结构不依赖于特定的提示程序或概率估计方法。主要贡献不是代数形式本身（它与广义贝叶斯更新和概率变换框架相关），而是经验观察：在受控条件下，多样化的推理时提示流程反复表现出可复现的对数比率结构。该框架为分析推理时 LLM 流程中的校准、证据放大、不确定性传播和交互敏感性提供了协议敏感的视角。

英文摘要

Large language models increasingly rely on inference-time procedures such as chain-of-thought reasoning, self-refinement, retrieval augmentation, and verifier-guided revision, yet the structure of elicited probability transformations under these procedures remains poorly understood. We study externally elicited probability assignments over candidate answers and observe recurring approximate log-ratio relationships: \[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] where $q_t$ and $\tilde q_t$ are pre- and post-elicitation probabilities, $b_t$ is an externally constructed evidence signal, and $α_t$ is an empirical descriptor of the prompting configuration. Across 4,975 reasoning problems from GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge, evaluated on multiple instruction-tuned model families, we observe approximate log-ratio relationships with mean $R^2 \approx 0.76$ over about $1.3 \times 10^5$ candidate-level observations. Coefficients vary across elicitation settings, but qualitatively similar relationships persist across evaluated conditions. Robustness analyses using alternative statistical representations, prompting configurations, held-out evaluation, and token-level log-probabilities suggest that the observed structure is not tied to one prompting procedure or probability estimation method. The main contribution is not the algebraic form itself, which is related to generalized Bayesian updating and probability-transformation frameworks, but the empirical observation that diverse inference-time prompting pipelines repeatedly exhibit reproducible log-ratio structure under controlled conditions. The framework provides a protocol-sensitive perspective for analyzing calibration, evidence amplification, uncertainty propagation, and interaction sensitivity in inference-time LLM pipelines.

URL PDF HTML ☆

赞 0 踩 0

2603.17306 2026-06-01 cs.CL q-bio.NC

Evidence for systematic semantic structure in individual phonemes

单个音素中系统性语义结构的证据

Gexin Zhao

发表机构 * Columbia University（哥伦比亚大学）

AI总结本研究通过大型语言模型、跨语言听者实验和发音特征分析，证明英语单个音素携带结构化的多维语义轮廓，挑战了音义关系任意性的传统假设。

Comments 31 pages, 4 figures

详情

AI中文摘要

语言学的一个基本假设认为，声音与意义之间的关系在很大程度上是任意的。这里我们表明，这一假设在单个音素层面上不成立：每个英语音素都携带一个结构化的、多维的语义轮廓，该轮廓可从文本中恢复、跨语言感知，并以发音为基础。三个大型语言模型独立检测到220对字母对比中九个感知维度上的一致语义结构。以英语为母语者（N=93）在一项预先注册的强制选择任务中确认了这些关联（与模型预测的一致性为85.3%），而五种类型学上不同语言的听者（N=155）在音频呈现下复制了该效应（准确率73.2%-81.9%）。发音特征以交叉验证的R²为0.56-0.98预测了该结构，表明发出声音的身体行为系统地塑造了其所传达的意义。这些发现将音素层面的象似性重新定义为音系系统中一种普遍的、具身的属性。

英文摘要

A foundational assumption in linguistics holds that sound-meaning relations are largely arbitrary. Here we show that this assumption fails at the level of individual phonemes: each English phoneme carries a structured, multidimensional semantic profile that is recoverable from text, perceived across languages, and grounded in articulation. Three large language models independently detected consistent semantic structure across nine perceptual dimensions in 220 pairwise letter contrasts. Native English speakers (N = 93) confirmed these associations in a preregistered forced-choice task (85.3% agreement with model predictions), and listeners of five typologically diverse languages (N = 155) replicated the effect under audio presentation (73.2%-81.9% accuracy). Articulatory features predicted the structure with cross-validated R^2 of 0.56-0.98, indicating that the bodily act of producing a sound systematically shapes the meaning it conveys. These findings reframe phoneme-level iconicity as a pervasive, embodied property of the phonological system.

URL PDF HTML ☆

赞 0 踩 0

2601.05770 2026-06-01 cs.LG cs.CL

Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

权重到代码：从离散Transformer中提取可解释算法

Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

发表机构 * Key Laboratory of High Confidence Software Technology (PKU), MOE, Beijing, China（高可信软件技术重点实验室（PKU），教育部，北京，中国）； School of Computer Science, Peking University, Beijing, China（计算机学院，北京大学，北京，中国）； Shanghai AI Lab, Shanghai, China（上海人工智能实验室，上海，中国）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院，上海，中国）

AI总结提出离散Transformer架构，通过温度退火采样注入离散性，结合假设检验和符号回归从模型权重中提取可解释算法，在离散任务上性能与RNN基线相当，并扩展到连续中间计算任务。

详情

AI中文摘要

算法提取旨在直接从算法任务训练的模型中合成可执行程序，从而无需依赖人工编写的目标程序即可从权重中重新发现可执行机制。然而，将此范式应用于Transformer时，由于表示纠缠（例如叠加），其中重叠方向编码的特征严重阻碍了符号表达式的恢复。我们提出了离散Transformer，这是一种专门设计用于弥合连续表示与离散符号逻辑之间差距的架构。通过温度退火采样注入离散性，我们的框架有效利用假设检验和符号回归来提取人类可读的程序。实验表明，离散Transformer在共享离散任务上实现了与基于RNN的MIPS基线相当的性能，同时将提取扩展到具有连续值中间计算的任务。最后，我们展示了架构归纳偏置对合成程序提供了细粒度控制，使离散Transformer成为算法提取和Transformer可解释性的可控测试平台。

英文摘要

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo recovery of executable mechanisms from weights without relying on human-written target programs. However, applying this paradigm to Transformer is complicated by representation entanglement (e.g., superposition), where features encoded in overlapping directions substantially hinder the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to the RNN-based MIPS baseline on shared discrete tasks, while broadening extraction to tasks with continuous-valued intermediate computations. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a controllable testbed for algorithm extraction and Transformer interpretability.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

Learning to Reason with Insight for Informal Theorem Proving

Multi-Objective Bayesian Optimization via Adaptive \varepsilon-Constraints Decomposition

Symmetry Reveals Layerwise Dynamics: How Transformers Perform In-Context Classification

Prompt Injection as Role Confusion

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

Dual-Exposure Imaging with Events

ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning

MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems

Human Psychometric Questionnaires Mischaracterize LLM Behavior

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

EBuddy: a workflow orchestrator for industrial human-machine collaboration

A Perturbation Approach to Unconstrained Linear Bandits

TTE-CAM: Self-Explainable Class Activation Maps for Pretrained Black-Box CNNs

Circuit-Inspired High-Order Neural Networks with Unified Neural Dynamics Modeling for PDE Solving and Visual Perception

UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

Evidence for systematic semantic structure in individual phonemes

Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer