arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.08737 2026-06-09 cs.RO 新提交

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

Dream-Tac: 用于接触丰富机器人操作任务的统一触觉世界动作模型

Yunfan Lou, Yifan Ye, Yankai Fu, Jun Cen, Xiaowei Chi, Yaoxu Lyu, Peidong Jia, Sirui Han, Zhihe Lu, Shanghang Zhang

发表机构 * Peking University（北京大学）； The Hong Kong University of Science and Technology（香港科技大学）； Nanjing University（南京大学）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机学院多媒体信息处理国家重点实验室）

AI总结提出Dream-Tac统一触觉世界动作模型，通过接触门控视觉-触觉融合和接触感知注意力偏置，联合建模动作、未来视觉观察和触觉动态，在六项接触丰富操作任务中平均动作准确率提升31.7%。

Comments 16 pages,13 figures

详情

AI中文摘要

世界动作模型继承了世界模型的预测能力，使得动作生成能够由预期的未来观察引导。然而，它们主要依赖视觉，在接触丰富的操作任务中常常失败，因为关键线索来自物理交互。在本文中，我们提出Dream-Tac，一个统一的触觉世界动作模型，联合建模动作、未来视觉观察和触觉动态。具体来说，Dream-Tac引入了(i)接触门控视觉-触觉融合，以选择性整合触觉信号，以及(ii)接触感知注意力偏置，以更好地调节操作过程中的跨模态交互。为了支持实时部署，我们进一步设计了双级加速策略，在训练期间重新公式化接触感知偏置以保留融合注意力路径，并在推理时引入基于缓存的扩散加速，实现训练速度提升高达2.9倍，推理速度提升1.8倍。在六项接触丰富的操作任务中，Dream-Tac平均动作准确率提升31.7%，证明了统一视觉-触觉世界建模的有效性。代码可在https://github.com/LYFCLOUDFAN/Dream-Tac获取。

英文摘要

World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.

URL PDF HTML ☆

赞 0 踩 0

2606.08736 2026-06-09 cs.LG cs.DB 新提交

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

声明性结果一致性合成：精确、闭式规范满足及一致性基准

Muhammed Rasin

发表机构 * Independent Researcher（独立研究员）

AI总结针对无源数据下精确满足声明性分析结果的需求，提出结果一致性合成任务，通过闭式条件伽马抽样实现精确聚合，并构建SpecBench基准，证明一致性保真度正交。

Comments 22 pages, 1 figure. Benchmark and reference implementation (MIT): https://github.com/rasinmuhammed/misata

详情

AI中文摘要

我们研究合成表格数据主流范式未能提供的能力：在无源数据下精确满足声明的分析结果。模仿方法（copula、GAN、扩散）学习真实分布并从中采样，其评价基于对真实数据的保真度。一大类实际需求不同：在无源数据（冷启动）下生成数据，该数据在关系模式上复现声明的结果（收入曲线、流失率、群体份额）。现成的模仿工具不提供针对此类目标的接口，且由于采样方差，没有采样器能精确命中聚合值。在真实公共数据集上，基于该数据训练的现成学习合成器将声明的月度聚合值偏离74%至86%；逐周期优化将偏离降至约19%，但仍无法达到0；而闭式生成器精确达到0。我们将此任务命名为结果一致性合成，论证其评价轴为一致性而非保真度，并展示两轴正交。我们的贡献包括：(1) 形式化描述，表明广泛使用的精确聚合生成器族实际上是伽马总体的条件求和采样（通过Lukacs刻画），具有闭式精确性、闭式边际变异系数和尺度不变性；受控实验描绘边界，强制精确聚合在1-Wasserstein距离上对任意外部边际的成本最多为0.006，其余为形状族失配；(2) SpecBench，据我们所知，这是首个衡量冷启动关系合成中分析结果一致性的基准；(3) 一个闭式确定性参考系统。精确聚合本身是平凡的；贡献在于一致性联合闭式边际、完整性、确定性和零源数据。我们承认在存在真实数据时模仿方法的保真度优势。

英文摘要

We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group share) across a relational schema. Off-the-shelf imitation tools offer no interface for such targets, and no sampler can hit an exact aggregate, because sampling has variance. On a real public dataset, off-the-shelf learned synthesizers trained on that very data miss the declared monthly aggregate by 74 to 86 percent; a per-period steelman cuts the miss to about 19 percent and still cannot reach 0; a closed-form generator reaches exactly 0. We name this task outcome-conformant synthesis, argue its evaluation axis is conformance rather than fidelity, and show the two axes are orthogonal. We contribute: (1) a formal account showing a widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population (via Lukacs' characterization), with closed-form exactness, a closed-form marginal CV, and scale-invariance; a controlled experiment maps the boundary, enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, the rest being shape-family mismatch; (2) SpecBench, to our knowledge the first benchmark to measure conformance to analytical outcomes for cold-start relational synthesis; and (3) a closed-form, deterministic reference system. Exact aggregation alone is trivial; the contribution is conformance jointly with closed-form marginals, integrity, determinism, and zero source data. We concede fidelity to imitation where real data exists.

URL PDF HTML ☆

赞 0 踩 0

2606.08735 2026-06-09 cs.AI 新提交

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

结构条件化的演员-评论家分支用于质量-多样性强化学习

Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology（南京信息工程大学人工智能学院）； Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院网络空间安全研究院广东省新型安全智能技术重点实验室）

AI总结提出SV-QD-RL框架，通过结构条件化的演员-评论家分支和分支感知的QD档案，在MuJoCo任务中构建高质量且行为多样化的策略库。

详情

AI中文摘要

质量-多样性强化学习（QD-RL）旨在构建包含高性能和行为多样化策略的策略库。现有的QD-RL方法主要在 rollout 评估后多样化策略实例，或使用学习到的价值信息来改进策略质量和行为目标，而生成候选策略的学习分支仍较少被探索。本文提出SV-QD-RL，一种结构-价值耦合框架，将每个候选表示为结构条件化的演员-评论家分支。每个分支包含一个演员、一个结构掩码、一个分支特定的评论家、一个回放状态以及评估属性，包括行为、回报、稀疏性和价值分布。结构掩码定义了分支学习的演员子空间，而分支特定的评论家和回放状态塑造了其价值学习轨迹。然后，一个分支感知的QD档案根据行为质量、结构足迹和价值分布信息评估并保留分支。在MuJoCo连续控制任务上的实验表明，SV-QD-RL构建的策略库具有强大的档案质量和行为上有用的多样性。消融和诊断分析进一步表明，结构条件化、评论家差异化和记忆一致性细化对行为专门化做出了互补贡献。调度感知的库评估表明，学习到的档案在变化的行为级别要求下提供了可选择的策略替代方案。这些结果表明，将演员结构与分支特定的价值学习耦合是生成多样化QD-RL策略库的有效机制。

英文摘要

Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.

URL PDF HTML ☆

赞 0 踩 0

2606.08729 2026-06-09 cs.RO cs.LG 新提交

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

IR-SIM：一种用于导航、学习和基准测试的轻量级技能原生模拟器

Ruihua Han, Shuai Wang, Chengyang Li, Rui Gao, Xinyi Wang, Zhe Liu, Guoliang Li, Yupu Lu, Qi Hao, Jia Pan, Hengshuang Zhao

发表机构 * The University of Hong Kong（香港大学）； Shenzhen Institutes of Advanced Technology（深圳先进技术研究院）； Southern University of Science and Technology（南方科技大学）； University of Michigan（密歇根大学）； University of Macau（澳门大学）

AI总结提出轻量级技能原生导航模拟器IR-SIM，通过YAML配置完全定义场景，支持文本提示生成与修改，用于导航算法基准测试和训练数据自动生成，并桥接高保真模拟器和真实部署。

Comments 12 pages, 6 figures, project website: https://github.com/hanruihua/ir-sim

详情

AI中文摘要

模拟在由大型语言模型（LLM）支持的自动化机器人研究中起着关键作用。然而，现有的模拟器通常需要自定义代码或复杂接口，为快速原型设计和自动化算法开发设置了障碍。为此，我们提出了智能机器人模拟器（IR-SIM），一种轻量级的技能原生导航模拟器，专为快速场景构建、基准测试和机器人学习而设计。在IR-SIM中，场景完全由YAML配置文件定义，这些文件指定了移动机器人运动学、几何碰撞检测、激光雷达感知、可视化和行为模块。这种设计使机器人模拟完全可描述和可复现，允许通过提出的IR-SIM智能体技能从文本提示生成和修改场景。生成的场景可用于导航算法的自动基准测试以及学习方法的训练数据自动生成。此外，IR-SIM提供了到高保真模拟器和真实世界部署的桥梁，允许用户在原型设计后无需额外编码即可在更真实的环境中验证其算法。实验展示了IR-SIM在多个任务中的便利性和多功能性：从自然语言构建导航场景、训练避碰策略、对社交导航策略进行基准测试，以及桥接到高保真模拟器和真实世界部署。项目网站见https://github.com/hanruihua/ir-sim。

英文摘要

Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning. In IR-SIM, scenarios are entirely defined by YAML configuration files that specify mobile robot kinematics, geometric collision checking, LiDAR sensing, visualization, and behavior modules. This design makes robotic simulation fully describable and reproducible, allowing scenarios to be generated and modified from text prompts through the proposed IR-SIM agent skills. The resulting scenarios can be used for automated benchmarking of navigation algorithms and for automated generation of training data for learning methods. Furthermore, IR-SIM provides bridges to high fidelity simulators and real world deployment, allowing users to validate their algorithms in more realistic settings after prototyping without extra coding. The experiments showcase the convenience and versatility of IR-SIM in multiple tasks: constructing navigation scenarios from natural language, training a collision avoidance policy, benchmarking social navigation policies, and bridging to high fidelity simulators and real world deployment. The project website is available at https://github.com/hanruihua/ir-sim.

URL PDF HTML ☆

赞 0 踩 0

2606.08728 2026-06-09 cs.AI cs.CL cs.CV cs.LG 新提交

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

人工智能数学推理：语言模型、神经符号系统与验证发现的综合综述

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）

AI总结本文综述了数学推理领域从早期规则系统到当代推理模型、多智能体系统及验证发现工作流的演变，沿非正式推理、形式推理、数学发现及推理技术四轴组织，并评估了基准测试、失败模式及未来方向。

Comments Under review, 47 pages, 14 figures, 22 tables

详情

AI中文摘要

数学推理长期以来一直是机器智能的严格测试；在过去十年中，它已从NLP中的一个边缘问题发展为最重要的人工智能前沿之一。本综述对该领域的演变进行了统一阐述，从早期基于规则的数学文字题（MWP）求解器和模板驱动的几何系统，到神经表达式生成和LLM提示，再到当代推理模型、多智能体系统、神经符号定理证明器和验证发现工作流。我们沿四个轴组织该领域：(i) 文本和图表的非正式推理，涵盖MWP求解、多模态几何和VLM；(ii) 证明助手的形式推理，包括自动形式化、策略预测、编译器引导修复和证明搜索；(iii) 数学发现，其中系统提出构造、改进界限或协助攻击开放问题；以及(iv) 推理和训练时技术，包括CoT提示、工具使用、过程奖励模型和RLVR，这些技术日益将生成与验证联系起来。我们编目了涵盖小学算术、竞赛数学、几何、形式证明、多模态和多语言推理以及专家评估的主要基准，并考察了基准饱和、污染、报告不匹配以及pass@1、多数投票和验证器辅助pass@$k$之间的区别。我们批判性地评估了失败模式：扰动下的脆弱性、奖励黑客、多模态基础失败、脆弱形式化以及推理规模推理的能源成本。借鉴来自在职数学家的近期观点，我们确定了未来方向，集中于验证发现工作流、推理效率以及使AI辅助形式化广泛可用的基础设施。配套材料：https://github.com/Starscream-11813/awesome-AI4Math。

英文摘要

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

URL PDF HTML ☆

赞 0 踩 0

2606.08725 2026-06-09 cs.RO cs.SY eess.SY 新提交

Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning

基于可微约束轨迹规划的实时精确无碰撞遥操作

Max Grobbel, Tristan Schneider, Daniel Flögel, Sören Hohmann

发表机构 * FZI - Forschungszentrum Informatik（FZI 信息技术研究中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结针对遥操作中自碰撞与环境碰撞问题，提出基于对偶可微碰撞约束的轨迹规划方法，采用胶囊体与多面体建模，实现更低计算时间和更精确障碍物建模，保证平滑无碰撞遥操作。

Comments 8 pages, 4 figures, accepted at ICRA2026

详情

AI中文摘要

在遥操作中，人类操作员通常仅控制末端执行器的姿态，由于关节和连杆未单独控制，常导致机械臂自碰撞及与环境障碍物的碰撞。缓解此问题的常见策略是利用基于最优控制的轨迹规划增强操作员输入。由于基于导数的求解器需要可微约束，现有方法要么用球体近似机器人和障碍物，降低几何精度，要么近似导数，降低收敛性并增加计算时间。我们通过将一种基于凸优化对偶性的可微碰撞避免约束的最新公式应用于遥操作场景，解决了这些局限性。机器人用胶囊体近似，环境用多面体近似。我们在不同障碍物数量的仿真中将所得轨迹规划方法与最先进技术进行比较，并在真实遥操作测试中在UR5e机械臂上进行评估。结果表明，我们的方法在实现更精确障碍物建模的同时，计算时间更低，从而实现更平滑、无碰撞的末端执行器遥操作。

英文摘要

In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.

URL PDF HTML ☆

赞 0 踩 0

2606.08722 2026-06-09 cs.SD cs.CL 新提交

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

LLM 能否理解 LilyPond？一个用于符号音乐生成与理解的基准

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà

发表机构 * University of Padova（帕多瓦大学）； Universitat Pompeu Fabra（庞培法布拉大学）

AI总结提出 LilyBench，基于 LilyPond 的基准，联合评估开源 LLM 的符号音乐生成与理解能力，实验表明零样本可生成可执行 LilyPond，但结构理解任务仍有挑战，且指标间存在系统性分歧。

Comments Accepted at Ital-IA 2026

详情

AI中文摘要

大型语言模型的符号音乐评估在表示、数据集和指标上仍然碎片化。我们引入了 LilyBench，一个基于 LilyPond 的基准，用于在同一系列开源权重 LLM 上联合评估符号音乐生成和音乐理解。该基准包括一个 200 个提示的生成套件和十个从 ABC-Eval 改编的理解任务，涵盖语法、元数据预测、结构排序和音乐识别。生成质量通过编译率、基于 Jensen-Shannon 相似度的 MusPy 描述符分布以及基于 LilyBERT 的 Fréchet 音乐距离 (FMD) 进行评估。在四个开源模型上的实验表明，在零样本设置下可以实现可执行的 LilyPond 生成，而结构理解任务尽管在作曲家和流派识别上表现强劲，但仍然具有挑战性。我们的实验还揭示了基于描述符和基于嵌入的指标之间的系统性分歧，表明符号音乐评估受益于指标三角测量而非单一分数排名。我们发布了基准、提示库和评估代码，以支持未来在符号音乐生成和理解方面的研究，地址为 https://github.com/CSCPadova/lilybench。

英文摘要

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

URL PDF HTML ☆

赞 0 踩 0

2606.08721 2026-06-09 cs.LG 新提交

A Geometric Measure of Linear Separability for Neural Representations

神经表征的线性可分性几何度量

Yi Wei, Xuan Qi, Furao Shen

发表机构 * State Key Laboratory of Novel Software Technology, School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院软件新技术国家重点实验室）； AI for Good (AIGO), Istituto Italiano di Tecnologia（意大利技术研究院AI for Good (AIGO)）； DITEN, University of Genoa（热那亚大学DITEN）； State Key Laboratory of Novel Software Technology, School of Artificial Intelligence, Nanjing University（南京大学人工智能学院软件新技术国家重点实验室）

AI总结提出方向线性可分性度量（LSM），通过搜索包含目标类所有样本的仿射半空间并测量最小竞争样本入侵量，为神经表征的类间几何提供不对称、类级、目标归一化的诊断工具。

详情

AI中文摘要

现代神经分类器通常依赖线性读出，但仅预测指标无法刻画此类读出所操作的表征的类间几何。我们引入方向线性可分性度量（LSM），一种用于单侧仿射可分性的有限样本诊断工具。对于目标类A和竞争集B，LSM搜索包含A中所有样本的仿射半空间，并测量必须留在目标侧的最小竞争样本入侵量，按|A|归一化。所得量是不对称的、类级的、目标归一化的，适用于从神经网络提取的有限表征。我们建立了其支撑超平面刻画，将其与最优仿射分类精度关联，并证明了在全秩线性嵌入下的不变性。这些结果将线性重参数化引起的变化与信息丢失或非线性几何变换引起的变化区分开来。我们还给出了一种基于惩罚的仿射搜索，用于在高维特征中估计类级LSM，报告的值根据原始离散保持和违反准则计算。最后，我们将坐标门控非线性作为有限样本几何算子进行分析，并经验性地使用LSM诊断常见深度学习组件和架构中的类级入侵。

英文摘要

Modern neural classifiers commonly rely on linear readouts, yet predictive metrics alone do not characterize the class-wise geometry of the representations on which such readouts operate. We introduce the directional linear separability measure (LSM), a finite-sample diagnostic for one-sided affine separability. For a target class A and a competing set B, LSM searches over affine halfspaces that contain all samples in A and measures the smallest competing-sample intrusion that must remain on the target side, normalized by |A|. The resulting quantity is asymmetric, class-wise, target-normalized, and applicable to finite representations extracted from neural networks. We establish its supporting-hyperplane characterization, relate it to optimal affine classification accuracy, and prove invariance under full-rank linear embeddings. These results separate changes caused by linear reparameterization from those caused by information loss or nonlinear geometric transformations. We also give a penalty-based affine search for estimating class-wise LSM in high-dimensional features, with reported values computed from the original discrete preservation and violation criterion. Finally, we analyze coordinatewise gated nonlinearities as finite-sample geometric operators and empirically use LSM to diagnose class-wise intrusion across common deep-learning components and architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.08719 2026-06-09 cs.CV 新提交

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

无图像思考：通过在线自我蒸馏内化视觉操作

Yishuo Cai, Jiahui Liu, Yuanxin Liu, Haobo Deng, Linli Yao, Yuhao Zheng, Kun Ouyang, Zhimo Li, Ziyue Wang, Xu Sun, Haoli Bai, Xiaohui Li

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机学院多媒体信息处理国家重点实验室）； Central South University（中南大学）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）； Huawei Technologies（华为技术有限公司）

AI总结提出Imagine-OPD框架，通过在线自我蒸馏将“用图像思考”的视觉推理能力内化为“用想象思考”，在不调用外部工具的情况下生成内部视觉线索，在保持性能的同时显著降低推理开销。

详情

AI中文摘要

“用图像思考”已成为细粒度视觉推理的有效范式：通过显式放大相关区域并推理裁剪区域，模型可以访问从单个全局图像中难以恢复的局部证据。然而，这种优势伴随着冗余的工具调用和更长的推理轨迹。此外，当这种行为主要从结果奖励中学习时，产生的中间裁剪或视觉线索可能带有噪声，或者无法忠实地捕获任务相关的视觉证据。在这项工作中，我们探讨是否可以通过“用想象思考”来内化“用图像思考”的推理优势：这是一个内部过程，决定看哪里并想象更仔细检查会揭示什么视觉线索，而无需实际调用工具。我们提出Imagine-OPD，一种在线自我蒸馏框架，其中教师模型在训练期间扮演“用图像思考”推理者的角色：它接收来自标注区域的特权缩放证据视图，并监督模型自身的想象推理轨迹。Imagine-OPD不需要外部教师或高质量的想象演示。在视觉中心基准上的实验表明，Imagine-OPD在比较模型中实现了最佳平均性能，同时与“用图像思考”方法相比显著降低了推理开销。

英文摘要

''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ''Thinking with Images'' can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ''Thinking with Images'' reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model's own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ''Thinking with Images'' methods.

URL PDF HTML ☆

赞 0 踩 0

2606.08715 2026-06-09 cs.CL 新提交

Operationalizing Linguistic Methods through Prompt-Engineering Skills: An Automatic Chinese Web Neologism Detection Pipeline

通过提示工程技能操作化语言学方法：一种自动中文网络新词检测流水线

Yufeng Wu, Meichun Liu

发表机构 * City University of Hong Kong（香港城市大学）

AI总结提出一种自动中文网络新词检测方法，将传统语言学识别原则转化为提示工程技能，通过四阶段流水线从2.67亿文档中检测出4853个新词，并揭示候选覆盖和LLM语义判断为瓶颈。

详情

AI中文摘要

我们提出了一种自动中文网络新词检测方法，该方法将传统语言学识别原则操作化为提示工程技能。该方法包括四个阶段：基于字符n-gram的与分词器无关的候选生成；基于点互信息预过滤的词典锚定；基于中文构词原则的构词合法性技能；以及结合规则和三元分类技能来区分新词、实体和无。将该方法应用于BAAI CCI 3.0语料库（2.67亿文档），产生了226,959个分类候选，其中包括4,853个标注新词。为了评估该方法，我们开发了逐阶段条件召回分解，其中流水线的严格召回在数学上分解为各阶段条件召回的乘积。应用于Hou（2023）（4,199个条目），该分解揭示了阶段1候选覆盖和阶段4B LLM语义判断是两个瓶颈（召回率分别为41.5%和60.0%），而中间阶段接近无损。进一步的长度分层分析表明，结构构词合法性技能与长度无关（>= 96.9%），而语义新颖性分类技能与长度相关（2/3/4字符候选分别为65.6%/59.0%/44.1%），描绘了基于技能的语言学操作化的当前边界。我们将该方法、流水线输出和评估协议作为公共资源发布。

英文摘要

We present a method for automatic Chinese web neologism detection that operationalizes traditional linguistic identification principles as prompt-engineering skills. The method has four stages: tokenizer-independent character n-gram candidate generation; dictionary anchoring with a Pointwise Mutual Information pre-filter; a well-formedness skill based on Chinese word-formation principles; and a combined rule and three-way classification skill that distinguishes neologism, entity, and none. Applied to the BAAI CCI 3.0 corpus (267M documents), the method produces 226,959 classified candidates including 4,853 labeled neologisms. To evaluate the method, we develop a per-stage conditional recall decomposition in which the pipeline's strict recall factors mathematically into the product of stage conditional recalls. Applied to Hou (2023) (4,199 entries), the decomposition exposes Stage 1 candidate coverage and Stage 4B LLM semantic judgment as the two bottlenecks (R=41.5% and 60.0% respectively), while intermediate stages are near-lossless. A length-stratified analysis further reveals that the structural well-formedness skill is length-invariant (>= 96.9%) whereas the semantic novelty-classification skill is length-dependent (65.6%/59.0%/44.1% across 2/3/4-character candidates), mapping a current boundary of skill-based linguistic operationalization. We release the method, pipeline outputs, and evaluation protocol as public resources.

URL PDF HTML ☆

赞 0 踩 0

2606.08712 2026-06-09 cs.LG cs.AI cs.CV 新提交

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

SNR-ST-Mix: 基于样本特异性邻域回归混合增强的空间转录组学深度神经网络插补

Hongyi Yu, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou

发表机构 * Northwestern University（西北大学）； Yale University（耶鲁大学）

AI总结针对空间转录组数据噪声大、分辨率低的问题，提出SNR-ST-Mix数据增强框架，通过空间邻域约束和表达相似性加权混合生成生物合理的合成样本，提升深度神经网络插补性能。

Comments 19 pages, 4 figures, 3 tables

详情

AI中文摘要

目的：空间转录组学（ST）能够在组织背景下测量基因表达。然而，这些测量通常噪声大、分辨率低且采样稀疏，限制了精细空间结构的恢复。深度神经网络已成为从组织学进行表达插补的强大工具，但其性能仍受限于有限的样本量和缺乏生物学信息的增强。大多数现有的学习增强策略是为分类任务而非回归任务设计的，忽略了空间和转录组关系，导致生物上不合理的插值，阻碍了预测性能。方法：为解决这些限制，我们提出SNR-ST-Mix，一种专门为ST数据设计的几何和表达感知数据增强框架。它将混合限制在点的k个最近空间邻域内，并基于表达相似性自适应加权插值系数，生成保留局部生物结构同时确保空间平滑性的增强样本。这种双重条件化产生合成样本，扩展了有效训练流形，促进了泛化，并在样本特异性训练下增强了预测稳定性。结果：使用各种组织类型的大量实验表明，SNR-ST-Mix在不需要架构更改或额外计算的情况下，始终优于传统增强方法。结论：SNR-ST-Mix为空间转录组学回归任务提供了一种有效且生物学原理的增强策略。通过显式利用空间几何和转录组相似性，它扩展了有效训练流形，并在不增加模型复杂度的情况下提高了预测性能。

英文摘要

Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often noisy, low-resolution, and sparsely sampled, which limits the recovery of fine spatial structure. Deep neural networks have become powerful tools for expression imputation from histology, but their performance remains constrained by limited sample sizes and a lack of biologically informed augmentation. Most of the existing augmentation strategies for learning are designed for classification tasks rather than regression, which neglect spatial and transcriptomic relationships, leading to biologically implausible interpolations that hinder prediction performance. Approach: To address these limitations, we propose SNR-ST-Mix, a geometry- and expression-aware data augmentation framework designed specifically for ST data. It constrains mixing to a spot's k-nearest spatial neighbors and adaptively weights interpolation coefficients based on expression similarity, generating augmented samples that preserve local biological structure while ensuring spatial smoothness. This dual conditioning yields synthetic examples that expand the effective training manifold, promote generalization, and enhance prediction stability under sample-specific training. Results: Extensive experiments with various tissue types demonstrate that SNR-ST-Mix consistently outperforms conventional augmentation methods without requiring architectural changes or additional computation. Conclusions: SNR-ST-Mix provides an effective and biologically principled augmentation strategy for spatial transcriptomics regression tasks. By explicitly leveraging spatial geometry and transcriptomic similarity, it expands the effective training manifold and improves predictive performance without increasing model complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.08708 2026-06-09 cs.CV 新提交

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

PRPO: 通过令牌级动态优势重塑的感知强化策略优化

Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group（阿里巴巴集团高德地图计算机视觉实验室）； Peking University（北京大学）

AI总结提出令牌级强化学习框架PRPO，通过鲁棒视觉依赖（RVD）指标识别关键感知令牌，并利用感知优势重塑（PAR）技术增强其学习信号，在7个多模态推理基准上平均提升23.3%（3B模型）和21.1%（7B模型）。

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为提升大型视觉语言模型（LVLMs）推理能力的有效范式。然而，现有的RLVR方法主要依赖于轨迹级结果奖励，为所有生成的令牌分配相同的学习信号。这种粗粒度的信用分配从根本上与多模态推理不匹配，因为只有稀疏的子集令牌在因果上基于视觉证据。因此，这些关键的感知令牌受到弱监督，并且常常被语言先验或推理模板令牌淹没。为解决这一局限，我们提出感知强化策略优化（PRPO），一种令牌级强化学习框架，明确识别并强化长程多模态推理轨迹中的关键感知令牌。PRPO引入了鲁棒视觉依赖（RVD），一种原则性度量，用于识别预测既基于视觉又对扰动稳定的令牌，过滤掉脆弱或噪声视觉令牌。基于RVD，我们进一步提出感知优势重塑（PAR），一种令牌级信用分配技术，放大感知信息丰富的令牌，同时为非感知令牌保留稳定梯度。在七个多模态推理基准上的大量实验表明，PRPO在3B和7B模型规模上均持续优于强LVLM基线，分别实现了23.3%和21.1%的平均增益。PRPO以更高的训练效率和更强的跨任务泛化能力达到了最先进的性能。我们的发现强调了细粒度信用分配对于可扩展多模态强化学习的重要性。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.

URL PDF HTML ☆

赞 0 踩 0

2606.08705 2026-06-09 cs.CL 新提交

Analyzing the Correlation Between Hallucinations and Knowledge Conflicts in Large Language Models

分析大型语言模型中幻觉与知识冲突之间的相关性

Lucrezia Laraspata, Giovanna Castellano, Gennaro Vessio

发表机构 * University of Bari Aldo Moro（巴里阿尔多莫罗大学）

AI总结通过探针技术分析LLM内部表示，发现幻觉激活模式不能完全归因于知识冲突，但探针可提升模型可解释性。

详情

AI中文摘要

幻觉——事实不正确或无法验证的输出——仍然是大型语言模型（LLM）最具挑战性的限制之一，尤其是在知识密集型任务中。一种提出的解释是，由固定的、过时的训练数据引起的内部知识冲突。本文研究了与知识冲突相关的内部表示是否与LLM中的幻觉行为相关。使用受两项先前工作启发的探针技术，我们分析了预定义任务中隐藏层、注意力层和MLP层的激活以及输出logits。我们在幻觉检测基准上探测了LLaMA-3-8B，并在知识冲突数据集上探测了Falcon-7B。我们的发现表明，尽管概念上相关，但幻觉激活模式不能完全简化为或由知识冲突表示解释。尽管如此，探针在多种语言和激活类型中被证明是一个稳健的工具，支持其在提高LLM可解释性方面的作用。这项工作推进了对LLM中幻觉的更广泛理解，并强调了对其内部行为进行细粒度分析的价值。

英文摘要

Hallucinations -- factually incorrect or unverifiable outputs -- remain one of the most challenging limitations of Large Language Models (LLMs), especially in knowledge-intensive tasks. One proposed explanation is internal knowledge conflicts arising from fixed, outdated training data. This paper investigates whether internal representations linked to knowledge conflicts correlate with hallucination behaviors in LLMs. Using probing techniques inspired by two prior works, we analyzed activations from hidden, attention, and MLP layers, as well as output logits, across predefined tasks. We probed LLaMA-3-8B on hallucination detection benchmarks and Falcon-7B on a knowledge conflict dataset. Our findings show that, although conceptually related, hallucination activation patterns cannot be fully reduced to or explained by knowledge conflict representations. Nonetheless, probing proves a robust tool across multiple languages and activation types, supporting its role in improving LLM interpretability. This work advances the broader understanding of hallucinations in LLMs and underscores the value of fine-grained analysis of their internal behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.08702 2026-06-09 cs.AI 新提交

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

ConMem: 无训练多智能体系统中的结构化记忆引导自适应

Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen

发表机构 * Central South University（中南大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出ConMem框架，通过结构化记忆卡片和关系感知记忆图实现多智能体系统的高效自适应，无需额外训练，在多个基准上提升性能并降低推理开销。

详情

AI中文摘要

最近的进展通过基于记忆、技能和学习的方法改进了基于LLM的多智能体系统（MAS）的自适应能力，但这些方法仍受到噪声轨迹、记忆-技能关系建模不足以及对额外训练或高质量监督的依赖等挑战。为了解决这些限制，我们提出了ConMem，一个关系感知且无需训练的框架，通过跨经验协调实现高效的多智能体自适应。具体来说，ConMem将历史交互轨迹提炼为结构化记忆卡片，以捕获可重用的策略和线索，并将它们组织成关系感知的记忆图。在运行时，ConMem根据任务需求检索卡片，并通过卡片图协调它们以解决策略冲突并恢复其依赖关系。这些模块结合起来提供了结构化和关系感知的指导，使得多智能体系统能够实现鲁棒、轻量级的自适应，而无需额外训练。在多个基准测试和主流MAS架构上的大量实验表明，与现有记忆架构相比，ConMem取得了持续的性能提升，通过剪枝超过50%的扩展候选并减少超过80%的规划开销，提高了推理时的效率。我们的代码可在https://anonymous.4open.science/r/ConMemCode获取。

英文摘要

Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeling of memory-skill relations, and reliance on additional training or high-quality supervision. To address these limitations, we propose ConMem, a relation-aware and training-free framework that enables efficient multi-agent adaptation through cross-experience coordination. Specifically, ConMem distills historical interaction trajectories into structured memory cards to capture reusable strategies and cues, organizing them into a relation-aware memory graph. At runtime, ConMem retrieves cards according to task needs and coordinates them through the card graph to resolve strategy conflicts and recover their dependencies. Combined, these modules yield structured and relation-aware guidance, enabling robust, lightweight adaptation in multi-agent systems without additional training. Extensive experiments across multiple benchmarks and mainstream MAS architectures show consistent gains over existing memory architectures, with improved inference-time efficiency through pruning more than 50% of expanded candidates and reducing planning overhead by over 80%. Our codes are available at https://anonymous.4open.science/r/ConMemCode

URL PDF HTML ☆

赞 0 踩 0

2606.08691 2026-06-09 cs.LG stat.ME 新提交

Hierarchical Projection for Adaptive Knowledge Transfer

自适应知识迁移的分层投影

Samhita Pal, Tian Gu

发表机构 * Vanderbilt University Medical Center（范德比尔特大学医学中心）； Columbia University（哥伦比亚大学）

AI总结提出ProjectionTL框架，通过分层贝叶斯建模与自适应投影实现源选择与特征选择，缓解负迁移，提升跨域学习的准确性、稳定性和可解释性。

详情

AI中文摘要

现代数据驱动应用越来越多地涉及从多个异质源中学习，其中目标数据集有限，但跨域可获得相关信息。当相关性变化或存在虚假信号时，简单组合这些源会降低性能，这对可信的跨域学习构成了根本性挑战。我们提出了投影迁移学习（ProjectionTL），这是一个统一框架，将分层贝叶斯建模与自适应投影相结合，用于选择性知识迁移。关键思想是在两个层次上解耦迁移：首先，我们构建一个源引导的分层先验，通过数据驱动的权重聚合跨源信息，捕捉每个源与目标之间的全局对齐；其次，我们通过后验投影步骤在特征层面细化这种借用，选择性地保留与目标信号局部一致的坐标。这种两阶段设计使该方法能够同时进行源选择和特征选择，从而减轻负迁移，同时保持可解释性。ProjectionTL提供了一种跨域整合异质数据的原则性方法，桥接了统计建模和现代机器学习范式，以实现鲁棒且可解释的迁移。通过模拟和真实世界的生物医学应用，我们证明了与现有方法相比，准确性、稳定性和可解释性的提升。我们的框架为高维设置下的可信跨域学习提供了一种可扩展且通用的策略。

英文摘要

Modern data-driven applications increasingly involve learning from multiple heterogeneous sources, where a target dataset is limited but related information is available across domains. Naively combining these sources can degrade performance when relevance varies or spurious signals are present, posing a fundamental challenge for trustworthy cross-domain learning. We propose Projection Transfer Learning (ProjectionTL), a unified framework that integrates hierarchical Bayesian modeling with adaptive projection for selective knowledge transfer. The key idea is to decouple transfer at two levels: first, we construct a source-guided hierarchical prior that aggregates information across sources using data-driven weights, capturing global alignment between each source and the target; second, we refine this borrowing through a posterior-projection step that operates at the feature level, selectively retaining coordinates that exhibit local agreement with the target signal. This two-stage design enables the method to simultaneously perform source selection and feature selection, thereby mitigating negative transfer while preserving interpretability. ProjectionTL provides a principled approach to integrating heterogeneous data across domains, bridging statistical modeling and modern machine learning paradigms for robust and interpretable transfer. Through simulations and real-world biomedical applications, we demonstrate improved accuracy, stability, and interpretability compared to existing methods. Our framework offers a scalable and generalizable strategy for trustworthy cross-domain learning in high-dimensional settings.

URL PDF HTML ☆

赞 0 踩 0

2606.08688 2026-06-09 cs.RO cs.CV 新提交

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

PhysAgent: 通过轨迹驱动的多智能体反馈实现基于物理的4D合成自动化

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结提出PhysAgent，首个模拟器在环的多智能体框架，通过解耦材料与外力、利用视觉基础模型提取轨迹并借助LLM常识推理，实现自动化、物理可信的4D运动合成，显著提升生成多样性与物理准确性。

详情

AI中文摘要

实现完全自动化、物理合理的3D运动合成是图形学和生成式AI的核心目标。然而，配置复杂的环境力场仍然完全依赖人工专家干预，成为大规模模拟数据生成的严重瓶颈。现有自动化方法主要关注材料优化，在应用于更复杂的力场优化空间时表现出严重的模态差距和技术缺陷：朴素的大语言模型缺乏底层模拟反馈，导致严重的物理不准确性，而传统的分数蒸馏采样存在梯度缓慢、陷入局部最优以及数学上无法动态切换离散力场的问题。为此，我们提出PhysAgent，首个模拟器在环的多智能体框架，利用多模态输入实现自动化、基于物理的4D合成。通过将内在材料与外在动力学解耦，PhysAgent利用配备外化力场技能模块的语义智能体掌握模拟规则并生成有效初始化。随后，由轨迹驱动的多智能体反馈驱动的精炼智能体，借助视觉基础模型从渲染帧中提取密集点轨迹。通过将这些显式运动轨迹转换为结构化文本描述符，智能体利用LLM常识推理执行零样本宏观跳跃，有效逃离局部最优并动态切换离散力场。大量实验表明，PhysAgent能够从任意多模态提示快速生成稳定、多样的物理场景，在生成多样性和物理准确性上显著优于现有基线。

英文摘要

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.08687 2026-06-09 cs.CV 新提交

Shift-Dependent Asymmetry: Orthogonal Inverse Low-Rank Adaptation for Federated Medical Segmentation

移位依赖的不对称性：面向联邦医学分割的正交逆低秩适应

Xingyue Zhao, Wenke Huang, Linghao Zhuang, Haoran Wu, Anwen Jiang, Zhifeng Wang, Wenwen He, Ming Feng, Mang Ye, Bo Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对联邦医学分割中编码器与解码器的不对称性（编码器受外观移位主导，解码器受监督变化主导），提出逆不对称调优（IAT）方法，通过个性化模块特定组件并引入子空间正交正则化器防止泄露，实现跨站点泛化提升。

Comments Accepted by ICML 2026

详情

AI中文摘要

低秩适应（LoRA）能够实现医学图像分割基础模型的高效联邦微调。然而，大多数联邦LoRA方法采用统一的聚合规则，这在医学分割的编码器-解码器不对称性下被打破：编码器受外观移位主导，而解码器受监督变化主导。这种不匹配将共享解剖结构与站点特定偏差纠缠在一起，损害了泛化能力。为解决这一问题，我们提出逆不对称调优（IAT）。IAT通过个性化编码器中模块特定组件以吸收外观移位，以及解码器中模块特定组件以适应站点依赖的监督，同时保留用于可迁移共识的共享路径，从而将适应与异质性来源对齐。然而，在LoRA的双线性参数化下，仅靠结构分离是不够的，因为乘法耦合仍可能导致站点特定更新泄漏到共享方向。因此，我们引入子空间正交正则化器，在有效更新空间中惩罚共享-局部共线性，从而在不增加额外通信的情况下减轻泄漏。实验表明，与强联邦LoRA和参数高效联邦学习基线相比，该方法取得了持续改进。

英文摘要

Low-Rank Adaptation (LoRA) enables efficient federated fine-tuning of segmentation foundation models for medical imaging. However, most federated LoRA methods adopt a uniform aggregation rule, which breaks under the encoder-decoder asymmetry in medical segmentation: the encoder is dominated by appearance shifts, while the decoder is dominated by supervision variations. This mismatch entangles shared anatomy with site-specific biases and harms generalization. To address this, we propose Inverse Asymmetric Tuning (IAT). IAT aligns adaptation with heterogeneity sources by personalizing module-specific components in the encoder to absorb appearance shifts and in the decoder to accommodate site-dependent supervision, while retaining a shared pathway for transferable consensus. However, structural separation alone is insufficient under LoRA's bilinear parameterization, where multiplicative coupling can still cause site-specific updates to leak into the shared direction. We therefore introduce a Subspace Orthogonality Regularizer that penalizes shared-local collinearity in the effective update space, mitigating leakage without extra communication. Experiments show consistent improvements over strong federated LoRA and parameter-efficient FL baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.08684 2026-06-09 cs.CV 新提交

BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

BLUE：迈向自动驾驶高效视觉-语言-动作模型中更好的语言使用

George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang

发表机构 * Bosch Research（博世研究院）

AI总结提出BLUE方法，通过轻量门控机制在视觉-语言-动作模型中按帧决定是否激活语言生成，实现性能提升和2.54倍推理加速。

Comments preprint

详情

AI中文摘要

我们提出BLUE，一种在自动驾驶（AD）的视觉-语言-动作（VLA）模型中实现更好语言使用的极简方法。通过广泛分析，我们发现语言仅在一小部分路线上重要，但在这些路线上，语言可以大幅提升或降低性能。因此，在每一帧生成语言是低效的，因为大部分计算花费在无法从语言中受益的帧上。我们进一步表明，预训练的VLA隐藏状态可能已经编码了语言是否会对给定帧有益，尽管场景复杂度和运动特征本身难以预测这一点。基于这一发现，BLUE在冻结的VLA隐藏状态上训练一个轻量级门控，以决定每帧是激活语言生成还是直接预测动作，无需修改主干网络或额外的人工标注。仅用0.11M参数的门控，BLUE在两个基准测试上均达到新的最优水平，在Bench2Drive上实现76.2%的成功率，在Longest6 v2上获得36的驾驶分数，同时相比主干网络实现2.54倍的推理加速和8.9%的成功率提升。BLUE为高效的语言增强自动驾驶提供了一条实用路径，表明VLA模型可以以极低的成本保留语言的优势。我们的代码、数据、日志和检查点完全公开在https://github.com/George-Ling3/BLUE。

英文摘要

We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do not benefit from language. We further show that pretrained VLA hidden states potentially already encode whether language will benefit a given frame, even though scene complexity and kinematic features alone struggle to predict this. Based on this finding, BLUE trains a lightweight gate on frozen VLA hidden states to decide per frame whether to activate language generation or predict actions directly, without modifying the backbone or requiring additional human annotation. With just a 0.11M-parameter gate, BLUE sets a new state of the art on both benchmarks, achieving 76.2% success rate on Bench2Drive and 36 driving score on Longest6 v2, while delivering 2.54x inference speedup and 8.9% success rate improvement over the backbone. BLUE provides a practical path toward efficient language-augmented AD, showing that VLA models can retain the benefits of language at a fraction of the cost. Our code, data, logs and checkpoints are fully available on https://github.com/George-Ling3/BLUE.

URL PDF HTML ☆

赞 0 踩 0

2606.08682 2026-06-09 cs.LG cs.AI 新提交

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

激活引导引发突现失调：一项更全面的评估

Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu

发表机构 * Nanyang Technological University（南洋理工大学）； Sun Yat-sen University（中山大学）； University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）

AI总结研究激活引导是否引发突现失调，通过扩展评估范围，发现激活引导可导致广泛失调，且比微调产生更连贯的有害响应，并分析了关键因素。

详情

AI中文摘要

激活引导已成为一种流行的推理时技术，用于调节大型语言模型（LLMs）的行为。通过从目标行为的示例构建引导向量，并在推理期间将其注入中间激活，激活引导能够实现灵活的行为控制，同时避免微调所需的永久参数更新。与此同时，最近的研究将突现失调（EM）识别为一个重要的安全问题，其中在狭窄任务的不安全示例上微调的模型可能意外地泛化到无关任务上的广泛不安全行为。尽管微调引发的EM已被广泛研究，但激活引导是否能引发EM仍然相对未被探索，尽管它作为一种模型控制技术的使用日益增加。在本文中，我们对激活引导引发的突现失调进行了全面研究，大幅扩展了现有开创性工作的评估范围。首先，我们表明激活引导可以引发广泛的失调，即使在最近的Qwen-3.5系列中也是如此。此外，激活引导的模型产生的有害响应比微调模型具有更强的语义相关性和更高的连贯性，使得由此产生的失调可能更具危害性。其次，我们通过分析关键的引导特定因素来表征AS引发的EM的特性，包括引导幅度、引导子空间的低秩结构以及引导向量构建期间的周期数。第三，我们评估了AS引发的EM在不同模型家族、模型规模、目标任务和干预层上的鲁棒性和敏感性。我们的发现揭示了激活引导是突现失调的一个重要但未被充分研究的来源，并为理解EM的机制和安全风险提供了激活空间视角。

英文摘要

Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.

URL PDF HTML ☆

赞 0 踩 0

2606.08680 2026-06-09 cs.CV cs.RO 新提交

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

畸变感知的PETR用于混合针孔-鱼眼相机的BEV目标检测

Xiangzhong Liu

发表机构 * fortiss GmbH（fortiss有限公司）

AI总结针对鱼眼相机径向畸变破坏BEV检测器均匀采样假设的问题，提出DAPETR，通过畸变感知位置编码和双向特征-几何协同调制模块，在KITTI-360基准上优于基线方法，并揭示了学习适应与显式几何重参数化之间的冲突。

Comments 8 pages, 5 figures, accepted at ICRA 2026

详情

AI中文摘要

鱼眼相机因其低成本和高覆盖视野（FOV）而被广泛部署于自动驾驶感知套件中，但其在3D目标检测中的潜力仍未得到充分利用。严重的径向畸变通过违反均匀采样的基本假设，对大多数BEV检测器构成挑战。为弥补这一差距，我们提出了畸变感知PETR（DAPETR），一种专为混合针孔-鱼眼相机设置设计的无投影检测器。DAPETR包含两个关键的学习自适应模块：一个统一的畸变感知位置编码，将图像表示的位置编码与鱼眼几何协调一致；以及一个双向特征-几何协同调制模块，使图像特征和3D位置编码相互适应。在我们转换的KITTI-360基准上的实验中，我们系统地将我们的学习自适应方法与极坐标下的PETR（PolarPETR）进行了比较。我们发现，尽管两种方法都优于基线，但我们的学习模块实现了更优的性能。关键的是，我们发现了两种策略结合时的负面交互，表明学习适应和显式几何重参数化可能冲突。我们的最终DAPETR模型显著推进了鱼眼BEV检测的研究和基准，为除图像校正外的有效畸变感知3D感知设计提供了关键见解。

英文摘要

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

URL PDF HTML ☆

赞 0 踩 0

2606.08678 2026-06-09 cs.SD cs.LG 新提交

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

基于梯度反转和变分信息瓶颈的说话人不变表示学习用于欺骗检测

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite（阿维尼翁大学）； EURECOM

AI总结针对欺骗检测中说话人偏差导致泛化差的问题，提出教师-学生框架，利用梯度反转层和变分信息瓶颈解耦身份信息，在9个数据集上EER相对降低25.7%。

详情

AI中文摘要

先进的生成语音技术可能破坏语音生物识别的可靠性。虽然欺骗检测系统在域内条件下评估时表现出色，但对域外设置的泛化能力通常较差。在本文中，我们表明此类问题可能由说话人偏差引起，即模型学习个体声音特征而非操作或生成的标记。我们提出了一种用于说话人不变欺骗检测的教师-学生框架，该框架无需说话人标签即可解耦身份。我们利用预训练的说话人识别教师通过梯度反转层指导学生模型。为了控制抑制与语音身份相关线索和保留与欺骗检测相关线索之间的平衡，我们集成了变分信息瓶颈。在九个数据集上的评估表明，与MHFA基线相比，我们的模型实现了EER相对降低25.7%。

英文摘要

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.08673 2026-06-09 cs.CL 新提交

ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task

ClinicalAligner26AM: 用于数据集翻译的跨语言对齐器；来自MultiClinCorpus共享任务的证据

François Remy

发表机构 * Parallia Healthcare AI（Parallia医疗人工智能）

AI总结提出ClinicalAligner26AM，一种基于ClinicalEncoder26AM初始化的生物医学临床文本多语言对齐模型，通过Sinkhorn-Knop最优传输融合多级信号构建软对齐目标，在MultiClinCorpus任务中跨语言投影实体标注，字符加权F1超0.95。

详情

AI中文摘要

词级跨语言对齐对于标注投影、翻译审计和跨语言忠实度估计至关重要，然而现有的神经对齐器很少适应专业领域。在本文中，我们介绍了ClinicalAligner26AM，这是一个从ClinicalEncoder26AM初始化的大上下文多语言对齐模型，用于生物医学和临床文本。我们的训练方法受AWESoME Align启发。我们通过使用Sinkhorn-Knop最优传输对为平行临床文本和对话建立的成本矩阵进行锐化，该矩阵融合了句子级、短语级和词元级信号，从而构建软对齐目标。我们通过鼓励学生对齐器的朴素余弦词元相似度分数匹配该目标，直接将锐化后的对齐矩阵蒸馏到学生对齐器中。在推理时，我们通过学习的词元对齐矩阵投影源跨度分数，并解码目标文本中最长有效的高分跨度，可选地由附录B中总结的MultiClinNER预测支持。我们在MultiClinCorpus共享任务上评估CA26AM，该任务将西班牙语临床实体标注投影到六种目标语言中。我们提交的两个系统在所有语言和实体类型中分别排名第一和第二，几乎所有设置下的字符加权F1分数均高于0.95。

英文摘要

Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains. In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM. Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn-Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target. At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions summarized in Appendix B. We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.

URL PDF HTML ☆

赞 0 踩 0

2606.08672 2026-06-09 cs.CV cs.LG 新提交

Learning to Solve Generative ODEs Beyond the Linear Span

学习求解生成式常微分方程：超越线性跨度

Sihyeon Kim, Seunghun Lee, Vikas Singh, Hyunwoo J. Kim

发表机构 * Korea University（高丽大学）； KAIST（韩国科学技术院）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结针对扩散和流生成模型中ODE求解器步数多的问题，提出SpanLift轻量神经求解器，通过空间残差算子增强标量系数更新，实现少步采样且不增加模型NFE，在多个任务上达到最先进性能。

Comments 12 pages, 7 figures

详情

AI中文摘要

扩散和流生成模型通过积分学习到的ODE进行采样，但高质量采样仍需要大量连续的模型评估。求解器学习通过调整标量系数、时间步长或两者来降低这一成本，同时保持骨干模型固定。在这项工作中，我们识别出该更新族中的一个结构瓶颈：每一步仍然受限于跨度。由于标量系数更新位于缓冲速度评估的跨度内，它只能拟合跨度内的分量，而任何跨度外的残差无法通过标量重组单独达到。我们提出SpanLift，一种轻量神经求解器，它用空间残差算子增强标量系数更新。SpanLift将固定的基础求解器作为跨度内先验，并在状态和速度缓冲上学习一个空间残差算子。该算子通过端点教师匹配训练，保留预训练的骨干，且不增加模型NFE。实验表明，学习到的校正跨基础求解器迁移，且主要位于跨度外。在像素空间扩散、潜流匹配和降水临近预报中，SpanLift实现了最先进的少步采样。仅用3个NFE，它将CIFAR-10的FID从8.16提升到5.69，ImageNet的FID从17.37提升到11.83。

英文摘要

Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many sequential model evaluations. Solver learning reduces this cost by adapting scalar coefficients, timesteps, or both, while keeping the backbone model fixed. In this work, we identify a structural bottleneck in this update family: each step remains span-limited. Since the scalar-coefficient update lies in the span of buffered velocity evaluations, it can fit only the in-span component while leaving any out-of-span residual unreachable by scalar recombination alone. We propose SpanLift, a lightweight neural solver that augments scalar-coefficient updates with a spatial residual operator. SpanLift keeps a fixed base solver as an in-span prior and learns a spatial residual operator over the state and velocity buffer. The operator is trained by endpoint teacher matching, preserves the pretrained backbone, and adds no model NFEs. Empirically, the learned correction transfers across base solvers and is predominantly out-of-span. Across pixel-space diffusion, latent flow matching, and precipitation nowcasting, SpanLift achieves state-of-the-art few-step sampling. With only 3 NFE, it improves CIFAR-10 FID from 8.16 to 5.69 and ImageNet FID from 17.37 to 11.83.

URL PDF HTML ☆

赞 0 踩 0

2606.08671 2026-06-09 cs.LG 新提交

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

SkillHone：基于持久决策历史的持续智能体技能演化框架

Zhiwei Li, Yong Hu

发表机构 * WeChat, Tencent Inc., China（腾讯微信，中国）

AI总结提出SkillHone框架，通过持久决策历史记录诊断、修订和证据，实现智能体技能的持续演化，在开放网络深度研究基准上超越现有方法。

Comments Work in progress

详情

AI中文摘要

智能体技能通过任务特定程序、脚本和参考扩展语言模型智能体，但目标和环境不断变化。现有方法在有限运行中改进技能，仅保留最终产物，丢弃后续智能体解释先前修订、评估和拒绝替代方案所需的决策历史。我们提出SkillHone，一个基于持久决策历史的持续智能体技能演化框架。SkillHone将技能修订与提供实践反馈的评估侧证据配对，记录诊断、修订、证据和结果的结构化历史。角色分离的子智能体在带有隐去报告的实践探针上运行候选技能，并根据先前决策提出修订，实现跨会话改进而无需重新发现过去的推理。我们在原始开放网络环境中评估SkillHone的深度研究基准，其中智能体未获得集成搜索堆栈，必须通过可移植技能组织检索。我们与商业检索服务支持的深度研究智能体进行比较。以Qwen3.6-35B-A3B作为评估时骨干，生成的技能在GAIA上超过深度研究智能体15.8分，在WebWalkerQA-EN上超过3.2分，同时也超越了先前的技能演化方法。

英文摘要

Agent skills extend language-model agents with task-specific procedures, scripts, and references, but the tasks and environments they target continually change. Existing methods improve skills in bounded runs and retain only the final artifact, discarding the decision history that later agents need to interpret prior revisions, evaluations, and rejected alternatives. We introduce SkillHone, a harness for continual agent skill evolution grounded in persistent decision history. SkillHone pairs skill revisions with evaluation-side evidence that supplies practice feedback, recording structured histories of diagnoses, revisions, evidence, and outcomes. Role-separated subagents run candidate skills on practice probes with redacted reporting and propose revisions informed by prior decisions, enabling cross-session refinement without rediscovering past rationale. We evaluate SkillHone on deep-research benchmarks in a raw open-web setting, where agents are not given an integrated search stack and must organize retrieval through portable skills. We compare against a deep-research agent backed by commercial retrieval services. With Qwen3.6-35B-A3B as the evaluation-time backbone, the resulting skills outperform the deep-research agent by 15.8 points on GAIA and 3.2 points on WebWalkerQA-EN, while also exceeding prior skill-evolution methods.

URL PDF HTML ☆

赞 0 踩 0

2606.08670 2026-06-09 cs.CV 新提交

WaveDiT: Distribution-Aware Wavelet Flow Matching for Efficient 3D Brain MRI Synthesis

WaveDiT: 面向高效3D脑MRI合成的分布感知小波流匹配

Danilo Danese, Angela Lombardi, Giuseppe Fasano, Matteo Attimonelli, Tommaso Di Noia

发表机构 * Politecnico di Bari（巴里理工大学）； Sapienza University of Rome（罗马大学）

AI总结提出WaveDiT，一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架，通过分解时空注意力与基于高阶小波统计的带状异方差不确定性建模，实现单GPU上全分辨率3D脑MRI高效合成，在分布对齐和下游任务中优于现有方法。

Comments Provisionally accepted at MICCAI 2026

详情

AI中文摘要

大型且人口统计学平衡的数据集对于可靠的神经影像生物标志物至关重要。全分辨率3D脑MRI合成可以支持该场景下的数据增强，但现有方法要么在体积尺度上产生高昂的计算成本，要么依赖可能有损解剖细节的潜在压缩。因此，实用的3D生成增强通常需要专门的计算基础设施。我们提出WaveDiT，一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架。该模型将分解的时空注意力与从高阶小波统计中导出的带状异方差不确定性建模相结合。预测的对数方差直接集成到流目标和条件路径中，实现了与解剖细节的重尾和输入相关方差结构一致的适应性精度。该公式支持在单个现代GPU上，在实用的内存和时间约束下进行全分辨率3D合成。在多站点队列上的评估表明，与基于扩散、潜在和小波的基线相比，生成的MRI分布与真实MRI分布的对齐程度有所提高，同时下游脑年龄预测和区域级解剖一致性也得到了增强。代码可在https://github.com/sisinflab/WaveDiT获取。

英文摘要

Large and demographically balanced datasets are essential for reliable neuroimaging biomarkers. Full-resolution 3D brain MRI synthesis can support data augmentation in this setting, but existing approaches either incur prohibitive computational cost at volumetric scale or rely on lossy latent compression that may compromise anatomical detail. As a result, practical 3D generative augmentation often requires specialized compute infrastructure. We propose WaveDiT, a conditional flow matching framework operating in the coefficient space of a 3D Haar Discrete Wavelet Transform. The model combines factorized spatio-depth attention with band-wise heteroscedastic uncertainty modeling derived from higher-order wavelet statistics. Predicted log-variance is integrated directly into both the flow objective and conditioning pathway, enabling adaptive precision consistent with the heavy-tailed and input-dependent variance structure of anatomical detail. This formulation supports full-resolution 3D synthesis under practical memory and time constraints on a single modern GPU. Evaluation on a multi-site cohort demonstrates improved alignment between generated and real MRI distributions, together with enhanced downstream brain age prediction and region-level anatomical agreement relative to diffusion, latent, and wavelet-based baselines. Code is available at https://github.com/sisinflab/WaveDiT

URL PDF HTML ☆

赞 0 踩 0

2606.08669 2026-06-09 cs.SD cs.LG 新提交

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

基于SSL的特征提取器与后端分类器在欺骗检测中的比较：多语料库训练与跨语言分析

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite（阿维尼翁大学）； EURECOM

AI总结本研究通过多语料库训练和跨语言分析，比较了四种自监督学习特征提取器与四种后端分类器在欺骗检测中的性能，揭示了ASVspoof 5数据集中的领域偏差，并发现仅用8小时目标语言数据微调即可提升检测鲁棒性。

详情

AI中文摘要

语音生物识别系统面临来自欺骗攻击的日益增长的威胁，然而检测模型的评估在不同数据集上仍然不一致。为了研究这些不可预测的波动，我们对四种自监督学习特征提取器与四种后端分类器的组合进行了全面基准测试。我们比较了ResNet的层次化局部特征提取与基于注意力和图的后端的全局序列和关系建模。通过三种场景下的多语料库训练和六个评估数据集，我们的实证分析得出了两个关键发现。首先，我们揭示了ASVspoof 5数据集中的领域偏差，表明简单的数据缩放会主动降低性能。其次，我们的跨语言分析表明，仅用8小时的目标语言数据微调即可增强检测鲁棒性。这些发现共同强调了在欺骗检测中需要领域感知和语言特定适应的关键需求。

英文摘要

Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.

URL PDF HTML ☆

赞 0 踩 0

2606.08666 2026-06-09 cs.RO 新提交

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

语言作为传感器：从自然语言在3D场景中进行校准的空间信念估计

Aryan Naveen, Jason Xinyu Liu, Luca Carlone, Andreea Bobu

发表机构 * MIT Laboratory for Information & Decision Systems（麻省理工学院信息与决策系统实验室）； MIT Computer Science & Artificial Intelligence Laboratory（麻省理工学院计算机科学与人工智能实验室）

AI总结提出语言传感器模型（LSM）将自然语言描述转化为校准的空间分布，并融合到VL-Map概率框架中，实现更准确的目标定位。

Comments 18 pages, 7 figures, 3 tables

详情

AI中文摘要

部署在以人为中心的环境中的机器人经常接收自然语言的空间信息描述（如“我把背包放在桌子上”），这些描述涉及超出其感知视野的世界部分。传统的度量-语义映射忽略了这一信号，而现成的多模态模型在3D空间推理方面仍然有限，并且不易与其他传感器模态融合。为了将语言观测转换为校准的空间分布，我们训练了一个语言传感器模型（LSM），该模型将每个话语及其场景图上下文映射到多模态分布，其中混合权重编码指代歧义（例如，“哪张桌子”），分量协方差编码空间不确定性（例如，目标在“桌子上”的哪个位置）。然后，我们引入了VL-Map（视觉-语言度量-语义映射），这是一个概率框架，将这些语言预测视为随机观测，并在统一的信念图中与机载感知融合。在VLA-3D基准测试以及真实世界的移动机器人上，LSM是唯一协方差估计保持在校准范围内的语言预测器；融合到VL-Map中，它导致对目标对象位置更准确的预测（与最强的基础模型基线相比，真实目标上的概率质量增加了约70%）。

英文摘要

Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).

URL PDF HTML ☆

赞 0 踩 0

2606.08658 2026-06-09 cs.AI cs.LO 新提交

Extending Ontologies: From Dense Embeddings to Hybrid Quantum-Fuzzy Systems

扩展本体：从密集嵌入到混合量子模糊系统

Angjelin Hila

发表机构 * GitHub

AI总结本文综述本体与密集嵌入算法的集成方法，并提出神经-量子-模糊系统作为同时支持概率推理和精确推理的知识表示新范式。

2606.08657 2026-06-09 cs.RO cs.AI 新提交

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散策略：为基于扩散的机器人操作塑造潜在空间

Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出两阶段框架LDP，通过CVAE编码器吸收场景理解，在预浓缩的潜在空间中进行流匹配，简化学习并提升多臂协调任务性能。

详情

AI中文摘要

直接在原始动作空间中运行的基于扩散的视觉运动策略将场景理解与轨迹生成合并到单个去噪过程中。由此产生的速度场必须同时编码场景信息并生成精确轨迹，增加了学习复杂性，并在需要多臂精确时间协调的任务上限制了性能。为了简化这一联合学习问题，我们引入了潜在扩散策略（LDP），这是一个两阶段框架，在精心塑造的潜在空间中进行流匹配。通过将场景理解吸收到观察条件的CVAE编码器中，LDP集中了每个观察的条件分布。因此，流模型避免了隐式解析场景相关结构；相反，它在具有更平滑速度场的预浓缩分布内生成，从而简化了从有限演示中的学习。此外，为了捕捉潜在标记之间的时间依赖性，LDP采用每标记扩散强制训练，并使用阶梯推理采样来解决由此产生的分布不匹配。我们还提出了重建FID（rFID）作为轻量级代理，仅从潜在空间统计预测下游任务成功。在RoboTwin 2.0的协调密集型任务上，LDP以显著优势优于DP3，并有效迁移到真实世界的双臂部署。

英文摘要

Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.08656 2026-06-09 cs.CL 新提交

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

从玩家到大师：通过基于记忆的强化学习增强LLM代理的测试时学习

Yishuo Cai, Xingyu Guo, Xuancheng Huang, Jinhua Du, Can Huang, Wenxuan Huang, Wenhan Ma, Yuyang Hu, Aohan Zeng, Jie Tang, Xu Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Memopilot，一种通过多轮GRPO训练记忆更新过程来优化冻结LLM代理在测试时学习的方法，在多人博弈中显著提升Elo评分。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地部署在长期运行的环境中，在这些环境中，通过测试时的经验进行改进变得重要。一种常见的方法是在每次交互后更新显式记忆以指导未来的决策。然而，大多数现有方法依赖于手工设计的提示规则，这使得在多步时间跨度内难以使记忆更新与下游目标保持一致。我们提出Memopilot，一种即插即用的记忆副驾驶，它显式地训练记忆更新过程，以改进冻结的LLM在连续交互中的性能。我们将记忆更新公式化为一个多轮决策问题，并使用多轮GRPO进行端到端优化。我们的训练方案引入了（i）逐轮奖励信号和（ii）跨轨迹的上下文无关、逐轮优势估计，从而在多轮设置中实现更细粒度的信用分配和更稳定的训练。我们在两个测试平台上评估Memopilot：多轮石头剪刀布（RPS）和有限注德州扑克（LHE）。在这两种环境中，Memopilot显著改进了冻结玩家在测试时的学习，优于强基线，在两个游戏的Elo评分中均排名第一（LHE为1762，RPS为1590），并优于所有基线记忆方法和专有模型，包括DeepSeek-V3.2。

英文摘要

Large language model (LLM) agents are increasingly deployed in long-running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including DeepSeek-V3.2.

URL PDF HTML ☆

赞 0 踩 0