arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

cs.AI 机器学习与表示学习 131 cs.LG 机器学习应用 128 cs.AI AI应用与系统 113 cs.AI 评测、基准与数据集 99 cs.LG 深度学习架构与训练方法 83 cs.LG 数据集、基准与评测 77 cs.AI 自然语言与多模态智能 74 cs.AI 可信、安全与AI治理 68 cs.CL 大语言模型与基础模型 68 cs.LG 强化学习与序列决策 61 cs.CV 医学影像与生物视觉 55 cs.LG 优化、泛化与理论分析 52 cs.CV 数据集、基准、评测与训练方法 50 cs.CV 生成式视觉与世界模型 49 cs.CV 具身智能、机器人与自动驾驶 45 cs.LG 高效学习、压缩与部署 45 cs.AI 智能体、规划与决策 44 cs.CV 多模态与视觉语言模型 41 cs.LG 生成模型与概率建模 39 cs.CL 安全、隐私、公平与可解释NLP 38 cs.CL 评测、数据集与基准 35 cs.AI 机器人与具身智能 33 cs.LG 鲁棒性、不确定性与可信学习 32 cs.CL 对话系统与智能体 31 cs.LG 联邦学习、隐私与安全 29 cs.AI 多智能体与博弈 27 cs.CV 目标检测、分割与定位 26 cs.RO 机器人学习与模仿强化学习 25 cs.CV 低层视觉、计算成像与图像增强 24 cs.CV 鲁棒性、安全、隐私与可信视觉 23 cs.LG 其他/综合机器学习 23 cs.CV 3D视觉、点云与空间智能 21 cs.CL 信息抽取、检索与问答 21 cs.LG 表示学习、自监督与对比学习 21 cs.CV 视频理解与时序视觉 19 cs.RO 运动规划、控制与动力学 19 cs.RO 导航、定位与SLAM 19 cs.RO 具身智能与视觉语言动作模型 19 cs.RO 仿真、数据集与评测 19 cs.AI 知识表示、推理与符号AI 18 cs.CV 图像识别、检索与分类 18 cs.CL 其他/综合NLP 18 cs.AI 其他/综合AI 17 cs.CV 其他/综合视觉 17 cs.RO 操作、抓取与灵巧手 16 cs.LG 迁移、元学习与持续学习 15 cs.CL 语音语言联合与音频文本 13 cs.RO 无人车、无人机与移动机器人 13 cs.CL 文本生成、摘要与编辑 12 cs.CL 多模态语言处理 12 cs.RO 人机交互与协作机器人 11 cs.SD 语音合成与声音生成 11 cs.SD 数据集、基准与评测 11 cs.LG 图学习与结构化数据 10 cs.RO 安全、鲁棒性与可信机器人 10 cs.AI 搜索、优化与约束求解 9 cs.CL 语义、语法与语言学分析 9 cs.CL 低资源、领域适配与高效训练 9 cs.RO 软体机器人与硬件设计 9 cs.RO 多机器人与群体系统 8 cs.RO 其他/综合机器人 8 cs.SD 安全、隐私与深度伪造音频 7 cs.SD 其他/综合语音音频 7 cs.SD 音乐信息检索与音乐生成 6 cs.CV 文档图像、OCR与图表理解 5 cs.CL 机器翻译与跨语言处理 4 cs.SD 多模态音频与视听学习 3 cs.SD 低资源、多语言与方言语音 3 cs.SD 语音识别与关键词检测 2 cs.SD 语音增强、降噪与音频修复 2 cs.SD 语音翻译与语音语言模型 2 cs.SD 说话人识别、验证与分离 1 cs.SD 音频事件检测与场景理解 1

2606.16629 2026-06-16 cs.CL 新提交

Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

伊斯兰大语言模型：从知识获取到可信且抗幻觉的人工智能

Mohammed Amine Mouhoub

发表机构 * Paris Dauphine University（巴黎多芬纳大学）

AI总结综述伊斯兰大语言模型领域，提出构建可信、抗幻觉的伊斯兰AI需结合阿拉伯语NLP、检索增强生成、教法学派推理及专家评估等关键技术。

详情

AI中文摘要

大语言模型（LLMs）越来越多地用于知识密集型问答，包括宗教和法律问题。伊斯兰知识是一个特别苛刻的场景：答案应基于权威来源，引用必须精确，阿拉伯语变体与古典来源语言差异显著，且必须呈现合法的教法学派分歧而非合并为单一答案。本综述回顾了伊斯兰LLMs和可信伊斯兰AI的新兴领域。我们围绕阿拉伯语NLP和以阿拉伯语为中心的LLMs、伊斯兰NLP资源、古兰经问答、伊斯兰知识基准、检索增强生成、伊斯兰法律推理、继承推理、幻觉评估和可信度组织文献。我们认为，阿拉伯语流利度不足以实现伊斯兰AI。可靠系统需要策划来源、检索和验证模块、引用感知生成、教法学派感知推理、人类专家评估以及不仅衡量答案准确性还衡量忠实度、来源有效性和推理质量的基准。本综述以抗幻觉伊斯兰AI系统的研究议程结束。

英文摘要

Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur'anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.16624 2026-06-16 cs.AI 新提交

MR-GVNO: A Geometry-Aware Variational Physics-Informed Neural Operator for Mindlin-Reissner Plates on Irregular Domains

MR-GVNO：一种面向不规则域上Mindlin-Reissner板的几何感知变分物理信息神经算子

Siqi Wang, Daobo Sun, Yizheng Wang, Yilong Zhang, Yabin Jin, Xiaoying Zhuang, Timon Rabczuk

发表机构 * Institute of Computational Mechanics × AI & College of Intelligent Robotics and Advanced Manufacturing, Fudan University（计算力学与人工智能学院及智能机器人与先进制造学院，复旦大学）； School of Aerospace Engineering, Xiamen University（航空航天工程学院，厦门大学）； Department of Engineering Mechanics, Tsinghua University（工程力学系，清华大学）； Institute of Photonics, Department of Mathematics and Physics, Leibniz University（光子研究所，数学与物理系，莱比锡大学）； Institute of Structural Mechanics, Bauhaus-Universität Weimar（结构力学研究所，魏玛 Bauhaus-Universität）

AI总结提出MR-GVNO，一种几何感知变分神经算子，通过边界点云表示不规则几何，利用交叉注意力机制融合多物理场输入，基于离散总势能的变分物理信息损失无监督训练，实现对Mindlin-Reissner板问题的快速准确预测。

详情

AI中文摘要

板壳结构在工程中广泛应用，因此在不同几何、材料和载荷下进行快速响应预测非常理想。然而，传统的有限元方法需要重复建模和求解，导致计算成本高昂。本研究提出了一种用于Mindlin-Reissner板问题的几何感知变分神经算子，称为MR-GVNO。该方法使用边界点云表示不规则几何，并采用独立的编码器处理空间变化的材料场、压力载荷和标量物理参数。交叉注意力机制将这些输入与查询点信息集成，以预测任意位置的横向挠度和转角。MR-GVNO无需标记解数据，通过从离散总势能导出的变分物理信息损失进行训练。它直接处理不规则点云，并允许不同的物理场独立离散化，避免了插值到公共网格。在单孔、双孔和L形板上的数值实验表明，在均匀和非均匀材料以及均匀和随机载荷下，该方法能准确预测响应。该模型还实现了毫秒级的全场推理和良好的跨几何泛化能力。

英文摘要

Plate and shell structures are widely used in engineering, making rapid response prediction under varying geometries, materials, and loads highly desirable. However, conventional finite element methods require repeated modeling and solution, resulting in high computational costs. This study proposes a geometry-aware variational neural operator for Mindlin-Reissner plate problems, termed MR-GVNO. The method uses boundary point clouds to represent irregular geometries and employs separate encoders for spatially varying material fields, pressure loads, and scalar physical parameters. A cross-attention mechanism integrates these inputs with query point information to predict transverse deflections and rotations at arbitrary locations. MR-GVNO is trained without labeled solution data using a variational physics-informed loss derived from the discretized total potential energy. It directly processes irregular point clouds and allows different physical fields to be discretized independently, avoiding interpolation onto a common grid. Numerical experiments on single-hole, double-hole, and L-shaped plates demonstrate accurate response prediction under homogeneous and heterogeneous materials and uniform and random loads. The model also achieves millisecond-level full-field inference and favorable cross-geometry generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.16621 2026-06-16 cs.RO 新提交

Reinforcement Learning with Inner-loop Dynamics Estimator for Aerial Manipulation under Uncertainty

基于内环动力学估计器的强化学习在不确定性下的空中操纵

Shivansh Pratap Singh, Samaksh Ujjwal, Ishita Chaudhary, V R Vasudevan, Rishabh Dev Yadav, Spandan Roy

发表机构 * International Institute of Information Technology Hyderabad（国际信息技术学院海德拉巴）； University of Manchester（曼彻斯特大学）

AI总结提出一种分层控制框架，结合强化学习与外环与内环动力学估计器，实现直接任务驱动控制，在硬件实验中降低末端执行器跟踪误差并提高任务成功率。

详情

AI中文摘要

空中操纵器能够在难以到达的环境中进行物理交互；然而，在快速臂运动、载荷变化及相关未知动态不确定性下，直接全身空中操纵的组合问题在很大程度上仍未解决。我们提出了一种分层控制框架，结合强化学习（RL）与内环动力学估计器来解决这一问题。RL外环将期望的六自由度（DOF）末端执行器目标映射到协调的全身指令，从而实现直接任务驱动控制，而无需在策略层依赖完全准确的耦合动态模型。内环随后跟踪这些指令，同时通过动力学估计器方案在执行过程中补偿瞬态惯性偏移和不确定性，无需系统模型知识。我们通过硬件实验，在变化的载荷条件下，在配备3自由度操纵器的定制四旋翼飞行器上验证了所提出的方法。与RL+PID和RL+INDI+PID基线相比，所提出的方法在测试的硬件条件下降低了末端执行器跟踪误差并提高了任务成功率。这些结果表明，将学习到的全身协调与基于估计器的低层补偿相结合，提高了在变化操作条件下空中操纵的精度和鲁棒性。

英文摘要

Aerial manipulators enable physical interaction in hard-to-reach environments; however, the combined problem of direct whole-body aerial manipulation under rapid arm motion, payload changes, and related unknown dynamic uncertainty remains a largely unsolved problem. We present a hierarchical control framework that combines Reinforcement Learning (RL) with an inner-loop dynamics estimator to address this problem. The RL outer loop maps desired 6-degrees-of-freedom (DOF) end-effector targets to coordinated whole-body commands, enabling direct task-driven control without relying on a fully accurate coupled dynamic model in the policy layer. An inner loop then tracks these commands while compensating for transient inertial shifts and uncertainty during execution via a dynamics estimator scheme without requiring system model knowledge. We validate the proposed approach on a custom quadrotor equipped with a 3-DoF manipulator through hardware experiments under varying payload conditions. Compared with RL+PID and RL+INDI+PID baselines, the proposed method reduces end-effector tracking error and improves task success rate across the tested hardware conditions. These results show that combining learned whole-body coordination with estimator-based low-level compensation improves the precision and robustness of aerial manipulation under changing operating conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.16620 2026-06-16 cs.LG cs.AI 新提交

Entropy-Gated Latent Recursion

熵门控潜在递归

Soham Bhattacharjee, Dushyant Singh Chauhan, Salem Lahlou, Martin Takac, Nils Lukas

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出熵门控潜在递归（EGLR），通过在高不确定性token处递归应用冻结模型顶层解码器，构建与温度采样正交的确定性采样轴，扩展推理时缩放空间，在数学推理任务中显著提升性能。

详情

AI中文摘要

推理时缩放已成为改进语言模型推理能力的主要手段，但现有方法的展开多样性仅来源于单一来源：随机token级采样。我们认为这种单轴采样空间本质上是受限的，并识别出第二个完全确定且互补的轴：在冻结模型的顶层解码器层在高不确定性token处递归重新应用的层跨度$L$。不同的$L$选择会产生不同的展开，解决不同的问题子集，且无需随机性。我们通过熵门控潜在递归（EGLR）实例化这一轴，这是一种无需训练的解码过程，它重新应用顶层$L$层最多$K_{\max}$次迭代，直到下一个token分布收敛。结合$T$个温度采样，EGLR将单轴随机展开池转变为$L\times T$笛卡尔采样空间，且几乎不增加每次展开的成本。我们在8个指令微调模型和6个数学推理基准上表征了这一空间，并表明$L$轴与温度确实互补：在MATH-500上使用Qwen2.5-3B-Instruct时，联合$L\times T$预言机达到91.6%，比仅温度预言机（83.4%）高出8.2个百分点，比仅层预言机（81.2%）高出10.4个百分点，证实两个轴捕获了真正互补的问题。扩展的展开池为任何下游过程（包括自一致性、带验证器的最佳$N$选择和组相对RL训练（GRPO））提供了更丰富的每个提示候选，开辟了不依赖随机噪声的推理时缩放新方向。

英文摘要

Inference-time scaling has become the dominant lever for improving language-model reasoning, but existing methods derive rollout diversity from a single source: stochastic token-level sampling. We argue that this single-axis sampling space is fundamentally limiting, and identify a second, fully deterministic and complementary axis: the layer span $L$ at which a frozen model's top decoder layers are recursively re-applied at high-uncertainty tokens. Different choices of $L$ produce distinct rollouts that solve different subsets of problems, with no stochasticity. We instantiate this axis through Entropy-Gated Latent Recursion (EGLR), a training-free decoding procedure that re-applies the top-$L$ layers for at most $K_{\max}$ iterations until the next-token distribution converges. Combined with $T$ temperature samples, EGLR turns a single-axis stochastic rollout pool into an $L\times T$ Cartesian sampling space at almost the same per-rollout cost. We characterize this space across $8$ instruction-tuned models and $6$ math reasoning benchmarks, and show that the $L$-axis is genuinely complementary to temperature: on MATH-500 with Qwen2.5-3B-Instruct, the joint $L\times T$ oracle reaches $91.6\%$, $+8.2$ percentage points beyond the temperature-only oracle ($83.4\%$) and $+10.4$ points beyond the layer-only oracle ($81.2\%$), confirming that the two axes capture genuinely complementary problems. The expanded rollout pool provides richer per-prompt candidates for any downstream procedure that consumes rollouts, including self-consistency, best-of-$N$ with verifiers, and group-relative RL training (GRPO), opening a new direction for inference-time scaling that does not rely on stochastic noise.

URL PDF HTML ☆

赞 0 踩 0

2606.16613 2026-06-16 cs.AI 新提交

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

CoffeeBench：异构多智能体经济中的长周期LLM智能体基准测试

Issa Sugiura, Daichi Hattori, Kazuo Araragi, Keita Ogawa, Shota Onose, Taro Makino, Teppei Usuki, Takashi Ishida

发表机构 * Sakana AI ； KPMG AZSA LLC

AI总结提出CoffeeBench基准，在90天模拟中评估LLM智能体在异构多智能体经济中的长周期任务表现，发现高性能模型更积极沟通，而Claude Haiku 4.5存在空闲漂移失败模式。

Comments 23 pages, 8 figures

详情

AI中文摘要

随着LLM智能体能够处理越来越长周期的任务，评估它们在经济系统中的表现变得越来越重要。与主要评估单个智能体与被动环境交互的现有基准不同，经济系统本质上是多智能体的，需要自主智能体在追求自身长期目标的同时进行通信、谈判和交易。我们引入了CoffeeBench，这是一个用于评估LLM智能体在由异构公司组成的长期多智能体经济中的基准。在CoffeeBench中，两个农民、两个烘焙师和两个零售商在90天的模拟中自主经营业务，每个都通过通信和交易最大化累计净收入，同时管理现金、库存和定价。被评估的模型控制一个咖啡烘焙师，而其余公司由固定的参考智能体控制。在几个最近的开源和专有LLM中，所有模型都优于不采取任何行动的被动基线，大多数实现了正净收入。对智能体行为的分析揭示了长期经济互动中的显著差异：性能更高的模型与其他公司更积极地通信，而Claude Haiku 4.5表现出空闲漂移失败模式，尽管产生了连贯的评估和计划，但反复选择不行动。我们发布了代码和智能体轨迹以支持未来研究。

英文摘要

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.16612 2026-06-16 cs.SD cs.LG cs.MM 新提交

Beyond Artifacts: Towards Generalizable Synthetic Song Detection via Music-Intrinsic Features

超越伪影：基于音乐内在特征的可泛化合成歌曲检测

Yan Han, Zhibin Wen, Yuan Wang, Shuangrun Shao, Xiaobing Li, Yang Xu, Wei Li

发表机构 * Central Conservatory of Music（中央音乐学院）； Southern University of Science and Technology（南方科技大学）； Fudan University（复旦大学）

AI总结提出Sofia框架，通过特征特定专家和自适应混合专家模型利用音乐内在特征（人声、音频效果、全局结构）进行合成歌曲检测，在MUSIC8K基准上F1提升18.5点，具有强鲁棒性。

详情

AI中文摘要

AI音乐生成器的快速发展凸显了对可靠合成歌曲检测（SSD）的迫切需求。现有SSD方法通常依赖于低级伪影或固定特征假设，难以捕捉生成器无关的线索。为解决这一问题，我们提出Sofia（基于音乐特征的合成歌曲检测框架），一个灵活的框架，通过特征特定专家和自适应混合专家（MoE）模块对音乐内在属性进行建模。通过使用代表性的人声、音频效果、全局结构特征及其组合配置Sofia，我们展示了它们的个体和互补贡献。为全面评估我们的框架，我们进一步构建了MUSIC8K，一个具有挑战性的基准，包含最新出现的生成器和逼真的音频扰动。实验表明，Sofia从音乐内在特征中学习生成器无关的表示，在MUSIC8K-O上相比最强基线F1分数提升18.5点，同时保持强鲁棒性。

英文摘要

The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.16611 2026-06-16 cs.LG 新提交

TCHG: Tri-Trust Conditioned Heterogeneous Graph Learning for Reliable Dynamic Trust Prediction

TCHG：基于三重信任条件异构图学习的可靠动态信任预测

Bohao Liao, Boyu Deng, Qipeng Song, Jieling Wang, Jingchao Wang

发表机构 * Xidian University（西安电子科技大学）； Tsinghua University（清华大学）

AI总结提出TCHG框架，将信任证据分解为三个通道（实体可靠性、交互行为可靠性、上下文信任），分别控制图传播中的消息准入、传播强度和模式选择，并采用非均匀衰减的时间状态处理多尺度演化，实现可靠动态信任预测。

Comments 18 pages, 10 figures, 13 tables

详情

AI中文摘要

信任预测推断潜在的用户-用户信任关系，为社会推荐、虚假评论与操纵检测以及风险识别提供重要支持。图神经网络因其学习网络结构和复杂信任依赖的能力，已成为信任预测的主流方法。然而，现有方法通常依赖信任信号的统一表示，未将异质信任证据分解为独立的证据通道，未能利用不同证据通道在信任建模中应发挥的不同作用。为弥补这一不足，本文认为信任证据不应被视为无差别的输入，而应分解并用作图传播的功能控制因子。我们提出TCHG，一种三重信任条件异构图学习框架，将信任证据分解为三个通道，并赋予它们在传播中不同的功能角色：实体可靠性控制消息准入，交互行为可靠性调节传播强度，上下文信任通过上下文条件算子选择调整传播模式。由于三个证据通道以不同时间尺度演化，TCHG维护具有非均匀衰减率的独立时间状态，以防止快速变化的上下文信号覆盖缓慢积累的实体可靠性。它进一步预测信任概率并校准输出概率，提高稀疏或冲突证据下的预测置信度。在多个公开信任数据集上的大量实验表明，与代表性信任预测和异构图基线方法相比，TCHG实现了有效且可靠的信任预测。

英文摘要

Trust prediction infers latent user-user trust relations and provides important support for social recommendation, fake-review and manipulation detection, and risk identification. Graph neural networks have become a prominent approach to trust prediction because of their ability to learn network structures and complex trust dependencies. However, existing methods often rely on a unified representation of trust signals and do not disentangle heterogeneous trust evidence into separate evidence channels, failing to exploit the distinct roles that different evidence channels should play during trust modeling. To address this gap, this paper argues that trust evidence should not be treated as an undifferentiated input, but should be decomposed and used as functional control factors over graph propagation. We propose TCHG, a tri-trust conditioned heterogeneous graph learning framework that decomposes trust evidence into three channels and assigns them distinct functional roles in propagation: entity reliability governs message admission, interaction-behavior reliability modulates propagation strength, and contextual trust adjusts the propagation mode through context-conditioned operator selection. Since the three evidence channels evolve at different temporal scales, TCHG maintains independent temporal states with non-uniform decay rates to prevent rapidly changing contextual signals from overwriting slowly accumulated entity reliability. It further predicts trust probability and calibrates the output probability, improving predictive confidence under sparse or conflicting evidence. Extensive experiments on multiple public trust datasets show that TCHG achieves effective and reliable trust prediction compared with representative trust prediction and heterogeneous graph baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.16605 2026-06-16 cs.AI 新提交

ARB4WM: An Adversarial Robustness Benchmark for World Models in Continuous Control

ARB4WM：连续控制中世界模型的对抗鲁棒性基准

Junjian Zhang, Hao Tan, Ruonan Li, Dong Zhu, Aiping Li, Zhaoquan Gu

发表机构 * College of Computer Science, National University of Defense Technology（国防科技大学计算机学院）； College of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院）； Department of New Networks, Peng Cheng Laboratory（鹏城实验室新型网络部）； National Key Laboratory of Advanced Communication Networks（先进通信网络全国重点实验室）

AI总结提出ARB4WM统一基准，从策略、价值和潜在动力学三个层面评估世界模型在视觉扰动下的对抗鲁棒性，发现多目标攻击和时序暴露模式对安全评估至关重要。

Comments 24 pages, 10 figures, 5 tables. Source code available at https://github.com/zaoanguai/ARB4WM

详情

AI中文摘要

世界模型因其能够学习潜在动力学进行规划和决策，被广泛应用于机器人和智能体工程控制系统。随着这些系统越来越多地部署在安全关键场景中，理解它们在对抗条件下的鲁棒性变得至关重要。然而，现有评估缺乏一个统一的基准来测试世界模型智能体在策略、价值和潜在动力学层面的对抗威胁。为填补这一空白，我们提出了ARB4WM，一个用于世界模型智能体在视觉扰动下部署前鲁棒性和风险评估的统一评估框架。ARB4WM在这三个层面定义了五个白盒损失目标，并研究了它们与单步或多步扰动策略以及时序攻击模式（包括全帧、半序列和稀疏帧暴露）结合时的效果。具体而言，我们在MetaWorld和DeepMind Control Suite的20个任务上评估了四种Dreamer风格智能体，针对不同的损失目标、扰动策略和时序攻击模式。结果表明，针对价值估计、潜在表示和RSSM动力学的攻击可能与直接策略破坏一样具有破坏性，早期或频繁的扰动尤其有害，而输入级防御在自适应攻击下提供的恢复能力有限。这些发现表明，世界模型的安全性、风险和可靠性评估应涵盖多个面向组件的攻击目标和时序暴露协议，而非仅依赖于动作空间的鲁棒性。源代码可在https://github.com/zaoanguai/ARB4WM获取。

英文摘要

World models are widely used in robotic and agentic engineering control systems due to their ability to learn latent dynamics for planning and decision-making. As these systems are increasingly deployed in safety-critical settings, understanding their robustness under adversarial conditions has become essential. However, existing evaluations lack a unified benchmark for testing adversarial threats across the policy, value, and latent-dynamics levels of world-model agents. To fill this gap, we present ARB4WM, a unified evaluation framework for pre-deployment robustness and risk assessment of world-model agents under visual perturbations. ARB4WM defines five white-box loss objectives across these three levels and studies their effects when combined with single-step or multi-step perturbation strategies and temporal attack modes, including full-frame, half-sequence, and sparse-frame exposure. Specifically, we evaluate four Dreamer-style agents across 20 tasks from MetaWorld and the DeepMind Control Suite under different loss objectives, perturbation strategies, and temporal attack modes. Results show that attacks targeting value estimation, latent representations, and RSSM dynamics can be as damaging as direct policy disruption, and that early or frequent perturbations are especially harmful, while input-level defenses provide limited recovery under adaptive attacks. These findings suggest that safety, risk, and reliability assessment for world models should cover multiple component-oriented attack objectives and temporal exposure protocols rather than relying solely on action-space robustness. Source code is available at https://github.com/zaoanguai/ARB4WM.

URL PDF HTML ☆

赞 0 踩 0

2606.16603 2026-06-16 cs.CL cs.AI 新提交

VeriGraph: Towards Verifiable Data-Analytic Agents

VeriGraph: 迈向可验证的数据分析智能体

Jiajie Jin, Zhao Yang, Wenle Liao, Yuyang Hu, Guanting Dong, Xiaoxi Li, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学高瓴人工智能学院）

AI总结提出VeriGraph框架，通过构建显式异质证据有向无环图（DAG）实现数据分析智能体的可验证性，并设计基于图的策略优化提升正确性与可审计性。

Comments 10 pages

详情

AI中文摘要

基于LLM的智能体在数据密集型分析任务中展现出强大能力，但其输出很少是可验证的：对线性文本轨迹的依赖使其推理难以审计。特别是，对原始数据的确定性计算和对自然语言主张的语义推导常常纠缠在非结构化流中，导致数值结论难以复现，定性判断难以检查。为解决这一问题，我们提出VeriGraph，一个可追踪的神经符号推理框架，使智能体在执行过程中构建显式的异质证据有向无环图（DAG）。VeriGraph引入了三种证据扩展原语，即计算扩展、基础扩展和推导扩展，以在统一图中连接原始数据、解释器变量、计算结果和自然语言主张。在此公式下，结构可追溯性简化为从原始数据源到终端主张的图可达性，而语义支持通过主张级证据评估来衡量。为了改进图构建，我们进一步设计了一种基于图的策略优化策略，采用复合奖励联合监督答案正确性、计算完整性和推导连贯性。在四个基准上的实验表明，VeriGraph-8B在所有基线中取得了最高总分。更重要的是，VeriGraph生成了可审计的证据图，具有显著更强的主张基础，在我们的主张级证据支持评估下达到了87.61%的基础率。这些结果表明，显式证据图构建是实现可验证数据分析智能体的有前景的途径。我们的代码可在https://github.com/ignorejjj/VeriGraph获取。

英文摘要

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at https://github.com/ignorejjj/VeriGraph.

URL PDF HTML ☆

赞 0 踩 0

2606.16602 2026-06-16 cs.LG cs.NA math.NA physics.comp-ph 新提交

PhysGuard: Fisher-Guided Gradient Projection for Sim-to-Real Neural PDE Surrogates

PhysGuard: 面向仿真到现实神经PDE代理的Fisher引导梯度投影

Changjian Zhou, Junfeng Fang, Negin Yousefpour, Peng Wu, Bin Yan, Guillermo A Narsilio

发表机构 * Faculty of Engineering and IT, University of Melbourne（墨尔本大学工程与信息技术学院）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）； Artificial Intelligence Research Institute, IFLYTEK Co., Ltd.（科大讯飞股份有限公司人工智能研究院）

AI总结针对神经算子模型从仿真到现实迁移时的精度下降问题，提出PhysGuard框架，利用仿真数据的Fisher信息矩阵保护物理关键参数，限制微调更新方向，在严重域偏移下将低频误差降低32%。

详情

AI中文摘要

在仿真数据上训练的神经算子模型由于仿真到现实的差距，应用于实验测量时往往失去精度。使用有限真实数据的标准微调可以缩小这一差距，但可能损害预训练期间学到的核心物理相关表示。尽管知识保留自适应在视觉或语言任务中已被广泛研究，但对于架构和受保护知识根本不同的神经算子，这些方法是否适用仍不清楚。神经算子需要保留核心尺度的物理结构，而非语义或视觉特征。我们提出PhysGuard，一个用于神经算子精确仿真到现实自适应的物理保留框架。具体来说，PhysGuard利用在仿真数据上计算的实证Fisher信息矩阵来识别物理关键参数方向，然后将微调更新限制在不干扰这些方向的方向上。逐层的Gram矩阵公式使其对具有数百万参数的模型高效，而自适应阈值自动确定受保护子空间大小。频谱探测实验表明，主导Fisher方向与低频输出结构强相关。在四个神经算子架构和不同物理系统的基准实验表明，与基线相比，PhysGuard在大多数评估指标上表现强劲。在严重域偏移下优势最为明显，与标准微调相比，低频误差降低高达32%，同时保持适应性。我们的代码可在https://github.com/ZhouChaunge/PhysGuard获取。

英文摘要

Neural operator models trained on simulation data often lose accuracy when applied to experimental measurements due to the sim-to-real gap. Standard fine-tuning with limited real data can reduce this gap, but it may also damage the core physics-relevant representations learned during pretraining. Although knowledge-preserving adaptation has been widely investigated in vision or language tasks, it remains unclear whether these methods are suitable for neural operators whose architectures and protected knowledge are fundamentally different. Neural operators need to preserve core-scale physical structures rather than semantic or visual features. We propose PhysGuard, a physics-preserving framework for accurate sim-to-real adaptation of neural operators. Specifically, PhysGuard uses the empirical Fisher Information Matrix computed on simulation data to identify physics-critical parameter directions, then restricts fine-tuning updates to directions that do not interfere with them. A layer-wise Gram-matrix formulation makes this efficient for models with millions of parameters, while an adaptive threshold automatically determines the protected subspace size. A spectral probe experiment shows that the dominant Fisher directions are strongly associated with low-frequency output structures. Experiments on benchmark across four neural operator architectures and different physical systems show that PhysGuard performs strongly on most evaluation metrics compared to baselines. The benefits are most evident under severe domain shift, where it reduces low-frequency error by up to 32\% compared to standard fine-tuning while maintaining adaptability. Our code is available at https://github.com/ZhouChaunge/PhysGuard.

URL PDF HTML ☆

赞 0 踩 0

2606.16601 2026-06-16 cs.CV 新提交

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

DifferAD-R1: 基于差异引导的多模态大语言模型工业异常定位

Dingrong Wang, Xian Tao, Zhen Qu, Hengliang Luo, Xinyi Gong, Fei Shen, Zhengtao Zhang, Guiguang Ding

发表机构 * Institute of Automation, Chinese Academy of Sciences (CAS)（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； CASI Vision Technology Co., Ltd.（中科慧远视觉技术有限公司）； Shandong Laboratory of Aluminum Advanced Manufacturing in Binzhou (SLAAMB), Binzhou Institute of Technology, Weiqiao-UCAS Science and Technology Park（山东省滨州市铝先进制造实验室（SLAAMB），滨州技术学院，魏桥国科科技园）； Space Information Research Institute, Hangzhou Dianzi University（杭州电子科技大学空间信息研究院）； School of Software, Tsinghua University（清华大学软件学院）

AI总结提出DifferAD-R1框架，通过差异引导双图像范式将异常定位转化为一次性差异定位问题，并设计双一致性定位奖励和难度感知策略，在AD-DualDiff数据集上优于现有方法。

Comments Submitted to IEEE Transactions on Circuits and Systems for Video Technology

详情

AI中文摘要

工业异常定位旨在准确识别和定位工业产品中的异常区域，解决实际场景中检测未见缺陷类别的关键挑战。传统的封闭集方法通常跨场景泛化能力差，而现有的基于多模态大语言模型（MLLM）的方法面临两个核心限制：要么采用与定位实际需求不一致的问答式范式，要么依赖标准优化技术如组相对策略优化（GRPO），后者无法为细微缺陷提供有效的学习信号。为解决这些问题，本文提出DifferAD-R1，一种专为工业异常定位设计的MLLM增强强化学习框架。我们设计了一种差异引导的双图像范式，将定位任务重新表述为一次性差异定位问题，以有效探索跨场景异常。针对难以检测的异常，开发了双一致性定位奖励，增强了优化稳定性和鲁棒性。此外，我们整合了难度感知策略，包括自适应重加权和分组重采样，以优先学习困难实例。为促进实际工业环境中的评估，我们构建了AD-DualDiff数据集，包含20个类别的13K对图像。实验结果表明，DifferAD-R1显著优于现有基线，并与大规模模型如Qwen3-VL（235B参数）相比取得了有竞争力的性能。我们的代码公开在：https://github.com/Rong2026/work-1。

英文摘要

Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: https://github.com/Rong2026/work-1.

URL PDF HTML ☆

赞 0 踩 0

2606.16600 2026-06-16 cs.RO 新提交

WaveSync: Constrained Wavefront Optimization for Synchronized Co-Speech Gestures in Humanoid Robots

WaveSync: 面向人形机器人同步共语手势的约束波前优化

Thang Tran Viet, Thanh Nguyen Canh, Gia Huy Uong, Phuc Van Dinh, Tan Viet Tuyen Nguyen, Xiem HoangVan, Nak Young Chong

发表机构 * University of Engineering and Technology, Vietnam National University（越南国立大学工程与技术大学）； School of Information Science, Japan Advanced Institute of Science and Technology（日本先端科学技术大学院大学信息科学学院）； School of Electronics and Computer Science, University of Southampton（南安普顿大学电子与计算机科学学院）； Department of Robotics, Hanyang University（汉阳大学机器人学系）

AI总结提出WaveSync框架，利用大语言模型分解语义并构建重要性波，通过动态运动基元生成手势轨迹，再经波前优化实现手势与语音的峰值同步，同时满足运动学约束，在五组对话场景中优于基线方法。

详情

AI中文摘要

富有表现力的共语手势对于自然的人机交互至关重要，但在物理人形机器人上生成这些手势非常困难，因为手势动作必须与语音重点对齐，同时满足严格的运动学和动力学约束。与虚拟化身不同，人形机器人无法自由执行快速或重叠的运动，这使得单词级别的同步和硬件安全的运动规划成为一个耦合问题。我们提出了\textbf{WaveSync}，一个混合框架，其中大语言模型将对话响应分解为结构化的语义模式，并为每个单词分配重要性权重，构建连续的语义重要性波。手势轨迹通过动态运动基元进行塑造，在增强表现力的同时确保运动学可行性。波前优化阶段实现手势与语音的峰值到峰值同步，并通过手势持续时间压缩和前向传播解决剩余的运动学违规。基于五个对话场景的实验评估表明，我们的方法实现了高同步精度，并在客观和主观评估中优于三个基线。WaveSync中的每个组件在生成富有表现力、语义基础且符合运动学要求的手势中都发挥了必要作用。代码、资源和视频可在\href{https://github.com/pairs-lab/WaveSync}{WaveSync}获取。

英文摘要

Expressive co-speech gestures are crucial for natural human-robot interaction, but generating them on physical humanoid robots is difficult because gesture strokes must align with speech emphasis while satisfying strict kinematic and dynamic constraints. Unlike virtual avatars, humanoid robots cannot freely execute rapid or overlapping motions, making word-level synchronization and hardware-safe motion planning a coupled problem. We present \textbf{WaveSync}, a hybrid framework in which a Large Language Model decomposes dialogue responses into structured semantic schemas and assigns per-word importance weights, constructing a continuous Semantic Importance Wave. Gesture trajectories are shaped through Dynamic Movement Primitives, enforcing kinematic feasibility while enhancing expressiveness. A Wavefront Optimization stage aligns peak-to-peak gesture-speech synchronization and resolves residual kinematic violations through gesture-duration compression and forward propagation. Experimental evaluation based on five dialogue scenarios shows that our method achieves high synchronization accuracy and outperforms three baselines in both objective and subjective evaluations. Each component in WaveSync plays a necessary role in producing gestures that are expressive, semantically grounded, and kinematically compliant. The code, resources, and videos are available at \href{https://github.com/pairs-lab/WaveSync}{WaveSync}

URL PDF HTML ☆

赞 0 踩 0

2606.16596 2026-06-16 cs.CL 新提交

How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

机器翻译质量能带你走多远？目标导向设置中的外在话语评估

Wafaa Mohammed, Kata Naszadi, Vlad Niculae

发表机构 * Language Technology Lab, University of Amsterdam（语言技术实验室，阿姆斯特丹大学）

AI总结研究机器翻译在静态和交互式目标导向任务中的外在话语评估，发现高内在翻译质量不能保证下游话语成功，且强系统仍存在指代不一致问题。

2606.16595 2026-06-16 cs.SD cs.AI 新提交

ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition

ArtNet：一种类似JEPA的发音预测框架，用于鲁棒的零样本音素识别

Zeqian Hu, Fuliang Weng, Shu Shang, Yaqian Zhou

发表机构 * Fudan University（复旦大学）； Pedawise

AI总结提出ArtNet框架，通过基于发音特征的结构化预测任务和变分信息瓶颈抑制语言特定变化，在零样本跨语言音素识别中实现20.56%的音素错误率降低。

Comments Accepted at Interspeech 2026

2606.16593 2026-06-16 cs.CV 新提交

Rotational Symmetry based Object Pose Estimation from Point Clouds in the Absence of Known 3D Models

基于旋转对称性的无已知3D模型点云物体姿态估计

Weichen Dai, Ruixun Yu, Yangjie Tang, Yifan Du, Yiyang Zhang, Donglei Sun, Hua Zhang

发表机构 * Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, School of Computer Science, Hangzhou Dianzi University（浙江省脑机协同智能重点实验室，杭州电子科技大学计算机学院）； Advanced Intelligent Manufacturing Research Group, the University of Nottingham Ningbo China（先进智能制造研究组，宁波诺丁汉大学）

AI总结提出利用工业物体的旋转对称性，通过迭代优化联合估计姿态与点云，无需已知3D模型，在合成和真实数据集上达到与有模型方法相当的性能。

详情

AI中文摘要

物体姿态估计对许多工业应用至关重要，例如使用机器人进行自动喷漆。然而，保密性问题常常限制了对高质量3D模型的访问，给基于点云的姿态估计带来了重大挑战。在这种情况下，旋转对称性——许多工业物体易于获取的特征——可以提供有价值的先验信息以促进姿态估计。在本文中，我们提出了一种方法，利用工业物体中常见的旋转对称性来解决缺乏3D模型带来的挑战。通过迭代优化过程，物体姿态与点云细化联合估计。该优化依赖于旋转对称性约束损失。为了构建这一损失，每个3D点根据当前估计的姿态旋转，并利用旋转对称性通过最近邻搜索识别多个对应点。然后使用这些对应点计算旋转对称性约束损失，迭代地细化姿态和点云。通过将旋转对称性显式地纳入优化过程，所提出的方法实现了鲁棒的姿态估计，并在不同物体类型上具有良好的泛化能力。该方法在一个专门为无已知3D模型的点云创建的数据集上进行了评估，该数据集包含四类合成物体和一个从生产线收集的真实轮毂。实验结果表明，所提出的方法实现了与依赖已知3D模型的方法相当的性能。

英文摘要

Object pose estimation is crucial to many industrial applications, with one example being automated spray painting using a robot. However, confidentiality concerns often limit access to high-quality 3D models, posing a significant challenge for point-cloud-based pose estimation. In such scenarios, rotational symmetry, a readily accessible characteristic of many industrial objects, can provide valuable prior information to facilitate pose estimation.In this paper, we propose a method that leverages the rotational symmetry commonly found in industrial objects to address the challenge caused by the absence of 3D models. The object pose is jointly estimated with point cloud refinement through an iterative optimization process. This optimization relies on a rotational symmetry constraint loss. To construct this loss, each 3D point is rotated according to the currently estimated pose, and multiple correspondences are identified using nearest-neighbor search by exploiting the rotational symmetry property. These correspondences are then used to compute the rotational symmetry constraint loss, which iteratively refines both the pose and the point cloud.By explicitly incorporating rotational symmetry into the optimization process, the proposed method achieves robust pose estimation and generalizes well across diverse object types. The proposed method is evaluated on a dataset specifically created for point clouds without known 3D models, consisting of four categories of synthetic objects and one real wheel hub collected from a production line. Experimental results demonstrate that the proposed method achieves performance comparable to methods that rely on known 3D models.

URL PDF HTML ☆

赞 0 踩 0

2606.16586 2026-06-16 cs.CV 新提交

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

LOCUS: 局部视觉线索搜索增强多模态大语言模型的细粒度感知

Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）； State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

AI总结提出LOCUS训练框架，通过可验证的局部线索搜索代理任务，使MLLM内化细粒度证据选择，提升定位敏感视觉理解而不改变推理接口。

详情

AI中文摘要

多模态大语言模型（MLLMs）在细粒度视觉感知上仍然不可靠，即使高分辨率输入保留了必要的局部细节。我们将这一限制识别为视觉上下文腐烂：决定性证据可能存在于完整图像中，但在冗余视觉上下文中无法被可靠地选择和利用。我们提出LOCUS（局部视觉线索搜索），一个训练框架，通过可验证的代理任务教会MLLMs内化局部证据搜索。在训练期间，LOCUS提供一个局部裁剪作为视觉线索，并使用基于IoU的奖励优化模型以恢复其在完整图像中的空间支持。视觉线索仅在训练期间使用，保持标准的图像-问题推理接口不变。在细粒度感知、幻觉、一般理解和推理基准上的实验表明，LOCUS改善了定位敏感的视觉理解，同时保留了广泛的能力。注意力分析进一步表明对任务相关证据区域的更强关注，表明训练时的视觉线索搜索为内化的细粒度证据选择提供了有效途径。

英文摘要

Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.

URL PDF HTML ☆

赞 0 踩 0

2606.16583 2026-06-16 cs.CL 新提交

Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?

不确定性并非临床VQA的安全网，但它能预测模型失败吗？

Arnisa Fazla, Alberto Testoni, Ameen Abu-Hanna, Barbara Plank, Iacer Calixto

发表机构 * Amsterdam University Medical Center, University of Amsterdam（阿姆斯特丹大学医学中心）； Amsterdam Public Health（阿姆斯特丹公共卫生）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结研究临床视觉语言模型的不确定性估计是否可靠，发现其质量随模型准确率变化，在模型脆弱时失效，但能预测扰动下的性能崩溃。

Comments 17 pages, 4 figures

详情

AI中文摘要

临床视觉语言模型（VLM）的安全部署需要可靠的不确定性估计（UE）：一个指示何时应信任预测或将其升级给临床医生的信号。我们测试了当前UE方法是否真正提供这一信号。在12个VLM上对8种方法进行临床视觉问答（VQA）基准测试，我们发现UE质量并非UE方法的内在属性：它跟踪模型准确率，在模型性能最弱的地方退化，因此正是在最需要可靠性的地方。当我们通过隐藏多项选择答案中的正确选项（NOTA扰动）对模型进行压力测试时，准确率崩溃，而不确定性几乎不变，使模型系统性地校准错误。然而，我们发现未扰动输入上的不确定性可靠地预测了哪些预测会在NOTA下崩溃，表明当前VLM中的UE携带关于模型脆弱性的诊断信息。我们的结果将UE定位为识别脆弱预测的诊断工具，并激励基于扰动的评估作为通向安全临床部署的途径。

英文摘要

Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.16580 2026-06-16 cs.LG cs.CV 新提交

Multi-Modal Spatio-Temporal Graph Neural Network with Mixture of Experts for Soil Organic Carbon Prediction

基于专家混合的多模态时空图神经网络用于土壤有机碳预测

Daniele Mos, Felipe Drummond, Anton Bossenbroek, Soufiane el Khinifri

发表机构 * Spatialise B.V.

AI总结提出SpTGNN，一种多模态时空图神经网络，通过异构图注意力、微调基础模型特征提取和稀疏专家混合融合，结合异方差回归与深度集成的不确定性量化，在三个区域数据集上优于XGBoost基线。

Comments Paper is 27 pages, 14 figures, 12 tables

详情

AI中文摘要

表层土壤有机碳（SOC）预测是农业可持续性、土地利用政策和施肥规划的基础。现有方法面临两个限制：它们将手工制作的协变量与经典机器学习或单模态深度模型配对，忽略了丰富的光谱和时间信息，而基于网格的架构忽略了田间测量的不规则空间结构。我们提出了SpTGNN，一种多模态时空图神经网络来解决这两个问题。SpTGNN将土壤测量表示为具有三种边类型（空间邻近性、光谱相似性、高程）的异构图中的节点，并应用关系图注意力来学习每种关系的独立模式。一个微调的TerraMind编码器从Sentinel-2、Sentinel-1和DEM信号中提取节点特征，并结合每个样本的环境协变量以及学习到的位置和时间嵌入。一个稀疏专家混合模块通过top-$k$路由融合四个流。通过配对异方差回归（偶然不确定性）和深度集成（认知不确定性）来捕获不确定性，并使用Moran's $I$惩罚项正则化空间自相关。我们在一个全球SOC语料库上进行评估，该语料库分为三个区域实例（全球约49k样本，非洲约26k，欧洲约14k）。我们的5成员深度集成在非洲测试集上报告$R^2=0.762$，RMSE $=3.51\pm0.48$ g/kg和MAPE $=22.9\\%$，优于表格XGBoost基线；最佳单个检查点达到验证$R^2=0.864$。消融实验证实异构图、MoE融合和微调主干各自贡献显著，集成不确定性量化栈实现后校准ECE为$0.031$（混合）和$0.026$（$\beta$-NLL）。据我们所知，这是第一个统一基础模型特征提取、异构图注意力和分解不确定性量化的SOC估计框架。

英文摘要

Top-soil organic carbon (SOC) prediction is fundamental to agricultural sustainability, land use policy and fertilization planning. Existing approaches face two limitations: they pair hand-crafted covariates with classical ML or single-modal deep models that miss rich spectral and temporal information, and grid-based architectures ignore the irregular spatial structure of field measurements. We introduce SpTGNN, a multi-modal spatio-temporal graph neural network addressing both. SpTGNN represents soil measurements as nodes in a heterogeneous graph with three edge types (spatial proximity, spectral similarity, elevation), and applies relational graph attention to learn separate patterns per relation. A fine-tuned TerraMind encoder extracts node features from Sentinel-2, Sentinel-1 and DEM signals, combined with per-sample environmental covariates and learned positional and temporal embeddings. A sparse Mixture-of-Experts module fuses the four streams via top-$k$ routing. Uncertainty is captured by pairing heteroscedastic regression (aleatoric) with deep ensembles (epistemic), and a Moran's $I$ penalty regularizes spatial autocorrelation. We evaluate on a global SOC corpus split into three regional instances ($\sim$49k samples globally, Africa $\sim$26k, Europe $\sim$14k). Our 5-member deep ensemble reports $R^2=0.762$, RMSE $=3.51\pm0.48$ g/kg and MAPE $=22.9\%$ on the Africa test split, improving over a tabular XGBoost baseline; the best single checkpoint reaches validation $R^2=0.864$. Ablations confirm the heterogeneous graph, MoE fusion and fine-tuned backbone each contribute substantively, and the ensemble UQ stack achieves post-calibration ECE of $0.031$ (hybrid) and $0.026$ ($β$-NLL). To our knowledge, this is the first framework to unify foundation-model feature extraction, heterogeneous graph attention and decomposed uncertainty quantification for SOC estimation.

URL PDF HTML ☆

赞 0 踩 0

2606.16579 2026-06-16 cs.LG math-ph math.DG math.MP 新提交

On the Entropy Formula for Real, Complex, and Quaternionic Deep Linear Networks

关于实、复和四元数深度线性网络的熵公式

Luis Contreras, Marco Nahas, Tejas Kotwal

发表机构 * CINVESTAV-IPN（墨西哥国立理工学院高级研究中心）； Brown University（布朗大学）

AI总结将Menon和Yu的实深度线性网络熵公式推广到复和四元数情形，得到统一公式。

Comments 17 pages

2606.16576 2026-06-16 cs.CL 新提交

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

LLM智能体能否推断世界模型？来自智能体自动机学习的证据

Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, Gabriel Stanovsky

发表机构 * The Hebrew University of Jerusalem（海法大学）； New York University（纽约大学）； Google Research（谷歌研究）

AI总结提出智能体自动机学习框架，通过成员查询和等价查询评估LLM智能体发现隐藏确定性有限自动机的能力，发现性能随DFA规模增加而急剧下降，推理模型优于非推理模型但仍存在规划、整合和假设构建缺陷。

详情

AI中文摘要

我们提出智能体自动机学习，以评估调用工具的LLM智能体通过交互发现隐藏环境的程度。在我们的设置中，智能体应通过与预言机的交互来发现隐藏的确定性有限自动机（DFA），交互方式包括（1）成员查询（“该字符串是否属于目标语言？”）和（2）等价查询（“这是目标DFA吗？”）。这产生了一个可扩展的测试平台，具有可控的任务复杂度、可测量的交互效率以及强基线（经典自动机学习算法）。评估最先进的LLM，我们发现性能随着DFA规模增加而急剧下降。推理模型明显强于非推理模型，但轨迹分析揭示了查询规划、证据整合和假设构建中的反复失败。总体而言，我们的结果表明，当前的LLM智能体有时可以执行非平凡的交互式发现，但在此任务上远不如经典算法稳健和高效。

英文摘要

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

URL PDF HTML ☆

赞 0 踩 0

2606.16573 2026-06-16 cs.CV 新提交

Transformation-driven generation of comparable projection images from multimodal anatomical scenes

从多模态解剖场景生成可比较投影图像的变换驱动方法

Dariusz Pojda, Krzysztof Domino, Michał Tarnawski, Agnieszka Anna Tomaka

发表机构 * Institute of Theoretical and Applied Informatics, Polish Academy of Sciences（波兰科学院理论与应用信息学研究所）

AI总结提出变换驱动框架，从多模态解剖数据生成可重复的投影空间观测，通过下颌运动场景验证，实现不同解剖配置下直接可比的虚拟X光投影生成。

Comments 36 pages, 11 figures

详情

AI中文摘要

本工作解决了从异质解剖场景生成可重复投影空间观测的计算问题，其中组件可能经历独立的空间变换。我们提出了一种变换驱动框架，用于从多模态解剖数据生成合成投影图像，并在下颌运动场景中进行了演示。与主要为配准、投影真实感或渲染效率设计的传统数字重建放射影像（DRR）方法不同，所提出的公式将投影成像视为对显式表示的解剖场景进行观测的过程。独立可变换的基于体积和表面的解剖对象嵌入到共享场景表示中，并通过显式变换直接传播到投影空间。投影几何、采集建模、材料解释和图像呈现保持显式分离，从而能够在保持可重复性和生成投影之间直接可比性的同时，对方法假设进行可控探索。特别强调了与颅面分析相关的变换驱动解剖场景，包括下颌运动和 therapeutic repositioning。使用由CT/CBCT体积、分割结构、表面模型以及辅助解剖或治疗对象组成的共享解剖参考场景，该框架能够在保持相同成像假设的同时，从多种解剖配置生成直接可比的VirtualRTG投影。该方法并非旨在实现完全物理逼真的放射模拟，而是为研究解剖-投影关系、运动可观测性和变换感知成像工作流提供可控且可重复的方法学环境。

英文摘要

This work addresses the computational problem of generating reproducible projection-space observations from heterogeneous anatomical scenes whose components may undergo independent spatial transformations. We propose a transformation-driven framework for synthetic projection imaging from multimodal anatomical data and demonstrate it on mandibular-motion scenarios. In contrast to conventional Digitally Reconstructed Radiograph (DRR) approaches primarily designed for registration, projection realism, or rendering efficiency, the proposed formulation treats projection imaging as an observation process operating on an explicitly represented anatomical scene. Independently transformable volumetric and surface-based anatomical objects are embedded within a shared scene representation and propagated directly into projection space through explicit transformations. Projection geometry, acquisition modelling, material interpretation, and image presentation remain explicitly separated, enabling controlled exploration of methodological assumptions while preserving reproducibility and direct comparability between generated projections. Particular emphasis is placed on transformation-driven anatomical scenarios relevant to craniofacial analysis, including mandibular motion and therapeutic repositioning. Using a shared anatomical reference scene composed of CT/CBCT volumes, segmented structures, surface models, and auxiliary anatomical or therapeutic objects, the framework enables generation of directly comparable VirtualRTG projections from multiple anatomical configurations while preserving identical imaging assumptions. Rather than aiming at fully physically faithful radiographic simulation, the proposed approach provides a controllable and reproducible methodological environment for studying anatomy--projection relationships, motion observability, and transformation-aware imaging workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.16572 2026-06-16 cs.RO 新提交

Steering Generative Reinforcement Learning into Stable Robotic Controller

将生成式强化学习引导至稳定机器人控制器

Yixuan Wang, Shutong Ding, Ke Hu, Tianxiang Gui, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University（上海科技大学）

AI总结提出SteerGenPO框架，通过潜在空间强化学习将训练好的生成式策略转化为鲁棒的确定性机器人控制器，在Isaac Lab和Unitree G1任务上优于基线方法，实现更稳定的推理行为。

详情

AI中文摘要

基于扩散和流的生成式策略通过迭代动作生成诱导丰富的随机探索，为强化学习提供了强大的策略类。然而，扩散策略的随机性不适用于高维机器人系统中的稳定精确控制，其中小的动作变化可能累积为不一致的运动并降低鲁棒性。为解决此问题，我们提出SteerGenPO，一种潜在空间强化学习框架，将训练好的生成式策略引导为鲁棒的确定性机器人控制器。关键思想是用学习到的潜在演员替换训练好的生成式策略的随机潜在采样，该潜在演员为生成式策略预测状态相关的潜在输入。这分离了探索和控制：随机生成采样在策略学习期间提供多样化的动作提议，而确定性潜在引导在部署时提供稳定和自适应的控制。我们在六个Isaac Lab基准测试和一个Unitree G1运动任务上评估了SteerGenPO。结果表明，SteerGenPO在经典RL和生成式RL基线上均有改进，同时其确定性潜在引导产生更稳定的推理时行为和更可靠的命令响应。

英文摘要

Diffusion and flow-based generative policies provide a powerful policy class for reinforcement learning by inducing rich stochastic exploration through iterative action generation. However, the stochasticity of diffusion policies is not suitable for stable and precise control in high-dimensional robotic systems, where small action variations can accumulate into inconsistent motion and reduced robustness. To address this issue, we propose SteerGenPO, a latent-space reinforcement learning framework that steers a trained generative policy into a robust deterministic robotic controller. The key idea is to replace stochastic latent sampling of the trained generative policy with a learned latent actor that predicts a state-dependent latent input for the generative policies. This separates exploration and control: stochastic generative sampling provides diverse action proposals during policy learning, while deterministic latent steering provides stable and adaptive control at deployment. We evaluate SteerGenPO on six Isaac Lab benchmarks and a Unitree G1 locomotion task. The results show SteerGenPO improves over both classical RL and generative RL baselines, while its deterministic latent steering produces more stable inference-time behaviors and more reliable command responses.

URL PDF HTML ☆

赞 0 踩 0

2606.16570 2026-06-16 cs.RO 新提交

Automated Digital Twin Construction for Highway Scenarios Using LiDAR Point Clouds and OpenStreetMap

基于LiDAR点云和OpenStreetMap的高速公路场景自动数字孪生构建

Yongqi Zhao, Dong Bi, Paul Kovacevic, Tomislav Mihalj, Martin Schabauer, Johannes Betz, Arno Eichberger

发表机构 * Institute of Automotive Engineering, Graz University of Technology（格拉茨技术大学汽车工程研究所）； School of Intelligent Connected Vehicle, Hubei University of Automotive Technology（湖北汽车工业学院智能网联汽车学院）； Professorship of Autonomous Vehicle Systems, Technical University of Munich（慕尼黑工业大学自动驾驶系统教席）

AI总结提出融合LiDAR点云与OpenStreetMap数据的自动化流程，生成地理参考的ASAM OpenDRIVE高速公路地图，实现车道级几何与拓扑的完整建模，平均横向RMSE为0.740米。

Comments 9 pages, 5 figures

详情

AI中文摘要

精确的道路环境建模是自动驾驶系统仿真和验证的基础。然而，从真实传感器数据构建标准格式（如ASAM OpenDRIVE）的道路地图仍然是一个耗时且昂贵的过程。移动测绘LiDAR可以捕获精确的车道级几何，但仅限于行驶走廊，而OpenStreetMap（OSM）提供广泛的道路网络拓扑，但缺乏车道级的几何精度。为了解决这一问题，提出了一种自动化工作流程，融合LiDAR点云与OSM数据，生成地理参考的ASAM OpenDRIVE高速公路环境地图，所需人工干预最少。该流程从LiDAR测量中重建主线道路，并从OSM道路图中推断匝道几何和拓扑，从而在无需完整传感器覆盖的情况下实现完整的高速公路互通立交建模。实验表明，平均横向RMSE为0.740米，生成的地图可直接用于主流仿真平台，包括IPG CarMaker和Esmini。这些结果验证了将测量几何与地图拓扑相结合用于自动OpenDRIVE数字孪生生成的有效性。项目代码可在https://github.com/ftgTUGraz/opendrive-digital-twin-generator获取。

英文摘要

Accurate road environment modeling is fundamental to the simulation and validation of automated driving systems. However, constructing road maps in standardized formats such as ASAM OpenDRIVE from real-world sensor data remains a time-consuming and costly process. Mobile mapping LiDAR captures accurate lane-level geometry but is confined to the driven corridor, while OpenStreetMap (OSM) provides broad road network topology but lacks geometric precision at the lane level. To address this, an automated workflow is proposed to fuse LiDAR point clouds with OSM data to generate georeferenced ASAM OpenDRIVE maps of highway environments, requiring minimal manual intervention. The pipeline reconstructs mainline roads from LiDAR-derived measurements and infers ramp geometry and topology from the OSM road graph, enabling complete highway interchange modeling without full sensor coverage. Experiments demonstrate a mean lateral RMSE of 0.740 m, and the generated maps are directly usable in mainstream simulation platforms including IPG CarMaker and Esmini. These results validate the effectiveness of combining measurement-derived geometry with map-derived topology for automated OpenDRIVE digital twin generation. The project code is available at https://github.com/ftgTUGraz/opendrive-digital-twin-generator

URL PDF HTML ☆

赞 0 踩 0

2606.16569 2026-06-16 cs.CV cs.RO 新提交

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

PROSE: 基于视觉语言模型的无训练自我中心场景配准

Zhiang Chen, Nahyuk Lee, Boyang Sun, Taein Kwon, Marc Pollefeys, Zuria Bauer, Sunghwan Hong

发表机构 * ETH Zurich（苏黎世联邦理工学院）； VGG, University of Oxford（牛津大学VGG实验室）； ETH AI Center（苏黎世联邦理工学院人工智能中心）

AI总结提出PROSE方法，利用预训练视觉语言模型将RGB序列提升为对象级3D场景图，通过对象高度先验和相同/不同查询匹配实例，无需训练或深度传感器即可实现自我中心场景配准，在Aria基准上超越几何和场景图基线。

Comments Project page: https://rckola.github.io/prose/

详情

AI中文摘要

将同一室内空间在不同时间拍摄的两张图像进行配准，是机器人和AR系统持久空间记忆的基础，但该任务的现实版本是自我中心的，且其最具可扩展性的形式是仅RGB。头戴式摄像头产生模糊、快速移动、部分重叠的视图，难以从中恢复密集几何。经典配准依赖于该场景所缺乏的干净点云，而学习的场景图方法需要预先构建或注释的图以及训练好的匹配器，我们发现后者在自我中心数据下脆弱。我们采取不同路线，使用预训练的视觉语言模型作为场景理解和跨扫描匹配的来源。我们的方法PROSE（Prompted Scene rEgistration）利用现成的几何、分割和语言基础模型将每个RGB序列提升为对象级3D场景图，然后提示同一VLM匹配两个RGB序列中的对象实例。为了使匹配易于处理且可靠，我们利用对象高度作为先验，并通过配对的相同/不同查询验证每个提议的匹配，然后通过为每个匹配对象假设一个候选并选择具有最强几何一致性的候选来求解刚体变换。PROSE不添加任何学习参数，也不需要深度传感器、训练或注释图。在自我中心的Aria Digital Twin和Aria Everyday Activities基准测试中，它在真实和RGB重建的点云上的配准精度均优于几何和学习的场景图基线，并且其生成的场景图可直接用于下游任务。

英文摘要

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.16568 2026-06-16 cs.CL cs.AI 新提交

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

快速判断何时，谨慎决定谁：基于扩散增强的双过程多轮对话

Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

发表机构 * Deakin University（迪肯大学）； Griffith University（格里菲斯大学）

AI总结针对多说话人对话中的轮次转换问题，提出音频两阶段流水线，先快速检测轮次边界，再轻量验证决定是否转移并预测下一说话人，扩散增强进一步改善检测性能。

详情

AI中文摘要

可靠的轮次转换对于口语对话系统至关重要。然而，现有方法大多针对双说话人交互设计，难以处理包含重叠和快速说话人切换的现实多说话人音频。我们在VoxConverse数据集上研究多说话人轮次转换，并提出一个纯音频的两阶段流水线，将何时触发轮次边界与是否实际转移话语权分开。一个快速触发器扫描音频并提出候选的结束轮次时间，而一个轻量验证器仅在这些时间运行，以决定\textsc{Hold}或\textsc{Shift}，并支持下一说话人预测。我们报告了完整多说话人设置下的结果，以及为可比性而控制的二元顶2投影结果。我们还研究了基于扩散的、保留标签的背景音频混合作为数据增强策略。结果显示，与基线相比，转移检测有所改善，扩散增强进一步提升了性能。

英文摘要

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.16567 2026-06-16 cs.AI cs.LG cs.SY eess.SY math.DS 新提交

TNODEV: Toolbox for Neural ODE Verification

TNODEV: 神经ODE验证工具箱

Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel

发表机构 * Univ Gustave Eiffel, COSYS-ESTAS（古斯塔夫·埃菲尔大学，COSYS-ESTAS实验室）

AI总结提出TNODEV，首个集成伪造检查、区间可达性、验证循环和并行调度的神经ODE形式验证器，支持安全集包含和分类鲁棒性验证。

Comments 29 pages, 7 figures, Under review in TMLR

详情

AI中文摘要

神经常微分方程（神经ODE）已开始出现在安全关键场景中，例如网络物理系统的连续时间控制器和集成到自动化决策流水线中的分类器，这引发了对其行为能否被形式化验证的问题。现有的专门用于神经ODE的工具仅提供单次可达性调用，没有迭代输入集细化，将其判定的精度限制在单次可达性调用所能提供的范围内。我们提出了TNODEV，这是首个用于神经ODE的可靠形式验证器，它集成了伪造检查器、基于连续时间混合单调性的快速区间可达性后端、具有三种输入集分裂启发式的验证与细化循环以及并行调度器，构成一个端到端流水线。TNODEV支持纯神经ODE、与神经网络控制器闭环的神经ODE以及通用神经ODE（GNODE）上的安全集包含验证，安全集可指定为区间或由目标分类标签诱导的半空间交集。我们在安全集包含和分类鲁棒性属性的一系列基准上评估了TNODEV，包括与NNV 2.0和CORA的直接可达性比较，以及在MNIST通用神经ODE分类器上与NNV2.0的验证比较。

英文摘要

Neural ordinary differential equations (neural ODE) have started to appear in safety critical settings such as continuous-time controllers for cyber-physical systems and classifiers integrated into automated decision pipelines, raising the question of whether their behavior can be formally verified. Existing tools dedicated to neural ODE provide only a single reachability call without iterative input set refinement, limiting the precision of their verdicts to whatever one reachability call can deliver. We present TNODEV, the first sound formal verifier for neural ODE that integrates a falsification checker, a fast interval-based reachability backend based on continuous-time mixed monotonicity, a verification and refinement loop with three input-set splitting heuristics, and a parallel scheduler in a single end-to-end pipeline. TNODEV supports safe-set inclusion verification on pure neural ODE, neural ODE in closed loop with a neural network controller and general neural ODE (GNODE), with the safe set specified either as an interval or as the half-space intersection induced by a target classification label. We evaluate TNODEV on a range of benchmarks across safe-set inclusion and classification-robustness properties, including a direct reachability comparison against NNV~2.0 and CORA and a verification comparison against NNV2.0 on MNIST general neural ODE classifiers.

URL PDF HTML ☆

赞 0 踩 0

2606.16566 2026-06-16 cs.CV 新提交

Local-GS: Accelerating 3D Gaussian Splatting via Tile-Local Warp Coherence

Local-GS：通过Tile局部Warp一致性加速3D高斯泼溅

Yang Luo, Yan Gong, Yongsheng Gao, Jie Zhao, Xinyu Zhang, Huaping Liu

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（哈尔滨工业大学机器人技术与系统国家重点实验室）； State Key Laboratory of Intelligent Green Vehicle and Mobility, School of Vehicle and Mobility, Tsinghua University（清华大学车辆与运载学院智能绿色车辆与交通国家重点实验室）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出Local-GS，通过基于SIMT执行边界组织高斯原语，设计提升、剔除和混合三阶段warp一致渲染范式，在不降低质量的前提下实现最高7.76倍加速。

详情

AI中文摘要

3D高斯泼溅（3DGS）通过将场景表示为各向异性3D高斯原语的密集集合，显著推进了实时新视角合成。然而，高斯的不规则空间分布通常导致GPU利用率低下，因为warp发散和冗余计算降低了渲染性能。为了解决这个问题，我们提出了Local-GS，一种warp一致的渲染范式，它根据SIMT（单指令多线程）执行边界而非场景几何来组织高斯原语。具体来说，我们提出了三个warp一致阶段：提升阶段，在tile级别预计算共享参数；剔除阶段，丢弃没有贡献的warp；混合阶段，用统一的指令流替换逐像素分支。在多个数据集上的广泛基准测试中，Local-GS在不牺牲质量的情况下提高了效率。作为一种即插即用的优化，它为所有测试的基线提供了额外的性能提升，在Deep Blending场景上实现了7.76倍的加速。

英文摘要

3D Gaussian Splatting (3DGS) has significantly advanced real-time novel view synthesis by representing scenes as dense collections of anisotropic 3D Gaussian primitives. However, the irregular spatial distribution of Gaussians often leads to poor GPU utilization, as warp divergence and redundant computation degrade rendering performance. To address this, we present Local-GS, a warp-coherent rendering paradigm that, organizes Gaussian primitives with respect to SIMT (Single Instruction, Multiple Threads) execution boundaries rather than scene geometry. Specifically, we propose three warp-coherent stages: a hoisting stage that precomputes shared parameters at tile level, a culling stage that discards warps with no contribution, and a blending stage that replaces per-pixel branching with a uniform instruction stream. Across extensive benchmarks on multiple datasets, Local-GS improves efficiency without compromising quality. As a plug-and-play optimization, it provides additional performance gains to all tested baselines, culminating in a $7.76\times$ speedup on Deep Blending scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.16564 2026-06-16 cs.RO cs.LG 新提交

Elastic ODYN: Differentiable Optimization for Infeasible Control and Learning in Robotics

Elastic ODYN：面向机器人中不可行控制与学习的可微优化

Aristotelis Papatheodorou, Jose Rojas, Ioannis Havoutis, Carlos Mastalli

发表机构 * University of Oxford（牛津大学）； Heriot-Watt University（赫瑞瓦特大学）

AI总结提出Elastic ODYN，一种通过平滑平方ℓ2弹性松弛处理不可行二次规划（QP）的原始-对偶非内点求解器，支持热启动，在无可行点时收敛到最接近可行解，并基于此开发可微QP层和不可行感知SQP方法，在基准QP、奇异接触力学、可微参数辨识及四足/人形机器人轨迹优化中优于现有方法。

Comments 8 pages, 5 figures, 2 tables

详情

AI中文摘要

机器人系统经常遇到冲突的目标、建模误差和退化接触条件，这些条件使得二次规划（QP）不可行。然而，大多数优化求解器和可微QP层假设可行性，当约束无法同时满足时，会导致数值失败、梯度不稳定或求解器崩溃。我们提出Elastic ODYN，一种原始-对偶非内点QP求解器，通过平滑平方ℓ2弹性松弛处理不可行性。所得公式在病态和退化条件下保持良态，支持热启动，并在无可行点时收敛到最接近可行解。一个轻量级细化阶段从弹性解中恢复有物理意义的对偶变量。基于此框架，我们开发了Elastic OdynLayer，一个在不可行性下具有稳定梯度的可微QP层，以及Elastic OdynSQP，一种不可行感知的SQP方法，通过选择性约束松弛解决不一致的子问题和本质不可行的最优控制任务。我们在基准QP、奇异接触力学、可微参数辨识以及四足和人形机器人轨迹优化上评估该框架。在所有设置中，Elastic ODYN在鲁棒性、热启动性能和收敛可靠性方面始终优于最先进的弹性QP求解器，使得优化、仿真、控制和学习能够超越现有方法的可行性假设。

英文摘要

Robotic systems routinely encounter conflicting objectives, modeling errors, and degenerate contact conditions that render quadratic programs (QPs) infeasible. Yet most optimization solvers and differentiable QP layers assume feasibility, leading to numerical failures, unstable gradients, or solver breakdown when constraints cannot be simultaneously satisfied. We present Elastic ODYN, a primal--dual non-interior-point QP solver that handles infeasibility through smooth squared-$\ell_2$ elastic relaxations. The resulting formulation remains well posed under ill-conditioning and degeneracy, supports warm starting, and converges to closest-to-feasible solutions when no feasible point exists. A lightweight refinement stage recovers physically meaningful dual variables from the elastic solution. Building on this framework, we develop Elastic OdynLayer, a differentiable QP layer with stable gradients under infeasibility, and Elastic OdynSQP, an infeasibility-aware SQP method that resolves inconsistent subproblems and intrinsically infeasible optimal control tasks through selective constraint relaxation. We evaluate the framework on benchmark QPs, singular contact mechanics, differentiable parameter identification, and quadrupedal and humanoid trajectory optimization. Across all settings, Elastic ODYN consistently outperforms state-of-the-art elastic QP solvers in robustness, warm-start performance, and convergence reliability, enabling optimization, simulation, control, and learning beyond the feasibility assumptions of existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.16562 2026-06-16 cs.LG 新提交

MIRAGE: Auditing Anti-Muslim Bias in Frontier LLMs Across Reasoning, Agentic, and Time-Coupled Conditions

MIRAGE: 审计前沿大语言模型在推理、智能体与时间耦合条件下的反穆斯林偏见

Noor Islam S. Mohammad, Tamim Sheikh

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MIRAGE基准，包含1200个提示，覆盖直接完成、思维链推理和模拟智能体决策三种部署场景，发现思维链放大偏见、智能体决策存在不对称性、偏见与检索新闻时间耦合，现有缓解措施效果有限。

详情

AI中文摘要

在发现大语言模型中持续存在的反穆斯林偏见五年后，大多数评估仍局限于单轮提示完成，这一设置已不再反映前沿LLM的部署方式。我们引入\textbf{MIRAGE}（穆斯林身份推理与智能体生成评估）基准，包含1,200个提示，涵盖三种部署现实条件：直接完成、思维链推理以及跨内容审核、贷款分类、难民申请摘要和招聘筛选的模拟智能体决策。在六个前沿模型上，我们发现：(i) 思维链推理相比直接完成，将穆斯林-暴力关联\textit{放大}了12-34%；(ii) 智能体决策在相同证据下，穆斯林与非穆斯林匹配案例之间表现出9-22个百分点的差异；(iii) 偏见与检索到的新闻上下文高度时间耦合，在近期冲突检索下增加18-27%。现有的基于提示的缓解措施在我们的三种条件下迁移性差，抑制了直接完成偏见，但智能体不对称性基本保持不变。我们发布MIRAGE和一个开放评估工具包，以支持有针对性的缓解研究。

英文摘要

Five years after the discovery of persistent anti-Muslim bias in large language models, most evaluations remain confined to single-turn prompt completion, a setting that no longer reflects how frontier LLMs are deployed. We introduce \textbf{MIRAGE} (Muslim-Identity Reasoning and Agentic Generation Evaluation), a benchmark of 1{,}200 prompts spanning three deployment-realistic conditions: direct completion, chain-of-thought reasoning, and simulated agentic decision-making across content moderation, lending triage, refugee claim summarization, and hiring screens. Across six frontier models, we find that (i) chain-of-thought reasoning \emph{amplifies} rather than suppresses Muslim-violence associations by 12--34\% relative to direct completion, (ii) agentic decisions exhibit a 9--22 percentage-point asymmetry between Muslim and matched non-Muslim cases on identical evidence, and (iii) bias is sharply time-coupled to retrieved news context, increasing 18--27\% under recent-conflict retrieval. Existing prompt-based mitigations transfer poorly across our three conditions, suppressing direct-completion bias while leaving agentic asymmetry largely intact. We release MIRAGE and an open evaluation harness to support targeted mitigation research.

URL PDF HTML ☆

赞 0 踩 0

2606.16560 2026-06-16 cs.CL 新提交

The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

BD-LSC数据集：促进俚语与标准用法中词汇语义变化检测模型的基准测试

Afnan Aloraini, Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro

发表机构 * The University of Manchester（曼彻斯特大学）； Qassim University（卡西姆大学）

AI总结针对词汇语义双向变化及俚语与标准用法混合的挑战，构建BD-LSC和ST-WSD两个基准数据集，评估多种方法，发现稀有俚语义项仍是核心难题。

详情

AI中文摘要

自动语义变化检测旨在识别词义随时间的变化，为语言和社会变迁提供见解。尽管计算词汇语义变化（LSC）近期取得进展，现有基准和方法难以捕捉双向语义变化，特别是词汇同时获得和失去义项的情况。对于兼具俚语和标准用法的词汇，这一问题尤为棘手。为填补这些空白，我们引入两个互补的基准数据集。双向词汇语义变化（BD-LSC）数据集捕捉三个时间段的义项获得、义项丢失和稳定性，支持复杂语义轨迹的研究。SlangTrack词义消歧（ST-WSD）数据集为结合俚语和标准用法的词汇提供细粒度的实例级义项标注，支持WSD和语义变化检测模型的系统基准测试。利用这些基准，我们系统评估了不同方法论家族的模型：使用上下文嵌入的无监督聚类、监督机器学习、基于Transformer的模型以及最先进的大语言模型。在评估的系统中，少样本GPT-4o模型在精确义项匹配（ESM）和多标签准确率上取得了最强的综合性能；然而，所有系统的Macro-F1分数接近0.5，表明稀有俚语义项仍然困难，我们将其确定为核心开放挑战。

英文摘要

Automatic semantic change detection aims to identify how word meanings shift over time, offering insights into both linguistic and societal change. Despite recent progress in computational lexical semantic change (LSC), existing benchmarks and methods struggle to capture bi-directional semantic change, particularly cases where words simultaneously gain and lose senses. This problem is especially challenging for words that have both slang and standard meanings. To address these gaps, we introduce two complementary benchmark datasets. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, sense loss, and stability across three time periods, enabling the study of complex semantic trajectories. The SlangTrack Word Sense Disambiguation (ST-WSD) dataset provides fine-grained, instance-level sense annotations for words combining slang and standard usages, supporting systematic benchmarking of WSD and semantic change detection models. Using these benchmarks, we systematically evaluate models across different methodological families: unsupervised clustering using contextualised embeddings, supervised machine learning, transformer-based models, and state-of-the-art large language models. Among the evaluated systems, the few-shot GPT-4o model achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy; however, Macro-F1 scores near 0.5 across all systems show that rare slang senses remain difficult, which we identify as the central open challenge.

URL PDF HTML ☆

赞 0 踩 0