arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06892 2026-06-08 cs.LG 新提交

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

GRASP：面向可扩展预训练数据归因的几何感知残差对齐

Yue Min, Ruining Chen, Yujun Li

发表机构 * Wizard Quant ； University of Science and Technology of China（中国科学技术大学）

AI总结提出GRASP方法，通过二次几何惩罚建模子集交互，结合低维特征草图与有限置信度选择协议，实现可扩展的预训练数据归因，显著提升反事实子集保真度并降低计算成本。

详情

AI中文摘要

可扩展的数据归因方法通常为单个训练样本分配孤立的效用分数。这种普遍的加性假设从根本上无法捕捉关键的子集动态，包括数据冗余和互补覆盖。在这项工作中，我们将归因重新定义为子集级别的反事实效用预测，并引入GRASP，一种交互感知的替代方法。基于理论平滑度下界，GRASP通过二次几何惩罚显式建模子集交互。为了实现预训练规模的效率而不依赖隐藏的oracle调优，我们将低维特征草图与严格有限下置信度选择协议相结合。广泛的子集重训练评估表明，GRASP显著优于现有的可扩展基线。它将反事实子集保真度的任务级秩相关性提高了一倍以上，同时将前期工件构建成本降低了近一个数量级。下游诊断进一步表明，这种评分机制可迁移到语言模型策展和跨领域视觉选择，为优化大规模预训练语料库奠定了坚实基础。

英文摘要

Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate. Grounded in a theoretical smoothness lower bound, GRASP explicitly models subset interactions through a quadratic geometric penalty. To achieve pretraining-scale efficiency without relying on hidden oracle tuning, we couple low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol. Extensive subset-retraining evaluations demonstrate that GRASP decisively outperforms existing scalable baselines. It more than doubles the task-level rank correlation for counterfactual subset fidelity while reducing upfront artifact construction costs by nearly an order of magnitude. Downstream diagnostics further show that this scoring mechanism transfers to language model curation and cross-domain vision selection, establishing a robust foundation for optimizing massive pretraining corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.06891 2026-06-08 cs.CV 新提交

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Stream3D-VLM：基于增量几何先验的在线3D空间理解

Hanxun Yu, Xuan Qu, Lei Ke, Boqiang Zhang, Yuxin Wang, Jianke Zhu, Dong Yu

发表机构 * Zhejiang University（浙江大学）； Tencent Hunyuan（腾讯文汇）； HKUST（香港科技大学）； Shenzhen Loop Area Institute（深圳环城研究院）

AI总结提出在线3D视觉语言模型Stream3D-VLM，通过自回归流控制、轻量视觉-空间特征融合模块和几何自适应体素压缩，实现从流式视频中实时理解3D空间，并构建超百万在线3D问答数据集，在多项任务上超越现有模型。

Comments Project Page: https://stream3d-vlm.github.io/

详情

AI中文摘要

尽管3D场景理解取得了进展，但现有的3D大型多模态模型在离线设置下运行，需要完整的场景观测或预定义的视频片段。在本文中，我们提出了一种在线3D视觉语言模型，能够从流式视频中实现实时空间理解。我们的方法基于LLM的下一个词预测目标，采用自回归流控制建模来学习何时响应，并使用轻量级的视觉-空间特征融合（VSFI）模块，将时间对齐的几何先验增量注入视觉流。为了减轻长上下文解码开销，我们提出了一种即插即用的几何自适应体素压缩（GAVC）模块，用于高效的视觉令牌压缩。为了解决流式3D语言数据的稀缺问题，我们进一步开发了一个可扩展的数据生成流程，策划了超过100万个在线时空3D问答对，并建立了一个涵盖29个任务的全面基准。大量实验表明，我们的方法在在线和离线3D空间理解、推理和定位任务上均显著优于专有和开源模型。项目页面见https://这个URL。

英文摘要

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.06890 2026-06-08 cs.CV cs.LG 新提交

Diagnosing Visual Ignorance in Vision-Language Models

诊断视觉语言模型中的视觉忽视

Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang

发表机构 * Peking University（北京大学）

AI总结研究视觉语言模型依赖语言先验的内部机制，通过层替换和探针分析揭示多阶段瓶颈，并引入渐进视觉退化指标发现基准测试可能奖励视觉忽视。

详情

AI中文摘要

视觉语言模型（VLM）经常依赖语言先验，产生自信但缺乏视觉证据支持的答案。虽然这种行为被广泛观察到，但其内部机制及对基准评估的影响仍未被充分理解。在这项工作中，我们从机制和行为两个角度研究语言先验依赖。在内部，我们将反事实层替换与有监督的逐层MLP探针相结合，以追踪真实视觉语义和语言先验语义如何在语言解码器中竞争。我们的分析揭示了一个多阶段瓶颈：中间层通常无法有效检索视觉信息，而后续层可能进一步抑制存活的视觉信号，偏向文本空间偏差。在外部，我们引入了一种基于多步高斯模糊的渐进视觉退化度量，用于识别那些即使视觉内容被逐渐破坏，答案仍保持不变的实例。在十二个视觉问答基准和三个代表性VLM上，我们发现相当一部分示例在严重或完全视觉混淆下仍可回答，表明当前基准可能无意中奖励视觉忽视。这些发现表明，语言先验依赖是一种系统性的路由故障，影响模型内部和基准有效性。最后，我们概述了未来的关键研究方向，强调需要设计基于结构隔离或反事实数据的训练分布和评估协议，以强制执行真正的跨模态基础。

英文摘要

Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.06887 2026-06-08 cs.CV 新提交

ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning

ARAPDiffusion: 基于ARAP正则化的扩散变形形状空间学习

Haibo Liu, Jinghan Ke, Haitao Yang, Xiangru Huang, Georgios Pavlakos, Qixing Huang

发表机构 * University of Texas at Austin（德克萨斯大学）； Westlake University（西拉丘学院）

AI总结提出ARAPDiffusion，一种潜在扩散模型，通过注入ARAP变形模型作为正则化损失，学习变形形状集合的连续形状空间，减少对大量3D训练数据的依赖。

详情

AI中文摘要

本文介绍了ARAPDiffusion，一种潜在扩散模型，用于学习变形形状集合的潜在连续形状空间。关键创新在于将尽可能刚性（ARAP）变形模型作为正则化损失注入潜在扩散（LD），从而减少学习生成模型所需的大量3D训练数据。与标准LD相比，我们展示了如何利用ARAP模型同时改进编码器/解码器和LD模型。训练过程交替使用LD模型定义的合成分布来开发增强形状编码器/解码器的正则化损失，以及使用形状解码器来开发改进LD模型的正则化损失。我们还展示了LD范式在结合无表示LD模型和适用于无序点云的隐式形状解码器方面的优势。无条件和条件形状生成的实验结果证明了ARAPDiffusion相对于基线方法的优势。

英文摘要

This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.06885 2026-06-08 cs.CV cs.AI 新提交

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

FreeAnimate: 基于预览引导去噪的无训练人体图像动画

Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出FreeAnimate框架，利用图像扩散模型内在能力实现无训练的人体图像动画，通过预览生成策略提供时序和结构先验，结合反演增强注意力和参考锚定自注意力模块，保证时序一致性和身份保持。

Comments Accepted to IEEE ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11462600

AI中文摘要

人体图像动画已经取得了显著进展，主要得益于扩散模型。然而，现有方法通常需要大量的训练数据和资源才能获得高质量结果，限制了泛化性和可访问性。在这项工作中，我们引入了FreeAnimate，一个无训练框架，利用图像扩散模型的内在能力来实现时序一致性、身份保持和背景稳定性。我们的方法包含一种新颖的预览生成策略，该策略从生成的预览帧中提供时序和结构先验，无需训练即可有效引导姿态对齐和背景一致性。此外，FreeAnimate引入了反演增强注意力和参考锚定自注意力模块，以保证时序一致性和身份保持。实验结果表明，FreeAnimate优于现有的无训练竞争方法和基于训练的基线方法，生成的图像质量可与最先进的方法相媲美，并在不同数据集上展现出强大的泛化能力。我们的项目页面位于此https URL。

英文摘要

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06881 2026-06-08 cs.LG 新提交

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

GlucoFM-Bench：血糖预测的时间序列基础模型基准测试

Baiying Lu, Zhaohui Liang, Ryan Pontius, Shengpu Tang, Temiloluwa Prioleau

发表机构 * Department of Computer Science（计算机科学系）； Dartmouth College（达特茅斯学院）； Emory University（埃默里大学）； Quantitative Biomedical Sciences（定量生物医学科学）

AI总结提出GlucoFM-Bench基准，评估8种时间序列基础模型与监督深度学习模型在15个糖尿病数据集上的血糖预测性能，发现预训练模型在零样本和少样本场景表现优异，但全样本下轻量LSTM仍最优。

详情

AI中文摘要

血糖预测模型是现代糖尿病管理系统的基石，可靠的短期预测能够实现主动干预、支持自动化胰岛素输送，并降低低血糖和高血糖事件的风险。从建模角度看，由于糖尿病群体中异质的生理动态，血糖预测面临独特挑战。传统机器学习和深度学习模型已被广泛评估用于血糖预测，但近期的时间序列基础模型（TSFMs）在此场景下的研究仍较少。为填补这一空白，我们提出GlucoFM-Bench，一个全面的基准测试，评估最先进的TSFMs与监督深度学习模型在血糖预测中的表现。我们评估了8种代表性架构，包括预训练TSFMs、时间序列大语言模型和特定任务深度学习模型，涵盖15个公开的糖尿病相关数据集，涉及1117名1型糖尿病、2型糖尿病、前驱糖尿病和非糖尿病个体。模型在零样本、少样本和全样本协议下进行评估，并系统变化上下文长度和预测范围。跨数据集，预训练TSFMs，尤其是Chronos-2和TimesFM，展现出强大的零样本和少样本迁移能力，最佳零样本模型性能在最佳全样本监督模型的5%以内。然而，当任务特定数据充足时，轻量级LSTM仍是最强的，在全样本训练下比TSFMs高出4-21%。分层分析揭示了T1D队列和低/高血糖范围内的持续挑战，强调了超越聚合误差指标进行评估的必要性。总之，GlucoFM-Bench为评估、比较和改进血糖预测基础模型提供了标准化和可重复的基础。

英文摘要

Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and hyperglycemic events. From a modeling perspective, glucose forecasting poses unique challenges due to heterogeneous physiological dynamics across diabetes populations. Traditional machine learning and deep learning models have been extensively evaluated for glucose prediction, yet recent time-series foundation models (TSFMs) remain much less studied in this setting. To bridge this gap, we present GlucoFM-Bench, a comprehensive benchmark evaluating state-of-the-art TSFMs alongside supervised deep learning models for blood glucose forecasting. We assess eight representative architectures, including pre-trained TSFMs, time-series large language models, and task-specific deep learning models, across 15 publicly available diabetes-relevant datasets comprising 1,117 individuals with type 1 diabetes, type 2 diabetes, prediabetes, and no diabetes. Models are evaluated under zero-shot, few-shot, and full-shot protocols, with systematic variation in context length and prediction horizon. Across datasets, pre-trained TSFMs, especially Chronos-2 and TimesFM, show strong zero-shot and few-shot transfer, with the best zero-shot model performing within 5% of the best full-shot supervised model. Yet, when task-specific data are abundant, a lightweight LSTM remains strongest, outperforming TSFMs by 4--21% under full-shot training. Stratified analyses reveal persistent challenges in T1D cohorts and hypo-/hyperglycemic ranges, highlighting the need for evaluation beyond aggregate error metrics. Together, GlucoFM-Bench provides a standardized and reproducible foundation for evaluating, comparing, and improving foundation models for blood glucose forecasting.

URL PDF HTML ☆

赞 0 踩 0

2606.06879 2026-06-08 cs.CL cs.CR 新提交

An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

用于多轮短信钓鱼检测的扩展合成对话数据集

Carl Lochstampfor, Ayan Roy

发表机构 * GitHub ； arXiv

AI总结提出COVA-X扩展数据集（10,985条对话），改进生成管道解决标签污染等问题，实验表明Longformer超越XGBoost，验证了Transformer模型需要更大对话语料才能发挥上下文优势。

详情

AI中文摘要

我们之前的工作引入了COVA，一个合成生成的多轮对话短信钓鱼数据集，包含3,201条标记对话，建立了八个模型的基线检测基准。虽然使用TF-IDF特征的XGBoost表现最佳，准确率72.5%，宏F1为0.691，但Transformer模型表现不佳，归因于输入截断和训练数据不足。我们提出COVA-X，一个扩展数据集，包含10,985条对话，涵盖八种针对老年人的诈骗类别，由改进的生成管道生成，解决了第一次迭代中的污染、标签不匹配、舞台指示泄露和提示设计失败问题。在扩展数据集上重新训练所有分类器得到了本工作的核心发现：Longformer现在在所有评估指标上超越了XGBoost，准确率79.71%，宏F1 0.7786，而XGBoost分别为78.43%和0.7563。这直接证实了Transformer模型需要更大的对话语料库才能发挥其上下文优势。我们还记录了一个质量生命周期，包括标签修正率从49.8%提高到3.9%（12.7倍改进），一项架构干预将虚拟绑架伪影率从67.1%降低到46.5%，以及按诈骗类型的结果分析显示，诈骗类别以机制一致的方式调节结果。清理前后的敏感性分析证实，数据集精炼在所有三种分类器架构中恢复了真实的标签相关信号。

英文摘要

Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7$\times$ improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.06878 2026-06-08 cs.RO cs.CV 新提交

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

一种用于鲁棒6-DoF抓取姿态估计的跨视图融合框架

Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Nanyang Technological University（南洋理工大学）； Nanjing University（南京大学）

AI总结提出跨视图融合框架，通过辅助视图缓解遮挡，利用自监督对比学习增强点云特征的空间一致性和方向区分性，并设计跨视图对齐圆柱体集成模块融合抓取相关几何，提升角落视图下的6-DoF抓取姿态估计鲁棒性。

Comments Corresponding author: Jin Xie

详情

AI中文摘要

本文提出一种跨视图融合框架，增强了角落视图中6-DoF抓取姿态估计的鲁棒性。我们的框架通过引入辅助视图缓解遮挡，并通过后融合策略避免了耗时的、任务无关的多视图重建。为了增强跨视图融合，我们提出一种自监督对比学习策略，利用跨视图关联来正则化点云特征。简而言之，如果两个点对应相同的3D位置，则跨视图点对被视作匹配；如果它们代表不同的抓取方向，则视为不匹配。该学习策略显著增强了点特征的空间一致性和方向区分性，从而促进了跨视图融合并提高了估计鲁棒性。此外，我们提出一种跨视图对齐圆柱体集成模块，将抓取相关几何融合为综合表示。具体地，该模块首先根据相似性对齐跨视图点和特征，以增强对噪声的鲁棒性。随后，将这些点注册到圆柱坐标系中，强调对抓取重要的旋转对称几何。最后，交替使用局部自注意力和种子交叉注意力层，分别实现单视图内和跨视图间的交互，支持抓取相关几何的细粒度表示。我们的框架在GraspNet-1Billion基准测试和实际应用中均取得了强劲性能。代码可在以下网址获取：此https URL。

英文摘要

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.

URL PDF HTML ☆

赞 0 踩 0

2606.06877 2026-06-08 cs.RO cs.AI 新提交

Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

复杂逻辑约束下长时域任务规划的神经符号学习

Qiwei Du, Zitong Zhan, Shaoshu Su, Bowen Li, Yi Du, Zhipeng Zhao, Taimeng Fu, Sebastian Scherer, Jiaoyang Li, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo, NY 14260（空间人工智能与机器人实验室，布法罗大学，纽约州，14260）； Robotics Institute, Carnegie Mellon University, PA 15213（机器人研究所，卡内基梅隆大学，宾夕法尼亚州，15213）

AI总结提出基于命令学习的双层优化框架，通过神经评分器剪枝无关对象，并引入3R策略（修复、重启、回滚）稳定下层规划，在三个基准上实现失败率降低80.04%、规划时间减少57.14%。

详情

AI中文摘要

当机器人必须在复杂逻辑约束（包括对象可供性、空间关系和顺序动作依赖）下推理长时域动作序列时，任务规划常常面临严重的效率瓶颈。最近的神经符号方法通过学习对象重要性分数来剪枝任务无关对象，从而提高规划效率，但它们通常依赖于从完整搜索空间生成的固定离线监督。这造成了训练-测试不匹配：在部署时，规划器在由模型自身不完美预测诱导的剪枝搜索空间中运行，导致暴露偏差和规划性能下降。为了解决这一挑战，我们将任务规划的对象重要性学习形式化为一个基于命令学习的双层优化问题。上层优化一个神经评分器，而下层在评分剪枝的搜索空间中求解符号规划问题。为了稳定这一学习过程，我们在下层规划中引入3R策略，使用并行的修复、重启和回滚恢复来为上层学习提供可靠且自适应的反馈。在三个具有挑战性的基准上的实验展示了最先进的性能，包括失败率降低80.04%和规划时间减少57.14%。我们进一步在仿真和现实世界中的四足移动机械臂上验证了该框架，展示了其在高效且可部署的神经符号任务规划方面的潜力。

英文摘要

Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

URL PDF HTML ☆

赞 0 踩 0

2606.06875 2026-06-08 cs.CV cs.CR 新提交

Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

统一安全上下文图像生成：在多模态扩散变换器中通过限制不安全信息流

Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Mi Wen, Min Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UVR框架，通过分析注意力动态中的不安全信息流，在无需训练的情况下对输出补丁进行注意力调制，实现图像生成和编辑任务的安全控制，达到91%和77%的擦除率。

Comments ICML26

详情

AI中文摘要

配备多模态注意力（MM-Attn）的扩散变换器（DiTs）已成为图像生成的主导范式。然而，防止有害内容的生成仍然是一个关键挑战，特别是在图像到图像（I2I）编辑任务中。现有的安全机制主要针对文本到图像（T2I）合成或基于U-Net的架构设计，这限制了它们在基于DiT的框架中统一安全缓解的有效性。为弥补这一差距，我们提出了统一视觉安全调节器（UVR），一个无需训练的、在生成图像中调节不安全语义的安全生成框架。UVR基于从信息流角度对MM-Attn中注意力动态的分析。我们识别出一个与任务无关的启动阶段，在该阶段输出补丁中的不安全语义迅速出现并可以被精确定位，随后是特定任务的语义放大和干扰阶段，其中有害信号进一步传播并与良性内容纠缠。基于这些观察，UVR通过统一的、有针对性的注意力调制和对识别出的不安全输出补丁上有害信息流的显式限制来缓解不安全生成。跨多种概念的实验表明，UVR在图像合成和编辑任务中分别实现了91%和77%的擦除率，达到了最先进的安全性能，同时以最小的退化保持了视觉质量和保真度。代码可在以下网址获取：https://this URL。

英文摘要

Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.

URL PDF HTML ☆

赞 0 踩 0

2606.06872 2026-06-08 cs.CV cs.AI 新提交

EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

EgoPressDiff: 用于自我中心UV域手部压力估计的多模态视频扩散模型

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出EgoPressDiff，一种条件视频扩散框架，通过多模态条件策略（手部姿态、3D网格顶点和深度信息）从视觉输入生成UV压力图，解决了现有方法中的量化误差和时间不一致问题，在EgoPressure数据集上实现SOTA，Volumetric IoU相对提升34%以上。

Comments Accepted to IEEE ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11463813

AI中文摘要

从自我中心视角估计手部表面接触压力对于AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法通常对压力信号进行离散化并独立处理帧，导致量化误差和时间不一致性。我们提出EgoPressDiff，一种条件视频扩散框架，从视觉输入生成UV压力图。我们方法的核心是一种多模态条件策略，引入PoseNet和顶点编码器，从手部姿态和3D网格顶点中高效提取特征。这些信号与深度信息一起，指导生成过程以确保压力场在物理上是合理的。为了有效融合这些异构特征，我们进一步提出分布校准空间层，在组合前对齐其统计特性。在EgoPressure自我中心视图设置上的评估表明，EgoPressDiff实现了最先进的结果，Volumetric IoU相对先前基线提升超过34%，同时降低MAE并保持高时间精度。我们的项目页面位于此https URL。

英文摘要

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06871 2026-06-08 cs.LG 新提交

Evidence-Grounded Ensemble Diagnosis of 802.11 Packet Captures: A Multi-Stage Pipeline with Deterministic Reliability Scoring

基于证据的802.11数据包捕获集成诊断：具有确定性可靠性评分的多阶段流水线

Jerome Henry, Swadhin Pradhan, Miroslav Popovic

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出PROBE多阶段流水线，通过确定性证据框架和集成方法解决LLM在802.11诊断中的幻觉、置信度偏差和评估偏见问题，在87个企业Wi-Fi捕获上实现0.957的加权证据F1分数和96%的自动接受率。

Comments 37 pages, 9 figures, 9 tables

详情

AI中文摘要

诊断802.11数据包捕获需要专家协议知识，速度慢、工程师间不一致且不可扩展。基于LLM的方法听起来合理，但会编造捕获中不存在的协议事件（尤其是截断的跟踪），产生未校准的置信度分数，并且当黄金参考由被测模型共同生成时遭受评估偏差。我们引入PROBE（基于证据的协议推理集成），一个解决所有三个失败的多阶段流水线。它整合了(i)具有帧级可验证性的确定性PCAP到文本归一化，(ii)多运行、多候选集成，带有可选的跨模型第二意见和渐进混淆，(iii)一个判决感知的证据框架，将缺乏失败证据视为贡献证据，以及(iv)一个完全确定性的复合可靠性分数，来自证据有效性、运行间稳定性和跨模型一致性，无需LLM自我评估。在87个企业Wi-Fi捕获（104个捕获-审查者对）上，单次LLM分析将加权证据F1从0.871（专家基线）提升到0.912，但在35%的情况下遗漏了关键帧。朴素集成投票降至基线以下（0.842），因为多数投票放大了保守判决：50%的确认失败被误分类为“无问题”或“证据不足”。添加基于证据的协调达到0.957 F1，96%的自动接受率，以及最坏情况下的下限高于0.70。LLM自我报告的置信度聚集在0.95，无论难度如何（71%报告恰好0.95），证实其无信息量。我们还引入了一个使用逐字段断言匹配的模型无关评估框架，消除了来自模型共同生成的黄金参考的循环偏差。

英文摘要

Diagnosing 802.11 packet captures requires expert protocol knowledge, is slow, inconsistent across engineers, and unscalable. LLM-based approaches sound plausible but fabricate protocol events absent from captures (especially truncated traces), produce uncalibrated confidence scores, and suffer evaluation bias when golden references are co-produced by the model under test. We introduce PROBE (Protocol Reasoning Over evidence-Based Ensembles), a multi-stage pipeline addressing all three failures. It integrates (i) deterministic PCAP-to-text normalization with frame-level verifiability, (ii) multi-run, multi-candidate ensembles with optional cross-model second opinion and progressive obfuscation, (iii) a verdict-aware evidence framework treating absence of failure evidence as contributing evidence, and (iv) a fully deterministic composite reliability score from evidence validity, run-to-run stability, and cross-model agreement without LLM self-assessment. On 87 enterprise Wi-Fi captures (104 capture-reviewer pairs), single-pass LLM analysis raises weighted evidence F1 from 0.871 (expert baseline) to 0.912 but misses critical frames in 35% of cases. Naive ensemble voting drops below baseline (0.842) as majority voting amplifies conservative verdicts: 50% of confirmed failures are misclassified as 'no issue' or 'insufficient evidence.' Adding evidence-grounded reconciliation achieves 0.957 F1, a 96% auto-accept rate, and a worst-case floor above 0.70. LLM self-reported confidence clusters at 0.95 regardless of difficulty (71% report exactly 0.95), confirming it is uninformative. We also introduce a model-agnostic evaluation framework using per-field assertion matching, eliminating circular bias from model-co-produced golden references.

URL PDF HTML ☆

赞 0 踩 0

2606.06870 2026-06-08 cs.RO 新提交

What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy

我的机器人在想什么？透明且可信的共享自主性的设计考量

Atharv Belsare, Zohre Karimi, Connor Mattson, Rushiil Nakka, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah（犹他大学计算学院）； Robotics Center, University of Utah（犹他大学机器人中心）

AI总结通过用户实验研究共享自主系统中界面透明度（反馈模态和信息丰富度）对协调与信任的影响，发现反馈提高意图对齐、减少纠正干预，视觉优于听觉，信息丰富度偏好依赖任务复杂度，揭示完整信念分布并不一致提升对齐或信任。

Comments 9 pages, 5 Figures, Code and videos are available at https://sites.google.com/view/design-t2-sa/home. Under review at IROS 2026

详情

AI中文摘要

在共享自主性下运行的辅助机器人必须平衡用户控制与自主辅助。由于机器人动作依赖于不可直接观察的内部意图推理，推断目标与预期目标之间的不匹配会破坏协调与信任。我们研究了界面级透明度，包括反馈模态（视觉与听觉）和信息丰富度（稀疏与丰富），如何影响基于视觉的共享自主系统中的交互。在一项包含N=25名参与者的用户研究中，涉及两项辅助操作任务，我们评估了这些设计如何影响协调与信任。提供反馈显著提高了意图对齐并减少了纠正干预，表明使推断目标可理解加速了共享控制中的收敛。参与者偏好视觉反馈而非听觉反馈，而对稀疏与丰富信息的偏好取决于任务复杂度。我们还发现，揭示完整的信念分布并不一致地提高对齐或信任。这些发现共同表明，有效的透明度主要通过目标可理解性增强协调，而信任取决于任务适当的信息暴露，而非最大程度的信息披露。基于这些结果，我们概述了设计透明共享自主系统的指导方针。

英文摘要

Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because robot actions depend on internal intent inference that is not directly observable, mismatches between inferred and intended goals can undermine coordination and trust. We investigate how interface-level transparency, including feedback modality (visual vs. auditory) and information richness (sparse vs. rich), shapes interaction in a vision-based shared autonomy system. In a user study with N=25 participants across two assistive manipulation tasks, we evaluate how these designs influence coordination and trust. Providing feedback significantly improves intent alignment and reduces corrective intervention, indicating that making the inferred goal legible accelerates convergence in shared control. Participants preferred visual over auditory feedback, while preferences for sparse versus rich information depended on task complexity. We also found that revealing the full belief distribution did not consistently improve alignment or trust. Together, these findings indicate that effective transparency enhances coordination primarily through goal legibility, while trust depends on task-appropriate information exposure rather than maximal disclosure. Based on these results, we outline guidelines for designing transparent shared autonomy systems.

URL PDF HTML ☆

赞 0 踩 0

2606.06869 2026-06-08 cs.AI 新提交

Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

基于证据的智能诊断与治疗可视化系统与大语言模型：多轮交互与多模态治疗方案生成

Yunhan Wang, Yuda Wang, Zhiying Tu, Mingqiang Song, Li Song, Kun Li, Dianhui Chu, Bolin Zhang

发表机构 * Harbin Institute of Technology, Weihai（哈尔滨工业大学（威海））； Harbin Institute of Technology (Weihai) Qingdao Research Institute（哈尔滨工业大学（威海）青岛研究院）； Shandong Key Laboratory of Digital Service Computing Technology and Systems（山东省数字服务计算技术与系统重点实验室）； Weihai Municipal Hospital（威海市人民医院）； Shanghai Taizhu Technology Co., Ltd（上海泰山技术有限公司）； Tianjin Zhifu Qihuang Medical Technology Co., Ltd（天津中孚启黄医疗技术有限公司）

AI总结提出知识增强的可视化诊断系统，通过知识图谱约束、信息增益驱动提问和多模态治疗呈现，提升中医辨证透明度和治疗可解释性。

Comments 29 pages, 9 figures, 5 tables, including supporting information

详情

AI中文摘要

目的：现有AI辅助中医诊断工具存在推理过程不透明、交互被动及治疗方案展示有限的问题。本研究提出一种知识增强的可视化诊断系统，以提高辨证论治的透明度和可解释性。方法：系统基于包含241个证候、1263个症状和2485个关系的Neo4j知识图谱构建。它集成了四阶段症状匹配流水线（精确、语义、模糊和大语言模型验证）、基于信息增益的主动提问策略（经遗传算法优化），以及融合人工智能生成插图、三维经络穴位模型和循证文献的多模态治疗呈现。结果：知识图谱约束将非标准输出减少了32%。案例研究验证了交互工作流在患者自评、临床辅助诊断和中医教育中的有效性。跨30个案例的自动配对比较评估进一步显示，诊断信任度显著提升（Cohen's d = 1.82, p < 0.001），认知负荷降低（五个维度中四个维度改善），循证参考文献可信度更高（4.21 vs. 2.95）。结论：所提系统通过知识图谱驱动的可视化和多模态交互，增强了中医诊断推理的透明度和治疗方案的可解释性，为可信AI辅助中医应用提供了实用解决方案。

英文摘要

Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation. This study proposes a knowledge-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment. Methods: The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations. It incorporates a four-stage symptom matching pipeline (exact, semantic, fuzzy, and large language model verification), an information gain-driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence-generated illustrations, three-dimensional meridian-acupoint models, and evidence-based literature. Results: Knowledge graph constraints reduced non-standard outputs by 32%. Case studies validated the effectiveness of the interactive workflow across patient self-assessment, clinician-assisted diagnosis, and traditional Chinese medicine education. Automated paired-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust (Cohen's d = 1.82, p < 0.001), reduced cognitive load (improvements in four of five dimensions), and higher credibility of evidence-based references (4.21 vs. 2.95). Conclusions: The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph-driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence-assisted traditional Chinese medicine applications.

URL PDF HTML ☆

赞 0 踩 0

2606.06867 2026-06-08 cs.CV 新提交

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

Multi-FRuGaL：面向癌症诊断与预后的多模态灵活冗余感知分解门控学习

Sanket Kachole, Siddhesh Thakur, Shubham Innani, Sanyukta Adap, Suhang You, Carla Pitarch-Abaigar, Spyridon Bakas

发表机构 * Division of Computational Pathology, Department of Pathology and Laboratory Medicine, Indiana University School of Medicine（计算病理学部，病理学与实验室医学部，印第安纳大学医学院）； IU Melvin and Bren Simon Comprehensive Cancer Center（印第安纳大学Melvin和Bren Simon综合癌症中心）； Departments of Biostatistics and Health Data Science（生物统计学与健康数据科学部）； Radiology and Imaging Sciences（放射学与影像科学部）； Neurological Surgery（神经外科）； Indiana University School of Medicine（印第安纳大学医学院）； Department of Computer Science, Luddy School of Informatics, Computing, and Engineering（计算机科学部，Luddy信息、计算与工程学院）

AI总结提出Multi-FRuGaL框架，通过分解感知自适应门控中间融合，在缺失模态下学习模态级表示，分离冗余与互补信号，提升癌症诊断与预后性能。

详情

AI中文摘要

现代医学依赖于涵盖放射学、病理学、文本报告和结构化临床信息的异构数据源。然而，真实世界的患者数据常常不完整，存在缺失或稀疏获取的模态，限制了标准多模态融合方法的有效性。为此，我们提出了多模态灵活冗余感知分解门控学习（Multi-FRuGaL）框架，这是一种分解感知的自适应门控中间融合框架，可在数据缺失下执行模态级表示学习。Multi-FRuGaL 集成了每个模态的编码器、信号分解层、输入条件门控网络和信息感知融合目标，以将冗余信号与模态特异性互补信号分离，选择性地提升信息丰富的模态并抑制冗余或噪声输入，即使在多个模态缺失时也能保持良好定义。我们在两个多模态头颈癌队列上评估了 Multi-FRuGaL：HANCOCK 挑战数据集（N = 763），包含五种模态和两个预后终点（5年生存率和2年复发率）；以及 HECKTOR 挑战数据集（N = 588），包含三种模态用于人乳头瘤病毒（HPV）状态分类。Multi-FRuGaL 在多个任务上始终比评估的基线方法获得更高的平均性能，将生存预测的 AUC 从 0.601 提高到 0.8496，复发预测的 AUC 从 0.672 提高到 0.8102，并在 HECKTOR 上实现 HPV 预测的 AUC 为 0.975。对于生存分析，它在 HANCOCK 上进一步实现了总生存期的 C-index 为 0.6814，无复发生存期为 0.7421，无进展生存期为 0.7143，在 HECKTOR 上无复发生存期为 0.7203。定性分析进一步表明，即使在严重缺失模态条件下，Multi-FRuGaL 也能学习到判别性和鲁棒的多模态表示。

英文摘要

Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.06866 2026-06-08 cs.LG nucl-th 新提交

Product units in gated recurrent units improve nuclear-mass prediction

门控循环单元中的乘积单元改进核质量预测

Ziyuan Li, Paulo S. A. Freitas, John W. Clark, Babette Dellen

发表机构 * University of Applied Sciences Koblenz（应用科学大学科伦兹大学）； Technical University of Munich（慕尼黑技术大学）； University of Madeira（马德拉大学）； Washington University in St. Louis（圣路易斯华盛顿大学）

AI总结提出基于复数域加法-乘法乘积单元门控循环单元（AM-PU-GRU）的机器学习模型，通过整合乘积单元变换和复数计算，在核质量预测中实现插值RMSE 0.227 MeV和外推RMSE 0.179 MeV，超越现有模型。

Comments Accepted at ICCS 2026

详情

AI中文摘要

使用机器学习预测原子核质量可以补充理论模型，并推进对核图表中未知领域的探索。我们提出了一种基于门控循环单元（GRU）的机器学习技术，该技术通过利用长期依赖关系在核质量预测中展现出竞争性能。通过在循环单元内整合乘法交互和乘积单元变换，我们报告了核质量预测的显著改进。计算在复数域中进行，以联合捕捉幅度和相位动态。对于基于原子质量评估（AME2016和AME2020）的插值和时间外推任务，复数加法-乘法乘积单元门控循环单元（AM-PU-GRU）模型始终实现最低的预测误差，插值RMSE为0.227 ± 0.004 MeV，外推RMSE为0.179 ± 0.015 MeV。这些结果超越了其他最先进的机器学习模型，也优于实值GRU基线和乘积单元消融变体，同时对不同的理论先验（包括WS4和SEMF）保持鲁棒性。我们的发现确立了复数乘积单元循环网络作为基于序列的核质量预测的新基准。

英文摘要

The prediction of masses of atomic nuclei using machine learning can complement theoretical models and advance the exploration of poorly known domains of the nuclear chart. We propose a machine learning technique based on gated recurrent units (GRU), which have demonstrated competitive performance in nuclear-mass prediction by exploiting long-term dependencies. By integrating multiplicative interactions and product-unit transformations within recurrent units, we report significant improvements in nuclear-mass prediction. Computations are performed in the complex domain to jointly capture amplitude and phase dynamics. For interpolation and temporal-extrapolation tasks based on the atomic mass evaluation (AME2016 and AME2020), the complex additive-multiplicative product-unit gated recurrent unit (AM-PU-GRU) model consistently achieves the lowest prediction errors, with an interpolation RMSE of 0.227 $\pm$ 0.004 MeV and an extrapolation RMSE of 0.179 $\pm$ 0.015 MeV. These results surpass other state-of-the-art machine learning models and also outperform the real-valued GRU baseline and product-unit ablation variants, while remaining robust to different theoretical priors, including WS4 and SEMF. Our findings establish complex-valued product-unit recurrent networks as a new benchmark for sequence-based nuclear-mass prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.06865 2026-06-08 cs.CL 新提交

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

大型语言模型是否适合图计算？进展与展望

Yuting Zhang, Yi Han, Kai Wang, Wei Ni, Angela Bonifati, Wenjie Zhang

发表机构 * University of New South Wales（新南威尔士大学）； Antai College of Economics and Management, Shanghai Jiao Tong University（上海交通大学安泰经济管理学院）； Edith Cowan University（埃迪斯科文大学）； Lyon 1 University（里昂第一大学）

AI总结本文通过角色分类法综述LLM在图计算中的应用，分析作为执行者和规划者的两种范式，指出LLM适用于简单小规模任务，但在大规模和精确性要求高的任务中不可靠，并总结数据集和未来方向。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被探索用于图计算，其中任务需要对结构化关系和算法操作进行推理。然而，目前尚不清楚LLMs何时能可靠地支持此类计算，以及如何将它们整合到图求解流程中。现有的关于LLMs和图交叉的综述主要关注图学习、文本属性图或图语言建模。为弥补这一空白，我们通过基于角色的分类法对LLMs在图计算中的应用进行了全面综述。具体来说，我们识别出两种主要范式：i) LLMs作为执行者，模型直接从图描述和指令中解决图任务；ii) LLMs作为规划者，模型制定问题、分解推理步骤，并调用外部工具或代理执行。基于此分类法，我们分析了当前方法的优势和局限性。我们的综述表明，LLMs在简单、小规模任务中具有潜力，但在大规模和精确性要求高的任务中仍不可靠。最后，我们总结了可用的数据集，并提出了四个未来方向。

英文摘要

Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.

URL PDF HTML ☆

赞 0 踩 0

2606.06864 2026-06-08 cs.CV cs.LG 新提交

LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification

LRMIL: 通过高分辨率知识蒸馏实现全切片图像分类的高效低分辨率多实例学习

Yonghan Shin, Won-Ki Jeong

发表机构 * Department of Computer Science and Engineering, Korea University, Seoul, Korea（韩国大学计算机科学与工程系）

AI总结提出LRMIL框架，通过两阶段知识蒸馏将高分辨率知识迁移到低分辨率表示，在推理时仅使用低分辨率图像块，显著降低计算成本并提升分类性能。

详情

AI中文摘要

多实例学习（MIL）已成为数字病理学中全切片图像（WSI）分析的标准范式，因为它无需密集标注即可实现切片级预测。现有的MIL方法通常依赖于高分辨率图像块的详尽提取和编码。然而，这种做法在真实临床环境中存在两个关键限制：难以在较低放大倍数下捕获全局视觉线索，并且由于每张切片包含大量高分辨率图像块而导致巨大的计算开销。为了解决这些限制，我们提出了一种高效的低分辨率多实例学习（LRMIL）框架，该框架将高分辨率知识迁移到低分辨率表示。LRMIL采用两阶段蒸馏策略。首先，图像块级别的跨分辨率蒸馏将低分辨率图像块嵌入与高分辨率表示对齐。其次，切片级知识蒸馏在切片级监督和教师指导下训练低分辨率学生MIL模型。在推理时，LRMIL仅处理低分辨率图像块，大幅减少了数据预处理和计算成本。在多个WSI基准上的大量实验表明，LRMIL在实现更高效推理的同时，始终优于最先进的MIL方法。这些结果凸显了LRMIL作为临床病理学中WSI分析的实用且可扩展的解决方案。

英文摘要

Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.

URL PDF HTML ☆

赞 0 踩 0

2606.06861 2026-06-08 cs.LG cs.AI 新提交

Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks

使用乘积单元残差网络建模非线性特征交互

Ziyuan Li, Uwe Jaekel, Babette Dellen

发表机构 * University of Applied Sciences Koblenz（科隆应用科学大学）； Technical University of Munich（慕尼黑技术大学）

AI总结提出乘积单元残差网络（PURe），通过显式建模特征交互提升鲁棒性和可解释性，在合成和真实数据集上优于MLP。

Comments Accepted at ICCS 2026

详情

AI中文摘要

理解非线性特征交互在科学和工程中至关重要，然而标准多层感知器（MLP）通常仅隐式地捕获此类交互，导致表征纠缠，可能损害鲁棒性和可解释性。我们研究了乘积单元残差网络（PURe），它将乘法乘积单元与残差连接相结合，以显式建模跨特征耦合，同时稳定优化。我们在一个基于交互的合成基准和两个真实世界数据集上进行了系统评估，考察了预测准确性、对高斯特征噪声的鲁棒性以及在有限训练数据下的性能，并在匹配参数预算下比较了实值和复值变体。除了准确性，基于SHapley Additive exPlanations（SHAP）的交互分析表明，与MLP基线相比，PURe学习了更集中且结构更连贯的交互模式。总体而言，PURe实现了具有竞争力或更好的性能，在低数据场景下具有更好的鲁棒性和样本效率，并增强了交互级别的可解释性。

英文摘要

Understanding nonlinear feature interactions is crucial in science and engineering, yet standard multilayer perceptrons (MLPs) often capture such interactions only implicitly, leading to entangled representations that can impair robustness and interpretability. We investigate product-unit residual networks (PURe) that integrate multiplicative product units with residual connections to explicitly model cross-feature couplings while stabilizing optimization. We conduct a systematic evaluation on an interaction-driven synthetic benchmark and two real-world datasets, assessing predictive accuracy, robustness to Gaussian feature noise, and performance under limited training data, and we compare real- and complex-valued variants under a matched parameter budget. Beyond accuracy, SHapley Additive exPlanations (SHAP)-based interaction analyses show that PURe learns more concentrated and structurally coherent interaction patterns than MLP baselines. Overall, PURe achieves competitive or improved performance, better robustness and sample efficiency in low-data regimes, and enhanced interaction-level interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.06857 2026-06-08 cs.CL 新提交

Interpreting Brain Responses to Language with Sparse Features from Language Models

用语言模型稀疏特征解释大脑对语言的响应

Michael A. Lepori, Kendrick Kay, Greta Tuckute

发表机构 * Brown University（布朗大学）； University of Minnesota（明尼苏达大学）； Harvard University（哈佛大学）

AI总结提出增强稀疏编码模型，用分层稀疏自编码器特征替代密集LM隐状态，并加入惊奇度预测器，解释大脑语言皮层响应，发现前颞叶语言网络由共同特征预测，且大脑响应与LM中最通用的特征对应。

详情

AI中文摘要

认知神经科学的一个核心目标是刻画人类语言皮层所表征的特征。人工语言模型已成为应对这一挑战的有力工具，但将生物表征与人工表征相关联的研究常被批评为将一个黑箱与另一个黑箱相关联。本文引入增强稀疏编码模型，一种用分层组织的稀疏自编码器特征替代密集LM隐状态，并显式包含惊奇度作为预测因子的编码框架。利用该方法，我们(i) 产生对神经响应的解释，并(ii) 测试模型-大脑对齐是否反映了LM表征中的主要变异或特异变异。使用8名参与者聆听200句语言多样性句子的高场7T fMRI数据集，我们首先通过恢复先前对处理难度和意义抽象性调谐的体素群体的解释来验证建模框架。然后，我们解释了一个先前未表征（但可靠）的体素群体，发现其调谐于与人相关的内容。接着，我们显示额颞叶人类语言网络由其组成区域间的共同特征集预测，但发现额叶区域即使在没有LM特征的情况下也能被惊奇度单独较好地解释。最后，我们显示语言处理过程中的大脑响应并非仅能从任意一组LM特征预测。相反，大脑响应最好由倾向于捕捉LM表征中编码的最通用信息的特征解释，表明大脑与LM语言表征之间存在非平凡的对齐。

英文摘要

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

URL PDF HTML ☆

赞 0 踩 0

2606.06856 2026-06-08 cs.CV 新提交

FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness

FS-DVS：一种增强信息完整性的频率选择性动态视觉传感范式

Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出FS-DVS范式，通过在事件触发前集成可学习空间滤波器模拟视网膜神经节细胞聚合机制，自发学习中心-环绕模式以增强中频信息，在目标检测和动作识别中取得显著性能提升。

详情

AI中文摘要

动态视觉传感器（DVS）通过异步报告像素级强度变化，提供卓越的时间分辨率和动态范围。然而，传统DVS依赖每像素独立触发机制，忽略了生物视网膜神经节细胞（RGC）执行的空间整合。因此，它们缺乏对比度敏感函数（CSF）及其对中空间频率的固有敏感性，这不可避免地因亚阈值信号丢失而导致信息不完整。为弥补这一差距，我们提出FS-DVS（频率选择性动态视觉传感器），一种新颖范式，它在事件触发过程之前严格集成一个可学习空间滤波器，以模拟RGC聚合机制。通过开发可微分事件模拟框架，空间滤波器可以与下游任务进行端到端优化。我们的研究揭示，从δ函数开始，学习到的空间滤波器自发演变为强调中频分量的中心-环绕模式，与人类CSF一致。除了在目标检测和动作识别中实现显著的性能提升外，不同任务中向类人CSF特性的一致收敛强调了这种中频选择性机制的普遍性。与单纯提高传感器灵敏度或依赖后处理相比，我们的范式实现了具有高噪声鲁棒性的选择性信息增强，为下一代神经形态传感器提供了稳健且生物合理的蓝图。

英文摘要

Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.

URL PDF HTML ☆

赞 0 踩 0

2606.06854 2026-06-08 cs.LG 新提交

The Geometry of Last-Layer Model Stealing

最后一层模型窃取的几何学

Snigdha Chandan Khilar

发表机构 * Independent Researcher（独立研究者）

AI总结利用几何学解释如何通过已知方法窃取机器学习模型，展示了完美复制Transformer网络最后一层的条件，并揭示了隐藏层的限制。

2606.06853 2026-06-08 cs.CV cs.AI 新提交

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

MotionEnhancer: 利用视频扩散模型增强运动感知的视觉-语言模型

Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen

发表机构 * School of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； Beijing Digital Native Digital City Research Center（北京数字原生数字城研究中心）； School of Computer Science, Peking University（北京大学计算机学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）

AI总结提出MotionEnhancer，通过从视频扩散模型中提取运动先验并利用注意力对齐增强视觉-语言模型的运动理解能力，无需额外参数或架构修改，在运动级视频理解基准上取得一致提升。

Comments Accepted by CVPR 2026

详情

AI中文摘要

新时代见证了视觉-语言模型（VLM）在视频理解任务中的显著能力扩展。虽然当前的VLM在事件或故事级别的理解上表现出色，但它们捕捉细粒度运动细节的能力仍然有限，这主要是由于它们关注高层静态语义结构和宏观事件逻辑。相比之下，视频扩散模型（VDM）擅长建模动态运动模式，得益于大规模视频数据和时序生成的内在需求。在本文中，我们介绍了MotionEnhancer，一种新颖的方法，它利用从强大视频扩散模型中提取的运动先验作为辅助监督，通过注意力对齐增强VLM的运动理解能力。MotionEnhancer包含两个简单的无参数模块：运动敏感头选择（MHS）和运动显著文本标记识别（MTTI），以仅计算的方式直接从VDM中提取和优化与运动相关的注意力。MotionEnhancer为运动理解提供了可扩展的解决方案，无需额外的训练参数、修改现有架构或工具调用。大量实验表明，在两个运动级视频理解基准上，MotionEnhancer能够在最先进的VLM上实现一致的改进，尤其是在运动相关指标上。

英文摘要

The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.06850 2026-06-08 cs.CV 新提交

CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs

CFRNet: 用于消费级嵌入式NPU上实时盲脸修复的循环一致不动点训练

Fuchen Li, Xinyang Wang, Yahui Zhang, Yuhan Chen, Jiahong Guo, Zhuohan Qin, Wenbo Ma

发表机构 * University of Florida（佛罗里达大学）； University of Southampton（南安普顿大学）； Chongqing University（重庆大学）； Qingdao University（青岛大学）； Intel Asia-Pacific Research & Development Ltd（英特尔亚太研发有限公司）

AI总结提出CFRNet，一种2.0M参数的ResNet风格修复网络，通过循环一致不动点训练（CCFP）在消费级NPU上实现高质量盲脸修复，兼顾速度与效果，LPIPS比单次循环降低31%。

Comments 12 pages.Code and project page will be released

详情

AI中文摘要

消费设备上的盲脸修复必须在图像质量与速度和内存之间取得平衡。GFPGAN和CodeFormer等强方法提供了良好的感知质量，但它们依赖于大型预训练生成先验以及注意力、码本查找和风格调制等操作，这些操作难以在消费硬件中使用的小型神经处理单元（NPU）上编译和量化。小型卷积修复器运行速度足够快，但往往过度平滑，并在眼睛、鼻子和嘴巴周围留下伪影。我们提出了CFRNet，一个2.0M参数的ResNet风格修复器，用于在消费级NPU上常见的$256\times256$人脸裁剪尺寸的端侧使用。主要思想是循环一致不动点训练（CCFP）。我们不是训练网络进行单次前向传播然后手动多次运行，而是训练它作为一个不动点算子，使得对修复后的人脸再次应用该网络不会改变人脸。CCFP使用三种训练损失，即渐进式多周期监督、幂等损失和重新退化循环损失，并且在推理时不增加任何成本。为了在我们的部署限制下进行公平比较，我们在相同的$256\times256$分辨率下从头重新训练所有基线。在300张图像的测试集上，CFRNet达到了最佳感知分数（三次循环时LPIPS为0.250，比一次循环低31%），并且在两次循环时也达到了最佳PSNR和SSIM。在HiSilicon Hi3402 NPU上，它以INT8格式每次循环运行约23毫秒，而相同的基线无法编译到该芯片上。循环次数$k$作为一个简单的质量旋钮，无需重新训练：PSNR在$k=2$时最佳，LPIPS在$k=3$时持续改善。我们进一步表明，同样的思想适用于更易于部署的普通CNN，并在车载驾驶员监控板上实时运行模型。

英文摘要

Blind face restoration on consumer devices has to balance image quality against speed and memory. Strong methods such as GFPGAN and CodeFormer give good perceptual quality, but they rely on large pretrained generative priors and on operators such as attention, codebook lookup, and style modulation that are hard to compile and quantize on the small neural processing units (NPUs) used in consumer hardware. Small convolutional restorers run fast enough, but they tend to over-smooth and to leave artifacts around the eyes, nose, and mouth. We present CFRNet, a 2.0,M-parameter ResNet-style restorer for on-device use at $256\times256$, the common face-crop size on consumer NPUs. The main idea is Cycle-Consistent Fixed-Point Training (CCFP). Instead of training the network for one pass and then running it several times by hand, we train it to act as a fixed-point operator, so that applying it again to a restored face does not change the face. CCFP uses three training losses, namely progressive multi-cycle supervision, an idempotence loss, and a re-degradation cycle loss, and it adds no cost at inference. To compare fairly under our deployment limits, we retrain all baselines from scratch at the same $256\times256$ resolution. On a 300-image test set, CFRNet reaches the best perceptual score (LPIPS 0.250 at three cycles, which is 31% lower than one cycle) and also the best PSNR and SSIM at two cycles. It runs in about 23,ms per cycle in INT8 on a HiSilicon Hi3402 NPU, while the same baselines cannot be compiled to that chip. The cycle count $k$ acts as a simple quality knob that needs no retraining: PSNR is best at $k\!=\!2$ and LPIPS keeps improving up to $k\!=\!3$. We further show that the same idea works with a plain CNN that is even easier to deploy, and we run the model in real time on an in-car driver-monitoring board.

URL PDF HTML ☆

赞 0 踩 0

2606.06842 2026-06-08 cs.CL 新提交

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

CRAFT：面向表格问答与事实验证的统一反事实推理框架

Chenshuo Pan, Yu Zhao, Jie Zhang, Changzai Pan, Zhenhe Wu, Jiayi Liang, Yujie Mao, Shuangyong Song, Yongxiang Li, Zhongjiang He

发表机构 * Xingchen AGI Lab,China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd（兴晨AGI实验室，中国电信人工智能技术（北京）有限公司）

AI总结提出CRAFT统一反事实推理框架，将表格问答和事实验证转化为双向验证过程，通过构建声明及其反事实变体并加权整合证据，显著提升复杂表格推理性能。

Comments 24pages,10 figures

详情

AI中文摘要

表格推理对大型语言模型（LLMs）仍然具有挑战性，尤其是在需要多步推理的长且结构化的表格任务中。现有方法主要依赖单向推理，限制了其跨任务探索替代假设的能力。在这项工作中，我们提出了CRAFT，一个统一的反事实推理框架，将表格问答和事实验证重新表述为通用的双向验证过程。我们的方法显式地构建声明性陈述及其反事实变体。然后，沿着原始路径和反事实路径进行推理提取证据，并通过加权机制整合以得出最终答案。实验结果表明，我们的方法在WikiTQ和TabFact等表格推理数据集上持续优于代表性基线，在复杂问答上取得了特别大的改进。我们的框架还显著缩小了不同骨干LLM之间的性能差距。这表明反事实推理有效克服了单向推理的局限性，引导LLM进行更具辨别力的推理，并为结构化推理任务建立了更原则性的范式。我们的代码将在接收后公开。

英文摘要

Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.06840 2026-06-08 cs.CL cs.AI cs.LG 新提交

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

先刻画再蒸馏：大输出空间中的机械推理

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

发表机构 * Khoury College of Computer Sciences, Northeastern University（东北大学计算机科学学院）

AI总结研究现代推理模型在百万级标签空间中实现零样本多标签分类的机制，提出“候选列表生成+精细推理”两阶段模型，并基于此开发机械蒸馏策略，优于标准蒸馏。

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 新提交

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考：细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

发表机构 * Colab ； Beihang University（北航）； Meituan（美团）； National University of Singapore（新加坡国立大学）

AI总结提出FLIGHT基准和FLIGHT VLA异步架构，通过低频飞行员推理VLM与高频扩散动作模型解耦，实现无人机长时程语义指令下的平滑连续飞行控制。

详情

AI中文摘要

语言引导的无人机代理必须执行长时程语义指令，同时产生平滑、物理可行的连续飞行命令，然而现有的视觉语言导航（VLN）基准通常使用离散或粗粒度的动作，而现有的无人机视觉-语言-动作（VLA）任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白，我们引入了\ extbf{FLIGHT}，一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准，该基准结合了多阶段指令与密集的6-DoF轨迹注释，分为两个数据集：细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力，同时适应高频、实时的精确控制，我们进一步提出了\ extbf{FLIGHT VLA}，一种异步架构，将用于任务状态推理的低频流式飞行员视觉语言模型（VLM）与用于连续控制的高频扩散动作模型解耦，并由显式的\ extbf{飞行员推理}文本进行监督，该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中，FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线，实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理，验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

URL PDF HTML ☆

赞 0 踩 0

2606.06835 2026-06-08 cs.CL 新提交

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

Translate-R1：通过强化学习实现成本感知的翻译工具使用

Pratik Jayarao, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Adithya M Devraj, Meet Vadera, Priyanka Nigam, Bing Yin

发表机构 * Amazon Stores Foundation AI（亚马逊商店基金会人工智能）

AI总结提出一种基于强化学习的门控策略，让LLM自主评估理解能力，仅在必要时调用翻译工具，在22种语言上提升奖励并降低翻译成本。

Comments 14 pages main text plus appendix, 7 figures, 11 tables

详情

AI中文摘要

LLM在不同语言上的性能差距已有充分记录，而原生缩小差距需要对大多数语言不存在的语料库进行预训练或微调。翻译提供了一种替代方案：将输入转换为模型的主导语言，从而立即释放其全部能力。然而，对每个输入都应用翻译对于模型已能处理的语言来说是浪费的，而将选择权留给模型则相反地失败，因为LLM过于自信，即使无法理解输入也会跳过工具。先前的工作通过语言特定规则、领域启发式、语言标识符或外部路由器来解决这一问题，每种方法都需要手动工程。我们转而学习一个单一策略，仅从奖励中决定何时翻译，开发出语言和领域自适应的内省能力，评估自身理解能力，并仅在无法原生解决任务时调用翻译。使用我们保留答案的翻译流水线构建的数据，我们在后训练的Qwen3-4B上继续RL，涵盖3个资源层级（高、低、极低）的22种语言和5个领域，并引入置信度门控GSPO用于成本敏感的工具使用。门控策略在基线基础上将奖励提升：高资源+4.6，低资源+23.5，极低资源+17.5。与几乎总是翻译的无约束策略相比，它以63%的成本保留了全部奖励，并在87%的成本敏感范围内是帕累托最优的。此外，为了模拟在完全未见语言上的行为，我们创建了2种合成语言，在这些语言上，我们的门控策略比过度自信的基线（即使在这些不可理解的输入上也未充分利用工具）提升了+18.7。该策略零样本迁移到9种保留语言，我们分析了工具使用在训练过程中如何按语言和领域出现。

英文摘要

The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model's dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.

URL PDF HTML ☆

赞 0 踩 0

2606.06833 2026-06-08 cs.LG cs.AI cs.CR 新提交

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

听弦外之音：面向声学对抗攻击的语言模型先验

Jiani Xie, Andrew C. Cullen, Paul Montague, Benjamin I. P. Rubinstein

发表机构 * University of Melbourne（墨尔本大学）； DST Group（DST集团）

AI总结提出Semantic Gambit攻击，利用大语言模型实时提供预测上下文，突破因果限制，使实时ASR系统词错误率提升至35.6%，较当前最优方法提高三倍。

2606.06832 2026-06-08 cs.RO 新提交

STRIPS-WM: Learning Grounded Propositional STRIPS-style World Models from Images

STRIPS-WM：从图像学习基于命题的STRIPS风格世界模型

Abhiroop Ajith, Constantinos Chamzas

发表机构 * Worcester Polytechnic Institute（沃斯特理工学院）

AI总结提出STRIPS-WM框架，从图像转换中学习符号化世界模型，用于机器人视觉任务规划，提升规划成功率。

详情

AI中文摘要

执行长时域视觉操作的机器人观察高维图像，但成功的规划依赖于与动作相关的事实：当前可以做什么以及之后会发生什么变化。有用的规划表示应丢弃无关的视觉细节，同时保留动作的适用性和效果。经典任务规划器通过具有前提条件和效果的符号操作符利用这种结构，但从原始视觉经验中获得此类表示仍然具有挑战性。我们研究了一个视觉任务规划设置，其中机器人仅接收图像转换：当前图像、执行的高级动作以及结果图像。在测试时，给定起始图像和目标图像，机器人必须产生一系列达到目标的高级动作。为了解决这个问题，我们引入了STRIPS-WM，一个直接从视觉转换中学习基于图像的STRIPS风格世界模型的框架。STRIPS-WM首先从图像中诱导出有限的抽象转换图，然后学习潜在二元谓词和每个动作标签的一个基于命题的操作符。学习到的操作符形成一个具有稀疏前提条件和添加/删除效果的符号动作模型。最后，学习到的谓词被蒸馏到视觉编码器中，使得能够直接从新的起始和目标图像进行经典规划。在视觉重排任务上的实验表明，STRIPS-WM在图像到规划的成功率上优于测试的视觉展开、潜在图搜索和潜在符号基线。

英文摘要

Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depend on action-relevant facts: what can be done now and what changes afterward. A useful planning representation should discard irrelevant visual details while preserving action applicability and effects. Classical task planners exploit this structure through symbolic operators with preconditions and effects, but obtaining such representations from raw visual experience remains challenging. We study a visual task-planning setting in which a robot receives only image transitions: the current image, executed high-level action, and the resulting image. At test time, given a start image and a goal image, the robot must produce a sequence of high-level actions that reaches the goal. To address this problem, we introduce STRIPS-WM, a framework for learning image-grounded STRIPS-style world models directly from visual transitions. STRIPS-WM first induces a finite abstract transition graph from images, then learns latent binary predicates and one grounded propositional operator per action label. The learned operators form a symbolic action model with sparse preconditions and add/delete effects. Finally, the learned predicates are distilled into a visual encoder, enabling classical planning directly from novel start and goal images. Experiments on visual rearrangement tasks show that STRIPS-WM improves image-to-plan success over the tested visual rollout, latent graph-search and latent-symbolic baselines.

URL PDF HTML ☆

赞 0 踩 0