arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4106
2606.01777 2026-06-02 cs.RO

Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality

Trans2Occ: 从仿真到现实的透明物体体素占用估计与抓取

Yixuan Yang, Sha Zhang, Rui Li, Zhenfei Yin, Xinzhu Ma, Yiran Qin, Lei Bai, Xudong Xu, Shilin Shan, Wangmeng Zuo, Yanyong Zhang, Wanli Ouyang, Feng Zheng, Shixiang Tang, Dongzhan Zhou

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) SUSTech(南方科技大学) CUHK(香港中文大学) Harbin Institute of Technology(哈尔滨工业大学) University of Oxford(牛津大学) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于单视图RGB输入的体素占用预测框架,结合仿真数据生成与规则抓取策略,实现透明物体的鲁棒3D感知与操作。

详情
AI中文摘要

透明物体由于折射和反射导致的深度感知不可靠,对机器人感知构成挑战。先前的方法依赖多视图重建或深度补全,但往往难以在真实机器人系统中扩展或部署。本文提出一个基于单视图RGB输入的透明物体感知与操作实用框架。我们的方法直接从单张图像预测体素空间占用,提供支持下游机器人抓取的几何感知表示。为实现大规模训练,我们构建了一个仿真流水线,在不同材质和光照条件下生成配对的RGB图像和体素占用标注。我们证明预测的占用表示对领域偏移具有鲁棒性,并能从仿真有效迁移到真实机器人设置,无需微调。基于占用构建的简单规则抓取策略进一步实现了透明物体的可靠抓取性能。在仿真和真实环境中的大量实验表明,我们的框架提供了准确的3D理解,并实现了透明物体的实用操作。这些结果表明,单视图占用预测为机器人中的透明物体感知提供了一种可扩展且有效的解决方案。

英文摘要

Transparent objects remain challenging for robotic perception due to unreliable depth sensing caused by refraction and reflection. While prior approaches rely on multi-view reconstruction or depth completion, they are often difficult to scale or deploy in real-world robotic systems. In this paper, we present a practical framework for transparent object perception and manipulation based on single-view RGB input. Our approach predicts voxel-space occupancy directly from a single image, providing a geometry-aware representation that supports downstream robotic grasping. To enable large-scale training, we construct a simulation pipeline that generates paired RGB images and voxel occupancy annotations under diverse materials and lighting conditions. We demonstrate that the predicted occupancy representation is robust to domain shifts and transfers effectively from simulation to real-world robotic setups without fine-tuning. A simple rule-based grasping strategy built on top of the occupancy further achieves reliable grasp performance on transparent objects. Extensive experiments in both simulation and real-world environments show that our framework provides accurate 3D understanding and enables practical manipulation of transparent objects. These results suggest that single-view occupancy prediction offers a scalable and effective solution for transparent object perception in robotics.

2606.01757 2026-06-02 cs.CV

PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

PillarDETR:基于YOLO骨干和RT-DETR头的实时3D目标检测

Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PillarDETR架构,结合YOLOv8的CSP骨干和RT-DETR解码器,实现无需NMS的端到端实时3D目标检测,在KITTI和nuScenes上取得精度与速度的良好平衡。

Comments 6 pages, 1 figures, 8 tables

详情
AI中文摘要

实时3D目标检测是自动驾驶系统和机器人安全运行的关键组成部分。虽然LiDAR点云提供准确的空间信息,但高效处理它们仍然是一个重大挑战。传统方法依赖于复杂的3D卷积或基于锚点的范式,难以平衡检测精度与推理速度。在本文中,我们提出PillarDETR,一种新颖的端到端3D目标检测架构,它将基于柱体的LiDAR编码的效率与现代2D视觉模型的表示能力相结合。具体来说,PillarDETR用源自YOLOv8的跨阶段局部(CSP)网络替代标准卷积骨干,从而能够从伪图像中提取更丰富的特征。此外,我们摒弃了传统的基于锚点或基于中心的检测头,转而采用实时检测Transformer(RT-DETR)解码器。这种混合设计使网络能够捕获全局上下文并直接预测3D边界框,而无需依赖非极大值抑制(NMS)。在KITTI和nuScenes基准上的大量实验表明,PillarDETR在平均精度(mAP)和推理延迟之间实现了令人信服的权衡。我们的消融研究证实,集成YOLOv8骨干和RT-DETR头相比PointPillars基线带来了显著改进,使PillarDETR成为实时3D感知的高效解决方案。

英文摘要

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

2606.01756 2026-06-02 cs.CV

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

EvoCut:面向高效大型视觉语言模型的多层演化感知视觉标记压缩

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang, Yao Hu, Jiawei Li, Shikai Jiang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Xiaohongshu(小红书) Fudan University(复旦大学)

AI总结 提出一种无需训练和注意力的视觉标记压缩方法EvoCut,通过分析多层演化偏差估计标记重要性,在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能。

Comments Preprint. 12 pages, 6 figures, 7 tables

详情
AI中文摘要

大型视觉语言模型(LVLMs)在图像和视频理解任务上取得了强大性能,但其推理效率受到视觉编码器产生的大量视觉标记的限制。现有大多数视觉标记压缩方法从特定层的注意力分数或表示属性估计标记重要性,忽略了视觉标记在视觉编码器中的演化过程。这种逐层标准可能提供不完整的重要性估计,并限制压缩后的性能保持。为解决此问题,我们分析了逐层视觉标记演化方向,并观察到标记在视觉编码器各层形成多个组演化方向。进一步分析表明,信息性标记往往表现出与共同组演化方向的持续偏离。基于这一观察,我们提出了EvoCut,一种无需训练和注意力的视觉标记压缩方法,通过多层演化偏差估计标记重要性。实验结果表明,EvoCut在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能,展示了其在平衡效率和准确性方面的有效性。

英文摘要

Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1\% of the visual tokens on LLaVA-1.5-7B while preserving 94.4\% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.

2606.01755 2026-06-02 cs.AI cs.CL

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University(数据科学与人工智能系,墨尔本大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题,提出TriAlign框架,通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化,实现公平对齐。

详情
AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应,但可能在不同社会群体间引入显著的通用真值不一致性,即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化,要么主要关注主观偏好对齐,很大程度上忽视了通用真值的公平性和一致性。为填补这一空白,我们研究了真值不变对齐(TIA),这是一个针对个性化LLM的对齐问题,旨在确保通用真值在不同社会群体间保持一致,同时保留个性化。我们提出TriAlign,这是首个用于TIA的离线多智能体强化学习(MARL)框架,其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚,联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明,TriAlign在这三个目标之间实现了比强基线更强的平衡,减少了跨社会群体的通用真值差异,同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

2606.01753 2026-06-02 cs.CV

Quality-Guided Semi-Supervised Learning for Medical Image Segmentation

质量引导的半监督学习用于医学图像分割

Kumar Abhishek, Ghassan Hamarneh

发表机构 * School of Computing Science, Simon Fraser University, Canada(Simon Fraser大学计算机科学学院)

AI总结 提出一种质量引导的半监督学习框架,通过专用网络估计分割质量,并利用质量感知正则化和伪标签重加权提升医学图像分割性能。

Comments Early Accept at MICCAI 2026, 13 pages, 2 figures

详情
AI中文摘要

训练准确的医学图像分割模型需要大量密集标注的数据,这既昂贵又耗时。半监督学习通过从大量未标注数据和少量标注数据中学习来缓解这一问题。然而,大多数现代半监督学习方法依赖未标注数据的伪标签,并通常通过模型置信度或不确定性来评估其可靠性,这些度量是自我指涉的,缺乏对分割质量的明确基础。相反,我们提出了一种质量引导的半监督学习框架,训练一个专用网络从图像-掩膜对中估计分割质量。该预测器在通过合成损坏生成的变质量掩膜上进行训练,这些损坏结合了部分训练分割模型产生的不完美输出,捕捉训练中遇到的真实错误模式。我们通过两种互补机制将质量预测器集成到半监督学习中:质量感知正则化损失和基于质量的伪标签样本重新加权方案。我们表明,我们的方法可以作为现有半监督学习框架的即插即用增强。在五个数据集和多种架构上的大量实验表明,与竞争性的半监督学习方法相比,我们的方法取得了一致的改进,推进了半监督医学图像分割的最新水平。

英文摘要

Training accurate medical image segmentation models requires large amounts of densely annotated data, which is costly and time-consuming to obtain. Semi-supervised learning (SSL) alleviates this by learning from both abundant unlabeled data and limited labeled data. However, most modern SSL methods rely on pseudolabels for unlabeled data, and typically assess their reliability through model confidence or uncertainty, measures that are self-referential and lack explicit grounding in segmentation quality. Instead, we propose a quality-guided SSL framework that trains a dedicated network to estimate segmentation quality from image-mask pairs. The predictor is trained on variable-quality masks generated through synthetic corruptions augmented with imperfect outputs from partially trained segmentation models, capturing realistic error patterns encountered during training. We integrate the quality predictor into SSL through two complementary mechanisms: a quality-aware regularization loss and a quality-based pseudolabel sample reweighting scheme. We show that our method serves as a drop-in enhancement to existing SSL frameworks. Extensive experiments across five datasets and multiple architectures demonstrate consistent improvements over competing SSL methods, advancing the state-of-the-art in semi-supervised medical image segmentation.

2606.01746 2026-06-02 cs.CV cs.LG

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

敏感性是一把双刃剑:判别性与对抗鲁棒性之间的权衡

Kai Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文发现全连接分类器的高敏感性带来判别性但也导致脆弱性,而ℓ2距离分类器的不敏感性带来鲁棒性但限制性能,为此提出基于混合原型混合框架的ℓ2重分类器,通过融合稳定原型和动态原型实现判别性与鲁棒性的平衡,并设计混合替代攻击评估协议。

Comments 13 pages including reference, 4 figures

详情
AI中文摘要

现代神经网络极易受到对抗性扰动的影响。在这项工作中,我们指出这种脆弱性部分源于广泛使用的全连接分类器对此类扰动的敏感性。相比之下,简单的基于ℓ2距离的分类器表现出显著更强的鲁棒性。我们提供了充分的理论和实证分析,表明全连接分类器的高敏感性使其具有判别性,但也使其脆弱;相反,ℓ2分类器的不敏感性赋予了鲁棒性但限制了性能。受这种权衡的启发,我们提出了一种基于混合原型混合框架的新型ℓ2重分类器。该方法保留了全连接分类器的判别能力,同时利用了ℓ2距离的鲁棒性。它通过融合两种原型类型来产生基于ℓ2距离的预测:(1)通过指数移动平均更新的稳定数据集级原型,以及(2)使用直通估计器从全连接分类器预测生成的动态批量级原型。然而,这种基于直通估计器的动态架构给评估带来了重大挑战,例如梯度混淆和前向不连续性。为了解决这个问题,我们提出了一种新的严格评估协议——混合替代攻击,该协议使用多个替代模型以及强大的AutoAttack,以确保公平和稳健的评估。大量实验表明,我们的轻量级即插即用模块只需极少的微调,就能有效增强各种现有最先进对抗训练模型的对抗鲁棒性。

英文摘要

Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers' high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, $\ell_2$-classifiers' insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel $\ell_2$-reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of $\ell_2$ distance. It yields $\ell_2$-distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier's predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.

2606.01738 2026-06-02 cs.CL cs.AI

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

THRD:一种针对大语言模型越狱攻击的无训练多轮防御框架

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

发表机构 * Beijing Language and Culture University(北京语言大学)

AI总结 提出无训练框架THRD,通过显式建模时间风险累积(包括逐轮风险评估、跨轮意图检测、响应评估和决策模块)防御多轮越狱攻击,将攻击成功率降至0.2-4.0%且模型效用损失小于1.5%。

详情
AI中文摘要

多轮越狱攻击通过利用对话动态(如逐步升级和跨轮协调)对LLM构成日益严重的威胁。现有防御要么依赖昂贵的重新训练(通常会降低模型效用),要么在每一轮独立应用单轮分析,无法捕捉风险沿交互轨迹的累积。我们观察到多轮交互中的安全行为是轨迹依赖的:对话历史不断重塑模型的调节上下文,使得孤立评估每一轮变得不足。基于这一洞察,我们提出THRD,这是第一个显式建模多轮越狱防御中时间风险累积的无训练框架。THRD集成了四个模块:用于即时风险评估的逐轮风险评估器(TRA)、用于跨轮意图升级检测的历史上下文分析器(HCA)、用于识别促进性输出的响应评估器(RE),以及通过带衰减调制和趋势感知调整的时间演化评分机制组合这些信号的决策模块。在两个目标模型上针对最先进的多轮攻击(包括基于树搜索和多智能体协作方法)的实验表明,THRD将攻击成功率降至0.2-4.0%,同时在MMLU和GSM8K上将模型效用退化控制在1.5%以内。消融研究证实了模块的非冗余贡献和稳定的跨架构泛化。对首次拒绝触发器的分析显示,超过70%的多轮攻击需要在第2轮或之后才能检测到,验证了显式时间聚合的必要性。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

2606.01737 2026-06-02 cs.AI

TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

TrafficRAG:用于交通事故责任认定的多模态RAG框架

Xu Li, Zedong Fu, Xinyi Li, Xun Han

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出TrafficRAG框架,通过视觉语言模型生成结构化描述、混合检索获取法规和案例、大语言模型融合多模态证据进行推理,实现自动化交通事故责任分析报告生成。

Comments 12 pages, 3 figures, accepted at ICANN 2026

详情
AI中文摘要

交通事故责任分析是智能交通和法律辅助中一项关键但具有挑战性的任务。现有方法通常存在效率低、主观判断和不一致的分析结果等问题。同时,大语言模型受到噪声视频输入和法律领域知识不足的限制。为了解决这些问题,本文提出了TrafficRAG,一个用于自动化交通事故分析和报告生成的多模态检索增强框架。具体来说,该框架首先采用视觉语言模型生成事故场景的结构化文本描述,作为准确的检索查询。基于这些文本查询,采用结合BM25稀疏检索和稠密嵌入检索的混合检索策略来获取相关交通法规和类似历史案例。最后,大语言模型整合检索到的法律知识和多模态事故证据进行综合推理,生成标准化、有法律依据的责任分析报告。大量实验表明,TrafficRAG始终优于基线方法,实现了77.32%的法律规范适配准确率、81.71%的事实忠实度以及5.48%的责任比例平均绝对误差。结果验证了通过检索增强将多模态事实证据与法律条款相结合,可以有效提高交通事故责任认定的可靠性和准确性。

英文摘要

Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.

2606.01734 2026-06-02 cs.CV cs.LG cs.RO

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

发表机构 * Fundamental Engineering for Knowledge-Based Society, Graduate School of Engineering, University of Fukui(知识社会基础工程,工程研究生院,福井大学)

AI总结 提出FlatVPR范式,通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率,实现稀疏锚点下的线性插值重建,在NCLT数据集上显著提升视觉位置识别精度。

Comments 5 pages, 1 figure, technical report

详情
AI中文摘要

本文提出“FlatVPR”,一种新颖的几何校正范式,通过强制特征流形结构,使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$(其中 $t \in [0,1]$ 表示相对位置)精确重建,从而有效平衡视觉位置识别(VPR)中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型(如DINOv2-ViT-S/14)提供了鲁棒的语义特征,但其潜在流形表现出显著的曲率,将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上,这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间展平,地图构建被公式化为期望最大化(EM)框架,解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明,即使在100米间隔的极端稀疏锚点和极端季节变化条件下,应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

2606.01725 2026-06-02 cs.AI cs.LG

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

基于迹驱动仿真的通用任务多模型智能体AI系统特征分析

Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) SK Hynix(SK海力士) KAIST(韩国科学技术院)

AI总结 本文提出GAIATrace数据集和Vidur-Agent仿真器,通过迹驱动仿真分析多模型智能体AI系统在通用任务上的行为特征。

Comments 13 pages, 18 figures, 2 tables

详情
AI中文摘要

智能体AI通过迭代规划、工具使用和基于观察结果的推理来完成任务。尽管其流行,但其系统级行为仍然知之甚少,特别是对于复杂数据集和智能体架构——由于高度非确定性执行、高昂的评估成本以及对专有模型的有限可见性。本文提出了GAIATrace,这是两个最先进的智能体系统(MiroThinker和OWL)运行GAIA(一个由异构通用任务组成的基准测试)的首个token级迹数据集。与先前的迹数据集不同,GAIATrace捕获了完整的推理token、任务级结构以及每个主要参与LLM的活动,从而支持深入的系统研究。作为数据集的补充,我们提出了Vidur-Agent,一个迹驱动的仿真器,可以重放GAIATrace以在多种模拟环境中进行可重复、低成本的系统评估。利用这两个工件,我们描述了现代智能体系统如何处理通用任务以及各种系统设计选择如何塑造其行为,得出了若干独特的发现。

英文摘要

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.

2606.01723 2026-06-02 cs.LG cs.AI

Shortcut to Nowhere: Demystifying Deep Spurious Regression

捷径通往虚无:揭秘深度虚假回归

Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Rutgers University(罗格斯大学) Yang AI Lab(杨人工智能实验室)

AI总结 针对连续预测中的虚假相关性,提出利用标签和特征空间中虚假属性的相似性来校准分布,从而提升模型在分布偏移下的泛化能力。

详情
AI中文摘要

现实世界中的回归常常存在捷径:在训练中与连续目标虚假相关的属性,在部署偏移下不可靠;使用此类捷径回归目标可能在测试时灾难性失败。现有关于虚假相关性的研究主要关注分类,其中标签是分类的且组是自然定义的。然而,许多现实任务需要连续预测,其中不存在硬标签边界或离散的组-标签对。我们将深度虚假回归(DSR)定义为从具有属性-标签混淆的回归数据中学习,处理连续虚假相关性,并在测试时泛化到所有属性-标签组合。受分类和回归捷径内在差异的启发,我们提出利用标签和特征空间中虚假属性之间的相似性,从而在跨属性校准标签和学习特征分布时考虑邻近目标和相关组。在涵盖计算机视觉、环境感知和大语言模型(LLM)回归的常见真实世界DSR数据集上的大量实验验证了我们策略的优越性能。我们的工作填补了研究连续预测中虚假相关性的基准和技术空白。

英文摘要

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

2606.01722 2026-06-02 cs.LG cs.AI cs.DC

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

后确定性分布式系统:可信自主基础设施的新基础

Jun He, Deying Yu

发表机构 * OpenKedge Inc.(OpenKedge公司)

AI总结 本文提出后确定性分布式系统(PDDS)模型,以协调确定性代码、随机模型和自主代理共存的异构环境,并定义了五大架构支柱及新的故障分类。

Comments 8 pages, 1 table

详情
AI中文摘要

几十年来,分布式系统通常假设正确的参与者执行协议指定的行为,具有稳定、外部定义和确定性的语义。经典理论广泛参数化了网络时序、通信拓扑和故障域,但参与者模型相对固定。将自主推理引擎、随机模型驱动代理和策略驱动参与者集成到云控制平面、事件响应系统和金融基础设施中,挑战了这一假设的普遍性。这些代理通常产生不同的推理路径、不同的操作轨迹和异构的内部表示,同时实现语义等价且正确的结果。在本文中,我们引入后确定性分布式系统(PDDS)作为研究和工程模型,用于协调确定性代码、随机模型和自主代理共存的异构环境。我们表明,经典分布式计算模型构成了这种参与者通用模型的零歧义特例。我们并非主张确定性系统消失;而是确定性执行不能再作为自主基础设施的通用参与者假设。最后,我们概述了后确定性基础设施的五大架构支柱:协议驱动开发、可验证代理基础设施、自主状态控制平面、语义法定保证和认知状态复制。认知状态复制将持久性和一致性模型从数据可见性扩展到知识可见性,实现代理记忆、可验证语义回滚以及跨推理参与者的连贯性。我们还定义了在此环境中出现的故障类别的分类法。

英文摘要

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

2606.01720 2026-06-02 cs.LG

A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

关于带客户端采样的正交化矩阵动量的稳定性注记

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Sun Yat-sen University(中山大学) Southern University of Science and Technology(南方科技大学) George Washington University(乔治华盛顿大学)

AI总结 研究带客户端采样的分布式矩阵优化中正交化动量更新的有限样本泛化界,通过耦合邻域稳定性递归和加权集中步骤导出上尾保证。

详情
AI中文摘要

我们研究了带矩阵值参数和正交化动量更新的客户端采样分布式优化方案的有限样本泛化。核心量是当每轮只有一部分客户端参与时,返回模型上总体目标与经验目标之间的差距。在独立异构客户端数据、不等本地样本计数和固定聚合权重下,我们通过耦合邻域稳定性递归和加权集中步骤导出了有限轮上尾保证。该界限通过放大因子 \(Y_i(\mathcal C)\) 保留客户端选择计数;在均匀全参与全批次情况下,当控制依赖于时间范围的放大项时,它产生 \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) 的缩放。矩阵正交化规则要求沿配对轨迹是Lipschitz的,该条件由正则化极型映射和归一化有限步Newton-Schulz正交化器满足。对于未正则化的矩阵符号,相同的论证需要耦合谱分离,而高斯平滑给出了有限轮平滑变体。一个一维反例说明了为什么间隙、平滑或正则性条件是必要的。

英文摘要

We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor \(Y_i(\mathcal C)\); in the uniform full-participation full-batch regime, it yields \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton--Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

2606.01719 2026-06-02 cs.LG cs.AI cs.CR

Fair Finetuning Mitigates Distribution Inference Attacks

公平微调缓解分布推断攻击

Rakshit Naidu

发表机构 * Rakshit Naidu

AI总结 提出公平微调(FFt)方法,通过在等几率约束下对互补分布样本进行微调,将模型公平性指标与分布推断攻击中的对抗优势联系起来,并给出理论界限,实验证明能有效降低攻击成功率。

Comments 16 pages (11 main, 5 appendix)

详情
AI中文摘要

在敏感数据上训练的机器学习模型可能会无意中泄露其训练分布的群体级信息——这种威胁被称为分布推断攻击(DIA)。具有黑盒访问权限的对手可以在不直接观察任何训练数据的情况下推断敏感的人口统计属性,如子群比例。尽管已经提出了差分隐私和属性遗忘等防御措施,但公平性约束与分布泄漏之间的联系尚未被探索。我们提出了公平微调(FFt):在等几率(EO)约束下,对来自互补分布的样本进行微调。我们提供了完整的理论刻画,证明了紧界 $ ext{Adv}(\mathcal{A},M_f) \le Δ_{ ext{EO}} \cdot W$,其中 $W$ 量化了两个训练分布通过其敏感属性组成的可区分程度。我们还建立了FFt降低对抗优势的必要条件,并证明了该界的紧性。我们在六个数据集上进行了评估,涵盖表格数据(ACS Income、COMPAS、German Credit)、图像数据(UTKFaces)和自然语言处理数据(Bias in Bios)。基于重演的FFt在所有设置中一致地将对抗准确率差距降低到检测阈值 $τ=0.1$ 以下;在ACS Income上,差距从约15%下降到4%以下。我们的工作提供了第一个将模型测量的EO差异直接与其在DIA博弈中的对抗优势联系起来的正式界限,为统一的公平性和隐私防御开辟了新途径。

英文摘要

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.

2606.01713 2026-06-02 cs.RO cs.SY eess.SY math.OC

FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects

FlipItRight: 面向多样物体的稳定姿态目标投掷翻转

Axel Dawne, Shinkyu Park

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出FlipItRight框架,通过将任务分解为物体级规划器和机器人级规划器,利用释放状态作为显式中间表示,实现无需先验数据或学习模型的高自由度机械臂稳定平面姿态目标投掷翻转,在120次实验中达到90%成功率。

详情
AI中文摘要

我们提出了FlipItRight,一个用于高自由度机械臂进行稳定平面姿态目标投掷翻转的框架。该任务被分解为一个物体级规划器,它生成满足期望着陆姿态的候选释放状态,以及一个机器人级规划器,它评估可执行性并构建可行的摆动运动。将释放状态视为显式中间表示,能够实现原则性的候选过滤、释放和预摆动配置的自适应选择,以及结构化的近释放运动设计——特别是在最终摆动阶段保持近似恒定的末端执行器速度,以提高对释放时间不确定性的鲁棒性。我们在一个真实平台上对不同形状、大小和质量的物体进行了验证,在120次试验中达到了90%的成功率。消融研究证实,每个设计选择都对投掷性能有所贡献,并且该框架不需要先验数据或学习模型,能够直接部署到新物体和目标上,无需特定环境的校准或数据收集。

英文摘要

We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.

2606.01708 2026-06-02 cs.LG cs.AI

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

随机极小极大树的双保真度最优动作识别

Peter Chen, Xi Chen

发表机构 * Department of Mathematics, Columbia University(哥伦比亚大学数学系) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 针对随机极小极大树中的固定置信度最优动作识别问题,提出双保真度树搜索算法2FFS,结合极小极大快速扩展与MCTS随机采样,自适应选择廉价有偏评估或昂贵精确评估,理论证明固定置信度正确性、有限停止及多项式深度成本上界,实验表明比现有BAI-MCTS基线显著减少样本和计算。

Comments 36 pages

详情
AI中文摘要

我们研究随机极小极大树中的固定置信度最优动作识别(BAI)。该问题在现代AI规划中日益重要,其中深度极小极大搜索和带有语言模型长滚动的蒙特卡洛树搜索(MCTS)面临一个基本权衡:启发式评估廉价但有偏,而精确滚动可靠但代价高昂。我们提出2FFS,一种双保真度树搜索算法,将多保真度平面赌博机思想引入树中。该算法结合了极小极大风格的快速扩展和MCTS风格的随机采样,自适应地决定何时利用廉价有偏评估以及何时调用昂贵精确评估进行局部认证。我们证明了固定置信度正确性,建立了精确识别的有限停止性,并给出了通用深度树的多项式深度成本上界。在数值随机树实验中,与现有BAI-MCTS基线相比,2FFS使用的样本和计算操作显著减少。

英文摘要

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

2606.01703 2026-06-02 cs.SD cs.AI cs.CV

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结 提出JenBridge框架,通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制,实现长视频配乐的高保真生成与场景转换自然连贯。

详情
AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计,缺乏确保叙事连续性的机制。我们提出了JenBridge,一个模块化且可解释的自适应长视频配乐框架,确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型,采用流匹配目标训练,遵循两阶段范式:在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验,然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是,为了实现跨不同场景变化的长格式连贯性,JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包,包括一种生成式过渡方法,并独特地采用了一个大型语言模型(LLM)代理,作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务,我们提出了LVS基准,这是一个新基准,包含一个精选数据集和新的评估指标,侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

2606.01701 2026-06-02 cs.CV

Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding

时空相关性引导的几何划分用于多功能视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * Institute of Digital Media, Department of Electronics Engineering and Computer Science, Peking University(数字媒体研究所,电子工程与计算机科学系,北京大学) Information Technology R&D Innovation Center of Peking University(北京大学信息科技研发创新中心) Peng Cheng Laboratory(鹏城实验室) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 针对VVC中几何划分开销大的问题,提出时空相关性引导的几何划分(STGEO)方案,通过模式预测和运动候选选择减少边信息比特,提升编码效率。

详情
Journal ref
IEEE Transactions on Image Processing, vol. 31, pp. 30-42, 2022
AI中文摘要

几何划分因其在混合视频编码框架中卓越的运动场描述能力而受到越来越多的关注。然而,多功能视频编码(VVC)中现有的几何划分(GEO)方案给边信息的信令带来了不可忽视的负担,从而限制了编码效率。鉴于此,我们提出了一种时空相关性引导的几何划分(STGEO)方案,以有效描述视频编码运动场中的物体信息。所提方法可以节省用于边信息信令的比特,包括划分模式和运动信息。我们首先以统计合理的方式分析了划分模式决策和运动矢量选择的特性。基于观察到的时空相关性,我们设计了一种模式预测和编码方法,以减少表示上述边信息的开销。主要思想是预测具有较高选择可能性的STGEO模式和运动候选,这可以指导熵编码,即用更少的比特表示预测的高概率模式和运动候选。特别地,高概率STGEO模式基于边缘信息和相邻STGEO编码块的历史模式进行预测。相应的运动信息由合并候选列表中的索引表示,该索引基于离线训练的合并候选选择概率自适应地推断。仿真结果表明,与未使用GEO的VTM-8.0相比,所提方法在随机接入和低延迟B配置下平均分别节省了0.95%和1.98%的比特率。

英文摘要

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

2606.01700 2026-06-02 cs.CV

MixerSENet: A Lightweight Framework for Efficient Hyperspectral Image Classification

MixerSENet: 一种用于高效高光谱图像分类的轻量级框架

Mohammed Q. Alkhatib, Swalpa Kumar Roy, Ali Jamali

发表机构 * College of Engineering and IT, University of Dubai(迪拜大学工程与信息技术学院) Department of Computer Science and Engineering, Alipurduar Government Engineering and Management College(阿利普杜尔政府工程与管理学院计算机科学与工程系) Department of Geography, Simon Fraser University(西蒙·弗雷泽大学地理系)

AI总结 提出轻量级框架MixerSENet,通过解耦空间与通道维度混合并引入挤压激励模块,在保持低参数量的同时实现高光谱图像分类的高精度与高效率。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

本文提出了一种新颖的框架MixerSENet,用于高光谱图像(HSI)分类,旨在解决计算效率和有限标注数据带来的挑战。所提出的模型处理高光谱图像块,同时在整个网络中保持一致的尺寸和分辨率,有效解耦了空间和通道维度的混合。值得注意的是,MixerSENet轻量且计算高效,与传统模型相比所需参数更少,适用于资源受限环境。模型中嵌入了挤压激励块以细化特征提取,增强网络捕获更多信息特征的能力。在两个基准数据集上的实验结果表明,MixerSENet实现了优越的性能,在Houston13数据集上达到82.47%的总体精度(OA),在Qingyun数据集上达到96.70%,优于包括3D-CNN、HybridKAN、HSIFormer、SimPoolFormer和MorphMamba在内的最先进方法。此外,对计算效率的详细分析表明,MixerSENet在准确性和效率之间实现了良好的平衡,仅需53,146个参数和较低的推理时间,证实了其在实际应用中的实用性。发布时,源代码将在https://github.com/mqalkhatib/MixerSENet公开。

英文摘要

In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network's ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at https://github.com/mqalkhatib/MixerSENet.

2606.01698 2026-06-02 cs.CV

Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model

通过半监督超图概念瓶颈模型实现标签高效的医学图像可解释诊断

Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu, Jing Qin, Lei Zhu

发表机构 * HKUST(GZ)(香港科技大学(广州)) Joy Future Academy(未来正义学院) MBZUAI(穆罕默德·本·拉希德智能研究院) Tsinghua University(清华大学) Sichuan University(四川大学) PolyU

AI总结 提出一种半监督超图概念瓶颈模型,利用双层超图学习建模高阶概念依赖并生成领域自适应伪标签,在胎盘植入谱系等医学图像诊断中实现高可解释性和性能。

详情
AI中文摘要

深度学习在医学图像分析中取得了革命性进展,在多种应用中提供了卓越的诊断准确性。然而,其决策缺乏可解释性阻碍了临床采纳,特别是在高风险医疗场景中,透明度对可信度至关重要。例如,在胎盘植入谱系(PAS)中,超声图像中的细微线索挑战了可靠诊断,使得黑盒模型难以获得准确的评分信任。为了解决这一问题,概念瓶颈模型(CBM)通过将临床上有意义的中间概念嵌入诊断流程,提供了一种有前景的途径,使临床医生能够审查和优化模型输出。然而,传统的CBM在捕捉复杂的概念间依赖关系方面表现不佳,并且需要昂贵、专家驱动的概念注释,限制了其可扩展性。本研究引入了一种新颖的半监督CBM框架,专为医学成像设计,利用双层超图学习来建模高阶概念依赖并生成领域自适应伪标签。我们的方法通过集成概念级超图以增强推理和图像级超图以生成鲁棒的伪标签,实现了卓越的可解释性和性能。在新标注的PAS超声数据集和乳腺超声公共数据集上的实验证明了所提出的概念标签高效可解释框架的有效性。其通用性在皮肤镜图像数据集SkinCon上得到了进一步验证。代码可在https://github.com/scott-yjyang/HyperCBM获取。

英文摘要

Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.

2606.01695 2026-06-02 cs.LG

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

CANARY: 语言模型中微调污染的无标签检测

Swapnil Parekh

发表机构 * Switzerland(瑞士)

AI总结 提出CANARY方法,通过稀疏自编码器分析隐藏状态差异,在无标签情况下检测微调数据污染,实现1%污染率下AUROC=1.000,并支持检测、验证、优先排序和修复。

详情
AI中文摘要

攻击者可以通过污染仅1%的微调样本来植入潜在的有害行为。这种污染对所有的输出级防御都是不可见的:有害行为潜伏在模型的隐藏状态几何中,直到污染超过7.5%才会在生成的文本中出现。我们提出了CANARY(通过神经激活表示产出的污染审计器),这是一种无标签检查点审计器,可以直接通过对未标记提示集进行两次前向传递来检测这种隐藏的偏移。CANARY通过稀疏自编码器投影隐藏状态差异,过滤风格噪声以隔离有意义的语义漂移。它在四种模型架构和两种训练范式下,在1%污染率下实现了AUROC=1.000(95%置信区间=[0.997, 1.000];Cohen's d=3.28),比任何输出级方法触发点低7.5倍,并且在良性微调上零误报,对风格匹配和梯度噪声自适应攻击具有完全鲁棒性。相同的SAE特征基础驱动了一个完整的治理流程:SAE过滤放大以比标准生成高5倍的速率揭示潜在危害;得分排序的提示带来4.2倍的红队测试提升;在推理时抑制少数污染特定特征将危害从70%降低到10%,且无困惑度惩罚。CANARY是第一个仅从隐藏状态检测、验证、优先排序和修复供应链污染的无标签框架。

英文摘要

Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA(电气与计算机工程系,信息处理实验室,华盛顿大学,美国)

AI总结 针对热行人多目标跟踪中身份碎片化问题,提出轻量级后处理方法,通过在线短间隙重映射和离线轨迹重链接恢复身份连续性,在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419
AI中文摘要

热行人多目标跟踪仍然具有挑战性,因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始,我们添加了一个模块化的身份修复后端,包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明,主要身份增益来自保守的重链接,将IDF1从82.25提升到84.93,同时保持MOTA,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息热图像中,通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析,表明与局部帧到帧关联相比,场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

2606.01689 2026-06-02 cs.CV cs.AI

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) College of Software, Jilin University(吉林大学软件学院) School of Geosciences, Yangtze University(长江大学地球科学学院) College of Communication Engineering, Jilin University(吉林大学通信工程学院)

AI总结 针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题,提出基于鲁棒主成分分析(RPCA)的RPCASSM网络,通过设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模,有效解决了边缘建模难题。

Comments 12 pages, 8 figures, under review

详情
AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少,主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题,本文基于鲁棒主成分分析(RPCA)的模型范式提出了RPCASSM网络,旨在通过红外小目标在空间域的性质设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制(SPCM)来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制(DPCM),聚焦于目标的可变形空间进行状态空间建模。通过上述设计,我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

2606.01686 2026-06-02 cs.SD cs.AI

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

HAIM: 用于AI音乐制作跟踪基准的人机音乐数据集

Seonghyeon Go, Yumin Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对当前AI音乐检测局限于二元分类的不足,提出HAIM数据集,通过多阶段标签定义“AI音乐跟踪”任务,评估现有检测器缺陷,推动向细粒度结构化评估转变。

详情
AI中文摘要

随着Suno和Udio等生成平台达到人类级音频质量,AI的实用性已扩展到整个音乐制作流程。除了简单的音轨生成,这些进步催生了AI驱动方法在各种形式中的应用,包括声音合成、编曲和专业母带处理。然而,当前的检测研究仍主要局限于二元“AI或人类”范式,未能反映当代音乐制作流程的现实。在真实制作中,AI工具越来越多地被用于优化或母带处理人类制作的音轨,而人类工程师同样对AI生成的材料进行后处理以确保专业质量。此外,用户经常采用对抗策略绕过AI检测器,例如对AI生成的音轨应用人类母带处理。这创造了一个简单的二元分类无法捕捉的灰色地带。在本文中,我们定义并研究“AI音乐跟踪”:在音乐制作的多面光谱中识别特定AI集成的挑战。为此,我们引入HAIM,一个具有音乐制作阶段多样化标签的数据集。它旨在隔离AI干预的阶段,包括混合制作和代理级跟踪。我们对最先进检测器的评估揭示了系统性缺陷。通过发布HAIM,我们提出了一个新的基准,将领域从二元分类转向对AI音乐的细粒度结构化评估。

英文摘要

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

2606.01682 2026-06-02 cs.CL cs.AI cs.LG

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器:数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland(马里兰大学计算机科学系)

AI总结 提出Chunk-Level Guided Generation方法,利用现成的大语言模型作为过程评分器,通过固定长度块评分和对比选择规则,无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情
AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已经陷入错误推理路径时,该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题,但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation,一种无训练的替代方案,使用现成的大语言模型作为过程评分器。在每一步,小模型采样k个固定长度的候选块,而大模型使用似然度对候选块进行评分,无需生成任何文本。选中的块在下一步之前被提交,从而在错误传播之前引导生成。我们用两种选择规则实例化该框架:似然引导选择(LGS),选择具有最高长度归一化大模型对数概率的块;以及对比引导选择(CGS),减去小模型的对数概率,以偏向于大模型偏好与小模型偏好不同的块。我们证明,由于系统性的长度偏差(即使在长度归一化后仍然存在),使用大模型似然度对可变长度推理步骤进行评分是不可靠的,而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上,使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B,CGS在多数投票上最多提升28个百分点,并且在匹配的引导预算下,在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索,且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B,CGS在k=16时在MATH上达到81.8%,在Minerva Math上达到63.6%,超过多数投票4-6个百分点。最后,Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

2606.01679 2026-06-02 cs.CL

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

编码但未路由:解释科学声明验证中的表格-图表差距

Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter, Florian Boudin, Akiko Aizawa

发表机构 * The University of Tokyo(东京大学) NII LLMC(日本信息处理学会LLMC) National Institute of Informatics(日本信息处理研究所) University of Göttingen(哥廷根大学) Inria, LS2N, Nantes Université(法国国家信息与自动化技术研究院(Inria)、LS2N实验室、南特大学)

AI总结 通过分层线性探测和注意力分析,发现多模态大语言模型在科学声明验证中,图表信息被编码但未路由到预测位置,导致表格与图表表现差距。

详情
AI中文摘要

多模态大语言模型越来越多地被用于辅助科学同行评审,其核心要求是验证论文中的声明是否得到其证据的支持。先前的研究表明,当证据是表格时,模型在此任务上的表现显著优于相同底层数据的图表。这引发了一个问题:模型是否未能从图表中提取信息,还是提取了信息但在形成预测时未能使用?我们通过分层线性探测和注意力分析,在三个开源视觉语言模型上研究了这个问题,这些模型处理代表相同底层数据的表格和图表证据。我们发现一致的证据支持后者。图表信息被编码在模型的中间表示中,但未到达预测位置,这一差距在表格中不存在,并且在所有测试条件下都成立。注意力分析进一步揭示,这种断开在不同模型家族中表现为两种架构上不同的形式。这些发现将表格-图表差距重新定义为编码的视觉信息在预测时路由失败的问题,而非编码本身失败。

英文摘要

Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.

2606.01678 2026-06-02 cs.CL

Why Do Self-Harm Prediction Models Struggle to Generalise? Lexical and Semantic Variations in Emergency Department Triage Notes

为什么自伤预测模型难以泛化?急诊科分诊笔记中的词汇和语义变化

Liuliu Chen, Mike Conway, Jo Robinson, Vlada Rozova

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息系) Orygen, The National Centre of Excellence in Youth Mental Health, Australia(青年心理健康国家卓越研究中心) Centre for Youth Mental Health, The University of Melbourne, Australia(青年心理健康中心) Centre for Digital Transformation of Health, The University of Melbourne, Australia(健康数字化转型中心)

AI总结 通过分析两家医院急诊科分诊笔记的词汇特征、预测特征和主题差异,发现机构间文档差异导致自伤检测模型跨站点性能下降,并提出改进泛化性的方法。

Comments Accepted to CLPsych2026

详情
AI中文摘要

急诊科(ED)的自伤表现与较高的自杀风险密切相关。NLP模型在单家医院的分诊笔记中检测自伤方面表现出稳健的性能,但跨机构时性能常常下降。为了探究潜在原因,我们通过分析词汇特征、高度相关的预测特征和显著主题,比较了两家医院的ED分诊笔记。我们的结果揭示了与自伤相关的词汇表达和特征重要性在不同医院之间存在差异,尽管核心主题如自我中毒和自我伤害是一致的。这些文档差异与跨站点性能下降相关。我们的发现揭示了机构差异如何影响临床文本中自伤的识别,并强调了改善模型泛化性的潜在方法。

英文摘要

Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown robust performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. Our findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.

2606.01677 2026-06-02 cs.SD

UniVocal: Unified Speech-Singing Code-Switching Synthesis

UniVocal: 统一语音-歌唱代码切换合成

Yufei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, Yang Ai

发表机构 * Tongyi Fun Team, Alibaba Group(通义Fun团队,阿里巴巴集团) Independent Researcher(独立研究者)

AI总结 提出UniVocal统一框架,通过两阶段课程学习和链式思维生成,隐式从文本上下文推断发声模式,实现语音-歌唱代码切换合成,在SCSBench上达到最优性能。

Comments accepted by ACL 2026

详情
AI中文摘要

我们提出UniVocal,一个统一框架,隐式地从文本上下文中推断发声模式,开创了语音-歌唱代码切换(SCS)合成任务——其中转换由文本语义自主驱动,类似于无缝的人类语言混合。与单模式生成或依赖切换控制标签的系统不同,我们提出的UniVocal仅从文本上下文隐式推断发声模式。为实现这一点,我们采用了一种数据高效的两阶段课程学习策略,逐步训练一个具有竞争力的TTS系统以获得所需的SCS能力。针对数据稀缺问题,我们引入了一个可扩展的流水线来合成多样化的代码切换数据,这些数据在语义和声学上都很自然,同时引入了一个新的多场景基准SCSBench。为了解决语义分词器在捕捉声学细节方面的局限性,我们还引入了精炼的cent token和链式思维(CoT)生成,在内容生成之前规划韵律,有效增强了共情语音生成和歌唱旋律。实验结果表明,UniVocal在SCSBench上达到了最先进的性能,同时在常规语音和歌唱任务上保持了竞争性能。音频样本可在https://project-univocal-demo.github.io/demo/获取。代码和数据集已在https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal发布。

英文摘要

We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.

2606.01672 2026-06-02 cs.LG

RDA: Reward Design Agent for Reinforcement Learning

RDA:用于强化学习的奖励设计智能体

Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 提出基于视觉语言模型的奖励设计智能体RDA,通过任务分解、视觉轨迹评估和失败模式总结迭代优化奖励函数,在操作任务中生成更符合指令的策略。

Comments Accepted to RLC'26

详情
AI中文摘要

强化学习已经能够获得令人印象深刻的机器人技能,但通常需要手工设计的奖励函数,这些函数设计缓慢且难以与人类意图对齐。最近的工作,如Eureka,通过使用LLM从任务描述中迭代生成和优化奖励代码来自动化奖励设计。然而,它们依赖于粗糙的反馈信号,如成功率,这些信号对学习到的行为提供的语义洞察很少。因此,它们训练的策略达到了最终目标,但经常与任务指令对齐不良。我们引入了奖励设计智能体(RDA),一个基于VLM的智能体框架,将语义理解注入奖励设计。RDA分解任务,视觉评估轨迹,总结失败模式,并迭代修订奖励代码以更好地与任务指令对齐。在ManiSkill的12个桌面操作任务和HumanoidBench的4个全身操作任务中,RDA产生的策略在指令对齐方面显著优于其他基线,同时实现了相当的任务成功率。视频和生成的奖励代码可在https://nitinkamra1992.github.io/reward-design-agent获取。

英文摘要

Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned behavior. As a result, their trained policies achieve the final goal but are frequently poorly aligned with task instructions. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that injects semantic understanding into reward design. RDA decomposes tasks, visually evaluates trajectories, summarizes failure modes, and iteratively revises reward code to better align with task instructions. Across 12 tabletop manipulation tasks from ManiSkill and 4 whole-body manipulation tasks from HumanoidBench, RDA produces policies substantially more instruction-aligned than those of other baselines, while achieving comparable task success rates. Videos and the generated reward code are available on https://nitinkamra1992.github.io/reward-design-agent.

2606.01667 2026-06-02 cs.LG

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

ATLAS: 智能体测试时学习分配缩放

Peijia Qin, Qi Cao, Pengtao Xie

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出ATLAS框架,让LLM编排器通过单一

详情
AI中文摘要

测试时缩放已成为提升大语言模型推理能力的主要方式,但其编排仍由设计者工程化:固定的样本预算、固定的改进循环、固定的评分规则或固定的搜索策略决定了计算如何分配,模型负责求解而非编排。我们提出ATLAS,一种智能体测试时缩放框架,其中LLM编排器端到端地拥有控制循环。通过单一动作“探索”(在原问题上派发一个全新的独立求解器),编排器决定是否收集更多证据、何时停止以及如何综合最终答案;动作空间是可扩展的,每次探索调用可选地指定求解器、推理努力或提示策略。我们在四个基准上评估ATLAS,涵盖科学问答、代码生成和多模态推理,使用Claude Sonnet 4.6骨干网络,在HLE-Verified上达到56.00%,在LiveCodeBench上达到82.29%,在GPQA-Diamond上达到85.75%,在BabyVision上达到23.71%,同时使用的API调用远少于固定工作流基线。多模型扩展ATLAS-MM将求解器选择作为额外动作维度,进一步将HLE-Verified提升至60.00%,LiveCodeBench提升至85.63%,并在GPQA-Diamond和BabyVision上持续改进。将编排器的直接综合替换为独立整合器的消融实验在四个基准中的三个上降低或未能提高准确率,这与有状态证据管理在产生增益中的作用一致。

英文摘要

Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.