arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助:面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究利用视觉-语言-动作(VLA)模型通过模仿学习实现人机协作,发现动作分块策略在隐式协作中存在演示动作泄漏问题,提出推理时引导方法缓解过早辅助行为,并通过用户研究验证其有效性。

详情
AI中文摘要

人机协作(HRC)结合了人类和机器人的互补优势,以提高任务效率。然而,许多现有的协作系统依赖于手工设计的流程,限制了其对新任务的可扩展性和灵活性。在这项工作中,我们展示了通过模仿学习进行端到端训练的模型,特别是视觉-语言-动作(VLA)模型,可以支持协作操作,并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型,并识别了隐式HRC中动作分块策略的一种失败模式,其中演示动作泄漏(即动作块跨越潜在任务转换)可能导致过早的辅助行为。我们发现,这个问题随着执行时域的增长而加剧,并在真实世界的协作VLA系统中出现,例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法,以减轻这些错误的辅助动作,同时保持策略性能。最后,通过一项16名参与者在长时域协作组装任务上的用户研究,我们表明引导能够实现更长的执行时域,同时减轻过早辅助,与短时域策略相比,实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

2606.12473 2026-06-12 cs.CV 新提交

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

基于人体姿态估计的立体视觉跌倒预测与检测在AMD Kria K26 SOM上的实现

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette, Christopher Paolini

发表机构 * San Diego State University(圣地亚哥州立大学) PSG College of Technology(PSG理工学院) Old Dominion University(欧道明大学)

AI总结 提出一种基于AMD Kria K26 SOM的低功耗、便携式立体视觉跌倒预测与检测系统,通过量化YOLOX、A2J和CNN三级流水线实现实时、隐私保护的跌倒检测,多线程版本达到4.5 FPS。

Comments 19 pages; 31 figures

详情
AI中文摘要

背景与目标:老年人跌倒可能导致严重伤害并降低生活质量。及时的预测和检测对于预防伤害和支持健康至关重要。我们提出了一种便携式、低功耗、电池供电的基于视觉的跌倒预测与检测系统,在AMD Kria K26系统模块(SOM)上使用人体姿态估计(HPE)。目标是实现非侵入性、保护隐私的实时跌倒检测系统。方法:系统使用Intel RealSense D455距离感应摄像头,通过USB连接到K26 SOM。它捕获同步的RGB和深度帧,分辨率分别为640×480×3和640×480像素,帧率为60 FPS。SOM运行一个三级流水线,包括量化的YOLOX、Anchor-to-Joint(A2J)和跌倒检测模型。YOLOX从RGB帧中识别人体边界框,然后丢弃RGB帧以保护隐私。A2J使用深度帧估计每个人的15个关节点。CNN使用选定的关节坐标(x, y, z)对跌倒活动进行分类。YOLOX在CrowdHuman上训练;A2J在ITOP、MP-3DHP、UR Fall Detection和自定义的SDSU PSG数据集上训练;CNN在UR Fall Detection和SDSU PSG上训练。设计使用了单核DPU的串行流水线和双核DPU运行YOLOX和A2J的多线程版本。结果:量化精度通过YOLOX的IoU≥50%、A2J的10厘米规则mAP以及CNN的分类准确率(TP+TN)/(TP+TN+FP+FN)进行评估。准确率分别为74%、84.13%和75.85%。吞吐量从单线程流水线的2.5 FPS提高到多线程版本的4.5 FPS。结论:结果证明了在AMD Kria K26边缘设备上实现隐私保护跌倒检测的可行性。设备上的HPE和跌倒分类无需依赖云端,支持老年人监测和辅助医疗。未来工作将提高模型精度和速度。

英文摘要

Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 新提交

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense: 审计LLM中参数化工具知识的诊断框架

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

发表机构 * SAP Labs(SAP实验室)

AI总结 提出ToolSense诊断框架,自动生成三类基准测试,揭示参数化工具检索中知识-检索分离现象,发现模型在模糊查询下性能显著下降。

详情
AI中文摘要

作为大型工具目录上的代理部署的大型语言模型面临关键的工具检索瓶颈。由于基于嵌入的检索方法依赖于可能无法充分捕获专用工具语义的紧凑编码器,参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟令牌来解决这一问题,经过两个阶段(记忆然后检索SFT)的微调,将LLM用作检索器,在标准ToolBench检索基准上取得了强劲性能。然而,这些基准使用冗长、完全指定的查询,并且其评估应用了将输出限制为有效令牌路径的约束解码,这并不能揭示模型是否真正理解其工具。我们引入了\textbf{ToolSense},一个开源LLM驱动的诊断框架,它将任何工具目录作为输入,并自动生成三个基准:具有三个模糊级别查询的现实检索基准(RRB)、MCQ探测基准和QA探测基准。将ToolSense应用于ToolBench(约47k个工具)并评估五个参数化模型训练配置,揭示了知识-检索分离:在RRB查询上,与完全指定的ToolBench基准相比,几个配置下降了约50-64个百分点,低于嵌入模型基线。此外,尽管检索性能强劲,一些模型在事实探测上得分接近随机,表明存在知识-检索分离。我们在https://this URL上开源了ToolSense框架和ToolBench诊断基准。

英文摘要

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

2606.13287 2026-06-12 cs.LG cs.DC math.OC 新提交

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

裁剪使分布式和联邦异步SGD对掉队者具有鲁棒性

Samuel Erickson, Mikael Johansson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 本文理论证明梯度裁剪能消除异步SGD中最大延迟对复杂度的影响,基于次Weibull梯度噪声模型,首次实现异步优化的高概率收敛。

详情
AI中文摘要

在现代机器学习中,训练的并行化是扩大规模的重要策略。异步随机梯度下降(ASGD)通过避免等待慢速工作节点来最大化可用硬件的利用率。然而,在恒定步长下,由于更新中的大延迟,慢速工作节点仍然会对ASGD的收敛产生负面影响。同时,在深度学习模型的异步训练中,经验观察到梯度裁剪能“稳定”训练。在这项工作中,我们为这一行为提供了理论依据,证明裁剪消除了最大延迟对预言复杂度的依赖。我们采用次Weibull梯度噪声模型,该模型将次高斯和次指数分布推广到更重尾的分布,受深度学习中的经验观察启发。我们证明了期望收敛,并且首次在异步优化中证明了高概率收敛。

英文摘要

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

2606.12876 2026-06-12 cs.LG cs.CL cs.IT math.IT 新提交

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

发表机构 * University of Toronto(多伦多大学)

AI总结 提出Drop-by-Drop框架,基于信息论和逐次细化理论,利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制,降低存储开销并保持性能。

Comments 37 pages, 12 figures

详情
AI中文摘要

随着大语言模型(LLM)在具有不同资源约束的异构硬件上部署越来越广泛,无需重新训练即可自适应管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop,一种新颖的多比特宽度训练后量化框架,能够从单个训练模型实现对LLM权重的推理时精度控制。我们的方法在理论上基于信息论和逐次细化。我们证明,通常服从高斯分布的LLM权重,在由LLM损失函数驱动的加权均方误差失真下,随着额外比特的加入可以以递增的保真度最优重建。为了在实践中实现这一点,Drop-by-Drop将Matryoshka风格的监督纳入损失函数,利用了加性码本的结构。Drop-by-Drop生成单个模型,其中有序的码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度,显著减少了存储和内存开销,同时在主要架构(如Qwen、LLaMA、Gemma和Mistral)上保持了有竞争力的困惑度和准确度。

英文摘要

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

2606.12710 2026-06-12 cs.LG math.OC 新提交

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

一种稳定的路径空间方法用于基于扩散的后验采样

Evan Scope Crafts, Umberto Villa, Saviz Mowlavi, Yanting Ma, Hassan Mansour, Wael H. Ali

发表机构 * Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin(德克萨斯大学奥斯汀分校奥登计算工程与科学研究所) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Department of Biomedical Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校生物医学工程系) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 提出一种稳定的路径空间框架,通过随机最优控制与信任域优化,实现非线性逆问题中准确且鲁棒的后验采样。

详情
AI中文摘要

扩散模型为贝叶斯逆问题提供了表达性数据驱动先验,但许多扩散后验采样器依赖启发式引导近似,可能对非线性算子和多模态后验失效。本文开发了一种稳定的路径空间框架用于基于扩散的后验采样。从终端边际代表先验的基础扩散过程出发,我们定义了轨迹上的似然加权目标测度,并将后验采样转化为学习一个路径测度匹配该目标的受控随机过程。该公式将扩散后验采样与随机最优控制联系起来,同时保留了不确定性量化所需的贝叶斯结构。我们引入了一种时间重参数化,通过消除未知初始值函数引起的偏差,使路径空间控制问题适定,无需辅助训练。然后通过具有对数方差目标的信任域路径空间优化方法学习控制。路径空间视角还统一了我们的学习控制方法与现有的基于引导的采样器,量化了近似控制引起的采样误差,并产生了用于渐近精确后验期望的重要性采样校正。我们在具有解析表征或高质量参考后验的基准逆问题套件上评估了所提出的框架,从而实现了对采样精度和不确定性量化的原则性评估。这些实验深入揭示了基于扩散的后验采样器的行为,并证明了相比领先方法更高的准确性和鲁棒性。

英文摘要

Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

2606.12611 2026-06-12 cs.LG cs.IT math.IT 新提交

Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

NSL-KDD数据集不平衡数据条件下IDS的AutoML框架评估

Wiliane Carolina Silva, Evandro César Vilas Boas, Felipe A. P. de Figueiredo

发表机构 * Cybersecurity and Artificial Intelligence Laboratory (CS&I Lab), National Institute of Telecommunications (Inatel)(网络安全与人工智能实验室(CS&I Lab),国家电信研究所(Inatel)) Wireless and Artificial Intelligence Laboratory (WAI Lab), National Institute of Telecommunications (Inatel)(无线与人工智能实验室(WAI Lab),国家电信研究所(Inatel))

AI总结 研究NSL-KDD数据集上严重类别不平衡对多分类入侵检测中AutoML框架性能的影响,发现集成学习和不平衡感知优化可提升少数类检测能力,PyCaret表现最佳(macro-F1 66%)。

详情
AI中文摘要

本研究探讨了严重类别不平衡对使用NSL-KDD数据集进行多分类网络入侵检测的自动化机器学习(AutoML)框架性能的影响。与以往通过二分类或移除少数类来简化问题的研究不同,我们保留了原始的五类分布,包括高度欠表示的R2L和U2R攻击,从而能够对不平衡敏感的学习行为进行现实评估。在统一且可重复的实验协议下,分析了九个开源AutoML框架,考虑了架构设计、集成策略、验证程序、超参数优化和不平衡处理机制的差异。结果表明,采用集成学习和不平衡感知优化的框架在少数类判别上表现更好。PyCaret获得了最佳整体性能,macro-F1达到66%,其次是AutoGluon(55%),而缺乏原生平衡支持的框架在少数类检测能力上显著下降。进一步分析表明,仅以准确率为导向的优化不足以应对高度不平衡的入侵检测场景,因为高加权指标可能与对罕见攻击类别的泛化能力差共存。作为贡献,本研究为严重多类不平衡下的AutoML入侵检测建立了标准化基准,指出了当前架构的局限性,以及将不平衡感知优化、重采样和分层评估策略原生集成到自动化学习流水线中的必要性。源代码已公开。

英文摘要

This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66\% macro-F1, followed by AutoGluon with 55\%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

2606.13341 2026-06-12 cs.CV cs.AI physics.med-ph 新提交

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

双域等变生成对抗网络用于多模态CT-PET合成

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

发表机构 * arXiv

AI总结 提出双域等变生成对抗网络(DDE-GAN),联合空间与频域学习并融入旋转等变性,实现高保真多模态CT-PET图像合成。

Comments 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情
AI中文摘要

我们提出了一种用于多模态CT-PET图像合成的双域等变生成对抗网络(DDE-GAN)。传统的基于GAN的方法通常仅在空间域中操作,忽略了几何一致性,导致结构保真度有限。DDE-GAN通过联合学习空间域和频率(傅里叶)域,捕捉互补的解剖和频谱信息,解决了这些挑战。此外,嵌入在CT和PET测量物理中的旋转等变性被整合到生成器和判别器的损失中,以确保在旋转下的一致响应,从而提高解剖准确性。一种分层双域训练策略通过多阶段损失函数强制实现域内和域间一致性。在HECKTOR 2022 CT-PET数据集上的评估表明,DDE-GAN在CT-PET图像合成中取得了优于基线模型的合成质量。结果表明,将双域学习与几何等变性相结合,显著增强了多模态图像合成的准确性和鲁棒性,为PET补全和数据增强等实际应用提供了可能。

英文摘要

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

2606.12478 2026-06-12 cs.LG cond-mat.stat-mech quant-ph 新提交

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

玻尔兹曼注意力:用于协同注意力的可学习伊辛耦合

Gilhan Kim, Daniel K. Park

发表机构 * Yonsei University(延世大学)

AI总结 提出玻尔兹曼注意力,通过可学习的伊辛耦合增强注意力机制中的位置间交互,在字符级语言建模和括号匹配任务中优于标准softmax注意力,并展示了量子退火训练的有效性。

Comments 19 pages, 5 figures

详情
AI中文摘要

注意力机制是现代序列模型的核心,但标准注意力主要通过单个查询-键相似度计算相关性。尽管softmax归一化引入了位置间的竞争,但标准注意力层并未显式参数化注意力决策之间的可学习交互。这限制了其直接在注意力机制内建模协同或对抗性共注意力结构的能力。我们提出玻尔兹曼注意力,一种基于能量的泛化,其中注意力模式由相互作用的伊辛模型控制。该方法用可学习的成对耦合增强通常的数据依赖局部场,使模型能够表示超出softmax或sigmoid注意力所捕获的位置间相关性。在字符级语言建模和合成括号匹配实验上,玻尔兹曼注意力在标准Transformer架构中持续优于标准softmax注意力,且优势随序列长度增加而更加明显。四路消融实验证实改进来自可学习的成对耦合。这些结果表明,显式位置间交互为基于注意力的序列建模提供了原则性增强。此外,伊辛公式为基于量子计算的采样策略开辟了自然路径:我们证明非绝热量子退火提供了实用的训练方法,同时保持了与精确玻尔兹曼计算相当的性能。

英文摘要

Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

2606.12263 2026-06-12 cs.CV 新提交

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

VOID: 击败潜在扩散模型中的未授权模仿

Chunlin Qiu, Ang Li, Tianxiao Huang, Ruilin Gan, Yunjie Ge, Shenyi Zhang, Huayi Duan, Lingchen Zhao, Chao Shen, Qian Wang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院) School of Computer Science, Wuhan University(武汉大学计算机学院) Institute for Math&AI, Wuhan University(武汉大学数学与人工智能研究所) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) School of Cyber Science and Engineering, Xi’an Jiaotong University(西安交通大学网络空间安全学院)

AI总结 针对潜在扩散模型被用于未授权模仿的问题,提出VOID防御框架,通过操纵模型内在随机性,放大潜在编码误差并抵消目标引导信号,实现语义破坏,阻止未授权模仿,同时将扰动限制在人眼不可感知区域。

Comments Extended full version with more comprehensive experimental results. To appear in the 35th USENIX Security Symposium (USENIX Security 2026)

详情
AI中文摘要

虽然潜在扩散模型(LDM)彻底改变了视觉合成,但它们越来越多地被用于对个人的未授权模仿。现有防御通过注入欺骗性扰动,将生成图像引导至无关目标。然而,这种方法基于一个无根据的假设:微小的扰动能在LDM的整个生成过程中保持其欺骗效果。实际上,模型固有的恢复机制会移除这些扰动,导致个体身份在生成的图像中重新出现。我们提出VOID,一种通过操纵LDM内在随机性克服这一难题的防御框架。VOID以两种新颖方式扰动扩散管道:1)放大潜在编码误差以破坏图像的语义结构,以及2)抵消目标引导信号以抑制模型的恢复能力。这导致语义破坏,阻止任何未授权模仿。值得注意的是,安全增益不以视觉效用为代价,因为VOID同时设法将扰动限制在受保护图像的人眼不可感知区域。我们在5个数据集上对10种模仿攻击的24种最先进防御进行了全面评估,证明了VOID前所未有的保护能力:它将平均Frechet Inception Distance(FID)从113提高到365,比迄今为止最强的防御提升了223%。

英文摘要

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

2606.12236 2026-06-12 cs.RO cs.CV 新提交

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 提出DrivingAgent框架,通过自动化模块开发(设计阶段)和强化学习训练的轻量级LLM实时调度(调度阶段),解决自动驾驶系统集成新模型和满足实时约束的挑战,在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情
AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而,这一趋势带来了两个关键挑战:(i)设计和集成新模型的手动且劳动密集型过程,以及(ii)缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型(LLM)的智能体为自动化提供了有前景的途径,但现有框架并不适合自动驾驶。具体来说,它们未能区分系统设计和实时调度的根本不同需求,将模块视为不透明的黑盒,并且并非为持续运行而设计。为了解决这些局限性,我们提出了DrivingAgent,这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段,DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段,它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块,并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明,DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

2606.12160 2026-06-12 cs.CL 新提交

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher(独立研究员)

AI总结 本研究通过分析每层令牌logits特征,提出CHAIR框架检测幻觉,在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情
AI中文摘要

在这项工作中,我们引入了CHAIR(Classifier of Hallucination As ImproveR),一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征,如最大值、最小值、均值、标准差和斜率,从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明,CHAIR显著提高了检测准确性,特别是在零样本场景下,展示了其鲁棒性和泛化能力。除了幻觉检测,CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式,我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案,还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

Decoding-time truthfulness methods -- layer-contrast decoding, inference-time intervention, and learned logit adapters -- have demonstrated 10-30 point gains on TruthfulQA when applied to base language models. However, modern instruction-tuned LLMs already achieve substantially higher baselines (61-76%), raising the question of whether these methods remain effective in practice. We design a six-control evaluation framework -- out-of-distribution training, multi-judge validation, simple decoding baselines, confound controls, bootstrap confidence intervals, and seed variance -- and apply it across 5 models (1B-70B), 3 benchmarks, and 15 methods. We find that previously reported gains shrink substantially under strict controls: on the full TruthfulQA benchmark (N=817), no token-level method achieves statistically significant improvement, and the best learned adapter scores -2.0 points below greedy (p=.23). We identify five evaluation sensitivities -- contamination, judge choice, missing baselines, confounds, and statistical noise -- that individually or jointly account for these discrepancies. Cross-benchmark validation on HaluEval QA and TriviaQA confirms that these patterns extend beyond TruthfulQA. Deliberative prompting methods (chain-of-thought, self-critique) appear more robust in the evaluated regime, with CoT achieving +5.6-19pp across benchmarks as a training-free, single-pass method. We release a seven-point evaluation checklist and discuss implications for future truthfulness research.

2606.11894 2026-06-12 cs.CV 新提交

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 提出Wild3R,一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法,通过引入包含多样光照和瞬态物体的WildCity数据集,学习跨视角外观一致性并移除瞬态内容,性能优于现有前馈方法,与基于逐场景优化的方法相当。

Comments Project page: https://furuschool.github.io/wild3r-page/

详情
AI中文摘要

前馈式3D高斯泼溅(3DGS)消除了传统3DGS所需的耗时逐场景优化。然而,现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中,我们提出了Wild3R,一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据,而这些是学习鲁棒场景表示所必需的。为解决这一问题,我们引入了WildCity数据集,该数据集包含200个场景、170种光照条件和瞬态物体,总计337,500张图像。通过利用该数据集,我们的模型在参考视图条件下学习跨视角的外观一致性,同时移除瞬态内容。大量实验表明,我们的方法优于现有的前馈方法,并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

2606.11836 2026-06-12 cs.SD cs.AI eess.AS 新提交

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2606.11793 2026-06-12 cs.LG cs.AI physics.ao-ph 新提交

Scalable Deep Learning Framework for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2606.11792 2026-06-12 cs.CV cs.AI cs.CL 新提交

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

Comments Preprint

详情
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

2606.11767 2026-06-12 cs.RO cs.AI 新提交

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架,实现仅依赖触觉的灵巧手盲抓取,在真实机器人上对20个物体达到27%成功率。

Comments 23 pages, 6 figures

详情
AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而,由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力,为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距,我们提出了一个仅依赖触觉的盲抓取框架,该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先,我们引入了一个Real2Sim触觉校准流程,构建了一个接触校准的数字孪生模拟器,能够复现真实的触觉信号。其次,我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力,该编码器通过自监督预训练融入了传感器几何先验。第三,为了提高对未见物体的泛化能力,我们在校准后的模拟器中训练了特定物体的强化学习专家,并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法,涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率,无需真实世界的抓取演示或视觉输入。仿真消融实验表明,布局感知触觉预训练提高了抓取性能,而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明,接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面:此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.

2606.11681 2026-06-12 cs.CL cs.SD 新提交

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

Comments Accepted to Interspeech 2026, Github: https://github.com/sanghyang00/ur-bert

详情
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

2606.11255 2026-06-12 cs.LG 新提交

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur核:通过草图调制和径向随机化的随机特征

Taha Bouhsine

发表机构 * Azetta AI

AI总结 提出一种随机特征构造方法,用于Bernstein-Schur核类,通过草图化有限调制和随机化完全单调径向因子,实现无偏估计和算子范数界,应用于yat核族。

详情
AI中文摘要

Bernstein-Schur核是有限特征核(具有显式有限维特征映射的核)与完全单调平移不变核的乘积:非平稳核介于平移不变和点积模板之间,随机特征通常利用后者,因此一般Bochner采样或多项式草图都不能直接应用于完整核。我们为整个类给出一种随机特征构造,它随机化两个因子:草图化有限调制并随机化完全单调径向因子,对后者的单变量Bernstein-Widder尺度进行采样,然后应用高斯随机傅里叶特征(其频率仍是d维的)。特征维度为Dm,由草图大小m和径向抽取次数D设定,与精确调制特征的O(d^2)大小无关。保持调制精确是可分析极限(m→∞):在那里我们证明无偏性、推荐平坦估计量的精确方差、期望矩阵-Bernstein算子范数界(具有匹配的高概率尾部),该界由核和调制Gram矩阵的最大特征值以及固有维度控制,而非粗糙的N max_{ij}逐元素路径,以及确定性相对谱核岭稳定性结果。通过条件化于草图,双随机化估计量继承了相同的固有维度算子范数保证,加上一个可调加性草图项,该草图项由m独立于D调节。激励实例是有偏yat核k_{yat,b}(w,x)=(w^⊤x+b)^2/(‖w-x‖^2+ε),b≥0,其族通过b的有限差分包含逆多二次核;对于它,径向混合是IMQ谱采样器,每个尺度一个频率在固定径向特征预算下是方差最优的。

英文摘要

Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(λ)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/λ)\log(d_{\mathrm{eff}}/δ))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/δ))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

2606.11190 2026-06-12 cs.LG 新提交

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

何时对齐,何时预测:多模态学习的相图

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero

发表机构 * Technion(以色列理工学院) Genentech(基因泰克公司) Brown University(布朗大学) Meta AI, FAIR

AI总结 提出统一线性框架,通过信噪比模型揭示跨模态对齐与预测的互补失效模式,构建四区域相图指导多模态学习目标选择,并在非线性实验中验证。

详情
AI中文摘要

跨模态对齐(CA)和跨模态预测(CP)是多模态表示学习的主要范式,但目前缺乏对每种方法何时成功、何时失败以及跨模态训练何时有帮助的系统性理解——这一空白使得从业者,特别是在生物医学或天体物理学等科学领域,面对异构仪器以及多个层次的组织和测量时,无法诊断为什么标准方法不如最佳单模态。我们开发了一个统一的线性框架来解决这两个问题。在具有结构化跨模态干扰相关性的尖峰信号加噪声模型下,我们推导出两个目标的分离比,揭示了互补的失效模式:对齐使每个模态白化,当干扰在视图间强相关时失败;预测通过单侧白化编码任何可跨模态预测的内容,恢复由源模态质量决定。由此产生的相图将多模态问题划分为四个区域:两者、仅CA、仅CP和两者都不。我们提出了一种数据驱动的方法,使用少量标记子样本将真实世界数据集定位在该图中,在任何跨模态训练之前确定首选目标和预测方向。在合成数据、立体视觉基准、图像-文本对和真实天体物理数据上的实验验证了非线性情况下的预测,包括跨模态训练有害的“两者都不”区域。我们的框架使从业者能够诊断其多模态问题,并在投入训练之前选择正确的目标。重现结果的代码可在此https URL获取。

英文摘要

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at https://github.com/IlayMalinyak/mm_align_vs_pred.

2606.10716 2026-06-12 cs.CL cs.AI 新提交

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展:利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University(技术研究所,ICAI工程学院,科米利亚斯宗座大学) DD-AIM, Senior Machine Learning Researcher(DD-AIM,高级机器学习研究员)

AI总结 提出注意力扩展机制,通过预训练词嵌入增强PLM的上下文表示,在不增加计算成本的情况下扩展有效上下文范围,显著提升长文档关键短语提取性能。

详情
AI中文摘要

预训练语言模型(PLM)在关键短语提取(KPE)中取得了强劲性能,主要得益于其生成丰富上下文表示的能力。然而,长文档KPE仍然具有挑战性,因为显著的关键短语证据可能分散在遥远的文档部分,而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型(LLM)可以处理更广泛的文本上下文,但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制,我们提出了一种注意力扩展机制,该机制利用预训练词嵌入,用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围,而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法,包括通用、科学、任务特定和长上下文编码器,使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明,注意力扩展在所有评估设置中一致地提升了KPE性能,超越了最先进的模型,并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型,表明所提出的机制提供了互补信息,而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

2606.10683 2026-06-12 cs.RO cs.AI cs.CV 新提交

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok:基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Hefei University of Technology(合肥工业大学) Rimbot Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口,并基于此开发UniDexTok,一种免重定向的状态分词器,学习基于真实关节状态的离散token,实现异构灵巧手的统一表示,误差降低98%以上。

详情
AI中文摘要

灵巧手对于精细操作至关重要,但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难,与平行夹爪相比更是如此。因此,灵巧手数据仍然碎片化,难以用于联合训练。在这项工作中,我们提出了统一灵巧手模型(UDHM),它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM,我们引入了UniDexTok,一种免重定向的状态分词器,它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示,无需依赖重定向或仿真数据。与最近的基线UniHM相比,UniDexTok将MPJAE从15.63度降低到0.16度,MPJPE从18.51毫米降低到0.18毫米,误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明,来自其他实施例的数据提高了目标实施例的重建精度,证明了跨实施例分词的优势。当引入新的灵巧手时,UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

2606.10678 2026-06-12 cs.LG 新提交

One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data

更接近真实:一种多尺度残差感知表示学习管道用于时间序列预测

Amrijit Biswas, Mustafa Kamal, Robin Krambroeckers, M. M. Lutfe Elahi, Sifat Momen, Nabeel Mohammed, Shafin Rahman

发表机构 * RobotBulls Labs(RobotBulls实验室) North South University(南北大学)

AI总结 提出两阶段模型无关框架,通过显式解耦预测与残差学习,使用元校正器动态建模结构误差模式,提升Transformer预测精度。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情
AI中文摘要

近年来,基于Transformer的模型已成为时间序列预测的主要范式,利用自注意力机制捕获长程依赖关系。尽管取得了成功,但这些单阶段预测架构由于结构差异、未建模的随机成分或多尺度时间表示不足,表现出持续的系统性残差偏差。当残差被视为不可约噪声时,这一局限性依然存在,阻碍了对结构化误差模式的自适应校正。为解决这一问题,我们引入了一个两阶段、模型无关的框架,将预测和残差学习显式解耦为不同的表示学习阶段。基础Transformer首先生成初始预测。随后,专用的元校正器动态建模跨多元通道的结构化误差模式,保留跨变量依赖关系,并迭代修正基础Transformer的残差偏差。通过将该管道形式化为假设空间扩展,我们的框架解决了单阶段架构固有的近似局限性,消除了对限制性假设的依赖,并实现了复杂误差动态的端到端学习。在八个流行的基准数据集上使用既定协议进行评估,我们的方法达到了最先进的性能,在标准指标(MSE、MAE)上有显著改进。结果表明,该框架能够减轻系统性偏差,增强对复杂时间动态的鲁棒性,推进了基于Transformer的预测模型的实际应用。

英文摘要

Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

2606.10403 2026-06-12 cs.CL 新提交

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI(NAVER云AI) KAIST AI(韩国科学技术院人工智能系)

AI总结 提出KCSAT-ML基准(含664道韩国高考数学题及339道带官方错误率的核心题)和难度对齐推理增益(DRG)指标,揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

Comments 18 pages, 14 figures, 8 tables

详情
AI中文摘要

数学推理基准已大量涌现,但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML,包含十年(2014-2025)韩国大学修学能力考试(KCSAT;修能)数学:664道题,其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益(DRG)配对:一种分数正交的度量,询问模型的错误是集中在人类认为难的题目上,还是人类认为容易的题目上。两者共同揭示,在广泛的视觉语言模型(以及通过OCR的LLM)中,存在三种模式:(i)低预算准确率在人类高错误率尾部崩溃,无论模型大小;(ii)测试时缩放(TTS)使token使用量大致随队列错误率线性增加,而准确率增益遵循非单调曲线;(iii)在同一模型族内,TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上,准确率几乎相同的模型可以处于几乎相反的值:一个模型做错了人类也觉得难的题目,而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

2606.09639 2026-06-12 cs.CV 新提交

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Jason Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Electronic Science and Technology of China(电子科技大学) Zhejiang University(浙江大学) The University of Tokyo(东京大学) Nanyang Technological University(南洋理工大学)

AI总结 提出CineDance-1M大规模多镜头长片音视频数据集,通过三阶段筛选流程和CineBench评估体系,实现高质量联合生成。

详情
AI中文摘要

训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面表现出色,但开源模型的进展仍受限于高质量训练数据的稀缺性。为弥合这一差距,我们引入了CineDance-1M,一个大规模、开放研究文本到音视频(T2AV)数据集,专门用于多镜头、长片联合音视频生成。每个视频平均时长92.8秒,包含24.2个连续镜头,并提供音频和视频模态的可配置、结构化标注。这一卓越质量通过严格的三个阶段筛选流程实现:i) 多样化来源和全面清洗,ii) 基于电影理论的叙事解析,以及iii) 层次化双模态字幕生成。为进行全面评估,我们提出了CineBench,包含多样化的提示套件和六维、与人类对齐的度量系统,专为复杂叙事音视频评估而设计。此外,我们将LTX-2.3适配为CineDance,展示了卓越的单模态质量以及精确的音视频对齐和稳健的主体与环境一致性,有效验证了我们的筛选策略和CineDance-1M的高质量。我们预期这项工作将为加速未来多镜头、长片联合音视频生成研究奠定坚实基础。我们的项目页面可在https://aliothchen.github.io/projects/CineDance/获取。

英文摘要

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

2606.09073 2026-06-12 cs.LG cs.AI cs.CL 新提交

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 本文提出使用分布奖励模型统一RLHF中的悲观主义方法,通过闭式有效奖励公式连接现有启发式方法,并揭示其隐含假设。

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)受限于\textit{奖励破解},即策略利用代理奖励模型(RM)中的错误,产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}:在RM不确定的区域惩罚奖励。然而,标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化(KL-DRO)视角下,KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法:均值聚合、最坏情况优化(WCO)和不确定性加权优化(UWO)都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

2606.08765 2026-06-12 cs.RO cs.CV 新提交

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

RGB-S: 用于鲁棒灵巧操作的图像对齐触觉显著性

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出RGB-S框架,通过正向运动学和相机标定将触觉传感器位置投影到RGB图像平面,生成力调制高斯显著性图,显式对齐触觉与视觉,在严重遮挡下灵巧操作成功率提升26.7个百分点。

Comments 20 pages, 7 figures

详情
AI中文摘要

有效的视觉-触觉整合对于机器人灵巧操作至关重要,尤其是在视觉观测不可靠或被遮挡时。然而,将稀疏、异构的触觉测量与密集的视觉表示鲁棒地对齐仍然是一个基本挑战。大多数现有方法需要策略从有限的演示中隐式学习跨模态对应关系,而不利用几何先验。因此,它们在视觉观测退化时往往数据效率低且泛化能力差。为解决这一限制,我们提出一个框架,显式地将物理接触锚定在图像域中。利用机器人正向运动学和相机标定,我们将触觉传感器位置直接投影到RGB图像平面上。然后,我们渲染力调制的高斯显著性图,以模拟由运动学和标定误差引起的空间不确定性。通过零初始化的条件架构整合这些2D空间锚点,我们的方法将物理接触先验注入标准视觉骨干网络,同时保留预训练的视觉表示。我们在模拟和现实世界的六项灵巧操作任务中评估了我们的方法,在严重视觉遮挡下。现实世界实验表明,在图像域中显式的RGB-S锚定将现实世界遮挡操作成功率比最强的隐式视觉-触觉基线提高了26.7个百分点,表明其空间推理能力和对遮挡的鲁棒性得到了改善。项目页面:touch-as-saliency.github.io

英文摘要

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

2606.08436 2026-06-12 cs.CV 新提交

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

通过候选感知因果推理增强教学视频中的时间答案定位

Muge Qi, Rong Fu, Pengbin Feng, Xianda Li, Yu Cai, Yifu Guo, Shizhe Zhang, Simon James Fong, Lei Ma, Bin Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出候选感知因果推理框架,通过视觉-语言预训练候选选择和基于GRPO的时序逻辑推理,解决教学视频中复杂问题理解和长视频片段定位挑战,在六个基准上取得最优mIoU。

详情
AI中文摘要

教学视频中的时间答案定位任务旨在定位响应自然语言查询的精确视频片段,对于直接视频答案检索日益重要。由于需要理解语义复杂的问题并解决未修剪视频与短目标时刻之间的显著长度不匹配,该任务仍然具有挑战性。现有方法通常对无关内容敏感或视觉推理能力不足。为了解决这些局限性,我们提出了候选感知因果推理框架。我们的方法首先采用基于视觉-语言预训练的候选选择算法高效生成K个候选片段,然后应用由拒绝奖励机制增强并通过组相对策略优化优化的时序逻辑推理模块进行稳健推理。在六个基准上的大量实验表明,我们的方法在平均交并比方面达到了最先进的性能,为长视频中基于推理的检索提供了新视角。

英文摘要

The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

2606.08098 2026-06-12 cs.AI cs.LG 新提交

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

何时委托优于多数?一种基于委托的多样本LLM推理聚合器

Yasushi Sakai, Allen Song, Kent Larson

发表机构 * MIT Media Lab(麻省理工学院媒体实验室)

AI总结 提出基于委托的聚合器PPV,利用样本的字母熵和推理几何信号,在MMLU-Pro上比多数投票高1.5个百分点,无需标签或训练。

Comments Preprint. 16 pages, 5 figures, 4 tables

详情
AI中文摘要

多数投票是对多样本LLM推理进行无监督聚合的主流方法。我们证明,将每个样本携带的信号输入基于委托的聚合器(传播代理投票,PPV)可产生一种无监督共识规则,在MMLU-Pro上整体比多数投票高1.5个百分点,在非平凡子集上高2.24个百分点(配对McNemar p ~ 1.0e-14,n = 8,099)。多数投票丢弃了每个样本携带的两个自由信号:组内字母熵和组间推理几何。PPV暴露了两个每个投票者使用的杠杆,它们恰好消耗这些信号:WHEN(投票者保留自己选择的权重)和WHOM(如何将剩余权重分配给同行)。我们使用字母熵驱动WHEN,使用以问题为中心的嵌入余弦驱动WHOM。该方法不需要真实标签和辅助训练:对于每个问题,我们将128个采样生成划分为16组,计算每组的字母级语义熵和推理嵌入质心,并将两者输入随机委托矩阵,其平稳分布选择共识答案。我们通过一个例子说明PPV如何推翻一个明显的10-6多数(错误答案):10票的多数簇几何上不连贯(平均簇内余弦-0.02),而6票的少数簇紧凑(+0.26),因此传播的委托质量集中在少数派的答案上,尽管仅凭熵会使多数保持领先。我们还报告了具有负面结果的委托策略,这些策略限制了无监督LLM聚合的设计空间:没有问题内的置信度模式集成能够缩小与oracle的差距。

英文摘要

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 新提交

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠?

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze(佛罗伦萨大学)

AI总结 通过离散概率问题基准测试,发现 LLM 在标准问题上准确率 0.96,但在反直觉问题上仅 0.59,且存在 token 偏差和误导提示的脆弱性。

详情
AI中文摘要

我们通过离散概率问题的受控基准研究,调查了大语言模型的概率推理能力。我们构建了两个数据集,分别是一组标准习题和一组反直觉习题,旨在触发启发式推理,并评估了 8 个最先进的模型,每个模型分别在有无思维链提示的情况下进行测试。模型在标准问题上的平均准确率为 0.96,但在反直觉问题上仅为 0.59。我们进一步提供了 token 偏差的经验证据:当规范表述被伪装变体替换时,性能下降超过 20%。在提示中嵌入误导性建议会使性能降低高达 34%,且没有模型被证明免疫。综合来看,报告的结果表明,尽管当前 LLM 在高级数学问题上取得了成功,但它们尚未成为真正的概率推理者。

英文摘要

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.