arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助：面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文研究利用视觉-语言-动作（VLA）模型通过模仿学习实现人机协作，发现动作分块策略在隐式协作中存在演示动作泄漏问题，提出推理时引导方法缓解过早辅助行为，并通过用户研究验证其有效性。

详情

AI中文摘要

人机协作（HRC）结合了人类和机器人的互补优势，以提高任务效率。然而，许多现有的协作系统依赖于手工设计的流程，限制了其对新任务的可扩展性和灵活性。在这项工作中，我们展示了通过模仿学习进行端到端训练的模型，特别是视觉-语言-动作（VLA）模型，可以支持协作操作，并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型，并识别了隐式HRC中动作分块策略的一种失败模式，其中演示动作泄漏（即动作块跨越潜在任务转换）可能导致过早的辅助行为。我们发现，这个问题随着执行时域的增长而加剧，并在真实世界的协作VLA系统中出现，例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法，以减轻这些错误的辅助动作，同时保持策略性能。最后，通过一项16名参与者在长时域协作组装任务上的用户研究，我们表明引导能够实现更长的执行时域，同时减轻过早辅助，与短时域策略相比，实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

URL PDF HTML ☆

赞 0 踩 0

2606.12473 2026-06-12 cs.CV 新提交

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

基于人体姿态估计的立体视觉跌倒预测与检测在AMD Kria K26 SOM上的实现

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette, Christopher Paolini

发表机构 * San Diego State University（圣地亚哥州立大学）； PSG College of Technology（PSG理工学院）； Old Dominion University（欧道明大学）

AI总结提出一种基于AMD Kria K26 SOM的低功耗、便携式立体视觉跌倒预测与检测系统，通过量化YOLOX、A2J和CNN三级流水线实现实时、隐私保护的跌倒检测，多线程版本达到4.5 FPS。

Comments 19 pages; 31 figures

详情

AI中文摘要

背景与目标：老年人跌倒可能导致严重伤害并降低生活质量。及时的预测和检测对于预防伤害和支持健康至关重要。我们提出了一种便携式、低功耗、电池供电的基于视觉的跌倒预测与检测系统，在AMD Kria K26系统模块（SOM）上使用人体姿态估计（HPE）。目标是实现非侵入性、保护隐私的实时跌倒检测系统。方法：系统使用Intel RealSense D455距离感应摄像头，通过USB连接到K26 SOM。它捕获同步的RGB和深度帧，分辨率分别为640×480×3和640×480像素，帧率为60 FPS。SOM运行一个三级流水线，包括量化的YOLOX、Anchor-to-Joint（A2J）和跌倒检测模型。YOLOX从RGB帧中识别人体边界框，然后丢弃RGB帧以保护隐私。A2J使用深度帧估计每个人的15个关节点。CNN使用选定的关节坐标（x, y, z）对跌倒活动进行分类。YOLOX在CrowdHuman上训练；A2J在ITOP、MP-3DHP、UR Fall Detection和自定义的SDSU PSG数据集上训练；CNN在UR Fall Detection和SDSU PSG上训练。设计使用了单核DPU的串行流水线和双核DPU运行YOLOX和A2J的多线程版本。结果：量化精度通过YOLOX的IoU≥50%、A2J的10厘米规则mAP以及CNN的分类准确率（TP+TN)/(TP+TN+FP+FN)进行评估。准确率分别为74%、84.13%和75.85%。吞吐量从单线程流水线的2.5 FPS提高到多线程版本的4.5 FPS。结论：结果证明了在AMD Kria K26边缘设备上实现隐私保护跌倒检测的可行性。设备上的HPE和跌倒分类无需依赖云端，支持老年人监测和辅助医疗。未来工作将提高模型精度和速度。

英文摘要

Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

URL PDF HTML ☆

赞 0 踩 0

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 新提交

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense: 审计LLM中参数化工具知识的诊断框架

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

发表机构 * SAP Labs（SAP实验室）

AI总结提出ToolSense诊断框架，自动生成三类基准测试，揭示参数化工具检索中知识-检索分离现象，发现模型在模糊查询下性能显著下降。

详情

AI中文摘要

作为大型工具目录上的代理部署的大型语言模型面临关键的工具检索瓶颈。由于基于嵌入的检索方法依赖于可能无法充分捕获专用工具语义的紧凑编码器，参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟令牌来解决这一问题，经过两个阶段（记忆然后检索SFT）的微调，将LLM用作检索器，在标准ToolBench检索基准上取得了强劲性能。然而，这些基准使用冗长、完全指定的查询，并且其评估应用了将输出限制为有效令牌路径的约束解码，这并不能揭示模型是否真正理解其工具。我们引入了\textbf{ToolSense}，一个开源LLM驱动的诊断框架，它将任何工具目录作为输入，并自动生成三个基准：具有三个模糊级别查询的现实检索基准（RRB）、MCQ探测基准和QA探测基准。将ToolSense应用于ToolBench（约47k个工具）并评估五个参数化模型训练配置，揭示了知识-检索分离：在RRB查询上，与完全指定的ToolBench基准相比，几个配置下降了约50-64个百分点，低于嵌入模型基线。此外，尽管检索性能强劲，一些模型在事实探测上得分接近随机，表明存在知识-检索分离。我们在https://this URL上开源了ToolSense框架和ToolBench诊断基准。

英文摘要

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

URL PDF HTML ☆

赞 0 踩 0

2606.13287 2026-06-12 cs.LG cs.DC math.OC 新提交

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

裁剪使分布式和联邦异步SGD对掉队者具有鲁棒性

Samuel Erickson, Mikael Johansson

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）

AI总结本文理论证明梯度裁剪能消除异步SGD中最大延迟对复杂度的影响，基于次Weibull梯度噪声模型，首次实现异步优化的高概率收敛。

详情

AI中文摘要

在现代机器学习中，训练的并行化是扩大规模的重要策略。异步随机梯度下降（ASGD）通过避免等待慢速工作节点来最大化可用硬件的利用率。然而，在恒定步长下，由于更新中的大延迟，慢速工作节点仍然会对ASGD的收敛产生负面影响。同时，在深度学习模型的异步训练中，经验观察到梯度裁剪能“稳定”训练。在这项工作中，我们为这一行为提供了理论依据，证明裁剪消除了最大延迟对预言复杂度的依赖。我们采用次Weibull梯度噪声模型，该模型将次高斯和次指数分布推广到更重尾的分布，受深度学习中的经验观察启发。我们证明了期望收敛，并且首次在异步优化中证明了高概率收敛。

英文摘要

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

URL PDF HTML ☆

赞 0 踩 0

2606.12876 2026-06-12 cs.LG cs.CL cs.IT math.IT 新提交

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

发表机构 * University of Toronto（多伦多大学）

AI总结提出Drop-by-Drop框架，基于信息论和逐次细化理论，利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制，降低存储开销并保持性能。

Comments 37 pages, 12 figures

详情

AI中文摘要

随着大语言模型（LLM）在具有不同资源约束的异构硬件上部署越来越广泛，无需重新训练即可自适应管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop，一种新颖的多比特宽度训练后量化框架，能够从单个训练模型实现对LLM权重的推理时精度控制。我们的方法在理论上基于信息论和逐次细化。我们证明，通常服从高斯分布的LLM权重，在由LLM损失函数驱动的加权均方误差失真下，随着额外比特的加入可以以递增的保真度最优重建。为了在实践中实现这一点，Drop-by-Drop将Matryoshka风格的监督纳入损失函数，利用了加性码本的结构。Drop-by-Drop生成单个模型，其中有序的码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度，显著减少了存储和内存开销，同时在主要架构（如Qwen、LLaMA、Gemma和Mistral）上保持了有竞争力的困惑度和准确度。

英文摘要

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

URL PDF HTML ☆

赞 0 踩 0

2606.12710 2026-06-12 cs.LG math.OC 新提交

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

一种稳定的路径空间方法用于基于扩散的后验采样

Evan Scope Crafts, Umberto Villa, Saviz Mowlavi, Yanting Ma, Hassan Mansour, Wael H. Ali

发表机构 * Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin（德克萨斯大学奥斯汀分校奥登计算工程与科学研究所）； Mitsubishi Electric Research Laboratories (MERL)（三菱电机研究实验室）； Department of Biomedical Engineering, The University of Texas at Austin（德克萨斯大学奥斯汀分校生物医学工程系）； Mitsubishi Electric Research Laboratories（三菱电机研究实验室）

AI总结提出一种稳定的路径空间框架，通过随机最优控制与信任域优化，实现非线性逆问题中准确且鲁棒的后验采样。

详情

AI中文摘要

扩散模型为贝叶斯逆问题提供了表达性数据驱动先验，但许多扩散后验采样器依赖启发式引导近似，可能对非线性算子和多模态后验失效。本文开发了一种稳定的路径空间框架用于基于扩散的后验采样。从终端边际代表先验的基础扩散过程出发，我们定义了轨迹上的似然加权目标测度，并将后验采样转化为学习一个路径测度匹配该目标的受控随机过程。该公式将扩散后验采样与随机最优控制联系起来，同时保留了不确定性量化所需的贝叶斯结构。我们引入了一种时间重参数化，通过消除未知初始值函数引起的偏差，使路径空间控制问题适定，无需辅助训练。然后通过具有对数方差目标的信任域路径空间优化方法学习控制。路径空间视角还统一了我们的学习控制方法与现有的基于引导的采样器，量化了近似控制引起的采样误差，并产生了用于渐近精确后验期望的重要性采样校正。我们在具有解析表征或高质量参考后验的基准逆问题套件上评估了所提出的框架，从而实现了对采样精度和不确定性量化的原则性评估。这些实验深入揭示了基于扩散的后验采样器的行为，并证明了相比领先方法更高的准确性和鲁棒性。

英文摘要

Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.12611 2026-06-12 cs.LG cs.IT math.IT 新提交

Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

NSL-KDD数据集不平衡数据条件下IDS的AutoML框架评估

Wiliane Carolina Silva, Evandro César Vilas Boas, Felipe A. P. de Figueiredo

发表机构 * Cybersecurity and Artificial Intelligence Laboratory (CS&I Lab), National Institute of Telecommunications (Inatel)（网络安全与人工智能实验室（CS&I Lab），国家电信研究所（Inatel））； Wireless and Artificial Intelligence Laboratory (WAI Lab), National Institute of Telecommunications (Inatel)（无线与人工智能实验室（WAI Lab），国家电信研究所（Inatel））

AI总结研究NSL-KDD数据集上严重类别不平衡对多分类入侵检测中AutoML框架性能的影响，发现集成学习和不平衡感知优化可提升少数类检测能力，PyCaret表现最佳（macro-F1 66%）。

详情

AI中文摘要

本研究探讨了严重类别不平衡对使用NSL-KDD数据集进行多分类网络入侵检测的自动化机器学习（AutoML）框架性能的影响。与以往通过二分类或移除少数类来简化问题的研究不同，我们保留了原始的五类分布，包括高度欠表示的R2L和U2R攻击，从而能够对不平衡敏感的学习行为进行现实评估。在统一且可重复的实验协议下，分析了九个开源AutoML框架，考虑了架构设计、集成策略、验证程序、超参数优化和不平衡处理机制的差异。结果表明，采用集成学习和不平衡感知优化的框架在少数类判别上表现更好。PyCaret获得了最佳整体性能，macro-F1达到66%，其次是AutoGluon（55%），而缺乏原生平衡支持的框架在少数类检测能力上显著下降。进一步分析表明，仅以准确率为导向的优化不足以应对高度不平衡的入侵检测场景，因为高加权指标可能与对罕见攻击类别的泛化能力差共存。作为贡献，本研究为严重多类不平衡下的AutoML入侵检测建立了标准化基准，指出了当前架构的局限性，以及将不平衡感知优化、重采样和分层评估策略原生集成到自动化学习流水线中的必要性。源代码已公开。

英文摘要

This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66\% macro-F1, followed by AutoGluon with 55\%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.13341 2026-06-12 cs.CV cs.AI physics.med-ph 新提交

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

双域等变生成对抗网络用于多模态CT-PET合成

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

发表机构 * arXiv

AI总结提出双域等变生成对抗网络（DDE-GAN），联合空间与频域学习并融入旋转等变性，实现高保真多模态CT-PET图像合成。

Comments 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情

DOI: 10.1109/ISBI61048.2026.11515956

AI中文摘要

我们提出了一种用于多模态CT-PET图像合成的双域等变生成对抗网络（DDE-GAN）。传统的基于GAN的方法通常仅在空间域中操作，忽略了几何一致性，导致结构保真度有限。DDE-GAN通过联合学习空间域和频率（傅里叶）域，捕捉互补的解剖和频谱信息，解决了这些挑战。此外，嵌入在CT和PET测量物理中的旋转等变性被整合到生成器和判别器的损失中，以确保在旋转下的一致响应，从而提高解剖准确性。一种分层双域训练策略通过多阶段损失函数强制实现域内和域间一致性。在HECKTOR 2022 CT-PET数据集上的评估表明，DDE-GAN在CT-PET图像合成中取得了优于基线模型的合成质量。结果表明，将双域学习与几何等变性相结合，显著增强了多模态图像合成的准确性和鲁棒性，为PET补全和数据增强等实际应用提供了可能。

英文摘要

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.12478 2026-06-12 cs.LG cond-mat.stat-mech quant-ph 新提交

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

玻尔兹曼注意力：用于协同注意力的可学习伊辛耦合

Gilhan Kim, Daniel K. Park

发表机构 * Yonsei University（延世大学）

AI总结提出玻尔兹曼注意力，通过可学习的伊辛耦合增强注意力机制中的位置间交互，在字符级语言建模和括号匹配任务中优于标准softmax注意力，并展示了量子退火训练的有效性。

Comments 19 pages, 5 figures

详情

AI中文摘要

注意力机制是现代序列模型的核心，但标准注意力主要通过单个查询-键相似度计算相关性。尽管softmax归一化引入了位置间的竞争，但标准注意力层并未显式参数化注意力决策之间的可学习交互。这限制了其直接在注意力机制内建模协同或对抗性共注意力结构的能力。我们提出玻尔兹曼注意力，一种基于能量的泛化，其中注意力模式由相互作用的伊辛模型控制。该方法用可学习的成对耦合增强通常的数据依赖局部场，使模型能够表示超出softmax或sigmoid注意力所捕获的位置间相关性。在字符级语言建模和合成括号匹配实验上，玻尔兹曼注意力在标准Transformer架构中持续优于标准softmax注意力，且优势随序列长度增加而更加明显。四路消融实验证实改进来自可学习的成对耦合。这些结果表明，显式位置间交互为基于注意力的序列建模提供了原则性增强。此外，伊辛公式为基于量子计算的采样策略开辟了自然路径：我们证明非绝热量子退火提供了实用的训练方法，同时保持了与精确玻尔兹曼计算相当的性能。

英文摘要

Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

URL PDF HTML ☆

赞 0 踩 0

2606.12263 2026-06-12 cs.CV 新提交

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

VOID: 击败潜在扩散模型中的未授权模仿

Chunlin Qiu, Ang Li, Tianxiao Huang, Ruilin Gan, Yunjie Ge, Shenyi Zhang, Huayi Duan, Lingchen Zhao, Chao Shen, Qian Wang

发表机构 * School of Cyber Science and Engineering, Wuhan University（武汉大学网络空间安全学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； Institute for Math&AI, Wuhan University（武汉大学数学与人工智能研究所）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； School of Cyber Science and Engineering, Xi’an Jiaotong University（西安交通大学网络空间安全学院）

AI总结针对潜在扩散模型被用于未授权模仿的问题，提出VOID防御框架，通过操纵模型内在随机性，放大潜在编码误差并抵消目标引导信号，实现语义破坏，阻止未授权模仿，同时将扰动限制在人眼不可感知区域。

Comments Extended full version with more comprehensive experimental results. To appear in the 35th USENIX Security Symposium (USENIX Security 2026)

详情

AI中文摘要

虽然潜在扩散模型（LDM）彻底改变了视觉合成，但它们越来越多地被用于对个人的未授权模仿。现有防御通过注入欺骗性扰动，将生成图像引导至无关目标。然而，这种方法基于一个无根据的假设：微小的扰动能在LDM的整个生成过程中保持其欺骗效果。实际上，模型固有的恢复机制会移除这些扰动，导致个体身份在生成的图像中重新出现。我们提出VOID，一种通过操纵LDM内在随机性克服这一难题的防御框架。VOID以两种新颖方式扰动扩散管道：1）放大潜在编码误差以破坏图像的语义结构，以及2）抵消目标引导信号以抑制模型的恢复能力。这导致语义破坏，阻止任何未授权模仿。值得注意的是，安全增益不以视觉效用为代价，因为VOID同时设法将扰动限制在受保护图像的人眼不可感知区域。我们在5个数据集上对10种模仿攻击的24种最先进防御进行了全面评估，证明了VOID前所未有的保护能力：它将平均Frechet Inception Distance（FID）从113提高到365，比迄今为止最强的防御提升了223%。

英文摘要

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

URL PDF HTML ☆

赞 0 踩 0

2606.12236 2026-06-12 cs.RO cs.CV 新提交

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结提出DrivingAgent框架，通过自动化模块开发（设计阶段）和强化学习训练的轻量级LLM实时调度（调度阶段），解决自动驾驶系统集成新模型和满足实时约束的挑战，在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情

AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而，这一趋势带来了两个关键挑战：（i）设计和集成新模型的手动且劳动密集型过程，以及（ii）缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型（LLM）的智能体为自动化提供了有前景的途径，但现有框架并不适合自动驾驶。具体来说，它们未能区分系统设计和实时调度的根本不同需求，将模块视为不透明的黑盒，并且并非为持续运行而设计。为了解决这些局限性，我们提出了DrivingAgent，这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段，DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段，它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块，并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明，DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.12160 2026-06-12 cs.CL 新提交

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher（独立研究员）

AI总结本研究通过分析每层令牌logits特征，提出CHAIR框架检测幻觉，在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情

AI中文摘要

在这项工作中，我们引入了CHAIR（Classifier of Hallucination As ImproveR），一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征，如最大值、最小值、均值、标准差和斜率，从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明，CHAIR显著提高了检测准确性，特别是在零样本场景下，展示了其鲁棒性和泛化能力。除了幻觉检测，CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式，我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案，还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

Decoding-time truthfulness methods -- layer-contrast decoding, inference-time intervention, and learned logit adapters -- have demonstrated 10-30 point gains on TruthfulQA when applied to base language models. However, modern instruction-tuned LLMs already achieve substantially higher baselines (61-76%), raising the question of whether these methods remain effective in practice. We design a six-control evaluation framework -- out-of-distribution training, multi-judge validation, simple decoding baselines, confound controls, bootstrap confidence intervals, and seed variance -- and apply it across 5 models (1B-70B), 3 benchmarks, and 15 methods. We find that previously reported gains shrink substantially under strict controls: on the full TruthfulQA benchmark (N=817), no token-level method achieves statistically significant improvement, and the best learned adapter scores -2.0 points below greedy (p=.23). We identify five evaluation sensitivities -- contamination, judge choice, missing baselines, confounds, and statistical noise -- that individually or jointly account for these discrepancies. Cross-benchmark validation on HaluEval QA and TriviaQA confirms that these patterns extend beyond TruthfulQA. Deliberative prompting methods (chain-of-thought, self-critique) appear more robust in the evaluated regime, with CoT achieving +5.6-19pp across benchmarks as a training-free, single-pass method. We release a seven-point evaluation checklist and discuss implications for future truthfulness research.

URL PDF HTML ☆

赞 0 踩 0

2606.11894 2026-06-12 cs.CV 新提交

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo（东京大学）

AI总结提出Wild3R，一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法，通过引入包含多样光照和瞬态物体的WildCity数据集，学习跨视角外观一致性并移除瞬态内容，性能优于现有前馈方法，与基于逐场景优化的方法相当。

Comments Project page: https://furuschool.github.io/wild3r-page/

详情

AI中文摘要

前馈式3D高斯泼溅（3DGS）消除了传统3DGS所需的耗时逐场景优化。然而，现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中，我们提出了Wild3R，一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据，而这些是学习鲁棒场景表示所必需的。为解决这一问题，我们引入了WildCity数据集，该数据集包含200个场景、170种光照条件和瞬态物体，总计337,500张图像。通过利用该数据集，我们的模型在参考视图条件下学习跨视角的外观一致性，同时移除瞬态内容。大量实验表明，我们的方法优于现有的前馈方法，并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.11836 2026-06-12 cs.SD cs.AI eess.AS 新提交

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩：基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； National Research Council Canada（加拿大国家研究委员会）

AI总结提出一种基于k-means通道聚类的无数据无训练压缩方法，通过层间不同参数簇数实现细粒度混合稀疏剪枝，在HuBERT-large和Whisper-large-v3上显著降低WER。

Comments Accepted by Interspeech 2026

2606.11793 2026-06-12 cs.LG cs.AI physics.ao-ph 新提交

更接近真实：一种多尺度残差感知表示学习管道用于时间序列预测

Amrijit Biswas, Mustafa Kamal, Robin Krambroeckers, M. M. Lutfe Elahi, Sifat Momen, Nabeel Mohammed, Shafin Rahman

发表机构 * RobotBulls Labs（RobotBulls实验室）； North South University（南北大学）

AI总结提出两阶段模型无关框架，通过显式解耦预测与残差学习，使用元校正器动态建模结构误差模式，提升Transformer预测精度。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情

AI中文摘要

近年来，基于Transformer的模型已成为时间序列预测的主要范式，利用自注意力机制捕获长程依赖关系。尽管取得了成功，但这些单阶段预测架构由于结构差异、未建模的随机成分或多尺度时间表示不足，表现出持续的系统性残差偏差。当残差被视为不可约噪声时，这一局限性依然存在，阻碍了对结构化误差模式的自适应校正。为解决这一问题，我们引入了一个两阶段、模型无关的框架，将预测和残差学习显式解耦为不同的表示学习阶段。基础Transformer首先生成初始预测。随后，专用的元校正器动态建模跨多元通道的结构化误差模式，保留跨变量依赖关系，并迭代修正基础Transformer的残差偏差。通过将该管道形式化为假设空间扩展，我们的框架解决了单阶段架构固有的近似局限性，消除了对限制性假设的依赖，并实现了复杂误差动态的端到端学习。在八个流行的基准数据集上使用既定协议进行评估，我们的方法达到了最先进的性能，在标准指标（MSE、MAE）上有显著改进。结果表明，该框架能够减轻系统性偏差，增强对复杂时间动态的鲁棒性，推进了基于Transformer的预测模型的实际应用。

英文摘要

Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

URL PDF HTML ☆

赞 0 踩 0

2606.10403 2026-06-12 cs.CL 新提交

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI（NAVER云AI）； KAIST AI（韩国科学技术院人工智能系）

AI总结提出KCSAT-ML基准（含664道韩国高考数学题及339道带官方错误率的核心题）和难度对齐推理增益（DRG）指标，揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

Comments 18 pages, 14 figures, 8 tables

详情

AI中文摘要

数学推理基准已大量涌现，但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML，包含十年（2014-2025）韩国大学修学能力考试（KCSAT；修能）数学：664道题，其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益（DRG）配对：一种分数正交的度量，询问模型的错误是集中在人类认为难的题目上，还是人类认为容易的题目上。两者共同揭示，在广泛的视觉语言模型（以及通过OCR的LLM）中，存在三种模式：（i）低预算准确率在人类高错误率尾部崩溃，无论模型大小；（ii）测试时缩放（TTS）使token使用量大致随队列错误率线性增加，而准确率增益遵循非单调曲线；（iii）在同一模型族内，TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上，准确率几乎相同的模型可以处于几乎相反的值：一个模型做错了人类也觉得难的题目，而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

URL PDF HTML ☆

赞 0 踩 0

2606.09639 2026-06-12 cs.CV 新提交

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Jason Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Electronic Science and Technology of China（电子科技大学）； Zhejiang University（浙江大学）； The University of Tokyo（东京大学）； Nanyang Technological University（南洋理工大学）

AI总结提出CineDance-1M大规模多镜头长片音视频数据集，通过三阶段筛选流程和CineBench评估体系，实现高质量联合生成。

详情

AI中文摘要

训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面表现出色，但开源模型的进展仍受限于高质量训练数据的稀缺性。为弥合这一差距，我们引入了CineDance-1M，一个大规模、开放研究文本到音视频（T2AV）数据集，专门用于多镜头、长片联合音视频生成。每个视频平均时长92.8秒，包含24.2个连续镜头，并提供音频和视频模态的可配置、结构化标注。这一卓越质量通过严格的三个阶段筛选流程实现：i) 多样化来源和全面清洗，ii) 基于电影理论的叙事解析，以及iii) 层次化双模态字幕生成。为进行全面评估，我们提出了CineBench，包含多样化的提示套件和六维、与人类对齐的度量系统，专为复杂叙事音视频评估而设计。此外，我们将LTX-2.3适配为CineDance，展示了卓越的单模态质量以及精确的音视频对齐和稳健的主体与环境一致性，有效验证了我们的筛选策略和CineDance-1M的高质量。我们预期这项工作将为加速未来多镜头、长片联合音视频生成研究奠定坚实基础。我们的项目页面可在https://aliothchen.github.io/projects/CineDance/获取。

英文摘要

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

URL PDF HTML ☆

赞 0 踩 0

2606.09073 2026-06-12 cs.LG cs.AI cs.CL 新提交

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley（加州大学伯克利分校）； DeepMind（深度Mind）

AI总结本文提出使用分布奖励模型统一RLHF中的悲观主义方法，通过闭式有效奖励公式连接现有启发式方法，并揭示其隐含假设。

详情

AI中文摘要

基于人类反馈的强化学习（RLHF）受限于\textit{奖励破解}，即策略利用代理奖励模型（RM）中的错误，产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}：在RM不确定的区域惩罚奖励。然而，标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化（KL-DRO）视角下，KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法：均值聚合、最坏情况优化（WCO）和不确定性加权优化（UWO）都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

URL PDF HTML ☆

赞 0 踩 0

2606.08765 2026-06-12 cs.RO cs.CV 新提交

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

RGB-S: 用于鲁棒灵巧操作的图像对齐触觉显著性

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University（上海科技大学）； Beijing Institute for General Artificial Intelligence（北京通用人工智能研究院）

AI总结提出RGB-S框架，通过正向运动学和相机标定将触觉传感器位置投影到RGB图像平面，生成力调制高斯显著性图，显式对齐触觉与视觉，在严重遮挡下灵巧操作成功率提升26.7个百分点。

Comments 20 pages, 7 figures

详情

AI中文摘要

有效的视觉-触觉整合对于机器人灵巧操作至关重要，尤其是在视觉观测不可靠或被遮挡时。然而，将稀疏、异构的触觉测量与密集的视觉表示鲁棒地对齐仍然是一个基本挑战。大多数现有方法需要策略从有限的演示中隐式学习跨模态对应关系，而不利用几何先验。因此，它们在视觉观测退化时往往数据效率低且泛化能力差。为解决这一限制，我们提出一个框架，显式地将物理接触锚定在图像域中。利用机器人正向运动学和相机标定，我们将触觉传感器位置直接投影到RGB图像平面上。然后，我们渲染力调制的高斯显著性图，以模拟由运动学和标定误差引起的空间不确定性。通过零初始化的条件架构整合这些2D空间锚点，我们的方法将物理接触先验注入标准视觉骨干网络，同时保留预训练的视觉表示。我们在模拟和现实世界的六项灵巧操作任务中评估了我们的方法，在严重视觉遮挡下。现实世界实验表明，在图像域中显式的RGB-S锚定将现实世界遮挡操作成功率比最强的隐式视觉-触觉基线提高了26.7个百分点，表明其空间推理能力和对遮挡的鲁棒性得到了改善。项目页面：touch-as-saliency.github.io

英文摘要

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.08436 2026-06-12 cs.CV 新提交

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

通过候选感知因果推理增强教学视频中的时间答案定位

Muge Qi, Rong Fu, Pengbin Feng, Xianda Li, Yu Cai, Yifu Guo, Shizhe Zhang, Simon James Fong, Lei Ma, Bin Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出候选感知因果推理框架，通过视觉-语言预训练候选选择和基于GRPO的时序逻辑推理，解决教学视频中复杂问题理解和长视频片段定位挑战，在六个基准上取得最优mIoU。

详情

AI中文摘要

教学视频中的时间答案定位任务旨在定位响应自然语言查询的精确视频片段，对于直接视频答案检索日益重要。由于需要理解语义复杂的问题并解决未修剪视频与短目标时刻之间的显著长度不匹配，该任务仍然具有挑战性。现有方法通常对无关内容敏感或视觉推理能力不足。为了解决这些局限性，我们提出了候选感知因果推理框架。我们的方法首先采用基于视觉-语言预训练的候选选择算法高效生成K个候选片段，然后应用由拒绝奖励机制增强并通过组相对策略优化优化的时序逻辑推理模块进行稳健推理。在六个基准上的大量实验表明，我们的方法在平均交并比方面达到了最先进的性能，为长视频中基于推理的检索提供了新视角。

英文摘要

The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

URL PDF HTML ☆

赞 0 踩 0

2606.08098 2026-06-12 cs.AI cs.LG 新提交

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

何时委托优于多数？一种基于委托的多样本LLM推理聚合器

Yasushi Sakai, Allen Song, Kent Larson

发表机构 * MIT Media Lab（麻省理工学院媒体实验室）

AI总结提出基于委托的聚合器PPV，利用样本的字母熵和推理几何信号，在MMLU-Pro上比多数投票高1.5个百分点，无需标签或训练。

Comments Preprint. 16 pages, 5 figures, 4 tables

详情

AI中文摘要

多数投票是对多样本LLM推理进行无监督聚合的主流方法。我们证明，将每个样本携带的信号输入基于委托的聚合器（传播代理投票，PPV）可产生一种无监督共识规则，在MMLU-Pro上整体比多数投票高1.5个百分点，在非平凡子集上高2.24个百分点（配对McNemar p ~ 1.0e-14，n = 8,099）。多数投票丢弃了每个样本携带的两个自由信号：组内字母熵和组间推理几何。PPV暴露了两个每个投票者使用的杠杆，它们恰好消耗这些信号：WHEN（投票者保留自己选择的权重）和WHOM（如何将剩余权重分配给同行）。我们使用字母熵驱动WHEN，使用以问题为中心的嵌入余弦驱动WHOM。该方法不需要真实标签和辅助训练：对于每个问题，我们将128个采样生成划分为16组，计算每组的字母级语义熵和推理嵌入质心，并将两者输入随机委托矩阵，其平稳分布选择共识答案。我们通过一个例子说明PPV如何推翻一个明显的10-6多数（错误答案）：10票的多数簇几何上不连贯（平均簇内余弦-0.02），而6票的少数簇紧凑（+0.26），因此传播的委托质量集中在少数派的答案上，尽管仅凭熵会使多数保持领先。我们还报告了具有负面结果的委托策略，这些策略限制了无监督LLM聚合的设计空间：没有问题内的置信度模式集成能够缩小与oracle的差距。

英文摘要

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

URL PDF HTML ☆

赞 0 踩 0

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 新提交

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠？

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze（佛罗伦萨大学）

AI总结通过离散概率问题基准测试，发现 LLM 在标准问题上准确率 0.96，但在反直觉问题上仅 0.59，且存在 token 偏差和误导提示的脆弱性。