URL PDF HTML ☆

赞 0 踩 0

2605.28255 2026-05-28 cs.AI cs.CL cs.HC

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI，掌舵吧：是什么驱动人机协作问答中的委托与信任？

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

发表机构 * University of Maryland（马里兰大学）； University of California（加州大学）； MBZUAI

AI总结通过问答游戏实验，研究人类在何时以及为何选择委托AI或采纳其建议，发现人类存在对AI正确建议的低依赖（3.9%）和错误建议的过度依赖（1.7%），并受确认偏见影响，建议通过校准置信度、基于证据的解释和信任细化机制来改进人机协作。

Comments Findings of the Association for Computational Linguistics, 2026

详情

AI中文摘要

AI系统并非完美无缺，人类在决定是否信任AI而非自身判断时也可能犯错。因此，改善人机协作需要理解人类何时、为何以及如何决定依赖AI。我们研究了两种不同的依赖决策：委托选择——在不知道AI输出结果的情况下决定何时让AI自主行动，以及采纳选择——评估AI建议并决定如何使用它们。这两种解耦的依赖模式塑造了协作，但先前的工作很少在现实环境中对同一用户同时研究它们。我们通过研究在问答游戏中竞争的人机协作团队来填补这一空白，游戏中人类可以选择何时以及如何与AI代理合作以获胜。我们的24场比赛匹配了23位专家人类和16个AI代理，捕获了387次委托决策和1440次采纳决策。虽然人机协作表现优于单独的AI或人类，但人类做出了次优的协作决策，既对正确的AI建议低依赖（错失3.9%的机会），又在AI误导时过度依赖（1.7%）。双方都贡献了错误答案：当人类和AI意见不一致时，报告的模型置信度接近随机水平，而确认偏见导致当AI建议与人类初始错误答案一致时，低依赖率更高（64.5%）。为缩小这一差距，我们建议采用校准的置信度、基于证据的解释以及帮助用户细化信任的机制。

英文摘要

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

URL PDF HTML ☆

赞 0 踩 0

2605.28253 2026-05-28 cs.CL cs.DB cs.HC

Building Community-Centred NLP Resources for Puno Quechua

构建以社区为中心的普诺克丘亚语自然语言处理资源

Elwin Huaman, Adrian Gamarra Lafuente, Johanna Cordova, Anna Korhonen

发表机构 * University of Cambridge (UK)（剑桥大学（英国））； Stanford University (USA)（斯坦福大学（美国））； ERTIM - Inalco (France)（ERTIM - Inalco（法国））

AI总结通过参与式设计收集66小时语音数据，微调Whisper-base等模型，首次为普诺克丘亚语建立ASR基准并开源所有资源。

Comments Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026), co-located with ACL 2026

2605.28247 2026-05-28 cs.LG cs.AI

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS: 通过验证器耦合的稀疏自编码器覆盖实现可解释的RLVR数据选择

Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； The 63rd Research Institute, National University of Defense Technology, Nanjing（国防科技大学第六三研究所，南京）

AI总结提出IRDS方法，基于稀疏自编码器簇和验证器耦合的覆盖目标，选择模型失败但可学习的RLVR训练实例，提升数学推理准确率并降低计算成本。

Comments 24 pages,3 figures,18 tables

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强LLM推理能力的关键技术，但其数据效率低下仍是一个主要瓶颈。现有方法仅部分解决此问题，各自至少缺少子集级覆盖、验证器信号使用或可解释性中的一项。为弥补这一空白，我们提出了IRDS（可解释的RLVR数据选择），该方法在稀疏自编码器（SAE）簇的基础上选择RLVR训练实例，使得选择本身在可识别的问题模式上是可审计的。为了选择模型既失败又能从中学习的实例，我们在SAE基础上引入了一个验证器耦合的覆盖目标，并通过贪心对数行列式最大化来求解。在三个指令微调模型和六个数学推理基准上的实验表明，IRDS实现了最高的整体准确率，在Qwen两个模型上超过最强基线+3.9/+4.0个百分点，在Llama-3.1-8B上超过+0.5个百分点，同时运行成本比基于轨迹的基线低一个数量级。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.28241 2026-05-28 cs.CV

桥接无线电地图估计中的采样分布偏移：一种轨迹感知范式

Feng Qiu, Zheng Fang, Shuhang Zhang, Kangjun Liu, Longkun Zou, Jing Liu, Ke Chen

发表机构 * School of Artificial Intelligence, Xidian University（西安电子科技大学人工智能学院）； Pengcheng Laboratory（鹏城实验室）； Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系）； Department of Electronics, Peking University（北京大学电子系）； Guangzhou Institute of Technology, Xidian University（西安电子科技大学广州研究院）

AI总结针对无人机轨迹采样与随机采样分布不匹配导致的性能下降，提出基于随机触发轨迹采样的轨迹感知训练范式，有效降低估计误差。

详情

AI中文摘要

基于学习的无线电地图估计（RME）在无人机辅助无线感知中扮演关键角色，支持覆盖预测和网络优化等任务。当前大多数方法假设基于随机采样的独立同分布（i.i.d.）训练和测试设置。然而，实际无人机测量是沿着可行轨迹顺序收集的，导致高度结构化和空间相关的模式。这种不匹配引入了采样分布偏移，增加了空间场恢复的内在难度，并损害了在i.i.d.假设下训练的模型的泛化能力。为缓解这一问题，我们提出了一种基于随机触发轨迹采样（ST-TBS）的轨迹感知训练范式，该范式在保持轨迹连续性的同时引入采样变异性。此外，从统计角度来看，我们表明与随机采样相比，基于轨迹的采样降低了空间多样性并增加了信息冗余。在RadioMapSeer和SpectrumNet数据集上的大量实验表明，在基于轨迹的观测下，使用随机采样训练的模型性能显著下降，在SpectrumNet上RMSE从0.0391增加到0.2632。相反，我们提出的ST-TBS方法有效将RMSE降低到0.0571。这些结果强调了对齐训练和部署采样分布对于可靠RME的必要性。

英文摘要

Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

URL PDF HTML ☆

赞 0 踩 0

2605.28232 2026-05-28 cs.AI

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

PIRS：基于物理信息奖励塑形的SAC建筑能源管理

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino（托里尼理工大学）

AI总结针对深度强化学习中奖励函数设计缺乏物理基础的问题，提出PIRS方法，将ISO 7730 PMV公式嵌入SAC的多目标奖励中，提升可解释性和性能。

Comments N pages, 4 figures, 3 tables. Accepted at the 2nd Workshop on AI-Driven Energy Efficiency in Dynamic Systems (AI-DEEDS '26), co-located with ACM e-Energy / ACM Sustainability Week, Banff, AB, Canada, June 22-25, 2026

详情

AI中文摘要

居住者舒适度和电网感知的能效是相互竞争的目标，其联合优化关键取决于深度强化学习（DRL）控制器中奖励函数的指定方式。然而，奖励设计在很大程度上仍然是临时的：舒适度项要么是手动调整的启发式规则，要么是简单的温度偏差代理，缺乏热舒适物理的明确基础。我们提出PIRS（物理信息奖励塑形），它在用于Soft Actor-Critic（SAC）的加权多目标奖励中，用ISO 7730预测平均投票（PMV）公式替代这些临时的舒适度代理。通过将舒适度信号锚定在ISO 7730 PMV公式中，PIRS提高了奖励的可解释性，并在不改变学习流程任何其他组件的情况下，提供了一个基于标准的舒适度代理。我们在CityLearn v2.1.2（2022年挑战赛第一阶段）中评估PIRS，使用一个中央SAC智能体在五个随机种子上训练50k步，并与基于规则的控制器（RBC）、手动设计的奖励（E2）、仅能量奖励（E3）和朴素温度偏差舒适度奖励（E4）进行比较。区域级关键绩效指标（KPI）以与RBC的比率报告显示，PIRS在成本、碳和电力指标上与手动基线相当，同时显著优于非物理基础的设计——特别是在负载爬坡（1.78倍 vs. ~2.4倍RBC）和日峰值需求方面。所有DRL策略在此训练预算下仍高于RBC；我们诚实地解释这一差距，并将PIRS定位为可解释、符合标准的奖励设计基础，而非在有限计算下优于经典控制的声明。

英文摘要

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

URL PDF HTML ☆

赞 0 踩 0

2605.28231 2026-05-28 cs.RO cs.LG

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA：进度感知的机器人操作技能学习

Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders

发表机构 * NAVER LABS（NAVER实验室）； NAVER LABS Europe（NAVER实验室欧洲）

AI总结提出ProgVLA，一种紧凑的视觉-语言-动作模型，通过显式表示任务进度和两阶段Perceiver重采样机制，在有限计算和内存下实现长序列多模态处理，并在多任务操作基准上达到或超越大模型性能。

详情

AI中文摘要

我们提出了ProgVLA，一种紧凑的视觉-语言-动作（VLA）模型，专为在严格的计算和内存预算下进行可靠的机器人操作而设计。该模型特别关注通过维护任务进度的显式表示来高效处理长多模态序列。为此，ProgVLA集成了两个关键组件。首先，一个带有两阶段Perceiver重采样方案的多模态编码器将可变长度的视觉、语言和本体感受流压缩为一组固定的控制就绪上下文令牌，在保持跨模态基础的同时大幅减少序列长度。其次，一组辅助的进度头通过离线强化学习（RL）目标进行训练，以联合学习针对归一化剩余水平目标的批评者。这为策略提供了任务进度的内部估计，并实现了优势加权和成功加权的流匹配模仿学习。在两个成熟的多任务机器人操作基准上，一个0.1B参数的ProgVLA模型达到了与显著更大的预训练基线相当的成功率，并且在长时域和更困难的任务层级上超过了它们。消融实验表明，学习到的上下文重采样器和任务自适应视觉微调是最大的单一贡献者，而进度感知训练提供了集中在长时域和多对象任务上的一致额外增益。我们还在真实世界的玩具厨房环境中进一步验证了该方法。

英文摘要

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

URL PDF HTML ☆

赞 0 踩 0

2605.28230 2026-05-28 cs.CV

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

Proprio: 用于物理合理视频生成的潜在自评分与推理时精炼

Mariam Hassan, Kaouther Messaoud, Wuyang Li, Alexandre Alahi

发表机构 * École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Télécom Paris（巴黎电信学院）

AI总结提出Proprio，一种无需训练框架，通过分析模型在潜在扰动下的流残差作为自评分信号，结合最佳N搜索和梯度自精炼，提升冻结视频生成器输出的物理合理性。

详情

AI中文摘要

现代视频生成模型在视觉上效果显著，但经常违反基本物理原理。我们提出Proprio，一种无需训练的框架，使冻结的视频生成器能够评估和改进自身输出的物理合理性。受本体感觉（生物对自身运动的感知）启发，Proprio将模型在受控潜在扰动下的流残差视为自评分信号。能被生成器学习到的动力学更好解释的样本会产生更小且更稳定的残差。我们跨时间步和扰动聚合该信号，通过动态时空掩码聚焦于运动相关区域，并将其用于最佳N搜索、基于梯度的自精炼或两者结合。在文本到视频和图像到视频基准测试中，Proprio持续提升物理合理性，在多种设置下优于基于VLM的评分和外部世界模型基线。使用TurboWan2.2，Proprio将Physics-IQ从32.2提升至37.5（+16.5%），VideoPhy2-hard物理常识从45.6提升至55.0（+20.6%）。人类评估进一步显示，在大约三分之二的比较中，评估者更偏好Proprio选择或精炼的视频的物理合理性。这些结果表明，冻结的视频生成器包含可操作的内部信号，用于评估和改进自身输出的物理合理性。

英文摘要

Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

URL PDF HTML ☆

赞 0 踩 0

2605.28229 2026-05-28 cs.CV cs.AI

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism: 用于图像到视频迁移的异构混合专家模型

Rui Lin, Chuanming Wang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology（网络与交换技术国家重点实验室）

AI总结提出VidPrism，一种异构时间混合专家框架，通过功能专业化专家、内容感知多速率采样和动态双向融合机制，解决传统MoE中专家同质化问题，在视频识别基准上达到最先进性能。

Comments CVPR2026 camera ready

详情

AI中文摘要

随着预训练技术的快速发展，适应大规模视觉-语言模型（VLM）进行视频理解（即图像到视频迁移学习）已成为主导范式。为了获得卓越性能，近期进展中采用混合专家（MoE）来增强VLM的时间建模能力是一种有效策略。然而，传统的MoE设计存在专家同质化问题，即所有专家充当相同的通才，从无差异的视频流中低效地学习时空特征。为解决此问题，我们提出VidPrism，一种新颖的异构时间混合专家框架。VidPrism通过部署功能专业化的专家开创了分工机制，每个专家承担从空间理解到时间建模的不同角色。为了适当地为这些专家提供输入，我们引入了一个内容感知的多速率采样模块，动态生成从语义丰富到运动聚焦的表示流，为专家提供专业化输入。此外，一种动态双向融合机制实现了这些路径之间的协同信息交换，从而产生全面的视频表示。在各种视频识别基准上的大量实验表明，VidPrism达到了最先进的性能，并有效促进了专家专业化。我们的源代码可在https://github.com/Lrrrr549/VidPrism.git获取。

英文摘要

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.

URL PDF HTML ☆

赞 0 踩 0

2605.28228 2026-05-28 cs.CL

When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

当求助者难以帮助：评估情感支持对话系统在最坏情况交互中的表现

Jiajie Yang, Yangchun Li, Guanyi Chen, Rui Fan, Xin Bai, Tingting He

发表机构 * Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning（湖北人工智能与智能学习省级重点实验室）； National Language Resources Monitoring and Research Center for Network Media（网络媒体语言资源监测与研究中心）； School of Computer Science, Central China Normal University（华中师范大学计算机学院）； Faculty of Artificial Intelligence in Education, Central China Normal University（华中师范大学教育人工智能学院）； School of Chinese Language and Literature, Central China Normal University（华中师范大学中文语言文学学院）

AI总结本研究通过专家模拟和提出最坏情况评估框架，发现现有情感支持对话系统在面对低参与度、抗拒等困难求助者时性能显著下降，并验证了最坏情况模拟数据可提升模型鲁棒性。

详情

AI中文摘要

情感支持对话系统（ESDS）越来越多地使用大语言模型模拟的求助者进行评估和训练。然而，这类模拟求助者通常表现为合作、平均水平的用户，他们清晰披露、建设性回应并在几轮内接受支持。这可能导致过于乐观的评估，并掩盖ESDS是否能够处理困难的求助互动。在这项工作中，我们研究了在最坏情况交互下的ESDS评估，其中求助者由于低参与度、抗拒、有限的自我披露、情绪波动或僵化的负面解释而难以帮助。我们首先进行了一项专家模拟研究，邀请八位经验丰富的咨询专业人员模拟困难求助者，与现有的中文ESDS互动，提供量表评分，并参与半结构化访谈。基于这项研究，我们推导出最坏情况下的求助者行为，并识别出当前系统的关键局限性。然后，我们提出了一个最坏情况评估框架，包括一个基于LLM的最坏情况求助者模拟器和四个面向最坏情况的指标：深度情感理解、引导性探索、平衡的情感支持以及真实和接地气的支持。评估17个系统后，我们发现几乎所有模型在最坏情况交互下性能都大幅下降。大型通用LLM通常比专门的ESDS更稳健，但即使是最强的模型也难以维持参与度并改善求助者的情绪状态。最后，我们表明最坏情况模拟也可以生成有用的训练数据，提高较小模型的鲁棒性。

英文摘要

Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.

URL PDF HTML ☆

赞 0 踩 0

2605.28227 2026-05-28 cs.CL

Why We Need Speech to Evaluate Speech Translation

为什么我们需要语音来评估语音翻译

Maike Züfle, Danni Liu, Vilém Zouhar, Jan Niehues

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文通过元评估发现现有文本和语音质量估计指标在评估语音翻译中的语音特有信息（如性别一致性和韵律）时均存在不足，并提出SpeechCOMET模型，分析其失败原因，强调需要专用训练数据和真正基于语音的模型。

详情

AI中文摘要

语音翻译模型越来越能够保留语音特定信息（例如，说话者性别、韵律和强调），但评估指标仍然对这些现象视而不见。我们在两个针对性别一致性和韵律的对比数据集上对基于文本和基于语音的质量估计指标进行了元评估，发现两者均存在不足，即使直接访问语音信号也是如此。然后，我们训练了SpeechCOMET，一个带有语音编码器的质量估计模型家族，并评估了一个最先进的SpeechLLM作为评判者。两者在标准质量估计上匹配或超过基于文本的COMET，但都没有一致地评估语音特定现象。我们确定了三个原因：（1）当前编码器未能可靠地保留语音特定特征，（2）模型倾向于忽略语音源信号，以及（3）质量估计训练数据包含的相关示例太少。我们发布了所有模型和代码，并认为进展需要专用的语音特定训练数据和真正基于语音的模型。

英文摘要

Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

URL PDF HTML ☆

赞 0 踩 0

2605.28226 2026-05-28 cs.LG

当有用上下文泄露：领域自适应ASR中的隐私风险

Maike Züfle, Jan Niehues

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文识别并系统研究了领域自适应ASR中因上下文提示或微调导致模型泄露隐私的风险，通过构建控制数据集测量泄露率，并评估了提示级缓解策略及精度-泄露权衡。

详情

AI中文摘要

语音大语言模型越来越多地部署在专业环境中，领域定制是标准做法：用户在提示中提供包含敏感信息的上下文，在专有录音上进行微调，或两者兼有。我们识别并系统研究了这种定制的一个被忽视的隐私风险：适应于识别领域特定术语的模型可以被诱导转录其上下文或训练数据中一个语音相似的词，即使说的是不同的词，从而泄露私人信息。为了评估这一风险，我们构建了一个控制数据集，并测量了两种定制机制（提示和微调）下的泄露率。两种机制都会导致可测量的泄露，且组合时加剧。我们评估了一种提示级缓解策略，并分析了不同定制方法下的精度-泄露权衡，发现无上下文提示的微调提供了最佳平衡。我们公开了代码和数据集。

英文摘要

SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.28203 2026-05-28 cs.LG

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

通过解耦影响函数优化多维视频奖励模型

Muyao Wang, Zeke Xie, Hideki Nakayama

发表机构 * The University of Tokyo（东京大学）； HKUST (Guangzhou)（香港科技大学（广州））

AI总结针对文本到视频生成任务中训练样本在不同评估维度上可靠性不一致的问题，提出解耦影响框架以估计维度特定监督风险，并设计维度解耦剪枝与重加权策略，显著提升多维视频奖励模型与真实标注的对齐效果。

详情

AI中文摘要

随着文本到视频（T2V）生成模型的不断发展，视频评估的复杂性要求跨多个轴进行细粒度评估。为此，近期工作致力于开发多维视频奖励模型（MVRMs），将评估过程分解以更好地适应人类视觉感知的多面性。然而，训练有效的MVRMs从根本上受到视频数据复杂性的挑战。在本工作中，我们识别出一个关键现象，称为维度异质性：训练样本的可靠性在不同评估维度上可能显著不同，这意味着一个样本可能为一个目标提供可靠的监督，同时为另一个目标引入高监督风险。因此，基于全局标量指标进行过滤的流行数据驱动方法对于T2V任务是不适定的。为解决此问题，我们提出一个解耦影响框架，能够高效估计维度特定的监督风险。利用该框架，我们引入两种维度解耦优化策略：维度解耦剪枝（移除极端高风险样本）和维度解耦重加权（对高风险监督进行软降权）。大量实验表明，我们的解耦策略显著优于全局过滤基线，得到的奖励模型与真实标注的对齐效果更优。

英文摘要

As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

URL PDF HTML ☆

赞 0 踩 0

2605.28202 2026-05-28 cs.RO

Natural Functional Gradients for Smooth Trajectory Optimization

平滑轨迹优化的自然函数梯度

Kisang Park, Chanwoo Kim, Kyungjae Lee, Sungjoon Choi

发表机构 * Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea（韩国大学人工智能系，首尔，大韩民国）； Department of Statistics, Korea University, Seoul, Republic of Korea（韩国大学统计系，首尔，大韩民国）

AI总结提出一种基于自然函数梯度的轨迹优化框架，通过函数空间中的几何感知更新和蒙特卡洛估计，在无解析梯度时生成更平滑、更可行的运动轨迹。

详情

AI中文摘要

生成无碰撞且平滑的运动仍然是机器人操作中的一个核心挑战，尤其是在杂乱环境和狭窄通道中，可行区域高度受限且碎片化。我们提出了一种轨迹优化框架，该框架使用自然函数梯度直接在函数空间中进行几何感知更新。该方法优化了一个高斯平滑的替代目标，通过平滑轨迹扰动正则化优化景观，同时保留轨迹级结构。由于更新在函数空间内固有定义，轨迹规则性可以独立于特定时间离散化进行控制。我们推导了自然函数梯度的实用蒙特卡洛估计器，仅需黑盒轨迹评估，使得该方法在由于碰撞检测和接触丰富的仿真导致解析梯度不可用或不可靠时适用。在受限机器人操作任务上的实验表明，与代表性的规划和轨迹优化基线相比，所提出的方法在几何间隙狭窄的环境中提高了轨迹可行性并生成了更平滑的运动。更多结果、视频和实现细节可在项目页面获取：https://kisangpark.github.io/natural-functional-gradient/

英文摘要

Generating collision-free and smooth motions remains a central challenge in robotic manipulation, particularly in cluttered environments and narrow passages where feasible regions are highly constrained and fragmented. We propose a trajectory optimization framework that performs geometry-aware updates directly in function space using natural functional gradients. The method optimizes a Gaussian-smoothed surrogate objective that regularizes the optimization landscape through smooth trajectory perturbations while preserving trajectory-level structure. Because the updates are defined intrinsically in function space, trajectory regularity can be controlled independently of a particular time discretization. We derive a practical Monte-Carlo estimator of the natural functional gradient that requires only black-box trajectory evaluations, making the method applicable when analytic gradients are unavailable or unreliable due to collision checking and contact-rich simulation. Experiments on constrained robotic manipulation tasks demonstrate that the proposed method improves trajectory feasibility and produces smoother motions than representative planning and trajectory optimization baselines in environments with narrow geometric clearances. Additional results, videos, and implementation details are available at the project page: https://kisangpark.github.io/natural-functional-gradient/

URL PDF HTML ☆

赞 0 踩 0

2605.28201 2026-05-28 cs.AI

面向多跳音视频推理的主动全模态感知代理

Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对多跳音视频推理中证据稀疏且跨模态分布的问题，提出MOV-Bench基准和AOP-Agent代理框架，通过分层全模态记忆与观察-反思-重规划循环实现主动感知，显著提升开源全模态大模型在长视频和推理密集型问题上的性能。

详情

AI中文摘要

多跳音视频推理对全模态大语言模型（Omni-LLMs）仍然具有挑战性，因为相关证据通常稀疏、时间上分散，并且分布在音频和视频流中。现有基准对此设置的研究有限，通常仅涉及有限数量的模态、相关时间片段或推理步骤。在这项工作中，我们引入了MOV-Bench，一个包含519个精心设计问题的基准，这些问题需要对时间上分散的音视频证据进行多跳推理。在MOV-Bench上的评估表明，当前的全模态大语言模型在多跳跨模态推理方面仍然存在困难。为了解决这一挑战，我们进一步提出了AOP-Agent，一个基于开源全模态大语言模型的高效代理框架，用于主动全模态感知。通过将分层全模态记忆与协作的观察-反思-重规划循环相结合，AOP-Agent使开源全模态大语言模型能够进行主动感知，而无需额外训练或专有模型。在MOV-Bench和OmniVideoBench上的实验表明，AOP-Agent持续提升了推理性能，在长视频和推理密集型问题上尤其显著。

英文摘要

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Building Community-Centred NLP Resources for Puno Quechua

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware Paradigm

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

Why We Need Speech to Evaluate Speech Translation

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Supervised Semantic Differential for Cross-Cultural Concept Analysis: A Case Study of Human Affect

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism Biomarkers

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

Natural Functional Gradients for Smooth Trajectory Optimization

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Geometry-First Generative Spatial Single-Cell Reconstruction

Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning