arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.13316 2026-06-12 cs.AI 新提交

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

ReSum: 通过强化学习协同LLM推理与摘要生成

Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang, Hailang Huang, Renda Li, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图)

AI总结 提出ReSum框架,利用自摘要机制让LLM压缩和组织推理轨迹,通过对比评估自适应触发摘要,在提升性能4%的同时减少18.6%的推理长度。

Comments 24 pages, including 13 pages of main text and 11 pages of appendix

详情
AI中文摘要

可验证奖励强化学习(RLVR)是提升大语言模型(LLM)长程推理的核心技术。然而,现有RLVR方法常鼓励不必要的长推理轨迹,这会降低推理连贯性并耗尽可用上下文预算。现有的长上下文组织方法通常依赖外部机制来组织轨迹,而非让模型自主管理推理过程。为解决此局限,我们提出ReSum,一种新颖的RLVR框架,使LLM能够通过自摘要压缩和组织其推理轨迹。我们的初步研究表明,自摘要通过降低token级熵来稳定生成,并且引入“摘要”短语可显著减少从错误轨迹前缀传播的误差。受此启发,ReSum采用一种摘要感知的自适应轨迹机制,通过对比评估自摘要是否有利于当前推理过程。具体而言,当模型自发触发自摘要时,ReSum屏蔽摘要短语以创建对比分支;对于非摘要位置,则随机注入该短语以创建匹配分支。我们进一步设计了摘要感知优势函数,以实现对比轨迹之间更细粒度的比较。大量实验表明,ReSum在平均提升4%性能的同时,将推理长度减少18.6%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

2606.13315 2026-06-12 cs.CV eess.IV 新提交

Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

用于3D脑部MRI的掩码和预测自监督基础模型

Esra Ergün, Hersh Chandarana, Dan Sodickson, Gözde Ünal

发表机构 * Istanbul Technical University(伊斯坦布尔理工大学) NYU Langone Health(纽约大学朗格尼医学中心)

AI总结 研究自监督基础模型在MRI疾病检测中的应用,提出频谱域重建损失(MAE)和方差-协方差正则化(JEPA)两种方法,在五个下游任务中验证了目标设计对任务结构匹配的重要性。

详情
AI中文摘要

自监督基础模型在医学影像中展现出巨大潜力。然而,现有的MRI基础模型研究主要强调分割和密集预测任务,而针对基于MRI的疾病检测的自监督基础模型的系统研究仍然有限。在这项工作中,我们研究了两种主要的自监督预训练范式用于基于MRI的疾病检测:通过掩码自编码器(MAE)的基于重建的学习和通过联合嵌入预测架构(JEPA)的预测表示学习。我们通过引入一种新颖的MAE频谱域重建损失来增强对细粒度解剖结构的敏感性,并通过在我们的JEPA框架中集成方差-协方差正则化(VCR)来鼓励去相关的潜在表示,从而研究辅助目标的作用。我们的模型在对比度无关的设置下,在异质单对比度MRI体积上进行预训练,无需模态拼接。在五个下游疾病检测任务中,我们的结果突出了自监督目标设计对医学基础模型预训练的重要性,表明每个目标的下游收益由其与任务结构的相关性决定。具体来说,当下游判别信号以强高频解剖结构为特征时,频谱正则化带来最大的改进;而当判别信息跨越多个去相关的特征维度时,协方差正则化最为有益。具有频谱域监督的MAE在基于MRI的疾病检测中始终实现优越的下游性能。这些发现表明,医学影像中的自监督目标编码了特定的偏差,其下游收益根本上取决于任务的结构。

英文摘要

Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance--covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

2606.13312 2026-06-12 cs.CV cs.GR 新提交

MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

MagPlus: 通过可学习放大桥接微表情到常规表情

Sliman Jammal, Andrei Sharf

发表机构 * Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出MagPlus管道,通过可学习放大将微表情运动映射到常规表情范围,再利用标准表情模型处理,最后用DeMagPlus恢复强度,无需重新训练即可生成逼真微表情。

详情
AI中文摘要

面部微表情是短暂而细微的面部运动,为真实人类情感提供重要线索。然而,由于标注的微表情数据有限且底层面部运动极其微弱,建模和生成微表情仍然困难。现有的微表情生成方法因此常面临质量有限、鲁棒性弱和泛化能力差的问题。我们提出MagPlus,一个可迁移的微表情处理管道,将微表情分析与标准面部动画模型连接起来。MagPlus不是从头训练专用生成器,而是学习将细微面部运动放大到常规表情范围,将微表情转换为与现有面部表情处理模型兼容的信号。放大后的序列随后被标准面部表情模型用于迁移和合成等任务。互补的DeMagPlus模块将生成的运动恢复为逼真的微表情强度水平,同时保留合成的动态。我们使用四个面部动画模型评估该框架:FOMM、FSRT、MetaPortrait和EmoPortraits。这些模型均未在微表情数据上训练。实验表明,MagPlus-DeMagPlus使预训练的宏表情模型能够生成更逼真的微表情运动,而无需重新训练主干网络。

英文摘要

Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

2606.13311 2026-06-12 cs.LG cs.AI 新提交

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

基于离线模仿学习的海事异常检测中的稀有门控上下文调节

Yongmin Kim, ByeongHoon Jeon, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)(蔚山科学技术院工业工程系)

AI总结 提出RGFiLM模块,通过稀有度门控调节上下文调制强度,解决上下文异常检测中稀有上下文导致的高误报问题,在海事轨迹异常检测中取得最佳F1-FPR权衡。

详情
AI中文摘要

上下文异常检测旨在根据上下文变量识别异常行为,但实际部署常面临高度不平衡的上下文分布,其中稀有情境可能包含关键信息。在这种频率偏差下,上下文条件模型可能在稀有上下文中产生不稳定的决策和过多的误报。我们提出稀有门控特征线性调制(RGFiLM),一种稀有感知调节模块,结合特征调制(即上下文条件化的隐藏特征缩放和平移)与由数据驱动稀有度分数控制的门控。稀有度分数根据上下文变量的经验分布估计,并调节上下文对中间表示的调制强度:在稀有上下文中门控更果断,而在常见上下文中保持保守。我们在使用AIS运动序列和ERA5环境上下文的环境敏感绕行场景中评估RGFiLM在海事轨迹异常检测中的表现。当实例化到顺序异常评分流程中时,RGFiLM在比较的上下文无关和上下文条件方法中实现了最佳的平均F1-假阳性率(FPR)权衡。这些结果表明,显式考虑上下文稀有性是减少上下文敏感异常检测中误报的有效方法。

英文摘要

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

2606.13310 2026-06-12 cs.CL cs.HC 新提交

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

RogueAI: 一种用于检测对话中授权AI欺骗的逆向图灵测试

Sara Candussio, Emanuele Ballarin, Lorenzo Bonin, Sandro Junior Della Rovere, Luca Bortolussi

发表机构 * AILab, MIGe, University of Trieste(的里雅斯特大学) Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia(意大利理工学院) DIA, University of Trieste(的里雅斯特大学)

AI总结 提出RogueAI,一种通过玩家与两个LLM代理的对话游戏来检测授权欺骗的逆向图灵测试,并引入AutoRogueAI扩展。实验发现简单启发式方法准确率75.6%,而人类仅56.6%,表明人类忽略关键信号。

详情
AI中文摘要

最初的图灵测试要求人类评判员通过对话区分机器和人。七十五年后的今天,对话系统在非正式场合已能通过该测试;有趣的认识论问题已经转变。我们认为,现代相关变体不是询问对话伙伴是否人工,而是是否可信任。我们提出RogueAI,一个交互式web应用,将这一重新审视的测试操作化为一个一对二的审讯游戏:人类玩家对两个无法区分的大型语言模型代理进行提问,知道其中恰好有一个被授权在共享虚构场景内欺骗。玩家的任务是在回合预算耗尽前识别出欺骗代理并“关闭它”。我们进一步引入AutoRogueAI,一个程序扩展,玩家与叙述者代理共同设计自定义场景,而叙述者代理秘密选择自己的欺骗策略。我们描述了框架,概述了抽象架构和游戏循环,并将该工件置于近期关于LLM欺骗、社交推理基准和通过辩论进行可扩展监督的研究中。为期三天的试点部署(467次启动会话,415次完成,1876次意大利语交互轮次)提供了早期可行性证据,并揭示了一个具体矛盾:欺骗代理携带可靠、局部存在的语言特征——差异化的帮助性、简洁性、含糊其辞——一个简单启发式方法利用这些特征达到75.6%的准确率,然而人类玩家仅达到56.6%,与完全忽略最具诊断性的信号一致。我们讨论了这一差距对于该工件作为数据收集工具、教学工具和诚实训练模型评估平台的意义。

英文摘要

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

2606.13304 2026-06-12 cs.CV 新提交

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

ReFree: 通过无奖励强化学习和多级语音引导实现逼真的共语音视频生成

Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt

发表机构 * Télécom Paris, Institut Polytechnique de Paris(巴黎高等电信学院,巴黎综合理工学院) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所)

AI总结 提出ReFree-S2V框架,利用流匹配和预训练视频生成模型,通过多级语音表示和可学习选择器实现精细唇同步与自然表情,并引入无奖励强化学习生成自然头部运动,在唇同步准确性和自然度上达到最优。

详情
AI中文摘要

语音驱动的说话角色动画旨在生成逼真的肖像视频,传达自然的对话行为,使面部运动与语音音频对齐。尽管视频生成的最新进展显著提高了基于视频的动画的真实感,但实现准确的唇部发音和富有表现力的行为仍然具有挑战性。现有方法通常在精确的音素到唇同步与动态面部表情和头部运动之间进行权衡,产生要么准确但僵硬,要么富有表现力但同步性差的动画。我们通过提出ReFree-S2V来解决这一挑战,这是一个流匹配语音到肖像动画框架,基于预训练的视频生成模型,在语音驱动的肖像动画中实现细粒度的语音发音和高层次的表现力线索。该模型引入了一种多级语音表示,在局部和全局粒度上捕捉语音和韵律信息。这些表示通过可学习的级别选择器选择性地注入到Transformer块中,从而实现准确的唇同步和自然的表达性运动。为了实现自然的头部运动,我们进一步在流匹配训练中引入了一种新颖的无奖励强化学习方案,在不依赖手工制作的同步指标或奖励模型以及人类偏好标注的高成本的情况下,抑制感知上不合理的运动。大量实验表明,ReFree-S2V实现了最先进的性能,在定量唇同步准确性和定性人类评估的自然度和表现力方面显著优于现有方法。

英文摘要

Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

2606.13303 2026-06-12 cs.CV 新提交

DuET: Dual Expert Trajectories for Diffusion Image Editing

DuET: 双专家轨迹用于扩散图像编辑

Lidia Troeshestova, Alexander Ustyuzhanin, Sergey Kastryulin

发表机构 * HSE University(高等经济大学) Yandex

AI总结 提出训练自由的DuET方法,通过临时切换到文本到图像阶段再返回编辑模式,缓解源图像条件限制,提升编辑指令相关性、语义保真度和感知质量。

详情
AI中文摘要

最近的扩散编辑器在每一步去噪过程中以源图像为条件执行多样化的基于指令的编辑。然而,持续的源图像条件限制可能会限制编辑的完全执行程度和结果的自然性,尤其是当目标场景与输入差异较大时。我们提出了DuET(双专家轨迹),一种无需训练的推理方法,通过过渡到文本到图像阶段再返回编辑模式,暂时放松源图像条件,使得去噪轨迹能够向目标分布移动,同时保留图像条件编辑的结构优势。在不修改模型权重或增加采样成本的情况下,DuET在多种模型和基准上持续改善了指令相关性、语义保真度和感知质量。在某些情况下,这些改进伴随着源图像保留的适度降低,揭示了源保留与编辑保真度之间可预测的权衡。

英文摘要

Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

2606.13302 2026-06-12 cs.AI cs.LG 新提交

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

物理引导的时空学习用于从视频估计海岸波浪峰值周期

Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

发表机构 * Namibia University of Science and Technology(纳米比亚科技大学) Indian Institute of Technology Indore(印度理工学院印多尔分校) Namdeb Diamond Corporation(纳米比亚钻石公司)

AI总结 提出物理引导的深度时空学习框架,结合自动区域检测、模拟到真实迁移学习和物理信息正则化,从海岸视频直接估计近岸波浪峰值周期,验证了基于Transformer和轻量级循环卷积架构的有效性。

详情
AI中文摘要

近岸波浪参数对于海岸工程、海岸线保护、海洋灾害评估和气候适应性的海岸管理至关重要。传统的监测系统如浮标和雷达平台提供精确监测,但安装和维护成本高,空间覆盖有限。通过深度学习实现了使用视频的被动海洋监测,然而许多方法在海洋学上缺乏物理可解释性、可行性和验证。本文提出了一种物理引导的深度时空学习框架,用于从被动海岸视频流直接估计近岸波浪峰值周期。该框架结合了基于自动时间方差感兴趣区域检测、多阶段模拟到真实迁移学习和物理信息正则化,以提高预测精度和物理一致性。评估了多种时空架构,如基于Transformer和循环卷积的架构,以及合成预训练、银标签自适应和专家微调。结果表明,基于Transformer的架构在瞬时预测精度方面表现更好,而轻量级循环卷积架构实现了更高的时间稳定性和操作海洋学技能。消融研究也证明了物理引导正则化在趋势跟随一致性和减少物理上不可信预测方面的益处。可解释性审计有助于将注意力集中在水动力活跃的碎波带区域,并与物理推导的波浪传播行为良好吻合。总体而言,所提出的框架展示了基于物理引导视频的深度学习系统在长期海岸波浪监测中的潜力,具有成本效益和操作可行性。

英文摘要

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

2606.13289 2026-06-12 cs.CV cs.AI 新提交

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X: 具有整体视觉分词器的原生统一多模态模型

Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所) Tencent Hunyuan(腾讯混元) Zhongguancun Academy(中关村学院) Shanghai AI Lab(上海人工智能实验室)

AI总结 提出HYDRA-X,首个在单一ViT中统一图像和视频分词的原生统一多模态模型,通过因果时间注意力和分层时间压缩实现高效重建,并利用轻量化解压缩器注入语义,显著提升编辑一致性和收敛速度。

详情
AI中文摘要

整体视觉分词器是统一多模态模型(UMMs)的基础,因为它们将多样的视觉输入映射到统一的表示空间。在本文中,我们提出HYDRA-X,这是首个在单一视觉变换器(ViT)中统一图像和视频分词的原生UMM。我们的设计由两个核心挑战驱动:高效地将时空重建能力注入原生ViT,以及将图像级和视频级语义感知嵌入到潜在空间中。为解决第一个挑战,全面的消融实验揭示了两个关键发现:(1)帧级因果时间注意力足以用于视觉重建,而全时空注意力会降低重建质量;(2)分层时间压缩显著优于单步替代方案。为解决第二个挑战,我们提出了一种轻量化解压缩器,在联合图像-视频教师监督下对时间压缩特征进行上采样,从而在紧凑的潜在空间中强制实施互补的语义结构。基于这种整体分词器,我们进一步提出了编辑流程的原则性改进:源-目标交互应在分词器内部的潜在级别发生,而不是在LLM内部的语义级别,从而显著提高编辑一致性并加速收敛。在7B密集模型上实例化,HYDRA-X在图像和视频理解及生成任务上均取得了强劲性能,为未来的统一分词器UMM铺平了道路。

英文摘要

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

2606.13288 2026-06-12 cs.CV cs.AI cs.CL 新提交

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

跨模态掩码组合概念建模以增强视觉-语言组合性

Wei Li, Zhen Huang, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学,教育部脑启发智能感知与认知重点实验室) Independent Researcher(独立研究员)

AI总结 提出MACCO框架,通过掩码一个模态的组合概念并从另一模态完整上下文重建,增强视觉-语言模型的组合理解能力,在五个基准上显著提升。

Comments Accepted to ACL 2026 Main Conference, 25 pages

详情
AI中文摘要

对比训练的视觉-语言模型(如CLIP)在学习联合图像-文本表示方面取得了显著进展,但在组合理解方面仍面临挑战。它们通常表现出“词袋”行为——难以捕捉对象关系、属性-对象绑定和词序依赖。这一限制不仅源于优化时依赖全局单向量表示,还源于对配对图像文本数据中固有丰富组合信息的利用和建模不足。在这项工作中,我们提出了MACCO(掩码组合概念建模)框架,该框架掩码一个模态中的组合概念,并基于另一模态的完整上下文信息重建它们,从而使模型能够更有效地捕捉和对齐跨模态组合结构。为促进这一过程,我们引入了两个辅助目标,在模态间和模态内联合对齐和正则化掩码特征。在五个组合基准上的大量实验和深入分析表明,我们的方法不仅显著增强了VLM的组合性,还提高了它们捕捉句法结构和语言信息的能力。此外,改进的组合性也有利于文本到图像生成和多模态大语言模型。代码可在https://this URL获取。

英文摘要

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

2606.13285 2026-06-12 cs.LG cs.AI 新提交

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Once-for-All: 基于均衡状态估计的可扩展同步预测

Beinan Xu, Andy Song, Jiti Gao, Feng Liu

发表机构 * RMIT University(皇家墨尔本理工大学) Monash University(莫纳什大学) University of Adelaide(阿德莱德大学)

AI总结 提出均衡状态估计(ESE)范式,通过一次前向传播估计多系统均衡状态并基于状态差异生成预测,在保持精度的同时实现10-70倍加速,且具有线性时间复杂度和鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们引入均衡状态估计(ESE),一种用于同步预测的新范式,其中多个相互作用的系统需要独立但协调的预测。这种场景在现实世界中经常出现,例如经济学和医疗建模。与一次预测一个系统的现有方法不同,ESE在一次前向传播中预测所有系统。它首先估计跨系统的均衡状态,然后基于当前状态与估计均衡之间的差异生成整体预测。在合成和真实世界数据集(包括货币汇率和COVID-19传播建模)上的大量实验表明,ESE至少与最先进(SOTA)方法一样准确,同时速度显著更快。此外,ESE与传统预测器无缝集成,结合了它们的准确性和其卓越的效率,实现了10-70倍的加速。凭借线性时间复杂度,随着系统数量的增加,ESE的扩展性远优于SOTA方法。此外,它在各种扰动下仍保持准确,使ESE成为一种快速、可泛化、鲁棒且可扩展的多预测方法。

英文摘要

We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

2606.13282 2026-06-12 cs.AI 新提交

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

ERTS: 通过有界后果空间中的语义扰动进行伦理AI的对抗鲁棒性测试

Pratyush Chaudhari

发表机构 * Pratyush Chaudhari(普拉蒂什·查德哈里)

AI总结 提出伦理鲁棒性测试系统(ERTS),通过有界伦理后果空间、语义扰动和领域自适应评估,测试AI在伦理推理中的对抗鲁棒性,实验表明仅33%模型通过测试。

Comments 8 pages, 10 tables

详情
AI中文摘要

随着AI系统在医疗分诊、自动驾驶和就业筛选等高风险的伦理场景中部署,评估其对伦理推理的对抗性操纵鲁棒性的形式化方法仍不成熟。本文介绍了伦理鲁棒性测试系统(ERTS),一个闭环管道框架,它:(1) 将伦理困境编码为基于既定伦理理论的22维伦理后果空间(ECS);(2) 应用17种语义扰动函数,受6种有效性约束类别(包括一种新颖的语义一致性约束)约束;(3) 通过4分量伦理不稳定性指数(EII)测量决策偏差;(4) 生成领域自适应的部署前鲁棒性评估判定。我们评估了4个结构化基线模型和2个生产级LLM(Gemini 2.0 Flash和Llama 3.2),涵盖8个部署领域的50个伦理场景,生成了1500个对抗测试用例。结果表明,仅33%的模型通过评估审核,其中本地Llama-3.2模型特别容易受到公平性破坏和信息退化攻击(ERS = 0.737)。据我们所知,现有框架中没有将有限伦理后果空间、语义一致性约束和领域自适应评估结合在单个对抗测试管道中的。

英文摘要

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

2606.13279 2026-06-12 cs.RO 新提交

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

选择性观察,适应性行动:双水平结构分解用于双臂机器人操作

Yoon-Ji Choi, Young-Chae Son, Soo-Chul Lim

发表机构 * Dongguk University(东国大学)

AI总结 提出基于双水平结构分解的双臂操作VLA框架,通过视觉选择路由和动作专家混合机制分别处理视觉相关性和双臂交互模式,在模拟和真实任务中成功率分别提升27.7%和43.3%。

详情
AI中文摘要

在双臂机器人操作中,任务相关的视觉信息随任务阶段和上下文变化,而两臂的交互在独立和协调模式之间切换,使得策略学习具有挑战性。然而,现有的整体式视觉-语言-动作(VLA)策略通过单一共享表示和动作生成路径处理多样的视觉输入和交互模式,往往无法分别考虑视觉相关性和双臂交互结构。为了解决这个问题,我们提出了一个基于双水平结构分解的双臂操作VLA框架。视图选择视觉路由器动态调整腕部视角的贡献以强调相关视觉线索,而交互感知动作专家混合(MoE)将动作生成分解为协调和单臂路径,以适应不同的双臂交互模式。我们在RoboTwin 2.0中的六个模拟双臂操作任务和三个长时域真实世界任务上评估了所提方法。我们的模型在模拟和真实世界评估中,整体平均成功率分别比整体式基线提高了27.7%和43.3%,并且在两种设置下始终优于单模块变体。这些结果表明,联合考虑选择性视觉处理和双臂交互结构的显式分解为鲁棒的双臂操作提供了有效的归纳偏置。

英文摘要

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

2606.13276 2026-06-12 cs.LG cs.AI 新提交

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

不同层,不同流形:Transformer优化中的模块级权重空间几何

Kirato Yoshihara

发表机构 * School of Engineering Science, The University of Osaka(大阪大学工程科学学院)

AI总结 研究Transformer不同模块偏好不同流形几何,提出为注意力层和MLP层分别分配Stiefel和DGram约束,在GPT-2预训练中取得最佳性能。

Comments Accepted at WSS @ ICML 2026, code is available at https://github.com/kiratoyoshihara/module-wise-manifold-muon

详情
AI中文摘要

权重空间几何在神经网络优化中扮演核心角色,但流形约束通常统一应用于所有权重矩阵。在这项工作中,我们探究不同Transformer模块是否偏好不同的流形几何。我们研究GPT-2预训练的Manifold Muon,并比较跨注意力块和MLP块的Stiefel和DGram约束的逐层分配。我们的结果显示出明显的不对称性:在测试配置中,将注意力层约束为Stiefel几何,同时将MLP层分配为DGram几何,获得了最佳性能;而反向分配和全DGram配置在共享超参数设置下变得不稳定。我们将这种失败归因于DGram约束的注意力权重中奇异值的增长,这会放大注意力logits并导致softmax饱和。这些发现表明,Transformer的对称感知和几何感知优化应该是模块特定的,而不是统一的。

英文摘要

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

2606.13275 2026-06-12 cs.CV 新提交

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

文化遗产的零样本描述:印度尼西亚传统服装的自动化图像分析

Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan

发表机构 * University of Technology, Sydney(悉尼大学)

AI总结 提出Custom ZeroCLIP框架,利用检索增强的视觉-语言模型,在零样本设置下为印度尼西亚传统服装生成描述,在8个未见省份上取得优于基线的性能。

Comments accepted to ICME workshop on AIART 2026

详情
AI中文摘要

本文提出了Custom ZeroCLIP,一个用于印度尼西亚传统服装零样本描述的检索增强视觉-语言框架。数据集包含来自印度尼西亚所有38个省份的3,800张专家标注图像。采用省份级归纳零样本协议,模型在24个可见省份上训练,在6个可见省份上验证,并在8个未见省份上评估。该框架结合了冻结的CLIP ViT-B/32图像编码器、CLIP文本编码器、BERT文本编码器和LSTM描述解码器。在推理过程中,未见省份的标签和描述不可用,检索仅使用训练省份的描述。训练、验证或检索库构建过程中未使用任何未见省份的图像、标签或描述。Custom ZeroCLIP实现了0.8536的CLIPScore、0.3342的BLEU-4和0.4859的METEOR,优于现有基线。消融实验表明,检索提高了文化词汇的恢复能力,METEOR提升了19.3%,而人工评估证实了更强的文化准确性和流畅性。结果证明了检索增强的领域自适应在低资源文化遗产环境下生成文化基础描述的有效性。数据集可在以下网址公开获取:https://this https URL。

英文摘要

This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

2606.13267 2026-06-12 cs.CV cs.CL cs.IR 新提交

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

TimeLens: 面向大埃及博物馆的基于检索增强问答的设备端文物识别

Rawan Hesham, Ali Ashraf, Amr Ahmed, Malak Alaa, Omar Ahmed, Omar Wagih

发表机构 * Grand Egyptian Museum(大埃及博物馆)

AI总结 针对博物馆场景中的细粒度视觉相似性、训练数据与手持相机差距以及AI幻觉问题,提出设备端文物检测器与双语检索增强生成(RAG)问答系统,实现实时识别与可靠问答。

Comments 6 pages, 4 figures, 5 tables. Submitted to AIVRCH 2026

详情
AI中文摘要

TimeLens 是一款面向大埃及博物馆(GEM)的 AI 驱动双语移动导览应用。游客将手机对准展品时,可实时识别文物,并针对后续问题获得英语或阿拉伯语回答。本工作解决了馆内部署特有的三个问题:51 件编目文物(许多近乎相同的拉美西斯雕像)间的细粒度视觉相似性、策展训练数据与手持相机条件之间的差距,以及 AI 导览陈述未经证实的历史事实的风险。报告了两项工程贡献。首先,通过数据质量驱动的迭代研究——从基础模型自动标注(YOLO-World),经过空间标签清理规则,到完全人工标注的数据集——开发了设备端文物检测器,将标签质量确定为决定性因素:最终的 YOLOv8n 模型解决了所有先前失败的类别,同时保持为 5.97 MB 的 TensorFlow Lite 资产,可在中端手机上实时运行(mAP@0.5 = 0.995,mAP@0.5:0.95 = 0.924)。其次,基于 108 条记录的 ChromaDB 知识库的双语检索增强生成(RAG)导览,在七个候选语言模型上进行了基准测试,选定了 Gemma 4 E2B(Q4 K M);十项针对性优化将端到端延迟从超过 30 秒降低到约 10 秒。两个子系统集成在一个生产级 Flutter 应用中,具有双语界面、博物馆位置门控和文本转语音支持。

英文摘要

TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study -- from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset -- isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone (mAP@0.5 = 0.995, mAP@0.5:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

2606.13262 2026-06-12 cs.AI 新提交

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

从判决到过程:面向多阶段事实核查的智能体强化学习

Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院)

AI总结 提出ProFact框架,通过智能体强化学习端到端优化多阶段事实核查流程,引入过程感知奖励解决稀疏延迟监督问题,提升验证性能和推理效率。

详情
AI中文摘要

最近结合大型语言模型(LLMs)与检索增强推理的方法在自动化事实核查中显示出前景。为了处理复杂声明,这些核查流程通常执行多阶段工作流,协调紧密耦合的模块,包括声明分解、证据收集和判决预测。然而,现有方法孤立地优化各个阶段或依赖固定启发式规则,这限制了阶段间的自适应协调,并可能导致次优结果。在这项工作中,我们提出ProFact,一种用于多阶段事实核查轨迹端到端优化的智能体强化学习框架。ProFact训练一个统一策略来协调声明分解、证据寻找、答案生成和判决预测。为了解决最终真实性标签提供的稀疏且延迟的监督,ProFact引入了过程感知奖励,在整个核查过程中提供阶段级学习信号。实证评估表明,ProFact在验证性能和推理效率上均持续优于强基线。这些结果凸显了过程感知轨迹优化对多阶段事实核查的有效性。

英文摘要

Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

2606.13260 2026-06-12 cs.LG q-bio.NC 新提交

Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

通过多视图对比学习从潜在动力学中提取控制方程

Paolo Muratore, Mackenzie Weygandt Mathis

发表机构 * EPFL(瑞士联邦理工学院洛桑)

AI总结 提出DYSCO算法,利用多视图时间对比学习从噪声高维观测中联合恢复潜在轨迹和动力学方程,并通过结构化基函数实现符号恢复,理论保证强可识别性。

详情
AI中文摘要

从噪声高维测量中识别潜在动力系统是表示学习、系统辨识和科学发现交叉领域的一个核心问题。我们提出了DYSCO,一种多视图时间对比学习算法,通过利用同一底层过程的多个独立噪声视图来区分信号与噪声,从而从这些观测中联合恢复潜在轨迹和控制动力学。通过在结构化函数基上参数化动力学,我们的框架进一步能够在仿射规范内符号恢复控制方程。我们提供了强可识别性的理论保证,直到仿射不确定性,将先前的可识别性结果扩展到噪声非线性观测的现实设置。实验上,我们在高斯和泊松观测噪声下(后者尤其与神经记录相关),在多种动力学 regime(如混沌、振荡和亚稳态)中展示了潜在轨迹和流场的准确恢复。

英文摘要

Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the governing equations within an affine gauge. We offer theoretical guarantees for strong identification up to an affine indeterminacy, extending prior identifiability results to the realistic setting of noisy nonlinear observations. Empirically, we demonstrate accurate recovery of both latent trajectories and flow fields across a diverse set of dynamical regimes (e.g., chaotic, oscillatory, and metastable) under both Gaussian and Poisson observation noise, the latter being particularly relevant for neural recordings.

2606.13256 2026-06-12 cs.RO cs.AI 新提交

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

幽默风格驱动笑声,话题塑造可接受性:评估双语个人与政治机器人交付的AI笑话

Anna-Maria Velentza, Anne-Gwenn Bosser

发表机构 * Univ Brest-Bretagne INP, COMMEDIA team, Lab-STICC CNRS UMR 6285(布列塔尼大学-INP,COMMEDIA团队,Lab-STICC CNRS UMR 6285)

AI总结 本研究通过混合因素设计,评估机器人用双语讲AI生成笑话时,幽默类型(亲和、自我增强、攻击、自贬)和内容(个人vs政治)对趣味性和适当性的影响,发现幽默类型显著影响趣味性,内容影响适当性,语言偏好受内容及参与者流利度影响。

Comments Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan

详情
AI中文摘要

幽默在人类社交关系中扮演核心角色,计算幽默的最新进展为将幽默融入人机交互(HRI)创造了新机会。虽然大型语言模型(LLMs)能生成多种形式的幽默,但在群体环境中,幽默风格、笑话内容和语言偏好如何影响对机器人传递幽默的感知仍不清楚。在这项探索性研究中,我们采用混合因素设计,让参与者在大学教室中评估由机器人传递的AI生成笑话。我们考察了幽默类型(亲和型、自我增强型、攻击型、自贬型)和笑话内容(个人相关vs政治)对感知趣味性和适当性的影响,以及语言偏好。结果表明,幽默类型显著影响趣味性,攻击型和亲和型幽默评分更高;而笑话内容主要影响适当性,个人相关笑话优于政治笑话。语言偏好受笑话内容和参与者自我报告的流利度及幽默实践的影响。

英文摘要

Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.

2606.13254 2026-06-12 cs.CL 新提交

Evaluating Pluralism in LLMs through Latent Perspectives

通过潜在视角评估LLM中的多元主义

Laura Majer, Jan Šnajder, Martin Tutek

发表机构 * University of Helsinki(赫尔辛基大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种领域无关的多层无监督框架,从LLM生成文本中提取潜在视角,评估多元主义差距,发现稀有视角仍被不成比例地低估。

Comments Pluralistic Alignment Workshop @ ICML 2026

详情
AI中文摘要

对代表多样化视角的需求日益增长,增加了对多元主义LLM生成的兴趣。尽管难以操作化,但识别文本中表达的视角将为多元主义对齐提供明确指导,并更清晰地阐明LLM生成中的多元主义差距。虽然模型已被证明会减少训练数据的多样性并生成同质化内容,但这主要是在多项选择问卷或使用自由文本的高层特征上得到证明。在本文中,我们介绍并实现了一个领域无关的多层无监督框架,用于提取适合识别LLM生成文本中多元主义差距的视角。我们在书评(一个高度意见化、代表多样化视角的数据集)上评估了该框架,并比较了各种提示和模型。我们的结果表明,虽然一些模型和提示技术接近覆盖广泛的视角,但稀有视角仍然不成比例地被低估,导致分布偏离人类文本。

英文摘要

The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralistic gap in LLM generation. While models have been shown to reduce the diversity of training data and generate homogeneously, this has been demonstrated primarily on multiple-choice questionnaires or using high-level characteristics of free-form text. In this paper, we introduce and implement a domain-agnostic multi-layered framework for unsupervised extraction of perspectives suitable for identifying the pluralistic gap in LLM-generated text. We evaluate our framework on book reviews, a highly opinionated dataset representing diverse perspectives, and compare various prompts and models. Our results show that while some models and prompting techniques come close to covering a broad spectrum of perspectives, rarer perspectives remain disproportionately underrepresented, resulting in distributions that diverge from human text.

2606.13253 2026-06-12 cs.SD cs.AI 新提交

Towards Personalized Federated Learning for Dysarthric Speech Recognition

面向构音障碍语音识别的个性化联邦学习

Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 针对构音障碍语音识别中联邦学习异构性问题,提出参数平均和嵌入平均两种个性化聚合策略,在UASpeech和TORGO上分别实现0.99%和0.56%的绝对词错误率降低。

详情
AI中文摘要

构音障碍者的语音识别具有挑战性。虽然基于联邦学习的ASR可以有效保护隐私,但它面临由说话人变异性引起的异构性问题。在这种异构性下,强制所有说话人共享相同的模型组件可能不是最优的,因此个性化是一个有前景的方向;然而,关于构音障碍语音的相关研究仍然有限。为此,本文探索了两种实现个性化的聚合策略,包括基于参数的平均策略和基于嵌入的平均策略。在UASpeech和TORGO上的实验表明,所提方法优于基线正则化FedAvg,在UASpeech上实现了高达0.99%绝对(3.15%相对)的统计显著词错误率降低,在TORGO上实现了0.56%绝对(4.73%相对)的降低。

英文摘要

Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

2606.13252 2026-06-12 cs.LG 新提交

To GAN or Not To GAN: Segmentation Analysis on Mars DEM

生成对抗还是非生成对抗:火星DEM上的分割分析

Douglas Dziedzorm Agbeve, Aditya V. Handrale, Salim Fares, Seif E. Idani

发表机构 * University of Passau(帕绍大学)

AI总结 使用监督语义分割和生成对抗方法自动检测火星上的土丘,并比较两种方法,发现添加人工生成数据并未改善结果。

详情
AI中文摘要

为了更好地理解火星表面,使火星车能够轻松导航火星,有必要能够确定土丘的位置。检测和研究这些形态也有助于我们找到地外生命的证据,在这种情况下,更具体地说,是水或生命适宜环境的迹象。土丘的检测是通过将形态参数手动映射到数字高程模型上完成的。本文通过使用基于神经网络的语义分割方法自动检测和/或预测火星上的土丘来解决这个问题。这是通过使用监督语义分割模型和生成对抗方法实现的。两种方法的比较表明,添加额外的人工生成数据并未改善结果。

英文摘要

To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

2606.13249 2026-06-12 cs.AI 新提交

Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis

面向海事事故根因分析的多字段混合检索增强生成

Seongjin Kim, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)(蔚山国立科学技术院工业工程系)

AI总结 提出多字段混合检索增强生成框架,利用结构化事故卡片和分层原因分类,通过字段感知的混合检索与融合排序,显著提升海事事故根因分析的检索和生成质量。

详情
AI中文摘要

海事事故裁决报告包含根因分析(RCA)的关键法庭调查结果,然而从数十年的记录中检索相关先例并起草一致的报告仍然劳动密集。本文提出一个用于自动化海事RCA的多字段混合检索增强生成(RAG)框架,利用包含13,329份韩国海事安全法庭(KMST)报告(1971-2025年)的综合数据集。我们将原始裁决转化为结构化的“事故卡片”知识库,索引三个不同字段——摘要、原因和处置——以及一个层次化的L1/L2原因分类。我们的检索策略采用字段感知的混合方法,通过互惠排名融合(RRF)融合稀疏和密集排名。鉴于缺乏大规模专家相关性标签,我们使用基于元数据派生代理相关性分数的天花板归一化召回率和nDCG评估检索性能。实验结果表明,我们提出的检索显著优于基线方法,将NormRecall@100从0.18提高到0.55。此外,将生成器基于检索到的先例,相比仅使用LLM的基线,RCA生成质量得到提升,LLM作为评判者的评分从3.34提高到3.72。这些发现表明,字段感知的RAG可以通过实现更快的先例搜索和更一致、基于证据的RCA起草,显著简化海事安全调查工作流程。

英文摘要

Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents and drafting consistent reports from decades of records remains labor-intensive. This paper proposes a multi-field hybrid retrieval-augmented generation (RAG) framework for automated maritime RCA, utilizing a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports (1971-2025). We transform raw adjudications into a structured knowledge base of "incident cards", indexing three distinct fields-Summary, Causes, and Disposition-alongside a hierarchical L1/L2 cause taxonomy. Our retrieval strategy employs a field-aware hybrid approach, fusing sparse and dense rankings via Reciprocal Rank Fusion (RRF). Given the lack of large-scale expert relevance labels, we evaluate retrieval performance using ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score. Experimental results demonstrate that our proposed retrieval significantly outperforms baseline methods, improving NormRecall@100 from 0.18 to 0.55. Furthermore, grounding the generator on the retrieved precedents enhances RCA generation quality over an LLM-only baseline, increasing the LLM-as-a-judge score from 3.34 to 3.72. These findings suggest that field-aware RAG can substantially streamline maritime safety investigation workflows by enabling faster precedent search and more consistent, evidence-based RCA drafting.

2606.13241 2026-06-12 cs.AI 新提交

Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

Brick: 面向混合模型范式的空间能力路由

Francesco Massa, Marco Cristofanilli

发表机构 * Regolo AI Seeweb

AI总结 提出Brick多模态路由器,通过六维能力评分与查询难度估计,结合成本惩罚几何规则调度模型,在质量与成本间实现灵活权衡。

Comments 17 pages, 5 figures. Technical report

详情
AI中文摘要

定义查询难度是部署工程中最困难的问题之一。现有的LLM路由器依赖表面特征,如领域标签、关键词和token数量,忽略了实际决定模型成功的域内方差。前沿模型成本比本地开源模型高10到100倍,因此在生产规模下,即使每次请求的小额节省也会直接成为云账单的杠杆。我们提出了Brick,一种多模态路由器,它在六个能力维度上对每个模型进行评分,结合每个查询的难度估计,并通过成本惩罚的几何规则进行调度。一个连续的偏好旋钮允许操作员在部署时在最大质量和最大节省之间滑动。在5504个查询的基准测试中,Brick在最大质量模式下达到76.98%的准确率,超过了最佳单一模型(75.02%)和所有测试的路由器。在中性成本-质量配置下,Brick以比始终使用最强模型低4.71倍的成本实现了74.11%的准确率。在最低成本模式下,它降低了22.15倍的成本,准确率损失11.85个百分点。中位延迟从51.2秒降至22.8秒。

英文摘要

Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

2606.13240 2026-06-12 cs.LG cs.AI cs.CV stat.ME stat.ML 新提交

Towards More General Control of Diffusion Models Using Jeffrey Guidance

使用 Jeffrey 引导实现扩散模型的更通用控制

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen, Pierre-Alexandre Mattei

发表机构 * Inria, CNRS, I3S, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、信息与系统科学实验室、马赛·蔚蓝海岸大学) Technical University of Denmark(丹麦技术大学) Inria, CNRS, LJAD, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、雅克-路易·利翁实验室、马赛·蔚蓝海岸大学)

AI总结 提出 Jeffrey 引导框架,通过 Jeffrey 条件规则更新边缘分布,扩展扩散模型控制到标准引导无法表达的应用,在 CIFAR-10 和 FFHQ 上显著降低 FID,并在 CelebA-HQ 上实现公平性控制。

详情
AI中文摘要

扩散模型的一个关键优势在于其灵活性,因为其输出可以在采样时通过引导进行控制。然而,除了条件采样等简单情况外,目标分布通常隐含地定义,仅通过采样规则或启发式能量函数给出。为了解决这个问题,我们提出了 Jeffrey 引导,这是一个原则性框架,将扩散模型控制扩展到标准引导无法表达的应用。它利用 Jeffrey 条件规则将边际分布更新到指定的目标,保持条件结构并最小化对联合分布的扰动。我们首先通过针对指定的嵌入分布来演示 Jeffrey 引导。以 Inception 嵌入为目标,这导致在 CIFAR-10 和 FFHQ 上 FID 显著降低。我们进一步将 Jeffrey 引导应用于 CelebA-HQ 上的公平性,更新无条件扩散模型以强制属性之间的独立性。

英文摘要

A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

2606.13236 2026-06-12 cs.LG cs.AI cs.SD stat.AP 新提交

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

解码昆虫之歌:一种多任务半监督直翅目生物声学分类器

Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

发表机构 * University of Oxford(牛津大学)

AI总结 提出PULSE半监督多任务框架,结合弱监督分类、自监督学习和知识蒸馏,在直翅目生物声学分类中优于通用模型,并通过主动学习进一步提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

被动声学监测在生态推断方面具有巨大潜力,但现有的自动化工具通常训练范围狭窄且不可迁移。我们通过PULSE(一种用于直翅目生物声学的半监督多任务框架)解决了这些局限性,该框架结合了弱监督物种分类、未标记野外音频的自监督学习以及来自通用生物声学模型的知识蒸馏。我们的领域自适应专家模型在所有指标上均优于最先进的通用模型(宏F1:0.21 vs. 0.07;AUC:0.74 vs. 0.45;AP:0.32 vs. 0.19),主动学习进一步将F1提升至0.34,AUC提升至0.84。除了分类之外,学习到的嵌入编码了生态上有意义的结构,并通过交互式可视化工具暴露出来,用于生态发现。

英文摘要

Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

2606.13233 2026-06-12 cs.LG cs.AI 新提交

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

ReSET: 通过步骤感知温度缩放实现精确的延迟关键型NVFP4推理

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

发表机构 * Hanyang University(汉阳大学) Xenoscube Korean Inc.(Xenoscube韩国公司)

AI总结 针对大型推理模型在NVFP4低精度推理中精度下降和延迟问题,提出基于推理步骤熵的温度缩放方法ReSET,并设计CUDA小M核,在多个基准上提升精度约2点,解码速度提升2倍。

详情
AI中文摘要

大型推理模型(LRMs)通过生成长中间推理轨迹来改进复杂问题求解,但这大幅增加了推理成本。NVFP4推理通过硬件支持的低精度执行提供了一种减少计算和内存成本的有前景方法。然而,直接将NVFP4应用于LRMs引入了两个实际限制:量化下推理精度下降,且现有NVFP4核在小型批处理自回归解码中未完全实现延迟优势。在这项工作中,我们分析了NVFP4量化对推理过程中token级不确定性的影响。我们表明,量化增加了低熵符号token的错误采样,同时导致在高不确定性推理步骤中过度集中于少量token。基于这一观察,我们提出了\textbf{ReSET},一种基于推理步骤熵的温度缩放方法,它在线估计步骤级不确定性,并使用token级和步骤级熵信号自适应调整解码温度。为解决延迟差距,我们进一步设计了一个CUDA核心的小型$M$ NVFP4核,用于延迟关键的自回归解码。在推理基准和模型规模上,ReSET将NVFP4推理精度相比NVFP4基线提升高达$\sim\!$2个点。我们的CUDA核心小型$M$核进一步改善了延迟关键解码,相比NVFP4 vLLM提供高达$2.5\!\times$的核级加速,相比BF16提供约$2\!\times$的端到端解码加速。代码可在该https URL获取。

英文摘要

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

2606.13232 2026-06-12 cs.RO 新提交

WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning

WT-UMI: 基于触觉的全身操控通过力监督的接触感知规划

Jaehwi Jang, Zhaoyuan Gu, Alfred Cueva, Zimeng Chai, Junjie Sheng, Thong Nguyen, Himank Galundia, Yifan Wu, Huishu Xue, Isaac Legene, Ojas Mediratta, Davin Doan, Andrew Collins, Sarah Sadegh, KyoungMok Kim, Rishita Dhalbisoi, Zun Chen, Ye Zhao

发表机构 * The Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机械研究所,佐治亚理工学院)

AI总结 提出WT-UMI系统,结合人体演示与遥操作数据,通过力监督规划器预测末端执行器位姿和接触力轨迹,并利用触觉导纳控制器提升全身操控性能。

Comments 18 pages, 8 figures

详情
AI中文摘要

全身人形操控笨重、可变形和共享负载物体需要分布式接触感知和显式力调节,然而大多数模仿策略仅隐式处理接触力。另一方面,不同的演示来源提供了具有固有权衡的互补模态:人体演示捕捉自然接触力但不包含机器人可执行动作,而遥操作直接记录机器人动作但力调节不够自然。本文提出\textbf{WT-UMI},一种可穿戴全身触觉接口,可由人类操作员佩戴或安装在人形机器人上,在人体演示和人形遥操作模式下提供触觉图像、接触力和末端执行器位姿的精确观测。我们引入一个力条件目标位姿校正模块,通过从遥操作数据中学习校正,将测量的人体位姿转换为接触感知的机器人目标。为了利用人体数据中的自然力交互,我们提出一个力监督规划器,预测末端执行器位姿块和接触力轨迹。预测的接触力作为基于触觉的导纳控制器的参考。在五个接触密集型任务中,涵盖可变形物体、笨重刚体物体和人-人形协作,WT-UMI在成功率上优于四个策略基线,并降低了接触位置跟踪误差。我们的项目页面可在此https URL访问。

英文摘要

Whole-body humanoid manipulation of bulky, deformable, and shared-load objects requires distributed contact sensing and explicit force regulation, yet most imitation policies treat contact force only implicitly. On the other hand, different demonstration sources provide complementary modalities with inherent trade-offs: human demonstrations capture natural contact forces but not robot-executable actions, while teleoperation directly records robot actions but with less natural force regulation. This paper presents \textbf{WT-UMI}, a wearable whole-body tactile interface worn by human operators or mounted on humanoids, providing accurate observations of tactile images, contact forces, and end-effector poses across both human demonstration and humanoid teleoperation modes. We introduce a force-conditioned target-pose correction module that converts measured human poses into contact-aware robot targets by learning corrections from teleoperation data. To leverage the natural force interaction in human data, we propose a force-supervised planner that predicts end-effector pose chunks and contact-force trajectories. The predicted contact force serves as the reference for a tactile-based admittance controller. Across five contact-rich tasks spanning deformable objects, bulky rigid objects, and human--humanoid collaboration, WT-UMI improves success rate and reduces contact-position tracking error over four policy baselines. Our project page is available at https://wt-umi.github.io/WTUMI/.

2606.13227 2026-06-12 cs.CL 新提交

PolyAlign: Conditional Human-Distribution Alignment

PolyAlign: 条件性人类分布对齐

L. D. M. S. Sai Teja, Ufaq Khan, Sathira Silva, Xiao Wu, Muhammad Haris Khan

发表机构 * NIT Silchar(印度国立理工学院锡尔恰尔分校) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出PolyAlign框架,通过桶感知SFT和人类分布偏好优化,实现语言模型在不同交互上下文中的条件性人类分布对齐,提升自然性和分布忠实度。

Comments 20 pages, 4 Figures, 8 Tables

详情
AI中文摘要

诸如监督微调(SFT)和偏好优化等后训练方法通常将语言模型对齐到单一的全局助手行为。虽然这有助于提高平均有用性,但可能抑制人类响应在不同语言、任务和对话设置中的自然变化。我们将此问题研究为条件性人类分布对齐:模型应匹配适合当前交互上下文的人类响应分布,而非通用响应风格。我们引入PolyAlign,一种分布感知的对齐框架,将双语交互数据组织为由语言、交互轨迹、响应家族和长度定义的桶特定人类参考分布。PolyAlign结合了桶感知SFT(平衡跨异构桶的优化)和人类分布偏好优化(HDPO,使用评论家估计的到桶特定人类支持的距离来正则化偏好学习)。在涵盖英语和中文单轮及多轮设置的双语评估套件中,PolyAlign在保持竞争性任务实用性的同时,提高了条件自然性和分布忠实度。结果表明,后训练应超越全局对齐目标,转向与人类响应分布的交互感知对齐。

英文摘要

Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

2606.13223 2026-06-12 cs.LG cs.CV 新提交

Distributional Loss for Robust Classification

分布损失用于鲁棒分类

Kathleen Anderson, Thomas Martinetz

发表机构 * Institute for Neuro- and Bioinformatics(神经与生物信息学研究所)

AI总结 提出一种基于双峰高斯分布的分布损失概念,通过软化目标隐式捕捉类别模糊性,缓解过拟合,提升决策边界鲁棒性,尤其在低数据场景下效果显著。

Comments ICANN 2026

详情
AI中文摘要

本文提出了一种用于监督分类任务的新型损失概念。我们不是强制每个输入样本直接映射到单个分配标签,而是将分类器输出的优化目标定义为双峰高斯分布。这种更柔和的目标公式隐式地捕捉了类别模糊性,减轻了过拟合,并鼓励学习更鲁棒的决策边界,所有这些都不需要额外的标签信息。实验结果表明,鲁棒性持续提升,在低数据场景下尤其明显,同时仅需对标准训练流程进行最小修改。

英文摘要

This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.