arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪 全部专题
2606.16286 2026-06-16 cs.LG cs.AI cs.RO 新提交

FlowMPC: Improving Flow Matching policies with World Models

FlowMPC:利用世界模型改进流匹配策略

Chandon Hamel

发表机构 * Stanford University(斯坦福大学)

AI总结 提出FlowMPC框架,结合流匹配模仿策略与学习的世界模型,通过MPPI规划提升测试时性能,在ManiSkill操作任务中显著提高成功率。

详情
AI中文摘要

流匹配(FM)是一种在多模态动作空间中进行行为克隆的强大方法[Jiang et al., 2025],但由于它没有直接训练以最大化期望回报,FM策略在测试时的表现仍有改进空间。本文研究学习的世界模型是否可以通过对策略提出的候选动作序列进行模型预测路径积分(MPPI)规划来改进FM策略。基于TD-MPC2 [Hansen et al., 2024],我引入了FlowMPC,这是一个将模仿学习的FM策略与学习的世界模型相结合的框架,用于ManiSkill操作任务[Tao et al., 2025]中的测试时规划。在PickCube和PickSingleYCB上,添加世界模型比单独使用FM策略提高了性能,尤其是在回合结束时的成功率方面有显著提升。这些结果表明,基于世界模型的规划可以有效地补充基于流的模仿策略,而无需修改FM训练目标。

英文摘要

Flow Matching (FM) is a powerful approach for behavior cloning in multimodal action spaces [Jiang et al., 2025], but because it is not trained to directly maximize expected return, there is still room to improve how FM policies act at test time. This work investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy. Building on TD-MPC2 [Hansen et al., 2024], I introduce FlowMPC, a framework that combines an imitation-learned FM policy with a learned world model for test-time planning in ManiSkill manipulation tasks [Tao et al., 2025]. Across PickCube and PickSingleYCB, adding the world model improved performance over the FM policy alone, with especially clear gains in end-of-episode success. These results suggest that world-model-based planning can effectively complement flow-based imitation policies without modifying the FM training objective.

2606.16285 2026-06-16 cs.CL cs.LG 新提交

HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

HiMPO:面向长周期智能体的后见知情记忆策略优化以减少纠缠信用分配

Jiangze Yan, Yi Shen, Wenjing Zhang, Jieyun Huang, Zhaoxiang Liu, Ning Wang, Kai Wang, Shiguo Lian

发表机构 * Unicom Data Intelligence, China Unicom(联通数据智能有限公司,中国联通) Data Science & Artificial Intelligence Research Institute, China Unicom(中国联通数据科学与人工智能研究院)

AI总结 提出HiMPO框架,通过比较记忆更新前后的任务相关信息估计局部效用,并利用后见相关性作为回顾性滤波器,减少记忆写入动作的信用纠缠,提升长周期智能体性能。

Comments Preprint. 2 figures

详情
AI中文摘要

长周期智能体依赖记忆机制压缩交互历史,但优化记忆写入面临独特的信用分配挑战:记忆更新可能因下游工具故障、噪声观测或推理错误而受到奖励或惩罚,而非其自身贡献。这种因果纠缠的信用可能导致智能体丢弃有用证据或保留无关信息。我们提出HiMPO,一种后见知情记忆策略优化框架,用于在长周期智能体中对记忆写入动作分配较少纠缠的信用。HiMPO首先通过比较在相同写前状态下从先前记忆和更新记忆中可恢复的任务相关信息,估计记忆更新的局部效用。然后,它使用后见相关性作为有界回顾性滤波器,当局部效用不受目标结果支持时,衰减记忆信用。由此产生的记忆特定优势仅应用于记忆令牌,而轨迹级奖励则优化智能体的其余行为。在基于裁判的开放领域任务和客观压缩记忆问答中,HiMPO在保持压缩上下文效率的同时,优于基于强记忆和基于强化学习的基线。受控干预进一步表明,HiMPO减少了工具诱导错误的责备泄漏,并提高了记忆更新的归因保真度。

英文摘要

Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

2606.16281 2026-06-16 cs.CL cs.AI 新提交

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

现在谁应该主导解码?跟踪可靠轨迹以集成掩码扩散语言模型

Heecheol Yun, Joonhyung Park, Joowon Kim, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS

AI总结 针对掩码扩散语言模型集成问题,提出TIE框架,通过跟踪答案相关位置的置信度动态,迭代识别并传递可靠解码轨迹,实现多模型协同生成。

Comments preprint

详情
AI中文摘要

掩码扩散语言模型(MDLM)已成为序列生成的一种独特范式。随着MDLM在能力和知识覆盖范围上变得多样化,一个重要问题是如何结合它们的知识。为此,我们首先研究了MDLM独特的解码动态。我们发现,成功的生成在答案相关位置上表现出稳定的置信度动态,而不可靠的轨迹通常可以通过注入来自其他模型的有希望的中间状态来纠正。受此观察启发,我们提出了$\textbf{TIE}$(基于轨迹的迭代集成),这是一个知识融合框架,其中MDLM迭代地识别可靠的解码轨迹并在模型之间传递它们。TIE跟踪答案相关位置上的置信度动态,以确定哪个模型当前遵循更可靠的轨迹,并选择性地跨模型传递部分去噪的序列。由于处于更有希望轨迹上的模型在去噪步骤中经常变化,TIE允许不同模型在生成的不同阶段贡献互补的优势。在多种推理任务上的强劲表现以及我们的分析表明,TIE为MDLM集成这一尚未充分探索的问题提供了一种实用方法。

英文摘要

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose $\textbf{TIE}$ ($\textbf{T}$rajectory-based $\textbf{I}$terative $\textbf{E}$nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

2606.16278 2026-06-16 cs.CV cs.AI 新提交

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

RealityBridge: 连接可编辑3D高斯泼溅驾驶模拟与现实世界视频

Zhenhua Wu, Yun Pang, Mingkun Chang, Yuwei Ning, Liangzhi Wang, Yi Xiao, Guanbin Li

发表机构 * Sun Yat-sen University(中山大学) Guangdong Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education(教育部机器智能与先进计算重点实验室)

AI总结 提出RealityBridge框架,利用多模态控制和轻量级GateNet,结合自回归长视频训练与奖励引导后训练,缩小编辑后3DGS驾驶视频的Sim-to-Real差距,提升视觉真实感和时间一致性。

详情
AI中文摘要

长尾危险场景对于安全导向的自动驾驶至关重要,但难以大规模收集和复现。可编辑3D高斯泼溅(3DGS)模拟通过重建真实驾驶场景并支持可控场景编辑,提供了一种有前景的替代方案。然而,编辑后的3DGS渲染视频仍存在显著的Sim-to-Real差距,包括渲染伪影、前景资产退化、光照不一致和时间闪烁。现有的修复和视频生成方法不足以应对此任务,因为它们通常无法联合修复3DGS特定伪影、提升视觉真实感并确保时间一致性。为填补这一空白,我们提出RealityBridge,一种针对编辑后3DGS驾驶视频的结构保持和资产感知的Sim-to-Real框架。RealityBridge使用多模态控制,包括渲染视频、前景掩码、边缘图和语义掩码,并结合轻量级GateNet进行跨骨干层的自适应条件分配。我们进一步构建了针对性的训练数据,并引入自回归长视频训练与奖励引导后训练,以提升修复质量、时间稳定性和幻觉抑制。在内部和公开驾驶数据集上的大量实验表明,RealityBridge在伪影去除、光照协调和长序列时间一致性方面优于现有方法。

英文摘要

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

2606.16274 2026-06-16 cs.CV 新提交

GraphWorld: Long-Horizon Planning with World Models for End-to-End Autonomous Driving

GraphWorld: 基于世界模型的长时域规划实现端到端自动驾驶

Ziying Song, Caiyan Jia, Lin Liu, Lei Yang, Shengkai Zhang, Feiyang Jia, Fengda Zhao, Peiliang Wu, Shaoqing Xu, Chen Lv, Yadan Luo

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University(北京交通大学计算机科学与技术学院,交通数据挖掘与具身智能北京市重点实验室) School of Artificial Intelligence (School of Software), Yanshan University(燕山大学人工智能学院(软件学院)) School of Mechanical and Aerospace Engineering, Nanyang Technological University(南洋理工大学机械与航空航天工程学院) University of Macau(澳门大学) The University of Queensland(昆士兰大学)

AI总结 提出GraphWorld框架,通过潜在世界建模增强长时域规划,利用自车中心交互图建模邻车关系,并基于世界状态条件规划实现安全轨迹生成,显著降低碰撞率。

Comments 16 pages, 5 figures

详情
AI中文摘要

端到端自动驾驶通过将感知、预测和规划统一到单一学习框架中取得了显著进展,在短时域决策中表现出色。然而,大多数现有的E2E-AD方法仍局限于短时域规划,缺乏建模长期时间依赖的能力,这严重限制了它们在复杂且高度交互的驾驶场景中的泛化性和安全性。在这项工作中,我们提出了GraphWorld,一个通过潜在世界建模显式增强长时域规划的E2E-AD框架。我们引入了一个自车中心交互图,该图基于空间邻近性自适应地建模关键邻车,并通过跨节点交叉注意力将关系上下文传播到规划查询。我们提出了一种世界状态条件规划,通过建模自车与周围智能体之间的交互来学习以自车为中心的潜在世界表示。这种潜在世界状态捕获了关键的交互动态和安全相关语义,并作为条件信号来指导长时域、安全感知的轨迹规划。在Bench2Drive、NAVSIMv1/2和nuScenes上的大量实验表明,GraphWorld显著降低了碰撞率并提高了长时域规划性能,验证了其在复杂驾驶环境中的有效性。

英文摘要

End-to-end autonomous driving has made significant progress by unifying perception, prediction, and planning within a single learning framework, achieving strong performance in short-horizon decision making. However, most existing E2E-AD methods remain confined to short-horizon planning and lack the ability to model long-term temporal dependencies, which severely limits their generalization and security in complex and highly interactive driving scenarios. In this work, we propose GraphWorld, an E2E-AD framework that explicitly enhances long-horizon planning through latent world modeling. We introduce an Ego-Centric Interaction Graph, which adaptively models critical neighboring agents based on spatial proximity, and propagates relational context to planning queries via cross-node cross-attention. We present a World-State-Conditioned Planning that learns ego-centric latent world representations by modeling interactions between an ego vehicle and surrounding agents. This latent world state captures key interaction dynamics and safety-relevant semantics, and serves as a conditioning signal to guide long-horizon, safety-aware trajectory planning. Extensive experiments on Bench2Drive, NAVSIMv1/2, and nuScenes demonstrate that GraphWorld significantly reduces collision rates and improves long-horizon planning performance, validating its effectiveness in complex driving environments.

2606.16272 2026-06-16 cs.RO 新提交

TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation

TopoRetarget:面向灵巧操作的交互保持重定向

Jielin Wu, Shenzhe Yao, Guanqi He, Xiaohan Liu, Zhaoqing Zeng, Xiangrui Jiang, Han Yang, Wentao Zhang, Hang Zhao

发表机构 * IIIS, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出TopoRetarget框架,通过稀疏交互图和距离加权拉普拉斯变形,在重定向中保持手-物体交互结构,提升灵巧操作强化学习策略的性能。

Comments Project page: https://toporetarget2026.github.io/TopoRetarget/

详情
AI中文摘要

人类手-物体演示通过参考跟踪为训练灵巧操作强化学习策略提供了密集的参考运动。然而,要将此类演示用于策略学习,重定向必须保留手部姿态和任务相关的手-物体接触结构。否则,接触和可行性伪影会降低下游策略的性能。我们提出TopoRetarget,一种交互保持的重定向框架,它在不同重定向条件下使用单一参数集,同时保持任务相关的手-物体交互,并将人类演示适应到灵巧机器人手。该方法在手和物体关键点上构建稀疏交互图,并优化带有方向一致性、运动学约束和穿透处理的距离加权拉普拉斯变形。评估表明,生成的参考提高了交互保真度和策略学习:TopoRetarget在ContactPose数据集上实现了所有基线中最佳的接触精度和对齐,将笔旋转训练成功率比现有基线方法提高了40.6个百分点,并在立方体重定向和笔旋转任务上实现了对Wuji手硬件的零样本迁移。

英文摘要

Human hand-object demonstrations provide dense reference motions for training dexterous manipulation reinforcement learning (RL) policies through reference tracking. However, to use such demonstrations for RL policy learning, retargeting must preserve hand pose and task-relevant hand-object contact structure. Otherwise, contact and feasibility artifacts can degrade downstream RL policy performance. We introduce TopoRetarget, an interaction-preserving retargeting framework that uses a single set of parameters across diverse retargeting conditions while maintaining task-relevant hand-object interaction and adapting human demonstrations to dexterous robot hands. The method constructs a sparse interaction graph over hand and object keypoints and optimizes distance-weighted Laplacian deformation with directional consistency, kinematic constraints, and penetration handling. Evaluations show that the generated references improve both interaction fidelity and policy learning: TopoRetarget achieves the best contact precision and alignment over all baselines on the ContactPose Dataset, improves Pen-Spin training success by 40.6 percentage points over the existing baseline methods, and enables zero-shot transfer to Wuji Hand hardware on cube reorientation and pen spinning.

2606.16271 2026-06-16 cs.CV cs.LG 新提交

Contrastive Learning for Seismic Horizon Tracking with Domain-Specific Priors

基于领域先验的对比学习用于地震层位追踪

Alexandre Thouvenot, Lionel Boillot, Vincent Gripon

发表机构 * IMT Atlantique, LAB-STICC, UMR CNRS 6285(IMT Atlantique, LAB-STICC, CNRS 6285联合实验室) TotalEnergies, OneTech(道达尔能源公司, OneTech)

AI总结 提出自监督融合信号与纹理的方法,利用信号导出的局部层位对应作为领域先验训练纹理深度学习模型,通过对比学习保持层位身份,实现跨不连续面的层位追踪。

Comments 5 pages, 5 figures. Submitted to the IEEE GRSL for possible publication

详情
AI中文摘要

无监督3D地震层位追踪面临一个关键限制:基于信号的传播器提供精确的迹级对齐,但在断层附近常失败,而纹理驱动的深度模型对不连续性更鲁棒,但通常以标记数据需求和降低迹级精度为代价。我们提出了一种自监督融合两种范式的方法,其中信号导出的局部层位对应作为领域先验来训练基于纹理的深度学习模型。具体来说,我们从反射体斜率估计可靠的迹间流,并将其用于形成对比目标中的正对,同时将训练限制在高置信度邻域,可选地使用断层掩码增强。目标不是推断不连续性附近的模糊对应,而是跨不连续性保持层位身份。结果,网络学习到体素级嵌入,保持局部信号连续性,同时通过相似性搜索实现跨不连续性的层位传播。在公共F3数据集和含断层合成数据集上的实验实现了比无监督基线更低的平均绝对误差(MAE),并且与使用单个标记切片的半监督方法性能相当。

英文摘要

Unsupervised 3D seismic horizon tracking faces a key limitation: signal-based propagators provide accurate trace-level alignment but often fail near faults, whereas texture-driven deep models are more robust to discontinuities, typically at the cost of labeled data requirements and reduced trace-level precision. We propose a self-supervised fusion of both paradigms in which signal-derived local horizon correspondences act as domain-specific priors to train a texture-based deep learning model. Specifically, we estimate reliable trace-to-trace flows from reflector slopes and use them to form positive pairs in a contrastive objective, while restricting training to high-confidence neighborhoods, optionally augmented with a fault mask. The objective is not to infer ambiguous correspondences close to discontinuities, but to preserve horizon identity across them. As a result, the network learns voxel-wise embeddings that preserve local signal continuity while enabling horizon propagation beyond discontinuities through similarity search. Experiments on the public F3 dataset and a faulted synthetic dataset achieve lower mean absolute error (MAE) than unsupervised baselines and competitive performance against a semi-supervised method using a single labeled slice.

2606.16257 2026-06-16 cs.LG cs.AI 新提交

Variance Reduction for Non-Log-Concave Sampling with Applications to Inverse Problems

非对数凹采样的方差缩减及其在逆问题中的应用

M. Berk Sahin, Ahmet Ege Tanriverdi, Behzad Sharif, Abolfazl Hashemi

发表机构 * School of Electrical and Computer Engineering, Purdue University(普渡大学电气与计算机工程学院) School of Electrical and Computer Engineering, University of Southern California(南加州大学电气与计算机工程学院) School of Biomedical Engineering, Purdue University(普渡大学生物医学工程学院)

AI总结 针对非对数凹分布采样中随机梯度高方差问题,提出统一分析动量、STORM和PAGE等方差缩减方法,证明其在相对Fisher信息和非平方总变差距离下的改进收敛率,并扩展至基于得分的生成先验逆问题求解。

Comments Accepted to Uncertainty in Artificial Intelligence (UAI) 2026

详情
AI中文摘要

从具有未归一化密度的高维、非对数凹分布中采样是机器学习中的一个基本挑战,特别是当势能的精确梯度不可用,且必须通过每次迭代固定梯度计算预算下表现出高方差的随机梯度来近似时。尽管诸如带动量的SGD、STORM和PAGE等方差缩减技术已在非凸优化中展现出改进的收敛性质,但它们对非对数凹分布采样的影响仍 largely unexplored。在这项工作中,我们首次对这些估计器用于非对数凹分布采样进行了统一分析。我们在$\varepsilon$-相对Fisher信息下建立了改进的非渐近收敛率,并在Poincaré不等式假设下,在平方总变差距离下建立了改进的非渐近收敛率,进一步证明了向目标分布的弱收敛。我们将分析扩展到使用基于得分的生成先验求解逆问题。我们通过实验验证了理论,并证明在每次迭代固定梯度计算预算下,方差缩减技术在两个标准成像应用中 consistently 提高了样本质量。

英文摘要

Sampling from high-dimensional, non-log-concave distributions with unnormalized densities is a fundamental challenge in machine learning, particularly when the exact gradient of the potential is unavailable and must be approximated via stochastic gradients that exhibit high variance under a fixed budget of gradient computations per iteration. Although variance reduction techniques such as SGD with momentum, STORM, and PAGE have demonstrated improved convergence properties in non-convex optimization, their implications for sampling from non-log-concave distributions remain largely unexplored. In this work, we develop the first unified analysis of these estimators for sampling from non-log-concave distributions. We establish improved non-asymptotic convergence rates in $\varepsilon$-relative Fisher information and, under a Poincaré inequality assumption, in squared total variation distance, and further prove weak convergence to the target distribution. We extend our analysis to solving inverse problems with score-based generative priors. We empirically validate our theory and demonstrate that, under a fixed gradient computations per iteration, variance-reduction techniques consistently improve sample quality in two standard imaging applications.

2606.16256 2026-06-16 cs.CV cs.LG 新提交

KeepLoRA++: Continual Learning with Layer-Scaled Residual Gradient Adaptation

KeepLoRA++: 基于层级缩放残差梯度适应的持续学习

Mao-Lin Luo, Yi-Lin Zhang, Zi-Hao Zhou, Yankun Hong, Xialiang Tong, Mingxuan Yuan, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education(东南大学计算机网络和信息集成教育部重点实验室) Huawei Noah’s Ark Lab(华为诺亚方舟实验室)

AI总结 针对预训练视觉语言模型持续学习中保留预训练知识、旧任务知识和学习新知识的冲突,提出KeepLoRA++,通过层级缩放残差梯度适应方法,限制LoRA参数更新到残差子空间并采用浅到深层缩放,平衡三者,在图像分类、视觉问答和视频理解任务上优于基线。

详情
AI中文摘要

预训练视觉语言模型的持续学习需要平衡三个相互竞争的目标:保留预训练知识、保留一系列已学习任务的知识以及保持获取新知识的可塑性。本文提出KeepLoRA++,通过统一的二维知识保留机制来平衡这些目标。我们从层间和层内两个角度分析Transformer架构的知识分布。层间视角考察知识保留如何跨层分布,而层内视角关注每层内的参数空间。我们的分析揭示了一个结构特性:通用可迁移知识主要编码在浅层和参数的主子空间中,而任务特定适应则定位于深层和残差子空间。受此启发,KeepLoRA++引入了一种层级缩放残差梯度适应方法。新任务的学习通过将LoRA参数更新限制在残差子空间,并结合从浅到深的层级缩放来实现,以防止干扰先前获得的能力。具体而言,新任务的梯度被投影到与预训练模型主子空间以及先前任务特征主导方向正交的子空间上,同时为浅层分配较小的更新幅度,为深层分配较大的更新幅度。我们的理论分析和实证评估证实,KeepLoRA++成功平衡了这三个相互竞争的目标,在图像分类、视觉问答和视频理解任务上持续优于代表性基线。

英文摘要

Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

2606.16255 2026-06-16 cs.CV 新提交

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

UniDDT: 使用解耦扩散变换器统一多模态理解与生成

Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang

发表机构 * Nanjing University(南京大学) ByteDance Seed(字节跳动Seed) University of Hong Kong(香港大学)

AI总结 提出UniDDT模型,通过噪声ViT编码器统一视觉语义表示,并采用解耦扩散解码器分离扩散与文本解码,平衡多模态理解与生成任务,在多个基准上取得优异性能。

Comments This work was completed in \textbf{November 2025}

详情
AI中文摘要

统一多模态模型(UMMs)已成为通用多模态智能的关键方向,将理解和生成集成到单一框架中。然而,现有的UMMs面临显著挑战:(1)视觉理解与生成任务之间的固有学习冲突,导致两个任务建模次优;(2)不同的理解与生成视觉空间阻碍可扩展性;(3)过度依赖特定任务数据,忽视了文本-图像理解与生成的二元性。为解决这些挑战,我们提出UniDDT,它利用噪声ViT编码器与LLM统一视觉生成和理解任务的语义编码,同时使用独立的扩散解码器将扩散解码与文本解码解耦。借助这种噪声ViT编码器,UniDDT能够利用潜在空间作为统一的视觉表示,实现理解与生成任务之间的无缝兼容。因此,可以平衡生成任务内的可扩展性和理解任务内的语义表达能力。此外,我们从相同的图像-文本对构建双重数据结构,促进生成与理解数据之间的相互依赖,以利用其固有的二元性。大量实验表明,UniDDT实现了多模态理解与生成的有效统一,增强了语义一致性和可扩展性。对于视觉生成任务,我们的UniDDT在GenEval上达到0.87分,DPG总体得分86.9。对于多模态理解任务,我们的UniDDT在MME基准上达到1699.5分,在SEEDbench上总体得分76.5。

英文摘要

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

2606.16253 2026-06-16 cs.CV cs.AI 新提交

Learned Image Compression for Vision-Language-Action Models

面向视觉-语言-动作模型的图像压缩学习

Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

发表机构 * POSTECH(浦项科技大学) Soongsil University(崇实大学) Chung-Ang University(中央大学)

AI总结 提出SPARC框架,通过自适应比特率分配和倾斜率损失,在低带宽下保持VLA机器人控制性能,优于传统编解码器。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越依赖高频多摄像头观测,使得视觉通信成为带宽受限或分布式部署场景中实时机器人控制的主要瓶颈。然而,现有的图像和视频编解码器旨在保留通用视觉保真度,而非下游VLA策略的控制性能。在这项工作中,我们引入了SPARC(空间自适应速率控制),一种为VLA驱动机器人量身定制的学习图像压缩框架。我们的关键观察是,视觉信息的重要性在相机视角和图像内的空间区域之间差异很大。基于这一观察,SPARC采用轻量级时间掩码选择器,根据任务相关性自适应地在潜在表示上分配比特率,同时利用时间上下文。我们进一步引入倾斜率损失,通过减少基于熵的目标过度抑制罕见但任务关键的视觉模式的趋势来稳定训练。在包括RoboCasa365、VLABench和LIBERO在内的多样化机器人基准测试上的实验表明,在相同比特率预算下,SPARC始终比传统图像/视频编解码器和最近的学习压缩方法实现更强的控制性能。我们还展示了在远程控制设置中的实际部署优势,我们的方法显著改善了比特率-成功率权衡。

英文摘要

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

2606.16246 2026-06-16 cs.LG cs.AI cs.CL 新提交

Data Augmentations for Data-Constrained Language Model Pretraining

数据受限语言模型预训练的数据增强

Michael K. Chen, Xikun Zhang, Zhen Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校) RMIT University(皇家墨尔本理工大学)

AI总结 针对数据受限下标准自回归预训练严重过拟合的问题,提出三类数据增强方法(token级噪声、序列排列、目标偏移预测),有效降低验证损失并支持数百epoch训练。

详情
AI中文摘要

随着AI实验室接近数据天花板,计算能力超过新高质量文本生成速率,语言模型预训练正转向数据受限、计算充裕的体制,需要在固定语料库上进行高效的多轮训练。标准自回归(AR)预训练在此设置下严重过拟合,早期达到最优然后持续恶化。我们研究数据增强作为正则化器来缓解过拟合,并在相同数据上实现数百轮的有效训练。我们为AR预训练引入了三类正交的增强:token级噪声(掩码、随机替换)、序列排列(从右到左预测、Fill-in-the-Middle)以及目标偏移预测($x_{t+i}$,$i > 1$)。通过系统消融实验,我们发现单个增强相对于基线延迟了过拟合并降低了验证损失,其中随机token替换在单个方法中实现了最佳最小损失。组合增强类别进一步降低了最小验证损失。我们的实验表明,数据增强缓解了AR预训练的数据低效问题,并为数据受限体制提供了有前景的解决方案。所有代码和数据可在https://github.com/michaelchen-lab/data-augmentations-for-pretraining获取。

英文摘要

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at https://github.com/michaelchen-lab/data-augmentations-for-pretraining

2606.16243 2026-06-16 cs.LG cs.CL 新提交

LiFT: Local Search via Linear Programming for Overfitting-Controlled Transformers

LiFT: 通过线性规划进行局部搜索以实现过拟合可控的Transformer

Abhishek Shukla, Anikeit Khanna, Ankur Sinha, Faiz Hamid

发表机构 * Department of Management Sciences, Indian Institute of Technology Kanpur(印度理工学院坎普尔分校管理科学系) Department of Civil Engineering, Indian Institute of Technology Kanpur(印度理工学院坎普尔分校土木工程系) Operations and Decision Sciences, Indian Institute of Management Ahmedabad(印度管理学院艾哈迈达巴德分校运营与决策科学系) Brij Disa Centre for Data Science and AI, Indian Institute of Management Ahmedabad(印度管理学院艾哈迈达巴德分校Brij Disa数据科学与人工智能中心)

AI总结 提出基于线性规划的局部搜索框架,通过双层优化联合更新模型参数和正则化超参数,利用验证梯度和Hessian信息构造局部下降方向,在保持训练最优性的同时减少过拟合,实验表明在GPT-2 Small微调中持续改善测试困惑度。

Comments 22 pages, 6 figures, published in The 20th Learning and Intelligent Optimization Conference (LION 2026)

详情
AI中文摘要

本文提出了一种基于线性规划(LP)的局部搜索框架,用于微调预训练Transformer模型,并显式控制过拟合。该方法将Transformer微调表述为一个基于双层优化的正则化问题,其中模型参数和正则化超参数被联合更新。利用初始热身迭代期间收集的信息,包括验证梯度和训练Hessian信息,通过求解一个线性规划来构造局部下降方向,该方向在保持训练最优性的同时最小化缩放的方向导数。这种验证感知的下降方向能够对参数和正则化超参数进行聚焦的局部更新,从而在不需重复完整再训练周期的情况下减少过拟合。由此产生的方法称为基于线性规划的Transformer微调(LiFT),它通过系统识别任务特定的更新,而非依赖启发式或网格搜索的超参数选择,从而区别于传统微调。在WikiText-2上微调GPT-2 Small的实验表明,LiFT通过选择性调整Transformer块和正则化参数实现了有效的适应,在多种层配置和正则化设置下持续改善测试困惑度,尤其在易过拟合场景中增益显著。除了实证性能,LiFT还在Transformer微调、双层优化、局部搜索和正则化理论之间建立了原则性的联系。

英文摘要

This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.

2606.16242 2026-06-16 cs.LG cs.CL 新提交

Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

快速投毒:针对快速响应框架的实用投毒攻击

David Huang, Jaewon Chang, Avidan Shah, Prateek Mittal, Chawin Sitawarin

发表机构 * Princeton University(普林斯顿大学)

AI总结 揭示针对快速响应框架的投毒攻击,通过提示注入在训练集中植入恶意样本,实现目标性投毒和概念后门攻击,仅1%投毒率即可导致高达100%误报率和96%漏报率。

Comments Spotlight at ICML 2026

详情
AI中文摘要

快速响应(RR)框架部署在生产系统中,包括Anthropic的ASL-3安全措施,持续改进越狱检测分类器。当出现绕过这些分类器的新越狱方法时,快速响应会生成合成变体用于训练,帮助模型从新攻击中泛化并快速适应。我们揭示,提示注入可以渗透到该管道中,将投毒样本送入分类器的训练集,实现两个攻击目标:(I)目标性投毒攻击,通过将无害样本归类为越狱来制造误报,并具有特定所需特征(例如特定格式、主题或关键词);(II)基于概念的后门攻击,在存在后门触发器时,诱导对越狱输入产生漏报,甚至泛化到防御者明确训练过的攻击策略中的越狱。重要的是,我们的威胁模型限制攻击者只能修改越狱样本(不能修改良性数据或标签),这是先前工作未探索的约束,使得第二个目标特别具有挑战性。我们通过遗漏攻击解决这一问题,该攻击利用了一个新现象:当在概念缺失的不安全样本上训练时,分类器错误地将该概念的存在与安全标签关联。两种攻击在仅1%的投毒率下都会导致显著且在某些情况下近乎完全的标签翻转,实现高达100%的误报率和高达96%的漏报率。

英文摘要

The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.

2606.16241 2026-06-16 cs.CV 新提交

Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis

结构-语义协同优化的潜扩散模型用于快速视觉字谜合成

Xiang Gao, Yunpeng Jia

发表机构 * School of Digital Media and Design Arts, Beijing University of Posts and Telecommunications(北京邮电大学数字媒体与设计艺术学院)

AI总结 提出结构-语义协同优化框架S2CO-Anagram,通过空文本结构对齐、语义增强和注意力引导噪声融合,在极低计算成本下生成高分辨率、高视觉和谐度与语义保真度的视觉字谜图像。

详情
AI中文摘要

视觉字谜是一种有趣的艺术创作形式,其中单个图像在翻转或旋转等变换下呈现不同的概念解释。最近的工作通过利用预训练的文本到图像(T2I)扩散模型实现了视觉字谜合成,但仍存在几个关键限制,包括计算效率低、美学质量次优以及语义保真度和表现力弱。本文专注于以最小的计算成本生成视觉质量显著提升的视觉字谜,从而推进幻觉数字艺术的智能创作。为了提高图像分辨率同时减少时间开销,我们将基于像素的T2I模型中的先进并行去噪算法适配到对抗性蒸馏的潜模型上,并相应地提出了一种结构-语义协同优化(S2CO)框架来抵消随之而来的视觉退化。作为我们方法的核心,S2CO框架包含三个关键创新:(I)空文本结构对齐优化;(II)语义增强优化;(III)注意力引导噪声融合。基于这些组件,我们的方法称为S2CO-Anagram,能够生成比相关SOTA方法具有显著更优视觉和谐性和语义保真度的高分辨率字谜图像,同时实现更快的推理速度。代码将公开。

英文摘要

Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.

2606.16240 2026-06-16 cs.CL cs.LG 新提交

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

创意碰撞:大型语言模型中的导演人格引导与竞争

Subramanyam Sahoo, Justin Shenk

发表机构 * AI Safety Camp(AI安全训练营)

AI总结 研究通过叠加两种语义相反的导演人格向量(斯皮尔伯格与斯科塞斯)来引导语言模型生成,发现斯皮尔伯格向量主导道德倾向,中间点提升连贯性,且两者在特定层共享道德基调基底。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

激活引导已成为在推理时塑造大型语言模型行为的强大工具,但以往大多数工作向残差流注入单一的语义方向。我们研究了两种语义相反的引导向量叠加的丰富场景——我们称之为“创意碰撞”。具体而言,我们通过在精心策划的剧本语料库上进行均值差异激活对比,构建了史蒂文·斯皮尔伯格(乐观、救赎的道德价值)和马丁·斯科塞斯(黑暗、道德模糊)的导演人格向量,然后通过标量混合参数$α\in[0,1]$和引导系数$λ$在两者之间进行插值。在五个评估轴(道德价值、生成连贯性、表面风格、方向主导性和向量几何)上,出现了三个主要发现:(i)斯皮尔伯格的表征特征表现出稳健的“方向主导性”,在几乎整个插值范围内抑制了斯科塞斯的道德影响;(ii)中间碰撞点在高$λ$下相对于纯单导演引导反而提高了生成连贯性;(iii)两种人格在40层仅解码器Transformer的第28层达到最大定位,揭示了一个共享的“道德基调基底”。这些结果阐明了Transformer残差流中竞争语义方向的几何结构,并对可控创意生成和价值对齐叙事合成具有直接影响。

英文摘要

Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emph{single} semantic direction into the residual stream. We study the richer setting in which two semantically opposing steering vectors are superimposed -- a regime we call \textbf{Creative Collision}. Concretely, we construct directorial persona vectors for Steven Spielberg (optimistic, redemptive moral valence) and Martin Scorsese (dark, morally ambiguous) via mean-difference activation contrast on curated screenplay-derived corpora, then interpolate between them with a scalar mixing parameter $α\in [0,1]$ and a steering coefficient $λ$. Across five evaluation axes -- moral valence, generation coherence, surface style, directional dominance, and vector geometry -- three principal findings emerge: (i)~Spielberg's representational signature exhibits robust \emph{directional dominance}, suppressing Scorsese's moral influence across almost the entire interpolation range; (ii)~intermediate collision points paradoxically \emph{improve} generation coherence relative to pure single-director steering at high $λ$; and (iii)~both personas localise maximally to layer~28 of a 40-layer decoder-only transformer, revealing a shared \emph{moral-tone substrate}. These results illuminate the geometry of competing semantic directions in transformer residual streams and have direct implications for controllable creative generation and value-aligned narrative synthesis.

2606.16236 2026-06-16 cs.LG cs.NE 新提交

Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

进化双层奖励塑形以增强强化学习的泛化能力

Ekasit Usaratniwart, Xilin Gao, Marc Ong, Youhei Akimoto

发表机构 * University of Tsukuba(筑波大学) RIKEN Center for Advanced Intelligence Project(理化学研究所革新智能综合研究中心)

AI总结 提出GERS方法,通过双层优化利用标量验证反馈调整奖励函数,在限制轨迹访问下提升强化学习在未见环境中的泛化性能。

Comments Accepted at PPSN 2026

详情
AI中文摘要

强化学习(RL)在部署于与训练环境不同的环境时,通常会出现性能下降。现有技术如域随机化(DR)可以缓解这一问题,但需要访问多样化的训练环境和完整的轨迹可观测性,这些假设在隐私保护或受限场景中无法满足,此时仅能获得标量性能指标。我们提出通过进化奖励塑形实现泛化(GERS),一种双层优化方法,仅使用来自验证环境的标量反馈来改善在未见测试环境上的泛化能力。在下层,由上层塑形的奖励函数引导的RL智能体在具有可访问轨迹数据的有限训练环境集上学习策略;在上层,CMA-ES优化奖励塑形参数,以最大化在无法访问轨迹的单独验证环境上的累积未塑形奖励。在连续控制任务上的结果表明,GERS在未见测试环境上优于标准RL基线。尽管DR将GERS的训练和验证环境组合集视为需要轨迹访问的单一训练集,而GERS无法访问验证轨迹,但GERS的性能与DR相当。这些结果证实,GERS在受限数据访问约束下有效增强了泛化能力。

英文摘要

Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.

2606.16234 2026-06-16 cs.CV cs.AI 新提交

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

传播结构引导:从眼底图像和稀疏OCT扫描合成荧光素血管造影

Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室) Tianyuan Honors School, Nanjing Medical University(南京医科大学天元荣誉学院) Nanjing University of Science and Technology(南京理工大学) Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University(南京医科大学第一附属医院眼科)

AI总结 提出从彩色眼底照片(CFP)和稀疏OCT扫描合成荧光素血管造影(FFA)的框架,通过空间对齐跨模态融合和令牌级对比学习,实现非侵入性FFA合成,提升下游诊断性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情
AI中文摘要

眼底荧光素血管造影(FFA)对于评估视网膜血管异常至关重要,但其获取具有侵入性且并非总是可行。相比之下,彩色眼底摄影(CFP)无创且广泛可用,这推动了CFP到FFA合成的研究。然而,先前的工作仅依赖CFP表面纹理,从根本上限制了重建功能性血管信息和细微病理变化的能力。为了解决这个问题,我们提出了一种新颖的框架,该框架利用光学相干断层扫描(OCT)提供的结构引导,从CFP合成FFA。我们构建了一个包含来自3,676只患者眼睛的配对CFP、FFA和OCT的多模态视网膜成像数据集——这是视网膜成像中首个三模态对齐数据集。为了弥合OCT和眼底模态之间的空间差距,我们提出了空间对齐跨模态融合(SACMF)模块,该模块将深度分辨的OCT特征投影到眼底平面,并通过自适应层归一化将其注入CFP编码器。除了特征融合,我们还引入了令牌级跨模态对齐(TCMA),这是一种令牌级对比学习策略,在对应空间位置显式对齐CFP和FFA表示。我们的方法相比最先进的方法实现了更优的合成性能。此外,大量实验表明,我们方法合成的FFA图像在提升下游疾病诊断性能方面比现有方法带来更大的改进,突显了我们的方法作为常规工作流程中无创决策支持工具的临床潜力。代码可在https://github.com/while-plus/OCT-guide-FFA-Syn获取。

英文摘要

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

2606.16232 2026-06-16 cs.RO 新提交

PolyMerge: Compressing 3D Gaussian Splats with Polytope Coverings for Provably Safe Resource-Constrained Navigation

PolyMerge: 用多面体覆盖压缩3D高斯泼溅以实现可证明安全的资源受限导航

Jihoon Hong, Chih-Yuan Chiu, Sara Fridovich-Keil, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出PolyMerge,将大规模3D高斯泼溅模型转换为凸多面体覆盖,保证覆盖原模型所有障碍物,结合控制障碍函数实现实时安全路径规划,在Crazyflie无人机上验证。

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8512-8519, July 2026

详情
AI中文摘要

障碍物避免对于安全导航和运动规划至关重要。最近的辐射场重建方法能够以高保真度进行物体检测和建模,但对于机载感知路径规划而言,仍然过于消耗内存和计算资源。为了解决这些限制,我们提出PolyMerge,将场景的大规模、逼真的3D高斯泼溅(3DGS)模型转换为凸多面体的轻量级表示,这些多面体的并集可证明地过度逼近原始3DGS模型中的所有障碍物。PolyMerge调整多面体数量以权衡保守性和计算成本,并与控制障碍函数(CBF)集成以规划无碰撞路径。我们在Crazyflie无人机的仿真和硬件实验中展示了PolyMerge,该无人机在严重的机载计算约束下使用PolyMerge实时计算并跟踪安全轨迹,在保证安全的同时在速度上优于基线。有关我们的代码和视频,请访问https://athlon76.github.io/PolyMerge-website/。

英文摘要

Obstacle avoidance is essential for safe navigation and motion planning. Recent radiance field reconstruction methods enable object detection and modeling with high fidelity, but remain too memory- and compute-intensive for on-board perception-based path planning. To address these limitations, we propose PolyMerge to convert a large, photorealistic 3D Gaussian Splatting (3DGS) model of a scene into a lightweight representation of convex polytopes whose union provably over-approximates all obstacles in the original 3DGS model. PolyMerge tunes the polytope count to trade off conservativeness and compute cost, and integrates with control barrier functions (CBFs) to plan collision-free paths. We showcase PolyMerge in simulation and hardware experiments on a Crazyflie drone, which uses PolyMerge to compute and follow safe trajectories in real time under severe onboard compute constraints, outperforming baselines in speed while guaranteeing safety. For our code and videos, visit https://athlon76.github.io/PolyMerge-website/.

2606.16231 2026-06-16 cs.LG cs.AI 新提交

From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation

从令牌到区域:面向GPU内核生成的CUDA敏感指令微调

Wentao Chen, Jiace Zhu, Xing Zhe Chai, Zeng Qu, Qiaoling Xiao, Liucheng Duan, An Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学) Biren Technology(壁仞科技)

AI总结 提出CuSeT方法,通过自适应令牌级掩码和区域感知样本重加权,在简单SFT框架内提升LLM生成CUDA内核的功能正确性。

详情
AI中文摘要

高性能CUDA内核对于可扩展的AI系统至关重要,而大型语言模型(LLM)由于严格且隐式的执行约束,仍然难以生成正确的内核。现有的基于LLM的方法要么依赖昂贵的智能体或强化学习(RL)流水线,要么采用监督微调(SFT)目标,但未能显式建模CUDA敏感性,即与执行约束紧密耦合的代码令牌或区域。在这项工作中,我们从令牌置信度模式的角度研究CUDA敏感性,表明CUDA敏感性出现在令牌和区域两个层面,其中大多数CUDA敏感令牌以高置信度被预测,而较小的低置信度子集形成对应于执行关键结构的区域。这些发现表明,有效的CUDA内核生成应同时利用高置信度的CUDA敏感令牌并保留低置信度的CUDA敏感区域。基于这些见解,我们提出了\textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)},一种在简单SFT框架内的低成本后训练方法。CuSeT遵循“从令牌到区域”的原则,结合了\emph{自适应令牌级掩码}和\emph{区域感知样本重加权}。实验表明,CuSeT在多个模型系列和规模上一致地提高了功能正确性,优于标准SFT和高级SFT变体,同时以显著更低的推理成本达到了与前沿CUDA内核生成模型相竞争的性能。

英文摘要

High-performance CUDA kernels are essential for scalable AI systems, while Large Language Models (LLMs) still struggle to generate correct kernels due to strict and implicit execution constraints. Existing LLM-based approaches either rely on costly agentic or reinforcement-learning (RL) pipelines, or adopt supervised fine-tuning (SFT) objectives that fail to explicitly model CUDA sensitivity, namely code tokens or regions tightly coupled with execution constraints. In this work, we investigate CUDA sensitivity from the perspective of token confidence patterns, showing that CUDA sensitivity appears at both token and region levels, where most CUDA-sensitive tokens are predicted with high confidence, while a smaller low-confidence subset forms regions corresponding to execution-critical structures. These findings suggest that effective CUDA kernel generation should both leverage high-confidence CUDA-sensitive tokens and preserve low-confidence CUDA-sensitive regions. Building on these insights, we propose \textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)}, a low-cost post-training method within a simple SFT framework. CuSeT follows the principle of ``from tokens to regions'' by combining \emph{adaptive token-level masking} with \emph{region-aware sample reweighting}. Experiments show that CuSeT consistently improves functional correctness across multiple model families and scales, outperforming standard SFT and advanced SFT variants, while achieving competitive performance against frontier CUDA kernel generation models with substantially lower inference cost.

2606.16226 2026-06-16 cs.LG 新提交

Prediction of Runtime Parameters of Parallel Chemistry Applications via Active and Generative Learning

通过主动和生成学习预测并行化学应用的运行时参数

Tanzila Tabassum, Omer Subasi, Ajay Panyala, Epiya Ebiapia, Gerald Baumgartner, Erdal Mutlu, P Sadayappan, Karol Kowalski

发表机构 * Louisiana State University(路易斯安那州立大学) Pacific Northwest National Laboratory(太平洋西北国家实验室) University of Utah(犹他大学)

AI总结 提出基于主动学习和生成学习的机器学习方法,结合梯度提升回归树模型,预测并行化学计算的运行时参数,在CCSD计算中MAPE低至0.023,R²高达99.9%。

详情
AI中文摘要

在这项工作中,我们开发了两种主要的基于机器学习的方法来预测高度可扩展的并行化学计算的运行时参数。这些方法将主动学习和生成学习与经验确定的梯度提升回归树模型相结合,该模型是从丰富的机器学习模型套件中选出的。当在耦合簇单双激发计算上进行评估时,我们的模型实现了低至0.023的平均绝对误差百分比(MAPE)和高达99.9%的决定系数。此外,当与主动学习相结合以缓解缺乏大量训练数据的问题时,我们的模型在使用原始数据集的20-25%时,MAPE约为0.2。

英文摘要

In this work, we develop two main Machine Learning based approaches to predict the runtime parameters of highly scalable parallel chemistry computations.These approaches employ active and generative learning together with the empirically determined gradient boosted regression tree models chosen among a rich suite of machine learning models. When evaluated on Coupled-Cluster with Singles and Doubles computations, our models achieve a mean absolute error percentage (MAPE) as low as 0.023 and a coefficient of determination as high as 99.9%. Furthermore, when combined with active learning to mitigate the lack of large amounts of training data, our models score a MAPE about 0.2 with 20-25% of the original dataset.

2606.16222 2026-06-16 cs.AI cs.LG 新提交

Latent Thought Flow: Efficient Latent Reasoning in Large Language Models

潜在思维流:大型语言模型中的高效潜在推理

Xiandong Zou, Jing Huang, Jianshu Li, Pan Zhou

发表机构 * Singapore Management University(新加坡管理大学) Ant Group(蚂蚁集团)

AI总结 提出Latent Thought Flow (LTF)方法,将推理建模为可变长度连续轨迹,通过连续GFlowNet训练采样器匹配奖励后验,在提升准确率9.5%的同时平均减少推理长度27.2%。

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖中间推理,然而显式的思维链(CoT)存在语言空间瓶颈:每个思维必须解码为token,导致高推理开销。潜在推理将思考过程转移到连续空间,但现有方法大多学习确定性或奖励最大化路径,缺乏在具有不同正确性和成本的轨迹间分配概率的原则性方法。我们提出潜在思维流(LTF),将推理建模为可变长度连续轨迹,并训练采样器以匹配由答案质量和计算成本定义的奖励诱导后验。我们使用具有随机潜在转移的连续GFlowNet实例化该方法。为处理稀疏答案监督,我们引入熵加权子轨迹平衡目标以获取中间奖励,以及参考先验正则化器以锚定探索。在微调和迁移学习设置下的实验表明,与强潜在推理基线相比,LTF在平均减少推理长度27.2%的同时,准确率提升9.5%,优于显式CoT和潜在推理基线。

英文摘要

Large Language Models (LLMs) increasingly rely on intermediate reasoning, yet explicit Chain-of-Thought (CoT) suffers from a linguistic space bottleneck: each thought must be decoded into tokens, causing high inference overhead. Latent reasoning moves deliberation into continuous space, but existing methods mostly learn deterministic or reward-maximizing paths, lacking a principled way to allocate probability across trajectories with different correctness and costs. We propose Latent Thought Flow (LTF), which models reasoning as variable-length continuous trajectories and trains a sampler to match a reward-induced posterior over answer quality and computation cost. We instantiate this with a continuous GFlowNet using stochastic latent transitions. To handle sparse answer supervision, we introduce an Entropy-Weighted Subtrajectory Balance objective for intermediate rewards and a reference-prior regularizer to anchor exploration. Experiments under finetuning and transfer learning settings show that LTF outperforms explicit CoT and latent reasoning baselines, improving accuracy by 9.5% while reducing reasoning length by 27.2% on average compared with strong latent reasoning baselines.

2606.16215 2026-06-16 cs.CL cs.AI cs.LG 新提交

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

PACT: 多轮工具使用智能体的特权轨迹协同训练

Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Ohio State University(俄亥俄州立大学) University of Pennsylvania(宾夕法尼亚大学) Arizona State University(亚利桑那州立大学)

AI总结 提出PACT框架,通过特权轨迹(专家轨迹)在训练时提供密集监督信号,结合轨迹条件RL和组件感知SFT损失,避免推理时依赖轨迹,显著提升多轮工具使用智能体的性能。

Comments Project page: https://zhenbangdu.github.io/pact-project-page/

详情
AI中文摘要

多轮工具使用智能体必须在多个交互轮次中进行推理、调用工具并适应观察结果。对此类智能体进行后训练具有挑战性,因为强化学习通常面临稀疏奖励和弱信用分配问题(尽管匹配仅提示推理设置),而基于专家轨迹的监督微调提供密集过程监督,但可能过度约束模型到固定轨迹。为解决这一问题,我们提出PACT,一种用于多轮工具使用智能体的特权轨迹协同训练框架。关键思想是仅将专家轨迹作为训练时的优化信号,而非推理时的提示。PACT保持推理生成仅基于提示,然后通过两个互补信号利用专家轨迹指导优化:一个轨迹条件RL代理,在专家轨迹上下文中评估仅提示轨迹;一个组件感知SFT损失,以退火强度监督推理前缀和工具调用。为减少对训练时轨迹上下文的过度依赖,PACT进一步引入仅提示锚定。我们还提供了一个潜在轨迹视角,连接两个基于轨迹的目标,并解释专家轨迹如何在推理生成中不被使用的情况下指导优化。在FTRL、BFCL和ToolHop上的实验表明,PACT持续优于强SFT和RL基线,凸显了特权轨迹协同训练在多轮工具使用学习中的价值。

英文摘要

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

2606.16212 2026-06-16 cs.CV cs.AI 新提交

LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

LUCID:基于确定性流匹配的学习型欠采样自适应一致性引导稀疏视角CT重建

Jigang Duan, Jiayi Wang, Heran Wang, Ping Yang, Genwei Ma, Xing Zhao

发表机构 * School of Mathematical Sciences, Capital Normal University(首都师范大学数学科学学院) National Center for Applied Mathematics Beijing, Capital Normal University(首都师范大学北京国家应用数学中心) Academy for Multidisciplinary Studies, Capital Normal University(首都师范大学交叉科学研究院)

AI总结 提出LUCID框架,利用流匹配生成先验和稀疏度自适应策略,通过退化匹配初始状态和投影域一致性校正,实现不同采样密度下的稳定稀疏视角CT重建,减少伪影和幻觉结构。

详情
AI中文摘要

稀疏视角CT通过获取更少的投影视图来减少辐射剂量和扫描时间,但角度欠采样使得重建严重病态,导致条纹伪影、结构模糊和细节丢失。现有的监督方法通常受限于特定的采样设置,而生成方法在严重欠采样下可能引入解剖上不一致的幻觉样结构。我们提出Lucid,一种基于流匹配生成先验的稀疏自适应、一致性引导重建框架,用于稀疏视角CT。Lucid仅在高品质CT图像上训练,学习高斯分布与高品质CT图像分布之间的连续传输,与视角采样无关。在推理过程中,显式纳入采样稀疏度水平,以调整单个预训练模型的生成轨迹。具体地,Lucid通过稀疏度加权融合稀疏视角FBP图像和高斯噪声构建退化匹配的初始状态,执行稀疏度调制的流匹配更新,并在每次先验更新后应用投影域数据一致性校正。在多种稀疏视角设置下的实验表明,Lucid在不同采样密度下实现稳定的重建性能,提高图像质量和结构保真度,并降低生成式稀疏视角CT重建中幻觉样结构的风险。

英文摘要

Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

2606.16211 2026-06-16 cs.CL 新提交

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

编织多源证据进行生物医学推理:BioMedHop基准与BioWeave框架

Xingyu Tan, Shiyuan Liu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学与工业研究组织) University of Technology Sydney(悉尼科技大学)

AI总结 提出BioMedHop基准和BioWeave框架,用于评估和实现生物医学多源证据推理,BioWeave在基准上优于基线方法10.5%,并提升小模型性能。

详情
AI中文摘要

生物医学问答(QA)日益需要对交互实体进行推理,其中支持证据分散在生物医学知识图谱、文献文档和网络可访问资源中。然而,现有的生物医学QA基准主要关注考试式知识、文献理解或短程多跳推理,而源条件图推理和证据拓扑构建尚未充分探索。为填补这一空白,我们引入了BioMedHop,一个多源图基基准,用于评估结构化证据拓扑上的生物医学推理。BioMedHop包含10,045个实例,涵盖知识图谱、文档、网络和混合证据设置,包括共享邻居匹配、交集推理、基于路径的推理和计数,并提供选项式、开放式和数值计数形式。为支持该基准,我们进一步提出了BioWeave,一个源感知推理框架,该框架检索生物医学知识图谱路径,从文档和网络来源收集支持线索,将其组装成统一的证据图,并通过实体级证据支持验证答案。综合实验表明,在BioMedHop上,BioWeave在比较方法中实现了最佳整体性能,在总体平均值上比强混合基线ToG-2高出10.5%。此外,BioWeave一致地改进了不同的大语言模型骨干,并使较小的模型(如Qwen3-4B)能够达到与GPT-4-Turbo相当的推理性能。

英文摘要

Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.

2606.16210 2026-06-16 cs.AI 新提交

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

基于场景相关观测商的传感器条件表示学习

Yan Jiao, Pin-Han Ho, Limei Peng

发表机构 * Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(电子科技大学深圳高等研究院) Department of Electrical and Computer Engineering, University of Waterloo(滑铁卢大学电气与计算机工程系) School of Computer Science and Engineering, Kyungpook National University(庆北国立大学计算机科学与工程学院)

AI总结 提出场景相关观测商作为表示目标,通过OQ-TSAE框架分解场景与干扰因子,在传感器条件下保持可区分性,优于重建、度量学习和对比学习基线。

详情
AI中文摘要

智能传感系统中的学习表示通常通过重建保真度或下游预测精度来评估,但这些标准并未指定哪些潜在区分是由传感过程证明合理的。在传感器条件环境中,干扰因素可以在不改变场景的情况下改变测量值,而不同的场景在有限的传感能力下可能无法区分。本文形式化了传感器条件表示的正确性,即在抑制干扰引起的和传感器不支持的变化的同时,保留传感支持的场景区分。我们引入了场景相关观测商,一种由传感支持的可区分性在干扰规范化后诱导的表示目标,并开发了观测商塔克结构自编码(OQ-TSAE),一种具有假区分、假合并、干扰敏感性和潜在排序一致性诊断的场景-干扰因子分解框架。在受控基准上的实验表明,商一致监督在表示正确性诊断上优于面向重建、度量学习和对比学习的基线。敏感性、扰动和消融研究显示了商对齐监督、可靠商关系和商几何的重要性。互补的真实雷达实验表明,仅重建的OQ-TSAE变体保留了竞争性的下游效用、观测退化下的鲁棒性和低种子间变异性。这些结果表明,传感器条件表示不仅应通过预测效用评估,还应通过其潜在几何是否保留传感证明的场景区分来评估。

英文摘要

Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

2606.16208 2026-06-16 cs.RO 新提交

ATHENA: Accelerated Multi-Task Heterogeneous Influence Functions for Robot Data Curation

ATHENA: 加速的多任务异构影响函数用于机器人数据筛选

Tao Xu, Jiaxin Wang, Runhao Zhang, Jiayi Guan, Xianchao Zeng, Weixi Song, Xinyu Zhou, Zhetao Chen, Guang Chen, Yong-Lu Li

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Xi'an Jiaotong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出ATHENA框架,利用Kronecker梯度结构和秩r随机截断近似加速影响函数计算,实现多任务VLA模型数据筛选,在模拟和真实机器人任务中以更少数据达到或超越全数据微调性能。

详情
AI中文摘要

在机器人模仿学习中,影响函数提供了一种原则性方法来量化每个演示对机器人任务结果的影响,但将其扩展到十亿参数的视觉-语言-动作(VLA)模型受到计算和多任务瓶颈的限制。为此,我们提出ATHENA,一个专为十亿参数规模的多任务VLA数据筛选设计的影响函数框架。具体来说,它利用线性层梯度的Kronecker结构来降低投影成本,并通过秩r随机截断近似来近似稠密Hessian矩阵的逆,在影响计算中实现了约313.4倍的加速。此外,ATHENA制定了全局和局部交互影响,以平衡50个联合训练任务间的数据筛选。在RoboTwin 2.0和真实机器人部署上的广泛评估,分别涵盖9.34小时和6.90小时的演示,表明ATHENA在模拟中仅使用50%的演示、在六个真实机器人任务中使用66.7%的数据,即可达到或超过全数据联合微调的性能。总体而言,ATHENA证明了其在十亿参数多任务VLA微调中用于数据筛选的有效性。

英文摘要

In robot imitation learning, influence functions provide a principled approach to quantify each demonstration's effect on robot task outcomes, yet scaling them to billion-parameter Vision-Language-Action (VLA) models is limited by computational and multitask bottlenecks. To this end, we propose ATHENA, an influence function framework tailored for multitask VLA data curation at a billion-parameter scale. Concretely, it leverages the Kronecker structure of linear-layer gradients to reduce projection cost, and approximates dense Hessian inversion with a rank-r Random Truncated Approximation, achieving about a 313.4x speedup in influence computation. Furthermore, ATHENA formulates global and local interactive influence to balance data curation across 50 jointly trained tasks. Extensive evaluations on RoboTwin 2.0 and real-robot deployment, covering 9.34 and 6.90 hours of demonstrations, respectively, show that ATHENA matches or exceeds full-data joint fine-tuning using only 50% of demonstrations in simulation and 66.7% of data across six real-robot tasks. Overall, ATHENA demonstrates its effectiveness for data curation in billion-parameter multitask VLA fine-tuning.

2606.16206 2026-06-16 cs.AI cs.CL cs.CY cs.HC 新提交

Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

衡量LLM导师是教学还是解题:教育影响的诊断方法

Junyi Yao, Zihao Zheng, Baichuan Li

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) Department of Operations Research and Engineering Management, Southern Methodist University(南卫理公会大学运筹学与工程管理系)

AI总结 针对LLM作为教育导师时解题能力不等于教学支持的问题,提出基于解题导向与教学导向基准性能差距的诊断方法,通过MathTutorBench分析表明两者仅部分对齐,建议分开报告评分并明确保护学生能动性的标准。

详情
AI中文摘要

大型语言模型越来越多地被提议作为教育导师,但更强的任务解决能力并不一定意味着更强的学习支持。受近期呼吁在实践中衡量NLP系统社会影响的启发,我们研究公开的LLM辅导基准是否能够区分支持学习的行为与单纯的答案生成。我们提出了一种轻量级诊断方法,基于解题导向和教学导向基准性能之间的差距。利用公开的MathTutorBench排行榜结果,我们表明这些维度仅部分对齐:在八个公开报告的模型中,解题和教学综合得分之间的相关性为0.421,并且当评估从解题转向教学时,几个模型的排名发生了显著变化。然后,我们分析了公开的TutorBench样本,并表明与能动性相关的行为明确编码在基准评分标准中,尤其是在主动学习环境中,奖励引导性问题、校准提示和非揭露性脚手架。这些发现共同表明,教育影响评估不应将任务成功视为学习支持的充分代理。我们认为,公开的辅导基准可以通过分别报告解题导向和教学导向得分,并使披露敏感、保护学生能动性的标准更加明确,从而更好地支持积极影响评估。

英文摘要

Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

2606.16202 2026-06-16 cs.CV cs.AI cs.RO 新提交

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

EgoPhys: 从第一人称视频学习可变形物体的通用物理模型

Hyunjin Kim, Ri-Zhao Qiu, Guangqi Jiang, Xiaolong Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校)

AI总结 提出EgoPhys框架,从第一人称RGB视频中通过可泛化先验构建可变形物体的物理数字孪生,无需测试时优化即可预测弹簧刚度场,在重建、未来预测和零样本泛化上优于基线。

Comments Project Page: https://hjhyunjinkim.github.io/EgoPhys

详情
AI中文摘要

人类通过日常互动自然地理解物体物理,但准确预测复杂的可变形动力学(如弹性材料和织物)仍然是计算机视觉和机器人学的主要挑战。我们提出EgoPhys,一个利用可泛化先验从仅RGB的第一人称视频构建可变形物理数字孪生的框架。EgoPhys通过将每个物体的逆物理解蒸馏到紧凑码本中,克服了现有方法的局限性,从而能够为未见物体预测密集的弹簧刚度场,而无需每个弹簧的测试时优化。使用来自多样化第一人称交互的可泛化先验进行训练,EgoPhys在重建、未来预测和零样本泛化方面优于基线。为了支持训练和评估,我们整理了一个涵盖多样化可变形物体、场景和操作风格的第一人称交互数据集。我们将EgoPhys部署在真实的xArm6机器人上,证明从单个第一人称人类游戏视频初始化的数字孪生可以作为内部世界表示,辅助可变形物体规划,突显第一人称RGB观测作为通往真实到模拟管道的可扩展路径。

英文摘要

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

2606.16198 2026-06-16 cs.CV 新提交

GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

GRACE: 基于接地动作中心证据增强视频多模态大语言模型用于观众情感预测

Ruoxuan Yang, Tieyuan Chen, Xiaofeng Huang, Haibing Yin, Jun Wang, Xiping Chen, Jun Yin, Xuesong Gao, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Hangzhou Dianzi University(杭州电子科技大学) The 52nd Research Institute of China Electronics Technology Group Corporation(中国电子科技集团公司第五十二研究所) Hangzhou Bywin Technology Co., Ltd.(杭州百威科技有限公司) Zhejiang Dahua Technology Co., Ltd.(浙江大华技术股份有限公司) School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Haihe Laboratory of Information Technology Application Innovation(海河信息技术应用创新实验室)

AI总结 提出GRACE框架,通过提取时间有序的主谓宾三元组和视觉实体裁剪,增强视频MLLM对细粒度情感线索的提取与推理,在Pitts数据集上提升Qwen2.5-VL和Qwen3-VL性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视频广告中的观众情感预测旨在推断观众中引发的潜在情感反应。为了弥合展示内容与感受之间的差距,模型必须从显性的视觉叙事、具体的角色-物体交互和可见的文本线索中推断隐藏的观众情感。然而,标准的多模态大语言模型(MLLMs)通常依赖整体帧表示,这使得这些细粒度的情感相关事件隐式化,并复杂化了精确的情感推理。为了解决这个问题,我们提出了一种基于接地动作中心的证据增强框架,通过引入显式事件结构和局部化视觉证据来增强视频MLLMs的线索提取和理解能力。我们的方法从以动作中心的视频描述中提取时间排序的主语-动词-宾语(SVO)三元组和辅助可见文本线索,将主语和宾语实体作为视觉实体裁剪进行接地,然后使MLLM基于这些提取的结构化线索执行线索增强的情感推理。通过这种方式,动作三元组指定“发生了什么”,而接地的视觉实体裁剪将“谁或什么参与每个事件”锚定到具体的视觉证据上。在Pitts数据集上的实验显示,相对于Qwen2.5-VL和Qwen3-VL基线有持续改进。消融研究、在AdsQA上的跨数据集评估以及在情感聚焦的TVQA子集上的迁移实验进一步支持了我们方法的有效性和泛化能力。

英文摘要

Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs' clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify "what happens", while grounded visual entity crops anchor "who or what participates in each event" to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.