arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2606.08841 2026-06-09 cs.AI cs.CV 新提交

ZIPP:Zero-shot Image Personalization from Personas

ZIPP:基于人物画像的零样本图像个性化生成

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

发表机构 * Adobe Media and Data Science Research (MDSR)(Adobe媒体与数据科学研究(MDSR)) IIIT-Delhi(德里印度理工学院) SUNY at Buffalo(纽约州立大学布法罗分校)

AI总结 提出ZIPP方法,利用自然语言人物画像通过LLM改写提示词实现零样本图像个性化生成,无需用户数据或微调;引入ZIPBench基准,在多个评测中取得13-20%的提升。

详情
AI中文摘要

文本到图像扩散模型越来越多地部署在开放式创意环境中,但其输出仍然缺乏个性,优化的是整体审美而非个人品味。人类偏好是多元化的:一位喜欢柔和、怀旧肖像的用户可能偏爱充满活力的街头摄影,而另一位则倾向于梦幻的电影美学。现有方法需要密集的交互历史或逐用户微调,在冷启动场景中失败,并将上下文相关的偏好压缩为静态表示。我们提出了基于人物画像的零样本图像个性化生成(ZIPP),该方法以自然语言人物画像(用户身份和审美偏好的简洁描述符)为条件生成图像,无需任何用户特定数据或权重更新。ZIPP使用LLM从给定人物画像的角度重写提示词,引导扩散模型输出个性化结果。为了大规模挖掘人物画像,我们在一个包含2200万用户的Reddit交互图上训练了一个归纳式图注意力网络,采用双对比目标将图结构与视觉行为对齐,然后通过多模态大语言模型将学习到的表示转化为自然语言人物画像。我们引入了ZIPBench,这是首个零样本个性化基准,包含1500名用户、图挖掘的人物画像和4万张生成图像。在四个基准和涵盖五个模型家族的14个LLM上,人物画像条件化带来一致的性能提升(13-20%),前沿模型受益最大。在少样本设置中,ZIPP匹配或超过了基于每用户100多个示例微调的基线。ZIPP实现了最低的偏好分布散度(CMMD 0.16 vs 0.55),且经IPF归一化的人口统计评估表明,它显著减少了现有方法中存在的子群体偏差。人工评估证实,与通用生成相比胜率为79%,与所有微调基线相比胜率为58-65%。

英文摘要

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

2606.08840 2026-06-09 cs.AI cs.SE 新提交

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

超越通过率:开放代码大语言模型的多语言、执行基础评估

Sayed Erfan Arefin

发表机构 * Sayed Erfan Arefin

AI总结 针对12种编程语言的2707道LeetCode问题,评估9个开放代码LLM,发现最佳模型Yi-Coder-9B-Chat的正确率仅23.64%,远低于人类57.2%的基准,且排名因问题难度和语言而异,编译错误占失败原因的63.25%。

详情
AI中文摘要

代码生成模型通常使用紧凑的执行基准和总体通过率进行比较,但这种总结掩盖了性能在不同编程语言、问题族和失败模式之间的差异。我们对9个专门用于编码的开放访问LLM进行了大规模、基于执行的评估,涉及12种编程语言的2707道免费LeetCode问题。我们的语料库包含325,343个问题-模型-语言作业,每个作业都关联了提示元数据、提取的代码、LeetCode执行结果和静态分析信号。结果表明,当前的开放模型远未达到人类接受参考:最佳模型Yi-Coder-9B-Chat的平均正确率为23.64%,而人类接受基线为57.2%。排名也依赖于切片:Qwen2.5-Coder-14B-Instruct在困难问题和不同问题覆盖上最强,而Gemma-2-27B-IT在所有语言上的lint通过率最高。失败分析显示,编译错误占未接受最佳提交的63.25%,表明许多失败发生在语义正确性测试之前。静态质量进一步与功能正确性偏离。总之,这些发现表明,多语言、保留工件的评估揭示了单语言或单指标排行榜所隐藏的权衡。

英文摘要

Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how performance varies across programming languages, problem families, and failure modes. We present a large-scale, execution-grounded evaluation of 9 openly accessible LLMs specialized for coding on 2,707 free LeetCode problems across 12 programming languages. Our corpus contains 325,343 problem-model-language jobs, each linked to prompt metadata, extracted code, LeetCode execution outcomes, and static-analysis signals. The results show that current open models remain far from the human acceptance reference: the best model, Yi-Coder-9B-Chat, reaches 23.64% mean correctness, compared with a 57.2% human acceptance baseline. Rankings are also slice-dependent: Qwen2.5-Coder-14B-Instruct is strongest on hard problems and distinct-problem coverage, while Gemma-2-27B-IT achieves the highest all-language lint pass rate. Failure analysis shows that compile errors account for 63.25% of non-accepted best submissions, indicating that many failures occur before semantic correctness can be tested. Static quality further diverges from functional correctness. Together, these findings show that multilingual, artifact-preserving evaluation reveals tradeoffs hidden by single-language or single-metric leaderboards.

2606.08833 2026-06-09 cs.CV 新提交

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

CSFlow: 将流匹配与人类对比敏感度对齐

Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus(马克斯·普朗克信息学研究所,萨尔兰信息学园区)

AI总结 提出CSFlow加权方案,通过将人类对比敏感度函数与流匹配的迭代去噪步骤对齐,在傅里叶空间中引入软自回归结构,提升生成图像的视觉真实感,FID降低4.7%,Inception Score提升2.2%。

详情
AI中文摘要

我们引入了对比敏感流(CSFlow),这是一种将人眼的对比敏感度函数(CSF)与流匹配的迭代去噪步骤联系起来的加权方案。由于真实世界图像将信号集中在低空间频率,这些分量在连续扩散过程中比高频分量更早达到高信噪比。当使用扩散或流匹配模型生成图像时,这会在傅里叶空间中诱导一种软自回归结构,其中粗略的图像内容在精细细节之前稳定。同时,人类视觉系统对空间频率的敏感度不均:极低和极高的频率需要显著更高的对比度才能被感知。我们首次通过两个贡献将这些观察结果融合在一起:(1)一个估计每个反向流区间生成哪些频率的度量,以及(2)通过将每个噪声级别生成的频率与人类对比敏感度对齐获得的时间步权重。我们通过实验验证了我们的贡献,表明这些权重可以通过仅推理时间步修改或短时微调,将FID降低4.7%,Inception Score提高2.2%,GenEval分数提高2.5%,从而改善生成性能。定性上,我们发现我们的CSFlow权重导致生成的图像具有更好的视觉真实感和更少的卡通外观。

英文摘要

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

2606.08832 2026-06-09 cs.AI 新提交

Instrumental convergence and power-seeking

工具性趋同与权力寻求

David Thorstad

发表机构 * GitHub

AI总结 本文探讨人工智能可能寻求权力的论点,分析工具性趋同论题,指出其强版本未被充分论证,并讨论对长期主义、AI治理及风险研究方法的影响。

详情
AI中文摘要

近年来,人们越来越担心人工智能可能很快对人类构成生存风险。一个主要的担忧理由是,人工智能体可能寻求权力,旨在获取权力并在此过程中削弱人类。我展示了权力寻求论点如何依赖于一个被称为工具性趋同论题的强版本。我探讨了工具性趋同论题的主要辩护,并认为没有一个辩护能够以足够强的形式确立该论题,从而为权力寻求论点提供基础。我讨论了这对长期主义、人工智能治理以及研究人工智能体带来的风险的方法论的影响。

英文摘要

Recent years have seen increasing concern that artificial intelligence may soon pose an existential risk to humanity. One leading ground for concern is that artificial agents may be power-seeking, aiming to acquire power and in the process disempowering humanity. I show how the argument from power-seeking rests on a strong version of a claim known as the instrumental convergence thesis. I explore leading defenses of the instrumental convergence thesis and argue that none establishes the thesis in a strong enough form to ground the argument from power-seeking. I discuss implications for longtermism, the governance of artificial intelligence, and the methodology of studying risks posed by artificial agents.

2606.08831 2026-06-09 cs.AI 新提交

Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

面向大语言模型的推理时保形推理与有效事实性控制

Ting Wang, Yuanjie Shi, Yan Yan, Huan Zhang

发表机构 * Machine Learning, ICML(机器学习,国际机器学习大会)

AI总结 提出推理时保形推理框架,将保形预测集成到推理图生成中,通过图级不确定性校准生成停止阈值,实现有效事实性控制。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地执行多步推理,其中中间声明形成隐式有向无环图,其节点正确性在结构上依赖于其祖先。这使得事实不确定性具有结构性,而非节点错误的简单累积,并且需要对推理结构进行推理时不确定性量化。虽然保形预测(CP)提供了灵活的用户指定事实性控制,但现有工作仍然是事后性的,无法在生成过程中进行干预。为了填补CP灵活性与事后局限性之间的差距,我们提出了一种推理时保形推理(ITCR)框架,该框架将CP直接集成到推理图生成中。ITCR学习一种结构级事实性不确定性函数,该函数在不进行复杂建模假设的情况下,聚合推理图上的声明级事实性信号。然后,我们基于图级事实性不确定性设计非一致性分数,并校准保形阈值以决定何时停止生成。我们从理论上证明这种生成是嵌套的,为事实性控制提供了有效的覆盖保证。在多个数据集和覆盖目标上的实验证明了经验上的有效覆盖。在下游推理任务中,推理时校准的图比事后剪枝的图产生更准确的生成。

英文摘要

Large language models (LLMs) increasingly perform multi-step reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node-wise errors, and necessitates inference-time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user-specified factuality control, existing work remains post-hoc and cannot intervene during generation. To fill the gap between CP's flexibility and its post-hoc limitation, we propose an \emph{Inference-Time Conformal Reasoning (ITCR)} framework that integrates CP directly into reasoning graph generation. ITCR learns a structure-level factuality uncertainty function that aggregates claim-level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non-conformity score based on graph-level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference-time calibrated graphs yield more accurate generation than post-hoc pruned graphs.

2606.08828 2026-06-09 cs.RO 新提交

Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Video2Sim2Real:从单个人类视频实现全栈自主灵巧技能获取

Yunhai Han, Jianuo Qiu, Linhao Bai, Ziyu Xiao, Zihang Zeng, Yangcen Liu, Zhaodong Yang, Shalin Jain, Wenrui Ma, Jiaqi Fu, Yuqian Zheng, Manisha Natarajan, Muhammad Zubair Irshad, Kenneth Shaw, Matthew Gombolay, Zsolt Kira, Harish Ravichandar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Pennsylvania(宾夕法尼亚大学) Toyota Research Institute(丰田研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Video2Sim2Real框架,从单个人类操作视频中重建数字孪生并提取运动先验,通过物体关键帧优化机器人配置,结合残差强化学习与碰撞感知规划,实现从仿真到真实世界的灵巧技能迁移。

Comments Website: https://video2sim2real.github.io/

详情
AI中文摘要

人类操作视频是机器人学习的便捷直观来源。然而,由于感知误差和具身差距,直接将人类灵巧性迁移到机器人仍然具有挑战性。为此,我们引入Video2Sim2Real,一个从单个人类操作视频中自主获取技能的全栈框架。我们的框架首先使用现成的基础模型重建适用于仿真器的数字孪生,并提取机器人和物体运动先验。与将提取的机器人运动视为整个执行过程中的可靠参考不同,我们的关键思想是恢复并利用从演示技能中获得的最基本监督来源:我们识别以物体为中心的关键帧,利用仿真器中的物体信息优化相应的机器人配置,并将这些配置作为锚点来细化机器人运动,使其最终对环境产生期望的影响。为了弥合剩余的仿真到现实差距,我们引入了一种仿真到现实策略,将对噪声和不完整感知的鲁棒性与手-物交互动力学的变化解耦。具体来说,我们通过模仿学习从噪声的真实世界点云中重新校准机器人配置,并利用残差强化学习进行局部手指级自适应,以确保鲁棒且有效的交互。最后,一个碰撞感知的运动规划模块实现了对新颖物体配置的空间泛化。在多个日常操作任务中,Video2Sim2Real在模拟任务成功率、安全性和轨迹一致性上优于众多基线,并且比现有技术实现了更好的仿真到现实迁移。这些结果展示了从人类视频自主获取灵巧技能的一条有前景的路径。

英文摘要

Human manipulation videos are a convenient and intuitive source for robot learning. However, directly transferring human dexterity to robots remains challenging due to perception errors and embodiment gap. To address this, we introduce Video2Sim2Real, a full-stack framework for autonomous skill acquisition from a single human manipulation video. Our framework first uses off-the-shelf foundation models to reconstruct a simulator-ready digital twin and extract robot and object motion priors. Rather than treating the extracted robot motion as a reliable reference throughout execution, our key idea is to recover and leverage the most fundamental sources of supervision from the demonstrated skill: We identify object-centric keyframes to optimize the corresponding robot configurations using object information from the simulator, and use these configurations as anchors that refine the robot motion such that it ultimately has the desired impact on the environment. To bridge the remaining sim-to-real gap, we introduce a sim-to-real strategy that decouples robustness to noisy and incomplete perception from variations in hand-object interaction dynamics. Specifically, we learn to recalibrate robot configurations from noisy real-world point clouds via IL, and leverage residual RL to perform local finger-level adaptations to ensure for robust and effective interactions. Finally, a collision-aware motion planning module enables spatial generalization to novel object configurations. Across several everyday manipulation tasks, Video2Sim2Real improves simulated task success, safety, and trajectory coherence over numerous baselines, and achieves better sim-to-real transfer than existing techniques. These results demonstrate a promising path toward autonomous dexterous skill acquisition from human videos.

2606.08826 2026-06-09 cs.CV astro-ph.GA 新提交

Classifying galaxies in the Galaxy10 DECals dataset using Inception and Residual CNNs

使用Inception和残差CNN对Galaxy10 DECals数据集中的星系进行分类

Lanz Anthonee A. Lagman, Prospero C. Naval, Reinabelle C. Reyes

发表机构 * University of the Philippines - Diliman(菲律宾大学迪利曼分校) Department of Computer Science, College of Engineering, University of the Philippines - Diliman(菲律宾大学迪利曼分校工程学院计算机科学系) National Institute of Physics, College of Science, University of the Philippines - Diliman(菲律宾大学迪利曼分校理学院国家物理研究所)

AI总结 本研究比较了ResNet101和InceptionV4在星系形态分类任务上的性能,两者均达到约90%的准确率,其中ResNet101表现更优,表明这两种CNN架构可作为未来巡天星系图像分类的稳健基础。

Comments 4 pages, 3 figures, 2 tables, published in Proceedings of the 42nd Samahang Pisika ng Pilipinas Physics Conference (SPP 2024)

详情
Journal ref
Proc. Samahang Pisika Pilipinas 42, SPP-2024-2E-05 (2024)
AI中文摘要

关于星系形态的图像数据预计在未来几年内将在数量和质量上都有所增加;因此,探索哪些适用于图像分类任务的深度学习架构具有成本效益非常重要。残差网络和Inception网络因其计算效率而成为探索分类卷积神经网络(CNN)的理想选择,这得益于残差连接和并行化Inception模块等技术,使得网络能够更深而不显著增加计算复杂度。在这项工作中,我们分析了ResNet101和InceptionV4在空间增强的Galaxy10 DECals数据集上的性能。保留星系的十类分类,我们修改了每个类别的图像数量。我们发现ResNet101和InceptionV4模型达到了约90%的准确率,与文献中报告的性能相当。在性能指标方面,ResNet101优于InceptionV4。我们的结果表明,这两种CNN架构中的任何一种都可以作为即将到来的巡天中星系图像分类专用管线的稳健基础。

英文摘要

Image data regarding galactic morphology is expected to increase both in quantity and quality for the next foreseeable years; thus it is important to explore which deep learning architectures adapted for image classification tasks are cost-effective. Residual and Inception networks are ideal for exploring classification convolutional neural networks (CNNs) due to their computational efficiency, achieved through techniques such as residual connections and parallelized inception modules, enabling deeper networks without excessively increasing computational complexity. In this work, we analyze the performance of ResNet101 and InceptionV4 on a spatially-augmented Galaxy10 DECals dataset. Retaining the ten-class classification of galaxies, we modify the image count of each class. We find that ResNet101 and InceptionV4 models achieved accuracies of $\sim$ 90%, comparable with reported performance in the literature. In terms of performance metrics, ResNet101 is superior to InceptionV4. Our results indicate that either of these CNN architectures could serve as a robust foundation for specialized pipelines for classification of galaxy images from upcoming surveys.

2606.08816 2026-06-09 cs.LG cs.AI 新提交

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大语言模型用于寻找简单而有效的转录组扰动预测因子

Jake Fawkes, Liam Hodgson, Jason Hartford

发表机构 * University College London(伦敦大学学院) University of Manchester(曼彻斯特大学) Valence Labs(Valence实验室) Recursion(Recursion公司)

AI总结 利用知识图谱的K近邻方法在基因敲除扰动预测中表现优异,结合强化学习优化的LLM可达到最先进性能。

详情
AI中文摘要

预测未见过的基因敲除扰动对转录组基因表达的影响仍然是虚拟细胞模型的一个极具挑战性的问题。最近,通过利用生物知识图谱提供相似扰动的概念,在训练扰动集之外实现了更好的外推。在这项工作中,我们证明了利用这些假设的最简单模型——知识图谱的K近邻——在此任务上取得了极具竞争力的性能,并且通过使用强化学习(RL)优化的LLM可以进一步提高预测性能。具体来说,我们发现K近邻方法在分布外扰动预测上几乎击败了所有方法,而当通过RL训练推理LLM以改变邻域时,它在Replogle等人(2022)的细胞系上获得了与当前最先进方法相当的性能。我们还证明,尽管没有直接训练,RL训练提高了LLM在差异表达预测下游任务上的性能。总体而言,这些发现证明了知识图谱作为模型先验的有效性,并显示出RL可以将LLM精炼为预测复杂生物反应的通用工具的早期迹象。

英文摘要

Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

2606.08815 2026-06-09 cs.AI cs.CL cs.LG 新提交

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动量:策略优化中的密集内在信号

Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Eastern Institute of Technology(东方理工学院)

AI总结 针对GRPO在长链推理中因二元奖励导致的零优势崩溃和幻觉确定性失败模式,提出ISPO方法,通过内在信号密集化奖励,在三个基模型和五个数学推理基准上持续优于基线。

Comments 14 pages, 6 figures, 8 tables

详情
AI中文摘要

基于可验证奖励的强化学习已成为激发大型语言模型长链推理的强大范式。然而,现有基于组相对策略优化(GRPO)的方法依赖于二元结果奖励,这引发了两种结构性失败模式:零优势崩溃,即组内所有轨迹共享相同结果导致梯度消失;以及幻觉确定性,即模型在训练后期对错误轨迹变得过度自信。我们通过使用完全从策略自身条件概率计算的内在信号来密集化奖励,解决了这两种模式,并提出了ISPO(内在信号策略优化),它结合了衡量思考轨迹对最终答案信息量的序列级信号,以及令牌级方向性奖励,其幻觉确定性铰链惩罚关键决策令牌上的错误自信预测。在三个基模型和五个数学推理基准上,ISPO持续优于竞争基线,在零优势崩溃最频繁的最难基准上取得最大提升,训练动态诊断证实两种失败模式均被减少。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

2606.08814 2026-06-09 cs.AI cs.LG 新提交

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

STAR: 将MoE路由重新思考为结构感知的子空间学习

Sumin Park, Noseong Park

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出STAR方法,通过广义Hebbian算法学习主子空间来增强路由对输入结构的感知,实现专家稳定专业化,在合成数据和语言视觉任务上提升路由质量和下游性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

混合专家(MoE)通过选择性地将输入路由到专门的专家子集来高效扩展模型容量。然而,输入-专家专业化(MoE的核心动机)关键取决于路由器是否真正感知输入结构。实践中,MoE路由通常实现为浅层线性投影,对输入表示的感知有限,常导致路由不稳定。我们提出STAR(结构感知路由),将MoE路由重新思考为子空间学习问题,通过广义Hebbian算法(GHA)跟踪主导输入结构的演化主子空间来增强标准可学习路由。通过将路由决策直接与输入结构对齐,STAR实现了稳定的专家专业化。我们在受控合成设置和大规模语言与视觉任务上评估STAR,它持续提高了路由质量和下游性能,超过了强MoE基线。此外,可选的测试时子空间更新进一步增强了输入分布偏移下的路由鲁棒性和泛化能力。

英文摘要

Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable routing. We propose STAR, a Structure Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure, STAR enables stable expert specialization. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness and generalization under input distribution shifts.

2606.08802 2026-06-09 cs.LG 新提交

Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules

主动流扩展用于分布外发现:从理论到分子

Riccardo De Santi, Bruce Lee, Cristian Perez Jensen, Kimon Protopapas, Sophia Tang, Cheng-Hao Liu, Pranam Chatterjee, Yisong Yue, Andreas Krause

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(ETH AI 中心) University of Pennsylvania(宾夕法尼亚大学) Caltech(加州理工学院) FutureHouse

AI总结 提出Active Flow Expansion (ActFlow)方法,通过验证器反馈和主动探索扩展预训练流模型的生成集,覆盖更多有效设计空间,理论证明统计学习保证,在分子和蛋白质任务上优于现有方法。

详情
AI中文摘要

标准流和扩散预训练匹配可用数据(例如分子)的分布,这通常只覆盖有效设计空间的一小部分。然而,在生成发现中,目标是采样有效的新自然设计,这些设计在标准模型下被赋予可忽略的概率,因此无法从拟合观测数据的标准模型中获取。为克服这一限制,我们偏离数据分布匹配,通过生成集(模型以非可忽略概率覆盖的区域)来审视生成模型。这允许引入一种新的分布外流建模学习原则:扩大模型的生成集以增加对有效设计空间的覆盖。我们提出主动流扩展(ActFlow),一种持续预训练方法,利用验证器反馈,通过迭代适应在学习的流表示中主动探索生成的合成数据,将预训练模型扩展到新的有效区域。理论上,我们建立了据我们所知首个分布外流建模的统计学习保证,将生成集扩展分析为在学习表示上的局部到全局可达过程。实验上,我们使用合适的分布外生成建模指标,在小有机分子、中等大小药物样分子、治疗性肽和蛋白质序列设计任务上评估ActFlow。结果表明,ActFlow将有效覆盖扩展到远超初始预训练模型建模的区域,显著优于广泛采用的合成流预训练方法。

英文摘要

Standard flow and diffusion pre-training matches the distribution of available data (e.g., molecules), which often covers only a small fraction of the valid design space. In generative discovery, however, one aims to sample valid new-to-nature designs, assigned negligible probability under, and thus inaccessible to, standard models fitted to the observed data. To overcome this limitation, we depart from data distribution matching and view a generative model through its generable set: the region it covers with non-negligible probability. This allows to introduce a new learning principle for out-of-distribution flow modeling: enlarging a model's generable set to increase coverage of the valid design space. We propose Active Flow Expansion (ActFlow), a continued pre-training method that employs verifier feedback to expand a pre-trained model over new valid regions by iteratively adapting to synthetic data generated through active exploration in the learned flow representation. Theoretically, we establish to our knowledge first-of-their-kind statistical learning guarantees for out-of-distribution flow modeling, analyzing generable set expansion as a local-to-global reachability process over a learned representation. Empirically, we assess ActFlow with suitable out-of-distribution generative modeling metrics across small organic molecules, mid-sized drug-like molecules, therapeutic peptides, and protein sequence design tasks. Results show that ActFlow expands valid coverage far beyond the region modeled by the initial pre-trained model, significantly outperforming widely adopted synthetic flow pre-training methods.

2606.08800 2026-06-09 cs.AI 新提交

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

通过自进化桥接专家知识与自动化特征工程

Varun Khurana, Vijval Ekbote, Vashu Chauhan, Yaman Kumar Singla, Rajiv Ratn Shah, Balaji Krishnamurthy

发表机构 * Adobe Media and Data Science Research(Adobe媒体与数据科学研究) IIIT-Delhi(德里印度理工学院)

AI总结 提出FEST方法,结合双流特征生成、语义去重和树引导迭代进化,从原始文本和图像中发现可审计特征,在品牌分类等任务中平均提升4.2个百分点,并实现60-80%的专家特征覆盖。

详情
AI中文摘要

在品牌合规、临床护理和内容审核等高风险场景中,机器学习不能作为不透明的预言机部署:从业者需要检查驱动模型决策的特征,模型必须利用管理这些领域的专家文档。实际上,数据以非结构化内容形式到达,从中提取的特征必须可解释、有区分度,并与专家认为重要的内容对齐。现有方法存在不足:它们针对表格输入,缺乏专家对齐的证明,并且无法将诸如“保持专业语气”之类的定性标准转化为精确特征。我们提出了FEST(自进化树特征工程),结合了双流特征生成(语义和确定性)、语义去重和树引导的迭代进化,从原始文本和图像中发现可审计特征。FEST在品牌分类、内容真实性检测和压力检测的20个分类器-任务组合中领先17个,在五个分类器上平均比最强基线高出4.2个百分点。LLM作为评判者的评估显示,在严格的语义对齐阈值下,FEST实现了60-80%的专家设计品牌特征覆盖率,并通过人类专家研究证实,这些特征在相关性、清晰度和可操作性方面获得高评分。当以专家指南作为种子时,FEST将定性标准细化为可操作特征,跨品牌平均提高6-12个百分点的准确率。为了实现对自动化特征工程中专家对齐的系统评估,我们发布了BrandGuide,这是第一个将专家设计特征与2,683个品牌的100万+资产配对的数据集。通过将特征工程建立在专家知识基础上,FEST为需要人类监督的可解释机器学习开辟了一条实用途径。

英文摘要

In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.

2606.08797 2026-06-09 cs.LG cs.AI 新提交

Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition

通过拉格朗日分解将决策聚焦学习扩展到大规模问题

Stéphane Eilles-Chan Way, Hugo Percot, Quentin Cappart, Tias Guns, Louis-Martin Rousseau

发表机构 * Polytechnique Montréal(蒙特利尔综合理工学院) Ecole Polytechnique(巴黎综合理工学院) UCLouvain(鲁汶大学) Mila - Québec AI Institute(魁北克人工智能研究所) KU Leuven(荷语鲁汶大学)

AI总结 提出结合拉格朗日分解的决策聚焦学习框架,通过新代理目标和两种损失函数,在保持可并行化的同时,有效处理大规模约束优化问题,实验表明在变量数多八倍的实例上优于传统方法。

详情
AI中文摘要

决策聚焦学习在解决预测-优化问题中显示出巨大潜力,尤其是在模型欠规范的情况下。然而,其实际部署常因高计算成本和有限的可扩展性而受阻,因为需要在每次迭代中对每个训练实例求解一个约束优化问题。为解决这些挑战,我们提出了一种新颖的框架,将拉格朗日分解融入决策聚焦学习范式。具体而言,我们引入了一个新的代理目标以及两个用于评估和训练底层预测模型的损失函数。我们进一步提出了两种变体,它们在计算效率和解决方案质量之间提供了不同的权衡。我们的框架可以无缝集成到标准的决策聚焦学习方法中,包括Smart Predict-then-Optimize (SPO+)和隐式最大似然估计 (IMLE)。通过在两个标准基准测试(多维背包问题和二次投资组合优化)上的实验,我们证明了我们的方法在保持可并行化的同时实现了有竞争力的性能。特别是,在大规模实例上,它始终优于传统的决策聚焦学习方法,这些实例的变量数比相关工作通常考虑的要多出八倍。实现代码可在 https://github.com/corail-research/DFL-LD 获取。

英文摘要

Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational costs and limited scalability, as it requires solving a constrained optimization problem for each training instance at every iteration. To address these challenges, we propose a novel framework that incorporates Lagrangian decomposition into the decision-focused learning paradigm. Specifically, we introduce a new surrogate objective along with two loss functions for evaluating and training the underlying prediction model. We further propose two variants of our approach, which offer different trade-offs between computational efficiency and solution quality. Our framework can be seamlessly integrated with standard decision-focused learning methods, including Smart Predict-then-Optimize (SPO+) and Implicit Maximum Likelihood Estimation (IMLE). Through experiments on two standard benchmarks, the multi-dimensional knapsack problem and quadratic portfolio optimization, we demonstrate that our approach achieves competitive performance while remaining amenable to parallelization. In particular, it consistently outperforms traditional decision-focused learning methods on large-scale instances, involving up to eight times more variables than those typically considered in related work. The implementation is available at https://github.com/corail-research/DFL-LD.

2606.08795 2026-06-09 cs.CV 新提交

PairWise Image Finder: An Open-source Tool for Finding Visually Aligned Street-Level Image Pairs for Urban Perception Studies

PairWise Image Finder: 用于城市感知研究的视觉对齐街景图像对查找开源工具

Jussi Torkko

发表机构 * Digital Geography Lab, Department of Geosciences and Geography, University of Helsinki(赫尔辛基大学地球科学与地理系数字地理实验室)

AI总结 提出PairWise图像查找工具,集成特征检测与匹配及语义分割掩码,量化不同时期图像的视觉对齐度,输出匹配特征比例、距离、覆盖率和语义掩码对齐度,支持过滤高质量图像对,用于纵向变化研究和减少人工工作量。

Comments 6 pages, two figures, github repo link near the end

详情
AI中文摘要

变化检测和场景识别技术已广泛应用于街景图像(SVI)以理解跨年场景的变化。然而,仅凭元数据往往不足以可靠地找到视觉对齐的图像对。本研究介绍了PairWise图像查找器,该工具集成了特征检测和匹配,并辅以语义分割掩码来量化不同时期两幅图像的视觉对齐度。该工具输出匹配关键特征的比例、匹配特征距离和覆盖率以及语义掩码的对齐度,使用户能够根据对齐质量和用例过滤图像对。从该工具导出的视觉对齐对可用于准确研究显式的纵向变化,并有助于减少感知研究中的人工工作量。通过比较纵向变化展示了该工具的可用性,强调了量化变化时视角的重要性。所提出的方法为研究人员和利益相关者提供了一个可扩展的开源工具,用于查找用于城市分析、感知及相关应用的高质量图像对。

英文摘要

Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to understand changes in scenes across the years. However, metadata alone is often insufficient to reliably find visually aligned image pairs. This study introduces the PairWise image finder, a tool that integrates feature detection and matching, supported by semantic segmentation masks to quantify the visual alignment of two images of varying time periods. The tool outputs the share of matched key features, the matched feature distance and coverage, and the alignment of semantic masks, which enables the user to filter image pairs depending on the alignment quality and use case. The visually aligned pairs derived from the tool can be used to accurately study explicit longitudinal change and help reduce manual effort for perception studies. The usability of the tool is demonstrated through a comparison of longitudinal changes, highlighting the importance of perspective when quantifying changes. The proposed method provides a scalable and open tool for researchers and stakeholders to find high-quality image pairs for urban analysis, perception and related applications.

2606.08792 2026-06-09 cs.CL 新提交

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

放大镜:定位和操控大语言模型内的党派方向

Wendy K. Tam

发表机构 * Vanderbilt University(范德比尔特大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过线性探针在Llama 3.1 8B Instruct模型的隐藏状态中定位党派政治身份方向,并利用稀疏自编码器分解为可解释特征,因果干预可系统性改变模型输出,证明党派偏见是可定位和操控的几何特征。

详情
AI中文摘要

大型语言模型正迅速取代搜索引擎,成为人与信息之间的主要界面。与检索现有内容的搜索引擎不同,LLM生成受训练期间学到的内部表示影响的新文本。在这里,我们展示了党派政治身份编码在模型的激活空间中,并且这个方向直接塑造生成。使用来自美国国会现任议员的190,491条推文作为标记训练数据,我们在Llama 3.1 8B Instruct模型的隐藏状态上训练线性探针。我们在第18层识别出一个单一的几何轴,该轴以0.945的AUC和1.94的Cohen's d区分共和党和民主党文本,并使用稀疏自编码器将该轴分解为可解释的党派特征。沿该轴进行因果干预,在生成过程中消融或放大党派成分,会产生模型输出的系统性变化。我们观察到立场反转、语域转换以及结构化的权威捏造。我们的结果表明,语言模型中的党派偏见不是模糊的涌现属性,而是可以精确定位和操控的习得几何特征。党派偏见不是需要修补的漏洞,而是这些模型如何编码关于用户信息的结构属性。随着LLM取代搜索引擎成为知识界面,理解产品设计(及其后果)对于驾驭从策划到生成的信息生态系统的法律、社会和政治转型至关重要。

英文摘要

Large language models are rapicly replacing search engines as the primary interface between people and information. Unlike search engines, which retrieve existing content, LLMs generate novel text shaped by internal representations learned during training. Here we show that partisan political identity is encoded in the model's activation space, and that this direction directly shapes generation. Using 190,491 tweets from sitting members of the U.S. Congress as labeled training data, we train linear probes on the hidden states of the Llama 3.1 8B Instruct model. We identify a single geometric axis at layer 18 that separates Republican from Democratic text with an AUC of 0.945 and a Cohen's d of 1.94, and use sparse autoencoders to decompose that axis into interpretable partisan features. Causally intervening along this axis, ablating or amplifying the partisan component mid-generation, produces systematic shifts in the model's output. We witness stance reversals, register shifting, and structured fabrications of authority. Our results demonstrate that partisan bias in language models is not a vague emergent property but a learned geometric feature that can be precisely located and steered. Partisan bias is not a bug to be patched, but a structural property of how these models encode information about their users. As LLMs displace search engines as the interface to knowledge, understanding that product design (and its consequences) will be essential for navigating the legal, social, and political transitions from an information ecosystem that is curated to one that is generated.

2606.08790 2026-06-09 cs.AI cs.CR cs.MA 新提交

RAILS: Verification-Native Clearing For Agentic Commerce

RAILS: 面向代理商务的验证原生清算

Adrian de Valois-Franklin, Alex Bogdan

发表机构 * Evolutionairy AI

AI总结 针对自主代理在商务活动中缺乏中立清算机制的问题,提出RAILS协议,通过可靠性评分、记录和清算函数实现验证原生清算,确保财务结算基于充分证据。

Comments 49 pages, 15 figures

详情
AI中文摘要

自主代理进行谈判、购买、部署代码和转移资金,但缺乏中立机制来确定它们是否履行了委托义务、未履行时谁负责、以及后续采取何种结算行动。这就是代理清算问题。工具协议(MCP)、代理间通信(A2A)、支付轨道(x402)、授权和网络代理协议(AP2、Visa、Mastercard)以及结算风险标准都假设存在这种确定机制,但都没有产生它。清算是缺失的原语。支付不是清算。授权不是清算。LLM作为法官的评估不是清算。结算风险托管不是清算:它消耗清算决策。RAILS(实时代理完整性与账本结算)是代理商务的完整性和清算层,涵盖每个输出的可靠性评分、公开的可靠性记录以及消耗它们的清算函数。其核心的清算协议填补了这一空白。七个原语(义务对象、证据信封、验证网格、清算决策、结算指令、清算护照、终局规则),由可接受性分级验证的形式模型约束,共同产生一个可靠性属性:没有财务上重要的结算得到低于义务可接受性底线的证据支持。该属性可针对规范进行证伪。我们不知道先前的代理商务验证机制陈述过此类属性。最接近的方法输出通过、交付保证、裸分数或均衡。本文详细说明了该清算协议。

英文摘要

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none produce it. Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions. RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce, spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them. The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope, Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal model of admissibility-graded verification, together yield a soundness property: no financially material settlement is supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium. This paper specifies that clearing protocol.

2606.08788 2026-06-09 cs.CV 新提交

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

MaskAlign: 面向高效扩散训练的令牌子集表示对齐

Lianyu Pang, Tianlin Pan, Cheng Da, Changqian Yu, Huan Yang, Kun Gai, Song Guo, Wenhan Luo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Kuaishou Technology(快手科技) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对扩散模型与预训练视觉模型表示对齐中令牌级信息不匹配问题,提出MaskAlign方法,通过随机采样令牌子集进行对齐,并引入预掩码令牌混合块减少信息损失,提升训练效率和生成质量。

详情
AI中文摘要

与预训练视觉模型的表示对齐最近显示出加速扩散Transformer训练的潜力。通过将中间扩散特征与来自自监督视觉编码器的干净图像表示对齐,现有方法提高了收敛速度和生成质量。然而,这种对齐也引入了一个非平凡的约束:扩散模型处理噪声输入,其可用信息随时间步变化,而参考特征是从干净图像中提取的。在本文中,我们从令牌级角度重新审视这种不匹配。我们发现,在全令牌表示对齐下,具有较大对齐梯度范数的令牌表现出稳定的空间偏好,这表明对齐目标并非均匀影响所有令牌,可能鼓励模型依赖完整的干净图像令牌集。为了解决这个问题,我们提出了MaskAlign,一种令牌子集表示对齐方法,在训练期间对随机采样的令牌子集应用对齐。通过在不同迭代中向模型暴露不同的令牌子集,MaskAlign减少了表示对齐对完整令牌集的依赖,并鼓励在令牌子集扰动下更稳定的对齐行为。为了缓解直接丢弃令牌导致的信息损失,我们进一步引入了一个轻量级的预掩码令牌混合块,在掩码之前跨令牌共享信息。

英文摘要

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

2606.08780 2026-06-09 cs.CV 新提交

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

超越一致性:在零样本视频编辑中保留时间结构

Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu

发表机构 * Anhui University(安徽大学) University of Surrey(萨里大学) University of Warwick(华威大学)

AI总结 提出一种零样本视频编辑方法,通过自适应分割视频片段、选取锚帧和令牌合并策略,首次显式保留源视频的时间结构,平衡编辑保真度与计算效率。

详情
AI中文摘要

现有的零样本视频编辑方法依赖预训练的扩散模型,成功实现了空间控制和基本的时间一致性,但根本上未能保留视频的原始时间结构。这一区别至关重要:时间一致性确保视觉平滑,而时间结构决定了视频的高层叙事、节奏和语义流。没有这种保留,编辑输出(尤其是具有复杂语义变化的长视频)在叙事上变得不连贯,语义模糊。为了解决这一局限性,我们提出了一种新颖的零样本编辑方法,首次明确关注保留源视频的时间结构。我们通过基于特征相似性自适应地将视频分割成语义不同的片段,并为每个片段选择一个代表性的锚帧来实现这一点。为了增强片段内保真度和计算效率,我们设计了一种片段自适应的令牌合并策略,利用锚帧的语义主导性来稳定编辑。此外,我们采用交替组合策略,确保片段间无缝过渡,同时保持语义区分。大量实验表明,我们的方法达到了最先进的结果,成功平衡了原始时间结构的保留与计算效率,为零样本视频编辑保真度设立了新基准。

英文摘要

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

2606.08777 2026-06-09 cs.LG cs.AI 新提交

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

需要多少反事实?通过电路和因果效应探究VLM幻觉

Abhivansh Gupta, Simardeep Singh, Advika Sinha, Shreyansh Modi, Akshat Tomar

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度思维)

AI总结 本文通过定义基于对数概率差异的因果影响度量,并利用电路发现技术,研究视觉语言模型幻觉输出的反事实鲁棒性,推导出检测不稳定所需的最小反事实样本数。

详情
AI中文摘要

视觉语言模型(VLM)已知会产生不基于视觉证据的幻觉预测,但现有方法缺乏对这些预测在反事实扰动下鲁棒性的原则性理解。在这项工作中,我们研究了VLM中幻觉输出的反事实鲁棒性的样本复杂度。我们基于事实、反事实和激活修补运行之间的对数概率差异定义了一个因果影响度量,并用它来表征幻觉预测的稳定性。通过利用电路发现技术(CD-T),我们识别负责这些预测的模型组件,并追踪它们在反事实样本中的激活差异。然后,我们利用浓度不等式和因果影响分布的方差估计,推导出可靠检测幻觉输出不稳定性所需的最小反事实样本数m的经验界限。

英文摘要

Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

2606.08775 2026-06-09 cs.RO cs.AI 新提交

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一对象中心世界模型与扩散策略:多阶段机器人任务的分层框架

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

发表机构 * Tandon School of Engineering, New York University(纽约大学坦登工程学院) Courant Institute of Mathematical Sciences, New York University(纽约大学库朗数学科学研究所) AMI Labs(AMI实验室)

AI总结 提出WorldDP分层框架,结合高层世界模型进行运行时子目标优化和低层扩散策略执行,利用对象中心表示解耦环境实体,实现多阶段机器人操作任务的有效规划与执行。

详情
AI中文摘要

视觉世界模型在学习复杂系统动力学方面显示出巨大潜力。最近的进展利用这些模型作为模型预测控制(MPC)框架中的转移函数来解决各种控制任务。然而,当应用于机器人时,它们仅限于单阶段任务(如抓取或到达),难以处理需要复杂序列规划的多阶段任务。在这项工作中,我们引入了WorldDP,一个专为多阶段机器人操作设计的世界模型框架。我们的分层方法利用高层世界模型作为转移函数,在运行时优化可行的子目标,随后由低层扩散策略实现这些子目标。为了进一步辅助学习动力学和规划,我们结合了对象中心表示,这些表示解耦了环境实体,并使我们能够针对每个实体进行顺序规划。在多个机器人基准测试中,WorldDP始终优于现有基线,验证了将世界模型的物理基础规划与扩散策略的高效执行相结合,能够产生更优的多阶段性能。

英文摘要

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

2606.08770 2026-06-09 cs.CL cs.AI cs.CV cs.LG 新提交

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

TeamHerald@CHIPSAL 2026:基于Transformer架构和集成学习的尼泊尔语模因仇恨言论检测与情感分析

Ashish Acharya, Anish Khatiwada, Rohit Khadka, Pragya Aryal

发表机构 * Herald College Kathmandu(加德满都赫尔德学院)

AI总结 针对尼泊尔语模因中代码混合和资源匮乏问题,采用OCR提取文本并结合Transformer模型,发现硬/软投票集成策略在二分类和多分类任务中表现不同,软投票在多类情感任务中提升15.8%的Macro F1分数。

Comments Accepted at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026) at LREC 2026

详情
AI中文摘要

尼泊尔语互联网模因的分析因频繁的代码混合和缺乏已建立的基线资源而变得复杂。虽然模因本质上结合了视觉和文本元素,但本研究侧重于以文本为中心的方法,通过OCR层提取嵌入文本,并使用基于Transformer的架构进行建模。我们评估了六种不同的模型,并研究了硬投票和软投票集成策略在两项任务中的比较效果:二分类仇恨言论检测和三分类情感分析。实验结果表明,独立的仅解码器模型在二分类任务中取得了最高性能,而软投票集成在多类情感任务中表现最佳,相比最强的独立基线,Macro F1分数相对提升了15.8%。这些发现表明,集成策略在二分类和多类任务中表现不同,突出了选择适合分类目标的聚合方法的重要性。

英文摘要

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources. While memes inherently combine visual and textual elements, this study focuses on a text-centric approach by extracting embedded text using an OCR layer and modeling it with Transformer-based architectures. We evaluate six distinct models and investigate the comparative effectiveness of Hard and Soft Voting ensemble strategies across two tasks: binary hate speech detection and three-class sentiment analysis. Experimental results show that a standalone decoder-only model achieved the highest performance for binary classification, whereas the Soft Voting ensemble performed best for the multi-class sentiment task, yielding a 15.8% relative improvement in Macro F1-score over the strongest standalone baseline. These findings suggest that ensemble strategies behave differently across binary and multi-class tasks, highlighting the importance of selecting aggregation methods suited to the classification objective.

2606.08769 2026-06-09 cs.CL cs.AI 新提交

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

RadOT-Eval:用于放射学报告评估的可审计结构化证据传输

Weixin Liu, Juming Xiong, Yang Li, Qingyuan Song, Susannah Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RadOT-Eval框架,通过最优传输对齐结构化临床证据,在独立测试集上实现与错误负担的高斯皮尔曼相关,优于标准指标和基于LLM的评估器。

Comments 10 pages, 1 figure, 13 tables

详情
AI中文摘要

自动评估对于高风险文本生成至关重要,其中的错误通常涉及遗漏发现、幻觉内容、极性反转、位置变化、不确定性不匹配和时间比较错误,而不仅仅是低表面相似性。放射学报告生成提供了一个具有挑战性的测试案例,因为生成的报告必须跨来源保留结构化临床证据。我们提出了RadOT-Eval,一个可解释的结构化证据最优传输框架,用于离线审计放射学报告生成。RadOT-Eval将参考报告和候选报告分解为属性结构化的临床证据单元,使用熵正则化最优传输对齐相应的证据,并在单调风险模型中使用临床意义的侧信道差异来预测错误负担。所有传输、特征和读出选择均使用ReXVal数据集进行选择,并在独立的RadEvalX数据集上评估冻结系统。RadOT-Eval与总错误负担、临床显著错误负担和临床不显著错误负担的斯皮尔曼相关系数分别为0.715、0.548和0.399,其点估计值高于标准评估指标和基于开源大语言模型(LLM)的评估器GREEN-radllama2-7B。在ReXErr-v1上的冻结辅助腐败敏感性压力测试中,RadOT-Eval达到了0.768的AUROC和0.990的腐败大于干净的配对胜率。这些结果表明,在仅使用ReXVal模型选择和冻结RadEvalX测试下,结构化证据传输为高风险生成的临床文本提供了一个可审计、面向排序的评估工具。

英文摘要

Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.

2606.08768 2026-06-09 cs.LG 新提交

Understanding the Parameter Space Geometry of Transformers Encoding Boolean Functions

理解编码布尔函数的Transformer参数空间几何

Blanka Köver, Alexandra Butoi, Anej Svete, Michael Hahn, Ryan Cotterell

发表机构 * Machine Learning, ICML(机器学习,ICML)

AI总结 针对Transformer无法学习某些简单布尔函数(如奇偶函数)的问题,通过分析参数空间几何,证明敏感函数在参数空间中占据极小区域,随机初始化几乎必然错过,从而解释了可表达但不可学习的现象。

Comments ICML 2026

详情
AI中文摘要

Transformer始终无法学习某些简单的函数,而这些函数在特定参数设置下是可证明表达的。这种可学习性与可表达性之间的差距对于敏感函数尤为突出——例如奇偶函数,其输出在输入单个比特翻转时很可能改变。虽然先前的研究已经确定Transformer偏向于平均敏感度低的函数,但这种偏向背后的精确机制仍不清楚。为了阐明这一现象,我们研究了Transformer参数空间的几何结构。我们证明,敏感函数——即使可表示——占据了一个极小区域,随机初始化极有可能错过。具体而言,我们将关注点从平均敏感度转移到完整的敏感度分布——所有输入上敏感度值的分布——并证明随机初始化的Transformer几乎必然计算具有低敏感度字符串的函数。因此,任何缺乏此类字符串的函数都是可证明不可学习的。

英文摘要

Transformers consistently fail to learn certain simple functions that are provably expressible with specific parameter settings. This gap between learnability and expressivity is particularly prominent for sensitive functions -- functions whose output is likely to change if a single bit of the input is flipped -- for example, PARITY. While prior work has established that transformers exhibit a bias toward functions with low average sensitivity, the precise mechanism underlying this bias remains poorly understood. To shed light on this phenomenon, we study the geometry of transformers' parameter space. We show that sensitive functions -- even when representable -- occupy a vanishingly small region that random initialization is very likely to miss. Specifically, we shift the focus from average sensitivity to the full sensitivity profile -- the distribution of sensitivity values across all inputs -- and prove that randomly initialized transformers almost surely compute functions which have low-sensitivity strings. Consequently, any function that lacks such strings is provably unlearnable.

2606.08755 2026-06-09 cs.CL 新提交

Co-Evolving Skill Generation and Policy Optimization

共同进化技能生成与策略优化

Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li, Songtao Liu, Fenglong Ma

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Nanyang Technological University(南洋理工大学) University of California, San Diego(加州大学圣迭戈分校) University of Utah(犹他大学) Harvard University(哈佛大学)

AI总结 提出在线强化学习框架,通过对比基线和技能增强轨迹的奖励差距估计技能边际效用,实现存储前验证,并利用该信号训练策略作为技能生成器,减少对专有模型的依赖。

详情
AI中文摘要

技能增强的强化学习通过存储从过去经验中获取的可重用程序性知识来改进语言智能体。现有方法通常使用强大的语言模型分析轨迹、生成技能,并在在线训练期间更新可检索的技能库。然而,它们很少在存储和重用新生成的技能之前评估其是否有用。我们发现这一假设不可靠:即使由专有前沿LLM生成的技能也表现出高度混合的效用,许多技能几乎没有益处甚至降低性能。一旦此类技能进入库中,其影响难以识别,因为后续的轨迹反馈是延迟的,并且通常反映多个检索技能的组合效果,而非单个技能的边际贡献。我们提出了一种用于存储前技能验证的在线强化学习框架。该框架估计候选技能是否在当前任务的已检索技能之外贡献了有用信息。它使用标准的轨迹预算,在同一任务和检索上下文下形成两个匹配组:基于当前检索技能的条件基础轨迹,以及基于相同技能加上从基础轨迹中诱导出的一个候选技能的条件技能增强轨迹。这两组之间的奖励差距估计了候选技能的上下文相关边际效用,使框架能够在不增加轨迹开销的情况下促进有用技能,同时过滤无效或有害技能。该框架进一步利用这一边际效用信号来训练策略本身作为技能生成器,减少对专有模型重复调用的依赖。学习到的技能生成似然作为上下文相关的分数,用于检索时的重排序和随着策略演化对过时技能的修剪。

英文摘要

Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.

2606.08751 2026-06-09 cs.CV 新提交

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

少即是多:通过全局-局部轨迹缩减实现低计数PET去噪的3D扩散模型免训练加速框架

Yuhan Liu, Scott M. Leonard, Marlee Crews, Muhannad Fadhel, Jinkui Hao, Tianqi Chen, Ryan J. Avery, Bo Zhou

发表机构 * Northwestern University(西北大学) Hefei University of Technology(合肥工业大学)

AI总结 提出一种免训练的全局-局部跳跃策略,通过噪声一致变换初始化中间步骤和重用U-Net特征,在加速3D扩散模型去噪的同时提升重建质量。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

PET中的准确定量和摄取测量对于评估疾病进展和支持临床决策至关重要。虽然高计数PET提供了可靠的图像质量,但相关的辐射剂量和长时间采集仍然是重要的临床问题,促使采用低计数协议。基于扩散模型的方法在将低计数PET恢复至接近高计数质量方面显示出巨大潜力,但其迭代采样过程在应用于高分辨率3D PET体积时变得极其昂贵,导致显著的推理延迟,限制了实际临床部署。为了解决这些挑战,我们提出了一种免训练的全局-局部跳跃策略,该策略加速了基于扩散模型的3D PET去噪,同时提高了重建质量。所提出的方法即插即用,可直接应用于预训练扩散模型,无需重新训练或修改架构。具体而言,我们引入了:(i) 全局去噪步骤跳跃策略,通过使用低计数输入的噪声一致变换从中间去噪步骤初始化反向扩散过程,大幅减少所需的去噪步骤数;(ii) 局部特征重用捷径,在相邻去噪步骤间重用缓慢变化的高级U-Net特征,进一步减少每步计算量同时保持图像保真度。我们在来自内部和公共数据集的多种PET示踪剂上评估了所提出的方法,包括18F-FDG PET、68Ga-DOTATATE PET和18F-PSMA PET,结果显示相对于全步骤基线,实现了超过一个数量级的一致加速以及改进或相当的重建性能。盲法读者研究进一步证实了增强的临床信心和感知诊断质量。

英文摘要

Accurate quantification and uptake measurement in PET are critical for assessing disease progression and supporting clinical decision-making. While high-count PET provides reliable image quality, the associated radiation dose and prolonged acquisition remain significant clinical concerns, motivating the adoption of low-count protocols. Diffusion-model-based methods have demonstrated strong potential for restoring low-count PET to near high-count quality, but their iterative sampling procedure becomes prohibitively expensive when applied to high-resolution 3D PET volumes, introducing substantial inference latency that limits practical clinical deployment. To address these challenges, we propose a training-free Global-Local Skipping Strategy that accelerates diffusion model-based 3D PET denoising while simultaneously improving reconstruction quality. The proposed method is plug-and-play and directly applicable to pre-trained diffusion models without retraining or architectural modification. Specifically, we introduce: (i) a global denoising step skipping strategy that initializes the reverse diffusion process from an intermediate denoising step using a noise-consistent transformation of the low-count input, substantially reducing the number of required denoising steps; and (ii) a local feature reuse shortcut that reuses slowly-varying high-level U-Net features across neighboring denoising steps, further reducing per-step computation while preserving image fidelity. We evaluate the proposed approach on multiple PET tracers from in-house and public datasets, including 18F-FDG PET, 68Ga-DOTATATE PET, and 18F-PSMA PET, demonstrating consistent acceleration of over an order of magnitude alongside improved or comparable reconstruction performance relative to the full-step baseline. Blinded reader studies further confirm enhanced clinical confidence and perceived diagnostic quality.

2606.08748 2026-06-09 cs.CL 新提交

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

HydraQE: OSU 在 IWSLT 2026 语音翻译指标共享任务中的提交

Kevin Krahn, Eric Fosler-Lussier

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出 HydraQE,一个基于 Qwen3-ASR 的端到端无参考语音翻译质量估计系统,通过可学习的稀疏标量混合和轻量双向 Transformer 实现跨模态交互,并在人类标注、MetricX-24 和 xCOMET 伪标签上训练三个预测头,优于级联文本基线。

Comments Accepted to IWSLT 2026; 9 pages, 3 figures, 4 tables

详情
AI中文摘要

我们提出了 HydraQE,这是我们对 IWSLT 2026 语音翻译指标共享任务的贡献。HydraQE 是一个基于 Qwen3-ASR 骨干网络的端到端、无参考质量估计(QE)系统,它接受源音频和翻译假设作为联合输入。来自所有骨干网络层的隐藏状态通过可学习的稀疏标量混合进行组合,然后由轻量级双向 Transformer 重新编码,以便在池化为共享嵌入之前实现完整的跨模态交互。三个独立的预测头在互补的监督信号上训练:人工直接评估(DA)标注、MetricX-24 伪标签和 xCOMET 伪标签。为了解决人工标注数据的稀缺性,我们在合成损坏示例和银色伪标签机器翻译输出的组合上进行训练,采用从合成和银色数据开始并逐渐转向人工标注示例的课程学习。HydraQE 优于级联文本基线和先前的直接语音 QE 系统,证明了端到端语音翻译 QE 与级联方法具有竞争力。

英文摘要

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

2606.08745 2026-06-09 cs.CV 新提交

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

染色感知的小波正则化用于组织病理学中的即时对抗净化

Zhe Li, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出染色感知小波正则化(SAWR),利用Haar变换的多级小波域正则化分层分离对抗扰动与诊断结构信息,并扩展到组织学通道实现染色特异性频率调节,在即时净化框架中将对抗鲁棒性提升高达10.69%。

Comments 14 pages, 4 figures

详情
AI中文摘要

深度学习在计算病理学流程中已变得普遍,支持癌症筛查和数字病理学分析等任务。然而,神经网络对对抗扰动的敏感性引发了临床实践中可靠部署的安全问题。在组织病理学图像中,由于难以区分高频对抗噪声与细微且具有诊断意义的组织结构,这一挑战更加严峻。为解决此问题,我们提出染色感知小波正则化(SAWR),一种利用基于Haar变换的多级小波域正则化的对抗净化框架,以分层方式将对抗扰动与诊断结构信息分离。该频谱约束进一步扩展到单个组织学通道,实现与苏木精和伊红的生物学特性一致的染色特异性频率调节。当集成到即时净化框架中时,SAWR将对抗鲁棒性相对于基线方法提升高达10.69%,同时在对抗扰动下保持纹理和频谱保真度。

英文摘要

Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer screening and digital pathology analysis. However, the susceptibility of neural networks to adversarial perturbations raises safety concerns for reliable deployment in clinical practice. In histopathological images, this challenge is exacerbated by the difficulty of distinguishing high-frequency adversarial noise from subtle and diagnostically relevant tissue structures. To address this issue, we propose Stain-Aware Wavelet Regularization (SAWR), an adversarial purification framework that leverages multi-level wavelet-domain regularization based on Haar transform to hierarchically disentangle adversarial perturbations from diagnostic structural information. This spectral constraint is further extended to individual histological channels, enabling stain-specific frequency regulation consistent with the biological properties of Hematoxylin and Eosin. When integrated into an instant purification framework, SAWR improves adversarial robustness by up to 10.69\% over the baseline approach, while maintaining texture and spectral fidelity under adversarial perturbations.

2606.08743 2026-06-09 cs.RO 新提交

Guided Discovery of New Behaviors using Diffusion Policies

使用扩散策略引导发现新行为

Dian Yu, Sebastian Sanokowski, Majid Khadiv

发表机构 * Munich Institute of Robotics and Machine Intelligence, Technical University of Munich(慕尼黑工业大学慕尼黑机器人与机器智能研究所)

AI总结 提出结合Feynman-Kac校正器与引导势能的框架,从扩散策略中挖掘并优化罕见但可行的轨迹,再训练策略以系统发现多样化可执行行为。

Comments Preprint. Supplementary video: https://youtu.be/T7MUvMA67VM

详情
AI中文摘要

扩散模型已成为机器人学中生成建模的强大工具,扩散策略在多模态动作-轨迹分布建模方面表现出色。然而,当演示数据有限时,标准采样通常再现主导行为,而忽略有效但罕见的模式,限制了新解决方案的发现。现有方法(如引导方法或将强化学习与扩散结合)要么将样本推入不可行区域,要么难以逃离局部最小值,无法系统地发现多样化行为。为解决这些挑战,我们提出一个框架,将Feynman-Kac校正器与一种新颖的引导势能相结合,系统地将扩散策略样本引导至有前景但代表性不足的样本。这些轨迹通过基于采样的轨迹优化进行精炼,并重新纳入训练集以重新训练扩散策略。我们的方法有效地挖掘和修复新轨迹,实现多样化且可执行行为的系统发现。我们在多种操作环境中展示了该框架的有效性,一致地发现了新行为。

英文摘要

Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.

2606.08742 2026-06-09 cs.CV 新提交

AUCp: Pseudo-AUC for Inference Model Selection with Unlabeled Validation Data in Abnormality Detection

AUCp: 用于异常检测中无标注验证数据的推理模型选择的伪AUC

Md Mahfuzur Rahman Siddiquee, Fazle Rafsani, Jay Shah, Teresa Wu, Catherine D Chong, Todd J Schwedt, Baoxin Li

发表机构 * arXiv

AI总结 提出AUCp指标,无需标注验证集即可为无监督/自监督异常检测方法选择最优推理模型,通过将测试集所有样本视为异常计算伪AUC,理论及实验证明其优于传统指标。

详情
Journal ref
IEEE Transactions on Medical Imaging (Early Access), 2026
AI中文摘要

异常检测是医学图像分析中一项关键但具有挑战性的任务。通过学习仅重构正常数据来区分异常与正常数据,减少了对标注数据集的依赖。然而,许多研究即使是无监督的,也依赖标注验证集从多次训练迭代中选择最佳推理模型。对于许多疾病,标注数据不可用且获取耗时。为解决此问题,提出了AUCp——一种支持无监督和自监督方法异常检测的新指标。它不通过评估重构图像的真实性来选择最佳推理模型,而是关注实际检测性能,且无需标注测试集。假设测试集中所有未标注样本的伪真实标签为异常/阳性,并使用传统AUC计算,得到AUCp分数。给定一个包含大量正常样本的代表性训练集,我们通过数学和实证证据表明,使用AUCp分数进行模型选择在无监督和自监督方法中比传统指标更能改善疾病检测。使用两种无监督方法进行神经系统疾病检测以及在不同数据集上的自监督方法,我们的结果表明AUCp分数有效识别最佳推理模型,显著增强异常和疾病检测。相应实现可在https://github.com/mahfuzmohammad/AUCp获取。

英文摘要

Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalities from normal data by learning to reconstruct normal-only data alleviates the reliance on labeled datasets. However, many studies, even if unsupervised, rely on a labeled validation set to select the best model for inference from multiple training iterations. For many diseases labeled data are unavailable and substantially time consuming to obtain. To address this, AUCp - a novel metric that supports abnormality detection for unsupervised and self-supervised methods is proposed. Instead of evaluating the realism of reconstructed images to select the best of model for inference, it focuses on actual detection performance and without requiring an annotated test set. Assuming the pseudo ground truth of all unannotated samples in the test set as abnormal/positive and using traditional AUC calculation, AUCp scores are derived. Given a large and representative training set of normal samples, we show mathematical and empirical evidence that model selection using AUCp scores improves disease detection in terms of unsupervised and self-supervised methods over conventional metrics. Using two unsupervised methods for neurologic disease detection and self-supervised methods on diverse datasets, our results demonstrate that the AUCp score effectively identifies the optimal model for inference, significantly enhancing abnormality and disease detection. The corresponding implementations are available in https://github.com/mahfuzmohammad/AUCp.

2606.08741 2026-06-09 cs.RO 新提交

Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments

制造环境中人机交互的安全、流畅与可接受运动生成与执行

Thibaut Lopez, Olivier Aycard, Pierre-Brice Wieber, Mohamed Boua, Christine Jeoffrion

发表机构 * GIPSA Lab(GIPSA实验室) Grenoble Institute of Technology(格勒诺布尔理工学院) Inria(法国国家信息与自动化研究所) LIP/PC2S(LIP/PC2S实验室) Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Univ. Savoie Mont Blanc(萨瓦大学)

AI总结 针对人机共享环境,提出结合安全与社交感知的运动生成策略,通过MPC框架生成四种社交行为,用户研究表明机器人行为显著影响社会可接受性。

详情
AI中文摘要

在人类环境中运行的机器人不仅要确保物理安全,还要表现出人类伙伴可理解、流畅和可接受的行为。本文研究了结合安全保障与交互质量考虑(如运动平滑性和人类舒适度)的运动生成策略。虽然能够确保共享人机环境中安全的机器人设计已经实现了更紧密、更高级的交互形式,但这些新的基于近距离的任务需要超越纯技术考虑。特别是,机器人行为还必须从心理认知和社会角度加以解决。在此背景下,我们论证了将社交感知运动控制集成到机器人系统中的相关性。首先,我们识别了影响人类感知和操作员体验的运动参数。然后,我们实现了一个模型预测控制(MPC)框架,该框架生成四种不同的社交知情机器人行为。最后,我们进行了一项用户研究,以评估和验证这些行为,并评估它们对非专家参与者的社会影响。结果表明,机器人行为的变化显著影响系统的感知社会可接受性。这些发现强调了将以人为本的考虑纳入共享环境中机器人运动生成策略的重要性。

英文摘要

Robots operating in human environments must not only ensure physical safety but also exhibit behaviors that are understandable, fluent, and acceptable to human partners. This paper investigates motion generation strategies that combine safety guarantees with interaction quality considerations, such as motion smoothness and human comfort. While the design of robots capable of ensuring safety in shared human-robot environments has enabled closer and more advanced forms of interaction, these new proximity-based tasks require moving beyond purely technical considerations. In particular, robot behavior must also be addressed from psycho-cognitive and social perspectives. In this context, we argue for the relevance of integrating social-aware motion control into robotic systems. First, we identify the motion parameters that influence human perception and operator experience. Then, we implement a Model Predictive Control (MPC) framework that generates four distinct socially-informed robot behaviors. Finally, we conduct a user study to evaluate and validate these behaviors and assess their social impact on non-expert participants. The results demonstrate that variations in robot behavior significantly affect the perceived social acceptability of the system. These findings highlight the importance of incorporating human-centered considerations into motion generation strategies for robots operating in shared environments.