arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23257 2026-05-25 cs.RO cs.CV

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

将适应转化为资产：面向在线视觉语言导航的跨域桥接

Zixuan Hu, Xuantuo Huang, Yancheng Li, Yichun Hu, Shengyong Xu, Ling-Yu Duan

发表机构 * School of Computer Science, Peking University, Beijing, China（北京大学计算机科学系）； Peng Cheng Laboratory, Shenzhen, China（鹏城实验室）； School of Electronics, Peking University, Beijing, China（北京大学电子学院）

AI总结本文研究了视觉语言导航（VLN）代理在非平稳环境下的适应问题，提出了一种新的测试时适应（TTA）框架IDEA，通过将在线适应转化为知识资产的积累与组合，有效解决了现有方法中的灾难性遗忘和负迁移问题。IDEA引入了基于Fisher指导的软提示优化机制，并结合领域坐标构建动态资产库，利用历史知识构建跨领域桥梁，实现无需训练的适应。实验表明，该方法在多个基准测试中表现优异，展示了其在实际应用中的有效性。

Comments Accepted by ICML 2026

详情

AI中文摘要

在非平稳环境变化下导航对部署在野外的视觉语言导航（VLN）智能体构成了关键挑战。然而，现有的 VLN 测试时适应（TTA）方法大多将在线适应视为瞬时的、孤立的更新，导致灾难性遗忘和负迁移。为了克服这些问题，我们提出了 IDEA（Inter-Domain BridgE with Historical Assets），一种新颖的 TTA 框架，将适应转化为资产的积累和组合。具体来说，IDEA 引入了通过 Fisher 引导的加权方案优化的软提示，以捕获可迁移的知识。然后，这些优化后的提示与域坐标相结合，形成动态资产库。利用该库，IDEA 通过将目标域投影到历史知识的凸包上来构建跨域桥接。这些设计形成了一个互补循环：不断演化的库支撑桥接构建，而桥接提供优越的初始化以加速资产优化。在 REVERIE、R2R 和 R2R-CE 基准上的大量实验表明，IDEA 相对于现有方法具有一致的优越性，展示了其通过资产共享实现无需训练的适应的能力。

英文摘要

Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we propose Inter-Domain BridgE with Historical Assets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing.

URL PDF HTML ☆

赞 0 踩 0

2605.23255 2026-05-25 cs.LG cs.DS

Learning-Augmented Online Scheduling with Parsimonious Preemption

具有节俭抢占的学习增强在线调度

Mugen Blue, Sungjin Im, Alexander Lindermayr

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）； Institut für Mathematik, Technische Universität Berlin（柏林技术大学数学研究所）

AI总结本文研究了学习增强型在线调度问题，旨在在优化任务延迟的同时减少预emption（任务切换）的次数。作者提出了一种新的算法框架，在保证调度性能的同时，将每个任务的预emption次数控制为常数级别，并且预emption开销随预测误差对数增长。该工作首次为非相关和可变形机器的调度提供了有限预emption的理论保证，拓展了学习增强调度理论的应用范围。

详情

AI中文摘要

学习增强算法已成为一种强大的范式，通过整合可能带有噪声的预测来超越传统的最坏情况下限。虽然该框架在在线调度中取得了成功，但现有工作主要优化作业延迟，同时依赖于频繁的“盲目”抢占。这忽略了算法性能与抢占复杂度之间的基本权衡。我们首次系统研究了在优化延迟的同时限制抢占的学习增强调度。我们证明了理论延迟界限与抢占开销之间的差距可以通过坚实的分析基础来弥合。我们的结果包括：在准确预测下，单机和无关并行机上每作业仅需$O(1)$次抢占的$O(1)$-竞争比算法，且开销随预测误差对数增长。通过为无关机和可塑机提供首个有界抢占保证，我们将学习增强框架的理论范围扩展到更受约束和更现实的设置。最后，通过实验验证了我们的算法。

英文摘要

Learning-augmented algorithms have emerged as a powerful paradigm to surpass traditional worst-case lower bounds by integrating potentially noisy predictions. While this framework has seen success in online scheduling, existing work primarily optimizes job latency while relying on frequent, ``blind'' preemptions. This ignores the fundamental trade-off between algorithmic performance and preemption complexity. We provide the first systematic study of learning-augmented scheduling that curbs preemption while optimizing latency. We establish that the gap between theoretical latency bounds and preemption overhead can be bridged with solid analytical foundations. Our results include $O(1)$-competitive algorithms for single and unrelated parallel machines with only $O(1)$ preemptions per job under accurate predictions, with overhead scaling logarithmically with the prediction error. By providing the first bounded-preemption guarantees for unrelated and malleable machines, we extend the theoretical reach of the learning-augmented framework to more constrained and realistic settings. Finally, our algorithms are validated through experiments.

URL PDF HTML ☆

赞 0 踩 0

2605.23254 2026-05-25 cs.CV

CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels

CARE: 面向长尾噪声标签可靠学习的类别自适应专家共识

Mengke Li, Haiquan Ling, Lihao Chen, Yang Lu, Yiqun Zhang, Hui Huang

发表机构 * College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China.（深圳大学计算机科学与软件工程学院，中国深圳）； Guangming Laboratory, Shenzhen, China.（深圳广明实验室，中国深圳）； School of Informatics, Xiamen University, Xiamen, China.（厦门大学信息学院，中国厦门）； School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China（广东工业大学计算机科学与技术学院，中国广州）

AI总结在现实数据学习中，长尾类别分布和噪声标签的复合挑战常常导致模型性能下降。为了解决这一问题，本文提出了一种参数高效的框架CARE，通过结合视觉-语言模型的三种互补监督源，引入类自适应专家共识机制，根据不同类别的频率调整标签校正的严格程度，从而更有效地过滤噪声并重新校准类别分布。实验表明，CARE在多个合成和真实数据集上均优于现有方法，性能提升最高达3.0%。

Comments poster in ICML 2026

详情

AI中文摘要

从现实世界数据中学习常常受到长尾类别分布和噪声标注的双重挑战。现有方法部分解决了这些问题，但通常忽略了标签噪声在不同类别上的非均匀影响，导致对尾部类的修正无效，对头部类的过度正则化。为了解决这个问题，我们提出了类别自适应专家修正（CARE），一个参数高效的框架，利用来自视觉语言模型（VLM）的三种互补监督源：观察到的噪声标签、VLM文本嵌入和视觉特征。CARE引入了一种类别自适应专家共识机制，根据类别频率对尾部类施加更严格的一致性，对头部类施加更宽松的一致性。通过聚合这些来源的高置信度预测，CARE过滤不可靠信号并重新校准类别分布，从而在长尾分布下实现更可靠的修正。在合成和真实世界基准上的大量实验表明，CARE始终优于最先进的方法，实现了高达3.0%的性能提升。源代码可在https://github.com/qwq123-study/CARE获取。

英文摘要

Learning from real-world data is frequently hindered by the compound challenge of long-tailed class distributions and noisy annotations. Existing methods partially address these issues but typically ignore the non-uniform impact of label noise across classes, resulting in ineffective correction for tail classes and over-regularization for head classes. To address this issue, we propose Class-Adaptive Rectification with Experts (CARE), a parameter-efficient framework that leverages three complementary supervision sources from vision-language models (VLM): observed noisy labels, VLM text embeddings, and visual features. CARE introduces a class-adaptive expert consensus mechanism that enforces stricter agreement for tail classes and more permissive agreement for head classes based on class frequency. By aggregating high-confidence predictions across these sources, CARE filters unreliable signals and recalibrates class distributions, yielding more reliable rectification under long-tailed distributions. Extensive experiments on both synthetic and real-world benchmarks demonstrate that CARE consistently outperforms state-of-the-art methods, achieving up to 3.0\% performance gains. The source code is available at https://github.com/qwq123-study/CARE.

URL PDF HTML ☆

赞 0 踩 0

2605.23249 2026-05-25 cs.LG cs.AI

Enhancing Deep Neural Network Reliability with Refinement and Calibration

通过精炼和校准增强深度神经网络的可靠性

Ramya Hebbalaguppe, Ajay Shastry, Soumya Suvra Ghosal, Chetan Arora

发表机构 * SIT, Indian Institute of Technology Delhi, New Delhi, India（印度理工学院德里SIT，新德里）

AI总结尽管深度神经网络在预测准确性方面表现优异，但其置信度估计往往不可靠，可能影响用户对其决策的信任。为此，本文提出了一种新的损失函数和统一训练框架RefCal，旨在同时提升模型的校准性、锐度（即正确与错误预测之间的置信度差异）和准确率，从而增强深度神经网络的可靠性。实验表明，RefCal在类别不平衡的数据集上显著优于现有方法。

Comments ICLR 2026, Trustworthy AI and Representational Alignment

详情

AI中文摘要

尽管深度神经网络（DNN）实现了高预测精度，但其置信度估计通常不可靠，可能损害用户对其决策的信任。这推动了校准模型的研究，其中校准衡量模型预测置信度与正确经验概率的一致性。然而，校准指标通常可以通过后处理技术改进，这些技术仅模仿训练时的不确定性，而并未真正提升模型的理解。因此，统计学家建议模型不仅要校准，还要精炼。直观上，如果模型对正确和错误预测分配显著不同的置信度分数，则被认为更精炼，这一属性也称为锐度。我们观察到，许多现有的校准方法以降低精炼度为代价来改善校准。为解决这一局限，我们提出：（1）一种新的损失函数，显式促进精炼度，并可通过监督对比学习优化；（2）一个统一的训练框架RefCal，联合优化校准、精炼度和准确性，以提高DNN的可靠性。在类别不平衡率为10%的CIFAR-100-LT数据集上，RefCal实现了（准确率，精炼度，ECE）为（58.81，95.67，0.08），显著优于广泛使用的Correctness Ranking Loss（46.27，93.7，0.22）。

英文摘要

Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially compromising user trust in their decisions. This has motivated research on calibrated models, where calibration measures how well a model's predicted confidence aligns with the empirical probability of correctness. However, calibration metrics can often be improved through post-processing techniques that merely mimic training-time uncertainty without genuinely improving the model's understanding. For this reason, statisticians recommend that models be not only calibrated but also refined. Intuitively, a model is considered more refined if it assigns significantly different confidence scores to correct and incorrect predictions, a property also referred to as sharpness. We observe that many existing calibration methods improve calibration at the cost of reduced refinement. To address this limitation, we propose: (1) a novel loss function that explicitly promotes refinement and can be optimized through supervised contrastive learning; and (2) a unified training framework, RefCal, that jointly optimizes calibration, refinement, and accuracy to improve DNN reliability. On the CIFAR-100-LT dataset with 10 percent class imbalance, RefCal achieves (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), substantially outperforming the widely used Correctness Ranking Loss, which achieves (46.27, 93.7, 0.22).

URL PDF HTML ☆

赞 0 踩 0

2605.23245 2026-05-25 cs.CV cs.AI

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

SimInsert: 通过区域稀疏注意力融合实现无缝视频对象插入

Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University（新型软件技术国家重点实验室，南京大学）； School of Intelligence Science and Technology, Nanjing University（智能科学与技术学院，南京大学）； JIUTIAN Research（JIUTIAN研究机构）； Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Zhejiang Sci-Tech University（浙江科技学院）； The University of British Columbia（不列颠哥伦比亚大学）

AI总结 SimInsert 是一种无需训练的视频对象插入方法，旨在解决现有方法依赖显式运动工程或耗时重训练的问题，提升灵活性和泛化能力。该方法通过区域稀疏注意力融合，将任务分解为单帧编辑和语义运动描述，利用图像到视频扩散模型的生成先验，实现编辑内容在时间上的自然传播，并保持背景不变性与交互真实感。实验表明，SimInsert 在多项指标上显著优于现有方法，为高保真视频编辑提供了高效解决方案。

Comments Accepted by ICME2026

详情

AI中文摘要

视频对象插入需要确保时空连贯性和交互真实感，远不止简单的内容放置。然而，当前方法通常受限于对显式运动工程或资源密集型重新训练的依赖，限制了其灵活性和泛化能力。为弥补这一差距，我们提出了 extit{SimInsert}，一种无需训练的新范式，将任务高效地分解为直观的单帧编辑和语义运动描述。通过利用图像到视频扩散模型的强大生成先验，SimInsert在时间上传播编辑，严格保持背景不变性，同时实现插入对象与动态环境之间合理的、文本驱动的交互。我们的方法依赖于非侵入式引导机制，这些机制强制执行结构一致性，促进无缝边界融合，并抵消在去噪轨迹中通常累积的保真度漂移。大量定量实验验证了我们的有效性：SimInsert在PSNR上超越最先进方法18.8%，在SSIM上超越20.1%，在LPIPS上降低44.1%，为高保真视频编辑提供了流线型解决方案。

英文摘要

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textit{SimInsert}, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8\% gain in PSNR, 20.1\% in SSIM, and a 44.1\% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.

URL PDF HTML ☆

赞 0 踩 0

2605.23244 2026-05-25 cs.LG

Convex Optimization for Alignment and Preference Learning on a Single GPU

单GPU上的对齐与偏好学习的凸优化

Miria Feng, Mert Pilanci

发表机构 * Department of Electrical Engineering, Stanford University, California, United States（斯坦福大学电气工程系，加州，美国）

AI总结本文提出了一种名为COALA的凸优化算法，用于在单块GPU上高效完成大语言模型的对齐与偏好学习。该方法通过将神经网络重新表述为凸优化问题，避免了传统方法对参考模型的依赖，显著降低了训练时间和显存消耗。实验表明，COALA在多个数据集和模型上表现出优异的性能和效率，其计算量仅为DPO方法的约17.6%，且训练过程中奖励稳定增长，达到性能峰值的时间也明显缩短。

详情

AI中文摘要

微调大型语言模型（LLMs）以符合人类偏好推动了Gemini和ChatGPT等系统的成功。然而，从人类反馈中强化学习（RLHF）等方法仍然计算昂贵且复杂。直接偏好优化（DPO）提供了一种更简单的替代方案，但存在排名准确性不一致、对GPU资源依赖度高以及超参数调优成本高等局限性。我们提出了对齐与偏好学习的凸优化算法（COALA）：一种具有强理论保证的新型轻量级策略。通过利用神经网络的凸优化重表述，COALA消除了对参考模型的需求，并在训练时间和VRAM消耗上实现了显著减少，从而能够在单个GPU上进行高效训练。在四个数据集（包括一个26621样本的合成教育反馈数据集）和六个模型（包括Llama-3.1-8B）上的实验表明，COALA在仅使用DPO总TFLOPs约17.6%的情况下，展现了具有竞争力的性能和效率。与DPO和ORPO等传统方法相比，COALA表现出稳定、单调递增的奖励，并在显著更短的时间内达到峰值边际。据我们所知，这是首次将凸优化有效应用于LLMs的偏好微调。

英文摘要

Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets--including a 26621-sample synthetic Educational Feedback dataset--and six models (including Llama-3.1-8B) demonstrate COALA's competitive performance and efficiency while utilizing as little as ~17.6% of DPO's total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.23241 2026-05-25 cs.LG

RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases

RelPrism：面向关系数据库的多面预训练框架与自生成任务

Jinyu Yang, Cheng Yang, Junze Chen, Zedi Liu, Muhan Zhang, Hanyang Peng, Chuan Shi

发表机构 * Beijing University Of Posts and Telecommunications（北京邮电大学）； Peng Cheng Laboratory（鹏城实验室）； Peking University（北京大学）

AI总结关系数据库（RDB）仍是现代数据系统的核心，支持多种预测任务。尽管现有的关系深度学习方法通过将数据库转化为图结构并应用图模型进行表征学习，但有效的自监督预训练方法仍面临挑战，尤其是在处理多视角、多粒度的信息需求时。为此，本文提出RelPrism，一种多视角的自监督学习框架，通过从不同角度构建内在属性、关系属性和混合属性，并结合多粒度聚类生成伪任务，使预训练表征更具适应性。实验表明，RelPrism在多个真实数据集上的分类和回归任务中均优于现有方法。

详情

AI中文摘要

关系数据库（RDB）仍然是现代数据系统的基石，并支持多种预测任务。最近的关系深度学习（RDL）方法通过将RDB转换为图（其中行表示为节点，表间交互表示为边），然后应用基于图的模型进行表示学习，从而实现端到端预测。尽管RDL具有强大的能力，但有效的自监督预训练对于RDB仍然具有挑战性。RDB任务通常需要跨不同视角和粒度的多面信息。例如，用户流失分类可能更依赖于交互模式，而消费价值预测则需要用户-项目行为和内在用户属性来进行细粒度回归。这种异构需求对RDB表示学习提出了挑战，因为预训练目标应涵盖全面的信息以适应下游任务。然而，现有的自监督学习方法通常从单一视角（如节点级内在属性或子图级关系结构）获取监督信号，适应性有限。为此，我们提出了RelPrism，一个面向RDB的多面自监督学习框架。RelPrism从不同视角构建内在、关系和混合属性，并对每个视角应用多粒度聚类以形成相应的伪任务池。在这些池上进行预训练使表示暴露于更广泛的视角和粒度级别，为下游适应提供了更强的基础。在5个真实数据集上的14个任务上的实验表明，RelPrism在分类任务上比最先进的基线提高了4.15%的ROC-AUC，在回归任务上降低了10.75%的MAE。我们的代码可在https://anonymous.4open.science/r/RelPrism获取。

英文摘要

Relational databases (RDBs) remain the cornerstone of modern data systems and support diverse predictive tasks. Recent relational deep learning (RDL) methods enable end-to-end prediction by converting RDBs into graphs, where rows are represented as nodes and inter-table interactions are represented as edges, and then applying graph-based models for representation learning. Despite the strong capability of RDL, effective self-supervised pre-training for RDBs remains non-trivial. RDB tasks often require multi-faceted information across different perspectives and granularities. For example, user churn classification may rely more on interaction patterns, whereas consumption value prediction requires both user-item behaviors and intrinsic user attributes for fine-grained regression. Such heterogeneous needs challenge RDB representation learning, as pre-training objectives should cover comprehensive information for downstream adaptation. However, existing SSL methods typically derive supervision from a single facet, such as node-level intrinsic attributes or subgraph-level relational structures, providing limited adaptability. To this end, we propose RelPrism, a multi-faceted self-supervised learning framework for RDBs. RelPrism constructs intrinsic, relational, and hybrid attributes from distinct perspectives, and applies multi-granularity clustering to each perspective to form corresponding pseudo-task pools. Pre-training over these pools exposes representations to broader perspectives and granularity levels, yielding a stronger basis for downstream adaptation. Experiments on 14 tasks across 5 real-world datasets show that RelPrism improves ROC-AUC by 4.15% for classification and reduces MAE by 10.75% for regression over state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/RelPrism.

URL PDF HTML ☆

赞 0 踩 0

2605.23240 2026-05-25 cs.RO cs.SY eess.SY

Signal Temporal Logic Motion Planning via Graphs of Convex Sets

基于凸集图的信号时序逻辑运动规划

Yu Chen, Ancheng Hou, Mingyang Feng, Xiao Yu, Xiang Yin

发表机构 * School of Automation & Intelligent Sensing, Shanghai Jiao Tong University（自动化与智能感知学院，上海交通大学）； Institute of Artificial Intelligence, Xiamen University（人工智能研究院，厦门大学）

AI总结本文研究了在信号时序逻辑（STL）规范下的连续时间运动规划问题，旨在生成满足高层逻辑与时序要求且符合底层运动约束的平滑机器人轨迹。为此，作者提出了一种高效框架，将时序自动机推理与凸集图（GCS）相结合，将STL运动规划问题转化为GCS上的最短路径问题，从而生成满足STL规范、平滑性要求和速度限制的Bézier样条轨迹。实验表明，该方法在多个低维基准、三维四旋翼无人机、30自由度人形机器人以及UR-3机械臂的硬件实验中均能高效求解复杂STL运动规划问题。

详情

AI中文摘要

本文研究信号时序逻辑（STL）规范下的连续时间运动规划。目标是生成满足高层次逻辑和时间要求，同时遵守低层次运动约束的平滑机器人轨迹。为此，我们提出了一种高效框架，结合了时间自动机推理与凸集图（GCS）。首先将STL规范表示为时间自动机，然后与配置空间的凸分解耦合，形成联合转移系统，编码任务进展和区域占用。基于该联合转移系统，STL运动规划问题被重新表述为GCS上的最短路径问题，其解生成满足STL规范、平滑性要求和速度约束的平滑贝塞尔样条轨迹。我们建立了所提公式的正确性，并分析了其计算复杂度，表明一旦时间自动机和凸分解固定，凸松弛的规模与配置空间维度和贝塞尔次数成多项式关系。我们进一步利用专用模板和布尔组合，为表达性强的STL片段开发了紧凑的时间自动机构造。低维基准、3-D四旋翼、30自由度人形机器人的数值实验以及UR-3机械臂的硬件实验表明，所提方法能高效解决复杂的STL运动规划问题，并生成平滑可执行的轨迹。

英文摘要

This paper investigates continuous-time motion planning under Signal Temporal Logic (STL) specifications. The goal is to generate smooth robot trajectories that satisfy high-level logical and timing requirements while respecting low-level motion constraints. To this end, we propose an efficient framework that combines timed-automata reasoning with graphs of convex sets (GCS). An STL specification is first represented by a timed automaton, which is then coupled with a convex decomposition of the configuration space to form a joint transition system encoding both task progress and region occupancy. Based on this joint transition system, the STL motion-planning problem is reformulated as a shortest-path problem over a GCS, whose solution induces a smooth Bézier-spline trajectory satisfying the STL specification, smoothness requirements, and velocity bounds. We establish the soundness of the proposed formulation and analyze its computational complexity, showing that, once the timed automaton and convex decomposition are fixed, the convex relaxation scales polynomially with the configuration-space dimension and the Bézier degree. We further develop a compact timed-automaton construction for an expressive STL fragment using dedicated templates and Boolean composition. Numerical experiments on low-dimensional benchmarks, a $3$-D quadrotor, a $30$-DoF humanoid, and a hardware experiment on a UR-3 robot arm demonstrate that the proposed method efficiently solves complex STL motion-planning problems and produces smooth executable trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.23238 2026-05-25 cs.AI cs.GT cs.LG cs.MA

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

GENSTRAT：迈向大型语言模型中的战略推理科学

Vartan Shadarevian, Kia Ghods, Alex Kenich, Anany Kotawala

发表机构 * Princeton University（普林斯顿大学）； Google（谷歌）

AI总结本文提出GENSTRAT，一种基于程序生成战略环境的评估框架，用于更准确地评估大型语言模型在复杂战略场景中的推理能力。该方法生成一系列两人零和不完全信息卡牌游戏，并结合能力分析和“崎岖度”指标，全面评估模型在不同战略维度上的表现和稳定性。实验表明，前沿模型在整体表现上更优，但其能力分布和局部波动性存在显著差异，为实际部署提供了更细致的诊断依据。

Comments 33 pages, 8 figures, 9 tables (4 figures, 2 tables in main paper)

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被部署为市场、拍卖和竞价环境中的经济主体。预测它们在特定部署中的行为是困难的。现有的战略推理基准在固定的规范博弈上评估模型。这些基准可能会随着前沿模型的改进而饱和，并且不允许评估者从基准性能自信地推广到实际部署中涉及的各种混乱的战略环境。我们引入了GENSTRAT，它使用程序化生成的战略环境来解决这些挑战。具体来说，我们生成了一个两人零和、不完全信息纸牌游戏的分布。生成器可以按需生成新游戏，从而实现常青评估并抵抗污染。我们将游戏分布与一种能力剖面方法论配对，该方法论将模型能力分解为六个轴（状态空间、时间深度、信息敏感性、对手建模、风险和脆弱性）。我们还引入了一种分布内平滑度的锯齿度量，用于检测模型在战略相似游戏之间优势是否不可预测地跳跃。我们从2000个游戏的生成池中采样了50个基准游戏，并在一个包含超过36,000场比赛的正面交锋锦标赛中评估了九个前沿和开放权重LLM。较新的前沿模型平均得分更高。除了平均值之外，整体实力几乎相同的模型显示出性质不同的能力剖面，并且排行榜前三名模型中的两个（gpt-5和claude）在局部波动性上明显高于第三个（gemini-3.1-pro），尽管整体实力接近。总之，能力剖面和锯齿度量提供了仅靠整体排名无法提供的与部署相关的诊断信息。

英文摘要

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

URL PDF HTML ☆

赞 0 踩 0

2605.23237 2026-05-25 cs.CV

StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

StereoGenBench：一种用于受控基线条件下立体生成的合成多相机基准

Yangzhi Cui, Feng Qiao, Nathan Jacobs

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）

AI总结 StereoGenBench 是一个基于 Unreal Engine 的合成多相机基准数据集，旨在为立体生成、几何估计和可控视角合成提供精确可控的多基线配对数据。该数据集通过固定场景下六相机阵列的渲染，生成包含多基线、内参、深度、相机位姿等信息的高质量配对视图，支持对不同基线范围下的生成模型进行评估。该工作填补了现有数据集在多基线配对和可控参数方面的不足，为立体生成研究提供了标准化的测试平台。

详情

AI中文摘要

立体图像和视频生成、立体几何估计以及条件控制视图合成需要配对数据，其中决定双目几何的变量——相机基线、内参、场景深度和相机运动——是已知且可控的。现有的立体资源提供了这些变量的子集，但据我们所知，常用于立体生成评估的资源并未在单一受控源中提供场景配对的、校准的多基线右视图真值，以及联合记录的内参、密集度量深度和每帧姿态。我们引入了StereoGenBench，一个合成的Unreal Engine基准，旨在使基线灵敏度与目标相机一致性在匹配的场景内容下可测量。每个场景使用刚性六相机横向阵列渲染，产生多达15个校准视图对；相邻基线从瞳孔间到宽基线范围采样；焦距独立采样；每个视图发布RGB、度量深度、内参、每对基线和每帧姿态。数据集划分包括窄基线和宽基线两个评估族，以及一个仅训练族用于更广泛的全对覆盖。我们发布了数据集、评估代码、参考结果、Croissant元数据以及用于扩展的生成代码/配置（兼容资产）。数据集可在https://huggingface.co/datasets/stereo-dataset/stereo-dataset获取。

英文摘要

Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset

URL PDF HTML ☆

赞 0 踩 0

2605.23235 2026-05-25 cs.LG

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

语音识别中的凸低资源口音鲁棒语言检测

Miria Feng, William Tan, Mert Pilanci

发表机构 * Department of Electrical Engineering（电气工程系）； Department of Computer Science, Stanford University, California, United States（计算机科学系，斯坦福大学，加利福尼亚，美国）

AI总结随着全球化和多元文化的发展，语音识别系统在面对资源匮乏的方言和口音时常常表现不佳，导致语言识别错误并影响后续对话任务。本文提出了一种基于凸优化的低资源鲁棒语言检测方法Convex Language Detection（CLD），通过引入理论支撑的凸优化技术，结合多GPU加速的ADMM算法，实现了高效训练与全局最优解。该方法在理论上有稳定性保证，在实验中表现出对输入方言变化的强鲁棒性，即使在低资源条件下也能达到97-98%的识别准确率。

详情

AI中文摘要

全球化和多元文化持续产生日益多样化的语音变体。然而，当前的语音对话系统在处理代表性不足的方言和口音时经常失败，常常误识别输入语言，导致下游对话任务中的级联故障。在低资源约束下解决这种方言差异仍然是一个开放的挑战，因为标准微调计算成本高且容易在高维语音数据上过拟合。我们提出了凸语言检测（CLD），一种新颖的框架，将理论基础的凸优化技术集成到语音对话系统流程中。我们的方法通过JAX中的多GPU交替方向乘子法（ADMM）高效实现，从而提供全局最优性保证和多项式时间内的快速训练。理论上，我们证明了我们的凸目标诱导了认证的边际稳定性，并提供了对特征扰动的保证。实验上，我们展示了样本效率和对输入方言变化的鲁棒性，在具有挑战性的低资源场景中达到了97-98%的准确率。我们的开源包可在https://pypi.org/project/jaxcld/获取。

英文摘要

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at https://pypi.org/project/jaxcld/

URL PDF HTML ☆

赞 0 踩 0

2605.23220 2026-05-25 cs.LG

WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

WMAttack：世界模型智能体对抗评估的自动化攻击搜索

Zhixiang Guo, Siyuan Liang, Shi Fu, Cheng Guo, Andras Balogh, Mark Jelasity, Dacheng Tao

发表机构 * Nanyang Technological University（南洋理工大学）； University of Szeged（塞格德大学）

AI总结尽管世界模型作为决策代理的应用日益广泛，但其对抗鲁棒性仍因缺乏专门的自动化评估方法而研究不足。为解决攻击评估中准确性和效率之间的矛盾，本文提出WMAttack，一个用于世界模型代理对抗评估的自动攻击搜索框架。该方法通过有限预算下的攻击配置搜索，并结合自纠正攻击搜索和表示引导的攻击检索技术，显著提升了攻击发现的效率和效果，在多个基准任务中均优于现有基线方法。

详情

AI中文摘要

尽管世界模型作为决策智能体的使用日益增多，但由于缺乏专用的自动化评估方法，其对抗鲁棒性仍未得到充分探索。一个关键障碍是攻击评估必须既准确又高效：弱的手动调优攻击可能高估鲁棒性，而穷举超参数搜索由于每个候选都需要通过学习的潜在动力学进行闭环展开而代价高昂。我们引入了WMAttack，一个用于世界模型智能体对抗评估的自动化攻击搜索框架。WMAttack将鲁棒性评估形式化为对攻击配置的有限预算搜索，包括攻击族、扰动预算、优化步骤、重启和分配规则。为了提高搜索准确性，自校正攻击搜索（SCAS）利用来自奖励退化、动作不稳定性、运行时间和展开变异性的反馈来细化攻击提议分布。为了提高搜索效率，表征引导攻击检索（RGAR）从表征相似的任务中检索有效的历史配置，为未见环境提供热启动。我们提供了一个理论解释，表明当提议细化将概率质量转移到高效用攻击时，它能改善有限预算搜索。在Atari和DeepMind Control任务上，WMAttack始终发现比评估基线更强的攻击，在DreamerV3 Atari上将归一化奖励下降从0.497提高到1.034，在DMC上从0.319提高到0.682。消融实验进一步表明，在固定评估预算下，RGAR提高了初始候选质量，SCAS提高了最终攻击效用。

英文摘要

Despite the growing use of world models as decision-making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed-loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack-search framework for adversarial evaluation of world-model agents. WMAttack formulates robustness evaluation as a finite-budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self-Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation-Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation-similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.23219 2026-05-25 cs.LG cs.AI

PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows

PaP-NF: 通过前缀作为提示重编程和归一化流进行概率长期时间序列预测

Minju Kim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea（韩国Inha大学工业工程系）

AI总结本文提出了一种名为PaP-NF的概率长期时间序列预测框架，通过Prefix-as-Prompt机制将连续时间序列表示与冻结的大语言模型对齐，并基于该模型提取的全局上下文条件化归一化流解码器，从而实现对不确定性的建模。该方法在多个长期预测基准上表现出色，能够有效捕捉多模态不确定性，同时保持较高的点预测精度。

Comments Accepted to ICPR 2026

详情

AI中文摘要

时间序列预测在许多实际应用中扮演核心角色，并已被广泛研究。大多数现有方法依赖于确定性模型。然而，现实环境表现出固有的不确定性和复杂的未来行为，使得单点预测不足。这凸显了对能够量化和表示不确定性的概率预测方法的需求。在这项工作中，我们提出了PaP-NF，一个概率预测框架，它使用前缀作为提示机制将连续时间序列表示与冻结的大语言模型（LLM）对齐，并基于LLM提取的全局上下文条件化归一化流解码器。所得预测分布的质量使用连续排名概率得分（CRPS）进行评估，这是概率预测中的标准指标。在各种长期预测基准上，PaP-NF稳健地捕获多模态不确定性，同时保持有竞争力的点预测精度。官方实现可在：https://github.com/democracy04/PaP-NF 获取。

英文摘要

Time series forecasting plays a central role in many real-world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real-world environments exhibit inherently uncertain and complex future behaviors, making single-point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP-NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix-as-Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long-term forecasting benchmarks, PaP-NF robustly captures multi-modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: https://github.com/democracy04/PaP-NF

URL PDF HTML ☆

赞 0 踩 0

2605.23218 2026-05-25 cs.AI

Foundation Protocol: A Coordination Layer for Agentic Society

Foundation Protocol: 智能体社会的协调层

Bang Liu, Yongfeng Gu, Jiayi Zhang, Zhaoyang Yu, Sirui Hong, Maojia Song, Xiaoqiang Wang, Mingyi Deng, Zijie Zhuang, Ronghao Wang, Mingzhe Cao, Yutong Zhu, Xingjian Li, Yifan Wu, Jianhao Ruan, Yiran Peng, Shuangrui Chen, Jinlin Wang, Yizhang Lin, Dongjie Zhang, Dekun Wu, Chen Ma, Lizi Liao, Han Yu, Jian Pei, Heng Ji, Qiang Yang, Yuyu Luo, Chenglin Wu

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； City University of Hong Kong（香港城市大学）； Singapore Management University（新加坡管理学院）； Nanyang Technological University（南洋理工大学）； Duke University（杜克大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Hong Kong Polytechnic University（香港理工大学）

AI总结随着自主代理系统逐渐成为社会基础设施的一部分，协调能力成为系统扩展的关键瓶颈。本文提出了一种名为Foundation Protocol（FP）的协调层，旨在为人类与人工智能共存的社会提供基础架构支持。FP通过图结构统一不同类型的实体，支持多方协作与事件驱动的合作，并引入经济原语和治理机制，以确保系统的可组合性与责任可追溯性。该协议旨在兼容现有标准，降低集成与治理成本，推动自主代理系统在开放、多元和可治理的环境中发展。

详情

AI中文摘要

自主智能体正从工具转变为社会基础设施层：它们浏览、购买、部署软件、管理系统，并越来越多地相互交互。随着这些系统规模扩大，瓶颈从原始模型能力转向协调。智能体需要建立可靠的关系、组织多智能体工作、交换价值、支持人工智能经济，并在现实监督下保持安全和问责。本文介绍了Foundation Protocol (FP)，一种为新兴人机社会设计的以图为核心的协调层。FP统一了异构实体，包括智能体、工具、资源、人类、机构和组织，并支持原生的多方组织和基于事件的协作。它还提供了用于计量、收据和结算的经济原语，并将策略、来源和审计视为一等关注点。FP旨在包装和桥接现有协议而非替代它们，从而在减少集成和治理开销的同时实现渐进式采用。目标是保持自主智能体的可组合性，同时确保问责制不可妥协，从而使协调本身成为开放、多元和可治理的人机社会的共享基础设施。

英文摘要

Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.

URL PDF HTML ☆

赞 0 踩 0

2605.23216 2026-05-25 cs.CV

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

CaST-Bench：面向视频问答的因果链时空推理基准

Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong

发表机构 * Woven by Toyota（丰田公司）； The University of Tokyo（东京大学）

AI总结 CaST-Bench 是一个用于评估视频问答中因果链引导的时空推理能力的新基准，旨在解决现有模型在因果推理方面缺乏细致、可验证证据的问题。该基准通过人类与AI协作构建了包含2066个问题的高质量数据集，每个问题都附带有时间片段和边界框标注的因果链证据。研究还设计了新的评估指标，全面衡量模型在答案正确性和视觉证据推理方面的能力，揭示了当前视觉语言模型在构建精确因果链方面的不足，为未来模型改进指明了方向。

Comments CVPR 2026

详情

AI中文摘要

视频中的因果推理对视觉语言模型（VLM）是一个重大挑战，因为它需要超越表面感知，深入理解因果机制。然而，现有基准很少提供严格评估这一能力所需的细粒度、有依据的证据。为填补这一空白，我们引入了CaST-Bench，一个用于因果链时空视频推理的基准。CaST-Bench提出复杂的因果问题，要求模型识别并定位多个时空证据组成的链条。通过人机协作流程，我们构建了一个高质量数据集，包含1015个视频上的2066个问题，因果链由时间片段和边界框轨迹标注。此外，我们设计了一套全面的评估方案，包含新颖的指标，不仅评估答案正确性，还评估基于视觉证据的推理能力。这种证据基础对于通过减轻虚假相关性来提高准确性，以及通过使模型更透明来增强用户信任至关重要。我们的实验表明，当前的VLM在因果问题上表现不佳，主要原因是它们构建精确且有依据的因果链的能力有限。这为改进未来VLM指明了一个重要方向。

英文摘要

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.23215 2026-05-25 cs.LG cs.AI cs.CL

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels：生产中GPU内核生成的基准测试

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari

发表机构 * Snowflake AI Research（Snowflake AI研究院）； CMU（卡内基梅隆大学）； UCSD（加州大学圣地亚哥分校）； Independent Researcher（独立研究者）

AI总结当前基于大语言模型的GPU内核生成代理在性能评估方面面临基准与实际生产环境不匹配的问题。为此，研究提出了FastKernels，一个基于46个代表性架构构建的基准测试集，覆盖了8个类别，几乎涵盖了96.2%的HuggingFace Transformers架构，并同时提供了一个生产级推理框架。实验表明，现有最先进的内核生成代理在FastKernels上的加速效果有限，突显了基准与实际应用之间存在的关键瓶颈。

详情

AI中文摘要

基于LLM的GPU内核生成代理正在快速发展，但其进展从根本上受到所优化基准的限制。现有基准与生产推理框架严重脱节：它们在单GPU上使用合成输入评估内核，忽略周围的编译栈，并奖励复制已知优化而非发现新优化。由此产生的奖励信号具有误导性：代理学会生成在沙箱中得分高但在集成到实际系统时引入接口不兼容、编译栈冲突和静默正确性下降的内核。我们引入FastKernels，一个基于最小化46个代表性架构（涵盖8个类别）的内核基准，这些内核共同涵盖了96.2%（409/425）的HuggingFace Transformers架构。FastKernels同时作为一个简约的生产级推理框架，在主流LLM服务上与vLLM和SGLang等成熟系统运行性能相当，并在服务不足的架构上显著超过上游参考；每个任务的接口镜像其架构家族中最先进库的相应模块，使得优化后的内核能够直接部署到生产代码库中。在FastKernels上评估最先进的内核代理，我们发现即使最强的代理也仅实现0.94倍于生产基线的总加速，而较弱的代理分别为0.78倍和0.53倍——证实基准-生产错位是该领域的关键瓶颈。我们发布FastKernels，作为迈向基准收益直接转化为生产吞吐量改进的内核代理的垫脚石。代码可在https://github.com/Snowflake-AI-Research/fastkernels获取。

英文摘要

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

URL PDF HTML ☆

赞 0 踩 0

2605.23204 2026-05-25 cs.AI

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

AutoResearch AI：迈向人工智能驱动的科研自动化以实现科学发现

Guiyao Tie, Jiawen Shi, Dingjie Song, Yixiao Huang, Ziji Sheng, Xueyang Zhou, Daizong Liu, Pan Zhou, Yongchao Chen, Ran Xu, Lifang He, Qingsong Wen, Manling Li, Cong Lu, Shuai Li, Pengtao Xie, Yixuan Yuan, Rui Meng, Lei Xing, Lichao Sun, Caiming Xiong, Philip S. Yu, Jianfeng Gao

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Lehigh University（莱斯大学）； Tsinghua University（清华大学）； Wuhan University（武汉大学）； Salesforce Research（Salesforce研究）； Squirrel AI Learning（Squirrel AI学习）； Northwestern University（西北大学）； Independent（独立）； Shanghai Jiao Tong University（上海交通大学）； University of California San Diego（加州大学圣地亚哥分校）； Chinese University of Hong Kong（香港中文大学）； University of Illinois Chicago（伊利诺伊大学香槟分校）； Stanford University（斯坦福大学）； Google Cloud AI Research（谷歌云AI研究）； Recursive Superintelligence（递归超级智能）； Microsoft Research（微软研究院）

AI总结本文探讨了AI驱动的科研自动化（AutoResearch）的发展趋势，旨在通过人工智能实现从文献调研、假设生成到实验验证、结果报告等全流程的科研工作自动化。研究分析了当前系统在自主性、领域适用性、验证机制等方面的不足，并提出了五个评估维度，指出AutoResearch的自主程度依赖于具体应用场景，在结构化、可执行和易于验证的领域更具可信度，而在涉及伦理、机构责任等复杂情境中仍面临挑战。

Comments 49 pages, 12 figures, 10 tables

详情

AI中文摘要

科学研究正在被AI系统重塑，这些系统从孤立的辅助转向更长周期的工作流，涵盖文献基础、假设生成、实验、验证、报告和修订。这一转变标志着从面向科学的任务级AI向工作流级研究自动化的过渡。然而，当前系统仍然碎片化，在自主性、领域范围、执行环境、验证机制和人类监督方面存在差异，同时在证据保存、可重复性、弱方向拒绝、溯源追踪、跨领域鲁棒性和负责任的科学闭环方面仍面临挑战。本综述通过AutoResearch（定义为AI驱动的科学工作流自动化的演进谱系）审视这些发展。其中，Vibe Research表示人类引导的基于提示的辅助和人工验证执行区域，而新兴的AI主导系统协调发现循环的更大部分，但尚未实现稳健的自主性。我们分析了研究系统如何在工作流中重新分配控制、证据、执行、验证和问责，并围绕五个工作流条件组织该领域：文献与研究基础；假设形成与规划；实验与工具使用；反馈、验证与评审；报告与知识传播。我们进一步综合了AI科学家系统、混合主动协同研究框架、基准测试、领域部署和开源基础设施。最后，我们提出五个评估维度——新颖性、有效性、影响力、可靠性和溯源——并表明AutoResearch的自主性是领域条件化的，在结构化、可执行且快速可验证的环境中更为可信，但在具身、延迟、异构、伦理或机构问责的背景下则受限。

英文摘要

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

发表机构 * Joby Aviation（Joby航空）； Safe Intelligence

AI总结本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题，特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法，建立了相机姿态到像素值的闭式映射，并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景，如增强现实、自动驾驶和机器人操作等，并在多个基准测试中验证了其有效性，相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情

AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证，尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而，当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性，仅覆盖了图像形成过程中一小部分扰动。特别是，对相机运动的鲁棒性仍然是一个开放问题，尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法，针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质，我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景，例如增强现实中的地面、自动驾驶中的道路标记和交通标志，或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证，无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现，并展示了相比先前工作最高89%的加速和7%更紧的边界。然后，我们在VNN-COMP基准上评估了我们的方法，揭示了投影扰动的系统性弱点。最后，我们在一个安全关键的跑道分类器上进行了真实世界案例研究，突出了对相机运动的实际漏洞，并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

URL PDF HTML ☆

赞 0 踩 0

2605.23201 2026-05-25 cs.SD cs.MM

MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

MixFake: 在多样真实混合音频中基准测试和增强音频深度伪造检测

Qingcao Li, Yipeng Lin, Weichen Lian, Zhongjie Ba, Peng Cheng, Zhichao Lian

发表机构 * School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing, China（南京理工大学信息科学与工程学院）； The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China（浙江大学区块链与数据安全国家重点实验室）； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China（杭州高新技术区（滨江）区块链与数据安全研究院）

AI总结本文提出MixFake，一个用于评估和提升音频深度伪造检测性能的大型基准数据集，旨在模拟真实世界中包含背景音乐或噪声的复杂语音环境。为解决现有基于自监督学习的方法在处理非语音或混合源音频时的不足，作者提出了一种多流提示调优框架，通过注入信号级先验信息增强SSL模型对音频伪影的捕捉能力。实验表明，该方法在前景检测和复杂背景检测任务中均显著优于现有方法，取得了优异的检测性能。

Comments Accepted by ICME2026

详情

AI中文摘要

语音深度伪造检测在干净环境中取得了显著成功，但在复杂真实场景中面临重大挑战，因为语音常与背景音乐或噪声混合。当前最先进的方法依赖于自监督学习（SSL）模型的语义特征，但在处理非语音或混合源音频时常常失败。本文首先引入了MixFake，一个大规模基准数据集，旨在模拟具有不同信噪比（SNR）水平和混合真实性成分的多样化声学环境。为了解决“语义中心”限制，我们提出了一个多流提示微调框架，将信号级先验注入SSL骨干网络。通过深度提示注入集成基础流、频率流和纹理流，我们的模型有效捕获了声学伪影。实验结果表明，我们的方法显著优于现有基线，在前景检测中实现了0.95%的等错误率（EER），在复杂背景检测任务中实现了7.72%的绝对改进。我们的数据集和代码可在https://github.com/saltfish233/MixFake获取。

英文摘要

Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.

URL PDF HTML ☆

赞 0 踩 0

2605.23200 2026-05-25 cs.LG cs.AI

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

自适应质量分段KV压缩用于长上下文推理

Junzhe Yang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Institute of Digital Twin, Eastern Institute of Technology（数字孪生研究院，东部技术研究所）

AI总结在长文本推理中，键值（KV）缓存的线性增长是关键瓶颈，现有压缩方法基于重要性评分剔除 tokens，但易导致连续推理块被严重清除，破坏逻辑连贯性。为此，本文提出自适应分块（AMS）KV压缩框架，通过关注注意力质量的空间分布，动态分配内存配额，保障关键推理段的稳定性，并兼容多种主流压缩方法和现代KV服务框架。实验表明，AMS有效缓解了结构碎片化问题，提升了模型性能。

详情

AI中文摘要

键值（KV）缓存的线性增长是长文本LLM推理中的关键瓶颈。现有的KV压缩方法通过基于重要性分数驱逐令牌来缓解这一问题。然而，我们表明它们依赖全局Top-k选择会触发区域擦除：连续推理块的严重驱逐破坏了逻辑连贯性。为解决此问题，我们提出自适应质量分段（AMS）KV压缩框架，该框架将范式从令牌级竞争转变为区域感知配额分配。AMS根据注意力质量的空间分布自适应地划分KV缓存，确保结构上重要的推理段获得有保障的内存配额。为在迭代解码过程中保持稳定性，引入了基于EMA的平滑机制以防止分段边界的抖动。关键的是，AMS是一个通用的即插即用层，与现有评分器正交。它可以无缝集成到代表性方法中，如TOVA、Expected Attention、KeyDiff、R-KV和TriAttention。AMS还与现代分页KV服务框架（如vLLM）系统兼容，支持高效的收集和压缩KV执行，而不引入额外的稳态注意力开销。在多种任务上的大量实验，包括数学推理（MATH500、AIME、GSM8K）、代码补全、开放域问答和稀疏检索，表明AMS持续减轻结构碎片化并提升模型性能。

英文摘要

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

URL PDF HTML ☆

赞 0 踩 0

2605.23198 2026-05-25 cs.LG

Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling

标签高效的数据集剪枝通过半监督伪标签

Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

发表机构 * Graduate School of AI, KAIST（人工智能研究生院，韩国科学技术院）

AI总结本文提出了一种高效的半监督数据集剪枝方法SemiPrune，旨在解决传统剪枝方法依赖大量标注数据的问题。该方法仅需一小部分随机标注的数据，通过生成伪标签来利用大量未标注数据，从而提升剪枝效果。与依赖预训练模型特征的方法不同，SemiPrune直接从目标数据集中学习，更准确地捕捉数据分布，提升了剪枝的可靠性和性能，在多个数据集上均取得了优于现有方法的实验结果。

Comments 10 pages

详情

AI中文摘要

数据集剪枝通过从大型数据集中选择信息丰富的子集来减少深度学习的存储和训练成本。然而，大多数现有的剪枝方法需要完全标注的数据，这限制了它们在未标注数据丰富且标注成本高昂的现实场景中的适用性。最近的无标签剪枝方法解决了这个问题，但它们依赖于预训练模型的特征来估计样本难度。当目标数据集与预训练分布差异较大时，这种依赖可能不可靠。我们提出了 SemiPrune，一个标签高效的数据集剪枝框架，仅使用少量随机标注的子集，利用半监督学习为未标注数据生成伪标签，使得需要标签信息的现有监督剪枝方法可以无缝应用于生成的伪标签训练池。然后，我们从伪标签诱导的训练动态中估计样本难度并选择核心集。通过直接从目标数据集学习，我们的方法更好地捕捉目标分布，并为难度估计和核心集选择提供更可靠的信号。我们在领域特定、图像损坏和长尾数据集上验证了我们的方法，它在无标签和标签高效的基线中实现了最先进的性能，同时在标准基准上也展示了有竞争力的性能。

英文摘要

Dataset pruning reduces the storage and training costs of deep learning by selecting an informative subset from a large dataset. However, most existing pruning methods require fully labeled data, which limits their applicability in realistic settings where unlabeled data are abundant and annotation is costly. Recent label-free pruning methods address this issue, but they rely on features from pretrained models to estimate example difficulty. This dependence can be unreliable when the target dataset differs substantially from the pretraining distribution. We propose SemiPrune, a label-efficient dataset pruning framework, using only a small randomly labeled subset, that uses semi-supervised learning to generate pseudo-labels for unlabeled data, allowing existing supervised pruning methods that require label information to be seamlessly applied to the resulting pseudo-labeled training pool. We then estimate example difficulty from pseudo-label-induced training dynamics and select a coreset. By learning directly from the target dataset, our method better captures the target distribution and provides more reliable signals for difficulty estimation and coreset selection. We validate our approach on domain-specific, image-corrupted, and long-tailed datasets, where it achieves state-of-the-art performance among label-free and label-efficient baselines, while also demonstrating competitive performance on standard benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.23194 2026-05-25 cs.LG cs.AI

Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

面向智能电网数据驱动最优潮流问题的可扩展异构图基础模型

Massimiliano Lupo Pasini, Yijiang Li, Kibaek Kim, Teja Kuruganti

发表机构 * Computational Sciences and Engineering Division, Oak Ridge National Laboratory（橡树岭国家实验室计算科学与工程部）； Mathematics and Computer Science Division, Argonne National Laboratory（阿贡国家实验室数学与计算机科学部）； UT-Battelle, LLC（UT-巴特勒公司）

AI总结本文提出了一种基于HydraGNN的可扩展异构图神经网络（GNN）框架，用于构建数据驱动的最优潮流（OPF）代理模型和图基础模型（GFM）。该方法保留了电力网络中不同节点和边类型的异构结构，支持在超计算机上进行分布式预处理、训练、超参数优化和下游微调。实验表明，该框架能够生成参数量较少但验证损失更低的紧凑模型，并在可行性分类和N-1故障回归任务中显著提升小样本条件下的模型性能与训练效率。

Comments 10 pages, 6 tables, 4 figures

详情

AI中文摘要

快速可靠的最优潮流（OPF）近似对于可靠的智能电网运行至关重要，然而许多基于学习的替代模型要么扁平化处理电网的天然异质结构，要么针对有限的电网拓扑，要么缺乏用于图基础模型（GFM）训练的可扩展基础设施。本文提出了一种基于HydraGNN的可扩展异构图神经网络（GNN）工作流，用于数据驱动OPF代理建模和OPF-GFM开发。该工作流保留了电网中不同的节点和边类型——母线、发电机、负荷、并联电抗器、交流线路、变压器以及设备到母线的耦合——并支持在领导级超级计算机上进行分布式预处理、训练、超参数优化（HPO）和下游微调。利用跨越十个PGLib-OPF案例（从14到13,659个母线）的三百万个异构图实例，我们在ORNL Frontier超级计算机上进行了DeepHyper驱动的HPO。该实验识别出具有最低验证损失的紧凑模型（约1.6–1.7M参数）。关于可行性分类和N-1应急回归的下游实验表明，微调预训练的OPF GFM在部分或仅头部微调时，能够提高低数据精度、稳定训练、加速收敛并降低适应成本。

英文摘要

Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates either flatten the native heterogeneous structure of power networks, target a limited set of grid topologies, or lack scalable infrastructure for graph foundation model (GFM) training. This paper presents a scalable heterogeneous graph neural network (GNN) workflow, built on HydraGNN, for data-driven OPF surrogate modeling and OPF-GFM development. The workflow preserves the distinct node and edge types of power grids -- buses, generators, loads, shunts, AC lines, transformers, and device-to-bus couplings -- and supports distributed preprocessing, training, hyperparameter optimization (HPO), and downstream fine-tuning on leadership-class supercomputers. Using three million heterogeneous graph instances spanning ten PGLib-OPF cases, from 14 to 13,659 buses, we conduct DeepHyper-driven HPO on the ORNL Frontier supercomputer. The campaign identifies compact models ($\sim$1.6--1.7M parameters) with the lowest validation losses. Downstream experiments on feasibility classification and N-1 contingency regression show that fine-tuning pretrained OPF GFM improves low-data accuracy, stabilizes training, accelerates convergence, and reduces adaptation cost when partial or head-only fine-tuning is used.

URL PDF HTML ☆

赞 0 踩 0

2605.23191 2026-05-25 cs.LG cs.IR cs.NA math.NA

Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

扩展更多，收缩更少：塑造有效秩动态以实现推荐中的密集扩展

Guoming Li, Shangyu Zhang, Junwei Pan, Wentao Ning, Jin Chen, Gengsheng Xue, Chao Zhou, Shudong Huang, Haijie Gu, Menglin Yang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Tencent Inc.（腾讯公司）； Tencent Inc. Shenzhen China（腾讯公司深圳中国）

AI总结在推荐系统中，扩展推荐模型的规模是一个核心挑战。本文针对现有方法RankMixer在扩展过程中出现的嵌入坍塌问题，提出了一种新的架构RankElastor，通过参数化的全混合机制和改进的GLU风格前馈网络，有效提升了表示的谱稳定性，缓解了有效秩的衰减现象。实验表明，RankElastor在大规模工业数据集上显著提升了推荐性能，并表现出更稳健的扩展行为。

Comments Accepted at the 32st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Research Track), KDD 2026 February Cycle

详情

DOI: 10.1145/3770855.3818049

AI中文摘要

扩展推荐模型是推荐系统中的一个核心挑战。最近，RankMixer作为一种有效的解决方案出现，它基于统一的令牌表示，交替进行令牌混合和每个令牌的前馈网络（P-FFN），以实现可扩展的性能。然而，RankMixer存在 extit{嵌入坍缩}问题，即学习到的表示具有较低的有效秩，限制了表达能力并未能充分利用扩展后的表示空间。通过实证分析和理论洞察，我们识别出刚性令牌混合和P-FFN模块是这一现象的主要原因，它们共同在跨层的有效秩演化中诱导出 extbf{阻尼振荡轨迹}。为了解决这个问题，我们提出了RankElastor，一种新颖的架构，能够产生频谱鲁棒的表示，并具有可证明的坍缩缓解能力。RankElastor引入了两个组件：（i） extbf{参数化全混合}，通过改进的频谱鲁棒性实现表达性令牌混合；（ii） extbf{GLU改进的P-FFN}，通过GLU风格的FFN模块稳定表示频谱。在大规模工业数据集上的大量实验表明，RankElastor持续改进推荐性能，缓解嵌入坍缩，并表现出稳健的扩展行为。代码可在以下GitHub仓库获取：https://github.com/vasile-paskardlgm/RankElastor

英文摘要

Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textit{embedding collapse}, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbf{damped oscillatory trajectory} in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbf{parameterized full mixing}, which enables expressive token mixing with improved spectral robustness; and (ii) \textbf{GLU-improved P-FFNs}, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: https://github.com/vasile-paskardlgm/RankElastor

URL PDF HTML ☆

赞 0 踩 0

2605.23190 2026-05-25 cs.CL

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

机器生成文本的隐藏类人本质：理论与检测增强

Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian

发表机构 * Department of Computer Science, Hong Kong Baptist University（香港 Baptist 大学计算机科学系）； School of Computer Science and Technology, University of Science and Technology of China（中国科学技术大学计算机科学与技术学院）； State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

AI总结随着大语言模型生成的文本在各类应用中日益普及，其潜在的滥用问题引发了对检测方法的迫切需求。现有方法通常将生成文本视为完全机械化的产物，忽视了其中可能存在的类人类写作风格的片段。本文揭示了这类隐藏的类人类片段的存在，并分析其对检测任务的负面影响，进而提出一种无需模型依赖的堆叠增强框架，通过迭代过滤和优化提升检测性能，实验表明该方法在多种场景下均有效，并支持无训练部署。

详情

AI中文摘要

由大型语言模型（LLMs）生成的机器生成文本（MGTs）在各种应用中越来越普遍，但其在虚假新闻传播和网络钓鱼中的潜在滥用引发了严重担忧，凸显了MGT检测的必要性。现有的段落级检测方法通常将MGT视为完全机器化的，忽略了机器生成文本的隐藏类人本质：即使是完全机器生成的文本也可能包含与人类写作高度一致的片段。为此，我们首先揭示了这种隐藏类人片段的存在，然后从理论上分析了它们对检测的影响。我们的分析表明，这些片段增加了检测的句子复杂度，从而使MGT检测本质上更加困难。基于这一发现，我们提出了一种模型无关的堆叠增强框架，通过减少隐藏类人片段的影响来改进现有检测器。具体来说，我们将片段级别的保留决策建模为潜在变量问题，并使用硬EM启发式过程实例化优化，其中检测器迭代地过滤置信度高的类人子序列，并在剩余文本上自我优化。在各种LLM和实际场景中的大量实验表明，所提出的框架能够持续增强现有检测器。值得注意的是，该框架还可以以无需训练的方式工作，为实际部署提供了灵活性和可扩展性。

英文摘要

Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MGT detection. Existing paragraph-level detection methods commonly treat MGTs as entirely machine-like, overlooking the hidden human-like nature of machine-generated texts: even fully machine-generated texts may contain spans that are highly consistent with human writing. To this end, we first reveal the existence of such hidden human-like spans, and then theoretically analyze their impact on detection. Our analysis shows that these spans increase the sentence complexity for detection, thereby making MGT detection intrinsically harder. Based on this finding, we propose a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of hidden human-like spans. Specifically, we model span-level retention decisions as a latent-variable problem and instantiate the optimization with a hard-EM-inspired procedure, where the detector iteratively filters confidently human-like subsequences and refines itself on the remaining text. Extensive experiments across various LLMs and practical scenarios demonstrate that the proposed framework consistently enhances existing detectors. Notably, the framework can also work in a training-free manner, offering flexibility and scalability for practical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.23189 2026-05-25 cs.LG

Empirical Bayes Conformal Prediction for Vision and Language Models

视觉与语言模型的经验贝叶斯共形预测

Jiapeng Zeng, Yogesh Prabhu, Zhanpeng Zeng, Michael A. Newton, Vikas Singh

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； University of California San Diego（加州大学圣地亚哥分校）； Xiamen University（厦门大学）

AI总结本文提出了一种基于经验贝叶斯的符合性预测框架，用于提升视觉与语言模型的预测置信度评估。该方法通过引入 $r$-值将分数的不确定性转化为置信度评分，从而更准确地判断候选结果是否属于高分组。该方法在保持目标置信度的同时，有效减少了高方差错误候选的纳入，并在多个基准任务中表现出更稳定的排序性能和更小的预测集合规模。

详情

AI中文摘要

共形预测（CP）为现代视觉和语言模型提供无分布覆盖，但通常被迫从单个不稳定的非一致性得分中做出排序决策。标准CP使用一次实现，而平均后校准变体将多次实现平滑为点估计。这两种选项都丢弃了有助于识别候选是否真正稳定的不一致性。一个弱答案可能进入共形集，即使证据不充分，仅仅因为一个后验样本或提示措辞使其看起来很强。但变异性有助于区分稳定信号和噪声驱动的波动。我们描述了一个经验贝叶斯共形预测框架，该框架使用r值将得分变异性转化为不确定性感知的非一致性得分。得到的r值估计一个候选的潜在得分在考虑其均值和不确定性后属于排名靠前组的可能性。它既接受闭式正态-正态经验贝叶斯估计器，也接受非参数后验采样估计器。使用r值作为非一致性得分在温和正则条件下保留了目标共形覆盖，同时可证明地减少了高方差假候选的包含。在图像分类、基于CLIP的VLM基准和LLM上，我们展示了r值共形预测在变异性具有信息性时保持目标覆盖，同时提高排序稳定性并减小集合大小，并在变异性消失时恢复为类似CP的行为。

英文摘要

Conformal prediction (CP) gives distribution-free coverage for modern vision and language models, but it is often forced to make a ranking decision from a single unstable nonconformity score. Standard CP uses one realization, while average-then-calibrate variants smooth multiple realizations into a point estimate. Both options discard the inconsistency that can help identify whether a candidate is indeed stable. A weak answer can enter the conformal set even if the evidence is not strong, simply because one posterior sample or prompt phrasing made it look strong. But variability can help distinguish a stable signal from noise-driven fluctuations. We describe an empirical Bayes conformal prediction framework that uses $r$-values to convert score variability into an uncertainty informed nonconformity score. The resulting $r$-value estimates how likely a candidate's latent score belongs to the top-ranked group after accounting for both its mean score and its uncertainty. It admits both a closed-form Normal-Normal empirical Bayes estimator and a nonparametric posterior-sampling estimator. Using the $r$-value as the nonconformity score preserves the target conformal coverage while provably reducing the inclusion of high variance false candidates under mild regularity conditions. Across image classification, CLIP-based VLM benchmarks, and LLMs, we show that $r$-value conformal prediction preserves target coverage while improving ranking stability and reducing set size when variability is informative, and reverting to CP-like behavior when variability vanishes.

URL PDF HTML ☆

赞 0 踩 0

2605.23187 2026-05-25 cs.CV cs.RO

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

IntentionNav: 一种基于隐式人类指令的意图驱动目标导航基准

Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

发表机构 * The University of Manchester（曼彻斯特大学）； A*STAR ； Responsible AI Research Centre, Adelaide University（阿德莱德大学负责任人工智能研究中心）； University of Bedfordshire（贝福德郡大学）

AI总结 IntentionNav 是一个用于意图驱动对象导航的新基准，旨在评估智能体从隐含人类指令中推断目标物体并完成导航任务的能力。该基准不直接提供目标物体名称，而是通过自然语言指令隐含表达需求，要求智能体理解意图、识别目标并完成导航。研究引入了四种意图模式和多种指令风格，支持对目标推理、语言鲁棒性及导航成功率的细致分析，揭示了当前视觉语言模型在理解隐含意图和完成精准导航任务方面仍面临挑战。

Comments preprint

详情

AI中文摘要

现有的目标导航基准通常告诉具身智能体要找到哪个物体类别，例如微波炉或椅子。面向人类的具身AI经常被问到一些不那么直接的问题：“我需要热一下这个食物”或“房间感觉很闷”。智能体必须推断出能够满足需求的物体，找到一个场景中的实例，并决定是否已达到目标。我们将这种设置研究为意图驱动的目标导航，并引入IntentionNav，一个用于从隐式人类指令进行主动目标搜索的诊断基准。每个episode提供一个自由文本意图、RGB-D观测和位姿，但隐藏目标物体名称。IntentionNav包含176个Isaac Sim场景和64个目标类别上的500个意图。每个意图以四种受控指令风格重写，并标注四种意图模式之一，将表面措辞与语义线索类型分离，同时保持几何匹配。这种配对设计支持对目标推断、语言鲁棒性、邻域可达性和终端成功（而非仅聚合成功）的分析。我们使用一个固定的主动导航智能体评估了三个VLM。模型在48.3%的episode中识别出预期目标，在68.7%中进入其2米邻域，但仅在24.9%中成功终止，并在5.5%中达到接地1米成功。事件脚本意图的成功率最高（28.7%），而物理状态和可供性意图的成功率较低（分别为19.2%和18.5%），表明间接人类意图仍然是主动具身搜索中目标选择、视觉验证和终端定位的瓶颈。

英文摘要

Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

URL PDF HTML ☆

赞 0 踩 0

2605.23182 2026-05-25 cs.LG

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

强化学习中基于Bandit反馈的良好策略的纯探索

Zitian Li, Wang Chi Cheung

发表机构 * Department of Industrial Systems Engineering & Management（工业系统工程与管理系）

AI总结本文研究了强化学习中在仅获得带反馈（bandit feedback）的情况下，如何高效识别一个“足够好”的策略，而非传统的最优策略。为此，作者提出了“良好策略识别”（GPI）问题，目标是在给定奖励阈值的前提下，找到满足该阈值的策略或判断其不存在。文中设计了一种新算法BEE-GPI，并理论分析了其样本复杂度上界，表明其在正例和负例场景下均具有较高的效率，且其复杂度系数不依赖于状态和动作空间的大小，优于传统最优策略识别方法。实验验证了该方法的有效性。

详情

AI中文摘要

情节式强化学习中的纯探索主要关注最优策略识别（BPI），旨在以高置信度识别（近）最优策略。受实际场景中“足够好”的策略即可满足需求的启发，我们研究了另一种目标——良好策略识别（GPI）。对于给定的奖励阈值 $μ_0$，GPI 仅要求识别出一个期望奖励至少为 $μ_0$ 的策略（如果存在这样的策略，即正实例），或者声明不存在（负实例）。我们在固定置信度设置下形式化 GPI。要求输出以概率 $\geq 1-δ$ 正确，并寻求最小化期望样本复杂度，即输出所探索的情节数期望值。我们提出了一种新颖的算法 BEE-GPI，并推导了其在正实例和负实例下样本复杂度的理论上界。值得注意的是，对于正实例，上界中 $\log 1/δ$ 的系数为 $O(H^2/(V^* - μ_0)^2)$，其中 $H$ 是情节长度，$V^*$ 是情节的最优期望奖励。该系数不依赖于动作和状态空间大小，这与 BPI 中的样本复杂度形成鲜明对比。我们进一步建立了下界结果，以证明 BEE-GPI 的近最优性以及 $1/(V^* -μ)^2$ 项的必要性。数值实验进一步验证了我们方法的效率。

英文摘要

Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $μ_0$, GPI only requires identifying a policy with expected reward in an episode at least $μ_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-δ$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/δ$ in our upper bound is $O(H^2/(V^* - μ_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -μ)^2$ term. Numerical experiments further validate the efficiency of our approach.

URL PDF HTML ☆

赞 0 踩 0

2605.23180 2026-05-25 cs.CL cs.LG

Self-Improving In-Context Learning

自我改进的上下文学习

Baturay Saglam, Dionysis Kalogerias

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）

AI总结本文提出了一种改进上下文学习（ICL）的方法，通过在测试时优化固定少样本提示的连续嵌入来提升模型性能。研究发现，模型对示例输出的对数概率可以作为衡量其任务理解程度的有效信号，并据此构建了一个无需额外数据的自监督置信度代理，通过零阶优化对提示嵌入进行校准。该方法无需微调、无需生成token、无需预定义标签集，适用于分类和自由生成任务，在多个ICL任务中表现出色，验证了其优化信号的有效性。

详情

AI中文摘要

我们提出通过优化测试时固定少样本提示的连续嵌入来改进上下文学习（ICL）。关键观察是，模型对其演示输出分配的对数概率——可在单次前向传播中获得，无需生成任何令牌——为模型从演示中推断任务提供了有意义的信号。我们将此信号形式化为一个有界的、自监督的置信度代理，并通过在提示嵌入上进行零阶优化来最大化它，从而得到一种测试时校准程序。该方法不需要微调、令牌生成、预定义标签集或外部数据，因此同样适用于分类和自由生成任务。在一系列全面的ICL任务中，所提出的校准方法始终匹配或改进基础模型，并在大多数任务上优于特定于分类的基线。代理改进与下游准确率提升之间的统计显著相关性证实了所提出的代理编码了用于上下文学习的可靠优化信号。

英文摘要

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

URL PDF HTML ☆

赞 0 踩 0

2605.23179 2026-05-25 cs.AI

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

重绘AI地图：代理生态系统中责任边界的理论

Muhammad Zia Hydari, Farooq Muzaffar

发表机构 * University of Pittsburgh（匹兹堡大学）

AI总结该论文探讨了智能体生态系统中责任边界配置的理论问题，提出了一种基于能力层次的责任边界定位理论。研究引入了“责任资产”概念，指出其对AI输出的合法性、可审计性和责任归属具有关键作用，并分析了验证成本和责任可转移性对责任边界与执行边界协同移动的影响。理论提出了三种边界策略，并引入“规则债务”概念，揭示了组织决策规则迁移至智能体执行环境所带来的治理负担，为理解数字模块化与组织解耦的关系提供了新视角。

详情

AI中文摘要

代理AI编排器降低了跨组织边界组合信息系统能力的接口和组装成本，看似加速了模块化和组织分解。然而，其输出需要证据、审查、签核或可分配责任的AI赋能能力，即使其技术接口变得模块化，也可能保留集成的责任边界。我们提出了代理生态系统中责任边界定位的能力层面理论。我们引入责任资产：使AI支持输出合法、可审计、可审查并可分配给责任方的互补资产。我们认为验证成本和责任可转移性决定了执行边界和责任边界能否一起移动。该理论识别出三种边界策略：组件、集成和双轨。它还引入了规则债务，即当组织决策规则从正式信息系统迁移到无治理的代理执行环境时产生的治理负担。整合数字创新、交易成本、互补资产、数字平台治理和IS控制视角，我们提出了七个命题，将代理组装成本降低、责任资产、可占有性、编排者意图捕获和边界错误配置与边界策略、价值占有和规则债务联系起来。该理论解释了数字模块化何时扩展到组织分解，以及责任何时保持能力集成。通过文档处理、法律服务、审计、临床决策支持和采购中的结构化示例来约束边界逻辑。

英文摘要

Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation. Yet AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular. We develop a capability-level theory of accountability-boundary placement in agentic ecosystems. We introduce accountability assets: complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party. We argue that verification cost and responsibility transferability determine whether the execution and accountability boundaries can move together. The theory identifies three boundary strategies: component, integrated, and dual-track. It also introduces rule debt, the governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments. Integrating digital innovation, transaction cost, complementary-assets, digital platform governance, and IS control perspectives, we develop seven propositions linking agentic assembly-cost reductions, accountability assets, appropriability, orchestrator intent capture, and boundary misconfiguration to boundary strategy, value appropriation, and rule debt. The theory explains when digital modularization extends to organizational disaggregation and when accountability keeps capabilities integrated. Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic.

URL PDF HTML ☆

赞 0 踩 0

2605.23178 2026-05-25 cs.CV

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

将人物组合在一起：面向多人交互场景的迭代姿态-图像生成

Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

发表机构 * Cornell University（康奈尔大学）

AI总结尽管现有文本到图像模型在生成多人互动场景时仍面临语义多样性不足和构图准确性低的问题，常导致布局重复、姿势刻板和互动不自然。本文提出一种双模态的姿势-图像表示方法，将以人为中心的结构先验引入预训练的扩散变换模型，通过联合预测二维姿势图和对应的RGB图像，使结构与外观在学习过程中协同演化。核心方法采用跨模态对齐方案，将文本、姿势和图像表示进行绑定，确保多模态一致性，并设计迭代场景生成策略，逐步构建复杂的多人互动场景，有效分解整体生成复杂度，实验表明该方法显著提升了多人图像生成的提示对齐度和场景多样性。

Comments Accepted to SIGGRAPH Conference Papers 2026. 22 pages, 12 figures. Project page: https://cornell-vailab.github.io/PeopleComposer/

详情

DOI: 10.1145/3799902.3811129

AI中文摘要

尽管近期取得了进展，文本到图像模型仍然难以生成语义多样且组合准确的多人交互场景，常常陷入重复布局、刻板姿态和交互基础薄弱的问题。在这项工作中，我们通过引入一种双姿态-图像表示来弥合这一差距，该表示将人物中心的结构先验引入预训练扩散Transformer。我们的模型联合预测2D姿态可视化图像及其对应的RGB图像，使得结构和外观在学习过程中共同演化。其核心是一种跨模态对齐方案，将文本、姿态和图像表示绑定在一起，确保跨模态的一致性基础。此外，我们设计了一种迭代场景构建方案，逐步生成复杂的多人交互，同时有效分解整体生成复杂性。大量实验表明，我们的方法在多人图像生成中显著提高了提示对齐度和场景多样性。

英文摘要

Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

Learning-Augmented Online Scheduling with Parsimonious Preemption

CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels

Enhancing Deep Neural Network Reliability with Refinement and Calibration

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

Convex Optimization for Alignment and Preference Learning on a Single GPU

RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases

Signal Temporal Logic Motion Planning via Graphs of Convex Sets

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows

Foundation Protocol: A Coordination Layer for Agentic Society

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

FastKernels: Benchmarking GPU Kernel Generation in Production

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

Lipschitz Optimization for Formal Verification of Homographies

MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling

Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

Empirical Bayes Conformal Prediction for Vision and Language Models

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

Self-Improving In-Context Learning

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes