arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2604.11121 2026-05-11 cs.CL

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

BITS Pilani 在 SemEval-2026 任务 9:基于 DPO 细调的结构化监督微调用于极化检测

Atharva Gupta, Dhruv Kumar, Yash Sinha

发表机构 * Birla Institute of Technology and Science, Pilani, India(比拉理工学院和科学学院,瓦拉纳西,印度)

AI总结 本文提出结合结构化监督微调与 DPO 细调的方法,用于社交媒体文本中的极化检测,通过 LoRA 微调 Qwen 2.5-7B-Instruct 并利用自动偏好对减少误判,最终在英文测试集上达到 0.7664 的 Macro-F1 分数。

Comments Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), to be held in conjunction with ACL 2026

详情
AI中文摘要

本文提出结合结构化监督微调与 DPO 细调的方法,用于社交媒体文本中的极化检测,通过 LoRA 微调 Qwen 2.5-7B-Instruct 并利用自动偏好对减少误判,最终在英文测试集上达到 0.7664 的 Macro-F1 分数。

英文摘要

The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine tuning with Direct Preference Optimization (DPO) refinement. We fine tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post-submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802.

2604.09839 2026-05-11 cs.AI cs.LG

Steered LLM Activations are Non-Surjective

引导的LLM激活不是满射的

Aayush Mishra, Daniel Khashabi, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 研究探讨了激活引导是否能通过文本提示实现,证明引导行为在实际中无法通过提示复现,区分了白盒引导与黑盒提示的差异。

Comments 10 pages main text. ICLR 2026 Workshops (Sci4DL, Re-Align)

详情
AI中文摘要

激活引导是一种流行的白盒控制技术,通过修改模型激活来引发行为抽象变化。它已成为可解释性(如探测真实性或翻译激活为可读解释)和安全研究(如劫持可能性)的标准工具。然而,不清楚是否任何文本提示都能实现引导行为。本文将此问题转化为满射性问题:对于固定模型,是否每个引导激活在模型自然前向传递下都有前像?在实际假设下,证明激活引导使残差流离开离散提示可达状态的流形。几乎不可能有任何提示能复现引导引起的内部行为。我们通过三种广泛使用的LLM经验验证这一发现。结果建立了白盒引导与黑盒提示之间的正式区分。因此,我们警告不要将激活引导的易用性和成功视为提示可解释性或脆弱性的证据,并主张评估协议应明确分离白盒和黑盒干预。

英文摘要

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

2604.08923 2026-05-11 cs.CL

NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

NCL-BU 在 SemEval-2026 任务 3: 基于 XLM-RoBERTa 的多语言维度情感回归微调

Tong Wu, Nicolay Rusnachenko, Huizhi Liang

发表机构 * Independent Researcher(独立研究者) Centre for Applied Creative Technologies (CFACT+), Bournemouth University, UK(应用创意技术中心(CFACT+),伯恩茅斯大学,英国) School of Computing, Newcastle University, Newcastle upon Tyne, UK(计算学院,新castle大学,新castle upon Tyne,英国)

AI总结 本文提出基于 XLM-RoBERTa 的微调方法,用于多语言维度情感回归任务,通过双回归头预测情感值和唤醒度,跨领域和语言进行训练,实验显示任务特定微调优于其他大语言模型。

详情
AI中文摘要

维度方面情感分析(DimABSA)将传统情感分析从分类极性标签扩展到连续的估值-唤醒(VA)回归。本文描述了为 Track A,Subtask 1(维度方面情感回归)开发的系统,旨在为文本中的每个方面预测 [1, 9] 范围内的实值 VA 分数。采用基于 XLM-RoBERTa-base 的微调方法,使用双回归头进行估值和唤醒度预测,输出经过 sigmoid 缩放。分别针对每种语言-领域对(英语和中文在餐厅、笔记本电脑和金融领域)训练模型,并将训练和开发集合并用于最终测试预测。在开发实验中,微调方法在少样本提示设置下与几种大语言模型进行比较,证明任务特定微调在所有评估数据集中优于这些 LLM 基方法。

英文摘要

Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, with dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.

2604.08077 2026-05-11 cs.CV

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

AdaSpark: 为高效长视频理解设计的自适应稀疏性

Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang, Xinxin Zhu, Chuanyang Zheng, Ziming Wang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Jing Liu

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Alibaba Group Holding Limited(阿里巴巴集团控股有限公司) Future Living Lab of Alibaba(阿里巴巴未来生活实验室)

AI总结 AdaSpark通过自适应稀疏框架减少计算负载达57% FLOPs,保持性能并保留细粒度长程依赖,适用于小时级视频基准测试。

Comments 8 pages, CVPR2026 Accept (Highlight)

Journal ref Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

处理长视频的视频大语言模型(Video-LLMs)计算上不可行。当前效率方法常通过不可逆信息丢弃或刚性预定义稀疏模式抑制细粒度感知或长程时间建模。本文提出AdaSpark,一种自适应稀疏框架,通过将视频输入划分为3D时空立方体,并采用两个协同设计的上下文感知组件:(1) 自适应立方选择注意力(AdaS-Attn),根据每个查询令牌选择相关视频立方体的子集进行注意力处理;(2) 自适应令牌选择前馈网络(AdaS-FFN),只处理每个立方体中最显著的令牌。基于熵的(Top-p)选择机制根据输入复杂度自适应分配计算资源。实验表明,AdaSpark在保持与密集模型相当的性能并保留细粒度长程依赖的同时,显著减少了计算负载,最多减少57% FLOPs,已在具有挑战性的小时级视频基准测试中得到验证。

英文摘要

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

2604.07277 2026-05-11 cs.LG cs.AI

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Android Coach: 通过单状态多动作提升在线代理训练效率

Guo Gan, Yuxuan Ding, Cong Chen, Yuwei Ren, Yin Huang, Hong Zhou

发表机构 * Zhejiang University(浙江大学) Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出Android Coach框架,通过单状态多动作方法提升在线强化学习效率,实验显示在AndroidLab和AndroidWorld上成功率提升7.5%和8.3%,训练效率比PPO和GRPO高1.4倍。

详情
AI中文摘要

在线强化学习(RL)是增强Android代理能力的有效方法。然而,通过在线交互引导代理学习成本过高,由于模拟器延迟和现有RL算法的样本效率低下。我们发现当前方法的根本限制是单状态单动作范式,该范式通过在线单向回放的一对一状态-动作对更新策略,而未充分探索每个昂贵的模拟器状态。本文提出Android Coach,一种新的框架,将训练范式转向单状态多动作,允许代理对单个在线状态采样和利用多个动作。我们通过学习一个估计动作值的批评者来实现这一点,无需额外的模拟器开销。为确保批评者能可靠地担任教练,我们整合了一个过程奖励模型,并引入了一个基于平均批评者输出的组内优势估计器。广泛的实验表明,Android Coach的有效性和效率:在AndroidLab和AndroidWorld上,其成功率比UI-TARS-1.5-7B高7.5%和8.3%,并在匹配的成功率下,训练效率比单状态单动作方法PPO和GRPO高1.4倍。

英文摘要

Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.

2604.03420 2026-05-11 cs.CV cs.AI cs.LG

Zero-Shot Quantization via Weight-Space Arithmetic

通过权重空间算术实现零样本量化

Daniele Solombrino, Antonio Andrea Gargiulo, Alessandro Zirilli, Luca Zhou, Adrian Robert Minut, Emanuele Rodolà

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 本文提出通过权重空间算术提取量化向量,无需接收端训练即可提升低比特量化精度,展示了量化鲁棒性的可迁移性与有效性。

详情
AI中文摘要

我们证明了在权重空间中,后训练量化(PTQ)的鲁棒性是一个可转移的方向。我们称之为量化向量:通过简单的权重空间算术从捐赠任务中提取,可以用于修补接收模型,使在3比特设置下Post-PTQ Top-1精度提高高达60个点,而无需接收端的量化感知训练(QAT)。因为该方法不需要接收端训练数据,它为极低比特部署提供了一种零样本、低成本的QAT替代方案。在四个ViT规模和22个图像分类任务中,捐赠量化向量即使在捐赠任务和接收任务差异显著时也常能带来显著收益。我们进一步严格证明量化向量是良好定义的,并且不受重新参数化对称性影响,还提供了其效果的局部几何解释。这些结果表明,量化鲁棒性可以通过简单的权重空间代数部分隔离、重用和转移。

英文摘要

We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.

2604.02525 2026-05-11 cs.LG

AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation

AdaHOP: 通过异常模式感知旋转实现快速且准确的低精度训练

Seonggon Kim, Alireza Khodamoradi, Pranathi Vasireddy, Kristof Denolf, Eunhyeok Park

发表机构 * Advanced Micro Devices, Inc.(先进微器件公司) POSTECH(POSTECH大学)

AI总结 本文提出AdaHOP,通过分析权重、激活和梯度中的三种稳定异常模式,采用自适应Hadamard变换和异常提取策略,在MXFP4精度下实现BF16级质量,提升内存压缩和训练速度。

Comments 21 pages, 10 figures

详情
AI中文摘要

Hadamard变换已成为稳定低精度训练的关键工具,但现有方法在所有张量和计算路径中统一应用。本文指出这种一刀切策略存在固有局限:Hadamard平滑仅在其方向与操作数的异常结构正确对齐时才能减少量化误差。通过系统研究LLM训练中的权重、激活和梯度,我们识别出三种稳定的异常模式:行向、列向和无。并证明矩阵乘法中每对异常模式需要不同的变换或异常处理策略。我们提出AdaHOP,即具有异常模式感知策略的自适应Hadamard变换,当内维混合适当抑制操作数的异常时应用内Hadamard变换(IHT),当不适用时则选择性应用异常提取(OE),将主导的异常行或列提取到高精度路径。通过融合的硬件感知Triton内核,AdaHOP在MXFP4精度下实现BF16级质量的训练,同时实现高达3.6倍的内存压缩和1.46倍的端到端训练加速。

英文摘要

Hadamard transforms have become a key tool for stabilizing low-precision training, but existing methods apply them uniformly across tensors and computation paths. We show that this one-size-fits-all strategy is inherently limited: Hadamard smoothing reduces quantization error only when its direction is properly aligned with the operand's outlier structure. Through a systematic study of weights, activations, and gradients in LLM training, we identify three stable outlier patterns, Row-wise, Column-wise, and None, and show that each outlier pattern pair in matrix multiplication requires a distinct transform or outlier-handling strategy. We propose AdaHOP, Adaptive Hadamard transform with Outlier-Pattern-aware strategy, which applies Inner Hadamard Transform (IHT) when inner-dimension mixing properly suppresses the operands' outliers, and selectively applies Outlier Extraction (OE) that extracts dominant outlier rows or columns into a high-precision path when it does not. With fused, hardware-aware Triton kernels, AdaHOP enables training from scratch at MXFP4 precision with BF16-level quality, while achieving up to 3.6X memory compression, 1.46X end-to-end training speedup over BF16.

2604.01878 2026-05-11 cs.LG cs.AI

ASPECT: Node-Level Adaptive Spectral Fusion for Graph Contrastive Learning

ASPECT:节点级自适应频谱融合用于图对比学习

Zhuolong Li, Boxue Yang, Haopeng Chen

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 ASPECT通过节点级自适应融合低频和高频视图,提升图对比学习的表示质量,在混合图中减少悔恨,同时通过ASPECT-S增强稳定性。

Comments 28 pages, 3 figures. Revised version with updated method framing, improved exposition, and additional experiments

详情
AI中文摘要

频谱图对比学习通常构建低频和高频视图以捕捉互补的图信号,但这些视图通常通过图级或节点无关的融合规则结合。我们证明图级融合会在具有分离节点谱偏好的混合图中导致不可减少的悔恨。受此结果启发,我们提出ASPECT,一种自适应融合低频和高频视图的谱图对比学习方法。ASPECT学习节点级谱策略,并使用通道级对比证据进行正则化,使不同节点可以使用不同的谱混合。我们进一步引入ASPECT-S,一种稳定性意识的扩展,使用生成的图结构和特征扰动获得经验通道级敏感性估计,结合基于Rayleigh的谱搜索偏差产生有信息的扰动。在同质和异质基准测试中,ASPECT在竞争性的谱和图对比基线中提升了表示质量,而ASPECT-S在联合图结构和特征扰动下进一步提升性能。

英文摘要

Spectral graph contrastive learning often constructs low- and high-frequency views to capture complementary graph signals, but these views are commonly combined by graph-level or node-agnostic fusion rules. We show that graph-level fusion can incur irreducible regret on mixed graphs with separated node-wise spectral preferences. Motivated by this result, we propose ASPECT, a spectral graph contrastive learning method that adaptively fuses low- and high-frequency views at the node level. ASPECT learns a node-wise spectral policy and regularizes it using channel-wise contrastive evidence, enabling different nodes to use different spectral mixtures. We further introduce ASPECT-S, an optional stability-aware extension that uses generated graph-structure and feature perturbations to obtain empirical channel-wise sensitivity estimates, together with a Rayleigh-based spectral search bias for producing informative perturbations. Experiments on homophilic and heterophilic benchmarks show that ASPECT improves representation quality over competitive spectral and graph contrastive baselines, while ASPECT-S further improves performance under joint graph-structure and feature perturbations.

2604.01619 2026-05-11 cs.CV cs.AI

Automatic Image-Level Morphological Trait Annotation for Organismal Images

自动图像级形态学特征标注方法用于生物体图像

Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

发表机构 * The Ohio State University(俄亥俄州立大学) University of Maine(缅因大学)

AI总结 本文提出了一种基于稀疏自编码器的图像级形态学特征标注方法,通过构建Bioscan-Traits数据集,实现了对19000张昆虫图像的80000个特征标注,为大规模生态研究提供了可扩展的生物特征监督方法。

Comments ICLR 2026

Journal ref ICLR 2026

详情
AI中文摘要

形态学特征是生物体的物理特征,为理解生物体与环境的相互作用提供了关键线索。然而,提取这些特征仍是一个缓慢且依赖专家的过程,限制了其在大规模生态研究中的应用。本文展示了一种基于基础模型特征训练的稀疏自编码器,能够生成具有语义一致性的空间定位神经元,从而实现对有意义形态学部分的一致激活。利用这一特性,我们提出了一种特征标注流程,通过定位显著区域并使用视觉-语言提示生成可解释的特征描述。通过这种方法,我们构建了包含80000个特征标注的Bioscan-Traits数据集,覆盖了来自BIOSCAN-5M的19000张昆虫图像。人类评估确认了生成的形态学描述的生物学合理性。我们通过全面的消融研究评估了设计的敏感性,系统地变化关键设计选择并测量其对结果特征描述质量的影响。通过使用模块化的标注流程而不是昂贵的手动努力,我们提供了一种可扩展的方法,将生物上有意义的监督注入基础模型,实现大规模形态学分析,并弥合生态相关性和机器学习实用性的差距。

英文摘要

Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

2603.28113 2026-05-11 cs.LG

Demystifying Lipschitz verification: positive matrices, negative results

解开Lipschitz验证之谜:正矩阵、负结果

Simon Kuang, Yuezhu Xu, S. Sivaranjani, Xinfan Lin

发表机构 * Department of Mechanical and Aerospace Engineering(机械与航空航天工程系) University of California, Davis(加州大学戴维斯分校) Edwardson School of Industrial Engineering(工业工程学院) Purdue University(普渡大学)

AI总结 本文探讨了神经网络全局Lipschitz常数的估计难题,指出其结构性困难源于可达性问题的NP难性,并通过实例证明SDP方法的局限性,同时提出通过优化trivial bound和引入三角形层可使验证结果更精确。

Comments reduced scope, new theorems on NP-hardness

详情
AI中文摘要

神经网络的全局Lipschitz常数与鲁棒性和泛化能力相关,但不同于经典模型,它不能直接从参数中看出。这推动了复杂的验证算法,尤其是基于半正定规划(SDP)的增量二次约束方法,以改进快速但通常松散的逐层Lipschitz常数乘积(平凡界)。本文探讨了为何Lipschitz验证本身是个问题,指出困难是结构性的:估计网络的Lipschitz常数需要知道哪些隐藏状态可达,而可达性是NP难的。如果P≠NP,则可达性成为任何多项式时间算法的障碍。通过显式构造,我们证明这种盲区可使SDP界继承与平凡界相同的定性失败,包括但不限于多项式每层保守性。我们展示NP难问题的困难不仅局限于最坏情况的计算减少,而是实际上影响验证问题的每一个实例。因此SDP不足以进行Lipschitz验证。我们还论证这并非必要:一些看似平凡界的失败源于可去除的参数化病理现象,可通过优化或正则化平凡界本身来缓解。我们通过一个“球状牛”线性模型和数值证明概念来展示这一主张。虽然主要贡献是理论性的和负面的,我们最后提出了一种新的三角形层形式,它们不需要偏置即可实现通用逼近。结合平凡界正则化,它们使平凡界在理论上和实践中都严格成立。

英文摘要

The global Lipschitz constant of a neural network is related to robustness and generalization, yet unlike in many classical models, it is not plainly legible from the parameters. This has motivated sophisticated verification algorithms, especially semidefinite programming (SDP) based on incremental quadratic constraints on the activation functions, to improve on the fast but often loose product of layerwise Lipschitz constants (the trivial bound). We ask why Lipschitz verification is a problem in the first place. Our answer is that the difficulty is structural: estimating a network's Lipschitz constant requires knowing which hidden states are reachable, and reachability is NP-hard. If P!=NP, then reachability is a barrier to any polynomial-time algorithm. Through explicit constructions, we show that this blindness can force SDP-based bounds to inherit the same qualitative failures as the trivial bound, including but not limited to polynomial per-layer conservatism. We show that the difficulties of NP-hard questions are not isolated to worst-case computational reductions, but actually afflict every instance of the verification problem. Thus SDP is not sufficient for Lipschitz verification. We also argue that it is not necessary: several apparent failures of the trivial bound arise from removable parameterization pathologies, and can be mitigated by optimizing or regularizing the trivial bound itself. We demonstrate this claim via a "spherical cow" linear model and numerical proofs of concept. While the main contribution is theoretical and negative, we finally motivate a novel form of trigonometric layers that do not need biases for universal approximation. Combined with trivial bound regularization, they make the trivial bound provably and practically tight.

2603.21824 2026-05-11 cs.CV cs.AI

SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis

SteelDefectX:一种用于钢铁表面缺陷分析的多形式视觉-语言数据集和基准

Shuxian Zhao, Jie Gui, Baosheng Yu, Dacheng Tao

发表机构 * Southeast University(东南大学) Purple Mountain Laboratories(紫金山实验室) Nanyang Technological University(南洋理工大学)

AI总结 本文提出SteelDefectX数据集,包含7778张图像和25种缺陷类别,提供多形式文本标注,涵盖视觉-语言分类、分割及跨数据集迁移等任务,揭示文本设计对工业视觉-语言学习的重要性。

详情
AI中文摘要

钢铁表面缺陷分析对工业质量控制至关重要,但现有基准主要依赖标签注释,限制了视觉-语言模型的细粒度语义理解和系统评估。为解决这一问题,我们引入SteelDefectX,一个包含多形式文本注释的视觉-语言数据集,涵盖25种缺陷类别共7778张图像。在类别层面,数据集提供缺陷名称、代表性视觉属性和工业原因。在样本层面,每张图像均标注三种文本形式:(1)自由形式自然语言描述,(2)结构化属性注释,(3)基于模板的句子。这些注释提供了灵活的文本监督,具有不同层次的表达性和可控性。我们进一步建立了涵盖视觉-语言分类、分割和跨数据集迁移的全面基准,以及检索和文本引导定位等附加评估。实验结果揭示了文本表示在结构和灵活性之间的权衡。结构化属性提供更稳定的语义对齐,而自然语言描述提高迁移性和细粒度空间定位。这些发现突显了文本设计在工业视觉-语言学习中的关键作用。SteelDefectX为研究语义对齐和泛化提供了新的基准。代码和数据集可在https://github.com/Zhaosxian/SteelDefectX获取。

英文摘要

Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.

2603.19966 2026-05-11 cs.RO

GustPilot: A Hierarchical DRL-INDI Framework for Wind-Resilient Quadrotor Navigation

GustPilot:一种用于抗风四旋翼导航的分层DRL-INDI框架

Amir Atef Habel, Roohan Ahmed Khan, Fawad Mehboob, Clement Fortin, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory(智能空间机器人实验室) Center for Digital Engineering(数字工程中心) Skolkovo Institute of Science and Technology(斯克尔科夫科学与技术研究院)

AI总结 本文提出GustPilot框架,结合深度强化学习与INDI控制器,通过风感知规划和快速扰动抑制提升四旋翼在风扰环境下的导航可靠性。

Comments 8 pages, 5 figures

详情
AI中文摘要

风扰仍然是轻型四旋翼可靠自主导航的主要障碍,快速变化的气流会破坏规划和跟踪。本文引入GustPilot,一种分层抗风导航堆栈,其中深度强化学习(DRL)策略为门通过生成惯性系速度参考。同时,几何增量非线性动态逆控制(INDI)控制器提供低层跟踪,具有快速残差扰动抑制能力。INDI层通过提供增量反馈,利用机载传感器测量快速抑制风扰。通过双层策略,风感知规划通过风扇-喷射域随机化训练获得鲁棒性,INDI跟踪控制器在执行时间快速抑制扰动。我们在50g四旋翼平台上对GustPilot进行了真实飞行测试,与DRL-PID基线在四个场景中比较,从无风到全动态条件(移动门和移动扰动源)。尽管仅在最小单门和单风扇设置中训练,策略仍能泛化到更复杂的环境(最多六个门和四个风扇)而无需重新训练。在80次实验中,DRL-INDI的平均总体成功率(OSR)为94.7% vs DRL-PID的55.0%,跟踪RMSE降低达50%,在风扰达3.5 m/s时维持速度达1.34 m/s。这些结果表明,结合DRL速度规划与结构化INDI扰动抑制提供了一种实用且可推广的抗风自主飞行导航方法。

英文摘要

Wind disturbances remain a key barrier to reliable autonomous navigation for lightweight quadrotors, where the rapidly varying airflow can destabilize both planning and tracking. This paper introduces GustPilot, a hierarchical wind-resilient navigation stack in which a deep reinforcement learning (DRL) policy generates inertial-frame velocity reference for gate traversal. At the same time, a geometric Incremental Nonlinear Dynamic Inversion (INDI) controller provides low-level tracking with fast residual disturbance rejection. The INDI layer achieves this by providing incremental feedback on both specific linear acceleration and angular acceleration rate, using onboard sensor measurements to reject wind disturbances rapidly. Robustness is obtained through a two-level strategy, wind-aware planning learned via fan-jet domain randomization during training, and rapid execution-time disturbance rejection by the INDI tracking controller. We evaluate GustPilot in real flights on a 50g quad-copter platform against a DRL-PID baseline across four scenarios ranging from no-wind to fully dynamic conditions with a moving gate and a moving disturbance source. Despite being trained only in a minimal single-gate and single-fan setup, the policy generalizes to significantly more complex environments (up to six gates and four fans) without retraining. Across 80 experiments, DRL-INDI achieves a 94.7% versus 55.0% for DRL-PID as average Overall Success Rate (OSR), reduces tracking RMSE up to 50%, and sustains speeds up to 1.34 m/s under wind disturbances up to 3.5 m/s. These results demonstrate that combining DRL-based velocity planning with structured INDI disturbance rejection provides a practical and generalizable approach to wind-resilient autonomous flight navigation.

2603.19254 2026-05-11 cs.CL

FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

FinReasoning:一种用于可靠金融研究报告的分层基准

Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng, Jinru Ding, Jie Xu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出FinReasoning分层基准,用于评估金融研究中的核心能力,区分模型在基础阶段如审计和修正中的表现,揭示模型能力分层。

详情
AI中文摘要

大型语言模型(LLMs)正在越来越多地应用于金融研究流程中,其角色正从单个模型协助人类分析师转变为多个代理之间的自主协作。然而,现实中的部署仍然暴露了事实错误、数值不一致和浅层分析,这会扭曲对公司基本面的评估并引发严重的经济损失。虽然现有基准已经开始评估此类失败,但它们在一次通过中评分所有生成分析的各个方面,未能区分模型是否在基础阶段如审计和修正中失败,或在生成研究级见解时表现不佳。因此,这会掩盖能力瓶颈和多代理角色分配所需的专业优势。为了解决这些差距,我们引入了FinReasoning,一种将金融研究的核心能力分解为语义一致性、数据对齐和深入洞察的分层基准。我们进一步提出了一种细粒度评估框架,加强了幻觉修正评估,并纳入了12个指标的评分表用于核心分析技能。FinReasoning在不同模型类型之间揭示了清晰的能力分层。封闭源模型(如Doubao-Seed-1.8)在整体上表现强劲,更适合多代理金融系统中的核心推理代理;开源通用模型(如Qwen3-235B)显示出明显的能力建设差异,并在语义一致性上持续表现不佳,使其不适合对质量敏感的生成任务;金融领域模型(如Fin-R1)生成中等洞察,但缺乏基础的审计技能。我们的工作已经部署在多个现实场景的试点测试中。资源可在https://github.com/TongjiFinLab/FinReasoning上获得。

英文摘要

Large language models (LLMs) are increasingly deployed in financial research workflows, where their role is evolving from single-model assistance for human analysts toward autonomous collaboration among multiple agents. Yet real-world deployments still expose factual errors, numerical inconsistencies, and shallow analysis, which can distort assessments of corporate fundamentals and trigger severe economic losses. While existing benchmarks have begun to evaluate such failures, they score all aspects of the generated analysis in one pass, failing to distinguish whether a model fails at foundational stages like auditing and correction, or underperforms at generating research-grade insights. Consequently, it obscures capability bottlenecks and the specialized strengths essential for multi-agent role assignment. To address these gaps, we introduce FinReasoning, a hierarchical benchmark that decomposes the core capabilities of financial research into semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. FinReasoning reveals clear capability stratification across model types. Closed-source models (like Doubao-Seed-1.8) perform strongly overall and are better suited for core reasoning agents in multi-agent financial systems; open-source general models (like Qwen3-235B) show clear capability divergence and consistently underperform on Semantic Consistency, making them less suited for quality-sensitive generation tasks; financial-domain models (like Fin-R1) generate moderate insights but lack foundational auditing skills. Our work has already been deployed in pilot tests across several real-world scenarios. The resource is available at https://github.com/TongjiFinLab/FinReasoning.

2603.18856 2026-05-11 cs.CV cs.AI

Motion-o: Trajectory-Grounded Video Reasoning

Motion-o:基于轨迹的视频推理

Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas

发表机构 * Northeastern University(东北大学)

AI总结 Motion-o通过引入运动链推理,使视频推理模型能够显式且可验证地表示物体运动轨迹,提升动态和轨迹依赖性推理的可监督性。

详情
AI中文摘要

近年来,视频推理模型越来越多地生成时空证据链,以定位特定时间戳的对象。尽管这些轨迹通过明确证据出现的'哪里'和'何时'提高了可解释性,但它们往往忽略了连接观察的运动,即'如何'。这使得动态和轨迹依赖性的断言在缺乏视频支持时难以监督、验证或惩罚。我们正式将这一缺失组件定义为时空轨迹(STT)推理,并引入Motion-o,一种以运动为中心的视觉语言模型(VLM)扩展,使轨迹显式且可验证。Motion-o通过添加运动链推理(MCoT)来增强证据链,该推理通过离散的<motion/>标签表示物体运动,总结方向、速度和尺度变化。为了监督MCoT,我们把稀疏的时空注释密集化为物体轨迹,并从质心位移和框面积变化中推导出运动描述。然后,我们通过互补的奖励进行训练,包括轨迹一致性和视觉接地,包括基于扰动的信号,该信号惩罚在去除时间证据时运动描述未变化的情况。在多个视频理解基准上,Motion-o在不修改架构的情况下持续改进轨迹忠实推理。这些结果表明,显式的运动接口可以通过将隐含动态转换为可验证的证据来补充现有的VLM流程。代码可在https://github.com/ostadabbas/Motion-o获取。

英文摘要

Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.

2603.18636 2026-05-11 cs.CV

Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering

注意力稀疏性是输入稳定的:通过离线稀疏性分析和在线QK共聚类实现训练自由的视频生成稀疏注意力

Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li

发表机构 * SKLCCSE, School of Computer Science and Engineering, Beihang University(软件学院,北京航空航天大学) Beihang University(北京航空航天大学) School of Computer Science, Peking University(北京大学计算机学院) the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,中国科学院自动化研究所) School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学) BNRist, Tsinghua University(北京理工大学,清华大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出SVOO框架,通过离线层间稀疏性分析和在线双向共聚类实现训练自由的视频生成稀疏注意力,解决传统方法中层异质性和查询-键耦合问题,提升生成质量与速度的平衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

Diffusion Transformers (DiTs) 虽能实现高质量视频生成,但密集的3D注意力导致推理成本高,促使开发稀疏注意力技术以提高效率。然而,现有训练自由的稀疏注意力方法在视频生成中仍面临两个未解决的限制:在注意力剪枝中忽略层异质性,在块划分中忽略查询-键耦合,阻碍了质量-速度的更好平衡。本文发现注意力稀疏性是内在的层间属性,仅在不同输入间有微小变化。受此启发,我们提出SVOO,通过离线层间稀疏性分析和在线双向共聚类实现快速视频生成。具体而言,SVOO采用两阶段范式:(i) 离线层间敏感性分析以推导内在每层剪枝水平,(ii) 在线块间稀疏注意力通过双向共聚类算法实现。在七个广泛使用的视频生成模型上的大量实验表明,SVOO在质量-速度平衡上优于现有方法,最大可实现1.93倍的速度提升,同时在Wan2.1上保持高达29 dB的PSNR。代码可在:https://github.com/Mutual-Luo/SVOO获取。

英文摘要

Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, motivating sparse attention techniques for improving efficiency. However, existing training-free sparse attention methods for video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight: attention sparsity is an intrinsic layer-wise property, with only minor variation across different inputs. Motivated by this observation, we propose SVOO, a training-free sparse attention framework for fast video generation via offline layer-wise sparsity profiling and online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to 1.93x speedup while maintaining a PSNR of up to 29 dB on Wan2.1. Code is available at: https://github.com/Mutual-Luo/SVOO.

2603.16876 2026-05-11 cs.CV cs.AI cs.LG

Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation

多模态多智能体强化学习在放射学报告生成中的应用

Kaito Baba, Risa Kishikawa, Satoshi Kodera

发表机构 * Department of Cardiovascular Medicine(心血管医学科)

AI总结 本文提出MARL-Rad框架,通过多智能体强化学习提升放射学报告生成的临床效果,通过联合优化区域特定智能体和全局整合智能体,提高报告的准确性和一致性。

Comments 23 pages, 4 figures

详情
AI中文摘要

我们提出MARL-Rad,一种多模态多智能体强化学习框架,用于放射学报告生成,该框架在部署的放射学工作流中训练整个智能体系统。MARL-Rad解决了事后智能体化限制,即固定LLMs被组织成手工设计的智能体工作流,而未针对其分配的角色进行优化。我们的框架将胸部X光解读分解为区域特定智能体和一个全局整合智能体,并使用临床可验证的奖励共同优化它们。在MIMIC-CXR和IU X光数据集上的实验表明,MARL-Rad一致提高了临床效能指标,如RadGraph、CheXbert和GREEN分数,实现了最先进的临床效能性能。进一步分析显示,MARL-Rad提高了侧向一致性,并生成了更准确和详细的报告。盲审临床医生评估进一步表明,MARL-Rad生成的报告在临床上与真实报告相当。

英文摘要

We propose MARL-Rad, a multi-modal multi-agent reinforcement learning framework for radiology report generation that trains the entire agentic system on policy within its deployed radiology workflow. MARL-Rad addresses the limitation of post-hoc agentization, where fixed LLMs are organized into hand-designed agentic workflows without being optimized for their assigned roles. Our framework decomposes chest X-ray interpretation into region-specific agents and a global integrating agent, and jointly optimizes them using clinically verifiable rewards. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinical efficacy metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art clinical efficacy performance. Further analyses show that MARL-Rad improves laterality consistency and produces more accurate and detailed reports. A blinded clinician evaluation further suggests that MARL-Rad produces reports clinically comparable to ground-truth reports.

2603.14787 2026-05-11 cs.RO

Dynamic Properties and Motion Reproducibility of a Compact Pneumatically Actuated Humanoid Upper Body for Data-Driven Control

紧凑型气动仿人上身机器人动态特性与运动可重复性研究

Hiroshi Atsuta, Hisashi Ishihara, Minoru Asada

发表机构 * Symbiotic Intelligent Systems Research Center, Institute for Open and Transdisciplinary Research Initiatives, The University of Osaka(共生智能系统研究中心,开放与跨学科研究倡议研究所,大阪大学) International Professional University of Technology in Osaka, Umeda, Kita-ku, Osaka, Japan(大阪国际专业技术大学,乌梅达,北区,大阪,日本) Chubu University Academy of Emerging Sciences, Kasugai, Aichi, Japan(名古屋大学新兴科学学院, Kasugai,爱知,日本)

AI总结 本文研究了紧凑型13自由度气动仿人上身机器人的动态特性与运动可重复性,基于多层感知机实现4自由度臂部子系统的数据驱动控制,验证了数据驱动方法在高自由度气动机器人控制中的有效性。

Comments 25 pages, 21 figures. Submitted to Advanced Robotics

详情
AI中文摘要

具有高自由度的气动仿人机器人在物理人机交互中具有巨大潜力。然而,由于其固有的非线性特性,精确控制气动执行器具有挑战性。本文提出了一种紧凑型13自由度上身机器人,首先研究其关键动态特性,如执行时间延迟,并确认系统表现出高度可重复的行为。利用这种可重复性,基于多层感知机和显式时间延迟补偿,实现了4自由度臂部子系统的初步数据驱动控制器。网络在随机运动数据上训练,生成用于跟踪任意轨迹的压力命令。与传统PID控制器的比较评估显示,轨迹跟踪性能更优,突显了数据驱动方法在控制复杂高自由度气动机器人方面的潜力。

英文摘要

Pneumatically-actuated anthropomorphic robots with high degrees of freedom (DOF) offer significant potential for physical human-robot interaction. However, precise control of pneumatic actuators is challenging due to their inherent nonlinearities. This paper presents the development of a compact 13-DOF upper-body humanoid robot. To assess the feasibility of an effective controller, we first investigate its key dynamic properties, such as actuation time delays, and confirm that the system exhibits highly reproducible behavior. Leveraging this reproducibility, we implement a preliminary data-driven controller for a 4-DOF arm subsystem based on a multilayer perceptron with explicit time delay compensation. The network was trained on random movement data to generate pressure commands for tracking arbitrary trajectories. Comparative evaluations with a traditional PID controller demonstrate superior trajectory tracking performance, highlighting the potential of data-driven approaches for controlling complex, high-DOF pneumatic robots.

2603.14186 2026-05-11 cs.CV

Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

一步生成模型与多步扩散和流模型的匹配与语义缩放基准测试

Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, Léopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室)

AI总结 本文通过ImageNet验证集、ImageNetV2和reLAIONet数据集,评估了八种模型在类条件协议下的性能,发现FID和CFG选择在少步情况下可能误导,引入csFID、psFID等指标以诊断语义对齐的图像生成。

详情
AI中文摘要

最先进的文本到图像模型生成高质量图像,但推理成本高,因为生成需要多个连续的ODE或去噪步骤。原生一步模型通过单步将噪声映射到图像以减少成本,但公平比较多步系统困难,因为研究使用不匹配的采样步骤和不同的分类器免费引导(CFG)设置,其中CFG可能在相反方向影响FID、Inception Score和CLIP对齐。此外,不清楚一步模型在多步推理中的扩展能力,且在ImageNet之外缺乏标准化的分布外评估。为此,我们基于ImageNet标签ID对齐的reLAIONet数据集,在ImageNet验证集、ImageNetV2和reLAIONet上评估了八种模型,包括一步流(MeanFlow、Improved MeanFlow、SoFlow)、多步基线(RAE、Scale-RAE)和已建立系统(SiT、Stable Diffusion 3.5、FLUX.1)。使用FID、Inception Score、CLIP Score和Pick Score,我们发现FID导向的模型开发和CFG选择在少步情况下可能误导,其中引导变化可以提高FID但损害文本-图像对齐和人类偏好信号,降低视觉质量。为使这些权衡显式化,我们引入了CLIP缩放和PickScore缩放的FID(csFID、psFID)和Inception Score(csIS、psIS)作为诊断语义对齐图像生成的指标。

英文摘要

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, we benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals, worsening visual quality. To make these tradeoffs explicit, we introduce CLIP-scaled and PickScore-scaled variants of FID (csFID, psFID) and Inception Score (csIS, psIS) to serve as a diagnostic for semantically aligned image generation.

2603.08256 2026-05-11 cs.CL

NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

NCL-UoR在SemEval-2026任务5中的表现:基于嵌入的方法、微调与大语言模型用于词义可plausibility评分

Tong Wu, Thanet Markchom, Huizhi Liang

发表机构 * Independent Researcher(独立研究者) Department of Computer Science, University of Reading(阅读大学计算机科学系) School of Computing, Newcastle University(新castle大学计算学院)

AI总结 本文比较了三种方法:基于嵌入的方法、Transformer微调和大语言模型提示,发现结构化提示优于微调和嵌入方法,提示设计比模型规模更重要。

详情
AI中文摘要

词义可plausibility评分需要在包含歧义同音词的短篇叙述故事中,预测人类感知的词义可plausibility在1-5分尺度上。本文系统比较了三种方法:(1)基于嵌入的方法结合句子嵌入和标准回归器;(2)Transformer微调与参数高效适应;(3)大语言模型(LLM)提示与结构化推理和显式决策规则。表现最佳的系统采用结构化提示策略,将评估分解为叙述成分(前文、目标句子、结尾)并应用显式决策规则进行评分校准。分析表明,结构化提示与决策规则优于微调模型和嵌入方法,且提示设计比模型规模对这项任务更为重要。

英文摘要

Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1-5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.

2603.07475 2026-05-11 cs.CL cs.LG

A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

对AR和扩散LLM中逐层表示能力的比较分析

Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文比较了AR和扩散LLM的表示能力,发现扩散模型产生更全局的表示,而AR模型产生紧密耦合的局部表示,扩散训练引入深度冗余以实现原理性压缩。

Comments v3: improving writing with all v2 changes

详情
AI中文摘要

自回归(AR)语言模型通过左到右预测逐步构建表示,而扩散语言模型(dLLMs)通过全序列去噪训练。尽管最近的dLLMs表现与AR模型相当,但扩散目标是否根本改变内部表示仍不清楚。我们首次对原生dLLMs(LLaDA)、原生AR模型(Qwen2.5)和AR初始化的dLLMs(Dream-7B)进行逐层和逐token的表示分析,使用层间和token间的余弦相似度以及静态推理时的层跳过作为冗余分析的探针。我们发现扩散目标产生更具全局性的表示,具有显著的早期层冗余和减少的近期偏差,而AR目标产生紧密耦合、局部结构化的表示。AR初始化的dLLMs在扩散训练下仍保留AR-like动态,揭示了持续的初始化偏差。利用这种冗余,原生dLLMs在数学推理和编码基准上保留超过90%的性能的同时可减少多达18.75%的FLOPs,而AR模型在相同跳过下崩溃,表明扩散目标而非架构本身引入了深度冗余,从而实现了原理性压缩。

英文摘要

Autoregressive (AR) language models build representations incrementally via left-to-right prediction, while diffusion language models (dLLMs) are trained through full-sequence denoising. Although recent dLLMs match AR performance, whether diffusion objectives fundamentally reshape internal representations remains unclear. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B), using cosine similarity across layers and tokens alongside static inference-time layer-skipping as an analytical probe of redundancy. We find that diffusion objectives produce more global representations with substantial early-layer redundancy and reduced recency bias, while AR objectives yield tightly coupled, locally structured representations. AR-initialized dLLMs retain AR-like dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this redundancy, native dLLMs absorb up to 18.75% FLOPs reduction while retaining over 90% performance on math-reasoning and coding benchmarks, whereas AR models collapse under identical skipping, revealing that diffusion objectives, rather than architecture alone, induce depth redundancy that enables principled compression.

2603.05687 2026-05-11 cs.RO

Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding

接触导向策略:具备生成接触基础的灵活视觉-触觉策略

Zhengtong Xu, Yeping Wang, Ben Abbatematteo, Jom Preechayasomboon, Sonny Chan, Nick Colonnese, Amirhossein H. Memar

发表机构 * Purdue University(普渡大学) Meta Reality Labs Research(Meta现实实验室研究) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出Contact-Grounded Policy,通过预测机器人状态和触觉反馈的耦合轨迹,生成接触基础的灵活视觉-触觉策略,优于现有基线方法。

详情
AI中文摘要

在多指手的接触丰富灵巧操作中,任务成功依赖于持续演变的多点接触,对物体几何、摩擦转换和滑动高度敏感。最近,触觉指导的操作策略显示出潜力。然而,大多数方法将触觉信号作为附加观测,而非建模接触状态或动作输出与低层控制器动力学的交互。本文提出Contact-Grounded Policy (CGP),一种视觉-触觉策略,通过预测实际机器人状态和触觉反馈的耦合轨迹,将多点接触接地,并利用学习的接触一致性映射将这些预测转换为可执行的目标机器人状态,供合规控制器使用。CGP由两个组件组成:(i) 一个条件扩散模型,用于在压缩潜在空间中预测未来的机器人状态和触觉反馈;(ii) 一个学习的接触一致性映射,将预测的机器人状态-触觉对转换为合规控制器的可执行目标,使其能够实现预期的接触。我们使用物理的四指Allegro V5手和Digit360指尖触觉传感器,以及模拟的五指Tesollo DG-5F手和密集的全手触觉阵列评估CGP。在一系列灵巧任务中,包括在手内操作、精细抓取和工具使用,CGP优于视觉-运动和视觉-触觉扩散策略基线。

英文摘要

Contact-rich dexterous manipulation with multi-finger hands remains an open challenge in robotics because task success depends on multi-point contacts that continuously evolve and are highly sensitive to object geometry, frictional transitions, and slip. Recently, tactile-informed manipulation policies have shown promise. However, most use tactile signals as additional observations rather than modeling contact state or how their action outputs interact with low-level controller dynamics. We present Contact-Grounded Policy (CGP), a visuotactile policy that grounds multi-point contacts by predicting coupled trajectories of actual robot state and tactile feedback, and using a learned contact-consistency mapping to convert these predictions into executable target robot states for a compliance controller. CGP consists of two components: (i) a conditional diffusion model that forecasts future robot state and tactile feedback in a compressed latent space, and (ii) a learned contact-consistency mapping that converts the predicted robot state-tactile pair into executable targets for a compliance controller, enabling it to realize the intended contacts. We evaluate CGP using a physical four-finger Allegro V5 hand with Digit360 fingertip tactile sensors, and a simulated five-finger Tesollo DG-5F hand with dense whole-hand tactile arrays. Across a range of dexterous tasks including in-hand manipulation, delicate grasping, and tool use, CGP outperforms visuomotor and visuotactile diffusion-policy baselines.

2603.05117 2026-05-11 cs.RO

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

SeedPolicy: 通过自进化扩散策略实现机器人操控的水平扩展

Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, Shuaicheng Liu

发表机构 * Sichuan University(四川大学) Dexmal Inc.(Dexmal公司) Independent Researcher(独立研究员) UESTC

AI总结 本文提出SEGA模块,通过门控注意力机制提升扩散策略在长时域操控中的表现,实验表明其在RoboTwin 2.0基准上优于现有方法,具有更高的效率和性能。

Comments 22 pages, 14 figures

详情
AI中文摘要

模仿学习(IL)使机器人能够从专家示范中获取操控技能。扩散策略(DP)能建模多模态专家行为,但当简单增加堆叠观测时域时会退化,限制了长时域操控。我们提出自进化门控注意力(SEGA),一种时间模块,通过门控注意力维护时间演化的潜在状态,实现高效的递归更新,将长期上下文累积到紧凑的潜在表示中,同时过滤无关的时序信息。将SEGA集成到DP中得到自进化扩散策略(SeedPolicy),解决了时间建模瓶颈,通过适度开销扩展有效的时间时域。在具有50个操控任务的RoboTwin 2.0基准上,SeedPolicy优于DP和其他IL基线。在CNN和Transformer后端平均情况下,SeedPolicy在干净设置中实现36.8%的相对改进,在随机挑战设置中实现169%的相对改进。与参数达12亿的视觉-语言-动作模型如RDT相比,SeedPolicy在干净设置中以数量级更少的参数实现了更强的性能,证明了其强大的效率。这些结果确立了SeedPolicy作为长时域机器人操控模仿学习的最先进方法。代码可在:https://anonymous.4open.science/r/SeedPolicy-64F0/获得。

英文摘要

Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but degrades when naively increasing stacked observation horizons, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that accumulate long-term context into a compact latent representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and extends the effective temporal horizon with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves stronger performance in the clean setting with one to two orders of magnitude fewer parameters, demonstrating strong efficiency. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://anonymous.4open.science/r/SeedPolicy-64F0/.

2603.02883 2026-05-11 cs.CV

SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

语义感知混合格式量化:用于视频扩散变换器的混合格式量化

Wonsuk Jang, Thierry Tambe

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出SemanticDialect,通过块级混合格式量化和语义感知方言分配,提升视频扩散变换器的量化效果,实验表明其在Open-Sora 2.0上接近FP16质量。

详情
AI中文摘要

扩散变换器(DiTs)在视频生成质量上达到最新水平,但其显著的内存和计算开销阻碍了边缘部署。量化可以降低这些成本,但现有方法由于高激活变化和保持语义和时间一致性困难而降质。我们提出了SemanticDialect,通过块级混合格式量化,每个块从候选集(formatbook)中选择最优格式(方言),该集通过查找表存储量化误差和量化索引,实现高效的块级格式选择和量化,最小化在线开销。我们进一步引入注意力引导的激活分解,通过残差量化减少量化误差,并引入语义感知方言分配(SeDA),通过在语义相关令牌间强制格式统一性减少跨令牌量化不一致性。实验表明,SemanticDialect在Open-Sora 2.0上接近FP16质量,优于现有量化方法和块级格式(MXFP4, NVFP4)。我们还通过RTL设计和GPU内核实现验证了硬件部署性。

英文摘要

Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces cross-token quantization inconsistency by enforcing format uniformity among semantically correlated tokens. Experiments demonstrate that SemanticDialect outperforms prior quantization methods and block-wise formats (MXFP4, NVFP4) while approaching FP16 quality on Open-Sora 2.0. We also validate hardware deployability through RTL design and GPU kernel implementation.

2603.01586 2026-05-11 cs.CV

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

InterCoG:迈向空间精确图像编辑的交错链式 grounding 推理

Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Zhengzhou Advanced Research Institute of Harbin Institute of Technology(哈尔滨工业大学郑州先进研究院) Huawei Noah’s Ark Lab(华为诺亚实验室) Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology(深圳先进技术大学人工智能研究院)

AI总结 本文提出InterCoG框架,通过文本与视觉的交错链式推理实现复杂场景中精细图像编辑,结合文本空间关系和视觉定位提升编辑精度。

详情
AI中文摘要

新兴的统一编辑模型在一般对象编辑任务中表现出色,但在复杂多实体场景中进行细粒度编辑仍具挑战,尤其是目标不显眼且需空间推理的情况。为此,我们提出InterCoG,一种新颖的文本-视觉交错链式grounding推理框架,用于复杂真实场景中的精细图像编辑。其核心思想是首先在包含空间关系细节的文本中进行物体位置推理,以明确推断编辑目标的位置和身份。然后通过生成的边界框和掩码在像素空间中进行视觉定位,最后重写编辑描述以指定预期结果。为进一步促进这一范式,我们提出了两个辅助训练模块:多模态grounding重建监督和多模态grounding推理对齐,分别强制空间定位准确性和推理可解释性。我们还构建了GroundEdit-45K数据集,包含45,000个具有详细推理注释的grounding导向编辑样本,并构建了GroundEdit-Bench用于grounding-aware编辑评估。大量实验验证了我们的方法在高度精确编辑任务中,在空间复杂且多实体场景中的优越性。

英文摘要

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.

2602.23811 2026-05-11 cs.LG cs.AI

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies

超越状态级镜像下降:带有参数策略的离线策略优化

Xiang Li, Yuheng Zhang, Nan Jiang

发表机构 * Nanjing University(南京大学) UIUC(伊利诺伊大学)

AI总结 本文研究了在一般函数逼近下离线强化学习的理论方面,扩展了理论保证到大或连续动作空间上的参数策略类,并提出了基于上下文耦合的新型分析和算法见解。

详情
AI中文摘要

我们研究了在一般函数逼近下离线强化学习(RL)的理论方面。尽管先前的工作(如Xie等人,2021)已经建立了通过悲观主义从离线数据中学习良好策略的理论基础,但现有的计算上可行的算法(通常在oracle高效意义上),如PSPI,仅适用于有限和小的动作空间。此外,这些算法依赖于状态级镜像下降,并要求演员隐式地由批评者函数诱导,无法适应实践中普遍存在的独立策略参数化。在本文中,我们解决了这些限制,并将理论保证扩展到大或连续动作空间上的参数策略类。当将镜像下降扩展到参数策略时,我们识别出上下文耦合是核心困难,并展示了如何将镜像下降连接到自然策略梯度,从而导致新的分析、保证和算法见解,包括对离线RL和模仿学习之间惊人统一的揭示。

英文摘要

We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.

2602.22831 2026-05-11 cs.LG cs.AI cs.CL cs.CV cs.CY

Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

反向影响审计揭示大语言模型道德选择中的隐藏结构

Phil Blandfort, Tushar Karayil, Alex McKenzie, Urja Pawar, Robert Graham, Dmitrii Krasheninnikov

发表机构 * Predictably Weird Independent AE Studio

AI总结 通过反向影响审计发现,短上下文提示使LLM道德选择率变化12-18个百分点,揭示基线评分遗漏的结构,且部分显著效应出现反向效果。

详情
AI中文摘要

大语言模型的道德基准通常基于无上下文提示评分,隐含认为测量的选择率是稳定的。我们通过反向影响审计检验这一假设:对每个场景,比较基线提示与匹配线索引导选项A或B。在电车问题式道德分诊、烧烤和日常困境任务中,以及五种具有和无推理能力的LLM家族中,短上下文线索使各条件选择率平均变化12-18个百分点。这些变化揭示了基线评分遗漏的结构:约40%的基线中性分诊和烧烤条件在影响下表现出方向不对称性,且部分显著效应反向,与线索意图相反。在后续探测中,模型常承认线索但否认其影响选择。在显著反向案例中,这种陈述与揭示的不一致出现在78%的案例中。推理不消除上下文敏感性,但改变了其形态:如用户偏好和情感诉求等社会压力线索在基准中减弱,而少量示例演示在分诊和烧烤中显著增强。我们推荐反向影响配对作为无上下文道德偏见评估的标准补充,并发布工具和数据使此类审计成为常规。

英文摘要

Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.

2602.21858 2026-05-11 cs.AI

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

ProactiveMobile: 一个全面的基准,用于提升移动设备上的主动智能

Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang, Haofei Sun, Changpeng Yang, Yang Li, Peng Zhou, Shuai Nie, Hongzhen Wang, Linfeng Zhou, Hao Jia, Jiaming Xu, Runyu Shi, Ying Huang

发表机构 * HyperAI Team, Xiaomi Corporation(小米公司HyperAI团队) Zhejiang University(浙江大学) Peking University(北京大学) Northeastern University(东北大学)

AI总结 本文提出ProactiveMobile基准,通过多模态大语言模型在移动代理中实现主动智能,通过63个API生成可执行函数序列,实验表明Qwen2.5-VL-7B-Instruct在主动智能任务中表现优于其他模型。

详情
AI中文摘要

多模态大语言模型(MLLMs)在移动代理开发中取得了显著进展,但其能力主要局限于反应式范式,即仅执行显式用户命令。新兴的主动智能范式,即代理能够自主预测需求并发起行动,代表了移动代理的下一个前沿。然而,其发展受到缺乏能够解决现实复杂性并允许客观、可执行评估的基准的限制。为克服这些挑战,我们引入了ProactiveMobile,一个全面的基准,旨在系统地推进该领域研究。ProactiveMobile将主动任务形式化为在设备上下文信号的四个维度上推断隐含用户意图,并从63个API的综合功能池中生成可执行函数序列。该基准包含超过3,660个实例的14种场景,通过多答案注释来体现现实复杂性。为确保质量,30名专家团队对基准进行了最终审核,验证事实准确性、逻辑一致性和行动可行性,并修正任何不合规的条目。广泛的实验表明,我们微调的Qwen2.5-VL-7B-Instruct在主动智能任务中的成功率为19.15%,优于o1(15.71%)和GPT-5(7.39%)。这一结果表明,主动性是当前MLLMs普遍缺乏但可学习的关键能力,强调了所提出基准在主动性评估中的重要性。

英文摘要

Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.

2602.20974 2026-05-11 cs.LG

MAST: A Multi-fidelity Augmented Surrogate model via Spatial Trust-weighting

MAST:一种通过空间信任加权的多保真度增强代理模型

Ahmed Mohamed Eisa Nasr, Ali Elham, Haris Moazam Sheikh

发表机构 * Department of Aeronautics and Astronautics(航空航天系) University of Southampton(南安普顿大学)

AI总结 MAST通过空间信任加权结合修正的低保真度观测和高保真度预测,提升多保真度代理模型的鲁棒性和精度,尤其在预算受限条件下表现优异。

详情
AI中文摘要

在工程设计和科学计算中,计算成本和预测精度是内在耦合的。高保真度模拟提供准确预测但计算成本高,而低保真度近似在效率上牺牲了精度。多保真度代理建模通过结合大量低保真度数据和稀疏高保真度观测来解决这一权衡问题。然而,现有方法依赖于全局相关性假设,这在实践中往往无法捕捉保真度关系在输入空间中的变化,导致性能不佳,特别是在预算紧张的情况下。我们引入MAST,一种将修正的低保真度观测与高保真度预测结合的方法,信任高保真度附近的观测样本,而在其他地方依赖修正的低保真度。MAST通过显式偏差建模和基于距离的加权,结合闭式方差传播,生成一个单个的异方差高斯过程。在多保真度合成基准测试中,MAST显著优于当前最先进的技术。关键的是,MAST在总预算和保真度差距变化时保持稳健性能,而竞争方法在这些条件下表现出显著退化或不稳定行为。更广泛地说,MAST提供了一个空间自适应的多保真度高斯过程建模框架,在其中低保真度信息的贡献由其接近高保真度校准数据的 proximity 决定,为在稀疏和预算受限条件下构建更可靠的代理模型开辟了新方向。

英文摘要

In engineering design and scientific computing, computational cost and predictive accuracy are intrinsically coupled. High-fidelity simulations provide accurate predictions but at substantial computational costs, while lower-fidelity approximations offer efficiency at the expense of accuracy. Multi-fidelity surrogate modelling addresses this trade-off by combining abundant low-fidelity data with sparse high-fidelity observations. However, existing methods rely on global correlation assumptions that can often fail in practice to capture how fidelity relationships vary across the input space, leading to poor performance, particularly under tight budget constraints. We introduce MAST, a method that blends corrected low-fidelity observations with high-fidelity predictions, trusting high-fidelity near observed samples and relying on corrected low-fidelity elsewhere. MAST achieves this through explicit discrepancy modelling and distance-based weighting with closed-form variance propagation, producing a single heteroscedastic Gaussian process. Across multi-fidelity synthetic benchmarks, MAST shows a marked improvement over the current state-of-the-art techniques. Crucially, MAST maintains robust performance across varying total budget and fidelity gaps, conditions under which competing methods exhibit significant degradation or unstable behaviour. More broadly, MAST provides a spatially adaptive framework for multi-fidelity Gaussian-process modelling, in which the contribution of low-fidelity information is governed by its proximity to high-fidelity calibration data, opening a new direction for more reliable surrogate construction under sparse and budget-constrained settings.

2602.20816 2026-05-11 cs.CL cs.LG

Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

不要忽视尾部:为高效语言模型蒸馏解耦top-K概率

Sayantan Dasgupta, Trevor Cohn, Timothy Baldwin

发表机构 * School of Computing and Information Systems, University of Melbourne, Melbourne, Australia(计算与信息系统学院,墨尔本大学,澳大利亚墨尔本)

AI总结 本文提出一种尾部意识的分歧度,解耦教师模型top-K预测概率与低概率预测的贡献,提升分布尾部信息在语言模型蒸馏中的作用。

Comments ICML 2026

详情
AI中文摘要

语言模型蒸馏的核心学习信号是学生和教师分布之间的标准Kullback-Leibler(KL)分歧度。传统KL分歧度易受最高概率下一个token影响,即教师模型的模式,从而削弱了低概率但可能有信息的输出分布部分的影响。我们提出了一种新的尾部意识分歧度,能够在保持与KL分歧度相同计算效率的同时,将教师模型top-K预测概率的贡献与低概率预测的贡献解耦。我们的解耦方法减少了教师模式的影响,从而增加了分布尾部的贡献。实验结果表明,我们的修改蒸馏方法在各种数据集上的预训练和监督蒸馏中均表现出竞争力。此外,蒸馏过程高效,可以使用有限的学术预算处理大规模数据集,无需行业级计算。

英文摘要

The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

2602.20338 2026-05-11 cs.LG

Emergent Manifold Separability during Reasoning in Large Language Models

大语言模型推理中的涌现流形分离

Chanwoo Chun, Alexandre Polo, SueYeon Chung

发表机构 * Department of Physics, Harvard University(哈佛大学物理系) Center for Data Science, New York University(纽约大学数据科学中心) Kempner Institute, Harvard University(哈佛大学 Kempner 研究所) Center for Computational Neuroscience, Flatiron Institute(Flatiron 机构计算神经科学中心)

AI总结 研究大语言模型推理过程中的流形分离现象,通过曼福容量理论分析两种组合推理任务,发现概念流形在计算前解耦为线性可分子空间,揭示动态流形管理机制。

Comments Alexandre Polo and Chanwoo Chun contributed equally to this work

详情
AI中文摘要

链式思维提示显著提升了大语言模型的推理能力,但其底层表示几何的时间动态仍不明确。本文通过将曼福容量理论应用于两种组合推理任务,发现推理表现为短暂的几何脉冲:概念流形在计算前解耦为线性可分子空间,随后迅速压缩。此行为与标准线性探针准确性不同,表明信息可检索与几何准备存在本质区别。本文将其解释为动态流形管理机制,即模型动态调节表示能力以优化残差流带宽。

英文摘要

Chain-of-Thought (CoT) prompting significantly improves reasoning in Large Language Models, yet the temporal dynamics of the underlying representation geometry remain poorly understood. We investigate these dynamics by applying Manifold Capacity Theory (MCT) to two compositional reasoning tasks: a controlled Boolean logic tree that supports deep mechanistic analysis, and a natural-language eligibility task in which the model has to extract attributes from prose, compare them to thresholds, and compose the local decisions through a fixed evaluation tree. MCT lets us quantify the linear separability of latent representations without the confounding factors of probe training. On both tasks, and across several open-weight models, reasoning manifests as a transient geometric pulse: concept manifolds are untangled into linearly separable subspaces immediately prior to computation and rapidly compressed thereafter. This behavior diverges from standard linear probe accuracy, which remains high long after computation, suggesting a fundamental distinction between information that is merely retrievable and information that is geometrically prepared for processing. We interpret this phenomenon as Dynamic Manifold Management, a mechanism where the model dynamically modulates representational capacity to optimize the bandwidth of the residual stream throughout the reasoning chain.