arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.21414 2026-05-21 cs.RO cs.CV

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

PointACT: 多尺度点-动作交互的视觉-语言-动作模型

Shizhe Chen, Paul Pacaud, Cordelia Schmid

发表机构 * Inria(法国国家信息与自动化研究所) École normale supérieure(法国高等科学研究院) CNRS(法国国家科学研究中心) PSL Research University(巴黎综合理工研究院)

AI总结 本文提出PointACT,一种集成层次化3D点云表示的3D感知视觉-语言-动作政策,通过多尺度点-动作交互机制提升机器人在3D环境中的精细几何推理和空间定位能力。

Comments Accepted to RSS 2026; project webpage: https://cshizhe.github.io/projects/pointact.html

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用大规模预训练的视觉-语言骨干网络,在通用机器人操作中展现出强大潜力。然而,大多数现有VLA模型主要依赖2D视觉表示,限制了其对细粒度几何和空间定位的推理能力,这些能力对于在3D环境中实现精确且稳健的操作至关重要。在本文中,我们提出了PointACT,一种双系统3D感知VLA策略,直接将层次化的3D点云表示整合到动作解码过程中。PointACT采用多尺度点-动作交互机制,结合高效的瓶颈窗口自注意力机制,使演化动作令牌能够密集地关注局部几何细节和全局场景结构。我们评估了PointACT在LIBERO和RLBench基准上的表现,并系统地将其与单系统和双系统VLA基线进行比较,包括加入点云输入的变体。PointACT在两个基准上均实现了持续改进,在具有挑战性的RLBench-10Tasks套件上,其成功率比最先进的预训练VLA提高了10%,当冻结视觉-语言骨干并从头训练动作专家时,提升幅度更大。广泛的消融研究证明,紧密耦合层次化的3D几何与预训练的2D语义表示对于鲁棒且空间感知的机器人控制至关重要。我们的结果还突显了预训练3D表示在3D感知VLA策略中的潜力。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

2605.21411 2026-05-21 cs.CV

RoadTones: Tone Controllable Text Generation from Road Event Videos

RoadTones: 从道路事件视频生成可调节语气的文本

Chirag Parikh, Siddhi Pravin Lipare, Ravi Kiran Sarvadevabhatla

发表机构 * CVIT & iHub-Data, IIIT Hyderabad, India(CVIT与iHub-Data,IIIT海得拉巴,印度)

AI总结 本文提出RoadTones-51K数据集和RoadTones-VL-CoT模型,通过生成语气条件的推理草稿提升可解释性,并引入RoadTones-Eval评估体系,共同为上下文敏感的可调节视频描述奠定基础。

Comments Accepted at CVPR Findings 2026. Project page: https://roadtones.github.io/

详情
AI中文摘要

现有的视频-语言模型能够生成道路事件的事实性描述,但缺乏对事件表达方式的控制:语气、紧迫性或风格。这限制了在通信关键性场景中的应用,因为信息的有效性取决于内容和表现,而不仅仅是事实准确性。为缓解这一问题,我们引入了一个全面的数据集-模型-评估体系,用于可调节语气的道路视频描述生成。我们的经人类验证的数据生成流程扩展了道路视频语料库,添加了多样化的语气标注和多语气描述,生成RoadTones-51K数据集。我们提出了RoadTones-VL-CoT,一个可调节的视频到文本模型,还生成语气条件的推理草稿以提高可解释性。我们还引入了RoadTones-Eval,一个新的评估体系,联合测量事实一致性与语气符合度。此外,我们还进行了用户研究,其结果验证了描述质量、语气控制和事实一致性。这些贡献共同为上下文敏感的可调节视频描述奠定了基础。

英文摘要

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

2605.21406 2026-05-21 cs.RO

MC-Risk: Multi-Component Risk Fields for Risk Identification and Motion Planning

MC-Risk:多组件风险场用于风险识别和运动规划

Maximilian Link, Yingjie Xu, Yingbai Hu, Yinlong Liu

发表机构 * Technical University of Munich(慕尼黑技术大学) The Chinese University of Hong Kong(香港中文大学) City University of Macau(澳门城市大学)

AI总结 本文提出MC-Risk,一种与规划器对齐的多组件风险场,用于早期、校准且类别感知的风险定位。该方法通过线性组合三个可解释模块,包括电机代理场、VRU风险场和道路惩罚场,并在RiskBench碰撞子集上进行了首次标准化定量评估,展示了最佳的风险定位和最早危险指示。

详情
AI中文摘要

我们提出了MC-Risk,一种与规划器对齐的多组件风险场,用于早期、校准且类别感知的风险定位。MC-Risk线性组合了三个可解释模块:(i) 一个电机代理场,融合了黑箱多模态轨迹预测器和解析高斯环构造,其横向宽度随速度/曲率增长,高度随前瞻减少;(ii) 一个VRU风险场,用向前偏的各向异性核替代等效行人块,该核与方向和速度对齐;(iii) 一个道路惩罚场,利用全高清地图拓扑,对非道路区域施加惩罚,并对同向/反向车道施加风险暴露。我们进行了首次标准化定量评估,评估了风险场形式在RiskBench碰撞子集上的表现。MC-Risk在整体风险定位和危险指示方面表现最佳。最后,我们通过将该场作为MPC成本密度使用,演示了一个即插即用的规划接口,实现了无额外训练的风险感知轨迹生成。

英文摘要

We present MC-Risk, a planner-aligned, multi-component risk field on a bird's-eye-view grid that yields early, calibrated, and class-aware risk localization. MC-Risk linearly composes three interpretable modules: (i) a motorized-agent field that fuses a black-box multimodal trajectory predictor with an analytic Gaussian-torus construction whose lateral width grows with speed/curvature and whose height attenuates with look-ahead; (ii) a VRU risk field that replaces isotropic pedestrian blobs with a forward-biased anisotropic kernel aligned to heading and speed; and (iii) a road penalty field that exploits full HD-map topology, imposing an off-road penalty and lane-aware risk exposure for same/opposite directions. We conduct, to our knowledge, the first standardized quantitative evaluation of a risk-field formulation on RiskBench's collision subset. MC-Risk attains the best overall risk localization and the earliest hazard indication. Finally, we demonstrate a plug-and-play planning interface by using the field as an MPC cost density, enabling risk-aware trajectory generation without additional training.

2605.21404 2026-05-21 cs.LG

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

十二篇LLM代理基准测试论文披露了什么:一项初步审计和开放评分方案

Mahdi Naser Moghadasi, Faezeh Ghaderi

发表机构 * Research Division, BrightMind AI(BrightMind AI研究部) Texas Tech University(德克萨斯理工大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文通过分析十二篇知名LLM代理基准测试论文,揭示了这些论文在评估方法披露方面的不足,设计了一种开放评分方案以提高透明度和可重复性。

Comments Pilot audit of 12 LLM agent benchmark papers; schema, codebook, and per-paper scoring sheet released. Submission to IEEE Big Data 2026

详情
AI中文摘要

我们阅读了十二篇著名的LLM代理基准测试论文,并逐项记录了每篇论文对其实验评估如何运行的描述。这一动机源于一个常见的挫败感:两篇论文会使用相同的基准测试和相同的模型名称报告结果,但却得出不同的结论,而你无法查明原因——可能是框架、采样设置、子集或评估者版本。在许多情况下,发表的成果文件并不允许你回答这些问题。本文是对这一尝试的实施报告。我们设计了一个小型审计方案(五个字段:基准身份、框架规格、推理设置、成本报告、失败分解),编写了一个包含我们在试点评分中遇到的边界情况的评分代码书,将其应用于十二篇经典论文(八篇代理,四篇经典静态),并记录了我们所看到的内容。我们对代理运行的披露进行评分,而不是其正确性,并不声称披露意味着可靠的结果。在八篇代理基准测试论文中的平均审计评分为0.38(满分1.0),而在四篇经典静态基准测试中为0.66;最大的差距出现在成本(八篇代理基准测试论文中没有任何一篇以任何形式披露推理成本)和框架规格(没有任何一篇完全披露评估环境的内容寻址容器镜像)。我们发布了该方案作为JSON Schema文件,代码书作为Markdown文档,原始评分表作为CSV文件。评分由单个审计员在一次通过中完成;多评分者审计是自然的下一步,我们讨论了我们认为它会如何改变。

英文摘要

We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent-benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content-addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi-rater audit is the natural next step, and we discuss what we think it would change.

2605.21403 2026-05-21 cs.CL

Quantifying the cross-linguistic effects of syncretism on agreement attraction

量化语言间合成影响对一致性的吸引力

Utku Turk, Eva Neu

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)

AI总结 研究探讨了合成对一致性吸引力的影响在不同语言中的差异,通过大规模语言模型的 surprisal 和注意力熵来分析四种语言中的表现,揭示了合成如何调节吸引力的机制。

Comments SCiL Conference Paper

详情
AI中文摘要

一致性吸引力错误,即动词错误地与中间名词一致而非其语法头,受到形态合成的影响在某些语言(英语、德语、俄语)中更为明显,而在其他语言(土耳其语、亚美尼亚语)中则不明显,这种跨语言模式缺乏理论解释。我们利用大规模语言模型的 surprisal 和注意力熵作为处理代理,研究四种语言中的差异。LLM 导出的测量结果在英语和德语中复现了行为发现(合成调节吸引力),在土耳其语中得到无调节的结果,部分捕捉了俄语的模式。我们讨论了进一步理解为何合成在不同语言中影响一致性吸引力的机制。

英文摘要

Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.

2605.21398 2026-05-21 cs.RO

From swept contact to pose: Probe-aware registration via complementary-shape docking

从扫掠接触到姿态:通过互补形状对接实现探针感知的注册

Chen Chen, Yunwen Li, Yifan Xu, Xiangjie Yan, Chang Shu, Jianxia Hou, Shiji Song, Xiang Li

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Tsingscribe Medical Ltd.(Tsingscribe医疗有限公司) D-MAVT, ETH Zürich(苏黎世联邦理工学院D-MAVT) Peking University School and Hospital of Stomatology(北京大学口腔医学院及口腔医院) Institute for Guo Qiang, Tsinghua University(清华大学国强研究院)

AI总结 本研究提出了一种无需校准的注册方法,通过将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,显式考虑探针几何形状,并利用接触和非接触证据。该方法通过3D FFT相关性进行全局到局部搜索,然后使用李代数更新和解析接触灵敏度进行连续SE(3)细化,实现了高效的探索和指标级收敛。

Comments 8 pages, 9 figures, accepted to ICRA 2026

详情
AI中文摘要

在机器人操作中,精确的先验模型与真实场景之间的注册对于高精度操作至关重要,然而光学方法面临长校准链、视线约束和制造误差等问题。我们提出了一种无需校准的替代方法,将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,显式考虑探针几何形状,并利用接触和非接触证据。我们的求解器通过3D FFT相关性在低偏差的SO(3)样本上进行全局到局部搜索,随后使用李代数更新和解析接触灵敏度进行连续SE(3)细化。该流程在自由形式网格上进行了模拟,实现了亚0.04毫米和亚0.4度的精度,并在姿态噪声和接触丢失情况下表现出鲁棒性。在牙科准备机器人上,我们的方法达到了0.42毫米和3.75度的精度,优于光学追踪器注册,且无需外部传感器。这些结果展示了一种实用且精确的机器人注册策略,适用于手术和工业机器人。

英文摘要

Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4° accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75°, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.

2605.21395 2026-05-21 cs.AI cs.LG

Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

迈向稳健和自主的网络:AI原生6G的BlueSky愿景

Liang Wu, Kelly Wan, Mayank Darbari, Liangjie Hong

发表机构 * Nokia(诺基亚)

AI总结 本文提出了一种AI原生6G的BlueSky愿景,旨在将人工智能原生整合到6G中,从'为AI的网络'转向'为网络的AI',通过基础模型和协作多智能体系统,将网络管理转化为统一的多模态多任务优化问题,推动6G向智能自维持通信基础设施发展。

Comments Accepted at KDD 2026

详情
AI中文摘要

新兴应用的普及,如自动驾驶和沉浸式体验,要求细胞网络不仅更快,而且从根本上更稳健和自主。本文提出了一种BlueSky愿景,探讨人工智能如何原生整合到6G中,从'为AI的网络'转向'为网络的AI'。我们设想,不同于5G对分散、随机模型的依赖,6G时代原生AI将由基础模型锚定,并通过协作多智能体系统进行协调,将网络管理视为统一的多模态、多任务优化问题。基于这一愿景,我们提出了两个变革性方向。第一方向是开发一个6G基础模型作为统一的骨干,将任务特定的知识蒸馏成适合多样边缘部署的紧凑模型。第二方向是推进多智能体系统,以自主诊断、维护和恢复网络,最小化人工干预。这些方向为6G演变为智能、自维持的通信基础设施指明了道路。

英文摘要

The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underline{Network for AI} to \underline{AI for Network}. We envision that, unlike 5G's reliance on scattered, ad-hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi-agent systems, framing network management as a unified, multi-modal, multi-task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task-specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi-agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self-sustaining communication infrastructure.

2605.21391 2026-05-21 cs.CL

Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

通过条件尺度熵理解解码器-only语言模型中的隐喻处理

Lawhori Chakrabarti, Jennifer Johnson-Leung, Bert Baumgaertner, Aleksandar Vakanski, Min Xian, Boyu Zhang

发表机构 * Department of Computer Science, University of Idaho(计算机科学系,爱达荷大学) Department of Mathematics and Statistical Science, University of Idaho(数学与统计科学系,爱达荷大学) Department of Politics and Philosophy, University of Idaho(政治与哲学系,爱达荷大学) Department of Nuclear Engineering and Industrial Management, University of Idaho(核工程与工业管理系,爱达荷大学)

AI总结 研究探讨了解码器-only语言模型中隐喻处理的机制,通过条件尺度熵(CSE)分析不同层位的频率尺度变化,发现隐喻性词 token 在连续层位上产生显著更高的频谱宽度,且该效应不受语义复杂度或命题内容影响,验证了多尺度协调作为隐喻语言处理的特征。

Comments 18 pages, 3 figures, submitted to ICPR workshop

详情
AI中文摘要

隐喻要求语言模型解析一个词的上下文意义与基本字面意义相异。理解transformer模型如何在深度层面组织这种重新解释仍然是机械可解释性中的开放问题。我们引入了条件尺度熵(CSE),这是一种基于小波的度量,用于衡量transformer计算在每个层位置上跨频率尺度的广泛参与程度。两个定理证明了CSE对更新幅度具有不变性,从而将更新的结构模式与强度分离。使用CSE,我们发现隐喻性词在测试的每种解码器-only架构中(从124M到20B参数,包括GPT-2家族、LLaMA-2 7B、GPT-oss 20B)在连续层位置上产生显著更高的频谱宽度。该效应经聚类置换校正后仍然存在,且在模型的早期到中期相对深度范围内重现,并通过独立分析200对自然VUA对收敛。特定性控制进一步显示,该效应不被语义复杂度或匹配的命题内容所解释。这些结果将多尺度协调确定为所考察的解码器-only架构中隐喻语言处理的一致特征,并确立CSE作为表征transformer跨深度结构的原理性工具。

英文摘要

Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

2605.21388 2026-05-21 cs.LG cs.AI cs.NA math.NA stat.ML

On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

关于PDE诱导度量的一步Wasserstein引导生成模型的正则性和泛化性

Likun Lin, Zhongjian Wang, Jack Xin, Zhiwen Zhang

发表机构 * Department of Mathematics, The University of Hong Kong(香港大学数学系) Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University(南洋理工大学数学科学系) Department of Mathematics, University of California at Irvine(加州大学 Irvine 分校数学系)

AI总结 本文研究了一步Wasserstein引导生成模型在处理PDE诱导概率度量时的正则性和泛化性,通过理论框架证明了运输映射的正则性和生成模型的泛化性质,并通过实验验证了理论结果。

详情
AI中文摘要

尽管生成模型在经验上取得了显著成功,但其在科学计算中的统计准确性理论仍然较为悲观。本文发展了一个理论框架,用于理解运输映射的正则性和一步Wasserstein引导生成模型的泛化性质。我们考虑了与线性椭圆和抛物型方程在有界域上以及扩散和福克-计划克方程在环面上关联的归一化目标密度。在标准结构假设下,我们证明这些目标度量满足倍增条件。通过结合这一事实与倍增度量之间最优运输的正则性理论,我们证明了从均匀源度量到目标度量的最优运输映射是Hölder连续的。这种正则性为通过单个推前映射学习PDE诱导分布的一步生成模型提供了近似理论依据。作为代表实例,我们研究了DeepParticle,并推导了描述学习映射与总体最优映射之间差异的额外风险界。我们还建立了在目标转移下的鲁棒性估计,并通过实验验证了推导出的速率。

英文摘要

Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker--Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is Hölder continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

2605.21381 2026-05-21 cs.CV cs.LG

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

解耦生成与回归在可控图像恢复中的随机插值

Yi Liu, Jia Ma, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang

发表机构 * Tongji University(同济大学) Fudan University(复旦大学)

AI总结 本文提出DiSI框架,通过解耦随机插值过程中的生成与回归组件,实现从纯回归到全生成的连续可控过渡,提升图像恢复任务的效率和精度。

Comments 44 pages, 16 figures, 16 tables

详情
AI中文摘要

近年来,图像恢复(IR)的进步主要由生成方法如扩散模型和流匹配驱动,这些方法在合成逼真纹理方面表现出色,但存在推理慢和像素保真度差的问题。相比之下,传统基于回归的IR方法在这些方面表现更佳,提供单步高效性和高像素级重建保真度。为弥合这一差距,我们提出DiSI,一个统一框架,将随机插值过程解耦为独立的生成和回归组件。这种解耦使DiSI具有显著的通用性,能够连续且可控地从纯回归过程过渡到全生成过程。技术上,我们通过两种特定的采样轨迹实例化该框架,并辅以统一的采样器,实现高质量的少步推理。此外,我们设计了双分支U-Net风格变压器网络,在像素空间中使用专用分支增强条件引导,同时确保高吞吐量。大量实验表明,DiSI在各种IR任务中实现了高效且具有竞争力的结果,同时在单个模型中提供推理时的灵活性,以控制失真感知的权衡。

英文摘要

Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

闭环动态驾驶数据混合用于真实-合成协同训练

Hongzhi Ruan, Pei Liu, Weiliang Ma, Zhengning Li, Xueyang Zhang, Jun Ma, Dan Xu, Kun Zhan

发表机构 * Li Auto(力汽车) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 本文提出了一种闭环动态数据混合方法,通过动态优化过程调整训练数据混合比例,以提升模型性能,解决了在有限预算下优化数据混合的关键问题。

详情
AI中文摘要

数据扩展是现代深度学习的基础,随着自动驾驶转向端到端学习,其重要性日益增加。现实世界驾驶数据标注成本高且场景偏向性明显,使利用几乎无限的合成数据进行真实-合成协同训练成为有前景的方向。然而,简单地整合所有可用的合成数据效率低下且导致分布偏移,优化实际训练预算下的数据混合仍是一个关键但尚未充分研究的问题。因此,我们主张在场景类型和数量上为训练数据混合提供明确指导。特别是在本文中,我们将数据混合近似概念化为一个动态优化过程,通过闭环评估反馈迭代调整训练数据混合以最大化模型性能,并提出AutoScale,一种完全自动化的闭环数据引擎,统一了场景表示、数据混合优化与检索以及模型训练与评估。具体而言,我们提出了图正则化的自编码器(Graph-RAE)用于驾驶场景表示,引入了簇感知梯度上升(Cluster-GA)用于簇级重要性估计和重新加权,并执行簇引导的向量检索以选择高价值样本。在NavSim上的实验表明,AutoScale在有限预算下优于传统协同训练和跨域基线,实现了更好的性能。

英文摘要

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

2605.21371 2026-05-21 cs.CV

A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

一种用于南极 Landsat 7 ETM+ SLC-off 图像恢复的非参考扩散框架

Leyue Tang, Jonathan Louis Bamber, Gang Qiao, Yuanhang Kong

发表机构 * College of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China(同济大学测绘与地理信息学院,上海200092,中国)

AI总结 本文提出 DiffGF 框架,通过非参考扩散方法恢复 Landsat 7 SLC-off 图像,无需外部参考数据,利用南极专用数据集 SLCANT 进行训练和评估,验证了其在恢复南极 SLC-off 图像方面的高保真度,并通过下游裂缝分割应用展示了其实际价值。

Comments Submitted to IEEE JSTARS

详情
AI中文摘要

在南极获取可用光学图像本质上具有挑战性,由于极夜长和频繁的云覆盖。Landsat 提供了最长且最连续的光学观测,是南极研究最重要的遥感数据源之一。然而,2003 年扫描线校正器(SLC)故障导致 Landsat 7 ETM+ SLC-off 图像约有 22% 的像素缺失,严重限制了其可用性。与许多非极地环境不同,南极表面经历快速且显著的变化,这使得获取可靠的参考图像变得困难,减少了传统参考基填充方法的适用性。为了解决这一挑战,我们提出了 DiffGF,一种非参考扩散框架,用于在不需任何外部参考数据的情况下恢复 Landsat 7 SLC-off 图像。DiffGF 采用由潜在空间扩散过程和像素空间细化组成的两阶段设计。构建了一个专门的南极数据集 SLCANT 用于训练和评估。定量和定性结果表明,DiffGF 能够高保真地恢复南极 SLC-off 图像。其实际价值通过下游裂缝分割应用进一步检验。结果表明,DiffGF 为利用南极 Landsat 7 SLC-off 归档提供了有用的方法,使从历史记录中提取有价值信息成为可能,并支持相关的南极研究。

英文摘要

Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

2605.21369 2026-05-21 cs.CL

Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

第五届多语言指代消解共享任务成果:扩展长距离实体的数据集

Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

发表机构 * Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics(查理大学,数学与物理学院,形式与应用语言学研究所) University of West Bohemia, Faculty of Applied Sciences, Department of Computer Science and Engineering(西波赫大学,应用科学学院,计算机科学与工程系)

AI总结 本文总结了第五届多语言指代消解共享任务的成果,介绍了通过增加五个新数据集和两种新语言扩展数据集,以解决长距离实体的指代消解问题,并展示了传统系统与基于LLM的方法在任务中的表现。

Comments Accepted to CODI-CRAC 2026

详情
AI中文摘要

本文描述了与CODI-CRAC 2026研讨会同期举行的第五届多语言指代消解共享任务。在之前的版本基础上,该任务要求参与者开发能够识别提及并基于身份进行指代聚类的系统。2026版特别强调长距离实体,即跨越多个单词和句子的指代链。该任务通过加入五个新数据集和两种新语言扩展了语言范围,这些数据集利用了版本1.4的CorefUD,这是一个包含19种语言27个数据集的和谐多语言集合。总共十个系统参与了该任务,包括四个基于LLM的方法(三个微调模型和一个少样本方法)。尽管传统系统仍保持领先,但LLM显示出显著潜力,表明它们可能在未来的版本中挑战现有方法。

英文摘要

This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.

2605.21362 2026-05-21 cs.CL

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH:适应性语义混合用于大语言模型的黑盒劫持

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

发表机构 * University of Maine(缅因大学) University of Tennessee, Knoxville(田纳西大学, 基纳顿分校) University of Florida(佛罗里达大学)

AI总结 本文提出LASH框架,通过适应性语义混合方法,利用多个基础攻击的输出作为可重用的种子提示,针对不同目标模型和有害类别进行自适应组合,从而在黑盒红队测试中取得更高的攻击成功率。

详情
AI中文摘要

劫持攻击暴露了对齐的大语言模型预期安全行为与对抗性提示下行为之间的持续差距。现有自动化方法日益有效,但每个方法都局限于单一攻击家族(例如,一个细化循环、一个树搜索、一个突变空间或一个策略库),并且没有单一家族主导:表现最好的方法会根据目标模型和有害类别而变化,这表明互补优势可以通过每个提示的组合来利用。我们介绍了LASH(LLM适应性语义混合),一个黑盒框架,将多个基础攻击的输出视为可重用的种子提示,并针对每个目标请求自适应地组合它们。给定一个种子池,LASH搜索种子子集和softmax归一化的混合权重;组合模块合成一个候选提示,而无导数遗传优化器通过黑盒目标反馈和一个两阶段适应度函数(结合基于关键词的拒绝检测与LLM判官评分)更新权重。在包含100个有害提示的10个类别的JailbreakBench上,我们评估了LASH在六个常见目标模型上的表现。LASH在基于关键词的评估中平均攻击成功率为84.5%,在两阶段评估中为74.5%,其中响应首先被过滤以拒绝,然后由LLM判官评分是否实质上履行了原始有害请求。LASH在两个指标上均优于五个最先进的基线方法,仅使用30次平均目标查询。LASH还在三种防御机制下保持竞争力,并诱导出更多成功似内部表示。这些结果表明,跨异构劫持策略的适应性组合是黑盒红队测试的一个有前途的方向。

英文摘要

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

2605.21348 2026-05-21 cs.LG cs.AI cs.NA math.NA physics.comp-ph

Data-Efficient Neural Operator Training via Physics-Based Active Learning

通过物理引导的主动学习实现数据高效的神经算子训练

Alicja Polanska, Lorenzo Zanisi, Vignesh Gopakumar, Stanislas Pamela

发表机构 * University College London(伦敦大学学院) Atomic Energy Authority(原子能局)

AI总结 本文提出了一种基于物理的主动学习方法,用于提高神经算子训练的数据效率,通过利用偏微分方程残差来指导数据选择,在1D Burgers方程和2D可压缩纳维-斯托克斯方程的数值实验中验证了该方法在数据效率上的优越性。

Comments Presented at the ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations

详情
AI中文摘要

使用神经算子求解偏微分方程显著降低了计算成本,但仍然受到高训练数据需求的限制。主动学习提供了一个自然的框架,通过迭代方式选择最有信息量的样本来缓解这一问题。我们引入了基于物理的获取方法,这是一种新的物理引导的主动学习算法,利用偏微分方程残差来指导数据选择。我们通过1D Burgers方程和2D可压缩纳维-斯托克斯方程的数值实验验证了该方法。我们显示,在我们的实验中,基于物理的获取方法在数据效率上始终优于随机获取,并且在数据效率上与当前最先进的方法相媲美。同时,它具有独特的优势,即在训练过程中注入物理归纳偏差,确保在模型物理理解最弱的地方花费模拟成本。

英文摘要

Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements. Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner. We introduce physics-based acquisition - a novel physics-informed active learning algorithm that leverages the partial differential equation residual to guide data selection. We validate the method by presenting numerical experiments for the 1D Burgers equation and the 2D compressible Navier-Stokes equations. We show that, in our experiments, physics-based acquisition consistently outperforms random acquisition and matches the state of the art in data efficiency. At the same time, it has the unique advantage of injecting a physics inductive bias into the training process, ensuring that simulation cost is spent where the model's physical understanding is weakest.

2605.21343 2026-05-21 cs.CV

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

OcclusionFormer: 布局导向图像生成中的Z轴顺序安排

Ziye Li, Henghui Ding

发表机构 * Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China(大数据研究院,计算机科学与人工智能学院,复旦大学,中国)

AI总结 本文提出OcclusionFormer,一种基于Z轴顺序的扩散变换框架,通过解耦实例并利用体积渲染进行合成,以解决布局到图像模型中物体间遮挡问题,并通过查询对齐损失提升空间精度和语义一致性。

Comments ICML 2026, Project Page: https://henghuiding.com/OcclusionFormer/

详情
AI中文摘要

最近的布局到图像模型在空间可控性方面取得了显著进展。然而,它们仍然在物体间遮挡方面存在困难。当边界框重叠时,大多数现有方法缺乏显式的遮挡信息,这使得交集区域的生成本质上具有歧义性,并阻碍了复杂遮挡关系的确定。为此,我们首先构建了SA-Z,一个包含显式遮挡顺序和像素级注释的大型数据集。基于我们提出的数据集,我们引入了OcclusionFormer,一种新的遮挡感知扩散变换框架,通过解耦实例并利用体积渲染进行合成,显式地建模Z轴优先级。此外,为了确保细粒度的空间精度,我们引入了查询对齐损失,显式监督单个实例并增强语义一致性。所提出的方法有效减少了重叠区域的歧义性,强制正确遮挡依赖关系,并保持了结构完整性,从而在多样化的场景中实现了显著的准确性提升。

英文摘要

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

2605.21338 2026-05-21 cs.CL

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

文本分析评估框架:基于LLM和社交媒体的案例研究

Yuefeng Shi, Nedjma Ousidhoum, Jose Camacho-Collados

发表机构 * School of Computer Science and Informatics, Cardiff University(计算机科学与信息学学院,卡迪夫大学)

AI总结 本文提出了一种基于问题的评估框架,通过470个手工整理的问题来评估LLM在处理聚合文本数据时的语义理解和推理能力,揭示了LLM在处理大规模文本数据时的性能瓶颈。

详情
AI中文摘要

LLMs在广泛的NLP任务中表现出色,但在实际数据分析场景中仍存在显著差距,尤其是在处理长序列的非结构化文档(如新闻feed或本文特别针对的社交媒体帖子)时。为了实证评估LLM在该设置中的有效性,我们引入了一个包含470个手工整理问题的问题基于评估框架,旨在评估LLM在聚合文本数据上的语义理解和推理能力。我们将其应用于覆盖各种NLP任务的多样化Twitter数据集,包括情感分析、仇恨言论检测和情感识别。我们的结果表明,性能严重依赖于输入规模和数据源的复杂性,在多标签或目标依赖场景中下降明显。此外,随着任务复杂性的增加,性能从基本的语义存在识别逐步下降到更 demanding 的操作,如比较、计数和计算。此外,当输入规模超过500个实例时,我们发现LLMs,特别是开放式权重模型,普遍存在一个共同的限制:在数值任务上性能显著下降。这些发现突显了当前LLMs在对大规模文本集合进行严格定量分析时的关键架构瓶颈。

英文摘要

LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

2605.21333 2026-05-21 cs.CL cs.AI

SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

SymbolicLight V1: 一种具有高激活稀疏性和亚十亿级预训练证据的脉冲门双路径语言建模

Ting Liu

发表机构 * SymbolicLight Research(SymbolicLight研究院)

AI总结 本文提出SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型,通过长程记忆的指数衰减聚合路径和短程精度的脉冲门局部注意力路径,实现了高激活稀疏性和亚十亿级预训练证据。

Comments 35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript

详情
AI中文摘要

原生训练的脉冲语言模型难以同时结合Transformer类语言质量、稳定的多领域预训练和高激活稀疏性。我们提出了SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型。其Dual-Path SparseTCAM模块用指数衰减聚合路径替代密集自注意力,用于长程记忆,用脉冲门局部注意力路径用于短程精度,辅以动态上下文条件解码头和双语分词器。一个从头开始在300亿词中文-英语语料上训练的19400万参数SymbolicLight V1模型,在四个独立运行中达到8.88-8.93的验证PPL,每元素激活稀疏性超过89%。其PPL在GPT-2 20100万参数模型下落后7.7%,但在GPT-2 12400万参数模型上表现更优。在匹配0.5亿词训练预算的组件消融实验中,脉冲门局部注意力路径是最大贡献者,而用确定性top-k掩码替代LIF动力学在匹配稀疏性时导致更大退化,表明时间积分而非稀疏性本身驱动性能。我们还报告了一个在4880亿词上训练的0.8亿参数规模运行作为优化和稀疏性保持的证据,而非主要质量比较。当前密集硬件推理速度比GPT-2慢,因此神经形态部署被提出作为未来稀疏性驱动的机会,而非已实现的硬件加速。

英文摘要

Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

2605.21330 2026-05-21 cs.RO

Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

从联合传感器学习鲁棒的灵巧手部操作

Senlan Yao, Chenyu Yang, Jaehoon Kim, Aristotelis Sympetheros, Robert K. Katzschmann

发表机构 * Soft Robotics Laboratory, ETH Zürich, Switzerland(苏黎世联邦理工学院软机器人实验室)

AI总结 本文研究如何仅依靠关节传感进行手部操作,提出了一种无需外部感知的Proprioceptive Transformer方法,通过强化学习训练教师策略并将其转化为PT,实现了在腱驱动灵巧手上的连续立方体旋转,实验表明其在旋转速度和位置估计精度上优于基线方法。

Comments 8 pages, 6 figures, 3 tables

详情
AI中文摘要

手部对象操作是灵巧机器人的一项基本但具有挑战性的能力。尽管在灵巧操作方面取得了显著进展,现有方法主要依赖视觉或触觉传感来跟踪物体状态,而关节传感——任何机械手上最易获得的模态——仍被忽视,尤其是对于腱驱动手。本文研究关节传感单独能走多远,通过三个问题:(i) 是否电机编码器或直接关节传感能提供更好的本体感觉反馈,(ii) 如何从关节测量中提取环境信息,以及(iii) 是否仅使用关节控制可以在不依赖外部感知的情况下实现竞争性的现实性能。我们提出了Proprioceptive Transformer (PT),一种无需外部感知的连续立方体旋转方法,仅使用关节传感反馈。首先通过强化学习训练教师策略,利用特权物体信息,然后将其转化为PT,该方法仅基于关节位置和速度的历史数据。Transformer架构有效从关节传感器读数中的时间模式中提取隐含的物体状态信息。在真实的ORCA手实验中,我们的方法实现了比基线方法高3.1倍的旋转速度。我们还展示了PT在立方体位置估计上的RMSE比MLP基线低23.4%,表明其在从本体感觉源中提取外部信息方面具有优势。

英文摘要

In-hand object manipulation is a fundamental yet challenging capability for dexterous robots. Despite significant progress in dexterous manipulation, existing approaches rely heavily on vision or tactile sensing to track object states, while joint sensing -- the most readily available modality on any robotic hand -- remains largely overlooked, particularly for tendon-driven hands. In this paper, we study how far joint sensing alone can go by asking: (i) whether motor encoders or direct joint sensing provides better proprioceptive feedback, (ii) how to extract environment information from joint measurements, and (iii) whether joint-only control can achieve competitive real-world performance without external perception. We present the Proprioceptive Transformer (PT), an exteroceptive-free approach for continuous cube rotation on a tendon-driven dexterous hand that uses only joint sensing feedback. A teacher policy is first trained via reinforcement learning with privileged object information, then distilled into PT, which operates solely on joint position and velocity histories. The Transformer architecture effectively extracts implicit object state information from temporal patterns in joint sensor readings. Experiments on the real ORCA hand show that our approach achieves 3.1x higher rotation speed than baselines. We also demonstrate that our PT achieves a 23.4% lower RMSE for cube position estimation than the MLP baseline, indicating superior extraction of exteroceptive information from proprioceptive sources.

2605.21325 2026-05-21 cs.LG

Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers

快速且稳定的三角矩阵求逆用于Delta规则线性变换器

Aleksandros Sobczyk, Gioele Gottardo, Christos K. Matzoros, Mirko De Vita, Filip Skogh, Anastasios Zouzias, Jiawei Zhuang

发表机构 * Computing Systems Lab(计算系统实验室) Huawei Technologies Switzerland(华为技术瑞士)

AI总结 本文研究了Delta规则线性变换器中快速且稳定的三角矩阵求逆方法,通过分析直接和迭代算法,探讨了矩阵乘法丰富的算法在现代硬件上的高效利用,实验验证了不同方法在低精度浮点表示下的性能和稳定性,实现了三角矩阵求逆的4.3倍加速,从而提升整个层级性能并保持端到端模型精度。

Comments Preprint

详情
AI中文摘要

线性注意力机制已成为高效长上下文架构的核心,如Qwen3.5/3.6、Kimi Linear和RWKV-7等先进开源模型均整合了该机制。包含线性注意力层的Delta规则模型涉及三角矩阵求逆作为核心子过程。该操作常成为性能瓶颈,且由于对数值误差高度敏感,若未正确实现,会显著降低端到端模型精度。本文系统分析了直接和迭代三角矩阵求逆算法,针对矩阵乘法丰富的算法,从而可能高效利用现代硬件。为此,我们的分析涵盖了广泛的数学和实际方面,重点在于数值稳定性、计算复杂度以及最终的硬件效率和实际考虑。我们提供了严谨的实验评估以验证这些属性在实际场景中的表现,并在低精度浮点表示下突出每种方法的优势和局限性。在NPUs上的性能基准测试显示,三角矩阵求逆的实现相比SGLang的最新实现快达4.3倍,从而在整个层级上实现显著的性能提升,同时保持完整的端到端模型精度。

英文摘要

Linear attention has emerged as a cornerstone for efficient long-context architectures, as evidenced by its integration into state-of-the-art open-source models including Qwen3.5/3.6, Kimi Linear, and RWKV-7. Models that incorporate linear attention layers with the so-called Delta-Rule involve the inversion of triangular matrices as a core sub-routine. This operation often forms a performance bottleneck, and, due to its high-sensitivity to numerical errors, it can significantly deteriorate end-to-end model accuracy if it is not carefully implemented. This work provides a systematic analysis of both direct and iterative triangular inversion algorithms, targeting methods that are rich in matrix products, and, therefore, have the potential to efficiently utilize modern hardware. To that end, our analysis covers a broad spectrum of mathematical and practical aspects, with a heavy focus on numerical stability, computational complexity, and, ultimately, hardware efficiency and practical considerations. We provide a rigorous experimental evaluation to verify these properties in practical scenarios, and in low-precision floating-point representations, highlighting the strengths and limitations of each method. Performance benchmarks on NPUs reveal up to $4.3\times$ speed-up against the state-of-the-art implementations of SGLang for triangular matrix inversion, leading to significant performance improvements on the entire layer level, while maintaining full end-to-end model accuracy.

2605.21322 2026-05-21 cs.LG

Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

优化的联邦知识蒸馏与分布式神经架构搜索

Chaimaa Medjadji, Sylvain Kubler, Yves Le Traon, Guilain Leduc, Sadi Alawadi, Feras M. Awaysheh

发表机构 * Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg(安全、可靠与信任跨学科研究中心(SnT),卢森堡大学) Blekinge Institute of Technology(布莱金厄理工大学) ADSLabs, Umea University(ADSLabs,乌梅亚大学)

AI总结 本文提出FedKDNAS框架,结合客户端侧神经架构选择与服务器协调的知识蒸馏,以解决联邦学习中数据异质性、系统异质性和通信效率问题,通过提升准确率和效率的帕累托效率。

详情
AI中文摘要

联邦学习(FL)使在不集中数据的情况下进行协同模型训练成为可能。然而,现实部署必须同时解决客户端数据的统计异质性(非iid)、系统异质性(设备能力差异)和通信效率。现有FL方法通过改进聚合、个性化或知识蒸馏来缓解这些挑战,但几乎都假设客户端架构固定,限制了对异质数据复杂性和硬件约束的适应性。这种架构限制通常导致现实FL系统中准确率与效率之间的次优权衡。本文引入FedKDNAS,一种由蒸馏驱动的FL框架,结合客户端侧神经架构选择与服务器协调的知识蒸馏。每个客户端在准确率-资源约束下自主选择轻量模型,然后使用结合监督学习和知识蒸馏的混合目标在本地训练,并仅分享预测结果。服务器然后聚合并平滑这些预测,可选地与教师模型结合,以生成下一轮的稳定蒸馏目标。在六个数据集上对六个代表性的FL基线(FedAvg、Ditto、FedMD、FedDF、FedDistill、Local-KD)的广泛评估表明,FedKDNAS在非iid条件下将准确率提高高达15%,减少客户端CPU使用约28%,同时将通信开销减少高达44倍,同时保持轻量的logit通信。

英文摘要

Federated Learning (FL) enables collaborative model training without centralizing data. However, real-world deployments must simultaneously address statistical heterogeneity across client data (non-IID), system heterogeneity in device capabilities, and communication efficiency. Existing FL approaches mitigate these challenges through improved aggregation, personalization, or knowledge distillation, but they almost universally assume a fixed client architecture, limiting adaptability to heterogeneous data complexity and hardware constraints. This architectural constraint often leads to suboptimal trade-offs between accuracy and efficiency in real-world FL systems. This work introduces FedKDNAS, a distillation-driven FL framework that combines client-side neural architecture selection with distillation of server-coordinated knowledge. Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions, optionally combining them with a teacher model, to produce stable distillation targets for the next round. Extensive evaluation on six datasets against six representative FL baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) demonstrates that FedKDNAS consistently achieves superior Pareto efficiency, improving accuracy by up to 15\% under non-IID conditions, reducing client CPU usage by approximately 28\%, and decreasing communication overhead by up to 44 times while maintaining lightweight logit-based communication.

2605.21318 2026-05-21 cs.CL cs.AI cs.LG

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

TextReg: 通过正则化的文本空间优化缓解提示分布过拟合

Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了提示分布过拟合问题,提出TextReg框架通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新,提升模型在分布外(OOD)任务上的泛化能力。

Comments Code: https://github.com/luchengfu6/TextReg

详情
AI中文摘要

大型语言模型(LLMs)对用于指定任务目标和行为约束的提示非常敏感。许多最近的提示优化方法通过迭代使用LLM生成的反馈来重写提示,但结果提示往往变长,积累狭窄的样本特定规则,并在训练分布之外泛化能力差。我们研究这种失败模式作为提示分布过拟合,并认为这反映了离散文本空间优化中表示控制的不足。我们通过表示不效率(representational inefficiency)进行了形式化,这是一种双因素度量,将提示不效率分解为容量成本和范围狭窄,将分布提示过拟合归因于优化过程中两者的耦合增长。我们提出了TextReg,一个正则化框架,通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新。在多个推理基准上,TextReg显著提高了分布外(OOD)泛化能力,其准确性在TextGrad和REVOLVE上分别提高了+11.8%和+16.5%。

英文摘要

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

2605.21317 2026-05-21 cs.LG

CRAFT: Conflict-Resolved Aggregation for Federated Training

CRAFT: 用于联邦训练的冲突解决聚合

Ziqi Wang, Qiang Liu, Nils Thuerey

发表机构 * Department of Mathematics, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃朗根-纽伦堡费里德里希-亚历山大大学数学系) School of Computation, Information and Technology, Technical University of Munich(慕尼黑技术大学计算、信息与技术学院) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文提出CRAFT框架,通过将全局更新视为几何校正问题,解决联邦学习中冲突客户端更新的聚合问题,提升全局模型准确率并减少客户端间性能差异。

详情
AI中文摘要

在异构数据分布下,冲突客户端更新的聚合仍是联邦学习(FL)中的关键瓶颈。简单平均会产生一个改进全局目标但与特定客户端冲突的全局更新,导致这些客户端性能下降。本文提出CRAFT(Conflict-Resolved Aggregation for Federated Training),一种新的聚合框架,将全局更新视为几何校正问题。我们将其形式化为寻找最接近参考方向且满足无冲突对齐约束的更新。我们推导出约束优化问题的闭式表达式,避免了迭代求解器的计算开销。此外,我们使用分层适应来解决不同特征粒度下的冲突。我们提供了理论分析,证明CRAFT通过其投影几何促进共同下降结构并缓解冲突。在异构基准上的广泛实验表明,与最先进的基线相比,CRAFT在提升全局模型准确率的同时,减少了客户端间的性能差异。CRAFT的源代码可在https://github.com/tum-pbs/CRAFT获取。

英文摘要

The aggregation of conflicting client updates remains a fundamental bottleneck in federated learning (FL) under heterogeneous data distributions. Naive averaging can produce a global update that improves the global objective while conflicting with specific clients, causing degradation for those clients. In this work, we propose CRAFT (Conflict-Resolved Aggregation for Federated Training), a new aggregation framework that treats the global update as a geometric correction problem. We formulate aggregation as finding the update closest to a reference direction while satisfying conflict-free alignment constraints. We derive a closed-form expression for the constrained optimization problem, avoiding the computational overhead of iterative solvers. Furthermore, we use a layer-wise adaptation to address conflicts at varying feature granularities. We provide a theoretical analysis showing that CRAFT promotes a common-descent structure and mitigates conflicts through its projection geometry. Extensive experiments on heterogeneous benchmarks demonstrate that CRAFT improves the accuracy of the global model while reducing performance disparity across clients compared with state-of-the-art baselines. The source code for CRAFT is available at https://github.com/tum-pbs/CRAFT.

2605.21313 2026-05-21 cs.LG

A New Framework to Analyse the Distributional Robustness of Deep Neural Networks

分析深度神经网络分布鲁棒性的新框架

Divij Khaitan, Subhashis Banerjee

发表机构 * Microsoft(微软) Ashoka University(阿什oka大学)

AI总结 本文提出了一种新框架,通过研究层权重与激活之间的相互作用来分析和量化深度神经网络的分布鲁棒性,展示了该框架在CIFAR-10和ImageNet上模型的实用性,并表明所提指标能区分记忆训练数据和未记忆的网络。

Comments 9 pages, 6 figures, 3 tables

详情
AI中文摘要

深度神经网络在多种任务上取得了显著性能,但其对分布变化的脆弱性仍然是实际部署中的重大障碍。本文提出了一种框架,通过研究层权重与激活之间的相互作用来分析和量化神经网络的分布鲁棒性。我们使用伯努利分布建模这些相互作用,利用类别间分离度作为鲁棒性的诊断代理。我们通过在CIFAR-10和ImageNet上训练的模型展示了该框架的实用性。我们证明所提出的指标可以区分记忆训练数据的网络和未记忆的网络。我们还进行了类似的激活空间实验,发现相同的性质不成立。此外,我们研究了我们的指标在各种分布变化下的行为,并显示这些变化在我们的路径基础上降低了分离度。我们的结果表明,该框架提供了有用的模型级表示结构和鲁棒性的诊断。

英文摘要

Deep neural networks have achieved impressive performance on a variety of tasks, but their brittleness to distributional shifts remains a significant barrier to real-world deployment. In this paper, we propose a framework to analyse and quantify the distributional robustness of neural networks by studying the interactions between layer weights and activations. We model these interactions using Bernoulli distributions, using the separation between classes as a diagnostic proxy for robustness. We demonstrate the usefulness of this framework through models trained on CIFAR-10 and ImageNet. We show that our proposed metrics can distinguish between networks that have memorised their training data and those that have not. We also perform analogous experiments in the activation space and find that the same properties do not hold up. Additionally, we investigate the behaviour of our metrics under various distribution shifts and show that these shifts reduce separation under our path-based diagnostics. Our results suggest that this framework provides useful model-level diagnostics of representation structure and robustness.

2605.21311 2026-05-21 cs.LG cs.AI

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

DeCoR:基于强化学习的城市街道设计与控制联合优化

Bibek Poudel, Lei Zhu, Kevin Heaslip, Sai Swaminathan, Weizi Li

发表机构 * University of Tennessee, Knoxville, TN, USA(田纳西大学,诺克斯维尔分校) University of North Carolina at Charlotte, Charlotte, NC, USA(北卡罗来纳大学夏洛特分校) University of California, Riverside, CA, USA(加州大学河滨分校)

AI总结 本文提出DeCoR框架,通过强化学习联合优化城市街道的过街横道布局和网络级信号控制,减少了行人到达最近过街横道的时间,并显著降低了行人和车辆等待时间。

Comments 22 pages, 8 figures

详情
AI中文摘要

现代视觉系统可以大规模检测、跟踪和预测城市中的行人,但将感知输出转化为城市设计仍然有限。我们介绍了DeCoR,一种两阶段强化学习框架,利用流量观测来联合优化过街横道布局和网络级信号控制。设计阶段将行人网络编码为图,并学习一种生成策略,该策略参数化一个高斯混合模型,用于过街横道的位置和宽度,从中采样新的过街横道。对于每个布局,共享的控制策略学习自适应信号时序以最小化行人和车辆的总延迟。在一条750米的现实世界城市走廊上,DeCoR学习了一个布局,该布局将行人到达最近过街横道的时间减少了23%,同时使用比现有配置更少的过街横道。在控制方面,DeCoR相对于固定时间信号控制,将行人和车辆等待时间分别减少了79%和65%。进一步,控制策略能够泛化到训练外的需求,并且在不重新训练的情况下对布局变化具有鲁棒性。

英文摘要

Modern vision systems can detect, track, and forecast urban actors at scale, yet translating perception outputs to urban design remains limited. We introduce DeCoR, a two-stage reinforcement learning framework that leverages flow observations to co-optimize crosswalk layout and network-level signal control. The design stage encodes the pedestrian network as a graph and learns a generative policy that parameterizes a Gaussian mixture model over crosswalk location and width, from which new crosswalks are sampled. For each layout, a shared control policy learns adaptive signal timings to minimize joint pedestrian and vehicle delay. On a 750 m real-world urban corridor with demand sensed from video and Wi-Fi logs, DeCoR learns a layout that reduces pedestrian arrival time to their nearest crosswalk by 23% while using fewer crosswalks than existing configurations. On the control side, DeCoR reduces pedestrian and vehicle wait time by 79% and 65%, respectively, relative to fixed-time signalization. Further, the control policy generalizes to demands outside of training and is robust to layout changes without retraining.

2605.21309 2026-05-21 cs.CV cs.RO

Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Hyper-V2X: 基于超网络的协作鸟瞰图语义分割中epistemic和aleatoric不确定性的估计

Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag

发表机构 * CARISSMA Institute for Electric, COnnected, and Secure Mobility (C-ECOS), Technische Hochschule Ingolstadt(CARISSMA电动、连接与安全移动研究所(C-ECOS)、因戈尔施塔特技术大学) University of Applied Sciences Aschaffenburg(阿施发堡应用科学大学)

AI总结 本文提出Hyper-V2X框架,通过超网络估计协作V2X感知中的epistemic和aleatoric不确定性,采用部分权重生成方案和V2X上下文嵌入模块,条件化贝叶斯超网络生成随机鸟瞰图分割的权重分布,提升感知可靠性。

Comments Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026

详情
AI中文摘要

通过Vehicle-to-Everything (V2X)通信实现的协作感知通过共享传感器数据创建统一的环境表示,从而提高自动驾驶安全性。尽管近期工作已推进多智能体融合以改善感知,但此类协作框架中的不确定性量化仍鲜有研究。本文介绍Hyper-V2X,一种基于超网络的框架,用于估计V2X感知中的epistemic和aleatoric不确定性。具体而言,我们提出了一种部分权重生成方案和V2X上下文嵌入模块,将贝叶斯超网络条件化于融合的多智能体特征,以生成随机Bird's-Eye-View (BEV)分割的权重分布。与现有确定性BEV模型不同,Hyper-V2X在计算开销小的情况下实现了高效的不确定性估计。我们的方法架构无关,可无缝集成到现代协作骨干结构中,如CoBEVT。在OPV2V基准测试中,Hyper-V2X提供了准确且校准良好的不确定性估计,并提高了整体感知可靠性。我们的代码和基准已公开发布,许可证为开源:https://github.com/abhishekjagtap1/Hyper-V2X

英文摘要

Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X

2605.21308 2026-05-21 cs.CV cs.AI

Deformba: Vision State Space Model with Adaptive State Fusion

Deformba:具有自适应状态融合的视觉状态空间模型

Hongyu Ke, Jack Morris, Yongkang Liu, Satoshi Kitai, Kentaro Oguchi, Yi Ding, Haoxin Wang

发表机构 * Department of Computer Science, Georgia State University(佐治亚州立大学计算机科学系) University of Tennessee Knoxville(田纳西大学肯纳邦克分校)

AI总结 本文提出Deformba,一种能够动态增强空间结构信息并保持状态空间模型线性复杂度的自适应方法,通过多模态融合(如交叉注意力)提升视觉任务的性能,展示了在2D和3D视觉任务中的广泛适用性。

Journal ref Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

状态空间模型(SSMs)已作为一种强大的、高效的替代方案出现于Transformer之上,展现出线性时间复杂度和卓越的序列建模能力。然而,将其应用于视觉任务仍具有挑战性。首先,现有的视觉SSMs大多依赖于手动设计的固定扫描方法将图像块扁平化为序列,这会引入预定义的几何结构并增加复杂性。其次,在需要不同信息流之间进行查询式交互的领域中,SSMs的更广泛采用受到阻碍。这是由于SSMs为1D序列建模任务设计时固有的因果性和自指性所致。这种融合机制对于多视角3D融合等关键感知任务至关重要。为了解决这些限制,我们提出Deformba,一种上下文自适应的方法,能够在保持SSMs线性复杂度的同时动态增强空间结构信息。Deformba还允许多模态融合,如交叉注意力。为了证明Deformba的有效性和广泛适用性,我们在通用的2D视觉任务(如图像分类、目标检测和分割)以及3D视觉任务(如BEV感知)上测试其性能。大量实验表明,Deformba在各种视觉感知基准上均取得了强劲的性能。

英文摘要

State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

2605.21303 2026-05-21 cs.LG cs.AI cs.LO

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

从电路证据到机制理论:一种归纳逻辑方法

Nura Aljaafari, Danilo S. Carvalho, Andre Freitas

发表机构 * Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Idiap Research Institute(Idiap研究机构) CRUK National Biomarker Centre, University of Manchester(曼彻斯特大学癌症研究联盟国家生物标志物中心)

AI总结 本文提出了一种基于归纳逻辑的方法,通过将电路解释视为归纳理论构建,为累积的机制科学提供形式化基础设施。该方法通过因果功能签名和建筑签名,明确机制主张,并在不同模型规模之间实现可移植性。

Comments 27 pages, 10 Figures, 14 Tables

详情
AI中文摘要

机制可解释性能够产生神经网络行为的电路层面因果分析,但发现的电路往往仍然是孤立的实验艺术品:没有共享的形式化表示来说明电路计算什么,它们如何相互关联,或者两个发现是否为同一机制提供证据。本文通过将电路解释视为归纳理论构建,提供了一种形式化基础设施,用于累积的机制科学。每个电路在两个层面进行表征:因果功能签名(CFS),它通过因果归因证据和令牌角色配置文件将组件行为基础化;以及建筑签名τ_arch,通过归纳逻辑编程(ILP)从尺度不变的结构谓词中学习。共同,这些构成了一个形式化的一致层,使机制主张显式化,并通过θ-子sume进行比较,并在模型规模之间实现可移植性。CFS揭示了不同任务类型中不同的计算策略,包括注意力介导的复制与MLP介导的绑定。ILP签名在结构分离方面优于图核和特征向量基线,并支持在不同模型规模和架构家族之间进行原理性转移。

英文摘要

Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $τ_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $θ$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.

2605.21301 2026-05-21 cs.LG cs.CV

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls

通过与健康对照组对比自动发现疾病亚组

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, Pietro Gori

发表机构 * NeuroSpin(神经旋) Université Paris-Saclay(巴黎-萨克勒大学) CEA(法国原子能委员会) LTCI Institut Polytechnique de Paris(巴黎高等理工学院)

AI总结 本文提出了一种通过对比患者与健康对照组来发现可解释且同质的疾病亚组的方法,该方法在医学影像数据集上展示了改进的亚组估计质量。

Comments Accepted to Data Mining and Knowledge Discovery, ECML-PKDD 2026 Journal Track

详情
AI中文摘要

在生物医学亚组发现中,研究者致力于在患者群体中发现可解释且同质的亚组。在本文中,我们假设健康个体(即对照组)与患者共享一些无关的变异性因素,从而提出了一种称为Deep UCSL的对比亚组发现方法。通过对比患者与对照组,Deep UCSL识别出仅由病理因素驱动的亚组,忽略与健康个体共享的共同变异性。我们的框架采用深度特征提取器来学习判别性表示空间。数学上,我们基于潜在聚类和患者/对照组标签的条件联合似然推导出一种新的损失函数,并通过期望最大化策略交替优化亚组推断和特征编码器更新。一个正则化项进一步鼓励表示捕捉疾病特异性变异性,同时忽略与对照组共享的变异性。与先前相关工作相比,我们的方法在MNIST示例和四个不同的医学影像数据集上展示了改进的亚组估计质量。代码和数据集可在:https://github.com/rlouiset/deep_ucsl获取。

英文摘要

In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: https://github.com/rlouiset/deep_ucsl.

2605.21300 2026-05-21 cs.CV

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

通过强调图像负样本token减少LVLMs中的物体幻觉

Meng Shen, Minghao Wu, Deepu Rajan

发表机构 * Nanyang Technological University(南洋理工大学) Monash University(墨尔本大学)

AI总结 本文通过强调图像负样本token来减少LVLMs中的物体幻觉问题,提出调整不同token的训练权重和数据过滤策略以控制幻觉。

Comments 20 pages, 10 figures, 10 tables

详情
AI中文摘要

物体幻觉是阻碍大型视觉-语言模型(LVLMs)在实践中应用的重要挑战。我们假设幻觉的一个可能来源是模型倾向于优先生成文本而非与图像进行有意义的交互。为此,我们研究了生成过程并将文本token分为三类:图像正样本、不变样本和负样本,基于它们对输入图像token的视觉依赖性。我们的分析发现,大多数生成的token对图像信息影响很小。这表明在模型训练阶段,更强调学习如何遵循文本指令,而非从图像中提取信息。基于此发现,我们提出根据token的视觉依赖性调整训练权重以控制幻觉。此外,我们移除一部分可能包含更多幻觉的训练数据作为数据过滤策略。这两种方法在不牺牲响应长度或引入额外计算成本的情况下减少了幻觉。我们验证了我们的方法在三个LVLM变体上的有效性,展示了其有效性和通用性。

英文摘要

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.