arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.12227 2026-06-17 cs.CL 版本更新

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

结合在线优化与蒸馏以提升大语言模型的长上下文推理能力

Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究所）； TransPerfect（TransPerfect公司）

AI总结本文提出dGRPO方法，结合在线优化与蒸馏，通过强教师模型提供密集指导，提升长上下文推理能力，同时保持短上下文能力。

详情

AI中文摘要

适应大语言模型（LLMs）进行长上下文任务需要在训练后保持准确性和连贯性的方法。现有方法存在局限：1）监督微调（SFT）和知识蒸馏（KD）等离线方法存在曝光偏差且难以从模型生成的错误中恢复；2）在线强化学习方法如组相对策略优化（GRPO）更符合模型生成的状态，但因稀疏奖励导致不稳定和样本效率低；3）在线蒸馏（OPD）提供密集的token级指导，但不直接优化任意奖励信号。本文提出Distilled Group Relative Policy Optimization（dGRPO），通过OPD从更强的教师模型获得密集指导来增强GRPO。我们还引入LongBlocks，一个涵盖多跳推理、上下文接地和长形式生成的合成长上下文数据集。我们进行了广泛的实验和消融研究，比较离线训练、稀疏奖励GRPO和我们的综合方法，得出改进的长上下文对齐配方。总体而言，我们的结果表明，将基于结果的策略优化与知识蒸馏结合在一个目标中，为长上下文推理提供更稳定和有效的方法，同时保持短上下文能力。

英文摘要

Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative Policy Optimization (GRPO) train on model-generated trajectories but struggle with long-horizon credit assignment and sparse rewards; and (iii) on-policy distillation (OPD) provides dense token-level guidance but does not directly optimize task rewards. We study these complementary strategies for long-context alignment and derive a recipe that combines GRPO with OPD-style teacher guidance: the student learns from its own rollouts using outcome-level rewards, while a stronger teacher provides dense token-level regularization in place of the standard reference policy. This is especially useful when process-level supervision is difficult to obtain. To support this study, we introduce LongBlocks, a synthetic multilingual dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. Through controlled ablations, we isolate the roles of cold-start initialization, teacher anchoring, and data mixing, showing that our recipe yields a more stable and effective path to long-context reasoning than GRPO or OPD while preserving short-context capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.07971 2026-06-17 cs.CV cs.LG 版本更新

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD: 用于3D生成和编辑的离散体素扩散

Zhengrui Xiang, Jiaqi Wu, Fupeng Sun, Heliang Zheng, Yingzhen Li

发表机构 * Imperial College London（伦敦帝国学院）； Math Magic ； Hitem3D

AI总结提出离散体素扩散框架（DVD），通过将体素占用视为离散变量，实现稀疏体素的生成、不确定性估计和编辑，避免连续到离散的阈值处理，并提供可解释的生成动态。

详情

AI中文摘要

我们引入了离散体素扩散（DVD），这是一个离散扩散框架，用于生成、评估和编辑基于SLat（结构化潜在）的3D生成管道中的稀疏体素。尽管离散扩散通常没有在类似图像的生成中取代连续扩散，但我们表明它可以作为稀疏体素支架的有效第一阶段先验。通过将体素占用视为原生离散变量，DVD避免了连续到离散的阈值处理，并为体素生成、不确定性估计和编辑提供了一个简单的框架。除了质量提升外，DVD通过显式类别建模提供了更可解释的生成动态。此外，我们利用预测熵作为稳健的不确定性度量，以识别模糊的体素区域和复杂样本，促进数据过滤和质量评估等任务。最后，我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内修复和编辑体素，所需的辅助计算量可忽略不计，且无需额外的模型评估。

英文摘要

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations. Code is available at https://github.com/TeCai/DVD.

URL PDF HTML ☆

赞 0 踩 0

2512.09373 2026-06-17 cs.CV 版本更新

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

FUSER: 前馈多视图3D配准Transformer与SE(3)^N扩散精化

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

发表机构 * Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）； Nanjing University（南京大学）

AI总结提出FUSER，首个前馈多视图配准Transformer，在统一潜在空间中直接预测全局位姿，避免成对匹配；并引入SE(3)^N扩散精化框架FUSER-DF以校正估计。

Comments Accepted to CVPR 2026 (Oral)

详情

AI中文摘要

多视图点云的配准传统上依赖于广泛的成对匹配来构建用于全局同步的位姿图，这在计算上昂贵且在没有整体几何约束的情况下本质上是不适定的。本文提出了FUSER，第一个前馈多视图配准Transformer，它在统一、紧凑的潜在空间中联合处理所有扫描，直接预测全局位姿，无需任何成对估计。为了保持可处理性，FUSER通过稀疏3D CNN将每个扫描编码为低分辨率超点特征，该网络保留绝对平移线索，并通过几何交替注意力模块执行高效的扫描内和扫描间推理。特别地，我们从现成的基础模型中转移2D注意力先验，以增强3D特征交互和几何一致性。基于FUSER，我们进一步引入了FUSER-DF，一个SE(3)^N扩散精化框架，通过在联合SE(3)^N空间中进行去噪来校正FUSER的估计。FUSER作为代理多视图配准模型来构建去噪器，并推导了先验条件SE(3)^N变分下界用于去噪监督。在3DMatch、ScanNet和ArkitScenes上的大量实验表明，我们的方法实现了优越的配准精度和出色的计算效率。

英文摘要

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2604.22748 2026-06-17 cs.AI 版本更新

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

代理世界建模：基础、能力、定律及更远

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Runyi Li, Chenyu Tang, Dong Huang, Xuhang Chen, Rui Liu, Chengzu Li, Shiyi Du, Xu Huang, Haoxuan Che, Long Chen, Qifeng Chen, Wenya Wang, Wenxuan Zhang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

发表机构 * Hong Kong University of Science and Technology（香港科学与技术大学）； National University of Singapore（新加坡国立大学）； University of Oxford（牛津大学）； Nanyang Technological University（南洋理工大学）； Chinese University of Hong Kong（香港中文大学）； University of Hong Kong（香港大学）； University of Washington（华盛顿大学）； University of Tokyo（东京大学）； Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； Singapore University of Technology and Design（新加坡科技设计大学）； Singapore Management University（新加坡管理学院）； XGEN Labs（XGEN实验室）

AI总结本文提出'层次x定律'分类法，定义三个能力层级和四个约束领域，综合400余篇文献总结100余系统，分析方法、失败模式和评估实践，提出决策导向的评估原则和可复现评估包，展望从被动预测到重塑环境的代理世界建模路径。

详情

AI中文摘要

随着AI系统从生成文本转向通过持续交互完成目标，建模环境动态能力成为关键瓶颈。操控物体、导航软件、协调他人或设计实验的代理需要预测环境模型，但'世界模型'一词在不同研究社区中有不同含义。本文引入'层次x定律'分类法，沿两个轴组织：第一轴定义三个能力层级：L1预测器学习一步局部转移运算符；L2模拟器将它们组合成多步、动作条件化的回放，符合领域定律；L3演进器在预测失败时自主修订模型。第二轴识别四个约束领域：物理、数字、社会和科学。这些领域决定世界模型必须满足的约束条件及其可能失败的领域。利用此框架，本文综合400余篇文献，总结100余代表系统，涵盖基于模型的强化学习、视频生成、网页和GUI代理、多代理社会模拟和AI驱动的科学发现。分析各层级-领域对的方法、失败模式和评估实践，提出决策导向的评估原则和最小可复现评估包，概述架构指导、开放问题和治理挑战。最终路线图连接此前孤立的社区，从被动下一步预测走向能模拟并最终重塑代理所处环境的世界模型。

英文摘要

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate. Code and resources are available at: https://github.com/matrix-agent/awesome-agentic-world-modeling.

URL PDF HTML ☆

赞 0 踩 0

2604.10827 2026-06-17 cs.AI 版本更新

Know Thy Reasoner: Not All Language Models Explore Alike

你的模型多样性，而非方法，决定推理策略

Moulik Choraria, Argyrios Gerogiannis, Anirban Das, Supriyo Chakraborty, Sourya Basu, Sambit Sahu, Lav R. Varshney

发表机构 * UIUC（伊利诺伊大学香槟分校）； Capital One

AI总结本文提出模型多样性影响推理策略，通过理论框架分析推理不确定性，验证了不同模型在深度精炼和并行采样中的表现差异。

Comments This is a full-length extension of the workshop paper that appeared in the ICLR 2026 Workshop on LLM Reasoning

详情

AI中文摘要

计算LLM推理的扩展性需要在探索解决方案方法（广度）和细化有前途的解决方案（深度）之间分配预算。大多数方法隐式地权衡两者，但为何特定的权衡有效仍不明确，且在单一模型上的验证掩盖了模型自身的作用。我们主张最优策略取决于模型的多样性分布，即概率质量在解决方案方法上的分散情况，并在采用任何探索策略之前必须进行表征。我们通过理论框架分解推理不确定性，并推导出树状深度精炼优于并行采样的条件。我们在Qwen-3 4B和Olmo-3 7B系列上验证了这一点，显示轻量信号足以在低多样性对齐模型上进行基于深度的精炼，而在高多样性基础模型上则产生有限的效用，我们推测后者需要更强的补偿以应对较低的探索覆盖度。

英文摘要

Compute scaling for LLM reasoning trades off exploring solution approaches (\emph{breadth}) against refining promising ones (\emph{depth}), yet why a given trade-off works, and why it often fails to transfer across models, remains unclear. We argue that \textbf{the optimal strategy depends on the model's \emph{diversity profile}, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.} We formalize this with a framework decomposing reasoning uncertainty, deriving when depth-based refinement outperforms parallel sampling, and validate it across three model families at both inference and training. Our central finding is that the diversity regime dictates the strategy: low-diversity aligned models benefit from depth-based refinement with lightweight intrinsic signals, whereas high-diversity base models are often harmed by it, and instead need breadth or stronger signals to compensate.

URL PDF HTML ☆

赞 0 踩 0

2512.04524 2026-06-17 cs.LG cs.AI 版本更新

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

基于原型语义一致性对齐的域自适应检索

Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen, Jiaxing Li, Guoxu Zhou

发表机构 * School of Computer Science and Technology, Guangdong University of Technology（广东工业大学计算机科学与技术学院）； School of Automation, Guangdong University of Technology（广东工业大学自动化学院）； School of Computer Science, Guangdong Polytechnic Normal University（广东 polytechnic 正规大学计算机科学学院）； School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳校区计算机科学与技术学院）； School of Artificial Intelligence, Guangzhou University（广州大学人工智能学院）

AI总结提出原型语义一致性对齐（PSCA）两阶段框架，通过正交原型建立类级语义连接，利用几何邻近性加权伪标签置信度，并在重构特征上量化生成统一哈希码，解决域自适应检索中的类级对齐缺失和量化质量下降问题。

Comments AAAI2026

详情

AI中文摘要

域自适应检索旨在将知识从有标签的源域迁移到无标签的目标域，实现有效检索的同时缓解域差异。然而，现有方法存在几个根本性局限：1）忽略类级语义对齐，过度追求成对样本对齐；2）缺乏伪标签可靠性考虑或评估标签正确性的几何指导；3）直接量化受域偏移影响的原始特征，损害所学哈希码的质量。鉴于这些局限，我们提出基于原型的语义一致性对齐（PSCA），一种用于有效域自适应检索的两阶段框架。在第一阶段，一组正交原型直接建立类级语义连接，在聚集类内样本的同时最大化类间分离性。在原型学习过程中，几何邻近性通过自适应加权伪标签置信度，为语义一致性对齐提供可靠性指标。所得的隶属度矩阵和原型促进特征重建，确保在重建特征而非原始特征上进行量化，从而改善后续哈希编码质量并无缝连接两个阶段。在第二阶段，特定域的量化函数在相互逼近约束下处理重建特征，生成跨域的统一二进制哈希码。大量实验验证了PSCA在多个数据集上的优越性能。

英文摘要

Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

URL PDF HTML ☆

赞 0 踩 0

2603.26551 2026-06-17 cs.CV cs.AI 版本更新

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

超越MACs：面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine（乌迪大学机器学习与感知实验室）； Centre for Vision Research, York University（约克大学视觉研究中心）

AI总结针对MACs指标在边缘设备上的不足，提出基于硬件效率洞察的LowFormer骨干网络，通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

详情

DOI: 10.1007/s11263-026-02873-5
Journal ref: Int J Comput Vis 134, 295 (2026)

AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率，许多出版物依赖MACs（乘累加操作）作为执行时间的预测指标。本文通过实验证明该指标的缺陷，尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间，我们识别出高效执行的关键因素，并提供优化骨干设计的见解。基于这些见解，我们提出LowFormer，一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计，包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效，还在ImageNet上取得了更优结果。此外，我们提出LowFormer的边缘GPU版本，可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务（如目标检测、语义分割、图像检索和视觉目标跟踪），我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比，LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

URL PDF HTML ☆

赞 0 踩 0

2603.25937 2026-06-17 cs.RO cs.LG 版本更新

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

视觉基础模型能否导航？零样本真实世界评估与经验教训

Maeva Guerrier, Karthik Soma, Jana Pavlasek, Giovanni Beltrame

发表机构 * Polytechnique Montreal（蒙特利尔理工学院）

AI总结本文对五种视觉导航模型在真实环境中进行零样本评估，发现其存在几何理解不足、感知混淆和分布漂移等系统性问题，并公开评估代码与数据集。

详情

AI中文摘要

视觉导航模型（VNMs）通过从大规模视觉演示中学习，有望实现通用化的机器人导航。尽管在真实世界部署日益增多，现有评估几乎完全依赖成功率（机器人是否到达目标），这掩盖了轨迹质量、碰撞行为以及对环境变化的鲁棒性。我们针对五种最先进的VNMs（GNM、ViNT、NoMaD、NaviBridger和CrossFormer）在两个机器人平台和五个室内外环境中进行了真实世界评估。除了成功率，我们结合了基于路径的指标与基于视觉的目标识别分数，并通过受控图像扰动（运动模糊、太阳眩光）评估鲁棒性。我们的分析揭示了三个系统性问题：(a) 即使是架构复杂的扩散和Transformer模型也频繁发生碰撞，表明几何理解有限；(b) 模型无法区分感知相似但存在语义差异的不同位置，导致在重复环境中出现目标预测错误；(c) 在分布偏移下性能下降。我们将公开发布评估代码和数据集，以促进VNMs的可重复基准测试。

英文摘要

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

URL PDF HTML ☆

赞 0 踩 0

2602.10384 2026-06-17 cs.CL 版本更新

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

当表格失控：评估多模态模型在法语金融文档上的表现

Virginie Mouilleron, Théo Lasnier, Anna Mosolova, Djamé Seddah

发表机构 * Inria Paris（巴黎国家信息与自动化研究所）； Sorbonne Université（索邦大学）

AI总结提出Scribe Finance基准，评估多模态模型在法语金融文档上的文本、表格、图表及多轮对话理解能力，发现模型在图表和多轮对话中表现脆弱。

Comments 16 pages, 13 figures

详情

AI中文摘要

视觉语言模型（VLM）在许多文档理解任务上表现良好，但它们在专业非英语领域的可靠性仍未充分探索。这一差距在金融领域尤为关键，因为金融文档混合了密集的监管文本、数值表格和视觉图表，且提取错误可能产生实际后果。我们引入了Scribe Finance，这是首个用于评估法语金融文档理解的多模态基准。该数据集包含1,204个经过专家验证的问题，涵盖文本提取、表格理解、图表解释和多轮对话推理，数据来自真实的投资说明书、KID和PRIIP。我们使用LLM-as-judge协议评估了六个开源VLM（8B-124B参数）。虽然模型在文本和表格任务上表现强劲（85-90%准确率），但在图表解释上表现不佳（34-62%）。最值得注意的是，多轮对话揭示了一个严重的失败模式：早期错误在轮次间传播，导致准确率降至约50%，无论模型大小如何。这些结果表明，当前的VLM在定义明确的提取任务上有效，但在交互式、多步骤的金融分析中仍然脆弱。Scribe Finance提供了一个具有挑战性的基准，用于衡量和推动这一高风险场景的进展。

英文摘要

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Scribe Finance, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Scribe Finance offers a challenging benchmark to measure and drive progress in this high-stakes setting.

URL PDF HTML ☆

赞 0 踩 0

2601.16407 2026-06-17 cs.CL cs.AI 版本更新

Jacobian Scopes: token-level causal attributions in LLMs

Jacobian Scopes: LLM中的令牌级因果归因

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, Christopher J. Earls

发表机构 * Cornell University（康奈尔大学）； Imperial College London（伦敦帝国理工学院）； Goodfire AI

AI总结提出Jacobian Scopes，一种基于梯度的令牌级因果归因方法，用于解释LLM预测，揭示政治偏见、翻译策略和上下文学习机制。

Comments 25 pages, 16 figures

详情

AI中文摘要

大型语言模型（LLM）基于上下文中的线索（如语义描述和上下文示例）进行下一个令牌预测。然而，由于现代架构中层和注意力头的 proliferation，阐明哪些先前的令牌对给定预测影响最大仍然具有挑战性。我们提出Jacobian Scopes，一套基于梯度的令牌级因果归因方法，用于解释LLM预测。基于微扰理论和信息几何，Jacobian Scopes量化输入令牌如何影响模型预测的各个方面，例如特定logits、完整预测分布和模型不确定性（有效温度）。通过涵盖指令理解、翻译和上下文学习（ICL）的案例研究，我们展示了Jacobian Scopes如何揭示隐含的政治偏见，揭示词级和短语级翻译策略，并阐明最近争论的上下文时间序列预测的潜在机制。为了便于在自定义文本上探索Jacobian Scopes，我们开源了实现，并在以下网址提供了云托管交互式演示：this https URL。

英文摘要

Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.

URL PDF HTML ☆

赞 0 踩 0

2603.05171 2026-06-17 cs.CL cs.AI 版本更新

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

中国司法判决中法律论证结构的标注与可视化指南

Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li

发表机构 * Law School, Nanjing University（南京大学法学院）

AI总结提出一个系统化、可操作的标注框架，用于表示司法判决中的法律论证结构，支持大规模司法推理分析和法律论证挖掘。

Comments This Guideline has been developed through revision and refinement based on the first edition. The element label system has been adjusted, and the annotation granularity and annotation workflow have been further optimized

详情

AI中文摘要

本指南提出了一个系统化且可操作的标注框架，用于表示司法判决中的法律论证结构。该框架基于法律推理和论证理论，旨在揭示司法推理的逻辑组织，并为计算分析提供可靠基础。在元素层面，本指南区分了非命题层和命题层。非命题层由两个元素组成：议题和非论证性成分。在命题层面，本指南定义了四种命题类型：一般规范性判断、特殊规范性判断、一般事实判断和特殊事实判断。在关系层面，定义了五种关系类型来表示论证结构：支持、攻击、联合、匹配和同一性。这些关系捕捉了正面和负面的论证连接、合取推理结构、法律规范与案件事实之间的对应关系，以及命题之间的同一性或语义等价性。本指南进一步规定了基本结构和嵌套结构的形式化表示规则和可视化约定，使得复杂论证模式的可视化保持一致。此外，它建立了标准化的标注工作流程和一致性控制机制，以确保标注数据的可重复性和可靠性。通过提供清晰的概念模型、形式化表示规则和实用的标注程序，本指南支持大规模司法推理分析以及未来在法律论证挖掘、法律推理计算建模和人工智能辅助法律分析方面的研究。

英文摘要

This Guideline presents a systematic and operationalizable annotation framework for representing legal argumentation structures in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and provide a reliable foundation for computational analysis. At the element level, the Guideline distinguishes between the non-propositional layer and the propositional layer. The non-propositional layer consists of two elements: Issue and Non-argumentative Component. At the propositional level, the Guideline defines four proposition types: General Normative Judgment, Particular Normative Judgment, General Factual Judgment, and Particular Factual Judgment. At the relational level, five relation types are defined to represent argumentative structures: Support, Attack, Joint, Match, and Identity. These relations capture positive and negative argumentative connections, conjunctive reasoning structures, correspondences between legal norms and case facts, and identity or semantic equivalence between propositions. The Guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent visualization of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure the reproducibility and reliability of annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this Guideline supports large-scale analysis of judicial reasoning and future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

URL PDF HTML ☆

赞 0 踩 0

2602.11590 2026-06-17 cs.LG 版本更新

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

从错误中学习：自纠正掩码扩散模型

Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Ran Zilberstein, Michael Elad, Volodymyr Kuleshov

发表机构 * Cornell（康奈尔大学）； NVIDIA（英伟达）

AI总结提出ProSeCo框架，通过训练模型同时进行掩码去除和错误纠正，在生成过程中迭代修正已解码标记，提升样本质量并实现更快的采样速度。

Comments Code to reproduce our experiments is available here: https://github.com/kuleshov-group/proseco

详情

AI中文摘要

掩码扩散模型（MDMs）已成为自回归模型的有前途的替代方案，能够实现并行标记生成，同时保持竞争性能。尽管有这些优势，MDMs面临一个根本性限制：一旦标记被解除掩码，它们就保持固定，导致错误累积并最终降低样本质量。我们通过提出一个框架来解决这个问题，该框架训练模型同时执行掩码去除和纠正。通过重用MDM去噪网络的输出作为纠正器训练的输入，我们训练模型从潜在错误中恢复。在生成过程中，我们在掩码去除步骤之间应用额外的纠正性细化步骤，以更改解码的标记并改进输出。我们将我们的训练和采样方法命名为渐进式自纠正（ProSeCo），因为它具有独特的能力，可以迭代地细化整个序列，包括已生成的标记。我们在多个条件和无条件任务上进行了广泛的实验验证，表明我们的方法产生了更好的质量-效率权衡（采样速度提升高达约4倍），并实现了推理时计算缩放，以进一步提高样本质量，超越标准MDMs（在基准测试上提升高达约1.2倍）。

英文摘要

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that \method~yields better quality-efficiency trade-offs (up to ~4x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.2x improvement on benchmarks).

URL PDF HTML ☆

赞 0 踩 0

2602.06806 2026-06-17 cs.CV cs.LG 版本更新

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

RAIGen: 文本到图像生成模型中的罕见属性识别

Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

发表机构 * University of California, Berkeley（加州大学伯克利分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出RAIGen框架，利用Matryoshka稀疏自编码器和新颖的少数度量，在无标签条件下发现扩散模型中的罕见属性，并支持属性放大。

Comments Accepted at ICML 2026. Webpage and code available at https://github.com/VSSILPA/RAIGen

详情

AI中文摘要

文本到图像扩散模型实现了令人印象深刻的生成质量，但继承并放大了训练数据中的偏差，扭曲了语义属性的覆盖。先前的工作以两种方式解决这一问题。封闭集方法在预定义的公平性类别（如性别、种族）中减轻偏差，假设社会显著的少数属性是先验已知的。开放集方法将任务框架化为偏差识别，突出主导输出的多数属性。两者都忽略了一个互补的任务：揭示在数据分布中代表性不足（社会、文化或风格）但仍编码在模型表示中的罕见或少数特征。我们介绍了RAIGen，据我们所知，这是第一个用于扩散模型中无标签罕见属性发现的框架，不需要预定义的少数类别。RAIGen利用Matryoshka稀疏自编码器和一种新颖的少数度量，结合神经元激活频率与语义独特性，识别出那些其最高激活图像揭示代表性不足属性的可解释神经元。实验表明，RAIGen在Stable Diffusion中发现了超出固定公平性类别的属性，可扩展到更大的模型如SDXL，支持跨架构的系统审计，并在生成过程中实现罕见属性的定向放大。项目页面可在 https://vssilpa.github.io/RAIGen_webpage/ 获取。

英文摘要

Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for label-free rare-attribute discovery in diffusion models, requiring no predefined minority categories. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation. The project page is available at https://vssilpa.github.io/RAIGen_webpage/ .

URL PDF HTML ☆

赞 0 踩 0

2602.06276 2026-06-17 cs.LG stat.ML 版本更新

Statistical Learning from Attribution Sets

从归因集合中进行统计学习

Lorne Applebaum, Robert Busa-Fekete, August Y. Chen, Claudio Gentile, Tomer Koren, Aryan Mokhtari

发表机构 * Google Research（谷歌研究）； Cornell University（康奈尔大学）； Tel Aviv University（特拉维夫大学）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结针对隐私约束下广告点击与转化无法直接关联的问题，提出基于归因集合的无偏损失估计方法，实现经验风险最小化的泛化保证，并优于行业启发式方法。

Comments COLT 2026. 45 pages

详情

AI中文摘要

我们解决了隐私约束下广告领域转化预测模型的训练问题，其中点击和转化之间缺乏直接链接。受隐私保护浏览器API和第三方cookie弃用的启发，我们研究了一种设置，其中学习器观察到一系列点击和一系列转化，但只能将转化与一组候选点击（归因集合）相关联，而不是唯一的来源。我们将此形式化为从由具有候选先验分布的无知对手生成的归因集合中进行学习。尽管缺乏显式标签，我们通过一种新颖的方法从这些粗粒度信号中构建了总体损失的无偏估计量。利用该估计量，我们表明经验风险最小化实现了泛化保证，该保证随先验的信息量而缩放，并且对先验的估计误差也具有鲁棒性，尽管归因集合之间存在复杂的依赖关系。在标准数据集上的简单实证评估表明，我们的无偏方法显著优于常见的行业启发式方法，特别是在归因集合较大或重叠的情况下。

英文摘要

We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.

URL PDF HTML ☆

赞 0 踩 0

2601.05212 2026-06-17 cs.CV 版本更新

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

FlowLet: 基于小波流匹配的条件性3D脑MRI合成

Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

发表机构 * Politecnico di Bari（巴里理工学院）； Sapienza University of Rome（罗马萨皮恩扎大学）

AI总结提出FlowLet框架，利用可逆3D小波域中的流匹配生成年龄条件化的3D脑MRI，避免重建伪影并降低计算需求，实验证明其生成高保真体积且提升脑年龄预测模型对低代表性年龄组的性能。

Comments Accepted at Medical Image Analysis (Elsevier)

详情

DOI: 10.1016/j.media.2026.104161

AI中文摘要

脑磁共振成像（MRI）在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测（BAP），它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大规模、多样化和年龄平衡的数据集，而现有的3D MRI数据集在人口统计学上存在偏差，限制了公平性和泛化能力。获取新数据成本高昂且受到伦理约束，这促使了生成性数据增强。当前的生成方法通常基于潜在扩散模型，这些模型在学习的低维潜在空间中操作，以应对体积MRI数据的内存需求。然而，这些方法在推理时通常较慢，可能因潜在压缩而引入伪影，并且很少以年龄为条件，从而影响BAP性能。在这项工作中，我们提出了FlowLet，一个条件生成框架，通过在可逆3D小波域中利用流匹配来合成年龄条件化的3D MRI，有助于避免重建伪影并降低计算需求。实验表明，FlowLet以少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可改善低代表性年龄组的性能，基于区域的分析确认了解剖结构的保留。

英文摘要

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

URL PDF HTML ☆

赞 0 踩 0

2601.04574 2026-06-17 cs.CL 版本更新

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

FeedEval: 面向教学法的LLM生成作文反馈评估

Seongyeub Chu, Jongwoo Kim, Munyong Yi

发表机构 * Graduate School of Data Science, KAIST（数据科学研究生院，韩国科学技术院）； Department of Industrial & Systems Engineering, KAIST（工业与系统工程系，韩国科学技术院）

AI总结提出FeedEval框架，沿特异性、帮助性和有效性三个教学维度评估LLM生成的作文反馈，通过专用评估器筛选高质量反馈，提升下游评分和修订效果。

详情

AI中文摘要

超越数值分数预测，近期自动作文评分研究日益强调生成提供理由和可操作指导的高质量反馈。为减轻专家标注的高成本，先前工作通常依赖LLM生成的反馈来训练作文评估模型。然而，此类反馈常未经明确质量验证即被纳入，导致下游应用中噪声的传播。为解决这一局限，我们提出FeedEval，一个基于LLM的框架，用于沿三个教学维度（特异性、帮助性和有效性）评估LLM生成的作文反馈。FeedEval采用维度专用的LLM评估器，这些评估器在本研究策划的数据集上训练，以评估多个候选反馈并选择高质量反馈供下游使用。在ASAP++基准上的实验表明，FeedEval与人类专家判断高度一致，且使用FeedEval筛选的高质量反馈训练的作文评分模型取得了更优的评分性能。此外，使用小型LLM进行的修订实验表明，FeedEval识别的高质量反馈能导致更有效的作文修订。我们在以下网址发布代码和策划的数据集：this https URL。

英文摘要

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

URL PDF HTML ☆

赞 0 踩 0

2505.23939 2026-06-17 cs.LG cs.NI 版本更新

Searching Neural Architectures for Sensor Nodes on IoT Gateways

搜索物联网网关上传感器节点的神经架构

Andrea Mattia Garavagno, Edoardo Ragusa, Antonio Frisoli, Paolo Gastaldo

发表机构 * University of Genoa（基因瓦大学）

AI总结提出一种在物联网网关上自动设计神经网络的方法，保护数据隐私，在Raspberry Pi Zero 2上10小时内搜索出达到SOTA的架构。

详情

DOI: 10.1109/JIOT.2025.3581442
Journal ref: IEEE Internet of Things Journal, vol. 12, no. 21, pp. 44492-44501, 2025

AI中文摘要

本文提出一种在边缘自动设计神经网络的方法，即使在隐私敏感的物联网应用中也能实现机器学习。该方法在物联网网关上运行，为连接的传感器节点设计神经网络，而无需将收集的数据共享到本地网络之外，将数据保留在采集现场。这种方法有潜力为医疗物联网和工业物联网启用机器学习，在边缘设计硬件友好且定制的神经网络，用于个性化医疗和高级工业服务，如质量控制、预测性维护或故障诊断。通过防止数据泄露到云服务，该方法保护了敏感信息，包括工业机密和个人数据。全面的实验结果表明，在Visual Wake Words数据集上，所提出的方法通过在Raspberry Pi Zero 2上运行不到10小时的搜索过程，可以达到最先进的结果。

英文摘要

This paper presents an automatic method for the design of Neural Networks (NNs) at the edge, enabling Machine Learning (ML) access even in privacy-sensitive Internet of Things (IoT) applications. The proposed method runs on IoT gateways and designs NNs for connected sensor nodes without sharing the collected data outside the local network, keeping the data in the site of collection. This approach has the potential to enable ML for Healthcare Internet of Things (HIoT) and Industrial Internet of Things (IIoT), designing hardware-friendly and custom NNs at the edge for personalized healthcare and advanced industrial services such as quality control, predictive maintenance, or fault diagnosis. By preventing data from being disclosed to cloud services, this method safeguards sensitive information, including industrial secrets and personal data. The outcomes of a thorough experimental session confirm that -- on the Visual Wake Words dataset -- the proposed approach can achieve state-of-the-art results by exploiting a search procedure that runs in less than 10 hours on the Raspberry Pi Zero 2.

URL PDF HTML ☆

赞 0 踩 0

2506.05797 2026-06-17 cs.LG cs.CE cs.RO 版本更新

EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator

EqCollide: 等变且碰撞感知的可变形物体神经模拟器

Qianyi Chen, Tianrun Gao, Chenbo Jiang, Tailin Wu

发表机构 * Westlake University（西交大大学）； Fudan University（复旦大学）； Tongji University（同济大学）； McGill University（麦吉尔大学）

AI总结提出首个端到端等变神经场模拟器EqCollide，通过等变编码器和碰撞感知消息传递的图神经网络常微分方程，实现可变形物体碰撞的准确、稳定和可扩展模拟。

Comments SIGKDD 2026 Oral AI4S Track. 20 pages, 16 figures

详情

AI中文摘要

模拟可变形物体的碰撞是一项基础但具有挑战性的任务，因为涉及固体力学和多体相互作用的复杂性。现有的数据驱动方法通常缺乏对物理对称性的等变性、对碰撞处理不足以及可扩展性有限。本文介绍\name，这是首个用于可变形物体及其碰撞的端到端等变神经场模拟器。我们提出一个等变编码器，将物体几何和速度映射到潜在控制点。随后，基于等变图神经网络的神经常微分方程通过碰撞感知消息传递建模控制点之间的相互作用。为了重建速度场，我们查询一个以控制点特征为条件的神经场，实现连续且分辨率无关的运动预测。在2D和3D场景上的实验结果表明，\name在不同物体配置下实现了准确、稳定且可扩展的模拟。与最佳基线模型相比，其滚动均方误差降低了24.34%至57.62%。此外，\name能够泛化到更多碰撞物体和更长的时间范围，并对群作用下的输入变换保持鲁棒。代码可在以下网址获取：this https URL

英文摘要

Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results on 2D and 3D scenarios show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations. It achieves $24.34\%$ to $57.62\%$ lower rollout MSE, even compared with the best-performing baseline model. Furthermore, EqCollide could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action. Code is available at: https://github.com/AI4Science-WestlakeU/EqCollide

URL PDF HTML ☆

赞 0 踩 0

2504.14582 2026-06-17 cs.CV 版本更新

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2025 图像超分辨率（×4）挑战赛：方法与结果

Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen

发表机构 * CVPR 2025

AI总结本文介绍NTIRE 2025图像超分辨率（×4）挑战赛，包括恢复和感知两个子赛道，总结比赛设计、数据集、评估协议及25个团队的提交方法。

Comments NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

详情

DOI: 10.1109/CVPRW67362.2025.00141
Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 1525-1535

AI中文摘要

本文介绍了NTIRE 2025图像超分辨率（×4）挑战赛，这是CVPR 2025第10届NTIRE Workshop的关联竞赛之一。该挑战旨在从通过双三次下采样生成的×4比例低分辨率图像中恢复高分辨率图像，目标是开发有效的网络设计或解决方案以实现最先进的超分辨率性能。为反映图像超分辨率研究的双重目标，挑战包含两个子赛道：（1）恢复赛道，强调像素级精度，根据PSNR对提交结果进行排名；（2）感知赛道，关注视觉真实感，根据感知分数对结果进行排名。共有286名参与者注册了比赛，25个团队提交了有效作品。本报告总结了挑战设计、数据集、评估协议、主要结果以及每个团队的方法。该挑战作为基准，旨在推动图像超分辨率领域的最先进技术并促进其进步。

英文摘要

This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

URL PDF HTML ☆

赞 0 踩 0

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

面向多样化类人团队协作与通信的算法化提示生成与大型语言模型

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

发表机构 * Thomas Lord Department of Computer Science, University of Southern California（美国南加州大学汤姆·劳德计算机科学系）； School of Computing and Information, University of Pittsburgh（美国匹兹堡大学计算与信息学院）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）； Sibley School of Mechanical and Aerospace Engineering, Cornell University（康奈尔大学西伯利机械与航空航天工程学院）

AI总结结合质量多样性优化与LLM代理，自动搜索生成多样化团队行为的提示，捕获人类协作与通信策略，并通过用户研究验证其类人性。

详情

AI中文摘要

理解人类如何在团队中协作和通信对于改善人-代理团队协作和AI辅助决策至关重要。然而，由于后勤、伦理和实际限制，仅依赖大规模用户研究的数据是不切实际的，因此需要多种多样化人类行为的合成模型。最近，基于大型语言模型（LLM）的代理已被证明能够在社交环境中模拟类人行为。但是，获得大量多样化行为需要手动设计提示。另一方面，质量多样性（QD）优化已被证明能够生成多样化的强化学习（RL）代理行为。在这项工作中，我们将QD优化与LLM驱动的代理相结合，以迭代搜索在长时域、多步骤协作环境中生成多样化团队行为的提示。我们首先通过一项人类受试者实验表明，人类在该领域中表现出多样化的协调和通信行为。然后，我们进行一系列实验，表明我们的方法捕获了在没有大规模数据收集的情况下难以观察到的行为，并通过后续用户研究表明这些生成的行为是类人的。我们的发现凸显了QD与LLM驱动代理的结合作为研究多代理协作中团队协作和通信策略的有效工具。

英文摘要

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

URL PDF HTML ☆

赞 0 踩 0

2401.14381 2026-06-17 cs.LG math.DG 版本更新

Manifold GCN: Diffusion-based Convolutional Neural Network for Manifold-valued Graphs

Manifold GCN：基于扩散的流形值图卷积神经网络

Martin Hanik, Gabriele Steidl, Christoph von Tycowicz

发表机构 * BIFOLD—Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究院）； Technical University Berlin（柏林技术大学）； Zuse Institute Berlin（柏林泽尼茨研究所）

AI总结提出两种适用于黎曼流形特征图的图神经网络层：基于流形值图扩散方程的扩散层和受向量神经元启发的切向多层感知器，两者在节点置换和流形等距下等变，在更广泛问题上优于任务特定网络。

Comments Extended ADNI experiment

详情

DOI: 10.1007/s11263-026-02899-9
Journal ref: International Journal of Computer Vision, Volume 134, article number 315 (2026)

AI中文摘要

我们提出了两种适用于黎曼流形中特征图的图神经网络层。首先，基于流形值图扩散方程，我们构建了一个可应用于任意数量节点和图连接模式的扩散层。其次，通过将向量神经元框架的思想迁移到我们的通用设置中，我们建模了一个切向多层感知器。这两层在节点置换和特征流形的等距变换下都是等变的。这些特性在许多深度学习任务中带来了有益的归纳偏置。此外，它们还支持新颖、更灵活的特征设计。合成数据上的数值示例以及基于右海马体三角网格的阿尔茨海默病分类应用证明了我们新层的实用性：虽然它们适用于更广泛的问题类别，但在性能上优于任务特定的最先进网络。

英文摘要

We propose two graph neural network layers for graphs with features in a Riemannian manifold. First, based on a manifold-valued graph diffusion equation, we construct a diffusion layer that can be applied to an arbitrary number of nodes and graph connectivity patterns. Second, we model a tangent multilayer perceptron by transferring ideas from the vector neuron framework to our general setting. Both layers are equivariant under node permutations and the feature manifold's isometries. These properties have led to a beneficial inductive bias in many deep-learning tasks. Furthermore, they enable novel, more flexible feature designs. Numerical examples on synthetic data and an Alzheimer's classification application on triangle meshes of the right hippocampus demonstrate the usefulness of our new layers: While they apply to a much broader class of problems, they outperform task-specific state-of-the-art networks.

URL PDF HTML ☆

赞 0 踩 0

2412.10139 2026-06-17 cs.CL 版本更新

TACOMORE: Exploring a replicable prompting protocol for LLM-assisted corpus analysis

TACOMORE: 探索一种可复现的提示协议用于LLM辅助语料库分析

Bingru Li, Han Wang, Nicholas Groom

发表机构 * Department of Linguistics and Communication, University of Birmingham（伯明翰大学语言学与传播系）； Department of Information Engineering and Computer Science, University of Trento（特伦托大学信息工程与计算机科学系）； Institute of Foreign Languages and Cultures, University of Tartu（塔尔图大学外国语言与文化研究所）

AI总结提出TACOMORE框架，通过结构化提示将LLM从通用概率预测转向基于语料共现模式的推理，提升关键词、搭配和索引行分析的准确性与可复现性，但幻觉问题仍需人工验证。

详情

AI中文摘要

随着语料库语言学不断扩展，研究者面临日益增长的方法论瓶颈：虽然计算工具可以轻松统计数十亿词，但这些数据的定性解释仍然是一个缓慢且劳动密集型的人工任务。大型语言模型（LLM）提供了一种有前景的自动化方法，然而其整合到该领域常因黑箱不可预测性和缺乏可复现性而受阻。本研究引入TACOMORE，一个结构化的提示框架，旨在将临时的AI交互转化为标准化的语言协议。该框架基于四项基本原则（任务、上下文、模型和可复现性），引导LLM超越通用概率预测，将其推理锚定在目标语料库的特定共现模式上。我们将该框架应用于三个核心语料库任务，即关键词、搭配和索引行分析，使用一个开放的COVID-19研究摘要语料库。在测试三个LLM后，我们发现虽然结构化提示提高了准确性和可复现性，但关于幻觉的固有限制仍然存在。本研究为LLM在语料库语言学中的作用提供了批判性视角，强调了它们作为补充工具的潜力，同时突出了人工验证不可替代的角色。

英文摘要

As corpus linguistics continues to scale, researchers are facing a growing methodological bottleneck: while computational tools can easily count billions of words, the qualitative interpretation of these data remains a slow and labor-intensive human task. Large Language Models (LLMs) offer a promising way to automate this process, yet their integration into the field is often hindered by concerns over black-box unpredictability and a lack of replicability. This study introduces TACOMORE, a structured prompting framework designed to transform ad-hoc AI interactions into a standardized linguistic protocol. Built upon four foundational principles (Task, Context, Model, and Replicability), the framework guides LLMs to move beyond generic probability prediction to anchoring their reasoning in the specific co-occurrence patterns of a target corpus. We applied this framework to three core corpus tasks, i.e., the analysis of keywords, collocates, and concordances, using an open corpus of COVID-19 research abstracts. After testing three LLMs, we found that while structured prompting improves accuracy and replicability, inherent limitations regarding hallucination persist. This research offers a critical lens into the role of LLMs in corpus linguistics, highlighting their potential as complementary tools while emphasizing the irreplaceable role of human validation.

URL PDF HTML ☆

赞 0 踩 0

2212.07700 2026-06-17 cs.CV 版本更新

Colab NAS: Obtaining lightweight task-specific convolutional neural networks following Occam's razor

Colab NAS：遵循奥卡姆剃刀原则获取轻量级任务特定卷积神经网络

Andrea Mattia Garavagno, Daniele Leonardis, Antonio Frisoli

发表机构 * Institute of Mechanical Intelligence, Scuola Superiore Sant’Anna of Pisa（机械智能研究所，比萨圣安娜高等学院）

AI总结提出ColabNAS，一种低成本的硬件感知神经架构搜索方法，通过奥卡姆剃刀启发的无导数搜索策略，在免费GPU服务上3.1小时内获得轻量级CNN，在Visual Wake Word数据集上达到最先进结果。

详情

DOI: 10.1016/j.future.2023.11.003
Journal ref: Future Generation Computer Systems, vol. 152, pp. 152-159, 2024

AI中文摘要

当前从在大数据集上训练的卷积神经网络（CNN）进行迁移学习的趋势，在目标应用是一个自定义且有限的问题，且有足够数据从头训练网络时，可能是一种过度杀伤。另一方面，从头训练自定义且更轻量的CNN需要专业知识，以及在硬件感知神经架构搜索（HW NAS）情况下需要高端资源，这限制了非习惯性神经网络开发者对该技术的访问。因此，我们提出了ColabNAS，一种用于生成轻量级任务特定CNN的经济实惠的HW NAS技术。其新颖的无导数搜索策略受奥卡姆剃刀原则启发，使得在Visual Wake Word数据集（一个标准的TinyML基准）上，仅需使用Google Colaboratory和Kaggle Kernel等免费在线GPU服务，在3.1 GPU小时内即可获得最先进的结果。

英文摘要

The current trend of applying transfer learning from convolutional neural networks (CNNs) trained on large datasets can be an overkill when the target application is a custom and delimited problem, with enough data to train a network from scratch. On the other hand, the training of custom and lighter CNNs requires expertise, in the from-scratch case, and or high-end resources, as in the case of hardware-aware neural architecture search (HW NAS), limiting access to the technology by non-habitual NN developers. For this reason, we present ColabNAS, an affordable HW NAS technique for producing lightweight task-specific CNNs. Its novel derivative-free search strategy, inspired by Occam's razor, allows to obtain state-of-the-art results on the Visual Wake Word dataset, a standard TinyML benchmark, in just 3.1 GPU hours using free online GPU services such as Google Colaboratory and Kaggle Kernel.

URL PDF HTML ☆

赞 0 踩 0

2507.15777 2026-06-17 cs.CV

Label tree semantic losses for rich multi-class medical image segmentation

用于丰富多类医学图像分割的标签树语义损失

Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren

发表机构 * School of Biomedical Engineering & Imaging Sciences（生物医学工程与成像科学学院）； Department of Neurosurgery（神经外科部门）

AI总结提出两种基于标签层次结构的树状语义损失函数，在脑MRI全监督分割和神经外科高光谱成像稀疏标注场景理解中取得一致改进。

详情

DOI: 10.3389/frai.2026.1841639

AI中文摘要

丰富且准确的医学图像分割有望通过描绘术前规划的关键解剖结构、指导实时术中导航和支持精确术后评估，为下一代AI定义的临床实践奠定基础。然而，医学和外科成像分割任务中常用的学习方法对所有错误一视同仁，未能利用标签空间中的任何类间语义。随着标签基数和丰富度的增加以包含细微不同的类别，这一问题变得尤为突出。在这项工作中，我们提出了两种基于树的语义损失函数，利用标签的层次组织。我们进一步将我们的损失纳入最近提出的用于稀疏、无背景标注的训练方法中，以扩展所提出损失的适用性。在两个医学和外科成像分割任务上进行了大量实验，即全监督的头部MRI全脑分割和稀疏标注的神经外科高光谱成像场景理解。结果表明，在评估的任务特定基线上取得了一致的改进，其中基于Wasserstein的复合损失在全脑分割中支持最强，而层次加权顶层监督在稀疏HSI设置中表现最佳。

英文摘要

Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical imaging segmentation tasks, namely head MRI for whole brain parcellation with full supervision and neurosurgical hyperspectral imaging for scene understanding with sparse annotations. Results demonstrate consistent improvements over the evaluated task-specific baselines, with the strongest support for the Wasserstein-based compound loss in whole-brain parcellation and for hierarchy-weighted top-level supervision in the sparse HSI setting.

URL PDF HTML ☆

赞 0 踩 0

2601.06116 2026-06-17 cs.AI cs.CL cs.CY

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

在大语言模型中的同质化问题：迈向人工智能安全中的有意义多样性

Ian Rios-Sialer

发表机构 * Independent Researcher（独立研究者）

AI总结本文探讨了大语言模型中同质化问题，提出通过编码价值观系统来促进多样性，通过实验揭示性别偏见并引入xeno-reproduction概念以缓解同质化。

详情

AI中文摘要

生成式AI模型在训练数据中复制人类偏见，并通过如模式崩溃等机制放大这些偏见。多样性丧失导致同质化，不仅损害少数群体，也使所有人受益。我们主张同质化应成为人工智能安全的核心关注点。为有意义地表征大语言模型中的同质化，我们引入一个框架，允许利益相关者编码其上下文和价值体系。我们通过实验揭示了一个大语言模型（Claude 3.5 Haiku）在开放性故事提示中的性别偏见。基于酷儿理论，我们将同质化定义为规范性。借用女性主义理论的语言，我们引入xeno-reproduction作为一类任务，以通过促进多样性来缓解同质化。我们的工作开启了一条协作研究路线，旨在理解和推进AI中的多样性。

英文摘要

Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized but impoverishes everyone. We argue homogenization should be a central concern in AI safety. To meaningfully characterize homogenization in Large Language Models (LLMs), we introduce a framework that allows stakeholders to encode their context and value system. We illustrate our approach with an experiment that surfaces gender bias in an LLM (Claude 3.5 Haiku) on an open-ended story prompt. Building from queer theory, we formalize homogenization in terms of normativity. Borrowing language from feminist theory, we introduce the concept of xeno-reproduction as a class of tasks for mitigating homogenization by promoting diversity. Our work opens a collaborative line of research that seeks to understand and advance diversity in AI.

URL PDF HTML ☆

赞 0 踩 0

2605.12220 2026-06-17 cs.CV cs.AI cs.LG cs.RO

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV：基于高度感知的鸟瞰图与高分辨率特征融合的实时仅LiDAR三维行人检测

Mohammad Khoshkdahan, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文提出TriBand-BEV方法，通过高度感知的鸟瞰图与高分辨率特征融合实现实时LiDAR-only三维行人检测，采用轻量级鸟瞰图张量映射，单网络一次通过检测车辆、行人和自行车，提升检测精度与速度。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

DOI: 10.65109/INST9866
Journal ref: Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

AI中文摘要

安全的自动驾驶代理和移动机器人需要快速的实时三维感知，尤其是对于行人等易受伤害道路使用者。我们介绍了一种新的鸟瞰图（BEV）编码方法，将完整的三维LiDAR点云映射到轻量级的二维BEV张量中，分为三个高度带。我们明确地将三维检测重新公式化为二维检测问题，然后从BEV输出中重建三维框。单个网络在一次通过中检测车辆、行人和自行车。骨干网络在深层阶段使用区域注意力，层次化的双向颈部网络在P1到P4之间融合上下文和细节，头部使用分布焦点学习预测定向框，以预测侧偏移和旋转IoU损失。训练应用小垂直重新分箱和温和的反射率抖动以防止记忆化。我们使用四分位距（IQR）过滤器在三维重建中去除噪声和离群的LiDAR点。在KITTI数据集上，TriBand-BEV在49 FPS的单个消费级GPU上实现了易、中等和困难样本的行人BEV AP分别为58.7/52.6/47.2%，优于Complex-YOLO，分别提升了+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下检测稳定。该流程紧凑且适用于实时机器人部署。我们的源代码在GitHub上公开可用。

英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2512.03805 2026-06-17 cs.LG

Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA

基于动态算法配置的深度强化学习：在OneMax优化中使用(1+(λ,λ))-GA的案例研究

Tai Nguyen, Phong Le, André Biedenkapp, Carola Doerr, Nguyen Dang

发表机构 * University of St Andrews, United Kingdom（圣安德鲁大学，英国）； Sorbonne Université, CNRS, LIP6, France（索邦大学，法国）； University of Freiburg, Germany（弗赖堡大学，德国）

AI总结本文研究了深度强化学习算法DDQN和PPO在OneMax问题中控制(1+(λ,λ))-GA种群大小的挑战，发现DDQN和PPO存在可扩展性下降和学习不稳定问题，通过自适应奖励转移机制改进DDQN，使其在样本效率上优于传统方法。

Comments arXiv admin note: text overlap with arXiv:2502.20265

详情

DOI: 10.1145/3821217

AI中文摘要

动态算法配置（DAC）研究参数化优化算法控制策略的高效识别。许多研究利用强化学习（RL）解决DAC挑战；然而，应用RL通常需要大量领域专业知识。在本文中，我们对两种深度RL算法——双深度Q网络（DDQN）和近端策略优化（PPO）——进行深入研究，以控制OneMax实例上的(1+(λ,λ))-GA种群大小。尽管OneMax在结构上简单，但为(1+(λ,λ))-GA学习有效的控制策略诱导了一个高度具有挑战性的DAC景观，使其成为受控且 demanding 的基准。我们的研究揭示了限制DDQN和PPO的两个基本挑战：可扩展性下降和学习不稳定，归因于探索不足和规划时间跨度覆盖不足。为了解决探索不足，我们引入了一种自适应奖励转移机制，利用奖励分布统计信息来增强DDQN的探索。这消除了实例特定超参数调优，并确保了在问题规模上的一致有效性。为了解决规划时间跨度覆盖问题，我们证明了在DDQN中无折扣学习的成功，而PPO面临根本的方差问题，需要替代设计。我们进一步表明，尽管超参数优化增强了PPO的稳定性，但它始终无法识别有效的策略。最后，DDQN结合自适应奖励转移在样本效率上与理论推导的策略相当，远超先前的DAC方法。我们的发现提供了对标准深度RL方法在这一具有挑战性的DAC设置中所面临根本障碍的理解，并突显了有效学习所需的关键方法论成分。

英文摘要

Dynamic Algorithm Configuration (DAC) studies the efficient identification of control policies for parameterized optimization algorithms. Numerous studies leverage Reinforcement Learning (RL) to address DAC challenges; however, applying RL often requires extensive domain expertise. In this work, we conduct a comprehensive study of two deep-RL algorithms--Double Deep Q-Networks (DDQN) and Proximal Policy Optimization (PPO)--for controlling the population size of the $(1+(λ,λ))$-GA on OneMax instances. Although OneMax is structurally simple, learning effective control policies for the $(1+(λ,λ))$-GA induces a highly challenging DAC landscape, making it a controlled yet demanding benchmark. Our investigation reveals two fundamental challenges limiting DDQN and PPO: scalability degradation and learning instability, traced to under-exploration and planning horizon coverage. To address under-exploration, we introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN exploration. This eliminates instance-specific hyperparameter tuning and ensures consistent effectiveness across problem scales. To resolve planning horizon coverage, we demonstrate that undiscounted learning succeeds in DDQN, while PPO faces fundamental variance issues necessitating alternative designs. We further show that while hyperparameter optimization enhances PPO's stability, it consistently fails to identify effective policies. Finally, DDQN with adaptive reward shifting achieves performance comparable to theoretically derived policies with vastly improved sample efficiency, outperforming prior DAC approaches by orders of magnitude. Our findings provide insights into the fundamental obstacles faced by standard deep-RL approaches in this challenging DAC setting and highlight the key methodological ingredients required for effective learning.

URL PDF HTML ☆

赞 0 踩 0

2512.20985 2026-06-17 cs.AI cs.MA

A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines

基于区块链监控的代理AI架构：可信感知-推理-行动流水线

Salman Jan, Hassan Ali Razzaqi, Ali Akarma, Mohammad Riyaz Belgaum

发表机构 * Faculty of Computer Studies, Arab Open University-Bahrain（巴林阿拉伯开放大学计算机科学学院）； Faculty of Computer and Information System, Islamic University of Madinah, Saudi Arabia（沙特阿拉伯麦地那伊斯兰大学计算机与信息系统学院）

AI总结本文提出一种结合区块链的代理AI架构，用于确保自主决策流程中的信任和可追溯性，通过区块链实现对行动的持续监控和审计，验证输入并记录执行结果。

Comments This paper was presented at the IEEE International Conference on Computing and Applications (ICCA 2025), Bahrain

详情

DOI: 10.1109/ICCA66035.2025.11430865
Journal ref: Proceedings of the 2025 IEEE International Conference on Computing and Applications (ICCA), Bahrain, 2025, pp. 1-7

AI中文摘要

代理AI系统在医疗、智慧城市、数字取证和供应链管理等领域应用日益广泛。尽管这些系统灵活且能提供实时推理，但它们也引发了信任、监督和信息完整性方面的担忧。本文提出一种由LangChain多代理系统和受限制区块链组成的单一架构模型，以确保持续监控、政策执行和不可变审计。该框架将感知-行动循环与区块链治理层相关联，验证输入、评估推荐行动并记录执行结果。介绍了一种基于Hyperledger Fabric的系统，集成了MCP执行器和LangChain代理，并进行了智能库存管理、交通信号控制和医疗监控的实验。结果表明，区块链安全验证在防止未经授权实践、确保整个决策过程的可追溯性以及维持合理操作延迟方面是高效的。所提出的框架提供了一种通用系统，用于实施高影响的自主且负责任的代理AI应用。

英文摘要

The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.

URL PDF HTML ☆

赞 0 踩 0

2310.06328 2026-06-17 cs.LG eess.SP

ARC-Fi: Exploiting Antenna Spatial Diversity for Label-Efficient Domain Generalization in Wi-Fi Sensing

ARC-Fi: 利用天线空间多样性实现标签高效领域泛化在Wi-Fi传感

Ke Xu, Zhiyong Zheng, Hongyuan Zhu, Lei Wang, Jiangtao Wang

发表机构 * Suzhou Institute for Advanced Research, University of Science and Technology of China（中国科学技术大学苏州研究院）； Suzhou Big Data and AI Research and Engineering Center（苏州大数据与人工智能研究与工程中心）； School of Artificial Intelligence and Data Science, University of Science and Technology of China（中国科学技术大学人工智能与数据科学学院）； Institute for Infocomm Research (I 2 R), A*STAR（资讯与通讯研究院（I2R），A*STAR）； School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结 ARC-Fi通过引入物理指导的数据增强策略，解决Wi-Fi传感中领域偏移问题，实现高效领域泛化。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

Wi-Fi传感系统在部署于未见过的现实环境时受到领域偏移的严重阻碍。尽管现有方法试图通过无监督领域适应（UDA）或领域泛化（DG）来解决这一问题，但它们严重依赖于不可用的目标数据或过于昂贵且庞大的标注源数据集。在实践中，收集大量未标注的信道状态信息（CSI）是可行的，而手动标注则受到严重限制。这种现实困境需要半监督领域泛化（SSDG）。为此，我们提出了ARC-Fi，这是首个专门用于Wi-Fi传感的SSDG框架。直接应用传统对比学习到CSI数据不可避免地触发领域特定的“捷径学习”，导致模型记忆环境背景而非手势动态。为克服这一问题，ARC-Fi引入了一种物理指导的数据增强策略：天线响应一致性（ARC）模块。ARC利用多天线系统的内在空间多样性，将位于同一位置的天线信号视为自然语义保持的增强视图，以明确阻止环境捷径。此外，我们引入了一个统一的半监督对比目标，利用稀缺标签和可靠的伪标签对跨领域特征进行对齐，有效防止了同类实例的盲目排斥。在Widar和CSIDA数据集上的广泛实验表明，ARC-Fi建立了新的最先进的水平，显著优于现有的UDA、DG和SSDG方法。最终，这项工作提供了一个基于物理的、标签高效的解决方案，推动了稳健现实Wi-Fi传感系统的大规模部署。代码可在：https://github.com/KaoruMiyazono/UniCrossFi。

英文摘要

Wi-Fi sensing systems are severely hindered by domain shifts when deployed in unseen real-world environments. While existing methods attempt to tackle this through Unsupervised Domain Adaptation (UDA) or Domain Generalization (DG), they critically rely on either inaccessible target data or prohibitively expensive, massive labeled source datasets. In practice, collecting abundant unlabeled Channel State Information (CSI) is feasible, whereas manual labeling is severely constrained. This realistic dilemma necessitates Semi-Supervised Domain Generalization (SSDG). To this end, we propose ARC-Fi, the first dedicated SSDG framework for Wi-Fi sensing. Directly applying conventional contrastive learning to CSI data inevitably triggers paradigm-specific "shortcut learning," causing models to memorize environmental backgrounds rather than gesture dynamics. To overcome this, ARC-Fi introduces a physics-informed data augmentation strategy: the Antenna Response Consistency (ARC) module. ARC exploits the intrinsic spatial diversity of multi-antenna systems, treating signals from co-located antennas as naturally semantics-preserving augmented views to explicitly block environmental shortcuts. Furthermore, we introduce a unified Semi-Supervised Contrastive Objective that leverages scarce labels and reliable pseudo-labels to align cross-domain features, effectively preventing the blind repulsion of same-class instances. Extensive experiments on the Widar and CSIDA datasets demonstrate that ARC-Fi establishes a new state-of-the-art, significantly outperforming existing UDA, DG, and SSDG methods. Ultimately, this work provides a physics-grounded, label-efficient solution, advancing the scalable deployment of robust real-world Wi-Fi sensing systems. Code is available at: https://github.com/KaoruMiyazono/UniCrossFi.

URL PDF HTML ☆

赞 0 踩 0

2505.12620 2026-06-17 cs.CV

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

BusterX：基于MLLM的AI生成视频伪造检测与解释

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng

发表机构 * University of Liverpool, UK（利物浦大学）； Nanyang Technological University, SG（南洋理工大学）； The Chinese University of Hong Kong, Shenzhen, Guangdong, China（香港中文大学（深圳））； Wuhan University（武汉大学）； Hangzhou Dianzi University（杭州电子科技大学）

AI总结本文提出BusterX，一种基于多模态大语言模型的视频伪造检测系统，通过改进数据集和评估基准，提升检测准确性和解释质量。

详情

AI中文摘要

随着生成视频模型日益逼真，检测AI生成视频需要兼具准确性和可解释性的系统。然而，将多模态大语言模型（MLLMs）应用于视频取证目前受限于过时的数据集、简化的评估协议和对黑盒分类的依赖。为解决这些问题，我们引入了一个全面的数据集、基准和基线模型用于视频伪造检测。首先，我们提出了GenBuster-200K，一个包含超过200,000个高质量视频的公平数据集，这些视频来自最先进的生成器，涵盖多样化的现实场景。其次，我们提出了GenBuster-Bench，一个覆盖三个渐进赛道（领域内、领域外和野外）的诊断基准，用于评估模型在领域转移和代际转移中的表现。它还引入了MLLM-as-a-Judge协议来评估生成的取证解释质量。最后，我们开发了BusterX，一种具有RL训练的MLLM基线模型。不同于直接二元分类，BusterX将检测视为视觉推理任务，其中生成的推理链本身作为检测器。实验结果表明，BusterX在检测准确性和推理质量上均优于几种领先的MLLMs（例如Qwen3.5、Claude-Sonnet-4.6）

英文摘要

As generative video models become increasingly realistic, detecting AI-generated videos requires systems that offer both accuracy and interpretability. However, applying Multimodal Large Language Models (MLLMs) to video forensics is currently limited by outdated datasets, simplistic evaluation protocols, and a reliance on black-box classification. To address these issues, we introduce a comprehensive dataset, benchmark, and baseline model for video forgery detection. First, we present \textbf{GenBuster-200K}, a fair dataset of over 200,000 high-quality videos sourced from state-of-the-art generators, featuring diverse real-world scenarios. Second, we propose \textbf{GenBuster-Bench}, a diagnostic benchmark spanning three progressive tracks (In-Domain, Out-of-Domain, and In-the-Wild) to evaluate models across \textit{domain shifts} and \textit{generational shifts}. It also introduces an MLLM-as-a-Judge protocol to assess the quality of the generated forensic explanations. Finally, we develop \textbf{BusterX}, an MLLM baseline with RL training. Instead of direct binary classification, BusterX formulates detection as a visual reasoning task, where the generated reasoning chain serves as detector itself. Experimental results demonstrate that BusterX outperforms several leading MLLMs (e.g., Qwen3.5, Claude-Sonnet-4.6) in both detection accuracy and rationale quality.

URL PDF HTML ☆

赞 0 踩 0