arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2606.09183 2026-06-09 cs.RO 新提交

Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation

通过粒子模拟的策略学习实现挖掘机自主障碍物移除

Yuki Kadokawa, Sandro M. Alcantara Tacora, Taro Abe, Daisuke Endo, Genki Yamauchi, Takeshi Hashimoto, Takamitsu Matsubara

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Public Works Research Institute(土木研究所)

AI总结 提出一种基于粒子模拟的课程学习框架,通过RGB-D感知和参数化轨迹输出,实现挖掘机在不同埋深条件下自主移除地面障碍物,并在真实12吨挖掘机上验证了鲁棒性。

Comments under review

详情
AI中文摘要

从地面自主移除障碍物是一项重要的土方工程任务,但由于挖掘机必须随着土壤-障碍物条件的变化在重复循环中调整其挖掘轨迹,因此难以自动化。学习这种状态依赖行为需要一个能够再现累积土壤-障碍物相互作用的训练环境,包括接触状态、地形变形和障碍物可见性。因此,基于粒子的模拟适用于相关的策略学习。然而,粒子模拟计算成本高,重复的挖掘循环进一步增加了学习成本。我们观察到障碍物的埋藏条件决定了任务难度和模拟成本:更深的埋藏使障碍物移除更难,同时也需要更多粒子进行精确模拟。这一观察启发了一种基于埋藏条件的课程学习策略。我们提出了一种时间高效的模拟到现实策略学习框架,其中策略从RGB-D测量中观察地形和障碍物信息,然后输出参数化的挖掘轨迹;在此过程中,模拟器在可控埋藏条件下再现了真实挖掘机所使用的相同观测-动作接口。课程从浅埋条件开始,逐步增加埋藏深度,同时调整粒子数量,从而同时控制任务难度和模拟成本。实验表明,所提出的框架成功学习了一个有效的障碍物移除策略,而基线方法即使在完整一周的训练后也失败。所提出的课程在三天内实现了有效性能,并成功迁移到一台在开阔地面上操作各种钢制障碍物的真实12吨挖掘机上,从而展示了鲁棒的障碍物移除能力。

英文摘要

Autonomous obstacle removal from the ground is an important earthwork task, but this is difficult to automate because an excavator must adapt its excavation trajectories over repeated cycles as soil-obstacle conditions change. Learning such state-dependent behavior requires a training environment that reproduces accumulated soil-obstacle interactions, including contact states, terrain deformation, and obstacle visibility. Accordingly, particle-based simulation is suitable for the relevant policy learning. However, particle simulation is computationally expensive, and repeated excavation cycles further increase the learning cost. We observe that the burial condition of an obstacle governs both task difficulty and simulation cost: deeper burial makes obstacle removal harder while also requiring more particles for accurate simulation. This observation motivates a burial-conditioned curriculum learning strategy. We propose a time-efficient sim-to-real policy learning framework in which the policy observes terrain and obstacle information from RGB-D measurements and then outputs a parameterized excavation trajectory; in this process, the simulator reproduces in a real-world excavator the same observation-action interface it uses under controllable burial conditions. The curriculum begins with shallow burial conditions and progressively increases burial depth while adjusting particle count, thus simultaneously controlling task difficulty and simulation cost. Experiments show that the proposed framework successfully learns an effective obstacle-removal policy, whereas baseline methods fail even after a full week of training. The proposed curriculum achieves effective performance within three days and achieves successful transfer to a real 12-ton excavator operating on open ground with various steel obstacles, thus demonstrating robust obstacle removal.

2606.09181 2026-06-09 cs.CV cs.LG 新提交

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

用于视频问答中细粒度证据分离的反事实推理

Zhou Du, Hamid Krim, Xiao Wu, Zhaoquan Yuan, Liangwei Li, Keisuke Fujii

发表机构 * School of OptoElectonic Science and Engineering, University of Electronic Science and Technology of China(电子科技大学光电科学与工程学院)

AI总结 提出反事实推理框架CREDiT,通过结构因果模型将视频问答中的跨模态表示分解为因果和非因果成分,在独立性约束下进行特征级因果干预,提升答案准确性和推理可靠性。

Comments 10 pages, 6 figures

详情
AI中文摘要

近期视频多模态模型的进展显著提升了视频问答性能。然而,这些系统往往依赖于虚假的统计相关性而非与答案相关的因果证据,导致推理不忠实且脆弱,尤其在复杂真实场景中。现有方法要么依赖跨模态相关性、昂贵的精心策划的训练资源,要么依赖不充分的因果假设和约束,且通常操作在时间区间级别。因此,它们未能明确地将因果视觉线索与混杂因素分离,且提供的细粒度证据定位有限。为解决此问题,我们提出了一种用于细粒度证据分离的反事实推理框架(CREDiT)。CREDiT使用结构因果模型形式化视频问答过程,并在独立性和最小性约束下学习明确分解为因果和非因果成分的跨模态表示。为促进忠实的分离,我们引入特征级因果干预,构建近似因果效应同时抑制非因果相关性的反事实输入。在NExT-GQA、SportsQA和SPORTU-video上的大量实验表明,CREDiT在通用和复杂体育场景中均能持续提升答案准确性和推理可靠性,从而构建更可信的视频问答系统。

英文摘要

Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.

2606.09180 2026-06-09 cs.CV 新提交

Claude Code-Driving Scenario Mining for the Argoverse 2 Challenge

Claude Code驱动的Argoverse 2挑战赛场景挖掘

Wei Deng, Caoshengzhe Xue, Shuaikun Liu, Zhaohong Liu, Mengshi Qi, Huadong Ma

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出四阶段管道:Claude Code自主生成代码、迭代训练集筛选、语义代码审查和场景级验证,用于Argoverse 2场景挖掘挑战。

详情
AI中文摘要

我们提交了参加CVPR 2026 Argoverse 2场景挖掘挑战赛的作品。我们的系统使用四阶段管道:(1) 由GLM~5.1驱动的Claude Code代理进行自主代码生成,(2) 使用时间戳平衡准确率阈值0.8进行迭代训练集筛选以策划少样本示例,(3) 由单独的Claude Code会话进行语义代码审查,以及(4) Qwen3-VL场景级验证以过滤误报。我们报告了在Argoverse 2测试集上的结果。

英文摘要

We present our submission to the CVPR 2026 Argoverse 2 Scenario Mining Challenge. Our system uses a four-stage pipeline: (1) autonomous code generation via a Claude Code agent powered by GLM~5.1, (2) iterative training set screening with Timestamp Balanced Accuracy threshold 0.8 to curate few-shot examples, (3) semantic code review by a separate Claude Code session, and (4) Qwen3-VL scene-level verification to filter false positives. We report results on the Argoverse 2 test set.

2606.09178 2026-06-09 cs.CL cs.AI 新提交

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

跨东亚和东南亚语境的文化适应红队测试:方法论与比较分析

Hyeji Choi, Yongtaek Lim, Minwoo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 针对大语言模型的多语言安全评估,通过构建直接翻译与文化适应数据集,发现文化适应提示的攻击成功率平均提升9.3个百分点,直接翻译低估风险,且文化深度评分显著低于文化适应版本,表明适应文化语境对有效评估至关重要。

Comments Accepted to ICML 2026 Workshop on AIWILDS

详情
AI中文摘要

大语言模型的多语言安全评估主要依赖于将英文基准直接翻译成目标语言——这种方法转换了表面语言形式,但未能反映威胁场景、社会规范和法律法规中嵌入的文化语境。我们通过1:1种子匹配为四种语言——韩语、日语、泰语和高棉语——构建了配对的直接翻译和文化适应数据集,并比较了四个开源大语言模型的攻击成功率和文化真实感评分。文化适应提示在所有16种语言×模型组合中均产生正Delta-ASR(平均+9.3个百分点),且基于直接翻译的评估在48个类别×语言组合中有44个低估了风险。语言层面分析显示,威胁形式的分布在语言间具有异质性。文化真实感分析进一步表明,直接翻译的文化深度(C3)评分在所有四种语言中始终低于1.0(满分3.0,平均0.17),而文化适应评分高达2.51,表明直接翻译产生的输入与真实世界多文化环境中遇到的输入存在系统性差异。这些发现表明,将基准适应特定语言的文化语境——而非仅依赖语言翻译——对于有效的多语言大语言模型安全评估是必要的。

英文摘要

Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

2606.09175 2026-06-09 cs.LG cs.AI cs.DC 新提交

CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon

CANS: 通过合作自教神经外科加速多用户协同边缘推理

Zheshun Wu, Ziyang Zhang, Changyao Lin, Zenglin Xu, Jie Liu

发表机构 * Harbin Institute of Technology Shenzhen(哈尔滨工业大学(深圳)) Politecnico di Milano(米兰理工大学) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学) Shanghai Academy of Artificial Intelligence for Science(上海人工智能科学研究院)

AI总结 提出CANS框架,利用FedLinUCB-DW算法让异构设备自适应学习最优DNN分区,通过共享在线推理反馈和离线经验加速多用户边缘协同推理,显著降低延迟。

Comments 24 pages, 14 figures, 5 tables, submitted for possible journal publication

详情
AI中文摘要

最近,移动边缘计算(MEC)支持的协作深度神经网络(DNN)推理已成为向资源受限的移动设备提供智能服务的一种有前景的方法。一个代表性场景是多用户协同边缘推理,其中不同设备独立地划分其DNN模型,并通过无线网络将后端计算卸载到公共边缘服务器。然而,由于未知且时变的系统条件(包括波动的无线链路和多样的设备能力),确定每个设备的最优DNN分区具有挑战性。为解决此问题,我们提出了合作自教神经外科(CANS),一种协同边缘推理框架,使设备能够通过在线推理期间共享信息反馈来自适应学习最优DNN分区。为处理设备异构性并更好地利用离线推理经验,我们集成了一种新颖的FedLinUCB-DW算法,该算法将相同类型的设备分组,并使用本地离线早期退出推理经验来热启动在线探索。此外,我们通过推导遗憾上界为FedLinUCB-DW提供了理论保证。我们还在模拟环境和硬件原型系统上验证了我们的方法。实证评估表明,与最先进的基线相比,CANS实现了更低的推理延迟。特别是在两个边缘设备的原型实验中,所提出的CANS相比非合作基线将平均推理延迟降低了高达50%。

英文摘要

Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server over wireless networks. However, determining the optimal DNN partition for each device is challenging due to unknown and time-varying system conditions, including fluctuating wireless links and diverse device capabilities. To address this problem, we propose Cooperative Autodidactic NeuroSurgeon (CANS), a collaborative edge inference framework that enables devices to adaptively learn optimal DNN partitions by sharing informative feedback during online inference. To handle the challenge of device heterogeneity and better leverage offline inference experience, we integrate a novel FedLinUCB-DW algorithm that groups devices of the same type and warm-starts online exploration using local offline early-exit inference experience. Furthermore, we provide theoretical guarantees for FedLinUCB-DW by deriving the regret upper bound. We also validate our method on both a simulated environment and a hardware prototype system. Empirical evaluations demonstrate that CANS achieves lower inference latency compared to state-of-the-art baselines. Especially, in prototype experiments on two edge devices, the proposed CANS reduced average inference latency by up to 50% compared to the non-cooperative baseline.

2606.09169 2026-06-09 cs.AI cs.CV cs.MM 新提交

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

IMUG-Bench:交错理解与生成的统一多模态模型基准

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Huawei(华为)

AI总结 提出IMUG-Bench基准,用于评估统一多模态模型在多轮交错图文对话中的理解与生成能力,包含3113样本和12034交互轮次,揭示了生成侧暴露偏差,并探索了测试时扩展策略。

详情
AI中文摘要

近年来,统一多模态模型(UMMs)出现,支持在单一框架内同时进行理解和生成。掌握动态、多轮交错图文对话是UMMs在实际应用中的关键任务。然而,现有基准未能评估这一重要任务,因为它们通常局限于单轮或静态设置,并且通常忽略多轮交互中的暴露偏差。为弥补这一差距,我们提出IMUG-Bench,一个用于UMMs多轮交错图文对话的综合基准,联合评估其理解和生成能力。我们的IMUG-Bench包含三类:静态空间、时间因果和混合,涵盖3113个样本和12034个交互轮次。它还包括动态理解问题,从而支持更能反映真实多轮交互场景的评估。在IMUG-Bench上进行的大规模实验系统评估了主流开源和闭源UMMs,揭示了它们的能力边界和失败模式,并发现了多轮交互中生成侧的显著暴露偏差。我们进一步探索了几种测试时扩展策略,包括思维链、自我验证和最佳N采样,这些策略有效提高了生成准确性并减轻了生成任务中的暴露偏差。这些发现为增强未来UMMs的鲁棒性和多轮交互能力提供了见解。

英文摘要

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

2606.09167 2026-06-09 cs.CV 新提交

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

视觉-语言引导的高光谱目标跟踪:语义融合与上下文模板更新

Rui Yao, Yuhong Zhang, Kunyang Sun, Hancheng Zhu, Jiaqi Zhao, Zhiwen Shao, Abdulmotaleb El Saddik

发表机构 * China University of Mining and Technology(中国矿业大学) University of Ottawa(渥太华大学)

AI总结 提出VLHTrack框架,通过语言引导波段选择模块缓解光谱冗余,利用多模态融合模块整合视觉与语言特征,并采用动态模板更新策略应对目标形变,在HOT2023/2024上超越现有方法。

Comments 14 pages,8 figures

详情
AI中文摘要

高光谱目标跟踪(HOT)利用高光谱视频(HSV)提供的丰富光谱信息,为目标跟踪提供了巨大潜力。然而,从冗余光谱波段中高效提取和利用光谱信息仍然是一个基本挑战,严重限制了模型泛化能力和跟踪性能。此外,在动态场景中,目标常因遮挡和光照变化等因素出现剧烈外观变化,导致当前帧与模板之间产生大变形,这对现有时序建模方法构成重大挑战。本文提出VLHTrack,一种新颖的高光谱视觉-语言(VL)联合跟踪框架。具体而言,我们引入语言先验,通过设计语言引导波段选择模块(LBSM)来解决光谱冗余的基本挑战。LBSM利用大语言模型(LLM)描述建立语义到光谱的映射,从而减轻冗余并突出判别性光谱特征。随后采用多模态视觉-语言融合模块无缝整合视觉和语言嵌入,利用其互补优势学习连贯的跨模态表示。为解决长序列中的目标形变问题,我们提出通过动态模板更新与Mamba(DTUM)模块实现的动态更新模板特征策略。DTUM利用选择性状态空间建模学习帧间依赖关系以更新模板特征,确保在时间上下文引导下模板特征的高效演化。在HOT2023和HOT2024上的实验表明,VLHTrack优于最先进(SOTA)方法。

英文摘要

Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.

2606.09165 2026-06-09 cs.AI 新提交

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

从可靠到表达:面向遵循评分标准的安全评判员的课程学习

Yongtaek Lim, Hyeji Choi, Minwoo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种结合动态评分标准和从可靠到表达的课程学习策略,训练安全评判员在多种评分标准下稳定评估,12B模型准确率达94.12-94.88%,跨标准方差仅0.76。

Comments Accepted to ICML 2026 Workshop on AIWILDS

详情
AI中文摘要

安全评判员越来越多地被部署用于根据不断变化的标准评估模型输出,然而最近的元评估工作表明,它们在提示和评分标准变化下仍然脆弱,仅风格扰动就可能导致假阴性率波动高达0.24。我们认为安全判断本质上是一个遵循评分标准的问题:一个稳健的评判员必须能够一致地应用给定的评估标准,而不是记忆某个特定模板。我们提出了一种训练策略,结合了(i) 从提示-响应-标签三元组生成的实例条件动态评分标准,使评判员暴露于评估标准的变化性,以及(ii) 一个从可靠到表达的课程学习,从干净的固定评分标准监督开始,逐步引入噪声更大的动态评分标准数据。我们在一个单一人工标注集上,使用三种对比的评分标准提示(HarmBench风格、ShieldGemma风格和领域特定评分标准)进行评估。我们的12B课程评判员在三种评分标准下达到了94.12-94.88%的准确率,跨评分标准范围仅为0.76,在峰值准确率和稳定性上均优于通用大语言模型、专用安全分类器和高达30B的推理导向评判员。消融实验表明,简单地将动态评分标准混合到SFT中会增加跨评分标准方差(1.44 -> 3.60);只有课程学习计划才能恢复并改进固定评分标准基线(方差0.76)。

英文摘要

Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

2606.09162 2026-06-09 cs.CV 新提交

Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation

用于低空无人机视频语义分割的零参数几何门控以实现时间稳定性

Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Juanfan Wu, Mingxuan Cui, Yufeng Wang

发表机构 * Beihang University(北京航空航天大学) Northeastern University(东北大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Beijing Institute of Technology(北京理工大学)

AI总结 提出零参数几何门控,利用RANSAC单应性内点比率在16x16网格上路由区域,结合语义相似性传播实现时间稳定分割,在合成UAVid上提升mIoU达4.91%。

详情
AI中文摘要

低空无人机的视频语义分割需要时间一致性,但密集光流在主导航拍图像的平面区域中引入了空间结构化噪声。我们提出了一种零参数几何门控,它利用$16\ imes16$空间网格上的RANSAC单应性内点比率,在通过语义相似性传播融合之前,将每个区域路由到单应性或光流扭曲。该门控不需要学习参数——仅对RANSAC统计量进行中值阈值二值决策——为冻结的骨干网络仅增加了211K可训练参数(SSP融合层)。在合成UAVid上,该方法在两种架构(SegFormer-b2和Hiera-S+UPerNet)上比基础模型实现了+4.24--4.91%的mIoU改进。机制诊断表明,平面区域中的光流残差在空间上自相关(Moran's I = 0.32,$p < 0.001$),预测边界不稳定性(Spearman $\ ho= 0.66$),并且刚性化在单应性有效区域中将时间一致性从62%恢复到92%(+29.5pp)。

英文摘要

Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introduces spatially structured noise in the planar regions that dominate aerial imagery. We propose a zero-parameter geometric gate that uses RANSAC homography inlier ratios on a $16\times16$ spatial grid to route each region to either homography or optical flow warp before fusion via Semantic Similarity Propagation. The gate requires no learned parameters -- only a median-threshold binary decision on RANSAC statistics -- adding only 211K trainable parameters (the SSP fusion layer) to a frozen backbone. On synthetic UAVid, the method achieves +4.24--4.91\% mIoU improvement over base models across two architectures (SegFormer-b2 and Hiera-S+UPerNet). Mechanism diagnostics reveal that flow residuals in planar regions are spatially autocorrelated (Moran's I = 0.32, $p < 0.001$), predict boundary instability (Spearman $ρ= 0.66$), and that rigidification recovers temporal consistency from 62\% to 92\% (+29.5pp) in homography-valid regions.

2606.09160 2026-06-09 cs.LG cs.AI 新提交

Crop Recommendation and Agricultural Query Answering System Using Spatio-Temporal Graph Neural Networks and Hybrid Retrieval Augmentation

基于时空图神经网络与混合检索增强的作物推荐及农业问答系统

Prajwal Thapa, Yagya Raj Pandeya

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出融合时空图神经网络(STGCN)与检索增强生成(RAG)的精准农业系统,实现30天天气预报、作物推荐及农业问答,在尼泊尔1359个地点数据上STGCN预测MSE达0.011。

Comments 11 pages, 8 figures

详情
AI中文摘要

本文提出一个统一系统,旨在通过集成先进的天气预报、作物推荐和面向农民的问答工具来支持精准农业。我们提出了两个深度学习模型——基于Transformer的图神经网络和时空图卷积网络(STGCN)——利用尼泊尔1359个地点的数据预测未来30天的天气状况。STGCN在准确性上优于基于Transformer的模型(MSE约0.011 vs 0.013),有效建模了气候数据中的空间和时间依赖性。这些预测与静态土壤属性(如pH、水分和有机质含量)相结合,通过评分算法生成本地化的作物推荐,该算法匹配每种作物的最佳生长条件。此外,我们开发了一个检索增强生成(RAG)聊天机器人,利用领域特定的农业文档以自然语言回答农民的问题。整个系统通过移动应用程序部署,提供实时建议和对话支持。用户反馈证实了系统的可用性和相关性,尤其是在个性化农业指导有限的农村环境中。总体而言,我们的方法展示了如何将机器学习模型与当地农业数据相结合,为农民提供可操作的见解,促进更明智的决策、更高的作物产量和增强对气候变异的适应能力。

英文摘要

This paper presents a unified system designed to support precision agriculture by integrating advanced weather prediction, crop recommendation, and a question-answering tool for farmers. We propose two deep learning models -- a Transformer-based Graph Neural Network and a Spatio-Temporal Graph Convolutional Network (STGCN) -- to forecast weather conditions for the next 30 days using data from 1,359 locations in Nepal. The STGCN outperforms the Transformer-based model in accuracy (MSE ~0.011 vs. 0.013), effectively modeling both spatial and temporal dependencies in climate data. These predictions are combined with static soil properties such as pH, moisture, and organic content to generate localized crop recommendations through a scoring algorithm that matches each crop's optimal growing conditions. Additionally, we develop a Retrieval-Augmented Generation (RAG) chatbot that leverages domain-specific agricultural documents to answer farmers' questions in natural language. The entire system is deployed via a mobile application, offering real-time suggestions and conversational support. User feedback confirms the system's usability and relevance, especially in rural settings where personalized farming guidance is limited. Overall, our approach demonstrates how combining machine learning models with local agricultural data can empower farmers with actionable insights, promoting more informed decisions, better crop yields, and increased resilience to climate variability.

2606.09159 2026-06-09 cs.CL cs.AI 新提交

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

扩散语言模型中不变性与独立性解码的统一能量

Yuchen Yan, Minkai Xu, Zaiquan Yang, Yatao Bian

发表机构 * National University of Singapore(新加坡国立大学) Stanford University(斯坦福大学) City University of Hong Kong(香港城市大学)

AI总结 针对扩散语言模型并行生成文本时与自回归模型的性能差距,提出统一能量(Uni-E)方法,通过不变能量和独立能量解决模型容量、依赖性和不变性问题,无需采样即可精确计算,并能纠正分布偏移。

详情
AI中文摘要

扩散语言模型(DLM)通过迭代去噪完整序列实现并行文本生成,与自回归(AR)解码相比具有吸引人的灵活性。然而,现有方法未能完全捕捉令牌关系,导致与AR基线存在性能差距,尤其是在并行度增加时。本文对该差距进行了系统分析,确定了三个关键因素:(i)模型容量、(ii)依赖性和(iii)不变性。为解决这些问题,我们首先提出不变能量(Inv-E)以及一个有效的基于采样的估计器来处理不变性问题。通过进一步与独立能量(Ind-E)结合,我们得到统一能量(Uni-E),它涵盖了所有这些因素。Uni-E具有独特优势:无需基于采样的分区估计即可精确计算。此外,Uni-E是模型无关的,因此可以扩展到任意大小的模型。我们进一步证明Uni-E可以纠正由依赖性和不变性引起的分布偏移。在扩散语言模型(DLM)和扩散大语言模型(DLLM)上的大量实验证明了所提出的Uni-E的有效性。

英文摘要

Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.

2606.09157 2026-06-09 cs.CL cs.AI 新提交

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

SEF-CLGC在SemEval-2026任务11中的应用:逻辑符号对语言模型性能的影响

Hanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin

发表机构 * Université Côte d’Azur, Inria, CNRS, I3S, Sophia Antipolis, France(蔚蓝海岸大学, 法国国家信息与自动化研究所, 法国国家科学研究中心, 信息与系统科学实验室, 索菲亚安蒂波利斯, 法国) Data ScienceTech Institute, Paris, France(数据科学技术学院, 巴黎, 法国)

AI总结 本文提出SEF-CLGC管道,结合形式逻辑符号与小语言模型,在SemEval-2026任务11中评估推理性能,最佳模型在降低内容偏差的同时达到27.80%的内容分数。

Comments Accepted to SemEval-2026 co-located with ACL 2026

详情
AI中文摘要

本文重新审视了我们称为三段论评估框架-通用逻辑语法构建(SEF-CLGC)的管道。我们将形式逻辑符号与小语言模型(SLMs)相结合,以评估在SemEval-2026任务11子任务1:大型语言模型中内容与形式推理的分离中的推理性能。我们的实验表明,仅依靠在自然语言和符号语言组合上训练的SLMs,我们的最佳模型在该任务上达到了27.80%的内容分数,同时显著降低了推理中的内容偏差。

英文摘要

This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.

2606.09156 2026-06-09 cs.CV 新提交

OmniGen-AR: AutoRegressive Any-to-Image Generation

OmniGen-AR: 自回归任意到图像生成

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心) Bytedance Seed(字节跳动Seed) The University of Hong Kong(香港大学)

AI总结 提出统一自回归框架OmniGen-AR,通过共享视觉分词器和解耦因果注意力,支持文本、空间信号和视觉上下文等多种条件输入,在多项基准上达到最优或竞争性能。

Comments Accepted by NeurIPS

详情
AI中文摘要

自回归(AR)模型在视觉生成中展现出强大潜力,以简单的架构和优化目标实现了优越性能。然而,现有方法通常局限于单一模态条件(如文本),限制了其在需要从多种控制信号合成图像的现实场景中的应用。在这项工作中,我们提出了OmniGen-AR,一个统一的任意到图像生成的自回归框架。通过共享视觉分词器将各种视觉条件离散化,并使用文本分词器处理文本提示,OmniGen-AR在单个模型中支持广泛的条件输入,包括文本(文本到图像生成)、空间信号(分割到图像和深度到图像)以及视觉上下文(图像编辑、帧预测和文本到视频生成)。为了减轻条件令牌到内容令牌的信息泄露风险,我们引入了解耦因果注意力(DCA),它将全序列因果掩码分离为条件因果注意力和内容因果注意力。这作为训练时的正则化器,不影响推理时的标准下一个令牌预测。通过这种设计,OmniGen-AR在多个基准上取得了新的最先进或至少具有竞争力的结果,例如在GenEval上达到0.63,在VBench上达到80.02,展示了其在灵活和高保真视觉生成方面的有效性。

英文摘要

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

2606.09154 2026-06-09 cs.LG 新提交

Improved Convergence Analysis of Topology Dependence in Decentralized SGD

去中心化SGD中拓扑依赖性的改进收敛分析

Yuki Takezawa, Anastasia Koloskova, Sebastian U. Stich

发表机构 * University of Washington(华盛顿大学)

AI总结 提出更紧的收敛分析,揭示混合矩阵所有特征值影响收敛速率,并通过实验验证比仅用谱间隙的分析更准确。

Comments ICML 2026

详情
AI中文摘要

去中心化SGD是去中心化学习中的基本算法,尽管底层网络拓扑对其收敛行为的影响尚未完全理解。现有的收敛分析表明,在同质和异质情况下,具有小谱间隙的拓扑会显著恶化去中心化SGD的收敛速率。然而,许多先前的论文报告说,在异质情况下拓扑的选择确实对实验有显著影响,但在同质情况下对训练行为影响很小。在本文中,我们提出了去中心化SGD的更紧的收敛分析,比先前的分析更精确地理解拓扑如何影响收敛速率。具体来说,与仅使用谱间隙作为拓扑属性的现有收敛分析不同,我们的新分析表明混合矩阵的所有特征值都影响收敛速率。通过实验,我们仔细评估了去中心化SGD的收敛行为,并证明了我们的新收敛分析可以更准确地描述拓扑对收敛速率的影响。

英文摘要

Decentralized SGD is a fundamental algorithm in decentralized learning, although the influence of an underlying network topology on its convergence behavior is not yet fully understood. Existing convergence analyses have shown that topologies with a small spectral gap significantly deteriorate the convergence rate of Decentralized SGD in both homogeneous and heterogeneous cases. However, many prior papers have reported that indeed the choice of the topology has a significant experimental impact in the heterogeneous case, but has little experimental impact on training behavior in the homogeneous case. In this paper, we present a tighter convergence analysis of Decentralized SGD, offering a more precise understanding of how topologies affect the convergence rate than the prior analysis. Specifically, unlike existing convergence analyses that used only the spectral gap as a property of the topology, our novel analysis shows that all eigenvalues of the mixing matrix affect the convergence rate. Throughout the experiments, we carefully evaluated the convergence behavior of Decentralized SGD and demonstrated that our novel convergence analysis can more accurately describe the effect of topology on the convergence rate.

2606.09148 2026-06-09 cs.CL 新提交

Explicit Representation Alignment for Multimodal Sentiment Analysis

显式表示对齐用于多模态情感分析

Baode Wang, Ziming Wang, Huacan Wang, Ronghao Chen, Biao Wu

发表机构 * AgentAlpha

AI总结 针对多模态情感分析中表示不对齐问题,提出利用视觉-语言模型将视觉内容转为文本描述,结合语义标记选择和批量级均匀性正则化,实现跨模态对齐,在多个基准上取得最优性能。

Comments 10 pages, 5 figures

详情
AI中文摘要

多模态情感分析旨在通过联合建模文本和图像等异质模态来理解人类情感和情绪。然而,多模态模型往往无法持续优于强文本基线,且性能在不同融合策略间差异显著。在本工作中,我们识别出独立预训练的模态编码器之间的表示不对齐是多模态有效学习的关键瓶颈,并通过控制实验表明,融合前的对齐通常比融合复杂性更重要。为解决此问题,我们提出一个统一的多模态情感分析框架,利用视觉-语言模型(VLM)将视觉内容转换为结构化文本描述,将异质模态投影到共享语言空间,并实现可解释的以文本为中心的推理。为进一步提升鲁棒性,我们引入一种混合学习策略,结合语义标记选择和批量级均匀性正则化目标,鼓励更分散和稳定的全局特征空间,同时减轻VLM生成描述引入的噪声。在多个多模态情感和情绪基准上的实验表明,我们的方法持续优于强单模态和多模态基线,达到最先进性能。我们的分析进一步强调了表示对齐在多模态情感学习中的关键作用。

英文摘要

Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous modalities such as text and images. However, multimodal models often fail to consistently outperform strong text-only baselines, with performance varying significantly across fusion strategies. In this work, we identify representation misalignment between independently pretrained modality encoders as a key bottleneck for effective multimodal learning, and show through controlled experiments that alignment prior to fusion is often more important than fusion complexity. To address this issue, we propose a unified multimodal affective analysis framework that leverages vision-language models (VLMs) to convert visual content into structured textual descriptions, projecting heterogeneous modalities into a shared linguistic space and enabling interpretable text-centric reasoning. To further improve robustness, we introduce a hybrid learning strategy that combines semantic token selection with a batch-level uniformity regularization objective, encouraging a more dispersed and stable global feature space while mitigating noise introduced by VLM-generated descriptions. Experiments on multiple multimodal sentiment and emotion benchmarks show that our method consistently outperforms strong unimodal and multimodal baselines, achieving state-of-the-art performance. Our analysis further highlights the critical role of representation alignment in multimodal affective learning.

2606.09143 2026-06-09 cs.CV 新提交

CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

CAMF-Det: 面向无人机平台的激光雷达-相机闭合感知多模态融合3D目标检测

Yanze Jiang, Yanfeng Gu, Xian Li

发表机构 * School of Electronics and Information Engineering, Harbin Institute of Technology(哈尔滨工业大学电子与信息工程学院)

AI总结 针对无人机俯视场景中树冠遮挡导致的多模态信息退化问题,提出基于比尔-朗伯定律的闭合感知融合框架CAMF-Det,通过显式建模双模态遮挡强度并注入检测流程,在自建数据集上实现困难级别mAP_BEV提升9.43%和4.88%。

详情
AI中文摘要

基于激光雷达和相机的多模态3D目标检测在地面车辆场景中表现出色,但尚未在无人机平台上得到探索。在无人机俯视场景中,以树冠为主的频繁地面物体遮挡导致空间变化和模态依赖的信息退化。现有的多模态融合框架既未显式建模这种地面物体遮挡,也未将遮挡感知嵌入检测流程,限制了其在遮挡无人机场景中的性能。为应对这些挑战,我们提出CAMF-Det,一种面向无人机平台的激光雷达-相机3D目标检测的闭合感知多模态融合框架,该框架通过物理启发式建模导出双模态遮挡强度,并将其作为先验嵌入整个检测流程。首先,双模态闭合建模模块通过比尔-朗伯启发式公式和建筑物掩码校正,离线为两种模态显式构建遮挡强度真值。其次,以这些真值图为监督,双模态预测网络在单帧推理下将离线建模结果转换为在线遮挡强度预测。第三,将真值和预测的遮挡强度注入数据增强、特征编码、多模态融合和检测头,实现在空间变化和模态依赖信息退化下的自适应检测。在两个自建的基于无人机的多模态数据集SI3D-DI和SI3D-DII上的实验表明,CAMF-Det在所有难度级别上均达到最佳性能,困难级别的mAP$_{\mathrm{BEV}}$分别比最佳竞争方法提升9.43%和4.88%。这些结果证实了显式遮挡先验建模和利用对于无人机场景中鲁棒多模态3D检测的有效性。

英文摘要

Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.

2606.09142 2026-06-09 cs.CV cs.AI 新提交

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

通过视觉语言模型从自我中心视觉解码行人过街意图

Danya Li, Xiang Su, Yan Feng, Rico Krueger

发表机构 * Technical University of Denmark(丹麦技术大学) University of Helsinki(赫尔辛基大学) Delft University of Technology(代尔夫特理工大学)

AI总结 利用视觉语言模型(VLM)将行人过街意图预测转化为视觉问答任务,通过参数高效微调并结合自我运动、车辆运动和眼动等上下文线索,在自我中心视频上实现了14.5%的准确率提升,创下新纪录。

详情
AI中文摘要

自我中心视觉提供了人类感知和决策的第一人称视角,但其在交通安全预测方面的潜力尚未得到充分探索。在这项工作中,我们研究从短自我中心视频片段中解码行人过街意图。我们通过将任务表述为封闭式视觉问答(VQA)问题,并利用视觉语言模型(VLM)来预测行人的意图。我们首先在零样本设置下对三个系列的最先进VLM进行了基准测试,发现它们相对于随机猜测有适度提升,但表现出有限的高层次交通推理能力。基于这些发现,我们进一步使用参数高效微调将VLM适应于目标任务。我们的结果表明,微调后的模型显著优于其零样本对应模型,并在专门的基于Transformer的基线基础上实现了9%的准确率提升。最后,我们证明加入额外的上下文线索,包括自我运动、车辆运动和眼动,进一步提高了预测性能。特别是,由眼动和自我运动引导的微调Qwen3-VL-2B模型相比Transformer基线实现了14.5%的准确率提升,为自我中心行人意图解码建立了新的最先进水平。

英文摘要

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

2606.09140 2026-06-09 cs.CV 新提交

DiffSight-Former: Modeling Structural Differences and Temporal Dynamics for Glaucoma Progression Prediction

DiffSight-Former:建模结构差异和时间动态用于青光眼进展预测

Yi Huang, Lei Bi, Jinman Kim

发表机构 * The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DiffSight-Former框架,通过时间变异特征提取、多结构差异建模和时间感知Transformer,从序列眼底图像中预测青光眼进展,在SIGF和GRAPE数据集上取得高AUC和灵敏度。

Comments 12 pages, 6 figures

详情
AI中文摘要

青光眼是全球不可逆失明的主要原因,从眼底图像早期检测对于有效疾病管理至关重要。虽然深度学习在眼底图像分析中取得了有希望的性能,但现有方法大多依赖单时间点图像,未能捕捉与疾病进展相关的纵向结构和血管变化。临床随访期间获取的序列眼底图像提供了宝贵的时间信息;然而,当前的序列模型通常难以检测微妙的早期进展信号,并且常依赖固定长度输入或已患青光眼图像的诊断线索,限制了其在早期预测中的临床实用性。为解决这些限制,我们提出了DiffSight-Former,一个从序列眼底图像预测青光眼进展的框架。它包含一个基于眼底专用基础模型的时间变异特征提取模块,以获得稳健的解剖表示。引入多结构差异建模模块来量化视盘/杯区域和视网膜血管中与进展相关的变化。这些表示与时间间隔嵌入集成,并由时间感知Transformer处理,以建模疾病进展并估计未来青光眼发作的概率。在两个纵向数据集SIGF(405个序列)和GRAPE(263个序列)上进行了实验。在SIGF上,DiffSight-Former在进展预测中达到了91.54%的AUC和92.16%的灵敏度。在GRAPE上,它在三个临床视野进展标准上平均准确率达到87.48%。与现有方法相比,DiffSight-Former在不同时间设置下表现出强大的性能和鲁棒性,突显了其在纵向青光眼监测和早期风险预测中的潜力。

英文摘要

Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is critical for effective disease management. While deep learning has achieved promising performance in fundus image analysis, most existing methods rely on single time-point images and fail to capture longitudinal structural and vascular changes associated with disease progression. Sequential fundus images acquired during clinical follow-up provide valuable temporal information; however, current sequential models often struggle to detect subtle early progression signals and commonly depend on fixed-length inputs or diagnostic cues from already glaucomatous images, limiting their clinical utility for early prediction. To address these limitations, we propose DiffSight-Former, a framework for glaucoma progression prediction from sequential fundus images. It incorporates a time-variant feature extraction module based on a fundus-specific foundation model to obtain robust anatomical representations. A multi-structure difference modeling module is introduced to quantify progression-related changes in the optic disc/cup region and retinal vasculature. These representations are integrated with temporal interval embeddings and processed by a time-aware Transformer to model disease progression and estimate the probability of future glaucoma onset. Experiments were conducted on two longitudinal datasets, SIGF (405 sequences) and GRAPE (263 sequences). On SIGF, DiffSight-Former achieved an AUC of 91.54% and a sensitivity of 92.16% for progression prediction. On GRAPE, it achieved an average accuracy of 87.48% across three clinical visual-field progression criteria. Compared with existing approaches, DiffSight-Former demonstrates strong performance and robustness across different temporal settings, highlighting its potential for longitudinal glaucoma monitoring and early risk prediction.

2606.09139 2026-06-09 cs.CV 新提交

A Geometric Framework for Absolute Pose and Velocity Estimation with Event Cameras

事件相机的绝对位姿与速度估计的几何框架

Zibin Liu, Shunkun Liang, Banglei Guan, Yang Shang, Qifeng Yu, Ji Zhao

发表机构 * National University of Defense Technology(国防科技大学) independent researcher(独立研究者)

AI总结 提出利用3D直线及其触发事件的几何约束,通过线性与多项式求解器同时估计事件相机的绝对位姿和速度,最少仅需三个对应关系,在精度和效率上超越现有方法。

详情
AI中文摘要

尽管基于事件的运动估计取得了快速进展,当前的几何方法主要关注速度估计。然而,对于机器人导航和增强现实等关键应用同样至关重要的绝对位姿估计仍相对未被充分探索。因此,从事件流中同时恢复绝对位姿和速度仍然是一个开放且具有挑战性的问题。为弥补这一空白,我们提出了一种几何框架,通过利用场景中的3D直线及其触发的事件来估计绝对位姿和速度。该框架的核心是两个关键几何约束:3D直线与其对应事件平面的法向量之间的正交性,以及事件与其关联直线的2D投影之间的共线性。基于这些约束,我们提出了用于绝对位姿估计的线性求解器和多项式求解器。前者能够高效计算,而后者为旋转提供了全局最优解。对于速度估计,我们开发了一个高效的线性求解器和一个更精确的基于优化的求解器,以恢复角速度和线速度。值得注意的是,我们的方法最少需要三个事件-直线对应关系即可独立确定6自由度绝对位姿或速度。在仿真和真实世界数据集上的大量实验表明,我们的方法达到了最先进的性能,与现有方法相比,在精度和计算效率上都有显著提升。演示代码公开于 https://github.com/Zibin6/EventPoseVelocity。

英文摘要

Despite the rapid advancements in event-based motion estimation, current geometric methods primarily focus on velocity estimation. However, absolute pose estimation, which is equally crucial for key applications such as robotic navigation and augmented reality, remains relatively underexplored. Consequently, the simultaneous recovery of absolute pose and velocity from event streams remains an open and challenging problem. To address this gap, we propose a geometric framework for absolute pose and velocity estimation by leveraging 3D lines in the scene and the events they trigger. At the core of the framework lie two key geometric constraints: the orthogonality between a 3D line and the normal vector of its corresponding event plane, and the collinearity of an event with the 2D projection of its associated line. Based on these constraints, we present both linear and polynomial solvers for absolute pose estimation. The former enables efficient computation, while the latter provides a globally optimal solution for rotation. For velocity estimation, we develop an efficient linear solver and a more accurate optimization-based solver to recover both angular and linear velocities. Notably, our methods require a minimum of three event-line correspondences to determine the 6-DoF absolute pose or velocities independently. Extensive experiments in simulation and on real-world datasets demonstrate that our methods achieve state-of-the-art performance, with significant improvements in accuracy and computational efficiency compared to existing methods. The demo code is publicly available at https://github.com/Zibin6/EventPoseVelocity.

2606.09138 2026-06-09 cs.LG cs.CL 新提交

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Claw-R1:面向智能体强化学习的步骤级数据中间件系统

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出Claw-R1系统,通过网关服务器和数据池组件,将智能体交互步骤转化为结构化数据资产,支持实时检查、质量筛选和训练批次配置,解决智能体强化学习中数据生命周期管理问题。

详情
AI中文摘要

智能体强化学习已成为将大语言模型从静态聊天机器人转变为交互式智能体的重要后训练范式,催生了如OpenClaw等代表性应用。现有工作主要关注策略优化算法和训练框架,但对从数据产生到训练消费的智能体-环境交互完整数据生命周期关注不足。为弥补这一差距,我们提出Claw-R1,一个面向智能体强化学习的交互式步骤级数据中间件系统。Claw-R1通过两个核心组件——网关服务器和数据池——连接异构智能体运行时与强化学习训练后端。网关服务器通过统一的LLM API入口捕获多轮交互步骤,而数据池将其组织为由提示ID、响应ID、奖励和其他元数据组成的步骤级记录。在我们的演示中,用户可以交互式检查实时轨迹,查看每一步的状态、动作和奖励,根据质量和就绪程度筛选数据,并为不同的下游强化学习算法配置训练就绪批次。总体而言,Claw-R1将智能体交互轨迹视为受管理的数据资产,而非临时运行时日志。通过此演示,我们希望鼓励社区认识到数据管理在智能体强化学习中的重要性。我们的代码可在https://github.com/AgentR1/Claw-R1获取,演示视频可在https://youtu.be/Pw47dAOw6B0找到。

英文摘要

Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 新提交

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱:基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin(柏林工业大学) Fraunhofer FOKUS(弗劳恩霍夫开放通信系统研究所)

AI总结 研究利用大语言模型(LLM)零样本地将3D场景对象自动映射到本体类别,无需训练,在厨房场景中达到90-96%准确率,并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情
AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典,这些字典脆弱且无法跨资产泛化。我们研究大语言模型(LLM)是否能够自动化通用场景描述(USD)场景的接地步骤,作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景(125个对象)中,LLM在描述性名称下达到90-96%的精确匹配准确率,在缩写名称下达到49-89%,显著优于字典和嵌入基线。在完全不透明名称下,上下文增强提示可恢复高达48%的准确率。特征消融表明,LLM主要利用场景图中的语义线索(兄弟名称和父路径);匿名化这些线索将准确率降至0-6%,而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

2606.09132 2026-06-09 cs.AI 新提交

Vision Language Model Helps Private Information De-Identification in Vision Data

视觉语言模型助力视觉数据中的隐私信息去标识化

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) North Carolina State University(北卡罗来纳州立大学)

AI总结 提出VisShield框架,通过专用指令微调数据集OPTIC和训练策略,使视觉语言模型精准定位并掩码敏感文本,有效保护医学图像等视觉数据中的隐私信息。

详情
AI中文摘要

视觉语言模型(VLM)因其卓越的能力而广受欢迎。尽管存在多种增强文本应用隐私的方法,但视觉输入相关的隐私风险(如医学图像中的受保护健康信息)仍被广泛忽视。为解决此问题,需执行两项关键任务:准确定位敏感文本并处理以确保隐私保护。为此,我们引入VisShield(视觉隐私盾),一个端到端框架,旨在增强VLM的隐私意识。我们的框架包含两个关键组件:专用指令微调数据集OPTIC(光学隐私文本指令集)和定制训练方法。该数据集提供多样化的隐私导向提示,引导VLM执行目标光学字符识别(OCR)以精确定位敏感文本,而训练策略确保VLM有效适应隐私保护任务。具体而言,我们的方法确保VLM识别隐私敏感文本并输出检测实体的精确边界框,从而有效掩码敏感信息。大量实验表明,我们的框架在处理隐私信息方面显著优于现有方法,为视觉语言模型中的隐私保护应用铺平了道路。我们的数据集和代码可在此处获取。

英文摘要

Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 新提交

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣:面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University(北京大学力学与工程科学学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对多模态大语言模型中视觉令牌在深层饱和的问题,提出双路径视觉令牌路由(DPVR-LF),在饱和点将视觉令牌路由至单层可训练分支,仅最后层融合,以约3%可训练参数保持性能并减少计算。

Comments 18 pages, 4 figures. Submitted to Pattern Recognition

详情
AI中文摘要

多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层对称Transformer骨干,并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性:图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析,我们观察到视觉令牌倾向于在中间层饱和。具体而言,文本到图像的注意力从第0层的0.68下降到第4层的0.07,并在第18层后稳定在0.04附近,而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发,我们提出了双路径视觉令牌路由(DPVR),一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF(晚期融合)在饱和点将视觉令牌路由到一个单层可训练侧分支,运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播,并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数,DPVR-LF在标准基准上保持了有竞争力的多模态性能,同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

2606.09124 2026-06-09 cs.AI 新提交

A Regret Minimization Framework on Preference Learning in Large Language Models

大语言模型中偏好学习的遗憾最小化框架

Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 提出基于遗憾的偏好优化方法RePO,通过遗憾最小化而非奖励最大化来建模人类偏好,在数学推理和人类偏好数据集上取得一致性能提升。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过依赖任务特定的验证器提供自动化正确性信号,推动了推理密集型任务的进展。然而,许多现实语言任务难以配备可靠的验证器,这促使人们越来越依赖从人类反馈中强化学习(RLHF)。在此背景下,我们认为有必要更仔细地审视人类反馈应如何被解释。我们引入了基于遗憾的偏好优化($\textbf{RePO}$),它通过$\textit{遗憾最小化}$而非奖励最大化来重新构建RLHF。人类偏好通常由对结果的$\textit{前瞻性}$预期和对替代行为的$\textit{反事实}$比较所塑造,而非由即时的、与结果无关的效用决定。$\textbf{RePO}$通过将偏好建模为行为条件化的相对次优性评估来捕捉这一结构。在数学推理基准和人类偏好数据集上的实验表明,$\textbf{RePO}$能够取得一致的性能提升,表明它是一种有效且与人类对齐的大语言模型训练方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

2606.09118 2026-06-09 cs.AI 新提交

ComplexConstraints and Beyond: Expert Rubrics for RLVR

复杂约束与超越:RLVR的专家评分标准

Sushant Mehta, Liudas Panavas, Edwin Chen

发表机构 * Surge AI

AI总结 提出专家设计的评分标准作为评估和训练信号,通过复杂指令遵循和企业智能体任务验证,在RL训练中显著提升模型性能。

Comments Accepted to the GEM workshop at ACL 2026: https://gem-workshop.com/

详情
AI中文摘要

随着LLM能力的快速提升,用于评估它们的方法越来越滞后。传统基准依赖于对狭窄、表面约束的程序化验证,但现实世界的指令遵循和智能体任务需要评估细微的、上下文依赖的行为,这些行为难以通过简单的脚本检查。我们提出了一个基于专家策划的评分标准评估的系统分析作为替代范式,借鉴了来自两个领域的实证证据:复杂指令遵循和企业智能体任务。我们首先阐述了构建高质量评分标准的五个设计原则,包括最大可行原子性、意图感知标准设计和迭代LLM判断校准。为了验证这些原则,我们引入了ComplexConstraints,一个新的专家策划的指令遵循数据集,其中每个提示与10-40个原子评分标准配对。我们证明这些专家评分标准不仅是更好的评估工具,而且是高度有效的训练信号:在大约1000个ComplexConstraints示例上训练,使得4B参数模型在指令遵循上提升+15.5%,235B参数模型提升+12.2%,而在评分标准评分的企业环境上进行单周期RL训练产生的收益可以转移到模型从未训练过的分布外基准(BFCL +4.5%,Tau2-Bench +7.4%,Tool-Decathlon +6.8%)。我们的发现表明,专家编写的评分标准既改进了前沿LLM能力的测量,也改进了其发展,作为有效的评估和RL训练信号。

英文摘要

As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.

2606.09117 2026-06-09 cs.LG cs.AI 新提交

Optimizing Energy-based Neural Network Training with Coherent Ising Machine

利用相干伊辛机优化基于能量的神经网络训练

Chen-Rui Fan, Bo Lu, Zhi-Hong Zhang, Run-Qing Zhang, Jing-Wei Wen, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) China Mobile (Suzhou) Software Technology Company Limited(中移(苏州)软件技术有限公司) School of Science, Beijing University of Posts and Telecommunications(北京邮电大学理学院)

AI总结 本文利用相干伊辛机结合平衡传播训练基于能量的神经网络,并通过Adam优化器加速收敛,展示了在深层架构和卷积操作上的可扩展性,为下一代AI硬件提供了物理框架。

详情
AI中文摘要

尽管伊辛机作为伊辛模型的高级物理求解器,在组合优化和神经网络训练中具有应用潜力,但其在大规模神经网络中的可扩展性仍受限于硬件连接限制和次优的训练方法。在这项工作中,我们利用相干伊辛机(CIM)通过平衡传播训练基于能量的神经网络,实现了与现有软件实现相当的性能。我们进一步通过集成Adam优化器来求解Hopfield能量网络的基态,从而显著提高了收敛速度和求解精度。此外,我们展示了该方法在更深层网络架构和卷积操作上的可扩展性。我们的结果突显了CIM动力学作为训练复杂神经网络的可扩展平台的潜力,为通过模拟电路、光电子或集成光子学实现节能实现提供了途径。这项工作为下一代AI硬件开发建立了一个新颖的物理框架。

英文摘要

While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.

2606.09115 2026-06-09 cs.LG 新提交

Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

反事实传输流用于离线保守轨迹细化

Lena Krieger, Xuan Zhao, Zhuo Cao, Qin Wang, Hanno Scharr, Ira Assent

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出反事实传输流框架,通过检索高反馈轨迹构建局部偏好对,实现离线决策的保守轨迹细化,在D4RL基准上提升历史回报表现。

Comments accepted at RLxF @ ICML 2026

详情
AI中文摘要

离线强化学习提供了一条仅从记录数据中改进策略的路径,使用历史回报或其他可测量的结果作为世界反馈。一个关键困难是在不超出离线数据支持范围的情况下改进观察到的行为。我们提出了反事实传输流,这是一个由世界反馈引导的、用于离线决策的源条件轨迹细化框架。给定一个低反馈候选轨迹,我们通过在潜在轨迹空间中检索具有更高任务特定反馈的邻近轨迹,从离线数据中构建局部偏好对,并将其用作保守细化的弱监督。该框架学习实例特定的细化方向:在推理时,细化强度参数控制候选轨迹被传输的距离,从而在保留原始行为和施加更强改进之间实现权衡。在包括AntMaze和MuJoCo任务的D4RL基准上的实验表明,我们的方法从作为世界反馈的历史回报中改进行为,同时提供可解释的轨迹级细化路径。

英文摘要

Offline reinforcement learning (RL) offers a path to policy improvement from logged data alone, using historical returns or other measurable outcomes as world feedback. A key difficulty is improving observed behavior without extrapolating beyond what the offline data supports. We propose \emph{counterfactual transport flows}, a source-conditioned trajectory refinement framework for offline decision-making guided by world feedback. Given a low-feedback candidate trajectory, we construct local preference pairs from offline data by retrieving nearby trajectories in latent trajectory space with higher task-specific feedback, and use them as weak supervision for conservative refinement. The framework learns instance-specific refinement directions: at inference time, a refinement strength parameter controls how far the candidate trajectory is transported, enabling a trade-off between preserving the original behavior and applying stronger improvement. Experiments on D4RL benchmarks, including AntMaze and MuJoCo tasks, show that our method improves behavior from historical returns as world feedback, while providing interpretable trajectory-level refinement paths.

2606.09114 2026-06-09 cs.CL 新提交

MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection

MAAM:面向中文歧视性语言检测的锚点保留压缩与上下文校准

Yuxin Fu, Shijing Si

发表机构 * School of Economics and Finance, Shanghai International Studies University(上海外国语大学国际金融贸易学院)

AI总结 提出MAAM框架,通过保留歧视相关语义锚点并结合上下文先验校准,在轻量级模型上提升中文歧视性语言检测的准确性和校准性,同时构建首个中文LGBT歧视语料库ChLGBT。

详情
AI中文摘要

中文歧视性语言检测具有挑战性,因为有害意图往往是隐式的且依赖上下文。我们提出MAAM(近视-散光锚点机制),一种轻量级、模型无关的框架,受功能性视觉模糊启发:MAAM并非同等保留每个词元,而是保留歧视相关的语义锚点,并通过C-I-S上下文先验(上下文语气、群体身份和立场极性)对其进行校准。我们还引入了ChLGBT,据我们所知,这是首个专注于中文LGBT的歧视性语言数据集,包含8,120条人工标注样本和三个序数标签:显式偏见、隐式偏见和情感强度。在强编码器基线上,MAAM提升了所有三个预测维度,在准确率、F1、Brier分数和期望校准误差上均取得一致增益。与零样本和少样本提示协议下的前沿LLM基线相比,MAAM在保持竞争力的同时,提供了更强的紧凑性和稳定性。这些结果表明,可解释的锚点保留和上下文校准为中文歧视性语言评估提供了一种实用的替代方案,无需依赖更大规模的模型缩放。

英文摘要

Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-dependent. We propose MAAM (Myopia--Astigmatism Anchor Mechanism), a lightweight, model-agnostic framework inspired by functional visual blur: rather than preserving every token equally, MAAM retains discrimination-relevant semantic anchors and calibrates them with C--I--S contextual priors (Contextual Tone, Group Identity, and Stance Polarity). We also introduce ChLGBT, to our knowledge the first Chinese LGBT-focused discriminatory-language dataset, with 8,120 manually annotated samples and three ordinal labels: explicit bias, implicit bias, and emotional intensity. Across strong encoder baselines, MAAM improves all three prediction dimensions, with consistent gains in accuracy, F1, Brier score, and expected calibration error. Compared with frontier LLM baselines under zero-shot and few-shot prompting protocols, MAAM remains competitive while offering stronger compactness and stability. These results suggest that interpretable anchor preservation and contextual calibration provide a practical alternative to heavier model scaling for Chinese discriminatory-language assessment.

2606.09112 2026-06-09 cs.LG cs.AI 新提交

Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning

将平衡传播与伊辛机混合以实现高效的基于能量的学习

Chen-Rui Fan, Bo Lu, Xing-Yu Wu, Tie-Jun Wang, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) School of Physical Science and Technology, Beijing University of Posts and Telecommunications(北京邮电大学物理科学与技术学院)

AI总结 提出一种受伊辛动力学启发的平衡传播框架,通过扩展相空间动力学替代耗散Hopfield松弛,加速收敛、提高噪声鲁棒性,并在MNIST等数据集上实现与反向传播相当的性能。

详情
AI中文摘要

人工智能的快速发展推动了深度神经网络的重大进步。然而,传统的基于GPU的训练仍然高度耗能,这促使人们探索物理动力学和兼容的基于能量的学习方案,例如平衡传播(EP)。然而,基于EP的训练常常由于相空间收缩而陷入局部最小值。本文介绍了一种受伊辛动力学启发的平衡传播框架,其中耗散的Hopfield松弛被具有共轭变量的扩展相空间动力学所取代。由此产生的训练范式保留了EP的局部两阶段学习规则,同时改变了神经状态达到平衡的物理路径。我们表明,这种动力学降低了有效能量壁垒,加速了收敛,提高了噪声鲁棒性,并在MNIST、FashionMNIST和CIFAR-10上训练了深度卷积Hopfield网络,性能与反向传播相当。

英文摘要

The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.

2606.09111 2026-06-09 cs.CV 新提交

Illumination-Invariant Anomaly Detection for Sub-Canopy UAV Multispectral Point Clouds

林冠下无人机多光谱点云的照度不变异常检测

Likun Chen, Yanfeng Gu, Xian Li

发表机构 * School of Electronics and Information Engineering(电子信息工程学院) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对植被阴影导致的严重光照异质性,提出无先验异常检测框架,通过太阳角度估计与照度一致稀疏表示,有效分离暗目标与真实阴影,提升复杂森林环境中异常与背景的可分离性。

Comments 5 pages, 8 figures

详情
AI中文摘要

无人机多光谱点云为林冠下目标检测提供了高维空间-光谱数据,但其有效性因植被阴影造成的严重光照异质性而显著降低。为此,我们提出一种无先验异常检测框架,能够稳健处理光照变化。首先,将太阳角度估计公式化为逆优化问题。通过将光谱指数与光线追踪模型耦合,该策略实现了无先验阴影提取,无需依赖飞行元数据,有效区分暗目标与真实阴影。其次,为减轻光谱失真,引入一种照度一致稀疏表示机制。与标准重建方法不同,我们严格从共享相同光照状态的邻域构建背景字典。该约束有效解耦光谱反射率与光照变化,确保目标仅由物理一致的背景点表示。实验结果表明,所提方法在复杂森林环境中显著提高了异常与背景的可分离性,性能优于最先进的基线方法。该框架特别适用于识别伪装军事目标、绘制倒木树干以及发现隐藏在茂密植被下的考古遗址。

英文摘要

Unmanned Aerial Vehicle (UAV) multispectral point clouds (MPC) provide high-dimensional spatial-spectral data for sub-canopy target detection; however, their efficacy is significantly compromised by severe illumination heterogeneity caused by vegetation shadows. To address this, we propose a prior-free anomaly detection framework capable of robustly handling lighting variations. First, we formulate solar angle estimation as an inverse optimization problem. By coupling spectral indices with a ray-tracing model, this strategy achieves Prior-Free Shadow Extraction without relying on flight metadata, effectively distinguishing dark objects from true shadows. Second, to mitigate spectral distortions, we introduce an Illumination-Consistent Sparse Representation mechanism. Unlike standard reconstruction methods, we construct a background dictionary strictly from neighbors sharing the same illumination state. This constraint effectively disentangles spectral reflectance from lighting variations, ensuring that targets are represented solely by physically consistent background points. Experimental results indicate that the proposed method significantly improves the separability between anomalies and background in complex forest environments, demonstrating superior performance over state-of-the-art baselines. This framework is particularly suited for identifying camouflaged military targets, mapping fallen tree trunks, and uncovering archaeological ruins hidden beneath dense foliage.