arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15351 2026-06-16 cs.CV 新提交

Facial Affect Analysis for Service-Oriented Systems: Advances, Challenges, and Future Visions

面向服务系统的面部情感分析:进展、挑战与未来愿景

Spyridon Georgiou, Aggelos Psiris, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University(国际希腊大学) Democritus University of Thrace(德谟克利特大学) Kingston University London(伦敦金斯顿大学) University of Western Macedonia(西马其顿大学)

AI总结 本文从系统工程角度综述面部情感分析在服务导向软件生态系统中的进展,强调可组合性和可靠性需求,并指出基准性能提升不足以满足服务化部署,需兼顾鲁棒性、公平性、隐私性等运行时保障。

详情
AI中文摘要

面部情感分析(FAA)正从独立的识别任务演变为服务导向软件生态系统(SoSE)中可复用的感知能力。本文保留了FAA的方法论核心,同时通过可组合和可靠服务的系统工程需求重新诠释近期进展。我们回顾了静态和动态表情分析、动作单元和微表情建模以及现代CNN、Transformer、图神经网络和混合架构的代表性进展,然后根据这些进展在边缘、云和混合服务管道中的操作适配性进行解读。综合强调决定可部署性的SoSE关注点:面向不确定性输出的服务契约、延迟和可用性包络、生命周期监控与重新校准、治理感知集成以及跨独立演化组件的互操作性。我们的分析表明,仅凭基准性能提升不足以满足SoSE就绪性;分布偏移下的鲁棒性、干预稳定性、公平性、隐私姿态和运行时保证同样关键。最后,我们提出了将FAA视为具有显式接口、可测量质量属性和可问责生命周期管理的操作服务组件的路线图。

英文摘要

Facial Affect Analysis (FAA) is evolving from a stand-alone recognition task into a reusable perception capability for Service-Oriented Software Ecosystems (SoSE). This paper preserves the FAA methodological core while reframing recent advances through systems-engineering requirements for composable and dependable services. We review representative progress in static and dynamic expression analysis, action-unit and micro-expression modeling, and modern CNN, Transformer, graph, and hybrid architectures, then interpret these advances by their operational fit in edge, cloud, and hybrid service pipelines. The synthesis emphasizes SoSE concerns that determine deployability: service contracts for uncertainty-aware outputs, latency and availability envelopes, lifecycle monitoring and recalibration, governance-aware integration, and interoperability across independently evolving components. Our analysis shows that benchmark gains alone are insufficient for SoSE readiness; robustness under shift, intervention stability, fairness, privacy posture, and runtime guarantees are equally critical. We conclude with a roadmap for treating FAA as an operational service component with explicit interfaces, measurable quality attributes, and accountable lifecycle management.

2606.15346 2026-06-16 cs.CV cs.LG cs.MM 新提交

DYNA-PRUNER: Input-Adaptive Data-Model Co-Pruning for Efficient and Scalable Spatio-Temporal Media Prediction

DYNA-PRUNER: 面向高效可扩展时空媒体预测的输入自适应数据-模型协同剪枝

Fuyan Zhang, Yuqi Li, Yingli Tian, Edmond S. L. Ho

发表机构 * The City College of New York(纽约市立学院) The Graduate Center, CUNY(纽约市立大学研究生中心) New York University(纽约大学) University of Glasgow(格拉斯哥大学)

AI总结 提出Dyna-Pruner框架,通过共享重要性同步机制实现输入自适应的数据与模型结构协同剪枝,在CNN、RNN和Transformer骨干上减少70% FLOPs并实现2.5倍加速,精度损失小于1%。

Comments ICME 2026 Spotlight Paper

详情
AI中文摘要

时空预测支持雷达/卫星临近预报和城市级交通监测,但现代模型通常因实时部署成本过高而受限。这源于密集计算与强输入依赖冗余(如平静海面或晴朗天空)之间的不匹配。为了在可扩展媒体分析中实现自动化的资源感知架构优化,我们提出Dyna-Pruner,一个用于输入依赖的数据和模型结构协同剪枝的端到端框架。一种共享重要性同步机制生成耦合掩码,剪枝冗余区域及其对应的计算单元(如卷积滤波器),从而在推理时产生每个样本的稀疏子网络。在WeatherBench、SEVIR和TaxiBJ上的实验表明,该框架与CNN、RNN和Transformer骨干无缝集成,将FLOPs减少高达70%,并在NVIDIA Jetson AGX Orin上实现2.5倍加速,精度损失可忽略不计(<1%)。

英文摘要

Spatio-temporal prediction supports radar/satellite nowcasting and city-scale traffic monitoring, but modern models are often too expensive for real-time deployment. This stems from a mismatch between dense computation and strong input-dependent redundancy (e.g., calm seas or clear skies). To enable automated, resource-aware architecture optimization in scalable media analysis, we propose Dyna-Pruner, an end-to-end framework for input-dependent co-pruning of data and model structure. A shared-importance synchronization mechanism generates coupled masks that prune redundant regions and their corresponding computational units (e.g., convolutional filters), yielding per-sample sparse sub-networks at inference time. Experiments on WeatherBench, SEVIR, and TaxiBJ show seamless integration with CNN, RNN, and Transformer backbones, reducing FLOPs by up to $70\%$ and achieving a $2.5\times$ speedup on NVIDIA Jetson AGX Orin with negligible accuracy loss ($<1\%$).

2606.15341 2026-06-16 cs.CV 新提交

CausalDrive: Real-time Causal World Models for Autonomous Driving

CausalDrive: 用于自动驾驶的实时因果世界模型

Tianyi Yan, Huan Zheng, Dubing Chen, Meizhi Qu, Yingying Shen, Lijun Zhou, Mingfei Tu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Cheng-zhong Xu, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau(澳门大学协同创新研究院,科技学院) Xiaomi EV(小米汽车) CASIA(中国科学院自动化研究所)

AI总结 提出CausalDrive,一种可控、实时的驾驶世界渲染器,通过因果预测和Context-Forced DMD架构实现交互式模拟,支持闭环评估、强化学习后训练和人在环仿真。

详情
AI中文摘要

世界模型已成为扩展自动驾驶数据的有前景范式,但现有的视频生成模型作为交互式模拟器仍有不足。基于布局的渲染器依赖所有背景智能体的“预言”未来轨迹,使其严格非反应式。相反,纯动作条件预测器缺乏对复杂交互的语义控制,并受限于高昂的扩散延迟,阻碍了闭环策略学习。为弥补这一差距,我们提出CausalDrive,一种可控、实时的基础驾驶世界渲染器。CausalDrive仅基于初始前视图、自车轨迹和宏观文本提示运行。通过排除未来NPC布局,我们迫使模型内在预测因果交互,实现对驾驶社会学的文本驱动控制,允许用户动态编排对相同自车动作的不同反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移,我们提出新颖的Context-Forced DMD架构。该架构结合连续流匹配与自校正蒸馏目标,实现12 FPS的交互速度。这一突破将被动视频生成器转变为可玩的神经模拟器。我们在三个下游应用中展示了其多功能性:(1)生成式闭环评估,显著减轻碰撞伪影;(2)由Video2Reward模块驱动的大规模强化学习后训练;(3)实时人在环仿真。大量实验验证,在CausalDrive反应式场景中训练的策略在现实世界中表现出更优的交互能力。

英文摘要

World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.

2606.15338 2026-06-16 cs.RO 新提交

SimWeaver: Zero-Shot RGB Sim-to-Real for Deformable Manipulation

SimWeaver:面向可变形操作的零样本RGB仿真到现实

Wenkang Hu, Haoran Wang, Yitong Li, Liu Liu, Mengao Zhao, Lai Jiang, Xincheng Tang, Junhang Wei, Zhengjie Shu, Zhendong Wang, Zhizhong Su, Huamin Wang, Ruigang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Horizon Robotics(地平线机器人) Style3D Research(Style3D研究院)

AI总结 提出SimWeaver框架,通过200条仿真演示训练零样本RGB VLA策略,在5种可变形任务中达到91%平均真实成功率,无需遥操作或任务校准。

详情
AI中文摘要

可变形操作的RGB仿真到现实在没有真实世界微调的情况下基本上仍未解决。我们提出了SimWeaver,它在每个任务的200个模拟演示上训练零样本RGB VLA策略,在5种不同的可变形任务(包括塑料袋操作)中达到每个任务超过80%和平均91%的真实世界成功率,无需遥操作或每个任务校准。SimWeaver结合了一个可靠的基于测量的模拟器(SimWeaver-Sim)、一个支持单图像生成的可扩展资产框架(SimWeaver-Asset)、一个确定性拓扑感知轨迹合成器(SimWeaver-Syn)以及一个具有ISP感知光度增强的仿真到现实协议(SimWeaver-Real)。在丝绸抓取任务中,模拟训练的策略在视觉分布偏移下达到100%的成功率,而基于真实数据的基线下降到9-70%,且每个轨迹的成本低两个数量级。我们将发布SimWeaver和一个代表性资产子集。项目页面:https://simweaver.github.io/

英文摘要

RGB sim-to-real for deformable manipulation has remained largely unsolved without real-world fine-tuning. We present SimWeaver, which trains zero-shot RGB VLA policies on 200 simulated demonstrations per task, reaching above 80% per-task and 91% average real-world success across 5 diverse deformable tasks including plastic-bag manipulation, without teleoperation or per-task calibration. SimWeaver combines a reliable measurement-backed simulator (SimWeaver-Sim) with an extensible asset framework supporting single-image generation(SimWeaver-Asset), a deterministic topology-aware trajectory synthesizer (SimWeaver-Syn), and a sim-to-real protocol with ISP-aware photometric augmentation (SimWeaver-Real). On silk grasping, the sim-trained policy reaches 100% under visual distribution shifts where real-data baselines drop to 9-70%, at two orders of magnitude lower per-trajectory cost. We will release SimWeaver and a representative asset subset. Project page: https://simweaver.github.io/

2606.15335 2026-06-16 cs.CL cs.AI 新提交

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

基于解耦表示的分布式智能体协作隐私保护文本净化

Xuan Liu, Hefeng Zhou, Sicheng Chen, Chao Yang, Xingcheng Xu, Jingjing Qu, Jiong Lou, Jie LI, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DiSan框架,通过解耦文本为任务语义和风格子空间,结合联邦原型对齐与对抗正则化,在分布式多智能体协作中实现隐私保护,显著降低风格归因和PII泄露。

详情
AI中文摘要

当分布式智能体跨组织边界交换文本时,隐私泄露不仅来自显式标识符,还来自分布特征,如格式惯例、词汇选择和句法模式。我们提出DiSan(解耦净化),一个隐私保护净化框架,是Intern-Shannon中多智能体协作的内置组件。DiSan使用双流编码器将文本分解为保持任务语义的源不变角色子空间和保持本地的源识别风格子空间。联邦原型对齐和对抗正则化使得无需集中原始文本即可进行联合训练。实验表明,标识符级别的掩码是不够的:掩码19.2%的token仅将TF-IDF风格归因降低18.6%。相比之下,DiSan在分布式多智能体RAG基准上将答案级别的PII暴露降低了20倍,同时保持了83%的答案忠实度,并在Enron数据集上将TF-IDF风格归因降低了73.2%,神经探针降低了70.6%。

英文摘要

When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy-preserving sanitization framework and a built-in component of Intern-Shannon for multi-agent collaboration. DiSan uses a two-stream encoder to factorize text into a source-invariant role subspace that preserves task semantics and a source-identifying style subspace that remains local. Federated proto-type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier-level masking is insufficient: masking 19.2% of tokens reduces TF-IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer-level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi-agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF-IDF and 70.6% under a neural probe.

2606.15333 2026-06-16 cs.CL cs.LG 新提交

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

重放重要内容:面向高效LLM强化反学习的离策略重放

Zirui Pang, Chenlong Zhang, Haosheng Tan, Zhuoran Jin, Jiaheng Wei, Zixin Zhong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Glasgow(格拉斯哥大学)

AI总结 针对LLM反学习中在线策略优化对困难样本利用不足的问题,提出ReRULE方法,通过离策略重放缓冲区存储并复用低奖励困难样本,在保持通用性的同时提升反学习效率。

详情
AI中文摘要

LLM反学习已成为一种经济有效的替代方案,无需完全重新训练即可从预训练模型中移除危险知识,同时保持通用实用性。最近的基于RL的方法(如RULE)将反学习重新定义为学习拒绝行为,但其在线策略优化在整个训练过程中反复从相同的遗忘和保留/边界提示中采样。我们发现了该过程中的一个关键低效问题:简单案例迅速收敛,提供的梯度信号很少,而遗忘/保留边界附近的困难案例持续产生低奖励的轨迹,这些轨迹在单次使用后被丢弃。为了解决这个问题,我们提出了ReRULE,一种用于强化反学习的离策略重放增强方法。ReRULE在早期GRPO训练期间将低奖励的困难案例轨迹组存储在重放缓冲区中,并通过重要性采样的离策略更新在后续阶段重用它们,将计算重定向到仍需学习的边界案例。理论上,我们证明ReRULE比纯在线策略RULE具有更紧的困难案例收敛界。实验上,ReRULE将MUSE-Books保留质量从46.3提高到56.2,同时仅增加5-11%的训练时间。其在更简单的TOFU设置上改进有限,进一步支持了预期的条件行为:当困难/简单差异显著时,重放最为有益。

英文摘要

LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as learning a refusal behavior, but their on-policy optimization repeatedly samples from the same forget and retain/boundary prompts throughout training. We identify a critical inefficiency in this process: easy cases quickly converge and provide little useful gradient signal, while hard cases near the forget/retain boundary continue to produce low-reward rollouts that are discarded after a single use. To address this issue, we propose ReRULE, an off-policy replay enhancement for reinforcement unlearning. ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training and reuses them in later stages through importance-sampled off-policy updates, redirecting computation toward boundary cases that still require learning. Theoretically, we show that ReRULE yields a tighter hard-case convergence bound than pure on-policy RULE. Empirically, ReRULE improves MUSE-Books Retain Quality from 46.3 to 56.2 while adding only 5--11% training time across benchmarks. Its limited improvement on the simpler TOFU setting further supports the intended conditional behavior: replay is most beneficial when the hard/easy disparity is pronounced.

2606.15332 2026-06-16 cs.LG 新提交

Probabilistic Signature Inversion: Learning Conditional Distributions from Truncated Signatures

概率签名反演:从截断签名中学习条件分布

Junoh Kang, Kiseop Lee, Bohyung Han

发表机构 * ECE & IPAI, Seoul National University(首尔大学电气与计算机工程系 & 人工智能研究所) Department of Statistics, Purdue University(普渡大学统计系)

AI总结 针对截断签名反演的病态问题,提出概率框架,采用签名条件流匹配模型学习路径的条件分布,并推导线性统计下的贝叶斯最优误差作为理论基线。

详情
AI中文摘要

签名变换是连续时间路径的一种原则性特征映射,因其唯一性和普适性而受到重视。然而,从截断签名中恢复路径在结构上是不适定的,因为截断签名映射不是单射的。因此,我们将截断签名反演重新构建为一个概率问题——学习给定截断签名的路径的条件分布——并采用签名条件流匹配模型作为实际估计器。这种概率公式阐明了反演的基本困难:贝叶斯重建误差量化了在给定统计量后仍存在的不可约不确定性。我们推导了线性统计下的贝叶斯最优误差,获得了对数几何布朗运动的闭式解以及对数分数布朗运动和奥恩斯坦-乌伦贝克过程的数值可处理公式,为模型验证提供了具体的理论基线。该基线是截断签名条件下贝叶斯误差的上界,因为截断签名提供了比线性统计更丰富的信息。实验表明,在线性统计条件下,经验重建误差与理论推导的基线高度一致,而当统计量替换为截断签名时,误差减小。此外,生成的路径忠实地恢复了条件签名,同时保留了关键的分布和时间结构,表明估计器对目标条件分布具有良好的校准性。这些结果共同建立了一个适定的截断签名反演概率框架,并在理论覆盖的参数过程族之外的真实金融数据上展示了其适用性。

英文摘要

The signature transform is a principled feature map for continuous-time paths, valued for its uniqueness and universality. Recovering a path from its truncated signature is, however, structurally ill-posed because the truncated signature map is not injective. We therefore reframe truncated signature inversion as a probabilistic problem -- learning the conditional distribution of a path given its truncated signature -- and adopt a signature-conditioned flow matching model as a practical estimator. This probabilistic formulation elucidates the fundamental difficulty of inversion: Bayes reconstruction error quantifies the irreducible uncertainty remaining after conditioning on a statistic. We derive the Bayes-optimal error under linear statistics, obtaining a closed form for log-GBM and numerically tractable formulas for log-fBM and OU, yielding a concrete theoretical baseline for model validation. This baseline upper-bounds the Bayes error under truncated-signature conditioning, since truncated signatures provide richer information than linear statistics. Experiments show that empirical reconstruction errors under linear-statistics conditioning faithfully align with the theory-derived baseline, while errors decrease when the statistic is replaced with truncated signatures. Moreover, generated paths faithfully recover the conditioning signature while preserving key distributional and temporal structures, indicating that the estimator is well-calibrated to the target conditional distribution. Together, these results establish a well-posed probabilistic framework for truncated-signature inversion, with applicability demonstrated on real financial data beyond the parametric process families covered by theory.

2606.15328 2026-06-16 cs.CV 新提交

SGFormer++: Semantic Graph Transformer for Incremental 3D Scene Graph Generation

SGFormer++:用于增量式3D场景图生成的语义图Transformer

Mengshi Qi, Changsheng Lv, Zijian Fu, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications(北京邮电大学网络与交换技术国家重点实验室)

AI总结 提出SGFormer++,通过图嵌入层和语义注入层实现全局消息传递,并引入空间引导特征适配器和级联二值预测头解决增量场景图生成中的灾难性遗忘问题,在3DSSG基准上达到最优性能。

详情
AI中文摘要

本文提出SGFormer++,一种用于3D场景图生成(SGG)的新型语义图Transformer,旨在将点云场景解析为语义结构图,其中节点表示检测到的对象实例,边编码它们的成对关系,核心挑战在于建模复杂的全局场景结构。现有基于图卷积网络(GCN)的方法存在过平滑和感受野有限的问题,而SGFormer++利用Transformer层作为骨干网络实现全局消息传递。具体地,我们引入了两个专为3D SGG定制的关键组件:(1)图嵌入层++,以线性计算复杂度高效集成边缘感知的全局上下文;(2)语义注入层++,利用来自大语言模型(LLM)和视觉-语言模型(VLM)的语言先验丰富视觉特征,在不引入额外可训练参数的情况下增强语义表示。为进一步解决增量式SGG(I-SGG)的实际挑战(其中新的关系类别顺序到达),我们为SGFormer++配备了新颖的空间引导特征适配器,利用主语-宾语空间几何校准谓词特征以应对尺度变化,以及级联二值预测头,通过任务增量分类器扩展和logit蒸馏缓解灾难性遗忘。在3DSSG基准上的大量实验表明,SGFormer++在标准和增量设置下均达到最先进性能:在增量设置下,谓词A@1绝对提升4.49%。代码和数据可在 https://github.com/Andy20178/SGFormer 获取。

英文摘要

In this paper, we propose SGFormer++, a novel Semantic Graph Transformer for 3D scene graph generation (SGG), which aims to parse point cloud scenes into semantic structural graphs, where nodes denote detected object instances and edges encode their pairwise relationships, with the core challenge lying in modeling complex global scene structure. While existing graph convolutional network (GCN)-based methods suffer from over-smoothing and limited receptive fields, SGFormer++ leverages Transformer layers as its backbone to enable global message passing. Specifically, we introduce two key components tailored for 3D SGG: (1) a Graph Embedding Layer++ that efficiently integrates edge-aware global context with linear computational complexity, and (2) a Semantic Injection Layer++ that enriches visual features with linguistic priors from large language models (LLMs) and vision-language models (VLMs), boosting semantic representation without introducing extra trainable parameters. To further address the practical challenge of incremental SGG (I-SGG), where new relationship categories arrive sequentially, we equip SGFormer++ with a novel Spatial-guided Feature Adapter, which calibrates predicate features using subject-object spatial geometry to counter scale variation, and a Cascaded Binary Prediction Head that mitigates catastrophic forgetting via task-incremental classifier expansion and logit distillation. Extensive experiments on the 3DSSG benchmark demonstrate that SGFormer++ achieves state-of-the-art performance in both standard and incremental settings: it yields a significant 4.49% absolute improvement in Predicate A@1 under the incremental setting. Code and data are available at: https://github.com/Andy20178/SGFormer.

2606.15327 2026-06-16 cs.LG 新提交

Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design

语义DLM+:通过转移核设计中的偏差-方差权衡改进扩散语言模型

Keyue Jiang, Yuxiang Wang, Yanan Zhao, Xiang Yu, Qifang Zhao, Bohan Tang, Baojian Zhou, Yanghua Xiao, Lin Qu, Xiaoxiao Xu

发表机构 * Alibaba Group(阿里巴巴集团) Fudan University(复旦大学) University College London(伦敦大学学院) Nanyang Technological University(南洋理工大学) University of Oxford(牛津大学)

AI总结 本文通过分析泛化误差的三个关键因素,提出SemDLM+模型,通过全局转移和语义频率惩罚解决语义盆地问题,在LM1B和OpenWebText上提升了训练动态和生成质量。

详情
AI中文摘要

扩散语言模型(DLMs)已展现出作为自回归语言模型替代方案的强大扩展能力。然而,它们的性能对转移核的选择高度敏感,设计不当的核可能导致训练不稳定、收敛缓慢和采样偏差等问题。在本文中,我们通过泛化误差的原则性分析来研究这种敏感性,并确定了三个关键因素:渐近偏差(逼近后验分布的难度)、暴露偏差(采样过程中的误差传播)以及由核分散引起的优化方差。我们进一步比较了不同的转移核:掩码扩散产生稀疏且更易逼近后验的目标,而均匀扩散提供更强的采样侧修复能力但导致更难的逼近。受此权衡启发,我们重新审视了一个先前被忽视的变体——语义DLM(SemDLM),其中转移核将标记破坏为语义相似的邻域。我们的理论表明,SemDLM可以通过降低均匀扩散的后验逼近难度同时保留修复能力,作为一个合理的中间地带。然而,我们发现SemDLM存在语义盆地问题,即采样反复停留在某个语义区域内,产生低多样性的文本。为解决此问题,我们提出SemDLM+,在采样过程中添加全局转移和语义频率惩罚。在LM1B和OpenWebText上的实验表明,SemDLM+改善了训练动态,并实现了具有满意多样性的竞争性语言建模和生成质量。

英文摘要

Diffusion Language Models (DLMs) have demonstrated strong scaling capacity as alternatives to autoregressive language models. However, their performance is highly sensitive to the choice of transition kernels, and poorly designed kernels can lead to issues like training instability, slow convergence, and biased sampling. In this paper, we study this sensitivity through a principled analysis of generalization error and identify three critical factors: asymptotic bias (difficulty in approximating the posterior distribution), exposure bias (error propagation during sampling), and optimization variance induced by kernel dispersion. We further compare different transition kernels: masking diffusion yields sparse and easier posterior-approximation targets, while uniform diffusion provides stronger sampling-side repair but induces harder approximation. Motivated by this trade-off, we revisit a previously overlooked variant, semantic DLM (SemDLM), where the transition kernel corrupts tokens to neighborhoods that are semantically similar. Our theory suggests that SemDLM can serve as a plausible middle ground by reducing the posterior approximation difficulty of uniform diffusion while retaining repair ability. However, we find that SemDLM suffers from a semantic basin problem, where sampling repeatedly stays within a semantic region and produces low-diversity text. To address this, we propose SemDLM+, which adds a global transition and a semantic-frequency penalty during sampling. Experiments on LM1B and OpenWebText show that SemDLM+ improves training dynamics and achieves competitive language modeling and generation quality with satisfactory diversity.

2606.15325 2026-06-16 cs.CL 新提交

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

先验优于证据:基于LLM的二语发音反馈中的刻板印象驱动诊断

Rong Wang, Kun Sun

发表机构 * University of Tuebingen(蒂宾根大学) Tongji University(同济大学)

AI总结 研究测试了LLM在二语发音反馈中是否基于语音证据而非预训练先验进行诊断,发现评分准确性与推理脱钩,音素级反馈收敛于固定困难音素集,且声学证据仅在直接探测目标维度时改善评分。

Comments 12 pages, 2 figures

详情
AI中文摘要

大型语言模型越来越多地被部署用于第二语言(L2)英语学习中的书面发音反馈,其假设是模型的诊断基于提供的语音证据而非预训练中的先验。该假设在1800条L2-Arctic语音上进行了测试,涵盖六种L1背景、三种音频能力LLM、四个发音维度以及五种证据条件(从纯文本基线到数值声学特征和原始音频)。每个(语音×模型×条件×维度)单元在三个指标上评分:与黄金标签的评分准确性(RA)、评估内部一致性而无真实标签的证据连贯性(EC),以及基于黄金证据评估的接地正确性(GC)。结果显示跨模型有三个发现。第一,评分准确性和接地推理脱钩:39.6%的被判断单元包含支持错误评分的内部连贯推理,而仅15.8%的推理支持正确评分。第二,音素级反馈收敛到一个固定的L2英语困难音素清单,该清单在所有六种L1背景和所有证据条件下重复出现。第三,仅当提供的特征直接探测目标维度时,声学证据才改善评分:文本化的F0范围将三个模型的音高变化接地性从(0.18-0.19)提升到(0.45-0.62),而需要目标-实现对齐的重音和音素正确性仍保持未接地。没有文本化F0值的相同音频波形无法重现这一改进。这些发现表明,当前通用LLM作为外部计算发音证据的言语化器比作为独立诊断引擎更可靠。

英文摘要

Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining. This assumption is tested on 1,800 L2-Arctic utterances spanning six L1 backgrounds, three audio-capable LLMs, four pronunciation dimensions, and five evidence conditions ranging from a text-only baseline to numeric acoustic features and raw audio. Each (utterance x model x condition x dimension) cell is scored on three metrics: Rating Accuracy (RA) against gold labels, Evidence Coherence (EC) assessing internal consistency without ground truth, and Grounded Correctness (GC) evaluated against gold evidence. Results show three findings across models. First, rating accuracy and grounded reasoning decouple: 39.6% of judged cells contain internally coherent reasoning that supports a wrong rating, against only 15.8% where the reasoning supports a correct rating. Second, phoneme-level feedback converges to a fixed inventory of L2-English difficulty phones that recurs across all six L1 backgrounds and all evidence conditions. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range raises pitch-variation grounding from (0.18-0.19) to (0.45-0.62) across all three models, while stress and phoneme correctness, which require target-to-realisation alignment, remain ungrounded. The same audio waveform without textualised F0 values does not reproduce this improvement. These findings indicate that current general-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines.

2606.15323 2026-06-16 cs.CV 新提交

PPDM: Pixel Puzzling Diffusion Model for Speed and Memory Efficient Volumetric Medical Image Translation

PPDM: 像素拼图扩散模型用于速度和内存高效的体积医学图像翻译

Tianqi Chen, Jun Hou, Yinchi Zhou, James S. Duncan, Chi Liu, Bo Zhou

发表机构 * Department of Radiology, Northwestern University(西北大学放射学系) Department of Biomedical Engineering, Yale University(耶鲁大学生物医学工程系) Department of Radiology and Biomedical Imaging, Yale School of Medicine(耶鲁医学院放射学与生物医学影像系)

AI总结 提出像素拼图扩散模型(PPDM),通过可逆像素拼图操作和直接桥接扩散公式,在降低内存和加速推理的同时保持全局一致性,用于3D医学图像翻译。

Comments 12 pages, 5 figures, 5 tables

详情
AI中文摘要

扩散模型在医学图像到图像翻译中展现出优越的保真度,但其扩展到高分辨率3D体积受到高昂计算成本和GPU内存需求的严重限制。现有的内存高效策略常常牺牲全局体积一致性或精细解剖细节。在这项工作中,我们提出了像素拼图扩散模型(PPDM),一个简单而有效的框架,用于内存和速度高效的3D医学图像翻译。PPDM引入了一个可逆的像素拼图-解拼图操作,将空间分辨率转换为通道维度,显著减少激活内存同时保持全局上下文。为了进一步提高效率和稳定性,我们采用直接桥接扩散公式,从条件输入而非纯噪声开始,使模型能够专注于任务相关的残差。此外,引入拼图梯度损失以强制空间一致性并抑制空间重排引入的网格状伪影。我们在多个具有挑战性的3D医学图像翻译任务上评估PPDM,包括低计数PET去噪、联合PET去噪和衰减校正以及跨模态MRI翻译。在所有任务中,PPDM始终匹配或超越全3D扩散模型,同时将训练GPU内存使用减少高达一个数量级并显著加速推理,并且优于基于潜在压缩或频率分解的现有内存高效扩散方法。这些结果表明,PPDM在有限计算资源下为高保真3D扩散医学图像翻译提供了实用且可扩展的解决方案。

英文摘要

Diffusion models have demonstrated superior fidelity for medical image-to-image translation, but their extension to high-resolution 3D volumes is severely constrained by prohibitive computational cost and GPU memory requirements. Existing memory-efficient strategies often compromise global volumetric consistency or fine anatomical detail. In this work, we propose the Pixel Puzzling Diffusion Model (PPDM), a simple and effective framework for memory- and speed-efficient 3D medical image translation. PPDM introduces a reversible pixel puzzle-unpuzzle operator that trades spatial resolution for channel dimensionality, substantially reducing activation memory while preserving global context. To further improve efficiency and stability, we adopt a direct bridge diffusion formulation that starts from the conditional input rather than pure noise, enabling the model to focus on task-relevant residuals. In addition, a puzzle-gradient loss is incorporated to enforce spatial coherence and suppress grid-like artifacts introduced by spatial rearrangement. We evaluate PPDM on multiple challenging 3D medical image translation tasks, including low-count PET denoising, joint PET denoising and attenuation correction, and cross-modal MRI translation. Across all tasks, PPDM consistently matches or outperforms full 3D diffusion models while reducing training GPU memory usage by up to an order of magnitude and significantly accelerating inference, and it outperforms existing memory-efficient diffusion approaches based on latent compression or frequency decomposition. These results demonstrate that PPDM provides a practical and scalable solution for high-fidelity 3D diffusion-based medical image translation under limited computational resources.

2606.15320 2026-06-16 cs.CV 新提交

Conditional Multi-Event Temporal Grounding in Long-Form Video

长视频中的条件多事件时间定位

Yuanhao Zou, Arthad Kulkarni, Lucas Tonanez, Lincoln Spencer, Guangyu Sun, Tianxingjian Ding, Andong Deng, Yi Li, Shuangjun Liu, Yuan Li, Dashan Gao, Ning Bi, Taotao Jing, Shuai Zhang, Chen Chen

发表机构 * University of Central Florida(中佛罗里达大学) Qualcomm AI Research(高通人工智能研究院)

AI总结 提出CoMET-Bench基准和CoMET-Agent框架,解决长视频中基于组合时空条件定位所有事件的任务,F1@0.5提升6.1%。

详情
AI中文摘要

多模态大语言模型在视频时间定位方面取得了快速进展,但实际应用通常需要定位满足组合时间和空间条件的每个事件。现有基准存在不足:它们仅定位每个查询的单个时刻,在没有时间条件的情况下进行计数,或者将定位和计数视为不相交的任务。我们引入了CoMET-Bench,用于长视频中的条件多事件时间定位,包含600个视频上的2789个查询,平均时长33.8分钟,涵盖五个真实世界领域,每个查询由4个时间条件、3个空间条件和一个专用的负查询子集组成。我们进一步提出了一个统一的评估协议,联合测量计数、定位和负查询识别,包括一个新的Rejection-F1指标,以防止懒惰的“始终为空”模型进行琐碎的游戏。对广泛的MLLM、基于代理和定位专用方法的基准测试表明,现有方法远未解决此任务。基于这些发现,我们提出了CoMET-Agent,一个无需训练的代理框架,将任务重新表述为结构化搜索和聚合,通过纯结构推理在F1@0.5上比GPT-5提高6.1%。失败分析进一步揭示了三个开放方向:细粒度实体跟踪、位置均匀检索和因果事件配对。

英文摘要

Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy "always-empty" models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

2606.15317 2026-06-16 cs.RO 新提交

Covariance-Regulated Recursive Koopman Learning for Nonlinear Systems with Uncertain Time-Varying Dynamics

面向不确定时变非线性系统的协方差调控递归Koopman学习

Weibin Gu, Chen Yang, Lu Shi, Chao Gao

发表机构 * Tsinghua University(清华大学) China University of Petroleum-Beijing at Karamay(中国石油大学(北京)克拉玛依校区) Xinchen Qihang Inc.(信辰启航有限公司)

AI总结 针对离线模型在时变动力学下失效的问题,提出协方差调控递归Koopman学习框架,通过误差死区门控和常迹归一化策略防止协方差爆炸和参数冻结,实现数值稳定的在线建模,并在非完整驱动轮式机器人和扑翼微型飞行器上验证了其跟踪性能。

详情
AI中文摘要

自主机器人的离线模型在训练分布之外的时变动力学下常常失效。Koopman算子理论通过提升提供非线性动力学的线性表示,但其向实时递归估计的过渡可能面临数值脆弱性:使用指数遗忘时低激励下的协方差风涌,以及无遗忘时增益消失。本文提出了一种协方差调控递归Koopman学习(CR-RKL)框架,包含两种互补策略——误差死区门控和常迹归一化——每种策略都能独立防止协方差爆炸和参数冻结,后者还额外保留了不确定性的几何结构。在具有车轮滑移和Stribeck摩擦的非完整差分驱动机器人以及26克仿蝴蝶扑翼微型飞行器上验证,CR-RKL实现了数值稳定且准确的在线建模,当嵌入模型预测控制时,在不确定时变动力学下保持了可靠的跟踪性能。

英文摘要

Offline models for autonomous robots often fail under time-varying dynamics outside their training distribution. Koopman operator theory offers a linear representation of nonlinear dynamics via lifting, but its transition to real-time recursive estimation may suffer numerical vulnerabilities: covariance windup under low excitation when using exponential forgetting, and vanishing gain without forgetting. This paper introduces a Covariance-Regulated Recursive Koopman Learning (CR-RKL) framework with two complementary strategies--error dead-zone gating and constant-trace normalization--each independently capable of preventing covariance explosion and parameter freezing, with the latter additionally preserving the geometric structure of uncertainty. Validated on a non-holonomic differential-drive robot with wheel slip and Stribeck friction and on a 26-gram butterfly-inspired flapping-wing micro aerial vehicle, CR-RKL achieves numerically stable and accurate online modeling, and when embedded in model predictive control, it maintains reliable tracking performance under uncertain, time-varying dynamics.

2606.15315 2026-06-16 cs.AI 新提交

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

ChatPlanner: 一个用于个性化公共交通路线规划的大型语言模型框架

Tingting Yang, Chenhao Xue, Jun Chen

发表机构 * School of Engineering and Materials Science, Queen Mary University of London(伦敦玛丽女王大学工程与材料科学学院) Department of Engineering Science, University of Oxford(牛津大学工程科学系)

AI总结 提出ChatPlanner框架,利用大型语言模型和检索增强生成技术从自然语言查询中提取用户偏好并融入路线规划算法,实验证明其能生成更符合个性化需求的可行路线方案。

Comments Under Review at Transportation Research Part C

详情
AI中文摘要

在公共交通系统中,由于难以捕捉和整合多样化的用户偏好到路线规划算法中,个性化路线规划仍然具有挑战性。本文提出了ChatPlanner,一个新颖的框架,利用大型语言模型(LLMs)实现偏好感知的公共交通路线规划。我们的方法采用微调后的LLMs结合检索增强生成(RAG),从自然语言查询中提取路线参数并解释细微的用户偏好,随后将这些偏好整合到公共交通路线规划算法的目标函数中。本研究设计了包含八种角色和五种上下文的偏好感知数据集,为微调和RAG建立评分标准。本文进行了三项实验,以验证解决方案的可行性、路线信息和偏好的提取、以及解决方案集的质量和完整性。结果表明,ChatPlanner能够可靠地生成可行方案。微调强制了所需的输出结构并学习了通用的偏好模式,而RAG提供了查询特定的上下文以解决不精确或会话式的表达,并校准连续分数。两者的结合在路线信息提取和用户偏好解释方面达到了最高准确性。基于选定案例研究的结果表明,通过捕捉用户偏好,ChatPlanner在现有路线规划器忽略的不同维度上识别出了有价值的解决方案,生成了更有价值的路线备选方案。本研究为将自然语言理解整合到交通优化中建立了一个新范式。

英文摘要

Personalized public transit routing in public transit systems remains challenging due to the difficulty of capturing and integrating diverse user preferences into routing algorithms. This paper presents ChatPlanner, a novel framework that leverages Large Language Models (LLMs) to enable preference aware public transit routing. Our approach employs fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to extract routing parameters and interpret nuanced user preferences from natural language queries, subsequently integrating these preferences into the objective function of a public transit routing algorithm. This study designs preference aware datasets incorporating eight personas and five contexts to establish scoring standards for both fine-tuning and RAG. This work conducted three experiments to validate the solutions' feasibility, extraction of routing information and preferences, and solution set quality and completeness. Results demonstrate that ChatPlanner generates feasible solutions reliably. Fine-tuning enforces the required output structure and learns general preference patterns, while RAG provides query-specific context to resolve imprecise or conversational expressions and calibrate continuous scores. The combination of both achieves the highest accuracy in routing information extraction and user preference interpretation. Results based on selected case studies show that by capturing user preferences, ChatPlanner identifies valuable solutions across different dimensions that existing route planners overlook, generating more valuable route alternatives. This research establishes a new paradigm for integrating natural language understanding into transportation optimization.

2606.15314 2026-06-16 cs.LG cs.AI stat.ML 新提交

LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

有限语义表格数据上的LLM:来自工业汽车改造预测的证据

Aina Vila Pons, Ioannis Tzachristas, Constantinos Antoniou

发表机构 * Technical University of Munich(慕尼黑工业大学) BMW Group(宝马集团)

AI总结 研究在工业表格数据中,LLM(嵌入、直接分类、混合堆叠)与经典树集成方法的对比,发现LLM在语义受限时效果有限,但嵌入和混合方法仍有价值。

详情
AI中文摘要

工业改造规划依赖于结构化操作数据而非自由文本:规划者必须估计新注册的原型是否需要改造、需要哪种改造包以及工作将花费多长时间。我们研究了一个工业数据集,该数据集将原型注册系统(284,271辆车)与改造管理系统(48,716次清洗后的访问)相连接,并在行序列化输入上比较了强大的表格机器学习基线与三种基于LLM的策略:嵌入特征(Amazon Titan)、直接提示分类(Claude Sonnet 4)和ML+LLM堆叠方法。在二分类发生预测、15类改造类型分类、每次访问持续时间回归以及聚合的月度基准测试中,经典树集成仍然是最强的独立模型。然而,LLM结果揭示了一致的模式:嵌入在表格上仍然有用(二分类AUC = 0.982),直接提示在通过哈希去除语义信号后崩溃(二分类AUC = 0.500;多类加权F1 = 0.018),而混合堆叠产生了最佳的手动构建多类模型(加权F1 = 0.626)。在月度基准测试中,基于滞后的机器学习优于时间序列基础模型,尽管Chronos-small在零样本预测中仍具有竞争力。结果表明,在隐私受限的工业表格上,LLM作为补充组件比替代强大的表格基线更有效。

英文摘要

Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take. We study an industrial dataset linking a prototype-registration system (284,271 vehicles) with a retrofit-management system (48,716 cleaned visits), and compare strong tabular machine learning baselines with three LLM-based strategies on row-serialized inputs: embedding features (Amazon Titan), direct prompted classification (Claude Sonnet 4), and an ML+LLM stacking approach. Across binary occurrence prediction, 15-way retrofit-type classification, per-visit duration regression, and an aggregated monthly benchmark, classical tree ensembles remain the strongest standalone models. However, the LLM results reveal a consistent pattern: embeddings remain useful on tables (binary AUC = 0.982), direct prompting collapses once semantic signal is stripped by hashing (binary AUC = 0.500; multiclass weighted F1 = 0.018), and hybrid stacking yields the best manually built multiclass model (weighted F1 = 0.626). On the monthly benchmark, lag-based machine learning outperforms time-series foundation models, though Chronos-small remains competitive in zero-shot forecasting. The results suggest that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines.

2606.15308 2026-06-16 cs.AI 新提交

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

强制延迟:在多模态大语言模型级联中操纵路由决策

Zhongye Liu, Yaopei Zeng, Yurui Chang, Lu Lin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出强制延迟攻击(FDA),通过对抗性图像攻击降低弱模型置信度,迫使级联系统将查询路由到强模型,揭示了MLLM级联在计算分配上的安全漏洞。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)展示了强大的视觉推理能力,但为每个查询服务大型模型在计算上成本高昂。MLLM级联通过首先查询较弱但更便宜的模型,并在弱模型输出不自信时延迟到强模型来缓解这一成本。然而,由于弱模型的置信度直接控制计算分配,这些系统暴露了一个新的攻击面:对手可以操纵置信度,使其查询持续被延迟到强模型。受此漏洞启发,我们引入了强制延迟攻击(FDA),这是一种对抗性图像攻击,降低弱模型的置信度,导致级联将查询路由到强模型。FDA通过优化一个温度平滑的目标来学习一个通用边界触发器。该目标将弱模型在触发输入上的令牌分布推向从其干净响应中构建的较不集中的目标。跨数据集、模型系列和延迟指标,FDA持续增加强模型路由,同时优于图像扰动和提示注入基线。这些结果表明,MLLM级联容易受到操纵计算分配的攻击,迫使非预期的强模型使用,而不直接针对答案正确性。

英文摘要

While multimodal large language models (MLLMs) have shown strong visual reasoning abilities, serving a large model for every query is computationally expensive. MLLM cascades mitigate this cost by first querying a weak but cheaper model and deferring to a strong model when the weak model's output is unconfident. However, since the weak model's confidence directly controls compute allocation, these systems expose a new attack surface: an adversary can manipulate confidence so that their queries are consistently deferred to the strong model. Motivated by this vulnerability, we introduce the Forced Deferral Attack (FDA), an adversarial image attack that lowers the weak model's confidence and causes cascades to route queries to the strong model. FDA learns a universal border trigger by optimizing a temperature-flattened objective. This objective pushes the weak model's token distribution on triggered inputs toward less concentrated targets constructed from its clean responses. Across datasets, model families, and deferral metrics, FDA consistently increases strong-model routing while outperforming image-perturbation and prompt-injection baselines. These results show that MLLM cascades are vulnerable to attacks that manipulate compute allocation, forcing unintended strong-model usage without directly targeting answer correctness.

2606.15307 2026-06-16 cs.CL cs.AI 新提交

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

利用思维链监督的强化学习进行仇恨和宣传模因的可解释检测

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) Qatar University(卡塔尔大学)

AI总结 提出基于强化学习的后训练方法,结合任务特定奖励和组相对策略优化(GRPO),提升思考型多模态大语言模型在仇恨和宣传模因检测中的分类性能和解释质量。

详情
AI中文摘要

仇恨和宣传模因利用图像与文本之间的相互作用来传达有害意图,而这两种模态单独都无法揭示这种意图。尽管基于思考的多模态大语言模型(MLLMs)在视觉-语言理解方面取得了进展,但它们在模因内容审核中的应用仍未得到充分探索。我们提出了一种基于强化学习的后训练方法,通过任务特定奖励和组相对策略优化(GRPO)来提高思考型MLLMs的分类性能和基于参考的解释质量。具体来说,我们(i)对现成的MLLMs在英语和阿拉伯语基准上的仇恨和宣传模因理解进行了系统的实证研究,(ii)通过蒸馏和多LLM细粒度宣传标注,用弱监督的思维链(CoT)理由扩展了现有的模因数据集,(iii)引入了一个基于GRPO的目标函数,带有思考长度正则化,联合优化分类准确性和解释质量,以及(iv)研究基于共识伪标签的无标签模因的自监督GRPO。在Hateful Memes和ArMeme基准上的实验表明,我们的方法在FHM准确率(从79.9%提高到82.0%,提升高达2.1%)和ArMeme宏F1(从0.536提高到0.612,提升高达7.6个百分点,附带解释;与原始ArMeme基准相比提升6.1个百分点)上优于先前报告的结果,同时生成自然语言解释。在ArMeme上,序列分类基线在原始准确率方面仍然更强,而我们的方法提供了更平衡的每类性能以及解释。我们公开发布了代码、数据扩展和评估资源。

英文摘要

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

2606.15306 2026-06-16 cs.LG cs.AI 新提交

LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

LatentGym: 具有可控潜在结构的跨任务经验学习测试平台

Daksh Mittal, Tommaso Castellani, Thomson Yen, Naimeng Ye, Fangyu Wu, Minghui Chen, Tiffany Cai, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

发表机构 * Columbia University(哥伦比亚大学) Oumi Blog | Code | Models(Oumi博客 | 代码 | 模型)

AI总结 提出LatentGym测试平台,通过可控潜在变量分离探索与利用,研究LLM代理在跨任务序列中的适应性学习机制。

Comments 61 pages

详情
AI中文摘要

我们设想持续学习的代理系统会随时间变得更加有用:当它们遇到一系列相关任务时,应该推断这些任务之间共享的隐藏结构,并利用它来改进未来的决策。这种跨任务经验学习能力在个性化和交互式辅助等领域至关重要,但现有的训练/评估框架不提供共享的、可控的潜在结构,也无法衡量代理是否改进或改进的原因。我们引入了LatentGym:一个可控的套件,其中每个环境都围绕一个控制任务间结构的地面真实潜在变量组织。我们的构建产生了将探索(代理的行为是否收集关于潜在变量的信息)与利用(代理是否使用收集到的信息)分离的指标。我们在实证研究中展示了我们的套件,解决了三个问题:前沿模型如何以及为什么无法适应相关任务;对相关任务序列进行后训练是否能提高一般的跨任务适应性,以及这些收益来自何处;以及诸如任务间反馈等设计选择如何塑造训练动态和泛化。总之,这些结果为研究LLM代理如何从跨任务经验中学习,以及设计在顺序、个性化和交互式设置中更可靠适应的代理建立了受控基础。

英文摘要

We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross-task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground-truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent's actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post-training on related task sequences improves general cross-task adaptation, and where those gains come from; and how design choices such as inter-task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.

2606.15305 2026-06-16 cs.CV 新提交

CoMNeT: A MedNeXt-CorrDiff Framework for Volumetric Brain Tumor Segmentation

CoMNeT: 一种用于体积脑肿瘤分割的MedNeXt-CorrDiff框架

Michael L. Evans, MD Fayaz Bin Hossen, MD Shibly Sadique, Walia Farzana, Khan M. Iftekharuddin

发表机构 * Old Dominion University(欧道明大学)

AI总结 提出CoMNeT框架,结合3D卷积分割模型MedNeXt与校正扩散后处理CorrDiff,通过集成学习提升胶质瘤分割精度,在UTSW-Glioma数据集上取得优于基线模型的Dice分数。

Comments 10 pages, 4 figures, 2 tables

详情
AI中文摘要

从多参数磁共振成像(MRI)中准确分割脑肿瘤对于治疗规划、反应评估和定量神经肿瘤学研究至关重要。然而,由于肿瘤外观和MRI协议在患者扫描之间的差异,自动分割在计算机视觉中仍然是一项困难的任务。此外,临床重要区域如增强肿瘤(ET)和肿瘤核心(TC)通常相对于全脑体积较小,进一步增加了实现高体素级精度的难度。在本文中,我们展示了将现代3D卷积分割模型与基于校正扩散的细化和集成相结合,可以改善UTSW-Glioma数据集上的体积胶质瘤分割。我们提出了CoMNeT,一个MedNeXt-CorrDiff框架,该框架使用四种MRI模态作为输入,并预测ET、TC和全肿瘤(WT)区域,用于自动脑肿瘤分割。MedNeXt作为主要分割模型,使用全局响应归一化进行特征学习,而CorrDiff被训练为后处理残差细化方法,在最终阈值化之前纠正概率图中的错误。使用五折交叉验证,CoMNeT在大多数肿瘤区域取得了最高的Dice分数,ET、TC、WT和平均Dice分数分别为0.7543 +/- 0.0261、0.6806 +/- 0.0166、0.9049 +/- 0.0128和0.7798 +/- 0.0184。CoMNeT优于两个选定的基线模型:SegResNet(平均Dice 0.7555 +/- 0.0190)和独立MedNeXt(平均Dice 0.7697 +/- 0.0154)。我们的研究结果支持将校正扩散和折叠级概率集成作为现有最先进3D卷积模型用于自动胶质瘤分割的实用补充。

英文摘要

Accurate brain tumor segmentation from multiparametric magnetic resonance imaging (MRI) is critical for treatment planning, response assessment, and quantitative neuro-oncology research. However, automated segmentation remains a difficult task in computer vision because of variation in tumor appearance and MRI protocols across patient scans. Moreover, clinically important regions such as enhancing tumor (ET) and tumor core (TC) are often small relative to the full brain volume, furthering increasing the difficulty of achieving high voxel-level precision. In this paper, we show that combining a modern 3D convolutional segmentation model with corrective diffusion-based refinement and ensembling improves volumetric glioma segmentation on the UTSW-Glioma dataset. We propose CoMNeT, a MedNeXt-CorrDiff framework that uses four MRI modalities as input and predicts ET, TC, and whole tumor (WT) regions for automated brain tumor segmentation. MedNeXt is used as the primary segmentation model with Global Response Normalization for feature learning, while CorrDiff is trained as a postprocessing residual refinement method to correct errors in the probability maps before final thresholding. Using five-fold cross-validation, CoMNeT achieved the highest Dice score for most tumor regions, with ET, TC, WT, and average Dice scores of 0.7543 +/- 0.0261, 0.6806 +/- 0.0166, 0.9049 +/- 0.0128, and 0.7798 +/- 0.0184, respectively. CoMNeT outperformed two selected baseline models: SegResNet (0.7555 +/- 0.0190 average Dice) and standalone MedNeXt (0.7697 +/- 0.0154 average Dice). Our findings support the use of corrective diffusion and fold-level probability ensembling as practical additions to existing state-of-the-art 3D convolutional models for automated glioma segmentation.

2606.15304 2026-06-16 cs.CV 新提交

HemExp: Clinically-Guided Latent Diffusion for Modeling Hematoma Expansion

HemExp: 临床引导的潜在扩散模型用于血肿扩展示建模

Orhun Utku Aydin, Satoru Tanioka, Tzu I Chuang, Alexander Koch, Dimitrios Rallios, Marie Gultom, Begum Tahhan, Fujimaro Ishida, Dietmar Frey, Adam Hilbert

发表机构 * CLAIM – Charité Lab for AI in Medicine, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin(柏林夏里特医学院人工智能医学实验室(CLAIM),柏林夏里特医学院,柏林自由大学和柏林洪堡大学成员) Department of Neurosurgery, Mie Chuo Medical Center(三重中央医疗中心神经外科)

AI总结 提出临床引导的潜在扩散模型HemExp,通过生成患者特异性随访CT图像及血肿分割,实现血肿扩展示的空间概率预测,支持临床变量扰动下的不确定性建模。

详情
AI中文摘要

自发性脑出血(ICH)后的血肿扩展示(HE)是神经外科护理中急性分诊和治疗决策的主要决定因素。然而,现有方法大多提供二元扩展示风险或单一随访体积,限制了不确定性感知决策。我们提出HemExp,一种临床引导的潜在扩散模型,可生成患者特异性的随访非对比CT图像,以及脑实质和脑室内出血的分割。生成过程以基线成像、临床变量和明确的扩展示指标为条件,实现对真实临床场景的可控模拟。HemExp使用血肿感知多头变分自编码器,并通过条件扩散模型将进展建模为基线和随访潜在表示之间的差异。该模型在来自多个中心的450名患者的配对扫描上训练,并在来自一个保留机构的107名患者上评估。HemExp通过为每位患者生成多个合成随访图像来估计可能的随访血肿体积分布,从而生成空间HE概率图。扰动临床输入(如症状发作至成像时间或抗凝状态)会改变预测的随访血肿体积分布。HemExp扩展了二元预测器,并在成像空间中展示了稳健的临床相关结果估计,如血肿体积、脑室内受累和占位效应。总体而言,我们的结果支持可控潜在扩散作为早期ICH进展的不确定性感知建模的一个有前景的方向。

英文摘要

Hematoma expansion (HE) after spontaneous intracerebral hemorrhage (ICH) is a major determinant of acute triage and treatment decisions in neurosurgical care. However, most existing methods provide either a binary expansion risk or a single follow-up volume, limiting uncertainty-aware decisions. We introduce HemExp, a clinically-guided latent diffusion model that generates patient-specific follow-up non-contrast CT images, along with segmentations of intraparenchymal and intraventricular hemorrhage. Generation is conditioned on baseline imaging, clinical variables, and an explicit expansion indicator, enabling controllable simulation of realistic clinical scenarios. HemExp uses a hemorrhage-aware multi-head variational autoencoder and models progression as the difference between baseline and follow-up latent representations with a conditional diffusion model. The model is trained on paired scans from 450 patients across multiple centers and evaluated on 107 patients from a held-out institution. HemExp produces spatial HE probability maps by generating multiple synthetic follow-up images per patient to estimate distributions of plausible follow-up hematoma volumes. Perturbing clinical inputs such as symptom-onset-to-imaging time or anticoagulant status shifts the predicted follow-up volume distribution. HemExp extends binary predictors and demonstrates robust estimation of clinically relevant outcomes in the imaging space, such as hematoma volume, intraventricular involvement, and mass effects. Overall, our results support controllable latent diffusion as a promising direction for uncertainty-aware modeling of early ICH progression.

2606.15301 2026-06-16 cs.LG cs.AI 新提交

Discovering Lattice Reduction Strategies via Self-Play

通过自我对弈发现格基约简策略

Mohamed Malhou, Kristin Lauter, Ludovic Perret

发表机构 * FAIR, Meta Superintelligence Labs(Meta超级智能实验室FAIR) Sorbonne Université CNRS, LIP6(索邦大学CNRS/LIP6) EPITA, EPITA Research Lab (LRE)(EPITA研究实验室(LRE))

AI总结 利用深度强化学习和AlphaZero风格自我对弈,在LLL原始动作空间中学习更优的格基约简策略,训练于8维格但可零样本泛化至32维。

详情
AI中文摘要

Lenstra-Lenstra-Lovász (LLL) 算法是计算机科学中用于格基约简的开创性贡献,但其多项式时间输出的基随着维数增长远非最优。我们证明,深度强化学习可以通过与LLL的原始动作空间交互,发现严格更优、可泛化的约简策略。我们将格基约简形式化为单人马尔可夫决策过程 (MDP),并使用AlphaZero风格的自我对弈流水线训练深度残差网络,该流水线结合了自适应视界MCTS(蒙特卡洛树搜索),将多步网络预测与熵门控扩展机制耦合。由此产生的策略DeltaStar仅在小的8维q-ary格上训练,且需要的原始行操作少于LLL。关键的是,它无需重新训练即可零样本泛化到未见过的模数和高达n=32的更高维度。

英文摘要

The Lenstra-Lenstra-Lovász (LLL) algorithm is a seminal contribution to computer science used for lattice basis reduction, yet its polynomial-time outputs produce bases that are far from optimal as the dimension grows. We show that deep reinforcement learning can discover strictly superior, generalizable reduction strategies by interacting with the primitive action space of LLL. We formulate lattice reduction as a single-player Markov Decision Process (MDP) and train a deep residual network using an AlphaZero-style self-play pipeline augmented with adaptive-horizon MCTS (Monte Carlo Tree Search), which couples multi-step network predictions with an entropy-gated expansion mechanism. The resulting policy, DeltaStar, is trained exclusively on small $8$-dimensional $q$-ary lattices and requires fewer primitive row operations than LLL. Crucially, it generalizes zero-shot to unseen moduli and higher dimensions up to $n=32$ without retraining.

2606.15300 2026-06-16 cs.AI cs.CL 新提交

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

CODA-BENCH:代码智能体能否处理数据密集型任务?

Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du

发表机构 * Renmin University of China(中国人民大学)

AI总结 提出CODA-BENCH基准,在数据密集型环境中联合评估代码与数据智能,包含1009个任务,平均每个环境980个文件,揭示当前智能体在数据发现与代码执行整合上的不足。

Comments Accepted at ICML 2026. 37 pages, 11 figures. Project page: https://coda-bench.github.io/ Code: https://github.com/ruc-datalab/CoDA-Bench Data: https://huggingface.co/datasets/RUC-DataLab/CoDA-Bench

详情
AI中文摘要

高级智能体正日益展现出作为自主工程师的潜力,这催生了对能够捕捉真实世界开发复杂性的评估基准的需求。此类环境通常涉及复杂代码和大规模数据(即文件系统)。然而,现有基准通常孤立地评估代码中心或数据中心能力,与真实开发场景存在明显差距。在本文中,我们通过引入CODA-BENCH来弥合这一差距,这是首个在数据密集型环境中联合评估代码与数据智能的基准。我们基于Kaggle生态系统(包含数百个数据集)构建了一个数据密集型Linux沙箱,其中智能体必须主动探索复杂的文件层次结构以识别相关资源,并为数据驱动的分析任务生成代码。CODA-BENCH包含跨越31个社区的1009个任务,每个任务环境平均包含980个文件,模拟了真实的数据规模和噪声。对高级智能体的评估显示,即使是最优系统也难以有效整合数据发现与代码执行,成功率仅为61.1%。这些结果凸显了当前智能体在数据密集型任务中的能力差距,并为未来研究指明了有希望的方向。

英文摘要

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

2606.15291 2026-06-16 cs.AI 新提交

A Formal Framework for Declarative Agentic AI in Business Process Analysis

业务流程分析中声明式智能体AI的形式化框架

Mohammad Azarijafari, Luisa Mich, Michele Missikoff

发表机构 * University of Trento(特伦托大学) Istituto di Analisi dei Sistemi ed Informatica (IASI) “Antonio Ruberti”, National Research Council (CNR)(国家研究委员会(CNR)安东尼奥·鲁贝蒂系统分析与信息学研究所(IASI))

AI总结 提出基于AGO方法的形式化框架,通过集合论和数学逻辑定义智能体、目标和对象实体及其交互,构建业务流程知识库以支持结构化查询、增量更新和自动生成工作流。

详情
AI中文摘要

智能体AI为自动化业务流程(BP)开辟了新机遇,实现了自主决策和动态适应。然而,要实现这一潜力,需要以形式化精度定义BP实体及其交互。本文通过AGO方法提出了一个用于智能体BP分析的形式化框架。AGO从谁在行动(智能体)、为何执行(目标)以及相关实体是什么(对象)的角度捕获建模视角。基于集合论和数学逻辑,我们形式化定义了AGO实体类型及其交互,将所有定义组织成BP知识库(BPKB)。生成的BPKB支持结构化查询、增量更新和BP工作流的自动生成,同时确保导出路径的健全性和完备性。

英文摘要

Agentic AI opens new opportunities for automating Business Process (BP), enabling autonomous decision-making and dynamic adaptation. However, realising this potential requires BP entities and their interactions to be defined with formal precision. This paper presents a formal framework for Agentic BP analysis through the AGO methodology. AGO captures the modelling perspective in terms of who is acting (Agents), why it is carried out (Goals), and what the relevant entities are (Objects). Grounded in set theory and mathematical logic, we formally define the AGO entity types and their interactions, organising all definitions into a BP Knowledge Base (BPKB). The resulting BPKB supports structured querying, incremental updates, and automatic generation of BP workflows, while ensuring soundness and completeness of the derived paths.

2606.15288 2026-06-16 cs.LG cs.AI physics.ao-ph 新提交

Hybrid NARX-LLM for Greenland Iceberg Discharge: Prompt-Driven Residual Correction

混合NARX-LLM用于格陵兰冰山排放:提示驱动的残差校正

Yiquan Gao, Duohui Xu

发表机构 * Heriot-Watt University(赫瑞瓦特大学) StudioYG

AI总结 提出混合NARX-LLM框架,结合非线性自回归模型与大型语言模型进行残差校正,并引入物理信息提示方法,用于建模格陵兰冰山排放的复杂非线性动态,提升预测准确性。

详情
AI中文摘要

格陵兰冰山排放表现出复杂的非线性动态,且可观测性有限,对传统预测模型构成挑战。我们提出一个混合NARX-LLM框架,该框架结合了具有外源输入的非线性自回归模型(NARX)和用于残差校正的大型语言模型(LLM)。我们进一步提出了一种物理信息提示(PIP)方法,将非结构化物理知识转化为结构化提示,用于零样本上下文推理。主要目标是探索该框架在建模格陵兰冰山排放方面的校正潜力,而不仅仅是优化预测精度。NARX组件捕获内在的时间依赖性,而由PIP引导的LLM编码冰川动力学和环境驱动因素,并感知关键趋势模式以校正系统预测误差。这种集成允许模型推理未建模因素并产生可解释的残差,从而提升整体预测精度。应用于格陵兰冰山排放时间序列,我们的方法处理了由于罕见变化和非平稳趋势而难以预测的极端事件,这是传统方法经常忽视的局限性。通过融合结构化时间序列建模与知识驱动的Foundation AI,该框架提供了一条可扩展且可解释的路径,将数据受限的气候预测与物理信息LLM推理相结合。代码已公开。

英文摘要

Greenland iceberg discharge exhibits complex nonlinear dynamics with limited observability, challenging traditional predictive models. We present a Hybrid NARX-LLM framework that combines a nonlinear autoregressive model with exogenous inputs (NARX) and a large language model (LLM) for residual correction. We further propose a Physics-Informed Prompt (PIP) method that transforms unstructured physical knowledge into structured prompts for zero-shot in-context reasoning. The primary objective is to explore the corrective potential of this framework for modeling Greenland iceberg discharge, rather than merely optimizing predictive accuracy. The NARX component captures intrinsic temporal dependencies, while the LLM, guided by PIP, encodes glacier dynamics and environmental drivers and perceives key trend patterns to correct systematic prediction errors. This integration allows the model to reason about unmodeled factors and produce interpretable residuals, enhancing overall predictive accuracy. Applied to Greenland iceberg discharge time series, our approach addresses extreme events that are difficult to predict due to rare variations and nonstationary trends, a limitation often overlooked by traditional methods. By fusing structured time-series modeling with knowledge-driven foundation AI, the framework offers a scalable and interpretable pathway to bridge data-limited climate forecasting with physics-informed LLM reasoning. The code is available.

2606.15287 2026-06-16 cs.CV 新提交

G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

G2IA: 几何引导的实例感知跨模态地点识别检索与精炼

Xianyun Jiao, Jingyi Xu, Zhongmiao Yan, Xieyuanli Chen, Lin Pei

发表机构 * Shanghai Jiao Tong University(上海交通大学) National University of Defense Technology(国防科技大学)

AI总结 提出G2IA框架,通过几何引导的实例感知检索和跨模态局部形状与空间布局验证,解决图像到点云地点识别中的模态差异和感知混淆问题。

详情
AI中文摘要

跨模态地点识别(CMPR)使仅搭载相机的机器人在自主导航场景中能够根据预先构建的激光雷达地图进行定位。这种图像到点云的设置面临两种耦合的模糊性:透视RGB外观与稀疏度量几何之间的模态差异,以及具有相似道路、立面、交叉口和物体布局的城市地点之间的感知混淆。我们不将CMPR视为单一的全局描述符匹配问题,而是认为可靠的检索需要几何感知表示对齐和细粒度候选验证。本文提出G2IA,一个几何引导的实例感知框架,用于图像到点云的地点识别。在检索阶段,来自VGGT的视觉几何先验和实例特征被整合,以构建与激光雷达地图表示更兼容的地点描述符。在精炼阶段,通过显式验证局部实例形状及其相对空间布局在跨模态下是否一致,对检索到的候选进行重新排序。在公开基准上的实验表明,G2IA在不同定位阈值下一致地改善了图像到点云的地点识别,并表现出强大的跨数据集泛化能力。

英文摘要

Cross-modal place recognition (CMPR) enables camera-only robots to localize against pre-built LiDAR maps in autonomous navigation scenarios. This image-to-point-cloud setting is challenged by two coupled ambiguities: the modality gap between perspective RGB appearance and sparse metric geometry, and perceptual aliasing among urban places with similar roads, facades, intersections, and object arrangements. Instead of treating CMPR as a single global descriptor matching problem, we argue that reliable retrieval requires both geometry-aware representation alignment and fine-grained candidate verification. In this paper, we propose G2IA, a geometry-guided instance-aware framework for image-to-point-cloud place recognition. In the retrieval stage, visual geometry priors from VGGT and instance features are integrated to construct place descriptors that are more compatible with LiDAR-derived map representations. In the refinement stage, the retrieved candidates are re-ranked by explicitly verifying whether local instance shapes and their relative spatial layouts are consistent across modalities. Experiments on public benchmarks demonstrate that G2IA consistently improves image-to-point-cloud place recognition under different localization thresholds, and exhibits strong cross-dataset generalization.

2606.15286 2026-06-16 cs.CV 新提交

Decoupled Motion Representation Learning for Moving Infrared Small Target Detection

解耦运动表示学习用于移动红外小目标检测

Guoyi Zhang, Peiwen Wu, Han Wang, Xiangpeng Xu, Xiaohu Zhang

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University(中山大学航空航天学院)

AI总结 针对动态场景中目标、平台和背景运动高度耦合导致检测困难的问题,提出解耦运动表示学习框架,通过显式运动分支建模全局相干运动、隐式分支捕捉局部异常,并设计相干运动引导的异常推理模块抑制虚警,在复杂动态场景中显著优于现有方法。

详情
AI中文摘要

动态场景中的红外小目标检测仍然具有挑战性,原因是目标、成像平台和动态背景之间的运动高度耦合。现有的多帧方法通常执行隐式时间建模,其中连贯的背景动态主导运动对应学习,导致检测与虚警之间存在固有的权衡。在这项工作中,我们观察到背景运动表现出强烈的全局连贯性,而小目标主要对应稀疏的局部运动异常。此外,许多虚警响应与全局连贯运动模式保持高度一致性,表明它们主要源于连贯的背景动态而非真实目标运动。基于这些观察,我们提出了一种解耦运动表示学习框架用于移动红外小目标检测。具体地,引入显式运动分支,利用预训练的光流先验建模全局连贯运动动态,并采用结构保持的自监督适应策略进行红外运动对应学习。同时,设计了基于可变形特征对齐的隐式运动分支,在连贯运动引导下捕捉目标敏感的局部运动异常。此外,提出了连贯运动引导的局部异常推理模块,在局部运动建模过程中识别并抑制由连贯运动引起的虚假响应。在两个具有挑战性的红外小目标检测基准上的大量实验表明,所提方法在复杂运动的动态场景中持续优于现有最先进方法,同时保持了良好的推理效率。

英文摘要

Infrared small target detection in dynamic scenes remains challenging due to the highly coupled motions among targets, imaging platforms, and dynamic backgrounds. Existing multi-frame methods usually perform implicit temporal modeling, where coherent background dynamics dominate motion correspondence learning, leading to an inherent trade-off between detection and false alarms. In this work, we observe that background motions exhibit strong global coherence, whereas small targets mainly correspond to sparse local motion anomalies. Moreover, many false-alarm responses maintain high consistency with globally coherent motion patterns, indicating that they mainly originate from coherent background dynamics rather than genuine target motions. Based on these observations, we propose a decoupled motion representation learning framework for moving infrared small target detection. Specifically, an explicit motion branch is introduced to model globally coherent motion dynamics using pretrained optical flow priors, together with a structure-preserving self-supervised adaptation strategy for infrared motion correspondence learning. Meanwhile, an implicit motion branch based on deformable feature alignment is designed to capture target-sensitive local motion anomalies under coherent motion guidance. Furthermore, a coherent-motion-guided local anomaly reasoning module is proposed to identify and suppress coherent-motion-induced false responses during localized motion modeling. Extensive experiments on two challenging infrared small target detection benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches, particularly in dynamic scenes with complex motions, while maintaining favorable inference efficiency.

2606.15285 2026-06-16 cs.RO 新提交

Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models

理解中行动:面向实时视觉-语言-动作模型的异步语义-动作解耦

Shenhao Yan, Ge Wang, Qi Liu, Weilin Meng, Jiahao Yang, Chengsi Yao, Fan Feng, Xiaoguang Ma, Yiming Zhao, Yatong Han

发表机构 * Northeastern University(东北大学) Ising AI CUHK-Shenzhen(香港中文大学(深圳))

AI总结 提出异步语义-动作解耦框架,分离VLAs中的语义理解与动作生成,通过低频率语义更新和高频率动作推理实现高频闭环控制,在LIBERO和真实机器人上验证了高达35.6Hz的动作模块吞吐率。

详情
AI中文摘要

视觉-语言-动作模型(VLAs)在机器人操作中展现出强大的任务理解和泛化能力,但全模型推理的高计算成本限制了其在低延迟、高频率闭环控制中的部署。我们提出一种异步语义-动作解耦框架,该框架沿现有VLAs的内部语义-动作接口分离语义理解与动作生成,无需重新设计视觉-语言骨干网络或引入外部规划器。低频理解模块异步更新可复用的语义条件,而高频动作模块持续输出控制动作,无需重复调用完整模型。为缓解陈旧语义与当前执行状态之间的时间不匹配,我们进一步引入历史动作条件化和时间错位训练,提供短时域执行上下文,并在陈旧语义条件下提高反馈控制鲁棒性。在LIBERO上使用$π_{0.5}$和UniVLA进行的实验,以及使用UniVLA的真实机器人部署表明,所提框架实现了高达35.6 Hz的服务端动作模块推理吞吐率,并提供了一条低侵入性路径,无需以控制速率运行完整VLA推理即可实现高频闭环控制。

英文摘要

Vision-Language-Action models (VLAs) have demonstrated strong task understanding and generalization in robotic manipulation, yet the high computational cost of full-model inference limits their deployment in low-latency, high-frequency closed-loop control. We propose an asynchronous semantic-action decoupling framework that separates semantic understanding from action generation along the internal semantic-action interface of existing VLAs, without redesigning the vision-language backbone or introducing an external planner. A low-frequency understanding module asynchronously updates reusable semantic conditions, while a high-frequency action module continuously outputs control actions without repeatedly invoking the full model. To mitigate the temporal mismatch between stale semantics and the current execution state, we further introduce historical action conditioning and time-misalignment training, which provide short-horizon execution context and improve feedback control robustness under stale semantic conditions. Experiments on LIBERO with $π_{0.5}$ and UniVLA, together with real-robot deployment using UniVLA, show that the proposed framework achieves up to 35.6 Hz server-side action-module inference throughput and offers a low-intrusion path to high-frequency closed-loop control without running full VLA inference at control rate.

2606.15282 2026-06-16 cs.CV 新提交

Enhancing Precision Agriculture with a Hybrid Deep Learning Framework for Multi-Class Plant Disease Classification and Interpretability

利用混合深度学习框架增强精准农业:多类植物病害分类与可解释性

Hasibul Islam Sufi, Ridam Roy, Shayla Alam Setu, Mahimul Islam Nadim

发表机构 * Department of Computer Science and Engineering, Daffodil International University(计算机科学与工程系,达福尔国际大学)

AI总结 提出混合ResNet-ViT架构用于多类植物病害分类,在38类叶片图像上达到98.58%准确率,结合Grad-CAM等可解释性技术定位病害区域。

详情
AI中文摘要

本研究提出了一种整体深度学习架构,用于从高分辨率叶片图像中对植物病害进行多类分类,特别关注ResNet-50和混合ResNet + Vision Transformer (ViT)设计的行为。一个专门收集的图像数据库包含15,200张训练图像和3,800张验证图像,涵盖多种作物的38个类别,包括番茄、苹果、葡萄等,经过预处理步骤如调整大小、归一化和数据增强以增强模型鲁棒性。训练了多种架构,包括ResNet-50、MobileNetV2和EfficientNet-B0,并与混合ResNet + ViT模型进行比较。所有模型使用AdamW优化器和交叉熵损失进行微调,并应用早停以防止过拟合并确保泛化。此外,实现了可解释性技术如Grad-CAM和显著性图以指示病害相关区域,同时进行基于分割的分析以识别叶片的受影响部分。在所有考虑的架构中,ResNet-50达到了最高准确率98.74%,而混合ResNet + ViT模型达到了竞争性的98.58%,表明混合架构在捕捉局部和全局信息方面是有效的。实验结果展示了基于Transformer的模型在实现高精度、可解释且计算高效的基于计算机的多类多病害分类系统方面的潜力,为栽培管理实践和精准农业提供了有用的帮助。

英文摘要

This study proposes an overall deep learning architecture for multi-class classification of plant diseases from high-resolution leaf imagery, with a particular interest in investigating the behavior of ResNet-50 and a hybrid ResNet + Vision Transformer (ViT) design. A specially gathered image database with 15,200 training images and 3,800 validation images spanning 38 classes across multiple crops, including tomato, apple, grape etc. were subjected to preprocessing steps such as resizing, normalization, and data augmentation to enhance model robustness. Multiple architectures, including ResNet-50, MobileNetV2, and EfficientNet-B0, were trained and compared with the hybrid ResNet + ViT model. All models were fine-tuned using the AdamW optimizer and cross-entropy loss, with early stopping applied to prevent overfitting and ensure generalization. Furthermore, interpretability techniques such as Grad-CAM and saliency maps were implemented to indicate disease-relevant regions, while segmentation-based analysis was performed to identify the affected parts of a leaf. For every one of the considered architectures, ResNet-50 led to the highest accuracy of 98.74%, whereas the hybrid ResNet + ViT model achieved a competitive accuracy of 98.58%, showing that the hybrid architectures were effective in capturing both local and overall information. The experimental results showcase the promise of transformer-based models to achieve highly accurate, interpretable, and computationally efficient computer-based multi-class multi-disease classification systems, providing helpful assistance for cultivation management practices as well as for precision farming.

2606.15280 2026-06-16 cs.LG 新提交

Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators

重新思考结构异常检测:从决策边界到投影算子

Alexander Bauer

发表机构 * Machine Learning Group, TU Berlin(柏林工业大学机器学习组) BIFOLD, Berlin, Germany(柏林BIFOLD研究所)

AI总结 针对现有异常检测方法在流形支持数据上的局限性,提出基于投影算子的几何视角,将异常定义为投影残差,统一了重建方法并提升了性能。

详情
AI中文摘要

大多数现有的异常检测方法依赖于估计概率密度或学习封闭的决策边界,隐含地假设正常数据在环境空间中占据非零体积的区域。相比之下,结构异常检测考虑位于低维流形附近的数据,导致现有方法的归纳偏差与数据结构不匹配,常常导致性能下降。为了解决这种不匹配,我们引入了几何视角。具体来说,我们学习一个投影算子到正常样本的流形上,并定义一个样本为异常如果它被这个投影改变。这个公式自然地整合了流形支持数据的归纳偏差,并将异常检测重新表述为投影残差,从而解决了由退化分布建模引起的问题。值得注意的是,它通过用投影质量解释重建方法的成功和失败,提供了对基于重建方法的统一解释。特别是,它解释了投影对齐模型强大的泛化能力,作为向流形收缩行为的结果。此外,通过将异常检测与概率建模解耦,它减少了将罕见但正常的样本错误分类的趋势,这是现有方法广泛认可的局限性。实验上,我们证明了投影对齐方法实现了强大的性能,优于基于边界的方法,同时改进了现有的基于重建的方法。

英文摘要

Most existing anomaly detection methods rely on estimating a probability density or learning an enclosing decision boundary, implicitly assuming that normal data occupies a region of non-zero volume in the ambient space. In contrast, structural anomaly detection considers data that lies near a low-dimensional manifold, creating a mismatch between the inductive bias of existing methods and the structure of the data, often resulting in degraded performance. To address this mismatch, we introduce a geometric perspective. Specifically, we learn a projection operator onto the manifold of normal samples and define a sample as anomalous if it is altered by this projection. This formulation naturally integrates the inductive bias of manifold-supported data and reframes anomaly detection in terms of a projection residual, thereby resolving issues arising from modeling degenerate distributions. Notably, it provides a unifying interpretation of reconstruction-based methods by explaining their success and failure in terms of projection quality. In particular, it explains the strong generalization ability of projection-aligned models as a consequence of contraction behavior toward the manifold. Moreover, by decoupling anomaly detection from probabilistic modeling, it reduces the tendency to misclassify rare but normal samples, a widely recognized limitation of existing approaches. Empirically, we demonstrate that projection-aligned methods achieve strong performance, outperforming boundary-based methods while improving upon existing reconstruction-based approaches.

2606.15278 2026-06-16 cs.LG cs.AI 新提交

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

RECTOR:面向情感与认知表征学习的掩码区域-通道-时间建模

Jinhan Liu, Mahsa Shoaran

发表机构 * Cornell University(康奈尔大学)

AI总结 提出RECTOR自监督框架,通过自适应功能分区和掩码拓扑学习,统一建模EEG/sEEG的区域-通道-时间动态,在情感识别和任务参与分类上达到新最优,且对缺失通道和跨导联泛化鲁棒。

详情
AI中文摘要

情感和认知障碍表现为跨区域、通道和时间的分布式、时变脑网络动态,给基于EEG/sEEG的临床诊断鲁棒表征学习带来挑战。我们提出RECTOR(掩码区域-通道-时间建模),一种端到端自监督框架,超越固定解剖先验,统一联合区域-通道-时间表征学习。其核心RECTOR-SA是一种由自适应功能分区诱导的层次化块稀疏自注意力,将区域结构从静态解剖定义演变为自适应功能区域。自监督由掩码拓扑和表征学习驱动,联合优化三个互补目标:掩码预测建模、拓扑结构建模和跨视图一致性。在多个基准上,RECTOR在EEG情感识别和sEEG任务参与分类中达到新最优。关键的是,其对缺失通道的强鲁棒性和跨导联泛化能力凸显了其在异构EEG/sEEG上进行大规模预训练的潜力,并在区域和通道层面提供可解释的洞察。

英文摘要

Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.