arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1925
2606.18022 2026-06-17 cs.LG 新提交

Recursive Scaling in Masked Diffusion Models

掩码扩散模型中的递归缩放

Alba Carballo-Castro, Julianna Piskorz, Paulius Rauba, Mihaela van der Schaar, Pascal Frossard

发表机构 * LTS4, EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院LTS4实验室) University of Cambridge, Cambridge, UK(英国剑桥大学)

AI总结 提出递归掩码扩散模型(R-MDMs),通过在每个扩散步骤中重复应用同一去噪变换器增加递归深度,实现参数高效缩放,在数独和倒计时等结构化生成任务中,以更少参数匹配非递归基线性能。

详情
AI中文摘要

掩码扩散模型(MDMs)最近成为一种有前景的序列生成范式。传统上,缩放MDMs通过增加参数数量或去噪步骤数来实现。我们引入了递归掩码扩散模型(R-MDMs),它通过在每个扩散步骤中重复应用相同的去噪变换器,将递归深度作为第三个缩放轴。递归通过参数重用实现了输出的迭代细化,在不增加参数数量的情况下增加了有效模型深度。在包括数独和倒计时在内的结构化生成任务中,我们展示了R-MDMs实现了显著提升的参数效率:具有$L$次递归迭代的模型通常与具有大约$L$倍参数的非递归基线性能相当。此外,递归细化可以部分替代额外的去噪步骤,使得递归模型在推理时以更少的前向传播达到相同的生成质量。这些结果表明,递归深度是MDMs的一种实用缩放机制,提高了参数效率和测试时计算分配。

英文摘要

Masked diffusion models (MDMs) have recently emerged as a promising paradigm for sequence generation. Scaling MDMs is conventionally achieved by increasing the parameter count or the number of denoising steps. We introduce Recursive Masked Diffusion Models (R-MDMs), which add recursive depth as a third scaling axis by repeatedly applying the same denoising transformer within each diffusion step. Recursion enables iterative refinement of the output through parameter reuse, increasing effective model depth without increasing parameter count. Across structured generation tasks, including Sudoku and Countdown, we show that R-MDMs achieve substantially improved parameter efficiency: a model with $L$ recursive iterations often matches the performance of non-recursive baselines with roughly $L\times$ more parameters. Moreover, recursive refinement can partially substitute for additional denoising steps, allowing recursive models to reach the same generation quality with fewer forward passes at inference time. These results suggest that recursive depth is a practically useful scaling mechanism for MDMs, improving both parameter efficiency and the allocation of test-time compute.

2606.18021 2026-06-17 cs.AI cs.CL cs.LG cs.MA 新提交

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

LegalHalluLens: 类型化幻觉审计与校准的多智能体辩论以实现可信赖的法律AI

Lalit Yadav, Akshaj Gurugubelli

发表机构 * Independent Researcher, Sunnyvale, CA, USA(独立研究者,美国加州太阳谷) Independent Researcher, San Diego, CA, USA(独立研究者,美国加州圣地亚哥)

AI总结 针对法律AI中聚合指标掩盖的错误集中性和方向性问题,提出LegalHalluLens审计框架,通过类型化幻觉画像、风险方向指数(RDI)和校准辩论管道,将幻觉检测减少45%,并揭示聚合指标隐藏的失败模式。

Comments 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

部署在法律工作流程中的AI系统以聚合指标报告的约52%的比率产生幻觉,但这个平均值掩盖了错误集中的位置和方向,使合规官员无法获得可操作的可信部署信号。我们提出LegalHalluLens,一个包含三个组件的审计框架:基于CUAD(Hendrycks等人,2021)的四种法律动机声明类别(数字、时间、义务/权利、事实)的类型化幻觉画像;一个风险方向指数(RDI),将遗漏与发明偏差简化为一个可部署比较的标量;以及一个针对幅度和方向校准的类型化辩论管道。在510份合同和249,252个条款级实例上,我们测量了义务/数字和时间声明之间约38-40个百分点的模型内差距,而聚合报告隐藏了这一点,并表明两个具有匹配的52%比率的系统可能具有相反的RDI。辩论管道将虚构检测减少了45%,每个类别的收益跟踪诊断结果,使用显著更小的骨干网络(4B活跃参数)匹配商业API。类型化画像和RDI揭示了聚合指标隐藏的失败模式;我们进一步表明这些诊断可作为多智能体辩论管道的校准输入,其中针对测量失败模式的怀疑挑战和非对称门优于通用调整的辩论。该框架支持部署在现实世界中的法律AI的方向感知采购、问责制和智能体设计。

英文摘要

AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.

2606.18008 2026-06-17 cs.CV 新提交

PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution

PhaseWin:一种用于忠实视觉归因的高效搜索算法

Zihan Gu, Ruoyu Chen, Junchi Zhang, Li Liu, Xiaochun Cao, Hua Zhang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) Shanghai Center for Mathematical Sciences, Fudan University(复旦大学上海数学中心) College of Electronic Science and Technology, National University of Defense Technology(国防科技大学电子科学学院) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区网络空间安全学院)

AI总结 提出PhaseWin算法,通过分阶段窗口搜索将视觉归因的计算复杂度从O(n²)降至O(n),在保持接近贪心算法忠实度的同时大幅减少模型评估次数。

Comments 26 pages, 29 figures

详情
AI中文摘要

视觉归因是解释现代视觉和视觉-语言模型的基本工具,尤其在需要检查、诊断或审计模型决策时。其目标是通过对候选图像区域分配重要性排序,解释模型决策如何依赖于视觉输入的局部区域。给定一个划分为n个区域的图像,忠实归因可以转化为有序子集搜索问题,其中逐步插入所选区域应尽可能早地恢复目标模型响应。对区域子集的穷举搜索会产生指数级成本,而广泛使用的贪心搜索仍需要二次数量的模型评估,因为每个选择步骤都会重新评分所有剩余候选。我们提出PhaseWin,一种用于忠实视觉归因的高效子集搜索算法。PhaseWin将贪心区域选择重组为分阶段窗口搜索过程:不是每一步都重新评估整个候选集,而是在全局候选筛选、自适应剪枝和局部窗口细化之间交替,同时保留贪心搜索的基本区域排序行为。我们在单调证据积累条件下分析PhaseWin,并表明在特征级结构假设下,它实现了可控的线性评估复杂度以及接近贪心的忠实度保证。在图像分类、目标检测、视觉定位和图像描述上的大量实验表明,在所有比较的归因方法中,PhaseWin以最少的前向传播达到高忠实度,经验上实现了从O(n²)到O(n)的预测降低。代码可在该网址获取。

英文摘要

Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model's decision depends on local regions of the visual input, typically by assigning an importance ordering over candidate image regions. Given an image partitioned into $n$ regions, faithful attribution can be cast as an ordered subset-search problem, in which progressively inserting the selected regions should recover the target model response as early as possible. Exhaustive search over region subsets incurs exponential cost, while the widely used greedy search still requires a quadratic number of model evaluations, because every selection step rescores all remaining candidates. We propose PhaseWin, an efficient subset-search algorithm for faithful visual attribution. PhaseWin reorganizes greedy region selection into a phased window-search procedure: rather than re-evaluating the full candidate set at every step, it alternates between global candidate screening, adaptive pruning, and localized window refinement, while preserving the essential region-ranking behavior of greedy search. We analyze PhaseWin under monotone evidence-accumulation conditions and show that, under feature-level structural assumptions, it attains controllable linear evaluation complexity together with near-greedy faithfulness guarantees. Extensive experiments on image classification, object detection, visual grounding, and image captioning show that, among all compared attribution methods, PhaseWin reaches high faithfulness with the fewest forward passes, empirically realizing the predicted reduction from $O(n^2)$ to $O(n)$. The code is available at this https URL.

2606.18005 2026-06-17 cs.AI econ.GN 新提交

LLM Consumer Behavior Theory: Foundations of a Novel Research Field

LLM消费者行为理论:一个新兴研究领域的基础

Manon Reusens, Sofie Goethals, David Martens

发表机构 * Department of Engineering Management, University of Antwerp(安特卫普大学工程管理系)

AI总结 本文提出LLM消费者行为理论,研究LLM代理在市场中代表人类消费决策的行为,整合经济学与自然语言处理,探讨偏好表达、市场聚合及理性假设的失效。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主代理,代表用户做出消费决策。这一转变对传统上以人类为主要决策者的消费者理论提出了基本问题。在本文中,我们引入了LLM消费者行为理论,这是一个关注分析代理市场中消费者行为的新研究领域。借鉴经典和行为经济学以及自然语言处理的最新进展,我们形式化了人类偏好如何被基于LLM的代理反映和执行,以及代理级别的决策如何聚合为市场需求。我们将先前关于LLM决策、人类行为模拟和偏好诱导的分散文献统一在共同的经济视角下,强调了理性、异质性等假设在代理市场中可能失效的地方。本文不提供实证验证,而是概述了LLM消费者行为的范围,并识别了与对齐、偏好表示和市场动态相关的开放研究问题。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consumer theory, which has traditionally modeled humans as the primary decision-makers. In this paper, we introduce LLM Consumer Behavior Theory, a new field of study concerned with analyzing consumer behavior in agentic markets. Drawing on classical and behavioral economics alongside recent advances in Natural Language Processing, we formalize how human preferences are reflected and acted upon by LLM-based agents, and how agent-level decisions aggregate into market demand. We unify previously fragmented literature on LLM decision-making, human behavior simulation, and preference elicitation under a common economic lens, highlighting where assumptions, such as rationality and heterogeneity, may fail in agentic markets. Rather than providing empirical validation, this paper outlines the scope of LLM consumer behavior and identifies open research questions related to alignment, preference representation, and market dynamics.

2606.18003 2026-06-17 cs.LG cs.AI 新提交

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL:空间和时间漂移下的聚类持续联邦学习

Davide Domini, Gianluca Aguzzi, Lorenzo Pellegrini, Mirko Viroli, Lukas Esterle

发表机构 * University of Bologna(博洛尼亚大学) Aarhus University(哥本哈根大学)

AI总结 针对空间异质性和时间漂移下节点隐私保护的集体自适应问题,提出C2FL方法,通过空间聚类自组织学习组,结合经验回放和停留时间感知自适应平均,实现鲁棒集体适应。

详情
AI中文摘要

集体自适应系统(CAS)越来越依赖机器学习,让每个节点从本地感知数据中学习,使其行为与周围环境对齐。然而,扩展这种智能带来了根本性挑战:感知数据通常涉及隐私,无法集中收集;节点是移动的,穿越不同区域,附近节点感知相似现象,而远处节点观察到截然不同的条件,形成自然空间聚类;并且由于移动性,这些分布随时间演变,引入时间漂移,使本地模型逐渐过时。这些动态出现在多个领域——车辆感知、无人机监测、智能手机众包——但隐私、空间异质性和时间漂移的相互作用严重削弱了传统学习策略。因此,我们提出C2FL,一种完全分布式的联邦学习(FL)方法,其中节点通过空间聚类自组织成学习组,反映环境的地理结构。为了抵消时间漂移,每个节点将经验回放与停留时间感知的自适应平均步骤相结合,随着在同一区域停留更长时间,逐步纳入区域共识,同时在不断变化的分布下保留先前获得的知识。我们在系统再现空间和时间变化的合成实验上评估了我们的方法,表明标准联邦策略在这些条件下显著退化,而我们的方法恢复了鲁棒的集体适应。

英文摘要

Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.

2606.18001 2026-06-17 cs.LG 新提交

Half a Link can Be Enough to Predict a Whole Link: Understanding Generalization in Knowledge Graph Foundation Models

半条链接足以预测整条链接:理解知识图谱基础模型中的泛化

Cosimo Gregucci, Obaidah Theeb, Daniel Hernandez, Antonio Vergari, Steffen Staab

发表机构 * Institute for AI, University of Stuttgart(斯图加特大学人工智能研究所) University of Southampton(南安普顿大学) University of Edinburgh(爱丁堡大学)

AI总结 本文通过分析知识图谱基础模型在未见图上的零样本泛化,发现模型利用部分可见的“半链接”进行预测,并基于此提出四类场景的分类法,揭示现有模型的泛化机制与改进方向。

详情
AI中文摘要

知识图谱(KG)基础模型(KGFMs)是零样本泛化器:只需训练一次,它们就能在未见过的图上预测链接,无需重新训练。然而,理解它们何时以及如何能够在不同KG间稳健泛化仍是一个开放问题。在本文中,我们揭示了它们的泛化机制,强调了它们在未见KG上的性能在涉及部分可见链接(我们称之为半链接)时并非均匀。事实上,我们表明,要预测一个测试三元组$(h,r,t)$,在实践中可能只需在推理图中观察到半链接$(h,r)$或$(r,t)$。这产生了四种场景的分类法,这些半链接的组合被观察到或未被观察到。通过对这些场景进行严格的分层分析,我们揭示了SoTA KGFMs利用可见的半链接进行预测,而不可见的半链接则带来不同的挑战。因此,我们更细粒度的分类法可以作为稳健KGFM泛化的诊断协议,并突出新KGFM可以改进的地方。

英文摘要

Knowledge graph (KG) foundation models (KGFMs) are zero-shot generalizers: trained once, they can predict links on unseen graphs without retraining. However, understanding when and how they can robustly generalize across KGs is still an open question. In this paper, we shed some light on their generalization mechanisms highlighting how their performance on unseen KGs is not uniform when it comes to partially seen links, which we call half-links. In fact, we show that to predict a test triple $(h,r,t)$ it might suffice in practice to have observed the half-link $(h,r)$ or $(r,t)$ in the inference graph. This yields a taxonomy of four scenarios when combinations of these half-links are observed or not. In a rigorous stratified analysis over these scenarios, we reveal that SoTA KGFMs use seen half links for predictions, while unseen half-links pose different challenges. As such, our finer-grained taxonomy can be a diagnostic protocol for robust KGFM generalization and highlights where novel KGFMs can improve.

2606.17999 2026-06-17 cs.CL 新提交

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

VoidPadding: 让 [VOID] 处理掩码扩散语言模型中的填充,以便 [EOS] 专注于语义终止

Chunyu Liu, Zhengyang Fan, Kaisen Yang, Alex Lamb

发表机构 * Tsinghua University(清华大学)

AI总结 提出VoidPadding方法,通过引入[VOID]令牌处理填充,将[EOS]从双重角色中解放,实现大块解码下的早期停止和自适应响应扩展,在数学推理和代码生成任务上平均提升17.84分,并减少55.7%的NFE。

详情
AI中文摘要

MDLM通过去噪预分配的掩码响应画布生成文本,使得响应长度建模成为指令调优的核心。现有的MDLM通常继承自回归惯例,在指令调优期间使用重复的\ exttt{[EOS]}令牌进行填充,赋予\ exttt{[EOS]}双重角色:既是语义终止符又是填充令牌。我们证明这种双重角色是大块解码下\ exttt{[EOS]}溢出的根本原因。为了解耦这些角色,我们提出VoidPadding,引入\ exttt{[VOID]}用于填充,并保留\ exttt{[EOS]}用于终止。在推理过程中,学习到的\ exttt{[EOS]}信号实现早期停止,而学习到的\ exttt{[VOID]}信号指导自适应响应画布扩展。在Dream-7B-Instruct上,VoidPadding在数学推理和代码生成基准测试中,将块大小平均的四任务均值比原始模型提高+17.84分,比RainbowPadding提高+6.95分,同时平均减少55.7%的解码NFE。代码可在该https URL获取。

英文摘要

MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by \(+17.84\) points over the original model and \(+6.95\) points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at this https URL.

2606.17998 2026-06-17 cs.CV 新提交

AIGS-Net: Compact Illumination Field Modeling via 2D Gaussian Splatting for Fast Low-Light Image Enhancement

AIGS-Net: 基于2D高斯泼溅的紧凑光照场建模用于快速低光图像增强

Yuhan Chen, Kunyang Huang, Fuchen Li, Zhuohan Qin, Guofa Li, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) Herbert Wertheim College of Engineering, University of Florida(佛罗里达大学赫伯特·韦特海姆工程学院) School of Mathematics and Statistics, Qingdao University(青岛大学数学与统计学院) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出AIGS-Net,通过输入自适应的2D高斯泼溅光照场和零参数多尺度上下文编码,以约40个可学习参数实现低光图像增强,在LOL和LSRW基准上平衡了增强质量与推理效率。

详情
AI中文摘要

现有的低光图像增强方法通常在光照场建模的表征能力与计算复杂度之间存在瓶颈。为解决此问题,本文提出自适应光照高斯泼溅网络(AIGS-Net),一种用于快速低光增强的超轻量级架构。与传统的静态先验不同,AIGS-Net构建了一个输入自适应的2D高斯泼溅光照场。高斯基函数的不透明度由输入图像的相对亮度统计动态调制,并通过有序alpha合成渲染空间变化的光照补偿。为了高效指导自适应光照补偿,引入了一个零参数非线性多尺度上下文编码模块,无需额外卷积权重即可提取低频结构和局部对比度线索。为抑制噪声放大和传感器引起的颜色偏差,AIGS-Net集成了噪声掩膜估计、锁定单通道伽马映射、跨通道一致性正则化和目标颜色对齐约束。在LOL和LSRW基准上的实验表明,AIGS-Net在仅需约40个可学习参数的情况下,改善了细节恢复和颜色保真度,实现了增强质量与极端推理效率之间的有效权衡。

英文摘要

Existing low-light image enhancement methods often face a bottleneck between the representation capacity of illumination-field modeling and computational complexity. To address this issue, this paper proposes an Adaptive Illumination Gaussian Splatting Network (AIGS-Net), an ultra-lightweight architecture for fast low-light enhancement. Unlike conventional static priors, AIGS-Net constructs an input-adaptive 2D Gaussian Splatting illumination field. The opacity of Gaussian basis functions is dynamically modulated by relative luminance statistics of the input image, and spatially varying illumination compensation is rendered through ordered alpha compositing. To guide adaptive illumination compensation efficiently, a zero-parameter nonlinear multiscale contextual encoding module is introduced to extract low-frequency structures and local contrast cues without additional convolutional weights. To suppress noise amplification and sensor-induced color bias, AIGS-Net integrates noise-mask estimation, locked single-channel Gamma mapping, cross-channel consistency regularization, and target color-alignment constraints. Experiments on LOL and LSRW benchmarks show that AIGS-Net improves detail recovery and color fidelity while requiring only approximately 40 learnable parameters, achieving an effective trade-off between enhancement quality and extreme inference efficiency.

2606.17996 2026-06-17 cs.LG cs.AI 新提交

Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

多重周期性与通道相关的小波分解在长期时间序列预测中的应用

Bin Wang, Heming Yang, Jinfang Sheng

发表机构 * School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院)

AI总结 提出McWC模型,通过多层周期性构建、多层感知机提取通道相关性、多级小波分解融合高低频信息,并在频域解耦通道内自相关,实现高效准确的长期预测。

详情
AI中文摘要

周期性和趋势是时间序列数据的重要组成部分,许多基于周期性和趋势的研究在长期时间序列预测中取得了良好效果。然而,我们认为当前工作忽略了时间序列数据中真实世界通道间相关性的影响,导致预测次优。此外,这些模型依赖复杂设计来捕获多样信息,导致计算效率低下。为解决这一挑战,我们提出McWC,一种长期时间序列预测模型,分别对周期性、趋势和通道间相关性进行建模。具体来说,McWC首先使用多层周期性构建模块从数据中解耦周期性信息。然后,使用多层感知机提取通道间相关性。接着,使用多级小波分解模块对数据中的多层高频和低频信息进行建模和融合。最后,聚合不同组件的结果以获得输出。同时,我们通过在频域计算损失函数来解耦通道内自相关。在六个真实世界数据集上的实验表明,McWC实现了最先进的性能,展现出卓越的计算效率和历史信息提取能力。

英文摘要

Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the influence of real-world inter-channel correlations in time series data which leads to suboptimal predictions. Furthermore, these models rely on complex designs to capture diverse information so that resulting in low computational efficiency. To address this challenge, we propose McWC, a long-term time series forecasting model that separately models the cyclicity, trend, and inter-channel correlations. Specifically, McWC first decouples cyclical information from data using a multi-layer cyclicity construction module. Then, it extracts inter-channel correlations using multi-layer perceptron. Next, it models and fuses the multi-layer high-frequency and low-frequency information from data using a multi-level wavelet decomposition module. Finally, it aggregates the results of different components to obtain the output. Simultaneously, we decouple intra-channel autocorrelations by calculating a loss function in the frequency domain. Experiments on six real-world datasets demonstrate that McWC achieves state-of-the-art performance, exhibiting excellent computational efficiency and historical information extraction capabilities.

2606.17989 2026-06-17 cs.CV cs.AI 新提交

Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

先恢复语义,再生成更好:改进的潜在建模用于3D MRI重建和跨对比合成

Yonghao Chen, Sicheng Yang, Rui Tang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Xi’an Jiaotong University(西安交通大学)

AI总结 提出语义优先的潜在建模框架,通过潜在协调编码器、语义恢复块和解剖感知频率损失,解决3D MRI压缩中长程解剖一致性、语义退化和平滑重建问题,提升重建和跨对比合成质量。

Comments Code: this https URL (https://github.com/script-Yang/RSF)

详情
AI中文摘要

多对比磁共振成像(MRI)为临床诊断提供互补信息。然而,获取所有MRI序列通常耗时且成本高昂。最近的生成模型通过从可用对比推断缺失对比来进行跨对比合成以解决此问题。尽管如此,合成3D MRI面临重大挑战。由于体积巨大,直接在像素空间操作在计算上不可行;因此,常见方法是先将3D体积压缩到潜在空间,然后在该空间中训练生成模型。我们观察到现有压缩架构存在几个关键问题:它们未能保持长程解剖一致性,丢弃了临床有意义的语义,并依赖于导致过度平滑重建的优化目标。最终,这些缺陷损害了后续生成模型的性能。在这项工作中,我们提出了一种语义优先的潜在建模框架,用于3D MRI重建和跨对比合成。具体来说,我们引入了潜在协调编码器(LHE)来捕获全局解剖依赖关系,确保体积表示的一致性。为了减轻潜在压缩过程中的语义退化,我们进一步设计了语义恢复块(SRB),该块从自监督语义教师注入高级先验,增强潜在空间中对比感知的可分离性。此外,我们提出了解剖感知频率损失(AFL),以自适应地保留诊断相关的高频结构。在两个公共多对比MRI数据集上的大量实验表明,重建保真度和跨对比合成质量持续提升。我们的代码可在该https URL获取。

英文摘要

Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at this https URL.

2606.17985 2026-06-17 cs.CV 新提交

Gaussian Light Field Splatting: A Physical Prior-Driven Vision Transformer for Unsupervised Low-Light Image Enhancement

高斯光场溅射:一种物理先验驱动的视觉Transformer用于无监督低光图像增强

Yuhan Chen, Wenxuan Yu, Guofa Li, Fuchen Li, Kunyang Huang, Yicui Shi, Ying Fang, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Herbert Wertheim College of Engineering, University of Florida(佛罗里达大学赫伯特·韦特海姆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出GLFS模型,将高斯溅射的连续物理光照建模引入Transformer,通过各向异性高斯函数表示场景光照并引入物理引导偏置到自注意力中,配合颜色向量角损失和亮度边缘损失,实现非均匀光照下的曝光均衡和色彩校正,达到最先进性能。

详情
AI中文摘要

现有的无监督低光图像增强方法在复杂的非均匀光照下常常遇到局部曝光不平衡和颜色失真。此外,大多数Vision Transformers缺乏对光照退化的物理先验进行建模的显式机制。为了解决这些限制,我们提出了GLFS,一种基于高斯光场溅射的Vision Transformer,它将高斯溅射的连续物理光照建模集成到Transformer架构中。在GLFS中,场景光照由各向异性高斯基函数的叠加表示。将物理引导的偏置引入自注意力,以自适应地推断空间增益场,从而在复杂光照下实现准确且均匀的恢复。为了减少增强过程中的颜色偏差和结构退化,进一步开发了颜色向量角损失和亮度边缘损失。这些损失强制色调一致性并提高局部细节的结构保真度。广泛的消融研究和定量评估表明,GLFS在光照校正和细节保留方面具有明显优势。它实现了最先进的性能,并为低光图像增强提供了一种新的表示范式。

英文摘要

Existing unsupervised low-light image enhancement methods often encounter local exposure imbalance and color distortion under complex non-uniform illumination. In addition, most Vision Transformers lack an explicit mechanism for modeling the physical priors of illumination degradation. To address these limitations, we propose GLFS, a Gaussian light field splatting-based Vision Transformer that integrates continuous physical illumination modeling from Gaussian splatting into the Transformer architecture. In GLFS, scene illumination is represented by a superposition of anisotropic Gaussian basis functions. Physics-guided biases are introduced into self-attention to adaptively infer a spatial gain field, enabling accurate and uniform restoration under complex illumination. To reduce color bias and structural degradation during enhancement, a color-vector angular loss and a luminance-edge loss are further developed. These losses enforce hue consistency and improve the structural fidelity of local details. Extensive ablation studies and quantitative evaluations show that GLFS provides clear advantages in illumination correction and detail preservation. It achieves state-of-the-art performance and offers a new representation paradigm for low-light image enhancement.

2606.17982 2026-06-17 cs.RO 新提交

LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation

LAGO策略:面向平滑操作的延迟感知异步扩散策略与目标导向无碰撞规划

Guowei Shi, Xupeng Xie, Yiming Luo, Jian Guo, Jun Ma, Boyu Zhou

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) International Digital Economy Academy(国际数字经济学院) The University of Hong Kong(香港大学) Southern University of Science and Technology(南方科技大学)

AI总结 提出LAGO策略,通过延迟感知条件引导和时空轨迹优化,解决异步扩散策略的间断和碰撞问题,实现平滑安全的操作。

Comments 8 pages, 8 figures

详情
AI中文摘要

基于扩散的视觉运动策略在异步推理部署时,常出现片段间不连续,且缺乏显式的障碍物感知机制,导致运动抖动和碰撞,阻碍了在真实场景中的可靠操作。为解决这些问题,我们提出LAGO策略,一个统一的异步动作生成框架,将轨迹优化与扩散策略相结合,实现平滑安全的执行。LAGO策略通过基于未来动作的延迟感知无分类器引导条件,提高了片段间一致性。它进一步通过从演示中预测任务相关的交互目标,实现目标导向的无碰撞轨迹规划。最后,时空轨迹优化细化待执行的动作,以实现低抖动和可行的运动。大量真实世界实验表明,LAGO策略在具有挑战性的操作任务中,实现了平滑无碰撞的执行和高任务成功率。项目网站:此 https URL

英文摘要

Diffusion-based visuomotor policies deployed with asynchronous inference often exhibit inter-chunk discontinuities and lack explicit mechanisms for obstacle-aware execution, leading to jerky motions and collisions that hinder reliable manipulation in real-world scenes. To address these issues, we propose LAGO Policy, a unified asynchronous action-generation framework that integrates trajectory optimization with diffusion policy for smooth and safe execution. LAGO Policy improves inter-chunk consistency via latency-aware classifier-free guidance conditioning on future actions. It further enables goal-directed collision-free trajectory planning by predicting a task-relevant interaction goal from demonstrations. Finally, spatial-temporal trajectory optimization refines the actions to be executed for low-jerk and feasible motion. Extensive real-world experiments demonstrate that LAGO Policy achieves smooth collision-free execution with high task success across challenging manipulation tasks. Project Website: this https URL

2606.17979 2026-06-17 cs.AI 新提交

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training(机构文本:STAR:时空自适应奖励分配用于文本到图像强化学习后训练)

AI总结 针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题,提出STAR方法,利用文本-图像注意力构建时空自适应分配图,对相关潜在区域施加更强策略更新,提升语义对齐和文本渲染性能。

详情
AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势,并以相同强度应用于整个生成轨迹。然而,文本到图像生成自然具有时间和空间结构:不同的去噪步骤负责不同的生成阶段,而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题,我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励(STAR)分配**。STAR利用生成模型内部的文本-图像注意力,从用户提示中真正关心的核心内容开始,构建在去噪步骤和展开中动态变化的空间分配图,并将相同的组相对优势分配给更相关的潜在区域,几乎没有额外的计算开销。然后,STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型,并在三个任务上评估:GenEval、OCR文本渲染和PickScore。实验结果表明,STAR在不改变外部奖励源的情况下,改善了组合语义对齐、文本渲染和偏好优化,在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

2606.17978 2026-06-17 cs.AI 新提交

MoCo-AIS: A Contrastive Learning Framework for Similarity Computation of Vessel Trajectories

MoCo-AIS: 一种用于船舶轨迹相似度计算的对比学习框架

Ruixin Song, Md Mahbub Alam, Zahra Sadeghi, Amilcar Soares, José F. Rodrigues-Jr, Gabriel Spadon

发表机构 * Dalhousie University(达尔豪斯大学) Linnaeus University(林奈大学) University of Sao Paulo(圣保罗大学)

AI总结 提出基于动量对比(MoCo)的统一对比学习框架MoCo-AIS,通过正负轨迹对学习嵌入,在真实AIS数据集上评估多种深度学习模型,显著提升轨迹相似度学习性能。

Comments Under review at SIGSPATIAL'26

详情
AI中文摘要

轨迹相似度是分析移动模式的基本任务,对于路线模式提取、移动预测和异常检测等应用至关重要。传统的基于距离的相似度计算方法计算成本高,促使人们采用轻量级基于学习的方法。监督方法依赖于从传统距离度量中衍生的大量标签,并且通常复现这些度量,这限制了泛化能力。虽然自监督学习通过对比学习解决了这个问题,但它缺乏统一的框架,使得难以比较深度学习(DL)模型以获得一致的轨迹表示。因此,本文提出了MoCo-AIS,一个基于动量对比(MoCo)范式的统一框架,用于学习船舶轨迹嵌入,该框架通过正负轨迹对来制定相似度学习。在此框架内,我们在大规模真实世界船舶跟踪AIS数据集上评估了多种领先的深度学习模型,这些数据集捕获了不同的航行行为和操作条件。结果表明,我们的框架显著改进了现有基线的相似度学习,同时为评估轨迹表示模型提供了一个基准平台。

英文摘要

Trajectory similarity is a fundamental task in analyzing mobility patterns, essential for applications such as route pattern extraction, mobility prediction, and anomaly detection. Traditional distance-based measures for computing similarity incur high computational cost, driving the adoption of lightweight learning-based approaches. Supervised methods rely on extensive labels derived from traditional distance measures and often reproduce these metrics, which limits generalization. While self-supervised learning addresses this issue through contrastive learning, it lacks a unified framework, making it difficult to compare deep learning (DL) models for consistent trajectory representation. Accordingly, this paper presents MoCo-AIS, a unified framework for learning vessel trajectory embeddings based on the Momentum Contrast (MoCo) paradigm, which formulates similarity learning through positive and negative trajectory pairs. Within this framework, we evaluate a diverse set of leading DL models on large-scale, real-world vessel-tracking AIS datasets that capture diverse navigation behaviors and operating conditions. Results demonstrate that our framework significantly improves similarity learning over existing baselines, while providing a benchmarking platform for evaluating trajectory representation models.

2606.17973 2026-06-17 cs.CL 新提交

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

基于AI心理健康对话的被动抑郁严重程度评估:微调大语言模型

Olivier Tieleman, Ziyi Zhu, Ting Su, Samuel J. Bell, Thomas D. Hull, Caitlin A. Stamatis

发表机构 * GitHub

AI总结 通过微调Qwen3.5-27B模型,利用伪标签增强数据,从AI心理健康对话中预测PHQ-9总分,实现无需自评问卷的被动抑郁监测。

Comments 12 pages, 1 figure

详情
AI中文摘要

抑郁症是全球致残的主要原因,早期发现症状变化对及时干预至关重要。患者健康问卷-9(PHQ-9)等经过验证的工具支持大规模症状监测,但现实世界中的完成率较低,导致应答偏倚和系统性缺失。从常规生成数据中推断严重程度的被动方法可以弥补这一差距。我们通过直接从用户与AI心理健康应用的对话记录中预测PHQ-9总分来解决这一问题,仅需对话文本,无需额外的临床数据。我们微调了带有回归头的Qwen3.5-27B骨干模型,用推理模型(Claude Opus)生成的伪标签和迭代训练的中间模型扩充了3,111个真实标签,最终得到包含6,283个用户的数据集。在包含842个用户的保留测试集上,我们的最佳模型在PHQ-9 >= 10的临床阈值下实现了MAE = 2.6、RMSE = 4.0、Pearson r = 0.80、AUC = 0.91。我们还发现,从PHQ-9 >= 3到PHQ-9 >= 24的每个严重程度阈值下,AUC均大于0.87,表明该模型能够捕捉整个临床谱系中的抑郁严重程度。这项工作为AI心理健康平台中无需用户完成自评测量的被动、连续症状监测打开了大门。

英文摘要

Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.

2606.17972 2026-06-17 cs.CV cs.AI 新提交

SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation

SegDINO: 将多尺度结构引入DINO以实现高效医学图像分割

Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, Qiuxia Yang, Yize Mao, Guang Yang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-sen University Cancer Center(中山大学肿瘤防治中心) Imperial College London(帝国理工学院)

AI总结 提出SegDINO框架,通过令牌金字塔适应和尺度感知解码将多尺度结构引入DINO,在保持高效的同时实现医学图像分割的最优性能。

Comments Code: this https URL (https://github.com/script-Yang/segdino_v2)

详情
AI中文摘要

自监督DINO模型提供了强大的可迁移视觉表示,但直接应用于图像分割仍具挑战。现有方法通常依赖带有复杂上采样的重型解码器,引入大量参数和计算开销。我们观察到,向DINO特征引入尺度远比增加解码器容量更为关键。本文提出SegDINO,一种高效分割框架,将DINOv3骨干网络与轻量级尺度建模相结合。SegDINO引入令牌金字塔适应(TPA)将中间DINO特征重组为伪多尺度层次,以及尺度感知解码(SAD)实现高效的尺度内细化和自顶向下的多尺度传播。我们进一步整理了PanCT,一个包含284名患者专家标注胰腺肿瘤的新CT数据集,以评估SegDINO处理困难小病灶的能力。在PanCT和三个公共基准上的大量实验表明,SegDINO以高效率实现了最先进的结果。代码见此https链接。

英文摘要

Self-supervised DINO models provide strong transferable visual representations, yet applying them directly to image segmentation remains challenging. Existing approaches commonly rely on heavy decoders with complex upsampling, introducing substantial parameter and computational overhead. We observe that introducing scale into DINO features is far more critical than increasing decoder capacity. In this work, we present SegDINO, an efficient segmentation framework that integrates a DINOv3 backbone with lightweight scale modeling. SegDINO introduces Token Pyramid Adaptation (TPA) to reorganize intermediate DINO features into a pseudo multi-scale hierarchy, and Scale-Aware Decoding (SAD) for efficient intra-scale refinement and top-down multi-scale propagation. We further curate PanCT, a new CT dataset containing 284 patients with expert-annotated pancreatic tumors, to assess SegDINO's ability to handle difficult small-lesion cases. Extensive experiments on PanCT and three public benchmarks demonstrate that SegDINO achieves state-of-the-art results with high efficiency. The code is available at this https URL.

2606.17967 2026-06-17 cs.CL 新提交

Learning task-specific subspaces via interventional post-training of speech foundation models

通过语音基础模型的干预后训练学习任务特定子空间

Jack Cox, Jon Barker

发表机构 * University of Sheffield(谢菲尔德大学)

AI总结 提出一种干预对比学习后训练方法,将语音基础模型的纠缠表示分解为内容和说话人子空间,提升说话人验证和关键词识别性能。

Comments Accepted to Interspeech 2026; 6 pages (4 main body), 2 figures

详情
AI中文摘要

语音基础模型在大量未标注语音数据上预训练,产生跨任务通用的表示。然而,这些表示以分布式方式编码显著语音变量的信息,而下游语音任务仅依赖其中部分变量。本文提出一种使用干预对比学习的后训练精炼方法。通过利用干预数据集和多部分对比损失,我们学习从语音基础模型的纠缠表示空间到分离的内容和说话人子空间的变换。我们在说话人验证和关键词识别任务上评估学习到的表示,显示出改进的域外说话人验证性能,并证明说话人和内容信息在学习的子空间中被分离。

英文摘要

Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.

2606.17966 2026-06-17 cs.CV 新提交

Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

Reload-Mamba:用于多类语义分割的分层抗稀释状态空间建模

Sheng-Wei Chan, Hsin-Jui Pan, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出Reload-Mamba框架,通过边界监督的局部细节先验、类不确定性感知的Reload门控和分层多级Reload机制,解决Mamba状态空间传播导致的响应稀释问题,在ADE20K、Cityscapes和PASCAL VOC 2012上取得优异性能。

Comments 23 pages, 4 figures, 17 tables. Code will be released soon

详情
AI中文摘要

基于Mamba的状态空间模型为高分辨率密集预测提供了线性时间的长程建模能力,但顺序状态空间传播会削弱多类语义分割中关键的边界敏感和细节敏感响应。我们提出Reload-Mamba,一种语义分割框架,通过三个分割特定设计解决这种传播导致的响应稀释问题:(i) 边界监督的局部细节先验,使用真实边界掩码显式训练,以识别需要响应恢复的区域;(ii) 类不确定性感知的Reload门控,将来自预重载辅助头的逐像素类熵作为额外的门控信号,该公式仅在多类密集预测下提供信息;(iii) 分层多级Reload机制,在三个解码器级别应用抗稀释细化,并自上而下融合恢复的表示。基于ConvNeXt-Tiny编码器、多尺度解码器和具有像素级方向注意力的四方向Mamba扫描,Reload-Mamba在ADE20K上达到47.9%单尺度(48.9%多尺度)mIoU,在Cityscapes上达到83.2%单尺度mIoU。在标准DeepLab风格协议下使用ResNet-101 + COCO预训练,Reload-Mamba在PASCAL VOC 2012 val上达到87.8% mIoU。控制消融实验表明,三个分割特定设计各自贡献了超出直接移植先前为二值化提出的抗稀释架构的性能,在ADE20K上相比直接移植基线累积提升了+2.2 mIoU。

英文摘要

Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.

2606.17961 2026-06-17 cs.CV cs.AI 新提交

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

基于相似性的位置编码在旋转下的鲁棒性:理论分析与实验验证

Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale, Alessandria, Italy(皮埃蒙特东方大学计算机科学研究所,DiSIT,亚历山德里亚,意大利)

AI总结 本文理论分析并实验验证了基于相似性的位置编码(simPE)在旋转扰动下的稳定性,证明其在Frobenius范数下具有有界扰动,并在多个数据集上优于标准位置编码。

详情
AI中文摘要

位置编码是Transformer架构的基本组成部分,因为它注入了关于输入空间或序列排列的信息。在标准绝对位置编码和正弦编码的最新替代方案中,基于相似性的位置编码(simPE)已成为一种通过成对关系表示位置结构的灵活框架。simPE最初是为医学成像应用设计的,其中几何鲁棒性尤为重要:在图像采集过程中,由于成像仪器、患者定位或轻微的采集偏差,自然会产生小旋转。尽管具有经验上的前景,但simPE在几何扰动下的理论行为尚未完全表征。在本文中,我们研究了simPE对旋转的鲁棒性,结合了形式化的理论分析和实验验证。我们首先证明simPE通常不是旋转不变的。然后,我们证明,在基本分量的温和Lipschitz假设下,simPE在旋转扰动下是稳定的,并推导了Frobenius范数下的显式扰动界限。我们在四个受控数据集上实验验证了这些发现——一个合成Arrow数据集、一个合成Shapes数据集(四个几何形状类别)、一个合成Digits数据集和一个基准图像分类数据集(FashionMNIST)——其中训练和验证图像保持固定的规范方向,而测试图像则经受逐渐增大的旋转角度。在所有数据集中,simPE在旋转下的准确率、F1分数、精确率和召回率方面始终优于标准学习位置编码,特别是在小到中等角度范围内,这证实了理论稳定性保证。

英文摘要

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

2606.17958 2026-06-17 cs.CV cs.LG 新提交

Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation

超越视觉线索:CoT增强推理用于半监督医学图像分割

Yuming Chen, Yuxin Xie, Tao Zhou, Yi Zhou

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室) Nanjing University of Science and Technology(南京理工大学)

AI总结 提出CERS框架,通过集成链式思维推理和语义参考选择策略,解决半监督医学图像分割中的视觉-语义不匹配问题,在边界模糊和语义不一致场景下优于现有方法。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

半监督医学图像分割已成为医学图像分析中的主导研究问题,通过对未标记数据利用一致性正则化来缓解标注稀缺。然而,现有方法主要通过视觉模式匹配操作,严重依赖像素级相似性。这种以视觉为中心的依赖在临床场景中常常失效,因为视觉上相似的病变可能需要不同的诊断结论,从而无法捕捉专家使用的潜在诊断逻辑。为了解决这个问题,我们超越视觉线索,提出了CERS(CoT增强推理分割),一个集成链式思维(CoT)推理以区分病理上不同案例的框架。具体来说,我们构建了一个知识池,其中包含由大型语言模型(LLMs)生成的丰富语言推理描述。引入了一种语义感知的参考选择策略来识别历史证据,首先通过形态学过滤候选,然后通过CoT一致性进行细化以消除硬负样本。此外,设计了多尺度坐标注意力模块(MCAM)以有效地将这种推理衍生的上下文融合到解码过程中。大量实验证明了CERS相对于最先进方法的优越性,特别是在解决边界模糊和语义不一致方面。代码可在该https URL获取。

英文摘要

Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily on pixel-level similarities. This visual-centric dependency often falters in clinical scenarios characterized by the visual-semantic mismatch, where visually similar lesions warrant distinct diagnostic conclusions, thus failing to capture the underlying diagnostic logic used by experts. To address this, we move beyond visual cues and propose CERS (CoT-Enhanced Reasoning Segmentation), a framework that integrates Chain-of-Thought (CoT) reasoning to distinguish pathologically distinct cases. Specifically, we construct a knowledge pool enriched with linguistic reasoning descriptions generated by large language models (LLMs). A semantic-aware reference selection strategy is introduced to identify historical evidence, filtering candidates first by morphology, and then refining them via CoT consistency to eliminate hard negatives. Furthermore, a multi-scale coordinate attention module (MCAM) is designed to effectively fuse this reasoning-derived context into the decoding process. Extensive experiments demonstrate the superiority of CERS against state-of-the-art approaches, particularly in resolving boundary ambiguities and semantic inconsistencies. The code is available at this https URL.

2606.17953 2026-06-17 cs.CV 新提交

MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

MLLMs 先正确后错误:追踪并纠正后层文本偏见

Xingming Li, Ao Cheng, Qiyao Sun, Xixiang He, Xuanyu Ji, Runke Huang, Qingyong Hu

发表机构 * National University of Defense Technology(国防科技大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Intelligent Game and Decision Lab(智能博弈与决策实验室)

AI总结 发现多模态大语言模型在中间层形成正确视觉预测,但最终输出时被文本覆盖,通过检测预测方向变化(85%失败转向文本,89%成功转向视觉)提出无训练方法CALRD,在冲突基准上提升高达9.4%。

Comments Accepted at IJCAI 2026. 16 pages, 10 figures

详情
AI中文摘要

当视觉与文本矛盾时,多模态大语言模型(MLLMs)始终偏向文本,即使图像提供了明确的相反证据。这种偏见对需要视觉基础的应用构成风险,但其原因尚不清楚。本文中,我们揭示了一个令人惊讶的发现:模型最初往往是正确的,在中间层形成基于视觉的正确预测,然后在最终输出中改变主意,偏向文本。我们称之为“后层文本覆盖”。视觉信息已被编码,只是未能保留到输出。更有趣的是,我们发现预测的变化方式揭示了其正确性:85%的失败转向文本,而89%的成功转向视觉。这种方向性特征使得一种简单而有效的干预成为可能:当我们检测到自信的视觉预测被抑制时,我们将其恢复。我们提出了CALRD(冲突感知层参考解码),一种无需训练的方法,在推理时恢复被覆盖的预测。在五种不同架构的MLLM上的实验表明,在冲突基准上绝对提升高达9.4%,同时基本保持标准性能,无需训练或外部知识。它恢复了模型已知但未能保留的信息。

英文摘要

When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.

2606.17952 2026-06-17 cs.LG cs.AI 新提交

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

SoftMoE: 用于大语言模型混合专家网络的软可微路由

Mikołaj Zasada, Łukasz Struski, Jacek Tabor, Marcin Kurdziel

发表机构 * AGH University of Krakow, Poland(克拉科夫AGH大学) Faculty of Mathematics(数学系) Computer Science, Jagiellonian University, Poland(计算机科学系,杰哥利安大学,波兰) Centre for Credible Artificial Intelligence, Warsaw University of Technology(可信人工智能中心,华沙技术大学)

AI总结 提出SoftMoE,通过软top-k LapSum松弛替代离散路由,实现专家路由的梯度优化,并学习每层专家激活数量,在语言建模中激活更少专家达到相当或更优性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

稀疏混合专家(MoE)架构通过仅激活一小部分专家(通过top-$k$路由)在固定推理预算下扩展LLM参数。虽然这保持了因果性并适用于自回归语言模型,但离散的top-$k$算子不可微,强制每个输入激活固定数量的专家,导致计算利用效率低下。我们提出SoftMoE,用截断的软top-$k$ LapSum松弛替代离散路由,允许基于梯度的专家路由优化。我们进一步参数化每层平均激活专家数,并施加全局预算约束,使模型能够学习跨层分配专家容量。SoftMoE完全兼容自回归建模,在语言建模和下游任务上达到与稀疏MoE相当或更优的性能,同时激活显著更少的专家。值得注意的是,学习到的分配高度非均匀,后层激活更多专家。源代码已公开$^\dagger$。

英文摘要

Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-$k$ routing. While this preserves causality and suits autoregressive language models, the discrete top-$k$ operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-$k$ LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available$^\dagger$.

2606.17950 2026-06-17 cs.CV cs.AI 新提交

Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

即插即适应:基于预训练对齐模型的首眼多模态指代消解

Jinghan Wu, Jing Li, Ivor W. Tsang, Xuetao Zhang

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University(西安交通大学人工智能与机器人研究所人机混合增强智能全国重点实验室) Centre for Frontier AI Research and Institute of High-Performance Computing, Agency for Science, Technology and Research (A*STAR)(新加坡科技研究局前沿人工智能研究中心与高性能计算研究所)

AI总结 提出即插即适应方法,利用预训练的细粒度对齐模型,通过证据理论融合视觉与类别线索,无需目标数据集训练或大型VLLM,在CIN基准上CoNLL F1比专用方法和流行VLLM分别提升5.31%和2.12%。

详情
AI中文摘要

视觉信息有助于解决指代消解中的歧义,带来显著的性能提升。然而,现有的多模态指代消解(MCR)方法在应用前需要使用目标数据集的部分标注数据进行训练,这阻碍了其直接可用性并引发泛化担忧。虽然拥有数十亿参数的视觉-语言大模型(VLLM)提供了有前景的零样本能力,但它们仍然难以获取。其庞大的规模限制了部署能力,且许多模型只能通过付费API访问。在本文中,我们提出了一种即插即适应方法,该方法策略性地适配一个精心预训练的\emph{对齐模型},以立即用于MCR任务,旨在消除对稀缺基准数据集的训练或依赖资源密集型VLLM的需求。具体来说,我们首先使用视觉-语言对齐数据集预训练文本与视觉上下文信息之间的细粒度对齐模型。然后,我们通过证据理论融合视觉和类别线索进行相似度聚合,将对齐模型重新用于MCR,从而增强效果。在Coreference Image Narratives (CIN)基准数据集上的实验证明了我们方法的有效性,在CoNLL F1上比最先进的专用方法和流行VLLM分别提高了5.31%和2.12%。我们进一步在掩码CIN数据集上进行鲁棒性测试,并在专门构建的VCR-MCR数据集上进行泛化评估,结果证实了这两种能力。

英文摘要

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

2606.17945 2026-06-17 cs.AI 新提交

Small Initialization Matters for Large Language Models

小初始化对大语言模型至关重要

Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Institute of Natural Sciences, Shanghai Jiao Tong University(上海交通大学自然科学研究院) MemTensor (Shanghai) Technology Co., Ltd.(上海记忆张量科技有限公司) Institute for Advanced Algorithms Research(先进算法研究所)

AI总结 本文发现减小初始化尺度能持续改善大语言模型预训练,尤其在推理任务上提升显著,并揭示了小初始化驱动参数从低复杂度结构向丰富表示演化的机制。

Comments 26 pages, 8 figures

详情
AI中文摘要

大语言模型提供了一个可处理的系统,用于探究智能本身如何涌现,而不仅仅是LLM如何被工程化。尽管进展通常归因于规模、数据和架构,但我们表明参数初始化是训练以及模型能力的基因式决定因素。减小初始化尺度持续改善预训练,在推理密集型任务上收益最大。我们识别出两种限制小初始化优势的常用经验设置,并展示放松这些设置如何恢复有利的缩放。我们进一步发现了一个平衡推理和训练的关键初始化。从机制上讲,小初始化驱动了独特的发展轨迹:参数首先凝聚成低复杂度结构,随后扩展为更丰富的表示,为“压缩即智能”这一观点提供了具体形式。词元级分析表明,收益集中在非平凡、上下文约束的预测上,而非均匀地分布于所有词元。这些结果启发了一个简单的$\gamma$-初始化规则:将初始化范围作为显式旋钮,并默认使用小初始化,这是一种几乎无成本的干预,能改善预训练并跨模型规模增强推理。

英文摘要

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $\gamma$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.

2606.17937 2026-06-17 cs.RO 新提交

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

ThinkingVLA:用于机器人操作的交叉视觉与语言推理

Tianyi Lu, Hui Zhang, Zijie Diao, Junke Wang, Shengqi Xu, Xingyao Lin, Guojin Zhong, Ziyi Ye, Peng Wang, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学)

AI总结 提出ThinkingVLA,通过统一的多Transformer架构实现前向与逆向推理交织,显著提升长时域操作任务性能。

详情
AI中文摘要

大多数视觉-语言-动作(VLA)模型直接将观测映射到动作,缺乏显式推理,限制了其在推理密集型长时域任务中的能力。为解决此问题,现有方法采用思维链(CoT)推理以实现子目标分解和空间预测。然而,这些方法缺乏用于有效跨模态推理的统一架构,并且未能显式包含基于目标状态的逆向推理能力。我们认为,操作规划自然分解为预测(预测下一个视觉状态)和逆向动力学(推断达到该状态的动作)。连接两者需要一个统一的、在单一生成过程中交织文本和视觉推理的自回归架构。我们提出\textbf{ThinkingVLA},一种在统一的混合Transformer架构中实现此分解的生成模型。ThinkingVLA包含一个前向CoT,用于识别即时子目标并指导视觉预测;预测的图像随后作为目标状态,为逆向CoT提供基础,该逆向CoT基于预测图像推理空间关系和动作意图;最终动作基于完整的推理上下文生成。在仿真和真实世界基准上的大量实验表明,ThinkingVLA持续优于最先进的基线,在长时域操作任务上尤其有大幅提升。

英文摘要

Most Vision-Language-Action (VLA) models map observations directly to actions without explicit reasoning, limiting their capacity for reasoning-intensive long-horizon tasks. To address this, existing approaches adopt Chain-of-Thought (CoT) reasoning to enable subgoal decomposition and spatial anticipation. However, those methods lack a unified architecture for effective cross-modal reasoning and fail to explicitly include inverse reasoning ability based on the target state. We argue that manipulation planning naturally decomposes into prediction, anticipating the next visual state, and inverse dynamics, inferring the actions to reach it. Bridging both requires a unified autoregressive architecture that interleaves textual and visual reasoning in a single generation process. We propose \textbf{ThinkingVLA}, a generative model that realizes this decomposition within a unified Mixture-of-Transformers architecture. ThinkingVLA consists of a forward CoT that identifies the immediate subgoal and guides the visual forecasting; the predicted image then serves as the target state, grounding an inverse CoT that reasons about spatial relationships and action intent based on the predicted image; and the final action is generated conditioned on this full reasoning context. Extensive experiments on simulation and real-world benchmarks demonstrate that ThinkingVLA consistently outperforms state-of-the-art baselines, with particularly large gains on long-horizon manipulation tasks.

2606.17936 2026-06-17 cs.RO 新提交

SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints

SPARK: 基于关键点的自动驾驶赛车低延迟单摄像头3D姿态估计

Dominic Ebner, Markus Lienkamp

发表机构 * Technical University of Munich(慕尼黑工业大学) School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology(工程与设计学院,移动系统工程系,汽车技术研究所) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所)

AI总结 提出SPARK算法,利用单摄像头和关键点检测实现自动驾驶赛车中低延迟、高精度的3D姿态估计,性能优于现有方法。

Comments 9 pages, 6 figures, ITSC 2026, Invited Session

详情
AI中文摘要

在自动驾驶赛车中,快速检测其他参与者的运动对于规划与非合作对手的安全无碰撞轨迹至关重要。LiDAR检测本质上比视觉方法更慢且更难部署在边缘设备上,导致检测延迟,限制了高动态机动中的目标跟踪性能。利用单目3D检测可以实现对赛道上其他参与者的易于部署、低延迟检测。我们提出了SPARK,一种基于关键点检测的自动驾驶赛车单摄像头姿态估计算法。它实现了高精度的远距离检测,超越了最先进的单目摄像头检测算法的性能,同时保持较低的延迟。通过使用经过良好优化的YOLO模型并利用自动驾驶赛车领域的固定几何结构,该算法还表现出低延迟和低资源使用率。我们在真实世界的自动驾驶赛车数据上评估了我们的方法性能,并将其与最先进的LiDAR和摄像头检测算法进行了比较。源代码可在以下网址获取:this https URL

英文摘要

In autonomous racing, fast detection of other participants' movements is required to plan safe, collision-free trajectories with non-cooperative opponents. LiDAR detection is inherently slower and harder to deploy on edge devices than vision methods, causing delayed detections that limit object tracking performance during high-dynamic maneuvering. Utilizing monocular 3D detection enables an easy-to-deploy, low-latency detection of other participants on the racetrack. We present SPARK, a single-camera pose-estimation algorithm for autonomous racing using keypoint detection. It achieves long-range detection with high accuracy, exceeding the performance of state-of-the-art monocular camera detection algorithms while maintaining lower latency. By employing well-optimized YOLO models and leveraging the fixed geometry in the autonomous racing domain, the algorithm also exhibits low latency and resource usage. We evaluate the performance of our approach on real-world autonomous racing data and compare it to state-of-the-art LiDAR and camera detection algorithms. The source code is available at: this https URL

2606.17935 2026-06-17 cs.CV 新提交

MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

MoonSplat: 基于Sim(3)全局优化的单目在线高斯泼溅

Guo Pu, Yixuan Han, Haofeng Li, Yao Zhang, Hui Zhou, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) Beijing Hydrogen Intelligent Tech. Co., Ltd.(北京氢元智能科技有限公司)

AI总结 提出一种结合Sim(3)全局优化的在线体素化3DGS框架,通过颜色残差学习策略加速收敛,实现鲁棒的相机跟踪和全局闭环,在室内外数据集上达到SOTA性能。

Comments SIGGRAPH 2026

详情
AI中文摘要

从单目图像序列进行在线3D重建是一个具有挑战性且持续的研究课题。3D高斯泼溅(3DGS)凭借其高质量的实时渲染能力,使得在线3D重建能够以更强的表达能力表示密集场景,因此在机器人、AR/VR等广泛应用中具有巨大潜力。然而,现有的在线3DGS方法仍面临一些关键挑战:由于缺乏全局优化导致的脆弱相机位姿估计,以及在大规模或长序列场景中优化效率低下。为了解决这些问题,我们提出了一种鲁棒且高效的在线体素化3DGS重建框架,该框架集成了全局$\ ext{Sim}(3)$优化,能够实现可靠的相机跟踪以及针对相机位姿和体素化3DGS的高效全局闭环。为了加速体素化3DGS的收敛,我们进一步引入了一种颜色残差学习策略,这不仅提高了优化速度,还增强了渲染质量。在多种室内外数据集上的大量实验表明,我们的方法在相机位姿估计精度和渲染质量方面均达到了最先进的性能,同时保持了实时效率。此外,我们基于所提出的方法开发并部署了一个真实的基于无人机的主动重建系统,验证了其在实际在线3D重建任务中的鲁棒性和泛化能力。我们的代码和数据可在该网址获取。

英文摘要

Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds great promise for a wide range of applications such as robotics and AR/VR. However, existing online 3DGS methods still suffer from some key challenges: fragile camera pose estimation due to the lack of global optimization, and low optimization efficiency in large-scale or long-sequence scenarios. To address these issues, we propose a robust and efficient online voxelized 3DGS reconstruction framework integrated with global $\text{Sim}(3)$ optimization, which enables reliable camera tracking and efficient global loop closure for both camera poses and voxelized 3DGS. To accelerate the convergence of the voxelized 3DGS, we further introduce a color residual learning strategy, which not only boosts optimization speed but also enhances rendering quality. Extensive experiments on diverse indoor and outdoor datasets demonstrate that our method achieves state-of-the-art performance in both camera pose estimation accuracy and rendering quality, while retaining real-time efficiency. Additionally, we develop and deploy a real-world UAV-based active reconstruction system grounded on our proposed method, validating its robustness and generalizability for practical online 3D reconstruction tasks. Our code and data are available at this https URL.

2606.17931 2026-06-17 cs.LG 新提交

Predictive Analytics in E-Commerce for CustomerBehavior Forecasting using hybrid Ret-DNN withXGBoost Model

电子商务中基于混合Ret-DNN与XGBoost模型的客户行为预测分析

Degala Pushpa Sri, Mayank Atreya, Lakshmi. H, Navin Chhibber, Mukesh Soni

发表机构 * Chewy Inc(Chewy公司) Pace Institute of Technology and Atlanta, USA(佩斯理工学院和亚特兰大美国) Nitte Meenakshi Institute of Sciences(尼特梅恩克希科学学院) Lovely Professional University(洛丽专业大学) Infinity Tech Group(无限科技集团) University(大学)

AI总结 提出混合Ret-DNN与XGBoost模型,通过特征提取和梯度提升预测客户购买概率,在UK零售数据集上MAE达0.2193。

Comments 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON)

详情
AI中文摘要

近年来,电子商务服务在人们的日常生活中迅速增长,帮助他们在线购买产品。然而,零售平台难以理解客户行为,并难以预测其未来购买。为克服这些挑战,本研究提出一种混合零售深度神经网络(Ret-DNN)与极端梯度提升(XGBoost)模型,用于捕捉零售数据的时间特征和表格动态。首先,数据来自一家英国在线零售商,包含近50万条交易记录。然后,使用一系列技术对收集的数据进行预处理,如数据清洗、异常值处理、时间特征提取、特征编码和z-score归一化,以确保数据准备好进行模型训练和测试。随后,预处理后的数据被输入到Ret-DNN模型中,该模型作为特征提取器,理解客户交易的完整上下文。进一步,提取的数据作为输入输入到XGBoost模型,该模型预测最终输出为客户购买概率。最后,提出的Ret-DNN XGBoost模型取得了更好的结果,平均绝对误差(MAE)为0.2193,优于现有的Ret-DNN模型。关键词:客户行为预测,极端梯度提升,电子商务,预测分析,零售深度神经网络。

英文摘要

In recent years, electronic (E) commerce services have rapidly increased in the daily lives of people, which helpsthem to purchase products online. However, retail platforms have struggled to understand customer behavior and make it difficult to predict their future purchases. To overcome these challenges, this study proposes a hybrid Retail Deep NeuralNetwork (Ret-DNN) with an Extreme Gradient Boosting(XGBoost) model for capturing temporal features and tabular dynamics of retail data. First, data were sourced from a UnitedKingdom (UK)-based online retailer that contains transactions with almost 500,000 records. Then, the collected data were pre-processed using a series of techniques, such as data cleaning, outlier handling, temporal feature extraction, feature encoding, and z-score normalization, to ensure that the data were ready for model training and testing. Subsequently, the preprocessed data were fed into the Ret-DNN model, which acts as a feature extractor to understand the complete context of customer transactions. Further, the extracted data were fed as input into the XGBoost model, which predicted the final output as the purchase probability of customers. Finally, the proposed Ret-DNN XGBoost model achieved better results by attaining aMean Absolute Error (MAE) 0.2193 when compared to the existing Ret-DNN model. Keywords: Customer behavior forecasting, extreme gradientboosting, electronic commerce, predictive analytic, retail deepneural networks.

2606.17930 2026-06-17 cs.AI 新提交

How Inference Compute Shapes Frontier LLM Evaluation

推理计算如何塑造前沿LLM评估

Jessica McFadyen, Ole Jorgensen, Harry Coppock, Kevin Wei, Cozmin Ududec

发表机构 * UK AI Security Institute(英国人工智能安全研究所)

AI总结 通过控制推理计算量(如token预算、上下文压缩和重复提交)评估12个前沿语言模型,发现更大计算量显著提升性能,固定预算评估低估模型能力,且不同基准对推理扩展方法敏感。

Comments 34 pages, 4 figures

详情
AI中文摘要

AI评估正转向更困难的任务,这些任务受益于涉及工具使用和迭代问题解决的更长轨迹。因此,性能对测试时可用的计算量(“推理计算”)及其分配越来越敏感。然而,许多评估仍然在单一限制性预算下报告性能,这意味着低分可能反映评估设置而非模型的潜在能力。为了验证这一点,我们在涵盖软件工程、数学、医学和网络安全的七个具有挑战性的基准上评估了多达12个前沿语言模型。我们使用结合三种简单推理扩展干预的受控设置:更大的token预算、上下文压缩和重复提交尝试,由模型本身或最小正确性反馈引导。我们发现了三个主要结果。首先,更大的token预算在多个领域的基准上显著提升性能,包括网络安全、FrontierMath、Humanity's Last Exam和TerminalBench。其次,随着模型进步,固定预算评估可能越来越低估前沿能力。较新的模型在大型预算下达到更高性能,解锁更困难的任务并更可靠地解决它们。第三,不同基准在哪种推理扩展方法最有效方面存在差异:重复提交广泛提升性能,但更大token预算、外部反馈和并行尝试的价值因基准而异。总体而言,我们的结果表明基准分数是协议依赖的。因此,我们主张评估应将能力报告为推理时间计算的函数,明确指定协议选择,并在匹配预算的大共享计算范围内比较模型代际,特别是在安全或政策相关设置中。

英文摘要

AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.

2606.17929 2026-06-17 cs.AI 新提交

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct:在重复任务上加速的计算机使用代理

Bojie Li

发表机构 * Pine AI

AI总结 提出PreAct方法,通过将首次成功执行编译为状态机程序,在后续任务中直接重放,避免逐步骤调用语言模型,实现8.5-13倍加速,并确保重放时屏幕状态匹配。

详情
AI中文摘要

计算机使用代理通过屏幕操作真实软件——点击和打字——但它们从头解决每个任务:当要求重复一个任务时,代理重新读取屏幕,重新推理每次点击,并再次支付全部成本。我们提出PreAct,让这样的代理在之前做过的任务上更快。首次成功时,PreAct将运行编译成一个小的状态机程序——检查屏幕的状态、执行动作的转换——并在后续运行中直接重放,而不是调用代理,速度提升8.5-13倍,无需每步的语言模型调用。重放并非盲目:每一步PreAct在行动前检查屏幕是否与程序预期匹配,一旦出现异常就将控制权交还给代理。PreAct在决定保留什么时也应用同样的原则:新编译的程序只有在从干净状态重新运行时,独立评估器确认其解决了任务后,才进入存储——捕获那些重放到最后一步但未完成任务的程序。在移动、桌面和网络基准测试中,这种存储时检查将重复运行中因故障程序积累而改善的运行与退化的运行区分开,每个基准测试价值1.75-2.6个任务,三个方向一致;当没有程序匹配时,从头探索的回退使PreAct与强大的记录-重放基线持平。我们还报告了哪些因素不重要:提示措辞、运行时护栏,以及语言模型或普通嵌入检索器选择重用的程序。

英文摘要

Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.