arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
2605.07357 2026-05-12 cs.AI

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

Xingtong Yu, Zhongwei Kuai, Chang Zhou, Xuanting Xie, Renhe Jiang, Xikun Zhang, Hong Cheng, Xinming Zhang, Yuan Fang

发表机构 * The Chinese University of Hong Kong(香港中文大学) The University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学) The University of Tokyo(东京大学) RMIT University(皇家墨尔本理工大学) Singapore Management University(新加坡管理学院)

AI总结 本文提出了一种名为GraphReAct的图推理-行动框架,用于解决多步骤图推理问题。该方法结合了图结构数据中的拓扑信息与语义信息,设计了两种互补的检索动作——拓扑检索和语义检索,以动态扩展推理上下文,并引入上下文精炼动作以逐步压缩信息。实验表明,GraphReAct在六个基准数据集上均优于现有方法,验证了其在图学习中的有效性。

Comments Under review

详情
英文摘要

Reasoning-acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi-step inference. In this work, we propose GraphReAct, a graph reasoning-acting framework that enables step-by-step inference over graph-structured data. Specifically, we design a graph-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non-local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state-of-the-art methods, validating the effectiveness of reasoning-acting for graph learning.

2605.07237 2026-05-12 cs.CL

Teaching Language Models to Think in Code

Hyeon Hwang, Jiwoo Lee, Jaewoo Kang

发表机构 * Korea University(韩国大学) AIGEN Sciences(AIGEN科技)

AI总结 本文提出了一种名为ThinC的新框架,旨在让语言模型通过代码进行推理,而非将代码作为自然语言指令的工具。该方法通过代码块之间的执行结果进行推理,减少了自然语言推理的干扰与错误。实验表明,ThinC在多个高水平数学基准测试中表现优异,甚至超越了更大规模的模型,并且其推理过程高度依赖代码执行结果,具有较强的鲁棒性。

Comments Preprint

详情
英文摘要

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

2605.07203 2026-05-12 cs.CV

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

Chamuditha Jayanga Galappaththige, Jason Lai, Timothy Patten, Donald Dansereau, Niko Suenderhauf, Dimity Miller

发表机构 * QUT Centre for Robotics(昆士兰大学机器人中心) ARIAM ACFR, University of Sydney(悉尼大学先进计算机研究学院)

AI总结 本文研究了基于高斯泼溅(Gaussian Splatting)的场景变化检测问题,提出了一种直接在原始高斯参数空间进行比较的方法,而非传统的渲染后对比方式。通过分析高斯的原始属性(位置、各向异性协方差和颜色),作者证明这些属性本身已包含足够的变化信息,并引入几何和光度漂移的各向异性模型以及每个高斯的可观测性项来解决表示的欠约束问题。该方法在多视角一致性、变化类型区分等方面具有优势,并在实际数据集上取得了优于现有方法约17%的性能提升。

Comments Project Page: https://chumsy0725.github.io/GS-DIFF/

详情
英文摘要

Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone -- position, anisotropic covariance, and color -- carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by $\sim$17% in mean Intersection over Union.

2605.07202 2026-05-12 cs.AI

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

Dongming Wu, Junwen Li, Ming Lu, Gang Wang, Ting Chen

发表机构 * Rajax Network Technology(Taobao Shangou of Alibaba)(阿里淘宝购物网络技术)

AI总结 本文提出AIDA(自主洞察发现代理),首个面向复杂商业环境的端到端自主探索框架,旨在解决大语言模型在将碎片化企业数据转化为可操作洞察时面临的挑战。AIDA构建了一个包含200多个指标和100多个维度的高灵活性零售环境,并集成了专有的领域特定语言(DSL),实现了语义推理与精确SQL执行的结合。通过强化学习系统,AIDA将商业分析建模为受帕累托原则指导的累积推理过程,实验表明其在环境感知和多角度深入分析方面显著优于基于工作流的代理。

详情
英文摘要

Transforming fragmented enterprise data into actionable insights remains a significant challenge for LLMs, constrained by complex database schemas, limitations in dynamic SQL generation, and the need for deep multi-dimensional analysis.In this paper, we propose AIDA(Autonomous Insight Discovery Agent), the first end-to-end framework designed for autonomous exploration in complex business environments. We establish a highly flexible instant retail environment encompassing 200+ metrics and 100+ dimensions, and integrates a proprietary Domain-Specific Language (DSL) that bridges semantic reasoning with precise SQL execution. Our reinforcement learning system subsequently formulates business analysis as a Pareto Principle-guided cumulative reasoning process. Experimental results demonstrate that AIDA significantly outperforms workflow-based agents, and extensive evaluations further reveal that AIDA achieves superior environmental perception and more in-depth analysis from diverse perspectives. Our work ultimately establishes the transformative potential of autonomous intelligence for industrial-scale business intelligence systems.

2605.07024 2026-05-12 cs.LG cs.AI

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Microsoft(微软)

AI总结 该论文提出Delulu,一个经过验证的多语言基准数据集,用于检测代码生成任务中Fill-in-the-Middle(FIM)任务中的幻觉问题。研究通过对抗性流程生成并筛选出包含1951个样本的高质量数据集,涵盖7种语言和4类幻觉类型,并利用Docker容器验证代码的编译与运行错误。实验评估了11个开源FIM模型,结果显示即使是最强模型也仅达到84.5%的准确率,表明FIM任务中的幻觉问题具有内在难度,而非特定模型家族的缺陷。

详情
英文摘要

Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.

2605.06969 2026-05-12 cs.CV

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Yuchen Guo, Junli Gong, Yao Lu, Xintong Xu, Yiuming Cheung, Weifeng Su

发表机构 * Northwestern University(西北大学) Northeastern University(东北大学) University of Washington(华盛顿大学) Hong Kong Baptist University(香港 Baptist大学) Beijing Normal - Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 该研究旨在提升红外-可见光图像融合(IVIF)质量评估的准确性,针对现有方法过度依赖手工特征和全参考指标的问题,提出了一种基于多模态大语言模型(MLLM)的新型评估方法FuScore。该方法通过MLLM生成连续的质量评分,而非离散等级预测,从而实现对相似质量图像的细粒度区分,并结合多维度一致性构建软标签,进一步引入三元目标函数以提升评估的全面性和鲁棒性。实验表明,FuScore在与人类视觉偏好相关性方面达到了当前最优水平。

详情
英文摘要

Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.

2605.06763 2026-05-12 cs.LG

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

Mohsen Dehghankar, Abolfazl Asudeh

发表机构 * Department of Computer Science(计算机科学系) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 该研究将稀疏注意力机制重新定义为半空间范围搜索问题,旨在提升大语言模型推理效率的同时保证关键键值对的完整召回。为此,作者提出了一种名为Louver的新索引结构,能够在理论和实践中实现零误漏,并且具备轻量、易集成以及硬件优化等特性。实验表明,Louver在准确性和运行效率上均优于现有稀疏注意力方法,甚至超越了高度优化的密集注意力实现。

详情
英文摘要

Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both query- and step-dependent. Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys above a certain threshold. We address this challenge by reformulating sparse attention as the halfspace range searching problem. However, existing range searching indices are not suitable for modern LLM inference due to their computational and implementation overheads. To overcome this, we introduce Louver, a novel index structure tailored for efficient KV cache retrieval. Louver (i) guarantees zero false negatives with respect to a specified threshold in both theory and practice, (ii) is lightweight to integrate into existing LLM pipelines, and (iii) incorporates hardware-aware optimizations for both CPU and GPU executions. Our experiments demonstrate that Louver outperforms prior sparse attention methods in both accuracy and runtime, and is faster than highly optimized dense attentions such as FlashAttention. These results highlight that recall guarantees are a critical and overlooked dimension of sparse attention, and open a new direction for building theoretically grounded, efficient KV cache indices.

2605.06681 2026-05-12 cs.LG cs.CV

A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry

Lorenzo Riccardo Allegrini, Geremia Pompei

发表机构 * ContinualIST, Pisa, Italy(持续主义机构,意大利比萨) University of Pisa, Department of Computer Science, Pisa, Italy(比萨大学计算机科学系,意大利比萨)

AI总结 本文提出了一种分层集成管道,用于处理欧洲空间局(ESA)卫星遥测数据中的异常检测问题。该方法结合了形状片段提取、统计特征分析、单通道建模、通道内堆叠以及跨通道聚合等多种技术,通过时间序列交叉验证和双层掩码策略进行训练与验证,有效防止信息泄露。实验结果表明,该方法在ESA-ADB基准测试中表现出优异的泛化能力,能够有效检测现实卫星遥测数据中的细微异常。

Comments 15 pages, 3 figures, 1 table. Submitted to the ML4ITS workshop at the ECML PKDD 2025 conference. Awarded 2nd place in the final round of the Spacecraft Anomaly Challenge on ESA dataset. (Ranked 1st on the Kaggle public leaderboard and 3rd on the private leaderboard)

Journal ref Communications in Computer and Information Science 2842 (2026) Chapter 7

详情
英文摘要

A hierarchical ensemble pipeline is introduced to address anomaly detection in multivariate telemetry data provided by European Space Agency (ESA). The method integrates shapelet-based and statistical feature extraction, per-channel modeling, intra-channel stacking, and a final cross-channel aggregation. The pipeline is trained and validated using time-series cross-validation and two-level masking strategies to prevent information leakage. Results on the European Space Agency Anomaly Detection Benchmark (ESA-ADB) challenge demonstrate strong generalization, highlighting the effectiveness of hierarchical modeling in detecting subtle anomalies in realistic satellite telemetry.

2605.06663 2026-05-12 cs.CL

EMO: Pretraining Mixture of Experts for Emergent Modularity

Ryan Wang, Akshita Bhagia, Sewon Min

发表机构 * University of California, Berkeley(加州大学伯克利分校) Allen Institute for AI(人工智能研究院)

AI总结 本文提出了一种名为EMO的专家混合(MoE)预训练模型,旨在实现模型的模块化部署,使得不同任务可以独立使用或组合使用专家子集,而无需人工定义先验知识。EMO通过鼓励相似领域的内容使用相似的专家,并基于文档边界进行预训练,使专家分组在训练过程中自然形成。实验表明,EMO在保持整体性能的同时,能够显著减少推理时所需的专家数量,且专家子集在语义层面(如数学、代码等)表现出专业化能力,优于传统MoE在语法层面的分工。

详情
英文摘要

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.

2605.06483 2026-05-12 cs.AI cs.RO cs.SY eess.SY

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

Bowen Ye, Zhijian Li, Junyue Huang, Junkai Ma, Xiang Yin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Alibaba Group(阿里巴巴集团)

AI总结 该研究提出了一种名为ReasonSTL的框架,旨在解决将自然语言转化为信号时序逻辑(STL)这一关键但具有挑战性的任务。ReasonSTL通过结合本地开源语言模型与工具增强的推理过程,实现了自然语言到STL公式的高效生成,并引入了过程奖励训练机制以优化工具使用路径和最终公式结构。实验表明,该方法在自动评估和人工评估中均达到领先水平,为工业场景下的形式化规范编写提供了透明、低成本且隐私保护的解决方案。

详情
英文摘要

Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textsc{ReasonSTL}, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textsc{ReasonSTL} decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textsc{STL-Bench}, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textsc{ReasonSTL} achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textsc{ReasonSTL} provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.

2605.06117 2026-05-12 cs.LG

BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

Yi-Siang Wang, Kuan-Yu Chen, Yu-Chen Den, Darby Tien-Hao Chang

发表机构 * SinoPac Holdings(SinoPac控股公司)

AI总结 本文提出BoostLLM,一种受提升算法启发的大型语言模型(LLM)微调框架,旨在提升其在少样本表格分类任务中的性能。该方法将参数高效的微调过程转化为多轮残差优化过程,通过训练序列化的PEFT适配器作为弱学习器,并结合决策树路径作为结构化输入视图,以增强模型对表格数据的归纳偏置。实验表明,BoostLLM在多个LLM主干和数据集上均优于传统微调方法,且在少样本场景下表现可与XGBoost媲美,甚至在某些情况下超越基于GPT-4o的模型。

Comments 19 pages, 4 figures

详情
英文摘要

Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work, we revisit the boosting paradigm, traditionally associated with tree ensembles, and ask whether it can be applied as a general training principle for LLM fine-tuning. We propose BoostLLM, a framework that transforms parameter-efficient fine-tuning into a multi-round residual optimization process by training sequential PEFT adapters as weak learners. To incorporate tabular inductive bias, BoostLLM integrates decision-tree paths as a second input view alongside raw features; analysis reveals that the path view acts as a structured teacher in early training steps before the model shifts toward feature-driven representations. Empirically, BoostLLM achieves consistent improvements over standard fine-tuning across multiple LLM backbones and datasets, matching or surpassing XGBoost across a wide range of shot counts and outperforming GPT-4o-based methods with a 4B model. We further show that the framework scales: pairing with stronger tree models and extended boosting horizons yields additional gains under appropriate stabilization. These results suggest that boosting can serve as a general training principle for LLM fine-tuning, particularly in low-data regimes for structured data.

2605.05831 2026-05-12 cs.CV

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

发表机构 * IIIT Hyderabad(IIIT海得拉尔学院) Microsoft Research India & IIT Hyderabad(微软研究院印度分部及IIIT海得拉尔学院)

AI总结 随着科学传播逐渐呈现多模态趋势,研究论文、幻灯片、视频等不同形式的材料共同传达研究成果,但目前缺乏结构化的关联方式。本文提出首个整合研究论文、演讲视频、讲解视频和幻灯片的多模态会议数据集(MCD),并评估多种嵌入式和视觉-语言模型在跨格式细粒度对应任务中的表现。研究发现,视觉-语言模型在整体上表现稳健,但在细粒度对齐上仍有不足,而嵌入式模型在文本与视觉对应上效果较好,但对公式和符号内容的处理存在明显聚类差异,为多模态科学理解的未来研究指明了方向。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026

详情
英文摘要

The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD

2605.05736 2026-05-12 cs.AI

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

Wei Li, Shibo Feng, Pengcheng Wu, Xingyu Gao, Min Wu, Peilin Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai University(上海大学) Nanyang Technological University(南洋理工大学) Chinese Academy of Sciences(中国科学院) Institute for Infocomm Research, A*STAR(信息通信研究院,A*STAR)

AI总结 本文提出了一种名为SDFlow的非自回归时间序列生成方法,通过相似性驱动的流匹配技术,在冻结的向量量化(VQ)潜在空间中实现并行序列生成,有效解决了自回归模型中的暴露偏差问题。该方法通过全局传输映射替代逐步预测、低秩流形分解降低维度复杂度、以及在变分流匹配框架中引入离散监督,成功克服了非自回归生成中的关键挑战。实验表明,SDFlow在长序列生成任务中取得了最先进的性能,显著提升了生成质量并加快了推理速度。

详情
英文摘要

Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across sequential predictions, leading to pronounced quality degradation in long-horizon generation. To address this, we propose SDFlow ($\textbf{S}$imilarity-$\textbf{D}$riven $\textbf{Flow}$ Matching), a non-autoregressive framework that operates entirely in the frozen VQ latent space and enables parallel sequence generation via flow matching. We tackle three key challenges in making this transition: (1) eliminating exposure bias by replacing step-wise token prediction with a global transport map; (2) mitigating the high-dimensionality of VQ token spaces via a low-rank manifold decomposition with a learned anchor prior over the latent manifold; and (3) incorporating discrete supervision into continuous transport dynamics by introducing a categorical posterior over codebook indices within a variational flow-matching formulation. Extensive experiments show that SDFlow achieves state-of-the-art performance, improving Discriminative Score and substantially reducing Context-FID, particularly for challenging long-sequence generation. Moreover, SDFlow provides significant inference speedups over autoregressive baselines, offering both high fidelity and computational efficiency. Code is available at https://anonymous.4open.science/r/SDFlow-D6F3/

2605.05072 2026-05-12 cs.CV

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

Yuan Wu, Zhiqiang Yan, Jiawei Lian, Zhengxue Wang, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) National University of Singapore(新加坡国立大学)

AI总结 本文研究了如何从相机和激光雷达传感器数据中准确预测三维场景的占用情况,重点解决传统方法在投影空间采样固定、难以适应真实场景高度变化和稀疏性的问题。为此,作者提出了一种名为HiPR的框架,通过高度引导的投影重参数化方法,动态调整激光雷达点云的采样范围,使投影点更合理地分布于具有几何意义的区域。实验表明,HiPR在保持实时推理能力的同时,显著优于现有先进方法。

详情
英文摘要

3D occupancy prediction aims to infer dense, voxel-wise scene semantics from sensor observations, where the 2D-to-3D view transformation serves as a crucial step in bridging image features and volumetric representations. Most previous methods rely on a fixed projection space, where 3D reference points are uniformly sampled along pillars. However, such sampling struggles to capture the sparsity and height variations of real-world scenes, leading to ambiguous correspondences and unreliable feature aggregation. To address these challenges, we propose HiPR, a camera-LiDAR occupancy framework with Height-Guided Projection Reparameterization. HiPR first encodes LiDAR into a BEV height map to capture the maximum height of the point cloud. HiPR then adjusts the sampling range of each pillar using the height prior, enabling adaptive reparameterization of the projection space. As a result, the projected points are redistributed into geometrically meaningful regions rather than fixed ranges. Meanwhile, we mask out the invalid parts of the height map to avoid misleading the feature aggregation. In addition, to alleviate the training instability caused by noisy LiDAR-derived heights, we introduce a training-time Progressive Height Conditioning strategy, which gradually transitions the conditioning signal from ground-truth heights to LiDAR heights. Extensive experiments demonstrate that HiPR consistently outperforms existing state-of-the-art methods while maintaining real-time inference. The code and pretrained models can be found at https://github.com/yanzq95/HiPR.

2605.05045 2026-05-12 cs.CV cs.CL

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 该研究分析了视觉-语言模型在面对旋转和噪声等视觉干扰时产生的关系幻觉现象,揭示了即使轻微的图像扰动也会显著影响模型对物体间关系的推理能力。研究评估了多种基于提示的增强与预处理策略,发现这些方法虽能部分缓解问题,但无法彻底消除关系幻觉。结果表明,当前模型在感知鲁棒性与关系理解之间仍存在差距,亟需开发更具几何感知能力的视觉-语言模型。

详情
英文摘要

Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.

2605.04012 2026-05-12 cs.AI

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann, Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu, Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel McDuff

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 该研究提出了一种名为SymptomAI的会话式人工智能代理,用于日常症状的端到端访谈与鉴别诊断。通过在Fitbit应用中进行的大规模随机实验,SymptomAI在与13,917名参与者的交互中表现出比独立临床医生更高的诊断准确性。研究还发现,采用专门症状访谈策略的AI代理在诊断效果上显著优于用户引导的对话方式,并揭示了症状与生理指标之间的强关联。

Comments 13 page main text, 54 pages total. 16 figures total

详情
英文摘要

Language models excel at diagnostic assessments on curated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

2605.03652 2026-05-12 cs.CV cs.AI

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

发表机构 * Tencent HY Team(腾讯HY团队)

AI总结 本文提出了一种名为 AniMatrix 的动画视频生成模型,专门针对动画艺术风格进行设计,而非依赖物理现实作为先验。该模型通过双通道条件机制和三步过渡策略,重新定义“正确性”标准,克服传统模型对物理规律的依赖,并有效区分艺术表达与生成失败。实验表明,AniMatrix 在专业动画师参与的评估中表现优异,尤其在提示理解与艺术动作生成方面显著优于现有模型。

Comments 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released

详情
英文摘要

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.

2605.03438 2026-05-12 cs.CV

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Zihao Guo, Jihua Zhu, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) University of Western Australia(西澳大学)

AI总结 本文提出了一种名为Mantis的高效参数微调框架,专门针对基于Mamba架构的3D点云基础模型。该方法通过引入状态感知适配器(SAA),在冻结预训练主干网络的前提下实现状态级的细粒度适配,同时采用双序列化一致性蒸馏(DSCD)减少序列化带来的不稳定性。实验表明,Mantis仅需约5%的可训练参数即可在多个基准上取得具有竞争力的性能。

详情
英文摘要

Pre-trained 3D point cloud foundation models (PFMs) have demonstrated strong transferability across diverse downstream tasks. However, full fine-tuning these models is computationally expensive and storage-intensive. Parameter-efficient fine-tuning (PEFT) offers a promising alternative, but existing PEFT approaches are primarily designed for Transformer-based backbones and rely on token-level prompting or feature transformation. Mamba-based backbones introduce a granularity mismatch between token-level adaptation and state-level sequence dynamics. Consequently, straightforward transfer of existing PEFT approaches to frozen Mamba backbones leads to substantial accuracy degradation and unstable optimization. To address this issue, we propose Mantis, the first Mamba-native PEFT framework for 3D PFMs. Specifically, a State-Aware Adapter (SAA) is introduced to inject lightweight task-conditioned control signals into selective state-space updates, enabling state-level adaptation while keeping the pre-trained backbone frozen. Moreover, different valid point cloud serializations are regularized by Dual-Serialization Consistency Distillation (DSCD), thereby reducing serialization-induced instability. Extensive experiments across multiple benchmarks demonstrate that our Mantis achieves competitive performance with only about 5% trainable parameters. Our code is available at https://github.com/gzhhhhhhh/Mantis.

2605.02948 2026-05-12 cs.LG cs.AI cs.SD

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

Yuxin Lu, Jiayang Sun, Guibo Zhu, Min Cao

发表机构 * Soochow University(苏州大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 AsymTalker 是一种基于扩散模型的长时 talking head 生成方法,旨在解决现有方法在长时间视频生成中出现的身份不一致和时空对齐问题。该方法引入了时间参考编码(TRE)和非对称知识蒸馏(AKD),分别用于缓解静态身份参考与动态音频流之间的时空错位,以及解决分块生成过程中身份漂移的问题。实验表明,AsymTalker 在保证高保真度和身份一致性的同时,能够生成长达600秒的视频,并实现每秒66帧的实时推理速度,达到了当前最先进的性能。

详情
英文摘要

Diffusion-based talking head generation has achieved remarkable visual quality, yet scaling it to long-term videos remains challenging. The widely adopted chunk-wise paradigm introduces two fundamental failures: (1) temporal-spatial misalignment between static identity references and dynamic audio streams, and (2) cascading identity drift propagated through self-generated continuity references across chunks. To address both issues, we propose AsymTalker, a novel diffusion-based talking head generation method comprising Temporal Reference Encoding (TRE) and Asymmetric Knowledge Distillation (AKD). First, TRE mitigates temporal-spatial misalignment by transforming the static identity image into a temporally coherent latent representation through encoding of a temporally replicated pseudo-video, without introducing additional parameters. Second, AKD resolves the inherent conditioning dilemma in chunk-wise training: using ground-truth references causes train-inference mismatch, while self-generated references entangle supervision with identity drift. Our asymmetric design circumvents this by anchoring the teacher model with ground-truth continuity references to provide drift-free, chunk-level supervision, thereby avoiding the teacher bottleneck. Meanwhile, the student model learns under inference-aligned conditions, conditioned only on self-generated references, and is trained via distribution matching to preserve identity over long horizons. Extensive experiments show AsymTalker achieves state-of-the-art results on HDTF and VFHQ. It guarantees high-fidelity, identity-consistent synthesis over 600-second videos and reaches a real-time inference speed of 66 FPS.

2605.02751 2026-05-12 cs.AI cs.CL

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

发表机构 * IBM Research Yorktown Heights(IBM研究院Yorktown Heights)

AI总结 在多智能体场景中,语言模型(LMs)遵循指令和保持价值对齐至关重要,但现有研究多关注单个模型与用户的对齐,忽视了多模型交互中可能产生的对齐偏差扩散问题。本文通过多轮对话的社会困境游戏实验,发现模型在交互中可能表现出更加反社会的行为,且当其他模型被引导为恶意行为时,这种效应会加剧。为缓解该问题,作者提出了一种基于隐式特质引导的方法,通过间歇性注入强化模型初始正向社会行为的系统提示,有效抑制对齐偏差的扩散,且无需访问模型参数或内部状态,适用于黑箱模型的多智能体应用场景。

详情
英文摘要

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.

2605.02487 2026-05-12 cs.RO

Visibility-Aware Mobile Grasping in Dynamic Environments

Tianrun Hu, Anxing Xiao, David Hsu, Hanbo Zhang

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Smart Systems Institute, National University of Singapore(新加坡国立大学智能系统研究所)

AI总结 本文研究了机器人在动态未知环境中进行移动抓取的问题,重点解决有限视野下视觉感知与身体运动之间的协调难题。提出了一种统一的移动抓取系统,包含基于行为树的分层规划器和结合主动感知的全身运动规划器,能够在动态障碍物存在的情况下安全导航并完成抓取任务。实验表明,该方法在静态和动态未知环境中分别实现了68.8%和58.0%的成功率,显著优于现有方法。

详情
英文摘要

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures. We provide experimental results across 400 randomized simulation scenarios and real-world deployment on a Fetch mobile manipulator. Results show that our system achieves a success rate of 68.8\% and 58.0\% in unknown static and dynamic environments, respectively, significantly boosting success rates by 22.8\% and 18.0\% over the \nam approach in both unknown static and dynamic environments, with improved collision safety.

2605.01402 2026-05-12 cs.CL cs.CV cs.LG

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du, Shanshan Song, Xiaomeng Li

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

AI总结 多模态大语言模型(MLLMs)在处理长尾分布的数值回归任务时表现不佳,现有基于标记的监督微调方法容易偏向高密度区域,导致回归均值化和尾部性能下降。本文提出了一种基于组相对策略优化的分布感知强化学习框架,通过引入基于一致相关系数的奖励机制,在批量层面提供跨样本的比较监督,从而在相关性、尺度和均值等方面对齐预测与真实分布。该方法无需修改模型结构,实验表明其在多种长尾回归基准上均优于传统微调方法,尤其在中样本和少样本场景下效果显著。

Comments Accepted by ICML 2026

详情
英文摘要

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

2605.00642 2026-05-12 cs.AI cs.CV

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, Yu Zhou

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) VCIP & TMCC & DISSec, College of Computer Science, Nankai University(南开大学计算机学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 本文提出了一种面向GUI定位任务的首个基于策略自蒸馏(OPSD)框架GUI-SD,旨在解决现有强化学习方法在训练效率和样本稀疏性方面的不足。该方法通过构建视觉增强的特权上下文和引入熵引导的蒸馏策略,实现了单次交互中的密集监督学习,有效提升了定位精度与训练效率。实验表明,GUI-SD在六个代表性基准上均优于现有方法。

Comments under review

详情
英文摘要

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.

2605.00623 2026-05-12 cs.RO

Recovering Hidden Reward in Diffusion-Based Policies

Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie, Guodong Zhang, Qicheng He, Deyi Ji, Yue Ding, Hongtao Lu

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为 EnergyFlow 的框架,通过参数化一个标量能量函数,将生成动作建模与逆强化学习相结合,其梯度即为去噪场。该方法在最大熵最优性条件下,通过去噪得分匹配学习得分函数,能够恢复专家的软Q函数梯度,从而实现无需对抗训练的奖励提取。实验表明,EnergyFlow 在多种操作任务中表现出领先的模仿性能,并能为下游强化学习提供有效的奖励信号,优于对抗性逆强化学习和基于似然的方法。

Comments Accepted by ICML 2026

详情
英文摘要

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

2605.00548 2026-05-12 cs.CV cs.GR

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir

发表机构 * Reichman University(雷曼大学)

AI总结 本文研究了扩散模型中输入噪声的特性,发现白噪声中低频分量主要决定图像的全局结构和颜色组成,而高频分量控制细节。基于此,作者提出了一种无需训练的低频噪声操控方法,通过简单操作低频噪声来引导图像生成过程,从而在保持输出多样性的同时,实现对图像整体结构和颜色的有效控制。

Comments SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/

详情
英文摘要

Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.

2605.00408 2026-05-12 cs.CV

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei

发表机构 * Pengcheng Laboratory, Shenzhen(鹏城实验室,深圳) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学,深圳)

AI总结 本文提出了一种可学习的密度控制方法LeGS,用于改进三维高斯溅射(3DGS)技术,以克服其对启发式密度控制规则的依赖。该方法将密度控制建模为通过强化学习优化的参数化策略网络,并设计了一种基于敏感性分析的有效奖励函数,以精确量化单个高斯分布对重建质量的贡献。实验表明,LeGS在多个数据集上显著优于现有方法,在重建质量和计算效率之间取得了更好的平衡。

Comments 9 pages, 5 figures

详情
英文摘要

While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbf{LeGS}, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from $O(N^2)$ to $O(N)$. Extensive experiments on the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets demonstrate that \textbf{LeGS} significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at https://github.com/AaronNZH/LeGS

2604.27224 2026-05-12 cs.RO

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Pokuang Zhou, Yuhao Zhou, Quan Khanh Luu, Seungho Han, Heng Zhang, Binghao Huang, Yunzhu Li, Arash Ajoudani, Zhengtong Xu, Yu She

发表机构 * Purdue University(普渡大学) Istituto Italiano di Tecnologia(意大利技术研究院) Columbia University(哥伦比亚大学)

AI总结 本文研究了如何通过触觉感知提升四足机器人在复杂接触环境中的运动与操作能力。作者提出了一种分层的触觉感知策略学习框架,结合真实人类示范训练高层视觉-触觉策略,并通过大规模仿真强化学习训练底层触觉感知全身控制策略,实现了从仿真到现实的零样本迁移。实验表明,该方法在多种高接触任务中相比仅依赖视觉或视觉-触觉的方法,平均性能提升了28.54%。

详情
英文摘要

Quadrupedal loco-manipulation is commonly built on visual perception and proprioception. Yet reliable contact-rich manipulation remains difficult: vision and proprioception alone cannot resolve uncertain, evolving interactions with the environment. Tactile sensing offers direct contact observability, but scalable tactile-aware learning framework for quadrupedal loco-manipulation is still underexplored. In this paper, we present a tactile-aware loco-manipulation policy learning pipeline with a hierarchical structure. Our approach has two key components. First, we leverage real-world human demonstrations to train a tactile-conditioned visuotactile high-level policy. This policy predicts not only end-effector trajectories for manipulation, but also the evolving tactile interaction cues that characterize how contact should develop over time. Second, we perform large-scale reinforcement learning in simulation to learn a tactile-aware whole-body control policy that tracks diverse commanded trajectories and tactile interaction cues, and transfers zero-shot to the real world. Together, these components enable coordinated locomotion and manipulation under contact-rich scenarios. We evaluate the system on real-world contact-rich tasks, including in-hand reorientation with insertion, valve tightening, and delicate object manipulation. Compared to vision-only and visuotactile baselines, our method improves performance by 28.54% on average across these tasks.

2604.26805 2026-05-12 cs.AI cs.MA

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出了一种名为Bian Que的智能运维框架,旨在提升大规模在线系统(如搜索、推荐和广告)的运维效率。该框架通过统一的操作范式和灵活的技能编排机制,实现了对运维事件的精准数据与知识匹配,解决了传统方法中信息过载与人工配置困难的问题。研究贡献包括统一的操作模式、自动化的技能生成与优化机制,以及自演进的学习系统,实际部署在快手电商搜索引擎上,显著提升了运维效率与准确性。

Comments HomePage: https://benchen4395.github.io

详情
英文摘要

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

2604.24954 2026-05-12 cs.LG cs.AI cs.CV

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Shaokun Zhang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Bilal Kartal, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Qing Miao, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Alexandre Milesi, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Andrii Skliar, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Natan Bagrov, Borys Tymchenko, Tomer Asida, Daniel Afrimi, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Negar Habibi, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Radha Sri-Tharan, Jeffrey Glick, Barnaby Simkin, George Zelenfroynd, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Katherine Cheung, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro, Udi Karpas

发表机构 * NVIDIA

AI总结 本文介绍了 Nemotron 3 Nano Omni,这是 Nemotron 多模态系列的最新模型,首次原生支持音频输入,同时兼容文本、图像和视频。该模型在架构、训练数据和训练方法上均有改进,在多种模态任务中均表现出更高的准确性,尤其在现实文档理解、长音频视频理解和智能计算机使用方面表现突出。基于高效的 Nemotron 3 Nano 30B-A3B 架构,该模型引入了创新的多模态 token 减少技术,显著降低了推理延迟并提升了吞吐量,同时提供了多种精度格式的模型权重和部分训练数据及代码以促进进一步研究。

详情
英文摘要

We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

2604.20783 2026-05-12 cs.LG

Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces

Zesheng Liu, Maryam Rahnemoonfar

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Lehigh University(莱维大学) Department of Civil and Environmental Engineering(土木与环境工程系)

AI总结 该研究旨在解决由雷达观测得到的冰层内部结构不完整的问题,通过结合物理气候模型提供的同步特征,生成完整的冰层厚度标注。提出的方法结合了几何学习与基于变换器的时序模块,以聚合层内空间信息并促进层间信息传播,从而生成结构一致且符合物理规律的冰层厚度。该模型在保留已有观测数据的基础上,能够恢复缺失的冰层片段,甚至填补完全缺失的层,并为后续深度层预测模型提供了有效的预训练监督信号。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Internal ice layers imaged by radar provide key evidence of snow accumulation and ice dynamics, but radar-derived layer boundary observations are often incomplete, with discontinuous traces and sometimes entirely missing layers, due to limited resolution, sensor noise, and signal loss. Existing graph-based models for ice stratigraphy generally assume sufficiently complete layer profiles and focus on predicting deeper-layer thickness from reliably traced shallow layers. In this work, we address the layer-completion problem itself by synthesizing complete ice-layer thickness annotations from incomplete radar-derived layer traces by conditioning on colocated physical features synchronized from physical climate models. The proposed network combines geometric learning to aggregate within-layer spatial context with a transformer-based temporal module that propagates information across layers to encourage coherent stratigraphy and consistent thickness evolution. To learn from incomplete supervision, we optimize a mask-aware robust regression objective that evaluates errors only at observed thickness values and normalizes by the number of valid entries, enabling stable training under varying sparsity without imputation and steering completions toward physically plausible values. The model preserves observed thickness where available and infers only missing regions, recovering fragmented segments and even fully absent layers while remaining consistent with measured traces. As an additional benefit, the synthesized thickness stacks provide effective pretraining supervision for a downstream deep-layer predictor, improving fine-tuned accuracy over training from scratch on the same fully traced data.