arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2266
2602.04170 2026-06-17 cs.CV 版本更新

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

部分环形扫描:重新审视视觉状态空间模型中的扫描顺序

Yi-Kuan Hsieh, Kuan-Chuan Peng, Xin li, Ming-Ching Chang, Yu-Chee Tseng, Jun-Wei Hsieh

AI总结 提出PRISMamba,通过环形扫描和部分通道滤波提升视觉状态空间模型的旋转鲁棒性和效率,在ImageNet-1K上达到84.5% Top-1精度。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

状态空间模型(SSM)已成为视觉任务中注意力机制的高效替代方案,提供线性时间序列处理并具有竞争性精度。然而,视觉SSM需要将2D图像沿预定义扫描顺序序列化为1D token序列,这一因素常被忽视。我们证明扫描顺序通过改变空间邻接性、破坏对象连续性以及加剧旋转等几何变换下的性能退化,对性能产生关键影响。我们提出部分环形扫描Mamba(PRISMamba),一种旋转鲁棒的遍历方法,将图像划分为同心环,在每个环内进行顺序无关的聚合,并通过一组短径向SSM跨环传播上下文。通过部分通道滤波进一步提高效率,仅将信息最丰富的通道路由到循环环路径,其余通道保留在轻量级残差分支上。在ImageNet-1K上,PRISMamba以3.9G FLOPs和A100上3054 img/s的速度达到84.5% Top-1精度,在准确率和吞吐量上均优于VMamba,且所需FLOPs更少。在旋转下,PRISMamba保持性能,而固定路径扫描下降1~2%。这些结果突显了扫描顺序设计以及通道滤波,作为视觉SSM中准确性、效率和旋转鲁棒性的关键且未被充分探索的因素。代码将在接收后发布。

英文摘要

State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.

2602.03846 2026-06-17 cs.LG cs.AI 版本更新

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

PLATE: 可塑性可调的几何感知持续学习高效适配器

Romain Cosentino

AI总结 提出无需旧任务数据的持续学习方法PLATE,利用预训练网络的几何冗余性,通过结构化低秩更新显式控制可塑性-保留权衡,提升最坏情况保留保证。

详情
AI中文摘要

我们为预训练模型开发了一种持续学习方法,该方法不需要访问旧任务数据,解决了基础模型适应中预训练分布通常不可用的实际障碍。我们的关键观察是,预训练网络表现出大量的几何冗余性,并且这种冗余性可以通过两种互补的方式加以利用。首先,冗余神经元提供了预训练时代主导特征方向的代理,使得可以直接从预训练权重构建近似受保护的更新子空间。其次,冗余性为可塑性的放置位置提供了自然偏差:通过将更新限制在冗余神经元的子集并约束剩余的自由度,我们获得了在旧数据分布上功能漂移减少且最坏情况保留保证改善的更新族。这些见解导致了PLATE(可塑性可调的高效适配器),一种不需要过去任务数据的持续学习方法,它提供了对可塑性-保留权衡的显式控制。PLATE通过结构化低秩更新ΔW = B A Q^T参数化每一层,其中B和Q从预训练权重一次性计算并保持冻结,只有A在新任务上训练。代码可在https://this URL获取。

英文摘要

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

2602.03420 2026-06-17 cs.SD cs.LG 版本更新

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

CoCoEmo: 通过激活引导实现可组合且可控的类人情感语音合成

Siyi Wang, Shihong Tan, Siyi Liu, Hong Jia, Gongping Huang, James Bailey, Ting Dang

AI总结 提出基于激活引导的框架,在混合TTS模型中实现可组合的混合情感合成和文本-情感不匹配合成,发现情感韵律主要由语言模块而非流匹配模块生成。

详情
AI中文摘要

人类语音中的情感表达是微妙且组合的,通常涉及多种、有时相互冲突的情感线索,这些线索可能与语言内容不一致。相比之下,大多数表现性文本转语音系统强制执行单一话语级别的情感,压缩了情感多样性并抑制了混合或文本-情感不匹配的表达。虽然通过潜在方向向量进行激活引导提供了一种有前景的解决方案,但情感表示在TTS中是否线性可引导、在混合TTS架构中应在何处应用引导以及如何评估这种复杂的情感行为仍不清楚。本文首次系统分析了混合TTS模型中用于情感控制的激活引导,引入了一个定量、可控的引导框架,以及多评估者评估协议,实现了可组合的混合情感合成和可靠的文本-情感不匹配合成。我们的结果首次证明,情感韵律和表达变异性主要由TTS语言模块而非流匹配模块合成,并提供了一种轻量级引导方法,用于生成自然、类人的情感语音。

英文摘要

Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

2602.03300 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

R1-SyntheticVL:生成模型的合成数据是否已为多模态大语言模型做好准备?

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

AI总结 提出集体对抗数据合成(CADS)方法,通过集体智能和对抗学习自动生成高质量、多样且具有挑战性的多模态数据,用于增强多模态大语言模型(MLLM)在复杂现实任务中的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

在这项工作中,我们旨在开发有效的数据合成技术,自主合成多模态训练数据,以增强MLLM解决复杂现实任务的能力。为此,我们提出了集体对抗数据合成(CADS),这是一种新颖且通用的方法,用于合成高质量、多样且具有挑战性的多模态数据。CADS的核心思想是利用集体智能确保高质量和多样化的生成,同时探索对抗学习以合成具有挑战性的样本,从而有效驱动模型改进。具体来说,CADS包含两个循环阶段:集体对抗数据生成(CAD-Generate)和集体对抗数据判断(CAD-Judge)。CAD-Generate利用集体知识共同生成新的多样化多模态数据,而CAD-Judge则协作评估合成数据的质量。此外,CADS引入了一种对抗上下文优化机制,以优化生成上下文,鼓励生成具有挑战性和高价值的数据。通过CADS,我们构建了MMSynthetic-20K并训练了我们的模型R1-SyntheticVL,该模型在多个基准测试中表现出优越的性能。

英文摘要

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

2602.03045 2026-06-17 cs.LG 版本更新

Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation

先澄清再绘制:面向鲁棒文本到CAD生成的主动式智能体

Bo Yuan, Zelin Zhao, Petr Molodyk, Bin Hu, Yongxin Chen

AI总结 提出主动式智能体框架ProCAD,通过澄清代理在代码生成前解决用户提示中的歧义,再通过CAD编码代理生成可执行程序,显著提升鲁棒性,平均Chamfer距离降低79.9%。

Comments ICML 2026

详情
AI中文摘要

大型语言模型最近使得文本到CAD系统能够从自然语言提示中合成参数化CAD程序(例如CadQuery)。然而在实践中,几何描述可能是不明确或内部不一致的:关键尺寸可能缺失,约束可能冲突。然而,现有的微调模型倾向于被动地遵循用户指令,并在文本模糊时产生幻觉尺寸。为了解决这个问题,我们提出了一个用于文本到CadQuery生成的主动式智能体框架,名为ProCAD,它在代码合成之前解决规范问题。我们的框架将主动式澄清代理(该代理审计提示并仅在必要时提出有针对性的澄清问题以生成自洽的规范)与CAD编码代理(将规范转换为可执行的CadQuery程序)配对。我们基于精心策划的高质量文本到CadQuery数据集微调编码代理,并通过在澄清轨迹上进行智能体SFT来训练澄清代理。实验表明,主动式澄清显著提高了对模糊提示的鲁棒性,同时保持较低的交互开销。ProCAD优于前沿闭源模型,包括Claude Sonnet 4.5,将平均Chamfer距离降低了79.9%,并将无效比率从4.8%降至0.9%。我们的代码和数据集在此https URL上公开。

英文摘要

Large language models have recently enabled text-to-CAD systems that synthesize parametric CAD programs (e.g., CadQuery) from natural-language prompts. In practice, however, geometric descriptions can be under-specified or internally inconsistent: critical dimensions may be missing and constraints may conflict. However, existing fine-tuned models tend to reactively follow the user instructions and hallucinate dimensions when the text is ambiguous. To address this, we propose a proactive agentic framework for text-to-CadQuery generation, named as ProCAD, that resolves specification issues before code synthesis. Our framework pairs a proactive clarifying agent, which audits the prompt and asks targeted clarification questions only when necessary to produce a self-consistent specification, with a CAD coding agent that translates the specification into an executable CadQuery program. We fine-tune the coding agent based on a curated high-quality text-to-CadQuery dataset and train the clarifying agent via agentic SFT on clarification trajectories. Experiments show that proactive clarification significantly improves robustness to ambiguous prompts while keeping interaction overhead low. ProCAD outperforms frontier closed-source models, including Claude Sonnet 4.5, reducing the mean Chamfer distance by 79.9% and lowering the invalidity ratio from 4.8% to 0.9%. Our code and datasets are made publicly available on https://github.com/BoYuanVisionary/Pro-CAD.

2601.22495 2026-06-17 cs.LG 版本更新

Gradual Fine-Tuning for Flow Matching Models

流匹配模型的渐进微调

Gudrun Thorkelsdottir, Arindam Banerjee

AI总结 提出渐进微调(GFT)框架,通过退火策略在目标分布样本下微调流生成模型,理论保证逼近真实目标,实验表明稳定性、效率与多样性优于现有方法。

Comments Preprint. Added methodology and experimental sections

详情
AI中文摘要

在数据有限、分布演变或计算受限的场景中,微调流匹配模型是一个核心挑战。尽管近期工作取得了显著进展,特别是在基于奖励的微调领域,但现有方法在稳定性、效率和多样性保持方面既未展示理论正确性,也未获得强有力的实证结果。本文提出渐进微调(GFT),一个简单而基于退火的框架,用于在仅有目标分布样本时微调流生成模型。对于随机流,GFT定义了一个温度控制的中间目标序列,平滑地插值预训练漂移和目标漂移,并在温度趋近于零时理论上逼近真实目标。我们分析证明,GFT后的样本生成可以通过使用任意(例如最优传输)耦合以及利用少步推理方法显著提高效率。实验上,GFT显著改善了收敛稳定性,同时相比其他微调方法保持或提高了生成质量、训练速度和生成多样性。我们的结果将GFT定位为在分布偏移下可扩展适应流匹配模型的简单、理论扎实且实践有效的替代方案。

英文摘要

Fine-tuning flow matching models is a central challenge in settings with limited data, evolving distributions, or computational constraints. While recent work has produced significant advances, particularly in the area of reward-based fine-tuning, current methods fail to demonstrate both theoretical correctness as well as strong empirical results in terms of stability, efficiency, and diversity preservation. In this work, we propose Gradual Fine-Tuning (GFT), a simple yet principled annealing-based framework for fine-tuning flow generative models when only samples from the target distribution are available. For stochastic flows, GFT defines a temperature-controlled sequence of intermediate objectives that smoothly interpolate between the pretrained and target drifts, provably approaching the true target as the temperature approaches zero. We analytically demonstrate that sample generation after GFT can be made substantially more efficient with the use of arbitrary (e.g., optimal transport) couplings, as well as by utilizing few-step inference methods. Empirically, GFT significantly improves convergence stability, while maintaining or improving generation quality, training speed, and generation diversity compared to other fine-tuning methods. Our results position GFT as a simple yet theoretically grounded and practically effective alternative for scalable adaptation of flow matching models under distribution shift.

2510.09468 2026-06-17 cs.LG 版本更新

Geodesic Calculus on Implicitly Defined Latent Manifolds

隐式定义潜在流形上的测地线计算

Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth

AI总结 提出将自编码器的潜在流形视为隐式子流形,并开发离散黎曼微积分工具以近似经典几何算子,通过去噪目标学习近似投影,实现潜在流形上的测地线路径计算和黎曼指数映射。

Comments 26 pages, 18 figures

详情
AI中文摘要

自编码器的潜在流形提供了数据的低维表示,可以从几何角度进行研究。我们提出将这些潜在流形描述为某个潜在空间的隐式子流形。基于此,我们开发了用于离散黎曼微积分的工具,近似经典几何算子。这些工具对于实际例子中经常出现的隐式表示不准确性具有鲁棒性。为了获得合适的隐式表示,我们提出通过最小化去噪目标来学习潜在流形上的近似投影。该方法独立于底层自编码器,并支持在潜在流形上使用不同的黎曼几何。该框架特别能够计算连接给定端点的测地线路径,并通过潜在流形上的黎曼指数映射进行测地线射击。我们在合成数据和真实数据上训练的各种自编码器上评估了我们的方法。

英文摘要

Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

2601.19099 2026-06-17 cs.CV cs.AI 版本更新

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv: 地图到街景空间推理的可扩展基准

Yosub Shin, Michael Buriek, Igor Molybog

AI总结 提出m2sv基准,通过匹配朝北俯视图与街景图像推断相机方向,评估VLM空间推理能力;最佳模型准确率65.2%,低于人类72.0%,揭示几何对齐与推理一致性的差距。

详情
AI中文摘要

视觉-语言模型(VLM)在许多多模态基准上表现强劲,但在需要将抽象俯视图表示与自我中心视图对齐的空间推理任务上仍然脆弱。我们引入m2sv,一个用于地图到街景空间推理的可扩展基准,要求模型通过将朝北俯视图与在同一真实世界交叉口拍摄的街景图像对齐来推断相机视角方向。我们发布了m2sv-20k,一个具有受控歧义的地理多样化基准,以及m2sv-sft-11k,一个用于监督微调的精选结构化推理轨迹集。尽管在现有多模态基准上表现强劲,但最佳评估的VLM在m2sv上仅达到65.2%的准确率,低于人类标注者的平均72.0%(专家可达95%),且标注者间一致性高($\kappa$高达0.76)。虽然监督微调和强化学习带来持续改进,但跨基准评估显示迁移有限。除了总体准确率,我们使用结构信号和人工努力系统分析了地图到街景推理的难度,并对适应的开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面的持续差距,为跨视角的接地空间推理的未来工作提供了动力。

英文摘要

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

2601.19098 2026-06-17 cs.RO 版本更新

SimTO: A two-stage, simulation-driven topology optimization framework for bespoke soft robotic grippers

SimTO:一种面向定制软体机器人夹爪的两阶段仿真驱动拓扑优化框架

Kurt Enkera, Josh Pinskier, Marcus Gallagher, David Howard

AI总结 提出SimTO框架,通过两阶段仿真驱动拓扑优化自动提取接触载荷,为特征丰富物体定制软体夹爪,实验证明其抓取力优于传统方法且泛化性强。

Comments 15 pages, 9 figures. Published in Structural and Multidisciplinary Optimization

详情
AI中文摘要

软体机器人夹爪对于在制造业、医疗保健和农业中抓取精致、几何形状复杂的物体至关重要。然而,现有设计难以抓取具有高拓扑变异性、特征丰富的物体,包括汽车装配线上具有锋利齿廓的齿轮、带有脆弱突起的珊瑚,或像西兰花这样具有不规则分支结构的蔬菜。与立方体或球体等简单几何基元不同,特征丰富的物体缺乏明确的“最佳”接触表面,因此既难以抓取又容易受损。因此,安全处理此类物体需要专门设计的软体夹爪,其形态需针对物体特征进行定制。拓扑优化为生产专用夹爪提供了一种有前景的方法,但其效用受限于需要预定义载荷工况。对于软体夹爪,这些载荷来自抓取过程中数百种不可预测的夹爪-物体接触力,且先验未知。为解决此问题,我们引入了SimTO,这是一个两阶段、仿真驱动的拓扑优化框架,它能在执行经典拓扑优化之前,从动态、富含接触的抓取仿真中自动提取载荷工况,从而消除了手动指定载荷的需求。给定任意特征丰富的物体,SimTO能生成高度定制的软体夹爪,其细粒度形态特征针对物体几何形状进行定制。物理实验证实,我们的专用夹爪比传统拓扑优化方法生成的通用设计实现了更高的抓取力,而数值实验表明,它们在不同物体姿态下实现了高抓取成功率,并对一组未见过的物体具有很强的泛化能力。

英文摘要

Soft robotic grippers are essential for grasping delicate, geometrically complex objects in manufacturing, healthcare and agriculture. However, existing designs struggle to grasp feature-rich objects with high topological variability, including gears with sharp tooth profiles on automotive assembly lines, corals with fragile protrusions, or vegetables with irregular branching structures like broccoli. Unlike simple geometric primitives such as cubes or spheres, feature-rich objects lack a clear "optimal" contact surface, making them both difficult to grasp and susceptible to damage. Safe handling of such objects therefore requires specialized soft grippers whose morphology is tailored to the object's features. Topology optimization offers a promising approach for producing specialized grippers, but its utility is limited by the need for pre-defined load cases. For soft grippers, these loads arise from hundreds of unpredictable gripper-object contact forces during grasping and are unknown a priori. To address this problem, we introduce SimTO, a two-stage, simulation-driven topology optimization framework that automatically extracts load cases from a dynamic, contact-rich grasping simulation before performing classical topology optimization, eliminating the need for manual load specification. Given an arbitrary feature-rich object, SimTO produces highly customized soft grippers with fine-grained morphological features tailored to the object geometry. Physical experiments confirm that our specialized grippers achieve higher grasp forces than a generalist design produced by conventional topology optimization methods, while numerical experiments show that they achieve high grasp success rates across varying object poses and strong generalization to a set of unseen objects.

2601.18252 2026-06-17 cs.CV cs.AI cs.LG stat.ML 版本更新

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet: 一种用于提示引导的线框解析的协作点线网络

Chao Wang, Xuanying Li, Cheng Dai, Jinglei Feng, Yuxiang Luo, Hao Qin, Yuqi Ouyang

AI总结 提出点线协作框架Co-PLNet,通过点线提示编码器交换空间线索,并利用交叉引导线解码器增强点线一致性,在Wireframe和YorkUrban数据集上提升线框解析的准确性和鲁棒性。

详情
AI中文摘要

线框解析旨在恢复线段及其连接点,以形成结构化的几何表示,用于同时定位与地图构建(SLAM)等下游任务。现有方法分别预测线和点,并在事后进行调和,导致不匹配和鲁棒性降低。我们提出Co-PLNet,一个点线协作框架,在两个任务之间交换空间线索,其中早期检测通过点线提示编码器(PLP-Encoder)转换为空间提示,该编码器将几何属性编码为紧凑且空间对齐的图。交叉引导线解码器(CGL-Decoder)随后通过基于互补提示的稀疏注意力细化预测,强制点线一致性和效率。在Wireframe和YorkUrban上的实验显示,准确性和鲁棒性持续改进,同时具有有利的实时效率,证明了我们在结构化几何感知中的有效性。我们的代码可在该 https URL 获取。

英文摘要

Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception. Our code is available at https://github.com/GalacticHogrider/Co-PLNet.

2410.10137 2026-06-17 cs.LG math.DG stat.CO stat.ML 版本更新

Variational autoencoders with latent high-dimensional steady geometric flows for dynamics

具有潜在高维稳态几何流的变分自编码器用于动力学

Andrew Gracyk

AI总结 提出VAE-DLM方法,在潜在空间中引入稳态几何流,通过物理信息方法求解高维流,增强潜在表示的表达能力,在PDE型数据上降低OOD误差15%-35%。

详情
Journal ref
23rd International Conference of Numerical Analysis and Applied Mathematics (ICNAAM) 2025
AI中文摘要

我们开发了用于PDE型环境数据的变分自编码器(VAE)的黎曼方法,其中包含正则化几何潜在动力学,称为VAE-DLM(具有动态潜在流形的VAE)。我们重新构建了VAE框架,使得嵌入欧几里得空间中的流形几何(受我们的几何流约束)在编码器和解码器开发的中间潜在空间中被学习。通过定制潜在空间演化的几何流,我们诱导出我们选择的潜在几何性质,这些性质反映在经验性能中。我们通过谨慎选择先验重新表述了传统的证据下界(ELBO)损失。我们开发了一个具有稳态正则化项的线性几何流。该流只需要对一个时间导数进行自动微分,并且可以在中等高维度上以物理信息方法求解,从而允许更具表达力的潜在表示。我们讨论了该流如何被表述为梯度流,并保持熵远离度量奇点。这结合特征值惩罚条件,有助于确保流形在测度上足够大、非退化且具有规范几何,从而有助于鲁棒表示。我们的方法侧重于改进的多层感知器架构,使用tanh激活函数用于流形编码器-解码器。我们在感兴趣的数据集上证明,我们的方法至少与传统VAE表现相当,且通常更好。我们的方法可以超越传统VAE以及采用我们提出架构的VAE,在选定数据集上经常将分布外(OOD)误差降低15%至35%。我们重点展示了我们的方法在环境PDE上的应用,这些PDE的解在后期保持最小变化。我们提供了经验性证明,说明如何通过VAE改进外部动力学的鲁棒学习。

英文摘要

We develop Riemannian approaches to variational autoencoders (VAEs) for PDE-type ambient data with regularizing geometric latent dynamics, which we refer to as VAE-DLM, or VAEs with dynamical latent manifolds. We redevelop the VAE framework such that manifold geometries, subject to our geometric flow, embedded in Euclidean space are learned in the intermediary latent space developed by encoders and decoders. By tailoring the geometric flow in which the latent space evolves, we induce latent geometric properties of our choosing, which are reflected in empirical performance. We reformulate the traditional evidence lower bound (ELBO) loss with a considerate choice of prior. We develop a linear geometric flow with a steady-state regularizing term. This flow requires only automatic differentiation of one time derivative, and can be solved in moderately high dimensions in a physics-informed approach, allowing more expressive latent representations. We discuss how this flow can be formulated as a gradient flow, and maintains entropy away from metric singularity. This, along with an eigenvalue penalization condition, helps ensure the manifold is sufficiently large in measure, nondegenerate, and a canonical geometry, which contribute to a robust representation. Our methods focus on the modified multi-layer perceptron architecture with tanh activations for the manifold encoder-decoder. We demonstrate, on our datasets of interest, our methods perform at least as well as the traditional VAE, and oftentimes better. Our methods can outperform this and a VAE endowed with our proposed architecture, frequently reducing out-of-distribution (OOD) error between 15% to 35% on select datasets. We highlight our method on ambient PDEs whose solutions maintain minimal variation in late times. We provide empirical justification towards how we can improve robust learning for external dynamics with VAEs.

2601.10962 2026-06-17 cs.LG cond-mat.dis-nn 版本更新

Noise-Driven Exploration and Transient Freezing Select Flat Minima in Stochastic Gradient Descent

噪声驱动的探索与瞬态冻结在随机梯度下降中选择平坦极小值

Ning Yang, Yikuan Zhang, Qi Ouyang, Chao Tang, Yuhai Tu

AI总结 通过分析SGD学习动力学,发现非平衡机制驱动解选择:瞬态探索阶段逃离尖锐谷,噪声重塑势能稳定平坦解,冻结延迟增强泛化。

Comments 12 pages, 4 figures

详情
AI中文摘要

随机梯度下降(SGD)是深度学习的核心,但其偏好更平坦、更泛化解的动力学起源仍不清楚。本文通过分析SGD学习动力学,识别出一种非平衡机制,该机制在训练过程中控制解的选择。数值实验揭示了一个瞬态探索阶段,在此阶段SGD轨迹反复逃离尖锐谷,并向损失景观中更平坦的区域迁移,然后才被限制在最终盆地中。利用一个可处理的物理模型,我们证明SGD噪声将损失景观重塑为一个有效势能,该势能优先稳定平坦解。我们进一步揭示了一种瞬态冻结机制:随着训练进行,平坦化的景观抑制了竞争谷之间的跃迁。更强的SGD噪声延迟了这种冻结转变,延长了探索阶段,从而增加了收敛到更平坦极小值的概率。这些结果共同提供了一个统一的物理框架,连接了学习动力学、损失景观几何和泛化,并为设计更有效的优化算法提供了指导原则。

英文摘要

Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism that governs solution selection during training. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and migrate toward flatter regions of the loss landscape before becoming confined to a final basin. Using a tractable physical model, we show that SGD noise reshapes the loss landscape into an effective potential that preferentially stabilizes flat solutions. We further uncover a transient freezing mechanism: as training progresses, the flattening landscape suppresses transitions between competing valleys. Stronger SGD noise delays this freezing transition, prolonging the exploratory phase and thereby increasing the probability of convergence to flatter minima. Together, these results provide a unified physical framework connecting learning dynamics, loss-landscape geometry, and generalization, and suggest guiding principles for the design of more effective optimization algorithms.

2512.16420 2026-06-17 cs.SD 版本更新

DPDFNet: Boosting DeepFilterNet2 via Dual-Path RNN

DPDFNet: 通过双路径RNN提升DeepFilterNet2

Daniel Rika, Nino Sapir, Ido Gus

AI总结 提出DPDFNet,在DeepFilterNet2编码器中引入双路径块增强长时跨带建模,结合过衰减抑制损失和微调策略,在多个基准上超越现有因果模型,并部署于边缘NPU实现实时性能。

Comments Accepted manuscript version. Accepted for publication in Speech Communication

详情
AI中文摘要

我们提出DPDFNet,一种因果单通道语音增强模型,它在DeepFilterNet2架构的基础上,在编码器中引入双路径块,增强了长时域和跨频带建模能力,同时保留了原有的增强框架。此外,我们证明,添加一个损失分量以减轻增强语音中的过度衰减,并结合针对“始终在线”应用定制的微调阶段,可以显著提升模型整体性能。我们在标准VoiceBank+DEMAND和DNS4盲测基准上评估DPDFNet,结果显示其相比DeepFilterNet2有一致提升,并且与其他因果开源模型相比整体性能强劲。此外,我们引入了一个补充的多语言低信噪比评估集,包含12种语言在日常噪声场景下的长录音,DPDFNet在此评估集上表现出优于其他因果开源模型的性能,包括一些规模更大、计算需求更高的模型。我们还提出了一种整体指标PRISM,它是侵入式和非侵入式指标的复合、尺度归一化聚合,该指标清晰展示了与双路径块数量的可扩展性。我们通过在Ceva-NeuPro-Nano边缘NPU上部署DPDFNet进一步证明了其在设备上的可行性。结果表明,我们的第二大模型DPDFNet-4在NPN32上实现了实时性能,在NPN64上运行更快,证实了在最先进的嵌入式功耗和延迟约束下可以维持高质量。

英文摘要

We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for "always-on" applications, leads to substantial improvements in overall model performance. We evaluate DPDFNet on the standard VoiceBank+DEMAND and DNS4 blind test benchmarks, where it shows consistent gains over DeepFilterNet2 and strong overall performance against other causal open-source models. In addition, we introduce a supplementary multilingual low-SNR evaluation set comprising long recordings in 12 languages across everyday noise scenarios, on which DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.

2601.03872 2026-06-17 cs.CL 版本更新

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Atlas: 编排异构模型与工具实现多领域复杂推理

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao

AI总结 提出ATLAS双路径框架,通过无监督聚类路由和强化学习多步路由动态选择最优模型-工具组合,在15个基准上超越GPT-4o,分布内外任务分别提升10.1%和13.1%。

Comments Accepted by ACL 2026

详情
AI中文摘要

大语言模型与外部工具的集成显著扩展了AI代理的能力。然而,随着模型和工具多样性的增加,选择最优模型-工具组合成为一个高维优化挑战。现有方法通常依赖单一模型或固定工具调用逻辑,未能利用异构模型-工具对之间的性能差异。本文提出ATLAS(自适应工具-LLM对齐与协同调用),一种用于跨领域复杂推理中动态工具使用的双路径框架。ATLAS通过双路径方式运作:(1)基于无监督聚类的路由,利用经验先验进行领域特定对齐;(2)基于强化学习的多步路由,探索自主轨迹以实现分布外泛化。在15个基准上的大量实验表明,我们的方法优于GPT-4o等闭源模型,在分布内(+10.1%)和分布外(+13.1%)任务上均超越现有路由方法。此外,我们的框架通过编排专用多模态工具在视觉推理中展现出显著提升。

英文摘要

The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace:工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结 提出EngTrace符号基准,包含1350个参数化测试用例,通过两阶段可验证评估框架(分层协议+AI仲裁)检验中间推理轨迹与最终答案,揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情
AI中文摘要

大型语言模型(LLM)正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程,因此对其推理能力进行严格评估势在必行。然而,现有的基准(如MMLU、MATH和HumanEval)评估的是孤立的认知技能,未能捕捉工程中核心的基于物理的推理,其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督,我们引入了EngTrace,这是一个基于90个参数化模板构建的符号基准,每个模板生成独特的、抗污染的实例,涵盖三个主要工程分支、九个核心领域和20个不同领域,产生1350个测试用例,以压力测试跨多样物理场景的泛化能力。超越结果匹配,我们引入了一个可验证的两阶段评估框架,该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡,识别出一个复杂性悬崖,其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

2507.14632 2026-06-17 cs.CV 版本更新

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

BusterX++: 迈向基于MLLM的统一跨模态AI生成内容检测与解释

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

AI总结 提出统一多模态大模型BusterX++,通过纯强化学习策略实现图像与视频伪造检测的跨模态能力迁移,性能超越现有方法。

详情
AI中文摘要

生成式AI的快速发展显著提升了图像和视频合成质量,加剧了多模态视觉错误信息的风险。最近的多模态大模型通过推理和解释在透明化AI生成内容检测方面展现出潜力,但现有方法大多将图像和视频取证视为孤立任务,跨模态协同作用尚未充分探索。为解决这一问题,我们提出了\textbf{BusterX++},一个统一的多模态大模型,用于联合图像和视频检测并具备可解释推理能力。我们还引入了\textbf{GenBuster-Bench++},一个精心策划、难度对齐的基准测试,包含平衡的图像和视频样本,覆盖最新的生成模型和多样化的真实场景。利用这一受控设置,我们重新审视了广泛采用的$SFT \rightarrow RL$后训练范式。值得注意的是,我们的发现表明,仅由稀疏结果奖励驱动的单阶段纯RL策略在统一和单模态设置中始终匹配或超越强SFT+RL基线。我们的关键洞察是,SFT降低了策略熵,限制了策略搜索空间并抑制了探索自由度。相比之下,单阶段纯RL在整个训练过程中保持较高的策略熵,有效解锁了图像和视频取证之间跨模态能力迁移的自发涌现。大量实验表明,BusterX++达到了最先进的性能,突显了RL在统一跨模态视觉推理中的强大潜力。

英文摘要

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

2601.00215 2026-06-17 cs.CV cs.CL 版本更新

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

通过奖励设计解耦多模态大语言模型中的感知与推理

Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng

AI总结 研究多模态大模型中感知与推理的瓶颈,发现感知是主要约束,并通过奖励设计提升视觉基础推理,平均提升5.56分。

Comments 24 pages, 15 Figures, 10 Tables

详情
AI中文摘要

基于可验证奖励的强化学习推动了LLM推理的重大进步,直观上这种策略应能很好地迁移到多模态模型。然而,多模态模型做两件事:首先感知图像中的内容,然后推理其含义。由于这两个阶段是联合评分的,很难判断推理本身还有多大提升空间。我们在算法视觉谜题上研究这一问题,其中两个组件都是必要的,并表明感知而非推理是约束瓶颈。用简单的文本描述替换图像,Claude模型的平均性能提升超过20点。然后我们评估了六种奖励设计,旨在诱导推理过程中的视觉基础,而无需思维链监督。使用GRPO训练Qwen-2.5-VL-7B,奖励设计诱导出带有自我反思和视觉引用的长结构化推理,相比基础模型获得5.56点的提升。然而,这些提升是不均匀的;没有单一奖励能改善所有类别,并且具有可验证准确性信号的奖励会以域外迁移为代价换取域内准确性。这些结果表明,感知感知的奖励设计是一条前进之路,以便在源头纠正感知,而不是纠正继承其错误的推理。

英文摘要

Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.

2512.21315 2026-06-17 cs.LG cs.CV stat.ML 版本更新

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

数据处理不等式是否反映实践?论低级任务的有用性

Roy Turgeman, Tom Tirer

AI总结 本文研究低级处理(如去噪、编码)如何提升分类性能,证明在有限样本下存在预处理可提高准确率,并通过实验验证理论趋势。

Comments ICLR 2026 (camera-ready). Code is available at: https://github.com/serveroy/process-before-you-classify

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

数据处理不等式是一个信息论原理,指出信号的信息内容不能通过处理观测数据而增加。特别地,它表明在解决分类问题之前,增强信号或对其进行编码没有益处。对于最优贝叶斯分类器,这一断言可以被证明是正确的。然而,在实践中,尽管现代深度神经网络具有强大的能力,但在高级下游任务之前执行“低级”任务仍然很常见。在本文中,我们旨在理解低级处理何时以及为何对分类有益。我们提出了一个二元分类设置的综合理论研究,其中我们考虑一个与最优贝叶斯分类器紧密相连的分类器,并随着训练样本数量的增加而收敛到它。我们证明,对于任何有限数量的训练样本,存在一种预分类处理可以提高分类准确率。我们还探讨了类分离、训练集大小和类平衡对该过程相对增益的影响。我们通过理论设置的经验研究来支持我们的理论。最后,我们进行了一项实证研究,调查去噪和编码对基准数据集上实际深度分类器性能的影响。具体来说,我们改变了训练集的大小和类别分布以及噪声水平,并展示了与我们的理论结果一致的趋势。

英文摘要

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

2512.16978 2026-06-17 cs.CV 版本更新

A Benchmark for Omni-Modal Reasoning in Long Videos

长视频全模态推理基准

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

AI总结 提出LongShOTBench基准,用于评估长视频中视觉、语音和环境音频的全模态推理,并引入无训练的全模态证据搜索代理LongShOTAgent,在105个模型上取得最优性能。

详情
AI中文摘要

长形式全模态视频理解需要整合视觉、语音和环境音频,并进行连贯的长上下文推理。现有的视频基准通常在时间尺度、模态覆盖、开放式交互和可解释评分之间进行权衡。为了解决这一差距,我们引入了LongShOTBench,一个围绕三个耦合目标设计的长期视频理解基准:整体全模态集成、意图驱动的开放式交互和规则级诊断。它从真实观看场景构建单轮和多轮问题,通过系统任务探究视觉、语音、环境音频、时间和跨模态推理。每个项目包括一个参考答案和一个加权标准级规则,让评估识别哪些感知事实、时间链接、模态接地要求和推理步骤得到满足或遗漏。所有样本都经过手动验证,以提高接地性、清晰度和规则可靠性。我们还引入了LongShOTAgent,一个无训练的全模态证据搜索代理,将全视频预处理与目标检索、查询自适应片段细化以及基于视觉、语音和非语音音频证据的显式声明验证相结合。其迭代搜索-细化-验证循环暴露中间证据,并让模态特定专家在回答之前重新分析相关时刻。我们评估了105个视频能力模型,涵盖开源全模态模型、视觉语言系统、音频LLM、代理管道和闭源API。当前的MLLM远未饱和LongShOTBench,而我们的LongShOTAgent是最强的无训练系统,达到66.64%的整体性能。通过发布基准、排行榜和方法,我们为推进长形式全模态视频推理提供了一个共享、可解释的测试平台。代码、数据和排行榜可在以下网址获取:此 https URL。

英文摘要

Long-form omni-modal video understanding requires integrating vision, speech, and ambient audio with coherent long-context reasoning. Existing video benchmarks often trade off temporal scale, modality coverage, open-ended interaction, and interpretable scoring. To address this gap, we introduce LongShOTBench, a long video understanding benchmark designed around three coupled goals: holistic omni-modal integration, intent-driven open-ended interaction, and rubric-level diagnosis. It builds single- and multi-turn questions from real viewing scenarios, with systematic tasks probing visual, speech, ambient-audio, temporal, and cross-modal reasoning. Each item includes a reference answer and a weighted criterion-level rubric, letting evaluation identify which perceptual facts, temporal links, modality-grounding requirements, and reasoning steps are satisfied or missed. All samples are manually verified to improve grounding, clarity, and rubric reliability. We also introduce LongShOTAgent, a training-free omni-modal evidence-seeking agent coupling full-video preprocessing with targeted retrieval, query-adaptive segment refinement, and explicit claim verification over visual, speech, and non-speech audio evidence. Its iterative search-refine-verify loop exposes intermediate evidence and lets modality-specific specialists re-analyze relevant moments before answering. We evaluate 105 video-capable models spanning open-source omni-modal models, vision-language systems, audio LLMs, agentic pipelines and closed-source APIs. Current MLLMs remain far from saturating LongShOTBench, while our LongShOTAgent is the strongest training-free system, reaching 66.64% overall. By releasing the benchmark, leaderboard, and method, we provide a shared, interpretable testbed for advancing long-form omni-modal video reasoning. Code, data, and the leaderboard are available at https://longshot.cvmbzuai.com/.

2512.13853 2026-06-17 cs.LG cond-mat.stat-mech math.PR stat.ML 版本更新

Dropout Neural Network Training Viewed from a Percolation Perspective

从逾渗视角看待Dropout神经网络训练

Finley Devlin, Jaron Sanders

AI总结 本文研究使用dropout训练深度神经网络时的逾渗现象,建立新逾渗模型刻画网络拓扑与路径问题的关系,揭示dropout中的逾渗效应及其可能导致训练崩溃的机制。

Comments 21 pages, 14 figures

详情
AI中文摘要

在这项工作中,我们研究了使用dropout训练深度神经网络(NNs)时逾渗的存在和影响。Dropout方法是训练NNs的正则化技术,由G. Hinton等人(2012)首次提出。这些方法在训练的每个阶段随机临时移除NN中的连接,并用随机梯度下降(SGD)更新剩余子网络。随机从网络中移除连接的过程类似于逾渗,这是统计物理的一个范式模型。如果dropout移除足够多的连接,使得NN的输入和输出之间没有路径,那么NN就无法根据数据做出预测。我们研究了模拟NN中dropout的新逾渗模型,并刻画了网络拓扑与该路径问题之间的关系。该理论证明了dropout中存在逾渗效应。我们还表明,在使用dropout训练无偏置NN时,这种逾渗效应可能导致训练崩溃;并且我们启发式地论证了这种崩溃也扩展到有偏置的NN。

英文摘要

In this work, we investigate the existence and effect of percolation in training deep Neural Networks (NNs) with dropout. Dropout methods are regularisation techniques for training NNs, first introduced by G. Hinton et al. (2012). These methods temporarily remove connections in the NN, randomly at each stage of training, and update the remaining subnetwork with Stochastic Gradient Descent (SGD). The process of removing connections from a network at random is similar to percolation, a paradigm model of statistical physics. If dropout were to remove enough connections such that there is no path between the input and output of the NN, then the NN could not make predictions informed by the data. We study new percolation models that mimic dropout in NNs and characterise the relationship between network topology and this path problem. The theory shows the existence of a percolative effect in dropout. We also show that this percolative effect can cause a breakdown when training NNs without biases with dropout; and we argue heuristically that this breakdown extends to NNs with biases.

2506.24121 2026-06-17 cs.CV 版本更新

TextMesh4D: Zero-shot Text-to-4D Mesh Generation

TextMesh4D: 零样本文本到4D网格生成

Sisi Dai, Xinxin Su, Kai Xu

AI总结 提出TextMesh4D框架,通过雅可比变形场和局部-全局语义正则化,实现零样本文本到动态网格生成,解决扩散引导与网格拓扑约束的冲突,达到高时间一致性和几何保真度。

详情
AI中文摘要

大规模、高质量动态3D(4D)资产对于学习物理基础表示至关重要,但大规模捕获和标注成本高昂。这限制了监督式4D学习的可行性,并激发了利用预训练扩散先验的零样本文本到4D生成。为了建模复杂动态,先前方法通常采用隐式3D表示(如NeRF或3DGS)以利用其变形能力。然而,其隐式性质对表面拓扑的控制有限,阻碍了高保真几何,并使时间一致表面重建具有挑战性。为解决这些限制,我们探索零样本文本到4D网格生成。然而,将基于扩散的引导与拓扑约束网格结合时会出现结构不匹配:引导是噪声且空间不一致的,而网格施加严格的拓扑约束,使得直接顶点级变形不稳定。在本文中,我们介绍TextMesh4D,这是首个零样本文本到4D框架,通过在两个互补层面解决上述挑战,直接生成动态网格。几何上,我们通过雅可比变形场(JDF)将变形建模从顶点转移到面,通过可积性强制积分公式实现拓扑感知表面重建。语义上,我们提出局部-全局语义正则化器(LGSR),通过联合约束局部变形合理性和全局形状一致性来随时间保持身份。大量实验表明,在单个24GB GPU上高效运行的同时,达到了最先进的时间一致性、结构保真度和视觉质量。

英文摘要

Large-scale, high-quality dynamic 3D (4D) assets are essential for learning physically grounded representations, but remain costly to capture and annotate at scale. This limits the viability of supervised 4D learning and motivates zero-shot text-to-4D generation leveraging pretrained diffusion priors. To model complex dynamics, prior methods typically adopt implicit 3D representations (e.g., NeRFs or 3DGS) for their deformation capacity. However, their implicit nature provides limited control over surface topology, which hinders high-fidelity geometry and makes temporally coherent surface reconstruction challenging. To address these limitations, we explore zero-shot text-to-4D mesh generation. However, a structural mismatch arises when combining diffusion-based guidance with topology-constrained meshes: the guidance is noisy and spatially inconsistent, while meshes impose severe topological constraints, making direct vertex-level deformation unstable. In this paper, we introduce TextMesh4D, the first zero-shot framework for text-to-4D that directly generates dynamic meshes by addressing the above challenge at two complementary levels. Geometrically, we shift deformation modeling from vertices to faces via a Jacobian Deformation Field (JDF), enabling topology-aware surface reconstruction through an integrability-enforcing integration formulation. Semantically, we propose a Local-Global Semantic Regularizer (LGSR) that preserves identity over time by jointly constraining local deformation plausibility and global shape consistency. Extensive experiments demonstrate state-of-the-art temporal consistency, structural fidelity, and visual quality, while remaining efficient on a single 24GB GPU.

2512.13009 2026-06-17 cs.RO 版本更新

K-VARK: Kernelized Variance-Aware Residual Kalman Filter for Sensorless Force Estimation in Collaborative Robots

K-VARK: 用于协作机器人无传感器力估计的核化方差感知残差卡尔曼滤波器

Oğuzhan Akbıyık, Naseem Alhousani, Fares J. Abu-Dakka

AI总结 提出K-VARK方法,通过核化运动基元学习残差力矩的预测均值和异方差方差,并自适应调整卡尔曼滤波噪声协方差,在6自由度协作机械臂上实现无传感器力估计,RMSE降低20%以上。

详情
AI中文摘要

可靠接触力估计对于确保机器人与非结构化环境的安全和精确交互至关重要。然而,由于固有的建模误差以及复杂的残差动力学和摩擦,准确的无传感器力估计仍然具有挑战性。为应对这一挑战,本文提出K-VARK(核化方差感知残差卡尔曼滤波器),一种将关节残差力矩的核化概率模型集成到自适应卡尔曼滤波框架中的新颖方法。通过在优化激励轨迹上训练的核化运动基元,K-VARK捕获残差力矩的预测均值和输入相关的异方差方差,反映数据变异性和距训练样本距离的影响。这些统计信息通过增广测量噪声协方差来通知方差感知的虚拟测量更新,而过程噪声协方差通过变分贝叶斯优化在线自适应以处理动态干扰。在6自由度协作机械臂上的实验验证表明,与最先进的无传感器力估计方法相比,K-VARK的RMSE降低了20%以上,为抛光、装配等高级任务提供了鲁棒且准确的外部力/力矩估计。

英文摘要

Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.

2512.11784 2026-06-17 cs.LG stat.ML 版本更新

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

大提示词机制下的Softmax作为线性注意力:基于测度的视角

Etienne Boursier, Claire Boyer

AI总结 提出基于测度的框架,证明在无限提示词极限下softmax注意力收敛到线性算子,并给出有限提示词下的非渐近浓度界,从而将线性注意力的优化分析迁移到大提示词下的softmax注意力。

详情
AI中文摘要

Softmax注意力是Transformer架构的核心组成部分,但其非线性结构给理论分析带来了重大挑战。我们开发了一个统一的、基于测度的框架,用于研究有限和无限提示词下的单层softmax注意力。对于独立同分布的高斯输入,我们利用softmax算子在大提示词极限下收敛到作用于底层输入标记测度的线性算子这一事实。基于这一见解,我们建立了softmax注意力输出和梯度的非渐近浓度界,量化了有限提示词模型接近其无限提示词对应模型的速度,并证明了在具有次高斯标记的一般上下文学习设置中,这种浓度在整个训练轨迹上保持稳定。在线性回归的上下文学习中,我们利用易处理的无限提示词动力学来分析有限提示词长度下的训练。我们的结果表明,当提示词足够长时,为线性注意力开发的优化分析可以直接迁移到softmax注意力上,表明大提示词下的softmax注意力继承了其线性对应物的分析结构。这反过来为研究大提示词机制下softmax注意力层的训练动力学和统计行为提供了一个有原则且广泛适用的工具包。

英文摘要

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

2509.12742 2026-06-17 cs.CV 版本更新

Effective Gaussian Management for High-fidelity Object Reconstruction

高保真物体重建的有效高斯管理

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Junxin Chen, Feng Xu

AI总结 提出一种高斯管理框架,通过选择性激活属性、自适应表示和任务解耦剪枝,结合正则化表面重建模块,在减少参数的同时实现高保真外观与几何重建。

详情
AI中文摘要

本文提出了一种有效的高斯管理框架,用于外观和几何的高保真场景重建。与最近将所有基元在优化过程中统一处理的高斯泼溅(GS)管线不同,我们的框架显式管理高斯的属性激活、表示和剪枝。具体来说,我们的框架首先引入GauSep,一种新的致密化策略,选择性地激活高斯颜色或法线属性,以缓解由双重监督产生的破坏性梯度冲突。我们进一步提出GauRep,一种自适应高斯表示,动态调整球谐函数(SHs)阶数并执行任务解耦剪枝,以在个体和全局层面减少冗余。为了为上述管理过程提供可靠的几何监督,我们还引入了CoRe,一个正则化表面重建模块,通过置信度机制从SDF分支蒸馏鲁棒的法线场到高斯表示。值得注意的是,所提出的高斯管理与各种重建架构兼容,可以无缝集成以提高性能同时减小模型大小。大量实验表明,与最先进方法相比,我们的方法在外观和几何重建上实现了优越或可比的性能,同时使用了显著更少的参数。

英文摘要

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.

2511.01352 2026-06-17 cs.LG astro-ph.HE astro-ph.IM hep-ex physics.data-an 版本更新

MiniFool -- Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks

MiniFool——深度神经网络中基于物理约束感知的最小化器对抗攻击

Lucie Flek, Oliver Janik, Philipp Alexander Jung, Akbar Karimi, Timo Saala, Alexander Schmidt, Matthias Schott, Philipp Soldin, Matthias Thiesmeyer, Christopher Wiebusch, Ulrich Willemsen

AI总结 提出MiniFool算法,通过最小化结合χ²检验统计量与目标分数偏差的代价函数,生成物理感知的对抗样本,用于测试粒子与天体物理中的神经网络分类器,并量化网络决策的鲁棒性。

Comments Submitted to Computing and Software for Big Science

详情
Journal ref
Published in: Eur.Phys.J.C 86 (2026) 6, 641
AI中文摘要

在本文中,我们提出了一种新算法MiniFool,该算法实现了物理启发的对抗攻击,用于测试粒子物理和天体粒子物理中基于神经网络的分类任务。虽然我们最初为IceCube中微子天文台的天体物理tau中微子搜索开发了该算法,但我们将其应用于其他科学领域的更多数据,从而证明了其通用性。在此,我们将该算法应用于著名的MNIST数据集,以及大型强子对撞机CMS实验的开放数据。该算法基于最小化一个代价函数,该函数结合了基于χ²的检验统计量与期望目标分数的偏差。检验统计量根据实验不确定性量化了应用于数据的扰动的概率。对于我们研究的用例,我们发现翻转分类的可能性对于最初正确分类和错误分类的事件是不同的。当测试分类随攻击参数(该参数缩放实验不确定性)的变化时,可以量化网络决策的鲁棒性。此外,这允许测试未标记实验数据分类的鲁棒性。

英文摘要

In this paper, we present a new algorithm, MiniFool, that implements physics-inspired adversarial attacks for testing neural network-based classification tasks in particle and astroparticle physics. While we initially developed the algorithm for the search for astrophysical tau neutrinos with the IceCube Neutrino Observatory, we apply it to further data from other science domains, thus demonstrating its general applicability. Here, we apply the algorithm to the well-known MNIST data set and furthermore, to Open Data data from the CMS experiment at the Large Hadron Collider. The algorithm is based on minimizing a cost function that combines a $χ^2$ based test-statistic with the deviation from the desired target score. The test statistic quantifies the probability of the perturbations applied to the data based on the experimental uncertainties. For our studied use cases, we find that the likelihood of a flipped classification differs for both the initially correctly and incorrectly classified events. When testing changes of the classifications as a function of an attack parameter that scales the experimental uncertainties, the robustness of the network decision can be quantified. Furthermore, this allows testing the robustness of the classification of unlabeled experimental data.

2505.03509 2026-06-17 cs.LG astro-ph.IM 版本更新

AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning

AnomalyMatch: 通过半监督和主动学习发现罕见感兴趣对象

Pablo Gómez, Laslo E. Ruhberg, Maria Teresa Nardone, David O'Ryan

AI总结 提出AnomalyMatch框架,结合半监督FixMatch算法和主动学习,将异常检测视为二分类问题,利用少量标注和大量未标注图像训练,在严重类别不平衡下实现高AUROC和AUPRC。

Comments Accepted for publication in RASTI; 17 pages; 12 figures

详情
AI中文摘要

大数据集中的异常检测在天文学和计算机视觉中至关重要。然而,由于标记数据稀缺,通常无法应用监督方法进行异常检测。我们提出了AnomalyMatch,一个结合了使用EfficientNet分类器的半监督FixMatch算法与主动学习的异常检测框架。AnomalyMatch专为大规模应用定制,并集成到ESA Datalabs科学平台中。在该方法中,我们将异常检测视为二分类问题,并有效利用有限的标记图像和丰富的未标记图像进行训练。我们通过用户界面实现主动学习,用于验证高置信度异常并纠正误报。在严重类别不平衡下,对GalaxyMNIST天文数据集和miniImageNet自然图像基准的评估显示出强大性能。从五到十个标记异常开始,我们实现了平均AUROC为0.96(miniImageNet)和0.89(GalaxyMNIST),相应的AUPRC分别为0.82和0.77。经过三个主动学习周期后,按分数排名前1%的图像中,异常精度达到76%(miniImageNet)至94%(GalaxyMNIST)。我们与已建立的Astronomaly软件在来自'Galaxy Zoo - The Galaxy Challenge'数据集的选定'奇特'星系上进行比较,实现了可比较的性能,平均AUROC为0.83。我们的结果强调了该方法在异常发现方面的卓越实用性和可扩展性,突显了针对标签严重稀缺领域的专门方法的价值。

英文摘要

Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo- The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线,用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结 提出结合几何模型与深度学习的框架,利用固定摄像头连续量化监测城市河流漂浮碎片,并评估不同模型在复杂环境下的精度与速度,通过投影几何实现碎片尺寸估计。

详情
AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题,对生物多样性、水质以及人类活动(如航行和娱乐)产生不利影响。本研究提出了一种新颖的方法框架,利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献:(i)利用深度学习对漂浮碎片进行连续量化和监测;(ii)在复杂环境条件下,识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试,包括与数据泄漏相关的偏差实验。此外,实现了一个几何模型,用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性,特别是在负样本图像的整合和时间泄漏的考虑方面。最后,证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

2507.11178 2026-06-17 cs.LG cs.AI 版本更新

A Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

基于梯度的因果发现框架及其在复杂工业过程中的应用

Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Mingbao Yang, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

AI总结 提出GRNGC方法,通过对模型输入输出梯度施加L1正则化推断Granger因果,仅需一个预测模型,降低计算开销,在多个基准和真实数据集上优于现有方法。

Comments 9 pages,3 figures, conference

详情
AI中文摘要

随着深度学习技术的发展,各种基于神经网络的Granger因果模型已被提出。尽管这些模型表现出显著改进,但仍存在若干局限性。大多数现有方法采用组件式架构,需要为每个时间序列构建单独的模型,导致大量计算成本。此外,对神经网络第一层权重施加稀疏性惩罚以提取因果关系,削弱了模型捕捉复杂交互的能力。为解决这些局限性,我们提出基于梯度正则化的神经Granger因果(GRNGC),该方法仅需一个时间序列预测模型,并对模型输入与输出之间的梯度施加$L_{1}$正则化以推断Granger因果。此外,GRNGC不依赖于特定的时间序列预测模型,可通过KAN、MLP和LSTM等多种架构实现,提供增强的灵活性。在DREAM、Lorenz-96、fMRI BOLD和CausalTime上的数值模拟表明,GRNGC优于现有基线,并显著降低计算开销。同时,在真实世界的DNA、酵母、HeLa和膀胱尿路上皮癌数据集上的实验进一步验证了该模型在重建基因调控网络方面的有效性。

英文摘要

With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model's ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model's input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model's effectiveness in reconstructing gene regulatory networks.

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Branch-and-Browse:具有树状推理与动作记忆的高效可控网页探索

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

AI总结 提出Branch-and-Browse框架,通过树状结构化推理、网页状态重放和页面动作记忆,实现LLM网页代理的高效可控多分支探索,在WebArena上成功率35.8%,执行时间降低40.4%。

详情
AI中文摘要

由大型语言模型(LLM)驱动的自主网页代理在执行目标导向任务(如信息检索、报告生成和在线交易)方面展现出强大潜力。这些代理标志着向开放网络环境中实用具身推理的关键一步。然而,现有方法在推理深度和效率方面仍然受限:简单的线性方法无法进行多步推理且缺乏有效的回溯,而其他搜索策略则粗粒度且计算成本高。我们引入了Branch-and-Browse,一个细粒度的网页代理框架,它统一了结构化推理-行动、上下文记忆和高效执行。它(i)采用显式子任务管理与树状结构化探索,实现可控的多分支推理;(ii)通过高效的网页状态重放与后台推理引导探索;(iii)利用页面动作记忆在会话内和跨会话间共享已探索的动作。在WebArena基准测试中,Branch-and-Browse的任务成功率达到35.8%,相对于最先进的方法执行时间减少高达40.4%。这些结果表明,Branch-and-Browse是一个可靠且高效的基于LLM的网页代理框架。

英文摘要

Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8\% and reduces execution time by up to 40.4\% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.

2510.14807 2026-06-17 cs.AI 版本更新

Beyond the Sampled Token: Preserving Candidate Support in RLVR

超越采样令牌:在RLVR中保留候选支持

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

AI总结 本文从候选分布角度分析RLVR中的探索崩溃,提出CaSP方法,通过保留前N个候选的概率质量,在不牺牲pass@1的情况下提升pass@K,在多个基准测试中验证了有效性。

Comments Technical report (23 pages, 16 figures, project page: https://spherelab.ai/simko/)

详情
AI中文摘要

我们从下一个令牌预测的候选分布角度,重新审视了具有可验证奖励的强化学习(RLVR)中的探索崩溃。我们正式证明,当概率集中到前1个候选时,无论采样预算K如何,期望的不同响应数量都会崩溃为1。这一理论含义通过我们在训练过程中对前N个候选概率的实证跟踪得到进一步验证,其中前1个候选逐渐占据主导地位,而其他合理替代方案被抑制。这些发现提出了有效探索的关键需求:在前N个候选上保留不可忽略的概率质量。为此,我们提出了候选感知支持保留(CaSP),包含两个互补设计。具体来说,对于正确响应,CaSP在前N个候选上重新分配正梯度;对于错误响应,则对前1个候选施加更强的惩罚。与许多以牺牲pass@1为代价提高pass@K的探索导向方法不同,CaSP在整个K谱上提高了pass@K。这些增益泛化到6个数学、2个逻辑推理和2个编码基准测试,并扩展到32B参数模型和高达K=1024的采样预算,使其成为RLVR探索的一种原则性、候选级别的方法。

英文摘要

We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the \emph{candidate distribution} for next-token prediction. We formally show that as probability concentrates on the top-$1$ candidate, the expected number of distinct responses collapses to one regardless of the sampling budget $K$. This theoretical implication is further verified by our empirical tracking of top-$N$ candidate probabilities during training, where the top-$1$ candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: \emph{preserving non-negligible probability mass on the top-$N$ candidates}. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-$N$ candidates for correct responses, and applies a stronger penalty to the top-$1$ candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@$K$ at the cost of pass@1, CaSP improves pass@$K$ across the full $K$ spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to $K=1024$, positioning it as a principled, candidate-level approach for RLVR exploration.