arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1262
2605.00825 2026-05-04 cs.CV

Posterior Augmented Flow Matching

后验增强流匹配

George Stoica, Sayak Paul, Matthew Wallingford, Vivek Ramanujan, Abhay Nori, Winson Han, Ali Farhadi, Ranjay Krishna, Judy Hoffman

发表机构 * Georgia Tech(佐治亚理工学院) University of Washington(华盛顿大学) Hugging Face Ai2 UC Irvine(加州大学 Irvine 分校)

AI总结 本文提出后验增强流匹配(PAFM),通过引入对中间状态和条件的近似后验期望,解决高维图像中流匹配的稀疏性和高方差问题,提升模型泛化能力。

详情
AI中文摘要

流匹配(FM)训练了一个时间依赖的向量场,将样本从简单先验分布传输到复杂数据分布。然而,对于高维图像,每个训练样本仅监督单条轨迹和中间点,导致训练信号极稀疏且方差高。这种欠约束监督可能导致流崩溃,即学习的动力学记忆特定源-目标配对,将多样化输入映射到过于相似的输出,无法泛化。我们引入后验增强流匹配(PAFM),一种理论上严谨的FM扩展,用给定中间状态和条件的近似后验期望替代单目标监督。PAFM将不可行的后验分解为(i)假设终点下中间状态的似然和(ii)该终点在条件下的先验概率,并使用重要性采样方案构建多个候选目标的混合。我们证明PAFM在不增加计算开销的情况下,显著降低了训练中的梯度方差,同时得到原始FM目标的无偏估计。最后,我们展示PAFM在不同模型规模(SiT-B/2和SiT-XL/2)、不同架构(SiT和MMDiT)以及类和文本条件基准(ImageNet和CC12M)上,将FID50K提升多达3.4。代码:https://github.com/gstoica27/PAFM.git。

英文摘要

Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize. We introduce Posterior-Augmented Flow Matching (PAFM), a theoretically grounded generalization of FM that replaces single-target supervision with an expectation over an approximate posterior of valid target completions for a given intermediate state and condition. PAFM factorizes this intractable posterior into (i) the likelihood of the intermediate under a hypothesized endpoint and (ii) the prior probability of that endpoint under the condition, and uses an importance sampling scheme to construct a mixture over multiple candidate targets. We prove that PAFM yields an unbiased estimator of the original FM objective while substantially reducing gradient variance during training by aggregating information from many plausible continuation trajectories per intermediate. Finally, we show that PAFM improves over FM by up to 3.4 FID50K across different model scales (SiT-B/2 and SiT-XL/2), different architectures (SiT and MMDiT), and in both class and text conditioned benchmarks (ImageNet and CC12M), with a negligible increase in the compute overhead. Code: https://github.com/gstoica27/PAFM.git.

2605.00800 2026-05-04 cs.LG

Generating Statistical Charts with Validation-Driven LLM Workflows

通过验证驱动的LLM工作流生成统计图表

Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan

发表机构 * University of Ljubljana Faculty of Computer and Information Science(卢布尔雅那大学计算机与信息科学系)

AI总结 本文提出一种结构化的LLM工作流,通过验证输出来生成多样且易读的统计图表,解决可视化特有的失败模式,如可读性和语义不匹配。

详情
AI中文摘要

生成多样且易读的统计图表仍对LLMs而言具有挑战性,因为许多失败在渲染后才显现,无法仅通过数据或代码检测。现有图表数据集也 rarely 提供完全对齐的成果,如可执行代码、数据集上下文和问答对。我们提出了一种结构化的基于LLM的工作流,将图表生成分解为数据集筛选、图表提案、代码合成、渲染、验证驱动的细化、描述生成和问答生成。通过整合渲染输出验证,该工作流解决了可视化特定的失败模式,如可读性和语义不匹配。它将图表生成视为可检查的过程,而非一次性提示到代码任务,保留每个图表及其代码、数据集上下文、描述和问答对。应用于UCI数据集,该工作流从74个数据集中生成1,500张图表,涵盖24种图表类型,并配对30,003个问答对。我们对16个多模态LLMs在这些图表-问题对上进行了评估。结果表明,图表语法问题几乎饱和,而值提取、比较和推理仍更具挑战性,展示了该工作流在诊断基于图表的多模态推理中的实用性。

英文摘要

Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable code, dataset context, and question-answer pairs. We present a structured LLM-based workflow that decomposes chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation. By incorporating rendered-output validation, the workflow addresses visualization-specific failure modes such as readability and semantic mismatch. It treats chart generation as an inspectable process rather than a one-shot prompt-to-code task, retaining each chart with its code, dataset context, description, and question-answer pairs. Applied to UCI datasets, the workflow produces 1,500 charts from 74 datasets, spanning 24 chart families and paired with 30,003 question-answer pairs. We evaluate 16 multimodal LLMs (MLLMs) on these chart-question pairs. The results show that chart-syntax questions are nearly saturated, while value extraction, comparison, and reasoning remain more challenging, illustrating the workflow's utility for diagnostic studies of chart-grounded multimodal reasoning.

2605.00799 2026-05-04 cs.CV

GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer

GMGaze:基于MoE的上下文感知 gaze 估计方法结合 CLIP 和多尺度 Transformer

Xinyuan Zhao, Yihang Wu, Ahmad Chaddad, Sarah A. Alkhodair, Reem Kateb

发表机构 * School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, China(桂林电子科技大学人工智能学院) The Laboratory for Imagery, Vision and Artificial Intelligence, École de Technologie Supérieure, Montreal, Canada(图像、视觉与人工智能实验室,蒙特利尔大学高级技术学院) Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia(计算机与信息科学学院信息科技系,沙特国王大学) Department of Cybersecurity, College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia(计算机科学与工程学院网络安全系,塔伊卜大学) Department of Networked Engineering, College of Computer Science and Engineering, Jeddah University, Jeddah, Saudi Arabia(计算机科学与工程学院网络工程系,吉达大学)

AI总结 本文提出 GMGaze 方法,通过多尺度 Transformer 和 CLIP 结合,利用语义原型条件和稀疏 MoE 模块提升 gaze 估计精度,实验表明在多个基准上均优于现有方法。

Comments Accepted in KBS

详情
AI中文摘要

gaze 估计方法通常使用面部外观预测 gaze 方向。然而,基于 CNN、Transformer 和 CLIP 的方法存在三个主要挑战:图像特征晚融合、缺乏因子感知条件和不实际的容量扩展。为此,我们提出全局条件多尺度 gaze 估计 (GMGaze),利用多尺度 Transformer 架构。具体而言,模型首先引入语义原型条件,通过四个学习的原型库(即光照、背景、头部姿态和外观)调节 CLIP 全局图像嵌入,生成两个互补的上下文偏置全局标记。这些标记与 CLIP 块和 CNN 标记在第一层融合。这种早期统一融合防止了晚阶段合并中的信息丢失。最后,每个标记通过稀疏 Mixture-of-Experts 模块,提供条件计算能力而不均匀增加密集参数。对于跨域适应,我们结合对抗域适应技术,通过特征分离损失促使两个全局标记保持去相关。实验表明,在四个公开基准(MPIIFaceGaze、EYEDIAP、Gaze360 和 ETH-XGaze)上,GMGaze 分别达到 2.49°、3.22°、10.16° 和 1.44° 的平均角度误差,优于所有域内基线。在跨域评估中,它在两个标准转移路线中提供最先进的(SOTA)结果。

英文摘要

Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture. Specifically, the model first introduces semantic prototype conditioning, which modulates the CLIP global image embedding using four learned prototype banks (i.e., illumination, background, head pose and appearance) to generate two complementary context-biased global tokens. These tokens, along with the CLIP patch and CNN tokens, are fused at the first layer. This early unified fusion prevents information loss common in late-stage merging. Finally, each token passes through sparse Mixture-of-Experts modules, providing conditional computational capacity without uniformly increasing dense parameters. For cross-domain adaptation, we incorporate an adversarial domain adaptation technique with a feature separation loss that encourages the two global tokens to remain de-correlated. Experiments using four public benchmarks (MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze) show that GMGaze achieves mean angular errors of 2.49$^\circ$, 3.22$^\circ$, 10.16$^\circ$, and 1.44$^\circ$, respectively, outperforming previous baselines in all within-domain settings. In cross-domain evaluations, it provides state-of-the-art (SOTA) results on two standard transfer routes.

2605.00798 2026-05-04 cs.LG cs.CL cs.MA

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

RunAgent:通过约束引导执行解释自然语言计划

Arunabh Srivastava, Mohammad A., Khojastepour, Srimat Chakradhar, Sennur Ulukus

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) NEC Laboratories America, Inc.(NEC美国实验室)

AI总结 RunAgent通过约束引导执行解释自然语言计划,结合代理语言的显式控制构造,提升结构化工作流执行的可靠性。

详情
AI中文摘要

人类通过执行针对性计划解决问题,但大型语言模型(LLMs)在结构化工作流执行中仍不可靠。我们提出了RunAgent,一个多代理计划执行平台,通过约束和标准来引导逐步执行,将自然语言的表达力与编程的确定性结合。RunAgent通过代理语言中的显式控制构造(如IF、GOTO、FORALL)实现这一点。除了验证每一步输出的语法和语义验证外,RunAgent还根据任务描述及其实例在每一步中自主推导和验证约束。RunAgent还动态选择基于LLM的推理、工具使用和代码生成与执行(例如Python),并纳入错误纠正机制以确保正确性。最后,RunAgent通过在每一步执行中保留相关信息来过滤上下文历史。在Natural-plan和SciBench数据集上的评估表明,RunAgent在基准LLMs和最先进的PlanGEN方法上表现更优。

英文摘要

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.

2605.00789 2026-05-04 cs.CV cs.AI cs.LG

Make Your LVLM KV Cache More Lightweight

使大型视觉-语言模型的KV缓存更轻量

Xihao Chen, Yangyang Guo, Roger Zimmermann

发表机构 * Integrative Sciences and Engineering Programme, National University of Singapore(国立新加坡大学整合科学与工程学程) School of Computing, National University of Singapore(国立新加坡大学计算机学院)

AI总结 本文提出LightKV方法,通过利用视觉token嵌入的冗余性,减少KV缓存大小,提升解码效率并降低GPU内存消耗。

Comments Accepted to Transactions on Machine Learning Research (TMLR), 2026

详情
AI中文摘要

Key-Value (KV) cache已成为现代大型视觉-语言模型(LVLMs)推理中的事实组件。尽管它在大型语言模型(LLMs)中增强了解码效率,但直接在LVLMs中使用会因预填充阶段处理大量视觉token而引入显著的GPU内存开销。为解决此问题,我们提出LightKV,一种新的方法,通过利用视觉token嵌入之间的冗余性来减少KV缓存大小。受文本提示引导,LightKV采用跨模态信息传递来聚合视觉token中的信息性信息,并在预填充过程中逐步压缩它们。这种提示意识的指导使我们的方法区别于以往仅基于视觉的压缩策略。我们在八个开源LVLMs上评估LightKV,跨越八个公共基准数据集,例如MME和SeedBench。实验结果表明,仅使用原始视觉token的55%,LightKV(a)将视觉token的KV缓存大小减半,(b)将计算量减少高达40%,(c)在保持通用性能的同时显著优于现有基线。

英文摘要

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.

2605.00787 2026-05-04 cs.LG

SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

SAVGO: 通过余弦相似性学习状态-动作值几何以实现连续控制

Stavros Orfanoudakis, Pedro P. Vergara

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 SAVGO通过在动作空间中显式融入值的相似性,统一了表示学习、值估计和策略优化,提升了连续控制任务的性能。

Comments Reinforcement Learning

详情
AI中文摘要

尽管表示学习和相似性学习提升了强化学习的样本效率,但很少直接用于形状策略更新。为此,提出了一种几何感知的RL算法SAVGO,通过学习联合的状态-动作嵌入空间,使相似的动作值对具有高余弦相似性,而不同对则映射到不同方向。这种学习的几何结构允许在每次更新时生成动作候选的相似性核,使策略改进直接指向高价值区域,而不仅仅是局部梯度更新。结果表明,表示学习、值估计和策略优化在单一几何一致目标下统一,同时保持了离线策略演员-评论家训练的可扩展性。该方法在标准MuJoCo连续控制基准上进行了评估,展示了在高维任务上优于强基线的改进。通过消融研究分析了值几何学习和基于相似性的策略更新的贡献。

英文摘要

While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy updates directly in the action space. To bridge this gap, a geometry-aware RL algorithm that explicitly incorporates value-based similarity into the policy update, State-Action Value Geometry Optimization (SAVGO), is proposed. In detail, SAVGO learns a joint state-action embedding space in which pairs with similar action-value estimates exhibit high cosine similarity, while dissimilar pairs are mapped to distinct directions. This learned geometry enables the generation of a similarity kernel over candidate actions sampled at each update, allowing policy improvement to be guided directly toward higher-value regions beyond local gradient-based updates. As a result, representation learning, value estimation, and policy optimization are unified within a single geometry-consistent objective, while preserving the scalability of off-policy actor-critic training. The proposed method is evaluated on standard MuJoCo continuous-control benchmarks, demonstrating improvements over strong baselines on challenging high-dimensional tasks. Ablation studies are done to analyze the contributions of value-geometry learning and similarity-based policy updates.

2605.00781 2026-05-04 cs.CV

Map2World: Segment Map Conditioned Text to 3D World Generation

Map2World: 基于分段地图的3D世界生成

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang, Jiaolong Yang, Kyoung Mu Lee

发表机构 * Seoul National University(首尔国立大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 本文提出Map2World框架,通过用户定义的任意形状和尺度的分段地图生成一致且灵活的3D世界,并引入细节增强网络提升生成质量,实验证明其在可控性、一致性及内容连贯性上优于现有方法。

Comments project page: https://robot0321.github.io/Map2World/index.html

详情
AI中文摘要

3D世界生成对于沉浸式内容创作或自动驾驶模拟等应用至关重要。尽管近期进展显示出有希望的结果,但现有方法受限于网格布局并在整个世界中存在物体尺度不一致的问题。本文提出Map2World框架,首次实现了基于用户定义的任意形状和尺度的分段地图的3D世界生成,确保全局一致性与灵活性。为进一步提升质量,我们提出细节增强网络,通过整合全局结构信息生成精细细节,而不会破坏整体场景一致性。我们设计了整个流程以利用资产生成器的强大先验知识,实现跨多样领域稳健泛化,即使在有限训练数据下也能生成场景。大量实验表明,我们的方法在用户可控性、尺度一致性和内容连贯性方面显著优于现有方法,使用户能够在更复杂的条件下生成3D世界。

英文摘要

3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incorporating global structure information. We design the entire pipeline to leverage strong priors from asset generators, achieving robust generalization across diverse domains, even under limited training data for scene generation. Extensive experiments demonstrate that our method significantly outperforms existing approaches in user-controllability, scale consistency, and content coherence, enabling users to generate 3D worlds under more complex conditions.

2605.00778 2026-05-04 cs.LG q-bio.NC

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

可观性能并不完全反映系统组织:对在咬合约束下步态动力学的多层级分析

Jacques Raynal, Pierre Slangen, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN)(生物工程与纳米科学实验室) University of Montpellier(蒙彼利埃大学) EuroMov Digital Health in Motion(EuroMov数字健康运动) IMT Mines Ales

AI总结 研究通过多层级分析揭示在咬合约束下,可观性能与系统组织之间的不一致性,提出了一种结构化分析框架。

Comments 1 table, 4 figures. Exploratory single-case study

详情
AI中文摘要

在生物力学系统中,可观性能常被用作系统组织的代理。然而,这一假设隐含了输出指标与内部系统状态之间的对应关系,可能在适应性系统中不成立。本研究将垂直咬合维度(VDO)视为应用于适应性神经机械系统中的约束,通过单例设计在帕金森病患者中进行跨重复条件的个体分析。分析分为三个互补层次:(i)描述可观性能的聚合线性指标,(ii)描述状态空间中时间组织的动力学框架,(iii)通过无监督嵌入获得的潜在空间表示。结果表明,可观性能相似的条件可能对应状态空间和潜在空间中不同的组织。这种分离突显了聚合指标的局限性,并表明相似输出可能源于非等价系统状态。提出了第四层次作为纯粹概念扩展,描述系统状态之间的潜在关系。该层次未被实现且未从实验数据中推导。这些观察是严格探索性的而非因果性的。所提出的框架不建立机理、预测或方向性关系,但提供了一种用于分析受约束系统多层级表示的结构化方法。

英文摘要

In biomechanical systems, observable performance is often used as a proxy for underlying system organization. However, this assumption implicitly presumes a correspondence between output metrics and internal system states that may not hold in adaptive systems. In this study, the vertical dimension of occlusion (VDO) is considered as a constraint applied to an adaptive neuromechanical system, enabling the exploration of system-level responses under controlled variations. A single-case design in a patient with Parkinson's disease allows an intra-individual analysis across repeated conditions.The analysis is structured across three complementary levels: (i) aggregated linear metrics describing observable performance, (ii) a dynamical systems framework describing temporal organization in state space, and (iii) a latent space representation obtained through unsupervised embedding. The results show that conditions with comparable observable performance may correspond to different organizations in both state space and latent space representations. This dissociation highlights a limitation of aggregated metrics and suggests that similar outputs may arise from non-equivalent system states. A fourth level is proposed as a purely conceptual extension describing potential relationships between system states. This level is not implemented and is not derived from experimental data. These observations are strictly exploratory and non-causal. The proposed framework does not establish mechanistic, predictive, or directional relationships, but provides a structured approach for analyzing constraint-driven systems across multiple levels of representation.

2605.00777 2026-05-04 cs.SD cs.CL eess.AS

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

LASE:语言对抗性说话者编码用于印地语跨脚本身份保持

Venkata Pushpak Teja Menta

发表机构 * GitHub

AI总结 LASE通过对抗性学习提升跨脚本说话者身份保持,减少语音编码器在不同语言脚本间的性能差异,实验显示其在多个数据集上均达到零误差。

Comments 7 pages, 2 figures, 2 tables. Code, model, and datasets at https://github.com/praxelhq/lase

详情
AI中文摘要

用于多语言语音克隆的说话者编码器应能平等对待同一说话者,无论音频使用哪种脚本。现成的编码器无法做到,其失败是受口音条件影响的。在包含英语、印地语、泰卢格语和泰米尔语的1043对西口音语音语料库上,WavLM-base-plus-sv在脚本变化时损失0.082绝对余弦相似度,ECAPA-TDNN损失0.105。在包含1369对印度口音语音语料库上,差距缩小至0.006(WavLM-SV)和0.044(ECAPA-TDNN)。泄漏最大的是当系统将非印地语训练的语音投影到印地语脚本时。我们提出了LASE(语言对抗性说话者编码器),一个小型投影头,基于冻结的WavLM-base-plus训练,使用两种损失:一个监督对比损失用于语音身份,和一个梯度反转交叉熵损失对抗一个4语言分类器,推动嵌入向语言无信息但说话者有信息的方向发展。在1118个经过质量筛选的跨脚本配对上训练,LASE的残差差距在两个数据集上均接近零(Delta=0.013西口音,Delta=0.026印度;两者95%置信区间均包含零),并在两个基线模型上将跨脚本与地板的差距提升2.4-2.7倍。一个ECAPA+GRL消融实验显示GRL目标改进了任一基座模型,但WavLM选择贡献太大。在合成多说话人声源识别中,LASE在跨脚本说话人召回率上与ECAPA-TDNN相当(0.788 vs 0.789),训练数据减少约100倍。我们发布了r1检查点、两个语料库和置信区间配方。

英文摘要

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

2605.00776 2026-05-04 cs.CL cs.AI

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

定向社会关怀:在线媒体中针对特定倡导、反对、援助、伤害和受害的揭示

Scott Friedman, Ruta Wheelock, Sonja Schmer-Galunder, Drisana Iverson, Jake Vasilakes, Joan Zheng, Jeffrey Rye, Vasanth Sarathy, Christopher Miller

发表机构 * SIFT University of Florida(佛罗里达大学) Claremont Graduate University(克莱蒙特研究生大学) Tufts University(塔夫茨大学)

AI总结 本文提出DSR方法,用于多维、多价值情感分析,通过变压器模型检测情感目标并评分,揭示在线媒体中正负面情感共存及目标。

Comments 32 pages, 12 figures, 7 tables

详情
AI中文摘要

在线平台、影响操作和政治言辞中的语言经常同时针对不同主题表达亲社会情感(如倡导、助人、同情)和反社会情感(如威胁、反对、指责)。尽管许多自然语言处理工具将文本整体情感分类为积极、中性或消极,但这些工具无法报告正负面情感共存或情感目标。本文提出定向社会关怀(DSR)方法,包含两个变压器模型,用于检测消息中的跨度级情感目标并沿三个(-1,1)轴评分,这些轴受社会科学研究中的道德脱罪和道德框架理论启发。我们展示了DSR数据集构建的数据收集和标注策略,变压器架构用于跨度级评分,并通过验证研究展示了有希望的结果。我们应用验证过的DSR模型在六个第三方在线媒体数据集上,并报告了DSR输出与现有社会科学数据集标签和主题之间的有意义相关性。

英文摘要

The language in online platforms, influence operations, and political rhetoric frequently directs a mix of pro-social sentiment (e.g., advocacy, helpfulness, compassion) and anti-social sentiment (e.g., threats, opposition, blame) at different topics, all in the same message. While many natural language processing (NLP) tools classify or score a text's overall sentiment as positive, neutral, or negative, these tools cannot report that positive and negative sentiments coexist, and they cannot report the target of those sentiments. This paper presents the Directed Social Regard (DSR) approach to multi-dimensional, multi-valence sentiment analysis, comprised of a pair of transformer-based models that (1) detects span-level targets of sentiment in a message and then (2) scores all spans within the message context along three (-1, 1) axes of regard that are motivated by social science theories of moral disengagement and moral framing. We present a data collection and annotation strategy for DSR dataset construction, a transformer-based architecture for span-level scoring, and a validation study with promising results. We apply the validated DSR model on six third-party datasets of online media and report meaningful correlations between DSR outputs and the labels and topics in these pre-existing social science datasets.

2605.00764 2026-05-04 cs.CV cs.HC

Modeling Subjective Urban Perception with Human Gaze

用人类注视建模主观城市感知

Lin Che, Xi Wang, Marc Pollefeys, Konrad Schindler, Martin Raubal, Peter Kiefer

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出基于注视行为的城市感知框架,通过结合注视数据与场景表示,提升对主观城市感知的预测能力。

详情
AI中文摘要

城市感知描述了人们对城市环境的主观评价,塑造了城市被体验和理解的方式。现有计算方法主要从街景图像直接建模城市感知,但忽略了人类感知过程如何形成这些判断。本文引入Place Pulse-Gaze数据集,通过同步眼动记录和个体感知标签增强街景图像。基于该数据集,我们提出Gaze-Guided Urban Perception Framework,系统研究注视行为如何影响主观城市感知的建模。框架探讨了三种互补设置:仅基于注视的建模、注视与显式语义场景表示的融合、以及注视与隐式更丰富的视觉表示的融合。实验表明,注视本身已包含有用的预测信号,结合场景表示进一步提升预测效果。整体发现强调了将人类感知过程纳入城市场景理解的重要性,并为基于注视的多模态城市计算开辟了新方向。

英文摘要

Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.

2605.00762 2026-05-04 cs.LG cs.AI cs.MA

Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values

预算化组合多臂老虎机中基于Shapley值的 meritocratic 公平性框架

Shradha Sharma, Swapnil Dhamal, Shweta Jain

发表机构 * Indian Institute of Technology Ropar(印度理工学院罗帕尔)

AI总结 本文提出一种新的预算化组合多臂老虎机框架,通过扩展Shapley值到K-Shapley值,解决了全反馈环境下个体臂贡献的计算问题,并提出了K-SVFair-FBF算法,实现了公平性与效率的平衡。

详情
AI中文摘要

我们提出了一种新的框架,用于预算化组合多臂老虎机中的meritocratic公平性,即全反馈多臂老虎机(BCMAB-FBF)。与半反馈不同,在全反馈中,单个臂的贡献未被完全接收,这使得该设置更具挑战性。为了计算BCMAB-FBF中的臂贡献,我们首先将合作博弈理论中的经典解法概念Shapley值扩展为K-Shapley值,该值捕捉了受限于最多大小为K的集合的代理的边际贡献。我们证明K-Shapley值是唯一满足对称性、线性性、空玩家和效率属性的解法概念。随后,我们提出K-SVFair-FBF,一种公平性感知的老虎机算法,能够自适应地估计未知估值函数的K-Shapley值。与标准的全反馈老虎机文献不同,K-SVFair-FBF不仅在全反馈设置下学习估值函数,还缓解了来自蒙特卡洛近似带来的噪声。理论上,我们证明K-SVFair-FBF在公平性后悔上达到O(T^{3/4})的界限。通过联邦学习和社会影响力最大化数据集的实验,我们展示了我们的方法在公平性和有效性上优于现有基线。

英文摘要

We propose a new framework for meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). Unlike semi-bandit feedback, the contribution of individual arms is not received in full-bandit feedback, making the setting significantly more challenging. To compute arm contributions in BCMAB-FBF, we first extend the Shapley value, a classical solution concept from cooperative game theory, to the $K$-Shapley value, which captures the marginal contribution of an agent restricted to a set of size at most $K$. We show that $K$-Shapley value is a unique solution concept that satisfies Symmetry, Linearity, Null player, and efficiency properties. We next propose K-SVFair-FBF, a fairness-aware bandit algorithm that adaptively estimates $K$-Shapley value with unknown valuation function. Unlike standard bandit literature on full bandit feedback, K-SVFair-FBF not only learns the valuation function under full feedback setting but also mitigates the noise arising from Monte Carlo approximations. Theoretically, we prove that K-SVFair-FBF achieves $O(T^{3/4})$ regret bound on fairness regret. Through experiments on federated learning and social influence maximization datasets, we demonstrate that our approach achieves fairness and performs more effectively than existing baselines.

2605.00760 2026-05-04 cs.LG

Learning the Helmholtz equation operator with DeepONet for non-parametric 2D geometries

利用DeepONet学习Helmholtz方程算子以解决非参数2D几何中的问题

Rodolphe Barlogis, Ferhat Tamssaouet, Quentin Falcoz, Stéphane Grieu

发表机构 * PROMES-CNRS, Université Perpignan Via Domitia (UPVD)(普罗梅斯-国家科学研究中心,佩皮尼扬大学(UPVD)) LAAS-CNRS, Université de Toulouse (UT)(拉夫桑热-国家科学研究中心,图卢兹大学(UT))

AI总结 本文利用DeepONet框架,通过学习散射体几何与散射场之间的算子,解决非参数2D几何中的Helmholtz方程问题,并通过有限元法验证模型的泛化能力。

Comments 24 pages, 16 figures

详情
AI中文摘要

本文研究了在非参数域上求解二维Helmholtz方程的问题,利用基于DeepONet框架的物理信息神经算子网络。考虑一个二维正方形域,其中心包含任意边界几何的内含物。该内含物作为入射谐波波的散射体。目标是学习将散射体几何与所产生散射场联系起来的算子。通过在域内多个点评估到内含物边界的符号距离函数,以编码其几何。该函数作为DeepONet架构分支部分的输入,而局部信息则作为树干部分的输入。此方法能够编码任意几何,无论是参数化的还是非参数化的。在未见几何上评估模型,并与其有限元法(FEM)等效进行比较,以测试其泛化能力。训练的网络权重隐含嵌入了局部物理及其与域几何的相互作用。如果训练空间足够覆盖目标评估空间,模型可以相应泛化。此外,它可以通过细化来扩展到另一个感兴趣区域,而无需从头开始重新训练。该框架还避免了为每个几何重新网格化的需求。所提出的方法提供了一个比FEM替代方案更轻量的替代模型,并避免了依赖FEM生成的训练数据。

英文摘要

This paper deals with solving the 2D Helmholtz equation on non-parametric domains, leveraging a physics-informed neural operator network based on the DeepONet framework. We consider a 2D square domain with an inclusion of arbitrary boundary geometry at its center. This inclusion acts as a scatterer for an incoming harmonic wave. The aim is to learn the operator linking the geometry of the scatterer to the resulting scattered field. A signed distance function to the boundary of the inner inclusion, evaluated at several points in the domain, is used to encode its geometry. It serves as input for the branch part of the DeepONet architecture, while local information is used as input for the trunk part. This approach enables the encoding of arbitrary geometries, whether they are parameterized or not. The evaluation of the model on unseen geometries is compared with its finite element method (FEM) equivalent to test its generalization capabilities. The trained network weights implicitly embed the local physics and their interaction with the domain geometry. If the training space sufficiently covers the target evaluation space, the model can generalize accordingly. Furthermore, it can be refined to extend to another region of interest without retraining from scratch. This framework also avoids the need to remesh the domain for each geometry. The proposed approach delivers a computationally lighter surrogate model than FEM alternatives and avoids relying on FEM-generated training data.

2605.00751 2026-05-04 cs.LG

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search

NonZero:基于交互的多智能体蒙特卡洛树搜索探索

Sizhe Tang, Zuyuan Zhang, Mahdi Imani, Tian Lan

发表机构 * The George Washington University(乔治华盛顿大学) Northeastern University(东北大学)

AI总结 NonZero通过交互引导的提议规则,在低维非线性表示中进行代理引导选择,提升多智能体MCTS的可扩展性与探索效率。

Comments Accepted by ICML 2026 as Spotlight

详情
AI中文摘要

蒙特卡洛树搜索(MCTS)在合作多智能体领域扩展性差,因为扩展必须考虑指数级大的联合动作集,限制了现实搜索预算下的探索。我们提出NonZero,通过在低维非线性表示中运行代理引导的选取,而不是直接探索完整的联合动作空间。我们的探索使用交互分数:单智能体偏差按预测收益排序,而双智能体偏差通过混合差值度量,即使单个智能体无法单独改善,也能揭示协调收益。我们正式将候选提议作为局部偏差的老虎机问题,并推导出具有子线性局部遗憾保证的提议规则NonZero,以达到近似图局部最优解,而无需枚举联合动作空间。实证上,NonZero在MatGame、SMAC和SMACv2上相比强大的模型基和模型无关基线,在匹配的搜索预算下提升了样本效率和最终性能。

英文摘要

Monte Carlo Tree Search (MCTS) scales poorly in cooperative multi-agent domains because expansion must consider an exponentially large set of joint actions, severely limiting exploration under realistic search budgets. We propose NonZero, which keeps multi-agent MCTS tractable by running surrogate-guided selection over a low-dimensional nonlinear representation using an interaction-guided proposal rule, instead of directly exploring the full joint-action space. Our exploration uses an interaction score: single-agent deviations are ranked by predicted gain, while two-agent deviations are scored by a mixed-difference measure that reveals coordination benefits even when no single agent can improve alone. We formalize candidate proposal as a bandit problem over local deviations and derive a proposal rule, NonZero, with a sublinear local-regret guarantee for reaching approximate graph-local optima without enumerating the joint-action space. Empirically, NonZero improves sample efficiency and final performance on MatGame, SMAC, and SMACv2 relative to strong model-based and model-free baselines under matched search budgets.

2605.00744 2026-05-04 cs.CV

Quantum Gradient-Based Approach for Edge and Corner Detection Using Sobel Kernels

基于量子梯度的方法用于使用Sobel核的边缘和角点检测

Mohammad Aamir Sohail, Gabriela Pinheiro, Yasemin Poyraz Kocak, Batuhan Hangun, Emre Camkerten, Simge Yigit, Hafize Asude Ertan

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系) Department of Computer Science(计算机科学系) Department of Computer Technologies(计算机技术系) Department of Computer Engineering(计算机工程系)

AI总结 本文提出了一种量子实现的Sobel边缘检测和Harris角点检测方法,采用FRQI和QPIE两种量子图像编码方式,通过量子梯度计算和经典后处理提升检测性能,实验结果与经典方法一致,QPIE配置在有限测量次数下表现更稳定。

详情
AI中文摘要

边缘检测是指识别数字图像中强度急剧变化的点,表示物体边界或结构特征。角点是灰度强度在多个方向上突然变化的位置,广泛应用于特征提取、目标跟踪和3D建模。本文提出了一种基于量子梯度计算的Sobel边缘检测和Harris风格角点检测的量子实现。采用两种量子图像编码方法——灵活的量子图像表示(FRQI)和量子概率图像编码(QPIE)——对输入数据进行编码并进行比较分析。所提出的方法引入了一种基于滞后2差分的量子梯度计算方案,使能够在叠加态下评估梯度特征。为提高检测质量和减少误报,对由量子电路识别出的候选角点应用经典后处理步骤。实验结果表明,所提出的量子电路输出与经典Sobel和Harris算子一致。此外,基于QPIE的配置在有限测量次数下比FRQI更稳定和一致。尽管梯度计算可以在电路层面高效执行,但总体成本仍由状态制备、测量和经典后处理主导。所有实验均在无噪声模拟下进行,性能在NISQ硬件上可能受噪声和测量限制的影响。因此,本文展示了经典边缘和角点检测方法的功能性和可扩展的量子实现,而非端到端的加速。

英文摘要

Edge detection refers to identifying points in a digital image where intensity changes sharply, indicating object boundaries or structural features. Corners are locations where gray-level intensity changes abruptly in multiple directions and are widely used in feature extraction, object tracking, and 3D modeling. In this study, we present a quantum implementation of Sobel-based edge detection and Harris-style corner detection. Two quantum image encoding methods - Flexible Representation of Quantum Images (FRQI) and Quantum Probability Image Encoding (QPIE) - are used to encode the input data and are comparatively analyzed. The proposed approach introduces a quantum gradient computation scheme based on lag-2 differences, enabling the evaluation of gradient-like features in superposition. To improve detection quality and reduce false positives, a classical post-processing step is applied to candidate corner points identified by the quantum circuit. Results show that the proposed quantum circuits produce outputs consistent with classical Sobel and Harris operators. Furthermore, the QPIE-based configuration yields more stable and coherent results than FRQI, especially under limited measurement shots. While gradient computation can be performed efficiently at the circuit level, the overall cost remains dominated by state preparation, measurement, and classical post-processing. All experiments are conducted under noiseless simulation, and performance on NISQ hardware may be affected by noise and measurement limitations. Therefore, this work demonstrates a functional and scalable quantum realization of classical edge and corner detection methods rather than an end-to-end speedup.

2605.00738 2026-05-04 cs.LG

Temporal Data Requirement for Predicting Unplanned Hospital Readmissions

预测非计划性医院再入院的时序数据需求

Ramin Mohammadi, Vahab vahdat, Sarthak Jain, Amir T. Namin, Ramya Palacholla, Sagar Kamarthi

发表机构 * Northeastern University(东北大学) Partners Healthcare Connected Health Innovation(Partners Healthcare 连接健康创新) MGH Institute for Technology Assessment(MGH 技术评估研究所) Harvard Medical School(哈佛医学院) Tufts University School of Medicine(塔夫茨大学医学院) Department of Public Health and Community Medicine(公共卫生与社区医学系)

AI总结 本文研究了不同观察窗口对预测髋膝关节置换术后30天再入院的影响,发现非结构化临床笔记的最佳时间窗口比结构化数据更短,而结构化数据在延长窗口后性能趋于稳定。

详情
AI中文摘要

随着电子健康记录(EHRs)的普及,构建预测模型中的关键挑战是确定最优的历史数据时间窗口以最大化准确性。本研究探讨了从手术当天到三年前的各种观察窗口对髋膝关节置换术后30天再入院预测的影响。数据集涵盖超过400万条结构化就诊记录和8万条非结构化临床笔记,来自7174名患者。为了从临床笔记中提取意义,我们采用了非神经网络(BOW、计数BOW、TF-IDF、LDA)和神经网络编码器(BERT、1D CNN、BiLSTM、Average)。随后,我们评估了仅使用临床笔记、仅使用结构化数据以及两者结合的模型。我们的结果显示,非结构化临床笔记的最佳时间窗口显著短于结构化数据,最大预测性能通过手术前三个月至六个月的笔记实现。相比之下,使用结构化数据的性能随着时间窗口的延长而提高,但在十二个月后趋于稳定。这些模态特定的时间模式在模型复杂度或编码器类型无关的情况下保持一致。最终,这些发现挑战了更多历史数据必然带来更好机器学习预测的一般假设,建立了针对优化再入院预测模型的时间窗口指南。

英文摘要

With the proliferation of Electronic Health Records (EHRs), a critical challenge in building predictive models is determining the optimal historical data time window to maximize accuracy. This study investigates the impact of various observation windows ranging from the day of surgery to three years prior on predicting 30-day readmission following hip and knee arthroplasties. The dataset encompasses both structured encounter records (over 4 million) and unstructured clinical notes (80,000) from 7,174 patients. To extract meaning from the clinical notes, we employed a suite of non neural (BOW, count BOW, TF IDF, LDA) and neural encoders (BERT, 1D CNN, BiLSTM, Average). We subsequently evaluated models utilizing clinical notes alone, structured data alone, and a combination of both modalities. Our results demonstrate that the optimal time window for unstructured clinical notes is significantly shorter than for structured data, maximum predictive performance was achieved using notes from just three to six months prior to surgery. In contrast, performance using structured data improved as the time window lengthened, but strictly plateaued after twelve months. These modality-specific temporal patterns remained consistent regardless of model complexity or encoder type. Ultimately, these findings challenge the general assumption that more historical data inherently yields better machine learning predictions, establishing targeted time-window guidelines for optimizing readmission prediction models.

2604.27977 2026-05-04 cs.AI cs.LG

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

D3-Gym:构建现实世界可验证环境用于数据驱动发现

Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani, Ziru Chen, Huan Sun

发表机构 * The Ohio State University(俄亥俄州立大学) Cisco Research(思科研究)

AI总结 D3-Gym通过构建首个自动化的可验证环境数据集,提升科学数据驱动发现的模型能力,验证信号质量高,训练效果显著。

详情
AI中文摘要

尽管语言模型和代理在科学数据驱动发现方面取得进展,但缺乏可验证的现实世界科学任务环境限制了其发展。为填补这一空白,我们引入D3-Gym,首个自动构建的可验证环境数据集。D3-Gym包含565个来自239个真实科学仓库的跨学科任务,每个任务配备自然语言指令、可执行环境、输入数据集、参考代码解决方案和自动合成评估脚本。严格评估证实D3-Gym的验证信号质量高,与人工标注的黄金标准一致度达87.5%,在领域特定评估逻辑上表现良好。进一步训练发现,基于D3-Gym采样的轨迹在ScienceAgentBench上提升了Qwen3模型性能,Qwen3-32B模型提升7.8个绝对点,大幅缩小与强私有模型的差距。所有D3-Gym成果均可在https://github.com/OSU-NLP-Group/D3-Gym获取。

英文摘要

Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks. To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.

2604.10418 2026-05-04 cs.CL

Turing or Cantor: That is the Question

图灵还是康托尔:这是一个问题

Eugene Eberbach

发表机构 * Dept. of Eng. and Science, Rensselaer Polytechnic Institute(工程与科学系,伦斯勒理工学院)

AI总结 本文探讨图灵成就与康托尔集合论的关联,提出基于输入数据概率分布的不可判定性度量,并定义了三种新的TM不可判定问题复杂性类。

Comments arXiv admin note: text overlap with arXiv:2106.15969

详情
AI中文摘要

Alan Turing被视为计算机科学的奠基人之一,与Kurt Godel、Alonzo Church和John von Neumann齐名。本文提出了多项新研究成果,证明图灵的成就离不开康托尔在集合论和数学基础中的早期贡献。文章提出基于输入数据概率分布的不可判定性度量,以及扩展图灵对无限逻辑和Oracle机的工作至更广泛的超图灵计算模型。还定义了三种新的TM不可判定问题复杂性类:U-完全(通用完全)、D-完全(对角线完全)和H-完全(超计算完全)类。这些定义此前未被其他科学家明确提出,受Cook/Levin NP完全类的启发。最后,对于NP完全类的P≠NP问题,本文对U-完全类的不可判定问题复杂性进行了否定性回答。

英文摘要

Alan Turing is considered as a founder of current computer science together with Kurt Godel, Alonzo Church and John von Neumann. In this paper multiple new research results are presented. It is demonstrated that there would not be Alan Turing's achievements without earlier seminal contributions by Georg Cantor in the set theory and foundations of mathematics. It is proposed to introduce the measure of undecidability of problems unsolvable by Turing machines based on probability distribution of its input data, i.e., to provide the degree of unsolvabilty based on the number of undecidable instances of input data versus decidable ones. It is proposed as well to extend the Turing's work on infinite logics and Oracle machines to a whole class of super-Turing models of computation. Next, the three new complexity classes for TM undecidable problems have been defined: U-complete (Universal complete), D-complete (Diagonalization complete) and H-complete (Hypercomputation complete) classes. The above has never been defined explicitly before by other scientists, and has been inspired by Cook/Levin NP-complete class for intractable problems. Finally, an equivalent to famous P is not equal to NP unanswered question for NP-complete class, has been answered negatively for U-complete class of complexity for undecidable problems.

2604.06940 2026-05-04 cs.LG cs.AI

A First Guess is Rarely the Final Answer: Learning to Search in the Traveling Salesperson Problem

首次猜测 rarely 是最终答案:学习在旅行商问题中的搜索

Andoni Irazusta Garmendia

发表机构 * University of the Basque Country (UPV/EHU)(巴斯克大学(UPV/EHU))

AI总结 本文提出NICO-TSP,一种针对TSP的2-opt改进框架,通过两阶段训练提升搜索效率和鲁棒性,优于现有方法。

详情
AI中文摘要

大多数神经求解器在TSP中被训练输出单一解,但实践中测试时常进行额外采样或后处理搜索。本文提出NICO-TSP,通过两阶段训练:模仿学习生成短周期最优轨迹,再通过无批评者群体强化学习长周期轨迹,提升搜索效率和鲁棒性,优于现有方法。

英文摘要

Most neural solvers for the Traveling Salesperson Problem (TSP) are trained to output a single solution, even though practitioners rarely stop there: at test time, they routinely spend extra compute on sampling or post-hoc search. This raises a natural question: can the search procedure itself be learned? Neural improvement methods take this perspective by learning a policy that applies local modifications to a candidate solution, accumulating gains over an improvement trajectory. Yet learned improvement for TSP remains comparatively immature, with existing methods still falling short of robust, scalable performance. We argue that a key reason is design mismatch: many approaches reuse state representations, architectural choices, and training recipes inherited from single-solution methods, rather than being built around the mechanics of local search. This mismatch motivates NICO-TSP (Neural Improvement for Combinatorial Optimization): a 2-opt improvement framework for TSP. NICO-TSP represents the current tour with exactly $n$ edge tokens aligned with the neighborhood operator, scores 2-opt moves directly without tour positional encodings, and trains via a two-stage procedure: imitation learning to short-horizon optimal trajectories, followed by critic-free group-based reinforcement learning over longer rollouts. Under compute-matched evaluations that measure improvement as a function of both search steps and wall-clock time, NICO-TSP delivers consistently stronger and markedly more step-efficient improvement than prior learned and heuristic search baselines, generalizes far more reliably to larger out-of-distribution instances, and serves both as a competitive replacement for classical local search and as a powerful test-time refinement module for constructive solvers.

2604.04385 2026-05-04 cs.CL cs.AI cs.LG

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

对齐路由:在语言模型中本地化、扩展和控制策略电路

Gregory N. Frank

发表机构 * Independent Researcher(独立研究者)

AI总结 研究通过本地化策略路由机制,探讨在语言模型中扩展和控制策略电路的方法,发现路由机制在安全性和性能上的关键作用。

Comments Code and data: https://github.com/gregfrank/how-alignment-routes

详情
AI中文摘要

我们本地化了对齐训练语言模型中的策略路由机制。中间层注意力门读取检测到的内容并触发更深层的放大头,以增强信号向拒绝的传递。在较小的模型中,门和放大器是单个头;在更大的规模上,它们成为跨相邻层的头带。门贡献了不到1%的输出DLA,但互换测试(p < 0.001)和敲除级联确认其因果必要性。在n >= 120时,互换筛查在十二个实验室的六个模型中检测到相同的motif,尽管具体头因实验室而异。单头消融在72B上削弱了多达58倍,且遗漏了互换识别的门;在规模上,互换是唯一可靠的审计。连续调节检测层信号可以控制策略从强硬拒绝通过逃避到事实回答。在安全提示上,相同的干预将拒绝转化为有害指导,表明安全训练能力由路由控制,而非移除。阈值因主题和输入语言而异,且电路在家族内跨代迁移,即使行为基准没有变化。路由是早承诺:门在其自身层上触发,比更深层的层完成输入处理之前。一个上下文替换密码将门互换必要性降低了70到99%在三个模型中,且模型转向谜题解决而不是拒绝。将明文门激活注入密码前向传递恢复了Phi-4-mini中48%的拒绝,将绕过局部化到路由接口。第二种方法,密码对比分析,利用明/密DLA差异映射完整的密码敏感路由电路在O(3n)前向传递中。任何能击败检测层模式匹配的编码都能绕过策略,无论更深层的层是否重构内容。

英文摘要

We localize the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, yet interchange testing (p < 0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n >= 120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; at scale, interchange is the only reliable audit. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing that the safety-trained capability is gated by routing, not removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family even while behavioral benchmarks register no change. Routing is early-commitment: the gate fires at its own layer before deeper layers finish processing the input. An in-context substitution cipher collapses gate interchange necessity by 70 to 99% across three models, and the model switches to puzzle-solving rather than refusal. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.

2603.28980 2026-05-04 cs.CV

Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Stepper:基于多视角全景图的分步沉浸式场景生成

Felix Wimbauer, Fabian Manhardt, Michael Oechsle, Nikolai Kalischek, Christian Rupprecht, Daniel Cremers, Federico Tombari

发表机构 * Google(谷歌) University of Oxford(牛津大学) MCML Technical University of Munich(慕尼黑技术大学)

AI总结 Stepper通过分步扩展多视角全景图,解决传统方法在视觉保真度与可探索性之间的权衡问题,实现高质量沉浸式3D场景生成。

Comments Accepted at CVPR 2026 Findings; Find our project page under https://fwmb.github.io/stepper/

详情
AI中文摘要

从文本合成沉浸式3D场景的技术正迅速成熟,得益于新型视频生成模型和前馈3D重建技术,具有在AR/VR和世界建模中的巨大潜力。尽管全景图像已被证明可用于场景初始化,但现有方法在视觉保真度和可探索性之间存在权衡:自回归扩展受上下文漂移影响,而全景视频生成受限于低分辨率。我们提出了Stepper,一个统一的文本驱动沉浸式3D场景合成框架,通过分步全景场景扩展克服这些限制。Stepper利用一种新颖的多视角360°扩散模型,实现一致的高分辨率扩展,并结合一个几何重建流程,强制几何一致性。在新的大规模多视角全景图数据集上训练,Stepper实现了最先进的保真度和结构一致性,优于先前方法,从而为沉浸式场景生成设定了新标准。

英文摘要

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

2603.18280 2026-05-04 cs.LG cs.AI cs.CL

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

检测成本低,路由是学习的:为何基于拒绝的对齐评估失败

Gregory N. Frank

发表机构 * Independent Researcher(独立研究者)

AI总结 研究指出现有对齐评估忽视路由层,通过政治审查实验发现检测准确率不具诊断性,路由机制因模型和实验室而异,拒绝不再是主要审查机制,需采用检测-路由-生成三阶段框架。

Comments Code and data: https://github.com/gregfrank/routing-is-learned

详情
AI中文摘要

当前对齐评估主要测量模型是否编码危险概念及是否拒绝有害请求,但忽视了对齐通常发生的层:从概念检测到行为策略的路由。本文通过研究中国起源语言模型的政治审查作为自然实验,利用探针、手术性消融和行为测试,在九个开源模型中发现三个结论:首先,探针准确率单独不具诊断性,政治探针、空对照和排列基线均可达到100%,因此保留的类别泛化是信息性测试。其次,手术性消融揭示了实验室特定的路由。移除政治敏感方向会消除审查并恢复大多数模型的准确事实输出,而一个模型因架构将事实知识与审查机制纠缠而编造。跨模型转移失败,表明路由几何是模型和实验室特定的。第三,拒绝不再是主要审查机制。在某一模型家族中,强硬拒绝降至零,而叙述引导升至最高,使拒绝-only基准无法检测审查。这些结果支持三阶段描述框架:检测、路由、生成。模型通常保留相关知识;对齐改变知识的表达方式。仅审计检测或拒绝的评估因此忽视最直接决定行为的路由机制。

英文摘要

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

2602.14276 2026-05-04 cs.CV

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

ScreenParse:超越稀疏标注的完整屏幕解析监督

A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar

发表机构 * IBM Research Zurich, Zurich, Switzerland(IBM苏黎世研究实验室,瑞士苏黎世) ETH Zurich, Computer Vision(苏黎世联邦理工学院,计算机视觉) ETH AI Center, Switzerland(苏黎世联邦理工学院人工智能中心,瑞士) ETH Zurich, Photogrammetry(苏黎世联邦理工学院,摄影测量学) Microsoft, Switzerland(微软公司,瑞士)

AI总结 ScreenParse通过大规模完整屏幕解析标注,训练出性能优异的ScreenVLM模型,显著提升了密集解析能力和迁移表现。

Comments Accepted at ICML 2026. 28 pages, 15 figures

详情
AI中文摘要

现代计算机使用代理(CUA)必须将屏幕视为结构化状态,在执行任务前需感知可见元素的位置、文本内容等。然而,现有标注数据集提供稀疏监督,限制了覆盖范围和泛化能力。本文提出ScreenParse数据集,包含771K网页截图的密集标注,通过自动化流程生成。基于该数据集训练的ScreenVLM模型在密集解析任务中表现优异,并在迁移学习中展现出强大的结构先验能力。

英文摘要

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

2602.13595 2026-05-04 cs.AI

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

量化陷阱:多跳推理中的线性扩展定律破裂

Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li

发表机构 * School of Engineering and Computer Science, Baylor University(贝勒大学工程与计算机科学学院) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) School of Computer Science and Communication Engineering, Jiangsu University(江苏大学计算机科学与通信工程学院) Beijing Electronic Science and Technology Institute(北京电子科技学院)

AI总结 研究揭示多跳推理中量化降低反而增加能耗和降低准确性的现象,提出关键模型规模理论解释其成因。

Comments 23 pages, 8 figures

详情
AI中文摘要

神经扩展定律提供了一种可预测的AI进步配方:降低数值精度应线性提高计算效率和能耗(E ∝ bits)。本文证明在多跳推理中该定律失效。我们揭示了'量化陷阱':将精度从16位降至8/4位反而增加净能耗并降低推理准确性。我们提供严谨的理论分解,指出该失败归因于硬件转换开销、去量化内核的隐藏延迟成本,这些在顺序推理链中成为主导瓶颈,以及顺序能耗摊销失败。因此,扩展定律破裂在实践中不可避免。我们正式化了关键模型规模N*,预测陷阱溶解或加深作为模型大小、批次大小和硬件配置函数,验证范围达120倍(0.6B-72B)六种GPU架构。我们的发现表明,行业'更小更好'的启发式方法对复杂推理任务在数学上是反面的。

英文摘要

Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E \propto \mathrm{bits}$). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. We formalize a Critical Model Scale $N^*$ that predicts when the trap dissolves or deepens as a function of model size, batch size, and hardware configuration, validated across a 120$\times$ range (0.6B--72B) on six GPU architectures. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.

2602.13305 2026-05-04 cs.CV cs.AI

WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery

WildfireVLM:基于卫星影像的AI分析用于早期野火检测与风险评估

Aydin Ayanzadeh, Prakhar Dixit, Sadia Kamal, Milton Halem

发表机构 * Department of Computer Science and Electrical Engineering(计算机科学与电气工程系) University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)

AI总结 WildfireVLM结合卫星影像检测与语言驱动的风险评估,利用YOLOv12检测火区与烟雾,并通过多模态大语言模型生成风险评估和应急响应建议,验证其有效性并实现实时处理与长期追踪。

详情
AI中文摘要

野火对生态系统、人类生命和基础设施构成日益增长的威胁,其频率和强度因气候变化和人类活动而增加。早期检测至关重要,但基于卫星的监测因烟雾信号微弱、天气动态变化及大范围实时分析需求而具有挑战性。我们引入WildfireVLM,一种结合卫星影像野火检测与语言驱动风险评估的AI框架。我们使用Landsat-8/9、GOES-16和其他公开地球观测源构建了标注的野火和烟雾数据集,包括对齐的光谱带产品。WildfireVLM利用YOLOv12检测火区和烟雾云,利用其在卫星影像中检测小而复杂模式的能力。我们整合多模态大语言模型(MLLMs),将检测输出转换为上下文化的风险评估和优先级应急响应建议。我们使用LLM-as-judge评估验证风险推理质量,采用服务导向架构部署系统,支持实时处理、风险可视化仪表板和长期野火追踪,展示了计算机视觉与语言推理结合在可扩展野火监测中的价值。代码和数据集已公开在GitHub上:https://github.com/Ayanzadeh93/_WildfireVLM_.

英文摘要

Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring. The code and dataset are publicly available on GitHub at https://github.com/Ayanzadeh93/_WildfireVLM_.

2601.21214 2026-05-04 cs.CL cs.LG

Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

推理步扩展揭示弱点:解密和改进大语言模型中的推理泛化

Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song, Defu Lian, Ying Wei

发表机构 * University of Science and Technology of China(中国科学技术大学) City University of Hong Kong(香港城市大学) Zhejiang University(浙江大学) City University of Hong Kong, Shenzhen Research Institute(香港城市大学深圳研究院)

AI总结 本文通过多领域任务研究发现,推理错误集中在少数关键错误类型的位置,而非均匀分布。提出在推理过程中动态识别并禁用错误处理头,从而提升推理步泛化能力。

Comments 52 pages, accepted by ICLR 2026 main conference

详情
AI中文摘要

链式推理(CoT)推理已成为使大语言模型(LLMs)解决复杂问题的标准范式。然而,最近的研究揭示了在推理步泛化场景中性能显著下降的问题,即所需推理步骤数超过训练分布时,底层算法未变。驱动这一失败的内部机制尚不明确。在本文中,我们对多个领域的任务进行了系统研究,发现错误集中在少数关键错误类型的token位置,而非均匀分布。更深入的检查发现,这些token层面的错误预测源于内部竞争机制:某些称为错误处理头(ep heads)的注意力头通过放大错误推理轨迹而抑制正确轨迹。值得注意的是,在推理过程中移除单个ep heads通常可以恢复正确预测。受这些见解启发,我们提出了推理过程中的测试时间修正,一种轻量级干预方法,可动态识别并禁用推理过程中的ep heads。在不同任务和LLM上的广泛实验表明,它一致地提高了推理步泛化能力,突显了其有效性和潜力。

英文摘要

Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.

2512.16762 2026-05-04 cs.LG

NRGPT: An Energy-based Alternative for GPT

NRGPT:一种基于能量的GPT替代方案

Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov

发表机构 * IBM Research(IBM研究院) Georgia Tech(佐治亚理工学院) Brown University(布朗大学) MIT(麻省理工学院)

AI总结 NRGPT通过最小化修改将GPT与能量基模型框架统一,其推理过程被视为在能量景观上探索,实验证明在特定条件下可转化为梯度下降,适用于简单语言、代数任务和更复杂的语言建模。

Comments Accepted to ICLR 2026 main conference

详情
AI中文摘要

生成式预训练变换器(GPT)架构是语言建模最流行的架构。基于能量的建模是一种不同的范式,将推断视为在能量景观上运行的动态过程。我们提出对GPT设置进行最小修改,以将其与EBM框架统一。我们的模型的推断步骤,称为eNeRgy-GPT(NRGPT),被概念化为在能量景观上探索标记。我们证明并在经验上验证,在某些情况下这种探索会变成梯度下降,尽管它们不一定导致性能最佳的模型。我们展示了我们的模型在简单语言(莎士比亚数据集)、代数ListOPS任务和更丰富的设置如OpenWebText语言建模中表现良好。我们还观察到,我们的模型可能在非常长的训练期间对过拟合更具鲁棒性。

英文摘要

Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.

2512.01116 2026-05-04 cs.CV

Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis

多模态癌症生存分析中的结构预后事件建模

Yilan Zhang, Li Nanbo, Changchun Yang, Jürgen Schmidhuber, Xin Gao

发表机构 * King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 本文提出SlotSPE框架,通过槽注意力机制压缩多模态数据为结构化槽,有效建模癌症生存分析中的复杂交互,提升预后相关性与可解释性。

Comments 37 pages, 14 Figures

Journal ref The Fourteenth International Conference on Learning Representations (ICLR2026)

详情
AI中文摘要

组织学图像与基因谱的整合在改善癌症生存预测方面显示出巨大潜力。然而,当前方法在高效有效建模内模和跨模交互方面面临挑战,由于输入的高维性和复杂性。主要挑战是捕捉关键预后事件,尽管数量少,但这些事件奠定了观察输入的复杂性,并很大程度决定患者结果。这些事件表现为高水平的结构信号,如空间组织学模式或通路协同激活,通常稀疏、患者特异且未标注,使它们难以揭示。为此,我们提出SlotSPE,一种基于槽的结构预后事件建模框架。具体而言,受因子编码原理启发,我们使用槽注意力将每个患者的多模态输入压缩为紧凑、模态特定且互不相同的槽集合。通过利用这些槽表示作为预后事件的编码,我们的框架实现了复杂内模和跨模交互的高效有效建模,同时促进无缝整合生物先验知识以增强预后相关性。在十个癌症基准上的广泛实验表明,SlotSPE在8个队列中优于现有方法,整体改进2.9%。它在缺失基因组数据下仍保持稳健,并通过结构化事件分解显著提高可解释性。

英文摘要

The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events, manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations, are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patient's multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.

2510.22819 2026-05-04 cs.LG

Last-Iterate Analyses of FTRL with the 1/2-Tsallis Entropy in Stochastic Bandits

FTRL在随机老虎机中使用1/2-Tsallis熵的最后迭代分析

Jingxin Zhan, Yuze Han, Zhihua Zhang

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) Center for Applied Statistics and School of Statistics, Renmin University of China(中国人民大学统计学院)

AI总结 本文研究了使用1/2-Tsallis熵正则化器的FTRL算法,证明了其最后迭代收敛率为t^{-1/2},并验证了对数遗憾与该收敛率的对应关系。

详情
AI中文摘要

在线学习算法的收敛性分析是机器学习理论的核心,其中最后迭代收敛性尤其重要,因为它捕捉了学习者的实际决策并描述了学习过程随时间的演变。然而,在多臂老虎机中,大多数现有算法分析主要关注遗憾的顺序,而最后迭代(简单遗憾)收敛率仍较少被研究,尤其是对于广泛研究的Follow-the-Regularized-Leader(FTRL)算法。最近,使用$1/2$-Tsallis熵正则化器$Ψ(p) = -4\sum_{i=1}^d \sqrt{p_i}$(即1/2-Tsallis-INF算法,由arXiv:1807.07623提出)的FTRL算法在随机老虎机中实现了对数遗憾。然而,其最后迭代收敛率尚未被研究。直观上,对数遗憾应对应于$t^{-1}$的最后迭代收敛率。本文研究了1/2-Tsallis-INF算法,并通过理论分析部分验证了这一直觉,证明了由$Ψ(p)$定义的Bregman散度,即在迭代$t$时在最优臂上的点质量和在臂集上的概率分布之间的散度,以$t^{-1/2}$的速度衰减。

英文摘要

The convergence analysis of online learning algorithms is central to machine learning theory, where the last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolution of the learning process over time. However, in multi-armed bandits, most existing algorithmic analyses mainly focus on the order of regret, while the last-iterate (simple regret) convergence rate remains less explored -- especially for the widely studied Follow-the-Regularized-Leader (FTRL) algorithms. Recently, FTRL with the $1/2$-Tsallis entropy regularizer $Ψ(p) = -4\sum_{i=1}^d \sqrt{p_i}$ (the $1/2$-Tsallis-INF algorithm, by arXiv:1807.07623) was shown to achieve logarithmic regret in stochastic bandits. Nevertheless, its last-iterate convergence rate has not yet been studied. Intuitively, logarithmic regret should correspond to a $t^{-1}$ last-iterate convergence rate. This paper studies the $1/2$-Tsallis-INF algorithm and partially confirms this intuition through theoretical analysis, showing that the Bregman divergence, defined by $Ψ(p)$, between the point mass on the optimal arm and the probability distribution over the arm set obtained at iteration $t$, decays at a rate of $t^{-1/2}$.

2507.22699 2026-05-04 cs.CV

Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints

基于网格不可伸长约束的图像引导形状从模板方法

Thuy Tran, Ruochen Chen, Shaifali Parashar

发表机构 * CNRS(法国国家科学研究中心) École Centrale de Lyon(里昂中央理工大学) INSA Lyon(里昂国立应用科学学院) Université Claude Bernard Lyon 1(里昂一大学) LIRIS(图像研究所)

AI总结 本文提出一种无监督的形状从模板方法,利用图像观测和网格不可伸长约束,实现比现有无监督方法快400倍的重建速度,并在细节生成和严重遮挡处理上表现更优。

Comments Accepted to ICCV 2025. Total 13 pages, 9 figures, 9 tables

详情
AI中文摘要

形状从模板(SfT)是指通过图像或视频重建变形物体3D形状的一类方法,使用3D模板。传统SfT方法需要图像与3D模板之间的点对应关系和纹理信息,以实时重建3D形状,但在严重遮挡情况下性能显著下降。相比之下,现代SfT方法通过深度神经网络进行无对应关系的重建,但需要大量数据监督。最近的进展结合可微物理和图形,采用完全无监督或自监督方法变形3D模板以匹配输入图像。本文提出了一种无监督SfT方法,仅使用图像观测(颜色特征、梯度和轮廓)和网格不可伸长约束,以400倍的速度重建3D形状,并在生成细节和处理严重遮挡方面显著优于现有方法。代码可在https://github.com/dvttran/nsft获取。

英文摘要

Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code is available at https://github.com/dvttran/nsft.