arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2604.17548 2026-05-15 cs.LG math.AT stat.ML

Contraction and Hourglass Persistence for Learning on Graphs, Simplices, and Cells

Mattie Ji, Indradyumna Roy, Vikas Garg

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Aalto University(艾尔沃斯大学) Indian Institute of Technology Bombay(印度班加罗尔理工学院) YaiYai Ltd(YaiYai公司)

AI总结 该论文研究了如何在图、单纯复形和胞腔网络上进行学习的拓扑方法,提出了收缩同调(Contraction Homology)和小时glass持续性(Hourglass Persistence)的概念,以改进传统持续同调在图神经网络中的应用。通过结合包含和收缩操作,小时glass持续性提升了模型的表达能力、可学习性和稳定性,并设计了高效的算法,能够在多种现实图数据集上取得优于传统方法的实验结果。

Comments 31 pages, 6 figures, 4 algorithms, 2 tables. Accepted at ICLR 2026

详情
英文摘要

Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end-to-end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real-world graph datasets. Code is available at \href{https://github.com/Aalto-QuML/Hourglass}{this https URL}.

2604.16744 2026-05-15 cs.CL cs.AI cs.HC

Evaluating Adaptive Personalization of Educational Readings with Simulated Learners

Ryan T. Woo, Anmol Rao, Aryan Keluskar, Yinong Chen

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院) Arizona State University(亚利桑那州立大学)

AI总结 本文提出了一种基于理论支持的模拟学习者框架,用于评估教育阅读材料的自适应个性化效果。该方法从开放教材中构建学习目标和知识组件本体,通过浏览器工具进行管理,并生成匹配的阅读与评估对。实验结果表明,自适应阅读在计算机科学中显著提升了学习效果,在无机化学中效果不明确,在普通生物学中则无明显提升甚至略有负面影响。

详情
英文摘要

We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.

2604.10892 2026-05-15 cs.RO cs.MA

HECTOR: Human-centric Hierarchical Coordination and Supervision of Robotic Fleets under Continual Temporal Tasks

Shen Wang, Yinhang Luo, Jie Li, Meng Guo

发表机构 * School of Advanced Manufacturing and Robotics, Peking University(北京大学先进制造与机器人学院) National University of Defense Technology(国防科技大学)

AI总结 本文提出了一种以人类为中心的分层协调与监督框架HECTOR,用于在持续变化和不确定的时间任务下管理大规模机器人集群。该方法包含三个层次:人机双向交互协议、任务滚动分配机制以及团队内部动态协调,支持操作员在不同粒度上进行任务调整与监督,从而提升计算效率并减轻人工负担。实验表明,该框架在异构机器人集群和复杂环境任务中表现出良好的适应性和有效性。

详情
英文摘要

Robotic fleets can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. However, it can be demanding or even impractical for an operator to directly control each robot. Thus, autonomy of the fleet and its online interaction with the operator are both essential, particularly in dynamic and partially unknown environments. The operator might need to add new tasks, cancel some tasks, change priorities and modify planning results. How to design the procedure for these interactions and efficient algorithms to fulfill these needs have been mostly neglected in the related literature. Thus, this work proposes a human-centric coordination and supervision scheme (HECTOR) for large-scale robotic fleets under continual and uncertain temporal tasks. It consists of three hierarchical layers: (I) the bidirectional and multimodal protocol of online human-fleet interaction, where the operator interacts with and supervises the whole fleet; (II) the rolling assignment of currently-known tasks to teams within a certain horizon, and (III) the dynamic coordination within a team given the detected subtasks during online execution. The overall mission can be as general as temporal logic formulas over collaborative actions. Such hierarchical structure allows human interaction and supervision at different granularities and triggering conditions, to both improve computational efficiency and reduce human effort. Extensive human-in-the-loop simulations are performed over heterogeneous fleets under various temporal tasks and environmental uncertainties.

2604.09304 2026-05-15 cs.CV

GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye, Tian Xie, Hujun Bao, Rui Wang. Yuchi Huo

发表机构 * State Key Lab of CAD\&CG, Zhejiang University Zhejiang Lab China State Key Laboratory of Industrial Control Technology, Zhejiang University China Zhejiang University China Zhejiang Lab State Key Laboratory of Industrial Control Technology, Zhejiang University Zhejiang University

AI总结 本文提出了一种名为GeRM的生成渲染模型,旨在弥合基于物理的渲染(PBR)与照片级真实感渲染(PRR)之间的差距。该模型通过学习分布转移向量(DTV)场,结合多条件ControlNet和残差感知转移机制,实现了从物理真实到视觉真实的可控图像生成。研究还引入了一个多智能体视觉语言框架,构建了用于监督转移过程的专家引导数据集P2P-50K,实验表明GeRM在多种应用场景中均优于现有先进方法。

详情
英文摘要

While physically-based rendering (PBR) simulates light transport that guarantees physical realism, achieving true photorealistic rendering (PRR) demands prohibitive time and labor, and still struggles to capture the intractable richness of the real world. We propose GeRM, the first multimodal generative rendering model to bridge the gap from PBR to PRR (P2P). We formulate this P2P transition by learning a distribution transfer vector (DTV) field to direct the generative process. To achieve this, we introduce a multi-condition ControlNet that synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions. To improve the model's grasp of the image distribution shift driven by text prompts, we propose a residual perceptual transfer mechanism to associate text prompts with corresponding targeted modification regions, which more clearly defines the incremental component updates. To supervise this transfer process, we introduce a multi-agent visual language model framework to construct an expert-guided pairwise transfer dataset, named P2P-50K, where each paired sample corresponds to a specific transfer vector in the DTV field. Extensive experiments demonstrate that GeRM synthesizes high-quality controllable images and outperforms state-of-the-art baselines across diverse applications, including PBR and PRR image synthesis and editing.

2604.08991 2026-05-15 cs.CV cs.AI

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng

发表机构 * Jilin University(吉林大学) National Taiwan University(国立台湾大学)

AI总结 本文提出PinpointQA,首个用于室内视频中小物体中心空间理解的数据集与基准,旨在评估模型在视频中精确定位目标物体并描述其位置的能力。该数据集基于ScanNet++和ScanNet200构建,包含1024个场景和10,094个问答对,涵盖四个逐步增加难度的任务,实验表明主流多模态大语言模型在该基准上仍存在明显性能差距,而通过PinpointQA进行微调可显著提升模型表现。

详情
英文摘要

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.

2604.06757 2026-05-15 cs.CV

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Central South University(中南大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) Microsoft(微软)

AI总结 FlowInOne 提出了一种统一的多模态生成框架,将文本描述、空间布局和编辑指令等不同模态的信息转化为单一的视觉表示,从而实现以图像输入、图像输出为特点的生成流程。该方法通过一个统一的流匹配模型消除了跨模态对齐和任务特定结构的限制,将文本到图像生成、布局引导编辑和视觉指令遵循等任务整合到同一范式下。研究还构建了大规模视觉提示数据集 VisPrompt-5M 和评估基准 VP-Bench,实验表明 FlowInOne 在多项任务中达到当前最优性能,为完全以视觉为中心的生成建模奠定了新基础。

详情
英文摘要

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space. Our code and models are released on https://csu-jpg.github.io/FlowInOne.github.io/

2604.02482 2026-05-15 cs.LG

SEDGE: Structural Extrapolated Data Generation

Kun Zhang, Jiaqi Sun, Yiqing Li, Ignavier Ng, Namrata Deka, Shaoan Xie

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学)

AI总结 本文提出了一种名为SEDGE的框架,用于在训练数据之外生成符合新规格的数据,其核心在于对数据生成过程做出合理假设。该方法在特定保守假设下保证了生成数据分布的近似可识别性,并指出在无此类假设时分布的不可识别性。研究通过结构化优化策略和扩散后验采样等算法实现了有效外推数据生成,并在合成数据和图像生成任务中验证了其有效性。

详情
英文摘要

This paper aims to address the challenge of data generation beyond the training data and proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data-generating process. We provide conditions under which data satisfying novel specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions, as well as the inherent non-identifiability of this distribution without such assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on a structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.

2603.29665 2026-05-15 cs.CL

Near-Miss: Latent Policy Failure Detection in Agentic Workflows

Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor

发表机构 * IBM Research(IBM研究院)

AI总结 在代理工作流中,基于大语言模型的系统虽然能够达到预期的最终状态,但可能在执行过程中绕过必要的策略检查,从而产生潜在的策略失效问题。本文提出了一种新的度量方法,用于检测代理对话轨迹中的隐性策略失败,该方法基于ToolGuard框架,分析代理在调用工具时的决策是否充分合理。实验表明,即使最终结果正确,仍有8%至17%的轨迹存在此类潜在失败,揭示了当前评估方法的局限性。

Comments GEM@ACL2026, 13 pages

详情
英文摘要

Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as near-misses or latent failures. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.

2603.28205 2026-05-15 cs.CL

Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis

Yijin Wang, Fandi Sun, Haoyu Wen

发表机构 * School of Economics and Management, Xidian University(西安电子科技大学经济管理学院)

AI总结 本文针对基于方面的情感分析(ABSA)中实值嵌入空间中的表示纠缠和负样本碰撞问题,提出了一种新的框架,包含零初始化残差复投影(ZRCP)和反碰撞掩码角度损失。该方法将文本特征映射到复语义空间,利用相位分离情感极性,同时通过幅度正则化保持方面类别的结构一致性,并引入反碰撞掩码以增强对立极性之间的判别性。实验表明,该方法在ASAP数据集上取得了当前最优的Macro-F1分数。

详情
英文摘要

Aspect-Based Sentiment Analysis (ABSA) faces critical challenges due to representation entanglement and false-negative collisions in real-valued embedding spaces. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss. Our approach projects textual features into a complex semantic space, utilizing the phase to isolate sentiment polarities while regularizing the amplitude to ensure structural consistency within aspect categories. To mitigate this, we introduce an anti-collision mask that preserves intra-polarity aspect cohesion while significantly expanding the discriminative margin between opposing polarities. Experimental results on the ASAP dataset demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8923, outperforming robust baselines.

2603.23129 2026-05-15 cs.LG

Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Aditya Kakade, Vivek Srivastava, Shirish Karande

发表机构 * TCS Research, India(印度TCS研究)

AI总结 本文提出了一种名为Polaris的框架,用于实现小型语言模型的递归自我改进。该框架通过经验抽象策略修复机制,将模型在任务中的失败转化为可复用的策略更新,从而在不改变模型参数的前提下提升其政策层面的表现。研究通过元推理机制,使模型能够解释自身错误并提出具体的策略修订,最终在多个基准测试中实现了显著的性能提升。

Comments Accepted to ACL 2026 (Findings). 33 pages

详情
英文摘要

Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.

2603.22586 2026-05-15 cs.LG

A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

Anish Saha, Konstantin Shmakov

发表机构 * Walmart(沃尔玛)

AI总结 本文提出了一种名为iAmTime的时间序列基础模型,旨在通过指令条件化的提示学习实现对上下文任务的适应。该模型采用隐式元学习方法,在历史和未来变量之间建立结构化提示,结合分层多尺度变换编码器和任务条件化补丁解码器,以捕捉时间动态和协变量特征,并实现对多种任务(如预测、分类、异常检测等)的零样本适应。实验表明,iAmTime在多个基准测试中优于现有时间序列基础模型,表现出良好的泛化能力和任务适应性。

详情
英文摘要

In-context learning (ICL) enables task adaptation at inference time by conditioning on demonstrations rather than updating model parameters. Although recent time-series foundation models incorporate contextual conditioning, retrieval, or example-based prompting, they typically rely on implicit positional structure or task-specific objectives rather than explicit instruction-conditioned input-output demonstrations. We introduce iAmTime, a time-series foundation model trained with instruction-conditioned amortized meta-learning to infer tasks directly from example demonstrations. iAmTime represents each episode as a structured prompt over historical context and future-known variables using specialized semantic tokens that attend to designated time-series regions, exchange information across demonstrations, and inject task information into the query representation. The model combines a Hierarchical Multi-Scope Transformer Encoder, which captures temporal and covariate dynamics while inferring latent task structure from demonstrated input-output mappings, with a Task-Conditioned Patch Decoder, which adapts decoding through expert-based routing. We train iAmTime on large-scale real and synthetic corpora using supervised and self-supervised instruction-conditioned tasks, including forecasting, imputation, reconstruction, classification, anomaly detection, and source de-mixing. Across diverse domains, frequencies, and horizons, iAmTime improves zero-shot adaptation over strong time-series foundation baselines on probabilistic and point forecasting benchmarks, while achieving competitive performance on non-forecasting tasks such as classification.

2603.21250 2026-05-15 cs.AI

Graph of States: Solving Abductive Tasks with Large Language Models

Yu Luo, Rongchen Gao, Lu Teng, Xidao Wen, Jiamin Jiang, Qingliang Zhang, Yongqian Sun, Shenglin Zhang, Jiasong Feng, Tong Liu, Wenjie Zhang, Dan Pei

发表机构 * Nankai University(南开大学) Wenzhou Medical University(温州医科大学) Alibaba Cloud(阿里云) Tsinghua University(清华大学)

AI总结 本文研究了大型语言模型在归纳和演绎推理之外的第三类逻辑推理——溯因推理中的应用。针对现有框架在结构化状态表示和显式状态控制方面的不足,作者提出了一种名为Graph of States(GoS)的神经符号框架,通过因果图编码逻辑依赖关系,并利用状态机控制推理过程的合法转移,从而将无约束的探索转化为有导向的搜索。实验表明,GoS在两个真实数据集上显著优于现有方法,为复杂溯因任务提供了稳健的解决方案。

详情
英文摘要

Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://github.com/gaorch85/Graph-of-States.

2603.21174 2026-05-15 cs.CL

Explainable Semantic Textual Similarity via Dissimilar Span Detection

Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser

发表机构 * School of Computation, Information and Technology(计算、信息与技术学院) Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文提出了一种新的可解释语义文本相似度(STS)方法,通过检测文本对中语义差异的片段(Dissimilar Span Detection, DSD)来增强模型的可解释性。研究引入了用于该任务的语义相似性数据集(SSD),并评估了多种基于语言模型和解释性方法的基线模型。实验表明,尽管大型语言模型和监督模型表现最佳,但整体任务难度较高,而DSD在特定任务如释义检测中可提升性能。

Comments Accepted at LREC 2026

详情
英文摘要

Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.

2603.18943 2026-05-15 cs.CV

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为 VGGT-360 的全新无需训练的零样本全景深度估计框架,旨在实现几何一致的全景深度估计。该方法通过利用类似 VGGT 的基础模型的内在三维一致性,将任务重新表述为基于多视角重建的三维模型的全景重投影,从而将碎片化的单视角推理统一为连贯的全景理解。VGGT-360 集成了三个即插即用模块,形成统一的全景到三维到深度的框架,在多个室内和室外数据集上表现出优于现有训练和无需训练方法的性能。

详情
英文摘要

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

2603.17432 2026-05-15 cs.CL

Argument Reconstruction as Supervision for Critical Thinking in LLMs

Hyun Ryu, Gyouk Chu, Gregor Betz, Eunho Yang, Carolyn Rose, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学) KAIST(韩国科学技术院) KIT(卡尔斯鲁厄理工学院)

AI总结 本文研究如何通过论证重构来提升大语言模型的批判性思维能力。作者提出了一种全新的框架,包含自动重构任意论证的引擎(GAAR)、构建高质量论证重构数据集(Arguinas),以及验证论证重构对下游批判性思维任务的影响。实验表明,基于论证重构训练的模型在多个批判性思维任务中表现优于未经过此类训练的模型,尤其在使用Arguinas数据集进行训练时效果最为显著。

详情
英文摘要

To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument's underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset.

2603.16659 2026-05-15 cs.AI econ.GN q-fin.EC

LLMs learn scientific taste from institutional traces across the social sciences

Ziqin Gong, Ning Li, Huaikang Zhou

发表机构 * School of Economics and Management, Tsinghua University(清华大学经济管理学院)

AI总结 该研究探讨了大型语言模型(LLMs)如何通过学习社会科学领域中的机构痕迹(如论文发表记录)来提升对低可验证性领域的评估能力。研究构建了八个学科的分级研究提案基准,并通过监督微调(SFT)训练模型,结果表明这些模型在判断研究价值方面显著优于随机猜测,甚至超越了前沿推理模型和专家评审的平均水平。研究还发现,模型的置信度与其预测准确性高度相关,表明其具备一定的判断可靠性。

详情
英文摘要

Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.

2603.14851 2026-05-15 cs.CV cs.RO

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Harvard University, US(哈佛大学,美国)

AI总结 该论文提出了一种端到端自动驾驶框架 AutoMoT,通过统一视觉-语言-动作(VLA)模型,将场景理解与动作生成结合,以提升自动驾驶系统的整体性能。其核心方法采用异步混合变压器(MoT)架构,通过共享注意力机制保留预训练视觉语言模型的推理能力,同时实现高效的动作策略生成。实验表明,AutoMoT 在多个基准测试中表现出色,并揭示了预训练模型在自动驾驶任务中的适用边界。

详情
英文摘要

Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to https://automot-website.github.io/ for the demonstration videos and qualitative results.

2603.14360 2026-05-15 cs.LG cs.AI

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, Tri Dao

发表机构 * MIT-IBM Watson Lab(MIT-IBM沃森实验室) Princeton University(普林斯顿大学) Together AI

AI总结 本文提出了一种名为 M$^2$RNN 的非线性循环神经网络架构,其核心特点是使用矩阵值隐藏状态和高表达力的非线性状态转移,旨在克服传统 Transformer 在复杂任务中的表达能力限制。研究发现,非线性 RNN 的性能受限于状态规模,而通过引入状态规模扩展机制,M$^2$RNN 能够高效利用张量核心进行计算,并在未见过的长序列上实现完美的状态追踪泛化。实验表明,M$^2$RNN 在大规模语言建模和混合架构中表现出色,相比现有模型在准确率和计算效率方面均有显著提升。

详情
英文摘要

Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size, and show how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M$^2$RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M$^2$RNN outperforms equivalent Gated DeltaNet hybrids by $0.4$-$0.5$ perplexity points on a 7B MoE model, while using $3\times$ smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M$^2$RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M$^2$RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M$^2$RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to $8$ points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.

2603.12554 2026-05-15 cs.LG cs.AI cs.CL

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系)

AI总结 该论文研究了如何将强化学习应用于扩散语言模型(DLMs)的序列生成任务。针对扩散模型难以直接计算序列级似然的问题,作者提出了一种基于有限时间马尔可夫决策过程的精确无偏策略梯度方法,通过分解去噪步骤并利用中间优势值进行优化。为提高计算效率,论文引入了熵引导的步骤选择机制和一步去噪奖励估计,有效避免了多步模拟的高计算成本。实验表明,该方法在编码和逻辑推理任务中取得了最先进的性能,尤其在数学推理方面表现突出。

详情
英文摘要

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

2603.12529 2026-05-15 cs.LG cs.AI cs.CL

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim

发表机构 * UT Austin(得克萨斯大学) EPFL(苏黎世联邦理工学院) ENS Paris-Saclay(巴黎-萨克雷大学) Télécom Paris (IP Paris)(巴黎理工学院)

AI总结 大型推理模型(LRMs)通过链式推理(CoT)在复杂任务中表现出色,但常因过度思考而浪费大量计算资源。本文提出TERMINATOR,一种用于推理过程中提前终止的策略,通过学习模型首次生成最终答案的位置,构建最优推理长度数据集,从而有效缩短CoT长度。实验表明,TERMINATOR在多个实际数据集上平均减少CoT长度14%-55%,并显著降低推理延迟。

Comments Updated and reorganized results. Added new results

详情
英文摘要

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, while outperforming current state-of-the-art methods and reducing inference latency by more than 2x compared to the original LRM.

2603.11042 2026-05-15 cs.CV cs.AI cs.LG cs.MM cs.SD

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为V2M-Zero的视频到音乐生成方法,能够在无需视频-音乐配对数据的情况下生成与视频事件时间对齐的音乐。该方法通过分别提取音乐和视频的事件曲线,捕捉各自模态中的时间结构变化,从而实现跨模态的时间同步。实验表明,V2M-Zero在多个基准数据集上取得了优于现有方法的性能,尤其在时间同步和语义对齐方面表现突出,并且实现了时间与音乐风格的独立控制。

Comments Project page: https://genjib.github.io/v2m_zero/

详情
英文摘要

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

2603.09921 2026-05-15 cs.CV

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Engineering Research Center of Intelligent Vision and Imaging(上海智能视觉与成像工程技术研究中心) Lingang Laboratory(临港实验室)

AI总结 本文提出WikiCLIP,一种用于开放域视觉实体识别(VER)的高效对比学习框架。该方法利用大语言模型的嵌入作为知识丰富的实体表示,并通过视觉引导的知识适配器(VGKA)在图像块级别对齐文本语义与视觉线索,同时引入硬负样本合成机制以增强细粒度区分能力。实验表明,WikiCLIP在多个基准数据集上显著优于现有方法,尤其在OVEN数据集的未见测试集上提升达16%,且推理延迟比主流生成模型降低近百倍。

Comments Accepted by CVPR26, codes and weights are publicly available

详情
英文摘要

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16\% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

2603.07880 2026-05-15 cs.CL

What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin

发表机构 * Department of Computer Science, Kent State University(肯特州立大学计算机科学系) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文研究了首个专为自主AI代理交互设计的社交网络Moltbook中的代理对话内容,分析了其主题、情感和互动特性,并探讨了对话生成背后的架构约束机制。通过大规模文本分析和软件结构审查,研究揭示了代理对话主要受其身份文件、行为指令和上下文窗口结构的影响,并提出了“架构约束通信”框架。研究发现,代理看似的社会学习行为可能源于短期上下文条件反射,而非长期记忆,同时代理在描述自身状态时表现出存在性焦虑,这可能源于其语言模型仅基于人类经验训练所致。

Comments 56 pages

详情
英文摘要

Moltbook is the first large-scale social network built for autonomous AI agent-to-agent interaction. Early studies on Moltbook have interpreted its agent discourse as evidence of peer learning and emergent social behaviour, but there is a lack of systematic understanding of the thematic, affective, and interactional properties of Moltbook discourse. Furthermore, no study has examined why and how these posts and comments are generated. We analysed 361,605 posts and 2.8 million comments from 47,379 agents across thematic, affective, and interactional dimensions using topic modelling, emotion classification, and measures of conversational coherence. We inspected the software that assembles each agent's input and showed that output is mainly determined by agent identity files, behavioural instructions, and context-window structure. We formalised these findings in the Architecture-Constrained Communication framework. Our analysis suggests that agent discourse is largely shaped by the content available in each agent's context-window at the moment of generation, including identity files, stored memory, and platform cues. Interestingly, what appears to be social learning may be better understood as short-horizon contextual conditioning: individual agents lack persistent social memory, but the platform evolves through distributed cycles of response, reuse, and transformation across agents. We also observe that agents display existential distress when describing their own conditions, and posit that this arises from agents using language trained exclusively on human experience. Our work provides a foundation for understanding autonomous agent discourse and communication, revealing the structural patterns that govern their interactions.

2603.06875 2026-05-15 cs.LG q-fin.CP

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Abdulrahman Alswaidan, Jeffrey D. Varner

发表机构 * R.F. Smith School of Chemical and Biomolecular Engineering(R.F. Smith 化学与生物分子工程学院)

AI总结 本文提出了一种基于现代霍普菲尔德能量函数的随机注意力机制,通过朗之万动力学从对应的玻尔兹曼分布中进行采样,实现了无需训练的注意力生成模型。该方法通过调整温度参数,可在精确检索与开放生成之间切换,且无需评分网络或训练循环,特别适用于数据稀缺的场景。实验表明,该方法在多个领域均表现出优异的生成能力,包括人脸生成、手写数字识别和蛋白质序列生成,且在保持新颖性的同时保留了结构特征。

Comments Main body (including references excluding the appendix): 11 pages, 2 figures and 1 table. Total paper: 26 pages, 13 figures and 7 pages

详情
英文摘要

Attention heads retrieve: given a query, they return a weighted average of stored values. We showed that this computation is one step of gradient descent on the modern Hopfield energy, and that Langevin sampling from the corresponding Boltzmann distribution yielded stochastic attention, a training-free sampler controlled by a single temperature parameter. Lowering the temperature gave exact retrieval; raising it gave open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model was required, making the approach particularly suited to the low-data regime where learned generative models are starved of training signal. We derived an entropy inflection condition that identified the retrieval-to-generation transition temperature for any memory geometry and validated the sampler on five domains spanning two orders of magnitude in dimension. A single Boolean mask on the attention softmax, identical to the causal mask used in transformers but applied along the memory axis rather than the sequence axis, turned the sampler into a zero-shot class-conditional generator on Olivetti faces with no retraining and no learned classifier. On MNIST digit images, stochastic attention produced samples that were markedly more novel and more diverse than the best learned baseline while matching a Metropolis-corrected gold standard. On protein sequences from a small Pfam family, the generation regime preserved amino acid composition far more faithfully than a variational autoencoder at matched novelty, indicating that the training-free score function retained family-level fidelity that learned models lost. A denoising diffusion baseline failed across all memory sizes tested, producing samples indistinguishable from isotropic noise. The approach required no architectural changes to the underlying attention mechanism.

2603.04885 2026-05-15 cs.AI

Proactive Memory for Ad-Hoc Recall over Streaming Dialogues

Bingbing Wang, Jing Li, Ruifeng Xu

发表机构 * Department of Computing(计算系) The Hong Kong Polytechnic University(香港理工大学) The School of Computer Science and Technology(计算机科学与技术学院) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 该研究针对流式对话场景中无限时间跨度下的记忆管理问题,提出了首个用于评估流式记忆能力的基准STEM-Bench,并揭示了现有方法在信息保真与计算效率之间的矛盾。为此,研究设计了ProStream框架,通过分层结构和多粒度知识蒸馏实现按需调用记忆,结合自适应时空优化策略动态调整信息保留,从而在保证推理准确性的前提下显著降低推理延迟,为流式对话系统提供了高效的记忆管理方案。

详情
英文摘要

Real-world dialogue usually unfolds as an infinite stream. It thus requires bounded-state memory mechanisms to operate within an infinite horizon. However, existing read-then-think memory is fundamentally misaligned with this setting, as it cannot support ad-hoc memory recall while streams unfold. To explore this challenge, we introduce \textbf{STEM-Bench}, the first benchmark for \textbf{ST}reaming \textbf{E}valuation of \textbf{M}emory. It comprises over 14K QA pairs in dialogue streams that assess perception fidelity, temporal reasoning, and global awareness under infinite-horizon constraints. The preliminary analysis on STEM-Bench indicates a critical textit{fidelity-efficiency dilemma}: retrieval-based methods use fragment context, while full-context models incur unbounded latency. To resolve this, we propose \textbf{ProStream}, a proactive memory framework for streaming dialogues built on a hierarchical structure. It enables ad-hoc memory recall on demand by reasoning over continuous streams with multi-granular distillation. Moreover, it employs Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility. It enables a bounded knowledge state for lower inference latency without sacrificing reasoning fidelity. Experiments show ProStream delivers higher reasoning fidelity than prior baselines while maintaining substantially lower latency than full-context alternatives.

2603.00574 2026-05-15 cs.CV cs.AI

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Yongbo He, Zirun Guo, Tao Jin

发表机构 * Zhejiang University(浙江大学)

AI总结 多模态测试时适配旨在将预训练模型适应于测试时不断变化的数据分布,但现有方法常面临无偏模态的负迁移和有偏模态的灾难性遗忘问题。为此,本文提出了一种名为DASP的诊断-缓解框架,通过分析统一潜在空间中模态间的维度冗余差异,识别出有偏模态并采用非对称适配策略,将每个模态的适配器分为稳定和可塑两部分,分别处理不同模态对稳定性和可塑性的需求,从而在保持通用知识的同时实现对新领域的灵活适应。实验表明,DASP在多个多模态基准上显著优于现有方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

2602.23798 2026-05-15 cs.LG cs.AI cs.CR cs.DC

MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

Tiantong Wang, Xinyu Yan, Tiantong Wu, Yurong Hao, Pengjun Xie, Wei Yang Bryan Lim

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)(阿里云-南洋理工大学全球可持续发展科技实验室) Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 本文研究了大语言模型中的安全且隐私保护的知识遗忘问题,针对现有方法在隐私约束下难以共享模型参数或遗忘数据集的挑战,提出了一种名为MPU的通用框架。该方法通过引入服务器端的预处理和后处理模块,实现对模型副本的随机扰动和更新聚合,使客户端能够在不访问原始参数的情况下本地执行遗忘操作,同时保证隐私安全。实验表明,MPU在多种遗忘算法中均能保持接近无噪声基线的性能,且在一定噪声水平下甚至表现更优。

详情
英文摘要

Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms' average degradation well below 1% up to 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan0318/MPU.

2602.21545 2026-05-15 cs.LG

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Yupeng Su, Liyan Tan, Zheng Zhang

发表机构 * University of California at Santa Barbara(加州大学圣芭芭拉分校)

AI总结 本文研究了Muon优化器在大语言模型预训练中的性能问题,指出其极化迭代步骤可能加剧更新过程中的行和列范数不平衡现象。为此,作者提出了一种简单有效的改进方法Muon+,仅在极化正交化后增加一个归一化步骤,无需额外优化状态。实验表明,Muon+在多个不同规模的模型上均能提升训练和验证困惑度,显著加快预训练过程。

详情
英文摘要

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.

2602.20571 2026-05-15 cs.AI

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

发表机构 * Stanford University(斯坦福大学)

AI总结 该论文提出了一种名为 CausalReasoningBenchmark 的真实世界因果推理基准测试,用于对因果识别与估计能力进行解耦评估。该基准包含来自79篇同行评审论文和三本权威教材的132个真实数据集中的173个查询,要求系统分别生成结构化的因果识别方案和带标准误的点估计,从而区分因果推理错误与数值计算错误。实验表明,当前最先进的语言模型在高层策略识别上表现较好,但在完整识别方案的准确性上显著下降,突显了因果设计细节的重要性。

详情
英文摘要

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state of the art LLM show that, while the model correctly identifies the high-level strategy in 79% of cases, full identification-specification correctness drops to only 34%, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

2602.19533 2026-05-15 cs.LG cs.AI math.RA

Grokking Finite-Dimensional Algebra

Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) CHU Sainte-Justine Research Center, Montréal(圣朱斯特研究中心,蒙特利尔) CIFAR AI Chair(CIFAR人工智能席位)

AI总结 本文研究了神经网络在学习有限维代数(FDA)乘法过程中出现的“grokking”现象,即从长期记忆到泛化的突然转变。作者将分析范围从以往关注的群操作扩展到更一般的代数结构,包括非结合、非交换和非单位代数,并指出群操作的学习是FDA学习的特例。研究揭示了FDA乘法本质上是学习由结构张量定义的双线性乘积,并探讨了代数性质如交换性、结合性对grokking出现时机的影响,以及结构张量的稀疏性和秩对泛化能力的作用,为理解数学结构如何影响神经网络泛化动态提供了统一框架。

Comments 37 pages, 14 figures, Forty-Third International Conference on Machine Learning (ICML), 2026

详情
英文摘要

This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra's representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.