arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.10119 2026-05-13 cs.LG

Refresh-Scaling the Memory of Balanced Adam

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí

AI总结 本文研究了平衡Adam优化器中单一剩余超参数β的设置问题,提出应将其视为统计记忆范围的控制变量,而非固定常数。通过引入有效训练周期和刷新次数的概念,作者发现将β调整使得刷新次数约为1000时,能在不同规模的视觉和语言任务中提升模型的鲁棒性。实验表明,该方法相比固定β值的最优基线,在最坏情况下的验证损失差距减少了33.4%,并显著提升了模型性能的一致性。

详情
英文摘要

Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, how this parameter should be set remains poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different $β$ values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.944$, the refresh rule improves worst-case robustness, reducing the maximum relative gap in validation loss by 33.4\%, while bringing all 11 runs within 1\% of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is more naturally viewed as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.

2605.10082 2026-05-13 cs.CL cs.LG

FERA: Uncertainty-Aware Federated Reasoning for Large Language Models

Ruhan Wang, Chengkai Huang, Zhiyong Wang, Junda Wu, Rui Wang, Tong Yu, Julian McAuley, Lina Yao, Dongruo Zhou

AI总结 在联邦推理任务中,如何在不集中训练和不共享原始数据的前提下,利用多方持有的私有高质量示例提升大语言模型的多步推理能力是一个关键挑战。本文提出了一种无需训练的不确定性感知联邦推理框架FERA,通过迭代的服务器-客户端协同优化,使客户端生成带有轻量不确定性估计的推理轨迹,服务器则基于这些轨迹合成更优的推理结果并反馈给客户端,从而逐步提升整体推理性能。该方法引入了不确定性感知的自批判聚合机制,有效解决异构客户端之间的冲突,并通过理论分析证明了其收敛性与效率优势,实验表明FERA在多个推理基准上均优于现有方法。

Comments 44 pages, 8 figures

详情
英文摘要

Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.

2605.10032 2026-05-13 cs.CL

PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Sajib Acharjee Dip, Song Li, Liqing Zhang

AI总结 PlantMarkerBench 是一个面向多物种植物的基准数据集,用于评估基于文献的植物标记基因推理能力。该基准通过模块化流程构建,涵盖拟南芥、玉米、水稻和番茄四种植物,包含5,550个标注了标记证据有效性、类型和支持强度的句子级实例。研究定义了两个任务:判断句子是否为基因-细胞类型对的有效标记证据,并对证据类型进行分类,旨在推动基于文献的生物证据解析和可信科学信息提取的研究。

详情
英文摘要

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

2605.09964 2026-05-13 cs.AI q-bio.QM

Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach

Ziqi Gao, Chenyi Zi, Zijing Liu, Ziqiao Meng, Yu Li, Jia Li

AI总结 蛋白质-蛋白质相互作用(PPIs)在细胞功能和疾病机制中起着关键作用。当前基于学习的PPI预测方法主要关注学习蛋白质的表示,却忽略了设计专门的分类头,通常依赖于缺乏生物学依据的通用聚合方法。本文提出了一种基于生物“L3规则”的模型无关PPI分类器L3-PPI,通过引入L3路径正则化的图提示学习方法,将蛋白质嵌入对的分类任务转化为图级别的分类任务,有效提升了预测性能。

Comments Accepted at ICML 2026

详情
英文摘要

Protein-protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over advanced competitors.

2605.09904 2026-05-13 cs.CV

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

Junzhe Chen, Siyuan Meng, Yuxi Chen, Man Zhao, Wenyao Gui, Xiaojie Guo

AI总结 TOC-Bench 是一个用于评估视频大语言模型(Video-LLMs)时间对象一致性能力的诊断基准。该基准通过对象轨迹和结构化时间事件时间线进行构建,强调模型在遮挡、消失、重现、状态变化和跨对象交互等场景下保持同一对象身份、状态和连续性的能力。研究发现,尽管现有模型在一般视频理解任务上表现良好,但在事件计数、事件排序、身份敏感推理和幻觉检测等方面仍存在显著不足,表明时间对象一致性是当前视频大语言模型的一个关键瓶颈。

详情
英文摘要

Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer temporal-necessity filtering protocol, which removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items across 10 diagnostic dimensions. From this pool, we construct a human-verified benchmark with 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge, with notable weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding benchmarks. These results suggest that object-centric temporal coherence is a key bottleneck for current Video-LLMs, and that TOC-Bench provides a focused platform for diagnosing and improving object-aware temporal reasoning. The resource is available at https://github.com/cjzcjz666/toc_bench.git.

2605.09598 2026-05-13 cs.CV

SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

Ismael Elsharkawi, Ahmed Sait, Silvio Giancola, Bernard Ghanem, Hossam Sharara, Abdelrahman Eldesokey

AI总结 本文提出 SoccerLens,一个用于评估足球视频理解中视觉 grounding 能力的新基准,旨在解决现有模型可能依赖虚假关联而非真实视觉证据的问题。该基准包含标注的13类常见足球事件视频片段,并通过三级语义相关性结构组织视觉线索。研究进一步扩展了注意力归因方法,引入了衡量模型注意力是否与标注线索对齐的评估指标,结果表明当前最先进的足球视觉语言模型在 grounding 性能上表现有限,揭示了预测准确率与真实视觉理解之间的显著差距。

Comments Preprint

详情
英文摘要

Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.

2605.09523 2026-05-13 cs.LG cs.CE cs.NA math.NA physics.comp-ph stat.ML

HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations

Lennon J. Shikhman

AI总结 本文提出了一种名为HS-FNO的历史空间傅里叶神经算子,用于求解非马尔可夫型偏微分方程。该方法通过引入扩展状态$u_t(θ,x)$,将历史信息纳入模型,从而更准确地捕捉系统动态。HS-FNO通过将历史状态更新分解为预测新时间片和精确移动已知部分,减少了学习维度并提升了预测精度。实验表明,HS-FNO在多个基准问题上优于现有方法,尤其在自回归预测中表现出显著的误差降低。

Comments 15 pages, 4 figures, 1 table. Code at https://github.com/lennonshikhman/hs-fno/

详情
英文摘要

Neural operators provide fast surrogate models for time-dependent partial differential equations, but their standard autoregressive use usually assumes that the instantaneous field $u(t,\cdot)$ is a complete state. This assumption fails for delay equations, distributed-memory systems, and other non-Markovian dynamics: two trajectories may agree at time $t$ and nevertheless have different futures because their histories differ. We introduce the History-Space Fourier Neural Operator (HS-FNO), a neural operator for delay and memory-driven PDEs formulated on the lifted state $u_t(θ,x)=u(t+θ,x)$, $θ\in[-τ,0]$. The key computational step is to decompose one history-state update into a learned predictor for the newly exposed future slice and an exact shift-append transport for the portion of the history window already known from the previous state. This avoids learning deterministic history coordinates, reduces the learned output dimension, and enforces the natural discrete history update. We test HS-FNO on five benchmark families covering delayed reaction--diffusion, spatial epidemiology, nonlocal neural-field dynamics, delayed waves, and distributed-memory closures. Across ten random seeds, HS-FNO attains the lowest aggregate one-step, history-space, and rollout errors among the principal baselines. The largest gain occurs in autoregressive prediction, where aggregate rollout error decreases from $0.241$, $0.188$, and $0.185$ for current-state, lag-stack, and unconstrained history-to-history operators, respectively, to $0.094$. The same model uses fewer parameters than unconstrained history prediction. These results indicate that enforcing the discrete shift structure of history-state evolution is an effective inductive bias for non-Markovian PDE surrogate modeling.

2605.09461 2026-05-13 cs.AI

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection

Wenxin Tang, Xiang Zhang, Junliang Liu, Jingyu Xiao, Xi Xiao, Jinlong Yang, Yuehe Ma, Zhenyu Liu, Zhengheng Li, Zicheng Wang, Wang Luo, Qing Li, Lei Wang, Peng Xiangli

AI总结 本文提出了一种名为VulTriage的三路径上下文增强框架,用于基于大语言模型(LLM)的漏洞检测。该方法通过控制路径提取并描述程序结构信息,知识路径检索相关的漏洞模式与示例,语义路径总结代码功能行为,从而增强LLM的输入上下文,提升其对细微语义差异导致的漏洞的检测能力。实验表明,VulTriage在多个基准数据集上取得了优于现有深度学习和LLM基线方法的性能,尤其在资源有限和类别不平衡场景下表现出良好的泛化能力。

详情
英文摘要

Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the structural dependencies, domain-specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE-derived vulnerability patterns and examples through hybrid dense--sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state-of-the-art performance, outperforming existing deep learning and LLM-based baselines on key pair-wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low-resource and class-imbalanced settings. Our code is available at https://github.com/vinsontang1/VulTriage

2605.09430 2026-05-13 cs.CV

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

Junkang Zhou, Yefei He, Feng Chen, Weijie Wang, Bohan Zhuang

AI总结 本文提出了一种名为FlashAR的轻量级后训练加速框架,旨在高效提升自回归图像生成模型的推理速度。该方法通过引入一个垂直预测头与原有的水平预测头协同工作,基于双向下一个token预测实现高度并行的生成过程,同时尽量保持原模型的训练目标不变。实验表明,FlashAR仅需少量训练数据即可实现对预训练模型的高效适配,在512x512图像生成任务中达到最高22.9倍的加速效果。

Comments Post-training acceleration for autoregressive image generation, code is available at https://lxazjk.github.io/FlashAR/

详情
英文摘要

Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.

2605.09287 2026-05-13 cs.AI

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

Dongyi Liu, Yifan Niu, Qinwen Wang, Han Xiao, Jia Li

AI总结 本文提出了一种基于关键步骤的信用分配方法PiCA,用于改进基于大语言模型的搜索智能体在强化学习中的训练效果。针对长期任务中奖励稀疏、信用孤立和分布偏移等关键问题,PiCA通过引入潜在基于奖励塑形机制,将搜索过程重构为累积进展的序列,并利用历史轨迹中的关键步骤作为信息峰值,为每一步提供与最终目标紧密关联的密集奖励。实验表明,PiCA在多个知识密集型问答任务中显著提升了模型性能,显示出其良好的通用性和有效性。

Comments 21 pages, 7 figures

详情
英文摘要

Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential-Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub-queries and sub-answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot-aware and trajectory-dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge-intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA's robust generalization. The code is available at https://github.com/novdream/PiCA.

2605.09277 2026-05-13 cs.LG

Worst-Case Regret Bounds for Combinatorial Thompson Sampling in Sleeping Semi-Bandits

Zhiming Huang, Bingshan Hu, Jianping Pan

AI总结 本文研究了具有睡眠臂的半带问题中的组合汤普森采样(CTS)算法,针对其长期存在的理论保障不足和实际性能不佳的问题,提出了首个最坏情况下的遗憾界分析,并设计了一种改进算法CL-SG。该算法通过共享高斯种子协调探索,显著提升了理论性能,实验表明其在真实数据集上优于现有方法。

Comments Accepted by INFOCOM 26 on Dec 2025

详情
英文摘要

We revisit combinatorial Thompson sampling (CTS) for semi-bandits with sleeping arms, where arm availability varies over time and actions must satisfy combinatorial constraints, as in wireless mesh routing with fluctuating link availability. Despite its practical relevance, CTS has been hindered by several long-standing problems: (i) the absence of worst-case regret guarantees in the semi-bandit setting even without sleeping arms, (ii) the lack of theory under adversarially varying availability, and (iii) the consistently weak empirical performance of CTS with Gaussian priors (CTS-G). This paper resolves these long-standing issues by providing the first worst-case regret analysis of CTS-G, proving an upper bound of $\tilde{O}(m\sqrt{NT})$ and a matching lower bound of $\tildeΩ(m\sqrt{NT})$. To bridge the gap between theory and practice, we further propose CL-SG, a simple CTS-G variant that samples a single shared Gaussian seed each round to coordinate exploration across arms. We show that CL-SG achieves an improved regret bound of $\tilde{O}(\sqrt{mNT})$, together with a matching lower bound $Ω(\sqrt{mNT})$. Experiments on real-world datasets demonstrate that CL-SG consistently outperforms strong baselines including CTS-G and CTS-B, and we open-source our implementation for reproducibility.

2605.09171 2026-05-13 cs.RO

SHIELD: Scalable Optimal Control with Certification using Duality and Convexity

Hansung Kim, Siddharth H. Nair, Francesco Borrelli

AI总结 本文提出了一种名为SHIELD的分层算法,通过利用强凸性和拉格朗日对偶性,有效降低$\ell_1$-正则化凸规划中的决策变量维度和约束集规模,同时保证被移除的约束和变量仍满足安全要求。为加速算法运行,作者引入了基于Transformer的深度神经网络辅助对偶证书的推导,并在复杂多模态交通场景中的随机模型预测控制(SMPC)中验证了该方法,实验表明其在保持可行性与闭环安全性的前提下,计算效率提升了数量级。

详情
英文摘要

We present SHIELD, a hierarchical algorithm that reduces both the decision-variable dimension and the constraint set in $\ell_1$-regularized convex programs. From strong convexity and Lagrangian duality, we derive certificates that \emph{safely} discard constraints and decision variables while guaranteeing that all removed constraints remain satisfied and all removed variables are null. To further accelerate the proposed algorithm, we propose a transformer-based deep neural network to guide the dual certificate inference. We validate SHIELD on stochastic model predictive control (SMPC) in complex, multi-modal traffic scenarios, comparing against a full-dimensional SMPC policy. Numerical simulations demonstrate order-of-magnitude computational speedups while preserving feasibility and closed-loop safety, highlighting the practicality of certifiably safe, lightweight MPC in complex driving scenes.

2605.09043 2026-05-13 cs.CL cs.AI

Phase Transitions in Affective Meaning Divergence: The Hidden Drift Before the Break

Napassorn Litchiowong

AI总结 本文研究了对话中情感意义分歧(AMD)的相变现象,即对话双方对同一词语的情感理解逐渐偏离,最终导致沟通失效。作者基于言语行为理论和熵正则化博弈论,构建了AMD的数学模型,并发现当参数 $βα> 4$ 时,AMD的增加会导致协调修复能力的突变式崩溃。在多个数据集上的实验证明,AMD在对话失控前表现出显著的临界减慢特征,且其时间动态模式优于传统毒性或情感指标,为理解对话破裂提供了新的理论依据。

Comments Accepted to the ACL 2026 Student Research Workshop

详情
英文摘要

One partner says "Fine" meaning "resolution"; the other hears "surrender." The word is shared; the affective uptake is not. We formalize this as affective meaning divergence (AMD), the total-variation distance between interlocutors' anchor-conditioned affect distributions. Building on speech-act theory, common-ground accumulation, and entropy-regularized game theory, we derive a logit best-response map whose dynamics undergo a saddle-node bifurcation: when $βα> 4$, a monotone increase in AMD-driven load produces an abrupt, hysteretic collapse of repair coordination. On Conversations Gone Awry (CGA-Wiki; $N = 652$), derailing conversations exhibit critical-slowing-down (CSD) signatures across multiple levels: lexical divergence variance ($p < 0.001$, $d = 0.36$), AMD variance ($p = 0.001$, $d = 0.26$), and dialog-act repair variance ($p = 0.016$, $d = 0.20$), all significant after correction and stronger than toxicity and sentiment baselines. AMD provides a distinct temporal signature, with retrospectively measured variance peaking at the bifurcation point while toxicity variance peaks earlier, and is the only indicator grounded in the theoretical framework. Boundary-condition analysis on CGA-CMV ($N = 1,169$) yields mixed but directionally consistent evidence.

2605.09003 2026-05-13 cs.CV

FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

Yixin Tang, Jiawei Guo, Junxian Li, Zhiteng Li, Jixin Zhao, Bingya Zhang, Chenbo Wang, Yulun Zhang, Shangchen Zhou

AI总结 本文提出了一种名为FlashClear的高效图像内容移除方法,旨在解决基于扩散模型的对象移除方法在计算效率上的不足。该方法通过引入区域感知的对抗蒸馏(RAD)和前景优先的非对称注意力与缓存(FPAC)策略,实现了仅需少数步骤即可完成高质量内容移除的模型,显著提升了推理速度。实验表明,FlashClear在保持视觉质量的同时,相比现有方法在速度上分别提升了8.26倍和122倍。

Comments Code: https://github.com/GuoCalix/FlashClear

详情
英文摘要

Recently, diffusion-based object removal models have achieved impressive results in eliminating objects and their associated visual effects. However, they indiscriminately denoise all tokens across all timesteps, ignoring that removal usually involves small foreground regions. This strategy introduces substantial computational overhead and prolonged inference times. To overcome this computational burden, we propose a latent discriminator to implement Region-aware Adversarial Distillation (RAD), yielding a highly efficient few-step model named FlashClear. Furthermore, tailored to few-step diffusion models, we propose FPAC (Foreground-Prioritized Asymmetric Attention and Caching), a training-free acceleration strategy. Extensive experiments demonstrate that our framework provides massive acceleration while maintaining or exceeding the performance of our base model, ObjectClear. Notably, on the OBER benchmark, our FlashClear achieves up to 8.26$\times$ and 122$\times$ speedup over ObjectClear and OmniPaint, respectively, while maintaining high visual quality and fidelity.

2605.08978 2026-05-13 cs.AI

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Xingyuan Hua, Sheng Yue, Ju Ren

AI总结 本文提出了一种基于探索感知的强化学习框架,旨在解决智能体在执行任务时探索策略不加区分的问题。该方法通过变分推断引入细粒度奖励函数,能够评估探索行为对未来决策的潜在提升,并结合探索感知的分组机制,在优化过程中区分探索动作与任务完成动作。实验表明,该方法在多种文本和图形界面基准任务中均取得了显著提升。

详情
英文摘要

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.

2605.08828 2026-05-13 cs.AI

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

Strick Sheng, Ziyue Wang, Liyi Zhou

AI总结 该研究提出了一种名为EnvTrustBench的可扩展智能体框架,用于评估大型语言模型代理在面对过时、错误或恶意环境信息时的可靠性问题。研究定义了“证据锚定缺陷”(EGD),即代理在未核实当前证据的情况下,仅凭环境提供的信息做出决策,从而导致任务错误。通过构建任务场景、生成工作空间与验证机制,EnvTrustBench系统评估了多种代理在不同情境下的表现,揭示了环境信息可靠性对代理行为的广泛影响,突显了环境锚定在智能体系统中的核心地位。

详情
英文摘要

Large language model agents increasingly operate through environment-facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems-level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under-specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence-grounding defect (EGD) as a behavioral failure in which an agent treats an environment-facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task-incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent-facing objective, and validation oracle, executes the evaluated agent, records its action-observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback-guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications.

2605.08806 2026-05-13 cs.CV

L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation

Zehua Wang, Changwang Mei, Huaijiang Sun, Pengqi Hu, Zhaoyang Yin

AI总结 该论文提出了一种名为L2A的框架,旨在通过有效利用历史姿态信息来提升三维人体姿态估计的准确性。研究发现,现有方法在跨层特征复用方面存在不足,为此,作者设计了空间-时间并行的Transformer骨干网络以保持一致的表示空间,并引入了历史姿态积累(HPA)机制和层姿态历史聚合(LPA)模块,以自适应地整合多层特征,减少冗余并提升稳定性。实验表明,该方法在多个基准数据集上取得了最先进的性能。

Comments 15page

详情
英文摘要

Existing 2D-3D lifting human pose estimation methods have achieved strong performance. But the utilization of historical pose representations across network depth was overlooked. In current pipelines, information is propagated through fixed residual connections, which restricts effective reuse of early-layer features such as fine-grained spatial structures and short-term motion cues. However, naively incorporating historical features across layers is non-trivial. We further identify that maintaining a consistent representation space across layers is a prerequisite for effective cross-layer feature aggregation. To address this issue, we propose a history-aware framework that enables effective network cross-layer history feature utilization. Specifically, we adopt a spatial-temporal parallel Transformer backbone to prevent alternating spatial-temporal transformations during sequential processing, thereby maintaining a consistent representation space. Building upon this, we introduce a History Pose Accumulation (HPA) mechanism that adaptively aggregates features from all preceding layers to enhance current representations. Furthermore, we propose a Layer Pose History Aggregation (LPA) module that transforms layer pose features into a compact and structured form, reducing redundancy and enabling more stable aggregation. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on benchmarks.

2605.08802 2026-05-13 cs.CV

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao

AI总结 CoLVR 是一种通过对比优化增强潜空间视觉推理探索能力的方法,旨在解决现有模型因依赖硬对齐目标而限制潜空间推理灵活性的问题。该方法引入了基于角度扰动的潜空间对比训练框架,以学习更加多样化和探索性强的表示,并结合强化学习的潜轨迹对比奖励进行后训练,进一步优化潜空间推理过程。实验表明,CoLVR 在多个基准测试中显著提升了潜空间表示的探索能力,并在跨域任务中表现出色。

详情
英文摘要

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

2605.08754 2026-05-13 cs.AI

Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

Shizhong Zhou, Haifeng Liu, Zheng Zhang, Shiyu Zhang, Bo Yang, Yi Lin

AI总结 本文提出了一种名为CaTR的强化学习框架,用于解决机场地面上的实时多架飞机滑行路径规划问题。该框架通过分层的冲突感知观测机制,结合基于网格的环境建模和动作掩码技术,能够有效捕捉当前及下游的交通冲突信息,并采用价值分解策略以平衡安全与效率的多目标优化。实验表明,CaTR在多种交通密度下均能实现优于传统规划和强化学习方法的安全与效率综合性能。

详情
英文摘要

Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety--efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.

2605.08693 2026-05-13 cs.AI

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, Yong Li

AI总结 SkillMaster 是一种旨在使大语言模型代理实现自主技能掌握的训练框架。该方法通过轨迹引导的技能复盘、反事实效用评估和双优势估计机制,使代理能够在任务解决过程中自主创建、优化和选择技能,从而提升其应对复杂任务的能力。实验表明,SkillMaster 在多个基准任务中显著优于现有方法,展示了代理从被动使用技能向主动学习和改进技能的能力转变。

详情
英文摘要

Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.

2605.08600 2026-05-13 cs.CL

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov

AI总结 本文介绍了一个包含100,502条哈萨克斯坦电影评论的多语言语料库,涵盖俄语、哈萨克语及代码混合文本,时间跨度从2001年至2025年。评论经过人工标注语言和情感极性,并附有部分用户评分。研究通过对比传统文本特征方法与多语言Transformer模型在情感分类任务中的表现,发现后者在极性分类任务中具有明显优势,但在评分分类任务中仍面临类别不平衡和评分细微差异带来的挑战。

Comments 10 pages, 1 figure, 8 tables, to appear in Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)

详情
英文摘要

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.

2605.08571 2026-05-13 cs.RO

BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

Antong Zhang, Han Qi, Heng Yang

AI总结 本文提出BEACON框架,旨在通过最佳努力适应实现跨领域协同训练,用于在源域有大量示范而目标域示范有限的情况下训练生成式机器人策略。该方法将跨域协同训练建模为差异感知的重要性重加权问题,同时学习基于扩散模型的视觉运动策略和样本级源权重,以最小化目标域泛化保证下的目标函数。通过可扩展的实例级差异估计器、策略与权重的随机交替更新以及多源扩展,BEACON在多种跨域场景中提升了策略的鲁棒性和数据效率,并隐式实现了特征对齐。

详情
英文摘要

We introduce BEACON--Best-Effort Adaptation for Cross-Domain Co-Training--a theory-driven framework for training generative robot policies with abundant source demonstrations and limited target demonstrations. BEACON casts cross-domain co-training as a discrepancy-aware importance-reweighting problem, jointly learning a diffusion-based visuomotor policy and per-sample source weights that minimize an objective informed by target-domain generalization guarantees. To make best-effort adaptation practical for high-dimensional sequence policies, we develop scalable instance-level discrepancy estimators, stochastic alternating updates for policy and weights, and a multi-source extension that balances heterogeneous source domains. Across sim-to-sim, sim-to-real, and multi-source manipulation settings, BEACON improves robustness and data efficiency over target-only, fixed-ratio co-training, and feature-alignment baselines. Importantly, even without an explicit alignment objective, BEACON achieves feature alignment as an implicit result of discrepancy-aware cross-domain co-training.

2605.08446 2026-05-13 cs.LG

Direct Bethe Free Energy Minimization for Bayesian Neural Network

Pavel Prochazka

AI总结 本文提出了一种通过直接最小化Bethe自由能来训练贝叶斯神经网络的方法,替代传统的变分下界最大化策略。该方法在树状因子图上能够精确计算Bethe自由能,支持概率层和确定层的混合结构,并且在权重后验限制为最后一层高斯分布时,能够得到解析可计算的损失函数。实验表明,该方法在预测性能上与标准方法相当,同时避免了变分族选择带来的Jensen间隙,并实现了单次梯度传递下的超参数优化。

Comments Submited to conference - fix typo in title + name

详情
英文摘要

We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ($L_\text{MAP} \leq L_\text{Bethe} \leq L_\text{ELBO}$), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.

2605.08328 2026-05-13 cs.LG cs.CV

P-Flow: Proxy-gradient Flows for Linear Inverse Problems

Zehua Jiang, Fenghao Zhu, Xinquan Wang, Chongwen Huang, Zhaoyang Zhang

AI总结 本文提出了一种名为 P-Flow 的新框架,用于解决线性逆问题,通过引入代理梯度来更新源点,有效避免了传统方法中因长链求导导致的数值不稳定和计算开销。该方法结合高维空间中的测度集中现象,采用高斯球面投影以确保先验分布的一致性,并基于贝叶斯理论和 Lipschitz 连续性进行了理论分析。实验表明,P-Flow 在多种图像修复任务中表现优异,尤其在极端退化条件下具有明显优势。

详情
英文摘要

Generative models based on flow matching have emerged as a powerful paradigm for inverse problems, offering straighter trajectories and faster sampling compared to diffusion models. However, existing approaches often necessitate differentiating through unrolled paths, leading to numerical instability and prohibitive computational overhead. To address this, we propose P-Flow, a framework that stabilizes the reconstruction process by leveraging a proxy gradient to update the source point. This approach effectively circumvents the numerical instability and memory overhead of long-chain differentiation. To ensure consistency with the prior distribution, we employ a Gaussian spherical projection motivated by the concentration of measure phenomenon in high-dimensional spaces. We further provide a theoretical analysis for P-Flow based on Bayesian theory and Lipschitz continuity. Experiments across diverse restoration tasks demonstrate that P-Flow delivers competitive performance, especially under extreme degradations such as severely ill-posed conditions and high measurement noise.

2605.08322 2026-05-13 cs.LG cs.AI

SDG-MoE: Signed Debate Graph Mixture-of-Experts

Stepan Kulibaba, Kirill Labzin, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gasnikov, Aleksei Shpilman

AI总结 本文提出了一种名为SDG-MoE的新颖稀疏混合专家(MoE)架构,旨在通过引入专家间的结构化交流机制提升模型性能。该方法在路由后引入了一个轻量级的迭代讨论步骤,包含支持图和批评图两个交互矩阵,以及基于分歧的锚定机制,以增强专家间的信息传递与协调。实验表明,SDG-MoE在多个基准数据集上显著优于传统MoE和无符号图通信基线,验证了其有效性与优越性。

详情
英文摘要

Sparse MoE models achieve a good balance between capacity and compute by routing each token to a small subset of experts. However, in most MoE architectures, once a token is routed, the selected experts process it independently and their outputs are combined via a weighted sum. This leaves open whether enabling communication among them could improve performance. While prior work has raised this question, direct interaction among the active routed experts remains underexplored. In this paper, we propose SDG-MoE (Signed Debate Graph Mixture-of-Experts), a novel architecture that adds a lightweight, iterative deliberation step before final aggregation. SDG-MoE introduces three components: (i) two learned interaction matrices over the active experts, a support graph $A^+$ and a critique graph $A^-$, capturing reinforcing and corrective influences; (ii) a signed message-passing step that updates expert representations before aggregation; and (iii) a disagreement-gated Friedkin-Johnsen-style anchoring that controls deliberation strength while preventing expert drift. Together, these enable a structured deliberation process where interaction strength scales with disagreement and specialization is preserved. We also provide a theoretical analysis establishing stability conditions on expert states and showing that deliberation adds only low-order overhead over the active set. In controlled three-seed pretraining experiments, SDG-MoE improves validation perplexity over both an unsigned graph communication baseline and vanilla MoE, outperforming the strongest baseline by 19.8%, and gives the best external perplexity on WikiText-103, C4, and Paloma among the compared systems.

2605.08083 2026-05-13 cs.CL

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang, Rui Liu, Runpeng Dai, Ruibo Chen, Chenxi Liu, Tianyi Xiong, Xidong Wu, Hongming Zhang, Heng Huang

AI总结 本文研究了如何在推理时通过分配额外计算资源来提升大语言模型的性能,提出了一种名为AutoTTS的环境驱动框架,用于自动发现高效的推理时扩展策略。该方法通过构建可控的环境,使策略搜索更加高效,并引入参数化和反馈机制以提升发现效率。实验表明,所发现的策略在数学推理任务中优于手动设计的基线,在成本与准确率的权衡上表现更优,且发现过程成本低廉。

Comments 25 pages

详情
英文摘要

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

2605.07782 2026-05-13 cs.CL cs.PL

CktFormalizer: Autoformalization of Natural Language into Circuit Representations

Jing Xiong, Qi Han, Chenchen Ding, He Xiao, Zunhai Su, Chaofan Tao, Ngai Wong

AI总结 CktFormalizer 是一个将自然语言自动转化为电路表示的框架,旨在解决大语言模型生成的 Verilog 代码在综合和实现过程中常出现的缺陷问题。该框架通过嵌入在 Lean 4 中的依赖类型硬件描述语言,实现了类型检查、正确性保障和形式化证明,有效提升了生成电路的正确性和可实现性。实验表明,CktFormalizer 在保持仿真通过率的同时,显著提高了后端实现的成功率,并能通过自动化定理证明实现性能优化。

详情
英文摘要

LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95--100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.

2605.07744 2026-05-13 cs.AI

Alternating Target-Path Planning for Scalable Multi-Agent Coordination

Yu Kumagai, Keisuke Okumura

AI总结 本文研究了多智能体在同时分配目标和规划路径(TAPF)中的协调问题,提出了一种解耦目标分配与路径规划的迭代优化框架。该方法基于高效的次优多智能体路径规划求解器,通过反复规划路径并利用反馈信息优化目标分配,有效提升了算法的可扩展性。实验表明,该框架在保持较好解质量的同时,显著优于基于冲突搜索的传统方法,为实际大规模TAPF问题提供了可行的解决方案。

详情
英文摘要

The concurrent target assignment and pathfinding (TAPF) problem extends multi-agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision-free paths to agents. Prior work on TAPF has relied exclusively on Conflict-Based Search (CBS), which tightly couples target assignment and pathfinding, resulting in compute-intensive, non-scalable solutions. In contrast, we propose an iterative refinement framework that decouples target assignment from pathfinding. Our framework builds on modern, fast, suboptimal MAPF solvers, such as LaCAM. Specifically, within a given time budget, it repeatedly solves MAPF for the current target assignment, identifies bottleneck agents via MAPF feedback, and refines the assignment. Empirical results show that feedback-driven reassignment loop is effective, enabling our framework to scale well beyond the reach of the state-of-the-art CBS-based solver while maintaining decent solution quality. This represents a solid step toward practical, large scale TAPF suitable for real-world setups.

2605.07552 2026-05-13 cs.CV

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

Zepeng Yang, Junxuan Bai, Hao Li, Ju Dai, Junjun Pan, Yongfeng Yin, Bin Li

AI总结 该论文提出了一种名为VIMCAN的混合架构,用于视觉-惯性融合的三维人体姿态估计。该方法结合了Mamba的高效序列建模能力和Cross-Attention的空间感知能力,有效解决了传统Transformer在处理长序列时计算复杂度高、难以实时处理的问题。实验表明,VIMCAN在多个数据集上取得了优于现有方法的精度,并能在普通消费级硬件上实现每秒60帧以上的实时推理。

Comments Accepted in CVPR 2026

详情
英文摘要

The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available at \href{https://github.com/Eddieyzp/VIMCAN}{this GitHub repository}.

2605.06940 2026-05-13 cs.CL cs.AI cs.LG

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Souvik Pramanik, S. M. Riaz Rahman Antu, Shak Mohammad Abyad, Md. Ibrahim Khalil, Md. Shahriar Hussain

AI总结 该研究提出MultiSoc-4D,一个用于诊断封闭集指令下大型语言模型(LLM)标注偏差的孟加拉语社交媒体数据集,包含超过58,000条来自六个来源的社交媒体评论,并在四个维度上进行标注。通过多模型协作标注与共享验证集的结构化流程,研究系统性地揭示了LLM在标注过程中普遍存在的“指令诱导标签坍缩”现象,即模型倾向于使用默认标签,导致对少数类别的检测严重不足。该研究还通过统计验证证明了这一现象形成的“标签一致性幻觉”,并评估了40多个LLM在训练流程中的标注偏差传播情况,为低资源语言的NLP标注研究提供了重要基准。

Comments 21 pages, 14 figures, 13 tables

详情
英文摘要

Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a "label agreement illusion", statistically validated via almost null Fleiss' Kappa ($κ\approx -0.001$) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.