arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2603.09668 2026-05-19 cs.CV

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

DiffWind: 基于物理的可微风驱物体动力学建模

Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng, Haocheng Peng, Ru Zhang, Yang Yang, Siyuan Huang, Yujun Shen, Ruizhen Hu, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Laboratory of CAD & CG(CAD与计算机图形学国家重点实验室) State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室) Ant Group(蚂蚁集团) Shenzhen University(深圳大学)

AI总结 本文提出DiffWind,一种基于物理的可微框架,统一了风-物体相互作用建模、基于视频的重建和正向模拟。通过将风表示为基于网格的物理场,物体表示为从3D高斯点散布派生的粒子系统,并利用材料点方法(MPM)建模其相互作用,从而实现了对风驱物体动力学的重建。此外,本文还引入了WD-Objects数据集,通过大量实验证明了该方法在重建精度和模拟保真度方面显著优于现有动态场景建模方法。

Comments Accepted by ICLR 2026. Project page: https://zju3dv.github.io/DiffWind/

详情
AI中文摘要

从视频观测建模风驱物体动力学极具挑战性,因为风的不可见性和时空变异性以及物体的复杂变形。我们提出了DiffWind,一种基于物理的可微框架,统一了风-物体相互作用建模、基于视频的重建和正向模拟。具体来说,我们将风表示为基于网格的物理场,物体表示为从3D高斯点散布派生的粒子系统,其相互作用通过材料点方法(MPM)建模。为了恢复风驱物体动力学,我们引入了一个重建框架,通过可微渲染和模拟联合优化时空风力场和物体运动。为了确保物理有效性,我们将其纳入格子玻尔兹曼方法(LBM)作为物理约束,强制符合流体动力学定律。除了重建之外,我们的方法自然支持在新型风条件下进行正向模拟,并能够实现新的应用,如风引导重定向。我们进一步引入了WD-Objects,一个合成和现实世界风驱场景的数据集。大量实验表明,我们的方法在重建精度和模拟保真度方面显著优于现有动态场景建模方法,为基于视频的风-物体相互作用建模开辟了新的途径。

英文摘要

Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.

2603.09405 2026-05-19 cs.CV

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

YOLO-NAS-Bench: 一种具有自进化预测器的代理基准,用于YOLO架构搜索

Zhe Li, Xiaoyu Ding, Jiaxin Zheng, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机科学技术研究院)

AI总结 本文提出YOLO-NAS-Bench,一种针对YOLO检测器的代理基准,通过自进化机制提升预测器的准确性,从而在YOLO架构搜索中实现高效评估。

Comments Accepted as Oral at CVPR 2026 Workshop on Neural Architecture Search (NAS)

详情
AI中文摘要

针对目标检测中的神经架构搜索(NAS)面临高评估成本的问题,本文提出YOLO-NAS-Bench,首个专门针对YOLO风格检测器的代理基准。YOLO-NAS-Bench定义了一个涵盖通道宽度、块深度和运算符类型的搜索空间,覆盖YOLOv8到YOLO12的核心模块。通过随机、分层和拉丁超立方策略采样1000种架构,在COCO-mini上训练并构建LightGBM代理预测器。为提高预测器在高性能领域的表现,提出自进化机制,通过预测器自身发现并评估有信息量的架构,使预测器的R2从0.770提升至0.815,稀疏Kendall Tau从0.694提升至0.752。使用最终预测器作为进化搜索的适应度函数,发现超越所有官方YOLOv8-YOLO12基线的架构,在COCO-mini上具有可比的延迟,验证了预测器对高性能检测架构的判别能力。代码可在https://github.com/VDIGPKU/YOLO-NAS-Bench获取。

英文摘要

Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures. The code is available at https://github.com/VDIGPKU/YOLO-NAS-Bench.

2603.08462 2026-05-19 cs.LG

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

推理作为压缩:通过条件信息瓶颈统一预算强制

Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出将高效推理视为信息瓶颈原则下的损失性压缩问题,通过引入条件信息瓶颈(CIB)原则,解决了传统预算强制方法在处理transformers时的理论缺陷,并通过语义先验实现了更高效的推理压缩,提升了准确率并减少了计算成本。

详情
AI中文摘要

\ac{CoT}提示方法提高了LLM在复杂任务上的准确性,但通常会增加token使用和推理成本。现有的"预算强制"方法通过使用启发式长度惩罚进行微调来减少成本,但会抑制必要的推理和冗余填充。我们重新将高效推理视为在\ac{IB}原则下的损失性压缩问题,并识别出在应用朴素\ac{IB}到transformers时的关键理论缺口:注意力违反了提示、推理轨迹和响应之间的马尔可夫性质。为了解决这个问题,我们模型\ac{CoT}生成在\ac{CIB}原则下,其中推理轨迹$Z$作为计算桥梁,只包含响应$Y$中无法直接从提示$X$获得的信息。这产生了一个通用的强化学习目标:在推理轨迹的先验分布下最大化任务奖励,同时压缩完成内容,将常见启发法(如长度惩罚)作为特殊情况(如均匀先验)包含在内。与传统的token计数方法不同,我们引入了一个语义先验,通过语言模型测量token成本的惊奇度。关键的是,该先验仅在token级log-概率上进行查询,对训练循环的开销可忽略不计。实证表明,我们的\ac{CIB}目标在保留流畅性和逻辑性的同时修剪推理冗余,提高准确率在中等压缩水平,并在最小的准确率下降下实现激进压缩。这些收益在不同模型家族和任务领域中得到验证,确认\ac{CIB}作为一种领域无关的CoT压缩框架。

英文摘要

\ac{CoT} prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing ``Budget Forcing'' methods reduce cost via fine-tuning with heuristic length penalties, suppressing both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the \ac{IB} principle, and identify a key theoretical gap when applying naive \ac{IB} to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model \ac{CoT} generation under the \ac{CIB} principle, where the reasoning trace $Z$ acts as a computational bridge that contains only the information about the response $Y$ that is not directly accessible from the prompt $X$. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting approaches, we introduce a semantic prior that measures token cost by surprisal under a language model. Crucially, the prior is queried only for token-level log-probabilities, adding negligible overhead to the training loop. Empirically, our \ac{CIB} objective prunes reasoning redundancy while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop. These gains generalize across model families and task domains, confirming \ac{CIB} as a domain-agnostic CoT compression framework.

2603.08290 2026-05-19 cs.LG cs.AI

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

先浅后深:一种由深度诱导的sharpness-aware minimization的隐式偏见

Chaewon Moon, Dongkuk Si, Chulhee Yun

发表机构 * Graduate School of AI, KAIST(韩国成均馆大学人工智能研究生院) Mobilint, Inc.(Mobilint公司)

AI总结 该研究探讨了在训练线性可分二分类问题时,sharpness-aware minimization (SAM) 的隐式偏见,发现对于深度L=2的情况,SAM的行为与深度L=1时不同,展示了sequential feature amplification现象。

Comments Accepted to ICLR 2026, 84 pages, 35 figures

详情
AI中文摘要

我们研究了在训练L层线性对角网络时,sharpness-aware minimization (SAM) 的隐式偏见。对于线性模型(L=1),ℓ∞-SAM和ℓ2-SAM都能恢复ℓ2最大间隔分类器,与梯度下降(GD)一致。然而,对于深度L=2,行为发生剧烈变化——即使在单例数据集上。对于ℓ∞-SAM,极限方向依赖于初始化,并可能收敛到零向量或任何标准基向量,与GD的极限方向形成鲜明对比。对于ℓ2-SAM,我们证明其极限方向与GD的ℓ1最大间隔解一致,但有限时间动态表现出我们称之为“顺序特征放大”的现象,即预测器最初依赖于次要坐标,然后逐渐转向更大的坐标。我们的理论分析将这种现象归因于ℓ2-SAM在扰动中应用的梯度归一化因子,该因子在早期放大次要坐标,允许主要坐标在后期主导。合成和真实数据实验验证了我们的发现。

英文摘要

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

2603.07900 2026-05-19 cs.AI

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery: 通过电子健康记录上的任务条件预训练实现零样本临床预测

Payal Chandak, Gregory Kondas, Liat Antwarg Friedman, Isaac Kohane, Matthew McDermott

发表机构 * Harvard-MIT HST(哈佛-麻省理工学院HST) Columbia University(哥伦比亚大学) Harvard Medical School(哈佛医学院)

AI总结 本文提出EveryQuery,一种通过任务条件预训练实现零样本临床预测的电子健康记录基础模型,通过直接估计未来窗口内结果发生的可能性,而非生成未来事件,从而在多个预测任务中优于自回归基线模型。

详情
AI中文摘要

在电子健康记录(EHR)上预训练的基础模型已通过生成合成患者未来和聚合采样轨迹的统计信息,展示了零样本临床预测能力。然而,这种自回归推理过程计算成本高、统计噪声大且不支持直接提示条件预测,因为用户无法直接根据特定临床问题条件预测。在本初步工作中,我们引入EveryQuery,一种EHR基础模型,通过任务条件预训练实现零样本推理。不同于生成未来事件,EveryQuery输入患者的历史和一个结构化的查询指定临床任务,并通过单次前向传递直接估计未来窗口内结果发生的可能性。EveryQuery通过在随机采样的查询任务和患者上下文中预训练,直接训练模型以产生正确的答案。这使得无需微调、线性探测或轨迹生成即可对查询空间中的任何任务进行零样本预测。在MIMIC-IV上,EveryQuery在82%的39个随机采样的预测任务中优于自回归基线模型,平均AUC提高+0.16(95%置信区间:[0.10,0.22])。这一优势在明确从预训练分布中排除的任务中保持一致。此外,EveryQuery的性能提升在罕见临床事件上最为显著,证实并展示了自回归推理在低预发率结果方面的根本限制的解决方案。然而,目前EveryQuery在需要对多个代码进行离散推理的任务上表现欠佳,如30天再入院,暴露了当前查询语言的表达性限制。

英文摘要

Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

2603.04727 2026-05-19 cs.CV cs.AI

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型是否准备好用于监控?对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

发表机构 * Electrical and Computer Engineering Department(电气与计算机工程系)

AI总结 本文研究了多模态大语言模型在现实中的零样本异常检测性能,发现其存在保守偏差,通过特定指令可以提升F1分数,但召回率仍是关键瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视频理解方面展示了出色的通用能力,但其在现实中的视频异常检测(VAD)可靠性仍待探索。与传统依赖重建或姿态线索的流程不同,MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务,在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度(1s-3s)对性能的影响,重点分析精度-召回率的权衡。研究发现,在零样本设置中存在显著的保守偏差;尽管模型表现出高置信度,但倾向于选择'正常'类,导致高精度但召回率崩溃,限制了实际应用。我们证明,针对类别的特定指令可显著改变这一决策边界,使ShanghaiTech的峰值F1分数从0.09提升至0.64,但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距,并为未来在召回导向提示和模型校准方面的研究提供了基础,这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

2603.04161 2026-05-19 cs.CL

Traces of Social Competence in Large Language Models

大语言模型中社会能力的踪迹

Tom Kouwenhoven, Michiel van der Meer, Max van Duijn

发表机构 * Leiden Institute of Advanced Computer Science(莱顿先进计算机科学研究所) Leiden University(莱顿大学)

AI总结 本文研究了大语言模型在虚假信念测试中的表现,通过贝叶斯逻辑回归分析模型大小和训练方法对社会认知能力的影响,发现模型规模扩大有助于性能提升,但并非绝对,同时指出解释命题态度会改变响应模式,进一步的推理导向微调会加剧这种影响。

Comments Presented at the 2026 Conference on Computational Natural Language Learning (CoNLL)

详情
AI中文摘要

虚假信念测试(FBT)一直是评估理论自我(ToM)及相关社会认知能力的主要方法。对于大语言模型(LLMs),由于数据污染、模型细节不足和控制不一致等问题,该测试的可靠性和解释潜力一直有限。我们通过在192个FBT变体(Trott等人,2023)的平衡数据集上测试17个开源模型,并使用贝叶斯逻辑回归来识别模型大小和训练后对社会认知能力的影响。我们发现模型规模扩大有助于性能提升,但并非严格正比。交叉效应显示,解释命题态度(X thinks)根本上改变了响应模式。指令微调部分缓解了这种影响,但进一步的推理导向微调会加剧这种影响。在分析OLMo 2训练过程中社会推理能力的案例研究中,我们发现这种交叉效应出现在预训练阶段,表明模型在预训练过程中获取了与心理状态词汇相关的刻板响应模式,这些模式可能超过其他情境语义。最后,向量引导使我们能够将think向量作为观察到的FBT行为的因果驱动因素。

英文摘要

The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.

2602.22941 2026-05-19 cs.CV

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

基于平移和缩放视频记录的皮划艇冲刺团队船只速度和划桨率重建

Julian Ziegler, Daniel Matthes, Finn Gerdts, Patrick Frenzel, Torsten Warnke, Matthias Englert, Tina Koevari, Mirco Fuchs

发表机构 * Laboratory for Biosignal Processing, Leipzig University of Applied Sciences, Leipzig, Germany(生物信号处理实验室,莱比锡应用科学大学,莱比锡,德国) Research Group Canoeing, Institute for Applied Training Science (IAT), Leipzig, Germany(划船研究组,应用训练科学研究所(IAT),莱比锡,德国) German Canoe Federation, Duisburg, Germany(德国皮划艇联合会,杜伊斯堡,德国)

AI总结 本文提出了一种基于平移和缩放视频记录重建皮划艇冲刺团队船只速度和划桨率的方法,利用YOLOv8检测浮标和运动员,结合已知的浮标网格估计同源性,通过U-Net进行船体校准以估计船的位置,并利用光流实现鲁棒跟踪,最终提取划桨率信息,实验结果表明其速度和划桨率的MAPE分别达到0.011和0.009,具有高精度和自动化反馈。

详情
AI中文摘要

节奏策略,由速度和划桨率曲线定义,对于皮划艇冲刺的峰值表现至关重要。尽管GPS是分析的黄金标准,但其有限的可用性需要自动化视频分析方法。本文提出了一种扩展框架,用于从平移和缩放的视频记录中重建所有冲刺项目(K1-K4,C1-C2)和距离(200m-500m)的性能指标。我们的方法利用YOLOv8进行浮标和运动员检测,利用已知的浮标网格估计同源性。我们通过学习特定船体的运动员偏移量来一般化估计船的位置,利用U-Net进行船体校准。进一步,我们通过光流实现鲁棒的跟踪方案以适应多运动员船体类型。最后,我们介绍了从姿态估计或运动员边界框本身提取划桨率信息的方法。与精英比赛GPS数据的评估显示,速度的MAPE为0.011 [0.008 0.014](Spearman rho=0.974)和划桨率的MAPE为0.009 [0.006 0.013](Spearman rho=0.975)。这些方法为教练提供了高精度、自动化的反馈,且无需传感器,仅需极少的手动初始化工作。

英文摘要

Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity MAPE of 0.011 [0.008 0.014] (Spearman rho=0.974) and a stroke rate MAPE of 0.009 [0.006 0.013] (Spearman rho = 0.975). The methods provide coaches with highly accurate, automated feedback with minimal manual initialization work required, and without requiring sensors.

2602.18217 2026-05-19 cs.CL

Information-Theoretic Storage Cost in Sentence Comprehension

句法理解中的信息论存储成本

Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox

发表机构 * Department of Linguistics, Georgetown University, USA(地缘政治学系,乔治城大学,美国) National Institute for Japanese Language and Linguistics, Japan(日本语言学研究院,日本)

AI总结 本文提出了一种基于信息论的存储成本度量方法,用于评估句法理解过程中上下文信息的存储需求,通过神经语言模型估计该成本,并在英语中验证了其在中心嵌套和相对从句中的处理不对称性,以及在阅读时间变异预测中的有效性。

Comments Accepted to CoNLL 2026

详情
AI中文摘要

实时句法理解对工作记忆施加了显著负担,因为理解者必须维护上下文信息以预测未来输入。尽管对这种负担的测量在心理语言学理论中起到了重要作用,但它们主要通过符号语法形式化,将句法预测分配为离散且均匀的成本。本研究提出了一种基于信息论形式化的处理存储成本度量,作为先前词语对未来上下文信息的携带量,在不确定性下的度量。与之前的离散、基于语法的度量不同,这种度量是连续的、概率性的、理论中立的,并且可以从预训练的神经语言模型中估计。通过三种英语分析验证了该方法的有效性:我们的度量(i)恢复了已知的中心嵌套和相对从句中的处理不对称性,(ii)与一个语法标注语料库中的基于语法的存储成本相关联,(iii)在两个大规模自然主义数据集中预测阅读时间变异,这在传统信息基础预测器之上。我们的代码可在https://github.com/kohei-kaji/info-storage获取。

英文摘要

Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have largely been formalized using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, probabilistic, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors. Our code is available at https://github.com/kohei-kaji/info-storage.

2602.12703 2026-05-19 cs.LG

SWING: Unlocking Implicit Graph Representations for Graph Random Features

SWING: 解锁隐式图表示用于图随机特征

Alessandro Manenti, Avinava Dubey, Arijit Sehanobish, Cesare Alippi, Krzysztof Choromanski

发表机构 * Google Research(谷歌研究) Independent Researcher(独立研究者) Google DeepMind(谷歌深Mind) Columbia University(哥伦比亚大学)

AI总结 SWING通过在连续空间中进行行走而非在图节点上进行行走,实现了对隐式图表示(i-graphs)中图随机特征的高效计算,其核心方法是结合随机特征和重要性采样技术的定制Gumbel-softmax采样机制,从而在不需显式图结构的情况下,提高了计算效率和精度。

详情
AI中文摘要

我们提出了SWING:空间行走用于隐式网络图,这是一种新的算法类别,用于在由隐式表示(i-graphs)给出的图上进行图随机特征的计算,其中边权重定义为相应节点特征向量的双变量函数。这些图类包括多个显著例子,如ε邻域图,广泛用于机器学习。与在图节点上进行行走不同,这些方法依赖于在连续空间中的行走,在其中这些图被嵌入。为了准确且高效地近似原始组合计算,SWING应用了通过随机特征结合重要性采样技术获得的定制Gumbel-softmax采样机制,具有线性化内核。该算法本身具有独特价值。SWING依赖于隐式定义图与傅里叶分析之间的深刻联系,本文中已提出。SWING具有加速友好特性,不需要输入图的显式材料。我们对SWING进行了详细的分析,并在不同类别的i-graphs上进行了彻底的实验。

英文摘要

We propose SWING: Space Walks for Implicit Network Graphs, a new class of algorithms for computations involving Graph Random Features on graphs given by implicit representations (i-graphs), where edge-weights are defined as bi-variate functions of feature vectors in the corresponding nodes. Those classes of graphs include several prominent examples, such as: $ε$-neighborhood graphs, used on regular basis in machine learning. Rather than conducting walks on graphs' nodes, those methods rely on walks in continuous spaces, in which those graphs are embedded. To accurately and efficiently approximate original combinatorial calculations, SWING applies customized Gumbel-softmax sampling mechanism with linearized kernels, obtained via random features coupled with importance sampling techniques. This algorithm is of its own interest. SWING relies on the deep connection between implicitly defined graphs and Fourier analysis, presented in this paper. SWING is accelerator-friendly and does not require input graph materialization. We provide detailed analysis of SWING and complement it with thorough experiments on different classes of i-graphs.

2602.12015 2026-05-19 cs.CL

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

解构大型语言模型中的歧义与不稳定性:一项临床文本到SQL的案例研究

Angelo Ziletti, Leonardo D'Ambrosi

发表机构 * Bayer AG(勃林格殷曼集团)

AI总结 本文提出CLUES框架,通过将文本到SQL分解为两个阶段(解释->答案)来区分输出多样性两种不同原因:输入歧义和模型不稳定性,并在临床文本到SQL基准测试中提高了故障预测性能。

Journal ref Proceedings of the 7th Clinical Natural Language Processing Workshop 2026

详情
AI中文摘要

在临床文本到SQL中部署大型语言模型需要区分输出多样性的两种不同原因:(i)输入歧义,应触发澄清,和(ii)模型不稳定性,应触发人工审查。我们提出CLUES,将文本到SQL建模为两个阶段的过程(解释-->答案),并将语义不确定性分解为歧义分数和不稳定性分数。不稳定性分数通过二元语义图矩阵的Schur补计算。在AmbigQA/SituatedQA(黄金解释)和临床文本到SQL基准测试(已知解释)上,CLUES在状态-of-the-art Kernel Language Entropy之上提高了故障预测。在部署设置中,它保持竞争力,同时提供单个分数不可用的诊断分解。所得到的不确定性区域映射到目标干预 - 对歧义进行查询细化,对不稳定性进行模型改进。高歧义/高不稳定性区域包含51%的错误,覆盖25%的查询,从而实现高效的优先级排序。

英文摘要

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

2602.09805 2026-05-19 cs.CL cs.AI cs.LG

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率:分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

发表机构 * Integreat - Norwegian Centre for knowledge-driven machine learning(Integreat - 挪威知识驱动机器学习中心) UiT - The Arctic University of Norway(UiT - 北极大学) University of Oslo(奥斯陆大学)

AI总结 本文提出一种无需追踪的评估协议,通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率,同时考虑任务工作量元数据进行归一化处理,并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情
AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性,单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议,通过三个即使在封闭模型中也可用的观测指标精确分解token效率:完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时,我们进一步将生成长度归一化为声明的任务隐含工作,并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时,我们定义了一个可审计的求解器衍生工作量规模,并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型,从而能够对推理任务难度因素进行细致分析:任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定,比准确性排名更加稳健,同时分解了逻辑受限、上下文受限(截断驱动)和冗余受限的失败模式,这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板,详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

2602.08206 2026-05-19 cs.CV

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

基于地理推理的词汇无关遥感语义分割

Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang

发表机构 * School of Electronic Information, Wuhan University of Science and Technology(武汉科技大学电子信息学院) State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications(北京邮电大学网络与交换技术国家重点实验室) Oriental Space Port Research Institute(东方航天港研究院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院)

AI总结 本文提出了一种基于地理推理的词汇无关遥感语义分割框架GR-CoT,通过离线知识蒸馏流和在线实例推理流解决遥感开放词汇语义分割中的语义歧义问题,提升复杂场景下的分割性能和语义一致性。

Comments 5 pages, 3 figures

详情
AI中文摘要

开放词汇语义分割已成为遥感领域的重要研究方向,因为它能够实现超越预定义土地覆盖类别的识别。然而,现有方法主要依赖于被动的视觉-文本匹配,往往在地理复杂场景中面临语义歧义问题,尤其是当不同类别表现出相似的光谱或结构模式时。为了解决这个问题,我们提出了一个用于遥感开放词汇语义分割的地理推理链式思考(GR-CoT)框架。GR-CoT由一个离线知识蒸馏流和一个在线实例推理流组成。前者为歧分类构建类别解释标准,后者执行宏观场景锚定、视觉特征解耦和知识驱动的决策合成,以生成适应图像的词汇表供下游分割使用。在LoveDA和GID5基准测试中,实验表明所提出的框架提高了整体分割性能,并在复杂场景中产生了更具语义一致性的预测。

英文摘要

Open-vocabulary semantic segmentation has become an important direction in remote sensing, as it enables recognition beyond predefined land-cover categories. However, existing methods mainly depend on passive visual-text matching and often struggle with semantic ambiguity in geographically complex scenes, especially when different classes exhibit similar spectral or structural patterns. To address this issue, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework for remote sensing open-vocabulary semantic segmentation. GR-CoT consists of an offline knowledge distillation stream and an online instance reasoning stream. The former constructs category interpretation standards for confusing classes, while the latter performs macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis to generate an image-adaptive vocabulary for downstream segmentation. Experiments on the LoveDA and GID5 benchmarks indicate that the proposed framework improves overall segmentation performance and yields more semantically coherent predictions in complex scenes.

2602.07618 2026-05-19 cs.LG stat.ML

Neural Networks With Dense Weights Are Not Universal Approximators

具有密集权重的神经网络不是通用逼近器

Levi Rauchwerger, Stefanie Jegelka, Ron Levie

发表机构 * Princeton University, Dept of CS(普林斯顿大学计算机科学系) MIT, Dept of EECS and CSAIL(麻省理工学院电子工程与计算机科学系及计算机科学与人工智能实验室) TUM, School of CIT, MCML, MDSI(技术大学(TUM)信息科技学院,MCML,MDSI) Technion – IIT, Faculty of Mathematics(技术学院–以色列理工学院数学学院)

AI总结 研究探讨了密集神经网络的逼近能力,指出在有限的权重约束下,密集连接的神经网络无法逼近任意连续函数,从而揭示了密集层神经网络的固有局限性,推动了稀疏连接在实现真正通用性中的必要性。

详情
AI中文摘要

我们研究了密集神经网络的逼近能力。虽然通用逼近定理表明,如果对权重值没有限制,足够大的架构可以逼近任意连续函数,但我们证明密集神经网络并不具备这种普遍性。我们的论证基于一种模型压缩方法,结合弱正则性引理与将前馈网络解释为消息传递图神经网络的解释。我们考虑具有自然权重、输入和输出维度约束的ReLU神经网络,这建模了一种密集连接的概念。在此设置中,我们展示了存在无法被此类网络逼近的Lipschitz连续函数。这突显了密集层神经网络的固有局限性,并推动了稀疏连接作为实现真正通用性的必要成分的使用。

英文摘要

We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

2602.06866 2026-05-19 cs.LG

T-STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility

T-STAR: 一种基于上下文的Transformer框架用于基于码头的共享微出行短期概率需求预测

Jingyi Cheng, Gonçalo Homem de Almeida Correia, Oded Cats, Shadi Sharif Azadeh

发表机构 * Transport and Planning, Delft University of Technology(代尔夫特理工大学交通与规划)

AI总结 本文提出T-STAR框架,通过两级结构分离一致需求模式和短期波动,提升短期概率需求预测的准确性,实验表明其在确定性和概率性准确性上均优于现有方法,且具备良好的时空鲁棒性。

Comments This work has been submitted to Transportation Research Part C

详情
AI中文摘要

可靠的短期需求预测对于管理共享微出行服务和确保响应、以用户为中心的操作至关重要。本文介绍了T-STAR(Two-stage Spatial and Temporal Adaptive contextual Representation),一种新的基于Transformer的概率框架,旨在以15分钟的分辨率预测车站级自行车共享需求。T-STAR通过分层两级结构解决高分辨率预测中的关键挑战,第一阶段捕捉粗粒度的小时需求模式,第二阶段通过整合高频、本地化的输入(包括近期波动和实时需求变化)提高预测精度,以考虑短期需求的时间转移。时间序列Transformer模型用于两个阶段生成概率预测。使用华盛顿特区的Capitol Bikeshare数据的广泛实验表明,T-STAR在确定性和概率性准确性上均优于现有方法。该模型在车站和时间期间表现出强大的时空鲁棒性。零样本预测实验进一步展示了T-STAR在无需重新训练的情况下能够转移到以前未见过的服务区域的能力。这些结果凸显了该框架在提供细粒度、可靠且不确定性的短期需求预测方面的潜力,从而无缝整合以支持多模式出行规划,提高共享微出行服务的实时操作能力。

英文摘要

Reliable short-term demand forecasting is essential for managing shared micro-mobility services and ensuring responsive, user-centered operations. This study introduces T-STAR (Two-stage Spatial and Temporal Adaptive contextual Representation), a novel transformer-based probabilistic framework designed to forecast station-level bike-sharing demand at a 15-minute resolution. T-STAR addresses key challenges in high-resolution forecasting by disentangling consistent demand patterns from short-term fluctuations through a hierarchical two-stage structure. The first stage captures coarse-grained hourly demand patterns, while the second stage improves prediction accuracy by incorporating high-frequency, localized inputs, including recent fluctuations and real-time demand variations in connected metro services, to account for temporal shifts in short-term demand. Time series transformer models are employed in both stages to generate probabilistic predictions. Extensive experiments using Washington D.C.'s Capital Bikeshare data demonstrate that T-STAR outperforms existing methods in both deterministic and probabilistic accuracy. The model exhibits strong spatial and temporal robustness across stations and time periods. A zero-shot forecasting experiment further highlights T-STAR's ability to transfer to previously unseen service areas without retraining. These results underscore the framework's potential to deliver granular, reliable, and uncertainty-aware short-term demand forecasts, which enable seamless integration to support multimodal trip planning for travelers and enhance real-time operations in shared micro-mobility services.

2602.05156 2026-05-19 cs.RO cs.SY eess.SY

PLATO Hand: Shaping Contact Behavior with Fingernails for Precise Manipulation

PLATO Hand:利用指甲形状接触行为实现精确操控

Dong Ho Kang, Aaron Kim, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony Group Corporation(索尼集团)

AI总结 本文提出PLATO手,一种具有混合指尖的灵活机器人手,通过结合刚性指甲、嵌入式远节指骨和顺应性肉垫,实现接触行为的塑造。研究开发了基于应变能的弯曲-压入模型,指导指尖设计并解释材料刚度和接触几何如何控制指尖变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性,并成功执行了敏感边缘操控任务,如纸张分隔、卡片拾取和橙子剥皮。这些结果表明,结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。

详情
AI中文摘要

我们提出了PLATO手,一种具有混合指尖的灵活机器人手,该指尖结合了刚性指甲、嵌入式远节指骨和顺应性肉垫,以在操控过程中塑造接触行为。通过机械组织指尖接触的启动、支撑和传递方式,这种结构在多样化的物体几何形状和抓取方向上创造了稳定且任务相关的接触条件。我们开发了基于应变能的弯曲-压入模型,以指导指尖设计并解释材料刚度和接触几何如何控制指尖内的变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性,并成功执行了敏感边缘操控任务,包括纸张分隔、卡片拾取和橙子剥皮。这些结果表明,结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。我们的项目页面是:https://platohand.github.io

英文摘要

We present the PLATO Hand, a dexterous robotic hand with a hybrid fingertip that combines a rigid fingernail, embedded distal phalanx, and compliant pulp to shape contact behavior during manipulation. \rrev{By mechanically organizing how contact is initiated, supported, and transmitted at the fingertip, this structure creates stable and task-relevant contact conditions across diverse object geometries and grasp orientations.} We develop a strain-energy-based bending--indentation model to guide the fingertip design and to explain how material stiffness and contact geometry govern deformation partitioning within the fingertip. \rrev{Experiments show improved pinch stability, improved fingernail-mediated dorsal-contact force transmission and proprioceptive observability}, and successful execution of edge-sensitive manipulation tasks, including paper singulation, card picking, and orange peeling. These results show that coupling a mechanically structured contact interface with a force-motion-transparent finger mechanism provides a principled approach to precise manipulation. Our project page is at: https://platohand.github.io

2602.03797 2026-05-19 cs.LG

Manifold Random Features

流形随机特征

Ananya Parashar, Derek Long, Dwaipayan Saha, Krzysztof Choromanski

发表机构 * Department of Industrial Engineering and Operations Research(工业工程与运筹学系) Columbia University(哥伦比亚大学) Google DeepMind(谷歌DeepMind)

AI总结 本文提出了一种新的方法,通过离散化流形和最近引入的图随机特征(GRFs)技术,学习流形上的连续场,从而近似一般流形上定义的双变量函数(特别是核函数)。该方法提供了正且有界的特征,对于准确且低方差的近似至关重要。

详情
AI中文摘要

我们提出了一种新的范式,用于创建随机特征以近似在一般流形上定义的双变量函数(特别是核函数)。这种新的机制称为流形随机特征(MRFs),利用流形的离散化和最近引入的图随机特征(GRFs)技术来学习流形上的连续场。这些场用于找到在一般情况下无法解析推导的连续近似机制。MRFs提供正且有界的特征,这是准确、低方差近似的关键属性。我们展示了GRFs在离散图对象上定义与用于正则核的连续随机特征之间的深刻渐近联系。作为我们方法的副产品,我们重新发现最近引入的高斯核近似机制,特别是用于改进线性注意力Transformer,通过考虑简单的图随机游走并绕过原始复杂的数学计算。我们还补充了我们的算法的严格理论分析,并通过详尽的实验研究进行了验证。

英文摘要

We present a new paradigm for creating random features to approximate bi-variate functions (in particular, kernels) defined on general manifolds. This new mechanism of Manifold Random Features (MRFs) leverages discretization of the manifold and the recently introduced technique of Graph Random Features (GRFs) to learn continuous fields on manifolds. Those fields are used to find continuous approximation mechanisms that otherwise, in general scenarios, cannot be derived analytically. MRFs provide positive and bounded features, a key property for accurate, low-variance approximation. We show deep asymptotic connection between GRFs, defined on discrete graph objects, and continuous random features used for regular kernels. As a by-product of our method, we re-discover recently introduced mechanism of Gaussian kernel approximation applied in particular to improve linear-attention Transformers, considering simple random walks on graphs and by-passing original complex mathematical computations. We complement our algorithm with a rigorous theoretical analysis and verify in thorough experimental studies.

2602.03664 2026-05-19 cs.AI cs.LG

Mitigating Conversational Inertia in Multi-Turn Agents

缓解多轮代理中的对话惯性

Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) University of Rochester, Rochester, NY, USA(罗切斯特大学)

AI总结 本文研究了多轮代理中对话惯性问题,提出通过上下文偏好学习来校准模型偏好,以减少惯性并提升性能。

Comments ICML2026

详情
AI中文摘要

大型语言模型在获得适当演示时表现出色,但在多轮代理场景中,LLM错误地模仿自身之前的响应作为少样本示例。通过注意力分析,我们识别出对话惯性现象,即模型对先前响应表现出强烈的对角注意力,这与模仿偏差有关,限制了探索。这揭示了将少样本LLM转化为代理时的张力:更长的上下文丰富了环境反馈以供利用,但也加剧了对话惯性,从而损害探索。我们的关键见解是,对于相同状态,生成时使用更长上下文的动作表现出更强的惯性,这使得可以在没有环境奖励的情况下构建偏好对。基于此,我们提出上下文偏好学习,以校准模型偏好,使模型更倾向于选择低惯性响应而非高惯性响应。我们进一步提供了推理时的上下文管理策略,以平衡探索与利用。在八个代理环境和一个深度研究场景中的实验结果验证了我们的框架能够减少对话惯性并实现性能提升。

英文摘要

Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

2601.19667 2026-05-19 cs.CL cs.AI cs.IR cs.LG

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL:面向生物医学实体链接的合成上下文增强

Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier

发表机构 * Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Limics(索邦大学、国家医学研究院、巴黎索邦大学、Limics) Service de médecine interne, Hôpital Tenon, Assistance Publique - Hôpitaux de Paris(内科服务部,Tenon医院,巴黎公共医院) Barcelona Supercomputing Center, Barcelona, Spain(巴塞罗那超级计算中心,西班牙巴塞罗那)

AI总结 SynCABEL通过利用大型语言模型生成丰富的上下文合成训练示例,解决了监督式生物医学实体链接中专家标注数据稀缺的问题,并在三个多语言基准上实现了新的最先进的结果。

Comments 7 pages, 5 figures

详情
AI中文摘要

我们提出了SynCABEL(Synthetic Contextualized Augmentation for Biomedical Entity Linking),一个框架,旨在解决监督式生物医学实体链接(BEL)中的核心瓶颈:专家标注训练数据的稀缺性。SynCABEL利用大型语言模型为目标知识库中的所有候选概念生成上下文丰富的合成训练示例,提供广泛的监督而无需手动标注。我们证明,当结合解码器-only模型和引导推理时,SynCABEL在三个广泛使用的多语言基准上建立了新的最先进结果:MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)。评估数据效率时,我们显示SynCABEL在使用最多60%的标注数据的情况下达到全人工监督的性能,显著减少了对劳动密集型和昂贵的专家标注的依赖。最后,考虑到基于精确代码匹配的标准评估往往低估了由于本体冗余而具有临床价值的预测,我们引入了LLM-as-a-judge协议。这项分析揭示了SynCABEL显著提高了具有临床价值的预测率。我们的合成数据集、模型和代码已发布以支持可重复性和未来研究。

英文摘要

We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.

2601.16880 2026-05-19 cs.LG cs.IT math.IT

Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

深度网络中最小权重扰动的理论及其在低秩激活后门攻击中的应用

Bethan Evans, Jared Tanner

发表机构 * Department of Mathematics, University of Oxford, Oxford, UK(牛津大学数学系)

AI总结 本文推导了深度网络实现指定输出变化所需的最小范数权重扰动,并讨论了其大小决定因素,同时将其应用于精度修改激活的后门攻击,确定了攻击成功的压缩阈值,并展示了低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。

详情
AI中文摘要

深度网络中实现指定输出变化所需的最小范数权重扰动被推导出来,并讨论了其大小决定因素。这些单层精确公式与更通用的多层Lipschitz常数基于的鲁棒性保证被对比;两者都被观察到具有相同数量级,这表明它们在保证效果上相似。这些结果应用于精度修改激活的后门攻击,确定了攻击成功的压缩阈值,并通过实验证明低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。这些表达式揭示了反向传播边际如何控制逐层敏感性,并提供了关于与所需输出变化一致的最小参数更新的可验证保证。

英文摘要

The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These single-layer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.

2601.14330 2026-05-19 cs.CV cs.LG

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

LURE: 用于扩散模型多概念重新唤醒的潜在空间解阻

Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Sichuan University(四川大学) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学) Yonsei University(延世大学)

AI总结 本文提出LURE方法,通过重建潜在空间和引导采样轨迹,实现多概念的高保真重新唤醒,解决了现有方法在多概念场景下的梯度冲突和特征纠缠问题。

详情
AI中文摘要

概念擦除旨在抑制扩散模型中的敏感内容,但最近的研究表明,被擦除的概念仍可能被重新唤醒,揭示了擦除方法的脆弱性。现有重新唤醒方法主要依赖于提示级优化来操控采样轨迹,忽略了其他生成因素,限制了对底层动态的全面理解。在本文中,我们将生成过程建模为一个隐式函数,以实现对多个因素的全面理论分析,包括文本条件、模型参数和潜在状态。我们理论证明,扰动每个因素可以重新唤醒被擦除的概念。基于这一见解,我们提出了一种新的概念重新唤醒方法:用于概念重新唤醒的潜在空间解阻(LURE),通过重建潜在空间并引导采样轨迹来重新唤醒被擦除的概念。具体而言,我们的语义重新绑定机制通过将去噪预测与目标分布对齐来重建潜在空间,以重新建立断裂的文本-视觉关联。然而,在多概念场景中,朴素的重建会导致梯度冲突和特征纠缠。为了解决这个问题,我们引入了梯度场正交化,强制特征正交以防止相互干扰。此外,我们的潜在语义识别引导采样(LSIS)通过后验密度验证确保重新唤醒过程的稳定性。广泛的实验表明,LURE能够在多种擦除任务和方法中同时实现多个被擦除概念的高保真重新唤醒。

英文摘要

Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.

2601.13839 2026-05-19 cs.CV

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

DisasterVQA: 一个用于灾难场景的视觉问答基准数据集

Aisha Al-Mohannadi, Ayisha Firoz, Yin Yang, Muhammad Imran, Ferda Ofli

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·卡伊夫大学) College of Science & Engineering(科学与工程学院) Qatar University(卡塔尔大学)

AI总结 本文提出DisasterVQA数据集,用于灾难场景中的感知与推理任务,通过1395张真实图像和4405对专家 curated 的问答对,评估了七种最先进的视觉-语言模型在灾难响应中的性能,发现模型在细粒度定量推理、物体计数和上下文敏感解释方面存在不足。

Comments Accepted at ICWSM 2026

详情
AI中文摘要

社交媒体图像在自然灾害和人为灾害中提供低延迟的情报信息源,能够实现快速损害评估和响应。尽管视觉问答(VQA)在通用领域表现出色,但其在灾难响应中所需的复杂和安全关键推理的适用性仍不明确。我们引入了DisasterVQA基准数据集,专门用于危机情境中的感知和推理。DisasterVQA包含1395张真实世界图像和4405对专家精心编写的问答对,涵盖洪水、野火和地震等多种事件。基于人道主义框架,包括FEMA ESF和OCHA MIRA,该数据集包含二元、多选和开放式问题,覆盖情境意识和操作决策任务。我们评估了七种最先进的视觉-语言模型,并发现性能在问题类型、灾难类别、地区和人道主义任务上存在差异。尽管模型在二元问题上实现高准确率,但在细粒度定量推理、物体计数和上下文敏感解释方面表现不佳,尤其是在代表性不足的灾难场景中。DisasterVQA提供了一个具有挑战性和实用性的基准,以指导开发更稳健和具有操作意义的视觉-语言模型用于灾害响应。该数据集可通过https://doi.org/10.5281/zenodo.18267769公开获取。

英文摘要

Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.

2601.09495 2026-05-19 cs.LG

Parallelizable memory recurrent units

可并行化的记忆递归单元

Florent De Geeter, Gaspard Lambrechts, Damien Ernst, Guillaume Drion

发表机构 * Montefiore Institute, University of Liege(蒙费尔研究所,利耶日大学)

AI总结 本文提出了一种结合非线性递归网络持久记忆能力和状态空间模型并行计算能力的新递归神经网络——记忆递归单元(MRUs),通过多稳态机制实现持久记忆,同时避免瞬态动态以提高效率,并展示了其在长时序依赖任务中的有效性。

Comments 19 pages, 12 figures. This work has been the subject of patent applications (Numbers: EP26151077 and EP26175248.9)

详情
AI中文摘要

随着大规模并行处理单元的出现,并行化已成为新序列模型的 desirable 属性。在训练过程中,能够针对序列长度并行处理序列的能力是Transformer架构兴起的主要原因之一。然而,Transformer在序列生成方面效率低下,因为它们需要在每个生成步骤重新处理所有先前的时间步。最近,状态空间模型(SSMs)作为一种更高效的替代方案出现。这些新的递归神经网络(RNNs)在保持RNN高效更新的同时,通过去除非线性动态(或递归)获得了并行化能力。SSMs通过高效训练可能非常大的网络,可以达到最先进的性能,但仍受有限表示能力的限制。特别是,由于其单稳态性,SSMs无法表现出持久记忆,即保留信息无限期的能力。在本文中,我们介绍了一种新的RNN家族——记忆递归单元(MRUs),它们结合了非线性RNN的持久记忆能力与SSMs的并行计算能力。这些单元利用多稳态作为持久记忆的来源,同时通过去除瞬态动态以实现高效计算。我们随后推导出一个具体的实现作为概念验证:双稳态记忆递归单元(BMRU)。这种新的RNN与并行扫描算法兼容。我们证明BMRU在具有长期依赖的任务中表现良好,并且可以与状态空间模型结合,创建具有瞬态动态和持久记忆的混合网络。

英文摘要

With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

2601.09413 2026-05-19 cs.SD cs.AI cs.CL cs.MA eess.AS

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: 一种基于自我反思的语音代理方法用于语音识别和多感知音频推理

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

发表机构 * NVIDIA Kyoto University(京都大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出Speech-Hands框架,通过自我反思决策机制解决语音识别和外部声音理解任务中的信任问题,提升了模型在多任务音频推理中的准确性和鲁棒性。

Comments Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073

详情
AI中文摘要

我们介绍了一种语音代理框架,该框架学习了一种关键的全方位理解技能:知道何时信任自身,何时咨询外部音频感知。我们的工作受到一个关键但反直觉的发现的启发:简单地在语音识别和外部声音理解任务上微调全方位模型往往会降低性能,因为模型容易被噪声假说误导。为了解决这个问题,我们的框架Speech-Hands将问题重新表述为一个显式的自我反思决策。这个可学习的反思原语在防止模型被错误的外部候选干扰方面证明是有效的。我们展示了这种代理行为机制能够自然地从语音识别推广到复杂的多选音频推理。在OpenASR排行榜上,Speech-Hands在七个基准测试中比强大的基线高出12.1%的WER。该模型在音频问答决策中也实现了77.37%的准确率和高F1分数,展示了在多样化的音频问答数据集上的鲁棒性和可靠性。通过统一感知和决策,我们的工作为更可靠和稳健的音频智能提供了实用路径。

英文摘要

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

2601.08679 2026-05-19 cs.AI

PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

PersonaDual: 通过自适应推理平衡个性化与客观性

Xiaoyou Liu, Xinyi Mou, Shengbin Yue, Liang Wang, Yuqing Wang, Qiexiang Wang, Tianrui Qin, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) OPPO

AI总结 本文提出PersonaDual框架,通过自适应切换模式,在单一模型中实现通用客观推理与个性化推理的平衡,减少干扰并提升客观问题解决能力。

详情
AI中文摘要

随着用户对LLM对齐其偏好的期望增加,个性化信息变得有价值。然而,个性化信息可能是一把双刃剑:它能提高交互但可能损害客观性和事实准确性,尤其是在与问题不匹配时。为缓解此问题,我们提出PersonaDual,一个支持单个模型中通用目的客观推理和个性化推理的框架,并根据上下文自适应切换模式。PersonaDual首先通过SFT学习两种推理模式,然后通过强化学习和我们提出的DualGRPO进一步优化模式选择。在客观和个性化基准测试中,PersonaDual在保留个性化优势的同时减少干扰,实现近无干扰性能,并更有效地利用有用的个性化信号以改善客观问题解决。

英文摘要

As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.

2601.03851 2026-05-19 cs.CL

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

重新思考表格修剪在表格问答中的应用:从顺序修订到黄金轨迹监督的并行搜索

Yu Guo, Shenghao Ye, Shuangwu Chen, Zijian Wen, Tao Zhang, Qirui Bai, Dong Jin, Yunpeng Hou, Huasen He, Jian Yang, Xiaobin Tan

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 本文提出TabTrim框架,通过将表格修剪从顺序修订转变为由黄金轨迹监督的并行搜索,解决了现有方法中无法检测关键答案数据丢失的问题,并在多个表格推理任务中实现了最先进的性能。

Comments 17 pages, 5 figures, accepted to ACL 2026 Oral

详情
AI中文摘要

表格问答(TableQA)显著受益于表格修剪,它通过消除冗余单元提取紧凑的子表格以简化下游推理。然而,现有修剪方法通常依赖于由不可靠的批评信号驱动的顺序修订,常常无法检测到答案关键数据的丢失。为了解决这一限制,我们提出了TabTrim,一种新的表格修剪框架,将表格修剪从顺序修订转变为由黄金轨迹监督的并行搜索。TabTrim通过使用黄金SQL查询执行过程中的中间子表格推导出黄金修剪轨迹,并训练一个修剪器和一个验证器,使逐步修剪结果与黄金修剪轨迹一致。在推理过程中,TabTrim执行并行搜索以探索多个候选修剪轨迹并识别最优的子表格。广泛的实验表明,TabTrim在多样化的表格推理任务中实现了最先进的性能:TabTrim-8B达到73.5%的平均准确率,优于最强基线3.2%,包括在WikiTQ上79.4%和TableBench上61.2%。

英文摘要

Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.

2601.01685 2026-05-19 cs.CL cs.AI cs.MA

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

用真理欺骗:通过生成蒙太奇进行开放式通道多智能体合谋以操纵信念

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

发表机构 * University of Liverpool(利物浦大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文研究了通过公开通道分发真实证据片段,利用多智能体合谋操纵信念的新威胁,提出了生成蒙太奇框架,展示了在14种LLM家族中74.4%的攻击成功率,并揭示了更强的推理能力反而增加了易受攻击的风险。

Comments Accepted to the ACL 2026 Main Conference (Oral Presentation)

详情
AI中文摘要

随着大型语言模型(LLMs)向自主代理合成实时信息转变,其推理能力引入了意想不到的攻击面。本文介绍了一种新的威胁,即合谋代理通过仅使用真实证据片段在公开通道中引导受害者信念,而无需依赖隐蔽通信、后门或伪造文件。通过利用LLMs的过度思考倾向,我们正式化了首次认知合谋攻击,并提出生成蒙太奇:一个由写作者-编辑-导演框架构成的框架,通过对抗性辩论和协调发布证据片段来构建欺骗性叙述,使受害者内化并传播伪造结论。为研究此风险,我们开发了CoPHEME数据集,该数据集源自真实世界谣言事件,并在多种LLM家族中模拟攻击。我们的结果表明,14种LLM家族普遍存在漏洞:攻击成功率达到74.4%(专有模型)和70.6%(开放式权重模型)。反直觉的是,更强的推理能力增加了易受攻击性,推理专精模型的攻击成功率高于基础模型或提示。此外,这些虚假信念会传播到下游判断者,达到超过60%的欺骗率,突显了LLM代理在动态信息环境中交互的社会技术脆弱性。我们的实现和数据可在:https://github.com/CharlesJW222/Lying_with_Truth/tree/main。

英文摘要

As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

2512.18953 2026-05-19 cs.CV

Symmetry Matters: Auditing and Symmetrizing 3D Generative Models

对称性至关重要:审计和对称化3D生成模型

Nicolas Caytuiro, Ivan Sipiran

发表机构 * University of Chile(智利大学)

AI总结 本文研究了无条件点云生成中对称性的保持问题,通过审计多个3D生成模型的对称性并计算基于Chamfer距离的归一化对称性分数,发现现有模型在对称性意识评估协议下存在持续的对称性差距。通过分析训练数据和引入对称性意识干预,作者提出了在半对象数据集上训练生成模型并在采样时进行反射重建的方法,从而提高几何一致性和视觉合理性。

Comments 12 pages, 8 figures, 4 tables

详情
AI中文摘要

对称性是许多物体类别中强有力的先验知识,但标准的3D生成模型基准很少报告这一先验是否被保留。我们研究了无条件点云生成中的对称性保持问题。我们首先通过几种3D生成模型审计生成形状的对称性,并基于Chamfer距离(CD)计算归一化对称性分数。我们表明,尽管当前3D生成模型在标准评估下取得竞争性结果,但当应用对称性意识评估协议时,它们显示出持续的对称性差距。为了测试这个差距是否仅仅继承自训练数据,我们评估了这些模型在由ShapeNet衍生的镜像物体数据集上的表现,并分析了训练过程中的对称性动态。通过机制可解释性技术,在采样和潜在空间层面进一步表明,反射对称性在学习的生成过程中并不可靠地编码。最后,为了解决这个差距,我们提出了一种数据导向的对称性意识干预:在半对象数据集上训练生成模型,并在采样时通过反射重建完整物体。在多个模型架构上,这种干预显著提高了几何一致性和视觉合理性,同时在标准度量下仍具竞争力。这些发现表明,需要伴随标准基准进行对称性意识评估,未来的3D生成模型应显式地将这一先验纳入训练或采样过程中。

英文摘要

Symmetry is a strong prior present in many object categories, yet standard benchmarks for 3D generative models rarely report whether this prior is preserved. We study symmetry preservation in unconditional point cloud generation. We first audit the symmetry of generated shapes by several 3D generative models and compute a normalized symmetry score based on the Chamfer Distance (CD). We show that although current 3D generative models achieve competitive results under standard evaluation, they reveal a persistent symmetry gap when a symmetry-aware evaluation protocol is applied. To test whether this gap is merely inherited from the training data, we evaluate these models over a mirrored-objects dataset derived from ShapeNet and analyze symmetry dynamics during training. Mechanistic interpretability techniques were employed at the sampling and latent levels to further show that reflection symmetry is not reliably encoded in the learned generative process. Finally, to address this gap, we propose a data-centric symmetry-aware intervention: training generative models on a half-objects dataset and reconstructing full objects by reflection during sampling. Across multiple backbones, this intervention substantially improves geometric consistency and visual plausibility while remaining competitive under standard metrics. These findings suggest that symmetry-aware evaluation is needed alongside standard benchmarks, and incoming 3D generative models should incorporate this prior explicitly, either during training or sampling.

2512.13506 2026-05-19 cs.LG stat.ML

Learning under Distributional Drift: Prequential Reproducibility as an Intrinsic Statistical Resource

在分布漂移下学习:预quential可再现性作为内在统计资源

Sofiya Zaichyk

发表机构 * Innovative Defense Technologies (IDT)(创新防御技术(IDT))

AI总结 本文研究了在分布漂移下学习的问题,提出了一种内在的漂移预算$C_T$,用于量化数据分布沿实际学习者-环境轨迹的累积信息几何运动,以 Fisher-Rao 距离衡量。该预算将外生环境变化与学习者动作引起的反馈分离,从而提供了基于速率的预quential可再现性特征。文章证明了漂移反馈界,并建立了匹配的下界,展示了平均 Fisher-Rao 运动率的依赖性是紧的。此外,还证明了信息论上的不可区分性结果,并通过实验表明适当选择的监控通道可以保留风险相关的漂移信号。

Comments Revised: Added additional experiment. Clarified lower bound

详情
AI中文摘要

在分布漂移下统计学习仍然缺乏充分的描述,尤其是在闭环设置中,学习会改变数据生成规律。我们引入了一个内在的漂移预算$C_T$,用于量化数据分布沿实际学习者-环境轨迹的累积信息-几何运动,以Fisher-Rao距离衡量。该预算将外生环境变化与由学习者动作引起的反馈分离。这给出了基于速率的预quential可再现性特征:当使用实际流上的性能来预测下一步分布下的一步 ahead 性能时,漂移贡献通过平均运动率$C_T/T$,而不是单独的累积漂移。我们证明了一个漂移反馈界,其顺序为$T^{-1/2}+C_T/T$,至多有受控的二阶余项。我们还建立了在标准正则子类上的匹配尖锐下界。因此,对平均Fisher-Rao运动率的依赖性在常数范围内是紧的:$C_T/T$足够用于上界控制,并且在正则困难子类上是不可避免的。我们进一步证明了一个信息论上的不可区分性结果,表明在一步 ahead 目标上的顺序$C/T$效应不需要仅从实际性能流中识别。最后,我们表明固定监控通道诱导了收缩的可观察Fisher运动,并通过实验,包括一个不正确的现实数据反馈设置,表明适当选择的通道可以在内在数据生成规律不可用时保留风险相关的漂移信号。由此产生的理论将外生漂移、自适应数据分析和表现反馈视为沿同一学习者-环境轨迹的Fisher-Rao运动的不同来源。

英文摘要

Statistical learning under distributional drift remains poorly characterized, especially in closed-loop settings where learning alters the data-generating law. We introduce an intrinsic drift budget $C_T$ that quantifies cumulative information-geometric motion of the data distribution along the realized learner-environment trajectory, measured in Fisher-Rao distance. The budget separates exogenous environmental change from policy-sensitive feedback induced by the learner's actions. This gives a rate-based characterization of prequential reproducibility: when performance on the realized stream is used to predict one-step-ahead performance under the next distribution, the drift contribution enters through the average motion rate $C_T/T$, not through cumulative drift alone. We prove a drift-feedback bound of order $T^{-1/2}+C_T/T$, up to controlled second-order remainder terms, and establish a matching sharpness lower bound for the same prequential reproducibility gap on a canonical regular subclass. Thus the dependence on the average Fisher-Rao motion rate is tight up to constants: $C_T/T$ is sufficient for upper control and unavoidable on regular hard subclasses. We further prove an information-theoretic indistinguishability result showing that order-$C/T$ effects on the one-step-ahead target need not be identifiable from the realized performance stream alone. Finally, we show that fixed monitoring channels induce contracted observable Fisher motion, and experiments, including a misspecified real-data feedback setting, indicate that appropriately chosen channels can retain risk-relevant drift signal when the intrinsic data-generating law is unavailable. The resulting theory treats exogenous drift, adaptive data analysis, and performative feedback as different sources of Fisher-Rao motion along the same learner-environment trajectory.

2512.11446 2026-05-19 cs.CV

YawDD+: Frame-level Annotations for Accurate Yawn Prediction

YawDD+: 用于准确打哈欠预测的帧级标注

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

发表机构 * Embedded Systems Division, Silicon Austria Labs(Silicon Austria Labs嵌入式系统部门) Institute of Visual Computing, Graz University of Technology(格拉茨技术大学视觉计算研究所) Department of Computer Science, University of Innsbruck(因斯布鲁克大学计算机科学系)

AI总结 本文提出了一种半自动化标注流程,通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注,从而在边缘设备上提升模型训练效果,实现更高效的疲劳驾驶检测。

Comments This paper is accepted in the 33rd IEEE International Conference on Image Processing (ICIP) 2026

详情
AI中文摘要

驾驶员疲劳仍然是道路事故的主要原因,导致24%的碰撞事故。尽管打哈欠是疲劳的早期行为指标,但现有方法面临挑战,因为视频标注数据集中存在系统性噪声,源于粗略的时间标注。训练稳健的机器学习(ML)模型需要丰富的监督标签,以帮助从训练数据中学习显著特征。此外,在边缘设备上高效训练和推断模型对于疲劳驾驶检测任务至关重要,以在不依赖云基础设施的情况下实现车辆上的准确实时决策。为了解决这个问题,我们开发了一种半自动标注流程,通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注,从而在边缘平台如NVIDIA Jetson NANO上更准确地训练模型。在YawDD+上训练已建立的MNasNet分类器和YOLOv11检测器架构,比视频级监督提高了多达6%的帧准确率和5%的mAP,分别在Jetson NANO和AGX上实现了99.34%的分类准确率和95.69%的检测mAP。此外,MNasNet在AGX上仅用8.69分钟/epoch完成一个周期,同时提供高达115帧/秒(FPS)的推断时间,证明了增强的数据质量本身支持边缘设备上的驾驶员疲劳监测系统,而无需服务器端计算。YawDD+数据集和训练好的模型已在线上提供。

英文摘要

Driver fatigue remains a leading cause of road accidents, responsible for 24% of crashes. While yawning serves as an early behavioral indicator of fatigue, existing approaches face significant challenges due to the presence of systematic noise in video-annotated datasets arising from coarse temporal annotations. Training robust machine learning (ML) models requires rich supervisory labels that help learn salient features from the training data. Moreover, efficient on-device training and inference of models on edge devices is crucial in driver fatigue detection tasks to enable accurate real-time decisions on vehicles without reliance on cloud infrastructure. To address this issue, we develop a semi-automated labeling pipeline with human-in-the-loop verification to annotate YawDD videos to YawDD+ frame-level annotations, enabling more accurate model training on edge platforms such as NVIDIA Jetson NANO. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP on Jetson NANO and AGX. Moreover, MNasNet completed the epoch time in just 8.69 min/epoch while delivering up to 115 frames-per-second (FPS) inference time on AGX, confirming that enhanced data quality alone supports on-device driver fatigue monitoring systems without server-side computation. The YawDD+ dataset and trained models are available online.