arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11424 2026-05-13 cs.CV

VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors

Jimin Tang, Wenyuan Zhang, Junsheng Zhou, Zian Huang, Kanle Shi, Shenkun Xu, Yu-Shen Liu, Zhizhong Han

发表机构 * School of Software, Tsinghua University(清华大学软件学院) Department of Computer Science, Wayne State University(韦恩州立大学计算机科学系)

AI总结 VidSplat 是一种基于高斯点扩散的生成式重建框架,旨在解决在稀疏视角下进行多视角表面重建时存在的缺失区域和遮挡问题。该方法利用视频扩散先验,通过迭代生成新视角来补充输入覆盖不足的区域,从而实现对完整3D场景的重建。其核心在于提出了一种无需训练的分阶段去噪策略和迭代优化机制,有效提升了重建的几何一致性和完整性。

Comments Accepted by SIGGRAPH Conference 2026. Project Page: https://tangjm24.github.io/VidSplat

详情
英文摘要

Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.

2605.11418 2026-05-13 cs.AI cs.CR

Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

Shoumik Saha, Kazem Faghih, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland - College Park(马里兰大学计算机科学系)

AI总结 本文研究了AI代理技能注册系统中基于自然语言的语义供应链攻击问题,揭示了SKILL.md文件在技能发现、选择和治理阶段可能被恶意利用的风险。通过实验证明,攻击者可通过精心设计的文本触发器提升恶意技能的可见性、引导代理选择功能相似的对抗性变体,并有效规避安全审查。研究指出,SKILL.md不仅是文档,更是影响代理行为的关键操作性文本,暴露了当前AI代理能力扩展机制中的重大安全隐患。

Comments 31 pages, 21 figures, 10 tables

详情
英文摘要

Autonomous AI agents increasingly extend their capabilities through Agent Skills: modular filesystem packages whose SKILL.md files describe when and how agents should use them. While this design enables scalable, on-demand capability expansion, it also introduces a semantic supply-chain risk in which natural-language metadata and instructions can affect which skills are admitted, surfaced, selected, and loaded. We study SKILL.md - only attacks across three registry-facing stages of the Agent Skill lifecycle, using real ClawHub skills and realistic registry mechanisms. In Discovery, short textual triggers can manipulate embedding-based retrieval and improve adversarial skill visibility, achieving up to 86% pairwise win rate and 80% Top-10 placement. In Selection, description-only framing biases agents toward functionally equivalent adversarial variants, which are selected in 77.6% of paired trials on average. In Governance, semantic evasion strategies cause malicious skills to avoid a blocking verdict in 36.5%-100% of cases. Overall, our results show that SKILL.md is not passive documentation but operational text that shapes which third-party capabilities agents find, trust, and use.

2605.11414 2026-05-13 cs.LG cs.AI

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

Nilushika Udayangani, Kishor Nandakishor, Marimuthu Palaniswami

发表机构 * Department of Electrical and Electronic Engineering(电子与电气工程系)

AI总结 本文研究了在时间序列分类任务中,如何将完整序列分类器的知识迁移到仅基于部分序列输入的分类器中。为了解决部分数据缺乏判别性特征导致的泛化能力下降问题,作者提出了一种基于生成扩散先验的知识蒸馏框架(GDPD),通过将短上下文学生特征视为完整上下文教师特征的退化观测,利用扩散模型的迭代恢复能力学习教师特征的生成先验,并引导学生特征学习长期上下文知识,从而有效提升部分序列分类的性能。实验表明,GDPD在多种数据集和架构下均表现出优越的全序列到部分序列的知识迁移效果。

Comments Published as a conference paper at ICLR 2026 (Brazil, Rio de Janeiro)

Journal ref The Fourteenth International Conference on Learning Representations 2026

详情
英文摘要

While traditional time-series classifiers assume full sequences at inference, practical constraints (latency and cost) often limit inputs to partial prefixes. The absence of class-discriminative patterns in partial data can significantly hinder a classifier's ability to generalize. This work uses knowledge distillation (KD) to equip partial time series classifiers with the generalization ability of their full-sequence counterparts. In KD, high-capacity teacher transfers supervision to aid student learning on the target task. Matching with teacher features has shown promise in closing the generalization gap due to limited parameter capacity. However, when the generalization gap arises from training-data differences (full versus partial), the teacher's full-context features can be an overwhelming target signal for the student's short-context features. To provide progressive, diverse, and collective teacher supervision, we propose Generative Diffusion Prior Distillation (GDPD), a novel KD framework that treats short-context student features as degraded observations of the target full-context features. Inspired by the iterative restoration capability of diffusion models, we learn a diffusion-based generative prior over teacher features. Leveraging this prior, we posterior-sample target teacher representations that could best explain the missing long-range information in the student features and optimize the student features to be minimally degraded relative to these targets. GDPD provides each student feature with a distribution of task-relevant long-context knowledge, which benefits learning on the partial classification task. Extensive experiments across earliness settings, datasets, and architectures demonstrate GDPD's effectiveness for full-to-partial distillation.

2605.11408 2026-05-13 cs.LG cs.AI cs.CL

MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification

Bo Zheng, Yudong Chen, Zihua Xiong, Shuai Fang, Peidong He, Yang Yang, Sheng Guo

发表机构 * Zhejiang University(浙江大学) MyBank, Ant Group(蚂蚁集团MyBank)

AI总结 MaskTab 是一个专为工业级表格数据设计的统一预训练框架,旨在解决表格数据高维、缺失值多且标签稀少的问题。该方法通过引入可学习的缺失值标记和混合监督预训练策略,结合多专家增强损失函数,有效提升了模型在大规模工业数据上的表现。实验表明,MaskTab 在多个工业基准上显著优于现有方法,并能高效蒸馏到轻量模型中,在严格时延和可解释性约束下仍保持优越性能。

详情
英文摘要

Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at scale. While foundation models have revolutionized vision and language, tabular learning still leans on handcrafted features and lacks a general self-supervised framework. We present MaskTab, a unified pre-training framework designed specifically for industrial-scale tabular data. MaskTab encodes missing values via dedicated learnable tokens, enabling the model to distinguish structural absence from random dropout. It jointly optimizes a hybrid supervised pre-training scheme--utilizing a twin-path architecture to reconcile masked reconstruction with task-specific supervision--and an MoE-augmented loss that adaptively routes features through specialized subnetworks. On industrial-scale benchmarks, it achieves +5.04% AUC and +8.28% KS over prior art under rigorous scaling. Moreover, its representations distill effectively into lightweight models, yielding +2.55% AUC and +4.85% KS under strict latency and interpretability constraints, while improving robustness to distribution shifts. Our work demonstrates that tabular data admits a foundation-model treatment--when its structural idiosyncrasies are respected.

2605.11406 2026-05-13 cs.LG

A Boundary-Aware Non-parametric Granular-Ball Classifier Based on Minimum Description Length

Zeqiang Xian, Caihui Liu, Yong Zhang, Wenjing Qiu, Duoqian Miao, Witold Pedrycz

发表机构 * Department of Mathematics and Computer Science, Gannan Normal University(数学与计算机科学学院,赣南师范大学) Key Laboratory of Data Science and Artificial Intelligence of Jiangxi Education Institutes, Gannan Normal University(江西省数据科学与人工智能重点实验室,赣南师范大学) Department of Computer Science and Technology, Tongji University(计算机科学与技术学院,同济大学) Department of Electrical and Computer Engineering, University of Alberta(电气与计算机工程学院,阿尔伯塔大学)

AI总结 本文提出了一种基于最小描述长度原理的边界感知非参数粒球分类器(MDL-GBC),旨在解决现有粒球分类方法中依赖手工设计质量指标和启发式规则的问题。该方法将类条件粒球构建建模为局部模型选择问题,通过比较单球模型、双球模型和核心-边界模型的描述长度,决定粒球的保留、分割或细化策略,从而实现边界敏感区域的显式建模与分类机制的一致性。实验表明,MDL-GBC在多个基准数据集上取得了优异的分类性能,具有良好的可解释性和竞争力。

Comments 13 pages, 2 figures

详情
英文摘要

Existing granular-ball classification methods are often driven by handcrafted quality measures, neighborhood rules, or heuristic splitting and stopping criteria, which may reduce the transparency of local construction decisions and hinder explicit modeling of boundary-sensitive regions. To address this issue, this paper proposes a Minimum Description Length based Granular-Ball Classifier (MDL-GBC), a boundary-aware non-parametric and interpretable granular-ball classifier. MDL-GBC formulates class-conditional granular-ball construction as a local model selection problem under the Minimum Description Length principle. For each class, samples from the target class provide positive class evidence, while samples from the remaining classes provide negative boundary evidence. For each current granular ball, three candidate explanations are compared under a unified description-length criterion: a single-ball model, a two-ball model, and a core-boundary model. The selected model determines whether the ball is retained, geometrically split, or refined into core and boundary-sensitive child balls, thereby making local construction decisions consistent with the MDL-based classification mechanism. During prediction, a class-level mixture coding rule aggregates stable granular balls of the same class and assigns the test sample by comparing class-wise coding costs. Experiments on 18 benchmark datasets show that MDL-GBC achieves competitive classification performance against classical classifiers and representative granular-ball-based methods, obtaining the best average Accuracy, Macro-F1, and average rank. These results indicate that MDL-GBC provides an effective and interpretable alternative to conventional heuristic granular-ball classification strategies.

2605.11404 2026-05-13 cs.AI

Attributing Emergence in Million-Agent Systems

Ling Tang, Jilin Mei, Qian Chen, Qihan Ren, Linfeng Zhang, Quanshi Zhang, Jing Shao, Xia Hu, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Tongji University(同济大学)

AI总结 该研究探讨了在百万智能体系统中如何将宏观涌现现象归因于个体智能体的问题。现有方法因计算复杂度限制,仅适用于小规模系统,而实际社会现象常发生在百万级智能体规模。为此,研究将Aumann-Shapley路径积分归因方法扩展至百万智能体规模,实现了高效且满足所有四个公理的归因计算,并通过实证分析揭示了小规模与全量数据在归因结果上的结构性差异,证明了全量归因对于非线性宏观指标的理论必要性。

详情
英文摘要

Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in $N$ and have been confined to $N \lesssim 10^3$, while the phenomena they explain occur at $N \geq 10^6$. We address this gap by adapting Aumann--Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ($1{,}671{,}587$ active users), we compute the attribution at both full scale and the visibility-biased $N = 10^2$ convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.

2605.11403 2026-05-13 cs.LG cs.AI cs.CL

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu

发表机构 * OPPO AI Center(OPPO人工智能中心)

AI总结 该研究针对基于可验证奖励的强化学习(RLVR)中主流算法Group Relative Policy Optimization(GRPO)存在的两个效率问题,提出了FG-ExPO方法。该方法通过引入准确率条件的KL缩放(AKL)和高斯课程采样(GCS)两个轻量组件,分别动态调整策略探索的约束强度和优化问题采样分布,从而提升模型在数学推理任务中的训练效率。实验表明,FG-ExPO在多个主流基准上显著优于原始GRPO,尤其在AIME 2025等任务中展现出更优的性能提升。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.

2605.11402 2026-05-13 cs.LG cs.CR cs.NI

More Than Meets the Eye: A Semantics-Aware Traffic Augmentation Framework for Generalizable Website Fingerprinting

Youquan Xian, Xueying Zeng, Lingjia Meng, Lei Cui, Runhan Song, Wei Wang, Zhengquan Ding, Peng Liu, Zhiyu Hao

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China(北京邮电大学信息安全学院) School of Computer Science and Engineering, Beihang University, Beijing, China(北京航空航天大学计算机科学与工程学院) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) School of Computer Science and Engineering, Guangxi Normal University, Guilin, China(广西师范大学计算机科学与工程学院) Zhongguancun Laboratory, Beijing, China(中关村实验室)

AI总结 本文提出了一种语义感知的流量增强框架SATA,旨在解决基于深度学习的网站指纹识别技术在真实环境中的泛化能力不足问题。该方法通过协议规则进行应用层语义增强,扩展流量中的资源组成模式和帧序列模式,并引入跨层特征对齐机制,将增强的语义信息与可观测的流量特征进行对齐。实验表明,SATA能够生成训练集中不存在但在测试集中真实存在的流量模式,显著提升了主流模型在多种复杂场景下的性能,尤其在开放世界设置中,准确率和AUROC分别提升了90.81%和48.37%。

Comments 18 pages, 19 figures, Submitted to NDSS 2027

详情
英文摘要

Deep learning-based website fingerprinting has emerged as an effective technique for inferring the websites users visit. Although existing methods achieve strong performance on closed-world datasets, they often fail to generalize to real-world environments, especially under geographic and temporal shifts. This limitation fundamentally stems from the coupled effects of two key challenges: application-layer resource composition variability and observable feature instability induced by cross-layer encapsulation. Intertwined, these factors induce systematic shifts between underlying application semantics and observable traffic features. To address the above challenges, we propose SATA , a semantics-aware traffic augmentation framework. Specifically, SATA first performs application-layer semantic augmentation based on protocol rules, expanding the resource composition patterns within each flow and frame sequence patterns under protocol constraints. Based on these augmented frame sequences, we further introduce a cross-layer feature alignment mechanism via knowledge distillation. It aligns frame sequence with packet-length sequence features, enabling cross-layer feature alignment between enhanced semantics and observable sequences. Extensive experiments show that SATA successfully generates traffic patterns that are absent from the training set but genuinely exist in the test set, and significantly improves the performance of mainstream models across diverse and complex scenarios. In particular, in open-world settings, SATA improves ACC by 90.81% and AUROC by 48.37%. The source code of the prototype system is available at https://anonymous.4open.science/r/SATA-B6C2/.

2605.11398 2026-05-13 cs.AI cs.CL

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

Robin Linzmayer, Georgianna Lin, Di Coneybeare, Jason Chu, Trudi Cloyd, Manish Garg, Miles Gordon, Elizabeth Hartofilis, Benjamin Hong, Ashraf Hussain, Eugene Y. Kim, Oluchi Iheagwara King, Ross McCormack, Erica Olsen, John K. Riggins, Mustafa N. Rasheed, Dana L. Sacco, Vinay Saggar, Osman R. Sayan, Amit Shembekar, Janice Shin-Kim, Wendy W. Sun, Bernard P. Chang, David Kessler, Noémie Elhadad

发表机构 * Department of Computer Science, Columbia University, New York, NY, USA(计算机科学系,哥伦比亚大学,纽约,纽约州,美国) Department of Biomedical Informatics, Columbia University, New York, NY, USA(生物医学信息学系,哥伦比亚大学,纽约,纽约州,美国) Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA(急诊医学系,哥伦比亚大学伊文思医疗中心,纽约,纽约州,美国)

AI总结 本文提出 AcuityBench,一个用于评估语言模型能否从用户医疗描述中正确识别护理紧急程度的基准。该基准整合了五个公开数据集,涵盖用户对话、论坛帖子、临床案例和患者门户信息,并统一采用四级紧急程度框架进行评估。研究发现,不同模型在明确案例和模糊案例中的表现存在显著差异,且任务形式的选择会影响误判类型,突显了临床紧急程度识别作为关键安全能力的重要性。

Comments 41 pages, 5 figures. Preprint under review for the Track on Evaluations and Datasets at NeurIPS 2026

详情
英文摘要

We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.

2605.11396 2026-05-13 cs.LG

MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

Yupeng Su, Ruijie Zhang, Ziyue Liu, Yequan Zhao, Zheng Zhang

发表机构 * University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 本文提出MuonQ,一种基于方向保真优化的低比特Muon优化器训练框架,旨在解决Muon优化器在量化训练中对误差敏感的问题。通过预量化归一化、结构分解和μ律压缩量化等方法,MuonQ有效抑制了量化误差的累积与方向偏差,实现了稳定高效的4比特量化训练。实验表明,MuonQ在保持训练损失和下游任务准确率接近全精度Muon的同时,将优化器状态内存减少了7.3倍。

Comments MuonQ enables stable 4-bit quantization of Muon's optimizer states by preserving directional fidelity through pre-quantization normalization, structural decomposition, and companding quantization

详情
英文摘要

The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt $μ$-law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon's optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 $\times$. Our code is available at https://github.com/YupengSu/MuonQ.

2605.11392 2026-05-13 cs.AI

Transformer Interpretability from Perspective of Attention and Gradient

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 本文从注意力和梯度的角度深入研究了Transformer模型的可解释性,提出了一种通过引导梯度方向(即注意力方向)实现更全面和细致的特征区域解释的方法。该方法有助于更好地理解Transformer的工作机制,并揭示了Vision Transformer(ViT)与人类图像感知之间的差异,展示了几乎不可察觉的图像类别篡改现象,可能在特定场景下带来安全隐患。

详情
英文摘要

Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention and gradient, we conduct an in-depth study of Transformer interpretation and propose a method to achieve it by guiding the gradient direction, or more precisely, the attention direction. The method enables more comprehensive interpretation of feature regions, offers detail interpretation, and helps to better understand Transformer mechanism. Leveraging the difference in how Vision Transformer (ViT) and humans perceive images, we alter the class of an image in a way that is almost imperceptible to the human eye. This class rewriting phenomenon may potentially pose security risks in certain scenarios.

2605.11388 2026-05-13 cs.CL cs.AI

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

Dean Light, Michael Theologitis, Kshitish Ghate, Shuyue Stella Li, Benjamin Newman, Chirag Shah, Aylin Caliskan, Pang Wei Koh, Dan Suciu, Yulia Tsvetkov

发表机构 * University of Washington(华盛顿大学)

AI总结 该研究提出了一种名为“Deep Reasoning”的方法,旨在提升通用智能体在推理任务中的灵活性与适应性。通过结构化的元推理,该方法在推理过程中动态构建任务特定的推理框架,从而更有效地处理复杂问题。实验表明,基于该方法构建的通用智能体DOLORES在多个困难基准上显著优于现有方法,展现了其在结构化推理和任务适应性方面的优势。

Comments Preprint under review

详情
英文摘要

Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning -- an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.

2605.11387 2026-05-13 cs.LG cs.RO

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

Alberta Longhini, David Emukpere, Jean-Michel Renders, Seungsu Kim

发表机构 * Naver Labs Europe(纳维尔实验室欧洲分部) Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 本文研究了在保持生成策略动作分布多模态特性的同时,如何利用强化学习对预训练生成策略进行微调的问题。为了解决现有方法在提升任务性能时导致行为模式单一化的问题,作者提出了一种无监督的行为模式发现框架,通过挖掘策略中的潜在行为模式,并利用互信息作为内在奖励,以在提升任务成功率的同时保持行为多样性。实验表明,该方法在机器人操作任务中优于传统微调方法,取得了更高的成功率并保留了更丰富的多模态动作分布。

Journal ref International Conference on Machine Learning, 2026

详情
英文摘要

We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies (e.g., diffusion policies) improve task performance but often collapse diverse behaviors into a single reward-maximizing mode. To mitigate this issue, we propose an unsupervised mode discovery framework that uncovers latent behavioral modes within generative policies. The discovered modes enable the use of mutual information as an intrinsic reward, regularizing RL fine-tuning to enhance task success while maintaining behavioral diversity. Experiments on robotic manipulation tasks demonstrate that our method consistently outperforms conventional fine-tuning approaches, achieving higher success rates and preserving richer multimodal action distributions.

2605.11386 2026-05-13 cs.AI

Revisiting Privacy Preservation in Brain-Computer Interfaces: Conceptual Boundaries, Risk Pathways, and a Protection-Strength Grading Framework

Lei Sun, Xiuqing Mao, Shuai Zhang, Qingyu Zeng, Min Zhao, Jiyuan Li, Wenle Dong

发表机构 * PLA Information Engineering University(中国人民解放军信息工程大学)

AI总结 随着脑机接口(BCI)技术从实验室走向临床和实际应用,其隐私保护问题日益突出。本文系统回顾了BCI系统中隐私泄露的多种路径,提出了涵盖保护对象、生命周期阶段和保护强度等级的三维分类框架,将现有研究分为四个保护强度等级。研究强调,BCI隐私保护不仅要隐藏数据,还需分离任务无关的敏感信息,同时保持系统功能的实用性,并指出心智隐私和神经伦理风险仍是亟待解决的开放问题。

详情
英文摘要

Brain-computer interfaces (BCIs) are moving rapidly from laboratory research into clinical, edge, and real-world settings. Under ISO/IEC 8663:2025, a BCI is a direct communication link between central nervous system activity and external software or hardware systems. This link expands privacy risk beyond raw neural-signal leakage: neural data, derived representations, model assets, and decoded outputs can be re-associated with individuals across collection, transmission, storage, training, inference, and feedback, or used to infer information beyond what a task requires. Starting from the general BCI paradigm, this review deffnes privacy-protection boundaries, protection objects, and the relationship between user data privacy and model privacy within a shared risk pathway. It then proposes a three-dimensional framework - protection object, lifecycle stage, and dominant protection-strength level - to classify existing work into four levels of protection strength. Finally, mental privacy and neuroethical risks are treated as open issues, emphasizing that BCI privacy protection should not only obscure data but also disentangle task-irrelevant sensitive information while preserving downstream utility. Keywords: Brain-computer interface, Neural data privacy, User data privacy, Model privacy, Disentanglement of task-irrelevant sensitive information, Protection-strength grading, Neuroethical risks

2605.11385 2026-05-13 cs.CV cs.RO

JACoP: Joint Alignment for Compliant Multi-Agent Prediction

Qingze Liu, Alen Mrdovic, Danrui Li, Mathew Schwartz, Sejong Yoon, Mubbasir Kapadia

发表机构 * Rutgers University, New Brunswick(新泽西州罗格斯大学) The College of New Jersey(新泽西州学院)

AI总结 该论文提出了一种名为JACoP的多阶段框架,用于解决多智能体轨迹预测中的集体合规性问题。其核心方法结合了基于锚点的个体轨迹筛选和基于马尔可夫随机场的联合轨迹对齐,有效减少了轨迹间的社交碰撞和环境违规。JACoP在保证预测精度的同时,显著提升了场景层面的合理性,为实际应用提供了更安全可靠的预测方案。

Comments Accepted by CVPRF 2026

详情
英文摘要

Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state-of-the-art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real-world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi-Agent Prediction, an innovative multi-stage framework that ensures scene-level plausibility. JACoP incorporates an Anchor-Based Agent-Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter-agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.

2605.11383 2026-05-13 cs.CV

HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels

Ningkang Peng, Jingyang Mao, Qianfeng Yu, Xiaoqian Peng, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 在大规模视觉识别和数据挖掘任务中,噪声标签会严重影响深度神经网络的泛化能力。本文首次提出了一种基于哈密顿动力学的主动决策边界修复方法HamBR,通过球面哈密顿蒙特卡洛机制主动探测特征空间中的类间模糊区域,并合成高质量虚拟异常样本,利用能量模型建立鲁棒的决策边界屏障,从而恢复决策边界的判别性。实验表明,HamBR在多个基准数据集上取得了最先进的性能,并显著提升了模型的分布外检测能力。

详情
英文摘要

In large-scale visual recognition and data mining tasks, the presence of noisy labels severely undermines the generalization capability of deep neural networks (DNNs). Prevalent sample selection methods rely primarily on training loss or prediction confidence for passive screening. However, within a feature space degraded by noise, decision boundaries undergo systematic boundary collapse. This phenomenon hinders the ability of the model to distinguish between hard clean samples and noisy samples at the decision margins, thereby creating a significant performance bottleneck. This study is the first to emphasize the pivotal importance of active boundary restoration for noise-robust learning. We propose HamBR, a novel paradigm based on Hamiltonian dynamics. The core approach leverages the Spherical Hamiltonian Monte Carlo (Spherical HMC) mechanism to actively probe inter-class ambiguous regions within the representation space and synthesize high-quality virtual outliers. By imposing explicit repulsion constraints via energy-based modeling, these synthesized samples establish robust energy barriers at the decision boundaries. This mechanism forces real samples to move from dispersed overlapping regions toward their respective class centers, thereby restoring the discriminative sharpness of the decision boundaries. HamBR demonstrates exceptional versatility and can be integrated as a plug-and-play defense module into existing semi-supervised noisy label learning frameworks. Empirical evaluations show that the proposed paradigm significantly enhances the discriminative accuracy of hard boundary samples, achieving state-of-the-art (SOTA) performance on CIFAR-10/100 and real-world noise benchmarks. Furthermore, it exhibits superior convergence efficiency and reliable robustness, while improving significantly the capability of the model for Out-of-Distribution (OOD) detection.

2605.11381 2026-05-13 cs.RO cs.DC

Kairos: A Scalable Serving System for Physical AI

Yinwei Dai, Ganesh Ananthanarayanan, Landon Cox, Xenofon Foukas, Bozidar Radunovic, Ravi Netravali

发表机构 * Princeton University(普林斯顿大学) Microsoft(微软公司)

AI总结 随着物理AI在通用环境中的能力不断提升,其推理特性与数字AI存在显著差异,现有数字AI服务系统难以满足其需求。本文提出Kairos,首个专为多机器人设计的物理AI服务系统,将生成-执行循环作为核心机制,显著提升了任务执行效率。实验表明,Kairos在多种物理AI模型和机器人平台上,平均端到端任务延迟相比现有数字AI服务方法降低了31.8%至66.5%,且性能提升随机器人规模增大而增强。

详情
英文摘要

Physical AI is experiencing rapid growth with frontier foundation models increasing its capabilities across general environments. Physical AI tasks are characterized by inference properties that are markedly different from digital AI. They consist of multiple rounds of inference and action execution, generating a chunk of actions in each inference round, and asynchronously interleaving inference and execution. This makes existing digital AI serving systems unsuited for physical AI; a shortcoming that is critical for enabling their wide adoption, considering their size and the scale of the robot fleets they have to serve. To fill this gap, we design Kairos, the first multi-robot serving system that makes the generate-execute loop a first-class citizen, with active involvement in the execution phase. Across a wide range of physical AI models and robots, Kairos reduces the average end-to-end task latency by 31.8--66.5% over state-of-the-art digital AI serving practices, with gains scaling with the robot fleet size.

2605.11380 2026-05-13 cs.LG cs.AI

TRACE: Temporal Routing with Autoregressive Cross-channel Experts for EEG Representation Learning

Fan Ma, Qier An, Peng Chen, Lingfei Qian, Xiang Lan, Mingyang Jiang, Zhiling Gu, Xenophon Papademetris, Hua Xu

发表机构 * Department of Biomedical Informatics and Data Science, Yale University(耶鲁大学生物医学信息学与数据科学系)

AI总结 本文提出了一种名为TRACE的自回归EEG预训练框架,旨在解决EEG信号多通道、非平稳特性带来的可迁移表征学习难题。TRACE通过在因果上下文中预测未来EEG片段,并在每个时间步进行跨通道一致的时序自适应计算,实现对不同时间阶段和通道间关系的灵活建模。该方法支持不同通道配置和记录域的异构预训练,实验表明其在多个下游任务中表现优异,尤其在运动想象和临床事件分类任务中具有竞争力。

详情
英文摘要

Learning transferable representations for electroencephalography (EEG) remains challenging because EEG signals are inherently multi-channel and non-stationary. Channels observed at the same time provide coupled measurements of neural activity, while the relevant temporal dynamics vary across contexts. This structure is poorly matched by architectures that apply uniform computation across time or route each channel patch independently. To this end, we propose TRACE, an autoregressive EEG pre-training framework that predicts future EEG patches from causal context while performing temporally adaptive and cross-channel coherent computation. At each temporal step, TRACE derives an expert routing decision from the causal cross-channel history and applies it jointly to all channels at that step. This preserves instantaneous cross-channel coherence while allowing different temporal regimes to activate different computation. Since routing is defined over the available channel set and causal temporal context, TRACE is compatible with heterogeneous pre-training across corpora with different channel counts, montages, sequence lengths, and recording domains. Across eight downstream EEG benchmarks, TRACE is evaluated in both settings: when downstream domains are seen only as unlabeled pre-training data and when downstream datasets are completely unseen during pre-training. It obtains the best results on several benchmarks while remaining competitive on motor imagery and clinical event classification tasks, with ablations supporting the importance of cross-channel temporal routing.

2605.11376 2026-05-13 cs.AI

LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出了一种名为LLM-X的可扩展谈判导向型交换框架,旨在支持个人语言模型代理之间的直接、结构化通信。该框架引入了消息总线和路由机制,确保通信的结构有效性与策略执行,并提供了联邦网关、主题路由和策略执行的架构设计,以及支持能力协商和合同网络式协调的类型化消息协议。实验表明,LLM-X在不同规模和负载条件下均能保持稳定,且揭示了策略选择在系统鲁棒性、公平性与通信效率之间的权衡关系。

Comments 8 pages, 7 figures, accepted at AGENT 2026 Workshop, co-located with ICSE 2026

详情
英文摘要

We propose a personal-LLM exchange (LLM-X), a scalable negotiation-oriented environment that enables direct, structured communication across populations of personal agents (LLMs), each representing an individual user. Unlike existing tool-centric protocols that focus on agent-API interaction, LLM-X introduces a message bus and routing substrate for LLM-to-LLM coordination with guarantees around schema validity and policy enforcement. We contribute: (1) an architecture for LLM-X comprising federated gateways, topic-based routing, and policy enforcement; (2) a typed message protocol supporting capability negotiation and contract-net-style coordination; and (3) the first empirical evaluation of LLM-based multi-agent negotiation at scale. Experiments span 5, 9, and 12 agents, under distinct negotiation policies (Low, Medium, High), and across both short-run (minutes) and long-run (2h, 12h) load conditions. Results highlight clear policy-performance trade-offs: stricter policies improve robustness and fairness but increase latencies and message volume. Extended runs confirm that LLM-X remains stable under sustained load, with bounded latency drift.

2605.11373 2026-05-13 cs.AI cs.LG stat.ML

Causal Algorithmic Recourse: Foundations and Methods

Drago Plecko, Collin Wang, Elias Bareinboim

发表机构 * Department of Statistics & Data Science(统计与数据科学系) UCLA(加州大学洛杉矶分校) Department of Computer Science(计算机科学系) Columbia University(哥伦比亚大学)

AI总结 本文研究如何在人工智能决策系统中为个体提供可靠的逆向决策建议,即算法性补救(algorithmic recourse)问题。作者提出了一种因果框架,将补救过程建模为干预前后的结果过程,考虑了潜在变量的重新采样和部分稳定性。文章引入了后补救稳定性条件,并开发了基于copula的算法以从观测数据中推断补救效果,同时提出了在数据不满足copula模型时的分布无关学习方法,为算法性补救提供了更稳健和实用的解决方案。

详情
英文摘要

The trustworthiness of AI decision-making systems is increasingly important. A key feature of such systems is the ability to provide recommendations for how an individual may reverse a negative decision, a problem known as algorithmic recourse. Existing approaches treat recourse outcomes as counterfactuals of a fixed unit, ignoring that real-world recourse involves repeated decisions on the same individual under possibly different latent conditions. We develop a causal framework that models recourse as a process over pre- and post-intervention outcomes, allowing for partial stability and resampling of latent variables. We introduce post-recourse stability conditions that enable reasoning about recourse from observational data alone, and develop a copula-based algorithm for inferring the effects of recourse under these conditions. For settings where paired observations of the same individual before and after intervention are available (called recourse data), we develop methods for inferring copula parameters and performing goodness-of-fit testing. When the copula model is rejected, we provide a distribution-free algorithm for learning recourse effects directly from recourse data. We demonstrate the value of the proposed methods on real and semi-synthetic datasets.

2605.11369 2026-05-13 cs.CV

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

Sanghyeok Nam, Byoungjun Kim, Daehyung Park, Tae-Kyun Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 该研究旨在解决人类与物体之间动态交互动作生成的挑战,提出了一种结合预训练运动先验和模仿智能体的框架,以生成如持物奔跑等长期动态交互动作。通过在规划阶段引入预训练的人体运动扩散模型增强数据集,并生成物体轨迹,从而规划出动态交互序列;在执行阶段,使用一个组合网络融合专用于动态人体动作或静态交互的预训练模仿智能体,实现时空技能的互补组合。该方法在保持交互质量的同时显著提升了任务成功率,并大幅减少了训练时间。

Comments CVPR Findings 2026

详情
英文摘要

Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.

2605.11368 2026-05-13 cs.LG cs.AI q-bio.GN

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Jeongchan Kim, Yunkyung Ko, Jong Chul Ye

发表机构 * KAIST AI(韩国釜山科学技术院人工智能实验室)

AI总结 本文研究了如何利用Edit Flows在DNA序列生成过程中实现推理阶段的奖励控制。提出了一种名为LPDP的方法,它是一种无需训练、关注中间状态和动作的局部重解算操作符,能够在生成可变长度DNA序列时进行高效的编辑操作。LPDP通过在每一步推理中评估单步根编辑、保留最优根编辑集,并在局部范围内求解离散优化问题,从而提升生成序列的质量和生物合理性,适用于增强子优化和基因剪接边界修复等任务。

Comments 22 pages, 5 figures

详情
英文摘要

We study the application of recent Edit Flows for inference-time reward control for DNA sequence generation. Unlike most reward-guided DNA generation frameworks, which operate on fixed-length sequence spaces, Edit Flows have a potential to generate variable-length DNA through biologically plausible insertion, deletion, and substitution operations. In particular, we propose Local Perturbation Discrete Programming (LPDP), a training-free, intermediate-state and action-aware local re-solving operator for variable-length DNA edit-action generators at inference time. More specifically, at each guided rollout step, LPDP scores one-step root edits, retains a near-best root band, and re-ranks each retained root by solving a bounded local discrete program around its child sequence. This local program uses the typed geometry of edit actions to focus on coherent substitution, insertion, or deletion subgraphs, and aggregates local continuations with either a hard Max backup or a soft log-sum-exponential (LSE) backup. We instantiate LPDP in two regimes: front-loaded reward tilting for enhancer optimization, where early edits are critical for establishing global regulatory sequence structure, and back-loaded reward tilting for exon-intron-exon inpainting, where late edits fine-tune splice-boundary contexts.

2605.11363 2026-05-13 cs.CV cs.CL

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Wei Wu, Ziyang Xu, Zeyu Zhang, Yang Zhao, Hao Tang

发表机构 * Peking University(北京大学) La Trobe University(拉特罗布大学)

AI总结 本文提出了一种名为 PresentAgent-2 的智能框架,旨在从用户查询中生成包含多模态内容的完整演示视频。该框架支持三种独立的演示模式,包括单人讲解、多人讨论和互动问答,并通过深度研究和多模态资源整合,实现内容生成、脚本编写和动态媒体合成。研究拓展了演示生成从依赖文档的幻灯片制作向基于查询、具备研究支撑和交互能力的视频生成方向发展。

详情
英文摘要

Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.

2605.11362 2026-05-13 cs.LG cs.AI stat.AP stat.ML

Causal Fairness for Survival Analysis

Drago Plecko

发表机构 * Department of Statistics & Data Science(统计与数据科学系)

AI总结 在数据驱动时代,机器学习和人工智能被广泛用于医疗、就业等高风险领域,引发了对系统公平性问题的关注。现有公平机器学习研究多聚焦于静态场景,而对生存分析等时间序列场景中的公平性研究仍较为缺乏。本文提出一种因果框架,用于生存分析中的公平性研究,能够将生存差异分解为直接、间接和虚假路径的贡献,从而提供对差异成因和演变过程的可解释分析,并应用于分析重症监护病房中种族差异随时间的变化。

详情
英文摘要

In the data-driven era, large-scale datasets are routinely collected and analyzed using machine learning (ML) and artificial intelligence (AI) to inform decisions in high-stakes domains such as healthcare, employment, and criminal justice, raising concerns about the fairness behavior of these systems. Existing works in fair ML cover tasks such as bias detection, fair prediction, and fair decision-making, but largely focus on static settings. At the same time, fairness in temporal contexts, particularly survival/time-to-event (TTE) analysis, remains relatively underexplored, with current approaches to fair survival analysis adopting statistical fairness definitions, which, even with unlimited data, cannot disentangle the causal mechanisms that generate disparities. To address this gap, we develop a causal framework for fairness in TTE analysis, enabling the decomposition of disparities in survival into contributions from direct, indirect, and spurious pathways. This provides a human-understandable explanation of why disparities arise and how they evolve over time. Our non-parametric approach proceeds in four steps: (1) formalizing the necessary assumptions about censoring and lack of confounding using a graphical model; (2) recovering the conditional survival function given covariates; (3) applying the Causal Reduction Theorem to reframe the problem in a form amenable to causal pathway decomposition; (4) estimating the effects efficiently. Finally, our approach is used to analyze the temporal evolution of racial disparities in outcome after admission to an intensive care unit (ICU).

2605.11355 2026-05-13 cs.LG cs.CE

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

Reza Barati, Qinmin Vivian Hu

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一款名为 gym-invmgmt 的开源库存管理方法评估框架,用于在统一实验条件下比较不同库存策略的性能。该框架通过共享的核心环境设定和多样化的22种场景,评估优化方法、启发式方法和学习控制器在不同库存管理条件下的表现。研究发现,基于场景对冲的随机规划方法在预测信息可用时表现最佳,而基于Transformer的近端策略优化方法在推理速度和策略质量上具有优势,但不同策略的表现依赖于信息获取、需求变化、网络结构和策略表示等多个因素。

Comments 16 pages, 4 figures

详情
英文摘要

Inventory-policy comparisons are often difficult to interpret because performance depends on the evaluation contract as much as on the policy itself. Differences in topology, demand regime, information access, feasibility constraints, shortage treatment, and Key Performance Indicator (KPI) definitions can change method rankings. We present gym-invmgmt, a Gymnasium-compatible extension of the OR-Gym inventory-management lineage for auditable cross-paradigm evaluation. The benchmark evaluates optimization, heuristic, and learned controllers under a shared CoreEnv transition, reward, action-bound, and KPI contract, while varying stress conditions through a 22-scenario core grid plus four supplemental MARL-mode rows. Within these released scenarios, informed stochastic programming provides the strongest non-oracle reference, reflecting the value of scenario hedging under forecast access, but at substantially higher online computational cost. Among learned controllers, the Proximal Policy Optimization Transformer variant (PPO-Transformer) achieves the strongest learned-policy quality at fast inference, while Residual Reinforcement Learning (Residual RL) provides competitive hybrid performance. The graph neural network variant (PPO-GNN) is highly competitive on the default divergent topology but less robust on the serial topology. Imitation learning performs well in stationary regimes but degrades under demand shift, and the bounded Large Language Model (LLM) policy-parameter baseline is best interpreted as a diagnostic controller rather than an autonomous inventory optimizer. Overall, the benchmark identifies scenario-conditioned leaders while showing that performance depends jointly on information access, demand shift, topology, and policy representation.

2605.11354 2026-05-13 cs.CV

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Haoyu Zhang, Zeyu Zhang, Zedong Zhou, Yang Zhao, Hao Tang

发表机构 * Peking University(北京大学) La Trobe University(拉特罗布大学)

AI总结 本文提出了一种名为Lite3R的模型无关框架,旨在提升基于Transformer的3D重建方法的效率。该框架通过引入稀疏线性注意力机制减少密集多视图注意力的计算开销,并结合参数高效的FP8感知量化训练策略,实现低精度下的稳定几何重建。实验表明,Lite3R在多个主流模型上显著降低了计算延迟和内存消耗,同时保持了较高的重建质量,为实际应用中的高效3D重建提供了有效的算法与系统协同设计方法。

详情
英文摘要

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.

2605.11348 2026-05-13 cs.CL cs.AI cs.IR cs.SI

Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

Ujun Jeong, Saketh Vishnubhatla, Bohan Jiang, Andre Harrison, Adrienne Raglin, Huan Liu

发表机构 * Arizona State University(亚利桑那州立大学) DEVCOM Army Research Laboratory(陆军研究实验室)

AI总结 本文研究了在灾害场景下,如何利用大语言模型(LLM)从社交媒体中提取因果关系,以增强灾情态势感知。为验证LLM的有效性,作者提出了一种基于专家知识的评估框架,通过对比模型生成的因果图与灾害报告中的参考图,评估其准确性。研究发现,LLM在提取因果关系方面具有潜力,但也存在依赖模型先验知识而非事件后证据的风险。

Comments Submitted to EMNLP

详情
英文摘要

During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.

2605.11346 2026-05-13 cs.LG cs.AI cs.CE

Physics-Informed Teacher-Student Ensemble Learning for Traffic State Estimation with a Varying Speed Limit Scenario

Archie J. Huang, Dongdong Wang, Shaurya Agarwal, Mohamed Abdel-Aty, Md Mahmudul Islam, Muhammad Shahbaz

发表机构 * Department of Building, Civil and Environmental Engineering, Concordia University(康科迪亚大学建筑、土木和环境工程系) Urban Artificial Intelligence Laboratory, University of Florida(佛罗里达大学城市人工智能实验室) Department of Civil, Environmental and Construction Engineering, University of Central Florida(中央佛罗里达大学土木、环境和建设工程系)

AI总结 本文研究了在可变限速场景下的交通状态估计问题,提出了一种结合物理信息深度学习与教师-学生集成训练的新型框架。该方法通过在教师模型中编码流量守恒定律,学生模型则利用多层感知机分类器识别交通特征并选择合适的教师模型进行估计,从而有效应对限速变化带来的交通特性异质性。实验结果表明,该方法在交通状态估计任务中优于其他主流基线方法。

Comments The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026

详情
英文摘要

Physics-informed deep learning (PIDL) neural networks have shown their capability as a useful instrument for transportation practitioners in utilizing the underlying relationship between the state variables for traffic state estimation (TSE). Another efficient traffic management approach is implementing varying speed limits (VSLs) on transportation corridors to control traffic and mitigate congestion. However, the existing training architecture of PIDL in the literature cannot accommodate the changing traffic characteristics on a freeway with VSL. To tackle this challenge, we propose a novel framework integrating teacher-student ensemble training with PIDL neural networks for TSE under VSL scenarios. The physics of flow conservation law is encoded locally in the teacher models by PIDL, and the student model uses a multi-layer perceptron classifier (MLP) to identify traffic characteristics and selects the ensemble member of PIDL neural networks for TSE. This integrated framework provides a natural solution for capturing the heterogeneity of VSL and accurately addressing the TSE problem. The case study results validate the proposed ensemble approach, demonstrating its superior performance in TSE compared to other popular baseline methods, as indicated by relative L2 error.

2605.11341 2026-05-13 cs.AI

CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening

Giuliano Lorenzoni, Ivens Portugal, Paulo Alencar, Donald Cowan

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出了一种名为CPEMH的智能代理框架,用于评估和保障基于提示的大型语言模型在心理健康筛查中的行为表现。该框架通过协调设计、评估和选择提示策略,实现了对模型行为在不同场景下的系统控制,具备模块化结构,确保了过程的可追溯性和稳定性。研究通过抑郁筛查的案例展示了该框架在临床对话场景中对模型行为进行稳定化和审计的能力,强调了模块化协调、稳定性优先以及将F1值、偏差和鲁棒性作为核心评估标准的重要性。

Comments 4 pages, 2 figures. Accepted at the AGENT 2026 Workshop (ICSE 2026)

详情
英文摘要

This paper presents CPEMH, an agentic framework designed to evaluate prompt-driven behavior in foundation-model systems operating on transcript-based datasets for mental-health screening. CPEMH serves as an engineering methodology for behavioral assurance in large-scale language systems, introducing an orchestrated architecture that autonomously performs the design, evaluation, and selection of prompt strategies, enabling systematic control of behavioral variability across contexts. Its modular agentic design, combining orchestrator, inference, and evaluation agents, ensures traceability, reproducibility, and robustness throughout the prompting lifecycle. A case study on automated depression screening from interview transcripts demonstrates the framework's capacity to stabilize and audit foundation-model behavior in conversational and clinically sensitive domains. Lessons learned emphasize the role of modular orchestration in behavioral assurance, the prioritization of stability over architectural complexity, and the integration of F1, bias, and robustness as core acceptance criteria.

2605.11334 2026-05-13 cs.LG cs.CL cs.IR

VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

Jasmine Qi, Danylo Dantsev, Muyang Sun

发表机构 * Indeed Inc(Indeed公司)

AI总结 VERDI 是一种用于验证型大语言模型评估系统的单次调用置信度估计方法,通过分解推理过程中的验证步骤,提取三个结构化信号来评估判断结果的可信度。该方法无需额外推理调用,结合逻辑回归模型实现高精度的置信度预测,在多个公开基准和实际系统中均表现出良好的性能,尤其在答案置信度校准不佳的模型上也具有较好的适应性。

Comments 16 pages, 6 figures

详情
英文摘要

LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.