arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.24383 2026-05-27 cs.AI cs.CY cs.SE

A governance horizon for ethical-use constraints in open-weight AI models

开放权重AI模型中伦理使用约束的治理视野

Weiwei Xu, Hengzhi Ye, Haoran Ye, Kai Gao, Vladimir Filkov, Minghui Zhou

发表机构 * School of Computer Science(计算机科学学院) Ministry of Education(教育部) Laboratory of High Confidence Software Technologies(高可信软件技术实验室) University of Science and Technology Beijing(北京科技大学) University of California, Davis(加州大学戴维斯分校)

AI总结 通过审计Hugging Face Hub上的模型仓库,发现基于披露的治理在开放权重AI中具有浅层结构性限制,提出治理视野概念并比较不同政策设计的效果。

详情
AI中文摘要

对开放权重AI模型的伦理约束既反映了社会关切,也是AI治理政策的基础。这些约束预计会传播到下游衍生品,同时作为自愿元数据披露实施,必须在每一代重用中重新声明。我们审计了Hugging Face Hub上的2,142,823个模型仓库,以测试这种基于披露的治理基础设施能否在深层模型谱系中维持可追溯性。限制证据以1.31个衍生步骤的半衰期衰减($R^2$=0.98),超过七代下游后,至少80%的后代模型缺乏足够的公开证据进行治理判定,我们将这一深度边界形式化为治理视野。恢复缺失许可元数据的平台级干预表明,政策设计(而非仅执法)是约束因素:仅继承设计需要近乎完全的执法才能移动视野,而明确解决孤儿谱系组件的强制声明设计即使在中等执法水平下也能移动视野。结构性瓶颈在于没有可继承上游意图的谱系:此类孤儿组件在任何仅继承政策下都无法判定,无论执法率如何,未解决的上游节点还会造成直接的下游不可判定性瓶颈,仅靠继承规则无法恢复。与PyPI的比较(其中治理信号由显式机器可读声明携带)证实,这种崩溃是开放权重衍生特有的拓扑结构问题,而非开放生态系统固有的。这些结果表明,基于披露的治理在开放权重AI中具有浅层、结构决定的范围,实现深层供应链问责需要治理信号通过衍生本身传播的溯源机制。

英文摘要

Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ($R^2$=0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

2605.24296 2026-05-27 cs.AI cs.IR

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

合成专利数据何时有帮助?低资源多标签分类中的数量-保真度权衡

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate, Intellectual Property(Clarivate知识产权)

AI总结 研究通过LLM生成合成数据用于多标签专利分类时的数量与保真度权衡,发现低资源场景下数量效应主导,高资源场景下保真度更重要,混合数据策略最优。

详情
AI中文摘要

关于利用通过LLM生成的合成数据进行多标签专利分类时必须考虑的问题包括:(i) 何时使用此类数据可能有所帮助以及(ii) 为何如此。实际上,前一部分适当调整了通过增加样本量来改进结果的可能性。当前实验涉及六个开源LLM(从3.8B到12B参数),针对辅助技术64个WIPO标签分类的四种真实数据机制。应用了基于标签集条件化的全合成生成方法和释义方法,每种方法与三种分类器类别结合使用。结果表明,BERT-for-Patents的微F1从0.120到0.702的声称改进主要反映了数量效应;实际上,在165个样本中进行有放回复制产生了0.678。因此,相对于对照组的改进为+0.024,而与最佳基线(焦点损失重加权)相比为+0.219。这里要考虑的第二个关键点是随着数据生成机制变化,保真度分数的演变。对于低真实数据机制,数量效应占主导,最大均值差异(MMD)与分类性能之间的相关系数等于r = +0.95。随着使用更多真实数据,相关性变为负值,在1:10机制下达到r = -0.73(Fisher z = +6.47,p < 0.001,Delta r的95% CI [ +0.96, +1.00 ])。在固定预算分配方面,将真实数据(约20-30%)与合成数据(70-80%)结合优于纯合成和纯真实策略。此外,一个能够将原始微F1改进高达+0.58的语料库可能会对Jaccard重叠检索代理产生不利影响。其他体裁的提示族变体可能提供对该现象的一些解释,但使用标准专利过滤器仍使nDCG@10降低26%。

英文摘要

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving results by an increase in sample size. The current experiment involves six open-source LLMs (from 3.8B to 12B parameters) for four real-data regimes in classification of 64 WIPO labels of assistive technologies. Both full-synthesis generation, conditioned on the label set, and paraphrasing methods are applied, with each used in combination with three classifier categories. It is shown that the claimed improvements in micro F1 for BERT-for-Patents from 0.120 to 0.702 mainly reflect a volume effect; indeed, replication with replacement in 165 examples produces 0.678. Thus, the improvement over the control is +0.024, while compared to the best baseline (focal loss reweighting) is +0.219. The second crucial point to consider here is that of evolving fidelity scores as the data generation regime varies. For low real-data regimes, the volume effect dominates and the correlation coefficient between maximum mean discrepancy (MMD) and classification performance equals r = +0.95. As more real data is used, the correlation becomes inverted and reaches r = -0.73 at the 1:10 regime (Fisher z = +6.47, p < 0.001, 95% CI on Delta r [ +0.96, +1.00 ]). In terms of a fixed budget allocation, combining real data (about 20-30%) with synthetic (70-80%) outperforms both purely synthetic and purely real strategies. Moreover, a corpus that allows for improvement in classification performance up to +0.58 in raw micro F1 may adversely affect a Jaccard-overlap retrieval proxy. Prompt-family variations for other genres may provide some explanation of the phenomenon, but using the standard-patent filter still decreases nDCG@10 by 26%.

2605.24217 2026-05-27 cs.AI cs.DC

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

识别和减轻生产级LLM推理基准中的系统性测量偏差

Ashok Chandrasekar, Jason Kramberger

发表机构 * Google(谷歌)

AI总结 针对生产级LLM推理基准中因客户端排队导致的测量偏差,提出基于多进程的无偏评估框架和归一化输出令牌时间(NTPOT)指标,实现高并发下的准确性能评估。

详情
AI中文摘要

随着大型语言模型(LLM)从研究环境过渡到生产部署,评估其是否满足严格的服务水平目标(SLO)变得至关重要。然而,当前的评估方法在大规模下存在严重的测量偏差。我们证明,广泛使用的基准测试工具依赖于单进程、异步驱动架构,在高并发下引入了根本性的客户端排队瓶颈。通过将基准测试客户端建模为$M/G/1$队列,我们从数学上展示了Python全局解释器锁(GIL)如何随着请求速率增加而人为地膨胀首令牌时间(TTFT)和每输出令牌时间(TPOT)指标。为了解决这一系统性不准确性,我们提出了一个无偏的多进程评估框架,有效分散客户端负载,确保可忽略的排队开销。此外,我们形式化了一个复合指标——归一化每输出令牌时间(NTPOT),以稳健地摊销端到端延迟,包括跨序列长度的预填充和调度延迟。我们的实证评估表明,该方法成功隔离了纯服务引擎性能,能够在每秒数千个查询的生产规模下对LLM进行准确、可复现的性能分析。

英文摘要

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

2605.24152 2026-05-27 cs.AI

Neuro-Inspired Inverse Learning for Planning and Control

神经启发式逆向学习用于规划与控制

Maryna Kapitonova, Tonio Ball

发表机构 * NeuroMentum AI IMBIT, University of Freiburg, Germany(NeuroMentum AI IMBIT,弗赖堡大学,德国)

AI总结 提出一种神经启发式框架Inverter,通过逆向学习(IL)结合前向/逆向内部模型、开环多步运动指令和层次化动作组织,在规划与控制任务中实现高效推理,平均性能提升24.2%且计算时间降低一到两个数量级。

Comments Version 2, minor fix in online version of the abstract, pdf unchanged

详情
AI中文摘要

我们提出了一种用于具身规划与控制的神经启发式框架。基于哺乳动物大脑中实现快速高效目标导向行为的三个原则——配对的前向/逆向内部模型、开环多步运动指令以及顺序层次化的动作组织——我们的Inverter框架使用学习组件,通过逆向学习(IL)进行端到端训练,并在自然情况下辅以解析或算法模块;我们形式化了IL,并将其与监督学习、强化学习和模仿学习区分开来。IL桥接了强化学习(RL)式的摊销(单次前向传播但每次只输出一个动作)和最优控制(OC)式的序列规划(整个轨迹但需要迭代测试时计算)。单个Inverter或层次化n=2的Inverter堆栈在所有3个maze2d和6个antmaze D4RL变体上,平均比离线RL和扩散规划基线提升24.2%(范围-1.9%至+78.2%),同时推理计算时间减少一到两个数量级。显著的是,通过前向模型(FoM)对整个T步动作序列进行优化(而非逐步骤优化),使得Inverter能够生成平滑、目标一致、轨迹级的结构,并达到比训练数据本身所蕴含的策略更接近解析最优的控制策略。我们还发现了IL的一种失败模式:在训练数据覆盖范围狭窄时出现FoM攻击,我们通过使用覆盖范围更广的随机训练数据来缓解。作为一个应用实例,脉冲Inverter合成任意单量子比特量子门,其保真度与标准迭代数值基线(GRAPE)相当,而每个门的计算时间降低超过1000倍。总之,我们得出结论:IL实现了一类通用的世界接口,特别适用于对延迟和资源敏感的具身AI。

英文摘要

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

2605.24071 2026-05-27 cs.LG cs.AI

Not All Transitions Matter: Evidence from PPO

并非所有转移都重要:来自PPO的证据

Ajhesh Basnet

发表机构 * Department of Artificial Intelligence and Data Science(人工智能与数据科学系) KPR Institute of Engineering and Technology(KPR工程科技研究院)

AI总结 本文提出在PPO训练中随机丢弃一定比例的轨迹转移,以打破重复梯度结构,稳定训练,并在多个环境中验证了效果。

Comments 19 pages, 5 figures. Accepted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026)

Journal ref Proceedings of the 2026 8th Asia Conference on Machine Learning and Computing

详情
AI中文摘要

在策略上训练强化学习代理意味着每次更新时收集新的经验,而这些经验隐藏着一个问题。轨迹中的每个状态都是前一个状态的直接输出,由代理自身的动作因果链连接。因此,连续的转移从未真正独立。它们携带重叠信息,网络接收到的梯度信号最终比批次大小所暗示的要重复得多。相同的方向被反复强化,价值网络在策略变化时难以跟上,训练变得悄悄不稳定,而仅凭奖励曲线很少能揭示这一点。本文询问这种冗余是否可以简单地移除。我们表明,在适当阶段从轨迹中随机丢弃固定比例的转移,使得奖励信号保持完整,足以打破重复的梯度结构并稳定训练。变化很小:一个采样步骤,没有新组件,不修改核心算法,并且适用于任何PPO实现。在五个难度递增的环境(CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5和Hopper-v5)中,该方法在奖励上与标准PPO匹配,同时在KL散度、策略熵和价值估计上产生更一致的训练动态。丢弃25%的转移是最佳点:足以破坏冗余,又不至于使批次过薄。

英文摘要

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

2605.24042 2026-05-27 cs.LG cs.AI

Hidden-State Privacy Has an Empty Middle

隐藏状态隐私存在空中间

Alexander Okezue Bell

发表机构 * Stanford University(斯坦福大学)

AI总结 通过理论下界和实验证明,高斯释放机制在隐藏状态隐私中无法同时实现中等效用和隐私,存在空中间区域,并提出了对角逆Fisher机制作为最优解。

Comments 74 pages, 61 figures

详情
AI中文摘要

在我们测试的1536个高斯释放协方差中,对于单层隐藏状态隐私,没有一个能在自适应检索攻击者下同时实现中等效用和中等隐私。我们证明了一个互补的Fisher球下界:每个具有O(1) Fisher效用的满秩高斯释放都存在一个方向,其马氏信号随隐藏宽度线性增长,排除了该类中的均匀高斯安全性,并与经验上的空中间匹配。对角逆Fisher释放Σ^⋆_{diag}(K) = (2K/d) diag(1/F_{ii})是在一阶KL预算K下唯一的最小最大最优对角机制,也是在32个模型层网格的每个点上最坏攻击者top-1 ≤ 0.001的唯一释放,但它位于隐私/效用边界上,而不是填充中间。在欧几里得检索下达到13倍帕累托缩减的广义特征机制,在自适应马氏攻击者下崩溃为100% top-1,而全轨迹序列逆变器恢复了干净GPT-2前缀的94%,但在Σ_{diag}下为0%。从头训练的分离记忆Transformer在90M时达到G_{Mah} ∈ [20, 33],并在固定token语言建模损失惩罚下,从30M到1B保持比相同预算GPT基线6-24倍的优势;预训练模型最高为9.3。这些结果将隐藏状态释放从高斯类内的机制设计重新定义为架构或释放协同设计。

英文摘要

Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $Σ^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $Σ_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

2605.24001 2026-05-27 cs.CV cs.AI cs.LG

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

发表机构 * Purdue University(普渡大学) hi-lab, Xiaohongshu Inc.(小红书实验室,小红书公司)

AI总结 针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题,提出基于积分KL最小化的无数据轨迹级对齐框架DIDR,通过扩散奖励分数和代理估计器实现奖励驱动的校正,在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情
AI中文摘要

近期一步文本到图像生成的进展实现了实时合成,具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果,优化倾向于利用随机自由度,通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题,我们提出了Diff-Instruct with Diffused Reward (DIDR),一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器,同时自然诱导出扩散奖励分数(DRS),它作为对参考分数函数的奖励驱动校正。为了使其实用,我们进一步引入了扩散奖励代理(DRP),一种基于可微短步去噪的DRS高效估计器。大量实验表明,DIDR持续帕累托主导现有的一步SDXL基线。此外,当迁移到6B DiT骨干网络(Z-Image)时,DIDR在偏好对齐上超越了其50步教师模型,同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2605.23651 2026-05-27 cs.CL

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

大型语言模型有多像人类?一个语域感知的语言评估框架

Björn Nieth, Marianna Gracheva, Michaela Mahlberg, Bjoern Eskofier, Emmanuelle Salin

发表机构 * Department Artificial Intelligence in Biomedical Engineering (AIBE)(人工智能生物医学工程系) Department of Digital Humanities and Social Studies (DHSS)(数字人文与社会科学系) University of Birmingham(伯明翰大学) Chair of AI-supported Therapy Decisions(人工智能支持治疗决策教授职位) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Institute of AI for Health, Helmholtz Zentrum München(健康人工智能研究所,海德堡中心慕尼黑)

AI总结 提出一个基于语域感知的评估框架,通过比较人类参考语料库与LLM生成文本的词汇语法特征分布(使用最大均值差异和Biber的67个特征),发现LLM偏离人类基线,且最接近人类的模型取决于语域而非模型大小。

Comments 8.5 pages (main) + 31 pages appendix, 29 figures, 10 tables. Code and data: https://github.com/BjoernNieth/Register_Aware_LLMs

详情
AI中文摘要

虽然事实正确性和任务性能长期以来一直是大型语言模型(LLM)研究的焦点,但生成文本在语言层面上与人类相似程度这一基本问题尚未得到充分探索。从语料库语言学的角度来看,语言生产本质上是依赖语境的,不同的交际语境会导致语言特征的频率和共现模式产生差异。未能遵循这些模式的文本可能在内容上是正确的,但仍然不受人类读者欢迎。在这项工作中,我们提出了一个上下文感知的评估框架,其中通过使用给定语域的人类参考语料库与相应的LLM生成语料库之间的语言特征分布的两样本问题来评估人类相似度。我们使用最大均值差异(MMD)和Biber引入的67个词汇语法特征来实现该框架,这些特征通常应用于语料库语言学。在我们的实验中,我们比较了七个经过指令微调的开源模型,跨越五个不同语域的英语数据集,并与人类基线进行对比。虽然在所有测试设置中,LLM都偏离了人类基线,但哪些模型最接近人类语言取决于语域,而不是由模型大小决定。

英文摘要

While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.

2605.23327 2026-05-27 cs.CV

GFSR: Geometric Fidelity and Spatial Refinement for Reliable Lane Detection

GFSR:用于可靠车道检测的几何保真度与空间细化

Tiancheng Wang, Zhaolu Ding, Richeng Xu, Tianhui Zheng, Hui Liu, Hanyu Xuan, Zhiliang Wu, Guanghui Yue

发表机构 * the School of Big Data and Statistics, Anhui University(安徽大学大数据与统计学院) the School of Artificial Intelligence and Data Science, University of Science and Technology of China(中国科学技术大学人工智能与数据科学学院) Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究院) the College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) the School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学院生物医学工程学院)

AI总结 针对现有车道检测方法中分类置信度与几何质量脱节、回归模块弱化采样点关联导致复杂场景性能下降的问题,提出包含LaneIoU引导的置信度校准和自适应门控位置细化的GFSR框架,在CULane和CurveLanes上取得最优结果。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems. 12 pages, 6 figures

详情
AI中文摘要

车道检测是自动驾驶和高级驾驶辅助系统中的一项关键感知任务。然而,现有方法在复杂真实场景中仍会退化,原因在于两个主要限制。首先,分类置信度仅表征车道先验的分类存在性,与几何质量无强相关性。如果仅基于该置信度进行阈值过滤和NMS,模型倾向于保留高置信度的车道先验,而消除那些置信度较低但几何表示更优的先验。其次,现有方法中的回归模块削弱了采样点之间的相关性,阻碍了对远处、高曲率和复杂拓扑车道的细粒度优化,导致欠拟合。为解决这些问题,我们提出了几何保真度与空间细化(GFSR),一个由LaneIoU引导的置信度校准(LCC)和自适应门控位置细化(AGLR)组成的框架。具体地,LCC采用LaneIoU作为软监督来显式估计车道先验的几何保真度,并将其与分类置信度融合以构建协同可靠性指数(CRI)。该指数引导车道先验过滤,有效保留那些具有高分类置信度和良好几何质量的先验。同时,在每个细化阶段与回归头协作,AGLR预测采样点横向偏移并采用门控机制自适应调节校正幅度,增强点间相关性,提升模型对复杂车道场景的适应性和鲁棒性。在CULane和CurveLanes上的大量实验表明,我们的GFSR在CULane上达到了最优性能,F1_50和F1_75分数分别为81.46%和65.01%,在CurveLanes上达到了87.35%的F1_50。

英文摘要

Lane detection stands as a crucial perception task in autonomous driving and advanced driver assistance systems. However, existing methods still degrade in complex real scenarios due to two major limitations. First, classification confidence only characterizes the categorical existence of lane priors and has no strong correlation with geometric quality. If threshold filtering and NMS are conducted merely based on this confidence, the model tends to retain lane priors with high confidence while eliminating those with lower confidence but superior geometric representation. Secondly, the regression modules in existing methods weaken correlations among sampling points, hindering fine-grained optimization of distant, high-curvature and complex-topology lanes and causing underfitting. To address these issues, we propose Geometric Fidelity and Spatial Refinement (GFSR), a framework consisting of LaneIoU-guided Confidence Calibration (LCC) and Adaptive Gated Location Refinement (AGLR). Specifically, LCC adopts LaneIoU as soft supervision to explicitly estimate the geometric fidelity of lane priors, which is further fused with classification confidence to construct the Collaborative Reliability Index (CRI). This index guides lane prior filtering, effectively retaining those with high classification confidence and favorable geometric quality. Meanwhile, cooperating with regression heads in each refinement stage, AGLR predicts sampling point lateral offsets and adopts a gating mechanism to adaptively regulate correction magnitude, strengthen inter-point correlations and boost model adaptability as well as robustness toward complex lane scenarios. Extensive experiments on CULane and CurveLanes demonstrate that our GFSR achieves state-of-the-art performance on CULane, with F1_50 and F1_75 scores of 81.46% and 65.01%, and reaches 87.35% F1_50 on CurveLanes.

2605.22904 2026-05-27 cs.CV cs.AI

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

基于AI视频监控的自杀风险评估:地铁站预防的可解释框架

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Brian Mishara

发表机构 * Université TÉLUQ(大学TÉLUQ) Polytechnique Montréal(蒙特利尔理工学院) Université du Québec à Montréal(魁北克大学蒙特利尔分校)

AI总结 提出首个可解释框架,通过行人跟踪、活动识别、站台语义分割和轨迹风险热图建模,从监控视频中评估自杀风险,在真实数据上达到83.2% ROC-AUC。

Comments 9 pages, 6 figures, 1 table. Accepted for Publication in the International Joint Conference of Artificial Intelligence (IJCAI)

详情
AI中文摘要

理解并监控地铁站中的人类行为对于支持自杀预防工作至关重要,早期识别高风险情况能够实现及时干预。这需要通过对每个乘客的行为、其空间上下文和时间动态进行联合推理,从监控视频中评估自杀风险。然而,使用监控摄像头捕获的视频进行评估具有挑战性,因为它需要准确感知人体运动、理解站台几何结构,并随时间聚合异质行为线索。在这项工作中,我们正式定义了地铁站自杀风险评估(SRA)任务,并引入了首个解决这一挑战的可解释框架。与专注于孤立子任务或试图直接推断意图的方法不同,我们的公式通过整合行人跟踪、活动识别、站台语义分割和轨迹驱动的风险热图建模,从累积证据中评估自杀风险。通过将SRA形式化为一个独特任务,并在真实监控数据上基准测试一个完整的操作流程,实现了83.2%的ROC-AUC,这项工作突出了自杀风险评估的复杂性,并为面向社会公益的可解释AI系统研究开辟了新方向。

英文摘要

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

2605.22834 2026-05-27 cs.CL cs.IR

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

查询自适应语义分块用于检索增强生成:一种具有上下文窗口扩展的动态策略

Mudit Rastogi

发表机构 * Independent Researcher(独立研究者)

AI总结 提出查询自适应语义分块(QASC)方法,通过将查询融入分块过程,利用句子-查询嵌入余弦相似度、上下文窗口扩展和分块级分数聚合,动态构建相关且连贯的文档块,在F1分数上比固定分块提升18-27%,比语义和智能体分块提升8-12%。

详情
AI中文摘要

检索增强生成(RAG)系统关键依赖于文档分块质量以检索相关上下文。固定分块将文档分割成统一单元,不考虑语义或用户意图,导致精度-召回率权衡无法通过调整块大小解决。语义和智能体方法部分解决了这些限制,但未在分块阶段集成用户查询。我们提出查询自适应语义分块(QASC),通过三种机制将查询融入分割以动态构建块:句子与查询嵌入之间的余弦相似度评分以识别种子句子,围绕种子的上下文窗口扩展以保持连贯性,以及块级分数聚合以确保整体相关性。我们在100篇技术文档上评估QASC,涵盖四种类型的200个查询,并与五种粒度的固定分块、递归分割、语义分块和智能体分块进行比较。QASC实现了0.85的F1分数,相对于固定分块相对提升18-27%,相对于语义和智能体替代方法提升8-12%。消融研究证实每个组件都有意义贡献。三名标注者的人工评估(Cohen kappa = 0.82)证实QASC比现有方法产生更相关和连贯的块。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.

2605.22774 2026-05-27 cs.LG cs.AI cs.HC

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

CogAdapt: 通过导联适应将临床心电图基础模型迁移至可穿戴认知负荷评估

Amir Mousavi, Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

发表机构 * Department of Computer Science, College of AI, Cyber and Computing, The University of Texas at San Antonio(计算机科学系,人工智能、网络与计算学院,德克萨斯大学圣安东尼奥分校) Department of Educational Psychology, College of Education and Human Development, The University of Texas at San Antonio(教育心理学系,教育与人类发展学院,德克萨斯大学圣安东尼奥分校)

AI总结 提出CogAdapt框架,通过可学习适配器LeadBridge将3导联可穿戴信号转换为12导联表示,并结合渐进微调策略ProFine,实现临床心电图基础模型向可穿戴认知负荷评估的迁移,在跨受试者验证中显著优于从头训练的基线模型。

Comments 7 pages, 7 figures. Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情
AI中文摘要

实时认知负荷评估对于自适应人机交互至关重要,但由于标记数据有限和跨受试者泛化能力差,仍然具有挑战性。最近在数百万临床记录上预训练的心电图基础模型提供了丰富的表示,但由于传感器配置不匹配和任务差异,无法直接应用于可穿戴设备。在本文中,我们提出了CogAdapt,一个将临床心电图基础模型适应于可穿戴认知负荷评估的框架。CogAdapt引入了LeadBridge,一个可学习的适配器,将3导联可穿戴信号转换为解剖学一致的12导联表示,以及ProFine,一种渐进微调策略,逐步解冻编码器层同时防止灾难性遗忘。在两个公共数据集(CLARE和CL-Drive)上的留一受试者交叉验证评估表明,CogAdapt显著优于从头训练的基线,宏F1分数分别达到0.626和0.768。这些结果证明了基础模型适应用于从可穿戴传感器进行与受试者无关的认知负荷评估的前景。

英文摘要

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

2605.21883 2026-05-27 cs.CL

Token-weighted Direct Preference Optimization with Attention

基于注意力的令牌加权直接偏好优化

Chengyu Huang, Zhuohang Li, Sheng-Yen Chou, Claire Cardie

发表机构 * Cornell University(康奈尔大学) Vanderbilt University(范德比大学)

AI总结 提出Token-weighted DPO (TwDPO)方法,利用注意力机制估计令牌权重,在不增加额外训练成本的情况下提升大语言模型与人类偏好对齐的性能。

详情
AI中文摘要

直接偏好优化(DPO)无需单独奖励模型即可使大语言模型与人类偏好对齐。然而,DPO平等对待响应中的所有令牌,忽略了单个令牌的不同重要性。现有的令牌级PO方法要么使用基于令牌位置的启发式函数,要么使用单独训练模型给出的概率估计来计算令牌权重,这缺乏鲁棒性且增加了额外训练成本。相比之下,我们提出令牌加权DPO(TwDPO)——一种基于令牌加权RL的新型训练目标,以及AttentionPO——TwDPO的一个实例,它利用LLM自身的注意力来估计令牌权重。AttentionPO提示LLM作为成对评判者,并在比较响应时检查模型关注的位置。这种设计使AttentionPO具有内容感知能力,根据响应内容调整权重,并且高效,每个样本仅需额外两次前向传播。实验结果表明,AttentionPO在AlpacaEval、MT-Bench和ArenaHard上显著提升了性能,超越了现有的偏好优化方法。

英文摘要

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) -- a novel training objective grounded on token-weighted RL -- and AttentionPO -- an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.

2605.20988 2026-05-27 cs.LG cs.AI

A Sharper Picture of Generalization in Transformers

Transformer 泛化能力的更清晰图景

Paul Lintilhac, Sair Shaikh

发表机构 * Thayer School of Engineering Dartmouth College(达特茅斯学院泰勒工程学院)

AI总结 本文通过PAC-Bayes理论研究Transformer在布尔域上的泛化行为,证明稀疏低阶频谱可实现低锐度构造并得到非平凡的泛化界,解释了思维链为何能改善高阶目标函数的泛化。

Comments 10 pages, 9 figures, 41 pages of supplementary material

详情
AI中文摘要

我们从目标函数的傅里叶谱角度研究Transformer在布尔域上的泛化行为。与先前基于Rademacher复杂度推导泛化界的工作(Edelman等人,2022;Trauger & Tosh,2024)不同,我们探讨了通过PAC-Bayes理论获得泛化界的可行性。我们证明,集中在低阶分量上的稀疏谱能够实现具有良好泛化性质的低锐度构造。我们的思路是证明存在实现任何稀疏度不超过上下文长度的布尔函数的平坦极小值,然后将PAC-Bayes界应用于一个理想化的低锐度学习器,从而得到一个非平凡的泛化界。我们利用这一点正式解释了为什么思维链能改善高阶目标函数的泛化,并展示了我们界中的复杂度参数可以通过性质测试高效估计。我们通过实验评估了预测,并进行了机制可解释性研究,以支持我们的理论构造在真实Transformer中的现实性。

英文摘要

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger & Tosh, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We use this to give a formal account of why chain-of-thought improves generalization for high-degree target functions, and show that the complexity parameters in our bound can be efficiently estimated via property testing. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

2605.20914 2026-05-27 cs.CV

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

RISE: 自进化视觉语言模型的可靠改进

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group(阿里集团AMAP实验室)

AI总结 针对视觉语言模型自进化中角色交替粗粒度、问题质量下降和类型坍缩问题,提出RISE框架,通过细粒度角色交替、质量监督器和技能感知动态平衡实现可靠自进化。

详情
AI中文摘要

视觉语言模型(VLM)已具备强大的多模态推理能力,但进一步提升仍严重依赖大规模人工构建的监督信号进行后训练。这种监督信号获取成本高昂,尤其对于推理密集型多模态任务,其中问题、答案和反馈信号必须精心设计。这激发了自进化学习,即模型通过双角色闭环自我改进:提问者自主提出问题,求解者学习解答。然而,我们观察到当前的VLM自进化方法仍面临三大挑战:粗粒度的角色交替延迟了问题生成与求解者适应之间的交互;生成的问题质量可能逐渐下降;问题类型可能坍缩至狭窄分布。这些问题限制了自进化的效率和可靠性。因此,我们提出 extbf{RISE},一个可靠的视觉语言模型自进化框架。RISE基于三个互补设计:细粒度角色交替,缩短提问者与求解者之间的反馈循环以提高效率;质量监督器,提高问题有效性和伪标签可靠性;以及技能感知动态平衡,在进化过程中缓解模式坍缩并保持广泛的技能覆盖。这些组件共同使得从无标签图像中实现更可靠和有效的自进化成为可能。在两个VLM骨干网络上的七个基准测试实验表明,RISE持续改进基础模型,带来广泛而持久的性能提升。我们的代码已公开在https://github.com/AMAP-ML/RISE。

英文摘要

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

2605.20690 2026-05-27 cs.AI

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

声明式数据服务:用于组合数据系统的结构化智能体发现

Shanshan Ye, Duo Lu

发表机构 * Northeastern University(东北大学) Brown University(布朗大学)

AI总结 提出声明式数据服务(DDS)架构,通过分层类型契约将全局搜索分解为有界子搜索,解决无界智能体发现无法稳定收敛的问题,并在交易后端工作负载上验证其有效性。

Comments Accepted at AI Agents for Discovery in the Wild (AID-Wild), Workshop at ACM CAIS 2026

详情
AI中文摘要

智能体发现已表明,在基准条件下,LLM驱动的搜索能够发现新颖的算法、设计和代码。将该范式迁移到多系统数据后端面临一个更困难的问题:搜索空间是异构的,验证器是部署栈是否实际运行,且组合知识在预训练中不均匀地捕获。即使添加了迭代和显式组合知识,无界智能体发现(一个基于失败日志反馈迭代的编码智能体)也无法在运行栈上一致收敛。我们提出声明式数据服务(DDS),一种从声明式用户意图中结构化智能体发现数据系统组合的架构。该框架在连续层(意图、操作DAG、每系统技能、运行时归因)拥有四个类型契约,将全局搜索分解为有界子搜索;子智能体搜索每个类型空间,而框架提供通道,使知识以内联技能引用的方式向前流动,错误以类型信号的方式向后路由。作为交易后端工作负载的生命证明,DDS在无界发现无法收敛的地方收敛;运行时失败成为技能补丁,下一次部署内联引用。我们将其定位为早期原型,报告来自真实世界数据系统组合的经验教训。

英文摘要

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.

2605.20606 2026-05-27 cs.CV

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

注意你的边界:你的蒸馏数据集真的鲁棒吗?

Muquan Li, Yingyi Ma, Yihong Huang, Hang Gou, Ke Qin, Ming Li, Yuan-Fang Li, Tao He

发表机构 * The Laboratory of Intelligent Collaborative Computing of UESTC, Chengdu, China(UESTC智能协同计算实验室,中国成都) Monash University, Melbourne, Australia(墨尔本大学,澳大利亚墨尔本) Guangdong Laboratory of Artificial Intelligence(广东人工智能实验室)

AI总结 针对数据集蒸馏中鲁棒性不足的问题,提出一种结合攻击感知课程学习与对比鲁棒性目标的框架C²R,通过优先处理最小鲁棒边界的对抗样本并扩大类间决策边界分离度,显著提升鲁棒准确率。

Comments Accepted to ICML 2026

详情
AI中文摘要

数据集蒸馏(DD)将大型训练集压缩为小型合成集以进行高效训练,但大多数DD方法仅优化干净准确率而忽略鲁棒性。最近的鲁棒DD方法提高了鲁棒性,但通常面临较差的准确率-鲁棒性权衡,因为它们(i)统一对待所有对抗扰动样本,尽管鲁棒风险主要由接近零的鲁棒边界主导,以及(ii)没有明确增加攻击集中区域的决策边界类间分离。我们提出了对比课程鲁棒数据集蒸馏(C$^2$R),一个将攻击感知课程与对比鲁棒性目标相结合的框架。从鲁棒边界的角度,我们推导出一个扰动分数,近似每个样本的鲁棒铰链,从而能够优先考虑那些最直接驱动鲁棒误差的最小边界对抗样本。同时,一个类平衡的对比鲁棒性损失在明确扩大跨类别边界分离的同时强制执行对抗不变性。在CIFAR-10/100、Tiny-ImageNet和多个ImageNet-1K子集上进行的六种攻击实验表明,C$^2$R实现了最佳的鲁棒准确率,平均优于先前的鲁棒DD方法2.8%。

英文摘要

Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.

2605.20291 2026-05-27 cs.LG

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel: 通过重要性-多样性数据选择实现Web智能体的域外泛化

Fatemeh Pesaran Zadeh, Seyeon Choi, Xing Han Lù, Siva Reddy, Gunhee Kim

发表机构 * Seoul National University(首尔国立大学) McGill University(麦吉尔大学) Mila -- Quebec AI Institute(蒙特利尔AI研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出Weasel方法,通过优化平衡单步重要性与状态、网站、交互模式成对多样性的目标,选择固定预算的轨迹子集,结合目标中心AXTree剪枝和风格一致理由替换,提升Web智能体离线训练的域外泛化性能并降低训练成本。

Comments ICML 2026. Code is released at https://github.com/fatemehpesaran310/weasel

详情
AI中文摘要

大型语言模型(LLMs)使得Web智能体能够通过多步浏览器交互遵循自然语言目标。然而,在特定轨迹和领域上微调的智能体通常难以泛化到域外,且离线训练可能因噪声、冗余轨迹和长可访问性树(AXTree)状态而计算效率低下。为了解决这两个问题,我们提出了Weasel,一种用于Web智能体离线训练的轨迹选择方法。Weasel通过优化一个平衡状态、网站和交互模式上的单步重要性与成对多样性的目标,选择固定预算的轨迹步骤子集,并使用贪心算法高效求解。我们进一步通过目标中心AXTree剪枝(仅保留真实动作目标周围的内容)提高效率,并通过用模型生成的、风格一致的理由替换专家轨迹,缓解推理原生模型的风格不匹配问题。在AgentTrek和NNetNav训练数据集上,以及在WebArena、WorkArena和MiniWob中的评估,以及使用Qwen2.5-7B、Gemma3-4B和Qwen3-8B的实验表明,Weasel在降低训练成本的同时提高了域外性能,相比标准微调实现了约9.7-12.5倍的训练加速。我们在https://github.com/fatemehpesaran310/weasel提供代码。

英文摘要

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

2605.20255 2026-05-27 cs.LG cs.AI cs.HC cs.RO

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

行人行为不确定性下安全自动驾驶的多智能体强化学习

Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella

发表机构 * University of Bern(伯恩大学) AI4I, The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 本文使用多智能体近端策略优化(MAPPO)联合训练自动驾驶汽车和12个行人,通过隐藏的行人特质模拟乱穿马路行为,相比固定策略基线显著降低了碰撞率,并揭示了速度差异指标可用于检测未预期的乱穿马路行为。

Comments Accepted to ICRA 2026 Workshop "8th Workshop on Long-term Human Motion Prediction"

详情
AI中文摘要

自动驾驶汽车(SDC)的仿真测试通常依赖脚本化行人模型,这些模型无法捕捉真实过街行为的异质性和不确定性,限制了安全评估的真实性,尤其是对于由车辆无法观察到的潜在人格特质支配的乱穿马路行为。我们假设,通过多智能体强化学习(MARL)联合训练行人和SDC,相比针对固定行人策略训练,能产生更真实的交互场景,并且可预测与不可预测过街行为之间的差距可以直接从轨迹中测量。我们使用多智能体近端策略优化(MAPPO)联合训练一个SDC和12个行人:行人移动遵循脚本化的Dijkstra路径规划,而RL策略控制高层的前进/等待决策,乱穿马路概率取决于每个行人在回合开始时采样并隐藏于SDC的特质。在500回合评估中,联合训练的SDC达到78%的目标完成率,碰撞率为14%,而最佳基于规则的基线分别为35%和33%。速度差异指标显示,在近距离(0-3米)范围内,SDC在乱穿马路者附近比在人行横道使用者附近快2.65米/秒,表明乱穿马路遭遇未被预期。乱穿马路占过街事件的13%,但占碰撞的62%,并且联合训练相比单智能体RL减少了30%的碰撞,因为行人学会了在SDC高速接近时等待。

英文摘要

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co-train an SDC and 12 pedestrians using Multi-Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high-level go/wait decisions, and jaywalking probability depends on a per-pedestrian trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule-based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co-training reduced collisions by 30% relative to single-agent RL as pedestrians learned to wait when the SDC approached at speed.

2605.19969 2026-05-27 cs.LG

Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning

你的邻居知道:利用局部邻居进行去中心化学习中的后门检测

Sayan Biswas, Antoine Boutet, Davide Frey, Romaric Gaudel, Rachid Guerraoui, Maxime Jacovella, Anne-Marie Kermarrec, Dimitri Lerévérend, François Taïani, Martijn de Vos

发表机构 * EPFL(瑞士联邦理工学院) Inria, INSA Lyon, CITI(法国国家科学研究中心、里昂国立应用科学学院、CITI) Univ. Rennes, Inria, CNRS, IRISA(雷恩大学、法国国家科学研究中心、CNRS、IRISA)

AI总结 提出Argus框架,通过局部邻居协作分析模型更新并利用结构相似性度量区分真实后门与数据异构性导致的误报,实现去中心化学习中的后门检测,并提供理论收敛保证。

Comments 34 pages, 10 figures

详情
AI中文摘要

去中心化学习(DL)是一种新兴的机器学习范式,其中节点在没有中央服务器的情况下协作训练模型。然而,DL的协作性质使其容易受到后门攻击,即模型被训练为在标准输入上表现正常,而在遇到带有特定触发器的数据时执行隐藏的恶意行为。DL中的后门攻击仍未得到充分研究,现有防御措施常常忽视DL的约束。我们引入了Argus,一种原生于DL的新型后门检测框架,它既不需要中央协调器,也不需要预先知道触发器。在Argus中,诚实节点本地分析接收到的模型更新以识别潜在的后门触发器。然后,节点集体与邻居共享其触发器,并使用结构相似性度量将真实后门与数据异构性引起的误报区分开。一个关键见解是,假阳性触发器在不同参与者之间表现出不一致性,而真阳性触发器则呈现一致的模式。未通过此协作测试的模型更新被拒绝,持续恶意的发送者最终被驱逐。我们首次为特定于DL的后门检测机制提供了理论收敛保证,表明以高概率过滤可疑模型更新可保持与标准DL相当的收敛速度。我们在三个标准数据集上实现了Argus,并针对三个最先进的基线进行了评估。在各种设置下,与无防御相比,Argus将攻击成功率降低了多达90个百分点,同时将模型效用保持在全知神谕的5个百分点以内。此外,随着数据异构性的增加,Argus相对于基线的有效性也有所提高。

英文摘要

Decentralized learning (DL) is an emerging machine learning paradigm where nodes collaboratively train models without a central server. However, the collaborative nature of DL makes it vulnerable to backdoor attacks, where a model is taught to behave normally on standard inputs while executing hidden, malicious actions when encountering data with specific triggers. Backdoor attacks in DL remain understudied and existing defenses often overlook DL constraints. We introduce Argus, a novel backdoor detection framework native to DL that requires neither a central coordinator nor prior knowledge of the trigger. In Argus, honest nodes locally analyze received model updates to identify potential backdoor triggers. Nodes then collectively share their triggers with their neighbors and use a structural similarity metric to separate true backdoors from false alarms induced by data heterogeneity. A key insight is that false positive triggers exhibit inconsistencies across participants while true positive ones show consistent patterns. Model updates that fail this collaborative test are rejected, and persistently malicious senders are eventually evicted. We provide the first theoretical convergence guarantees for a DL-specific backdoor detection mechanism, showing that filtering out suspicious model updates with high probability preserves a convergence rate comparable to standard DL. We implement and evaluate Argus on three standard datasets and against three state-of-the-art baselines. Across settings, Argus reduces attack success rates by up to 90 points compared to no defense, while preserving model utility within 5 percentage points of an omniscient oracle. Furthermore, the effectiveness of Argus compared to baselines improves as data heterogeneity increases.

2605.19908 2026-05-27 cs.CL

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

作者身份信号在基于编码器的语言模型中出现在哪里?

Francis Kulumba, Guillaume Vimont, Laurent Romary, Florian Cafiero

发表机构 * Inria Paris(巴黎国家信息与自动化研究所) Sorbonne Université(索邦大学) IRIF(IRIF研究所) LRE, EPITA Ecole nationale des chartes – PSL(LRE,EPITA国立档案馆 – 法国社会科学研究院)

AI总结 通过机械可解释性工具,研究不同评分机制对基于编码器的作者身份归因模型性能的影响,发现评分机制决定了编码器在何处整合作者身份信号。

Comments 12 pages, 6 figures. Under review

详情
AI中文摘要

使用相同的预训练编码器、数据和损失进行微调的作者身份归因模型,其性能可能相差四倍,这仅取决于它们的评分机制。我们使用机械可解释性工具来解释这一差距。诸如词长、标点密度和功能词频率等风格特征在我们探测的每个模型的每一层中都是相似的,包括一个现成的控制编码器,这表明差距不是由它们的线性可读性解释的。相反,因果干预表明,评分器似乎决定了编码器在何处整合作者身份信号。平均池化迫使信号在早期到中期层整合,而后期交互则将其推迟到后期层。我们进一步从每个评分器的梯度结构中推导出这种差异,训练动态揭示了遵循该差异的不同学习轨迹。

英文摘要

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are similarly available at every layer in every model we probe, including an off-the-shelf control encoder, suggesting that the gap is not explained by their linear readability. Instead, causal intervention shows that the scorer appears to determine where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

2605.19186 2026-05-27 cs.AI

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

可发现的智能体知识——智能体知识图谱能力的形式化框架(扩展版)

Terry R. Payne, Valentina Tamma, Enrico Daga

发表机构 * School of Computer Science and Informatics, University of Liverpool, UK(利物浦大学计算机科学与信息学学院) Open University(开放大学)

AI总结 本文提出一个四维形式化框架(语义表达性、智能体可发现性、任务相对基础性和认知信任范围),并从中推导出智能体能力概况(AAP),作为VoID和DCAT之上的语义层,支持智能体在规划时进行原则性的知识图谱选择、组合和故障诊断。

详情
AI中文摘要

二十年前,语义网服务社区被问及具有不同本体承诺的智能体如何能够连贯地发现、组合和调用网络服务。答案是OWL-S和WSMO:形式化的能力描述,指定服务能做什么、智能体为了认知上合理调用必须已经知道什么,以及如何形式化地桥接本体不匹配。当前的知识图谱元数据标准(如VoID和DCAT)描述了知识图谱包含什么,但没有说明特定智能体能从中证明什么、空结果受什么封闭假设支配,或者智能体的任务词汇是否在模式中有基础。此外,在已部署的知识图谱中,控制模式描述逻辑和操作性的蕴涵机制可能不同:这是一种当前元数据不可见的认知失效模式。我们针对知识图谱环境重新审视并扩展这些见解,提出了一个四维形式化框架:语义表达性、智能体可发现性、任务相对基础性和认知信任范围,从中我们推导出智能体能力概况(AAP):一个位于VoID和DCAT之上的语义层,使智能体在规划时能够进行原则性的知识图谱选择、组合和故障诊断。这四个维度在单个智能体层面操作化了本体连续体的能力结构,特别用于知识图谱选择、组合和故障诊断。一个来自学术搜索任务的实例具体化了该框架,并通过五点研究议程指出了实现基于AAP的能力匹配规模化所需的形式化、计算和工程工作。

英文摘要

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be epistemically sound, and how ontological mismatches could be formally bridged. Current KG metadata standards such as VoID and DCAT describe what a KG contains, yet say nothing about what a specific agent can prove from it, what closure assumptions govern empty results, or whether the agent's task vocabulary is grounded in the schema. Furthermore, in deployed KGs the governing schema DL and the operative entailment regime can diverge: an epistemic failure mode invisible to current metadata. We revisit and extend these insights for the KG setting with a four-dimensional formal framework; Semantic Expressivity, Agentic Discoverability, Task-Relative Grounding, and Epistemic Trust Scope, from which we derive the Agentic Affordance Profile (AAP): a semantic layer above VoID and DCAT enabling principled KG selection, composition, and failure diagnosis at agent planning time. The four dimensions operationalise the affordance structure of the Ontological Continuum at the individual-agent level, specifically for \kg selection, composition, and failure diagnosis. A worked example drawn from a scholarly-search task concretely grounds the framework, and identifies the formal, computational, and engineering work needed to realise AAP-based affordance matching at scale though a five-point research agenda.

2605.17036 2026-05-27 cs.AI cs.LG cs.MA cs.SY eess.SY

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

自主AI代理在供应链管理中的可靠性与有效性

Carol Xuan Long, David Simchi-Levi, Feng Zhu, Huangyuan Su, Andre P. Calmon, Flavio P. Calmon

发表机构 * Harvard University(哈佛大学) MIT/Purdue(麻省理工学院/普渡大学) MIT(麻省理工学院) Harvard University/Kempner Institute(哈佛大学/凯普勒研究所) Georgia Tech(佐治亚理工学院)

AI总结 本文通过MIT啤酒游戏研究多级供应链中的自主生成式AI代理,发现模型能力是性能主导因素,但平均性能掩盖可靠性风险,并引入代理牛鞭效应,提出基于GRPO的后训练框架以提高可靠性。

详情
AI中文摘要

本文使用MIT啤酒游戏研究多级供应链中的自主生成式AI代理。我们确定了影响性能的四个推理时杠杆:模型选择、策略和护栏、集中数据共享以及提示工程。模型能力是主导因素:开箱即用的推理模型超越人类水平性能,优化后的推理模型相对于人类团队将成本降低高达67%。然而,强劲的平均性能掩盖了显著的可靠性风险。我们引入了代理牛鞭效应:自主多级系统中运行间决策不稳定性的放大。其中一个核心组成部分是决策牛鞭效应,即由随机代理决策而非客户需求变化产生的订单变异性部分。我们表明,即使需求路径固定,决策不稳定性也可以在固定时间点跨设施以及同一设施内随时间放大。重复采样(一种自然的测试时补救措施)未能显著减少这种不稳定性,这表明可靠性需要改变底层决策策略,而不仅仅是平均模型输出。为解决这一限制,我们提出了一种基于组相对策略优化(GRPO)的强化学习后训练框架,该框架使用系统级供应链奖励训练共享的基础LLM。后训练显著减少了尾部事件,抑制了代理牛鞭效应,并提高了自主供应链代理的可靠性。

英文摘要

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce agent bullwhip: the amplification of run-to-run decision instability in autonomous multi-echelon systems. A central component is decision bullwhip, the portion of order variability generated by stochastic agent decisions rather than by changes in customer demand. We show that decision instability can amplify both across facilities at a fixed point in time and within the same facility over time, even when the demand path is held fixed. Repeated sampling, a natural test-time remedy, fails to meaningfully reduce this instability, suggesting that reliability requires changing the underlying decision policy rather than merely averaging over model outputs. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. Post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

2605.16457 2026-05-27 cs.LG cs.AI cs.CV

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Department of Computer Science(计算机科学系) Engineering, Seoul National University(工程系,首尔国立大学)

AI总结 提出可辨识的令牌对应关系(ITC)方法,通过将下一帧预测建模为结构化分配问题,解决基于令牌的Transformer世界模型在长程推演中的时间不一致性,在四个基准上达到最先进性能。

详情
AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色,但常在长程推演中出现时间不一致性,包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题,而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系(ITC),这是一种用于基于令牌的Transformer世界模型的解码步骤,将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题:每个下一帧令牌要么通过从上一帧复制令牌来解释,要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变,可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数,显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

2605.04880 2026-05-27 cs.LG cs.AI

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的调和均值公式

Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka

发表机构 * Bar Ilan University(巴伊兰大学)

AI总结 针对无限时域非回合制任务中的平均奖励强化学习,提出一种修正的调和均值算子,解决SMDP中奖励和持续时间非平稳时的奖励率计算问题,并证明其理论性质及有效性。

Journal ref https://alaworkshop2026.github.io/papers/ALA2026_paper_57.pdf

详情
AI中文摘要

最近的研究重新激发并增强了对无限时域、非回合制(持续)任务中未折扣平均奖励强化学习算法的兴趣。半马尔可夫决策过程(SMDP)尤其引人关注。在SMDP中,离散动作随机产生奖励和持续时间,目标是优化平均奖励率。现有算法通过优化奖励与持续时间的比率来逼近这一目标。然而,当奖励和持续时间(在无限时域中)非平稳时,这种方法可能不正确。本文提出一种新颖的修正调和均值算子,即使在上述条件下也能正确计算奖励率。这产生了可以与SMDP一起工作的无模型学习算法,同时保持对随时间变化的非平稳奖励和持续时间分布的鲁棒性。我们证明了修正调和均值算子的理论性质,并通过实验与现有算法相比展示了其有效性。

英文摘要

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

2605.02207 2026-05-27 cs.CV cs.AI cs.LG

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo:面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

发表机构 * Department of Computer Science, Old Dominion University, VA, USA(计算机科学系,老 Dominion 大学,弗吉尼亚州,美国)

AI总结 提出MultiSense-Pneumo多模态原型系统,整合症状、咳嗽音频、语音和胸片,通过可解释的后期融合实现肺炎筛查与分诊支持。

详情
AI中文摘要

肺炎仍然是全球发病率和死亡率的主要原因,尤其是在低资源环境中,那里缺乏影像学、实验室检测和专科护理。临床评估依赖于异质性证据,包括症状、呼吸模式、口头描述和胸部影像,使得一线筛查本质上是多模态的。然而,许多现有的计算方法仍然是单模态的,并且主要关注放射影像。在这项工作中,我们提出了MultiSense-Pneumo,一个面向肺炎筛查和分诊支持的多模态研究原型,它整合了结构化症状描述符、咳嗽音频、口语和胸部X光片。该系统结合了确定性症状分诊、基于LightGBM的声学分类、使用ResNet-18的域对抗放射影像分析、基于Transformer的语音识别以及可解释的后期融合算子。每个模态被转换为归一化的关注信号,并聚合为统一的筛查估计。融合权重是手动指定的,被视为启发式、可解释的参数,而不是学习或临床优化的值。MultiSense-Pneumo的设计考虑了在标准笔记本电脑级硬件上的离线执行,但并未作为经过部署验证或临床验证的诊断系统呈现。实验结果表明,在合成域偏移下,放射影像路径具有强大的组件级性能,同时也突出了重要的局限性,特别是咳嗽声学的异常类别召回率降低以及缺乏配对的端到端多模态患者评估。因此,MultiSense-Pneumo旨在作为筛查和分诊研究的框架和组件级原型。

英文摘要

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.

2605.08146 2026-05-27 cs.CV cs.AI

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

VT-Bench:视觉-表格多模态学习的统一基准

Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国) School of Intelligence Science and Technology, Nanjing University, China(智能科学与技术学院,南京大学,中国) School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国)

AI总结 提出首个视觉-表格多模态基准VT-Bench,涵盖9个领域14个数据集,评估23个模型,揭示视觉-表格学习的挑战。

详情
AI中文摘要

多模态学习在视觉-文本任务中引起了广泛关注。然而,在医疗和工业等高危领域起关键作用的视觉-表格数据仍未得到充分探索。本文介绍了 extit{VT-Bench},这是第一个用于标准化视觉-表格判别预测和生成推理任务的统一基准。VT-Bench汇集了9个领域(以医疗为中心,同时涵盖宠物、媒体和交通)的14个数据集,超过756K个样本。我们评估了23个代表性模型,包括单模态专家、专门的视觉-表格模型、通用视觉-语言模型(VLM)和工具增强方法,突出了视觉-表格学习的重大挑战。我们相信VT-Bench将激励社区构建更强大的多模态视觉-表格基础模型。 基准:https://github.com/Ziyi-Jia990/VT-Bench

英文摘要

Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench

2511.19741 2026-05-27 cs.CV

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

通过最小切片传输计划的高效可迁移最优传输

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

发表机构 * Department of Computer Science, Vanderbilt University(范德比大学计算机科学系) Department of Mathematics, Florida State University(佛罗里达州立大学数学系) Department of Biostatistics & Bioinformatics, Duke University(杜克大学生物统计学与生物信息学系) Department of Electrical & Computer Engineering, Vanderbilt University(范德比大学电气与计算机工程系)

AI总结 提出最小切片传输计划(min-STP)框架,研究优化切片器在不同分布对间的可迁移性,并引入小批量公式以提高可扩展性,在点云对齐和流生成建模中实现一次性匹配和摊销训练。

详情
AI中文摘要

最优传输(OT)为寻找分布之间的对应关系以及解决计算机视觉各个领域(包括形状分析、图像生成和多模态任务)中的匹配和对齐问题提供了强大的框架。然而,OT的计算成本阻碍了其可扩展性。基于切片的传输计划最近通过利用一维OT问题的闭式解,在降低计算成本方面显示出前景。这些方法优化一维投影(切片)以获得条件传输计划,该计划最小化环境空间中的传输成本。虽然高效,但这些方法留下了一个问题:学习到的最优切片器是否能够在分布偏移下迁移到新的分布对。理解这种可迁移性对于数据演变或跨密切相关的分布重复进行OT计算的情况至关重要。在本文中,我们研究了最小切片传输计划(min-STP)框架,并探讨了优化切片器的可迁移性:在一个分布对上训练的切片器能否为新的未见对产生有效的传输计划?理论上,我们证明优化后的切片器在数据分布轻微扰动下保持接近,从而能够在相关任务间高效迁移。为了进一步提高可扩展性,我们引入了min-STP的小批量公式,并提供了其准确性的统计保证。实验上,我们证明了可迁移的min-STP实现了强一次性匹配性能,并促进了点云对齐和基于流的生成建模的摊销训练。

英文摘要

Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

2605.18866 2026-05-27 cs.LG cs.AI

FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives

FLUIDSPLAT: 通过高斯原语从稀疏传感器重建物理场

Huaxi Huang, Meng Li, Zhengqing Gao, Xi Zhou, Xiaoshui Huang, Xiao Sun

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Hong Kong University of Science and Technology(香港科学与技术大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出FLUIDSPLAT模型,利用高斯原语作为空间显式中间表示,从稀疏传感器数据重建流场,理论分析了表示能力与观测数的关系,并在多个基准上实现误差降低11-28%。

Comments 24 pages, 5 figures,preprint

详情
AI中文摘要

从稀疏表面安装的传感器重建连续流场是空气动力学设计、流动控制和数字孪生仪器的核心。现有的神经方法通常将传感器读数编码为隐式潜在代码,空间可解释性差,且关于表示能力应如何随观测数量扩展的正式指导有限。受3D高斯泼溅启发,我们引入FLUIDSPLAT,一种传感器条件模型,预测K个各向异性高斯原语,形成单位划分支架,即流场的空间显式且可解释的中间表示。对于理想化的高斯原语估计器,我们证明了对于具有Sobolev光滑度s的场,逼近率为$O(K^{-s/d})$;结合N个含噪声观测,得到偏差$O(K^{-2s/d})$和方差$O(σ^{2}K/N)$的平方风险分解。平衡两者得到$K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$:在稀疏传感下原语数量不能自由增长,揭示了方差瓶颈,促使用状态条件残差解码器补充支架。在涵盖2D和3D的四个基准(圆柱绕流、AirfRANS、FlowBench LDC-3D和PhySense-Car 3D)上,FLUIDSPLAT相比多个强基线实现了11-28%的误差降低。

英文摘要

Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin instrumentation. Existing neural methods for this task typically encode sensor readings into implicit latent codes with little spatial interpretability and limited formal guidance on how representational capacity should scale with observation count. Inspired by 3D Gaussian Splatting, we introduce FLUIDSPLAT, a sensor-conditioned model that predicts K anisotropic Gaussian primitives forming a partition-of-unity scaffold, a spatially explicit and interpretable intermediate representation of the flow. For an idealized Gaussian primitive estimator, we prove an $O(K^{-s/d})$ approximation rate for fields with Sobolev smoothness $s$; incorporating $N$ noisy observations yields a squared-risk decomposition with bias $O(K^{-2s/d})$ and variance $O(σ^{2}K/N)$.Balancing the two yields $K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$: primitive count cannot grow freely under sparse sensing, revealing a variance bottleneck that motivates complementing the scaffold with a state-conditioned residual decoder. Across four benchmarks spanning 2D and 3D, FLUIDSPLAT achieves 11-28% error reduction over several strong baselines on cylinder flow, AirfRANS, FlowBench LDC-3D, and PhySense-Car 3D benchmarks.

2605.18592 2026-05-27 cs.LG cs.AI cs.CL

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS: 一种用于基于评分标准的强化学习的记忆增强评分标准改进系统

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Adobe Inc.(Adobe公司) Department of Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 提出AMARIS系统,通过持久化评估记忆存储纵向训练证据来改进评分标准,在科学、医学、指令遵循和创意写作任务上优于静态、局部自适应和无记忆基线方法。

Comments Preprint. Under review

详情
AI中文摘要

基于评分标准的奖励塑形为通过强化学习(RL)微调大语言模型(LLMs)提供了可解释且可编辑的奖励信号,但现有的自适应评分标准方法通常从局部证据(如当前批次或实例级比较)更新标准。这种局部视角丢弃了训练过程中产生的诊断信息,使得难以跟踪重复失败、评估之前的评分标准编辑或在早期标准饱和后提高标准。我们引入了AMARIS,一种记忆增强的评分标准改进系统,它将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中,然后检索最近和语义相关的历史来修订评分标准。我们在全局和实例特定评分标准设置下,在科学、医学、指令遵循和创意写作任务上评估了AMARIS。AMARIS在静态、局部自适应和无记忆基线上有所改进,例如在GPQA-Diamond上比最强基线高出+2.8分,在IFBench上高出+2.2分,同时分析表明记忆减少了振荡性的评分标准编辑,并支持从早期错误纠正到后期课程推进的进展。AMARIS与正常RL循环异步运行,相对于同步评分标准更新减少了阻塞延迟。

英文摘要

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.