arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.07079 2026-06-08 cs.CV 新提交

AsyncPatch Diffusion: spatially-flexible image generation

异步补丁扩散:空间灵活的图像生成

Samuele Papa, Valentin De Bortoli, Guillaume Couairon, Daniel Sýkora, Romuald Elie, Klaus Greff

发表机构 * Google DeepMind(谷歌DeepMind) University of Amsterdam(阿姆斯特丹大学) The Netherlands Cancer Institute(荷兰癌症研究所)

AI总结 提出AsyncPatch Diffusion框架,通过为不同空间区域分配不同噪声水平实现异质去噪轨迹,在保持生成质量的同时原生支持图像修复和自适应生成。

Comments 36 pages, 14 figures

详情
AI中文摘要

标准扩散模型使用单一共享噪声水平破坏整个样本,迫使所有空间区域遵循相同的去噪轨迹。我们引入了AsyncPatch Diffusion,一个联合扩散框架,为不同的输入维度(如图像像素或潜在令牌)分配不同的噪声水平。我们展示了这种异步破坏如何定义有效的生成过程,同时支持更丰富的空间异质去噪轨迹,并为此过程证明了第一个有效的ELBO。我们表明,单个预训练模型可以执行空间自适应生成,其中不同区域按不同调度去噪。一个关键挑战是训练:天真的独立噪声水平采样过度强调高度异质的配置,而低估了在采样过程中至关重要的同质噪声水平。我们通过一个受控的噪声水平采样器来解决这个问题,该采样器调节平均破坏水平及其空间变异性。AsyncPatch在ImageNet 256和LSUN上实现了与常规扩散相当的生成质量,同时原生适用于图像修复而无需特定任务微调。我们进一步引入了输入引导,利用干净或部分损坏的区域来指导未知区域的生成,提高了局部一致性和纹理匹配。最后,我们展示了自适应生成策略,包括不确定性引导加速和自回归采样。

英文摘要

Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.

2606.07074 2026-06-08 cs.LG cs.AI 新提交

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

SlimSearcher: 通过自适应奖励门控实现训练效率感知的Web代理

Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出SlimSearcher框架,通过帕累托高效过滤和自适应奖励门控,在保持或提升准确率的同时将工具调用轮次减少17%-58%。

Comments 17 pages, 8 figures,

详情
AI中文摘要

深度研究代理在复杂信息寻求任务中展现了卓越能力,但这种能力伴随着高昂的计算成本。受准确率驱动训练范式的影响,当前模型采用蛮力策略,表现为盲目依赖工具和执行性推理——生成长而冗余的轨迹,这些对于解决任务远非必要,导致浪费的工具调用和过多的token消耗。为克服这一效率陷阱,我们提出SlimSearcher,一个原则性框架,在监督微调(SFT)和强化学习(RL)中推动准确率与计算成本之间的帕累托前沿。在SFT阶段,SlimSearcher采用帕累托高效过滤来提炼既成功又经济的轨迹,引导模型走向内在效率感知的搜索行为。在RL阶段,我们引入自适应奖励门控,一种动态奖励塑造机制,在采样队列中评估相对工具和token效率。通过将这些自适应效率指标与严格正确性门控级联,我们的方法有效避免了与绝对惩罚相关的简洁性偏差,并缓解了奖励黑客攻击。在包括GAIA、BrowseComp和XBenchDeepSearch在内的长时域基准上的大量实验表明,SlimSearcher在保持或提升准确率的同时,将平均工具调用轮次减少了17%-58%。

英文摘要

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

2606.07069 2026-06-08 cs.CL cs.CY 新提交

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

mmPISA-bench:LLMs 在 43 种语言中的推理能力是否同样出色?

Yerzhan Sapenov, Jaromir Savelka

发表机构 * Independent Scholar(独立学者) School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院)

AI总结 提出 mmPISA-bench,一个基于 PISA 的多语言推理基准,包含 25 道选择题,官方翻译至 43 种语言,评估 LLMs 在不同语言、推理难度和翻译类型下的表现,发现现代 LLMs 在所有语言上推理有效,机器翻译不影响准确性,但部分语言成本更高且准确率更低。

详情
AI中文摘要

我们推出了 mmPISA-bench,这是一个紧凑的高质量多语言推理基准,源自 OECD 国际学生评估项目(PISA)。该基准包含 25 道需要推理才能正确回答的多项选择题。每道题都提供了官方人工翻译的 43 种语言版本,并辅以机器翻译版本(即总共 2,150 个数据点)。我们评估了两个主流专有 LLMs 在不同语言、推理努力水平和翻译类型下正确回答问题的能力。我们的结果表明,现代 LLMs 能够在所有评估的语言中有效推理,达到与人类应试者相当的准确率,但在所覆盖的语言之间存在一些性能差异。我们进一步发现,与官方人工翻译相比,机器翻译的问题并未降低准确率,这表明高质量的机器翻译(合成数据)可能通常足以用于大规模多语言推理评估,尤其是在没有官方翻译的情况下。最后,我们分析了 token 使用和相关推理成本,发现某些语言中 LLMs 的使用同时更昂贵且准确率更低。

英文摘要

We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

2606.07068 2026-06-08 cs.LG 新提交

Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

过滤特征选择评估中的偏差:数据集、基线和实验设计选择的元分析

Malick Ebiele, Malika Bendechache, Rob Brennan

发表机构 * University College Dublin(都柏林大学) University of Galway(Galway大学) ADAPT Centre(ADAPT中心)

AI总结 通过分析28项高影响力过滤特征选择研究,发现数据集数量、基线方法和新方法数量可解释33%的性能变异,揭示了评估中的潜在偏差,并提出了五项基于证据的评估建议。

详情
AI中文摘要

背景:自1990年以来,跨异构应用提出了许多特征选择方法。为了验证新方法的有用性,需要在使用至少一个数据集的特征选择任务中,与现有文献中的至少一种基线方法进行比较。表格深度学习(DL)和机器学习(ML)中数据估值的最新发展表明,新方法、算法和模型的评估可能有意识或无意识地存在偏差。我们假设在特征选择(FS)中,特别是在过滤特征选择(FFS)中,存在类似的趋势。因此,本研究的目的是检查FFS研究,以识别影响评估的因素,这些因素可能构成偏差的入口点,从而为FFS评估推荐更强的原则。方法:我们分析了1994年至2025年间发表的28项高影响力FFS研究样本。该分析提供了如何检查FFS研究的思考,强调了过程中学到的经验教训,并为未来的FFS评估给出了五项基于证据的建议。结果:多元线性回归分析得分为$R^2=0.33$。这意味着新方法相对于所选基线的性能变异(胜率)的33%可由数据集数量(#Datasets)、基线数量(#Baselines)和新方法数量(#NewMethods)解释。讨论:$R^2=0.33$被认为是中等解释力;考虑到这是首次此类研究,这一结果是有希望的。中等解释力的结果是由于胜率还受到其他因素的影响,例如特征选择领域的成熟度、数据集和基线的类型,以及用于解释关系的回归模型的简单性。

英文摘要

Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing literature on a feature selection task using at least one dataset. Recent developments in tabular Deep Learning (DL) and data valuation in Machine Learning (ML) suggest that the evaluation of new methods, algorithms, and models may be consciously or unconsciously biased. We hypothesise that a similar trend exists in feature selection (FS), particularly in filter feature selection (FFS). The aim of this study is therefore to examine FFS studies to identify factors that influence the evaluation and that might consist entry point for biases in order to recommend stronger principles for FFS evaluation. Methods: We analyse a sample of 28 high profile FFS studies published between 1994 and 2025. The analysis provides reflections on how to examine FFS studies, highlights lessons learned throughout the process, and gives five evidence-based recommendations for future FFS evaluation. Results: Multivariate Linear Regression analysis achieved a score of $R^2=0.33$. It means that 33% of the variance in the performance of new methods against chosen baselines (win rate) is explained by the number of datasets (#Datasets), the number of baselines (#Baselines), and the number of new methods (#NewMethods). Discussion: $R^2=0.33$ is considered medium explanation; which is promising given that this is the first such study. The medium explanation result is due to the fact that win rate is influenced by additional factors such as the maturity of the feature selection domain, the type of datasets and baselines, and the simplicity of the regression model used to explain the relationship.

2606.07058 2026-06-08 cs.LG cs.CV math.AT stat.ML 新提交

Constructing VAE Latent Spaces with Prescribed Topology

构建具有指定拓扑的VAE潜在空间

Jilles S. van Hulst, Jakub M. Tomczak, W. P. M. H. Heemels, Duarte J. Antunes

发表机构 * Control Systems Technology Section, Department of Mechanical Engineering, Eindhoven University of Technology(机械工程系控制系统技术部,埃因霍温理工大学) Nature Innovation Laboratory (NatInLab)(自然创新实验室(NatInLab))

AI总结 针对数据流形非欧几里得拓扑导致标准高斯先验不匹配的问题,提出一种构造性数学框架,通过因子化分布和重参数化技巧,为乘积覆盖空间流形(如圆柱、环面、莫比乌斯带等)设计拓扑匹配的先验,提升重建质量和表示忠实性。

Comments 16 pages, 7 figures

详情
AI中文摘要

变分自编码器(VAE)学习高维数据的低维潜在表示。当数据位于具有非欧几里得拓扑的流形上时,标准高斯先验会引入拓扑不匹配,从而降低重建质量并阻碍忠实表示。我们提出了一个构造性数学框架,解决了所有允许乘积覆盖空间的流形的这种不匹配问题。这些流形可表示为基本因子(圆、区间或直线)的乘积,或此类乘积在有限对称群下的商。该类包括圆柱、环面、莫比乌斯带、克莱因瓶和实射影空间。基本因子上的因子化分布产生具有闭式解耦KL散度的乘积拓扑,使得每个潜在因子可以独立塑造,同时保持训练可处理。我们为周期、有界和无界支撑编目了可重参数化的编码器-先验对,并提供了坐标变换,允许标准神经网络输出具有平滑梯度的非欧几里得参数。对于商流形,解码器接收覆盖空间坐标的群不变特征,使得识别点产生相同输出。锚点约束相对于数据固定坐标系或创建软拓扑孔。在合成流形和真实图像数据集(旋转和循环移位MNIST)上的实验证实,拓扑匹配的先验使KL正则化与数据流形对齐。所得到的拓扑感知模型在所有实际相关的正则化强度下均优于高斯基线。代码可从此https URL获取。

英文摘要

Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When the data lies on a manifold with non-Euclidean topology, the standard Gaussian prior introduces a topological mismatch that degrades reconstruction quality and prevents faithful representation. We present a constructive mathematical framework that resolves this mismatch for all manifolds that admit a product covering space. These are manifolds expressible as products of elementary factors (circles, intervals, or lines) or as quotients of such products by a finite symmetry group. The class includes cylinders, tori, Möbius strips, Klein bottles, and real projective spaces. Factorized distributions over the elementary factors yield product topologies with closed-form, decoupled KL divergences, so that each latent factor can be shaped independently while keeping training tractable. We catalogue reparametrizable encoder-prior pairs for periodic, bounded, and unbounded supports, and provide coordinate transformations that allow standard neural networks to output non-Euclidean parameters with smooth gradients. For quotient manifolds, the decoder receives group-invariant features of the covering-space coordinates, so that identified points produce identical outputs. Anchor constraints fix the coordinate system relative to the data or create soft topological holes. Experiments on synthetic manifolds and real-image datasets (rotated and cyclically shifted MNIST) confirm that a topology-matched prior aligns KL regularization with the data manifold. The resulting topology-aware models outperform the Gaussian baseline at all practically relevant regularization strengths. The code is available at https://github.com/JvHulst/VAE-Topology.

2606.07054 2026-06-08 cs.CL cs.AI cs.CR cs.LG 新提交

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

TRACE: 通过自适应跨步骤证据聚合的LLM智能体轨迹推理

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

发表机构 * University of Massachusetts at Amherst(马萨诸塞大学阿默斯特分校) Adobe Research(Adobe研究) Dolby Labs(杜比实验室) University of Oregon(俄勒冈大学) Cisco(思科)

AI总结 提出TRACE框架,通过TIJ循环识别高信号区域、累积跨步骤证据并合成轨迹级判决,在SHADE-Arena的十个任务域上F1达0.713,召回率0.844,尤其擅长长距离证据链接。

详情
AI中文摘要

自主LLM智能体可以通过一系列单独良性的行动追求隐藏的恶意目标,这使得使用标准轨迹级监控难以检测破坏行为。现有方法要么一次性评估完整轨迹,要么将其划分为独立评分的窗口,限制了连接时间上相距较远的证据的能力。我们提出TRACE,一个用于长视界LLM智能体轨迹的监控框架。TRACE通过一个TIJ(分类-检查-判决)循环运行,该循环识别高信号区域,执行有针对性的检查,同时在推理步骤中累积累积的证据,并综合出轨迹级判决。我们在SHADE-Arena的十个任务域上评估TRACE,与最先进的基线进行比较。TRACE实现了0.713的总体F1分数和0.844的召回率,在需要长距离证据链接的任务上取得了最大的提升。

英文摘要

Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

2606.07053 2026-06-08 cs.CV cs.LG 新提交

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

TrioPose: 用于姿态引导文本到图像生成的原生三流扩散变换器

Dian Gu, Zhengyi Yang

发表机构 * Institute of Automation Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出TrioPose,基于SD3.5M架构的原生三流姿态感知DiT,通过逐层激活和零初始化双残差注入保持预训练稳定性,并设计可学习关系偏置掩码和姿态引导空间损失加权,在多人姿态引导生成中实现SOTA性能,Human-Art上AP达64.33。

Comments 15 pages (9 pages main body, 6 pages references and appendix), 3 figures, 5 tables

详情
AI中文摘要

姿态引导的文本到图像生成在复杂多人场景中常遭受肢体扭曲和特征串扰。虽然现有的基于UNet的适配器难以处理长程空间依赖,新兴的多模态扩散变换器(MM-DiT)提供了优越的全局建模能力。然而,MM-DiT中的简单信号拼接严重破坏了预训练的潜在分布。为了解决这个问题,我们提出了TrioPose,一个基于SD3.5M架构的原生姿态驱动框架。具体来说,我们引入了一个三流姿态感知DiT(TSPA-DiT),将姿态视为独立模态。它采用逐层激活和零初始化双残差注入,在保持预训练潜在稳定性的同时平滑地施加几何约束。为了解决严重的多实例遮挡,我们设计了一个可学习关系偏置掩码,将拓扑连接分类为细粒度的物理状态,将其映射为连续的注意力软约束,以有效解耦实例间干扰。此外,一种姿态引导空间损失加权策略利用热图导出的误差图调制原生扩散目标,将解剖监督严格集中在畸变区域。大量实验表明,TrioPose在具有挑战性的基准测试(包括Human-Art、CrowdPose和OCHuman)上实现了最先进的性能。值得注意的是,它在Human-Art上达到了64.33的AP,比先前方法提高了30%,同时在复杂多人生成中为视觉保真度和文本-图像语义对齐设立了新标准。

英文摘要

Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of $64.33$ on Human-Art, representing a $30\%$ improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.

2606.07044 2026-06-08 cs.LG 新提交

Hierarchical Forecast Reconciliation for Urban Rail Transit Demand Prediction under Operational Disruptions

运营中断下城市轨道交通需求预测的层级协调方法

Dang Viet Anh Nguyen, Alma Fazlagic, Kristine Pryds Loft, Filipe Rodrigues

发表机构 * Technical University of Denmark (DTU)(丹麦技术大学)

AI总结 针对城市轨道交通中站点与OD流预测不一致问题,提出首个层级协调框架,利用神经全连接协调器(FCR)学习非线性映射,确保结构一致性,在中断场景下OD预测误差降低达17.45%。

Comments 33 pages, 6 figures, 16 tables

详情
AI中文摘要

准确且一致的乘客需求预测对于城市轨道交通(URT)运营至关重要。乘客需求具有层级结构,其中起讫点(OD)流量通过守恒约束聚合为站点级进出站流量。实践中,站点级和OD级预测通常独立生成,产生违反这些约束的不一致预测,给运营决策带来不一致性。在中断期间,当预测可靠性最为关键时,此类问题更为严重。本文提出了首个用于联合站点级和OD级URT需求预测的层级预测协调框架。神经全连接协调器(FCR)学习从非协调基础预测到协调层级预测的非线性映射,同时通过构造保证精确的结构一致性。该方法使用哥本哈根S-train网络的Rejsekort智能卡数据,在单步、多步和中断预测场景下,与OLS、WLS和最小迹(MinT)变体进行基准比较。结果表明,协调一致地提高了OD预测准确性,同时确保了层级一致性。在正常条件下,FCR与基于MinT的方法性能相当。一项oracle分析表明,完美的站点级预测可将OD预测误差降低高达34%,凸显了改进基础预测的价值。在严重中断下,FCR优于经典方法,在多步目的地侧延迟场景中将OD预测误差降低高达17.45%。这些发现确立了层级协调作为提高预测鲁棒性的有效机制,最大的收益出现在最具挑战性的运营条件下。

英文摘要

Accurate and coherent passenger demand forecasting is essential for Urban Rail Transit (URT) operations. Passenger demand has a hierarchical structure in which origin-destination (OD) flows aggregate to station-level inflows and outflows through conservation constraints. In practice, station-level and OD-level forecasts are often generated independently, producing incoherent predictions that violate these constraints and introduce inconsistencies into operational decision-making. Such issues become more severe during disruptions, when forecasting reliability is most critical. This paper presents the first hierarchical forecast reconciliation framework for joint station-level and OD-level URT demand prediction. A neural Fully Connected Reconciler (FCR) learns a non-linear mapping from incoherent base forecasts to coherent hierarchical predictions while guaranteeing exact structural consistency by construction. The method is benchmarked against OLS, WLS, and Minimum Trace (MinT) variants using Rejsekort smart-card data from the Copenhagen S-train network under one-step, multi-step, and disruption forecasting scenarios. Results show that reconciliation consistently improves OD forecasting accuracy while ensuring hierarchical coherence. Under normal conditions, FCR performs competitively with MinT-based methods. An oracle analysis indicates that perfect station-level forecasts could reduce OD prediction error by up to 34 percent, highlighting the value of improved base forecasts. Under severe disruptions, FCR outperforms classical methods, reducing OD forecasting error by up to 17.45 percent in multi-step destination-side delay scenarios. These findings establish hierarchical reconciliation as an effective mechanism for improving forecast robustness, with the largest benefits occurring under the most challenging operating conditions.

2606.07036 2026-06-08 cs.CV cs.AI cs.CE cs.LG 新提交

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

STREAM: 用于数字组织病理学图像生成的随机黎曼流匹配与各向异性解码器

Won June Cho, Daeky Jeong, Hyeongyeol Lim, Hongjun Yoon

发表机构 * DEEPNOID Inc.(DEEPNOID公司)

AI总结 提出STREAM框架,利用组织病理学视觉基础模型的patch-token特征作为潜在空间,通过黎曼流匹配生成高质量组织病理学图像,解决条件崩溃问题,并设计各向异性解码器提升生成质量。

Comments 27 pages, 7 figures

详情
AI中文摘要

合成组织病理学图像生成解决了计算病理学中的关键挑战,包括患者隐私和对基础模型大规模训练数据日益增长的需求。潜在扩散模型主导了图像生成领域,最近的研究强调潜在空间的选择对生成图像的质量至关重要。现有的组织病理学最先进生成模型使用预训练的视觉基础模型(VFM)作为条件信号,我们观察到这会导致“条件崩溃”,即条件信号主导潜在空间,降低生成样本的质量和多样性。因此,我们转而使用预训练的组织病理学VFM作为潜在空间本身,利用其编码丰富语义信息的patch-token特征。我们经验性地表明,这些特征经过$\ell_2$归一化,位于单位超球面$\mathcal{S}^{d-1}$上,具有强烈的角度主导性和内在曲率,使其自然适用于黎曼公式。因此,我们提出了STREAM,这是第一个在病理学领域应用黎曼流匹配的框架。STREAM包括两个阶段:1)一种桥式随机扰动,在$\mathcal{S}^{d-1}$上建立每个token的可整流性,用于在潜在空间中训练扩散变换器(DiT);2)一种新颖的各向异性解码器,对速度场雅可比矩阵的低能量方向分配鲁棒性,同时保持其高能量方向的保真度。STREAM在乳腺癌和结直肠癌数据集上实现了最先进的重建和生成性能。代码将在接收后公开发布。

英文摘要

Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to "conditioning collapse," where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

2606.07034 2026-06-08 cs.CV 新提交

ForensicConcept: Transferable Forensic Concepts for AIGI Detection

ForensicConcept: 用于AIGI检测的可迁移取证概念

Menyanshu Zhou, Ziyin Zhou, Ke Sun, Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ForensicConcept框架,通过Transformer归因定位关键补丁、构建概念码本并利用扩散特征对齐,实现AI生成图像检测中可迁移取证概念的提取与跨骨干网络迁移。

Comments Accepted by ICML 2026

详情
AI中文摘要

AI生成图像检测器在分布内数据上取得高精度,但往往在未见过的生成器上失效。理解这一失败的关键障碍在于当前检测器的黑箱性质:它们不揭示哪些证据驱动其决策。我们提出ForensicConcept,一个从检测器中提取显式取证概念并使其能够跨骨干网络迁移的框架。我们的方法通过Transformer归因定位决策关键补丁,将其聚类为紧凑的概念码本,并使用概念对齐投影产生可审计的证据读出。受先前研究表明DINO表示可以引导扩散生成并与扩散特征具有概念级对应关系的启发,我们引入基于CleanDIFT扩散特征的生成痕迹参考,并通过邻域结构一致性(CKNNA)量化骨干-痕迹对齐。我们进一步提出概念码本注入,将扩散衍生的概念迁移到目标骨干网络中。在GenImage、GAN族和Chameleon基准上的实验显示,相比先前方法有一致改进。我们还发现CKNNA对齐预测迁移有效性,为为什么某些骨干网络产生比其他更可迁移的取证证据提供了原则性解释。

英文摘要

AI-generated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose ForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.

2606.07033 2026-06-08 cs.AI cs.CV 新提交

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

层次化语义约束异构图用于音视频事件定位

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院)

AI总结 提出层次化语义约束异构图框架,通过构建异构图、双向语义约束和双曲空间层次正则化,解决开放词汇音视频事件定位中跨尺度一致性和层次语义一致性问题。

详情
AI中文摘要

开放词汇音视频事件定位(OV-AVEL)联合建模音视频线索,以识别并时间定位事件,包括训练中未见过的类别。现有方法主要在欧几里得空间中学习联合音视频表示,但仍面临两个重大挑战。首先,未见类别缺乏监督信号,难以在多个时间尺度上保持音视频一致性。其次,片段级与视频级语义之间缺乏层次约束,导致模型无法在不同层级间建立语义一致性。为解决这些挑战,我们提出一种层次化语义约束异构图(HSCHG)用于音视频事件定位框架。我们首先在欧几里得空间中构建一个异构层次图,包含音频和视觉片段节点及其对应的视频级节点。我们使用多方向时间边来捕获每个模态内的完整时间信息。同时,我们采用双阈值过滤门控融合策略,仅在对齐置信度高时引入跨模态信息。此外,我们在片段级和视频级表示之间引入双向语义约束,以实现不同层级间的语义一致性。基于此,我们将多级音视频表示和文本原型统一映射到双曲空间中。我们使用层次蕴含正则化损失来表征视频与片段之间的层次关系。大量实验结果表明,我们的方法在OV-AVEL基准上优于现有方法。消融研究进一步验证了我们方法的有效性。

英文摘要

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

2606.07032 2026-06-08 cs.CV cs.AI 新提交

Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

前所未见:基于一致视频源数据集的真正零样本组合图像检索基准测试

Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有零样本组合图像检索数据集存在参考与目标图像不相关、非真正零样本的问题,提出ZeroSight基准,包含来自视频的一致参考-目标对和训练无关的MLLM驱动方法SC4CIR,通过三重对称一致性检查识别难负样本,实验表明现有方法性能被高估。

详情
AI中文摘要

零样本组合图像检索(ZS-CIR)旨在基于由参考图像和相对描述组成的查询,在没有训练样本的情况下检索目标图像。现有的ZS-CIR数据集常因图像来源嘈杂而导致参考图像与目标图像完全不相关,并且由于使用了CLIP等模型已训练过的公开图像数据集,未能实现真正的零样本场景。为解决这些挑战,我们引入了ZeroSight,一个用于ZS-CIR的新基准。它包括一个来自视频的一致参考-目标对数据集、一个数据构建流程,以及考虑多个正负目标图像排序的评估方法。我们通过从单个视频中提取帧并使用LLM辅助方法生成相对描述,确保参考-目标对在视觉和语义上一致。为确保真正的零样本场景,我们使用2022年3月31日之后发布的视频数据,确保其未包含在CLIP的预训练数据中。此外,我们提出了一种无需训练的MLLM驱动方法SC4CIR(对称一致性用于CIR),该方法通过三重对称一致性检查能够有效识别难负目标。该方法是即插即用的,能与各种CIR方法无缝集成并显著提升性能。我们通过27种方法的实验结果表明,当前的ZS-CIR数据集和评估指标导致了检索性能的膨胀,夸大了CIR方法的能力。我们的基准和模型可通过此https URL访问。

英文摘要

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.

2606.07031 2026-06-08 cs.LG 新提交

CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning

CF-JEPA:面向时间序列表示学习的无掩码前向预测与非对称编码器利用

Jaehoon Lee, Sunghyun Sim

发表机构 * Graduate School of Artificial Intelligence Convergence Engineering, Changwon National University(昌原国立大学人工智能融合工程研究生院) Department of Artificial Intelligence Engineering, Changwon National University(昌原国立大学人工智能工程系)

AI总结 提出CF-JEPA,一种无掩码框架,通过多时间范围前向预测替代掩码,利用时间序列的时序顺序作为学习信号;并利用在线编码器与EMA目标编码器之间的非对称性,将不同任务路由到合适的编码器,在多个基准上取得领先性能。

详情
AI中文摘要

自监督学习(SSL)在时间序列表示学习领域主要有两种范式:对比方法(在构建正负样本对时面临挑战)和基于掩码的方法(会破坏时间序列信号的时序连续性)。联合嵌入预测架构(JEPA)通过在表示空间中进行预测而非重建原始输入,提供了一种有前景的替代方案。然而,现有的时间序列JEPA变体仍然依赖掩码,因此继承了其连续性问题。本文提出基于裁剪的前向JEPA(CF-JEPA),这是一种创新的无掩码框架,用多时间范围前向预测替代掩码:随机裁剪作为上下文视图,并在前向时间方向上预测短、中、长时域的未来表示,直接利用时间序列数据固有的时序顺序作为学习信号。此外,我们还发现单次训练运行中产生的在线编码器和指数移动平均(EMA)目标编码器之间存在强烈的非对称性:在线编码器发展出更高秩的判别性特征,而EMA目标编码器发展出更平滑、更低秩的时序特征。利用这种非对称性,将分类任务路由到在线编码器,将预测或异常检测任务路由到EMA目标编码器,在不增加训练成本的情况下,多变量预测均方误差(MSE)降低了27%。在126个加州大学河滨分校(UCR)和26个东英吉利大学(UEA)分类数据集、8个电力变压器温度预测基准以及关键绩效指标/Yahoo异常检测任务上,CF-JEPA在自监督基线方法中取得了UCR和UEA上的最高平均准确率和排名,并在单变量预测和k近邻评分的异常检测中排名第二。

英文摘要

Self-supervised learning (SSL) for time-series representation learning is dominated by two paradigms: contrastive methods, which face challenges in constructing positive or negative pairs, and masking-based methods, which disrupt the temporal continuity of time-series signals. Joint-Embedding Predictive Architecture (JEPA) offers a promising alternative by predicting in representation space rather than reconstructing raw inputs. However, existing time-series JEPA variants still rely on masking and therefore inherit its continuity problem. Crop-based Forward JEPA (CF-JEPA) is proposed as an innovative mask-free framework that replaces masking with multi-horizon forward prediction: random crops serve as context views, and short-, mid-, and long-horizon future representations are predicted in the forward temporal direction, directly leveraging the inherent temporal ordering of time-series data as a learning signal. A strong asymmetry is also identified between the online encoder and the exponential moving average (EMA) target encoder, both produced from a single training run: the online encoder develops higher-rank discriminative features, while the EMA target encoder develops smoother, lower-rank temporal features. Exploiting this asymmetry, classification is routed to the online encoder and forecasting or anomaly detection to the EMA target encoder, achieving a 27% reduction in multivariate forecasting mean squared error (MSE) at no additional training cost. Across 126 University of California, Riverside (UCR) and 26 University of East Anglia (UEA) classification datasets, eight electricity transformer temperature forecasting benchmarks, and Key Performance Indicator /Yahoo anomaly detection, CF-JEPA achieves the highest average accuracy and rank on UCR and UEA among self-supervised baselines and ranks second on univariate forecasting and k-nearest neighbors-scored anomaly detection.

2606.07030 2026-06-08 cs.SD cs.AI cs.CL cs.LG 新提交

Phonetic Error Analysis of Raw Waveform Acoustic Models

原始波形声学模型的音素错误分析

Erfan Loweimi, Zhengjun Yue, Andrea Carmantini, Zoran Cvetkovic, Steve Renals, Peter Bell

发表机构 * Centre for Speech Technology Research (CSTR), University of Edinburgh, UK(语音技术研究中心(CSTR),爱丁堡大学,英国) Cisco, UK(思科公司,英国) SLAI & CUHK-SZ, China(SLAI与CUHK-SZ,中国) King's College London, UK(伦敦国王学院,英国)

AI总结 通过分解音素错误率、分析混淆矩阵,发现BLSTM层对过渡依赖类提升最大,WSJ迁移学习对辅音改进约是元音的三倍,且混淆模式反映固有音素相似性。

Comments INTERSPEECH2026

详情
AI中文摘要

我们分析了原始波形声学模型在TIMIT音素识别中的错误模式,超越了整体音素错误率(PER)。将PER按三个广义语音类别(BPC)分解,并从替换错误构建混淆矩阵。我们的模型将参数化(SincNet, Sinc2Net)或非参数化CNN与双向LSTM相结合,在开发/测试集上分别达到13.9%/15.3%的PER,这是原始波形模型在TIMIT上的最佳报告结果。来自WSJ的迁移学习将PER降至11.3%/12.3%,超越了Filterbank基线。每个BPC的分析表明,BLSTM层对过渡依赖类提升最大,而WSJ迁移学习对辅音的改进约是元音的三倍。原始波形和Filterbank系统的混淆模式一致,表明主要混淆反映了固有的音素相似性。

英文摘要

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

2606.07024 2026-06-08 cs.CV 新提交

GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding

GuideCAD: 基于前缀嵌入的轻量级多模态3D CAD模型生成框架

Minseong Kim, Jinyeong Park, Sungho Park, Jibum Kim

发表机构 * Convergence Research Center for Insect Vectors(昆虫传播载体汇聚研究中心) Incheon National University(仁川国立大学) Center for Brain-Machine Interface(脑机接口中心)

AI总结 提出GuideCAD框架,利用少量可训练参数通过映射网络将图像嵌入转为前缀嵌入,结合预训练大语言模型和Transformer解码器生成3D CAD模型,参数减少约4倍且训练效率提升2倍。

详情
AI中文摘要

用于3D CAD生成的多模态方法需要大量计算资源,因此需要高效训练。为此,我们提出GuideCAD,利用语义丰富的视觉-文本表示,仅用少量可训练参数即可生成3D CAD模型。具体而言,GuideCAD使用映射网络将图像嵌入转换为前缀嵌入,使预训练的大语言模型能够整合视觉和文本信息。随后,基于Transformer的解码器利用视觉-文本嵌入预测构建序列,从而生成3D CAD模型。为了实验评估,我们构建了一个新数据集,称为GuideCAD,包含文本-图像对。每对包括一个表示3D CAD构建序列的文本提示及其对应的3D CAD图像。实验结果表明,与微调方法相比,GuideCAD在生成质量相当的情况下,参数减少约四倍,训练效率提升两倍。我们已在以下网址发布方法的源代码和数据集:this https URL

英文摘要

Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: https://github.com/mskimS2/GuideCAD

2606.07020 2026-06-08 cs.CL 新提交

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

MADE:超越评分——通过多语言智能诊断引擎实现细粒度评估洞察

Yilun Liu, Miao Zhang, Shimin Tao, Minggui He, Chunguang Zhao, Chenxin Liu, Li Zhang, Chen Liu, Cheng Qian, Liqun Deng, Xiaojun Meng, Daimeng Wei

发表机构 * Huawei(华为)

AI总结 提出MADE多语言智能诊断引擎,将评估后分析分解为规划、聚合分析、实例检查、多语言文化反思和报告合成,在33个模型族、11个基准、26种语言等大规模设置下,诊断报告质量提升47%,专家偏好率达87.9%。

详情
AI中文摘要

多语言和多文化基准现在覆盖数十种语言和模型族,但由此产生的得分景观仍然指标丰富而洞察贫乏,需要进行细粒度的多语言评估后诊断。然而,单个LLM和开放式智能体很容易被冗长、嘈杂的诊断输入所淹没,并且没有可重用的分类法。为了解决这个问题,我们提出了MADE,一个多语言智能诊断引擎,它将评估后分析分解为规划、聚合分析、实例级案例检查、多语言和文化反思以及基于事实的报告合成。MADE与一个专家主导的54个查询和15种语言的诊断集配对,在大规模多语言评估基础(33个模型族、11个基准、26种语言、34种文化、866万条评估记录)上进行评估。实验表明,MADE在诊断报告质量上比最强的共享基线高出47%,并且在87.9%的成对比较中被多语言人类专家偏好。与多语言专家一起应用,MADE进一步揭示了关于部署、迭代和跨文化陷阱的四个可操作发现,将基准得分表转化为模型选择和修复指南。

英文摘要

Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis. However, single LLMs and open-ended agents are easily swamped by the long, noisy diagnostic input, and no reusable taxonomy exists for it. To address this, we propose MADE, a Multilingual Agentic Diagnosing Engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level case inspection, multilingual and cultural reflection, and grounded report synthesis. MADE is paired with an expert-led 54-query and 15-language diagnostic set, evaluated on top of a large-scale multilingual evaluation substrate (33 model families, 11 benchmarks, 26 languages, 34 cultures, 8.66M evaluation records). Experiments show that MADE outperforms the strongest shared baseline by 47% in diagnosis report quality and is preferred by human multilingual experts in 87.9% of pairwise comparisons. Applied with multilingual experts, MADE further surfaces four actionable findings on deployment, iteration, and cross-cultural pitfalls, turning benchmark score tables into model-selection and remediation guidance.

2606.07017 2026-06-08 cs.AI cs.CL cs.ET 新提交

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

基础模型智能体的仿真到现实差距:统一MDP视角

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出将基础模型智能体的评估与训练差距形式化为经典仿真到现实问题,围绕MDP四要素(观测、动作、转移、奖励)构建统一框架,并倡导采用域随机化等成熟解决方案。

Comments 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

详情
AI中文摘要

基础模型智能体越来越多地被部署用于现实世界决策,但受到仿真到现实差距的影响。虽然机器人学和经典控制有成熟的框架来解决这一差距,但基础模型社区将智能体鲁棒性视为一个全新的现象。我们的论文提出将基础模型智能体评估和训练差距形式化为一个经典的仿真到现实问题,完全围绕马尔可夫决策过程的四个要素构建,包括观测、动作、转移和奖励。在本文中,我们设定了一个全面的研究议程,将经典差异转化为基础模型领域,并倡导采用域随机化等成熟解决方案。我们提供了具体示例,例如多语言工具调用,以展示尽管语义意图正确,但观测空间差距如何导致操作无效的动作。最终,这一议程旨在推动范式转变,产生统一的词汇和标准化的压力测试基准,以培养新一代高度可信的智能体,用于可靠的现实世界应用。

英文摘要

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

2606.07013 2026-06-08 cs.RO cs.HC 新提交

A Multi-Operator Mixed-Reality Interface for Multi-Robot Control and Coordination: Co-Located and Private Workspace Collaboration

面向多机器人控制与协调的多操作员混合现实界面:共位与私有工作空间协作

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa(DIBRIS部门,RICE实验室,热那亚大学)

AI总结 提出一种扩展至多操作员协作的混合现实界面,支持共位共享工作空间和私有工作空间两种模式,通过注册驱动场景构建、轻量级会话同步和单机器人控制租约防止命令冲突。实验表明两种模式任务性能相当,但共位模式显著提升协作感知和操作员偏好。

Comments Submitted to RO-MAN 2026

详情
AI中文摘要

多操作员控制机器人团队不仅需要访问相同的任务信息,还需要维护共享态势感知并防止冲突干预的机制。基于我们之前的HORUS界面(统一系统的整体操作现实),我们提出了一种混合现实界面,将单操作员多机器人监督扩展到协作式多操作员使用。该系统支持两种互补模式:共位共享工作空间,操作员在同一物理位置观察和操作同一张迷你地图;以及私有工作空间模式,操作员通过独立放置的本地工作空间执行相同任务。该架构结合了注册驱动的场景构建、轻量级共享会话同步以及每机器人控制租约,以支持协作监控、任务分配和远程操作,同时防止冲突命令。我们在一项人类受试者研究中评估了该方法,共有36名参与者(18对)在两个搜索环境中控制三台Nova Carter移动机器人。两种模式下的客观任务性能相当,表明两种模式都支持有效的任务执行。然而,共位共享工作空间显著改善了感知协作、共享理解和交接清晰度,并且是首选的协作模式。这些结果表明,即使底层机器人控制工具保持不变,物理上共置MR工作空间也能改善操作员的协调方式。

英文摘要

Multi-operator control of robot teams requires not only access to the same mission information, but also mechanisms for maintaining shared awareness and preventing conflicting interventions. Building on our previous HORUS interface (Holistic Operational Reality for Unified Systems) we present a mixed-reality interface that extends single-operator multi-robot supervision to collaborative multi-operator use. The system supports two complementary modes: a co-located shared workspace, in which operators observe and manipulate the same mini-map in the same physical location, and a private-workspace mode, in which operators work on the same mission through independently placed local workspaces. The architecture combines registration-driven scene construction, lightweight shared-session synchronization, and per-robot control leases to support collaborative monitoring, tasking, and teleoperation while preventing conflicting commands. We evaluated the approach in a human-subject study with 36 participants (18 pairs) controlling three Nova Carter mobile robots in two search environments. The performance of the objective task was comparable across the two modes, indicating that both modes supported effective mission execution. However, the co-located shared workspace significantly improved perceived collaboration, shared understanding, and handoff clarity, and was the preferred collaborative mode. These results indicate that physically co-locating the MR workspace improves how operators coordinate even when the underlying robot-control tools remain unchanged.

2606.07012 2026-06-08 cs.RO 新提交

Task Editing for Generalizable 3D Visuomotor Policy Learning

面向可泛化3D视觉运动策略学习的任务编辑

Jian-Jian Jiang, YiHan Yang, Lan Wei, Yuming Luo, Xiao-Ming Wu, Xuhang Chen, Bin Fan, Dandan Zhang, Wei-Shi Zheng

发表机构 * Sun Yat-sen University(中山大学) Imperial College London(帝国理工学院) Nanyang Technological University(南洋理工大学) South China University of Technology(华南理工大学)

AI总结 提出Task-Edit框架,通过将任务分解为场景、技能和对象组件并灵活重组,生成多样化轨迹,提升3D视觉运动策略在长程操作任务中的泛化能力。

Comments 8 pages, 4 figures

详情
AI中文摘要

3D视觉运动策略为复杂机器人操作提供了有前景的方向,因为深度图和点云为空间推理提供了丰富的几何信息。然而,它们的成功通常依赖于大规模的真实世界演示,这些演示的收集成本高昂且耗时。为此,现有方法通常使用演示生成策略,通过对人类收集的演示应用以对象为中心的变换(如改变对象姿态或尺度)来提高数据效率。虽然这些变换在局部变化上有效,但它们很大程度上保留了原始场景结构和技能序列,限制了合成复杂任务中多样化的场景-技能-对象组合的能力。在本文中,我们提出Task-Edit,一种新颖的演示生成框架,从任务中心编辑的角度生成多样化轨迹。Task-Edit的关键见解是将任务分解为场景、技能和对象组件,并灵活地重新组合它们。通过这种方式,Task-Edit实现了可扩展的演示生成,并显著提高了长程操作任务的泛化能力。我们通过大量真实世界实验评估了Task-Edit,并展示了三个优势:(1)有效性:Task-Edit在各种真实世界任务和机器人形态上显著提升了3D视觉运动策略。(2)泛化性:Task-Edit提高了模型在不同场景设置下的泛化能力。(3)适用性:Task-Edit使模型能够处理真实世界中难以收集的场景,包括抗干扰、避障和未见过的杂乱场景。

英文摘要

3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.

2606.07007 2026-06-08 cs.LG cs.AI 新提交

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

理解稀疏自编码器中概念学习与神经元解释的几何视角

Chenhao Zhang, Chris Lin, Su-In Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出统一数学框架,将概念学习形式化为集合对齐问题,区分检测、分离和近似三种学习强度,并给出几何条件与误差界,通过形式概念分析连接概念学习与神经元解释。

详情
AI中文摘要

我们提出了一个统一的数学框架,用于几何理解稀疏自编码器(SAE)中的概念学习和神经元解释。尽管SAE通过学习稀疏特征表示提高了神经网络的可解释性,但“概念”和“学习”的原则性定义仍不明确。我们将概念形式化为数据点的集合,并将概念学习视为人类定义概念与模型诱导概念之间的集合对齐问题。该公式区分了三种越来越强的学习概念——检测、分离和近似——并给出了概念可由单个神经元或多神经元单元表示的几何条件、误差界和容量约束。它还提供了对常见SAE现象的集合论解释,包括特征分裂、特征吸收、特征族和层次概念。最后,我们通过形式概念分析将概念学习与神经元解释联系起来,表明这两个方向不必一致,并且它们的多对多结构可以通过概念格来组织。在合成数据上使用ReLU和Top-$K$ SAE的实验说明了该理论,并揭示了SAE大小和稀疏性对概念学习的影响。

英文摘要

We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''concept'' and ''learning'' remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts. This formulation distinguishes three increasingly strong notions of learning -- detection, separation, and approximation -- and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi-neuron units. It also provides a set-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many-to-many structure can be organized by concept lattices. Experiments on synthetic data with ReLU and Top-$K$ SAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning.

2606.07006 2026-06-08 cs.LG cs.CL 新提交

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

RASFT: 用于推理的滚动自适应监督微调

Yongliang Miao, Fengyuan Liu, Wei Shi, Yanguang Liu, Fei Sun, Na Zou, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出RASFT框架,通过基于策略rollout的问题级可解性校准专家监督,在模型困难时加强指导、表现可靠时放松模仿并纳入自生成轨迹,同时使用裁剪逆比约束策略漂移,在多个推理基准上优于SFT和RL方法。

详情
AI中文摘要

监督微调(SFT)是一种通过模仿离线专家演示来使大型语言模型适应推理任务的流行方法,通常将单个专家轨迹视为目标行为。然而,推理并非简单的路径模仿:严格遵循一个演示解决方案可能会过度拟合表面形式并抑制模型自身的推理分布。我们提出了滚动自适应监督微调(RASFT),这是一种策略感知的SFT框架,它根据从验证的策略rollout中估计的问题级可解性来校准专家监督。对于每个问题,当当前策略困难时,RASFT加强专家指导,而当模型已经表现出可靠的推理行为时,放松严格模仿并纳入正确的自生成轨迹。为了保留有用的推理先验,RASFT进一步引入了冻结参考模型与当前策略之间的裁剪逆比,以约束过度的策略漂移。在六个数学推理基准和两个代码推理基准上的多个模型实验表明,RASFT在整体性能上优于SFT、SFT变体和代表性RL方法。代码可在该https URL获取。

英文摘要

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.

2606.07000 2026-06-08 cs.AI 新提交

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

教方法而非答案:用于多模态策略优化的特权辅导蒸馏

Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang

发表机构 * Tianjin University(天津大学) Beijing Institute of Technology(北京理工大学) Singapore Management University(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) Xiaomi Inc(小米公司)

AI总结 提出PTD-PO框架,通过构建特权提示提供密集的令牌级监督,避免暴露答案,并采用Top-K JS散度稳定蒸馏,显著提升多模态推理性能。

详情
AI中文摘要

最近的后训练方法,特别是具有可验证奖励的强化学习(RLVR),显著增强了大型视觉语言模型(LVLMs)的推理能力。然而,可验证奖励的稀疏性为失败的rollout提供了很少的令牌级监督,常常导致复杂多模态推理任务中的低效探索。尽管策略蒸馏可以提供密集的指导,但基于外部教师的方法引入了大量计算开销,而基于答案条件微调的方法可能暴露答案级信息并诱导类似捷径的生成行为。为解决这些限制,我们提出了PTD-PO,一种用于RLVR的特权辅导蒸馏策略优化框架,在不向学生策略暴露答案的情况下提供密集指导。具体来说,PTD-PO从空间注意力引导和中间文本推理步骤中构建结构化的特权提示,并通过上下文学习将其用于生成逐步的令牌分布监督。学生仍在原始无答案上下文中优化,其失败的rollout在令牌分布级别与提示增强的参考模型对齐。为进一步稳定引导和无引导上下文之间分布偏移下的蒸馏,我们引入了Top-K Jensen-Shannon散度目标,专注于对齐信息性令牌概率,同时减少内存开销。在2B到8B参数的LVLMs上的实验表明,PTD-PO持续优于RLVR和蒸馏基线,缓解了熵崩溃,并提高了复杂多模态推理性能。

英文摘要

Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.

2606.06996 2026-06-08 cs.RO cs.DC 新提交

Mission-Level Runtime Assurance Framework for Autonomous Driving

自动驾驶任务级运行时保证框架

Chieh Tsai, Salim Hariri

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出一种评估驾驶安全与任务完成能力的运行时保证框架,通过监控系统拒绝不可行命令,实验证明其优于仅关注平台安全的方法。

详情
AI中文摘要

本文研究当高级驾驶命令出现故障或不可靠时自动驾驶的运行时安全性。与主要关注即时车辆安全的传统运行时安全方法不同,所提出的框架在执行命令前评估驾驶安全以及车辆是否仍能成功完成任务。该框架通过引入任务级故障场景(如跳过必需检查点、进入受限区域、生成无法成功完成任务的未来路线)扩展了highway-env。引入运行时监控系统,在执行前检测并拒绝不安全或任务不可行的命令。作为对比,使用公开的Simplex-Drive框架实现了一个基于学习的驾驶控制、安全回退控制和运行时控制器切换的自适应Simplex-Drive运行时安全基线。实验结果表明,仅平台级运行时安全无法检测任务级规划故障,而所提出的框架成功拒绝任务不可行命令,并在随机故障条件下提高了任务成功率。

英文摘要

This paper studies runtime safety for autonomous driving when high-level driving commands become faulty or unreliable. Unlike conventional runtime-safety approaches that mainly focus on immediate vehicle safety, the proposed framework evaluates both driving safety and whether the vehicle can still successfully complete its mission before a command is executed. The framework extends highway-env with mission-level fault scenarios such as skipping required checkpoints, entering restricted areas, and generating future routes that can no longer complete the mission successfully. A runtime monitoring system is introduced to detect and reject unsafe or mission-infeasible commands before execution. For comparison, an adapted Simplex-Drive runtime-safety baseline with learning-based driving control, safety fallback control, and runtime controller switching is implemented using the public Simplex-Drive framework. Experimental results show that platform-level runtime safety alone cannot detect mission-level planning faults, while the proposed framework successfully rejects mission-infeasible commands and improves mission success under randomized fault conditions.

2606.06994 2026-06-08 cs.CL cs.DB 新提交

Principles of Concept Representation in Sentence Encoders

句子编码器中概念表示的原则

Isabelle Mohr, John Dujany, Jonathan Souquet, Andre Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) Merck KGaA(默克 KGaA)

AI总结 通过表征组合性视角,研究句子编码器产生良好概念表示的条件,提出四个原则:微调重塑而非扩展潜在几何(P1)、语义信号集中在特定层(P2)、硬负例改善区分性但不提升排序(P3)、监督有效性取决于概念组合类型(P4)。

详情
AI中文摘要

是什么让句子编码器产生良好的概念表示?我们通过表征组合性的视角来探讨这个问题:只有当编码器的潜在空间允许相应语义算子的低失真实现时,它才支持一个概念族。这一框架预测了当前编码器成功之处以及它们在结构上与监督不匹配的地方。通过在WordNet和Wiktionary的330万同义词和定义对上训练的编码器条件进行受控消融实验,在三个去污染分割和一个修饰语标记的名词短语基准上进行评估,我们确定了四个原则。微调重新校准潜在几何而非扩展它(P1)。语义信号在概念特定训练开始前集中在最后的Transformer层,使得跨层池化变得多余(P2)。硬负例改善了区分性和压力测试鲁棒性,但不提升检索排序,表明校准和排序是可独立处理的(P3)。最后,监督的有效性取决于目标概念的组合类型。外延训练有助于交性和子性概念族,但损害关系性和内涵性概念族,暴露了当前训练范式的结构性限制(P4)。我们发布了两个新的评估数据集:一个DBpedia语义差距基准和一个修饰语标记的名词短语释义套件。

英文摘要

What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.

2606.06991 2026-06-08 cs.CV cs.AI 新提交

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

不要暂停:面向在线视频理解的流式视频-语言同步

Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出流式视频-语言同步(SVLS)范式,通过帧驱动转换控制器和流式令牌调节器实现视频帧与语言生成的细粒度同步,在不中断感知的情况下进行实时交互。

详情
AI中文摘要

在线视频大语言模型(Video-LLMs)通过逐帧处理和主动响应,在人机交互方面取得了进展。然而,流式场景中仍存在一个关键挑战:现有模型在生成响应时通常会暂停视频感知,破坏了实时的视频-语言同步并导致卡顿。为了解决这个问题,我们引入了一种新的在线视频理解范式:流式视频-语言同步(SVLS),并提出了LyraV,一个基于分层控制框架的实时流式助手,具有两个核心创新。首先,帧驱动转换控制器(FDTC)是一个无需训练的基于验证的有限状态机,它做出高层语义决策,决定何时继续说话、开始新的响应或保持沉默。其次,流式令牌调节器(SToP)是一个即插即用的轻量级预测模块,动态调整语言生成速率以匹配视觉内容的节奏。具体来说,LyraV执行逐帧增量、子预算解码:在每个帧间隔内,它只发射适合实时预算的一小部分令牌,因此感知永远不会被阻塞整个句子。这些组件共同使LyraV能够无缝地交织传入的视频帧和生成的词令牌,实现细粒度的同步。在五个在线和三个离线基准上进行的广泛实验表明,LyraV保留了骨干网络的通用理解能力,同时显著提高了流式同步和叙事流畅性,实现了98.29%的视频播放同步率和3.89 FPS的实时处理速度。有趣的是,我们观察到LyraV的一个经验能力:对流式令牌进行动态推理,实现了与视觉输入并行的连续解释和“思考”。

英文摘要

Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.

2606.06990 2026-06-08 cs.LG 新提交

Accelerating Reproducible Research in Synthetic EHR Generation

加速可复现的合成电子健康记录生成研究

Jalen Jiang, Chufan Gao, Ethan Rasmussen, Stephen Z. Xie, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一个轻量级端到端基准框架,统一数据加载、标准化训练和架构无关评估,重新实现多种基线模型并添加GPT-2基线,通过隐私-效用评估套件和自助置信区间分析长尾性能问题,推动社区驱动的可复现性。

详情
AI中文摘要

生成高保真合成电子健康记录(EHR)对于在保护患者隐私的同时推进医学研究至关重要。然而,现有生成模型之间的直接比较因代码库分散、数据加载器不兼容、库依赖冲突以及评估协议不一致而受到阻碍。为解决这些问题,我们引入了一个轻量级、端到端的可复现合成EHR评估基准框架,组织为统一流水线,涵盖数据摄取、标准化模型训练和架构无关评估。我们当前的实现针对纵向ICD诊断代码的生成——这是该文献中最常研究的模态——并基于社区维护的PyHealth库构建。我们在完整ICD-9词汇粒度下重新实现并统一了强基线模型(MedGAN、CorGAN、PromptEHR、HALO),并从通用序列建模文献中添加了一个轻量级GPT-2基线。我们贡献了一个严格的、架构无关的隐私-效用评估套件,该套件同样适用于GAN和基于Transformer的生成器,并报告了所有指标的自助置信区间。我们进一步分析了现有模型在长尾分布上的不佳表现,并讨论了框架在诊断代码之外的可扩展性。通过降低在单一流水线下运行、扩展和评估的工程障碍,我们为社区驱动的可复现性和合成EHR模型基准测试提供了一个起点。

英文摘要

The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by disjointed codebases, incompatible data loaders, conflicting library dependencies, and inconsistent evaluation protocols. To address these gaps, we introduce a lightweight, end-to-end benchmarking framework for reproducible synthetic EHR evaluation, organized as a unified pipeline spanning data ingestion, standardized model training, and architecture-agnostic evaluation. Our current implementation targets the generation of longitudinal ICD diagnosis codes -- the most commonly studied modality in this literature -- and is built on the community-maintained PyHealth library. We reimplement and unify strong baselines (MedGAN, CorGAN, PromptEHR, HALO) under full ICD-9 vocabulary granularity, and add a lightweight GPT-2 baseline from the general-purpose sequence-modeling literature. We contribute a rigorous, architecture-agnostic privacy-utility evaluation suite that applies identically to GAN- and transformer-based generators, and report bootstrapped confidence intervals across all metrics. We further analyze the poor long-tailed performance of existing models and discuss the extensibility of our framework beyond diagnosis codes. By lowering the engineering barrier to running, extending, and evaluating under a single pipeline, we introduce a starting point for community-driven reproducibility and benchmarking synthetic EHR models.

2606.06986 2026-06-08 cs.LG 新提交

Heterogeneous Effects of Green Finance on Urban Decarbonization: Evidence from 285 Cities in China

绿色金融对城市脱碳的异质性效应:来自中国285个城市的证据

Xueyang Li, Jinlei Ma

发表机构 * School of Business, Anhui University of Technology(安徽理工大学商学院)

AI总结 本研究利用计量模型和机器学习分析,发现绿色金融显著降低城市碳强度,其中绿色债券和绿色投资效果最强,且存在空间溢出效应,影响因城市发展水平而异,主要通过能源结构优化等渠道发挥作用。

详情
AI中文摘要

虽然绿色金融已成为低碳城市转型的关键工具,但其实际的脱碳效应和传导机制仍不明确。本研究采用计量经济模型和基于机器学习的分析,考察绿色金融是否以及如何降低城市碳强度。结果表明,绿色金融显著降低碳强度,其中绿色债券和绿色投资的影响最强,并存在明显的空间溢出效应。效果因发展水平而异,在四五线城市最为显著。中介分析显示,绿色金融主要通过能源结构优化发挥作用,其次是产业升级、外商直接投资和技术创新。SHAP分析证实不同金融工具之间存在显著差异,其中绿色债券、基金和信贷对脱碳贡献最大。此外,在技术能力低、产业依赖度高和以煤为主的能源结构的城市,边际影响更强。这些发现为构建多层次、区域差异化的绿色金融体系以促进包容性低碳转型提供了理论支持和政策指导。关键词:绿色金融;碳强度;脱碳效应;机器学习;城市

英文摘要

While green finance has become a key instrument for low-carbon city transitions, its actual decarbonization effects and transmission mechanisms remain unclear. This study employs econometric models and machine learning-based analysis to examine whether and how green finance reduces city-level carbon intensity. Results show that green finance significantly lowers carbon intensity, with green bonds and green investment having the strongest impacts and evident spatial spillovers. The effects vary by development level, being most pronounced in Fourth- and Fifth-tier cities. Mediation analysis reveals that green finance operates mainly through energy structure optimization, followed by industrial upgrading, foreign direct investment, and technological innovation. SHAP analysis confirms substantial differences across financial instruments, with green bonds, funds, and credit contributing most to decarbonization. Moreover, the marginal impact is stronger in cities with low technological capacity, high industrial dependency, and coal-based energy mixes. These findings provide theoretical support and policy guidance for building a multi-level, regionally differentiated green finance system to promote inclusive low-carbon transitions. Keywords: Green Finance; Carbon Intensity; Decarbonization Effect; Machine Learning; City

2606.06985 2026-06-08 cs.CL eess.AS 新提交

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

基于大语言模型生成近误样本的对比训练用于鲁棒语码转换语音识别

Tung X. Nguyen, Hieu Minh Truong, Giang-Son Nguyen, Nhu Vo, Wray Buntine, Dung D. Le

发表机构 * VinUniversity(文大学) University of Technology Sydney(技术悉尼大学) Monash University(莫纳什大学)

AI总结 提出POI感知对比训练框架,通过大语言模型生成近误负样本并过滤,结合POI加权交叉熵与多负例对比损失微调Whisper-small,在语码转换语音识别任务上降低超过2%的错误率。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

语码转换(CS)是指在单个话语中交替使用多种语言,这对自动语音识别(ASR)仍然具有挑战性。为了解决这个问题,我们提出了一个兴趣点(POI)感知的对比训练框架,该框架提高了CS关键区域的识别能力。我们首先采用文献中的POI检测方法识别CS片段,然后通过扰动ASR N-best输出中的POI并利用大语言模型扩展候选,构建声学上合理的近误假设。通过声学、音位和文本约束过滤,保留困难但合理的负样本。最后,我们使用POI加权交叉熵锚点目标以及多负例对比排序损失,通过LoRA微调Whisper-small。在CS-FLEURS(cmn-eng)和ViMedCSS(vie-eng)上的实验表明,与标准LoRA微调相比,通用错误率和CS感知错误率均持续降低超过2%。

英文摘要

Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.

2606.06984 2026-06-08 cs.LG 新提交

Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts

通过预测梯度催化剂加速多目标贝叶斯优化

Alma Rahat, Tinkle Chugh, Jonathan Fieldsend, Richard Allmendinger

发表机构 * Loughborough University(洛辛厄姆大学) University of Exeter(埃克塞特大学) The University of Manchester(曼彻斯特大学)

AI总结 提出利用高斯过程预测梯度作为辅助信号,增强现有Pareto兼容采集函数,加速多目标贝叶斯优化收敛到全局Pareto集。

Comments Parallel Problem Solving From Nature (PPSN), 2026

详情
AI中文摘要

本文提出了一种通用的多目标贝叶斯优化(MOBO)加速机制,利用高斯过程预测梯度作为辅助信号。该方法并非取代现有的Pareto兼容采集函数,而是通过从代理模型导出的梯度中获取局部平稳性信息来增强它们,从而在有限的评估预算下更快地收敛到全局Pareto集。研究了两种催化剂实例:自适应多重梯度下降算法催化剂(MGDA)和预定义权重变体,后者在预算紧张时能够实现聚焦探索。在DTLZ基准测试套件(使用2个目标和10个决策变量)上的实验表明,当代理模型准确时,特别是对于平稳问题,预测梯度催化相比其他采集函数(EHVI、AugTch、tMPoI、SAF)能够带来显著的加速。

英文摘要

This paper presents a general acceleration mechanism for multi-objective Bayesian optimisation (MOBO) that leverages Gaussian process predictive gradients as auxiliary signals. Rather than replacing existing Pareto-compliant acquisition functions, the proposed approach augments them with local stationarity information derived from surrogate-derived gradients, enabling faster convergence toward the global Pareto set under limited evaluation budgets. Two catalyst instantiations are investigated: an adaptive Multiple-Gradient Descent Algorithm-Based Catalyst (MGDA) and a predefined-weight variant that enables focused exploration when budgets are tight. Experiments on the DTLZ benchmark suite (using 2 objectives and 10 decision variables) show that predictive gradient catalysis can deliver significant acceleration compared to other acquisition functions (EHVI, AugTch, tMPoI, SAF) when surrogates are accurate, particularly for stationary problems.

2606.06978 2026-06-08 cs.CV 新提交

CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

CL-CLIP: 基于CLIP的持续学习框架与代价体积类别解耦用于目标检测

Zihan Liu, Yuguang Yang, Shengjie Su, Jianing Pang, Linlin Yang, Chunyu Xie, Nikolai Yu. Zolotykh, Baochang Zhang

发表机构 * National College for Excellent Engineers, Beihang University(卓越工程师学院,北京航空航天大学) AI Research, Qihoo 360(360人工智能研究院,奇虎360) School of Electronic Information Engineering, Beihang University(电子信息学院,北京航空航天大学) School of Cyber Science and Technology, Beihang University(网络安全科学与技术学院,北京航空航天大学) School of Computer Science and Engineering, Beihang University(计算机科学与工程学院,北京航空航天大学) State Key Laboratory of Media Convergence and Communication, Communication University of China(媒体融合与传播国家重点实验室,中国传媒大学) Institute of Information Technology, Mathematics and Mechanics, Lobachebsky University(信息技术、数学与力学学院,洛瓦茨基大学) School of Artificial Intelligence, Beihang University(人工智能学院,北京航空航天大学)

AI总结 提出CL-CLIP框架,通过代价体积引导的类别解耦,增强开放词汇检测器的持续学习能力,缓解灾难性遗忘,在PASCAL VOC和MS-COCO上显著提升F-ViT基线性能。

详情
AI中文摘要

持续目标检测(COD)要求检测器随时间获取新类别的同时保留先前学习的类别。这一目标与开放词汇检测密切相关,因为两种设置都需要对当前训练阶段注释未完全覆盖的类别进行推理。最近的基于CLIP的开放词汇检测器展现出强大的零样本泛化能力,而F-ViT等框架表明视觉-语言预训练可以为未见类别提供强大的零样本检测能力。然而,实际部署不能保持纯粹的零样本:一旦这些检测器在新引入的类别上持续更新,它们会遭受严重的灾难性遗忘,并迅速失去先前校准的检测能力。因此,我们提出CL-CLIP,一种基于CLIP的COD框架,通过代价体积引导的类别解耦,为开放词汇检测器提供更好的持续学习能力。具体来说,遵循CAT-Seg,我们计算CLIP图像-文本相似度代价体积,定义为视觉令牌与类别文本嵌入之间的密集类别级响应图。这种零样本空间先验将共享区域特征分解为类别特定路径,然后由多专家RoI头处理。在PASCAL VOC和MS-COCO上的大量实验表明,CL-CLIP在持续微调下显著改善了F-ViT基线,并与现有持续目标检测器相比取得了竞争性能,特别是在适应新引入类别的同时保持竞争力的基类性能。

英文摘要

Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.