arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.07410 2026-06-08 cs.LG cs.AI 新提交

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

人类与DeepSeek-R1大语言模型数学推理的全面解剖

Yuxiang Chen, Jun Wang

发表机构 * UCL Centre for Artificial Intelligence(伦敦大学人工智能中心)

AI总结 通过AIME 2025所有30道题目的10247个推理步骤注释,发现DeepSeek-R1存在拓扑模仿(表面模仿推理而非真正推理),但成功轨迹中分支与回溯的稳定使用以及反射在演绎推理中的有效放置是真正推理的信号。

详情
AI中文摘要

大语言模型中“顿悟时刻”的出现,特别是DeepSeek-R1-0120,引发了这些系统是真正推理还是仅仅模仿推理表象的问题。我们对AIME 2025所有30道题目进行了模型与人类推理的全面实证比较,将10247个推理步骤详尽地注释为五个功能类别:分析、推理、分支、回溯和反思。我们发现了一个明显的结构差异。人类解决方案在分析和演绎之间保持紧凑交替,而DeepSeek-R1频繁重访中间结果,进行浅层且往往不必要的验证,并在局部检查中循环,而没有有意义的逻辑进展。我们将其描述为拓扑模仿:再现推理的表面形式而不发挥其功能作用。尽管如此,我们识别出两个真正推理的信号。首先,成功轨迹表现出分支和回溯的稳定使用,而失败轨迹要么过度使用要么使用不足探索性动作。其次,反思仅在置于演绎推理中时才有效;陷入分析循环的反思专注于局部数值细节而忽略全局逻辑错误。这些发现表明,当前的长链思维模型可能更多地因推理的表象而非真正的演绎进展而获得奖励。我们讨论了改进评估和训练的方向,包括测量跨轨迹稳定性、惩罚“空转”轨迹、鼓励更深层的逻辑纠正,以及将推理时间计算重新分配给演绎和回溯。总体而言,推理质量不仅取决于反思发生的多少,还取决于反思是否一致地出现在适当的逻辑尺度上。

英文摘要

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising "spinning-wheel" traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.

2606.07404 2026-06-08 cs.LG 新提交

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

可逆基础:通过状态保持缩放训练120B稀疏MoE

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 本文报告在单台8-GPU节点上端到端训练千亿参数稀疏混合专家模型,通过可逆循环、状态保持增长和单节点经济学三大原则,实现从密集种子到120B模型的四阶段扩展。

Comments 58 pages, 9 figures, 37 tables. Code: https://github.com/The-School-of-AI/LLM. Released models: huggingface.co/theschoolofai/LightningLM-0.1V-{2B, 5B-MoE, 9B-MoE, 120B-MoE}. Companion work: arXiv:2605.29379 (BrahmicTokenizer-131K), arXiv:2605.29459 (Kronecker Embeddings)

详情
AI中文摘要

本文报告在单台八GPU节点上端到端训练千亿参数稀疏混合专家模型。LightningLM 0.1V是一个循环骨干语言模型家族,通过四个阶段从小型密集种子扩展,经过5B和9B混合专家,最终达到120B模型,具有460个路由专家,采用top-12路由。每个更大模型从小模型的训练权重增长而来;活跃参数从密集种子的1.78B单调增加到120B时的5.93B(约占存储的118.67B的5%)。整个谱系在单节点上运行,较大阶段在8K上下文中,达到120B规模时发布的训练损失为1.78。这是一份系统和经验报告,围绕三个原则组织。可逆性:可逆循环栈在反向传播中重建激活而非存储它们,使激活内存随模型增长保持平坦。状态保持增长:每次扩展(密集到MoE、浅到深、少专家到多专家)都作为可重现原则给出,并附有错误导致的失败案例;若干失败是无声的。单节点经济学:120B通过TQP训练,这是一种量化基础专家权重和训练低秩适配器的策略,将优化器状态承载于2.26B适配器参数而非路由专家中的100B+,将专家路径优化器状态减少约45倍。新颖之处在于已知原语的集成,而非任何孤立原语:一个在单节点上端到端运行的成长谱系,以从业者级别记录,并以每个领域的留出损失作为证据,表明目标能力(多语言印度能力、代码)是通过构造学习的。模型家族、分词器和训练代码已发布。

英文摘要

This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense seed, through a 5B and a 9B mixture of experts, to a 120B model with 460 routed experts under top-12 routing. Each larger model is grown from the trained weights of the smaller one; active parameters rise monotonically from 1.78B at the dense seed to 5.93B at 120B (about 5% of the 118.67B stored). The full lineage runs on single nodes, the larger stages at 8K context, reaching a released training loss of 1.78 at 120B scale. This is a systems and experience report. It is organized around three disciplines. Reversibility: a reversible recurrence stack reconstructs activations in the backward pass instead of storing them, holding activation memory flat as the model grows. State-preserving growth: each expansion (dense to MoE, shallow to deep, few experts to many) is given as a reproducible principle paired with the failure that results from getting it wrong; several failures are silent. Single-node economics: the 120B trains through TQP, a strategy of quantized base expert weights and trained low-rank adapters that carries optimizer state on 2.26B adapter parameters rather than 100B+ resident in routed experts, cutting expert-path optimizer state by a factor of ~45. What is new is the integration of known primitives, not any primitive in isolation: one grown lineage running end to end on a single node, documented at practitioner level, with per-domain held-out loss as evidence that targeted capabilities (multilingual Indic competence, code) were learned by construction. Model family, tokenizer, and training code are released.

2606.07402 2026-06-08 cs.CL 新提交

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

M$^3$Exam: 面向真实用户-智能体交互的多模态记忆基准

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen, Yuqian Wu, Fangyuan Zhang, Qintian Guo, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Beijing University of Chemical Technology(北京化工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Beijing Institute of Technology (Zhuhai)(北京理工大学(珠海)) Tencent Hy(腾讯(深圳)) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出M$^3$Exam基准,用于评估多模态大语言模型在真实用户-智能体交互中的跨模态推理和隐式信息推断能力,并设计M$^3$Proctor方法通过按需处理视觉源提升准确率13%,同时降低索引构建时间和检索token超70%。

详情
AI中文摘要

语言智能体越来越多地部署在积累的多模态信息上,然而现有基准假设人机交互形式,具有稀疏的视觉内容和直白的内容,既不评估基于真实多模态文件交互的推理,也不评估对隐藏用户信息的解释。因此,我们引入了M$^3$Exam,一个基于真实用户-智能体交互的查询中心多模态对话记忆基准,具有跨模态基础推理和隐式信息推断的多维评估。对多模态大语言模型和记忆系统的基准测试揭示了跨模态基础推理、跨会话推理以及累积多模态上下文的效率成本方面的持续差距。我们进一步提出了M$^3$Proctor,一种多模态记忆方法,它检测查询模态偏差并仅按需消耗原始视觉源,将准确率提高13%,同时将索引构建时间和检索到的令牌减少超过70%。

英文摘要

Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

2606.07401 2026-06-08 cs.CV 新提交

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

RealDocBench: 面向真实世界监管文档的字段级问答与布局理解基准

Ameya Joshi, Joon Kim, Gus Eggert, Joseph Bajor, Cindy Hao, Jing Reyhan, Kushal Byatnal, Eli Badgio

发表机构 * Extend AI

AI总结 提出RealDocBench基准,包含字段级问答和布局理解两个任务,评估18个系统在真实监管文档上的性能,揭示单一指标掩盖的性能差异和成本延迟权衡。

详情
AI中文摘要

文档解析系统越来越多地部署在高风险、受监管的工作流程中,如抵押贷款承销、财务报告、供应链物流和临床记录。然而,大多数公开基准在干净的学术布局或合成文本上评估解析器,并报告单一的OCR或Markdown级相似度分数。这类文档和指标与下游代理实际需求(即在混乱的真实世界页面上获取特定字段的正确值)相关性较差。我们引入了RealDocBench,这是一个基于真实监管文档构建的双轨基准。问答轨道包含跨越四个领域的581份文档上的1,356个字段级问题,每个问题配有一个类型化的gold_dict键值对答案,解析器按每个字段和严格的每个问题准确率评分。布局轨道包含1,500个人工验证的页面图像,在九类公共分类法下用COCO风格的边界框注释,使用包含邻域感知分割/合并恢复的匈牙利匹配器评分。我们在统一的提取和评分协议下评估了18个系统,涵盖商业解析API、通用视觉语言模型和开源OCR模型,并报告准确率以及每页成本和缓存失效延迟。RealDocBench暴露了单一数字基准隐藏的广泛性能差异、一个持续困难的医学子领域以及不同操作点之间的成本和延迟权衡。我们发布了数据集、解析器适配器和评估工具,以支持文档解析系统的可重复字段级比较。

英文摘要

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

2606.07400 2026-06-08 cs.LG 新提交

Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients

通过动态策略梯度对离散潜在结构进行生成建模

Stefan Ivanovic, Ge Liu, Mohammed El-Kebir

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GReinSS框架,使用动态缩放奖励学习潜在状态分布以最大化观测数据似然,在模拟潜在集和图重建中优于基线,并在RNA测序数据中比RSEM更准确地重建异构体。

Comments ICML 2026

详情
AI中文摘要

许多科学问题需要从间接观测中推断未观测到的机械潜在状态。虽然经典方法(如期望最大化)无法扩展到组合爆炸的空间,但深度学习方法(如变分自编码器)通常形成人工潜在状态,而非重建机械真实状态。本文提出GReinSS,一个策略学习框架,使用动态缩放奖励来学习最大化观测数据似然的潜在状态分布。我们证明GReinSS能够准确重建模拟的潜在集和潜在图,优于替代的策略学习和生成建模基线。此外,GReinSS从真实短读RNA测序数据中重建的异构体,比标准RSEM算法更匹配通过正交长读测序检测到的异构体。总体而言,GReinSS是一种从间接观测中对组合潜在状态进行生成建模和推断的原则性且实际有效的方法。

英文摘要

Many scientific problems require inferring unobserved mechanistic latent states from indirect observations. While classical approaches, including expectation maximization, do not scale to combinatorially large spaces, deep learning approaches such as variational autoencoders typically form artificial latent states rather than reconstructing the mechanistic ground-truth states. Here, we introduce GReinSS, a policy learning framework that uses dynamically rescaled rewards to learn latent state distributions that maximize the observed data likelihood. We show that GReinSS accurately reconstructs simulated latent sets and latent graphs, outperforming alternative policy learning and generative modeling baselines. Additionally, GReinSS reconstructs isoforms from real short-read RNA sequencing data that better match isoforms detected by orthogonal long-read sequencing than the standard RSEM algorithm. Overall, GReinSS is a principled and practically effective approach for generative modeling and inference of combinatorial latent states from indirect observations.

2606.07397 2026-06-08 cs.SD 新提交

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

Audio-Oscar: 一个用于复杂音频场景生成、编排和优化的多智能体系统

Yifan Duan, Qixiang Xu, Hengtao Wu, Zhanxun Liu, Wenhao Guan, Junxi Liu, Ziyang Ma, Kelu Xu, Xie Chen

发表机构 * MoE Key Lab of Artificial Intelligence(人工智能混合专家实验室) X-LANCE Lab(X-LANCE实验室) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室) Xiamen University(厦门大学) State Key Laboratory of Complex & Critical Software Environment, China(复杂与关键软件环境国家重点实验室,中国)

AI总结 提出Audio-Oscar多智能体框架,通过协调多个专业智能体处理角色建模、语音生成、时间线规划等,实现复杂音频场景的生成与优化,并构建ASG-Bench基准进行评估。

详情
AI中文摘要

近年来,音频生成在文本到语音(TTS)、文本到音频(TTA)和文本到音乐(TTM)等任务上取得了显著进展。然而,从复杂的音频场景描述中生成长格式且可控的音频仍然是一个重大挑战,因为此类场景通常需要协调语音、音效、音乐、歌曲、时间结构以及后期制作。在这项工作中,我们引入了 \textbf{Audio-Oscar},一个用于从复杂描述生成音频的多智能体框架。Audio-Oscar 协调一组专业智能体,每个智能体负责音频场景的不同方面,包括角色建模和声音设计、语音生成、细粒度时间线规划、模型选择、非语音生成以及音频后期制作。Audio-Oscar 还整合了反馈驱动的优化。此外,为了解决缺乏从复杂音频场景描述评估音频生成的合适基准的问题,我们构建了 \textbf{ASG-Bench},一个音频场景生成基准,包含与参考音频配对的场景描述和纯文本场景描述。每个场景都标注了目标音频事件和时间语句,以评估生成的音频是否忠实地实现了所需的场景内容和时间结构。实验结果表明,Audio-Oscar 能够有效生成与复杂场景描述匹配的音频。项目样本可在该 https URL 获取。我们的代码可在该 https URL 获取。

英文摘要

In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a multi-agent framework for generating audio from complex descriptions. Audio-Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine-grained timeline planning, model selection, non-speech generation, and audio post-production. Audio-Oscar further incorporates feedback-driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct \textbf{ASG-Bench}, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text-only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio-Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio-Oscar.

2606.07394 2026-06-08 cs.CV 新提交

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

注意差距:解开视频实例分割中的性能瓶颈

Danial Hamdi, Fardin Ayar, Mahdi Javanmardi

发表机构 * Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic)(阿美里卡布里大学计算机工程系(德黑兰技术学院))

AI总结 提出一种基于整数线性规划的诊断框架,分离分类、分割和跟踪误差,发现跟踪不稳定是视频实例分割的主要瓶颈,尤其在遮挡、长视频和高密度场景下,且强骨干网络无法消除该算法性问题。

详情
AI中文摘要

在视频实例分割(VIS)中,分类、分割和跟踪目标被联合评估,但它们各自对性能损失的贡献仍然不透明。我们引入一个诊断框架,将身份和类别分配表述为整数线性规划(ILP),产生一个模型无关的预言机,分层隔离每个错误源。应用于跨越在线和离线范式的七种VIS方法,在YouTube-VIS 2019/2021和OVIS的诊断子集上,我们的分析揭示了一致的图景。跟踪不稳定是在线方法的关键瓶颈,在严重遮挡下差距超过20 AP,并且随着视频长度和实例密度急剧增长。虽然语义分类在标准基准上有显著贡献,但在跟踪失败最严重的地方其影响变得微不足道。尽管更强的骨干网络大幅提升了默认分数,但它们基本保留了AP跟踪差距,证实了时间脆弱性是算法性的,而非纯粹表示性的。为补充预言机,我们引入了TrackLens,一种可视化工具,将差距大小转化为可观察的查询级故障模式。这些工具共同为瞄准VIS的核心挑战——鲁棒的长期时间关联——提供了系统基础。

英文摘要

In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

2606.07392 2026-06-08 cs.AI cs.LG econ.EM stat.ML 新提交

Online Pandora's Box for Contextual LLM Cascading

面向上下文LLM级联的在线潘多拉魔盒

Alexandre Belloni, Yan Chen, Yehua Wei

发表机构 * The Fuqua School of Business, Duke University(杜克大学福克商学院)

AI总结 针对LLM级联场景,提出在线上下文潘多拉魔盒模型,通过参数化保留索引和GMM估计结合UCB界,实现维度相关的√T累积遗憾。

详情
AI中文摘要

受大型语言模型(LLM)级联的启发,我们提出了一种在线上下文潘多拉魔盒模型,用于自适应地查询和选择LLM API。在每个周期中,决策者观察一个请求上下文,并面临一个两阶段决策问题。在查询阶段,决策者顺序查询API,每次查询揭示一个生成的输出,并且决策者承担(输出相关的)成本。在选择阶段,决策者选择一个生成的输出进行部署,并仅观察部署输出的下游奖励。这种输出介导的反馈结构不同于经典的在线上下文潘多拉魔盒模型,后者打开盒子直接揭示其奖励。我们不估计每个API的完整条件输出和成本分布,而是直接建模保留索引,并为查询阶段开发一种学习方法。具体地,我们对由经典Weitzman策略诱导的上下文保留索引函数施加参数化结构。我们的策略将这些保留索引的广义矩方法(GMM)类型估计与这些索引以及共享输出级奖励评估器的UCB风格置信界相结合。在正则条件下,我们证明所得策略在T个周期的时间范围内实现了维度相关的$\widetilde O(\sqrt T)$累积遗憾。

英文摘要

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

2606.07389 2026-06-08 cs.RO 新提交

Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic Grasping

模拟驱动的无生物信号共享自主假肢抓取模仿学习

Kaijie Shi, Wanglong Lu, Huiling Chen, Vinicius Prado da Fonseca, Ting Zou, Hanli Zhao, Xianta Jiang

发表机构 * Memorial University of Newfoundland(缅因大学) Wenzhou University(温州大学)

AI总结 提出一个自动生成多样化抓取演示的模拟框架,结合物理可行抓取合成、自然到达轨迹重定向和程序化环境执行,通过模仿学习实现高成功率和强泛化能力的假肢控制。

详情
AI中文摘要

无生物信号的上肢假肢共享自主控制旨在不依赖EMG或其他生理信号的情况下实现自然且低努力的操作。最近的基于模仿学习的方法显示出有希望的结果,但其可扩展性受到收集大量真实世界人类演示数据的成本和变异性的限制。在这项工作中,我们提出了一个可扩展的模拟框架,该框架从腕部安装的虚拟摄像头自动生成多样化的到达-抓取演示。该框架结合了物理可行的抓取合成、自然到达轨迹重定向以及在程序化生成的室内环境中的到达-抓取-提升执行。它记录腕部视角观察、本体感觉和动作,以构建用于模仿学习的大规模演示数据集。通过广泛的模拟基准测试,我们评估了物体和场景的泛化能力,并比较了几种代表性的最先进模仿学习方法。结果表明,模拟演示足够丰富和一致,可用于有效的策略学习。在三个现实场景中,学习到的模拟到现实策略实现了超过90%的抓取成功率,超越了基线方法,并表现出更强的泛化能力,突显了模拟驱动训练在无生物信号共享自主假肢抓取中的前景。演示可在\href{此URL}{此URL}获取。

英文摘要

Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising results, but their scalability is limited by the cost and variability of collecting large amounts of real-world human demonstration data. In this work, we present a scalable simulation framework that automatically generates diverse reach-to-grasp demonstrations from a wrist-mounted virtual camera. The framework combines physically feasible grasp synthesis, natural reaching trajectories retargeting, and reach--grasp--lift execution in procedurally generated indoor environments. It records wrist-view observations, proprioception, and actions to build a large-scale demonstration dataset for imitation learning. Through extensive simulation benchmarks, we evaluate object and scene generalization and compare several representative state-of-the-art imitation learning methods. Results show that the simulated demonstrations are sufficiently rich and consistent for effective policy learning. In three realistic settings, the learned sim-to-real policy achieves over 90\% grasp success, surpasses baseline methods, and exhibits stronger generalization, highlighting the promise of simulation-driven training for biosignals-free shared-autonomy prosthetic grasping. The demonstrations are available at \href{https://sites.google.com/view/sim-prosthetic-grasp/home}{https://sites.google.com/view/sim-prosthetic-grasp/home}.

2606.07387 2026-06-08 cs.LG 新提交

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

充分利用有限数据:面向文本到音乐生成的分数感知训练

Yun-Chen Cheng, Tzu-Hung Huang, Chih-Pin Tan

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出分数感知训练方法,利用CLAP条件Beta噪声时间表将低分音频段用于高噪声训练,结合段级过滤和两阶段字幕策略,在有限数据下实现高效文本到音乐生成,并在ICME 2026 ATTM挑战赛中获得客观评估第二名。

详情
AI中文摘要

最先进的文本到音乐生成系统依赖于大规模专有数据集和工业级计算资源,使得无法区分架构贡献与资源优势。我们提出\textit{分数感知训练},将音频-字幕对齐分数作为整个流程的直接监督信号。我们不丢弃低分片段,而是通过CLAP条件Beta噪声时间表将其重新用于高噪声训练阶段,作为有效的隐式正则化器。作为补充,段级过滤移除最不匹配的样本,两阶段字幕程序弥合了冗长训练字幕与简洁推理提示之间的分布差距。REPA辅助损失进一步从预训练的CLAP和MuQ编码器中迁移结构化语义知识,无需额外数据。我们基于FluxAudio的450M参数系统提交至ICME 2026 ATTM Grand Challenge效率赛道,在客观评估中两个赛道均排名第二,在最终MOS评估中效率赛道排名第三。

英文摘要

State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose \textit{score-aware training}, which treats audio-caption alignment score as a direct supervision signal throughout the pipeline. Rather than discarding low-scoring segments, we repurpose them via a CLAP-conditioned Beta noise timestep schedule that routes them to high-noise training regimes, acting as an effective implicit regularizer. Complementarily, segment-level filtering removes the most misaligned examples, and a two-stage caption procedure bridges the distribution gap between verbose training captions and concise inference prompts. A REPA auxiliary loss further transfers structured semantic knowledge from pretrained CLAP and MuQ encoders without additional data. Our 450M-parameter FluxAudio-based system, submitted to the ICME 2026 ATTM Grand Challenge Efficiency Track, ranked 2nd across both tracks in the objective evaluation and 3rd in the Efficiency Track in the final MOS evaluation.

2606.07386 2026-06-08 cs.RO 新提交

Spline Policy: A Structured Representation for Robot Policies

样条策略:机器人策略的结构化表示

Mengze Tian, Yiming Li, Sichao Liu, Auke Ijspeert, Sylvain Calinon

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL)(瑞士联邦理工学院(EPFL)) Idiap Research Institute(Idiap研究 institute)

AI总结 提出样条策略(SP),用样条参数替代动作块,保留策略主干,支持连续轨迹解码、时域重采样、参数空间编辑及下游控制,并具有局部修正机制。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

现代机器人操作的模仿学习策略通常将动作表示为固定分辨率的动作块,这种方法简单有效,但在执行前暴露的几何和时间结构有限。本文研究了样条策略(SP),一种结构化表示,它用样条参数替换动作块,同时保持策略主干不变。预测的样条可以解码为紧凑的连续轨迹,在不同时间分辨率下查询,在参数空间中进行约束或编辑,并传递给下游控制器。对于二次样条输出,相同的表示还可以通过解析距离场构造转换为状态依赖的向量场。在该构造的正则性和投影假设下,诱导的动力学不会增加与生成样条的距离,从而在预测运动周围产生有原则的局部修正机制。样条输出进一步支持从观测到样条参数、轨迹和流场的不确定性传播,并且可以与经典控制机制(如零空间碰撞避免)结合,而无需重新训练策略主干。我们使用扩散、流匹配、基于Transformer和视觉-语言-动作主干实例化了SP。在低维运动学习、匹配主干下的模拟操作、灵巧操作以及真实机器人案例研究中的实验表明,SP与现代策略学习器兼容,同时暴露了有用的运动结构特性,包括紧凑解码、时间重采样、预测运动周围的局部修正、不确定性评估和控制器兼容性。

英文摘要

Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spline Policy (SP), a structured representation that replaces action chunks with spline parameters while keeping the policy backbone unchanged. The predicted spline can be decoded as a compact continuous trajectory, queried at different temporal resolutions, constrained or edited in parameter space, and passed to downstream controllers. For quadratic spline outputs, the same representation can also be converted into a state-dependent vector field through an analytical distance-field construction. Under the regularity and projection assumptions of this construction, the induced dynamics do not increase the distance to the generated spline, yielding a principled local corrective mechanism around the predicted motion. The spline output further supports uncertainty propagation from observations to spline parameters, trajectories, and flow fields, and can be combined with classical control mechanisms such as null-space collision avoidance without retraining the policy backbone. We instantiate SP with diffusion, flow-matching, transformer-based, and vision-language-action backbones. Experiments in low-dimensional motion learning, simulated manipulation under matched backbones, dexterous manipulation, and real-robot case studies show that SP remains compatible with modern policy learners while exposing useful motion-structure properties, including compact decoding, temporal resampling, local correction around predicted motions, uncertainty evaluation, and controller compatibility.

2606.07383 2026-06-08 cs.RO cs.LG 新提交

RhinoVLA Technical Report

RhinoVLA 技术报告

Huixi Intelligence, :, Chen Zhang, Chenyang Zhou, Guanglei Ding, Guanghui He, Haibin Gao, Jiajia Chen, Jianyong Zhang, Lianyi Yu, Ningyi Xu, Ping Xu, Qingchen Li, Yingjun Hu, Yijia Zhang, Yuxi Liu

发表机构 * Huixi Intelligence(慧溪智能)

AI总结 针对边缘硬件上VLA模型部署延迟问题,提出RhinoVLA,通过令牌高效骨干、连续动作专家和统一接口实现实时闭环控制,在Huixi R1上达到11.69 Hz推理速度。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大潜力,但在边缘硬件上的实时部署仍具挑战。本文中,我们识别出VLM视觉和上下文令牌是部署延迟的主要来源:对于以GEMM为主的投影算子,当模型维度固定时,计算量随输入令牌数量线性增长。基于此观察,我们提出RhinoVLA,一种与Huixi R1边缘SoC协同设计的面向部署的VLA模型。RhinoVLA采用令牌高效的Qwen3-VL骨干和连续动作专家,在保留预训练多模态能力的同时减少VLM侧的令牌和计算负担。为支持跨机器人学习,RhinoVLA进一步引入统一接口,结合视图注册表、72维物理状态-动作槽空间和机器人实例LoRA,使异构机器人观测和动作模式能在共享策略下对齐。在部署方面,RhinoVLA通过硬件感知编译、混合精度执行和并行视觉编码进行优化。实验表明,RhinoVLA在相似参数量下实现了与π0.5相当的下游性能,同时在Huixi R1上达到11.69 Hz的端到端推理,满足10 Hz实时闭环控制目标。该项目将在以下网址开源:此 https URL。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to π0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at https://github.com/HuixiAI/RhinoVLA.

2606.07382 2026-06-08 cs.LG stat.ML 新提交

Covariance Shrinkage via Stochastic Interpolation

通过随机插值的协方差收缩

Mathieu Chalvidal, Florentin Coeurdoux, Eric Vanden-Eijnden

发表机构 * Capital Fund Management(资本基金管理公司)

AI总结 提出将高维协方差估计的经典收缩重述为基于源分布与目标分布之间参数化随机插值的经验风险最小化,揭示三种降低统计风险的机制,并设计神经估计器及风险上界。

Comments 18 pages

详情
AI中文摘要

我们将高维协方差估计器的经典收缩重述为基于源分布与目标分布之间参数化随机插值的经验风险最小化。该形式将已知的收缩估计器作为特例,并揭示了降低统计风险的三种不同机制:(i) 调度:插值调度决定了可容许协方差的类别,从而影响可实现的风险。(ii) 流映射和耦合:虽然朴素构造相当于假设分布之间的独立性,但特定的耦合结构(例如最优传输问题的解)可以降低经验风险。此外,实现这种耦合的非线性流映射使插值协方差摆脱经验估计的特征基,从而实现特征向量正则化。(iii) 提前停止:通过积分回归向量场定义的估计器通过近似真实插值分布提供了额外的偏差-方差权衡。然后,我们提出了一种插值器的神经估计器,并给出了其二次风险关于插值近似误差的上界,并在合成实验中进行了验证。最后,我们将该估计器应用于真实的神经影像数据,展示了该方法在实践中提供的额外正则化能力。

英文摘要

We recast classical shrinkage of high-dimensional covariance estimators as empirical risk minimization over a parametric stochastic interpolant between a source and a target distribution. This formalism recovers known shrinkage estimators as special cases and reveals three distinct mechanisms for reducing statistical risk: (i) Scheduling: the interpolant schedule determines the class of admissible covariances, and hence the achievable risk. (ii) Flow maps and couplings: whereas naive constructions amount to assuming independence between the distributions, specific coupling structures (e.g., solutions of optimal transport problems) can lower the empirical risk. Moreover, non-linear flow maps realizing such couplings free the interpolant covariance from the eigenbasis of the empirical estimate, enabling eigenvector regularization. (iii) Early stopping: estimators defined by integrating a regressed vector field afford an additional bias-variance trade-off through approximation of the true interpolant distribution. We then propose a neural estimator of the interpolant, together with an upper bound on its quadratic risk in terms of the interpolant approximation error, and validate both on synthetic experiments. Finally, we apply the estimator to real neuroimaging data, demonstrating the additional regularization power this approach offers in practice.

2606.07368 2026-06-08 cs.CV cs.AI 新提交

Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

野外有丝分裂检测:MIDOG 2025挑战中的多肿瘤与上下文感知泛化

Marc Aubreville, Jonas Ammeling, Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Robert Klopfleisch, Jiaqi Lv, Shan E Ahmed Raza, Raphaël Bourgade, Thomas Walter, Yasemin Topuz, Songül Varlı, Charles-Antoine Collins-Fekete, Zhuoyan Shen, Navya Sri Kelam, Nitin Singhal, Christian Marzahl, Brian Napora, Tengyou Xu, Hongyan Gu, Mario Vento, Gennaro Percannella, Norbert Ropiak, Izabela Wasiak, Jie Xiao, Shaojun Liu, Seungho Choe, April Khademi, Vidushi Walia, Sujatha Kotte, Andrew Broad, Alex Wright, Guillaume Balezo, Esha Sadia Nasir, Mostafa Jahanifar, Yosuke Yamagishi, Shouhei Hanaoka, Mattia Sarno, Francesco Tortorella, Biwen Meng, Jingxin Liu, Sara Krauss, Daniel Hieber, Lavish Ramchandani, Dev Kumar Das, Mieko Ochi, Yuan Bae, Piotr Giedziun, Mateusz Maniewski, Vangala Govindakrishnan Saipradeep, Naveen Sivadasan, Leire Benito-Del-Valle, Adrian Galdran, Kaustubh Atey, Sameer Anand Jha, Adinath Dukre, Imran Razzak, Maxime W. Lafarge, Viktor H. Koelzer, Nils Porsche, Nikolas Stathonikos, Mitko Veta, Dominik Hirling, Zsanett Zsófia Iván, Peter Horvath, Katharina Breininger, Christof A. Bertram

发表机构 * Flensburg University of Applied Sciences(弗劳恩霍夫应用科技大学) Technische Hochschule Ingolstadt(施特拉尔松德应用技术大学) University of Veterinary Medicine(兽医大学) Schwarzman Animal Medical Center(施瓦茨曼动物医学中心) Freie Universität Berlin(柏林自由大学) University of Warwick(沃里克大学) MINES Paris - PSL University(巴黎综合理工学院) Yildiz Technical University(耶利泽技术大学) University College London(伦敦大学学院) AIRA MATRIX Private Limited(AIRA MATRIX 私人有限公司) University of California, Los Angeles(加州大学洛杉矶分校) University of Kansas Medical Center(堪萨斯医学中心) University of Salerno(萨勒诺大学) Cancer Center Sp. z o. o.(癌症中心) th Military Research Hospital in Bydgoszcz(比多日茨军医研究所) Shenzhen Technology University(深圳技术大学) Toronto Metropolitan University(多伦多 Metropolitan 大学) Tata Consultancy Services Ltd.(塔塔咨询有限公司) Leeds Teaching Hospitals NHS Trust(利兹教学医院 NHS信托) The University of Tokyo(东京大学) Xi’an Jiaotong-Liverpool University(西安交通大学-利物浦大学) University of Augsburg(奥格斯堡大学) Ulm University(乌尔姆大学) Japanese Red Cross Medical Center(日本红十字医疗中心) Wroclaw University of Science and Technology(沃拉日市科学与技术大学) TECNALIA, Basque Research and Technology Alliance (BRTA)(TECNALIA,巴斯克研究与技术联盟(BRTA)) Indian Institute of Technology Bombay(孟买印度理工学院) MBZUAI University of Basel(巴塞尔大学) University Medical Center Utrecht(乌得勒支大学医学中心) TU Eindhoven(埃因霍温理工大学) HUN-REN Biological Research Centre(匈牙利-人生物研究中心)

AI总结 针对临床实际中组织学多样性的挑战,MIDOG 2025挑战评估了跨12种肿瘤类型和多种扫描平台的算法性能,发现模型在传统热点区域表现可靠,但在困难区域和罕见肿瘤中性能显著下降,集成方法可提升F1分数1.5个百分点。

详情
AI中文摘要

自动有丝分裂检测是计算病理学中一项成熟的任务。虽然之前的基准测试关注扫描仪引起的域偏移,但临床“真实世界”应用要求模型能够对组织学景观中预期的巨大差异具有鲁棒性。MItosis DOmain Generalization (MIDOG) 2025挑战旨在评估算法在空前生物学和上下文多样性下的性能。我们策划了一个包含365个病例的测试数据集,涵盖12种不同的人类、犬和猫肿瘤类型,并在多个扫描平台上数字化。超越手动选择的感兴趣区域(ROI),该挑战还要求在随机组织区域(代表全切片检测情况)和困难区域(富含难负样本的区域)进行检测。在第二个赛道中,我们引入了非典型有丝分裂象(AMF)的分类。检测赛道有18支队伍提交,F1分数最高达0.740。在AMF检测赛道,我们有21个提交,平衡准确率最高达0.908。我们的分析显示,虽然大多数模型在传统热点区域表现可靠,但在困难ROI中性能显著下降,假阳性率增加了两倍。此外,性能在12种肿瘤类型间差异显著,突显了当前最先进架构在遇到罕见或高度多形性恶性肿瘤时的“盲点”。此外,我们评估了集成的有效性,发现F1分数和平衡准确率平均分别提高1.5和1.3个百分点。相比之下,测试时增强(TTA)没有显示出相关改进。MIDOG 2025表明,“野外”有丝分裂检测仍然是一个重大障碍。从仅热点评估到多上下文框架的转变,为临床可靠性提供了更现实的代理指标。

英文摘要

Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

2606.07367 2026-06-08 cs.LG 新提交

Self-evolving LLM agents with in-distribution Optimization

自演化分布内优化的LLM智能体

Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Q-Evolve框架,通过分布内强化学习统一过程奖励标注与策略学习,利用加权隐式Q学习稳定贝尔曼更新,实现智能体自演化,在AlfWorld等任务上优于基线。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)最近已成为复杂环境中交互智能体的强大控制器,但训练它们执行可靠的长期决策仍然是一个基本挑战。一个关键困难在于信用分配:智能体通常仅在回合结束时收到延迟奖励。在本文中,我们提出了Q-Evolve,一个用于LLM智能体的自演化框架,该框架在原则性的分布内强化学习范式中统一了自动过程奖励标注和策略学习。在每个演化迭代中,我们的方法从混合离策略数据集(结合专家演示与智能体生成的轨迹)中学习一个分布内评论家,通过加权隐式Q学习目标在稀疏奖励设置中稳定贝尔曼备份。然后,通过学习到的价值函数通过优势估计推导出逐步过程奖励,无需环境回溯或人工标注即可提供密集且可靠的监督。利用这些信号,我们执行行为近端策略优化,使智能体在用于过程奖励标注的数据上演化,从而在不加剧分布偏移的情况下实现迭代自我改进。我们在AlfWorld、WebShop和ScienceWorld上评估了我们的方法,结果显示Q-Evolve在样本效率、鲁棒性和整体任务性能上优于强基线。我们的结果表明,通过过程级监督和策略的共同演化(两者都基于共享的分布内学习循环),可以实现稳定的智能体自演化。

英文摘要

Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.

2606.07366 2026-06-08 cs.CV cs.LG cs.RO 新提交

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Dash2Sim: 来自野外行车记录仪视频的闭环驾驶仿真

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla, Manmohan Chandraker, Srinivasa Narasimhan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NEC Labs America(NEC美国实验室) MIT(麻省理工学院) UC San Diego(加州大学圣地亚哥分校)

AI总结 提出Dash2Sim框架,将单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志,用于闭环仿真,并构建ROADWork4D基准数据集,验证了施工区场景对规划器的挑战。

详情
AI中文摘要

自动驾驶仿真通常依赖于在少数城市收集的数据或手工编写的合成场景。行车记录仪视频覆盖了更广泛的位置和情况,包括罕见或长尾场景。由于难以从单目野外视频中恢复准确的4D场景,它们被认为不太适用于仿真。施工区是行车记录仪捕捉到的一类长尾情况。我们提出Dash2Sim,一个将野外单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志并与现有仿真器兼容的框架,并针对独立维护的地图验证每个日志,无需标注。我们将Dash2Sim应用于大型视频语料库,创建了ROADWork4D基准数据集,涵盖17个城市的4,244个场景和270万个3D对象。在验证子集ROADWork4D-CL(2,201个场景)上,我们研究了特权闭环规划器,发现施工区场景具有挑战性:尽管基于规则和混合规划器的泛化能力优于基于学习的规划器,但所有规划器均表现不足,无法完成临时施工区通道所需的变道。在规划之外,Dash2Sim恢复的密集深度在新视角合成质量上提高了高达19%(基于感知指标),表明其具有为单目视频的闭环传感器仿真提供丰富条件的潜力。

英文摘要

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

2606.07365 2026-06-08 cs.LG cs.AI 新提交

A robust PPG foundation model using multimodal physiological supervision

一种使用多模态生理监督的鲁棒PPG基础模型

Eloy Geenjaar, Vince Calhoun, Scott Daly, Gouthaman KV, Lie Lu, Trisha Mittal, Daniel P. Darcy

发表机构 * Dolby Laboratories(杜比实验室)

AI总结 提出一种PPG基础模型,利用ICU数据集中的心电和呼吸信号选择对比样本,无需高质量或场域数据预训练,在15个下游任务中14个取得性能提升。

详情
AI中文摘要

光电容积描记法(PPG)是一种无创测量血容量变化的方法,广泛应用于可穿戴设备和临床环境。最近的PPG基础模型要么使用开源ICU数据集,采用需要精心整理数据的预训练范式,从而难以泛化到场域数据,要么使用闭源场域PPG数据。相比之下,我们提出了一种PPG基础模型,不需要高质量或场域预训练数据,而是利用ICU数据集中伴随的心电图和呼吸信号在预训练期间选择对比样本。我们的方法允许模型保留并从噪声PPG片段中学习,提高了推理时的鲁棒性。我们的模型在比现有最先进方法少3倍的受试者上预训练,在15个不同的下游任务(包括场域日常活动和心率预测)中的14个上实现了性能提升。我们的结果表明,多模态监督可以整合互补的生理信息,以提高PPG基础模型的鲁棒性,并增强其对消费级数据的泛化能力。

英文摘要

Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.

2606.07356 2026-06-08 cs.SD cs.CL 新提交

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

DirectAudioEdit: 基于扩散预测对比的无反演文本引导音频编辑

Zhengkun Ge, Xiaoqian Liu, Haoran Zhang, Yuan Ge, Junxiang Zhang, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang, China(东北大学计算机科学与工程学院) Kunming University of Science and Technology(昆明理工大学) NiuTrans Research, Shenyang, China(新译研究)

AI总结 提出一种无需训练和反演的文本引导音频编辑方法DirectAudioEdit,通过扩散预测对比构建编辑路径,在音乐和事件基准上降低FAD和KL指标15%以上,编辑速度提升高达64.5%。

详情
AI中文摘要

文本引导音频编辑旨在修改语言指定的声学内容,同时保留与编辑无关的源组件。现有的无训练方法通常依赖于基于反演的编辑。虽然无反演编辑因其减少计算开销和重构误差而具有吸引力,但在音频编辑中仍基本未被探索。关键挑战是通过扩散去噪动力学构建源到目标的编辑路径。在本文中,我们介绍了DirectAudioEdit,这是首次尝试开发一种无需训练和反演的音频编辑方法。在两个骨干网络上的音乐和事件级基准实验表明,与DDPM反演相比,DirectAudioEdit将宏观平均FAD和KL分别降低了15.9%和15.8%,同时实现了高达64.5%的编辑加速。

英文摘要

Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

2606.07355 2026-06-08 cs.CV 新提交

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

面向微手势在线识别的时空解耦适配器

Xucheng Shen, Kun Li, Fei Wang, Wei Qian, Jin Jiang, Dan Guo

发表机构 * Hefei University of Technology(合肥工业大学) United Arab Emirates University(阿联酋大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(人工智能研究院,合肥国家综合科学中心) Anhui Evolution Technology Co., Ltd.(安徽进化科技有限公司)

AI总结 提出时空解耦适配器,通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支,并引入自适应软平衡增强缓解长尾分布问题,在EI-MiGA挑战赛Track 2中取得第一名。

Comments Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026

详情
AI中文摘要

微手势在线识别旨在对未修剪视频中的细微手势进行时间定位和分类。由于微手势持续时间极短、运动幅度低且视觉线索模糊,捕获判别性的时空表示仍然极具挑战性。现有的参数高效适配器通常采用单分支联合建模时空线索,这可能无法捕获微手势的细粒度模式。为解决这一局限,我们提出了一种时空解耦适配器,通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支。此外,为解决基准数据集中的长尾分布问题,我们引入了自适应软平衡增强,该方法根据类别稀有性和学习难度动态分配增强强度,无需手动设置阈值。我们的方法取得了0.43808的F1分数,在第四届EI-MiGA-IJCAI挑战赛的Track 2中排名第一。

英文摘要

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

2606.07345 2026-06-08 cs.LG 新提交

TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention

TabSwift: 一种高效的基于行注意力的表格基础模型

Si-Yang Liu, Han-Jia Ye

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TabSwift,通过门控注意力稳定和可学习注册令牌增强轻量级行注意力骨干,实现高效表格上下文学习,在保持竞争力的同时降低推理成本。

Comments Accepted to ICML 2026, spotlight

详情
AI中文摘要

以TabPFN为代表的表格基础模型通过上下文学习进行预测,直接从带标签的训练样本推断测试标签。它们已展现出有竞争力的性能,尤其是在中小型数据集上。然而,最近的表格基础模型通常通过日益复杂的架构来提高准确性,导致更高的推理成本并限制了实际部署。在这项工作中,我们重新审视了原始TabPFN设计,并表明一个轻量级的仅行注意力骨干可以通过两个简单的增强保持高度竞争力:门控注意力稳定机制和一组可学习的注册令牌,提供全局上下文并改善预训练质量。由此产生的模型TabSwift支持分类和回归,与更强的表格基础模型(如TabPFN v2和TabICL)竞争,同时推理效率更高。对于延迟敏感的服务,我们进一步引入了一种自适应逐层早期退出机制,动态调整每个样本的推理深度。总体而言,TabSwift为实际部署实现了高效且随时可用的表格上下文学习。

英文摘要

Tabular foundation models, exemplified by TabPFN, perform prediction via in-context learning, inferring test labels directly from labeled training examples. They have demonstrated competitive performance, particularly on small-to-medium datasets. However, recent tabular foundation models often improve accuracy with increasingly complex architectures, incurring higher inference cost and limiting practical deployment. In this work, we revisit the original TabPFN design and show that a lightweight row-wise attention-only backbone can remain highly competitive with two simple enhancements: a gated attention stabilization mechanism and a small set of learnable register tokens that provide global context and improve pretraining quality. The resulting model, TabSwift, supports both classification and regression, and is competitive with stronger tabular foundation models (e.g., TabPFN v2 and TabICL) while being more efficient at inference. For latency-sensitive serving, we further introduce an adaptive layer-wise early-exit mechanism that dynamically adjusts inference depth per sample. Overall, TabSwift enables efficient and anytime tabular in-context learning for practical deployments.

2606.07342 2026-06-08 cs.CL cs.NE 新提交

LLM-Guided Evolution for Medical Decision Pipelines

LLM引导的医疗决策流程进化

Ivan Sviridov, Artem Oskin, Ivan Panin, Iaroslav Bespalov, Dmitry Dylov, Ivan Oseledets, Aleksandr Nesterov

发表机构 * Sber AI Lab(Sber AI实验室) AIRI

AI总结 提出LLM引导的MAP-Elites进化方法,无需微调即可优化医疗决策流程,在分诊、咨询和图像分类任务中超越手工设计基线。

详情
AI中文摘要

将大型语言模型(LLM)适应临床工作流程通常需要昂贵的微调或手动提示和流程工程。我们研究了LLM引导的MAP-Elites进化作为一种推理时替代方案,用于发现医疗决策策略,并在https://this URL提供实现仓库。我们将紧急分诊、交互式咨询和医学图像分类表述为对可执行工件的进化搜索,这些工件由特定任务的适应度函数优化。在所有三种设置中,进化在实践约束下改进了手工设计的基线。在分诊中,进化程序将Semigran准确率从77.3%提高到87.1%,紧急召回率从0.60提高到0.97,同时改进了安全加权的保留MIMIC-ESI性能。在交互式咨询中,进化策略改进了Llama-3、Qwen-3.5和Gemma-4的准确率-成本前沿,并迁移到保留的iCRAFTMD。在PneumoniaMNIST中,仅提示进化改进了冻结的MedGemma VLM,同时保留了严格的JSON输出。定性分析表明,收益来自可解释的程序级机制、校准的分诊边界、有针对性的证据获取、选择性承诺和面向发现的视觉决策规则,而不仅仅是表面的提示改写。

英文摘要

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

2606.07338 2026-06-08 cs.CV 新提交

VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

VeriDrive: 可验证的反事实监督用于成本高效的视觉-语言规划

Zikai Zhang, Hubert P. H. Shum, Toby P. Breckon

发表机构 * Department of Computer Science, Durham University(杜伦大学计算机科学系)

AI总结 提出VeriDrive框架,通过结构化感知-评估-修正链生成可验证的反事实监督,降低视觉-语言驾驶规划的数据构建成本,并在nuScenes数据集上验证其有效性。

详情
AI中文摘要

视觉-语言驾驶模型越来越多地使用推理监督来连接感知、预测和规划,但现有的驾驶理由通常是自由形式的,且使用前沿模型生成成本高昂。我们提出了VeriDrive,一个构建面向规划的、可验证的反事实监督框架。VeriDrive将驾驶推理转化为结构化的感知-评估-修正链,该链将关键对象锚定于未来运动,使用可规则检查的证据评估替代自我轨迹,将风险意图修正为专家行为,并生成最终规划目标。为了扩展数据构建,VeriDrive结合了本地生成与验证器引导的选择性修正,仅升级无效或困难的样本。我们在nuScenes上构建了VeriDrive数据集,并在Omni-Q协议下进行训练。受控的开环实验表明,VeriDrive在L2、碰撞和交叉指标上优于OmniDrive,同时减少了记录的令牌使用量、生成时间和实际支付的LLM/VLM成本。这些结果表明,可审计的中间字段和结构化修正目标可以在现实注释预算下改进视觉-语言规划监督。代码、提示和验证器脚本即将发布,并将在审稿过程后公开。

英文摘要

Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

2606.07333 2026-06-08 cs.CV 新提交

Varifold Moment Invariants for Sustainable and Explainable Contour Feature Extraction

Varifold矩不变量:可持续且可解释的轮廓特征提取

G. Longari, J. -C. Alvarez Paiva, A. B. Tumpach

发表机构 * Computer Vision Lab, Technische Universität Wien, Karlsplatz 13, 1040 Vienna, Austria Wolfgang Pauli Institut, Oskar-Morgensternplatz 1, 1090 Vienna, Austria U.M.R. CNRS 8524, U.F.R. de Math\'ematiques, 59655 Villeneuve d'Ascq C\'edex, France Laboratoire Painlevé, Lille University, 59650 Villeneuve d’Ascq, France Wolfgang Pauli Institut, Oskar-Morgensternplatz 1, 1090 Vienna, Austria

AI总结 提出Varifold矩不变量(VMI)统一框架,结合区域、边界和切线几何生成高判别力几何特征,配合轻量分类器在降低计算成本的同时超越现有轮廓方法。

Comments 29 pages, 12 figures

详情
AI中文摘要

我们引入Varifold矩不变量(VMI)作为许多先前提出的矩不变量的统一框架。这些不变量与其他在平移和旋转下不变的轮廓特征(如扩展高斯图像、椭圆傅里叶描述符或形状分布)密切相关。Varifold矩方法的优势在于能够结合区域的几何、其边界以及与之相切的直线族,从而创建大量具有高判别力和清晰几何意义的不变特征。通过将我们的VMI特征提取与轻量特征分类器随机森林或多层感知器相结合,我们在基于轮廓的方法中超越了现有技术水平,同时大幅降低了计算成本,使我们的算法能够在轻量设备上运行。我们在大量广泛使用的不同类型数据集(叶子、物体、细胞)上测试了我们的分类任务,并以少量几何可解释的特征实现了高精度。

英文摘要

We introduce Varifold Moments Invariants (VMI) as a unifying framework for many previously introduced Moment Invariants. These invariants are deeply related to other contour features that are invariant under translations and rotations, like Extended Gaussian Image, Elliptic Fourier Descriptors or Shape Distributions. The advantage of the varifold approach to moments consists in being able to combine the geometry of the region, its boundary, and the family of lines tangent to it, in order to create a substantial number of invariant features with high discriminating power and clear geometric meaning. By coupling our VMI feature extraction with the light feature classifiers Random Forest or Multi-Layer-Perceptron, we outperform state-of-the-art approaches based on contours, while decreasing drastically the computational cost to the point of allowing our algorithm to run on light devices. We tested our approach on classification tasks on a large number of widely-used datasets of various types (leaves, objects, cells) and achieved high accuracy with a low number of geometrically interpretable features.

2606.07326 2026-06-08 cs.CV 新提交

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

AnchorWorld: 基于视图演化定制的具身自我中心世界模拟

Yu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang, Lei Ke, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Kun Gai, Yujiu Yang

发表机构 * Tsinghua University(清华大学) HUST(华中科技大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) HKUST(香港科技大学) WHU(武汉大学)

AI总结 提出AnchorWorld框架,利用3D人体运动和外源视角辅助训练增强交互完整性,并通过锚点视图和文本描述实现自我演化世界的灵活定制,显著优于现有方法。

详情
AI中文摘要

尽管交互式世界建模是一个关键前沿,但在实际场景所需的多样化可控性方面仍未被充分探索。为弥补这一差距,我们提出AnchorWorld,一个通过增强交互完整性和灵活的世界定制机制来推进自我中心模拟的框架。首先,我们利用3D人体运动作为主要交互模态。为了补充自我中心视角中不可见或被截断的身体部位,我们引入了一种辅助训练监督,该监督包含了与智能体第一人称感知解耦的外源视角。这使得模型能够观察智能体相对于环境的全身定位,从而促进人-世界交互更稳健的空间基础。此外,我们提出了一种简单而有效的机制来定制自我演化的世界。这是通过在统一的世界坐标系内定义锚点视图,并结合描述局部场景动态演化的文本描述来实现的。实验结果表明,AnchorWorld显著优于最先进的基线方法,而消融研究验证了我们关键设计的有效性。值得注意的是,我们的定制方案展现出有希望的时空几何一致性,并严格遵守规定的演化动力学。

英文摘要

Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

2606.07313 2026-06-08 cs.CL cs.AI 新提交

SV-Detect: AI-generated Text Detection with Steering Vectors

SV-Detect: 基于引导向量的AI生成文本检测

Mikhail Vishnyakov, Tatiana Gaintseva

发表机构 * Independent Researcher(独立研究者) Queen Mary University of London(伦敦女王学院)

AI总结 提出从冻结语言模型的隐藏表示中提取引导向量,通过层间投影特征训练轻量分类器,实现跨域、跨模型和编辑攻击下的机器生成文本检测。

详情
AI中文摘要

检测机器生成文本在分布偏移(如跨域、源模型和编辑攻击的迁移)下尤其困难。我们提出了一种基于从冻结语言模型的隐藏表示中提取的引导向量的假文本检测器。在每一层,我们构建一个分离人类编写文本和机器生成文本的方向,并通过每个输入与这些方向的逐层对齐来表示输入。在这些投影特征上训练的轻量分类器产生最终的检测分数。我们的方法在分布内和分布偏移下均表现出色,包括跨域、跨源模型以及机器编辑转换(如润色和重写)。解释分析表明,学习到的方向与可识别的风格线索一致,同时捕获了超越表面特征的显著额外信号。这些结果将假文本检测定位为表示空间探测问题,并表明引导向量提供了一种简单有效的解决方案。

英文摘要

Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

2606.07311 2026-06-08 cs.CV cs.AI 新提交

CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

CULTURESCORE: 评估视频生成模型的文化忠实度

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Mila – Quebec AI Institute(魁北克人工智能研究所) Netflix(网飞)

AI总结 提出CultureScore框架,从身份、背景和行为三个维度评估视频生成的文化忠实度,实验发现当前最佳模型得分仅56.8%,行为维度最困难。

详情
AI中文摘要

随着Veo 3.1和LTX-2等视频生成模型的进步,它们准确表现多元全球文化的能力仍是一个关键但研究不足的前沿。当前的指标如VideoScore仅衡量视觉质量,无法评估文化忠实度。因此,一个将合十礼替换为握手的模型与正确生成该手势的模型获得相同的分数。我们提出CultureScore,一个将文化忠实度分解为三个细粒度维度的组合评估框架:身份(谁被代表)、背景(文化本地化背景)和行为(规范性手势和互动)。我们通过一个覆盖10个国家的评估套件来实施该框架,在三个最先进的模型上生成了6,180个视频。我们的评估显示,当前没有模型能够实现文化忠实的视频生成:表现最好的模型整体CultureScore仅为56.8%,其中行为是最具挑战性的维度,所有模型在该维度上均低于52%。此外,人类偏好排序与CultureScore方向一致,但与VideoScore相反;在视觉质量上得分最高的模型被标注者排在最后,这强调了文化忠实度是公平视频生成的基本标准。

英文摘要

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 新提交

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project(德国科研基金Reinhart Koselleck项目) EU H2020 project(欧盟H2020项目)

AI总结 研究音频语言模型中显式声学线索的对齐性,通过eGeMAPS特征提取六种可解释声学概念标记,发现对齐标记提升UAR,而错乱标记降低性能,模型对符号线索敏感但仍部分依赖音频信号。

Comments 6 pages, 3 figures, 3 tables

详情
AI中文摘要

指令跟随音频语言模型(ALMs)可以通过显式的声学线索进行增强,但在原始音频已经可用的情况下,这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别(SER)中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量,并被附加到文本提示中,同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中,对齐的标记提高了未加权平均召回率(UAR),而打乱、冲突或损坏的标记相对于对齐标记降低了性能,并将混淆转向中性。重要的是,在强标记扰动下预测不会崩溃,这表明模型对符号线索通道敏感,但部分仍锚定于音频信号。我们认为,仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

2606.07308 2026-06-08 cs.AI 新提交

Off-Policy Evaluation with Strategic Agents via Local Disclosure

通过局部披露进行具有策略性主体的离线策略评估

Kiet Q. H. Vo, Abbavaram Gowtham Reddy, Julian Rodemann, Siu Lun Chau, Krikamol Muandet

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全研究中心) LMU Munich(慕尼黑大学) Nanyang Technological University(南洋理工大学)

AI总结 研究策略性行为下的离线策略评估,通过局部披露揭示主体策略前协变量,构建双重稳健估计器,缓解信息不对称。

详情
AI中文摘要

我们研究了策略性行为下的离线策略评估(OPE),其中决策主体(或代理)通过策略性地修改其协变量来响应决策者的策略。这种行为导致了策略依赖的协变量偏移,打破了现有方法中协变量外生于策略的标准假设。相关工作通过施加强假设(如重复交互或完全了解代理的响应行为)来应对这一挑战,这极大地限制了其在OPE中的适用性。相比之下,我们考虑一次性OPE设置,其中决策者仅部分了解代理的响应行为。我们的关键见解是,通过事后解释披露局部信息,可以在适应之前揭示代理的策略前协变量,从而减轻策略行为引起的信息损失。利用这一结构,我们估计了代理响应的统计模型,并构建了策略值的双重稳健估计器。通过假设代理的成本敏感性服从条件对数正态分布,我们建立了所提估计量的一致性,并实证验证了我们的方法。更广泛地说,我们的结果强调了交互设计如何通过揭示代理策略响应中原本隐藏的结构来缓解信息不对称。

英文摘要

We study off-policy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker's policy by strategically modifying their covariates. Such behavior induces a policy-dependent covariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents' response behavior, substantially limiting its applicability to OPE. In contrast, we consider a one-shot OPE setting where the decision maker has only partial knowledge of the agents' response behavior. Our key insight is that disclosing local information through post-hoc explanations reveals agents' pre-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior. Leveraging this structure, we estimate a statistical model for the agents' responses and construct a doubly robust estimator for policy value. By assuming that the agents' cost sensitivity follows a conditional log-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents' strategic responses.

2606.07303 2026-06-08 cs.LG 新提交

Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models

表征涌现的自举理论:解释不充分性作为表征学习与世界模型的驱动力

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN), University of Montpellier(生物工程与纳米科学实验室(LBN),蒙彼利埃大学) EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alès(EuroMov数字健康运动,蒙彼利埃大学,IMT矿山阿尔勒) Certified Sophrologist, Sensorimotor Practice, Montpellier, France(认证Sophrologist,运动觉实践,蒙彼利埃,法国) Emeritus Professor, University of Montpellier(荣誉教授,蒙彼利埃大学)

AI总结 提出表征涌现自举理论(TBER),将解释不充分性视为新表征涌现的积极信号,通过五阶段递归过程驱动表征创新,应用于表征学习、世界模型和科学发现。

Comments 24 pages, 25 references. Theoretical framework relating representation learning, representational emergence, and world models

详情
AI中文摘要

表征学习是现代机器学习的核心,实现了从手工特征到学习嵌入、潜在空间、基础模型、世界模型和数字孪生的转变。然而,大多数研究关注在选定表征框架后如何优化表征,而较少关注何时需要新的表征层次。我们引入表征涌现自举理论(TBER),这是一个描述当现有表征变得解释不充分时新表征如何出现的框架。在这种观点下,表征创新不仅由更多数据、更大模型或更强计算能力驱动,还由持续的解释差距驱动:即表征仍能描述观察但无法使其组织或变换变得可理解的情况。TBER将解释不充分性识别为表征转变的积极信号。一个表征变得不充分,并非因为它必然错误,而是因为其解释领域已被超越。自举动态遵循递归序列:观察揭示异常;异常暴露不充分性;不充分性激发新表征;这些新表征产生进一步观察和可能的新异常。我们通过五个阶段形式化这一过程:稳定观察、异常检测、解释不充分性识别、表征涌现和临时稳定。我们讨论了在表征学习、潜在空间、基础模型、世界模型、数字孪生、自适应生物系统和科学发现中的应用。TBER表明,未来AI系统可能受益于检测其内部表征解释极限的机制。

英文摘要

Representation learning is central to modern machine learning, enabling transitions from handcrafted features to learned embeddings, latent spaces, foundation models, world models, and digital twins. Yet most research examines how representations are optimized after a representational framework has been selected, while less attention is given to when a new level of representation becomes necessary. We introduce the Bootstrap Theory of Representational Emergence (TBER), a framework describing how new representations arise when existing ones become explanatorily insufficient. In this view, representational innovation is not only driven by more data, larger models, or greater computational power, but also by persistent explanatory gaps: situations in which a representation can still describe observations but can no longer make their organization or transformations intelligible. TBER identifies explanatory insufficiency as a positive signal for representational transition. A representation becomes insufficient not because it is necessarily false, but because its explanatory domain has been exceeded. The bootstrap dynamic follows a recursive sequence: observations reveal anomalies; anomalies expose insufficiencies; insufficiencies motivate new representations; and these new representations generate further observations and possible new insufficiencies.We formalize this process through five stages: stabilized observation, anomaly detection, recognition of explanatory insufficiency, representational emergence, and provisional stabilization. We discuss applications to representation learning, latent spaces, foundation models, world models, digital twins, adaptive biological systems, and scientific discovery. TBER suggests that future AI systems may benefit from mechanisms for detecting the explanatory limits of their own internal representations.

2606.07300 2026-06-08 cs.CL 新提交

Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

Phun-Bench:评估大语言模型的中文语音理解能力

Xing Yue, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University(浙江大学)

AI总结 提出Phun-Bench基准,通过同音、押韵和语音相似性三个维度系统评估大语言模型的语音理解能力,发现模型在灵活运用语音知识方面存在不足。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

语言是思想的载体,与声音、符号和意义紧密相连。然而,大多数大语言模型(LLM)研究关注意义(语义)和符号(拼写),而很大程度上忽略了声音。现有的LLM语音能力基准要么可以通过死记硬背解决,要么与其他能力交织在一起,不足以衡量LLM在语音理解方面的真实能力。在这里,我们提出Phun-Bench,一个专门构建的中文基准,包含跨三个维度(同音、押韵和语音相似性)的多样化任务和设置,旨在系统评估LLM的语音理解能力。我们的结果表明,虽然LLM在回忆正确发音方面表现出色,但它们通常难以像人类说话者那样灵活直观地利用语音知识。此外,通过详细分析,我们提出了关于LLM语音理解和“感知”潜在机制的假设,突出了未来研究的一个未充分探索的前沿。

英文摘要

Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.