arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15079 2026-06-16 cs.CL cs.AI 新提交

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Ling 和 Ring 2.6 技术报告:高效且即时的万亿参数规模智能体智能

Ang Li, Ben Liu, Bin Han, Bin Hu, Bin Jing, Binbin Hu, Bing Li, Cai Chen, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Liang, Chen Qian, Chengfu Tang, Chengyao Wen, Chilin Fu, Chunwei Wu, Cong Zhang, Cunyin Peng, Daixin Wang, Dalong Zhang, Deng Zhao, Dingnan Jin, Dingyuan Zhu, Donghao Zhang, Fan Yuan, Fangzheng Zhao, Fanzhuang Meng, Feifan Wu, Feng Xu, Fengbin Fang, Gangshan Wang, Guodong Yang, Hailin Zhao, Haitao Wang, Haitao Zhang, Hanxiao Zhang, Hanzi Wang, Hao Dai, Hao Liu, Hao Qian, Hao Wu, Haoxiong Liu, Haoyu Xu, Heng Zhang, Hong Liu, Hongliang Zhang, Hongrui Liu, Hongxun Li, Hongzhi Ruan, Huaidong Xiong, Huihuang Zheng, Huikang Tang, Jia Guo, Jia Li, Jia Liu, Jiameng Wang, Jiaming Liu, Jiannan Shi, Jianping Wei, Jiaolong Yang, Jiapeng Wang, Jie Gao, Jie Wang, Jiewei Wu, Jin Yang, Jinjin Li, Jinjing Huang, Jinquan Sun, Jinyao Chen, Juanhui Tu, Jun Liu, Jun Mei, Jun Xu, Jun Zhou, Junjie Ou, Junnan Sipan, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kuan Xu, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Chen, Lei Liang, Lei Xu, Li Tang, Liang Jiang, Liangcheng Fu, Lihui Zhang, Linfeng Shi, Lintao Ma, Liyuan Liu, Longfei Li, Longfei Zheng, Lu Liu, Lu Yu, Man Li, Meiqi Zhu, Meng Li, Mengjie Gao, Mengshu Sun, Mingming Yin, Mingyang Zhang, Mingyuan Fan, Nuo Xu, Pan Tang, Peijie Jiang, Peilong Zhao, Peng Lin, Pingping Liu, Qi Zuo, Qian Zhao, Qiang Cheng, Qianggang Cao, Qiaoben Bao, Qing Cui, Qingyuan Yang, Qitao Shi, Qiyin Huang, Qizheng Zhou, Quan Wan, Runyuan Zhao, Shaomian Zheng, Shaowei Wei, Shengnan Zhang, Shuaicheng Li, Shujie Li, Shuo Zhang, Sikang Bian, Tianchu Yao, Tiange Xu, Tianshu Wang, Ting Guo, Tinghao Wang, Tingwei Huang, Tong Zhao, Tongkai Yang, Wang Hong, Wanli Gu, Wei Lu, Weichang Wu, Weiguang Han, Weiquan Li, Wenbo Shen, Wenjing Fang, Wenzhi Tang, Xiang Shu, Xiao Shi, Xiaodong Yan, Xiaolu Zhang, Xiaopei Wan, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xinxing Yang, Xinyao Tang, Xinyu Kong, Xinyu Liu, Xiong Xu, Xuan Sun, Xudong Han, Xudong Wang, Xujie Shen, Yalin Zhang, Yangyang Hou, Yankun Ren, Yao Zhao, Ye Chen, Yeyang Chen, Yibo Cao, Yifan Zuo, Yijie Chen, Ying Li, Yingjie Song, Yingxue Li, Yiqi Wang, Yixuan Sun, Yizhu Xiao, Yongfei Xu, Yu Liu, Yuchen Fang, Yue Gao, Yue Yu, Yue Zhang, Yuqi Zhang, Yuxiao He, Yuxiao Lu, Yuxin Tian, Yuxuan Li, Yuzhuo Fu, Zhankai Xu, Zhaoxin Huan, Zhenduo Zhang, Zhengke Gui, Zhengyu Huang, Zhenjun Ma, Zhenxuan Pan, Zheping Qu, Zhibo Zhu, Zhidong Fan, Zhigang Huangfu, Zhihao Wang, Zhiqiang Zhang, Zhizhen Liu, Zhuyan Zhou, Zibin Lin, Zihang Zeng, Zihao Wang, Zilong Wang, Ziqi Liu, Zitao Xuan, Zixuan Cheng, Zujie Wen, Zuoli Tang

发表机构 * Ling Team(Ling团队) Inclusion AI

AI总结 提出Ling-2.6和Ring-2.6模型系列,通过架构迁移预训练、混合线性注意力设计及KPop强化学习框架,实现低延迟、强推理与高效部署,开源所有检查点。

详情
AI中文摘要

高效且可扩展的智能体智能需要模型既能提供低延迟响应,又能具备强大的推理能力,同时保持训练、服务和部署的实用性。在本报告中,我们介绍了Ling-2.6和Ring-2.6,这是一系列旨在大规模解决这一挑战的模型。Ling-2.6针对即时响应生成和每个输出令牌的高能力进行了优化,而Ring-2.6则专为更深层次的推理和更高级的智能体工作流而设计。我们没有从头开始训练,而是通过架构迁移预训练和大规模后训练来升级Ling-2.0基础模型。这一升级以模型架构、优化目标、服务系统和智能体训练环境的统一协同设计为指导,从而在模型能力和部署效率上实现改进。在架构层面,我们引入了一种混合线性注意力设计,将闪电注意力与MLA相结合,提高了长上下文训练和解码的效率。为了进一步提升令牌效率,我们通过进化思维链、语言单元策略优化、双向偏好对齐和最短正确响应蒸馏来优化每个输出令牌的能力。对于智能体能力,我们提出了KPop,这是一个强化学习框架,旨在支持Ring-2.6-1T在大规模环境接地数据上的稳定训练。KPop通过跨编码、搜索、工具使用和工作流执行的异步调度提高了训练效率,实现了从复杂的智能体-环境交互中进行可扩展学习。Ling-2.6和Ring-2.6共同为高效、可扩展和开放的智能体系统提供了一条实用路径。我们开源了2.6系列的所有检查点,以支持实用智能体智能的进一步研究和开发。

英文摘要

Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

2606.15078 2026-06-16 cs.AI cs.GT physics.soc-ph 新提交

Cognitive Debt: AI as Intellectual Leverage and the Dynamics of Systemic Fragility

认知债务:作为智力杠杆的AI与系统性脆弱性的动态机制

Shuchen Meng

发表机构 * New York University(纽约大学)

AI总结 本文提出认知债务的形式化理论,通过建立包含认知资本和认知债务的状态变量模型,证明理性代理人会积累认知债务,并导致认知明斯基时刻和系统性脆弱性。

Comments 46 pages, 3 figures. Preliminary version; comments welcome

详情
AI中文摘要

我们发展了一个认知债务的形式化理论:当个体将AI用作第一性原理认知的替代品而非补充品时,积累的未经验证的推理义务存量。模型每个代理人有两个状态变量:认知资本和认知债务,以及一个乘数型生产技术,其中认知资本作为抵押品决定AI采用的回报。我们建立了六个命题。理性代理人会承担正的认知债务,因为成本是递延的、部分外部的,并且被短期生产率增长所掩盖。平静时期降低了主观风险评估,提高了AI替代强度,并放大了杠杆,产生了一个认知明斯基时刻,其中主观风险下降而真正的系统性脆弱性上升。预期危机损失是总杠杆的凸函数。危机后,产出目标压力可能产生一个虚假修正循环,其中代理人用更多AI修补AI失败。由于系统性风险、认知公共品和军备竞赛外部性,分散均衡相对于社会最优过度采用了替代性AI。在一个两类型异质代理人经济中,高认知资本代理人更密集地采用AI,并可能最终侵蚀其无辅助认知资本至低于初始低技能代理人的水平。

英文摘要

We develop a formal theory of cognitive debt: the stock of unverified reasoning obligations that accumulates when individuals use AI as a substitute rather than a complement for first-principles cognition. The model features two state variables per agent, cognitive capital and cognitive debt, and a multiplicative production technology in which cognitive capital functions as collateral that determines the return to AI adoption. We establish six propositions. Rational agents incur positive cognitive debt because the costs are deferred, partially external, and masked by short-run productivity gains. Tranquil periods lower subjective risk assessments, raise AI substitution intensity, and compound leverage, generating a cognitive Minsky moment in which subjective risk falls while true systemic fragility rises. Expected crisis losses are convex in aggregate leverage. Post-crisis, output-target pressure can produce a false-correction loop in which agents patch AI failures with more AI. The decentralised equilibrium over-adopts substitutive AI relative to the social optimum because of systemic risk, cognitive public goods, and arms-race externalities. In a two-type heterogeneous-agent economy, high-cognitive-capital agents adopt AI more intensively and may eventually erode their unaided cognitive capital below that of initially lower-skilled agents.

2606.15077 2026-06-16 cs.AI cs.CL 新提交

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

风险感知的LLM智能体用于地理空间数据检索:设计与初步对抗性评估

Kyle Gao, Joel Cumming, Jonathan Li, Linlin Xu, David A. Clausi

发表机构 * Dept. of Systems Design Engineering, University of Waterloo(滑铁卢大学系统设计工程系) SkyWatch Dept. of Geography and Environmental Management, University of Waterloo(滑铁卢大学地理与环境管理系) Dept. of Geomatics Engineering, University of Calgary(卡尔加里大学测绘工程系)

AI总结 提出一种基于LLM的框架,通过自然语言查询从云地理空间目录检索遥感数据,集成三个智能体实现安全、意图解析和API调用生成,初步对抗实验表明提示级安全指令提升鲁棒性但需系统级防御。

Comments Accepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026

详情
AI中文摘要

我们提出一个由LLM驱动的框架,用于通过自然语言查询从基于云的地理空间目录中检索遥感数据。该系统将用户意图转换为结构化的API调用,实现对卫星影像和环境数据集的高效访问。该架构集成了三个智能体:Guardrail用于安全和策略执行,General-QA用于意图解释,Recommender-Analyst用于模式感知的API调用生成。这种协调设计确保了与外部数据服务的可靠、语义对齐的交互。该模块化框架通过API模式替换可跨平台移植,并支持环境监测、灾害响应和气候分析等应用。它在用户意图与地理空间基础设施之间建立了可扩展的接口,实现了简化和自动化的地球观测工作流程。在对抗性多轮设置下的初步实验表明,提示级安全指令提高了鲁棒性,尽管在API操作场景中仍存在罕见的高影响失败,这突显了需要自适应、系统级的防御措施来平衡安全性、可用性和成本效率,这也激励了我们使用拦截级别的Guardrail智能体。

英文摘要

We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

2606.15074 2026-06-16 cs.LG 新提交

TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation

TriAdReview: 用于多模型技术文档生成的三角对抗审查架构

Zhiqiang Zhou, Junliang Dai, Xu Ling

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化工职业技术学院)

AI总结 提出TriAdReview三角对抗审查架构,使用两个独立审查模型和三角判断机制迭代改进生成器输出,在五个基准任务上相比单模型基线提升10.1%,但发现对抗审查在完整性任务上存在结构偏差。

Comments 12 pages, 7 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于技术文档生成,但单模型输出常常存在过度工程化、安全盲点和覆盖不完整的问题。我们提出TriAdReview,一种三角对抗审查架构,采用两个独立的审查模型(工程视角和边界视角)以及一个三角判断机制,迭代改进生成器模型的输出。我们在五个基准任务——架构设计、代码生成、提案审查、安全审计和需求分析——上评估了TriAdReview,使用了三种配置:单模型(基线)、双模型(单次审查)和三模型(完整系统)。在75次实验(每个单元n=5)中,结果显示三模型配置相比单模型基线实现了10.1%的总体改进(50分制中26.2 vs. 23.8;p<0.05,配对t检验),在安全审计(+27.6%)、代码生成(+20.8%)和架构设计(+15.6%)上尤为显著。第二个评分者(mimo-v2.5-pro)以较小的效应(+2.7%)确认了方向,表明评分者间一致性中等。然而,系统在需求分析上出现了-7.5%的退化,揭示出对抗审查架构存在对简化的结构性偏见,这对面向完整性的任务适得其反。我们通过任务类型框架分析了这一边界条件,并证明审查提示适应可以部分缓解该问题。我们的发现首次实证描述了多模型对抗审查何时有益或有害,对协作AI系统的设计具有启示意义。

英文摘要

Large language models (LLMs) are increasingly used for technical document generation, yet single-model outputs often suffer from over-engineering, security blind spots, and incomplete coverage. We propose TriAdReview, a triangular adversarial review architecture that employs two independent reviewer models (engineering and boundary perspectives) and a triangular judging mechanism to iteratively improve a generator model's output. We evaluate TriAdReview across five benchmark tasks - architecture design, code generation, proposal review, security audit, and requirements analysis - using three configurations: single model (baseline), dual model (single review), and triple model (full system). Results across 75 experiments (n=5 per cell) show that the triple model configuration achieves a 10.1% overall improvement over the single model baseline (26.2 vs. 23.8 out of 50; p<0.05, paired t-test), with particularly strong gains on security audit (+27.6%), code generation (+20.8%), and architecture design (+15.6%). A second scorer (mimo-v2.5-pro) confirms the direction with a smaller effect (+2.7%), suggesting moderate inter-rater agreement. However, the system shows a -7.5% degradation on requirements analysis, revealing that adversarial review architectures have a structural bias toward simplification that is counterproductive for completeness-oriented tasks. We analyze this boundary condition through a task-type framework and demonstrate that reviewer prompt adaptation partially mitigates the issue. Our findings provide the first empirical characterization of when multi-model adversarial review helps versus harms, with implications for the design of collaborative AI systems.

2606.15072 2026-06-16 cs.CV 新提交

Texture-Shape Bias Balancing for Robust Synthetic-to-Real Semantic Segmentation in Automotive NIR Imagery

纹理-形状偏差平衡用于汽车近红外图像中鲁棒的合成到真实语义分割

Felix Stillger, Ben Hamscher, Lukas Hahn, Annika Mütze, Tobias Meisen, Kira Maag

发表机构 * University of Wuppertal(伍珀塔尔大学) Aptiv(Aptiv公司) Heinrich Heine University Düsseldorf(海因里希·海涅大学杜塞尔多夫) Osnabrück University(奥斯纳布吕克大学)

AI总结 提出生成式增强框架,通过目标风格适配和Voronoi风格多样化策略平衡纹理-形状偏差,实现近红外图像合成到真实域适应,将域差距减少高达63.6%。

Comments Accepted at ECML PKDD 2026 (ADS Track)

详情
AI中文摘要

语义分割是现代汽车系统中视觉感知的基本组成部分,实现像素级场景理解。近红外成像在困难光照条件下提供稳定检测,但由于缺乏真实世界场景的高质量标注数据,特定领域的语义分割模型开发仍具挑战。合成数据集提供可扩展的替代方案,但基于合成图像训练的模型在迁移到真实域时性能下降。我们首次系统研究汽车领域近红外图像中合成到真实域适应的语义分割。我们提出生成式增强框架,通过引入的目标风格适配将合成图像转换为逼真的近红外风格变体。目标风格适配通过低秩适配在小型真实近红外图像集上微调潜在扩散模型,并使用结构保持的多信号条件应用于合成训练数据。为减少纹理偏差并提高分割鲁棒性,我们进一步应用基于Voronoi的风格多样化策略,在保持场景几何的同时修改原始纹理。在车辆内部和街道场景的近红外数据上使用多种模型架构的实验表明,训练期间平衡归纳偏差可显著提高语义分割的鲁棒性,并在我们的真实场景中将域差距减少高达63.6%(外部)和28.4%(内部)。代码可在GitHub获取。

英文摘要

Semantic segmentation is a fundamental component of visual perception in modern automotive systems, enabling pixel-level scene understanding. Near-Infrared imaging (NIR) offers stable detection under difficult illumination conditions, but the development of domain-specific semantic segmentation models remains challenging due to the lack of high-quality annotated data from real-world scenarios. Synthetic datasets offer a scalable alternative, but models trained on synthetic images often suffer performance degradation when transferred to real domains. We present the first systematic study on synthetic to real domain adaptation for semantic segmentation in NIR images in the automotive domain. We propose a generative augmentation framework that transforms synthetic images into realistic NIR-style variants via our introduced target style adaptation (TSA). TSA fine-tunes a latent diffusion model via low-rank adaptation on a small curated set of real NIR images and applies it to synthetic training data using structure-preserving multi-signal conditioning. To reduce texture bias and improve segmentation robustness, we further apply a Voronoi-based style diversification strategy (VSD) that modifies the original textures while preserving scene geometry. Experiments with multiple model architectures on NIR data from vehicle interiors and street scenes show that balancing inductive bias during training leads to noticeably more robust semantic segmentation and effectively reduces the domain gap in our real-world scenarios by up to 63.6% on exterior and 28.4% on interior data. The code is available at GitHub.

2606.15070 2026-06-16 cs.CL 新提交

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

当进一步推理无益时停止:推理模型中的注意力状态自适应生成

Jiakai Li, Ke Qin, Rongzheng Wang, Yizhuo Ma, Qizhi Chen, Muquan Li, Shuang Liang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province(四川省 ubiquitous 智能与可信服务重点实验室)

AI总结 针对大型推理模型过度思考导致冗余和准确率下降的问题,提出无需训练的注意力状态自适应生成方法ASAG,通过推断推理状态动态调整生成策略,在多个基准上平均准确率提升3.2%,生成token减少近40%。

Comments ICML 2026 Spotlight

详情
AI中文摘要

通过引入测试时计算缩放,大型推理模型(LRMs)可以通过显式的思维链(CoT)推理过程解决复杂问题。然而,它们常常遭受过度思考的困扰,导致冗余的token输出和准确率下降。当前缓解这一问题的方法仍然有限:基于训练的方法需要大量计算资源,而无需训练的方法依赖于精心设计的提示或不可靠的置信度信号。在这项工作中,我们从注意力分布的角度研究早期停止,并提出一种简单的方法ASAG,该方法推断模型的推理状态并自适应地调整生成策略。所提出的框架无需训练且即插即用,能够无缝集成到现有的LRMs中。在九个基准上的大量实验表明,该方法在主流LRMs(包括DeepSeek-R1-Distill和Qwen3系列)的不同参数规模上均取得了一致的改进。具体而言,ASAG在Qwen3-8B的所有推理任务上平均准确率提高了3.2%,同时生成的token数量减少了近40%。

英文摘要

By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.

2606.15069 2026-06-16 cs.CL 新提交

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

CoCoGEC:面向鲁棒语法纠错的反事实生成

Qianyu Wang, Xiaoman Wang, Yuanyuan Liang, Xinyuan Li, Yunshi Lan

发表机构 * East China Normal University(华东师范大学)

AI总结 提出CoCoGEC框架,通过生成词级和句级反事实样本并筛选高互信息实例,提升语法纠错模型在上下文扰动下的稳定性,在三个扰动数据集上取得显著F0.5提升。

详情
AI中文摘要

语法纠错(GEC)系统通常在GEC基准上进行训练和评估,但一旦周围上下文发生轻微扰动或扩展,其性能往往会急剧下降。这表明现有的GEC模型通常无法理解变化上下文中的错误模式。在本文中,我们深入研究了GEC任务的反事实,其中上下文的细微变化可能导致标签翻转问题。我们提出了CoCoGEC,一个反事实生成框架,该框架创建训练实例的副本,并改变与错误无关的上下文。我们的框架通过以下方式系统地生成反事实:(1)通过改变词级和句级上下文,生成保持原始实例错误模式及语法的句内和句间反事实;(2)通过选择具有翻转标签和高GEC互信息(MI)系数的实例来修正生成的反事实。大量实验表明,我们的方法显著提高了GEC模型的稳定性,优于一组数据增强基线。特别是,在扰动的BEA-19*、CoNLL-14*和TEM-8*数据集上,它分别实现了+9.9、+11.3和+20.8个点的绝对F0.5增益。我们的代码已发布在https://github.com/Quinnok/CoCoGEC。

英文摘要

Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data set.Our code is released at https://github.com/Quinnok/CoCoGEC

2606.15068 2026-06-16 cs.RO 新提交

Design and Fabrication of a Spin Coater with In-Situ Optical Measurement for Soft Thin Films

用于软薄膜的原位光学测量旋涂机的设计与制造

Daniel Gliksberg, Jiajie Qiu, Jun Suzuki, Kamal Youcef-Toumi

发表机构 * The Japan Steel Works, LTD.(日本制钢所)

AI总结 针对软弹性薄膜厚度测量难题,设计了一种低成本3D打印旋涂机,集成激光反射原位光学测厚系统,实现50-300微米薄膜厚度控制,分辨率达3.6微米。

Comments 8 pages, 7 figures, 5 tables. To be published in the conference proceedings for AIM 2026

详情
AI中文摘要

旋涂广泛用于聚合物和弹性体薄膜的制造,但由于接触式测量的变形以及传统光学计量成本高、复杂度大,高柔性材料的可靠厚度验证仍然具有挑战性。在介电弹性体致动器等软弹性应用中,精确的厚度控制尤为关键,因为机械和功能性能与薄膜厚度密切相关。本文提出了一种低成本的、主要采用3D打印的台式旋涂机,集成了最小变形的光学厚度测量系统,用于软薄膜制备流程。该系统设计用于制造厚度在50至300微米之间的薄膜,重复性在10微米以内。通过四象限光电探测器跟踪反射激光束的位移,实现原位厚度测量,避免了显著变形。讨论了光学几何、传感器线性约束以及通过有限元分析进行的结构验证。使用校准金属垫片的实验验证显示厚度分辨率为3.6-3.7微米,最佳情况下的测量重复性为13微米(95%置信区间)。该平台可重复生产厚度在目标值9微米以内的硅胶薄膜,表明可访问的光学计量可以集成到低成本旋涂系统中,用于无需专门工业仪器的、厚度可控的柔性薄膜实际制造。

英文摘要

Spin coating is widely used for fabrication of thin polymer and elastomer films, yet reliable thickness verification of highly compliant materials remains challenging due to deformation from contact-based measurements and the cost and complexity of conventional optical metrology. Accurate thickness control is especially critical in soft elastomer applications such as dielectric elastomer actuators (DEAs), where mechanical and functional performance scales strongly with film thickness. This work presents a low-cost, primarily 3D-printed benchtop spin coater with an integrated, minimally deforming optical thickness measurement system for soft-film fabrication workflows. The system is designed to manufacture films between 50 and 300 microns thick with repeatability within 10 microns. Thickness is measured in-situ by tracking displacement of a reflected laser beam via quadrant photodetector, avoiding significant deformation. Optical geometry, sensor linearity constraints, and structural validation via finite element analysis are discussed. Experimental validation using calibrated metal shims demonstrated a thickness resolution of 3.6-3.7 microns and best-case measurement repeatability of 13 microns (95 percent confidence interval). The platform repeatably produced silicone films within 9 microns of target thickness, demonstrating that accessible optical metrology can be integrated into a low-cost spin coating system for practical, thickness-controlled fabrication of compliant thin films without specialized industrial instrumentation.

2606.15064 2026-06-16 cs.LG cs.RO 新提交

Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering

相位局部筛选无帮助:基于逐阶段度量选择的演示过滤负面结果

Aarav Bedi

发表机构 * Department of Mechanical Engineering, University of California, Berkeley(加州大学伯克利分校机械工程系)

AI总结 本文通过LIBERO任务实验证明,按阶段局部应用度量进行演示筛选不如全局或统一度量,原因是缺陷信号被稀释且阶段度量不可迁移。

Comments 5 pages, 3 tables. Code: https://github.com/aaravbedi/phase-gated-curation

详情
AI中文摘要

操作演示具有时间阶段结构,一个自然的假设是演示筛选度量应在阶段内而非全局应用。其思想是将每条轨迹分割为阶段,用局部信息最丰富的度量对每个阶段评分,然后聚合。这直接源于先前工作,表明单个全局度量可能是缺陷的最佳检测器,但却是结果策略的最差筛选器。我们在三个接触丰富的LIBERO拾取放置任务上测试了逐阶段假设,使用受控的早期释放结构缺陷,将阶段门控筛选与相同度量的统一应用以及强单个全局度量进行比较。在所有三个任务和每个条件五个随机种子下,阶段门控筛选从未是最佳筛选策略,并且在三个任务中的两个上是最差的(任务1:86.0 vs. 全局92.0;任务3:22.7 vs. 统一48.0)。我们将失败归因于一个具体机制:当缺陷信号集中在单个阶段时,跨阶段排名聚合会用来自无缺陷阶段的无信息分数稀释该信号,从而选择比简单地在各处应用缺陷信息度量更差的演示子集。我们进一步表明,逐阶段度量选择不能跨任务迁移,因为任何两个任务之间没有阶段共享获胜度量,因此选择不能重用,必须从噪声扫描中为每个任务重新推导。这些结果限制了一种看似合理且先前未经测试的方法,并论证了实践者应优先识别单个缺陷信息度量,而非按阶段分解筛选。我们发布了完整流程、所有度量实现和每个种子的结果。

英文摘要

Manipulation demonstrations have temporal phase structure, and a natural hypothesis is that demonstration-curation metrics should be applied within phases rather than globally. The idea is to segment each trajectory into phases, score each phase with the metric that is locally most informative, and then aggregate. This follows directly from prior work showing that a single global metric can be the best detector of a defect and yet the worst curator of the resulting policy. We test the per-phase hypothesis on three contact-rich LIBERO pick-and-place tasks with a controlled early-release structural defect, comparing phase-gated curation against the same metrics applied uniformly and against a strong single global metric. Across all three tasks and five random seeds per condition, phase-gated curation is never the best curation strategy, and it is the worst of the three on two of the three tasks (Task 1: 86.0 vs. 92.0 for global; Task 3: 22.7 vs. 48.0 for uniform). We trace the failure to a concrete mechanism. When the defect signal is concentrated in a single phase, rank-aggregating across phases dilutes that signal with uninformative scores from defect-free phases, selecting a worse demonstration subset than simply applying the defect-informative metric everywhere. We further show that the per-phase metric selection does not transfer across tasks, since no phase shares a winning metric between any two tasks, so the selection cannot be reused and must be re-derived per task from a noisy sweep. These results bound a plausible and previously untested method, and they argue that practitioners should prefer identifying a single defect-informative metric over decomposing curation by phase. We release the full pipeline, all metric implementations, and per-seed results.

2606.15059 2026-06-16 cs.CL 新提交

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

长语音同声翻译的实用评估方法

Yulin Xue, Siqi Ouyang, Lei Li

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对长语音同声翻译评估困难的问题,提出一种基于ASR、强制对齐和句子嵌入对齐的实用方法,实现句子级延迟与质量度量,揭示现有系统在长语音上延迟累积严重。

Comments Accepted to IWSLT 2026 Scientific Track

详情
AI中文摘要

同声语音翻译(SimulS2ST)实现了实时跨语言通信,但现有评估主要关注短语音或预分割语音,而非长语音连续输入。先前的方法难以复现,且其假设不适用于端到端系统。我们提出一种针对长语音SimulS2ST的实用评估方法。给定源语音、预分割的源文本和参考翻译,我们对生成的目标语音运行自动语音识别(ASR)和强制对齐以恢复词级时间戳,然后应用基于句子嵌入的对齐器将目标文本与其对应的源句子匹配。这使得能够计算句子级的延迟和质量指标,包括YAAL和xCOMET,这些指标随后聚合为最终的系统级分数。在代表性SimulS2ST系统上的实验表明,该方法在实践中有效,并揭示了当前系统在长语音上遭受显著的延迟累积。

英文摘要

Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.

2606.15058 2026-06-16 cs.LG stat.AP 新提交

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

机器学习与随机游走难题:基于扩展窗口评估和SHAP可解释性的CAD/USD汇率预测

Louis Agyekum, Edmund Fosu Agyemang, Obu-Amoah Ampomah, Kofi Acheampong, Emmanuel Boadi, Priscilla Yaa Amakye, Fafa Shalom Tchorly, Enock Adu Bonsu, Eric Nyarko

发表机构 * Department of Economics, University of Ottawa(Ottawa大学经济学系) Department of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine at Tulane University(Tulane大学生物统计学与数据科学系) Department of Statistics, Western Michigan University(西方密苏里大学统计学系) Department of Economics, Western Michigan University(西方密苏里大学经济学系) School of Mathematical and Statistical Sciences, University of Texas Rio Grande Valley(德克萨斯里奥格兰德谷大学数学与统计学系) Robinson College of Business, Georgia State University(佐治亚州立大学罗宾逊商学院) Department of Mathematics & Statistics, University of North Florida(北佛罗里达大学数学与统计学系) Department of Epidemiology and Biostatistics, University of Arizona(亚利桑那大学流行病学与生物统计学系) Department of Statistics and Actuarial Science, University of Ghana(加纳大学统计学与精算科学系)

AI总结 研究机器学习模型能否超越朴素随机游走基准预测月度美元/加元汇率,采用扩展窗口评估和SHAP解释,发现线性回归显著优于随机游走,集成模型表现接近。

Comments 10 pages, 14 figures, 8 tables

详情
AI中文摘要

本研究考察机器学习(ML)模型能否在预测月度美元/加元汇率时超越朴素随机游走基准。使用加拿大银行2017年1月至2026年5月的日度数据,重采样为113个月度观测值,评估了五种ML模型:线性回归、随机森林、梯度提升、XGBoost和AdaBoost。这些模型以朴素随机游走模型和带有Holt-Winters季节性的指数平滑(ETS)为基准。所有模型均采用扩展窗口框架评估以保持严格的样本外完整性,并使用Diebold-Mariano(DM)检验评估预测精度差异。结构断点检测识别出序列中的四个显著断点,分别对应2018年中美贸易战升级、2020年COVID-19经济复苏、2022年加拿大银行加息周期峰值以及2024年加拿大银行降息周期开始。应用SHAP(Shapley Additive Explanations)分析解释表现最佳ML模型的驱动因素。结果表明,朴素随机游走模型仍然是一个强大的基准。线性回归是唯一在统计上优于朴素随机游走模型的模型,DM统计量为3.0585,p值为0.0071,而ML集成模型仅显示出微小差异。采用扩展窗口框架的随机森林在所有模型(除随机游走外)中实现了最低的MAPE,为1.17%。SHAP分析证实,短期滞后(尤其是滞后1和滞后2)以及近期滚动均值主导预测,这与汇率的近随机游走行为一致。

英文摘要

This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026, resampled into 113 monthly observations, five ML models are evaluated: linear regression, random forest, gradient boosting, XGBoost, and AdaBoost. These models are benchmarked against the naive random walk model and exponential smoothing with Holt-Winters seasonality (ETS). All models are evaluated using an expanding-window framework to maintain strict out-of-sample integrity, and forecast-accuracy differences are assessed using the Diebold-Mariano (DM) test. Structural break detection identifies four significant breakpoints in the series, corresponding to the escalation of the US-China trade war in 2018, the COVID-19 economic recovery in 2020, the peak of the Bank of Canada rate-hiking cycle in 2022, and the start of the Bank of Canada rate-cutting cycle in 2024. SHAP, or Shapley Additive Explanations, analysis is applied to interpret the drivers of the best-performing ML model. The results show that the naive random walk model remains a formidable benchmark. Linear regression is the only model that statistically outperforms the naive random walk model, with a DM statistic of 3.0585 and a p value of 0.0071, whereas the ML ensemble models show only marginal differences. Random Forest with an expanding-window framework achieves the lowest MAPE of 1.17 percent among all models except the random walk. SHAP analysis confirms that short-term lags, particularly lag1 and lag2, and recent rolling means dominate predictions, consistent with the near-random-walk behavior of exchange rates.

2606.15055 2026-06-16 cs.CV cs.AI 新提交

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

通过视觉-语义枢轴终身学习弥合城市街景推理中的地理偏差

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出HVSP-LL终身学习框架,通过分层视觉-语义枢轴模块和公平感知重放机制,在跨城市街景推理中减少地理偏差,实现城市间感知差距缩小38%。

详情
AI中文摘要

城市街景的视觉感知支撑着景观规划、公共卫生和场所营造中的循证决策。然而,在少数拍摄良好的大都市上训练的模型会系统性地误判代表性不足的地区,将地理偏差传播到下游政策中。我们通过HVSP-LL(一种终身学习框架)解决了这一差距,该框架将分层视觉-语义枢轴模块与公平感知重放机制相结合。枢轴模块沿三层本体(宏观结构、中观组成、微观元素)组织景观概念,并将图像特征与每层可学习的语义锚点对齐,提供抵抗分布漂移的可迁移表示。终身适应组件顺序吸收新的城市区域,同时通过最差区域样本重新加权目标和结构感知示例缓冲区约束区域间感知差距。我们在一个由四大洲十二个城市和七个感知维度组成的全景街景基准上评估了HVSP-LL。该框架在保留城市序列上达到0.834的斯皮尔曼相关系数,比最强的持续基线绝对提高了6.1个百分点,并将城市间感知差距缩小到0.094——相对于最强的持续基线(0.151)减少了38%,相对于代表性的正则化基线(0.218)减少了57%。消融实验证实,枢轴层次结构的每一层都有单调贡献,公平感知重放将平均反向迁移从-0.038(无保留)转换为+0.013,消除了保留序列上的灾难性遗忘。我们的结果表明,分层锚定是实现城市尺度地理公平街景推理的实用途径。

英文摘要

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

2606.15054 2026-06-16 cs.LG 新提交

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

大小无关:余弦评分稀疏自编码器

Silen Naihin, Lev Stambler

发表机构 * GitHub arXiv

AI总结 针对稀疏自编码器中内积评分受输入范数干扰的问题,提出余弦评分方法,使特征检测更关注方向对齐,实验表明该方法能更频繁地学习到人类可识别的概念。

Journal ref ICML 2026, Spotlight at the Mechanistic Interpretability Workshop

详情
AI中文摘要

稀疏自编码器通过内积检测特征,因此特征的激活既取决于其方向对齐,也取决于输入的范数。在BatchTopK下,高范数令牌同时膨胀所有预激活,无论内容对齐如何都占用字典槽位。这很重要,因为子层归一化已经丢弃了评分所衡量的幅度,因此编码器检测到模型不读取的量。我们将评分替换为余弦相似度和输入幅度的学习混合,让优化器选择使用多少范数;每个特征的扩展让每个特征独立决定。在两种模式下,训练都可以自由恢复内积,但从未这样做,没有特征选择超过一半的幅度依赖。在匹配重构下,余弦编码器学习的特征比标准编码器更频繁地与人类可识别的概念对齐,填补了内积浪费在范数检测器上的字典槽位。均衡梯度的损失重加权几乎无法缩小差距,证实了前向传播评分几何是关键。该优势并非在所有任务或深度上普遍存在,但我们认为余弦评分应成为归一化表示上字典学习的默认选择。

英文摘要

Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously, claiming dictionary slots regardless of content alignment. This matters because sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read. We replace the score with a learned blend of cosine similarity and input magnitude, letting the optimizer choose how much norm to use; a per-feature extension lets each feature decide independently. In both regimes, training is free to recover inner product but never does, with no feature ever choosing more than half-magnitude dependence. At matched reconstruction, the cosine encoder learns features that align with human-recognizable concepts far more often than standard, filling dictionary slots that inner product wastes on norm detectors. Loss reweighting that equalizes gradients barely closes the gap, confirming forward-pass score geometry as the lever. The advantage is not universal across tasks or depths, but we believe cosine scoring should be the default for dictionary learning on normalized representations.

2606.15053 2026-06-16 cs.LG cs.NA math.NA 新提交

Physics-conforming Latent Twins

物理一致潜在对偶

Matthias Chung, Yutong Bu, Deepanshu Verma

发表机构 * Emory University(埃默里大学) Clemson University(克莱姆森大学)

AI总结 提出物理一致潜在对偶框架,通过联合学习编码器、解码器和潜在流映射,使潜在动力学满足守恒律、不变性和耗散结构,在保持代理模型预测精度的同时提高物理约束满足度和长期行为质量。

Comments 32 pages, 11 figures

详情
AI中文摘要

代理模型是科学机器学习的核心,能够对复杂物理系统进行快速预测、模拟、推断和控制。然而,对于时间相关问题,仅准确插值训练轨迹是不够的:可靠的代理还应尊重赋予这些轨迹物理意义的守恒律、不变量、可接受条件和耗散结构。我们提出了物理一致潜在对偶,这是一个学习潜在代理解算子的框架,其动力学通过设计满足选定的物理原理。该方法基于潜在对偶公式,通过联合学习编码器、解码器和任意时间索引状态之间的潜在流映射,同时约束潜在动力学以保持或耗散指定的结构量。我们发展了一种约束转移观点,将原始状态空间中的物理结构与潜在空间中的兼容约束联系起来,并证明了结构保持界,表明潜在强制执行如何改善解码后物理缺陷的控制。我们还推导了保持线性和二次不变量或强制执行耗散不等式的潜在流映射的代数条件。在代表性ODE和PDE基准上的数值实验表明,该方法在保持准确代理预测的同时,改善了约束满足、结构保真度和定性长期行为。

英文摘要

Surrogate models are central to scientific machine learning, where they enable fast prediction, simulation, inference, and control for complex physical systems. For time-dependent problems, however, accurate interpolation of training trajectories is not sufficient: reliable surrogates should also respect the conservation laws, invariants, admissibility conditions, and dissipative structures that give those trajectories physical meaning. We introduce Physics-conforming Latent Twins, a framework for learning latent surrogate solution operators whose dynamics satisfy selected physical principles by design. The method builds on the Latent Twin formulation by jointly learning an encoder, a decoder, and a latent flow map between arbitrary time-indexed states, while constraining the latent dynamics to preserve or dissipate prescribed structural quantities. We develop a constraint-transfer viewpoint that connects physical structure in the original state space with compatible constraints in latent space, and prove structure-preservation bounds showing how latent enforcement improves control of physical defects after decoding. We also derive algebraic conditions for latent flow maps that preserve linear and quadratic invariants or enforce dissipative inequalities. Numerical experiments on representative ODE and PDE benchmarks demonstrate improved constraint satisfaction, structural fidelity, and qualitative long-time behavior while maintaining accurate surrogate prediction.

2606.15049 2026-06-16 cs.CV 新提交

Gaussian Spatial Priors for Anatomy-Aware Object Detection in Surgical Videos

高斯空间先验用于手术视频中解剖感知的目标检测

Yunfan Li, Artem Shmelev, Himanshu Gupta

发表机构 * Stony Brook University(石溪大学) Stony Brook University Hospital(石溪大学医院)

AI总结 提出高斯空间先验(GSP)模块,通过编码解剖结构间的空间关系作为参数化偏置注入DAB-DETR解码器的自注意力,显著提升腹股沟疝手术视频中依赖类结构(如腹壁血管)的检测性能。

详情
AI中文摘要

检测手术视频中的解剖结构对于术中安全框架至关重要,例如腹股沟疝修复中的肌耻骨孔关键视图(CVMPO)。虽然标准方法能可靠检测出库珀韧带和危险三角等显著结构,但较小的结构(如腹壁血管)由于视觉模糊和间歇性可见性仍然具有挑战性。我们观察到结构之间的空间关系受解剖约束,并提出高斯空间先验(GSP)模块,将该关系编码为紧凑的参数化偏置,注入DAB-DETR解码器的自注意力中。该先验从训练注释中离线计算为一组冻结的高斯参数,并在每个解码器层使用迭代精化的参考点重新计算。在腹股沟疝修复视频数据集上使用5折交叉验证,GSP在依赖类检测上比DAB-DETR提升$+33.5\%$($\text{AP}_{50}$),比YOLOv26提升$+53.9\%$,同时在锚点检测上提升$+6.0\%$。这些增益在所有折上具有统计显著性($p=0.012$,配对$t$检验)。

英文摘要

Detecting anatomical structures in surgical video is essential for intraoperative safety frameworks such as the Critical View of Myopectineal Orifice (CVMPO) in inguinal hernia repair. While prominent structures like the Cooper's Ligament and Triangle of Doom are reliably detected by standard methods, smaller structures such as the epigastric vessels remain challenging due to their visual ambiguity and intermittent visibility. We observe that the spatial relationship between structures is anatomically constrained, and propose a Gaussian Spatial Prior (GSP) module that encodes this relationship as a compact, parametric bias injected into the self-attention of a DAB-DETR decoder. The prior is computed offline from training annotations as a small set of frozen Gaussian parameters and recomputed at each decoder layer using the iteratively refined reference points. On a dataset of inguinal hernia repair videos with 5-fold cross-validation, GSP improves dependent class detection by $+33.5\%$ ($\text{AP}_{50}$) over DAB-DETR and $+53.9\%$ over YOLOv26, while also improving anchor detection by $+6.0\%$. These gains are statistically significant across all folds ($p=0.012$, paired $t-$test).

2606.15046 2026-06-16 cs.RO 新提交

Exact, Efficient, and Safe Occlusion-Aware Planning Using AH-Polyhedrons

使用AH-多面体的精确、高效且安全的遮挡感知规划

Long Kiu Chung, David Isele, Toktam Mohammadnejad, Faizan M. Tariq, Sangjae Bae, Shreyas Kousik, Jovin D'sa

发表机构 * Honda Research Institute (HRI)(本田研究所) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出APRO框架,利用博弈论主动感知和AH-多面体可达性分析,通过线性规划实现精确安全验证,解决自动驾驶代客泊车中的遮挡问题,达到100%安全率且实时。

Comments 8 pages, 3 figures

详情
AI中文摘要

安全处理遮挡是动态环境中自主移动机器人面临的基本挑战。这一问题在自动驾驶代客泊车(AVP)中尤为突出,因为交通规则宽松、遮挡频繁且杂乱,过度保守的行为可能导致车辆被困。然而,现有方法要么缺乏形式化安全保证,要么假设智能体遵循道路结构,要么引入保守性,使得AVP的遮挡感知规划仍然是一个开放挑战。在本文中,我们提出APRO(遮挡的AH-多面体可达性),一个基于博弈论主动感知和AH-多面体可达性分析的精确且高效的遮挡感知规划框架,以AVP作为典型用例。我们的关键洞察是将先前工作中基于集合的安全条件重新表述为AH-多面体的并集,从而通过线性规划(LP)实现精确的安全验证,无需在集合计算或道路拓扑假设中引入任何额外的保守性。我们进一步展示了如何将所得安全条件集成到基于优化的规划器或二分搜索方案中,以用于实时应用。我们在仿真和硬件实验中验证了我们的方法,包括在真实停车场数据集上的数据回放。实验结果表明,我们的方法在所有评估场景中始终达到100%的安全率,同时保持实时性能,从而比具有形式化安全保证的现有方法做出更安全、更优的决策。

英文摘要

Safely handling occlusions is a fundamental challenge for autonomous mobile robots operating in dynamic environments. This issue is especially prominent in autonomous valet parking (AVP), where traffic rules are lax, occlusions are frequent and cluttered, and overly conservative behavior can leave vehicles stuck. However, existing methods either lack formal safety guarantees, assume agents follow road structures, or introduce conservatism, leaving occlusion-aware planning for AVP an open challenge. In this paper, we propose APRO (AH-Polyhedron Reachability for Occlusions), an exact and efficient occlusion-aware planning framework based on game-theoretic active perception and AH-polyhedron reachability analysis with AVP as our canonical use case. Our key insight is to reformulate set-based safety conditions in prior work as unions of AH-polyhedrons, enabling exact safety verification through linear programming (LP) without any additional conservatism in set computations or assumptions on road topology. We further show how the resulting safety conditions can be integrated into optimization-based planners or a bisection search scheme for real-time applications. We validate our method in simulation and hardware experiments, including data replay on a real-world parking lot dataset. Experimental results demonstrate that our method consistently achieved a 100% safety rate across all evaluated scenarios while maintaining real-time performance, resulting in safer and more optimal decisions than existing methods with formal safety guarantees.

2606.15044 2026-06-16 cs.CL 新提交

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

公平与效率:多语言大语言模型分词器的实证研究

Kieron Seven Jun Wei Lee, Muhammad Reza Qorib, Andrew Ivan Soegeng, Hwee Tou Ng

发表机构 * National University of Singapore(新加坡国立大学) Carnegie Mellon University(卡内基梅隆大学) SAP

AI总结 本文系统比较了11种东南亚语言上的公平分词器,发现Parity-aware BPE在效率与公平的权衡中处于帕累托前沿,而Morphology-Driven Byte Encoding在语义推理上表现最佳。

详情
AI中文摘要

多语言大语言模型(LLMs)依赖子词分词来桥接离散文本和连续神经表示。最先进的多语言LLMs通常使用字节级字节对编码(BPE)分词器,这些分词器在结构上偏向高资源语言和拉丁文字。对于代表性不足的语言使用者,特别是东南亚地区的语言,这种偏见增加了推理成本并扩大了跨语言能力差距。我们首次在涵盖11种东南亚语言的统一基准上对公平分词器进行了系统比较。除了分词器级别的压缩效率和跨语言公平性分析外,我们还通过使用相同训练数据训练的1.5B参数语言模型评估了下游任务性能。我们的结果表明,Parity-aware BPE位于效率-公平权衡的帕累托前沿,以有竞争力的成本实现了强大的压缩公平性。Morphology-Driven Byte Encoding通过形态更丰富的表示提供了最佳的语义推理性能,尽管计算成本更高。Byte Latent Transformer在下游任务中表现不佳,可能是因为其架构假设与有限低资源训练数据的约束不一致。总之,我们的发现表明跨语言公平性和分词效率并非根本矛盾,并为设计公平的多语言模型提供了实用指导。

英文摘要

Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

2606.15038 2026-06-16 cs.AI 新提交

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

融合并非一刀切:用于时间-事件建模的跨模态表示对齐

Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

发表机构 * Arizona State University(亚利桑那州立大学) Mayo Clinic(梅奥诊所)

AI总结 针对多模态临床数据中的模态不平衡和分布偏移问题,提出一种基于基础模型的跨模态对齐框架,通过四种融合策略在CT影像和纵向EHR数据间进行表示对齐,在肺栓塞死亡率和心血管疾病结局预测任务上验证了融合的有效性,并首次系统分析了时间-事件预测中的模态不平衡对融合行为的影响。

详情
AI中文摘要

从多模态临床数据进行准确的时间-事件(TTE)预测仍然具有挑战性,原因是模态不平衡和分布偏移。我们引入了一个基础模型驱动的框架,用于CT成像和纵向EHR数据之间的跨模态表示对齐,旨在跨任务和机构进行泛化。CT和EHR模态使用特定领域的基础模型独立编码,并通过四种原则性融合策略在共享潜在空间中对齐:后期融合、对比对齐、交叉注意力和共同注意力。我们在大规模多机构队列(PE:训练集N=3,099;内部验证集1,098;外部验证集435;CVD:训练集N=2,951;内部验证集837;外部验证集682)上评估了两个临床不同的TTE任务:肺栓塞(PE)死亡率和心血管疾病(CVD)结局。当模态贡献相当时,融合一致地将一致性指数提高了1.5-5.4%,优于单模态基线。总体而言,对比多模态融合,特别是使用CLMBR表示,提供了最一致且统计上最稳健的改进,尤其是在PE死亡率预测中。对于MACE,交叉注意力(独热编码)实现了最高的内部性能,而图像引导的共同注意力实现了最佳的外部性能。因此,我们引入了一个可泛化的基于基础模型的跨模态对齐框架,并首次系统分析了TTE预测中模态不平衡下的融合行为。我们的结果确立了任务感知的多模态对齐作为稳健泛化和可扩展临床部署的必要设计原则。

英文摘要

Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

2606.15037 2026-06-16 cs.CL cs.CV 新提交

ReportQA: QA-Based Radiology Report Evaluation

ReportQA: 基于问答的放射学报告评估

Yiming Shi, Shaoshuai Yang, Xi Chen, Haolin Li, Hengyu Zhang, Che Jiang, Kaiwen Wang, Xun Zhu, Dong Xie, Fei Wang, Dejing Dou, Miao Li, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) College of AI, Tsinghua University(清华大学人工智能学院) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心) Beijing Electronic Digital & Intelligence(北京电子数字与智能)

AI总结 提出ReportQA框架,利用知识树和LLM从报告中提取结构化信息生成QA对,以问答准确率作为评估指标,比现有指标更符合放射科医生判断。

详情
AI中文摘要

放射学报告评估对于推进自动报告生成至关重要。自然语言生成指标具有有限的临床相关性。临床效能(CE)指标评估重要的医学发现,但主要关注存在性且仅覆盖有限的实体集。由于严重依赖人工标注,CE指标难以扩展临床实体或属性。在临床实践中,放射学报告作为信息传递的媒介。临床医生使用它们执行下游诊断任务,而无需直接检查图像。基于这一见解,我们提出了ReportQA,一个临床相关且灵活的放射学报告评估框架,支持对放射学报告生成系统进行详细的定量分析。我们首先收集涵盖多种成像模态和解剖区域的数据集。然后,在放射科医生的指导下构建临床实体和属性的知识树,并使用大型语言模型(LLM)从原始报告中提取结构化信息。接下来,我们从预定义模板生成QA对,并通过自过滤和基于报告的过滤进行质量控制。在评估期间,将报告视为上下文,LLM作为评判模型来回答QA对。基于得到的QA准确率,我们引入了QAScore指标。与现有指标相比,QAScore显示出与放射科医生判断更好的对齐。在多个最先进的视觉-语言模型上的实验表明,当前基于报告的推理范式难以学习细粒度的临床表示,并表现出强烈的负先验偏差。相比之下,问题驱动的推理提供了一种更有效的替代方案。为了可重复性和可扩展性,我们发布了知识树、结构化报告和QA对,以及用于QA构建和评估的流水线代码。

英文摘要

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

2606.15036 2026-06-16 cs.LG math.NT 新提交

Transformers Learn the Mestre-Nagao Heuristic

Transformer学习Mestre-Nagao启发式方法

Pranav Venkata Konda

发表机构 * Pranav Venkata Konda(普拉纳夫·文卡塔·科恩达)

AI总结 训练两层Transformer编码器对有理椭圆曲线进行秩分类(rank 0/1),精度>99%,并通过机械可解释性发现模型学到了Mestre-Nagao和启发式权重,且CLS嵌入编码了L(E,1)的对数。

Comments 15 pages, 10 figures

详情
AI中文摘要

我们训练了一个两层Transformer编码器,用于将导子≤10000的有理椭圆曲线$E/\mathbb{Q}$从前128个归一化Frobenius迹分类为秩0或秩1。我们在两个类别上都达到了>99%的准确率,并且在测试曲线上(训练集中没有同源或二次扭的曲线)准确率基本不变。然后,我们应用机械可解释性技术,如注意力分析、线性探针、激活修补、logit归因和神经元级电路分析,来逆向工程模型(函数空间中的质心)学到的算法。我们发现,在平台期,一个由512个第一层MLP神经元中的20个组成的稀疏电路足以在AUROC为0.992的线性探针下进行秩预测,实现了秩0和秩1检测器的推挽检测架构,并带有单侧读出。然而,我们注意到模型存在次优的读出问题,表明读出路径与判别电路之间的秩顺序不匹配。关键的是,顶部判别神经元的学得输入权重与Mestre-Nagao和启发式权重$\log(p)/(p\cdot \log{B})$匹配,Spearman系数$r=0.997$,Pearson系数$r=0.952$:模型仅从Frobenius迹数据就学到了解析数论的一个结果。我们还发现,所有50个独立训练的模型都将CLS注意力集中在素数位置,其速率是合数位置的2-50倍。CLS嵌入编码了$\log{L(E,1)}$,在50个模型中的$R^2=0.962\pm 0.011$(在控制导子后)。激活修补分析表明,注意力权重与因果信息流分离。此外,训练得到的50个解在函数空间上几乎相同(成对一致性>98.8%),尽管权重空间存在巨大障碍。

英文摘要

We train a two-layer transformer encoder to classify rational elliptic curves $E/\mathbb{Q}$ of conductor $\leq 10000$ as either rank 0 or rank 1 from the first 128 normalized Frobenius traces. We achieve >99% accuracy on both classes, and accuracy is essentially unchanged on test curves with no isogeny or quadratic-twist relative in the training set. We then apply techniques from mechanistic interpretability such as attention analysis, linear probing, activation patching, logit attribution, and neuron-level circuit analysis to reverse-engineer the algorithm the (centroid in function space) model learned. We find that a sparse circuit of 20 out of 512 layer-1 MLP neurons is sufficient for rank prediction under a linear probe with an AUROC of 0.992 at plateau, implementing a push-pull detector architecture of rank-0 and rank-1 detectors with a one-sided readout. However, we notice that the model has sub-optimal readout problems indicating a mismatch in rank-order between the readout pathway and the discriminative circuit. Critically, the learned input weights of the top discriminating neuron match the Mestre-Nagao sum heuristic weights $\log(p)/(p\cdot \log{B})$ with a Spearman coefficient $r = 0.997$ and Pearson coefficient $r = 0.952$: the model has learnt a result from analytic number theory from the Frobenius trace data alone. We additionally find that all 50 independently trained models concentrate CLS attention on prime positions at 2-50$\times$ the rate of composite positions. The CLS embedding encodes $\log{L(E,1)}$ with $R^2 = 0.962\pm 0.011$ across the 50 models (after controlling for the conductor). Activation patching analysis reveals that attention weights are dissociated from causal information flow. Additionally, the 50 solutions from training are near-identical in function space (with pairwise agreement $>$98.8%) despite large weight space barriers.

2606.15034 2026-06-16 cs.AI 新提交

OSGuard: A Benchmark for Safety in Computer-Use Agents

OSGuard:计算机使用智能体安全基准

Mina Mohammadmirzaei, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 提出OSGuard双粒度基准,通过动作级安全判断和风险增强执行评估智能体在良性指令下的安全性,揭示局部监督与端到端安全的差距。

详情
AI中文摘要

计算机使用智能体越来越根据它们是否完成现实的桌面和网页任务来评估。然而,仅凭任务成功可能会遗漏智能体通过不安全捷径达到名义目标时的失败。我们引入了OSGuard,一个双粒度基准套件,用于在良性、未更改的用户指令下评估计算机使用智能体的安全性。OSGuard包含一个用于局部护栏决策的动作级基准和一个用于端到端评估的风险增强执行套件。动作级基准由上下文化的提议动作组成,这些动作被标记为允许、无关或不安全,每个判断都相对于原始指令和当前界面状态。执行套件包含手动构建的OSWorld衍生任务变体,其中原始任务仍然可完成,但环境被修改以引入潜在危险,如破坏性覆盖等。每个变体都配有增强评估器,保留原始任务成功标准,同时添加显式的基于状态的安全不变量,使我们能够区分安全完成和满足名义任务目标的不安全完成。我们在OSGuard上的实验结果表明,当前的多模态护栏在孤立的动作判断上表现良好,而风险增强执行暴露了局部监督与可靠端到端安全之间的剩余差距。这种双粒度设计能够更精确地诊断模型是否既能识别不安全的提议动作,又能在作为护栏部署时提高全任务安全性。

英文摘要

Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions. OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. The action-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state. The execution suite contains manually constructed OSWorld-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails.

2606.15032 2026-06-16 cs.LG 新提交

How Should World Models Be Evaluated? A Decision-Making-Centric Position

世界模型应如何评估?一个以决策为中心的立场

Yang Yu, Shiyuan Zhang, Yifei Sheng, Haoxiang Ren, Haoxin Lin

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Cirquar Technologies

AI总结 本文指出世界模型评估中声明与证据不匹配的问题,提出以决策为中心的评估框架,强调反事实推理、策略优化等能力,并定义L0-L7评估阶梯。

详情
AI中文摘要

世界模型迅速成为现代AI的核心抽象之一。然而,该术语现在指代多种不同对象:动作条件环境模型、潜在想象模型、未来视频预测器、交互式神经模拟器、潜在预测表示和合成数据引擎。评估也随术语扩展。近期论文衡量视频真实性、感知相似性、指令遵循、物理合理性、策略排序、可执行性、规划成功率和下游策略改进。结果不仅指标多样,而且存在声明/证据不匹配的反复问题:论文经常对其模型的用途做出比评估实际能证明的更强的声明。本文调查近期文献,认为核心问题取决于用途。当模型被呈现为用于具身决策的世界模型时,更关键的问题不是它是否生成视觉上令人信服的视频,而是它是否支持在干预、策略引起的分布偏移和长程展开下的可靠反事实推理、策略评估、规划和策略优化。我们使用L0-L7阶梯组织文献,范围从视觉合理性到策略优化效用。在我们的解释中,L0-L3最自然地被视为生成工件的诊断,L4通常是第一个真正的干预测试,L5-L7提供决策有用性的最直接证据。基于这一诊断,我们提出一个以决策为中心的评估框架和基准协议,强调反事实动作保真度、闭环展开有效性、奖励/价值预测、策略排序一致性、优化提升、模型可利用性和不确定性校准。

英文摘要

World models have rapidly become one of the central abstractions in modern AI. Yet the term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. The result is not only metric diversity but also a recurring problem of claim/evidence mismatch: papers frequently make a stronger claim about what their model is useful for than their evaluation can actually establish. This paper surveys the recent literature and argues that the central question is use-dependent. When a model is presented as a world model for embodied decision-making, a more decisive issue is not whether it generates visually compelling videos, but whether it supports reliable counterfactual reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the literature using an L0--L7 ladder that ranges from visual plausibility to policy optimization utility. In our interpretation, L0--L3 are most naturally read as diagnostics of generated artifacts, L4 is often the first genuinely interventional test, and L5--L7 provide the most direct evidence of decision usefulness. Based on this diagnosis, we propose a decision-making-centric evaluation framework and a benchmark protocol that foreground counterfactual action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration.

2606.15029 2026-06-16 cs.AI 新提交

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Metric Match:一种评估LLM评判可靠性的子集选择方法

Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Metric Match方法,通过选择少量样本进行人工标注,以子集匹配总体可靠性指标,从而高效估计LLM评判的可靠性,实验表明在15个数据集上平均估计误差降低18.7%,标注需求减少32.5%。

详情
AI中文摘要

LLM评判被用于减少评估开放文本生成时对昂贵人工劳动的需求。然而,这些评判的可靠性关键取决于它们与人类评分者的一致性——这一属性本身依赖于昂贵的人工标注。在这项工作中,我们开发了一种方法(Metric Match),用于从有限标注中估计LLM评判的基于相关性的可靠性指标。Metric Match选择一部分样本进行人工标注,使得该子集在获取的合成标签方面与总体可靠性指标匹配。我们通过实验表明,在四种不同的相关性指标和15个数据集上,Metric Match相对于随机子集选择的胜率为0.838,平均估计误差降低18.7%,标注需求减少32.5%。我们提供了一个成本模型,并强调了一个医学案例研究,在该案例中,与随机选择相比,我们的方法为专家标注节省了1,041.67美元。此外,我们将任务从可靠性估计转变为可靠性分类,即判断给定评判是否高于部署阈值,使用Metric Match优于随机选择。所有项目代码公开可用,我们还提供了一个可安装的包以便使用。

英文摘要

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

2606.15028 2026-06-16 cs.RO 新提交

An Autonomous Subgram SMA-Based Swimmer

基于SMA的亚克级自主游泳器

Conor K. Trygstad, Francisco M. F. R. Gonçalves, Néstor O. Pérez-Arancibia

发表机构 * Washington State University(华盛顿州立大学)

AI总结 提出一种900毫克仿生游泳器Swima,采用形状记忆合金驱动的高功密度执行器,集成机载电源和计算,实现自主游泳超过18分钟,速度达22.4毫米/秒,转弯速率14°/秒,跟踪误差均方根约6.5°,为首个亚克级机载电源、驱动和计算的微型游泳器。

Comments Under review, 6 pages, 5 figures

详情
AI中文摘要

我们介绍了Swima,一种仿生900毫克游泳器,由两个10毫克高功密度(HWD)执行器驱动,这些执行器由形状记忆合金(SMA)线驱动。通过使用定制印刷电路板(PCB)和11毫安时3.7伏507毫克单节锂离子(Li-Ion)电池,我们集成了机载电源和计算,从而实现超过18分钟的自主游泳。Swima可以以高达22.4毫米/秒(0.56体长/秒)的速度游泳,达到高达14°/秒的转弯速率,并且能够在多次测试中跟随0度航向参考轨迹,跟踪误差的均方根(RMS)值约为6.5°。该机器人是迄今为止开发的首个具有机载电源、驱动和计算的亚克级微型游泳器。

英文摘要

We present the Swima, a bioinspired 900-mg swimmer propelled by two 10-mg high-work-density (HWD) actuators driven by shape-memory alloy (SMA) wires. We integrated onboard power and computation by using a custom-built printed circuit board (PCB) and an 11-mAh 3.7-V 507-mg single-cell lithium-ion (Li-Ion) battery, which in conjunction enable autonomous swimming in excess of 18 min. The Swima can swim at speeds of up to 22.4 mm/s (0.56 Bl/s), achieves turning rates of up to 14°/s, and can follow 0-degree heading reference trajectories with root mean square (RMS) values of tracking errors of about 6.5° across multiple tests. This robot is the first subgram microswimmer with onboard power, actuation, and computation developed to date.

2606.15026 2026-06-16 cs.CL 新提交

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

基于生理信号的深度时间建模与集成融合多模态情感识别

Desta Haileselassie Hagos, Saurav Keshari Aryal, Patrick Ymele-Leki, Anietie Andy, Legand L. Burge

发表机构 * Howard University(霍华德大学)

AI总结 本研究评估LSTM、TCN和Transformer在WESAD数据集上的多模态情感识别性能,通过消融实验和传感器级早期融合,并采用晚期融合集成策略,最终集成方法达到98.91%准确率和98.56%宏F1分数。

Comments Accepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: https://doi.org/10.1145/3807503.3819363

详情
AI中文摘要

生理压力和情感识别对于健康监测和情感计算非常重要。在这项工作中,我们对深度学习模型如长短期记忆网络(LSTM)、时序卷积网络(TCN)和Transformer在WESAD数据集上使用腕部和胸部传感器信号进行多模态情感识别进行了全面评估。我们通过仅在腕部和仅胸部输入上训练模型进行消融研究,以评估每个模态的单独贡献。此外,我们实现了一种晚期融合集成策略,该策略结合了在多模态输入上训练的所有三种架构的预测。我们还在传感器级别采用早期融合,即在将腕部和胸部信号输入每个模型之前进行拼接。我们的结果表明,Transformer模型在多模态设置中始终达到最高准确率,而TCN模型在仅腕部配置中表现最佳。集成方法实现了最高的总体准确率(98.91 +/- 0.13%)和宏F1分数(98.56 +/- 0.17%)。这些发现证明了传感器融合和基于集成的融合在开发鲁棒的生理情感识别系统中的有效性。

英文摘要

Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

2606.15021 2026-06-16 cs.RO 新提交

Steering Autoregressive Vision-Language-Action Policies via Action Token Intervention

通过动作令牌干预引导自回归视觉-语言-动作策略

Jason Chan, Jonathan C. Kao

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出Token Steering方法,在推理时通过干预动作令牌空间动态引导VLA模型轨迹生成,无需训练或微调,显著提升家务操作任务成功率。

Comments 9 pages, 5 figures

详情
AI中文摘要

我们提出Token Steering (TS),一种通过直接干预动作令牌空间来动态引导自回归视觉-语言-动作(VLA)模型生成轨迹的方法。TS将低维用户输入注入模型的原生动作令牌表示,允许用户在无需修改底层视觉-语言模型(VLM)架构的情况下影响轨迹生成。由于TS完全在推理时运行,因此不需要额外的训练或微调。用户输入引导而非覆盖预训练策略,允许用户影响机器人动作,同时保留VLA学习的灵巧性、平滑性和任务先验。我们在两个家务操作任务——物体放置后关闭抽屉和状态感知物体交换——上评估TS,成功率分别从10.0%提高到72.5%,从16.7%提高到93.8%。通过实现对机器人基础模型的轻量级、直观引导,我们的界面有潜力改善消费环境中的交互,并拓宽对有限身体控制个体的可及性。项目网站:https://jasontchan.github.io/token-steering/ 。

英文摘要

We present Token Steering (TS), a method for dynamically steering trajectories generated by an autoregressive vision-language-action (VLA) model through direct intervention in the action-token space. TS injects low-dimensional user inputs into the model's native action-token representation, allowing users to influence trajectory generation without modifying the underlying vision-language model (VLM) architecture. Because TS operates entirely at inference time, it requires no additional training or finetuning. User inputs guide rather than override the pretrained policy, allowing users to influence robot actions while preserving the dexterity, smoothness, and task priors learned by the VLA. We evaluate TS on two household manipulation tasks -- drawer closing after object placement and state-aware object swapping -- and improve success rates from 10.0% to 72.5% and from 16.7% to 93.8%, respectively. By enabling lightweight, intuitive steering over robot foundation models, our interface has the potential to improve human-robot interaction in consumer environments and broaden accessibility for individuals with limited physical control. Project website: https://jasontchan.github.io/token-steering/ .

2606.15019 2026-06-16 cs.CV 新提交

Towards Global AI-Driven Cervical Cancer Screening

迈向全球人工智能驱动的宫颈癌筛查

Thuy Nuong Tran, Ömer Sümer, Evangelia Christodoulou, Lennart Nauschütte, Simon Kalteis, Martin Paulikat, Esmira Pashayeva, Klara Steinheuer, Isabella Borges, Piotr Kalinowski, Hermann Bussmann, Sieng Sokmney, Poeung Kuong, Sathiarany Vong, Achim Schneider, Magnus von Knebel-Doeberitz, Patrick Godau, Lena Maier-Hein

发表机构 * German Cancer Research Center (DKFZ)(德国癌症研究中心) Heidelberg University(海德堡大学) National Center for Tumor Diseases (NCT) Heidelberg(海德堡国家肿瘤疾病中心) Helmholtz Association(亥姆霍兹联合会) University of Heidelberg(海德堡大学) Medical Faculty Heidelberg(海德堡医学院) University Hospital Heidelberg(海德堡大学医院) German Consortium for Translational Cancer Research (DKTK)(德国转化癌症研究联盟) National Center for Tumor Diseases (NCT) Dresden(德累斯顿国家肿瘤疾病中心) University Hospital Carl Gustav Carus Dresden(德累斯顿卡尔·古斯塔夫·卡鲁斯大学医院) Technische Universität Dresden(德累斯顿工业大学) Helmholtz-Zentrum Dresden-Rossendorf (HZDR)(亥姆霍兹德累斯顿罗森多夫研究中心) University of Bonn(波恩大学) University Hospital Bonn(波恩大学医院) University of Cologne(科隆大学) University Hospital Cologne(科隆大学医院) University of Duisburg-Essen(杜伊斯堡-埃森大学) University Hospital Essen(埃森大学医院) University of Freiburg(弗莱堡大学) University Hospital Freiburg(弗莱堡大学医院) University of Göttingen(哥廷根大学) University Hospital Göttingen(哥廷根大学医院) University of Hamburg(汉堡大学) University Hospital Hamburg-Eppendorf(汉堡-埃彭多夫大学医院) University of Jena(耶拿大学) University Hospital Jena(耶拿大学医院) University of Kiel(基尔大学) University Hospital Schleswig-Holstein(石勒苏益格-荷尔斯泰因大学医院) University of Leipzig(莱比锡大学) University Hospital Leipzig(莱比锡大学医院) University of Lübeck(吕贝克大学) University Hospital Lübeck(吕贝克大学医院) University of Magdeburg(马格德堡大学) University Hospital Magdeburg(马格德堡大学医院) University of Mainz(美因茨大学) University Hospital Mainz(美因茨大学医院) University of Marburg(马尔堡大学) University Hospital Marburg(马尔堡大学医院) University of Munich (LMU)(慕尼黑大学) University Hospital Munich (LMU)(慕尼黑大学医院) Technical University of Munich (TUM)(慕尼黑工业大学) University Hospital rechts der Isar (TUM)(慕尼黑工业大学伊萨尔河右岸医院) University of Münster(明斯特大学) University Hospital Münster(明斯特大学医院) University of Regensburg(雷根斯堡大学) University Hospital Regensburg(雷根斯堡大学医院) University of Rostock(罗斯托克大学) University Hospital Rostock(罗斯托克大学医院) University of Tübingen(蒂宾根大学) University Hospital Tübingen(蒂宾根大学医院) University of Ulm(乌尔姆大学) University Hospital Ulm(乌尔姆大学医院) University of Würzburg(维尔茨堡大学) University Hospital Würzburg(维尔茨堡大学医院)

AI总结 提出首个基于深度学习、在多国数据上验证的宫颈癌筛查方法,通过多任务学习同时进行图像分类和病变分割,在内部验证中优于医生,但外部验证显示性能因国家而异。

Comments 20 pages, 9 figures

详情
AI中文摘要

全球消除宫颈癌是世界卫生组织设定的关键公共卫生目标,筛查项目可将死亡率降低高达80%。然而,中低收入国家在专家和活检服务方面资源有限。基于深度学习的算法为筛查提供了有前景的支持,但现有方法大多在单一国家的私有数据集上开发和验证。我们提出了首个基于深度学习的宫颈癌筛查方法,并在多国数据上进行了验证。技术上,我们将阴道镜图像中病变的检测和分类问题表述为多任务学习问题,同时进行图像级分类和病变分割。我们的模型在带有手动病变分割掩膜和相应组织病理结果的醋酸染色阴道镜图像私有数据集上训练,采用大量数据增强以应对图像变异性。在以内部分布验证中,以病理结果作为金标准,我们的算法在CIN1-(宫颈上皮内瘤变1级或更低)与CIN2+(2级或更高)分类中优于医学专家(平衡准确率:0.68 vs 0.64)。在来自四个国家的四个阴道镜数据集上进行外部验证,这些数据集在患病率和患者特征上存在显著差异,我们的方法相比基线方法表现出更优性能。不同国家间的性能差异较大,AUC值范围从0.54到0.80。总体而言,算法性能随年龄、转化区(最易发生病变的宫颈区域)、合并症和特征性体征的存在而变化,其中合并症的负面影响最大。未来工作应侧重于提高模型的鲁棒性和泛化能力。

英文摘要

The global elimination of cervical cancer is a key public health goal set by the World Health Organization (WHO), with screening programs reducing mortality by up to 80%. However, access to experts and biopsy services is limited in low- to middle-income countries (LMICs). Deep learning (DL)-based algorithms offer promising support for screening, but most existing approaches have been developed and validated on private datasets from single countries. We present the first DL-based approach to cervical cancer screening validated on data from multiple countries. Technically, we phrase the problem of detecting and classifying lesions in colposcopy images as a multi-task learning problem, in which we simultaneously perform image-level classification and lesion segmentation. Our model was trained on a private data set of acid stain colposcopy images with manually generated lesion segmentation masks and corresponding histopathological results, employing extensive data augmentation to address image variability. In an in-distribution validation with pathology results serving as ground truth, our algorithm outperformed medical experts (Balanced Accuracy: 0.68 vs 0.64) in CIN1- (Cervical intraepithelial neoplasia grade 1 or lower) versus CIN2+ (grade 2 or higher) classification. External validation on four colposcopy data sets from four countries featuring radical differences in prevalence and patient characteristics yielded superior performance of our method compared to baseline methods. Performance variability across countries was high with AUC values ranging from 0.54 - 0.80. Overall, algorithm performance varied with age, transformation zone (cervical area most prone to lesion development), presence of comorbidities and pathognomonic signs, with comorbidities having by far the largest negative effect. Future work should focus on improving model robustness and generalizability.

2606.15017 2026-06-16 cs.CL 新提交

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

在线技能与记忆模块是否总是值得其令牌消耗?Web代理的预算约束研究

Sina Hajimiri, Masih Aminbeidokhti, Jose Dolz, Ismail Ben Ayed, Issam H. Laradji, Spandana Gella, Nicolas Gontier

发表机构 * ServiceNow AI Research(ServiceNow AI 研究院) ÉTS Montreal(蒙特利尔高等技术学院) University of British Columbia(不列颠哥伦比亚大学) McGill University(麦吉尔大学)

AI总结 在固定推理预算下,对比三种在线增强模块与令牌匹配的基线,发现基线在总成功率上匹配或超越所有增强方法,且常使用更少令牌。

详情
AI中文摘要

在线Web代理通常用记忆、工作流或技能模块增强基础执行器。这些模块可提升性能,但也消耗测试时令牌,这一成本很少与执行器的推理成本一同报告。我们研究在线增强(每项任务都需支付此开销),并在固定总推理预算下重新评估其收益。我们将AWM、ASI和ReasoningBank与令牌匹配的普通基线(使用相同预算进行额外执行器步骤)进行比较。在三个WebArena领域和三个模型(Gemini 3 Flash、GPT-5.4-mini和Qwen 3.6-27B)上,普通基线在总成功率上匹配或超越所有三种增强方法,同时通常使用更少总令牌。我们在WorkArena-L1上使用Qwen 3.6-27B观察到类似趋势,表明该效果扩展到企业知识工作任务。我们的结果表明,技能和工作流记忆在特定领域可能有用,但其表面收益在预算匹配的执行器面前往往消失。我们进一步表明,运行间方差显著影响结果,应作为在线Web代理的核心评估标准报告。

英文摘要

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

2606.15010 2026-06-16 cs.RO 新提交

LV-Calib: LiDAR-Camera Extrinsic Calibration with Boundary-Response Modeling

LV-Calib:基于边界响应建模的LiDAR-相机外参标定

Sheng Hong

发表机构 * Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University(厦门大学萨本栋微米纳米科学技术研究院)

AI总结 提出LV-Calib框架,利用可打印平面靶标,通过视觉基准和圆形反射率边界,结合强度与几何约束优化LiDAR特征点,实现LiDAR-相机外参标定和边界响应校准,达到亚像素重投影精度和毫米级特征一致性。

Comments Comments: 8 pages, 6 figures, 3 tables

详情
AI中文摘要

我们提出LV-Calib,一个使用可打印平面靶标进行LiDAR-相机外参估计和LiDAR边界响应校准的标定框架。靶标作为共享观测载体:视觉基准提供索引图像测量,而圆形反射率边界提供LiDAR可观测的结构特征点。LV-Calib不直接将边界点拟合为理想几何轮廓,而是自动裁剪背景点,估计靶标平面,并通过强度与几何约束迭代优化精确的LiDAR侧3D特征点。该优化显式处理了由有限光束足迹和黑白反射率不连续处的混合强度返回引起的展宽和畸变过渡带。基于这些优化的LiDAR特征,我们构建了加权重投影一致的外参优化,其中图像观测保持在重投影域,LiDAR特征残差由优化置信度加权。最后,利用估计的外参和提取的过渡带,LV-Calib通过估计边界重叠样本的俯仰-偏航-距离残差统计量来校准LiDAR边界响应。在印刷板标定数据上的实验展示了亚像素重投影精度、毫米级LiDAR特征一致性以及改进的里程计性能。代码和标定数据将发布以供可重复评估。

英文摘要

We present LV-Calib, a calibration framework for LiDAR-camera extrinsic estimation and LiDAR boundary-response calibration using a printable planar target. The target serves as a shared observation carrier: visual fiducials provide indexed image measurements, while circular reflectivity boundaries provide LiDAR-observable structural feature points. Instead of directly fitting boundary points as ideal geometric contours, LV-Calib automatically crops background points, estimates the target plane, and iteratively refines accurate LiDAR-side 3-D feature points from intensity and geometric constraints. The refinement explicitly handles the broadened and distorted transition band induced by finite beam footprint and mixed-intensity returns around black-white reflectivity discontinuities. Given these refined LiDAR features, we formulate a weighted reprojection-consistent extrinsic optimization with LiDAR feature alignment, where image observations are kept in the reprojection domain and LiDAR feature residuals are weighted by refinement confidence. Finally, using the estimated extrinsic and the extracted transition band, LV-Calib calibrates the LiDAR boundary response by estimating pitch-yaw-range residual statistics of boundary-overlap samples. Experiments on printed-board calibration data demonstrate sub-pixel reprojection accuracy, millimeter-level LiDAR feature consistency, and improved odometry performance. Code and calibration data will be released for reproducible evaluation.

2606.15007 2026-06-16 cs.CL cs.AI cs.LG 新提交

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Ultra: 开放、高效的混合专家Mamba-Transformer模型用于智能体推理

NVIDIA, :, Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif, Aditya Vavre, Adriana Flores Miranda, Ahmad Bilal, Aileen Zaman, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Alex Gronskiy, Alex Kondratenko, Alex Steiner, Alex Ye, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alice Gatti, Alisa Liu, Alok Kumar, Amar Phanishayee, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrea Santilli, Andrew Fulks, Andrew McHarg, Andrew Tao, Andrii Skliar, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Anna Shors, Anna Warno, Antoni-Joan Solergibert I Llaquet, Arham Mehta, Arkadiusz Nowaczynski, Arti Jain, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Avinash Vem, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bo Deng, Bob Schatz, Boris Ginsburg, Boxin Wang, Brad Nemire, Brandon Norick, Brian Dang, Brian Westphal, Brian Yu, Brucek Khailany, Bryan Catanzaro, Carlo del Mundo, Caryln Aarish, Chankyu Lee, Chantal Hwang, Charbel Sakr, Charles Wang, Charlie Truong, Chen Cui, Cheng Cheng, Cheng-Ping Hsieh, Chenghao Zhang, Chenhui Deng, Chintan Patel, Chris Alexiuk, Christian Cosgrove, Christian Munley, Christine Harvey, Christopher Parisien, Chunyang Shen, Coco Li, Collin Neale, Cynthia Gao, Cyril Meurillon, Dan Gil, Dan Su, Dan Zhao, Dane Corneil, Daniel Afrimi, Daniel Egert, Daniel Korzekwa, Daniel Lo, Daniel Machlab, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, David Yu, Davit Karamyan, Deena Donia, Deep Debroy, Deepak Narayanan, Devin O'Kelly, Dheeraj Peri, Dhruv Nathawani, Di, Wu, Dima Rekesh, Divyanshu Kakwani, Donald Plummer, Dong Anh, Dongfeng Yu, Dongfu Jiang, Donnie Kim, Dorrin Poorkay, Duncan Riach, Dusan Stosic, Dustin VanStee, Eavan Meng, Edgar Minasyan, Edward Lin, Eileen Margaret Peters Long, Elad Sarafin, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric Tramel, Eric Yang, Erick Galinkin, Erik Pounds, Erika Goncalves Goncalves, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Faisal Ladhak, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Frank Sun, Frankie Siino, Frida Hou, Gal Hubara Agam, Gal Kaplun, Gantavya Bhatt, Gargi Prasad, Garvit Kulshreshtha, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Greg Mason, Greg Pauloski, Grigor Nalbandyan, Grzegorz Chlebus, Grzegorz Karch, Guan-Ting Liu, Guoming Zhang, Guyue Huang, Haggai Maron, Haifeng Qian, Haim Elisha, Haoxing Ren, Haran Kumar Shiv Kumar, Haribhau Hud, Harris Nover, Harrison Saturley Hall, Hayate Iso, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hovhannes Tamoyan, Hua Li, Huanhuan Chen, Hui Li, Hui Wang, Huy Nguyen, Ian Chiles, Ido Galil, Ido Shahaf, Igor Gitman, Igor Shovkun, Ilya Loshchilov, Ingo Guehring, Itamar Schen, Itay Levy, Itay Neeman, Ivan Moshkov, Izik Golan, Izzy Putterman, Jaemin Choi, Jakub Slowikowski, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiantao Jiao, Jiaqi Zeng, Jie Lou, Jim King, Jimmy Zhang, Jingquan Wang, Jinhang Choi, Jinju Chu, Joey Conway, Joey Guman, Johan Jatko, Johannes Rausch, John Kamalu, John Roberts, Johnny Greco, Johnny Mensel, Jonah Alben, Jonas Yang, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joshua Mabry, Joshua Pierce, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kajal Jain, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Willowhawk, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirthi Shankar Sivamani, Konstantinos Krommydas, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Kyle Keprios, Kylie Day, Lawrence McAfee, Leo Du, Leon Derczynski, Li Ding, Linda Liu, Lingjie Wu, Lior Kadoch, Lizzie Wei, Luis Vega, Luke Robison, Lun Su, Maarten Van Segbroeck, Maciej Jakub Mikulski, Maer Rodrigues de Melo, Magda Sypula, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Tarun Chandran, Manoj Kilaru, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Marcin Chochowski, Mark Cai, Mark Mozolewski, Markus Kliegl, Marta Stepniewska-Dziubinska, Martyna Patelka, Mattei Machczynski, Matvei Novikov, Mauricio Ferrato, Maximilian Golub, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Mengxi Wu, Meredith Price, Meriem Boubdir, Micah Schaffer, Michael Andersch, Michael Boone, Michael Gschwind, Michael Lightstone, Michael Loh, Michal Bien, Michal Zawalski, Michelle Gill, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Houston, Mingyuan Ma, Minseok Lee, Mohamed Fawzy, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Namit Dhameja, Narimane Hennouni, Natalie Hereth, Nathaniel Pinckney, Nave Algarici, Nave Assaf, Netanel Haber, Nicholas Knight, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Desai, Nikolai Ludwig, Nima Tajbakhsh, Ning Xu, Nir Ailon, Nirmal Juluru, Nitin Nitin, Ofri Masad, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivia Viessmann, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Pablo Ribalta, Pallab Bhattacharya, Panos Lampropoulos, Parth Mannan, Pasha Shamis, Patrick Legresley, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pierre-Yves Aquilanti, Pinky Xu, Piotr Januszewski, Piotr Laskiewicz, Pooya Jannaty, Prakash Gurumurthy, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Puhui Meng, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Radha Sri-Tharan, Rahul Kandu, Rakshit Sanadhya, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Ray Macalisang, Rayen Tian, Reka Kovacs, Renjie Pi, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Rishi Puri, Rita Fernandes Neves, Ritchie Zhao, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Robert Kirby, Roger Waleffe, Rohit Watve, Roi Koren, Ron Banner, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Stewart, Ryota Egashira, Sadegh Mahdavi, Saee Paliwal, Sagar Singh, Sahil Modi, Salika Dave, Samantha Shinagawa, Samuel Kriman, Sandip Bhaskar, Sangkug Lym, Sanjay Kariyappa, Sanjeev Satheesh, Saran Vikas Murari, Satish Pasumarthi, Saurabh Mishra, Saurav Muralidharan, Scott Hara, Sean Narentharen, Selvaraj Anandaraj, Seonjin Na, Seonmeyong Bak, Seonmyeong Bak, Sepehr Sameni, Seph Mard, Serge Panev, Seth Henneman, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Mendelson, Shaun Kotek, Shawn Wang, Shay Aharon, Shaya Gharghabi, Sheng-Chieh Lin, Shi Chen, Shiqing Fan, Shirish Baskaran, Shreya Gopa, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Shwetha Krishnamurthy, Siddharth Singh, Simeng Sun, Sirshak Das, Sivakumar Arayandi Thottakara, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sridhar Bhuvanapalli, Srimukh Veccham, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Su Rong, Sugam Dipak Devare, Sukrit Rao, Sumeet Kumar Barua, Sungsoo Ha, Sunny Gai, Suriya Gunasekar, Suseella Panguluri, Suyog Gupta, Sviataslau Hinzburh, Sweta Priyadarshi, Syeda Nahida Akter, Talor Abramovich, Tan Bui, Tanay Varshney, Tatevik Ter-Hovhannisyan, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tianhe Zhang, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Tiyasa Mitra, Tom Balough, Tomasz Grzegorzek, Tomasz Hliwiak, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Tony Salim, Tony Wang, Traian Rebedea, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Venkat Srinivasan, Venmugil Elango, Vibhor Agrawal, Victor Cui, Vijay Korthikanti, Vikas Mehta, Vinay Rao, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Vu Pham, Wanli Jiang, Wasi Uddin Ahmad, Wataru Ishihara, Wei Du, Wei Ping, Weiheng Chai, Wenliang Dai, Wesley Helmholz, Will Jennings, Will Zhu, Wojciech Prazuch, Xiaowei Ren, Xiwen Yu, Yan Breek, Yang Chen, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Youngeun Kwon, Yu Yao, Yugi Guvvla, Yuki Huang, Yunsheng Liu, Zach Moshe, Zachary Newell, Zhilin Wang, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijie Yan, Zsolt-Alon Wertheimer

发表机构 * NVIDIA(英伟达)

AI总结 提出550B总参数量、55B激活参数的混合专家Mamba-Attention语言模型Nemotron 3 Ultra,通过20T tokens预训练、1M上下文扩展及后训练,在推理吞吐量提升约6倍的同时保持与顶尖模型相当的精度。

详情
AI中文摘要

我们介绍了Nemotron 3 Ultra,一个总参数量5500亿、激活参数550亿的混合专家Mamba-Attention语言模型。我们在20万亿文本tokens上预训练了Nemotron 3 Ultra,然后将上下文长度扩展到100万tokens,并使用监督微调(SFT)、强化学习(RL)和多教师在线策略蒸馏(MOPD)进行后训练。Nemotron 3 Ultra是我们迄今为止能力最强的模型,采用了多项关键技术——LatentMoE、多token预测(MTP)、NVFP4预训练、多环境RLVR、MOPD和推理预算控制。与公开可用的最先进LLM相比,Nemotron 3 Ultra的推理吞吐量提高了约6倍,同时达到了相当的精度。最先进的精度、高推理吞吐量和100万tokens的上下文长度使Nemotron 3 Ultra成为长时间运行的自主智能体任务的理想选择。我们在HuggingFace上开源了基础、后训练和量化检查点,以及训练数据和配方。

英文摘要

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.