URL PDF HTML ☆

赞 0 踩 0

2606.09623 2026-06-09 cs.LG 新提交

Constrained user-item allocation for e-commerce marketing campaigns

面向电子商务营销活动的约束用户-物品分配

Maja Lindström, Natalija Glisovic, Jan von Pichowski, Tommy Löfstedt, Martin Rosvall

发表机构 * Umeå University（于默奥大学）； KTH Royal Institute of Technology（皇家理工学院）； University of Würzburg（维尔茨堡大学）

AI总结提出自动定向方法，通过约束谱双聚类、贪心局部搜索和多臂老虎机框架联合选择用户和物品构建多个不重叠营销活动，在合成数据、Amazon评论和商业数据上优于模拟退火。

详情

基于多智能体强化学习的任意物体协同运输中的形状形成

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser

发表机构 * University of Technology Nuremberg（纽伦堡工业大学）

AI总结提出一种多智能体强化学习方法，使多机器人系统自主形成支撑任意形状和非均匀质量分布物体的编队，同时避免障碍物，实现可靠且泛化的协同运输。

详情

AI中文摘要

协同物体运输在众多领域（包括工业到家庭服务）中至关重要。一种流行的运输策略是将物体承载在多机器人系统之上。相应的任务通常通过将其分解为三个相互关联的子问题来解决：编队控制、协同导航和碰撞避免。现实世界物体带来的一个特殊挑战是其可能具有任意形状和非均匀质量分布，这需要机器人编队能够牢固支撑物体。在这项工作中，我们通过提出一种新颖的多智能体强化学习方法来解决运输此类现实世界物体时的模式形成控制挑战。我们的方法使多机器人系统能够自主定位在物体下方以支撑其重量，同时在编队过程中避免障碍物。我们在不同环境和不同数量机器人下的评估表明，我们的方法能够产生可靠形成平衡编队的策略，并泛化到杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。

英文摘要

Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.09608 2026-06-09 cs.CV 新提交

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

TUDSR: 用于更高超分辨率的两次上采样扩散

Zhiqiang Wu, Yitong Dong, Xian Wei

发表机构 * East China Normal University（华东师范大学）； Zhejiang University（浙江大学）

AI总结提出TUDSR框架，通过两阶段训练（R分辨率和NR分辨率）结合循环分块策略，在SD2.1基础上实现1024²和2048²高分辨率图像超分辨率，显著优于现有方法。

详情

AI中文摘要

基于扩散的生成模型在真实世界图像超分辨率（SR）中取得了显著成功。通过分块扩散技术，这些模型可以生成超出其原生支持分辨率的高分辨率图像。然而，这种高分辨率（例如2048²）输出的质量通常仍然极差，主要归因于我们考虑的两个因素：图像上采样比率（例如×8）超过模型原生支持的上采样比率（例如×4），以及模型的原生支持分辨率。在实践中，训练原生高分辨率模型需要更大的架构，这会导致显著的计算开销和GPU内存成本，使其在资源有限的设备上难以实现。因此，我们提出了TUDSR，一种用于更高超分辨率的两次上采样扩散框架。TUDSR框架主要包括两个阶段：第一阶段在R分辨率下训练，第二阶段引入基于循环分块的训练策略在NR分辨率下训练。每个阶段采用包含生成器和判别器的单步GAN架构。基于SD2.1-base，我们开发了TUDSR-S，在多个基准测试中取得了最先进的性能。大量实验进一步表明，TUDSR-S在1024²甚至2048²分辨率下生成高质量图像，显著优于现有方法。代码可在https://github.com/wuer5/TUDSR获取。

英文摘要

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

URL PDF HTML ☆

赞 0 踩 0

2606.09607 2026-06-09 cs.LG cs.AI 新提交

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

注意力头中的闭包验证电路发现：共激活提出，消融处置

Yongzhong Xu

发表机构 * GitHub

AI总结通过共激活聚类提出注意力头电路假设，并用因果消融验证闭包性，发现该方法在密集模型有效但在MoE模型失效，表明共激活仅是电路提议而非确认。

Comments 22 pages, 3 figures

详情

AI中文摘要

可解释性越来越将组件组（而非单个单元）作为基本对象，并提议通过聚类共激活统计来发现它们。我们询问这种廉价信号是否真正识别出注意力头电路。将稀疏自编码器聚类方法适配到注意力头——但通过因果消融而非重构进行验证——我们聚类头，然后运行闭包测试：消融发现的社区，并将每个示例的损伤与匹配随机对照进行比较。在两个密集的1B规模模型（Pythia 1B, OLMo 1B）和两种输入分布上，社区通过了闭包测试。在混合专家模型（OLMoE-1B-7B）中，路由条件聚类恢复了一个统计上真实的信号，但该信号未能通过闭包测试——消融反而改善了损失，方向错误。将闭包测试扩展到训练过程中，注意力目标选择性和参与比率在双向与功能解耦。我们得出结论：廉价信号是电路提议，而非确认的电路；闭包是区分二者的关键。

英文摘要

Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

URL PDF HTML ☆

赞 0 踩 0

2606.09605 2026-06-09 cs.AI 新提交

Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

下一个词预测学习睡眠生理学的可泛化表示

Jonathan F. Carter, Lionel Tarassenko

发表机构 * Institute of Biomedical Engineering, University of Oxford（牛津大学生物医学工程研究所）

AI总结提出Hypnos模型，通过下一个词预测目标，从多模态生理信号中学习可泛化表示，在睡眠阶段分类和房颤检测等任务上显著优于现有基础模型。

详情

AI中文摘要

基础模型提供了一种有前景的途径，将多模态生理信号压缩为人类健康的紧凑表示，在睡眠医学、心脏病学、神经病学及其他医疗领域具有广泛应用。现有模型通常采用掩码重建或对比学习目标进行训练。然而，掩码重建可能不适用于这些信号的随机性质，而对比方法依赖于正样本对定义，尽管生理信号的语义不变性尚不明确。在这项工作中，我们展示了下一个词预测是一种简单且可扩展的替代方案。我们开发了Hypnos，一个多模态睡眠基础模型，使用来自超过20,000次夜间多导睡眠图记录的八种不同传感模态（例如EEG、ECG、呼吸信号）进行训练。我们使用残差向量量化将每种模态标记化为离散标记流，然后训练一个大型自回归RQ-Transformer，以并行方式联合预测所有模态的下一个标记。训练后，Hypnos可应用于任何支持模态子集的连续传感器数据流，为下游任务生成嵌入。在一系列基准测试中，Hypnos显著优于现有基础模型。在睡眠阶段分类中，我们在保留测试集上匹配了强监督基线的性能，同时使用的标记数据减少了100倍。Hypnos甚至泛化到日间生理学，在检测房颤方面超越了专用的ECG基础模型。我们的结果表明，下一个词预测是从多模态生理信号进行表示学习的强自监督目标。

英文摘要

Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using $100\times$ less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

URL PDF HTML ☆

赞 0 踩 0

2606.09603 2026-06-09 cs.CL 新提交

Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion

基于语料库特征扩散的繁体中文家长会自动化个别化教育计划生成

Kuanlin Chen, Cheng-En Ou

发表机构 * National University of Singapore（新加坡国立大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对繁体中文个别化教育计划（IEP）生成中数据稀缺和隐私限制问题，提出基于语料库特征扩散（CGFD）的低资源微调流程，通过种子选择、特征扩散和语法约束解码（GCD）生成高质量样本，并发现GCD在繁体中文下适得其反，无GCD路径在可靠性和速度上更优。

Comments 12 pages, 5 figures

详情

AI中文摘要

编写个别化教育计划（IEP）是一项高劳动强度、知识密集型的文档负担；英语研究表明，生成式AI可以显著减少起草时间，但由于领域数据稀缺、严格的隐私法规以及缺乏本地评估基准，繁体中文的自动化IEP生成几乎未被探索。我们提出了一种基于语料库特征扩散（CGFD）的低资源微调流程：（1）通过tau阈值和标志感知分数上限选择25个双专家高评分种子转录本；（2）从种子中提取特征画像（句子长度、结构、量化模板），并连同言语化采样风格的多样性控制注入LLM提示，以驱动扩散；（3）使用15个专家黄金种子作为扩散锚点，目标生成585个样本；获得567个有效扩散样本，形成582个样本的训练集，用于使用QLoRA微调Breeze-7B；（4）通过语法约束解码（GCD）在推理时强制执行分层SMART目标阶梯模式。在55个样本的模式压力集上的消融结果揭示了一个意外发现：在繁体中文令牌预算下，GCD适得其反——无GCD路径实现了100%的模式通过率，中位延迟降低34%，在可靠性和速度上均优于GCD。在n=10的正式保留集上，无GCD推理路径实现了BERTScore F1 = 0.779，超过了GPT-5.4（0.726）、DeepSeek-V3.2（0.703）、Gemini-3-Flash-Preview（0.703）和Llama-4-Maverick（0.700）的零样本基线，同时保持完全本地、气隙推理。该系统填补了繁体中文特殊教育NLP的空白，并在工业工程范式下提供了可扩展、保护隐私的本地推理解决方案。

英文摘要

Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered on Corpus-Grounded Feature Diffusion (CGFD): (1) 25 dual-expert high-score seed transcripts are selected via a tau threshold with flag-aware score caps; (2) a FeatureProfile (sentence length, structure, quantification templates) is extracted from seeds and injected into LLM prompts alongside Verbalized-Sampling-style diversity control to drive diffusion; (3) 15 expert gold seeds are used as diffusion anchors, targeting 585 samples; 567 valid diffusion samples are obtained, yielding a 582-sample training set used to fine-tune Breeze-7B with QLoRA; (4) schema-constrained inference via Grammar-Constrained Decoding (GCD) enforces a hierarchical SMART Goal Ladder schema at inference time. Ablation results on a 55-sample schema stress set reveal an unexpected finding: GCD is counterproductive under Traditional Chinese token budgets -- the no-GCD path achieves 100% schema pass rate at 34% lower median latency, outperforming GCD on both reliability and speed. On the n=10 formal hold-out, the no-GCD inference path achieves BERTScore F1 = 0.779, exceeding GPT-5.4 (0.726), DeepSeek-V3.2 (0.703), Gemini-3-Flash-Preview (0.703), and Llama-4-Maverick (0.700) zero-shot baselines while maintaining fully local, air-gapped inference. This system addresses a gap in Traditional Chinese special-education NLP and offers a scalable, privacy-preserving local inference solution under an industrial engineering paradigm.

URL PDF HTML ☆

赞 0 踩 0

2606.09590 2026-06-09 cs.CL cs.CR 新提交

Clinically Grounded Privacy Evaluation of Medical LMs

临床导向的医学语言模型隐私评估

Sasha Ronaghi, Sana Tonekaboni, Lena Stempfle, Vivian Utti, Jordan Li Cahoon, Nathaniel Hendrix, Ayin Vala, Marzyeh Ghassemi, Emily Alsentzer

发表机构 * Stanford University（斯坦福大学）； Massachusetts Institute of Technology（麻省理工学院）； American Board of Family Medicine（家庭医学认证委员会）

AI总结提出临床导向框架，按对抗访问等级评估医学语言模型隐私泄露，发现常规元数据可导致高比率逐字记忆和敏感诊断恢复，但部分记忆源于模板化文档。

详情

AI中文摘要

医学语言模型（LMs）可以记忆和重现受保护的健康信息，但隐私评估通常关注训练文本的恢复，而非在现实威胁模型下的泄露。我们引入了一个临床导向的框架，沿着对抗访问的分级轴评估泄露，范围从公开可推断的人口统计信息到泄露的笔记片段。在每个层级，我们测量患者特定文本的逐字记忆和敏感诊断的语义泄露。将该框架应用于一个在378k临床笔记上预训练的LM，我们发现常规就诊元数据（即姓名、出生日期、提供者、诊所、就诊日期）在患者时间线上引发高比率的逐字记忆和敏感诊断恢复（流产AUROC 0.91，HIV 0.81）。同时，精确匹配记忆可能夸大泄露：36%的记忆令牌反映了模板化文档。我们的工作强调了在纵向临床数据上训练的风险，为医学LM的上下文隐私评估提供了一个实用框架。

英文摘要

Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09585 2026-06-09 cs.AI 新提交

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

光学推理：重新思考图像作为超越文本的表达性推理媒介

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出光学推理概念，将图像作为独立推理媒介，通过排版和图形两种变体实现，在语言和多模态任务中匹配或超越文本推理，同时减少推理令牌。

详情

AI中文摘要

思维链（CoT）提升了大型语言模型（LLMs）的性能，并已扩展到多模态大型语言模型（MLLMs）。最近的工作进一步从基于文本的多模态推理转向交错模态推理，其中中间步骤可以同时包含文本理由和视觉证据。在这项工作中，我们提出了一个更大胆、更雄心勃勃的想法：图像能否单独作为语言和多模态任务的推理媒介？为了探索这一点，我们提出了光学推理，它将图像视为独立的推理媒介。我们通过两种变体实例化这一概念：基于排版的光学推理，优化视觉布局以实现紧凑的理由渲染；以及基于图形的光学推理，将文本和图形元素组合成结构化的视觉理由。在数学、科学和交错模态推理基准测试中，光学推理可以匹配甚至超越传统的文本推理，同时在语言任务上平均减少28.57%的推理令牌，在多模态任务上减少16%，实现文本推理1.96倍的令牌效率。这些结果表明，图像可以有效且高效地编码理由，同时为推理提供统一的视觉画布。

英文摘要

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.09582 2026-06-09 cs.LG stat.ML 新提交

On Choosing the $μ$ Parameter in Gaussian Differential Privacy

论高斯差分隐私中参数 $μ$ 的选择

Bogdan Kulynych, Antti Honkela

发表机构 * Lausanne University Hospital（拉索恩大学医院）； University of Helsinki（赫尔辛基大学）

AI总结本文通过匹配强对手成员推理攻击的最坏情况成功度，提供从纯-DP ε到GDP μ的原则性映射，并推荐 μ≈ε/5 作为保守通用转换。

2606.09578 2026-06-09 cs.AI cs.CL cs.IR 新提交

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

TABVERSE：大语言模型与视觉语言模型中跨格式表格理解的基准测试

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）； Singapore University of Technology and Design (SUTD)（新加坡科技设计大学）

AI总结提出TABVERSE基准，通过控制表格内容、跨多种结构格式（HTML、Markdown、LaTeX）和渲染图像，系统评估LLM和VLM在问答、结构理解和结构重建任务中的表现，发现表示格式显著影响表格理解能力。

Comments 24 pages, 18 tables, 16 figures, Submitted to ARR May 2026

详情

AI中文摘要

大语言模型（LLMs）和视觉语言模型（VLMs）在表格推理任务上的评估日益增多，但表格表示的作用仍未充分探索。实践中，相同的表格内容可能以不同的结构格式出现，如HTML、Markdown和LaTeX，或作为渲染图像。然而，现有评估往往让内容、格式、布局和模态同时变化，使得难以隔离表示效应。我们引入了TABVERSE，一个受控的多模态表格基准，它在多个结构格式和渲染图像中对齐相同的表格内容，并带有问题类别和难度标签。这种设计使得在保持表格内容固定的同时，能够系统评估表示效应。我们在三个任务上评估LLMs和VLMs：问答（QA）、结构理解能力（SUC）和结构重建（SR）。我们的结果表明，表示选择显著影响表格理解。模型在结构化文本上的表现通常优于渲染图像，但这一差距的大小取决于任务、模型和格式。HTML通常是最稳健的文本格式，而行敏感的结构任务和语法可用的LaTeX重建仍然具有挑战性。这些发现表明，表格表示是可靠表格评估的关键因素。

英文摘要

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.09577 2026-06-09 cs.CL cs.LG cs.SE 新提交

Code Is More Than Text: Uncertainty Estimation for Code Generation

代码不仅仅是文本：代码生成的不确定性估计

Yuling Shi, Caiqi Zhang, Yuexian Li, Haopeng Wang, Yeheng Chen, Nigel Collier, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Cambridge（剑桥大学）

AI总结针对代码生成中错误程序的可靠性问题，提出基于词法、算法和功能三个正交轴的不确定性估计方法，在五个代码LLM上将AUROC提升8.1个百分点。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被部署为代码生成器，其中静默错误的程序会带来真实的安全和可靠性风险。可靠的不确定性估计（UE）对于选择性预测、人在回路审查和下游智能体决策至关重要。然而，现有的大多数代码UE方法继承自自然语言（NL）生成，忽略了使代码独特的属性。我们认为代码在三个方面与NL不同：单个错误标记可能破坏整个程序（标记脆弱性）；算法意图和具体实现可能独立不一致（意图-代码差距）；程序可以被执行（可执行性）。我们将这些属性实例化为三个正交的不确定性轴：词法（Top-K标记熵）、算法（伪代码一致性）和功能（行为一致性）。在五个代码LLM上，我们的三轴集成将平均AUROC从最强NL衍生基线的0.696提高到0.776（+8.1点）。值得注意的是，在Qwen3-14B上，我们的单次Top-K标记熵匹配了最强多次基线，同时成本降低超过3倍；在各模型上，它仍然是一个有竞争力的低成本信号。这些结果表明，代码UE需要特定于代码的设计，而不是直接移植NL方法。

英文摘要

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

URL PDF HTML ☆

赞 0 踩 0

2606.09572 2026-06-09 cs.RO cs.AI 新提交

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

CT-VAM: 一种小脑-丘脑启发的视觉-动作模型用于高效视觉运动控制

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin

发表机构 * University of Science and Technology of China（中国科学技术大学）； AIRLab, Department of Automation（自动化系AIRLab）

AI总结提出CT-VAM模型，通过TARS条件注意力解码器融合异构输入，以68M参数实现与大型VLA模型相当的LIBERO成功率，并降低推理延迟，支持高频控制。

详情

AI中文摘要

视觉-语言-动作模型在机器人操作中展现出强大潜力，然而原始语言主要用于指定任务意图，而非在高频低层执行过程中反复处理。受此分离的启发，我们提出了一种小脑-丘脑启发的视觉-动作模型（CT-VAM），用于高效的任务条件视觉运动控制。CT-VAM作为一个紧凑的局部执行策略，从双视角视觉观察、本体感觉和轻量级任务条件中预测动作块，从而可能实现一种实用的云-边缘范式，其中高层语义推理由大模型处理，而快速闭环控制在本地硬件上运行。为了有效融合异构输入，CT-VAM引入了TARS（丘脑动作路由流），一种流分离的条件注意力解码器，独立路由动作、视觉和任务流，防止密集的感官标记淹没紧凑的任务相关条件。仅凭68M参数，CT-VAM在LIBERO上取得了与更大规模VLA模型竞争的成功率，同时降低了推理延迟。结合用于异步块执行的流一致修补，CT-VAM支持高频控制，并在资源受限的机器人平台上展示了鲁棒的实时部署能力。

英文摘要

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

URL PDF HTML ☆

赞 0 踩 0

2606.09569 2026-06-09 cs.RO cs.CV 新提交

Safe-RULE：安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame（圣母大学）

AI总结针对离线安全强化学习易受数据投毒攻击的问题，提出Safe-RULE框架，通过反学习移除恶意样本影响，无需从头训练或访问原始环境，实验证明能有效提升安全性。

Comments 20 pages, 3 figures

2606.09556 2026-06-09 cs.AI 新提交

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

AI科学家的能力取决于其证据：药物资产估值中专有数据与推理技能的分层消融研究

Yinan Wang

发表机构 * Noah AI Research（Noah AI研究）

AI总结通过分层消融实验，发现药物资产估值中AI科学家的决策上限由专有证据集决定，而非仅依赖推理框架；加入专有数据后决策质量显著提升。

Comments Preprint; 2 figures, 5 tables

详情

AI中文摘要

AI科学家智能体通常被评估时，仿佛能力主要取决于模型质量、提示或推理框架。我们在药物资产估值中测试了一个不同的假设：对于知识密集型的科学决策，限制因素往往是智能体能够访问的证据基础。我们在一个生产级估值智能体上进行了三臂对照消融实验：A是仅使用网络的普通LLM分析师，B增加了公共结构化工具以及14维估值剧本、验证器、客观性策略和红队，C增加了专有的Noah AI语料库，包含精选的管线、试验和交易情报。在包含13个资产的分层基准测试中，B改善了校准和审计纪律：层级内准确率从0.80提高到0.89，客观性从3.16提高到3.30。但B并未消除事实上限。在能力超集核算下，A和B仅恢复了精选黄金竞争记录的0.25和0.38，而C恢复了0.96；在精选长尾子集上，C达到0.93，而A/B为0.26/0.30。原始盲审决策质量A和B相似（7.01 vs 6.96），因此我们引入了完整性感知决策效用：知情决策质量 = 决策质量 × 黄金覆盖率。在此指标上，C达到7.43，而A/B为1.76/2.57。即使一个完美的非专有数据报告，其B的覆盖率上限也仅为3.83。结果并非推理框架不重要；它们改善了校准和纪律。相反，专有证据集设定了AI科学家所能知道并因此决策的上限。

英文摘要

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

URL PDF HTML ☆

赞 0 踩 0

2606.09547 2026-06-09 cs.CV cs.LG 新提交

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预：视频大语言模型能否在错误发生时即时纠正？

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research（高通人工智能研究院）； York University（约克大学）； Vector Institute for AI（向量人工智能研究所）

AI总结提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力，并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments Qualcomm Interactive Cooking: Ego-MC-Bench -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-mistake-corrections and Ego-CoMist -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-counterfactual-mistakes

详情

AI中文摘要

学习日常技能（如烹饪一道菜）越来越依赖于教学媒体，例如在线视频。这为使用视频（和多模态）大语言模型（LLMs）作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是，它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力，我们引入了Ego-MC-Bench（错误纠正），这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明，Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集，但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制，我们还引入了Ego-CoMist，这是一个反事实合成数据集，通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明，在Ego-CoMist上进行微调可以带来性能提升，特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

URL PDF HTML ☆

赞 0 踩 0