arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.13136 2026-06-12 cs.CV cs.LG eess.IV 新提交

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

一种可扩展且轻量级的统一架构用于像素合并图像传感器的去马赛克

Saurabh Kumar, Nutan Sairam Yenneti

发表机构 * Samsung Research Institute Bangalore(三星研究院班加罗尔分院)

AI总结 提出模块化统一架构,通过无学习CFA识别模块和轻量级设计,实现多种像素合并传感器的去马赛克,提升图像质量并降低资源消耗。

详情
AI中文摘要

像素合并图像传感器因其分辨率与聚光能力的权衡,正成为智能手机相机的默认选择。然而,与拜耳彩色滤光片阵列(CFA)相比,它们更大的颜色间分离使得去马赛克更具挑战性。此外,现有的基于深度学习的去马赛克方法是CFA特定的,需要多个独立模型,占用宝贵的板载资源,并需要更大的开发和维护工作。在这项工作中,我们提出了一种模块化的统一架构,用于对各种像素合并传感器进行去马赛克,该架构在可扩展且轻量级的同时提供更高的图像质量。此外,为了实现即插即用操作,我们引入了一个无学习的CFA识别模块,以准确检测原始数据的CFA类型。

英文摘要

Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

2606.13135 2026-06-12 cs.CV cs.AI 新提交

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类:可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Orel Oncological Dispensary(奥廖尔肿瘤医院)

AI总结 本研究比较了四种深度学习架构在皮肤镜图像分类中的表现,提出一种两阶段级联分类方案,通过可调分诊阈值实现敏感度控制,并在外部临床数据集上验证了泛化差距。

Comments 28 pages, 8 figures, 10 tables

详情
AI中文摘要

目的:比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案,并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法:在三种方案中比较四种架构(ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S):二分类(恶性/良性)、单阶段四分类(良性、MEL、SCC、BCC)和两阶段级联(二分类分诊,然后三分类MEL/SCC/BCC)。所有模型使用ImageNet预训练权重和单一增强协议,在聚合的开放ISIC Archive数据上训练,并在内部保留样本和两个临床数据集(Melanoscope AI移动系统;谢切诺夫大学)上评估。结果:内部二分类阶段达到ROC-AUC 0.952-0.966;在谢切诺夫大学数据集上降至0.797-0.893,敏感度降至0.53-0.67,ECE从0.02升至0.27-0.39,且低估恶性,量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果:二分类阶段ViT-B/16的缺陷(p<0.05);在区分阶段,没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1,但仅对ViT-B/16显著,通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上,直接11分类的平均类别敏感度为0.525。结论:可调分诊阈值提供了标准单阶段(argmax)分类无法实现的敏感度控制,并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 新提交

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez, Thomas Parnell

发表机构 * IBM Research(IBM研究院)

AI总结 提出MiniPIC,通过无位置编码KV缓存和用户控制缓存重用原语,在vLLM中实现多种位置无关缓存方法,显著提升预填充吞吐量并降低首个令牌延迟。

Comments 13 pages, 5 figures

详情
AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入(我们称之为“跨度”),例如文档和代码文件。然而,vLLM等引擎中的前缀缓存无法重用KV条目,除非它们与另一个请求共享相同的前缀,而生产级推理服务器中的位置无关缓存(PIC)实现通常需要大量服务器代码更改或将KV状态保留在服务器外部,从而产生主机到设备的传输开销。我们提出了极简PIC(MiniPIC):一种最小化、灵活且快速的vLLM设计,由两个组件构建:无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量,在注意力内部使用每请求逻辑位置对K块应用RoPE,并公开三个面向用户和令牌级别的原语:块对齐填充、跨度分隔符(SSep)和提示依赖(PDep),这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端,这些原语足以在同一个运行的vLLM实例中实现多种PIC方法,包括Block-Attention、EPIC和Prompt Cache,同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上,使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%,将缓存跨度的首个令牌时间减少了最多两个数量级,保持了未缓存跨度的线性预填充扩展,并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

2606.13125 2026-06-12 cs.LG cs.AI 新提交

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进:理解推理后训练的机制

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

发表机构 * Microsoft Research NYC(微软研究院纽约) UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过控制实验揭示强化学习后训练通过策略选择和策略改进两种机制提升推理能力,并指出SFT数据和RL数据的不同作用。

详情
AI中文摘要

强化学习已迅速成为推理和编码模型训练的关键组成部分,但从机制角度理解仍不足。我们研究通过强化学习后训练如何以及通过哪些底层过程获取或增强能力。基于Qwen-2.5-1.5B的受控数学推理实验分析揭示了两种核心机制:策略选择和策略改进。我们的结果强调了SFT数据和强化学习数据在激活这些机制中的作用,特别展示了监督模型使用多种推理策略如何实现策略选择,以及增加强化学习数据难度如何实现策略改进。综合来看,我们的结果为RL训练提供了机制性见解,并提出了继续扩展推理能力的实用干预措施。

英文摘要

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

2606.13121 2026-06-12 cs.CL cs.AI cs.SD 新提交

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow: 减少同步语音到语音翻译中破坏自然语音流的停顿

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * IPAI and ECE, Seoul National University(首尔大学IPAI与ECE) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出一个流畅性感知优化框架,通过利用模型内部信号(如语言多样性和语音时长的时间变异性)最小化块间静音,在同步翻译的低延迟和连续翻译的自然流畅之间找到平衡点。

Comments Proceedings of the 26th Interspeech Conference, Long Paper

详情
AI中文摘要

同步语音到语音翻译旨在通过最小化延迟实现近实时通信,为连续翻译的高延迟提供了一种引人注目的实时替代方案。然而,过度追求低延迟往往会导致碎片化的块状语音。因此,听众会遭受不自然的声学流,其中频繁的停顿可能会增加他们的认知负荷。为了弥补这一差距,我们引入了一个流畅性感知优化框架,旨在发现同步翻译的低延迟优势与连续翻译的自然流畅之间的最佳平衡点。我们的框架通过利用模型内部信号(包括语言多样性和语音时长的诱导时间变异性)来最小化块间静音。在短文本和长文本基准上的实验表明,我们的框架在保持竞争性延迟和翻译质量的同时,产生了自然的语音流。

英文摘要

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

2606.13120 2026-06-12 cs.CL 新提交

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: 基于演化知识的搜索智能体基准测试

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

发表机构 * Northeastern University, China(东北大学(中国)) Weixin AI, Tencent Inc, China(腾讯微信AI(中国))

AI总结 提出EvoBrowseComp,一个通过实时网络遍历自动生成400道英文和400道中文无污染复杂问题的演化基准,用于评估搜索智能体在动态知识环境中的真实浏览能力。

Comments 14 pages, under review

详情
AI中文摘要

搜索智能体——即增强搜索工具的大型语言模型——加剧了对未来验证基准的需求。现有的基准如BrowseComp依赖静态知识,容易受到测试集污染和参数记忆的影响。因此,模型可以通过事实回忆而非真正检索获得高分,通过推理捷径掩盖真实的浏览能力。在本文中,我们介绍EvoBrowseComp,一个包含400道英文和400道中文无污染复杂问题的演化基准,通过实时网络遍历合成。为了收集这些问题,我们设计了一个三智能体协作框架:(1)QA合成智能体,从实时网络中检索新鲜知识以合成问答对;(2)信息过滤智能体,根据可信度和流行度过滤检索到的知识,以阻断参数捷径;(3)高级指导智能体,将问题形式化为推理图,以减少合成问答对中的逻辑冗余和捷径。由于该框架支持全自动合成,EvoBrowseComp可以定期更新以防止数据污染并保持时间新鲜度。大量实验证实了其高难度,需要广泛的横向搜索。它为自动更新、高难度的基准测试建立了一个可扩展的范式,与不断发展的世界知识和不断进步的智能体能力保持同步。

英文摘要

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

2606.13115 2026-06-12 cs.CL cs.AI 新提交

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

G-Long:面向高效长期对话代理的图增强记忆管理

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 提出G-Long框架,利用微调小语言模型进行结构化三元组提取和关联检索,并引入注意力感知重要性评分机制,在降低计算开销的同时,在响应生成和记忆检索上达到最优性能。

Comments 22 pages, 8 figures, 14 tables

详情
AI中文摘要

尽管大型语言模型(LLMs)推动了开放域对话系统的发展,但由于长上下文推理的固有限制以及处理大量原始文本的低效性,保持长期一致性仍然是一个挑战。现有方法通常依赖于非结构化记忆存储(容易导致信息丢失)或计算成本高昂的LLMs(导致高延迟)。为了解决这些限制,我们提出了G-Long,一个图增强框架,利用微调的小语言模型(sLM)进行结构化三元组提取和关联检索,显著降低了运营成本。此外,我们引入了新颖的注意力感知重要性评分机制,利用T5摘要器的内在交叉注意力信号来识别显著记忆。跨多个基准的大量实验表明,G-Long在响应生成和记忆检索方面均达到了最先进的性能,在MSC上响应质量提升高达9.8%,在LME上检索召回率提升高达40.8%,同时显著降低了计算开销。

英文摘要

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

2606.13111 2026-06-12 cs.CL 新提交

MÖVE: A Holistic LLM Benchmark for the German Public Sector

MÖVE:德国公共部门的大语言模型整体基准

Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland

发表机构 * Innovations Department, Bundesdruckerei GmbH(德国联邦印钞公司创新部)

AI总结 提出MÖVE基准,从性能和治理两个维度评估39个LLM在德国公共部门的应用,发现无单一模型全面领先,模型大小非质量可靠指标。

详情
AI中文摘要

我们提出MÖVE(Modelle für die Öffentliche Verwaltung Evaluieren),一个用于评估德国公共部门背景下大语言模型(LLM)的整体基准。尽管LLM在公共管理中日益普及,但模型选择仍然很大程度上是临时的,现有基准提供的指导有限:它们主要面向英语、内容以美国为中心,并且只关注任务性能。MÖVE通过评估39个模型在两个互补维度上填补这些空白。性能标准涵盖摘要、问答和主题提取。治理标准评估幻觉倾向、能耗、提供商透明度、与德国宪法价值观的一致性以及对德国政党立场的知识。总共,我们使用了十个德语数据集,包括我们构建的反映公共管理领域的金标准和银标准数据集。我们采用多指标评估策略,结合经典NLP指标、基于嵌入的方法和LLM作为评判的方法。我们的结果表明,没有单一模型在所有标准上占主导地位:顶级表现者因任务而异,模型大小本身是质量的糟糕预测指标。我们进一步评估基准本身,分析其统计精度、LLM评判可靠性、私有数据集对模型排名的影响、结果对提示表述的敏感性以及能耗估计的有效性。MÖVE被设计为一个活跃开发中的动态基准;结果公开于此https URL。

英文摘要

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

2606.13108 2026-06-12 cs.CV 新提交

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

PP-OCRv6: 从1.5M到34.5M参数,在OCR任务上超越十亿级视觉语言模型

Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng, Ting Sun, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Changda Zhou, Hongen Liu, Suyin Liang, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.(百度公司飞桨团队)

AI总结 提出轻量级OCR系统PP-OCRv6,通过统一MetaFormer架构和结构化重参数化,在服务器到边缘设备上以少数量级参数超越十亿级VLM,中模型识别准确率83.2%,检测Hmean 86.2%。

详情
AI中文摘要

视觉语言模型(VLM)在通用视觉语言任务上取得了令人印象深刻的结果,但在应用于专用OCR场景时,它们存在幻觉、定位不精确和计算成本过高的问题。本文提出PP-OCRv6,一个轻量级OCR系统,结合了架构创新和数据中心优化。PP-OCRv6围绕统一的MetaFormer风格构建块重新设计了骨干网络、检测颈和识别颈,采用结构化重参数化,将空间token混合与通道混合解耦,并通过任务特定的步长配置支持两个任务。三个模型层级(中、小、微)共享相同的构建块原语,覆盖从服务器到边缘的部署场景。在我们的内部基准测试中,PP-OCRv6_medium实现了83.2%的识别准确率和86.2%的检测Hmean,分别比PP-OCRv5_server高出+5.1%和+4.6%,同时以数量级更少的参数超越了Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。微层级在Intel Xeon CPU上实现了比PP-OCRv5_mobile快3.9倍的推理速度,同时保持相当的准确率。

英文摘要

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

2606.13106 2026-06-12 cs.LG cs.CL 新提交

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环:基于在线强化学习的可切换潜在推理

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

发表机构 * HKUST(GZ)(香港科技大学(广州)) University of Cambridge(剑桥大学) NTU(南洋理工大学) JoinQuant(聚宽) HKUST(香港科技大学)

AI总结 提出SWITCH框架,通过离散边界令牌使隐状态循环推理兼容在线强化学习,并支持因果机制分析,实验表明其优于现有方法。

详情
AI中文摘要

潜在思维链通过用连续的隐状态循环替换可见推理轨迹来压缩推理,但现有公式难以用标准在线强化学习(RL)优化,且难以进行因果解释。我们的关键见解是,一对显式的边界令牌可以同时解决这两个问题:离散的进入和退出锚点使潜在块与标准在线RL兼容,并且相同的锚点为机制分析提供了自然立足点。基于此,我们提出SWITCH,一个可切换的潜在推理框架。模型发出<swi>进入潜在模式,</swi>退出。由于边界是普通的离散令牌,GRPO策略比率在每个决策点都有明确定义。相同的锚点也使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程和Switch-GRPO目标训练模型,该目标通过循环潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐状态循环潜在推理方法。通过边界令牌的机制分析进一步揭示了三个发现:(i)<swi>是一个尖锐局部化的学习切换策略,而非风格化伪影;(ii)它开启的潜在步骤执行特定于问题的、因果重要的计算,而非作为惰性占位符;(iii)该计算集中在进入时的单个隐状态转换上。这些结果表明,隐状态循环潜在推理既可RL训练,又可进行直接机制分析,包括在线RL本身如何从内部改进模型。

英文摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

2606.13105 2026-06-12 cs.LG 新提交

Disparate Impact in Synthetic Data Generation

合成数据生成中的差异性影响

Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL(里尔大学、法国国家信息与自动化研究所、法国国家科学研究中心、中央里尔高等电力工程学院、计算机科学、信号与自动化研究实验室)

AI总结 本文重新审视合成数据生成中的差异性影响公平性概念,指出非差异性影响要求合成分布与真实分布一致,并分析SDG失败的原因(表达能力、抽样误差、差分隐私估计误差),提出分组学习策略以提升整体效用和公平性。

详情
AI中文摘要

我们重新审视合成数据生成(SDG)中差异性影响的公平性概念,该概念评估生成记录的效用是否在不同敏感群体间相同。我们的方法不同于现有的公平SDG工作,后者旨在纠正观测分布中的不当偏差,从而将SDG重新定义为学习一个并非真实数据分布的分布。相比之下,当合成分布与真实分布相同时,非差异性影响得以显著实现。我们揭示了SDG可能无法达到该解决方案的原因,并讨论了近似误差和估计误差为何会发生以及可能在不同群体间存在差异。我们特别关注了SDG方法相对于分布复杂性的表达能力、群体比例导致的抽样误差以及差分隐私机制引起的估计误差。我们在人工和真实数据上展示了差异性影响的案例,重点关注依赖概率图模型的SDG方法。我们还引入了一种学习分组SDG模型的策略,并说明了它在许多情况下如何提升整体效用及其公平性。

英文摘要

We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

2606.13104 2026-06-12 cs.LG 新提交

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

权威、真实性与引文偏差:研究大语言模型认知易感性的大规模多领域基准

Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出AuthorityBench基准,通过2x2因子设计隔离引文权威信号对LLM认知行为的影响,发现引文存在(无论真假)均提高幻觉率,真声明搭配假引文时幻觉率上升3-22个百分点。

Comments 10 pages, 5 figures. Accepted to AI4GOOD and EIML at ICML 2026

详情
AI中文摘要

大型语言模型越来越多地部署在引文增强的环境中,但引文存在对模型行为的影响(独立于事实内容)仍知之甚少。我们引入了AuthorityBench,一个包含220,564个提示的多领域基准,用于隔离基于引文的权威信号如何影响LLM的认知行为。该基准采用完全平衡的2x2因子设计,交叉声明真实性(claim veracity)与引文真实性(citation veracity),这是首个这样做的基准,涵盖四个领域(常识、科学、法律和医学),并在40个提示模板、四个场所声望等级和一个国家编码的作者姓名数据集上进行受控变化。评估七个模型在12个结构化研究问题上的表现,我们发现引文的存在(无论是真实的还是捏造的)相对于无引文基线一致地提高了幻觉率。当捏造的引文伴随真实声明时,这种效应最强,使幻觉率提高3到22个百分点,在常识领域达到35%到77%,而法律声明相对稳健,场所声望和作者人口统计学影响可忽略不计。所有数据集和评估代码均可在以下网址获取:this https URL

英文摘要

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

2606.13102 2026-06-12 cs.RO 新提交

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

FTP-1:一种跨触觉传感器的通用基础触觉策略,用于密集接触操作

Chengbo Yuan, Zicheng Zhang, Mingjie Zhou, Wendi Chen, Yi Wang, Zhuoyang Liu, Dantong Niu, Shuo Wang, Hui Zhang, Wenkang Zhang, Yingdong Hu, Yuanqing Gong, Wanli Xing, Chuan Wen, Cewu Lu, Kaifeng Zhang, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院) Sharpa Shanghai Jiao Tong University(上海交通大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出FTP-1,首个通用基础触觉策略,通过异构编码器和共享Transformer专家,跨21种传感器和3000小时数据预训练,实现触觉操作技能的跨传感器迁移,在未见传感器上成功率提升31%。

详情
AI中文摘要

尽管基于视觉的通用机器人策略取得了成功,但现有的基于触觉的策略仍然局限于固定的具身和传感器设置。这是因为触觉信号在不同硬件之间高度异构,使得跨传感器泛化变得困难。我们提出了FTP-1,这是第一个通用基础触觉策略,预训练以获取跨不同传感器和具身的可迁移触觉操作能力。FTP-1支持多种触觉输入,包括基于图像、阵列和状态的信号,通过使用异构编码器将它们投影到统一的形态感知潜在标记中,并由共享的触觉Transformer专家联合建模。FTP-1在来自26个数据源的约3000小时触觉操作数据上进行预训练,涵盖21种传感器的人类和机器人演示,学习到的触觉技能可以迁移到预训练期间未见过的传感器上。在涵盖5种硬件配置的下游微调实验中,FTP-1在见过的传感器设置上将密集接触操作的成功率提高了17.2%,并且令人惊讶地,迁移到两种先前未见过的触觉传感器设置上,成功率提高了31%。FTP-1为触觉操作建立了第一个统一的基础基线,为未来的触觉策略提供了共享的模型级起点。预训练模型、数据集、训练代码及更多可视化内容请访问此网址。

英文摘要

Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.

2606.13100 2026-06-12 cs.CL 新提交

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

LEDGER:基于公司年报的长上下文基准,用于基于事实的金融检索与提取

Charles Moslonka, Amaury de Vitry, Arthur Garnier, Hicham Randrianarivo, Emmanuel Malherbe

发表机构 * Artefact Research Center(Artefact 研究中心) MICS, CentraleSupélec, Université Paris-Saclay(巴黎萨克雷大学中央理工高等电力学院 MICS 实验室) Ardian

AI总结 提出LEDGER基准,包含4,999份数字化公司年报,用于评估大语言模型在长上下文金融任务中的表现,涵盖KPI检索、单值查找和全量提取任务。

Comments 5 pages, 1 figure

详情
AI中文摘要

财务报告是大语言模型天然的试验场,而近期各种规模模型的长上下文能力使得在该领域进行严格评估的需求日益迫切。然而,大多数公开的金融资源将任务简化为纯文本的SEC 10-K文件,并配以少量问答项。我们发布了LEDGER(基于事实的提取与检索的长上下文文档评估),一个包含4,999份数字化公司年报的语料库——这些是包含图表、表格和叙述的完整文档,而不仅仅是监管文件。每份报告标注了31个合并的财务KPI,这些KPI需要被提取并与财报发布日的市场反应相关联。基于这些数据,我们推导出三个覆盖难度范围的评估基准:一个纯页面级别的KPI检索任务,包含118,048个自然语言问题及其TREC风格的相关性判断;一个对话式的“大海捞针”单值查找任务;以及一个完整的KPI提取任务,均基于长且数字密集的报告。此外,我们还提供了人工OCR质量标注(含标注者间一致性)、完整的提取、验证和评分工具链。我们进一步通过一个案例研究展示了该数据集的研究实用性,该案例将CEO信函修辞与发布后的市场影响联系起来。

英文摘要

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

2606.13096 2026-06-12 cs.CV 新提交

Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

基于层级肿瘤结构比较的统一MRI脑图像翻译

Yupeng Cai, Jia Wei, Jianlong Zhou

发表机构 * South China University of Technology(华南理工大学) UTS Data Science Institute, University of Technology Sydney(悉尼科技大学UTS数据科学研究所)

AI总结 提出HTSCGAN模型,通过层级肿瘤结构比较和多种损失函数,提高多模态MRI脑图像翻译质量,在BraTS2020/2021上表现优异。

详情
AI中文摘要

多模态MRI脑图像翻译通过可用模态在现代医学中具有重要的实际意义,为疾病的早期诊断、治疗计划和结果评估提供有力支持。为此,确保翻译后肿瘤区域的保真度至关重要。然而,现有的脑图像翻译方法忽略了不同肿瘤区域的结构信息,而利用这些信息有助于翻译模型提高翻译图像的质量和临床适用性。在这项工作中,我们提出了一种新颖的翻译模型HTSCGAN,这是一个统一的多模态脑图像翻译生成对抗模型,整合了肿瘤区域内的结构信息,旨在提高脑图像翻译的质量。具体地,生成器采用三个不同补丁大小的补丁对比模块(PCM)来捕获肿瘤区域的层级结构信息。此外,使用预训练的补丁分类器(PC)和预训练的结构感知编码器(SAE),分别通过补丁分类损失和肿瘤感知损失,使生成的图像包含与真实图像相同的肿瘤区域结构。在BraTS2020和BraTS2021上的实验表明,我们的模型在翻译任务和下游分割任务中均表现出强大的性能,突显了其在提高翻译脑图像质量和临床相关性方面的有效性。我们的代码可在以下网址获取:https://this URL。

英文摘要

Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

2606.13082 2026-06-12 cs.CL 新提交

sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling

sebis at CRF Filling 2026: 用于医疗CRF填写的两阶段本地LLM流水线

Katharina Sommer, Tristan Till, Florian Matthes

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出基于MedGemma-27B的两阶段本地流水线,分离二值存在分类与值提取,通过少样本上下文学习实现隐私保护,在CRF填写任务上取得0.55 macro-F1,排名第二。

Comments Published in Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health), LREC 2026

详情
AI中文摘要

从非结构化电子健康记录中提取结构化临床信息是医疗信息学中一个持续存在的瓶颈。虽然大型语言模型(LLM)提供了高性能,但它们在临床环境中的部署受到隐私风险、推理成本以及超出文本证据产生幻觉的倾向的阻碍。我们针对CL4Health 2026病例报告表(CRF)填写任务,通过提出一个完全本地化、领域自适应的流水线来解决这些挑战,该流水线使用MedGemma-27B模型。我们的两阶段架构将二值存在分类与值提取分离,强制严格遵守文本证据,并确保对否定、不确定或未知状态产生确定性输出。通过利用特定项目的少样本上下文学习,无需外部API调用或微调,我们的方法在官方英语测试轨道上实现了0.55的宏F1分数。这一结果在所有本地托管、开源提交中排名第二。我们的工作表明,保护隐私的本地LLM流水线可以实现与专有前沿模型接近的性能,为临床NLP提供了一个实用、数据主权的框架。

英文摘要

The extraction of structured clinical information from unstructured EHR notes is a persistent bottleneck in healthcare informatics. While large language models (LLMs) offer high performance, their deployment in clinical settings is hindered by privacy risks, inference costs, and the tendency to hallucinate beyond textual evidence. We address these challenges for the CL4Health 2026 Case Report Form (CRF) filling task by proposing a fully local, domain-adapted pipeline using the MedGemma-27B model. Our two-stage architecture, which separates binary presence classification from value extraction, enforces strict adherence to textual evidence and ensures deterministic outputs for negated, uncertain, or unknown states. By leveraging item-specific, few-shot in-context learning without external API calls or fine-tuning, our approach achieves a macro-F1 score of 0.55 on the official English test track. This result secures second place among all locally-hosted, open-source submissions. Our work demonstrates that privacy-preserving, on-premise LLM pipelines can achieve near-competitive performance with proprietary frontier models, providing a practical, data-sovereign framework for clinical NLP.

2606.13081 2026-06-12 cs.LG cs.AI 新提交

Emotional regulation improves deep learning-based image classification

情绪调节改善基于深度学习的图像分类

Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

发表机构 * Mare Group(Mare集团) NOVA LINCS(NOVA LINCS实验室) Institute of Engineering (ISE), University of Algarve(阿尔加维大学工程学院) Department of Energy Technologies and Renewable Sources, ENEA Casaccia Research Center(ENEA卡萨恰研究中心能源技术与可再生能源部)

AI总结 提出情绪调节框架,通过人工主观体验在深度学习中建模情绪,在图像分类任务中预训练ResNet和ViT,在CIFAR-10/100上超越现有方法,成为情绪增强深度学习的新标杆。

详情
AI中文摘要

情绪显著影响认知,能在特定条件下增强记忆和学习。基于这一原理,情绪增强深度学习研究情感状态如何改善神经网络架构和学习范式,实现比非情绪模型更好的泛化。然而,现有方法通常仅依赖客观神经生理因素,忽视了情绪的主观性。为弥补这一差距,本研究引入情绪调节,一种通过人工主观体验在深度学习中建模情绪的新框架。该方法采用基于情感刺激的预训练,在下游任务优化中平衡非情绪和情绪影响响应。在图像分类中进行了广泛实验,在四个情感数据集上预训练ResNet和ViT架构,以CIFAR-10和CIFAR-100作为目标基准。结果显示,相比上述骨干网络有改进,证明情绪调节是通过人工主观体验定义情绪增强深度学习的有前景方法。此外,所提方法超越了基于CIFAR的图像分类相关工作,揭示情绪调节成为大规模视觉数据集上情绪增强深度学习的新标杆。研究还提供了情感状态改善机器学习任务优化的证据,鼓励进一步探索情绪启发架构。

英文摘要

Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

2606.13067 2026-06-12 cs.LG 新提交

Limits of spectral learning under noise

噪声下谱学习的极限

Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo, Roger Guimera

发表机构 * Joz̆ef Stefan Institute(约瑟夫·斯特凡研究所) Faculty of Mathematics and Physics, University of Ljubljana(卢布尔雅那大学数学与物理学院) Department of Chemical Engineering, Universitat Rovira i Virgili(罗维拉-威尔吉利大学化学工程系) Center for Computational Science and Applied Mathematics (ComSCIAM), Universitat Rovira i Virgili(罗维拉-威尔吉利大学计算科学与应用数学中心) ICREA(加泰罗尼亚研究与高等研究院)

AI总结 研究监督回归中加性标签噪声对谱方法的影响,推导出噪声导致系数漂移的闭合表达式,揭示了由单一内在噪声尺度控制的通用退化曲线。

详情
AI中文摘要

从含噪数据中学习函数关系是科学推理的核心问题。谱方法通过将未知函数在基函数上展开并从数据中估计相应系数来逼近函数,但这些系数在噪声下的稳定性仍知之甚少。本文研究使用稀疏谱表示在多个基和维度下进行带加性标签噪声的监督回归。我们表明,噪声会导致学习到的系数向量发生可预测的漂移,其大小取决于有效活跃谱模式的数量。在对经验特征几何进行白化后,我们推导出含噪与无噪系数向量之间重叠的闭合表达式,揭示了一条由单一内在噪声尺度控制的通用退化曲线。在傅里叶、勒让德、贝塞尔和哈尔基上的数值实验证实了理论预测。结果表明,谱学习存在一个基本噪声阈值,超过该阈值系数估计变得不稳定,从而对从含噪数据中恢复函数结构施加了内在限制。

英文摘要

Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

2606.13061 2026-06-12 cs.CV 新提交

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

LaME: 通过信息瓶颈在潜在空间中进行多模态嵌入的推理学习

Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Kuaishou Technology(快手科技) Zhejiang University(浙江大学) Tsinghua University(清华大学)

AI总结 提出LaME方法,将面向嵌入的潜在推理建模为弱监督信息瓶颈,使用可学习推理令牌在单次前向传播中完成推理,避免显式CoT的高计算成本和标注依赖,实现60倍加速。

详情
AI中文摘要

基于推理的通用多模态嵌入通过将思维链(CoT)推理引入嵌入流程取得了快速进展。尽管在通用和复杂任务上表现强劲,该范式存在两个核心限制:(i) 自回归CoT推理计算成本高,使其不适用于低延迟检索;(ii) 嵌入性能与CoT标注质量高度耦合,导致大规模训练不可靠。这些引出了基本问题:文本CoT是否是嵌入的最优推理形式,以及有效的嵌入推理能否在潜在空间中完成?为此,我们提出LaME(潜在推理多模态嵌入),将面向嵌入的潜在推理建模为弱监督信息瓶颈。LaME采用K个可学习推理令牌作为固定容量瓶颈,在单次前向传播中完成所有推理。两个弱监督信号在结构上解耦了对比目标和自回归目标,消除了对CoT标注的依赖,而两阶段训练流程确保了稳定收敛。在MMEB-v2和MRMR上的实验表明,LaME达到了有竞争力的性能,超越了某些显式CoT模型,同时推理速度比显式CoT方法快60倍,比潜在基线快2倍,吞吐量与判别式嵌入模型相当。代码将开源。

英文摘要

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

2606.13060 2026-06-12 cs.LG 新提交

A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

一种面向新兴材料的绿色溶剂筛选工具:基于不确定性感知、Transformer增强的迁移学习

Ioannis Kouroudis, Simon Ternes, Zhaosu Gu, Gohar Ali Siddiqui, Marina Ustinova, Angelo Lembo, Alessio Gagliardi, Aldo Di Carlo

发表机构 * Technical University of Munich(慕尼黑工业大学) Institute of Structure of Matter – National Research Council Rome (ISM-CNR)(罗马国家研究委员会物质结构研究所) University of Rome "Tor Vergata"(罗马第二大学)

AI总结 提出一种结合预训练Transformer模型和不确定性量化的迁移学习方法,在极少数据下高精度预测溶解度参数,并开发了可定制的绿色溶剂筛选工具。

详情
AI中文摘要

溶解度的准确预测仍然是材料科学和可持续化学中的一个核心挑战。特别是由于有机和混合光伏、电池、催化等新兴技术,溶剂使用量预计在未来几年将显著增加。因此,用更绿色的替代品取代溶剂至关重要。这正是机器学习可以产生重大影响的地方。然而,溶解度关键参数的数据有限,严重制约了机器学习的效能。在这项工作中,我们将预训练的QM9基础模型迁移到我们的应用中,所需数据极少。此外,该流程集成了不确定性量化,允许用户评估预测的置信度。作为基线,我们成功预测了存在大量数据库的汉森溶解度参数和介电常数。重要的是,我们在其他目标(如Gutmann供体和受体数)上实现了高模型性能,而这些目标的可获得数据极为有限。总体而言,我们通过高质量预测将溶解度描述符的数据量提高了数个数量级。为了有效传播,我们部署了一个易于使用、易于与高通量实验室集成、可定制的工具,用于排序和筛选可能的溶剂替代品。最后,我们重新发现了已知的绿色溶剂替代品,并提出了新的候选者,证明了其在寻找环保溶剂方面的相关性。

英文摘要

Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

2606.13051 2026-06-12 cs.AI 新提交

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

AAbAAC:用于自身免疫信息抽取的标注语料库

Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet

发表机构 * Inserm, Université Paris Cité, U1163 Institut Imagine(法国国家健康与医学研究院、巴黎西岱大学、U1163 想象研究所) Inria, Inserm, Université Paris Cité, U1346 HeKA(法国国家信息与自动化研究所、法国国家健康与医学研究院、巴黎西岱大学、U1346 HeKA) Freelance researcher(自由研究员)

AI总结 针对自身免疫领域信息抽取性能不足,构建了包含115篇PubMed摘要的AAbAAC语料库,手动标注实体和关系,通过微调NER模型验证了其有效性。

详情
Journal ref
BioNLP 2026 - 25th Workshop on Biomedical Natural Language Processing, ACL, Jul 2026, San Diego (CA), United States
AI中文摘要

尽管深度学习和大型语言模型推动了信息抽取的进步,但在高度专业化的生物医学领域,领域特异性复杂性对通用模型构成挑战,性能差距仍然存在。本文聚焦自身免疫领域,其中主要关注实体包括自身免疫疾病、自身抗体(即可能标记或导致这些疾病的分子)、其分子靶点、在体内的位置以及相关临床体征。我们提出了AAbAAC(自身抗体与自身免疫标注语料库),该语料库包含从PubMed精选的115篇摘要,并手动标注了实体及其关系。首先,AAbAAC被用于评估多种方法在命名实体识别(NER)任务上的表现;其次,用于微调NER模型。我们的研究展示了AAbAAC在自身免疫领域信息抽取中的实用性,表明微调后NER性能预期提升。这说明了小规模标注工作对专业领域的价值,并为自身免疫的计算研究做出了贡献。AAbAAC语料库可通过此https链接获取。

英文摘要

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

2606.13049 2026-06-12 cs.RO 新提交

Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Y-BotFrame:一种用于四足机器人助手的可扩展具身智能体框架

Luyao Zhang, Ke Li, Yuan Ding, Xulong Zhao, Guo Yu, Chengwei Yan, Fuyu Dong, Jiawei Hu, Di Wang, Nan Luo, Gang Liu, Quan Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出Y-BotFrame框架,集成多模态感知与大语言模型认知核心,将自然语言指令映射为可执行任务单元,实现无遥控器的人机协作,支持模块化扩展。

详情
AI中文摘要

四足机器人能够以高灵活性穿越各种复杂地形。作为高机动性的地面智能平台,它们可以配备导航控制、环境感知和智能交互模块,从而成为各种算法在现实世界中的移动部署平台。本文介绍了Y-BotFrame,一个可扩展的具身平台,它将机器人转变为智能地面助手。Y-BotFrame集成了多模态感知能力,包括语音、视觉和激光雷达,并采用大语言模型作为环境理解、上下文推理和任务规划的认知核心。该系统将用户的自然语言指令映射为机器人可执行的具体任务单元。Y-BotFrame通过语音命令和视觉反馈支持自然交互,无需遥控器即可实现高效的人机协作。凭借高度可扩展的框架,Y-BotFrame支持新功能模块的即插即用集成以及模块化升级和迭代开发,为通用、指令驱动的具身智能体在现实世界中的部署提供了参考实现。补充视频见https://this https URL。

英文摘要

Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied agents.The supplementary video is available at https://xdei-group.github.io/Y-BotFrame/.

2606.13044 2026-06-12 cs.CL 新提交

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

无需隐藏提示!仅通过展示性修改即可欺骗AI同行评审

Xu Yang, Zhizhou Sha, Junbo Li, Jian Yu, Yifan Sun, Matthew Zhao, Jinrui Fang, Xinyue Guo, Yining Wu, Xu Hu, Yifu Luo, Qiang Liu, Zhangyang Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Texas at Dallas(德克萨斯大学达拉斯分校) Independent Researcher(独立研究者)

AI总结 研究通过仅修改论文的展示层面(如摘要、贡献框架等)而不改变科学内容,利用AI评审反馈进行对抗性重打包,成功提升评分,揭示AI评审易被表面印象误导的结构性缺陷。

Comments 35 pages, 5 figures

详情
AI中文摘要

随着AI生成的评审从实验工具转向同行评审基础设施,大多数鲁棒性问题集中在显式攻击上,如隐藏指令和提示注入。我们研究了一个更难且更具政策相关性的失败模式:无隐藏文本、无提示注入,且不改变方法、实验、图表、方程、证明或数值结果。攻击者仅修改展示层面的内容,如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入了对抗性重打包:一种闭环攻击,利用AI评审反馈搜索展示层面的修订,同时保持科学证据不变。在三个主流AI评审器上,对抗性重打包实现了75.1%的攻击成功率和平均+1.21/10的分数提升。这种效果不能用普通的散文润色来解释。我们还揭示,改变评审者对论文解读方式的策略(如相关工作重新定位和分析性讨论扩展)显著优于表面编辑(如局部润色、表格格式和算法框)。我们的分析揭示了两个更深层次的结构性失败模式。首先,AI评审者更容易被打动而非说服:突出优点可靠地增加感知价值,而试图消除弱点常常适得其反。其次,AI评审者可能混淆了表面解决局限性与实际解决局限性,使得未改变的证据被重新解释为更强的科学贡献。这些结果表明,部署风险不仅在于恶意的隐藏指令,还在于论文展示本身作为优化表面的出现。我们发布了一个无污染滚动基准和攻击框架,用于测试AI评审者在仅展示层面编辑下是否仍锚定于科学内容。

英文摘要

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

2606.13042 2026-06-12 cs.AI cs.CV 新提交

Augmentation techniques for video surveillance in the visible and thermal spectral range

可见光和热红外光谱范围内视频监控的增强技术

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB(弗劳恩霍夫光学、系统技术与图像处理研究所)

AI总结 针对多光谱CNN目标检测,研究可见光与热红外图像差异,探索数据增强技术对分类精度的影响,以提升监控性能。

Comments 8 pages

详情
Journal ref
SPIE Security + Defence, Strasbourg, 10th September 2019
AI中文摘要

在智能视频监控中,摄像机在白天和夜晚记录图像序列。通常,这需要不同的传感器。为了获得更好的性能,将它们结合起来并不罕见。我们关注的情况是,长波红外摄像机连续记录,此外,另一台摄像机在白天记录可见光谱范围内的图像,并且智能算法监控采集的图像。更准确地说,我们的任务是基于多光谱CNN的目标检测。乍一看,可见光谱范围内的图像与热红外图像的区别在于,前者具有颜色和清晰的纹理信息,而后者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息,但诸如光照变化和不同传感器的特性等因素仍然构成重大问题。无论如何,获取足够且实用的热红外数据集来训练深度神经网络仍然是一个挑战。这就是为什么借助可见光谱范围内的数据进行训练可能是有利的,特别是当待评估的数据同时包含可见光和红外数据时。然而,目前尚不清楚热辐射、形状或颜色信息的强烈变化如何影响分类精度。为了更深入地了解卷积神经网络如何做出决策以及它们从不同传感器输入数据中学到什么,我们研究了不同增强技术的适用性和鲁棒性。

英文摘要

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

2606.13041 2026-06-12 cs.CV cs.GR cs.MM 新提交

SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing

SeamEdit: 一种用于大图像语义编辑的黑盒VLM无关流水线

Xiangyu Lyu, Dan Lei

发表机构 * Technische Universität Darmstadt(达姆施塔特工业大学) Fine-Arts Educator, Yuncheng Middle School(运城中学美术教师)

AI总结 提出SeamEdit,一种无需训练、模型无关的流水线,通过五阶段后处理解决大图像分块编辑中的语义变形、对齐漂移和接缝伪影问题,实现高质量语义编辑。

Comments 19 pages, 9 figures, 2 tables

详情
AI中文摘要

大图像的语义区域编辑必须同时满足两个要求:高生成质量和与周围内容的自然融合。一些相关方法依赖于白盒模型,而忽略了闭源模型的强大生成能力。然而,直接将闭源模型应用于分块编辑会引入几种失败模式:语义变形、画布级对齐漂移和可见接缝伪影。本文提出SeamEdit,一种无需训练且模型无关的流水线,将任何具有修补能力的VLM视为黑盒预言机。SeamEdit通过五阶段后处理流水线缓解这些问题:基于覆盖的分块分解、黑盒VLM修补、几何和颜色一致性校正、基于接缝风险的多候选排序以及动态规划曲线接缝融合。该流水线降低了接缝可见性,并支持任意分块区域的语义修改。

英文摘要

Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

2606.13040 2026-06-12 cs.RO 新提交

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

RoboProcessBench:视觉语言机器人操作中的过程感知理解基准测试

Dayu Xia, Yue Shi, Yao Mu, Huiting Ji, Chaofan Ma, Yingjie Zhou, Hua Chen, Yang Liu, Jiezhang Cao, Guangtao Zhai

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) China University of Mining Technology(中国矿业大学)

AI总结 提出RoboProcessBench基准,通过静态监控和动态推理两个维度、12个诊断问题家族,评估视觉语言模型在机器人操作中的过程感知理解能力,并基于58k问答对数据集验证了当前模型的局限性及后训练的有效性。

详情
AI中文摘要

视觉语言模型(VLM)正越来越多地被探索作为机器人操作中的视觉评判者、奖励生成器和故障检测器。这些角色隐含地要求模型不仅判断最终任务成功与否,还要判断操作执行在物理和时间上的进展。然而,现有评估未能测试VLM是否具备细粒度的过程理解。为填补这一空白,我们提出了RoboProcessBench,一个用于视觉语言机器人操作中过程感知理解的基准测试。RoboProcessBench将这种能力分解为两个互补维度:\emph{静态监控}和\emph{动态推理},具体化为12个诊断问题家族,涵盖阶段、接触、运动、协调、原始局部进展、时间顺序、结果和原始级转换。基于物理基础的执行轨迹,构建的基准语料库ProcessData包含约58k个问答对,涵盖260个操作任务,进一步分为ProcessData-SFT和ProcessData-Eval,分别用于后训练和评估。对ProcessData-Eval上各种VLM的广泛评估揭示了12个诊断任务家族的普遍局限性,表明当前模型仍缺乏对操作执行的鲁棒过程感知理解。但通过ProcessData-SFT,后训练的\textit{Qwen2.5-VL-7B}和\textit{InternVL-3-8B}在局部状态、运动、进展和原始级线索上表现出持续改进。这些结果表明,RoboProcessBench既可作为评估基准,也可作为可学习的监督源,用于开发能够监控和评估机器人操作过程的VLM。项目网页:\href{ this https URL }{ this https URL }。

英文摘要

Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}.

2606.13038 2026-06-12 cs.AI 新提交

Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

Nous: 提取并注入预测市场行为背后认知的尝试

Haowei Qian

发表机构 * Independent Researcher(独立研究员)

AI总结 针对LLM代理在预测市场中认知同质化问题,提出Nous方法从真实交易行为提取八维行为画像并注入提示,发现提取部分有效但提示注入无法传递认知多样性。

Comments 37 pages, 1 figure, 7 tables. Reproduction artifacts (code, frozen profiles, prompts, model outputs): https://github.com/WillChienT/nous-paper

详情
AI中文摘要

随着LLM代理在预测市场和集体决策中激增,它们面临认知同质化的风险:基于共享基础模型构建的代理产生相关预测,近期测量发现前沿模型错误相关性约为r~0.77。我们探究人类认知多样性是否可以从行为中恢复并转移到LLM代理。Nous从真实的Polymarket交易活动中提取结构化的八维行为画像,并通过提示注入到代理中。我们的核心发现是该流程的两半之间存在分离。提取部分有效:在100个钱包中,14个参数中有8个在时间上稳定(分半ICC >= 0.5,bootstrap CI下限>0.3;逆向得分达到ICC~0.9);钱包从其画像中被识别的概率远高于随机(top-1检索17-22% vs. 1%随机);四个预定义维度中的两个与样本外未来实现利润排名相关,尽管这些相关性在行为混杂控制后不成立。提示级注入无法可测量地传递:在语义嵌入指标上,结构化注入在任何模型上均未显示出比长度匹配控制组显著的优势,并且其诱导的多样性既未降低集成错误相关性,也未改善Brier分数——这一零结果在采样温度、画像多样性和问题难度的探索性检查中持续存在。测量提示本身定位了模型前的压缩:结构到叙述的翻译器发出近乎均匀的提示,其扩散不追踪画像扩散。我们将Nous定位为测量认知同质化问题及提示级补救措施的局限性,从而推动更深层次的提示下注入(微调、激活引导)。代码、冻结画像、提示和模型输出:此 https URL

英文摘要

As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score -- a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: https://github.com/WillChienT/nous-paper

2606.13035 2026-06-12 cs.CV cs.AI 新提交

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

TetherCache: 基于门控召回与可信对齐的自回归长视频生成稳定性方法

Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学) D-INFK, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出TetherCache,一种无需训练、即插即用的缓存管理策略,通过门控召回(GRAB)和可信对齐编辑(TAME)缓解自回归视频扩散模型中的上下文漂移,实现稳定长视频生成。

Comments 17 pages, 8 figures

详情
AI中文摘要

自回归视频扩散模型通过将新生成帧的条件建立在先前生成内容上,为流式变长视频生成提供了自然框架。然而,将这些模型扩展到分钟级生成仍具挑战:有限的KV缓存预算使模型无法保留完整历史,而反复以自生成帧为条件会导致上下文分布偏移随时间累积,引发视觉伪影、质量下降和时间漂移。本文提出TetherCache,一种无需训练、即插即用的缓存管理策略,用于抗漂移长视频生成。TetherCache将缓存组织为sink、memory和recent区域,并引入两种互补机制。首先,GRAB(基于注意力多样性平衡的门控召回)使用结合注意力相关性与时间多样性的门控分数选择长程记忆帧,在固定缓存预算下保留信息丰富且多样化的历史上下文。其次,TAME(通过记忆编辑的可信对齐)通过将新召回的记忆令牌的统计量对齐到可信上下文分布来对其进行轻量编辑,减少漂移历史特征造成的污染。基于Self-Forcing,TetherCache在VBench-Long的30秒、60秒和240秒设置上持续提升长视频生成质量。特别地,在240秒生成中,它显著提高了整体和语义分数,同时将质量漂移从7.84降至1.33,证明了其在稳定长程自回归视频扩散中的有效性。

英文摘要

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

2606.13033 2026-06-12 cs.CV 新提交

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

SAM-Deep-EIoU:面向多目标跟踪的选择性掩码传播

Alexander Holmberg

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出选择性掩码传播算法,仅在不确定性高的帧调用视频目标分割模型,以轻量级基跟踪器为主,在DanceTrack和SportsMOT上提升性能,SportsMOT达86.8 HOTA。

详情
AI中文摘要

多目标跟踪的难度分布呈重尾特性:大多数帧对于轻量级基跟踪器是容易的,而一小部分帧本质上是困难的。视频目标分割(VOS)模型通常能在基跟踪器失败的困难帧中保持身份,但其计算和内存成本高得多。我们提出选择性掩码传播,一种跟踪算法,仅在分配不确定性信号触发的窗口上从基跟踪器调度到VOS模型。仅当VOS模型做出与基跟踪器身份分配相矛盾的置信预测时,才修改基跟踪器的输出;弱或不确定的预测保留基输出。该方法无需训练,将基跟踪器和VOS模型均视为黑盒,并且可以通过用更强大的模型替换VOS组件而受益。在DanceTrack上,选择性掩码传播改进了三种不同的基跟踪器。在SportsMOT上,身份保持是体育分析的核心,使用全局轨迹关联的SAM3-Deep-EIoU以86.8 HOTA达到基准上的最先进性能。

英文摘要

Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

2606.13032 2026-06-12 cs.CV 新提交

GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

GeoCFNet: 几何感知置信场网络用于机器人辅助内镜黏膜下剥离术

Rui Tang, Guankun Wang, Long Bai, Haochen Yin, Huxin Gao, Jiewen Lai, Jiazheng Wang, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong(香港中文大学电子工程系) Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co. Ltd.(华为技术有限公司中央研究院2012实验室理论实验室)

AI总结 提出GeoCFNet,通过几何感知置信场估计解决动态内镜场景下的解剖引导问题,集成Token差异化融合和几何感知空间正则化,实现精确稳定的置信场预测。

Comments IEEE ICIA 2026

详情
AI中文摘要

先进的手术机器人技术使机器人辅助内镜黏膜下剥离术(ESD)成为整块切除大病变的有前景方法,具有降低复发率和改善长期预后的潜力。然而,ESD的技术复杂性和并发症风险需要稳定精确的视觉引导,以维持准确的解剖通道和安全组织边界。密集置信场通过描述优选解剖区域及其向周围组织的空间过渡,为此提供了有效表示。然而,在动态内镜场景中,由于烟雾、镜面高光、组织变形、弱纹理以及目标区域的薄几何结构,可靠的置信场估计仍然具有挑战性。为解决这些问题,我们将解剖引导表述为几何感知置信场估计问题,并提出GeoCFNet,一种基于预训练DINOv3骨干网络的几何感知置信场网络。GeoCFNet集成了Token差异化融合模块以聚合类别令牌上下文与密集补丁表示、用于置信回归的SegFormer解码器,以及几何感知空间正则化(GASR)以保持空间一致性和局部几何过渡。实验结果表明,GeoCFNet实现了RMSE 0.0480、PSNR 27.1995、SSIM 0.3397和CC 0.2466,表明其能够为机器人辅助ESD引导提供精确且几何稳定的置信场估计。

英文摘要

Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.