arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2606.09542 2026-06-09 cs.CV 新提交

A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation

一种用于零样本交通事故预警的VideoMAE-v2方法

Siyuan Li, Xiaoyang Bi, Mengshi Qi

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出基于VideoMAE-v2的框架,通过滑动窗口协议和逐帧预测头,在零样本设置下从粗粒度标注数据泛化到未知行车记录仪视频,实现交通事故预警。

详情
AI中文摘要

交通事故预警——在行车记录仪视频的每一帧预测即将发生碰撞的可能性——对于安全至关重要,但难以规模化,因为为每个部署场景收集域内标注的事故视频成本过高。我们在零样本设置下研究此任务,即没有目标域训练数据可用:模型必须仅从公开的二元标注驾驶事故数据集中学习,并泛化到未见过的行车记录仪视频。我们提出一个框架,通过将VideoMAE-v2骨干网络与滑动窗口协议下的逐帧预测头相结合,弥合帧级时间风险估计任务与粗粒度标注二元事故数据集之间的差距。我们的方法在2026年CVPR@AUTOPILOT零样本交通事故预警竞赛中获得第二名。代码可在https://github.com/TimeSouth/zero-shot-taa-solution获取。

英文摘要

Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.

2606.09539 2026-06-09 cs.LG 新提交

Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth

大规模高效交通预测:STGCN架构深度的系统研究

Soban Nasir Lone, Mohamed Abouelela, Taeyoung Yu, Jiwon Kim, Constantinos Antoniou

发表机构 * Technical University of Munich(慕尼黑工业大学) The University of Queensland(昆士兰大学)

AI总结 系统研究STGCN架构深度对交通预测性能与计算效率的影响,发现单块结构在多数数据集上达到最优或接近最优性能,且计算成本显著低于标准双块结构。

Comments Accepted for publication in IEEE ITSC (2026)

详情
AI中文摘要

时空图神经网络(STGNNs)已成为交通预测的主流方法,但其计算需求对智能交通系统(ITS)的实际部署构成挑战。尽管近期研究提出了STGNNs的高效替代方案,但一个基本问题仍未探索:这些架构本身是否过参数化?我们使用该领域最广泛采用的模型之一——时空图卷积网络(STGCN)来研究这一问题。通过在四个不同的交通数据集上进行系统实验,我们比较了1块、2块(标准)和3块STGCN变体。我们的发现表明,单块架构在三个数据集上实现了短期预测(10分钟)的最优性能,而在更长预测时长上仅带来边际退化(相对误差≤1.8%)。关键的是,与单块相比,双块变体导致CPU推理延迟增加61%,吞吐量降低37%——这对于资源受限的ITS部署是巨大的开销。三块架构没有提供有利的权衡,计算成本增加一倍以上,而相对改进小于0.5%。这些结果表明,默认的双块STGCN在许多应用中可能过参数化,这对部署交通预测系统的从业者和基准测试效率方法的 researchers 都有影响。

英文摘要

Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a fundamental question remains unexplored: are these architectures themselves over-parameterised? We examine this question using the Spatio-Temporal Graph Convolutional Network (STGCN), one of the most widely adopted models in this domain. Through systematic experiments across four diverse traffic datasets, we compare 1-block, 2-block (standard), and 3-block STGCN variants. Our findings reveal that the single-block architecture achieves optimal performance for short-term prediction (10 mins) on three of four datasets, while incurring only marginal degradation ($\leq$1.8% relative error) at longer horizons. Crucially, the 2-block variant incurs 61% higher CPU inference latency and 37% lower throughput relative to 1-block -- substantial overhead for resource-constrained ITS deployment. The 3-block architecture offers no favourable tradeoff, more than doubling computational cost for $<$0.5% relative improvement. These results suggest that the default 2-block STGCN may be over-parameterised for many applications, with implications for both practitioners deploying traffic prediction systems and researchers benchmarking efficiency-focused methods.

2606.09536 2026-06-09 cs.CV 新提交

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

基于Hadamard编码输出表示的对抗攻击与干扰检测用于目标检测和语义分割

Lucas Görnhardt, Timo Bartels, Niklas Schwarz, Tim Fingscheidt

发表机构 * Technische Universität Braunschweig(Braunschweig 技术大学)

AI总结 针对传统one-hot编码导致模型校准差、易受攻击的问题,提出Hadamard编码输出表示,通过优化解码过程实现最优类概率,利用预测不一致性检测对抗攻击和干扰,在单次检测中达到SOTA性能。

详情
AI中文摘要

传统的one-hot编码通常导致模型校准不佳,在攻击下过于自信,并使基于熵的检测算法失效。先前的图像分类工作表明,Hadamard编码输出表示可以提高对抗鲁棒性。然而,将Hadamard码集成到语义分割中的尝试在平均交并比性能上远落后于最先进模型。对于目标检测,此类输出编码尚未被研究。此外,现有技术没有解决内在的码字不一致性,也没有实际利用内在的码字冗余。因此,我们首先推导了一种新的Hadamard码字解码过程,以得到最优的类概率,通过使用到概率单纯形的投影来解决底层优化问题。其次,我们的优化提供了预测不一致性的度量。第三,我们首次展示了如何利用这些不一致性进行对抗攻击和干扰检测。第四,我们引入了HadamardNet,这是一个采用Hadamard码作为语义分割和目标检测模型及任务的输出表示的框架。我们在干扰和对抗攻击上进行了全面评估,在仅单次检测中实现了两项任务的最先进扰动检测性能,同时在干净数据上提供了等效或接近的参考性能。

英文摘要

Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, attempts to integrate Hadamard codes into semantic segmentation fall far behind state-of-the-art models in mean intersection-over-union performance. Regarding object detection, such output encodings have not yet been investigated at all. Further, no prior art addressed intrinsic codeword inconsistencies or actually exploited intrinsic codeword redundancy. Accordingly, we first derive a novel decoding procedure for Hadamard codewords towards optimal class-wise probabilities, solving the underlying optimization problem by using the projection onto the probability simplex. Second, our optimization delivers a measure of prediction inconsistency. Third, we are the first to show how to exploit these inconsistencies for adversarial attack and disturbance detection. Fourth, we introduce HadamardNet, a framework employing Hadamard codes as output representations for semantic segmentation and object detection models and tasks. We conduct a comprehensive evaluation both on disturbances and adversarial attacks, achieving state-of-the-art perturbation detection performance for both tasks in only a single detection pass, while delivering equivalent or close-by reference performance on clean data.

2606.09535 2026-06-09 cs.CL cs.SD 新提交

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

克服Whisper在达罗毗荼语系和低资源语言中的解码器不一致性

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

发表机构 * Sony Research India(索尼印度研究院)

AI总结 针对Whisper在达罗毗荼语系上词错误率高的问题,通过语言学和数据集分析发现词汇稀疏和字符级替换错误,提出加权注意力和自条件化两种解码器增强方法,显著降低低资源和黏着语言的WER。

Comments Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables

详情
AI中文摘要

多语言ASR模型如Whisper在高资源语言上表现良好,但在达罗毗荼语系上的词错误率(WER)显著高于印度-雅利安语系。通过语言学和数据集分析,我们发现达罗毗荼语系具有更长的单词、更高的词汇多样性和更低的重复率,导致标记分布稀疏和频繁的字符级替换错误。基线微调进一步揭示了自注意力(语言上下文)和交叉注意力(声学线索)之间的解码器不平衡。尽管合成标记重复实验表明潜在收益,但实际不可行。受这些观察启发,我们引入了两种解码器级增强:加权注意力(自适应平衡注意力来源)和自条件化(重新注入中间预测以提高标记一致性)。实验表明,对于低资源和黏着语言,WER持续降低。

英文摘要

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

2606.09525 2026-06-09 cs.CL cs.AI 新提交

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的涌现

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein

发表机构 * Nanyang Technological University(南洋理工大学) University of Copenhagen(哥本哈根大学)

AI总结 通过测量监督微调、直接偏好优化和可验证奖励强化学习三个阶段,发现大型语言模型对上下文特征的敏感性在指令微调过程中动态变化,其中监督微调使模型倾向于使用易理解的上下文,而后续阶段可能强化或改变这一偏好。

详情
AI中文摘要

在指令微调(IFT)过程中,大型语言模型(LLMs)通过使用提供的上下文来回答问题,从而学会遵循指令。虽然先前的工作已经研究了上下文特征如何与LLM的上下文使用相关,但这种分析仅限于推理时间,尚未揭示这些关系最初是如何获得的。在这里,我们测量了模型对这些特征的敏感性在连续的IFT阶段(监督微调(SFT)、直接偏好优化(DPO)和可验证奖励强化学习(RLVR))中如何变化。跨四个模型和三个数据集的实验表明,SFT使模型更倾向于使用易于理解的上下文,例如包含高长度、上下文-查询相似性和流畅性的上下文。SFT后的动态可能根据训练数据集强化或解决这些偏好。我们的发现揭示了上下文使用在每个IFT阶段都被积极重塑,并且设计平衡的IFT数据集对于确保指令微调模型稳健的上下文利用至关重要。

英文摘要

During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.

2606.09517 2026-06-09 cs.LG 新提交

Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

研究概率电价预测中的校准挑战

Jan Niklas Lettner, Hadeer El Ashhab, Benjamin Schäfer

发表机构 * Institute for Automation and Applied Informatics(自动化与应用信息学研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文指出当前概率电价预测中评分规则偏向锐度而忽视校准,导致过自信估计,呼吁未来研究转向校准感知的目标和架构。

Comments Presented at the ACM Sustainability Week Companion 2026, Banff, AB, Canada

详情
AI中文摘要

随着可再生能源整合增加市场波动性,概率电价预测已成为有效风险管理的关键。然而,当前的适当评分规则往往优先考虑预测锐度而牺牲校准,导致过度自信且统计上不可靠的不确定性估计。本文强调了理论评分与实际校准之间的关键差距,证明当可靠性被忽视时,模型可能成为确定性预测的代理。我们得出结论,未来的研究必须转向校准感知的目标和架构,以确保能源市场预测的分布完整性。

英文摘要

As renewable energy integration increases market volatility, probabilistic electricity price forecasting has become essential for effective risk management. However, current-proper-scoring rules often prioritize forecast sharpness at the expense of calibration, leading to overconfident and statistically unreliable uncertainty estimates. This work highlights the critical gap between theoretical scoring and practical calibration, demonstrating that models can become mere proxies for deterministic forecasts when reliability is neglected. We conclude that future research must shift toward calibration-aware objectives and architectures to ensure the distributional integrity of energy market forecasts.

2606.09516 2026-06-09 cs.CV 新提交

SwiftVR: Real-Time One-Step Generative Video Restoration

SwiftVR:实时一步生成式视频恢复

Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li

发表机构 * State Key Laboratory of Internet of Things for Smart City, Department of Computer and Information Science, University of Macau(澳门大学智慧城市物联网国家重点实验室) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室)

AI总结 提出SwiftVR,一种流式一步生成式视频恢复框架,通过因果分块协议、无掩码移位窗口自注意力和轻量级恢复感知自编码器,在消费级GPU上实现实时高清视频恢复。

详情
AI中文摘要

实时视频恢复(VR)用于直播流,需要在严格的每帧延迟约束下输出高分辨率结果。现有的一步扩散式VR模型由于两个主要瓶颈难以部署在消费级GPU上:高分辨率下的二次空间注意力以及大型视频自编码器的延迟-内存开销。我们提出SwiftVR,一种流式一步生成式VR框架,在因果分块协议下减少这两个瓶颈。对于注意力,无掩码移位窗口自注意力通过确定性索引将每个空间窗口聚合成密集张量,所有注意力调用都在密集缩放点积注意力路径上,无需掩码、循环移位、填充或硬件特定的稀疏核。由于SwiftVR仅使用标准密集SDPA调用,训练好的模型无需重新训练或自定义核即可迁移到消费级GPU。对于自编码,轻量级恢复感知自编码器在保持重建质量的同时实现快速分块解码。在单个H100上,SwiftVR在2560x1440分辨率下维持31 FPS,在3840x2160下维持14 FPS,而所有对比的扩散式VR基线在4K下均超出内存限制。在消费级RTX 5090上,SwiftVR在1920x1080下达到26 FPS。据我们所知,SwiftVR是首个在消费级GPU上实现实时1080p流媒体的生成式VR模型,同时以更低的推理成本获得强大的无参考感知质量。项目地址:https://h-oliday.github.io/SwiftVR。

英文摘要

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

2606.09514 2026-06-09 cs.LG 新提交

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

BUDDY: 预算驱动的动态深度路由用于自适应大型语言模型推理

Yuhua Zhou, Shaoqi Yu, Shichao Weng, Changhai Zhou, Mingze Yin, Fei Yang, Aimin Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出BUDDY框架,通过轻量决策模块根据输入动态选择top-k层,并复用KV缓存支持解码时自适应路由,在严格预算控制下提升精度-计算权衡。

详情
AI中文摘要

大型语言模型(LLMs)由于其深度和参数规模,推理成本高昂。深度剪枝可以通过跳过冗余的Transformer块来降低延迟,但现有方法(i)在用户指定的计算预算下提供的控制有限,(ii)通常固定路由路径,无法在解码过程中随着上下文增长而自适应。我们提出了BUDDY,一个预算驱动的动态深度路由框架。BUDDY使用轻量级决策模块根据输入对中间层进行评分,并确定性地执行top-k层以满足给定预算。为了支持解码时的自适应,BUDDY重用第一层的KV缓存作为低开销的全局上下文源,并在每次路由决策前将其与最新令牌表示合并。当未提供明确预算时,可选的预算预测器估计输入相关的计算水平以平衡质量和效率。在Llama系列和Qwen模型上的实验表明,BUDDY与强静态剪枝基线相比具有竞争力,并且通常能改善精度-计算权衡,同时独特地支持严格预算控制、解码时重路由以及单个训练模型内的多个预算。

英文摘要

Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.

2606.09511 2026-06-09 cs.CV 新提交

Securing Self-supervised Data Curation for Foundation Models Robustness

保障基础模型鲁棒性的自监督数据筛选

Sandeep Gupta, Roberto Passerone

发表机构 * Queen's University Belfast(贝尔法斯特女王大学) University of Trento(特伦托大学)

AI总结 针对自监督数据筛选面临的数据投毒风险,提出基于ImageBind和传统分类器的主动防御机制PDD,在多种攻击下有效检测中毒数据,SVM-PDD表现最优。

Comments 22 pages

详情
AI中文摘要

自监督数据筛选为扩展和提升机器学习模型的泛化能力提供了一条途径。通过利用自监督学习(SSL)进行数据筛选,可以有效满足基础模型对大规模训练数据集的需求。SSL极大地减轻了与标注和人工数据集筛选相关的成本,同时最小化了对人工监督的需求。然而,必须严格检查SSL筛选数据集的完整性,因为依赖匿名且未经审查的外部来源会显著增加数据投毒的风险。在本文中,我们提出了一种中毒数据检测器(PDD),这是一种主动防御机制,旨在在基础模型训练之前确保SSL筛选数据集的完整性。PDD使用预训练的ImageBind模型与传统分类器(包括随机森林(RF)、k近邻(KNN)、朴素贝叶斯(NB)和支持向量机(SVM))的组合进行设计。我们使用来自三个不同数据集的176,200张图像以及三种不同的对抗攻击(涵盖分布内和分布外场景)严格评估了PDD。值得注意的是,SVM-PDD在分布内(Set3-Set5)和分布外(TrueFace和140K RealFace)数据集上均实现了优越的性能。我们的设计表现出强大的可扩展性,并通过集成方法实现了新对抗攻击检测器的快速集成。

英文摘要

Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of machine learning models. By leveraging self-supervised learning (SSL) for data curation, the demand for massive training datasets required by foundation models can be effectively met. SSL greatly alleviates the costs associated with annotation and manual dataset curation while minimizing the need for human oversight. However, the integrity of SSL-curated datasets must be rigorously checked, as reliance on anonymous and unvetted external sources can substantially increase the risk of data poisoning. In this paper, we propose a Poisoned Data Detector (PDD), an active defense mechanism designed to ensure the integrity of SSL-curated datasets prior to foundation model training. PDDs are designed using a combination of the pretrained ImageBind model and traditional classifiers, including Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayes (NB), and Support Vector Machines (SVM). We rigorously evaluated PDDs using 176,200 images from three diverse datasets and three different adversarial attacks encompassing both in-distribution and out-of-distribution scenarios. Notably, SVM-PDD achieves superior performance for both in-distribution (Set3-Set5) and out-of-distribution (TrueFace and 140K RealFace) datasets. Our design demonstrates strong scalability and enables the rapid integration of new adversarial attack detectors through an ensemble approach.

2606.09508 2026-06-09 cs.AI cs.CL 新提交

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

从刚性到动态:面向长上下文LLM的熵引导自适应推理

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

发表机构 * Department of Computing, PolyU(香港理工大学计算学系) DSA, HKUST(GZ)(香港科技大学(广州)数据科学与分析学域) CSE, HKUST(香港科技大学计算机科学与工程学系)

AI总结 提出EntropyInfer框架,利用注意力熵在预填充阶段自适应分配计算资源,并在解码阶段通过生成令牌压缩KV缓存,实现长上下文LLM的高效推理。

详情
AI中文摘要

现有的用于长上下文LLM推理的稀疏注意力和KV缓存压缩方法通常应用固定的稀疏模式或跨所有注意力头的统一预算,忽略了头和上下文之间注意力行为的显著变化。我们观察到注意力头之间存在两种不同的熵模式:刚性头,其熵在输入段中保持接近零;动态头,其熵显著波动。至关重要的是,这些类型的分布是上下文相关的,无法离线预先确定。因此,我们提出了EntropyInfer,一个无需训练框架,在预填充期间使用注意力熵在单个头和段的粒度上自适应分配计算。对于解码,我们引入了一种潜在KV缓存压缩方案,该方案利用生成的输出令牌(而非仅预填充令牌)来识别和保留最关键的缓存条目。在Llama、Qwen和openPangu模型系列上的大量实验表明,EntropyInfer在包括SnapKV、AdaKV和CritiPrefill在内的基线上持续取得优势,在超过100k令牌的情况下实现了高达2.39倍的端到端加速,同时与全注意力相比质量下降最小。代码已发布在https://github.com/SHA-4096/EntropyInfer。

英文摘要

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

2606.09507 2026-06-09 cs.CV 新提交

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Prisma-World: 相机可控的多智能体视频世界模型

Huiqiang Sun, Zhan Peng, Size Wu, Kun Wang, Kang Liao, Dianyi Wang, Xingyu Zeng, Sheng Jin, Yangguang Li, Zhiguo Cao, Ziwei Liu, Wei Li

发表机构 * School of AIA, HUST(华中科技大学人工智能与自动化学院) S-Lab, NTU(南洋理工大学S-Lab) SenseTime Research(商汤科技研究院) FDU(复旦大学) SUAT(深圳大学) HKU(香港大学) CUHK(香港中文大学)

AI总结 提出Prisma-World,通过联合几何感知去噪过程实现多智能体视频生成中的跨视角一致性,支持灵活智能体数量和相机控制。

Comments Project page: https://huiqiang-sun.github.io/prisma-world/

详情
AI中文摘要

视频世界模型在生成可控视觉体验方面取得了快速进展,但大多数模型仍从单一观察者模拟世界。将此类模型扩展到多个智能体面临一个核心挑战:如果每个智能体的未来状态是独立生成的,重叠视角可能会实例化同一场景的不同版本,导致智能体间的物体、布局和外观不一致。传统的相机条件控制单个轨迹,但并未显式耦合在共享场景几何下应一致的视图生成。我们引入了Prisma-World,一个相机可控的多智能体世界模型,它将多智能体生成形式化为一个联合几何感知去噪过程,以实现跨视角一致性。Prisma-World在一个全注意力序列中处理所有智能体视频,使用多智能体RoPE设计来区分智能体身份同时保持同步的时间坐标,并将相对相机几何注入注意力中,使重叠视角偏向共享场景证据。为了进一步增强多视角一致性并提升全局空间感知,我们通过重叠衰减课程训练范式以及小地图条件结构指导来增强我们的框架。为了促进多智能体模型的训练和评估,我们引入了PrismaDataset,这是一个大规模UE5数据集,包含跨多样场景的全景采集、可组合的多智能体视角组(具有灵活的智能体数量和复杂的相机轨迹),以及用于一致性训练和评估的精确相机/动作标注。实验表明,单个Prisma-World模型可以生成高保真度的多智能体视频,具有灵活的智能体数量、相机可控性、改进的跨视角一致性以及在小地图引导下的空间定位。

英文摘要

Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.

2606.09499 2026-06-09 cs.RO cs.AI cs.CR 新提交

Targeting World Models to Compromise Robot Learning Pipelines

针对世界模型以破坏机器人学习流程

Ethan Rathbun, Ahmed Agha, Saaduddin Mahmud, Christopher Amato, Alina Oprea, Eugene Bagdasarian

发表机构 * Northeastern University(东北大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出针对世界模型的新型数据投毒攻击方法,通过注入恶意提示或转换动态,在看似安全的数据中生成危险训练轨迹,导致下游策略不安全。

Comments 8 Pages, CoRL Preprint

详情
AI中文摘要

世界模型近来在流行度和能力上迅速增长,成为生成机器人训练数据或模拟真实环境的更高效工具,许多工作提议将其集成到机器人学习流程中。尽管非常实用,但本文证明世界模型引入了机器人学习供应链中一种独特隐蔽且有效的数据投毒入口,可能导致部署不安全或受损的机器人策略,尽管训练数据看似安全。与传统数据投毒技术直接向已售或上传数据集中植入危险轨迹不同,我们的新型攻击方法将恶意提示或受损转换动态注入到视觉安全的遥操作数据集中,这些数据仅当通过世界模型作为输入时才会被激活。这可能导致生成合成的危险机器人训练轨迹,进而产生不安全或受损的机器人策略。我们展示了针对最先进的行动条件和文本条件世界模型的攻击有效性,展示了在下游DRL策略上的完整端到端后门攻击,以及针对VLA设置的概念验证。总体而言,这些发现需要研究更安全的世界模型,并重新评估其在机器人学习供应链中的地位。

英文摘要

World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.

2606.09498 2026-06-09 cs.CL 新提交

Self-Harness: Harnesses That Improve Themselves

Self-Harness:自我改进的操控框架

Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出Self-Harness范式,让LLM智能体通过弱点挖掘、框架提议和验证迭代改进自身操控框架,在Terminal-Bench-2.0上使三种模型的通过率分别提升21.4%、14.3%和14.2%。

详情
AI中文摘要

基于LLM的智能体的性能由其基础模型和中介其与环境交互的操控框架共同塑造。由于不同模型表现出不同的行为,有效的框架设计本质上是模型特定的。然而,智能体框架仍然主要由人类专家设计,这种范式随着现代LLM日益多样化和快速演变而难以扩展。在本文中,我们引入了Self-Harness,一种新的范式,其中基于LLM的智能体改进其自身的操作框架,而不依赖人类工程师或更强的外部智能体。我们将Self-Harness实现为一个迭代循环,包含三个阶段:弱点挖掘,从执行轨迹中识别模型特定的失败模式;框架提议,生成与这些失败相关的多样化但最小的框架修改;以及提议验证,仅在回归测试后接受候选编辑。我们在Terminal-Bench-2.0上使用最小初始框架和来自不同家族的三个基础模型实例化了Self-Harness:MiniMax M2.5、Qwen3.5-35B-A3B和GLM-5。在所有三个模型上,Self-Harness一致地提高了性能,保留通过率分别从40.5%提高到61.9%,从23.8%提高到38.1%,以及从42.9%提高到57.1%。定性分析进一步表明,Self-Harness不仅仅是添加通用指令,而是有效地将模型特定的弱点转化为具体的、可执行的框架更改。这些结果表明了一条路径,使得基于LLM的智能体不仅被其框架塑造,而且能够参与重塑自身框架。

英文摘要

The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

2606.09495 2026-06-09 cs.CV 新提交

ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

ContextShift: 目标检测中上下文依赖性的受控基准

Dan Zlotnikov, Alex Lazarovich, Ohad Ben-Shahar

发表机构 * Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出ContextShift基准,通过几何变换和背景替换系统操纵物体-上下文关系,发现检测器性能下降主要表现为漏检增加和预测数量减少,且统计共现与有效视觉上下文非线性相关,上下文感知增强可提升鲁棒性。

详情
AI中文摘要

现代目标检测器在标准基准上表现强劲,但其对上下文变化的鲁棒性仍未被充分理解。先前的评估主要依赖于在非受控分布偏移上的平均精度等聚合指标,这可能会掩盖上下文变化下性能下降的真实情况。我们提出了ContextShift,一个受控基准,它在保持物体外观的同时系统地操纵物体-上下文关系。基于COCO 2017,它通过几何变换以及合成和自然背景替换,将上下文作为独立变量分离出来,包括基于归一化点互信息(NPMI)的连续兼容性轴。在多种检测器架构中,我们观察到一致的退化模式:假阴性最多增加227%,预测数量最多减少44%,而假阳性保持稳定或下降。这种抑制行为无法被平均精度等聚合指标捕捉,这些指标可能掩盖显著的召回率损失和预测动态变化。进一步分析表明,退化更多是由有效检测候选的形成减少而非置信度降低所驱动。此外,沿统计兼容性轴的性能是非单调的,在中间NPMI处达到峰值,并向两端退化,表明统计共现与有效视觉上下文并非线性相关。最后,我们展示了上下文感知增强提高了鲁棒性:每个增强变体在原始和操纵的测试图像上都优于仅使用数据集的基线,通过在训练期间暴露模型于物体-上下文解耦,部分恢复了因预测抑制失败而损失的性能。

英文摘要

Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.

2606.09489 2026-06-09 cs.AI 新提交

LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

LLM编排的卒中护理合规性检查无需计算机可解释指南

Giorgio Leonardi, Stefania Montani, Manuel Striani, Alessandro Canessa, Delfina Ferrandi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale(皮埃蒙特东方大学计算机科学研究所) Integrated Laboratory of AI and Medical Informatics, DAIRI, SS. Antonio e Biagio e Cesare Arrigo Hospital(圣安东尼奥、比亚焦与切萨雷·阿里戈医院DAIRI人工智能与医学信息学综合实验室)

AI总结 提出基于大语言模型编排的模块化框架,从非结构化临床文本和指南中自动提取患者轨迹、识别规范规则并计算合规性指标,在卒中护理领域验证了86%以上的轨迹合规。

详情
AI中文摘要

目标:医疗保健中的合规性检查旨在评估患者护理路径是否符合临床指南。然而,其实际应用通常依赖于正式、机器可解释的指南表示(如计算机可解释指南CIG),而这些在现实临床环境中很少可用。方法:本文引入了一个基于大语言模型编排的模块化框架,直接从非结构化的临床和指南文本中支持医疗合规性检查,无需预定义的CIG。所提出的架构集成了多个LLM和支持组件,从临床出院信中提取患者轨迹,从文本临床指南中识别规范规则,将这些规则转换为可执行脚本,并计算轨迹合规性指标以量化事件日志中的合规性。结果:该框架在亚历山德里亚医院神经内科病房的卒中护理领域进行了实施和评估。从医院数据中自动提取了数百条患者轨迹,并根据参考指南衍生的50条规则进行了评估。分析显示,超过86%的可用轨迹是合规的。结论:结果证明了使用编排的LLM进行实际医疗保健合规性分析的可行性。同时,该研究提供了亚历山德里亚医院卒中护理指南高度遵守的证据。

英文摘要

Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

2606.09484 2026-06-09 cs.CL 新提交

Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism

检测差异不等于理解结构:大型语言模型在图同构任务中失败

Kumar Thushalika, Sukumar Kishanthan, Asela Hevapathige

发表机构 * University of Ruhuna(鲁胡纳大学) University of Moratuwa(莫拉图瓦大学) University of Melbourne(墨尔本大学)

AI总结 本研究通过图同构检测任务揭示LLM的“虚假成功”:虽然LLM在检测同构时准确率接近完美,但面对节点标签置换的相同图时却无法识别,表明其依赖模式而非抽象结构推理。

详情
AI中文摘要

大型语言模型(LLM)在各种推理任务上表现出色,但它们在图结构推理方面的能力仍不明确。我们研究了LLM是否能真正理解图同构——图论中的一个基本问题。尽管LLM在同构检测上达到了近乎完美的准确率,但我们证明这种性能是虚幻的。当相同的图以置换后的节点标签呈现时,LLM无法识别它们的同构性。这一发现表明,LLM利用的是模式而非对抽象图结构的推理。由于置换不变性是有效结构推理的基本要求,这些结果表明,在图推理基准上的成功不应被解释为真正拓扑理解的证据。

英文摘要

Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fundamental problem in graph theory. While LLMs achieve near-perfect accuracy on isomorphism detection, we show this performance is illusory. When identical graphs are presented with permuted node labels, LLMs fail to identify their isomorphism. This finding suggests that LLMs exploit patterns rather than reasoning about abstract graph structure. Since permutation invariance is a fundamental requirement for valid structural reasoning, these results indicate that success on graph reasoning benchmarks should not be interpreted as evidence of genuine topological understanding.

2606.09483 2026-06-09 cs.CL cs.AI 新提交

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

超越回忆的记忆:用于自进化LLM代理的双过程认知记忆系统

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

发表机构 * Tencent(腾讯)

AI总结 提出DCPM系统,基于双过程理论将代理记忆组织为认知能力层次,通过同步日间写入器和异步夜间引擎分别处理信念修正和模式归纳,在隐式跨会话推理任务上提升显著。

详情
AI中文摘要

LLM代理的长期记忆不仅仅是适时检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩到为表面回忆而调整的单一检索面上,因此难以处理需要推理用户如何演变的隐式个性化。我们提出DCPM,它沿着认知能力层次重新组织代理记忆,从原始输入和原子事实,经过历时信念轨迹和身份,上升到领域模式、潜在意图和跨领域模式。该层次由两个过程驱动,继承了双过程理论的架构分裂:一个同步的日间写入器(系统1),记录信念修正为双重链接的取代链;一个异步的夜间引擎(系统2),归纳模式和意图,并扫描跨领域冲突,抽象为更高级的核心模式。在LongMemEval、PersonaMem和PersonaMem-v2上,启用系统2在奖励隐式跨会话推理的基准上贡献最大(在PersonaMem-v2上最高+5.20),在跨度回忆上贡献最小,与架构预测一致。

英文摘要

Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

2606.09480 2026-06-09 cs.LG 新提交

Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

损失引导的自适应尺度细化用于分子力预测

Limin Yu

发表机构 * Tianjin Medical University(天津医科大学)

AI总结 提出损失引导的自适应尺度细化框架,通过插值、路由和尺度池更新自动发现任务有效尺度,在NaCl水溶液体系中降低力预测误差。

Comments 23 pages, 2 figures, 6 tables. Preprint on adaptive scale refinement for molecular force prediction

详情
AI中文摘要

分子系统涉及多个空间尺度的相互作用,从局部配位和短程扰动到长程静电和溶剂介导效应。然而,大多数分子表示学习方法依赖于手动预定义的尺度,而任务最优建模尺度可能与这些固定水平不一致。本研究引入了一个损失引导的自适应尺度细化框架用于分子力预测,将预定义尺度视为初始锚点,通过插值、路由、可微尺度更新和尺度池细化来发现任务有效的分辨率。使用NaCl水溶液离子系统作为最小测试平台,构建了短尺度和长程力预测分支并分析了它们的互补性。Oracle硬路由将整体力MAE从399.65降低到382.67,而连续Oracle插值进一步将其降低到380.96。在最近离子距离低于0.6 nm的紧密接触区域,紧密接触MAE从327.22降低到260.51。一个最小尺度池更新实验表明,从端点锚点{0,1}开始,损失引导的更新自动生成中间尺度并恢复了大部分连续Oracle性能。最终更新的尺度池{0,0.125,0.25,0.375,0.5,0.75,1}实现了381.23的整体MAE。这些结果支持自适应尺度细化作为分子表示学习的一个有前景的方向,特别是当固定尺度建模不足时。

英文摘要

Molecular systems involve interactions across multiple spatial scales, from local coordination and short-range perturbations to long-range electrostatic and solvent-mediated effects. However, most molecular representation learning methods rely on manually predefined scales, and the task-optimal modeling scale may not coincide with these fixed levels. This study introduces a loss-guided adaptive scale refinement framework for molecular force prediction, treating predefined scales as initial anchors and discovering task-effective resolutions through interpolation, routing, differentiable scale updates, and scale pool refinement. Using a NaCl aqueous ionic system as a minimal testbed, this study constructs short-scale and long-range force prediction branches and analyzes their complementarity. Oracle hard routing reduces the overall force MAE from 399.65 to 382.67, while continuous oracle interpolation further reduces it to 380.96. In close-contact regimes with nearest-ion distance below 0.6 nm, the close-contact MAE decreases from 327.22 to 260.51. A minimal scale pool update experiment shows that starting from endpoint anchors {0,1}, loss-guided updates automatically generate intermediate scales and recover most of the continuous oracle performance. The final updated scale pool {0,0.125,0.25,0.375,0.5,0.75,1} achieves an overall MAE of 381.23. These results support adaptive scale refinement as a promising direction for molecular representation learning, especially when fixed-scale modeling is insufficient.

2606.09479 2026-06-09 cs.CV cs.DL 新提交

Optical Music Recognition for Real-World Manuscripts with Synthetic Data

基于合成数据的真实世界手稿光学音乐识别

Jiří Mayer, Martina Dvořáková, Vojtěch Dvořák, Markéta Herzánová Vlková, Filip Bím, Pavel Pecina, Samuel Šomorjai, Petr Žabička, Jan Hajič

发表机构 * Institute of Formal and Applied Linguistics, Charles University(查尔斯大学形式与应用语言学研究所) Moravian Library(摩拉维亚图书馆)

AI总结 针对资源受限场景下真实世界复杂钢琴手稿的识别,提出利用合成手稿图像进行域自适应,显著提升性能,并避免昂贵细粒度标注。

Comments Accepted for publication at the ICDAR 2026 conference

详情
AI中文摘要

光学音乐识别(OMR)在模型设计方面取得了重大进展,端到端方法现在能够识别所有复杂程度的符号。然而,这一进展的影响受到可用训练数据集视觉领域的限制,这些数据集大多是数字原生的。图书馆和其他遗产机构中现有的大量乐谱收藏主要包含手稿,其视觉领域高度多样且不同,因此现有的OMR系统在现实世界中应用时失败。这些机构通常资源受限,因此无法期望大规模领域内数据集。我们在资源受限场景下为具有复杂钢琴符号的真实世界手稿提供了第一个基线。使用细粒度音乐符号图(MuNG)注释和Smashcima合成工具,我们随后表明,虽然领域内数据的一些直接转录仍然是必要的,但使用合成音乐手稿图像进行域自适应带来了显著的改进。此外,所使用的符号不需要是领域内的,因此可以避免昂贵的细粒度注释。因此,我们将OMR更接近其既定目标之一:保护和推广音乐文化遗产。

英文摘要

Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable of recognising notation at all levels of complexity. However, the impact of this progress has been limited by the visual domains of available training datasets, which are largely born-digital. Existing large collections of sheet music in libraries and other heritage institutions contain predominantly manuscripts, whose visual domains are highly diverse and different, so existing OMR systems fail when applied in the real world. These institutions are often resource-constrained, so large in-domain datasets cannot be expected. We provide a first baseline on real-world manuscripts with complex piano notation in the resource-constrained scenario. Using fine-grained music notation graph (MuNG) annotations and the Smashcima synthesis tool, we then show that while some direct transcriptions of in-domain data remain essential, domain adaptation using synthetic musical manuscript images brings significant improvement. Furthermore, the symbols used do not need to be in-domain, so the expensive fine-grained annotation can be avoided. We thus bring OMR closer to one of its stated goals: preserving and promoting musical cultural heritage.

2606.09477 2026-06-09 cs.CV 新提交

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

多相机系统中视觉-惯性相对位姿估计的高效最小求解器

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv

发表机构 * Naval Aviation University(海军航空大学) National University of Defense Technology(国防科技大学)

AI总结 提出两种基于IMU先验的最小求解器,仅需4个点对应,将多相机相对位姿问题简化为单变量6次多项式,显著降低计算复杂度,在RANSAC框架中表现优异。

详情
AI中文摘要

估计多相机系统的相对位姿是计算机视觉中的一个基本问题,在自动驾驶、移动设备和无人机(UAV)中具有关键应用。然而,现有解决方案通常计算复杂度高或依赖过多的点对应,限制了其实际应用。为解决这些限制,我们提出两种高效的最小求解器,利用新颖的参数化来估计多相机系统的相对位姿。第一种求解器利用惯性测量单元(IMU)提供的垂直方向先验,第二种利用IMU提供的旋转轴方向先验。我们的方法仅需四个点对应,并将多相机相对位姿估计问题简化为求解一个单变量6次多项式,相较于通常涉及8次多项式的现有方法有显著改进。这种计算复杂度和对应要求的降低使得我们的求解器在集成到RANSAC框架中时特别有效,展示了在视觉里程计应用中的强大潜力。通过在合成数据和KITTI基准上的严格评估,我们的方法相比最先进算法实现了卓越的计算效率和具有竞争力的精度。

英文摘要

Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.

2606.09476 2026-06-09 cs.RO 新提交

Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight Relabeling

目标集,而非目标状态:通过目标集事后重标记实现可查询的机器人目标

Carlos Vélez García, Miguel Cazorla, Jorge Pomares

发表机构 * INESCOP(西班牙鞋类及相关技术研究所) University of Alicante(阿利坎特大学)

AI总结 提出目标集事后重标记(GS-HER),将事后重标记从单目标状态推广到谓词级目标集,通过可查询的二值谓词解耦成功条件与状态维度,提升离线GCRL在冗余维度下的性能,并实现单一模型支持多目标谓词。

详情
AI中文摘要

事后重标记通常将实现的未来状态转化为精确目标,当任务成功仅取决于状态子集时,这可能会过度约束离线机器人学习。我们提出目标集事后重标记(GS-HER),这是HER在谓词级别上的推广,其中实现的状态认证查询定义的目标集,而非单一目标状态。一个二值查询指定哪些变量定义成功,使目标谓词成为推理时的输入,同时保持底层离线GCRL算法不变。在OGBench任务和五个离线目标条件学习器上,当全状态目标受到无关维度的瓶颈时,GS-HER提升了性能,并将事后重标记转变为可重用的目标接口:一个检查点可以回答多个机器人目标谓词,而无需重新训练。

英文摘要

Hindsight relabeling usually turns achieved future states into exact goals, which can overconstrain offline robot learning when task success depends only on a subset of the state. We propose Goal-Set Hindsight Relabeling (GS-HER), a predicate-level generalization of HER in which achieved states certify query-defined goal sets rather than singleton goal states. A binary query specifies which variables define success, making the goal predicate an inference-time input while leaving the underlying offline GCRL algorithm unchanged. Across OGBench tasks and five offline goal-conditioned learners, GS-HER improves performance when full-state goals are bottlenecked by nuisance dimensions and turns hindsight relabeling into a reusable goal interface: one checkpoint can answer multiple robot goal predicates without retraining.

2606.09474 2026-06-09 cs.CV 新提交

Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration

无需训练的通用的少样本分割通过开放词汇语义仲裁

Silas Kwabla Gah, Ebenezer Owusu

发表机构 * University of Ghana(加纳大学)

AI总结 提出Open-V框架,通过推理时协调冻结的语义先验(SAM3 PCS与K-shot CLIP支持质心)实现无需训练的通用少样本分割,在多个基准上超越有监督方法。

详情
AI中文摘要

通用少样本语义分割(GFSS)传统上被视为表示学习问题,需要任务特定的适应来从有限的支持样本中引入新类别。然而,最近的基础模型已经展现出强大的开放词汇识别和分割能力,这提出了一个不同的问题:能否通过推理时协调冻结的语义先验而不是参数适应来解决GFSS?我们通过Open-V回答了这个问题,这是一个无需训练的GFSS框架,它结合了Segment Anything (SAM3) 可提示概念分割(PCS)与K-shot CLIP支持质心,通过校准的逐像素语义仲裁。Open-V不引入任何可训练组件,并在推理时支持任意语义类别。除了分割性能,我们的研究还贡献了三个更广泛的发现。首先,我们表明支持信息可以通过推理时语义基础来整合,并且其贡献随着基础模型文本先验在标签不相交词汇表上的减弱而增加。其次,我们识别了基础模型分割中的可重复性混淆,证明了预处理和评估空间的不匹配会无声地扭曲报告的性能。最后,我们在PASCAL-5i、COCO-20i和ADE-OW上验证了Open-V,表明无需训练的基础模型先验协调在常规GFSS和开放词汇评估设置中都能泛化。在PASCAL-5i(1-shot)上,Open-V达到了基础/新类/调和mIoU分别为78.4/77.5/77.9,无需GFSS特定训练,超越了最强有监督基线+17.7 HM。

英文摘要

Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.

2606.09471 2026-06-09 cs.LG cs.CL 新提交

Escaping the KL Agreement Trap in On-Policy Distillation

逃离在线策略蒸馏中的KL一致陷阱

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong Polytechnic University(香港理工大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 针对在线策略蒸馏中学生陷入低KL一致陷阱导致训练信号弱的问题,提出KAT动态终止规则,过滤弱监督,在数学基准上提升avg@k 2.66%和pass@k 3.43%,同时减少59.73%的rollout长度。

Comments 13 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)通过让教师对学生生成的rollout进行评分,提供密集的token级监督。然而,当学生漂移到不可恢复的前缀时,教师可能局部同意退化状态,产生低反向KL但几乎没有纠正训练信号。我们将这种持续状态识别为低KL一致陷阱。进一步分析表明,陷阱期间及之后的token产生的监督信号效用较低。我们提出KAT(KL一致陷阱终止),一种在线OPD终止规则,通过动态训练自适应阈值检测持续的低KL一致。通过过滤来自退化一致的弱监督,KAT在四个数学基准上将avg@k准确率提升2.66%,pass@k提升3.43%,同时将平均rollout长度减少59.73%。

英文摘要

On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.

2606.09470 2026-06-09 cs.CL cs.AI 新提交

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

一种用于联合多粒度L2评估和自然语言解释的微调SpeechLLM

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

发表机构 * Centre for Language Studies, Radboud University(语言研究中心,拉德堡德大学)

AI总结 提出一种基于评分准则的SpeechLLM,通过混合训练目标联合预测句子级和词/音素级标签并生成自然语言解释,在SpeechOcean762上达到或超越单粒度模型。

Comments Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

详情
AI中文摘要

自动化的L2语音评估可以分配熟练度标签,但通常缺乏可解释性。我们提出了一种基于评分准则的SpeechLLM,用于多角度、多粒度的评估,采用结合监督微调和有界直接偏好优化的混合目标进行训练。该模型在同一个响应中联合预测句子级(准确性、流利度、韵律)的序数标签、词/音素级准确性,并生成自然语言解释。在SpeechOcean762上,我们的方法匹配或优于单粒度模型,同时与先前方法保持竞争力。我们从两个维度分析解释的可靠性:与模型预测的自一致性和与真实标签的对齐,使用情感一致性(合理性)和基于提及的一致性(忠实性)。解释在句子级别是合理的,但在词/音素级别忠实性下降:参考稀疏且与词元级标签弱对齐。

英文摘要

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

2606.09461 2026-06-09 cs.CL 新提交

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

H2HMem: 面向人际交互中智能体的多模态记忆基准

Shiping Zhu, Yibo Yang, Zhengyang Wang, Tiancheng Shen, Dandan Guo, Ming-Hsuan Yang

发表机构 * Jilin University(吉林大学) Shanghai Jiao Tong University(上海交通大学) University of California at Merced(加州大学默塞德分校)

AI总结 提出H2HMem基准,通过双人和多人多模态对话评估智能体在记忆召回、推理和应用方面的能力,揭示现有模型在多模态、多参与者场景下的显著局限。

Comments 22 pages, 6 figures

详情
AI中文摘要

大型语言模型智能体越来越多地部署在人际交互场景中,例如会议助手和临床文档系统,在这些场景中,它们必须观察对话并保留信息以供后续查询。与传统的人助交互不同,这些环境本质上是多模态的,涉及复杂的语篇现象,如回指和指示,并且包含来自多个参与者的异步或冲突信息。然而,现有的记忆基准主要关注单用户、纯文本交互,未能捕捉这些挑战。为填补这一空白,我们引入了H2HMem,一个面向复杂人际交互中记忆能力评估的人-人多模态记忆基准。H2HMem包括双人和多人对话,包含多模态信息流,并从三个维度评估智能体:记忆召回、推理和应用。使用先进智能体的实验揭示了在跨模态、参与者和会话中构建、保留和利用记忆方面的显著局限性,凸显了下一代LLM智能体需要大幅改进的空间。

英文摘要

Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants. However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human-human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application. Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.

2606.09459 2026-06-09 cs.CL 新提交

AbstRAG: Learning to Abstract for Retrieval Problems

AbstRAG:面向检索问题的抽象学习

Lei Xu, Xin Quan, Daniel Pedronette, André Freitas

发表机构 * Idiap Research Institute(Idiap 研究所) École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院 (EPFL)) São Paulo State University(圣保罗州立大学) University of Manchester(曼彻斯特大学) CRUK National Biomarker Centre(英国癌症研究中心国家生物标志物中心)

AI总结 针对查询与文档证据间的抽象鸿沟问题,提出AbstRAG方法,通过将抽象作为显式检索对象,并采用反思性精炼机制,在三个基准上提升了检索和生成性能。

详情
AI中文摘要

当查询、文档证据和用户意图以不同抽象级别表达时,检索增强生成常常失败。查询可能询问一个类别、关系或事件,而文档仅陈述具体实例、间接框架或限定表述。我们将这种不匹配定义为抽象鸿沟:将查询意图与可用证据对齐所需的最小类型假设集合。为弥合这一鸿沟,我们引入AbstRAG,将抽象视为显式检索对象。AbstRAG将查询-证据鸿沟分解为表达、概念、意图-证据和事件类型组件,并通过结合匹配质量、查询无关的效用先验以及所需桥梁的成本来评分相关性。其核心机制是反思性精炼:批评者诊断检索失败,定位失败的抽象操作符,提出最小的阶段特定补丁,并仅在充分性和压缩控制下接受补丁。在三个文档内检索基准上与七个基线对比,AbstRAG在21个配对自助法对比中的18个上以nDCG@10胜出,并在三个基准上分别将生成准确率提升1.9%、5.2%和4.0%;消融实验证实,反思性精炼驱动了大部分检索增益,而仅压缩控制就在压力切片上将过度扩展假阳性从73.7%降至0%。

英文摘要

Retrieval-augmented generation often fails when the query, the document evidence, and the user's intent are expressed at different levels of abstraction. A query may ask about a class, a relation, or an event, while the document only states specific instances, indirect framings, or scoped formulations. We define this mismatch as an abstraction gap: the minimal set of typed assumptions required to align query intent with the available evidence. To close this gap, we introduce AbstRAG, which treats abstraction as an explicit retrieval object. AbstRAG decomposes the query--evidence gap into expression, conceptual, intent--evidence, and event-type components, and scores relevance by combining match quality, a query-independent utility prior, and the cost of the required bridges. Its central mechanism is reflective refinement: a critic diagnoses retrieval failures, localizes the failed abstraction operator, proposes a minimal stage-specific patch, and accepts the patch only under sufficiency and compression controls. Across three within-document retrieval benchmarks against seven baselines, AbstRAG outperforms on nDCG@10 in 18 of 21 paired-bootstrap contrasts and improves generation accuracy by 1.9%, 5.2%, and 4.0% across the three benchmarks; ablations confirm that reflective refinement drives most of the retrieval gain and the compression control alone reduces over-expansion false positives from 73.7% to 0% on a stress slice.

2606.09457 2026-06-09 cs.RO 新提交

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

$ω$-EVA:基于潜在交互世界模型的构想、验证与行动

Zhenguo Sun, Yu Sun, Hande Huang, Alois Knoll

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出$ω$-EVA框架,通过潜在交互世界模型实现“构想-验证-行动”循环,利用动作条件潜在动力学和语言条件流策略生成动作,无需生成未来视频,在多种机器人操作任务中提升策略性能。

详情
AI中文摘要

具身策略通常直接将当前观测映射到动作,使得候选动作的后果隐含。世界模型提供预测监督、表示或外部模拟,但很少让策略在行动前检查自身提议的想象后果。我们提出$ω$-EVA,一种潜在交互世界模型,实现了用于具身动作生成的构想-验证-行动循环。其三阶段框架学习动作条件潜在动力学,在动力学感知的视觉表示上训练语言条件流策略,并将策略的提议反馈给世界模型。一个三分支精炼器联合推理当前状态、提议条件未来和提议动作,以生成最终动作块。由于后果推理保持在潜在特征空间中,$ω$-EVA在推理时避免了生成未来视频。在多种单臂、双臂、长时域和扰动仿真设置中的评估表明,完整的交互流程持续改进了提议策略,而潜在诊断指示了有意义的动作条件未来结构。拥有约12亿参数且无需额外的机器人数据预训练,$ω$-EVA展示了紧凑且具有竞争力的性能-规模-数据权衡,使世界模型成为主动的动作反馈模块而非被动预测器。

英文摘要

Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.

2606.09456 2026-06-09 cs.LG 新提交

Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

打破分词器壁垒:跨模型系列的在线策略蒸馏

Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tencent(腾讯) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出跨分词器在线策略蒸馏方法,通过精确的token映射算法使教师模型概率分布信号能跨不同分词器传播,显著提升计算效率。

详情
AI中文摘要

在线策略蒸馏(OPD)已成为大型语言模型(LLM)后训练中从领域专家向学生模型迁移知识的核心技术。然而,现有的OPD蒸馏方法要求教师和学生模型共享相同的分词器,限制了OPD在模型系列内的适用性。当前主流实践通常采用在教师生成的响应上进行监督微调(SFT)来实现跨分词器蒸馏,这未能捕捉到嵌入在教师概率分布中的丰富知识。在这项工作中,我们使标准的在线策略蒸馏方法能够跨模型系列运行,确保高保真的token级信号可以通过精确的token映射算法在不同分词器之间传播。大量实验表明,在各种基准测试上,跨分词器OPD在计算效率上显著优于基线方法。我们的结果为OPD解锁了更广泛的教师-学生配对,为适应和增强LLM之间的交互开辟了新途径。

英文摘要

On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.

2606.09453 2026-06-09 cs.CV 新提交

GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer

GD-MIL:用于前列腺癌多模态生化复发预测的等级解耦多实例学习

Dasari Naga Raju

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对前列腺癌生化复发预测,提出GD-MIL方法,通过梯度反转等级对抗训练实现H&E全切片图像特征与Gleason等级解耦,融合临床变量后C-index达0.704,显著优于临床基线。

详情
AI中文摘要

根治性前列腺切除术后的生化复发(BCR)是前列腺癌的关键终点,但风险分层几乎完全依赖于以Gleason等级为主的变量。H&E全切片图像(WSI)是否携带超出等级之外的预后信号,以及多实例学习(MIL)能否恢复这些信号,仍无定论。一个关键障碍是许多流程在评估折上选择模型检查点,人为提高了一致性指数。我们在TCGA-PRAD(487名患者,101个BCR事件)上构建了严格的基准测试,使用严格的折外评分,在五个种子重复的五折交叉验证上进行评估。MIL聚合器(ABMIL、CLAM、TransMIL、PatchGCN)的选择影响很小(UNI2-h的C指数为0.61-0.64),而特征提取器是主导因素(ResNet50为0.566,病理基础模型高达0.639)。基于等级、分期和年龄的临床Cox模型达到0.687;没有纯影像模型显著优于它(p > 0.10)。我们引入了等级解耦MIL(GD-MIL),一种门控注意力MIL编码器,通过梯度反转等级对抗训练,使切片表示在与临床变量晚期融合之前对Gleason等级保持不变。GD-MIL实现了C指数0.704,显著优于临床基线(delta-c = +0.029,p = 0.0005)和最佳纯影像模型(delta-c = +0.062,p = 0.039),表明H&E形态学包含与等级互补的预后信息。中位风险分层在BCR无生存期上产生log-rank p < 0.0001的分离(五年约20% vs 约70%)。

英文摘要

Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk stratification relies almost entirely on variables dominated by Gleason grade. Whether H&E whole slide images (WSIs) carry prognostic signal beyond grade, and whether multiple instance learning (MIL) can recover it, remains unsettled. A key obstacle is that many pipelines select model checkpoints on the evaluation fold, artificially inflating concordance. We construct a rigorous benchmark on TCGA-PRAD (487 patients, 101 BCR events) using strict out-of-fold scoring over five-fold cross-validation repeated across five seeds. The choice of MIL aggregator (ABMIL, CLAM, TransMIL, PatchGCN) has little effect (C-index 0.61-0.64 with UNI2-h), while the feature extractor is the dominant factor (ResNet50 0.566 versus pathology foundation models up to 0.639). A clinical Cox model on grade, stage, and age reaches 0.687; no imaging-only model significantly outperforms it (p > 0.10). We introduce Grade-Disentangled MIL (GD-MIL), a gated-attention MIL encoder trained with a gradient-reversal grade adversary that encourages the slide representation to be invariant to Gleason grade before late fusion with clinical variables. GD-MIL achieves C-index 0.704, significantly outperforming both the clinical baseline (delta-c = +0.029, p = 0.0005) and the best imaging-only model (delta-c = +0.062, p = 0.039), suggesting H&E morphology contains prognostic information complementary to grade. A median risk split yields log-rank p < 0.0001 separation in BCR-free survival (~20% vs ~70% at five years).

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 新提交

Dense Force Estimation with an Event-based Optical Tactile Sensor

基于事件的光学触觉传感器的稠密力估计

Agis Politis, René Zurbrügg, Valentina Cavinato

发表机构 * Sony Advanced Visual Sensing, Zurich, Switzerland(索尼高级视觉传感公司,苏黎世,瑞士) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出首个利用事件相机重建稠密3D力场的方法,通过事件数据估计表面位移并映射为力,平均误差(0.14N,0.10N,0.93N),工作频率100Hz。

详情
AI中文摘要

人类依赖空间稠密、几何和力感知的触觉反馈以高时间分辨率进行灵巧操作。虽然基于视觉的触觉传感器能够实现稠密力估计,但受限于相机帧率、运动模糊和数据带宽。基于事件的光学触觉传感器具有微秒级时间分辨率和低运动模糊的优点,但现有方法仅限于预测净力。我们提出了首个利用基于事件的光学触觉传感器进行稠密3D力场重建的框架。我们的方法从事件数据估计3D表面位移,并通过逆有限元方法(iFEM)将其映射为力。剪切位移通过所提出的事件标记跟踪算法恢复,而法向位移则由卷积神经网络预测,该网络在收集的同步力-位移-事件数据集上训练。实验表明,该方法能够准确重建物理力,在力范围高达(4N,4N,20N)时,平均绝对误差为(0.14N,0.10N,0.93N),同时以平均100Hz的频率运行。这项工作为在机器人抓取和灵巧操作中实现高频控制的稠密力反馈迈出了第一步。

英文摘要

Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.