arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02580 2026-06-02 cs.CV

IntraShuffler：一种用于异构差分隐私联邦学习的隐私保护框架

Farhin Farhad Riya, Olivera Kotevska, Jinyuan Stella Sun

发表机构 * University of Tennessee, Knoxville, USA（田纳西大学，科文特分校）； Oak Ridge National Laboratory, USA（橡树岭国家实验室）

AI总结针对异构差分隐私联邦学习中诚实但好奇的服务器通过梯度结构推断客户端属性的隐私推理攻击，提出IntraShuffler中间件框架，通过隐私感知混洗机制破坏梯度持久结构，同时保持ε感知聚合，将梯度可恢复性降低60%以上，代理推理准确率从0.78降至0.33。

详情

AI中文摘要

联邦学习中的异构差分隐私允许客户端根据机构策略和数据敏感性选择个体隐私预算（$\varepsilon_i$）。实践中，许多HDP-FL系统采用$\varepsilon$感知的服务器聚合，通过根据声明的隐私预算重新加权客户端更新来提高模型效用。然而，联邦学习中的梯度更新保留了由非独立同分布数据引起的结构模式，而$\varepsilon$感知聚合暴露的这些额外信号为诚实但好奇的服务器提供了新的推理机会。在这项工作中，我们首先展示，配备梯度去噪和代理建模的服务器可以在现实知识约束下发起隐私推理攻击，该攻击推断客户端的分布属性并链接同一客户端在不同训练轮次中的更新，通过代理推理准确率和链接成功率衡量。混洗模型通过匿名化更新来源被广泛研究作为针对此类推理风险的防御，但它与HDP-FL的$\varepsilon$感知聚合根本不相容。为了解决这一挑战，我们提出了IntraShuffler，一个专为HDP-FL系统设计的中间件防御框架。IntraShuffler引入了一种隐私感知的混洗机制，将客户端分组到隐私兼容的桶中，并在每个桶内执行参数级混洗，以破坏持久的梯度结构，同时保持$\varepsilon$感知聚合。在四个不同数据集上的实验表明，IntraShuffler将梯度可恢复性降低了60%以上，并将代理推理准确率从0.78降至0.33，同时在多种联邦学习聚合规则下保持可比的模型效用。

英文摘要

Heterogeneous Differential Privacy (HDP) in Federated Learning (FL) allows clients to select individual privacy budgets ($\varepsilon_i$) according to institutional policies and data sensitivity. In practice, many HDP-FL systems employ $\varepsilon$-aware server aggregation to improve model utility by re-weighting client updates according to their declared privacy budgets. However, gradient updates in FL retain structural patterns induced by non-independent and identically-distributed (non-IID) data, and these additional signals exposed by $\varepsilon$-aware aggregation create new opportunities for inference by an honest-but-curious server. In this work, we first show that a server equipped with gradient denoising and surrogate modeling can mount a \emph{Privacy Inference Attack} that infers distributional attributes of clients and links updates from the same client across training rounds, measured via surrogate inference accuracy and linkage success, under realistic knowledge constraints. The Shuffle-Model has been widely studied as a defense against such inference risks by anonymizing update sources, but it is fundamentally incompatible with HDP-FL $\varepsilon$-aware aggregation. To address this challenge, we propose \textbf{IntraShuffler}, a middleware defense framework designed for HDP-FL systems. IntraShuffler introduces a privacy-aware shuffling mechanism that groups clients into privacy-compatible buckets and performs parameter-level shuffling within each bucket to disrupt persistent gradient structure while preserving $\varepsilon$-aware aggregation. Experiments across four different datasets show that IntraShuffler reduces gradient recoverability by over 60% and decreases surrogate inference accuracy from 0.78 to 0.33 while maintaining comparable model utility across multiple FL aggregation rules.

URL PDF HTML ☆

赞 0 踩 0

2606.02562 2026-06-02 cs.RO cs.AI cs.LG cs.SY eess.SY

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

通过可信推理实现许可安全：可验证的信念空间神经安全滤波器用于保证交互式机器人

Haimin Hu

发表机构 * Department of Computer Science, Johns Hopkins University, USA（约翰霍普金斯大学计算机科学系）

AI总结针对交互式机器人中人类不确定性带来的安全问题，提出一种基于共形预测的信念空间安全滤波器验证方法，在考虑推理可靠性的前提下保证高概率安全，并减少保守性。

Comments Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情

AI中文摘要

与人类交互的自主机器人必须在人类引起的不确定性（如偏好、目标、能力和合作意愿）下做出安全高效的决策。安全滤波器是确保交互式机器人安全性的流行方法，其模块化设计将安全性与性能分离，使机器人能够在最小影响任务效率的情况下安全地与人交互。传统安全滤波器通常仅在物理空间中运行，忽略了机器人在线学习和适应的能力，而最近提出的信念空间安全滤波器（BeliefSF）在闭环中考虑机器人安全性，并通过运行时推理主动减少机器人的不确定性，从而降低滤波的保守性。然而，由于运行时推理的误差以及处理信念空间高维性所需的安全滤波器神经近似，为部署BeliefSF的机器人提供形式化安全保证仍然是一个重大挑战。本文提出一种算法方法，使用共形预测来认证BeliefSF的高概率安全性，同时明确考虑机器人运行时推理模块的可靠性。我们的方法利用信念空间安全滤波的结构，将验证集中在预期推理可靠的区域。它保留了标准共形预测的简单性和样本复杂度，但能够认证一个显著更不保守的安全滤波器。通过一个模拟的人-车交互基准测试，我们展示了我们的方法验证了一个比标准共形预测基线更许可的信念空间安全滤波器。

英文摘要

Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety from performance, allowing robots to operate safely around people with minimal impact on task efficiency. While traditional safety filters typically operate only in the physical space, neglecting the robot's ability to learn and adapt online, the recently proposed belief-space safety filter (BeliefSF) reasons about robot safety in closed-loop with runtime inference that actively reduces the robot's uncertainty online, thereby reducing conservativeness in filtering. However, providing formal safety guarantees for robots deploying BeliefSF remains a significant challenge due to errors in runtime inference and neural approximation of safety filters required to handle the high dimensionality of belief spaces. In this paper, we propose an algorithmic approach to certify high-probability safety of BeliefSF using conformal prediction, while explicitly accounting for the reliability of the robot's runtime inference module. Our method leverages the structure of belief-space safety filtering by focusing verification on a region where inference is expected to be reliable. It preserves the simplicity and sample complexity of standard conformal prediction, yet can certify a substantially less conservative safety filter. Through a simulated human-vehicle interaction benchmark, we show that our approach verifies a significantly more permissive belief-space safety filter than a standard conformal prediction baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.02559 2026-06-02 cs.CL cs.AI

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

从层到子模块：重新思考基于替换的LLM压缩中的粒度

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

发表机构 * University of Trento（特伦托大学）

AI总结提出子模块级别的非连续替换压缩方法SubFit，通过为注意力和前馈子模块分别设计轻量残差旁路，在多种LLM上实现更好的困惑度-准确率权衡。

详情

AI中文摘要

大型语言模型（LLM）的后训练压缩会移除整个架构组件，要么删除它们，要么用拟合模块替换它们。现有的基于替换的方法共享两个设计约束：全层粒度和连续选择。我们认为这过于严格：事实上，预训练Transformer中的冗余并不局限于连续区域，也不均匀分布在注意力和前馈输出之间，这意味着不同的策略最适合近似不同的子模块类型，并且可移除的组件不需要聚集在连续的深度范围内。基于这一直觉，我们引入了SubFit（子模块级拟合残差替换），它在子模块级别压缩LLM：注意力和前馈子模块被非连续地选择，并且每个子模块都获得自己的轻量级拟合残差旁路。SubFit在训练后运行，仅需要校准数据。在十个LLM（五个基础模型，五个指令微调模型）、五个从12.5%到37.5%的稀疏度水平以及四个基于替换的基线上，SubFit在评估的稀疏度水平上实现了最佳的聚合困惑度-准确率权衡，在激进压缩下获得更大收益。在25%稀疏度下，它保留了84.6%的密集下游准确率，困惑度退化2.42倍，而最强基线分别为81.6%和4.34倍，同时实现了可测量的推理加速和KV缓存节省。代码可在https://github.com/eliacunegatti/SubFit获取。

英文摘要

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.

URL PDF HTML ☆

赞 0 踩 0

2606.02556 2026-06-02 cs.CL

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

英雄之旅：用文本游戏测试复杂规则归纳

Anshun Asher Zheng, Kanishka Misra, David I. Beaver, Junyi Jessy Li

发表机构 * Department of Linguistics（语言学系）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出HERO'S JOURNEY基准，通过目标导向的文本游戏评估大型语言模型在属性与程序归纳任务中的规则推理能力，发现模型虽能进行规则归纳但能力有限且不均衡，程序执行成为瓶颈，而表面语义影响较小。

Comments 24 pages

2606.02553 2026-06-02 cs.CV

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

LongLive-RAG: 一种用于长视频生成的通用检索增强框架

Qixin Hu, Shuai Yang, Wei Huang, Song Han, Yukang Chen

发表机构 * NVIDIA ； USC（美国大学）； MIT（麻省理工学院）

AI总结提出LongLive-RAG框架，通过将自回归视频生成中的历史潜变量作为可检索记忆，利用查询嵌入检索相关历史潜变量并引入窗口时间增量损失，以减轻滑动窗口注意力导致的误差累积，提升长视频生成质量。

Comments 20 pages, 7 figures, 4 tables

详情

AI中文摘要

自回归（AR）视频扩散支持可变长度合成，但长时生成常面临累积误差和身份漂移。为提升效率，现有方法在生成时普遍采用滑动窗口注意力。这会产生不可逆的生成轨迹：一旦活动窗口累积外观误差，后续生成只能基于此退化轨迹并进一步漂移。我们通过将长视频生成建模为检索增强生成（RAG）问题来解决这一限制。我们不依赖仅最近窗口，而是将先前生成的潜变量视为动态、可搜索的历史。我们提出LongLive-RAG，一个用于AR视频生成的通用检索框架。在每个新块中，LongLive-RAG使用查询嵌入检索相关历史潜变量。这一轻量级检索步骤相比生成仅增加少量开销，并使生成器能基于非局部上下文而非仅最近窗口进行条件生成。为使检索更具判别性，我们引入窗口时间增量损失，抑制冗余局部相似性并鼓励嵌入捕捉有意义的时序变化。这些组件共同帮助减少滑动窗口注意力引起的误差累积。在多个AR骨干网络和生成长度上的实验表明，长视频质量提升且平均VBench-Long排名最佳。据我们所知，在开放式AR长视频生成方法中，LongLive-RAG是首个将自生成潜变量历史构建为内容可寻址检索记忆的方法。代码见https://github.com/qixinhu11/LongLive-RAG。

英文摘要

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.

URL PDF HTML ☆

赞 1 踩 0

2606.02552 2026-06-02 cs.CV cs.AI

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

建模深度歧义：一种用于无飞点深度估计的混合密度表示

Siyuan Bian, Congrong Xu, Jun Gao

发表机构 * University of Michigan（密歇根大学）； NVIDIA（英伟达）

AI总结提出混合密度表示MDA，通过预测每个像素的多个深度假设及其概率，解决深度估计中边界处的飞点伪影问题，显著改善边界重建并消除飞点。

详情

AI中文摘要

尽管深度估计取得了进展，飞点仍然是一个持续存在的失败模式：在物体边界附近，深度估计器经常在前景和背景表面之间的空白空间中预测虚假的3D点。我们将这种伪影追溯到一种标准建模选择：为每个像素分配单个深度假设。在边界处，一个像素可能跨越前景和背景表面，因此其真实深度在两者之间是模糊的。预测单个深度的模型无法同时保留两种可能性，因此训练反而将预测拉向一个位于两个表面之间的中间深度。我们通过MDA解决了这个问题，这是一种混合密度表示，让模型为每个像素预测多个深度假设及其相关概率。在边界附近，不同的假设可以与不同的表面对齐，解码后的深度从这些假设之一中选择，而不是放置在它们之间的空白空间中。在不同的骨干网络上，MDA显著改善了边界重建，并在很大程度上消除了飞点伪影，即使在严重的输入模糊下也是如此，同时增加了可忽略的运行时开销。相同的混合密度框架自然地扩展到透明物体，其中它预测透明像素处的多个深度层，以及天空区域，其中专用组件将无界天空与有限深度区域分开，产生无飞点的天际线。项目页面：https://biansy000.github.io/mda-site/。

英文摘要

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a single depth hypothesis. At boundaries, a pixel can straddle a foreground and a background surface, so its true depth is ambiguous between the two. A model that predicts a single depth cannot keep both possibilities, so training instead pulls the prediction toward an intermediate depth that lies on neither surface. We address this with MDA, a mixture-density representation that lets the model predict multiple depth hypotheses and their associated probabilities for each pixel. Near boundaries, different hypotheses can align with different surfaces, and the decoded depth is selected from one of these hypotheses rather than placed in the empty space between them. Across different backbones, MDA substantially improves boundary reconstruction and largely removes flying-point artifacts even under severe input blur, while adding negligible runtime overhead. The same mixture-density framework naturally extends to transparent objects, where it predicts multiple depth layers at transparent pixels, and to sky regions, where a dedicated component separates the unbounded sky from finite-depth regions, producing flying-point-free skylines. Project Page: https://biansy000.github.io/mda-site/.

URL PDF HTML ☆

赞 0 踩 0

2606.02551 2026-06-02 cs.RO cs.CV

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

AFUN：迈向用于功能理解的可供性基础模型

Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao

发表机构 * University of Michigan（密歇根大学）； University of California, San Diego（加州大学圣地亚哥分校）； NVIDIA（英伟达）

AI总结提出AFUN模型，从单张RGB-D图像和语言任务描述中预测任务条件功能掩码和3D接触后运动曲线，通过大规模标准化数据流水线实现开放世界泛化，在多项基准测试中显著优于现有方法。

详情

AI中文摘要

可供性理解连接视觉感知和物理动作，作为开放非结构化真实环境中机器人操作的可解释接口。然而，构建一个不仅理解交互发生的位置和方式，还能跨不同环境、物体和任务泛化的可供性基础模型，仍然是一个长期的研究挑战。现有方法通常只解决部分挑战，要么定位任务相关区域而不指定可执行运动，要么预测运动但可扩展性有限。在本文中，我们提出了我们的模型，朝着用于功能理解的可供性基础模型迈出了一步。从单个RGB-D观测和语言任务描述中，我们的模型预测任务条件功能掩码（在哪里交互）和3D接触后运动曲线（如何交互）。为了支持开放世界泛化，我们构建了一个大规模标准化数据流水线，将异构的机器人、人类、仿真和真实世界扫描数据转换为共享的可供性模式，包含语言、掩码和以物体为中心的3D运动标签。我们从三个方面评估我们的模型：对于可供性分割，我们的模型在来自4个基准的8个测试集上以较大优势优于所有基线，平均gIoU/cIoU提高+23.9/+26.3；对于接触点预测，它预测出更精确的点，命中率比最佳基线提高12.7-61.3%；对于3D运动，它在所有三个测试集上均达到最佳性能。我们的模型可以部署于真实世界机器人操作，无需针对机器人本体进行微调或使用任务特定启发式方法，展示了适应开放世界可供性任务的能力。项目页面：https://www.zhaoningwang.com/AFUN

英文摘要

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

URL PDF HTML ☆

赞 1 踩 0

2606.02548 2026-06-02 cs.CL

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

SN-WER：用于多脚本印度语ASR评估的脚本归一化词错误率

Priyaranjan Pattnayak

发表机构 * Oracle America Inc.（Oracle美国公司）

AI总结提出SN-WER指标，通过将参考和假设文本音译为规范脚本后计算WER，解决多脚本场景下WER高估错误的问题，在印度语上评估显示可减少高达12%的模型差距。

Comments Accepted to ACL 2026 MeLLM

详情

AI中文摘要

词错误率（WER）是自动语音识别（ASR）的主要指标，但当参考文本和假设文本以不同脚本编码相同单词时，WER可能高估错误。在多语言设置中，ASR模型可能输出罗马化文本，这一问题很常见。我们提出脚本归一化WER（SN-WER），一种无需训练、仅用于评估的评分方法，在计算WER之前将参考文本和假设文本音译为特定语言的规范脚本。我们在5种印度语言、2个数据集和3个ASR模型上评估了SN-WER。在精心整理的FLEURS数据上，SN-WER将膨胀的模型差距减少了高达12%，而在噪声较大的Common Voice数据上，减少幅度较小或不一致，表明存在真正的识别弱点而不仅仅是脚本不匹配。受控压力测试显示，人为罗马化引起的WER膨胀衰减了67%，而词汇替换控制显示对语义错误的敏感性几乎相同，Delta SN-WER / Delta WER约为1.09。SN-WER对音译器选择、归一化变化具有鲁棒性，并且在评估的印度语设置中，令牌碰撞率低于0.1%。我们认为，SN-WER应作为WER和CER的伴随指标报告，用于脚本不敏感的ASR评估，特别是当转录文本用于下游搜索、索引或多语言LLM流水线时。

英文摘要

Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.02545 2026-06-02 cs.CL

LL-Bench: 在大规模生成模型时代重新思考低级视觉评估

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出LL-Bench基准，包含大量真实退化图像和人工偏好标注，系统评估大规模生成模型在低级视觉任务中的性能，并引入LL-Score评估器以更好对齐人类偏好。

详情

AI中文摘要

大规模生成模型在图像生成和编辑任务中展现了卓越的能力。然而，它们在需要像素级控制的低级视觉任务中的表现仍未得到充分研究。为填补这一空白，我们引入了 extbf{LL-Bench}，一个用于评估大规模生成模型在 extbf{低级视觉}任务上能力的全面 extbf{基准}。该基准包含覆盖16种低级退化任务的2,469张真实退化图像，以及由10个最先进的大规模生成模型和21个传统恢复模型生成的28,919张恢复图像，这些图像附有152,020个专家级成对人类偏好和28,334个质量评分。基于LL-Bench，我们进行了系统诊断，揭示了大规模生成模型在不同低级视觉任务中的性能边界和独特失败模式，并与传统代表性恢复方法进行了比较。此外，我们研究了当前质量评估指标在LL-Bench上的有效性，发现它们与人类评分存在显著差异。为了更好地使恢复图像质量评估与人类偏好对齐，我们进一步提出了 extbf{LL-Score}，一个基于MLLM的评估器，能够同时捕捉恢复质量和幻觉存在。大量实验表明，LL-Score不仅优于现有的图像质量评估指标，而且可以作为有前景的奖励模型，用于训练低级视觉任务的生成模型。

英文摘要

Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.02532 2026-06-02 cs.CV

Improving Combined Detection and Classification of TEM Defects via Mask-Conditioned Latent Diffusion Augmentation

通过掩码条件潜在扩散增强改善TEM缺陷的联合检测与分类

Ni Li, Nuohao Liu, Ryan Jacobs, Ajay Annamareddy, Maciej P. Polak, Kevin Field, Izabela Szlufarska, Dane Morgan

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； University of Michigan-Ann Arbor（密歇根大学安娜堡分校）

AI总结提出一种基于掩码条件潜在扩散模型（LDM）的生成式数据增强方法，用于合成可控、自动标注的多类缺陷掩码的TEM图像，以提升小样本下Mask R-CNN模型的缺陷检测与分类性能。

详情

AI中文摘要

分析透射电子显微镜（TEM）图像中的微观结构缺陷，特别是在辐照金属合金中，通常受到高质量标注数据可用性的限制。为了解决这个问题，我们引入了一种生成式数据增强方法，使用掩码条件潜在扩散模型（LDM）合成具有可控、自动标注的多类缺陷掩码的逼真TEM图像。我们的方法无需生成过程中的人工标注，通过从实验掩码学习到的分布中采样，能够创建合成图像-掩码对。这些生成的数据用于增强不同规模（10、50和100张标注实验图像）的小型实验数据集，以训练Mask区域卷积神经网络（R-CNN）模型进行缺陷检测和分类。我们的结果表明，生成式增强带来了整体模型性能的小幅提升，检测和分类F1分数的调和平均值最高提升0.02。然而，我们也发现检测和分类改进的相对贡献取决于特定的训练/测试数据划分。这些发现凸显了针对性生成模型在数据稀缺的基于显微镜的图像量化任务中提升深度学习性能的潜力。

英文摘要

Analyzing microstructural defects in transmission electron microscopy (TEM) images, particularly in irradiated metal alloys, is often limited by the availability of high-quality, labeled data. To address this, we introduce a generative data augmentation approach using a mask-conditioned latent diffusion model (LDM) for synthesizing realistic TEM images with controllable, automatically labeled multi-class defect masks. Without requiring manual annotations for generation, our method enables the creation of synthetic image-mask pairs by sampling distributions learned from experimental masks. These generated data were used to augment small experimental datasets of varying sizes (10, 50, and 100 labeled experimental images) to train a Mask Regional Convolutional Neural Network (R-CNN) model for defect detection and classification. Our results show that generative augmentation yields small overall model performance improvements, with up to a 0.02 gain in the harmonic mean of detection and classification F1 scores. However, we also find that the relative contributions to detection and classification improvement depend on the specific train/test data split. These findings highlight the potential of targeted generative models to enhance deep learning performance in data-scarce microscopy-based image quantification tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.02530 2026-06-02 cs.AI cs.CL

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer: 局部化在策略蒸馏用于高效安全对齐

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang, Wendi Feng, Yesheng Liu, Lijun Li, Jin-Ge Yao, Lei Sha

发表机构 * Beihang University（北航）； Beijing Institute of Technology（北京理工大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； Peking University（北京大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结针对大语言模型安全对齐导致通用能力下降的问题，提出SafeSteer方法，通过激活引导构建安全教师并选择安全令牌，仅在安全令牌上施加反向KL惩罚，在仅用100个有害样本且无需通用数据的情况下，实现了安全与通用能力之间的优越平衡。

Comments 19 pages, 8 figures, 14 tables. Submitted to EMNLP 2026

详情

AI中文摘要

将大型语言模型（LLMs）与人类价值观对齐通常会降低其通用能力，这被称为对齐税。现有方法通过平衡双重目标来缓解这一问题，但严重依赖大量通用数据或辅助奖励模型。在本文中，我们认为，由于安全特征在输出分布中本质上是稀疏的，对齐需要局部修改而非全局权衡。为此，我们提出SafeSteer，它在安全令牌上执行在策略蒸馏。首先，我们通过激活引导构建一个安全教师。基于该教师，我们开发了一种安全令牌选择算法。因此，SafeSteer在训练期间将反向KL惩罚限制在这些令牌上，以保留通用能力。跨多种模型的实验结果表明，与现有方法相比，我们的SafeSteer在安全性和通用能力之间实现了更优越的权衡，在七个安全基准上取得了强大的安全性能，同时在五个通用能力基准上仅有最小程度的下降。值得注意的是，SafeSteer仅需100个有害样本，无需使用任何通用数据，不到先前基线所用数据的1%，大大降低了对齐成本。更多详情请访问我们的项目页面：https://anjingkun.github.io/SafeSteer。

英文摘要

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.

URL PDF HTML ☆

赞 0 踩 0

2606.02526 2026-06-02 cs.CV cs.AI

ToolFG：面向良好基础的细粒度图像分类

Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou, Yan Bai, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University（兰卡斯特大学）； Peking University（北京大学）

AI总结提出ToolFG框架，通过MCTS引导的工具使用知识蒸馏和模型-工具协同进化机制，使MLLM自主调用外部工具获取可靠视觉线索，实现细粒度图像分类。

详情

AI中文摘要

细粒度图像分类（FGIC）具有广泛的应用并吸引了大量研究关注。本文通过提出 extbf{ToolFG}探索了一种解决FGIC的新范式，这是首个针对FGIC定制的集成工具的多模态大语言模型（MLLM）框架。ToolFG使MLLM能够在推理过程中自主灵活地使用外部工具，主动与图像交互，并以更 extit{可靠}和 extit{良好基础}的方式收集可验证的视觉线索，以区分高度相似的类别。为了赋予模型这种工具使用能力，我们设计了一种新颖的 extbf{MCTS引导的工具使用知识蒸馏机制}，该机制有效地从高级专有MLLM中挖掘与工具使用和FGIC相关的知识用于模型训练。此外，我们提出了一种 extbf{模型-工具协同进化机制}，该机制共同优化工具集和模型的工具使用策略，推动它们朝向相互适应且专门针对FGIC的状态发展。大量实验证明了我们框架的有效性。

英文摘要

Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.02510 2026-06-02 cs.CV cs.RO

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

并非所有点都同等重要：不确定性感知的4D LiDAR场景合成

Xiang Xu, Alan Liang, Youquan Liu, Xian Sun, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

发表机构 * NUAA（南京航空航天大学）； NUS（新加坡国立大学）； FDU（福建工程学院）； Duke（杜克大学）； NTU（国立新加坡大学）； NJUPT（南京理工大学泰州学院）； SKL-TI（特种信息处理实验室）

AI总结提出U4D框架，利用空间不确定性引导LiDAR场景生成，通过熵图识别高不确定性区域并优先合成，再补全其余区域，实现高保真4D场景。

Comments CVPR 2026 E2E3D Workshop; GitHub at https://github.com/worldbench/U4D

详情

AI中文摘要

从LiDAR获取的序列构建忠实的4D世界对于具身AI至关重要，但当前的生成框架对所有空间区域采用统一的建模能力。这忽略了单个扫描中感知难度的巨大差异：远距离表面、遮挡边界和小尺度物体比良好观测的结构具有更高的不确定性。我们提出了U4D，一种新的框架，明确利用空间不确定性以“从难到易”的顺序引导LiDAR场景生成。U4D通过预训练分割器的香农熵推导逐点不确定性图，然后应用无条件扩散阶段合成具有精确几何的高熵区域，接着是条件补全阶段，利用这些结构作为先验填充剩余区域。MoST（时空混合）块通过动态平衡空间细节和时间连续性进一步维护跨帧一致性。在nuScenes和SemanticKITTI上的大量实验证明了最先进的场景保真度、时间一致性和下游性能。

英文摘要

Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a "hard-to-easy" schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02509 2026-06-02 cs.CL

When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives

当评分量表不足时：LLM辅助发现土耳其教师叙述中的ADHD信号

Baris Karacan, Irem Aktar Songur, Ahmet Ozaslan, Elvan Iseri

发表机构 * Department of Computer Science, University of Illinois Chicago（伊利诺伊大学芝加哥分校计算机科学系）； Department of Child and Adolescent Psychiatry, Gazi University（加齐大学儿童与青少年精神病学系）

AI总结本研究通过分析土耳其教师评估表中的结构化评分和开放式叙述，利用大语言模型辅助的主题发现方法，揭示了叙述文本中未被结构化量表捕捉的ADHD互补信号。

Comments 15 pages. Accepted to CLPsych 2026. Camera-ready author version. The final version will appear in the ACL Anthology

详情

AI中文摘要

注意缺陷多动障碍（ADHD）是儿童期最常见的神经发育障碍之一，其诊断依赖于结合临床医生判断、标准化评分量表以及家长和教师报告的评估。虽然诸如康纳斯教师评分量表修订版简表（CTRS-R:S）等结构化工具能够量化ADHD相关行为，但教师也会提供开放式叙述，其中可能包含结构化评估未捕捉的互补信号。然而，教师叙述在多大程度上编码了评分量表忽视的信号仍不清楚。在本研究中，我们分析了临床ADHD评估期间收集的去标识化土耳其教师评估表，包括CTRS-R:S评分和开放式教师叙述。我们比较了结构化评分和叙述文本的预测信号，并识别了结构化评估无法清晰区分ADHD与非ADHD学生，而基于叙述的模型却能捕捉到不同行为模式的案例。值得注意的是，这些案例与叙述模型遗漏的案例重叠极少，表明结构化和叙述信息编码了互补信号。为了解释这些差异，我们应用了大语言模型（LLM）辅助的主题发现流程，揭示了不同的注意力、行为和家庭相关模式，突显了自然语言处理（NLP）在从教师叙述中发现临床相关信号以及补充传统ADHD筛查工具方面的潜力。

英文摘要

Attention Deficit Hyperactivity Disorder (ADHD) is one of the most common neurodevelopmental disorders in childhood, and its diagnosis relies on assessments combining clinician judgment with standardized rating scales and reports from parents and teachers. While structured instruments such as the Conners' Teacher Rating Scale-Revised Short Form (CTRS-R:S) quantify ADHD-related behaviors, teachers also provide open-ended narratives that may contain complementary signals not captured by structured assessments. However, it remains unclear to what extent teacher narratives encode signals overlooked by rating scales. In this study, we analyze de-identified Turkish teacher evaluation forms collected during clinical ADHD assessments, including both CTRS-R:S scores and open-ended teacher narratives. We compare predictive signals from structured scores and narrative text and identify cases where structured assessments fail to clearly distinguish ADHD from non-ADHD students while narrative-based models capture distinct behavioral patterns. Notably, these cases show minimal overlap with those missed by the narrative model, suggesting that structured and narrative information encode complementary signals. To interpret these differences, we apply a large language model (LLM)-assisted theme discovery pipeline that reveals distinct attention, behavioral, and family-related patterns, highlighting the potential of natural language processing (NLP) to uncover clinically relevant signals from teacher narratives and to complement traditional ADHD screening tools.

URL PDF HTML ☆

赞 0 踩 0

2606.02502 2026-06-02 cs.CL

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM：面向多模态持续指令调优的质心路由与自适应MoE

Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China（南京大学人工智能学院）； State Key Laboratory of Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）

AI总结提出CRAM方法，通过将任务特定模式隔离到独立模块、自适应秩实例化动态分配参数、质心路由激活现有专家以及正交惩罚约束更新方向，解决了多模态持续指令调优中任务竞争导致遗忘和参数效率低下的问题。

详情

AI中文摘要

多模态大语言模型（MLLMs）通过指令调优在共享生成框架下统一异构视觉-语言任务，但实际部署需要持续能力扩展，这使得多模态持续指令调优（MCIT）至关重要。现有方法要么使用共享参数集更新所有任务，要么为每个新任务分配专用模块。共享更新迫使异构任务竞争，导致已学能力遗忘。相反，隔离扩展防止了干扰，但在长任务流中严重限制了参数效率。为解决这一困境，我们提出了CRAM。具体来说，通过将任务特定模式隔离到独立模块，CRAM减轻了跨任务的灾难性遗忘。为了进一步提高参数效率，我们利用自适应秩实例化来识别现有专家能力与新任务需求之间的能力差距，并仅动态分配必要的参数。为了确保任务间的稳定复用，质心引导路由识别并激活现有专家的能力，而正交惩罚将新更新限制在任务特定方向，防止重新学习通用能力。跨多个基准的大量实验一致证明了其相对于现有方法的优越性。

英文摘要

Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts' capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.

URL PDF HTML ☆

赞 0 踩 0