arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1699
2605.21874 2026-05-25 cs.SD

Real-time, EDM-inspired sonification of the activity of a supercomputer

实时、受EDM启发的超级计算机活动声音化

Marco Alunno, Paolo Bientinesi

AI总结 本文研究了如何将超计算机实时运行数据通过声音形式进行信息丰富的声学化呈现。研究提出了一种基于电子舞曲(EDM)风格的声学化方法,以持续、清晰且吸引人的方式反映系统各节点的活动状态。该方法强调实时监控而非调试,生成无限延续且风格统一的音乐,将数据声学化与长期监听需求相结合,具有独特创新性。

Comments 7 pages, 2 figures, accepted conference paper

详情
AI中文摘要

本文描述的项目探索了对超级计算机实时接收的数据进行信息性声音化。这些数据捕获了计算机所有节点当前的活动,因此其声音化作为一种持续监控节点行为以及整个系统行为的形式。由于这种监控理论上永无止境,因此产生的声音化必须在音乐上能够通过声音传达信息,同时保持长时间的可理解性和吸引力。我们没有将预定义的音乐风格强加于数据,而是试图找到一种数据本身能够合理支持的音乐风格。从一小部分候选中,我们选择了EDM,因为它是一类流派,其结构和时间特征与连续的数据驱动过程和长期聆听非常契合。通过这种基于风格的方法,本研究建立在计算机数据声音化的悠久传统之上,同时独特地结合了很少同时处理的三个要素:以监控(而非调试)为主要目标、实时(而非事后)数据解释,以及生成几乎无限且风格连贯(而非不协调)的音乐。

英文摘要

The project described in this paper explores the informative sonification of data received in real time from a supercomputer. These data capture the current activities in all the nodes of the computer, therefore, their sonification functions as a form of continuous monitoring of the nodes' behavior and, by extension, of the system as a whole. Because such monitoring is theoretically unending, the resulting sonification must be musically capable of conveying information through sound in a way that remains both intelligible and engaging over long durations. Rather than imposing a predefined musical style onto the data, we sought to identify one which the data themselves could plausibly support. From a small set of candidates, we selected EDM because it is a family of genres whose structural and temporal characteristics align well with continuous, data-driven processes and long-term listening. Through this style-based approach, this research builds on the long tradition of computer data sonification while uniquely combining three elements rarely addressed together: monitoring (rather than debugging) as the primary goal, real-time (rather than post-mortem) data interpretation, and generation of virtually infinite and stylistically coherent (rather than incongruous) music.

2605.21813 2026-05-25 cs.LG stat.ME stat.ML

Symbolic Density Estimation for Discrete Distributions

离散分布的符号密度估计

Ziwen Liu, Meng Li

AI总结 本文提出了一种名为符号密度估计(SDE)的无监督框架,用于自动恢复离散分布的闭式概率质量函数。该方法通过在结构化的搜索空间中组合基本解析操作,结合领域特定的结构先验、进化搜索和有效性感知推理阶段,能够有效扩展至更复杂的分布族,如零膨胀分布和有限混合分布。研究还构建了一个涵盖多种常用离散分布的基准数据集,并在实验中验证了该算法在参数估计和模型拟合方面的优越性。

Comments 28 pages, 5 figures, 22 tables

详情
AI中文摘要

离散概率法则支撑着统计建模,然而可解释分布的目录通过几个世纪以来逐案数学推导仅逐渐扩展。我们引入了符号密度估计(SDE),这是一个无监督框架,通过在结构化搜索空间内组合基本解析操作自动恢复闭式概率质量函数。我们的方法将领域特定的结构先验与进化搜索和有效性感知推理阶段相结合,并扩展到更丰富的分布族,如零膨胀和有限混合。为了支持系统评估和未来研究,我们贡献了一个涵盖广泛常用离散分布的基准数据集。所提出的算法恢复了所有基准分布族,并给出了准确的参数估计。一个真实数据应用表明,它识别出简洁且可解释的混合模型,这些模型在拟合优度上优于标准模型。

英文摘要

Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.

2605.21031 2026-05-25 cs.RO

Modeling and Control of a Pneumatic Morphing Soft Quadrotor based on the SOFA Framework for Dynamic Soft Robotic Simulation

基于SOFA框架的软体动态仿真的气动变形软四旋翼建模与控制

F. Labra Caso, V. Sumathy, P. Ferrentino, B. Vanderborght, J. Haluska, G. Nikolakopoulos

AI总结 本文提出了一种基于SOFA框架的有限元方法,用于建模和控制一种充气变形软体四旋翼飞行器。该方法在保留传统四旋翼动力学物理可解释性和控制结构的同时,能够捕捉充气软臂复杂的时变行为。通过在SOFA中将软体手臂离散为四面体网格,并结合弹性材料定律模拟其内部力,实现了对充气驱动变形能力的动态仿真与控制分析,展示了该建模框架和控制器设计的有效性。

Comments 8 pages, 10 figures

详情
AI中文摘要

本文提出了一种基于SOFA的有限元方法,用于气动变形软四旋翼的软体建模及相应的动态仿真与控制。所提出的建模保留了传统四旋翼动力学的物理可解释性和控制结构,同时捕捉了气动驱动软臂的复杂时变行为。在SOFA中,软气动驱动臂被离散化为四面体网格,遵循弹性材料定律,产生与身体真实动态行为相适应的内力。在内部空腔中施加由周期性和基于误差的控制信号共同驱动的气动作用,以分析变形能力。最后,提出了一种比例积分控制器来研究气动臂的受控动态行为和变形能力,其中对软臂的气动驱动进行控制以实现期望的目标位置。仿真结果证明了所提出的新型建模框架及相关控制器设计的有效性。

英文摘要

This article presents a novel SOFA based finite element method for the soft body modeling and the corresponding dynamic simulation and control of a pneumatic morphing soft quadrotor. The proposed modeling preserves the physical interpretability and control structure of traditional quadrotor dynamics, while capturing the complex, time-varying behavior of pneumatically actuated soft arms. In SOFA, the soft pneumatically actuated arms are discretized as a tetrahedral mesh following an elastic material law that produces internal forces adequate to the real dynamic behavior of the body. Pneumatic actuation governed by both periodic and error-based control signals is applied within the internal cavities to analyze the morphing capability. Finally, a proportional-integral controller is proposed to study the controlled dynamic behavior and morphing capabilities of the pneumatic arm, wherein the pneumatic actuation to the soft arm is controlled to achieve the desired target position. The simulation results show the effectiveness of the proposed novel modeling framework and the related controller design.

2605.19069 2026-05-25 cs.CL cs.AI

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业ASR系统在代码切换语音上的基准测试:阿拉伯语、波斯语和德语

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

AI总结 本文研究了自动语音识别(ASR)系统在语言代码转换(Code-Switching)场景下的性能,针对阿拉伯语、波斯语和德语与英语之间的四种语言对进行了评估。通过一个两阶段的筛选流程,选取了300个样本,并使用BERTScore和词错误率(WER)进行测评,发现不同指标对系统排名的一致性及质量差距的反映存在差异。研究还揭示了商业ASR系统在处理代码转换语音时的性能差距,并公开了相关数据集以供进一步研究。

详情
AI中文摘要

代码切换——在同一话语中两种语言的自然交替——仍然是自动语音识别(ASR)中最具挑战性和研究不足的条件之一。我们提出了一个基准测试,评估了五个商业ASR提供商在四种语言对上的表现:埃及阿拉伯语-英语、沙特阿拉伯语(纳吉迪/希贾兹)-英语、波斯语(法尔西)-英语和德语-英语,每对包含300个样本,通过结合启发式过滤和GPT-4o与Gemini 1.5 Pro集成评分器的两阶段管道选择,将LLM成本降低约91%。我们在WER和BERTScore上进行评估,表明虽然两个指标在阿拉伯语和波斯语对的系统排序上一致(τ=1.0),但WER通过惩罚语义正确的音译选择,将质量差距的幅度夸大约3倍。ElevenLabs Scribe v2实现了最低的WER(总体13.2%),并在BERTScore上领先(总体0.936)。难度分层分析揭示了被总体平均值掩盖的性能差距,BERT嵌入投影证实了参考和假设之间的语义接近性,尽管存在表面脚本差异。数据集公开于https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch。

英文摘要

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($τ= 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

2605.18214 2026-05-25 cs.CV

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

EgoInteract: 用于交互理解和预测的合成自我中心视频生成

Rosario Leonardi, Francesco Ragusa, Daniele Materia, Alessandro Passanisi, James Fort, Jakob Engel, Giovanni Maria Farinella

AI总结 本文提出EgoInteract,一个可控的模拟器,用于生成具有精细时空标注的以自我为中心的合成视频,旨在解决真实数据收集困难以及交互模式覆盖有限的问题。该模拟器支持对相机、人体和手部运动、物体操作及场景构图的精确控制,生成的视频数据可用于时序动作分割、下一时段活跃物体检测、交互预测等任务。实验表明,基于该模拟器训练的模型在多个真实世界的以自我为中心数据集上均取得了优于现有方法的性能,验证了其有效性和泛化能力。

详情
AI中文摘要

收集具有密集时空标注的大规模自我中心视频数据集成本高昂、速度缓慢,且常受环境偏差、隐私约束和交互模式覆盖有限的限制。虽然合成数据在多个视觉领域显示出巨大潜力,但其在自我中心感知中的应用仍相对未被充分探索,尤其是对于需要时间一致的人-物交互的任务。在这项工作中,我们引入了EgoInteract,一个用于自我中心视频生成的可控模拟器,旨在建模细粒度的自我中心交互及其时间动态。该模拟器能够精确控制相机、人体和手部运动、物体操作以及跨不同环境的场景组成。基于此框架,我们生成一个带有密集时空标注的合成自我中心视频数据集,用于时间动作分割、下一活动物体检测、交互预测和手-物交互检测。我们评估了在模拟数据上训练的模型在多个真实世界自我中心基准上的表现,这些基准涵盖不同环境、物体类别和交互模式。结果表明,在各项任务和数据集上,我们的方法相较于强基线有一致的改进,展示了基于模拟方法的有效性和可迁移性。

英文摘要

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

2605.13930 2026-05-25 cs.LG cs.HC cs.NE

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

基于稀疏自编码器的脑电图基础模型机制可解释性

William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schiøler, Andreas Brink-Kjær, Sadasivan Puthusserypady, Sándor Beniczky, James Zou, Lars Kai Hansen

AI总结 该研究旨在提升EEG基础模型的可解释性,通过稀疏自编码器(SAEs)从三个不同架构的EEG变压器模型中提取稀疏特征字典,并将其与临床分类(如异常、年龄、性别和用药)对齐。研究提出了一种统一的超参数优化方法,用于评估模型特征的语义清晰度和纠缠程度,并引入“目标与非目标”探针区域度量,揭示了模型在概念控制方面的三种操作模式。此外,研究还展示了模型在临床概念干预中的关键失败案例,并通过频谱解码器将潜在空间操作映射到生理可解释的频率特征,为临床应用提供了更透明的解释框架。

Comments Preprint. 14 pages, 7 figures, 4 tables

详情
AI中文摘要

脑电图基础模型在临床性能上达到了最先进水平,但其驱动预测的内部计算仍然不透明,这是临床信任的障碍。我们将TopK稀疏自编码器应用于三种架构不同的EEG Transformer:SleepFM、REVE和LaBraM,从其嵌入中提取稀疏特征字典。通过将这些特征基于临床分类法(异常、年龄、性别和用药)进行 grounding,我们跨架构基准测试了单语义性和纠缠性。一个由内在字典健康审计驱动的单一超参数过程,在所有三种架构上鲁棒地迁移。通过概念引导,我们引入了一个“目标 vs. 非目标”探测区域度量来量化引导选择性,并揭示了三种操作模式:可选择性引导、编码但纠缠、以及未编码。该框架暴露了关键的表征失败:“破坏球”干预会破坏全局模型性能,以及临床纠缠,例如年龄-病理混淆,其中不可能在不破坏另一个概念的情况下抑制一个概念。最后,一个频谱解码器将这些干预映射回幅度谱,将潜在操作转化为生理上可解释的频率特征,例如病理性慢波抑制和α频带恢复。

英文摘要

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

2605.07919 2026-05-25 cs.CV

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

MedVIGIL: 在视觉证据受损下评估可信的医学视觉语言模型

Hanqi Jiang, Junhao Chen, Mingyu Kang, Hyeokjae Kwon, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li

AI总结 本文提出MedVIGIL,一个用于评估医疗视觉-语言模型(VLMs)在面对失效视觉证据时可信度的基准测试。研究关注模型在图像或问题被篡改时是否仍能正确拒绝回答,而非给出流畅但错误的答案。MedVIGIL包含300个由放射科专家标注的案例,提供了多种评估指标和复合得分,用于衡量模型在不同失效场景下的表现,并公开了16个视觉模型和两个纯文本基线的评估结果。

详情
AI中文摘要

医学视觉语言模型(VLM)通常在完整的图像-问题对上进行评估,但可信的临床应用需要更强的性质:模型必须能够识别答案的证据基础何时失效。我们通过扰动证据下的静默失败来研究这一问题,其中视觉相关的医学问题与错误前提、措辞扰动、仅知识改写或ROI损坏的图像配对,但模型返回流畅的非拒绝答案。我们引入了medvigil,一个从四个公共医学VQA来源中提取的300例评估套件,由四位委员会认证的放射科医生全程监督:每个黄金答案、拒绝选项、候选答案集、释义、错误前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医生并行注释每个案例,一位高级放射科医生整合发布的清单,第四位独立于构建的放射科医生回答每个探针以提供人类参考基线。发布包含2556个MCQ探针、240个反事实三元组、医生裁定的风险等级和可回答性标志、ROI框以及配对的开放式变体。我们报告了七个正确性条件审计指标,总结为medvigil复合评分(MCS),并审计了16个视觉能力模型加上两个纯文本基线。独立放射科医生得分为MCS 83.3,静默失败率为5.8%,比最强审计模型(Claude Opus 4.7为69.2)高出14.1个复合分。基准和评估工具已公开发布。

英文摘要

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

2605.02443 2026-05-25 cs.CL

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

HalluScan:一个用于检测和缓解指令遵循型大语言模型幻觉的系统性基准

Ahmed Cherif

AI总结 HalluScan 是一个系统化的基准框架,用于检测和缓解遵循指令的大型语言模型中的幻觉问题。该研究提出了 HalluScore 作为新的综合评估指标,并引入了 Adaptive Detection Routing 算法以提高检测效率,同时通过错误类型分解揭示了不同领域中幻觉表现的显著差异。实验表明,NLI 验证方法在检测效果上表现最佳,为后续研究和应用提供了重要参考。

Comments 38 pages, 13 figures, 10 tables. Submitted to Neural Computing and Applications

详情
AI中文摘要

大型语言模型(LLM)在各种自然语言处理任务中展现了卓越的能力,但它们仍然容易产生幻觉——生成事实上不正确、与提供上下文不一致或与用户指令不符的内容。我们提出了HalluScan,一个全面的基准框架,系统性地评估了跨越72种配置(涵盖6种检测方法、4个开放权重模型家族和3个不同领域)的幻觉检测与缓解。我们引入了三个关键贡献:(1)HalluScore,一种新颖的复合指标,与人类专家判断的皮尔逊相关系数达到r=0.41;(2)自适应检测路由(ADR),一种智能路由算法,实现了2.0倍的成本降低,仅损失0.1%的AUROC;(3)系统错误级联分解,揭示了不同领域间幻觉错误类型的显著差异。我们的实验表明,NLI验证达到了最高的总体AUROC为0.88,而RAV达到了第二高的AUROC为0.66。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

2604.24021 2026-05-25 cs.AI math.AP

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

QED:一个用于生成开放问题数学证明的开源多智能体系统

Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

AI总结 本文介绍了一个名为 QED 的开源多智能体系统,旨在无需人工干预即可将人类提出的研究问题转化为完整的数学证明。该系统通过分离规划、证明和验证三个阶段,有效克服了单一查询证明生成的常见缺陷,其中分解代理负责结构规划,证明代理生成候选论证,验证代理检查正确性。在与领域专家合作的评估中,QED 在 18 个不同难度的研究项目上表现出色,成功生成了五项原创性研究成果,其中三项被认为具有与主流数学期刊相当的深度和广度。

详情
AI中文摘要

我们提出 extbf{QED},一个开源的多智能体系统,它能够将人类提供的研究问题转化为完整的数学证明,无需进一步的人类指导。其流水线旨在通过分离规划、证明和验证来克服单次查询证明生成的常见失败:分解智能体结构化证明搜索,证明智能体生成候选论证,验证智能体检查正确性。与领域专家合作,我们在18个不同难度的研究级项目上评估了QED。QED在代数几何、流体偏微分方程、概率和反问题领域产生了五篇原创工作。专家评估认为这些工作是扎实的专业研究贡献,其中三篇在难度和范围上与常见于成熟专业数学场所发表的工作相当。QED发布于https://github.com/proofQED/QED。

英文摘要

We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single-query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research-level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

2604.21502 2026-05-25 cs.CV

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

VFM$^{4}$SDG:揭示VFM在单域广义目标检测中的力量

Yupeng Zhang, Ruize Han, Ningnan Guo, Wei Feng, Song Wang, Liang Wan

AI总结 该研究针对单域通用目标检测(SDGOD)中因环境变化导致的性能下降问题,提出了一种基于视觉基础模型(VFM)的新型框架VFM$^{4}$SDG。通过分析发现,检测器在跨域场景下的性能下降主要源于关系结构的不稳定,而VFM在严重域偏移下仍能保持稳定的关系和物体响应,因此被用作跨域稳定性先验。该方法通过引入冻结的VFM,分别在编码器和解码器中进行关系先验蒸馏和语义-上下文查询增强,有效提升了检测器的跨域鲁棒性,并在多个基准测试中取得了显著优势。

详情
AI中文摘要

现实世界中的天气、光照和成像变化常常引起严重的域偏移,导致单源检测器在未见环境中性能下降。现有的单域广义目标检测(SDGOD)方法主要依赖于数据增强或域不变学习,而很大程度上忽略了域偏移如何破坏检测器的预测稳定性。通过分析实验,我们发现性能下降主要由漏检增加主导。进一步分析表明,这一现象源于DETR风格检测器的跨域稳定性降低:域偏移破坏了编码器侧的物体-背景和实例间关系,并进一步削弱了解码器查询与真实物体之间的语义-空间绑定。受此启发,我们发现视觉基础模型(VFM)在严重偏移下仍能保持稳定的关系结构和物体响应,使其成为补偿检测器退化的合适跨域稳定性先验。为此,我们提出了VFM$^{4}$SDG,一个用于SDGOD的双先验学习框架,它将冻结的VFM引入编码器表示学习和解码器查询建模。具体来说,我们提出了跨域稳定关系先验蒸馏,将VFM中的稳定物体-背景和实例间关系蒸馏到编码器中,补偿关系退化。同时,我们提出了基于语义-上下文先验的查询增强,在查询进入解码器层之前注入类别语义原型和全局物体上下文,增强语义-空间查询-物体绑定稳定性。大量实验表明,VFM$^{4}$SDG在标准SDGOD基准和两个主流基于DETR的检测框架上显著优于现有先进方法,证明了其有效性、鲁棒性和泛化性。

英文摘要

Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant learning, while largely overlooking how domain shift disrupts detector prediction stability. Through analytical experiments, we find that performance degradation is mainly dominated by increasing missed detections. Further analysis shows that this phenomenon stems from reduced cross-domain stability in DETR-style detectors: domain shift disrupts encoder-side object-background and inter-instance relations, and further weakens the semantic-spatial binding between decoder queries and real objects. Motivated by this, we find that vision foundation models (VFMs) still preserve stable relational structures and object responses under severe shifts, making them suitable cross-domain stability priors to compensate for detector degradation. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen VFM into encoder representation learning and decoder query modeling. Specifically, we propose Cross-domain Stable Relational Prior Distillation to distill stable object-background and inter-instance relations from the VFM into the encoder, compensating for relational degradation. Meanwhile, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category semantic prototypes and global object context into queries before they enter the decoder layer, enhancing semantic-spatial query-object binding stability. Extensive experiments show that VFM$^{4}$SDG significantly outperforms existing advanced methods on standard SDGOD benchmarks and two mainstream DETR-based detection frameworks, demonstrating its effectiveness, robustness, and generality.

2604.17134 2026-05-25 cs.CL

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian

RoIt-XMASA:面向罗马尼亚语和意大利语的多领域多语言情感分析数据集

Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, Vlad Andrei Muntean, Dumitru-Clementin Cercel

AI总结 本文介绍了 RoIt-XMASA,一个扩展了跨语言多领域亚马逊情感分析数据集的多语言数据集,涵盖意大利语和罗马尼亚语,包含36,000条标注评论和202,141条未标注样本。为应对跨语言和跨领域挑战,研究提出了一种多目标对抗训练框架,通过损失反转和元学习系数动态平衡情感判别与领域和语言不变性。实验表明,该方法在XLM-R模型上实现了66.23%的F1分数,优于基线4.64%,并在少样本设置下展示了任务微调与提示方法之间的性能权衡。

Comments Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)

详情
AI中文摘要

我们提出了RoIt-XMASA,一个多语言数据集,它将跨语言多领域亚马逊情感分析扩展到意大利语和罗马尼亚语,包含36,000条跨三个领域(书籍、电影和音乐)的标注评论和202,141条未标注样本。为了解决跨语言和跨领域的挑战,我们提出了一种多目标对抗训练框架,该框架采用带有元学习系数的损失反转,以动态平衡情感判别与领域和语言不变性。使用我们的方法,XLM-R达到了66.23%的F1分数,比基线高出4.64%。少样本评估显示,Llama-3.1-8B达到了58.43%的F1分数,揭示了基于提示的方法的效率与任务特定微调的更高性能之间的有意义权衡。

英文摘要

We present RoIt-XMASA, a multilingual dataset that extends the Cross-lingual Multi-domain Amazon Sentiment Analysis to Italian and Romanian, comprising 36,000 labeled reviews across three domains (books, movies, and music) and 202,141 unlabeled samples. To address cross-lingual and cross-domain challenges, we propose a multi-target adversarial training framework that employs loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination with domain and language invariance. XLM-R achieves an F1-score of 66.23% with our approach, outperforming the baseline by 4.64%. Few-shot evaluation shows that Llama-3.1-8B achieves 58.43% F1-score, revealing a meaningful trade-off between the efficiency of prompting-based approaches and the higher performance of task-specific fine-tuning.

2604.11759 2026-05-25 cs.AI

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

检索是不够的:为什么组织AI需要认知基础设施

Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano

AI总结 本文指出,当前组织中AI使用的知识通常缺乏认知结构,仅依赖检索无法准确区分决策、假设、争议和未知问题等不同知识状态。为此,研究提出了OIDA框架,通过引入知识对象、重要性评分和矛盾关系等机制,构建具有认知一致性的知识表示系统,并引入“问题”作为组织未知的建模方式,提升AI对组织认知状态的理解能力。实验表明,OIDA在保持知识质量方面具有显著优势,并验证了其核心机制的有效性。

Comments 10 pages, 2 figures, 8 tables, 6 appendices

详情
AI中文摘要

AI代理使用的组织知识通常缺乏认知结构:检索系统会呈现语义相关的内容,而不区分约束性决策与放弃的假设、有争议的主张与已解决的问题、已知事实与未解决的问题。我们认为,组织AI的上限不是检索保真度,而是认知保真度——即系统将承诺强度、矛盾状态和组织无知表示为可计算属性的能力。我们提出了OIDA,这是一个框架,将组织知识结构化为类型化的知识对象,这些对象带有认知类别、具有类别特定衰减的重要性分数以及带符号的矛盾边。知识重力引擎以确定性方式维护分数,并具有经过证明的收敛保证(充分条件:最大度数<7;经验上对度数为43的情况鲁棒)。OIDA引入了“问题”作为模型化的无知:一种具有反向衰减的原语,以越来越紧迫的方式揭示组织不知道什么——这是所有被调查系统中缺失的机制。我们描述了认知质量评分(EQS),一种包含五个组成部分的评估方法,并带有明确的循环性分析。在受控比较(n=10个响应对)中,OIDA的RAG条件(3,868个令牌)达到EQS 0.530,而全上下文基线(108,687个令牌)为0.848;28.1倍的令牌预算差异是主要的混淆因素。问题机制在统计上得到验证(Fisher p=0.0325,OR=21.0)。形式化属性已建立;在相等令牌预算下的决定性消融实验(E4)已预注册但尚未运行。

英文摘要

Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

2604.10077 2026-05-25 cs.CV

DocRevive: A Unified Pipeline for Document Text Restoration

DocRevive:文档文本恢复的统一流水线

Kunal Purkayastha, Ayan Banerjee, Josep Llados, Umapada Pal

AI总结 DocRevive 是一种统一的文档文本修复管道,旨在解决损坏、遮挡或不完整文本的重建问题。该方法结合了先进的OCR、图像分析、掩码语言模型和扩散模型,实现了在保持视觉完整性的同时进行语义连贯的文本修复。研究还构建了一个包含30,078张退化文档图像的合成数据集,并提出了一种综合上下文相似度度量指标,以评估修复质量,为文档修复任务设立了新的基准。

详情
AI中文摘要

在文档理解中,重建受损、遮挡或不完整文本的挑战仍然是一个关键但未充分探索的问题。后续的文档理解任务可以受益于文档重建过程。为此,本文提出了一种新颖的统一流水线,结合了最先进的光学字符识别(OCR)、高级图像分析、掩码语言建模和基于扩散的模型,以在保持视觉完整性的同时恢复和重建文本。我们创建了一个包含30,078张退化文档图像的合成数据集,模拟了多种文档退化场景,为恢复任务设定了基准。我们的流水线检测并识别文本,通过遮挡检测器识别退化,并使用修复模型进行语义连贯的重建。基于扩散的模块无缝地重新整合文本,匹配字体、大小和对齐方式。为了评估恢复质量,我们提出了统一上下文相似度度量(UCSM),结合了编辑相似度、语义相似度和长度相似度,并引入上下文可预测性度量,当正确文本在上下文中显而易见时,对偏差进行惩罚。我们的工作推进了文档恢复,有利于档案研究和数字保存,同时为文本重建设立了新标准。OPRB数据集和代码分别可在Hugging Face(https://huggingface.co/datasets/kpurkayastha/OPRB)和Github(https://github.com/kunalpurkayastha/DocRevive)上获取。

英文摘要

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

2603.28767 2026-05-25 cs.CV

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Gen-Searcher: 强化搜索代理用于图像生成

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue

AI总结 本文提出Gen-Searcher,首个结合搜索增强的图像生成智能体,旨在解决现有模型因内部知识固化而在知识密集型或需最新信息的现实场景中表现不佳的问题。该方法通过多跳推理与搜索获取生成所需的文字知识和参考图像,并构建了两个高质量数据集及一个综合性基准KnowGen用于评估模型性能。实验表明,Gen-Searcher在多个指标上显著优于现有模型,为基于搜索的图像生成智能体研究提供了开放基础。

Comments Project page: https://gen-searcher.vercel.app Code: https://github.com/tulerfeng/Gen-Searcher

详情
AI中文摘要

最近的图像生成模型在生成高保真度和逼真图像方面表现出强大能力。然而,它们从根本上受限于冻结的内部知识,因此在需要知识密集型或最新信息的现实场景中常常失败。在本文中,我们提出Gen-Searcher,作为训练搜索增强图像生成代理的首次尝试,该代理执行多跳推理和搜索,以收集基于文本的知识和参考图像,用于接地生成。为实现这一目标,我们构建了一个定制数据管道,并策划了两个高质量数据集:Gen-Searcher-SFT-10k和Gen-Searcher-RL-6k,包含多样化的搜索密集型提示和对应的真实合成图像。我们进一步引入了KnowGen,一个综合基准,明确要求搜索接地外部知识用于图像生成,并从多个维度评估模型。基于这些资源,我们使用SFT训练Gen-Searcher,随后进行具有双重奖励反馈的代理强化学习,该奖励结合了基于文本和基于图像的奖励,为GRPO训练提供更稳定和信息丰富的学习信号。实验表明,Gen-Searcher带来了显著提升,在KnowGen上使Qwen-Image提高了约16分,在WISE上提高了15分。我们希望这项工作能够作为图像生成中搜索代理的开放基础,并完全开源我们的数据、模型和代码。

英文摘要

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

2603.23565 2026-05-25 cs.LG cs.AI

Safe Reinforcement Learning with Preference-based Constraint Inference

基于偏好的约束推断的安全强化学习

Chenglin Li, Grant Ruan, Hua Geng

AI总结 本文研究了安全强化学习中如何从人类偏好中高效且可靠地学习复杂的安全约束。针对现有方法依赖专家演示或限制性假设的问题,提出了一种基于偏好的约束强化学习框架(PbCRL),通过引入死区机制和信噪比损失,提升了对安全成本分布的建模能力,并优化了策略学习过程。实验表明,该方法在满足安全约束和提升奖励方面优于现有先进方法,为安全关键场景中的约束推理提供了有效解决方案。

Comments Accepted by the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

安全强化学习(RL)是安全关键决策的标准范式。然而,现实世界中的安全约束可能复杂、主观,甚至难以明确指定。现有的约束推断工作依赖于限制性假设或大量的专家演示,这在许多实际应用中并不现实。如何廉价且可靠地学习这些约束是我们本研究关注的主要挑战。虽然从人类偏好中推断约束提供了一种数据高效的替代方案,但我们发现流行的Bradley-Terry(BT)模型未能捕捉安全成本的非对称、重尾特性,导致风险低估。在文献中,理解BT模型对下游策略学习的影响仍然很少。为了解决上述知识空白,我们提出了一种新颖的方法,即基于偏好的约束强化学习(PbCRL)。我们在偏好建模中引入了一种新颖的死区机制,并从理论上证明它鼓励重尾成本分布,从而实现更好的约束对齐。此外,我们引入了信噪比(SNR)损失,通过成本方差鼓励探索,这被发现有利于策略学习。进一步,采用两阶段训练策略以降低在线标注负担,同时自适应地增强约束满足。实验结果表明,PbCRL实现了与真实安全要求的优越对齐,并在安全性和奖励方面优于最先进的基线。我们的工作为安全RL中的约束推断探索了一种有前景且有效的方法,在各种安全关键应用中具有巨大潜力。

英文摘要

Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which are not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy is deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, with great potential in various safety-critical applications.

2603.21880 2026-05-25 cs.RO

Optimal Solutions for the Moving Target Vehicle Routing Problem with Obstacles via Lazy Branch and Price

带障碍物的移动目标车辆路径问题的最优解:懒惰分支定价法

Anoop Bhat, Geordan Gutow, Surya Singh, Zhongqiang Ren, Sivakumar Rathinam, Howie Choset

AI总结 本文研究了存在障碍物的移动目标车辆路径规划问题(MT-VRP-O),旨在为多个代理规划路径以拦截移动目标,同时满足时间窗口、速度限制和容量约束。为此,作者提出了一种基于延迟分支定价的优化方法Lazy BPRC,通过在分支定价框架中使用放松连续性约束的运动规划技术,有效降低了计算成本,并在保证最优解的前提下显著提升了求解效率。

详情
AI中文摘要

带障碍物的移动目标车辆路径问题(MT-VRP-O)旨在为多个智能体寻找轨迹,使其共同拦截一组移动目标。每个目标有一个或多个必须被访问的时间窗口,智能体必须避开静态障碍物并满足速度和容量约束。我们引入了具有松弛连续性的懒惰分支定价法(Lazy BPRC),为MT-VRP-O找到最优解。Lazy BPRC应用了VRP的分支定价框架,该框架在受限主问题(RMP)和定价问题之间交替。RMP旨在从有限的路径子集中为每个智能体选择一系列目标-时间窗口配对(称为路径)来执行。定价问题将路径添加到有限子集中。传统上,求解RMP需要计算每个智能体遵循有限子集中每条路径的成本。在MT-VRP-O中计算这些成本是计算密集型的,因为它需要在移动目标之间进行无碰撞运动规划。Lazy BPRC通过使用每条路径成本的下界来求解RMP,从而推迟成本计算,这些下界是通过具有松弛连续性约束的运动规划计算得出的。我们根据需要懒惰地评估路径的真实成本。我们通过在凸集图(GCS)上搜索最短路径来计算路径成本,并使用我们的连续性松弛方法加速搜索。我们证明,Lazy BPRC的运行速度比两种消融方法快一个数量级。

英文摘要

The Moving Target Vehicle Routing Problem with Obstacles (MT-VRP-O) seeks trajectories for several agents that collectively intercept a set of moving targets. Each target has one or more time windows where it must be visited, and the agents must avoid static obstacles and satisfy speed and capacity constraints. We introduce Lazy Branch-and-Price with Relaxed Continuity (Lazy BPRC), which finds optimal solutions for the MT-VRP-O. Lazy BPRC applies the branch-and-price framework for VRPs, which alternates between a restricted master problem (RMP) and a pricing problem. The RMP aims to select a sequence of target-time window pairings (called a tour) for each agent to follow, from a limited subset of tours. The pricing problem adds tours to the limited subset. Conventionally, solving the RMP requires computing the cost for an agent to follow each tour in the limited subset. Computing these costs in the MT-VRP-O is computationally intensive, since it requires collision-free motion planning between moving targets. Lazy BPRC defers cost computations by solving the RMP using lower bounds on the costs of each tour, computed via motion planning with relaxed continuity constraints. We lazily evaluate the true costs of tours as-needed. We compute a tour's cost by searching for a shortest path on a Graph of Convex Sets (GCS), and we accelerate this search using our continuity relaxation method. We demonstrate that Lazy BPRC runs up to an order of magnitude faster than two ablations.

2603.07615 2026-05-25 cs.LG cs.CV

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

压缩即适应:基于扩散基础模型的隐式视觉表示

Zongyu Guo, Jiajun He, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu

AI总结 本文提出了一种将视觉信号编码为函数的新表示框架,通过低秩适配参数附着在冻结的视觉生成模型上,从而实现对视觉内容的隐式表示。该方法能够将例如81帧视频的信号压缩为一个紧凑的向量,在极低比特率下实现高质量的感知视频压缩。此外,该函数式表示支持推理时的扩展与控制,提升了压缩性能,并为视觉压缩与生成提供了一个统一的框架。

Comments ICML 2026

详情
AI中文摘要

现代视觉生成模型通过大规模训练获得丰富的视觉知识,但现有的视觉表示(如像素、潜变量或标记)仍独立于模型,无法直接利用这些知识进行紧凑存储或重用。在这项工作中,我们引入了一种新的视觉表示框架,将信号编码为一个函数,该函数通过附加在冻结的视觉生成模型上的低秩适应参数进行参数化。这种视觉信号的隐式表示,例如一个81帧的视频,可以进一步哈希成一个紧凑的向量,在极低比特率下实现强感知视频压缩。除了基本压缩外,这种表示的函数性质使得推理时缩放和控制成为可能,从而在压缩性能上实现额外优化。更广泛地说,由于隐式表示直接作为生成过程的函数,这提出了一个统一视觉压缩与生成的框架。

英文摘要

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

2602.15602 2026-05-25 cs.LG stat.ML

Certified Per-Instance Unlearning Using Individual Sensitivity Bounds

使用个体灵敏度界限的认证逐实例遗忘

Hanna Benarroch, Jamal Atif, Olivier Cappé

AI总结 本文研究了如何通过个体敏感度界限实现有保证的逐实例模型遗忘。不同于传统的基于最坏情况敏感度的噪声注入方法,作者提出了一种针对每个数据点贡献进行自适应噪声校准的新方法,从而减少噪声注入量并提升模型性能。在岭回归和深度学习实验中验证了该方法的有效性,证明其在保证遗忘认证的同时能够显著降低噪声影响。

详情
AI中文摘要

认证的机器遗忘可以通过注入噪声实现,从而提供差分隐私保证,其中噪声根据最坏情况灵敏度进行校准。这种保守的校准通常会导致性能下降,限制了实际适用性。在这项工作中,我们研究了一种基于自适应逐实例噪声校准的替代方法,该校准针对每个数据点对学习解的个体贡献进行定制。这引发了以下挑战:当机制依赖于要移除的特定点时,如何建立正式的遗忘保证?为了定义噪声梯度动力学中的个体数据点灵敏度,我们考虑使用逐实例差分隐私。对于通过朗之万动力学训练的岭回归,我们推导出高概率的逐实例灵敏度界限,从而在注入显著更少噪声的情况下实现认证遗忘。我们通过线性设置中的实验证实了我们的理论发现,并提供了进一步的经验证据,表明该方法在深度学习设置中的相关性。

英文摘要

Certified machine unlearning can be achieved via noise injection leading to differential privacy guarantees, where noise is calibrated to worst-case sensitivity. Such conservative calibration often results in performance degradation, limiting practical applicability. In this work, we investigate an alternative approach based on adaptive per-instance noise calibration tailored to the individual contribution of each data point to the learned solution. This raises the following challenge: how can one establish formal unlearning guarantees when the mechanism depends on the specific point to be removed? To define individual data point sensitivities in noisy gradient dynamics, we consider the use of per-instance differential privacy. For ridge regression trained via Langevin dynamics, we derive high-probability per-instance sensitivity bounds, yielding certified unlearning with substantially less noise injection. We corroborate our theoretical findings through experiments in linear settings and provide further empirical evidence on the relevance of the approach in deep learning settings.

2602.11146 2026-05-25 cs.CV cs.AI

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

超越基于VLM的奖励:扩散原生潜在奖励建模

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

AI总结 本文提出了一种基于扩散模型的原生潜在奖励模型DiNa-LRM,旨在解决扩散和流匹配模型在偏好优化中对奖励函数的需求。该方法直接在扩散过程的噪声状态上进行偏好学习,引入了与扩散噪声相关的不确定性校准的Thurstone似然函数,从而提升了奖励模型的判别鲁棒性和计算效率。实验表明,DiNa-LRM在图像对齐任务中显著优于现有的扩散奖励基线,并以更低的计算成本达到与最先进视觉语言模型相当的性能,同时提升了偏好优化的动态效率。

Comments Accepted by ICML 2026. Code: https://github.com/HKUST-C4G/diffusion-rm

详情
AI中文摘要

扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又计算高效的奖励函数。视觉语言模型(VLM)凭借其丰富的多模态先验,已成为主要的奖励提供者,用于指导对齐。然而,它们的计算和内存成本可能很高,并且通过像素空间奖励优化潜在扩散生成器会引入域不匹配,使对齐复杂化。在本文中,我们提出DiNa-LRM,一种扩散原生潜在奖励模型,直接在噪声扩散状态上制定偏好学习。我们的方法引入了一种噪声校准的Thurstone似然,具有扩散噪声依赖的不确定性。DiNa-LRM利用预训练的潜在扩散骨干网络,配备时间步条件奖励头,并支持推理时噪声集成,提供了一种扩散原生的机制用于测试时缩放和鲁棒奖励。在图像对齐基准测试中,DiNa-LRM显著优于现有的基于扩散的奖励基线,并以一小部分计算成本实现了与最先进VLM竞争的性能。在偏好优化中,我们证明DiNa-LRM改善了偏好优化动态,实现了更快且更资源高效的模型对齐。

英文摘要

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

2602.04431 2026-05-25 cs.LG cs.GT

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

MaMa: 一种基于博弈论的安全智能体系统设计方法

Jonathan Nöther, Adish Singla, Goran Radanovic

AI总结 本文研究了基于大语言模型的多智能体系统在部分智能体失效或对抗行为下的安全设计问题。受Stackelberg安全博弈启发,作者提出了一种名为MaMa的新算法,通过元对抗者与元代理之间的博弈过程,自动设计出在最坏情况下仍能保持安全的智能体系统。实验表明,该方法设计的系统不仅能够有效抵御最坏攻击,还能在不同攻击目标和大模型环境下保持良好的泛化能力。

详情
AI中文摘要

基于LLM的多智能体系统展现了令人印象深刻的能力,但当单个智能体失败或表现出对抗行为时,也会引入显著的安全风险。在这项工作中,我们研究了即使部分智能体被攻破时仍能保持安全的智能体系统的自动设计。受Stackelberg安全博弈启发,我们将此问题形式化为系统设计者(元智能体)与一个最佳响应的元对手之间的博弈,该对手选择并攻破一部分智能体以最小化安全性。我们提出了MaMa(元对手-元智能体),一种受此形式化启发的新算法,用于自动设计安全的智能体系统。我们的方法使用基于LLM的对抗搜索,其中元智能体迭代地提出系统设计,并根据元对手发现的最强攻击接收反馈。跨不同环境的实证评估表明,使用MaMa设计的系统能够持续防御最坏情况下的攻击,同时保持与仅优化任务成功率的系统相当的性能。此外,所得系统能够泛化到更强的对手,以及具有不同攻击目标或底层LLM的对手,展示了超越训练设置的鲁棒安全性。

英文摘要

LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. Inspired by Stackelberg security games, we formalize this problem as a game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary-Meta-Agent (MaMa), a novel algorithm inspired by this formalization for automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

2601.21513 2026-05-25 cs.LG

Cascaded Transfer: Learning Many Tasks under Budget Constraints

级联迁移:在预算约束下学习多任务

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, Mathilde Mougeot, Argyris Kalogeratos

AI总结 在分布式应用场景中,如变电站级别的用电需求预测或联邦学习,需要为大量相关任务训练不同模型,但任务之间的关系未知。本文提出了一种新的级联迁移学习(CTL)范式,通过构建以根节点为起点的树形结构,使模型参数在任务间逐层传递,同时遵循全局训练预算约束。该方法基于最小化任务间距离与预算约束的组合目标构建生成树,形成具有几何感知和深度限制的迁移图,并理论分析了迁移误差在级联路径上的累积与衰减特性。实验表明,CTL在多种任务集合上实现了比现有方法更准确且更节省成本的模型适应,尤其在预算受限时效果更显著。

详情
AI中文摘要

在分布式应用中,如变电站级能源需求预测或联邦学习,大量相关任务必须由不同模型学习,而确切的任务关系未知。我们提出了新颖的级联迁移学习(CTL)范式,其中模型参数通过组织为有根树的任务层级级联,并遵守全局训练预算。从源任务开始,树指定了任务学习和细化的顺序,预算沿其分支分配。我们设计了基于生成树的级联机制,通过最小化结合成对任务距离和可用训练预算的目标来连接所有任务,从而产生几何感知和深度有界的迁移图。我们从理论上刻画了迁移误差如何沿级联路径累积和衰减:任何上游节点引入的误差都会被每个下游细化收缩,而平衡的树拓扑限制了这种累积。在合成和真实多任务场景、时间序列预测和图像分类上的实验表明,CTL能够在大量任务集合中实现比替代方法更准确和成本效益更高的适应,且在预算最紧张时增益最大。

英文摘要

In distributed applications, such as energy demand forecasting at the substation level or federated learning, a large number of related tasks must be learned by different models, while the exact task relationships are unknown. We propose the novel Cascaded Transfer Learning (CTL) paradigm in which model parameters cascade hierarchically through tasks organized as a rooted tree, respecting a global training budget. Starting from a source task, the tree specifies the order in which tasks are learned and refined, with the budget allocated along its branches. We design cascade mechanisms based on spanning trees that connect all tasks by minimizing an objective combining pairwise task distances and the available training budget, which yield geometry-aware and depth-bounded transfer graphs. We theoretically characterize how transfer errors accumulate and attenuate along cascade paths: errors introduced at any upstream node are contracted by every downstream refinement, and balanced tree topologies bound this accumulation. Experiments on synthetic and real many-task settings, time-series forecasting and image classification, show that CTL enables more accurate and cost-effective adaptation across large task collections than alternative approaches, with the largest gains at the tightest budgets.

2601.14180 2026-05-25 cs.CV

Progressive $\mathcal{J}$-Invariant Self-supervised Learning for Low-Dose CT Denoising

渐进式 $\mathcal{J}$-不变自监督学习用于低剂量CT去噪

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

AI总结 本文研究了低剂量CT图像去噪中的自监督学习方法,旨在减少对配对正常剂量CT数据的依赖。为了解决现有方法因感受野受限导致的训练效率低和性能不足的问题,提出了一种渐进式$\mathcal{J}$-不变自监督学习方法,通过逐步盲区去噪机制和引入控制噪声来提升去噪效果。实验表明,该方法在Mayo低剂量CT数据集上优于现有自监督方法,并达到或超越了部分监督去噪方法的性能。

详情
AI中文摘要

自监督学习越来越多地被研究用于低剂量计算机断层扫描(LDCT)图像去噪,因为它减轻了对通常难以收集的配对正常剂量CT(NDCT)数据的依赖。然而,许多现有的自监督盲点去噪方法由于感受野受限,存在训练效率低下和性能次优的问题。为了缓解这一问题,我们提出了一种新颖的渐进式 $\mathcal{J}$-不变学习,最大化利用 $\mathcal{J}$-不变性来增强LDCT去噪性能。我们引入了一种逐步盲点去噪机制,以渐进方式强制执行条件独立性,从而实现更细粒度的去噪学习。此外,我们在训练过程中显式注入受控的高斯噪声和泊松噪声的组合,以正则化去噪过程并减轻过拟合。在Mayo LDCT数据集上的大量实验表明,所提出的方法持续优于现有的自监督方法,并实现了与几种代表性监督去噪方法相当或更好的性能。

英文摘要

Self-supervised learning has been increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to collect. However, many existing self-supervised blind-spot denoising methods suffer from training inefficiencies and suboptimal performance due to restricted receptive fields. To mitigate this issue, we propose a novel Progressive $\mathcal{J}$-invariant Learning that maximizes the use of $\mathcal{J}$-invariant to enhance LDCT denoising performance. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained learning for denoising. Furthermore, we explicitly inject a combination of controlled Gaussian and Poisson noise during training to regularize the denoising process and mitigate overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

2601.03715 2026-05-25 cs.LG cs.AI

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

R$^3$L: 反思-重试强化学习与语言引导探索、关键信用和正向放大

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

AI总结 R$^3$L 是一种结合语言引导探索、关键信用分配和正向增强的强化学习方法,旨在解决大语言模型在推理和智能体能力训练中面临的探索与利用难题。该方法通过“反思-重试”机制合成高质量轨迹,利用语言反馈定位错误并优化失败路径,同时仅更新存在差异的轨迹后缀以提高信用分配精度,并通过增强成功轨迹的权重来稳定训练过程。实验表明,R$^3$L 在多个任务中相较基线方法实现了显著性能提升,同时保持了训练稳定性。

详情
AI中文摘要

强化学习推动了LLM推理和智能体能力的最新进展,但当前方法在探索和利用方面均存在困难。探索方面,困难任务成功率低且从头开始重复rollout成本高;利用方面,粗粒度的信用分配和训练不稳定:轨迹级奖励因后续错误惩罚有效前缀,且失败主导的群体淹没少数正向信号,使优化缺乏建设性方向。为此,我们提出R$^3$L,即反思-重试强化学习与语言引导探索、关键信用和正向放大。为合成高质量轨迹,R$^3$L通过反思-重试从随机采样转向主动合成,利用语言反馈诊断错误,将失败尝试转化为成功尝试,并通过从识别出的失败点重启来降低rollout成本。在错误被诊断和定位后,关键信用分配仅更新存在对比信号的分叉后缀,排除共享前缀的梯度更新。由于困难任务中失败占主导且反思-重试产生离策略数据,可能导致训练不稳定,正向放大提高成功轨迹的权重,确保正向信号引导优化过程。在智能体和推理任务上的实验表明,与基线相比,相对提升5%到52%,同时保持训练稳定性。我们的代码已发布在https://github.com/shiweijiezero/R3L。

英文摘要

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

2512.20901 2026-05-25 cs.CV

Benchmarking and Enhancing VLM for Compressed Image Understanding

基准测试与增强VLM对压缩图像的理解

Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

AI总结 随着图像压缩技术的广泛应用,如何提升视觉语言模型(VLM)对压缩图像的理解能力变得尤为重要。本文首次构建了一个全面的基准,用于评估VLM在不同压缩编码和任务下的表现,并分析了模型在压缩图像上的性能差距来源,发现仅通过增强模型泛化能力可以有效缓解这一问题。基于此,作者提出了一种通用的VLM适配器,能够在多种压缩格式和比特率下提升模型性能10%-30%,为VLM在压缩图像任务中的应用提供了重要参考。

Comments The paper is accepted by ICML 2026

详情
AI中文摘要

随着视觉语言模型(VLM)的快速发展及其应用需求的增长,图像输入的高效压缩变得日益重要。现有VLM主要处理和理解高比特率压缩图像,而它们对低比特率压缩图像的解读能力迄今尚未被探索。本文首次引入全面基准测试,评估VLM对压缩图像的能力,涵盖多种现有广泛使用的图像编解码器和多样化任务,基准测试中包含超过一百万个压缩图像。接着,我们分析性能差距的来源,将其归因于a)压缩过程中的信息损失和b)VLM的泛化失败。我们通过具体示例可视化这些差距,并确定对于压缩图像,只有泛化差距可以缓解。最后,我们提出一个通用VLM适配器,以增强模型对现有编解码器压缩图像的性能。结果证明,单个适配器可以将VLM在不同编解码器和比特率图像上的性能提升10%-30%。我们相信,我们的基准测试和增强方法为弥合VLM与压缩图像之间的差距提供了宝贵的见解和贡献。源代码可在https://github.com/bblgbr/CompressVLMBench获取。

英文摘要

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images. The source code is available at https://github.com/bblgbr/CompressVLMBench.

2512.15767 2026-05-25 cs.LG cs.AI

Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

连接数据与物理:基于图神经网络的混合孪生框架

M. Gorpinich, B. Moya, S. Rodriguez, F. Meraghni, Y. Jaafra, A. Briot, M. Henner, R. Leon, F. Chinesta

AI总结 该研究提出了一种基于图神经网络的混合孪生框架,旨在解决物理仿真中因模型简化或未建模效应导致的“无知模型”问题。通过结合物理模型与数据驱动方法,该方法利用图神经网络学习稀疏空间测量中的缺失物理规律,从而在减少数据需求的前提下提升仿真精度与可解释性。实验表明,该框架在不同网格、几何和负载位置的非线性热传导问题中均表现出良好的泛化能力与修正效果。

Comments 27 pages, 14 figures

详情
AI中文摘要

模拟复杂的非定常物理现象依赖于详细的数学模型,例如通过有限元方法(FEM)进行仿真。然而,由于未建模效应或简化假设,这些模型通常与实际情况存在差异。我们将这种差距称为无知模型。纯数据驱动的方法试图学习整个系统的行为,但需要跨越整个空间和时间域的大量高质量数据。在现实场景中,此类信息不可用,使得完全数据驱动的建模不可靠。为了克服这一限制,我们采用混合孪生方法对无知分量进行建模,而不是从头模拟现象。由于基于物理的模型近似了现象的整体行为,剩余的无知通常比完整的物理响应复杂度低,因此可以用更少的数据进行学习。然而,一个关键困难是空间测量是稀疏的,并且在实际中获取不同空间配置下同一现象的数据具有挑战性。我们的贡献是通过使用图神经网络(GNN)来表示无知模型来克服这一限制。即使测量位置数量有限,GNN也能学习缺失物理的空间模式。这使得我们能够用数据驱动的修正来丰富基于物理的模型,而无需密集的空间、时间和参数数据。为了展示所提出方法的性能,我们在不同网格、几何形状和载荷位置的非线性热传导问题上评估了这种基于GNN的混合孪生方法。结果表明,GNN成功捕获了无知并泛化了跨空间配置的修正,提高了仿真精度和可解释性,同时最小化了数据需求。

英文摘要

Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.

2512.07078 2026-05-25 cs.CV cs.LG

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

DFIR-DETR:面向小目标检测的频域迭代细化与动态特征聚合

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li

AI总结 本文针对复杂场景中小目标检测中的核心挑战,提出了一种名为DFIR-DETR的新方法,通过频率域迭代优化和动态特征聚合,有效解决了现有网络在注意力分配、特征上采样和高频信息保留方面的不足。该方法在保持较低计算成本的同时,在NEU-DET和VisDrone数据集上取得了显著的性能提升,验证了其在不同检测任务中的有效性。

详情
AI中文摘要

复杂场景中的小目标检测暴露了神经网络设计中的基本矛盾:骨干注意力分布均匀而不考虑内容,金字塔颈部在上采样过程中放大激活幅度而不进行归一化补偿,瓶颈卷积通过累积空间滤波逐步平滑高频边缘分量。为此,我们开发了DFIR-DETR,将每个提出的模块追溯到RT-DETR基线中特定的、可测量的缺陷:忽略空间复杂性的均匀注意力、破坏上采样特征稳定性的归一化漂移,以及逐步抑制小目标所依赖的高频分量的空间卷积。在NEU-DET和VisDrone上,DFIR-DETR仅以11.7M参数和47.2 GFLOPs就达到了92.9%和51.6%的mAP50,在两个性质不同的检测领域展示了持续的性能提升。

英文摘要

Small object detection in complex scenes exposes a fundamental tension in neural network design: backbone attention distributes computation uniformly regardless of content, pyramid necks inflate activation magnitudes during upsampling without norm compensation, and bottleneck convolutions progressively smooth high-frequency edge components through accumulated spatial filtering. In response, we develop DFIR-DETR by tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline: uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on. On NEU-DET and VisDrone, DFIR-DETR achieves 92.9% and 51.6% mAP50 with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

2511.22521 2026-05-25 cs.CV cs.AI

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

DocVAL:用于基于文档的视觉问答的验证链式思维蒸馏

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Ser-Nam Lim, Rajiv Ramnath

AI总结 DocVAL 是一种用于文档视觉问答(VQA)的验证式思维链(CoT)蒸馏框架,旨在将大型视觉语言模型(VLM)中的精确空间推理能力转移到更高效的紧凑模型中。该方法结合了教师模型生成的空间推理监督、基于规则的双模式验证器以过滤低质量训练信号,并采用两阶段训练流程进行迭代优化,最终使学生模型无需OCR或检测模块即可独立运行。实验表明,DocVAL 在多个基准测试中显著提升了紧凑模型的定位性能,并引入了mAP作为新的定位评估指标。

详情
AI中文摘要

文档视觉问答要求模型不仅正确回答问题,还要在复杂文档布局中精确定位答案。大型视觉语言模型(VLM)具有强大的空间定位能力,但其推理成本和延迟限制了实际部署。紧凑型VLM更高效,但在标准微调或蒸馏下常出现显著的定位退化。为解决这一问题,我们提出DocVAL,一种验证链式思维(CoT)蒸馏框架,将显式空间推理从大型教师模型转移到紧凑、可部署的学生VLM。DocVAL结合了(1)教师生成的空间CoT监督,(2)基于规则的双模式验证器,过滤低质量训练信号并提供细粒度像素级纠正反馈,以及(3)验证驱动的两阶段训练过程与迭代细化。文本检测仅作为训练时的监督和验证脚手架,使得最终学生模型在推理时作为纯VLM运行,无需OCR或检测。在多个文档理解基准上,DocVAL相比可比的紧凑VLM持续提升高达6-7个ANLS点。我们进一步引入平均精度(mAP)作为文档问答的定位指标,并在此新评估下报告了强大的空间定位性能。我们发布了95K验证器验证的CoT轨迹,并表明高质量、验证过的监督比扩展未过滤数据更有效,实现了高效且可信的文档定位。代码/数据:https://github.com/ahmad-shirazi/DocVAL

英文摘要

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Code/Data: https://github.com/ahmad-shirazi/DocVAL

2511.13904 2026-05-25 cs.CV

Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment

面向实时可扩展部署的边缘辅助多摄像头车辆跟踪框架

Yuqiang Lin, Sam Lockyer, Shucheng Zhang, Florian Stanek, Markus Zarbock, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang

AI总结 本文提出了一种名为EASE-MCVT的边缘辅助多摄像头车辆跟踪框架,旨在解决现有方法在实时性和可扩展性方面的不足。该框架采用分布式边缘-服务器架构,通过在边缘端进行目标检测、单摄像头跟踪和特征提取,仅传输轻量级元数据至中心服务器,从而实现高效的跨摄像头关联。研究在算法和系统层面进行了优化,包括动态工作负载分配、服务器端重匹配模块和自监督摄像头链接模型,实验表明该方法在保证跟踪精度的同时实现了实时处理能力,为城市级实时交通管理提供了可行方案。

详情
AI中文摘要

摄像头是现代智能交通系统中的核心传感模态,提供关于道路使用者活动的丰富视觉信息。多摄像头车辆跟踪利用这些数据重建跨摄像头网络的车辆轨迹,支持交通流预测和优化等应用。然而,现有大多数MCVT研究强调跟踪精度,而对实时性能和可扩展性关注有限,这两者对于实际城市规模部署至关重要。为弥补这一差距,我们提出边缘辅助、可扩展且高效的MCVT(EASE-MCVT),一种分布式边缘-服务器框架,专为实时吞吐量和可扩展操作设计。在边缘端,每个摄像头流通过目标检测、单摄像头跟踪、地理映射和特征提取进行处理,而仅将轻量级元数据(包括车辆位置和外观特征)发送到中央服务器进行跨摄像头关联。为提高跟踪精度和系统效率,EASE-MCVT从算法和系统角度进行了优化。算法上,它引入了用于轨迹级特征提取的动态工作负载方案、用于重新连接碎片化轨迹的服务器端重新匹配模块,以及一个自监督摄像头链接模型,该模型学习时空约束以加速和稳定跨摄像头关联。系统上,它集成了面向生产的数据工程组件,以标准化大规模操作的部署和数据交换。据我们所知,EASE-MCVT是首个明确设计用于在分布式边缘-服务器设置中同时解决实时性能和可扩展性的MCVT框架。在RoundaboutHD和CityFlow数据集上的实验表明,该框架实现了实时吞吐量并具有竞争力的跟踪精度,为城市范围的实时交通管理铺平了道路。

英文摘要

Cameras are a core sensing modality in modern intelligent transportation systems (ITS), providing rich visual information on road-user activities. Multi-Camera Vehicle Tracking (MCVT) uses this data to reconstruct vehicle trajectories across camera networks, supporting applications such as traffic flow prediction and optimisation. However, most existing MCVT studies emphasise tracking accuracy while paying limited attention to real-time performance and scalability, both essential for real-world and city-scale deployment. To address this gap, we propose Edge-Assisted, Scalable and Efficient MCVT (EASE-MCVT), a distributed edge--server framework designed for real-time throughput and scalable operation. On the edge side, each camera stream is processed through object detection, single-camera tracking, geo-mapping and feature extraction, while only lightweight metadata, including vehicle locations and appearance features, is sent to the central server for cross-camera association. To improve both tracking accuracy and system efficiency, EASE-MCVT is optimised from algorithmic and system perspectives. Algorithmically, it introduces a dynamic workload scheme for tracklet-level feature extraction, a server-side re-match module to reconnect fragmented tracklets, and a self-supervised camera link model that learns spatio-temporal constraints to accelerate and stabilise cross-camera association. Systemically, it integrates production-oriented data engineering components to standardise deployment and data exchange for large-scale operation. To the best of our knowledge, EASE-MCVT is the first MCVT framework explicitly designed to address both real-time performance and scalability in a distributed edge--server setting. Experiments on the RoundaboutHD and CityFlow datasets demonstrate real-time throughput with competitive tracking accuracy, paving the way for city-wide real-time traffic management.

2511.10404 2026-05-25 cs.CL

DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence

DELICATE: 利用类别和时间证据的历时实体链接

Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Mehwish Alam

AI总结 本文提出了一种名为 DELICATE 的新型神经符号方法,用于解决历史意大利语文本中的实体链接问题,该方法结合了基于 BERT 的编码器和来自 Wikidata 的上下文信息,通过时间合理性与实体类型一致性选择合适的知识库实体。同时,研究还构建了一个名为 ENEIDE 的多领域实体链接语料库,涵盖19至20世纪的文学与政治文本。实验表明,DELICATE 在性能上优于其他历史文本实体链接模型,且其置信度评分和特征敏感性提升了结果的可解释性。

详情
AI中文摘要

尽管自然语言处理领域取得了显著进展,但由于复杂的文档类型、缺乏特定领域的数据集和模型以及长尾实体(即在知识库中代表性不足的实体),实体链接任务在人文学科中仍然具有挑战性。本文旨在通过两个主要贡献解决这些问题。第一个贡献是DELICATE,一种用于历史意大利语的新型神经符号方法,它结合了基于BERT的编码器和来自Wikidata的上下文信息,利用时间合理性和实体类型一致性来选择适当的KB实体。第二个贡献是ENEIDE,一个多领域的历史意大利语实体链接语料库,半自动地从两个注释版本中提取,时间跨度从19世纪到20世纪,包括文学和政治文本。结果表明,即使与拥有数十亿参数的更大架构相比,DELICATE在历史意大利语中的表现也优于其他实体链接模型。此外,进一步的分析揭示了DELICATE的置信度分数和特征敏感性如何提供比纯神经方法更可解释和可解释的结果。

英文摘要

In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

AI总结 本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用,特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境,并构建了包含正确操作轨迹和双平面X射线序列的数据集,用于训练仅依赖视觉信息的模仿学习策略。实验表明,该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入,为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情
AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而,对于稀疏输入的X光引导手术(如脊柱内固定),这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒,具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集,模拟了提供者的逐步对齐过程。然后,我们训练了用于规划和开环控制的模仿学习策略,该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功,在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构(包括骨折)以及不同的解剖结构和初始位置。在真实X光上的展开表明,具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞,但我们还发现了局限性,特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准,而借助更稳健的先验和领域知识,此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.