arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11684 2026-05-13 cs.LG eess.SP math.PR stat.AP

Partial Model Sharing Improves Byzantine Resilience in Federated Conformal Prediction

Ehsan Lari, Reza Arablouei, Stefan Werner

发表机构 * Department of Electronic Systems, Norwegian University of Science and Technology(挪威科学技术大学电子系统系) CSIRO’s Data61(澳大利亚数据61机构) Department of Information and Communications Engineering, Aalto University(艾尔沃斯大学信息与通信工程系)

AI总结 本文提出了一种基于部分模型共享的拜占庭鲁棒联邦共形预测方法,通过每次仅交换部分模型参数来提升系统安全性与通信效率。该方法在训练和校准阶段均增强了鲁棒性,训练阶段通过部分共享限制攻击面并减少恶意更新的影响,校准阶段则利用直方图特征向量进行异常检测与共形分位数估计。实验表明,该方法在多种拜占庭攻击场景下能实现更接近名义值的预测覆盖率,并显著缩小预测区间,为联邦不确定性量化提供了更高效且鲁棒的解决方案。

Comments 5 pages, 4 figures, Accepted for presentation at the 34th European Signal Processing Conference (EUSIPCO 2026) in Bruges, Belgium

详情
英文摘要

We propose a Byzantine-resilient federated conformal prediction (FCP) method that leverages partial model sharing, where only a subset of model parameters is exchanged each round. Unlike existing robust FCP approaches that primarily harden the calibration stage, our method protects both the federated training and conformal calibration phases. During training, partial sharing inherently restricts the attack surface and attenuates poisoned updates while reducing communication. During calibration, clients compress their non-conformity scores into histogram-based characterization vectors, enabling the server to detect Byzantine clients via distance-based maliciousness scores and to estimate the conformal quantile using only benign contributors. Experiments across diverse Byzantine attack scenarios show that the proposed method achieves closer-to-nominal coverage with substantially tighter prediction intervals than standard FCP, establishing a robust and communication-efficient approach to federated uncertainty quantification.

2605.11683 2026-05-13 cs.CV

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

Kaixuan He, Song Chen, Yi Kang

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥国家科学中心人工智能研究院)

AI总结 视觉 Transformer(ViT)由于自注意力机制的二次复杂度,计算开销较大。为解决这一问题,本文提出 DORA,一种基于强化学习的动态在线推理框架,用于在 ViT 中实现自适应的 token 合并。DORA 将 token 合并过程建模为马尔可夫决策过程,通过轻量级 RL 智能体根据当前特征状态和层间上下文动态决定合并策略,并通过非线性知识蒸馏惩罚函数优化智能体,以平衡计算效率与特征保真度。实验表明,DORA 在多个 ViT 尺度上均优于现有方法,在保持精度损失极小的前提下实现了显著的计算加速。

Comments Preprint. Under review

详情
英文摘要

Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.

2605.11680 2026-05-13 cs.CV

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Shivam Kumar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文介绍了 ShapeCodeBench,一个用于感知到程序重建的合成基准,旨在从渲染图像生成可执行的绘图程序,并与目标图像进行比较。该基准通过可重复的随机数生成器生成样本,支持创建无偏的测试集,包含不同难度级别的150个样本,并采用多种指标进行评估。实验表明,当前最先进的模型在精确匹配方面仍表现有限,表明该基准仍有较大的提升空间。

Comments 14 pages, 5 figures, 2 tables. Code, data, and artifacts: https://github.com/shivamk3r/shape-code-bench ; archival release: https://doi.org/10.5281/zenodo.20132286

详情
英文摘要

We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.

2605.11678 2026-05-13 cs.AI

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

Seungwoo Roh, Huiyeong Kim, Jong-Chan Kim

发表机构 * Graduate School of Automobile and Mobility, Kookmin University, Korea(汽车与移动研究生院,韩国高垣大学)

AI总结 本文提出了一种名为OOM-Free Alpamayo的框架,通过CPU-GPU内存交换技术,在不修改模型结构的前提下,实现了在显存受限的GPU上高效运行视觉-语言-动作(VLA)模型。该方法通过分层内存管理、流水线参数传输和驻留层决策策略,显著降低了显存占用并提升了推理速度。实验表明,该方法在NVIDIA Alpamayo-R1-10B模型上实现了比现有方法最高3.55倍的加速,同时保持了全BF16精度。

Comments Submitted to IEEE RTCSA on March 26, 2026 (KST); Accepted on May 4, 2026 (KST)

详情
英文摘要

End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.

2605.11674 2026-05-13 cs.RO

A Proprioceptive-Only Benchmark for Quadruped State Estimation: ATE, RPE, and Runtime Trade-offs Between Filters and Smoothers

Ylenia Nisticò, João Carlos Virgolino Soares, Joan Solà, Claudio Semini

发表机构 * Dynamic Legged Systems (DLS) Lab, Istituto Italiano di Tecnologia(动态腿部系统实验室,意大利技术研究院) Institut de Robòtica i Informàtica Industrial - CSIC(机器人与信息技术研究所 - 加拿大国家研究委员会)

AI总结 本文对比了三种先进的四足机器人本体感知状态估计器(MUSE、IEKF 和 IS),在 GrandTour 数据集的 CYN-1 序列上评估其长期和短期精度以及计算效率。研究发现,IEKF 和 IS 在长期轨迹误差上优于 MUSE,而短期误差在各方法间差异较小,不同方法在精度与计算延迟之间存在权衡。该研究为四足机器人状态估计器的选择提供了清晰的性能与计算成本参考,并开源了全部评估代码以保证可复现性。

Comments Submitted to IEEE Robotics and Automation Practice

详情
英文摘要

We compare three state-of-the-art proprioceptive state estimators for quadruped robots: MUSE [1], the Invariant Extended Kalman Filter (IEKF) [2], and the Invariant Smoother (IS) [3], on the CYN-1 sequence of the GrandTour Dataset [4]. Our goal is to give practitioners clear guidance on accuracy and computation time: we report long-term accuracy (Absolute Trajectory Error, ATE), short-term accuracy (translational and rotational Relative Pose Error, RPE), and per-update computation time on a fixed hardware/software stack. On this dataset, RPEs are broadly similar across methods, while IEKF and IS achieve a lower ATE than MUSE. Runtime results highlight the accuracy-latency trade-offs across the three approaches. In the discussion, we outline the evaluation choices used to ensure a fair comparison and analyze factors that influence short-horizon metrics. Overall, this study provides a concise snapshot of accuracy and cost, helping readers choose an estimator that fits their application constraints, with all evaluation code and documentation released open-source at https://github.com/iit-DLSLab/state_estimation_benchmark for full reproducibility.

2605.11672 2026-05-13 cs.AI cs.DB

A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination

Vinu Ellampallil Venugopal

发表机构 * International Institute of Information Technology(国际信息研究所)

AI总结 本文受分布式系统中CAP定理的启发,提出了一种针对大语言模型(LLM)的类CAP三难困境:在语义不充分的情况下,模型无法同时保证强正确性、严格无偏和高实用性。研究指出,当输入提示缺乏唯一答案时,模型若要生成有用的回答,必须引入某种选择标准,但若该标准未由用户提供或由前提合理推导,则可能导致偏见;反之,若模型避免使用未经支持的偏好,则可能保持正确性和无偏性,但会牺牲实用性。该研究揭示了某些LLM失败的根源可能在于任务本身的语义不充分,而非模型能力的局限。

详情
英文摘要

The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness--non-bias--utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.

2605.11666 2026-05-13 cs.LG cs.AI

Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

Liqin Ye, Yanbin Yin, Michael Galarnyk, Yuzhao Heng, Sudheer Chava, Chao Zhang

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为Evolutionary Task Discovery(EvoTD)的框架,旨在通过结构化进化操作提升大语言模型的推理能力。该方法将数据合成视为在算法技能和复杂度属性构成的双轴流形上的定向搜索,引入了交叉操作以增强技能组合的多样性,并通过参数化变异操作调整结构约束以促进鲁棒泛化。实验表明,EvoTD能够有效扩展模型的推理边界,并在不同模型架构和预训练设置下展现出良好的泛化能力。

详情
英文摘要

The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.

2605.11665 2026-05-13 cs.RO

Nautilus: From One Prompt to Plug-and-Play Robot Learning

Yufeng Jin, Jianfei Guo, Xiaogang Jia, Yu Deng, Zechu Li, Han Liu, Weiran Liao, Vignesh Prasad, Mathias Franzius, Gerhard Neumann, Georgia Chalvatzaki

发表机构 * TU Darmstadt(图宾根大学) KIT(卡尔斯鲁厄理工学院) FZI(弗劳恩霍夫研究所) Robotics Institute Germany(德国机器人研究所) Honda Research Institute Europe(本田欧洲研究院)

AI总结 当前机器人学习研究分散在不同的策略家族、基准测试和实际机器人系统中,导致各实现之间复杂交织,难以移植和复用。为解决这一问题,本文提出NAUTILUS,一个开源框架,能够将用户输入的单一指令(如“用基准B评估策略A”)自动转化为可执行的复现、评估、微调和部署流程。NAUTILUS通过统一接口、类型化契约和自动化验证机制,实现了对现有和用户自定义策略、仿真器、基准和真实机器人的灵活集成,显著降低了跨体系复现与评估的工程负担。

详情
英文摘要

Robot learning research is fragmented across policy families, benchmark suites, and real robots; each implementation is entangled with the others in a complex combination matrix, making it an engineering nightmare to port any single element. General-purpose coding agents may occasionally bridge specific setups, but cannot close this gap at scale because they lack the procedural priors and validation practices that characterize robotics research workflows. We propose NAUTILUS, an open-source harness that turns a single user prompt -- for example, "Evaluate policy A with benchmark B" -- into ready-to-use reproduction, evaluation, fine-tuning, and deployment workflows. NAUTILUS provides: plug-and-play agent skill sets with distilled priors from robotics research; typed contracts among policies, simulators/benchmarks, and real-world robots; unified interfaces and execution environments; and a trustworthy agentic coding workflow with explicit, automated validation, and testing at each milestone. NAUTILUS can not only automatically generate the required adapters and containers for existing implementations, but also wrap and onboard new or user-provided policies, simulators/benchmarks, and robots, all connected via a uniform interface. This expands cross-validation coverage without hand-written glue code. Like a nautilus shell that grows by adding chambers, NAUTILUS scales by extending its execution in chambered units, making it a research harness for scalability rather than a hand-curated framework, and aiming to reduce the engineering burden of cross-family reproduction and evaluation in the ever-growing robot learning ecosystem.

2605.11663 2026-05-13 cs.CL

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Kyosuke Takami, Yuka Tateisi, Satoshi Sekine, Yusuke Miyao

发表机构 * Osaka Kyoiku University(大阪教养大学) University of Tokyo(东京大学) NII LLMC(日本国家信息与通信技术研究所大语言模型中心)

AI总结 该研究基于日本全国学力调查数据,构建了一个包含科学、数学和日语科目的多模态基准数据集,包含真实考试题目、图表及约90万份学生答题分布数据。该数据集保留了真实考试的结构和内容,支持在统一评估框架下对比人类与模型的表现。研究通过实测准确率和字符级F1指标评估了多模态大语言模型的性能,并进一步分析了自动评分的可靠性,为多模态教育推理提供了可复现的基准,支持未来在真实评估场景中的模型评估与可解释性研究。

详情
英文摘要

Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark

2605.11659 2026-05-13 cs.CV cs.AI

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院)

AI总结 本文研究了在源域数据不可用的情况下,如何通过少量样本将大模型(如CLIP)适配到目标领域的问题,即无源域少样本跨域学习(CDFSL)。研究发现,基于适配器的方法(如LoRA)在CDFSL中优于基于提示的方法,其优势源于对视觉CLS token注意力的修正,从而增强模态对齐和类别区分。基于这一发现,作者提出了一个通用的注意力建模框架——语义探针(Semantic Probe),有效提升了适配器和提示方法在CDFSL中的性能,并在多个基准上取得了最先进的结果。

详情
英文摘要

Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.

2605.11636 2026-05-13 cs.AI

Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Xinbo Gao, Jing Zhang

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Independent Researchers(独立研究者) Xidian University(西安电子科技大学)

AI总结 本文提出了一种名为 Seirênes 的自对抗自博弈强化学习框架,旨在将大语言模型在复杂上下文中推理失败的问题转化为训练信号,从而提升其鲁棒性。该方法通过单一模型同时生成具有干扰性的上下文和解决任务,迫使模型在噪声中识别核心逻辑,从而增强其深层推理能力。实验表明,Seirênes 在多个数学推理基准上取得了显著提升,并能有效暴露顶级闭源模型的推理盲点。

详情
英文摘要

We present Seirênes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seirênes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seirênes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seirênes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seirênes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seirênes' general ability to uncover reasoning models' blind spots.

2605.11634 2026-05-13 cs.CV cs.AI

Unlocking UML Class Diagram Understanding in Vision Language Models

Artem Naboichenko, René Peinl

发表机构 * Hof University of Applied Sciences(霍夫应用科学大学)

AI总结 尽管视觉语言模型(VLMs)在各类应用中取得了显著进展,但在理解图表等结构化视觉内容方面仍存在不足,尤其在计算机科学领域的UML类图理解方面研究较少。本文提出了一种基于UML类图的视觉问答基准,兼具挑战性与可行性,并构建了一个包含16,000个图像-问题-答案三元组的大规模训练数据集。实验表明,基于LoRA的微调方法在该任务上表现优于当前主流的Qwen 3.5 27B模型。

详情
英文摘要

Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.

2605.11633 2026-05-13 cs.AI

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai, Kunyi Liu, Hongruixuan Chen, Zhuo Zheng, Junshi Xia, Stefano Ermon, Naoto Yokoya

发表机构 * The University of Tokyo(东京大学) RIKEN AIP(理化学研究所AIP) Waseda University(早稻田大学) Stanford University(斯坦福大学)

AI总结 该论文提出了一种名为DORA的基准测试平台,用于评估大型语言模型代理在灾难应急响应中的端到端能力。研究通过515个由专家设计的任务,覆盖45个真实灾难事件,涵盖从灾害感知、空间分析到疏散规划和多模态报告生成等多个维度,全面测试代理在异构地理空间数据上的推理与操作能力。实验揭示了当前LLM代理在灾难响应中的三大挑战,包括领域适应性不足、工具选择与参数理解困难以及长流程推理的脆弱性,为构建更可靠的灾难响应系统提供了重要参考。

Comments DORA stress-tests LLM agents on real-world disaster operations that demand comprehensive orchestration of 108 specialized tools over heterogeneous geospatial data

详情
英文摘要

Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.

2605.11629 2026-05-13 cs.CL

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Yuanhao Yue, Chengyu Wang, Yuanjie Lyu, Lei Shen, Jun Huang

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 近年来,多模态大语言模型在视觉语言任务中展现出强大的链式推理能力,但由于延迟和资源限制,其在实际系统中的部署受到限制。为了解决这一问题,本文提出OmniThoughtVis,一个可扩展的数据筛选与知识蒸馏框架,用于将大模型的多模态推理能力迁移到更小、更适合部署的模型中。该方法通过生成结构化的推理轨迹并结合多种策略保证数据质量,最终在多个基准测试中显著提升了小模型的推理性能,展示了其在实际应用中的重要价值。

详情
英文摘要

Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.

2605.11628 2026-05-13 cs.CV

Single-Shot HDR Recovery via a Video Diffusion Prior

Chinmay Talegaonkar, Jinshi He, Christopher McKenna, Nicholas Antipa

发表机构 * University of California San Diego(加州大学圣地亚哥分校) Creare LLC(Creare公司)

AI总结 本文提出了一种基于视频扩散先验的单次拍摄高动态范围(HDR)图像恢复方法,解决了现有方法在保真度和模型复杂度之间的平衡问题。该方法将HDR重建重新定义为条件视频生成任务,通过生成曝光序列并融合为最终HDR图像,提升了重建结果的准确性和可解释性。实验表明,该方法在多个评估指标上优于现有方法,并在人类评估中获得更高偏好,同时框架还可扩展到其他图像重建任务。

详情
英文摘要

Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.

2605.11625 2026-05-13 cs.AI

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

Zhaomeng Zhou, Lan Zhang, Junyang Wang, Mu Yuan, Junda Lin

发表机构 * University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 这篇论文研究了如何让大型推理模型在有限计算资源下更高效地进行适应性推理。作者提出了一种名为Budget-Efficient Thinking(BET)的两阶段框架,通过结合行为冷启动和投资成本感知奖励机制,使模型能够根据推理的预期收益而非问题难度来分配计算预算。BET使模型学会在简单问题上快速回答、在无解问题上提前放弃、在复杂但可解的问题上保留足够计算资源,从而在多个基准测试中显著减少了推理开销并提升了整体性能。

Comments 24 pages, 6 figures, 11 tables

详情
英文摘要

Large reasoning models (LRMs) improve problem solving through extended reasoning, but often misallocate test-time compute. Existing efficiency methods reduce cost by compressing reasoning traces or conditioning budget on perceived difficulty, yet largely overlook solvability. As a result, they may spend large budgets on queries beyond the model's capability while compressing hard-but-solvable queries that require deeper reasoning. In this work, we formulate adaptive reasoning as a computational investment under uncertainty, where budget should follow the expected return of reasoning rather than perceived difficulty alone. To instantiate this principle, we propose Budget-Efficient Thinking (BET), a two-stage framework that combines behavioral cold-start with GRPO under an investment-cost-aware reward. By aligning solve-or-fold decisions with rollout-derived solvability, BET learns three behaviors: (1) short solve, answering easy queries concisely; (2) nice fold, abstaining early when continued reasoning has near-zero expected return; and (3) hero call, preserving sufficient compute for hard-but-solvable queries. Across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% on average while achieving overall performance improvements, and transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning with comparable efficiency gains.

2605.11622 2026-05-13 cs.CV

RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction

Yaxuan Song, Jianan Fan, Tianyi Wang, Qiuyue Hu, Hang Chang, Heng Huang, Weidong Cai

发表机构 * School of Computer Science, The University of Sydney, Australia(悉尼大学计算机科学学院) Engineering Division, Lawrence Berkeley National Lab, USA(伯克利国家实验室工程部) Berkeley Biomedical Data Science Center, Lawrence Berkeley National Lab, USA(伯克利生物医学数据科学中心) Department of Computer Science, University of Maryland College Park, USA(马里兰大学学院市计算机科学系)

AI总结 本文提出了一种名为RNA-FM的生成模型,用于基于组织病理学全切片图像(WSI)预测全基因组RNA测序(RNA-seq)数据。该方法将转录组预测建模为连续时间条件运输问题,通过学习形态条件下的速度场,从简单先验分布映射到目标基因表达分布,从而更准确地捕捉生物异质性和预测不确定性。RNA-FM结合通路级别的结构信息,实现了可扩展且具有生物学可解释性的全基因组基因表达填补,实验表明其在性能和生物学意义方面均优于现有方法。

Comments 15 pages, 13 tables, 3 figures. Accepted by the Forty-Third International Conference on Machine Learning (ICML2026). Code is available at https://github.com/YXSong000/RNA-FM

详情
英文摘要

Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.

2605.11618 2026-05-13 cs.RO

Sampling-Based Follow-the-Leader Motion Planning for Manipulator-Mounted Continuum Robots

Chengnan Shentu, Nicholas Baldassini, Oluwagbotemi D. Iseoluwa, Radian Gondokaryono, Jessica Burgner-Kahrs

发表机构 * University of Toronto(多伦多大学)

AI总结 本文研究了用于机械臂搭载的连续体机器人的“跟随领导者”(FTL)运动规划问题,提出了一种基于采样的运动规划方法,能够同时考虑机器人构型和机械臂末端位姿。该方法通过几何构造直接计算末端位姿,避免了在线规划中的迭代优化,提升了效率,并保证了形状搜索的分辨率完备性与末端跟踪的收敛性。实验表明,该方法在多种测试场景中均实现了100%的成功率和较高的轨迹精度。

详情
英文摘要

Follow-the-leader (FTL) motion exploits the unique morphology of continuum robots (CRs) to navigate confined spaces by having the body retrace the path of the tip. While extensively studied, existing FTL methods typically assume a fixed base or a single degree-of-freedom insertion mechanism, limiting their applicability to practical systems in which CRs are mounted on robotic manipulators with fully actuated SE(3) base pose. This paper presents a sampling-based motion planner for FTL motion of manipulator-mounted CRs that jointly considers robot configuration and base pose. The key idea is to decouple global shape search from base pose determination by computing the base pose through a closed-form geometric construction, thereby avoiding iterative optimization during online planning. The approach supports general forward models and enables efficient planning by shifting the majority of computation offline. We establish theoretical guarantees including resolution complete shape search and converging tip tracking throughout waypoint traversal and interpolation. Experiments on 120 simulated paths over 3 test classes demonstrate 0% tip error and 1.9% mean shape deviation (w.r.t. robot length) at 100% success rate. We validate the practicality of our approach on a 6-DOF tendon-driven CR mounted on a serial manipulator. Code and visualization available at https://continuumroboticslab.github.io/sb-ftl-cr-planner/.

2605.11616 2026-05-13 cs.CV

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang, Jingyi He, Yining Pan, Xulei Yang, Shijie Li

发表机构 * TUM(慕尼黑工业大学) A*STAR(新加坡科技研究局)

AI总结 该研究旨在解决三维功能可操作性区域的定位问题,即在视觉语言模型中准确识别出物体上可用于交互的特定区域,如把手或按钮。为此,提出了一种名为AFFORDMEM的框架,通过跨场景和场景内两种记忆机制,无需模型微调或目标场景标注,即可从源场景中构建可复用的记忆库来辅助定位。实验表明,该方法在SceneFun3D数据集上显著提升了定位精度,验证了其在细粒度定位和空间关系理解方面的有效性。

详情
英文摘要

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

2605.11613 2026-05-13 cs.LG cs.AI

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao, Jindong Li, Dongcheng Zhao, Xing Yu

发表机构 * Xiaohongshu Inc.(小红书公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 本文研究了在策略优化中使用自我蒸馏时,如何从通用相关性转向输入特定的奖励分配问题。作者提出,标准的自我蒸馏奖励本质上是响应与反馈之间的点互信息(pMI),并进一步将其分解为输入相关的部分和通用捷径部分。基于此,他们提出了CREDIT方法,通过对比学习分离输入特定的奖励成分,从而提升模型在多个任务上的表现,且计算开销极小。

详情
英文摘要

On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.

2605.11612 2026-05-13 cs.CL cs.AI

When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

Ziyu Liu, Tao Li, Tianjie Ni, Xiaolong Lan, Wengang Ma, Tao Yang, Guohua Wang, Junjiang He

发表机构 * School of Cyber Science and Engineering, Sichuan University(四川大学计算机科学与工程学院) School of Computer Science, China West Normal University(西南大学计算机科学学院) School of Electronic and Information Engineering, Lanzhou Jiaotong University(兰州交通大学电子信息工程学院)

AI总结 该研究提出了一种针对大语言模型的新型后门攻击方法——Paraesthesia,通过将情绪作为动态触发因素,实现对模型的隐蔽性攻击。不同于传统基于固定触发词的后门攻击,Paraesthesia 利用情绪风格在语义空间中形成独立聚类的特性,将情绪作为触发信号嵌入训练数据,使模型在推理阶段遇到特定情绪输入时生成预设的恶意输出。实验表明,该方法在多种任务和不同模型上均能实现高达约99%的攻击成功率,同时保持模型的正常功能。

详情
英文摘要

Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.

2605.11609 2026-05-13 cs.LG cs.AI cs.CL

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, Xing Yu

发表机构 * Xiaohongshu Inc.(小红书公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 该研究针对基于策略的自蒸馏方法在数学推理任务中效果不佳的问题,提出了一种新的反向自蒸馏方法(AntiSD)。通过点互信息分析,发现特权上下文导致教师模型对已知结构部分过于自信,而忽视了推理过程中的关键思考步骤。AntiSD通过最大化学生与教师之间的分布差异,反转了传统自蒸馏的梯度方向,从而更有效地提升推理能力。实验表明,该方法在多个大规模语言模型上显著减少了训练步骤并提升了推理准确率。

详情
英文摘要

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

2605.11608 2026-05-13 cs.CL cs.AI cs.LG

PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

Chieh-Yen Lin, Shao-Hua Sun

发表机构 * Appier AI Research(Appier人工智能研究院) National Taiwan University(国立台湾大学)

AI总结 PRISM 是一种用于分析训练后大语言模型变体(如量化、LoRA适配和蒸馏模型)表示漂移的几何风险界方法,能够将漂移分解为尺度、形状和输出头三个独立可测的维度。该方法利用模型的线性输出头和近等距的主干结构,推导出目标模型与变体之间的交叉熵风险上界,从而不仅判断性能退化,还能识别退化的具体原因。实验表明,PRISM 在多个基准测试中表现出优异的变体排序能力,并且其形状正则化项在防止灾难性遗忘方面优于经验回放等传统方法。

详情
英文摘要

Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.

2605.11605 2026-05-13 cs.CV cs.AI

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

Chaeyoung Jung, Kyeongha Rho, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 多模态大语言模型(Omni-LLMs)在处理多模态输入时面临较高的计算开销,因此需要有效的token减少方法。本文提出了一种名为ContextGuard的推理时token剪枝框架,通过保留广泛的视听上下文并去除跨模态冗余,从而在保证性能的同时减少输入token数量。该方法基于音频预测粗粒度视觉语义,剪枝可由音频恢复的视频token,并保留能提供音频无法表达的局部视觉细节的token,同时合并时间上相似的视频token以进一步压缩。实验表明,ContextGuard在多个基准测试中优于现有方法,且在不需微调下游模型的情况下实现了较高的剪枝比例与性能。

详情
英文摘要

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.

2605.11603 2026-05-13 cs.AI

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta

发表机构 * Manipal University Jaipur(曼海普大学斋普尔) TCS Research(塔塔咨询服务)

AI总结 随着大语言模型(LLM)部署规模的扩大,如何在异构模型池中平衡响应质量与计算成本成为关键问题。本文提出了一种基于约束优化的绿色感知路由(GAR)框架,旨在在保证准确率和延迟约束的前提下,最小化每请求的碳排放。GAR通过自适应约束优化和轻量级估计器实现实时路由决策,并结合在线算法与启发式变体,有效降低碳足迹同时保持模型性能,为可持续的大语言模型推理提供了理论支持与实践方案。

详情
英文摘要

The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.

2605.11601 2026-05-13 cs.CL cs.AI

DiffScore: Text Evaluation Beyond Autoregressive Likelihood

Wen Lai, Yingli Shen, Dingnan Jin, Qing Cui, Jun Zhou, Maosong Sun, Alexander Fraser

发表机构 * Ant Group(蚂蚁集团) Tsinghua University(清华大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种名为 DiffScore 的文本评估方法,旨在克服自回归语言模型在文本评价中因位置偏差导致的局限性。DiffScore 基于掩码大型扩散语言模型,通过全双向上下文对每个词进行评分,从而消除位置偏倚,并建立从局部流畅性到整体连贯性的评估层次。该方法还引入了多时间步质量分析和双向PMI分解等诊断工具,实验表明其在多个基准测试中优于传统自回归模型。

详情
英文摘要

Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.

2605.11598 2026-05-13 cs.LG cs.AI cs.DB q-bio.QM

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

Madhurima Panja, Danny D'Agostino, Huitao Li, Tanujit Chakraborty, Nan Liu

发表机构 * Sorbonne University Abu Dhabi(阿布扎赫尔索邦大学) Duke-NUS Medical School, Singapore(新加坡杜克-新加坡国立医学学院)

AI总结 随着数据驱动方法在公共卫生决策中的广泛应用,传染病预测已成为重要研究领域。为解决现有研究缺乏高质量多变量预测基准的问题,本文提出了EpiCastBench,一个包含40个精心挑选的多变量传染病数据集的大型基准框架,涵盖多种传染病和地理区域,具有不同的时间粒度、序列长度和稀疏性。研究通过统一的评估设置对15种多变量预测模型进行了系统比较,所有数据和代码均已公开,有助于推动传染病预测方法的发展与验证。

详情
英文摘要

The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

2605.11595 2026-05-13 cs.AI

Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI

Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis, Dimosthenis Kyriazis

发表机构 * CC BY-NC-SA 4.0

AI总结 本文针对欧盟人工智能法案对高风险AI系统提出的透明性与可信性要求,提出了一种用于贝叶斯置信传播神经网络(BCPNN)的原生可解释性框架。该框架通过建立BCPNN特有的可解释性分类体系和十六个架构级解释原语,实现了对模型决策过程的系统性解释,并引入了五个配置级解释原语以支持预部署阶段的审计。研究为BCPNN在边缘设备上的可信部署提供了理论支持,推动了类脑AI在工业物联网中的应用。

Comments 8 pages

详情
英文摘要

The EU Artificial Intelligence Act (Regulation 2024/1689), fully applicable to high-risk systems from August 2026, creates urgent demand for AI architectures that are simultaneously trustworthy, transparent, and feasible to deploy on resource-constrained edge devices. Brain-like neural networks built on the Bayesian Confidence Propagation Neural Network (BCPNN) formalism have re-emerged as a credible alternative to backpropagation-driven deep learning. They deliver state-of-the-art unsupervised representation learning, neuromorphic-friendly sparsity, and existing FPGA implementations that target edge deployment. Despite this momentum, no systematic framework exists for explaining BCPNN decisions -- a gap the present paper fills. We argue that BCPNN is, in the sense of Rudin's interpretable-by-design agenda, an inherently transparent model whose architectural primitives map directly onto established explainable-AI (XAI) families. We make four contributions. First, we propose the first XAI taxonomy for BCPNN. It maps weights, biases, hypercolumn posteriors, structural-plasticity usage scores, attractor dynamics, and input-reconstruction populations onto attribution, prototype, concept, counterfactual, and mechanistic explanation modalities. Second, we introduce sixteen architecture-level explanation primitives (P1--P16), several without analogue in standard ANNs. We provide closed-form algorithms for computing each from quantities the model already maintains. Third, we introduce five design-time Configuration-as-Explanation primitives (Config-P1 to Config-P5) that treat BCPNN hyperparameter choices as an auditable pre-deployment explanation artifact. Fourth, we sketch a roadmap for integration into industrial IoT deployments and discuss EU AI Act alignment, edge feasibility, and Industry 5.0 implications.

2605.11594 2026-05-13 cs.CV

PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

Cheng Chi, Xianqi Wang, Hongcheng Luo, Mingfei Tu, Gangwei Xu, Zehan Zhang, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang, Haiyang Sun

发表机构 * Xiaomi EV(小米电动车) Huazhong University of Science and Technology(华中科技大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种名为PointForward的前馈驾驶场景重建框架,通过点对齐的表示方法解决现有方法在多视角一致性与动态实例建模方面的不足。该方法在世界坐标系中初始化稀疏的3D查询点,并通过时空融合多视角图像信息,实现单次前馈过程中的显式跨视角一致性。此外,通过引入场景图显式组织动态实例,结合3D边界框实现实例级运动传播,从而获得时序一致的动态重建结果。实验表明,PointForward在大规模驾驶数据集上达到了最先进的性能。

详情
英文摘要

High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.

2605.11592 2026-05-13 cs.LG cs.AI cs.CR

SoK: Unlearnability and Unlearning for Model Dememorization

Mengying Zhang, Derui Wang, Ruoxi Sun, Xiaoyu Xia, Shuang Hao, Minhui Xue

发表机构 * RMIT University(皇家墨尔本理工大学) University of Texas at Dallas(德克萨斯大学达拉斯分校) CSIRO and Adelaide University(澳大利亚CSIRO与阿德莱德大学)

AI总结 本文系统研究了机器学习模型中数据遗忘相关的两种关键技术——不可学习性(unlearnability)和模型遗忘(unlearning),旨在防止敏感数据被滥用。研究揭示了这两种方法在浅层遗忘、相互影响及理论保障方面的共性与缺陷,并首次提出了统一的分类框架、实证分析以及理论保证,为实现更深层次的数据遗忘提供了理论基础和实践指导。

Comments The first two authors contributed equally

详情
英文摘要

Advanced model dememorization methods, including availability poisoning (unlearnability) and machine unlearning, are emerging as key safeguards against data misuse in machine learning (ML). At the training stage, unlearnability embeds imperceptible perturbations into data before release to reduce learnability. At the post-training stage, unlearning removes previously acquired information from models to prevent unauthorized disclosure or use. While both defenses aim to preserve the right to withhold knowledge, their vulnerabilities and shared foundations remain unclear. Specifically, both unlearnability and unlearning suffer from issues such as shallow dememorization, leading to falsely claimed data learnability reduction or forgetting in the presence of weight perturbations. Moreover, input perturbations may affect the effectiveness of downstream unlearning, while unlearning may inadvertently recover domain knowledge hidden by unlearnability. This interplay calls for deeper investigation. Finally, there is a lack of formal guarantees to provide theoretical insights into current defenses against shallow dememorization. In this Systematization of Knowledge, we present the first integrated analysis of model dememorization approaches leveraging unlearnability and unlearning. Our contributions are threefold: (i) a unified taxonomy of unlearnability and scalable unlearning methods; (ii) an empirical evaluation revealing the robustness, interplay, and shallow dememorization of leading methods; and (iii) the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These results lay the foundation for unifying dememorization mechanisms across the ML lifecycle to achieve a deeper immemor state for sensitive knowledge.