arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2605.13428 2026-05-14 cs.RO

SID: Sliding into Distribution for Robust Few-Demonstration Manipulation

Yicheng Ma, Wei Yu, Zhian Su, Xidan Zhang, Huixu Dong

发表机构 * Grasp Lab, Zhejiang University(浙江大学抓取实验室) Torch Kernel Co., Ltd.(火炬内核有限公司)

AI总结 本文提出了一种名为SID的框架,用于在仅有少量演示的情况下实现鲁棒的机器人操作。SID通过学习物体中心的运动场,逐步引导系统接近演示的流形,并进入轻量化的自中心执行策略的可靠操作区域,从而减少分布外执行的问题。该方法在多个现实任务中表现出色,仅需两次演示即可在分布外初始条件下实现约90%的成功率,为少样本操作提供了一种新的范式。

Comments 20 pages, 14 figures. Project website: https://sliding-into-distribution.github.io/

详情
英文摘要

Generalizing robotic manipulation across object poses, viewpoints, and dynamic disturbances is difficult, especially with only a few demonstrations. End-to-end visuomotor policies are expressive but data-hungry, while planning and optimization satisfy explicit constraints but do not directly capture the interaction strategies demonstrated by humans. We propose Sliding into Distribution (SID), a structured framework that learns an object-centric motion field from canonicalized demonstrations to iteratively slide the system toward the demonstrated manifold and into the reliable operating region of a lightweight egocentric execution policy, mitigating out-of-distribution (OOD) execution. The motion field provides large corrective motions when far from the demonstration manifold and naturally vanishes near convergence, enabling robust reaching under substantial pose and viewpoint shifts. Within the reached regime, an egocentric policy trained with conditioned flow matching performs task-specific manipulation, supported by kinematically consistent point-cloud reprojection augmentation that preserves action-observation consistency. Across six real-world tasks, SID achieves approximately 90% success under OOD initializations with only two demonstrations, with under a 10% drop under distractors and external disturbances. Overall, SID provides a new paradigm for few-shot manipulation: explicitly managing distribution shift via online distribution recovery.

2605.13424 2026-05-14 cs.LG cs.CL

LIFT: Last-Mile Fine-Tuning for Table Explicitation

Divij Khaitan, Ashish Tiwari

发表机构 * Microsoft Corporation(微软公司)

AI总结 本文提出了一种名为LIFT(Last-Mile Fine-Tuning)的新型微调方法,用于从非结构化的剪贴板文本中提取表格并修正错误。该方法结合了预训练的大语言模型和微调的小语言模型(参数规模为1B-24B),在保证准确性的前提下显著提升了对输入格式变化的鲁棒性,并在仅有1000个训练样本的情况下,其性能优于端到端微调方法。研究显示,LIFT在表格提取任务中具有更高的效率和更强的适应性。

Comments 9 pages, 1 figure, 3 tables

详情
英文摘要

We propose last-mile fine-tuning, or Lift, a pipeline in which a pre-trained large language model extracts an initial table from unstructured clipboard text, and a fine-tuned small language model (1B-24B parameters SLM) repairs errors in the extracted table. On a benchmark of 2,596 tables from three datasets, Lift matches or exceeds end-to-end SLM fine-tuning on tree-edit-distance-based similarity (TEDS) metric while requiring as little as 1,000 training examples - where it outperforms end-to-end fine-tuning by up to 0.144 TEDS points. We term this approach last-mile fine-tuning and show it also more robust to input format variability. Comparisons with self-debug and end-to-end fine-tuning approaches show that last-mile fine-tuning provides an attractive option when training data is limited or when robustness to input variation is sought without compromising on accuracy.

2605.13418 2026-05-14 cs.LG

DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning

Marc Molina Van den Bosch, Riccardo Taiello, Albert Sund Aillet, Andrea Protani, Miguel Angel Gonzalez Ballester, Luigi Serio

发表机构 * CERN(欧洲核子研究中心) Universitat Pompeu Fabra(庞培法布拉大学)

AI总结 本文提出了一种名为DP-KFC的数据无关预处理方法,用于在隐私保护深度学习中提升优化效果。该方法通过结构化合成噪声探测网络,无需使用私有或公共数据即可估计曲率信息,解决了差分隐私优化中损失函数各向异性与噪声各向同性之间的几何不匹配问题。实验表明,DP-KFC在强隐私保护条件下显著优于传统DP-SGD及其他自适应方法,且在医疗等数据稀缺的领域展现出良好的应用潜力。

Comments Accepted at the International Conference on Machine Learning (ICML 2026). 9 pages main text + appendix, 5 figures, 2 tables. Code: https://github.com/molinamarcvdb/DP-KFC

详情
英文摘要

Differentially private optimization suffers from a fundamental geometric mismatch: deep networks have highly anisotropic loss landscapes, yet DP-SGD injects isotropic noise. Second-order preconditioning can resolve this, but estimating curvature typically requires private data (consuming privacy budget) or public data (introducing distribution shift). We show that the Fisher Information Matrix decouples into architectural sensitivity, recoverable via synthetic noise, and input correlations, approximable from modality-specific frequency statistics. We propose DP-KFC, which constructs KFAC preconditioners by probing networks with structured synthetic noise, requiring neither private nor public data. Empirically, DP-KFC consistently outperforms DP-SGD and adaptive baselines across diverse modalities in strong privacy regimes ($\varepsilon \leq 3$). DP-KFC matches private-data preconditioners while public-data variants degrade by up to $4.8\%$, showing that curvature can be estimated without consuming privacy budget or introducing distribution shift. This enables privacy-preserving learning in specialized domains (e.g., medical applications) where regulatory constraints make data scarce.

2605.13414 2026-05-14 cs.AI

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Zabir Al Nazi, Shubhashis Roy Dipta

发表机构 * University of California, Riverside, USA(加州大学河滨分校) University of Maryland, Baltimore County, USA(马里兰大学巴尔的摩县分校)

AI总结 本文提出TRIAGE评估框架,用于评估大语言模型在资源受限情况下对未来任务进行选择、排序和计算分配的前瞻性元认知控制能力。该框架通过给模型提供任务池和预设的token预算,要求其制定一个包含任务选择、顺序和资源分配的统一计划,并基于模型在各任务上的解题能力和成本进行评估,从而计算出其分诊效率比。实验表明,当前主流语言模型在该能力上存在显著不足,揭示了其在资源高效部署方面尚未被充分测量的关键能力维度。

详情
英文摘要

Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

2605.13412 2026-05-14 cs.CL cs.AI

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-Jørgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy Høgenhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund

发表机构 * Visual Analysis and Perception Lab(视觉分析与感知实验室) Pioneer Center for AI(先锋人工智能中心) Center of Excellence for Global Mobility Law(全球移动法律卓越中心) Department of Computer Science(计算机科学系)

AI总结 该研究探讨了使用现成的大语言模型(LLMs)对丹麦难民申请决定文本中的可信度评估进行自动标注的性能与误差。研究引入了一个名为RAB-Cred的高质量丹麦语法律文本分类数据集,并系统评估了多种模型和提示组合在零样本和少样本设置下的表现。研究揭示了顶级模型在标注中的不一致性与错误模式,强调了单一模型预测的局限性,并指出在法律等专业领域中,LLMs作为标注工具仍存在不足,需结合人类判断与更细致的评估方法。

Comments Accepted at the 20th Linguistic Annotation Workshop (LAW XX), co-located with ACL 2026 (https://sigann.github.io/LAW-XX-2026/)

详情
英文摘要

Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred

2605.13408 2026-05-14 cs.CL

From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

Neh Majmudar, Anne Huang, Jinfan Frank Hu, Elena Filatova

发表机构 * City University of New York (CUNY)(纽约城市大学) Davidson Academy(戴维森学院) Phillips Academy(菲利普斯学术院)

AI总结 本文研究了高中语言学竞赛中常用的两种语言谜题形式——“罗塞塔石碑”和“匹配对”(Match-Up),提出了一种将前者系统性地转换为后者的高效方法,从而加速新谜题的生成。通过让人类专家和大型语言模型(LLMs)对转换后的谜题对进行测试,研究发现两者在解决Match-Up谜题时均表现出“全或无”的模式,即要么完全解决,要么完全无法解决。该工作构建了一个包含配对谜题的新数据集,并深入分析了不同格式下谜题难度的差异,为理解人类与机器的语言推理能力提供了新视角。

Comments Proceedings of the Fifteenth Language Resources and Evaluation Conference

详情
英文摘要

In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.

2605.13407 2026-05-14 cs.LG cs.CE q-fin.ST

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

Namhyoung Kim, Jae Wook Song

发表机构 * RiskX Hanyang University(翰阳大学)

AI总结 本文提出了一种名为PRISM-VQ的动态因子框架,用于解决跨截面股票收益预测中的低信噪比和市场制度变化带来的挑战。该方法结合了专家先验因子、通过跨截面结构学习得到的向量量化离散潜在因子,以及结构条件的专家混合网络,以生成时变因子载荷。实验表明,该方法在沪深300和标普500数据集上显著提升了收益预测和投资组合表现,同时保持了模型的可解释性。

Comments IJCAI 2026 Accepted Paper including Technical Appendix

详情
英文摘要

Predicting cross-sectional stock returns is challenging due to low signal-to-noise ratios and evolving market regimes. Classical factor models offer interpretability but limited flexibility, while deep learning models achieve strong performance yet often underutilize financial priors. We address this gap with PRISM-VQ (PRior-Informed Stock Model with Vector Quantization), a dynamic factor framework that integrates expert prior factors, vector-quantized discrete latent factors learned from cross-sectional structure, and a structure-conditioned Mixture-of-Experts to generate time-varying factor loadings. Vector quantization acts as an information bottleneck that suppresses noise while capturing robust market structure, with discrete codes serving both as latent factors and as routing signals for temporal expert specialization. Experiments on CSI 300 and S&P 500 show consistent improvements in cross-sectional return prediction and portfolio performance over strong baselines while preserving interpretability. Our code is available at https://github.com/finxlab/PRISM-VQ.

2605.13405 2026-05-14 cs.LG

When is Warmstarting Effective for Scaling Language Models?

Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Josif Grabocka, Frank Hutter, Aaron Klein

发表机构 * University of Freiburg(弗赖堡大学) Zuse School ELIZA(Zuse学校ELIZA) University of Technology Nuremberg(努尔登堡技术大学) University of Helsinki(赫尔辛基大学) Prior Labs(Prior实验室) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文研究了在扩展语言模型规模时,预热训练(warmstarting)的有效性。作者指出,尽管预热训练有助于节省资源,但在实际大模型训练中应用较少,主要受限于对模型性能保持和扩展策略的分析不足。研究发现,保持基础模型初始性能并非必要,简单且通用的扩展策略往往更有效,并确定了扩展因子的上限,超过该上限时从头训练更高效。实验表明,2倍扩展因子在多数情况下能提供最佳收敛速度提升,研究结果为模型扩展提供了实用指导和经验界限。

详情
英文摘要

Model growth from a given checkpoint aims to accelerate training of a larger model, offering potential resource savings. Despite recent interest, warmstarting has seen limited practical adoption in large-scale training. We attribute this to two underexplored factors: (1) an overemphasis on preserving the smaller model's performance at initialization, which constrains operator design for new architectures, and (2) insufficient analysis of how growth interacts with hyperparameters and scaling behavior, compounded by inconsistent growth factors across the literature. We show that preserving the base model's initial post-growth performance is not necessary for strong final performance, and that simple, architecture-agnostic growth strategies can outperform more complex warmstarting operators. Crucially, we empirically identify an upper bound on the growth factor $g$ beyond which training from scratch is more efficient. We observe this across multiple ablation setups. Notably, this limit is also present, but unreported, in prior published results. Across our experiments on dense MLPs and dense language models, we find that a $2\times$ growth factor is the most reliable in yielding convergence speedups, with gains most pronounced under 20 tokens/parameter budgets and diminishing as budget increases. We fit scaling laws over these observations to provide predictive guidance for practitioners deciding when and how much to grow. Together, our analysis provides practical guidelines and empirical limits for model growth.

2605.13404 2026-05-14 cs.SD

Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris, Konstantinos Tsamis

发表机构 * Dept. of Music Technology and Acoustics, Hellenic Mediterranean University(音乐技术与声学系,希腊地中海大学) Athena RC(雅典研究中心)

AI总结 该研究提出了一种名为Sec2Drum-DAC的条件潜扩散模型,用于从符号控制信息生成鼓声音频。该模型通过在物理时间点采样事件特征,并预测冻结DAC编码本嵌入的主成分坐标,而非直接生成波形样本,从而在保持节奏和力度信息的同时生成逼真的音频。实验表明,该方法在多个评估指标上优于确定性PCA回归和符号渲染基线,尤其在音谱和瞬态特性方面表现突出。

详情
英文摘要

Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA regression and a symbolic rendering baseline, while direct regression remains stronger on phase-sensitive waveform L1. Auxiliary RVQ cross-entropy improves short-step diffusion on mel error, onset-flux cosine, and waveform L1, with the most favorable trade-offs occurring at 6-25 denoising steps depending on the metric.

2605.13403 2026-05-14 cs.RO cs.CV

RotVLA: Rotational Latent Action for Vision-Language-Action Model

Qiwei Li, Xicheng Gong, Xinghang Li, Peiyan Li, Quanyun Zhou, Hangjun Ye, Jiahuan Zhou, Yadong Mu

发表机构 * Wangxuan Institute of Computer Technology, Peking University(王轩计算机技术研究所,北京大学) Xiaomi Robotics(小米机器人) CASIA

AI总结 本文提出RotVLA,一种基于连续旋转潜行动作表示的视觉-语言-动作(VLA)框架,旨在解决现有潜行动作模型在动作表示离散化带来的重建行为简单、表达能力有限等问题。RotVLA将潜动作建模为SO(n)空间中的元素,具有连续性、组合性和符合现实动作动态的结构化几何特性,并通过三帧学习框架强化时间动态特性。实验表明,RotVLA在多个基准测试中表现出色,显著优于现有VLA模型。

详情
英文摘要

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

2605.13401 2026-05-14 cs.LG cs.RO stat.ML

Trajectory-Level Data Augmentation for Offline Reinforcement Learning

Tobias Schmähling, Matthias Burkhardt, Tobias Windisch

发表机构 * University of Applied Sciences Kempten(凯普滕应用科学大学)

AI总结 本文提出了一种用于离线强化学习的轨迹级数据增强方法,旨在解决主动定位等任务中从少量次优轨迹中训练策略的问题。该方法利用任务结构以及奖励函数、价值函数与日志策略之间的几何关系,通过轨迹层面的增强技术提升数据质量,从而提高离线强化学习的性能。研究提供了理论依据,并在不同维度和部分可观测性条件下验证了方法的有效性。

Comments 26 pages, 25 figures, Accepted at ICML 2026

详情
英文摘要

We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation technique that exploits task structure and the geometric relationship between rewards, value functions, and mathematical properties of logging policies. During data collection, our augmentation supports suboptimal logging policies, leading to higher data quality and improved offline reinforcement learning performance. We provide theoretical justification for these strategies and validate them empirically across positioning tasks of varying dimensionality and under partial observability.

2605.13399 2026-05-14 cs.LG cs.IT math.IT

The Diffusion Encoder

Akhil Premkumar, Sarah Lucioni

发表机构 * Department of Physics(物理系) University of California San Diego(加州大学圣地亚哥分校) Independent Researcher(独立研究者)

AI总结 本文提出了一种新型编码器,利用扩散模型的强大表达能力来替代传统变分自编码器中的编码器。为了解决扩散模型与解码器之间在潜在空间更新方向不一致的问题,作者设计了一种基于期望最大化算法思想的交替训练方案,从而实现了编码器与解码器的可靠同步,同时保持了扩散模型简单高效的训练目标。

Comments 22 pages + references, 10 figures

详情
英文摘要

We construct a new kind of encoder, leveraging the expressive power of diffusion models. In a traditional variational autoencoder, the encoder and decoder jointly negotiate a latent representation of the input. This is made possible by the reparameterization trick, which simplifies training at the cost of restricting the encoder to a simple family of distributions. Replacing this encoder with a diffusion model requires rethinking how the decoder pressure can be transmitted back to the encoder, given that they tend to update their internal estimates of the latent in opposing directions. We solve this problem with an alternating training scheme, inspired by the expectation-maximization algorithm. Our method enables more reliable synchronization between encoder and decoder, while preserving the simple and efficient training objective of standard diffusion models.

2605.13396 2026-05-14 cs.CV

PreFIQs: Face Image Quality Is What Survives Pruning

Jan Niklas Kolf, Guray Ozgur, Andrea Atzori, Žiga Babnik, Vitomir Štruc, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD(弗劳恩霍夫计算机图形研究研究所IGD) University of Ljubljana(卢布尔雅那大学) Technical University of Darmstadt(达姆施塔特技术大学)

AI总结 本文提出了一种无需训练和监督的面部图像质量评估框架 PreFIQs,基于“剪枝识别示例”(PIE)假设,通过分析预训练人脸识别模型及其剪枝版本之间嵌入向量的欧几里得距离来衡量图像质量。该方法从雅可比向量积的角度提供了理论支持,并在多个基准数据集上取得了优于现有方法的性能,验证了参数剪枝作为评估面部图像质量的有效信号。

Comments Accepted at CVPR 2026 Workshops

详情
英文摘要

Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.

2605.13395 2026-05-14 cs.LG cs.CV

Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation

Lilin Zhang, Yimo Guo, Yue Li, Jiancheng Shi, Xianggen Liu

发表机构 * Sichuan University(四川大学) Dongfang Electric (Chengdu) Innovation Research Co., Ltd.(东方电子(成都)创新研究院有限公司) Southwest China Research Institute of Electronic Equipment(西南中国电子设备研究院)

AI总结 该论文研究了深度神经网络在长尾数据下的对抗训练问题,指出传统对抗训练方法在类别不平衡的数据上存在训练目标偏斜和对抗分布不稳定等局限。作者提出通过自适应调整对抗扰动来同时提升模型的鲁棒性和类别平衡能力,并设计了名为 RobustLT 的即插即用框架,实验表明该方法在多个长尾数据集上有效增强了模型的对抗鲁棒性与类别平衡性能。

Comments accepted by CVPR 2026

详情
英文摘要

Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose RobustLT, a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RobustLT consistently enhances adversarial robustness and class-balance on long-tailed datasets. The code is available at \href{https://github.com/zhang-lilin/RobustLT}{https://github.com/zhang-lilin/RobustLT}.

2605.13391 2026-05-14 cs.AI

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

Liangtian Liu, Zeyuan Wang, Ziyu Li, Kai Ouyang, Zichao Tang, Chengfu Liu, Haifeng Li, Hanwen Yu, Wentao Yang, Cheng Yang, Dongyang Hou

发表机构 * School of Geosciences and Info-Physics, Central South University(地质科学与信息物理学院,中南大学) School of Resources and Environment, University of Electronic Science and Technology of China(资源与环境学院,电子科技大学) School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology(地球科学与空间信息工程学院,湖南科技大学) Sanya Institute of Hunan University of Science and Technology(海南科技大学三亚研究院)

AI总结 随着多模态大语言模型的发展,遥感智能正从“感知”转向“行动”,但现有遥感智能体在工具调用上仍采用被动选择方式,难以在复杂任务中动态平衡上下文负载与工具集完整性。为此,本文提出RS-Claw,一种基于分层技能树的主动探索架构,通过技能封装技术对工具进行分层描述,使智能体能够按需逐步加载工具信息,从而显著释放上下文空间并提高关键工具的命中率。实验表明,RS-Claw在Earth-Bench基准测试中表现出色,有效压缩了输入令牌并优于现有方法。

详情
英文摘要

The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.

2605.13386 2026-05-14 cs.LG stat.ML

Support-Conditioned Flow Matching Is Kernel Smoothing

Daniel Matsui Smola

发表机构 * Department of Computer Science(计算机科学系) University of Washington(华盛顿大学)

AI总结 本文研究了基于交叉注意力的生成模型在有限支持集条件下的生成机制,揭示其速度场本质上是 Nadaraya-Watson 核平滑器,并随着生成过程时间推移,核带宽逐渐缩小,从早期的全局平均过渡到后期的最近邻行为。研究将交叉注意力机制与经典核方法联系起来,并指出了三种失效场景,实验验证了理论预测,并表明 IP-Adapter 的交叉注意力实现了近似核平滑效果。

Comments Submitted to NeurIPS 2026. 18 pages, 10 figures, 1 table. Code at https://github.com/BaroqueObama/kernel-flow-matching-code

详情
英文摘要

Generative models are often conditioned on a small set of examples via cross-attention. Under the Gaussian optimal-transport path, we show that the exact velocity field induced by a finite support set is a Nadaraya--Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest-neighbor at late steps. A single Gaussian-kernel attention head exactly computes this field, connecting cross-attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest-neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation. Experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features confirm that learned conditioning improves in precisely these regimes, and that IP-Adapter's cross-attention implements approximate NW smoothing in practice.

2605.13384 2026-05-14 cs.LG

Teaching and Learning under Deductive Errors

Jan Arne Telle, Brigt Håvardstun, Jose Hernandez-Orallo

发表机构 * Department of Informatics University of Bergen(卑尔根大学信息学院) University of Bergen(卑尔根大学) VRAIN - Universitat Politecnica de Valencia(瓦伦西亚理工大学VRAIN实验室) Universitat Politecnica de Valencia(瓦伦西亚理工大学) Leverhulme Centre for the Future of Intelligence - University of Cambridge(剑桥大学未来智能中心)

AI总结 本文研究了在学习者存在演绎错误的情况下,机器教学与学习的框架。传统模型假设学习者不会犯推理错误,但人类和少样本学习下的大语言模型往往会出现不一致或随机错误。为此,作者提出了一种新的教学框架,在改进的PAC模型下,分析了教师如何在一定误差估计下找到近似正确的教学集,并探讨了相关计算问题的复杂性,给出了参数化的算法及实验验证。

Comments 15 pages, preprint neurips

详情
英文摘要

Most models of machine teaching and learning assume the learner makes no errors in its internal deductive inference. However, humans and large language models in few-shot learning regimes are two important examples of learners where this does not hold. They fail on some consistency checks, and they can fail stochastically. In this paper we introduce a teaching and learning framework that takes these deductive errors into account. We specifically study the case of machine teaching, as different characterizations of the teacher can account for both machine teaching and learning. In an overhauled Probably Approximately Correct (PAC) setting, we study theoretically that, for some estimated error level, the teacher must find a PAC teaching set that with high probability will lead the learner to guess a hypothesis that is approximately correct. We study the computational complexity of six different problems related to computing optimal PAC teaching sets. We give XP algorithms parametrized by size of teaching set, with tight runtime bounds under standard complexity assumptions like ETH. These results are complemented with a small experimental study of which teaching and learning protocols can best represent the observed behavior in some LLM teaching sessions.

2605.13383 2026-05-14 cs.LG

Beyond Oversquashing: Understanding Signal Propagation in GNNs Via Observables

Eden Nagar, Ya-Wei Eileen Lin, Ron Levie

发表机构 * Technion - Israel Institute of Technology, Faculty of Mathematics(技术离子-以色列理工学院,数学系) Technical University of Munich, School of Computation, Information and Technology(慕尼黑技术大学,计算、信息与技术学院) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文研究了图神经网络(GNNs)中信号传播的问题,指出传统方法在传播过程中容易导致信息丢失,表现为过度平滑和过度压缩现象。作者受量子力学启发,提出基于可观测量的新建模方法,用于刻画信号在图中的位置、集中程度及传播特性,并证明了标准谱图神经网络在信号传播能力上的不足。基于此,作者提出了一种新型谱图神经网络——Schrödinger GNN,能够更有效地在图中路由信号。

详情
英文摘要

Graph Neural Networks (GNNs) perform computations on graphs by routing the signal between graph regions using a graph shift operator or a message passing scheme. Often, the propagation of the signal leads to a loss of information, where the signal tends to diffuse across the graph instead of being deliberately routed between regions of interest. Two notions that depict this phenomenon are oversmoothing and oversquashing. In this paper, we propose an alternative approach for modeling signal propagation, inspired by quantum mechanics, using the notion of observables. Specifically, we model the place in the graph where the signal lies, how much the signal is concentrated there, and how much of the signal is propagated towards a location of interest when applying a GNN. Using these new concepts, we prove that standard spectral GNNs have poor signal propagation capabilities. We then propose a new type of spectral GNN, termed Schrödinger GNN, which we show has a superior capacity to route the signal across the graph.

2605.13382 2026-05-14 cs.RO

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen, Xiangyu Xu

发表机构 * Xi'an Jiaotong University(西安交通大学)

AI总结 本文提出了一种名为BlockVLA的新框架,旨在加速基于自回归(AR)的视觉-语言-动作(VLA)模型在机器人任务中的推理过程。通过引入块扩散范式,BlockVLA将预训练的AR模型转换为高效的离散扩散策略,在保持块级自回归依赖的同时实现块内并行去噪,从而兼顾全局因果一致性和局部并行生成。实验表明,BlockVLA在LIBERO和SimplerEnv基准测试中实现了比传统离散扩散模型3.3倍的推理加速,并在复杂长时序任务中表现出更优的训练效率和性能提升。

详情
英文摘要

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3$\times$ inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.

2605.13381 2026-05-14 cs.CV cs.MM

Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

Chiara Musso, Joy Battocchio, Andrea Montibeller, Giulia Boato

发表机构 * University of Trento(特伦托大学)

AI总结 随着AI生成图像日益逼真,视觉Transformer(ViT)已成为现代深度伪造检测的核心技术。然而,现有方法普遍依赖冻结的预训练主干网络,这引入了一个隐蔽但关键的漏洞。本文提出了一种基于目标检测器ViT主干网络知识的灰盒攻击方法——替代迭代对抗攻击(SIAA),能够在目标检测器的特征空间内生成高效对抗样本,实验表明该方法在多种场景下均能实现接近白盒攻击的高成功率,揭示了仅凭主干网络知识即可严重削弱检测器可靠性的问题,突显了在对抗性多媒体取证中亟需更鲁棒防御机制的重要性。

详情
英文摘要

As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector's ViT backbone alone and operates entirely within the target detector's feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.

2605.13380 2026-05-14 cs.RO

Exploring Human-Robot Collaboration: Analysis of Interaction Modalities in Challenging Tasks

Simone Arreghini, Cristina Iani, Alessandro Giusti, Valeria Villani, Lorenzo Sabattini, Antonio Paolillo

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA), USI-SUPSI(达勒莫勒人工智能研究所(IDSIA)) Department of Surgery, Medicine, Dentistry and Morphological Sciences with interest in Transplant, Oncology and Regenerative Medicine, University of Modena and Reggio Emilia(外科、医学、牙科和形态学科学系(关注移植、肿瘤学和再生医学),摩德纳和雷吉奥艾米利亚大学) Department of Sciences and Methods for Engineering (DISMI), University of Modena and Reggio Emilia(工程科学与方法系(DISMI),摩德纳和雷吉奥艾米利亚大学)

AI总结 本文研究了人类与机器人协作中的三种交互模式——被动、反应式和主动式,并通过实验分析它们在完成复杂任务时的表现。实验中,参与者在记忆中搭建七层彩色塔的过程中,分别与不同模式的机器人协作。结果表明,尽管机器人协助增加了任务时间,但大多数参与者更倾向于协作,尤其偏好机器人主动提供帮助的模式。研究指出,在受控协作任务中,及时的主动支持能够提升用户体验。

详情
英文摘要

This work compares three interaction modalities for human-robot collaboration: passive, reactive, and proactive. We studied 18 participants assembling a seven-layer colored tower from memory while using nearby and distant blocks. In the passive modality participants worked alone; in the reactive modality a mobile robot helped only upon request; in the proactive modality it initiated brick delivery and error signaling without explicit requests. Although robot assistance increased completion time, most participants preferred collaboration: 67% preferred proactive behavior and 78% judged it most useful. These results suggest that timely proactive support can improve user experience in controlled collaborative tasks.

2605.13375 2026-05-14 cs.CV cs.AI

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) Juhaokan Technology Co.,Ltd(极皓科技有限公司) Nanjing University(南京大学) University of Science and Technology of China(中国科学技术大学)

AI总结 在视觉-语言模型(VLMs)中,处理大量视觉标记会导致高昂的计算开销。为解决这一问题,本文提出GRIP-VLM,一种基于强化学习的组相对重要性剪枝框架,将剪枝建模为马尔可夫决策过程,通过监督预热引导的组相对策略优化(GRPO)直接探索离散选择空间,从而避免连续近似方法带来的次优解问题。该方法结合预算感知评分器,无需重新训练即可动态评估并适应不同压缩比,实验表明其在多个多模态基准上优于启发式和监督学习基线,在保持精度的同时实现了最高达15%的推理加速。

Comments 10 pages, 11 figures

详情
英文摘要

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

2605.13373 2026-05-14 cs.CL

Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing

Daniel Fernández-González, Cristina Outeiriño Cid

发表机构 * Universidade de Vigo, Departamento de Informática(维戈大学计算机系)

AI总结 本文研究如何利用预训练的编码器-解码器 Transformer 模型进行序列到序列的成分句法分析。作者扩展了现有的序列到序列框架,基于 BART、mBART 和 T5 等预训练编码器-解码器模型构建句法分析器,并通过线性化策略进行微调与评估。实验表明,该方法在连续树库和复杂离散基准测试中均优于以往的序列到序列模型,并能与最先进的任务专用句法分析器竞争。

Comments Preliminary version

详情
英文摘要

To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models are typically initialized with pre-trained encoder-only language models like BERT or RoBERTa. However, the use of pre-trained encoder-decoder language models for constituency parsing has not been thoroughly explored. To bridge this gap, we extend the sequence-to-sequence framework by investigating parsers built on pre-trained encoder-decoder architectures, including BART, mBART, and T5. We fine-tune them to generate linearized parse trees and extensively evaluate them on different linearization strategies across both continuous treebanks and more complex discontinuous benchmarks. Our results demonstrate that our approach outperforms all prior sequence-to-sequence models and performs competitively with leading task-specific constituent parsers on continuous constituent parsing.

2605.13370 2026-05-14 cs.LG cs.CL

Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

Sungwoo Goo, Hwi-yeol Yun, Sangkeun Jung

发表机构 * College of Pharmacy(药学院) Chungnam National University(Chungnam国立大学) Department of Computer Science & Engineering(计算机科学与工程系)

AI总结 本文提出了一种名为“Phasor Memory Network(PMNet)”的新架构,旨在解决显式记忆模型在语言建模中因反向传播时梯度不稳定而导致的训练困难问题。该方法通过引入单位相位动力学和分层可学习锚点,结构化地稳定了记忆模块的更新过程,从而在无需特殊初始化的情况下保持梯度稳定性。实验表明,PMNet在合成复制粘贴任务中能够实现几乎100%的精确记忆检索,并在参数规模仅为Mamba模型三分之一的情况下,展现出相当的长上下文处理能力,为可扩展序列建模提供了理论支撑。

详情
英文摘要

For over a decade, explicit memory architectures like the Neural Turing Machine have remained theoretically appealing yet practically intractable for language modeling due to catastrophic gradient instability during Backpropagation Through Time. In this work, we break this stalemate with \textit{Phasor Memory Network} (PMNet), a novel architecture that structurally resolves memory volatility through \textit{Unitary Phasor Dynamics} and \textit{Hierarchical Learnable Anchors}. Rather than relying on brute-force scaling, we present a mechanistic proof-of-concept in a controlled byte-level setting. By constraining recurrent state updates to phase rotations on a complex unit circle, PMNet preserves gradient norms and inherently prevents divergence without the need for specialized initialization. We empirically demonstrate the active actuation of the memory module through a synthetic Copy-Paste task, where PMNet utilizes an expansive \textit{85-slot hierarchical memory tree} ($=\sum^{4}_{h=1}4^{h-1}$) to achieve near 100\% exact retrieval across temporal distances that completely exceed the local sliding window attention's receptive field. Furthermore, despite being a compact 119M parameter model trained on 18.8B tokens, PMNet matches the zero-shot long-context robustness of a Mamba model that is three times larger. Our ablation studies and gradient analyses confirm that the historical failure of explicit memory was a structural alignment problem, which PMNet effectively overcomes, providing a theoretically grounded foundation for scalable sequence modeling.

2605.13368 2026-05-14 cs.CL

What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

Shaomu Tan, Dawei Zhu, Ke Tran, Michael Denkowski, Sony Trenous, Bill Byrne, Leonardo Ribeiro, Felix Hieber

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Cambridge(剑桥大学) Amazon AGI(亚马逊人工智能研究院)

AI总结 本文系统研究了迭代自修正策略在文学翻译中的实际效果,探讨了不同粒度和策略对翻译质量的影响。研究发现,先进行文档级机器翻译,再进行片段级修正能带来稳定且显著的提升,而文档级修正效果较弱且不可靠。实验还表明,通用的修正提示优于特定错误修正和评估后修正方法,且修正主要提升了流畅性、风格和术语,对内容准确性提升有限。这些发现揭示了当前修正方法的机制及其局限性。

详情
英文摘要

Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner's distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.

2605.13366 2026-05-14 cs.CV cs.LG

Neural Surrogate Forward Modelling For Electrocardiology Without Explicit Intracellular Conductivity Tensor

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Jakub Grzelak, Oleg Aslanidi

发表机构 * King’s College London(伦敦国王学院) PhysicsX

AI总结 该研究针对无创心脏电生理学中的正向建模问题,提出了一种无需显式输入细胞内导电张量的深度学习方法,用于直接从左心房细胞内电位预测远场心电图。该方法通过深度学习模型学习电位与心电图之间的映射关系,避免了传统物理模型中难以测量的导电张量带来的结构误差。实验表明,该模型在仅使用74个受试者数据训练的情况下,取得了较高的预测精度(R²为0.949 ± 0.037),展示了其在改善房颤无创评估中的潜力。

Comments Accepted into the 9th International Conference on Computational and Mathematical Biomedical Engineering (CMBE2026)

详情
英文摘要

Accurate forward modelling is essential for non-invasive cardiac electrophysiology, particularly in atrial fibrillation, where electrical activation is highly disorganised. Conventional physics-based forward models require explicit specification of intracellular conductivity tensors, which are not directly measurable in clinical practice and introduce structural modelling errors. This proof-of-concept study presents a deep learning approach that learns a direct mapping from left atrial intracellular electrical potentials to far-field ECGs without requiring explicit intracellular conductivity inputs at inference time. Despite training only on 74 subjects, the model achieved an R2 of 0.949 \pm 0.037, highlighting potential to reduce structural uncertainty and improve non-invasive AF assessment.

2605.13352 2026-05-14 cs.LG

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

Mayank Nautiyal, Li Ju, Andreas Hellander, Ekta Vats, Prashant Singh

发表机构 * Department of Information Technology, Uppsala University(乌普萨拉大学信息科技系) SciLifeLab, Uppsala University(乌普萨拉大学SciLifeLab)

AI总结 GeoFlowVLM 是一种后处理方法,旨在为冻结的视觉-语言嵌入模型引入几何感知的联合不确定性估计。该方法通过黎曼流匹配在超球面乘积空间上学习配对嵌入的联合分布,从而同时捕捉跨模态的模糊性(aleatoric uncertainty)和训练分布外的不确定性(epistemic uncertainty)。该模型能够生成条件检索熵和边际典型性分数,分别用于衡量模糊性和知识不确定性,并在多个检索和零样本分类任务中表现出良好的校准性能。

详情
英文摘要

Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.

2605.13349 2026-05-14 cs.CV

Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints

Haoyang Hu, Masataka Seo, Yen-Wei Chen

发表机构 * Ritsumeikan University, Graduate School of Information(日光大学信息工程研究生院) Engineering, Osaka Institute of Technology(工程学,大阪技术学院)

AI总结 本文研究了在扩散模型框架下,如何在保持图像语义一致性和分布约束的前提下,实现基于文本条件的点编辑。为了解决传统点编辑方法中轨迹模糊、编辑范围过大导致的不自然伪影等问题,作者引入了基于CLIP的引导机制和先验保持损失函数,确保编辑过程在扩散先验分布范围内进行。同时,提出了一种方向加权的点追踪机制,提升了细粒度编辑的准确性和生成质量。

Comments ICASSP 2026 oral

详情
英文摘要

Diffusion-based point editing methods have gained significant traction in image editing tasks due to their ability to manipulate image semantics and fine details by applying localized perturbations on the manifold of noise latent. However, these approaches face several limitations. Traditional point-based editing relies on pairs of handle and target points to define motion trajectories, which can introduce ambiguity or unnecessary alterations. Furthermore, when the distance between the handle and target points is large, the accumulated perturbations often cause the noise latent deviation from inversion score trajectory, resulting in unnatural artifacts. To address these issues in global editing tasks, we introduce a CLIP-based model to evaluate and guide intermediate editing steps, ensuring that the generated results remain both semantically aligned. Additionally, we propose a prior-preservation loss that constrains the optimized latent code to stay within the sampling space of the diffusion prior, improving consistency with the original data distribution, to ensure the model generates images along a familiar score trajectory. For fine-grained tasks, we present a directionally-weighted point tracking mechanism that steers the editing process toward the target direction within similar feature regions. This improves both the tracking accuracy and generation quality, while also reducing the editing time.

2605.13346 2026-05-14 cs.LG

Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning

Marco Angioli, Kevin Johansson, Antonello Rosato, Amy Loutfi, Denis Kleyko

发表机构 * Sapienza University of Rome(罗马大学) Örebro University(奥雷布罗大学) Örebro University Linköping University(奥雷布罗大学-林奈大学) Örebro University Research Institutes of Sweden(瑞典奥雷布罗大学研究机构)

AI总结 本文研究了在资源受限设备上高效部署上下文多臂老虎机算法的问题,提出了一种基于概率更新规则的高维上下文多臂老虎机方法(probabilistic HD-CB)。该方法通过随机更新部分向量分量并结合时间衰减更新概率,避免了传统高维方法中因累积操作导致的精度问题和溢出风险,同时降低了计算和存储开销。实验表明,该方法在相同精度下性能优于二值化高维方法,且在少量比特数下接近原高维方法的性能。

详情
英文摘要

Contextual bandits (CB) are online sequential decision-making problems under partial feedback that underpin many adaptive services. There is a growing demand to deploy CB agents directly on-device, under strict constraints on memory, compute, and energy. However, standard linear CB algorithms are often impractical for resource-constrained devices with their unfavorable scaling in computational and memory costs. Recently, HD-CB, a CB approach based on hyperdimensional computing principles, has been proposed to model and solve CB problems by moving into high-dimensional spaces. HD-CB offers faster convergence, favorable scalability, and improves memory efficiency compared to linear CB algorithms. However, its learning rule is accumulation-based: the values of action vectors grow over time, requiring high precision. While periodic binarization can prevent overflow in low-precision components, it may discard important information about magnitudes and degrade decision quality. This paper introduces probabilistic HD-CB, a low-precision variant that replaces deterministic accumulation with a probabilistic update rule. At each step, only a random subset of vector components is updated, with a time-decaying update probability, and component values are constrained to a predefined range [-k,+k]. This approach enables low-precision components, prevents overflow without periodic binarization, and reduces the expected update cost in proportion to the fraction of updated components. Off-policy evaluation on standardized synthetic CB benchmarks using the Open Bandit Pipeline shows that probabilistic HD-CB consistently outperforms binarized HD-CB at equal precision, while approaching the performance of HD-CB with as few as 3 bits per component.

2605.13345 2026-05-14 cs.AI cs.MA

Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

Markus Wenzel, Tobias Strapatsas, Jessika Kress, Dorothea Sauer, Nele Gessler, Horst K. Hahn

发表机构 * Constructor University(Constructor大学) Fraunhofer Institute for Digital Medicine MEVIS(弗劳恩霍夫数字医学研究所MEVIS) Asklepios Kliniken Hamburg GmbH(阿斯克列庇斯汉堡医院有限公司)

AI总结 该研究针对急诊科在患者护理和资源管理方面面临的挑战,提出了一种结合离散事件仿真(DES)和基于代理的模型(ABM)的混合仿真方法,用于构建高度可配置的急诊科数字孪生系统。通过验证模型在不同规模、患者流量和人员配置下的表现,并与实际数据对比,证明了该模型能够有效模拟真实急诊环境下的运行动态。此外,研究还引入了一个基于时间事件记录的多智能体系统,可自主探索资源分配策略,为急诊科资源优化提供了有力的仿真工具。

详情
英文摘要

Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.