arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06311 2026-06-05 cs.AI

AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

基于记忆增强神经网络的AIS船舶轨迹预测

Wonmo Koo, Sanha Chang, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)（工业与系统工程系，韩国科学技术院）

AI总结本文提出使用记忆增强神经网络，基于AIS数据预测船舶轨迹，在墨西哥湾和纽约湾数据集上显著优于无外部记忆的深度学习基线。

2606.06309 2026-06-05 cs.CV

RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

RhymeFlow: 基于异步去噪流调度的无训练加速视频生成

Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu, Yueqi Duan

发表机构 * Tsinghua University（清华大学）； GigaAI

AI总结针对DiT视频生成模型推理慢的问题，提出无训练框架RhymeFlow，通过识别关键帧并仅对其密集去噪，非关键帧逐步跳过步骤，同时引入潜在轨迹投影模块保持时序一致性，实现加速并提升质量。

Comments Project Page: https://simon-dcs.github.io/Website-of-RhymeFlow/, Code: https://github.com/Simon-Dcs/RhymeFlow

详情

AI中文摘要

基于扩散变换器（DiTs）的视频生成模型在视频合成中取得了显著性能，但由于3D注意力的二次复杂度，它们存在高推理延迟和计算成本的问题。现有的加速方法主要通过稀疏注意力和KV缓存等技术降低每个单独去噪步骤内的计算复杂度。然而，它们严格遵循标准扩散管道的固有约束：目标视频序列中的每一帧都必须经历所有扩散时间步的完整、密集去噪过程。我们观察到，由于相邻帧之间的对应内容和运动，当锚定具有关键语义过渡的关键帧时，其他帧的中间状态通常遵循更可预测的轨迹，这表明这种均匀、密集的去噪过程对于自然视频数据本质上是冗余的。为此，我们引入了 extbf{RhymeFlow}，一个无训练框架，它将不同帧的去噪轨迹解耦。具体来说，我们首先识别出一组稀疏的关键帧，它们主导了潜在语义演化。然后，只有这些关键帧经历密集的逐步去噪以确保结构完整性，而非关键帧则逐步跳过去噪步骤以最小化计算成本。由于非关键帧跳过的中间状态破坏了关键帧去噪步骤中的时间连贯性，导致视觉退化，我们进一步引入了一个潜在轨迹投影模块，使关键帧能够与完整且时间一致的序列表示进行交互。在当前的基于DiT的视频生成模型上的大量实验表明，我们的方法以更高的推理速度和更好的视觉质量优于现有基线。

英文摘要

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.06308 2026-06-05 cs.RO

Attitude-Aided Linear Calibration of Triaxial Accelerometers

三轴加速度计的姿态辅助线性校准

Yongqiang Yu, Tian Huang, Yipeng Yang

发表机构 * Tsinghua University（清华大学）

AI总结提出一种利用姿态信息的三轴加速度计线性校准方法（ALAC），通过构建组合误差矩阵实现线性最小二乘估计，仅需五个任意方向测量即可完成校准，并在静态和准静态实验中验证了其精度和鲁棒性。

详情

AI中文摘要

三轴MEMS加速度计广泛应用于惯性传感、导航和传感器融合，但现有校准方法通常依赖昂贵的参考设备或非线性迭代优化，限制了其在低成本或自校准系统中的效率和适用性。我们提出姿态辅助线性加速度计校准（ALAC），一种可在任何提供姿态信息的平台（如转台、机械臂或惯性测量单元）上运行的方法。ALAC构建组合误差矩阵（CEM）以在统一校准模型中表示传感器误差，并实现线性最小二乘估计。偏置和重力向量被联合估计，隐式考虑了平台未对准，CEM的矩阵分解恢复尺度、非正交性和对准旋转参数。在静态重力下，校准被表述为约束齐次最小二乘（CHLS）问题，并使用标准线性代数闭式求解。仅需五个任意方向的测量，递归扩展支持在线或现场校准。在静止的机器人安装加速度计和准静态公共IMU轨迹上的实验表明，ALAC在离线和在线模式下，在精度和对传感器噪声的鲁棒性方面均优于基于参考和在线基线方法。在相同数据集上，它在滤波条件下与迭代自校准性能相当，并在原始测量上超越所有评估基线。这些结果证明了基于MEMS的惯性平台（尤其是低成本IMU和在线校准场景）的一种鲁棒且实用的校准方案。

英文摘要

Triaxial MEMS accelerometers are widely used for inertial sensing, navigation, and sensor fusion, but existing calibration methods often rely on costly reference setups or nonlinear iterative optimization, limiting their efficiency and applicability to low-cost or self-calibrating systems. We present attitude-aided linear accelerometer calibration (ALAC), a method that operates on any platform providing orientation information, such as turntables, robotic arms, or inertial measurement units. ALAC constructs a combined error matrix (CEM) to represent sensor errors in a unified calibration model and enables linear least-squares estimation. The bias and gravity vector are jointly estimated, implicitly accounting for platform misalignment, and matrix decomposition of the CEM recovers scale, non-orthogonality, and alignment rotation parameters. Under static gravity, calibration is formulated as a constrained homogeneous least-squares (CHLS) problem and solved in closed form using standard linear algebra. Only five arbitrarily oriented measurements are required, and a recursive extension supports online or in-field calibration. Experiments on a stationary robot-mounted accelerometer and a quasi-static public IMU trajectory show that ALAC, in both offline and online modes, outperforms reference-based and online baselines in accuracy and robustness to sensor noise. On the same dataset, it matches iterative self-calibration under filtered conditions and surpasses all evaluated baselines on raw measurements. These results demonstrate a robust and practical calibration scheme for MEMS-based inertial platforms, especially low-cost IMUs and online calibration scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.06306 2026-06-05 cs.CL

Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

分解语言模型中的事实性谄媚：规模与指令调优如何塑造鲁棒性

Victor De Marez, Luna De Bruyne, Walter Daelemans

发表机构 * Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics University of Antwerp（计算语言学、心理语言学与社会语言学研究中心荷兰安特卫普大学）

AI总结通过将事实性谄媚分解为真值边际和操纵敏感性两个通道，研究了模型规模和指令调优对56个开源语言模型（0.3B-32B参数）在13种操纵类型下鲁棒性的影响。

详情

AI中文摘要

事实性谄媚是指语言模型在社会压力下放弃正确、可验证答案的现象。由于只有当朝向错误答案的压力超过模型对真相的中立偏好时才会发生翻转，翻转率混淆了两种机制：基线偏好强度（真值边际）以及压力将其偏移的程度（操纵敏感性）。我们将事实性谄媚分解为这些通道，并用它们来分离规模和指令调优对56个开源权重模型（参数范围0.3B-32B，13种操纵类型）的影响。我们发现脆弱性主要由规模决定，但指令调优改变了规模的作用方式：小的指令调优模型可能变得不那么鲁棒，而大的指令调优模型通常变得更鲁棒。指令调优主要增加真值边际，但其行为效果取决于操纵类型。缩放对两个通道的影响也不同：基础模型获得边际但变得略微更易受操纵影响，而指令调优模型更快地获得边际并变得不那么敏感。因此，事实性谄媚不是一个单一的标量属性。评估应报告通道特定、操纵特定和规模条件下的鲁棒性，而不仅仅是翻转率。

英文摘要

Factual sycophancy occurs when a language model abandons a correct, verifiable answer under social pressure. Because a flip occurs only when pressure toward a false answer exceeds the model's neutral preference for the truth, flip rates conflate two mechanisms: the strength of that baseline preference (truth margin), and how far pressure shifts it (manipulation sensitivity). We decompose factual sycophancy into these channels and use them to separate the effects of size and instruction tuning across 56 open-weight models spanning 0.3B-32B parameters and 13 manipulation types. We find that vulnerability is governed mainly by size, but instruction tuning changes how size acts: small instruction-tuned models can become less robust, whereas large instruction-tuned models usually become more robust. Instruction tuning primarily increases truth margin, but its behavioral effect depends on manipulation type. Scaling also changes the two channels differently: base models gain margin but become mildly more manipulation-sensitive, whereas instruction-tuned models gain margin faster and become less sensitive. Factual sycophancy is therefore not a single scalar property. Evaluations should report channel-specific, manipulation-specific, and size-conditioned robustness rather than flip rates alone.

URL PDF HTML ☆

赞 0 踩 0

2606.06303 2026-06-05 cs.LG cs.AI

Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction

基于梯度信息逻辑校正的离散扩散模型即插即用引导

Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出GILC框架，通过将预训练去噪网络作为变分代理来估计引导信号，并引入无雅可比机制直接校正干净预测逻辑，实现无需额外训练的离散扩散模型可控生成，在DNA、蛋白质序列和分子生成任务上达到最优性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

离散扩散模型的可控生成常常受到高计算开销或需要重新训练的限制。在本文中，我们提出了\underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC)，这是一个即插即用框架，通过将预训练的去噪网络重新用作变分代理来高效估计引导信号。为了规避高维离散空间中固有的梯度不稳定性，我们引入了一种无雅可比机制，直接校正干净预测的逻辑，从而实现稳定且有效的引导。我们的方法适用于可微和不可微的奖励函数。在DNA、蛋白质序列和分子生成任务上的大量实验表明，GILC无需额外训练即可达到最先进的性能，并且常常优于微调方法。

英文摘要

Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC}), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.06300 2026-06-05 cs.AI

Multi-ResNets for Subspace Preconditioning in Constrained Optimization

Multi-ResNets：约束优化中子空间预条件的多残差网络

Merve Karakas, Christopher J. Williams, Emmanuel O. Balogun, Sadegh Sadeghi Tabas, Christian Brown, Nikhil Rao

发表机构 * UCLA（加州大学洛杉矶分校）； University of Oxford（牛津大学）； Tapestry, Google（谷歌Tapestry）； Alphabetical ordering, authors contributed equally to this work（作者等量贡献）

AI总结提出一种分阶段残差神经网络架构MResOpt，通过优先级分解约束满足和阶段感知损失，在预测-补全-校正流水线中实现域知有序约束满足，并在理想无限宽条件下表现为序列高斯过程回归，显著降低高优先级约束违反。

详情

AI中文摘要

我们提出MResOpt，一种用于约束优化问题的分阶段残差神经网络架构。我们的架构适用于预测-补全-校正流水线，并通过中间重新补全和阶段感知损失按优先级分解约束满足。该框架支持域知有序约束满足，使网络能够在存在序结构时利用它。在理想化的无限宽条件下，我们证明我们的设计表现为序列高斯过程回归。在合成QP、QCQP和SOCP基准测试中，分阶段架构在凸和非凸设置中均改善了高优先级约束满足。在线流约束交流最优潮流中，我们引入了一种物理驱动的约束排序，并展示了MResOpt支持一种学习的分工，使迭代保持在等式流形上，与重投影基线相比，实现了显著更低的高优先级违反，同时保持计算效率。

英文摘要

We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.

URL PDF HTML ☆

赞 0 踩 0

2606.06295 2026-06-05 cs.LG physics.bio-ph physics.chem-ph

Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare Events

反应通量匹配：稀有事件的机制发现与自适应采样

Rishal Aggarwal, David Ryan Koes, Nicholas M. Boffi, Eric Vanden-Eijnden

发表机构 * CMU-Pitt Program in Computational Biology（CMU-匹兹堡计算生物学项目）； Dept. of Computational & Systems Biology University of Pittsburgh（计算与系统生物学系匹兹堡大学）； Machine Learning Department Dept. of Mathematical Sciences Carnegie Mellon University（机器学习系数学科学系卡内基梅隆大学）； Courant Institute, New York University（纽约大学Courant研究所）； Machine Learning Lab Capital Fund Management（机器学习实验室资本基金管理）

AI总结提出通量匹配框架，从反应轨迹数据中学习电流速度 u(z) 和标量势 h(z)，用于识别反应路径和反应坐标，并实现自适应采样。

Comments 21 pages, 7 figures, submitted to NeurIPS 2026

详情

AI中文摘要

路径采样方法生成连接亚稳态的反应轨迹系综，但从这些数据中提取机制性洞察仍然具有挑战性。我们引入了通量匹配，这是一个直接从反应轨迹数据中学习两个互补对象的框架：电流速度 $u(z)$，其流线描绘了主导反应路径；以及标量势 $h(z)$，通过对反应电流进行加权亥姆霍兹-霍奇分解得到，作为数据驱动的反应坐标。两者都最小化反应路径系综上的二次泛函，类似于生成建模中的流匹配损失，并且不需要了解底层动力学或平稳分布。与基于committor的方法不同，$u$ 和 $h$ 在投影到非马尔可夫集体变量上时仍然定义良好，它们的水平集反过来为增强采样方法提供了自适应界面，以改进采样。通量匹配通过生成电流速度轨迹和分子系统上的速率常数计算得到验证。

英文摘要

Path sampling methods generate ensembles of reactive trajectories connecting metastable states, but extracting mechanistic insight from these data remains nontrivial. We introduce Flux Matching, a framework that learns two complementary objects directly from reactive trajectory data: a current velocity $u(z)$, whose streamlines trace the dominant reaction pathways, and a scalar potential $h(z)$, obtained from a weighted Helmholtz-Hodge decomposition of the reactive current, that serves as a data-driven reaction coordinate. Both minimize quadratic functionals over the reactive path ensemble, analogous to the flow matching loss in generative modeling, and require no knowledge of the underlying dynamics or stationary distribution. Unlike committor-based methods, $u$ and $h$ remain well-defined under projection onto non-Markovian collective variables, and their level sets in turn provide adaptive interfaces for improved sampling with enhanced sampling methods. Flux Matching is validated through the generation of current velocity trajectories and rate constant calculations on molecular systems.

URL PDF HTML ☆

赞 0 踩 0

2606.06294 2026-06-05 cs.CV cs.AI

Towards One-to-Many Temporal Grounding

面向一对多时间定位

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对一对多时间定位（OMTG）任务，提出包含基准、数据集和奖励函数的系统解决方案，显著提升多段视频定位性能。

Comments Accepted to ICML'26

详情

AI中文摘要

时间定位（TG）旨在定位与文本查询对应的视频片段。先前研究主要关注单段检索。然而，现实场景通常需要为单个查询定位多个不连续片段——我们将其称为一对多时间定位（OMTG）。先前最先进的MLLMs针对一对一设置优化，在此场景下表现不佳，由于缺乏事件基数感知，往往得到近乎零的分数。为弥补这一差距，我们提出一个包含三项关键贡献的系统解决方案。首先，我们建立了首个全面的OMTG基准，引入计数准确率（C-Acc）和有效时间F1（EtF1）作为评估指标。其次，我们通过一个复杂的构建流程，整理了一个包含56k样本的高质量OMTG数据集。第三，我们开发了专门针对OMTG的新型时间奖励和描述奖励函数。特别地，描述奖励利用密集视频描述上的思维链推理，明确引导策略优化以实现精确性和完整性。大量实验表明，我们的模型在OMTG基准上达到了43.65%的最新EtF1，分别超过Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

英文摘要

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.06293 2026-06-05 cs.LG stat.ML

PAC-Bayesian Adversarially Robust Generalization for Message Passing Graph Neural Networks: A Sensitivity Analysis

消息传递图神经网络的PAC-Bayesian对抗鲁棒泛化：敏感性分析

Ziling Liang, Xinping Yi, Qingsong Wen, Shi Jin

发表机构 * School of Information Science and Engineering, Southeast University（信息科学与工程学院，东南大学）； Squirrel Ai Learning

AI总结通过敏感性感知的PAC-Bayesian框架，利用输出雅可比矩阵的秩约束和异向高斯后验，为消息传递图神经网络导出更紧的对抗鲁棒泛化界。

详情

AI中文摘要

尽管图神经网络（GNNs）对对抗攻击的脆弱性对图表示学习构成了严重威胁，但在对抗环境下对鲁棒泛化行为的理解仍然是一个基本挑战。最近，基于PAC-Bayesian边际的泛化分析通过提供灵活且数据依赖的分析框架，显著推动了这一研究方向。然而，现有的鲁棒分析通常依赖于各向同性高斯后验，并在全参数空间中控制权重扰动，这限制了捕捉异质参数敏感性的能力，且依赖于隐藏宽度相关的复杂度项，导致泛化界不够紧。在本文中，我们将最近提出的敏感性感知PAC-Bayesian框架从深度神经网络扩展到消息传递图神经网络（MPGNNs），并在对抗环境下导出了更紧的鲁棒泛化界。具体地，我们首先通过推导关于权重参数的输出雅可比矩阵，量化不同参数块的扰动对网络输出的敏感性。利用这些雅可比矩阵在$K$类图分类中秩最多为$K$的事实，我们构建了雅可比对齐的敏感性矩阵，并使用具有优化协方差的异向高斯后验来紧上界KL散度。值得注意的是，通过细化学习权重的谱范数依赖性，并将主导维度因子从隐藏宽度相关项减少到类别数$K$，我们的分析为MPGNNs提供了更紧的鲁棒泛化保证，从而指导其设计以增强对抗鲁棒性。

英文摘要

Whilst the vulnerability of graph neural networks (GNNs) to adversarial attacks poses a critical threat to graph representation learning, the understanding of the robust generalization behavior remains a fundamental challenge in the adversarial setting. Recently, PAC-Bayesian margin-based generalization analysis substantially advances this line of research by providing a flexible and data-dependent analytical framework. However, existing robust analyses often rely on isotropic Gaussian posteriors and control weight perturbations in the full parameter space, which limits the ability to capture heterogeneous parameter sensitivity yet hinges on hidden-width-dependent complexity terms, resulting in not-tight-enough generalization bounds. In this paper, we extend a recently proposed sensitivity-aware PAC-Bayesian framework from deep neural networks to message passing GNNs (MPGNNs) and derive a tighter robust generalization bound in the adversarial setting. Specifically, we first quantify how sensitive the perturbations across different parameter blocks are to the network outputs by deriving the output Jacobians with respect to the weight parameters. Exploiting the fact that these Jacobian matrices have rank at most $K$ in $K$-class graph classification, we then construct Jacobian-aligned sensitivity matrices and use anisotropic Gaussian posteriors with optimized covariances to upper bound the KL divergence in a tight way. Notably, by refining the spectral-norm dependence on the learned weights and reducing the leading dimension factor from hidden-width-dependent terms to the number of classes $K$, our analysis yields much tighter robust generalization guarantees for MPGNNs, thereby guiding their designs to enhance adversarial robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.06292 2026-06-05 cs.CV cs.RO

Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

合成数据生成与基于视觉的褶皱和关键点检测用于双手布料操作

Ariel Herrera, Xueyang Kang, Atal Anil Kumar

发表机构 * Department of Engineering, University of Luxembourg（卢森堡大学工程系）； School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电子与电气工程学院）； Université de Lorraine, Arts et Metiers Institute of Technology, LCFC（洛林大学，艺术与工艺技术学院，LCFC）

AI总结针对布料操作中视觉感知难题，提出基于Blender的合成数据生成管道和结合CNN与YOLOv8-OpenCV的感知框架，实现褶皱抓取和关键点熨烫，关键点模型平均位置误差1.7615像素。

详情

AI中文摘要

纺织品的机器人操作仍然具有挑战性，因为连续变形和自遮挡阻碍了估计布料状态所需的鲁棒视觉感知。为了解决缺乏标注真实世界数据的问题，我们开发了一个基于Blender的合成管道，导出自动标注的关键点，并将人工标注的渲染图与真实世界数据结合训练褶皱检测器。我们提出了一个感知框架，集成了用于置换不变关键点检测的CNN和用于从结构褶皱中提取抓取点的YOLOv8-OpenCV管道。一个提出的双手算法利用该系统通过褶皱拉伸完全折叠的服装，一旦角落出现就过渡到基于关键点的熨烫。关键点模型实现了1.7615像素的平均位置误差（MPE）。感知系统无需微调即可迁移到物理织物上，优于在高遮挡状态下失败或在严重褶皱上产生误报的基线方法。

英文摘要

Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

URL PDF HTML ☆

赞 0 踩 0

2606.06286 2026-06-05 cs.CL cs.AI

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

LLMs 可能泄露训练数据，但它们愿意吗？一种基于倾向性的 LLM 记忆评估

Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark（南部丹麦大学）

AI总结提出 PropMe 框架，通过对比前缀攻击与非对抗评估，揭示 LLM 在非对抗设置下很少泄露训练数据，并引入 SimpleTrace 流水线进行归因和度量。

详情

AI中文摘要

大型语言模型可以重现训练数据，但现有的记忆评估大多衡量模型是否可以被强制这样做，而不是在正常使用下是否会这样做。我们引入了 PropMe，一个基于倾向性的记忆评估框架，对比了基于前缀的能力攻击与非对抗性评估。我们提出了一种度量转换方法，应用于现有函数，可以创建倾向性度量。我们进一步引入了 SimpleTrace，一个基于 infini-gram 的轻量级追踪流水线，能够确定性地将模型生成归因于大规模训练语料库，并计算逐字、近逐字和倾向性转换的记忆度量。评估两个完全开放的模型：Comma 和 DFM Decoder，在两个数据集：Common Pile 和 Dynaword，以及两种语言上，我们发现能力与倾向性之间存在一致差距：前缀攻击比通用或数据集特定提示引发更强的记忆信号，而倾向性得分总体保持较低。因此，模型在直接诱导时可以泄露训练数据，但在更常见的非对抗设置中很少这样做。我们还发现，从 Comma 持续预训练的 DFM Decoder 对 Common Pile 表现出降低的记忆和记忆倾向性，证实当后续训练强调部分不同数据时，记忆能力可能下降。我们的结果表明，并鼓励，记忆审计应同时报告最坏情况下的可提取性和普通泄露倾向性，以便更全面地理解这一现象。

英文摘要

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

URL PDF HTML ☆

赞 0 踩 0

2606.06285 2026-06-05 cs.AI

TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

TRACE: 面向多模态时间序列基础模型的时间条件估计

Ziwen Kan, Yishuo Chen, Kecheng Li, Andrew Wen, Xiaomeng Wang, Liwei Wang, Jihao Duan, Song Wang, Hongfang Liu, Tianlong Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TRACE条件估计范式，通过利用可用辅助模态推断缺失目标模态，解决多模态时间序列中的时间错位和部分模态缺失问题，在医疗和情感分析基准上优于现有融合方法。

Comments 5 figures and 5 tables in the main paper, plus appendix

详情

AI中文摘要

时间序列基础模型旨在学习可泛化的时间表示，以适应广泛的下游任务。在现实世界的多模态设置中，时间序列经常受到时间错位和部分模态缺失的影响，其中不同模态以异质时间尺度被观测或部分缺失。现有方法通常依赖简单的插补或掩码策略，未能考虑跨模态依赖，往往导致错位或退化的表示。我们提出TRACE，一种用于缺失和不规则采样下多模态时间序列基础模型管道的条件估计范式，允许从可用的辅助模态中系统地推断不完整的目标模态。我们在涵盖医疗和情感计算的多个多模态基准上评估TRACE，包括MIMIC-IV临床数据集以及用于多模态情感分析的CMU-MOSI和CMU-MOSEI基准。在一系列下游预测任务和缺失模态设置中，TRACE始终优于先前的多模态融合方法，展示了对严重模态缺失更强的鲁棒性和更可靠的跨模态表示。

英文摘要

Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.

URL PDF HTML ☆

赞 0 踩 0

2606.06284 2026-06-05 cs.AI

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

ToolChoiceConfusion: 因果最小工具过滤实现可靠LLM智能体

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

发表机构 * Independent Researcher（独立研究者）； United States of America（美国）

AI总结提出因果最小工具过滤（CMTF）方法，通过因果充分性选择工具，减少错误工具调用和令牌成本，在102个任务、100个工具、4个LLM后端的基准测试中，将可见工具从100个减少到每步1个，令牌使用降低约90%。

详情

AI中文摘要

大型语言模型智能体越来越依赖外部工具，但更大的工具菜单会通过增加错误工具调用、过早行动和令牌成本来降低可靠性和效率。现有的工具选择方法通常优化语义相关性，暴露名称或描述与用户请求匹配的工具。我们认为相关性是不够的：一个工具可能与任务相关，但在当前步骤仍然是不必要或过早的。我们提出因果最小工具过滤（CMTF），一种无需训练的方法，通过因果充分性选择工具。CMTF使用轻量级前提-效果契约，仅暴露从当前状态向用户目标推进所需的最小下一步工具前沿。在多步骤工具使用任务中，我们将CMTF与全工具暴露、关键词检索、状态感知过滤和因果路径消融进行比较，衡量任务成功率、错误工具调用、过早行动、工具暴露和令牌成本。在包含102个任务、100个工具、四个LLM后端和2448个任务-方法-模型运行的主要基准测试中，CMTF在总体成功率上与最强的因果基线持平，同时将可见工具从100个减少到每步1个，并且相对于全工具暴露将令牌使用减少约90%。

英文摘要

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

URL PDF HTML ☆

赞 0 踩 0

2606.06281 2026-06-05 cs.RO

Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation

多分辨率触觉模仿学习用于接触丰富的机器人操作

Rickmer Krohn, Erik Helmut, Niklas Funk, Jan Peters, Vignesh Prasad, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning, TU Darmstadt（互动机器人感知与学习，图腾达姆施塔特大学）； Intelligent Autonomous Systems, TU Darmstadt（智能自主系统，图腾达姆施塔特大学）； Hessian AI（黑森人工智能）； Robotics Institute Germany（德国机器人研究所）

AI总结提出多分辨率触觉表示框架MiTaS，融合不同时间分辨率的触觉传感器（GelSight Mini和Evetac）与RGB相机，通过模态特定卷积茎和基于Transformer的融合实现复杂接触丰富操作任务的模仿学习，平均成功率80%。

Comments 20 pages, preprint

详情

AI中文摘要

触觉感知有助于解决各种操作任务。尽管存在多种不同特性的触觉传感器，但利用多个异构触觉传感器的融合来改进操作学习仍未被充分探索。我们提出了多分辨率触觉感知（MiTaS），一个表示框架，利用在不同时间分辨率下工作的多个触觉传感器来解决复杂的接触丰富操作任务。我们提出了一种新颖的架构，使用模态特定的卷积茎和基于Transformer的融合，有效融合来自RGB相机流、基于视觉的GelSight Mini传感器和高频事件型Evetac传感器的信息。然后，这种多传感器表示条件化一个流匹配策略，用于解决下游任务。在五个接触丰富操作任务上的实验结果证明了多分辨率触觉特征在模仿学习中的有效性。MiTaS实现了80%的平均成功率，而仅视觉（31%）和视觉-触觉（54%）基线无法可靠地完成任务。在策略评估期间无法访问Evetac传感器的情况下，使用多触觉数据共同训练视觉-触觉模型可在某些任务上将性能提升超过10%。详细的传感器读取和注意力分析揭示了不同传感器在任务执行过程中的重要性，验证了我们的多分辨率触觉感知方法。项目页面：http://mitas-touch.github.io。

英文摘要

Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi-Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact-rich manipulation tasks. We propose a novel architecture using modality-specific convolutional stems and transformer-based fusion that effectively fuses information from an RGB camera stream, a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor. This multi-sensor representation then conditions a flow-matching policy for solving downstream tasks. Experimental results across five contact-rich manipulation tasks demonstrate the effectiveness of multi-resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision-only (31 %) and visual-tactile (54 %) baselines cannot solve the task reliably. Co-training a visuo-tactile model with multi-tactile data boosts performance by over 10 \% in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor-reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi-resolution tactile sensing approach. Project Page: http://mitas-touch.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.06278 2026-06-05 cs.CV

Geodesic Flow Matching on a Riemannian Degradation Manifold for Blind Image Restoration

黎曼退化流形上的测地流匹配用于盲图像恢复

Akshay Janardan Bankar, Ankita Chatterjee, Sayan Banerjee, Shreyas Pandith, Kalakonda Sai Shashank, Amit Satish Unde

发表机构 * Samsung Research Institute（三星研究院）

AI总结提出在低维黎曼流形上显式建模退化，通过联合图像-流形空间上的测地流匹配目标学习内在传输动力学，实现盲图像恢复。

Comments Submitted to ECCV 2026

2606.06272 2026-06-05 cs.LG cs.AI

Your GFlowNet Secretly Learns an Optimal Transport Plan

你的GFlowNet秘密学习了一个最优传输方案

Ian Maksimov, Nikita Morozov, Denis Belomestny, Sergey Samsonov

发表机构 * GitHub ； arXiv

AI总结本文建立了非无环生成流网络与最优传输之间的理论联系，证明最小流GFlowNet学习到的策略编码了从源分布到目标分布的最优传输方案。

Comments ICML 2026 SPIGM Workshop

2606.06271 2026-06-05 cs.CL cs.HC

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

FOXGLOVE: 理解专家与LLM在议论文中的目标导向和锚定写作反馈

Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结通过构建FOXGLOVE数据集，系统比较了写作专家和大型语言模型在议论文反馈中的目标导向、锚定性和优先级，发现两者在反馈目标和位置分布上相似，但在具体句子选择和反馈复杂度上存在差异。

详情

AI中文摘要

虽然大型语言模型（LLMs）越来越多地被用于生成写作反馈，但对于写作研究认为对修订至关重要的维度（目标导向、锚定到特定句子和优先级），尚无LLM与专家反馈的系统比较。我们引入了FOXGLOVE数据集，包含由训练有素的写作指导员对69篇十二年级议论文撰写的696条反馈评论，以及根据共享协议从四个前沿LLM生成的1,644条评论，总计2,340条评论。我们提供了指导员和LLM评论子集的专家质量评级。我们发现指导员和LLM在目标和文章位置上的反馈分布相似，但指导员和模型在提供反馈的具体句子上存在分歧。此外，我们发现模型倾向于写出更复杂的反馈，并且比指导员使用更少的问题。LLM反馈在大多数质量维度上获得更高的评分（由指导员评分），但这一优势很大程度上可归因于更长的评论。FOXGLOVE使得系统比较人类和LLM反馈在哪些方面一致、分歧和不同成为可能。

英文摘要

While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.

URL PDF HTML ☆

赞 0 踩 0

2606.06267 2026-06-05 cs.CL

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

多电路，单机制：电路发现中的输入变化与评估粒度

Alireza Bayat Makou, Jingcheng Niu, Subhabrata Dutta, Iryna Gurevych

发表机构 * UKP Lab, Technical University of Darmstadt（达姆施塔特技术大学UKP实验室）； National Research Center for Applied Cybersecurity ATHENE（应用网络安全国家研究研究中心ATHENE）

AI总结本文通过固定任务、改变输入统计量，发现电路结构差异并不对应功能差异（称为“伪特化”），并证明结构不同的电路实现相同计算，强调边缘级评估和跨条件迁移测试的必要性。

Comments 90 pages, 53 figures

详情

AI中文摘要

电路发现方法识别解释特定模型行为的子图，发现的电路之间的结构差异通常被解释为不同机制的证据。我们通过固定任务、改变输入统计量来测试这一假设，并表明由此产生的结构差异表现出明显的特化，但不对应功能差异，我们将这种模式称为伪特化。使用跨四个词频带以及一个控制条件的字面序列复制任务，在五个Pythia模型（70M-1.4B）中提取了75个电路，发现结构不同的电路实现相同的计算：频带特定的边广泛跨频带转移，大多数频带共享的核心至少恢复电路性能的99%，因果干预实验证实内部表示在频带间可互换。在同一频带内的重复提取进一步表明，发现算法从有效子图的等价类中采样，而非恢复唯一机制。标准评估实践掩盖了这种模式：源级评估夸大了表面忠实度，而边缘级评估揭示了从结构到功能的多对一映射。我们的结果表明，电路之间的结构差异不足以作为不同机制的证据，暴露这一点需要边缘级评估和跨条件迁移测试。

英文摘要

Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.

URL PDF HTML ☆

赞 0 踩 0

2606.06266 2026-06-05 cs.CL

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

从自我到他人：评估LLM仇恨言论标注中的人口统计学视角采纳

Paloma Piot, Javier Parapar

发表机构 * Information Retrieval Lab（信息检索实验室）； CITIC Research Centre（CITIC研究中心）； Universidade da Coruña（科鲁纳大学）

AI总结本研究通过评估人格化LLM在仇恨言论检测中模拟不同人口群体视角的能力，发现模型在群体间分歧、群体内敏感性和替代性预测三个维度上表现不一，其中使用Llama 3.1的替代性提示在多数人口统计轴上实现了最高跨群体一致性。

详情

AI中文摘要

仇恨言论检测本质上是主观的：来自不同人口群体的个体对相同内容的感知差异很大。从多个群体收集足够标注成本高昂且难以规模化。人格化大型语言模型（被提示采用特定人口身份的模型）已被提出作为一种大规模模拟多样化视角的方法。但它们是否真正反映了不同群体如何分歧？我们评估了人类社会判断的三个维度：(i) 不同群体的人格化模型是否以类人方式产生分歧（群体间分歧），(ii) 当内容针对自身身份时它们是否变得更敏感（群体内敏感性），以及 (iii) 它们是否能准确预测另一群体将如何反应（替代性预测）。我们的结果表明，没有模型能一致地捕捉所有三个维度，且性能高度依赖模型，并非仅通过最小身份提示就能可靠出现。然而，使用Llama 3.1的替代性提示在大多数人口统计轴上产生了最高的跨群体一致性，并提供了最接近人类分歧模式的整体近似，表明该配置可能为与人类判断对齐的自动标注提供更可靠的设置。

英文摘要

Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.

URL PDF HTML ☆

赞 0 踩 0

2606.06256 2026-06-05 cs.AI

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

RedKnot: 基于头部感知的KV重用和SegPagedAttention的高效长上下文LLM服务

Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He, BoYu Wang, Guanjie Chen, Junhao Hu

发表机构 * Xiaohongshu Inc., China（小红书公司，中国）； Peking University（北京大学）； Huawei Cloud（华为云）

AI总结提出RedKnot系统，通过按KV头分解缓存并采用SegPagedAttention，实现位置无关的KV重用、前缀压缩、冷热分离和分布式放置，在不重训练模型的情况下提升资源效率。

详情

AI中文摘要

随着大语言模型（LLM）服务输入长度的持续增长，KV缓存已成为AI基础设施中的主要瓶颈。它限制了GPU内存容量、服务并发性、缓存重用和分布式可扩展性。几个重要问题，包括位置无关的KV缓存、前缀KV缓存压缩、冷/热KV缓存分离和分布式KV缓存管理，都依赖于KV缓存的表示和管理方式。然而，现有的服务系统在很大程度上依赖于单一的KV缓存抽象，其中KV缓存被视为同质的token级内存块序列，并在注意力头和服务场景中采用类似的管理策略。我们观察到，KV缓存的效用在不同KV头之间具有高度结构性：不同的头表现出不同的功能角色、注意力距离和运行时重要性。因此，并非每个头、token范围或服务场景都需要完整的KV缓存。我们提出了RedKnot，一个用于LLM服务的头部感知KV缓存管理系统。RedKnot通过沿KV头分解KV缓存来打破传统的单一KV缓存抽象，这些KV头的重要性和有效注意力范围在不同服务场景中显著变化。这种头部级分解将KV缓存从单一的张量抽象转变为结构化的内存对象，使RedKnot能够统一支持位置无关的KV重用、前缀KV压缩、冷/热KV分离和分布式KV放置，同时保持输出保真度并提高资源效率，无需模型重训练或微调。RedKnot通过将KV缓存从单一的被动运行时工件转变为动态的、模型感知的可扩展LLM服务的运行时基础，为AI基础设施建立了新的基础。

英文摘要

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.

URL PDF HTML ☆

赞 0 踩 0

2606.06255 2026-06-05 cs.RO cs.CV cs.DC

RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning

RadiusFPS：通过球形体素剪枝在CPU和GPU上实现高效最远点采样

Ziyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki

发表机构 * School of Computing（计算学院）； Institute of Science（科学研究院）； Tokyo（东京）

AI总结提出RadiusFPS框架，利用球形体素剪枝加速最远点采样（FPS），在保持标准更新规则的同时，通过保守几何边界和坐标点跳过测试减少冗余计算，并在GPU上实现融合核，显著提升速度并降低内存占用。

Comments 28 pages,15 figures

详情

AI中文摘要

点云是机器人感知的主要感官表示，支撑着基于激光雷达的自动驾驶、同时定位与地图构建（SLAM）和导航。在这些流程中，最远点采样（FPS）是最著名的下采样算子，其均匀覆盖保留了下游感知所依赖的几何结构。然而，经典FPS的大时间复杂度与现代3D传感器每秒百万点的速率难以匹配，使其成为与机器人系统的实时性和有限机载计算预算相冲突的主要延迟瓶颈。因此，我们提出RadiusFPS，一种基于球形体素剪枝的FPS加速框架，在相同初始化和打破平局策略下保留标准FPS更新规则。通过用球形体素索引点云，RadiusFPS推导出保守的几何边界，在每次迭代中剪枝冗余距离计算，并辅以坐标点跳过测试去除残余更新。我们进一步引入RadiusFPS-G，一种线程束级别的GPU实现，将体素选择、剪枝和距离更新融合到内存合并的核中，消除了昂贵的全局内存往返。在室内（S3DIS、ScanNet）和室外LiDAR（SemanticKITTI）基准测试中，RadiusFPS-G相比基于GPU的FPS实现了高达2.5倍的加速，在评估方法中与QuickFPS相当或更优，同时使用大约一半的GPU内存，并具有可比较的分割精度。当与基于学习的FastPoint采样器结合时，生成的流程在所有评估配置中实现了最快的端到端推理。这些特性使得高质量的FPS风格采样对于延迟和内存受限的机器人视觉变得实用。

英文摘要

Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.

URL PDF HTML ☆

赞 0 踩 0

2606.06252 2026-06-05 cs.AI

Closing the Loop on Latent Reasoning via Test-Time Reconstruction

通过测试时重建实现潜在推理的闭环

Xiaopeng Yuan, Haibo Jin, Ye Yu, Peng Kuang, Lijun Yu, Yushun Dong, Haohan Wang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Google（谷歌）； Florida State University（佛罗里达州立大学）

AI总结提出ReLAT方法，利用自监督测试时训练通过查询重建损失优化潜在状态，实现潜在推理的闭环，提升数学推理、知识问答和代码生成的性能。

详情

AI中文摘要

近期工作将中间推理从自然语言轨迹转移到潜在或缓存级表示，以减少令牌开销并避免离散通信瓶颈。然而，这种转变也消除了文本推理的一个关键优势：中间状态不再可检查，使得难以确定潜在状态是否仍保留原始查询的约束。因此，潜在推理通常以开环方式运行，即潜在状态被生成和使用，而无需基于输入的保真度检查。我们提出ReLAT（基于重建的测试时潜在推理），一种自监督测试时训练方法，利用查询本身作为参考来闭合这个循环。我们的关键观察是：如果潜在状态忠实地表示查询，则查询应能从该状态恢复；如果查询无法恢复，则潜在状态已丢失任务相关信息。ReLAT通过构建可微的“问题→潜在思考→问题”循环，并在答案生成前通过潜在思考优化查询重建损失来实现这一原则。这使不透明的潜在计算锚定到它应该代表的问题规范。在Qwen系列上的数学推理、知识问答和代码生成基准测试中，ReLAT持续优于单模型推理、基于文本的协作、开环潜在协作以及替代的测试时训练目标。在Qwen3-8B上，ReLAT将AIME 2024准确率从56.7%提升至73.3%，比最强的开环潜在基线高出16.6个百分点。

英文摘要

Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent state is produced and consumed without an input-anchored fidelity check. We propose ReLAT (Reconstruction-Guided Latent Reasoning At Test Time), a self-supervised test-time training method that closes this loop using the query itself as the reference. Our key observation is that if a latent state faithfully represents a query, the query should be recoverable from it; if the query cannot be recovered, the latent state has lost task-relevant information. ReLAT operationalizes this principle by constructing a differentiable Question -> Latent Thought -> Question cycle and optimizing query reconstruction loss through the latent thought before answer generation. This anchors opaque latent computation to the problem specification it is supposed to represent. Across mathematical reasoning, knowledge QA, and code generation benchmarks on the Qwen family, ReLAT consistently improves over single-model inference, text-based collaboration, open-loop latent collaboration, and alternative test-time training objectives. On Qwen3-8B, ReLAT raises AIME 2024 accuracy from 56.7% to 73.3%, a 16.6-point gain over the strongest open-loop latent baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.06250 2026-06-05 cs.RO

Breaking Time: A Fully Gaussian Framework for Distributed and Continuous-Time SLAM

打破时间：一种用于分布式和连续时间SLAM的全高斯框架

Davide Ceriola, Simone Ferrari, Luca Di Giammarino, Leonardo Brizi, Giorgio Grisetti

发表机构 * Department of Computer, Control, and Management Engineering "Antonio Ruberti", Sapienza University of Rome（计算机、控制与管理工程系（Antonio Ruberti）, 罗马萨皮恩扎大学）； University of Stuttgart（斯图加特大学）

AI总结提出G-solver，结合高斯信念传播和高斯过程运动先验的分布式连续时间轨迹估计框架，支持异构异步传感器和多相机场景。

Comments To be published in RA-L. Open-source implementation is released at https://github.com/rvp-group/gsolver

详情

AI中文摘要

连续时间SLAM为融合异构传感器同时估计平滑轨迹提供了原则性框架，特别适合处理具有非均匀读出模式的异构、异步传感器流，如卷帘快门相机、激光雷达扫描仪、雷达扫描或事件相机。在这项工作中，我们引入了G-solver，一个全高斯分布式框架，将高斯信念传播（GBP）与高斯过程（GP）运动先验相结合，用于连续时间轨迹估计。我们的GP模型提供了轨迹的概率表示，支持一致插值和数据驱动超参数的使用，而GBP提供了一种适用于分散设置的可扩展消息传递公式。由此产生的求解器自然地扩展到多相机场景，无需专门的同步或工程工作。我们在合成数据和真实数据上评估了该方法，包括卷帘快门和分布式多相机优化，展示了与现有连续时间方法相当的运行时间下的准确稳定估计。发布了开源实现。

英文摘要

Continuous-time SLAM provides a principled framework for fusing heterogeneous sensors while estimating smooth trajectories, and is particularly well-suited for handling heterogeneous, asynchronous sensor streams with non-uniform readout patterns, such as rolling shutter cameras, LiDAR scanners, radar sweeps, or event-based sensors. In this work, we introduce G-solver, a fully Gaussian and distributed framework that combines Gaussian Belief Propagation (GBP) with Gaussian Process (GP) motion priors for continuous-time trajectory estimation. Our GP model provides a probabilistic representation of the trajectory, enabling consistent interpolation and the use of data-driven hyperparameters, while GBP offers a scalable message-passing formulation well-suited for decentralized settings. The resulting solver naturally extends to multi-camera scenarios without specialized synchronization or engineering effort. We evaluate the approach on synthetic and real data, including rolling shutter and distributed multi-camera optimization, demonstrating accurate and stable estimation with runtimes comparable to existing continuous-time methods. An open-source implementation is released.

URL PDF HTML ☆

赞 0 踩 0

2606.06249 2026-06-05 cs.CV cs.LG

GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

GRAMformer: 通过体积多模态交叉注意力实现任意顺序模态交互

Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecommunications, Sapienza University of Rome（信息工程、电子与电信系，罗马萨皮恩扎大学）

AI总结提出体积多模态交叉注意力（VMA）机制，通过计算查询与多模态键向量的联合几何体积来建模任意顺序的模态交互，并集成到新型多模态Transformer架构GRAMformer中，提升多模态学习的有效性和效率。

详情

AI中文摘要

基于Transformer的多模态模型依赖注意力机制来整合异构模态间的信息。尽管取得了成功，现有的多模态注意力公式通过成对点积交互的集合或将所有模态拼接成键来计算分数，即使多个模态应该被联合参与。因此，当前方法要么在模态数量上产生二次复杂度，要么无法显式建模依赖于多个表示联合配置的交互。在这项工作中，我们引入了体积多模态交叉注意力（VMA），一种新颖的交叉注意力机制，其中注意力分数被定义为查询和多个模态特定键的联合几何的函数。VMA计算跨多个模态的查询和键向量所张成的体积，捕获超越成对相似性的联合多模态依赖，实现任意顺序模态交互的原生建模。我们将VMA集成到我们新颖的多模态Transformer架构中，命名为GRAMformer，该架构专门设计用于整合任意数量的模态。我们在多模态学习任务上评估了所提出的模型，展示了改进的有效性和效率。

英文摘要

Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.06245 2026-06-05 cs.RO cs.AI

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

MPCoT: 奖励引导的多路径潜在推理用于测试时可扩展的视觉-语言-动作

Boyang Zhang, Lianlei Shan

发表机构 * Department of Electrical and Computer Engineering, Boston University（波士顿大学电气与计算机工程系）； Department of Computer Science, Tsinghua University（清华大学计算机系）

AI总结提出MPCoT框架，通过奖励引导的多路径潜在推理，在保持零推理令牌和原始动作接口的同时，提升长时域和高不确定性控制任务中的VLA策略性能。

Comments 14 pages, 5 figures, submitted to CoRL

详情

AI中文摘要

视觉-语言-动作（VLA）策略在长时域和高不确定性控制中仍然脆弱，其中单次动作解码提供的推理时思考有限。显式的思维链可以增加推理深度，但引入了令牌延迟和间接的文本到动作接口。我们提出MPCoT，一个奖励引导的多路径潜在推理框架，初始化$M$个假设，通过K个权重共享步骤细化它们，并在动作解码前进行软聚合。一个仅用于训练的路径偏好目标使用专家动作一致性、基于世界模型/VLM的进展和成功反馈来评估候选动作分支，使潜在路径评分器与下游执行质量对齐。MPCoT保留原始的8步动作接口，生成零推理令牌，并暴露可配置的推理控制（K，M）。在LIBERO和CALVIN上的匹配协议下，MPCoT提升了长时域性能，消融实验证实了深度-宽度效应、置信度加权聚合和奖励引导的路径监督。

英文摘要

Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.06242 2026-06-05 cs.CL cs.AI cs.CV cs.IR

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

面向机构文档数据快照提取的开源布局检测模型基准测试

AJ Carl P. Dy, Aivin V. Solatorio

发表机构 * Development Data Group Office of the World Bank Group Chief Statistician（世界银行发展数据分析组办公室世界银行统计主任）； The World Bank（世界银行）

AI总结针对机构文档中图表数据快照提取任务，构建基准数据集并评估多个开源布局检测模型，发现现有模型在操作型文档上泛化能力不足，存在内容混淆、碎片化及上下文缺失等问题。

Comments 23 pages, 8 figures

详情

AI中文摘要

机构文档中的图表包含大量操作和分析信息。当前从文档中提取视觉内容的方法主要围绕通用文档布局分析，将图表视为统一相关的文档对象，而非具有语义意义的分析产物。在这项工作中，我们引入了一个基准数据集和评估框架，用于 extit{数据快照提取}，即识别和定位机构文档中具有语义意义的视觉产物的任务。该基准涵盖人道主义报告、世界银行政策研究工作论文和项目评估文件，并包含包含可重用分析信息的图表注释。利用该数据集，我们对多个开源布局检测模型进行了基准测试，并评估了检测性能和空间提取质量。结果表明，尽管当前模型在传统学术基准上表现强劲，但在操作型机构文档上难以泛化。常见的失败模式包括分析内容与非分析内容混淆、复合分析产物碎片化，以及解释所需的上下文信息提取不完整。这些发现凸显了通用文档布局分析与操作上有用的数据快照提取之间持续存在的差距。我们发布了源PDF、注释数据集、元数据和源代码，以支持操作型文档智能的未来研究。数据集可在https://huggingface.co/datasets/ai4data/data-snapshot获取，源代码可在https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot获取。

英文摘要

Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.

URL PDF HTML ☆

赞 0 踩 0

2606.06238 2026-06-05 cs.LG cond-mat.stat-mech hep-lat

Generative Criticality in Large Language Model Temperature Scaling

大型语言模型温度缩放中的生成临界性

Huajian Ruan, Jinyang Li, Xingyu Guo, Lingxiao Wang

发表机构 * State Key Laboratory of Nuclear Physics and Technology, Institute of Quantum Matter, South China Normal University（核物理与技术国家重点实验室，量子物质研究院，华南师范大学）； Key Laboratory of Atomic and Subatomic Structure and Quantum Control (MOE), Guangdong-Hong Kong Joint Laboratory of Quantum Matter（原子与亚原子结构及量子控制重点实验室（MOE），量子物质广深联合实验室）； Guangdong Basic Research Center of Excellence for Structure and Fundamental Interactions of Matter, Guangdong Provincial Key Laboratory of Nuclear Science（物质结构与基本相互作用卓越基础研究中心，广东省核科学重点实验室）； KEK Theory Center, Institute of Particle and Nuclear Studies（KEK理论中心，粒子与核物理研究所）； RIKEN Center for Interdisciplinary Theoretical and Mathematical Sciences (iTHEMS), Wako（RIKEN交叉学科理论与数学科学中心（iTHEMS），Wako）； Graduate University for Advanced Studies (SOKENDAI), Oho 1-1, Tsukuba, Ibaraki（高等研究大学（SOKENDAI），Oho 1-1，筑波，Ibaraki）； Institute for Physics of Intelligence, The University of Tokyo（智能物理研究院，东京大学）

AI总结通过统计场框架研究大型语言模型文本生成中的温度缩放，发现接近特征温度时出现类似连续相变的临界现象，为理解解码策略与临界现象的联系提供定量工具。

Comments 9 pages, 7 figures, contributed to PAI 2026 Conference

详情

AI中文摘要

我们为大型语言模型（LLM）生成的文本提出一个统计场框架，将token嵌入视为一维链上的连续自旋变量。通过连接的两点关联函数定义磁化率，并通过系综平均嵌入场定义序参量，我们改变softmax温度$T$，观察到在特征温度$T_c$附近出现尖锐的磁化率峰，具有幂律标度行为，序参量同时发生快速变化，并在$T_c$以下坍缩到单一语义方向。由最近邻（TwoNN）方法估计的内在维度独立地证实了这些发现，在$T_c$附近达到最小值。结果在模型规模（Qwen3：0.6B--32B）和提示类别上具有鲁棒性。虽然现象学上类似于连续相变，但自回归生成的非平衡性质需要进一步研究。我们的框架为探测LLM输出的集体统计结构提供了定量工具，并暗示了解码策略与临界现象之间的联系。

英文摘要

We propose a statistical-field framework for text generated by large language models (LLMs), treating token embeddings as continuous spin variables on a one-dimensional chain. Defining a susceptibility from the connected two-point correlator and an order parameter from the ensemble-averaged embedding field, we vary the \texttt{softmax} temperature $T$ and observe a sharp susceptibility peak near a characteristic $T_c$ with power-law-like scaling, a concurrent rapid change in the order parameter, and a collapse onto a single semantic direction below $T_c$. The intrinsic dimension estimated by the two nearest neighbor (TwoNN) method independently corroborates these findings, reaching a minimum near $T_c$. Results are robust across model scales (Qwen3: 0.6B--32B) and prompt categories. While the phenomenology closely resembles a continuous phase transition, the non-equilibrium nature of autoregressive generation warrants further investigation. Our framework provides quantitative tools for probing the collective statistical structure of LLM outputs and suggests connections between decoding strategies and critical phenomena.

URL PDF HTML ☆

赞 0 踩 0

2606.06236 2026-06-05 cs.LG

Tracing the Oracle: Improving Diffusion Timestep Scheduling for 3D CT Reconstruction

追踪神谕：改进扩散时间步调度用于3D CT重建

Yujia Wu, Zhaoqiang Liu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

AI总结针对3D CT重建中扩散模型推理计算开销大且均匀时间步调度引入大截断误差的问题，提出即插即用的TrO框架，通过动态规划优化时间步调度，在有限采样步数下显著提升重建保真度和计算效率。

Comments Accessed to ECML-PKDD2026

详情

AI中文摘要

预训练扩散模型在解决高度病态的3D计算机断层扫描（CT）逆问题中展现出令人印象深刻的潜力，但推理过程存在显著的计算开销。此外，现有的均匀时间步调度未能捕捉反向条件扩散随机微分方程的非均匀演化，从而引入了大量截断误差。为克服这一限制，我们提出Tracing the Oracle（TrO），一种用于改进时间步调度的即插即用框架。具体而言，我们将少量样本上的密集采样数值积分轨迹视为参考神谕。通过动态规划全局最小化少步近似与神谕之间的累积误差，提取优化后的调度。该机制将有限的采样步精确分配到对截断误差高度敏感的关键演化阶段。我们在AAPM数据集上的多个3D CT重建任务中进行的广泛实验表明，与最先进的3D CT重建方法DDS结合时，我们的优化时间步在不超过10个采样步的严格预算下，相比现有启发式调度显著提高了重建保真度和计算效率。

英文摘要

Pretrained diffusion models demonstrate impressive potential in solving highly ill-posed 3D computed tomography (CT) inverse problems, while the inference process suffers from significant computational overhead. Furthermore, existing uniform timestep schedules fail to capture the non-uniform evolution of the reverse conditional diffusion stochastic differential equation, thereby introducing substantial truncation errors. To overcome this limitation, we propose Tracing the Oracle (TrO), a plug-and-play framework for improved timestep scheduling. Specifically, we treat densely sampled numerical integration trajectories on a few samples as the reference oracle. The optimized schedule is extracted by leveraging dynamic programming to globally minimize the cumulative error between the few-step approximation and the oracle. This mechanism precisely allocates the limited sampling steps to critical evolution stages that are highly susceptible to truncation errors. Our extensive experiments on the AAPM dataset across multiple 3D CT reconstruction tasks demonstrate that, when combined with the state-of-the-art 3D CT reconstruction method DDS, our optimized timesteps significantly improve reconstruction fidelity and computational efficiency compared to existing heuristic schedules, especially under a strict budget of no more than 10 sampling steps.

URL PDF HTML ☆

赞 0 踩 0

2606.06235 2026-06-05 cs.LG cs.AI

Design a Reliable LLM-Integrated Interface for Mortality Forecasting

设计一个可靠的LLM集成接口用于死亡率预测

Thi Kim Ngan Nguyen

发表机构 * Curtin University（Curtin大学）

AI总结提出一个结合大语言模型（LLM）的接口，通过自然语言输入驱动确定性预测流程，在保持统计精度的同时提升非专家用户的可及性。

Comments 7 pages, 7 figures

详情

AI中文摘要

死亡率预测在精算和政策决策中扮演重要角色，但其实现仍然技术复杂且对非专家用户不友好。本项目提出一个可靠的大语言模型（LLM）集成接口，在保持统计功效的同时提升可用性。LLM被设计为一个约束编排层，将自然语言输入转化为确定性预测流程的结构化配置。采用三阶段方法确保准确性、可用性和透明度。首先，使用CoMoMo包实现基线流程，复现已建立的死亡率预测结果。其次，扩展流程以使用滚动原点评估和均方误差（MSE）生成多步预测。第三，原型接口使用本地LLM以自然语言处理用户的预测请求。该系统表明，LLM可以在不损害高敏感性分析工作流中的可重复性、透明度或精算有效性的前提下增强可访问性。

英文摘要

Mortality forecasting plays an important role in actuarial and policy decision-making, but its implementation remains technically complex and inaccessible to non-expert users. This project proposes a reliable large language model (LLM)-integrated interface that improves usability while maintaining statistical power. The LLM is designed as a constrained orchestration layer that translates natural-language inputs into structured configurations for a deterministic forecasting pipeline. A three-phase methodology is employed to ensure accuracy, usability, and transparency. First, a baseline pipeline is implemented using the CoMoMo package, reproducing established mortality forecasting results. Second, the pipeline is extended to generate multi-step forecasts using rolling-origin evaluation and mean squared error (MSE). Third, a prototype interface uses a local LLM to handle users' forecasting requests in plain language. The system demonstrates that LLMs can enhance accessibility without compromising reproducibility, transparency, or actuarial validity in high-stakes analytical workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.06228 2026-06-05 cs.CV

SAM-Flow: Source-Anchored Masked Flow for Training-Free Image Editing

SAM-Flow：源锚定掩码流用于免训练图像编辑

Haowang Cui, Rui Chen, Tao Luo, Tao Guo, Zheng Qin, Jiaze Wang

发表机构 * Tianjin Key Laboratory of Imaging and Sensing Microelectronic Technology, School of Microelectronics, Tianjin University（天津影像与传感微电子技术重点实验室，微电子学院，天津大学）； School of Cyber Security, Tianjin University（网络安全学院，天津大学）

AI总结提出SAM-Flow框架，通过源锚定掩码流和动态软掩码机制实现局部免训练图像编辑，有效防止背景泄漏。

Comments Code is available at: https://github.com/chwbob/Sam-Flow

详情

AI中文摘要

免训练图像编辑最近因能够利用强大的预训练扩散和流匹配模型修改真实图像而无需额外训练，引起了越来越多的关注。然而，现有的基于反演和基于差分流的方法通常执行全局潜在传输，这不可避免地会将编辑效果传播到非目标区域并导致背景泄漏。为了解决这个问题，我们提出了SAM-Flow，一种源锚定掩码流框架，用于局部免训练图像编辑。SAM-Flow不是更新整个潜在表示，而是首先使用侦察图像和令牌接地注意力图来定位可编辑的语义区域。然后，它仅在这些区域内应用差分速度更新，同时将剩余区域锚定到源图像潜在轨迹。为了进一步提高空间稳定性和边界自然性，我们引入了一种时变源锚定投影机制，具有动态软掩码、过渡区域和时间掩码累积。所提出的方法是即插即用的，可以集成到主流流匹配骨干网络（如Stable Diffusion 3和FLUX）中，无需任何微调。大量的定性和定量实验表明，SAM-Flow实现了准确的语义编辑，同时显著改善了背景保持，为免训练图像编辑提供了一种简单且通用的局部编辑范式。代码可在 https://github.com/chwbob/Sam-Flow 获取。

英文摘要

Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based and differential-flow-based methods usually perform global latent transport, which inevitably propagates editing effects to non-target regions and leads to background leakage. To address this problem, we propose SAM-Flow, a source-anchored masked flow framework for localized training-free image editing. Instead of updating the whole latent representation, SAM-Flow first uses a scout image and token-grounded attention maps to localize the editable semantic regions. It then applies differential velocity updates only within these regions, while anchoring the remaining areas to the source-image latent trajectory. To further improve spatial stability and boundary naturalness, we introduce a time-varying source-anchored projection mechanism with dynamic soft masks, transition regions, and temporal mask accumulation. The proposed method is plug-and-play and can be integrated with mainstream flow-matching backbones such as Stable Diffusion 3 and FLUX without any fine-tuning. Extensive qualitative and quantitative experiments demonstrate that SAM-Flow achieves accurate semantic editing while significantly improving background preservation, providing a simple and general localized editing paradigm for training-free image editing. Code is available at: https://github.com/chwbob/Sam-Flow.

URL PDF HTML ☆

赞 0 踩 0