arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.30076 2026-05-29 cs.CL

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

UniSteer: 文本引导的激活空间流匹配用于多功能LLM引导

Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang, Jingyi Yu, Kan Ren

AI总结提出UniSteer，一种文本引导的激活流匹配模型，通过学习残差流激活的条件分布，实现统一的行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类。

Comments 16 pages,4 figures

详情

AI中文摘要

基于激活的控制通过在推理过程中干预大型语言模型（LLM）的内部表示来引导它们，并已成为控制个性、风格等行为的有效范式。然而，现有方法通常依赖于固定的引导方向或特定任务的干预模块，难以适应细粒度概念和组合约束。我们提出UniSteer，一种文本引导的激活流匹配模型，它从自然语言条件中学习残差流激活的条件分布。UniSteer不是为每个目标行为拟合单独的干预，而是在激活空间中学习一个通用的条件速度场。在推理时，UniSteer通过将源激活部分传输到潜在状态并在目标文本条件下重新生成它，然后将其注入回冻结的LLM，从而执行流反转。相同的条件模型通过选择具有最低重建能量的文本标签来支持激活空间分类。在三个目标LLM上的实验表明，UniSteer在行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类方面提供了统一的接口。

英文摘要

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.

URL PDF HTML ☆

赞 0 踩 0

2605.30075 2026-05-29 cs.LG cs.DC

Q-ANCHOR: Federated Quantum Learning with ZNE-guided Correction

Q-ANCHOR: 基于ZNE引导校正的量子联邦学习

Hoang M. Ngo, Quan Nguyen, Wanli Xing, My T. Thai

AI总结针对量子联邦学习中非独立同分布数据导致的客户端漂移和量子硬件噪声导致的硬件偏差，提出Q-ANCHOR聚合架构，通过零噪声外推锚定服务器更新并应用有状态客户端校正，理论证明可同时减轻两类漂移，实验显示训练更稳定。

详情

AI中文摘要

量子联邦学习（QFL）提供了一个有前景的框架，可以在保持数据严格本地化的同时，跨分布式客户端训练量子模型。由于其简单性和低通信开销，联邦平均（FedAvg）是QFL文献中的标准聚合选择。然而，在实际硬件上部署QFL会暴露出严重的双重漂移现象：全局模型同时受到来自非独立同分布数据的客户端漂移和来自噪声量子梯度估计的硬件偏差的干扰。在这项工作中，我们首先分析了FedAvg在这些现实条件下的收敛性，数学上证明了量子硬件偏差会产生标准平均无法纠正的持久误差下限。为了克服这一限制，我们提出了Q-ANCHOR，一种量子感知的联邦聚合架构，该架构通过零噪声外推锚定服务器更新，同时应用有状态客户端校正来抑制客户端漂移和硬件引起的偏差。我们的收敛理论证明，Q-ANCHOR成功减轻了经典客户端漂移，同时积极降低了硬件偏差下限。实验结果表明，Q-ANCHOR实现了比传统FL基线显著更稳定的训练。

英文摘要

Quantum Federated Learning (QFL) offers a promising framework to train quantum models across distributed clients while keeping data strictly local. Due to its simplicity and low communication overhead, Federated Averaging (FedAvg) is the standard aggregation choice in QFL literature. However, deploying QFL on practical hardware exposes a severe double-drift phenomenon: the global model is simultaneously derailed by client drift from non-IID data and hardware bias from noisy quantum gradient estimates. In this work, we first analyze the convergence of FedAvg under these realistic conditions, mathematically demonstrating that quantum hardware bias creates a persistent error floor that standard averaging cannot correct. To overcome this limitation, we propose Q-ANCHOR, a quantum-aware federated aggregation architecture that anchors server updates with zero-noise extrapolation while applying stateful client correction to suppress both client drift and hardware-induced bias. Our convergence theory proves that Q-ANCHOR successfully mitigates classical client drift while actively reducing the hardware-bias floor. Experimental results demonstrate that Q-ANCHOR achieves significantly more stable training than conventional FL baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.30073 2026-05-29 cs.CV

Native Audio-Visual Alignment for Generation

原生音视频对齐生成

Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He

AI总结提出NAVA框架，通过原生音视频对齐和上下文条件联合去噪，实现高质量、同步且可控的音视频生成。

Comments Project page: https://ernie-research.github.io/NAVA/

详情

AI中文摘要

联合音视频生成旨在合成时间同步且语义一致的视觉-声学内容。然而，现有的开源方法主要依赖于带有后对齐的双塔设计或统一的三模态设计，将文本上下文、音频和视频混合在一个共享空间中。前者削弱了细粒度的音视频协同进化，而后者将语义条件与低级同步耦合。为了解决这些限制，我们提出了NAVA，一个用于联合音视频生成的原生音视频对齐框架。NAVA建立在上下文条件的原生音视频对齐之上：它首先在专用的交互空间中建立音视频对应关系，然后使用外部上下文来条件化联合去噪过程。具体地，NAVA通过Align-then-Fuse MMDiT架构实例化，该架构从模态感知的音视频对齐过渡到模态共享的联合去噪。此外，我们引入了上下文音色条件，将参考音色线索与相应的语音跨度关联，以实现可控的语音音色。在Verse-Bench和Seed-TTS上的实验以及用户研究表明，NAVA仅使用6.3B参数就实现了卓越的视频质量、精确的音视频同步、有竞争力的音频质量和更强的参考音色可控性。

英文摘要

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.30070 2026-05-29 cs.LG cs.AI

A Predictive Law for On-Policy Self-Distillation From World Feedback

基于世界反馈的在线自蒸馏预测定律

Tommy He, Jerome Sieber, Matteo Saponati

AI总结本文发现在线自蒸馏（OPSD）中初始师生性能差距与最终性能改进之间存在线性关系，并提出一种预测定律，用于在训练前预测OPSD配置的效果。

详情

AI中文摘要

超越简单的标量奖励，向更丰富的世界反馈迈进，是实现更可扩展的RL后训练的自然路径。在线自蒸馏（OPSD）是一种有前景的最新方法，它使用任意反馈作为学习信号，但其与GRPO等成熟方法相比的可靠性仍不清楚。我们发现了OPSD中初始学生-教师性能差距与最终性能改进之间存在惊人的一致线性相关性。这种关系在不同上下文类型和模型家族中均成立，为预测OPSD配置的结果提供了一种强大的预测定律，而无需运行完整的训练过程。有趣的是，我们表明这种线性可预测性随模型规模成立，这为具有更强上下文学习能力的大型模型上新的经验缩放定律提供了潜在基础。本质上，我们的发现表明，OPSD性能可以在训练前进行预测和调整，为将世界反馈作为后训练流水线的一等组件提供了一种原则性方法。

英文摘要

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.30065 2026-05-29 cs.CV

Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained Priors

利用二维预训练先验提升零样本三维风格迁移

Xin Dong, Yunzhi Teng, Wenfeng Deng, Yansong Tang

AI总结提出Data-Sufficient StyleGaussian模型，通过集成大规模2D图像数据集预训练的解码器，结合特征高斯溅射与延迟风格化，在数据稀缺条件下实现零样本3D风格迁移的高质量多视图一致渲染。

Comments Accepted by IEEE IVMSP2026

详情

AI中文摘要

在这项工作中，我们专注于零样本三维风格迁移，即给定任意风格图像，生成三维场景的多视图一致风格化视图。我们主要解决三维风格迁移中的数据稀缺问题，该问题源于每个模型仅在单个场景上训练，从而限制了可用内容图像的数量。这种稀缺性严重阻碍了风格化性能，因为模型优化依赖于足够数量的内容-风格图像对来提供监督信号。我们的核心思想是将在大规模二维图像数据集上预训练的解码器集成到三维风格迁移流程中，从而利用解码器从大量内容-风格图像对中学习到的先验知识。我们的方法结合了特征高斯溅射和延迟风格化，通过将视图相关操作统一为视图不变过程，在确保视图一致性的同时，利用数据充足的解码器网络实现高质量风格化。实验表明，我们的Data-Sufficient StyleGaussian（DS-StyleGaussian）模型在多个数据集上的视觉质量优于现有的零样本三维风格迁移方法。这项工作也表明，二维预训练可以作为三维任务的强增强手段，弥合二维与三维之间的数据差距。

英文摘要

In this work, we focus on zero-shot 3D style transfer that can generate multi-view consistent stylized views of the 3D scene given an arbitrary style image. We primarily tackle the issue of data scarcity in 3D style transfer, which arises when each model is trained on only a single scene, thereby limiting the number of available content images. This scarcity significantly hampers stylization performance, as model optimization relies on a sufficient number of content-style image pairs to provide supervisory signals. Our core idea is to integrate a decoder pre-trained on large-scale 2D image datasets into the 3D style transfer pipeline, thereby leveraging the prior knowledge encoded in the decoder from learning over numerous content-style image pairs. Our method combines feature Gaussian splatting and deferred stylization, enabling high-quality stylization with the data-sufficient decoder network while ensuring view consistency by unifying view-dependent operations into a view-invariant process. Experiments demonstrate that our Data-Sufficient StyleGaussian (DS-StyleGaussian) model outperforms existing zero-shot 3D style transfer methods in terms of visual quality across various datasets. This work also suggests that 2D pre-training can serve as a strong enhancement for 3D tasks, bridging the data gap between 2D and 3D.

URL PDF HTML ☆

赞 0 踩 0

2605.30062 2026-05-29 cs.CV

FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection

FakeVLM-R1：通过思维链内化物理定律进行合成图像检测

Leqi Zhu, Junyan Ye, Kaiqing Lin, Zhiyuan Yan, Conghui He, Weijia Li

AI总结提出FakeVLM-R1框架，结合监督微调、组相对策略优化和批判性思维链机制，通过双向辩证推理和物理常识构建真实性反证，实现高精度、逻辑可解释的合成图像检测，解决现有方法的过度拒绝偏差。

详情

AI中文摘要

生成式人工智能技术的发展已将合成图像的视觉真实性提升至前所未有的水平。尽管当前基于大型多模态模型（LMM）的可解释检测方法取得了一定进展，但它们仍然依赖于从大量伪造数据中获得的模仿学习，因此缺乏真正的因果推理能力，容易产生解释性幻觉。为克服这一瓶颈，我们提出FakeVLM-R1，旨在赋予模型在执行合成检测任务时类似人类的批判性思维能力。该框架在监督微调（SFT）基础上，将组相对策略优化（GRPO）与批判性思维链（CoT）机制相结合。在推理阶段，模型执行“双向辩证推理”过程：在提出伪造假设的同时，必须同时调用物理常识构建真实性反证。此外，我们构建了包含高质量样本的FakeClue++数据集，该数据集广泛引入了基于真实图像物理定律的注释，为模型提供了统一的真实性锚点。实验证实，FakeVLM-R1在多个基准测试中达到了评估模型中的最优性能（SOTA）。它不仅实现了高精度、逻辑可解释的检测，还解决了现有方法对真实图像的过度拒绝偏差，展现出对扰动的泛化性和鲁棒性。

英文摘要

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

URL PDF HTML ☆

赞 0 踩 0

2605.30059 2026-05-29 cs.LG cond-mat.stat-mech stat.ML

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

泊松重置的岭回归：谱正则化的更新视角

Petar Jolakoski

AI总结通过非平衡统计物理中的随机重置与统计学习中的岭正则化建立联系，证明线性梯度流下以速率r重置到原点产生的稳态均值即为岭估计，并推广到一般更新重置律以生成替代谱滤波器。

详情

AI中文摘要

我们将非平衡统计物理中的随机重置与统计学习中的岭正则化联系起来。对于线性梯度流，以速率$r$重置到原点产生稳态均值$(X^\top X+rI)^{-1}X^\top y$，这正是惩罚项$\lambda=r$的岭估计。这利用了岭回归与梯度流指数时间平均之间已知的拉普拉斯变换关系，其中指数时间现在被解释为与泊松重置相关的稳态年龄。然后我们将这一恒等式推广到一般更新重置律：指数重置时间分布是唯一的更新律，其稳态均值在每个特征方向上作为精确的滤波器恒等式对每个正曲率重现标量岭，而非指数更新律则生成替代的谱滤波器。在波动层面，我们研究了一个具有恒定扩散的独立加性奥恩斯坦-乌伦贝克扩展，解释为一种风格化的SGD近似。在这种设定下，等式仅在均值层面成立，因为重置过程由于累积的OU噪声和重置时序方差具有非零稳态协方差，而确定性岭是一个具有相同中心的固定估计量。风格化实验直接比较了确定性更新诱导的滤波器，并说明了非指数重置时间律诱导的滤波器何时可能在预测上与岭不同。关于稳态均值和诱导谱滤波器的结果是在二次目标上具有各向同性重置的连续时间梯度流下建立的；协方差和风险公式额外假设具有状态独立协方差的加性噪声。

英文摘要

We connect stochastic resetting from non-equilibrium statistical physics with ridge regularization in statistical learning. For linear gradient flow, resetting to the origin at rate $r$ produces stationary mean $(X^\top X+rI)^{-1}X^\top y$, exactly the ridge estimator with penalty $λ=r$. This uses the known Laplace-transform relationship between ridge regression and exponential-time averaging of gradient flow, with the exponential time now interpreted as the stationary age associated with Poisson resetting. We then extend this identity to general renewal reset laws: the exponential reset time distribution is the unique renewal law whose stationary mean reproduces scalar ridge in every eigendirection as an exact filter identity for every positive curvature, while non-exponential renewal laws generate alternative spectral filters. At the fluctuation level, we study a separate additive Ornstein-Uhlenbeck extension with constant diffusion, interpreted as a stylized SGD approximation. In this setting, the equality holds only at the level of the mean, since the reset process has a nonzero stationary covariance from accumulated OU noise and reset-timing variance, whereas deterministic ridge is a fixed estimator with the same center. Stylized experiments compare the deterministic renewal-induced filters directly and illustrate when filters induced by non-exponential reset-time laws can differ predictively from ridge. The results for the stationary mean and the induced spectral filters are established for continuous-time gradient flow with isotropic resetting on quadratic objectives; the covariance and risk formulas additionally assume additive noise with state-independent covariance.

URL PDF HTML ☆

赞 0 踩 0

2605.30058 2026-05-29 cs.CL

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

HEART-Bench: 大语言模型智能体是否表现出类似人类的心理学？

Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

AI总结提出HEART-Bench基准，通过构建基于大五人格和自传体记忆的虚拟角色，在DIAMONDS情境框架下评估LLM智能体能否展现一致的人类心理特征。

Comments GitHub: https://github.com/peng-weihan/HEART-BENCH

详情

AI中文摘要

尽管LLM智能体在规划、推理和行动等任务导向能力上表现出色，但很少有研究将它们视为完整的人类个性，其中情感维度同样重要。在本文中，我们引入了一个新颖的基准，系统评估LLM智能体是否能模拟连贯、类似人类的心理。具体来说，我们的基准构建了11个基于正交大五人格特质的多样化人类角色，每个角色都深入整合了1000个结构化的自传体式情景记忆，这些记忆分布在基于理论的发展生命阶段。为了严格评估LLM的心理表现，我们设计了一套由64个决策场景组成的精选套件，这些场景基于DIAMONDS分类法，这是一个心理框架，从八个维度描述情境：责任、智力、逆境、求偶、积极性、消极性、欺骗和社交性。通过将智能体置于不同场景中，基准评估它们是否能整合其固有的人格特质和自传体记忆，做出与其特定心理特征一致的行为决策。经过系统的人工验证和过滤，我们得到了一个包含673道多项选择题（MCQ）的基准。我们相信，这个基准为研究基于LLM的智能体中的人类情感、人格一致性和价值一致的行为决策提供了一个原则性且可扩展的测试平台。

英文摘要

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

URL PDF HTML ☆

赞 0 踩 0

2605.30056 2026-05-29 cs.RO cs.LG

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于评论家引导的样本高效扩散强化学习

Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi

AI总结针对扩散策略在强化学习中探索与利用不平衡的问题，提出评论家引导的扩散策略优化（CGPO），通过无训练引导技术平衡探索与利用，在MuJoCo和Franka机器人任务上取得最优性能。

Comments accepted by ICML2026

详情

AI中文摘要

近年来，强化学习（RL）通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中，一个代表性分支专注于基于采样的策略优化。这种设计使得扩散模型在训练初期具有更好的探索能力，但在Q值信息的利用上不足，导致策略收敛缓慢。另一个分支关注基于梯度的策略优化，该方法充分利用Q函数的梯度，但容易退化为低多样性的单峰策略。为了解决这个问题，我们提出了CGPO（评论家引导的扩散策略优化），通过将无训练引导技术集成到扩散策略的去噪过程中，有效平衡探索与利用。具体而言，CGPO将动作生成引导至评论家网络定义的高价值区域，并将引导后的动作作为回归目标。通过这种方式，CGPO减少了获取高质量动作所需的时间，并通过更好的探索-利用权衡提高了最终性能。我们在5个MuJoCo运动任务上验证了CGPO的有效性，与现有的基于扩散的RL方法相比，CGPO达到了最先进的性能。值得注意的是，CGPO是首次成功将扩散策略应用于真实世界RL的方法，在Franka机器人臂抓取任务上表现出优越性能。我们的官方页面发布在https://dingsht.tech/cgpo-webpage。

英文摘要

Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.

URL PDF HTML ☆

赞 0 踩 0

2605.30051 2026-05-29 cs.CL cs.CY

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁？面向辅导对话中学生模拟的历史感知档案

Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

AI总结提出历史条件的学生模拟任务，通过强化学习训练档案生成器和模拟器，利用学生历史信息准确预测对话轮次，在数学学习平台数据集上显著优于基线。

详情

AI中文摘要

开发基于大型语言模型（LLM）的自动化辅导工具的一个关键部分是学生模拟，即使用LLM扮演学生角色，这可以促进辅导模型的评估和训练。现有工作主要关注对话内模拟，缺乏关于学生知识和行为的上下文，部分原因是没有基于过去的学生问答或对话交互。在这项工作中，我们引入了历史条件的学生模拟任务，其目标是通过利用学生学习历史中的信息准确预测学生对话轮次。我们提出了一个双组件框架，其中档案生成器总结学生历史，模拟器基于生成的档案预测学生轮次。我们使用强化学习（RL）训练这两个组件，生成针对忠实学生模拟优化的档案。我们在从数学学习平台收集的首个真实世界学生对话和问答响应数据集上评估了我们的方法和基线。大量实验表明，我们的方法显著优于基线，并证明了历史、档案和RL训练的重要性。

英文摘要

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.

URL PDF HTML ☆

赞 0 踩 0

2605.30049 2026-05-29 cs.AI

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

面向文本到图像扩散Transformer的鲁棒且可泛化的安全引导

Zihao Xue, Yan Wang, Zhen Bi, Long Ma, Zhonglong Zheng, Zeyu Yang, Bingyu Zhu, Longtao Huang, Jie Xiao, Jungang Lou

AI总结提出SafeDIG框架，通过位置感知稀疏特征迁移实现扩散Transformer的安全引导，在保持源域安全性和图像质量的同时，有效降低目标域和整体不安全生成率。

详情

AI中文摘要

扩散Transformer已成为文本到图像生成的强大骨干网络，但其分层和跨模态生成过程使得安全控制在根本上不同于提示级过滤或输出级检测。有害语义可能在文本表示中弱表达，逐步绑定到视觉潜变量，最终与渲染动态纠缠。因此，在固定层进行安全引导可能不稳定，而从已知风险学习到的引导机制可能无法可靠地迁移到偏移的目标风险域。我们提出SafeDIG，一个将DiT安全适应形式化为位置感知稀疏特征迁移的安全引导框架。SafeDIG首先在功能不同的DiT干预位置上构建稀疏自编码器，并使用鲁棒性感知预训练路由来优先选择在源-目标风险偏移下预期保持稳定的干预站点。然后，通过冻结SAE编码器作为可重用的稀疏安全字典，并仅将解码器适应到目标域激活流形，将可迁移的安全特征与特定领域的激活几何分离。在推理过程中，SafeDIG结合混合和排斥操作，将不安全激活引导至迁移的安全流形或远离有害的稀疏方向。在FLUX.1 Dev和Stable Diffusion 3.5 Large上的实验表明，SafeDIG在保持源域安全性和图像质量的同时，持续降低了目标域和整体的不安全生成率。

英文摘要

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

URL PDF HTML ☆

赞 0 踩 0

2605.30046 2026-05-29 cs.LG cs.AI

Masked Diffusion Modeling for Anomaly Detection

掩码扩散建模用于异常检测

Lixing Zhang, Yuchen Liang, Liyan Xie

AI总结提出基于掩码扩散模型的MaskDiff-AD方法，通过重建随机掩码坐标的难度构建异常分数，在分类、混合类型和离散序列数据上实现高效异常检测。

详情

AI中文摘要

异常检测旨在识别偏离名义数据分布的样本，是许多安全关键应用的核心。然而，针对分类、混合类型和离散序列数据开发有效的异常检测方法仍然具有挑战性且相对未被充分探索。掩码扩散模型通过学习从剩余可见上下文中恢复掩码值，为建模此类数据提供了一种自然的方式。在本文中，我们提出了用于异常检测的掩码扩散（MaskDiff-AD），一种基于掩码扩散模型的前向方法，仅在名义数据上训练。给定测试样本，MaskDiff-AD从随机掩码坐标的重建难度构建异常分数，产生一个直接作用于离散状态空间且避免反向时间采样的内容敏感分数。我们还开发了MaskDiff-AD的非参数变体，并通过在固定检测阈值下表征I型和II型错误提供了理论保证。在来自ADBench和UADAD的十四个分类和混合类型表格数据集，以及来自NLP-ADBench的四个文本异常检测数据集上的实验表明，MaskDiff-AD相对于经典、基于扩散以及最近的表格/文本异常检测基线取得了有竞争力的性能。值得注意的是，MaskDiff-AD达到了最佳总体平均排名，优于所有十二种表格基线方法。

英文摘要

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30045 2026-05-29 cs.CV

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

GenEraser：通过平衡文本-掩码引导和解耦定位器-保持器实现可泛化的视频对象移除

Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian

AI总结提出GenEraser框架，通过多条件混合专家、可学习深度CFG融合机制和解耦专家架构，解决视频对象移除中目标与物理效应同时消除的泛化难题，在ROSE和VOR-Eval上分别提升2.16 dB和1.44 dB。

详情

AI中文摘要

视频对象移除在域外场景中常因复杂的时空歧义而难以同时消除目标对象及其关联的物理效应（如烟雾、反射、光线和涟漪）。现有方法主要依赖空间掩码，但往往无法捕捉弱相关效应，且显式文本引导的潜力尚未充分探索。此外，移除模型在高层语义泛化与精确像素级背景保持之间存在根本性的优化冲突。为解决这些挑战，我们提出GenEraser，一种用于泛化高保真视频对象与效应移除的新框架。首先，我们引入多条件混合专家（MC-MoE）配合二分文本引导，充分利用扩散变换器的多模态先验，显著增强复杂效应的识别。其次，开发可学习深度“CFG”融合机制（LD-CFG），以自适应平衡不同场景下掩码和文本条件的相对主导地位。最后，提出解耦专家架构，包含定位器和保持器，以缓解语义泛化与像素对齐之间的固有权衡。大量实验表明，我们的GenEraser超越了近期最先进方法，在ROSE基准和VOR-Eval上分别实现了显著的定量提升（2.16 dB和1.44 dB），同时在开放世界场景中保持了异常稳健的泛化能力。

英文摘要

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.30042 2026-05-29 cs.AI

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

学会选择：一种基于赋权与语义通信的自适应方法选择多智能体系统

Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

AI总结提出一种结合上下文赌博机、结构化智能体间通信和语义检查点的多智能体框架，通过保持动作-结果因果一致性来提升科学计算工作流中自适应决策的收敛性、鲁棒性和泛化能力。

详情

AI中文摘要

自动化科学计算工作流不仅需要生成可执行代码：自主系统还必须选择适当的计算策略，忠实地执行它们，并确保最终结果在因果上可归因于产生它们的决策。在多智能体流水线中，这一过程尤其脆弱，因为智能体意图与行动之间的微小不一致可能导致语义漂移，即最终执行的程序不再反映最初选择的策略，从而破坏下游评估和适应。受ATHENA框架（Toscano等人，2025；Toscano等人，2026）和赋权概念（Yiu等人，2025）的启发，本文引入了一个多智能体框架，该框架将上下文赌博机与结构化智能体间通信相结合，最重要的是，引入了语义检查点以保持整个流水线中行动-结果的一致性。该系统在自适应决策架构中集成了专门的大语言模型（LLM）智能体、基于代码生成和自修复执行循环。通过赋权的视角解释该框架，我们表明可靠的自主学习不仅需要识别高质量的行动，还需要保持这些行动在智能体间传播的完整性。使用敏感性分析和不确定性量化工作流作为代表性案例研究，我们证明未受约束的语义漂移会降低策略学习，而所提出的框架则提高了收敛性、鲁棒性和对新问题情境的适应能力。这些结果表明了科学多智能体系统的一个更广泛的设计原则：自适应决策必须与明确的机制相结合，以保证整个计算流水线中的语义一致性和可靠信息流。

英文摘要

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.30038 2026-05-29 cs.LG cs.AI cs.CV

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

AI总结提出一种轻量级、无奖励的后训练方法，通过将对比对齐引导直接整合到扩散模型的分数匹配目标中，以解决文本-图像对齐中的过度惩罚和计数错误问题。

Comments ICML 2026, Project page: https://jaayeon.github.io/AGSM

详情

AI中文摘要

扩散模型生成高度逼真的图像，但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐，但其性能严重依赖奖励质量，且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明，通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐，优于标准参数高效微调基线。然而，对比公式可能过度惩罚负对，表现为典型的失败案例，如过度计数和重复。为解决此问题，我们提出一种轻量级、无奖励的后训练方法，通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向，我们的方法缓解了这些限制，并产生更连贯和语义忠实的生成。实验表明，我们的方法与SoftREPA相当，同时显著改善了其失败案例，在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络（SD1.5、SDXL和SD3），并与现有的基于RL的扩散后训练方法互补。项目页面：https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

URL PDF HTML ☆

赞 0 踩 0

2605.30031 2026-05-29 cs.SD cs.AI cs.CL

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

大型音频语言模型中的音频越狱：分类、攻防分析与成本感知评估

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

AI总结本文提出了大型音频语言模型中音频越狱攻击与防御的统一分类法和受控实证评估，揭示了声学最佳N攻击暴露了最坏情况下的音频空间漏洞，叙事框架是一种有效的低延迟语义威胁，而现有防御在鲁棒性与良性可用性之间存在权衡。

Comments Submitted to ACL ARR 2026 May

详情

AI中文摘要

大型音频语言模型（LALMs）将越狱风险从令牌级提示扩展到完整的语音感知到推理管道，其中不安全行为可以通过语义、声学风格、信号伪影或内部表示来诱导。现有研究在异质的威胁模型和评估协议下研究这些风险，使得比较攻击实用性或防御效用变得困难。本文提供了LALM越狱攻击和防御的统一分类法和受控实证评估。我们将先前的工作组织为语义、声学、信号和嵌入层攻击；基于防护、无需训练和基于训练的防御；以及跨模态、音频原生和交互式基准。然后，我们在十个开源LALM上评估代表性攻击和防御，不仅测量攻击成功率，还测量良性拒绝和延迟。我们的结果表明，声学最佳N揭示了最坏情况下的音频空间漏洞，叙事框架是一种有效的低延迟语义威胁，而当前防御在鲁棒性与良性可用性之间存在权衡。这些发现支持将成本和效用感知评估作为仅成功率的LALM安全基准的必要补充。

英文摘要

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.30029 2026-05-29 cs.AI

RAISE: RAG Design as an Architecture Search Problem

RAISE：将RAG设计视为架构搜索问题

Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

AI总结本文提出将检索增强生成（RAG）系统的设计选择形式化为架构搜索问题，并构建RAISE框架和基准，通过标准化搜索空间和预算评估13种优化算法在7个数据集上的表现，发现优化性能高度依赖任务。

详情

AI中文摘要

检索增强生成（RAG）系统涉及众多设计选择，包括查询重写、分块、检索深度、重排序和上下文压缩。在实践中，这些选择通常通过启发式方法配置，阻碍了跨设置的系统评估和可重复性。我们认为这一挑战最好被形式化为RAG架构搜索。为了支持对该问题的可控和可重复研究，我们引入了RAG智能搜索引擎（RAISE），这是一个用于RAG超参数优化的综合框架和基准，它在标准化的搜索空间和预算下评估RAG管道的优化方法。RAISE实现了13种搜索算法，并使用三种随机种子在七个公开文本和多模态数据集上对其进行评估。我们的实验表明，优化性能高度依赖于任务：在一个数据集上表现良好的方法可能无法在其他数据集上一致泛化，这提醒我们不要将聚合排名解释为普遍优越策略的证据。RAISE为公平、可重复和系统的RAG超参数优化研究提供了共同的实验基础。

英文摘要

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.30027 2026-05-29 cs.CV cs.IR

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

DocRetriever：面向多模态文档检索的即插即用框架与综合基准

Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao

AI总结提出DocRetriever即插即用框架，通过布局感知的稀疏嵌入和推理增强的重排序器解决多模态文档检索中语义模糊和泛化瓶颈问题，并构建MultiDocR基准实现更严格评估。

Comments Accepted at KDD 2026 Research Track

详情

DOI: 10.1145/3770855.3817680

AI中文摘要

多模态文档包含表格、图形和布局等多样元素，可能使检索任务复杂化。当前方法通常将密集视觉嵌入模型与有监督重排序器相结合以实现高精度检索，但存在固有局限性。首先，密集嵌入的粗粒度特性往往模糊显式语义，无法利用结构显著信息。其次，有监督重排序模型面临泛化瓶颈，其性能严重依赖领域特定训练数据。此外，现有基准通常缺乏多样化的评估维度和全面的相关性标注，限制了可靠评估。为解决这些挑战，我们提出DocRetriever，一个即插即用框架。它通过布局感知的稀疏嵌入技术增强视觉检索，实现无需光学字符识别（OCR）开销的有效混合编码。我们还引入了一个可泛化的重排序器，利用推理增强的示范和优化采样来提高少样本场景下的准确性。最后，我们构建了一个新基准MultiDocR，以实现更严格的评估。在多个基准上的实验验证了DocRetriever相对于最先进方法的优越性。

英文摘要

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30022 2026-05-29 cs.CL cs.AI

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

给它空间！编码器中位置和语义表示的显式解缠

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

AI总结通过将位置和语义信号分离为三个独立流，研究Transformer中位置编码的机制，发现解缠方法能保留宏观结构并提升语言表示性能。

Comments 8 page + 10 pages of bibliography and appendix

详情

AI中文摘要

位置编码（PE）是置换不变的Transformer表示序列顺序的基础，然而位置信息如何处理和存储仍知之甚少。现代PE方法如RoPE在长上下文理解或检索等任务上仍存在困难\cite{chen-etal-2025-hope}。因此，更好地理解内部位置机制有助于设计更好的PE。基于位置和语义信号在训练好的Transformer中占据几乎正交子空间的证据，我们修改编码器Transformer以处理三个显式解缠的流：语义、绝对位置（AP）和相对位置（RP），并将掩码语言建模（MLM）目标限制在语义流上。这种解耦使得能够进行清晰的机制研究，并得出三个要点：（1）孤立的AP子空间自发坍缩为一个捕获文档结构的低频二维流形；（2）注意力头特化为结构导向和语义导向两组，其中RP专门支持后者；（3）标准位置编码不能稳健地保留宏观结构：RoPE和RP仅弱编码它，而纠缠的AP在MLM压力下在最后几层丢失了它。解缠方法保留了位置编码，在Flash-Holmes探测基准的65个语言现象中的49个上改善了语言表示。

英文摘要

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.30015 2026-05-29 cs.LG cs.AI

Test Time Training for Supervised Causal Learning

测试时训练用于监督因果学习

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

AI总结针对监督因果学习在分布外泛化中的不足，提出测试时训练框架TTT-SCL，通过动态生成与测试实例对齐的训练集，显著提升因果发现性能。

详情

AI中文摘要

监督因果学习（SCL）通过将因果发现构建为监督学习问题，展现了潜力。然而，它面临显著的分布外泛化挑战。我们揭示了先前SCL实践的三个局限性：合成基准与真实数据之间的显著性能差距、对分布偏移的脆弱性以及组合泛化的失败，共同质疑了其现实世界适用性。为此，我们提出测试时训练用于监督因果学习（TTT-SCL），一种新颖的框架，动态生成与任何特定测试实例显式对齐的训练集。我们展示了TTT-SCL与基于分数的方法之间的关联，并基于经典评分函数设计了一个高效模块用于生成训练集。在合成基准、伪真实和真实世界数据集上的实验表明，TTT-SCL显著优于现有的SCL和传统因果发现方法。

英文摘要

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30014 2026-05-29 cs.AI

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

从GPS点到出行模式：基于LLM的灵活语义轨迹生成

Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis

AI总结提出HTP方法，通过层次化生成出行模式再生成GPS点，利用LLM和RQ-VAE实现灵活、语义丰富的轨迹生成，在质量上平均提升29.78%。

Comments This paper is accepted by KDD2026 second round

详情

AI中文摘要

城市轨迹在建模城市动态和支持各种智慧城市应用中起着关键作用。然而，隐私问题限制了对大规模高质量轨迹数据集的访问。轨迹生成通过合成现实数据来减轻隐私风险，提供了一种有前景的替代方案。然而，现有方法未能显式捕获出行模式，并且只能在单一条件下生成固定长度的轨迹。为了解决这些局限性，我们提出了 extbf{HTP}，它 extbf{层}次化地首先生成 extbf{出行模式}，然后使用大语言模型（LLM）生成GPS extbf{点}，而不是直接生成GPS点。我们首先设计了一个轨迹特定的残差量化变分自编码器（RQ-VAE），它以从粗到细的方式将微观级别的GPS轨迹量化为紧凑的宏观级别出行模式令牌。这些令牌捕获了丰富的段空间不规则性，例如由交通条件引起的点密度变化。然后，我们用出行模式令牌扩展LLM词汇表，以对齐轨迹表示与LLM输入，并应用监督微调（SFT）使LLM与轨迹生成任务对齐，从而能够在各种条件下生成出行模式序列。在两个真实世界数据集上的大量实验表明，HTP在生成质量上平均比最强基线高出29.78%。我们的代码可在https://github.com/slzhou-xy/HTP获取。

英文摘要

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

URL PDF HTML ☆

赞 0 踩 0

2605.30011 2026-05-29 cs.CV cs.AI

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA：用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

AI总结提出VisualThink-VLA框架，通过视觉中间推理和选择性路由机制，在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情

AI中文摘要

近期工作开始为视觉-语言-动作（VLA）策略配备显式的中间推理。然而，在具身控制中，文本思维链并不适用：无关或弱文本信息会干扰动作预测，而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA，一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作：VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测，该接口在避免解码开销的同时保持空间精度。此外，为了进一步提升性能和效率，VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌，从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit，这是一个以VisualEvidence-Agent为核心的监督与审计资源，该智能体构建了754.7k条VLA指令的VisualEvidence-Set，用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中，VISUALTHINK-VLA在大多数基准测试上实现了最高成功率，同时将推理增强基线的多秒延迟降至亚秒级。例如，在BridgeData V2上，它将步骤延迟从ECoT的8.377秒降至0.367秒，实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

URL PDF HTML ☆

赞 0 踩 0

2605.30010 2026-05-29 cs.CV

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom: 早期令牌压缩实现快速视频理解

Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang

AI总结针对视频大语言模型中视觉编码阶段效率低下的问题，提出EarlyTom无训练令牌压缩框架，通过在视觉编码器内部进行早期压缩，显著降低首令牌延迟并提升吞吐量。

Comments Accepted by CVPR 2026. 16 pages, 8 figures, 8 tables. Project page: https://viridisgreen.github.io/EarlyTom

详情

AI中文摘要

视频大语言模型（Video-LLMs）在视频理解任务中展现了强大的能力。然而，处理大量视觉令牌带来的低效率仍然阻碍了它们的实际部署。尽管近期的方法在保持与全令牌基线相当准确性的同时实现了极低的令牌保留率，但大多数方法仅在预填充的后期阶段进行压缩，视觉编码器的效率未得到优化。在本文中，我们首先表明视觉编码对首令牌时间（TTFT）贡献很大。因此，与仅在视觉编码器之后压缩视觉令牌不同，在编码器内部进行压缩仍有很大的探索空间。基于这一见解，我们提出了EarlyTom，一种无训练的令牌压缩框架，在视觉编码器内部执行早期视觉令牌压缩，从而显著降低TTFT并提高吞吐量。此外，我们引入了一种解耦的空间令牌选择策略，提高了整体压缩效果。在单个NVIDIA A100 GPU上，对于LLaVA-OneVision-7B模型，EarlyTom将TTFT降低高达2.65倍，FLOPs降低高达61%，同时保持与全令牌基线相当的准确性。这些改进显著增强了Video-LLMs在实际生产场景中部署的实用性。

英文摘要

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.30002 2026-05-29 cs.AI

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent：融合语义推理的智能体时间序列预测

Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

AI总结提出KairosAgent框架，通过结合基于LLM的推理器和基于TSFM的预测器，并引入强化学习范式，实现跨模态时间序列的零样本预测。

详情

AI中文摘要

跨领域多模态时间序列预测是一项具有挑战性的任务，要求模型整合精确的数值理解、跨领域语义理解和有效的多模态融合。现有方法要么从头构建时间序列基础模型（TSFM），要么利用预训练的大语言模型（LLM）。然而，TSFM通常忽略语义理解且缺乏面向未来的语义推理能力，而LLM在数值理解和准确的定量预测方面存在困难。为克服这些限制，我们提出KairosAgent，一种用于多模态时间序列预测的新型智能体框架，包括基于LLM的推理器和基于TSFM的预测器。KairosAgent通过动态调用分析工具来增强LLM的数值理解和语义推理能力，从而统一文本推理和数值预测。推理结果随后融合到TSFM流程中，实现更准确可靠的未来预测。为进一步改进推理，我们整理了一个大规模高质量轨迹语料库，并引入了一种基于预测的强化学习范式，包含多轮细化和轮次级别信用分配。实验表明，KairosAgent在最大化预训练LLM和TSFM效用的同时，实现了卓越的零样本预测性能，为高效且可解释的时间序列智能体提供了有前景的方向。项目页面位于https://foundation-model-research.github.io/KairosAgent。

英文摘要

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

URL PDF HTML ☆

赞 0 踩 0

2605.29997 2026-05-29 cs.CV

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

FRUC：来自未标定协作驾驶视图的前馈动态场景重建

Yihang Tao, Yu Guo, Zhengru Fang, Haonan An, Yuguang Fang

AI总结提出FRUC框架，基于前馈3D高斯泼溅和视觉几何Transformer，从未标定的多车协作视图实现动态场景的一次性、免标定重建，通过自中心因果遮挡场和零初始化残差去噪实现非破坏性几何补充。

详情

AI中文摘要

我们提出了FRUC，一个用于从未标定协作驾驶视图进行动态场景重建的前馈3D高斯泼溅框架。现有的多智能体重建框架常常受到严格先决条件的阻碍，需要精确的空间标定和缓慢的逐场景优化。在本文中，我们通过将分布式多车辆网络概念化为一个时空非结构化的自中心多相机系统来重新思考这一任务，其核心挑战在于在不降低自中心准确观测到的可见几何的情况下，通过协作增强自中心遮挡几何，同时保持重建效率。为了实现高效重建，FRUC基于视觉几何Transformer骨干网络，支持从灵活数量的多车辆视图进行一次性、免标定推理。为了在未标定的跨智能体错位下实现非破坏性几何补充，FRUC首先引入了一个自中心因果遮挡场，通过建模智能体时空相关性，将遮挡演化显式推导为潜在先验。在这些遮挡先验的指导下，它进一步将跨智能体集成公式化为一个通过零初始化注入的确定性残差去噪过程，将具有挑战性的跨智能体融合转化为有界残差学习，以实现鲁棒的协作盲点补全。通过在真实世界V2XReal和UrbanIng-V2X数据集上的广泛评估，FRUC被证明是动态协作驾驶环境场景重建的新最先进方法，在渲染质量和效率上均显著优于现有方法。

英文摘要

We present FRUC, a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi-agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per-scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi-vehicle network as a spatio-temporally unstructured ego-centric multi-camera system, where the core challenge lies in enhancing ego-centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one-shot, calibration-free inference from a flexible number of multi-vehicle views. To achieve non-destructive geometric supplementation under uncalibrated cross-agent misalignment, FRUC first introduces an ego-centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent-wise spatio-temporal correlations. Guided by these occlusion priors, it further formulates cross-agent integration as a deterministic residual denoising process via zero-initialized injection, turning challenging cross-agent fusion into bounded residual learning for robust collaborative blind-spot completion. Through extensive evaluations on the real-world V2XReal and UrbanIng-V2X datasets, FRUC is shown to be a new state-of-the-art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.29992 2026-05-29 cs.CL

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

通过跨语言分词器手术和离线蒸馏使多语言嵌入模型适应土耳其语

M. Ali Bayram, Banu Diri, Savaş Yıldırım

AI总结提出一种高效的三阶段适应流程，通过跨语言分词器优化、教师模型克隆和离线蒸馏，构建了土耳其语句子嵌入模型embeddingmagibu-200m，在STSbTR上超越教师模型，并在TR-MTEB上以更少参数达到竞争性能。

Comments 14 pages, 2 figures, 4 tables, Appendix included

详情

AI中文摘要

句子嵌入是语义搜索、聚类、分类和检索增强生成的基础组件。本文提出了embeddingmagibu-200m，一个专注于土耳其语的句子嵌入模型，生成768维L2归一化向量，支持8192个token的上下文窗口，远超早期基于BERT的土耳其语编码器的512 token限制。无需完整预训练，引入了一个高效的三阶段适应流程：(1) 通过从教师词汇表中修剪冗余token，并基于40语言语料库的频率分析纳入多语言token，构建一个词汇量为131,072的土耳其语优化多语言分词器；(2) 克隆教师嵌入模型，同时保留transformer骨干权重，并通过均值组合token映射为新的词汇表初始化兼容的嵌入表；(3) 使用余弦相似度目标，在平衡的40语言维基百科语料库上，从预计算的教师向量进行离线嵌入蒸馏。得到的student模型约有2亿参数，在单个GPU上训练约四小时，通过避免训练期间的在线教师推理，总成本为5-20美元。实验表明，在STSbTR上，Pearson/Spearman相关系数达到77.55%/77.45%，超过了3亿参数的教师模型(73.84%/72.92%)。在TR-MTEB（26个任务）上，平均得分为63.9%（在26个模型中排名第7），提供了有竞争力的成本-质量权衡，参数比教师少33%。为促进可复现性和下游使用，所有工件均已发布，包括模型权重、分词器文件、预计算嵌入数据集以及开源克隆和蒸馏工具。

英文摘要

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

URL PDF HTML ☆

赞 0 踩 0

2605.29986 2026-05-29 cs.AI

Accelerating Constrained Decoding with Token Space Compression

加速受限解码：通过词元空间压缩

Michael Sullivan, Alexander Koller

AI总结提出CFGzip离线压缩词元搜索空间，大幅降低上下文无关文法约束解码的开销，实现高达两个数量级的延迟减少和7.5倍的总生成速度提升。

Comments 13 pages; 5 figures; under review at EMNLP 2026

2605.29983 2026-05-29 cs.LG cs.CV

Improving Adversarial Robustness of Attribution via Implicit Regularization

通过隐式正则化提高归因的对抗鲁棒性

Amir Mehrpanah, Matteo Gamba, Hossein Azizpour

AI总结本文发现标准随机梯度下降的学习动态可以隐式地提高归因的对抗鲁棒性，并证明在softmax归一化下注意力归因的鲁棒性提升受限，而基于核的注意力可恢复鲁棒性。

Comments 39 pages, 22 figures, to be published in International Conference on Machine Learning 2026

详情

AI中文摘要

归因的对抗鲁棒性是深度学习中可靠可解释性的基本要求，但现有方法通常依赖计算昂贵的显式正则化。在这项工作中，我们表明归因鲁棒性可以从标准随机梯度下降的学习动态中隐式产生。我们通过参数空间和输入空间曲率之间的联系从理论上论证了这种效应，并在各种架构、数据集和归因方法上进行了验证，计算开销可忽略不计。相反，我们证明由于固有的熵约束，这种鲁棒性提升通常不会转移到softmax归一化下的注意力归因，并通过实验验证了这一局限性。最后，我们表明用基于核的注意力替换softmax注意力可以恢复Transformer模型中的鲁棒性提升。我们的结果突出了学习动态作为鲁棒可解释性的一种原则性且实用的机制，并揭示了归一化下注意力归因的基本局限性。

英文摘要

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

URL PDF HTML ☆

赞 0 踩 0

2605.29980 2026-05-29 cs.CV cs.AI cs.LG

Genetically Aligned Patient Representations Improve Hematological Diagnosis

基因对齐的患者表示改善血液学诊断

Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

AI总结提出一种两阶段框架，通过自监督视觉预训练和监督对比学习对齐白细胞图像与染色体畸变及体细胞突变，提升血液学诊断性能。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情

AI中文摘要

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液学细胞学的独特之处在于，视觉单细胞评估通常与细胞遗传学和分子遗传学相结合用于血癌诊断。在本研究中，我们提出了一个框架，将单个白细胞图像与染色体畸变（核型）以及来自靶向基因面板的体细胞突变对齐。我们的训练策略采用两阶段方法：（i）在超过1500名患者的队列上，使用iBOT头进行自监督、仅视觉的Transformer聚合器预训练；（ii）通过急性髓系白血病患者的监督对比损失进行基因对齐。我们的基因对齐患者编码器改善了血液学诊断任务，优于切片级组织病理学基础模型。此外，该模型为疾病和遗传改变提供了即用型检索能力。将遗传数据纳入患者编码器提高了患者表示的质量，提供了一个与临床诊断工作流程对齐的框架，并为未来的多模态血液学特定AI铺平了道路。代码和模型权重可在https://github.com/marrlab/GenBloom获取。

英文摘要

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

URL PDF HTML ☆

赞 0 踩 0

2605.29975 2026-05-29 cs.LG eess.SP

A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy

一种全卷积方法用于X射线光子相关光谱中结构动力学数据的去噪

Nisar Nellikunnummel, Andi Barbour, Lutz Wiegart, Tatiana Konstantinova, Anthony DeGennaro

AI总结提出全卷积去噪自编码器（FC-DAE），用于去噪X射线光子相关光谱中的双时间强度-强度相关函数，支持任意输入尺寸，在低信噪比条件下恢复复杂动力学特征并保持结构保真度。

详情

AI中文摘要

我们提出了一种全卷积去噪自编码器（FC-DAE），用于去噪X射线光子相关光谱（XPCS）中的双时间强度-强度相关函数（$C_2$）。与通常限制为固定输入尺寸的传统去噪自编码器不同，FC-DAE接受任意维度的输入，同时保留不同动力学范围内的相关结构。该模型使用在NSLS-II光束线收集的实验$C_2$数据进行训练，并应用数据增强来扩展数据集的多样性并减少过拟合。FC-DAE在低信噪比条件下成功恢复复杂的动力学特征，同时保持结构保真度。为了评估重建可靠性，我们采用定量指标来评估结构保真度并识别潜在的模型引入偏差。我们的结果表明，FC-DAE提供了具有高计算效率的鲁棒去噪性能，使得在光子受限和低剂量测量条件下恢复XPCS动力学成为可能。

英文摘要

We present a fully convolutional denoising autoencoder (FC-DAE) for denoising two-time intensity-intensity correlation functions ($C_2$) in X-ray photon correlation spectroscopy (XPCS). Unlike conventional denoising autoencoders that are typically restricted to fixed input sizes, the FC-DAE accepts inputs of arbitrary dimensions while preserving correlation structures across diverse dynamical regimes. The model is trained using experimentally derived $C_2$ data collected at NSLS-II beamlines, with data augmentation applied to expand the diversity of the dataset and reduce overfitting. The FC-DAE successfully recovers intricate dynamical features in low signal-to-noise conditions while maintaining structural fidelity. To assess reconstruction reliability, we employ quantitative metrics to evaluate structural fidelity and identify potential model-induced bias. Our results demonstrate that the FC-DAE provides robust denoising performance with high computational efficiency, enabling recovery of XPCS dynamics under photon-limited and low-dose measurement conditions.

URL PDF HTML ☆

赞 0 踩 0