arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.25850 2026-05-26 cs.CL cs.AI cs.LG

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

TIAR:基于轨迹信息的优势重加权用于大语言模型弃权学习

Muyu Pan, Shu Zhao, Nan Zhang, Philip Shin, Varun Parekh, Vijaykrishnan Narayanan, Rui Zhang

AI总结 本文提出TIAR方法,利用GRPO中的多条轨迹作为自然弃权信号,动态重加权弃权奖励,在六个评估类别中的五个上取得最优弃权F1分数,同时保持基线准确率。

Comments 10 pages, 1 figure, 4 tables

详情
AI中文摘要

本文研究大语言模型(LLM)的弃权学习,特别是使用三元奖励来激励大语言模型中的真实性。本文将该思想从三元奖励扩展到基于轨迹信息的优势重加权(Trajectory-Informed Advantage Reweighting),在组相对策略优化(GRPO)训练期间动态重加权弃权奖励。本工作的目标聚焦于弃权学习而非提升真实性,作为减少幻觉的探索。本文的新颖之处在于方法论创新、优势重加权和基准选择。利用GRPO的多条轨迹作为自然弃权信号,该方法使用奖励信号探索知识边界并鼓励一致性。通过证明轨迹可以作为策略相对于查询的置信度指标,进而用于动态计算弃权优势。使用AbstentionBench作为评估基准,因为本工作旨在为弃权学习领域做出贡献。对该基准上的所有数据集,均使用本方法和各种基线进行了测试。实证结果表明,TIAR在六个评估类别中的五个上取得了最优弃权F1分数,在31个基准数据集中的17个上优于静态三元基线,同时完全保持基线准确率。

英文摘要

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

2605.25848 2026-05-26 cs.LG cs.AI

Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

几何演化图:从Transformer残差流中提取稳定概念探针

James Henry

AI总结 提出几何演化图(GEM)方法,通过追踪残差流中概念的方向轨迹并识别旋转停止的交接层,提取稳定的概念探针,在391个概念×模型对中优于峰值层探针的比例达66.2%。

Comments 24 pages, 3 figures. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433)

详情
AI中文摘要

从Transformer残差流中提取的概念探针的可靠性取决于提取层。常见的做法是在固定的后期层或分离得分函数的峰值处进行探测,这忽略了一个基本的结构特征:概念表示在其组装阶段经历显著的方向旋转,直到主要概念分配区(CAZ)之后的一个特征交接层才稳定下来。我们引入了几何演化图(GEM),它通过残差流激活追踪概念的完整方向轨迹,识别旋转停止的交接层,并从该层提取稳定的探针方向。在跨越70M到14B参数的23种架构和17种概念类型中,CAZ内入口到出口的余弦相似度平均为0.233,表明CAZ入口处的探针方向不能可靠地预测出口处的探针方向。在391个概念×模型对(23个模型×17个概念)上的消融实验表明,GEM提取的探针在268/391次试验(68.5%)中至少与峰值层探针一样精确,并在259/391次试验(66.2%)中严格优于峰值层探针。架构差异显著:MHA模型在173/221次试验(78.3%)中偏好交接层;GQA模型仅在56/119次试验(47.1%)中偏好交接层。模型级Wilcoxon检验:W=214, N=23, p=0.010(单侧)。一个自适应消融宽度规则针对79/391个近最终层情况:在60/79个触发情况(75.9%)中提高了探针质量,平均增益+7.44个百分点。方向特异性控制证实消融效果是概念方向特异性的:与随机方向消融相比,中位数抑制率为377倍(99.1%的概念方向击败了所有10个随机种子)。参考实现:rosetta_tools v1.3.1(doi:10.5281/zenodo.20361433)。

英文摘要

Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo substantial directional rotation during their assembly phase, and do not settle into a stable direction until a characteristic handoff layer after the primary Concept Allocation Zone (CAZ). We introduce Geometric Evolution Maps (GEMs), which track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases, and extract the settled probe direction from that layer. Across 23 architectures spanning 70M to 14B parameters and 17 concept types, the entry-to-exit cosine similarity within CAZs has a mean of 0.233, showing that probe direction at CAZ entry does not reliably predict probe direction at exit. Ablation experiments across 391 concept x model pairs (23 models x 17 concepts) show that GEM-extracted probes are at least as precise as peak-layer probes in 268/391 trials (68.5%), and strictly outperform in 259/391 (66.2%). The architecture split is pronounced: MHA models favour the handoff in 173/221 trials (78.3%); GQA models favour the handoff in only 56/119 trials (47.1%). Model-level Wilcoxon: W=214, N=23, p=0.010 (one-sided). An adaptive ablation width rule targets the 79/391 near-final-layer cases: it improves probe quality in 60/79 triggered cases (75.9%), mean gain +7.44pp. A direction-specificity control confirms the ablation effect is concept-direction specific: median 377x suppression rate versus random-direction ablation (99.1% of concept directions beat all 10 random seeds). Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

2605.25846 2026-05-26 cs.CL

On the Limits of Model Merging for Multilinguality in Pre-Training

论预训练中模型合并对多语言能力的限制

Seth Aycock, Fedor Vitiugin, Aleksandr Umnov, Christof Monz, Khalil Sima'an

AI总结 通过控制实验比较混合预训练、模型合并和单语预训练,发现合并单语模型会导致性能崩溃,表明表示相似性是模型合并的前提。

Comments MeLLM Workshop 2026

详情
AI中文摘要

通过混合预训练数据或训练后的方法(如特定语言模型合并)可以实现模型一致的多语言性能。在这项工作中,我们测试了合并是否可应用于单语预训练模型。我们对混合、合并和单语预训练设置的有效性进行了控制研究。我们发现,虽然单语预训练能带来强大的语言内性能,但由于干扰,合并单语模型的任何组合都会导致性能崩溃。我们的分析表明,表示相似性是模型合并的先决条件。因此,我们得出结论,微调中合并的灵活性并不能简单地扩展到特定语言的预训练。

英文摘要

Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.

2605.25835 2026-05-26 cs.LG cs.AI

Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation

面向Kubernetes清单生成的上下文-工具数据蒸馏方法及实验评估

Andrey Kozachok, Anatoliy Bakaev, Aleksandr Kozachok, Shamil Magomedov, Artem Noev

AI总结 提出上下文-工具数据蒸馏方法,通过合成生成和反向指令生成构建语料库,结合外部验证器过滤,在资源受限条件下微调1.5B参数小语言模型生成Kubernetes清单,实验表明严格输出格式比增加训练样本更关键。

Comments 15 pages, 4 figures, 2 tables

详情
AI中文摘要

本文研究了参数高达40亿的小语言模型(SLM)在领域特定语言(DSL)中生成工件的专业化。选择Kubernetes清单作为目标领域。我们提出了上下文-工具数据蒸馏方法:源语料库通过合成生成形成,在扩展方案中通过从真实Kubernetes YAML文件进行反向指令生成,仅当通过外部验证器并匹配领域上下文模型时,才将配对包含在训练中。与经典的KL散度知识蒸馏不同,基线实现简化为在工具验证示例上进行监督微调。实验部分在资源受限条件下展示了试点实现:DeepSeek-V4 Flash API作为教师模型进行合成生成,而Qwen2.5-Coder-1.5B-Instruct通过LoRA在CPU上进行微调。在K8s-Distill-Pilot语料库(训练1200,验证100,测试200)上,我们以更严格的提示公式和max_new_tokens=768实现了full-pass@1 = 91.5%(183/200)。关键经验发现是,对于Kubernetes YAML,试点中的结果质量更多地取决于严格的输出格式要求,而不是简单地增加训练样本数量。

英文摘要

This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-specific languages (DSL). Kubernetes manifests are chosen as the target domain. We propose the context-instrumental data distillation method: the source corpus is formed through synthetic generation and, in an extended scheme, through reverse instruction generation from real Kubernetes YAML files, with pairs included in training only upon passing external validators and matching the domain context model. Unlike classical KL-divergence knowledge distillation, the baseline implementation reduces to supervised fine-tuning on instrumentally verified examples. The experimental section presents a pilot implementation under resource-constrained conditions: the DeepSeek-V4 Flash API serves as the teacher for synthetic generation, while Qwen2.5-Coder-1.5B-Instruct is fine-tuned via LoRA on CPU. On the K8s-Distill-Pilot corpus (train_1200, validation_100, test_200), we achieved full-pass@1 = 91.5% (183/200) with a stricter prompt formulation and max_new_tokens=768. The key empirical finding is that for Kubernetes YAML, result quality in the pilot depended more on strict output format requirements than on simply increasing the number of training examples.

2605.25832 2026-05-26 cs.RO cs.AI cs.CL cs.CV

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

当搜索成为记忆:将机器人设计试验转化为可迁移技能

Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

AI总结 提出Auto-Robotist,一种自进化LLM代理,通过将形态搜索轨迹提炼为自然语言技能库,实现可迁移的机器人设计知识,在EvoGym任务中提升冷启动搜索并跨设计空间迁移技能。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作进化机器人设计的提案生成器,但大多数循环仍然是无记忆的:模拟结果塑造下一代种群,但并未作为可复用的设计知识保留。我们提出Auto-Robotist,一种自进化的LLM代理,它将形态搜索轨迹提炼为显式的自然语言技能库。每个技能存储结构原型、基于证据的正负规则以及支持它们的评估设计,使设计记忆可检查而非隐含在种群中。在搜索过程中,代理检索技能以调节LLM对精英主体的编辑,同时保留遗传算法(GA)突变路径以进行探索;评估后,通过添加、诊断和合并更新库。在涵盖运动、穿越和物体交互的七个EvoGym任务中,Auto-Robotist改善了冷启动5x5搜索,并将学到的技能迁移到10x10设计空间,其中参考条件迁移在每个任务上都优于GA。这些结果表明,LLM代理可以将昂贵的物理评估转化为可复用、可审计的设计原则。我们的代码将在接收后发布。

英文摘要

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

2605.25831 2026-05-26 cs.CL cs.AI cs.LG

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

澄清、弃权或回答?基于信念增强生成的对话策略

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernández

AI总结 提出信念增强生成(BAG)方法,通过将大语言模型自身的信念状态注入提示,使其推理多个采样响应并决定对话策略(回答、澄清或弃权),从而提升多轮模糊问答的准确性和策略决策的忠实度。

详情
AI中文摘要

大语言模型(LLMs)定义了文本上的分布,这可以视为不确定性的概率表示:采样K个响应会产生一个信念状态——模型认为合理的响应。现有工作利用这种表示进行解码或选择性预测等狭窄任务,通常需要手动干预,无法直接控制生成。我们提出信念增强生成(BAG):通过提示将LLMs锚定在其自身的信念状态中,并让它们推理这K个样本以决定对话策略:回答、澄清或弃权。在多轮模糊问答设置中,我们发现LLMs默认很少澄清或弃权,忽略了关于输入或事实的不确定性。BAG在六个模型上提高了问答准确性,并产生了比仅提示基线更忠实于信念状态的策略决策。然而,区分何时澄清与何时弃权仍然具有挑战性。

英文摘要

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

2605.25829 2026-05-26 cs.RO cs.AI

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

OASIS: 通过SE(3)轨迹预测实现机器人操作中的观测-动作空间对齐

Xinzhe Chen, Sihua Ren, Liqi Huang, Haowen Sun, Mingyang Li, Xingyu Chen, Zeyang Liu, Xuguang Lan

AI总结 提出OASIS视觉运动策略,通过SE(3)末端执行器轨迹预测对齐中间表示与动作空间,在仿真和真实实验中优于VLA和WAM基线。

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型和世界动作模型(WAMs)通过用辅助空间特征或未来视觉状态预测丰富中间表示来推进机器人操作。然而,这些表示在很大程度上仍停留在观测空间内,不共享动作空间的刚体几何,迫使动作解码器隐式恢复该几何。我们提出OASIS,一种通过$SE(3)$末端执行器轨迹预测将中间表示与动作空间对齐的视觉运动策略。OASIS将融合视觉-语言和度量深度特征的3D感知特征编码器与生成相机帧末端执行器轨迹的$SE(3)$轨迹预测器耦合。以预测器的姿态监督隐藏状态为条件,动作解码器生成与刚体运动一致的动作块。在仿真和真实世界实验中,OASIS在成功率和分布外泛化方面优于VLA和WAM基线。我们的项目页面位于https://npuhandsome.github.io/OASIS_web。

英文摘要

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.

2605.25821 2026-05-26 cs.CV

[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

[CLS] 还不够:基于补丁级推理与自适应聚合的多标签识别

Akang Wang, Xili Deng, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

AI总结 针对CLIP等视觉语言模型在多标签识别中因[CLS]全局表征不足的问题,提出PIAA框架,通过补丁级推理和自适应聚合实现无训练的多标签识别,在NUS-WIDE上mAP提升超6%。

详情
AI中文摘要

视觉语言模型(如CLIP)通过将图像与文本概念对齐展现出强大的零样本识别能力,但在多标签识别(多个目标共存)中表现不佳。一个关键瓶颈是[CLS]标记作为单一的全局视觉表征,不足以忠实编码具有不同尺度、上下文和共现模式的多样目标。为解决这一局限,我们提出一个新的多标签图像识别框架PIAA,将预测公式化为补丁级推理后接自适应聚合。具体来说,我们首先从两个互补角度增强补丁级预测:(i)缓解视觉编码器中的语义纠缠以获得更具判别性的补丁表征,(ii)学习无监督视觉分类器以缩小视觉-语言模态差距。然后我们引入一个自适应聚合模块,将补丁级分数整合为最终的多标签预测。值得注意的是,整个流程完全无需训练,不需要梯度更新或参数微调。实验表明,我们的方法以最小的额外计算实现了显著改进,在具有挑战性的NUS-WIDE基准上相比代表性基线mAP提升超过6%。代码可在https://github.com/akang-wang/PIAA获取。

英文摘要

Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.

2605.25819 2026-05-26 cs.LG cs.CR

On Reliability of Efficient Membership Inference Vulnerability Evaluation

关于高效成员推断脆弱性评估的可靠性

Joonas Jälkö, Gauri Pradhan, Ossi Räisä, Antti Honkela

AI总结 本文揭示了高效成员推断攻击评估中两个关键缺陷:跨样本FPR未校准导致差分隐私审计不可靠,以及有限总体偏差导致样本脆弱性高估,并提出了后处理校准方法。

Comments 14 pages, 10 figures

详情
AI中文摘要

成员推断攻击(MIA)是通过从数据中学习的模型或统计量来经验性评估训练数据中敏感信息泄露的流行方法。MIA脆弱性通常通过二元分类器的假阳性率(FPR)和真阳性率(TPR)来评估,该分类器试图预测特定样本是否在训练数据中。然而,为了可靠估计TPR,尤其是对于低FPR值,需要大量观测,这在MIA中意味着许多目标模型,导致巨大的计算成本。为避免过高的计算需求,MIA分数通常跨多个个体和多个目标模型进行平均。我们展示了这种高效MIA评估流程中的两个关键弱点。首先,我们表明基于跨多个个体拼接的MIA分数评估TPR(常用于研究极低FPR机制下的脆弱性)在跨样本FPR上未校准。这使得它作为差分隐私审计工具不可靠。为解决此问题,我们提出了一种后处理方法,以有效校准不同样本的FPR。其次,我们识别了Carlini等人2022年提出的常用高效似然比攻击(LiRA)实现中的有限总体偏差,导致样本脆弱性的正向偏差。

英文摘要

Membership inference attacks (MIAs) are popular methods for empirically assessing the leakage of sensitive information in the training data through models or statistics learned from the data. The MIA vulnerability is often evaluated through false positive rate (FPR) and true positive rate (TPR) of a binary classifier that tries to predict whether a particular sample was in the training data. However, in order to reliably estimate the TPR especially for low FPR values, a lot of observations are needed, which in case of MIA translates to many target models, leading to large computational cost. To avoid excessive compute requirements, the MIA scores are often averaged over multiple individuals and multiple targeted models. We demonstrate two key weaknesses in this efficient MIA evaluation pipeline. First, we show that evaluating the TPR based on MIA scores concatenated across multiple individuals, commonly used to study vulnerabilities in the very low FPR regime, is not calibrated across the per-sample FPRs. This makes it unreliable as a tool for auditing differential privacy. To solve this, we propose a post-processing method to effectively calibrate the FPR across different samples. Second, we identify a finite population bias in the commonly used efficient likelihood-ratio attack (LiRA) implementation proposed by Carlini et al. 2022, leading to a positive bias in the per-sample vulnerability.

2605.25816 2026-05-26 cs.CL cs.AI

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

超越架构复杂性的微调:基于DeBERTa的PIIBench广泛覆盖PII检测

Pritesh Jha

AI总结 本研究通过微调DeBERTa模型,在涵盖82种实体类型的多源PIIBench数据集上实现广泛覆盖的PII检测,直接微调方法在F1分数上显著优于架构复杂的层次模型和课程扩展方法。

详情
AI中文摘要

个人身份信息(PII)检测系统通常在狭窄的源或领域边界内训练,当部署在异构文本上时覆盖范围有限。我们研究了在修正后的多源PIIBench准备数据上的模型微调,该数据跨越十个源数据集,涵盖82种保留实体类型。我们评估了三种基于DeBERTa的方法:直接令牌分类微调、源条件层次模型(SC+H)和三阶段课程扩展(SC+H+Curr)。在可重复的5,000条记录保留子集(test_5k)上,与八个已发表的比较系统相比,直接微调的DeBERTa达到F1 0.6476,而SC+H和课程变体分别达到0.5899和0.2772;最强的已发表比较系统仅达到0.1723。由于验证最初偏向SC+H,我们在完整的100,002条记录保留分割上进行了最终的流式评估。直接微调仍然优越,达到F1 0.6455,而SC+H为0.5894。实体级分析表明,直接微调在82个细粒度实体类型中的54个和所有十个粗粒度组中获胜(按支持加权实体F1),而SC+H在28个类型上保持局部优势。结果表明,多样化的任务特定训练数据和简单的加权交叉熵目标对广泛覆盖的PII检测的贡献大于所测试的架构和课程复杂性。

英文摘要

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

2605.25814 2026-05-26 cs.CL cs.AI

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

自适应图优化与基于大语言模型的标签传播用于经济高效实体解析

Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke

AI总结 提出Alper框架,通过迭代概率标签传播整合匹配与聚类,自适应融合图传播弱信号与LLM强查询,在预算约束下最大化边际增益,实现高效实体解析。

详情
AI中文摘要

脏实体解析(ER)从单个杂乱数据集中识别指向同一真实世界实体的记录,是数据管理和挖掘中的基本任务。然而,ER的主流阻塞-匹配-聚类范式存在严重缺陷。其级联、解耦的工作流本质上生成一个静态、稀疏的图,由于阻塞失败导致缺失边,由于匹配错误导致噪声链接,造成错误传播并产生次优聚类,特别是在聚类中施加严格传递性时。我们认为匹配和聚类本质上是协同的,两者都优化理想实体图的构建。基于这一见解,我们提出Alper,一个统一框架,将这些步骤整合为在全局、演化图上的迭代概率标签传播过程。与分离的阻塞不同,Alper通过自适应地整合来自图传播的“弱但廉价”信号与基于LLM的“强但昂贵”成对查询,动态优化图结构和标签。为了提高成本效益,我们将信号选择形式化为在查询预算下最大化累积边际增益的约束优化问题,通过我们的贪心算法求解,并具有可证明的理论保证。我们在八个基准数据集上的广泛实验表明,Alper始终优于最先进的级联流水线。

英文摘要

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

2605.25813 2026-05-26 cs.RO

Extending Embodied Question Answering from Perception to Decision

将具身问答从感知扩展到决策

Xicheng Gong, Qiwei Li, Peiran Xu, Yadong Mu

AI总结 提出大规模具身问答数据集EQA-Decision和基线模型RoboDecision,系统覆盖静态场景构建、空间理解、任务动态推理和即时决策四个维度,以统一框架评估具身环境中的感知、推理和行动级决策。

Comments 11 pages,4 figures

详情
AI中文摘要

具身问答(EQA)连接了具身环境中的感知、推理和交互。然而,现有的数据集和基准仍然分散,每个都侧重于有限的推理技能子集,如空间理解或程序推理,而没有提供一个统一的、大规模的综合评估框架。我们提出了EQA-Decision,一个大规模具身问答数据集,系统地涵盖了具身推理的四个互补维度:静态场景构建、空间理解、任务动态推理和即时决策。该数据集包含超过四百万个问答对,并在多样化的具身场景中具有分层注释。此外,我们开发了RoboDecision,一个与EQA-Decision基准对齐的强基线模型,提供了一个统一框架,共同评估具身环境中的感知、推理和行动级决策。结果表明,EQA-Decision有效地基准测试并增强了VLM在空间和交互推理方面的能力,为推进具身智能研究提供了坚实基础。

英文摘要

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

2605.25810 2026-05-26 cs.CV

Data-driven Head Motion Generation through Natural Gaze-Head Coordination

数据驱动的自然注视-头部协调头部运动生成

Xiaohan Liu, Yilin Wen, Yusuke Sugano

AI总结 提出首个数据驱动方法,通过自动提取自然注视和头部运动,利用条件变分自编码器生成与注视相关的头部运动,并应用于注视控制的视频生成。

详情
AI中文摘要

我们提出了首个数据驱动的方法,从大规模野外面部视频中建模时间上的注视-头部协调。为了获得可泛化学习的训练数据,我们提出了一种自动流水线,利用现成的基于外观的注视估计器提取自然且多样化的注视和头部运动。为了捕捉注视-头部协调的概率相关性和时间动态,我们将模型建立在生成性条件变分自编码器上,以生成合理且多样化的注视条件头部运动。我们进一步将框架应用于注视控制的面部视频生成,其中我们实现了与输入注视相关的自然逼真头部运动的视频生成——这一方面此前未被强调。人类评估和定量比较证明了我们方法的有效性并验证了我们的设计选择,评估者对我们的方法表现出统计学上显著的偏好,优于基线方法。

英文摘要

We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.

2605.25804 2026-05-26 cs.CV

Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks

基于时空与频率增强深度神经网络的事件到视频重建

Ramna Maqsood, Paulo Nunes, Luís Ducla Soares, Caroline Conti

AI总结 提出MSFET-E2V模型,通过跨域注意力模块融合时空特征与离散小波变换的频率表示,并设计轻量级小波增强跳跃块,实现高质量事件到视频重建,在多个数据集上超越现有方法。

详情
AI中文摘要

事件相机相比传统基于帧的相机具有显著优势,包括高时间分辨率、低延迟和能量效率。这些特性使其适用于高速和高动态范围场景采集;然而,缺乏密集强度帧限制了传统计算机视觉方法在场景理解中的直接应用。事件到视频(E2V)重建旨在通过将异步事件流转换为同步视频帧序列来弥合这一差距。现有的基于卷积神经网络和Transformer的E2V重建方法主要在空间域操作,往往难以恢复精细结构细节并抑制严重重建伪影。为解决这些问题,我们提出MSFET-E2V,一种新颖的多尺度频率增强Transformer模型。其核心是跨域注意力模块,该模块将时空特征与来自离散小波变换的频率感知表示相融合。与仅依赖空间注意力的先前方法不同,我们的方法通过考虑低频和高频分量有效捕捉局部和全局结构,增强细节保留和跨各种运动场景的鲁棒性。此外,我们提出一个轻量级小波增强跳跃块作为跳跃连接,通过联合空间-频率域处理促进伪影抑制和结构细节细化。大量实验表明,MSFET-E2V在多个真实世界事件数据集上取得了优于最先进方法的性能,在重建质量上提供了显著提升。此外,与现有基于Transformer的方法相比,我们提出的模型显著减少了参数数量、GPU内存使用和推理时间。

英文摘要

Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.

2605.25802 2026-05-26 cs.CV

Rethinking VLM Representation for VLA Initialization

重新思考用于VLA初始化的VLM表示

Weifeng Lin, Siyuan Huang, Hao Li, Tingwei Chen, Ruichuan An, Xinyu Wei, Jianbo Liu, Hongsheng Li

AI总结 本文通过控制表示设计问题,沿能力级具身VQA监督、参数更新策略和机器人数据预训练三个轴,研究VLA初始化,发现保留预训练VLM表示对动作性能至关重要,而LoRA比全微调提供更可靠的初始化,分阶段基于LoRA的训练获得最强变体。

Comments 9 main-text pages, 5 appendix pages, 4 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型广泛采用预训练的视觉-语言模型(VLM)作为策略骨干,但目前尚不清楚何种预训练VLM表示对VLA初始化有用。在本文中,我们将VLA初始化作为一个受控的表示设计问题,沿三个轴进行研究:能力级具身VQA监督、参数更新策略和机器人数据预训练。我们的实验表明,原始预训练VLM表示是动作性能的关键来源。然而,具身VQA适应并不产生一致的收益:其收益取决于下游瓶颈,且来自不同能力域的收益并非简单相加。对于更新策略,LoRA提供了比全微调更可靠的初始化,表明过度重塑预训练表示会削弱VLA初始化。机器人数据预训练进一步改善了VLA初始化,通过分阶段基于LoRA的训练获得了最强变体。这些发现共同表明,有效的VLM到VLA适应应在保留对动作学习有用的预训练VLM表示的同时,注入与动作相关的具身和机器人轨迹信号。

英文摘要

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

2605.25801 2026-05-26 cs.CV

PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution

PixelWizard: 迈向高效高保真超大规模空间分辨率视频生成

Wenxue Li, Jingjing Ren, Peng Zhang, Tian Ye, Daiguo Zhou, Jian Luan, Lei Zhu

AI总结 提出PixelWizard框架,通过分层解耦全局结构建模与细节合成,并引入噪声跨度对齐捷径训练,实现超大规模分辨率视频的高效高保真生成,加速超过10倍。

详情
AI中文摘要

高分辨率视频生成面临优化不稳定和计算成本高昂的双重瓶颈。令牌序列的大规模扩展不仅使优化偏向局部纹理而牺牲全局一致性,导致结构崩溃,还带来了高昂的训练成本和严重的推理延迟。为了解决这个问题,我们提出了PixelWizard,一个将全局结构建模与细粒度细节合成分层解耦的框架。PixelWizard首先建立一个紧凑的时空锚点以集中密集的结构先验,然后指导高分辨率下的细粒度生成。这减轻了局部优化偏差,确保结构稳定性而不损害高频细节。利用这种结构稳定性,我们引入了噪声跨度对齐捷径训练来打破推理瓶颈。通过显式建模步长,该机制允许模型以大步长遍历生成轨迹。关键的是,我们结合了指数索引偏置采样和自适应噪声跨度校准,以对齐优化与高分辨率网格的偏移噪声调度,确保鲁棒的少步推理而不产生蒸馏的沉重开销。大量实验表明,PixelWizard在实现卓越视觉质量的同时,将原生2K/4K视频的生成采样加速超过10倍。

英文摘要

High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.

2605.25799 2026-05-26 cs.CV

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

应对源自由跨域小样本学习中加剧的注意力汇聚问题

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

AI总结 针对跨域小样本学习中标准微调加剧注意力汇聚导致判别性下降的问题,提出基于令牌动态重加权的方法抑制简单令牌依赖并增强困难令牌学习,实现新最优性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

视觉语言模型(如CLIP)展现了令人印象深刻的泛化能力,但其在跨域小样本学习(CDFSL)中的潜力尚未充分探索,该任务需要模型将源域信息迁移到训练数据稀缺的目标域。尽管注意力汇聚现象已在某些任务的视觉语言模型中被观察到,但其在CDFSL场景中的作用尚未被研究。本文揭示了先前工作忽视的一个关键问题:CDFSL中标准的目标域小样本微调显著加剧了注意力汇聚问题,导致类别间判别性差。为理解这一现象,通过大量实验,我们将其解释为模型对域适应的捷径学习:为克服源域与目标域之间的巨大域差距,模型倾向于将初始更接近目标域类别的令牌(即简单令牌)推得更近,从而加剧注意力汇聚,浪费了学习其他有判别性但初始较远的令牌(即困难令牌)的能力。为解决此问题,我们提出一种新方法,在目标域微调期间根据令牌与目标域类别的相关性动态重加权令牌,明确抑制模型对简单令牌的依赖并增强困难令牌的学习,减少汇聚令牌并提升判别性。在四个基准数据集上的大量实验验证了我们方法的合理性,展现了新的最优性能。我们的代码可在 https://github.com/shuaiyi308/TIR 获取。

英文摘要

Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.

2605.25794 2026-05-26 cs.AI

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

何时可以信任早期预警?从 LMS 交互日志中排除泄漏的早期结果预测

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

AI总结 针对学习管理系统日志中早期预测结果因时间泄漏而被高估的问题,提出 LEAP 协议(排除泄漏的早期可用性协议),通过截止优先截断和特征溯源审计防止后截止证据进入基准,并在 OULAD 数据集上验证了多种方法的性能。

详情
AI中文摘要

基于学习管理系统(LMS)日志构建的早期预警模型旨在尽早预测课程结束结果,以便及时提供学习者支持。然而,报告的“早期”性能常常因时间泄漏而被夸大。当流程使用了在预测时尚未可用的信息时,就会发生这种情况。我们在时间可用性约束下形式化了基于截止点的早期结果预测,并引入了 LEAP(排除泄漏的早期可用性协议),该协议在连接和聚合之前强制执行截止优先截断,并审计特征来源以防止后截止证据进入基准。我们在公共开放大学学习分析数据集(OULAD)上实例化 LEAP,作为跨周截止点的泄漏控制评估的多步骤协议。使用几种标准学习方法,我们通过 ROC-AUC、PR-AUC、Brier 分数和 F1@0.5 评估性能。结果显示,随着观察窗口的扩大,性能提高,在第 3 周左右有显著提升;随机森林在最早截止点表现最佳,而梯度提升在此后占主导地位。泄漏消融进一步表明,时间违规,特别是通过评估信息,可能会夸大表观的“早期”性能。

英文摘要

Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely learner support. However, reported "early" performance is often inflated by temporal leakage. This occurs when the pipeline uses information that would not yet be available at the time of prediction. We formalize cutoff-based early outcome prediction under a temporal availability constraint and introduce LEAP (Leakage-Excluded Early-Availability Protocol), which enforces cutoff-first truncation prior to joins and aggregation and audits feature provenance to prevent post-cutoff evidence from entering the benchmark. We instantiate LEAP on the public Open University Learning Analytics Dataset (OULAD) as a multi-step protocol for leakage-controlled evaluation across weekly cutoffs. Using several standard learning methods, we evaluate performance using ROC-AUC, PR-AUC, Brier score, and F1@0.5. Results show improving performance as the observation window expands, with a marked gain around week~3; Random Forest performs best at the earliest cutoffs, while Gradient Boosting dominates thereafter. Leakage ablations further show that temporal violations, especially through assessment information, can inflate apparent "early" performance.

2605.25790 2026-05-26 cs.RO

HoLoArm: Deformable Arms for Collision-Tolerant Quadrotor Flight

HoLoArm: 用于碰撞容忍四旋翼飞行的可变形臂

Quang Ngoc Pham, Jonas Eschmann, Yang Zhou, Alejandro Ojeda Olarte, Giuseppe Loianno, Van Anh Ho

AI总结 受蜻蜓翅膀结脉结构启发,提出具有柔性臂的四旋翼HoLoArm,结合强化学习控制策略实现被动变形与快速恢复,在高达7.6 m/s碰撞速度下保持稳定飞行。

Comments 8 pages, 15 figures, 1 table, Accepted at the IEEE Robotics and Automation Letters (RA-L) and the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 3582-3589, March 2026
AI中文摘要

无人机在以人为中心的应用中日益普及,凸显了对能够承受碰撞并快速恢复的设计的需求,以最小化对人类和环境的风险。我们提出了HoLoArm,一种具有柔性臂的四旋翼,其灵感来源于蜻蜓翅膀的结脉结构。这种设计在保持飞行稳定性的同时提供了自然的柔韧性和弹性,并通过集成强化学习(RL)控制策略进一步增强了恢复和悬停性能。实验结果表明,HoLoArm可以在任何方向(包括轴向)被动变形,并根据冲击方向和程度在0.3-0.6秒内恢复。无人机能够在高达7.6米/秒的碰撞速度下存活,并携带540克有效载荷,同时保持稳定飞行。这项工作有助于具有高敏捷性和可靠安全性的软体空中机器人的形态设计,使其能够在杂乱和人类共享的环境中运行,并为未来将柔性结构与智能控制相结合的完全软体无人机奠定了基础。

英文摘要

The increasing use of drones in human-centric applications highlights the need for designs that can survive collisions and recover rapidly, minimizing risks to both humans and the environment. We present HoLoArm, a quadrotor with compliant arms inspired by the nodus structure of dragonfly wings. This design provides natural flexibility and resilience while preserving flight stability, which is further reinforced by the integration of a Reinforcement Learning (RL) control policy that enhances both recovery and hovering performance. Experimental results demonstrate that HoLoArm can passively deform in any direction, including axial one, and recover within 0.3-0.6 s depending on the direction and level of the impact. The drone can survive collisions at speeds up to 7.6 m/s and carry a 540 g payload while maintaining stable flight. This work contributes to the morphological design of soft aerial robots with high agility and reliable safety, enabling operation in cluttered and human shared environments, and lays the groundwork for future fully soft drones that integrate compliant structures with intelligent control.

2605.25789 2026-05-26 cs.LG cs.AI cs.IT math.IT stat.ML

On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

关于自由探索对多臂老虎机遗憾最小化的益处

Yunlong Hou, Zixin Zhong, Vincent Y. F. Tan

AI总结 本文研究在初始自由探索阶段后最小化累积遗憾的多臂老虎机问题,提出一种两阶段算法UFE-KLUCB-H,并证明其相比无自由探索的策略能严格减少遗憾。

Comments 55 pages

详情
AI中文摘要

我们研究了一个随机多臂老虎机问题,其中智能体在遗憾累积之前被授予一个自由探索预算,这是经典遗憾最小化或纯探索范式未涵盖的设置。目标是设计一个自适应策略,在初始自由探索阶段策略性地探索老虎机实例,并在后续阶段最小化累积遗憾。我们形式化了这个带有自由探索的遗憾最小化问题,并识别出一个有趣的区间,其中自由探索预算与时间范围成对数比例。为了量化由于自由探索阶段的可用性而高概率节省的遗憾量,我们引入了一类新的策略,称为$(α,β)$-可能节省策略。我们提出了一种两阶段、可能节省的算法UFE-KLUCB-H,它由一个原则性的自由探索策略UFE和一个历史感知的遗憾最小化策略KLUCB-H组成。推导了UFE-KLUCB-H的实例相关上界,表明UFE-KLUCB-H累积的遗憾严格少于无法访问自由探索阶段的策略。作为补充,我们基于针对自由探索环境定制的多实例扰动论证推导了实例相关下界,证明了UFE-KLUCB-H对于二值老虎机的近乎最优性。我们的上界和下界揭示了累积遗憾中依赖于可用自由探索量的尖锐相变。进行了仿真,表明算法中的强制探索和自适应性导致了更大的遗憾节省。

英文摘要

We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(α,β)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.

2605.25786 2026-05-26 cs.LG cs.AI

NPSolver: Neural Poisson Solver with Iterative Physics Supervision

NPSolver: 具有迭代物理监督的神经泊松求解器

Bocheng Zeng, Rui Zhang, Runze Mao, Mengtao Yan, Xuan Bai, Yang Liu, Zhi X. Chen, Hao Sun

AI总结 提出NPSolver,通过迭代物理监督(利用少量PCG步骤)训练无标签的神经泊松求解器,并引入边界感知Transolver架构,在2D/3D不规则几何上优于物理信息和数据驱动基线。

Comments kdd 2026

详情
AI中文摘要

在复杂不规则域上高效求解泊松方程仍然是科学计算中的一个基本挑战,因为经典迭代求解器常常因病态系统而面临过长的运行时间。虽然神经算子提供了一种快速的替代方案,但它们通常依赖大规模标记数据集,或者在使用物理信息残差损失时难以处理不稳定的训练动态。我们提出 extsc{NPSolver},一种通过迭代物理监督训练的无标签神经泊松求解器。 extsc{NPSolver} 不依赖完全收敛的数值解或原始PDE残差,而是利用少量预处理共轭梯度(PCG)步骤来优化自身预测,从而提供更稳定且尺度良好的训练信号。理论分析证实,这种迭代监督充当了良态误差代理,并且停止梯度设计对于优化稳定性至关重要。为了更好地捕捉混合边界条件下的边界驱动特征,我们进一步引入了边界感知Transolver( extsc{BA-Transolver})架构,该架构明确分离了内部和边界令牌化。在2D和3D不规则几何上的广泛评估表明, extsc{NPSolver} 优于物理信息和数据驱动基线。此外,一个下游热控制任务突出了该模型进行高效可靠的基于梯度的边界控制的能力。我们将在 https://github.com/intell-sci-comput/NPSolver 发布我们的代码和数据。

英文摘要

Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical iterative solvers often suffer from prohibitive runtime due to ill-conditioned systems. While neural operators offer a fast alternative, they typically rely on large-scale labeled datasets or struggle with unstable training dynamics when using physics-informed residual losses. We propose \textsc{NPSolver}, a neural Poisson solver trained without solution labels via iterative physics supervision. Instead of relying on fully converged numerical solutions or raw PDE residuals, \textsc{NPSolver} utilizes a small number of preconditioned conjugate gradient (PCG) steps to refine its own predictions, providing a more stable and well-scaled training signal. Theoretical analysis confirms that this iterative supervision serves as a well-conditioned error proxy and that a stop-gradient design is essential for optimization stability. To better capture boundary-driven features under mixed boundary conditions, we further introduce the Boundary-Aware Transolver (\textsc{BA-Transolver}) architecture that explicitly separates interior and boundary tokenization. Extensive evaluations on 2D and 3D irregular geometries demonstrate that \textsc{NPSolver} outperforms both physics-informed and data-driven baselines. Furthermore, a downstream thermal control task highlights the model's capability for conducting efficient and reliable gradient-based boundary control. We will release our codes and data at https://github.com/intell-sci-comput/NPSolver.

2605.25784 2026-05-26 cs.CV cs.MM

VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

VertiCue-Bench: 诊断多模态大语言模型是否利用高度线索解决遥感自然场景中的二维歧义

Jing Huang, Duanchu Wang, Junjie Yang, Zihang Cheng, Cheng Li, Lin Cui, Zhouyi Wu, Di Wang

AI总结 提出VertiCue-Bench基准,通过17个任务1534个实例诊断MLLMs是否真正利用冠层高度模型(CHM)的垂直线索解决遥感自然场景中的语义歧义,发现模型在感知高度线索与语义推理之间存在显著脱节。

详情
AI中文摘要

多模态大语言模型(MLLMs)最近在地理空间推理方面显示出有希望的进展。然而,现有的遥感基准仍然主要围绕二维中心,主要基于光学外观评估模型。在自然环境中,由于严重的光谱混淆,这种范式失效,其中生态上不同的区域共享相似的纹理但在垂直结构上根本不同。在这种情况下,明确的3D结构数据,如冠层高度模型(CHMs),成为语义消歧的基本几何证据。然而,目前尚不清楚当前的MLLMs是否能够真正利用垂直线索来解决外观级别的歧义。为了填补这一空白,我们引入了VertiCue-Bench,这是第一个基于CHM的地理空间推理诊断基准。VertiCue-Bench包含1534个精心策划的实例,涵盖17个任务,明确将低级高度感知与歧义感知的语义推理分离。对14个最先进的通用和遥感专用MLLMs的评估,结合反事实模态测试,揭示了惊人的感知-推理分离。虽然模型在读取原始CHM高度线索方面表现出新兴能力,但它们大多未能将几何感知转化为可靠的语义推理,在需要联合约束时通常表现不如仅使用RGB的基线。总体而言,VertiCue-Bench揭示了自然场景理解中关键的几何到语义的差距,为推进地理空间MLLMs提供了可行的见解。

英文摘要

Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.

2605.25781 2026-05-26 cs.CL

Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation

双三角形标注:一种可扩展的人机协同高精度历史文档标注框架

Yi Ren

AI总结 提出双三角形标注框架,通过两层人机协同和跨模型共识自动完成大部分标注工作,实现高精度历史文档结构化信息提取。

Comments 12 pages, 4 figures. ACL ARR 2026 March submission

详情
AI中文摘要

大规模评估历史文档的结构化信息提取需要高精度的真实标注,但传统人工标注成本高昂,而基于大语言模型的完全自动化流水线容易产生幻觉。我们提出双三角形标注,一种双层人机协同框架,利用跨模型共识自动完成大部分标注工作,同时确保高精度输出。第一层中,两个架构独立的多模态大语言模型并行标注每个文档;当它们一致时,标签自动接受,不一致则提交给人工评审。第二层将两个这样的系统相互交叉检查,将剩余冲突升级给领域专家。该框架基于一个假设——模型之间的错误独立性——不需要分布先验或任务特定校准,并且随着模型能力的提升而变得更加自主。在Guides Rosenwald(一个涵盖1887-1906年的法国医疗目录语料库)上,该框架实现了0.003的最终词错误率。大规模应用时,模型共识自动接受了13,595个字段中的85%以上。我们发布了由此产生的基准——Rosenwald指南的第一个结构化提取真实标注——以支持未来历史文档处理工作。

英文摘要

Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.

2605.25778 2026-05-26 cs.CV

OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance

OMGTex: 无需几何引导的一阶段多风格面部纹理重建

Zitong Xiao, Yuda Qiu, Zisheng Ye, Xiaoguang Han

AI总结 提出OMGTex,一种端到端的扩散框架,无需3D几何先验,直接从多风格面部图像重建高质量、可编辑的UV纹理,通过梯度引导推理和语义感知训练实现鲁棒重建与编辑。

Comments CVPR 2026 (Poster)

详情
AI中文摘要

我们提出OMGTex,一种端到端的基于扩散的框架,用于从多风格面部图像重建高质量且可编辑的面部UV纹理。现有的纹理重建方法面临两个主要限制:(1) 依赖于难以准确估计的3D几何先验,尤其是在面部遮挡或风格化域中,导致脆弱性;(2) 缺乏语义解耦,阻碍了区域特定的纹理编辑和风格迁移。我们的工作同时解决了这两个挑战。 我们的核心创新是一个无几何的流水线,直接将2D面部图像映射到其对应的可编辑UV纹理。我们引入了两种关键技术:首先,为了解决扩散生成中常见的UV错位问题,我们引入了一种推理时的梯度引导细化策略,显式校正结构一致性。其次,我们利用扩散模型固有的语义分布能力,设计了一种新颖的训练范式来增强这种倾向,从而实现面部纹理的语义感知编辑。此外,为了解决多风格纹理重建中的数据稀缺问题,我们构建了CANVAS,这是第一个涵盖真实和多样化风格化领域的全面配对纹理重建数据集。 据我们所知,OMGTex是第一个无几何推理框架,能够在不同领域实现鲁棒、风格一致且可编辑的面部纹理重建。我们的方法在多个面部纹理基准上达到了最先进的性能。

英文摘要

We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously. Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains. To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on multiple facial texture benchmarks.

2605.25775 2026-05-26 cs.CV

DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion

DRFusion: 抗漂移的时间一致红外-可见光视频融合

Xingyuan Li, Haoyuan Xu, Shulin Li, Xiang Chen, Zhiying Jiang, Jinyuan Liu

AI总结 提出一种抗漂移的视频融合方法,将任务重构为历史条件运动生成,通过稳定历史引导和软时间锚定实现时间一致性,并采用解耦结构-运动适应策略,在融合质量和时间稳定性上达到最优。

Comments 11 pages, 7 figures, 4 tables

详情
AI中文摘要

红外和可见光视频融合对于在动态场景中实现全面感知至关重要。然而,保持时间一致性仍然是一个艰巨的挑战。依赖光流的传统方法通常存在几何刚性和重影伪影。此外,标准的基于扩散的融合模型通常以逐帧方式运行;当扩展到自回归设置时,它们缺乏内在的时间约束,并且容易出现严重的误差累积和漂移,其中微小的伪影随时间放大。为了解决这些限制,我们提出了一种抗漂移的视频融合方法,将任务重构为历史条件运动生成。我们引入了稳定历史引导和软时间锚定,将时间一致性重新定义为频谱滤波,无需刚性对齐即可隐式聚合运动动态。此外,我们的解耦结构-运动适应策略通过两阶段训练和潜在细化桥接了预训练先验和结构约束。大量实验表明,我们的方法在融合质量和时间稳定性方面均达到了最先进的性能。

英文摘要

Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.

2605.25771 2026-05-26 cs.LG cs.AI

MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training

MDGMIX: 边界感知的子图混合用于多域图预训练

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang

AI总结 针对多域图预训练中的数据冗余问题,提出MDGMIX框架,通过边界感知子图混合与层次判别学习解耦共享和域特定模式,并在适配时使用轻量级提示加权机制,在少样本分类任务中优于强基线且效率更高。

Comments Accepted by ICML2026

详情
AI中文摘要

多域图预训练是构建具有跨域泛化能力的基础图模型的关键步骤。然而,现有方法主要依赖联合训练所有源域图,导致计算成本高。此外,尚不清楚所有源域图数据是否对有效迁移有同等贡献。本文通过实验揭示了多域图预训练中存在显著的数据冗余。基于这一发现,我们提出了多域图预训练框架MDGMIX,该框架将边界感知的子图混合与层次判别相结合。通过选择边界节点构建具有挑战性的混合域子图,MDGMIX利用粗粒度域判别和细粒度域分解损失来解耦共享模式与域特定模式。在适配过程中,MDGMIX采用轻量级提示加权机制来迁移源域知识。大量实验表明,MDGMIX在少样本分类任务中持续优于强基线,同时表现出优越的时间和内存效率。代码可在 https://github.com/zhengziyu77/MDGMIX 获取。

英文摘要

Multi-domain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities. However, existing methods predominantly rely on jointly training all source domain graphs, resulting in high computational costs. Furthermore, it remains unclear whether all source domain graph data contribute equally to effective transfer. This paper empirically reveals significant data redundancy in multi-domain graph pre-training. Based on this finding, we propose the Multi-domain Graph Pre-training Framework, MDGMIX, which combines boundary-aware subgraph mixing with hierarchical discrimination. By selecting boundary nodes to construct challenging mixed-domain subgraphs, MDGMIX employs coarse-grained domain discrimination and fine-grained domain decomposition losses to decouple shared patterns from domain-specific patterns. During adaptation, MDGMIX employs a lightweight prompt weighting mechanism to transfer source domain knowledge. Extensive experiments demonstrate that MDGMIX consistently outperforms strong baselines in few-shot classification tasks while exhibiting superior time and memory efficiency. The code is available at: https://github.com/zhengziyu77/MDGMIX.

2605.25765 2026-05-26 cs.CV cs.AI cs.LG

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

通过交叉注意力激活投影实现扩散模型的概念遗忘

Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim

AI总结 提出PURE方法,利用交叉注意力激活空间构建遗忘和保留基,通过线性投影编辑权重,在保持保留概念的同时有效消除目标概念。

详情
AI中文摘要

概念遗忘旨在从预训练的文本到图像扩散模型中擦除目标概念,而无需重新训练。闭式方法在此设置中具有吸引力,因为它们对交叉注意力权重应用单一确定性编辑,并且不增加推理时间成本。然而,现有的闭式方法通过文本编码器对少数命名目标概念的简短锚定提示的响应来表示目标概念,而唤起该概念但不一致命名的释义提示可以绕过编辑。我们认为,目标应该改为在交叉注意力激活空间中表示。文本嵌入描述用户的提示,而交叉注意力激活描述模型即将渲染的内容,后者泛化到锚定模板未覆盖的释义。基于这一观察,我们提出了PURE(U-Net渲染中的投影用于擦除),这是一种闭式方法,从沿短去噪轨迹捕获的逐层交叉注意力激活构建遗忘和保留基,并将单个线性投影器应用于交叉注意力键和值权重。在最近涵盖艺术风格、知识产权、名人和NSFW类别中十个概念的整体概念遗忘基准上,PURE显著减少了在释义和对抗性提示下的目标泄露,同时将保留概念保持接近未编辑模型,在评估方法中实现了最佳的总体遗忘-保留权衡。

英文摘要

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

2605.25764 2026-05-26 cs.CV cs.AI

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

病理基础模型在空间域理解中的基准测试

Bokai Zhao, Yiyang Zhang, Yuanchi Zhu, Hanqing Chao, Long Bai, Tai Ma, Minfeng Xu, Ming Song, Tianzi Jiang

AI总结 提出SpaPath-Bench基准,通过空间域识别任务评估病理基础模型在区分组织区域和捕获空间关系方面的表示能力。

Comments MICCAI2026

详情
AI中文摘要

病理基础模型(PFMs)已成为从全切片图像(WSIs)中学习可迁移表示的核心方法,通常通过下游临床终点进行基准测试。虽然这种任务级评估不可或缺,但它们对表示本身编码了什么提供了有限的见解,特别是PFM嵌入是否能够区分有意义的组织区域并捕获其空间关系。我们提出了SpaPath-Bench,一个表示级基准,旨在诊断PFMs中的空间表示能力。SpaPath-Bench将配对全切片图像和空间转录组学(ST)数据上的空间域识别(SDI)制定为诊断任务。它整理了42个公开的配对WSI和ST切片,支持跨19个编码器和7种SDI方法的大规模评估,并使用三个互补标准衡量分区质量:无监督空间一致性、转录组学参考一致性和专家参考一致性。在83K次运行中,SpaPath-Bench揭示了不同的预训练范式捕获了组织空间架构的不同方面,并为构建下一代空间感知计算病理模型提供了实用指导。代码和数据管道公开于https://bokai-zhao.github.io/SpaPath-benchboard/。

英文摘要

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at https://bokai-zhao.github.io/SpaPath-benchboard/.

2605.25759 2026-05-26 cs.CV

Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

通过合成局部偏好实现解剖学合理的人体图像生成

Bao Li, Yuliang Xiu, Zhen Liu

AI总结 提出 ASAP 框架,利用局部退化机制构建受控偏好对,并结合局部有界 DPO 变体,在保持整体图像质量的同时减少解剖学错误。

详情
AI中文摘要

大规模文本到图像基础模型已实现显著的视觉真实感,但生成具有正确解剖结构的人体图像仍然具有挑战性。现有方法通过在高品质人体照片上进行监督微调时使用部位特定模块或局部损失加权来强制解剖约束,但此类数据集有限,且由于光照、姿态和背景等混杂因素,通常提供模糊的优化信号。基于偏好的对齐提供了一种替代方案,但标准的直接偏好优化(DPO)平等对待所有像素,因此未能利用解剖伪影的局部性。为了解决这个问题,我们提出了通过合成解剖偏好进行对齐(ASAP)的框架,该框架通过对高保真人体图像应用局部退化机制来构建受控偏好对。该机制通过对图像进行受控实验,在目标区域引入明确的解剖错误,同时保留其余内容。利用这一机制,我们创建了人类解剖偏好(HAP)数据集,包含超过10K个精心挑选的对,用于有效对齐文本到图像人体图像生成模型的解剖结构。为了更好地利用这些受控偏好对的局部性,我们引入了DPO的局部有界变体,该变体优先优化目标解剖区域,同时强制有限偏好间隔以防止过度优化并保持全局语义。我们进一步引入了HAF-Bench,一个用于系统评估解剖保真度的基准。大量实验表明,ASAP在多个基础模型上持续减少解剖错误,同时保持整体图像质量。

英文摘要

Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

2605.25751 2026-05-26 cs.CV

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

SplitAvatar: 基于自回归高斯分裂的单次头部化身

Hongzhe Liao, Chuhua Xian, Hongmin Cai, Haiyang Liu, Fa-Ting Hong

AI总结 提出一种基于自回归高斯分裂的单图像可动画头部化身重建方法,通过图分裂网络渐进生成高斯体,解决高斯数量不匹配和细粒度细节缺失问题。

详情
AI中文摘要

3D高斯泼溅(3DGS)利用各向异性高斯体为高质量场景重建提供了高效方法。最近,基于3DGS的方法显著提升了人类化身的渲染质量,同时实现了实时性能。然而,现有方法存在基于图像和基于3DMM的方法生成的高斯体数量不匹配的问题。这种差异导致重建的表情缺乏细粒度细节。本文提出了一种从单张图像重建可动画头部化身的新方法。我们提出了一种图分裂网络,利用自回归架构从粗到细渐进生成高斯体。为了解决分裂高斯体引起的图不一致性,我们采用网格拓扑扩展方法,使GNN的连通性与增加的高斯数量对齐。此外,我们引入了一种新颖的密度控制方法,包括一个门控机制,为高斯体生成软掩码,防止分裂操作后的过度密集化。这允许对不同面部区域的高斯密度进行动态控制。为了实现平滑快速的训练,我们采用延迟过滤策略,避免在训练期间重新计算图拓扑。实验结果表明,我们的自回归结构通过渐进分裂高斯体有效提升了表情表示能力。这一过程通过GNN引导的分裂实现,合成更精确的面部细节,并达到更高的重建质量。

英文摘要

3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN's connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.