arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪 全部专题
2605.12051 2026-05-13 cs.LG

Learning plug-in surrogate endpoints for randomized experiments

Alessandro-Umberto Margueritte, Ahmet Zahid Balcıoğlu, Jesse Krijthe, Dave Zachariah, Fredrik D. Johansson

发表机构 * AstraZeneca(阿斯利康) Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学和哥德堡大学) TU Delft(代尔夫特理工大学) Uppsala University(乌普萨拉大学)

AI总结 在随机实验中,当长期结果难以观测时,常使用短期替代终点来评估干预效果。本文研究了一类可以直接替代主要结果的插件复合替代终点,提出两种方法以最大化其对真实效应的预测能力,并分析了在典型场景下找到无偏效应估计替代终点的可能性。实验表明,基于直接建模替代效应的方法相比现有方法能生成更具预测性的插件终点。

Comments 29 pages, 5 figures

详情
英文摘要

Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.

2605.12049 2026-05-13 cs.LG cs.AI cs.IT cs.NE math.IT

Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

Aaron Spieler, Georg Martius, Anna Levina

发表机构 * University of Tübingen, Germany(图宾根大学,德国) Max Planck Institute for Biological Cybernetics, Tübingen, Germany(生物感知研究所,图宾根,德国) Max Planck Institute for Intelligent Systems, Tübingen, Germany(智能系统研究所,图宾根,德国)

AI总结 本文探讨了在固定参数预算下,如何在神经网络的单元数量、每个单元的复杂度和连接度之间进行最优分配的问题。研究引入了一种基于“表达型漏记忆”(ELM)神经元的循环网络架构,能够独立调节网络宽度、单元复杂度和连接度,并在不同规模下稳定训练。实验表明,在固定参数预算下,存在一个非平凡的最优权衡点,且更大的预算倾向于支持更复杂和更多的神经元,研究还通过信息论模型解释了这一权衡现象的机制。

Comments 25 pages, 21 figures, 3 tables, including derivations. Submitted for peer review

详情
英文摘要

Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.

2605.12047 2026-05-13 cs.CL

Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition

Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza

发表机构 * Center for Language and Cognition (CLCG), University of Groningen(语言与认知中心(CLCG),格罗宁根大学)

AI总结 本研究探讨儿童导向语言(CDL)是否优化了词汇学习,特别是动词意义的获取。通过对比基于CDL和成人导向语言(ADL)训练的神经语言模型,研究发现CDL和口语ADL在语法干扰下表现出更强的学习鲁棒性。研究还发现,动词意义的习得早于语法能力的提升,且这一异步现象在口语中尤为明显,表明CDL在动词学习上的优势可能源于口语本身的特性,而非CDL独有的优化。

Comments 8 pages

详情
英文摘要

Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed language (ADL). We selectively remove syntactic or lexical co-occurrence information from the model training data, and evaluate the impact of these manipulations on verb meaning acquisition. While disrupting syntax impairs learning across all datasets, models trained on CDL and spoken ADL show significantly higher resilience than those trained on written input. Tracking semantic and syntactic performance over training, we observe a semantic-first trajectory, with verb meanings emerging prior to robust syntactic proficiency, an asynchrony most pronounced in the spoken domain, especially CDL. These results suggest that the advantage for verb learning previously attributed to CDL may instead reflect broader properties of the spoken register, rather than a uniquely CDL-specific optimization.

2605.12039 2026-05-13 cs.CL

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

Xiaoyuan Li, Moxin Li, Keqin Bao, Yubo Ma, Wenjie Wang, Dayiheng Liu, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 SkillGraph 是一种通过动态演化技能图谱来增强智能体强化学习能力的方法,旨在解决现有技能库在组合任务中依赖关系识别和维护困难的问题。该方法将可复用的技能表示为有向图中的节点,并通过边类型编码技能之间的前提、增强和共现关系,从而支持多步骤决策。实验表明,SkillGraph 在多个复杂任务中表现出色,显著优于传统的记忆增强型强化学习方法。

Comments Under Review

详情
英文摘要

Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.

2605.12038 2026-05-13 cs.CV

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(展示实验室,新加坡国立大学)

AI总结 OmniHumanoid 是一种用于跨具身视频生成的流式生成框架,旨在实现从人类到机器人或机器人到机器人之间的动作迁移。该方法通过分离可迁移的运动学习与具身特定的适配,解决了传统方法中因素纠缠和依赖配对数据的限制,仅需使用未配对视频即可适应新具身。研究还引入了分支隔离注意力机制,并构建了一个包含多具身、多场景的合成数据集,实验表明该方法在运动保真度和具身一致性方面表现优异,且无需重新训练共享运动模型即可扩展到新机器人。

详情
英文摘要

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.

2605.12031 2026-05-13 cs.LG cs.CV

Resilient Vision-Tabular Multimodal Learning under Modality Missingness

Camillo Maria Caruso, Valerio Guarrasi, Paolo Soda

发表机构 * Research Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma(人工智能与计算机系统研究单位,工程系,罗马生物医学大学) Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University(诊断与干预系,辐射物理,生物医学工程,乌梅大学)

AI总结 该研究针对医疗多模态学习中常见的模态缺失问题,提出了一种无需数据填补或启发式切换的联合视觉-表格学习框架。该方法通过可学习的模态标记对单模态表示进行加权,并利用带有掩码的自注意力机制进行中间融合,从而排除缺失的模态和特征。此外,引入模态丢弃正则化策略增强模型鲁棒性,实验表明该方法在不同缺失场景下均优于现有基线,表现出更稳定的性能和更强的鲁棒性。

详情
英文摘要

Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.

2605.12028 2026-05-13 cs.CL cs.IR

Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking

David-Maximilian Caraman, Gheorghe Cosmin Silaghi

发表机构 * Babeş-Bolyai University(巴贝什-波雅伊大学)

AI总结 本文介绍了参与SemEval-2026任务8(MTRAGEval)的系统,针对多轮检索任务提出了一个三阶段方法,包括基于LoRA微调的查询重写、BM25与稠密检索的混合搜索以及交叉编码器重排序。该方法在四个英文领域中取得了nDCG@5为0.531的成绩,排名第八,显著优于基准系统。研究还发现,针对不同领域调整生成查询的温度参数能够有效提升性能,而其他复杂策略则可能带来性能下降。

Comments Accepted at SemEval2026, task 8: MTRAGEval

详情
英文摘要

We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.

2605.12027 2026-05-13 cs.CV

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

Ying Zang, Xuanyi Liu, Yidong Han, Deyi Ji, Chaotao Ding, Yuanqi Hu, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu

发表机构 * Peking University(北京大学) Zhejiang University(浙江大学) Huzhou University(湖州大学) Huawei(华为) Tongji University(同济大学)

AI总结 该论文提出了一种名为4DVGGT-D的4D视觉几何变换器,旨在解决从单目视频中重建动态4D场景时的挑战。研究核心在于通过一种无需训练的渐进式解耦框架,将动态与静态要素分离,从而提升深度估计的稳定性与准确性。方法包含动态掩码引导的位姿解耦、拓扑子空间手术以及信息论置信度融合三个关键模块,有效提升了4D重建的质量与鲁棒性。

详情
英文摘要

Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.

2605.12025 2026-05-13 cs.LG stat.ML

Approximation Theory of Laplacian-Based Neural Operators for Reaction-Diffusion System

Takashi Furuya, Ryo Ozawa, Jenn-Nan Wang

发表机构 * Doshisha University, RIKEN AIP(大阪大学、RIKEN AIP) Tohoku University(东北大学) National Taiwan University(国立台湾大学)

AI总结 本文研究了基于拉普拉斯算子的神经算子在非线性反应-扩散系统中的逼近理论,以通用的Gierer-Meinhardt模型为例,分析了从初始条件到时间依赖解的映射学习问题。通过利用PDE格林函数的拉普拉斯谱表示,作者建立了神经网络深度、宽度和谱秩相关的显式逼近误差界,证明了所需参数复杂度随目标精度呈多项式增长,从而克服了传统算子学习中面临的参数复杂度指数增长问题。数值实验验证了理论结果的有效性。

详情
英文摘要

Neural operators provide a framework for learning solution operators of partial differential equations (PDEs), enabling efficient surrogate modeling for complex systems. While universal approximation results are now well understood, approximation analysis specific to nonlinear reaction-diffusion systems remains limited. In this paper, we study neural operators applied to the solution mapping from initial conditions to time-dependent solutions of a generalized Gierer-Meinhardt reaction-diffusion system, a prototypical model of nonlinear pattern formation. Our main results establish explicit approximation error bounds in terms of network depth, width, and spectral rank by exploiting the Laplacian spectral representation of the Green's function underlying the PDE. We show that the required parameter complexity grows at most polynomially with respect to the target accuracy, demonstrating that Laplacian eigenfunction-based neural operator architectures alleviate the curse of parametric complexity encountered in generic operator learning. Numerical experiments on the Gierer-Meinhardt system support the theoretical findings.

2605.12022 2026-05-13 cs.CL

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

Xiaoyuan Li, Yuzhe Wang, Moxin Li, Keqin Bao, Rui Men, Yichang Zhang, Dayiheng Liu, Wenjie Wang, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 该研究提出了一种可扩展的自动鲁棒性增强框架SAGE,用于提升大语言模型知识评估基准的鲁棒性。SAGE通过微调小型模型实现高效的问题变体生成与验证,其中VariantGen负责生成变体,VariantQual基于人工标注数据训练用于验证质量。实验表明,SAGE能够在远低于人工成本的情况下构建大规模鲁棒性增强基准,并且微调模型还能泛化到其他任务如MMLU,无需针对具体基准进行微调。

Comments Under Review

详情
英文摘要

Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.

2605.12021 2026-05-13 cs.CV

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Ryota Yoshihashi, Masahiro Kada, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

发表机构 * Institute of Science Tokyo(东京科学研究所) DENSO IT Laboratory(DENSO IT实验室) National Institute of Informatics(国家信息研究所)

AI总结 本文提出了一种名为What-Where Transformer(WWT)的视觉骨干网络,旨在同时学习物体的外观(what)和位置(where)信息。该方法通过分离“what-where”这一归纳偏置,采用多流架构将物体表示与注意力图分别处理,从而实现对物体外观和空间位置的解耦表征。实验表明,WWT在无额外后处理的情况下即可从原始注意力图中发现多个物体,并在零样本目标发现和弱监督语义分割等任务中表现出优越性能。

详情
英文摘要

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

2605.12019 2026-05-13 cs.LG cs.AI

Efficient and Adaptive Human Activity Recognition via LLM Backbones

Aleksandr Bredikhin, Philippe Lalanda, German Vega

发表机构 * Univ. Grenoble Alpes, France(格勒诺布尔阿尔卑斯大学,法国)

AI总结 本文提出了一种基于大语言模型(LLM)的高效且自适应的人类活动识别(HAR)方法,旨在解决传统方法在计算资源消耗和领域适应性方面的不足。通过将预训练的LLM作为通用时间特征提取器,并引入结构化卷积投影将传感器信号映射到LLM的隐空间,该方法大幅降低了参数量和训练成本,同时提升了模型的泛化能力。实验表明,该方法在低数据和少样本场景下表现出色,为HAR系统提供了可扩展且高效的解决方案。

详情
英文摘要

Human Activity Recognition (HAR) is a core task in pervasive computing systems, where models must operate under strict computational constraints while remaining robust to heterogeneous and evolving deployment conditions. Recent advances based on Transformer architectures have significantly improved recognition performance, but typically rely on task-specific models trained from scratch, resulting in high training cost, large data requirements, and limited adaptability to domain shifts. In this paper, we propose a paradigm shift that reuses large pretrained language models (LLMs) as generic temporal backbones for sensor-based HAR, instead of designing domain-specific Transformers. To bridge the modality gap between inertial time series and language models, we introduce a structured convolutional projection that maps multivariate accelerometer and gyroscope signals into the latent space of the LLM. The pretrained backbone is kept frozen and adapted using parameter-efficient Low-Rank Adaptation (LoRA), drastically reducing the number of trainable parameters and the overall training cost. Through extensive experiments on standard HAR benchmarks, we show that this approach enables rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. At the same time, our results highlight the complementary roles of convolutional frontends and LLMs, where local invariances are handled at the signal level while long-range temporal dependencies are captured by the pretrained backbone. Overall, this work demonstrates that LLMs can serve as a practical, frugal, and scalable foundation for adaptive HAR systems, opening new directions for reusing foundation models beyond their original language domain.

2605.12017 2026-05-13 cs.CV

FAME: Feature Activation Map Explanation on Image Classification and Face Recognition

Xinyi Zhang, Manuel Günther

发表机构 * Department of Informatics, University of Zurich(苏黎世大学信息学院)

AI总结 本文提出了一种名为FAME的图像分类与人脸识别任务的特征激活图解释方法,旨在提升深度学习模型的可解释性。FAME结合了基于梯度的特征图方法与扰动方法的优点,通过梯度驱动的方式对输入图像进行操作,而非使用固定补丁,从而更准确地生成像素级的归因图。实验表明,FAME在深度网络中优于传统CAM方法,并在定性和定量评估中展现出竞争力。

Comments Accepted for CVPR Workshop 2026

详情
英文摘要

Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM's above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: {\footnotesize https://github.com/AIML-IfI/fame.}

2605.12016 2026-05-13 cs.AI

LLMs and the ZPD

Peter Wallis

发表机构 * Centre for Policy Modelling(政策建模中心)

AI总结 本文探讨了大语言模型(LLMs)与维果茨基“最近发展区”(ZPD)理论之间的关系,提出LLMs并非通过分布式表征进行“思考”,而是在执行一种基于实践的“原始思维”。研究认为,LLMs的行为更类似于“做梦”而非幻觉,强调互动在人类沟通中的核心地位,而非仅仅是理解的辅助手段,为理解LLMs的认知机制提供了新的视角。

Comments Short paper submitted to Interspeech 2026 (Desk Reject) 4 pages, plus references. 2 figures

详情
英文摘要

One hundred years ago Vygotsky and his circle were exploring the nature of consciousness and defining what would become psychology in the Soviet Union. They concluded that children develop "scientific thinking" through interacting with enculturated adults in Zones of Proximal Development or ZPDs. The proposal is that, contrary to the claims of some, the LLM mechanism is not doing thinking with "distributed representations," but rather the completion model is doing "primitive thinking" in terms of *practices*. Viewed from this perspective, it would seem our large language models don't hallucinate, but rather dream, and that what is needed is not "guard rails" but an investigation of the set of cognitive tools that enable us to do things that look like common-sense. The proposal here is that *interaction* is core to human communication rather than just an add-on to "real" understanding.

2605.12013 2026-05-13 cs.CV cs.AI

L2P: Unlocking Latent Potential for Pixel Generation

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Tencent Youtu Lab(腾讯云图实验室) Hainan-biuh(海南-比乌) Weess Gmbh(韦斯公司)

AI总结 本文提出了一种名为L2P的高效像素生成框架,旨在解决从头训练高精度像素空间模型所需的高昂计算和数据资源问题。L2P通过直接利用预训练潜在扩散模型(LDM)的知识,采用大块标记化替代VAE,并冻结LDM中间层仅训练浅层网络,从而学习潜在空间到像素空间的映射。该方法仅使用LDM生成的合成图像作为训练数据,无需真实数据采集,实现了快速收敛,并可在8块GPU上生成4K超高分辨率图像,实验表明其性能接近源模型,在多个基准测试中表现优异。

Comments project page: https://nju-pcalab.github.io/projects/L2P/

详情
英文摘要

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

2605.12010 2026-05-13 cs.LG

Limits of Learning Linear Dynamics from Experiments

Aybüke Ulusarslan, Niki Kilbertus, Nora Schneider

发表机构 * Technical University of Munich(慕尼黑技术大学) Helmholtz Munich(亥姆霍兹慕尼黑) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 本文研究了从实验数据中学习线性动力系统时的可识别性限制问题。作者指出,传统方法通常假设系统可识别,但当这一假设不成立时,模型预测可能不准确。通过几何分析,论文揭示了实验设置(初始状态和控制输入)决定了可从观测轨迹中恢复的信息上限,并推导出与该实验设置一致的所有系统的闭式描述,证明即使整个系统不可识别,实验可达子空间上的动力学仍可唯一确定。

详情
英文摘要

Learning governing dynamics from data is a common goal across the sciences, yet it is only well-posed when the underlying mechanisms are identifiable. In practice, many data-driven methods implicitly assume identifiability; when this assumption fails, estimated models can yield spurious predictions and invalid mechanistic conclusions. Classical identifiability guarantees for controlled linear time-invariant (LTI) systems provide sufficient conditions -- controllability and persistent excitation -- but leave open whether identifiability holds when these conditions fail, and which parts of the system remain identifiable without full identifiability. We show that the experimental setup, i.e., the realized initial state and control input, dictates a fundamental limit on the information recoverable from the observed trajectory. We develop a geometric characterization of this limit and derive a closed-form description of all systems consistent with the experimental setup. Crucially, we prove that even when the full system is not identifiable, the restricted dynamics on the subspace reachable by the experiment remain uniquely determined.

2605.12009 2026-05-13 cs.LG

Estimating Subgraph Importance with Structural Prior Domain Knowledge

Changhyun Kim, Seunghwan An, Jong-June Jeon

发表机构 * University of Seoul(首尔大学) Incheon National University(仁川国立大学)

AI总结 本文提出了一种用于预训练图神经网络(GNN)图级任务的子图重要性估计方法,将其建模为嵌入空间中的线性组Lasso回归问题。该方法有效利用了图子结构的先验领域知识,且不受GNN输出层或读出函数形式的限制,无需真实目标标签即可进行估计。实验表明,该方法在多个现实图数据集上优于现有基线,并进一步扩展用于识别图中的重要节点。

详情
英文摘要

We propose a subgraph importance estimation method for pretrained Graph Neural Networks (GNNs) on graph-level tasks, formulated as a linear Group Lasso regression problem in the embedding space. Our method effectively leverages prior domain knowledge of graph substructures, while remaining independent of the specific form of the output layer or readout function used in the GNN architecture, and it does not require access to ground-truth target labels. Experiments on real-world graph datasets demonstrate that our method consistently outperforms existing baselines in subgraph importance estimation. Furthermore, we extend our method to identify important nodes within the graph.

2605.12006 2026-05-13 cs.CV

Robust Promptable Video Object Segmentation

Sohyun Lee, Yeho Gwon, Lukas Hoyer, Konrad Schindler, Christos Sakaridis, Suha Kwak

发表机构 * POSTECH Google(谷歌) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文研究了可提示视频对象分割(PVOS)模型在输入受到干扰时性能下降的问题,提出了首个全面的鲁棒PVOS(RobustPVOS)研究。作者构建了一个包含351个视频片段和2500多张物体掩膜的综合性基准,涵盖真实场景下的多种不利条件,并生成了带有多样化时间变化干扰的合成训练数据。提出了一种新的鲁棒PVOS方法MoGA,通过记忆中的物体特异性表示来增强模型对不同物体退化的处理能力,并保持预测的时序一致性,实验表明该方法在多种干扰条件下均取得显著提升,为未来鲁棒PVOS研究提供了有力基础。

Comments Accepted to CVPR 2026

详情
英文摘要

The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.

2605.12004 2026-05-13 cs.CL

Learning Agentic Policy from Action Guidance

Yuxiang Ji, Zengbin Wang, Yong Wang, Shidong Yang, Ziyu Ma, Guanhua Chen, Zonghua Sun, Liaoni Wu, Xiangxiang Chu

发表机构 * Xiamen University(厦门大学) AMAP, Alibaba Group(阿里云实验室,阿里巴巴集团) Southern University of Science and Technology(南方科技大学)

AI总结 该研究针对大语言模型在智能体强化学习中的探索能力不足问题,提出了一种基于行动指导的强化学习方法ActGuide-RL。该方法利用日常交互中产生的丰富行动数据作为计划式参考指导,帮助智能体克服奖励状态的可达性障碍,并通过混合策略训练将引导策略的探索收益反馈到未引导策略中。实验表明,该方法在搜索智能体基准测试中显著优于零样本强化学习,并可与监督微调加强化学习的方法相媲美,为智能体强化学习提供了一种减少对大量监督数据依赖的新范式。

Comments Work in progress

详情
英文摘要

Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.

2605.12002 2026-05-13 cs.CV

EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, Trong-Le Do

发表机构 * University of Science - VNU-HCM(越南国家大学-胡志明市大学) Vietnam National University(越南国家大学)

AI总结 本文提出了一种名为EDGER的图像伪造定位方法,旨在应对文本引导的图像修复技术带来的挑战,提升跨域检测能力。该方法采用双分支框架,结合基于频率的边缘检测与合成热图定位,分别在像素级和块级定位伪造区域,从而实现高精度、高分辨率的通用化检测。实验表明,EDGER在多个基准数据集上表现出优异的跨域泛化能力和对高分辨率图像的适应性。

Comments Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

详情
英文摘要

Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

2605.11996 2026-05-13 cs.AI

BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

Xiaoting Lyu, Yufei Han, Hangwei Qian, Haoyuan Yu, Xiang Ao, Bin Wang, Chenxu Wang, Xiaobo Ma, Wei Wang

发表机构 * Ministry of Education Key Lab for Intelligent Networks and Network Security(教育部长智能网络与网络安全重点实验室) Xi’an Jiaotong University(西安交通大学) INRIA(法国国家信息与自动化技术研究院) CFAR, A*STAR(新加坡A*STAR机构) Beijing Key Laboratory of Security and Privacy in Intelligent Transportation(北京智能交通安全与隐私重点实验室) Beijing Jiaotong University(北京交通大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) School of Cyber Engineering, Xi’an University of Electronic Science and Technology(西安电子科技大学网络安全工程学院) Ministry of Education Key Lab for Intelligent Networks and Network Security at Xi’an Jiaotong University(西安交通大学教育部长智能网络与网络安全重点实验室)

AI总结 本文研究了针对知识图谱增强大语言模型(KG-LLMs)的后门攻击问题,特别是针对通过图神经网络将知识图谱编码为软提示的新型架构。该架构引入了图条件通道,使得现有针对文本通道的后门攻击效果大打折扣。为此,作者提出BadSKP攻击方法,通过多阶段优化策略操纵图表示,诱导软提示生成对抗性语义,实验表明该方法在多种设置下均能有效攻击目标模型,而传统仅针对文本的攻击则效果有限。

详情
英文摘要

Recent knowledge graph (KG)-enhanced large language models (LLMs) move beyond purely textual knowledge augmentation by encoding retrieved subgraphs into continuous soft prompts via graph neural networks, introducing a graph-conditioned channel that operates alongside the standard text interface. However, existing backdoor attacks are largely designed for the textual channel, and their effectiveness against this dual-channel architecture remains unclear. We show that this architecture creates a robustness gap: text-channel backdoor attacks that readily compromise textual KG prompting systems become largely ineffective against soft-prompt-based counterparts. We interpret this gap through semantic anchoring, whereby graph-derived soft prompts bias the generation-driving hidden state toward query-consistent semantics and suppress surface-level malicious instructions. Because this anchoring effect is itself induced by the graph channel, an attacker who manipulates graph-level representations can in turn redirect it toward adversarial semantics. To demonstrate this risk, we propose BadSKP, a backdoor attack that targets the graph-to-prompt interface through a multi-stage optimization strategy: it constructs adversarial target embeddings, optimizes poisoned node embeddings to steer the induced soft prompt, and approximates the optimized representations with fluent adversarial node attributes. Experiments on two soft-prompt KG-enhanced LLMs across four datasets show that BadSKP achieves high attack success under both frozen and trojaned settings, while text-only attacks remain unreliable even under perplexity-based defenses.

2605.11993 2026-05-13 cs.CL

Towards Visually-Guided Movie Subtitle Translation for Indic Languages

Tarun Chintada, Kshetrimayum Boynao Singh, Asif Ekbal

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institute of Technology Patna(印度理工学院帕纳布分校)

AI总结 该研究探讨了面向印地语等低资源语言的电影字幕翻译问题,指出仅依赖文本的系统往往无法捕捉到视觉信息中蕴含的情感、动作和社会语境。研究比较了两种轻量级的视觉引导策略,发现通过选择性地增强低质量字幕片段,可以有效提升翻译质量,同时大幅减少视觉处理需求。实验表明,基于粗粒度属性的视觉上下文摘要方法在捕捉场景情感和细微语境方面更具鲁棒性。

详情
英文摘要

Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses

2605.11987 2026-05-13 cs.AI cs.LG stat.AP stat.ML

Random-Set Graph Neural Networks

Tommy Woodley, Shireen Kudukkil Manchingal, Matteo Tolloso, Davide Bacciu, Fabio Cuzzolin

发表机构 * School of Engineering, Computing and Mathematics(工程、计算与数学学院) Oxford Brookes University(奥克斯福德布鲁克斯大学) Department of Computer Science(计算机科学系) University of Pisa(比萨大学) Oxford Brookes Institute for Artificial Intelligence, Data Analysis and Systems (AIDAS)(奥克斯福德布鲁克斯人工智能、数据分析与系统研究所(AIDAS))

AI总结 本文提出了一种新的图神经网络框架——随机集图神经网络(RS-GNN),用于更准确地量化节点层面的不确定性。该方法通过信念函数形式对节点的认识不确定性进行建模,能够同时输出精确的概率预测和不确定性度量。实验表明,RS-GNN在多个真实世界的图学习数据集上表现出优越的不确定性量化能力。

Comments 23 pages, 6 figures

详情
英文摘要

Uncertainty quantification has become an important factor in understanding the data representations produced by Graph Neural Networks (GNNs). Despite their predictive capabilities being ever useful across industrial workspaces, the inherent uncertainty induced by the nature of the data is a huge mitigating factor to GNN performance. While aleatoric uncertainty is the result of noisy and incomplete stochastic data such as missing edges or over-smoothing, epistemic uncertainty arises from lack of knowledge about a system or model (e.g., a graph's topology or node feature representation), which can be reduced by gathering more data and information. In this paper, we propose an original new framework in which node-level epistemic uncertainty is modelled in a belief function (finite random set) formalism. The resulting Random-Set Graph Neural Networks have a belief-function head predicting a random set over the list of classes, from which both a precise probability prediction and a measure of epistemic uncertainty can be obtained. Extensive experiments on 9 different graph learning datasets, including real-world autonomous driving benchmarks as such Nuscene and ROAD, demonstrate RS-GNN's superior uncertainty quantification capabilities

2605.11986 2026-05-13 cs.AI

On the Limitations of Large Language Models for Conceptual Database Modeling

Arthur F. Siqueira, Carlos D. S. Nogueira, Eduarda Farias, Claudio E. C. Campelo, Júlia Menezes

发表机构 * Systems and Computing Department(系统与计算系)

AI总结 本文分析了大语言模型(LLMs)在支持关系数据库概念建模中的应用,特别是通过从自然语言需求中自动生成实体-关系(ER)图的能力。研究结合不同的语言模型和提示工程方法,评估其在概念上一致地识别实体、关系和属性的能力。实验结果表明,尽管LLMs在简单场景中表现尚可,但随着需求复杂性的增加,其可靠性下降,出现了更多不一致、模糊和约束表示失败的问题,表明当前LLMs在复杂场景中尚不成熟,验证成本可能抵消其表面的效率提升。

详情
英文摘要

This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.

2605.11983 2026-05-13 cs.LG stat.ML

QDSB: Quantized Diffusion Schrödinger Bridges

Tobias Fuchs, Florian Kalinke, Nadja Klein

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 在源分布和目标分布仅通过未配对样本指定的情况下,生成模型的学习变得越来越重要。本文提出了一种名为QDSB的量化扩散Schrödinger桥方法,用于加速无模拟Schrödinger桥的训练过程。该方法通过在锚点量化后的分布上计算端点耦合,并通过单元采样将结果映射回原始数据点,从而减少计算成本并保持全局传输结构的稳定性。实验表明,QDSB在保持样本质量的同时显著提升了训练效率。

详情
英文摘要

Learning generative models in settings where the source and target distributions are only specified through unpaired samples is gaining in importance. Here, one frequently-used model are Schrödinger bridges (SB), which represent the most likely evolution between both endpoint distributions. To accelerate training, simulation-free SBs avoid the path simulation of the original SB models. However, learning simulation-free SBs requires paired data; a coupling of the source and target samples is obtained as the solution of the entropic optimal transport (OT) problem. As obtaining the optimal global coupling is infeasible in many practical cases, the entropic OT problem is iteratively solved on minibatches instead. Still, the repeated cost remains substantial and the locality can distort the global transport geometry. We propose quantized diffusion Schrödinger bridges (QDSB), which compute the endpoint coupling on anchor-quantized endpoint distributions and lift the resulting plan back to original data points through cell-wise sampling. We show that the regularized optimal coupling is stable w.r.t. anchor quantization, with an error controlled by the quality of the anchor approximation. In real-world experiments, QDSB matches the sample quality of existing baselines, requiring substantially less time. Code and data are available at github.com/mathefuchs/qdsb.

2605.11978 2026-05-13 cs.CL

On Predicting the Post-training Potential of Pre-trained LLMs

Xiaoyuan Li, Yubo Ma, Kexin Yang, Moxin Li, Keqin Bao, Wenie Wang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 本文研究如何在预训练阶段预测大型语言模型在后续微调后的性能,以提高模型选择效率。作者提出了一种基于评分标准的判别评估框架RuDE,通过构建细粒度对比样本来评估模型的可塑性,并引入4C分类体系指导实验设计。实验表明,RuDE能以超过90%的相关性预测模型微调后的表现,并通过强化学习验证其有效性,为高效开发基础模型提供了新方法。

Comments Under Review

详情
英文摘要

The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.

2605.11977 2026-05-13 cs.CV

Optimizing 4D Wires for Sparse 3D Abstraction

Dong-Yi Wu, Tong-Yee Lee

发表机构 * National Cheng Kung University(国立成功大学)

AI总结 本文提出了一种基于单一连续4D曲线(B样条)的3D几何抽象统一框架,通过参数化空间坐标和变量宽度来表示复杂形状。与传统方法中使用多个独立曲线段导致结构碎片化不同,该方法通过保证全局拓扑一致性,实现了更整洁美观且结构连贯的3D抽象。研究引入了可微渲染管道,支持基于梯度的优化,并在图像到3D抽象、多视角线稿生成等任务中表现出更高的语义保真度和结构一致性。

详情
英文摘要

We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.

2605.11974 2026-05-13 cs.LG

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

Xu Chu, Guanyu Wang, Zhijie Tan, Xinrong Chen, Ziyu Li, Tong Mo, Weiping Li

发表机构 * School of Software and Microelectronics, Peking University(软件与微电子学院,北京大学)

AI总结 大型语言模型(LLMs)在处理输入元素顺序时存在偏差,影响其在上下文学习和检索增强生成等场景中的应用。为解决这一问题,本文提出了一种基于强化学习的双重群体优势优化方法(DGAO),通过平衡组内准确率优势和组间稳定性优势,同时提升模型的准确性和顺序稳定性。DGAO还引入了两个新指标——一致性率和过度自信率,用于更全面地评估模型性能,实验表明该方法在保持模型性能的同时显著提升了顺序公平性。

详情
英文摘要

Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization (\textbf{DGAO}), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: https://github.com/Hyalinesky/DGAO.

2605.11972 2026-05-13 cs.RO cs.AI cs.ET cs.SY eess.SY

Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation

Mohammad Khoshkdahan, John Pravin Arockiasamy, Andy Flores Comeca, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology, Karlsruhe, Germany(卡尔斯鲁厄理工学院,德国卡尔斯鲁厄) Halmstad University, Halmstad, Sweden(哈马格大学,瑞典哈马格)

AI总结 该研究针对非视线交叉路口的碰撞问题,提出了一种结合集体感知与协作机器人的交通调控系统。系统通过双摄像头和V2X技术融合感知信息,实时监测道路环境,并由协作机器人在检测到潜在碰撞风险时发出停止手势,阻止车辆违规合并。实验表明,该方法能有效提升非视线条件下的交通安全,填补了现有V2X技术在未连接车辆中的感知与干预空白。

Comments Accepted for publication in the Proceedings of the 2026 IEEE Vehicular Technology Conference (VTC2026-Spring)

详情
英文摘要

Collisions at non-line-of-sight (NLOS) intersections remain a major safety concern because drivers have limited visibility of approaching traffic. V2X based warnings can reduce these risks, yet many vehicles are not equipped with V2X and drivers may ignore in vehicle alerts. Collective perception (CP) can compensate for low V2X penetration by extending the awareness of connected vehicles, but it cannot influence unconnected vehicles. To fill this gap, our work introduces a complementary concept that adds a cooperative humanoid robot as an active traffic moderator capable of physically stopping a vehicle that attempts to merge into an unseen traffic stream. The system operates on two parallel perception pathways. A dual camera infrastructure unit detects the position, speed and motion of approaching vehicles and transmits this information to the robot as a collective perception message (CPM). The robot also receives cooperative awareness messages (CAM) from connected vehicles through its onboard V2X unit and can act as a relay for decentralized environmental notification messages (DENM) when safety events originate elsewhere along the road. A fusion module combines these streams to maintain a robust real time view of the main road. A Zone of Danger (ZoD) is defined and used to predict whether an approaching vehicle creates a collision risk for a merging road user. When such a risk is detected, the robot issues a human-like STOP gesture and blocks the merging path until the hazard disappears. The full system was deployed at the Future Mobility Park (FMP) in Rotterdam. Experiments show that the combined vision and V2X perception allows the robot to detect approaching vehicles early, predict hazards reliably and prevent unsafe merges in real world NLOS conditions.

2605.11967 2026-05-13 cs.CV

H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes

ByungHa Ko, Youngmin Lee, Dong Hwan Kim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系) Intelligence and Interaction Research Center, Korea Institute of Science and Technology(韩国科学技术院智能与交互研究中心)

AI总结 本文提出了一种名为H2G的层次感知双曲分组方法,用于在无需语义标签的情况下对3D场景进行多粒度分组。该方法通过将2D基础模型的相似性线索转化为层次化监督,并将其嵌入到双曲特征场中,以更好地建模树状结构。H2G通过一种层次感知的目标函数,实现了对细粒度部件、物体结构及层次顺序的统一建模,从而在单一特征空间中完成多层级的语义分组。

详情
英文摘要

Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta's objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.