arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2510.01172 2026-05-15 cs.CL

Energy-Regularized Sequential Model Editing on Hyperspheres

Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng

发表机构 * Columbia University(哥伦比亚大学) University of California, Los Angeles(加州大学洛杉矶分校) Zhejiang University(浙江大学) University of Science and Technology of China(中国科学技术大学)

AI总结 大型语言模型需要持续更新以保持与现实世界知识的一致性,但顺序编辑常导致模型表示不稳定并引发灾难性遗忘。本文提出了一种基于超球面能量(HE)正则化的编辑方法SPHERE,通过维持神经元权重在超球面上的均匀分布,有效缓解了编辑过程中的性能退化问题。实验表明,SPHERE在多个主流模型上显著提升了编辑效果,同时较好地保留了模型原有性能。

Comments Accepted by ICLR 2026. The code is available at https://github.com/PlusLabNLP/SPHERE. Project page: https://www.qingyuanliu.net/sphere_projectpage/

详情
英文摘要

Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.

2510.00977 2026-05-15 cs.LG cs.CL

It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie

发表机构 * UdeM(蒙特利尔大学) McGill(麦吉尔大学) Mila(Mila人工智能研究院) UManitoba(曼尼托巴大学) CUHK(香港中文大学) ZJU(浙江大学) UAlberta(阿尔伯塔大学) Amii(阿米人工智能研究院) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文研究了GRPO算法在大语言模型微调中的有效性,并提出了一种新的视角:GRPO的性能优势来源于其隐含的对比目标,这一特性使其在结构上与DPO等偏好学习方法密切相关。基于这一发现,作者提出了2-GRPO,仅需两次rollouts即可构建对比信号,显著减少了计算资源需求。理论分析和实验表明,2-GRPO在保持97.6%性能的同时,仅需16-GRPO的12.5% rollout和21%训练时间。

详情
英文摘要

GRPO has emerged as a prominent reinforcement learning algorithm for post-training LLMs. Unlike critic-based methods, GRPO computes advantages by estimating the \emph{value baselines} from group-level statistics, eliminating the need for a critic network. Consequently, the prevailing view emphasizes the necessity of large group sizes, which are assumed to yield more accurate statistical estimates. In this paper, we propose a different view that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This makes GRPO structurally related to preference learning methods such as DPO. This perspective motivates 2-GRPO, a minimal group-size variant that constructs contrastive signals with only two rollouts. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains $97.6\%$ of the performance of 16-GRPO, while requiring only $12.5\%$ of the rollouts and $21\%$ of the training time.

2510.00757 2026-05-15 cs.LG

LEAP: Local ECT-Based Learnable Positional Encodings for Graphs

Juan Amboage, Ernst Röell, Patrick Schnider, Bastian Rieck

发表机构 * AIDOS Lab, University of Fribourg(弗里堡大学AIDOS实验室) Institute of AI for Health, Helmholtz Munich(慕尼黑亥姆霍兹人工智能与健康研究所) Technical University of Munich(慕尼黑技术大学) Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) Department of Computer Science, University of Basel(巴塞尔大学计算机科学系)

AI总结 本文提出了一种基于局部欧拉特征变换($\ell$-ECT)的可学习图位置编码方法LEAP,用于改进图神经网络中的位置编码能力。该方法结合了可微分的ECT近似及其局部变体,能够捕捉图的局部结构特征,并通过端到端训练方式进行优化。实验表明,LEAP在多个真实和合成数据集上表现出色,展示了其在图表示学习中的有效性和潜力。

Comments Accepted at the International Conference on Learning Representations (ICLR) 2026. Our code is available https://www.github.com/aidos-lab/LEAP

详情
英文摘要

Graph neural networks (GNNs) largely rely on the message-passing paradigm, where nodes iteratively aggregate information from their neighbors. Yet, standard message passing neural networks (MPNNs) face well-documented theoretical and practical limitations. Graph positional encoding (PE) has emerged as a promising direction to address these limitations. The Euler Characteristic Transform (ECT) is an efficiently computable geometric-topological invariant that characterizes shapes and graphs. In this work, we combine the differentiable approximation of the ECT (DECT) and its local variant ($\ell$-ECT) to propose LEAP, a new end-to-end trainable local structural PE for graphs. We evaluate our approach on multiple real-world datasets as well as on a synthetic task designed to test its ability to extract topological features. Our results underline the potential of LEAP-based encodings as a powerful component for graph representation learning pipelines.

2509.26100 2026-05-15 cs.AI

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Xibang Yang, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) The University of Hong Kong(香港大学) East China Normal University(华东师范大学)

AI总结 随着大语言模型在高风险领域的广泛应用,现有的静态评估方法已难以应对AI风险的动态变化和法规的持续演进。本文提出了一种新的智能体驱动的安全评估范式AgenticEval,通过多智能体框架自主解析政策文件,持续生成和演化综合性安全基准,并利用自我演进的评估循环不断优化测试用例。实验表明,该方法能够有效揭示传统评估方式难以发现的模型深层次安全漏洞,凸显了动态评估体系在确保AI安全部署中的重要性。

Comments Findings of ACL 2026

详情
英文摘要

The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.

2509.25914 2026-05-15 cs.LG

ReNF: Rethinking the Design of Neural Long-Term Time Series Forecasters

Yihang Lu, Xianwei Meng, Enhong Chen

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院合肥研究院) University of Science and Technology of China(中国科学技术大学) Hefei University of Technology(合肥工业大学)

AI总结 本文重新审视了长期时间序列预测中神经网络预报器的设计原则,提出了一种基于方差减少假设的新型框架ReNF。该方法通过结合自回归结构与直接输出结构的优势,提出了一种简洁高效的Boosted Direct Output范式,并引入参数平滑技术以提升模型泛化能力。实验表明,这种基于原理的改进使简单的时序多层感知机在多个基准上超越了近期复杂的先进模型,验证了设计原则的重要性。

详情
英文摘要

Neural Forecasters (NFs) have become a cornerstone of Long-term Time Series Forecasting (LTSF). However, recent progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting structures. In this work, we revisit principled designs of LTSF. We begin by formulating a Variance Reduction Hypothesis (VRH), positing that generating and combining multiple forecasts is essential to reducing the inherent uncertainty of NFs. Guided by this, we propose Boosted Direct Output (BDO), a streamlined paradigm that synergistically hybridizes the causal structure of Auto-Regressive (AR) with the stability of Direct Output (DO), while implicitly realizing the principle of forecast combination within a single network. Furthermore, we mitigate a critical validation-test generalization gap by employing parameter smoothing to stabilize optimization. Extensive experiments demonstrate that these trivial yet principled improvements enable a direct temporal MLP to outperform recent, complex state-of-the-art models in nearly all benchmarks, without relying on intricate inductive biases. Finally, we empirically verify our hypothesis, establishing a dynamic performance bound that highlights promising directions for future research. The code is publicly available at: https://github.com/Luoauoa/ReNF.

2509.25826 2026-05-15 cs.LG

Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kun Feng, Shaocheng Lan, Yuchen Fang, Wenchao He, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

发表机构 * School of Information Science and Technology, ShanghaiTech University(信息科学与技术学院,上海科技大学) Ant Group(蚂蚁集团)

AI总结 时间序列基础模型(TSFMs)在零样本泛化方面面临挑战,主要由于时间序列中的采样密度和周期结构等固有时间异质性。为解决这一问题,本文提出Kairos,一种参数高效且灵活的时序基础模型,通过动态分块标记和混合尺寸编码,将时间异质性与模型容量解耦,从而在不增加模型宽度或深度的情况下实现细粒度的时间抽象。Kairos还引入了基于动态旋转编码的多粒度位置嵌入,能够根据实例的频谱特征和时间结构进行条件建模,最终在两个主流基准上以更少的参数取得了优越的零样本性能。

详情
英文摘要

Inherent temporal heterogeneity, such as varying sampling densities and periodic structures, has posed substantial challenges in zero-shot generalization for Time Series Foundation Models (TSFMs). Existing TSFMs predominantly rely on massive parameterization to absorb such heterogeneity, as their static tokenization and positional encoding schemes entangle diverse temporal patterns into a fixed representation space, encouraging memorization rather than adaptation. To address this limitation, we propose Kairos, a flexible and parameter-efficient TSFM dedicated to forecasting tasks, which decouples temporal heterogeneity from model capacity through a novel tokenization perspective. Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks, GIFT-Eval and Time-Series-Library. The project page is at https://foundation-model-research.github.io/Kairos .

2509.23023 2026-05-15 cs.AI

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Davi Bastos Costa, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所)

AI总结 本文提出了一种名为 *Mini-Mafia* 的简化版社交推理游戏,用于评估大型语言模型在多智能体交互中的表现。通过分析游戏中欺诈者、侦探和村民之间的互动,研究得出了一个预测欺诈方获胜概率的解析公式,并据此构建了 *Mini-Mafia Benchmark*,能够定量评估模型的欺骗、检测和披露能力。实验表明,该方法在跨模型预测中表现优异,并揭示了一些关于当前主流大模型能力的反直觉结论。

Comments Adds a validation section for the theoretical model and restructures the presentation

详情
英文摘要

Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

2509.22746 2026-05-15 cs.AI cs.CV

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Alibaba Group Holding Limited(阿里巴巴集团控股有限公司) Future Living Lab of Alibaba(阿里巴巴未来生活实验室) University of Southern California(南加州大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 当前视觉推理方法主要专注于探索特定的推理模式,虽能在特定领域取得改进,但难以形成通用的推理能力。为此,本文提出了一种新的自适应推理范式——Mixture-of-Visual-Thoughts(MoVT),通过在一个模型中统一不同推理模式,并根据上下文选择合适的模式。研究引入了两阶段的自适应视觉推理框架AdaVaR,利用监督学习进行初始训练,并通过强化学习与精心设计的算法引导模型实现上下文自适应的模式选择,实验表明该方法在多种场景下均能有效提升视觉推理性能。

Comments 27 pages, 11 figures, 5 tables, accepted by ICLR 2026

详情
英文摘要

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

2509.20846 2026-05-15 cs.LG

Causal Time Series Generation via Diffusion Models

Yutong Xia, Chang Xu, Yuxuan Liang, Li Zhao, Qingsong Wen, Roger Zimmermann, Jiang Bian

发表机构 * National University of Singapore(新加坡国立大学) Microsoft Research Asia(微软亚洲研究院) HKUST (Guangzhou)(香港科技大学(广州)) Squirrel AI

AI总结 本文提出了一种基于因果视角的条件时间序列生成方法,将时间序列生成任务扩展到干预和反事实场景,形成了新的因果时间序列生成(Causal TSG)任务家族。为此,作者设计了基于扩散模型的统一框架CaTSG,通过后门调整和推理-行动-预测过程,实现对因果干预和反事实生成的精确控制。实验表明,CaTSG在保持观测真实性的同时,能够有效生成干预和反事实序列,优于现有基线方法。

详情
英文摘要

Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl's causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.

2509.14232 2026-05-15 cs.CV

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 GenExam 是首个面向多学科文本到图像生成的考试式基准,旨在评估模型在理解、推理与图像生成方面的综合能力。该基准包含10个学科共1000道题目,每个题目均配有标准答案图像和细粒度评分点,以精确评估生成结果的语义正确性与视觉合理性。实验表明,GenExam 对现有模型提出了巨大挑战,开源模型在性能上与闭源模型存在显著差距,凸显了当前生成模型在复杂任务中的不足。

Comments Accepted by ICML 2026

详情
英文摘要

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.

2509.01299 2026-05-15 cs.CV

Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals

Huan Ni, Qingshan Liu, Xiaonan Niu, Danfeng Hong, Lingli Zhao, Haiyan Guan

发表机构 * School of Remote Sensing & Geomatics Engineering, Nanjing University of Information Science & Technology(南京信息工程大学遥感与地理信息学院) Tiandu-Nuist Deep Sapce Exploartion Laboratory(天都-南京信息工程大学深空探索实验室) School of Computer Science, Nanjing University of Posts and Telecommunications(南京邮电大学计算机科学学院) Nanjing Center, China Geological Survey(南京地质调查局南京中心) School of Automation, Southeast University(东南大学自动化学院) School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院)

AI总结 本文研究了跨域少样本分割(CD-FSS)问题,旨在在源域和目标域之间存在域偏移的情况下,利用极少的样本对未知类别进行分割。为了解决现有方法中模块独立导致知识流动受限的问题,作者提出了一种基于常微分方程(ODE)和傅里叶变换的统一模块FSS-TI,通过时间区间内的特征演化过程,实现了对域无关特征的探索和有限样本下的高效学习。实验表明,该方法在跨域适应性和分割性能方面均优于现有方法。

详情
英文摘要

Cross-domain few-shot segmentation (CD-FSS) aims to segment unseen categories with very limited samples while alleviating the negative effects of domain shift between the source and target domains. At present, existing CD-FSS studies typically rely on multiple independent modules to enhance cross-domain adaptability. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations (ODEs) and the Fourier transform, resulting in a structurally concise method-Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs not only explores a domain-agnostic feature space, but also achieves significant performance improvement through target-domain fine-tuning with extremely limited support samples. Specifically, the ODE modeling process incorporates nonlinear transformations and random perturbations of the amplitude and phase spectra, effectively simulating potential target-domain data distributions. Meanwhile, the analytical solution of the ODE is transformed into a theoretically infinitely iterable feature refinement process, thereby enhancing the learning capability under limited support samples. In this way, both the exploration of domain-agnostic features and the few-shot learning problem can be addressed through the optimization of the intrinsic parameters of the ODE. Moreover, during target-domain fine-tuning, we strictly constrain the support samples to match the settings of real-world CD-FSS tasks, without incurring additional annotation costs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs.

2508.15198 2026-05-15 cs.LG math-ph math.MP

Frequency-adaptive tensor neural networks for high-dimensional multi-scale problems

Jizu Huang, Yue Qiu, Rukang You

发表机构 * SKLMS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, PR China(数学与系统科学研究院,中国科学院,北京,100190,中国) School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100190, PR China(中国科学院大学数学科学学院,北京,100190,中国) College of Mathematics and Statistics, Chongqing University(重庆大学数学与统计学院)

AI总结 该研究针对高维多尺度问题中传统张量神经网络(TNNs)难以准确捕捉高频特征的问题,提出了一种频率自适应的张量神经网络方法。通过傅里叶分析揭示TNNs的训练动态,并引入随机傅里叶特征增强其表达能力,同时利用TNNs的张量结构对一维组件函数进行离散傅里叶变换,有效缓解了维度灾难。该方法显著提升了TNNs在复杂多尺度问题中的求解能力,并通过大量数值实验验证了其有效性与鲁棒性。

详情
英文摘要

Tensor neural networks (TNNs) have demonstrated their superiority in solving high-dimensional problems. However, similar to conventional neural networks, TNNs are also influenced by the Frequency Principle, which limits their ability to accurately capture high-frequency features of the solution. In this work, we analyze the training dynamics of TNNs by Fourier analysis and enhance their expressivity for high-dimensional multi-scale problems by incorporating random Fourier features. Leveraging the inherent tensor structure of TNNs, we further propose a novel approach to extract frequency features of high-dimensional functions by performing the Discrete Fourier Transform to one-dimensional component functions. This strategy effectively mitigates the curse of dimensionality. Building on this idea, we propose a frequency-adaptive TNNs algorithm, which significantly improves the ability of TNNs in solving complex multi-scale problems. Extensive numerical experiments are performed to validate the effectiveness and robustness of the proposed frequency-adaptive TNNs algorithm.

2508.06226 2026-05-15 cs.AI

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China(教育部智能网络与网络安全重点实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China(陕西省大数据知识工程重点实验室) School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院)

AI总结 GeoLaux 是一个用于评估多模态大语言模型(MLLMs)在需要辅助线构造的长步骤几何问题上表现的细粒度基准数据集,包含2186个计算与证明问题,平均解题步骤达6.51步,其中41.8%的问题需要辅助线构造。基于该数据集对23个主流MLLMs进行五维评估,研究发现模型在长步骤问题上的表现明显下降,辅助线理解能力不足是影响几何推理的关键因素,同时有限的答案提示有助于提升推理过程的正确性。GeoLaux 为评估和提升 MLLMs 的几何推理能力提供了重要参考。

Comments 26 pages, 24 figures

详情
英文摘要

Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux

2508.06202 2026-05-15 cs.CV cs.AI

LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) University of Amsterdam(阿姆斯特丹大学) Tsinghua University(清华大学)

AI总结 持续视觉指令微调(CVIT)使多模态大语言模型能够逐步学习新任务,但面临灾难性遗忘的问题。为解决这一挑战,本文提出了一种高效的架构扩展方法LiLoRA,通过共享LoRA矩阵A并引入对矩阵B的低秩分解,显著减少了参数开销,并结合余弦正则化稳定性损失以保持表示的一致性。实验表明,LiLoRA在多个CVIT基准上实现了更优的性能,同时提升了参数效率。

Comments AAAI 2026 Oral Presentation. 9 pages

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):19978--19986, 2026

详情
英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.

2508.05008 2026-05-15 cs.CV

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song, Dong Yi, Jinlin Wu, Jiawei Ma, Hongbin Liu, Zhen Lei, Jiebo Luo

发表机构 * City University of Hong Kong(香港城市大学) Shenzhen Loop Area Institute(深圳环城院) CAIR, HKISI, Chinese Academy of Sciences(中国科学院计算智能研究所) UESTC(电子科技大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 该研究针对医学图像分割中因设备差异、成像模式等引起的领域偏移问题,提出了一种多模态因果驱动的表示学习框架MCDRL。该方法结合视觉-语言模型与因果推理,通过构建领域特定的干扰词典并训练因果干预网络,有效消除领域偏差的同时保留解剖结构信息。实验表明,MCDRL在多个医学图像分割任务中表现出更优的分割精度和更强的跨领域泛化能力。

Comments Accepted by CVPR 2026

详情
英文摘要

Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

2508.01916 2026-05-15 cs.LG cs.AI cs.CL

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang, Michael Hahn

发表机构 * Saarland University(萨尔兰大学)

AI总结 本文研究如何通过无监督学习将神经网络的表示空间分解为具有可解释性的子空间。作者提出了一种名为邻居距离最小化(NDM)的方法,能够在不依赖标签的情况下学习出与模型内部概念对齐的子空间。实验表明,这些子空间能够捕捉到输入中的抽象概念,并在GPT-2等模型中与已知的电路变量存在强关联,为理解模型内部结构提供了新视角。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

2507.21433 2026-05-15 cs.LG cs.AI

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Kaiwen Chen, Xin Tan, Minchen Yu, Jingzong Li, Hong Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Hang Seng University of Hong Kong(恒生大学)

AI总结 大型推理模型(LRMs)在许多AI推理系统中发挥着关键作用,但其在生产环境中的部署面临服务质量(QoS)挑战,主要表现为长序列推理过程带来的高内存开销,限制了吞吐量并增加了延迟。为此,本文提出ReasonCache,一种基于协同过滤算法的KV缓存管理方法,通过识别和复用相似的中间推理步骤对应的KV缓存块,实现零拷贝缓存复用,显著提升了推理效率。实验表明,ReasonCache在保持较高准确率的同时,峰值吞吐量提升了89.2%,平均提升达40-60%,有效提高了AI推理服务的响应速度和成本效益。

Comments 10 pages, 7 figures

详情
英文摘要

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.

2507.21023 2026-05-15 cs.LG eess.SP

On Using the Shapley Value for Anomaly Localization: A Statistical Investigation

Xubin Fang, Rick S. Blum, Franziska Freytag

发表机构 * Electrical and Computer Engineering Department of Lehigh University(莱维大学电气与计算机工程系)

AI总结 本文研究了在传感器数据系统中使用夏普利值进行异常定位的问题,探讨了其统计特性。作者提出通过在夏普利值计算中采用单一固定项,可以在保持相同误检概率的前提下,显著降低异常定位的复杂度。研究证明了该方法在独立观测情况下具有普遍适用性,而在相关观测情况下仍需进一步验证。

Journal ref Applied AI Letters 7(2) (2026) e70024

详情
英文摘要

Recent publications have suggested using the Shapley value for anomaly localization for sensor data systems. Using a reasonable mathematical anomaly model for full control, experiments indicate that using a single fixed term in the Shapley value calculation achieves a lower complexity anomaly localization test, with the same probability of error, as a test using the Shapley value for all cases tested. A proof demonstrates these conclusions must be true for all independent observation cases. For dependent observation cases, no proof is available.

2507.07776 2026-05-15 cs.CV

SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein, Iyiola E. Olatunji, Daniel Kudenko

发表机构 * University of Luxembourg(卢森堡大学) CAIMed – Lower Saxony Center for AI & Causal Methods in Medicine(下萨克森人工智能与因果方法医学中心)

AI总结 该论文提出SCOOTER,一个用于评估无约束对抗样本真实性的开源框架。无约束对抗攻击通过改变物体颜色等方式绕过传统防御策略,但其不可察觉性需依赖人类评估。SCOOTER提供了标准化的人类评估流程、大规模对比实验以及开源工具和数据集,揭示了当前多种对抗攻击方法在人类感知上表现不佳,并强调了人类感知与自动视觉系统之间的差异。

Comments 42 pages, 16 figures, 11 tables, Under Review, Code: https://github.com/DrenFazlija/Scooter, Data: https://doi.org/10.5281/zenodo.15771501

详情
英文摘要

Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

2507.04049 2026-05-15 cs.CV cs.RO

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo

发表机构 * School of Artificial Intelligence (School of Software), Yanshan University(燕山大学人工智能学院(软件学院)) Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University(北京交通大数据挖掘与具身智能关键实验室,北京交通大学计算机科学与技术学院) Horizon Robotics(地平线机器人) School of Mechanical and Aerospace Engineering, Nanyang Technological University(南洋理工大学机械与航空航天工程学院) University of Macau(澳门大学) School of Electrical Engineering and Computer Science, The University of Queensland(昆士兰大学电子工程与计算机科学学院)

AI总结 大多数端到端自动驾驶方法依赖单一专家示范的模仿学习,导致行为保守且同质化,难以适应复杂的真实场景。本文提出DIVER框架,结合强化学习与扩散生成模型,生成多样化且可行的驾驶轨迹。DIVER通过强化学习引导扩散过程,利用奖励机制确保轨迹的安全性与多样性,并提出新的多样性度量指标,实验表明其在多个基准测试中显著提升了轨迹多样性,有效缓解了模仿学习中的模式崩溃问题。

Comments 17 pages, 10 figures

详情
英文摘要

Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

2506.16608 2026-05-15 cs.LG cs.AI

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Jiamin He, A. Rupam Mahmood, Martha White

发表机构 * Department of Computing Science University of Alberta(计算科学系阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所(Amii)) CIFAR AI Chair, Amii(CIFAR人工智能主席,Amii)

AI总结 本文提出了一种新的强化学习框架,将参数化的动作分布视为动作,重新定义了智能体与环境之间的边界。该方法通过重参数化使动作空间变为连续空间,适用于离散、连续或混合类型的动作。研究还提出了一种通用的确定性策略梯度估计器DA-PG以及基于TD3的实用演员-评论家算法DA-AC,实验表明其在多种控制任务中表现出良好的性能。

Comments Accepted to ICLR 2026 (camera-ready)

详情
英文摘要

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

2506.08584 2026-05-15 cs.CL

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hsing-Chi Hwang, Ruishan Liu

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Electrical and Computer Engineering, University of Southern California(南加州大学电气与计算机工程系) Suzanne Dworak-Peck School of Social Work, University of Southern California(南加州大学苏兹安·德沃拉克-佩克社会工作学院) Department of Psychiatry and the Behavioral Sciences, University of Southern California(南加州大学精神病学与行为科学系) Annenberg School for Communication, University of Southern California(南加州大学安纳伯格通信学院)

AI总结 本文提出CounselBench,一个用于评估大语言模型在心理健康问答任务中表现的大型基准测试,由100名心理健康专家构建。该基准包含两个部分:CounselBench-EVAL基于2000个专家对GPT-4、LLaMA 3等模型及在线人类治疗师的回答进行评分,揭示了模型在临床相关性、个性化和安全性等方面存在的问题;CounselBench-Adv则通过专家设计的对抗性问题,进一步暴露模型的特定失效模式。研究为心理健康领域的语言模型评估提供了临床导向的框架。

详情
英文摘要

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

2506.04499 2026-05-15 cs.CV

FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli

发表机构 * Qualcomm AI Research(高通AI研究)

AI总结 本文提出了一种名为FALO的高效且精确的LiDAR三维目标检测方法,专为资源受限的边缘设备设计。该方法通过将稀疏体素按坐标和邻近性排列成一维序列,并结合提出的ConvDotMix模块进行处理,实现了在空间和嵌入维度上的充分特征混合与高阶非线性交互。实验表明,FALO在保持先进检测精度的同时,推理速度比当前最新方法在移动端GPU和NPU上提升了1.6到9.8倍,适合部署在紧凑型嵌入式设备上。

详情
英文摘要

Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

2506.00158 2026-05-15 cs.LG

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

Eli Chien, Wei-Ning Chen, Pan Li

发表机构 * Department of Electrical Engineering, National Taiwan University, Taiwan(台湾国立台湾大学电子工程系) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE), Taiwan(国立台湾大学人工智能研究中心(NTU AI-CoRE)) Microsoft, USA(微软公司) Department of Electrical and Computer Engineering, Georgia Institute of Technology, USA(佐治亚理工学院电子与计算机工程系)

AI总结 本文研究了在差分隐私(DP)和内存受限条件下,如何通过零阶优化方法对大语言模型进行微调,并解决隐私放大问题。针对零阶方法中因更新方向随机导致的各向异性噪声难以适用传统隐私分析框架的问题,作者提出了一种混合噪声机制和耦合分析方法,首次建立了收敛的隐藏状态DP界,突破了全局利普希茨条件的限制。该成果为设计更高效的差分隐私零阶优化算法提供了新的理论支持。

Comments ICML 2026

详情
英文摘要

Zeroth-order optimization has emerged as a promising approach for fine-tuning large language models under differential privacy (DP) and memory constraints. While privacy amplification by iteration (PABI) provides convergent DP bounds for first-order methods, establishing similar guarantees for zeroth-order methods remains an open problem. First-order PABI analysis relies on the fact that gradients are perturbed with isotropic noise, allowing privacy bounds to be iteratively tracked via shifted Rényi divergence. In contrast, DP zeroth-order methods inject scalar noise along random update directions to maintain utility. This anisotropic update fails standard shifted divergence frameworks, as the global Lipschitz property no longer holds almost surely. We provide the first convergent hidden-state DP bound for zeroth-order optimization by proposing a hybrid noise mechanism and a novel coupling analysis. We bypass the purely shifted-divergence approach by constructing a coupled auxiliary process, which circumvents the global Lipschitz barrier and yields a convergent privacy bound. Furthermore, our results induce better DP zeroth-order algorithmic designs that are previously unknown to the literature.

2505.22394 2026-05-15 cs.CV

PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Fan Fei, Jiajun Tang, Fei-Peng Tian, Boxin Shi, Ping Tan

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) National Engineering Research Center of Visual Technology, School of Computer Science, Peking University(视觉技术国家工程研究中心,计算机学院,北京大学) The Hong Kong University of Science and Technology(香港科技大学) Light Illusions PKU-AI 2 Robotics Joint Lab of Embodied AI(北京大学人工智能2机器人联合实验室)

AI总结 本文提出了一种名为 PacTure 的新框架,用于根据文本描述为无纹理的3D网格生成物理基于渲染(PBR)材质纹理。为了解决现有方法在生成效率和纹理一致性方面的不足,该方法引入了视图打包技术,有效提升了多视角生成时的分辨率,同时保持了生成模型的高效性与兼容性。通过结合细粒度控制和自回归预测框架,PacTure 在生成质量和效率方面均优于现有先进方法。

Comments Accepted by Computational Visual Media Journal (CVMJ) in Feb. 2026. 19 pages, 7 figures

详情
英文摘要

We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures for an untextured 3D mesh from a text description. Existing 2D generation-based texturing approaches either generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures, or adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation, without imposing additional inference cost. Unlike UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inferencing cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework, creating an efficient multi-view PBR generation backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality and efficiency.

2505.11809 2026-05-15 cs.CV

From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models

Zicheng Fan, Kunihiko Fujiwara, Pengyuan Liu, Fan Zhang, Filip Biljecki

发表机构 * organization= Department of Architecture, National University of Singapore , country= Singapore organization= Research \& Development Institute, Takenaka Corporation , country= Japan organization= Urban Analytics Subject Group, Urban Studies \& Social Policy Division, University of Glasgow , country= United Kingdom organization= Institute of Remote Sensing GIS, Peking University , country= China organization= Department of Real Estate, National University of Singapore , country= Singapore

AI总结 本文提出一种基于视觉语言模型(VLM)的方法,利用街景图像评估城市地标在真实街道环境中的可见性,替代传统的基于几何遮挡的视线模拟方法。通过在受控方向和缩放的街景图像中检测目标地标,构建异构可见性图以表示地标之间的视觉连接关系,揭示了多个地标通过共享视觉走廊相互关联的模式。实验表明,该方法在多个国际知名地标上的检测准确率达87%,并在伦敦泰晤士河沿岸案例中有效识别了关键中介地点,为城市规划和遗产保护提供了新的分析视角。

详情
英文摘要

Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.

2505.03519 2026-05-15 cs.LG

Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment

Sy-Tuyen Ho, Koh Jun Hao, Ngoc-Bao Nguyen, Alexander Binder, Ngai-Man Cheung

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) University of Maryland College Park(马里兰大学学院公园分校)

AI总结 该论文重新审视了模型逆向攻击的评估方法,指出当前主流评估框架存在误导性,许多被认为是成功的攻击实际上为假阳性,未能真实还原目标个体的信息。研究揭示这些假阳性具有类似第一类对抗样本的特性,并展示了其高度可迁移性,导致现有攻击准确率被高估。为此,作者提出基于多模态大语言模型(MLLM)的新评估框架,有效降低对抗迁移性,更可靠地评估隐私泄露风险。

Comments Accepted to CVPR Findings 2026

详情
英文摘要

Model Inversion attacks aim to reconstruct information from private training data by exploiting access to a target model. Nearly all recent MI studies evaluate attack success using a standard framework that computes attack accuracy through a secondary evaluation model trained on the same private data and task design as the target model. In this paper, we present the first in-depth analysis of this dominant evaluation framework and reveal a fundamental issue: many reconstructions deemed successful under the existing framework are in fact false positives that do not capture the visual identity of the target individual. We first show that these MI false positives satisfy the same formal conditions as Type I adversarial examples. Our controlled experiments, we demonstrate extremely high false-positive transferability, an empirical signature characteristic of adversarial behavior, indicating that many MI false positives likely contain Type I adversarial features. This adversarial transferability significantly inflates reported attack accuracy and leads to an overstatement of privacy leakage in existing MI work. To address this issue, as our second contribution, we introduce a new evaluation framework based on MLLMs, whose general-purpose visual reasoning avoids the shared-task vulnerability and reduces Type-I adversarial transferability of current evaluation framework. We propose systematic design principles for MLLM-based evaluation. Using this framework, we reassess 27 MI attack setups across diverse datasets, target models, and priors, and find consistently high false-positive rates under the conventional approach. Our results call for a reevaluation of progress in MI research and establish MLLM-based evaluation as a more reliable standard for assessing privacy risks in machine learning systems. Code/data/prompt are available at https://hosytuyen.github.io/projects/FMLLM

2505.01584 2026-05-15 cs.LG cs.AI

Silent Neuron Theory and Plasticity Preservation for Deep Reinforcement Learning in Adaptive Video Streaming

Zhiqiang He, Zhi Liu

发表机构 * Department of Computer and Network Engineering, the University of Electro-Communications, Japan(电子通信大学计算机与网络工程系,日本)

AI总结 本文研究了深度强化学习在自适应视频流中的应用,针对实际网络带宽异质性导致的模型泛化能力不足问题,提出了“静默神经元理论”以更准确地刻画神经网络的可塑性退化现象。基于该理论,作者设计了Reset Silent Neuron(ReSiN)方法,通过结合前向和后向传播状态的策略性神经元重置,有效保持网络可塑性,从而提升模型在非稳态网络环境下的适应能力。实验表明,ReSiN在比特率和QoE指标上显著优于现有方法,且在不同网络条件下均表现出良好的鲁棒性。

详情
英文摘要

Adaptive video streaming optimizes Quality of Experience (QoE) metrics by selecting appropriate bitrates according to varying network bandwidth and user demands. In practice, however, real-world network bandwidth often exhibits heterogeneity relative to training environments. Current methods predominantly tackle this problem through learning-based approaches designed to improve generalization performance. While our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to heterogeneous network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. Moreover, we establish a tighter performance bound for ReSiN under non-stationary network conditions. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.

2504.18544 2026-05-15 cs.LG cs.AI cs.CY

Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review

Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani

发表机构 * Healthy Lifespan Institute, School of Computer Science, University of Sheffield(健康寿命研究所,计算机科学学院,谢菲尔德大学) School of Electrical and Electronic Engineering, University of Sheffield(电子与电气工程学院,谢菲尔德大学) Healthy Lifespan Institute, School of Sociological Studies, Politics and International Relations, University of Sheffield(健康寿命研究所,社会科学学院,政治与国际关系,谢菲尔德大学) Digital Environment Research Institute, Queen Mary University of London(数字环境研究 institutes,伦敦女王大学)

AI总结 该论文系统回顾了近年来合成表格健康数据生成与评估领域的研究,指出了当前在评估方法上缺乏共识、指标应用不一致、领域专家参与不足等关键挑战。为应对这些问题,研究提出了结构化的分类框架和实用评估指南,旨在推动更严谨、标准化的评估实践,促进合成健康数据的负责任开发与应用。

Comments 32 pages

详情
英文摘要

Generating synthetic tabular health data is challenging, and evaluating their quality is equally, if not more, complex. This systematic review highlights the critical importance of rigorous evaluation of synthetic health data to ensure reliability, clinical relevance, and appropriate use. From an initial identification of 2067 relevant papers published in the last ten years, 134 studies were selected for detailed analysis. Our review identifies key challenges, including lack of consensus on evaluation methods, inconsistent application of evaluation metrics, limited involvement of domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide a structured consolidation of synthetic data generation and evaluation methods into taxonomies, alongside practical guidelines to support more robust and standardised evaluation practices. These findings aim to support the responsible development and use of synthetic health data, aligned with emerging expectations around transparency, reproducibility, and governance, ultimately enabling the community to fully harness its transformative potential and accelerate innovation.

2504.09549 2026-05-15 cs.CV

SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Yuhao Wang, Xiang Hu, Lixin Wang, Pingping Zhang, Huchuan Lu

发表机构 * School of Future Technology, Dalian University of Technology(大连理工大学未来技术学院) School of Information and Communication Engineering, Dalian University of Technology(大连理工大学信息与通信工程学院)

AI总结 本文提出了一种名为SD-ReID的生成框架,用于解决航拍与地面视角下的人再识别(AG-ReID)问题。该方法通过结合生成模型与可控条件,学习不同视角下的特征分布,从而提取更具鲁棒性的身份表示,并引入视图细化解码器以增强特征对齐能力。实验表明,该方法在多个AG-ReID数据集上均取得了优异的性能。

Comments This work is accepted by IEEE TIP 2026. More modifications may performed

详情
英文摘要

Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code and pre-trained models are available at https://github.com/924973292/SD-ReID.