arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.08964 2026-05-18 cs.LG

T2T-LA: A Topology-to-Topology LLM Agent for Graph Learning with Neither Feature Access nor Task Knowledge

T2T-LA：一种用于图学习的拓扑到拓扑LLM代理，无需特征访问或任务知识

Yongyu Wang

发表机构 * MTU（MTU大学）

AI总结本文提出T2T-LA，一种无需特征访问或任务知识的拓扑到拓扑LLM代理，通过学习失败拓扑与评分之间的关系，实现图学习中的拓扑推理。

详情

AI中文摘要

图学习旨在将数据转换为图表示，这对CAD中的许多问题至关重要，其中电路、布局、设计和优化状态通常被建模为图结构对象。现有图学习方法通常依赖精心设计的图构建规则、大量参数调优和复杂的数学理论；此外，实现良好性能往往需要针对下游目标定制的图构建方法。在本工作中，我们研究LLM是否能推理解图结构并推断有用拓扑，而无需观察特征矩阵、了解下游任务或依赖任何精心设计的图构建算法或参数调优过程。为此，我们提出了T2T-LA，一种拓扑到拓扑的LLM代理，其输入仅包括一组先前失败的拓扑和由私人评分器分配的评分。该代理未被告知任务或算法产生评分的方式、这些拓扑是如何生成的以及评分的含义。由于观察到的拓扑都不令人满意，T2T-LA无法简单模仿一个好的例子。相反，它被迫推断图连接模式与观察到的评分之间的隐藏关系，这在CAD场景中特别相关，因为有用的设计结构可能难以手动指定。实验结果表明，T2T-LA能够在一次操作中生成一个图拓扑，使下游算法产生足够好的解决方案，表明了一种新的LLM驱动方向，用于ML-for-CAD工作流程中的拓扑推理和图表示学习。

英文摘要

Graph learning aims to convert data into graph representations, which are fundamental to many problems in machine learning for CAD, where circuits, layouts, designs, and optimization states are often modeled as graph-structured objects. Existing graph learning methods usually rely on carefully designed graph construction rules, extensive parameter tuning, and sophisticated mathematical theory; moreover, achieving good performance often requires task-specific graph construction tailored to the downstream objective. In this work, we study whether a large language model (LLM) can reason about graph structure and infer a useful topology without observing the feature matrix, without knowing the downstream task, and without relying on any carefully designed graph construction algorithm or parameter tuning process. To this end, we propose T2T-LA, a Topology-to-Topology LLM Agent that receives no input other than a set of previously failed topologies and the scores assigned to them by a private scorer. The agent is not told what task or algorithm produces the scores, how these topologies are generated, or what the scores mean. Since none of the observed topologies is satisfactory, T2T-LA cannot simply imitate a good example. Instead, it is forced to infer hidden relationships between graph connectivity patterns and the observed scores, a capability that is particularly relevant to CAD scenarios where useful design structures may be difficult to specify manually. Experimental results show that T2T-LA can generate, in one shot, a graph topology that enables the downstream algorithm to produce a sufficiently good solution, suggesting a new LLM-driven direction for topology reasoning and graph representation learning in ML-for-CAD workflows.

URL PDF HTML ☆

赞 0 踩 0

2512.08052 2026-05-18 cs.RO cs.LG

An Introduction to Deep Reinforcement and Imitation Learning

深度强化学习与模仿学习入门

Pedro Santana

发表机构 * ISCTE – University Institute of Lisbon（里斯本大学理工学院）

AI总结本文介绍深度强化学习和深度模仿学习在具身智能体中的应用，涵盖马尔可夫决策过程、REINFORCE和PPO等核心算法，以及行为克隆、DAgger和GAIL等基础方法。

详情

AI中文摘要

具身智能体，如机器人和虚拟角色，必须持续选择动作以有效执行任务，解决复杂的序列决策问题。由于手动设计此类控制器困难，学习方法如深度强化学习（DRL）和深度模仿学习（DIL）成为可行替代方案。DRL利用奖励信号优化行为，而DIL使用专家演示指导学习。本文在具身智能体背景下介绍DRL和DIL，采用简洁深入的方法概述文献。内容自包含，按需呈现所有必要的数学和机器学习概念。本文不作为领域综述，而是聚焦少量基础算法和技术，优先深入理解而非广泛覆盖。材料从马尔可夫决策过程到REINFORCE和近端策略优化（PPO）的DRL，以及从行为克隆到数据集聚合（DAgger）和生成对抗模仿学习（GAIL）的DIL。

英文摘要

Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.

URL PDF HTML ☆

赞 0 踩 0

2512.06655 2026-05-18 cs.LG cs.AI

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

图正则化稀疏自编码器用于LLM安全引导

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri

发表机构 * ELLIS Institute Tübingen（图宾根ELLIS研究所）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； Intesa Sanpaolo（Intesa Sanpaolo公司）； University of Southern California（南加州大学）

AI总结本文提出图正则化稀疏自编码器，通过在神经元共激活图上平滑解码器向量并应用方向库，提升安全引导效果，在多个基准测试中显著提高有害请求拒绝率。

详情

AI中文摘要

稀疏自编码器（SAEs）日益用于提取激活方向以实现推理时的引导，但其标准稀疏性目标将潜在特征视为独立。此先验可能与高层安全行为不匹配，其中拒绝和有害合规似乎依赖于激活空间中的分布式结构。我们引入图正则化稀疏自编码器（GSAE），一种字典学习方法，通过在神经元共激活图上平滑SAE解码器向量，并通过两个门控运行时控制器应用所得方向库来学习安全引导方向。实证研究表明，GSAE在JailbreakBench、HarmBench和XSTest中提高了选择性拒绝，增加有害请求拒绝同时保持良性提示拒绝低。在Llama-3-8B上，将标准SAE替换为GSAE的其他相同管道改进了JailbreakBench上的Δ_s值20.1点和HarmBench上的16.8点。GSAE优于激活引导基线和黑盒防护栏，保持良性任务性能，跨Llama-3、Mistral、Qwen 2.5和Phi-4泛化，并在黑盒和灰盒jailbreak攻击下保持强大。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $Δ_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

URL PDF HTML ☆

赞 0 踩 0

2512.00417 2026-05-18 cs.CL

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

CryptoBench: 一种动态基准，用于评估LLM代理在加密货币领域的专家级能力

Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Nicholas Deng, Qixin Xiao, Jia Tian, Kanghong Zhan, Tianyi Li, Xiaochen Liu, Jason Ge, Chaoyang He, Kaixuan Huang, Lin Yang, Wenhao Huang, Mengdi Wang

发表机构 * Princeton University（普林斯顿大学）； Zenith Lab（Zenith实验室）； University of California, Los Angeles（加州大学洛杉矶分校）； University of California, Berkeley（加州大学伯克利分校）； University of Michigan（密歇根大学）

AI总结本文提出CryptoBench，首个专家 curated 的动态基准，用于严格评估LLM在加密货币领域的真实能力。通过50题/月的动态任务，细分子类评估数据获取与预测能力，揭示LLM在检索与预测上的不平衡问题。

详情

AI中文摘要

本文介绍了CryptoBench，首个由专家 curated 的动态基准，旨在严格评估大型语言模型（LLM）代理在独特且快节奏的加密货币领域中的实际能力。与通用搜索和预测代理基准不同，专业加密分析面临特定挑战：极端时间敏感性、高度对抗性信息环境以及从多样化专业来源合成数据的必要性，如链上智能平台和实时去中心化金融（DeFi）仪表板。CryptoBench因此成为更具有挑战性和价值的LLM代理评估场景。为解决这些挑战，我们构建了一个动态基准，每月包含50个问题，由加密货币专业人员精心设计，以反映实际分析师的工作流程。这些任务在四象限系统中严格分类：简单检索、复杂检索、简单预测和复杂预测。这种细分子类化使能够精确评估LLM代理的基础数据获取能力及其高级分析和预测技能。我们对十种LLM进行评估，包括直接和代理框架内，揭示了性能层次并发现了失败模式。我们观察到检索-预测不平衡，许多领先模型虽然在数据检索上熟练，但在需要预测分析的任务中表现出明显弱点。这突显了代理在事实性上看似稳固，但缺乏深入分析能力来综合信息的倾向。

英文摘要

This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

URL PDF HTML ☆

赞 0 踩 0

2511.18225 2026-05-18 cs.LG stat.ML stat.OT

Adaptive Conformal Prediction for Quantum Machine Learning

适应性符合预测用于量子机器学习

Douglas Spencer, Samual Nicholls, Michele Caprio

发表机构 * Mathematical Institute, University of Oxford（牛津大学数学研究所）； Department of Computer Science, The University of Manchester（曼彻斯特大学计算机科学系）

AI总结本文提出适应性量子符合预测算法，解决量子处理器时间变化噪声对符合保证的影响，通过重复校准保持有效性，实验证明其在IBM量子处理器上的稳定性和覆盖率。

Comments Accepted at TMLR 05/2026. 27 pages, 5 figures

Journal ref Transactions on Machine Learning Research, May 2026, ISSN 2835-8856

详情

AI中文摘要

量子机器学习旨在利用量子计算机改进经典机器学习算法。目前，量子领域仍缺乏稳健的不确定性量化方法，尽管需要可靠和可信的预测。最近的工作引入了量子符合预测框架，该框架能产生保证包含真实结果的概率预测集。本文正式阐述了量子处理器中固有的时间变化噪声如何即使在校准和测试数据可交换的情况下也会破坏符合保证。为解决这一挑战，我们借鉴了适应性符合推断方法，该方法通过重复校准在时间上保持有效性。我们引入了适应性量子符合预测（AQCP）算法，该算法在任意硬件噪声条件下提供渐近平均覆盖率保证。在IBM量子处理器上的实验证明，AQCP实现了目标覆盖率并表现出比量子符合预测更大的稳定性。

英文摘要

Quantum machine learning seeks to leverage quantum computers to improve upon classical machine learning algorithms. Currently, robust uncertainty quantification methods remain underdeveloped in the quantum domain, despite the critical need for reliable and trustworthy predictions. Recent work has introduced quantum conformal prediction, a framework that produces prediction sets that are guaranteed to contain the true outcome with a user-specified probability. In this work, we formalise how the time-varying noise inherent in quantum processors can undermine conformal guarantees, even when calibration and test data are exchangeable. To address this challenge, we draw on Adaptive Conformal Inference, a method which maintains validity over time via repeated recalibration. We introduce Adaptive Quantum Conformal Prediction (AQCP), an algorithm which provides asymptotic average coverage guarantees under arbitrary hardware noise conditions. Empirical studies on an IBM quantum processor demonstrate that AQCP achieves the target coverage level and exhibits greater stability than quantum conformal prediction.

URL PDF HTML ☆

赞 0 踩 0

2511.13108 2026-05-18 cs.CV

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

DGS-Net：基于知识蒸馏的梯度手术用于AI生成图像检测中的CLIP微调

Jiazhen Yan, Ziqiang Li, Fan Wang, Boyu Wang, Ziwen He, Zhangjie Fu

发表机构 * School of Computer Science, Nanjing University of Information Science（南京信息工程大学计算机学院）； University of Macau（澳门大学）

AI总结本文提出DGS-Net，通过梯度空间分解分离有害和有益的下降方向，提升CLIP在AI生成图像检测中的微调效果，实验表明其在检测性能和泛化能力上优于现有方法。

Comments Accepted by ICML 2026 Spotlight

详情

AI中文摘要

生成模型如GANs和扩散模型的快速发展导致AI生成图像广泛传播，引发虚假信息、隐私侵犯和信任危机。尽管大规模多模态模型如CLIP能提供强可转移表示以检测合成内容，但微调时常导致灾难性遗忘，降低预训练先验并限制跨领域泛化。为此，我们提出Distillation-guided Gradient Surgery Network (DGS-Net)，通过梯度空间分解分离有害和有益的下降方向，投影任务梯度到有害方向的正交补集并与从冻结CLIP编码器蒸馏出的有益方向对齐，实现先验保留与无关抑制的统一优化。在50种生成模型上的实验表明，本方法在检测性能和泛化能力上平均优于现有方法6.6个百分点。

英文摘要

The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

URL PDF HTML ☆

赞 0 踩 0

2511.09378 2026-05-18 cs.AI cs.LG

Frontier Large Language Models Rival State-of-the-Art Planners

前沿大语言模型与最先进的规划器相媲美

Augusto B. Corrêa, André G. Pereira, Jendrik Seipp

发表机构 * University of Oxford（牛津大学）； Federal University of Rio Grande do Sul（里约格兰德杜斯尔大学）； Linköping University（林霍普大学）

AI总结研究显示前沿大语言模型在规划任务中超越传统规划器， Gemini 3.1 Pro在标准任务中表现突出，GPT-5表现接近基线，且在符号规划中仍具竞争力，揭示了大语言模型规划能力的提升趋势。

详情

AI中文摘要

一系列有影响力的研究表明，大语言模型无法可靠解决简单的规划任务。我们展示最新一代前沿模型推翻这一结论。我们评估了三个前沿LLM家族在具有挑战性的规划任务上的表现，基于最近的国际规划竞赛，遵循严格的评估指南：解决方案通过验证工具验证，任务重新创建以避免数据污染，性能与最先进的经典规划器进行比较。在标准任务描述中，Gemini 3.1 Pro在360个任务中解决了245个，优于最强的基线规划器（245 vs. 234）。GPT-5的表现与基线相当。当所有语义信息被混淆以测试纯符号规划时，性能下降，但Gemini 3.1 Pro仍能与最强基线竞争。跨模型世代的纵向比较——从GPT-3.5（解决零任务）到GPT-5——揭示了显著的上升趋势。前沿LLM可能最终能够规划；现在的问题是这种能力将如何延伸。

英文摘要

A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.

URL PDF HTML ☆

赞 0 踩 0

2510.25404 2026-05-18 cs.LG cs.AI

SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization

SemanticOpt: 向基于LLM的语义黑盒优化迈进

Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Jie Chen, Wojciech Matusik, Mina Konaković Luković

发表机构 * MIT（麻省理工学院）； MIT-IBM Watson AI Lab（麻省理工-IBM沃森人工智能实验室）

AI总结 SemanticOpt利用LLM处理语义信息，通过微调结构化贝叶斯优化轨迹与自然语言上下文，提升黑盒优化性能，在多个实际问题中优于传统方法和现有LLM方法。

详情

AI中文摘要

当每个实验昂贵、耗时或难以执行时，优化实验系统极具挑战性。现有针对昂贵黑盒问题的优化器，如贝叶斯优化，通常仅限于数值或分类观察。它们不利用更广泛的领域知识，如专家启发法、相关科学论文或相似先前实验。大型语言模型（LLMs）可以解释这种语义信息；然而，即使是最先进的LLMs也难以可靠地解决黑盒优化问题。我们介绍了SemanticOpt，一个用于语义黑盒优化的框架，通过在结构化贝叶斯优化轨迹上微调LLMs，使其具备优化能力。SemanticOpt在提出新实验时同时使用数值和语义证据，并生成与贝叶斯代理模型对齐的可解释预测。我们构建了一系列现实世界优化问题并配以语义信息，以创建评估语义黑盒优化的多样化基准。在这些领域中，SemanticOpt在给定相关语义信息时，平均上优于传统优化器和现有基于LLM的方法。

英文摘要

Optimizing an experimental system can be extremely challenging when each experiment is expensive, time-consuming, or difficult to perform. Existing optimizers for expensive black-box problems, such as Bayesian optimization, are typically limited to numerical or categorical observations. They do not make use of broader domain knowledge, such as expert heuristics, relevant scientific papers, or similar previous experiments. Large language models (LLMs) can interpret this semantic information; however, even state-of-the-art LLMs struggle to reliably solve black-box optimization problems. We introduce SemanticOpt, a framework for semantic black-box optimization that equips LLMs with optimization capabilities by fine-tuning them on structured Bayesian optimization trajectories augmented with natural-language context. SemanticOpt jointly uses numerical and semantic evidence when proposing new experiments, while producing interpretable predictions aligned with Bayesian surrogate models. We construct a range of real-world optimization problems paired with semantic information to create a diverse benchmark for evaluating semantic black-box optimization. Across these domains, SemanticOpt outperforms both classical optimizers and existing LLM-based approaches on average when given relevant semantic information.

URL PDF HTML ☆

赞 0 踩 0

2510.24457 2026-05-18 cs.RO cs.SY eess.SY

Flatness-based trajectory planning for 3D overhead cranes with friction compensation and collision avoidance

基于平坦性的3D龙门起重机轨迹规划方法，包含摩擦补偿与碰撞避免

Jorge Vicente-Martinez, Edgar Ramirez-Laboreo

发表机构 * Departamento de Informatica e Ingenieria de Sistemas (DIIS)（信息与系统工程系（DIIS））； Instituto de Investigacion en Ingenieria de Aragon (I3A)（阿列尼亚工程研究所（I3A））； Universidad de Zaragoza（萨拉戈塔大学）

AI总结本文提出一种利用微分平坦性优化3D龙门起重机轨迹生成方法，通过直接纳入非线性摩擦和碰撞避免等复杂约束，实现安全高效的运动控制。

Comments 6 pages, 8 figures. Final version, after peer review and acceptance, submitted to the 23rd IFAC World Congress

2510.23634 2026-05-18 cs.LG cs.AI

Monotone and Separable Set Functions: Characterizations and Neural Models

单调和可分离的集合函数：特征化与神经模型

Soutrik Sarangi, Yonatan Sverdlov, Nadav Dym, Abir De

发表机构 * IIT Bombay（印度理工学院班加罗尔分校）； Technion（技术学院）

AI总结本文研究了保持集合自然偏序的集合到向量函数设计，提出弱MAS属性模型，展示了其在集合包含任务中的优势。

详情

AI中文摘要

受集合包含问题应用启发，本文考虑设计集合到向量函数，使自然偏序保持，即S⊆T当且仅当F(S)≤F(T)。我们称满足此性质的函数为单调和可分离(MAS)集合函数。我们建立了向量维度的上下界，作为多重集合基数和基础集大小的函数。在重要情况无限基础集时，我们证明MAS函数不存在，但提出名为our的模型，其满足弱MAS属性并具有Holder连续稳定性。我们还展示MAS函数可用于构建单调的通用模型，可近似所有单调集合函数。实验考虑了多种集合包含任务，结果显示我们的模型相比不考虑集合包含作为归纳偏置的标准集合模型具有优势。代码可在https://github.com/structlearning/MASNET获取。

英文摘要

Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely $S\subseteq T \text{ if and only if } F(S)\leq F(T) $. We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name "weakly MAS" and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in https://github.com/structlearning/MASNET.

URL PDF HTML ☆

赞 0 踩 0

2510.13842 2026-05-18 cs.CL cs.AI cs.CR

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

ADMIT: RAG基事实核查中的少样本知识污染攻击

Yutao Wu, Xiao Liu, Yinghui Li, Yifeng Gao, Yifan Ding, Jiale Ding, Xiang Zheng, Xingjun Ma

发表机构 * Deakin University（德金大学）； Fudan University（复旦大学）； City University of Hong Kong（香港城市大学）

AI总结 ADMIT提出一种无需访问目标模型的少样本攻击方法，通过注入真实证据来翻转事实核查决策，实验显示其在多种系统中成功率达86%，揭示了RAG事实核查系统的重大漏洞。

2510.08398 2026-05-18 cs.CV

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

VideoVerse: 你的T2V生成器有世界模型能力来合成视频吗？

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

发表机构 * Sun Yat-sen University（中山大学）； Hong Kong Polytechnic University（香港理工大学）； Tsinghua University（清华大学）； OPPO Research Institute（OPPO研究院）

AI总结 VideoVerse通过评估T2V模型对复杂时间因果关系和世界知识的理解能力，揭示现有模型与理想世界建模能力的差距。

Comments 26 Pages, 10 Figures, 14 Tables

详情

AI中文摘要

最近文本到视频（T2V）生成技术的快速发展使训练模型具备了更强的世界模型能力，使现有基准逐渐无法评估最先进的T2V模型。首先，当前评估维度如每帧美学质量和时间一致性已无法区分最先进的T2V模型。其次，事件级时间因果性——区分视频与其他模态的本质属性——仍 largely 未被探索。第三，现有基准缺乏对世界知识的系统评估，这是构建世界模型的关键能力。为解决这些问题，我们引入VideoVerse，一个专注于评估当前T2V模型是否能理解复杂时间因果性和世界知识以合成视频的综合基准。我们收集了跨不同领域的代表性视频，并提取其事件级描述，具有固有的时间因果性，然后由独立标注者重写为文本到视频提示。对于每个提示，我们设计了十个评估维度，涵盖动态和静态属性，最终得到300个提示、815个事件和793个评估问题。因此，通过使用现代视觉-语言模型开发了一个与人类偏好一致的基于问答的评估流程，系统地评估了领先的开源和闭源T2V系统，揭示了当前T2V模型与理想世界建模能力之间的差距。

英文摘要

The recent rapid advancement of Text-to-Video (T2V) generation technologies are engaging the trained models with more world model ability, making the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality-an essential property that differentiates videos from other modalities-remains largely unexplored. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark focusing on evaluating whether the current T2V model could understand complex temporal causality and world knowledge to synthesize videos. We collect representative videos across diverse domains and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design ten evaluation dimensions covering dynamic and static properties, resulting in 300 prompts, 815 events, and 793 evaluation questions. Consequently, a human preference-aligned QA-based evaluation pipeline is developed by using modern vision-language models to systematically benchmark leading open- and closed-source T2V systems, revealing the current gap between T2V models and desired world modeling abilities.

URL PDF HTML ☆

赞 0 踩 0

2510.03161 2026-05-18 cs.CV cs.AI

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

UniShield: 一种适应性多智能体框架用于统一的伪造图像检测与定位

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（北京大学电子与计算机工程学院）； School of Future Technology, South China University of Technology（华南理工大学未来技术学院）； School of Electronic and Information Engineering, South China University of Technology（华南理工大学电子与信息工程学院）； Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University（北京大学深圳研究生院超高清沉浸媒体技术省重点实验室）

AI总结 UniShield通过多智能体框架实现跨领域伪造图像检测与定位，提升检测的适应性和实用性。

2510.02453 2026-05-18 cs.LG cs.AI cs.CL

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

如何训练你的导师：通过导师模型引导黑盒大语言模型

Parth Asawa, Alan Zhu, Abigail O'Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Bespoke Labs（Bespoke实验室）

AI总结本文提出Advisor Models，通过训练小型开放权重模型生成动态个性化建议，提升黑盒前沿模型性能，实验显示在多个任务中效果显著，且具有良好的迁移性和鲁棒性。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

前沿语言模型作为黑盒服务部署，其权重无法修改，定制仅限于提示。我们引入Advisor Models，一种方法通过训练小型开放权重模型生成动态、实例特定的自然语言建议，以提升黑盒前沿模型的能力。Advisor Models将GPT-5.2在RuleArena（税务）任务上的性能提升27.4%，减少Gemini 3 Pro在SWE代理任务中的步骤24.6%，并在个性化GPT-5到用户偏好方面优于静态提示优化器（85-100% vs. 40-60%）。我们还发现顾问具有可迁移性：用低成本学生模型训练的顾问仍能将改进转移到前沿模型。此外，Advisor Models具有鲁棒性：在其他基准测试中未观察到降级，除了训练管道所训练的基准测试。我们的方法展示了如何以实用且经济有效的方式对黑盒前沿模型进行参数优化。

英文摘要

Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5.2's performance on RuleArena (Taxes) by 27.4%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

URL PDF HTML ☆

赞 0 踩 0

2510.02278 2026-05-18 cs.LG

Metropolis-Scale Road Network Datasets for Fine-Grained Urban Traffic Modeling

用于精细城市交通建模的Metropolis级道路网络数据集

Fedor Velikonivtsev, Oleg Platonov, Ekaterina Alimaskina, Gleb Bazhenov, Liudmila Prokhorenkova

发表机构 * HSE University（俄罗斯高等经济大学）； Yandex Research（Yandex研究院）； BRAIn Lab（BRAIn实验室）

AI总结本文提出两个主要城市精细化道路网络数据集，用于解决大规模交通预测中的挑战，提供高分辨率的时间序列数据和丰富的静态道路属性。

详情

AI中文摘要

交通动态建模是城市计算中的关键挑战，应用于实时交通管理到基础设施规划。然而，该领域的进展受到缺乏大规模公开数据集的限制，这些数据集能捕捉真实城市道路网络的细微特性。现有基准往往受限于规模小、依赖稀疏高速公路传感器、缺乏真实道路连接信息以及缺乏道路属性信息。为解决此问题，我们引入了两个主要城市精细化道路网络数据集，其规模高达10万条道路段，使用真实道路连接性，包含5分钟分辨率的交通速度和流量时间序列测量，并包含丰富的静态道路属性。这些数据集使深入分析时空交通模式成为可能，并可作为各种ML应用的基准。作为数据集实用性和挑战的实证演示，我们将其用于交通预测任务。我们数据集中的现实道路网络规模揭示了当前交通预测模型的重大可扩展性问题。为解决这些问题，我们提出了一种简单且高效的基线模型，不仅能够扩展到大规模道路图，还能实现与现有时空模型相媲美的预测性能。我们希望这些数据集能成为交通建模、城市计算和智能城市发展广泛研究的基础资源。

英文摘要

Modeling traffic dynamics is a critical challenge for urban computing, with applications from real-time traffic management to infrastructure planning. However, progress in this area is fundamentally constrained by a lack of large-scale public datasets that capture the subtle properties of real city road networks. Existing benchmarks are often limited by their small scale, reliance on sparse highway traffic sensors, absence of true road connectivity information, and lack of information about road properties. To address this issue, we introduce datasets representing fine-grained road networks of two major cities, which are unique in their scale (up to 100,000 road segments), use of real road connectivity, presence of time series measurements for both traffic speed and volume at a 5-minute resolution, and inclusion of rich static road attributes. These datasets enable in-depth analysis of spatiotemporal traffic patterns and can serve as benchmarks for various ML applications. As a practical demonstration of the utility of our datasets and the challenges they present, we use them for the task of traffic forecasting. The size of the real-world road networks in our datasets reveals significant scalability issues in current traffic forecasting models. To address them, we propose a simple and efficient baseline that not only scales to large road graphs but also achieves forecasting performance competitive with other established spatiotemporal models. We hope that the proposed datasets will serve as a foundational resource for a broad range of research in traffic modeling, urban computing, and smart city development.

URL PDF HTML ☆

赞 0 踩 0

2509.24550 2026-05-18 cs.LG cs.SD

Training-Free Multimodal Guidance for Video to Audio Generation

无需训练的多模态引导用于视频到音频生成

Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecomm., Sapienza University of Rome, Italy（信息工程、电子与电信系，罗马大学萨皮恩扎）

AI总结本文提出无需训练的多模态引导机制，用于视频到音频扩散生成，通过模态嵌入跨度强制视频、音频和文本的一致对齐，提升生成质量与多模态对齐效果。

Journal ref ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

详情

DOI: 10.1109/ICASSP55912.2026.11462881

AI中文摘要

视频到音频（V2A）生成旨在从静音视频中合成逼真且语义一致的音频，潜在应用于视频编辑、 Foley声音设计和辅助多媒体。尽管已有成果显著，现有方法或需在大规模配对数据集上进行昂贵的联合训练，或依赖成对相似性可能无法捕捉全局多模态一致性。本文提出一种新颖的无需训练的多模态引导机制，用于V2A扩散，利用模态嵌入所跨越的体积来强制视频、音频和文本之间的一致对齐。所提出的多模态扩散引导（MDG）提供了一种轻量级、即插即用的控制信号，可在任何预训练音频扩散模型上应用而无需重新训练。在VGGSound和AudioCaps上的实验表明，我们的MDG在感知质量和多模态对齐方面均优于基线，证明了联合多模态引导在V2A中的有效性。

英文摘要

Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

URL PDF HTML ☆

赞 0 踩 0

2509.23352 2026-05-18 cs.CV cs.AI

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

动态树RPO：通过结构化采样打破独立轨迹瓶颈

Xiaolong Fu, Lichen Ma, Zipeng Guo, ShiPing Dong, Lan Yang, Tan Lit Sin, Gaojing Zhou, Yu He, Jingling Fu, Shizhe Zhou, Junshi Huang, Jason Li

发表机构 * Sun Yat-sen University（中山大学）； Tsinghua University（清华大学）； Beijing University of Chemical Technology（北京化工大学）

AI总结本文提出动态树RPO，通过树状结构采样策略和动态噪声强度，提升文本到图像生成的质量与效率，同时结合层调优强化学习方法，在多个基准测试中表现出色。

Comments Fig.3 updated

详情

AI中文摘要

将强化学习（RL）整合到流匹配模型中，推动了文本到图像（T2I）生成的质量提升。然而，这些进步往往以大量探索和低效采样策略为代价，由于采样组的微小变化。基于这一见解，我们提出了动态树RPO，实现了滑动窗口采样策略作为树状结构搜索，具有沿深度动态噪声强度。我们在此树结构中执行GRPO引导优化和受约束的随机微分方程（SDE）采样。通过共享树的前缀路径，我们的设计有效缓解了轨迹搜索的计算开销。通过为每个树层设计良好的噪声强度，动态树RPO可以在不增加额外计算成本的情况下增强探索的多样性。此外，我们无缝整合监督微调（SFT）和RL范式，构建我们的提议层调优RL，将SFT的损失函数重新表述为动态加权进展奖励模型（PRM），而不是单独的预训练方法。通过将此加权PRM与动态自适应剪裁边界关联，避免了动态树RPO中探索过程的干扰。得益于树状结构采样和层调优RL范式，我们的模型在有效方向上动态探索多样化的搜索空间。与现有基线相比，我们的方法在语义一致性、视觉保真度和人类偏好对齐方面在已建立的基准测试中表现出显著优势，包括HPS-v2.1、PickScore和ImageReward。特别是，我们的模型在这些基准测试中分别优于SoTA by 4.9%、5.91%和8.66%，同时将训练效率提高了近50%。

英文摘要

The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.

URL PDF HTML ☆

赞 0 踩 0

2509.22739 2026-05-18 cs.CL cs.AI cs.LG stat.ML

Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

无痛激活导向：一种自动化、轻量级的微调大型语言模型方法

Sasha Cui, Zhongren Chen

发表机构 * Yale University（耶鲁大学）

AI总结本文提出Painless Activation Steering，一种自动化方法，无需人工干预即可利用标注数据提升模型性能，尤其在行为任务中表现优异，但对智能任务效果有限。

详情

AI中文摘要

语言模型通常通过权重或提示导向进行微调，但前者耗时昂贵，后者控制不精确且需手动试错。激活导向（AS）提供了一种更经济、快速且可控的替代方法，但现有技术需人工构造提示对或进行大量特征标注，不如RL和SFT等方法方便。本文引入Painless Activation Steering（PAS），一种完全自动的方法，可利用任何标注数据集进行AS，无需提示构造、特征标注或人工干预。在三个开源模型和18个任务上评估PAS，发现其在行为任务中性能可靠，但对智能任务效果有限。 introspective variant（iPAS）在偏差、道德和对齐任务上分别提升了10.1%、5.2%和34.8%。此外，PAS在上下文学习（ICL）和SFT基础上还提供了额外增益。PAS构建了一个快速、轻量的激活向量，可低成本训练、存储和激活。实验结果为AS的应用提供了明确的指导，展示了其作为实用自动化微调方法的潜力。

英文摘要

Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

URL PDF HTML ☆

赞 0 踩 0

2509.22151 2026-05-18 cs.CV cs.CL

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

MultiMat: 多模态程序合成用于基于过程的材料生成

Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser

发表机构 * University of Mannheim（曼海姆大学）； Adobe Research（Adobe研究）

AI总结 MultiMat利用大规模多模态模型实现多模态程序合成，提升生成过程材料图的效率与视觉质量，优于纯文本基线方法。

Comments Accepted at ICLR 2026 (poster)

详情

AI中文摘要

材料节点图是生成程序化材料的2D通道的程序，包括几何如粗糙度和位移图，以及反射率如albedo和导电性图。它们在计算机图形学中对于以参数化和任意分辨率表示虚拟3D对象的外观至关重要。特别是，它们的有向无环图结构和中间状态使交互式外观建模能够实现模块化和可解释的工作流程。然而，创建此类图仍然具有挑战性，通常需要专业培训。尽管最近的神经程序合成方法试图简化这一过程，但它们仅将图表示为文本程序，无法捕捉到节点图本质上视觉-空间性质，这使得它们对人类易于理解。为了解决这一差距，我们提出了MultiMat，一种多模态程序合成框架，利用大型多模态模型来处理视觉和文本图表示，以提高程序化材料图的生成效果。我们训练我们的模型在一个新的生产质量程序化材料数据集上，并将其与一种受约束的树搜索推理算法结合，该算法确保静态正确性的同时，能够高效地在程序空间中导航。我们的实验结果表明，我们的多模态程序合成方法在无条件和有条件图合成中比纯文本基线更高效，具有更高的视觉质量和保真度，建立了新的最先进性能。

英文摘要

Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2509.21663 2026-05-18 cs.LG cs.AI cs.LO

Logic of Hypotheses: from Zero to Full Knowledge in Neurosymbolic Integration

假设逻辑：从零到全面知识的神经符号整合

Davide Bizzaro, Alessandro Daniele

发表机构 * University of Padua（帕多瓦大学）； Fondazione Bruno Kessler（布鲁诺·科斯勒基金会）； University of Bozen-Bolzano（博赞-博尔扎诺大学）

AI总结本文提出LoH语言，结合数据驱动规则学习与符号先验和专家知识，实现神经符号整合的灵活统一，并通过模糊逻辑实现可微计算图编译。

详情

AI中文摘要

神经符号整合（NeSy）融合神经网络学习与符号推理。该领域可分为注入手工规则的神经模型方法和从数据中诱导符号规则的方法。我们引入假设逻辑（LoH），一种新的语言，统一这些流派，使数据驱动规则学习与符号先验和专家知识的灵活整合成为可能。LoH扩展了命题逻辑语法，加入了可学习参数的选择运算符，可从选项池中选择子公式。利用模糊逻辑，LoH中的公式可直接编译为可微计算图，从而通过反向传播学习最优选择。该框架涵盖了一些现有NeSy模型，同时增加了任意程度的知识规范的可能性。此外，使用戈德尔模糊逻辑和最近开发的戈德尔技巧，可以将模型离散化为硬布尔值函数，而不会损失性能。我们对这些模型进行了实验分析，展示了在表格数据和两个具有感知组件的NeSy任务上的强大结果。

英文摘要

Neurosymbolic integration (NeSy) blends neural-network learning with symbolic reasoning. The field can be split between methods injecting hand-crafted rules into neural models, and methods inducing symbolic rules from data. We introduce Logic of Hypotheses (LoH), a novel language that unifies these strands, enabling the flexible integration of data-driven rule learning with symbolic priors and expert knowledge. LoH extends propositional logic syntax with a choice operator, which has learnable parameters and selects a subformula from a pool of options. Using fuzzy logic, formulas in LoH can be directly compiled into a differentiable computational graph, so the optimal choices can be learned via backpropagation. This framework subsumes some existing NeSy models, while adding the possibility of arbitrary degrees of knowledge specification. Moreover, the use of Gödel fuzzy logic and the recently developed Gödel trick yields models that can be discretized to hard Boolean-valued functions without any loss in performance. We provide experimental analysis on such models, showing strong results on tabular data and on two NeSy tasks with a perceptual component.

URL PDF HTML ☆

赞 0 踩 0

2509.21465 2026-05-18 cs.LG

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

谈话树木：基于推理的决策树诱导用于表格数据

George Yakushev, Alina Shutova, Ivan Rubachev, Natalia Bereberdina, Renat Sergazinov, Artem Babenko

发表机构 * Yandex ； HSE University（俄罗斯高等经济大学）

AI总结本文提出利用具备推理能力的LLM诱导小规模表格数据的决策树，生成轻量级树结构，优于CART和非贪心树学习器，并在低资源表格问题中与树集成竞争。

Comments Preprint, code at https://github.com/yandex-research/TalkingTrees

详情

AI中文摘要

表格基础模型正日益流行于低资源表格问题。这些模型通过在大量合成数据上预训练来弥补小训练数据集。通过预训练获得的先验知识提供了卓越的性能，但由此产生的模型成为难以解释的黑箱，且推理成本高。本文探索了一种替代策略：在代理设置中使用具备推理能力的LLM诱导小规模表格数据的决策树。我们设计了一组最小的工具用于构建、分析和操纵决策树。借助这些工具，LLM结合其先验知识和数据学习生成轻量级决策树，优于CART和最近的非贪心树学习器，并在低资源表格问题中与树集成竞争。虽然单个代理决策树能与最先进的黑箱模型竞争，但其还带有可读的推理跟踪，可用于检查偏见和数据泄露。此外，基于推理的LLM的创建过程允许将额外的人类输入纳入树中，而无需在数据中捕获。

英文摘要

Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly for inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in an agentic setup. We design a minimal set of tools for constructing, analyzing, and manipulating decision trees. Equipped with these tools, the LLM combines its prior knowledge with learning from data to produce a lightweight decision tree that outperforms CART and recent non-greedy tree learners and remains competitive with tree ensembles on low-resource tabular problems. While a single agentic decision tree is competitive with state-of-the-art black box models, it also comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input to be incorporated into the tree without it being captured in data.

URL PDF HTML ☆

赞 0 踩 0

2509.21173 2026-05-18 cs.CV cs.AI cs.LG

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

精度降低可能更可靠：对VLMs量化影响的系统评估

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Chokri Mraidha, Fabio Arnez

发表机构 * Computer Vision Center（计算机视觉中心）

AI总结本文系统评估了量化对VLMs可靠性的影响，发现量化能提升准确率、校准、异常检测和抗噪能力，但不改善协变量偏移或虚假相关性。

Comments Accepted at ICML 2026

详情

AI中文摘要

视觉-语言模型（VLMs）如CLIP已革新零样本分类和安全关键任务，如异常检测。然而，其高计算成本阻碍了实际部署。尽管量化是提高效率的标准方法，但其对超出简单Top-1准确率的可靠性指标的影响仍被忽视。本文通过超过70万次实验评估VLMs的量化效果，发现量化噪声反而能提升准确率、校准、异常检测和抗噪能力，但不改善协变量偏移或虚假相关性。我们利用这些反直觉发现，证明量化通过抑制高秩谱成分，迫使模型依赖稳健的低秩特征，从而提升泛化能力和抗噪能力，为利用量化部署更快速、可靠的VLMs提供了路径。

英文摘要

Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.

URL PDF HTML ☆

赞 0 踩 0

2509.20641 2026-05-18 cs.LG cs.SD

Investigating Modality Contribution in Audio LLMs for Music

在音乐音频大语言模型中探讨模态贡献

Giovana Morais, Magdalena Fuentes

发表机构 * Audio Research Lab, New York University, USA Integrated Design \& Media, New York University, USA

AI总结本文通过MM-SHAP框架量化音频大语言模型中各模态的贡献，发现高准确率模型更依赖文本回答问题，但音频仍能局部化关键声音事件，首次将MM-SHAP应用于音频大语言模型。

Comments 5 pages, 2 figures, accepted at ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11463350

AI中文摘要

音频大语言模型（Audio LLMs）能够实现人类般的音乐对话，但尚不清楚它们是否真正听懂音频还是仅仅依赖文本推理，正如最近的基准测试所表明的。本文通过量化每个模态对模型输出的贡献来探讨这一问题。我们适应了MM-SHAP框架，这是一个基于Shapley值的性能无关评分，用于量化每个模态对模型预测的相对贡献。我们在MuChoMusic基准上评估了两个模型，并发现准确性更高的模型更依赖文本来回答问题，但进一步检查显示，即使整体音频贡献较低，模型仍能成功局部化关键声音事件，这表明音频并未被完全忽略。我们的研究是首次将MM-SHAP应用于音频大语言模型，我们希望它能为未来可解释AI和音频领域的研究奠定基础。

英文摘要

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

URL PDF HTML ☆

赞 0 踩 0

2509.20349 2026-05-18 cs.LG

Process-Informed Forecasting of Complex Thermal Dynamics in Pharmaceutical Manufacturing

基于过程的制药制造复杂热力学动态预报

Ramona Rubini, Siavash Khodakarami, Aniruddha Bora, George Em Karniadakis, Michele Dassisti

发表机构 * Department of Agricultural and Environmental Sciences, University of Bari（巴里大学农业与环境科学系）； Department of Mechanical, Mathematics, and Management (DMMM), Polytechnic University of Bari（巴里理工学院机械、数学与管理系）； Division of Applied Mathematics, Brown University（布朗大学应用数学系）

AI总结本文提出基于过程的预报方法，结合传统模型和深度学习架构，通过整合过程先验信息提升预测准确性与物理一致性，验证了其在制药冻干过程中的有效性。

详情

AI中文摘要

准确的时间序列预测对于复杂物理系统的现代工业监控和控制至关重要，但深度学习模型在受监管环境中往往缺乏所需的物理一致性。为弥合这一差距，我们引入了基于过程的预报（PIF）模型，用于制药冻干过程中的温度预测，将确定性生产配方作为宏结构先验。我们研究了经典方法（如自回归积分滑动平均模型）和现代深度学习架构，包括Kolmogorov-Arnold网络。我们比较了三种不同的损失函数形式，整合了过程指导的轨迹先验：固定权重损失、动态不确定性基于损失和残差注意力机制。我们不仅评估了所有模型的准确性和物理一致性，还评估了其对传感器噪声的鲁棒性。此外，我们测试了最佳模型在迁移学习场景下的实际泛化能力，以适应新过程。我们的结果表明，PIF模型在准确性、物理合理性和噪声鲁棒性方面均优于数据驱动的模型，提供了一种可扩展的框架，用于关键制造中的可靠和可推广的预测解决方案。

英文摘要

Accurate time-series forecasting for complex physical systems is the backbone of modern industrial monitoring and control, yet deep learning models often lack the physical consistency required in regulated environments.To bridge this gap, we introduce Process-Informed Forecasting (PIF) models for temperature in pharmaceutical lyophilization, embedding deterministic production recipes as macro-structural priors. We investigate classical methods (e.g., Autoregressive Integrated Moving Average (ARIMA) model) and modern deep learning architectures, including Kolmogorov-Arnold Networks (KANs). We compare three different loss function formulations that integrate a process-informed trajectory prior: a fixed-weight loss, a dynamic uncertainty-based loss, and a Residual-Based Attention (RBA) mechanism. We evaluate all models not only for accuracy and physical consistency but also for robustness to sensor noise. Furthermore, we test the practical generalizability of the best model in a transfer-learning scenario to a new process. Our results show that PIF models outperform their data-driven counterparts in terms of accuracy, physical plausibility and noise resilience, offering a scalable framework for reliable and generalizable forecasting solutions in critical manufacturing.

URL PDF HTML ☆

赞 0 踩 0

2509.15267 2026-05-18 cs.CV cs.AI cs.LG

Autoguided Online Data Curation for Diffusion Model Training

自引导在线数据精炼用于扩散模型训练

Valeria Pais, Luis Oala, Daniele Faccio, Marco Aversa

发表机构 * University of Glasgow（格拉斯哥大学）； Dotphoton

AI总结本文研究自引导和在线数据选择方法对扩散模型训练效率的影响，通过合成数据任务验证了自引导在样本质量和多样性上的优势。

Comments Accepted non-archival paper at ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL)

详情

AI中文摘要

生成模型计算成本重新点燃了高效数据精炼的希望。本文探讨了最近发展的自引导和在线数据选择方法是否能提升扩散模型训练的时间和样本效率。我们整合了联合示例选择（JEST）和自引导到统一代码库中，以实现快速消融分析和基准测试。我们在受控的二维合成数据生成任务以及（3x64x64）-D图像生成上评估了数据精炼的组合。我们的比较是在相等的墙钟时间和样本数量下进行的，明确考虑了选择的开销。在所有实验中，自引导一致地提高了样本质量和多样性。早期AJEST（仅在训练开始时应用选择）在两个任务上都能匹配或略微超过自引导单独的效率。然而，其时间开销和额外的复杂性使自引导或均匀随机数据选择在大多数情况下更优。这些发现表明，尽管目标在线选择在早期训练中能带来效率提升，但稳健的样本质量改进主要由自引导驱动。我们讨论了限制和范围，并概述了数据选择何时可能有益。

英文摘要

The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

URL PDF HTML ☆

赞 0 踩 0

2509.10310 2026-05-18 cs.CV math.OC

A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments

面向复杂城市环境的街道设施地理定位的随机生灭方法

Evan Murphy, Marco Viola, Vladimir A. Krylov

发表机构 * School of Mathematical Sciences, Dublin City University（都柏林城市大学数学科学学院）

AI总结本文提出基于能量地图的随机生灭优化算法，用于精确定位城市街道设施，通过整合地理空间信息提升定位精度，验证了其在大规模设施映射中的可行性。

Comments Accepted for publication in the Proceedings of the 27th Irish Machine Vision and Image Processing Conference (IMVIP 2025)

详情

AI中文摘要

本文针对复杂城市环境中街道设施的精确地理定位问题，提出基于能量地图的概率框架。该框架通过将能量表示为基于地图的地理定位格式，使优化过程能够无缝整合外部地理空间信息，如GIS图层、道路地图或放置约束，从而提升上下文意识和定位准确性。引入随机生灭优化算法以推断资产最可能的配置，并通过基于都柏林市中心街道照明基础设施的现实模拟验证了该方法的可行性，展示了其在大规模和精确城市资产映射中的潜力。该算法的实现将在GitHub仓库https://github.com/EMurphy0108/SBD_Street_Furniture中提供。

英文摘要

In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.

URL PDF HTML ☆

赞 0 踩 0

2507.17572 2026-05-18 cs.RO

Sampling-Based Global Optimal Control and Estimation via Semidefinite Programming

基于采样的全局最优控制与估计通过半正定规划

Antoine Groudiev, Fabian Schramm, Éloïse Berthier, Justin Carpentier, Frederike Dümbgen

发表机构 * École Normale Supérieure, PSL Research University（巴黎高等师范大学，PSL研究大学）； Inria, Département d’informatique de l’ENS, CNRS, PSL Research University（法国国家科学研究中心，ENS计算机系，PSL研究大学）； U2IS, ENSTA, Institut Polytechnique de Paris（巴黎高等理工学院，U2IS）

AI总结本文将KernelSOS理论应用于控制和机器人领域，解决实际应用中的重启策略、超参数校准等关键问题，并展示其在高维非参数轨迹优化中的优势。

详情

AI中文摘要

全局优化在过去几十年中因理论基础和数值方法的发展而受到关注。Kernel Sum of Squares (KernelSOS) 结合了核方法的表达能力和SOS优化的保证，提供了一种强大的理论框架。本文将KernelSOS从理论推向实践，展示了其在挑战性控制和机器人问题中的应用。我们识别并解决了使该方法在应用环境中有效所需的实际考虑因素：重启策略、系统化超参数校准、恢复极小值的方法以及与快速局部求解器的结合。作为概念验证，将KernelSOS应用于机器人定位，展示了其与依赖启发式和手工改写问题的现有SOS方法的竞争力。即使在高维、非参数的轨迹优化设置中，模拟器被视为黑盒，我们展示了如何将KernelSOS与快速局部求解器结合，以发现高质量的解决方案，而不会影响总体运行时间。

英文摘要

Global optimization has gained attraction over the past decades, thanks to the development of both theoretical foundations and efficient numerical routines. Among recent advances, Kernel Sum of Squares (KernelSOS) provides a powerful theoretical framework, combining the expressivity of kernel methods with the guarantees of SOS optimization. In this paper, we take KernelSOS from theory to practice and demonstrate its use on challenging control and robotics problems. We identify and address the practical considerations required to make the method work in applied settings: restarting strategies, systematic calibration of hyperparameters, methods for recovering minimizers, and the combination with fast local solvers. As a proof of concept, the application of KernelSOS to robot localization highlights its competitiveness with existing SOS approaches that rely on heuristics and handcrafted reformulations to render the problem polynomial. Even in the high-dimensional, non-parametric setting of trajectory optimization with simulators treated as black boxes, we demonstrate how KernelSOS can be combined with fast local solvers to uncover higher-quality solutions without compromising overall runtimes.

URL PDF HTML ☆

赞 0 踩 0

2507.15970 2026-05-18 cs.SD cs.AI eess.AS

CIS-BWE: Chaos-Informed Speech Bandwidth Extension

CIS-BWE: 基于混沌的语音带宽扩展

Tarikul Islam Tamiti, Tonmoy Das, Nursadul Mamun, Anomadarshi Barua

发表机构 * Chittagong University of Engineering and Technology（奇坦加大学工程与技术学院）； George Mason University（乔治·梅森大学）

AI总结本文提出NDSI-BWE框架，利用六种基于非线性动力学系统的判别器捕捉语音的复杂时间行为，通过深度卷积实现参数减少，提升语音带宽扩展性能。

详情

AI中文摘要

恢复因带宽限制丢失的高频成分对于电信和有限资源下的高保真音频应用至关重要。我们引入NDSI-BWE，一种新的对抗性带宽扩展（BWE）框架，利用四种新的判别器灵感来自非线性动力学系统以捕捉多样的时间行为：多分辨率李雅普诺夫判别器（MRLD）用于确定初始条件的敏感性，通过捕捉确定性混沌；多尺度递归判别器（MS-RD）用于自相似递归动力学；多尺度去趋势分形分析判别器（MSDFA）用于长程缓慢变异性尺度不变关系；多分辨率庞加莱图判别器（MR-PPD）用于捕捉隐藏的潜在空间关系；多周期判别器（MPD）用于捕捉周期性模式；多分辨率振幅判别器（MRAD）和多分辨率相位判别器（MRPD）用于捕捉复杂的振幅-相位转换统计。通过在每个判别器中使用深度卷积块的核心深度卷积，NDSI-BWE实现了八倍的参数减少。这些七个判别器指导一个基于复数ConformerNeXt的生成器，采用双流Lattice-Net架构，同时优化幅度和相位。生成器利用基于Transformer的Conformer的全局依赖建模能力和ConvNeXt块的局部时间建模能力。在六个客观评估指标和包含五名人类评委的主观文本中，NDSI-BWE在BWE中建立了新的SoTA。

英文摘要

Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

URL PDF HTML ☆

赞 0 踩 0

2507.14200 2026-05-18 cs.CL cs.AI cs.LG

A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

可扩展的多语言模型协作系统：基于检索的选择与探索-利用驱动增强

Shengji Tang, Jianjian Cao, Weihao Lin, Jiale Hong, Bo Zhang, Shuyue Hu, Lei Bai, Tao Chen, Wanli Ouyang, Peng Ye

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出SMCS系统，通过检索优先选择模块和探索-利用驱动后验增强模块，有效协调多个开源语言模型，实验显示其在多个任务中优于闭源模型，且在不同数据集上超越开源模型的平均最佳结果。

详情

AI中文摘要

现有多语言模型协作系统在整合新语言模型和任务时常面临可扩展性挑战，导致性能不佳。为此，我们提出SMCS，一种可扩展的多语言模型协作系统，旨在有效协调多个开源语言模型。系统包含两个核心模块：基于检索的优先选择模块（RPS），动态选择最适合的语言模型；探索-利用驱动的后验增强模块（EPE），通过混合评分机制促进响应多样性并选择高质量输出。在八个主流基准测试中，实验验证了系统的有效性：通过整合十五个开源语言模型，SMCS在多个任务中优于现有的闭源语言模型，例如GPT-4（+5.36%）和GPT-o3-mini（+5.28%）。值得注意的是，它甚至在不同数据集上超越了开源语言模型的最佳平均结果（+2.86%），显著推进了开源协作的实证性能前沿。代码已发布在https://github.com/magent4aci/SMCS。

英文摘要

Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (+2.86%), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.

URL PDF HTML ☆

赞 0 踩 0

2507.10236 2026-05-18 cs.CV

Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

在真实世界中导航AI生成图像检测的挑战：真正重要的是什么？

Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos

发表机构 * Information Technologies Institute - Centre for Research and Technology Hellas（信息科技研究所 - 希腊研究中心与技术研究所）

AI总结研究真实世界中AI生成图像检测的挑战，分析设计选择对检测性能的影响，提出优化方法并提升AUC 26.87%。

Comments ACM International Workshop on Multimedia AI against Disinformation 2026 (MAD 2026)

详情

DOI: 10.1145/3810988.3812665

AI中文摘要

随着生成式人工智能的发展，AI生成图像的逼真度已达到足以欺骗甚至警惕的人类观察者的水平。然而，尽管当前的AI生成图像检测（AID）方法在受控基准数据集上表现优异，但在真实世界案例中却表现不佳。为此，我们引入了ITW-SM数据集，一个经过精心编排的真实和AI生成图像集合，源自主要社交媒体平台。我们利用它分析构建检测器时的关键设计选择，包括其架构、预训练的潜在空间、训练数据以及预处理方法。我们指出，简单地扩大预训练阶段或选择更多训练数据并不总是能提高检测性能。相反，我们的研究揭示了优化每个设计选择以使处理流程能够传播并有效分析低级痕迹和高级图像语义的重要性。基于我们的发现，我们在多种最先进的检测方法上实现了平均AUC提升26.87%，为开发更具鲁棒性的检测器提供了路线图。我们的资源可在https://mever-team.github.io/itw-sm获取。

英文摘要

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors. Our assets are available on https://mever-team.github.io/itw-sm.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

T2T-LA: A Topology-to-Topology LLM Agent for Graph Learning with Neither Feature Access nor Task Knowledge

An Introduction to Deep Reinforcement and Imitation Learning

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Adaptive Conformal Prediction for Quantum Machine Learning

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Frontier Large Language Models Rival State-of-the-Art Planners

SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization

Flatness-based trajectory planning for 3D overhead cranes with friction compensation and collision avoidance

Monotone and Separable Set Functions: Characterizations and Neural Models

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Metropolis-Scale Road Network Datasets for Fine-Grained Urban Traffic Modeling

Training-Free Multimodal Guidance for Video to Audio Generation

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

Logic of Hypotheses: from Zero to Full Knowledge in Neurosymbolic Integration

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

Investigating Modality Contribution in Audio LLMs for Music

Process-Informed Forecasting of Complex Thermal Dynamics in Pharmaceutical Manufacturing

Autoguided Online Data Curation for Diffusion Model Training

A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments

Sampling-Based Global Optimal Control and Estimation via Semidefinite Programming

CIS-BWE: Chaos-Informed Speech Bandwidth Extension

A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?