arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1709
专题追踪 全部专题
2606.13835 2026-06-15 cs.CL cs.AI cs.MA 新提交

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

当合理但不现实:评估基于LLM的城市模拟中的人类移动性

Gustavo H. Santos, Aline Carneiro Viana, Thiago H. Silva

发表机构 * UTFPR(巴西联邦理工大学) Inria(法国国家信息与自动化研究所) U. of Toronto(多伦多大学)

AI总结 提出验证框架,通过移动定律、时间节奏等指标评估基于LLM的城市模拟器生成的人类移动模式,发现叙事合理性与经验移动现实性之间存在显著差距。

Comments 14 pages, 10 figures

详情
AI中文摘要

基于LLM的生成式智能体越来越多地用于城市模拟器,但尚不清楚它们是否再现了经验上真实的人类移动模式,还是仅仅生成合理的移动叙事。我们引入了一个验证框架,用于评估基于LLM的城市模拟器中生成智能体的移动性,并与真实世界移动数据进行比较。为此,我们使用了移动定律、时间节奏、网络模体、语义活动转换和行为移动性配置文件。利用大巴黎地区和上海的数据集,我们评估了AgentSociety和CitySim在多个移动现实性维度上的表现。我们的分析揭示了叙事合理性与经验移动现实性之间的显著差距。尽管模拟器捕捉到了一些高级语义活动分布,但它们难以再现核心的空间和时间约束,包括真实的行程长度分布、起止点流量、停留时间和转换动态。我们进一步观察到,现实的移动多样性在默认提示配置下不稳定,可能需要显式的配置文件感知初始化。为了支持可重复的评估,我们还贡献了可扩展且开放的LLM驱动基础设施,用于区域级地图生成、可观测性增强的模拟、移动性指标计算和交通模拟。我们的发现强调了需要对基于LLM的城市模拟器进行严格的经验验证,并提供了构建更真实和可重复的城市模拟系统的实用工具。

英文摘要

LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a validation framework for evaluating the mobility of generative agents of LLM-based urban simulators against real-world mobility data. For this, we use mobility laws, temporal rhythms, network motifs, semantic activity transitions, and behavioral mobility profiles. Using datasets from the Greater Paris region and Shanghai, we evaluate AgentSociety and CitySim across multiple dimensions of mobility realism. Our analysis reveals a substantial gap between narrative plausibility and empirical mobility realism. Although the simulators capture some high-level semantic activity distributions, they struggle to reproduce core spatial and temporal constraints, including realistic trip-length distributions, origin-destination flows, dwell times, and transition dynamics. We further observe that realistic mobility diversity is unstable across default prompting configurations and may require explicit profile-aware initialization. To support reproducible evaluation, we also contribute scalable and open LLM-driven infrastructure for regional-scale map generation, observability-enhanced simulation, mobility-metric computation, and traffic simulation. Our findings highlight the need for rigorous empirical validation of LLM-based urban simulators and provide practical tools for building more realistic and reproducible urban simulation systems.

2606.13823 2026-06-15 cs.LG eess.SP stat.ML 新提交

A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series

多变量时间序列无训练时滞谱嵌入的平稳性与耦合准则

Siddharth Pal, Viktoria Rojkova

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出基于时滞相关矩阵截断的固定长度描述符D(τ),通过平稳高斯VAR(1)模型推导其适用条件:信号近似平稳且类别信息存在于跨通道时间耦合而非边际功率。

Comments 25 pages, 2 figures, 10 tables

详情
AI中文摘要

我们研究多变量时间序列的无训练固定长度描述符,不仅问这样的描述符是否表现良好,而且问何时可以预期它有效。我们的研究对象是$D(\tau)$,它由时滞相关矩阵在Marchenko-Pastur边缘截断构建,使得仅信号承载的特征值存活,并通过与类质心的余弦相似度分类,零学习参数。核心贡献不是描述符本身,而是一个可证伪的适用性准则。基于平稳高斯VAR(1)模型,我们论证当信号近似平稳且类别信息存在于它们的跨通道时间耦合而非边际每通道功率时,$D(\tau)$能分离两个类别。我们半正式地推导出三个结果:可区分性条件、为什么静态($\tau=0$)协方差退化为随机、以及为什么平稳但功率判别范式会击败描述符。该准则是可操作的:一个两部分预检测试——增强Dickey-Fuller平稳性检验和功率基线饱和检验——在任何训练前预测适用性。我们在混合数据集上验证了这两部分。在满足准则的四个范式(Sleep-EDF、BCI-IV-2a、MIT-BIH、ESC-50)上,描述符以极低成本与强基线竞争,在Sleep-EDF上20受试者留一法下达到$88.5\pm4.5\\%$,单CPU线程。在违反准则的三个范式——非平稳ERP、以及功率判别的金融波动和可穿戴压力模式——上,它完全如预检预测的那样失败,而这些负面结果更具信息量。我们明确$D(\tau)$不是最准确的表示;其价值在于它是一个紧凑、无训练的嵌入,其有效域事先已知。

英文摘要

We study training-free fixed-length descriptors for multivariate time series and ask not merely whether such a descriptor performs well, but when it can be expected to work at all. Our object of study is $D(τ)$, built from a time-lagged correlation matrix truncated at the Marchenko-Pastur edge so that only signal-bearing eigenvalues survive and classified by cosine similarity to class centroids with zero learned parameters. The central contribution is not the descriptor but a falsifiable applicability criterion for it. Working from a stationary Gaussian VAR(1) model, we argue that $D(τ)$ separates two classes when the signals are approximately stationary and the class information lives in their cross-channel temporal coupling rather than in marginal per-channel power. We derive, semi-formally, three consequences: a distinguishability condition, why the static ($τ=0$) covariance collapses to chance, and why a stationary but power-discriminated paradigm defeats the descriptor. The criterion is operational: a two-part pre-flight test -- an augmented Dickey-Fuller stationarity check and a power-baseline saturation check -- predicts applicability before any training. We validate both halves on a mixed assortment. On four paradigms that satisfy the criterion (Sleep-EDF, BCI-IV-2a, MIT-BIH, ESC-50) the descriptor is competitive with strong baselines at a fraction of their cost, reaching $88.5\pm4.5\%$ under 20-subject leave-one-subject-out on Sleep-EDF on a single CPU thread. On three that violate it -- non-stationary ERPs, and financial-volatility and wearable-stress regimes that are power-discriminated -- it fails exactly as the pre-flight predicts, and these negatives are the more informative half. We are explicit that $D(τ)$ is not the most accurate representation; its value is a compact, training-free embedding whose domain of validity is known in advance.

2606.13821 2026-06-15 cs.LG 新提交

Attention-Based Estimation of the Individual Treatment Benefit Probability under Dose Variation

基于注意力的剂量变化下个体治疗获益概率估计

Lev V. Utkin, Andrei V. Konstantinov, Stanislav K. Kogan, Natalya M. Verbova, Maksim I. Goriunov

发表机构 * Peter the Great St.Petersburg Polytechnic University Higher School of Artificial Intelligence Technologies(圣彼得堡彼得大帝理工大学人工智能技术高等学院)

AI总结 提出Dose-AIPTB框架,将个体治疗获益概率估计扩展至离散剂量场景,通过注意力机制聚合伪标签实现个性化剂量选择。

详情
AI中文摘要

估计个体患者治疗优于对照的概率,称为个体治疗获益概率(IPTB),提供了比群体平均指标更具临床直观性的替代方案。然而,现有的IPTB估计方法主要局限于二元治疗设置,尽管临床实践中剂量变化干预普遍存在。我们提出一个通用框架,用于离散剂量分配下有序结局的IPTB估计,称为Dose-AIPTB(基于注意力的剂量IPTB)。我们的方法将问题重述为对未观察到的个体治疗效应符号的二元分类,从协变量相似的成对比较中构建伪标签,并通过注意力机制或Nadaraya-Watson核回归进行聚合。该公式自然适应多个离散剂量水平,超越了二元治疗范式。通过在协变量偏移、不同样本量和异质性结局下的真实世界和合成数据上的数值实验,我们证明基于注意力的聚合始终优于核方法。该框架为基于个体水平获益概率的个性化剂量选择提供了基础。实现该模型的代码公开于此https URL。

英文摘要

Estimating the probability that a treatment outperforms a control for an individual patient, called the Individual Probability of Treatment Benefit (IPTB), offers a clinically intuitive alternative to population-average metrics. However, existing methods for IPTB estimation are largely confined to binary treatment settings, despite the prevalence of dose-varying interventions in clinical practice. We propose a general framework for IPTB estimation with ordinal outcomes under discrete dose assignments, called Dose-AIPTB (Dose Attention-based IPTB). Our approach recasts the problem as binary classification over the unobserved sign of the individual treatment effect, constructing pseudo-labels from covariate-similar pairwise comparisons and aggregating them via attention mechanisms or Nadaraya-Watson kernel regression. This formulation naturally accommodates multiple discrete dose levels, extending beyond the binary treatment paradigm. Through numerical experiments on real-world and synthetic data under covariate shift, varying sample sizes, and heterogeneous outcomes, we demonstrate that attention-based aggregation consistently outperforms kernel alternatives. The framework provides a foundation for personalized dose selection grounded in individual-level benefit probabilities. Codes implementing the model are publicly available at https://github.com/NTAILab/AIPTBDose.

2606.13818 2026-06-15 cs.LG 新提交

Uncertainty Estimation and Generalization Bounds for Modern Deep Learning

现代深度学习的不确定性估计与泛化界

Luis A. Ortega

发表机构 * Andrés Department of Computer Science Machine Learning Group(安德烈斯计算机科学系机器学习组) Madrid, June 2026(马德里,2026年6月)

AI总结 本文从贝叶斯角度统一推断、函数空间建模和大偏差理论,提出DVIP、VaLLA和FMGP等方法改进不确定性估计,并利用PAC-贝叶斯和大偏差理论解释过参数化神经网络的泛化能力。

Comments PhD Thesis, Autonomous University of Madrid

详情
AI中文摘要

本论文研究贝叶斯原理如何加深我们对现代深度学习系统的理解。尽管神经网络取得了显著的预测性能,但其泛化能力和不确定性量化能力仍仅被部分理解。本论文从方法论和理论两个角度应对这一挑战:将贝叶斯推断、函数空间建模和大偏差理论统一在一个共同的概率视角下。在方法论方面,论文引入了深度变分隐过程(DVIP),这是一个可扩展的贝叶斯框架,将隐过程扩展到深度架构。作为补充,提出了两种后处理方法——变分线性化拉普拉斯近似(VaLLA)和固定均值高斯过程(FMGP)——为预训练的确定性网络配备校准的不确定性估计。理论贡献集中于现代机器学习中一个核心开放问题:为什么大型、过参数化的神经网络能泛化得这么好?为此,论文发展了一个统一的概率框架,在PAC-贝叶斯和大偏差理论的语言下连接了三个关键机制——多样性、平滑性和随机性。

英文摘要

This thesis investigates how Bayesian principles can deepen our understanding of modern deep learning systems. While neural networks achieve remarkable predictive performance, their ability to generalize and to quantify uncertainty remains only partly understood. This thesis approaches this challenge from both methodological and theoretical angles: unifying Bayesian inference, function-space modeling, and large-deviation theory under a common probabilistic perspective. On the methodological side, the thesis introduces the Deep Variational Implicit Process (DVIP), a scalable Bayesian framework that extends implicit processes to deep architectures. Complementing this, two post-hoc methods -- the Variational Linearized Laplace Approximation (VaLLA) and the Fixed-Mean Gaussian Process (FMGP) -- are proposed to equip pretrained deterministic networks with calibrated uncertainty estimates. The theoretical contributions focus on one of the central open questions in modern machine learning: why do large, over-parameterized neural networks generalize so well? To address this, the thesis develops a unified probabilistic framework that connects three key mechanisms -- diversity, smoothness, and stochasticity -- within the language of PAC-Bayesian and large-deviation theory.

2606.13817 2026-06-15 cs.RO cs.LG 新提交

FlowMo-WM: A World Model with Object Momentum and Hidden Ambient Drift

FlowMo-WM:具有物体动量和隐藏环境漂移的世界模型

Yitao Jiang, Luyang Zhao, Muhao Chen, Devin Balkcom

发表机构 * Dartmouth College(达特茅斯学院) Clemson University(克莱姆森大学) University of Houston(休斯顿大学)

AI总结 提出FlowMo-WM,一种端到端可训练的视觉世界模型,通过分解图像-动作历史为短历史潜在状态和长历史上下文,分别建模物体运动和环境漂移,提升水下机器人等场景的长程预测精度。

详情
AI中文摘要

机器人学习中的世界模型根据视觉观察和动作预测未来状态,使智能体能够推理其控制后果。然而,许多动作条件模型在运动由即时控制主导的场景中评估,而水面航行器和其他真实世界物体在惯性下持续运动,并被水流或风等隐藏环境漂移所位移。我们提出FlowMo-WM,一种端到端可训练的视觉世界模型,无需流场直接监督,从图像-动作历史中推断以物体为中心的运动状态和与隐藏漂移相关的预测性长历史上下文。FlowMo-WM将图像-动作历史分解为短历史潜在状态(训练以总结以物体为中心的运动)和长历史上下文(训练以总结缓慢变化的外生影响)。在潜在展开期间,零上下文残差转移将动作条件基础动力学与上下文相关的漂移效应分离。在具有多样隐藏流、干扰和随机化车辆动力学的模拟水面航行器环境中,FlowMo-WM相比代表性动作条件潜在世界模型提高了长程展开精度。预测时上下文消融实验(在展开过程中将推断的上下文置零或打乱)表明,环境上下文对于隐藏漂移下的稳定预测至关重要,而冻结线性探针则表征了学习因子中编码的信息。

英文摘要

World models in robot learning predict future states from visual observations and actions, enabling agents to reason about the consequences of their controls. However, many action-conditioned models are evaluated in settings where motion is dominated by immediate control, whereas aquatic surface vehicles and other real-world objects continue moving under inertia and are displaced by hidden ambient drift, such as water currents or wind. We propose FlowMo-WM, an end-to-end trainable visual world model that infers object-centric motion state and a predictive long-history context associated with hidden drift from image-action histories without direct supervision of flow fields. FlowMo-WM factorizes image-action history into a short-history latent state, trained to summarize object-centric motion, and a longer-history context, trained to summarize slowly varying exogenous influences. A zero-context residual transition separates action-conditioned base dynamics from context-dependent drift effects during latent rollout. In simulated aquatic surface-vehicle environments with diverse hidden flows, disturbances, and randomized vehicle dynamics, FlowMo-WM improves long-horizon rollout accuracy over representative action-conditioned latent world models. Prediction-time context ablations, in which the inferred context is zeroed or shuffled during rollout, show that the ambient context is important for stable prediction under hidden drift, while frozen linear probes characterize information encoded in the learned factors.

2606.13815 2026-06-15 cs.AI cs.CL 新提交

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Poker Arena: 大型语言模型中策略推理与记忆的多轴剖析

Pratham Singla, Shivank Garg, Vihan Singh

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 提出Poker Arena平台,通过三层记忆架构和九轴认知剖面分解策略推理,揭示标量排行榜系统性误排模型能力结构。

Comments 33 pages, ICML Workshop

详情
AI中文摘要

不确定性下的策略推理支撑着谈判、金融和政策中的关键决策,但现有的游戏基准将异质推理维度压缩为单一标量,导致前沿LLM的能力结构未被审视。我们引入Poker Arena,一个无限注德州扑克锦标赛平台,该平台将三层记忆架构(手牌内、会话内和跨会话)与九轴认知剖面相结合,将策略推理分解为可解释的维度,如下注规模校准和位置意识。我们在50个会话(每个会话1000手牌)和受控记忆消融实验中评估了七个前沿模型;锦标赛筹码和聚合轴得分对模型进行了不同排序:Claude Opus 4.6赢得+15,730筹码和14次第一名,但在平均轴得分上仅排名第五(共七个),而持久记忆对某些模型有帮助,对另一些则有损害。这些发现表明,多轴评估揭示了标量排行榜系统性误排的能力结构,其中跨维度一致性优于任何单一维度的峰值性能。

英文摘要

Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

2606.13809 2026-06-15 cs.CV 新提交

Compressing Image Style Training into a Single Model Forward

将图像风格训练压缩为单次模型前向传播

Zhongjie Duan, Yingda Chen

发表机构 * ModelScope Team, Alibaba Group(阿里巴巴集团 ModelScope 团队)

AI总结 提出i2L框架,通过单次前向传播预测LoRA权重,实现高效风格迁移,避免逐风格优化,在风格保真度、提示对齐和感知质量上超越现有基线。

Comments 11 pages, 9 figures

详情
AI中文摘要

基于扩散的风格迁移必须在推理效率与风格化保真度之间取得平衡。基于适配器的方法效率高,但将风格作为外部条件注入,可能削弱参考图像的特定外观或将参考语义复制到生成图像中。基于优化的个性化方法(如LoRA)能更有效地内化风格,但每个新风格都需要独立的训练过程。我们提出i2L(图像到LoRA),一种将风格LoRA训练摊销为单次前向传播的框架。给定一张或多张参考图像,i2L预测文本到图像模型的LoRA权重,无需逐风格优化即可立即实例化风格。该架构结合了图像编码器、可学习的LoRA查询以及生成适配矩阵的压缩解码头。在语义多样的风格对上训练,鼓励预测器保留外观线索同时抑制参考内容复制。在Z-Image、FLUX.2和Hidream-O1上的实验表明,i2L在风格保真度、提示对齐和感知质量上优于现有基线。由于i2L生成显式的LoRA权重,它还支持非对称无分类器引导、多参考风格融合以及与可控生成模块的组合。

英文摘要

Diffusion-based style transfer must balance inference efficiency with stylization fidelity. Adapter-based methods are efficient, but they inject style as an external condition and can either weaken reference-specific appearance or copy reference semantics into the generated image. Optimization-based personalization methods such as LoRA internalize style more effectively, but require a separate training process for every new style. We introduce i2L (image-to-LoRA), a framework that amortizes style LoRA training into a single forward pass. Given one or more reference images, i2L predicts LoRA weights for a text-to-image model, enabling immediate style instantiation without per-style optimization. The architecture combines an image encoder, learnable LoRA queries, and compressed decoding heads that generate adapted matrices. Training on semantically diverse style pairs encourages the predictor to preserve appearance cues while suppressing reference-content copying. Experiments on Z-Image, FLUX.2, and Hidream-O1 show that i2L improves style fidelity, prompt alignment, and perceptual quality over existing baselines. Because i2L produces explicit LoRA weights, it also supports asymmetric classifier-free guidance, multi-reference style fusion, and composition with controllable-generation modules.

2606.13808 2026-06-15 cs.CL 新提交

The Culture Funnel: You Can't Align What isn't in the Data

文化漏斗:你无法对齐不在数据中的内容

Ananya Sahu, Mehrnaz Mofakhami, Daniel D'Souza, Thomas Euyang, Julia Kreutzer, Marzieh Fadaee

发表机构 * Cohere Labs(Cohere实验室)

AI总结 针对当前文化对齐方法依赖推理时干预而忽视训练数据的问题,提出文化数据漏斗概念,通过多维标签框架分析预训练、微调、对齐和推理数据集,发现后训练阶段文化信号急剧下降,多语言性虽增强地理多样性但未能确保平衡,所提标签可提升下游文化基准性能。

详情
AI中文摘要

当前的文化对齐方法侧重于推理时的干预,假设模型已经包含足够的文化知识。我们认为现代LLM流程存在一个文化数据漏斗。通过使用一个跨预训练、微调、对齐和推理数据集的多维标签框架,我们展示了在后训练阶段显式文化信号急剧下降,而地理上集中、任务专门化的数据占主导地位。多语言性增强了文化知识的地理多样性,但并未确保平衡的代表性。我们的标签提升了下游文化基准性能,表明进展需要将重点转向训练数据流程。为促进未来研究,我们在此https URL发布带有5.6M样本的文化标注数据集。

英文摘要

Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates. Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines. To facilitate future research, we release our culturally tagged dataset with 5.6M samples at https://huggingface.co/datasets/CohereLabs/CultureMarkers.

2606.13803 2026-06-15 cs.LG 新提交

Neural Slack Variables for Shape Constraints

形状约束的神经松弛变量

Ruben Wiedemann, Antoine Jacquier, Lukas Gonon

发表机构 * Imperial College London(伦敦帝国理工学院) University of St. Gallen(圣加仑大学)

AI总结 提出神经松弛变量方法,将约束强制执行转化为回归问题,通过联合学习辅助网络实现零违规,应用于单调性和凸性约束及金融波动曲面学习。

详情
AI中文摘要

在神经网络中强制执行单调性和凸性等函数不等式约束是许多工业和科学应用中的基本挑战。经典的惩罚方法和基于互补松弛性的原始-对偶方法仅在违反位置提供约束梯度,导致约束满足脆弱。另一方面,通过构造保证可行性的架构仍然主要限于简单情况,并引入额外的归纳偏差。我们提出神经松弛变量,一种深度学习原生的原始侧方法,通过将主网络与联合学习的辅助网络耦合,将约束强制执行转化为回归问题。辅助网络作为主网络约束量的有效目标,诱导可行性和正则性。神经松弛变量在密集网格的单调性和凸性测试案例上实现了零测量违规,而惩罚和原始-对偶基线存在残余违规,并实现了波动率曲面的无套利学习,这是量化金融中的一个开放工业挑战。

英文摘要

Enforcing functional inequality constraints such as monotonicity and convexity in neural networks is a fundamental challenge in many industrial and scientific applications. Classical one-sided penalty methods, along with primal-dual methods gated by complementary slackness, provide constraint gradients only at violated locations, resulting in fragile satisfaction. Architectures that guarantee feasibility by construction, on the other hand, remain largely limited to elementary cases and impose additional inductive biases. We introduce neural slack variables, a deep learning native primal-side approach that converts constraint enforcement into a regression problem by coupling the primary network with a jointly learned auxiliary network. The auxiliary network serves as a valid target for the primary network's constraint quantities, inducing feasibility and regularity. Neural slack variables achieve zero measured violations on dense-grid monotonicity and convexity test cases, where penalty and primal-dual baselines leave residual violations, and enable arbitrage-free learning of volatility surfaces, an open industrial challenge in quantitative finance.

2606.13801 2026-06-15 cs.LG q-bio.NC 新提交

Neural Variability Enhances Artificial Network Robustness

神经变异性增强人工网络鲁棒性

Robin Preble, Praveen Venkatesh, Stefan Mihalas, Kameron Decker Harris

发表机构 * Department of Computer Science, Western Washington University(西华盛顿大学计算机科学系) Allen Institute(艾伦研究所)

AI总结 研究通过引入结构化噪声(模仿皮层神经变异性)提升人工神经网络对对抗攻击和自然图像修改的鲁棒性,发现噪声结构可显著增强鲁棒性,且对抗攻击的噪声结构可泛化至其他攻击类型。

详情
AI中文摘要

皮层中的神经反应在重复刺激下表现出显著的试验间变异性,而外周感觉神经元的反应则更为一致,这使许多人怀疑随机性是否具有意义。已有研究认为,噪声和信号相关性可能被优化用于动物的辨别,而人工神经网络(ANN)研究也显示了噪声在机器学习任务中的类似益处,尽管大多数ANN研究忽略了相关性的影响。在这里,我们研究相关噪声是否能提高人工神经网络对对抗攻击和自然图像修改的鲁棒性。利用修改输入与干净输入下激活的协方差,我们发现结构化噪声可以显著提高网络鲁棒性。对自然图像修改的鲁棒性最受益于结构,但这种结构在修改类型之间迁移性差。相比之下,来自对抗攻击的噪声结构可以泛化到其他类型的攻击。这些结果表明,ANN激活中的结构化噪声通常能提高鲁棒性,建立了一种仅依赖局部信息的生物合理策略来创建鲁棒的人工神经网络。

英文摘要

Neural responses in cortex exhibit substantial trial-to-trial variability in response to repeated stimuli, while peripheral sensory neurons respond far more consistently, leading many to wonder whether stochasticity may carry meaning. Existing work has argued that noise and signal correlations may be optimized for discrimination in animals, whereas artificial neural network (ANN) studies have shown similar benefits of noise in machine learning tasks, although most ANN work has neglected the effects of correlations. Here we investigate whether correlated noise improves the robustness of artificial neural networks to adversarial attacks and naturalistic image modifications. Using the covariance of activations under modified versus clean inputs, we find that structured noise may significantly improve network robustness. Robustness to naturalistic image modifications benefits most from structure, but this structure transfers poorly across modification types. In contrast, noise structure from adversarial attacks can generalize to other kinds of attacks. These results suggest that structured noise in ANN activations generally improves robustness, establishing a biologically plausible strategy for creating robust artificial neural networks that only relies on local information.

2606.13767 2026-06-15 cs.LG cs.AI cs.IT math.IT 新提交

Beyond LoRA: Is Sparsity-Induced Adaptation Better?

超越LoRA:稀疏诱导的适应更好吗?

Elijah Cadenhead, Cristian McGee, Xin Li, El Houcine Bergou, Aritra Dutta

发表机构 * School of Data, Mathematical and Statistical Sciences, University of Central Florida, United States(中佛罗里达大学数据、数学与统计科学学院) College of Computing, Mohammed VI Polytechnic University (UM6P), Morocco(穆罕默德六世理工大学计算机学院) Department of Computer Science, University of Central Florida, United States(中佛罗里达大学计算机科学系)

AI总结 本文提出Cheap LoRA (cLA)及其变体,通过在LoRA中引入稀疏性实现参数高效微调,理论推导泛化误差界,实验表明在多种任务上性能与参数匹配基线相当,同时减少训练时间和峰值GPU内存。

Comments Overview of the paper and code can be found here: https://elicaden.github.io/Beyond_LoRA/

详情
AI中文摘要

低秩适应(LoRA)及其变体为预训练模型的全微调提供了一种内存和计算高效的替代方案。然而,关于这些方法的比较泛化能力以及低秩更新的结构限制如何保持有效适应性能的问题仍然存在。我们提出了一个历史框架,涵盖过去(全微调和原始LoRA)、现在(LoRA的不同变体),并通过在现有LoRA变体中引入稀疏性,提出了更简单、更便宜、参数高效的扩展:Cheap LoRA (cLA),训练单个低秩因子而固定另一个(确定性地或在其随机变体中随机地),以及链式循环变体${c}^3$LA。我们将cLA视为非对称LoRA的结构化实例,作为全微调的控制列子空间限制。我们推导了这些变体的信息论泛化误差界,这是该领域的首批尝试之一。在实验上,我们评估了10个预训练模型和14个数据集上的11种微调方法,使用损失景观和谱分析等工具分析了微调模型的性能和泛化能力。尽管微调模型对预训练模型、数据集和其他因素敏感,但我们的研究表明,将基于LoRA的PEFT方法的适应限制在稀疏、结构化的列空间上,在参数匹配基线的任务上仍然具有竞争力,同时即使使用朴素、非优化的稀疏实现,也能减少高达10%的训练时间和高达15%的峰值GPU内存。我们的理论和实验泛化度量为其成本效益适应提供了比常用分析工具更一致和原则性的方法。概述和代码可在以下网址获取:此 https URL。

英文摘要

Low-rank adaptation (LoRA) and its variants provide a memory- and compute-efficient alternative to full fine-tuning of pre-trained models. However, questions remain about the comparative generalizability of these approaches and how the structural restrictions on low-rank updates preserve effective adaptation performance. We present a historical framing, covering the past (full fine-tuning and original LoRA), the present (different variants of LoRA), and propose simpler, cheaper, parameter-efficient extensions by inducing sparsity within existing LoRA variants: Cheap LoRA (cLA), training a single low-rank factor with the other fixed (deterministically or, in its randomized variant, stochastically), and the chained circulant variant, ${c}^3$LA. We frame cLA as a structured instance of asymmetric LoRA, serving as a controlled column-subspace restriction of full fine-tuning. We derive information-theoretic generalization error bounds for these variants, marking one of the first endeavors in this area. Empirically, we evaluate 11 fine-tuning methods across 10 pre-trained models and 14 datasets, analyzing the fine-tuned models' performance and generalization using tools such as loss landscapes and spectral analysis. Despite the sensitivity of fine-tuned models to the pre-trained model, datasets, and other factors, our study suggests that restricting LoRA-based PEFT methods' adaptation to a sparse, structured column space remains competitive across tasks with their parameter-matched baselines while reducing up to 10% training time and peak GPU memory up to 15%, even with a naïve, non-optimized, sparse implementation. Our theoretical and empirical generalization measures provide a more consistent and principled approach to their cost-effective adaptation than commonly used analytical tools. Overview and code are available at: https://elicaden.github.io/Beyond_LoRA/.

2606.13756 2026-06-15 cs.CL 新提交

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

QIAS 2026:伊斯兰继承推理共享任务综述

Abdessalam Bouchekif, Somaya Eltanbouly, Samer Rashwani, Shahd Gaben, Mutaz Al-Khatib, Heba Sbahi, Emad Mohamed, Mohammed Ghaly

发表机构 * Hamad bin Khalifa University(哈马德·本·哈利法大学) Nazarbayev University(纳扎尔巴耶夫大学)

AI总结 本文介绍 QIAS 2026 共享任务,评估大语言模型在伊斯兰继承领域的复杂推理能力,基于 MAWARITH 数据集,使用 MIR-E 多步指标评估,16 支队伍参与,结果显示该任务对当前模型极具挑战性。

详情
AI中文摘要

本文全面概述了 QIAS 2026 共享任务,该任务作为 OSACT7 研讨会的一部分组织,并与 LREC 2026 联合举办。该共享任务旨在评估大语言模型在伊斯兰继承这一宗教和法律领域进行复杂推理的能力。与传统的问答基准不同,QIAS 2026 侧重于从自然语言案例进行端到端推理,要求系统完成完整的继承计算过程,从识别合格继承人到为每位受益人分配正确份额。为支持此评估,任务基于 MAWARITH 基准,这是一个包含 12,500 个阿拉伯语继承案例的数据集,并标注了中间推理步骤和最终答案。系统提交使用 MIR-E 进行评估,这是一个多步指标,衡量继承推理主要阶段的性能。共有 16 支队伍参与共享任务,研究了多种方法,包括基于提示的方法、检索增强生成和微调策略。结果表明,伊斯兰继承对当前语言模型来说仍然是一个极具挑战性的基准,尤其是在需要精确法律解释和结构化数值推理的阶段。本概述总结了任务设计、数据集、评估框架、参与系统和主要结果。

英文摘要

This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to perform complex reasoning in the religious and legal domain of Islamic inheritance. Unlike conventional question-answering benchmarks, QIAS 2026 focuses on end-to-end reasoning from natural language cases, requiring systems to perform the full inheritance calculation process, from identifying the eligible heirs to assigning the correct share to each beneficiary. To support this evaluation, the task was based on the MAWARITH benchmark, a dataset of $12{,}500$ Arabic inheritance cases annotated with intermediate reasoning steps and final answers. System submissions were evaluated using MIR-E, a multi-step metric that measures performance across the main stages of inheritance reasoning. A total of $16$ teams participated in the shared task, investigating a range of approaches, including prompting-based methods, retrieval-augmented generation, and fine-tuning strategies. The results show that Islamic inheritance remains a highly challenging benchmark for current language models, especially in stages that require precise legal interpretation and structured numerical reasoning. This overview summarizes the task design, dataset, evaluation framework, participating systems, and main results.

2606.13754 2026-06-15 cs.LG 新提交

D2H-AD: A Hybrid Model Utilizing Hyperdimensional Computing for Advanced Anomaly Detection

D2H-AD:一种利用超维度计算进行高级异常检测的混合模型

Ghazal Ghajari, Elaheh Ghajari, Ashutosh Ghimire, Saeid Ataei, Faris Alsulami, Fathi Amsaad

发表机构 * Wright State University(莱特州立大学) Azad University(阿扎德大学) Stevens Institute of Technology(史蒂文斯理工学院) University of Jeddah(吉达大学)

AI总结 提出基于超维度计算的异常检测框架D2H-AD,通过距离相似性和密度感知编码统一表示,在多个基准数据集上优于现有方法,具有轻量、可解释和高效的特点。

详情
AI中文摘要

异常检测是智能系统的基本组成部分,应用于医疗、网络安全、智能电网和物联网环境。尽管传统的机器学习和深度学习方法在识别异常方面表现出有效性,但它们通常依赖大量标记数据集、计算成本高,并在边缘和高维场景中面临可扩展性挑战。本文提出D2H-AD,一种基于超维度计算(HDC)的新型异常检测框架,HDC是一种受大脑启发的范式,使用高维分布式向量表示信息。与现有基于HDC的方法不同,D2H-AD在统一框架中集成了基于距离的相似性和密度感知编码,改进了异常表示和检测性能。消融研究表明,仅超维度编码相比直接在原始特征空间应用相同的密度-距离评分,ROC-AUC提升高达5.4%。此外,D2H-AD在所有评估数据集上始终优于五个基线方法:HDAD、ODHD、一类SVM、孤立森林和自编码器。该框架轻量、可解释且计算高效,适用于资源受限和实时应用。我们在五个基准数据集上验证了D2H-AD,展示了优越的F1分数和ROC-AUC性能,以及对类别不平衡、噪声和数据复杂性的鲁棒性。除了提高准确性,D2H-AD还提供可扩展性、小内存占用和低延迟操作,这得益于二进制计算和紧凑设计。这些特性使其特别适用于TinyML和边缘AI部署。所提出的框架突显了HDC在动态环境中进行准确、可解释和节能异常检测的潜力。

英文摘要

Anomaly detection is a fundamental component of intelligent systems with applications in healthcare, cybersecurity, smart grids, and IoT environments. Although conventional machine learning and deep learning methods have demonstrated effectiveness in identifying anomalies, they often rely on large labeled datasets, incur high computational costs, and face scalability challenges in edge and high-dimensional settings. This paper presents D2H-AD, a novel anomaly detection framework based on Hyperdimensional Computing (HDC), a brain-inspired paradigm that represents information using high-dimensional distributed vectors. Unlike existing HDC-based methods, D2H-AD integrates distance-based similarity and density-aware encoding within a unified framework, improving anomaly representation and detection performance. Ablation studies show that hyperdimensional encoding alone yields up to 5.4% higher ROC-AUC than applying the same density-distance scoring directly in the original feature space. Furthermore, D2H-AD consistently outperforms five established baselines, namely HDAD, ODHD, One-Class SVM, Isolation Forest, and Autoencoders, across all evaluated datasets. The framework is lightweight, interpretable, and computationally efficient, making it suitable for resource-constrained and real-time applications. We validate D2H-AD on five benchmark datasets and demonstrate superior F1-score and ROC-AUC performance, together with robustness to class imbalance, noise, and data complexity. In addition to improved accuracy, D2H-AD offers scalability, a small memory footprint, and low-latency operation enabled by binary computations and a compact design. These properties make it particularly attractive for TinyML and edge AI deployments. The proposed framework highlights the potential of HDC for accurate, interpretable, and energy-efficient anomaly detection in dynamic environments.

2606.13753 2026-06-15 cs.LG cs.AI 新提交

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

权重范数设定“顿悟”时间尺度:因果延迟定律

Truong Xuan Khanh, Doan Hoang Viet, Luu Duc Trung, Phan Thanh Duc

发表机构 * H&K Research Studio / Clevix LLC(H&K研究工作室 / Clevix有限责任公司) Bac A Bank(北亚银行) Banking Academy of Vietnam(越南银行学院)

AI总结 通过干预训练中权重范数,发现网络在范数达到临界值Wc时发生顿悟,且延迟时间与固定范数倍数呈指数关系,揭示了范数对顿悟的因果作用。

Comments 14 papges, 9 figs and 3 tables

详情
AI中文摘要

“顿悟”是神经网络中泛化能力的延迟出现,远在模型拟合训练数据之后才发生。权重范数是否导致这种延迟存在争议:一些研究报告了转变时的临界范数,另一些则观察到没有固定范数的顿悟。我们通过在训练过程中干预范数而非仅观察它来解决这一问题。在带权重衰减的自由训练下,当权重范数达到一个跨种子和学习率变化很小(变异系数1%至2%)且随模数基按幂律增长的值Wc时,网络发生顿悟。当我们转而将范数固定为Wc的某个倍数ρ并保持该值时,网络仍然顿悟,但延迟遵循T_grok ∝ exp(α ρ)。一个指数α≈7.5拟合了四个模数下的延迟(R²=0.996)。在扫描范围内,固定范数使延迟变化约19倍,而学习率仅变化约2倍,且将范数保持在Wc以上会减慢而非阻止顿悟。最后的LayerNorm通过解耦权重尺度与网络函数消除了这种依赖;没有它,指数定律重新出现。这种固定范数的延迟是指数对应物,对应于自由收缩范数所预测的对数延迟。

英文摘要

Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others observe grokking with no fixed norm at all. We settle this by intervening on the norm during training rather than only observing it. Under free training with weight decay, networks grok when the weight norm reaches a value Wc that varies little across seeds and learning rates (CV 1 to 2 percent) and grows with the modular base as a power law. When we instead clamp the norm to a fixed multiple rho of Wc and hold it there, the network still groks, but the delay follows T_grok proportional to exp(alpha rho). One exponent, alpha near 7.5, fits this delay across four moduli (R^2 = 0.996). Over the swept ranges the held norm moves the delay by about 19x and the learning rate by only about 2x, and holding the norm above Wc slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from the network function; without it the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.

2606.13748 2026-06-15 cs.LG 新提交

FedSPC: Shared Parameter Correction for Personalized Federated Learning

FedSPC:个性化联邦学习的共享参数校正

Kannanthodath Induchoodan Ajay Menon, Christian Prehofer, Yunfei Xu, Toru Hirano

发表机构 * DENSO AUTOMOTIVE Deutschland GmbH(电装汽车德国有限公司) DENSO International America, Inc.(电装国际美国公司) Technical University of Munich(慕尼黑工业大学)

AI总结 针对个性化联邦学习中共享参数因客户端局部目标不一致而更新冲突的问题,提出模块化校正方法FedSPC,仅对共享参数应用控制变量校正,在多种PFL设置下提升性能。

Comments Accepted for presentation at FL@FM-IJCAI'26, in conjunction with IJCAI 2026. 9 pages

详情
AI中文摘要

个性化联邦学习(PFL)是联邦学习中解决统计异质性的重要方法之一,同时支持客户端特定的适应。许多PFL方法将模型拆分为共享参数和个性化参数,并在每个客户端上联合训练。然而,这产生了一个优化问题:共享参数由优化不同局部目标的客户端更新,可能导致共享更新不一致并削弱共享表示。为解决此问题,我们提出联邦共享参数校正(FedSPC),一种用于PFL的模块化校正方法。FedSPC仅对给定PFL方法的共享参数应用控制变量校正,而保持个性化参数不变。它可以集成到三种常见的PFL设置中:共享特征提取器、共享分类器以及带有局部正则化的完全共享模型。在CIFAR-100和Tiny-ImageNet上使用ViT、ResNet-34和VGG-11的实验表明,FedSPC提高了代表性PFL方法(包括FedPer、FedRep、FedBABU、LG-FedAvg和Ditto)的性能。

英文摘要

Personalized federated learning (PFL) is one of the important approaches in federated learning for addressing statistical heterogeneity while enabling client-specific adaptation. Many PFL methods split the model into shared and personalized parameters, which are jointly trained on each client. However, this creates an optimization issue: shared parameters are updated by clients optimizing different local objectives, which can lead to inconsistent shared updates and weaken the shared representation. To address this problem, we propose Federated Shared Parameter Correction (FedSPC), a modular correction method for PFL. FedSPC applies control-variate correction only to the shared parameters of a given PFL method, while leaving personalized parameters unchanged. It can be integrated into three common PFL settings: shared feature extractors, shared classifiers, and fully shared models with local regularization. Experiments on CIFAR-100 and Tiny-ImageNet with ViT, ResNet-34, and VGG-11 show that FedSPC improves performance across representative PFL methods, including FedPer, FedRep, FedBABU, LG-FedAvg, and Ditto.

2606.13746 2026-06-15 cs.RO 新提交

Scalable Dynamic Tactile Sensing Enabled by Passive and Flexible Acoustic Waveguides

可扩展动态触觉传感:基于被动柔性声波导

Guimin Long, Changhong Linghu, Chuanping Liu, Ke Xu, Xingjian Jing

发表机构 * Department of Mechanical Engineering, City University of Hong Kong(香港城市大学机械工程系)

AI总结 提出一种基于深亚波长声波导的被动分布式触觉传感范式,通过弹性膜帽亥姆霍兹谐振器和弹簧增强微管网络实现弯曲不变性,结合稀疏麦克风阵列与轻量神经网络,在4个麦克风64节点阵列中实现4mm空间分辨率和>99%定位精度,支持低频信号波形重建,并展示指尖阵列、触觉手套和大面积皮肤等原型。

Comments 40 pages, 6 figures

详情
AI中文摘要

人工动态触觉传感需要灵敏度、鲁棒性和柔顺性,但现有技术在大面积阵列扩展时面临权衡,加上布线复杂性和成本。本文报告了一种使用深亚波长声波导的被动分布式范式,将性能与结构柔性解耦。弹性膜帽封装的亥姆霍兹谐振器由弹簧增强微管互连,形成封闭网络,在宏观弯曲下保持声学传输不变。通过稀疏嵌入麦克风,系统实现了低频信号(<100 Hz)的实时定位(4 mm最高空间分辨率;4个麦克风64节点传感阵列中准确率>99%)和波形重建。快速连续小波变换和轻量神经网络可在5.5 ms内完成推理。我们展示了适形原型——指尖阵列、触觉手套和大面积皮肤——可检测从单根头发接触到5 mg颗粒撞击、动脉脉搏波、羽毛触摸和手指接触的刺激。这为下一代人机界面建立了一种可扩展、灵活、低成本的范式。

英文摘要

Artificial dynamic tactile sensing requires sensitivity, robustness, and compliance, yet existing technologies face trade-offs when scaling to large-area arrays, compounded by wiring complexity and cost. Here, we report a passive distributed paradigm using deep sub-wavelength acoustic waveguides that decouples performance from structural flexibility. Elastic-membrane-capped Helmholtz resonators interconnected by spring-reinforced microtubes form an enclosed network with invariant acoustic transmission under macroscopic bending. By sparsely embedding microphones, the system achieves real-time localization (4 mm highest spatial resolution; >99% accuracy in a 4 microphones 64-node sensing array) and waveform reconstruction of low-frequency signals (<100 Hz). Fast Continuous Wavelet Transform and a lightweight neural network enable inference within 5.5 ms. We demonstrate conformable prototypes-fingertip arrays, a tactile glove, and large-area skins-detecting stimuli from single-hair contact to 5-mg particle impacts, arterial pulse waves, feather touches, and finger contact. This establishes a scalable, flexible, low-cost paradigm for next-generation human-machine interfaces.

2606.13742 2026-06-15 cs.LG cs.AI physics.comp-ph physics.flu-dyn stat.ML 新提交

A fully GPU-based workflow for building physics emulators of hypersonic flows

基于全GPU工作流构建高超声速流物理仿真器

Fabian Paischer, Dylan Rubini, Deniz A. Bezgin, Aaron B. Buhendwa, David Hauser, Florian Sestak, Johannes Brandstetter, Sebastian Kaltenbach, Nikolaus A. Adams

发表机构 * TU Munich(慕尼黑工业大学) Institute for Machine Learning, JKU Linz(林茨约翰·开普勒大学机器学习研究所) ELLIS Unit(ELLIS单元) EMMI AI

AI总结 提出全GPU工作流,集成加速数据生成与不确定性量化增强的神经仿真器训练,通过可微求解器JAX-Fluids实现残差驱动改进,提升物理一致性并支持外推。

Comments First authors contributed equally

详情
AI中文摘要

以高保真度和低计算成本解析复杂物理现象的能力是解决现代工程关键挑战的核心。一个典型例子是高超声速流,其中精确预测全流场拓扑,特别是激波位置和强度,至关重要。然而,超声速和高超声速流仍然是传统降阶模型和神经仿真器的绊脚石,这些模型难以在工业相关应用中物理一致地捕捉流态中的陡峭梯度。为此,我们引入了一个完全基于GPU的工作流,该工作流将加速数据生成与通过不确定性量化和物理感知细化增强的神经仿真器训练相结合。我们的工作流由可微高保真求解器(JAX-Fluids)实现,我们利用该求解器进行快速数据集创建和基于残差的神经仿真器改进,以增强物理一致性。在此框架基础上,我们首先提出了一系列模型架构,并分析了它们的缩放行为以揭示其优缺点。然后,我们表明基于残差的细化使得能够在仅提供网格和输入参数的情况下进行训练,显著降低残差并提高物理一致性。可微仿真和基于残差的细化共同产生了在其训练分布之外仍然可靠的物理仿真器,这是在现实工程设计循环中部署代理的关键要求。

英文摘要

The ability to resolve complex physical phenomena with high fidelity and at low computational cost is central to addressing key challenges in modern engineering. A prime example lies in hypersonic flows, where the precise prediction of the full flowfield topology, in particular with respect to shock wave location and intensity, is critical. Yet supersonic and hypersonic flows continue to be a stumbling block for traditional reduced-order models and neural emulators that struggle to capture steep gradients in flow states with physical consistency in applications of industrial relevance. To that end, we introduce a fully GPU based workflow that integrates accelerated data generation with the training of neural emulators augmented by uncertainty quantification and physics-aware refinement. Our workflow is enabled by a differentiable high-fidelity solver (JAX-Fluids) which we employ for rapid dataset creation and residual-based improvement of the neural emulator to enhance physical consistency. Building on this framework, we first present a suite of model architectures and analyze their scaling behavior to expose their strengths and shortcomings. We then show that residual-based refinement enables training on cases where only mesh and input parameters are available, substantially reducing residuals and improving physical consistency. Together, differentiable simulation and residual-based refinement yield physics emulators that remain reliable beyond their training distribution, a key requirement for deploying surrogates in real-world engineering design loops.

2606.13741 2026-06-15 cs.LG 新提交

High-Frequency Pricing at Scale for E-Commerce

电子商务中的大规模高频定价

Stefan Birr, Tobias Huelden, Mones Raslan, Adele Gouttes, Andreas Schmitt, Mateusz Koren, Johannes Stephan, Robert Streek, Manuel Kunz, Tim Januschowski

发表机构 * Zalando SE Databricks

AI总结 提出一种预测-优化框架,结合梯度提升树与多目标优化,实现时尚电商促销活动的每日高频定价,通过23次A/B测试验证,利润提升约6%。

详情
AI中文摘要

本文介绍了针对时尚电商促销活动的一种专门的预测-优化算法定价工具的设计、开发和实施。销售活动给定价带来了独特的挑战,包括波动的需求模式、快速的定价决策以及平衡短期收入与长期盈利能力的需要。我们描述了我们的方法,该方法结合了使用梯度提升树的每日分辨率需求预测与一个多目标优化框架,该框架针对超过500万件商品同时最大化长期利润和净商品价值。我们的解决方案通过实现一个预测-优化架构,将定价决策时间从数小时缩短到数分钟,解决了现有周粒度系统的关键局限性。我们通过在2023-2024年期间在欧洲领先的在线时尚零售商Zalando的12个市场中进行的23次A/B测试验证了我们的方法。实验结果表明,与之前的手动-算法混合方法相比,新的定价系统在保持同等销售和收入表现的同时,实现了约6%的更高利润。基于这些结果,该算法已成功部署到生产环境,现在负责公司促销活动中的大部分算法定价决策。

英文摘要

This paper presents the design, development, and implementation of a specialized forecast-then-optimize algorithmic pricing tool for sales campaigns in fashion e-commerce. Sales events present unique challenges for pricing including volatile demand patterns, rapid pricing decisions, and the need to balance short-term revenue with long-term profitability. We describe our approach combining daily-resolution demand forecasting using gradient-boosted trees with a multi-objective optimization framework that maximizes both long-term profit and net merchandise value for more than 5 million articles. Our solution addresses key limitations of existing weekly-granularity systems by implementing a forecast-then-optimize architecture that reduces pricing decision time from hours to minutes. We validate our approach through 23 A/B tests across 12 markets during 2023-2024 sales campaigns at Zalando, one of Europe's leading online fashion retailers. Experimental results demonstrate that the new pricing system achieves approximately 6% higher profit while maintaining equivalent performance on sales and revenue compared to the previous manual-algorithmic hybrid approach. Based on these results, the algorithm was successfully deployed to production and now handles the majority of algorithmic pricing decisions for sales campaigns at the company.

2606.13740 2026-06-15 cs.LG 新提交

Efficient On-Device Diffusion LLM Inference with Mobile NPU

基于移动NPU的高效设备端扩散大语言模型推理

Tuowei Wang, Yanfan Sun, Ju Ren

发表机构 * Tsinghua University(清华大学) Beihang University(北京航空航天大学)

AI总结 提出首个NPU感知推理框架Diffusion-LLM-on-NPU,通过多块推测解码、双路径渐进修正和交换优化内存运行时,在移动设备上加速扩散大语言模型推理,相比CPU基线实现17-42倍延迟降低。

详情
AI中文摘要

扩散大语言模型(dLLM)通过并行去噪多个token来加速生成,使其适用于延迟敏感的移动端推理。然而,重复去噪在智能手机上引入了大量计算。移动神经处理单元(NPU)提供高吞吐量的密集矩阵计算,但高效利用它们仍然具有挑战性:token提交缩小了每块的有效工作负载,token修订使KV缓存重用复杂化,且NPU可见地址空间有限导致昂贵的重映射和数据传输开销。在本文中,我们提出了Diffusion-LLM-on-NPU,这是首个用于在智能手机上加速dLLM的NPU感知推理框架。Diffusion-LLM-on-NPU通过三种技术将块级dLLM推理与移动NPU的执行特性对齐。(1)多块推测解码用推测的未来块token填充当前块解码后期阶段缩小的负载。(2)双路径渐进修订使已提交的token在稳定前保持可修订,并通过CPU侧路径刷新不稳定token,而不会阻塞密集的NPU执行。(3)交换优化内存运行时压缩NPU可见地址布局,并将数据准备与NPU计算重叠,以减少重映射和传输开销。我们将Diffusion-LLM-on-NPU实现为端到端框架,并在多种硬件平台和dLLM工作负载上进行评估。Diffusion-LLM-on-NPU在保留生成质量的同时,将LLaDA-8B的生成延迟比使用前缀KV缓存重用的CPU基线降低了17倍至42倍。

英文摘要

Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency-sensitive mobile inference. However, repeated denoising introduces substantial computation on smartphones. Mobile neural processing units (NPUs) offer high-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per-block effective workloads, token revision complicates KV cache reuse, and limited NPU-visible address space incurs costly remapping and data transfer overheads. In this paper, we propose llada.cpp, the first NPU-aware inference framework for accelerating dLLMs on smartphones. llada.cpp aligns block-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques. (1) Multi-Block Speculative Decoding fills the shrinking workload in late-stage current-block decoding with speculative future-block tokens. (2) Dual-Path Progressive Revision keeps committed tokens revisable until stable and refreshes unstable tokens through a CPU-side path without stalling dense NPU execution. (3) Swap-Optimized Memory Runtime compacts NPU-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads. We implement llada.cpp as an end-to-end framework and evaluate it across diverse hardware platforms and dLLM workloads. llada.cpp reduces LLaDA-8B generation latency by 17x-42x over the CPU baseline with prefix KV cache reuse, while preserving generation quality.

2606.13736 2026-06-15 cs.CV 新提交

Connections Between Pairs of Filters Improve the Accuracy of Convolutional Neural Networks

滤波器对之间的连接提高卷积神经网络的准确性

Kathleen Anderson, Philipp Grüning, Erhardt Barth

发表机构 * GitHub

AI总结 本文提出在卷积神经网络中引入可学习的滤波器对连接函数,替代传统点式非线性激活,通过在不同层自适应调整连接方式提升网络性能。

Comments IJCNN 2023

详情
AI中文摘要

尽管研究人员不断为CNN寻找新的改进网络结构,但大多数新发明的架构仍然依赖于堆叠卷积块并用点式激活函数分隔的传统模式。然而,纯粹基于点式非线性的网络存在缺陷。一种替代方案是在网络的两个滤波器之间引入成对连接。典型的连接函数使用乘法或最小值操作来实现逻辑AND连接。在本文中,我们进一步证明CNN可以从更通用的连接中受益,这些连接包含可学习的参数。通过这样的参数,网络能够在不同的网络层实现不同的连接,并更好地使连接函数适应手头的任务。

英文摘要

While researchers continue to find new and improved network structures for CNNs, most of the newly invented architectures still rely on the traditional pattern of stacking convolutional blocks and separating them with pointwise activation functions. However, there are drawbacks to a network purely building on pointwise nonlinearities. One alternative is to introduce a pairwise connection between two filters of a network. Typical connection functions use multiplications or the minimum operation to realize logical AND connections. In this paper, we go one step further by demonstrating that CNNs can benefit from more general connections, which include parameters that are learned. With such parameters, the network is able to implement different connections in different network layers and better adapt the connection function to the task at hand.

2606.13734 2026-06-15 cs.AI 新提交

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

AI 接受度还是 AI 采用广度?对低素养/高使用率关联的工具特定再分析

Hristo Inouzhe

发表机构 * Universidad Autónoma de Madrid(马德里自治大学)

AI总结 本文重新分析 Tully 等人(2025)的研究,发现 AI 素养与 AI 使用之间的负相关关系因工具类型而异,低素养仅预测非文本 AI 工具的采用广度而非使用强度。

Comments 11 pages, 2 tables, 1 figure

详情
AI中文摘要

Tully、Longoni 和 Appel(2025)最近报告的证据表明,较低的人工智能(AI)素养预示着对 AI 更高的接受度。我们使用该文章研究 3 的公开数据重新审视这一主张,该数据以五点频率量表测量了过去对五类 AI 工具的使用情况。我们首先通过 OLS 对参与者水平平均值、二元 logit、有序 logit 和多项 logit 规范,再现了 AI 素养与总体 AI 使用之间的负相关关系。然后,我们表明总体关系掩盖了按工具类型划分的显著异质性。在我们调整了人口统计变量的主要规范中,AI 素养不能显著预测文本 AI 使用(有序 logit β = -0.090,p = .387),而它仍然是非文本 AI 采用的强预测因子(β = -0.377,p < .001)。非文本效应在 Tully 等人原始研究 3 的控制规范下也是稳健的(β = -0.502,p < .001)。二元、有序 logit 和多项规范表明,非文本关系主要是一种采用/非采用模式,而非密集使用的证据:调整人口统计变量后,曾经使用过非文本 AI 工具的比值比为 0.68。因此,在测量自我报告过去使用而非陈述偏好的研究中,证据不支持简单的说法,即较低的 AI 素养预示着对 AI 总体上更高的接受度。它反而指向一个更狭窄的模式,即在渗透率较低的非文本 AI 工具中更广泛的采用。

英文摘要

Recent evidence reported by Tully, Longoni, and Appel (2025) suggests that lower artificial intelligence (AI) literacy predicts greater receptivity toward AI. We revisit this claim using the public data from Study 3 of that article, which measures past usage of five AI tool categories on a five-point frequency scale. We first reproduce the negative association between AI literacy and aggregate AI usage using OLS on participant-level averages, binary logit, ordered logit, and multinomial logit specifications. We then show that the aggregate relationship masks substantial heterogeneity by tool type. In our demographic-adjusted primary specification, AI literacy does not significantly predict text AI usage (ordered-logit $β$ = -0.090, p = .387), whereas it remains a strong predictor of non-text AI adoption ($β$ = -0.377, p < .001). The non-text effect is also robust under Tully et al.'s original Study 3 control specification ($β$ = -0.502, p < .001). Binary, ordered-logit, and multinomial specifications suggest that the non-text relationship is primarily an adoption/non-adoption pattern rather than evidence of intensive use: the demographic-adjusted odds ratio of ever having used a non-text AI tool is 0.68. Thus, in the study that measures self-reported past usage rather than stated preferences, the evidence does not support a simple claim that lower AI literacy predicts greater receptivity to AI in general. It points instead to a narrower pattern of broader adoption across lower-penetration, non-text AI tools.

2606.13732 2026-06-15 cs.AI 新提交

When Sample Selection Bias Precipitates Model Collapse

当样本选择偏差引发模型崩溃

Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, Yan Pang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文研究低资源验证场景下,基于局部有偏参考分布的数据选择反而加速模型崩溃,并提出多数据孤岛协同的Wasserstein代理参考缓解多样性退化。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在合成数据上递归训练的普及可以缓解数据稀缺,但存在模型崩溃的风险,即重复训练会侵蚀分布尾部并使输出同质化。数据选择被广泛视为一种补救措施,但其可靠性关键取决于验证器使用的参考分布。我们表明,在低资源验证机制中,每个验证器仅观察到目标流形的一个小、碎片化且有偏的切片,选择本身也会变得有偏。这种情况自然出现在低资源数据孤岛中,例如医疗联盟或专有金融机构,其中原始数据无法汇集,本地参考固有地不完整。结果,选择优先保留与本地流形对齐的样本,同时剪除全局相关的尾部模式,从防止崩溃的保障转变为引发崩溃的机制。我们从理论上证明,这种孤岛选择加速了崩溃并导致幂律多样性衰减。作为一种初步缓解措施,我们在不共享原始数据的情况下,从多个数据孤岛构建Wasserstein代理参考。实证结果证实,本地参考选择在偏斜分布上失败,而协作代理参考减轻了多样性退化,表明当真实数据覆盖范围碎片化或稀缺时,递归合成数据管道需要特别谨慎。

英文摘要

The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.

2606.13731 2026-06-15 cs.AI cs.MA 新提交

TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

TwinBI:一种用于与商业智能仪表盘高效增强交互的智能数字孪生

Jisoo Jang Wen-Syan Li

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 提出TwinBI框架,通过LLM代理与可执行仪表盘状态耦合,统一对话、操作、语义和溯源,提升多步分析中状态一致性,将精确匹配准确率从43.3%提升至63.3%,超时率从40%降至10%。

详情
AI中文摘要

商业智能(BI)越来越多地将仪表盘交互与基于LLM的辅助相结合,但这两种模式在多步分析中常常不同步。当用户在直接仪表盘操作和自然语言查询之间切换时,很难在过滤器、层次结构、指标和图表上下文中保持一致的分析状态。我们提出TwinBI,一种智能数字孪生框架,将基于LLM的代理系统与可执行的BI仪表盘状态耦合。TwinBI通过从统一交互日志重建的共享分析状态,统一了对话交互、仪表盘操作、语义基础和溯源追踪。它还公开了诸如模式视图、SQL、日志和/insights命令等工件,用于基于状态的分析摘要。我们通过两种互补方式评估TwinBI。在相同骨干代理的受控A/B基准测试中,与仅使用仪表盘相比,TwinBI将精确匹配准确率从43.3%提高到63.3%,部分信用准确率从48.3%提高到70.8%,并显著将超时率从40.0%降低到10.0%。在可用性研究中,参与者受益于集成的仪表盘和聊天工作流,任务准确性高,工作负载适中,对状态感知交互机制评价良好。这些结果表明,TwinBI通过将可见的仪表盘状态转化为更丰富的可操作上下文,提高了代理级别的分析可靠性和面向用户的分析支持。我们的数据集和源代码可在以下网址获取:this https URL

英文摘要

Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: https://github.com/simonjisu/TwinBI

2606.13727 2026-06-15 cs.RO 新提交

Occupancy-Grounded Room Segmentation for Hierarchical 3D Scene Graphs

基于占用空间的房间分割用于分层3D场景图

Carlos Cueto Zumaya, Iacopo Catalano, Jorge Peña-Queralta, Wallace Moreira Bessa

发表机构 * University of Turku(图尔库大学) Centre for Artificial Intelligence, Zürich University of Applied Sciences(苏黎世应用科学大学人工智能中心)

AI总结 提出一种基于占用分解的房间节点锚定方法,构建分层3D场景图,在Matterport3D数据集上相比基线方法恢复了更多房间实例。

详情
AI中文摘要

室内机器人的分层3D场景图(3DSGs)在空间尺度上组织几何和语义信息,其中房间层连接对象级感知和房间级推理。现有系统从不同的空间基板(例如,地点聚类、墙壁平面或分割输出)构建该层,因此房间节点没有在共同的几何标准上进行评估。我们提出了一种基于占用空间的3DSG管道,其中房间节点锚定到从占用分解中跟踪的自由空间区域,为每个房间提供明确的多边形足迹。我们在12个Matterport3D场景上评估该管道,通过将预测的房间多边形与标注的房间实例进行匹配,并与代表性最先进的地点连接基线Hydra进行比较。结果表明,基于占用空间的锚定比地点连接构建恢复了更多的房间实例,但代价是精度较低,并且两种方法在墙壁精确的房间边界方面仍然是一个开放问题。代码可在该https URL获取。

英文摘要

Hierarchical 3D scene graphs (3DSGs) for indoor robots organize geometric and semantic information across spatial scales, with a room layer that connects object-level perception to room-scale reasoning. Existing systems construct this layer from different spatial substrates (\eg{} place clusters, wall planes, or segmentation outputs), and as a result, room nodes are not evaluated on a common geometric criterion. We present an occupancy-grounded 3DSG pipeline in which room nodes are anchored to tracked free-space regions derived from occupancy decomposition, giving each room an explicit polygonal footprint. We evaluate the pipeline on 12 Matterport3D scenes by matching predicted room polygons to annotated room instances and compare against Hydra, a representative state-of-the-art place-connectivity baseline. The results show that occupancy-grounded anchoring recovers substantially more room instances than place-connectivity construction, at the cost of lower precision, and that wall-accurate room boundaries remain an open problem for both methods. Code is available at https://github.com/crcz25/OccuSG.

2606.13723 2026-06-15 cs.CV cs.AI 新提交

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

形态感知样本分配:克服IoU不敏感性用于表面缺陷检测

Pengfei Liu, Yuhan Guo

发表机构 * School of Management, Harbin Institute of Technology(管理学院,哈尔滨工业大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 针对IoU在缺陷检测中不敏感的问题,提出基于面积、形状和长宽比的形态相似性度量来优化正样本分配,理论分析表明该方法能重塑匹配函数响应分布,在NEUDET和GC10-DET数据集上基于YOLOv9框架取得一致性能提升,且零额外推理开销。

详情
AI中文摘要

交并比(IoU)作为评估候选框与真实标注空间对齐的关键指标,直接决定了正样本集的质量和视觉检测模型的训练效果。通过理论建模和分析,我们揭示了IoU响应曲线上的一个非敏感区域,在该区域内,尽管样本的几何重叠程度不同,但IoU得分几乎相同。为克服这一局限,我们引入一组形态相似性度量,涵盖面积、形状和长宽比,以优化正样本分配过程,从而确保更具区分性和可靠性的匹配。通过基于均值的多维相似性聚合,推导出一个补充匹配分数,补偿IoU在表示结构对应性方面的固有缺陷。理论上,融入形态相似性重塑了匹配函数的响应分布,产生有效的方向梯度和多边形等响应轮廓,将高响应区域紧密限制在每个真实实例周围,显著提高了正样本选择的精度。基于YOLOv9框架的实验在NEUDET和GC10-DET数据集上均取得一致性能提升。值得注意的是,所提方法完全即插即用,且零额外推理开销,从而确保了工业视觉检测的部署效率。

英文摘要

Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

2606.13722 2026-06-15 cs.AI cs.MA 新提交

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

YeasierAgent:作为意图驱动创建平台无关共生智能体原生应用的画布的智能体社交沙盒

Jory He

发表机构 * Yeasier AI

AI总结 提出YeasierAgent范式,通过平台无关的交互单元和空间多智能体协作,实现快速跨平台构建共生智能体原生应用,统一情感陪伴与工具执行。

详情
AI中文摘要

本文介绍了YeasierAgent,一种基于共生智能体、叙事世界和场景感知交互的应用构建范式。它通过将应用重新定义为用户、智能体和世界之间的协作空间,挑战了传统的设备耦合软件模型。我们提出了一种系统架构,实现了两个主要贡献:(1)通过利用平台无关的交互单元(智能体、场景、对话)而非固定的图形布局,实现跨平台的智能体原生应用的快速构建;(2)在单一体验沙盒中统一智能体的情感陪伴和实用工具执行属性。通过集成自动生成、用户创建的世界和空间多智能体协作,YeasierAgent形式化了共生智能体原生应用的类别,展示了从孤立的、特定工具聊天机器人向凝聚的、社会嵌入的计算环境的转变。

英文摘要

This paper introduces YeasierAgent, an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. It challenges the conventional device-coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross-platform construction of agent-native applications by utilizing platform-agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user-created worlds, and spatial multi-agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent-Native Applications, demonstrating a shift from isolated, tool-specific chatbots toward cohesive, socially embedded computational environments.

2606.13720 2026-06-15 cs.AI 新提交

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

拒绝不止一个方向:Diff-in-Means 与 INLP 的初步比较

Elisabetta Rocchetti, Alfio Ferrara

发表机构 * Department of Computer Science, Università degli Studi di Milano(米兰大学计算机科学系)

AI总结 比较 DiM 和 INLP 两种方法在安全微调聊天模型中调控拒绝行为的效果,发现 INLP 反事实翻转可匹配 DiM 方向消融,而零空间投影较弱,且两种方法在激活空间中产生不同几何分布。

详情
AI中文摘要

Arditi 等人 (2024) 表明,安全微调聊天模型中的拒绝行为由残差流中的一个线性方向介导,该方向可通过有害和无害激活的均值差 (DiM) 恢复。我们将基于 DiM 的干预(激活添加和方向消融)与基于迭代零空间投影 (INLP) 的两种干预——零空间投影和反事实翻转——在五个开源聊天模型上进行比较,探究 INLP 是否能在引导拒绝方面匹配 DiM,以及其更丰富的参数化是否产生更可调的干预。INLP 反事实翻转在拒绝抑制上与 DiM 方向消融具有竞争力,而零空间投影始终较弱。将 INLP 限制为提取子空间的主导方向,可在接近基线的困惑度下保留大部分抑制效果,从而提供可调的能力。从几何角度看,两种 INLP 干预落在激活空间中性质不同的区域:零空间投影将变换后的激活压缩在有害和无害簇之间,而反事实翻转将其移入相反簇,这表明模型编码概念的缺失与其对立面不同——这是一个有趣的区分,值得未来进一步研究。

英文摘要

Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

2606.13715 2026-06-15 cs.AI cs.CL cs.MA 新提交

WorkBench Revisited: Workplace Agents Two Years On

WorkBench 再探:两年后的工作场所智能体

Olly Styles

发表机构 * GitHub

AI总结 本文重新评估2024至2026年间WorkBench基准上智能体的进展,发现前沿模型在能力和安全性上均有显著提升,但开放权重模型降低了高性能门槛。

Comments 8 pages, 3 figures. Follow-up to arXiv:2405.00823

详情
AI中文摘要

2024年3月,WorkBench上表现最好的智能体GPT-4完成了43%的任务,并在26%的任务中采取了意外的有害行为(例如给错误的人发送电子邮件)。我们在2026年6月重新审视该基准,发现迄今为止最好的智能体Claude Opus 4.8完成了89%的任务,并仅在2.5%的任务中采取了意外的有害行为。除了前沿智能体性能的显著进步外,有三点值得注意。首先,在WorkBench上,能力与安全性是相辅相成的,而非相互权衡,因此完成最多任务的模型造成的意外损害也最少。其次,虽然几类错误已被完全消除,但前沿模型仍然会犯一些基本错误,有时会导致不可逆转的损害,例如将电子邮件发送给错误的人。第三,开放权重模型的兴起大幅降低了此前仅专有模型才能达到的性能水平的成本,而前沿模型的成本则保持相对稳定。我们发布了该基准的更新版本,包括数据与代码质量改进、新的模型评分以及自2024年以来WorkBench上智能体进展的分析。

英文摘要

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

2606.13714 2026-06-15 cs.CV 新提交

TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

TSA: 时间槽激活用于持久目标中心视频表示

Duc Nguyen, Sieu Tran, Hao Vo, Khoa Vo, Duy Minh Ho Nguyen, Nghi D. Q. Bui, Anh Nguyen, Long Mai, Ngan Le

发表机构 * University of Arkansas, USA(阿肯色大学) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究所) Google Research, Google(谷歌研究院) University of Liverpool, UK(利物浦大学) Adobe Research(Adobe研究院)

AI总结 提出时间槽激活(TSA)机制,通过学习每槽每帧激活分数实现持久槽的生命周期建模,解决无条件传播导致的状态漂移和重建干扰问题,在多个基准上提升目标分解和时间身份保持。

详情
AI中文摘要

无监督视频目标中心学习旨在将动态场景分解为时间上持久的实体表示。现有的循环视频槽注意力方法在帧间传播一组固定的槽,但通常假设无条件槽传播:每个槽在每一帧都被更新和解码,无论其对应目标是否可见。我们表明,这种设计违反了持久槽的基本生命周期要求:当目标缺失或完全遮挡时,其槽应保留先前状态,并避免解释无关的可见内容。相反,无条件传播导致两种失败路径:更新引起的状态漂移(当前帧证据覆盖缺失目标的表示)和解码器引起的重建干扰(非活跃槽通过解码器注意力保持与重建的耦合)。我们提出时间槽激活(TSA),一种无需可见性监督即可学习每槽每帧激活分数 $\alpha_{k,t} \in (0, 1)$ 的机制。TSA 使用该激活作为共享潜在控制变量进行槽生命周期建模。当槽不活跃时,TSA 通过激活门控更新将其状态锚定到前一槽,并通过在 softmax 归一化前对注意力 logits 施加激活依赖的加性偏置来抑制其解码器参与。这共同减少了状态漂移和重建驱动的干扰。为了在部分遮挡和逐渐重现下改进决策,TSA 进一步将激活预测条件于时间上下文编码器生成的每槽时间记忆。我们在 MOVi-C/E、YT-VIS 和 OVIS 基准上使用标准指标和基于跟踪的指标(FG-ARI、mBO、IDF1、HOTA)评估 TSA。TSA 持续改进了目标分解和时间身份保持,在长且严重遮挡的视频上取得了大幅提升。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $α_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

2606.13712 2026-06-15 cs.SD cs.CL 新提交

Multimodal Speaker Identification in Classroom Environments

课堂环境中的多模态说话人识别

Michael L. Chrzan, Meghavarshini Krishnaswamy, Robert Gibboni, Katie Wetstone, Wei Ai, Jing Liu

发表机构 * University of Michigan(密歇根大学) University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学)

AI总结 针对课堂背景噪声和儿童语音变异性导致纯声学模型准确率低的问题,提出融合声学嵌入与LLM语义上下文的多模态框架,将学生识别准确率从39.0%提升至50.3%,长句准确率达76.9%,角色区分准确率99.3%。

Comments 9 pages, 5 tables, 3 figures

详情
AI中文摘要

K-12课堂动态的自动化分析面临背景噪声和儿童语音变异性带来的挑战,这些因素常常干扰纯声学模型。本研究评估了一种多模态说话人识别框架,该框架将声学嵌入与LLM衍生的语义上下文相结合。使用EDSI数据集的一个子集(8个数学课堂,N = 2,801个话语),我们发现声学基线模型(ECAPA-TDNN)仅达到39.0%的准确率。通过将基于转录的“上下文锚定”集成到梯度提升分类器中,我们的多模态方法将学生识别准确率提高到50.3%。对于超过5秒的话语,性能也有所提升,达到76.9%的准确率(基线为64.9%),Top-3准确率为90.9%。此外,该模型以99.3%的准确率区分教师与学生角色。该方法推进了能够考虑个体学生参与的自动化反馈系统的可行性,这是支持大规模公平教学的关键一步。

英文摘要

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based "contextual anchoring" into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.