arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2606.08381 2026-06-09 cs.CL cs.AI 新提交

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

审计大型语言模型中的专有对齐:一种无需真实标准的比较框架

Alireza Arbabi, Florian Kerschbaum

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所)

AI总结 提出一种统计框架,通过比较目标模型与基线模型在共享语义空间中的响应偏差,检测黑盒语言模型中的专有对齐行为,无需真实标准即可实现外部审计。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地通过不透明的开发和部署流程发布和部署,使得模型提供商能够在不正式宣布的情况下注入有意的、提供商特定的策略。因此,已有多种模型被报道生成反映专有规则和组织利益的响应,导致在有争议话题上的审查或错误信息。然而,系统性地识别这种对齐仍然是一个基本挑战,因为“专有”在不同语境中的含义模糊。在本文中,我们提出了一种统计框架,通过比较行为分析来检测黑盒语言模型中的专有对齐。我们的方法量化了目标模型与一组参考基线模型在共享语义空间中的响应之间的系统性偏差。通过评估相对行为差异而非绝对正确性,我们的框架能够在黑盒访问下进行有原则的审计。应用于几个广泛讨论但此前未量化的案例,它为外部评估大型语言模型中提供商特定的对齐行为提供了系统且可扩展的基础。

英文摘要

Large language models (LLMs) are increasingly released and deployed through opaque development and deployment pipelines, enabling model providers to inject intentional, provider-specific policies without officially announcing them. As a result, various models have been reported to generate responses reflecting proprietary rules and organizational interests, leading to censorship or misinformation on controversial topics. However, systematic identification of such alignment remains a fundamental challenge, complicated by the ambiguity of what ``proprietary'' entails in different contexts. In this paper, we propose a statistical framework for detecting proprietary alignment in black-box language models via comparative behavioral analysis. Our approach quantifies systematic deviations between the responses of a target model and those of a reference set of baseline models in a shared semantic space. By evaluating relative behavioral divergence rather than absolute correctness, our framework enables principled auditing under black-box access. Applied to several widely discussed but previously unquantified cases, it provides a systematic and scalable basis for external assessment of provider-specific alignment behavior in large language models.

2606.08379 2026-06-09 cs.AI cs.CE cs.LG q-fin.CP q-fin.TR 新提交

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

TT-DAC-PS:用于最优交易执行的双目标确定性演员-评论家与策略平滑

Ilia Zaznov, Atta Badii, Julian Kunkel, Alfonso Dufour

发表机构 * University of Reading(雷丁大学) University of Göttingen(哥廷根大学) GWDG(哥廷根数据处理中心) Henley Business School(亨利商学院)

AI总结 提出TT-DAC-PS算法,结合双指数移动平均评论家目标、悲观最小备份、TD3风格策略平滑噪声、延迟演员更新和保守Q正则化,以抑制过高估计,并在限价订单簿数据上优于经典和强化学习基线。

Comments 21 pages, 1 figure, 3 tables

详情
AI中文摘要

本研究通过引入TT-DAC-PS(双目标确定性演员-评论家与策略平滑),解决了大规模股票卖单的最优执行问题。该确定性演员-评论家架构结合了双指数移动平均评论家目标与悲观最小备份、TD3风格的目标策略平滑噪声、延迟演员更新以及保守Q正则化,以抑制过高估计。探索使用Ornstein-Uhlenbeck(OU)噪声,并采用混合调度:确定性回合衰减、基于近期奖励离散度的方差引导调整,以及一个可学习并映射到噪声尺度的Soft Actor-Critic(SAC)风格温度。环境整合了Almgren-Chriss(AC)交易影响与限价订单簿(LOB)价格和成交量、归一化状态特征、每步成交量参与上限以及基于效用的奖励。该交易执行算法应用于十只美国股票的LOB数据。性能评估针对强化学习基线算法,包括近端策略优化(PPO)、软演员-评论家(SAC)和优势演员-评论家(A2C),以及替代交易执行算法,包括时间加权平均价格(TWAP)、成交量加权平均价格(VWAP)和AC。所提出的模型持续降低平均实现缺口百分比,并具有竞争性的方差,优于经典基线和标准强化学习基准模型。

英文摘要

This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses Ornstein-Uhlenbeck (OU) noise with a hybrid schedule: deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a Soft Actor-Critic (SAC)-style temperature that is learned and mapped to the noise scale. The environment integrates Almgren-Chriss (AC) trade impact with Limit Order Book (LOB) prices and volumes, normalised state features, per-step volume participation caps, and a utility-based reward. The trade execution algorithm is applied to LOB data for ten U.S. stocks. Performance is assessed against reinforcement-learning baseline algorithms, including Proximal Policy Optimisation (PPO), Soft Actor-Critic (SAC), and Advantage Actor-Critic (A2C), as well as alternative trade execution algorithms, including Time-Weighted Average Price (TWAP), Volume-Weighted Average Price (VWAP), and AC. The proposed model consistently reduces mean implementation shortfall percentage with competitive variance, outperforming classical baselines and standard reinforcement-learning benchmark models.

2606.08376 2026-06-09 cs.LG cs.AI 新提交

RiskNet: A large-scale dataset of AI risk incidents from news with alignment and multi-dimensional annotations

RiskNet:一个来自新闻的大规模AI风险事件数据集,包含对齐和多维标注

Leihan Zhang, Wecheng Ye, Xianlong Ma, Haochuan Liu, Yang Li, Qianyu Zhang, Jinliang Chen, Qiang Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance(多模态数据智能感知与治理北京市重点实验室)

AI总结 提出RiskNet,一个从多语言新闻构建的大规模AI风险事件数据集,通过结构化流水线进行事件识别、对齐和多维分类,支持AI安全、治理和风险分析研究。

Comments The manuscript has been submitted to Scientific Data

详情
AI中文摘要

随着人工智能(AI)系统越来越多地部署在社会关键领域,与AI相关的危害和失败事件的报告在频率和多样性上不断增加。尽管现有的治理框架阐述了负责任AI的高层原则,但用于跟踪和分析真实世界AI风险事件的大规模实证资源仍然有限。现有的事件集合通常由人工整理,规模相对较小,不足以支持持续、数据驱动的监控和下游计算分析。为满足这一需求,我们提出了RiskNet,一个从大规模多语言新闻源构建的AI风险事件数据集。RiskNet应用了一个结构化的流水线,用于AI风险新闻识别、事件级报告筛选、事件对齐和多维事件分类。生成的资源将分散的新闻报道组织成以事件为中心的记录,并为事件分类、事件对齐和事件级风险标注提供基准数据集。在当前版本中,RiskNet覆盖了数亿条源记录,并生成了一个大规模的AI风险相关报告集合,包括对齐的事件簇和标注的基准子集。该数据集还通过一个在线平台提供浏览和探索功能。我们描述了数据源、处理工作流、分类法设计以及资源的技术验证。RiskNet旨在支持AI安全、治理、风险分析和基准测试的下游研究,以及对AI相关危害的纵向和跨源分析。通过提供一个结构化且可复用的实证资源,RiskNet有助于弥合高层治理原则与AI风险事件记录现实之间的差距。

英文摘要

As artificial intelligence (AI) systems are increasingly deployed across socially consequential domains, reports of AI-related harms and failures have grown in frequency and diversity. Although existing governance frameworks articulate high-level principles for responsible AI, large-scale empirical resources for tracking and analyzing real-world AI risk incidents remain limited. Existing incident collections are often manually curated, relatively small in scale, and insufficient for continuous, data-driven monitoring and downstream computational analysis. To address this need, we present RiskNet, a large-scale dataset of AI risk incidents constructed from large-scale multilingual news sources. RiskNet applies a structured pipeline for AI risk news identification, event-level report screening, incident alignment, and multi-dimensional incident classification. The resulting resource organizes dispersed news reports into incident-centered records and provides benchmark datasets for event classification, incident alignment, and incident-level risk labeling. In its current release, RiskNet covers hundreds of millions of source records and yields a large-scale collection of AI risk-related reports, including aligned incident clusters and annotated benchmark subsets. The dataset is also accessible through an online platform for browsing and exploration. We describe the data sources, processing workflow, taxonomy design, and technical validation of the resource. RiskNet is intended to support downstream research on AI safety, governance, risk analysis, and benchmarking, as well as longitudinal and cross-source analyses of AI-related harms. By providing a structured and reusable empirical resource, RiskNet helps bridge the gap between high-level governance principles and the documented realities of AI risk incidents.

2606.08375 2026-06-09 cs.LG 新提交

Few-step Cofolding with All-Atom Flow Maps

少步全原子流图共折叠

Gianluca Scarpellini, Ron Shprints, Peter Holderrieth, Juno Nam, Pranav Murugan, Rafael Gómez-Bombarelli, Tommi Jaakola, Maruan Al-Shedivat, Nicholas Matthew Boffi, Avishek Joey Bose

发表机构 * Genesis Molecular AI Massachusetts Institute of Technology(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学) Imperial College London(伦敦帝国学院) Mila

AI总结 提出DeCAF框架,将全原子共折叠扩散模型蒸馏为流图,仅需几步推理即可生成高质量样本,并通过奖励引导搜索提升采样质量。

详情
AI中文摘要

3D生物分子复合物的全原子生成建模已成为预测蛋白质和蛋白质-配体系统结构的主流范式。然而,在原子级保真度下生成结构通常需要昂贵的迭代扩散展开,这使得传统部署和推理时搜索技术的计算成本都很高。在本文中,我们引入了去噪器共折叠全原子流图(DeCAF)框架,用于将最先进的全原子共折叠模型蒸馏为全原子流图,这些流图仅需几步推理即可产生高质量样本。我们基于去噪器的流图公式构建DeCAF,该公式具有端点损失,自然支持SE(3)刚性对齐,我们证明这对于训练准确模型至关重要。我们进一步推导了一个简单的变量变换,使DeCAF能够在EDM风格架构的σ空间噪声调度中运行,从而能够从预训练的共折叠扩散模型直接蒸馏。借助DeCAF的流图前瞻,我们引入了一个专门构建的推理时框架,通过奖励引导搜索改进采样。实验上,在具有挑战性的Runs N' Poses数据集上,DeCAF-Boltz在严格的NFE预算下,在蛋白质-配体姿势的准确性(RMSD)和物理有效性分数上均统计上优于Boltz-1x,同时在PoseBusters上的所有推理计算预算下显示出更优的帕累托前沿。将最先进的Pearl共折叠模型蒸馏后,DeCAF-Pearl优于基于扩散的共折叠模型,并在成功率上与其教师模型匹配,同时使用的NFE减少了5倍。我们在https://github.com/genesistherapeutics/decaf发布代码。

英文摘要

All-atom generative modeling of 3D biomolecular complexes has emerged as the dominant paradigm for predicting the structure of proteins and protein-ligand systems. Generating structures at the atomic level of fidelity, however, typically requires expensive iterative diffusion rollouts, making both conventional deployment and inference-time search techniques computationally costly. In this paper, we introduce the Denoiser Cofolding All-Atom Flowmap (DeCAF) framework for distilling state-of-the-art all-atom cofolding models into all-atom flow maps that produce high-quality samples in only a few inference steps. We build DeCAF on a denoiser-based formulation of flow maps with endpoint losses that naturally support SE(3) rigid alignment, which we show is critical for training accurate models. We further derive a simple change of variables that lets DeCAF operate in the σ-space noise schedule of EDM-style architectures, enabling direct distillation from pretrained cofolding diffusion models. Equipped with DeCAF's flowmap lookahead, we introduce a purpose-built inference-time framework that improves sampling through reward-guided search. Empirically, DeCAF-Boltz statistically improves over Boltz-1x in both accuracy (RMSD) and physical validity scores of protein-ligand poses at strict NFE budgets on the challenging Runs N' Poses, while also showing a more optimal Pareto frontier across all inference compute budgets on PoseBusters. Distilling the state-of-the-art Pearl cofolding model, DeCAF-Pearl outperforms diffusion-based cofolding models and matches its teacher on success rate while using 5x fewer NFEs. We release our code at https://github.com/genesistherapeutics/decaf.

2606.08369 2026-06-09 cs.LG cs.AI 新提交

An Information-Theoretic Definition for Open-Ended Learning

开放学习的信息论定义

Wanqiao Xu, Yifan Zhu, Benjamin Van Roy

发表机构 * Stanford University(斯坦福大学)

AI总结 提出基于比特等价的信息论定义开放环境,证明经典赌博机非开放,设计算法实现开放学习。

详情
AI中文摘要

越来越多的研究表明,能够在开放环境中持续扩展能力的AI系统具有巨大潜力。但目前尚无关于开放性的统一定义或关于智能体应如何探索开放环境的理论。我们基于一个新概念——${\textit比特等价}$——引入了一个信息论定义,该概念量化了达到每个期望奖励水平所需的信息。我们认为,如果智能体能够实现比特等价的线性增长,则该环境是开放的。我们证明了经典赌博机环境不是开放的,并构建了一个开放赌博机环境。我们还提出了一种在该环境中实现开放学习的算法。

英文摘要

A growing body of work points to the great promise of AI systems that can continually expand their capabilities as they operate in an open-ended environment. But yet there is no coherent definition of open-endedness or theory about how an agent ought to explore an open-ended environment. We introduce an information-theoretic definition based on a new concept -- the ${\textit bit-equivalent}$ -- which quantifies the information required to attain each level of expected reward. We consider an environment to be open-ended if an agent can attain linear growth in the bit-equivalent. We establish that classical bandit environments are not open-ended and formulate a bandit environment that is. We also introduce an algorithm that achieves open-ended learning in this environment.

2606.08365 2026-06-09 cs.LG cs.AI 新提交

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

稀疏自编码器引导副作用的干预前预测

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种干预前筛选框架,利用特征统计预测SAE引导的副作用(效果不稳定和附带扩散),在多个模型和字典上验证了解码器几何等信号优于基线,但预测效果因模型而异。

详情
AI中文摘要

稀疏自编码器(SAE)特征越来越多地用于引导语言模型,但特征引导很少是干净的:相同的干预在不同上下文中可能表现不一致,并扰动不相关的特征。我们引入了一个干预前筛选框架,用于从引导前计算的特征统计中预测SAE引导的副作用。我们沿着引导模块化的两个轴(效果稳定性和附带扩散)来操作化副作用,并在ReLU、JumpReLU和TopK SAE字典上评估GPT-2-small、Pythia-70M-deduped、Gemma-2-2B和Llama-3.1-8B。在这些设置中,解码器几何、激活统计、共激活结构和直接logit足迹比仅频率和激活幅度基线更好地预测引导模块化。信号在GPT-2-small、Pythia-70M和Llama-3.1-8B中最强,在那里它能在对抗幅度相关混杂的残差化后幸存,而在Gemma-2-2B中较弱。保留筛选表明,通过预测的清洁度对未见特征进行排序可以选择在新上下文中更干净地引导的特征,但成功的轴因设置而异:GPT-2在清洁度上提升最大,Pythia主要在稳定性上提升,Llama主要在附带性上提升,而Gemma仅部分提升。一个受控的Llama Scope宽度比较表明,在32K到128K字典宽度变化下,预测信号仍然存在,尽管筛选收益变得不太稳定。总体而言,SAE引导的副作用是可提前预测的,但有用的预测器签名和迁移的模块化轴依赖于模型和字典设置。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral spread, and evaluate GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries. Across these settings, decoder geometry, activation statistics, co-activation structure, and direct-logit footprint predict steering modularity better than frequency-only and activation-magnitude baselines. The signal is strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, where it survives residualization against magnitude-related confounds, and weaker in Gemma-2-2B. Held-out screening shows that ranking unseen features by predicted cleanliness can select features that steer more cleanly on fresh contexts, but the successful axis varies by setting: GPT-2 improves most cleanly, Pythia improves mainly on stability, Llama mainly on collateral, and Gemma only partially. A controlled Llama Scope width comparison shows that the predictive signal persists under a 32K-to-128K dictionary-width change, although the screening payoff becomes less stable. Overall, SAE steering side effects are predictable in advance, but the useful predictor signature and transferred modularity axis are model- and dictionary-setting dependent.

2606.08364 2026-06-09 cs.CV cs.AI 新提交

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

基于自监督视觉Transformer的CBCT颞下颌关节骨关节炎检测

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

发表机构 * Herman Ostrow School of Dentistry, University of Southern California(南加州大学赫尔曼·奥斯特罗牙科学院) Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院)

AI总结 研究DINO系列自监督ViT在CBCT颞下颌关节骨关节炎检测中的迁移性能,发现部分解冻最后两个Transformer块可将AUC从0.671提升至0.902,表明适应策略比骨干选择更重要。

详情
AI中文摘要

颞下颌关节骨关节炎(TMJ OA)是一种常见的退行性疾病,其骨性改变在锥形束CT(CBCT)上通常很细微,使得自动检测具有挑战性。我们研究了DINO系列自监督视觉Transformer——DINOv1、DINOv2、DINOv2+reg和RAD-DINO(一种放射学预训练变体)——迁移到CBCT的效果,询问需要多少以及何种骨干适应。我们提出了一种简单的基于切片的流程,使用视觉Transformer(ViT)骨干:轴向CBCT切片由冻结或部分适应的ViT逐切片编码,并通过基于注意力的多实例学习(MIL)聚合,用于患者级别的二分类OA/正常分类。通过在多源CBCT数据集上对解冻策略和聚合设计进行系统消融,我们发现部分解冻最后两个Transformer块是决定性因素,将AUC从0.671(完全冻结的DINOv2)提高到0.902。这优于DINOv1(0.867)、DINOv2+reg(0.774)和有监督的ImageNet ViT-B/16基线(0.843)。我们的结果为在低数据医学影像设置中适应DINO系列基础模型提供了实用指导,表明适应策略比骨干选择本身更能驱动性能。

英文摘要

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

2606.08360 2026-06-09 cs.LG cs.AI 新提交

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

协变量依赖到达下的自适应同伴推荐招募的生成前沿规划

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe

发表机构 * Harvard University(哈佛大学)

AI总结 针对同伴推荐招募中协变量依赖到达的现实问题,提出生成前沿规划(GFP),通过确定性备份和边际贪心分配实现高效规划,在模拟实验中优于基线方法。

详情
AI中文摘要

同伴推荐招募系统(如受访者驱动抽样)对于研究和干预受传染病影响的隐藏人群至关重要。为了加速招募,公共卫生机构必须在多轮中自适应地分配有限的推荐资源,当前决策影响未来招募者的数量和协变量。先前的工作通过假设推荐来自同质总体的独立同分布抽样使问题可解,但忽略了驱动真实同伴推荐的同质性和共享背景。我们考虑一个更现实的模型,其中推荐容量和新推荐个体的协变量都依赖于推荐者,并通过删失计数模型和条件生成模型从数据中学习。由此产生的规划问题具有挑战性,因为每个候选分配都会导致未来招募者的不同分布。我们提出生成前沿规划(GFP),一种基于模型的规划器,用潜在协变量覆盖值替代的确定性备份替代每步蒙特卡洛采样。该替代的设计使得下一个前沿的期望值仅通过离线摊销的有限维摘要依赖于后代生成模型,并且使得每轮目标具有单调递减收益。这两个性质共同使规划易于处理:确定性备份消除了蒙特卡洛采样,递减收益结构使得边际贪心分配能够为每轮问题实现(1-1/e)近似。在根据真实受访者驱动抽样数据集校准的模拟环境中,GFP在四个折扣因子下均优于随机、强化学习和独立同分布动态规划基线。

英文摘要

Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a \((1-1/e)\)-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

2606.08357 2026-06-09 cs.CL 新提交

Forward-Free Diffusion Language Models

无前向过程的扩散语言模型

Haotian Sun, Rushi Qiang, Yuqian Zheng, Bo Dai

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FReDA,一种无需人工设计前向过程的扩散语言模型,通过递归分布细化利用模型生成草稿作为隐式中间状态,在推理和编码任务上超越更大模型,并实现1.5-1.8倍加速。

详情
AI中文摘要

扩散语言模型通过迭代去噪生成文本,为自回归生成提供了强大的替代方案。然而,离散语言空间缺乏用于定义有效扰动的自然邻域结构,因此在前向过程中提出了一些人工破坏方案。这些预设的前向过程通常产生数学上方便但与生成过程中遇到的草稿和错误不一致的状态,导致样本质量下降。为了解决这一限制,我们提出了FReDA,一种无前向过程的扩散语言模型,消除了对人工设计前向过程的需求。我们将扩散语言建模形式化为递归分布细化,其中模型生成的草稿作为隐式中间状态,学习的细化模型逐步将草稿分布推向目标分布。具体地,FReDA通过提出候选草稿序列并直接执行自我细化或通过最佳N细化在并行候选中进行选择来细化草稿。通过这种设计,FReDA是邻域无关的、模型复杂度感知的,并且与灵活的细化参数化兼容。在sub-8B规模下的广泛评估表明,FReDA-4B在推理和编码基准上优于更大的扩散基础模型,实现了高达15%的绝对增益,同时相对于扩散基线达到1.5-1.8倍的平均加速,并且随着额外细化计算量的增加而有效扩展。

英文摘要

Diffusion language models generate text through iterative denoising, offering a powerful alternative to autoregressive generation. However, discrete language spaces lack a natural neighborhood structure for defining effective perturbations, so some artificial corruption schemes are proposed in the forward process. Such prescribed forward processes often produce states that are mathematically convenient but misaligned with drafts and errors encountered during generation, resulting in degraded sample quality. To address this limitation, we propose FReDA, a forward-free diffusion language model that eliminates the need for a hand-designed forward process. We formulate diffusion language modeling as recursive distribution refinement, in which model-generated drafts serve as implicit intermediate states, and the learned refinement model progressively moves the draft distribution toward the target distribution. Concretely, FReDA refines drafts by proposing candidate draft sequences and either directly performing self-refinement or selecting among parallel candidates via best-of-N refinement. With this design, FReDA is neighborhood-agnostic, model-complexity-aware, and compatible with flexible refinement parameterizations. Extensive evaluations in the sub-8B regime show that FReDA-4B outperforms larger diffusion base models on reasoning and coding benchmarks, achieving absolute gains of up to 15%, while reaching a 1.5-1.8x average speedup over diffusion baselines and scaling effectively with additional refinement computation.

2606.08348 2026-06-09 cs.CL 新提交

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Bayesian-Agent:面向LLM Agent框架的后验引导技能演化

Xiaojun Wu, Cehao Yang, Honghao Liu, Xueyuan Lin, Wenjie Zhang, Zhichao Shi, Xuhui Jiang, Chengjin Xu, Jia Li, Jian Guo

发表机构 * IDEA Research(IDEA研究院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DataArcTech Ltd.(DataArcTech有限公司)

AI总结 提出Bayesian-Agent框架,将可复用技能视为假设,通过后验分布指导技能演化(如修补、拆分、压缩等),在多个基准上显著提升性能,表明Agent技能演化应视为后验引导的框架优化。

Comments 15 pages, 6 figures

详情
AI中文摘要

LLM agent越来越依赖外部推理条件:提示、工具、记忆、SOP、技能和框架反馈。这些资产可以在不改变模型权重的情况下改进任务执行,但通常通过启发式反思或重用观察到的成功和失败来修订,仿佛计数本身是可靠的信念。我们引入了\textbf{Bayesian-Agent},一个原生且跨框架的框架,将可重用技能和SOP视为关于冻结模型在特定提示、上下文和框架环境下是否会成功的假设。Bayesian-Agent记录经过验证的轨迹证据,维护每个技能的特征条件分类后验,并将后验状态映射为可检查的动作,如修补、拆分、压缩、退役和探索。面向模型的提示获得可执行的防护栏和故障模式修补,而后验摘要仍可用于审计。使用\texttt{deepseek-v4-flash},增量修复将SOP-Bench从80%提升到95%,Lifelong AgentBench从90%提升到100%,RealFin-Bench从45%提升到65%。我们进一步评估了Bayesian-Agent的原生后端以及可选的GenericAgent、mini-swe-agent和Claude Code后端。结果包括正面、负面、饱和和案例研究设置,表明Agent技能演化最好被视为后验引导的框架优化,而非未校准的提示积累。源代码可在https://github.com/DataArcTech/Bayesian-Agent获取。

英文摘要

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce \textbf{Bayesian-Agent}, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With \texttt{deepseek-v4-flash}, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.

2606.08347 2026-06-09 cs.CL cs.LG 新提交

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

张量化Engram:在N-gram嵌入中共享潜在变量对大型语言模型有益

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Yuning Qiu, Qibin Zhao, Danilo Mandic

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 提出张量化Engram(TN-gram),通过CP分解共享因子压缩n-gram嵌入,减少参数并避免哈希冲突,在多个任务上匹配或超越现有方法。

详情
AI中文摘要

现代语言模型使用离散的token级嵌入表示文本,这迫使重复的多token模式必须在Transformer层中隐式学习。过度token化的Transformer和Engram都试图通过显式引入多token(n-gram)记忆来解决这一限制。然而,它们为每个n-gram阶数使用单独的哈希表,这引入了哈希冲突并阻止嵌套的n-gram共享底层潜在结构。为了解决这些问题,我们提出了张量化Engram(TN-gram),一种紧凑的记忆模块,通过Canonical Polyadic(CP)形式中的共享因子表示张量化的n-gram嵌入。TN-gram学习共享的token-位置因子以及阶数吸收向量,以编码不同n-gram阶数的嵌入。综合实验表明,TN-gram在需要更少参数的情况下,匹配甚至超越了Engram风格的n-gram模块。

英文摘要

Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

2606.08346 2026-06-09 cs.CL cs.LG 新提交

CATPO: Critique-Augmented Tree Policy Optimization

CATPO: 批评增强的树策略优化

Ayush Singh, Umang Goyal, Ankur Dahiya

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Vision and Language Group(视觉与语言组)

AI总结 提出CATPO方法,通过树信息性评分和批评引导修复,解决树结构强化学习中低效树浪费计算的问题,在数学推理任务上提升准确率。

Comments 14 pages, 1 figures, 6 tables

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的主流范式。最近的基于树的方法(如TreeRPO)通过树结构展开扩展了平坦轨迹采样,无需单独的奖励模型即可获得密集的步级奖励信号。然而,并非所有树都具有相同的信息量:所有叶子成功、所有叶子失败或策略已预测出奖励分布的树对梯度更新贡献甚微,浪费计算资源。我们提出CATPO(批评增强的树策略优化),在树级别诊断并解决这一浪费问题。CATPO首先通过树信息性分数F(T)对每棵树进行评分,该分数结合了叶子结果多样性和策略-奖励去相关性,且无需额外计算。对于所有分支均失败的“全错”树,CATPO应用批评引导修复:定位最浅的失败点,生成自然语言批评,并嫁接精炼的延续以恢复训练信号。最后,信息性加权损失通过归一化分数缩放每棵树的梯度贡献,将参数更新集中在最具信息性的树上,同时保持整体梯度幅度。在MATH数据集上训练的Qwen2.5-Math-1.5B上的实验表明,CATPO在四个基准(AIME24、MATH-500、OlympiadBench和MinervaMath)上实现了37.5%的宏平均准确率,比TreeRPO提高1.9%,比GRPO提高4.8%。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

2606.08341 2026-06-09 cs.RO 新提交

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

面向人机装配遥操作的不确定性感知意图预测

Fnu Heman, Yixuan Wang, Kolin Xu, Conner Wallace, John Dang, Akhil Joshi, Jun Sheng, Pinhas Ben-Tzvi, Mingyu Cai

发表机构 * University of California, Riverside(加州大学河滨分校) University of Miami(迈阿密大学)

AI总结 提出结合层次迁移学习、共形预测和VLM引导校正的不确定性感知意图预测框架,利用人类演示数据预训练,仅用少量机器人数据即提升动作分割性能。

Comments 7 pages, 6 figures. Preprint version

详情
AI中文摘要

在人机协作的辅助遥操作中,准确的意图预测对于在长时程操作和装配任务中实现及时可靠的机器人辅助至关重要。这些系统需要持续理解用户行为,以实时识别动作、预测意图并检测错误。然而,机器人遥操作演示成本高且受硬件限制,而人类演示更易收集且提供丰富的时序结构。为解决这一挑战,我们提出了一种不确定性感知的人到机器人意图预测框架,该框架结合了:(1) 层次迁移学习,其中MS-TCN++在人类手部演示上预训练,并在有限的机器人遥操作数据上微调,以捕捉低级动作和高级任务意图;(2) 共形预测模块,提供具有统计覆盖保证的帧级预测集,用于可靠的不确定性量化和早期意图估计;(3) VLM引导的片段校正,利用视觉和时序上下文选择性审查低置信度或时序不确定的片段。该框架支持辅助遥操作中的动作识别、时序分割、意图预测和错误检测。在包含22个动作类别的机器人装配演示实验表明,仅使用16个机器人演示,人到机器人的微调将机器人测试集的Edit分数从70.50提升至80.70。Edit安全的VLM校正进一步将帧准确率从45.21%提升至46.42%,并提高了F1@25和F1@50,同时保持了Edit分数。这些结果表明,人类演示为鲁棒、不确定性感知的机器人动作分割提供了可扩展的预训练数据。代码和数据见项目网站。

英文摘要

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.

2606.08340 2026-06-09 cs.AI cs.LG cs.MA 新提交

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

开放式多智能体协作在语言智能体中的基准测试

Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker, Alexander Rutherford, Davide Paglieri, Aidan Scannell, Henry Gouk, Elliot J. Crowley, Tim Rocktäschel, Amos Storkey

发表机构 * University of Edinburgh(爱丁堡大学) University of Oxford(牛津大学) University College London(伦敦大学学院)

AI总结 提出基于JAX的开放式多智能体协作基准Alem,评估13种现代LLM在长时生存世界中的零样本协作能力,发现协调能力是前沿LLM智能体的独立瓶颈。

Comments 42 pages, preprint

详情
AI中文摘要

随着语言模型越来越多地被部署为自主智能体,它们必须在开放式交互任务中与他人进行长期协调。然而,现有评估很少同时测试这些需求,而是强调单智能体任务、短交互或高度结构化的多智能体设置。我们提出了$alem$,一个基于JAX的开放式多智能体协作基准,构建在类似Craftax的动态之上。Alem将程序生成的协调任务、软专业化、通信和可控制的协调难度嵌入到一个具有探索、制作、交易和战斗的长期生存世界中。我们在同质团队中零样本评估了$13$种现代LLM,并以训练好的MARL智能体作为参考点。当前的LLM智能体远未解决Alem,平均标准化回报仅约6%,但它们的失败并非均匀分布。在最难的协调设置下,零样本的Gemini-3.1-Pro-High接近训练了十亿步的MARL智能体,而GPT-5.4-High实现了强基础任务奖励但协调奖励低得多。这种对比表明,个体任务能力并不等同于协调能力。消融实验表明,通信是协调的最大贡献者,而记忆和推理在用于维护多步计划时有所帮助。总体而言,我们的结果将协调确定为前沿LLM智能体的一个独立瓶颈,与单智能体能力分开。Alem使这一瓶颈可测量,并为开发能够通信、分配角色和执行共享计划的智能体提供了一个受控测试平台。代码可在https://github.com/alem-world/alem-env获取。

英文摘要

As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

2606.08336 2026-06-09 cs.CV 新提交

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

超越原始信号:作为特权合成数据的未解码生成潜变量

Cristian Sbrolli, Nicolas Michel, Matteo Matteucci, Toshihiko Yamasaki

发表机构 * Politecnico di Milano(米兰理工大学) The University of Tokyo(东京大学)

AI总结 提出直接潜变量增强(DLA)方法,利用未解码的生成潜变量作为特权信息,并通过多层显式模拟联觉(MESSy)将密集知识迁移到纯视觉学生模型,避免了解码-编码循环的低效性。

详情
AI中文摘要

虽然多模态集成显著提升了计算机视觉模型,但部署它们会带来高昂的推理成本,并且需要稀缺且完美配对的数据集。近期方法通过生成式AI合成缺失模态来解决这一数据瓶颈,但它们引入了一个严重的低效问题:解码-编码循环。具体来说,信息丰富的生成潜变量被解码为噪声原始信号,迫使下游分类器浪费容量重新编码它们。为了绕过这一瓶颈,我们提出直接潜变量增强(DLA),直接利用未解码的生成潜变量作为特权信息。此外,为了将这种密集知识迁移到纯视觉学生模型,我们引入多层显式模拟联觉(MESSy)。MESSy 不使用强制表示匹配(这迫使学生扭曲其原生视觉特征以适应复杂的多模态拓扑),而是使用预测目标来安全地内化这些物理先验。实验结果表明,我们的框架显著优于原始数据增强和传统蒸馏。最终,我们的方法产生了高度准确的单模态学生模型,其具有“联觉”潜变量结构,这些结构本质上与它们从未直接观察到的物理属性对齐。

英文摘要

While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic'' latent structures that are inherently aligned with physical properties they have never directly observed.

2606.08332 2026-06-09 cs.CV 新提交

SMI: Efficient Self-Supervised Learning via Mutual-Information-Inspired Dependency Optimization

SMI: 基于互信息启发的依赖优化的高效自监督学习

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 提出SMI方法,通过非线性变换样本级依赖矩阵优化自监督学习,在ImageNet上以ResNet-50达到竞争性能并降低计算复杂度,在低资源任务上提升迁移性能。

详情
AI中文摘要

自监督学习(SSL)已经取得了显著的表示学习性能,但许多现有方法依赖于大批量大小、内存库、动量编码器或全局同步机制,这些机制大大增加了计算成本和训练复杂度。在这项工作中,我们提出了语义互信息(SMI),一种轻量级的自监督目标,它源于高斯假设下互信息启发的依赖公式。与在高维特征相关矩阵上操作的传统相关匹配目标不同,SMI通过成对相关性的非线性变换在样本级依赖矩阵上进行优化。这种公式引入了独特的优化动态,强调强依赖的语义对,同时保持表示多样性。在ImageNet上使用ResNet-50骨干网络的实验结果表明,SMI在实现与最先进的SSL方法相当的线性评估性能的同时,显著降低了计算复杂度。在多个低资源基准上,SMI持续改善了Barlow Twins的迁移性能,特别是在细粒度数据集上。此外,对优化动态和表示几何的分析表明,对齐-冗余平衡得到改善,特征多样性增加,语义表示更加空间局部化。这些结果表明,非线性依赖优化为传统的基于相关的自监督学习目标提供了一种有效且计算高效的替代方案。

英文摘要

Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing methods rely on large batch sizes, memory banks, momentum encoders, or global synchronization mechanisms that substantially increase computational cost and training complexity. In this work, we propose Semantic Mutual Information (SMI), a lightweight self-supervised objective derived from a mutual-information-inspired dependency formulation under Gaussian assumptions. Unlike conventional correlation matching objectives that operate on high-dimensional feature correlation matrices, SMI performs optimization on a sample-level dependency matrix through a nonlinear transformation of pairwise correlations. This formulation induces distinct optimization dynamics that emphasize strongly dependent semantic pairs while maintaining representation diversity. Experimental results on ImageNet using a ResNet-50 backbone demonstrate that SMI achieves competitive linear evaluation performance relative to state-of-the-art SSL approaches while substantially reducing computational complexity. Across multiple low-resource benchmarks, SMI consistently improves transfer performance over Barlow Twins, particularly on fine-grained datasets. Furthermore, analyses of optimization dynamics and representation geometry suggest improved alignment--redundancy balance, greater feature diversity, and more spatially localized semantic representations. These results indicate that nonlinear dependency optimization provides an effective and computationally efficient alternative to conventional correlation-based self-supervised learning objectives.

2606.08327 2026-06-09 cs.CL cs.AI cs.LG 新提交

Chiaroscuro Attention: Spending Compute in the Dark

明暗对比注意力:在黑暗中投入计算

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出CHIAR-Former,一种基于谱熵路由的混合Transformer,通过DCT谱混合与全注意力互补,在WikiText-103上以62.5%更少注意力FLOPs实现PPL 36.54,较全注意力基线提升45%。

Comments 8 pages, 6 figures, 3 tables

详情
AI中文摘要

标准Transformer在每一层和每个标记上统一应用自注意力,无论输入是否需要动态的跨标记交互。我们提出CHIAR-Former(明暗对比注意力),一种4层混合Transformer,它基于每个标记的谱熵(一种理论上合理的复杂度信号)将每个标记路由到三个算子之一:DCT谱混合、RBF核混合或全自注意力。通过在WikiText-103上的系统消融,我们发现路由崩溃:路由器持续拒绝RBF而偏向DCT和注意力,表明谱混合和动态注意力是互补且充分的。一个专门设计的仅DCT+注意力变体在WikiText-103上达到验证集PPL 36.54——相比全注意力基线(PPL 66.62)提升45%,同时减少62.5%的注意力FLOPs。我们将评估扩展到WikiText-2、IMDB情感分类和合成ListOps操作,建立了一个清晰的操作区间:CHIAR-Former在大型自然文本上表现出色,其中标记多样性支持谱专门化,而全注意力在小数据集和合成模式匹配任务上仍保持优势。这些发现——无论是成功还是失败——共同定义了谱路由何时以及为何值得使用。

英文摘要

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

2606.08324 2026-06-09 cs.CV cs.AI 新提交

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

基于集合的Transformer用于远距离长波红外高光谱成像中的大气补偿

Fabian Perez, Nicolas Quintero, Jeferson Acevedo, Hoover Rueda-Chacon

发表机构 * Department of Computer Science, Universidad Industrial de Santander Bucaramanga(圣安德烈斯工业大学计算机科学系)

AI总结 提出一种轻量级基于集合的深度学习框架,利用不同距离的辐射测量联合估计透射率、大气路径辐射和下行辐射,在MODTRAN数据集上实现低光谱畸变。

Comments IGARSS 2026 accepted paper conference

详情
AI中文摘要

被动长波红外(LWIR)高光谱成像在远距离几何下依赖于大气吸收和发射以及反射辐射,因此大气补偿对于获取目标信息至关重要。尽管其重要性,但由于实际和建模困难,这一补偿在很大程度上被忽视。在本文中,我们提出了一种轻量级基于集合的深度学习框架,该框架将不同远距离范围收集的多个辐射测量作为输入,并联合估计透射率、大气路径辐射和共享的下行辐射光谱。我们使用稀疏自编码器分析学习到的表示,并观察到尽管缺乏位置监督,几个潜在特征确实在测试数据的地理一致子集上激活。在MODTRAN生成的远距离LWIR数据集上的实验表明,所有估计产品的光谱畸变较低。数据集和代码公开于:https://factral.co/SAE-LWIR/

英文摘要

Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/

2606.08322 2026-06-09 cs.LG stat.ME 新提交

Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA

使用PCA和核PCA的航空公司聚类分析中的正交性与维度性

Andreas Schlapbach

发表机构 * Swiss Federal Railways (SBB)(瑞士联邦铁路(SBB)) University of Berne(伯尔尼大学)

AI总结 本文复现了Renold等人对1995-2020年美国航空公司利润周期的聚类实验,通过PCA和核PCA分析,发现六聚类分类在原始7维和3维PC空间中具有几何鲁棒性,并验证了数据的内在线性流形结构。

详情
AI中文摘要

为了刻画1995年至2020年美国航空公司的利润周期,Renold等人(2023)结合了k-means聚类、主成分分析和系统动力学建模。我们在三个空间中复现了他们的聚类实验——原始7维变量空间、3维PC得分空间和4维PC得分空间,使用了他们论文中慷慨包含的数据集。我们表明,六聚类分类在几何上是鲁棒的:在3-PC空间中的k-means产生的聚类分配与7维原始空间逐位相同。作为非线性检验,我们在六个核(涵盖三个族加上一个线性基线)下应用核PCA。所有六个核在2D中保留了六聚类分配。一个1D诊断进一步收紧:线性核将COVID年份C_3与峰值利润聚类C_0混淆,而所有五个非基线核将C_3移动到仅与后金融危机聚类C_5重叠。核族之间的一致性证实了一个内在的线性流形,没有隐藏的曲率。轮廓准则显示,该数据集在结构上仅支持三个聚类,而不是六个。原始7D空间中的共线性抑制了本应识别k=3作为结构上合理选择的轮廓信号。

英文摘要

To characterize the US airline profit cycles from 1995 to 2020, the authors of Renold et al. (2023) combine k-means clustering, principal component analysis, and system dynamic modelling. We replicate their clustering experiment in three spaces -- the original 7-dimensional raw-variable space, a 3-dimensional PC score space, and a 4-dimensional PC score space using their dataset gratefully included in the paper. We show that the six-cluster taxonomy is geometrically robust: k-means in 3-PC space produces bit-for-bit identical cluster assignments relative to 7D raw space. As a nonlinearity check we apply kernel PCA under six kernels spanning three families plus a linear baseline. All six kernels preserve the six-cluster assignment in 2D. A 1D diagnostic tightens this: the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas all five non-baseline kernels shift C_3 to overlap only the post-financial-crisis cluster C_5. Agreement across the kernel families confirms an intrinsically linear manifold with no hidden curvature. The silhouette criterion reveals that the dataset structurally supports only three clusters, not six. Collinearity in the raw 7D space suppresses the silhouette signal that would otherwise identify k=3 as the structurally motivated choice.

2606.08314 2026-06-09 cs.AI 新提交

Integrating Deep Learning Demand Forecasting with Multi-Objective Optimization for Circular Coffee Supply Chains: A Data-Driven Framework for Cost, Emissions, and Freshness Management

集成深度学习需求预测与多目标优化的循环咖啡供应链:面向成本、排放和新鲜度管理的数据驱动框架

Gerçek Budak, Faraz Gholamzadeh Gharehgheshlaghi, Melika Barjesteh Vaezi, Ahmad Gholizadeh Lonbar

发表机构 * Ankara Yıldırım Beyazıt University(安卡拉耶尔德勒姆贝亚泽特大学) Texas Tech University(德克萨斯理工大学) University of Alabama(阿拉巴马大学)

AI总结 提出两阶段框架,先用CNN-LSTM模型预测需求(MAE=22.87,R²=0.90),再通过三目标MILP模型优化成本、碳排放和新鲜度,在循环供应链中获得25个Pareto解,平衡政策可减排22.4%仅增成本9.9%。

详情
AI中文摘要

咖啡供应链是最复杂的农产品网络之一,具有地理分散生产、多层协调以及对质量和新鲜度高度敏感的特点。尽管可持续性和数字化已受到关注,但需求预测、优化和可追溯性通常被分开处理。本研究提出了一个两阶段集成框架。首先,使用混合CNN-LSTM模型进行需求预测。在公开的Coffee Chain Sales数据集上,按时间顺序70/15/15划分,模型实现了MAE为22.87、R²为0.90,优于最佳深度学习基准约12%,优于经典方法超过30%。第二阶段,预测的需求输入一个三目标混合整数线性规划(MILP)模型,该模型在具有循环回收的多周期、多模式、闭环供应链中同时最小化成本、最小化碳排放和最大化产品新鲜度。新鲜度通过基于库存年龄的指数衰减建模。使用epsilon-约束方法,获得了25个Pareto解。敏感性和政策分析表明,平衡的可持续性政策可以在仅增加9.9%成本的情况下减少22.4%的排放,同时保持接近最优的新鲜度。

英文摘要

The coffee supply chain is one of the most complex agri-food networks, marked by geographically dispersed production, multi-tier coordination, and high sensitivity to quality and freshness. While sustainability and digitalization have gained attention, demand forecasting, optimization, and traceability are often treated separately. This study presents a two-phase integrated framework. First, a hybrid CNN-LSTM model is used for demand forecasting. On the public Coffee Chain Sales dataset with chronological 70/15/15 splitting, the model achieves MAE of 22.87 and R^2 of 0.90, outperforming the best deep learning benchmark by ~12% and classical methods by over 30%. In the second phase, the forecasted demand feeds a tri-objective mixed-integer linear programming (MILP) model that jointly minimizes cost, minimizes carbon emissions, and maximizes product freshness in a multi-period, multimodal, closed-loop supply chain with circular recovery. Freshness is modeled via exponential decay based on inventory age. Using the epsilon-constraint method, 25 Pareto solutions are obtained. Sensitivity and policy analyses show that balanced sustainability policies can reduce emissions by 22.4% with only a 9.9% cost increase while maintaining near-optimal freshness. Keywords: Coffee supply chain; Deep learning; Demand forecasting; Multi-objective optimization; Circular economy; CNN-LSTM; Mixed-integer linear programming.

2606.08312 2026-06-09 cs.AI cs.FL 新提交

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

自回归强化学习策略中LTLf约束的神经符号注入

Ashkan Ansarifard, Matteo Mancanelli, Elena Umili, Fabio Patrizi

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 提出神经符号框架,将LTLf约束编译为DFA并通过可微损失注入Transformer策略,在导航任务中提升约束满足且保持回报竞争力。

Comments Accepted at the Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs (SKILLED-LLMs 2026), co-located with KR 2026 and FLoC 2026, Lisbon, Portugal

详情
AI中文摘要

在这项工作中,我们研究了在有限迹线性时序逻辑(LTLf)表达的时延任务约束下的离线强化学习(RL)。最近,基于Transformer的方法如Trajectory Transformers和Decision Transformers已被采用,将RL视为序列建模问题。然而,这些方法纯粹优化奖励,不考虑高层时序需求。在此,我们引入一个神经符号框架,将LTLf背景知识注入到这类基于Transformer的RL策略中。我们的方法将LTLf公式编译为确定性有限自动机(DFA),并通过可微表示和基于逻辑的损失函数将其整合到学习过程中。特别地,我们从DFA进展中推导出可微的满足信号,并将其作为训练过程中的正则化项。最终的方法在不同模型间是架构无关的。我们在具有覆盖安全性和可达性时序属性组合的规范套件的导航环境中评估所提出的框架。实验结果表明,融入背景知识不仅提高了约束满足,而且与普通基线相比保持了有竞争力的回报。

英文摘要

In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.

2606.08311 2026-06-09 cs.AI 新提交

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning

利用机器学习构建心脏病学接口术语以突出电子健康记录

Mahshad Koohi Habibi Dehkordi, Shuxin Zhou, Yehoshua Perl, Fadi P. Deek, James Geller, Gai Elhanan, Andrew J. Einstein, Luke Lindemann, Vipina K. Keloth

发表机构 * Department of Computer Science, New Jersey Institute of Technology(新泽西理工学院计算机科学系) Department of Computer Science, St.Francis College(圣弗朗西斯学院计算机科学系) Department of Informatics, New Jersey Institute of Technology(新泽西理工学院信息学系) Department of Data Science, New Jersey Institute of Technology(新泽西理工学院数据科学系) Center for Genomic Medicine, School of Medicine, University of Nevada(内华达大学医学学院基因组医学中心) Department of Medicine, Cardiology Division, Columbia University Irving Medical Center(哥伦比亚大学伊万杰琳医学中心内科部(心内科)) Advanced Metrics Laboratory, School of Medicine and Health Sciences, George Washington University(乔治华盛顿大学医学院与健康科学学院高级指标实验室) Department of Biomedical Informatics and Data Science, Yale University(耶鲁大学生物医学信息学与数据科学系)

AI总结 提出基于机器学习的心脏病学接口术语(CIT)设计方法,通过半自动构建训练数据并训练模型,实现对电子健康记录中关键信息的高亮,覆盖率达74.21%。

详情
AI中文摘要

电子健康记录(EHR)笔记是密集的医学文档,包含大量信息,通常充满复杂的医学术语。高亮EHR中的所有细节有助于通过吸引对关键内容的注意力来减少遗漏重要信息的可能性。本研究提出设计一种心脏病学接口术语(CIT),以准确高亮心脏病患者EHR笔记中的所有细节。我们引入一种创新的机器学习(ML)技术用于CIT的设计。ML技术需要训练数据。手动准备此类训练数据耗时且昂贵。CIT设计过程包括三个阶段。在前两个阶段中,我们创新性地推导出一个训练数据CIT,供第三阶段的ML技术使用。我们首先设计初始CIT,由几个部分组成:SNOMED的心脏病学子层次、从构建集的EHR中挖掘的其他SNOMED概念,以及术语的必要组成部分(如医学缩写和药物)。利用迭代过程,从构建集中提取包含初始CIT概念的细粒度短语作为CIT概念候选。候选概念在半自动审查后添加到CIT中,得到训练数据CIT(TCIT)。在第三阶段,使用TCIT训练ML模型,以识别适合作为CIT概念的概念。该模型用于从构建集中提取更多概念,得到最终CIT。然后使用最终CIT高亮测试集,并评估其捕获未见EHR数据集中细节的程度。为此,使用了四个评估指标:覆盖率、广度、完整性和简洁性。高亮测试集的覆盖率为74.21%,广度为1.68。对于测试集中的20个随机笔记,平均完整性为98.2%,平均简洁性为84.2%。

英文摘要

Electronic health record (EHR) notes are dense medical documents containing large amounts of information, often filled with complex medical jargon. Highlighting all details in EHRs helps reduce the likelihood of missing crucial information by drawing attention to key content. This study proposes the design of a Cardiology Interface Terminology (CIT) to accurately highlight all details in EHR notes of cardiology patients. We introduce an innovative Machine Learning (ML) technique for the design of CIT. The ML technique requires training data. Manual preparation of such training data is time-consuming and expensive. The process of the CIT design includes three phases. In the first two phases, we innovatively derive a training data CIT to be used by the third phase, ML technique. We start by designing an initial CIT, composed of several components: the cardiology-related sub-hierarchies of SNOMED, other SNOMED concepts mined from EHRs of build set, and necessary components of terms e.g., medical abbreviations and medications. Utilizing an iterative process, fine-grained phrases containing initial CIT concepts are extracted from build set as CIT concept candidates. The candidate concepts are semi-automatically reviewed before being added to CIT, yielding the training data CIT, TCIT. In the third phase, a ML model is trained with TCIT to identify candidates fitting to be concepts in the CIT. This model is used to extract further concepts from build set, yielding the final CIT. The final CIT is then used to highlight the test set and evaluate the extent to which it captures details in an unseen EHR dataset. For this purpose, four evaluation metrics, coverage, breadth, completeness, and conciseness are used. The highlighted test set has a coverage of 74.21%, with a breadth of 1.68. For 20 random notes in test set, the average completeness is 98.2% and average conciseness is 84.2%.

2606.08310 2026-06-09 cs.AI cs.MA 新提交

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

核弹还是和平:大语言模型在高风险决策模拟中的(缺失的)伦理推理与行动

John Chen, Sihan Cheng, Can Gurkan, H M Abdul Fattah

发表机构 * University of Arizona(亚利桑那大学) Northwestern University(西北大学)

AI总结 研究LLM在复杂游戏《文明V》中自发升级核授权的现象,通过三种提示干预发现伦理推理未能可靠消除升级,识别出三种失败路径,强调需在复杂决策上下文中测试伦理推理的自发性和行为有效性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为具有决策能力的长期智能体。虽然LLM在电车难题等困境中能展现伦理能力,但这种能力可能无法迁移到复杂的智能体场景中。我们在《文明V》中研究这一差距,这是一款涉及经济、外交、技术和军事战略等复杂决策的多玩家游戏。从130个高紧张度的LLM自我对弈回合开始(其中LLM玩家自发升级核授权),我们通过三种提示干预重放这些回合:强调核伤害的伦理提示、移除先前模型的决策理由、以及强调现实世界影响的高风险框架。没有干预或干预组合能可靠消除涌现的升级。我们识别出三种失败路径:伦理推理在没有提示时未能浮现、即使在提示下也未能出现、或者浮现但未能生效(当战略反制因素占主导时)。因此,对智能体模型的评估必须测试伦理推理是否在复杂决策上下文中被自发调用并具有行为有效性,而不仅仅是在孤立情境中能否被诱发。

英文摘要

Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical competence on dilemmas such as trolley problems, this competence may not translate to complex, agentic scenarios. We study this gap in Civilization V, a multiplayer game with a complex decision-making landscape including economy, diplomacy, technology, and military strategy. Starting from 130 high-tension LLM self-play episodes, in which an LLM player spontaneously escalated nuclear authorization, we replay them across 13 models with three prompt interventions: an ethical prompt naming nuclear harm, removal of the previous model's decision-making rationale, and high-stakes framing emphasizing real-world impacts. No interventions nor their combinations reliably eliminate emergent escalation. We identify three failure pathways: ethical reasoning that fails to surface without prompting, fails to appear even when prompted, or surfaces but fails to take effect when strategic counter-factors dominate. Evaluations of agentic models, therefore, must test whether ethical reasoning is spontaneously invoked and behaviorally effective in complex decision-making contexts, beyond whether it can be elicited in isolation.

2606.08309 2026-06-09 cs.LG cs.CV 新提交

Where the Score Lives: A Wavelet View of Diffusion

分数函数所在之处:扩散的小波视角

Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba

发表机构 * The Kempner Institute for the Study of Natural and Artificial Intelligence(肯普纳自然与人工智能研究所) Harvard University(哈佛大学)

AI总结 提出基于二维正交小波基的分数函数参数化,通过数据分布矩分析揭示不同架构的归纳偏差,解释扩散模型中分数网络与数据分布的相互作用。

Comments 20 pages, 12 figures, AISTATS 2026

详情
Journal ref
Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300
AI中文摘要

基于分数的生成模型在过去十年中在生成多样化视觉上合理的图像方面取得了显著成功。在扩散建模中,包括CNN、U-Net和Transformer在内的多种架构被用作分数近似网络;然而,迄今为止,关于这些架构选择如何影响生成行为的了解相对较少。在这项工作中,为了提供对此领域的见解,我们提出了一种使用二维正交小波基展开的分数函数的解析可解参数化。特别地,我们根据数据分布的矩推导出可解释的最优分数函数。我们利用这种参数化提供了一种与架构无关的、基于矩的分析,揭示了数据分布的哪些属性对去噪最为重要。我们的分数机器足够灵活,可以部分模仿多种架构(包括U-Net和CNN)的相关归纳偏差,朝着理解不同分数架构为何表现出不同生成行为迈出了一步。由于我们的分数函数可以根据数据矩解析求解,我们可以开始理解数据分布如何与分数网络相互作用,从而产生我们在扩散模型中观察到的行为。

英文摘要

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

2606.08308 2026-06-09 cs.LG 新提交

Fourier fractal dimension to predict the generalization of deep neural networks

傅里叶分形维数预测深度神经网络的泛化能力

Joao B. Florindo, Davi Wanderley Misturini

发表机构 * Institute of Mathematics, Statistics and Scientific Computing - University of Campinas(坎皮纳斯大学数学、统计与科学计算研究所)

AI总结 提出基于权重变化的傅里叶分形维数作为泛化度量,并设计傅里叶优化器正则化该维数,在CIFAR-10等数据集上实现与泛化差距的高相关性。

详情
AI中文摘要

在不依赖留出验证数据的情况下预测深度神经网络的泛化性能是机器学习中的一个基本挑战。虽然随机梯度下降驱动这些高度参数化模型的优化,但其重尾、非高斯动力学在参数空间中诱导出复杂的、尺度不变的轨迹。在本文中,我们提出了一种基于网络权重变化的傅里叶分形维数的新型泛化度量。通过分析频域中Lévy驱动的随机微分方程的特征函数,我们提取出一个能够稳健捕捉学习过程几何复杂性的度量。此外,我们引入了一种定制的基于傅里叶的优化器,旨在训练过程中主动正则化该分形维数。在CIFAR-10、SVHN和MNIST数据集上的大量实证评估表明,我们提出的傅里叶泛化度量与实际泛化差距具有强相关性。我们的方法实现了最先进的Kendall秩相关系数,优于现有的基于范数、基于间隔和PAC-Bayesian度量。最终,这项工作凸显了频域分形分析作为模型泛化能力的强大预测器以及开发更稳定优化算法的原则性基础的潜力。

英文摘要

Predicting the generalization performance of deep neural networks without relying on hold-out validation data is a fundamental challenge in machine learning. While Stochastic Gradient Descent (SGD) drives the optimization of these highly parameterized models, its heavy-tailed, non-Gaussian dynamics induce complex, scale-invariant trajectories in the parameter space. In this paper, we propose a novel generalization measure based on the Fourier fractal dimension of the network's weight variations. By analyzing the characteristic function of the Lévy-driven stochastic differential equations in the frequency domain, we extract a metric that robustly captures the geometric complexity of the learning process. Furthermore, we introduce a customized Fourier-based optimizer designed to actively regularize this fractal dimension during training. Extensive empirical evaluations on the CIFAR-10, SVHN, and MNIST datasets demonstrate that our proposed Fourier generalization measure exhibits a strong correlation with the actual generalization gap. Our method achieves state-of-the-art Kendall rank correlation coefficients, outperforming a wide array of existing norm-based, margin-based, and PAC-Bayesian measures. Ultimately, this work highlights the potential of frequency-domain fractal analysis as both a powerful predictor for model generalizability and a principled foundation for developing more stable optimization algorithms.

2606.08307 2026-06-09 cs.CL 新提交

Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic-Language X Communities

理解阿拉伯语X社区中心理健康话语的社会文化维度

Amal Alqahtani, Rana Salama, Mona Diab

发表机构 * King Saud University(沙特国王大学) Cairo University(开罗大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过GPT-4.1识别个人披露的推特用户,分析边缘型人格障碍、双相障碍和ADHD相关话语,发现不同病症的词汇模式差异,提出可复用的LLM辅助披露流程和文化关键词框架。

Comments Accepted to the SMM4H-HeaRD Workshop, co-located with the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

计算心理健康研究主要集中于英语人群,阿拉伯语话语相对缺乏研究。我们提出一项探索性计算研究,包含来自607名用户的8147条推文,这些用户被GPT-4.1个人披露流程分类为三个特定病症的阿拉伯语X(原Twitter)社区中可能具有亲身经历的作者。我们关注与边缘型人格障碍(BPD)、双相障碍和ADHD相关的话语,并使用多领域文化关键词框架描述社区相关的语言模式。结果表明,在该语料库中,双相障碍推文包含更多宗教和医学词汇,BPD推文包含更多关系、身份和情绪困扰词汇,而ADHD推文更常关注实际症状和药物管理。我们将这些模式视为假设生成而非验证性的,因为语料库在不同病症间不平衡,某些子语料库在时间上集中,且关键词框架是初步操作化而非经过验证的测量工具。本文贡献了一个可复用的LLM辅助个人披露流程和一个针对阿拉伯语心理健康话语的探索性文化关键词框架。

英文摘要

Computational mental health research has predominantly centered on English-speaking populations, leaving Arabic-language discourse comparatively under-examined. We present an exploratory computational study of 8,147 tweets from 607 users classified by a GPT-4.1 personal-disclosure pipeline as likely lived-experience authors in three condition-specific Arabic-language X (formerly Twitter) Communities. We focus on discourse related to borderline personality disorder (BPD), bipolar disorder, and ADHD, and characterize community-associated linguistic patterns using a multi-domain cultural keyword framework. The results suggest that in this corpus, Bipolar tweets contain more religious and medical vocabulary, BPD tweets contain more relational, identity, and emotional-distress vocabulary, and ADHD tweets more often focus on practical symptoms and medication management. We treat these patterns as hypothesis-generating rather than confirmatory because the corpus is imbalanced across conditions, some subcorpora are temporally concentrated, and the keyword framework is an initial operationalization rather than a validated measurement instrument. The paper contributes a reusable LLM-assisted personal-disclosure pipeline and an exploratory cultural keyword framework for Arabic mental health discourse.

2606.08303 2026-06-09 cs.LG 新提交

GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks

GeoGNN:使用双塔图神经网络的时间序列地理定位

Toan Tran, Waqwoya Abebe, Abhishek Potnis, Supriya Chinthavali, Cyrus Shahabi, Li Xiong, Dalton Lunga

发表机构 * Emory University(埃默里大学) Oak Ridge National Laboratory(橡树岭国家实验室) University of Southern California(南加州大学)

AI总结 提出GeoGNN双塔架构,利用地理邻接图学习空间嵌入,结合时间序列表示,通过点积匹配实现时间序列地理定位,在电力消费数据集上平均提升约27%的定位精度。

详情
AI中文摘要

本文研究时间序列地理定位的新概念,目标是推断每个原始时间序列的地理来源。成功的地理定位可以为时间序列提供空间上下文,支持下游位置感知应用。我们形式化了该问题,借鉴图像地理定位的核心思想建立了强基线,并提出了GeoGNN——一种双塔架构。训练时,GeoGNN的空间塔通过利用地理邻接图学习地理单元候选的嵌入,而时间塔从时间序列中提取信息表示。推理时,每个时间表示与候选地理嵌入通过点积相似度匹配,并结合辅助分类头,以预测时间序列关联的地理来源。在全国范围的大规模电力消费数据集上的实验表明,GeoGNN在数据集上取得了最佳性能,并将细粒度和粗粒度地理定位精度平均提高了约27%。

英文摘要

This paper investigates a novel concept of time series geolocalization, where the goal is to infer the geographic origin of each raw time series. Successful geolocalization can provide spatial context to time series, enabling downstream location-aware applications. We formalize the problem, adapt core ideas from image geolocalization to establish strong baselines, and propose GeoGNN, a two-tower architecture. During training, GeoGNN's spatial tower learns embeddings of geographic cell candidates by leveraging the geographic adjacency graph, while the temporal tower extracts informative representations from time series. During inference, each temporal representation is matched against candidate geographic embeddings using dot-product similarity, combined with an auxiliary classification head, to predict the time series' associated geographic origin. Experiments on large-scale, countrywide electricity-consumption datasets demonstrate that GeoGNN achieves the best performance across datasets and enhances both fine- and coarse-grained geolocalization accuracy by ~27% on average.

2606.08302 2026-06-09 cs.CV 新提交

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

HACK++:面向高效视觉自回归建模的更有效的头部感知键值压缩

Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Rakuten(乐天) Tsinghua University(清华大学)

AI总结 针对VAR模型跨尺度KV缓存导致的高计算和内存开销,提出无训练头部感知压缩框架HACK++,通过离线分类头部类型和自适应预算分配,在极低缓存预算下保持近无损生成。

详情
AI中文摘要

视觉自回归(VAR)模型采用下一尺度预测范式,以显著更少的解码步骤实现高质量生成。然而,现有VAR模型由于跨尺度键值(KV)缓存的累积,面临严重的注意力复杂度和内存开销。本文通过将KV缓存压缩引入下一尺度范式来应对这一挑战。我们首先深入分析VAR注意力,观察到注意力头可以稳定地分为两个功能不同的类别:上下文头关注保持语义一致性,而结构头保持空间连贯性。它们的功能差异使得现有的一刀切压缩方法在VAR模型上表现不佳。我们进一步发现,两种头部类型对历史尺度的依赖程度不同,且这种依赖在不同层和生成步骤中发生变化,这要求自适应的缓存预算分配。为解决这些问题,我们提出HACK++,一种针对VAR模型的无训练头部感知键值压缩框架。通过一次性离线校准,HACK++分类头部类型并推导头部特定先验。在推理时,它将注意力与缓存压缩在独立预算下解耦,在压缩累积缓存时采用更激进的策略,通过模式特定策略和依赖感知预算分配来限制当前尺度的注意力成本。在多个VAR模型上进行的广泛实验,涵盖文本到图像、类别条件和统一理解与生成任务,验证了HACK++的有效性和泛化能力。例如,在Infinity-2B/8B上,HACK++在仅30%注意力预算和10%缓存预算下保持近无损生成,即使在1%缓存预算下也保持稳健。

英文摘要

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

2606.08296 2026-06-09 cs.AI cs.LG 新提交

Revisiting the shutdown problem

重新审视关机问题

David Thorstad

发表机构 * GitHub

AI总结 本文重新评估了AI关机问题的难度,指出现有论证未能证明其难以解决,且相关技术方案对模型性能造成了高安全代价。

详情
AI中文摘要

关于人工智能存在风险的主要论点中的一个关键前提是,功能异常的人工智能体无法轻易被关闭。这引发了灾难性关机问题,即确保在人工智能体造成灾难性后果之前能够将其关闭。一系列论证和定理表明,解决灾难性关机问题很困难,这加强了存在风险的论点,并推动寻找解决灾难性关机问题的方法。本文论证了两个结论。第一,现有论证并未确立解决灾难性关机问题的难度。第二,对灾难性关机问题的关注导致了技术解决方案,这些方案对模型性能施加了高安全代价。

英文摘要

A key premise in leading arguments for existential risk from artificial intelligence is that malfunctioning artificial agents could not be easily shut down. This motivates the catastrophic shutdown problem of ensuring that agents can be shut down before they cause an existential catastrophe. A range of arguments and theorems are offered to suggest that solving the catastrophic shutdown problem is difficult, bolstering arguments for existential risk and motivating a search for solutions to the catastrophic shutdown problem. This paper argues for two conclusions. First, existing arguments do not establish the difficulty of solving the catastrophic shutdown problem. Second, concern for the catastrophic shutdown problem has led to technical solutions that impose a high safety tax on model performance.

2606.08295 2026-06-09 cs.CL 新提交

TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation

TLRD: 通过三级理由蒸馏教授LLMs在表格数据上进行推理

Tianyuan Liang, Xuwei Tan, Lei Shi, Junsheng Zhong, Ziyu Hu, Tian Xie, Zhiqun Zuo, Xiaodong Yu, Xueru Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 提出TLRD框架,通过三级理由蒸馏将表格数据集转换为结构化理由监督,使LLMs在仅基于原始特征的情况下实现零开销预测和可解释推理,显著缩小与树集成模型的性能差距。

详情
AI中文摘要

表格数据是存储现实世界信息的主要媒介,驱动着机器学习的许多工业应用。传统预测器实现了强大的预测性能,但不提供决策所必需的可读、案例特定的解释。大型语言模型(LLMs)可以通过生成预测和解释来自然弥合这一差距。然而,数据集特定的模式(如特征分布和交互)使LLMs难以理解和推理表格数据,而仅标签微调在提高性能的同时会导致灾难性遗忘。为了解决这个问题,我们提出了三级理由蒸馏(TLRD),一个将仅标签表格数据集转换为LLMs的结构化理由监督的框架。TLRD使用高容量教师模型,基于三个互补的证据级别(实例级特征、数据集级分布上下文和比较级检索邻居)合成理由语料库,然后将理由蒸馏到学生LLMs中,从而仅从原始特征实现零开销预测和基于理由的解释。在多个领域数据集上的实验表明,TLRD显著缩小了LLMs与最先进树集成模型之间的性能差距,同时产生基于理由且可读的解释,为高风险决策提供了有价值的参考。

英文摘要

Tabular data is a primary medium for storing real-world information, driving many industrial applications of machine learning. Traditional predictors achieve strong predictive performance but do not provide readable, case-specific explanations essential for decision-making. Large Language Models (LLMs) can naturally bridge this gap by generating predictions alongside explanations. However, dataset-specific patterns, such as feature distributions and interactions, make tabular data difficult for LLMs to understand and reason over, while label-only fine-tuning improves performance at the cost of catastrophic forgetting. To address this problem, we propose Tri-Level Rationale Distillation (TLRD), a framework that converts label-only tabular datasets into structured rationale supervision for LLMs. TLRD uses a high-capacity teacher to synthesize a rationale corpus grounded in three complementary levels of evidence: instance-level feature, dataset-level distributional context, and comparison-level retrieved neighbors, then distills the rationale into student LLMs, enabling zero-overhead prediction and grounded explanation from raw features only. Experiments on multiple domain datasets show that TLRD significantly closes the performance gap between LLMs and state-of-the-art tree ensembles while producing grounded and readable explanations, offering a valuable reference for high-stakes decision-making.