语言大模型 / LLM

2606.19108 2026-06-18 cs.LG 新提交 60%

JourneyFormer: Encoding Airbnb Guest Journey with Sequence Modeling

JourneyFormer: 使用序列建模编码Airbnb客人旅程

Daochen Zha, Chun How Tan, Xin Liu, Bin Xu, Han Zhao, Xiaowei Liu, Tracy Yu, Hui Gao, Huiji Gao, Liwei He, Stephanie Moyerman, Sanjeev Katariya

发表机构 * Airbnb

专题命中其他LLM ：序列建模用于推荐，非LLM核心

AI总结针对Airbnb中客人序列长、探索性强且标签稀疏的问题，提出JourneyFormer序列建模解决方案，通过优化数据选择、ID嵌入、模型架构和标签归因，并在两个生产面上通过在线A/B测试验证了其有效性。

Comments Accepted by KDD 2026

详情

AI中文摘要

序列建模因其能够建模用户历史行为并推断用户意图，在推荐和排序算法中越来越受欢迎。尽管理论简单，但由于序列的复杂性和稀疏标签，序列模型在生产中的实际部署并非易事。例如，在Airbnb中，客人序列通常较长、具有探索性且复杂，我们关注的是稀疏的预订标签。因此，我们经常需要在数据和建模方面做出各种设计决策，以在有效性和可扩展性之间取得平衡。本文深入探讨了这些生产挑战，并部署了JourneyFormer，一种用于Airbnb搜索排序的序列建模解决方案。我们详细介绍了关键的设计考虑，涵盖客人事件选择、ID嵌入、模型架构和标签归因等方面。此外，我们描述了几种加速模型训练和推理的定制策略。JourneyFormer已成功部署在Airbnb的生产环境中，其有效性和影响不仅通过改进的离线排序指标得到证明，而且通过两个生产面上的在线A/B测试在关键业务指标上取得了显著提升。

英文摘要

Sequence modeling has become increasingly popular in recommendation and ranking algorithms, owing to its capacity to model users' historical behaviors and infer user intentions. Despite its theoretical simplicity, the practical deployment of a sequence model in production is non-trivial due to complexity of the sequence and sparse labels. For example, in Airbnb, guest sequences are often long, exploratory and complex, and we focus on booking labels, which are sparse. As such, we are often required to make various design decisions regarding data and modeling to strike a balance between effectiveness and scalability. This work delved into these production challenges and deployed JourneyFormer, a sequence modeling solution for search ranking at Airbnb. We detail crucial design considerations, covering aspects such as guest event selection, ID embeddings, model architecture, and label attribution. Additionally, we describe several tailored strategies to accelerate model training and inference. JourneyFormer has been successfully deployed within Airbnb's production, where its effectiveness and impact have been evidenced not only by improved offline ranking metrics but also by significant gains in key business metrics through online A/B testing across 2 production surfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.18923 2026-06-18 cs.LG 新提交 60%

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

GrapNet: 一种可编程的动态架构神经图基板

Zirong Li

发表机构 * Zirong Li（李子荣）

专题命中其他LLM ：提出可编程神经图基板，非LLM核心

AI总结提出GrapNet，一种将图作为可执行架构的神经基板，通过可编程接口支持结构编辑、冻结子图、局部审计等操作，在Split Fashion-MNIST和Split CIFAR-10上分别提升12.08和3.81个百分点的准确率。

Comments 8 pages, 1 figure, preprint

详情

AI中文摘要

可编程性是固定张量神经网络中缺失的一流接口：编辑关系、冻结子图、审计局部函数或更改执行后端应是对神经程序的操作，而非临时参数手术。GrapNet研究这种图即网络的设置。图是架构和可执行程序，而非输入数据图。每个计算节点拥有其下一层子节点引用和与这些引用对齐的可训练分配向量；删除关系会物理移除子节点引用和相应的分配坐标。结构规则和执行策略位于节点核心之外，因此同一子节点拥有的图可以被增长、冻结、结构编辑、分组为可训练族块、通过注意力在活动关系上路由，或在拓扑稳定后降级为密集快照。GrapNet通过向量值父接口与常规模块组合：密集层、CNN编码器、ResNet特征提取器、注意力块和Transformer表示都可以为每个坐标提供一个感知GrapNode。评估组织为可编程性压力测试套件，而非新的重放基准。在匹配的十种子Split Fashion-MNIST研究中，可塑GrapNet+ER头在相同已见类损失和重放记忆下达到63.16%的已见类准确率，而参数更大的密集MLP+ER为51.08%，配对差值为12.08点，p=1.3e-5。在Split CIFAR-10上使用冻结的ImageNet ResNet-18编码器时，相同基板将在线头比MLP-256提高3.81点，p=0.0026。这些结果支持GrapNet作为可编辑的神经图基板，其核心价值在于具有忠实执行视图的结构可编程性。

英文摘要

Programmability is a missing first-class interface in fixed-tensor neural networks: editing a relation, freezing a subgraph, auditing a local function, or changing the execution backend should be an operation on the neural program rather than ad-hoc parameter surgery. GrapNet studies this graph-as-network setting. The graph is the architecture and executable program, not an input data graph. Each compute node owns its next-layer child references and a trainable allocation vector aligned with those references; deleting a relation physically removes both the child reference and the corresponding allocation coordinate. Structural rules and execution policies live outside the node core, so the same child-owned graph can be grown, frozen, structurally edited, grouped into trainable family blocks, routed by attention over active relations, or lowered to dense snapshots after topology stabilizes. GrapNet composes with conventional modules through a vector-valued parent interface: dense layers, CNN encoders, ResNet feature extractors, attention blocks, and transformer representations can all feed one sensory GrapNode per coordinate. The evaluation is organized as a programmability stress suite rather than as a new replay benchmark. In a matched ten-seed Split Fashion-MNIST study, a plastic GrapNet+ER head reaches 63.16 percent seen-class accuracy versus 51.08 percent for a parameter-larger dense MLP+ER under the same seen-class loss and replay memory, with paired delta 12.08 points and p=1.3e-5. On Split CIFAR-10 with a frozen ImageNet ResNet-18 encoder, the same substrate improves the online head over MLP-256 by 3.81 points, with p=0.0026. These results support GrapNet as an editable neural graph substrate whose core value is structural programmability with faithful execution views.

URL PDF HTML ☆

赞 0 踩 0

2606.18856 2026-06-18 cs.CL cs.LG 新提交 60%

Approximate Structured Diffusion for Sequence Labelling

近似结构化扩散用于序列标注

Nicolas Floquet, Joseph Le Roux, Nadi Tomeh

发表机构 * Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN（巴黎北大学 Sorbonne、法国国家科学研究中心、巴黎北信息学实验室、LIPN）

专题命中其他LLM ：扩散模型用于序列标注，非LLM核心但相关。

AI总结提出一种基于扩散的条件随机场（CRF）训练方法，通过引入标签噪声条件来捕捉长距离依赖，结合近似推理在词性标注任务上实现16.5%的错误率降低。

2606.18852 2026-06-18 cs.CL cs.AI 新提交 60%

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

对齐隐含陈述：通过上下文边界半硬负挖掘实现隐式仇恨言论的泛化性

Wicaksono Leksono Muhamad, Yunita Sari

发表机构 * Mantera Studio（Mantera工作室）； Universitas Gadjah Mada（加雅玛大学）

专题命中其他LLM ：隐式仇恨言论分类，使用对比学习。

AI总结提出ImpSH三元组框架，通过将帖子与隐含陈述对齐并使用上下文边界半硬负样本聚焦学习，提升隐式仇恨言论的跨域泛化能力，在多个数据集上优于对比基线。

详情

AI中文摘要

隐式仇恨言论分类仍然是一个挑战，因为意图通常通过暗示和上下文而非明确辱骂来掩盖。先前的监督对比方法改进了域内检测，但可能过拟合表面线索，且难以跨数据集迁移。我们提出ImpSH，一个基于三元组的框架，当隐含陈述可用时将其与帖子对齐，并使用上下文边界半硬负样本将学习聚焦于近混淆项。我们还研究了AugSH，它通过数据增强形成正样本。在使用BERT和HateBERT对IHC、SBIC和DynaHate进行的受控评估中，ImpSH是标准监督对比基线的可行替代方案，并且在匹配的预处理和调优预算下通常能提高跨域性能。使用对齐性和均匀性进行的表示分析表明，正样本对更紧密且全局分布平衡，定性最近邻案例研究展示了域转移下的典型假负例。这些结果表明，通过上下文边界挖掘将帖子与其隐含陈述对齐，提供了到相关暗示的更稳定、类似双射的映射，克服了传统基于聚类的表示学习固有的波动性。

英文摘要

Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18820 2026-06-18 cs.LG cs.AI 新提交 60%

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

成熟马尔可夫决策过程：信息增加与动作集缩小下的决策制定

Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen

发表机构 * Ant International（蚂蚁国际）； School of Economics, Sichuan University（四川大学经济学院）； School of Economics, Fudan University（复旦大学经济学院）

专题命中其他LLM ：提出MMDP框架，结构感知强化学习，与LLM弱相关

AI总结针对决策过程中信息增加与动作集缩小的不对称性，提出成熟马尔可夫决策过程（MMDP）框架，并基于过期动作优先级原则开发结构感知强化学习方法，实验证明其能提升学习效率。

Comments 25 pages, 9 figures

详情

AI中文摘要

序列决策问题通常表现出信息和决策灵活性的不对称演化：随着决策周期的展开，智能体获得更丰富的信息，而由于操作截止、承诺或资源约束，可行动作逐渐过期。标准的MDP公式通常将这种结构扁平化为阶段相关的状态描述和动作掩码，从而掩盖了嵌套的信息-动作不对称性，而这种不对称性决定了哪些决策是紧急的、哪些可以推迟。我们引入了成熟马尔可夫决策过程（MMDP），这是一种围绕这种信息-动作不对称性构建的公式。我们通过一个过期动作优先级原则来刻画其关键后果之一，该原则识别出必须在下一阶段之前解决的动作。受此结构启发，我们开发了一个结构感知的强化学习框架，包括阶段感知的策略设计、过期动作抽象以及带有蒸馏的搜索增强学习。在受控的多供应商补货问题、复杂度递增的简化现金管理环境以及生产级模拟器上的实验表明，显式建模这种不对称性可以提高学习效率，并且随着决策问题的规模扩大，其价值日益增加。

英文摘要

Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information--action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information--action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

URL PDF HTML ☆

赞 0 踩 0

2606.18790 2026-06-18 cs.SD cs.AI cs.LG 新提交 60%

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

闭环：用于符号音乐生成中可解释激活引导的PID反馈控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

发表机构 * Athens University of Economics and Business（雅典经济与商业大学）； Orfium Research（Orfium 研究）； Hellenic Mediterranean University（希腊地中海大学）； Archimedes / Athena Research Center（阿基米德/雅典娜研究中心）

专题命中其他LLM ：符号音乐生成中的激活引导，与LLM弱相关

AI总结提出基于PID反馈控制的推理时激活引导框架，通过差分均值法提取音高和时长潜在方向，并利用Gram-Schmidt正交化解耦多属性引导，实现符号音乐生成中细粒度、可解释的属性调制。

Comments Accepted at Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio (43rd International Conference on Machine Learning - ICMLMLA26), 4 pages main (11 total), 2 figures

详情

AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展，但在实现对离散信号属性的细粒度、可解释控制方面仍存在明显差距。本文研究了多轨音乐Transformer（MMT）的机制可解释性，并提出了一种无需重新训练即可通过推理时激活引导实现确定性属性调制的框架。利用差分均值（DiffMean）方法，我们在残差流中分离出信号属性（特别是音高和时长）的潜在方向。我们验证了该领域的线性表示假设，实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题，我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明，与朴素向量加法相比，这种几何解耦减少了概念干扰和信号退化，即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

URL PDF HTML ☆

赞 0 踩 0

2606.18548 2026-06-18 cs.CY cs.AI 新提交 60%

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

参与强度作为自适应AI伦理教学的学习者建模信号

Yongkyung Oh, Lynn Talton, Alex Bui

发表机构 * University of California, Los Angeles (UCLA)（加州大学洛杉矶分校）

专题命中其他LLM ：研究LLM使用频率与AI感知关系

AI总结本研究比较了三种学习者特征（使用频率、自评熟悉度、先前AI教育）与AI感知结果的关系，发现使用频率与所有五项结果显著相关，为自适应AI伦理教学提供了简单的入学者建模信号。

详情

AI中文摘要

在研究生研究训练中，自适应AI伦理教学受益于反映先前LLM经验差异的入学者测量指标。先前的课程或研讨会参与是一个明显的候选指标，但尚不清楚它是否与关键AI感知项目的教学前评分相关。我们比较了三种候选入学者特征：自我报告的使用频率、自评LLM熟悉度和先前AI教育，针对93名参加必修研究伦理课程的生命科学研究生和博士后学员的五项基线感知结果。使用频率与所有五项结果显示出Holm校正的关联，自评熟悉度与三项结果相关，而先前AI教育与任何结果均无关联。在量表低端呈现阈值模式，在训练兴趣和准确性信任方面最为明显，而非在所有五项结果上呈现均匀梯度。在简短的入学者调查中，报告的LLM使用比先前的课程或研讨会更一致地与这些感知相关，自评熟悉度作为次要指标。这些结果表明，简单的教学前行为信号可以为自适应AI伦理教育的轻量级入学者画像提供信息。

英文摘要

Adaptive AI ethics instruction in graduate research training benefits from intake measures that reflect differences in prior LLM experience. Prior coursework or workshop attendance is an obvious candidate, but it is not clear whether it is associated with pre-instruction ratings on key AI perception items. We compare three candidate intake features, self-reported usage frequency, self-rated LLM familiarity, and prior AI education, across five baseline perception outcomes in 93 bioscience graduate and postdoctoral trainees enrolled in a required research ethics course. Usage frequency shows Holm-corrected associations with all five outcomes, self-rated familiarity with three, and prior AI education with none. A threshold-like pattern at the lower end of the scale is most visible for training interest and accuracy trust rather than appearing as a uniform gradient across all five outcomes. In a short intake survey, reported LLM use is more consistently associated with these perceptions than prior coursework or workshops, with self-rated familiarity serving as a secondary indicator. These results suggest that simple pre-instruction behavioral signals can inform lightweight intake profiling for adaptive AI ethics education.

URL PDF HTML ☆

赞 0 踩 0

2606.18539 2026-06-18 cs.LG stat.ML 新提交 60%

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

TS-Fault: 针对结构性故障的时间序列预测器基准测试

Yuyang Zhao, Lian Xu, Hao Miao, Chenxi Liu, Hao Xue

发表机构 * Ray-zyy

专题命中其他LLM ：评估时间序列预测模型鲁棒性

AI总结提出TS-Fault基准，通过参数化故障场景（沿观测/机制、单变量/多变量两轴）评估时间序列预测模型鲁棒性，发现干净数据准确性与鲁棒性负相关、机制级故障重排排名、基础模型最脆弱。

详情

AI中文摘要

时间序列预测（TSF）支撑着能源、交通、金融和医疗等领域的关键决策，然而TSF模型几乎普遍通过在干净保留数据上的单一数字（如平均误差）进行排名，隐含假设该数字能预测部署可靠性。但实际故障并非独立同分布噪声，而是具有时间形状的结构化事件、断裂的跨变量依赖、伴随缺失的机制变化以及跨传感管道的因果传播。将TSF鲁棒性视为数据质量问题，我们提出TS-Fault，一个在显式、参数化且具有可控语义难度的故障场景下评估预测模型的基准。TS-Fault将重复出现的故障沿两个正交轴（观测级 vs 机制级；单变量 vs 多变量）组织为四种模式，并通过统一重要性评分将每种故障注入最关键的预测窗口。该设计使得鲁棒性能够针对模型实际依赖的结构进行测试，而非简化为通用噪声敏感性。我们在6个数据集、4种模式和5个难度级别上，采用配对干净/损坏协议评估了21个模型。结果揭示了三个与常见排行榜直觉相悖的发现：（i）干净数据准确性与鲁棒性负相关；（ii）干净排名在观测级故障下保持不变，但在机制级故障下重新洗牌；（iii）所有灾难性故障均发生在机制级故障下，基础模型在干净数据上准确率最高但表现出最大的脆弱性。代码已公开于该URL。

英文摘要

Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

URL PDF HTML ☆

赞 0 踩 0

2606.18525 2026-06-18 cs.LG 新提交 60%

Hierarchical Attention via Domain Decomposition

基于区域分解的层次注意力机制

Stephan Köhler, Oliver Rheinbach

发表机构 * Faculty of Mathematics and Computer Science（数学与计算机科学系）

专题命中其他LLM ：提出层次注意力机制改进Transformer

AI总结提出一种基于两水平重叠Schwarz区域分解的层次注意力机制，通过局部低秩注意力块与粗网格注意力块结合，在少参数下实现更快训练和更高精度。

Comments 20 pages, 10 figures

详情

AI中文摘要

我们提出了一种基于两水平重叠Schwarz区域分解的层次注意力机制。该方法的动机源于观察到两水平Schwarz区域分解方法将局部子域校正与一个传达全局、长程信息的粗水平相结合。我们在一个具有齐次Dirichlet边界条件的一维扩散问题背景下，测试了其在有限维算子学习中的实用性。尽管该问题简单，但它提供了一个受控的序列到序列设置，其中精确的非局部解算子已知。离散化后，学习解算子相当于逼近一个对称正定矩阵的逆。作为基线，我们使用一个全局无softmax的低秩注意力算子，形式为$QK^T$。所提出的构造将这个密集的全局分解替换为一个两水平加性结构：重叠子域上的局部低秩注意力块与一个粗注意力块相结合。得到的算子形式为$$M_{\theta}^{-1} = \Phi Q_0 K_0^T \Phi^T + \sum_{i=1}^{N} R_i^T D_i^{1/2} Q_i K_i^T D_i^{1/2} R_i.$$ 这里$R_i$限制到重叠子域，$D_i$是单位划分权重，$\Phi$是粗插值（或延拓）矩阵。针对合成Fourier右端项的数值实验表明，区域分解注意力算子能够比全局低秩注意力基线训练更快，并在使用显著更少参数的情况下提供更精确的逼近。

英文摘要

We propose a hierarchical attention mechanism based on two-level overlapping Schwarz domain decomposition. The method is motivated by the observation that two-level Schwarz domain decomposition methods combine local subdomain corrections with a coarse level that communicates global, long-range information. We test its usefulness in the context of finite-dimensional operator learning using a simple, one-dimensional diffusion problem with homogeneous Dirichlet boundary conditions. Although elementary, this problem provides a controlled sequence-to-sequence setting in which the exact nonlocal solution operator is known. After discretization, learning the solution operator amounts to approximating the inverse of a symmetric positive definite matrix. As a baseline, we use a global softmax-free low-rank attention operator of the form $QK^T$. The proposed construction replaces this dense global factorization by a two-level additive structure: local low-rank attention blocks on overlapping subdomains are combined with a coarse attention block. The resulting operator has the form $$M_θ^{-1} = ΦQ_0 K_0^T Φ^T + \sum_{i=1}^{N} R_i^T D_i^{1/2} Q_i K_i^T D_i^{1/2} R_i.$$ Here $R_i$ restricts to an overlapping subdomain, $D_i$ is a partition-of-unity weight, and $Φ$ is a coarse interpolation (or prolongation) matrix. Numerical experiments for synthetic Fourier right-hand sides indicate that the domain-decomposition attention operator is able to train faster and can give more accurate approximations than a global low-rank attention baseline while using significantly fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.18519 2026-06-18 cs.RO cs.AI 新提交 60%

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿：利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced（加州大学默塞德分校）

专题命中其他LLM ：利用LLM进行形式化验证的任务规划

AI总结针对自然语言歧义性，提出基于线性时序逻辑（LTL）反馈循环的LLM任务规划系统，通过双LLM分工实现规范生成与验证，提升精准农业任务规划的可靠性。

Journal ref Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

详情

AI中文摘要

尽管机器人系统现已商业化并部署于各行各业，但许多系统高度专业化，通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题，我们近期引入了一个任务规划器，利用大语言模型（LLM）根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色，但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑（LTL）的反馈循环来扩展我们的系统，以确保任务规划系统满足用户制定的规范，同时仍使用自然语言。为减轻潜在偏差，我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验，我们强调了将任务验证集成到全自主流水线中的优势与局限，特别是关于LLM生成有效LTL公式的能力，并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

URL PDF HTML ☆

赞 0 踩 0

2606.18312 2026-06-18 cs.CR cs.DC cs.LG 新提交 60%

TIGER: Inverting Transformer Gradients via Embedding-Subspace Distance Optimization

TIGER：通过嵌入子空间距离优化反转Transformer梯度

William Kalikman, Ivo Petrov, Dimitar I. Dimitrov, Martin Vechev

发表机构 * ETH Zürich（苏黎世联邦理工学院）； INSAIT, Sofia University "St. Kliment Ohridski"（索菲亚大学"圣克莱门特·奥赫里茨基"）

专题命中其他LLM ：提出Transformer梯度反转攻击TIGER。

AI总结提出TIGER攻击，通过将子空间信号转化为可微目标，直接优化令牌嵌入以最小化到子空间的距离，在编码器模型上提升重建质量和速度，在解码器模型上增强对差分隐私的鲁棒性。

Comments 16 pages, 13 pages main text,

详情

AI中文摘要

联邦学习允许多个客户端通过向中央服务器发送梯度更新来联合训练共享模型，同时保持原始输入在本地。然而，先前的梯度反转攻击表明，这些更新可以泄露足够的信息来重建客户端输入。现有的针对Transformer的攻击要么优化虚拟输入以匹配真实的客户端更新，这对于现代模型来说成本高昂且不稳定；要么利用注意力梯度的低秩性来识别包含真实层嵌入的子空间，然后对候选令牌进行离散成员测试。然而，这种令牌测试在数值噪声（例如来自量化或差分隐私）下很脆弱，并且对于具有非因果注意力的编码器模型扩展性差。我们引入了TIGER，一种连续的梯度反转攻击，它将这种子空间信号转化为可微目标。TIGER不是搜索令牌或匹配完整梯度，而是直接优化令牌嵌入以最小化它们到子空间的距离。我们的实验表明，在仅编码器模型上，TIGER在重建质量和运行时间上均显著优于现有攻击；而在解码器模型上，TIGER比先前基于子空间的攻击更鲁棒，从而在受差分隐私保护的联邦学习设置中实现了首次成功的重建。

英文摘要

Federated learning allows multiple clients to jointly train a shared model by sending gradient updates to a central server while keeping raw inputs local. However, prior gradient inversion attacks show that these updates can reveal enough information to reconstruct client inputs. Existing attacks on transformers either optimize dummy inputs to match the true client updates, which is costly and unstable for modern models, or exploit the low rank of attention gradients to identify a subspace containing the true layer embeddings, followed by a discrete membership test for candidate tokens. However, this token test is brittle under numerical noise, i.e., from quantization or Differential Privacy (DP), and scales poorly for encoder models with non-causal attention. We introduce TIGER, a continuous gradient inversion attack that turns this subspace signal into a differentiable objective. Instead of searching over tokens or matching full gradients, TIGER directly optimizes token embeddings to minimize their distance to the subspace. Our experiments demonstrate that on encoder-only models, TIGER substantially improves both reconstruction quality and runtime over existing attacks, while on decoder models, TIGER is more robust than prior subspace-based attacks, enabling the first successful reconstructions in DP-defended federated learning settings.

URL PDF HTML ☆

赞 0 踩 0

2606.16214 2026-06-18 cs.LG cs.AI 新提交 60%

Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

贝叶斯深度学习中的校准无采样不确定性估计

Tobias Jan Wieczorek, Leon de Andrade, Thomas Möllenhoff, Marcus Rohrbach

发表机构 * TU Darmstadt & hessian.AI, Darmstadt, Germany（达姆施塔特工业大学 & hessian.AI，德国达姆施塔特）； RIKEN Center for Advanced Intelligence Project, Tokyo, Japan（日本理化学研究所革新智能研究中心，日本东京）

专题命中其他LLM ：贝叶斯深度学习不确定性估计，可应用于LLM

AI总结提出校准方差传播（CVP），通过新型归一化层传播方法、激活函数处理技术及轻量校准步骤，在单次前向传播中高效估计不确定性，在Transformer和CNN上达到与MC采样相当的精度，成本显著降低。

详情

AI中文摘要

现代深度学习模型仍然以过度自信而闻名，限制了它们在高风险应用中的可靠性。贝叶斯方法通过学习模型参数的分布来应对这一问题，最近的进展使得在大规模架构上以与AdamW相当的成本实现这一目标成为可能。然而，测试时仍存在一个挑战：预测必须对从后验中采样的权重进行多次前向传播的平均，这代价高昂。方差传播提供了一种高效的替代方案，在单次前向传播中计算每层不确定性的解析近似。虽然此类技术对MLP有效，但由于现代架构的深度增加和层类型多样性，其扩展仍然具有挑战性。为填补这一空白，我们提出了校准方差传播（CVP），它引入了一种新的归一化层传播方法，结合了处理激活函数的近期技术，并通过轻量校准步骤吸收残差误差。CVP在Transformer和CNN上产生与MC采样相当准确的不确定性估计，而成本仅为极小部分。与先前的方差传播工作相比，CVP在BEiT-3上对视觉推理（NLVR2）的$0.5\%$风险覆盖率从$8.2\%$提高到$14.6\%$，在ViLT上对VQAv2从$2.6\%$提高到$10.8\%$，且增益扩展到卷积架构。

英文摘要

Modern deep learning models remain notoriously prone to overconfidence, limiting their reliability in high-stakes applications. Bayesian methods aim to counter this by learning a distribution over model parameters, and recent advances now make this feasible for large-scale architectures at costs comparable to AdamW. However, a challenge remains at test time: predictions must be averaged across many forward passes with weights sampled from the posterior, which is prohibitively expensive. Variance propagation offers an efficient alternative, computing layer-wise analytical approximations of uncertainty in a single forward pass. While such techniques are effective for MLPs, their extension to modern architectures remains challenging, due to increased depth and diversity of layer types. To fill this gap, we propose Calibrated Variance Propagation (CVP), which introduces a new propagation method for normalization layers, combines it with recent techniques for handling activation functions, and absorbs residual error through a light calibration step. CVP yields comparably accurate uncertainty estimates to MC sampling across transformers and CNNs, at a fraction of the cost. Against prior variance propagation work, CVP improves coverage at $0.5\%$ risk from $8.2\%$ to $14.6\%$ with BEiT-3 on Visual Reasoning (NLVR2) and from $2.6\%$ to $10.8\%$ with ViLT on VQAv2, with gains extending to convolutional architectures.

URL PDF HTML ☆

赞 0 踩 0

2602.11557 2026-06-18 cs.LG stat.ML 交叉投稿 60%

The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

小批量随机梯度下降的隐式偏差

Jichu Li, Xuan Tang, Difan Zou

专题命中其他LLM ：小批量随机梯度下降隐式偏差理论

AI总结研究小批量随机最陡下降在多类分类中的隐式偏差，揭示批大小、动量和方差缩减对最大间隔行为和收敛率的影响，并证明动量可实现小批量收敛，方差缩减可恢复全批量隐式偏差。

详情

AI中文摘要

多种广泛使用的优化方法，如SignSGD和Muon，可以被解释为在不同范数诱导几何下的最陡下降实例。在这项工作中，我们研究了多类分类中小批量随机最陡下降的隐式偏差，刻画了批大小、动量和方差缩减如何在一般逐项和Schatten-$p$范数下塑造极限最大间隔行为和收敛率。我们证明，在没有动量时，最坏情况下的收敛和成功分类只能通过全批量梯度保证。相反，动量通过批量-动量权衡使得小批量收敛到近似最大间隔解成为可能，尽管会减慢收敛速度。该方法提供了完全显式、与维度无关的收敛率，优于先前的结果。此外，我们证明方差缩减可以恢复任意批大小下的精确全批量隐式偏差，尽管收敛速度较慢。最后，我们进一步研究了无动量的单批量最陡下降，并通过一个具体数据示例揭示了其收敛到根本不同偏差的特性，这揭示了纯随机更新的一个关键局限性。总体而言，我们的统一分析阐明了随机优化何时与全批量行为一致，并为更深入地探索随机梯度最陡下降算法的训练行为铺平了道路。

英文摘要

A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-$p$ norms. We show that, without momentum, worst-case convergence and successful classification can only be guaranteed with full-batch gradient. In contrast, momentum enables small-batch convergence to an approximate max-margin solution through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.

URL PDF HTML ☆

赞 0 踩 0

2601.23018 2026-06-18 cs.HC cs.AI cs.LG 交叉投稿 60%

Integrating Multi-Label Classification and Generative AI for Scalable Analysis of User Feedback

整合多标签分类与生成式AI实现用户反馈的可扩展分析

Sandra Loop, Erik Bertram, Sebastian Juhl, Martin Schrepp

发表机构 * SAP SE（SAP公司）； Hochschule Fresenius Heidelberg（弗赖辛大学海德堡分校）； University of Missouri（密苏里大学）

专题命中其他LLM ：使用生成式AI分析用户反馈，属于LLM应用。

AI总结提出结合监督多标签分类与生成式AI的方法，高效处理大量用户评论，自动分配主题标签并生成摘要，同时发现情感分析不能可靠反映产品满意度。

Comments 8 pages, 2 figures, submitted to Springer Nature

详情

AI中文摘要

在高度竞争的软件市场中，用户体验（UX）评估对于确保软件质量和促进产品长期成功至关重要。此类UX评估通常将标准化问卷的定量指标与通过开放式问题收集的定性反馈相结合。虽然开放式反馈为改进提供了有价值的见解，并有助于解释定量结果，但分析大量用户评论具有挑战性且耗时。在本文中，我们介绍了一家大型软件公司在长期UX测量项目中开发的技术，以高效处理和解释大量用户评论。为了提供收集到的评论的高层概述，我们采用监督机器学习方法，为每条评论分配有意义的预定义主题标签。此外，我们展示了如何利用生成式AI（GenAI）创建简洁且信息丰富的用户反馈摘要，促进向组织尤其是高层管理人员有效传达发现。最后，我们研究了用户评论中表达的情感是否可以作为整体产品满意度的指标。我们的结果表明，仅凭情感分析并不能可靠地反映用户满意度。相反，产品满意度需要在调查中明确评估，以衡量用户对产品的感知。

英文摘要

In highly competitive software markets, user experience (UX) evaluation is crucial for ensuring software quality and fostering long-term product success. Such UX evaluations typically combine quantitative metrics from standardized questionnaires with qualitative feedback collected through open-ended questions. While open-ended feedback offers valuable insights for improvement and helps explain quantitative results, analyzing large volumes of user comments is challenging and time-consuming. In this paper, we present techniques developed during a long-term UX measurement project at a major software company to efficiently process and interpret extensive volumes of user comments. To provide a high-level overview of the collected comments, we employ a supervised machine learning approach that assigns meaningful, pre-defined topic labels to each comment. Additionally, we demonstrate how generative AI (GenAI) can be leveraged to create concise and informative summaries of user feedback, facilitating effective communication of findings to the organization and especially upper management. Finally, we investigate whether the sentiment expressed in user comments can serve as an indicator for overall product satisfaction. Our results show that sentiment analysis alone does not reliably reflect user satisfaction. Instead, product satisfaction needs to be assessed explicitly in surveys to measure the user's perception of the product.

URL PDF HTML ☆

赞 0 踩 0

2606.18918 2026-06-18 cs.LG cs.CC 新提交 55%

Some Complexity Results for Robustness Verification for Binarized Neural Networks

二值化神经网络鲁棒性验证的一些复杂性结果

Harshit Goyal, Sudakshina Dutta

发表机构 * Indian Institute of Technology Goa（印度理工学院Goa）

专题命中其他LLM ：二值化神经网络鲁棒性验证，非LLM

AI总结本文通过从布尔可满足性问题归约证明二值化神经网络的可满足性是NP完全的，并利用均匀遮挡导致的网络输出分段常数结构，提出多项式时间鲁棒性检查算法。

2606.18867 2026-06-18 cs.LG cs.CY stat.ML 新提交 55%

Strategic Feature Selection

战略特征选择

Jivat Neet Kaur, Pratik Patil, Divya Shanmugam, Emma Pierson, Michael I. Jordan, Nika Haghtalab, Meena Jagadeesan, Ahmed Alaa, Serena Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Texas, Austin（德克萨斯大学奥斯汀分校）； Cornell Tech（康奈尔科技）； Stanford University（斯坦福大学）； University of Pennsylvania（宾夕法尼亚大学）； Harvard University（哈佛大学）； Inria, Paris（巴黎Inria）

专题命中其他LLM ：战略特征选择与分类，非LLM核心但涉及算法。

AI总结研究通过特征选择和岭正则化应对战略操纵的分类问题，发现仅基于可操纵性排除特征通常次优，提出联合优化特征集与正则化水平的算法，并在医疗支付基准上验证。

详情

AI中文摘要

当算法预测器在高风险领域（如医疗）中指导资源分配时，这些预测器必须考虑输入特征的战略操纵。典型的解决方案是重新设计预测器本身以明确考虑战略互动。然而在实践中，决策者通常受限于调整现有预测管道中的粗粒度杠杆。例如，医疗组织通常根据感知的可操纵性选择排除哪些特征，同时使用标准正则化程序来收缩保留特征的系数。在这项工作中，我们通过特征选择及其与岭正则化的相互作用，启动了对战略分类的形式化研究。我们的主要发现是，仅基于可操纵性排除单个特征通常是次优的。我们提供了在最优正则化下特征子集性能的细粒度刻画，为政策设计提供了新的见解。受此刻画启发，我们开发了一种实用算法，用于联合选择特征集和岭正则化水平。通过一个关于医疗支付基准的真实世界案例研究，我们说明了我们的算法如何指导实践中粗粒度政策杠杆的设计。我们的结果为减轻算法决策系统中战略行为的影响提供了一个有原则的、实用的框架。

英文摘要

When algorithmic predictors inform resource allocation in high-stakes domains such as healthcare, these predictors must account for strategic manipulation of input features. The typical solution is to redesign the predictor itself to explicitly account for strategic interactions. In practice, however, decision makers are often constrained to adjusting coarser levers within existing prediction pipelines. For example, healthcare organizations often select which features to exclude based on perceived manipulability, while using standard regularization procedures to shrink the coefficients of retained features. In this work, we initiate a formal study of strategic classification through feature selection and its interaction with ridge regularization. Our main finding is that excluding individual features based on their manipulability alone is generally suboptimal. We provide a fine-grained characterization of the performance of a feature subset under optimal regularization, yielding new insights for policy design. Motivated by this characterization, we develop a practical algorithm for jointly choosing the feature set and the level of ridge regularization. Through a real-world case study on a healthcare payments benchmark, we illustrate how our algorithm can guide the design of coarse policy levers in practice. Our results provide a principled, practical framework for mitigating the effects of strategic behavior in algorithmic decision-making systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18833 2026-06-18 cs.LG 新提交 55%

Seed-Guided Semi-Supervised Clustering by A-Contrario Anomaly Detection

基于A-Contrario异常检测的种子引导半监督聚类

Nassir Mohammad

发表机构 * Cyber Innovation Lab, Airbus, Newport, UK（空中客车公司网络创新实验室（英国纽波特））

专题命中其他LLM ：半监督聚类框架，非LLM核心。

AI总结提出一种基于统计对偶性的半监督聚类框架，通过a-contrario推理和感知算法，利用种子标签初始化并迭代排除异常点，实现鲁棒聚类，在少量种子下达到强性能。

详情

AI中文摘要

本文介绍了一种基于分组原则与异常检测之间统计对偶性的半监督聚类框架。我们解决了噪声环境中鲁棒聚类定义的挑战——在该任务中，划分算法往往过度分配离群点，而基于密度的方法仍对启发式全局参数敏感。借鉴\textit{a-contrario}统计推理和格式塔邻近原则，我们将聚类定义为相对于均匀随机性零假设不包含任何异常点的最大数据点子集。该方法的核心是感知算法，该算法利用基于期望的原则性阈值（$\mathbb{E} < 1$）来识别异常点，无需手动参数调整。通过将聚类视为异常检测的对偶问题，我们采用迭代的“通过排除进行聚类”机制。该算法由种子引导，利用最少的用户提供标签来初始化鲁棒的聚类中位数并形成初始组，随后通过接纳非异常点进行扩展。这种方法自然地隔离了边缘点、孤立噪声和新兴的未知聚类。我们在合成和真实基准数据集上评估了该方法，包括通过原始、线性降维和邻域保持嵌入表示的图像和文本数据集。结果表明，在每个聚类仅使用10-30个种子的情况下，所提出的方法在实用的低调优基准测试协议下实现了具有竞争力且通常非常强的性能，同时在固定种子聚类数和迭代次数下，对观测数和维度均保持线性可扩展性。

英文摘要

This paper introduces a semi-supervised clustering framework grounded in the statistical duality between grouping principles and anomaly detection. We address the challenge of robust cluster definition in noisy environments -- a task where partitioning algorithms often over-assign outliers and density-based methods remain sensitive to heuristic global parameters. Drawing on \textit{a-contrario} statistical reasoning and Gestalt proximity principles, we define a cluster as a maximal subset of data points containing no anomalies relative to a null hypothesis of uniform randomness. Central to this approach is the Perception algorithm, which utilises a principled expectation-based threshold ($\mathbb{E} < 1$) to identify outliers without manual parameter tuning. By treating clustering as the dual of anomaly detection, we employ an iterative ``clustering-by-exclusion'' mechanism. The algorithm is seed-guided, leveraging minimal user-provided labels to initialise robust cluster medians and form initial groups, which are subsequently expanded by admitting non-anomalous points. This approach naturally isolates fringe points, isolated noise, and emerging unknown clusters. We evaluate the method on synthetic and real-world benchmarks, including image and text datasets represented through raw, linear-reduced, and neighbourhood-preserving embeddings. Results demonstrate that with as few as 10--30 seeds per cluster, the proposed method achieves competitive and often very strong performance under a practical low-tuning benchmarking protocol, while maintaining linear scalability with respect to both observations and dimensionality for a fixed number of seeded clusters and iterations.

URL PDF HTML ☆

赞 0 踩 0

2606.18807 2026-06-18 cs.DS cs.LG 新提交 55%

Learning Augmented Exact Exponential Algorithms

学习增强的精确指数时间算法

Tatiana Belova, Yuriy Dementiev, Danil Sagunov

发表机构 * ITMO University（ITMO大学）

专题命中其他LLM ：学习增强指数时间算法，与LLM弱相关

AI总结提出一种通用方法，利用略优于随机猜测的噪声预测器，可证明地减少NP难子集选择问题的搜索空间，运行时间加速随预测质量平滑扩展，且仅需预测的成对独立性或无需知道预测器精度。

详情

AI中文摘要

学习增强算法领域已经证明，机器学习预测可以在广泛的问题中绕过最坏情况下的下界。然而，到目前为止，关注点几乎完全集中在多项式时间算法上，其中预测改进了竞争比、近似保证或运行时间。在本文中，我们提出了一个问题：预测能否推动NP难问题的精确指数时间算法的前沿？我们通过提出一种通用方法对此问题给出肯定回答，该方法增强了一整类用于各种子集选择问题的最先进精确算法。我们表明，一个仅略优于随机猜测的噪声预测器足以可证明地减少搜索空间，并且由此产生的运行时间加速随预测质量平滑扩展。重要的是，我们的算法仅需要预测的成对独立性，或者，不需要知道预测器的精度——这两种设置都比通常假设的更弱且更现实。

英文摘要

The field of learning-augmented algorithms has demonstrated that machine-learned predictions can bypass worst-case lower bounds across a wide range of problems. So far, however, the focus has been almost exclusively on polynomial-time algorithms, where predictions improve competitive ratios, approximation guarantees, or running times. In this paper, we raise the question of whether predictions can push the frontier of exact exponential-time algorithms for NP-hard problems. We answer this question affirmatively by proposing a general approach that augments an entire family of state-of-the-art exact algorithms for a variety of subset selection problems. We show that a noisy predictor that is only marginally better than random guessing suffices to provably reduce the search space, and that the resulting runtime speedup scales smoothly with the prediction quality. Importantly, our algorithms require only pairwise independence of predictions or, alternatively, do not require the knowledge of the predictor's accuracy - both strictly weaker and more realistic settings than typically assumed.

URL PDF HTML ☆

赞 0 踩 0

2606.18624 2026-06-18 cs.CL 新提交 60%

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST：用于语用语言理解的自我强化反事实推理

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

专题命中指令微调：通过监督微调和强化学习训练模型

AI总结提出PragReST框架，通过自监督构建语用问答数据、生成反事实推理轨迹，结合监督微调和强化学习提升大语言模型的语用推理能力，在四个基准上显著优于基线模型。

Comments First two authors contributed equally. Code and models: https://github.com/jihyung803/PragReST

详情

AI中文摘要

自然语言理解通常依赖于隐含而非明确陈述的含义，需要语用推理。尽管大语言模型（LLMs）在数学和逻辑推理上表现强劲，但在进行语用推理时仍存在困难，往往选择字面解释。为了提升LLM的语用推理能力，我们提出了PragReST，一个自监督框架，它构建语用问答数据，生成反事实推理轨迹，并通过监督微调和强化学习训练模型内化这些轨迹，无需人工标注训练数据或从更强的教师模型蒸馏。在四个语用基准（PragMega、Ludwig、MetoQA和AltPrag）上，PragReST相比骨干模型、任务特定的语用微调基线以及同一流水线的非反事实变体均有提升。在基于准确率的基准上，PragReST在Qwen3-8B和Qwen3-14B上分别比指令骨干模型提升了5.37%和5.50%（绝对值）。我们的错误分析和消融实验强调了反事实推理的重要性：PragReST主要减少了因未能将观察到的话语与合理的替代方案进行对比而导致的错误，而去除反事实推理会显著降低性能。此外，我们的训练保留了对通用知识和数学推理基准的域外性能。

英文摘要

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.18524 2026-06-18 cs.LG 新提交 60%

On the Residual Scaling of Looped Transformers: Stability and Transferability

关于循环Transformer的残差缩放：稳定性和可迁移性

Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li

发表机构 * Tsinghua University（清华大学）

专题命中预训练：分析循环Transformer的残差缩放

AI总结针对循环Transformer，提出残差缩放因子应为1/N而非1/√L，并推导出多层的分解参数化，实现超参数从少循环到多循环的迁移。

Comments 19 pages, 9 figures

详情

AI中文摘要

循环（权重共享）Transformer 将共享残差块应用 N 次（h ← h + ε f(h)，每一步使用相同的 f），在不增加参数的情况下增加有效深度。先前的深度缩放分析建议深度为 L 的残差网络使用 ε = 1/√L。我们证明这对于循环架构是不够的：权重共享使得残差更新在迭代间相关，需要更强的缩放 ε = 1/N。对于多层块（L 个独特层循环 N 次），我们推导出一个分解参数化 ε = λ/(N√L)，将两种增长源分开：1/N 控制层内循环相关性，1/√L 控制层间方差。一个关键结果是，最优学习率仅取决于独特层数 L，而非循环次数 N，从而实现了从小的 N 到大的 N 的直接超参数迁移，无需重新调整。在循环 Transformer 上的实验证实，1/N 缩放相比 1/√N 缩放提高了可训练性，并在不同循环次数下获得更优的损失。

英文摘要

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = λ/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

URL PDF HTML ☆

赞 0 踩 0

2606.18324 2026-06-18 cs.LG cs.AI 新提交 60%

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

为什么SWAVE可能不是你所需的一切：复数值循环语言模型的概念演化回顾

Ramprasath Ganesaraja, Swathika N, Sahil Dilip Panse

发表机构 * EdgeVerve Systems Limited（EdgeVerve系统有限公司）

专题命中预训练：回顾复数值循环语言模型SWAVE的演化。

AI总结本文回顾了复数值循环语言模型SWAVE的演化过程，揭示了其设计假设的缺陷，并提出了cos-domination collapse等理论见解和工程原则。

详情

AI中文摘要

SWave是一个复数值循环语言模型（169.26M参数，D=384，L=16，T=2048），在FineWeb-Edu上使用2xH100 NVL训练。它基于三个基本前提设计：将语言表示为复数值波而非实数值能实现更丰富的信息编码；Cayley参数化的酉变换提供数学保证防止状态衰减或爆炸；旋转而非收缩的隐藏状态能在任意长上下文中保持信号完整性。SWave的核心在三个开发阶段中经历了实质性演化。发现Resonance Head在结构上允许虚通道坍缩为全局损失最小值（我们称为cos-domination collapse的失败模式），并被来自相位关联记忆（PAM）架构的具有独立实部和虚部嵌入表的解耦头取代。这解决了退化最小值，并实现了稳定的200,000步训练（最佳步PPL 22.0，第89,861步）。ComplexNorm和Wave Propagation Scan在所有三个阶段中都是承重结构，并保留在最终架构中。ProtectGatedScan被重新定义为结构先验而非学习行为。四个多尺度保留概念在受控评估下未显示可测量的改进，被发现非承重。ComplexGatedUnit被参数更少的实值平方ReLU通道混合器取代。一旦结构约束得到解决，辅助训练目标未显示益处。研究得出了cos-domination collapse的形式化描述、用于数值稳定性的对数空间反向传播并行扫描、六个可迁移的复数值循环训练工程原则，以及用于捕捉传统测试套件遗漏的结构偏差的计划到代码可追溯性方法。

英文摘要

SWave is a complex-valued recurrent language model (169.26M parameters, D=384, L=16, T=2048) trained on FineWeb-Edu using 2xH100 NVL. It was designed around three founding premises: that representing language as complex waves rather than real-valued numbers enables richer information encoding; that a Cayley-parameterised unitary transition provides a mathematical guarantee against state decay or explosion; and that a hidden state which rotates rather than shrinks preserves signal integrity over arbitrarily long contexts. The core of SWave evolved substantially across three development phases. The Resonance Head was found to structurally admit imaginary-channel collapse as a global loss minimum (a failure mode we term cos-domination collapse) and was superseded by an untied head with independent real and imaginary embedding tables from the Phase-Associative Memory (PAM) architecture. This resolved the degenerate minimum and enabled stable 200,000-step training (best-step PPL 22.0 at step 89,861). ComplexNorm and the Wave Propagation Scan proved load-bearing throughout all three phases and were retained to the final architecture. ProtectGatedScan was reframed as a structural prior rather than a learned behaviour. The four multi-scale retention concepts showed no measurable improvement under controlled evaluation and were found non-load-bearing. The ComplexGatedUnit was superseded by a real-valued squared-ReLU channel mixer with fewer parameters. The auxiliary training objectives showed no benefit once structural constraints were resolved. The investigation yields a formal characterisation of cos-domination collapse, a parallel scan with a log-space backward pass for numerical stability, six transferable engineering principles for complex-valued recurrent training, and a plan-to-code traceability methodology for catching structural divergences that conventional test suites miss.

URL PDF HTML ☆

赞 0 踩 0

2606.18372 2026-06-18 cs.CL cs.AI 新提交 60%

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

保留还是删除？用于教育对话去标识的完全本地AI级联框架

Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, René F. Kizilcec

发表机构 * Cornell University（康奈尔大学）

专题命中领域大模型：使用本地LLM级联进行教育对话去标识。

AI总结针对教育对话中课程术语与个人身份信息混淆的问题，提出一种完全本地的级联框架，通过召回优先的联合提议器和上下文感知审查器实现约束性隐私分类，在数学辅导对话上达到0.958的宏F1，优于商业API和纯LLM基线。

详情

AI中文摘要

教育对话是研究中有价值但敏感的资源：捕捉真实学习的同一份转录往往也包含与课程内容纠缠的个人身份信息（PII），其中“Riemann”可能指真实学生或数学概念。现有方法在治理和准确性之间强制权衡。商业大型语言模型（LLM）可以处理这种歧义，但需要将学生数据发送给第三方，而本地命名实体识别（NER）系统保留治理但过度删除课程术语。我们提出一个完全本地的级联框架，将去标识从开放式实体识别重新定义为约束性隐私分类。一个召回优先的联合提议器结合两个轻量级编码器和确定性规则，过度生成候选跨度；然后一个上下文感知审查器利用周围对话和说话者角色对每个候选做出二元的保留/删除决策。我们在两个大型平台的数学辅导转录上评估了三种审查器配置，与同系列纯LLM基线和商业API进行比较。最强的本地配置达到0.958宏F1，而同系列纯LLM基线为0.767，商业API为0.706，同时完全在单个笔记本电脑上运行。在针对课程-人名歧义的挑战集上，相同配置仅下降0.03 F1，而较小审查器下降0.19至0.25。这些结果表明，对于教育去标识，问题表述比模型规模更重要。

英文摘要

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

URL PDF HTML ☆

赞 0 踩 0

2606.18256 2026-06-18 cs.HC cs.AI 新提交 60%

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

动态内群体人格生成以增强人机融洽关系

Yoonseok Oh, Inseo Jung, Jinkyu Kim, Jungbeom Lee, Minwoo Kang, Suhong Moon

发表机构 * Korea University（韩国大学）； Kakao Mobility ； University of California, Berkeley（加州大学伯克利分校）

专题命中领域大模型：LLM聊天机器人通过内群体人格增强融洽关系

AI总结提出一种动态内群体人格生成方法，通过识别用户主要关切并生成共享相似关切的内群体人格，显著提升人机融洽关系，实验表明该方法优于无人格条件和最小自我表露基线。

详情

AI中文摘要

基于LLM的聊天机器人越来越多地应用于咨询和同伴支持等人际领域，在这些领域中建立人机融洽关系至关重要但仍具挑战性。在这项工作中，我们引入了一种新颖的方法来为LLM赋予内群体人格，该方法首先识别用户的主要关切和简要个人背景（例如，一位担心未来职业前景的计算机科学本科生），然后生成一个共享相似主要关切但在背景和叙述细节（如年龄或职业）上有所不同的合成内群体人格（例如，一家AI初创公司的初级研究员）。此外，我们进行了一项人类受试者研究，系统评估内群体人格代理在增强人机融洽关系方面的有效性。我们将我们的方法与两种基线条件进行比较：一种是不带人格条件的传统代理，另一种是表现出最小自我表露的代理（例如，“我也曾有过这种感觉”）。来自评估融洽关系和用户体验的任务后问卷的结果表明，与基线相比，内群体人格代理显著改善了感知融洽度和个人相关性，并产生了更积极的用户体验——最显著的是更高的参与度。

英文摘要

LLM-based chatbots are increasingly applied in interpersonal domains such as counseling and peer support, where establishing human-AI rapport is crucial yet remains challenging. In this work, we introduce a novel approach for conditioning LLMs with in-group personas, which (i) first identifies a user's primary concern and brief personal context (e.g., a computer science undergraduate worried about future career prospects), and (ii) generates a synthetic in-group persona that shares a similar primary concern while differing in background and narrative details, such as age or profession (e.g., a junior researcher at an AI startup). Furthermore, we conduct a human-subject study to systematically evaluate the effectiveness of in-group persona agents in enhancing human-AI rapport. We compare our approach against two baseline conditions: a conventional agent without persona conditioning and an agent exhibiting minimal self-disclosure (e.g., "I've felt that too"). Results from post-task questionnaires assessing rapport and user experience indicate that the in-group persona agent significantly improves perceived rapport and personal relevance compared to the baselines, and also yields more positive user experience-most notably higher engagement.

URL PDF HTML ☆

赞 0 踩 0

1. 其他LLM 18 篇

JourneyFormer: Encoding Airbnb Guest Journey with Sequence Modeling

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

Approximate Structured Diffusion for Sequence Labelling

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

Hierarchical Attention via Domain Decomposition

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

TIGER: Inverting Transformer Gradients via Embedding-Subspace Distance Optimization

Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

Integrating Multi-Label Classification and Generative AI for Scalable Analysis of User Feedback

Some Complexity Results for Robustness Verification for Binarized Neural Networks

Strategic Feature Selection

Seed-Guided Semi-Supervised Clustering by A-Contrario Anomaly Detection

Learning Augmented Exact Exponential Algorithms

2. 指令微调 1 篇

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

3. 预训练 2 篇

On the Residual Scaling of Looped Transformers: Stability and Transferability

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

4. 领域大模型 2 篇

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport