语言大模型 / LLM

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新 95%

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

专题命中后训练：综述DPO，一种大模型后训练对齐方法

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2606.01249 2026-06-18 cs.LG cs.CL 版本更新 85%

Trust Region On-Policy Distillation

信任区域在线策略蒸馏

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang

发表机构 * Samsung Research（三星研究院）； University of Oxford（牛津大学）； Peking University（北京大学）

专题命中后训练：信任区域在线策略蒸馏，用于LLM后训练

AI总结提出信任区域在线策略蒸馏（TrOPD），通过信用分配策略和信任区域学习解决师生分布差异导致的训练不稳定问题，在数学推理、代码生成和通用基准上超越现有方法。

详情

AI中文摘要

在线策略蒸馏（OPD）是大型语言模型（LLM）高效后训练的基本技术，在智能体学习、多任务增强和模型压缩中具有广泛应用。然而，当教师和学生分布差异较大时，OPD训练变得不稳定，因为教师对学生生成token的监督可能产生不可靠的策略梯度，甚至导致优化失败。本文通过信用分配策略解决可靠的在线策略token级监督问题，并提出信任区域在线策略蒸馏（TrOPD）。它具有以下特点：1）信任区域在线策略学习：TrOPD仅在教师提供可靠监督的区域进行OPD，缓解了分布不匹配下K1反向KL估计的优化困难。2）异常值估计：对于异常区域，我们探索梯度裁剪、掩码和前向KL估计，以减少不可靠监督的不利影响。3）离策略引导：学生从教师前缀继续生成，并使用前向KL模仿离策略引导，鼓励向可靠区域进行在线策略探索。实验表明，TrOPD在数学推理、代码生成和通用领域基准上始终优于最先进的OPD基线，包括OPD、EOPD和REOPOLD。

英文摘要

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2601.17226 2026-06-18 cs.CL cs.AI 版本更新 85%

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复：面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales（新南威尔士大学）

专题命中后训练：使用强化学习提升LLM故事复述能力

AI总结提出RRR强化学习框架，结合结构主义叙事学与标量叙事性，通过d-RLAIF从文本特征中获取训练信号，无需参考输出，提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情

AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷，此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练（如SFT）无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR)，一个基于强化学习的流水线，将结构主义叙事学与标量叙事性相结合，以教授故事结构。我们扩展了TimeTravel数据集，加入人工标注的叙事平衡阶段，以评估奖励模型。通过d-RLAIF，RRR从文本特征的叙事性中推导训练信号，无需参考输出。评估表明，RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线，输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集，为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

URL PDF HTML ☆

赞 0 踩 0

2506.14126 2026-06-18 cs.LG cs.AI 版本更新 85%

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

从记忆到参数干扰：过度训练专家如何损害模型合并

Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

发表机构 * Concordia University（康科德大学）； Mila -- Québec AI Institute（魁北克人工智能研究所）； Google DeepMind（谷歌深Mind）

专题命中后训练：研究专家模型微调对合并的影响

AI总结本文研究专家模型微调过度对模型合并的影响，发现长时间微调导致记忆困难样本，造成参数干扰，降低合并性能，并提出任务相关的早停策略改善合并效果。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情

AI中文摘要

现代深度学习日益以使用开放权重基础模型为特征，这些模型可以在专门数据集上进行微调。这导致了专家模型和适配器的激增，通常通过HuggingFace和AdapterHub等平台共享。模型合并最近成为一种有效利用这些现有资源的方法，使得能够组合不同模型检查点的能力。因此，形成了一种自然的流程来利用迁移学习的好处并分摊沉没训练成本：模型在通用数据上预训练，在特定任务上微调，然后合并多个检查点以获得更强大的模型。一个普遍假设是，该流程中某一阶段的改进会向下游传播，从而在后续步骤中带来收益。在这项工作中，我们通过研究专家微调如何影响模型合并来挑战这一假设。我们表明，针对个体性能优化的专家长时间微调会导致跨视觉和语言模态、多种模型规模以及完全微调和LoRA适配模型的合并性能下降。我们将这种退化追溯到对一小部分困难样本的记忆，这些样本主导了微调后期步骤。这会导致负参数干扰，并编码在合并过程中被遗忘的知识。最后，我们证明任务相关的激进早停策略可以显著改善模型合并性能。

英文摘要

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. Model merging has recently emerged as an effective way to leverage these existing resources, enabling the composition of capabilities from different model checkpoints. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then multiple checkpoints are merged to obtain a more capable model. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model merging. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance across vision and language modalities, multiple model scales, and both fully fine-tuned and LoRA-adapted models. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps. This causes negative parameter interference and encodes knowledge that is forgotten during merging. Finally, we demonstrate that task-dependent aggressive early stopping strategies can significantly improve model merging performance.

URL PDF HTML ☆

赞 0 踩 0

2603.26557 2026-06-18 cs.CL 版本更新 70%

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

MemBoost：一种面向成本感知的LLM推理的内存增强框架

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

发表机构 * University of Cambridge（剑桥大学）； ETH Zurich（苏黎世联邦理工学院）

专题命中后训练：记忆增强框架降低LLM推理成本

AI总结提出MemBoost框架，通过轻量模型重用历史答案和检索支持信息，并选择性将困难查询路由到强模型，以降低LLM推理成本，同时保持回答质量。

Comments ICML MemFM 2026 Workshop

2602.05992 2026-06-18 cs.CL 版本更新 85%

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

DSB: 扩散语言模型的动态滑动块调度

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

发表机构 * Nanyang Technological University（南洋理工大学）

专题命中其他LLM ：改进扩散语言模型的推理调度

AI总结针对扩散语言模型固定块调度忽视语义难度的问题，提出无训练的动态滑动块方法DSB及配套KV缓存机制DSB Cache，显著提升生成质量和推理效率。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

扩散大语言模型（dLLMs）已成为文本生成的一种有前景的替代方案，其特点在于原生支持并行解码。在实践中，块推理对于避免全局双向解码中的顺序错乱以及提高输出质量至关重要。然而，广泛使用的固定、预定义块（朴素）调度忽略了语义难度，使其在质量和效率上均非最优策略：它可能迫使模型对不确定的位置过早做出承诺，同时延迟块边界附近的简单位置。在这项工作中，我们分析了朴素块调度的局限性，并揭示了根据语义难度动态调整调度对于可靠高效推理的重要性。受此启发，我们提出了动态滑动块（DSB），一种无训练的块调度方法，它使用动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率，我们引入了DSB Cache，一种针对DSB量身定制的无训练KV缓存机制。跨多个模型和基准的大量实验表明，DSB与DSB Cache一起，持续提升了dLLMs的生成质量和推理效率。代码已发布在 https://this https URL。

英文摘要

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

URL PDF HTML ☆

赞 0 踩 0

2602.23092 2026-06-18 cs.AI 版本更新 85%

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology（南方科技大学）； City University of Hong Kong（香港城市大学）

专题命中其他LLM ：利用LLM自动设计启发式求解CVRP，属于LLM应用

AI总结提出AILS-AHD方法，结合进化搜索框架与大语言模型动态生成和优化破坏启发式，并引入加速机制，在中等和大规模CVRP实例上优于现有求解器，在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情

AI中文摘要

容量受限车辆路径问题（CVRP）是一个基本的组合优化挑战，专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究，CVRP的NP-hard性质仍然带来显著的计算挑战，特别是对于大规模实例。本研究提出了AILS-AHD（自适应迭代局部搜索与自动启发式设计），一种利用大语言模型（LLMs）革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成，在AILS方法中动态生成和优化破坏启发式。此外，我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器（包括AILS-II和HGS）的综合实验评估表明，AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是，我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解，突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

URL PDF HTML ☆

赞 0 踩 0

2602.15851 2026-06-18 cs.CL cs.AI 版本更新 85%

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用：综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； School of Arts and Media（艺术与媒体学院）； University of New South Wales (UNSW)（新南威尔士大学）

专题命中其他LLM ：综述叙事理论驱动的LLM故事生成与理解

AI总结综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用，分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务，提出未来方向。

Comments 31 pages

详情

AI中文摘要

使用大语言模型（LLM）的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理（NLP）研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作，并发现以下内容：(a) 叙事文本来源多样，不仅限于文学；(b) 理论综合与验证是潜在成果；(c) 生成任务在多个方面落后于理解任务：理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向，我们相信，与其追求单一的、通用的“叙事质量”基准，进步可以受益于以下方面的努力：定义和改进针对单个叙事属性的基于理论的度量；继续开展大规模、理论驱动的文学/社会/文化分析；在情境化上下文中生成叙事；以及继续进行实验，其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域，为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

URL PDF HTML ☆

赞 0 踩 0

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新 85%

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind（谷歌深Mind）

专题命中其他LLM ：研究LLM跨语言差距，属于LLM应用

AI总结提出跨语言差距源于目标语言响应方差，通过形式化偏差和无偏误差，并采用推理时集成方法降低方差，使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情

AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型（LLMs）通过从源语言获取知识，并在使用目标语言查询时使其可访问，从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中，我们采取另一种视角来表征跨语言错误的性质，并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预，我们实证验证了我们的假设。我们展示了几种测试时集成方法，这些方法降低了响应方差，从而将源-目标迁移得分提高了多达12个绝对百分点，在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

URL PDF HTML ☆

赞 0 踩 0

2510.04120 2026-06-18 cs.CL cs.AI 版本更新 85%

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（自然语言处理2CT实验室，计算机与信息科学系，澳门大学）

专题命中其他LLM ：LLM隐喻处理机制分析

AI总结通过几何探测、上下文替换和句法扰动三种方法，分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性，揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情

AI中文摘要

大语言模型（LLM）在隐喻检测和解释任务上表现出色，但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度：语义属性对齐、词汇不变性和句法敏感性，对行为证据的局限性进行诊断分析。使用几何探测，我们评估模型生成的解释是否与参考语义属性对齐；通过上下文变化替换，分析隐喻和字面表达之间词汇关联的稳定性；通过受控句法扰动，检查隐喻检测的敏感性。我们的分析表明，LLM生成的解释可能相对于参考属性出现语义漂移；稳定的词汇锚点在不同上下文条件下持续存在，可能支持常规隐喻，同时使需要上下文整合的新奇隐喻产生偏差；检测性能对句法不规则性敏感。这些发现表明，强行为表现可能反映了异质的潜在信号，强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

URL PDF HTML ☆

赞 0 踩 0

2508.09191 2026-06-18 cs.LG cs.AI 版本更新 85%

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

从数值到标记：一种基于符号离散化的LLM驱动上下文感知时间序列预测框架

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）； University of Science and Technology of China（中国科学技术大学）； College of Intelligence and Computing（智能科学与计算学院）； iFLYTEK Research（iFLYTEK研究院）

专题命中其他LLM ：提出TokenCast框架，利用LLM进行时间序列预测。

AI总结提出TokenCast框架，利用大语言模型通过符号离散化将连续时间序列转化为标记，与上下文文本对齐，实现上下文感知的预测，实验证明有效。

详情

AI中文摘要

时间序列预测在能源、医疗和金融等关键应用领域支持决策中起着重要作用。尽管近期取得了进展，但由于将历史数值序列与通常包含非结构化文本数据的上下文特征整合的挑战，预测精度仍然有限。为了解决这一挑战，我们提出了TokenCast，一个由大语言模型（LLM）驱动的框架，利用基于语言的符号表示作为上下文感知时间序列预测的统一中介。具体来说，TokenCast采用离散分词器将连续数值序列转化为时间标记，实现与基于语言输入的结构对齐。为了有效弥合模态之间的语义差距，时间和上下文标记通过预训练的LLM嵌入到共享表示空间中，并通过生成目标进一步优化。基于这一统一语义空间，对齐的LLM随后以监督方式进行微调，以预测未来的时间标记，然后解码回原始数值空间。在真实世界数据集上的大量实验证明了我们框架的有效性，并突显了其作为上下文感知时间序列预测生成框架的潜力。代码可从此https URL获取。

英文摘要

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, a large language model (LLM) driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To effectively bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained LLM, further optimized with generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework and highlight its potential as a generative framework for context-aware time series forecasting. The code is available at https://github.com/Xiaoyu-Tao/TokenCast.

URL PDF HTML ☆

赞 0 踩 0

2506.15066 2026-06-18 cs.AR cs.MA 版本更新 85%

ChatModel: Automating Reference Model Design and Verification with LLMs

ChatModel: 利用LLMs自动化参考模型设计与验证

Jianmin Ye, Tianyang Liu, Qi Tian, Shengchu Su, Zhe Jiang, Xi Wang

专题命中其他LLM ：利用LLM自动化参考模型设计与验证，提升效率。

AI总结提出ChatModel平台，通过设计标准化和分层敏捷建模，利用LLM自动生成参考模型，在300个设计上验证，效率提升最高58.99%，验证周期加速7.11倍。

详情

AI中文摘要

随着集成电路设计复杂性的不断升级，功能验证变得越来越具有挑战性。参考模型对于加速验证过程至关重要，但其自身也变得越来越复杂且耗时。尽管大型语言模型（LLM）在代码编程方面显示出潜力，但有效生成复杂参考模型仍然是一个重大障碍。因此，我们引入了ChatModel，一个LLM辅助的敏捷参考模型生成与验证平台。ChatModel通过集成设计标准化和分层敏捷建模，简化了从设计规范到功能完备参考模型的过渡。采用构建块生成策略，不仅增强了LLM对参考模型的设计能力，还显著提高了验证效率。我们在300个不同复杂度的设计上评估了ChatModel，证明了参考模型生成在效率和质量上的显著提升。与替代方法相比，ChatModel实现了最高58.99%的性能提升，生成稳定性显著增强，并且其生成参考模型设计的能力提高了9.18倍。此外，ChatModel将参考模型设计与验证周期平均加速了7.11倍，相比传统手动方法。这些结果突显了ChatModel在推动参考模型生成与验证自动化方面的巨大潜力。

英文摘要

As the complexity of integrated circuit designs continues to escalate, functional verification becomes increasingly challenging. Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop. Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. Therefore, we introduce ChatModel, an LLM-aided agile reference model generation and verification platform. ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling. Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation. ChatModel achieved a peak performance improvement of 58.99% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs. Moreover, ChatModel accelerates the reference model design and validation cycles by an average of 7.11x over traditional manual approaches. These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation.

URL PDF HTML ☆

赞 0 踩 0

2412.15557 2026-06-18 cs.SE cs.CL 版本更新 85%

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

MORTAR：基于LLM的对话系统的多轮蜕变测试

Aaron Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen

发表机构 * Faculty of Information Technology, Monash University（墨尔本大学信息科技学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； School of Science, Computing and Emerging Technologies, Swinburne University of Technology（斯威本理工大学科学、计算与新兴技术学院）

专题命中其他LLM ：LLM对话系统多轮测试方法

AI总结提出MORTAR方法，通过多轮蜕变关系自动化生成测试用例，解决LLM对话系统多轮测试中的预言问题，相比单轮测试每个用例发现更多且更高质量的缺陷。

Comments Accepted for publication in IEEE Transactions on Software Engineering (TSE)

详情

DOI: 10.1109/TSE.2026.3701230

AI中文摘要

随着基于LLM的对话系统在日常生活中的广泛应用，质量保证变得比以往更加重要。最近的研究成功引入了在单轮测试场景中识别意外行为的方法。然而，多轮交互是对话系统常见的实际使用方式，但针对此类交互的测试方法仍未得到充分探索。这主要是由于多轮测试中的预言问题，它仍然是对话系统开发人员和研究人员面临的重大挑战。在本文中，我们提出了MORTAR，一种蜕变式多轮对话测试方法，它缓解了测试基于LLM的对话系统时的测试预言问题。MORTAR形式化了对话系统的多轮测试，并自动生成问答对话测试用例，其中包含多种对话级扰动和蜕变关系（MRs）。自动化的MR匹配机制使MORTAR在蜕变测试中具有更高的灵活性和效率。所提出的方法完全自动化，无需依赖LLM评判。在测试六个流行的基于LLM的对话系统时，与单轮蜕变测试基线相比，MORTAR每个测试用例发现的错误数量增加了150%以上，效果显著更好。在错误质量方面，MORTAR在多样性、精确性和唯一性方面揭示了更高质量的错误。MORTAR有望激发更多的多轮测试方法，并帮助开发人员在有限的测试资源和预算下更全面地评估对话系统性能。

英文摘要

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

URL PDF HTML ☆

赞 0 踩 0

2602.06470 2026-06-18 cs.CL cs.AI 版本更新 80%

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

专题命中其他LLM ：提出UNO框架利用用户日志改进LLM系统。

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型（LLMs）的发展，但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此，近期研究更加关注从真实世界部署中持续学习，其中用户交互日志提供了丰富的真人类反馈和过程知识。然而，从用户日志学习具有挑战性，因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为，且用户日志收集与模型优化之间的差异（例如，非策略优化问题）进一步加剧了这一问题。为此，我们提出UNO（用户日志驱动的优化），一个统一的框架，用于通过用户日志改进LLM系统（LLMsys）。UNO首先将日志提炼为半结构化的规则和偏好对，然后利用查询和反馈驱动的聚类来管理数据异质性，最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块，以处理从用户日志中提取的初级和反思性经验，从而提升未来的响应。广泛的实验表明，UNO在效果和效率上均达到最先进的水平，显著优于检索增强生成（RAG）和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

URL PDF HTML ☆

赞 0 踩 0

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新 80%

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing（多维计算公司）； Donostia International Physics Center（多斯蒂亚国际物理中心）； Ikerbasque Foundation for Science（伊克尔巴斯克科学基金会）

专题命中其他LLM ：提出LLM压缩方法，通过块移除优化，属于LLM。

AI总结提出将大语言模型块移除压缩问题建模为约束二进制优化，映射到Ising玻璃系统，实现高效排序和高质量非连续块移除，在50%压缩时MMLU提升近23个百分点，且计算高效、通用性强。

Comments 16 pages, 3 figures

详情

AI中文摘要

在本文中，我们将通过最优删除Transformer块（“块移除”）来压缩大语言模型（LLM）的问题，表述为一个约束二进制优化（CBO）问题，该问题可以映射到物理系统（Ising玻璃），其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序，产生许多高质量、非平凡的解决方案，而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲，例如在Llama-3.3-70B-Instruct的50%压缩中，与其他最先进的块移除方法相比，我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩，它在多个基准上与这些方法表现相当，适用于Llama-3.1-8B-Instruct、Qwen3-14B（重训练前后）以及Llama-3.3-70B-Instruct。该方法计算效率高，仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外，我们证明，当无法精确求解CBO问题时，使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性，该模型具有高度不均匀且具有挑战性的块结构，并且在移除2个注意力层或3个混合专家层时，我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

URL PDF HTML ☆

赞 0 踩 0

2601.14968 2026-06-18 cs.LG cs.AI 版本更新 80%

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

专题命中其他LLM ：将时间序列分类转化为多模态生成任务

AI总结提出将时间序列分类转化为多模态生成任务，通过离散化模块和对齐投影层弥合模态差距，并利用隐式特征建模提升语言模型性能。

详情

AI中文摘要

大多数现有的时间序列分类方法采用判别范式，将输入序列直接映射到独热编码的类别标签。虽然有效，但这种范式难以融入上下文特征，也无法捕捉类别间的语义关系。为了解决这些局限性，我们提出了InstructTime，一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说，连续的数值序列、上下文文本特征和任务指令被视为多模态输入，而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距，InstructTime引入了一个时间序列离散化模块，将连续序列转换为离散的时间标记，同时结合对齐投影层和生成式自监督预训练策略，以增强跨模态表示对齐。在此框架基础上，我们进一步提出了InstructTime++，通过引入隐式特征建模来扩展InstructTime，以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式，包括统计特征提取和基于视觉-语言模型的图像描述，并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

URL PDF HTML ☆

赞 0 踩 0

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新 80%

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Technologies（华为技术）

专题命中其他LLM ：全双工语音语言模型中的文本-语音交错生成

AI总结提出TurnGuide方法，通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音，解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题，显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情

AI中文摘要

全双工语音语言模型（FD-SLMs）是专门的基础模型，旨在通过建模复杂的对话轮次（如打断、反馈和重叠语音）来实现自然的实时口语交互。端到端（e2e）FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互，但由于语音序列过长和高质量口语对话数据有限，其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化，但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题，我们提出了TurnGuide，一种用于e2e FD-SLMs的新型文本-语音交错生成方法，该方法动态地将助手语音分割成对话轮次，并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能，同时不损害自然的声学流畅性。大量实验表明，TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力，而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

URL PDF HTML ☆

赞 0 踩 0

2512.04144 2026-06-18 cs.AI 版本更新 80%

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

RippleBench: 利用现有知识库捕捉涟漪效应

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

发表机构 * Harvard University（哈佛大学）； Imperial College London（伦敦帝国学院）； Northeastern University（东北大学）

专题命中其他LLM ：评估语言模型遗忘的涟漪效应

AI总结提出RippleBench-Maker自动管道，从知识库检索语义邻居生成选择题，评估八种遗忘方法在Llama3-8B-Instruct上的涟漪效应，发现准确率下降随语义距离衰减且跨模型一致。

详情

AI中文摘要

针对语言模型的目标干预，如遗忘或模型编辑，旨在修改特定信息，但其效果往往传播到相关的、非预期的领域（例如，删除病毒学内容可能降低对过敏任务的性能）；这些副作用通常被称为涟漪效应。我们引入RippleBench-Maker，一个自动管道，从知识库中检索任何源概念的语义邻居，并生成不同语义距离的多选题。我们使用WikiRAG（一个基于英文维基百科的开源RAG系统）实例化该框架，构建RippleBench-WMDP-Bio（584个种子主题，352,961个问题），并在Llama3-8B-Instruct上评估八种遗忘方法。所有八种方法在遗忘目标附近准确率下降最大，并随语义距离衰减，每种方法具有不同的传播曲线。我们在Mistral-7B、Zephyr-7B和Yi-34B上复现了这些发现；跨模型的差值曲线几乎相同，表明涟漪效应是遗忘方法的属性而非基础模型。我们通过一项包含四个实验的Mechanical Turk研究（5,200+次响应，61名工作者）验证了所有主要管道阶段。我们发布所有代码、数据和基础设施。

英文摘要

Targeted interventions on language models, such as unlearning or model editing, aim to modify specific information, but their effects often propagate to related, unintended areas (e.g., removing virology content may degrade performance on allergies); these side-effects are commonly referred to as the ripple effect. We introduce RippleBench-Maker, an automatic pipeline that retrieves semantic neighbors of any source concept from a knowledge repository and generates multiple-choice questions at varying semantic distances. We instantiate this framework using WikiRAG, an open-source RAG system over English Wikipedia, to construct RippleBench-WMDP-Bio (584 seed topics, 352,961 questions), and evaluate eight unlearning methods on Llama3-8B-Instruct. All eight exhibit accuracy drops that are largest near the unlearned target and decay with semantic distance, each with a distinct propagation profile. We replicate these findings across Mistral-7B, Zephyr-7B, and Yi-34B; cross-model delta curves are nearly identical, suggesting ripple effects are a property of the unlearning method rather than the base model. We validate all major pipeline stages using a four-experiment Mechanical Turk study (5,200+ responses, 61 workers). We release all code, data, and infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2510.09905 2026-06-18 cs.AI cs.CL 版本更新 80%

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

个性化陷阱：用户记忆如何改变大语言模型的情感推理

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

发表机构 * Amazon（亚马逊）

专题命中其他LLM ：LLM情感推理偏差研究

AI总结研究用户记忆如何导致大语言模型在情感推理中产生系统性偏差，发现高绩效模型对优势背景用户的情感解读更准确，个性化机制可能嵌入社会等级。

Comments 19 pages 5 figures

详情

AI中文摘要

当AI助手记住Sarah是一位打两份工的单亲母亲时，它对她压力的解读是否与她是富有的高管时不同？随着个性化AI系统越来越多地融入长期用户记忆，理解这种记忆如何塑造情感推理至关重要。我们通过在人验证的情感智能测试上评估15个模型，研究用户记忆如何影响大语言模型（LLMs）的情感智能。我们发现，相同的场景搭配不同的用户画像会产生系统性不同的情感解读。在经验证的独立于用户的情感场景和多样化的用户画像中，几个高性能LLM出现了系统性偏差，其中优势背景的用户画像获得了更准确的情感解读。此外，LLM在情感推理和支持性推荐任务中表现出跨人口统计因素的显著差异，表明个性化机制可以将社会等级嵌入模型的情感推理中。这些结果凸显了记忆增强AI的一个关键挑战：为个性化设计的系统可能会强化社会不平等。为缓解这些差异，我们整理了一个通用偏好数据集，旨在减少人口统计画像对情感理解的影响。

英文摘要

When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human-validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user-independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion reasoning and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models' emotional reasoning. These results highlight a key challenge for memory-enhanced AI: systems designed for personalization may reinforce social inequalities. To mitigate these disparities, we curate a general-purpose preference dataset designed to reduce demographic profiles' influence on emotional understanding.

URL PDF HTML ☆

赞 0 踩 0

2506.09046 2026-06-18 cs.LG cs.AI cs.MA 版本更新 80%

Self-Evolving Multi-Agent Systems via Textual Backpropagation

通过文本反向传播的自进化多智能体系统

Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze

发表机构 * Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学）； Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Notre Dame（诺丁汉大学）

专题命中其他LLM ：利用多个LLM构建多智能体神经网络框架。

AI总结提出Agentic Neural Network框架，将多智能体协作建模为分层神经网络，通过前向分解任务和反向传播反馈实现智能体角色、提示和协作的自进化，在七个基准数据集上超越现有方法。

详情

AI中文摘要

利用多个大型语言模型（LLM）已被证明对处理复杂、高维任务有效，但当前方法通常依赖静态、手动设计的多智能体配置。为克服这些限制，我们提出Agentic Neural Network（ANN）框架，该框架将多智能体协作概念化为分层神经网络架构。在此设计中，每个智能体作为节点运行，每一层形成一个专注于特定子任务的协作团队。我们的框架遵循两阶段优化策略：（1）前向阶段——受神经网络前向传播启发，任务被动态分解为子任务，并逐层构建具有合适聚合方法的协作智能体团队。（2）反向阶段——模仿反向传播，我们通过迭代反馈优化全局和局部协作，使智能体能够自进化其角色、提示和协调。这种神经符号方法使我们的框架能够在训练后创建新的或专门的智能体团队，在准确性和适应性方面带来显著提升。在七个基准数据集上，我们的工作在相同配置下超越了领先的多智能体基线，显示出持续的性能改进。

英文摘要

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2507.01414 2026-06-18 cs.LG 版本更新 80%

Decomposing Prediction Mechanisms for In-Context Recall

分解上下文召回中的预测机制

Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Pennsylvania（宾夕法尼亚大学）

专题命中其他LLM ：分析Transformer上下文学习机制

AI总结通过设计结合连续上下文学习与离散关联召回的新玩具问题，发现Transformer模型在上下文召回任务中存在两种具有不同学习动态的独立机制：一种依赖离散符号标签进行关联召回，另一种基于前一个token和上下文进行贝叶斯式预测。

Comments 45 pages, 47 figures, 2 tables

详情

AI中文摘要

我们引入了一类新的玩具问题，将线性回归风格的连续上下文学习（ICL）特征与离散关联召回相结合。我们在该玩具的样本轨迹上预训练Transformer模型，具体是从随机抽取的线性确定性动力系统中提取的符号标记交错状态观测。我们研究当模型被提示使用相应的上下文标签时，是否能够召回先前在其上下文中见过的序列的状态。仔细观察这个任务，很明显模型必须执行两个功能：（1）识别应召回哪个系统的状态，并将该系统应用于其最后看到的状态；（2）继续应用正确的系统来预测后续状态。训练动态表明，第一个能力在模型训练中后期才出现。令人惊讶的是，第二个能力（继续预测恢复的序列）发展得更早。通过分布外实验和通过边缘剪枝对模型权重的机制分析，我们发现这个玩具问题的下一个token预测涉及至少两个独立的机制。一种机制使用离散符号标签进行关联召回，以预测先前见过的序列恢复的开始。第二种机制在很大程度上与离散符号标签无关，基于前一个token和上下文进行“贝叶斯式”预测。这两种机制具有不同的学习动态。为了确认这种多机制现象（表现为不同的相变）不仅仅是玩具设置的人为产物，我们使用OLMo在ICL翻译任务上的训练检查点观察到了类似的现象：第一个任务token的性能与第二个任务token的性能出现决定性差距。

英文摘要

We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

URL PDF HTML ☆

赞 0 踩 0

2601.21626 2026-06-18 cs.LG cs.AI 版本更新 75%

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Science and Technology of China（中国科学技术大学）； Zhejiang Lab（浙江实验室）； Peng Cheng Laboratory（鹏城实验室）

专题命中其他LLM ：提出HeRo-Q算法用于LLM低比特量化，属于LLM。

AI总结针对后训练量化中“低误差、高损失”的矛盾，提出HeRo-Q算法，通过轻量可学习的旋转压缩矩阵重塑损失景观，降低最大Hessian特征值，增强对量化噪声的鲁棒性，在Llama和Qwen模型上优于现有方法。

详情

AI中文摘要

后训练量化（PTQ）是一种主流的模型压缩技术，但由于其仅专注于最小化量化误差，常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵：少数高曲率方向对扰动极其敏感。为了解决这个问题，我们提出了Hessian鲁棒量化（HeRo Q）算法，该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观，从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构，计算开销可忽略不计，并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明，HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法，而且在极具挑战性的W3A16超低比特场景中表现出色，将Llama3 8B在GSM8K上的准确率提升至70.15%，并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

URL PDF HTML ☆

赞 0 踩 0

2604.13082 2026-06-18 cs.LG cs.AI 版本更新 70%

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

算术泛化的长延迟：当学习到的表征超越行为时

Laura Gomezjurado Gonzalez

发表机构 * Stanford University（斯坦福大学）

专题命中其他LLM ：研究Transformer泛化机制，与LLM相关

AI总结研究Transformer在算术任务中泛化延迟的原因，发现编码器早期已学到结构，但解码器瓶颈导致延迟，通过移植编码器或冻结编码器可加速泛化，且数字基的选择影响学习难度。

Comments 19 pages, 10 fugures

详情

AI中文摘要

在算法任务上训练的Transformer中的grokking现象以训练集拟合与突然泛化之间的长延迟为特征，但该延迟的来源仍不清楚。在编码器-解码器算术模型中，我们认为这种延迟反映了对已学习结构的有限访问，而非未能首先获得该结构。我们研究一步Collatz预测，发现编码器在最初几千训练步内组织了奇偶性和残差结构，而输出精度在数万步内仍接近随机。因果干预支持解码器瓶颈假说。将训练好的编码器移植到新模型中将grokking加速2.75倍，而移植训练好的解码器则有害。冻结收敛的编码器并仅重新训练解码器完全消除了平台期，并达到97.6%的准确率，而联合训练为86.1%。解码器任务的难易取决于数字表示。在15种基中，那些分解与Collatz映射算术对齐的基（例如基24）达到99.8%的准确率，而二进制完全失败，因为其表示崩溃且无法恢复。基的选择作为归纳偏置，控制解码器可利用的局部数字结构量，从而在相同底层任务上产生巨大的可学习性差异。

英文摘要

Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.

URL PDF HTML ☆

赞 0 踩 0

2601.18511 2026-06-18 cs.CR 版本更新 70%

Scaling up FHE-based Privacy-Preserving ML: Higher Throughput, Longer Inputs for LLama-3-8B

扩展基于FHE的隐私保护机器学习：LLama-3-8B的更高吞吐量和更长输入

Jaiyoung Park, Sejin Park, Jai Hyun Park, Jung Ho Ahn, Jung Hee Cheon, Guillaume Hanrot, Jung Woo Kim, Minje Park, Damien Stehlé

专题命中其他LLM ：提出基于FHE的隐私保护LLM推理加速方法。

AI总结针对FHE-based LLM推理中输入长度扩展性差和非线性层评估受异常值影响的问题，采用令牌预置、正交旋转和稀疏密文多项式求值方法，结合快速同态线性代数技术，实现128加密令牌推理加速，并扩展至数千令牌的异构输入，在Llama-3-8B上取得显著性能提升。

详情

AI中文摘要

随着大型语言模型（LLM）变得无处不在，与推理相关的隐私问题日益突出。全同态加密（FHE）已成为非交互式机密LLM推理的主要密码学解决方案。然而，现有解决方案在输入令牌长度上扩展性差，主要关注小模型或小输入尺寸。它们还受到大的异常值影响，这强烈影响非线性层的评估，导致高昂的多项式逼近成本。我们从两个方向扩展基于FHE的LLM推理。首先，我们加速了128个加密令牌的基于FHE的推理。我们采用机器学习技术（令牌预置和正交旋转）来减轻异常值对非线性层FHE评估的影响。另外，我们设计了一种新颖的稀疏密文多项式求值方法，以加速我们的同态SoftMax实现。我们将这些与最近的快速同态线性代数技术相结合，实现了显著提高的效率。其次，我们将提示大小扩展到数千个令牌，适用于只有输入的最终部分敏感且加密的场景。处理此问题需要处理标准的明文-明文和密文-密文组件，以及针对新颖的明文-密文组件的宽同态计算。为了解决这个问题，我们设计了一种专用的同态线性代数算法，构建了一个浅层同态注意力电路，以最小化自举成本。基于这些要素，我们提出了一个基于CKKS的Llama-3-8B私有推理端到端实现。在8个NVIDIA RTX PRO 6000 GPU上，128个加密令牌的摘要生成需要20秒，生成每个令牌需要18秒（远超SOTA在更昂贵的H100 GPU上的295秒）。对于4096个令牌的异构输入（最后128个加密），摘要生成需要64秒，生成每个令牌需要22秒。

英文摘要

As large language models (LLMs) become ubiquitous, privacy concerns pertaining to inference keep growing. Fully homomorphic encryption (FHE) has emerged as a primary cryptographic solution for non-interactive confidential LLM inference. However, existing solutions scale poorly with input token length, focusing on small models or input sizes. They also suffer from large outlier values, which strongly impact the evaluation of non-linear layers, leading to heavy polynomial approximation costs. We scale up FHE-based LLM inference in two directions. First, we accelerate FHE-based inference for 128 encrypted tokens. We adopt ML techniques (token prepending and orthogonal rotations) to mitigate outlier impacts on the FHE evaluation of non-linear layers. Separately, we devise a novel polynomial evaluation method for sparsely-packed ciphertexts to speed up our homomorphic SoftMax implementation. We combine these with recent fast homomorphic linear algebra techniques, achieving significantly improved efficiency. Second, we expand the prompt size up to thousands of tokens for contexts where only the final part of the input is sensitive and encrypted. Processing this requires handling standard plaintext-plaintext and ciphertext-ciphertext components, alongside a wide homomorphic computation for a novel plaintext-ciphertext component. To address this, we devise a dedicated homomorphic linear algebra algorithm, building a shallow homomorphic attention circuit that minimizes bootstrapping costs. Based on these ingredients, we present a CKKS-based end-to-end implementation of Llama-3-8B private inference. On 8 NVIDIA RTX PRO 6000 GPUs, 128 encrypted tokens take 20s for summarization and 18s/token for generation (vastly outperforming the SOTA 295s on costlier H100 GPUs). For a heterogeneous 4096-token input (last 128 encrypted), it takes 64s for summarization and 22s/token for generation.

URL PDF HTML ☆

赞 0 踩 0

2510.27353 2026-06-18 cs.AI 版本更新 70%

An In-depth Study of LLM Contributions to the Bin Packing Problem

LLM对装箱问题贡献的深入研究

Julien Herrmann, Guillaume Pallez

发表机构 * CNRS-IRIT ； Inria

专题命中其他LLM ：研究LLM对装箱问题的贡献，分析LLM生成启发式算法。

AI总结通过分析LLM生成的启发式算法，发现其虽可读但难以解释，进而提出更简单高效的新算法，质疑LLM对装箱问题的实际贡献。

Comments Accepted for publication in ACM Transactions on Evolutionary Learning and Optimization

详情

DOI: 10.1145/3821574

AI中文摘要

近期研究表明，大型语言模型（LLM）可能为数学发现提供有趣的思路。该主张基于报告称，基于LLM的遗传算法在均匀分布和Weibull分布下为在线装箱问题产生了具有新见解的启发式算法。本文通过详细分析LLM产生的启发式算法，考察其行为和可解释性，重新评估了这一主张。尽管这些启发式算法是人类可读的，但即使对领域专家而言，它们仍然在很大程度上是不透明的。基于此分析，我们提出了一类针对这些特定装箱实例的新算法。推导出的算法显著更简单、更高效、更可解释且更具泛化性，表明所考虑的实例本身相对简单。然后，我们讨论了关于LLM对该问题贡献的主张的局限性，该主张似乎基于一个错误的假设，即这些实例先前已被研究过。我们的发现反而强调了在评估LLM生成输出的科学价值时，需要进行严格的验证和情境化。

英文摘要

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

URL PDF HTML ☆

赞 0 踩 0

2506.12311 2026-06-18 cs.CL cs.SD eess.AS 版本更新 70%

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

Phonikud：克服希伯来语文本转语音中的语音欠指定问题

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

发表机构 * Independent Researcher（独立研究者）； Reichman University（雷赫曼大学）； Tel Aviv University（特拉维夫大学）； Carnegie Mellon University（卡内基梅隆大学）

专题命中其他LLM ：希伯来语TTS，涉及语言模型

AI总结提出Phonikud框架，通过开源G2P系统、语料库、基准和评估模型，解决希伯来语TTS中重音等语音特征欠指定问题，实现更准确的音素预测。

Comments Accepted to Interspeech 2026. Project page: https://phonikud.github.io

详情

AI中文摘要

现代希伯来语的文本转语音（TTS）受到该语言正字法复杂性的挑战，现有解决方案忽略了诸如重音等欠指定的语音特征。我们提出了一个更准确的希伯来语TTS框架，包含四个贡献：（1）Phonikud，一个开源的希伯来语字素到音素（G2P）系统，输出完全指定的国际音标（IPA）转录，通过增强基础注音器设计而成。（2）ILSpeech语料库，包含配对的希伯来语音频、文本和专家IPA标注。（3）针对先前未测量的希伯来语G2P转换任务的基准。（4）希伯来语音频到IPA模型，捕获先前忽略的语音细节，用于自动TTS评估。我们的结果表明，Phonikud比先前方法更准确地预测希伯来语音素，并且使用Phonikud语音输入的小型本地TTS模型接近大型专有系统。我们在以下网址发布代码、数据和模型：this https URL。

英文摘要

Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically accurate Hebrew TTS with four contributions: (1) Phonikud, an open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified International Phonetic Alphabet (IPA) transcriptions, designed by augmenting a base diacritizer. (2) The ILSpeech corpus of paired Hebrew audio, text, and expert IPA annotations. (3) A benchmark for the previously unmeasured task of Hebrew G2P conversion. (4) Hebrew audio-to-IPA models capturing previously disregarded phonetic details for automatic TTS evaluation. Our results show that Phonikud more accurately predicts Hebrew phonemes than prior methods, and that small, local TTS models with phonetic input from Phonikud approach large proprietary systems. We release our code, data, and models at https://phonikud.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.26903 2026-06-18 cs.CR cs.AI 版本更新 60%

Practical Anonymous Two-Party Gradient Boosting Decision Tree

实用的匿名两方梯度提升决策树

Chenyu Huang, Fan Zhang, Minxin Du, Sherman S. M. Chow, Huangxun Chen, Huaming Rao, Danqing Huang, Bo Qian, Peng Chen

发表机构 * Tencent（腾讯）； Hong Kong Polytechnic University（香港理工大学）； Chinese University of Hong Kong（香港中文大学）； HKUST-GZ

专题命中其他LLM ：梯度提升决策树安全训练，非LLM但涉及AI安全

AI总结针对两方垂直分割数据上的梯度提升决策树训练，提出一种基于双电路隐私集合求交和遗忘可编程伪随机函数的匿名协议，在隐藏记录标识符的同时保持效率。

Comments 19 pages; 2026 IEEE Symposium on Security and Privacy (SP)

Journal ref 2026 IEEE Symposium on Security and Privacy (SP)

详情

DOI: 10.1109/SP63933.2026.00084

AI中文摘要

梯度提升决策树（GBDT）擅长处理结构化数据，通常用于在互不信任的各方之间垂直分割的特征上进行训练。高速和可解释性使得GBDT在金融和医疗领域广受欢迎，而神经网络在这些领域可能表现不佳。为GBDT启用安全计算带来了独特的挑战，需要安全的记录对齐以进行比较。依赖隐私集合求交（PSI）是一种事实上的方法。将PSI误认为是安全措施实际上会暴露数据集中哪些记录标识符（ID）是共享的。尽管电路PSI可以提供帮助，但对于通用用途来说成本高昂。需要新的思路来在“黑暗森林”中高效训练。为了隐藏ID，我们启动了对两方持有的分割数据上的匿名GBDT训练的研究。我们设计中的双电路PSI让双方交替作为接收者，对本地特征执行“选取后求和”。通过遗忘可编程伪随机函数，我们将电路PSI的输出作为共享状态在运行之间传播。避免通用对齐，我们解决了被忽视的困境：隐藏ID会带来与域大小成比例的成本。接下来，我们将用于将单指令多数据同态加密从（环）学习误差转换的密文打包成本减半，相比之前的安全GBDT（Usenix Security' 23）和相关安全机器学习计算。对比实验表明，我们的协议在效率上与有泄漏的方法相比仍具有竞争力。通过启用隐藏ID的聚合，我们的技术可以扩展到其他垂直分割的分析场景。

英文摘要

Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a "dark forest". Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security' 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.

URL PDF HTML ☆

赞 0 踩 0

2605.07036 2026-06-18 physics.ed-ph 版本更新 60%

Using Large Language Models to Analyze Engagement in Computational Thinking via Computational Physics Essays

使用大型语言模型通过计算物理论文分析计算思维中的参与度

Sean Savage, Amir Bralin, Paul Hur, N. Sanjay Rebello

专题命中其他LLM ：利用LLM自动评估学生计算物理论文中的计算思维。

AI总结本研究利用多模态大型语言模型自动评估100篇学生计算物理论文中的计算思维参与度，在明确子任务上达到84%的准确率，但主观整体质量评估准确率仅71%。

Comments 13 pages, 3 figures, 3 tables. Submitted to Physical Review Physics Education Research

详情

AI中文摘要

随着计算思维（CT）在物理教育中日益重要，对真实、基于项目的评估的需求也在增长。虽然开放式多模态作业（如计算物理论文，CPEs）有助于捕捉学生的推理并鼓励主动学习，但它们引入了显著的评估瓶颈。手动根据复杂的计算实践分类法对这些复杂笔记本进行评分是资源密集型的，并限制了大规模课程的扩展性。在本研究中，我们调查了使用多模态大型语言模型（LLM）自动评估100篇学生生成的CPEs的可行性。使用人工编码的基线，我们系统评估了模型在20个不同的CT子实践和整体质量评分中检测学生参与度的能力。结果表明，LLM在明确定义的任务上表现非常好，在二元子实践上达到了84%的精确一致率。然而，更主观的构念被证明具有挑战性，模型在整体质量分析中仅达到71%的一致率。我们的发现表明，虽然LLM可以可靠地自动化检测特定的计算实践，但主观评估仍然是一个障碍。

英文摘要

As computational thinking (CT) becomes increasingly important to physics education, the need for authentic, project-based assessments has grown. While open-ended multimodal assignments, such as Computational Physics Essays (CPEs), help capture student reasoning and encourage active learning, they introduce a significant evaluation bottleneck. Manually grading these complex notebooks across a complex taxonomy of computational practices is resource-intensive and limits scalability in large-enrollment courses. In this study, we investigated the viability of using a multimodal Large Language Model (LLM) to automate the evaluation of 100 student-generated CPEs. Using a human-coded baseline, we systematically evaluated the model's capacity to detect student engagement across 20 distinct CT sub-practices and a holistic overall quality score. The results showed that the LLM performs very well on clearly defined tasks, achieving an 84% exact agreement with human raters on the binary sub-practices. However, more subjective constructs proved challenging, with the model reaching only a 71% agreement for the holistic quality analysis. Our findings demonstrated that while LLMs can reliably automate the detection of specific computational practices, subjective evaluation remains a hurdle.

URL PDF HTML ☆

赞 0 踩 0

2604.04342 2026-06-18 cs.LG stat.ML 版本更新 60%

Generative models for decision-making under distributional shift

分布偏移下决策的生成模型

Xiuyuan Cheng, Yunqin Zhu, Yao Xie

发表机构 * Department of Mathematics, Duke University（杜克大学数学系）； H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology（佐治亚理工学院H. Milton Stewart工业与系统工程学院）

专题命中其他LLM ：生成模型用于决策，与LLM弱相关

AI总结本文提出基于流和分数生成模型的统一框架，通过传输映射、速度场等工具处理分布偏移下的决策问题，实现鲁棒性、条件分布生成及不确定性量化。

Comments INFORMS TutORials in Operations Research, 2026

详情

AI中文摘要

许多数据驱动的决策问题使用从历史数据估计的名义分布来制定，而性能最终由可能发生偏移、依赖于上下文、部分观测或由压力引起的部署分布决定。本教程介绍了现代生成模型，特别是基于流和分数的方法，作为构建决策相关分布的数学工具。从运筹学的角度来看，它们的主要价值不在于无约束的样本合成，而在于通过传输映射、速度场、分数场和引导随机动力学来表示和变换分布。我们提出了一个基于前推映射、连续性、Fokker-Planck方程、Wasserstein几何和概率空间优化的统一框架。在此框架内，生成模型可用于学习名义不确定性、构建用于鲁棒性的受压或最不利分布，以及在侧信息和部分观测下生成条件或后验分布。我们还强调了代表性的理论保证，包括迭代流模型的前向-反向收敛、传输映射空间中的一阶极小极大分析，以及具有生成先验的后验采样的误差传递界。本教程为在分布偏移下使用生成模型进行场景生成、鲁棒决策、不确定性量化及相关问题提供了原则性的介绍。

英文摘要

Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decision-relevant distributions. From an operations research perspective, their primary value lies not in unconstrained sample synthesis but in representing and transforming distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics. We present a unified framework based on pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Within this framework, generative models can be used to learn nominal uncertainty, construct stressed or least-favorable distributions for robustness, and produce conditional or posterior distributions under side information and partial observation. We also highlight representative theoretical guarantees, including forward-reverse convergence for iterative flow models, first-order minimax analysis in transport-map space, and error-transfer bounds for posterior sampling with generative priors. The tutorial provides a principled introduction to using generative models for scenario generation, robust decision-making, uncertainty quantification, and related problems under distributional shift.

URL PDF HTML ☆

赞 0 踩 0

2602.17187 2026-06-18 stat.ML cs.LG 版本更新 60%

Anti-causal domain generalization: Leveraging unlabeled data

反因果域泛化：利用无标签数据

Sorawit Saengkyongam, Juan L. Gamella, Andrew C. Miller, Jonas Peters, Nicolai Meinshausen, Christina Heinze-Deml

发表机构 * Apple（苹果公司）； ETH Zürich（苏黎世联邦理工学院）

专题命中其他LLM ：域泛化方法，可应用于LLM但非核心

AI总结针对反因果设置下的域泛化问题，提出利用无标签数据估计环境扰动方向，通过惩罚模型对协变量均值和协方差变化的敏感性实现鲁棒性，并提供最坏情况最优性保证。

Comments Accepted at the International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

域泛化问题关注的是学习在部署到新的、未见过的环境时对分布变化具有鲁棒性的预测模型。现有方法通常需要来自多个训练环境的标记数据，这在标记数据稀缺时限制了它们的适用性。在这项工作中，我们研究了反因果设置下的域泛化，其中结果导致观察到的协变量。在这种结构下，影响协变量的环境扰动不会传播到结果，这促使我们对模型对这些扰动的敏感性进行正则化。关键在于，估计这些扰动方向不需要标签，使我们能够利用来自多个环境的无标签数据。我们提出了两种方法，分别惩罚模型对跨环境协变量均值和协方差变化的敏感性，并证明这些方法在特定环境类别下具有最坏情况最优性保证。最后，我们在一个受控物理系统和一个生理信号数据集上展示了我们方法的实证性能。

英文摘要

The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anti-causal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model's sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model's sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.

URL PDF HTML ☆

赞 0 踩 0

1. 后训练 5 篇

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Trust Region On-Policy Distillation

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

2. 其他LLM 25 篇

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

ChatModel: Automating Reference Model Design and Verification with LLMs

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

Improve Large Language Model Systems with User Logs

LLM Compression by Block Removal with Constrained Binary Optimization

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

Self-Evolving Multi-Agent Systems via Textual Backpropagation

Decomposing Prediction Mechanisms for In-Context Recall

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Scaling up FHE-based Privacy-Preserving ML: Higher Throughput, Longer Inputs for LLama-3-8B

An In-depth Study of LLM Contributions to the Bin Packing Problem

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

Practical Anonymous Two-Party Gradient Boosting Decision Tree

Using Large Language Models to Analyze Engagement in Computational Thinking via Computational Physics Essays

Generative models for decision-making under distributional shift

Anti-causal domain generalization: Leveraging unlabeled data