语言大模型 / LLM

2606.18989 2026-06-18 cs.CL cs.AI 新提交 75%

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign：基于释义的跨语言习语对齐基准

Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（NLP 2 CT实验室，计算机与信息科学系，澳门大学）； Faculty of Arts and Humanities, University of Macau（人文学院，澳门大学）

专题命中领域大模型：构建跨语言习语对齐基准，评估LLM翻译能力。

AI总结提出G-IdiomAlign基准，通过维基词典释义锚定习语，构建高置信度对齐集，并设计多项选择等价测试和释义对比生成协议，揭示大语言模型在习语翻译中的字面翻译偏差。

Comments Accepted to ACL 2026

详情

AI中文摘要

习语由于其非组合性和弱表层形式基础，难以跨语言转换，使得字面映射不可靠。我们提出G-IdiomAlign，一个基于释义的基准，其中每个习语通过维基词典的英语释义进行锚定。我们进一步构建了一个高置信度的参考对齐集，用于可重复评估。G-IdiomAlign支持两种协议：（1）受控的多项选择习语等价测试，带有类型化干扰项用于错误归因；（2）释义对比生成，对比无释义和有释义输入，以隔离显式语义枢轴的影响。在不同的大语言模型中，字面翻译偏差是主要的失败模式，尤其是当目标语言是低资源语言时。在基于嵌入的语义代理下，释义一致地改善了释义对比生成，但性能仍然有限，表明在开放输出空间中存在显著提升空间。随后对Qwen3-8B的分析进一步表明，跨条件差异更多集中在注意力头而非层中，而有释义生成更好的情况与更强的释义锚定相关。

英文摘要

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18986 2026-06-18 cs.CL cs.AI 新提交 75%

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词：面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University（德肯大学）

专题命中领域大模型：提出时间序列问答框架，直接嵌入时间步避免分词瓶颈。

AI总结提出CADE框架，通过逐点线性编码器直接嵌入每个时间步，避免分词瓶颈，并利用单向监督对比损失对齐时间序列与文本锚点，在Time-MQA基准上提升六项TSQA任务性能。

详情

AI中文摘要

大型语言模型的最新进展催生了时间序列问答（TSQA），它将时间序列分析表述为自然语言问答。然而，直接将原始数值序列输入LLM会遇到分词瓶颈：字节对编码将连续值分割成不稳定的词元，其嵌入缺乏有意义的度量结构，导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口，锁定单一粒度，这会破坏模式并隐藏确切的时间步，且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战，我们提出了CADE（对比对齐与直接嵌入），一个基于两个关键组件构建的TSQA新框架：直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间，保留了精确的索引级访问，同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距，我们引入了一种新颖的单向监督对比损失，将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明，我们的框架在六项TSQA任务上持续提升了性能，优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18803 2026-06-18 cs.AI cs.CY 新提交 75%

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd（滴滴出行科技有限公司）

专题命中领域大模型：LLM应用于工业调度，属于领域大模型

AI总结提出ProfiLLM，一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道，解决工业网约车调度中大规模行为日志的用户画像问题，在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情

AI中文摘要

将大型语言模型（LLM）作为语义特征提取器引入工业网约车调度，处理平台规模的行为日志，是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主，但关键的行为信号（例如，驾驶员对某些区域的习惯性厌恶）本质上是上下文相关的，并且可以自然地表达为LLM生成的用户画像。然而，将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束，这些约束很少被一起解决：在一个拥有数百万日订单量的平台上，日志超出任何LLM的上下文窗口数个数量级；大多数用户是长尾用户，交互太少无法进行单个用户画像；表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM，一个智能LLM数据管道，通过两个模块实现面向生产匹配系统的效用对齐用户画像。（1）工具增强全局知识挖掘：为LLM智能体配备27个分析工具，用于挖掘平台规模的数据，生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。（2）效用对齐画像探索：为每个聚类生成多个候选画像，通过轻量级下游效用代理进行评估，迭代优化最佳候选，并为DPO微调构建偏好对。在滴滴生产调度器上部署后，ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进，在调度模拟中实现了高达+4.35%的GMV增长，并在14天在线A/B测试中持续改进，包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

URL PDF HTML ☆

赞 0 踩 0

2606.18597 2026-06-18 cs.CL 新提交 75%

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

低资源中文方言辨识：基于迁移学习与数据增强

Fan Xu, Yangjie Dan, Keyu Yan, Yong Ma, Mingwen Wang

发表机构 * Jiangxi Normal University（江西师范大学）

专题命中领域大模型：迁移学习与数据增强用于中文方言辨识

AI总结针对中文方言标注资源稀缺的问题，提出结合迁移学习与数据增强的CDDTLDA框架，利用源域ASR模型和目标域数据增强及微调，通过自注意力机制捕获共性语义特征，显著超越现有方法。

Comments Published in ACM TALLIP

详情

AI中文摘要

中文方言辨识是一项具有挑战性的自然语言处理任务，由于标注资源稀缺。本文中，我们开发了一种新颖的中文方言辨识框架，结合迁移学习与数据增强（CDDTLDA），以克服资源短缺问题。具体来说，我们首先使用一个较大的中文方言语料库训练一个源端自动语音识别（ASR）模型。然后，我们采用一种简单但有效的数据增强方法（即速度、音高和噪声干扰）来增强目标端低资源中文方言，并基于之前的源端ASR模型微调另一个目标ASR模型。同时，通过使用自注意力机制，可以捕获源端和目标端ASR模型之间的潜在共性语义特征。最后，我们提取目标ASR模型中的隐藏语义表示来进行中文方言辨识。我们广泛的实验结果表明，我们的模型在两个基准中文方言语料库上显著优于最先进的方法。

英文摘要

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.19167 2026-06-18 cs.SE 新提交 70%

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

用LLM和MCP集成教学软件工程：从课堂到工业实践

Kehui Chen, Jacky Keung, Weining Li, Xiangbing Shao, Yishu Li, Xiaoxue Ma

专题命中领域大模型：使用LLM辅助软件工程教学，但非核心模型创新

AI总结本研究将LLM和MCP集成到软件工程协作教学模式中，通过嵌入驱动工具到教学、代码辅助和工程模拟，弥合传统教学与工业流程的差距，提升学生编程、问题解决和智能工具使用能力。

Comments Aceept by International Symposium on Educational Technology (ISET) 2026

详情

AI中文摘要

大型语言模型（LLM）和模型上下文协议（MCP）在工业软件工程中的快速集成，迫切要求更新软件工程教育以跟上新兴技术和不断变化的行业需求。本研究探讨了一种创新方法，将LLM和MCP集成到软件工程教育的协作教学模式中，旨在构建一个与实际工程实践紧密相连的实用学习框架。通过将LLM和MCP驱动的工具嵌入日常教学、代码辅助和工程模拟中，该模型有效弥合了传统教学与工业工作流程之间的差距。这种集成增强了学生的编程能力、实际问题解决能力以及使用智能工程工具的熟练度。此外，通过与行业实习的合作，学生可以在真实环境中应用这些技术，进一步加强学术准备与专业实践之间的联系。总体而言，本研究为人工智能时代软件工程教育的改革与创新提供了一条实用路径。

英文摘要

The rapid integration of Large Language Models (LLMs) and the Model Context Protocol (MCP) into industrial software engineering has created a pressing need to update software engineering education to align with emerging technologies and evolving industry demands. This study investigates an innovative approach that integrates LLMs and MCP into a collaborative teaching model for software engineering education, aiming to build a practical learning framework closely connected to real-world engineering practices. By embedding LLM and MCP driven tools into daily teaching, code assistance, and engineering simulations, the model effectively bridges the gap between traditional instruction and industrial workflows. This integration enhances students' programming competence, practical problem-solving abilities, and proficiency in using intelligent engineering tools. Furthermore, through partnerships with industry internships, students can apply these technologies in real-world settings, further strengthening the connection between academic preparation and professional practice. Overall, this research offers a practical pathway for reforming and innovating software engineering education in the era of artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.18789 2026-06-18 eess.SY cs.SY 新提交 70%

PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies

PowerAgentBench-SS：电力系统稳态研究中智能体AI的基准测试

Costas Mylonas, Magda Foti, Andrea Pomarico, Matheus Duarte, Qian Zhang, Emmanouel Varvarigos

专题命中领域大模型：电力系统领域LLM智能体基准

AI总结提出PowerAgentBench-SS基准框架，用于评估LLM智能体在电力系统稳态研究中执行工程工作流的能力，通过工具API、验证预算和风险敏感指标区分智能体性能。

详情

AI中文摘要

电力系统基准测试通常评估数值求解器、预测模型或顺序控制器。这些基准是必要的，但它们不直接测试大型语言模型（LLM）智能体是否能执行工程工作流：检查电网案例、选择工具、调用模拟器、筛选 contingencies、提出可接受的缓解措施、验证结果并生成可审计的证据链。本文介绍了PowerAgentBench-SS，一个用于评估电力系统运行和规划研究中工具使用智能体的稳态基准框架。该基准向智能体公开案例数据、动作约束、工具API和验证预算，同时隐藏的评估器重新计算物理有效性并对提交的报告进行评分。我们定义了智能体接口、工具契约、证据日志和风险敏感指标，包括提交召回率、证据支持召回率、发现召回率、假安全惩罚、严重性遗憾、残余违规分数、动作成本、工具使用效率和工作流诊断。为了使框架具体化，我们在可复现的直流热N-2 contingency搜索试点中实例化该协议，使用确定性IEEE 39节点运行点变体，包括脚本基线、LLM JSON命令适配器、三个本地托管的Ollama LLM智能体和一个OpenAI API智能体。结果表明为什么仅求解器或仅答案评估是不够的：智能体不仅通过顶级contingency发现来区分，还通过验证预算使用、显式提交、类型强制、重复验证、证据支持报告和缓解行为来区分。

英文摘要

Power system benchmarks usually evaluate numerical solvers, prediction models, or sequential controllers. These benchmarks are necessary, but they do not directly test whether a Large Language Model (LLM) agent can execute an engineering workflow: inspect a grid case, select tools, call simulators, screen contingencies, propose admissible mitigations, validate results, and produce an auditable evidence trail. This paper introduces PowerAgentBench-SS, a steady-state benchmark framework for evaluating tool-using agents in power system operation and planning studies. The benchmark exposes public case data, action constraints, a tool API, and a validation budget to an agent, while a hidden evaluator recomputes physical validity and scores the submitted report. We define the agent interface, tool contract, evidence log, and risk-sensitive metrics, including submitted recall, evidence-backed recall, found recall, false-safe penalties, severity regret, residual violation score, action cost, tool-use efficiency, and workflow diagnostics. To make the framework concrete, we instantiate the protocol in a reproducible DC thermal N-2 contingency-search pilot on deterministic IEEE 39-bus operating-point variants, with scripted baselines, an LLM JSON-command adapter, three locally hosted Ollama LLM agents, and one OpenAI API agent. The results show why solver-only or answer-only evaluation is insufficient: agents are distinguished not only by top-contingency discovery, but also by validation-budget use, explicit submission, type coercions, duplicate validations, evidence-backed reporting, and mitigation behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.18636 2026-06-18 cs.CL cs.AI 新提交 70%

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home：智能家居中渐进式省略命令的解释

Yingyu Shan, Zeming Liu, Silin Li, Boao Qian, Jiashu Yao, Yuhang Guo, Haifeng Wang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Baidu Inc.（百度公司）

专题命中领域大模型：智能家居中渐进式省略命令的解释

AI总结针对智能家居中用户因共享上下文而使用渐进式省略命令导致的指代和意图歧义问题，提出首个模拟家庭数据集PEC-Home，实验表明现有LLM助手难以准确执行省略命令。

Comments Accepted by ACL 2026 Findings

详情

AI中文摘要

近年来，大型语言模型（LLM）的进步使家庭助手具备了自然语言交互能力。然而，当前的助手忽略了人类对话中随着共享上下文积累而发生的渐进式省略，即为了高效沟通而使用更简洁的表达。因此，当前助手仍难以准确解释此类省略表达，限制了其在现实应用中的有效性。在实际智能家居场景中，助手面临由省略命令引起的两大挑战：（1）多个用户对环境期望不同导致的指代歧义；（2）用户偏好随时间或环境变化导致的意图歧义。为应对这些挑战，我们引入了PEC-Home，这是首个专门为解释智能家居中渐进式省略命令而设计的模拟家庭数据集。在包括GPT-4o在内的多种LLM上的广泛实验表明，现有的家庭助手难以仅基于省略命令执行用户意图的操作。即使配备存储和检索用户对话历史的工具，其执行准确率仍低于使用完整命令时的水平。

英文摘要

Recent advancements in Large Language Models (LLMs) have empowered home assistants with natural language interaction capabilities. However, current assistants overlook the progressive omission that occurs in human dialogue as shared context accumulates, leading to more elliptical expressions for efficient communication. Thus, current assistants still struggle to interpret such elliptical expressions accurately, which limits their effectiveness in real-world applications. In practical smart home scenarios, assistants face two major challenges caused by elliptical commands: (1) referential ambiguity caused by different environmental expectations among multiple users; and (2) intention ambiguity resulting from user preferences that evolve over time or change with the environment. To address these challenges, we introduce PEC-Home, the first simulated home dataset specifically designed for interpreting progressively elliptical commands in smart homes. Extensive experiments on various LLMs, including GPT-4o, show that existing home assistants struggle to execute user-intended operations based solely on elliptical commands. Even when equipped with tools for storing and retrieving user dialogue history, execution accuracy remains below that achieved with complete commands.}.

URL PDF HTML ☆

赞 0 踩 0

2606.18910 2026-06-18 cs.LG cs.CL 新提交 75%

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

专题命中后训练：提出两阶段训练框架优化推理

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0

2606.18627 2026-06-18 cs.LG 新提交 70%

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

PACT: 在任务向量中保留锚定核心用于模型合并

Ningyuan Shi, Zhipeng Zhou, Hao Wang, Chunyan Miao, Peilin Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

专题命中后训练：模型合并方法，保留预训练权重中的核心维度

AI总结提出PACT方法，通过识别并保留预训练权重中的承重墙维度，在任务向量中锚定任务特定核心，解决任务向量范式下任务冲突和性能下降问题，提升模型合并效果。

Comments 33 pages,14 figures

详情

AI中文摘要

模型合并已成为多任务学习的一种无需训练的替代方案，旨在将多个任务特定的微调模型组合成一个单一的多任务模型。大多数现有的模型合并方法遵循任务算术范式，该范式将微调权重分解为预训练参数和任务向量，并仅在任务向量空间中进行合并。这一范式的有效性隐含地依赖于一个假设，即任务特定知识仅编码在任务向量中。我们认为，由于预训练模型固有的任务偏好，这一假设通常不成立。具体而言，我们识别出\textbf{承重墙（LBW）维度}，即一些任务关键知识仍嵌入在预训练权重中，而非完全转移到任务向量中。我们从标量权重和子空间两个角度刻画LBW维度，从而覆盖现有模型合并方法的主要范式。我们的分析表明，忽略LBW维度会导致基于任务向量的方法无法完全解决任务冲突，并可能无意中破坏预训练模型中编码的任务特定知识，从而导致性能下降。为解决这一问题，我们提出PACT，该方法通过将任务向量的正交补与预训练权重的子空间对齐，从而在任务向量中保留锚定的任务特定核心（即LBW维度）。在应用现有模型合并算法之前，将这些对齐的子空间分量从任务向量中移除。此外，我们开发了一种基于随机SVD的高效变体以提高可扩展性。PACT可以无缝集成到现有方法中。在多个基准上的大量实验表明，PACT持续增强主流模型合并方法，并建立了新的最先进性能。

英文摘要

Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task Arithmetic paradigm, which decomposes fine-tuned weights into pre-trained parameters and task vectors, and performs merging exclusively in the task-vector space. The effectiveness of this paradigm implicitly relies on the assumption that task-specific knowledge is encoded solely within task vectors. We argue that this assumption generally does not hold due to the intrinsic task preferences of pre-trained models. Specifically, we identify \textbf{Load-Bearing Wall (LBW) dimensions}, namely some task-critical knowledge that remains embedded in the pre-trained weights rather than being fully transferred into task vectors. We characterize LBW dimensions from both scalar-weight and subspace perspectives, thereby covering the major paradigms of existing model merging methods. Our analysis reveals that, by ignoring LBW dimensions, task-vector-based approaches fail to fully resolve task conflicts and may inadvertently damage task-specific knowledge encoded in the pre-trained model, leading to degradation. To address this issue, we propose PACT, which preserves the anchored task-specific cores (i.e., LBW dimensions) within task vectors by aligning their orthogonal complements with the subspace of the pre-trained weights. These aligned subspace components are then removed from the task vectors before applying existing model merging algorithms. Furthermore, we develop an efficient variant based on randomized SVD to improve scalability. PACT can be seamlessly integrated with existing methods. Extensive experiments across multiple benchmarks demonstrate that PACT consistently enhances mainstream model merging approaches and establishes new state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18850 2026-06-18 cs.CL cs.IR 新提交 75%

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum：基于知识图谱推理与反思性精炼的师生式抽象摘要生成

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

专题命中指令微调：科学文献摘要生成，师生框架与知识图谱。

AI总结提出ScholarSum框架，通过构建层次知识图谱引导学生生成初稿，并利用教师式审阅者迭代检查与修正，实现科学文献摘要的流畅性与事实一致性。

详情

AI中文摘要

抽象摘要生成在实现科学文献高效理解中起着关键作用，但它本质上要求同时具备语言流畅性和事实忠实性。现有方法往往难以协调这两个要求。抽取式方法依赖僵硬的句子拼接，破坏了宏观层面的逻辑连贯性；而基于大语言模型的生成式方法尽管掌握了语言流畅性，但事实一致性有限。在这项工作中，我们提出了ScholarSum，一个层次化反思性图框架，模拟师生写作过程以实现流畅且忠实的科学摘要生成。ScholarSum首先通过将文档分割成语义连贯的单元，组织成层次知识图谱，其多层社区结构捕获全局逻辑和宏观主题。在该全局结构引导下，学生生成初稿，随后通过细粒度证据检索进行精炼。为确保事实一致性，教师式审阅者迭代检查初稿，识别不支持的内容，并触发有针对性的重新检索和重写，直到摘要达到严格的质量标准。大量实验表明，ScholarSum在完整性和忠实性方面显著优于之前的基线方法。我们的代码可在该https URL获取。

英文摘要

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

URL PDF HTML ☆

赞 0 踩 0

2606.18902 2026-06-18 cs.CL 新提交 70%

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI ； Department of Engineering, University of Cambridge（剑桥大学工程系）

专题命中指令微调：自动提示优化属于LLM应用

AI总结提出随机提示优化框架SPO，其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索，在多个基准测试中表现依赖于错误类型，并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情

AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度，这促使我们将自动提示优化（APO）视为黑盒搜索。我们引入了SPO（随机提示优化），一个在提示空间上进行随机搜索的框架，并比较了三种复杂度递增的策略：基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE（基于智能体引导探索的SPO），后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中，没有单一策略占主导地位；有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上，它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为，将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.18829 2026-06-18 cs.LG cs.CL 新提交 75%

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem：多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； King Abdullah University of Science and Technology (KAUST)（卡尔斯鲁厄大学）； Tsinghua University（清华大学）； National University of Singapore（新加坡国立大学）

专题命中其他LLM ：评估多主体共享内存代理的记忆治理，涉及LLM代理

AI总结提出GateMem基准，评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力，发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情

AI中文摘要

LLM代理的内存基准主要假设单用户设置，而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中，多个主体写入公共内存池并根据不同角色、范围和关系进行查询，因此内存质量需要治理和召回。我们引入GateMem，一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用（含状态更新）、跨上下文授权边界的访问控制，以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域，包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上，没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数，而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明，当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18389 2026-06-18 cs.CL 新提交 75%

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据？引导它：面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies（肯佩伦智能技术研究所）； German Research Institute for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））

专题命中其他LLM ：激活引导用于低资源语言合成数据生成

AI总结提出激活引导作为低资源语言合成数据生成的替代方法，包括语言引导和质量引导，实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情

AI中文摘要

大型语言模型（LLMs）已成为合成数据生成的有效工具，包括低资源语言，生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示，这增加了推理成本，并可能通过词汇锚定降低多样性。在这项工作中，我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略：语言引导，针对语言的 linguistic identity；以及质量引导，通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法，通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用，并与非引导对应方法进行比较。我们的结果表明，早期层的引导一致地提高了生成数据的多样性，同时通常产生更强的下游模型性能，特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

URL PDF HTML ☆

赞 0 踩 0

2606.18304 2026-06-18 cs.LG cs.AI 新提交 75%

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

基于归因引导和覆盖最大化的结构MoE剪枝

Yifu Ding, Jiacheng Wang, Ge Yang, Yongcheng Jing, Jinyang Guo, Xianglong Liu, Dacheng Tao

发表机构 * School of Computer Science and Engineering, Beihang University（北京航空航天大学计算机科学与工程学院）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Nanyang Technological University（南洋理工大学）

专题命中其他LLM ：针对MoE模型的结构剪枝，属于LLM压缩与部署。

AI总结针对MoE模型专家级剪枝粒度粗、冗余识别不足的问题，提出基于归因引导和覆盖最大化的结构剪枝框架，将剪枝分配转化为通道分数覆盖优化问题，在50%剪枝率下结合4位量化保持精度，内存减少5.27倍。

Comments 9 pages, 5 figures. Submitted to ICML 2026

详情

AI中文摘要

混合专家（MoE）模型在计算上高效扩展，但由于其巨大的内存占用和推理开销，部署成本仍然很高。先前的压缩方法主要在专家级别操作，要么移除整个专家，要么通过粗粒度的重要性分数对专家进行排序。然而，这种专家级别的决策通常过于粗糙，无法捕捉细粒度的冗余，导致剪枝预算分配不当和压缩效果有限。为了解决这个问题，我们观察到MoE专家内的信息高度集中在一小部分通道中，即使在被认为重要的专家中也存在大量冗余。基于这一观察，我们提出了一种针对MoE模型量身定制的结构剪枝框架。我们的方法将剪枝比例分配重新表述为通道分数覆盖最大化问题，并使用基于归因的近似方法高效求解。在DeepSeek和Qwen MoE模型上的实验表明，我们的方法在结合4位量化时，在50%或25%的结构化剪枝下仍能保持模型精度。在Qwen3-30B-A3B上，我们的方法将内存占用减少了5.27倍，并在各种基准测试中持续优于最先进的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts or ranking experts by coarse-grained importance scores. However, such expert-wise decisions are often too coarse to capture fine-grained redundancy, leading to misallocated pruning budgets and limited compression. To address this problem, we observe that information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in experts deemed important. Based on this observation, we propose a structural pruning framework tailored for MoE models. Our method reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it efficiently using an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show that our method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization. On Qwen3-30B-A3B, our approach reduces memory footprint by 5.27$\times$ and consistently outperforms state-of-the-art baselines across diverse benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.18105 2026-06-18 cs.NI cs.LG 新提交 75%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan：一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University（浙江大学）； Fuzhou University（福州市大学）； Yangzhou University（扬州大学）； The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； College of Computer Science and Technology（计算机科学与技术学院）

专题命中其他LLM ：LLM用于解析用户意图进行网络规划

AI总结提出OmniPlan自适应框架，利用大语言模型解析用户意图，通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型，实现网络规划优化的及时性与近乎最优性，在分布式机器学习推理卸载任务中延迟降低97.8%，资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情

AI中文摘要

网络规划优化是跨多个领域（包括交通系统、通信网络和电网）的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划（MIP）求解器、启发式算法和深度强化学习（DRL）模型来计算规划决策。然而，它们缺乏对多样化和动态用户意图的有效适应性，从而导致执行时间与最优性之间的权衡。在本文中，我们提出OmniPlan，一种自适应框架，在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性，OmniPlan采用基于大语言模型（LLM）的解释器，将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后，它采用混合专家架构，集成MIP求解器、启发式算法和DRL模型作为专门专家，OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后，它包含一个基于DRL的专家配置模块，该模块微调优化目标权重，使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载（即分布式机器学习（ML））评估OmniPlan，其中我们利用OmniPlan将广泛的ML推理任务（例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林）卸载到硬件设备网络。我们在真实测试平台上的实验表明，OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载，延迟降低高达97.8%，网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

URL PDF HTML ☆

赞 0 踩 0

2606.17276 2026-06-18 cs.IR cs.LG 新提交 75%

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

LLM在生成式推荐中的记忆行为：观察、启示与训练策略

Sunwoo Kim, Sunkyung Lee, Clark Mingxuan Ju, Donald Loveland, Bhuvesh Kumar, Kijung Shin, Neil Shah, Liam Collins

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（成均馆大学）； Snap Inc.（Snap公司）

专题命中其他LLM ：研究LLM在生成式推荐中的记忆行为

AI总结研究LLM在生成式推荐中的记忆倾向，发现其过度依赖一跳记忆，提出IIRG训练策略以学习多跳协同与语义关系，显著提升对非一跳记忆用户的推荐效果。

详情

AI中文摘要

生成式推荐（GR）已成为推荐系统的一个有前景的方向。最近，大型语言模型（LLM）越来越多地被用于GR，因为其丰富的预训练知识有望帮助它们泛化到传统以记忆为导向的基线所能捕捉的常见用户行为模式之外。然而，现有的基于LLM的GR工作很大程度上忽略了LLM众所周知的记忆倾向，如果这种倾向存在于为GR微调的LLM中，将限制它们对预训练知识的利用。在这项工作中，我们通过检查一跳记忆（即模型推荐训练数据中项目的直接后继项目）来研究这一担忧。我们表明，LLM比非LLM的GR模型更频繁地这样做——事实上，它们相对于GR基线的大部分增益实际上来自那些目标项目可以通过一跳记忆预测的用户。我们直觉认为，提高剩余用户的性能需要LLM学习更丰富的项目-项目关系，超越一跳转换。为此，我们提出了IIRG，一种新颖的训练策略，教导LLM捕获：（1）从用户序列中跨多跳的项目共现导出的协同关系，以及（2）具有相似主题的项目之间的语义关系，这两者都可以作为有用的推荐信号。我们表明，IIRG显著优于仅使用标准下一项目预测训练的LLM，尤其是对于那些测试项目在训练时的一跳转换中未覆盖的用户，增益尤为显著。

英文摘要

Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs' well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.

URL PDF HTML ☆

赞 0 踩 0

2601.21626 2026-06-18 cs.LG cs.AI 版本更新 75%

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Science and Technology of China（中国科学技术大学）； Zhejiang Lab（浙江实验室）； Peng Cheng Laboratory（鹏城实验室）

专题命中其他LLM ：提出HeRo-Q算法用于LLM低比特量化，属于LLM。

AI总结针对后训练量化中“低误差、高损失”的矛盾，提出HeRo-Q算法，通过轻量可学习的旋转压缩矩阵重塑损失景观，降低最大Hessian特征值，增强对量化噪声的鲁棒性，在Llama和Qwen模型上优于现有方法。

详情

AI中文摘要

后训练量化（PTQ）是一种主流的模型压缩技术，但由于其仅专注于最小化量化误差，常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵：少数高曲率方向对扰动极其敏感。为了解决这个问题，我们提出了Hessian鲁棒量化（HeRo Q）算法，该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观，从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构，计算开销可忽略不计，并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明，HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法，而且在极具挑战性的W3A16超低比特场景中表现出色，将Llama3 8B在GSM8K上的准确率提升至70.15%，并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

URL PDF HTML ☆

赞 0 踩 0

2606.19317 2026-06-18 cs.LG cs.AI 新提交 70%

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT（新泽西理工学院）； MIT（麻省理工学院）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

专题命中其他LLM ：用程序合成解释注意力头

AI总结提出用可执行程序近似深度网络组件行为的方法，针对Transformer注意力头，通过生成Python程序再现注意力模式，实现可解释性。

详情

AI中文摘要

可解释深度学习研究的一个长期目标是，用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头，我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着，我们向预训练语言模型提供这些矩阵的摘要，并指示它生成一组Python程序，这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后，我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明，少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式，在TinyStories上平均交并比相似度超过75%。此外，最佳匹配程序可以替代神经注意力头而不会显著影响模型行为：在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%，同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程，推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.19264 2026-06-18 cs.LG cs.CL 新提交 70%

Structured Inference with Large Language Gibbs

大语言吉布斯结构化推理

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

发表机构 * University of Edinburgh, School of Informatics（爱丁堡大学信息学院）

专题命中其他LLM ：利用LLM条件分布进行结构化概率推理

AI总结提出大语言吉布斯方法，利用大语言模型的条件分布作为转移算子进行结构化概率推理，通过迭代重采样变量避免顺序偏差，在合成分布、一致性推理和贝叶斯结构学习中验证有效性。

Comments Code: https://github.com/hyeok9855/large-language-gibbs

详情

AI中文摘要

大型语言模型（LLMs）中编码的知识可以作为描述复杂世界变量的结构化推理的基础，但以概率一致的方式访问这些知识构成了一个困难的推理问题。我们提出了大语言吉布斯，一种结构化概率推理方案，它使用LLM的条件分布作为转移算子。不是通过单次自回归生成来采样结构化对象，而是利用LLM的下一个标记条件分布，在给定其他变量的条件下迭代地重采样单个变量。这种方法避免了顺序依赖偏差，并产生一个反映所有局部条件分布之间折衷的平稳分布。我们将这种方法应用于从合成分布中采样、一致性推理任务和贝叶斯结构学习。结果表明，在通过噪声LLM条件分布可访问的世界先验下，MCMC中使用LLM条件分布是用于结构化概率推理的一次性生成的实际替代方案。

英文摘要

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

URL PDF HTML ☆

赞 0 踩 0

2606.19218 2026-06-18 cs.CL 新提交 70%

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM：开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville（阿拉巴马大学亨茨维尔分校）； University of North Alabama（北阿拉巴马大学）； Stanford University（斯坦福大学）； Meta AI ； Amazon GenAI（亚马逊生成人工智能）

专题命中其他LLM ：评估LLM生成文本的自动指标，属于LLM应用

AI总结提出 RECOM 数据集，发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性，余弦相似度有效性高但区分性差，BERTScore 区分性受长度影响且有效性弱。

详情

AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法，但一个指标被默默要求完成两项任务：区分真实内容对齐与表面巧合（有效性），以及区分更好的系统与更差的系统（区分性）。在开放式、观点驱动的问答中，这两者存在矛盾。我们引入了 RECOM（Reddit Evaluation for Correspondence of Models），一个无污染评估数据集，包含 15,000 个 r/AskReddit 问题（2025 年 9 月），每个问题都配有真实的社区回复，这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM（7-10B）的每个回复与每个指标配对，并加入随机乱序噪声基线，我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答（Cohen's $d \approx 2$），但无法对五个模型进行排序（$|d| < 0.1$）；BERTScore 精确度看似能对模型排序（原始 $|d|$ 高达 0.63），但一旦控制回复长度，这一数值骤降至 $|d| = 0.09$，且其有效性较弱（$d \approx 0.8$，而余弦相似度约为 2）。由于每个指标对相同的输出进行评分，这种有效性与区分性的权衡是指标的属性，而非模型的属性，我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距，同样只能微弱地区分五个模型。我们建议在两个轴上报告指标，并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

URL PDF HTML ☆

赞 0 踩 0

2606.19172 2026-06-18 cs.AI 新提交 70%

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹：将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

专题命中其他LLM ：将用户记忆内化为参数编辑，属于LLM个性化

AI总结提出User as Engram方法，将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑，推理技能共享一个适配器，实现高精度间接推理且内存占用极小。

详情

AI中文摘要

语言模型中的个人记忆涉及两个问题：内容和推理技能。大脑将两者分开（每个情节在海马体中有一个稀疏的局部印迹，解释它的共享技能在缓慢的新皮层中），因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外，存储在自然语言记忆文件或检索索引中。当事实被写入模型时，标准方法是每用户的LoRA适配器，这与大脑相反，将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本；将相同事实写为局部Engram行则数学上保持不变，导致内存占用大约减少33,000倍。因此，我们提出User as Engram：将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑，并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回，同时平均提供5.6倍更高的间接推理准确性，并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒：写入一个事实会在精确触发时打开其查找，添加答案所需的值，保持其他每个位置不变到最后一位，如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中，它们的编辑可组合：许多用户同时共享一个表，可加性且无损地堆叠，而每用户LoRA（一个全局权重增量）只允许一个。在检索时，每用户Engram表不会随着检索器必须搜索的群体增长，因此在大约100个事实后，它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

URL PDF HTML ☆

赞 0 踩 0

2606.18851 2026-06-18 eess.SY cs.SY 新提交 70%

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

从令牌到能量灵活性：面向LLM推理工作负载的数据中心量化使能需求响应

Bojun Du, Xiaoyi Fan, Ershun Du, Long Chen, Jianpei Han, Qingchun Hou, Ning Zhang, Chongqing Kang

专题命中其他LLM ：LLM推理数据中心需求响应，量化管理。

AI总结提出一种量化使能的能量管理框架，通过建立量化-功率模型和两阶段需求响应模型，实现多园区协同优化，降低数据中心运营成本34.3%。

Comments 10 pages, 7 figures

详情

AI中文摘要

大型语言模型（LLM）推理的快速增长正在造成显著的数据中心负载，在日益紧张的电网条件和需求响应（DR）要求下，这些负载面临着越来越多的能量管理挑战。传统的数据中心能量管理主要依赖于时间和空间上的工作负载转移以及园区级能量资产调度，但通常将LLM推理需求视为聚合负载。因此，这些方法未能利用LLM服务的内部特性，从而忽视了模型量化等LLM特定技术所提供的灵活性。为了释放这种灵活性，本文提出了一种面向电网响应型LLM推理数据中心的量化使能能量管理框架。首先，建立了一个量化-功率模型，将每个模型-量化配置映射到一个紧凑的可调度参数集。其次，开发了一个两阶段量化使能的需求响应模型，以考虑模型实例切换、请求路由和精度选择。第三，引入了一种多园区协同优化方法，通过将电网侧电力和碳信号与量化使能的需求响应模型相结合，参与需求响应。案例研究表明，所提出的框架在不减少服务令牌量的情况下，将数据中心总运营成本降低了34.3%，验证了模型量化作为电网响应型LLM数据中心能量管理的有效灵活性杠杆。

英文摘要

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.

URL PDF HTML ☆

赞 0 踩 0

2606.18832 2026-06-18 cs.LG cs.AI 新提交 70%

Target-confidence Recourse Using tSeTlin machines: TRUST

使用Tsetlin机器的目标置信度追索：TRUST

K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

发表机构 * Group Research and Development Det Norske Veritas (DNV)（挪威船级社（DNV）集团研发部）

专题命中其他LLM ：提出TRUST框架，使用概率Tsetlin机器生成反事实解释，属于LLM应用

AI总结提出TRUST框架，通过概率Tsetlin机器和贝叶斯优化直接搜索满足用户指定置信度目标的最小输入变化，生成更稳健和可解释的反事实解释。

详情

AI中文摘要

反事实解释被广泛用于高风险决策系统中的算法追索。大多数现有方法寻求最小化改变输入以翻转模型决策。然而，决策者通常不仅依赖预测标签，还依赖置信度阈值和风险边际。刚好越过决策边界的反事实在噪声或模型变化下可能脆弱且不稳定。本文提出使用Tsetlin机器的目标置信度追索（TRUST），一种用户明确指定追索所需预测置信度的框架。TRUST不是先生成反事实再评估置信度，而是直接搜索满足用户定义置信度目标的最小变化，从而在成本、置信度和鲁棒性方面比较追索选项。我们使用概率Tsetlin机器（PTM）结合贝叶斯优化实例化TRUST。PTM基于概率子句的结构将预测置信度与决策规则的稳定性联系起来。我们表明，满足相同规则的反事实在可靠性上可能差异很大，取决于它们满足这些规则的安全程度，揭示了决策是由稳健还是脆弱的子句激活支持的。在合成和真实数据集上的实验表明，目标置信度反事实比传统的基于边界的方法产生更稳健和可解释的追索。在多个基准测试中，TRUST实现了完美的鲁棒性，同时保持较低的追索成本，包括在Haberman数据集上以0.92置信度达到0.10的L2距离。通过显式控制置信度和暴露规则级稳定性，TRUST为高风险决策支持提供了可操作的追索。

英文摘要

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.18795 2026-06-18 cs.SI 新提交 70%

Opinion Polarization in LLM-Based Social Networks: Manipulation and Mitigation

基于LLM的社交网络中的意见极化：操纵与缓解

Ali Safarpoor Dehkordi, Mohammad Shirzadi, Ahad N. Zehmakan

专题命中其他LLM ：基于LLM的社交网络意见极化研究

AI总结研究在基于大语言模型模拟的社交网络中，对手如何通过有限预算操纵意见极化，并评估两种防御机制（反应性和主动性）的效果，发现两者均无法完全恢复基线极化状态。

Comments 14 pages, 7 figures

详情

AI中文摘要

在线社交网络在面对试图通过操纵意见来放大意见极化的对手时有多脆弱？缓解这种操纵有多困难？现有研究使用意见动态的数学模型来探讨这一问题。虽然这些模型提供了有价值的理论见解，但它们依赖于关于交互、消息内容和意见更新的简化假设，限制了它们能够捕捉的对抗策略及其发现在现实环境中的适用性。基于大语言模型的模拟提供了一种更丰富的替代方案：智能体可以被赋予多样化的角色，通过自然语言进行交流，并以上下文相关的方式回应说服性或对抗性内容。这使得研究难以用经典数学模型表示的操纵策略成为可能。据我们所知，本研究首次在基于LLM的模拟社交网络框架中系统分析了极化的放大和缓解。在我们的框架中，具有多样化角色的LLM智能体通过交换自然语言帖子在社交网络上进行交互，并相应地更新他们的意见。我们表明，即使预算有限的对手也能显著增加极化。然后，我们研究了两类防御机制：反应性缓解（指派特定用户主动对抗操纵）和主动性干预（通过不针对特定用户的一般机制增加抵抗力）。我们的结果表明，尽管这些机制减少了对抗攻击的影响，但它们通常无法将网络恢复到其基线极化状态。这些发现表明，这两种方法都不能完全克服网络的脆弱性，凸显了此类攻击的潜在风险。

英文摘要

How vulnerable are online social networks to adversaries who seek to amplify opinion polarization by manipulating opinions, and how difficult is it to mitigate such manipulation? Existing studies have examined this question using mathematical models of opinion dynamics. While these models offer valuable theoretical insights, they rely on simplified assumptions about interactions, message content, and opinion updates, limiting the adversarial strategies they can capture and the applicability of their findings to real-world settings. Large language model (LLM)-based simulations provide a richer alternative: agents can be assigned diverse personas, communicate through natural language, and respond to persuasive or adversarial content in a context-dependent way. This enables the study of manipulation strategies that are difficult to represent using classical mathematical models. To the best of our knowledge, this study provides the first systematic analysis of polarization amplification and mitigation in an LLM-based simulated social network framework. In our framework, LLM agents with diverse personas interact over a social network by exchanging natural language posts and updating their opinions accordingly. We show that even an adversary with a limited manipulation budget can considerably increase polarization. We then study two classes of defense mechanisms: reactive mitigations, which assign specific users to actively counter manipulation, and proactive interventions, which increase resistance through general mechanisms not tied to particular users. Our results show that although these mechanisms reduce the impact of adversarial attacks, they generally do not restore the network to its baseline polarization state. These findings suggest that neither approach fully overcomes the vulnerability of the network, highlighting the potential risk of such attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.18726 2026-06-18 cs.LG cs.AI 新提交 70%

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

基于图锚定交叉注意力Transformer神经网络的预测过程监控中结构约束完整事件序列生成

Fang Wang, Ernesto Damiani

发表机构 * Department of Computer Science, University of Milan（米兰大学计算机科学系）

专题命中其他LLM ：预测过程监控，图锚定交叉注意力Transformer。

AI总结提出图锚定交叉注意力Transformer（GGATN），通过全局过程图作为结构化记忆、Transformer自注意力编码序列位置、图锚定交叉注意力注入过程拓扑，结合维特比式图约束解码，一次性生成完整事件序列，在六个基准日志上优于LLM基线。

Comments 40 pages

详情

AI中文摘要

结构约束的事件序列生成仍然具有挑战性，因为生成的路径必须保持转移可行性、时间顺序、终止和属性一致性。在预测过程监控（PPM）中，这一挑战表现为完整事件序列生成，而现有工作主要处理子任务，如下一个活动、剩余时间、结果和属性预测。本文提出了图锚定交叉注意力Transformer神经网络（GGATN）用于这一统一的PPM任务。GGATN使用全局过程图作为结构化活动记忆，通过Transformer自注意力对序列位置进行上下文化，并通过图锚定交叉注意力注入过程拓扑。与自回归解码不同，GGATN一次性生成活动、时间戳、长度以及事件级和序列级属性，随后进行维特比风格的图约束解码以获得可行路径和显式终止。在六个基准事件日志上的实验表明，其生成质量优于局部指令提示的LLM基线。GGATN在序列相似性、Damerau-Levenshtein相似性、基于二元组的控制流相似性和持续时间分布方面取得了强劲性能，同时保持零幻觉活动和零序列级属性不一致。消融分析证实了全局图编码器作为稳定的结构先验。可解释性分析展示了图结构、序列上下文、反馈细化和约束解码如何塑造生成过程。

英文摘要

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

URL PDF HTML ☆

赞 0 踩 0

2606.18717 2026-06-18 cs.CL cs.AI 新提交 70%

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher（独立研究者）

专题命中其他LLM ：土耳其语形态感知分词器与词嵌入。

AI总结针对土耳其语粘着特性，提出Morpheus神经词素边界模型，实现无损可逆分词与结构化词嵌入，在可逆分词器中达到最低比特每字符（1.425），词素对齐F1提升至0.61，GPU内存节省约19%。

详情

AI中文摘要

土耳其语是粘着语：意义由词素承载，然而驱动现代语言模型的子词分词器根据语料库统计分割单词，切碎了承载语义的后缀，并且在WordPiece和基于规则的分析器的情况下，无法将其输出解码回原始文本。本文提出\textbf{Morpheus}，一个面向土耳其语的神经词素边界模型，它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度，在推理时转化为精确的片段，无需字符串归一化，因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型，相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符（1.425），将子词家族的金标准词素对齐大致翻倍（MorphScore宏F1从约0.32提升至0.61），并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器，冻结的Morpheus向量在词汇检索（根家族MAP 0.85）和同根验证（ROC-AUC 1.00）上领先，超越了多语言检索器BGE-M3和BERTurk；在上下文和屈折依赖的任务（NER、格/数探测）上，更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码：此https URL 模型：此https URL 交互演示：此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

URL PDF HTML ☆

赞 0 踩 0

2606.18709 2026-06-18 cs.CL 新提交 70%

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

LLMs难以衡量区分不同水平学生的题目：阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）； University of Maryland（马里兰大学）； Virginia Tech（弗吉尼亚理工大学）

专题命中其他LLM ：评估LLM预测题目区分度能力。

AI总结本研究评估42个LLM在零样本设置下预测题目区分度的能力，发现直接预测与人类校准的区分度相关性弱（最高Spearman 0.152），基于CTT的响应校准相关性有限（0.241），表明LLM尚不能可靠捕捉题目区分度。

详情

AI中文摘要

题目区分度是教育评估的一个基本心理测量属性，它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型（LLM）能否估计题目难度，但尚不清楚它们能否捕捉题目区分度。在本工作中，我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现：直接区分度预测，即模型从其内容中显式估计题目的区分度值；以及基于响应的经典测试理论（CTT）校准，其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明，直接预测与人类校准的区分度一致性较弱：表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号，全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战：当前的LLM包含非随机的区分度相关信号，但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

URL PDF HTML ☆

赞 0 踩 0

2606.18620 2026-06-18 cs.CL cs.AI 新提交 70%

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL：面向信息抽取的贝叶斯上下文学习框架

Haoliang Liu, Chengkun Cai, Xu Zhao, Han Zhu, Shizhou Huang, Xinglin Zhang, Tao Chen, Jenq-Neng Hwang, Zhang Huaping, Lei Li

发表机构 * HiThink Research（海天瑞声研究）； University College London（伦敦大学学院）； University of Edinburgh（爱丁堡大学）； The Hong Kong University of Science and Technology（香港科技大学）； East China Normal University（华东师范大学）； Shanghai Medical Image Insights（上海医学影像洞察）； University of Waterloo（滑铁卢大学）； University of Washington（华盛顿大学）； Beijing Institute of Technology（北京理工大学）

专题命中其他LLM ：贝叶斯上下文学习框架用于信息抽取

AI总结提出BCL框架，利用贝叶斯更新和粒子滤波优化信息抽取中的上下文学习，在序列标注和关系分类任务上取得显著提升。

Comments ACL 2026 Findings

2606.18587 2026-06-18 cs.CL cs.AI 新提交 75%

Dual Dimensionality for Local and Global Attention

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）

专题命中预训练：提出距离自适应表示优化Transformer注意力

AI总结提出距离自适应表示（DAR），对局部上下文保留全维度表示，对远距离token使用低维表示，在保持性能的同时减少KV缓存。

详情

AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键（和值）通常以相同的维度表示，无论其与预测目标的距离如何。然而，在自然语言中，下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求：局部token对预测即时输出更关键，因此需要更丰富的表示，而远距离token主要作为长期记忆，低维表示可能就足够了。我们将这一思想形式化为距离自适应表示（DAR），在受控设置中实现，该设置在局部上下文窗口内保留全维度表示，同时为超出该窗口的token分配降维表示（例如原始维度的1/4）。在多个预训练规模（70M到410M参数）以及1B规模模型上的持续监督微调中，该方法与全维度基线的性能紧密匹配。相比之下，在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向，该架构可自适应地跨序列分配表示能力，从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

URL PDF HTML ☆

赞 0 踩 0

2606.19170 2026-06-18 cs.CL 新提交 70%

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango：一个严格仅L1的大型语言模型，用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University（京都大学）； NII-LLMC（日本国立信息与通信技术研究所-语言模型中心）

专题命中预训练：模拟第二语言习得的LLM，涉及预训练

AI总结提出1.8B参数的Dango模型，通过过滤L2污染和微调L2学习课程，模拟人类L2产出模式，优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情

AI中文摘要

我们介绍了Dango，一个1.8B参数的大型语言模型，旨在用于第二语言习得（SLA）中L1到L2（日语到英语）迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA，但它们主要依赖于较小的或非解码器模型，限制了它们生成开放式文本的能力，并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战：用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题，我们提出了一种过滤方法，以减少对英语的过早暴露，同时保留现实的最小暴露。然后，我们在LLM生成的L2学习课程上对模型进行微调，以模拟L2习得过程。我们的评估证实，Dango发展了类似人类的L2产出模式，优于未过滤和标准的多语言基线。我们发布了模型、数据和代码，以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

URL PDF HTML ☆

赞 0 踩 0

1. 领域大模型 7 篇

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

2. 后训练 2 篇

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

3. 指令微调 2 篇

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

4. 其他LLM 17 篇

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

Explaining Attention with Program Synthesis

Structured Inference with Large Language Gibbs

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

Target-confidence Recourse Using tSeTlin machines: TRUST

Opinion Polarization in LLM-Based Social Networks: Manipulation and Mitigation

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

BCL: Bayesian In-Context Learning Framework for Information Extraction

5. 预训练 2 篇

Dual Dimensionality for Local and Global Attention

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition