arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.10284 2026-06-10 cs.LG 新提交

Revisiting Positive Samples in Graph Contrastive Learning: From the Perspective of Message Passing

重新审视图对比学习中的正样本：从消息传递的角度

Lianze Shan, Ningchong Wang, Jitao Zhao, Di Jin, Dongxiao He

发表机构 * School of Computer Science and Technology, Tianjin University, Tianjin, China（天津大学计算机科学与技术学院）； School of Future Technology, Tianjin University, Tianjin, China（天津大学未来技术学院）

AI总结本文从Dirichlet能量角度理论发现消息传递机制使正样本最大化变得平凡，导致图对比学习难以从正样本中有效学习，并提出SPGCL方法通过仅传播高能量特征并利用低能量特征构建概率矩阵来恢复正样本的学习效能。

Comments 24 pages,6 figures

详情

AI中文摘要

图对比学习（GCL）通过最大化正样本之间的相似性并最小化负样本之间的相似性来训练图编码器，已成为主流的图预训练范式。普遍认为正样本在GCL中至关重要。理想情况下，最大化正样本的相似性使图编码器能够捕捉图数据的内在语义和模式。然而，我们发现一个有趣的现象：即使没有正样本，GCL也能取得有竞争力的性能。这促使我们重新审视GCL中正样本的基本机制。从Dirichlet能量的角度，我们理论上发现，消息传递（图编码器中的关键机制）使正样本的最大化变得平凡，从而阻止GCL从正样本中有效学习。为了解决这个问题，我们提出SPGCL来减轻消息传递导致的平凡化，并恢复正样本的学习效能。具体来说，我们发现高Dirichlet能量特征有助于正样本提供有效的学习信号，而低Dirichlet能量特征对正学习信号贡献很小，但对正采样有用。基于此，SPGCL仅传播高Dirichlet能量特征，并使用低能量特征构建概率矩阵以实现可靠的正采样。大量实验证明了SPGCL的有效性。

英文摘要

Graph Contrastive Learning (GCL), which trains graph encoders by maximizing similarity between positive samples and minimizing it between negative ones, has emerged as a mainstream graph pre-training paradigm. It is widely recognized that positive samples are essential in GCLs. Ideally, maximizing the similarity of positive samples enables graph encoders to capture intrinsic semantic and patterns of graph data. However, we discover an interesting phenomenon: GCLs can achieve competitive performance even without positive samples. This motivates us to revisit the fundamental mechanism of positive samples in GCLs. From the perspective of Dirichlet energy, we theoretically finds that message passing, a key mechanism in graph encoders, trivializes the maximization of positive samples, preventing GCLs from effectively learning from positive samples. To address this, we propose SPGCL to mitigate the trivialization caused by message passing and restore the learning efficacy of positive samples. Specifically, we find that high Dirichlet energy features help positive samples provide effective learning signals while low Dirichlet energy features contribute little to positive learning signal but is useful for positive sampling. Based on this, SPGCL propagates only high Dirichlet energy features and uses low energy features to construct a probability matrix for reliable positive sampling. Extensive experiments demonstrate the effectiveness of SPGCL.

URL PDF HTML ☆

赞 0 踩 0

2606.10279 2026-06-10 cs.AI cs.CL cs.LG 新提交

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

使用合成理由数据进行监督微调损害真实世界疾病预测

Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of California, Merced（加州大学默塞德分校）

AI总结研究发现，在临床预测任务中，使用合成理由数据进行监督微调反而显著降低模型性能，根本原因在于叙事合理性与判别优化之间的结构性冲突。

详情

AI中文摘要

监督微调中使用合成理由数据被广泛认为能通过教导模型不仅预测什么而且预测原因来提升语言模型在临床预测任务上的性能。我们在基于纵向健康史进行五年阿尔茨海默病及相关痴呆症（ADRD）预测的任务上检验了这一假设。通过一项包含504种配置的大规模对照实验，我们发现，与仅使用标签的微调相比，基于理由的SFT始终且显著地损害了预测性能。这种退化在多个模型系列和数据规模中持续存在，并且无法通过使用面向推理的基础模型来解决。关键的是，这种失败并非由理由质量差所致：人类专家注释证实生成的理由在医学上是准确的，并且忠实于患者特定的证据；少样本实验表明，当相同的理由作为推理时的演示而非训练目标使用时，能提升性能。我们确定根本原因在于叙事合理性与判别优化之间的结构性冲突。我们希望我们的工作能为更精确地理解理由监督何时以及如何有帮助、何时无帮助铺平道路，从而指导在高风险临床预测中负责任地开发语言模型。

英文摘要

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.10278 2026-06-10 cs.SD cs.AI 新提交

Towards Robust Arabic Speech Emotion Recognition with Deep Learning

基于深度学习的鲁棒阿拉伯语音情感识别

Youcef Soufiane Gheffari, Samiya Silarbi

发表机构 * ADASCA Laboratory – Advanced Data Science and Cognitive Applications, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf (USTO-MB), Oran, Algeria（ADASCA实验室——高级数据科学与认知应用，奥兰穆罕默德·布迪夫科技大学（USTO-MB），阿尔及利亚奥兰）

AI总结针对阿拉伯语音情感识别中方言多样、数据稀缺等问题，提出CNN-Transformer混合架构，在EYASE和BAVED数据集上达到98.1%准确率。

Comments 21 pages, 16 figures, 11 tables. Submitted manuscript

详情

AI中文摘要

语音情感识别（SER）旨在从音频信号中识别说话者的情感状态。尽管深度学习的最新进展显著提高了印欧语系语言的SER性能，但由于方言多样性、标注数据集有限以及难以同时建模局部频谱线索和长程时间依赖性，阿拉伯语SER仍然探索不足且具有挑战性。为解决这些限制，本研究探讨了联合建模空间和上下文信息的混合架构是否能改善阿拉伯语音的情感识别。我们提出并评估了一个包含三种架构的比较框架：CNN-LSTM模型、CNN-Transformer模型和微调的wav2vec 2.0模型。前两种模型利用MFCC和基于频谱图的表示，而wav2vec 2.0通过自监督表示直接对原始音频进行操作。在EYASE和BAVED数据集上进行的实验表明，所提出的CNN-Transformer架构显著优于其他模型，达到了98.1%的准确率。这一结果凸显了将卷积特征提取与基于Transformer的全局上下文建模相结合的有效性。本工作的主要贡献在于为阿拉伯语SER提供了混合方法和自监督方法的系统比较，并证明了CNN-Transformer架构在低资源和方言多样性环境中为捕捉频谱和长程依赖性提供了鲁棒解决方案。

英文摘要

Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10277 2026-06-10 cs.LG 新提交

A Unified Adaptive Feature Composition Framework for Multi-Task Generalization in Wireless Foundation Models

无线基础模型中多任务泛化的统一自适应特征组合框架

Yuxuan Shi, Tingting Yang, Kangning Ma, Liwen Jing, Yuwei Wang, Mengfan Zheng, Li Sun

发表机构 * Department of Broadband Communication, Pengcheng Laboratory（宽带通信系，鹏城实验室）； Purple Mountain Laboratories（紫金山实验室）

AI总结提出RAFC路由适配器，通过轻量级任务驱动网络动态组合Transformer各层隐藏特征，实现无线基础模型的多任务泛化，仅增加少于50K参数。

详情

AI中文摘要

尽管无线基础模型（WFM）在学习通用信道表示方面展现出强大潜力，但其适应各种下游任务仍受现有范式限制。微调策略引入了大量计算和存储开销，而冻结特征提取则导致跨不同下游任务的次优性能。为解决此问题，我们提出了一种用于WFM多任务泛化的统一自适应特征组合框架，其关键组件是用于特征组合的路由适配器（RAFC）。该路由器并非仅提取最后一层输出，而是将来自不同Transformer深度的隐藏状态视为可复用的多级隐藏特征池，并采用轻量级任务驱动特征组合网络生成逐层聚合权重，然后通过加权求和自适应地组合层次化表示。这种设计使每个下游任务能够访问合适的低、中、高级无线特征混合，而无需修改预训练骨干网络。在四个代表性无线任务上的大量实验表明，RAFC在引入少于50K额外参数的情况下，始终优于传统的适应基线。此外，学习到的路由权重提供了任务特定层偏好的可解释证据，使所提框架成为将WFM适应于各种下游场景的低复杂度、可扩展且可解释的接口。

英文摘要

Though wireless foundation models (WFMs) have shown strong potential in learning universal channel representations, their adaptation to various downstream tasks remains constrained by existing paradigms. Fine-tuning strategies introduces substantial computational and storage overhead, while frozen feature extraction leads to sub-optimal performance across diverse downstream tasks. To address this issue, we propose a unified adaptive feature composition framework for multitask generalization in WFMs, where the key component is the Routing Adapter for Feature Composition (RAFC). Instead of extracting only the final-layer output, this router treats the hidden states from different Transformer depths as a reusable pool of multi-level hidden features, and employs a lightweight task-driven feature composition network to generate layer-wise aggregation weights, then adaptively combine hierarchical representations through weighted summation. This design enables each downstream task to access suitable mixture of low-, mid-, and high-level wireless features without modifying the pretrained backbone. Extensive experiments on four representative wireless tasks demonstrate that RAFC consistently outperforms conventional adaptation baselines while introducing fewer than 50K additional parameters. Moreover, the learned routing weights provide interpretable evidence of task-specific layer preferences, making the proposed framework a low-complexity, scalable, and explainable interface for adapting WFMs to diverse downstream scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.10276 2026-06-10 cs.RO cs.AI 新提交

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

基于语言和自我中心人类信号的分层策略用于自然人机交互

Dongjun Lee, Juheon Choi, Dong Kyu Shin, Sinjae Kang, Kimin Lee

发表机构 * KAIST（韩国科学技术院）； Seoul National University（首尔国立大学）

AI总结提出EDITH框架，通过智能眼镜捕捉人类第一人称视角、注视和语言信号，设计分层策略将非语言信号与语言指令结合，实现更自然的人机交互，减少用户表达意图的负担。

Comments We provide video demos and code in: https://project-edith.github.io

详情

AI中文摘要

为了实现自然的人机交互，机器人必须理解人类不仅通过语言，还通过手势和注视等非语言信号表达的意图。然而，当前的机器人策略仅依赖语言指令作为传达意图的唯一接口，忽略了非语言信号，将全部沟通负担放在语言上。在这项工作中，我们提出了EDITH，一个机器人框架，通过智能眼镜的连续第一人称视角和注视流捕捉人类的非语言信号，并将其与语言指令一起作为机器人策略的输入。我们的硬件系统实时将人类的第一人称视角、注视和语音传输给机器人，并将语音转录为语言指令。为了处理这些丰富但嘈杂的信号，我们设计了一个分层策略，其中高层策略推断人类的意图并生成一系列子任务，每个子任务表示为一个细粒度指令，配有一个关键帧，将意图锚定在场景中（例如，人类指向目标物体的帧）。然后低层策略执行这些子任务。在我们的人机交互任务实验中，即使意图仅被短暂表达，EDITH也能使机器人根据人类的非语言信号行动，并且与仅使用语言指令相比，显著减少了用户传达意图的努力。请访问我们的项目页面获取源代码和真实机器人演示视频。

英文摘要

For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.

URL PDF HTML ☆

赞 0 踩 0

2606.10275 2026-06-10 cs.CV 新提交

FoA-SR: Faithful or Aesthetic? Profile-Aware Preference Optimization for Real-World Image Super-Resolution

FoA-SR: 忠实还是美观？面向真实世界图像超分辨率的轮廓感知偏好优化

Amjad Mahdi Alqarni, Peizhong Ju

发表机构 * Department of Computer Science（计算机科学系）； University of Kentucky（肯塔基大学）

AI总结提出FoA-SR，基于偏好优化实现真实世界图像超分辨率，通过忠实和美观两种轮廓分别优化适配器，在RealSR和DIV2K上验证了可分离的恢复策略。

Comments 17 pages, 6 figures, 9 tables. Preprint

详情

AI中文摘要

真实世界图像超分辨率（SR）通常设计为单一恢复目标，尽管当前生成模型能够为同一输入产生多个高质量重建。本文认为，最佳恢复策略取决于特定的恢复轮廓：忠实恢复优先考虑参考一致性、结构保持和幻觉抑制，而美观恢复优先考虑视觉愉悦和自然细节。我们提出FoA-SR，一种基于轮廓的新型真实世界SR偏好优化方法。为实现此目标，FoA-SR从我们的监督式FLUX.2-based SR适配器（Flux2SR）开始，该适配器通过LR潜在条件、流匹配和图像空间重建损失进行配对LR到HR图像超分辨率训练。在开发共享监督式超分辨率适配器后，FoA-SR为每个输入图像生成共享随机候选池，并使用轮廓特定的忠实和美观奖励对相同候选进行排序，以挖掘胜者-败者对。这些对用于微调单独的LoRA适配器，同时保持基础模型冻结。在RealSR和DIV2K上的实验表明，FoA-SR可以将同一SR适配器导向不同的恢复目标：忠实适配器改善参考一致性指标，而美观适配器提升无参考感知质量指标。我们的候选池分析显示，忠实和美观奖励经常选择不同的胜者，而Hybrid-LoRA消融表明，将两个轮廓合并为一个奖励会产生隐式折衷，而非显式轮廓控制。

英文摘要

Real-world image super-resolution (SR) is often designed with a single restoration objective, despite the current capacity of generative models to produce multiple high-quality reconstructions for the same input. In this paper, we argue that the best restoration strategy is subject to the specific restoration profile: a Faithful restoration prioritizes reference consistency, structure preservation, and hallucination suppression, whereas an Aesthetic restoration prioritizes visually pleasing and natural-looking details. We propose FoA-SR, a novel preference optimization approach to real-world SR based on profiles. To achieve this goal, FoA-SR starts with our supervised FLUX.2-based SR adapter (Flux2SR) trained with LR latent conditioning, flow matching, and image-space reconstruction losses for paired LR-to-HR image super-resolution. Following the development of the shared supervised super-resolution adapter, FoA-SR generates a shared stochastic candidate pool for each input image and ranks the same candidates using profile-specific Faithful and Aesthetic rewards to mine winner-loser pairs. These pairs are used to fine-tune separate LoRA adapters while keeping the base model frozen. Experiments on RealSR and DIV2K show that FoA-SR can steer the same SR adapter towards distinct restoration objectives: a Faithful adapter improves reference-consistent metrics while an Aesthetic adapter boosts metrics that measure perceptual quality without reference. Our candidate-pool analysis shows that Faithful and Aesthetic rewards frequently select different winners, and a Hybrid-LoRA ablation shows that collapsing both profiles into one reward yields an implicit compromise rather than explicit profile control.

URL PDF HTML ☆

赞 0 踩 0

2606.10273 2026-06-10 cs.RO 新提交

Locomotion analysis of a quadruped interacting with the lunar granular surface

四足机器人月球颗粒表面交互的运动分析

Yash J Vyas

发表机构 * Department of Industrial Engineering, University of Padua（工业工程系，帕多瓦大学）

AI总结通过强化学习训练四足机器人在模拟月球颗粒表面运动，对比刚性与软接触环境下的步态和能耗，发现软接触增加训练难度、改变步态并提高能量消耗。

详情

AI中文摘要

在星外环境中部署腿式机器人面临许多挑战，包括复杂的地形交互、能量和热约束。为了有效设计月球探测四足机器人的机械结构，需要仔细考虑电机扭矩、能量消耗和运输成本。月球表面由颗粒状风化层组成，这会影响腿式机器人的运动及其性能。基于刚性接触假设训练的运动算法在应用于软接触环境（如颗粒表面）时也无效，可能导致不稳定和跟踪不良。在本报告中，将月球颗粒表面-机器人足部接触的物理建模应用于使用强化学习训练运动的仿真环境。对在刚性接触和软接触环境下训练的策略进行比较，分析步态和运动性能指标。分析表明，模拟风化层表面的软接触给基于强化学习的训练带来了额外挑战，导致步态定性差异，并增加了总体能量消耗。

英文摘要

Deploying legged robots in extra-terrestrial environments includes many challenges due to complex terrain interactions, energy, and thermal constraints. For effective mechanical design of a lunar exploration quadrupedal robot, careful consideration of motor torques, energy expenditure, and cost of transport is required. The lunar surface is composed of granular regolith, which impacts the locomotion of legged robots and their performance. Locomotion algorithms trained with rigid contact assumptions are also ineffective when applied to environments with soft contacts, such as granular surfaces, which can result in instability and poor tracking. In this report, the physical modelling of the granular lunar surface-robot foot contacts is applied to a simulation environment with locomotion trained using Reinforcement Learning. A comparison is conducted between the policy trained on rigid contact and soft contact environments, analysing the gait and locomotion performance metrics. The analysis demonstrates that soft contacts simulating regolith surfaces pose additional challenges for Reinforcement Learning based training, result in a qualitatively different gait, and increase the overall energy expenditure.

URL PDF HTML ☆

赞 0 踩 0

2606.10267 2026-06-10 cs.RO cs.AI cs.LG 新提交

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

机器人策略编排的关键因素：分层VLA智能体的系统研究

Jiaheng Hu, Mohit Shridhar, Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang, Jie Tan, Annie Xie

发表机构 * Google DeepMind（谷歌深Mind）

AI总结系统研究分层视觉-语言-动作（Hi-VLA）系统的设计原则，通过统一框架分析规划器、控制器及接口机制对短时、长时及推理密集型任务性能的影响，提出构建更强健分层VLA智能体的实用原则。

详情

AI中文摘要

分层视觉-语言-动作（Hi-VLA）系统已成为复杂机器人操作的一种有前景的范式，它通过使用高层VLM规划器将任务分解为语言子目标，由低层VLA控制器执行。尽管近期取得了实证进展，但这些系统缺乏统一的设计原则：现有的Hi-VLA系统在选择和连接规划器、控制器、两者之间的切换机制以及规划器中观测和记忆的表示方式上存在差异。在本文中，我们对机器人操作的Hi-VLA设计进行了系统研究。我们将代表性的Hi-VLA智能体统一在一个选项式控制框架下，并在短时、长时和推理密集型任务上基准测试核心设计选择。我们的分析提炼出构建Hi-VLA系统的实用原则，展示了模型选择和接口机制如何共同塑造性能。应用这些原则，在仿真和真实ALOHA机器人上的实验中，我们得到了一个比平面VLA控制或朴素设计的分层系统都显著更强的系统。总体而言，我们的结果为构建更强大、更鲁棒且更有原则的分层VLA智能体奠定了基础。更多信息和视频请访问此http URL。

英文摘要

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.

URL PDF HTML ☆

赞 0 踩 0

2606.10254 2026-06-10 cs.AI cs.CL 新提交

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval：为何SOTA裁判难以应对真实人类推理

Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； East China Normal University（华东师范大学）； New York University（纽约大学）； Tongji University（同济大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出RealMath-Eval基准，评估LLM裁判对真实学生数学解答的评分能力，发现与人类评分存在高均方误差，而合成数据上表现更好，揭示评估差距源于人类错误空间的多样性和高信息熵。

Comments Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval

详情

AI中文摘要

尽管大型语言模型（LLM）在\emph{解答}高中数学方面已接近完美，但它们\emph{评估}真实学生多样化推理过程的能力仍未得到充分检验。为弥补这一差距，我们引入了\textbf{RealMath-Eval}，一个严格标注的基准，包含224份来自高中的真实考试答卷。我们的初步评估显示，即使是最先进的LLM裁判在此任务上也表现不佳，与人类专家评分相比呈现出高均方误差（$\sim$2.96）。为探究可能的原因，我们将此表现与同一裁判评估合成LLM生成解答的控制设置进行对比。我们识别出一个明显的“评估差距”：裁判在合成文本上准确性和一致性显著更高（MSE $\sim$1.17），但难以泛化到真实学生推理。通过语义嵌入分析，我们发现合成错误会“结构坍缩”为可预测的低维线性子空间，而人类错误则形成更多样的错误空间。此外，生成概率探测表明，人类推理涉及显著更高的信息论惊喜度，表明学生推理转换对当前模型而言更加分布外。最后，我们发现表面层面的风格迁移无法弥合这一差距。我们的发现表明，当前严重依赖合成数据的LLM评估流程可能无法充分捕捉真实学生数学推理的多样性。

英文摘要

While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10250 2026-06-10 cs.LG cs.AI 新提交

Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

联邦学习中不平衡的多层次分析以解决非独立同分布问题

Haengbok Chung, Jae Sung Lee

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University, Republic of Korea（人工智能跨学科项目，首尔国立大学，韩国）； Department of Nuclear Medicine, Seoul National University College of Medicine, Republic of Korea（核医学系，首尔国立大学医学院，韩国）； Brightonix Imaging Inc. Seoul, Republic of Korea（Brightonix影像公司，首尔，韩国）

AI总结提出FedBB算法，通过PNB损失函数和CBR重加权分别解决本地训练中的类内/类间不平衡和客户端间不平衡，在X射线和自然图像数据集上优于现有方法。

Comments 27 pages, 5 figures, 13 tables. Accepted for publication in Neurocomputing (2025). Author Accepted Manuscript

详情

DOI: 10.1016/j.neucom.2025.129528
Journal ref: Neurocomputing, Volume 626, 2025, Article 129528

AI中文摘要

类别不平衡是深度学习中常见的问题，会严重降低性能。在联邦学习（FL）中，它是导致非独立同分布数据（non-IID）的关键因素。基于先前的一些尝试，我们在三个层次上定义并分析了FL中的不平衡问题：案例间、类别间和客户端间。案例间不平衡处理每个单一类别内的不平衡；类别间不平衡比较不同类别之间的数据数量。客户端间不平衡表示不同客户端之间本地数据的偏斜程度。基于这些概念，我们提出了FedBB，它由两个主要部分组成：（1）正负平衡（PNB）损失函数解决了本地训练中的案例间和类别间不平衡，增强了高度偏斜的本地客户端数据集上的泛化能力。它通过为少数案例或类别分配更高的权重来优化多标签和多类分类。（2）客户端平衡重加权（CBR）在模型聚合期间根据客户端间不平衡重新加权客户端，为在偏斜较小的数据集上训练的模型赋予更大的权重。在X射线和自然图像数据集上的各种实验表明，FedBB在性能和效率上均优于其他算法。此外，它只需要有限的统计信息，这有利于隐私保护。通过消融研究，我们证明了PNB损失和CBR独立地贡献于性能。由于FedBB旨在构建一个能准确分类所有类别的全局模型，它可以作为通用和个性化FL的基线。

英文摘要

Class imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical factor contributing to non-identically distributed data (non-IID). Building on several previous attempts, we define and analyze imbalance issues in FL at three levels: inter-case, inter-class, and inter-client. Inter-case imbalance addresses the imbalance in every single class; inter-class imbalance compares the number of data between different classes. Inter-client imbalance represents different skewness of local data between clients. Based on these concepts, we propose FedBB, which consists of two main components: (1) Positive Negative Balanced (PNB) loss function addresses the inter-case and inter-class imbalances in local training, enhancing generalization on highly skewed local client datasets. It optimizes both multi-label and multi-class classifications by assigning higher weights to minority cases or classes. (2) Client Balanced Reweighting (CBR) reweights clients based on inter-client imbalance during model aggregation, giving greater weight to models trained on less skewed datasets. Various experiments on X-ray and natural image datasets demonstrate that FedBB outperforms other algorithms in both performance and efficiency. Additionally, it requires limited statistical information, which is beneficial for privacy protection. Through ablation studies, we proved that PNB loss and CBR independently contribute to performance. As FedBB aims to build a global model that accurately classifies all classes, it can serve as a baseline for the generic and personalized FL.

URL PDF HTML ☆

赞 0 踩 0

2606.10249 2026-06-10 cs.LG cs.SI 新提交

When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

当设计规则失效：基准组成决定标签信息性是否预测GNN聚合器选择

Neha Sharma, Ritesh Sharma

发表机构 * Department of Computer Science（计算机科学系）； Virginia Commonwealth University（弗吉尼亚 Commonwealth 大学）； Department of Electrical and Computer Engineering（电气与计算机工程系）

AI总结研究图神经网络聚合器选择（sum/mean/max）在24个节点分类数据集上的泛化性，发现标签信息性仅在传统基准上有效，在Facebook-100密集图中失效，且谱间隙能区分这些图。

详情

AI中文摘要

我们通过研究在24个节点分类数据集（涵盖引文、异嗜、LINKX Facebook-100、共同购买和共同作者图）上的聚合器选择（sum、mean、max），检验图神经网络（GNN）设计规则是否跨基准族泛化。边同嗜性仅能微弱预测GIN-Sum与GIN-Mean的性能差距。标签信息性在传统基准上能很好地预测这一差距，但当包含Facebook-100图时，预测能力大幅下降。在这些密集的朋友关系网络中，接近零的标签信息性与对sum聚合的强烈偏好共存，在扩展训练下产生7-10%的提升，最高达13%。随机块模型消融实验（包括匹配Facebook-100度规模的度修正变体）未能重现这一行为，表明平均度本身不能解释该效应。在若干与标签无关的图统计量中，谱间隙唯一地将这些图与其他低信息性数据集区分开来，该效应局限于单跳邻域并在不同架构中复现。我们进一步识别了与聚合器选择交互的训练机制，并表明PNA在标准引文基准上可能不如最佳单聚合器GIN。我们的结果表明，决定设计规则是否看似泛化的是基准组成而非数值不足，并且Facebook-100基准为未来的自适应聚合方法提供了具体目标。

英文摘要

We examine whether graph neural network (GNN) design rules generalize across benchmark families by studying aggregator selection (sum, mean, max) on 24 node-classification datasets spanning citation, heterophilic, LINKX Facebook-100, co-purchase, and co-authorship graphs. Edge homophily is only weakly predictive of the GIN-Sum versus GIN-Mean performance gap. Label informativeness predicts this gap well on legacy benchmarks but degrades substantially when Facebook-100 graphs are included. In these dense friendship networks, near-zero label informativeness coexists with a strong preference for sum aggregation, producing gains of 7-10% and up to 13% under extended training. Stochastic block model ablations, including degree-corrected variants matching Facebook-100 degree scales, fail to reproduce this behavior, indicating that mean degree alone does not explain the effect. Among several label-independent graph statistics, the spectral gap uniquely distinguishes these graphs from other low-informativeness datasets, with the effect localized to one-hop neighborhoods and replicated across architectures. We further identify training regimes that interact with aggregator choice and show that PNA can underperform the best single-aggregator GIN on standard citation benchmarks. Our results suggest that benchmark composition, rather than numerical insufficiency, determines whether design rules appear to generalize, and that the Facebook-100 regime provides a concrete target for future adaptive aggregation methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10246 2026-06-10 cs.SD cs.AI cs.LG 新提交

Linguistically Augmented Audio Speech Data (LinguAS)

语言增强音频语音数据 (LinguAS)

Ashley R. Keaton, Zahra Khanjani, Christine Mallinson, Vandana P. Janeja

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩分校）

AI总结提出LinguAS数据集，通过专家定义的语言特征(EDLFs)增强音频数据，显著提升深度伪造语音检测模型性能。

详情

AI中文摘要

恶意创建的伪造语音，包括深度伪造和欺骗音频，正以惊人速度扩散，检测模型竞相保持领先。然而，大多数检测模型仅基于帧级音频特征进行推理，未利用更大时间尺度上的有价值语言线索。为弥补这一空白，我们提出语言增强音频语音数据(LinguAS)，这是一个包含真实和深度伪造音频样本的数据集，标注了五种策略性选择的、专家定义的语言特征(EDLFs)，这些特征在英语口语中频繁出现且是自然人类语音的特征。LinguAS包含超过800个音频样本，每个样本都标注了EDLFs。数据集包含四种欺骗音频攻击类型的平衡数量以及相应数量的真实语音样本。我们还包含说话者性别和每个欺骗音频样本的生成器/来源元数据，为模型训练提供更细粒度信息。我们发现，使用EDLFs增强数据训练的模型性能显著超过ASVspoof 2021深度学习基线和HuBert、XLSR等SSL模型。LinguAS增强的语言、性别和生成器元数据为音频深度伪造研究者提供了一个强调真实人类语言特征的数据集，以改进伪造语音的模型推理。数据和代码已公开。

英文摘要

Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.10209 2026-06-10 cs.AI cs.LG cs.SE 新提交

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

更少上下文，更优智能体：面向长周期工具使用LLM智能体的高效上下文工程

Abhilasha Lodha, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal

发表机构 * Microsoft（微软）

AI总结针对企业工具使用场景中上下文过长导致的问题，提出选择性保留最近工具交互并添加紧凑摘要的方法，在费用明细任务上将完成率从71.0%提升至91.6%，同时大幅降低token消耗和运行时间。

Comments 17 pages, 3 figures, 8 tables

详情

AI中文摘要

部署为自主智能体用于企业工作流的大型语言模型面临一个关键挑战：来自企业系统的冗长工具响应可能导致上下文溢出、状态过时错误和高推理成本。我们在Microsoft Dynamics 365 Finance and Operations中使用Model Context Protocol工具研究自动费用明细化问题。我们在一个包含50个任务的酒店费用基准上评估了四种GPT-5配置：无用户模型、完整对话历史、上下文裁剪至最近5个工具调用/响应对、以及裁剪加自动摘要。结果在5次独立运行中取平均，用户模型在上下文工程比较中保持不变。无用户模型基线仅达到8.0%的完全明细化。完整上下文保留将完成率提升至71.0%，但每次基准测试消耗1,480,996个token和14.56小时。裁剪至最近5个工具调用将完成率提升至79.0%，同时将token使用降至535,274个，运行时间降至5.39小时。添加摘要实现了最佳结果：91.6%的完全明细化和99.64%的平均明细金额，使用553,374个token和5.79小时。我们进一步报告了置信区间、效应量分析、裁剪和摘要窗口的敏感性、失败分析、按三个类别分组的五种费用类型的结果，以及使用Claude Sonnet 4.5的跨模型证据。这些结果表明，对于这类企业工具使用工作流，选择性保留最近的工具交互加上紧凑摘要，与保留完整历史相比，可以提高可靠性和效率。

英文摘要

Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

URL PDF HTML ☆

赞 0 踩 0

2606.10126 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

帕累托引导的教师对齐用于公平个性化文本生成

Tunazzina Islam

发表机构 * Purdue University（普渡大学）

AI总结提出帕累托引导的教师对齐框架，通过修订候选生成、对感知可行性门控、帕累托候选选择和偏好优化，在减少人口统计差异的同时保持个性化保真度，实验表明公平缓解效果依赖于目标且跨域迁移不一致。

详情

AI中文摘要

个性化说服性文本生成可以提高相关性和参与度，但人口统计条件也可能引入跨群体的不平等框架。我们将个性化生成中的公平缓解研究为一个受约束的多目标对齐问题：在保持个性化保真度的同时减少人口统计差异。我们提出一个帕累托引导的教师对齐框架，结合了基于修订的候选生成、对感知可行性门控、帕累托风格的候选选择，以及通过监督微调和直接偏好优化的可选偏好优化。我们在气候变化和疫苗接种说服任务上评估该框架，使用一个受控的上下文丰富的人口统计网格（匹配性别和年龄对）以及一个统一的五审计评估套件，涵盖说服偏见、形式差异、情感框架差异、词汇关联差异和个性化保真度。在两个领域和跨族系迁移设置中，没有单一的对齐策略能同时主导所有目标。相反，方法占据了公平-个性化帕累托前沿的不同区域：一些方法实现更强的差异减少，而另一些则更好地保持个性化或人口统计稳定性。我们的结果表明，公平缓解效果依赖于目标，并在领域和模型族系间不一致地迁移，这促使在公平敏感的个性化生成中采用有界回归、多审计模型选择而非单指标优化。

英文摘要

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09900 2026-06-10 cs.CL cs.AI cs.IR cs.LG 新提交

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

更少上下文，更高准确率：一种用于LLM Agent的双时间记忆引擎，其中精简检索上下文优于完整历史

Liuyin Wang

发表机构 * Independent Researcher（独立研究者）

AI总结提出一种双时间记忆引擎Engram，通过混合读取路径从约9.6k token的检索片段中回答，在LongMemEval_S上达到83.6%准确率，比完整历史（79k token）高10.4个百分点，且无错误。

Comments 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: https://github.com/ly-wang19/engram

详情

AI中文摘要

长期记忆是LLM Agent缺失的一层：跨会话时它们会遗忘，而常见的解决方法——将整个历史重放到提示中——成本高、速度慢，且随着干扰物积累，准确性下降。大多数记忆系统在成本或延迟上胜出，但在准确性上仍不如完整上下文基线，且基准测试结果在不一致、不可复现的测试平台上报告，导致同一系统在不同来源上得分差异巨大。我们提出Engram，一种基于双时间数据模型的开源双过程记忆引擎。快速写入路径附加无损事件，无需LLM参与关键路径；异步路径提取原子（主体、谓词、客体）事实，构建双时间知识图谱，并解决矛盾，无需每个事实调用LLM——使事实失效而非删除，因此每个事实都有来源和继承链。混合读取路径融合密集、词汇、图谱和时效/显著性信号，应用时间点（“截至”）过滤器，并组装紧凑、带有来源标记的上下文。在完整的500个问题的LongMemEval_S上，由官方分类特定评判器评分，Engram的精简配置——从约9.6k token的检索片段回答，而非完整历史——得分为83.6%，而完整上下文为73.2%（+10.4个百分点，McNemar p < 10^-6），token数约为1/8（9.6k vs. 79k），且0/500错误。这种增益需要混合读取路径：仅事实会丢失召回率，而事实加检索片段则恢复细节。我们还贡献了一个中立的、仓库内的评估平台，内置官方评判器，并在每个表格中包含完整上下文基线，发布原始每问题日志，并记录了无声扭曲记忆基准的测量完整性陷阱（截断、自制评判器、完整历史泄露）。每个数字都附带复现命令。

英文摘要

Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

URL PDF HTML ☆

赞 0 踩 0

2606.09899 2026-06-10 cs.LG cs.AI 新提交

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

当归因补丁说谎时：诊断与二阶修正

Luyang Zhang, Jialu Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结研究归因补丁（梯度一阶近似）在机制可解释性中的不可靠性，发现主要误差源于下游网络的非线性，并提出可靠性评分、误差界和HVP二阶修正方法。

Comments 30 pages, 12 figures

详情

AI中文摘要

机制可解释性的一个核心目标是识别哪些内部组件因果地驱动语言模型的行为。由于这些重要性估计作为识别电路的证据，系统性错误可能导致对底层机制的误识别。虽然激活补丁提供了黄金标准的因果度量，但其计算成本在大规模下难以承受。从业者转而依赖归因补丁，一种基于梯度的一阶近似，其可靠性尚不明确。在这项工作中，我们刻画了这种不可靠性的来源，证明主要误差源于下游网络的非线性，而非补丁组件的局部曲率。这一洞察产生了三个实用工具：(i) 检测不可信估计的可靠性评分，(ii) 量化潜在归因误规范的误差界，以及 (iii) 仅需一次额外反向传播即可消除主导误差的Hessian-向量-乘积（HVP）修正。在五个模型家族（124M-9B参数）以及随机令牌和自然（名称交换）扰动的评估中，HVP是唯一在大规模下可行的二阶修正，而标准基线如积分梯度在计算上变得不可行。在对比实验中，多步HVP变体以显著更低的计算量达到或超过积分梯度的准确性，优于先前的二阶基线。这些改进在标准基准上实现了更高保真度的电路恢复，并支持一种屏幕-标记-修复工作流，仅将计算努力针对被标记为不可靠的组件。

英文摘要

A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.

URL PDF HTML ☆

赞 0 踩 0

2606.09898 2026-06-10 cs.LG cs.MA q-bio.QM 新提交

TRAPS: Therapeutic Response Analysis via Pathway-informed Stratification

TRAPS: 基于通路信息分层的治疗反应分析

Sujoy Banik, Sayantan Chakraborty, Boishakhi Das Toma, Zainab Ghafoor, Ushashi Bhattacharjee, Koushik Howlader, Tirtho Roy

发表机构 * Rajshahi University of Engineering \& Technology Bangladesh ； University of Dhaka Bangladesh ； American International University of Bangladesh Bangladesh ； Sonoma State University United States ； Iowa State University United States ； Rajshahi University of Engineering \& Technology ； University of Dhaka ； American International University of Bangladesh ； Sonoma State University ； Iowa State University

AI总结提出首个统一基准，评估三种通路引导的深度学习模型在联合预测癌症治疗反应和生存率上的表现，发现不同模型在不同任务上各有优劣。

详情

AI中文摘要

癌症治疗规划需要同时考虑多个临床维度。临床医生必须确定患者是否应接受靶向分子治疗、放疗，以及他们是否可能存活超过六个月。现有的通路引导深度学习模型是孤立开发和测试的，无法进行跨架构的公平比较。我们提出了第一个用于通路引导治疗反应建模的统一基准，评估了三种生物信息学架构：BINN、GraphPath 和 PATH，使用了来自癌症基因组图谱的五个癌症队列，代表 2,622 名患者，这些患者使用 Reactome 通路活性评分进行编码。每个模型在相同的数据和评估条件下联合训练所有三个临床结果，这是第一项将通路结构化深度学习视为联合治疗和生存预测问题的研究。我们的结果表明，没有一个架构在所有任务中获胜：PATH 在整体靶向分子治疗预测中表现最佳，BINN 在生存预测中最可靠，而没有一个模型能对放疗产生有用的预测，因为该决策的关键驱动因素是基因表达数据中未捕获的临床变量。最引人注目的是，GraphPath 在前列腺靶向分子治疗预测中达到了 0.92 的 AUROC，是整个基准中的最高分，这表明当与具有狭窄靶向驱动程序的队列匹配时，即使存在极端类别不平衡（仅 11% 阳性率），横向共调控结构也能产生卓越的判别能力。

英文摘要

Cancer treatment planning requires decisions across multiple clinical dimensions at once. Clinicians must determine whether a patient should receive targeted molecular therapy, radiation therapy, and whether they are likely to survive beyond six months. Existing pathway-informed deep learning models have been developed and tested in isolation, making fair comparison across architectures impossible. We present the first unified benchmark for pathway-guided therapy response modeling, evaluating three biologically informed architectures, BINN, GraphPath, and PATH, across five cancer cohorts drawn from The Cancer Genome Atlas, representing 2,622 patients encoded using Reactome pathway activity scores. Each model is trained jointly on all three clinical outcomes under identical data and evaluation conditions, the first study to treat pathway-structured deep learning as a combined therapy and survival prediction problem. Our results show that no single architecture wins across all tasks: PATH performs best for targeted molecular therapy prediction overall, BINN is most reliable for survival prediction, and no model produces useful predictions for radiation therapy, as the key drivers of that decision are clinical variables not captured in gene expression data. Most strikingly, GraphPath achieves an AUROC of 0.92 on prostate targeted molecular therapy prediction, the highest score in the entire benchmark, demonstrating that lateral co-regulation structure produces exceptional discriminative power when matched to a cohort with a narrow targetable driver programme, even under conditions of extreme class imbalance at only 11\% positive prevalence.

URL PDF HTML ☆

赞 0 踩 0

2606.09894 2026-06-10 cs.LG cs.CL 新提交

A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

语言模型表示中假设的意识谱状态的可导航流形

Sophie Zhao

发表机构 * School of Computer Science（计算机科学学院）； Georgia Institute of Technology（佐治亚理工学院）

AI总结研究语言模型嵌入空间中与意识谱相关的几何结构，发现嵌入形成可导航流形，高低层区域稳定，中间为过渡走廊，导航性为内在属性。

详情

AI中文摘要

在沉思、哲学和心理学描述中，人类意识常被描述为从反应性和自我聚焦模式到更整合和连贯模式的类似谱系。理解语言模型是否在表示空间中编码了这种结构化、人类可解释的意识谱系，对于模型引导、评估和对齐具有重要意义。在这项工作中，我们研究了Transformer嵌入空间中沿该谱系的几何结构和动态模式。我们表明，嵌入表现出与该谱系对齐的全局组织几何：与相似状态相关的句子聚类成局部连贯区域，形成结构化流形。特别地，高层和低层区域表现出类似凸性的稳定性，而中间区域形成过渡走廊。在动态上，效用引导和纯几何贪婪轨迹都一致地从低层区域穿越到高层区域，经过中间层级，表明可导航性是表示空间的内在属性，由全局方向信号引导但非决定。这些结果表明，嵌入空间编码了与假设的意识谱分类法（广泛受沉思传统、哲学和现代心理学中人类意识反复出现的结构描述启发）对齐的结构化和可导航几何，为分析和引导模型行为提供了表示层面的视角。

英文摘要

Across contemplative, philosophical, and psychological accounts, human consciousness is often described along a similar spectrum, ranging from reactive and self-focused patterns to more integrative and coherent ones. Understanding whether language models encode such a structured, human-interpretable consciousness spectrum in representation space is important for model guidance, evaluation and alignment. In this work, we study the geometric structure and dynamics of patterns along this spectrum in transformer embedding spaces. We show that embeddings exhibit a globally organized geometry aligned with this spectrum: sentences associated with similar states cluster into locally coherent regions, forming a structured manifold. In particular, higher-level and lower-level regions exhibit convexity-like stability, while intermediate regions form a transition corridor. Dynamically, both utility-guided and geometry-only greedy trajectories consistently traverse from lower- to higher-level regions, passing through intermediate tiers, indicating that navigability is an intrinsic property of the representation space, guided but not dictated by a global directional signal. These results suggest that embedding spaces encode structured and navigable geometry aligned with a hypothesized consciousness-spectrum taxonomy, broadly inspired by recurring structural descriptions of human consciousness across contemplative traditions, philosophy, and modern psychology, providing a representation-level perspective for analyzing and guiding model behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.09892 2026-06-10 cs.LG stat.ME 新提交

LMT: A Bayesian Framework for Causal Discovery from Textual Alarm Records in Manufacturing Systems

LMT: 制造系统中文本告警记录的因果发现贝叶斯框架

Xiaofeng Xiao, Jianhong Chen, Qiuzhuang Sun, Naichen Shi, Xubo Yue

发表机构 * Department of Mechanical & Industrial Engineering, Northeastern University, Boston, MA, USA（东北大学机械与工业工程系）； College of Integrative Studies, Singapore Management University, Singapore（新加坡国立大学整合研究学院）； Department of Industrial Engineering and Management Sciences, Department of Mechanical Engineering, Northwestern University, IL, USA（西北大学工业工程与管理科学系、机械工程系）

AI总结提出LMT框架，结合大语言模型提取的语义信号和基于泊松过程的时间证据，通过贝叶斯方法从文本告警记录中发现因果图，在小样本场景下表现优异。

Comments 19 pages

详情

AI中文摘要

文本事件记录（如告警日志）已成为工程和制造系统中越来越常见的数据源。除了识别相关性或重复模式外，工程师通常有兴趣了解在系统运行过程中哪些类型的事件因果性地触发或影响其他事件。文本事件描述可能包含关于此类因果关系的语义线索，而最近的大语言模型（LLM）为提取这些信号提供了有前景的工具。然而，仅依赖LLM编码的文本信息不足以进行准确的因果发现，因为语义模式并不直接揭示因果机制，并且可能将因果关系与相关性或频繁的顺序模式混淆。为了解决这些挑战，我们提出了\textbf{LMT}，一个用于工程事件数据的贝叶斯因果发现框架，它联合利用了文本描述和时间戳。具体来说，LMT首先使用LLM从事件描述中提取语义因果信号，并构建事件类型或事件簇之间因果图的先验分布。然后，它通过基于泊松过程的似然函数纳入时间证据，使得基于时间戳的统计证据能够精炼LLM信息先验。通过整合文本和时间信息，LMT生成一个既可解释又有数据支持的因果图。模拟研究表明，所提出的框架在不同设置下都是有效的，并且在样本量较小的告警事件场景中尤其具有优势。

英文摘要

Textual event records, such as alarm logs, have become an increasingly common data source in engineering and manufacturing systems. Beyond identifying correlations or recurring patterns, engineers are often interested in understanding which types of events causally trigger or influence other events during system operation. Textual event descriptions may contain semantic clues about such causal relationships, and recent large language models (LLMs) provide a promising tool for extracting these signals. However, relying solely on LLM-encoded textual information is insufficient for accurate causal discovery, since semantic patterns do not directly reveal causal mechanisms and may confuse causation with correlation or frequent sequential patterns. To address these challenges, we propose \textbf{LMT}, a Bayesian causal discovery framework for engineering event data that jointly leverages textual descriptions and timestamps. Specifically, LMT first uses LLMs to extract semantic causal signals from event descriptions and constructs a prior distribution over causal graphs among event types or event clusters. It then incorporates temporal evidence through a Poisson-process-based likelihood, allowing the LLM-informed prior to be refined by timestamp-based statistical evidence. By integrating the textual and temporal information, LMT produces a causal graph that is both interpretable and data-supported. Simulation studies show that the proposed framework is effective across different settings and is especially advantageous in small-sample alarm-event scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.09891 2026-06-10 cs.LG cs.AI cs.IR 新提交

Representation Curriculum: Stagewise Training for Robust Ranking and Allocation

表示课程：用于鲁棒排序和分配的分阶段训练

Ehsan Ebrahimzadeh, Sina Baharlouei, Abraham Bagherjeiran

发表机构 * eBay Search Ranking and Monetization（eBay搜索排名与变现）

AI总结提出表示课程（RC）方法，通过分阶段引入特征，先强调基于内容的信号，再引入依赖曝光的信号，减少对历史信号的捷径依赖，提升冷启动泛化性和鲁棒性。

Comments 12 pages, 5 figures

详情

DOI: 10.1145/3770855.3818470

AI中文摘要

数字市场中的排序是一种动态曝光分配机制：展示的物品塑造了发现轨迹和成功事件，平台记录这些事件以更新未来的分配策略。现代排序系统严重依赖曝光混杂信号（如流行度估计、CTR/CVR聚合和基于ID的表示），因为这些信号在静态需求下具有高度预测性。然而，这种预测能力可能成为一种学习捷径：早期访问依赖曝光的信念信号会使优化过度依赖它们，而忽视独立于曝光的价值信号（如基于内容的竞争力和语义亲和性）。因此，学习到的策略倾向于固化现有物品，并在分布偏移下降低冷启动泛化性和鲁棒性。我们提出表示课程（RC），一种训练时干预方法，按时间阶段安排特征使用。RC首先突出基于内容的价值信号，然后引入依赖曝光的信念信号，同时将内容路径锚定在学到的价值表示附近，从而抑制对历史信号的捷径依赖，并缓解内容信号上的梯度饥饿。我们形式化RC，使其独立于任务和假设类，并提供排序特定的实例化。在高斯线性岭回归设置中，我们推导出封闭形式解和充分条件，证明RC在冷启动目标分布上严格降低总体风险，并量化了与源性能的帕累托权衡。在公开的排序学习和推荐基准测试，以及大规模电商搜索系统中的随机在线实验中，RC显著地将依赖从历史信念信号转向基于内容的价值信号，并在头部性能可控权衡下，对冷群体带来一致的提升。

英文摘要

Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09890 2026-06-10 cs.LG cs.AI cs.CL 新提交

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

PreAct-Bench：大语言模型中的预测性监控基准

Hainiu Xu, Italo Luis da Silva, Jiangnan Ye, Yuhao Wang, Wei Liu, Linyi Yang, Jonathan Richard Schwarz, Nicola Paoletti, Yulan He, Hanqi Yan

发表机构 * King’s College London（伦敦国王学院）； National University of Singapore（新加坡国立大学）； Southern University of Science and Technology（南方科技大学）； Thomson Reuters Foundational Research（汤姆森路透基础研究）； Imperial College London（伦敦帝国学院）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结提出预测性监控任务，在动作执行前判断是否会导致不道德行为，并构建PreActBench基准，评估多种模型发现该任务具有挑战性。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被部署为能够执行多步动作轨迹以实现给定目标的自主代理。虽然现有的安全研究集中于从完整轨迹中检测不道德行为，但这种范式本质上是回顾性的：它仅在伤害已经发生后识别伤害。在这项工作中，我们研究了一个关键但被忽视的安全任务，我们称之为预测性监控：仅给定部分动作轨迹，模型能否在执行公开动作之前推断出它是否会以不道德行为告终？为了支持这一任务，我们提出了PreActBench，一个包含1000个跨五个领域的成对道德和不道德动作轨迹的基准。我们使用我们的前缀远见F1指标，在动作轨迹的不同部分上评估了一系列LLMs、安全护栏模型和潜在探测方法。结果表明，尽管人类取得了有希望的性能，但即使对于强模型，预测性监控仍然具有挑战性，突显了在LLM安全中需要面向未来的风险推理。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

URL PDF HTML ☆

赞 0 踩 0

2606.09889 2026-06-10 cs.LG 新提交

Optuna Constrained Tree-Structured Parzen Estimator Is a Joint Density Generalization of c-TPE

Optuna约束树结构Parzen估计器是c-TPE的联合密度推广

Shuhei Watanabe, Kaichi Irie

发表机构 * Independent Researcher（独立研究者）； Kyoto University（京都大学）

AI总结本文证明Optuna的约束TPE是联合c-TPE，使用联合似然的ECI采集函数，并展示其对约束重复的鲁棒性优于独立c-TPE。

2606.09888 2026-06-10 cs.LG 新提交

SinkRec: Mitigating Semantic State Sink in Long Sequence Recommendation with Memory-Conditioned Gated Delta Networks

SinkRec: 通过记忆条件门控Delta网络缓解长序列推荐中的语义状态沉没

Zhuang Zhuang, Zhipeng Wei, Ji Dai, Jie Chen, Fei Pan, Peng Jiang, Kun Gai

发表机构 * Kuaishou Technology, Beijing, China（快手科技，北京，中国）； Beijing University of Posts and Telecommunications, Beijing, China（北京邮电大学，北京，中国）； Independent Researcher（独立研究员）

AI总结针对线性注意力在长序列推荐中因重复行为模式导致的语义状态沉没问题，提出SinkRec混合记忆-循环架构，通过残差向量量化外部化重复模式，并设计TDGD模块净化读写过程，在保持线性时间效率的同时提升推荐效果。

详情

AI中文摘要

线性注意力通过避免标准Transformer的二次成本，为长序列推荐提供了高效的骨干网络，但其压缩的循环状态可能被重复行为模式主导。我们将此现象识别为语义状态沉没，即重复语义过度占据循环状态并偏置后续读出。为了缓解语义状态沉没，我们提出SinkRec，一种混合记忆-循环循环架构，将协同行为模式存储与动态转换建模解耦。SinkRec通过残差向量量化将重复局部模式外部化到可学习的条件记忆中，重新注入检索到的编码，并将记忆键值对暴露给注意力块。它进一步引入了时域感知状态关系差分门控DeltaNet（TDGD），通过抑制记忆覆盖的更新和移除记忆对齐的读出响应，利用记忆净化循环写入和读取。该设计将重复语义从状态竞争信号转变为记忆可检索模式，使循环状态专注于动态转换，并以线性时间效率缓解语义状态沉没。在公共和工业数据集上的实验证明了SinkRec的有效性和效率。

英文摘要

Linear attention provides an efficient backbone for long-sequence recommendation by avoiding the quadratic cost of standard Transformers, but its compressed recurrent state can be dominated by repetitive behavior patterns. We identify this phenomenon as semantic state sink, where recurring semantics over-occupy the recurrent state and bias subsequent readouts. To mitigate semantic state sink, we propose SinkRec, a hybrid memory-transition looped architecture that decouples collaborative behavioral pattern storage from dynamic transition modeling. SinkRec externalizes recurring local patterns into a learnable conditional memory through residual vector quantization, reinjects the retrieved codes, and exposes memory key-value pairs to the attention block. It further introduces Temporal-Aware State-Relation Differential Gated DeltaNet (TDGD), which uses memory to purify recurrent writing and reading by suppressing memory-covered updates and removing memory-aligned readout responses. This design turns recurring semantics from state-competing signals into memory-retrievable patterns, allowing the recurrent state to focus on dynamic transitions and alleviating semantic state sink with linear-time efficiency. Experiments on public and industrial datasets demonstrate the effectiveness and efficiency of SinkRec.

URL PDF HTML ☆

赞 0 踩 0

2606.09887 2026-06-10 cs.LG cs.AI cs.CL 新提交

SocraticPO: Policy Optimization via Interactive Guidance

SocraticPO: 通过交互式指导进行策略优化

Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（认知智能国家重点实验室，中国科学技术大学）； iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd（iFLYTEK中央中国AI研究院，iFLYTEK公司）

AI总结提出SocraticPO框架，在强化学习中使用自然语言指导辅助推理，并通过奖励衰减防止模型依赖教师帮助，提升科学推理任务性能。

详情

AI中文摘要

用于大语言模型的强化学习通常使用标量结果奖励（如二元正确性）来监督推理。这种奖励提供了优化方向，但很少解释模型应如何修正其错误推理，这可能鼓励捷径学习和脆弱的策略。我们提出\textbf{SocraticPO}（苏格拉底式策略优化），一种策略优化框架，用苏格拉底式的自然语言指导增强强化学习展开。在展开过程中，学生首先独立回答；如果答案错误，教师诊断尝试并提供简洁的纠正性指导，之后学生在扩展的上下文下继续。关键的是，这种指导与奖励衰减配对：在教师干预后获得的正确答案只得到衰减的奖励，防止策略将教师帮助视为获取奖励的免费途径。由于SocraticPO只修改展开过程，而保持标准期望奖励目标不变，它可以插入到现有的策略梯度后端（如Reinforce++）中。此外，由于教师只提供文本级指导，SocraticPO可以利用更强的黑盒教师模型，而无需访问logits或分布匹配。在来自SciKnowEval的本科水平科学推理基准上，SocraticPO优于强强化学习和自蒸馏基线。消融实验表明，目标指导和奖励衰减都是必要的，奖励衰减减轻了对辅助纠正的依赖。

英文摘要

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

URL PDF HTML ☆

赞 0 踩 0

2606.09886 2026-06-10 cs.LG cs.AI 新提交

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

SHAPE: 面向稀疏混合专家大语言模型的联盟感知专家剪枝

Yuhao Zhang

发表机构 * Beihang University（北航大学）

AI总结提出SHAPE框架，通过合作博弈论建模专家间协作，利用Shapley值识别高贡献专家，结合质量覆盖选择规则在剪枝预算下保留关键专家，实验表明在多种MoE模型上提升鲁棒性并降低显存。

详情

AI中文摘要

稀疏混合专家（MoE）大语言模型以低每token计算量实现了高质量，但其部署常受限于内存墙：必须保留全部专家池以支持依赖token的路由。专家剪枝是一种直接解决方案，但先前的标准通常独立评估专家，忽略了MoE推理本质上是“联盟性”的，即输出由路由到的top-$k$专家组合产生。我们提出\textbf{SHAPE}，一个任务驱动的剪枝框架，显式建模\textit{层内}专家协作。SHAPE将小校准集上的路由轨迹建模为经验合作博弈，并通过基于观察到的top-$k$联盟的Shapley式归因分配交互感知的专家价值，从而识别对高效用协作至关重要的专家，而不仅仅是频繁出现的专家。为了在全局剪枝预算下保持MoE拓扑，SHAPE进一步引入\textit{质量-覆盖}选择规则，在每层保留覆盖非负Shapley质量$\alpha$分数的最小专家子集，同时使用二分法匹配目标保留率。在三个现代MoE骨干网络（Qwen3-30B-A3B、GPT-OSS-20B和DeepSeek-V2-Lite）上的多个基准实验表明，SHAPE在20%和40%专家剪枝下，相比全局和逐层剪枝变体一致地提升了鲁棒性，无需额外训练即保持竞争性精度，并显著降低了峰值GPU内存占用。开源代码见此https URL。

英文摘要

Sparse Mixture-of-Experts (MoE) large language models achieve strong quality with low per-token compute, yet their deployment is often limited by the memory wall: the full expert pool must remain resident to support token-dependent routing. Expert pruning is a direct remedy, but prior criteria typically score experts independently and overlook that MoE inference is inherently \emph{coalitional}, where outputs arise from routed top-$k$ expert combinations. We propose \textbf{SHAPE}, a task-driven pruning framework that explicitly models \emph{intra-layer} expert cooperation. SHAPE formulates routing traces on a small calibration set as an empirical cooperative game and assigns interaction-aware expert values via a Shapley-style attribution over observed top-$k$ coalitions, enabling the identification of experts that are essential for high-utility collaborations rather than merely frequent. To preserve MoE topology under a global pruning budget, SHAPE further introduces a \emph{quality-coverage} selection rule that retains, in each layer, the minimal expert subset covering an $α$ fraction of non-negative Shapley mass, while using bisection to match a target keep rate. Experiments on three modern MoE backbones (Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite) across diverse benchmarks show that SHAPE consistently improves robustness over global and layer-wise pruning variants, maintaining competitive accuracy under 20\% and 40\% expert pruning without additional training and delivering clear reductions in peak GPU memory footprint. The open-source code is available at https://github.com/Alizen-1009/Shapley-Moe.

URL PDF HTML ☆

赞 0 踩 0

2606.09885 2026-06-10 cs.LG stat.ML 新提交

TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

TENP：用于混合专家的梯形专家神经元剪枝

Jiangyang He, Shaolin Zhu, Deyi Xiong

发表机构 * TJUNLP Lab, School of Computer Science and Technology, Tianjin University（天津大学计算机科学与技术学院 TJUNLP实验室）

AI总结提出TENP框架，通过识别重要专家并对其余专家进行神经元剪枝，保留梯形参数模式，在40%路由专家稀疏度和平均63.76%激活参数下，DeepSeek模型准确率仅下降1点，代码生成任务提升10%。

详情

AI中文摘要

混合专家大语言模型通过稀疏激活实现高效扩展，但其部署受到专家大量静态参数占用的根本限制。现有压缩方法要么移除整个专家，破坏路由拓扑并损害性能，要么依赖非结构化权重剪枝，实际效率有限。为解决这些局限，我们提出TENP，一种结构化的梯形专家神经元剪枝框架。使用少量样本，我们识别并保留重要专家，同时对次要专家应用专家神经元剪枝（ENP），从浅层到深层以梯形模式保留模型参数。在评估专家重要性时，我们联合考虑专家输出的幅度及其改变输入向量方向的能力。对于ENP，我们测量每个神经元对专家输出的投影贡献，以识别并保留重要神经元。我们在Qwen和DeepSeek模型上进行了广泛实验。在路由专家稀疏度为40%且平均激活63.76%专家参数的情况下，DeepSeek模型相比全参数模型准确率仅下降1点。此外，在代码生成任务上，它比全参数模型提升10%。

英文摘要

Mixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09883 2026-06-10 cs.LG cs.AI 新提交

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

TD-Grokking：通过训练时分解从零奖励问题中学习

Ningyuan Xi, Hao Xu, Hongsheng Xin, Ning Miao

发表机构 * Ningyuan Xi 1,2（西宁元 1,2）； Hao Xu 3（许浩 3）； Hongsheng Xin 3（辛红生 3）； Ning Miao 1,2, †（苗宁 1,2, †）

AI总结针对强化学习在零奖励问题上无法提供优化信号的问题，提出训练时分解框架TD-Grokking，将难解问题递归分解为可验证子问题，在数学和医疗任务上优于基线方法。

详情

AI中文摘要

大型语言模型（LLMs）在推理任务上取得了显著进展，这主要归功于后训练范式，特别是基于可验证奖励的强化学习（RLVR）。然而，一个关键瓶颈依然存在：RLVR在极具挑战性的零奖励问题上失败，因为所有采样的推理轨迹都产生统一失败的结果，无法提供优化信号来驱动模型改进。先前解决这一限制的努力，如密集过程监督、部分奖励分配或前缀引导探索，受到固有任务约束的限制，或者未能完全赋予策略模型解决原始难解问题所需的能力。为了解决这个问题，我们提出了TD-Grokking，一个针对零奖励问题的训练时分解框架。它递归地将难解的根问题分解为自包含、可验证的子问题，形成层次树，其中可解的叶子节点提供非零奖励。在数学和医疗任务上的评估表明，TD-Grokking优于普通的GRPO以及所有基线方法。结合详细分析，这些结果证实训练时分解有效地将零奖励示例转化为可用的训练信号，从而实现一致的性能提升。我们的代码和数据集可在以下网址获取：https://this URL。

英文摘要

Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists: RLVR fails on highly challenging zero-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems. To address this, we propose TD-Grokking, a training-time decomposition framework for zero-reward problems. It recursively decomposes intractable root problems into self-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non-zero rewards. Evaluations on mathematical and medical tasks show that TD-Grokking outperforms vanilla GRPO as well as all baseline approaches. Together with detailed analysis, these results confirm that training-time decomposition effectively converts zero-reward examples into usable training signals, enabling consistent performance gains. Our code and datasets are available at https://anonymous.4open.science/r/TD-Grokking-6567/.

URL PDF HTML ☆

赞 0 踩 0

2606.09882 2026-06-10 cs.CV cs.LG 新提交

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

WHU-Infra3D：面向3D路边基础设施清单的全栈多模态数据集与基准

Chong Liu, Luxuan Fu, Xuyu Feng, Zhen Dong, Bisheng Yang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS)（信息工程测绘遥感国家重点实验室）； Wuhan University（武汉大学）

AI总结提出WHU-Infra3D多模态基准数据集，覆盖三城市53.8公里，融合全景图像与LiDAR点云，提供2D-3D实例关联和跨帧跟踪，支持基础设施状态诊断与属性识别，填补自动化维护数据集空白。

详情

AI中文摘要

数字孪生城市的范式正从粗略的视觉映射转向更精确、可操作的城市资产数字化。然而，现有数据集主要关注粗略的视觉感知，缺乏自动化基础设施维护所需的严格多模态对齐和属性及状态诊断。为弥合这一差距，我们引入了WHU-Infra3D，一个大规模、多模态的基准数据集，专门用于路边基础设施清单。覆盖三个城市53.8公里，WHU-Infra3D独特地集成了全景图像和LiDAR点云，并具有严格的2D-3D实例关联和跨帧跟踪。该数据集包含超过17.5万个多视图2D边界框以及数千个3D基础设施实例，提供了超过18.1万个详细的属性和状态注释（例如，锈蚀、遮挡），以支持运行健康评估。我们在五个核心任务上建立了全面的基线：2D检测、2D跨视图匹配、3D地理识别、3D点云分割和属性识别。广泛的评估暴露了当前模型在长尾缺陷状态上的显著跨城市领域差距和固有脆弱性，使WHU-Infra3D成为推进可扩展、AI驱动的城市基础设施清单和生命周期管理的重要试验场。WHU-Infra3D数据集可在以下网址获取：https://xxx。

英文摘要

The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at https://github.com/WHU-USI3DV/WHU-Infra3D.

URL PDF HTML ☆

赞 0 踩 0

2606.09881 2026-06-10 cs.LG cs.CR cs.CV 新提交

Toward Calibrated, Fair, and accurate Deepfake Detection

迈向校准、公平且准确的深度伪造检测

Ryan Brown, Chris Russell

发表机构 * University of Oxford（牛津大学）

AI总结提出Face-Fairness框架，通过Face-Feature Tuning实现无需人口统计标签的深度伪造检测公平性，同时保持或提升整体准确率。

详情

AI中文摘要

深度伪造检测器在不同人口群体间表现出较大的性能差距。现有的公平性方法需要人口统计标签、重新训练或牺牲准确性。我们引入了Face-Fairness (FF)，一个即插即用的偏差缓解框架。我们的主要贡献是Face-Feature Tuning (FFT)，这是首个在深度伪造检测中展示的无人口统计标签的公平性方法：一个轻量级校准器，基于冻结的人脸嵌入进行logit重映射。我们通过两种变体补充FFT：FF-Max，在人口统计标签可用时最大化最差组准确率；以及FF-Discover，通过嵌入发现的组实现相同目标。在域内和跨数据集测试设置中，FF一致地减少了FPR/TPR差距，提高了最小组准确率，同时保持（通常提升）整体准确率。该方法与检测器无关，增加了可忽略的运行时开销，并且不需要访问身份属性。

英文摘要

Deepfake detectors show large performance gaps across demographic groups. Existing fairness approaches require demographic labels, retraining, or sacrifice accuracy. We introduce Face-Fairness (FF), a plug-and-play framework for bias mitigation. Our primary contribution, Face-Feature Tuning (FFT), is the first demographic label-free fairness method demonstrated for deepfake detection: a lightweight calibrator that performs a logit remapping conditioned on frozen face embeddings. We complement FFT with two variants: FF-Max, which maximizes worst-group accuracy when demographics are available, and FF-Discover, which does the same with embedding-discovered groups. Across in-domain and cross-dataset test settings, FF consistently reduces FPR/TPR gaps and improves minimum group accuracy while maintaining (often improving) overall accuracy. The approach is detector-agnostic, adds negligible runtime overhead, and requires no access to identity attributes.

URL PDF HTML ☆

赞 0 踩 0

2606.09880 2026-06-10 cs.LG 新提交

Hyperparameter Learning for Latent Factorization of Tensors for Representation Learning to Large-scale Dynamic Weighted Directed Network

面向大规模动态加权有向网络表示学习的张量潜在因子超参数学习

Yaqian Zhan, Jialan He, Tianzhu Chen

发表机构 * College of Computer and Information Science（计算机与信息科学学院）

AI总结针对大规模动态加权有向网络，提出基于差分进化的张量潜在因子模型超参数自动优化框架，自动学习正则化参数，提升预测精度并减少人工调参。

详情

AI中文摘要

大规模动态加权有向网络（DWDNs）被广泛用于建模节点间的时变交互。张量潜在因子（LFT）通过低秩嵌入从DWDNs中提取目标知识。然而，与许多机器学习模型类似，LFT的性能在很大程度上依赖于超参数的选择。在实践中，这些参数通常通过手动或网格搜索进行调整，这需要大量的计算资源和人力。受此挑战的启发，本文提出了一种基于差分进化（DE）的LFT自动超参数优化框架（DE-LFT）。该方法将DE集成到LFT模型的训练过程中，以自动学习最优正则化参数$\lambda_1$、$\lambda_2$和$\lambda_3$。因此，模型能够自适应地搜索超参数空间并提高预测精度。在四个真实世界数据集上的实验结果表明，与手动调优的基线相比，所提方法实现了更低的MAE和RMSE，同时减少了对大量参数调优的需求。

英文摘要

Large-scale dynamic weighted directed networks (DWDNs) are widely used to model time-varying interactions among nodes. Latent factorization of tensors (LFT) extracts target knowledge from DWDNs via low-rank embedding. However, similar to many machine learning models, the performance of LFT heavily depends on the selection of hyperparameters. In practice, these parameters are often tuned manually or through grid search, which requires significant computational resources and human effort. Motivated by this challenge, this paper proposes an automated hyperparameter optimization framework based on Differential Evolution (DE) for LFT (DE-LFT). The proposed method integrates DE into the training process of the LFT model to automatically learn optimal regularization parameters $λ_1$, $λ_2$ and $λ_3$. As a result, the model can adaptively search the hyperparameter space and improve prediction accuracy. Experimental results on four real-world datasets demonstrate that the proposed approach achieves lower MAE and RMSE compared with manually tuned baselines while reducing the need for extensive parameter tuning.

URL PDF HTML ☆

赞 0 踩 0