arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.02285 2026-06-11 cs.LG cs.CL math.ST stat.TH 版本更新

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

AI4SLT: 基于 Lean 4 的形式化统计学习理论实证过程

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文首次在 Lean 4 中完整形式化统计学习理论，基于实证过程理论，通过人机协作工作流构建了可验证的定理证明工具箱，并揭示了教材中的隐含假设。

Comments Accepted by ICML 2026

详情

AI中文摘要

我们提出了首个基于实证过程理论的统计学习理论（SLT）在 Lean 4 中的全面形式化。我们的端到端形式化基础设施填补了最新 Lean 库中缺失的内容，包括高斯 Lipschitz 集中的完整推导、次高斯过程的 Dudley 熵积分定理，以及具有尖锐速率的（稀疏）最小二乘回归应用。该项目采用人机协作工作流，其中人类设计证明策略，AI 代理执行战术性证明构建，从而产生了经过人工验证的 SLT 的 Lean 4 工具箱。除了实现之外，形式化过程暴露并解决了标准 SLT 教材中的隐含假设和缺失细节，强制对理论进行逐行细粒度理解。这项工作建立了一个可重用的形式化基础，并为机器学习理论的未来发展打开了大门。代码可在以下网址获取：https://this https URL。

英文摘要

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in https://github.com/YuanheZ/lean-stat-learning-theory.

URL PDF HTML ☆

赞 0 踩 0

2602.02229 2026-06-11 cs.LG eess.SP 版本更新

Prediction-Powered Risk Monitoring of Deployed Models for Detecting Harmful Distribution Shifts

预测驱动的已部署模型风险监控：检测有害分布漂移

Guangyi Zhang, Yunlong Cai, Guanding Yu, Osvaldo Simeone

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出预测驱动风险监控（PPRM），一种基于预测驱动推断的半监督方法，通过结合合成标签与少量真实标签构建运行风险的随时有效下界，实现对有害漂移的检测，并在图像分类、大语言模型和电信监控任务中验证有效性。

Comments Accepted by ICML2026

2602.00945 2026-06-11 cs.CL cs.AI 版本更新

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Neural FOXP2——面向大型语言模型目标语言改进的语言特定神经元引导

Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Meta, USA（Meta, 美国）； Apple, USA（Apple, 美国）； Pragya Lab, BITS Pilani Goa, India（Pragya实验室，BITS Pilani Goa，印度）

AI总结提出Neural FOXP2方法，通过定位语言神经元、计算引导方向和施加稀疏激活偏移，将模型默认语言从英语切换为印地语或西班牙语，实现可控的语言主导性。

详情

AI中文摘要

评估LLM生成数据的质量与可信度综述

Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou

发表机构 * University of Houston（德克萨斯大学休斯敦分校）； Worcester Polytechnic Institute（沃思利理工学院）； Rice University（里德大学）； Texas A&M University（德克萨斯农工大学）； University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； University of Southern California（南加州大学）； University of North Carolina at Charlotte（北卡罗来纳州立大学夏洛特分校）

AI总结提出LLM数据审计框架，从质量和可信度两个维度系统分类评估指标，分析六种模态数据生成方法的评估缺陷并给出改进建议。

Comments Published at TMLR. Title changed in the final version

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

大型语言模型（LLM）已成为跨多种模态生成数据的强大工具。通过将数据从稀缺资源转变为可控资产，LLM缓解了真实世界数据获取成本对模型训练、评估和系统迭代造成的瓶颈。然而，确保LLM生成的合成数据的高质量仍然是一个关键挑战。现有研究主要关注生成方法，对生成数据质量的直接关注有限。此外，大多数研究局限于单一模态，缺乏跨不同数据类型的统一视角。为填补这一空白，我们提出了\textbf{LLM数据审计框架}。在该框架中，我们首先描述了如何利用LLM生成六种不同模态的数据。更重要的是，我们从质量和可信度两个维度系统分类了评估合成数据的内在指标。这种方法将评估重点从依赖下游任务性能的外在评估转向数据本身的固有属性。利用这一评估体系，我们分析了每种模态代表性生成方法的实验评估，并指出了当前评估实践中的重大缺陷。基于这些发现，我们为社区改进数据生成评估提供了具体建议。最后，该框架概述了合成数据在不同模态下的实际应用方法。

英文摘要

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

URL PDF HTML ☆

赞 0 踩 0

2601.17360 2026-06-11 cs.LG cs.AI cs.CR 版本更新

Robust Privacy: Inference-Stage Privacy through Certified Robustness

鲁棒隐私：通过认证鲁棒性实现推理阶段隐私

Jiankai Jin, Xiangzheng Zhang, Zhao Liu, Wenzhuo Xu, Dongdong Yang, Deyue Zhang, Quanchen Zou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出鲁棒隐私(RP)概念，基于认证鲁棒性确保预测在输入邻域内不变，从而限制推理阶段隐私泄露；实验表明RP在属性推断和模型反演攻击中有效提升隐私-效用权衡。

详情

AI中文摘要

FronTalk: 以多模态反馈进行对话式代码生成的前端开发基准测试

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）； University of California, Los Angeles（加州大学洛杉矶分校）； Duke University（杜克大学）

AI总结提出FronTalk基准，通过多轮对话和多模态反馈（文本与视觉指令）评估前端代码生成，发现模型存在遗忘和视觉反馈理解困难，提出AceCoder方法有效减少遗忘并提升性能。

详情

AI中文摘要

我们提出了FronTalk，一个前端代码生成基准，开创性地研究了一种独特的交互动态：具有多模态反馈的对话式代码生成。在前端开发中，草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要，但它们在多轮代码生成中的作用仍未得到充分探索。为解决这一差距，我们聚焦于前端开发任务，整理了FronTalk，这是一个包含100个多轮对话的数据集，这些对话源自新闻、金融和艺术等不同领域的真实网站。每一轮都包含一个文本指令和一个等效的视觉指令，每个指令代表相同的用户意图。为全面评估模型性能，我们提出了一种新颖的基于智能体的评估框架，利用网络智能体模拟用户并探索网站，从而衡量功能正确性和用户体验。对20个模型的评估揭示了文献中系统性地未充分探索的两个关键挑战：（1）显著的遗忘问题，即模型覆盖先前实现的功能，导致任务失败；（2）解释视觉反馈的持续挑战，尤其是对于开源视觉语言模型（VLM）。我们提出了一个强大的基线来解决遗忘问题，即AceCoder，一种使用自主网络智能体批评每个过去指令实现的方法。这种方法将遗忘几乎减少到零，并将性能提升高达9.3%（从56.0%到65.3%）。总体而言，我们旨在为前端开发和多轮多模态代码生成的通用交互动态的未来研究提供坚实基础。代码和数据已在此https URL发布。

英文摘要

We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk

URL PDF HTML ☆

赞 0 踩 0

2601.03326 2026-06-11 cs.CV cs.LG 版本更新

Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

高阶类PCA旋转不变特征用于模旋转的详细形状描述符

Jarek Duda

发表机构 * Jarek Duda

AI总结提出将PCA扩展到高阶张量（如三阶中心矩）或多项式乘高斯分布，以获取更精确的旋转不变形状描述符，并应用于分子形状描述、物体识别和形状相似性度量。

Comments 5 pages, 4 figures

2506.08473 2026-06-11 cs.LG 版本更新

Pass@K 策略优化：解决更困难的强化学习问题

Christian Walder, Deep Karkhanis

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出 Pass-at-k 策略优化 (PKPO)，通过变换奖励直接优化 pass@k 性能，利用低方差无偏估计器，在训练中退火 k 可同时提升 pass@1 和 pass@k，解决更难问题。

详情

AI中文摘要

强化学习算法对每个问题采样多个 n>1 的解决方案尝试并独立奖励它们。这优化了 pass@1 性能，优先考虑孤立样本的强度，而牺牲了样本集的多样性和集体效用。这未充分利用采样能力，限制了探索和在更难示例上的最终改进。作为修复，我们提出 Pass-at-k 策略优化 (PKPO)，一种对最终奖励的变换，导致直接优化 pass@k 性能，从而优化联合考虑时最大化奖励的样本集。我们的贡献是推导出 pass@k 及其梯度在二元和连续奖励设置中的新型低方差无偏估计器。我们展示了使用我们的估计器进行优化简化为标准强化学习，其中奖励经过稳定高效的变换函数联合变换。虽然先前的工作仅限于 k=n，但我们是第一个能够对任意 k ≤ n 实现 pass@k 鲁棒优化的。此外，我们的方法不是以 pass@1 性能换取 pass@k 增益，而是允许在训练中退火 k，同时优化两个指标，通常能在显著 pass@k 增益的同时获得强大的 pass@1 数值。我们在玩具实验上验证了我们的奖励变换，揭示了我们的公式的方差减少特性。我们还使用开源 LLM GEMMA-2 包含了真实世界的例子。我们发现我们的变换有效地优化了目标 k。此外，更高的 k 值能够解决更多和更难的问题，而退火 k 则同时提升了 pass@1 和 pass@k。关键的是，在传统 pass@1 优化停滞的具有挑战性的任务集上，我们的 pass@k 方法解锁了学习，这可能是由于通过优先考虑联合效用而非单个样本的效用实现了更好的探索。

英文摘要

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

URL PDF HTML ☆

赞 0 踩 0

2512.11393 2026-06-11 cs.CV 版本更新

The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

N体问题：从单人物体中心视频进行并行执行

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

发表机构 * University of Bristol（布里斯托尔大学）； The University of Tokyo（东京大学）

AI总结提出N体问题，从单人物体中心视频预测N人并行执行任务，通过结构化提示策略引导视觉语言模型推理3D环境、物体使用和时间依赖，在EPIC-Kitchens和HD-EPIC数据集上显著提升动作覆盖率并降低冲突。

Comments project webpage: https://zhifanzhu.github.io/ego-nbody

详情

AI中文摘要

人类可以直观地并行化复杂活动，但模型能否通过观察一个人来预测这一点？给定一个物体中心视频，我们引入N体问题：预测N个人如何假设性地执行同一组任务。目标是最大化加速，但将视频片段天真地分配给个人往往违反现实世界约束，导致物理上不可能的场景，例如两个人使用同一物体或占据同一空间。为了量化这一点，我们形式化了N体问题，并提出了一套度量标准来评估性能（加速、任务覆盖）和可行性（空间碰撞、物体冲突和因果约束）。作为概念验证，我们引入了一种结构化提示策略，引导视觉语言模型（VLM）推理3D环境、物体使用和时间依赖，从而产生可行的并行执行。在来自EPIC-Kitchens和HD-EPIC的100个视频上，对于N=2，我们的结构化提示相比Gemini 2.5 Pro的基线提示，动作覆盖率提高了45%，同时碰撞率、物体冲突和因果冲突分别降低了51%、52%和55%。

英文摘要

Humans can intuitively parallelise complex activities, but can a model predict this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: predicting how N individuals, can hypothetically perform the same set of tasks. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To quantify this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). As a proof of concept, we introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies, producing a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, for $N = 2$, our structured prompt improves action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 51%, 52% and 55% respectively.

URL PDF HTML ☆

赞 0 踩 0

2512.08211 2026-06-11 cs.LG 版本更新

MobileFineTuner: A Mobile-Native Framework for On-Device LLM Fine-Tuning in Real-World Embedded AI Applications

MobileFineTuner：面向真实世界嵌入式AI应用中设备端大语言模型微调的移动原生框架

Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, Bing Luo

发表机构 * Duke Kunshan University（Duke昆山大学）； The University of Hong Kong（香港大学）

AI总结提出移动原生框架MobileFineTuner，通过C++实现资源感知训练运行时（内存高效注意力、激活检查点等），在商用手机上实现端到端LLM微调，显著降低内存压力并提升可执行性。

Comments 26 pages, 25 figures

详情

AI中文摘要

大语言模型（LLM）正从以云为中心的服务转向设备端嵌入式AI，其中模型与从用户及其物理环境感知的私有、纵向信号进行交互。手机是此类应用的自然平台，因为用户随身携带、连接可穿戴传感器，并深度集成于日常移动应用中。然而，在商用手机上实际进行LLM微调仍然困难。现有微调框架大多基于Python且面向服务器，难以部署到移动应用中。我们提出MobileFineTuner，一个面向移动原生的开源框架，用于在商用手机上实现端到端LLM微调。MobileFineTuner用C++实现，并提供可复用的训练栈。为了在移动资源约束下使微调可行，MobileFineTuner集成了资源感知的训练运行时，包括内存高效注意力、激活检查点、梯度累积、参数分片和能量感知调度。我们在真实手机上使用GPT-2、Gemma 3和Qwen2.5模型，在多个微调任务上评估MobileFineTuner。结果表明，MobileFineTuner再现了标准Full-FT和LoRA微调行为，显著降低了内存压力并提升了在内存受限手机上的可执行性。我们进一步通过一个私有的校园健康代理应用展示了MobileFineTuner，其中本地LLM在用户特定的可穿戴感知记录上进行微调，以提供更个性化的响应，同时将原始记录保留在手机上。这些结果确立了MobileFineTuner作为在嵌入式AI和感知系统中研究和构建设备端LLM微调应用的实用工具包。

英文摘要

Large language models (LLMs) are moving from cloud-centric services toward on-device embedded AI, where models interact with private, longitudinal signals sensed from users and their physical environments. Mobile phones are a natural platform for such applications because they are continuously carried by users, connected to wearable sensors, and deeply integrated with daily mobile applications. However, practical LLM fine-tuning on commodity phones remains difficult. Existing fine-tuning frameworks are largely Python-based and server-oriented, making them hard to deploy inside mobile applications. We present MobileFineTuner, a mobile-native open-source framework for end-to-end LLM fine-tuning on commodity mobile phones. MobileFineTuner is implemented in C++ and provides a reusable training stack. To make fine-tuning feasible under mobile resource constraints, MobileFineTuner integrates a resource-aware training runtime with memory-efficient attention, activation checkpointing, gradient accumulation, parameter sharding, and energy-aware scheduling. We evaluate MobileFineTuner on real mobile phones using GPT-2, Gemma 3, and Qwen2.5 models across multiple fine-tuning tasks. The results show that MobileFineTuner reproduces standard Full-FT and LoRA fine-tuning behavior, substantially reduces memory pressure and improves executability on memory-constrained phones. We further demonstrate MobileFineTuner through a private campus health-agent application, where a local LLM is fine-tuned on user-specific wearable-sensing records to provide more personalized responses while keeping raw records on the phone. These results establish MobileFineTuner as a practical toolkit for studying and building on-device LLM fine-tuning applications in embedded AI and sensing systems.

URL PDF HTML ☆

赞 0 踩 0

2508.21380 2026-06-11 cs.LG cs.AI 版本更新

The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network

算法并非行为：学得的先验知识在弈棋神经网络中覆盖前瞻

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

发表机构 * Fraunhofer HHI（弗劳恩霍夫人工智能研究所）

AI总结研究发现，国际象棋神经网络Leela Chess Zero在中间层能正确计算解法，但最终输出被安全优先的先验知识覆盖，导致错误答案。

详情

AI中文摘要

最近的机制性工作揭示了神经网络内部的学习算法，从模运算到游戏智能体中的搜索与规划。但算法结构是否保证算法行为？我们在最强的神经象棋引擎Leela Chess Zero中对此进行研究，先前工作已识别出学习到的前瞻。通过将logit透镜扩展到其选棋策略网络，我们发现正确的谜题解法——包括即时将杀——经常出现在中间层，但在最终输出中被系统性覆盖，我们将此现象称为“遗忘的谜题”。在这些位置上重复先前的分析，我们发现前瞻运行正常——正确续招的未来走法被表示、因果重要且可线性解码——排除了算法本身的失败。相反，后期层逐渐转向优先考虑安全对局而非激进。为了测试这一转变是否驱动了覆盖，我们引导模型反对这些偏好，并恢复了61.7%的遗忘谜题，提供了因果证据表明安全先验覆盖了算法计算的解法。这些发现表明，算法结构并不保证算法行为：模型可以在内部解决问题，但仍然输出错误答案。

英文摘要

Recent mechanistic work has uncovered learned algorithms within neural networks, from modular arithmetic to search and planning in game-playing agents. But does algorithmic structure guarantee algorithmic behavior? We investigate this in Leela Chess Zero, the strongest neural chess engine, where prior work identified learned look-ahead. By extending the logit lens to its move-selecting policy network, we discover that correct puzzle solutions-including immediate checkmates-often appear in intermediate layers but are systematically overridden in the final output, a phenomenon we term "forgotten puzzles". Replicating prior analyses on these positions, we find that look-ahead operates normally-future moves of the correct continuation are represented, causally important, and linearly decodable-ruling out a failure of the algorithm itself. Instead, late layers increasingly shift toward prioritizing safe play over aggression. To test whether this shift drives the override, we steer the model against these preferences and recover 61.7% of forgotten puzzles, providing causal evidence that safety priors override algorithmically computed solutions. These findings demonstrate that algorithmic structure does not guarantee algorithmic behavior: a model can internally solve a problem and still output the wrong answer.

URL PDF HTML ☆

赞 0 踩 0

2511.19314 2026-06-11 cs.AI cs.CL cs.LG 版本更新

PRInTS: Reward Modeling for Long-Horizon Information Seeking

PRInTS：面向长程信息检索的奖励建模

Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出PRInTS生成式过程奖励模型，通过密集评分和轨迹摘要提升长程信息检索中工具交互与推理能力，在多个基准上超越前沿模型。

Comments ACL 2026, 19 pages, code: https://github.com/G-JWLee/PRInTS

详情

AI中文摘要

信息检索是AI智能体的核心能力，要求它们在整个长轨迹中收集和推理工具生成的信息。然而，这种多步骤信息检索任务对于基于语言模型的智能体仍然具有挑战性。虽然过程奖励模型（PRM）可以通过在测试时对候选步骤进行排序来指导智能体，但现有的PRM——设计用于具有二元判断的短程推理——无法捕捉信息检索步骤的更丰富维度，例如工具交互和对工具输出的推理，也无法处理长程任务中快速增长的上下文。为了解决这些限制，我们引入了PRInTS，一种具有双重能力的生成式PRM：（1）基于PRM对步骤质量多个维度（例如，工具输出的解释、工具调用的信息量）的推理进行密集评分，以及（2）轨迹摘要，在压缩不断增长的上下文的同时保留步骤评估所需的基本信息。在FRAMES、GAIA（级别1-3）和WebWalkerQA（简单-困难）基准上对多个模型的广泛评估表明，使用PRInTS进行最佳n采样增强了开源模型以及专门智能体的信息检索能力，以更小的骨干智能体匹配或超越前沿模型，并优于其他强奖励建模基线。

英文摘要

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs - designed for short reasoning with binary judgment - cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple dimensions of step quality (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models reveal that best-of-n sampling with PRInTS enhances information-seeking in open-source models as well as specialized agents, matching or surpassing frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

URL PDF HTML ☆

赞 0 踩 0

2511.00044 2026-06-11 cs.LG nlin.AO 版本更新

Time-multiplexed layer reuse for physical neural networks

物理神经网络的时间复用层重用

Kohei Tsuchiyama, Andre Roehm, Takatomo Mihana, Ryoichi Horisaki

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo（信息科学与技术研究生学校，东京大学）

AI总结针对物理神经网络权重调整慢的瓶颈，提出TIDAL-Net，通过时间复用层增加有效深度，在图像分类和自然语言处理任务上提升性能。

详情

AI中文摘要

物理神经网络（PNN）是下一代计算的有前途的候选者，但现有演示仍比现代数字神经网络小几个数量级，而现代数字神经网络的最新进展是由可训练参数的快速增长驱动的。这种情况类似于早期数字神经网络的限制，这导致了关于参数重用的想法。我们研究了类似高效的硬件架构可能是什么样子，特别关注PNN中权重重新调整的常见瓶颈。我们提出了时间索引深度交替层网络（TIDAL-Net），它占据循环神经网络和深度神经网络之间的中间状态，专门针对常见PNN原型的规模和限制。TIDAL-Net利用许多PNN中快速前向动力学和缓慢可训练权重与偏置之间的时间尺度分离，通过逐层时间复用来增加有效深度，同时限制实现成本。在图像分类和自然语言处理任务上的数值实验表明，TIDAL-Net在仅对传统PNN进行微小修改的情况下提高了性能。

英文摘要

Physical neural networks (PNNs) are promising candidates for next-generation computing, but existing demonstrations remain several orders of magnitude smaller than modern digital neural networks, whose recent advances have been driven by rapid growth in trainable parameters. This situation resembles the constraints of early digital neural networks, which led to ideas around parameter reuse. We investigate what similarly efficient hardware architectures may look like, focusing specifically on the common bottleneck of slow re-adjustment of the weights in PNNs. We propose the Time-Indexed Deep Alternating Layers Network (TIDAL-Net), which occupies an intermediate regime between recurrent and deep neural networks, specifically aimed at the scales and restrictions of common PNN prototypes. TIDAL-Net leverages the timescale separation found in many PNNs between fast forward dynamics and slowly trainable weights and biases, using layer-by-layer time multiplexing to increase effective depth while limiting implementation cost. Numerical experiments on image classification and natural language processing tasks show that TIDAL-Net improves performance with only minor modifications to conventional PNNs.

URL PDF HTML ☆

赞 0 踩 0