arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.07489 2026-06-12 cs.AI econ.GN q-fin.EC 新提交

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

AI代理如何重塑知识工作:自主性、效率与范围

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

发表机构 * Harvard Business School(哈佛商学院) Perplexity AI

AI总结 基于Perplexity产品数据,研究发现AI代理通过端到端任务执行,将自主工作时间从33秒提升至26分钟,完成时间缩短87%,成本降低94%,并扩展了工作范围与认知层次。

详情
AI中文摘要

前沿AI系统正从对话式助手转向端到端执行任务的自主代理,弥合智能与实用性之间的差距。利用Perplexity的Search和Computer产品的生产数据,我们通过研究AI代理如何加速和重塑知识工作来考察这一转变。三个关键实证发现出现。首先,使用具有几乎相同初始查询对的会话作为同一底层任务的自然实验,Computer每个用户会话执行26分钟的自主工作,而Search为33秒。Computer自动化了Search用户可能手动编排和实现的任务分解与执行。因此,Computer将后续查询分布转向更高层次的工作,如验证和扩展。自主性也提高了执行质量,Computer上每次查询的不满意率比Search低55%。其次,由于其自主性优势,Computer在匹配任务上将完成时间从269分钟减少到36分钟,与仅配备Search的人类相比,估计时间和成本分别降低87%和94%。第三,Computer改变了用户尝试的工作范围:Computer查询更常跨越职业边界,需要更高层次的认知,利用更广泛的专业知识,采取将相互依赖的子任务捆绑到单个查询中的复合任务形式,并解锁了同一用户在Search使用中基本不存在的工作活动。综合来看,证据表明AI代理加速工作流程、提高输出质量、降低成本,并扩展自动化工作的广度和深度。

英文摘要

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

2606.07436 2026-06-12 cs.CV 新提交

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D:面向智能体3D空间推理的场景感知技能进化

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

发表机构 * Zhejiang University(浙江大学) University of Technology Sydney(技术悉尼大学) OPPO Research Institute(OPPO研究院)

AI总结 提出Skill-3D框架,通过场景记忆和技能库的协同进化,使智能体根据场景自适应选择工具,显著提升3D空间推理中工具使用的正确性和充分性。

详情
AI中文摘要

本文探索智能体3D空间理解,即MLLM智能体通过工具使用进行3D推理。现有方法在3D场景下常误用工具并表现出有偏的工具偏好,使得智能体范式相比非智能体策略仅有边际提升。我们揭示3D空间推理任务在不同场景下具有异质性,而这些智能体对所有场景采用统一的工具使用策略,而非根据具体场景和任务选择工具。为解决此问题,我们提出Skill-3D,一种学习自进化场景感知技能的框架。具体而言,Skill-3D识别任务场景并将智能体的工具使用轨迹记录到场景记忆中,其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能,失败的轨迹作为教训附加到该技能上。在训练过程中,一旦相似场景再次出现,注入相应技能以引导智能体,产生新轨迹,其成功和失败进一步优化技能,形成记忆和技能库共同进化的循环。实验表明,Skill-3D显著提升了3D空间推理中的工具利用率(在VSI-Bench上从39%提升至78%),推动智能体正确且充分地使用工具。例如,在MMSI-Bench上,它将Gemini-3-Flash提升了67%。此外,我们在技能引导的轨迹上进行智能体后训练,使Qwen3-VL-8B在VSI-Bench上提升了43%。

英文摘要

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.

2605.18898 2026-06-12 cs.LG stat.ML 交叉投稿

A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

一种双参数Weibull框架用于变压器权重分布诊断

Tiexin Ding

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种基于Weibull分布的双参数框架,用于分析Transformer中元素权重幅度分布,通过实验发现不同模块的k值分布特征,并揭示了训练过程中lambda参数的变化规律。

Comments 27 pages, 14 figures. Companion library npm-weibull-py and benchmark database available at https://github.com/tiexinding/NPM-Weibull-public

详情
AI中文摘要

我们应用Weibull分布——极值理论中的一个双参数家族——作为诊断框架,用于分析Transformer中元素权重幅度分布。在初始化时,i.i.d.高斯权重给出|w| ~ HalfNormal,产生k ~ 1.20通过中间80%概率-图拟合(此工作中的协议)。这个锚点使k成为一种原则性的、架构无关的训练动态测量工具;在每个层的每个检查点独立拟合每个权重矩阵,使能够进行每组件、每层和每步的诊断,这些聚合统计无法解决。将此框架应用于12个模型,涵盖7个架构家族(Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3)揭示了三个发现。首先,FFN模块和注意力输出投影W_o——传输类——落在狭窄的k带中:在12个条目中,中位数终端k在[1.186, 1.204]之间(跨家族CV=0.51%),在SwiGLU/GeLU激活、Pre-LN/QK-Norm放置和70M-14B大小之间共享。其次,注意力输入投影W_q, W_k——选择类——脱离Weibull家族,其严重程度由存储形状决定:分别存储Q/K(OLMo-1, OLMo-2)产生k在[0.76, 0.99](深层);GQA模型产生k在[1.10, 1.16](轻微);Pythia的合并W_qkv占据过渡区,跟踪训练预算T/tau单调递增。第三,lambda在训练过程中显著增长,并在Pythia家族中与sqrt(eta/lambda_wd)成比例(Pearson r=0.94,三种传输类型),方向上与Fan等人(2025)一致。这两个参数携带独立信息:k标记功能类别,lambda标记训练进度。我们发布了npm-weibull-py v0.4(Python库)和DATABASE_v9_1在https://github.com/tiexinding/NPM-Weibull-public。

英文摘要

We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM-Weibull-public .

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo:通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Archon Robotics

AI总结 提出三阶段运动引导课程强化学习框架RoboNaldo,从单一人踢参考逐步优化射门性能,在仿真中射门误差降低48.6%、速度提升2.96倍,真实机器人上3米外平均射门误差0.73-0.86米,触球后球速达13.10米/秒。

详情
AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性,但固定参考难以适应不同的球位和击球时机;相比之下,任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此,我们引入了RoboNaldo,一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架,并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验,然后使踢球适应任意静止球位的任意球场景,最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间,一个高级启发式规划器控制该接口,而推理时其他高级控制器可驱动相同的低级策略。在仿真中,RoboNaldo的任意球射门误差比先前工作基线低48.6%,射门速度高2.96倍。在真实世界中,使用搭载机载感知的宇树G1,RoboNaldo在3米距离的任意球和移动球情况下,平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒,是职业比赛开放射门速度的59-71%。项目页面:$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.

2606.11042 2026-06-12 cs.AI 版本更新

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM:面向真实世界专业领域的长周期计算机使用代理任务评估

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang

发表机构 * ByteDance Seed(字节跳动Seed) M-A-P Humanlaya

AI总结 提出Workflow-GYM基准,评估AI代理在专业软件中执行长周期、高价值工作流的能力,发现最强模型成功率仅略超30%,揭示当前代理在长周期工作流一致性方面的严重不足。

详情
AI中文摘要

近年来,AI代理在处理日益复杂、真实世界任务方面取得了快速发展。然而,现有基准很少评估代理能否操作图形用户界面以完成跨领域的长周期、高价值专业工作流。当前的GUI基准仍主要关注通用软件、相对简单的应用和短周期任务,使得现代代理能否遵循用户指令自主操作领域特定专业软件并以端到端方式完成经济价值工作尚不清楚。为填补这一空白,我们引入Workflow-GYM,一个以专业领域和专门软件环境为中心的长周期GUI任务基准。通过对最先进模型的广泛实验,我们发现即使最强的模型也仅达到略高于30%的成功率,突显出专业长周期GUI工作流对当前GUI代理仍极具挑战性。进一步分析表明,当前代理难以维持长周期工作流的一致性,频繁出现工作流阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的发现为当前代理系统的局限性提供了重要见解,并为下一代GUI代理研究指明了关键方向。

英文摘要

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

2606.06113 2026-06-12 cs.CV 版本更新

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

位置、类型、原因与重要性:面向文本到图像反馈的结构化缺陷定位

Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen, Haoxiang Cao, Feng Lu, Wendong Zhang, Changqian Yu, Chun Yuan

发表机构 * Tsinghua University(清华大学) Kolors Team, Kuaishou Technology(快手科技Kolors团队) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) South China Normal University(华南师范大学)

AI总结 提出结构化缺陷定位(SDG)方法,将文本到图像生成中的缺陷诊断建模为结构化集合预测,通过构建SDG-30K数据集和SDG-Eval评估协议,并利用视觉语言模型作为检测器,结合BoxFlow-GRPO将预测的缺陷集合转化为空间奖励以改进扩散模型对齐。

Comments 25 pages, 9 figures

详情
AI中文摘要

尽管文本到图像(T2I)模型生成的图像越来越逼真,但它们仍然存在局部、细微且结构复杂的失败。诊断这些失败需要实例级别的反馈,回答缺陷发生的位置、类型、原因及其对整体图像质量的重要性。虽然最近的密集反馈方法超越了标量监督,但其以热图为中心的表示仍将诊断公式化为像素场回归,这使得定位可变数量的缺陷并将语义原因绑定到单个失败变得困难。为了解决这一表示瓶颈,我们提出了结构化缺陷定位(SDG),通过将每个缺陷建模为(位置、类型、原因、重要性)元组,将T2I诊断转化为结构化集合预测。为了使这一公式可训练和可测量,我们引入了SDG-30K,一个包含30K张图像的数据集,具有跨四个现代T2I生成器的框级标注,以及一个专用的评估协议SDG-Eval。基于这种结构化表示,我们进一步提出了一个诊断到对齐的框架,其中视觉语言模型(VLM)作为SDG检测器,BoxFlow-GRPO将预测的缺陷集合转化为基于框的、重要性加权的空间奖励,用于扩散模型对齐。大量实验表明,我们的SDG检测器在结构化缺陷定位上优于领先的专有VLM,而SDG引导的奖励一致地改善了T2I对齐并支持局部图像细化。这些结果确立了SDG作为诊断、评估和增强现代生成模型的统一实例级接口。

英文摘要

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

2606.05860 2026-06-12 cs.LG 版本更新

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

GenAutoML: 面向时间序列分析的动态架构生成与优化的智能体框架

Oleeviya Babu Poikarayil, Cédric Schockaert, Abdulrahman Nahhas, Christian Daase, Mursal Dawodi, Jawid Ahmad Baktash

发表机构 * Paul Wurth S.A.(保罗·沃思公司) Otto-von-Guericke University(奥托·冯·格里克大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出GenAutoML框架,利用大语言模型作为神经架构师,通过沙盒反射循环和签名感知运行时自动生成并优化时间序列预测与异常检测的神经网络架构,引入动态可逆实例归一化提升非平稳条件下的鲁棒性。

Comments 26 pages, 17 figures, 12 tables. Under review

详情
AI中文摘要

为时间序列预测和异常检测设计神经架构仍然是一项资源密集型任务,通常需要大量领域专业知识。传统的自动机器学习系统通常依赖于静态、预定义的搜索空间,限制了其适应多样数据特征的能力。我们提出GenAutoML,一个智能体框架,利用大语言模型作为神经架构师,将自然语言需求与可执行的PyTorch实现连接起来。该框架包含一个沙盒反射循环用于自主代码优化,以及一个签名感知运行时用于确保架构一致性和执行安全性。为了提升非平稳条件下的鲁棒性,我们进一步引入了动态可逆实例归一化包装器。在ETTh1、ETTm1和Weather基准上的实验表明,GenAutoML能够动态生成针对数据集特征定制的任务特定神经架构。在生成的模型中,WaveInterferenceNet实现了每个样本低于0.01毫秒的推理延迟,同时保持有竞争力的预测性能。通过强调计算效率、架构适应性和稳定的优化行为,GenAutoML使得创建适用于资源受限和延迟敏感的Edge AI部署的超轻量级神经网络成为可能。

英文摘要

Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically rely on static, predefined search spaces, limiting their ability to adapt to diverse data characteristics. We present GenAutoML, an agentic framework that leverages Large Language Models (LLMs) as neural architects to bridge natural-language requirements and executable PyTorch implementations. The framework incorporates a Sandboxed Reflection Loop for autonomous code refinement and a Signature-Aware Runtime that enforces architectural consistency and execution safety. To improve robustness under non-stationary conditions, we further introduce a Dynamic Reversible Instance Normalization (Dyn-RevIN) wrapper. Experiments on the ETTh1, ETTm1, and Weather benchmarks demonstrate that GenAutoML can dynamically generate task-specific neural architectures tailored to dataset characteristics. Among the generated models, WaveInterferenceNet achieves inference latency below 0.01 ms per sample while maintaining competitive predictive performance. By emphasizing computational efficiency, architectural adaptability, and stable optimization behavior, GenAutoML enables the creation of ultra-lightweight neural networks suitable for resource-constrained and latency-sensitive Edge AI deployments.

2606.05405 2026-06-12 cs.AI cs.CL cs.LG 版本更新

Agents' Last Exam

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Chenyu Zhu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-Ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, Andy Zeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lyu, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Shi Qiu, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Cheng Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, He Ren, Zhenyu He, Qiao Jin, Langlang Li, Yuetai Li, Sylvia Liu, Lu Lu, Luqing Zhou, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Yian Ma, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Yinglun Zhu, Dawn Song

发表机构 * arXiv

AI总结 针对AI系统在专业领域缺乏经济性部署的问题,提出Agents' Last Exam (ALE)基准,通过250+专家协作构建覆盖13个行业集群55个子领域的1000+长期真实经济任务,当前最难层级平均通过率仅2.6%。

Comments Project website: https://agents-last-exam.org Code: https://github.com/rdi-berkeley/agents-last-exam

详情
AI中文摘要

最近的AI系统在广泛基准测试中取得了强劲结果,但这些成果并未转化为许多专业领域的经济上有意义的部署。我们认为这一差距主要是评估问题:广泛使用的基准缺乏对真实且经济上有价值的工作流程的持续性能测量。本文介绍了Agents' Last Exam (ALE),这是一个旨在评估AI代理在长期、经济上有价值、结果可验证的真实世界任务上的基准。与250多名行业专家合作开发,ALE涵盖了参考O*NET/SOC 2018(美国联邦职业分类)定义的非实体行业。它围绕一个任务分类法组织,包含55个子领域,分为13个行业集群,涵盖1000多个任务。当前结果显示,最难层级远未饱和:在主流框架和骨干配置下,平均完全通过率为2.6%。ALE被设计为一个活的基准:其任务池随着新工作流程和行业的加入而持续增长。更广泛地说,ALE不仅旨在作为另一个排行榜,而是作为缩小基准成功与GDP相关影响之间差距的工具。

英文摘要

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

2606.04935 2026-06-12 cs.AI 版本更新

What Type of Inference is Active Inference?

主动推理是一种什么类型的推理?

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

发表机构 * Department of Electrical Engineering(电气工程系) Eindhoven University of Technology(埃因霍温理工大学) Eindhoven, the Netherlands(荷兰埃因霍温) Lazy Dynamics Utrecht, the Netherlands(荷兰乌得勒支)

AI总结 本文通过变分自由能框架将主动推理中的期望自由能最小化分解为熵校正项和规划校正项,揭示了其推理本质,并在网格世界实验中验证了不同校正项的作用。

详情
AI中文摘要

主动推理将决策视为推理,期望自由能(EFE)统一了目标导向和信息寻求行为。最近的研究表明,EFE最小化可以写成在带有认知先验的生成模型上的变分自由能(VFE)最小化。我们证明了增强模型的VFE可以重写为预测模型的VFE加上显式的熵校正项,从而使EFE贡献透明。然后我们表明,基于EFE的适当规划需要将这些认知校正与规划校正相结合,规划校正将边际推理转化为策略优化,从而得到基于EFE规划的完整变分特征。这澄清了交叉熵规划和完整基于EFE规划所需的校正。相同的熵校正公式导致了基于EFE规划的详细消息传递方案以及更简单的消融。在三个网格世界环境上的实验表明,当观测具有决定性时,规划校正已经有所帮助,而当观测仅具有提示性时,额外的观测侧认知校正最为重要。

英文摘要

Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

2606.04602 2026-06-12 cs.AI 版本更新

Parthenon Law: A Self-Evolving Legal-Agent Framework

Parthenon Law: 一种自我进化的法律智能体框架

Hejia Geng, Leo Liu

发表机构 * tapntell.ai

AI总结 本文提出Parthenon框架,通过分解模型、工具、知识等组件并引入反泄漏学习循环,使法律领域的大语言模型智能体能够从经验中自我进化,显著提升法律事务处理性能。

详情
AI中文摘要

随着智能体能力的增强,法律领域的大语言模型智能体有望将文档密集型事务转化为可审查的工作产品——然而可靠部署面临三个障碍:缺乏关于当前最强模型与框架组合在端到端法律事务上行为的大规模证据;没有适应法律垂直领域的智能体架构,只有通用框架;以及在不断变化的事实、权威和截止日期环境中,缺乏系统从自身结果中学习的机制。我们逐一解决这些问题。在Harvey LAB上进行的大规模实证研究——包含12,510条智能体轨迹——表明即使是前沿智能体也无法一次性完成事务:每项标准的准确率随模型增强而提高,但严格的事务完成率停滞不前。然后我们引入Parthenon,一种自我进化的法律智能体框架,将模型、框架、智能体角色、法律知识、确定性工具和程序技能分解为可审计的表面,以实现来源可追溯性、日期和数字接地、交付物合规性和问题关闭。最后,一个反泄漏学习循环将评分失败转化为对技能、工具和知识的任务无关编辑,使系统能够随着经验改进——就像律所在每个事务后完善其检查清单和操作手册——而不触及模型权重。在我们的大规模实证分析中,Parthenon显著提升了最先进模型和框架在法律事务任务上的性能。

英文摘要

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

2606.04525 2026-06-12 cs.CL cs.LG q-bio.GN 版本更新

GENEB: Why Genomic Models Are Hard to Compare

GENEB:为什么基因组模型难以比较

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

发表机构 * GitHub arXiv

AI总结 针对基因组基础模型评估碎片化的问题,提出GENEB基准,通过统一探测协议在100项任务上比较40个模型,揭示模型排名不稳定、规模收益有限等关键发现。

Comments change first page figure, fix model sizes, add more consistency

详情
AI中文摘要

由于基准碎片化、评估协议不兼容以及任务特定报告,基因组基础模型的进展难以评估。因此,关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB,这是一个大规模诊断基准,在统一的基于探测的协议下(包括少样本场景),评估来自40个基因组基础模型的冻结表示,涵盖100个任务,跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时,对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明,整体排行榜不稳定:模型排名在不同任务类别间变化剧烈,规模仅带来适度且不一致的收益,而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性,并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。

英文摘要

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

2606.04474 2026-06-12 cs.CL eess.AS 版本更新

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

语音大模型推理中的实体绑定失败:诊断与思维链干预

Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China(1 数据科学学院,香港中文大学(深圳)) ByteDance, China(2 字节跳动,中国)

AI总结 本文通过诊断语音大模型在逻辑推理中的实体绑定失败问题,提出实体感知思维链方法,显著提升推理准确率。

Comments INTERSPEECH 2026

详情
AI中文摘要

语音大模型在复杂推理任务上表现不如文本模型。我们揭示了这种模态差距并非均匀的认知缺陷。通过评估三个不同的语音大模型,我们发现在空间、句法和事实任务上,语音到文本(S2T)匹配或超过文本到文本(T2T)。然而,在需要实体追踪的逻辑任务上,S2T准确率降至随机水平。我们将这种局部退化诊断为实体绑定失败:连续的语音特征导致模型在隐式推理过程中丢失精确的实体-属性关联。为解决此问题,我们提出了实体感知思维链(EA-CoT),强制语音大模型在推理前显式枚举实体并将其绑定到声明上。引人注目的是,即使口语名称被误识别,EA-CoT也能弥合差距,带来高达24.4%的绝对准确率提升。消融实验证实这些提升完全源于显式语义绑定,将模态差距重新定义为可解决的瓶颈。

英文摘要

Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.

2606.04364 2026-06-12 cs.CV cs.LG 版本更新

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

通过部分分解注意力的空间基础概念瓶颈模型

Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出一种部分分解的概念瓶颈模型,通过空间先验约束注意力,在细粒度识别中实现可解释性并提升定位精度。

Comments Updated results with GobalAttention Tokens

详情
AI中文摘要

概念瓶颈模型(CBM)在预测类别之前预测一层人类命名的属性,从而使其决策可审计。在细粒度识别任务中,概念头通常可以自由关注图像中的任何位置,因此以某个身体区域命名的头可能被其他区域的证据满足。本研究通过构造一个部分分解的CBM来消除这种自由度。该方法基于冻结的DINOv3视觉变换器,包含三个组件。一个学习到的前景门控,基于DINOv3块特征训练,抑制部分注意力内的背景块。一组部分查询交叉关注块特征,并且312个CUB属性中的每一个通过固定的概念到部分映射被路由,仅从其名称所暗示的部分令牌读取。一个可学习的二维高斯先验,以对数空间加性注入注意力logits,打破部分查询之间的排列对称性;其均值从每个部分的数据集平均关键点位置初始化,在训练或测试时不需要每张图像的关键点监督。在CUB-200-2011上,空间先验模型匹配完全监督基线(top-1准确率88.85%对88.95%),同时将指向精度提高16个百分点(52.6%对36.4%)。用PCA前景目标替换边界框监督,并与高斯先验结合,消除了所有每张图像监督,达到88.6%的top-1准确率和约70%的指向精度。关键点分数扫描显示,训练集的0.5%(约27张图像)足以初始化先验,且无显著损失。完全移除部分身份是更困难的情况:没有任何空间先验,指向精度降至2.9%。

英文摘要

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

2606.03096 2026-06-12 cs.CL 版本更新

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

大型语言模型中的事实性观点能否被编辑(操纵)?

Yuanpu Cao, Ziyi Yin, Fenglong Ma, Jinghui Chen

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出FOE基准测试,评估当前知识编辑技术对事实性观点(如公众人物立场)的操纵能力,并发现其仅能实现表面修改,无法保持观点与证据的一致性;进而提出自生成证据对齐方法实现观点-证据对齐。

Comments Accepted to the ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)正日益融入各个领域,这使得知识编辑技术变得至关重要,但也存在潜在危险。当前的编辑方法主要针对原子事实,忽视了操纵事实性观点(例如,公众人物在社会问题上的有记录的立场)所带来的重大风险。这种操纵可能重塑公众形象、影响选举并改变社会观点。为了系统评估这一威胁,我们引入了事实性观点编辑与证据(FOE)基准,涵盖261位公众人物、19个问题类别和2,178条完整的观点记录。我们的评估表明,当前的编辑技术在处理事实性观点时面临显著困难,通常仅能实现表面修改,而无法保持编辑后的观点与模型生成的支撑证据之间的一致性。为解决这一局限,我们进一步提出了一种简单而有效的自生成证据对齐方法,无需依赖显式指令即可实现观点-证据对齐。我们的基准和方法共同为理解LLMs中事实性观点编辑的新兴安全影响奠定了基础。

英文摘要

Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

2606.02133 2026-06-12 cs.LG cs.AI 版本更新

Variational Learning for Insertion-based Generation

基于插入生成的变分学习

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出插入过程(IP)模型,通过排列变分推断联合学习插入位置、内容和终止条件,支持变长生成并提升非自回归序列建模质量。

详情
AI中文摘要

非单调序列生成方法,如掩码扩散模型,通过允许以非固定和预设的顺序生成token,为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势,但大多数现有的非单调模型是顺序无关的,并依赖于固定长度的网格,限制了它们支持变长生成和自适应插入顺序的能力。在这项工作中,我们引入了一个概率框架,用于在变长插入模型中学习插入顺序。我们形式化了插入轨迹与排列之间的双射对应关系,这使得数据似然能够精确重参数化为排列上的和。基于这一结果,我们提出了插入过程(IP),这是一种随机生成模型,它联合学习在哪里插入、插入什么以及何时终止,并通过基于排列的变分推断进行训练。与先前的固定画布方法不同,IP原生支持变长生成,并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明,在缺乏规范从左到右结构的领域中,学习插入顺序提高了建模质量和泛化能力。

英文摘要

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

2606.01621 2026-06-12 cs.CV cs.RO 版本更新

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Goal2Pixel: 将目标锚定到像素以实现视觉语言导航

Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Goal2Pixel范式,通过将连续环境中的视觉语言导航(VLN-CE)重新定义为可导航像素锚定,利用图像平面作为统一空间接口,预测可见导航像素并反投影为3D航点,结合可见性感知关键帧记忆和坐标感知辅助损失,在减少VLM调用次数的同时实现竞争性性能。

Comments 8 pages

详情
AI中文摘要

视觉语言模型(VLM)已成为连续环境中视觉语言导航(VLN-CE)的常见基础。然而,大多数基于VLM的方法将导航视为低级动作预测,这种接口模糊、受限于短视运动基元,且由于重复的VLM查询而效率低下。我们提出Goal2Pixel,一种纯基于像素的范式,将VLN-CE重新定义为可导航像素锚定。Goal2Pixel不预测动作,而是使用图像平面作为VLM推理与机器人运动之间的统一空间接口:模型预测一个对智能体可见的可导航像素,该像素被反投影为3D航点以进行前向导航。对于非前向动作,我们在图像平面上附加辅助指令区域,其中左/右/下区域分别解释为左转、右转和停止。为了实现长程导航,我们提出了一种可见性感知的关键帧记忆,用于紧凑且信息丰富的历史表示。为了将预训练的VLM适应于可导航像素锚定,我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel在需要比先前方法更少的VLM推理调用的情况下,实现了具有竞争力的最新性能。在R2R-CE Val-Unseen上,它以每集仅7.75次VLM调用达到54.1%的SR和52.5%的SPL,而直接动作预测在32.9%的SR下需要46.62次调用,减少了6倍。同样的趋势在RxR-CE上也成立。项目页面:https://baobao0926.github.io/Goal2Pixel/。

英文摘要

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

2606.01172 2026-06-12 cs.LG stat.ME stat.ML 版本更新

Revisiting Neural Processes via Fourier Transform and Volterra Series

通过傅里叶变换和Volterra级数重新审视神经过程

Peiman Mohseni, Nick Duffield, Raymond K. W. Wong

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文利用Volterra展开和集合傅里叶卷积,提出了两种新的条件神经过程模型,解决了现有平移等变神经过程在可解释性和计算效率上的局限性。

详情
AI中文摘要

从有限的、不规则采样的测量中建模未知的潜在函数是科学和工程中的一个反复出现的挑战。神经过程(NPs)是一类概率函数模型,是有前景的解决方案——尤其是当赋予领域特定的对称性(如平移等变性)时,这提高了样本效率和泛化能力。然而,现有的平移等变NPs面临两个局限性:(i)它们堆叠带有非线性的通用组件,模糊了诱导的函数类并限制了可解释性;(ii)卷积设计依赖于具有局部感受野的核,并需要密集的均匀输入网格,而基于注意力的方法避免了这些问题,但随观测数量呈二次方缩放。我们通过两个贡献解决了这两个问题。首先,利用Volterra展开,我们将连续平移等变算子表征为高阶卷积的和,实现了分析透明性,同时允许通过一阶卷积进行高效近似。其次,我们引入了集合傅里叶卷积(SFConvs),这是一种频域参数化方法,直接在不规则采样点上操作,实现近似全局感受野,并在观测数量上线性缩放。基于这些思想,我们提出了两种条件神经过程(CNPs):SFConvCNPs,它堆叠带有非线性的SFConv块,以及SFVConvCNPs,它整合了Volterra公式。在合成和真实世界数据集上的实验证明了我们的方法相对于最先进基线的有效性。

英文摘要

Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions -- especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods' efficacy against state-of-the-art baselines.

2606.00807 2026-06-12 cs.AI cs.HC 版本更新

Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation

以交互为中心的智能:将交互作为共创AI和人机系统中的主要分析单元

Nicholas Davis

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Co-Creative AI Consulting(协同人工智能咨询)

AI总结 本文提出以交互作为主要分析单元,通过分布式认知、具身认知等理论,论证智能涌现于交互动态而非孤立计算,并引入交互中心智能框架。

详情
AI中文摘要

传统人工智能很大程度上将智能概念化为发生在有界代理内的孤立计算。在经典AI、机器学习以及许多生成系统中,主要的分析单元仍然是单个模型或自主系统,通过输出、基准、预测准确性或优化性能进行评估。尽管这些方法取得了重大进展,但它们往往低估了交互在智能、创造力、意义和适应性行为涌现中的作用。本文提出将交互作为共创AI和更广泛的以交互为中心的智能的主要分析单元。借鉴分布式认知、具身认知、生成、参与式意义建构、人机交互和计算创造力,本文追溯了向越来越关系性智能观的历史进程。基于先前在创造性意义建构、量化共创以及诸如绘图学徒和AI绘图伙伴等共创系统上的工作,本文论证了智能通过代理、环境和社会技术系统之间不断演化的交互动态涌现,而非仅仅通过内部计算。本文引入了以交互为中心的智能作为理解人机共创、协作涌现、适应性参与和交互动态的框架。该框架不通过生成的输出单独评估智能,而是强调随时间展开的交互轨迹、协调模式、参与性参与、适应性调节和交互漂移。讨论了可解释的共创AI、混合智能、生成AI和未来人机系统的启示。

英文摘要

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group(软件性能优化组) Department of Computing(计算部门)

AI总结 提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统,通过在线可微渲染实现跟踪与建图,并支持实时网格转换与编辑。

Comments 26 pages, 11 figures

详情
AI中文摘要

我们提出了一种密集RGB-D SLAM系统,使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法,但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务(如模拟、碰撞和编辑)的标准图元。最近的离线方法表明,通过在一组带姿态的图像上进行Delaunay三角剖分,可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解,我们提出了第一个密集SLAM系统,通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格,从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上,我们的系统在3D几何方面优于基线,匹配相机跟踪精度,并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

2605.28507 2026-06-12 cs.LG 版本更新

Universal Time Series Generation with Neural Controlled Differential Equations

基于神经受控微分方程的通用时间序列生成

Torben Berndt, Elyes Farjallah, Leif Seute, Raeid Saqur, Benjamin Walker, Jan Stühmer

发表机构 * Heidelberg Institute for Theoretical Studies(海德堡理论研究所) IAR, Karlsruhe Institute of Technology(卡尔斯鲁厄技术大学IAR部门) Max Planck Institute for Polymer Research(马克斯·普朗克聚合物研究所) IWR, Heidelberg University(海德堡大学IWR部门) Dept. of Computer Science, University of Toronto(多伦多大学计算机科学系) Mathematical Institute, University of Oxford(牛津大学数学研究所) Vector Institute, Toronto, Canada(多伦多向量研究所)

AI总结 本文证明结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,并提出生成式SLiCEs(G-SLiCEs)用于路径空间上的流匹配,在概率预测和下流任务中表现优异,尤其适用于不规则网格。

详情
AI中文摘要

最近关于状态空间模型(SSMs)序列通用性的工作引入了高效、最大表达性的连续时间方法用于时间序列建模。虽然这些工作侧重于判别设置,我们将这一视角扩展到生成式时间序列建模,通过证明最大表达性的结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,即它们可以在$W_\infty$中逼近紧致潜在集上连续因果推前映射的诱导路径律。基于这些理论结果,我们提出了生成式SLiCEs(G-SLiCEs),一种用于路径空间上流匹配的最大表达性连续时间模型。实验上,我们表明表达性提高了概率预测和下流任务的性能,同时保留了连续时间模型的优势,例如泛化到任意观测网格。这对于不规则网格尤其有利,而固定网格模型通常难以处理此类网格。

英文摘要

Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in $W_\infty$. Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

2605.29906 2026-06-12 cs.LG 版本更新

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

计划,而非摆姿势:基于文本对齐的BFM的长复合运动生成

Nikolay Shvetsov, Maksim Bobrin, Nazar Buzun, Anton Bozhedarov, Dmitry V. Dylov

发表机构 * AvaCapo Potsdam University(波茨坦大学) Applied AI Institute(应用人工智能研究所) Computational Imaging Lab(计算成像实验室) AXXX Innopolis University(因诺波利斯大学)

AI总结 提出Text2BFM框架,通过将自然语言与预训练行为基础模型对齐,在潜在策略空间中实现长复合运动生成,无需端到端运动生成器。

详情
AI中文摘要

文本到运动(T2M)生成在角色动画、虚拟化身和人机交互中具有广泛应用。现有方法通常直接从语言生成姿态轨迹或运动令牌,迫使单个模型处理语义解释、长程结构和低级物理实现。这种耦合使得它们在处理长、复合或语义密集的提示时成本高昂且往往不可靠。我们提出Text2BFM,这是第一个将自然语言与预训练行为基础模型(BFM)对齐用于T2M生成的框架,无需依赖重型端到端运动生成器。Text2BFM在冻结的BFM的潜在策略空间中操作,将其用作可执行的运动先验。一个文本对齐的变分行为瓶颈将BFM策略潜在序列压缩成与语言兼容且保留长程行为结构的紧凑运动表示。生成在这个紧凑的行为流形上通过轻量级条件生成器进行,得到的潜在编码行为被解码为驱动预训练冻结BFM的策略潜在。通过将语义规划与运动执行解耦,Text2BFM实现了高效、鲁棒的T2M生成,并在长复合文本描述上表现出色。

英文摘要

Text-to-motion (T2M) generation has broad applications in character animation, virtual avatars, and human-robot interaction. Existing methods typically generate pose trajectories or motion tokens directly from language, forcing a single model to handle semantic interpretation, long-horizon structure, and low-level physical realization. This coupling makes them costly and often unreliable for long, compositional, or semantically dense prompts. We propose Text2BFM, the first framework that aligns natural language with pretrained Behavioral Foundation Models (BFMs) for T2M generation without relying on heavy end-to-end motion generators. Text2BFM operates in the latent policy space of a frozen BFM, using it as an executable motion prior. A text-aligned variational behavioral bottleneck compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure. Generation is performed in this compact behavioral manifold with a lightweight conditional generator, and the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM. By decoupling semantic planning from motion execution, Text2BFM achieves efficient, robust T2M generation and strong performance on long, compositional textual descriptions.

2601.01901 2026-06-12 cs.LG 版本更新

FedBiCross: Personalized One-Shot Federated Learning on Medical Images

FedBiCross: 医学图像上的个性化一次性联邦学习

Yuexuan Xia, Yinghao Zhang, Yalin Liu, Hong-Ning Dai, Yong Xia

发表机构 * School of Computer Science and Engineering, Northwestern Polytechnical University, China(西北工业大学计算机科学与工程学院) School of Science and Technology, Hong Kong Metropolitan University, Hong Kong(香港 Metropolitan 大学科学与技术学院) Department of Computer Science, Hong Kong Baptist University, Hong Kong(香港 Baptist 大学计算机科学系)

AI总结 提出FedBiCross框架,通过聚类、双层跨簇优化和个性化蒸馏解决非独立同分布数据下一次性联邦学习中知识蒸馏效果差的问题,在四个医学图像数据集上优于现有方法。

Comments Accepted by BlockSys 2026. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

基于无数据知识蒸馏的一次性联邦学习(OSFL)在单轮通信中训练模型,无需共享原始数据,这使得OSFL对隐私敏感的医疗应用具有吸引力。然而,现有方法聚合所有客户端的预测以形成全局教师。在非独立同分布数据下,冲突的预测在平均过程中相互稀释,产生信息量较少的软标签,从而削弱蒸馏效果。我们提出FedBiCross,一个个性化OSFL框架,包含三个阶段:(1)根据模型输出相似性对客户端进行聚类,形成连贯的子集成;(2)双层跨簇优化,学习自适应权重以选择性利用有益的跨簇知识,同时抑制负迁移;(3)针对客户端特定适应的个性化蒸馏。在四个医学图像数据集上的实验表明,FedBiCross在不同非独立同分布程度下始终优于最先进的基线方法。

英文摘要

Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions dilute each other during averaging, yielding less informative soft labels that weaken distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

2605.25225 2026-06-12 cs.LG cs.AI 版本更新

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

用于Transformer修补和机制可解释性的连续深度场论

David N. Olivieri, Antonio F. Pérez Rodríguez

发表机构 * Universidade de Vigo(维戈大学) Independent Researcher(独立研究员)

AI总结 本文提出场论框架,将残差流视为深度-标记场,通过局部源插入、灵敏度场预测、经验格林函数响应和伴随变分问题来组织和预测Transformer激活修补干预,并在GPT-2风格自回归Transformer中验证了前向响应理论。

详情
AI中文摘要

机制可解释性通常使用激活修补、因果追踪、路径修补和引导方向来揭示Transformer激活空间中行为有意义的子空间。本文发展了一个场论框架来组织和预测此类干预。将残差流视为深度-标记场,我们将修补公式化为局部源插入,修补效应作为灵敏度场预测,下游传播作为经验格林函数响应,修补选择作为伴随变分问题。实验上,我们通过在GPT-2风格自回归Transformer中应用局部残差场干预并观察诱导的残差场差异和logit差异响应来测试前向响应理论。我们识别出有界的局部线性区域;从跨残差站点的一阶灵敏度预测修补效应;测量跨深度和标记位置的结构化各向异性传播;从高灵敏度站点和切片格林算子构建响应描述;并表明提示诱导的残差位移可以传递答案行为。这些结果将响应对象(即灵敏度、传播场和格林算子切片)确立为组织修补实验的实用语言,以及制定修补站点推断和跨尺度迁移的前向数学基础。

英文摘要

Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

2605.03460 2026-06-12 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对时间序列推理模型在金融领域的失效问题,提出基于2x2能力分类法的FinSTaR模型,通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026

详情
AI中文摘要

时间序列推理模型在通用领域表现出色,但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法,通过交叉1)单实体与多实体分析,以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务,并基于标普股票构建FinTSR-Bench基准。为此,我们提出FinSTaR(金融时间序列思考与推理),在FinTSR-Bench上训练,并针对每个类别采用不同的思维链策略。对于评估(确定性,即可从可观测数据计算得出),我们采用Compute-in-CoT,一种程序化思维链,使模型能够直接从原始价格推导答案。对于预测(本质上是随机的,即受不可观测因素影响),我们采用场景感知思维链,在做出判断前生成多种场景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们展示了四个能力类别通过联合训练具有互补性和相互增强性,并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开:https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

2605.24488 2026-06-12 cs.CV cs.GR 版本更新

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

基于SMPL骨架的拉班运动描述子的暗示性运动外观不变检测

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung

发表机构 * Sogang University(ソガン大学)

AI总结 提出一种仅基于SMPL骨架轨迹和拉班运动分析描述子的运动分类流程,用于检测暗示性和露骨动作,在四个层级上实现57.3%的四分类准确率。

Comments 5 pages, 2 figures, 3 tables. Extended version of a poster accepted to SIGGRAPH 2026

详情
AI中文摘要

在线多人3D虚拟环境中的内容审核最近已交由自动化、基于AI的流程处理。然而,该领域主要涉及图像、视频和音频中非法内容的检测,在暗示性运动的检测技术上存在盲点。我们提出一种仅基于运动的分类流程,使用拉班运动分析(LMA)描述子从SMPL骨架轨迹中检测暗示性和露骨动作。在涵盖四个有序层级(日常、艺术、暗示、露骨)的20,514个运动片段(17小时以上)上,基于110个LMA特征的逻辑回归实现了57.3%的四分类准确率(随机概率的2.3倍)、72.1%的三分类准确率和78.7%的二元SFW/NSFW准确率。混淆主要集中在相邻层级,证实分类错误集中在相邻层级而非非相邻层级。此外,不同运动质量在分类体系的每个层级占主导地位——没有单一特征驱动分类,表明四层级结构反映了真正不同的运动模式。

英文摘要

Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

2605.17770 2026-06-12 cs.AI cs.CL 版本更新

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵梯度反转:迈向大型推理模型的内部机制

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

发表机构 * National University of Singapore(新加坡国立大学) Renmin University of China(中国人民大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学)

AI总结 本文发现大型推理模型中令牌熵与logit梯度之间的稳健负相关(熵梯度反转),并提出相关性正则化组策略优化(CorR-PO)将其嵌入强化学习奖励正则化,从而提升推理性能。

Comments The authors are withdrawing this manuscript due to fundamental inaccuracies in the institutional affiliations and administrative attributions provided at the time of submission. As this version cannot be validated under the correct institutional framework, the authors request its formal withdrawal from the repository. No immediate replacement is intended

详情
AI中文摘要

大型推理模型(LRMs)的进步推动了从反应式“快思考”文本生成向系统性、逐步“慢思考”推理的范式转变,在复杂数学和逻辑任务中实现了最先进的性能。然而,该领域面临着 extit{令牌级行为分析与内部推理机制之间的根本差距,以及依赖昂贵外部验证器的推理优化强化学习(RL)的不稳定性}。我们识别并正式定义了 extbf{熵梯度反转},即令牌熵与logit梯度之间的稳健负相关,它作为LRM推理能力的明确几何指纹。在此基础上,我们提出 extbf{相关性正则化组策略优化(CorR-PO)},将这种反转特征嵌入RL奖励正则化。在多个模型规模的各种推理基准上的大量实验表明,CorR-PO始终优于最先进的基线,证实了更强的反转直接与更优的推理性能相关。

英文摘要

The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识?政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain(巴塞罗那理工大学研究中心,西班牙 Valencia理工大学) School of Science, Engineering and Design, Universidad Europea de Valencia, Spain(Valencia欧洲大学科学、工程与设计学院,西班牙) Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)(瓦伦西亚人工智能研究生学院与研究网络(ValgrAI))

AI总结 本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响,发现全文档上下文和检索知识对监督编码器有效,但对零样本大语言模型帮助有限,且模型扩展不保证性能提升。

Comments Code: https://github.com/VictorMYeste/human-value-detection-context-rag, best model: https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag, 18 pages, 3 figures

详情
AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性,因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式,我们比较了句子、窗口和全文档输入;无检索增强和基于检索增强的设置(使用精心策划的道德知识库);监督的DeBERTa-v3-base/large编码器;以及参数规模从12B到123B的零样本大语言模型。结果表明,更多上下文并非总是更好:全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点,但对零样本大语言模型没有一致帮助。在匹配比较中,检索到的道德知识更一致地有用,在早期融合下改善了每个测试的模型系列和上下文条件。然而,从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益,并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明,上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明,价值观敏感的NLP应联合评估上下文、知识和模型系列,而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS(中国科学院大学) CASIA(中国科学院自动化研究所) Tencent(腾讯) CMU(卡内基梅隆大学) WashU(华盛顿大学) SJTU(上海交通大学) XDU(北京理工大学)

AI总结 本文提出VDE Bench,一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准,通过高质量数据集和新的评估框架,系统量化了文本修改的准确性。

详情
AI中文摘要

近年来,图像编辑模型取得了显著进展,使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而,一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑,这涉及在图像中修改文本内容,同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上,因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距,我们提出了VDE Bench(视觉文档编辑基准),这是一个严格人工标注和评估的基准,专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集,其种子图像涵盖密集的中文和英文文本文档,包括学术论文、海报、演示文稿、考试材料和报纸。此外,我们引入了一个新的评估框架,系统地量化了在OCR解析层面的编辑性能,从而实现了对文本修改准确性的细粒度评估。基于此基准,我们对代表性图像编辑模型进行了全面评估。人类验证显示,人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

2605.20763 2026-06-12 cs.LG 版本更新

ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

ShapeBench: 一种可扩展的基准和诊断套件,用于气动形状优化的标准化评估

Shaghayegh Fazliani, Krissh Chawla, Jack Guo, Yiren Shen, Matthias Ihme, Madeleine Udell

发表机构 * Stanford University(斯坦福大学) Spinoza Labs(斯皮诺扎实验室)

AI总结 本文提出ShapeBench,一个开源的气动形状优化基准,提供统一的API,涵盖103个任务和八个形状类别,通过验证的代理模型和高保真CFD流程进行系统分析,展示了不同形状类别和问题形式中优化器排名的显著差异,强调了需要更通用方法的必要性。

详情
AI中文摘要

气动形状优化(ASO)的快速进展已超过了目前可用的标准化评估框架。公平比较需要一个覆盖多样形状类别、目标公式和匹配预算的统一基准。我们引入ShapeBench,一个开源的ASO基准,涵盖103个任务,跨越八个形状类别和多种优化模式。每个ShapeBench任务包括经过验证的代理模型以实现快速搜索;当可行时,提供高保真计算流体动力学(CFD)流程用于最终验证,从而实现系统化的保真度差距分析。ShapeBench提供可重复的协议和配置良好的基线,以使用一致的预算度量进行公平比较,允许在经典方法和LLM驱动方法之间进行比较,包括通用优化器和一个新的领域专用进化LLM基线,ShapeEvolve。在ShapeBench上的结果展示了不同形状类别和问题形式中优化器排名的显著差异,平均成对斯皮尔曼ρ=0.013,因此单任务结论无法可靠地推广到问题类别中。该基准还远未饱和;经典方法很少能适用于所有形状类别和任务,进一步强调了需要更通用方法的必要性。

英文摘要

Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $ρ= 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

2605.18817 2026-06-12 cs.LG 版本更新

Multi-Token Residual Prediction

多令牌残差预测

Yufeng Xu, Zishuo Bao, Qian Wang, Zeshen Zhang, Haoqi Zhang, Bowen Peng, Ang Li, Rahul Chalamala, Yucheng Lu

发表机构 * New York University(纽约大学) New York University Shanghai(纽约大学上海) Nos Research(Nos研究) Modal

AI总结 本文提出了一种轻量级模块Multi-token Residual Prediction,通过利用去噪过程中相邻步骤的logit分布相似性,在单次骨干网络前向传播中实现依赖感知的多令牌去噪,从而在成本较低的情况下提高去噪效率。

详情
AI中文摘要

扩散语言模型(DLMs)通过迭代去噪掩码令牌序列生成文本,相较于自回归模型在并行性和质量之间提供了一种权衡。在当前实践中,每步解码的令牌数量由置信度阈值控制,随着每步去噪的令牌数量增加,质量单调下降。我们引入了多令牌残差预测(MRP),这是一种轻量级模块,能够在单个骨干网络前向传播中实现依赖感知的多令牌去噪。MRP利用了去噪过程的一个关键性质:相邻去噪步骤的logit分布具有显著相似性。而不是再次运行骨干网络以获得下一步的logits,MRP通过骨干网络的隐藏状态预测步骤间的残差,从而在较低的成本下在单次骨干网络前向传播中去噪更多的令牌。我们部署了MRP在两种推理模式中:直接解码,它使用纠正的logits而不进行验证,以实现可调节的质量-速度权衡;以及推测解码,它通过骨干网络验证MRP的提案以实现无损加速。在SDAR模型上进行的实验表明,在推理和代码生成基准测试中,SDAR模型在1.7B、4B和8B规模上实现了高达1.42倍的SGLang无损加速。

英文摘要

Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We apply MRP across the two operating regimes of DLM decoding. In the high-quality-low-throughput static denoising regime, MRP serves as a drafter for speculative decoding: its proposals are verified against the backbone, yielding lossless acceleration of up to 1.4x in SGLang. In the low-quality-high-throughput dynamic denoising regime, MRP instead drives a remasking scheme that revokes over-eager reveals, recovering most of the accuracy lost to aggressive low-threshold decoding and improving accuracy by up to 22.6 points on code generation task HumanEval and 17.7 points on reasoning task GSM8K.