arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.18246 2026-06-17 cs.CL 新提交

Variable-Width Transformers

变宽Transformer

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

发表机构 * MIT（麻省理工学院）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结提出一种中间窄、两端宽的变宽Transformer架构，通过无参数残差缩放机制实现非均匀容量分配，在语言模型困惑度、FLOPs和KV缓存上优于均匀宽度基线。

详情

AI中文摘要

扩展模型规模，特别是深度和宽度，推动了基于Transformer的语言模型的显著进步。然而，大多数架构在所有层中保持恒定宽度，即使不同层可能扮演不同的计算角色，也均匀分配固定的参数和计算预算。在这项工作中，我们通过提出一个×形> <former架构，实证研究了跨网络深度的非均匀容量分配。该设计保持较宽的早期和晚期层，同时缩窄中间层，利用无参数残差缩放机制。在从200M到2B参数（密集）和3B参数（MoE）的仅解码器语言模型中，我们的> <former在语言建模损失上始终优于参数匹配的均匀基线。通过降低平均层宽度，该架构还减少了总体FLOPs（在拟合的损失匹配缩放曲线下减少22%）以及更小的KV缓存内存和I/O成本（减少15%）。在分析中，我们展示了这种瓶颈结构导致残差流中定性不同的表示。总体而言，我们的结果表明，非均匀宽度分配可以导致语言模型更资源最优的缩放。

英文摘要

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

URL PDF HTML ☆

赞 0 踩 0

2606.18242 2026-06-17 cs.CV 新提交

EventDrive: Event Cameras for Vision-Language Driving Intelligence

EventDrive: 用于视觉-语言驾驶智能的事件相机

Dongyue Lu, Rong Li, Ao Liang, Lingdong Kong, Wei Yin, Lai Xing Ng, Benoit R. Cottereau, Camille Simon Chane, Wei Tsang Ooi

发表机构 * NUS（新加坡国立大学）； HKUST(GZ)（香港科技大学（广州））； Horizon Robotics（地平线机器人）； A*STAR, I2R（新加坡科技研究局，资讯通信研究院）； IPAL, CNRS IRL 2955, Singapore（IPAL，法国国家科学研究中心国际联合实验室2955，新加坡）； University Toulouse, CNRS, CerCo, Toulouse, France（图卢兹大学，法国国家科学研究中心，CerCo，法国图卢兹）； ETIS UMR 8051, CY Cergy Paris University, ENSEA, CNRS, France（ETIS UMR 8051，CY塞尔吉-巴黎大学，ENSEA，法国国家科学研究中心，法国）

AI总结提出EventDrive基准和模型套件，通过多时域事件金字塔和时域混合专家模块融合事件流与RGB帧，在感知、理解、预测和规划四维度提升驾驶推理性能。

Comments CVPR2026, 34 pages, 15 figures, 15 tables, project page: this https URL (https://dylanorange.github.io/projects/eventdrive)

详情

AI中文摘要

事件相机通过异步亮度变化感知世界，具有微秒级延迟和高动态范围，其运动保真度远超基于帧的传感器，并能捕捉传统曝光常遗漏的时间结构。这些特性使事件成为自动驾驶中RGB的有力补充，尤其在帧感知可能不可靠的模糊、眩光和快速运动场景下。然而，现有的事件感知视觉-语言模型仍局限于通用感知，未能揭示事件传感如何促进整个驾驶循环中的推理和决策。我们提出EventDrive，一个大规模基准和模型套件，统一了事件流、RGB帧和语言监督，涵盖四个核心维度：感知、理解、预测和规划，包括描述、结构化问答、定位、运动状态识别、轨迹预测和规划任务。在此基础上，EventDrive-VLM引入了多时域事件金字塔和时域混合专家模块，自适应地编码和融合异步与基于帧的信息，用于下游推理。在多样化任务上的全面评估表明，事件流在时间精度、运动感知和鲁棒性方面提供了显著提升，将事件传感置于驾驶智能的核心。

英文摘要

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.18239 2026-06-17 cs.RO 新提交

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

EBench: 通用移动操作策略的要素诊断

Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Xi’an Jiaotong University（西安交通大学）； Institute for AI Industry Research (AIR), Tsinghua University（清华大学智能产业研究院）； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）

AI总结提出EBench基准，从5个能力和4个泛化维度诊断通用移动操作模型，揭示不同模型在成功率相近时能力差异显著。

详情

AI中文摘要

我们提出EBench，一个仿真基准，用于诊断通用移动操作策略，超越单一的成功率标量。EBench包含26个多样且具有挑战性的操作任务，沿5个能力维度和4个泛化维度进行标注。我们评估了最先进的通用操作模型，包括$\pi_0$、$\pi_{0.5}$、XVLA和InternVLA-A1，并揭示出成功率相近的模型展现出截然不同的能力轮廓：$\pi_{0.5}$实现了最高的测试成功率和最佳的训练-测试保持率，而InternVLA-A1在移动操作上占主导地位，但在灵巧任务上崩溃，XVLA与其他策略相比在一组不相交的原子技能上表现出优势。除了能力轮廓分析，EBench还从4个代表性角度分析了泛化能力，识别了不同分布偏移因素的影响。结果揭示了模型在总体得分背后的优势和弱点。我们希望这个基准能提供广泛的诊断信号，以指导通用操作模型的迭代。

英文摘要

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $\pi_0$, $\pi_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $\pi_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

URL PDF HTML ☆

赞 0 踩 0

2606.18235 2026-06-17 cs.AI 新提交

EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation

EvolveNav: 用于零样本目标导航的主动预反思与自进化记忆

Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma, Guosheng Lin, Hao Wang

发表机构 * HKUST(GZ)（香港科技大学（广州））； Nanyang Technological University（南洋理工大学）； Xi’an Jiaotong University（西安交通大学）； XGRIDS（深圳格物智联）

AI总结提出自进化零样本目标导航框架，通过从历史轨迹提取规则并基于置信上界检索，结合记忆引导预反思模块，减少无效探索，成功率提升10.1%。

详情

AI中文摘要

零样本目标导航（ZS-OGN）要求具身智能体在没有任何先验训练的情况下探索并定位目标物体。为此，近期方法利用基础模型，但它们通常依赖静态先验且缺乏适应性，导致重复错误和代价高昂的试错。本文提出一种自进化的ZS-OGN框架，实现连续的测试时改进。具体而言，我们通过从过去轨迹中提取可操作知识来构建智能体规则记忆。然后，我们提出一种基于置信上界的检索策略，通过平衡语义相关性和历史成功率来选择有效规则。此外，我们引入一个记忆引导的预反思模块，在行动前预测潜在结果，减少低效探索。大量实验表明，我们的方法优于现有的零样本基线，在减少不必要步骤的同时实现了10.1%的成功率提升。

英文摘要

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

URL PDF HTML ☆

赞 0 踩 0

2606.18231 2026-06-17 cs.CV cs.LG cs.RO 新提交

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

自适应体积力学属性场：分辨率无关

Rishit Dagli, Donglai Xiang, Vismay Modi, Xuning Yang, Gavriel State, David I.W. Levin, Maria Shugrina

发表机构 * NVIDIA（英伟达）

AI总结提出AdaVoMP方法，利用稀疏自适应体素结构和自回归Transformer编解码器，为3D物体预测高分辨率空间变化的杨氏模量、泊松比和密度，相比现有技术分辨率提升16^3倍且更准确。

Comments Project Page and hi-res paper: this https URL (https://research.nvidia.com/labs/sil/projects/adavomp/). ICML 2026

详情

AI中文摘要

精确的力学属性（或材料）杨氏模量（$E$）、泊松比（$\ u$）和密度（$\ ho$）对于数字世界的可靠物理模拟至关重要，但大多数3D资产缺乏这些信息。我们提出AdaVoMP，一种预测输入3D物体跨表示形式的精确密集空间变化（$E$，$\ u$，$\ ho$）的方法，在分辨率、准确性和内存效率上优于现有技术。我们技术的基础是一种稀疏自适应体素结构SAV，它能高效地表示输入3D形状和材料场输出。我们将最准确的先前方法VoMP的固定体素模型替换为一种新颖的稀疏Transformer编码器-解码器模型，该模型学习为每个输入形状自回归地生成唯一的SAV来表示其材料，实现比先前技术高$16^3$倍的分辨率。实验表明，即使测试时计算量少于所有先前技术，AdaVoMP也能估计出更准确的体积属性。这使得我们能够将高分辨率复杂3D物体转换为可模拟的资产，从而实现逼真的可变形模拟。

英文摘要

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($\nu$) and density ($\rho$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $\nu$, $\rho$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

URL PDF HTML ☆

赞 0 踩 0

2606.18216 2026-06-17 cs.CL 新提交

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

近端策略优化区域：教师存在于提示中，而非梯度中

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

发表机构 * NVIDIA（英伟达）

AI总结提出ZPPO方法，通过将教师知识注入提示而非策略梯度，解决小模型知识蒸馏中教师梯度主导和强化学习策略漂移问题，在多种规模模型上超越现有方法。

Comments Project page: this https URL (https://byungkwanlee.github.io/ZPPO-page/)

详情

AI中文摘要

知识蒸馏将教师的能力传递给小型学生模型，但在小模型场景下存在脆弱性：强制学生模仿更大教师的logits会使其集中于教师最尖锐的模式，从而损害训练语料库之外基准家族的泛化能力。强化学习通过基于学生自身rollout进行训练避免了logit模仿。然而，在每次rollout都失败（产生零优势并被静默丢弃）的问题上，将更强教师的响应注入策略梯度会破坏同策略假设并导致漂移。我们提出近端策略优化区域（ZPPO），受维果茨基最近发展区启发，将教师保留在提示中而非策略梯度中。在难题上，ZPPO构建两种重新表述的提示：二元候选包含问题（BCQ）将一个正确教师响应与一个错误学生响应配对作为匿名候选，学生必须区分；负候选包含问题（NCQ）将学生的错误rollout聚合到单个提示中，以揭示其共同的失败模式。提示重放缓冲区循环每个难题，直到其毕业（学生在该问题上的平均rollout准确率达到一半）或在有限容量下被FIFO逐出，从而在学生当前最近发展区内放大BCQ和NCQ。在Qwen3.5系列上，使用四个学生规模（0.8B-9B）和27B教师，后训练为视觉语言模型并在31个基准套件（16个VLM、10个LLM、5个视频）上评估，ZPPO优于离/同策略蒸馏和GRPO，且在最小规模上增益最大。

英文摘要

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

URL PDF HTML ☆

赞 0 踩 0

2606.18209 2026-06-17 cs.LG 新提交

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

重新思考用于分类的数据集蒸馏：蒸馏集是否优于核心集？

Trisha Mittal, Akshay Mehra, Joshua Kimball

发表机构 * Dolby Laboratory（杜比实验室）

AI总结本文通过大规模标准化实验评估七种最先进的数据集蒸馏方法，发现其在大型数据集上性能不如或仅相当于核心集，且构建成本更高，核心集在数据分布覆盖和计算效率上更具优势。

详情

AI中文摘要

数据集蒸馏（DD）已成为以数据为中心的机器学习中的一种重要方法，旨在通过将大型数据集中的信息压缩到少量合成样本中，合成紧凑的训练集以实现高效训练。然而，DD方法通常在不一致的评估协议下进行评估，从标准ERM到单/多教师监督，这使得难以从评估中分离出蒸馏数据的有效性。此外，许多先前方法声称DD优于数据剪枝方法（如核心集选择），其假设是将浓缩数据集限制为真实样本的子集会从根本上限制其表达能力。在这项工作中，我们通过使用标准化数据集和评估协议进行大规模实验，批判性地评估DD方法以评估其内在有效性。我们在ImageNet-1K、ImageNet100和ImageNette上对七种最先进的DD方法进行了基准测试，使用了三种广泛采用的训练协议，并与三种核心集策略进行比较。我们的结果表明，虽然一些DD方法甚至未能优于简单的随机子集，但最先进的DD方法在大型数据集上与核心集相当或更差，并且构建成本显著更高。除了准确性，我们还评估了浓缩集的代表性、多样性和质量，发现核心集始终能更好地覆盖原始数据分布。这些发现凸显了当前DD方法的实际优势有限，并表明核心集仍然具有竞争力，并且通常是以数据为中心的学习中计算效率更高的替代方案。

英文摘要

Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 新提交

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z.L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结提出循环世界模型（LoopWM），通过参数共享的Transformer块迭代细化潜在环境状态，实现高达100倍参数效率，并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

2606.18206 2026-06-17 cs.AI 新提交

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

不动点推理器：稳定且自适应的深度循环Transformer

Sajad Movahedi, Vera Milovanović, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center（ELLIS研究所蒂宾根，马克斯·普朗克智能系统研究所，蒂宾根人工智能中心）； ETH Zurich（苏黎世联邦理工学院）； Swiss Institute of Bioinformatics（瑞士生物信息学研究所）； Université Paris Cité（巴黎西岱大学）； Liquid AI

AI总结针对循环架构中深度导致的信号传播问题，提出基于预层归一化和残差缩放的FPRM模型，利用不动点收敛作为端到端停止机制，在Sudoku、Maze等推理基准上自适应计算并有效提升性能。

Comments Code available at this https URL (https://github.com/nilskiKonjIzDunava/fprm)

2606.18203 2026-06-17 cs.CL cs.AI 新提交

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research（谷歌研究院）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结提出RubricsTree框架，通过专家对齐的层次化分类法（含100多个原子布尔规则）和上下文自适应路由，实现可扩展、可审计且不断演进的开放式评估，在HealthBench上使模型性能提升高达约66%。

详情

AI中文摘要

基于LLM的个人健康代理利用用户健康（传感器）指标，为缓解全球医疗资源获取不均提供了有希望的途径。然而，大规模临床部署仍受限于开放式评估瓶颈：医生标注可靠但成本高且不可扩展，而LLM作为评判者的评估虽可扩展但主观、不一致，且有时临床对齐不佳。我们引入了RubricsTree，一个可扩展的评估框架，具有专家对齐的层次化分类法，包含超过100个原子级、临床可验证的布尔规则，这些规则通过迭代的人机协同策展协议（由经验丰富的医生领导的专家小组）从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集，提供可扩展评估所需的吞吐量，同时保持专家对齐的质量。通过系统的元评估，我们展示了RubricsTree：(i) 在具有挑战性的开放式查询上，专家对齐程度显著超过强大的大规模评估基线；(ii) 可靠地惩罚上下文退化的响应；(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时，在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此，RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

URL PDF HTML ☆

赞 0 踩 0

2606.18195 2026-06-17 cs.CL 新提交

Learning from the Self-future: On-policy Self-distillation for dLLMs

从自我未来学习：面向扩散LLM的在线自蒸馏

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

发表机构 * Tsinghua University（清华大学）； Technical University of Munich（慕尼黑工业大学）； Nanyang Technological University（南洋理工大学）； University of British Columbia（不列颠哥伦比亚大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； ELLIS Institute Tubingen（ELLIS蒂宾根研究所）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Tubingen AI Center（蒂宾根人工智能中心）

AI总结提出首个面向扩散大语言模型的在线自蒸馏框架d-OPSD，通过自我生成答案作为后缀条件实现从自我未来经验学习，并将监督从词元级转向步骤级，在推理基准上以约10%的优化步骤超越RLVR和SFT基线。

Comments Preprint

详情

AI中文摘要

在线自蒸馏（OPSD）已被证明对后训练大型语言模型（LLMs）有效，但其在扩散LLMs（dLLMs）上的应用仍未探索。现有的OPSD方法本质上是自回归中心的，它们通过从左到右的前缀条件化和词元级差异监督注入特权信息，这种设计与dLLMs的任意顺序生成根本冲突。我们提出了d-OPSD，这是首个为dLLMs量身定制的OPSD框架。我们的方法有两个核心贡献。首先，我们通过使用自我生成的答案作为后缀条件来重新构建自我教师，使学生模型能够从“自我未来经验”而非特权前缀中学习。其次，我们将监督从词元级转向步骤级，使训练与dLLMs的迭代去噪过程对齐。在四个推理基准上的实验表明，d-OPSD在样本效率上始终优于RLVR和SFT基线，仅需RLVR约10%的优化步骤，为dLLM后训练开辟了一条有前景的途径。代码可在该https URL获取。

英文摘要

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.18192 2026-06-17 cs.AI 新提交

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集：将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Nanjing University（南京大学）； Stanford University（斯坦福大学）

AI总结为解决长上下文文档稀缺问题，提出SEFD数据集，将SEC文件重建为布局忠实的MultiMarkdown格式，用于金融语言建模与评估，具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情

AI中文摘要

随着高质量公共网络语料库日益枯竭，干净的长上下文文档已成为大型语言模型（LLM）训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的，或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集（SEFD），这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集，用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据，并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型，并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1，一个152B令牌的初始公共快照，并提供了更大的1850万文件档案（估计为550B令牌）的语料库级分析。我们进一步引入了两个基于SEFD的基准：EDGAR-Forecast，用于评估模型知识截止后基于文件的数值预测；以及EDGAR-OCR，用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

URL PDF HTML ☆

赞 0 踩 0

2606.18189 2026-06-17 cs.RO 新提交

Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems

超越故障恢复：一种面向机器人系统的参与感知人在回路框架

Jiaying Fang, Joyce Yang, Zhanxin Wu, Bohan Yang, Tapomayukh Bhattacharjee

发表机构 * Cornell University（康奈尔大学）

AI总结提出一种参与感知模型预测控制（E-MPC）方法，通过规划交互频率和类型来维持用户参与度并控制工作负荷，在机器人辅助进食系统中验证了其提升用户体验且不降低任务成功率的效果。

Comments Project website at this https URL (https://emprise.cs.cornell.edu/empc)

详情

AI中文摘要

传统的人机协同方法通常仅在机器人遇到故障或不确定性时才让用户介入，将人类主要视为提升机器人性能的工具。然而，在许多以人为中心的机器人环境中，交互应通过让用户参与决策来支持参与度，而非将其限制于故障驱动的干预。这在物理护理场景中尤为突出，因为行动受限会降低用户实时干预或调节机器人行为的能力。因此，故障驱动的交互策略可能使用户在任务的大部分时间里沦为被动观察者。例如，行动受限的用户在持续被动接受机器人喂食时可能感到参与度不足。同时，过于频繁的交互可能令人疲惫并增加用户工作负荷。为解决这一权衡，我们提出了一种用户参与感知方法——参与感知模型预测控制（E-MPC），该方法规划交互以在维持参与度的同时满足工作负荷约束。E-MPC利用一个用户交互动力学模型，该模型捕捉用户参与度如何随交互频率和类型变化。机器人并非仅在任务执行出现困难时才请求输入，而是主动考虑用户在整个任务中偏好的参与水平，平衡自主性与交互，同时确保任务成功。我们通过多项消融实验和基线对比在仿真中评估了E-MPC。结果表明，该方法在多种用户画像下均有效。此外，我们在一个机器人辅助咬取系统中，与模拟行动受限的真实参与者进行了用户研究，显示E-MPC在维持任务成功的同时改善了用户体验。

英文摘要

Conventional human-in-the-loop approaches typically involve users only when a robot encounters failure or uncertainty, treating humans primarily as tools for improving robot performance. However, in many human-centered robotics settings, interaction should support engagement by keeping users involved in decision-making rather than limiting them to failure-driven interventions. This is particularly compelling in physical caregiving, where mobility limitations can reduce users' ability to intervene or modulate the robot's behavior in the moment. As a result, failure-driven interaction policies may relegate users to passive observers for long stretches of the task. For example, a user with mobility limitations may feel less engaged when being continuously and passively fed by a robot. At the same time, overly frequent interaction can be tiring and increase the user's workload. To address this trade-off, we propose Engagement-aware MPC (E-MPC), a user-engagement-aware method that plans interaction to maintain engagement while respecting a workload constraint. E-MPC leverages a user interaction dynamics model that captures how user engagement evolves as a function of both the frequency and type of interaction. Rather than requesting input only when difficulties arise during task execution, the robot proactively considers the user's preferred level of engagement throughout the task, balancing autonomy and interaction while ensuring task success. We evaluate E-MPC in simulation with several ablations and baseline comparisons. Results demonstrate the effectiveness of our approach across diverse user personas. In addition, we conduct a real-world user study with participants with emulated mobility limitations on a robot-assisted bite acquisition system, showing that E-MPC improves user experience while maintaining task success.

URL PDF HTML ☆

赞 0 踩 0

2606.18180 2026-06-17 cs.CV 新提交

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

EgoCS-400K：面向世界模型的自我中心游戏数据集

Rongjin Guo, Dong Liang, Yuhao Liu, Fang Liu, Tianyu Huang, Gerhard P. Hancke, Rynson W. H. Lau

发表机构 * City University of Hong Kong（香港城市大学）

AI总结为支持世界模型研究，构建大规模自我中心游戏数据集EgoCS-400K，包含40万第一人称视频和1万小时游戏轨迹，支持动作条件未来预测、状态事件场景展开等交互式视觉建模任务。

详情

AI中文摘要

从视频生成到交互式世界建模的转变对数据提出了新要求：除了带字幕的视频外，世界模型还需要基于驱动未来场景变化的动作、相机运动、状态和事件的时间对齐的视频-动作-语言轨迹。然而，大规模获取此类数据十分困难。网络视频数据集提供广泛的视觉覆盖，但缺乏可执行动作和可靠状态；机器人数据集提供动作和状态监督，但成本高昂且场景多样性有限；现有模拟器通常缺乏大规模人类驱动的交互轨迹。在本文中，我们介绍了EgoCS-400K，一个大规模基于回放的自我中心反恐精英世界模型数据集，该数据集基于公开的职业CS和CS2比赛演示构建，保留了人类游戏轨迹，并支持解析、回放、渲染和时间对齐。我们提取玩家状态、视角方向、移动、键盘/按钮输入、视角变化、武器使用、游戏事件和回合级上下文，并从相同轨迹渲染干净的第一人称视频。EgoCS-400K包含超过40万第一人称视频和1万小时游戏时间，来自1000多场比赛和4万回合，涵盖13张地图和每回合10个玩家视角。它支持一系列交互式视觉建模任务，包括动作条件未来预测、状态和事件感知场景展开、基于回放的描述以及智能体自我中心动作理解。通过大规模连接视觉观察与人类动作、相机运动、游戏状态和事件，EgoCS-400K在被动网络视频、可控游戏模拟和昂贵的真实世界具身数据之间架起了一座实用桥梁。

英文摘要

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.

URL PDF HTML ☆

赞 0 踩 0

2606.18156 2026-06-17 cs.CV cs.AI 新提交

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D：具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University（德克萨斯农工大学）； Netflix Eyeline Studios

AI总结提出ReAge3D框架，通过2D扩散模型DiffReaging和中心向外编辑传播策略，实现多视角一致的3D人脸回龄，保持身份和细节，优于现有方法。

详情

AI中文摘要

我们提出了一种新颖的框架，用于实现逼真且可控的3D人脸回龄，生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效，但不适合回龄，因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战，我们首先引入了一个基于2D扩散的回龄模型DiffReaging，该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略，利用该回龄模型重建多视图一致的回龄图像。具体来说，从回龄的正面枢轴视图开始，我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容，Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术，能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

URL PDF HTML ☆

赞 0 踩 0

2606.18154 2026-06-17 cs.AI 新提交

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

通过智能体发现混合结构学习心脏电生理数字孪生

Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

发表机构 * Rochester Institute of Technology（罗彻斯特理工学院）

AI总结提出LEADS框架，利用LLM智能体在结构化动作空间中迭代发现混合物理-神经模型，实现个性化心脏电生理数字孪生构建，优于人工设计和其他LLM方法。

Comments 10 pages, 4 figures

详情

AI中文摘要

构建个性化心脏电生理（EP）数字孪生需要为每个患者识别合适的模型结构，而不仅仅是拟合参数。传统方法依赖专家手动指定混合物理-神经架构，这需要深厚的领域专业知识，且无法跨患者迁移。最近的工作应用大型语言模型（LLM）来生成或充当混合模型。然而，尽管这些基于LLM的方法具有有希望的泛化能力，但它们缺乏稳定心脏模拟所需的结构先验。因此，我们提出LEADS，一个将心脏EP领域知识形式化为结构化动作空间，并利用LLM智能体发现混合模型的框架。该智能体遵循迭代推理-行动循环来选择、组合和优化混合模型，同时梯度下降处理参数拟合。所提出的LEADS设计每个候选模型都朝向物理基础、可解释和数值稳定，同时允许开放式的架构发现。我们在具有三个真实反应模型的合成数据和真实心脏EP数据上验证了LEADS，证明其优于人工设计的混合模型和其他基于LLM的混合建模方法。

英文摘要

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.18153 2026-06-17 cs.CV 新提交

Neural Tree Reconstruction for the Open Forest Observatory

开放森林观测站的神经树重建

Marissa Ramirez de Chanlatte, Arjun Rewari, Trevor Darrell, Derek J. N. Young

发表机构 * Berkeley AI Research, University of California, Berkeley（加州大学伯克利分校伯克利人工智能研究）； Department of Plant Sciences, University of California, Davis（加州大学戴维斯分校植物科学系）

AI总结针对开放森林观测站中经典运动恢复结构方法重建质量差的问题，提出引入神经辐射场提升3D树重建的细节与鲁棒性，并展望未来工作。

Comments Published as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2024

详情

AI中文摘要

开放森林观测站（OFO）是一项跨大学及其他合作伙伴的合作项目，旨在让生态学家、土地管理者和公众能够低成本地进行森林测绘。OFO正在构建一个地理空间森林数据库，以及通过无人机进行森林测绘的开源方法和工具。这些数据对多种气候应用非常有用，包括优先安排重新造林工作、减少野火风险以及监测碳封存。在OFO森林地图数据库的当前版本中，3D树图是使用经典的运动恢复结构技术创建的。这种方法容易出现伪影，缺乏细节，并且在森林地面（输入数据即俯拍图像的可视性有限）上尤其困难。这些重建错误可能会传播到下游的科学任务中（例如野火模拟）。3D重建的进展，包括神经辐射场（NeRF）等方法，产生了更高质量的结果，对稀疏视图更具鲁棒性，并支持数据驱动的先验。我们探索了将NeRF纳入OFO数据集的方法，概述了支持更先进的3D视觉模型的未来工作，并描述了高质量3D重建对林业应用的重要性。

英文摘要

The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO's forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.

URL PDF HTML ☆

赞 0 踩 0

2606.18147 2026-06-17 cs.AI 新提交

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

WEQA: 可穿戴健康问答中的查询自适应智能推理

Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

发表机构 * University of Cambridge（剑桥大学）； Tsinghua University（清华大学）； University College London（伦敦大学学院）； Dartmouth College（达特茅斯学院）； Google Research（谷歌研究院）

AI总结提出WEQA框架，通过LLM控制器动态组合传感器分析与预训练模型，实现可穿戴健康数据问答，在基准测试中准确率提升24%，专家评估显示实用性和临床合理性显著提高。

详情

AI中文摘要

语言模型在医学问答中表现出色，有时甚至超过普通医生的准确率。然而，关于可穿戴健康数据的问题回答仍然具有挑战性且研究不足，因为这些无处不在的传感器产生连续、高维和纵向的数据，难以与LLM预训练中的文本中心分布对齐。传感器模态和用户意图的多样性无法通过固定的推理工作流或单一的预训练基础模型有效处理。为了解决这些挑战，我们提出了WEQA，一个查询自适应智能体框架，将LLM推理与专门的可穿戴分析和建模工具统一起来。采用LLM控制器来合成执行计划，动态地将每个查询路由到适当的传感器分析和预训练模型组合，并利用外部知识进行基于证据的响应审计。我们还整理了一个基准测试，涵盖四个开放的可穿戴数据集，包括三个不同健康领域的分析和预测任务。实验表明，我们的框架比LLM和智能体基线准确率提高24%，一项由12名医学专家和8名用户进行的盲法研究显示，在实用性和临床合理性方面有显著提升。

英文摘要

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

URL PDF HTML ☆

赞 0 踩 0

2606.18144 2026-06-17 cs.AI cs.CY cs.LG cs.RO 新提交

Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

记忆作为消耗性资产：为具身智能体定价闪存耐久性及其局限性

Josef Liyanjun Chen

发表机构 * KAIKAKU

AI总结本文提出将机器人闪存耐久性视为折旧资本，通过单一影子价格η进行定价，实现成本最优的存储层级分配，并基于真实机器人日志测量价值-写入关联χ的符号，发现其取决于部署场景。

详情

AI中文摘要

机器人的闪存耐久性是一种不可再生资源：每次持久化写入都会消耗数千次编程/擦除周期中的一次，且无法补充，然而目前没有实际部署的机器人内存系统对哪些记忆值得消耗一次擦除周期进行定价。我们将具身记忆视为折旧资本，并用单一耐久性影子价格η对该资源定价，这使得在RAM/板载NVM/云层级中进行成本最小化的放置成为一个在磨损增强的每字节索引中的阈值。无论价值-写入关联χ的符号如何，该索引都是成本最优的；只有当χ>0时，最优解才变为非单调，将机器人最有价值的记忆从闪存中移出。因此，关键点是经验性的，我们在预定义的关口上测量真实机器人日志中的χ：其符号是部署场景的一个属性——在重复的长时域操作中为正（χ̂≈+1.0×10^{-3}，在全功率下可复现），在较短时域任务中为零，在非重复遥操作中为负。两个边界限制了该结果。在高端3,000 P/E TLC闪存按数据手册价格计算时，耐久性预算处于休眠状态；而在廉价边缘机器人使用的商用QLC/eMMC（约1,000 P/E）上则具有约束力。当约束生效时，学习到的磨损感知控制器仅在任务价值上与基于价格的路由持平，因为实现的价值在RAM、NVM和云层级之间是不变的：租金决定设备寿命和成本，而非任务性能。磨损感知放置是否能提高任务价值仍是一个开放问题——χ是针对价值代理测量的，而非单调最优解虽已被证明，但尚未在数据中观察到。

英文摘要

A robot's flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price $\eta$, which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association $\chi$; only when $\chi > 0$ does the optimum turn non-monotone, sending a robot's most valuable memories off its flash. The pivot is thus empirical, and we measure $\chi$ on real robot logs at a pre-specified gate: its sign is a property of the deployment regime -- positive on recurrent long-horizon manipulation ($\hat{\chi} \approx +1.0 \times 10^{-3}$, replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ($\sim$1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open -- $\chi$ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.

URL PDF HTML ☆

赞 0 踩 0

2606.18142 2026-06-17 cs.AI cs.CL cs.CY 新提交

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Oliver Tulio, Joel Christoph, Miles Tidmarsh, Carol Kline, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2606.18135 2026-06-17 cs.SD cs.AI 新提交

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

描述符：Certus 口径分类枪声数据集 (C3GD)

Sinclair Gurny, Ryan Quinn

发表机构 * Certus Innovations

AI总结介绍一个公开的枪声数据集 C3GD，包含超过8000个来自28种枪支、16种口径的实地采集数据点，用于口径分类、枪声检测等任务，提供丰富的元数据以支持泛化与学术分析。

详情

AI中文摘要

在这项工作中，我们介绍了 Certus 口径分类枪声数据集 (C3GD)，这是一个公开可访问的数据集，用于分析枪口爆炸声。该数据集旨在提供多种枪支、口径、弹药、麦克风和麦克风位置，其元数据详细程度超过当前已有的其他数据集。它包含来自28种枪支、16种口径的超过8000个实地采集数据点。由于实地数据采集成本高昂，现有研究多使用从互联网收集的枪声音频，这增加了低质量数据和标签噪声的风险。该数据集主要关注口径分类，但也可用于枪声检测、音频分离和音频信号处理，提供了多样化的真实世界参考。该数据集旨在提供足够的多样性，以便泛化到更多实际应用，同时提供足够的元数据以进行详细的学术分析。

英文摘要

In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.18124 2026-06-17 cs.CL 新提交

Unintended Effects of Geographic Conditioning in Large Language Models

大型语言模型中地理条件化的意外效应

Naz Col, David M. Chan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本研究评估了大型语言模型在接收地理中立提示时，因用户元数据中的位置信息导致的地理泄露现象，并发现位置占位符“Unknown”本身也会引发泄露，揭示了用户档案框架的生成条件化效应。

Comments To appear at the Second Workshop on Customizable NLP (CustomNLP4U) at ACL 2026

详情

AI中文摘要

现代对话式AI系统经常依赖用户元数据来本地化响应，但由这种隐藏上下文引入的意外区域偏见仍然知之甚少。在这项工作中，我们评估了位置泄露：即模型在接收地理中立用户提示时仍生成地理引用的现象。在创意写作和开放式问答提示中，即使是最先进的LLM，在暴露于位置元数据时也会系统性地偏向特定区域的输出，泄露比基线高出多达793倍（例如，Llama 3.1-8B从0.04%增加到31.7%，Qwen3-8B和Claude Sonnet 4.6分别为21.3%和8.8%）。我们的分析进一步揭示了一种新颖的结构性条件化效应：将注入的位置替换为占位符“Unknown”仍会使泄露比基线高出多达72倍，这表明用户档案框架本身，独立于任何地理内容，充当了生成条件化信号。

英文摘要

Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate location leakage: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended Q&A prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder "Unknown" still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.

URL PDF HTML ☆

赞 0 踩 0

2606.18123 2026-06-17 cs.CV 新提交

Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology

使用多模态混合专家病理基础模型预测免疫生物标志物，赋能精准肿瘤学

Tianyu Liu, Ziqing Wang, Zhaokang Liang, Tong Ding, Peter Humphrey, Lorraine Colón-Cartagena, Emily Ling-Lin Pai, Kenneth Tou En Chang, Mohamed Kahila, Jonathan Chong Kai Liew, Tinglin Huang, Rex Ying, Kaize Ding, Faisal Mahmood, Wengong Jin

发表机构 * Program of Computational Biology and Bioinforamtics, Yale University（耶鲁大学计算生物学与生物信息学项目）； Broad Institute of MIT and Harvard（麻省理工学院与哈佛大学博德研究所）； Department of Statistics and Data Science, Northwestern University（西北大学统计与数据科学系）； Department of Computer Science, Northeastern University（东北大学计算机科学系）； Department of Computer Science, Harvard University（哈佛大学计算机科学系）； Department of Pathology, Yale University（耶鲁大学病理学系）； Department of Anatomic Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania（宾夕法尼亚大学医院解剖病理学与检验医学系）； Department of Pathology and Laboratory Medicine, University of California, San Francisco（加州大学旧金山分校病理学与检验医学系）； Department of Pathology and Laboratory Medicine, KK Women’s and Children’s Hospital（竹脚妇幼医院病理学与检验医学系）； Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院生物统计学、流行病学与信息学系）

AI总结提出MixTIME多模态基础模型，采用混合专家架构整合不同模态的病理基础模型，从HE全切片图像预测多重免疫荧光蛋白表达，在17个蛋白标记物上达到最优性能，并增强空间域识别、生存预测等下游任务。

Comments 5 figures

详情

AI中文摘要

预测与肿瘤免疫微环境（TIME）相关的免疫生物标志物对于推进精准肿瘤学至关重要，但现有方法主要局限于单一图像模态，且存在分辨率不足以及未能充分利用互补的临床和生物学信息的问题。本文介绍MixTIME，一种多模态基础模型，利用混合专家（MoE）架构整合在不同模态上训练的病理基础模型：纯图像（UNIv2）、图像文本（CONCHv1.5）和图像转录组（STPath）表示，用于从苏木精-伊红（HE）全切片图像进行像素级和切片级的多重免疫荧光（mIF）蛋白表达预测。MixTIME采用可学习路由器动态加权专家贡献，并使用分布和趋势感知的损失函数进行训练。在两个不同规模的数据集上进行基准测试，MixTIME在17个蛋白标记物上通过相关性指标衡量达到了最先进的性能。预测的mIF图谱显著增强了下游任务，包括空间域识别、生存预测以及由全球多个机构的病理专家验证的AI辅助病理报告生成。此外，MixTIME能够跨临床时间点纵向追踪蛋白表达动态，并揭示与肿瘤微环境中耐药性和免疫抑制相关的蛋白-基因相互作用模式。总之，MixTIME为计算病理学中的多模态生物标志物发现和临床转化提供了一个可扩展的框架。

英文摘要

Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.

URL PDF HTML ☆

赞 0 踩 0

2606.18119 2026-06-17 cs.AI 新提交

First Proof Second Batch

首次证明第二批

Mohammed Abouzaid, Nikhil Srivastava, Rachel Ward, Lauren Williams

发表机构 * Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Harvard University（哈佛大学）； Polish Academy of Sciences（波兰科学院）； UC Berkeley（加州大学伯克利分校）； Brown University（布朗大学）； ETH Zürich（苏黎世联邦理工学院）； MIT（麻省理工学院）； Weierstrass Institute（魏尔斯特拉斯研究所）； Duke University（杜克大学）； Sorbonne Université（索邦大学）； Boston College（波士顿学院）； Université du Québec à Montréal（魁北克大学蒙特利尔分校）； UCLA（加州大学洛杉矶分校）； University of Michigan（密歇根大学）； University of Maryland（马里兰大学）

AI总结测试多个AI系统在十个数学研究问题上的解题能力，评估当前AI解决研究级数学问题的水平。

详情

AI中文摘要

为了评估当前AI系统正确解决研究级数学问题的能力，我们在十个涵盖广泛数学领域的问题上测试了多个AI系统；这些问题自然产生于贡献者的研究过程中。本文档包括问题、我们的方法论以及测试结果。我们提供了补充文档的链接，包括人类解法、AI生成的解法，以及AI生成解法的评审报告和日志。这十个问题由以下数学家贡献：(1) Dariusz Kalociński 和 Theodore A. Slaman，(2) Richard Schwartz，(3) Aleksa Milojevic 和 Benny Sudakov，(4) Larry Guth，(5) Oleg Butkovsky、Jonathan Mattingly 和 Lorenzo Zambotti，(6) Joshua Evan Greene 和 Duncan McCoy，(7) Sucharit Sarkar，(8) Sam Payne 和 Jidong (Jayden) Wang，(9) Sylvie Corteel 和 John Lentfer，(10) Srivatsav Kunnawalkam Elayavalli。

英文摘要

To assess the ability of current AI systems to correctly solve research-level mathematics problems, we tested several AI systems on a set of ten problems in a broad range of mathematical fields; these problems arose naturally in the research process of the contributors. This document includes the problems, our methodology, and the results of our testing. We provide links to supplementary documents including the human solutions, the AI-generated solutions, and the referee reports and logs for the AI-generated solutions. The ten problems were contributed by the following mathematicians: (1) Dariusz Kalociński and Theodore A. Slaman, (2) Richard Schwartz, (3) Aleksa Milojevic and Benny Sudakov, (4) Larry Guth, (5) Oleg Butkovsky, Jonathan Mattingly, and Lorenzo Zambotti, (6) Joshua Evan Greene and Duncan McCoy, (7) Sucharit Sarkar, (8) Sam Payne and Jidong (Jayden) Wang, (9) Sylvie Corteel and John Lentfer, (10) Srivatsav Kunnawalkam Elayavalli.

URL PDF HTML ☆

赞 0 踩 0

2606.18114 2026-06-17 cs.LG cs.AI 新提交

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

Ternary Mamba: 分组量化感知训练的 W1.58A16 状态空间模型

Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N

发表机构 * EdgeVerve Systems Limited（EdgeVerve系统有限公司）

AI总结提出从预训练检查点进行分组量化感知训练（QAT）结合知识蒸馏，以极低数据量（1亿token）将Mamba-2 1.3B压缩至3.61倍，零样本准确率接近Bi-Mamba，并发现预训练QAT特有的零比率坍塌问题。

详情

AI中文摘要

状态空间模型（SSM）如Mamba-2提供线性时间推理，但其内存占用限制了边缘部署。先前的三元SSM工作（Slender-Mamba）在150B token上从头训练；我们证明预训练检查点足以胜任，将边际token预算减少1000倍。使用分组量化感知训练（QAT）结合冻结FP16教师的知识蒸馏，我们将Mamba-2 1.3B压缩3.61倍（从2687 MB到744 MB），并在仅102M token（4 GPU小时，单H100）下达到48.1%的零样本准确率（7任务平均）——接近Bi-Mamba的48.4%（在+/-0.9pp置信区间内）。这种从预训练开始的QAT设置揭示了零比率坍塌，一种由可学习量化尺度引起的新不稳定性，在从头训练中不会出现。我们进一步证明，由于通过循环的误差累积，对Transformer有效的后处理校正策略对SSM失效。这些结果表明三元SSM不需要昂贵的从头训练：从预训练检查点进行QAT结合KD是一种数据高效的替代方案。

英文摘要

State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

URL PDF HTML ☆

赞 0 踩 0

2606.18106 2026-06-17 cs.LG 新提交

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

深度强化学习用于最小零强制集

Steve Halley, Maurício Gruppi

发表机构 * Department of Computing Sciences, Villanova University（维拉诺瓦大学计算科学系）

AI总结提出一种基于强化学习的框架SD-ZFS，通过改进S2V-DQN架构求解最小零强制集问题，在多种图结构上验证了其优于贪心启发式算法。

详情

AI中文摘要

本文探讨了在无向图中寻找最小零强制集的问题，并提出了一种自适应的机器学习框架来解决该问题。最小零强制集问题是一种图着色问题，其中初始节点集的颜色在整个网络中传播。如果节点集在颜色变化规则的约束下迫使所有未着色节点改变颜色，则该节点集是零强制集。该问题在网络科学、网络控制和逻辑电路设计等不同领域有多种应用。寻找最小零强制集已被证明是NP难的。我们提出了一种强化学习框架SD-ZFS，该框架将S2V-DQN架构适配到ZFS问题。我们在该适配框架上训练了多个模型，并分析了在不同结构图数据集上的性能。我们评估了在该框架上训练的模型在不同网络类型上的泛化、扩展和迁移能力。结果表明，与最优解和贪心启发式算法相比，该框架是有效的。我们进一步深入了解了如何通过机器学习解决ZFS问题以及网络结构对该问题的影响。

英文摘要

This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propagates throughout a network. The set of nodes is zero-forcing if it forces all uncolored nodes to change color under the constraint of the color-change rule. There are several applications to this problem across different domains such as network science, network control, and designing logical circuits. Finding the minimum zero-forcing set is shown to be NP-hard. We propose a reinforcement learning framework, SD-ZFS, that adapts the S2V-DQN architecture to the ZFS problem. We train several models on this adapted framework and analyze the performance across graph datasets that have varying structures. We evaluate how the models trained on the framework generalize, scale, and transfer to different network types. The results demonstrate the effectiveness of the framework when compared against the optimal solution and greedy heuristic. We provide further insight into how the ZFS problem can be solved through machine-learning and the influence of network structure on the problem.

URL PDF HTML ☆

赞 0 踩 0

2606.18103 2026-06-17 cs.CL cs.IR 新提交

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

HistoRAG：通过批判性技术实践将历史方法论嵌入检索增强生成

Noah J. Kim-Baumann, Torsten Hiltmann

发表机构 * Humboldt-Universität zu Berlin（柏林洪堡大学）

AI总结针对历史学等解释性学科，提出HistoRAG框架，通过分离检索与生成、时间窗口化、LLM作为评判者等架构干预，将历史编纂原则转化为RAG设计，解决标准RAG中的时间偏差、相关性评估等问题。

Comments 25 pages, 6 figures. Companion preprint to a Journal of Digital History notebook article (under review)

详情

AI中文摘要

检索增强生成（RAG）是将语言模型输出基于外部证据的主流架构，但其主导评估范式和默认配置仍面向事实性问答。对于历史研究等解释性学科，RAG嵌入了与学术实践相冲突的假设。我们提出HistoRAG，一个将历史编纂原则转化为具体架构干预的框架。分离的检索与生成将来源发现与解释解耦，时间窗口化强制在研究期间内平衡来源表示（作为历史探究的方法论要求），LLM作为评判者的评估使相关性判断透明且可争议。我们使用SPIEGELragged（应用于《明镜》周刊1950-1979年的102,189篇文章）评估这些干预。每项干预都解决了标准RAG中可测量的缺陷：使用1970年代术语时，特定时代词汇从1950年代检索到零个块（这证明了促使窗口化的时间偏差）；向量相似性与LLM评估的相关性仅弱相关（Spearman rho = 0.275），促使后检索评估；基于关键词和语义的检索主要发现不相交的来源池，促使一种架构，其中两者在共享的LLM评估过滤器下作为互补检索层运行。我们还引入了Zwischentexte（作为解释性提议而非发现的中介文本）的概念，作为将LLM生成文本负责任地整合到学术实践中的框架。该架构为如何将特定领域的认识论承诺转化为RAG设计决策提供了模型，并可能迁移到其他处理大型语料库的解释性学科。

英文摘要

Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual question-answering. For interpretive disciplines such as historical studies, RAG embeds assumptions that conflict with scholarly practice. We introduce HistoRAG, a framework that translates historiographical principles into concrete architectural interventions. Separated retrieval and generation decouples source discovery from interpretation, temporal windowing enforces balanced source representation across the research period as a methodological requirement of historical inquiry, and LLM-as-judge evaluation makes relevance judgments transparent and contestable. We evaluate these interventions using SPIEGELragged, applied to 102,189 articles from Der Spiegel (1950-1979). Each intervention addresses a measurable deficiency in standard RAG: era-specific vocabulary retrieves zero chunks from the 1950s when using 1970s terminology, evidence of the temporal skew that motivates windowing; vector similarity and LLM-assessed relevance correlate only weakly (Spearman rho = 0.275), motivating post-retrieval evaluation; and keyword-based and semantic retrieval surface largely disjoint source pools, motivating an architecture in which both operate as complementary retrieval layers under a shared LLM evaluation filter. We also introduce the concept of Zwischentexte (intermediate texts that function as interpretive proposals rather than findings) as a framework for responsible integration of LLM-generated text into scholarly practice. The architecture offers a model for how domain-specific epistemological commitments can be translated into RAG design decisions, and may transfer to other interpretive disciplines working with large corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.18101 2026-06-17 cs.AI 新提交

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

信任正确的教师：面向GUI定位的质量感知自蒸馏

Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu

发表机构 * University of Georgia（佐治亚大学）； INFLY Tech ； Tencent AI Lab（腾讯AI实验室）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出质量感知自蒸馏方法，通过软正确性感知门控和教师概率缩放改善坐标令牌教师信号质量，提升VLM在GUI定位任务中的性能。

详情

AI中文摘要

图形用户界面（GUI）定位要求视觉语言模型（VLM）在高分辨率截图中识别小的目标元素并预测精确的屏幕坐标。同策略自蒸馏（OPSD）是一种有前景的后训练方法，因为它提供密集的令牌级教师信号，超越了硬坐标标签。然而，朴素OPSD并不适合GUI定位：OPSD在由学生生成的前缀上评估教师，当前缀已经偏离目标坐标时，坐标令牌教师信号的质量会下降，导致不可靠的教师信号。为缓解这一问题，我们提出了面向基于VLM的GUI定位的质量感知自蒸馏，通过软正确性感知门控和教师概率缩放来改善坐标令牌教师信号质量。软正确性感知门控检查在当前学生生成的前缀下，教师的坐标令牌预测是否仍能完成到真实框。如果不能，则相应教师信号被降低权重。教师概率缩放则利用教师置信度作为轻量级因子，进一步校准门控监督的强度。一个关键的实验发现是，单独使用任一组件都不能提升整体性能，而组合使用则能持续提升性能。这表明两种机制发挥互补作用：正确性感知门控抑制不可靠的坐标令牌监督，而教师概率缩放校准剩余信号的强度。在六个GUI定位基准上的实验表明，我们的方法持续提升基础模型性能，并优于强基线。

英文摘要

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18098 2026-06-17 cs.AI 新提交

IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus

IsabeLLM: 自动化定理证明应用于共识的形式化验证

Elliot Jones, William Knottenbelt

发表机构 * Imperial College London（伦敦帝国学院）

AI总结本文改进IsabeLLM自动化定理证明工具，通过检索增强生成、错误追踪和反例生成提升大语言模型上下文，并兼容最新Isabelle和Sledgehammer，用于验证比特币工作量证明共识。

详情

AI中文摘要

人工智能（AI）的进步使得AI用于定理证明成为形式化验证计算机系统的一种有前景的方法。尽管由于所需专业知识和努力，形式化验证传统上仅限于安全关键系统，但AI可以帮助自动化大量工作负载，使其更易访问。基于区块链的系统越来越受欢迎，并经常成为恶意行为者的目标，常常导致巨大的财务损失，这凸显了更好地验证这些系统和缓解漏洞的必要性。可以说，这些系统中最重要的组件是共识协议，它允许节点在潜在对抗环境中达成决策。在本文中，我们改进了IsabeLLM，即Isabelle中的自动化定理证明工具。具体而言，我们实现了检索增强生成框架、错误追踪和反例生成，以改善提供给大语言模型的上下文。还实现了与最新版本Isabelle和Sledgehammer的兼容性，以提高效率。我们比较了两个版本IsabeLLM在完成比特币工作量证明共识验证方面的性能。

英文摘要

Advances in Artificial Intelligence (AI) have led AI for Theorem Proving to become a promising means of formally verifying computer systems. Whilst formal verification is traditionally reserved for safety-critical systems due to the required amount of expertise and effort, AI can help to automate a large amount of this workload and make it far more accessible. Blockchain-based systems are becoming increasingly popular and are frequently targeted by malicious actors, often resulting in huge financial losses, highlighting the need to better verify these systems and mitigate vulnerabilities. Arguably the most important component of these systems is the consensus protocol, which allows nodes to agree on decisions in a potentially adversarial environment. In this paper, we improve upon IsabeLLM, the automated theorem proving tool in Isabelle. Namely, we implement a Retrieval-Augmented Generation framework, Error tracing and counterexample generation for improved context supplied to the Large Language Model. Compatibility with the latest version of Isabelle and Sledgehammer is also implemented for improved efficiency. We compare the performance of the two versions of IsabeLLM in their ability to complete the verification of Bitcoin's Proof of Work consensus.

URL PDF HTML ☆

赞 0 踩 0

2606.18097 2026-06-17 cs.RO 新提交

WireCraft: A Simulation Benchmark for Industrial DLO Manipulation

WireCraft：工业DLO操作仿真基准

Chongyu Zhu, Ramy ElMallah, Hyegang Kim, Zachary Tang, Jiachen Rao, Artem Arutyunov, Seungyeon Ha, Chi-Guhn Lee

发表机构 * Department of Mechanical and Industrial Engineering, University of Toronto（多伦多大学机械与工业工程系）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）； CREFLE Inc.（CREFLE公司）

AI总结针对工业中可变形线性物体（DLO）操作缺乏统一基准的问题，提出WireCraft仿真基准，支持可配置难度和资产，涵盖三种任务族，并评估强化学习、模仿学习和视觉-语言-动作策略。

详情

AI中文摘要

可变形线性物体（DLO），如电线和电缆，是工业装配的核心。与刚体不同，刚体的状态由6自由度位姿捕获，而DLO具有无限维配置空间，并在与夹爪、夹具和工作空间的接触下连续变形，使其成为通用灵巧操作的一个高要求基准。尽管其重要性，策略开发和比较仍然困难：现有基准通常绑定到特定硬件设置，缺乏模块化和可定制的任务资产，或者研究没有真实世界工业线缆操作相关夹具的通用可变形物体任务。很少有基准将仿真、真实世界数据和共享评估协议对齐。为弥合这一差距，我们引入了WireCraft，一个用于工业DLO操作的仿真基准，具有可配置的难度和资产，涵盖三个任务族：连接器插入、夹子布线和通道就位。它支持两种互补的DLO物理模型——铰接式和可变形式，轨迹来自仿真和物理UR5。我们在共享指标下对强化学习（RL）、模仿学习（IL）和视觉-语言-动作（VLA）策略进行基准测试。基于特权状态的RL在每个任务族的一个代表性设置中实现了超过82%的成功率，确认了任务的良好定义。然而，对于连接器插入，从到达插座到接触丰富的对齐的过渡仍然是视觉RL、IL和VLA策略的关键瓶颈。这些结果表明，工业DLO操作虽然在特权状态下可处理，但对于当前基于视觉的学习仍然是一个开放的挑战。基准、数据和工具将在接收后开源。

英文摘要

Deformable Linear Objects (DLOs), such as wires and cables, are central to industrial assembly. Unlike rigid objects, whose state is captured by a 6-DoF pose, DLOs have an infinite-dimensional configuration space and deform continuously under contact with grippers, fixtures, and the workspace, making them a demanding benchmark for general dexterous manipulation. Despite their importance, policy development and comparison remain difficult: existing benchmarks are often tied to specific hardware setups, lack modular and customizable task assets, or study generic deformable-object tasks without the fixtures relevant to real-world industrial wire manipulation. Few benchmarks align simulation, real-world data, and shared evaluation protocols. To bridge this gap, we introduce WireCraft, a simulation benchmark for industrial DLO manipulation with configurable difficulty and assets, spanning three task families: connector insertion, clip routing, and channel seating. It supports two complementary DLO physics models, articulated and deformable, and the trajectories come from both simulation and a physical UR5. We benchmark reinforcement learning (RL), imitation learning (IL), and vision-language-action (VLA) policies under shared metrics. Privileged state-based RL solves a representative setting in each task family with over 82\% success, confirming the tasks are well-posed. For connector insertion, however, the transition from reaching the socket to contact-rich alignment remains a key bottleneck for vision RL, IL, and VLA policies. These results indicate that industrial DLO manipulation, though tractable under privileged state, remains an open challenge for current vision-based learning. The benchmark, data, and tools will be open-sourced upon acceptance.

URL PDF HTML ☆

赞 0 踩 0