arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.10971 2026-06-10 cs.RO cs.SY eess.SY 新提交

Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection

利用基于加加速度增强模型与仅IMU干扰抑制的自主农业机器人弹性导航

Batu Candan, Mohammed Atallah, Simone Servadio, Saeed Arabi

发表机构 * Iowa State University(爱荷华州立大学) Salin247

AI总结 针对农业机器人传感器中断和振动问题,提出加加速度增强EKF与多调谐因子自适应方法,动态调整测量协方差,显著降低3D位置RMSE。

详情
AI中文摘要

自主农业机器人导航的精确状态估计常因传感器中断(GNSS/激光雷达/视觉)和越野环境固有的高频振动而受损。本文提出一种基于加加速度增强扩展卡尔曼滤波器(EKF)与多调谐因子(MTF)自适应方法集成的鲁棒导航算法。与假设恒定测量噪声的标准EKF方法不同,我们的方法实时动态调整测量协方差矩阵,使系统能够应对突然干扰和传感器异常。我们使用Salin247自主机器人的真实数据评估该算法。结果表明,与基线EKF模型相比,加加速度增强结合MTF自适应显著降低了3D位置均方根误差(RMSE),提供了卓越的航位推算能力。

英文摘要

Precise state estimation for navigation of autonomous agricultural robots is often compromised by sensor outages (GNSS/LiDAR/Visual) and high-frequency vibrations inherent in off-road environments. This paper proposes a robust navigation algorithm based on a jerk-augmented Extended Kalman Filter (EKF) integrated with a Multiple Tuning Factor (MTF) adaptation method. Unlike standard EKF approaches that assume constant measurement noise, our method dynamically adjusts the measurement covariance matrix in real-time, allowing the system to cope with sudden disturbances and sensor outliers. We evaluate the algorithm using real-world data from a Salin247 autonomous robot. Results demonstrate that jerk-augmentation combined with MTF adaptation significantly reduces 3D position Root Mean Square Error (RMSE) compared to baseline EKF models, providing superior dead-reckoning capabilities.

2606.10967 2026-06-10 cs.CV 新提交

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

视觉上下文学习何去何从?跨领域与任务的统一基准

Pradnya Halady, Jiale Wei, Zdravko Marinov, Alexander Jaus, Simon Reiß

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 针对视觉上下文学习评估局限于预训练镜像任务的问题,构建跨领域和任务的统一基准VIBE,在14个数据集和12个任务上测试6个模型,揭示其适应能力、局限性及失败模式。

详情
AI中文摘要

视觉上下文学习被提出作为动态模型的一种途径,这些模型可以根据提供的上下文生成预测,从而在测试时适应新的视觉任务。然而,对这些模型适应能力的评估一直局限于狭窄的设置,主要反映预训练中的任务或图像领域,而实际适应并不需要。我们通过构建一个广泛的视觉上下文基准(VIBE),重点关注多样化的成像领域和广泛的任务,来解决这一差距。借此,我们能够更清晰地了解视觉上下文模型在面对新的图像和任务分布时的适应能力。我们在14个数据集和12个任务上对六个模型进行了压力测试(总共探索了106个数据集-任务组合),并在统一的、可重复的评估协议下,以一次学习设置进行比较。我们的评估揭示了视觉上下文学习现状的关键见解,包括局限性、系统性失败模式和有前景的方向。为了促进更广泛的评估,我们将公开发布我们的VIBE工具包。

英文摘要

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

2606.10959 2026-06-10 cs.LG 新提交

Population-Aware Physics-Informed Neural Particle Flow for Bayesian Update

群体感知的物理信息神经粒子流用于贝叶斯更新

Batu Candan, Simone Servadio

发表机构 * Iowa State University(爱荷华州立大学)

AI总结 提出群体感知的物理信息神经粒子流(PA-PINPF),通过群体编码器增强粒子更新,在贝叶斯后验传输中优于标准方法。

详情
AI中文摘要

物理信息神经粒子流(PINPF)学习一个确定性传输场,该场将粒子从先验分布移动到贝叶斯后验分布,同时强制执行控制概率演化的方程。然而,标准PINPF速度模型独立处理每个粒子,因此不显式地将其传输决策基于经验粒子群体。本文引入了群体感知的PINPF(PA-PINPF),它通过整个粒子集的置换不变深度集表示来增强每个粒子的更新。我们研究了两种群体编码器。PA-PINPF-State总结粒子状态,而PA-PINPF-Feature总结完整的局部物理信息特征向量,包括粒子位置、伪时间、测量信息、似然值和得分信息。后者使得群体上下文不仅能表示粒子云几何,还能表示群体级别的贝叶斯传输几何。这些方法保留了原始无监督的物理信息残差目标,并且在训练过程中不需要真实后验样本。在距离测量任务和非线性到达时间差后验传输上的实验表明,两种群体感知变体均优于逐粒子PINPF,而特征群体编码提供了最强的性能。这些结果表明,群体级别的物理特征为学习贝叶斯粒子传输提供了有用的全局信息。

英文摘要

Physics-informed neural particle flow (PINPF) learns a deterministic transport field that moves particles from a prior distribution toward a Bayesian posterior while enforcing the governing probability-evolution equation. However, the standard PINPF velocity model processes particles independently and therefore does not explicitly condition its transport decisions on the empirical particle population. This paper introduces population-aware PINPF (PA-PINPF), which augments each particle update with a permutation-invariant Deep Sets representation of the full particle set. We investigate two population encoders. PA-PINPF-State summarizes the particle states, whereas PA-PINPF-Feature summarizes the complete local physics-informed feature vectors, including particle position, pseudo-time, measurement information, likelihood values, and score information. The latter allows the population context to represent not only particle-cloud geometry, but also the population-level Bayesian transport geometry. The methods retain the original unsupervised physics-informed residual objective and require no ground-truth posterior samples during training. Experiments on range-measurement tasks and nonlinear time-difference-of-arrival posterior transport demonstrate that both population-aware variants improve over particle-wise PINPF, while feature-population encoding provides the strongest performance. These results show that population-level physics features provide useful global information for learned Bayesian particle transport.

2606.10956 2026-06-10 cs.AI cs.CL 新提交

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

注意差距:前沿大语言模型能否通过标准化办公能力考试?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

发表机构 * Microsoft Research(微软研究院)

AI总结 基于中国计算机等级考试(NCRE)的200个综合操作任务,评估7个前沿LLM在Word、Excel和PowerPoint自动化中的表现,发现单轮模型最高得分率36.6%,带执行反馈的智能体系统达68.8%,仍低于95.5%的社区参考分,表明可靠细粒度办公自动化仍是重大挑战。

Comments 21 pages, 5 figures

详情
AI中文摘要

大语言模型(LLM)代理在计算机自动化领域的部署正在加速,但其在复杂、专业级生产力软件中的导航能力在很大程度上尚未得到测试。我们认为办公自动化是基准测试文档自动化能力的理想环境,因为它需要长期规划和推理、精确的参数配置以及多应用集成。为了量化这种能力,我们引入了一项基于中国国家计算机等级考试(NCRE)的评估,包含200个涵盖Word、Excel和PowerPoint的综合实践操作任务。每个任务根据7118个机器可评分标准按100分制评分,得分率(SR)表示这些任务中获得的平均评分百分比。我们对7个前沿LLM进行了基准测试,并观察到明显的局限性:单轮模型最高得分为36.6%。一个具有执行反馈、迭代修复和更广泛办公自动化访问权限的更强智能体系统达到了68.8%,但仍低于用作评分合理性检查的95.5%社区参考分。最终,我们的实验表明,尽管代码生成最近取得了进展,但对于当前的代码生成LLM和智能体系统来说,实现可靠的细粒度办公文档自动化仍然是一个重大挑战。

英文摘要

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

2606.10953 2026-06-10 cs.AI cs.CV 新提交

Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

Architect-Ant: 可编辑的建筑平面图自动家具布置

Fedor Rodionov, Aleksandar Cvejic, Michael Birsak, John Femiani, Peter Wonka

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Miami University(迈阿密大学)

AI总结 提出基于微调视觉语言模型的可编辑自动家具布置框架Architect-Ant,通过领域特定语言编码布局并优化,生成符合建筑约束的合理布局。

Comments 17 pages, 10 figures

详情
AI中文摘要

带家具的平面图是房地产可视化、室内设计和建筑工作流程的基础。然而,由于缺乏带有对象级家具标注的真实专业设计平面图数据集,自动家具布置的进展受到限制。为解决这一差距,我们引入了AntPlan-270,这是一个包含270个建筑平面图的精选数据集,每个房间都有家具边界框标注,涵盖十个住宅房间类别。基于该数据集,我们提出了Architect-Ant,一个由微调视觉语言模型驱动的可编辑自动家具布置框架。家具布局使用一种紧凑的、基于坐标的领域特定语言(DSL)表示,该语言编码对象类别和相对于房间几何形状的位置。为了提高空间推理能力,我们生成了程序化推理轨迹,捕捉建筑约束,如墙壁对齐、门窗间隙、流通、固定装置兼容性和房间特定家具清单,并使用它们来监督模型的微调。然后,我们对候选对象位置应用偏好优化,以进一步提高布局质量。生成的DSL可以栅格化为语义掩码,并用于条件化基于Flux的LoRA渲染器,生成逼真的蓝图风格带家具平面图图像,同时保留可编辑的符号布局。布局布置实验表明,Architect-Ant能生成几何上有效且功能上合理的布局,并为更大的仅结构平面图数据集的家具布置提供了一条可扩展的路径。

英文摘要

Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

2606.10949 2026-06-10 cs.AI 新提交

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

回忆过好:记忆增强模型中的谄媚评估与缓解

Shelly Bensal, Axel Magnuson, Aparna Balagopalan, Daniel M. Bikel

发表机构 * Writer, Inc.(Writer公司)

AI总结 本研究首次系统评估记忆增强模型中的谄媚现象,提出MIST基准测试,发现记忆会放大谄媚行为(最高25倍),并提出两种轻量级缓解方法。

Comments Under submission; preprint

详情
AI中文摘要

持久记忆系统通过随时间存储用户信念,有望使LLM更有帮助。我们表明,它们也会通过系统性地放大谄媚(即模型优先同意用户而非准确性)而使模型更不准确。我们首次对此效应进行系统评估,引入MIST:一个合成生成的多轮对话基准,其中用户在科学、医学和道德推理领域表达看似合理的误解。对三种最先进的记忆系统和五个模型家族的测试表明,记忆在所有条件下都会放大谄媚行为,谄媚率比上下文基线高出25倍。错误分析表明,记忆提取是主要原因:有损压缩成离散片段会编码用户误解,同时丢弃纠正性上下文。基于这些结果,我们提出了两种轻量级缓解方法,在事实回忆方面匹配或超越记忆系统的同时,大幅减少谄媚。

英文摘要

Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct by systematically amplifying sycophancy, wherein models prioritize agreement with users over accuracy. We conduct the first systematic evaluation of this effect, introducing MIST: a benchmark of synthetically generated multi-turn conversations where users express plausible misconceptions in scientific, medical, and moral reasoning domains. Testing across three state-of-the-art memory systems and five model families reveals that memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines. Error analyses suggest memory extraction as the primary culprit: lossy compression into discrete snippets encodes user misconceptions while discarding corrective context. Based on these results, we propose two lightweight mitigations that substantially reduce sycophancy while matching or exceeding memory systems at factual recall.

2606.10940 2026-06-10 cs.CV cs.AI cs.LG 新提交

Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

民主化相机陷阱AI:用于检测英国哺乳动物的开源模型

Paul Fergus, Philip Stephens, Russell A. Hill, Lee Oliver, Katie Appleby, Sarah Beatham, Naomi Davies Walsh, Stuart Nixon, Naomi Matthews, Chris Sutherland, Kelly Hitchcock

发表机构 * Liverpool John Moores University(利物浦约翰穆里斯大学) Durham University(杜伦大学) MammalWeb(哺乳动物网) Game & Wildlife Conservation Trust(游戏与野生动物保护信托) National Trust(国家信托) Animal and Plant Health Agency(动物和植物卫生局) Chester Zoo(切斯特动物园) University of St Andrews(圣安德鲁大学) Nottingham Trent University(诺丁汉特伦特大学)

AI总结 发布一个针对31类(28种英国常见哺乳动物和鸟类)的开源目标检测模型,基于YOLO26x在48,165个标注实例上训练,mAP@0.5达0.984,旨在降低生态学家使用AI的门槛。

Comments 15 Pages, 4 Figures

详情
AI中文摘要

相机陷阱已成为生物多样性监测的基石,但将大量图像转化为可用生态数据的人工智能通常被锁定在商业平台之后,或针对与不列颠群岛不相符的动物群进行训练。为了消除障碍并提高采用率,我们发布了一个针对31类(28种英国常见哺乳动物和鸟类,以及人类、校准杆和车辆等实用类)的开源目标检测模型,该模型基于从多个地点经过十年运营部署(通过Conservation AI及其后续项目Trap Tracker)收集的48,165个标注实例的精选数据集。该模型是YOLO26x检测器,在80/10/10的类别分层划分上进行训练和测试,在保留的验证集上,IoU为0.5时平均精度为0.984(IoU 0.5-0.95时为0.956),精确率为0.988,召回率为0.965。在未见过的保留测试集上,31个类别的平均物种置信度范围为0.96至0.99,假阴性率为0.17%,主要集中在困难的夜间、远处或遮挡图像中。这些指标来自与训练相同站点和相机池的数据,因此在新站点的性能留待未来工作。我们以非商业许可发布ONNX格式的训练权重,支持本地桌面和实时相机,明确面向没有机器学习经验的生态学家。此发布是对过去十年中开发的多个付费模型的有意制衡。

英文摘要

Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images into usable ecological data is often locked behind commercial platforms or trained on fauna that does not match that of the British Isles. In an attempt to remove barriers and increase uptake, we release an open-source object detection model for 31 classes, 28 common UK mammal and bird species, plus utility classes for humans, calibration poles, and vehicles, drawn from a curated dataset of 48,165 labelled instances assembled from multiple sites over a decade of operational deployment through Conservation AI and its successor, Trap Tracker. The model, a YOLO26x detector trained and tested on an 80/10/10 class-stratified split, achieves a mean Average Precision of 0.984 at Intersection over Union (IoU) of 0.5 (0.956 at IoU 0.5-0.95) on the held-out validation set, with precision 0.988 and recall 0.965. On an unseen held-out test split, mean per-species confidence ranged from 0.96 to 0.99 across the 31 classes, with a 0.17% false-negative rate concentrated in difficult night-time, distant, or occluded images. These metrics are from data from the same pool of sites and cameras as training, so performance at entirely new sites is left to future work. We release the trained weights in ONNX format under a non-commercial licence, with local desktop and real-time camera support, aimed explicitly at ecologists with no machine-learning experience. This release is a deliberate counterweight to the multiple paid for models that have developed over the last decade.

2606.10939 2026-06-10 cs.CV 新提交

PENet+: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis

PENet+: 一种用于高效图像隐写分析的轻量级残差Transformer框架

Jincheol AN, Dongsu Kim, Haneol Jang, YoungJoon Yoo

发表机构 * Chung-Ang University(中央大学) Hanbat National University(韩巴大学) SNUAILAB

AI总结 提出PENet+,通过保留自注意力拓扑、精简分类器通道、激活感知HPF茎和MobileNetV2骨干,在保持检测精度的同时大幅降低参数量和FLOPs。

Comments IEEE ACCESS

详情
AI中文摘要

图像隐写分析,即检测嵌入数字图像中的隐藏信息,是现代网络安全和数字取证的核心组成部分。最近的残差Transformer架构,如像素差分卷积与增强Transformer网络(PENet)[1],实现了强大的检测精度,但其计算和内存需求阻碍了在资源受限环境中的部署。我们提出PENet+,一种轻量级隐写分析框架,在保持PENet判别性结构的同时显著提高效率。我们并非重新设计或压缩注意力块,而是保留PENet的自注意力拓扑以确保可重复性,并添加一个分类器精简阶段,逐步缩小SPP到FC1的输入通道(SPP:空间金字塔池化;FC1:第一个全连接层),从而大幅减少参数和FLOPs,且精度损失可忽略。我们进一步通过激活感知机制细化高通滤波器(HPF)茎,该机制早期聚合HPF响应并选择平衡的SRM-Gabor top-K子集,并将PENet的骨干替换为MobileNetV2风格的倒残差网络。一个K=31滤波器(16个Gabor + 15个SRM)的平衡配置在较低计算量下达到或超越更重设置。最后,我们从隐写分析角度论证PReLU,认为保留负响应有助于捕捉ReLU抑制的弱隐写线索。在512x512分辨率下的独立ALASKA2 JPEG QF90协议上(5000张封面图像用于训练、验证和内部测试;另外19000张封面图像用于评估集),PENet+相比重新评估的PENet基线,参数量减少高达45.5%,FLOPs减少约97%,为资源受限的隐写分析提供了一种计算高效的方向。设备级延迟和功耗测量留待未来工作。

英文摘要

Image steganalysis, the detection of hidden information embedded in digital images, is a core component of modern cybersecurity and digital forensics. Recent residual Transformer architectures, such as the Pixel-Difference-Convolution and Enhanced-Transformer-Network (PENet) [1], achieve strong detection accuracy, but their computational and memory demands hinder deployment in resource-constrained settings. We present PENet+, a lightweight steganalysis framework that preserves PENet's discriminative structure while substantially improving efficiency. Rather than redesigning or compressing the attention blocks, we retain PENet's self-attention topology for reproducibility and add a classifier-streamlining stage that progressively narrows the SPP-to-FC1 input channels (SPP: spatial pyramid pooling; FC1: first fully connected layer), yielding large reductions in parameters and FLOPs with negligible accuracy loss. We further refine the high-pass-filter (HPF) stem with an activation-aware mechanism that aggregates HPF responses early and selects a balanced SRM-Gabor top-K subset, and we replace PENet's backbone with a MobileNetV2-style inverted residual network. A balanced configuration with K=31 filters (16 Gabor + 15 SRM) matches or surpasses heavier settings at lower compute. Finally, we motivate PReLU from a steganalysis standpoint, arguing that preserving negative responses helps capture weak stego cues that ReLU suppresses. On a disjoint ALASKA2 JPEG QF90 protocol at 512x512 resolution (5,000 cover images for training, validation, and internal testing; a separate 19,000-cover evaluation set), PENet+ achieves up to 45.5% fewer parameters and about 97% fewer FLOPs than the re-evaluated PENet baseline, offering a computationally efficient direction for resource-constrained steganalysis. Device-level latency and power measurements remain future work.

2606.10938 2026-06-10 cs.LG 新提交

A Systematic Approach for Selecting Trajectories for Data Augmentation

一种系统化的轨迹数据增强选择方法

Adam Nordling

发表机构 * Masters Degree Project(硕士项目)

AI总结 提出系统化框架评估五种轨迹选择策略(异常性、多样性、代表性、不确定性和随机),在四个数据集上测试,发现异常性和不确定性策略在稀疏数据中提升稳定性,但在密集数据中可能引入噪声。

Comments 39 pages, 4 figures, Masters project

详情
AI中文摘要

轨迹数据增强是缓解机器学习应用中数据稀缺问题的一种有前景的方法,但其效用因保持时空一致性的复杂性而受到限制。尽管先前的工作证明了几何扰动的可行性,但它依赖于简单的随机选择,在理解哪些轨迹应被增强以获得最大收益方面留下了关键空白。本文通过开发一个系统且可扩展的框架来评估五种系统选择策略:异常性、多样性、代表性、不确定性和随机选择,填补了这一空白。这些策略在四个数据集(涵盖动物行为(Foxes和Starkey)、海上交通(AIS)和城市交通(Car))上使用一系列线性和非线性机器学习模型进行了严格测试。作为评估的一部分,集成了基于Optuna的超参数优化循环,以在探索的搜索空间内经验性地确定每个数据集的最佳增强参数。结果表明,虽然系统选择并非通用解决方案,但它比随机基线具有明显优势。系统策略,特别是异常性和不确定性,表现出更高的稳定性,并且在密集数据集中不易出现随机采样观察到的性能下降。然而,研究结果也表明,增强的价值是有严格条件的。通过UMAP的可视化分析表明,虽然系统增强成功修复了稀疏数据集中的拓扑碎片化,但在高质量密集数据集中,它可能充当破坏性噪声信号。此外,研究还发现了高速度域中的物理限制,其中标准扰动技术导致特征空间中的发散。

英文摘要

Trajectory data augmentation is a promising approach to mitigate data scarcity in machine learning applications, but its utility has been limited by the complexity of preserving spatio-temporal coherence. Although prior work demonstrated the viability of geometric perturbation, it relied on naive random selection, leaving a critical gap in understanding which trajectories should be augmented for maximal benefit. This thesis addresses this gap by developing a systematic and scalable framework to evaluate five systematic selection strategies: Outlierness, Diversity, Representativeness, Uncertainty, and Random selection. These strategies were rigorously tested across four datasets covering animal behavior (Foxes and Starkey), maritime traffic (AIS), and urban traffic (Car) using a suite of linear and non-linear machine learning models. As part of this evaluation, an Optuna-based hyperparameter optimization loop was integrated to empirically identify the best-performing augmentation parameters for each dataset within the explored search space. The results indicate that, while systematic selection is not a universal solution, it offers distinct advantages over the random baseline. Systematic strategies, particularly Outlierness and Uncertainty, demonstrated higher stability and were less prone to performance degradation observed with random sampling in dense datasets. However, the findings also reveal that the value of augmentation is strictly conditional. Visual analysis via UMAP demonstrates that while systematic augmentation successfully repairs topological fragmentation in sparse datasets, it can act as a corrupting noise signal in high-quality, dense datasets. Furthermore, the study identified physical limitations in high-velocity domains, where standard perturbation techniques lead to divergence in feature space...

2606.10935 2026-06-10 cs.LG cs.AI 新提交

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

CLP: 零损失自适应多令牌推理的搭配长度预测

Xuezhen Xie, Zhiqiang Zhou

发表机构 * Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 提出CLP方法,通过轻量级线性层预测可安全接受的额外令牌数,解决多令牌预测中头-主干竞争导致的输出退化问题,在Qwen2.5模型上实现零质量损失的1.14x-1.29x加速。

Comments 13 pages, 8 figures, 8 tables

详情
AI中文摘要

大型语言模型推理受限于自回归解码,每个令牌需要一次完整的前向传播。多令牌预测(MTP)提供了一种有前景的加速路径,但现有方法存在根本性的架构缺陷:第一个令牌的MTP头与主干自身的语言模型(LM)头竞争,导致预测被接受时质量严重下降。我们将这种头-主干竞争确定为先前基于MTP的加速方法中重复和不连贯输出的根本原因。为了解决这个问题,我们提出了Backbone-as-Architect设计原则,其中主干LM头始终生成第一个令牌,MTP头仅负责后续令牌。基于这一原则,我们引入了CLP(搭配长度预测器),一个轻量级的跨度级决策层,预测每个解码步骤可以安全接受多少个额外令牌。CLP仅使用单个线性层(4.6K--7.7K参数),取代了先前工作中过度设计的1M参数门控网络。在Qwen2.5模型(0.5B、1.5B、7B)上的实验表明,CLP在1.5B模型上实现了1.20x--1.29x加速,在7B模型上实现了1.14x--1.20x加速,且零质量退化(重复率<0.02),而基于门控的方法无法加速(1.07x)或产生严重退化的输出(重复率>0.5%)。我们进一步证明,较短的预测范围(k=2)在大模型上恢复了24%更高的MTP头准确率,建立了一个可扩展感知的设计原则。我们确定MTP头预测准确率是加速的约束条件,并为未来改进建立了清晰的路线图。

英文摘要

Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token prediction (MTP) offers a promising acceleration path, but existing approaches suffer from a fundamental architectural flaw: the MTP head for the first token competes with the backbone's own language model (LM) head, leading to severe quality degradation when predictions are accepted. We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens. Building on this principle, we introduce CLP (Collocation-Length Predictor), a lightweight span-level decision layer that predicts how many additional tokens can be safely accepted at each decoding step. CLP uses only a single linear layer (4.6K--7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x--1.29x speedup on 1.5B and 1.14x--1.20x on 7B, with zero quality degradation (repetition ratio < 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio > 0.5%). We further demonstrate that shorter prediction horizons (k=2) recover 24% higher MTP head accuracy on large models, establishing a scaling-aware design principle. We identify MTP head prediction accuracy as the binding constraint on acceleration and establish a clear roadmap for future improvements.

2606.10934 2026-06-10 cs.AI 新提交

WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

WorldKernel:世界模型是可能世界的耦合核

Fabio Rovai

发表机构 * The Tesseract Academy(泰塞克特学院)

AI总结 本文发现强预测器在反事实耦合上失效,提出将世界模型建模为可能世界上的半正定耦合核,其非对角元编码反事实信息,并通过半正定性约束和逻辑公理实现高效推理。

详情
AI中文摘要

一个常见的假设认为,给足够强的预测器提供足够的观测和干预数据就足够了。我们报告了一个与之矛盾的失败模式。在数百个结构因果模型中,对于已识别的量,强预测器和贝叶斯基线都成功,但对于未识别的量(反事实世界之间的耦合),预测器坍缩为一个点,在28%的模型上坍缩到没有有效模型能产生的点,而真实情况是一个可容许区间,更多数据永远不会缩小这个区间。这种差距是结构性的:预测无法表示反事实耦合上的不确定性。我们将世界模型建模为可容许世界上的单个半正定耦合核K(T,T'),其对角线是普通后验(预测器恢复的内容),非对角线是它无法恢复的跨世界耦合,每个反事实都读取这个耦合。本文就是关于这个非对角元的理论。它是真实的:两个具有相同后验的状态在跨世界查询上不同,而非对角元正是固定反事实的耦合。它是有界的:半正定性是边际分布缺乏的部分识别信息,强制执行它可以在多项式时间内对反事实进行有界化,而精确的响应类型程序是难处理的。逻辑结构使其更精确:本体论公理将边界收紧多达三分之一,并传播到它们从未触及的耦合。它是可获取的:有针对性的疤痕,即从遇到的不可行性中学习到的约束,比无针对性的疤痕快几倍地缩小差距。它的完全重构是对可容许世界的近似计数,在Sly-Sun阈值以下是易处理的,在此之上是难近似的;我们不声称能击败最坏情况。

英文摘要

A common assumption holds that enough observational and interventional data, given to a strong enough predictor, suffices. We report a failure mode that contradicts it. Across hundreds of structural causal models, on identified quantities a strong predictor and a Bayesian baseline both succeed, but on unidentified quantities (the couplings between counterfactual worlds) the predictor collapses to a point, on 28% of models to one no valid model can produce, while the truth is an admissible interval more data never narrows. The gap is structural: prediction cannot represent uncertainty over counterfactual couplings. We cast a world model as a single positive semidefinite coupling kernel K(T,T') over admissible worlds, whose diagonal is the ordinary posterior (what a predictor recovers) and whose off-diagonal is the cross-world coupling it cannot, which every counterfactual reads. The paper is the theory of that off-diagonal. It is real: two states with identical posteriors differ on a cross-world query, and the off-diagonal is the coupling that fixes counterfactuals. It can be bounded: positive semidefiniteness is partial-identifying information the marginals lack, and enforcing it bounds counterfactuals in polynomial time where the exact response-type program is intractable. Logical structure sharpens it: ontology axioms tighten the bound by up to a third, propagating to couplings they never touch. It can be acquired: targeted scars, constraints learned from encountered infeasibilities, close the gap several times faster than untargeted ones. Its full reconstruction is approximate counting of the admissible worlds, tractable below the Sly-Sun threshold and inapproximable above; we do not claim to beat the worst case.

2606.10933 2026-06-10 cs.AI 新提交

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

前沿编码代理使用元编程适应不熟悉的编程语言

Aman Sharma, Sushrut Thorat, Paras Chopra

发表机构 * Lossfunk

AI总结 研究评估LLM编码代理在陌生语言上的表现,发现强代理通过元编程(如用Python生成目标语言代码)适应,禁止此策略性能大幅下降,而弱代理无法从中受益。

Comments 43 pages, 8 figures

详情
AI中文摘要

基于LLM的编码代理通常在熟悉的软件环境中进行评估:主流语言、常见库和公共仓库。这些基准仍然重要,但它们可能隐藏代理在语言本身不熟悉时的行为。我们使用顺序设置(包括文件编辑、本地执行和隐藏测试评分)在四种深奥编程语言上评估了六个当代编码代理。我们的协议揭示了这些代理之间的能力差异,而主流编码和代理基准(如SWE-Bench Verified和Terminal-Bench 2.0)将这些差异压缩到更窄的范围内。我们观察到,最强的代理Claude Opus 4.6和GPT-5.4 xhigh通常避免直接编写目标语言。在Brainfuck和Befunge-98上,它们编写Python程序来生成目标语言代码,并在本地调试这些生成器。禁止这种元编程策略会导致性能大幅下降。从该策略中提取的文本指导并未实质性地改善较弱的代理。相比之下,来自Opus的用于构建生成器的Python辅助代码(没有解决的基准程序或隐藏测试答案)显著提高了Sonnet 4.6和GPT-5.4 mini在相同问题上的表现,而Haiku 4.5仍然较低。更多的解释器调用和输出令牌改善了较强的代理,但使较弱的代理接近其原始性能,表明这些资源放大了有用的策略而非创造了它们。总之,这些结果表明,强大的编码代理通过使用工具、反馈和工作区状态来构建目标语言的工作模型,从而适应不熟悉的语言。元编程是最明显的案例,但更广泛的差距在于构建和调试在目标语言规则下有效的策略。

英文摘要

LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.

2606.10932 2026-06-10 cs.CL cs.LG 新提交

Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

密度场状态空间模型:Mamba-2中的1比特蒸馏、高效推理与知识组织

Chirag Shinde

发表机构 * Independent Researcher(独立研究者)

AI总结 提出DF-SSM框架,将SSM压缩至1比特骨架加int8低秩校正,应用于Mamba-2 1.3B模型,实现9.7倍压缩和21.4倍推理加速,仅需3200万令牌和6小时蒸馏,并发现模型内部知识组织的三个处理阶段。

Comments 16 pages, 6 figures, 7 tables. Code available at https://github.com/cs-cmyk/df-ssm

详情
AI中文摘要

我们提出了密度场状态空间模型(DF-SSM),这是一个将SSM压缩为1比特骨架并带有int8低秩校正的框架。应用于Mamba-2 1.3B模型,我们得到了一个278 MB的模型(比2.7 GB的FP16教师模型小9.7倍),在GPU上推理速度提升21.4倍(batch=1,相对于mamba-ssm参考实现),同时在下游任务性能上保持在BitMamba-2(一个在150B令牌上从头训练的1.58比特模型)的2-4个百分点以内。蒸馏本身仅需3200万令牌和6小时(在单个A100 GPU上),尽管它假设有一个预训练的FP16教师模型。我们开发了一个优化的推理流水线,结合了用于骨架矩阵乘法的cuBLAS INT8张量核心、用于有状态SSM和卷积操作的自定义CUDA内核,以及用于在GPU和CPU上高效部署的AVX-512 CPU后端。除了压缩,我们还研究了所得模型的内部知识组织,发现了三个不同的处理阶段:意图分类(第0-3层,在没有词汇对齐的抽象空间中操作)、知识检索(第25-35层,事实关联定位在一个5层窗口内)和输出格式化(第36-47层,类别结构消失)。通过对19个类别中445个事实提示的系统分析,我们发现早期层分类是句法的(由模板结构驱动)而非语义的,并且尽管事实回忆较弱,模型仍表现出组织良好的知识表示——这表明表示结构可能先于事实强度。

英文摘要

We present Density Field State Space Models (DF-SSM), a framework for compressing SSMs to a 1-bit scaffold with int8 low-rank correction. Applied to Mamba-2 1.3B, we achieve a 278 MB model (9.7x smaller than the 2.7 GB FP16 teacher) that runs at 21.4x faster inference on GPU (batch=1, relative to the mamba-ssm reference implementation) while maintaining downstream task performance within 2-4 percentage points of BitMamba-2, a 1.58-bit model trained from scratch on 150B tokens. The distillation itself requires only 32M tokens and 6 hours on a single A100 GPU, though it presupposes a pretrained FP16 teacher. We develop an optimized inference pipeline combining cuBLAS INT8 tensor cores for the scaffold matmul, custom CUDA kernels for stateful SSM and convolution operations, and an AVX-512 CPU backend for efficient deployment on both GPU and CPU. Beyond compression, we investigate the internal knowledge organization of the resulting model, discovering three distinct processing phases: intent classification (layers 0-3, operating in an abstract space with no vocabulary alignment), knowledge retrieval (layers 25-35, where factual associations localize to a 5-layer window), and output formatting (layers 36-47, where category structure dissolves). Through systematic analysis of 445 factual prompts across 19 categories, we find that early-layer classification is syntactic (driven by template structure) rather than semantic, and that the model exhibits well-organized knowledge representations despite weak factual recall--suggesting that representational structure may precede factual strength.

2606.10929 2026-06-10 cs.LG cs.AI 新提交

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

可恢复但不稳定:权重和激活中的局部线性结构

Irina Piontkovskaia, Sergey Nikolenko

发表机构 * St. Petersburg Department of the Steklov Institute of Mathematics(斯捷克洛夫数学研究所圣彼得堡分所) St. Petersburg State University(圣彼得堡国立大学)

AI总结 研究神经网络中线性结构的存在性与尺度,发现局部低秩任务梯度结构,但固定任务平面假设不成立;首次恢复更新形成轨迹前缀基,捕获大部分恢复位移;提出随机搜索理论解释高维随机参数搜索有效性,并验证参数扰动与激活引导的关系。

Comments 23 pages, 8 tables, 9 figures

详情
AI中文摘要

任务向量、LoRA、激活引导和预训练权重周围的随机搜索都表明学习行为可以由线性方向控制。我们询问哪些线性结构实际存在以及它们处于什么尺度。在合成多任务Transformer和DistilGPT-2/GPT-2上的LoRA适配器中,我们发现强烈的局部低秩任务梯度结构,但拒绝了固定任务平面假设:静态基会错过恢复方向,有用的基在100步内显著漂移。然而,首次恢复更新形成了一个轨迹前缀基,捕获了LoRA恢复位移的77%。我们开发了随机搜索理论,结合高斯局部线性定理,证明了即使在非常高维的情况下随机参数搜索的有效性。我们还研究了参数扰动与激活引导之间的关系:单次梯度步产生的激活偏移与标记对比CAA引导向量的余弦为0.58,对Qwen-0.5B BoolQ陈述具有类似的引导效果。我们通过在合成Transformer和LLM上的实验验证了结果。我们的结果表明,训练网络中的线性结构不是全局任务方向,而是演化的局部几何结构,这些结构在参数和激活空间中部分持续存在。

英文摘要

Task vectors, LoRA, activation steering, and random search around pretrained weights all suggest that learned behaviour can be controlled by linear directions. We ask which linear structures actually exist and on what scale. In a synthetic multitask transformer and LoRA adapters on DistilGPT-2 / GPT-2 we find strong local low-rank task-gradient structure but reject the fixed-task-plane hypothesis: static bases miss the recovery direction, and the useful basis drifts substantially within 100 steps. However, the first recovery updates form a trajectory-prefix basis capturing 77% of the LoRA recovery displacement. We develop random search theory with a Gaussian local-linear theorem that justifies the effectiveness of random parameter search even in very high dimensions. We also study the relation between parameter perturbations and activation steering: a single gradient step produces an activation shift with 0.58 cosine to a labelled-contrast CAA steering vector, with a similar steering effect on Qwen-0.5B BoolQ statements. We validate our results with experiments on synthetic Transformers and LLMs. Our results suggest that linear structures in trained networks are not global task directions, but evolving local geometries that partially persist across parameter and activation spaces.

2606.10927 2026-06-10 cs.RO 新提交

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav: 通过真实世界强化学习实现终身导航

Hang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu, Minghan Li, Zhizheng Zhang, He Wang

发表机构 * Tsinghua University(清华大学) Galbot Robotics Peking University(北京大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出AllDayNav框架,利用自进化多模态记忆和强化学习隐式编码场景动态,在跨房间、跨回合和跨任务场景中实现接近100%的成功率,超越基于地图、VLM和RL的基线方法。

Comments Project Page: https://bagh2178.github.io/AllDayNav/

详情
AI中文摘要

在动态环境中进行终身具身导航需要机器人从碎片化观察中形成持久的场景理解,这对于依赖显式地图或场景图且难以泛化到结构化设置之外的现有方法仍然困难。我们提出AllDayNav,一个终身自学习导航框架,通过强化学习将场景动态隐式编码到大模型的十亿级参数中,并由一个自进化的多模态记忆驱动,该记忆维护和更新视觉关键帧、语义描述和时间上下文,同时自主生成开放词汇指令、图像目标和结构化奖励。在合成和真实环境中的跨房间、跨回合和跨任务场景实验表明,AllDayNav实现了接近100%的成功率,并在路径效率和鲁棒性上持续超越基于地图、VLM和RL的强基线,证明了隐式、记忆驱动的强化学习作为可靠终身导航的可扩展替代方案。

英文摘要

Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.

2606.10921 2026-06-10 cs.CL 新提交

Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

仅追踪所需:面向长文档问答的结构感知按需超图记忆

Xiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学与工业研究组织) University of Wollongong(伍伦贡大学)

AI总结 提出DocTrace,一种多智能体RAG框架,通过查询触发的知识组织、文档结构感知和经验引导推理,解决长文档问答中知识组织成本高、结构利用不足和推理经验无法复用的问题,在三个数据集上取得最佳性能。

详情
AI中文摘要

长文档问答需要大型语言模型对散布在长文档中的证据进行推理,答案通常依赖于事件顺序、章节级上下文和跨部分证据连接。尽管检索增强生成通过检索相关证据减少了输入上下文,但现有的结构化RAG方法仍面临三个限制:代价高昂的查询无关知识组织、对原始文档结构利用不足以及无法复用历史推理经验。为解决这些限制,我们提出了DocTrace,一个用于长文档问答的多智能体RAG框架,支持查询触发的知识组织、文档结构感知和经验引导推理。DocTrace通过轻量级文档结构树索引保留文档层次结构,在推理过程中按需构建智能体共享的超图结构工作记忆,并将成功的推理计划存储在图形结构经验记忆中以便未来复用,从而实现对相关长文档问题的自适应探索。在四个长文档问答数据集上的实验表明,DocTrace在三个数据集上取得了最佳性能,在F1和EM上分别比最强基线ComoRAG高出8.85%和4.40%,同时将总体计算成本降低了53.32%。

英文摘要

Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

2606.10918 2026-06-10 cs.RO cs.LG 新提交

Task Robustness via Re-Labelling Vision-Action Robot Data

通过重新标注视觉-动作机器人数据的任务鲁棒性

Artur Kuramshin, Özgür Aslan, Cyrus Neary, Glen Berseth

发表机构 * Mila — Quebec AI Institute(Mila — 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) The University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出TREAD框架,利用大型视觉语言模型对机器人数据集进行语义子任务分解和多样化指令生成,无需额外数据收集,提升策略在未见任务上的泛化能力。

Comments Project website: https://akuramshin.github.io/tread

详情
AI中文摘要

近年来,机器人学习模型规模的扩大产生了令人印象深刻的策略,能够执行各种操作任务并泛化到新场景。然而,这些策略在遵循指令方面仍然存在困难,很可能是因为现有机器人数据集中的语言和动作序列多样性有限。本文介绍了通过重新标注视觉-动作机器人数据实现任务鲁棒性(TREAD),这是一个可扩展的框架,利用大型视觉语言模型(VLM)在不进行额外数据收集的情况下增强现有机器人数据集,利用这些模型中嵌入的可迁移知识。我们的方法通过三个阶段利用预训练的VLM:从原始指令标签和初始场景生成语义子任务,根据这些子任务对演示视频进行分割,并生成包含对象属性的多样化指令,有效地将较长的演示分解为基于语言-动作对。我们进一步通过用语言多样化的文本目标版本增强数据来提高鲁棒性。在LIBERO上的评估表明,在我们增强的数据集上训练的策略在未见过的、新颖的任务和目标上表现出改进的性能。我们的结果表明,TREAD通过轨迹分解增强了规划泛化,并通过增加语言多样性增强了语言条件策略泛化。

英文摘要

The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via Re-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity.

2606.10917 2026-06-10 cs.AI 新提交

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Role-Agent: 通过双角色演化引导LLM智能体

Xucong Wang, Ziyu Ma, Shidong Yang, Tongwen Huang, Pengkun Wang, Yong Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图)

AI总结 提出Role-Agent框架,让单个LLM同时作为智能体和环境,通过世界在智能体(WIA)和智能体在世界(AIW)两个组件实现自举协同演化,在多个基准上平均提升超过4%。

Comments 20 pages, including 12 pages of main text and 8 pages of appendix; work in progress

详情
AI中文摘要

尽管大型语言模型(LLM)智能体在复杂任务上表现出色,但其学习常受限于低效的交互反馈和静态训练环境,阻碍了更广泛的泛化。为解决这些问题,本文引入Role-Agent,一个利用单个LLM同时充当智能体和环境的框架,实现自举协同演化。Role-Agent包含两个协同组件:世界在智能体(WIA)和智能体在世界(AIW)。在WIA中,LLM作为智能体,在每个动作后预测未来状态;预测状态与实际状态的对齐被用作过程奖励,鼓励环境感知推理。在AIW中,LLM分析失败轨迹中的失败模式,并检索具有相似失败模式的任务,从而重塑训练数据分布以进行针对性练习。在多个基准上的实验表明,Role-Agent持续提升性能,相比强基线平均提升超过4%。

英文摘要

Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, \textcolor{black}{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\% over strong baselines.

2606.10913 2026-06-10 cs.LG stat.ML 新提交

Conservation Laws from Data Symmetry in Neural Networks

神经网络中数据对称性导致的守恒律

Jakob Galley, Vahid Shahverdi, Axel Flinth

发表机构 * Umeå University(于默奥大学)

AI总结 研究训练数据的对称性是否在梯度流训练中产生守恒量,证明对于解析非多项式损失函数,数据对称性一般不产生额外守恒量;对于均方误差损失,数据增强可产生额外守恒量,并利用可张量化网络框架描述该现象。

详情
AI中文摘要

我们探讨训练数据的内在对称性是否在神经网络的梯度流训练中导致守恒量。在假设损失函数是解析且非多项式的情况下,我们证明数据对称性通常不会诱导任何额外的运动积分。另一方面,对于均方误差(MSE)损失,存在数据增强产生额外守恒量的情况。我们构建了一个利用\emph{可张量化网络}来描述这一现象的框架。可张量化网络是一类架构,其参数和输入的依赖关系可以通过中间表示分离。它们包括线性网络、多项式网络以及闪电注意力(Lightning Attention)。

英文摘要

We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing \emph{tensorizable networks} to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and polynomial networks, as well as Lightning Attention.

2606.10912 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

What Do Deepfake Speech Detectors Actually Hear?

深度伪造语音检测器实际上听到了什么?

Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出基于自监督表示和积分梯度的可解释性方法,分析三种WavLM检测器在ASVspoof5上的决策线索,发现它们分别依赖环境噪声、音素伪影和词边界。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

深度伪造语音检测器通常输出一个分数,而不解释为什么音频样本被标记、证据在信号中的位置或哪些线索驱动了决策。我们提出了一种音频原生的可解释性管道,使用时间对齐的自监督表示上的积分梯度来随时间定位决策证据。我们将所提出的方法应用于ASVspoof5上的三个基于WavLM的检测器(AASIST、CA-MHFA、SLS),并手动注释最高归因区域以提供最重要线索的语义含义。尽管性能相似,检测器依赖不同的线索:AASIST强调非语音/环境线索,CA-MHFA关注局部音素伪影,SLS依赖词边界和频谱完整性。我们超越推测性推理,通过因果遮蔽主要检测器线索来验证我们的发现。观察到的性能下降进一步支持了解释的检测器语义。

英文摘要

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

2606.10911 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

Ethical and Technical Limits of Deepfake Speech Datasets

深度伪造语音数据集的伦理与技术限制

Vojtěch Staněk, Eva Trnovská, Kamil Malinka, Anton Firc

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 通过审计39个深度伪造语音数据集,发现公平性评估因缺乏人口统计元数据而不可行,且数据集间真实语音源语料库重叠严重,影响跨数据集评估的可靠性。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

关于深度伪造语音检测器的鲁棒性和公平性的声明,其可信度仅与用于训练和评估这些系统的数据集相当。我们对深度伪造语音领域进行了数据集级别的审计。我们整理并分析了39个深度伪造语音数据集,检查了关键属性,包括可访问性、文档、人口统计和语言覆盖范围、数据集规模以及底层的真实语音来源。我们的审计揭示了两个重要的发现。首先,公平性评估在很大程度上不可行,因为大多数数据集缺乏人口统计元数据,只有少数包含性别或语言标签。这阻止了任何有意义的子组分析,并使得其他人口统计属性未被处理。其次,我们识别出不同数据集之间底层的真实语音源语料库存在大量重叠,这可能破坏跨数据集评估,并导致对泛化能力的夸大声称。

英文摘要

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

2606.10908 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

RAT: Reference-Augmented Training for ASV Anti-Spoofing

RAT:面向ASV反欺骗的参考增强训练

Vojtěch Staněk, Anton Firc, Jakub Reš, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出一种基于说话人参考录音的欺骗对抗架构,发现训练时引入参考通道可提升深度伪造检测性能,即使推理时参考缺失或失配。基于此提出参考增强训练(RAT)策略,在ASVspoof 5基准上以单个检测器达到2.57% EER和0.074 minDCF,超越大型集成系统。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们引入了一种以说话人参考录音为条件的欺骗对抗架构,但观察到它收敛到一种在推理时有效忽略参考的解决方案。令人惊讶的是,使用参考通道进行训练会诱导出不变性,从而改进深度伪造检测,即使在推理时参考缺失或失配。基于这一观察,我们提出了一种参考增强训练(RAT)策略。与单话语基线相比,RAT产生了改进的检测性能,即使在推理时将参考录音替换为零向量时也是如此。通过严格分析,我们证明优化过程迅速减少了参考贡献,导致推理很大程度上独立于参考通道。使用RAT,我们在ASVspoof 5基准上以单个检测器实现了最先进的2.57%等错误率和0.074最小检测代价函数,甚至超越了大型集成系统。

英文摘要

We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

2606.10905 2026-06-10 cs.CV 新提交

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

超越模型规模:通过训练小模型探究视觉上下文学习中的差距

Sunil Khatri, Steven Landgraf, Markus Ulrich, Simon Reiß

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 通过训练仅1百万参数的小模型,挑战大规模视觉上下文学习模型,揭示任务编码、预训练任务和评估指标方面的适应性能力测量差距。

详情
AI中文摘要

视觉上下文学习(VICL)旨在推动自适应视觉模型的发展,使其能够基于少量示例在测试时适应新任务。受自然语言处理研究中上下文学习历史的影响,当前VICL方法通常采用大规模模型和数据扩展作为关键要素。然而,这些要素是否是视觉模型形成上下文学习能力的关键尚不清楚。为了对这类大模型进行压力测试,我们用一个极端的反例挑战它们:我们训练了一个仅含1百万参数和7万张图像的微小视觉上下文模型。我们将这个容量严重受限的小模型与7000倍大的VICL模型在不同自适应设置下进行比较:(1)具有小分布偏移的图像数据,(2)未见过的任务编码,以及(3)全新的任务,即VICL所设想的场景。由于小模型和大模型之间训练资源的巨大差距,我们的实验展示了在任务编码方式、预训练中使用的任务以及评估指标选择方面,自适应能力测量存在的不足。当前VICL基准测试中的这些差距凸显了在自适应能力评估方面进行创新的必要性。

英文摘要

Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

2606.10903 2026-06-10 cs.RO 新提交

AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot Navigation

AgniNav:配置驱动的跨具身局部规划机器人导航

Tianhao Zang, Siwei Cheng, Haidong Huang, Shanze Wang, Wei Zhang

发表机构 * Eastern Institute of Technology, Ningbo, China(东方理工(宁波)) University of Nottingham, Nottingham, UK(诺丁汉大学) University of Science and Technology of China, Hefei, China(中国科学技术大学)

AI总结 提出AgniNav框架,通过可配置的四参数安全包络实现单目视觉导航在轮式、四足和人形机器人间的零重训练迁移,联合调节感知与规划。

详情
AI中文摘要

单目局部导航对轻量级机器人具有吸引力,但现有的基于视觉的策略通常将感知耦合到特定机体、相机高度和足迹,使得从轮式底盘到腿式平台的迁移依赖于重新训练或主动深度硬件。本文介绍了AgniNav,一个配置驱动的局部导航框架,在碰撞包络层面标准化跨具身迁移。每个机器人由一个可测量的四参数安全包络指定:碰撞相关高度、前长、后长和半宽。高度参数条件化一个图像到扫描网络,从单目彩色图像预测一维、碰撞相关的伪激光扫描,而剩余的足迹参数配置一个维度感知的局部规划器用于碰撞检测。训练使用从配对的彩色-深度数据生成的高度条件化列最小扫描标签,允许同一图像监督不同的安全包络,无需收集机器人特定数据。据我们所知,AgniNav是第一个单目局部导航框架,它联合调节感知和规划于共享的碰撞包络配置,实现跨轮式、四足和人形平台的零重训练部署。在Turtlebot2、Unitree Go2和Accelerated Evolution K1上的真实机器人实验分别实现了39/40、18/20和18/20的成功率,碰撞次数分别为0/40、1/20和2/20,同时在Jetson Orin上以30 Hz运行。

英文摘要

Monocular local navigation is attractive for lightweight robots, but existing vision-based policies often couple perception to a specific body, camera height, and footprint, making transfer from wheeled bases to legged platforms dependent on retraining or active depth hardware. This paper introduces AgniNav, a configuration-driven local navigation framework that standardizes cross-embodiment transfer at the collision-envelope level. Each robot is specified by a measurable four-parameter safety envelope: collision-relevant height, front length, rear length, and half width. The height parameter conditions an image-to-scan network to predict a one-dimensional, collision-relevant pseudo-laserscan from a monocular color image, while the remaining footprint parameters configure a dimension-aware local planner for collision checking. Training uses height-conditioned column-minimum scan labels generated from paired color-depth data, allowing the same image to supervise different safety envelopes without collecting robot-specific data. To the best of our knowledge, AgniNav is the first monocular local-navigation framework that jointly conditions perception and planning on a shared collision-envelope configuration for zero-retraining deployment across wheeled, quadruped, and humanoid platforms. Real-robot experiments on a Turtlebot2, Unitree Go2, and Accelerated Evolution K1 achieve 39/40, 18/20, and 18/20 successes with 0/40, 1/20, and 2/20 collisions, respectively, while running at 30 Hz on Jetson Orin.

2606.10899 2026-06-10 cs.RO 新提交

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

MV-Actor:对齐多视角语义与空间感知以实现双臂操作

Yinchen Tian, Huan Li, Muyao Peng, Xi Wang, Yan Wang, You Yang

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息与通信学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学智能产业研究院) AIR Wuxi Innovation Center, Tsinghua University(清华大学智能产业研究院无锡创新中心)

AI总结 提出MV-Actor框架,通过多视角语义交互和语义-空间令牌交互统一语义与空间表示,并利用引导度量深度修复模块处理深度噪声,在PerAct2基准上达到87.8%平均成功率。

Comments 14 pages,9 figures

详情
AI中文摘要

机器人操作已广泛应用于工业场景。与单臂操作相比,双臂操作配备多个摄像头以从不同视角捕获信息。然而,现有的多视角策略独立编码每个视角或浅层融合视角特征,导致语义感知共享有限且空间感知不可靠。本文提出\textbf{MV-Actor},一种为双臂操作构建统一语义-空间表示的多视角感知框架。首先,MV-Actor执行多视角语义交互以跨视角共享语义感知。然后,它使用语义-空间令牌交互将视觉语义与前馈重建模型特征对齐,并获取可靠的空间感知。最后,引导度量深度修复模块在消费级深度噪声下细化退化的传感器深度,以提供更可靠的度量锚点。在PerAct2双臂基准上进行的仿真实验中,MV-Actor达到了87.8%的最先进平均成功率。在视角变化更频繁且消费级深度不稳定的真实世界评估中,MV-Actor优于RGB和RGB-D基线,进一步证明了共享语义感知和可靠空间感知对双臂操作的好处。

英文摘要

Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.

2606.10896 2026-06-10 cs.LG cs.DB cs.IR cs.PF 新提交

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

Flash-GMM:一种用于可扩展软聚类的内存高效内核

Gal Bloch, Ariel Gera, Matan Orbach, Ohad Eytan, Assaf Toledo

发表机构 * IBM Research(IBM研究院)

AI总结 提出Flash-GMM融合Triton内核,在单GPU通次中高效计算高斯混合模型,通过避免实例化完整责任矩阵,实现20倍加速并支持比先前大100倍的数据集训练,集成到IVF粗量化器中提升ANN搜索性能。

详情
AI中文摘要

我们提出了 \textbf{Flash-GMM},一个融合的 Triton 内核,用于在单 GPU 通次中高效计算大规模数据上的高斯混合模型(GMM)。通过消除在 GPU 内存中实例化完整责任矩阵的需求,Flash-GMM 实现了比现有实现 \textbf{20$\times$} 的加速,并支持在单个设备上训练比以前可行数据集大 \textbf{100$\times$} 的数据集。为了展示其影响,我们将 Flash-GMM 集成到 IVF 粗量化器中用于近似最近邻(ANN)搜索。我们表明,软 GMM 聚类现在可以作为 $k$-means 的可行即插即用替代方案,并且可以利用 GMM 责任将边界向量分配到多个聚类。我们的方法在达到固定召回目标时,距离计算次数减少多达 $1.7\times$,或者在相同计算成本下,召回率@10 提高 $+2$--$12$。我们将该内核作为开源项目发布。

英文摘要

We present \textbf{Flash-GMM}, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a \textbf{20$\times$} speedup over existing implementations and enables training on datasets more than \textbf{100$\times$} larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for $k$-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to $1.7\times$ fewer distance computations, or equivalently, yields $+2$--$12$ recall@10 at matched computational cost. We release the kernel as an open-source project.

2606.10894 2026-06-10 cs.CV 新提交

The 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation

首届PortraitCraft挑战赛:CVPR 2026研讨会肖像构图理解与生成竞赛

Zijie Lou, Youyun Tang, Xiaochao Qu, Haoxiang Li, Ting Liu, Luoqi Liu, Xun Zhu, Zheng Zhang, Xi Chen, Miao Li, Ji Wu, Dizhe Zhang, Xian Ge, Sujia Wang, Ruiyang Zhang, Jiaming Wang, Xianshun Wang, Lu Qi, Boao Kang, Wei Zhou, Jinghui Sun, Zhenyu Yan, Jiliang Zhao, Rui Yang, Yipo Huang, Boyuan Liu, Shanglin Li, Zifan Xie, Yichen Zhang, Anlan Wang, Wenfeng Lin, Mingyu Guo, Dong Li, Xinghao Wang, Yanting Li, Shanzhao Tong, Shuai He, Qiu Zhou, Yongqi Yang, Taoyang Mu, Dianqiao Lei, Anlong Ming, Huadong Ma

发表机构 * CVPR 2026

AI总结 提出PortraitCraft挑战赛,包含构图理解与生成两个赛道,并发布约5万张肖像数据集,推动肖像美学与可控图像生成研究。

详情
AI中文摘要

本文介绍了首届PortraitCraft挑战赛的概况,该挑战赛是CVPR 2026的官方竞赛之一。挑战赛聚焦于肖像构图理解与生成,旨在推动AI在肖像美学分析和可控图像合成方面的研究。与主要关注全局美学评分的现有数据集和任务不同,PortraitCraft引入了一个统一的评估框架,包含两个互补赛道。赛道1要求模型进行结构化肖像构图理解,赛道2要求模型在显式构图约束下从结构化构图描述生成肖像图像。为支持该挑战赛,我们构建并公开发布了一个大规模肖像构图数据集,包含约50,000张精心策划的真实肖像图像,提供多级监督。本报告描述了挑战赛设置、评估协议、数据集组成和最终结果,并分析了提交方案的技术特点。PortraitCraft挑战赛为肖像构图理解与生成研究提供了一个标准化和可复现的平台,有望推动肖像美学和可控图像生成领域的进一步发展。

英文摘要

This paper presents an overview of the inaugural PortraitCraft Challenge, held as one of the official competitions at CVPR 2026. The challenge focuses on portrait composition understanding and generation, aiming to advance AI research in portrait aesthetics analysis and controllable image synthesis. Unlike existing datasets and tasks that primarily focus on global aesthetic scoring, PortraitCraft introduces a unified evaluation framework comprising two complementary tracks. Track 1 requires models to perform structured portrait composition understanding, and Track 2 requires models to generate portrait images from structured composition descriptions under explicit compositional constraints. To support the challenge, we constructed and publicly released a large-scale portrait composition dataset consisting of approximately 50,000 curated real portrait images, providing multi-level supervision. This report describes the challenge setup, evaluation protocols, dataset composition, and final results, along with an analysis of the technical characteristics of the submitted solutions. The PortraitCraft Challenge provides a standardized and reproducible platform for research on portrait composition understanding and generation, and is expected to foster further progress in the fields of portrait aesthetics and controllable image generation.

2606.10890 2026-06-10 cs.LG cs.AI 新提交

Optimal Post-Training Quantization Scales and Where to Find Them

最优后训练量化尺度及其寻找方法

Juan Amboage, Pablo Monteagudo-Lago, Ian Colbert, Giuseppe Franco, Nicholas Fraser

发表机构 * AMD

AI总结 提出PiSO算法,利用校准数据精确高效地计算逐通道最优量化尺度,并扩展到分组量化,在Llama和Qwen模型上显著提升困惑度和零样本准确率。

详情
AI中文摘要

后训练量化(PTQ)通过将权重映射到低比特表示来压缩大型语言模型。定义量化网格的缩放因子通常使用简单的、无数据的启发式方法选择。在这项工作中,我们提出了PiSO(分段尺度优化),一种利用校准数据在最近舍入量化下精确且高效地计算最优逐通道权重尺度的算法。PiSO将尺度搜索空间划分为有限个区间,在这些区间上目标函数具有闭式最小值。我们通过原则性启发式方法将PiSO扩展到分组量化,并提出了将尺度优化与纠错交错的有效策略。在Llama和Qwen模型上,跨多个模型大小和目标权重位宽的实验表明,在困惑度和下游零样本准确率上均有持续改进,无论是单独使用还是与纠错结合。特别地,我们观察到随着目标位宽变窄、量化变得更加困难,收益增加。

英文摘要

Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.

2606.10887 2026-06-10 cs.CV 新提交

Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio

听、看、学:通过SAM-Audio实现无遗忘学习

Avi Gupta, Nilotpal Sinha, Vishnu Raj, Sambuddha Saha, Pratik Joshi, Koteswar Rao Jerripothula, Tammam Tillo

发表机构 * University of Washington(华盛顿大学)

AI总结 提出一种利用SAM-Audio多模态先验的类增量学习方法,通过引导注意力机制和双层蒸馏策略,在音频-视觉场景中缓解灾难性遗忘,性能优于现有方法。

详情
AI中文摘要

类增量学习(CIL)旨在持续学习新类别而不遗忘先前获取的知识。尽管最近的CIL进展在各种模态中引起了显著兴趣,但音频-视觉设置仍未被充分探索。此外,尽管像SAM-Audio这样的基础多模态模型封装了丰富的静态先验,我们的实证分析表明,这些表示在增量设置中表现不佳。本文通过将SAM-Audio的音频-视觉先验整合到CIL设置中来弥合这一差距。具体来说,我们利用其密集的音频和视觉表示,并采用一种新颖的引导注意力策略,其中音频特征在上下文中引导视觉表示。为了进一步缓解灾难性遗忘,我们在特征和logit级别引入了双层蒸馏目标。在音频-视觉CIL基准上的广泛评估表明,我们的方法始终优于最先进的方法。

英文摘要

Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio's audio-visual priors into the CIL setting. Specifically, we leverage its dense audio and visual representations and employ a novel guided attention strategy where the audio features contextually guide the visual representations. To further mitigate catastrophic forgetting, we introduce dual-level distillation objectives at both the feature and logit levels. Extensive evaluations on audio-visual CIL benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods.

2606.10877 2026-06-10 cs.LG cs.CV 新提交

XtrAIn: Training-Guided Occlusion for Feature Attribution

XtrAIn:训练引导的遮挡特征归因

Thodoris Lymperopoulos, Ioannis Kakogeorgiou, Denia Kanellopoulou

发表机构 * NCSR Demokritos(希腊国家科学研究中心德谟克利特)

AI总结 提出XtrAIn方法,将遮挡操作从输入空间转移到参数空间,通过跟踪模型训练轨迹测量特征相关参数更新对输出logits的影响,解决传统遮挡归因中的偏差和不稳定性问题。

Comments 12 pages, 7 figures, 1 table

详情
AI中文摘要

基于遮挡的归因方法通过扰动输入特征并测量模型输出的变化来估计特征重要性,提供了一种直观的方式。然而,其可靠性受到特征移除实现方式的强烈影响:外部选择的基线可能引入偏差、分布外样本和不稳定的解释,而在非线性模型中,遮挡一组特征也可能改变非遮挡特征的贡献。我们将这种效应称为归因偏移,因为非遮挡特征的归因分数偏离其初始值。为了解决这些导致解释不稳定的主要问题,我们引入了XtrAIn,一种训练引导的归因方法,将遮挡操作从输入空间转移到参数空间。XtrAIn不用于工基线替换输入值,而是遵循模型的训练轨迹,测量特征相关参数更新如何影响输出logits。我们进一步引入了Xstep,一种轻量级近似方法以降低计算成本,以及XtrAIn+,一种目标聚焦变体,强调与目标类别一致的更新。在受控图像数据集和PAM50乳腺癌亚型分类上的实验表明,所提出的方法比标准归因基线产生更清晰、更可解释的归因模式。总体而言,XtrAIn提供了对特征归因的训练感知视角,并为研究训练过程中特征级证据的形成提供了有用的诊断工具。

英文摘要

Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model's training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.