2510.17633 2026-06-15 cs.SD cs.CR 版本更新

高效在线3D多相机多目标跟踪与姿态估计

Linh Van Ma, Tran Thien Dat Nguyen, Juhua Hu, Wei Cheng, Moongu Jeon

发表机构 * School of Electrical Engineering and Computer Science（电子工程与计算机科学学院）； School of Electrical Engineering, Computing and Mathematical Sciences（电子工程、计算与数学科学学院）； School of Engineering and Technology（工程与技术学院）

AI总结提出一种基于多单目相机的快速在线3D多目标跟踪与姿态估计方法，仅需2D检测，无需3D训练数据，通过贝叶斯最优滤波实现高效准确跟踪，并支持相机断连场景。

2604.14892 2026-06-15 cs.LG cs.AI 版本更新

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

LLM能否准确评分医学诊断和临床推理？

Amy Rouillard, Sitwala Mundia, Linda Camara, Ziyaad Dangor, Michael Cameron Gramanie, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

发表机构 * Wits MIND Institute, University of the Witwatersrand, Johannesburg, South Africa（维特士心理研究所，沃斯兰德大学，约翰内斯堡，南非）； Grai Labs, Cape Town, South Africa（格雷实验室，开普敦，南非）； South African Medical Research Council Vaccines and Infectious Diseases Analytics Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa（南非医学研究理事会疫苗和传染病分析研究组，健康科学学院，沃斯兰德大学，约翰内斯堡，南非）； Department of Internal Medicine, Charlotte Maxeke Johannesburg Academic Hospital, and Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa（内科学系，查理·马克斯凯约翰内斯堡学术医院，以及健康科学学院，沃斯兰德大学，约翰内斯堡，南非）； Department of Paediatrics and Child Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa（儿科学与儿童健康系，健康科学学院，沃斯兰德大学，约翰内斯堡，南非）； Wits MIND Institute, University of the Witwatersrand, Johannesbu（维特士心理研究所，沃斯兰德大学，约翰内斯堡）

AI总结研究使用LLM陪审团对300例低收入和中等收入国家医院病例的3334个诊断进行评分，发现校准后的LLM评分与专家评分高度一致，且严重错误风险更低，可作为可靠的评估代理。

详情

AI中文摘要

使用专家临床医生小组评估医学AI系统成本高且速度慢，这促使使用大型语言模型（LLM）作为替代评判者。在此，我们评估了一个由三个前沿AI模型组成的LLM陪审团，对300个真实低收入和中等收入国家（LMIC）医院病例的3334个诊断进行评分。LLM和临床医生生成的诊断均根据专家小组诊断在四个维度上进行评分：诊断、鉴别诊断、临床推理和阴性治疗风险。将LLM陪审团评分与专家和独立重新评分小组的评分进行比较，以评估误差指标、评分者间一致性、严重风险错误以及使用等渗回归进行事后校准的效果。在我们的数据中，我们发现：（i）未校准的LLM陪审团评分与专家临床医生小组评分保持序数一致性，但系统性地更低；（ii）LLM陪审团出现严重风险错误的概率低于人类专家重新评分小组；（iii）LLM陪审团结合LLM诊断可用于识别高风险错误诊断，从而实现有针对性的专家审查并提高小组效率；（iv）校准后的LLM陪审团评分和诊断代理排名与主要专家小组的评分和排名表现出极好的一致性；（v）LLM陪审团模型没有表现出自我偏好偏差，它们对自己底层模型或同一供应商模型生成的诊断评分并不比其他模型生成的诊断更有利（或更不利）。总之，这些结果提供了证据，表明校准后的LLM陪审团是医学AI基准测试中专家临床医生评估的值得信赖且可靠的代理。在其他临床环境中确认这些发现是未来工作的重要方向。

英文摘要

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

URL PDF HTML ☆

赞 0 踩 0

2604.15173 2026-06-15 cs.CV 版本更新

Boundary-Centric Clip-Budgeted Active Learning for Temporal Action Segmentation

面向时间动作分割的边界中心剪辑预算主动学习

Halil Ismail Helvaci, Sen-ching Samson Cheung

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出B-ACT框架，通过边界中心监督策略，在有限标注预算下优先标注动作边界帧，显著提升时间动作分割的标签效率，在多个数据集上超越现有方法。

详情

AI中文摘要

未修剪视频中的时间动作分割（TAS）需要密集的时间监督。然而，大部分标注成本花费在识别动作转换上，这些地方分割错误集中，且微小的时间偏移会不成比例地降低片段级指标。我们引入B-ACT，一种剪辑预算主动学习框架，明确地将监督分配到这些易出错的边界区域。B-ACT在分层两阶段循环中运行：(i) 使用预测不确定性对未标记视频进行排序和查询，(ii) 在每个选定视频中，从当前模型预测中检测候选转换，并通过新颖的边界分数选择前K个边界。边界分数融合邻域不确定性、类别模糊性和时间预测动态，以揭示每帧的潜在重要性。重要的是，我们的标注协议仅请求边界帧的标签，同时仍然在边界中心剪辑上训练，以通过模型的感受野利用时间上下文。在GTEA、50Salads和Breakfast上的大量实验表明，边界中心监督在稀疏预算下实现了强大的标签效率，并持续优于代表性的TAS主动学习基线和先前的最先进方法。在性能对边界放置高度敏感的数据集上（通过编辑和基于重叠的F1指标衡量），增益最大。

英文摘要

Temporal action segmentation (TAS) in untrimmed videos requires dense temporal supervision. However, most of the annotation cost is spent identifying action transitions where segmentation errors concentrate and small temporal shifts can disproportionately degrade segment-level metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these error-prone boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score. The boundary score fuses neighborhood uncertainty, class ambiguity, and temporal prediction dynamics to reveal the underlying importance of each frame. Importantly, our annotation protocol requests labels only at the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets. Gains are largest on datasets where performance is highly sensitive to boundary placement, as measured by edit and overlap-based F1 metrics.

URL PDF HTML ☆

赞 0 踩 0

2604.14193 2026-06-15 cs.CV eess.IV q-bio.NC 版本更新

QualiaNet: An Experience-Before-Inference Network

QualiaNet：一种先验体验的推理网络

Paul Linton

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出QualiaNet，模拟人类立体视觉的两阶段架构：先通过视差图模拟体验，再用CNN从视差梯度估计距离，验证了从视差梯度恢复距离的可行性。

Journal ref Extended abstract presented at the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026

2602.22822 2026-06-15 cs.AI cs.LG 版本更新

FlexMS: A Unified Public Benchmark for Molecule Tandem Mass Spectrum Prediction

FlexMS：分子串联质谱预测的统一公共基准

Yunhua Zhong, Yixuan Tang, Yifan Li, Pan Liu, Zhiwen Yang, Jie Yang, Jun Xia

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong University of Science and Technology（香港科学与技术大学）； The University of Hong Kong（香港大学）； Yangzhou University（扬州大学）； Fudan University（复旦大学）

AI总结提出FlexMS基准框架，通过标准化预处理、元数据条件和评估协议，实现跨公共资源的公平比较，并引入难度感知诊断指导模型选择。

Comments preprint version v3

详情

AI中文摘要

串联质谱（MS/MS）在小分子鉴定中至关重要，但当前的深度学习谱预测系统在实际评估和部署中仍存在困难。尽管新架构不断声称达到最先进性能，但不一致的元数据条件和纠缠的预处理流程阻碍了公平的架构比较。此外，现有评估通常局限于精心策划的数据集，未能捕捉真实代谢组学的异质性和跨领域偏移。而且，当前基准缺乏难度感知诊断，对模型在特定计算或数据约束下的行为视而不见。为解决这些问题，我们提出了FlexMS，一个模块化的公共数据基准框架，它在统一协议下标准化跨公共资源的MS/MS预测，同时保留分子编码器、元数据条件、预测头以及下游检索。FlexMS建立了一个公平的评估平台，显著降低了集成新预测工具的门槛。FlexMS不仅优化平均分数，还通过难度感知诊断增强聚合准确性，为不同计算约束、数据规模和下游检索目标下的模型选择提供可操作指导。最终，FlexMS为社区提供了一个可复现的标准，以识别哪些算法结论是稳定的，以及哪些操作点在实践中最为可行。

英文摘要

Tandem mass spectrometry (MS/MS) is central to small molecule identification, but current deep learning systems for spectrum prediction still remain difficult to evaluate and deploy in practice. While novel architectures constantly claim state-of-the-art performance, inconsistent metadata conditioning and entangled preprocessing pipelines hinder fair architectural comparisons. Besides, existing evaluations are often restricted to curated datasets, failing to capture the heterogeneity and cross-domain shifts of real-world metabolomics. Furthermore, current benchmarks lack difficulty-aware diagnostics and leave blind to how models behave under specific compute or data constraints. To address this, we present FlexMS, a modular public-data benchmark framework that standardizes MS/MS prediction across public resources while keeping molecular encoders, metadata conditioning, predictor heads, and downstream retrieval under one protocol. FlexMS establishes a fair evaluation playground which significantly lowers the barrier for integrating new predictive tools. Rather than solely optimizing for average scores, FlexMS augments aggregate accuracy with difficulty-aware diagnostics, providing actionable guidance on model selection across different compute constraints, data scales, and downstream retrieval objectives. Ultimately, FlexMS provides the community with a reproducible standard to identify which algorithmic conclusions are stable and which operating points are most viable in practice.

URL PDF HTML ☆

赞 0 踩 0

2604.09737 2026-06-15 cs.LG cs.AI 版本更新

面向SAR目标检测的物理可实现对抗衰减补丁

Yiming Zhang, Weibo Qin, Feng Wang

发表机构 * Key Laboratory for Information Science of Electromagnetic Waves (MoE) School of Information Science and Technology, Fudan University（电磁波信息科学重点实验室（MoE）复旦大学信息科学与技术学院）

AI总结提出对抗衰减补丁（AAP）方法，通过能量约束优化和衰减部署框架平衡攻击有效性与隐蔽性，并基于信号级电子干扰机制实现物理可行性。

Comments 5 pages, 4 figures. Source code is available at https://github.com/boremycin/SAAP. Accepted and published in IEEE CAIT 2026. DOI: 10.1109/CAIT70489.2026.11553874

Journal ref Proc. 2026 China Aerospace Information Technology Conference (CAIT), Tongxiang, China, May 2026

详情

DOI: 10.1109/CAIT70489.2026.11553874

AI中文摘要

深度神经网络在SAR目标检测任务中表现出色，但仍易受对抗攻击影响。现有的SAR特定攻击方法能有效欺骗检测器，但往往引入明显扰动，且主要局限于数字域，忽略了攻击SAR系统的物理实现约束。本文提出一种新颖的对抗衰减补丁（AAP）方法，采用能量约束优化策略结合基于衰减的部署框架，在攻击有效性和隐蔽性之间实现无缝平衡。更重要的是，AAP通过对齐信号级电子干扰机制，展现出强大的物理实现潜力。实验结果表明，AAP在保持高隐蔽性的同时有效降低检测性能，并在不同模型间表现出良好的可迁移性。本研究为SAR目标检测系统的对抗攻击提供了物理基础视角，并促进了更隐蔽且实际可部署的攻击策略设计。源代码已在此https URL公开。

英文摘要

Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.

URL PDF HTML ☆

赞 0 踩 0

2603.23530 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

你忘记我问什么了吗？大型语言模型中的前瞻记忆失败

Avni Mittal

发表机构 * University of Washington（华盛顿大学）

AI总结本研究通过认知心理学中的前瞻记忆视角，发现大型语言模型在执行复杂任务时，格式化指令的遵从率下降2-21%，并提出了显著性增强格式来恢复遵从性。

详情

AI中文摘要

大型语言模型在必须同时执行要求较高的任务时，常常无法满足格式化指令。我们通过认知心理学中的前瞻记忆视角，使用一个受控范式来研究这种行为，该范式将可验证的格式化约束与复杂度递增的基准任务相结合。在三个模型家族和超过8000个提示中，在并发任务负载下，遵从性下降了2-21%。脆弱性高度依赖于类型：终端约束（需要在响应边界采取行动）下降最多，高达50%，而避免约束相对稳健。显著性增强格式（显式指令框架加上尾部提醒）恢复了大量丢失的遵从性，在许多设置中将性能恢复到90-100%。干扰是双向的：格式化约束也可能降低任务准确性，其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中，随着约束的累积，联合遵从性急剧下降。所有结果均使用确定性程序化检查器，无需LLM作为评判组件，并在公开可用的数据集上进行。

英文摘要

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

URL PDF HTML ☆

赞 0 踩 0

2512.21201 2026-06-15 cs.RO cs.AI cs.CV 版本更新

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

薛定谔的导航者：为零样本目标导航设想未来轨迹集合

Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）； Shanghai University of International Business and Economics（上海对外经贸大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出一种信念感知框架，在推理时通过轨迹条件化的3D世界模型设想多个未来场景，结合自适应遮挡物感知采样和未来感知价值图，提升零样本目标导航在遮挡严重环境中的隐蔽目标发现和风险感知路径选择。

详情

AI中文摘要

零样本目标导航（ZSON）要求机器人在未见环境中找到目标物体，无需任务特定的微调或预建地图，这是通用服务机器人的关键能力。然而，在模拟中表现良好的方法在杂乱的真实世界场景中往往会退化，这些场景存在严重遮挡和潜在危险，大面积的未观察区域使得单场景推理脆弱且不安全。我们提出薛定谔的导航者，一个信念感知框架，在推理时对多个轨迹条件化的设想3D未来进行推理。给定候选路径，轨迹条件化的3D世界模型预测假设的观察结果，并保持多个合理场景实现的叠加，而不是承诺于单一地图。自适应遮挡物感知采样器将想象引导至不确定性关键区域，而未来感知价值图（FAVM）聚合设想的未来，以实现鲁棒、主动的动作选择。在模拟和物理Go2四足机器人上的实验表明，薛定谔的导航者优于强ZSON基线，在遮挡严重的导航场景中提高了隐蔽目标发现和风险感知路径点选择。这些结果突显了设想3D未来作为在不确定真实世界环境中进行零样本导航的可扩展和通用策略。

英文摘要

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

URL PDF HTML ☆

赞 0 踩 0

2603.18464 2026-06-15 cs.LG 版本更新

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

AcceRL: 面向视觉-语言-动作模型的分布式异步强化学习与世界模型框架

Chengxuan Lu, Shukuan Wang, Yanjie Li, Yingying Fang, Huoyan Wang, Tian Zhang, Wei Liu, Shiji Jin, Fuyuan Qian, Peiming Li, Chao Xu, Baigui Sun, Yang Liu

发表机构 * IROOTECH TECHNOLOGY Wolf 1069 b Lab, Sany Group（伊罗科技沃尔夫1069b实验室，三一集团）

AI总结提出AcceRL框架，通过物理隔离环境交互、模型推理和梯度更新实现分布式异步强化学习，消除同步系统的空闲气泡，提升硬件利用率，并支持即插即用的世界模型集成，在LIBERO任务上实现2.4倍吞吐加速和200倍样本效率提升。

详情

AI中文摘要

大规模视觉-语言-动作（VLA）模型的强化学习（RL）严重受限于同步障碍和环境数据获取的高成本。为克服这些挑战，我们提出AcceRL，一种分布式异步RL框架，物理隔离环境交互、模型推理和梯度更新。通过消除同步系统中固有的级联长尾空闲气泡，AcceRL最大化硬件利用率并确保可扩展吞吐量。此外，AcceRL采用模块化设计，支持将多种即插即用的世界模型集成到其分布式流水线中。大量实验表明，基础框架在所有四个LIBERO~\cite{liu2023libero}任务套件上均取得极具竞争力的性能。系统层面，异步架构相比领先的同步基线实现了2.4倍的吞吐加速。算法层面，通过利用在1000条离线轨迹上预训练的世界模型，AcceRL在LIBERO-Spatial上实现了高达200倍的在线样本效率提升，为具身AI建立了一个既样本高效又时间高效的稳健框架。代码包含在补充材料中。代码见此网址。

英文摘要

Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models is severely bottlenecked by synchronization barriers and the high cost of environment data acquisition. To overcome these challenges, we propose AcceRL, a distributed asynchronous RL framework that physically isolates environment rollouts, model inference, and gradient updates. By eliminating the cascading long-tail idle bubbles inherent in synchronous systems, AcceRL maximizes hardware utilization and ensures scalable throughput. Furthermore, AcceRL features a modular design that supports the integration of diverse, plug-and-play world models into its distributed pipeline. Extensive experiments demonstrate that the base framework achieves highly competitive performance across all four LIBERO~\cite{liu2023libero} task suites. Systematically, the asynchronous architecture delivers a $2.4\times$ throughput speedup over leading synchronous baselines. Algorithmically, by leveraging a world model pre-trained on 1,000 offline trajectories, AcceRL achieves up to a $200\times$ improvement in online sample efficiency on LIBERO-Spatial, establishing a robust framework that is both sample-efficient and time-efficient for embodied AI. Code is included in the supplementary material. Code is available at https://github.com/distanceLu/AcceRL.

URL PDF HTML ☆

赞 0 踩 0

2603.16073 2026-06-15 cs.CL 版本更新

FP4量化LLM训练中均值偏差的诅咒与祝福

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fang Dong, Anrui Chen, Ruijun Huang, Xin Zhang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Yixuan Chen, Li Shang

发表机构 * Fudan University（复旦大学）； University of Bath（巴斯大学）； Shanghai Innovation Institute（上海创新研究院）； University of Oxford（牛津大学）； Oxford Suzhou Centre for Advanced Research（牛津苏浙研究中心）； University of Colorado Boulder（科罗拉多大学波德格分校）； University of Michigan（密歇根大学）； Shenzhen Loop Area Institute（深圳环宇研究院）

AI总结发现FP4训练失败源于激活异常值由秩一均值偏差主导，提出Averis均值残差分离量化法，在Qwen3模型上实现鲁棒W4A4G4训练，损失差距低于NVIDIA的Hadamard方法。

详情

AI中文摘要

FP4训练有望为大型语言模型节省大量内存和计算，但由于分块量化受极端激活幅度支配，导致动态范围膨胀并压缩长尾信号，因此仍然脆弱。我们发现了这一失败的一个反直觉来源：主导激活异常值不仅仅是任意的稀疏事件，而主要是由一致的秩一均值偏差引起的，其方向与主导各向异性谱分量对齐。该均值分量在训练过程中增强，被注意力和FFN算子放大和重塑，并日益主导顶部激活幅度。至关重要的是，这一发现揭示了一个看似复杂的异常值抑制问题实际上有一个非常简单的解决方案：在量化之前隔离一致的均值。因此，我们提出了Averis，一种均值残差分割量化方法，该方法在FP4量化之前仅使用归约和逐元素减法来分离均值分量。在100B token上训练的Qwen3 0.6B密集模型和50B token上训练的Qwen3 7B A1.5B MoE模型上，Averis实现了鲁棒的W4A4G4 FP4训练，将BF16损失差距降低至1.19%/0.81%，而NVIDIA最近发布的基于Hadamard的异常值平滑方法为2.05%/1.10%，同时将下游差距限制在0.89/0.71点。Averis在vanilla NVFP4上的端到端开销仅为2.20%，约为NVIDIA基于Hadamard设计的30%，为稳定的低位LLM训练提供了一条硬件高效的路径。与Hadamard互补，Averis在结合使用时进一步将Qwen3-0.6B的损失和下游差距降低至0.94%和0.73点。代码可在以下网址获取：this https URL。

英文摘要

FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA's recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA's Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

URL PDF HTML ☆

赞 0 踩 0

2603.09377 2026-06-15 cs.CV 版本更新

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

SinGeo: 解锁单一模型在鲁棒跨视角地理定位中的潜力

Yang Chen, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology（智能科学与技术学院，国防科技大学）

AI总结提出SinGeo框架，通过双判别学习架构和课程学习策略，使单一模型在未知视场角和方向下实现鲁棒跨视角地理定位，在四个基准数据集上达到最先进性能。

Comments v2

详情

AI中文摘要

尽管近期取得了进展，鲁棒的跨视角地理定位（CVGL）仍然具有挑战性。现有方法仍依赖于视场角（FoV）特定的训练范式，其中模型在固定FoV下优化，但在未见过的FoV和未知方向上测试时性能崩溃。这种局限性需要部署多个模型来覆盖各种变化。尽管有研究通过简单随机化FoV探索了动态FoV训练，但它们未能实现跨不同条件的鲁棒性——隐含地假设所有FoV难度相同。为解决这一差距，我们提出了SinGeo，一个简单而强大的框架，使单一模型能够实现鲁棒的跨视角地理定位，无需额外模块或显式变换。SinGeo采用双判别学习架构，增强地面和卫星分支内的视角内判别性，并且是第一个引入课程学习策略以实现鲁棒CVGL的方法。在四个基准数据集上的广泛评估表明，SinGeo在多种条件下取得了最先进（SOTA）结果，并且显著优于专门为极端FoV训练的方法。除了卓越的性能，SinGeo还展示了跨架构的可迁移性。此外，我们提出了一种一致性评估方法，以定量评估模型在不同视角下的稳定性，为理解和推进未来CVGL研究的鲁棒性提供了可解释的视角。代码将在接收后公开。

英文摘要

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions -- implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2603.05556 2026-06-15 cs.LG 版本更新

IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings

IntSeqBERT: 通过模谱嵌入学习OEIS中的算术结构

Kazuhisa Nakasho

发表机构 * Iwate Prefectural University（岩手县大学）

AI总结提出IntSeqBERT，一种双流Transformer编码器，通过连续对数幅度嵌入和100个模数的正弦/余弦模嵌入融合，在OEIS序列上联合训练三个预测头，显著提升了序列预测精度。

详情

AI中文摘要

OEIS中的整数序列涵盖从个位数常数到天文阶乘和指数，使得标准分词模型难以处理，因为它们无法处理词汇表外的值或利用周期性算术结构。我们提出IntSeqBERT，一种用于OEIS掩码整数序列建模的双流Transformer编码器。每个序列元素沿两个互补轴编码：连续对数尺度幅度嵌入和100个残差（模数$2$--$101$）的正弦/余弦模嵌入，通过FiLM融合。三个预测头（幅度回归、符号分类和100个模数的模预测）在274,705个OEIS序列上联合训练。在Large规模（9150万参数）下，IntSeqBERT在测试集上达到95.85%的幅度准确率和50.38%的平均模准确率（MMA），分别比标准分词Transformer基线高出$+8.9$和$+4.5$个百分点。去除模流的消融实验证实，模流贡献了$+15.2$个百分点的MMA增益，并额外贡献了$+6.2$个百分点的幅度准确率。基于概率中国剩余定理（CRT）的解算器将模型预测转化为具体整数，使得下一项预测比分词Transformer基线提升7.4倍（Top-1: 19.09% vs. 2.59%）。模谱分析显示，归一化信息增益（NIG）与欧拉函数比值$\varphi(m)/m$之间存在强负相关（$r = -0.851$, $p < 10^{-28}$），为复合模数通过CRT聚合更有效地捕获OEIS算术结构提供了经验证据。

英文摘要

Integer sequences in the OEIS span values from single-digit constants to astronomical factorials and exponentials, making prediction challenging for standard tokenised models that cannot handle out-of-vocabulary values or exploit periodic arithmetic structure. We present IntSeqBERT, a dual-stream Transformer encoder for masked integer-sequence modelling on OEIS. Each sequence element is encoded along two complementary axes: a continuous log-scale magnitude embedding and sin/cos modulo embeddings for 100 residues (moduli $2$--$101$), fused via FiLM. Three prediction heads (magnitude regression, sign classification, and modulo prediction for 100 moduli) are trained jointly on 274,705 OEIS sequences. At the Large scale (91.5M parameters), IntSeqBERT achieves 95.85% magnitude accuracy and 50.38% Mean Modulo Accuracy (MMA) on the test set, outperforming a standard tokenised Transformer baseline by $+8.9$ pt and $+4.5$ pt, respectively. An ablation removing the modulo stream confirms it accounts for $+15.2$ pt of the MMA gain and contributes an additional $+6.2$ pt to magnitude accuracy. A probabilistic Chinese Remainder Theorem (CRT)-based Solver converts the model's predictions into concrete integers, yielding a 7.4-fold improvement in next-term prediction over the tokenised-Transformer baseline (Top-1: 19.09% vs. 2.59%). Modulo spectrum analysis reveals a strong negative correlation between Normalised Information Gain (NIG) and Euler's totient ratio $φ(m)/m$ ($r = -0.851$, $p < 10^{-28}$), providing empirical evidence that composite moduli capture OEIS arithmetic structure more efficiently via CRT aggregation.

URL PDF HTML ☆

赞 0 踩 0

2603.05230 2026-06-15 cs.CV cs.RO 版本更新

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

数字孪生驱动的自动化分拣系统中的纺织品分类与异物识别

Serkan Ergun, Tobias Mitterer, Hubert Zangl

发表机构 * Institute of Smart Systems Technologies（智能系统技术研究所）； University of Klagenfurt（克雷格弗特大学）； AAU SAL USE Laboratory（AAU SAL USE实验室）； Silicon Austria Labs（硅 Austria 实验室）

AI总结提出一种数字孪生驱动的机器人分拣系统，结合抓取预测、多模态感知和视觉语言模型，实现纺织品分类与异物检测，Qwen模型准确率达87.9%。

Comments 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)

详情

AI中文摘要

对可持续纺织品回收日益增长的需求要求强大的自动化解决方案，能够处理可变形服装并在杂乱环境中检测异物。本文提出了一种数字孪生驱动的机器人分拣系统，集成了抓取预测、多模态感知和语义推理，用于现实世界中的纺织品分类。一个配备RGBD传感、电容式触觉反馈和碰撞感知运动规划的双臂机器人单元，自主地将服装从未分类的篮子中分离，将其转移到检查区域，并使用最先进的视觉语言模型（VLM）进行分类。我们在一个包含223个检查场景的数据集上对来自五个模型家族的九个VLM进行了基准测试，这些场景包括衬衫、袜子、裤子、内衣、异物（包括上述类别之外的服装）和空场景。评估评估了每类准确率、幻觉行为以及在实际硬件约束下的计算性能。结果表明，Qwen模型家族实现了最高的总体准确率（高达87.9%），具有强大的异物检测性能，而较轻的模型如Gemma3为边缘部署提供了有竞争力的速度-准确率权衡。数字孪生结合MoveIt实现了碰撞感知路径规划，并将分割后的检查服装3D点云集成到虚拟环境中，以提高操作可靠性。所提出的系统证明了将语义VLM推理与常规抓取检测和数字孪生技术相结合，在现实工业环境中实现可扩展的自主纺织品分拣的可行性。

英文摘要

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

SARSteer: Safeguarding Large Audio-Language Models via Safe-Ablated Refusal Steering

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

Sub-Token Routing for KV Cache Compression

On the Generalization Bounds of Symbolic Regression with Genetic Programming

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

Efficient Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

Boundary-Centric Clip-Budgeted Active Learning for Temporal Action Segmentation

QualiaNet: An Experience-Before-Inference Network

FlexMS: A Unified Public Benchmark for Molecule Tandem Mass Spectrum Prediction

STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts

Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

Temporal Straightening for Latent Planning

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems