arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.07765 2026-05-19 cs.RO

Toward Seamless Physical Human-Humanoid Interaction: Insights from Control, Intent, and Modeling with a Vision for What Comes Next

迈向无缝的物理人机交互：从控制、意图和建模的角度见解以及未来发展的展望

Gustavo A. Cardona, Shubham S. Kumbhar, Panagiotis Artemiadis

发表机构 * University of Delaware（德克萨斯大学）

AI总结本文探讨了物理人机交互领域中控制、意图估计和计算人类模型三个核心支柱，总结了当前的研究现状、开放挑战和限制，并提出了跨领域整合的路径，旨在推动更鲁棒、安全和直观的物理交互研究。

Comments 60 pages, 5 figures, 3 tables

详情

DOI: 10.1007/s10846-026-02401-0

AI中文摘要

物理人机交互（pHHI）是一个快速发展的领域，对在无结构、以人为中心的环境中部署机器人具有重要意义。在本文综述中，我们通过三个核心支柱审视当前pHHI的现状：（i）人形机器人的建模与控制，（ii）人类意图估计，以及（iii）计算人类模型。对于每个支柱，我们调查了代表性方法，识别了开放挑战，并分析了当前限制，这些限制阻碍了鲁棒、可扩展和适应性交互的实现。这些包括需要能够处理不确定人类动态的全身控制策略、在有限感知下实时意图推断的需求，以及能够考虑人类身体状态变化的建模技术。尽管每个领域都取得了显著进展，但跨支柱的整合仍然有限。我们提出了统一这些领域的路径，以实现连贯的交互框架。这种结构不仅使我们能够映射当前的现状，还提出了未来研究的具体方向，旨在弥合这些领域之间的差距。此外，我们引入了一种基于模态的统一交互类型分类法，区分直接交互（如物理接触）和间接交互（如物体中介），并基于机器人参与的程度，从协助到合作和协作。对于此分类中的每个类别，我们提供了三个核心支柱，突出跨支柱整合的机会。我们的目标是建议推动鲁棒、安全和直观物理交互的途径，为未来研究提供路线图，使人形系统能够有效地理解、预测并与人类伙伴在多样化的现实环境中协作。

英文摘要

Physical Human-Humanoid Interaction (pHHI) is a rapidly advancing field with significant implications for deploying robots in unstructured, human-centric environments. In this review, we examine the current state of the art in pHHI through three core pillars: (i) humanoid modeling and control, (ii) human intent estimation, and (iii) computational human models. For each pillar, we survey representative approaches, identify open challenges, and analyze current limitations that hinder robust, scalable, and adaptive interaction. These include the need for whole-body control strategies capable of handling uncertain human dynamics, real-time intent inference under limited sensing, and modeling techniques that account for variability in human physical states. Although significant progress has been made within each domain, integration across pillars remains limited. We propose pathways for unifying methods across these areas to enable cohesive interaction frameworks. This structure enables us not only to map the current landscape but also to propose concrete directions for future research that aim to bridge these domains. Additionally, we introduce a unified taxonomy of interaction types based on modality, distinguishing between direct interactions (e.g., physical contact) and indirect interactions (e.g., object-mediated), and on the level of robot engagement, ranging from assistance to cooperation and collaboration. For each category in this taxonomy, we provide the three core pillars that highlight opportunities for cross-pillar unification. Our goal is to suggest avenues to advance robust, safe, and intuitive physical interaction, providing a roadmap for future research that will allow humanoid systems to effectively understand, anticipate, and collaborate with human partners in diverse real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2512.05136 2026-05-19 cs.CV cs.AI

Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

微调一种心电图基础模型以预测冠状动脉CT血管造影结果

Yujie Xiao, Qinghao Zhao, Gongzheng Tang, Hao Zhang, Zhuoran Kan, Deyun Zhang, Jun Li, Guangkun Nie, Xiaocheng Fang, Haoyu Wang, Shun Huang, Tong Liu, Jian Liu, Kangyin Chen, Shenda Hong

发表机构 * Institute of Medical Technology, Peking University Health Science Center（北京大学人民医院医学技术研究所）； National Institute of Health Data Science, Peking University（北京大学国家健康数据科学研究院）； Department of Cardiology, Peking University People’s Hospital（北京大学人民医院心内科）； Tianjin Key Laboratory of Ionic-Molecular Function of Cardiovascular Disease, Department of Cardiology, Tianjin Institute of Cardiology, The Second Hospital of Tianjin Medical University（天津医科大学心血管离子-分子功能重点实验室，天津心脏病学研究院，天津医科大学第二医院心内科）； Heart Voice Medical Technology（心声医疗科技）； School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院）

AI总结本文研究了通过微调心电图基础模型来预测冠状动脉CT血管造影结果的研究问题，采用多中心研究方法，利用CTCA作为解剖参考标准，开发并验证了AI-ECG模型，以预测血管特异性冠状动脉狭窄，并展示了模型在内部和外部验证中的表现，以及其在临床中的应用价值。

详情

AI中文摘要

CAD仍然是全球公共卫生的主要负担，然而可扩展的筛查工具有限。尽管CTCA是首选的非侵入性诊断方法，但其使用受到资源需求和辐射暴露的限制。AI-ECG可能为CAD风险分层提供补充方法。在多中心研究中，我们开发并验证了使用CTCA作为解剖参考标准的AI-ECG模型，以预测血管特异性冠状动脉狭窄。在内部验证中，模型在各血管上的AUC值为0.683-0.744，并表现出一致的外部性能。在临床正常ECG中保持了鉴别能力，并在各亚组中保持了广泛稳定性。模型预测的概率随着CTCA定义的狭窄严重程度呈单调增加。模型概率通过预定义的灵敏度和特异性基于阈值转换为血管特异性低、中、高风险分层。校准分析显示预测风险与观察风险之间的一致性，而DCA表明与“全部治疗”和“不治疗”策略相比，具有净临床获益。将AI衍生的风险分层与指南基于的PTP类别相结合，提高了排除性能，减少了灰色区域比例，并与PTP单独使用相比实现了正NRI。在纵向随访队列中，Kaplan-Meier分析显示模型定义的风险组在主要不良心血管事件风险上存在明显分离。波形和归因分析进一步识别了与高风险预测相关的结构化ECG形态差异和具有生理意义的信号区域。这些发现支持AI-ECG作为补充CAD筛查、解剖风险估计和临床分层的可行工具，但需要进一步的前瞻性研究来确认其临床影响。

英文摘要

CAD remains a major global public health burden, yet scalable screening tools are limited. Although CCTA is a first-line non-invasive diagnostic modality, its use is constrained by resource requirements and radiation exposure. AI-ECG may offer a complementary approach for CAD risk stratification. In this multicenter study, we developed and validated an AI-ECG model using CCTA as the anatomical reference standard to predict vessel-specific coronary stenosis. In internal validation, the model achieved AUC values of 0.683-0.744 across vessels and showed consistent external performance. Discrimination was maintained in clinically normal ECGs and remained broadly stable across subgroups. Model-predicted probabilities increased monotonically with CCTA-defined stenosis severity. Model probabilities were converted into vessel-specific low-, intermediate-, and high-risk strata using predefined sensitivity- and specificity-based thresholds. Calibration analysis showed agreement between predicted and observed risk, while DCA indicated net clinical benefit over treat-all and treat-none strategies. Integrating AI-derived risk strata with guideline-based PTP categories improved rule-out performance, reduced the gray-zone proportion, and achieved positive NRI compared with PTP alone. In a longitudinal follow-up cohort, Kaplan-Meier analysis showed clear separation of major adverse cardiovascular event risk across model-defined risk groups. Waveform- and attribution-based analyses further identified structured ECG morphology differences and physiologically meaningful signal regions associated with high-risk predictions. These findings support AI-ECG as a feasible tool for complementary CAD screening, anatomical risk estimation, and clinical triage, while prospective studies are needed to confirm its clinical impact.

URL PDF HTML ☆

赞 0 踩 0

2512.01843 2026-05-19 cs.CV

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

PhyDetEx: 检测和解释T2V模型的物理合理性

Zeqing Wang, Keze Wang, Lei Zhang

发表机构 * Zeqing Wang 1,3（王泽清 1,3）； Keze Wang 1（王凯泽 1）； Lei Zhang 2（张磊 2）

AI总结本文提出PhyDetEx，通过构建PID数据集和轻量级微调方法，评估T2V模型在生成物理合理视频方面的性能，并发现尽管模型在视频生成上有所进步，但理解并遵循物理定律仍存在挑战。

Comments 23 pages, 10 figures

详情

AI中文摘要

受模型容量和训练规模增长的推动，文本到视频（T2V）生成模型在视频质量、长度和遵循指令的能力方面取得了显著进展。然而，这些模型是否能理解物理并生成物理上合理的视频仍是一个问题。尽管视觉语言模型（VLMs）已被广泛用于各种应用中的通用评估，但它们难以识别生成视频中的物理不可能内容。为研究此问题，我们构建了一个PID（物理不可信检测）数据集，包含500个手动标注的测试视频和2,588对训练视频，其中每个不可信视频都是通过仔细修改其对应真实视频的描述来生成的，以诱导T2V模型生成物理上不可信的内容。利用构建的数据集，我们提出了一种轻量级微调方法，使VLMs不仅能检测物理不可信事件，还能生成违反物理原理的文本解释。将微调后的VLM作为物理合理性检测器和解释器，即PhyDetEx，我们评估了一系列最先进的T2V模型，以评估它们对物理定律的遵守程度。我们的发现表明，尽管最近的T2V模型在生成物理合理内容方面取得了显著进展，但理解和遵守物理定律仍是一个具有挑战性的问题，特别是对于开源模型。我们的数据集、训练代码和检查点可在https://github.com/Zeqing-Wang/PhyDetEx获取。

英文摘要

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

URL PDF HTML ☆

赞 0 踩 0

2512.01537 2026-05-19 cs.SD cs.AI cs.IT cs.LG eess.SP math.IT

Two-Dimensional Quantization for Geometry-Aware Audio Coding

二维量化用于几何感知的音频编码

Tal Shuster, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel（电气与计算机工程学院，内盖夫本· Gurion大学，贝尔谢巴，以色列）

AI总结本文提出了一种二维量化方法Q2D2，通过将特征对投影到结构化的2D网格上，提高了音频压缩效率，同时保持了最先进的重建质量。

Comments Accepted to ICML 2026

详情

AI中文摘要

最近的神经音频编解码器在重建质量上取得了显著成就，通常依赖于残差向量量化（RVQ）、向量量化（VQ）和有限标量量化（FSQ）等量化方法。然而，这些量化技术限制了潜在空间的几何结构，使特征之间的相关性捕捉变得更加困难，导致表示学习、代码本利用和令牌速率的效率低下。在本文中，我们引入了二维量化（Q2D2），一种将特征对投影到结构化2D网格（如六边形、菱形或矩形铺砌）并量化到最近网格值的量化方案，从而生成由网格级别乘积定义的隐式代码本，其代码本大小与传统方法相当。尽管其简单的几何公式，Q2D2在音频压缩效率方面有所提升，具有低令牌速率和高代码本利用率，同时保持了最先进的重建质量。具体而言，Q2D2在语音、音频和音乐领域广泛实验中，在各种客观和主观重建度量上实现了具有竞争力甚至更优的性能。全面的消融研究进一步证实了我们设计选择的有效性。

英文摘要

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

URL PDF HTML ☆

赞 0 踩 0

2511.20857 2026-05-19 cs.CL cs.AI

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory：通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

发表机构 * Google DeepMind（谷歌深Mind）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出Evo-Memory，一个用于评估LLM代理自演化记忆能力的综合流基准和框架，通过构建序列任务流数据集，要求LLM在每次交互后搜索、适应和演化记忆，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情

AI中文摘要

状态性对于大型语言模型（LLM）代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件，但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上，其中记忆被动地从对话中检索以回答查询，忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中，如交互问题助手或具身代理中，LLM需要处理连续的任务流，但通常无法从积累的交互中学习，失去有价值的上下文见解，这限制了测试时间的进化，即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距，我们引入了Evo-Memory，一个综合的流基准和框架，用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流，要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用，我们提供了一个基线方法ExpRAG，用于检索和利用先前经验，并进一步提出ReMem，一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程，以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

URL PDF HTML ☆

赞 0 踩 0

2511.11654 2026-05-19 cs.LG cs.AI cs.MA

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的收敛性

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa（亚马逊Alexa）； Indian Institute of Science（印度科学研究院）

AI总结本文研究了多智能体强化学习在交通信号控制中的收敛性问题，通过随机逼近方法分析学习动态，并证明了在特定条件下该算法能够收敛。

Comments 14 pages 2 figures

详情

AI中文摘要

快速城市化导致城市如班加罗尔面临严重的交通拥堵，使得高效的交通信号控制（TSC）变得至关重要。多智能体强化学习（MARL）作为一种减少平均通勤延误的有希望策略，通常将每个交通信号视为一个独立的智能体使用Q学习进行建模。尽管先前的工作Prashant L A等人已经证明了这种方法的有效性，但在交通控制背景下对这种算法稳定性及收敛性进行严谨理论分析的研究尚未开展。本文通过专注于该多智能体算法的理论基础，填补了这一空白。我们研究了在合作性TSC任务中使用独立学习者固有的收敛问题。利用随机逼近方法，我们正式分析了学习动态。本文的主要贡献是证明了特定的交通控制多智能体强化学习算法在给定条件下能够收敛，扩展了从单智能体收敛证明中异步价值迭代的结论。

英文摘要

Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

URL PDF HTML ☆

赞 0 踩 0

2511.07288 2026-05-19 cs.LG cs.AI

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为批评稳定化实现非策略模仿学习

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa（亚马逊Alexa）； Indian Institute of Science（印度科学研究院）

AI总结本文提出一种结合非策略学习的对抗模仿学习算法，通过双Q网络稳定化和价值学习（无需奖励函数推断）来提高样本效率，从而更高效地匹配专家行为。

Comments 14 pages and 4 images

2511.06316 2026-05-19 cs.AI

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN：一种通过地理空间神经推理进行高精度事故位置推断的视觉-语言框架

MD Thamed Bin Zaman Chowdhury, Moazzem Hossain

发表机构 * Department of Civil Engineering, Bangladesh University of Engineering and Technology（孟加拉工程与技术大学土木工程系）

AI总结本文提出ALIGN框架，通过视觉-语言模型整合文本和图像数据，以高精度推断事故位置，显著优于传统文本解析方法，实现了亚公里级的定位精度。

详情

AI中文摘要

在低收入和中等收入国家，公共安全和城市规划项目经常面临准确的、特定位置的道路事故数据短缺。从非结构化文本中提取可靠的地理信息需要克服传统文本基于地理编码工具的局限性，这些工具在多语言环境中常常无法处理含糊的地点描述。本研究引入ALIGN（通过地理空间神经推理进行事故位置推断），一种视觉-语言框架，旨在模拟人类空间推理能力，从非结构化的孟加拉语新闻报告和基于地图的线索中推断出精确的事故坐标。开发了一个多阶段自动化流程来处理多样化的文本和视觉数据，整合大语言模型用于线索提取与视觉-语言模型用于地图验证。使用代理架构，我们建模了一个迭代推理循环，结合光学字符识别（OCR）、基于网格的空间扫描以及三轮几何投票方法，以数学方式隔离和减少视觉幻觉。研究结果表明，多模态ALIGN框架显著优于传统文本-only地理解析基线。例如，所提出系统成功将平均定位误差从不可用的10.915公里减少到验证数据集上的亚公里精度0.593公里。此外，测试该框架与官方达卡警察局记录相比，证实了其可靠性，通过达到平均误差0.465公里。结果提供了一个高精度、无需训练的基础，用于数据稀少地区的自动化事故制图，支持证据驱动的道路安全政策制定，并促进多模态AI在交通分析中的整合。

英文摘要

In low- and middle-income countries, public safety and urban planning initiatives frequently face a critical shortage of accurate, location-specific road crash data. Extracting reliable geospatial information from unstructured text requires overcoming the limitations of traditional text-based geocoding tools, which often fail in multilingual environments with ambiguous place descriptions. This study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework designed to emulate human spatial reasoning to infer precise accident coordinates from unstructured Bangla news reports and map-based cues. A multi stage automated pipeline was developed to process diverse textual and visual data, integrating large language models for cue extraction with vision-language models for map verification. Using an agentic architecture, we modelled an iterative reasoning loop that combines Optical Character Recognition (OCR), grid-based spatial scanning, and a 3-run geometric voting method to mathematically isolate and reduce visual hallucinations. The findings highlight that the multimodal ALIGN framework significantly outperforms traditional text-only geoparsing baselines. For example, the proposed system successfully reduced the mean localization error from an unusable 10.915 km to a sub-kilometer precision of 0.593 km on a validation dataset. Furthermore, testing the framework against official Dhaka Metropolitan Police records confirmed its reliability by achieving a mean error of 0.465 km. The results provide a high-accuracy, training-free foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the integration of multimodal AI in transportation analytics.

URL PDF HTML ☆

赞 0 踩 0

2511.00392 2026-05-19 cs.RO cs.AI cs.CV

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep: 通过平面扫描融合声纳与视觉以实现鲁棒的3D重建

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui, Ziyang Hong, Junfeng Wu

发表机构 * Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Chinese University of Hong Kong, Hong Kong（香港中文大学）； Department of Automation, Harbin Institute of Technology（哈尔滨工业大学自动化系）

AI总结本文提出SonarSweep，一种端到端的深度学习框架，通过将平面扫描算法应用于声纳与视觉数据的跨模态融合，克服了单一模态方法在 underwater 环境中3D重建的局限性，实现了更精确和稳定的深度图生成。

Comments 8 pages, 9 figures, conference

详情

AI中文摘要

在视觉退化的水下环境中实现准确的3D重建仍是一个严峻的挑战。单一模态方法不足：基于视觉的方法因可见性差和几何约束而失败，而声纳则因固有的高度歧义和低分辨率而受限。因此，先前的融合技术依赖于启发式方法和错误的几何假设，导致显著的伪影和无法建模复杂场景。在本文中，我们引入了SonarSweep，一种新颖的端到端深度学习框架，通过将原理性的平面扫描算法应用于声纳与视觉数据的跨模态融合，克服了这些限制。在高保真模拟和真实环境中的大量实验表明，SonarSweep能够一致地生成密集且准确的深度图，在挑战性条件下，特别是在高浊度情况下，显著优于最先进的方法。为了促进进一步研究，我们将公开我们的代码和一个新型的数据集，该数据集包含同步的立体相机和声纳数据，这是首次公开的此类数据集。

英文摘要

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

URL PDF HTML ☆

赞 0 踩 0

2510.26745 2026-05-19 cs.LG cs.AI cs.CL stat.ML

Deep sequence models tend to memorize geometrically; it is unclear why

深度序列模型倾向于记忆几何学；不清楚为何

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

发表机构 * Machine Learning Department \& Heinz College, Carnegie Mellon University, Pittsburgh, PA, USA ； Google Research, NY, USA

AI总结研究探讨了深度序列模型中原子事实的存储机制，发现几何记忆能编码全局关系，即使在训练中未共现的实体间也能建立联系，挑战了传统关联记忆的观点。

Comments Forty-third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

深度序列模型被认为主要通过关联记忆存储原子事实，即通过暴力查找共现实体。我们识别出一种不同的存储形式，称为几何记忆。在此模型中，嵌入编码了所有实体之间的新型全局关系，包括训练中未共现的实体。这种存储形式强大：例如，我们展示了它如何将涉及ℓ-折叠组合的困难推理任务转化为易于学习的一步导航任务。从这一现象中，我们提取了神经嵌入几何学中难以解释的基本方面。我们认为，这种几何的出现，与局部关联的查找相比，不能简单归因于典型的监督、架构或优化压力。反直觉的是，即使几何比暴力查找更复杂，它仍然会被学习。然后，通过分析与Node2Vec的联系，我们展示了几何起源于一种光谱偏见，这与主流理论相反，确实自然产生，尽管缺乏各种压力。这一分析也指出了从业者在使Transformer记忆更几何化方面的可见空间。我们希望几何视角的参数记忆鼓励重新审视指导知识获取、容量、发现和遗忘等领域的默认直觉。

英文摘要

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

URL PDF HTML ☆

赞 0 踩 0

2510.24680 2026-05-19 cs.RO

InFeR: Informed Failure Resilience in Learned Visual Navigation Control

InFeR：在学习视觉导航控制中的有信息故障韧性

Zishuo Wang, Joel Loo, David Hsu

发表机构 * School of Computing & Smart Systems Institute, National University of Singapore（计算与智能系统学院研究所，新加坡国立大学）

AI总结该研究提出InFeR框架，通过变分信息瓶颈损失重构潜在空间以检测OOD故障，并利用Grad-CAM技术局部化故障源，从而在无需额外训练数据的情况下实现故障自恢复，提升了复杂环境中的长距离导航鲁棒性。

详情

AI中文摘要

尽管模仿学习（IL）已在许多常见环境中实现了成功的视觉导航，但在分布外（OOD）场景下，IL策略容易出现不可预测的故障。这需要具有故障韧性的策略，不仅能够检测故障，还能识别其来源并自主恢复。我们提出了InFeR，一种通用框架，用于构建具有有信息故障韧性的IL策略，而无需故障或恢复演示。InFeR通过变分信息瓶颈（VIB）损失重新训练IL策略，以结构化其潜在空间以检测OOD故障。它应用视觉可解释性技术Grad-CAM，以局部化图像区域作为故障源，并告知恢复的启发式策略。所有这些都在不需要额外训练数据的情况下实现。现实世界实验表明，InFeR在两种不同的策略架构上实现了有信息的故障恢复，从而在复杂环境中实现了稳健的长距离导航。

英文摘要

While imitation learning (IL) has enabled successful visual navigation in many common environments, IL policies are prone to unpredictable failures under out-of-distribution (OOD) scenarios. This necessitates failure-resilient policies, which not only detect failures, but also recognise their sources and recover from them autonomously. We propose InFeR, a general framework for building IL policies with informed failure resilience without failure or recovery demonstrations. InFeR retrains an IL policy with a Variational Information Bottleneck (VIB) loss to structure its latent space for OOD failure detection. It applies a visual explainability technique, Grad-CAM, to localise an image region as the source of failure and inform a heuristic policy for recovery. All these are achieved without requiring additional training data. Real-world experiments show that InFeR enables informed failure recovery across two different policy architectures, yielding robust long-range navigation in complex environments.

URL PDF HTML ☆

赞 0 踩 0

2510.24208 2026-05-19 cs.CL cs.LG

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

超越神经不兼容：通过潜在语义对齐实现语言模型中的跨尺度知识转移

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

发表机构 * Monash University（墨尔本大学）； Technical University of Munich（慕尼黑技术大学）； Chongqing University（重庆大学）

AI总结本文提出SemAlign方法，通过潜在语义对齐实现跨尺度知识转移，解决了不同架构和参数化模型间参数重用受限的问题，通过激活值作为转移介质，利用语义分解与重组稳定地实现知识迁移。

Comments an early-stage version

详情

AI中文摘要

语言模型（LMs）在其参数中编码了大量知识，但如何以细粒度方式转移此类知识，即参数化知识转移（PKT）仍不明确。核心挑战是当源模型和目标模型在架构和参数化上存在差异时，如何实现有效的、高效的跨尺度转移，这使得直接参数重用受到神经不兼容的限制。在本文中，我们识别出潜在语义对齐是跨尺度知识转移的关键前提。与直接移动层参数不同，我们的方法使用激活值作为转移介质。SemAlign包含两个阶段：一个层归因阶段，用于归因任务相关的源层并为每个目标层选择恰好一个源层；一个语义对齐阶段，通过逐层配对并优化目标模型，利用源侧语义监督。对齐通过语义分解和重组在潜在空间中进行。在浅层到深层的转移过程中，只有前沿目标层是可训练的。层目标通过匹配中心化的词-词关系几何与对齐的监督残差来监督该层的残差贡献，而输出KL保持源级预测行为。因此，转移介质既不是参数块也不是绝对的隐藏状态，而是由配对源层监督诱导的目标空间残差几何。在四个基准测试中的评估证实了SemAlign的有效性，进一步分析确认语义分解和重组为跨尺度知识转移提供了一个稳定的机制。

英文摘要

Language Models (LMs) encode substantial knowledge in their parameters, yet it remains unclear how to transfer such knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A central challenge is to make cross-scale transfer effective and efficient when source and target models differ in architecture and parameterization, making direct parameter reuse strongly limited by neural incompatibility. In this paper, we identify latent semantic alignment as the key prerequisite for cross-scale knowledge transfer. Instead of directly moving layer parameters, our approach uses activations as the transfer medium. \textsc{SemAlign} has two stages: an \emph{layer attribution} stage that attributes task-relevant source layers and selects exactly one source layer for each target layer, and a \emph{semantic alignment} stage that pairs them layer by layer and optimizes the target with source-side semantic supervision. The alignment is carried out in latent space through semantic decomposition and recomposition. During the shallow-to-deep transfer, only the frontier target layer is trainable. The layer objective supervises the residual contribution of that layer by matching centered token-token relation geometry against an aligned supervisory residual, while output KL preserves source-level predictive behavior. The transferred medium is therefore neither a parameter block nor an absolute hidden state, but target-space residual geometry induced by paired source-layer supervision. Evaluations on four benchmarks demonstrate the efficacy of \textsc{SemAlign}, and further analysis confirms that semantic decomposition and recomposition provide a stable mechanism for cross-scale knowledge transfer.

URL PDF HTML ☆

赞 0 踩 0

2510.20584 2026-05-19 cs.CL cs.AI

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

使用ChatGPT自动编码通信数据：子群体一致性分析

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

发表机构 * ETS Research Institute（ETS研究机构）

AI总结本文研究了使用ChatGPT进行通信数据编码在不同性别和种族/族裔群体间的一致性，发现其编码结果与人类评分者一致，为大规模评估协作与沟通提供了可能。

Comments Accepted to the Journal of Educational Measurement

详情

AI中文摘要

在大规模评估沟通和协作方面，对通信数据进行分类编码是一项劳动密集型任务，根据不同的框架进行分类。先前研究已证明，可以通过直接指示ChatGPT使用编码评分表来对通信数据进行编码，并且其准确性与人类评分者相当。然而，ChatGPT或类似AI技术在不同人口群体（如性别和种族）之间编码的一致性仍不清楚。为填补这一空白，我们引入了三种检查方法，用于评估基于LLM的编码中的子群体一致性，通过适应自自动化评分文献中已有的框架。使用典型的协作问题解决编码框架和三种类型的协作任务数据，我们检查了基于ChatGPT的编码在性别和种族/族裔群体中的表现。我们的结果表明，基于ChatGPT的编码在性别或种族/族裔群体中表现一致，与人类评分者一致，证明了其在大规模评估协作和沟通中的可行性。

英文摘要

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

URL PDF HTML ☆

赞 0 踩 0

2510.11391 2026-05-19 cs.CV cs.AI cs.CL

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK（香港大学）； UCAS（中国科学技术大学）； XJTU（西安交通大学）； UMich（密歇根大学）； Microsoft（微软）

AI总结本文提出DocReward，一种用于评估文档结构和风格的奖励模型，通过构建包含117,000对文档的DocPair数据集，采用Bradley-Terry损失训练，有效提升了文档生成的结构和风格专业性。

详情

AI中文摘要

近期的代理工作流程自动化了专业文档生成，但主要关注文本质量，忽视了结构和风格的专业性，这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型，无法引导代理生成结构和风格专业的文档。我们引入DocReward，一种评估文档结构和风格的文档奖励模型。为此，我们提出了一种文本质量无关的框架，确保评估不受内容质量的影响，并构建了包含117,000对文档的DocPair数据集，涵盖32个领域和267种类型。每对文档内容相同，但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中，DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明，DocReward能有效引导代理生成具有更一致结构和风格专业性的文档，突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

URL PDF HTML ☆

赞 0 踩 0

2510.10930 2026-05-19 cs.CL cs.AI

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

发表机构 * University of Cambridge（剑桥大学）； MIT（麻省理工学院）； Princeton University（普林斯顿大学）； NYU（纽约大学）； Harvard University（哈佛大学）； Stanford University（斯坦福大学）

AI总结本文研究了语言模型对游戏评估的能力，通过比较现代语言模型和人类及符号计算代理的评估结果，发现推理模型在游戏评估上更接近人类，但随着模型接近博弈最优，其与人类数据的匹配度会减弱，且在评估趣味性时表现出更大的波动。

详情

AI中文摘要

推理不仅仅是解决问题，也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上，通过研究模型如何玩国际象棋和围棋等游戏。在本文中，我们倡导一种新的范式，即评估人工智能系统对游戏的评估。首先，我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集，将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询：评估游戏的收益（或公平性）和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度：计算查询的复杂性和量化查询的难度。我们的结果表明，推理模型在游戏评估上通常比非推理语言模型更接近人类。然而，我们观察到非单调的关系：随着模型接近博弈最优，其与人类数据的匹配度会减弱。我们还发现，在评估趣味性时，模型之间存在更多的波动性，这与量化该查询的难度更大有关。在各种查询和游戏中，推理模型在评估查询时表现出高度变化和不可预测的资源使用，这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2510.08141 2026-05-19 cs.LG

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

SCOPE-RL: 稳定和定量控制强化学习后训练中的策略熵

Chen Wang, Zhaochun Li, Jionghao Bai, Hexuan Deng, Ge Lan, Yue Wang

发表机构 * College of Software, Nankai University（南开大学软件学院）； Zhongguancun Academy（中关村学院）； Beijing Institute of Technology（北京理工大学）； Zhejiang University（浙江大学）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结本文提出SCOPE-RL框架，通过温度自适应的正样本构造正则化项，稳定并定量控制强化学习后训练中的策略熵，实验表明其在Pass@1和Pass@$k$任务上优于现有基线方法。

详情

AI中文摘要

强化学习（RL）是训练大型语言模型（LLMs）的关键范式，但广泛使用的分组相对策略优化（GRPO）常面临熵崩溃问题：探索迅速消失，策略提前收敛，样本多样性下降，最终损害训练效果。现有解决方案，包括熵奖励和裁剪方法，很少能保持熵在稳定的探索范围内，且常引入振荡的熵或奖励退化。在本文中，我们识别出熵动态中被忽视的不对称性：在高温度采样下，正样本和负样本对策略熵有相反影响。具体而言，高温度正样本促进熵增长，而负样本抑制它。我们为此现象提供了理论解释：当策略更新过程中熵下降时，其对温度的导数在正样本更新下严格为正，表明高温度正样本可以抵消熵衰减，从而减缓熵崩溃并可能逆转它。受此启发，我们提出了SCOPE-RL，通过构造来自温度自适应正样本的正则化项，实现稳定且定量的熵控制。广泛实验表明，SCOPE-RL在Pass@1和Pass@$k$任务上均优于现有强RL基线方法。我们的结果提供了证据，证明摆脱熵崩溃可以提高推理性能，同时显示收益是非单调的，RL后训练在推理LLMs中存在最优的探索水平。

英文摘要

Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.

URL PDF HTML ☆

赞 0 踩 0

2510.06809 2026-05-19 cs.CV

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

VA-Adapter：将超声基础模型适应于超声心动图探头引导

Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang

发表机构 * Department of Automation, BNRist, Tsinghua University, Beijing, China（自动化系、BNRist、清华大学、北京、中国）； School of Computer Science and Technology, Xidian University（计算机科学与技术学院、西安电子科技大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Chinese PLA General Hospital（中国人民解放军总医院）

AI总结本文提出VA-Adapter，通过将超声基础模型与理解个体三维结构的能力相结合，提高超声心动图探头引导的精度和效率，实验表明其在参数量较少的情况下表现优于现有模型。

Comments MICCAI2026 Early Accept Paper

详情

AI中文摘要

超声心动图是检测心脏疾病的关键工具，但其操作难度高导致专业人员短缺。探头引导系统通过辅助获取高质量图像，提供了降低操作门槛的有前景的解决方案。然而，由于显著的个体差异，稳健的探头引导仍具挑战性。这种差异表现为二维图像中低级特征的差异，这使得图像特征理解复杂化，以及个体三维结构的差异，这给精确导航带来挑战。为了解决这些挑战，我们首先提出利用超声基础模型从大量数据集中学习的稳健图像表示。然而，将这些模型应用于探头导航是困难的，因为它们缺乏对个体三维结构的理解。为此，我们精心设计了视觉-动作适配器（VA-Adapter）以在线注入理解个体三维结构的能力。具体来说，通过将VA-Adapter嵌入基础模型的图像编码器中，模型可以从历史视觉-动作序列中推断心脏解剖结构，模拟超声技师的认知过程。在包含超过131万样本的数据集上进行的广泛实验表明，VA-Adapter在参数量少约33倍的情况下优于现有探头引导模型。代码可在https://github.com/LeapLabTHU/VA-Adapter上获得。

英文摘要

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

URL PDF HTML ☆

赞 0 踩 0

2510.04930 2026-05-19 cs.LG

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

平等梯度下降：一种加速 Grokking 的简单方法

Ali Saheb Pasand, Elvis Dohmatob

发表机构 * McGill University（麦吉尔大学）； Mila Institute（Mila研究院）； Concordia University（康科迪亚大学）

AI总结本文提出平等梯度下降（EGD）方法，通过规范化梯度使所有主方向的动态以相同速度演化，从而加速模型的 Grokking 过程，消除测试性能的停滞现象。

详情

AI中文摘要

Grokking 是一种现象，其中不同于训练性能在早期达到峰值，模型的测试/泛化性能在任意多个周期内停滞，然后突然跃升至接近完美的水平。在实践中，减少此类停滞的长度是有利的，即使学习过程'更快地 Grok'。在本工作中，我们提供了对 Grokking 的新见解。首先，我们通过实证和理论证明，不对称的（随机）梯度下降速度可以在不同主方向（即奇异方向）上诱导 Grokking。然后，我们提出了一种简单的修改，规范化梯度，使得所有主方向的动力学以相同的速度演化。接着，我们证明这种修改方法，称为平等梯度下降（EGD），可以被视为一种精心修改的自然梯度下降方法，能够更快地 Grok。事实上，在某些情况下，停滞完全被消除。最后，我们实证地展示了在经典算术问题如模加法和稀疏奇偶问题上，这种停滞现象被我们的方法消除。

英文摘要

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

URL PDF HTML ☆

赞 0 踩 0

2510.02590 2026-05-19 cs.LG

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

在可以的时候使用在线网络：迈向快速且稳定的强化学习

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo

发表机构 * Technical University of Darmstadt（德累斯顿技术大学）； German Research Center for AI (DFKI)（德国人工智能研究中心（DFKI））； Robotics Institute Germany (RIG)（德国机器人研究所（RIG））； University of Würzburg（弗赖堡大学）

AI总结本文提出了一种新的更新规则，通过在目标网络和在线网络之间取最小估计来改进价值函数学习，从而实现更快且更稳定的强化学习。

Comments Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

AI中文摘要

在深度强化学习（RL）中，使用目标网络来估计价值函数是一种流行的方法。虽然有效，但目标网络仍是一种折中方案，它在保持稳定性的同时牺牲了缓慢移动的目标，从而延迟了学习。相反，使用在线网络作为强化目标在直觉上很有吸引力，但众所周知会导致不稳定的学。在本文中，我们旨在结合两者的优势，通过引入一种新的更新规则，该规则通过目标网络和在线网络之间的最小估计来计算目标，从而得到我们的方法MINTO。通过这种简单而有效的修改，我们证明MINTO能够通过缓解使用在线网络进行强化时的潜在过估计偏差，从而实现更快且更稳定的价值函数学习。值得注意的是，MINTO可以无缝集成到广泛的价值基础和演员-评论家算法中，成本极低。我们对MINTO在多种基准上的进行了广泛评估，涵盖了在线和离线RL以及离散和连续动作空间。在所有基准上，MINTO都一致地提高了性能，展示了其广泛的应用性和有效性。

英文摘要

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2509.23183 2026-05-19 cs.LG cs.NI

ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse

ZeroSiam: 一种高效的非对称方法用于测试时熵优化而不发生崩溃

Guohao Chen, Shuaicheng Niu, Deyu Chen, Jiahao Yang, Zitian Zhang, Mingkui Tan, Pengcheng Wu, Zhiqi Shen

发表机构 * Nanyang Technological University（南洋理工大学）； Joint WeBank-NTU Research Institute on Fintech（金融科技联合研究机构）； South China University of Technology（华南理工大学）

AI总结本文提出ZeroSiam，一种针对测试时熵最小化的高效非对称架构，通过非对称发散对齐防止崩溃，并通过可学习预测器和stop-gradient操作符有效实现，实验和理论证明其能防止崩溃并正则化偏见学习信号，提升性能，尤其在易崩溃的小模型上表现稳定。

详情

AI中文摘要

测试时熵最小化有助于适应新环境并激励模型的推理能力，在推理过程中允许模型通过自身预测实时进化和改进，从而实现有竞争力的性能。然而，纯粹的熵最小化可能会偏好不可推广的捷径，如放大logit范数并驱动所有预测到主导类别以减少熵，从而导致崩溃解（例如，恒定的一热输出），这些解仅通过简单的方式最小化目标函数而没有有意义的学习。在本文中，我们揭示了非对称性作为防止崩溃的关键机制，并引入了ZeroSiam——一种专门针对测试时熵最小化的高效非对称孪生架构。ZeroSiam通过非对称发散对齐来防止崩溃，这一过程通过在分类器之前使用可学习预测器和stop-gradient操作符高效实现。我们提供了实证和理论证据表明，ZeroSiam不仅能够防止崩溃，还能正则化偏见学习信号，即使在没有崩溃的情况下也能提升性能。尽管其简单性，广泛的结果显示，ZeroSiam在使用可忽略开销的情况下，比先前的方法更稳定，展示了其在视觉适应和大语言模型推理任务中的有效性，包括在具有挑战性的测试场景和多样化的模型中，特别是易崩溃的微型模型上。

英文摘要

Test-time entropy minimization helps adapt a model to novel environments and incentivize its reasoning capability, unleashing the model's potential during inference by allowing it to evolve and improve in real-time using its own predictions, achieving promising performance. However, pure entropy minimization can favor non-generalizable shortcuts, such as inflating the logit norm and driving all predictions to a dominant class to reduce entropy, risking collapsed solutions (e.g., constant one-hot outputs) that trivially minimize the objective without meaningful learning. In this paper, we reveal asymmetry as a key mechanism for collapse prevention and introduce ZeroSiam--an efficient asymmetric Siamese architecture tailored for test-time entropy minimization. ZeroSiam prevents collapse through asymmetric divergence alignment, efficiently achieved by a learnable predictor and a stop-gradient operator before the classifier. We provide empirical and theoretical evidence that ZeroSiam not only prevents collapse, but also regularizes biased learning signals, enhancing performance even when no collapse occurs. Despite its simplicity, extensive results show that ZeroSiam performs more stably over prior methods using negligible overhead, demonstrating efficacy on both vision adaptation and large language model reasoning tasks across challenging test scenarios and diverse models, including particularly collapse-prone tiny models.

URL PDF HTML ☆

赞 0 踩 0

2509.22244 2026-05-19 cs.CV

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

FlashEdit: 解耦速度、结构和语义以实现精确图像编辑

Junyi Wu, Zhiteng Li, Haotong Qin, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结本文提出FlashEdit，一种高效的局部图像编辑框架，通过解耦速度、结构和语义来实现精确编辑，实验表明其在保真度和效率之间取得了良好的平衡。

Comments Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit

详情

AI中文摘要

基于文本的图像编辑使用扩散模型已取得了显著的高质量成果，但往往面临可接受的延迟问题。我们介绍了FlashEdit，一种针对标准反向编辑设置的实时局部图像编辑框架。其效率和精度源于三个关键创新：（1）一个循环一致的一步反向（COSI）管道，通过循环一致性鼓励流形对齐的一步反向；（2）一种背景屏蔽（BG-Shield）技术，通过结构自注意干预提高非编辑区域的保真度；（3）一种稀疏的空间交叉注意（SSCA）机制，通过抑制语义泄漏促进精确编辑。在PIE-Bench上的实验表明，FlashEdit在保真度和效率之间取得了良好的权衡，编辑可在0.2秒内完成，比基于DDIM的多步编辑快超过150倍。我们的代码将在https://github.com/JunyiWuCode/FlashEdit上公开发布。

英文摘要

Text-guided image editing with diffusion models has achieved remarkable quality but often suffers from prohibitive latency. We introduce \textbf{FlashEdit}, a real-time localized image editing framework for the standard inversion-based editing setting. Its efficiency and precision stem from three key innovations: (1) a \textbf{Cycle-Consistent One-Step Inversion (COSI)} pipeline that encourages manifold-aligned one-step inversion through cycle consistency; (2) a \textbf{Background Shield (BG-Shield)} technique that improves preservation of non-edited regions via structural self-attention intervention; and (3) a \textbf{Sparsified Spatial Cross-Attention (SSCA)} mechanism that promotes precise edits by suppressing semantic leakage. Experiments on PIE-Bench demonstrate a strong preservation-efficiency trade-off, with edits completed in under 0.2 seconds and an over 150$\times$ speedup over DDIM-based multi-step editing. Our code will be made publicly available at \url{https://github.com/JunyiWuCode/FlashEdit}.

URL PDF HTML ☆

赞 0 踩 0

2509.17680 2026-05-19 cs.CL

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

当表格问答遇见噪声：为复杂问题和大规模表格设计的双去噪框架

Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang

发表机构 * University of Science and Technology of China（中国科学技术大学）； The University of Melbourne（墨尔本大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）

AI总结本文提出EnoTab双去噪框架，通过改进相关性过滤和表格修剪能力，解决复杂问题和大规模表格中的噪声问题，提升表格问答性能。

Comments 24 pages, 24 figures, accepted to ACL 2026 Main

详情

AI中文摘要

表格问答（TableQA）是自然语言处理（NLP）中的基本任务。大语言模型（LLMs）强大的推理能力在这一领域带来了显著进展。然而，随着实际应用中问题日益复杂且表格规模增大，大量噪声数据被引入，严重降低了推理性能。为了解决这一挑战，我们专注于提升两个核心能力：相关性过滤，即识别并保留与推理真正相关的信息，以及表格修剪，即在保留必要内容的同时减少表格规模。基于这些原则，我们提出了EnoTab，一种为复杂问题和大规模表格设计的双去噪框架。具体来说，我们首先通过证据-based问题去噪，将问题分解为最小的语义单元，并根据一致性和实用性标准过滤掉与答案推理无关的部分。然后，我们提出证据树引导的表格去噪，构建一个明确且透明的表格修剪路径，逐步移除无关数据。在每一步修剪过程中，我们观察表格的中间状态，并应用后序节点回滚机制来处理异常表格状态，最终产生一个高度可靠的子表格用于最终答案推理。最后，广泛的实验表明，EnoTab在复杂问题和大规模表格的TableQA任务中实现了卓越的性能，证实了其有效性。

英文摘要

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2509.14004 2026-05-19 cs.CL

Early Stopping Chain-of-thoughts in Large Language Models

大语言模型中的早期停止思维链

Minjia Mao, Bowen Yin, Yu Zhu, Xiao Fang

发表机构 * University of Delaware（德克萨斯大学）； Peking University（北京大学）

AI总结本文提出了一种在推理阶段减少思维链生成长度的方法ES-CoT，通过检测答案收敛并提前停止来降低推理成本，同时保持与标准思维链相当的准确性。

详情

AI中文摘要

大语言模型（LLMs）在通过生成长链式思维（CoT）解决复杂问题时表现出卓越的能力，但这种长CoT会带来较高的推理成本。先前的推理阶段高效推理方法要么需要白盒模型监控推理过程，要么通过直接提示不可靠。为此，我们引入了ES-CoT，一种在推理时缩短CoT生成的方法，通过检测答案收敛并提前停止几乎不损失性能。当观察到推理过程中的语言标记（如“wait”）时，我们提示LLM输出其当前最终答案，称为步骤答案。我们跟踪连续相同步骤答案的运行长度作为答案收敛的度量。我们通过实证和理论证明，步骤答案稳定地收敛到最终答案，且大运行长度跳跃可靠地标记这种收敛。在六个跨三个LLM的推理数据集上的实验表明，ES-CoT在平均上减少了16.08%的推理令牌数量，同时保持与标准CoT相当的准确性。

英文摘要

Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. Previous methods on inference-stage efficient reasoning either require white-box models to monitor the reasoning process or are not reliable through direct prompting. In response, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with almost no performance loss. When observing a linguistic marker (such as "wait") in the reasoning process, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. We show both empirically and theoretically that step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on six reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by 16.08% on average while maintaining accuracy comparable to standard CoT.

URL PDF HTML ☆

赞 0 踩 0

2509.06984 2026-05-19 cs.LG cs.AI

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA: 在缺失模态约束下联邦微调基础模型的实用方法

Lishan Yang, Wei Emma Zhang, Nam Kha Nguygen, Po Hu, Yanjun Shu, Weitong Chen, Mong Yuan Sim

发表机构 * Adelaide University（阿德莱德大学）； Central China Normal University（中央中国师范大学）； Harbin Institute of Technology（哈尔滨工程大学）

AI总结本文提出FediLoRA，一种轻量级的联邦LoRA聚合框架，旨在解决联邦学习中异构环境下的缺失模态问题，通过联合简单平均和结构化编辑提升全局和个性化模型性能，实现在多个通用领域和医疗领域基准数据集上的强大表现。

Comments 8 pages, 7 figures

详情

AI中文摘要

联邦学习与LoRA微调提供了一种高效且隐私友好的解决方案，使机构能够协作利用其大规模数据集来训练VLLMs。然而，参与机构通常拥有异质计算资源，导致LoRA秩不平衡，这对有效协作构成重大挑战。此外，医疗和交通等现实应用领域常因用户错误或设备故障导致缺失模态，这显著降低了联邦设置中的全局模型性能。到目前为止，没有先前工作同时解决了联邦VLLMs中的这两个挑战。为了解决这些问题，我们提出FediLoRA，一种轻量级的联邦LoRA聚合框架，有效减轻了异构环境中的缺失模态影响。FediLoRA受到观察的启发，即简单平均和结构化编辑可以同时受益于全局和个性化模型。我们的方法在多个通用领域和医疗领域基准数据集上实现了强大性能。此外，在医疗数据上的额外实验进一步证明，FediLoRA适合实际应用部署场景。我们的代码已发布在https://github.com/gotobcn8/FediLoRA。

英文摘要

Federated Learning with LoRA fine-tuning offers an efficient and privacy-aware solution for institutions to collaboratively leverage their large datasets to train VLLMs. However, participating institutions often possess heterogeneous computational resources, resulting in imbalanced LoRA ranks, which pose a major challenge for effective collaboration. In addition, real-world applications in domains such as healthcare and transportation frequently suffer from missing modalities due to user mistakes or device failures, which significantly degrade global model performance in federated settings. To the best of our knowledge, no prior work has addressed these two challenges simultaneously in federated VLLMs. To tackle these issues, we propose FediLoRA, a lightweight federated LoRA aggregation framework that effectively mitigates the impact of missing modalities in heterogeneous environment. FediLoRA is explicitly motivated by the observation that simple averaging and structured editing can jointly benefit both global and personalized models. Our approach achieves strong performance across multiple general-domain and medical-domain benchmark datasets. Additional experiments on healthcare data further demonstrate that FediLoRA is well-suited for practical, real-world deployment scenarios. Our code is released at https://github.com/gotobcn8/FediLoRA.

URL PDF HTML ☆

赞 0 踩 0

2508.17431 2026-05-19 cs.CV cs.AI cs.LG

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

发表机构 * Media IC and System Lab, the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University（媒体IC与系统实验室，电子工程研究所及电气工程系，国立台湾大学）

AI总结本文提出FedKLPR框架，通过KL散度引导训练、无结构剪枝和跨轮次恢复技术，解决联邦学习在人重识别中的统计异质性和通信开销问题，实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情

AI中文摘要

人重识别（re-ID）是智能监控和公共安全中的基本任务。联邦学习（FL）提供了一种隐私保护的协同模型训练范式，无需集中数据收集。然而，由于非独立同分布（non-IID）客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销，将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战，我们提出了FedKLPR，一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先，KL散度引导训练，包括KL散度正则化损失（KLL）和KL散度聚合权重（KLAW），用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次，引入无结构剪枝以减少通信开销，并提出剪枝率聚合权重（PRAW）以衡量剪枝后客户端参数的相对重要性。与KLAW结合，PRAW形成KL散度-剪枝权重聚合（KLPWA），使在异构数据分布下能够有效聚合剪枝后的本地模型。第三，跨轮次恢复（CRR）适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明，FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比，FedKLPR在ResNet-50上将通信成本减少了40%--42%，并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

URL PDF HTML ☆

赞 0 踩 0

2508.16663 2026-05-19 cs.CV cs.AI cs.LG

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

The Loupe: 一种用于增强视觉变换器中判别特征的插件式注意力模块

Naren Sengodan

发表机构 * Jain University（贾因大学）

AI总结本文提出The Loupe模块，通过在视觉变换器的中间特征阶段插入轻量级插件式空间门控模块，利用小CNN预测单通道空间掩码，并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权，从而提升细粒度视觉分类性能。

详情

AI中文摘要

细粒度视觉分类（FGVC）要求模型关注于细微的、与任务相关的区域，而非广泛的物体上下文。我们提出了The Loupe，一种轻量级的插件式空间门控模块，用于层次化的视觉变换器。该模块在中间特征阶段插入，使用小CNN预测单通道空间掩码，并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权。在CUB-200-2011数据集上，The Loupe将Swin-Base的准确率从88.36%提升至91.72%，将Swin-Tiny的准确率从85.14%提升至88.61%，且仅增加0.1%的参数。消融实验表明，改进依赖于插入点和稀疏正则化器，表明受控的空间门控比朴素的多尺度遮蔽在此设置下更有效。定性结果表明，学习到的掩码通常与判别鸟类部分对齐，尽管该模块不是部分级监督的替代品，在遮挡或细粒度内部分差异时可能会失效。

英文摘要

Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often align with discriminative bird parts, although the module is not a substitute for part-level supervision and can fail under occlusion or fine-grained intra-part differences.

URL PDF HTML ☆

赞 0 踩 0

2508.14769 2026-05-19 cs.LG cs.DC

Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

边缘设备上的联邦蒸馏：非iid数据的高效客户端过滤

Ahmed Mujtaba, Gleb Radchenko, Radu Prodan, Marc Masana

发表机构 * 1 Embedded Systems Division, Silicon Austria Labs, Graz, Austria ； 2 Department of Computer Science, University of Innsbruck, Austria ； 5 Institute of Information Technology, University of Klagenfurt, Austria ； 3 TU-Graz SAL DES Lab, Silicon Austria Labs, Graz, Austria ； 4 Institute of Visual Computing, Graz University of Technology, Austria

AI总结本文提出了一种高效的边缘联邦蒸馏方法EdgeFD，通过在客户端使用KMeans基于的密度比估计器来过滤分布内外的代理数据，从而减少计算复杂度并提高知识共享质量，适用于非iid数据分布。

Comments This paper was accepted at the International Conference on Federated Learning Technologies and Applications, 2025. The final version is available at IEEE Xplore

详情

DOI: 10.1109/FLTA67013.2025.11336390

AI中文摘要

联邦蒸馏作为一种有前途的协同机器学习方法，通过交换模型输出（软日志）而不是完整模型参数，相较于传统联邦学习提供了增强的隐私保护和减少的通信开销。然而，现有方法采用复杂的选择性知识共享策略，要求客户端通过计算昂贵的统计密度比估计器来识别分布内代理数据。此外，服务器端对模糊知识的过滤引入了延迟。为了解决这些挑战，我们提出了一个鲁棒且资源高效的EdgeFD方法，该方法减少了客户端侧密度比估计的复杂性并消除了服务器端过滤的需要。EdgeFD引入了一个高效的KMeans基于的密度比估计器，用于在客户端上有效过滤分布内和分布外的代理数据，显著提高了知识共享的质量。我们评估了EdgeFD在多样化的实际场景中的表现，包括强非iid、弱非iid和iid数据分布，无需在服务器上预训练教师模型进行知识蒸馏。实验结果表明，EdgeFD优于最先进的方法，在异构和挑战性条件下仍能持续达到接近iid场景的准确率。KMeans基于的估计器显著减少的计算开销适用于在资源受限的边缘设备上部署，从而增强了联邦蒸馏的可扩展性和实际应用性。代码已在线提供以供复现。

英文摘要

Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2508.10678 2026-05-19 cs.CV

HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

HyperTea: 一种基于超图的时序增强与对齐网络用于移动红外小目标检测

Zhaoyuan Qi, Weihua Gao, Wenlong Niu, Jie Tang, Yun Li, Xiaodong Peng

发表机构 * Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences（空间系统电子信息技术重点实验室，国家空间科学中心，中国科学院）； School of Advanced Interdisciplinary Studies, University of Chinese Academy of Sciences（中国科学院大学交叉学科学院）； School of Computer Science and Technology, University of Chinese Academy of Sciences（中国科学院大学计算机科学与技术学院）

AI总结本文提出HyperTea网络，通过整合全局和局部时序视角，有效建模特征的高阶时空相关性，首次将CNN、RNN和HGNN结合用于MIRSTD，显著提升检测性能。

Comments Accepted by Knowledge-Based Systems

详情

AI中文摘要

在实际应用场景中，由于目标的大小小、强度弱和复杂的运动模式，移动红外小目标检测（MIRSTD）仍然极具挑战性。现有方法通常仅建模特征节点之间的低阶相关性，并在单一时间尺度上进行特征提取和增强。尽管超图已被广泛用于高阶相关性学习，但其在MIRSTD中却受到有限关注。为了探索超图的潜力并增强多时间尺度特征表示，我们提出HyperTea，它整合了全局和局部时序视角，有效建模特征的高阶时空相关性。HyperTea由三个模块组成：全局时序增强模块（GTEM）通过语义聚合和传播实现全局时序上下文增强；局部时序增强模块（LTEM）设计用于捕捉相邻帧之间的局部运动模式，然后增强局部时序上下文；此外，我们进一步开发了一个时序对齐模块（TAM）以解决潜在的跨尺度特征错位问题。据我们所知，HyperTea是首次将卷积神经网络（CNNs）、循环神经网络（RNNs）和超图神经网络（HGNNs）结合用于MIRSTD的工作，显著提升了检测性能。在DAUB和IRDST上的实验表明其处于最先进的水平（SOTA）。我们的源代码可在https://github.com/Lurenjia-LRJ/HyperTea上获得。

英文摘要

In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.

URL PDF HTML ☆

赞 0 踩 0

2508.08080 2026-05-19 cs.LG cs.NE stat.AP

Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles

符号量化回归用于条件量化可解释性预测

Cas Oude Hoekstra, Floris den Hengst

发表机构 * Independent researcher（独立研究者）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）

AI总结本文提出了一种符号量化回归方法，用于预测条件量化并解释预测变量对结果的影响，通过在航空燃料使用案例中比较预测极值和中央结果的模型，展示了SQR在高风险应用中的有效性。

Journal ref Transactions on Machine Learning Research, May 2026, https://openreview.net/pdf?id=x9OYbyPJOG

详情

AI中文摘要

符号回归（SR）是一种生成可解释或白盒预测模型的已知框架。尽管SR已被成功应用于创建结果平均值的可解释估计，但目前尚不清楚如何利用SR来估计目标变量分布其他点处变量之间的关系。例如，中位数或极值的估计提供了预测变量如何影响结果的更全面图景，并在高风险、安全关键应用领域是必要的。本文介绍了符号量化回归（SQR），一种利用SR预测条件量化的做法。在广泛的评估中，我们发现SQR在透明模型上表现优于，并且在不牺牲透明性的情况下与强大的黑盒基线模型表现相当。我们还展示了如何利用SQR通过比较预测极值和中央结果的模型来解释目标分布的差异。我们得出结论，SQR适用于预测条件量化并理解不同分位数下的有趣特征影响。

英文摘要

Symbolic Regression (SR) is a well-established framework for generating interpretable or white-box predictive models. Although SR has been successfully applied to create interpretable estimates of the average of the outcome, it is currently not well understood how it can be used to estimate the relationship between variables at other points in the distribution of the target variable. Such estimates of e.g. the median or an extreme value provide a fuller picture of how predictive variables affect the outcome and are necessary in high-stakes, safety-critical application domains. This study introduces Symbolic Quantile Regression (SQR), an approach to predict conditional quantiles with SR. In an extensive evaluation, we find that SQR outperforms transparent models and performs comparably to a strong black-box baseline without compromising transparency. We also show how SQR can be used to explain differences in the target distribution by comparing models that predict extreme and central outcomes in an airline fuel usage case study. We conclude that SQR is suitable for predicting conditional quantiles and understanding interesting feature influences at varying quantiles.

URL PDF HTML ☆

赞 0 踩 0

2508.07292 2026-05-19 cs.AI cs.CL cs.CV

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent: 闭环代理推理与自我一致性验证用于内窥镜诊断

Yi Tang, Kai-Ni Wang, Yang Chen, Xiaopu He, Guangquan Zhou

发表机构 * School of Biological Science and Medical Engineering, Southeast University（东南大学生物科学与医学工程学院）； Jiangsu Key Laboratory of Biomaterials and Devices, Southeast University（江苏省生物材料与器件重点实验室）； State Key Laboratory of Digital Medical Engineering, Southeast University（国家数字医学工程重点实验室）； The First Affiliated Hospital of Nanjing Medical University（南京医科大学第一附属医院）； Department of Computer Science and Engineering, The Chinese University of Hong Kong（香港中文大学计算机科学与工程系）； Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing（江苏省联合国际医学信息处理联合实验室）； Laboratory of Image Science and Technology, the School of Computer Science and Engineering, Southeast University（图像科学与技术实验室，东南大学计算机科学与工程学院）

AI总结该研究提出EndoCogniAgent框架，通过闭环代理推理和自我一致性验证提升内窥镜诊断的准确性与可靠性，其核心方法是将诊断过程建模为受控状态更新过程，并引入EndoAgentBench基准进行评估。

Comments 10 pages, 8 figures, 2 tables. Revised version with major updates on methodology and extended evaluation on EndoAgentBench. Code and data are available at https://github.com/Tyyds-ai/EndoCogniAgent

详情

AI中文摘要

内窥镜诊断是一个迭代过程，临床医生逐步获取、比较和验证局部视觉证据以得出结论。当前AI系统未能充分支持此过程，因为细粒度证据获取和多步推理仍弱相关，导致两种失败模式：幻觉证据和未纠正的误差累积，影响诊断可靠性。我们提出EndoCogniAgent，一种闭环代理框架，将内窥镜诊断建模为受控状态更新过程。在每次推理轮次中，中央计划器选择下一步证据获取动作，专用专家工具提取相应观察，自我一致性验证机制沿两个维度检查观察：知识一致性与输入图像以及时间一致性与先前验证的发现，然后更新诊断状态。验证的观察被纳入演进状态以指导后续计划，而缺乏充分支持的发现则保留并带有纠正反馈，引导计划器进行进一步验证。我们进一步引入EndoAgentBench，一个以工作流程为导向的基准，包含来自11个内窥镜数据集的6132个问题-答案对，旨在评估诊断代理在全面诊断链中的表现，从细粒度视觉感知到高水平诊断推理。实验显示，EndoCogniAgent在感知任务上达到85.23%的平均准确率，在推理任务上达到71.13%的临床接受率，消融分析确认自我一致性验证和事件状态维护对这些提升至关重要。

英文摘要

Endoscopic diagnosis is an iterative process in which clinicians progressively acquire, compare, and verify local visual evidence before reaching a conclusion. Current AI systems do not adequately support this process because fine-grained evidence acquisition and multi-step reasoning remain weakly coupled. This gives rise to two failure modes, hallucinated evidence and uncorrected error accumulation, that undermine diagnostic reliability. We propose EndoCogniAgent, a closed-loop agentic framework that formulates endoscopic diagnosis as a controlled state update process. At each reasoning round, a central planner selects the next evidence acquisition action, specialized expert tools extract the corresponding observation, and a self-consistency validation mechanism examines the observation along two dimensions, knowledge consistency against the input image and temporal consistency with prior validated findings, before updating the diagnostic state. Validated observations are admitted into the evolving state to condition subsequent planning, while insufficiently supported findings are retained with corrective feedback that redirects the planner toward additional verification. We further introduce EndoAgentBench, a workflow-oriented benchmark comprising 6,132 question-answer pairs from 11 endoscopic datasets, designed to evaluate diagnostic agents across a comprehensive diagnostic chain, from fine-grained visual perception to high-level diagnostic reasoning. Experiments show that EndoCogniAgent achieves 85.23\% average accuracy on perception tasks and 71.13\% clinical acceptance rate on reasoning tasks, with ablation analysis confirming that self-consistency validation and episodic state maintenance are individually critical to these gains.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Toward Seamless Physical Human-Humanoid Interaction: Insights from Control, Intent, and Modeling with a Vision for What Comes Next

Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Two-Dimensional Quantization for Geometry-Aware Audio Coding

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Convergence of Multiagent Learning Systems for Traffic control

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

Deep sequence models tend to memorize geometrically; it is unclear why

InFeR: Informed Failure Resilience in Learned Visual Navigation Control

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

DocReward: A Document Reward Model for Structuring and Stylizing

Evaluating Language Models' Evaluations of Games

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

Early Stopping Chain-of-thoughts in Large Language Models

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis