赞 0 踩 0

2511.14427 2026-06-11 cs.RO cs.LG 版本更新

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

面向接触丰富机器人强化学习的自监督多感官预训练

Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning (PEARL) Lab, TU Darmstadt, Germany（图腾机器人感知与学习实验室，图腾施塔德大学，德国）； Hessian.AI（海斯堡人工智能）； Robotics Institute Germany (RIG)（德国机器人研究所（RIG））

AI总结提出MSDP框架，通过掩码自编码和跨模态预测学习多感官表示，并采用非对称架构（评论家使用交叉注意力提取动态特征，演员使用稳定池化表示）加速策略学习，在模拟和真实机器人任务中展现出鲁棒性和高效性。

Comments 8 pages, 11 figures

详情

DOI: 10.1109/LRA.2026.3681156
Journal ref: IEEE Robotics and Automation Letters, 2026, Vol. 11, No. 6, pp. 6799-6806

AI中文摘要

有效的接触丰富操作需要机器人协同利用视觉、力和本体感觉。然而，强化学习智能体在这种多感官环境中难以学习，尤其是在感官噪声和动态变化的情况下。我们提出了多感官动态预训练（MSDP），一种新颖的框架，用于学习面向任务策略学习的表达性多感官表示。MSDP基于掩码自编码，通过仅从传感器嵌入的子集重建多感官观测来训练基于Transformer的编码器，从而实现跨模态预测和传感器融合。对于下游策略学习，我们引入了一种新颖的非对称架构，其中交叉注意力机制允许评论家从冻结的嵌入中提取动态的、任务特定的特征，而演员则接收稳定的池化表示来指导其动作。我们的方法在多种扰动（包括传感器噪声和物体动力学变化）下表现出加速学习和鲁棒性能。在模拟和真实世界中多个具有挑战性的、接触丰富的机器人操作任务上的评估展示了MSDP的有效性。我们的方法对扰动表现出强鲁棒性，并在仅6000次在线交互的真实机器人上实现了高成功率，为复杂的多感官机器人控制提供了一种简单而强大的解决方案。网站：this https URL

英文摘要

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: https://msdp-pearl.github.io/

URL PDF HTML ☆

赞 0 踩 0

2604.20348 2026-06-11 cs.RO cs.AI cs.MA 版本更新

审计人脸关键点检测中的群体偏见以实现公平的人机交互

Pablo Parte, Roberto Valle, José M. Buenaposada, Luis Baumela

发表机构 * Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid（智能人工智能部门，马德里理工大学）； Departamento de Informática y Estadística, Universidad Rey Juan Carlos（信息与统计学部门，皇家胡安·卡洛斯大学）； ELLIS Unit Madrid（马德里ELLIS单位）

AI总结本研究系统审计了人脸关键点检测中的年龄、性别和种族偏见，通过控制统计方法分离混杂视觉因素，发现头部姿态和分辨率等混杂因素影响更大，但年龄偏见显著存在。

详情

Journal ref: 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

AI中文摘要

人机交互中的公平性关键取决于使机器人能够解释人类行为的感知模型的可靠性。虽然群体偏见已在高级人脸分析任务中得到广泛研究，但其在人脸关键点检测中的存在尚未被探索。在本文中，我们对该任务中的群体偏见进行了系统审计，分析了年龄、性别和种族偏见。为此，我们引入了一种受控统计方法，以从混杂视觉因素中分离出群体效应。我们的分析表明，视觉混杂因素，特别是头部姿态和人脸分辨率，大大超过了群体属性的影响。值得注意的是，在考虑这些混杂因素后，性别和种族之间的性能差异消失。然而，我们发现了统计上显著的年龄相关偏见，即老年人的定位误差更高。这表明公平性问题甚至可能出现在低级视觉组件中，并可能通过人机交互管道传播。我们认为，审计和纠正此类偏见是实现可信赖和公平的机器人感知系统的必要步骤。

英文摘要

Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender, and race biases. To this end, we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Our analysis demonstrates that visual confounders, particularly head pose and face resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, performance disparities across gender and race vanish. However, we identify a statistically significant age-related bias, with higher localization errors for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

URL PDF HTML ☆

赞 0 踩 0

2509.09794 2026-06-11 cs.AI cs.LG 版本更新

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

合成住宅：数据稀缺下用于住宅建筑数据生成的多模态生成式AI管道

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

发表机构 * Lafayette University（拉法叶大学）； Georgia State University（佐治亚州立大学）

AI总结提出一个多模态生成式AI框架，整合图像、表格和模拟组件，从公开记录和图像生成合成住宅建筑数据集，以解决建筑参数数据稀缺问题。

Comments 37 pages; 2 appendices; 6 figures; 2 tables. Code available at https://github.com/Lafayette-EshbaughSilveyra-Group/synthetic-homes

详情

AI中文摘要

计算模型已成为建筑和城市尺度多尺度能源建模研究的强大工具，支持建筑和城市能源系统的数据驱动分析。然而，这些模型需要大量的建筑参数数据，这些数据通常难以获取、收集成本高昂或受隐私限制。我们引入了一个模块化的多模态生成式人工智能（AI）框架，该框架整合了图像、表格和基于模拟的组件，并从公开的县记录和图像生成合成住宅建筑数据集，同时提出了一个实例化该框架的端到端管道。为了减少典型的大型语言模型（LLM）挑战，我们使用基于遮挡的视觉焦点分析来评估模型组件。我们的分析表明，我们选择的视觉语言模型在建筑图像处理方面比基于GPT的替代方案实现了更大的视觉焦点。我们还根据国家参考数据集评估了结果的真实性，发现我们的合成数据在四个选定变量中的三个重叠率超过95%。这项工作减少了对昂贵或受限数据源的依赖，降低了建筑尺度能源研究和机器学习（ML）驱动的城市能源建模的障碍，从而在数据稀缺的情况下实现了可扩展的下游任务，如能源建模、改造分析和城市尺度模拟。

英文摘要

Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting data-driven analysis across building and urban energy systems. However, these models require large amounts of building parameter data that is often inaccessible, expensive to collect, or subject to privacy constraints. We introduce a modular, multimodal generative Artificial Intelligence (AI) framework that integrates image, tabular, and simulation-based components and produces synthetic residential building datasets from publicly available county records and images, and present an end-to-end pipeline instantiating this framework. To reduce typical Large Language Model (LLM) challenges, we evaluate our model's components using occlusion-based visual focus analysis. Our analysis demonstrates that our selected vision-language model achieves greater visual focus than a GPT-based alternative for building image processing. We also assess realism of our results against a national reference dataset, finding that our synthetic data overlaps more than 95% for three of the four selected variables. This work reduces dependence on costly or restricted data sources, lowering barriers to building-scale energy research and Machine Learning (ML)-driven urban energy modeling, and therefore enabling scalable downstream tasks such as energy modeling, retrofit analysis, and urban-scale simulation under data scarcity.

URL PDF HTML ☆

赞 0 踩 0

2510.01157 2026-06-11 cs.CL cs.CR cs.SD 版本更新

Where Do Backdoors Live? A Component-Level Analysis of Backdoor Propagation in Speech Language Models

后门藏身何处？语音语言模型中后门传播的组件级分析

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal, Peter West

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； École de technologie supérieure（高等技术学院）； Johns Hopkins University（约翰霍普金斯大学）

AI总结本文通过后门攻击视角，对语音语言模型进行组件级分析，揭示后门在不同组件中的传播机制，发现后门持久性高度依赖目标组件，且中毒样本与良性样本在共享嵌入中不可直接分离。

Comments Interspeech 2026 (long paper)

详情

AI中文摘要

语音语言模型（SLM）是系统的系统：独立组件联合起来实现共同目标。尽管其异构性，SLM 通常被端到端研究；信息如何流经管道仍然模糊。我们通过后门攻击的视角研究这一问题。我们首先确定后门可以通过 SLM 传播，使所有任务高度脆弱。由此，我们设计了一个组件分析来发现每个组件在后门学习中的作用。我们发现后门的持久性或擦除高度依赖于目标组件。除了传播，我们研究了后门如何在共享的多任务嵌入中被编码，表明中毒样本与良性样本不可直接分离，挑战了过滤防御中常用的可分离性假设。我们的发现强调需要将多模态管道视为具有独特脆弱性的复杂系统，而不仅仅是单模态系统的扩展。

英文摘要

Speech language models (SLMs) are systems of systems: independent components that unite to achieve a common goal. Despite their heterogeneous nature, SLMs are often studied end-to-end; how information flows through the pipeline remains obscure. We investigate this question through the lens of backdoor attacks. We first establish that backdoors can propagate through the SLM, leaving all tasks highly vulnerable. From this, we design a component analysis to discover the role each component takes in backdoor learning. We find that backdoor persistence or erasure is highly dependent on the targeted component. Beyond propagation, we examine how backdoors are encoded in shared multitask embeddings, showing that poisoned samples are not directly separable from benign ones, challenging a common separability assumption used in filtering defenses. Our findings emphasize the need to treat multimodal pipelines as intricate systems with unique vulnerabilities, not solely extensions of unimodal ones.

URL PDF HTML ☆

赞 0 踩 0

2603.14867 2026-06-11 cs.LG cs.AI cs.GT cs.MA 版本更新

Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

用于去中心化双层强化学习的样本高效超梯度估计

Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

发表机构 * University of Tokyo（东京大学）； National Institute of Information and Communications Technology（日本信息与通信技术研究所）

AI总结针对去中心化双层强化学习中领导者无法干预跟随者优化过程的问题，提出基于玻尔兹曼协方差技巧的超梯度估计方法，实现高维决策空间下的样本高效优化，并首次应用于双人马尔可夫博弈。

Comments 29 pages. Extended version of the paper accepted to ICAPS 2026

详情

AI中文摘要

许多战略决策问题，例如仓库机器人的环境设计，可以自然地表述为双层强化学习，其中领导者代理优化其目标，而跟随者解决一个以领导者决策为条件的马尔可夫决策过程。在许多情况下，当领导者无法干预跟随者的优化过程时，会出现一个基本挑战；它只能观察优化结果。我们通过推导领导者目标的超梯度（即考虑跟随者最优策略变化的领导者策略梯度）来解决这种去中心化设置。与先前基于超梯度的方法不同，这些方法需要大量数据来重复访问状态，或者依赖于梯度估计器，其复杂度可能随着领导者决策空间的高维性而显著增加，我们利用玻尔兹曼协方差技巧推导出一种替代的超梯度公式。这使得仅从交互样本中就能进行高效的超梯度估计，即使领导者的决策空间是高维的。此外，据我们所知，这是第一种能够在去中心化设置中实现基于超梯度的优化的双人马尔可夫博弈方法。实验突出了超梯度更新的影响，并展示了我们的方法在离散和连续状态任务中的有效性。

英文摘要

Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.22934 2026-06-11 cs.AI 版本更新

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

ProGRank: 探针梯度重排序以防御密集检索器RAG免受语料投毒攻击

Xiangyu Yin, Yi Qi, Chih-Hong Cheng

发表机构 * Chalmers University of Technology, Sweden（瑞典查尔姆斯理工大学）； University of Leeds, United Kingdom（英国利兹大学）； Carl von Ossietzky University of Oldenburg, Germany（德国奥尔登堡卡尔·冯·奥西特齐大学）

AI总结提出ProGRank，一种无需训练的后处理检索器端防御方法，通过随机扰动下探针梯度提取不稳定信号并重排序，有效防御密集检索器RAG的语料投毒攻击。

Comments accepted by ECML PKDD 2026

详情

AI中文摘要

检索增强生成（RAG）通过将生成基于检索到的证据来改进大语言模型应用，但也引入了语料投毒这一新的攻击面。在此场景中，攻击者注入或编辑段落，使其进入目标查询的Top-K结果并影响下游生成。现有防御通常依赖内容过滤、辅助模型或生成器端推理，这使部署复杂化。我们提出ProGRank，一种针对密集检索器RAG的事后、无需训练的检索器端防御。ProGRank在轻度随机扰动下对每个查询-段落对进行压力测试，从固定小参数子集中提取探针梯度，并推导出两个不稳定信号：表示一致性和分散风险。然后，它将这些信号与分数门控结合进行重排序。ProGRank保留原始段落内容，无需重新训练，并在部署的检索器不可用时支持基于代理的变体。跨数据集、检索器、攻击以及检索阶段和端到端设置的实验表明，ProGRank提高了鲁棒性，并保持了良好的鲁棒性-效用权衡，包括在自适应规避攻击下。

英文摘要

Retrieval-Augmented Generation (RAG) improves large language model applications by grounding generation in retrieved evidence, but also introduces corpus poisoning as a new attack surface. In this setting, an adversary injects or edits passages so that they enter the Top-$K$ results for target queries and influence downstream generation. Existing defences often rely on content filtering, auxiliary models, or generator-side reasoning, which complicates deployment. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query--passage pair under mild randomized perturbations, extracts probe gradients from a small fixed parameter subset, and derives two instability signals: representational consistency and dispersion risk. It then combines these signals with a score gate for reranking. ProGRank preserves the original passage content, requires no retraining, and supports a surrogate-based variant when the deployed retriever is unavailable. Experiments across datasets, retrievers, attacks, and retrieval-stage and end-to-end settings show that ProGRank improves robustness and maintains a favorable robustness--utility trade-off, including under adaptive evasive attacks.

URL PDF HTML ☆

赞 0 踩 0

2603.24080 2026-06-11 cs.CL cs.DB 版本更新

BioMamba: 领域自适应的生物医学语言模型

Ling Yue, Mingzhi Zhu, Sixue Xing, Yunning Cao, Yanbo Wang, Shimin Shan, Jinfei Liu, Vijil Chenthamarakshan, Shaowu Pan, Payel Das, Tianfan Fu

发表机构 * Rensselaer Polytechnic Institute（拉特格斯理工学院）； Jiaxing New Jies Thermal Power Co. Ltd.（嘉兴新捷热电有限公司）； North University of China（北方大学）； Zhejiang University（浙江大学）； IBM Research（IBM研究院）； Nanjing University（南京大学）

AI总结提出基于Mamba2的领域自适应预训练方法BioMamba，在PubMed、C4和Wikipedia混合数据上持续训练，显著降低生物医学困惑度并保持通用语言能力。

详情

DOI: 10.34133/hds.0454

AI中文摘要

背景。生物医学语言模型应在提升生物医学文本性能的同时保持通用语言模型的流畅性。对于基于Mamba的模型，这种权衡在生物医学文献和临床文本中尚未得到系统研究。方法。我们开发了BioMamba，一个包含五个规模的生物医学Mamba2模型家族，通过在PubMed摘要、Colossal Clean Crawled Corpus (C4)和Wikipedia的80%/10%/10%平衡混合数据上对已发布的公开Mamba2检查点进行持续预训练得到。贡献在于自适应配方和附带的开放权重检查点。结果。在五个规模上，BioMamba一致降低了PubMed困惑度，将Wikipedia风格的保留困惑度提高了1.46-4.72 PPL，而C4困惑度基本不变。在六个域外多项选择基准上，BioMamba保持在Mamba2的+/-3个百分点内，没有系统性退化。经过监督微调后，BioMamba+SFT在每个评估规模上匹配或超过Mamba2+SFT在MIMIC-IV笔记补全和出院总结生成上的表现，并在每个规模上改进了PubMedQA。最强模型(BioMamba-2.7B)在PubMed上达到5.28的困惑度，在BioASQ和PubMedQA上分别达到90.24%和73.00%的准确率。结论。平衡的领域自适应持续预训练配方增强了Mamba2语言模型在生物医学文献和临床文本上的性能，同时保持了通用语言建模的流畅性。

英文摘要

Background. Biomedical language models should improve performance on biomedical text while retaining general-language-modeling fluency. For Mamba-based models, this trade-off has not been systematically studied across biomedical literature and clinical text. Methods. We developed BioMamba, a family of biomedical Mamba2 models at five scales obtained by continued pretraining of released public Mamba2 checkpoints on a balanced 80%/10%/10% mixture of PubMed abstracts, the Colossal Clean Crawled Corpus (C4), and Wikipedia. The contribution is the adaptation recipe and the accompanying open-weight checkpoints. Results. Across five scales, BioMamba consistently lowered PubMed perplexity, improved Wikipedia-style held-out perplexity by 1.46-4.72 PPL, and left C4 perplexity essentially unchanged. On six out-of-domain multiple-choice benchmarks, BioMamba stayed within +/-3 percentage points of Mamba2 with no systematic regression. After supervised fine-tuning, BioMamba+SFT matched or exceeded Mamba2+SFT on MIMIC-IV note completion and discharge summary generation at every evaluated scale, and improved PubMedQA at every scale. The strongest model (BioMamba-2.7B) reached a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusions. A balanced domain-adaptive continued pretraining recipe strengthens Mamba2 language models on biomedical literature and clinical text while preserving general-language-modeling fluency.

URL PDF HTML ☆

赞 0 踩 0

2601.10724 2026-06-11 cs.RO 版本更新

Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty

具有状态依赖摩擦不确定性的车辆队列自适应滑模控制

Rishabh Dev Yadav

发表机构 * Robotics Research Center, International Institute of Information Technology Hyderabad（机器人研究中心，国际信息学院海得拉巴）

AI总结针对车辆队列中未知且状态依赖的摩擦力，提出一种自适应滑模控制器，无需先验知识即可处理摩擦不确定性，实现速度调节和间距保持。

详情

AI中文摘要

多机器人编队控制在车辆编队、队列、载荷运输和监视等领域有广泛应用。在车辆队列中保持编队需要设计合适的控制方案，能够处理外部干扰和不确定的系统参数，同时保持机器人之间预定义的安全距离。此背景下的一个关键挑战是处理车轮与地面之间未知/不确定的摩擦力，这些摩擦力随路面变化、轮胎磨损和车辆速度而变化。尽管最先进的自适应控制器可以处理先验有界的不确定性，但它们难以准确建模和识别摩擦力，这些摩擦力通常是状态依赖的且无法先验有界。本文提出了一种新的基于轮式移动机器人的车辆队列自适应滑模控制器，无需先验了解摩擦力的参数和结构即可处理其未知和复杂的行为。该控制器利用自适应滑模控制技术来调节队列速度并保持预定义的机器人间距离，即使在存在外部干扰和不确定系统参数的情况下也是如此。该方法包括两个阶段：首先，运动学控制器根据期望轨迹计算期望速度；其次，动力学模型生成命令以实现期望运动。通过分离机器人的运动学和动力学，该方法可以简化控制问题，并实现对轮式移动机器人更高效、更鲁棒的控制。

英文摘要

Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.

URL PDF HTML ☆

赞 0 踩 0

2511.05203 2026-06-11 cs.RO 版本更新

SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation

SIL: 语言条件的人机协同适应的共生交互学习

Linus Nwankwo, Bjoern Ellensohn, Christian Rauch, Elmar Rueckert

发表机构 * Technical University of Leoben（莱博恩技术大学）

AI总结提出共生交互学习（SIL）框架，实现人类与智能体在共享潜在任务空间中的双向协同适应，通过联合信念状态、FM空间推理和记忆架构，在指令跟随等任务中达到90.4%完成率和0.83信念对齐分数。

详情

AI中文摘要

当今的自主智能体主要由基础模型（FMs）驱动，能够理解自然语言指令并以类似人类的推理解决长期任务。然而，当前的人机交互框架大多遵循单向的主从技术，其中具身智能体被动执行命令而没有互惠学习。这忽视了日常人际交互中协同适应、多轮交互的本质。我们引入了共生交互学习（SIL），一个在共享潜在任务空间中的双向协同适应框架，其中人类和智能体都维护着随交互历史演变的联合信念状态。这使得主动澄清、适应性建议和共享计划细化成为可能。SIL利用FMs进行空间感知和推理，并结合一个三元组损失训练的神经编码器，将FMs的输出嵌入到任务特定的潜在表示中。为了支持任务演变时的长期稳定性，SIL利用情景记忆和语义记忆架构，并通过弹性权重巩固进行正则化以减轻灾难性遗忘。我们在模拟和真实世界的具身任务上评估SIL，包括指令跟随、信息检索、查询导向推理和交互式对话，实现了90.4%的任务完成率和ρ≈0.83的信念对齐分数，比最佳消融实验绝对提高了约20个百分点。演示和资源：此https URL。

英文摘要

Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long-horizon tasks with human-like reasoning. However, current human-robot interaction frameworks largely follow a one-way master-apprentice technique where the embodied agent passively executes commands without reciprocal learning. This neglects the co-adaptive, multi-turn nature of everyday human-to-human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co-adaptation framework in a shared latent task space, where both the human and the agent maintain joint belief states that evolve with the interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet-loss-trained neural encoder that grounds the FMs' outputs into task-specific latent representations. To support long-term stability as tasks evolve, SIL utilises episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogue, achieving a $90.4\%$ task completion rate and a belief alignment score of $ρ\approx 0.83$, an absolute improvement of about $20$ percentage points over the best ablations. Demos and resources: https://linusnep.github.io/SIL/.

URL PDF HTML ☆

赞 0 踩 0

2511.16672 2026-06-11 cs.CV 版本更新

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

EvoLMM：具有连续奖励的自进化大型多模态模型

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

发表机构 * Mohamed bin Zayed University of AI（Mohamed bin Zayed人工智能大学）； Aalto University（阿alto大学）； Australian National University（澳大利亚国立大学）； Linköping University（林肯大学）

AI总结提出EvoLMM框架，通过单个骨干模型实例化提议者和求解者两个协作智能体，利用连续自奖励过程无监督地提升LMM推理能力，在ChartQA等基准上取得约3%的提升。

Comments 9 pages, 6 figures

详情

AI中文摘要

近年来，大型多模态模型（LMMs）的进展实现了令人印象深刻的推理和感知能力，但大多数现有训练流程仍依赖于人工策划的数据或外部验证的奖励模型，限制了其自主性和可扩展性。在这项工作中，我们致力于以纯无监督方式（无需任何标注数据或奖励蒸馏）提升LMM的推理能力。为此，我们提出了一个名为EvoLMM的自进化框架，该框架从单个骨干模型实例化两个协作智能体：提议者（Proposer），生成多样化的、基于图像的问题；以及求解者（Solver），通过内部一致性解决这些问题，学习过程通过连续的自奖励机制进行。这种动态反馈促进了信息性查询的生成和结构化推理的改进，而无需依赖真实标签或人工判断。当使用流行的Qwen2.5-VL作为基础模型时，我们的EvoLMM在多模态数学推理基准（包括ChartQA、MathVista和MathVision）上取得了约3%的持续提升，仅使用原始训练图像。我们希望这种简单而有效的方法能成为一个坚实的基线，促进未来在完全无监督方式下自我改进LMM的研究。我们的代码和模型可在该https URL获取。

英文摘要

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

URL PDF HTML ☆

赞 0 踩 0

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间：高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Toronto（多伦多大学）； University of Cambridge（剑桥大学）； University of Oxford（牛津大学）

AI总结本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构，并提出一种无需训练的闭式潜在空间操作方法，实现对生成图像颜色的预测与显式控制。

Comments Accepted at ICML 2026

2603.09715 2026-06-11 cs.AI 版本更新

语言塑造大型语言模型中的心理健康评估

Jiayi Xu, Xiyang Hu

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Arizona State University（亚利桑那州立大学）

AI总结研究多语言LLM在心理健康评估中是否因语言不同而产生系统性偏差，发现中文提示比英文提示导致更高的污名相关评分和更保守的抑郁严重度判断。

详情

AI中文摘要

多语言大型语言模型（LLMs）越来越多地用于社会敏感的心理健康场景，包括支持聊天机器人、筛查和内容审核。这引发了一个可靠性问题：语义上等效的心理健康输入是否在不同语言中引发可比较的评估，还是会出现与语言相关的社会和文化背景一致的系统性偏移？我们在英中双语环境中使用GPT-4o和Qwen3-32B，通过一个两层框架来检验这个问题：结构层面的评估取向（通过心理测量污名工具测量）和决策层面的行为（通过二元污名检测和四类抑郁严重度分类测量）。在多种工具和模型中，中文提示比英文提示引发更高的污名相关分数。在决策层面，中文提示降低了对污名化内容的敏感性，并产生更保守的抑郁严重度判断，导致更多的低估错误。这些发现表明，提示语言可以改变基于LLM的心理健康评估中的评估取向和下游行为。它们强调了评估多语言LLM时不仅需要关注整体性能，还需要关注它们是否在社会敏感领域中对不同语言应用了可比较的评估标准。

英文摘要

Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.

URL PDF HTML ☆

赞 0 踩 0

2603.05573 2026-06-11 cs.LG 版本更新

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

为什么深度在可并行化序列模型中重要：一个李代数视角

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo L. Sabatini

发表机构 * Howard Hughes Medical Institute, Department of Neurobiology, Harvard Medical School（霍华德·休斯医学研究所，哈佛医学院神经生物学系）； Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University（自然与人工智能研究学院，哈佛大学）； Department of Psychology and Center for Brain Science, Harvard University（心理学系和脑科学中心，哈佛大学）

AI总结从李代数控制视角，研究可并行化序列模型（如Transformer变体和状态空间模型）的表达能力与深度关系，证明误差随深度增加呈指数下降。

Comments v2: Format update; split former Theorem 3.4 into Theorem 3.4 and Corollary 3.5 for clarity; corrected an indexing error affecting Corollary 3.6, Proposition 3.7, and Figure 2

2504.09762 2026-06-11 cs.AI 版本更新

Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

立场：停止将中间令牌拟人化为推理/思考痕迹！

Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文论证将模型生成的中间令牌拟人化为“推理痕迹”或“思考痕迹”具有误导性，呼吁社区避免此类拟人化。

Comments Appears in ICML 2026. [This is a fork of v1. This fork, while overlapping with v1 in background section, differs both in the overall focus as well as the specific argument against anthropomorphization of reasoning traces]