arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪 全部专题
2606.17474 2026-06-17 cs.CL cs.AI 新提交

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena:基于电子健康记录的大语言模型在端到端临床咨询工作流中的评估

Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Shandong University(机器智能与系统控制重点实验室,山东大学) Department of Medicine and Therapeutics, The Chinese University of Hong Kong(医学与治疗学系,香港中文大学) Department of Geriatric Medicine, Qilu Hospital of Shandong University(老年医学科,山东大学齐鲁医院) Department of Psychiatry, The Chinese University of Hong Kong(精神病学系,香港中文大学) Li Chiu Kong Family Sleep Assessment Unit, Department of Psychiatry, Faculty of Medicine, The Chinese University of Hong Kong(李秋虹家庭睡眠评估单元,精神病学系,医学院,香港中文大学) Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong(李嘉诚健康科学研究院,医学院,香港中文大学) Gerald Choa Neuroscience Institute, Department of Medicine and Therapeutics, The Chinese University of Hong Kong(Gerald Choa 神经科学研究所,医学与治疗学系,香港中文大学)

AI总结 提出AIPatient Arena框架,通过电子健康记录构建患者知识图谱,在多轮医患交互中评估大语言模型的八项临床能力,发现模型在信息覆盖、诊断推理等方面存在不足,强调过程评估的重要性。

Comments 49 pages, 12 figues, 11 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地被考虑用于临床咨询任务,然而大多数医学评估仍然是静态的、单轮的或狭义的结果导向,限制了它们反映真实医疗护理的序列性、不确定性和交互性的能力。在此,我们提出AIPatient Arena,一个基于电子健康记录(EHRs)的评估框架,用于评估LLMs在八个临床能力维度上的临床实用性。该框架将EHR数据整合到患者特定的知识图谱中,实现多轮医患交互。我们将AIPatient Arena应用于一个由437名患者组成的主要队列以及两个分布外验证队列(分别为119名和67名患者)。我们观察到,LLMs在医学访谈提问技能(QS;平均得分4.43-4.99/5)、伦理与职业行为(ET;4.38-4.93/5)以及临床解释的清晰度和透明度(EX;3.80-4.72/5)方面表现良好。在信息整合(II;3.19-4.21/5)和用药安全与合理性(MS;3.13-3.78/5)方面表现中等,但在处理模糊患者回应(HR;2.57-3.32/5)、信息覆盖(IC;2.08-3.02/5)以及诊断准确性与推理(Dx;2.63-3.55/5)方面观察到持续的弱点。基于过程的评估揭示了反复出现的交互失败,包括重复提问、遗漏既往病史以及对不确定性处理不当。更丰富的对话上下文改善了诊断推理,但在治疗计划方面收益有限。这些发现表明,仅凭最终答案的准确性不足以评估临床就绪性,并强调了评估模型在整个咨询过程中如何收集、解释和传递信息的重要性。AIPatient Arena为医学LLMs的面向工作流的部署前评估提供了一个基于EHR的框架。

英文摘要

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

2606.17471 2026-06-17 cs.LG cs.SY eess.SY 新提交

ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors

面向ReRAM的模型微调:解决I-V非线性和保留误差

Ching-Yi Lin, Shamik Kundu, Arnab Raha, Sahil Shah

发表机构 * Intel Corporation(英特尔公司)

AI总结 提出一种基于微调的硬件感知训练算法,通过范围收缩的sinh变换缓解I-V非线性,并将保留误差纳入正则化损失,实现ReRAM上DNN的高效部署,在图像分类和问答任务中精度损失极小。

Comments 11 pages, 12 figures, 2 tables, with appendix (5 pages, 9 figures)

详情
AI中文摘要

传统的CPU、GPU和NPU架构日益受到冯·诺依曼瓶颈的限制。虽然使用ReRAM交叉阵列的存内计算(IMC)提供了一种高密度、高能效的替代方案,但其实际部署受到非理想特性的制约。现有的硬件感知训练框架通常需要从头开始训练,这对于现代大规模模型来说计算成本过高。在这项工作中,我们提出了一种基于微调的硬件感知训练算法,能够在最小训练开销下实现DNN在ReRAM上的鲁棒部署。我们的方法通过应用范围收缩的sinh变换来缓解I-V非线性,并在微调过程中将保留误差直接纳入正则化损失。我们在图像分类和问答(QA)等模型和任务上评估了我们的框架。实验结果表明,我们的方法在ResNet18和DeiT-Tiny等大规模模型上实现了与基础模型相似的精度。在ImageNet上的MobileNetV3系列中,该技术的精度下降不到2%。此外,将该技术应用于SQuAD v2数据集,F-1分数仅下降1点。

英文摘要

Traditional CPU, GPU, and NPU architectures are increasingly limited by the von Neumann bottleneck. While In-Memory Computing (IMC) using ReRAM crossbar arrays offers a high-density, energy-efficient alternative, its practical deployment is constrained through their non-idealities. Existing hardware-aware training frameworks often require training from scratch, which is computationally prohibitive for modern large-scale models. In this work, we propose a finetuning-based hardware-aware training algorithm that enables robust DNN deployment on ReRAM with minimal training overhead. Our approach mitigates I-V non-linearity by applying a range-shrunk sinh transformation and incorporates retention errors directly into a regularization loss during the finetuning process. We evaluate our framework across models and tasks such as image classification and question-answering (QA). Experimental results demonstrate that our method achieves similar accuracy on large-scale models like ResNet18 and DeiT-Tiny as the base model. In-case of ImageNet for MobileNetV3 families the technique has only less than 2% accuracy degradation. Further, applying the technique on the SQuAD v2 dataset results in only 1 point degradation of F-1 score.

2606.17465 2026-06-17 cs.LG cs.SY eess.SY 新提交

Perron--Frobenius Operator Matching for Generative Modeling

Perron--Frobenius算子匹配用于生成建模

Shiqi Zhang, Wuwei Wu, Jaemin Oh, Jie Chen, Xiaoning Qian

发表机构 * Texas A&M University(德克萨斯农工大学) City University of Hong Kong(香港城市大学)

AI总结 提出Perron-Frobenius算子匹配(PFOM)生成框架,通过积分PF算子匹配密度演化,统一流、扩散和跳跃模型,并证明KL散度在Bregman散度中唯一保持密度级与样本条件目标等价,开发Nesterov加速训练和采样方法。

详情
AI中文摘要

我们引入了Perron--Frobenius算子匹配(PFOM),这是一个通过积分PF算子匹配密度演化的生成框架,涵盖了流、扩散和跳跃模型。我们证明,在Bregman散度中,只有Kullback--Leibler散度保持密度级和样本条件目标之间的等价性,从而产生一个等价于Koopman路径匹配的实用损失。我们进一步开发了Nesterov加速的训练和采样方法,以稳定离散化并加速收敛。PFOM将算子理论识别与现代生成建模统一起来,并为自适应字典和高维应用开辟了道路。

英文摘要

We introduce Perron--Frobenius Operator Matching (PFOM), a generative framework that matches density evolution via the integral PF operator, subsuming flow, diffusion, and jump models. We prove that among Bregman divergences, only Kullback--Leibler divergence preserves equality between density-level and sample-conditioned objectives, yielding a practical loss equivalent to Koopman path matching. We further develop Nesterov-accelerated training and sampling that stabilize discretization and accelerate convergence. %On Gaussian mixtures and two-moons, PFOM achieves faster KL/$W_2$/MMD decrease and improved wall-clock efficiency with empirical validation. PFOM unifies operator-theoretic identification with modern generative modeling and opens paths to adaptive dictionaries and high-dimensional applications.

2606.17464 2026-06-17 cs.LG 新提交

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

CheckMIABench: 语言模型成员推理攻击的坚实基础

Jeffrey G. Wang, Jason Wang, Marvin Li, Seth Neel

发表机构 * Harvard University(哈佛大学) Harvard Business School(哈佛商学院)

AI总结 为解决成员推理攻击评估中的分布偏移问题,提出基于训练中固定点前后数据同分布的基准框架,在Pythia和OLMo模型上评估多种攻击,并开源模块化库。

详情
AI中文摘要

成员推理攻击(MIA)是评估机器学习模型隐私属性的标准方法。尽管已有多次尝试评估语言模型上的MIA,但现有文献在构建干净评估以测试新技术方面遇到诸多困难。特别是,成员集和非成员集之间的微妙分布偏移可能破坏MIA的统计有效性;最近的研究通过展示没有访问底层模型的“盲”方法在同一基准上的表现远优于已发布方法,强调了这一点。本文利用训练过程中固定点前后的训练数据来自同一分布的洞察,构建了一个用于对LLM进行原则性MIA评估的基准。因此,所有具有中间检查点和公开训练数据的开源模型都可以转化为MIA测试平台。我们将我们的框架应用于针对Pythia和OLMo模型系列(从70M到7B参数)的半打已发布攻击。为促进进一步的隐私研究,我们开源了一个模块化库,用于在此设置中设计和实现攻击:此 https URL。

英文摘要

Membership inference attacks (MIAs) are a canonical way to assess a machine learning model's privacy properties. Although several attempts have been made to evaluate MIAs on language models, the extant literature has suffered numerous difficulties in constructing clean evaluations to test new techniques. In particular, subtle distribution shifts between member and non-member sets can undermine the statistical validity of MIAs; recent work has underscored this by showing that "blind" methods with no access to the underlying model can perform far better than published methods on the same benchmarks. This paper constructs a benchmark for principled evaluation of MIAs against LLMs, by leveraging the insight that training data before and after a fixed point during training are drawn from the same distribution. Therefore, all open-source models with intermediate checkpoints and public training data can be converted into MIA testbeds. We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. To facilitate further privacy research, we open-source a modular library for designing and implementing attacks in this setting: https://github.com/safr-ai-lab/pandora_llm.

2606.17463 2026-06-17 cs.CV cs.RO 新提交

WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

WeaveLA: 面向重复机器人操作的基于事件驱动的跨子任务潜在记忆编织

Shoujing Zhu, Zhenyang Liu, Fungmiu Wang, Jiafeng Wang, Bo Yue, Guiliang Liu, Simo Wu, Xiangyang Xue, Taiping Zeng

发表机构 * Fudan University(复旦大学) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Shanghai Innovation Institute(上海创新研究院) Shenzhen Loop Area Institute(深圳环域研究院)

AI总结 针对短窗口VLA策略缺乏跨子任务信息传递的问题,提出WeaveLA,通过事件触发将完成子任务压缩为潜在令牌并注入下一子任务的动作生成路径,在保持基础策略短窗口接口的同时实现轻量级跨子任务通道,在困难重复任务上成功率从0%提升至47.8%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略已实现显著的单步操作,但在每个阶段依赖于刚刚完成的任务时仍然脆弱。核心问题是结构性的:短窗口VLA缺乏明确的跨子任务信息路由通道,而现有的记忆增强变体要么在每一帧写入,要么从演示阶段检索,要么在子目标事件触发时未执行显式的子任务到子任务交接给动作专家。我们将子目标完成事件识别为跨子任务记忆交接的自然时间单元,并提出WeaveLA(为视觉-语言-动作策略编织潜在记忆),这是一种跨子任务记忆接口,在冻结的VLA骨干之上,通过查询驱动的注意力池化将每个完成的段压缩为潜在令牌,并直接路由到下一子任务的动作生成路径。这种事件触发、动作侧的设计保留了基础策略的短窗口接口,同时添加了轻量级跨子任务通道。通过在RoboMME上使用$\pi_{0.5}$骨干进行分层评估,WeaveLA的增益恰好出现在需要该通道的地方:在最难的重复切片(SwingXtimes,$N{=}3$)上,成功率从$0\\%$提升至$47.8\\%$,而单次执行片段保持不变。每集配对分析证实增益仅限于因果结构需要跨子任务信息的任务。

英文摘要

Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.

2606.17462 2026-06-17 cs.LG cs.NI 新提交

ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation

ResAware: 通过资源特权蒸馏实现跨环境网站指纹识别

Chongru Fan, Wei Wang, Wentao Huang, Zhenquan Ding, Jinqiao Shi, Lei Cui, Zhiyu Hao, Xiaochun Yun

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Laboratory(中关村实验室)

AI总结 提出ResAware框架,利用资源级特征训练教师模型并通过异构知识蒸馏指导学生模型,在不增加在线开销下提升跨环境鲁棒性,在五个月大规模数据集上显著提升基线方法性能。

Comments 18 pages, 9 figures

详情
AI中文摘要

虽然网站指纹识别(WF)攻击在受控实验室环境中实现了高精度,但在现实环境中,由于时空漂移、浏览器异构性、代理混淆等因素,其性能往往大幅下降。这一限制源于它们仅依赖低层流量特征,而这些特征噪声大且对环境扰动高度敏感。为解决此问题,我们提出\textbf{ResAware},一种在\textit{训练丰富/推理贫乏}非对称设置下的跨环境资源感知蒸馏框架。具体来说,ResAware在资源级特征上训练教师模型,然后通过异构知识蒸馏将所得特权知识蒸馏到学生模型中。部署时,学生模型仅使用加密流量进行推理,不产生额外成本。我们在一个跨越五个月、从六个全球观测点收集的大规模数据集上评估ResAware,包含超过$160{,}000$个配对样本。结果表明,ResAware显著增强了多种WF基线的跨环境鲁棒性。例如,在150天的时间漂移下,ResAware将Var-CNN的F1分数从$72.77\%$提升至$81.49\%$,开放世界$TPR@1\%FPR$从$22.40\%$提升至$27.20\%$。我们的结果表明,资源级监督在不扩大在线观测能力的情况下提高了WF鲁棒性。

英文摘要

While Website Fingerprinting (WF) attacks achieve high accuracy in controlled laboratory settings, they often degrade substantially in real-world environments due to spatio-temporal drift, browser heterogeneity, proxy obfuscation and etc. This limitation stems from their sole reliance on low-level traffic features that are noisy and highly sensitive to environmental perturbations. To address this problem, we propose \textbf{ResAware}, a cross-environment resource-aware distillation framework under a \textit{training-rich/inference-poor} asymmetric setting. Specifically, ResAware trains a teacher model on resource-level features, and then distills the resulting privileged knowledge into a student model through heterogeneous knowledge distillation. At deployment time, the student model performs inference using only encrypted traffic, incurring zero additional cost. We evaluate ResAware on a large-scale dataset collected over five months from six globally distributed vantage points, comprising more than $160{,}000$ paired samples. The results show that ResAware significantly enhances the cross-environment robustness of diverse WF baselines. Under a 150-day temporal drift, for example, ResAware improves the F1-score of Var-CNN from $72.77\%$ to $81.49\%$ and the open-world $TPR@1\%FPR$ from $22.40\%$ to $27.20\%$. Our results demonstrate that resource-level supervision improves WF robustness without expanding online observation capabilities.

2606.17459 2026-06-17 cs.AI 新提交

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

LLM 能当 CEO 吗?基于多角色智能体模拟的战略资源重新配置基准测试

Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) Yale University(耶鲁大学)

AI总结 提出 CEO-Bench,一个多智能体基准,评估 LLM 在约束丰富的组织环境中进行多轮战略资源重新配置的能力,发现模型在结构有效性上表现良好,但在战略校准上存在系统性失败模式。

Comments 13 pages

详情
AI中文摘要

评估大型语言模型(LLM)的决策能力是一个日益重要的研究重点,然而现有基准侧重于孤立的认知任务,如推理、知识检索以及在风格化环境中的经济理性。这些评估忽略了真实高管决策的核心挑战:在信息不对称、组织约束和时间依赖下整合来自专业利益相关者的冲突建议。我们引入了 \textsc{CEO-Bench},一个多智能体基准,评估 LLM 在 CEO 级别的战略资源重新配置能力——即在多轮、约束丰富的组织环境中跨业务部门重新分配资本的过程。在 \textsc{CEO-Bench} 中,LLM 智能体接收来自四个角色化的 C 级顾问(CFO、CTO、COO、CMO)的冲突建议,每个顾问拥有私有信号和不同优先级,智能体必须将这些建议综合成一个具体的分配计划,并沿四个维度进行评估:角色整合、条件大胆性、历史敏感性判断和计划有效性。在 13 个场景中对五个前沿模型的实验表明,所有模型都实现了高结构有效性,但在战略校准(最难的能力层)上表现差异显著。我们识别出系统性失败模式,包括单一顾问捕获、模糊下的保守默认和历史遗忘,并发现结构整合-大胆性权衡:更深入参与冲突观点的模型往往产生较不果断的行动。这些发现勾勒了 LLM 作为组织决策者的当前能力边界,并为未来 AI 辅助高管系统的设计提供信息。

英文摘要

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

2606.17456 2026-06-17 cs.RO q-bio.NC 新提交

Embodiment Shapes Rolling Behavior in a Multimodal Infant Model

具身形态塑造多模态婴儿模型中的翻滚行为

Leon Philipp, Francisco M. López, Jochen Triesch

发表机构 * Frankfurt Institute for Advanced Studies(法兰克福高等研究院) Goethe University Frankfurt(法兰克福大学) University of New South Wales(新南威尔士大学)

AI总结 通过虚拟婴儿MIMo学习仰卧到俯卧翻滚,研究婴儿运动发展中的具身形态变化如何影响行为,发现与真实婴儿一致的发育趋势和协调模式。

Comments 7 pages, 7 figures. Accepted at the 2026 IEEE ICDL Conference. Cite as: L. Philipp, F. M. López, and J. Triesch, "Embodiment Shapes Rolling Behavior in a Multimodal Infant Model", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-7

详情
AI中文摘要

翻身是婴儿运动发展中最早期的里程碑之一,反映了协调的全身感觉运动控制的出现。在这里,我们使用MIMo(一个配备本体感觉和前庭感觉的虚拟婴儿具身模型)对婴儿翻滚进行计算研究。MIMo通过强化学习学习从仰卧到俯卧的翻滚。有趣的是,学习到的行为捕捉到了与真实婴儿报告一致的发育趋势和协调模式,包括随着年龄增长表现提升和执行速度加快。我们的结果解释了婴儿的能力和限制如何能在人工代理中产生逼真的行为,特别强调了运动发展如何受到不断变化的身体形态的影响。这项工作突出了具身计算模型作为研究感觉运动发展的强大工具的作用。

英文摘要

Rolling over is one of the earliest milestones in infant motor development, reflecting the emergence of coordinated, whole-body sensorimotor control. Here, we conduct a computational study of infant rolling using MIMo, a virtual infant embodiment equipped with proprioception and vestibular sensation. MIMo learns supine-to-prone rolls with reinforcement learning. Interestingly, the learned behaviors capture developmental trends and coordination patterns consistent with those reported in real infants, including improved performance and faster execution with age. Our results explain how infant capabilities and constraints can give rise to realistic behaviors in artificial agents, with a particular emphasis on how motor development is shaped by the changing body morphology. This work highlights the role of embodied computational models as a powerful tool for studying sensorimotor development.

2606.17455 2026-06-17 cs.RO 新提交

Continual Online Personalization of Exoskeleton Control via Manifold-Aware Experience Replay

基于流形感知经验回放的外骨骼控制持续在线个性化

Changseob Song, Inseung Kang

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出流形感知经验回放框架,通过回放缓冲区保留用户特定表征,避免在线适应中的灾难性遗忘,在模拟偏瘫步态中扭矩和步态相位跟踪精度分别提升40%和60%。

详情
AI中文摘要

个性化外骨骼控制对于步态障碍的临床用户仍然是一个关键挑战。在线适应(OA)通过实时适应受试者变异性、设备适配性和不同运动任务提供了一种有效解决方案。然而,OA涉及连续的用户状态数据流,可能导致先前学习的运动情境的灾难性遗忘。在此,我们开发了一种基于流形感知经验回放的在线个性化框架,旨在在外骨骼控制的OA过程中跨不同任务维护用户特定表征。通过从回放缓冲区重放先前经历的任务,我们保留了跨所有学习任务的个性化外骨骼辅助。此外,我们捕获了一个区分不同运动任务的步态流形,消除了在选择目标回放区间时对显式任务标签的需求。我们在模拟偏瘫步态(与健全模式有显著偏差)上评估了我们的框架,涉及速度和坡度转换的多个遗忘场景。与没有回放的基线框架(在任务转换期间表现出灾难性遗忘)相比,我们的流形感知回放框架在扭矩和步态相位跟踪精度上分别实现了40%和60%的提升。这表明我们提出的框架在临床人群的日常行走中跨不同运动情境实时个性化外骨骼控制。

英文摘要

Personalizing exoskeleton control remains a critical challenge for clinical users with gait disabilities. Online adaptation (OA) offers an effective solution by adapting in real time to subject variability, device fit, and diverse locomotor tasks. However, OA involves a continual stream of user state data, which can lead to catastrophic forgetting of previously learned locomotor contexts. Here, we develop a manifold-aware experience replay-based online personalization framework designed to maintain user-specific representations across diverse tasks during OA of exoskeleton control. By replaying previously experienced tasks from a replay buffer, we preserve the personalized exoskeleton assistance across all learned tasks. Furthermore, we capture a gait manifold that distinguishes between different locomotor tasks, removing the need for explicit task labeling when selecting target replay bins. We evaluated our framework on emulated hemiplegic gait, which largely deviates from able-bodied patterns, across multiple forgetting scenarios with speed and incline transitions. Our manifold-aware replay framework achieved 40% and 60% improvements in torque and gait phase tracking accuracy, respectively, compared to a baseline framework without replay, which exhibited catastrophic forgetting during task transitions. This demonstrates that our proposed framework personalizes exoskeleton control in real time across diverse locomotor contexts in daily ambulation of clinical populations.

2606.17450 2026-06-17 cs.AI 新提交

A Machine-Learned Comorbidity Index

机器学习共病指数

Suleman Baloch, Kishlay Jha, Alberto M. Segre, Philip M. Polgreen, Bijaya Adhikari

发表机构 * Department of Electrical and Computer Engineering, University of Iowa, Iowa, USA(电气与计算机工程系,爱荷华大学,爱荷华,美国) Department of Computer Science, University of Iowa, Iowa, USA(计算机科学系,爱荷华大学,爱荷华,美国) Department of Internal Medicine, University of Iowa, Iowa, USA(内科学系,爱荷华大学,爱荷华,美国)

AI总结 提出一种机器学习共病指数(MLCI),通过最大化学习分数与多个临床结果之间的归一化希尔伯特-施密特独立性准则(nHSIC)来映射诊断代码为单一标量,捕获非线性风险-结果依赖,并在多个EHR数据集上优于基线方法。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea. 35 pages

详情
AI中文摘要

传统的共病评分(如Charlson和Elixhauser)广泛用于风险调整和患者分层,但它们有两个关键局限性:(i)它们主要围绕死亡率,不能很好地与其他临床结果对齐;(ii)它们的线性、基于规则的结构无法捕捉非线性、结果特定的风险关系。我们提出了一种机器学习共病指数(MLCI),通过最大化学习分数与多个临床结果之间的归一化希尔伯特-施密特独立性准则(nHSIC),将诊断代码映射到单个标量。MLCI捕捉非线性风险-结果依赖,并有一个理论支持,该理论描述了何时可以在不同结果上实现统一的、信息丰富的入院级排序。在多个基准电子健康记录(EHR)数据集上的实证结果表明,MLCI在多个评估指标上优于强基线方法。

英文摘要

Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: (i) they are largely mortality-centric and do not align well with other clinical outcomes, and (ii) their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert-Schmidt Independence Criterion (nHSIC) between the learned score and multiple clinical outcomes. MLCI captures nonlinear risk-outcome dependence and is supported by a theory that characterizes when a unified, informative admission-level ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong baselines across multiple evaluation metrics.

2606.17449 2026-06-17 cs.CL cs.AI cs.CV cs.LG cs.MM 新提交

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出MODE-RAG多智能体系统,利用变分自由能和内部注意力状态动态门控干预,结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情
AI中文摘要

虽然多模态检索增强生成(M-RAG)增强了大型视觉语言模型,但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外,现有的缓解流程常常面临干预悖论:静态规则往往不必要地干扰准确的生成,而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉,我们提出了一个多智能体系统MODE-RAG,由变分自由能(VFE)和内部注意力状态驱动,以动态门控干预。高风险查询被路由到五个阶段特定的智能体,集成蒙特卡洛树搜索(MCTS)进行严格的因果推导,以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法,我们引入了ModeVent,一个源自MultiVent数据集的具有挑战性的子集。大量实验表明,我们的系统有效降低了幻觉率和逻辑捏造,显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

2606.17446 2026-06-17 cs.RO cs.CV 新提交

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

AnnotateAnything:面向机器人操作的3D资产自动标注

Haoran Lu, Mutian Shen, Shuyang Yu, Yu Xiao, Songling Liu, Jianshu Zhang, Shang Wu, Yue Chen, Guo Ye, Jiayi Wang, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学)

AI总结 提出AnnotateAnything框架,通过视觉-语言标注和物理标注双流水线,自动为3D资产生成可执行操作标签,提升仿真数据收集效率和任务成功率。

详情
AI中文摘要

仿真使得可扩展的机器人数据收集成为可能,但原始3D资产仅提供几何信息,缺乏指定机器人应在何处以及如何操作的语义、交互和物理知识。在这项工作中,我们提出了AnnotateAnything,一个通用的自动标注框架,将被动3D资产转换为具有结构化、多样化和可执行操作标签的、可用于操作的资产。AnnotateAnything围绕两个互补的流水线构建。首先,一个统一的视觉-语言标注流水线,利用视觉-语言推理来推断对象语义、交互约束和3D接地线索,为识别有意义的交互区域提供人类先验指导。其次,一个全自动且大规模并行的物理标注流水线,通过候选生成、几何优化和轨迹生成,将这些先验知识嵌入每个资产的几何和物理约束中。该流水线生成多样且可执行的动作标注,包括抓取姿态、灵巧接触、关节运动路径点、插入方向、悬挂可供性和导航目标。利用生成的标注,我们进一步构建了一个跨不同对象、任务和机器人形态的异步并行仿真数据收集系统。实验表明,与现有的标注和数据生成流水线相比,AnnotateAnything在标注效率、数据收集效率和任务成功率方面均表现优越,同时支持下游任务如可供性检测、机器人VQA和视觉指令微调。我们在项目页面上提供项目材料,并计划发布完整代码、标注和基准以促进未来研究。视频、代码、演示资产和标注在补充材料中提供。项目页面:此https URL。

英文摘要

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

2606.17443 2026-06-17 cs.AI cs.CL cs.CY 新提交

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

在位优势:LLM推荐系统中的品牌偏见与认知操纵动态

Xi Chu, Yupeng Hou

发表机构 * Trine University(特莱恩大学) Texas A&M University(德克萨斯农工大学)

AI总结 研究LLM推荐中的品牌动态,发现知名品牌在同等规格下获100%推荐(IAI=10.0),但微弱评分优势可打破垄断;权威营销语言(如虚假临床证据)以+0.17评分点的偏差剩余价值打破垄断;多品牌GEO竞争存在社会困境,集体优化降低个体收益。

Comments 16 pages, 4 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)正成为消费者寻找产品的主要方式,但我们尚不了解品牌如何在这个新渠道中竞争。我们使用护肤品——消费者在购买前难以判断质量、必须依赖品牌声誉的类别——在三个商业LLM(GPT-4o-mini、Claude Sonnet、Gemini 3 Flash)中研究LLM推荐中的品牌动态,并对搜索品进行了稳健性检验。在三个实验中,我们发现:(1)条件垄断:当所有产品具有相同规格时,知名品牌获得100%的推荐(IAI = 10.0),但这种主导地位在竞争对手拥有不到+0.1星的评分优势时消失;(2)权威式营销语言,包括捏造的临床证据声明,以等于+0.17评分点的偏差剩余价值打破了这种垄断,每个模型反应不同;(3)多品牌GEO竞争中的社会困境:当所有品牌采用相同的优化策略时,在我们的收益代理中,个体收益从+0.802降至+0.007,而我们的测试中未参与的品牌获得零推荐。我们的结果表明,生成引擎优化(GEO)不仅应作为安全风险研究,还应作为塑造市场竞争的新兴营销实践来研究。

英文摘要

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

2606.17438 2026-06-17 cs.CV 新提交

Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects

基于接触式条纹投影轮廓术的高分辨率反射与透明物体三维表面测量

Ingu Yeo, Hyung-Gun Chi, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University(延世大学机械工程系) Yonsei Institute for Embodied Intelligence, Yonsei University(延世大学具身智能研究所)

AI总结 针对GelSight传感器在反射/透明物体上深度精度不足和校准困难的问题,提出基于数字条纹投影的接触式三维测量方法,通过三角测量实现高精度全视场三维重建。

详情
AI中文摘要

本文提出一种基于数字条纹投影(DFP)系统的接触式三维表面测量方法,属于以商业成功的GelSight传感器为代表的视觉触觉传感家族。此类传感器已被证明对机器人指尖操作和接触传感有效。然而,由于GelSight采用RGB LED光度立体视觉,它不直接测量绝对深度,而是通过积分估计的表面梯度来推断深度,这可能累积重建误差;此外,随着传感区域增大,校准变得越来越困难,并且在高反射或透明物体上深度精度受到挑战。为克服这些缺点,我们提出一种基于条纹投影的接触测量技术,在涂覆硅胶的接触表面上执行基于三角测量的三维重建,提供接触区域上密集的逐像素表面几何和全场三维形状测量。通过将高精度数字条纹投影集成到传感器中,我们的方法简化了大面积校准,并提高了复杂表面的深度精度。实验结果,包括与GelSight Mini传感器的直接比较、球面拟合精度评估和不确定性分析,证实了所提方法显著提高了基于结构光的三维测量的精度和稳定性,允许可靠重建具有不同光学特性的物体。

英文摘要

This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.

2606.17437 2026-06-17 cs.CV cs.AI 新提交

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

超声心动图视频标准视图分类的时空融合模型

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

发表机构 * Department of Ultrasound, The First Affiliated Hospital of Chengdu Medical College, School of Clinical Medicine, Chengdu Medical College(成都医学院第一附属医院超声科,临床医学院) College of Computer Science, Sichuan University(四川大学计算机学院) Department of Medical Ultrasound, West China Hospital of Sichuan University(四川大学华西医院超声科) Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College(中国医学科学院北京协和医学院肿瘤医院)

AI总结 针对超声心动图视图分类中数据稀缺、时空特征难以融合的问题,提出基于不确定性感知的CNN-LSTM双流融合模型,在最大公开数据集EV9V上取得竞争性能。

详情
AI中文摘要

超声心动图标准视图的自动分类对于高效的临床工作流程至关重要,但面临三个主要挑战。首先,公开可用的数据集稀缺,且规模和视图覆盖范围有限。其次,一些现代视频级架构在超声心动图视图分类中的性能尚未得到充分探索。第三,某些视图类别在空间外观上高度相似,使得单帧特征不足以区分,而异质的帧质量使得鲁棒的时序信息融合变得复杂。为了解决这些挑战,我们发布了九视图超声心动图视频(EV9V)数据集,包含5,138个视频、910,579帧和9个标准视图,据我们所知,这是最大的公开超声心动图视频数据集。利用EV9V,我们系统地基准测试了代表性的视频分类架构,包括卷积神经网络(CNN)、循环神经网络(RNN)和Transformer。此外,我们提出了一种时空融合模型(STFM),一种高效的双流CNN-LSTM(长短期记忆)框架,联合捕获空间解剖结构和时间心脏动力学。所提出的框架利用不确定性感知学习在训练期间优先采样代表性视频片段,并在推理期间进行基于证据的融合,提高了对超声心动图视频中帧质量变化的鲁棒性。大量实验表明,我们的方法在各种视频分类模型中取得了竞争性能,验证了不确定性感知时空学习在超声心动图视图分类中的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

2606.17436 2026-06-17 cs.CV 新提交

UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning

UoU:基于大规模无监督学习的通用指纹基础模型

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 提出UoU指纹基础模型,通过多级表示层次和结合监督、弱监督与无监督的训练策略,实现跨传感器、质量和应用的通用特征提取。

详情
AI中文摘要

指纹识别仍然由特定任务流水线主导,其中增强、结构解析、对齐和匹配被独立优化。尽管在狭窄场景中有效,但这种设计限制了表示在传感器、质量和下游应用中的重用。因此,我们提出UoU,即“基于大规模无监督学习的通用指纹基础模型”,它将指纹特征提取重新定义为领域特定的基础模型问题。UoU围绕一个多级表示层次组织,涵盖图像恢复、结构场、语义标记、点级生物特征实体和紧凑的全局描述符。其训练方案结合了在精确标注上的监督冷启动、大规模弱监督细化以及大规模无监督巩固,后两个阶段在大规模训练中迭代,使得弱监督拓宽语义覆盖,而无监督学习稳定对应关系、不变性和表示几何。UoU不将指纹图像视为通用纹理,而是利用领域特定的对称性和中间结构,包括方向流、周期性脊模式、稀疏生物特征实体和空间等变性。该框架有意与架构无关:虽然本研究包含一个基于transformer的结构化预测初始实例,但更广泛的设计支持多任务学习、可扩展模型配置以及针对匹配、对齐、增强、配准和相关指纹应用的下游专业化。本文介绍了UoU的技术动机、系统设计和验证协议,部分基线实现已公开于此https URL。

英文摘要

Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a \textbf{U}niversal fingerprint foundation model based \textbf{o}n large-scale \textbf{U}nsupervised learning,'' which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at https://github.com/XiongjunGuan/UoU.

2606.17435 2026-06-17 cs.LG 新提交

MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

MorphStrata: 时间序列移动目标防御中生成Morphence学生的层特定扰动

Abhishek Bhardwaj, Arnav Doshi, Anusri Nagarajan, Thanh Quynh Nhu Ta, Mohammad Masum, Robert Chun, Jaydip Sen, Saptarshi Sengupta

发表机构 * Department of Computer Science, San Jos\' e State University, San Jos\' e , CA, USA Department of Computer Engineering, San Jos\' e State University, San Jos\' e , CA, USA Praxis Business School, Kolkata, India

AI总结 提出MorphStrata策略,通过选择性层特定随机噪声注入生成结构异质的学生模型,在保持移动目标防御鲁棒性的同时,将训练开销增量控制在1%以内,并在高熵周期性数据集上实现高达24.11%和97.97%的RMSE降低。

Comments 13 pages, 9 figures, 11 tables

详情
AI中文摘要

时间序列预测模型仍然容易受到基于梯度的对抗攻击,而现有的防御机制通常会在鲁棒性与有限响应和计算成本之间进行权衡。这个问题在移动目标防御中尤为突出,因为维护多个随机化模型实例会显著增加训练开销。在这项工作中,我们引入了MorphStrata,一种具有选择性、层特定随机噪声注入的学生生成策略,扩展了传统的Morphence防御。MorphStrata使用Transformer骨干网络作为教师,并随机扰动选定的架构块,以在学生模型之间创建结构异质性,以应对不同的数据分布和威胁模型。我们在包括Jena Climate、Electricity Load Diagrams和Appliances Energy Prediction在内的一系列基准测试上,使用FGSM、BIM和PGD攻击以及多种攻击强度,与原始Transformer和Morphence骨干网络进行了评估。在不同的数据集和攻击机制下,所提出的集成模型保持了可比较的对抗RMSE。具体来说,对于高熵、周期性的数据集(如AEP数据),MorphStrata在所有攻击和扰动预算下实现了最低的RMSE,在30次随机试验中,在epsilon值为0.5时,相对于静态基线,在FGSM和BIM下分别提高了24.11%和97.97%。在大多数实验中,针对层生成MorphStrata学生导致的训练时间增加不到Morphence MTD基线的1%,同时实现了两位数的对抗RMSE降低。我们还观察到生成的学生的成对L2距离与整体防御有效性之间存在正相关。总之,与现有基线相比,MorphStrata以边际成本差异保持了作为MTD防御的对抗鲁棒性。

英文摘要

Time-series forecasting models remain vulnerable to gradient-based adversarial attacks while existing defense mechanisms typically incur a trade-off in robustness for bounded response and compute cost. The problem is pronounced in Moving Target Defense where maintaining multiple randomized model instances substantially exacerbates the training overhead. In this work, we introduce MorphStrata, a student generation strategy with selective, layer-specific stochastic noise injection that extends the traditional Morphence defense. MorphStrata uses a Transformer backbone as the teacher and perturbs randomly selected architectural blocks to create structured heterogeneity across student models in response to varied data distributions and threat models. We evaluate against vanilla Transformer and Morphence backbones on a suite of benchmarks including the Jena Climate, Electricity Load Diagrams, and Appliances Energy Prediction using FGSM, BIM and PGD attacks across multiple attack strengths. Across datasets and attack regimes, the proposed ensemble maintains comparable adversarial RMSE. Specifically, for high entropy, periodic datasets as in the case of the AEP data, MorphStrata achieves the lowest RMSE across all attacks and perturbation budgets, improving over the static baseline by up to 24.11% and 97.97% under FGSM and BIM respectively at an epsilon value of 0.5 over 30 randomized trials. Targeting the layers to generate MorphStrata students accounts for less than 1% increase in train-times over the Morphence MTD baseline for most of the experiments, while accounting for double digit gains in adversarial RMSE reduction. We also observe a positive correlation between higher pairwise L2 distance (among generated students) and overall defense effectiveness. In summary, MorphStrata maintains adversarial robustness as an MTD defense at marginal cost deltas when compared to existing baselines.

2606.17433 2026-06-17 cs.CV 新提交

LADBench: A Benchmark for Logical Fault Detection in Images

LADBench: 图像中逻辑故障检测的基准

Sahasra Kondapalli, Lara Radovanovic, Aadi Palnitkar, Mingyang Mao, Xiaomin Lin

发表机构 * University of South Florida(南佛罗里达大学)

AI总结 提出LAD-Bench基准,包含1000多张合成图像的四域逻辑异常,通过分层提示协议评估模型,揭示现有VLM在隐式逻辑故障检测上的不足。

Comments Accepted to the IEEE International Conference on Development and Learning (ICDL 2026)

详情
AI中文摘要

大型视觉语言模型在视觉问答和语义定位方面表现出色,但其自主逻辑推理能力仍未被充分探索。现有的异常基准强调视觉错误或直接提示,而非开放世界部署所需的物理和社会常识。为此,我们引入了LAD-bench,一个包含1000多张精心策划的合成图像基准,涵盖四个领域的逻辑异常:住宅、城市、协作和自然。我们进一步提出了一种基于渐进式揭示的分层提示协议,该协议衡量模型在定位和推理逻辑故障时需要多少显式帮助。评估领先的基础模型揭示了重大弱点:即使最好的模型也仅达到70.11%的整体准确率,表明隐式逻辑故障检测仍未解决。关键的是,即使在更深层次收到显式提示后,模型也常常无法识别异常。通过揭示这些序列多模态推理中的局限性,LAD-Bench为推进自主视觉系统的安全性、可靠性和认知对齐提供了一个严格的框架。数据集和代码:此 https URL

英文摘要

Large Vision Language Models (VLMs) excel at visual question answering and semantic grounding, but their capacity for autonomous logical reasoning remains underexplored. Existing anomaly benchmarks emphasize visual errors or direct prompting rather than the physical and social common sense needed for open-world deployment. To address this, we introduce LAD-bench, a benchmark of more than 1,000 curated synthetic images with logical anomalies across four domains: Residential, Urban, Collaborative, and Nature. We further propose a Tiered Prompting Protocol based on progressive disclosure, which measures how much explicit assistance a model needs to localize and reason about a logical fault. Evaluating leading foundation models reveals substantial weaknesses: even the best achieves only 70.11% overall accuracy, showing that implicit logical fault detection remains unsolved. Crucially, models often fail to identify anomalies even after receiving explicit hints in deeper tiers. By surfacing these limitations in sequential multimodal reasoning, LAD-Bench offers a rigorous framework for advancing the safety, reliability, and cognitive alignment of autonomous visual systems. Dataset and Code: https://huggingface.co/datasets/SahasraK/LADBench

2606.17431 2026-06-17 cs.CV 新提交

Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

视觉检索增强生成:基于轮廓引导的动物艺术创作

Quoc-Duy Tran, Anh-Tuan Vo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国立大学理科大学) Vietnam National University, Ho Chi Minh(胡志明市国立大学)

AI总结 提出视觉检索增强生成(Visual-RAG)框架,通过检索与自然轮廓结构相似的动物形状,结合ControlNet和IP-Adapter引导扩散模型生成动物艺术,实现计算空想性视错觉。

Comments SOICT 2025

详情
AI中文摘要

生成式AI已经提升了渲染逼真或艺术图像的能力,但在人类创造力的一个关键方面仍然有限:解释模糊形状。这种现象根植于空想性视错觉,使人类能够从云、石头或树叶等随机图案中感知有意义的形状。为了在计算上复制这一想象过程,我们引入了视觉检索增强生成(Visual-RAG),这是一个直接从自然轮廓生成动物艺术的框架。我们的方法从包含28,586个高质量轮廓的精选语料库中检索结构相似的动物形状,并将其作为参考示例,通过ControlNet和IP-Adapter引导基于扩散的生成。消融研究证实,使用RANSAC的形状上下文提供了最准确的匹配,而去除形状标准化会使内点比率降至仅13.4%,强调了结构保真度在Visual-RAG中的重要性。一项包含12名参与者的用户研究从美学、轮廓保真度和整体印象方面评估了输出结果。结果表明,虽然Visual-RAG提供了合理的解释,但在实现高感知影响力方面仍存在挑战。这项工作为计算空想性视错觉奠定了基础,展示了机器如何为想象发现的早期阶段做出贡献。

英文摘要

Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4\%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.

2606.17430 2026-06-17 cs.CV 新提交

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

CIAN:基于检索增强生成的事件丰富图像描述的多阶段框架

Trinh Thi Thu Hien, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City(胡志明市理科大学) Vietnam National University, Ho Chi Minh City(越南国家大学胡志明市分校)

AI总结 提出多阶段框架CIAN,通过检索相关文章并利用LoRA微调Qwen模型生成叙事,结合N-Gram精炼,在OpenEvents-V1上提升CIDEr从0.030到0.094,实现事件丰富的图像描述。

Comments SOICT 2025

详情
AI中文摘要

事件丰富的图像描述不仅描述可见内容,还描述事件的更广泛背景,包括时间、地点和参与者,这是大多数基于像素的模型所缺乏的能力。我们提出了上下文图像-文章叙述器(CIAN),这是一个多阶段框架,通过外部叙述丰富描述。CIAN使用SigLIP检索相关文章,总结它们以指导使用LoRA微调的Qwen模型进行叙事生成,并应用基于N-Gram的精炼以提高流畅性和连贯性。在OpenEvents-V1基准上,CIAN实现了高检索性能(mAP 0.979),并提高了描述质量,将CIDEr从0.030提升到0.094。这些结果突显了检索增强推理与语言精炼相结合在生成上下文感知、类人描述方面的有效性。

英文摘要

Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches captions with external narratives. CIAN retrieves relevant articles using SigLIP, summarizes them to guide a Narrative Generation stage with a LoRA-fine-tuned Qwen model, and applies N-Gram-based Refinement for fluency and coherence. On the OpenEvents-V1 benchmark, CIAN achieves high retrieval performance (mAP 0.979) and improves caption quality, increasing CIDEr from 0.030 to 0.094. These results highlight the effectiveness of retrieval-augmented reasoning combined with linguistic refinement for generating context-aware, human-like captions.

2606.17427 2026-06-17 cs.CV cs.HC 新提交

Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications

手部损伤和遮挡对增强现实应用中手部姿态估计精度的影响

Damian M. Manzone, Mathew Szymanowski, Olga Taran, Shuo Cai, Melissa Marquez-Chin, Tammy Zeng, Hardeep Singh, Cesar Marquez-Chin, José Zariffa

发表机构 * KITE Research Institute, Toronto Rehabilitation Institute, University Health Network(大学健康网络多伦多康复研究所KITE研究所) Institute of Biomedical Engineering, University of Toronto(多伦多大学生物医学工程研究所) Department of Health Sciences and Technology, ETH Zürich(苏黎世联邦理工学院健康科学与技术系) Department of Occupational Science & Occupational Therapy and the Rehabilitation Sciences Institute, University of Toronto(多伦多大学职业科学与职业治疗系及康复科学研究所)

AI总结 研究评估了HoloLens 2和多种姿态估计算法在手部损伤和物体遮挡条件下的精度,发现算法可泛化至手部损伤人群,透明物体略有优势。

详情
AI中文摘要

混合现实应用可设计用于手部康复。增强现实(AR)头戴式显示器(HMD)特别允许生态有效的任务,因为个体可以看到真实环境并与真实物体交互,同时在HMD上接收额外提示。虽然这些应用依赖于准确的手部姿态估计,但在调查手部损伤或真实物体交互遮挡对姿态估计精度的影响方面存在空白。此外,AR HMD预测与最先进姿态估计方法之间的比较尚未建立。本研究评估了HoloLens 2 HMD和最先进姿态估计算法(WiLoR、HaMeR、WildHands和MediaPipe)在颈椎损伤(cSCI;n=13,神经损伤水平:C3-C6;美国脊柱损伤协会损伤量表:A-D)和15名未受伤对照者与透明和不透明物体交互时的姿态估计精度。通过多摄像头设置三角测量生成3D关节位置的真实值。姿态估计精度在cSCI和未受伤对照组之间没有差异,表明HoloLens 2和姿态估计算法的3D关节预测可以泛化到手部损伤人群。此外,透明物体比不透明物体提供了微小的精度优势(0.1毫米),WiLoR和HaMeR的预测比HoloLens 2略精确(2毫米)。总体而言,这些结果表明HoloLens 2可能适用于手部康复应用,生成的数据集可用于改进手部损伤人群的姿态估计方法。

英文摘要

Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.

2606.17418 2026-06-17 cs.RO 新提交

DexLink Hand: A Compact, Affordable, 16-DOF Linkage-Driven Hand with Human-Like Dexterity

DexLink Hand:一款紧凑、经济、16自由度连杆驱动且具有类人灵巧性的手部

Hao Wu, Yanzhe Wang, Yu Feng, Jian Liu, Jihao Li, Jianshu Zhou, Huixu Dong

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出一种紧凑、低成本的连杆驱动仿人手,通过混合平面与空间连杆机构实现16个独立驱动、20个关节的高灵巧性,重320g、成本低于400美元,达到最大Kapandji评分并复现全部33种Feix抓取类型。

详情
AI中文摘要

灵巧机器人手在灵巧性、紧凑性和经济性之间长期面临权衡。特别是,高自由度设计通常需要复杂的驱动和传动,阻碍了其集成到人形尺寸中。为解决这些挑战,本文提出一种紧凑、低成本的连杆驱动仿人手,实现了高灵巧性、结构集成和类人手功能。该手集成了由16个独立驱动器驱动的20个关节,所有驱动、传感和传动组件紧凑地嵌入人手大小的结构中。最终原型仅重320g,总成本低于400美元。为实现这些目标,提出了一种结合平面和空间连杆机构的混合机械架构,实现了解耦的多向运动、仿生关节协同和高被动承载能力。拇指进一步采用了支持类人重构和对掌运动的仿生特征。通过这些机构和结构布局的协调集成,原型实现了具有仿生灵巧性的高度集成设计。实验评估表明,该手达到了最大Kapandji评分,复现了所有33种Feix抓取类型,并在多种日常物品和工具上实现了稳定抓取和灵巧操作。这些结果验证了所提出的手作为面向以人为中心环境中灵巧操作、遥操作和机器人学习的低成本、紧凑且机械高效的平台。

英文摘要

Dexterous robotic hands face a longstanding trade-off among dexterity, compactness, and affordability. Particularly, high-degree-of-freedom designs typically demand complex actuation and transmission, hindering integration into human-scale forms. To address these challenges, this work presents a compact, low-cost linkage-driven anthropomorphic hand that achieves high dexterity, structural integration, and human-hand-like functionality. The hand integrates 20 joints driven by 16 independent actuators, with all actuation, sensing, and transmission components compactly embedded within a human-hand-sized structure. The resulting prototype weighs only 320g at a total cost below USD 400. To meet these objectives, a hybrid mechanical architecture combining planar and spatial linkage mechanisms is proposed, enabling decoupled multidirectional motion, biomimetic joint synergies, and high passive load-bearing capability. The thumb further incorporates biomimetic features supporting human-like reconfiguration and opposition movements. Through the coordinated integration of these mechanisms and structural layout, the prototype achieves a highly integrated design with anthropomorphic dexterity. Experimental evaluations demonstrate that the hand achieves the maximum Kapandji score, reproduces all 33 Feix grasp types, and performs stable grasping and dexterous manipulation across a wide variety of daily objects and tools. These results validate the proposed hand as an affordable, compact, and mechanically efficient platform for dexterous manipulation, teleoperation, and robot learning in human-centered environments.

2606.17417 2026-06-17 cs.SD cs.LG 新提交

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models

大型音频语言模型时间理解失败模式的深入分析

Apoorva Kulkarni, Kaousheik Jayakumar, Sreyan Ghosh, Sarah Wiegreffe, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 本文通过行为与因果机制分析,揭示大型音频语言模型在时间推理中因模态不平衡而失败,并提出注意力重分配方法提升准确率。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在各种音频理解任务上表现出色,但在时间推理这一人类听觉感知的核心能力上仍存在困难。理解这些失败的原因仍然具有挑战性,因为现有基准报告了性能差距,但没有探究潜在机制。为此,我们引入了一个包含1657个问题的基准测试,涵盖三项基础任务,专门用于机制分析。检查模型在不同输入设置下的输出(行为分析)表明,当文本线索可用时,模型往往未充分利用音频。我们还首次对LALMs中的时间推理失败进行了因果机制分析。比较注意力加权与缩放,我们发现重新分配音频令牌上的注意力比增加音频注意力更有效。针对任务相关令牌进一步提升了效果。这些发现表明,模态不平衡本身不能解释失败。瓶颈层的注意力缩放在不进行微调的情况下将准确率从55.9%提高到59.1%,为未来工作展示了一个有前景的方向。

英文摘要

Large Audio Language Models (LALMs) achieve strong performance on a variety of audio understanding tasks but continue to struggle with temporal reasoning, a fundamental capability central to human auditory perception. Understanding the causes of these failures remains challenging as existing benchmarks report performance gaps without probing underlying mechanisms. To address this, we introduce a benchmark with 1,657 questions across three foundational tasks designed specifically for mechanistic analysis. Examining model outputs across varying input settings (behavioral analysis) reveals that models often under-utilize audio when textual cues are available. We also provide the first causal mechanistic analysis of temporal reasoning failures in LALMs. Comparing attention upweighting against scaling, we find that redistributing attention across audio tokens is more effective than increasing audio attention. Targeting task-relevant tokens yields further gains. These findings suggest that modality imbalance alone cannot explain failures. Attention scaling at bottleneck layers improves accuracy from 55.9% to 59.1% without fine-tuning, demonstrating a promising direction for future work.

2606.17416 2026-06-17 cs.SD cs.AI 新提交

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

L-Proto: 面向多语言说话人验证的语言感知情景原型训练

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(高丽大学人工智能系)

AI总结 针对多语言说话人验证中语言相关声学变异导致说话人身份与语言特征纠缠的问题,提出语言感知情景原型训练策略L-Proto,通过构建语言一致的训练情景减少语言驱动变异,提升跨语言泛化能力。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

多语言说话人验证仍然具有挑战性,因为语言相关的声学变异导致说话人身份与语言特征纠缠,降低了跨语言的泛化能力。在多语言训练中,嵌入向量通常将语言线索与说话人身份一起编码,导致说话人形成特定语言的聚类。我们提出L-Proto,一种语言感知的情景原型训练策略,该策略构建语言一致的训练情景。通过在每个情景中从单一语言采样说话人,L-Proto减少了训练期间的语言驱动变异,并鼓励嵌入向量更直接地关注说话人身份。在TidyVoice挑战基准上的实验表明,与传统的微调和随机情景采样相比,在多种骨干架构上均取得了一致的性能提升。

英文摘要

Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.

2606.17413 2026-06-17 cs.LG stat.AP 新提交

Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows

基于深度学习的OCO-2光谱大气CO2摊销概率检索:结合拉普拉斯近似与归一化流

Alejandro Calle-Saldarriaga, Felix Jimenez, Jack Grosskreuz, Jiazheng Wang, Jonathan Hobbs, Matthias Katzfuss

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室)

AI总结 提出深度学习框架,利用拉普拉斯近似和归一化流从OCO-2光谱中快速、准确地检索大气CO2浓度,并量化不确定性,相比传统方法加速数个数量级且精度更高。

Comments 23 pages, 8 figures

详情
AI中文摘要

基于空间的大气二氧化碳(CO2)监测对于约束全球碳收支至关重要。NASA的轨道碳观测者-2号(OCO-2)利用高分辨率光谱估算柱平均干空气CO2摩尔分数(XCO2)。然而,当前的操作检索算法计算成本高且未能正确量化不确定性。我们提出了一种新颖的深度学习框架来解决这些挑战。由于真实卫星观测的地面真值数据难以获取,我们使用高保真模拟数据集开发并验证了我们的方法。该数据集旨在支持OCO-2不确定性量化(UQ),并包含了真实的前向模型误差。我们的架构使用多分支神经网络编码光谱波段,并通过两种可扩展的UQ方法——拉普拉斯近似和归一化流——来估计完整CO2柱或其所需汇总的后验分布。与操作性的“全物理”求解器相比,我们的方法具有五个关键优势:(1)摊销:推理速度提高数个数量级,能够实时处理海量数据流;(2)模型误差鲁棒性:通过在明确包含模型差异的模拟数据上训练,我们的方法考虑了标准反演中常被忽略的系统误差;(3)点估计精度:与基线方法相比,我们实现了更优的预测精度;(4)改进的UQ:概率输出提供了校准更好的不确定性估计;(5)非高斯后验:当使用归一化流时,我们的框架成功建模了复杂、非对称的后验分布,克服了高斯假设的局限性。这些结果表明,基于模拟的深度学习是迈向下一代操作处理系统的可行路径。

英文摘要

Space-based monitoring of atmospheric carbon dioxide (CO2) is essential for constraining the global carbon budget. NASA's Orbiting Carbon Observatory-2 (OCO-2) estimates column-averaged dry-air mole fractions of CO2 (XCO2) using high-resolution spectra. However, current operational retrieval algorithms are computationally expensive and do not properly quantify uncertainties. We present a novel deep learning framework that addresses these challenges. Due to the difficulties of ground-truth data for real satellite observations, we develop and validate our approach using a high-fidelity simulation dataset. This dataset, created to support OCO-2 uncertainty quantification (UQ), incorporates realistic forward model errors. Our architecture encodes spectral bands using a multi-branch neural network and estimates posteriors of the full CO2 column or desired summaries thereof using two scalable UQ methods: Laplace approximations and normalizing flows. Our approach has five key advantages relative to operational "full-physics" solvers: (1) Amortization: Inference is orders of magnitude faster, enabling real-time processing of massive data streams; (2) Model error robustness: By training on simulations that explicitly include model discrepancies, our method accounts for systematic errors often neglected by standard inversions; (3) Point estimate accuracy: We achieve superior predictive accuracy compared to baseline methods; (4) Improved UQ: The probabilistic outputs yield better-calibrated uncertainty estimates; and (5) Non-Gaussian posteriors: When utilizing normalizing flows, our framework successfully models complex, asymmetric posterior distributions, overcoming the limitations of the Gaussian assumption. These results suggest that simulation-based deep learning is a viable path toward next-generation operational processing systems.

2606.17410 2026-06-17 cs.CV 新提交

Attention Alignment Between Humans and Vision-Language Models

人类与视觉语言模型之间的注意力对齐

Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase, Taylor Webb, Michael Graziano

发表机构 * Princeton Neuroscience Institute, Princeton University(普林斯顿大学普林斯顿神经科学研究所) Department of Psychology, Princeton University(普林斯顿大学心理学系) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系) Department of Psychology and Center for Computational Language Sciences, University of Southern California(南加州大学心理学系与计算语言科学中心) Department of Psychology, Université de Montréal(蒙特利尔大学心理学系)

AI总结 本研究比较了六种视觉语言模型的空间注意力图与人类注视热图,发现解码器架构(LSTM vs Transformer)主导对齐程度,LSTM解码器对齐度更高但空间分散且任务区分度低,而Transformer解码器注意力更集中且任务区分度强。

详情
AI中文摘要

视觉感知依赖于自上而下的目标和自下而上的感觉机制。视觉语言模型同时实现了这两种机制,使我们能够将每个组成部分视为关于驱动我们注视位置的可分离假设。我们比较了六种视觉语言模型的空间注意力图与在200张图像上两个任务(一般描述和社交字幕)中记录的人类注视热图。这六种模型跨越了CNN与ViT编码器乘以LSTM与Transformer解码器的2×2因子设计,外加Molmo 7B-D和Qwen3.5 9B。我们发现解码器和编码器架构都影响对齐,但解码器选择占主导地位。LSTM与Transformer解码器使对齐度提高了40-50个百分点(分别达到人类噪声上限的80-87%和40-59%)。相比之下,CNN与ViT编码器根据解码器家族的不同贡献了5-20个百分点的次要优势,其中CNN-LSTM是整体对齐度最高的模型(85-87%)。尽管对齐度有优势,但LSTM解码器的注意力图在空间上分散且任务区分度最小;而对齐度最弱的ViT-Transformer则显示出最尖锐的空间集中度和最强的任务区分度。一项半空间忽略模拟证实,消融注意力对LSTM解码器的影响大于Transformer解码器。在使用TRIBE模拟的合成神经反应的探索性扩展中,注视对齐和神经相关性分离:CNN-Transformer注意力图尽管注视对齐度较低,但能更好地预测合成大脑活动,其中注意力图最佳预测早期视觉皮层。总之,自上而下和自下而上的组件在行为和合成神经数据中预测的内容上存在权衡。

英文摘要

Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40--50 percentage points (80--87\% vs.\ 40--59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5--20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85--87\%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.

2606.17409 2026-06-17 cs.LG cs.AI 新提交

Discrete Autoregressive Transformer for Generative Mechanism Synthesis

离散自回归变压器用于生成式机构综合

Anar Nurizada, Anurag Purwar

发表机构 * Computer-Aided Design and Innovation Lab, Department of Mechanical Engineering, Stony Brook University(石溪大学机械工程系计算机辅助设计与创新实验室)

AI总结 提出离散自回归变压器,将平面路径综合转化为条件序列建模,通过VAE潜在变量和机构类型令牌生成关节坐标,实现多样准确机构设计。

详情
AI中文摘要

平面路径综合需要机构的耦合曲线匹配预定轨迹;从曲线到连杆的映射本质上是一对多的,跨越四杆、六杆和八杆拓扑。我们通过模拟接地评估,在一个包含超过一百万个机构的策划语料库上解决这个设计问题,报告了正向运动学和几何对齐后的Chamfer距离和动态时间规整。我们将综合问题表述为条件自回归序列建模:关节坐标被均匀量化成令牌,并由一个解码器-only变压器生成,该变压器具有目标曲线的变分自编码器(VAE)潜在变量和一个显式的机构类型令牌。训练结合了令牌交叉熵和一个高斯平滑的bin辅助损失,该损失尊重bin之间的序数结构。在推理时,一个有界潜在噪声调度在每个噪声水平下解码所有机构类型;我们根据几何误差保留前五个候选,从而在没有数据集查找的情况下产生多样准确的族。在保留测试中,平均Chamfer距离为$0.0132$,平均动态时间规整为$0.153$;一个潜在$k$-最近邻基线,在VAE空间中基于训练集邻居潜在变量进行条件化,使用相同的解码器实现了匹配拓扑的平均Chamfer距离$0.0071$和平均动态时间规整$0.117$。

英文摘要

Planar path synthesis requires mechanisms whose coupler curves match a prescribed trajectory; the mapping from curve to linkage is inherently one-to-many across four-, six-, and eight-bar topologies. We address this design problem with simulation-grounded evaluation on a curated corpus of over one million mechanisms, reporting Chamfer distance and dynamic time warping after forward kinematics and geometric alignment. We formulate synthesis as conditional autoregressive sequence modeling: joint coordinates are uniformly quantized to tokens and generated by a decoder-only transformer with a variational-autoencoder (VAE) latent of the target curve and an explicit mechanism-type token. Training combines token cross-entropy with a Gaussian-smoothed bin auxiliary loss that respects ordinal structure among bins. At inference, a bounded latent-noise schedule decodes all mechanism types at each noise level; we retain the top five candidates by geometric error, yielding diverse accurate families without dataset lookup. On held-out tests, aggregate mean Chamfer distance is $0.0132$ and mean dynamic time warping is $0.153$; a latent $k$-nearest-neighbor baseline that conditions on training-set neighbor latents in VAE space achieves matched-topology mean Chamfer distance $0.0071$ and mean dynamic time warping $0.117$ using the same decoder.

2606.17408 2026-06-17 cs.RO cs.CV cs.LG 新提交

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

动作生成应从何处开始?面向生成式机器人策略的可学习源先验

Meipo Dai, Qiyuan Zhuang, He-Yang Xu, Ying-Jie Shuai, Yijun Wang, Qi Dou, Xiu-Shen Wei

发表机构 * Southeast University(东南大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出LeaP,用轻量MLP预测基于本体感知的对角高斯分布作为动作生成源先验,替代标准高斯分布,在15个RoboTwin任务中平均成功率81.6%,优于基线方法6.5-25.5个百分点。

详情
AI中文摘要

生成式机器人策略通常从与观测无关的标准高斯分布开始动作生成,源分布的选择尚未被充分探索。本文提出一个简单问题:动作生成应从何处开始?我们提出LeaP,一种可学习源先验,用基于本体感知的对角高斯分布(作用于动作块)替代标准高斯分布。通过轻量MLP参数化,LeaP联合预测源分布的均值和状态自适应方差,同时保持下游生成器架构和推理求解器不变。这种设计提供了观测信息驱动的随机初始化,使生成器能够专注于精确的动作细化,而非从无信息的噪声源传输样本。在15个RoboTwin操作任务中,LeaP实现了81.6%的平均成功率,优于四个代表性基线——包括确定性源方法、无先验对应方法和扩散桥策略——6.5至25.5个百分点。相同的先验一致地改进了流匹配和扩散桥生成器,同时使用更少的参数且收敛更快。该优势延续到实际部署中,LeaP取得了最佳性能。这些结果表明,源分布是生成式机器人策略的一个独立且可重用的设计轴,与生成动力学的选择互补。

英文摘要

Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.

2606.17406 2026-06-17 cs.CV cs.AI 新提交

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

基于多特征聚合的图神经网络用于半监督图像分类

Marina Chagas Bulach Gapski, Vinicius Atsushi Sato Kawai, Gustavo Rosseto Leticio, Lucas Pascotti Valem, Daniel Carlos Guimarães Pedronette, Mohand Said Allili

发表机构 * Department of Statistics, Applied Mathematics, and Computing (DEMAC), São Paulo State University (UNESP)(圣保罗州立大学统计、应用数学与计算系) Institute of Mathematics and Computer Science (ICMC), University of São Paulo (USP)(圣保罗大学数学与计算机科学研究所) Department of Computer Science and Engineering, University of Quebec in Outaouais (UQO)(魁北克大学乌塔韦校区计算机科学与工程系)

AI总结 提出一种结合多种特征提取器和图表示进行半监督图像分类的GNN方法,通过流形学习和排名聚合提升分类精度。

详情
AI中文摘要

特征提取涉及识别和提取显著特征或模式,包括边缘、纹理、形状和颜色属性。当代特征提取器主要利用深度学习架构,如卷积神经网络(CNN)和视觉变换器(VIT)。文献中各种特征提取器的可用性提供了广泛的特征表示。从图像中提取的特征取决于具体应用、所选提取器及其配置。因此,通过组合不同的提取器来整合互补信息,为提高性能提供了一种有前景的方式。图神经网络(GNN),特别是图卷积网络(GCN),已成为半监督图像分类的强大且广泛采用的方法,因为它们有效利用标记和未标记数据,同时利用捕捉样本间关系的底层图结构。本研究提出了一种新颖的GNN方法,适用于标记数据稀缺的场景,通过整合来自不同提取器的多样化特征和图表示集进行分类。进行了实验研究,包括不同特征和图提取器的组合,以及排名聚合策略。实验发现强调了本研究的主要贡献,表明特征和图表示的策略性组合,结合流形学习用于图处理,在大多数实验条件下显著提高了分类精度。此外,利用排名聚合技术整合来自不同提取器的特征,被证明能增强分类精度。

英文摘要

Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.

2606.17405 2026-06-17 cs.AI 新提交

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

基于数字孪生模拟的治疗响应优化临床决策支持AI系统

Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

发表机构 * The Cancer Genome Atlas (TCGA)(癌症基因组图谱(TCGA))

AI总结 提出在线自适应框架,结合治疗效果估计、患者数字孪生和强化学习,在安全约束下实时优化治疗推荐,经合成和真实临床数据验证有效且稳定。

Comments Accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

详情
AI中文摘要

临床决策支持AI系统必须适应实时变化的患者状况,同时遵守严格的安全约束。我们提出了一个在线自适应框架,整合了治疗效果估计以量化临床获益、患者数字孪生以模拟治疗轨迹,以及强化学习用于序贯决策。AI系统最初在历史医疗记录上训练,并在持续学习循环中运行。为确保安全,一个基于规则的模块监测生命体征并阻止禁忌治疗。内部模型强烈不一致的案例被标记以供临床医生审查,在我们的实验中通过预训练的结果模型模拟。我们使用合成临床模拟器和来自癌症基因组图谱的真实卵巢癌数据集验证了我们的框架。在模拟和临床环境中,我们的方法在推荐治疗方面比标准计算基线表现出更优越的有效性和稳定性。此外,AI系统保持低延迟,并且在我们实验验证中仅需对少数案例进行专家咨询,展示了其作为临床医生监督下的个性化医疗安全工具的潜力,通过实际使用持续改进。

英文摘要

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.