arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17507 2026-06-17 cs.AI cs.SE 新提交

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

教育中的LLM作为评判者:基于课程标准的评分流水线

Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

发表机构 * NSW Department of Education(新南威尔士州教育部) South Australian Department for Education(南澳大利亚州教育部) OC Selective exam preparation platform(OC精英考试备考平台) Studitory: HSC preparation platform(Studitory: HSC备考平台)

AI总结 提出一种基于课程标准的可配置LLM评判流水线,用于高利害考试评分,通过整合授权课程工件和评分指南,提高评分一致性、透明度和与官方实践的契合度。

详情
AI中文摘要

生成式AI和大语言模型(LLM)越来越多地应用于题目生成和自动评估。然而,在备考高风险考试中部署LLM需要的不仅仅是提示工程,还需要软件流水线,系统地将模型输出锚定在授权课程工件和教育当局发布的评分指南上。本文提出了一种基于课程标准的、可配置的LLM-as-Judge流水线,用于题目级评分,与工业合作伙伴共同开发,以支持大学入学考试准备。该流水线识别问题的相关主题、子主题和认知需求,并组装可验证和授权的上下文以支持LLM判断。课程意图通过具体的课程大纲工件(包括规定的动词和结果、表现等级描述符、术语表定义和评分指南原则)来操作化。采用分阶段LLM工作流,首先生成特定题目的评分标准,捕获结构化的表现期望,然后推导和评估用于分配学生回答分数的评分标准。这种设计提高了与官方评分实践的一致性、透明度和对齐度。初步评估表明,所提出的LLM-as-Judge流水线提供的评分结果与人类导师相当,同时产生的理由更可追溯到授权课程工件和评分标准。该流水线已集成到在线学习平台中,早期部署数据提供了操作使用和手动覆盖的初步见解。

英文摘要

Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides.

2606.17506 2026-06-17 cs.CL 新提交

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

通过认知权利评估大语言模型的二阶偏见

Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) EuroSafeAI Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所(德国图宾根))

AI总结 提出基于认知权利理论的逻辑推理任务,评估LLM在判断偏见内容时表现出的二阶偏见,发现模型判断存在系统性偏差且能规避安全护栏。

Comments 20 pages, 13 tables, 2 figures

详情
AI中文摘要

对大语言模型社会偏见的评估主要关注模型是否生成或暗示偏见内容。然而,随着LLM越来越多地被用作偏见评判者,它们可能在评估偏见内容时以更微妙的方式表现出社会偏见,而当前方法并未系统性地捕捉这一点。我们称之为二阶偏见:LLM对社会偏见的判断中存在的偏见,并通过一种新颖的、基于哲学推理的任务进行评估。借鉴认知权利理论,我们将偏见概念化为塑造主体理性探究的错位基础知识,并推导出一个逻辑推理任务,让LLM判断偏见文本对谁是可接受的或不可接受的。我们开发了两个简单指标来衡量LLM评判者在没有充分支持的情况下推断人口统计学可接受性的偏见程度,以及这些推断在不同目标群体间的差异。评估开源和闭源模型时,我们发现我们的任务通过揭示模型判断中的偏见来规避安全护栏。它随目标群体系统性地变化,反映了隐性的社会图谱,并展示了模型如何仍然被人口统计标签触发。我们的工作指出了在判断任务中评估LLM偏见的必要性,以及更广泛地,在NLP中采用更具理论基础的偏见评估方法。我们在以下网址发布代码和模型响应:此 https URL。

英文摘要

Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.

2606.17500 2026-06-17 cs.LG cs.AR 新提交

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

可重构计算挑战:Versal AI Engine上的喷注标记Transformer

Gram Koski, Sean Lipps, Zhenghua Ma, G. Abarajithan, Ryan Kastner

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of California San Diego(加州大学圣地亚哥分校) La Jolla, CA, USA(拉贾拉, 加州, 美国)

AI总结 针对CERN LHC喷注标记任务,提出在AMD Versal AI Engine上部署量化整数Transformer的初始实现,并开发可重用软件框架自动生成Vitis图代码。

Comments 4 pages, 4 figures. In FCCM 2026 proceedings

详情
Journal ref
2026 IEEE 34th Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), Atlanta, GA, USA, 2026, pp. 307-310
AI中文摘要

基于Transformer的模型在CERN LHC的喷注标记中表现出强大的性能,但在低延迟、资源受限的触发系统中部署它们具有挑战性。我们提出了一个在AMD Versal AI Engine(AIE)上用于喷注标记的量化、纯整数Transformer的初始实现,将密集层和多头注意力(MHA)层映射到AIE瓦片。主要贡献是一个可重用的软件框架,该框架将Transformer层表示为可组合的AIE构建块,并从高级Python模型描述自动生成相应的Vitis图代码。该框架为未来研究提供了基础,并作为开源软件在此https URL发布。

英文摘要

Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at https://github.com/KastnerRG/particle_transformer_aie.

2606.17493 2026-06-17 cs.RO 新提交

When Robots Sleep: Offline Skill Consolidation for Shared-Policy Robot Learning

当机器人睡眠时:面向共享策略机器人学习的离线技能巩固

Nethmi Jayasinghe, Diana Gontero, Amit Ranjan Trivedi

发表机构 * University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出睡眠-觉醒框架,通过冻结技能记忆和纳什谈判梯度组合,解决多技能学习中的技能耦合崩溃问题,在Meta-World和SurgicAI上显著提升成功率和可靠性。

详情
AI中文摘要

在长期部署中学习的机器人必须添加新技能,同时不丢失使早期技能可重用的共享策略结构。我们研究顺序机器人技能学习,其中先前的轨迹和任务损失可能不可用,并且部署的策略必须保持单个共享控制器,没有特定任务的头部、路由或适配器。我们识别出技能耦合崩溃,这是一种故障模式,其中单个技能的成功仍然非平凡,而相关技能之间的可靠性下降。我们提出睡眠机器人,一种睡眠-觉醒框架,在觉醒期间学习每个新技能,并在睡眠期间使用紧凑的冻结技能记忆离线巩固共享策略:用于强化学习的冻结评论家与无序状态缓冲区,以及用于模仿学习的冻结演员快照与无序观察缓冲区。在睡眠期间,这些记忆定义了可微分的替代目标,其梯度通过纳什谈判组合,并具有自适应锚定和局部兴奋性以实现稳定巩固。在Meta-World MT5上,睡眠机器人相比最强的非神谕基线将平均成功率提高了64%,将成对可靠性提高了2.0倍;在SurgicAI上,相比持续模仿基线,它提高了平均成功率和反向迁移,同时在成对可靠性上保持竞争力。

英文摘要

Robots that learn over long deployments must add new skills without losing the shared policy structure that makes earlier skills reusable. We study sequential robot skill learning, where previous trajectories and task losses may be unavailable, and the deployed policy must remain a single shared controller without task-specific heads, routing, or adapters. We identify skill-coupling collapse, a failure mode in which individual skill success remains non-trivial while reliability among related skills deteriorates. We propose Sleeping Robots, a wake-sleep framework that learns each new skill during wake and consolidates the shared policy offline during sleep using compact frozen skill memories: frozen critics with unordered state buffers for reinforcement learning and frozen actor snapshots with unordered observation buffers for imitation learning. During sleep, these memories define differentiable surrogate objectives whose gradients are combined through Nash bargaining, with adaptive anchoring and local excitability for stable consolidation. On Meta-World MT5, Sleeping Robots improves average success by 64 % and pairwise reliability by x 2.0 over the strongest non-oracle baseline, and on SurgicAI it improves average success and backward transfer relative to continual imitation baselines while remaining competitive on pairwise reliability.

2606.17489 2026-06-17 cs.LG cs.AI 新提交

Online LLM Selection via Constrained Bandits with Time-Varying Demand

基于时变需求的约束赌博机在线LLM选择

Yin Huang, Qingsong Liu, Jie Xu

发表机构 * Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Manning College of Information and Computer Sciences, University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校曼宁信息与计算机科学学院)

AI总结 针对边缘云推理系统中异构LLM的选择问题,提出一种基于置信界估计和需求预测的在线学习算法,在硬预算和软延迟约束下实现亚线性遗憾和约束违反。

Comments 11 pages, 3 figures with multiple subfigures, 1 table, submitted for possible journal publication

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在边缘云推理系统中,以处理具有异构准确性、延迟和成本配置的多样化用户任务。为每个传入任务选择合适的LLM对于确保服务质量和高效资源利用至关重要。然而,模型异构性、随机且未知的性能特征以及时变的任务需求使得静态选择策略不再适用。实际部署通常施加硬资源预算(如货币支出限制)和软服务级别要求(如延迟保证)。这些约束为在线决策带来了额外挑战。我们将该问题形式化为一个约束随机赌博机学习任务,其中学习者在包装型(硬)和覆盖型(软)约束下顺序选择模型,同时适应时变的任务需求。学习者无法访问底层奖励、成本或延迟分布,必须依赖部分反馈。我们开发了一种新颖的在线学习算法,利用置信界估计和需求预测来平衡奖励最大化与长期约束满足。我们提供了理论保证,表明与具有完整信息的离线基准相比,该算法实现了亚线性遗憾和亚线性覆盖约束违反。在合成工作负载上的实验结果证明了我们的方法在动态、资源受限环境中的有效性和鲁棒性。

英文摘要

Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.

2606.17482 2026-06-17 cs.CV 新提交

SPHINX: First Explain, Then Explore

SPHINX: 先解释,再探索

Nguyen Do, Tue M. Cao, Tien Van Do, András Hajdu, Tamás Bérczes, My T. Thai

发表机构 * University of Florida(佛罗里达大学) University of Debrecen(德布勒恩大学)

AI总结 提出SPHINX闭环框架,通过可解释AI分析驾驶策略的失败模式,并利用视觉语言模型生成针对性对抗场景,提升自动驾驶策略鲁棒性。

Comments 13 pages

详情
AI中文摘要

生成对抗性驾驶场景对于在仿真中评估和改进自动驾驶决策系统至关重要。最近的方法,如ChatScene和LLM-Attacker,主要依赖大型语言模型和视觉语言模型的先验知识来程序化生成驾驶场景。我们认为,对抗性场景应基于驾驶策略的失败诊断(例如,犹豫不决、多帧不一致)来生成,以专门针对策略的弱点,而不是依赖先验假设。在本文中,我们提出SPHINX,一个闭环框架,用于对抗性场景合成,遵循一个简单原则:先解释,再探索。除了盲目探索场景空间外,SPHINX利用可解释人工智能方法分析策略,识别关键视觉概念及其对策略输出的影响,以及决策的不确定性。基于从策略自身决策过程中提取的可解释证据,我们使用视觉语言模型对当前策略的失败模式进行推理和批评。然后,这些批评被用于生成针对性的对抗性场景,以进行策略再训练和改进。我们证明,SPHINX能够突出策略失败的可解释说明,而其他对抗性场景生成方法则不能。在评估的基准和测试套件中,SPHINX可应用于多种最先进的自动驾驶架构,并在现有场景生成方法上带来一致的鲁棒性改进。

英文摘要

Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy's weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy's own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.

2606.17480 2026-06-17 cs.CV cs.RO 新提交

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

GeneralVLA-2: 几何感知重建与受控记忆用于机器人规划

Haoyu Wang, Guoqing Ma, Zeyu Zhang, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) CASIA(中国科学院自动化研究所) AI 2 Robotics

AI总结 针对机器人规划中3D物体重建幻觉和记忆质量不可控的问题,提出GeoFuse-MV3D几何先验引导重建分支和受控长期记忆系统,在GSO-30和Terminal-Bench等基准上显著提升性能。

详情
AI中文摘要

通用视觉-语言-动作系统需要以物体为中心的3D证据和可复用的操作经验来规划可靠的机器人轨迹。GeneralVLA提供了一个层次化接口,用于将语言和RGB-D观测转换为3D末端执行器路径,但仍存在两个瓶颈。首先,单目SAM3D风格的物体重建可能产生姿态和未见几何的幻觉,而操作受益于在标定多视图观测可用时的稳定物体形状。其次,原始的KnowledgeBank主要检索语义相似的片段并附加新知识,这使得难以控制记忆质量、冲突、置信度和几何相关性。为了解决第一个挑战,我们引入了GeoFuse-MV3D,一个几何先验引导的MV-SAM3D重建分支,它用输入视图掩码验证外部几何线索,应用软视觉外壳支持,执行轴方向细化,并仅融合几何同时保留外观。为了解决第二个挑战,我们将KnowledgeBank升级为一个受控的长期记忆系统,具有明确的质量、置信度、生命周期、验证器和冲突元数据,以及面向精度的检索。最后,我们在GSO-30上评估重建分支,在Terminal-Bench 2.0和SWE-Bench Verified上评估记忆模块;GeoFuse-MV3D相比MV-SAM3D基线,CD和LPIPS分别降低2.20%和2.02%,PSNR和SSIM分别提高2.36%和1.03%;KnowledgeBank相比ReasoningBank,在Terminal-Bench SR上提高4.53%,在SWE-Bench解决率上提高3.73%,同时AS分别降低4.95%和5.65%。代码:此 https URL。网站:此 https URL。

英文摘要

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

2606.17478 2026-06-17 cs.CL cs.AI 新提交

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

解码推理型LLM中的隐藏欺骗:用于欺骗审计的激活解释器

Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

发表机构 * Zhejiang University(浙江大学) Griffith University(格里菲斯大学)

AI总结 提出STATEWITNESS,一种通过解码目标模型隐藏状态来生成自然语言查询答案和结构化报告的激活解释器,在欺骗检测中平均AUROC达0.916,优于现有方法。

Comments Under review

详情
AI中文摘要

随着LLM获得更强的推理能力,欺骗行为成为一个日益严重的安全问题。现有的欺骗监控器要么对可见文本进行评分,要么从表示向量中导出标量探针分数,几乎没有留下关于为什么响应可疑的可检查证据。我们引入了STATEWITNESS,一种用于欺骗审计的激活解释器。一个独立的解码器读取目标模型的隐藏状态,然后回答自然语言查询或发出关于它们的结构化报告。我们在两个目标推理LLM上评估了STATEWITNESS,涵盖七个欺骗数据集。在相同评估协议下,STATEWITNESS的平均AUROC达到0.916,比最佳黑盒文本监控器相对提升11.6%,比最佳激活探针基线相对提升25.0%。当与现有监控器结合时,STATEWITNESS在简单阈值集成中减少了遗漏的欺骗示例。除了标量检测,解码器还返回查询级答案、模式报告以及令牌级或句子级证据痕迹供人工检查。我们将此接口视为更广泛的可解释性和对齐工具的潜在构建块。

英文摘要

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

2606.17477 2026-06-17 cs.CV cs.LG 新提交

Theoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer

基于强化学习优化器的分布外检测的理论基础

Salimeh Sekeh, Xin Zhang

发表机构 * San Diego State University(圣地亚哥州立大学)

AI总结 本文提出一种强化学习引导的优化器,通过修正梯度下降更新来降低语义分布外误报率,理论分析了模型变化和环境变化对泛化误差的影响。

详情
AI中文摘要

动态开放世界环境中的分布外(OOD)检测要求模型持续适应不断变化的数据分布,同时泛化到协变量偏移输入并拒绝语义偏移的OOD样本。大多数现有的OOD检测方法仅优化当前步目标,并未明确考虑部署后环境变化如何影响未来的OOD行为。在本文中,我们使用强化学习(RL)引导的优化器为动态OOD检测建立了理论基础,该优化器明确偏好随时间降低语义OOD假阳性率的更新。我们开发了一种新颖的增强优化器,在标准梯度下降(GD)之上使用RL引导的修正项,并展示了其在未来域泛化和语义OOD拒绝方面的改进。我们从模型变化和环境变化泛化误差的角度分析了时间误差分解,并开发了一个新的理论框架来比较GD和RL引导优化器下的泛化误差。

英文摘要

Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.

2606.17476 2026-06-17 cs.LG 新提交

Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

多适配器PPO:一种用于LIBS定量分析的交叉注意力增强波长选择框架

Hao Li, Man Fung Zhuo

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) University of Arizona(亚利桑那大学) Computer Engineering University of Arizona Tucson, USA(计算机工程大学亚利桑那大学图森美国)

AI总结 提出多适配器PPO框架,将波长选择转化为强化学习问题,利用交叉注意力和多适配器捕获光谱关系,在钢铁和煤炭数据集上综合评分平均提升28.4%,预测精度提升45.2%。

Comments 6 pages

详情
AI中文摘要

激光诱导击穿光谱(LIBS)定量分析由于高维光谱数据以及预测精度与特征效率之间的基本权衡,在波长选择方面面临关键挑战。本文提出了一种新颖的多适配器PPO框架,将波长选择转化为强化学习问题,利用交叉注意力机制和多个专用适配器来捕获复杂的光谱关系。我们的方法在钢铁和煤炭数据集上的综合评分平均比传统粒子群优化(PSO)高出28.4%,预测精度高出45.2%。所提出的方法在平衡预测精度与特征效率方面表现出优越性能,在LIBS定量分析中取得了最先进的结果,同时保持了可解释性和计算效率。我们在以下网址发布了代码和数据集:this https URL

英文摘要

Laser-induced breakdown spectroscopy (LIBS) quantitative analysis faces critical challenges in wavelength selection due to high-dimensional spectral data and the fundamental trade-off between prediction accuracy and feature efficiency. This paper presents a novel Multi-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross-attention mechanisms and multiple specialized adapters to capture complex spectral relationships. Our approach outperforms traditional Particle Swarm Optimization (PSO) by an average of 28.4\% in comprehensive score and 45.2\% in prediction accuracy across steel and coal datasets. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state-of-the-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency. We released our code and dataset here: https://github.com/Hflying/MAPPO

2606.17475 2026-06-17 cs.CV 新提交

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

StereoFactory: 一种用于鲁棒立体匹配的统一合并框架

Xianda Guo, Pinhan Fu, Ruilin Wang, Wenke Huang, Mang Ye, Qin Zou

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) D-Star Robotics Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出StereoFactory,一种粗到细的进化框架,通过遗传算法选择模型子集和CMA-ES优化模块级路由,实现自适应模型合并,在多个基准上降低误差并显著减少训练时间。

详情
AI中文摘要

立体匹配通过在大规模数据集上训练的基础模型取得了进展,但这种范式存在可扩展性瓶颈:引入新数据需要昂贵的联合重新训练。模型合并提供了一种可扩展的事后替代方案,在源检查点可用后整合来自专门模型的知识。然而,现有的合并方法通常保留所有可用模型或依赖贪婪包含,这可能会保留有害的任务向量干扰。我们提出StereoFactory,一种用于自适应模型合并的粗到细进化框架。第一阶段采用遗传算法搜索模型子集的组合空间,确定哪些模型应该参与。第二阶段通过CMA-ES优化对所选任务向量进行架构自适应路由,并可选地进行模块级缩放,解决模块级知识专门化问题(不同功能模块对知识源表现出不同偏好)。在两个架构和四个基准上的实验表明,在相同检查点池下,StereoFactory始终达到最佳的四基准平均值,相对于最强的受控基线,在NMRF上将平均误差从3.80降至3.30,在FoundationStereo上从2.88降至2.19。事后搜索仅需要相应联合重新训练挂钟时间的2.7–3.7%。分析表明,知识贡献本质上是模块特定的,所选子集可以在架构间转移且性能下降最小。代码将在接收后公开发布于:此 https URL。

英文摘要

Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7--3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: https://github.com/XiandaGuo/StereoFactory.

2606.17474 2026-06-17 cs.CL cs.AI 新提交

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena:基于电子健康记录的大语言模型在端到端临床咨询工作流中的评估

Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Shandong University(机器智能与系统控制重点实验室,山东大学) Department of Medicine and Therapeutics, The Chinese University of Hong Kong(医学与治疗学系,香港中文大学) Department of Geriatric Medicine, Qilu Hospital of Shandong University(老年医学科,山东大学齐鲁医院) Department of Psychiatry, The Chinese University of Hong Kong(精神病学系,香港中文大学) Li Chiu Kong Family Sleep Assessment Unit, Department of Psychiatry, Faculty of Medicine, The Chinese University of Hong Kong(李秋虹家庭睡眠评估单元,精神病学系,医学院,香港中文大学) Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong(李嘉诚健康科学研究院,医学院,香港中文大学) Gerald Choa Neuroscience Institute, Department of Medicine and Therapeutics, The Chinese University of Hong Kong(Gerald Choa 神经科学研究所,医学与治疗学系,香港中文大学)

AI总结 提出AIPatient Arena框架,通过电子健康记录构建患者知识图谱,在多轮医患交互中评估大语言模型的八项临床能力,发现模型在信息覆盖、诊断推理等方面存在不足,强调过程评估的重要性。

Comments 49 pages, 12 figues, 11 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地被考虑用于临床咨询任务,然而大多数医学评估仍然是静态的、单轮的或狭义的结果导向,限制了它们反映真实医疗护理的序列性、不确定性和交互性的能力。在此,我们提出AIPatient Arena,一个基于电子健康记录(EHRs)的评估框架,用于评估LLMs在八个临床能力维度上的临床实用性。该框架将EHR数据整合到患者特定的知识图谱中,实现多轮医患交互。我们将AIPatient Arena应用于一个由437名患者组成的主要队列以及两个分布外验证队列(分别为119名和67名患者)。我们观察到,LLMs在医学访谈提问技能(QS;平均得分4.43-4.99/5)、伦理与职业行为(ET;4.38-4.93/5)以及临床解释的清晰度和透明度(EX;3.80-4.72/5)方面表现良好。在信息整合(II;3.19-4.21/5)和用药安全与合理性(MS;3.13-3.78/5)方面表现中等,但在处理模糊患者回应(HR;2.57-3.32/5)、信息覆盖(IC;2.08-3.02/5)以及诊断准确性与推理(Dx;2.63-3.55/5)方面观察到持续的弱点。基于过程的评估揭示了反复出现的交互失败,包括重复提问、遗漏既往病史以及对不确定性处理不当。更丰富的对话上下文改善了诊断推理,但在治疗计划方面收益有限。这些发现表明,仅凭最终答案的准确性不足以评估临床就绪性,并强调了评估模型在整个咨询过程中如何收集、解释和传递信息的重要性。AIPatient Arena为医学LLMs的面向工作流的部署前评估提供了一个基于EHR的框架。

英文摘要

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

2606.17471 2026-06-17 cs.LG cs.SY eess.SY 新提交

ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors

面向ReRAM的模型微调:解决I-V非线性和保留误差

Ching-Yi Lin, Shamik Kundu, Arnab Raha, Sahil Shah

发表机构 * Intel Corporation(英特尔公司)

AI总结 提出一种基于微调的硬件感知训练算法,通过范围收缩的sinh变换缓解I-V非线性,并将保留误差纳入正则化损失,实现ReRAM上DNN的高效部署,在图像分类和问答任务中精度损失极小。

Comments 11 pages, 12 figures, 2 tables, with appendix (5 pages, 9 figures)

详情
AI中文摘要

传统的CPU、GPU和NPU架构日益受到冯·诺依曼瓶颈的限制。虽然使用ReRAM交叉阵列的存内计算(IMC)提供了一种高密度、高能效的替代方案,但其实际部署受到非理想特性的制约。现有的硬件感知训练框架通常需要从头开始训练,这对于现代大规模模型来说计算成本过高。在这项工作中,我们提出了一种基于微调的硬件感知训练算法,能够在最小训练开销下实现DNN在ReRAM上的鲁棒部署。我们的方法通过应用范围收缩的sinh变换来缓解I-V非线性,并在微调过程中将保留误差直接纳入正则化损失。我们在图像分类和问答(QA)等模型和任务上评估了我们的框架。实验结果表明,我们的方法在ResNet18和DeiT-Tiny等大规模模型上实现了与基础模型相似的精度。在ImageNet上的MobileNetV3系列中,该技术的精度下降不到2%。此外,将该技术应用于SQuAD v2数据集,F-1分数仅下降1点。

英文摘要

Traditional CPU, GPU, and NPU architectures are increasingly limited by the von Neumann bottleneck. While In-Memory Computing (IMC) using ReRAM crossbar arrays offers a high-density, energy-efficient alternative, its practical deployment is constrained through their non-idealities. Existing hardware-aware training frameworks often require training from scratch, which is computationally prohibitive for modern large-scale models. In this work, we propose a finetuning-based hardware-aware training algorithm that enables robust DNN deployment on ReRAM with minimal training overhead. Our approach mitigates I-V non-linearity by applying a range-shrunk sinh transformation and incorporates retention errors directly into a regularization loss during the finetuning process. We evaluate our framework across models and tasks such as image classification and question-answering (QA). Experimental results demonstrate that our method achieves similar accuracy on large-scale models like ResNet18 and DeiT-Tiny as the base model. In-case of ImageNet for MobileNetV3 families the technique has only less than 2% accuracy degradation. Further, applying the technique on the SQuAD v2 dataset results in only 1 point degradation of F-1 score.

2606.17465 2026-06-17 cs.LG cs.SY eess.SY 新提交

Perron--Frobenius Operator Matching for Generative Modeling

Perron--Frobenius算子匹配用于生成建模

Shiqi Zhang, Wuwei Wu, Jaemin Oh, Jie Chen, Xiaoning Qian

发表机构 * Texas A&M University(德克萨斯农工大学) City University of Hong Kong(香港城市大学)

AI总结 提出Perron-Frobenius算子匹配(PFOM)生成框架,通过积分PF算子匹配密度演化,统一流、扩散和跳跃模型,并证明KL散度在Bregman散度中唯一保持密度级与样本条件目标等价,开发Nesterov加速训练和采样方法。

详情
AI中文摘要

我们引入了Perron--Frobenius算子匹配(PFOM),这是一个通过积分PF算子匹配密度演化的生成框架,涵盖了流、扩散和跳跃模型。我们证明,在Bregman散度中,只有Kullback--Leibler散度保持密度级和样本条件目标之间的等价性,从而产生一个等价于Koopman路径匹配的实用损失。我们进一步开发了Nesterov加速的训练和采样方法,以稳定离散化并加速收敛。PFOM将算子理论识别与现代生成建模统一起来,并为自适应字典和高维应用开辟了道路。

英文摘要

We introduce Perron--Frobenius Operator Matching (PFOM), a generative framework that matches density evolution via the integral PF operator, subsuming flow, diffusion, and jump models. We prove that among Bregman divergences, only Kullback--Leibler divergence preserves equality between density-level and sample-conditioned objectives, yielding a practical loss equivalent to Koopman path matching. We further develop Nesterov-accelerated training and sampling that stabilize discretization and accelerate convergence. %On Gaussian mixtures and two-moons, PFOM achieves faster KL/$W_2$/MMD decrease and improved wall-clock efficiency with empirical validation. PFOM unifies operator-theoretic identification with modern generative modeling and opens paths to adaptive dictionaries and high-dimensional applications.

2606.17464 2026-06-17 cs.LG 新提交

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

CheckMIABench: 语言模型成员推理攻击的坚实基础

Jeffrey G. Wang, Jason Wang, Marvin Li, Seth Neel

发表机构 * Harvard University(哈佛大学) Harvard Business School(哈佛商学院)

AI总结 为解决成员推理攻击评估中的分布偏移问题,提出基于训练中固定点前后数据同分布的基准框架,在Pythia和OLMo模型上评估多种攻击,并开源模块化库。

详情
AI中文摘要

成员推理攻击(MIA)是评估机器学习模型隐私属性的标准方法。尽管已有多次尝试评估语言模型上的MIA,但现有文献在构建干净评估以测试新技术方面遇到诸多困难。特别是,成员集和非成员集之间的微妙分布偏移可能破坏MIA的统计有效性;最近的研究通过展示没有访问底层模型的“盲”方法在同一基准上的表现远优于已发布方法,强调了这一点。本文利用训练过程中固定点前后的训练数据来自同一分布的洞察,构建了一个用于对LLM进行原则性MIA评估的基准。因此,所有具有中间检查点和公开训练数据的开源模型都可以转化为MIA测试平台。我们将我们的框架应用于针对Pythia和OLMo模型系列(从70M到7B参数)的半打已发布攻击。为促进进一步的隐私研究,我们开源了一个模块化库,用于在此设置中设计和实现攻击:此 https URL。

英文摘要

Membership inference attacks (MIAs) are a canonical way to assess a machine learning model's privacy properties. Although several attempts have been made to evaluate MIAs on language models, the extant literature has suffered numerous difficulties in constructing clean evaluations to test new techniques. In particular, subtle distribution shifts between member and non-member sets can undermine the statistical validity of MIAs; recent work has underscored this by showing that "blind" methods with no access to the underlying model can perform far better than published methods on the same benchmarks. This paper constructs a benchmark for principled evaluation of MIAs against LLMs, by leveraging the insight that training data before and after a fixed point during training are drawn from the same distribution. Therefore, all open-source models with intermediate checkpoints and public training data can be converted into MIA testbeds. We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. To facilitate further privacy research, we open-source a modular library for designing and implementing attacks in this setting: https://github.com/safr-ai-lab/pandora_llm.

2606.17463 2026-06-17 cs.CV cs.RO 新提交

WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

WeaveLA: 面向重复机器人操作的基于事件驱动的跨子任务潜在记忆编织

Shoujing Zhu, Zhenyang Liu, Fungmiu Wang, Jiafeng Wang, Bo Yue, Guiliang Liu, Simo Wu, Xiangyang Xue, Taiping Zeng

发表机构 * Fudan University(复旦大学) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Shanghai Innovation Institute(上海创新研究院) Shenzhen Loop Area Institute(深圳环域研究院)

AI总结 针对短窗口VLA策略缺乏跨子任务信息传递的问题,提出WeaveLA,通过事件触发将完成子任务压缩为潜在令牌并注入下一子任务的动作生成路径,在保持基础策略短窗口接口的同时实现轻量级跨子任务通道,在困难重复任务上成功率从0%提升至47.8%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略已实现显著的单步操作,但在每个阶段依赖于刚刚完成的任务时仍然脆弱。核心问题是结构性的:短窗口VLA缺乏明确的跨子任务信息路由通道,而现有的记忆增强变体要么在每一帧写入,要么从演示阶段检索,要么在子目标事件触发时未执行显式的子任务到子任务交接给动作专家。我们将子目标完成事件识别为跨子任务记忆交接的自然时间单元,并提出WeaveLA(为视觉-语言-动作策略编织潜在记忆),这是一种跨子任务记忆接口,在冻结的VLA骨干之上,通过查询驱动的注意力池化将每个完成的段压缩为潜在令牌,并直接路由到下一子任务的动作生成路径。这种事件触发、动作侧的设计保留了基础策略的短窗口接口,同时添加了轻量级跨子任务通道。通过在RoboMME上使用$\pi_{0.5}$骨干进行分层评估,WeaveLA的增益恰好出现在需要该通道的地方:在最难的重复切片(SwingXtimes,$N{=}3$)上,成功率从$0\\%$提升至$47.8\\%$,而单次执行片段保持不变。每集配对分析证实增益仅限于因果结构需要跨子任务信息的任务。

英文摘要

Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.

2606.17462 2026-06-17 cs.LG cs.NI 新提交

ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation

ResAware: 通过资源特权蒸馏实现跨环境网站指纹识别

Chongru Fan, Wei Wang, Wentao Huang, Zhenquan Ding, Jinqiao Shi, Lei Cui, Zhiyu Hao, Xiaochun Yun

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Laboratory(中关村实验室)

AI总结 提出ResAware框架,利用资源级特征训练教师模型并通过异构知识蒸馏指导学生模型,在不增加在线开销下提升跨环境鲁棒性,在五个月大规模数据集上显著提升基线方法性能。

Comments 18 pages, 9 figures

详情
AI中文摘要

虽然网站指纹识别(WF)攻击在受控实验室环境中实现了高精度,但在现实环境中,由于时空漂移、浏览器异构性、代理混淆等因素,其性能往往大幅下降。这一限制源于它们仅依赖低层流量特征,而这些特征噪声大且对环境扰动高度敏感。为解决此问题,我们提出\textbf{ResAware},一种在\textit{训练丰富/推理贫乏}非对称设置下的跨环境资源感知蒸馏框架。具体来说,ResAware在资源级特征上训练教师模型,然后通过异构知识蒸馏将所得特权知识蒸馏到学生模型中。部署时,学生模型仅使用加密流量进行推理,不产生额外成本。我们在一个跨越五个月、从六个全球观测点收集的大规模数据集上评估ResAware,包含超过$160{,}000$个配对样本。结果表明,ResAware显著增强了多种WF基线的跨环境鲁棒性。例如,在150天的时间漂移下,ResAware将Var-CNN的F1分数从$72.77\%$提升至$81.49\%$,开放世界$TPR@1\%FPR$从$22.40\%$提升至$27.20\%$。我们的结果表明,资源级监督在不扩大在线观测能力的情况下提高了WF鲁棒性。

英文摘要

While Website Fingerprinting (WF) attacks achieve high accuracy in controlled laboratory settings, they often degrade substantially in real-world environments due to spatio-temporal drift, browser heterogeneity, proxy obfuscation and etc. This limitation stems from their sole reliance on low-level traffic features that are noisy and highly sensitive to environmental perturbations. To address this problem, we propose \textbf{ResAware}, a cross-environment resource-aware distillation framework under a \textit{training-rich/inference-poor} asymmetric setting. Specifically, ResAware trains a teacher model on resource-level features, and then distills the resulting privileged knowledge into a student model through heterogeneous knowledge distillation. At deployment time, the student model performs inference using only encrypted traffic, incurring zero additional cost. We evaluate ResAware on a large-scale dataset collected over five months from six globally distributed vantage points, comprising more than $160{,}000$ paired samples. The results show that ResAware significantly enhances the cross-environment robustness of diverse WF baselines. Under a 150-day temporal drift, for example, ResAware improves the F1-score of Var-CNN from $72.77\%$ to $81.49\%$ and the open-world $TPR@1\%FPR$ from $22.40\%$ to $27.20\%$. Our results demonstrate that resource-level supervision improves WF robustness without expanding online observation capabilities.

2606.17459 2026-06-17 cs.AI 新提交

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

LLM 能当 CEO 吗?基于多角色智能体模拟的战略资源重新配置基准测试

Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) Yale University(耶鲁大学)

AI总结 提出 CEO-Bench,一个多智能体基准,评估 LLM 在约束丰富的组织环境中进行多轮战略资源重新配置的能力,发现模型在结构有效性上表现良好,但在战略校准上存在系统性失败模式。

Comments 13 pages

详情
AI中文摘要

评估大型语言模型(LLM)的决策能力是一个日益重要的研究重点,然而现有基准侧重于孤立的认知任务,如推理、知识检索以及在风格化环境中的经济理性。这些评估忽略了真实高管决策的核心挑战:在信息不对称、组织约束和时间依赖下整合来自专业利益相关者的冲突建议。我们引入了 \textsc{CEO-Bench},一个多智能体基准,评估 LLM 在 CEO 级别的战略资源重新配置能力——即在多轮、约束丰富的组织环境中跨业务部门重新分配资本的过程。在 \textsc{CEO-Bench} 中,LLM 智能体接收来自四个角色化的 C 级顾问(CFO、CTO、COO、CMO)的冲突建议,每个顾问拥有私有信号和不同优先级,智能体必须将这些建议综合成一个具体的分配计划,并沿四个维度进行评估:角色整合、条件大胆性、历史敏感性判断和计划有效性。在 13 个场景中对五个前沿模型的实验表明,所有模型都实现了高结构有效性,但在战略校准(最难的能力层)上表现差异显著。我们识别出系统性失败模式,包括单一顾问捕获、模糊下的保守默认和历史遗忘,并发现结构整合-大胆性权衡:更深入参与冲突观点的模型往往产生较不果断的行动。这些发现勾勒了 LLM 作为组织决策者的当前能力边界,并为未来 AI 辅助高管系统的设计提供信息。

英文摘要

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

2606.17456 2026-06-17 cs.RO q-bio.NC 新提交

Embodiment Shapes Rolling Behavior in a Multimodal Infant Model

具身形态塑造多模态婴儿模型中的翻滚行为

Leon Philipp, Francisco M. López, Jochen Triesch

发表机构 * Frankfurt Institute for Advanced Studies(法兰克福高等研究院) Goethe University Frankfurt(法兰克福大学) University of New South Wales(新南威尔士大学)

AI总结 通过虚拟婴儿MIMo学习仰卧到俯卧翻滚,研究婴儿运动发展中的具身形态变化如何影响行为,发现与真实婴儿一致的发育趋势和协调模式。

Comments 7 pages, 7 figures. Accepted at the 2026 IEEE ICDL Conference. Cite as: L. Philipp, F. M. López, and J. Triesch, "Embodiment Shapes Rolling Behavior in a Multimodal Infant Model", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-7

详情
AI中文摘要

翻身是婴儿运动发展中最早期的里程碑之一,反映了协调的全身感觉运动控制的出现。在这里,我们使用MIMo(一个配备本体感觉和前庭感觉的虚拟婴儿具身模型)对婴儿翻滚进行计算研究。MIMo通过强化学习学习从仰卧到俯卧的翻滚。有趣的是,学习到的行为捕捉到了与真实婴儿报告一致的发育趋势和协调模式,包括随着年龄增长表现提升和执行速度加快。我们的结果解释了婴儿的能力和限制如何能在人工代理中产生逼真的行为,特别强调了运动发展如何受到不断变化的身体形态的影响。这项工作突出了具身计算模型作为研究感觉运动发展的强大工具的作用。

英文摘要

Rolling over is one of the earliest milestones in infant motor development, reflecting the emergence of coordinated, whole-body sensorimotor control. Here, we conduct a computational study of infant rolling using MIMo, a virtual infant embodiment equipped with proprioception and vestibular sensation. MIMo learns supine-to-prone rolls with reinforcement learning. Interestingly, the learned behaviors capture developmental trends and coordination patterns consistent with those reported in real infants, including improved performance and faster execution with age. Our results explain how infant capabilities and constraints can give rise to realistic behaviors in artificial agents, with a particular emphasis on how motor development is shaped by the changing body morphology. This work highlights the role of embodied computational models as a powerful tool for studying sensorimotor development.

2606.17455 2026-06-17 cs.RO 新提交

Continual Online Personalization of Exoskeleton Control via Manifold-Aware Experience Replay

基于流形感知经验回放的外骨骼控制持续在线个性化

Changseob Song, Inseung Kang

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出流形感知经验回放框架,通过回放缓冲区保留用户特定表征,避免在线适应中的灾难性遗忘,在模拟偏瘫步态中扭矩和步态相位跟踪精度分别提升40%和60%。

详情
AI中文摘要

个性化外骨骼控制对于步态障碍的临床用户仍然是一个关键挑战。在线适应(OA)通过实时适应受试者变异性、设备适配性和不同运动任务提供了一种有效解决方案。然而,OA涉及连续的用户状态数据流,可能导致先前学习的运动情境的灾难性遗忘。在此,我们开发了一种基于流形感知经验回放的在线个性化框架,旨在在外骨骼控制的OA过程中跨不同任务维护用户特定表征。通过从回放缓冲区重放先前经历的任务,我们保留了跨所有学习任务的个性化外骨骼辅助。此外,我们捕获了一个区分不同运动任务的步态流形,消除了在选择目标回放区间时对显式任务标签的需求。我们在模拟偏瘫步态(与健全模式有显著偏差)上评估了我们的框架,涉及速度和坡度转换的多个遗忘场景。与没有回放的基线框架(在任务转换期间表现出灾难性遗忘)相比,我们的流形感知回放框架在扭矩和步态相位跟踪精度上分别实现了40%和60%的提升。这表明我们提出的框架在临床人群的日常行走中跨不同运动情境实时个性化外骨骼控制。

英文摘要

Personalizing exoskeleton control remains a critical challenge for clinical users with gait disabilities. Online adaptation (OA) offers an effective solution by adapting in real time to subject variability, device fit, and diverse locomotor tasks. However, OA involves a continual stream of user state data, which can lead to catastrophic forgetting of previously learned locomotor contexts. Here, we develop a manifold-aware experience replay-based online personalization framework designed to maintain user-specific representations across diverse tasks during OA of exoskeleton control. By replaying previously experienced tasks from a replay buffer, we preserve the personalized exoskeleton assistance across all learned tasks. Furthermore, we capture a gait manifold that distinguishes between different locomotor tasks, removing the need for explicit task labeling when selecting target replay bins. We evaluated our framework on emulated hemiplegic gait, which largely deviates from able-bodied patterns, across multiple forgetting scenarios with speed and incline transitions. Our manifold-aware replay framework achieved 40% and 60% improvements in torque and gait phase tracking accuracy, respectively, compared to a baseline framework without replay, which exhibited catastrophic forgetting during task transitions. This demonstrates that our proposed framework personalizes exoskeleton control in real time across diverse locomotor contexts in daily ambulation of clinical populations.

2606.17454 2026-06-17 cs.AI cs.LG 新提交

Dissecting model behavior through agent trajectories

通过智能体轨迹剖析模型行为

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 本文提出“意图-执行差距”概念,并设计Simple Strands Agent(SSA)框架,通过分析138k条轨迹揭示模型在自主问题解决中的行为差异。

Comments 106 pages, 50 Figures, 16 Tables

详情
AI中文摘要

AI智能体性能不仅仅是一个建模问题,它本质上是一个系统问题。模型的高级能力通过智能体框架(harness)实现。因此,模型假设与框架行为之间的差距很容易阻止模型的全部能力转化为智能体性能。我们将此形式化为“意图-执行差距”:模型意图与框架执行之间的不匹配,反之亦然。我们认为,最小化这种意图-执行差距与框架设计的其他方面(如工具和执行循环)同样重要。为了说明这种框架-模型对齐的影响,我们开发了一个简单且可定制的框架,称为“Simple Strands Agent”(SSA)。SSA旨在找到跨不同模型家族(如Claude、Gemini、GPT、Grok、Qwen)通用的常见模式,以及少量模型特定的偏好。我们做出两个贡献:(i)我们在流行的智能体基准测试(SWE-Pro、SWE-Verified和Terminal-Bench-2)上**复现或改进了**不同模型提供商家族报告的pass@1性能;(ii)基于对**SSA生成的138k条轨迹的分析**,我们超越了前沿模型之间通常相对均匀的pass@1数字。通过在代码状态空间中表示智能体轨迹,我们观察到问题解决行为中的模型级差异。更细粒度的指标,如编辑频率、测试活动和阶段转换,揭示了单个模型如何在自主问题解决的不同阶段分配努力。

英文摘要

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we $\textbf{reproduce or improve on the pass@1}$ performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an $\textbf{analysis of 138k trajectories generated by SSA}$, we look beyond the $\texttt{pass@1}$ numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

2606.17453 2026-06-17 cs.AI 新提交

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出MapSatisfyBench基准,通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力,实验表明现有智能体在显式任务完成上表现良好,但在满足隐含需求方面仍有局限。

详情
AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中,用户通常非正式地表达需求,导致查询不明确,包含许多未言明的需求,即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法,但它增加了日常交互中的用户负担,而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而,评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次,用户满意度不能可靠地由单个参考答案表示,需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战,我们提出一个恢复-识别-过滤框架,从行为链证据中重建完整的用户需求,识别隐含决策因素,并仅保留那些有查询前证据支持的因素。基于此方法,我们从大规模真实世界匿名用户数据构建MapSatisfyBench,并从五个维度标注真实值,实现对满意度感知地图智能体的全链条评估。实验表明,当前智能体在显式任务完成上普遍表现良好,但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

2606.17450 2026-06-17 cs.AI 新提交

A Machine-Learned Comorbidity Index

机器学习共病指数

Suleman Baloch, Kishlay Jha, Alberto M. Segre, Philip M. Polgreen, Bijaya Adhikari

发表机构 * Department of Electrical and Computer Engineering, University of Iowa, Iowa, USA(电气与计算机工程系,爱荷华大学,爱荷华,美国) Department of Computer Science, University of Iowa, Iowa, USA(计算机科学系,爱荷华大学,爱荷华,美国) Department of Internal Medicine, University of Iowa, Iowa, USA(内科学系,爱荷华大学,爱荷华,美国)

AI总结 提出一种机器学习共病指数(MLCI),通过最大化学习分数与多个临床结果之间的归一化希尔伯特-施密特独立性准则(nHSIC)来映射诊断代码为单一标量,捕获非线性风险-结果依赖,并在多个EHR数据集上优于基线方法。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea. 35 pages

详情
AI中文摘要

传统的共病评分(如Charlson和Elixhauser)广泛用于风险调整和患者分层,但它们有两个关键局限性:(i)它们主要围绕死亡率,不能很好地与其他临床结果对齐;(ii)它们的线性、基于规则的结构无法捕捉非线性、结果特定的风险关系。我们提出了一种机器学习共病指数(MLCI),通过最大化学习分数与多个临床结果之间的归一化希尔伯特-施密特独立性准则(nHSIC),将诊断代码映射到单个标量。MLCI捕捉非线性风险-结果依赖,并有一个理论支持,该理论描述了何时可以在不同结果上实现统一的、信息丰富的入院级排序。在多个基准电子健康记录(EHR)数据集上的实证结果表明,MLCI在多个评估指标上优于强基线方法。

英文摘要

Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: (i) they are largely mortality-centric and do not align well with other clinical outcomes, and (ii) their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert-Schmidt Independence Criterion (nHSIC) between the learned score and multiple clinical outcomes. MLCI captures nonlinear risk-outcome dependence and is supported by a theory that characterizes when a unified, informative admission-level ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong baselines across multiple evaluation metrics.

2606.17449 2026-06-17 cs.CL cs.AI cs.CV cs.LG cs.MM 新提交

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出MODE-RAG多智能体系统,利用变分自由能和内部注意力状态动态门控干预,结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情
AI中文摘要

虽然多模态检索增强生成(M-RAG)增强了大型视觉语言模型,但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外,现有的缓解流程常常面临干预悖论:静态规则往往不必要地干扰准确的生成,而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉,我们提出了一个多智能体系统MODE-RAG,由变分自由能(VFE)和内部注意力状态驱动,以动态门控干预。高风险查询被路由到五个阶段特定的智能体,集成蒙特卡洛树搜索(MCTS)进行严格的因果推导,以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法,我们引入了ModeVent,一个源自MultiVent数据集的具有挑战性的子集。大量实验表明,我们的系统有效降低了幻觉率和逻辑捏造,显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

2606.17446 2026-06-17 cs.RO cs.CV 新提交

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

AnnotateAnything:面向机器人操作的3D资产自动标注

Haoran Lu, Mutian Shen, Shuyang Yu, Yu Xiao, Songling Liu, Jianshu Zhang, Shang Wu, Yue Chen, Guo Ye, Jiayi Wang, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学)

AI总结 提出AnnotateAnything框架,通过视觉-语言标注和物理标注双流水线,自动为3D资产生成可执行操作标签,提升仿真数据收集效率和任务成功率。

详情
AI中文摘要

仿真使得可扩展的机器人数据收集成为可能,但原始3D资产仅提供几何信息,缺乏指定机器人应在何处以及如何操作的语义、交互和物理知识。在这项工作中,我们提出了AnnotateAnything,一个通用的自动标注框架,将被动3D资产转换为具有结构化、多样化和可执行操作标签的、可用于操作的资产。AnnotateAnything围绕两个互补的流水线构建。首先,一个统一的视觉-语言标注流水线,利用视觉-语言推理来推断对象语义、交互约束和3D接地线索,为识别有意义的交互区域提供人类先验指导。其次,一个全自动且大规模并行的物理标注流水线,通过候选生成、几何优化和轨迹生成,将这些先验知识嵌入每个资产的几何和物理约束中。该流水线生成多样且可执行的动作标注,包括抓取姿态、灵巧接触、关节运动路径点、插入方向、悬挂可供性和导航目标。利用生成的标注,我们进一步构建了一个跨不同对象、任务和机器人形态的异步并行仿真数据收集系统。实验表明,与现有的标注和数据生成流水线相比,AnnotateAnything在标注效率、数据收集效率和任务成功率方面均表现优越,同时支持下游任务如可供性检测、机器人VQA和视觉指令微调。我们在项目页面上提供项目材料,并计划发布完整代码、标注和基准以促进未来研究。视频、代码、演示资产和标注在补充材料中提供。项目页面:此https URL。

英文摘要

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

2606.17443 2026-06-17 cs.AI cs.CL cs.CY 新提交

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

在位优势:LLM推荐系统中的品牌偏见与认知操纵动态

Xi Chu, Yupeng Hou

发表机构 * Trine University(特莱恩大学) Texas A&M University(德克萨斯农工大学)

AI总结 研究LLM推荐中的品牌动态,发现知名品牌在同等规格下获100%推荐(IAI=10.0),但微弱评分优势可打破垄断;权威营销语言(如虚假临床证据)以+0.17评分点的偏差剩余价值打破垄断;多品牌GEO竞争存在社会困境,集体优化降低个体收益。

Comments 16 pages, 4 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)正成为消费者寻找产品的主要方式,但我们尚不了解品牌如何在这个新渠道中竞争。我们使用护肤品——消费者在购买前难以判断质量、必须依赖品牌声誉的类别——在三个商业LLM(GPT-4o-mini、Claude Sonnet、Gemini 3 Flash)中研究LLM推荐中的品牌动态,并对搜索品进行了稳健性检验。在三个实验中,我们发现:(1)条件垄断:当所有产品具有相同规格时,知名品牌获得100%的推荐(IAI = 10.0),但这种主导地位在竞争对手拥有不到+0.1星的评分优势时消失;(2)权威式营销语言,包括捏造的临床证据声明,以等于+0.17评分点的偏差剩余价值打破了这种垄断,每个模型反应不同;(3)多品牌GEO竞争中的社会困境:当所有品牌采用相同的优化策略时,在我们的收益代理中,个体收益从+0.802降至+0.007,而我们的测试中未参与的品牌获得零推荐。我们的结果表明,生成引擎优化(GEO)不仅应作为安全风险研究,还应作为塑造市场竞争的新兴营销实践来研究。

英文摘要

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

2606.17438 2026-06-17 cs.CV 新提交

Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects

基于接触式条纹投影轮廓术的高分辨率反射与透明物体三维表面测量

Ingu Yeo, Hyung-Gun Chi, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University(延世大学机械工程系) Yonsei Institute for Embodied Intelligence, Yonsei University(延世大学具身智能研究所)

AI总结 针对GelSight传感器在反射/透明物体上深度精度不足和校准困难的问题,提出基于数字条纹投影的接触式三维测量方法,通过三角测量实现高精度全视场三维重建。

详情
AI中文摘要

本文提出一种基于数字条纹投影(DFP)系统的接触式三维表面测量方法,属于以商业成功的GelSight传感器为代表的视觉触觉传感家族。此类传感器已被证明对机器人指尖操作和接触传感有效。然而,由于GelSight采用RGB LED光度立体视觉,它不直接测量绝对深度,而是通过积分估计的表面梯度来推断深度,这可能累积重建误差;此外,随着传感区域增大,校准变得越来越困难,并且在高反射或透明物体上深度精度受到挑战。为克服这些缺点,我们提出一种基于条纹投影的接触测量技术,在涂覆硅胶的接触表面上执行基于三角测量的三维重建,提供接触区域上密集的逐像素表面几何和全场三维形状测量。通过将高精度数字条纹投影集成到传感器中,我们的方法简化了大面积校准,并提高了复杂表面的深度精度。实验结果,包括与GelSight Mini传感器的直接比较、球面拟合精度评估和不确定性分析,证实了所提方法显著提高了基于结构光的三维测量的精度和稳定性,允许可靠重建具有不同光学特性的物体。

英文摘要

This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.

2606.17437 2026-06-17 cs.CV cs.AI 新提交

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

超声心动图视频标准视图分类的时空融合模型

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

发表机构 * Department of Ultrasound, The First Affiliated Hospital of Chengdu Medical College, School of Clinical Medicine, Chengdu Medical College(成都医学院第一附属医院超声科,临床医学院) College of Computer Science, Sichuan University(四川大学计算机学院) Department of Medical Ultrasound, West China Hospital of Sichuan University(四川大学华西医院超声科) Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College(中国医学科学院北京协和医学院肿瘤医院)

AI总结 针对超声心动图视图分类中数据稀缺、时空特征难以融合的问题,提出基于不确定性感知的CNN-LSTM双流融合模型,在最大公开数据集EV9V上取得竞争性能。

详情
AI中文摘要

超声心动图标准视图的自动分类对于高效的临床工作流程至关重要,但面临三个主要挑战。首先,公开可用的数据集稀缺,且规模和视图覆盖范围有限。其次,一些现代视频级架构在超声心动图视图分类中的性能尚未得到充分探索。第三,某些视图类别在空间外观上高度相似,使得单帧特征不足以区分,而异质的帧质量使得鲁棒的时序信息融合变得复杂。为了解决这些挑战,我们发布了九视图超声心动图视频(EV9V)数据集,包含5,138个视频、910,579帧和9个标准视图,据我们所知,这是最大的公开超声心动图视频数据集。利用EV9V,我们系统地基准测试了代表性的视频分类架构,包括卷积神经网络(CNN)、循环神经网络(RNN)和Transformer。此外,我们提出了一种时空融合模型(STFM),一种高效的双流CNN-LSTM(长短期记忆)框架,联合捕获空间解剖结构和时间心脏动力学。所提出的框架利用不确定性感知学习在训练期间优先采样代表性视频片段,并在推理期间进行基于证据的融合,提高了对超声心动图视频中帧质量变化的鲁棒性。大量实验表明,我们的方法在各种视频分类模型中取得了竞争性能,验证了不确定性感知时空学习在超声心动图视图分类中的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

2606.17436 2026-06-17 cs.CV 新提交

UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning

UoU:基于大规模无监督学习的通用指纹基础模型

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 提出UoU指纹基础模型,通过多级表示层次和结合监督、弱监督与无监督的训练策略,实现跨传感器、质量和应用的通用特征提取。

详情
AI中文摘要

指纹识别仍然由特定任务流水线主导,其中增强、结构解析、对齐和匹配被独立优化。尽管在狭窄场景中有效,但这种设计限制了表示在传感器、质量和下游应用中的重用。因此,我们提出UoU,即“基于大规模无监督学习的通用指纹基础模型”,它将指纹特征提取重新定义为领域特定的基础模型问题。UoU围绕一个多级表示层次组织,涵盖图像恢复、结构场、语义标记、点级生物特征实体和紧凑的全局描述符。其训练方案结合了在精确标注上的监督冷启动、大规模弱监督细化以及大规模无监督巩固,后两个阶段在大规模训练中迭代,使得弱监督拓宽语义覆盖,而无监督学习稳定对应关系、不变性和表示几何。UoU不将指纹图像视为通用纹理,而是利用领域特定的对称性和中间结构,包括方向流、周期性脊模式、稀疏生物特征实体和空间等变性。该框架有意与架构无关:虽然本研究包含一个基于transformer的结构化预测初始实例,但更广泛的设计支持多任务学习、可扩展模型配置以及针对匹配、对齐、增强、配准和相关指纹应用的下游专业化。本文介绍了UoU的技术动机、系统设计和验证协议,部分基线实现已公开于此https URL。

英文摘要

Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a \textbf{U}niversal fingerprint foundation model based \textbf{o}n large-scale \textbf{U}nsupervised learning,'' which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at https://github.com/XiongjunGuan/UoU.

2606.17435 2026-06-17 cs.LG 新提交

MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

MorphStrata: 时间序列移动目标防御中生成Morphence学生的层特定扰动

Abhishek Bhardwaj, Arnav Doshi, Anusri Nagarajan, Thanh Quynh Nhu Ta, Mohammad Masum, Robert Chun, Jaydip Sen, Saptarshi Sengupta

发表机构 * Department of Computer Science, San Jos\' e State University, San Jos\' e , CA, USA Department of Computer Engineering, San Jos\' e State University, San Jos\' e , CA, USA Praxis Business School, Kolkata, India

AI总结 提出MorphStrata策略,通过选择性层特定随机噪声注入生成结构异质的学生模型,在保持移动目标防御鲁棒性的同时,将训练开销增量控制在1%以内,并在高熵周期性数据集上实现高达24.11%和97.97%的RMSE降低。

Comments 13 pages, 9 figures, 11 tables

详情
AI中文摘要

时间序列预测模型仍然容易受到基于梯度的对抗攻击,而现有的防御机制通常会在鲁棒性与有限响应和计算成本之间进行权衡。这个问题在移动目标防御中尤为突出,因为维护多个随机化模型实例会显著增加训练开销。在这项工作中,我们引入了MorphStrata,一种具有选择性、层特定随机噪声注入的学生生成策略,扩展了传统的Morphence防御。MorphStrata使用Transformer骨干网络作为教师,并随机扰动选定的架构块,以在学生模型之间创建结构异质性,以应对不同的数据分布和威胁模型。我们在包括Jena Climate、Electricity Load Diagrams和Appliances Energy Prediction在内的一系列基准测试上,使用FGSM、BIM和PGD攻击以及多种攻击强度,与原始Transformer和Morphence骨干网络进行了评估。在不同的数据集和攻击机制下,所提出的集成模型保持了可比较的对抗RMSE。具体来说,对于高熵、周期性的数据集(如AEP数据),MorphStrata在所有攻击和扰动预算下实现了最低的RMSE,在30次随机试验中,在epsilon值为0.5时,相对于静态基线,在FGSM和BIM下分别提高了24.11%和97.97%。在大多数实验中,针对层生成MorphStrata学生导致的训练时间增加不到Morphence MTD基线的1%,同时实现了两位数的对抗RMSE降低。我们还观察到生成的学生的成对L2距离与整体防御有效性之间存在正相关。总之,与现有基线相比,MorphStrata以边际成本差异保持了作为MTD防御的对抗鲁棒性。

英文摘要

Time-series forecasting models remain vulnerable to gradient-based adversarial attacks while existing defense mechanisms typically incur a trade-off in robustness for bounded response and compute cost. The problem is pronounced in Moving Target Defense where maintaining multiple randomized model instances substantially exacerbates the training overhead. In this work, we introduce MorphStrata, a student generation strategy with selective, layer-specific stochastic noise injection that extends the traditional Morphence defense. MorphStrata uses a Transformer backbone as the teacher and perturbs randomly selected architectural blocks to create structured heterogeneity across student models in response to varied data distributions and threat models. We evaluate against vanilla Transformer and Morphence backbones on a suite of benchmarks including the Jena Climate, Electricity Load Diagrams, and Appliances Energy Prediction using FGSM, BIM and PGD attacks across multiple attack strengths. Across datasets and attack regimes, the proposed ensemble maintains comparable adversarial RMSE. Specifically, for high entropy, periodic datasets as in the case of the AEP data, MorphStrata achieves the lowest RMSE across all attacks and perturbation budgets, improving over the static baseline by up to 24.11% and 97.97% under FGSM and BIM respectively at an epsilon value of 0.5 over 30 randomized trials. Targeting the layers to generate MorphStrata students accounts for less than 1% increase in train-times over the Morphence MTD baseline for most of the experiments, while accounting for double digit gains in adversarial RMSE reduction. We also observe a positive correlation between higher pairwise L2 distance (among generated students) and overall defense effectiveness. In summary, MorphStrata maintains adversarial robustness as an MTD defense at marginal cost deltas when compared to existing baselines.