arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13945 2026-06-15 cs.CL 新提交

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

MedLatentDx：用于跨医院罕见病诊断的潜在多智能体通信

Ziqing Wang, Lili Zhao, Kaize Ding

发表机构 * Northwestern University（西北大学）

AI总结提出MedLatentDx框架，通过潜在多智能体通信实现跨医院罕见病诊断，利用潜在KV块传输保护隐私，支持同骨干和跨家族LLM部署，在CrossRare-Bench上提升诊断性能并减少可重建临床内容。

详情

AI中文摘要

罕见病影响超过3亿患者，涉及7000多种疾病，但没有任何一家医院能遇到足够多的病例来进行可靠诊断。跨医院合作可以通过允许诊断机构使用分布式、病例特定的诊断证据来提供帮助，但隐私法规限制可识别的临床文本跨机构传输。这种情况带来了两个挑战：现有的医疗智能体系统通常依赖文本证据交换，而原始潜在状态（如隐藏状态和KV缓存）仍可能泄露提示衍生的临床内容。我们引入了MedLatentDx，一个潜在多智能体通信框架，其中医院智能体将私有临床记录和检索到的病例保留在本地，并向主机智能体发送紧凑的潜在KV块以进行罕见病诊断。MedLatentDx支持两种部署设置：相同骨干的医院智能体使用潜在KV蒸馏，而具有不同LLM骨干的医院使用跨家族潜在对齐。在CrossRare-Bench（一个自建的大规模罕见病基准，具有医院级别分区）上，MedLatentDx提高了跨医院诊断性能，同时相对于原始潜在通信基线减少了可重建的临床内容。

英文摘要

Rare diseases affect over $300$ million patients across more than $7{,}000$ conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross-hospital collaboration could help by allowing a diagnosing institution to use distributed, case-specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt-derived clinical content. We introduce MedLatentDx, a latent multi-agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare-disease diagnosis. MedLatentDx supports two deployment settings: same-backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross-family latent alignment. On CrossRare-Bench, a self-built large-scale rare-disease benchmark with hospital-level partitions, MedLatentDx improves cross-hospital diagnostic performance while reducing reconstructable clinical content relative to raw-latent communication baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13944 2026-06-15 cs.CL 新提交

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

LLM 蕴含多重性：部署上下文如何重塑模型层面的偏好与价值观

Filip Trhlik, Aoife O'Flynn, Angela Yu, Arduin Findeis, Paula Buttery

发表机构 * University of Cambridge（剑桥大学）； ALTA Institute（ALTA研究所）； Leverhulme Centre for the Future of Intelligence（勒沃霍姆未来智能中心）； Microsoft UK（微软英国）

AI总结本研究通过两个成对比较范式（国家偏好排序和效用判断）发现，部署上下文（如撰写Reddit帖子或新闻文章）对LLM偏好和价值观的影响远大于提示改写和温度控制，表明模型层面的偏好是上下文依赖的。

Comments 68 pages, 54 figures, 54 tables

详情

AI中文摘要

大型语言模型（LLM）在最近的评估工作中越来越被描述为具有稳定的、模型层面的偏好和价值观系统。然而，伴随的稳健性检查仅限于偶然的提示扰动，如句法变化和选项重新排序。这留下了当周围任务上下文发生变化时（如大多数实际部署中那样），所测量的属性是否仍然存在的问题。我们直接在两个已建立的成对范式中对此进行测试：国家偏好排序和效用判断。在两者中，我们将部署上下文——模型在做出具体价值依赖选择时执行的高层任务——作为我们的控制变量，在不同框架（如撰写Reddit帖子或新闻文章）之间变化。在五个LLM和超过120万个成对决策中，部署上下文产生的变化远大于提示释义和温度控制。在15个国家的偏好排序中，上下文引发了广泛的、统计上显著的排名变化；先前工作中报告的总体全球北方偏好本身是上下文依赖的，每个模型的偏见在不同上下文中系统性变化。在50个结果的效用引出中，跨类别的广泛排序得以保留，但领域内的细粒度排名变化很大，并且结果之间的基数交换率（例如，一个地区的多少条生命等于另一个地区的一条生命）在中位数上变化了2.47倍。因此，报告的模型层面偏好和效用最好被理解为上下文条件下的测量，而不是固定的模型属性：在一种框架下获得的安全保证在另一种框架下提供的保证有限。

英文摘要

Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering. This leaves open whether the measured properties survive when the surrounding task context changes, as it does in most real deployments. We test this directly across two established pairwise paradigms: ranking country preferences and eliciting utility judgements. In both, we make the deployment context -- the high-level task the model is performing while making concrete value-dependent choices -- our controlled variable, varied across framings such as writing a Reddit post or a news article. Across five LLMs and over 1.2M pairwise decisions, deployment context produces variation far larger than prompt paraphrasing and temperature controls. In country preference rankings over 15 countries, context induces widespread, statistically significant rank shifts; the aggregate Global North favouritism reported in prior work is itself context-dependent, with each model's bias shifting systematically across contexts. In utility elicitation over 50 outcomes, broad cross-category ordering is preserved, but fine-grained rankings within domains vary substantially, and cardinal exchange rates between outcomes (e.g. how many lives in one region equal one in another) shift by a factor of 2.47 at the median. Reported model-level preferences and utilities are therefore better understood as context-conditioned measurements than fixed model-level properties: safety guarantees obtained under one framing provide limited assurance in another.

URL PDF HTML ☆

赞 0 踩 0

2606.13940 2026-06-15 cs.CL 新提交

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

后训练能否使LLM成为优秀的医学编码员？生成式ICD编码的实证研究

Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

发表机构 * Northwestern University（西北大学）

AI总结通过对比提示、监督微调和强化学习（GRPO及新提出的PHI课程）在ICD编码任务上的表现，发现后训练（尤其是SFT和GRPO）能显著提升生成式LLM的编码性能，提示评估瓶颈在于模型适应而非生成范式本身。

详情

AI中文摘要

自动国际疾病分类（ICD）编码是用于计费、流行病学和临床决策支持的核心医学编码任务。生成式大语言模型（LLM）常被报道为弱医学编码员，但这一发现主要来自推理时设置（如提示、检索、重排序或工具使用），而任务特定后训练的作用尚未充分探索。我们提出了一项受控的实证研究，针对生成式ICD编码的后训练，在共同协议和度量集下比较了判别式基线与LLM编码员在提示、监督微调和强化学习中的表现。据我们所知，这是首个评估基于RL的后训练用于生成式LLM编码员在ICD编码中的研究。我们进一步引入了PHI，一种诊断性课程，扩展了GRPO以细化遗漏病例。我们的结果表明，仅提示评估大大低估了LLM在ICD编码中的潜力。SFT提供了主要的能力跃升，GRPO进一步改善了超出SFT的代码集预测，而PHI在宏观性能上提供了有针对性的提升。这些发现表明，主要瓶颈不在于生成式公式本身，而在于如何调整和优化模型以实现全分类召回。我们在以下网址发布代码、数据划分和检查点：此 https URL。

英文摘要

Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.

URL PDF HTML ☆

赞 0 踩 0

2606.13934 2026-06-15 cs.AI 新提交

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

对抗性概念搜索：从特征几何预测组合错误

Jennifer Meng Lu, Ruochen Zhang, Isabelle Lee, David Alvarez-Melis, Ellie Pavlick, Naomi Saphra

发表机构 * Brown University（布朗大学）； University of Southern California（南加州大学）； Harvard University（哈佛大学）； Boston University（波士顿大学）

AI总结利用LLM的表征几何预测其组合失败模式，发现概念编码近正交时可靠组合，编码接近时因干扰导致失败，无需评估具体输入即可预测错误。

详情

AI中文摘要

人类并不总能直觉地判断哪些场景对LLM最具挑战性。为了捕捉具有挑战性的边缘案例，开发者要么设计对人类困难的问题，要么策划广泛的基准测试。如果我们能预先预测模型会在哪些场景上失败呢？在本文中，我们利用LLM的表征几何来预测它会在哪些概念组合上失败。我们将这种组合失败归因于显著特征之间的干扰。在需要系统组合的任务中——玩具程序化设置、多跳推理、多语言事实回忆——我们发现，当一对概念被编码为近似正交时，模型可靠地组合它们。当它们的线性编码接近时，产生干扰，模型无法组合它们。我们的方法可靠地预测了不同组合任务中的失败模式，无需评估特定输入。这些结果为利用表征几何识别高风险示例、构建有针对性的压力测试以及在现实部署中提供可扩展的主动学习基础奠定了基础。

英文摘要

Humans cannot always intuit what scenarios are most challenging to LLMs. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM's representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.13931 2026-06-15 cs.CL 新提交

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

DLawBench: 通过多轮法律咨询评估大语言模型

Li Zhang, Yuzhen Shi, Yiran Hu, Jingwen Zhang, Wenbo Lv, Yubo Ma, Wei Wang, Rongyao Shi, Yuanyang Qiu, Xinran Xu, Yuemeng Qi, Linlin Miao, Jaromir Savelka, Yun Liu, Kevin Ashley, Bing Zhao, Hu Wei, Lin Qu

发表机构 * University of Pittsburgh（匹兹堡大学）； Alibaba Group（阿里巴巴集团）； University of Waterloo（滑铁卢大学）； Shandong University（山东大学）； Skylenage ； China University of Political Science and Law（中国政法大学）； Qwen Team, Alibaba Group（阿里巴巴集团Qwen团队）； Fordham University（福特汉姆大学）； Shanghai Jiao Tong University（上海交通大学）； Carnegie Mellon University（卡内基梅隆大学）； Tsinghua University（清华大学）

AI总结提出DLawBench基准，通过模拟四种客户类型（合作、依赖、退缩、对抗）的多轮对话，评估大语言模型在真实法律咨询中的交互能力。

Comments 37 pages, 8 figures, 26 tables. Code and data: https://github.com/SKYLENAGE-AI/DLawBench

详情

AI中文摘要

律师-客户咨询是法律服务的关键起点。有效的法律协助依赖于从客户那里获取充分且真实的信息，以制定最能保护其利益的策略。这一任务要求大语言模型不仅具备强大的法律推理能力，还要能通过多轮交互策略性地获取关键事实，并有效引导具有不同个性的客户。然而，现有的法律基准忽略了这种交互能力。为填补这一空白，我们引入了DLawBench，一个针对真实世界法律咨询的诊断基准。基于真实的客户行为，我们将律师-客户互动分为四种类型：合作型、依赖型、退缩型和对抗型。利用基于真实案例的对话，DLawBench评估大语言模型是否能在现实条件下有效进行法律咨询。DLawBench包含来自中国和美国法律的461个案例、5532个配对事实条目、3411个查询评估标准和3348个问题解决评估标准，并评估了26个代表性大语言模型。系统实验显示仍有很大提升空间：表现最好的模型GPT-5.5在基于咨询的法律推理上仅达到0.562。更重要的是，DLawBench揭示了法律咨询中的谄媚现象以及一个悖论：当客户最需要指导时，模型表现反而更差。

英文摘要

Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (LLMs) not only to perform robust legal reasoning, but also to strategically elicit material facts through multi-turn interactions and effectively guide clients with diverse personalities. Yet existing legal benchmarks overlook this interactive capability. To fill this gap, we introduce DLawBench, a diagnostic benchmark for real-world legal consultation. Drawing on realistic client behavior, we characterize lawyer-client interactions into four types: Cooperative, Dependent, Withdrawn, and Adversarial. Using dialogues grounded in real cases, DLawBench evaluates whether LLMs can effectively conduct legal consultation under realistic conditions. DLawBench comprises 461 cases from Chinese and U.S. law, 5,532 paired fact entries, 3,411 inquiry rubrics, and 3,348 issue-resolution rubrics, and evaluates 26 representative LLMs. Systematic experiments show substantial headroom: the best-performing model, GPT-5.5, achieves only 0.562 on consultation-grounded legal reasoning. More importantly, DLawBench exposes both sycophancy in legal consultation and a paradox: models perform worse when clients need guidance most.

URL PDF HTML ☆

赞 0 踩 0

2606.13929 2026-06-15 cs.CV cs.LG 新提交

Self-Evolving Visual Questioner

自演化视觉提问器

Yijun Liang, Hengguang Zhou, Ming Li, Lichen Li, Cho-Jui Hsieh, Tianyi Zhou

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； University of California, Los Angeles（加州大学洛杉矶分校）； Peking University（北京大学）； Arena ； MBZUAI（穆罕默德·本·扎耶德人工智能大学）

AI总结提出自演化框架，让视觉语言模型作为提问器和过滤器，无需外部监督即可生成更难、更信息丰富、更视觉中心的问题，并保持探索多样性以避免训练崩溃，显著提升自主提问质量和难度边界。

Comments 21 pages, including references and appendix. Project Page is available at https://joliang17.github.io/SelfEvolvingVQG/

详情

AI中文摘要

视觉语言模型（VLM）通常被训练为被动的回答者，而它们主动提出多样化、非平凡、视觉中心且基于问题的问题的能力仍未被充分探索。现有的视觉提问器的性能受到高质量训练数据的可用性或整理成本的瓶颈限制。我们证明，VLM可以在没有任何外部监督的情况下作为视觉提问器持续自我改进。我们提出一个自演化框架，该框架使用VLM本身作为提议者和过滤器，以产生更难、更信息丰富、更视觉中心的问题，同时保持其探索多样性以避免训练崩溃。这些问题随后用于以提问者和回答者模式训练VLM。为了评估提问器，我们引入了一个代理协议，从感知、推理和多样性维度评估问题。在各种骨干VLM上的实验表明，我们的方法显著提高了自主问题生成的质量，并大幅扩展了难度边界。在相同预算下，我们的自监督比在静态源数据上训练更有效。此外，自演化提问器仍然是一个有竞争力甚至更好的回答者。

英文摘要

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

URL PDF HTML ☆

赞 1 踩 0

2606.13925 2026-06-15 cs.AI math.AG 新提交

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

遗憾并非难点：半自动形式化的专家评审案例研究

Vasily Ilin, Brian Nugent

发表机构 * GitHub

AI总结通过Grothendieck消没定理的半自动形式化案例，揭示大语言模型在定义选择与API设计上的不足，提出应以专家评审而非仅无遗憾作为评估标准。

2606.13916 2026-06-15 cs.AI 新提交

A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

用于自动化高中成绩单处理的多智能体AI系统：大规模协作文档分析

Ben Torkian, Jun Zhou

发表机构 * University of South Carolina（南卡罗来纳大学）

AI总结提出一个多智能体AI系统，通过三个专业智能体协作，自动处理不同格式的高中成绩单，以GPA提取作为协调信号，实现96.7%的准确率和每份45秒的处理速度。

详情

AI中文摘要

每年，大学招生办公室面临一个巨大的挑战：处理数百万份高中成绩单，每份都有独特的格式、评分系统和布局。这一手动过程造成了操作瓶颈，延迟了招生决定并消耗了宝贵资源。我们通过一个多智能体AI系统提出了变革性解决方案，其中专业智能体通过智能协调和通信自动处理多样化的成绩单格式。我们的多智能体架构包括三个专业智能体——用于格式特定解析的模式识别智能体、用于自然语言理解的语义分析智能体和用于多模态文档分析的视觉智能体——由一个编排智能体协调，管理智能体通信和结果整合。我们的关键创新在于基于智能体的质量控制，使用GPA提取作为协调信号，确保可靠的智能体协作并防止关键信息丢失。在对来自美国13个州高中的40份真实成绩单进行评估时，我们的智能体系统成功处理了每一份文档，与专家手动审查相比达到了96.7%的准确率，同时保持了每份成绩单45秒的实际处理速度。这项工作展示了多智能体协调如何解决复杂的文档处理挑战，为机构提供了一种可扩展的协作AI解决方案，在保持准确性的同时大幅减少处理时间。

英文摘要

Each year, college admissions offices face an overwhelming challenge: processing millions of high school transcripts, each with unique formats, grading systems, and layouts. This manual process creates operational bottlenecks that delay admissions decisions and consume valuable resources. We present a transformative solution through a multi-agent AI system where specialized agents collaborate to automatically process diverse transcript formats through intelligent coordination and communication. Our multi-agent architecture consists of three specialized agents-a Pattern Recognition Agent for format-specific parsing, a Semantic Analysis Agent for natural language understanding, and a Vision Intelligence Agent for multimodal document analysis-coordinated by an Orchestration Agent that manages agent communication and result reconciliation. Our key innovation lies in agent-based quality control using GPA extraction as a coordination signal, ensuring reliable agent collaboration and preventing critical information loss. When evaluated on 40 real world transcripts from high schools across 13 U.S. states, our agent system successfully processed every document, achieving 96.7% accuracy compared to expert manual review while maintaining practical processing speeds of 45 seconds per transcript. This work demonstrates how multi-agent coordination can solve complex document processing challenges, offering institutions a scalable, collaborative AI solution that preserves accuracy while dramatically reducing processing time.

URL PDF HTML ☆

赞 0 踩 0

2606.13915 2026-06-15 cs.RO cs.SY eess.SY 新提交

Learning Dynamic Swing-Up of an Inverted Pendulum using Remote Magnetic Actuation

利用远程磁驱动学习倒立摆的动态摆动控制

Viacheslav Sydora, Jasan Zughaibi, Denis von Arx, Quentin Boehler, Michael Muehlebach

发表机构 * University of Zurich（苏黎世大学）； ETH Zurich（苏黎世联邦理工学院）； University of Strasbourg（斯特拉斯堡大学）

AI总结针对电磁导航系统在远离平衡态轨迹跟踪中的空白，提出结合轨迹优化、时变LQR和迭代学习控制的方法，首次实现倒立摆的磁驱动摆动控制，六次迭代成功，并验证了ILC校正与高保真磁场模型预测的扭矩偏差高度吻合。

详情

AI中文摘要

电磁导航系统（eMNS）在微创手术和靶向药物递送中受到广泛关注。尽管大多数文献依赖于这些系统的准静态控制，但近期工作已展示了动态方法的优势。然而，远离平衡态的轨迹跟踪仍未得到充分解决。我们通过使用临床就绪的Navion eMNS首次演示了磁驱动倒立摆的摆动控制，填补了这一空白。尽管倒立摆本身不具有临床相关性，但所提出的方法将扭矩和力作为控制目标，使其适用于其他磁驱动设备，如导管和导丝。我们的方法结合了考虑eMNS内部动力学的轨迹优化、时变线性二次型调节器（LQR）状态反馈和迭代学习控制（ILC），后者利用先前的试验数据和系统动态模型逐步优化前馈指令。尽管单独使用LQR因磁驱动的复杂现象而失败，但ILC在六次迭代内实现了成功摆动。此外，实验后分析表明，学习到的ILC校正与高保真磁场模型校准预测的扭矩偏差高度吻合，表明学习和自适应是处理电磁驱动中不确定性的有前景工具，这些不确定性可能源于患者特定的生理运动模式和磁场模型校准误差。

英文摘要

Electromagnetic Navigation Systems (eMNS) have gained considerable attention for minimally invasive surgery and targeted drug delivery. While most of the literature relies on quasi-static control of these systems, recent work has demonstrated the benefits of dynamic approaches. However, trajectory tracking far from equilibrium states remains largely unaddressed. We close this gap by demonstrating the first swing-up of a magnetically actuated inverted pendulum using the clinically-ready Navion eMNS. Although the inverted pendulum is not clinically relevant in itself, the proposed method utilizes torques and forces as control objectives, making it applicable to other magnetically actuated devices such as catheters and guidewires. Our approach combines trajectory optimization that accounts for internal eMNS dynamics with time-varying Linear Quadratic Regulator (LQR) state feedback and Iterative Learning Control (ILC), which leverages previous trial data and the system's dynamic model to progressively refine the feedforward command. While LQR alone fails due to the complex phenomena of magnetic actuation, ILC enables successful swing-up within six iterations. Furthermore, post-experimental analysis reveals that the learned ILC correction closely matches the torque discrepancy predicted by high-fidelity magnetic field model calibration, suggesting learning and adaptation as a promising tool to deal with uncertainties in electromagnetic actuation arising, e.g., from patient-specific physiological motion patterns and field model calibration inaccuracies.

URL PDF HTML ☆

赞 0 踩 0

2606.13911 2026-06-15 cs.CV 新提交

Overhead Wildlife Locator (OWL): Benchmarking Weakly Supervised Learning for Aerial Wildlife Surveys

Overhead Wildlife Locator (OWL): 用于航空野生动物调查的弱监督学习基准测试

Isai Daniel Chacón, Zhongqi Miao, Bruno Demuro, Caleb Robinson, Rahul Dodhia, Lasha Otarashvili, Jason Holmberg, Kirk Larsen, Howard Frederick, Nathan J. Pamperin, Pablo Arbeláez, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Lab（微软人工智能公益实验室）； Center for Research and Formation in Artificial Intelligence (Cinfonia), Universidad de los Andes（安第斯大学人工智能研究与培训中心）； Conservation X Labs（保护X实验室）； Kirk Larsen Consulting（Kirk Larsen咨询公司）； Tanzania Wildlife Research Institute（坦桑尼亚野生动物研究所）； Alaska Department of Fish and Game（阿拉斯加渔猎局）

AI总结提出弱监督密度估计框架OWL，含三种变体，在五个公开航空数据集上超越现有方法，并在阿拉斯加驯鹿普查中验证了操作可行性。

Comments 16 pages, 4 figures, 3 tables

详情

AI中文摘要

自动化航空野生动物调查越来越依赖深度学习，但标准目标检测器需要边界框标注，据报道其标注速度比点级标签慢七倍，成本高三倍。为解决这一瓶颈，我们引入了Overhead Wildlife Locator (OWL)，一个弱监督密度估计框架，包含三种变体：OWL-C，一种用于高通量筛选的全卷积模型；OWL-T，一种用于异质、杂乱场景的Swin增强混合模型；以及OWL-D，基于冻结的DINOv3 ViT-H+/16编码器和DPT风格融合解码器构建。我们在五个公开航空数据集上对所有三种变体与POLO、YOLOv11n和YOLOv11l进行了基准测试，数据集范围从稀疏的固定翼稀树草原调查到密集的无人机围场图像，并在其原始Delplanque分割上与已发表的HerdNet基线进行了比较。OWL-D在Delplanque上创下了新的最先进水平（0.934 AP对比HerdNet的0.840），并在五个数据集中有四个取得了最高AP。性能具有场景依赖性：在极端密度的SheepCounter无人机数据集上，混合模型OWL-T领先（0.978 AP），卷积变体达到最低计数误差，而基于基础模型的OWL-D性能下降，表明哪种变体适合哪种调查类型。我们进一步在阿拉斯加鱼类与野生动物局2022年中央北极驯鹿普查中验证了操作准备情况：在跨驯鹿群和跨时间迁移下，基于2017年Porcupine驯鹿群分割微调的OWL-C在保留的补丁测试集上达到了F1=0.965，在发布的测试补丁上聚合符号计数误差为+3.1%。我们在https://this https URL上发布了OWL代码、模型权重以及带注释的Porcupine驯鹿群2017年（PCH）和中央北极驯鹿群2022年（CAH）补丁，这是首个用于大规模驯鹿航空调查的开放补丁级数据集。

英文摘要

Automated aerial wildlife surveys increasingly rely on deep learning, yet standard object detectors require bounding-box annotations, reported to be up to seven times slower and three times more expensive to produce than point-level labels. To address this bottleneck, we introduce the Overhead Wildlife Locator (OWL), a weakly supervised density-estimation framework with three variants: OWL-C, a fully convolutional model for high-throughput screening; OWL-T, a Swin-augmented hybrid for heterogeneous, cluttered scenes; and OWL-D, built on a frozen DINOv3 ViT-H+/16 encoder with a DPT-style fusion decoder. We benchmark all three against POLO, YOLOv11n, and YOLOv11l across five public aerial datasets, from sparse fixed-wing savanna surveys to dense UAV paddock imagery, and against the published HerdNet baseline on its native Delplanque split. OWL-D sets a new state of the art on Delplanque (0.934 AP vs. HerdNet's 0.840) and records the highest AP on four of the five datasets. Performance is regime-dependent: on the extreme-density SheepCounter UAV dataset the hybrid OWL-T leads (0.978 AP) and the convolutional variants attain the lowest counting error, whereas the foundation-based OWL-D degrades, indicating which variant suits which survey type. We further validate operational readiness on the Alaska Department of Fish and Game's 2022 Central Arctic Caribou census: under cross-herd and cross-temporal transfer, OWL-C fine-tuned on the 2017 Porcupine Caribou Herd split attains F1 = 0.965 on a held-out patch test set, with a signed count error of +3.1% aggregated across the released test patches. We release the OWL code, model weights, and the annotated Porcupine Caribou Herd 2017 (PCH) and Central Arctic Herd 2022 (CAH) patches, the first open patch-level datasets for large-scale caribou aerial surveys, at https://github.com/microsoft/MegaDetector-Overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.13910 2026-06-15 cs.CV 新提交

PMOF: A Dataset and Benchmark for Passenger Monitoring Using Overhead Fisheye Cameras

PMOF：使用顶置鱼眼摄像头的乘客监控数据集与基准

Stella Katharina Wermuth, Qazi Arbab Ahmed, Klaus Neumann, Thorsten Jungeblut

发表机构 * Bielefeld University（比勒费尔德大学）； Fraunhofer IOSB-INA（弗劳恩霍夫IOSB-INA研究所）

AI总结提出首个在运动车辆内采集的顶视鱼眼图像数据集PMOF，含19k+标注帧，支持检测、跟踪和行为识别，通过跨域微调实现高精度，缩小静态与动态环境域差距。

Comments 6 pages, 7 figures. Accepted to the 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS 2026)

详情

AI中文摘要

自主无工作人员公共交通需要可靠的车上乘客监控。然而，运动车辆内的感知面临空间受限、光照变化、运动引起的背景变化、遮挡和视角有限等挑战。为缓解这些空间约束，天花板安装的鱼眼摄像头从单一视角提供全场景覆盖。但现有公开顶置鱼眼数据集在静态环境中记录，未捕捉车辆运动引入的域偏移。为填补这一空白，我们提出PMOF（使用顶置鱼眼摄像头的乘客监控），这是首个在运动车辆内采集的顶视鱼眼图像公开数据集，包含超过19k张手动标注帧。PMOF提供旋转边界框、跟踪标识和行为标签，支持目标检测、跟踪和行为识别。我们使用YOLO26m-obb模型在多种数据集配置下对PMOF进行基准测试，这些配置将PMOF与现有顶置鱼眼数据集结合。结合自定义旋转感知增强的跨域微调在PMOF上达到94.8% AP50，在来自不同领域的未见顶置鱼眼数据集上达到96.5% AP50。我们的结果突出了静态与动态环境之间的域差距，并表明引入PMOF可提高检测性能，并将泛化能力从乘客监控推广到更广泛的基于鱼眼的人员检测任务。数据集和代码见该https URL。

英文摘要

Autonomous staff-free public transport requires reliable in-vehicle passenger monitoring. However, perception inside moving vehicles is challenged by confined spaces, variable illumination, motion-induced background variation, occlusion, and limited viewpoints. To mitigate these spatial constraints, ceiling-mounted fisheye cameras provide full-scene coverage from a single viewpoint. Yet existing public overhead fisheye datasets are recorded in static environments and do not capture the domain shift introduced by vehicle motion. To fill this gap, we introduce PMOF, Passenger Monitoring using Overhead Fisheye cameras, the first public dataset of top-view fisheye imagery captured inside a moving vehicle, comprising over 19k manually annotated frames. PMOF provides rotated bounding boxes, tracking identifiers, and action labels, supporting object detection, tracking, and action recognition. We benchmark PMOF using YOLO26m-obb models fine-tuned under multiple dataset configurations that combine PMOF with existing overhead fisheye datasets. Cross-domain fine-tuning with custom rotation-aware augmentation achieves 94.8% AP50 on PMOF and 96.5% AP50 on an unseen overhead fisheye dataset from a different domain. Our results highlight the domain gap between static and moving environments and show that incorporating PMOF improves detection performance and advances generalization beyond passenger monitoring to broader fisheye-based person detection tasks. The dataset and code are available at https://swermuth.github.io/pmof/.

URL PDF HTML ☆

赞 0 踩 0

2606.13904 2026-06-15 cs.CL cs.AI cs.DB 新提交

SANA: What Matters for QA Agents over Massive Data Lakes?

SANA：大规模数据湖上的问答代理关键因素是什么？

Austin Senna Wijaya, Jiaxiang Liu, Haonan Wang, Eugene Wu

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出SANA诊断框架，通过消融实验分析数据湖探索式问答中搜索、规划、数据分析及行动策略的失败原因，揭示数据分析是主要瓶颈。

Comments 9 pages, 7 figures

详情

AI中文摘要

数据湖上的探索式问答（EQA）需要LLM代理发现相关源、分析检索数据并根据中间结果调整其行动。端到端准确率无法区分搜索、规划、数据分析或代理的行动策略（即下一步做什么以及何时提交答案的决策）中的失败。我们提出了SANA（搜索代理导航消融框架），这是一个诊断性消融框架，将EQA任务转化为包含黄金源序列、清洗后子问题和执行记录的运行时配置文件。SANA利用这些配置文件构建理想化的搜索、规划和数据分析工具，从而允许对每个组件进行消融；残差是策略失败的诊断证据。为了说明SANA作为一个可复用的评估框架，我们改编了两个最近的EQA基准测试LakeQA和KramaBench，并在固定提示、预算、数据湖和运行时下评估了轻量级和中型代理。在两个基准测试中，数据分析始终是瓶颈，而规划则不那么明显。搜索在LakeQA的大数据湖设置中是主要限制，但在较小规模的KramaBench中则不那么突出。因此，SANA将端到端任务准确率分解为数据湖代理失败原因的诊断，并允许系统比较搜索、规划、数据分析和代理设计方面的进展。

英文摘要

Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data-lake setting, but less so for the smaller-scale KramaBench. SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.

URL PDF HTML ☆

赞 0 踩 0

2606.13901 2026-06-15 cs.LG cs.NE 新提交

SpikF-GO: Spiking Fourier Graph Operators for Multivariate Time Series Forecasting

SpikF-GO: 用于多元时间序列预测的尖峰傅里叶图算子

Jafar Bakhshaliyev, Niels Landwehr

发表机构 * Data Science Group, University of Hildesheim（希尔德斯海姆大学数据科学组）

AI总结针对现有SNN预测方法缺乏变量间依赖建模的问题，提出SpikF-GO，通过超变量图公式和尖峰驱动谱处理，结合可学习稀疏频率门和复数LIF门，在统一协议下达到SNN方法最佳平均排名，并降低能耗。

Comments 23 pages, 2 figures, 11 tables. Accepted for presentation at ECML PKDD 2026. Code: https://github.com/jafarbakhshaliyev/SpikF-GO

详情

AI中文摘要

尖峰神经网络（SNNs）已成为传统神经网络的一种节能替代方案，在计算机视觉和机器人技术中表现出强劲性能。最近，SNNs已被应用于时间序列预测（TSF），相关方法探索了尖峰时间骨干网络、尖峰兼容位置编码、傅里叶域处理以及重新设计的神经元动力学。然而，现有的SNN预测方法独立处理变量，缺乏对变量间依赖关系建模的显式机制。在多元设置中，跨变量相关性携带大量预测信息，这是一个关键限制。我们提出了尖峰傅里叶图算子（SpikF-GO），通过结合超变量图公式（其中每个标量观测值成为一个图节点）和尖峰驱动谱处理来解决这一空白。SpikF-GO引入了一个硬混凝土频率门用于可学习的稀疏频率选择，以及一个复数LIF门，该门对实部和虚部傅里叶分量应用独立的尖峰神经元，在整个谱域中保持二进制事件驱动计算。我们进一步提出了一个变体，结合了基于中央模式生成器的位置编码，以增强长程时间建模。在统一实验协议下对八个基准进行评估，SpikF-GO在所有SNN方法中取得了最佳平均排名，并以更低的能耗优于其ANN对应方法FourierGNN。即使在显著更小的嵌入维度下，SpikF-GO仍保持竞争性精度，从而实现了显著的能耗降低。据我们所知，这是首批将基于图的多元建模引入尖峰领域用于TSF的工作之一，也是首个在共同实验协议下提供SNN预测架构统一比较的工作。

英文摘要

Spiking Neural Networks (SNNs) have emerged as an energy-efficient alternative to conventional neural networks, demonstrating strong performance in computer vision and robotics. More recently, SNNs have been applied to time series forecasting (TSF), with methods exploring spiking temporal backbones, spike-compatible positional encodings, Fourier-domain processing, and redesigned neuron dynamics. However, existing SNN forecasting approaches process variables independently, lacking explicit mechanisms for modeling inter-variable dependencies. This is a critical limitation in multivariate settings, where cross-variable correlations carry substantial predictive information. We propose Spiking Fourier Graph Operators (SpikF-GO), which addresses this gap by combining a hypervariate graph formulation in which every scalar observation becomes a graph node with spike-driven spectral processing. SpikF-GO introduces a Hard Concrete frequency gate for learnable sparse frequency selection and a Complex LIF gate that applies independent spiking neurons to real and imaginary Fourier components, preserving binary, event-driven computation throughout the spectral domain. We further present a variant incorporating Central Pattern Generator-based positional encodings for stronger long-range temporal modeling. Evaluated on eight benchmarks under a unified experimental protocol, SpikF-GO achieves the best average rank among all SNN methods and outperforms its ANN counterpart, FourierGNN, at reduced energy cost. SpikF-GO maintains competitive accuracy even at substantially smaller embedding dimensions, thereby achieving significant energy reductions. To our knowledge, this is among the first works to bring graph-based multivariate modeling into the spiking domain for TSF and the first to provide a unified comparison across SNN forecasting architectures under a common experimental protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.13898 2026-06-15 cs.CV cs.AI 新提交

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

HiLo-Token: 输入自适应的高低频令牌压缩用于高效图像编辑

Haoran You, Yotam Nitzan, Lingzhi Zhang, Yifan Gong, Mang-Tik Chiu, Connelly Barnes, Yan Kang, Yuqian Zhou, Eli Shechtman, Sohrab Amirghodsi

发表机构 * Adobe ART AI Lab（Adobe ART AI实验室）； Adobe Research（Adobe研究院）

AI总结针对扩散变换器（DiT）在图像编辑中延迟高的问题，提出输入自适应的令牌压缩框架HiLo-Token，根据空间频率分配令牌预算，在保持生成质量的同时实现高达3.13倍加速。

Comments 14 pages, 10 figures, Patent filled

详情

AI中文摘要

创意图像编辑工具，如Photoshop的移除或生成填充按钮，是日常客户使用的核心，并占Photoshop和Lightroom流量的主要部分。然而，当前的生成式AI模型面临显著的延迟挑战，当从基于卷积的U-Net过渡到扩散变换器（DiT）时，这一问题变得更加突出。在我们对数百个代表性图像编辑样本（涵盖广泛的掩码比例）的评估中，即使将DiT模块从50个时间步蒸馏到8个时间步，它单独就占总模型延迟的平均73%。为了应对这一挑战，我们提出了$\textbf{HiLo-Token}$，一个输入自适应的令牌压缩框架，该框架将更多令牌预算分配给高频、丰富上下文的区域，同时将更少令牌分配给低频区域。具体来说，对于用户掩码指定的编辑区域，我们保留膨胀掩码内的所有令牌，以保持强局部性和上下文相关性。在编辑区域之外，我们引入了一种简单而有效的基于空间频率的高频令牌选择策略，以捕获重要的局部细节，同时使用来自16倍下采样图像的令牌来表示低频分量，并保留模糊但全局的结构。在生产级评估数据上的大量实验验证了所提方法的有效性，在A100-80GB上，对于小、中、大掩码比例类别（平均比例分别为6.38%、15.92%和35.36%），图像编辑任务分别实现了3.13倍、2.59倍和1.67倍的DiT加速，且生成质量无任何退化。

英文摘要

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.13896 2026-06-15 cs.CV cs.AI 新提交

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

自监督遥感视觉模型如何迁移到下游任务？

Julia Romero, Qin Lv, Morteza Karimzadeh

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结研究六种代表性自监督地理空间基础模型（GeoFMs）在下游任务中的迁移表现，发现模型排名随任务和适应设置变化，中间层特征比最终层更相关，且解码器设计等适应设置影响与模型选择相当。

详情

AI中文摘要

自监督地理空间基础模型（GeoFMs）从遥感数据中学习可迁移表示，但其下游行为难以表征。我们研究了涵盖联合嵌入、重建和多模态预训练家族的六种代表性GeoFMs，并在不同标签可用性和下游流水线下评估了分类、回归和分割基准的迁移性能。我们发现模型排名随任务和适应设置而变化。逐层探针显示，在大多数情况下，与任务相关的信息在中间Transformer块中比在最终层嵌入中更容易获取，并且GeoFMs表现出不同的深度分布特征。在PASTIS和Sen1Floods11上的分割案例研究中，解码器设计和微调等下游适应设置可能与GeoFM的选择同样重要，且标准密集预测头可能与GeoFM在深度上组织信息的方式不一致。最后，案例研究中的CKA分析表明，微调不会均匀地重写GeoFMs的深度，最强的变化集中在ViT块中MLP的第一个线性层。这些结果有助于解释为什么GeoFM排名在不同基准之间发生变化，并激励更具表示意识的评估和适应策略。

英文摘要

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 新提交

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University（赖希曼大学）； Tel Aviv University（特拉维夫大学）； Google Research（谷歌研究院）

AI总结提出Gefen优化器，通过共享二阶矩估计和量化一阶矩，将AdamW内存占用减少约8倍，同时保持相同性能，支持更大批量和吞吐量。

详情

AI中文摘要

AdamW是现代深度学习的默认优化器，但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen，一种内存高效的优化器，它自动在参数块之间共享二阶矩估计，并使用学习到的码本量化一阶矩，从而将AdamW的内存占用减少约8倍，同时保持相同性能，相当于每十亿参数减少6.5 GiB。该方法受理论结果启发，该结果表明大的混合Hessian项将平方梯度的比率约束为接近1，表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际，Gefen从初始平方梯度推断块结构，除了AdamW默认超参数外，不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本，并重用相同的块进行一阶矩缩放。在多种实验中，Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存，同时保持AdamW级别的性能。在FSDP和DDP训练中，减少的内存占用支持更大的微批次，并显著提高相对于AdamW的吞吐量，提供了一种实用的即插即用替代方案，具有更低的内存使用，可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现，包括融合CUDA内核，网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

URL PDF HTML ☆

赞 0 踩 0

2606.13886 2026-06-15 cs.RO cs.CV cs.LG 新提交

PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

PhysVLA：面向物理基础的VLA用于具身机器人操作

Namai Chandra, Shriram Damodaran, Lin Wang

发表机构 * IIT Madras（印度理工学院马德拉斯分校）； Nanyang Technological University（南洋理工大学）

AI总结提出PhysVLA，一种即插即用的推理时框架，通过相位有限状态机和选择性欧拉-拉格朗日门，在不重新训练的情况下为任何冻结的VLA骨干注入物理约束，提升成功率、稳定性和轨迹效率。

Comments 9 pages, 5 figures, supplementary material included

详情

AI中文摘要

视觉-语言-动作（VLA）模型擅长将视觉输入和自然语言指令直接映射到机器人控制策略。然而，由于它们主要针对行为演示数据进行训练，并未明确强制执行刚体动力学或接触约束等基本物理原理。这暴露了一个关键的物理差距：在单步或分块VLA上应用的标准时间平滑以轨迹质量为代价，增加了短期记忆无法解决的失败。为弥补这一差距，我们提出PhysVLA（Physics-VLA），一种即插即用、推理时的框架，旨在包装任何冻结的VLA骨干，无需重新训练、微调或权重访问，每个控制步骤的开销小于1毫秒。PhysVLA拦截预测的控制动作，仅捕获模拟器或系统状态，并应用双层校正：（i）一个相位感知的有限状态机，用于结构化离散任务段（接近、抓取、运输和放置），以及（ii）一个选择性欧拉-拉格朗日门，仅在动力学预言器检测到运动学不一致时激活。在LIBERO-Spatial上使用7自由度Franka Panda对OpenVLA、OpenVLA-OFT、Force-VLA和Generalist-VLA进行评估，该框架实现了高达17%的绝对成功率提升和高达19%的稳定性提升，且无每任务回归，在所有四个骨干上轨迹效率提升高达15%，并在Robosuite Lift跨模拟器扫描中显示出高达10倍的轨迹急动度鲁棒性提升。我们还在真实的Agilex Piper机械臂上通过拾取和放置任务进一步验证了该框架，确认PhysVLA无需重新训练即可迁移到物理硬件，成功率提升高达50%，将物理意识确立为一种可组合、骨干无关的运行时模块。

英文摘要

Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.

URL PDF HTML ☆

赞 0 踩 0

2606.13883 2026-06-15 cs.RO 新提交

Guided Diffusion with Distilled Vision-Language Reliability for Aerial Navigation

基于蒸馏视觉语言可靠性的引导扩散用于空中导航

Ivan Valuev, Iana Zhura, Valerii Serpiva, Didar Seyidov, Dzmitry Tsetserukou

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出一种可靠性感知的扩散规划器，通过蒸馏视觉语言模型生成场景级可靠性热图，引导去噪过程处理不可靠区域，显著降低无人机导航中的障碍物违反率并提高区域可靠性。

详情

AI中文摘要

自主无人机导航通常由将感知、映射和规划分离为不同阶段的流水线解决，这会传播误差、累积延迟，并需要针对特定环境重新调整。端到端生成模型通过将原始观测直接映射到轨迹来消除这些接口，但继承了一个微妙的失败模式：在干净数据上训练后，它们无法识别观测何时不可靠，并将玻璃、镜子和过曝光表面等退化区域视为有效证据进行规划。我们提出了一种用于3D无人机导航的可靠性感知扩散规划器。它将轨迹生成条件设置为观测以及场景级可靠性热图，该热图标记了感知不可信的区域，由轻量级网络生成，该网络在实时规划预算内蒸馏了视觉语言模型的开放词汇推理能力。为了无需重新训练即可泛化到未见环境，我们使用可微的两阶段ESDF成本引导去噪过程，该成本将来自深度的物理障碍和来自高度不可靠区域的虚拟障碍同等对待。在仿真和真实四旋翼飞行器上，我们的规划器比最先进的扩散基线产生了明显更安全的轨迹，将障碍物违反率从40.3%降低到9.6%，并将穿越区域的平均可靠性从0.588提高到0.925。仅消融可靠性项会使平均可靠性从0.898降至0.783，确认了其决定性作用，而蒸馏使框架运行速度比完整视觉语言模型快2倍。

英文摘要

Autonomous UAV navigation is conventionally solved by pipelines that separate perception, mapping, and planning into distinct stages, which propagates errors, accumulates latency, and requires environment-specific retuning. End-to-end generative models remove these interfaces by mapping raw observations directly to trajectories, but inherit a subtle failure mode: trained on clean data, they cannot recognise when an observation is unreliable, and treat degraded regions such as glass, mirrors, and overexposed surfaces as valid evidence for planning. We present a reliability-aware diffusion planner for 3D UAV navigation. It conditions trajectory generation on the observation together with a scene-level reliability heatmap that marks where perception cannot be trusted, produced by a lightweight network that distils the open-vocabulary reasoning of a vision-language model within the real-time planning budget. To generalise to unseen environments without retraining, we steer the denoising process with a differentiable two-stage ESDF cost that treats physical obstacles from depth and virtual obstacles from highly unreliable regions on equal footing. In simulation and on a real quadrotor, our planner produces markedly safer trajectories than a state-of-the-art diffusion baseline, reducing the obstacle-violation rate from 40.3% to 9.6% and raising the mean reliability of traversed regions from 0.588 to 0.925. Ablating the reliability term alone drops mean reliability from 0.898 to 0.783, confirming it as the decisive component, while distillation runs the framework up to 2 times faster than the full vision-language model.

URL PDF HTML ☆

赞 0 踩 0

2606.13880 2026-06-15 cs.LG q-fin.RM 新提交

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

一种纵向属性条件神经网络用于不规则时间数据中健康状态转移概率建模：LANTERN框架

Bright Kwaku Manu, Beckett Sterner, Petar Jevtic

发表机构 * School of Computing and Augmented Intelligence, Arizona State University（亚利桑那州立大学计算与增强智能学院）； School of Life Sciences, Arizona State University（亚利桑那州立大学生命科学学院）； School of Mathematical and Statistical Sciences, Arizona State University（亚利桑那州立大学数学与统计科学学院）

AI总结提出LANTERN框架，利用条件神经网络从纵向健康数据中估计多状态转移概率，处理不规则时间间隔和协变量历史，在健康与退休研究数据上优于逻辑回归等基准模型。

Comments 35 pages, 17 figures

详情

AI中文摘要

长期护理转移概率的准确估计对于残疾保险定价、准备金和偿付能力评估至关重要。经典精算多状态模型通常依赖于马尔可夫、半马尔可夫或比例风险设定，这些模型直接与队列预测相关，但对于具有非线性老龄化模式和异质性协变量历史的不规则纵向健康数据可能具有限制性。本文开发了一种针对不规则纵向健康数据的多状态转移概率的良好校准估计器。该模型从个体健康史中学习，纳入观测之间的时间间隔，并根据人口统计学和社会经济属性条件化转移概率。它生成下一个观测健康状态的有效概率分布，包含四种可能状态：健康、轻度残疾、重度残疾和死亡。个体概率按年龄组和初始状态聚合，形成与精算队列预测兼容的转移矩阵。利用健康与退休研究的纵向数据，我们将所提出的估计器与逻辑回归、梯度提升树、循环神经网络和最后状态持久性基准进行比较。评估考虑了概率准确性、重度残疾和死亡的端点判别与校准、风险集中度以及聚合后的转移矩阵误差。所提出的估计器相对于逻辑回归和梯度提升树基准改善了重度残疾判别，保持强校准性，并在留出测试分析中在评估模型中产生最低的转移矩阵误差。结果表明，当通过校准和预测保真度（超越判别）进行评判时，结构化的机器学习估计器可以支持长期护理转移建模。

英文摘要

Accurate estimation of long-term care transition probabilities is central to disability insurance pricing, reserving, and solvency assessment. Classical actuarial multi-state models commonly rely on Markov, semi-Markov, or proportional-hazard specifications, which provide a direct connection to cohort projection but may be restrictive for irregular longitudinal health data with nonlinear aging patterns and heterogeneous covariate histories. This paper develops a well-calibrated estimator of multi-state transition probabilities for irregular longitudinal health data. The model learns from individual health history, incorporates the time elapsed between observations, and conditions transition probabilities on demographic and socioeconomic attributes. It produces a valid probability distribution over the next observed health state, with four possible states: healthy, mild disability, severe disability, and death. Individual probabilities are aggregated by age group and origin state to form transition matrices compatible with actuarial cohort projection. Using longitudinal data from the Health and Retirement Study, we compare the proposed estimator with logistic regression, gradient-boosted trees, a recurrent neural network, and a last-state persistence benchmark. The evaluation considers probabilistic accuracy, endpoint discrimination and calibration for severe disability and death, risk concentration, and transition matrix error after aggregation. The proposed estimator improves severe disability discrimination relative to logistic regression and gradient-boosted tree benchmarks, maintains strong calibration, and yields the lowest transition matrix error among the evaluated models in the held-out test analysis. Results show that a structured machine learning estimator can support long-term care transition modeling when judged by calibration and projection fidelity, beyond discrimination.

URL PDF HTML ☆

赞 0 踩 0

2606.13878 2026-06-15 cs.RO 新提交

AnyGoal: Vision-Language Guided Multi-Agent Exploration for Training-Free Lifelong Navigation

AnyGoal: 视觉-语言引导的多智能体探索实现免训练终身导航

MoniJesu James, Marcelino Julio Fernando, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出AnyGoal，一种免训练多机器人架构，利用视觉-语言模型（VLM）驱动前沿探索，通过共享2D高斯贝叶斯价值图（BVM）协调智能体，实现终身证据积累，在GOAT-Bench上达到52.4%子任务成功率，优于模块化方法27.7个百分点。

Comments 17 pages, 3 figures

详情

AI中文摘要

在大规模仿真语料库上训练的端到端导航策略在迁移到分布外场景、类别或目标模态时性能急剧下降。模块化流水线如Modular GOAT受限于封闭集目标检测召回率，而3D快照记忆系统（如3D-Mem）积累密集、视角相关的表示，维护成本高。我们提出AnyGoal，一种免训练的多机器人架构，将视觉-语言模型（VLM）置于基于前沿的探索核心，并通过共享的2D高斯贝叶斯价值图（BVM）协调智能体。BVM维护每个像素关于目标相关性的后验（mu, sigma^2），通过深度锥掩模对VLM分数进行精度加权融合更新，且子任务间从不重置，实现终身证据积累。前沿通过VLM评判softmax与BVM上的贝叶斯UCB项的凸混合进行排序。具有空间分离惩罚和承诺滞后的贪婪分配器在无中央控制器的情况下将前沿分配给各智能体。在完整的GOAT-Bench验证集未见分割（360个片段，2669个子任务）上，我们的双智能体系统在严格物理机制下（离散0.25米步长，无瞬移，42度水平视场角）达到52.4%的子任务成功率和12.7%的SPL，创下新纪录，比Modular GOAT（24.9%）提高27.7个百分点。单智能体AnyGoal达到41.9%的子任务成功率，表明增益来自决策架构。四路感知消融实验显示，开放词汇检测器将主要失败模式从探索转向目标验证。

英文摘要

End-to-end navigation policies trained on large simulation corpora degrade sharply when transferred to out-of-distribution scenes, categories, or goal modalities. Modular pipelines such as Modular GOAT are bottlenecked by closed-set object detection recall, while 3D snapshot-memory systems (e.g. 3D-Mem) accumulate dense, view-dependent representations that are heavy to maintain. We present AnyGoal, a training-free multi-robot architecture that places a Vision-Language Model (VLM) at the core of frontier-based exploration and coordinates agents through a shared 2D Gaussian Bayesian Value Map (BVM). The BVM maintains a per-pixel (mu, sigma^2) posterior over goal relevance, updated via precision-weighted fusion of VLM scores through a depth-cone mask, and is never reset between subtasks, yielding lifelong evidence accumulation. Frontiers are ranked by a convex blend of a VLM-as-judge softmax and a Bayesian UCB term on the BVM. A greedy allocator with spatial-separation penalty and commitment hysteresis distributes frontiers across agents without a centralized controller. On the full GOAT-Bench val unseen split (360 episodes, 2,669 subtasks), our dual-agent system achieves 52.4% Subtask SR at 12.7% SPL--state of the art under the strict physical regime (discrete 0.25 m steps, no teleportation, 42 deg HFOV) and a +27.5 pp improvement over Modular GOAT (24.9%). Single-agent AnyGoal achieves 41.9% Subtask SR, showing gains arise from the decision architecture. A four-way perception ablation shows that open-vocabulary detectors shift the dominant failure mode from exploration to goal verification.

URL PDF HTML ☆

赞 0 踩 0

2606.13877 2026-06-15 cs.RO 新提交

ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation

ContactWorld: 视觉-触觉世界模型中什么对接触丰富操作至关重要

Zhiyuan Zhang, Pokuang Zhou, Kaidi Zhang, Adeesh Desai, Temitope Amosa, Davood Soleymanzadeh, Jiuzhou Lei, Minghui Zheng, Yu She

发表机构 * School of Industrial Engineering, Purdue University（普渡大学工业工程学院）； Department of Mechanical Engineering, Texas A&M University（德克萨斯农工大学机械工程系）

AI总结通过12项接触丰富操作任务，发现空间结构化和时间连续的表征（如点云）能显著提升规划成功率，且触觉传感的有效性依赖于跨模态表征兼容性。

Comments 32 pages, 12 figures, supplementary material included

详情

AI中文摘要

接触丰富操作需要世界模型从多模态感官观测中推理复杂的接触动力学。然而，哪些表征属性从根本上支持接触丰富环境下的稳定长时域规划仍不清楚。在本文中，我们提出了ContactWorld，一个涵盖12项接触丰富操作任务（包括插入、拆卸、拧紧和探索性交互）的基准和系统性实证研究。通过大量实验，我们发现同时具有空间结构化和时间连续性的表征始终能实现最强的规划性能。特别地，点云观测将平均规划成功率从腕部视角观测的20.7%和前方视角观测的22.0%提升至32.1%。我们进一步发现，触觉传感的有效性关键取决于跨模态表征兼容性，而非仅模态规模。将点云观测与保留更丰富空间结构和交互动力学的触觉力场表征相结合，进一步将性能提升至36.1%，在所有评估任务中实现了最强的整体规划性能。此外，在长时域规划目标下，触觉传感变得越来越重要，因为复合预测误差和接触不确定性随时间累积。总之，这些发现强调了表征结构、多模态兼容性和长时域鲁棒性在面向接触丰富机器人操作的视觉-触觉世界模型中的重要性。

英文摘要

Contact-rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long-horizon planning in contact-rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision-tactile world models spanning 12 contact-rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point-cloud observations improve average planning success rates from 20.7% with wrist-view observations and 22.0% with front-view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross-modal representation compatibility rather than modality scaling alone. Combining point-cloud observations with tactile force-field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long-horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long-horizon robustness in vision-tactile world models for contact-rich robotic manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.13873 2026-06-15 cs.LG cs.CL 新提交

Natively Unlearnable Large Language Models

原生不可学习的大语言模型

Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出NULLs模型，通过共享骨干和稀疏激活的sinks分离数据源贡献，实现无需梯度更新的高效遗忘，在维基百科上验证了单篇文章遗忘的有效性和鲁棒性。

详情

AI中文摘要

遗忘旨在移除特定训练数据源的影响，但由于不同数据源的贡献在模型中纠缠，这已被证明具有挑战性。将源贡献隔离到不相交的参数中使得移除更容易，尽管这会阻碍跨源的联合学习。我们提出NULLs（原生不可学习的大语言模型），这是一类模型，通过训练一组共享骨干神经元以及一个稀疏激活的sinks池，满足隔离源特定贡献和跨源联合学习这两个对立目标。在训练过程中，特定于源的信息自然集中在其sinks中，而跨源共享的信息积累在骨干中。在部署时，通过禁用相应的sinks来遗忘一个源，无需梯度更新也无需访问保留数据。我们展示了NULLs可扩展到维基百科约600万篇文章，将每篇文章隔离为独立源。遗忘单篇文章会移除其特定知识，同时保留与语义相关文章共享的事实，与从头重新训练紧密匹配。我们注意到，使用NULLs进行遗忘也具有鲁棒性：在遗忘《哈利·波特》书籍的案例研究中，NULLs抵抗了对抗性提取和逆转事后遗忘的重新学习。最后，NULLs保留了一般语言能力，在下游基准测试中与标准Transformer相匹配。这些结果共同表明，源级遗忘不必是事后考虑。它可以原生地构建到LLM训练中，同时保留共享表示学习的优势。

英文摘要

Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia's ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.13872 2026-06-15 cs.CV 新提交

Avatar V: Scaling Video-Reference Avatar Video Generation

Avatar V：缩放视频参考的化身视频生成

Benjamin Liang, Ce Chen, Desmond Lin, Ivan Somov, Jiajun Zhao, Jiewei Yuan, Jingfeng Zhang, Junhao Huang, Nik Nolte, Pedram Haqiqi, Penghan Wang, Rong Yan, Rui Zhang, Sam Prokopchuk, Sivan Wang, Viktor Goriachko, Yi Ren, Yuanming Li, Yutao Chen, Zhenhui Ye, Zhibin Hong, Zilong Nie, Zujin Guo

发表机构 * HeyGen Research（HeyGen 研究院）

AI总结提出Avatar V框架，通过视频参考条件化身份建模，利用稀疏参考注意力机制实现长视频线性复杂度条件化，结合运动表示流和身份感知超分辨率精炼器，生成无限时长1080p视频，在身份保持、唇形同步和质量上超越现有方法。

Comments 31 pages, 15 figures. All contributors are listed in alphabetical order by first name

详情

AI中文摘要

生成不仅视觉上相似而且行为上可识别的化身视频，忠实再现其说话节奏、手势倾向和表情动态，仍然是一个开放挑战。现有方法主要依赖于静态单张图像，这提供的身份信息不足且无法捕捉动态运动特征，而标准的像素级目标函数对决定化身保真度的感知关键面部区域关注不够。我们提出了Avatar V，一个生产级框架，通过视频参考条件化身份建模来解决这些限制。该模型不将身份压缩为固定大小的嵌入，而是直接条件化于参考视频的完整令牌序列，通过参考上下文上的注意力学习再现静态身份属性（面部几何、皮肤纹理）和动态行为模式（说话节奏、微表情）。我们引入了稀疏参考注意力，一种非对称机制，实现对任意长参考的线性复杂度条件化；一个运动表示流，实现闭环说话风格迁移；以及一个身份感知超分辨率精炼器，继承完整的参考条件化。这些由数据引擎支持，该引擎从5000万原始视频中策划了1亿以上的训练片段，以及一个五阶段训练流程，包括流匹配预训练、个性微调、两阶段蒸馏（>10倍加速）和RLHF对齐，部署在数千个GPU上。Avatar V生成无限时长的1080p视频，在我们的跨场景基准测试中实现了最先进的身份保持、唇形同步和生成质量，在自动指标和人工评估中持续优于包括Seedance 2.0、Kling O3 Pro、Veo 3.1和OmniHuman 1.5在内的领先系统。

英文摘要

Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.13871 2026-06-15 cs.AI cs.DB 新提交

Hyperdimensional computing for structured querying on tabular data embeddings

超维计算用于表格数据嵌入的结构化查询

Sebastián Bugedo, Stijn Vansummeren

发表机构 * UHasselt, DSI Diepenbeek（哈塞尔特大学，数据科学研究所迪彭贝克）

AI总结针对表格嵌入缺乏可解释相似度的问题，提出基于超维计算（HDC）的框架，利用全息简化表示模型实现结构化查询，推导出等值与非等值谓词的闭式期望相似度，支持可靠零匹配检测。

Comments 15 pages with appendices. 8 figures. Under review

详情

AI中文摘要

表格数据嵌入已成为数据分析和数据集成管道的基石，支持实体注释与解析、模式匹配、列类型检测以及表格搜索等任务。现有方法将行、列或整个表格嵌入向量空间，并依赖最近邻搜索来检索候选匹配。当前嵌入方法的一个根本局限性是缺乏可解释的相似度分数：查询与其最近邻之间的具体相似度值没有内在含义，因此无法确定该邻居是真正匹配还是只是语料库中无有效答案时最不相似的项目。这种无法为检索设置原则性阈值的问题阻碍了实际部署，特别是对于零匹配检测。我们研究了超维计算（HDC）的使用，特别是全息简化表示（HRR）模型，作为当检索任务对应于在向量空间中回答结构化选择-投影查询时的表格行嵌入框架。利用HDC操作的代数性质，我们推导出等值和非等值检索谓词的闭式期望相似度值，这些值随着维度的增加收敛到可解释的值，并利用这些值来识别合适的检索阈值。我们在两个真实世界数据集上，针对不同表格大小和谓词长度，将HDC与基于图的基线EmbDI进行了评估。结果表明，HDC在所有配置下的行检索中与EmbDI相当或更优，更稳健地处理非等值谓词，并在足够维度下实现完美的属性投影准确性——同时通过其原则性阈值独特地实现了可靠识别零匹配谓词。

英文摘要

Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches embed rows, columns, or entire tables into a vector space and rely on nearest-neighbor search to retrieve candidate matches. A fundamental limitation of current embedding methods is the lack of interpretable similarity scores: the concrete similarity value between a query and its nearest neighbour carries no intrinsic meaning, making it impossible to determine whether that neighbour is a true match or simply the least-dissimilar item in a corpus that contains no valid answer. This inability to set principled thresholds for retrieval undermines practical deployment, particularly for zero-match detection. We investigate the use of HyperDimensional Computing (HDC), specifically the Holographic Reduced Representations (HRR) model, as a framework for tabular row embeddings when the retrieval task corresponds to answering structured select-project queries in vector space. Exploiting the algebraic properties of HDC operations, we derive closed-form expected similarity values for both equality and non-equality retrieval predicates, which converge to interpretable values as dimensionality increases, and use these to identify suitable retrieval thresholds. We evaluate HDC against EmbDI, a graph-based baseline, on two real-world datasets across varying table sizes and predicate lengths. Our results show that HDC matches or outperforms EmbDI for row retrieval across all configurations, handles non-equality predicates more robustly, and achieves perfect attribute projection accuracy at sufficient dimensionality -- while uniquely enabling reliable identification of zero-match predicates through its principled thresholds.

URL PDF HTML ☆

赞 0 踩 0

2606.13870 2026-06-15 cs.CV cs.AI cs.LG 新提交

Mirage Probes: How Vision Models Fake Visual Understanding

幻象探针：视觉模型如何伪造视觉理解

Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi, Allen G. Roush, Ravid Shwartz-Ziv, Hod Lipson

发表机构 * Columbia University（哥伦比亚大学）； Intuit ； Technion（以色列理工学院）； Thoughtworks ； New York University（纽约大学）

AI总结提出幻象探针框架，通过对比探针揭示视觉语言模型在无图像时也能回答问题的两种幻象行为：文本偏见和虚假图像，并证明后者需要表征级干预。

详情

AI中文摘要

视觉语言模型（VLM）即使在没有提供图像的情况下，也能自信且通常正确地回答基于图像的问题。这种幻象行为会虚增基准分数，而不反映视觉基础。先前的工作将其视为单一故障模式。我们认为这是两种。使用幻象探针（Mirage Probes），一种对比探针框架，将释义的问题变体与同一图像上的匹配幻象和非幻象标签配对，我们展示了在两个开源VLM中，幻象行为可以从残差流、MLP、后注意力和注意力头位置的内部激活中线性解码。我们证明朴素贝叶斯文本基线无法恢复此信号，排除了表面词汇混淆。跨基准可分离性模式，连同一种新颖的先验利用指数（PHI），衡量模型仅从文本中回答的程度，揭示了两种不同的机制：文本偏见，其中模型从语言先验中回答而不涉及视觉表征；以及虚假图像，其中模型在潜在空间中构建虚假视觉内容并像有基础一样回答。这种区别有直接的缓解后果：文本分布清理可以解决第一种机制，但无法触及第二种，因为虚假图像幻象存在于模型的视觉表征中而非文本中。忠实的视觉基础将需要在表征层面进行干预。

英文摘要

Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

URL PDF HTML ☆

赞 0 踩 0

2606.13867 2026-06-15 cs.LG 新提交

Muon$^p$: Muon with Fractional Spectral Powers

Muon$^p$: 具有分数谱幂的 Muon

Yihe Dong, Will Sawin

发表机构 * Princeton University（普林斯顿大学）

AI总结提出 Muon$^p$ 优化器，通过分数谱幂更新 $US^pV^\top$ 插值 Muon 与梯度下降，证明其无法用单变量多项式迭代计算，并推导低阶双变量递归近似，在保持矩阵乘法结构的同时提升微调性能。

详情

AI中文摘要

Muon 是一种日益广泛使用的优化器，它将梯度 $G=USV^\top$ 替换为其极因子 $UV^\top$，从而扁平化奇异谱。然而，完全扁平化丢弃了可能对自适应重要的奇异值信息。我们引入 Muon$^p$，一种 Muon 风格的优化器，它对于有理数 $p\in(0,1)$ 使用分数谱幂更新 $US^pV^\top$，在 Muon 和梯度下降之间插值。为了使其实用，我们证明分数谱幂不能通过任何固定的单变量多项式迭代计算，并进一步推导出低阶双变量递归，仅使用矩阵乘法近似 $US^pV^\top$，保留了 Muon 仅矩阵乘法的结构和计算复杂度。我们证明 Muon$^p$ 在 Schatten $q$-范数（$q=1+\frac{1}{p}$）下最大化损失的线性改进。实验上，Muon$^p$ 在微调中特别有效：在十亿级模型上，Muon$^p$ 改善了验证困惑度和下游任务性能。我们进一步通过谱几何的视角分析了 Muon$^p$ 何时不太适用。我们的结果揭示了关于何时保留奇异谱可以带来显著增益的重要见解，并引入了一种实现这些增益的原则性方法。

英文摘要

Muon is an increasingly widely used optimizer that replaces a gradient $G=USV^\top$ with its polar factor $UV^\top$, thereby flattening the singular spectrum. However, full flattening discards singular-value information that may matter for adaptation. We introduce Muon$^p$, a Muon-style optimizer that instead uses fractional spectral-power updates $US^pV^\top$ for rational $p\in(0,1)$, interpolating between Muon and gradient descent. To make it practical, we prove that fractional spectral powers cannot be computed by any fixed univariate polynomial iteration, and furthermore derive low-degree odd bivariate recurrences that approximate $US^pV^\top$ using only matrix multiplications, preserving Muon's matrix-multiplication-only structure and compute complexity. We show that Muon$^p$ maximizes the linear improvement in loss under the Schatten $q$-norm for $q=1+\frac{1}{p}$. Empirically, Muon$^p$ is especially effective for finetuning: on billion-scale models, Muon$^p$ improves validation perplexity and downstream task performance. We further analyze when Muon$^p$ is less suitable, through the lens of spectral geometry. Our results reveal important insights on when preserving the singular spectrum can bring significant gains, and introduce a principled way to achieve them.

URL PDF HTML ☆

赞 0 踩 0

2606.13862 2026-06-15 cs.LG cs.AI cs.CL 新提交

SuperThoughts: Reasoning Tokens in Superposition

SuperThoughts: 叠加中的推理令牌

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Microsoft Research（微软研究院）； Independent（独立机构）； Princeton University（普林斯顿大学）； Rice University（莱斯大学）

AI总结提出SuperThoughts方法，通过将连续CoT令牌对压缩为单一潜在表示并利用多令牌预测模块解码，在保持训练监督的同时将推理吞吐量翻倍，实现约20-30%的CoT长度缩减且精度损失极小。

详情

AI中文摘要

长链思维（CoT）推理提升了LLM的问题解决能力，但由于顺序生成令牌导致计算成本高昂。尽管近期工作探索在连续潜在空间中进行推理以绕过离散令牌生成，但这些方法常面临训练稳定性问题，且因缺乏监督信号而难以扩展到复杂的长程任务。我们提出SuperThoughts，将连续的CoT令牌对压缩为单一潜在表示，并通过轻量级多令牌预测（MTP）模块每步解码两个令牌。这既在训练时保留了离散令牌监督，又在推理时使吞吐量翻倍。我们在Qwen2.5-Math-1.5B-Instruct、Qwen2.5-Math-7B-Instruct、Qwen2.5-Math-14B-Instruct上进行微调，并在MATH500、AMC、OlympiadBench和GPQA-Diamond上评估。通过基于置信度的自适应机制（在不确定时回退到标准解码），SuperThoughts实现了约20-30%的CoT长度缩减，同时保持精度，在大多数任务上仅下降1-2个准确率点。

英文摘要

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20--30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

URL PDF HTML ☆

赞 0 踩 0

2606.13861 2026-06-15 cs.CV 新提交

Temporal Backtracking Search for Test-time Generative Video Reasoning

时间回溯搜索：测试时生成式视频推理

Sejoon Jun, Zheng Ding, Huangyuan Su, Weirui Ye, Yilun Du

发表机构 * Northeastern University（东北大学）； Independent Researcher（独立研究者）； Harvard & Kempner（哈佛大学与肯普纳研究所）； MIT（麻省理工学院）

AI总结提出时间回溯搜索（TBS），通过将搜索空间转移到时间轴，在扩散过程中定位失败点并回溯重启，显著提升视频模型在测试时的推理能力。

详情

AI中文摘要

虽然测试时扩展已彻底改变了大型语言模型的推理能力，但生成式视频推理仍受限于单次生成范式。我们证明，在去噪步骤上进行搜索无法挽救逻辑有缺陷的生成结果，因为空间轨迹在扩散过程的早期就已确定。根级最佳N（BoN）采样同样低效：推理错误在时间轴上早期聚集，而重新采样盲目丢弃已验证的先前进展。为了解锁视频模型的有效测试时扩展，我们引入了时间回溯搜索（TBS），它将搜索空间转移到时间轴。TBS通过三个核心机制将视频生成转化为迭代的生成-验证-重启循环：（1）可变K条件化，从任意干净的起始前缀恢复生成；（2）时间过程验证，定位失败并提取有效的重启锚点；（3）基于前缀的搜索，将计算重新分配给扩展正确轨迹，而不是根重采样。在算法、导航和机器人领域，TBS帕累托优于同等预算的BoN。在严格的分布外设置中，单次生成崩溃（BoN为0.7%），TBS达到22.7%，每个解决的片段都来自重启的分支。最终，TBS揭示了视频模型的局部推理能力远超单次生成所显示的水平，提供了一个可扩展的测试时框架来释放这种能力。

英文摘要

While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

URL PDF HTML ☆

赞 0 踩 0

2606.13856 2026-06-15 cs.RO 新提交

Output-Level Regularization Eliminates the Seed Lottery in Single-GPU VLA Fine-Tuning

输出级正则化消除单GPU VLA微调中的种子彩票

Jeffrin Sam, Dzmitry Tsetserukou

发表机构 * Skolkovo Institute of Science and Technology (Skoltech)（斯科尔科沃科学技术研究所）

AI总结发现单GPU微调VLA-JEPA时存在种子彩票现象（随机种子导致性能骤降29%），归因于输出坍塌，提出输出级正则化（VICReg、Dropout、减半学习率）可完全消除该问题。

Comments 10 pages, 8 figures, submitted to CoRL 2026

详情

AI中文摘要

在单GPU上微调视觉-语言-动作模型（VLA-JEPA）本应简单：加载预训练检查点、运行训练、部署。但存在一个隐藏风险。使用相同数据和架构、不同随机种子运行同一微调代码十三次，其中十二次产生的机器人成功率为91-94%，而一次运行无声地降至65.2%：29个百分点的差距，无错误消息、无警告，且无法预测哪个种子会失败。我们称此为种子彩票。我们将原因追溯到输出坍塌：动作预测器学会产生几乎相同的输出，无论机器人看到什么。现有的权重级方法（L2、EWC）在结构上对此坍塌视而不见——它们惩罚权重变化，但坍塌发生在权重可自由移动而不影响输出的方向上，我们通过Jacobian零空间形式化了这一差距。在7种方法×最多13个种子×3个LIBERO基准测试中，三种输出级正则化器——VICReg（n=12个种子）、Dropout（n=4）和减半学习率（n=5）——各自消除了所有灾难性种子（0/21次联合坍塌 vs. 基线1/13次；F(12,11)=28.7，p<0.001），而权重级方法（L2、EWC）保留了彩票。最简单的修复是在优化器配置中更改一个数字。

英文摘要

Fine-tuning a vision-language-action model (VLA-JEPA) on a single GPU should be simple: load a pretrained checkpoint, run training, deploy. There is a hidden danger. Run the same fine-tuning code thirteen times -- same data, same architecture, different random seed -- and twelve runs produce a robot succeeding 91--94% of the time, while one run silently degrades to 65.2%: a 29 pp gap with no error message, no warning, and no way to predict which seed will fail. We call this the seed lottery. We trace the cause to output collapse: the action predictor quietly learns to produce nearly identical outputs regardless of what the robot sees. Existing weight-level methods (L2, EWC) are structurally blind to this collapse -- they penalize weight changes, but collapse occurs in directions weights can move freely without affecting outputs, a gap we formalize via the Jacobian null-space. Across 7 methods x up to 13 seeds x 3 LIBERO benchmarks, three output-level regularizers -- VICReg (n=12 seeds), Dropout (n=4), and a halved learning rate (n=5) -- each eliminate every catastrophic seed (0/21 combined collapses vs. 1/13 Baseline; F(12,11)=28.7, p<0.001), while weight-level methods (L2, EWC) preserve the lottery. The simplest fix is changing one number in your optimizer config.

URL PDF HTML ☆

赞 0 踩 0

2606.13852 2026-06-15 cs.CL 新提交

Hybrid Classical-Quantum Variational Autoencoder for Neural Topic Modeling

用于神经主题建模的混合经典-量子变分自编码器

Ivan Kankeu

发表机构 * Ivan Kankeu

AI总结提出混合经典-量子变分自编码器，在推理网络中嵌入参数化量子电路，使用改进的高斯Softmax后验降低量子比特需求，在AgNews数据集上优于现有神经主题模型。

详情

AI中文摘要

神经主题模型能够实现可扩展的语义发现，但它们与量子硬件的集成仍 largely unexplored。我们提出了一种概念验证的混合经典-量子变分自编码器（VAE）用于主题建模，在VAE推理网络中嵌入参数化量子电路，同时保留经典的主题-词解码器。为了解决量子硬件的资源限制，我们提出了一种改进的高斯Softmax后验，将潜在空间维度与要提取的主题数量解耦，使模型能够在低资源（10量子比特）的量子设备上运行。在AgNews数据集上，混合VAE优于最先进的神经主题模型（NTMs），达到了0.71的$C_v$连贯性分数和0.20的NPMI分数，同时保持了高主题多样性。为了比较，我们还构建了一个完全经典的变体，它在AgNews上也优于最先进的模型，并在潜在空间中表现出清晰的类别分离。这些结果表明，即使在NISQ时代的设备上，混合VAE在计算上也是可行的，并且代表了量子增强主题建模的一个有前景的方向。

英文摘要

Neural topic models enable scalable semantic discovery, but their integration with quantum hardware remains largely unexplored. We present a proof-of-concept hybrid classical-quantum variational autoencoder (VAE) for topic modeling, embedding parameterized quantum circuits within the VAE inference network while retaining a classical topic-word decoder. To address the resource constraints of quantum hardware, we propose a modified Gaussian Softmax posterior that decouples latent space dimensionality from the number of topics to be extracted, enabling the model to operate with a low-resource 10-qubit quantum device. On the AgNews dataset, the hybrid VAE outperforms state-of-the-art neural topic models (NTMs), reaching a $C_v$ coherence score of 0.71 and an NPMI score of 0.20 while preserving high topic diversity. For comparison, we also construct a fully classical variant, which also outperforms state-of-the-art models on AgNews and exhibits clear class separation in the latent space. These results demonstrate that hybrid VAEs are computationally viable even on NISQ-era devices and represent a promising direction for quantum-enhanced topic modeling.

URL PDF HTML ☆

赞 0 踩 0