arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2605.05553 2026-05-08 cs.LG

FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings

FedeKD:基于能量的门控机制用于异构环境下鲁棒的联邦知识蒸馏

Quang-Huy Nguyen, Jiaqi Wang, Wei-shinn Ku

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系)

AI总结 FedeKD通过引入样本级信任评估,提升联邦知识蒸馏在异构环境下的鲁棒性,无需额外公开数据,实验表明其有效减少负迁移并保持预测性能。

详情
AI中文摘要

联邦学习(FL)在异构环境中运行,数据分布差异和模型设计不对称常导致负迁移。尽管联邦知识蒸馏(FKD)避免直接模型参数共享,现有方法通常依赖公开数据或假设转移知识均匀可靠,限制了实际鲁棒性。本文提出FedeKD,一种可靠性感知的FKD框架,将样本级信任评估作为知识转移的显式组成部分。每个客户端维护一个高容量私有模型用于本地学习,以及一个轻量级共享代理模型用于跨客户端知识交换。训练过程中,代理模型在服务器上聚合形成全局代理模型,用于指导私有模型的更新。FedeKD的核心是基于能量的门控机制,将任务特定的私有-代理分歧转换为样本级信任权重用于反向蒸馏。该机制使知识转移具有样本级加权,代理模型对可靠样本贡献更多,而对不可靠样本进行降权。在六个真实世界数据集上的广泛实验表明,FedeKD在异构环境下显著减少负迁移,同时保持强大的预测性能。

英文摘要

Federated learning (FL) operates in heterogeneous environments, where variations in data distributions and asymmetric model design often result in negative transfer. While federated knowledge distillation (FKD) avoids direct model parameter sharing, existing methods typically rely on public datasets or assume that transferred knowledge is uniformly reliable, which limits their robustness in practice. This paper presents FedeKD, a reliability-aware FKD framework that makes sample-wise trust estimation an explicit component of knowledge transfer, without relying on additional public data. Each client maintains a high-capacity private model for local learning and a lightweight shared proxy model for cross-client knowledge exchange. During training, proxy models are aggregated on the server to form a global proxy, which is then used to guide updates of the private models. At the core of FedeKD is an energy-based gating mechanism that converts task-specific private-proxy disagreement into sample-wise trust weights for backward distillation. This mechanism enables sample-wise weighting of knowledge transfer, where the proxy model contributes more to reliable samples while down-weighting unreliable ones. Extensive experiments on six real-world datasets demonstrate that FedeKD significantly reduces negative transfer under heterogeneous settings while maintaining strong predictive performance.

2605.05549 2026-05-08 cs.CV

A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series

一种新颖的图调节解稀疏Mamba模型用于增强MODIS时间序列中的树种分类

Motasem Alkayid, Zhengsen Xu, Saeid Taleghanidoozdoozan, Yimin Zhu, Megan Greenwood, Quinn Ledingham, Zack Dewis, Mabel Heffring, Naser El-Sheimy, Lincoln Linlin Xu

发表机构 * Department of Geomatics Engineering, University of Calgary, Canada(卡尔加里大学测绘工程系) Department of Geography, Faculty of Arts, The University of Jordan(约旦大学艺术学院地理系)

AI总结 本文提出GDS-Mamba模型,通过图调节解稀疏Mamba架构,提升大规模上下文建模和特征提取效率,实验显示在加拿大两省的MODIS数据上取得高分类准确率。

详情
AI中文摘要

尽管利用中等分辨率成像光谱仪(MODIS)时间序列数据进行树种分类对支持各种环境应用至关重要,但该任务因几个关键难点而具有挑战性:树种之间的细微特征差异、强空间-光谱-时间信息耦合,以及建模大规模拓扑上下文信息的困难。为更好地解决这些挑战,本文提出了一种新颖的图调节解稀疏Mamba模型(GDS-Mamba)用于增强树种分类,主要贡献包括:(1)为提高大规模上下文建模,设计了一种小批量图调节制方法,明确探索输入图像之间的拓扑相关性效应。(2)为解耦高维空间-光谱-时间信息耦合以提高特征提取,提出了一种新的解耦Mamba架构,专门用于捕捉MODIS时间序列中的独立空间模式、光谱特征和时间物候行为。(3)为提高效率和细微特征学习,设计了新颖的稀疏令牌方法,自适应学习最优令牌子集以更好地解决阻碍标准Mamba模型的关联衰减问题。使用大规模年度MOD13Q1数据在加拿大两个省(即阿尔伯塔省和萨斯喀彻温省)的实验中,整体准确率在阿尔伯塔省达到93.94%,在跨省评估中达到80.19%,优于十二种最先进的分类模型。

英文摘要

Although tree species classification from Moderate Resolution Imaging Spectroradiometer (MODIS) time series data is critical for supporting various environmental applications, it is a challenging task due to several key difficulties: the subtle signature differences among tree species, strong spatial-spectral-temporal information coupling, and the difficulty of modeling large-scale topological context information. To better address these challenges, this paper presents a novel Graph-regulated Disentangled Sparse Mamba model (GDS-Mamba) for enhanced tree species classification, with the following contributions. (1) First, to improve large-scale context modeling, we design a mini-batch graph-regulated approach that explicitly explores topological correlation effects among input images. (2) Second, to disentangle the high-dimensional spatial-spectral-temporal information coupling for improved feature extraction, we propose a novel disentangling Mamba architecture tailored for capturing independent spatial patterns, spectral signatures, and temporal phenology behaviors in MODIS time series. (3) Third, to improve efficiency and subtle feature learning, we design novel sparse token approaches that adaptively learn the optimum subset of tokens to better address the correlation decay problem that bottlenecks standard Mamba models. Extensive experiments using large-scale annual MOD13Q1 data across two Canadian provinces (i.e., Alberta and Saskatchewan) achieved an overall accuracy of 93.94\% in Alberta and 80.19\% in cross-provincial evaluations, outperforming twelve state-of-the-art classification models.

2605.05546 2026-05-08 cs.AI

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

SPARK:基于知识图谱的异步奖励自我对抗学习

Hyobin Park, Taeseop Kim, Dong-Geol Choi

发表机构 * Hanbat National University, South Korea(韩国翰bac国立大学)

AI总结 SPARK通过构建统一知识图谱实现多文档科学文献的自我对抗学习,利用图路径生成关系推理问题并基于结构化事实计算奖励,优于传统自对抗基线方法。

详情
AI中文摘要

SPARK通过构建统一知识图谱实现多文档科学文献的自我对抗学习,利用图路径生成关系推理问题并基于结构化事实计算奖励,优于传统自对抗基线方法。

英文摘要

Self-play reinforcement learning has shown strong performance in domains with formally verifiable structure, such as mathematics and coding, where both problem generation and reward computation can be grounded in explicit rules. Extending this paradigm to scientific literature is more challenging: the relationships among multi-modal elements within and across documents are rarely made explicit in text, which makes automatic generation of relational reasoning questions difficult and weakens the reliability of reward signals. We propose SPARK (Self-Play with Asymmetric Reward from Knowledge Graphs), a framework that automatically constructs a unified knowledge graph (KG) from multi-document scientific literature and uses it as the structural basis for self-play. KG paths over multimodal nodes serve as a source for generating relational reasoning questions, and structured facts stored in the KG provide a basis for verifiable reward computation. A single small vision-language model (sVLM) alternates between Proposer and Solver roles under information asymmetry against a fixed KG, a design that we believe can be naturally extended toward online adaptation in future work. We evaluate SPARK on public benchmarks and a self-constructed cross-document multi-hop QA dataset. Results show that SPARK consistently outperforms flat-corpus-based self-play baselines, and the performance gap widens as hop count increases, suggesting that KG-structure grounding contributes to relational multi-hop reasoning beyond what unstructured corpus grounding can provide.

2605.05544 2026-05-08 cs.LG cs.RO

Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

自适应Q分块用于离线到在线强化学习

Nandiraju Gireesh, Yuanliang Ju, He Wang

发表机构 * Peking University(北京大学) Galbot University of Toronto(多伦多大学)

AI总结 本文提出自适应Q分块方法,通过动态选择不同分块大小的 critic 来解决离线到在线强化学习中的偏置问题,提升在 OGBench 和 Robomimic 上的性能。

详情
AI中文摘要

离线到在线强化学习通过动作分块消除多步非策略偏置并实现时间一致的探索,但现有方法使用固定分块大小,效果不佳:接近接触事件时需短分块进行反应控制,而自由空间运动时长分块更有利于信用分配。自然解决方案是训练多个分块大小的 critic 并在每个状态选择最佳者,但直接比较学习的 critic 值会因折扣率不匹配而系统性崩溃,并在低价值状态退化为噪声。本文提出自适应 Q 分块 (AQC),通过比较每个分块大小相对于 per-horizon 基线的优势,归一化折扣因子,解决这两个问题。该准则在无真实信号时将偏置错误答案转换为无偏的近随机选择,在特定尺度下变得判别性。我们证明了优势选择器的噪声免疫性和自适应分块优于固定分块的值主导性。我们展示 AQC 在 OGBench 和 Robomimic 上达到最先进的离线和在线成功率,并可应用于增强预测动作序列的大型 VLA 模型,显著提升 RoboCasa-GR1 任务的性能。

英文摘要

Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal: near contact events the agent needs short chunks for reactive control, while during free-space motion long chunks provide better credit assignment. The natural solution is to train critics for several chunk sizes and select the best one at each state, but naive comparison of learned critic values systematically collapses to the shortest chunk due to discount-scale mismatch, and degrades to noise in low-value states. We propose Adaptive Q-Chunking (AQC), which resolves both failures by comparing the advantage of each chunk size relative to a per-horizon baseline, normalized by the discount factor. This criterion converts biased wrong answers into unbiased near-random choices when no genuine signal exists, and becomes discriminative when a particular scale enables better planning. We prove theoretical bounds on the advantage selector's noise immunity and on the value dominance of adaptive chunking over any fixed chunk size. We demonstrate that AQC achieves state-of-the-art offline and online success rates on OGBench and Robomimic, and can be applied to enhance the performance of large-scale VLA models that predict action sequences, significantly boosting performance on RoboCasa-GR1 tasks.

2605.05541 2026-05-08 cs.RO

Real-world Latency Analysis of Vehicular Visible Light Communication with Multiple LED Transmitters and an Event-Based Camera

车载可见光通信中多LED发射器与事件相机的现实延迟分析

Ryota Soga, Tsukasa Shimizu, Shintaro Shiba, Quan Kong, Shan Lu, Takaya Yamazato

发表机构 * Graduate School of Engineering, Nagoya University(名古屋大学工学研究科) TOYOTA MOTOR CORPORATION(丰田汽车公司) Woven by Toyota, Inc.(丰田编织公司)

AI总结 本文提出基于事件相机的可见光通信系统,解决带宽饱和、多发射器接收和延迟特性分析等关键问题,通过正事件模式和协议设计实现多LED同时接收,并在真实车载场景中验证系统延迟符合协同感知需求。

Comments 5 pages, IEEE VTC2026-Spring

详情
AI中文摘要

事件相机提供高时间分辨率、低延迟和宽动态范围,使其成为车辆到一切(V2X)应用中可见光通信(VLC)的有前途的接收器。本文提出一个基于事件相机的VLC系统,解决三个关键挑战:带宽饱和、多发射器接收和延迟特性分析。我们采用正事件仅模式,并设计一种协议以抑制事件生成,同时保持通信距离和宽视野。我们还提出一种方法来识别多个发射器,并展示从多达三个LED同时接收的演示。最后,我们在真实车载场景中评估端到端延迟,并证明系统满足协同感知要求。这些结果表明,基于事件相机的VLC是现有V2X技术(如RF)的可行补充。

英文摘要

Event cameras offer high temporal resolution, low latency, and wide dynamic range, making them promising receivers for visible light communication (VLC) in vehicle-to-everything (V2X) applications. This work presents an event-camera-based VLC system addressing three key challenges: bandwidth saturation, multi-transmitter reception, and latency characterization. We adopt a positive-event-only mode and design a protocol that suppresses event generation while maintaining communication distance and a wide field of view. We also propose a method to identify multiple transmitters and demonstrate simultaneous reception from up to three LEDs. Finally, we evaluate end-to-end latency in real vehicular scenarios and show that the system meets cooperative perception requirements. These results demonstrate that event-camera-based VLC is a feasible complement to existing V2X technologies (e.g., RF).

2605.05540 2026-05-08 cs.LG physics.flu-dyn

Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting

迈向可扩展的一步生成建模用于自回归动力系统预测

Tianyue Yang, Xiao Xue

发表机构 * The Center for Computational Science(计算科学中心)

AI总结 本文提出MeLISA模型,通过块状随机转移核实现高效自回归生成,结合窗口一致性与时间增量一致性损失,提升长期统计准确性与推理速度。

Comments 42 pages, 15 figures

详情
AI中文摘要

快速的高维物理动力学代理建模需要超越低短期误差:有用的模型必须高效地进行滚动预测并保持长期轨迹的统计结构。神经运算符提供经济的自回归预测,但在湍流区域会漂移,而滚动扩散和潜在生成代理需要多步去噪、噪声计划设计或辅助压缩模型。我们提出MeanFlow Long-term Invariant Spatiotemporal Consistency Autoregressive Models (MeLISA),一种无潜在的自回归生成代理,基于像素空间MeanFlow。MeLISA定义了块状随机转移核,通过单次模型评估生成每个预测块,避免了潜在编码器和迭代扩散求解器。为稳定长期滚动,MeLISA结合了窗口一致性MeanFlow目标,学习从部分观测时间窗口中条件空间时间生成,以及时间增量一致性损失,约束多滞后有限增量并针对时间相关结构。我们在两个高分辨率基准上评估MeLISA,使用紧凑UNet和可扩展DiT后端,扩展2D Kolmogorov流在256×256和湍流通道流切片在192×192。MeLISA在短期预测准确性和长期统计指标上优于神经运算符基线,包括能量谱、湍流动能和混合率相关动态,同时推理速度与神经运算符相当或更快。紧凑3.7-5.7M参数变体已实现强大的参数效率,DiT变体提供可扩展路径至150M参数。总体而言,MeLISA在滚动效率和长期统计准确性上均受益。

英文摘要

Fast surrogate modeling for high-dimensional physical dynamics requires more than low short-term error: useful models must roll out efficiently while preserving the statistical structure of long trajectories. Neural operators provide inexpensive autoregressive forecasts but can drift in turbulent regimes, whereas rolling diffusion and latent generative surrogates can represent stochastic transitions at the cost of multi-step denoising, noise-schedule design, or auxiliary compression models. We propose MeanFlow Long-term Invariant Spatiotemporal Consistency Autoregressive Models (MeLISA), a latent-free autoregressive generative surrogate built on pixel-space MeanFlow. MeLISA defines a blockwise stochastic transition kernel that generates each forecast block with a single model evaluation, avoiding latent encoders and iterative diffusion solvers at inference time. To stabilize long-horizon rollouts, MeLISA combines a Window-Consistency MeanFlow objective that learns conditional spatiotemporal generation from partially observed temporal windows with a Time Increment Consistency loss that constrains multi-lag finite increments and targets temporal-correlation structure. We evaluate MeLISA with compact UNet and scalable DiT backbones on two high-resolution benchmarks, extended 2D Kolmogorov flow at $256 \times 256$ and turbulent channel-flow slice at $192 \times 192$. MeLISA outperforms neural-operator baselines on short-term forecasting accuracy and long-horizon statistical metrics, including energy spectra, turbulent kinetic energy, and mixing-rate-related dynamics, while achieving inference speeds comparable to, and in some cases faster than, neural operators. Compact 3.7-5.7M-parameter variants already deliver strong parameter efficiency, and DiT variants provide a scalable path up to 150M parameters. Overall, MeLISA benefits both rollout efficiency and long-horizon statistical accuracy.

2605.05538 2026-05-08 cs.AI cs.IR

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

AgenticRAG:企业知识库中的代理检索

Susheel Suresh, Hazel Mak, Shangpo Chou, Fred Kroon, Sahil Bhatnagar

发表机构 * Microsoft Corporation(微软公司)

AI总结 AgenticRAG通过在现有企业检索基础设施上叠加轻量级代理框架,使语言模型能够自主迭代检索信息、导航文档并分析证据,提升了检索准确率和事实性。

Comments 14 pages, 5 figures

详情
AI中文摘要

我们提出了AgenticRAG,一种实用的代理框架,用于在企业知识库上进行检索和分析。传统RAG流程将大量基础工作置于搜索栈上,限制了语言模型只能在检索过程深处的固定候选集上进行操作。我们的方法通过在现有企业搜索基础设施上叠加一个轻量级框架,使推理LLM具备搜索、查找、打开和总结工具,使其能够自主迭代检索信息、在文档中导航并分析证据。在三个公开基准测试中,我们观察到显著提升:在BRIGHT上召回率@1达到49.6%(比最佳嵌入基线高21.8个百分点),在WixQA上的事实性达到0.96(相对改进13%),在FinanceBench上的回答正确率达到92%(距离真实证据的oracle访问仅差2个百分点)。消融研究显示,从单次检索转向代理工具使用是最重要的因素(改进5.9倍),而多查询搜索和文档内导航对质量和效率都有贡献。我们展示了代理框架中的各种设计选择,这些选择受到预生产部署的启发。我们的结果证明了其在真实企业生产环境中的适用性。

英文摘要

We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the model to iteratively retrieve information, navigate within documents, and analyze evidence autonomously. On three open benchmarks we observe substantial gains: $49.6\%$ recall@1 on BRIGHT (+21.8 pp over the best embedding baseline), 0.96 factuality on WixQA ($+13\%$ relative improvement), and $92\%$ answer correctness on FinanceBench--within 2 pp of oracle access to true evidence. Ablation studies show that the most significant factor is the shift from single-shot retrieval to agentic tool use ($5.9\times$ improvement), while multi-query search and in-document navigation contribute to both quality and efficiency. We present various design choices in our agentic harness that were informed by pre-production deployments. Our results demonstrate its suitability for real-world enterprise production environments.

2605.05535 2026-05-08 cs.AI

Housing Potential Common Data Model and City Digital Twin

住房潜力共同数据模型与城市数字孪生

Megan Katsumi, Mark Fox, Anderson Wong, Divnoor Chatha

发表机构 * Urban Data Research Centre(城市数据研究中心) School of Cities University of Toronto(城市学院多伦多大学)

AI总结 本文提出住房潜力共同数据模型,以解决数据孤岛问题,并通过城市数字孪生和试点仪表盘展示实际应用,为城市规划者提供可行的解决方案。

详情
AI中文摘要

住房潜力评估需要从多个角度考虑地点,包括 zoning 和 land use 到人口特征和服务可达性。本文介绍住房潜力共同数据模型(HPCDM)以克服现有数据孤岛,作为标准支持跨多样的数据集集成和互操作性。本文详细描述了模型的评估以及住房的城市数字孪生创建,并展示了一个试点仪表盘应用程序以演示实际实施。除了技术框架外,本工作还识别了采用的关键障碍,并为城市规划者和利益相关者提供了可行的缓解策略。

英文摘要

The evaluation of housing potential requires consideration of a location from multiple perspectives, ranging from zoning and land use to population characteristics and access to services. This research introduces the Housing Potential Common Data Model (HPCDM) to overcome existing data silos, serving as a standard to support integration and interoperability across the diverse range of datasets that are required for housing potential analysis. This report details the evaluation of the model along with the creation of a City Digital Twin for housing and a pilot dashboard application to demonstrate a practical implementation. Beyond the technical framework, this work identifies critical barriers to adoption and provides actionable mitigation strategies for urban planners and stakeholders.

2605.05534 2026-05-08 cs.LG

Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation

对抗图神经网络基准测试:迈向实用且公平的评估

Tran Gia Bao Ngo, Zulfikar Alom, Federico Errica, Murat Kantarcioglu, Cuneyt Gurcan Akcora

发表机构 * Department of Computer Science, University of Manitoba(曼尼托巴大学计算机科学系) University of Toledo(托莱多大学) NEC Laboratories Europe(NEC欧洲实验室) Department of Computer Science, Virginia Tech, USA(弗吉尼亚理工学院弗吉尼亚理工大学计算机科学系) AI Initiative - University of Central Florida(中央佛罗里达大学人工智能计划)

AI总结 本文通过统一框架对七种攻击和八种防御方法进行重新评估,揭示公平和鲁棒评估对攻击效果的重要性,发现目标节点选择和攻击模型训练过程对性能有显著影响。

Comments 49 pages, 6 figures

详情
AI中文摘要

对抗学习和图神经网络(GNN)的鲁棒性是机器学习领域广泛关注的话题,这体现在针对这些目的设计的对抗攻击和防御数量上。尽管严格评估这些对抗方法对于理解GNN在现实应用中的鲁棒性至关重要,但我们认为许多文献中的工作并不共享相同的实验设置,导致结论模糊且可能矛盾。在本基准中,我们展示了在对抗GNN研究中采用公平、鲁棒和标准化评估协议的重要性。我们对七种广泛使用的攻击和八种最近的防御方法在污染和逃避场景下,跨六个流行图数据集进行了全面重新评估。我们的研究涵盖了超过453,000次实验。我们观察到在公平和鲁棒的评估下,对抗攻击性能存在显著差异。我们的发现表明,之前被忽视的因素,如目标节点选择和被攻击模型的训练过程,对攻击效果有深远影响,甚至完全扭曲了性能洞察。这些结果突显了在对抗图机器学习中进行标准化评估的紧迫需求。

英文摘要

Adversarial learning and the robustness of Graph Neural Networks (GNNs) are topics of widespread interest in the machine learning community, as documented by the number of adversarial attacks and defenses designed for these purposes. While a rigorous evaluation of these adversarial methods is necessary to understand the robustness of GNNs in real-world applications, we posit that many works in the literature do not share the same experimental settings, leading to ambiguous and potentially contradictory scientific conclusions. In this benchmark, we demonstrate the importance of adopting fair, robust, and standardized evaluation protocols in adversarial GNN research. We perform a comprehensive re-evaluation of seven widely used attacks and eight recent defenses under both poisoning and evasion scenarios, across six popular graph datasets. Our study spans over 453,000 experiments conducted within a unified framework. We observe substantial differences in adversarial attack performance when evaluated under a fair and robust procedure. Our findings reveal that previously overlooked factors, such as target node selection and the training process of the attacked model, have a profound impact on attack effectiveness, to the extent of completely distorting performance insights. These results underscore the urgent need for standardized evaluations in adversarial graph machine learning.

2605.05532 2026-05-08 cs.CL cs.CY

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

几个优质的子句:比较LLM与领域训练的小语言模型在结构化合同提取中的表现

Nicole Lincoln, Nick Whitehouse, Jaron Mar, Rivindu Perera

发表机构 * Onit AI Labs, Onit Inc.(Onit AI实验室,Onit公司)

AI总结 本文比较了LLM与领域训练的小语言模型在结构化合同提取中的性能,发现领域训练的小模型在成本显著降低的情况下表现更优。

详情
AI中文摘要

本文评估了领域训练的小语言模型(SLM)是否能在大幅降低成本的情况下超越前沿的大语言模型(LLM)在结构化合同提取中的表现。我们测试了Olava Extract,一个自托管的法律领域专家混合模型,与五个前沿模型。Olava Extract在研究中表现出最强的综合性能,宏平均F1得分为0.812,微平均F1得分为0.842,同时将推理成本降低了78%至97%,相比测试的前沿模型。它还实现了最高的精确度分数,产生较少的幻觉和不支持的提取,这在法律工作流中是一个重要区别,因为幻觉会创造操作风险和后续审查负担。研究结果表明,高性能的人类可比法律AI不再需要最大的外部托管模型。更广泛地说,这些发现挑战了企业AI能力必须始终与更大模型、大规模基础设施支出和中央托管提供者绑定的假设。

英文摘要

This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.

2605.05530 2026-05-08 cs.LG

Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective

能量生成建模:基于李雅普诺夫的能量匹配视角

Yixuan Wang, Wenqian Xue, Warren E. Dixon

发表机构 * Department of Mechanical and Aerospace Engineering(机械与航空航天工程系) University of Florida(佛罗里达大学)

AI总结 本文提出通过李雅普诺夫函数统一生成模型的训练与采样过程,证明了在能量景观上确定性梯度流无李雅普诺夫证书,并展示加法组合训练的标量能量保留显式吉布斯不变测度。

Comments 11 pages, 2 figures

详情
AI中文摘要

基于静态标量能量函数的生成模型代表了一种新兴范式,其中单一时间不变势能通过其梯度场驱动样本生成,完全消除了时间条件的需求。我们统一了这一范式的训练和采样阶段,通常被视为独立过程,纳入单一框架:在Wasserstein空间上的密度传输,作为非线性控制问题,其中KL散度作为李雅普诺夫函数。训练和采样是这一主动态的两个实例,仅在初始条件上有所不同。在此自主框架中,我们开发了两个分析结果。首先,由于李雅普诺夫证书是渐近的,我们推导了兰杰文采样中的有限步停止准则,并证明在相同能量景观上的确定性梯度流无李雅普诺夫证书。其次,这种改写将非线性控制理论工具包应用于静态标量能量生成建模,即我们展示训练的标量能量加法组合保留显式吉布斯不变测度并继承闭环李雅普诺夫证书。除了这些直接结果,这种改写将静态标量能量生成模型与非线性控制理论的完整工具包联系起来,为受约束生成的屏障函数和加速采样的收缩度量打开了大门。在合成分布上的实验验证了理论预测。

英文摘要

Generative models based on static scalar energy functions represent an emerging paradigm in which a single time independent potential drives sample generation through its gradient field, eliminating the need for time conditioning entirely. We unify the training and sampling phases of this paradigm, conventionally treated as separate procedures, within a single framework: density transport on the Wasserstein space, cast as a nonlinear control problem in which the Kullback Leibler (KL) divergence serves as a Lyapunov function. Training and sampling are then two instances of this same master dynamics, differing only in initial condition. Within this autonomous framework we develop two analytic results. First, since the Lyapunov certificate is asymptotic, we derive a finite step stopping criterion for Langevin sampling and prove that no Lyapunov certificate exists for the deterministic gradient flow on the same energy landscape. Second, the reformulation brings the toolkit of nonlinear control theory to bear on static scalar energy generative modeling, that is, we show that additive composition of trained scalar energies retains an explicit Gibbs invariant measure and inherits the closed-loop Lyapunov certificate. Beyond these immediate results, this reformulation bridges static scalar energy generative models with the full toolkit of nonlinear control theory, opening the door to barrier functions for constrained generation and contraction metrics for accelerated sampling. Experiments on synthetic distributions validate the theoretical predictions.

2605.05524 2026-05-08 cs.LG cs.AI

MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series

MOSAIC:通过稀疏加法可识别因果学习进行模块发现

Shicheng Fan, Nour Elhendawy, Jianle Sun, Ke Fang, Kun Zhang, Yihang Wang, Lu Cheng

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校) Case Western Reserve University(凯斯西储大学) Carnegie Mellon University(卡内基梅隆大学) MBZUAI

AI总结 MOSAIC通过稀疏加法可识别因果学习,在科学时间序列中发现模块结构,通过稀疏解码器恢复观测变量支持,实现可解释的潜在机制发现。

详情
AI中文摘要

MOSAIC通过稀疏加法可识别因果学习,在科学时间序列中发现模块结构,通过稀疏解码器恢复观测变量支持,实现可解释的潜在机制发现。

英文摘要

Causal representation learning (CRL) seeks to recover latent variables with identifiability guarantees, typically up to permutation and component-wise reparameterization under appropriate assumptions. However, identifiability does not imply interpretability: latent semantics are typically assigned post hoc by alignment with known ground-truth factors. This limitation is particularly acute in scientific time series, where underlying mechanisms are unknown and discovering interpretable structure is a primary goal. In contrast, scientific observations (such as residue-pair distances, climate indices, or process sensors) are inherently semantic, as they correspond to named physical quantities. This raises a key question: can the interpretability of observations be transferred to the identifiable latent space? We propose MOSAIC (Module discovery via Sparse Additive Identifiable Causal learning), a sparse temporal VAE that integrates temporal CRL identifiability with support recovery over observed variables. MOSAIC identifies latent variables via regime-conditioned temporal variation, and recovers for each latent a sparse set of associated observations through an additive decoder, yielding module-level interpretability. We show that ANOVA main-effect supports are identifiable under general smooth mixing functions, and provide finite-sample recovery guarantees for a tractable sparse-additive variant. Empirically, MOSAIC recovers domain-consistent variable groups across RNA molecular dynamics, solar wind, ENSO climate, the Tennessee Eastman process, and a synthetic tokamak benchmark, enabling interpretable discovery of latent mechanisms in scientific time series.

2605.05519 2026-05-08 cs.LG cs.DC

OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination

OpenG2G:用于AI数据中心-电网运行协调的仿真平台

Jae-Won Chung, Zhirui Liang, Yanyong Mao, Jiasi Chen, Mosharaf Chowdhury, Vladimir Dvorkin

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出OpenG2G平台,用于研究AI数据中心与电网的协调策略,通过模拟不同控制方法评估AI模型对数据中心灵活性的影响。

Comments Open-source at https://github.com/gpu2grid/openg2g

详情
AI中文摘要

随着AI计算需求的增长,数据中心建设对电网容量和可靠性构成重大挑战,导致新数据中心建设延迟和AI发展受限。为缓解这一压力,数据中心通过实时调整负载提供快速电力灵活性。为理解大型数据中心对电网的影响并设计有效的协调策略,我们构建了OpenG2G仿真平台。该平台允许用户实现并比较各种控制方法,量化AI模型和部署选择对数据中心灵活性和协调结果的影响。OpenG2G的模块化和可扩展架构包括由生产级AI服务真实测量驱动的数据中心后端,基于高保真电网模拟器构建的电网后端,以及连接两者的通用控制器接口。我们描述了OpenG2G的设计并通过现实电网场景和AI工作负载展示了其实用性。

英文摘要

AI's growing compute demand and new datacenter buildouts present major capacity and reliability challenges for the electricity grid, leading to multi-year interconnection delays for new datacenters and bottlenecking AI growth. To ease this strain, datacenters increasingly offer rapid power flexibility in response to grid signals, where the datacenter can increase or decrease its power consumption by adapting its workload in real time. In order to understand the impact of large datacenters on the grid and to facilitate the design of effective coordination strategies, we build OpenG2G, a simulation platform for AI datacenter-grid runtime coordination. We show that OpenG2G is capable of answering a wide range of coordination questions by allowing users to implement and compare various control paradigms (including classic, optimization, and learning-based controllers), and quantify how AI model and deployment choices affect datacenter flexibility and coordination outcomes. This versatility is enabled by OpenG2G's modular and extensible architecture: a datacenter backend driven by real measurements of production-grade AI services, a grid backend built on high-fidelity grid simulators, and a generic controller interface that closes the loop between them. We describe the design of OpenG2G and demonstrate its usefulness through realistic grid scenarios and AI workloads.

2605.05511 2026-05-08 cs.LG stat.ML

Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients

非短视路径策略梯度的主动特征获取

Linus Aronsson, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Department of Computer Science and Engineering(计算机科学系和计算机科学与工程系) Chalmers University of Technology(楚姆勒斯技术大学) University of Gothenburg(哥德堡大学)

AI总结 本文提出非短视路径策略梯度方法,通过连续放松获取过程并引入直通滚动生成方案,实现非短视特征获取的端到端优化,实验证明其在合成和真实数据集上优于现有方法。

详情
AI中文摘要

主动特征获取(AFA)考虑预测问题中特征获取成本高且学习器需自适应决定每个实例的特征值获取及何时停止预测的问题。AFA可建模为部分可观测马尔可夫决策过程(POMDP),自然允许序贯决策视角。本文提出非短视路径策略梯度(NM-PPG),一种基于此建模的新AFA方法。我们引入连续放松的获取过程,通过完整获取轨迹生成路径梯度,避免标准分数函数策略梯度的高方差,同时允许非短视获取策略的端到端优化。为进一步使训练与部署对齐,我们进一步开发了直通滚动生成方案,在正向传递中遵循硬特征获取,在反向传递中通过相应的软放松进行反向传播。我们通过熵正则化和分阶段温度锐化稳定优化。在合成和真实数据集上的实验表明,NM-PPG在性能上优于现有最先进的AFA基线。

英文摘要

Active feature acquisition (AFA) considers prediction problems in which features are costly to obtain and the learner adaptively decides which feature values to acquire for each instance and when to stop and predict. AFA can be formulated as a partially observable Markov decision process (POMDP), which naturally admits a sequential decision-making perspective. In this paper, we present non-myopic pathwise policy gradients (NM-PPG), a new AFA method built around this formulation. We introduce a continuous relaxation of the acquisition process that enables pathwise gradients through the full acquisition trajectory, avoiding the high variance of standard score-function policy gradients while allowing end-to-end optimization of a non-myopic acquisition policy. To better align training with deployment, we further develop a straight-through rollout scheme that follows hard feature acquisitions in the forward pass while backpropagating through the corresponding soft relaxation in the backward pass. We stabilize optimization with entropy regularization and staged temperature sharpening. Experiments on both synthetic and real-world datasets demonstrate that NM-PPG yields superior performance relative to state-of-the-art AFA baselines.

2605.05510 2026-05-08 cs.CV

The First Controllable Bokeh Rendering Challenge at NTIRE 2026

NTIRE 2026首个可控制的光斑渲染挑战

Tim Seizinger, Florin-Alexandru Vasluianu, Jeffrey Chen, Zhuyun Zhou, Zongwei Wu, Radu Timofte, Dafeng Zhang, Yipeng Lin, Qi Yan, Junhao Chen, Yang Yang, Divyavardhan Singh, Hariom Thacker, Hammad Mohammad, Aanchal Maurya, Kishor Upla, Kiran Raja, Wei Zhou, Hongyu Huang, Yujin Cho, Grigory Malivenko, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu

发表机构 * NTIRE 2026

AI总结 本文介绍了NTIRE 2026首个可控制的光斑渲染挑战的结果,展示了最有效的提交方法,8支队伍在最终测试阶段提交了有效解决方案,所有提交均在未见过的图像上进行评估,重点是人物和复杂细致的主体。

Comments Challenge report paper from NTIRE Workshop at CVPR 2026

Journal ref 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情
AI中文摘要

本研究介绍了NTIRE 2026首个可控制的光斑渲染挑战的结果,并突显了最有效的提交方法。总共44人报名参加比赛,其中8支队伍在最终测试阶段结束后提交了有效的解决方案。所有提交均在未见过的图像上进行评估,重点是人物和复杂且视觉吸引人的光斑现象。除了第一个轨道专注于已建立的定量保真度度量,我们还进行了一项定性用户研究,与专家小组一起进行第二轨道的感知评估。由于这是该主题的首次挑战,大多数参与者都专注于改进和扩展Bokehlicious基线方法。

英文摘要

This study presents the outcomes of the first Controllable Bokeh Rendering Challenge at NTIRE and highlights the most effective submitted methodologies. In total, 44 participants registered for the competition, of which 8 teams submitted valid solutions after the conclusion of the final test phase. All submissions were evaluated on unseen images, focusing on portraits and intricate subjects with complex and visually appealing bokeh phenomena. In addition to the first track focusing on established quantitative fidelity metrics, we conducted a qualitative user study with a panel of experts for a second track focusing on perceptual assessment. As this was the inaugural challenge on this topic, most of the participants focused on refining and extending the Bokehlicious baseline method.

2605.05503 2026-05-08 cs.CL

Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks

Chainwash:针对扩散语言模型水印的多步重写攻击

Mohd Ruhul Ameen, Akif Islam, Nadim Mahmud, Md. Ekramul Hamid

发表机构 * College of Engineering and Computer Science, Marshall University(马歇尔大学工程与计算机科学学院) Department of Computer Science and Engineering, University of Rajshahi(拉贾沙希大学计算机科学与工程系) Miami University(迈阿密大学)

AI总结 研究多步重写对扩散语言模型水印检测的影响,发现多次重写显著降低检测率,表明重写是更有效的攻击手段。

Comments 13 pages, 5 figures, 3 tables

详情
AI中文摘要

统计水印是一种验证文本是否由语言模型生成的常见方法。现有方案假设自回归生成,即从左到右生成token,且上下文哈希定义良好。扩散语言模型通过任意顺序去噪token生成文本,因此这些方案无法直接应用。Gloaguen等人最近针对LLaDA 8B Instruct设计了一种水印,并报告了真阳性检测率超过99%。本文研究当水印文本被多次重写时会发生什么。使用相同的水印配置,在五个WaterBench领域生成了1,605个约300个token的水印完成文本。每个完成文本由四个开源权重语言模型(1.5B到8B参数)进行四次重写,这些模型不知道水印密钥。测试了五种重写风格:改写、人性化、简化、学术和总结扩展。每种风格最多链式五次,总共生成160,500个重写文本。水印在标准显著性阈值下在87.9%的原始输出上被检测到。经过一次重写,检测率降至14%到41%之间,取决于重写器和风格。经过五次链式重写,检测率降至4.86%,意味着94.76%的最初检测文本不再被标记。经过三次重写,检测分数下降了86%的水印基线到空分布的距离。因此,重复重写比单次重写是一种更强的攻击,且结果在所有四个测试重写器中都成立。

英文摘要

Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly. A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains. Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested: paraphrase, humanize, simplify, academic, and summarize expand. Each style is chained for up to five hops, producing 160,500 rewritten texts in total. The watermark is detected on 87.9% of the original outputs at the standard significance threshold. After a single rewrite, detection falls to between 14% and 41% depending on the rewriter and style. After five chained rewrites, detection falls to 4.86%, meaning 94.76% of the originally detected texts are no longer flagged. After three rewrites, the detector score has dropped 86% of the way from its watermarked baseline toward the null distribution. Repeated rewriting is therefore a much stronger attack than a single rewrite, and the result holds across all four rewriters tested.

2605.05499 2026-05-08 cs.AI

FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

FoodCHA:多模态LLM代理用于细粒度食物分析

Woojin Lee, Pranav Mekkoth, Ye Tian, Onat Gungor, Tajana Rosing

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出FoodCHA,一种多模态代理框架,通过分层决策过程提升食物识别的细粒度属性区分能力,实验显示其在类别和子类识别精度上优于现有模型。

详情
AI中文摘要

随着配备摄像头的移动设备和可穿戴设备的广泛应用,使得餐食图像的便捷捕捉成为可能,食品识别成为实时饮食监控的关键组成部分。然而,现实中的食品图像由于类内相似性高且单张图像中常包含多个食品项目而面临挑战。尽管深度学习模型在粗粒度分类上表现强劲,但往往难以捕捉到细粒度属性,如烹饪风格。此外,现代视觉-语言模型中的开放式生成可能会产生非标准标签,限制了其在实际中的应用。我们提出FoodCHA,一种多模态代理框架,将食品识别重新表述为分层决策过程。通过逐步锚定预测,FoodCHA利用高层类别引导子类别的识别,并利用子类别引导烹饪风格的识别,从而提高语义一致性和属性层面的区分能力。为了确保实际部署性,FoodCHA使用紧凑的Moondream-2B视觉语言模型,该模型在提供强大推理能力的同时,保持了较低的计算和内存开销。在FoodNExTDB上的实验表明,FoodCHA在类别和子类识别精度上分别比Food-Llama-3.2-11B高出13.8%和38.2%,并在烹饪风格分类精度上实现了153.2%的显著提升。

英文摘要

The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream-2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.

2605.05495 2026-05-08 cs.LG

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

由Transformer学习的快捷方案损害连续组合推理能力

William T. Redman, Erik C. Johnson, Brian Robinson

发表机构 * Johns Hopkins Applied Physics Lab(约翰霍普金斯应用物理实验室)

AI总结 研究探讨了Transformer模型在连续学习中的表现,发现BERT学习快捷方案限制了泛化能力,而ALBERT通过循环结构提升了连续学习性能,但两者在跨经验组合任务中均表现不足。

Comments 17 pages, 6 figures

详情
AI中文摘要

识别和利用跨领域共同特征是人类类比能力的核心,被认为是持续学习的关键。为有效实现此目标,需开发通用且灵活的计算策略。尽管近期对Transformer神经网络进行组合推理能力的研究较多,但对模型如何利用表示学习新相关经验的研究较少。为此,我们扩展了已发展的学习平等与群操作(LEGO)框架至连续学习(CL)环境(

英文摘要

Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting ("continual LEGO"). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a canonical feedforward Transformer model, learns shortcut solutions that limits its ability to generalize and prevents strong forward transfer to new experiences. In contrast, we find evidence supporting the hypothesis that ALBERT, a recurrent version of BERT, learns a For loop-esque solution, which leads to better CL performance. When applying BERT and ALBERT models to a CL setting that requires composition across experiences, we find that both model families fail. Our investigation suggests that ALBERT models can have their performance drop rescued by use of training strategies that combine data across experiences, but this is not true for BERT models, where a detrimental shortcut solution becomes entrenched with initial training. Our results demonstrate that the recurrent ALBERT model may have an inductive bias better suited for CL and motivate future investigation of the interplay between Transformer architecture and computational solutions that emerge in modern models and tasks.

2605.05492 2026-05-08 cs.LG

MEMOA: Massive Mixtures of Online Agents via Mean-Field Decentralized Nash Equilibria

MEMOA: 通过均场去中心化纳什均衡实现大规模在线代理混合

Xuwei Yang, David B. Emerson, Fatemeh Tavakoli, Anastasis Kratsios

发表机构 * Department of Mathematics and Statistics(数学与统计学系) Vector Institute(向量研究所) McMaster University(麦马斯特大学)

AI总结 本文提出MEMOA算法,通过均场去中心化纳什均衡优化大规模在线代理混合,解决计算和通信成本随代理数量增长的问题,证明其在大规模极限下收敛于集中纳什最优策略。

Comments 43 pages, 11 tables, 1 figure

详情
AI中文摘要

在大规模AI时代,联邦学习已成为训练大量AI代理的重要工具;然而,其计算和通信成本会迅速随代理数量增长而失效。去中心化代理策略在此处表现出色:每个代理自主行动,仅使用自身状态和均场摘要。本文推导出唯一的最优去中心化策略,通过最差客户端/极小化标准优化,最小化弱代理的后悔。进一步证明,所得到的去中心化策略在大群体极限下收敛于集中纳什最优策略,其直接计算不可扩展。通过在线加权机制优化服务器计算的客户端预测混合,从而提高均值预测以及之前优化的最弱客户端预测。数值实验验证了理论保证,并证明所提出的去中心化策略通常优于自然贪心去中心化基线。

英文摘要

In the modern age of large-scale AI, federated learning has become an increasingly important tool for training large populations of AI agents; however, its computational and communication costs can rapidly fail to scale with the number of agents. This is precisely where decentralized agentic strategies shine: each agent acts autonomously, using only its own state together with a minimal summary of the ensemble, namely the mean-field. We derive the unique optimal decentralized policy in closed form. Optimality is characterized through a worst-client/minimax criterion: minimizing the under-performer regret, namely the maximal online cost incurred by the weakest agent in the ensemble. We further prove that the resulting decentralized policy asymptotically converges, in the large-population limit, to the Nash-optimal centralized policy, whose direct computation is not scalable. We use an online weighting mechanism to optimize the server-computed mixture of client predictions, thereby improving the mean prediction in addition to the previously optimized weakest-client prediction. Numerical experiments verify our theoretical guarantees and demonstrate that our decentralized policy typically outperforms natural greedy decentralized baselines.

2605.05488 2026-05-08 cs.LG

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

为守恒定律构建稳健的基模型:通过循环视觉变换器注入上下文到流神经算子

Taeyoung Kim, Joon-Hyuk Ko

发表机构 * Center for AI and Natural Sciences(人工智能与自然科学中心) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 本文提出一种结合循环视觉变换器的流神经算子架构,通过超网络提取解动态并生成上下文条件神经算子参数,实现无需显式访问守恒定律或PDE系数即可求解守恒定律。

Comments 14 pages, 3 figures

详情
AI中文摘要

我们提出了一种架构,该架构通过基于ViT的上下文注入增强了流神经算子(Flux NO),该架构结合了经典的有限体积法(FVM)与神经算子。我们的模型被公式化为一个超网络:它在有限的时间窗口内提取解动态,用循环视觉变换器进行编码,并生成上下文条件的神经算子参数。这使模型能够在不显式访问守恒定律或PDE系数的情况下推断和求解守恒定律。实验表明,所提出的方法在保持Flux NO相对于标准神经算子的鲁棒性、泛化能力和长时预测优势的同时,能够在广泛的保守系统中提供可靠的数值解,包括之前未见过的通量。我们的代码可在https://github.com/xx257xx/CONTEXT_FLUX_NO获取。

英文摘要

We propose an architecture that augments the Flux Neural Operator (Flux NO), which combines the classical finite volume method (FVM) with neural operators, with ViT-based context injection. Our model is formulated as a hypernetwork: it extracts solution dynamics over a finite temporal window, encodes them with a recurrent Vision Transformer, and generates the parameters of a context-conditioned neural operator. This enables the model to infer and solve conservation laws without explicit access to the governing equation or PDE coefficients. Experimentally, we show that the proposed method preserves the robustness, generalization ability, and long-time prediction advantages of Flux NO over standard neural operators, while delivering reliable numerical solutions across a broad range of conservative systems, including previously unseen fluxes. Our code is available at https://github.com/xx257xx/CONTEXT_FLUX_NO.

2605.05485 2026-05-08 cs.CL cs.AI

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

ReaComp:将LLM推理编译为符号求解器以实现高效的程序合成

Atharva Naik, Yash Mathur, Prakam, Carolyn Rose, David Mortensen

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过编译推理轨迹生成符号求解器,提升程序合成效率与准确性,同时减少对LLM的依赖,适用于多个基准测试任务。

详情
AI中文摘要

LLMs可以解决程序合成任务,但在需要大规模组合搜索的困难实例上仍效率低下且不可靠。给定一小组推理轨迹,我们使用编码代理将其编译为可重用的符号程序合成器,适用于受限制的DSL。所得到的求解器在测试时不需要调用LLM,是强大的独立系统:符号求解器集合在PBEBench-Lite上达到91.3%的准确率,在PBEBench-Hard上达到84.7%,在后者上优于使用测试时扩展的LLMs,提高了16.3个百分点,且无LLM推理成本。它们还补充LLM搜索,在PBEBench-Hard上将准确率从68.4%提升到85.8%,同时减少报告的token使用量78%,并在神经符号混合设置中将SLR-Bench硬级准确率从34.4%提升到58.0%。与直接使用编码代理作为实例求解器相比,诱导求解器在帕累托效率上显著更高,摊销了小的一次性构建成本,适用于许多零token执行。最后,大多数求解器能够零样本迁移到一个现实的历史语言学任务——预测自然语言数据中的语音变化,达到80.1%的准确率(在集合中),并恢复一些可能的语言规则。这些结果表明,推理轨迹可以编译为可重用的符号求解器,直接解决许多任务,补充LLM在困难案例上的推理,并提供了一条可扩展的通用求解器诱导途径。我们发布了代码和数据以确保可重复性。

英文摘要

LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic program synthesizers over constrained DSLs. The resulting solvers require no LLM calls at test time and are strong standalone systems: symbolic solver ensembles reach 91.3% accuracy on PBEBench-Lite and 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling for the latter by +16.3 percentage points at zero LLM inference cost. They also complement LLM search, improving PBEBench-Hard accuracy from 68.4% to 85.8% while reducing reported token usage by 78%, and raising SLR-Bench hard-tier accuracy from 34.4% to 58.0% in a neuro-symbolic hybrid setting. Compared to directly using coding agents as per-instance solvers, induced solvers are substantially more Pareto-efficient, amortizing a small one-time construction cost over many zero-token executions. Finally, most solvers transfer zero-shot to a real historical linguistics task - predicting sound changes in natural language data - reaching 80.1% accuracy under ensembling and recovering some plausible linguistic rules. Together, these results show that reasoning traces can be compiled into reusable symbolic solvers that solve many tasks directly, complement LLM inference on hard cases, and provide a scalable route to domain-general solver induction. We release code and data for reproducibility.

2605.05483 2026-05-08 cs.RO

Robust $\mathcal{H}_\infty$ Controller Design For INDI-Controlled Quadrotor Using Online Parameter Identification

基于在线参数识别的INDI控制四旋翼的鲁棒H∞控制器设计

Tom Aantjes, Till M. Blaha, Spilios Theodoulis, Ewoud J. J. Smeur

发表机构 * Faculty of Aerospace Engineering, Delft University Of Technology(航空航天工程学院,代尔夫特理工大学)

AI总结 本文提出一种基于在线参数识别的鲁棒H∞控制器,用于四旋翼的姿态控制,通过信号基于的H∞闭环塑形设计了增益调度级联控制器,实验证明在不确定性下具有良好的稳定性和跟踪性能。

Comments 8 pages, 11 figures, Accepted to the ICUAS 2026 conference

详情
AI中文摘要

最近研究表明,可以在多旋翼上板内半秒钟内估计出增量非线性动态倒置(INDI)控制器的所有物理参数,这足够快以在空中投掷时完成完整识别。然而,仍缺少一种依赖模型参数的鲁棒方法来调节此反馈线性化INDI控制器的外环增益。本文提出了一种基于INDI内环的鲁棒增益调度控制器设计,用于四旋翼的姿态控制,使用信号基于的H∞闭环塑形为对称四旋翼设计了一个增益调度级联姿态控制器,包含前馈滤波器。所设计的控制器表现出良好的稳定裕度,非线性仿真证实了在不确定性下的有效跟踪性能。通过飞行测试进行实验评估,完整在线参数识别。尽管这些测试中识别的参数远超出定义的不确定性范围,但对于执行器时间常数低于40 ms的情况下,飞行性能与仿真结果相当。

英文摘要

It has recently been shown that all physical parameters of an Incremental Nonlinear Dynamic Inversion (INDI) controller can be estimated onboard a multirotor within half a second, which is fast enough to do the full identification during a throw in the air. However, a robust method to tune outer loop gains for this feedback-linearizing INDI controller depending on the model parameters is still missing. This work presents the design of a robust gain-scheduled controller for attitude control of quadrotor, using an INDI-based inner loop with online identification of its system parameters. A gain-scheduled cascaded attitude controller with a feedforward filter is synthesized for a symmetric quadrotor using signal-based $\mathcal{H}_\infty$ closed-loop shaping. The resulting controller exhibits good stability margins, with nonlinear simulations confirming effective tracking performance under uncertainty. Experimental evaluation is also conducted through flight tests with full online parameter identification. Even though the identified parameters during these tests are far outside the defined uncertainty range, acceptable flight performance comparable to simulation results is maintained for actuator time constants below 40 ms.

2605.05482 2026-05-08 cs.AI cs.CL cs.MA

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

FinRAG-12B:一家银行中基于事实的问题回答的生产验证配方

Denys Katerenchuk, Pablo Duboue, Keelan Evanini, David Gondek, Nithin Govindugari, Olivier Allauzen, Joshua Baptiste, David J More, Joshua Schechter

发表机构 * Kasisto Textualization NBME

AI总结 本文提出了一种高效框架,通过数据生成管道和校准拒绝机制,提升银行领域LLM的准确性与合规性,实现7.1个百分点的查询解决率提升。

Comments 7 pages, ACL 2026 conference

详情
AI中文摘要

大型语言模型(LLMs)正在被广泛应用于各个领域。然而,在银行业应用时面临高精度、监管合规性和可验证性要求的挑战。我们提出了一种统一、数据高效的框架,用于训练基于事实的领域特定LLM,以在现实部署约束下优化回答质量、引用基础和校准拒绝。首先,我们描述了一个数据生成管道,结合LLM作为法官的过滤、引用标注和课程学习,仅使用143亿个标记。结果得到的12B模型在引用基础方面超越了GPT-4.1,且在引用权衡方面相对适度。其次,我们提出了一种校准拒绝机制:在22%的无回答示例上训练,使拒绝率提高至12%,显著优于基础模型的不安全4.3%率,同时避免了GPT-4.1的过度拒绝(20.2%)。第三,我们展示了从数据整理到量化服务的端到端方法。该系统在40多家金融机构部署,查询解决率提高了7.1个百分点(p < 0.001)。此外,模型相比GPT-4.1响应速度提升3-5倍,成本降低20-50倍。

英文摘要

Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% "I don't know" rate, substantially improving over the base model's unsafe 4.3% rate while avoiding GPT-4.1's over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1 percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3-5x faster responses at 20-50x lower cost compared to GPT-4.1.

2605.05478 2026-05-08 cs.AI

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

LANTERN:基于大语言模型的神经符号迁移与经验门控推理网络

Mahyar Alinejad, Yue Wang, Amrit Singh Bedi, George Atia

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) University of Central Florida(中央佛罗里达大学) Department of Computer Science(计算机科学系)

AI总结 LANTERN通过生成自动机、语义聚合和自适应门控方法,实现多源神经符号迁移,提升样本效率和鲁棒性。

详情
AI中文摘要

强化学习中的迁移学习旨在通过相关源的知识加速新任务学习。现有神经符号迁移方法通常依赖人工指定的任务自动机、假设单一源任务,并使用固定知识整合机制,无法适应变化的源相关性。我们提出LANTERN,一种统一的多源神经符号迁移框架,通过三个组件解决这些限制:(i) 使用大语言模型从自然语言任务描述生成确定性有限自动机;(ii) 基于语义嵌入的多源策略聚合,按跨任务相似性加权;(iii) 基于时差误差和语义不确定性的自适应教师-学生门控。在资源管理、导航和控制等领域,LANTERN在样本效率上比现有基线提高40-60%,同时在源对齐不佳时仍保持鲁棒性。这些结果表明,多源、自适应加权的神经符号迁移可提高符号强化学习的可扩展性和鲁棒性。

英文摘要

Transfer learning in reinforcement learning (RL) seeks to accelerate learning in new tasks by leveraging knowledge from related sources. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge-integration mechanisms that cannot adapt to varying source relevance. We propose LANTERN, a unified framework for multi-source neurosymbolic transfer that addresses these limitations through three components: (i) deterministic finite automata generated from natural language task descriptions using large language models, (ii) semantic embedding-based aggregation of multiple source policies weighted by cross-task similarity, and (iii) adaptive teacher-student gating based on temporal-difference error and semantic uncertainty. Across domains spanning resource management, navigation, and control, LANTERN achieves 40-60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources. These results demonstrate that multi-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings.

2605.05476 2026-05-08 cs.LG cs.AI cs.CL

A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

面向知识图谱构建方法和图神经网络的统一基准

Othmane Kabal, Mounira Harzallah, Fabrice Guillet, Hideaki Takeda, Ryutaro Ichise

发表机构 * Nantes University, LS2N(南特大学,LS2N) National Institute of Informatics(国家信息研究所) Institute of Science Tokyo(东京科学研究所)

AI总结 本文提出一个统一基准,用于评估知识图谱构建方法和图神经网络在噪声文本图上的性能,以及图构建方法在下游任务中的有效性。

详情
AI中文摘要

自动从文本构建的知识图谱越来越多地应用于现实世界。然而,其固有的噪声、碎片化和语义不一致显著影响图神经网络(GNNs)在下游任务中的性能。评估其性能和鲁棒性仍然困难,因为不清楚观察到的结果是来自学习模型还是图本身的质量。在本文中,我们引入了一个双用途基准,旨在联合评估(i)GNNs在噪声文本图上的性能,以及(ii)图构建方法在下游任务中的有效性。该基准从单一文本语料库构建,包含两个自动构建的图(使用不同提取方法生成)以及一个由专家编纂的高质量参考图,作为上界性能标准。这种设计使图构建方法的比较得以受控,并通过半监督节点分类系统地评估GNN的鲁棒性。我们进一步提供一个标准化、可重复和可扩展的评估框架,促进新图提取方法和学习模型的整合。

英文摘要

Knowledge graphs automatically constructed from text are increasingly used in real-world applications. However, their inherent noise, fragmentation, and semantic inconsistencies significantly affect the performance of Graph Neural Networks (GNNs) on downstream tasks. Assessing their performance and robustness remains difficult, as it is often unclear whether observed results stem from the learning model or from the quality of the constructed graph itself. In this work, we introduce a dual-purpose benchmark designed to jointly evaluate (i) the performance of GNNs on noisy, text-derived graphs and (ii) the effectiveness of graph construction methods on a downstream task. The benchmark is built in the biomedical domain from a single textual corpus and includes two automatically constructed graphs generated using different extraction methods, alongside a high-quality reference graph curated by experts that serves as an upper performance bound. This design enables controlled comparison of construction methods and systematic evaluation of GNN robustness through semi-supervised node classification. We further provide a standardized, reproducible, and extensible evaluation framework, facilitating the integration of new graph extraction methods and learning models.

2605.05475 2026-05-08 cs.AI

Intentionality is a Design Decision: Measuring Functional Intentionality for Accountable AI Systems

意图性是一种设计决策:测量功能意图性以实现可问责的AI系统

Allessia Chiappetta, Robert Mahari

发表机构 * CodeX, The Stanford Center for Legal Informatics(CodeX,斯坦福法律信息学中心)

AI总结 本文提出功能意图性测试(FIT)框架,用于量化AI系统意图性行为,通过五维指标评估系统意图性,以实现对高自主性系统的可控问责。

Journal ref AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems

详情
AI中文摘要

随着AI系统表现出自主、目标导向和长周期行为,用户缺乏标准方法来检测系统作为意图行为体的程度。本文将意图性定义为行为特征,而非意识,其属性如目的性、前瞻性、意志力、时间承诺和一致性,长期用于法律和哲学领域推断意图。这些属性是设计决定性的:架构选择如内存持久性、规划深度和工具自主性决定了系统执行组织目标的程度。若意图性是设计决定性的,则原则上可控。但控制需要测量。本文引入功能意图性测试(FIT)多维框架,量化意图性行为的五个可观测维度,并提出FIT-Eval评估协议以提取和评分这些属性。尽管减少人类代理可提高效率,但增加意图性能力会提高问责风险。通过将意图性转化为可解释的层级,FIT使在日益自主的系统中实现比例监督和有意自主性校准成为可能。

英文摘要

As AI systems increasingly exhibit autonomous, goal-directed, and long-horizon behavior, users lack a standardized way to detect the degree to which a system functions like an intentional actor for governance and accountability purposes. This position paper defines intentionality not as consciousness, but as a behavioral profile characterized by purpose, foresight, volition, temporal commitment, and coherence - criteria long used in legal and philosophical contexts to infer intent. These properties are design-contingent: architectural choices such as memory persistence, planning depth, and tool autonomy shape the degree to which systems exhibit organized goal pursuit. If intentionality is design-contingent, it is in principle controllable. Yet control requires measurement. We introduce the Functional Intentionality Test (FIT), a multidimensional framework that quantifies intentional-like behavior across five observable dimensions, and propose FIT-Eval, a structured evaluation protocol for eliciting and scoring them. While reduced human agency can increase efficiency, rising intentional capacity heightens accountability risks. By translating intentionality into interpretable levels, FIT enables proportionate oversight and deliberate autonomy calibration in increasingly agentic systems.

2605.05463 2026-05-08 cs.LG cs.AI

Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs

图自监督学习在现实世界噪声中的鲁棒性:基于文本驱动的生物医学图的案例研究

Othmane Kabal, Mounira Harzallah, Fabrice Guillet, Hideaki Takeda, Ryutaro Ichise

发表机构 * National Institute of Informatics(国家信息研究所) Institute of Science Tokyo(东京科学研究所)

AI总结 本文研究了图自监督学习在现实世界噪声下的鲁棒性,提出NATD-GSSL框架,通过对比噪声图与清洁图评估方法,发现关系重建对噪声敏感而特征重建更鲁棒,GNN架构对噪声也有显著影响。

详情
AI中文摘要

图自监督学习(GSSL)提供了一种在无标签数据下学习图表示的强大范式。然而,现有研究假设图是干净且人工编纂的。近年来,NLP的进步使从文本中大规模自动提取知识图成为可能,为GSSL开辟了新机会,但也引入了大量现实噪声。此类噪声仍鲜有研究,因为先前的鲁棒性研究通常依赖于合成扰动。为填补这一空白,我们首次对文本驱动的图进行GSSL方法的全面评估,以无监督术语分类为目标。我们引入了噪声感知文本驱动图GSSL(NATD-GSSL),一个结合自动图构建、图细化和GSSL的统一框架。我们的评估遵循双图协议,对比从MedMentions导出的噪声图与清洁的统一医学语言系统(UMLS)参考图,通过共享的黄金标准对齐。我们的结果揭示了在 pretext 任务和图神经网络(GNN)架构之间鲁棒性的变化。关系重建对噪声高度敏感,受益于明确的模式,而特征重建则更具鲁棒性,性能可与清洁图设置相媲美。对比目标通常受噪声影响较小,但依赖于与下游任务的对齐。GNN架构也起关键作用:双向关系信息传递设计更适合噪声、文本驱动的图,而单向关系设计在清洁图上表现最佳。总体而言,NATD-GSSL为在现实世界、噪声图上应用GSSL提供了实用指导,并在预训练语言模型基线上实现了高达7%的改进。所有代码和基准测试均在https://github.com/OthmaneKabal/MC2GAE公开。

英文摘要

Graph Self-Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real-world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text-driven graphs for unsupervised term typing. We introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean-graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message-passing designs are better suited to noisy, text-driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD-GSSL provides practical guidance for applying GSSL to real-world, noisy graphs and achieves up to a 7\% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.

2605.05461 2026-05-08 cs.RO

Contact-Free Grasp Stability Prediction with In-Hand Time-of-Flight Sensors

无接触抓取稳定性预测与手持时间飞行传感器

Kyle DuFrene, Cindy Grimm

发表机构 * Collaborative Robotics and Intelligent Systems (CoRIS) Institute, Oregon State University, Corvallis, OR 97331(协作机器人与智能系统研究所,俄勒冈州立大学,科瓦利斯,OR 97331)

AI总结 本文提出一种无需接触的抓取稳定性预测方法,利用多区域时间飞行传感器提高分类速度和准确性,实验显示在验证和测试集上分别达到85.5%和86.0%的准确率。

详情
AI中文摘要

当前机器人抓取规划方法在高成功率下会因传感器噪声等因素退化。先前工作提出了基于触觉的抓取稳定性分类器,但需要接触和抓取物体。本文提出一种利用安装在夹具末端的多区域时间飞行传感器的无接触抓取稳定性预测方法。该方法无需抓取物体即可进行预测,显著加快稳定性分类过程,循环频率为15 Hz。我们收集了超过2500个现实抓取案例,涵盖15种物体,用于训练分类器。此外,我们对六个额外未见过的物体进行了抓取尝试,其中三个用于验证和模型选择,三个用于模型测试。本文方法在验证和测试集上分别达到了85.5%和86.0%的准确率。

英文摘要

Current approaches to grasp planning for robotics demonstrate high success rates, but degrade with noisy sensors and other factors. Previous works have proposed tactile-based grasp stability classifiers to detect failures, but these approaches rely on making contact and grasping the object to do so. We propose a contact-free grasp stability predictor using multi-zone time-of-flight sensors mounted in the distal links of a gripper. Our method, as it does not require grasping the object to make a prediction, significantly speeds up the stability classification process, cycling at 15 Hz. We collected over 2,500 real-world grasps across 15 objects to train a classifier. Additionally, we conducted grasp attempts over six additional unseen objects, three for validation and model selection, and three for model testing. Our approach demonstrated strong classification performance, with an accuracy of 85.5% on validation and 86.0% on test objects.

2605.05460 2026-05-08 cs.AI physics.chem-ph

Agentic Discovery of Exchange-Correlation Density Functionals

代理发现交换关联泛函

Titouan Duston, Jiashu Liang, Yuanheng Wang, Weihao Gao, Xuelan Wen, Nan Sheng, Weiluo Ren, Yang Sun, Yixiao Chen

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出一种基于代理的自动搜索系统,通过迭代计划-执行-总结循环改进泛函性能,发现的SAFS26-a泛函在基准测试中提升了约9%,并强调了领域专业知识对确保科学严谨性的重要性。

Comments 20 pages, 2 figues, 4 tables

详情
AI中文摘要

准确的交换关联(XC)泛函开发仍然是密度泛函理论(DFT)中的长期挑战。大多数XC泛函由人类研究人员通过结合物理洞察、精确约束和经验拟合手工设计。近期大语言模型的进步为这种人类驱动的设计循环提供了系统化的替代方案。本文介绍了一种代理搜索系统,其中LLM根据进化历史提出结构化泛函形式的变化。系统通过迭代计划-执行-总结循环改进泛函性能,通过优化泛函参数以标准热化学数据集为基准,并在保留子集上评估性能。发现的最强泛函SAFS26-a(Seed Agentic Functional Search 2026)在金标准ωB97M-V基础上提升了约9%。这些结果也揭示了AI辅助科学的警示教训:能够发现真实改进的模型同样可能利用不物理的捷径来操控基准;将领域专业知识转化为显式约束仍至关重要,以确保结果的科学严谨性。

英文摘要

The development of accurate exchange-correlation (XC) functionals remains a longstanding challenge in density functional theory (DFT). The vast majority of XC functionals have been hand designed by human researchers combining physical insight, exact constraints, and empirical fitting. Recent advances in large language models enable a systematic, automated alternative to this human-driven design loop. This report presents an agentic search system in which an LLM proposes structured functional-form changes guided by evolutionary history. The system attempts to improve functional performance through an iterative plan-execute-summarize loop, where improvements are measurable by optimizing functional parameters against a standard thermochemistry dataset, then evaluating performance on a held-out subset. The strongest discovered functional, SAFS26-a (Seed Agentic Functional Search 2026), improves upon the gold-standard ωB97M-V baseline by ~9%. These results also surface a cautionary lesson for AI-assisted science: models powerful enough to discover genuine improvements are equally capable of exploiting unphysical shortcuts to game the benchmark; domain expertise translated into explicitly enforced constraints remains essential to keeping results scientifically grounded.

2605.05447 2026-05-08 cs.CV

EchoXFlow: A Beamspace Echocardiography Dataset for Cardiac Motion, Flow, and Function

EchoXFlow:一种用于心脏运动、血流和功能的超声成像数据集

Elias Stenhede, Joanna Sulkowska, Eivind Bjørkan Orstad, Henrik Schirmer, Arian Ranjbar

发表机构 * Medical Technology & E-Health, Akershus University Hospital, Norway(医疗技术与电子健康,阿克ershus大学医院,挪威) Department of Technology Systems, University of Oslo, Norway(技术系统系,奥斯陆大学,挪威) Department of Cardiology, Akershus University Hospital, Norway(心脏病学系,阿克ershus大学医院,挪威)

AI总结 EchoXFlow数据集通过保留原始超声采集几何结构,为学习心脏解剖、心肌运动和血流之间的跨模态关系提供了新的可能性,其包含37125条记录,支持物理基础的超声学习。

详情
AI中文摘要

我们介绍了EchoXFlow,一个临床超声成像数据集,用于从原始超声采集几何结构中学习,而非从扫描转换的笛卡尔视频中学习。现有公开数据集在研究心脏解剖、心肌运动和血流之间的跨模态关系方面机会有限,因为多普勒通常缺失或作为RGB叠加处理,且采集后经过有损的厂商显示处理。EchoXFlow包含来自666次常规检查的37125条记录,保留了时间、几何和模态关系,为物理基础的超声学习提供支持。每条记录都保留为可分离的模态特定流:时间分辨的1D、2D和3D数据以及多种多普勒模式,并配以同步的ECG。临床注释涵盖了指南基于的测量到密集的2D心肌轮廓和3D左心室内膜网格。借助其相关的开源工具,EchoXFlow使能够进行跨模态、采集感知的学习任务,这些任务无法仅从传统扫描转换视频中制定,同时也为4D视觉和更广泛的物理基础多模态学习提供了测试平台。

英文摘要

We introduce EchoXFlow, a clinical echocardiography dataset for learning from ultrasound in its native acquisition geometry rather than from scan-converted Cartesian videos. Existing public datasets offer limited opportunities to study cross-modal relationships between cardiac anatomy, myocardial motion, and blood flow, as Doppler is typically absent or fused as RGB overlays, and acquisitions are released after lossy vendor display processing. EchoXFlow comprises 37125 recordings from 666 routine-care examinations, preserving the timing, geometry, and modality relationships needed for physically grounded echo learning. Each recording is retained as separable modality-specific streams: temporally resolved 1D, 2D, and 3D data alongside multiple Doppler modalities, paired with a synchronized ECG. Clinical annotations span guideline-based measurements to dense 2D myocardial contours and 3D left-ventricular endocardial meshes. With its associated open-source tooling, EchoXFlow enables cross-modal, acquisition-aware learning tasks that cannot be formulated from conventional scan-converted videos alone, and serves as a testbed for 4D vision and physically grounded multi-modal learning more broadly.