arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1980
2605.26358 2026-06-12 physics.flu-dyn cs.LG 版本更新

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

基于深度学习的代数雷诺应力闭合模型用于湍流RANS模拟

Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Aerospace and Mechanical Engineering, University of Notre Dame(诺特丹大学航空航天与机械工程系)

AI总结 提出一种物理驱动的深度学习闭合模型DARSM,通过神经网络映射流动不变量到隐式代数雷诺应力方程中的经验参数,并结合伴随方程实现端到端优化,在方形管道和周期性山丘基准测试中平均速度误差降低2-4倍。

详情
AI中文摘要

湍流在工程和科学中普遍存在,但直接模拟成本过高。雷诺平均纳维-斯托克斯(RANS)方程可节省超过十个数量级的计算量,但引入了未封闭项(封闭问题)。离线训练的机器学习(ML)闭合模型在预测模拟中会出现分布偏移,而绕过控制方程的ML方法难以从稀缺的高保真数据中泛化。我们开发了一种基于物理的深度学习RANS闭合模型——深度代数雷诺应力模型(DARSM),该模型可在小数据集上训练,并准确泛化到不同雷诺数、未见几何形状和不同流动状态。神经网络将流动不变量映射到隐式代数雷诺应力方程中的经验参数,该方程基于弱平衡假设从雷诺应力输运方程推导而来,为ML闭合施加了基于物理的结构。通过控制偏微分方程和耦合隐式闭合的端到端优化消除了分布偏移,但展开和隐式自动微分在刚性耦合求解器上均失败。我们推导了利用求解器隐式-显式结构的伴随方程,以实现高效优化。在标准方形管道和周期性山丘基准测试中,DARSM将基线RANS的平均测试速度误差降低了2-4倍(跨雷诺数、几何形状和流动状态),峰值案例级降低达12倍。在附着、各向异性主导的流动(方形管道)上训练的模型无需重新训练即可准确泛化到分离流动(周期性山丘),这是底层物理状态的改变。DARSM还优于五种已建立的ML方法:离线训练、张量基神经网络、场反演机器学习、DeepONet和物理信息神经网络。

英文摘要

Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver's implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by $2$-$4\times$ across Reynolds number, geometries, and flow regimes, with peak case-level reductions of $12\times$. The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.

2605.26144 2026-06-12 cs.SE cs.AI cs.CV 版本更新

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA:面向视觉规格到网页应用编码智能体的端到端基准

JunJia Guo, Yuhang Yao, Jiawei, Zhou, Jingdi Chen

发表机构 * University of Arizona(亚利桑那大学) Zoom Stony Brook University(石溪大学)

AI总结 提出VISTA基准,通过多维度输入条件和评估指标,衡量基于LLM的智能体从视觉规格生成功能完整、视觉一致的网页应用的能力。

Comments Project page: https://kaboider.github.io/VIS_APP/; Code: https://github.com/kaboider/VIS_APP_Code; Dataset: https://huggingface.co/datasets/JunJiaGuo/VIS-APP-Bench

详情
AI中文摘要

我们提出了VISTA(视觉规格到应用基准),这是一个用于评估基于LLM的智能体端到端网页应用生成能力的基准。与以往关注算法任务的代码生成基准不同,VISTA针对以UI为中心的现实开发场景,要求智能体从不明确的输入中生成功能完整、视觉一致的应用。我们定义了五种提示信息条件,沿视觉/结构保真度和技术栈约束两个轴变化:(1)仅文本,自由选择技术栈;(2)文本加参考截图,指定三种技术栈;(3)文本加参考截图,自由选择技术栈;(4)文本加截图和精简的Figma结构,指定单一技术栈;(5)文本加截图和精简的Figma结构,自由选择技术栈。为实现稳健评估,基准中的每个页面都手动标注了交互式UI组件和大约三个视觉锚点,解决了Playwright等基于脚本的测试工具在开放式代码生成设置中的已知局限性。评估结合了基于DOM的参考匹配、行为特定的浏览器测试和基于CLIP的视觉相似性,共同衡量结构对齐、行为完整性和整体视觉保真度。我们使用VISTA评估了来自两个模型家族和两个框架的四个智能体系统,发现视觉保真度和功能正确性在输入条件和智能体之间部分解耦,并且智能体的编辑风格差异显著,但大体上与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可重复的基础。

英文摘要

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

2512.15133 2026-06-12 cs.CE cs.AI 版本更新

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

HD-Prot:一种使用连续结构令牌进行联合序列-结构建模的蛋白质语言模型

Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Mohamed bin Zayed University of Artificial Intelligence(马尔代夫人工智能大学)

AI总结 提出HD-Prot,一种混合扩散蛋白质语言模型,通过连续结构令牌将序列pLM扩展为多模态,实现联合序列-结构建模,在多种任务上取得竞争性能。

Comments This is the long version of the corresponding paper to appear at KDD 2026

详情
AI中文摘要

蛋白质本质上具有一致的序列-结构二重性。丰富的蛋白质序列数据可以很容易地表示为离散令牌,这推动了蛋白质语言模型(pLM)的丰硕发展。然而,一个关键的剩余挑战是如何有效地将连续结构知识整合到pLM中。当前的方法通常将蛋白质结构离散化以适应语言建模框架,这不可避免地导致细粒度信息的丢失,并限制了多模态pLM的性能潜力。在本文中,我们认为这些担忧是可以避免的:基于序列的pLM可以通过连续令牌(即避免向量量化的高保真蛋白质结构潜在表示)扩展以纳入结构模态。具体来说,我们提出了一种混合扩散蛋白质语言模型HD-Prot,它在离散pLM之上嵌入了一个连续值扩散头,使得能够无缝处理离散和连续令牌,用于联合序列-结构建模。它通过统一的吸收扩散过程捕获跨模态的令牌间依赖关系,并通过序列的分类预测和结构的连续扩散估计每个令牌的分布。大量结果表明,HD-Prot在无条件序列-结构共生成、基序支架、蛋白质结构预测和反向折叠任务中取得了竞争性能。此外,尽管在有限的计算资源下开发(即模态扩展微调的预算不到十分之一),我们的方法可以与最先进的多模态pLM相媲美。它突显了在统一语言模型架构中同时估计分类和连续分布的可行性,为多模态pLM提供了一个有前景的替代方向。

英文摘要

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

2605.14568 2026-06-12 cs.SE cs.CL cs.LG 版本更新

Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines

在行为驱动软件测试套件中挖掘子场景重构机会:ML分类器和LLM-判断基线

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

发表机构 * Independent Researcher(独立研究者;应用MBA(数据分析),德克萨斯韦斯利安大学) Applied MBA (Data Analytics), Texas Wesleyan University(独立研究者;计算机工程学士,国立科学与技术大学(NUST)) Independent Researcher(独立研究者;管理硕士,慕尼黑技术大学) B.E. Computer Engineering, National University of Sciences and Technology (NUST) Independent Researcher M.Sc. Management, Technical University of Munich

AI总结 本文通过ML分类器和LLM基线,识别行为驱动开发测试套件中可提取的子场景,量化其在公共BDD生态系统中的普及率。

Comments 31 pages, 10 figures, 6 tables, 56 references. v2: retitled; reference list fully corrected and verified; decision-threshold sensitivity analysis and imbalance-robust baseline metrics added; figures restyled. Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359

详情
AI中文摘要

背景。行为驱动开发(BDD)软件测试套件积累重复的步骤子序列。有三种已发布的重构模式(在同一文件中的背景、在同一仓库中可重用的场景调用、跨组织共享的更高层次步骤),但没有先前工作自动化确定哪些重复的子序列值得提取或哪种机制适用。目标。通过重构适宜性(提取值得)对重复的步骤子序列(

英文摘要

Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

2605.12542 2026-06-12 astro-ph.IM astro-ph.EP cs.LG 版本更新

Earth Science Foundation Models: From Perception to Reasoning and Discovery

地球科学基础模型:从感知到推理与发现

Xiangyu Zhao, Bo Liu, Yuehan Zhang, Zelin Song, Wanghan Xu, Feng Liu, Fengxiang Wang, Ben Fei, Fenghua Ling, Wangxu Wei, Wenlong Zhang, Xiao-Ming Wu

发表机构 * Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文综述了地球科学基础模型,探讨了其从感知到多模态推理及科学发现的能力演进,并总结了其在大气、水圈、岩石圈等领域的广泛应用。

详情
AI中文摘要

大规模基础模型(FMs)正在通过整合异构多模态数据,如多平台影像、格网再分析数据、多样的地球物理和地球化学观测以及领域特定文本,来推动地球科学的发展。本文通过两个互补维度对地球科学基础模型(地球FMs)进行统一综述:深度,即追踪模型能力从感知到多模态推理和代理科学工作流的演变;广度,即总结其在大气、水圈、岩石圈、生物圈、人类圈和冰圈以及耦合地球系统过程中的扩展应用。利用这一框架,我们回顾了代表性多模态地球基础模型,并编译了超过200个数据集和基准,涵盖多样化的地球科学任务和模态。我们进一步讨论了多模态数据异构性、科学可靠性和持续更新、可扩展性和可持续性以及从基础模型到代理和具身地球智能的转变,并展望了更集成、可信和可操作的AI地球科学家的未来方向。总体而言,本文为理解地球基础模型的发展提供了结构化的路线图,从能力和应用广度两个方面进行综述。

英文摘要

Large foundation models (FMs) are transforming Earth science by integrating heterogeneous multimodal data, such as multi-platform imagery, gridded reanalysis data, diverse geophysical and geochemical observations, and domain-specific text, to support tasks ranging from basic perception to advanced scientific discovery. This paper provides a unified review of Earth science foundation models (Earth FMs) through two complementary dimensions: depth, which traces the evolution of model capabilities from perception to multimodal reasoning and agentic scientific workflows, and breadth, which summarizes their expanding applications across the atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, as well as coupled Earth system processes. Using this framework, we review representative multimodal Earth foundation models and compile more than 200 datasets and benchmarks spanning diverse Earth science tasks and modalities. We further discuss key challenges in multimodal data heterogeneity, scientific reliability and continual updating, scalability and sustainability, and the transition from foundation models to agentic and embodied Earth intelligence, and outline future directions toward more integrated, trustworthy, and actionable AI Earth scientists. Overall, this paper offers a structured roadmap for understanding the development of Earth foundation models from both capability depth and application breadth.

2604.24806 2026-06-12 cs.IR cs.AI cs.DB 版本更新

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

版本化延迟物化:面向大规模推荐系统的超长序列训练

Liang Guo, Ge Song, Litao Deng, Jianhui Sun, Chufeng Hu, Lu Zhang, Zhen Ma, Shouwei Chen, Weiran Liu, Sarang Masti Sreeshylan, Xiaoxuan Meng, Yanzun Huang

发表机构 * Meta Platforms, Inc.(Meta平台)

AI总结 提出版本化延迟物化范式,通过归一化存储和即时序列重建消除数据冗余,支持超长用户交互历史训练,降低存储I/O开销并提升模型质量。

详情
AI中文摘要

现代深度学习推荐模型(DLRM)遵循序列长度的缩放定律,推动前沿走向超长用户交互历史(UIH)。然而,行业标准的“Fat Row”范式将序列预物化到每个训练样本中,造成存储和I/O瓶颈,数据基础设施使用超过GPU训练容量,数据冗余在多租户环境中被放大,其中不同序列长度需求的模型共享联合数据集。我们提出了一种\emph{版本化延迟物化}范式,通过将UIH归一化存储在一个不可变层中,并在训练期间通过轻量级版本指针即时重建序列,从而消除冗余。系统通过一个分叉协议确保在线到离线(O2O)一致性,防止未来泄漏跨流式和批式训练,同时一个读优化的不可变存储层为异构模型租户提供多维投影下推。解耦的数据预处理与流水线I/O预取和数据亲和性优化掩盖了训练时序列重建的延迟,使训练吞吐量保持GPU计算受限。部署在生产DLRM上,系统减少了训练数据基础设施资源使用,同时实现了激进的序列长度缩放,带来显著的模型质量提升,作为现代推荐模型架构(包括HSTU和ULTRA-HSTU)的基础数据基础设施。

英文摘要

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

LLM智能体中长期记忆安全综述:跨记忆生命周期的攻击、防御与治理

Zehao Lin, Xixuan Hao, Renyu Fu, Shaobo Cui, Kai Chen, Chunyu Li, Zhiyu Li, Feiyu Xiong

发表机构 * MemTensor Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出记忆生命周期框架,系统分析LLM智能体长期记忆面临的新威胁,并引入可验证记忆治理(VMG)架构原语,强调存储时溯源与版本控制对安全的关键作用。

详情
AI中文摘要

LLM智能体中可写、跨会话持久记忆的出现,引入了与传统的以输入为中心的安全问题性质不同的威胁格局,其特点包括三个属性:持久性、状态性和传播性。为系统描述这一格局,我们提出记忆生命周期框架,该框架沿两个轴组织攻击、防御及其跨阶段依赖关系:六个生命周期阶段(写入、存储、检索、执行、共享与传播、遗忘与回滚)和四个安全目标(完整性、机密性、可用性、治理)。该分析进而揭示了在系统层面需要形式化安全保证,从而推动了可验证记忆治理(VMG)——一个由五个架构原语组成的框架,它规定了长期记忆系统必须提供哪些可验证机制,以维持对其记忆状态的可审计、可恢复控制。我们的分析表明,健壮的长期记忆(LTM)安全无法仅在检索或执行时进行事后补救,而必须从一开始就锚定于存储时的溯源、版本控制和策略感知的保留。

英文摘要

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

2604.07590 2026-06-12 cs.IR cs.AI 版本更新

DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

DCD:面向领域的受控检索增强生成设计

Valerii Kovalskii, Nikita Belov, Nikita Miteyko, Igor Reshetnikov, Maksim Maksimov

发表机构 * red_mad_robot

AI总结 提出DCD(领域-集合-文档)层次化设计,通过结构化知识表示和多阶段路由控制检索与生成范围,无需修改语言模型,提升RAG在异构语料和多步查询中的鲁棒性和准确性。

Comments 14 pages, 4 figures, 2 links, link to HF https://huggingface.co/datasets/redmadrobot-rnd/dcd, link to GIT https://github.com/redmadrobot-rnd/dcd

详情
AI中文摘要

检索增强生成(RAG)被广泛用于将大型语言模型锚定在外部知识源中。然而,当应用于异构语料库和多步查询时,朴素RAG管道由于扁平的知识表示和缺乏显式工作流而常常质量下降。在这项工作中,我们引入了DCD(领域-集合-文档),一种面向领域的设计,用于结构化知识并控制RAG系统中的查询处理,而无需修改底层语言模型。所提出的方法依赖于信息空间的层次分解和基于结构化模型输出的多阶段路由,使得检索和生成范围能够逐步受限。该架构辅以智能分块、混合检索以及集成验证和生成护栏机制。我们描述了DCD架构和工作流程,并讨论了在合成评估数据集上的评估结果,突出了它们在应用RAG场景中对鲁棒性、事实准确性和答案相关性的影响。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

2503.02178 2026-06-12 stat.ML cs.LG 版本更新

Central Limit Theorems for Stochastic Gradient Descent Quantile Estimators

随机梯度下降分位数估计量的中心极限定理

Ziyang Wei, Jiaqi Li, Likai Chen, Wei Biao Wu

发表机构 * Department of Statistics, University of Chicago(芝加哥大学统计系) Department of Statistics and Data Science, Washington University in St. Louis(圣路易斯华盛顿大学统计与数据科学系)

AI总结 本文针对常学习率SGD分位数估计,利用马尔可夫链理论证明其平稳分布随学习率趋于零时收敛到高斯分布,首次给出CLT型理论保证,并提出置信区间递归算法。

详情
AI中文摘要

本文发展了通过恒定学习率的随机梯度下降(SGD)进行分位数估计的渐近理论。分位数损失函数既不光滑也不强凸。超越传统视角和技术,我们将分位数SGD迭代视为一个不可约、周期且正常返的马尔可夫链,该链循环收敛到其唯一的平稳分布,无论初始值如何任意固定。为了推导平稳分布的精确形式,我们通过利用平稳方程分析其特征函数的结构。我们还推导了其矩生成函数(MGF)和尾部概率的紧界。综合上述方法,我们证明了当学习率$\eta\rightarrow0$时,中心化和标准化的平稳分布收敛到高斯分布。这一发现为恒定学习率的分位数SGD估计量提供了首个中心极限定理(CLT)类型的理论保证。我们进一步提出了一种递归算法来构建具有统计保证的估计量的置信区间。数值研究展示了在线估计器和推断过程的有效有限样本性能。本研究所发展的理论工具对于研究一般形式化为马尔可夫链的SGD算法具有独立意义,特别是在非强凸和非光滑设置中。

英文摘要

This paper develops asymptotic theory for quantile estimation via stochastic gradient descent (SGD) with a constant learning rate. The quantile loss function is neither smooth nor strongly convex. Beyond conventional perspectives and techniques, we view quantile SGD iteration as an irreducible, periodic, and positive recurrent Markov chain, which cyclically converges to its unique stationary distribution regardless of the arbitrarily fixed initialization. To derive the exact form of the stationary distribution, we analyze the structure of its characteristic function by exploiting the stationary equation. We also derive tight bounds for its moment generating function (MGF) and tail probabilities. Synthesizing the aforementioned approaches, we prove that the centered and standardized stationary distribution converges to a Gaussian distribution as the learning rate $η\rightarrow0$. This finding provides the first central limit theorem (CLT)-type theoretical guarantees for the quantile SGD estimator with constant learning rates. We further propose a recursive algorithm to construct confidence intervals of the estimators with statistical guarantees. Numerical studies demonstrate the effective finite-sample performance of the online estimator and inference procedure. The theoretical tools developed in this study are of independent interest for investigating general SGD algorithms formulated as Markov chains, particularly in non-strongly convex and non-smooth settings.

2305.08175 2026-06-12 cs.DB cs.CR cs.LG 版本更新

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

ResidualPlanner+:一种用于边际查询及更广泛查询的可扩展矩阵机制

Guanlin He, Yingtai Xiao, Levent Toksoz, Zeyu Ding, Danfeng Zhang, Daniel Kifer

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Binghamton University(宾厄姆顿大学) Duke University(杜克大学) TikTok Inc.(抖音公司)

AI总结 提出两种可扩展的矩阵机制ResidualPlanner和ResidualPlanner+,分别优化边际查询的精度和支持更复杂的工作负载(如范围查询),在速度和内存上显著超越现有方法。

详情
AI中文摘要

带噪声的边际查询是保护机密性的常见数据发布形式,对于列联表分析、贝叶斯网络构建甚至合成数据生成等下游任务非常有用。为线性查询(如边际查询)提供无偏噪声答案的隐私机制称为矩阵机制。我们提出了ResidualPlanner和ResidualPlanner+,两种高度可扩展的矩阵机制。ResidualPlanner在使用高斯噪声回答边际查询时既最优又可扩展,而ResidualPlanner+支持更通用的工作负载,例如边际查询与范围查询或前缀和查询的组合。ResidualPlanner可以优化许多损失函数,这些损失函数可以写成边际方差的凸函数(先前的工作仅限于一个预定义的目标函数)。ResidualPlanner可以在几秒钟内优化大规模设置中边际查询的精度,即使之前的最先进方法(HDMM)内存耗尽。它甚至可以在几分钟内处理具有100个属性的数据集。此外,ResidualPlanner可以高效计算每个边际的方差/协方差值(先前的方法即使对于相对较小的数据集也会很快耗尽内存)。ResidualPlanner+支持更复杂的工作负载,这些工作负载结合了边际查询和范围/前缀和查询(例如,关于种族的边际查询、关于年龄的范围查询以及回答每个种族的年龄范围查询的组合种族/年龄表格)。它甚至支持用户在不同属性上自定义工作负载。凭借这种增加的灵活性,ResidualPlanner+不一定是最优的,但它仍然极具可扩展性,并且在精度和速度上均优于先前的最先进方法(HDMM)处理前缀和查询。

英文摘要

Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

2601.19072 2026-06-12 cs.SE cs.AI 版本更新

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

HalluJudge: 代码审查自动化中上下文错位的无参考幻觉检测

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu

发表机构 * Monash University Australia(墨尔本大学澳大利亚) The University of Melbourne Australia(墨尔本大学澳大利亚) Atlassian USA(Atlassian美国)

AI总结 提出无参考幻觉检测方法HalluJudge,通过上下文对齐评估生成评论的根基性,采用多分支推理策略,在F1=0.85且成本$0.009下与开发者偏好67%一致。

Comments Accepted at FSE'26: Industry Track, Full-Length, Peer-Reviewed

详情
AI中文摘要

大型语言模型(LLM)在代码审查自动化(如审查评论生成)中表现出强大能力,但它们存在幻觉——生成的审查评论与实际代码无根基——这对LLM在代码审查工作流程中的应用构成重大挑战。为解决此问题,我们探索了无需参考的、有效且可扩展的方法来检测LLM生成的代码审查评论中的幻觉。在这项工作中,我们设计了HalluJudge,旨在基于上下文对齐评估生成评论的根基性。HalluJudge包括四种关键策略,从直接评估到结构化多分支推理(例如,思维树)。我们在Atlassian的企业级软件项目中对这些评估策略进行了全面评估,以检验HalluJudge的有效性和成本效率。此外,我们分析了HalluJudge的判断与实际生产环境中LLM生成的代码审查评论的开发人员偏好之间的一致性。我们的结果表明,HalluJudge中的幻觉评估具有成本效益,F1得分为0.85,平均成本为0.009美元。平均而言,67%的HalluJudge评估与在线生产中实际LLM生成的审查评论的开发人员偏好一致。我们的结果表明,HalluJudge可以作为实用的保障措施,减少开发人员接触幻觉评论,从而促进对AI辅助代码审查的信任。

英文摘要

Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

2603.17527 2026-06-12 stat.ML cs.LG math.OC 版本更新

Mirror Descent on Riemannian Manifolds

黎曼流形上的镜像下降

Jiaxin Jiang, Lei Shi, Jiyuan Tan

发表机构 * School of Mathematical Sciences, Fudan University, Shanghai 200433, China(复旦大学数学学院,上海200433,中国) Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China(上海当代应用数学重点实验室,复旦大学,上海200433,中国)

AI总结 将镜像下降推广到黎曼流形,通过重参数化提出黎曼镜像下降(RMD)及其随机变体,并建立非渐近收敛保证,在Stiefel流形上退化为曲线梯度下降(CGD)。

详情
AI中文摘要

镜像下降(MD)是一种可扩展的一阶方法,广泛应用于大规模优化,包括图像处理、策略优化和神经网络训练。本文将MD推广到黎曼流形上的优化。具体地,我们通过重参数化开发了一个黎曼镜像下降(RMD)框架,并进一步提出了RMD的随机变体。我们还为RMD和随机RMD建立了非渐近收敛保证。作为在Stiefel流形上的应用,我们的RMD框架退化为[26]中提出的曲线梯度下降(CGD)方法。此外,当将随机RMD框架特化到Stiefel设置时,我们得到了CGD的随机扩展,这有效地解决了大规模流形优化问题。

英文摘要

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

2603.11242 2026-06-12 stat.ML cs.LG 版本更新

A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation

统一潜在空间解缠的VAE框架及鲁棒的解缠效果评估

Xiaoan Lang, Md Mostafizer Rahman, Fang Liu

发表机构 * Department of Applied and Computational Mathematics and Statistics(应用与计算数学与统计系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出统一框架bfVAE整合多种解缠VAE方法,并开发FVH-LT和DBSR-LS评估解缠效果,引入LSSI指标量化潜在结构分离,无需真实生成因子。

详情
AI中文摘要

评估和解释潜在表示(如变分自编码器VAE)对于多样数据类型仍然是一个重大挑战,尤其是当真实生成因子未知时。为此,我们将几种最先进的用于潜在空间解缠的VAE方法统一到一个框架——bfVAE中。为了评估解缠VAE模型的有效性并增强潜在空间可解释性,我们提出了通过潜在遍历的特征方差异质性(FVH-LT)和潜在空间中的脏块稀疏回归(DBSR-LS)。为了确保学习到的潜在空间的鲁棒可解释性,我们开发了一种贪婪对齐策略(GAS),该策略减轻了标签切换问题,并对齐不同运行中的潜在维度,为结果聚合奠定基础。我们还引入了一个方便的标量潜在空间分离指数(LSSI),该指数基于FVH-LT和DBSR-LS的GAS对齐输出,在不知道真实生成因子的情况下总结整体潜在结构分离。我们将bfVAE与五个VAE模型进行比较,并在七个表格和图像数据集上验证了FVH-LT、DBSR-LS和LSSI的有效性。在我们检查的实验设置下,bfVAE提供了一个更灵活的解缠框架,在解缠和重构之间实现了比基准VAE模型更有利的整体权衡;FVH-LT和DBSR-LS可靠地揭示了语义上有意义且与领域相关的潜在结构,并且通常产生一致的结果;LSSI对潜在结构分离做出了有效的定量总结。

英文摘要

Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we unify several state-of-the-art disentangled VAE approaches for latent space disentanglement into one framework -- bfVAE. To assess the effectiveness of a disentangled VAE model and enhance latent space interpretability, we propose Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS). To ensure robust interpretability of learned latent space, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to set the foundation of result aggregation. We also introduce a convenient scalar latent space separation index (LSSI) based on the GAS-aligned outputs of FVH-LT and DBSR-LS to summarize the overall latent structural separation without knowledge of the ground-truth generative factors. We compare bfVAE to five VAE models and validate the effectiveness FVH-LT, DBSR-LS, and LSSI in on seven tabular and image datasets. Under our examined experimental settings, bfVAE provides a more flexible disentanglement framework achieves more favorable overall trade-off between disentanglement and reconstruction than the benchmark VAE models; FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures and generally yield consistent results; and LSSI makes an effective quantitative summary of latent structural separation.

2512.24787 2026-06-12 cs.IR cs.AI 版本更新

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

HiGR:腾讯工业级层次化生成式推荐框架

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

发表机构 * Platform and Content Group, Tencent(腾讯平台与内容组) Sun Yat-sen University(中山大学)

AI总结 提出HiGR框架,通过结构化语义ID和层次化解码器解决生成式推荐在工业规模下的规划效率与列表质量对齐问题,离线质量提升超10%,推理加速5倍。

详情
AI中文摘要

Slate推荐(在单个展示中向用户呈现排序项目列表)在主流在线平台中无处不在。虽然最近的生成式推荐方法在利用语义ID建模项目序列方面显示出强大潜力,但直接将其应用于工业规模的slate推荐面临根本性脱节:纠缠的SID空间混淆了高级列表规划,长序列上的细粒度自回归解码限制了语义规划效率,而令牌级目标与整体slate质量不一致。在本文中,我们提出HiGR,一个工业规模的层次化生成式slate推荐框架,通过协同设计的流水线弥合这一脱节。首先,HiGR通过前缀对比残差量化VAE(PCRQ-VAE)学习结构化SID。通过强制高级前缀捕获共享语义,PCRQ-VAE创建了一个可控的离散空间,作为高效规划的前提。利用这一结构化空间,我们的层次化Slate解码器(HSD)将自回归建模从纠缠的令牌级解码转变为粗粒度偏好嵌入。该设计显著降低了推理延迟,同时允许显式的全局slate结构规划。最后,这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标隐式反馈——排序保真度、真实用户兴趣和多样性。广泛的离线实验表明,HiGR在离线推荐质量上优于最先进的基线超过10%,同时实现了5倍的推理加速。腾讯平台上的在线A/B测试进一步将观看时间提高了1.22%,视频播放量提高了1.73%。HiGR已在多个腾讯平台表面部署,服务数亿用户,证明了其工业规模的适用性。

英文摘要

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患:工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出多轮工具使用安全基准MT-AgentRisk,发现多轮设置下攻击成功率平均增加16%,并设计无训练、与工具无关的自探索防御方法ToolShield,平均降低30%攻击成功率。

详情
AI中文摘要

基于LLM的智能体能力日益增强,但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具,这一差距扩大,引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置,我们提出一个原则性的分类法,将单轮有害任务转化为多轮攻击序列。利用该分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化:在开放和封闭模型的多轮设置中,攻击成功率(ASR)平均增加16%。为了缩小这一差距,我们提出了ToolShield,一种无需训练、与工具无关的自我探索防御方法:当遇到新工具时,智能体自主生成测试用例,执行它们以观察下游效果,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

2602.07294 2026-06-12 cs.CE cs.AI 版本更新

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fin-RATE:面向SEC文件的金融分析与追踪评估基准

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

发表机构 * Tongji University(同济大学) University of California, San Diego(加州大学圣地亚哥分校) Yale University(耶鲁大学) Goldman Sachs(高盛集团)

AI总结 针对LLM在金融领域分析复杂监管文件的需求,提出基于SEC文件的Fin-RATE基准,通过三种任务路径评估模型,发现跨文档和跨时间分析时性能显著下降。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

随着大型语言模型(LLM)在金融领域的部署日益增多,LLM越来越需要解析复杂的监管披露文件。然而,现有基准通常关注孤立细节,未能反映专业分析的复杂性——这种分析需要综合多个文档、报告期和公司实体的信息。此外,这些基准无法区分错误源于检索失败、生成不准确、领域特定推理错误还是对查询或上下文的误解,从而难以精确诊断性能瓶颈。为弥补这些不足,我们引入Fin-RATE,这是一个基于美国证券交易委员会(SEC)文件构建的基准,通过三条路径模拟金融分析师的工作流程:单个披露文件内的细节导向推理、共享主题下的跨实体比较,以及同一公司在多个报告期内的纵向跟踪。我们在真实上下文和检索增强设置下,对17个领先的LLM(包括开源、闭源和金融专用模型)进行了基准测试。结果显示,当任务从单文档推理转向纵向和跨实体分析时,性能显著下降,准确率分别下降18.60%和14.35%。这种下降与比较幻觉、时间和实体不匹配的增加有关,并进一步反映在推理质量和事实一致性的下降上——这些局限性是现有基准尚未正式分类或量化的。

英文摘要

With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

2602.10132 2026-06-12 physics.plasm-ph cs.AI 版本更新

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

TokaMark:MAST托卡马克等离子体模型的综合基准

Cécile Rousseau, Samuel Jackson, Rodrigo H. Ordonez-Hurtado, Nicola C. Amorisco, Tobia Boschi, George K. Holt, Andrea Loreti, Eszter Székely, Alexander Whittle, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Sue Thorne, Mykhaylo Zayats

发表机构 * IBM Research Europe(IBM欧洲研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(STFC哈特ree中心)

AI总结 为解决聚变数据稀缺、分散且标注不一致的问题,提出TokaMark基准,包含14项任务,统一多模态聚变数据访问和评估协议,并提供基线模型,以加速数据驱动的AI等离子体建模。

详情
AI中文摘要

开发运行如托卡马克等商业可行的聚变能源反应堆,需要从稀疏、有噪声且不完整的传感器读数中准确预测等离子体动力学。底层物理的复杂性和实验数据的异质性给传统数值方法带来了巨大挑战,并凸显了现代数据驱动方法的潜力。然而,实现这一潜力的主要障碍是缺乏经过整理、公开可用的数据集和标准化基准。现有的聚变数据集稀缺、分散在不同机构、特定于设施且标注不一致,这限制了可重复性,并阻碍了AI方法的公平和可扩展比较。在本文中,我们介绍了TokaMark,一个结构化基准,用于评估在Mega Ampere Spherical Tokamak(MAST)收集的真实实验数据上的AI模型。TokaMark提供了一套全面的工具,旨在统一多模态聚变数据的访问并标准化评估协议。该基准包括14项精心策划的任务,涵盖一系列物理机制,利用多种诊断手段并覆盖多个操作用例。提供了一个基线模型,以便在统一框架内进行透明的比较和验证。通过建立统一的基准,TokaMark旨在加速数据驱动的AI等离子体建模的进展,为最终实现可持续和稳定的聚变能源做出贡献。数据集、基准、文档和工具已在此https URL下开源。

英文摘要

Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, and highlight the promise of modern data-native approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to unify access to multi-modal fusion data and standardize evaluation protocols. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The dataset, benchmark, documentation, and tooling are open-sourced under https://github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamark_baseline.

2602.09379 2026-06-12 cs.MA cs.CL 版本更新

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

LingxiDiagBench: 用于基准测试大语言模型在中文精神科咨询与诊断中的多智能体框架

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

发表机构 * Tianqiao and Chrissy Chen Institute(天桥和克里斯西·陈研究所) EverMind AI Inc.(EverMind AI公司) Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine(上海精神卫生中心,上海交通大学医学院)

AI总结 提出LingxiDiagBench多智能体框架,包含16K电子病历对齐的合成咨询对话数据集,评估LLM在静态诊断和动态咨询中的表现,发现其对抑郁-焦虑共病识别和12类鉴别诊断准确率低,动态咨询常不如静态评估。

详情
AI中文摘要

精神障碍在全球范围内高度流行,但精神科医生的短缺以及基于访谈诊断固有的主观性,对及时、一致的心理健康评估造成了重大障碍。AI辅助精神科诊断的进展受到缺乏基准测试的限制,这些基准测试需同时提供逼真的患者模拟、临床医生验证的诊断标签,并支持动态多轮咨询。我们提出LingxiDiagBench,一个大规模多智能体基准测试,评估LLM在中文静态诊断推理和动态多轮精神科咨询中的表现。其核心是LingxiDiag-16K,一个包含16,000个电子病历对齐的合成咨询对话数据集,旨在再现12个ICD-10精神科类别中真实的临床人口统计和诊断分布。通过对最先进LLM的大量实验,我们建立了关键发现:(1)尽管LLM在二元抑郁-焦虑分类上达到高准确率(高达92.3%),但在抑郁-焦虑共病识别(43.0%)和12类鉴别诊断(28.5%)上性能显著下降;(2)动态咨询通常不如静态评估,表明无效的信息收集策略显著损害下游诊断推理;(3)由LLM作为评判者评估的咨询质量与诊断准确性仅呈中等相关性,表明结构良好的提问本身并不能确保正确的诊断决策。我们发布LingxiDiag-16K和完整的评估框架,以支持可重复的研究,网址为:https://this https URL。

英文摘要

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

2602.07698 2026-06-12 cs.SE cs.CL 版本更新

On Sequence-to-Sequence Models for Automated Log Parsing

关于自动化日志解析的序列到序列模型

Adam Sorrenti, Andriy Miranskyy

发表机构 * Toronto University(多伦多大学)

AI总结 本研究系统评估了四种序列建模架构(Transformer、Mamba、单/双向LSTM)在自动化日志解析中的性能,发现Transformer表现最佳,Mamba在计算成本较低时具有竞争力,并分析了表示选择、序列长度和数据效率的影响。

Comments Added a comparison with large language models

详情
AI中文摘要

上下文:日志解析是软件系统中的关键标准操作流程,支持监控、异常检测和故障诊断。然而,由于日志格式异构、训练与部署数据之间的分布偏移以及基于规则的方法的脆弱性,自动化日志解析仍然具有挑战性。目标:本研究旨在系统评估序列建模架构、表示选择、序列长度和训练数据可用性如何影响自动化日志解析性能和计算成本。方法:我们进行了一项受控实证研究,比较了四种序列建模架构:Transformer、Mamba状态空间、单向LSTM和双向LSTM模型。总共在多个数据集配置下训练了396个模型,并使用相对Levenshtein编辑距离进行统计显著性检验评估。结果:Transformer实现了最低的平均相对编辑距离(0.111),其次是Mamba(0.145)、单LSTM(0.186)和双LSTM(0.265),数值越低越好。Mamba在计算成本大幅降低的情况下提供了有竞争力的准确性。字符级分词通常能提升性能,序列长度对Transformer准确性的实际影响可忽略不计,Mamba和Transformer均表现出比循环模型更强的样本效率。结论:总体而言,Transformer将解析错误降低了23.4%,而Mamba在数据或计算受限的情况下是一个强有力的替代方案。这些结果还阐明了表示选择、序列长度和样本效率的作用,为研究人员和从业者提供了实用指导。

英文摘要

Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. Objectives: This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. Methods: We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Results: Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Conclusion: Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

2602.05121 2026-06-12 eess.SY cs.RO cs.SY 版本更新

Trojan Attacks on Neural Network Controllers for Robotic Systems

针对机器人系统神经网络控制器的木马攻击

Farbod Younesi, Walter Lucia, Amr Youssef

发表机构 * Concordia University(康科德大学) Concordia Institute for Information Systems Engineering(康科德信息系统工程研究所) Fonds de recherche du Québec – Nature et Technologies(魁北克自然与技术研究基金) National Cybersecurity Consortium(国家网络安全联盟)

AI总结 针对机器人神经网络控制器,设计轻量级并行木马网络,在特定触发条件下篡改控制指令,通过仿真验证攻击有效性。

Comments Paper submitted to the 2026 IEEE Conference on Control Technology and Applications (CCTA)

详情
AI中文摘要

神经网络控制器越来越多地应用于机器人系统中,用于轨迹跟踪和姿态稳定等任务。然而,它们对可能不可信的训练流程或供应链的依赖引入了显著的安全漏洞。本文以差速驱动移动机器人平台为案例,研究针对神经控制器的后门(木马)攻击。具体来说,假设机器人的跟踪控制器实现为神经网络,我们设计了一个轻量级的并行木马网络,可以嵌入到控制器中。该恶意模块在正常操作期间保持休眠,但在检测到由机器人姿态和目标参数定义的高度特定触发条件时,会破坏主控制器的轮速命令,导致不良且可能不安全的机器人行为。我们提供了所提出的木马网络的概念验证实现,并通过两种不同攻击场景下的仿真进行了验证。结果证实了所提出攻击的有效性,并表明基于神经网络的机器人控制系统面临潜在的关键安全威胁。

英文摘要

Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.

2602.04075 2026-06-12 cond-mat.mtrl-sci cs.LG 版本更新

Thermodynamic assessment of machine learning models for solid-state synthesis prediction

固态合成预测机器学习模型的热力学评估

Jane Schlesinger, Simon Hjaltason, Nathan J. Szymanski, Christopher J. Bartel

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 评估了机器学习模型预测固态材料合成可行性的热力学一致性,发现模型普遍高估合成可能性,但部分分数与热力学启发式趋势一致。

详情
AI中文摘要

机器学习模型最近被用于预测假设的固态材料是否可合成。这些模型旨在绕过固态相变的直接第一性原理建模,而是从成功合成材料的大型数据库中学习。在这里,我们评估了几个最近引入的合成预测模型与材料和反应热力学的对齐程度,通过相对于凸包的能量和考虑枚举合成反应的热力学选择性的度量来量化。使用成功合成配方的数据集确定了这两个量的可能界限,超出该界限的材料被认为不太可能被合成。以这些界限为背景,使用CHGNet基础势计算了通过Chemeleon生成模型生成的数千种新假设材料的热力学量。将四个最近发表的用于可合成性预测的机器学习模型应用于同一数据集,并将所得预测与计算的热力学进行比较。我们发现这些模型普遍高估了合成的可能性,但一些模型分数确实与热力学启发式趋势一致,对稳定性较差或没有计算为热力学选择性的可用合成配方的材料分配较低的分数。总的来说,这项工作识别了机器学习模型在材料合成中存在的差距,并引入了一种在缺乏大量负例(失败合成)的情况下评估其质量的新方法。

英文摘要

Machine learning models have recently emerged to predict whether hypothetical solid-state materials can be synthesized. These models aim to circumvent direct first-principles modeling of solid-state phase transformations, instead learning from large databases of successfully synthesized materials. Here, we assess the alignment of several recently introduced synthesis prediction models with material and reaction thermodynamics, quantified by the energy with respect to the convex hull and a metric accounting for thermodynamic selectivity of enumerated synthesis reactions. A dataset of successful synthesis recipes was used to determine the likely bounds on both quantities beyond which materials can be deemed unlikely to be synthesized. With these bounds as context, thermodynamic quantities were computed using the CHGNet foundation potential for thousands of new hypothetical materials generated using the Chemeleon generative model. Four recently published machine learning models for synthesizability prediction were applied to this same dataset, and the resultant predictions were considered against computed thermodynamics. We find these models generally overpredict the likelihood of synthesis, but some model scores do trend with thermodynamic heuristics, assigning lower scores to materials that are less stable or do not have an available synthesis recipe that is calculated to be thermodynamically selective. In total, this work identifies existing gaps in machine learning models for materials synthesis and introduces a new approach to assess their quality in the absence of extensive negative examples (failed syntheses).

2602.00343 2026-06-12 cs.DC cs.AI cs.PF 版本更新

Standardized Methods and Recommendations for Green Federated Learning

绿色联邦学习的标准化方法与建议

Austin Tapp, Holger R. Roth, Ziyue Xu, Abhijeet Parida, Hareem Nisar, Marius George Linguraru

发表机构 * Children’s National Hospital(儿童医院) NVIDIA(英伟达) Children’s National Hospital George Washington University(儿童医院乔治华盛顿大学)

AI总结 提出基于NVFlare和CodeCarbon的联邦学习碳核算方法,通过实验验证系统慢速和协调效应可显著增加碳排放,强调标准化碳核算对可复现绿色FL评估的必要性。

详情
AI中文摘要

联邦学习(FL)能够在隐私敏感的分布式数据上进行协作模型训练,但由于不一致的测量边界和异构的报告方式,其环境影响难以跨研究进行比较。我们提出了一种实用的碳核算方法,用于FL的CO2e跟踪,使用NVIDIA NVFlare和CodeCarbon进行显式的、阶段感知的任务(初始化、每轮训练、评估和空闲/协调)。为了捕捉非计算效应,我们还根据网络可配置能量模型,从传输的模型更新大小估计通信排放。我们在两个代表性工作负载上验证了所提出的方法:CIFAR-10图像分类和视网膜视盘分割。在CIFAR-10中,受控的客户端效率场景表明,在原本固定的FL协议下,系统级慢速和协调效应可能对碳足迹产生显著影响,相对于高效率基线,总CO2e增加了8.34倍(中等)和21.73倍(低)。在视网膜分割中,交换GPU层级(H100 vs. V100)产生了1.7倍的运行时间差距(290 vs. 503分钟),同时在不同站点间总能量和CO2e的变化不均匀,强调了按站点和按轮报告的必要性。总体而言,我们的结果支持一种标准化的碳核算方法,作为可复现的“绿色”FL评估的前提。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.

2601.22003 2026-06-12 stat.ML cs.LG stat.CO 版本更新

Efficient Stochastic Optimisation via Sequential Monte Carlo

通过序贯蒙特卡洛实现高效随机优化

James Cuin, Davide Carbone, Yanbo Tang, O. Deniz Akyildiz

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对梯度难以计算的优化问题,提出用序贯蒙特卡洛(SMC)采样器替代昂贵的内采样循环,实现高效随机优化,并在能量模型奖励调优中验证有效性。

Comments Accepted to ICML 2026

详情
AI中文摘要

在机器学习和统计学中,从最大边际似然估计过程到生成模型的微调,经常出现优化具有难处理梯度函数的问题。针对这类问题的随机近似方法通常需要内部采样循环来获得(有偏的)随机梯度估计,这很快会变得计算昂贵。在这项工作中,我们开发了用于优化具有难处理梯度函数的序贯蒙特卡洛(SMC)采样器。我们的方法用高效的SMC近似替代昂贵的内部采样方法,这可以带来显著的计算收益。我们为我们的方法所定义的基本递归建立了收敛结果,这些递归由SMC采样器近似。我们在各种设置下对能量模型的奖励调优展示了我们方法的有效性。

英文摘要

The problem of optimising functions with intractable gradients frequently arises in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

2601.21324 2026-06-12 stat.ML cs.LG 版本更新

Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination

批量校准的置信模糊集:样本外污染下的快速、可处理决策

Mengqi Chen, Thomas B. Berrett, Theodoros Damoulas, Michele Caprio

发表机构 * University of Bristol(布里斯托大学) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校) University of Oxford(牛津大学)

AI总结 提出批量校准置信模糊集,通过分离批量内污染和尾部贡献,得到闭式有限风险目标,转化为线性或二阶锥规划,实现高效鲁棒优化。

Comments Accepted for publication (spotlight) at ICML 2026

详情
AI中文摘要

分布鲁棒优化(DRO)在模糊集上最小化最坏情况期望损失,该模糊集可捕捉样本外环境中的分布偏移。虽然Huber(线性-空)污染是$\varepsilon$分数任意扰动的经典最小假设模型,但将其纳入模糊集可能导致最坏情况风险无穷大,且DRO目标变得无意义,除非施加强有界性或支撑假设。我们通过引入批量校准的置信模糊集来解决这些挑战:我们从数据中学习一个高质量批量集,同时考虑批量内的污染,并分别约束剩余尾部贡献。这导致一个闭式、有限的$\mathrm{mean}+\sup$鲁棒目标,以及针对常见损失和批量几何结构的可处理线性或二阶锥规划。通过该框架,我们强调并利用上期望(不精确概率概念)与最坏情况风险之间的等价性,展示IP置信集如何转化为具有可解释容忍水平的DRO目标。在重尾库存控制、地理偏移房价回归和人口偏移文本分类上的实验显示了竞争性的鲁棒性-准确性权衡和高效的优化时间,使用了贝叶斯、频率学派或经验参考分布。

英文摘要

Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an $\varepsilon$-fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite $\mathrm{mean}+\sup$ robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.

2512.23566 2026-06-12 math.DS cond-mat.stat-mech cs.LG math.OC stat.ML 版本更新

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

从几何到动力学:基于几何约束从稀疏观测学习过阻尼朗之万动力学

Dimitra Maoutsa

发表机构 * Dimitra Maoutsa(迪米特拉·马乌茨)

AI总结 提出一种随机控制框架,利用系统不变密度的几何结构进行路径增强,从稀疏时间采样数据中恢复过阻尼朗之万动力学,无需参数模型假设。

Comments 10+54 pages, 14 figures; accepted at ICML 2026 An earlier account of this work has previously appeared in arXiv:2301.08102 and arXiv:2304.00423 ; main methodology remains the same, this version includes additional numerical experiments and theory

详情
AI中文摘要

当随机系统的轨迹在时间上稀疏采样时,我们如何学习其动力学背后的规律?现有方法要么需要时间分辨的高频观测,要么依赖于仅适用于保守系统的几何论证,限制了它们能恢复的动力学范围。在这里,我们提出一个新的框架,通过将推断重新表述为随机控制问题来调和这两种观点。我们的方法使用几何驱动的路径增强,以系统不变密度的几何结构为指导,重构可能的轨迹并推断底层动力学,而不假设特定的参数模型。应用于过阻尼朗之万系统,我们的方法即使在极度欠采样数据下也能准确恢复随机动力学,在合成基准测试中优于现有方法。这项工作证明了将几何归纳偏差纳入随机系统识别方法的有效性。

英文摘要

How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system's invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

2512.21227 2026-06-12 cond-mat.mtrl-sci cs.AI 版本更新

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

PhononBench:面向晶体生成中动态稳定性的基于声子的大规模基准

Xiao-Qi Han, Ze-Feng Gao, Wen-Kao Li, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院)

AI总结 提出PhononBench,首个大规模AI生成晶体动态稳定性基准,利用MatterSim势高效计算声子,评估7个模型生成的133,838个结构,发现平均动态稳定性率仅32.15%。

Comments 53 pages, 6 figures

详情
AI中文摘要

近年来,生成式人工智能在晶体材料设计方面取得了显著进展,催生了基于图神经网络、扩散模型和大语言模型的方法。现有评估通常遵循稳定性-唯一性-新颖性(S.U.N.)框架,其中稳定性主要使用热力学标准评估,这未能完全捕捉材料实际存在所必需的动态稳定性。动态稳定性是决定材料能否被合成并持续存在的关键因素,声子谱计算是其评估标准。然而,此类计算的高计算成本阻碍了对生成晶体动态稳定性的大规模评估。在这项工作中,我们引入了PhononBench,这是首个针对AI生成晶体动态稳定性的大规模基准。利用最近开发的MatterSim原子间势,该势能在超过10,000种材料中实现了密度泛函理论(DFT)级别的声子预测精度,PhononBench能够对7个领先晶体生成模型生成的133,838个晶体结构进行高效的声子计算和动态稳定性分析。PhononBench揭示了当前生成模型的一个普遍局限性:除非另有说明,所有报告的动态稳定性指标均在-0.1 THz的声子频率阈值下评估,所有生成结构的平均动态稳定性率仅为32.15%,表现最佳的模型MatterGen也仅达到45.05%。此外,我们识别出32,995个在-0.001 THz严格阈值下整个布里渊区声子稳定的晶体结构。另外,一个基于网页的服务可通过此http URL访问,实现分钟级的超快声子预测。

英文摘要

In recent years, generative artificial intelligence has made significant advances in the design of crystalline materials, giving rise to approaches based on graph neural networks, diffusion models, and large language models. Existing evaluations commonly follow the stability-uniqueness-novelty (S.U.N.) framework, where stability is primarily assessed using thermodynamic criteria, which do not fully capture the dynamical stability essential for a material's practical existence. Dynamical stability is a key determinant of whether a material can be synthesized and persist, with phonon spectrum calculations serving as the standard for its evaluation. However, the high computational cost of such calculations has prevented large-scale assessment of dynamical stability in generated crystals. In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves density-functional-theory (DFT)-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient phonon calculations and dynamical-stability analysis for 133,838 crystal structures generated by 7 leading crystal generation models. PhononBench reveals a widespread limitation of current generative models: unless otherwise specified, all reported dynamical-stability metrics are evaluated at a phonon-frequency threshold of -0.1 THz, with the average dynamical-stability rate across all generated structures being only 32.15%, and the top-performing model, MatterGen, reaching just 45.05%.In addition, we identify 32,995 crystal structures that are phonon-stable across the entire Brillouin zone under a strict threshold of -0.001 THz. In addition, a web-based service is accessible at http://phononbench.cn/, enabling minute-level ultra-fast phonon predictions.

2511.19716 2026-06-12 math.NA cs.LG cs.NA 版本更新

Design Criteria for SGD Preconditioners: Local Conditioning, Noise Floors, and Basin Stability

SGD预条件子的设计准则:局部条件数、噪声基底与盆地稳定性

Mitchell Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi

发表机构 * Department of Mathematics, Emory University(埃默里大学数学系) Department of Mathematics, University of Minnesota Twin Cities(明尼苏达大学双城分校数学系) Department of Computer Science, University of Minnesota Twin Cities(明尼苏达大学双城分校计算机科学系) Department of Mathematics, University of Kentucky(肯塔基大学数学系)

AI总结 针对SGD在训练后期因各向异性曲率和梯度噪声导致的收敛缓慢问题,提出基于对称正定矩阵M的预条件SGD分析框架,推导收敛速率和噪声基底受M相关量控制的界,并给出非凸目标下的盆地稳定性保证,为科学机器学习提供设计准则。

Comments 31 pages, 11 Figures

详情
Journal ref
Trans. of Mach. Learning Research, 06/2026
AI中文摘要

随机梯度下降(SGD)在训练后期常因各向异性曲率和梯度噪声而变慢。我们在对称正定矩阵$\mathbf{M}$诱导的几何中分析预条件SGD,推导出收敛速率和随机噪声基底均受$\mathbf{M}$相关量控制的界:速率通过$\mathbf{M}$度量下的有效条件数,基底通过该条件数与预条件噪声水平的乘积。对于非凸目标,我们建立了依赖于预条件子的盆地稳定性保证:当光滑性和盆地大小以$\mathbf{M}$范数度量时,迭代停留在良好局部区域的概率有显式下界。这一视角在科学机器学习(SciML)中尤为重要,其中在随机更新下实现小训练损失与物理保真度、数值稳定性和约束满足密切相关。该框架适用于对角/自适应和曲率感知预条件子,并给出一个简单的设计原则:选择$\mathbf{M}$以改善局部条件同时衰减噪声。在二次诊断问题和三个SciML基准上的实验验证了预测的速率-基底行为。

英文摘要

Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

2511.13271 2026-06-12 cs.SE cs.AI cs.IR 版本更新

Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

生成式AI模型在学生软件编程学习活动中的使用研究

Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过对比生成式AI与传统在线资源对编程学习的影响,发现AI能提升任务表现但未必带来知识增益,初学者过度依赖而中级生选择性使用,呼吁将AI作为学习工具而非解题工具。

Comments 9 pages, 4 figures, published at AIWARE 2025

详情
AI中文摘要

生成式AI(GenAI)工具如ChatGPT的兴起为计算教育带来了新的机遇和挑战。现有研究主要关注GenAI完成教育任务的能力及其对学生表现的影响,往往忽视了其对知识获取的作用。在本研究中,我们调查了GenAI辅助与传统在线资源在不同熟练水平下对知识获取的支持效果。我们进行了一项受控用户实验,涉及24名具有两种不同编程经验水平(初学者、中级)的本科生,以考察学生在解决编程任务时如何与ChatGPT互动。我们分析了任务表现、概念理解和交互行为。我们的发现表明,使用GenAI生成完整解决方案显著提高了任务表现,尤其是对初学者而言,但并未持续带来知识增益。重要的是,使用策略因经验而异:初学者倾向于过度依赖GenAI以完成任务,过程中往往没有知识增益,而中级生则采用更具选择性的方法。我们发现,过度依赖和极少使用都会导致整体知识增益较弱。基于我们的结果,我们呼吁学生和教育工作者将GenAI作为学习工具而非解题工具。我们的研究强调了在将GenAI整合到编程教育中时,迫切需要指导以促进更深层次的理解。

英文摘要

The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.

2412.08610 2026-06-12 cs.GT cs.AI cs.CY 版本更新

Competition and Diversity in Generative AI

生成式人工智能中的竞争与多样性

Manish Raghavan

发表机构 * MIT Sloan School of Management & Department of Electrical Engineering and Computer Science(麻省理工学院斯隆管理学院及电气工程与计算机科学系)

AI总结 通过博弈论模型和Scattergories游戏实验,研究竞争如何促使生成式AI模型多样化,缓解同质化,并提升社会福利。

详情
AI中文摘要

最近的实验和现实证据表明,使用生成式人工智能会降低所产生内容的多样性。使用相同或相似的AI模型似乎会导致更同质化的行为。我们的工作从观察到存在一股相反方向的推动力开始:竞争。当生产者相互竞争(例如,争夺客户或注意力)时,他们被激励去创造新颖或独特的内容。我们探讨了竞争对内容多样性和整体社会福利的影响。通过一个正式的博弈论模型,我们表明竞争市场会选择多样化的AI模型,从而缓解单一文化。我们进一步表明,一个在孤立环境中表现良好(即根据基准)的生成式AI模型可能在竞争市场中无法提供价值。我们的结果强调了在生成式AI模型输出分布的广度上评估它们的重要性,特别是当它们将被部署在竞争环境中时。我们通过使用语言模型玩Scattergories(一个奖励正确且独特答案的文字游戏)来实证验证我们的结果。总体而言,我们的结果表明,由生成式AI导致的同质化不太可能在竞争市场中持续存在,相反,下游市场的竞争可能会推动AI模型开发的多样化。

英文摘要

Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: competition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact competition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a competitive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development.

2509.21548 2026-06-12 cs.CY cs.CL 版本更新

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

C-QUERI:国会机构中的问题、交流与回答数据集

Manjari Rudra, Daniel Magleby, Sujoy Sikdar

发表机构 * School of Computing, Binghamton University(宾夕法尼亚大学布林莫尔分校计算机学院) Department of Political Science, Binghamton University(宾夕法尼亚大学布林莫尔分校政治学系)

AI总结 提出从听证会记录中提取问答对的流程,构建108-117届国会委员会听证数据集,分析显示提问者党派可从问题本身预测,为政治话语研究提供框架。

详情
AI中文摘要

政治采访和听证中的问题除了信息收集外,还具有战略目的,包括推进党派叙事和塑造公众认知。然而,由于缺乏大规模数据集来研究此类话语,这些战略方面仍未得到充分研究。国会听证会为研究政治提问提供了一个特别丰富且易于处理的地点:互动由正式规则组织,证人必须回答,不同政治派别的成员保证有机会提问,从而能够比较跨政治光谱的行为。我们开发了一个流程,从非结构化听证记录中提取问答对,并构建了一个包含第108至117届国会委员会听证的新数据集。我们的分析揭示了跨党派的提问策略的系统性差异,表明仅从问题本身即可预测提问者的党派归属。我们的数据集和方法不仅推进了国会政治研究,还为分析类似采访环境中的问答提供了通用框架。

英文摘要

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.