arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11276 2026-05-13 cs.CV cs.AI

Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences

Trevor Neece, Mason Smetana, Lev Khazanovich

AI总结 该研究提出了一种基于生成式人工智能的方法,用于从OSHA严重伤害报告中生成高速公路施工危险场景的合成图像和时间序列,以辅助安全培训。研究开发了两种生成模式:单图生成和四阶段时间序列生成,并通过CLIP语义检索和专家评估对生成图像的教育价值、真实感和对齐度进行了多维评价。该方法在无需拍摄真实事故场景的情况下,为安全培训提供了可视化素材,同时为跨领域合成图像生成提供了新的评估框架。

详情
英文摘要

Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.

2605.11272 2026-05-13 cs.LG cs.AI cs.IR

Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank

Suryaa Veerabathiran Seran, Ashwin Naresh Kumar, Tracy Holloway King, Jing Zheng

AI总结 本文研究了在国际扩张阶段,如何缓解学习排序(LTR)模型在不同地区之间的行为偏差问题。作者指出,仅依赖点击数据训练的模型会忽视语义层面的本地化特征,导致非美国地区的内容曝光不均。为此,他们提出了一种结合行为反馈、视觉语言模型相关性信号和地域感知增强的多目标框架,有效提升了模型在多个地区的相关性和本地内容可见性。

详情
英文摘要

Adobe Express is expanding internationally, but the US has a disproportionately large content supply and interaction volume. Learning-to-rank (LTR) models trained primarily on behavioral feedback inherit this imbalance: templates popular in US are over-served in non-US locales. This cross-locale exposure bias suppresses local content discoverability and degrades ranking quality in growth locales. We show that click-only training suppresses semantically informative localization features. Adding vision-language model (VLM) graded relevance labels as auxiliary supervision alongside clicks improves semantic alignment but does not preserve local content visibility. We propose a multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting. Across five locales, the resulting model improves relevance while restoring stable localization, demonstrating the importance of disentangling exposure from semantic supervision.

2605.11267 2026-05-13 cs.CV

Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

Quanyun Wu, Kyle Gao, Wentao Sun, Hongjie He, Yuhao Chen, David A. Clausi, Jonathan Li

AI总结 本文提出了一种基于单目视觉的几何一致、真实尺度海岛面积与海岸线测量框架,仅需输入目标区域的地理坐标或名称即可自动获取低空环绕图像序列,并通过轻量轨迹对齐算法恢复全局物理尺度,最终实现高精度的二维平面面积和周长提取。该方法无需依赖传统GIS数据,大幅降低了测绘成本,实验表明其测量误差稳定在10%左右,具有较高的精度、鲁棒性和推理效率,为大规模海洋与海岸线监测提供了实用新范式。

Comments Accepted for publication at IEEE OCEANS (Sanya) 2026

详情
英文摘要

Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10\%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline

2605.11266 2026-05-13 cs.CV cs.GR cs.LG

PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

Zachary Lee, Maxwell Jacobson, Yexiang Xue

AI总结 该研究提出了一种名为PG-3DGS的物理引导三维高斯点绘方法,旨在生成不仅视觉逼真而且具备物理功能的三维结构。通过将可微分物理模拟与三维高斯表示相结合,该方法能够在优化形状时同时考虑视觉损失和物理目标,从而生成如能倒水的茶壶和能产生升力的飞机等具有实际功能的物体。实验表明,PG-3DGS在保持视觉质量的同时显著提升了物理功能,并在实际风洞测试中验证了其生成结构的物理性能优势。

Comments Submitted to Artificial Intelligence. 52 pages

详情
英文摘要

Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.

2605.11265 2026-05-13 cs.CV cs.AI cs.LG

DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction

Guiqiu Liao, Matjaž Jogan, Daniel A. Hashimoto

AI总结 本文提出了一种名为DenseTRF的自监督表征适应框架,用于解决手术场景中密集预测任务(如分割和手术区域识别)在跨域部署时因分布偏移导致的性能下降问题。该方法基于纹理感知的注意力机制,通过学习具有不变视觉结构的表征,并在无监督条件下将其适配到目标分布,从而显著提升了模型对领域变化的鲁棒性。实验表明,DenseTRF在多种手术场景中均优于当前最先进的分割模型和跨域适应方法。

Comments Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

详情
英文摘要

Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.

2605.11260 2026-05-13 cs.LG cs.AI

Curriculum Learning-Guided Progressive Distillation in Large Language Models

Jincheng Cao, Fanzhi Zeng, Leqi Liu, Aryan Mokhtari

AI总结 知识蒸馏是将大语言模型能力转移到小型学生模型的重要技术,但现有方法常忽略训练数据的学习顺序和师生模型容量不匹配的问题。本文提出了一种由课程学习引导的渐进式蒸馏框架(CLPD),通过将数据难度与教师模型能力对齐,同时构建显式和隐式的课程学习机制,有效提升了蒸馏效果。实验表明,CLPD在多个推理基准测试中优于传统蒸馏方法及其他单一优化策略,突显了联合考虑数据顺序与教师容量的重要性。

详情
英文摘要

Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.

2605.11259 2026-05-13 cs.AI

Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

Grama Chethan

AI总结 本文提出了一种名为“Template-as-Ontology”的可配置合成数据基础设施,用于跨领域制造环境中AI系统的验证。该方法通过一个统一的Python配置模块,同时定义制造仿真器的结构和AI分析工具的运行时数据模式,从而确保数据结构的一致性。实验表明,该框架能够生成符合MES标准的高质量合成数据,并有效减少AI工具在参数生成时的错误率,为离散制造AI的验证提供了可复用的数据基础。

Comments 18 pages, 1 fugure

详情
英文摘要

LLarge language model (LLM)-based AI agents deployed in manufacturing environments require populated, schema-correct data for validation, yet production MES data is proprietary, privacy-encumbered, and vendor-specific. This paper introduces the Template-as-Ontology principle: a single Python configuration module (700-770 lines, 45 validated exports) serves simultaneously as the specification for a time-stepped manufacturing simulator and as the runtime domain schema for AI analytics tools, producing alignment by construction rather than integration. We formally define the domain template as a typed relational configuration schema and prove that structural alignment between simulation and tool layers is guaranteed by single-source consumption. A five-layer pipeline--simulation, PostgreSQL, CDC/Iceberg lakehouse, star schema, and 12 parameterized AI tools--generates causally coherent, MES-shaped data spanning 66 entity types across four operational domains mapped to ISA-95/IEC 62264. We validate the architecture with six industry templates (aerospace, pharma, automotive, electronics, beverages, warehousing) running on identical framework code. Calibration experiments (60 runs, 10 seeds per template) confirm parametric controllability: observed KPIs fall within configured ranges across all templates. A controlled hallucination experiment (72 tool invocations, Qwen3-32B) demonstrates that ontology-constrained parameters eliminate tool-parameter fabrication (0% constrained vs. 43% unconstrained hallucination rate for the evaluated model, Fisher's exact test p < 10^-12); the 0% constrained rate is an architectural guarantee that holds for any model. The framework provides a reusable data layer for discrete manufacturing AI validation.

2605.11258 2026-05-13 cs.AI cs.CL q-bio.QM

Unlocking LLM Creativity in Science through Analogical Reasoning

Andrew Shen, Shaul Druckmann, James Zou

AI总结 本文研究如何通过类比推理(Analogical Reasoning, AR)提升大型语言模型(LLM)在科学问题中的创造力,特别是在生物医学等复杂领域。作者发现现有LLM在开放性问题求解中容易陷入模式崩溃,生成多样性不足的解,为此提出AR方法,通过跨领域问题的类比结构生成新颖解决方案。实验表明,AR显著提升了生成解的多样性和新颖性,并在多个生物医学任务中取得了优于现有方法的性能,验证了其在实际应用中的有效性。

详情
英文摘要

Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($ρ$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.

2605.11255 2026-05-13 cs.CL

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

Noam Kayzer, Dan Revital, Ori Bar Joseph, Smadar Arvatz, Or Levi, Tal Geva, Shaltiel Shmidman, Amir DN Cohen, Noam Ordan, Omer Baruch, Kate Zinkovskaia, Zevi Apini, Sarel Weinberger

AI总结 本文介绍了Hebatron,一个基于NVIDIA Nemotron-3稀疏专家混合架构的希伯来语专用开源大语言模型。该模型通过三阶段由易到难的训练课程和持续抗遗忘锚定策略进行训练,并在200万对希伯来语-英语样本上进行监督微调,显著提升了模型性能。Hebatron在希伯来语推理任务中达到73.8%的平均准确率,优于多个现有模型,同时在保持高推理吞吐量和长上下文支持方面表现出色。这是首个针对希伯来语的Nemotron-3架构适配模型,也是首个支持原生长上下文的开源希伯来语专家混合模型。

详情
英文摘要

We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew--English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8\%, outperforming DictaLM-3.0-24B-Thinking (68.9\%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3 architecture for any target language, and the first open-weight Hebrew-specialized MoE model with native long-context support. Model weights are released openly to support further research in Hebrew and Semitic-language NLP.

2605.11247 2026-05-13 cs.LG

A Proof-of-Concept Simulation-Driven Digital Twin Framework for Decision-Aware Diabetes Modeling

Zarrin Monirzadeh

AI总结 本文提出了一种基于仿真驱动的数字孪生框架,用于支持决策感知的糖尿病建模,利用基准临床数据、合成时间增强和连续血糖监测分析进行验证。该框架不同于传统预测模型,重点生成可解释的仿真轨迹而非临床验证结果,并通过公共数据集与受控合成场景评估其性能,展示了预测与反事实仿真的结合在决策分析中的可行性。该工作为未来医疗领域仿真驱动的数字孪生系统研究提供了基础。

Comments Preprint. 9 figures. DOI: 10.5281/zenodo.20127363

详情
英文摘要

This paper presents a proof-of-concept digital twin framework for simulation-driven diabetes modeling using benchmark clinical data, synthetic temporal augmentation, and illustrative continuous glucose monitoring (CGM) analysis. Unlike traditional predictive models, the framework focuses on generating interpretable simulated trajectories rather than clinically validated outcomes. Evaluation is conducted using a public dataset combined with controlled synthetic scenarios to illustrate temporal behavior and intervention effects. Results illustrate the feasibility of integrating prediction with counterfactual simulation for decision-aware analysis. This work does not claim clinical readiness but provides a foundation for future research on simulation-driven digital twin systems in healthcare.

2605.11242 2026-05-13 cs.CL cs.AI

RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German

Ignacio Sastre, Ignacio Remersaro, Facundo Díaz, Nicolás De Horta, Luis Chiruzzo, Aiala Rosá, Santiago Góngora

AI总结 本文介绍了 RETUYT-INCO 团队在 BEA 2026 共享任务“基于评分标准的德语短答案评分”中的参与情况,团队在多个子任务中采用了一种名为 Meta-prompting 的方法,通过从训练集示例中生成定制提示来对学生的答案进行评分。除了该方法,团队还尝试了传统机器学习、开源大模型微调及其他提示技术。最终在多个子任务中取得了中等偏上的排名,展示了方法的有效性与多样性。

Comments To be presented at the BEA 2026 workshop, co-located with ACL 2026

详情
英文摘要

In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.

2605.11239 2026-05-13 cs.LG stat.ML

Extending Kernel Trick to Influence Functions

Zhenhuan Sun, Shahrokh Valaee

AI总结 本文提出了一种影响函数的对偶表示方法,其计算复杂度随数据集规模增长而非模型规模,为大规模模型的影响分析提供了更高效的替代方案。该方法适用于可线性化的模型,通过构造一个与模型输出维度和数据集规模乘积相关的矩阵实现,能够在参数空间难以计算原始影响函数时有效估计参数、模型输出和损失的变化。这一成果在模型规模远大于数据集规模时具有显著优势。

详情
英文摘要

In this paper, we present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a matrix, whose size grows with the product of model output dimension and dataset size.

2605.11237 2026-05-13 cs.LG

DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift

Yongsen Tan, Zhecheng Sheng, Xiruo Ding, Serguei V. S. Pakhomov, Trevor Cohen

AI总结 本文研究了在部署阶段数据来源与标签关系发生变化的“来源偏移”问题,提出了一个基于反事实不变性与不变学习的鲁棒性学习目标。为此,作者开发了DeconDTN-Toolkit工具包,用于模拟不同程度的来源偏移并评估现有算法的鲁棒性,揭示了经验风险最小化在来源偏移下的脆弱性,并提出了新的分布外性能指标,为来源混淆问题的分析与缓解提供了理论支持与实用工具。

Comments Accepted to CHIL 2026

详情
英文摘要

Despite the burgeoning body of work on distribution shifts, provenance shift-where the relationship between data source and label changes at deployment-remains poorly understood and under-addressed. In this paper, we establish a formal connection between provenance shift, counterfactual invariance, and invariant learning to derive a learning objective for robustness. We then introduce \textsc{DeconDTN-Toolkit}, a specialized evaluation and remediation suite designed to simulate provenance shifts of varying degrees while maintaining the training protocol and the infrastructure of existing benchmarks. We reveal the vulnerability of Empirical Risk Minimization under provenance shift, introduce a robust out-of-distribution performance indicator, and conduct a comprehensive evaluation on existing algorithms. Our work provides both the theoretical grounding and the practical tools necessary to characterize the problem of confounding by provenance, and implementations of methods to mitigate it.

2605.11235 2026-05-13 cs.LG cs.AI

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

Han Zheng, Yining Ma, Karthick Gunasekaran, Bharathan Balaji, Zheng Du, Shiv Vitaladevuni, Cathy Wu

AI总结 在大语言模型的强化微调中,课程学习有助于提升训练效率与性能,但现有方法依赖人工设计的启发式规则或辅助模型进行课程判断,可能与策略的训练动态不一致。本文提出METIS框架,将课程判断内化为模型的原生能力,通过分析提示内部奖励的方差来衡量提示的信息量,并基于近期训练结果进行轻量化的上下文学习预测,从而动态调整训练分配。METIS通过联合优化标准奖励与自我判断奖励,实现策略的元认知学习,在多个基准任务中展现出更高的性能与更快的收敛速度。

详情
英文摘要

In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.

2605.11234 2026-05-13 cs.AI

The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

Grama Chethan

AI总结 本文提出并解决了工业AI代理系统中的“语义训练差距”问题,即大语言模型虽能掌握领域术语,却缺乏对制造操作语义结构的深入理解。为弥补这一差距,研究设计了一种基于制造本体的工具架构,将领域知识直接嵌入AI工具层,通过运行时语义约束替代传统训练方式,有效减少了领域标识符的错误生成。实验表明,该方法在不修改应用代码的情况下,实现了跨领域配置和工具调用零幻觉的性能提升。

Comments 29 pages, 2 figures

详情
英文摘要

Large language model (LLM)-based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics -- the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi-agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three-operation interface contract -- resolve, contextualize, annotate -- with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology-grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.

2605.11233 2026-05-13 cs.LG

A Comparative Study of Model Selection Criteria for Symbolic Regression

Ali Soltani, Gabriel Kronberger, Fabricio Olivetti de Franca, Mattia Billa, Alessandro Lucantonio

AI总结 本文对比研究了符号回归中常用的模型选择准则,旨在从生成的候选数学表达式中选择出在准确性与复杂性之间取得平衡且具有良好泛化能力的模型。研究通过在七个含高斯噪声的合成数据集上系统评估了AIC、AICc、BIC、MDL以及Efron引导法等准则的表现,发现MDL在多数数据集上能最有效地识别出测试误差最小且表达式最简的模型,BIC也有较高概率选择出真实函数表达式。

详情
英文摘要

Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron's bootstrap estimate for the in-sample prediction error on seven synthetic datasets with Gaussian noise. We rank candidate expressions generated by perturbing ground-truth functions to assess generalization error and selection probability of the ground-truth expression. Our findings reveal that MDL consistently identifies models with the lowest test error and the shortest length across most datasets. While no single criterion dominates all results, MDL and BIC produced the highest probability of selecting the ground-truth expressions.

2605.11232 2026-05-13 cs.AI cs.LG

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Prathamesh Vasudeo Naik, Naresh Dintakurthi, Yue Wang

AI总结 本文研究了如何为欺诈检测和反洗钱(AML)等合规性场景构建高效的大语言模型(LLM)服务架构。针对这类任务中常见的前缀密集、结构约束强和证据丰富的输入特点,作者提出了一套面向工作负载的LLMOps系统,结合了运行时调优、前缀缓存、多适配器服务、批处理优化等多种技术,显著提升了服务吞吐量和响应速度。实验表明,该方法在公共合成数据集上实现了性能的大幅提升,展示了合规性LLM服务需从工作负载设计、服务优化和质量控制多方面综合提升。

详情
英文摘要

Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.

2605.11225 2026-05-13 cs.AI cs.LG cs.MA

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

Tuo Zhang, Alin-Ionut Popa, Yan Xu, Rui Song, Dimitrios Dimitriadis

AI总结 PIVOT 是一种通过轨迹优化弥合大型语言模型(LLM)代理计划与执行之间差距的方法,其核心在于通过自监督框架迭代改进生成的轨迹。该方法包含计划、检查、进化和验证四个阶段,通过执行轨迹并计算结构化损失来识别计划与执行之间的差异,并据此优化轨迹,最终提升任务约束满足度。实验表明,PIVOT 在有无人类反馈的情况下均表现出色,显著优于现有方法,同时保持较高的计算效率。

详情
英文摘要

Large language model (LLM)-based agents frequently generate seemingly coherent plans that fail upon execution due to infeasible actions, constraint violations, and compounding errors over extended horizons. PIVOT (Plan-Inspect-eVOlve Trajectories) addresses this plan-execution misalignment through a self-supervised framework that treats trajectories as optimizable objects iteratively refined via environment interaction. The framework comprises four stages: PLAN generates candidate trajectories; INSPECT executes them and computes structured losses with textual gradients encoding plan-execution discrepancies; EVOLVE applies these signals to produce improved trajectories; and VERIFY performs a final global check against task constraints. A monotonic acceptance process ensures a non-decreasing solution quality. Empirical evaluations on DeepPlanning and GAIA demonstrate state-of-the-art performance: with human-in-the-loop (HITL) feedback, PIVOT establishes a strong upper bound up to 94% relative improvement in constraint satisfaction, while its fully autonomous variant retains substantial gains, showing that the core trajectory-refinement mechanism remains effective without external supervision. At the same time, PIVOT remains computationally efficient, requiring up to 3x to 5x fewer tokens than competing refinement methods. These findings establish that (self- or human-supervised) feedback-based trajectory optimization is a principled methodology for mitigating plan-execution gaps in autonomous agent systems.

2605.11224 2026-05-13 cs.CV cs.AI

ABRA: Agent Benchmark for Radiology Applications

Bulat Maksudov, Vladislav Kurenkov, Kathleen M. Curran, Alessandra Mileo

AI总结 ABRA 是一个面向放射学应用的智能体基准,旨在评估医疗智能体在实际影像处理任务中的能力。该基准通过21个功能调用工具,使智能体能够操作医学影像查看器和DICOM服务器,完成包括切片导航、窗口调节、标注和结构化报告等任务。ABRA 包含655个自动生成的任务,涵盖多个难度等级和任务类型,并通过自动评分系统评估智能体在规划、执行和结果方面的表现,揭示了当前模型在感知层面存在较大瓶颈。

详情
英文摘要

Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA

2605.11222 2026-05-13 cs.LG

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

Ryan Lucas, Mehdi Makni, Xiang Meng, Adam Deng, Rahul Mazumder

AI总结 本文提出了一种改进的基于海森矩阵的权重量化方法ADMM-Q,用于大语言模型的后训练量化。该方法基于改进的交替方向乘子法(ADMM),通过分层优化策略逐步最小化层间重构误差并满足量化约束,同时引入惩罚调度、预处理和局部搜索等增强技术以提升效率。实验表明,ADMM-Q在多个量化设置下显著降低了模型的困惑度,优于现有主流量化方法如GPTQ。

详情
英文摘要

Quantization is an effective strategy to reduce the storage and computation footprint of large language models (LLMs). Post-training quantization (PTQ) is a leading approach for compressing LLMs. Popular weight quantization procedures, including GPTQ and RTN, suffer in model utility, especially at aggressive quantization levels (sub-4-bit). We propose ADMM-Q, a novel weight quantization algorithm that considers the layer-wise quantization problem. Our algorithm is based on a combinatorial variant of the Alternating Direction Method of Multipliers (ADMM). Our operator-splitting procedure updates weights continuously to minimize the layer-wise reconstruction error, while gradually enforcing the quantization constraints with convergence guarantees. We propose additional algorithmic enhancements (e.g., penalty scheduling, preconditioning, and a local search post-processing step) to make ADMM-Q efficient at LLM scale. ADMM-Q is modular and can be used as a drop-in replacement for any weight quantizer within existing quantization pipelines: ADMM-Q is fully composable with existing techniques including range clipping, learned or random rotations, and activation scaling. Using ADMM-Q in place of GPTQ on Qwen3-8B, we decrease WikiText-2 perplexity in: (i) the W3A16 weight-only setting (12.85 $\rightarrow$ 10.06); (ii) the W4A8 SmoothQuant procedure (9.29 $\rightarrow$ 8.68); and (iii) the W2A4KV4 SpinQuant procedure (66.11 $\rightarrow$ 19.42).

2605.11218 2026-05-13 cs.AI

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

M. Shalankin

AI总结 该研究揭示了视觉-语言模型(VLMs)在评估图像质量时受到嵌入数字锚点的系统性偏差影响,并发现这种偏差在不同模型架构中普遍存在。通过逐层分析,研究发现模型中用于分类的浅层特征与质量预测性能存在解耦现象,而深层特征则更有利于质量判断。研究还揭示了不同模型对锚点信息的融合方式存在差异,为理解视觉锚定偏差的成因及其与模型表征动态的关系提供了因果解释。

详情
英文摘要

Embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across six VLMs from five architectural families (ANOVA eta^2 = 0.18-0.77, all p < 0.001). Anchor effects are 2.5x larger than severe image quality degradation, confirming bias is not reducible to visual changes. Layer-wise probing reveals consistent dissociation: layers where anchor classification saturates (L12-L34) are suboptimal for quality prediction, with optimal layers deeper (R^2 = 0.69-0.91). Fusion analysis identifies architecture-dependent integration -- instant fusion at L1-L2 in two models versus partial or no fusion in three others. These results establish a causal account of visual anchoring bias, linking behavioral susceptibility to representation dynamics.

2605.11217 2026-05-13 cs.LG cs.AI cs.CR

Leveraging RAG for Training-Free Alignment of LLMs

John T. Halloran

AI总结 该论文提出了一种基于检索增强生成(RAG)的对齐方法RAG-Pref,用于在无需额外训练的情况下提升大语言模型(LLM)对代理攻击的拒绝能力。该方法通过在推理过程中利用偏好和非偏好样本的对比信息,实现在线对齐,计算开销低且兼容现有工具。实验表明,RAG-Pref在五种主流LLM上显著提升了拒绝攻击的性能,同时在通用人类偏好对齐任务中也表现出色,且不显著增加计算资源需求。

Comments 19 pages, 4 figures, and 6 tables

详情
英文摘要

Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that state-of-the-art alignment algorithms require significant computational resources while being far less capable of enabling refusal guardrails for recent agentic attacks. Thus, to improve refusal guardrails against such attacks without drastically increasing computational overhead, we introduce Retrieval Augmented Generation for Pref erence alignment (RAG-Pref), a simple RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (training-free), compatible with off-the-shelf packages, and, when combined with offline (training-based) alignment algorithms, enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. We conclude by showing that, in stark contrast to other online alignment methods, RAG-Pref similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

2605.11214 2026-05-13 cs.LG

Enforcing Constraints in Generative Sampling via Adaptive Correction Scheduling

Noah Trupin, Yexiang Xue

AI总结 本文研究了在生成采样过程中如何有效施加硬约束的问题,指出传统方法在采样末尾或每一步进行投影的方式忽略了投影对状态分布的影响,可能导致采样结果虽满足约束但与原始动态不一致。为此,作者将约束施加形式化为生成过程中的修正调度问题,提出了一种基于状态的自适应修正调度策略,根据每一步的约束偏差动态分配投影资源,从而在减少修正次数的同时提升采样精度。实验表明,该方法在多种生成模型中均能显著优化约束采样的效率与质量。

详情
英文摘要

Hard constraints in generative sampling are typically enforced by projection, applied either once at the end of sampling or after every update. This binary framing overlooks a fundamental issue: projection changes the distribution of states which future updates depend on. As a result, delayed projection can produce samples that are feasible but inconsistent with the intended sampling dynamics, even after final projection. We formalize constraint enforcement as a correction scheduling problem over the generative rollout. Using one-step constraint defect as a local signal of geometric mismatch, we introduce adaptive correction scheduling, a state-dependent policy that allocates projection budget to the steps that most strongly perturb the trajectory. Terminal and stepwise projection arise as limiting cases of this family. Across controlled manifold rollouts and a learned projected diffusion sampler, adaptive scheduling improves the cost-accuracy frontier at matched projection budgets, recovering 71.2% of full stepwise benefit with 75% fewer corrections. These results show that constraint timing is a first-class design variable in generative sampling, and that enforcing feasibility alone is insufficient to preserve the intended constrained sampling dynamics.

2605.11210 2026-05-13 cs.RO

Distributed Pose Graph Optimization via Continuous Riemannian Dynamics

Jaeho Shin, Maani Ghaffari, Yulun Tian

AI总结 本文提出了一种基于李群上二阶连续时间动力系统的分布式姿态图优化(PGO)框架,通过将姿态变量建模为受阻尼作用的粒子,使所得黎曼动力学的平衡点与原PGO问题的一阶临界点一致。该方法利用阻尼欧拉-泊アン方程和半隐式几何积分器设计出一种优化算法,可推广现有黎曼梯度下降和高斯-牛顿方法,并在多机器人场景中实现了基于块对角质量与阻尼矩阵的全分布式并行求解,具有通信开销小、收敛性好的特点。实验表明,该求解器在同步与异步环境下均优于现有分布式方法。

详情
英文摘要

We present a framework for distributed Pose Graph Optimization (PGO) by formulating the problem as a second-order continuous-time dynamical system evolving on Lie groups. By modeling pose variables as massive particles subject to damping, the equilibrium points of the resulting Riemannian dynamics coincide with first-order critical points of the original PGO problem. Using the governing damped Euler--Poincaré equations and a semi-implicit geometric integrator, we design an optimization algorithm that generalizes existing algorithms such as Riemannian gradient descent and Gauss--Newton. In multi-robot settings, we present a fully distributed and parallel method based on block-diagonal mass and damping matrices, where each robot solves an ordinary differential equation for its own poses with minimal communication overhead. Moreover, modeling both state and velocity enables principled neighbor prediction that significantly improves convergence under delayed communication. Theoretically, we present an analysis and establish sufficient condition that ensures energy dissipation under the employed geometric discretization scheme. Experiments on benchmark PGO datasets demonstrate that the proposed solver achieves superior performance compared to state-of-the-art distributed baselines in both synchronous and asynchronous regimes.

2605.11209 2026-05-13 cs.LG

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

Eungyeup Kim, Chenchen Gu, Vashisth Tiwari, J. Zico Kolter

AI总结 现有基准测试显示大型语言模型在多项任务上表现接近完美,但这掩盖了对其可靠性进行严格评估的必要性。本文提出了一种高效评估方法,通过识别模型失败的系统性模式,利用交叉熵方法学习聚焦于易失败输入的采样分布,从而大幅减少所需推理量。实验表明,该方法在多个模型和任务上实现了高达156倍的效率提升,揭示了即使在基准测试中表现相近的模型,其可靠性也可能存在显著差异,强调了可靠性作为模型质量独立且可衡量维度的重要性。

Comments Project page: https://five-nines-reliability.notion.site/Measuring-Five-Nines-Reliability-Sample-Efficient-LLM-Evaluation-in-Saturated-Benchmarks-312b998d4f39802d88c0e9886db1b9cd

详情
英文摘要

While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world deployment, however, achieving extremely high reliability (e.g., "five-nines" (99.999%) vs. "three-nines" (99.9%)) is fundamentally critical, as this gap results in an order-of-magnitude increase in failures, which is catastrophic in reliability-critical applications. Still, estimating such a rare failure probability with tight confidence bounds requires prohibitively large LLM inference sizes, making standard Monte Carlo evaluation infeasible under limited compute budgets. In this paper, we observe that LLM failures exhibit strong systematic patterns: across broad parameterized input spaces, a small subset of inputs disproportionately accounts for the majority of failures. Leveraging this observation, we propose to learn a sampling distribution concentrated on failure-prone inputs via the cross-entropy method (CEM). We evaluate our framework on three LLMs, Qwen2.5-Math-7B-Instruct, gpt-oss-20b-low, and Gemini 2.5 Flash Lite, across parameterized GSM8K templates and achieve up to 156.22x reduction in required inferences compared to naive uniform sampling. Our estimates reveal that models with indistinguishable accuracy on standard benchmarks can differ substantially in estimated failure rates, underscoring that reliability is a distinct and measurable axis of model quality. Our simple yet practical framework enables the evaluation of extreme reliability in LLMs, a distinct and underexplored dimension of evaluation beyond existing benchmarks, for their growing use in reliability-sensitive applications.

2605.11205 2026-05-13 cs.LG cs.AI

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Jung Min Kang

AI总结 本文研究了在数据稀疏和项目难度差异较大的情况下,简单平均法在评估排名中的失效问题,并提出利用项目反应理论(IRT)可以更准确地恢复真实排名。通过在多个领域(如自然语言处理、临床试验等)的实验,作者发现当数据覆盖率下降时,简单平均的排名相关性显著降低,而基于IRT的模型则能保持高精度。研究揭示了评估失效的规模规律,并为物理AI等领域的基准测试提供了更可靠的评估方法。

Comments 15 pages, 4 tables, 1 figure. Code at https://github.com/testofschool/evaluation-failure-scaling-law

详情
英文摘要

Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation $ρ$ between simple-average rankings and ground-truth rankings degrades from $ρ= 1.000$ at 100% coverage to $ρ= 0.809$ at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains $ρ\geq 0.996$ across all conditions. A 150-condition grid sweep over sparsity $S \in [0, 0.70]$ and difficulty gap $D \in [0.5, 5.0]$ confirms that ranking error forms a failure surface with a strong $S \times D$ interaction ($γ_3 = +0.20$, $t = 13.05$), while IRT maintains $ρ\geq 0.993$ throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.

2605.11203 2026-05-13 cs.LG cs.CV

FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry

Elias B. Krey, Nils Neukirch, Nils Strodthoff

AI总结 本文研究了深度神经网络中间特征表示的几何结构,通过在输入空间应用多种图像变换,评估了在特征空间中学习从原始特征到变换后特征映射的可能性。研究设计了多种映射方式,包括线性与非线性、局部与全局映射,并分析了其重建质量和语义内容。结果表明,即使对于复杂的语义变换,使用单一特征向量的共享线性模型也能实现较好的重建效果,暗示特征空间可能在一定程度上具有线性结构。该研究为理解特征空间的组织方式提供了新视角,并展示了生成式图像编辑模型在这一领域的潜力。

Comments 27 pages, 24 figures, 3 tables, Code is available at https://github.com/AI4HealthUOL/FeatMap

详情
英文摘要

Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non-linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non-trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.

2605.11196 2026-05-13 cs.LG

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Vishal Pandey, Gopal Singh

AI总结 该论文提出了一种名为变分线性注意力(VLA)的新方法,旨在解决传统线性注意力在处理长上下文时出现的记忆干扰问题。VLA通过将记忆更新建模为带有自适应惩罚矩阵的在线正则最小二乘问题,有效控制了状态范数的增长,并保证了系统稳定性。实验表明,VLA在保持高检索性能的同时大幅降低了内存状态的范数,且在大规模序列处理中表现出优于现有方法的效率和准确性。

Comments 20 pages

详情
英文摘要

Linear attention reduces the quadratic cost of softmax attention to $\mathcal{O}(T)$, but its memory state grows as $\mathcal{O}(T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly $1$ for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces $\|S_t\|_F$ by $109\times$ relative to standard linear attention at $T{=}1{,}000$, achieves near-perfect exact-match accuracy on multi-query associative recall within the effective per-head memory regime ($n_\text{pairs} < d_h$), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62\% accuracy at the per-head capacity boundary. A Triton-fused kernel achieves $14\times$ speedup over sequential Python and $\mathcal{O}(T)$ scaling, crossing below softmax attention latency at approximately 43\,000 tokens.

2605.11195 2026-05-13 cs.CL

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

Eduardo Tenorio, Karuna Bhaila, Xintao Wu

AI总结 本文系统评估了差分隐私(DP)对大型语言模型(LLMs)中社会偏见的影响,通过在四个互补任务范式中比较DP训练模型与非DP基线模型的表现。研究发现,DP在句子评分任务中能有效降低偏见,但在其他任务中效果不一,揭示了logit层偏见与输出层偏见之间的差异。结果表明,减少记忆并不必然减少不公平性,强调了在评估LLMs公平性时进行多范式分析的重要性。

Comments 14 pages, 1 figure

详情
英文摘要

Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.

2605.11192 2026-05-13 cs.SD cs.AI cs.LG

Exploring Token-Space Manipulation in Latent Audio Tokenizers

Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

AI总结 本文研究了在潜空间音频编码器中对 token 空间进行操作的可能性,提出了一种名为 LATTE 的新型音频 tokenizer,通过引入可学习的潜空间 token 来实现对全局语音特征的编辑。该方法在保持高质量语音重建的同时,使得通过替换 token 来修改说话人身份或背景噪声等全局属性成为可能,并在语音转换和去噪任务中验证了其有效性,为无监督的可控音频编辑提供了新思路。

详情
英文摘要

Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.