arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11268 2026-06-11 cs.LG 新提交

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

LakeFM：基于不规则多变量多深度时间序列数据的水生生态系统基础模型

Abhilash Neog, Sepideh Fatemi, Medha Sawhney, Kazi Sajeed Mehrab, Aanish Pradhan, Bennett J. McAfee, Emma Marchisin, Arka Daw, Robert Ladwig, Cayelan C. Carey, Paul Hanson, Anuj Karpatne

发表机构 * Virginia Tech（弗吉尼亚理工大学）； Grand Valley State University（大峡谷州立大学）； University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； Amazon AGI（亚马逊AGI）； Aarhus University（奥胡斯大学）

AI总结针对湖泊时间序列数据不规则采样和跨湖泊泛化难题，提出预训练基础模型LakeFM，在模拟和观测数据上学习表征，实现优于现有模型的预测性能。

Comments KDD 2026

详情

DOI: 10.1145/3770855.3819024

AI中文摘要

理解和预测湖泊动态对于监测湖泊和水库的水质及生态系统健康至关重要。尽管机器学习方法最近已被应用于生态时间序列数据，但现有工作假设时间和深度上的规则采样，并且难以在具有异质变量、深度和观测模式的湖泊之间泛化。为了解决这些局限性，我们引入了\textsc{LakeFM}，一个用于水生系统的基础模型，在包含模拟和观测湖泊的大规模生态数据集上预训练。通过广泛的实证评估，我们表明\textsc{LakeFM}学习了跨越更广泛湖泊层面特征的有意义表征，并在与现有时间序列基础模型和非基础模型相比时，实现了具有竞争力或通常更优的预测性能，同时产生与真实湖泊动态一致的物理上合理的预测。

英文摘要

Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textsc{LakeFM}, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textsc{LakeFM} learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.11267 2026-06-11 cs.LG cs.CR 新提交

A prior-free blind detection of information leakage from model predictions

基于模型预测的信息泄露的无先验盲检测

Laurence A. Jacobs

发表机构 * Center for Molecular Cardiology, University of Zurich（苏黎世大学分子心脏病学中心）； Center for Complexity Sciences, National University of Mexico（墨西哥国立自治大学复杂性科学中心）

AI总结针对机器学习模型输出中信息泄露的检测问题，提出决策理论框架，证明校准泄露与诚实模型不可区分，但近确定性子组可被无先验检测，并在UK Biobank上验证。

详情

AI中文摘要

数据泄露——模型被基线不可用的信息污染——是基于机器学习的科学中主要的可重复性失败，然而检测工具需要训练代码、外部数据或领域专业知识。没有一种工具能作用于审计员最常持有的工件：模型的输出。我们询问仅从预测和结果中能判断出关于泄露的什么信息。我们给出了一个决策理论框架，其中泄露诊断是预测风险/结果规律的泛函，由与适当评分规则和决策曲线分析相关的阈值加权参数化。我们证明了一个尖锐的不可能性：重新校准的泄露匹配诚实模型的校准和区分度，通过预测的\emph{任何}函数与诚实性能不可区分，因此广泛类别仅能通过外部提供的可实现区分度上限来检测。然后我们证明了泄露无法隐藏什么：近确定性子组——近标签泄露的特征——产生一个持续的单位纯度头部，任何非确定性结果的合法预测器都无法制造，从而产生一个无先验测试。这些结果将泄露组织成三分法——未校准、广泛校准和确定性——每个都有匹配的检测器和失败模式。我们在UK Biobank上使用时窗共病泄露进行验证，已知分级严重性，测量该终点上的检测下限$\Delta\cstar \approx 0.007$，低于此的残余泄露从输出中无法检测，且太小无法改变结论。数值下限是队列和终点特定的；结构教训是通用的：仅输出检测在残余泄露与诚实的更强预测器无法区分时失败。该测试在商品硬件上不到一秒内返回对预测向量的判定。

英文摘要

Data leakage -- contamination of a model with information unavailable at baseline -- is the dominant reproducibility failure in machine-learning-based science, yet detection tools require training code, external data, or domain expertise. None operates on the artifact an auditor most often holds: the model's output. We ask what can be decided about leakage from predictions and outcomes alone. We give a decision-theoretic framework in which leakage diagnostics are functionals of the predicted-risk/outcome law, parameterized by a threshold-weighting linked to proper scoring rules and decision-curve analysis. We prove a sharp impossibility: a recalibrated leak matching an honest model's calibration and discrimination is indistinguishable from honest performance by \emph{any} function of the predictions, so the broad class is detectable only against an externally supplied ceiling on achievable discrimination. We then prove what leakage cannot hide: a near-deterministic subgroup -- the signature of a near-label leak -- produces a sustained unit-purity head that no legitimate predictor of a non-deterministic outcome can manufacture, yielding a prior-free test. These results organize leakage into a trichotomy -- miscalibrated, broad-calibrated, and deterministic -- each with a matched detector and failure mode. We validate on UK Biobank using time-windowed comorbidity leakage with known, graded severity, measuring a detection floor of $Δ\cstar \approx 0.007$ on this endpoint, below which residual leakage is undetectable from output and too small to alter conclusions. The numerical floor is cohort- and endpoint-specific; the structural lesson is general: output-only detection fails where residual leakage is indistinguishable from an honestly stronger predictor. The test returns a verdict on a prediction vector in under a second on commodity hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.11266 2026-06-11 cs.LG 新提交

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

碰撞前的预见：利用冻结视觉-语言模型的预期性安全强化学习

Samuel Tetteh, Cody Fleming

发表机构 * Iowa State University（爱荷华州立大学）

AI总结提出VLM-Safe-RL框架，通过冻结视觉-语言模型生成预期性成本项，改进CMDP拉格朗日更新，在高速碰撞场景下实现安全与回报的平衡。

Comments 44pages, 26 figures

详情

AI中文摘要

约束强化学习算法优化的成本信号几乎总是反应性的：模拟器仅在碰撞开始后发出非零成本，而PPO-Lagrangian的拉格朗日乘子仅在超出回合预算后增长。在比赛速度下，碰撞是瞬时且不可逆的，任何等待成本累积的安全机制在结构上都为时已晚。我们提出VLM-Safe-RL，一个将冻结的视觉-语言模型作为预期性成本项集成到CMDP拉格朗日更新中的框架。该框架包含四个贡献：(i) 解耦双路径CLIP，独立的奖励/成本路径，尊重CMDP的分解；(ii) VLM-Lagrange，一种增强的乘子更新，将每步VLM成本作为预期性项纳入；(iii) 置信门控，基于CLIP间隔的逻辑噪声模型导出的贝叶斯最优权重；(iv) VLMPPOLag，组合算法。在Safety-Gymnasium FormulaOne L2上，我们的主要评估（$n=5$个种子，$10^{6}$步，预算$d_{\text{lim}}=25$）中，VLMPPOLag$+$Conf是默认预算比较中唯一同时保持实质性回报（$J_r\approx40$）并在大多数种子上将成本控制在预算内的配置；五个约束感知基线（PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND）均至少未能满足一项要求。该机制泛化到保留的MetaDrive Medium（灾难率$41\%\to26\%$，95%自助法置信区间$[-26,-5]$个百分点），并显示出向Bullet Safety-Gym的方向一致迁移；我们诚实地报告了其不适用的情况（MetaDrive Easy/Hard, Qwen2-VL骨干），并将Hard失败归因于拉格朗日调节病理而非VLM信号本身。据我们所知，这是首个在CMDP拉格朗日更新中使用冻结VLM信号作为预期性成本项的工作。

英文摘要

The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP's factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ($n{=}5$ seeds, $10^{6}$ steps, budget $d_{\text{lim}}{=}25$) VLMPPOLag$+$Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ($J_r{\approx}40$) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate $41\%{\to}26\%$, 95\% bootstrap CI $[-26,-5]$\,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.

URL PDF HTML ☆

赞 0 踩 0

2606.11262 2026-06-11 cs.LG cs.AI 新提交

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

PermDoRA -- 理解语言模型中的适配器干扰：参数空间几何的局限性

Gowtham Sivaramakrishnan, Sarvesha Kumar Kombaiah Seetha, Kishan Gupta Balaji, Santhosh Baradwaj Vaduvur Ranganathan

发表机构 * Independent Researcher（独立研究员）

AI总结研究适配器组合中的干扰是否源于线性参数更新重叠，通过DoRA-RBAC框架和几何感知合并策略实验，发现参数空间几何不是干扰主因，而是共享非线性表示中的交互。

Comments 18 Pages, COLM 2026

详情

AI中文摘要

大型语言模型（LLMs）中的访问控制需要模块化机制，以在不重新训练或跨领域干扰的情况下实现特定领域行为。一个常见的假设是，适配器组合过程中的干扰源于线性参数更新的重叠，这表明强制正交性或方向独立性应能提高多领域性能。我们使用DoRA-RBAC（一种基于权重分解低秩适配的分层适配器组合框架）来测试这一假设。我们比较了传统的欧几里得合并与一种几何感知的黎曼启发式合并策略，该策略通过在LLaMA-3.1-8B和Mistral-7B上的多个QA基准（GPQA、PubMedQA、SimpleQA、WMDP）上进行归一化方向平均来近似弗雷歇均值。我们的结果表明，虽然单领域性能与LoRA相当，但几何感知合并相比标准平均在多领域组合中并未提供一致的优势。进一步分析揭示，适配器更新的角度对齐和正交性是组合性能的弱预测因子。这些发现表明，适配器干扰并非主要由参数空间几何决定，而是与共享非线性表示中的交互一致。

英文摘要

Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference. A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance. We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation. We compare conventional Euclidean merging with a geometry-aware Riemannian-inspired merging strategy that approximates the Frechet mean via normalized directional averaging across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP) on LLaMA-3.1-8B and Mistral-7B. Our results show that while single-domain performance matches LoRA, geometry-aware merging provides no consistent advantage over standard averaging in multi-domain settings.Diagnostic analysis further reveals that angular alignment and orthogonality of adapter updates are weak predictors of composition performance. These findings suggest that adapter interference is not governed primarily by parameter-space geometry, but is instead consistent with interactions in shared nonlinear representations.

URL PDF HTML ☆

赞 0 踩 0

2606.11260 2026-06-11 cs.SD cs.AI 新提交

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

RAIL: 基于CHC框架重新思考大型音频语言模型中的听觉智能

Hongyu Jin, Siyi Wang, Yang Xiao, Jiaheng Dong, Shihong Tan, Kaiyuan peng, Georgiana Juravle, Shanquan Chen, Gongping Huang, Hong Jia, Eun-Jung Holden, James Bailey, Ting Dang

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）； Faculty of Psychology and Educational Sciences, Alexandru Ioan Cuza University of Iași（亚历山德鲁伊万库扎大学心理学与教育科学学院）； School of Electronic Information, Wuhan University（武汉大学电子信息学院）； School of Public Health, The University of Hong Kong（香港大学公共卫生学院）； School of Computer Science, The University of Auckland（奥克兰大学计算机科学学院）； Department of Data Science and Artificial Intelligence, Monash University（莫纳什大学数据科学与人工智能系）

AI总结提出RAIL基准，基于CHC认知框架将听觉智能分解为五种核心能力，构建结构化评估任务，系统评测大型音频语言模型的认知行为。

详情

AI中文摘要

人类通过紧密集成的认知能力（如音频感知、音频推理和记忆）处理丰富的听觉环境。尽管大型音频语言模型（LALMs）在语音理解和多模态音频推理方面取得了近期进展，但当前的评估范式仍然主要围绕任务或模态，关注最终性能而忽视了潜在的听觉认知行为。这揭示了人类听觉认知理解与LALMs评估之间的根本差距，特别是缺乏将认知原则操作化到任务级指标之外以系统捕捉模型行为的框架。在这项工作中，我们引入了RAIL，一种基于Cattell-Horn-Carroll（CHC）认知框架的以人为中心的评估范式。RAIL将听觉认知形式化为五种核心能力，并将其发展为结构化评估任务，探究模型如何处理、保留和整合听觉信息。我们进一步构建了一个认知基础的基准，包含原则性数据收集和人类对齐的评估协议。评估26个最先进的LALMs，我们发现当前模型在认知能力上表现出高度不平衡的性能。RAIL建立了一种新的评估范式，从以任务为中心的基准测试转向基于认知的听觉智能评估。

英文摘要

Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.11258 2026-06-11 cs.LG nlin.PS physics.comp-ph 新提交

Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components

基于梯度的Gray-Scott系统反演的损失景观诊断：解构PINN各组件的角色

Yan Yang

发表机构 * Yan Yang（杨 Yan）

AI总结通过直接反向传播稳态损失至未折叠的Gray-Scott模拟，发现优化因损失景观中的平坦高原和陡峭悬崖而失败，而PINN中的残差损失通过隐式编码完整PDE动力学避免了该病理现象。

Comments Accepted at the AI4Physics Workshop, ICML 2026 (non-archival). 14 pages, 10 figures

详情

AI中文摘要

反应扩散系统的梯度基反演通常通过代理模型或物理信息神经网络（PINN）进行，而最直接的路径——通过PDE结构本身进行反向传播——在很大程度上被避免。我们将这条直接路径作为诊断探针，通过未折叠的Gray-Scott模拟反向传播稳态损失以恢复其参数，无需代理或神经网络增强。优化未能收敛，直接绘制损失景观将其失败定位于其几何结构——平坦高原无梯度信号，被与分岔边界对齐的陡峭悬崖所包围——这种结构在损失函数中重复出现，并且无论梯度如何路由到参数都会继承。将这一最小设置视为PINN的消融实验，我们解构了每个组件的作用：在神经网络固定的情况下，残差损失是PDE参数的二次函数，产生平滑的损失景观，因此仅凭它就能避免病理现象，通过隐式编码所有初始条件下的完整PDE动力学。而神经网络无法修复不适定的参数子空间，因此仅用于完成观测数据——这种分工此前未被明确。这些发现对PINN类方法具有具体的设计意义，并提供了关于何时添加维度实际上有帮助的更广泛启发。

英文摘要

Gradient-based inversion of reaction-diffusion systems is typically approached via surrogate models or physics-informed neural networks (PINNs), while the most direct route, backpropagation through the PDE's structure itself, has largely been avoided. We pursue this direct route as a diagnostic probe, backpropagating a steady-state loss through unrolled Gray-Scott simulation to recover its parameters, with no surrogate or neural-network augmentation. Optimization fails to converge, and plotting the landscape directly locates the failure in its geometry -- flat plateaus with no gradient signal, bounded by sharp cliffs that align with bifurcation boundaries -- a structure that recurs across loss functions and is inherited however the gradients are routed to parameters. Reading this minimal setup as an ablation of PINN, we disentangle each component's role: with the neural network fixed, the residual loss is quadratic in the PDE parameters and yields a smooth landscape, so it alone already avoids the pathology, by implicitly encoding the full PDE dynamics across all initial conditions. The neural network, for its part, cannot repair an ill-posed parameter subspace, and so serves only to complete the observed data -- a division of labor not previously made explicit. These findings carry concrete design implications for PINN-type methods and a broader heuristic on when added dimensions actually help.

URL PDF HTML ☆

赞 0 踩 0

2606.11257 2026-06-11 cs.CL cs.LG cs.PF 新提交

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

移动NPU上的能效型设备端RAG：Snapdragon X Elite系统设计与基准测试

Zhiyuan Cheng, Longying Lai

发表机构 * Qualcomm（高通）； Snapdragon X Elite（骁龙X Elite）； Dell XPS 13 laptop（戴尔XPS 13笔记本电脑）； Qualcomm Hexagon NPU（高通Hexagon NPU）； Adreno X1-85

AI总结本文首次在Snapdragon X Elite的Hexagon NPU上实现端到端RAG流水线，通过对比CPU和GPU，NPU在嵌入吞吐量、系统能耗和查询延迟上分别提升9.1倍、降低12.3倍和4.0倍，且答案质量相当。

Comments 9 pages, 2 figures, 6 tables

详情

AI中文摘要

检索增强生成（RAG）流水线计算密集，结合了嵌入、检索、重排序和大语言模型（LLM）生成。完全在设备端运行有利于隐私、延迟和离线使用，但CPU推理的能耗成本是一个主要障碍。我们提出了据我们所知第一个在Snapdragon X Elite的Qualcomm Hexagon NPU上运行所有神经阶段（嵌入、重排序和LLM生成）的端到端RAG流水线。在Dell XPS 13笔记本电脑上进行性能分析，我们比较了NPU加速的RAG与CPU和OpenCL/Adreno GPU基线在索引和查询工作负载上的表现。在索引方面，NPU实现了9.1倍的嵌入吞吐量提升和12.3倍的系统能耗降低。在120查询的Wikipedia段落基准测试中，与CPU基线相比，NPU实现了18.1倍的LLM预填充加速、4.0倍的端到端查询延迟降低和4.0倍的系统能耗降低；集成GPU上的相同工作负载比CPU慢1.7倍，且能耗比NPU高6.5倍。GPT-4.1 LLM作为评判者的评估发现，NPU的答案质量与CPU和GPU相当，在评估者噪声范围内（1-10分制下平均9.32 vs. 8.95 vs. 9.03），86.7%的查询在所有三个后端上得分相同。因此，在Snapdragon X Elite / Hexagon类笔记本电脑SoC上，NPU实现了实用、能效高的设备端RAG，且无质量退化——这是一条通往绿色边缘智能的可持续路径，我们预计随着软件栈的成熟，该方法将推广到类似的移动NPU（Apple Neural Engine、Intel NPU、MediaTek APU）。

英文摘要

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

URL PDF HTML ☆

赞 0 踩 0

2606.11251 2026-06-11 cs.LG 新提交

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

机械场网络：多变量系统的结构化神经动力学

Xingji Cui

发表机构 * Xi’an Jiaotong University（西安交通大学）

AI总结提出MF-Net，一种将多变量系统表示为共享场状态并通过可学习关系律更新状态的递归模型，在保持可解释结构的同时实现竞争性预测。

详情

AI中文摘要

许多多变量动力系统仅通过轨迹观测，其联合动力学机制是隐藏的。现有方法可以施加可解释的动力学或学习灵活的状态转移，但得到的交互结构通常要么预先指定，要么隐含在学习动力学中。我们引入MF-Net，一种递归动力学模型，将所有变量表示在共享场状态中，并通过学习的关系律更新该状态。每个变量携带一个场分量，这些分量通过可学习的机械转移共同演化。这里，机械指的是转移的关系-运动组织，其中学习的关系塑造状态依赖的流、场响应和推动场状态前进的运动趋势。得到的结构是展开本身的一部分：学习的关系影响场的运动方式，相同的内部量支持预测和结构读出。在已知定律的交互系统、混沌基准、真实神经记录和生态时间序列上，MF-Net在保持可检查的结构读出的同时，实现了有竞争力的短中期预测。在40维Lorenz-96测试平台上，MF-Net的八步$R^2$达到$0.798\pm0.018$；在五个随机种子下，其学习的关系矩阵以$19.80\pm1.00$的局部/非局部强度比和$1.000\pm0.000$的Precision@$K$恢复了局部耦合支持。MF-Net提供了一个结构可读的动力学建模框架，其中学习的关系通过前向演化训练，并在真实数据上，在适当的观测限制下被解释为功能预测耦合。

英文摘要

Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz--96 testbed, MF-Net achieves an eight-step $R^2$ of $0.798\pm0.018$; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of $19.80\pm1.00$ and Precision@$K$ of $1.000\pm0.000$. MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

URL PDF HTML ☆

赞 0 踩 0

2606.11249 2026-06-11 cs.RO cs.LG cs.MA 新提交

MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

MASK: 面向风险敏感的6G机器人学的多智能体语义K调度

Ahmet Gunhan Aydin, Elif Tugce Ceran

发表机构 * Middle East Technical University（中东技术大学）； Aselsan Inc.（阿塞尔桑公司）

AI总结针对6G机器人协同感知中频谱资源受限的问题，提出多智能体语义K调度（MASK）架构，通过仲裁辅助语义信息门控（A-SIG）机制仅调度语义重要性最高的K个智能体，结合自监督全局编码器和分布策略，在严格带宽限制下实现鲁棒的风险感知协调，性能接近无通信约束基线。

详情

AI中文摘要

实现6G连接机器人学的愿景需要协调高性能协作控制与物理无线信道的刚性频谱限制。在现实的协作感知场景中，频谱资源被量化为有限的物理资源块或正交子载波，使得所有智能体同时传输不可行。为了解决这一问题，我们提出了多智能体语义K调度（MASK），一种控制架构，旨在在严格的瞬时带宽限制下维持鲁棒的风险感知协调。我们引入了仲裁辅助语义信息门控（A-SIG），一种轻量级协调机制，通过基于本地计算的语义重要性分数仅调度前K个智能体来强制执行硬接入约束。通过将这些优先观测聚合为紧凑的潜在状态，自监督全局编码器使得分布策略能够在数据稀疏的情况下减轻尾部风险。我们在多个基准上评估了MASK，证明即使信道接入限制为群体大小的一小部分，其性能也能匹配无通信约束的基线。此外，该框架对数据包擦除具有固有的弹性，验证了语义调度作为资源受限的6G系统的关键使能技术。

英文摘要

Realizing the vision of 6G connected robotics requires reconciling high-performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi-Agent Semantic K-Scheduling (MASK), a control architecture designed to sustain robust, risk-aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter-Assisted Semantic Information Gating (A-SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top-K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self-supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication-unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource-constrained 6G systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11247 2026-06-11 cs.LG cs.AI cs.AR 新提交

Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

物理信息驱动的生成式AI在半导体制造中的应用：通过构造强制生成模型中的硬物理约束

Yaser Mike Banad, Sarah Sharif

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma（俄克拉荷马大学电气与计算机工程学院）； Center for Quantum Research and Technology, University of Oklahoma（俄克拉荷马大学量子研究与技术中心）； Intelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering (INQUIRE) Laboratory（创新研究与工程智能神经形态与量子理解实验室）； Material Science and Engineering Program, University of Oklahoma, Norman, OK 73019 USA（俄克拉荷马大学材料科学与工程项目，Norman, OK 73019 USA）

AI总结针对半导体制造中生成模型必须满足硬物理约束的问题，本文提出通过构造集成物理信息（如物理信息扩散、PDE约束变分模型等）来强制约束，而非事后过滤，并给出四种集成模式和未来研究方向。

详情

AI中文摘要

生成模型越来越多地被用于为物理系统提出设计、数据和控制动作，然而许多此类系统受硬物理约束而非感知合理性支配。半导体制造提供了一个严苛的测试案例：生成的掩模、布局、合成缺陷数据和工艺配方必须遵守光刻、传输、反应和器件物理约束，因为物理无效的样本不仅质量低劣，而且无法使用。本文认为，半导体制造揭示了一个更广泛的计算科学挑战，即用于受约束物理领域的生成式AI必须通过构造实现物理信息驱动，而非仅通过事后过滤来纠正。我们调查了新兴的架构工具包，包括物理信息扩散、PDE约束变分模型、神经算子先验和守恒律尊重生成网络，并展示了它如何与可微分光刻、TCAD、工艺仿真和自主实验相联系。我们识别了生成模型与基于物理的模拟器之间的四种集成模式，并提出了一个以物理保真度基准、可微分模拟器基础设施以及面向物理设计和制造的多模态基础模型为中心的研究议程。核心主张是分析性的而非修辞性的：在物理有效性是成功的关键标准的情况下，通过构造强制约束的架构应被期望优于事后过滤的架构，而晶圆厂正是这种区别最鲜明的环境。

英文摘要

Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are governed by hard physical constraints rather than by perceptual plausibility. Semiconductor manufacturing provides a demanding test case: generated masks, layouts, synthetic defect data, and process recipes must obey lithography, transport, reaction, and device-physics constraints, because physically invalid samples are not merely low quality but unusable. This Perspective argues that semiconductor manufacturing exposes a broader computational-science challenge, namely that generative AI for constrained physical domains must be physics-informed by construction, not corrected only through post-hoc filtering. We survey the emerging architectural toolkit, including physics-informed diffusion, PDE-constrained variational models, neural-operator priors, and conservation-law-respecting generative networks, and show how it connects to differentiable lithography, TCAD, process simulation, and autonomous experimentation. We identify four integration patterns between generative models and physics-based simulators, and we propose a research agenda centered on physics-fidelity benchmarks, differentiable simulator infrastructure, and multimodal foundation models for physical design and manufacturing. The central claim is analytical rather than rhetorical: where physical validity is the binding criterion of success, architectures that enforce it by construction should be expected to outperform those that filter for it after the fact, and the fab is the setting where this distinction is sharpest.

URL PDF HTML ☆

赞 0 踩 0

2606.11245 2026-06-11 cs.AI cs.NE q-bio.NC 新提交

Position: Hippocampal Explicit Memory Is the Cornerstone for AGI

立场：海马体显式记忆是通用人工智能的基石

Sangjun Park

发表机构 * Sangjun Park

AI总结本文主张，将显式记忆整合到大语言模型中是迈向通用人工智能的关键，因为LLM的学习机制类似人类内隐记忆，而高阶认知功能依赖海马体显式记忆。

Comments Accepted to ICML 2026 (Position Paper Track)

2606.11243 2026-06-11 cs.LG cs.CL 新提交

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo: 具有功能引导的分层流匹配用于从头蛋白质生成

Chuanzhen Wang, Meade Cleti, Pete Jano

发表机构 * Arizona State University（亚利桑那州立大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Tongji University（同济大学）

AI总结提出ProHiFlo，一种分层流匹配框架，通过粗到细生成、功能引导和自适应SE(3)等变架构，实现高效、准确的从头蛋白质生成，在酶活性位点支架任务中成功率58.9%。

Comments 23 pages

详情

AI中文摘要

从头蛋白质生成在治疗设计、酶工程和合成生物学中具有变革潜力。尽管基于扩散和流匹配的方法已取得进展，但它们通常在单一分辨率下操作，且缺乏整合功能约束的机制。我们提出ProHiFlo，一种具有三项创新的分层流匹配框架：(1) 粗到细生成，先建模主链几何再细化到全原子坐标，在保持精度的同时降低计算成本；(2) 功能引导，利用预训练预测器引导生成朝向所需性质，无需重新训练；(3) 自适应SE(3)等变架构，用于高效多尺度处理。在无条件生成、基序支架和功能设计上的实验表明，在需要少4倍采样步数的情况下实现了最先进的性能。在酶活性位点支架任务中，ProHiFlo达到58.9%的成功率，而RFDiffusion为41.2%。

英文摘要

De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-ofthe-art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.

URL PDF HTML ☆

赞 0 踩 0

2606.11235 2026-06-11 cs.LG cs.DB stat.ME 新提交

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

少样本重采样：可扩展的统计可靠数据挖掘

Leonardo Pellegrina, Fabio Vandin

发表机构 * Department of Information Engineering, University of Padova（帕多瓦大学信息工程系）

AI总结提出FewRS方法，基于重采样评估数据挖掘结果的统计显著性，通过推导新的上界偏差界，仅需极少量重采样数据集即可保证假发现概率，显著提升可扩展性。

Comments Accepted to KDD 2026

详情

DOI: 10.1145/3770855.3817752

AI中文摘要

知识发现的一个关键步骤是评估数据挖掘结果。在包括模式挖掘、图分析等多个应用中，此步骤包括评估结果的统计显著性，以避免仅由噪声或数据随机波动导致的虚假发现。虽然针对某些特定应用已经开发了专门程序，但基于重采样的方法被广泛使用，尤其是在无法推导解析结果的复杂分析中。然而，当前基于重采样的方法需要生成和分析数千个重采样数据集，因此对于大型数据集或计算密集型分析不实用。本文中，我们介绍了FewRS，一种简单有效的基于重采样的方法，用于评估数据挖掘结果的统计显著性，并对错误发现概率提供严格保证。我们的方法可应用于任何使用重采样方法的情况。FewRS基于我们对表示数据挖掘结果质量的检验统计量的上确界偏差推导出的新界。我们证明FewRS需要生成和分析极少数量的重采样数据集，从而得到高度可扩展且广泛适用的方法。我们在常见任务（如模式挖掘和网络分析）上测试了我们的方法。在所有情况下，与现有技术相比，我们的方法在运行时间上减少了多达两个数量级，同时保持高统计功效，使得能够在大型真实世界数据集上对数据挖掘结果进行统计验证。

英文摘要

A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.11233 2026-06-11 cs.CV 新提交

OSCS-SupCon: Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning for Robust Feature Disentanglement

OSCS-SupCon: 基于正交Sigmoid的通用与风格监督对比学习用于鲁棒特征解耦

Bin Wang, Fadi Dornaika

发表机构 * University of the Basque Country（巴斯克大学）； IKERBASQUE（伊克尔巴斯克）

AI总结针对监督对比学习中负样本稀释和特征空间纠缠问题，提出OSCS-SupCon框架，采用Sigmoid对比损失和正交约束，提升特征判别性和泛化能力。

详情

AI中文摘要

监督对比学习（SupCon）通过显式建模样本间的成对关系取得了强大性能。然而，现有基于SupCon的方法存在两个关键限制：标准InfoNCE损失导致的负样本稀释，以及缺乏分离类别相关（通用）和类别无关（风格）特征的显式约束引起的特征空间纠缠。这些限制降低了特征判别性和泛化能力。为解决这些问题，我们提出OSCS-SupCon（基于正交Sigmoid的通用与风格监督对比学习），一个结合Sigmoid成对对比目标与显式正交约束的统一框架。具体而言，我们引入一个具有两个可学习参数（温度和偏置）的Sigmoid对比损失，自适应地调整成对决策边界并缓解负样本稀释。此外，我们通过带ReLU非线性的线性投影强制通用和风格特征子空间之间的正交性，从而减少特征重叠并改善风格无关特征的解耦。在六个基准数据集上的大量实验表明，OSCS-SupCon在多种骨干架构上始终优于最先进的监督对比学习方法。特别是在使用ResNet-18骨干的细粒度CUB200-2011数据集上，所提方法相比CS-SupCon在分类准确率上提升了3.4%，突显了其鲁棒性和泛化能力。消融研究进一步证实了每个组件的有效性。

英文摘要

Supervised Contrastive Learning (SupCon) has achieved strong performance by explicitly modeling pairwise relationships among samples. However, existing SupCon-based methods suffer from two key limitations: negative-sample dilution induced by the standard InfoNCE loss, and feature-space entanglement caused by the lack of explicit constraints separating category-relevant (common) and category-irrelevant (style) features. These limitations reduce feature discriminability and generalization ability. To address these issues, we propose OSCS-SupCon (Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning), a unified framework that combines a sigmoid-based pairwise contrastive objective with explicit orthogonality constraints. Specifically, we introduce a sigmoid-based contrastive loss with two learnable parameters, temperature and bias, which adaptively modulate pairwise decision boundaries and alleviate negative-sample dilution. Furthermore, we enforce orthogonality between common and style feature subspaces via a linear projection with ReLU nonlinearity, thereby reducing feature overlap and improving disentanglement of style-irrelevant representations. Extensive experiments on six benchmark datasets demonstrate that OSCS-SupCon consistently outperforms state-of-the-art supervised contrastive learning methods across multiple backbone architectures. In particular, on the fine-grained CUB200-2011 dataset with a ResNet-18 backbone, the proposed method achieves a 3.4% improvement in classification accuracy over CS-SupCon, highlighting its robustness and generalization capability. Ablation studies further confirm the effectiveness of each component.

URL PDF HTML ☆

赞 0 踩 0

2606.11232 2026-06-11 cs.CL cs.AI 新提交

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

每个行为都有代价：前沿大语言模型中的压缩道德组合

Weijia Zhang, Ruiqi Chen, Yunze Xiao, Weihao Xuan

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Michigan（密歇根大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Tokyo（东京大学）

AI总结针对现有道德基准仅评估孤立行为偏好的不足，提出Moral Trolley Arena两阶段盲ELO基准，通过校准个体道德行为并组合为双行为项，发现前沿LLM的道德判断呈压缩而非简单加性关系。

详情

AI中文摘要

现有的LLM道德基准通常询问模型偏好哪个孤立的道德行为、价值或基础。这有用但不完整。现实判断往往要求模型在同一选项中组合多个道德信号。我们引入**Moral Trolley Arena**，一个两阶段盲ELO基准，用于衡量LLM如何组合道德证据。单场景阶段首先从跨越五个道德基础理论的229个场景语料库中校准个体道德行为；组合阶段则将校准后的行为组合成受控强度网格上的双行为道德项，并测量由此产生的组合偏好。在十个前沿模型中，组合判断主要由成分行为强度预测，但关系始终是压缩的而非简单加性。模型还表现出非加性强度锚定、成分控制后有限的基础特异性残差，以及跨提供者高度收敛的组合偏好曲面。这些结果表明，道德审计应衡量道德证据的组合规则，而不仅仅是对孤立行为的排名。

英文摘要

Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.

URL PDF HTML ☆

赞 0 踩 0

2606.11231 2026-06-11 cs.CV 新提交

CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection

CFCamo：一种用于伪装目标检测的反事实检测或放弃框架

Suhang Li, Osamu Yoshie, Yuya Ieiri

发表机构 * Graduate School of Information, Production and Systems, Waseda University（早稻田大学信息生产系统研究生院）

AI总结提出CFCamo框架，通过反事实配对训练和策略优化，使COD模型在检测到目标时输出结果，在无目标时放弃检测，解决了正样本训练导致的过度检测偏差。

Comments 10 pages, 7 figures, 5 tables. Code and data: https://github.com/suhang2000/CFCamo

详情

AI中文摘要

视觉语言强化学习最近在伪装目标检测（COD）中展现出强大的目标存在定位能力。然而，定位只是决策的一方面：当智能体面对没有伪装目标的普通图像时，它是否仍会声称存在伪装目标？标准的COD训练和评估数据仅包含正样本，因此在此设置下优化的智能体会产生过度检测偏差，这是一种任务特定的物体幻觉形式，标准COD评估无法衡量。为了量化这种目标缺失行为，我们构建了反事实COD（CF-COD），一个配对基准，从每个留出的COD评估图像中移除伪装目标，同时保留合理的背景。CF-COD评估模型是否在原始图像上检测到目标，并在目标缺失的反事实图像上放弃检测，通过配对准确率（PA）总结。我们进一步引入了CFCamo，一个用于COD的配对反事实框架，支持放弃检测。在训练中，CFCamo使用反事实序列策略优化（CSPO）优化Qwen3-VL-4B-Instruct智能体，该策略采样配对的原始-反事实轨迹，并使用反事实配对奖励（CPR）将原始图像检测与反事实放弃耦合。在CAMO-test上，CFCamo相比先前基于RL的COD基线将S_alpha提高了3.7个百分点；在CF-COD上，它达到了80.0-90.8%的PA。消融实验表明，移除反事实耦合后，尽管目标存在COD得分很高，PA降至1.4-5.2%，这表明仅凭目标存在评估无法表征检测或放弃行为。总体而言，这些结果表明CFCamo通过将目标存在检测与目标缺失放弃耦合，而不仅仅是加强目标存在定位，改进了COD智能体。代码和数据可在https://this URL获取。

英文摘要

Vision-language reinforcement learning has recently shown strong target-present localization for camouflaged object detection (COD). Yet localization is only one side of the decision: when the agent faces an ordinary image with no camouflaged target, will it still claim that a camouflaged object exists? Standard COD training and evaluation data are positive-only, so agents optimized under this setting can acquire an over-detect bias, a task-specific form of object hallucination that standard COD evaluation leaves unmeasured. To quantify this target-absent behavior, we construct Counterfactual COD (CF-COD), a paired benchmark that removes the camouflaged target from each held-out COD evaluation image while preserving a plausible background. CF-COD evaluates whether a model detects the target on the original image and abstains on the target-absent counterfactual, summarized by Pair Accuracy (PA). We further introduce CFCamo, a paired counterfactual framework for COD with abstention. For training, CFCamo optimizes a Qwen3-VL-4B-Instruct agent with Counterfactual Sequence Policy Optimization (CSPO), which samples paired original-counterfactual rollouts and uses a Counterfactual Paired Reward (CPR) to couple original-image detection with counterfactual abstention. On CAMO-test, CFCamo improves S_alpha by +3.7 pp over the prior RL-based COD baseline; across CF-COD, it reaches 80.0-90.8% PA. Ablations show that removing counterfactual coupling reduces PA to 1.4-5.2% despite strong target-present COD scores, showing that target-present evaluation alone does not characterize detect-or-abstain behavior. Overall, these results indicate that CFCamo improves COD agents by coupling target-present detection with target-absent abstention, rather than merely strengthening target-present localization. Code and data are available at https://github.com/suhang2000/CFCamo.

URL PDF HTML ☆

赞 0 踩 0

2606.11222 2026-06-11 cs.CL cs.IT math.IT 新提交

A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries

文本中语义信息的几何轮廓：帧条件唯一性与标量摘要的权衡三角形

Dmitriy Kompaneets

发表机构 * Independent Researcher（独立研究员）

AI总结提出一个几何框架，通过句子嵌入的结构测量文本语义内容，包括三个坐标（新颖性、广度、整合性），并证明任何标量摘要都无法同时满足分析稳定性、序数鲁棒性和跨表示可比性。

Comments 19 pages. Code and data: https://github.com/dkompaneets/geometric_profile_semantic_information

详情

AI中文摘要

一段文本承载了多少意义？香农的理论衡量符号上的不确定性，并有意忽略意义，而诸如BERTScore的成对度量比较两段文本而非表征单段文本。我们开发了一个几何框架，从文本句子嵌入的结构中测量语义内容。该框架包含三个部分。首先，在固定的嵌入和基线内，六个自然公理唯一确定一个标量度量（尺度可调），即帧条件唯一性定理。得到的标量在经验上过于粗糙，这促使我们寻求更丰富的表征。其次，我们提出一个三坐标语义轮廓，捕捉新颖性（与通用话语的偏离）、广度（不同思想的多样性）和整合性（它们之间的连通性），以及一个离散的最小单元（语义量子），其分辨率由聚类阈值$\tau$固定。第三，我们证明了一个不可能定理：轮廓的任何标量摘要都不能同时满足在释义和拼接下的分析稳定性、跨文本尺度的序数鲁棒性以及跨表征的可比性。我们展示了两个实用标量$S_{\mathrm{minmax}}$和$S_{\mathrm{rank}}$，每个占据这个权衡三角形的不同角落。在23个合成类别、5本Project Gutenberg小说和3个嵌入模型上的验证确认了该权衡。推荐的秩归一化配置在28个序数检验中通过25个（Benjamini-Hochberg校正后通过21个），优于包括单字熵和基于BERTScore的新颖性信号在内的七个基线。一个独立的变分结果将广度坐标与行列式点过程的对数行列式联系起来（在507个Gutenberg章节上Spearman $\rho = 0.985$），为广度提供了优化理论基础。

英文摘要

How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $τ$. Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $ρ= 0.985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.

URL PDF HTML ☆

赞 0 踩 0

2606.11221 2026-06-11 cs.CV 新提交

LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

LAST: 通过Gromov-Wasserstein对齐连接视觉-语言与动作流形

Huaihai Lyu, Chaofan Chen, Yuheng Ji, Xiansheng Chen, Pengwei Wang, Shanghang Zhang, Changsheng Xu

发表机构 * National Engineering Laboratory for Intelligent Information Processing（智能信息处理国家级工程实验室）； University of Science and Technology of China（中国科学技术大学）

AI总结提出LAST方法，通过李代数线性化和局部度量离散化，对齐视觉-语言语义几何与动作流形，解决异构空间不兼容问题，提升VLA模型收敛性和泛化性。

详情

AI中文摘要

我们从Gromov-Wasserstein视角研究视觉-语言-动作（VLA）学习，目标是使动作表征的关系几何与VL嵌入的语义几何兼容。然而，由于领域间的数学异质性，这种对齐并非易事：视觉-语言的语义空间在拓扑上是线性和各向同性的，而机器人动作的物理流形是非欧几里得和各向异性的。它们不兼容的度量结构使得直接回归不适定。为了解决这种不兼容性，我们引入了LAST（李代数动作空间分词器），它通过两阶段变换重建动作空间以建立与VL模态的局部度量兼容性：（1）全局拓扑线性化：通过李代数映射线性化动作流形，将轨迹转换为固定长度、物理可加的表示。（2）局部度量离散化：将表示分层离散化为模式和白化残差，生成近似各向同性的局部图表，这些图表在统计上与语义度量对齐。通过在全局和局部层面解决结构不匹配问题，LAST使VLA模型具有更优的收敛性和泛化性。

英文摘要

We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic. Their disjoint metric structures render direct regression ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation. (2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.

URL PDF HTML ☆

赞 0 踩 0

2606.11220 2026-06-11 cs.CL 新提交

LifeSentence: Language models can encode human life course trajectories from longitudinal panel data

LifeSentence: 语言模型可以从纵向面板数据编码人类生命历程轨迹

Samuel Liu, Muchen Xi, William Yeoh, Joshua J. Jackson

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出LifeSentence模型，将大型语言模型与纵向面板数据结合，通过结构化自然语言记录生命事件并微调预训练模型，在少样本条件下超越传统方法，实现生命事件预测与时间顺序重建。

详情

AI中文摘要

预测人类生命结果对于理解个体如何获得长寿健康的生活至关重要。传统的统计方法准确度有限，可能是因为忽略了生命历程的序列结构。现代方法如Transformer架构需要大规模训练数据，而大多数纵向面板研究缺乏此类数据。本文介绍LifeSentence，一种将大型语言模型与纵向面板数据相结合的生命历程推理模型。通过将每个生命事件表示为结构化的自然语言记录，并在一个包含预测、鲁棒性和推理的18任务评估分类体系上对预训练的240亿参数语言模型进行指令微调，LifeSentence利用预训练期间已编码的分布知识补充面板数据。该模型在来自德国社会经济面板的约65,000名个体上训练——比之前基于Transformer的方法少约45倍——在所有任务族上均优于经典和深度学习基线，在联合事件与时间预测上相比最佳基线实现三倍改进，并在从去除时间戳的事件集重建时间顺序时达到91.2%的Kendall tau系数。在没有显式监督的情况下，该模型仅从离散事件序列中恢复出记录的社会分层模式，包括教育溢价、性别工资差距和母亲惩罚。自然语言接口进一步支持定性新研究查询，例如将早期生活史连接到指定的晚年终点，使LifeSentence成为预测工具和对人类传记进行反事实探索的探针。

英文摘要

Forecasting human life outcomes is important to gain insights into how individuals attain long and healthy lives. Conventional statistical approaches yield limited accuracy, potentially due to discarding the sequential structure of the life course. Modern methods such as transformer architectures require large scale training data that most longitudinal panel studies lack. Here we introduce LifeSentence, a model for life-course reasoning that bridges large language models with longitudinal panel data. By representing each life event as a structured natural-language record and instruction-tuning a pretrained 24-billion-parameter language model across an 18-task evaluation taxonomy spanning prediction, robustness and reasoning, LifeSentence supplements panel data with distributional knowledge already encoded during pretraining. Trained on approximately 65,000 individuals from the German Socio-Economic Panel - roughly 45 times fewer than prior transformer-based approaches - LifeSentence outperforms classical and deep learning baselines across all task families, achieving a threefold improvement in joint event-and-timing prediction from best baselines and 91.2% Kendall's tau when reconstructing chronological order from timestamp-stripped event sets. Without explicit supervision, the model recovers documented patterns of social stratification, including the education premium, the gender wage gap and the motherhood penalty, from discrete event sequences alone. A natural-language interface further enables qualitatively new research queries, such as connecting an early-life history to a specified late-life endpoint, establishing LifeSentence as both a predictive tool and a probe for counterfactual exploration of human biographies.

URL PDF HTML ☆

赞 0 踩 0

2606.11219 2026-06-11 cs.CL cs.AI cs.SD 新提交

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Afrispeech Semantics: 评估跨领域和口音的口语语言模型中的音频语义推理

Chibuzor Okocha, Christan Grant

发表机构 * University of Florida（佛罗里达大学）

AI总结提出五项语义与副语言推理任务（蕴含、一致性、合理性、口音漂移、口音约束），评估音频语言模型在口音变化、领域迁移和语义过度推断下的推理能力，揭示当前评估的局限性。

Comments Accepted to ACL

详情

AI中文摘要

音频语言模型（ALMs）越来越多地用于基于语音的理解，但它们在转录、文本到音频检索、字幕生成和问答准确性之外的语义推理能力仍未得到充分基准测试。特别是，口音变化、领域迁移和语义过度推断对音频推理的影响尚不清楚。我们评估了音频语言模型在五项语义和副语言推理任务上的表现：蕴含、一致性、合理性、口音漂移和口音约束。这些任务共同评估模型以口语音频作为主要证据来源进行推理的能力，包括文本假设是否可以从音频中推断、矛盾或无法确定，陈述是否与口语内容一致或冲突，给定话语的声明是否合理，以及模型预测在口音变化下是否保持稳定或适当约束。这些发现凸显了当前音频推理评估的关键局限性，并希望为更稳健和公平的ALM设计与评估提供指导。

英文摘要

Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

URL PDF HTML ☆

赞 0 踩 0

2606.11213 2026-06-11 cs.CL 新提交

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

超越压缩：面向长周期智能体的结构化上下文驱逐

Andrew Semenov, Svyatoslav Dorofeev

发表机构 * Kiz8

AI总结提出上下文窗口生命周期（CWL）方案，通过结构化、语义感知的驱逐策略，使长周期LLM智能体在有限上下文预算内实现无限工作视野，避免性能下降和幻觉。

详情

AI中文摘要

我们提出了上下文窗口生命周期（CWL），一种上下文管理方案，为长周期LLM智能体提供有效无界的工作视野。随着会话累积历史，CWL通过渐进式、语义感知的驱逐将上下文保持在预算内：智能体在工作过程中将其轨迹注释为类型化、依赖链接的情节，当令牌预算超出时，一个确定性的、无需LLM的策略在该结构内按优先级顺序驱逐内容。CWL保留用户轮次和智能体正在积极推理的探索上下文，同时积极丢弃其效果已持久化在环境中的行动情节，使活动上下文保持在稳定上限附近，这也避免了与超大提示相关的性能下降。与基于摘要的压缩相比，CWL避免了四个众所周知的局限性：不可预测的信息丢失、因果结构的破坏、阻塞模型成本以及压缩引起的幻觉。与最近截断相比，CWL具有语义感知能力：它根据依赖图丢弃最旧且最可恢复的内容，而不是按时间顺序丢弃最旧的内容而不考虑相关性。我们描述了注释协议、情节图、驱逐策略和令牌记账循环，并在长周期智能体基准上评估了CWL：一个智能体会话在8000万个令牌上完成89个顺序任务，与每任务隔离会话相比，任务准确性没有可测量的下降。

英文摘要

We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions

URL PDF HTML ☆

赞 0 踩 0

2606.11212 2026-06-11 cs.CL 新提交

EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

EverydayGPT: 用于高效安全混合GPT-RAG对话问答的置信门控路由

Jaspreet Singh Nahal

发表机构 * Dr. A.P.J. Abdul Kalam Technical University（阿卜杜尔·卡拉姆技术大学）

AI总结提出置信门控路由机制，通过联合策略决定检索与生成路径，使85%的查询使用快速RAG提取，延迟降低120倍以上，同时保持答案质量。

Comments 12 pages, 10 figures, 6 tables. Code and evaluation scripts available at: https://github.com/merciless-admiral-3083/EverydayGPT. This paper studies routing strategies for hybrid GPT-RAG systems under resource constraints, focusing on efficiency-safety tradeoffs rather than state-of-the-art accuracy

详情

AI中文摘要

标准检索增强生成（RAG）流水线无条件地将每个查询路由到检索和生成，导致不必要的计算并将低质量上下文传播给生成器。我们引入了EverydayGPT，一个轻量级对话问答系统，围绕置信门控路由（CGR）机制构建，该机制将路由决策形式化为检索距离和提取充分性的联合策略。骨干网络是一个205M参数的GPT，在FineWeb-Edu的10B令牌上从头训练。CGR通过快速RAG提取（~45 ms）解决85%的查询，避免调用昂贵的GPT路径（~5.9s），在大多数查询上实现超过120倍的延迟降低，同时保持答案质量。在500个问题的领域内基准测试中，系统达到F1 = 0.226 +/- 0.004，而仅GPT为0.171，无条件RAG为0.210。相对于强基线的提升虽小但一致，而效率提升显著（平均延迟降低6.3倍）。结构化的基础审计发现采样集中没有无根据的声明，并带有明确的范围限制。我们将这项工作定位为资源约束下路由策略的研究，而非声称最先进的性能。

英文摘要

Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.11211 2026-06-11 cs.CL cs.AI cs.LG 新提交

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

推理下的校准漂移：思维链预算如何导致大型语言模型过度自信

Prakul Sunil Hiremath, Harshit R. Hiremath

发表机构 * Department of Computer Science and Engineering, Visvesvaraya Technological University, Belagavi（维斯瓦拉亚科技大学计算机科学与工程系，贝拉加维）； Department of Computer Science and Business System, SG Balekundri Institute of Technology, Belagavi（SG巴莱昆德里理工学院计算机科学与商业系统系，贝拉加维）

AI总结研究发现，增加思维链推理预算超过任务特定阈值会导致模型对错误答案过度自信，提出校准漂移现象并引入CABStop停止规则。

Comments 31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available

详情

AI中文摘要

大型语言模型（LLMs）表达校准不确定性的能力对于安全部署至关重要。思维链（CoT）推理被广泛用于提高准确性和可靠性，但其对校准的影响尚未完全理解。我们表明这一图景是不完整的：在某些设置中，将推理预算增加到任务特定阈值以上会导致模型系统性地变得过度自信，对错误答案赋予高置信度。我们将此现象称为推理下的校准漂移（CDUR），并从理论和实证两方面进行研究。我们定义推理预算B，并分析预期校准误差ECE(B)呈现非单调模式的条件：它首先随着推理纠正错误而下降，然后随着更长推理产生内部一致但错误的解释而上升。我们提出一个基于自回归生成的假设锁定模型来解释这种行为。我们在47个推理陷阱问题上评估了Llama-3.1-8B和Llama-3.3-70B，跨越四个推理预算和三个随机种子（1,368次API调用；574个有效响应）。8B模型显示出非单调的校准行为，而70B模型的结果仅限于基线评估，对于预算依赖效应尚无定论。我们引入CABStop，一种校准感知的停止规则，当置信度偏离辅助准确性估计时停止推理。这些结果表明，增加推理深度并不总是提高可靠性，应谨慎监控。

英文摘要

The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully.

URL PDF HTML ☆

赞 0 踩 0

2606.11210 2026-06-11 cs.CL cs.AI cs.MM 新提交

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

T2MM：一种支持基于探究建模的LLM架构

John Kos, Rudra Singh, Ashok Goel

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出T2MM架构，利用LLM在生态建模软件VERA中生成交互式模型，优于全代码生成基线。

Comments 16 pages, 4 figures

详情

AI中文摘要

模型构建是科学学习中的基础实践，依赖于可视化和交互性。大型语言模型（LLM）越来越多地增强多模态能力，并已集成到教育环境中以支持学习。然而，这些工具缺乏某些学习环境所需的视觉交互性。我们提出了文本到多模态模型（T2MM），这是一种稳健、动态的LLM支持架构，可在开放探究生态建模软件虚拟实验研究助手（VERA）中辅助模型构建。T2MM考虑学习者模型的当前上下文，并创建交互式模型（而非静态图像），使模型能够对人工调整保持响应。为了衡量技术可行性，我们通过一个自定义的程序生成数据集（包含自然语言学习者建模请求和VERA系统中的目标模型）来评估T2MM。在所有测量的成功指标上，T2MM优于通过LLM支持的全代码生成实现的基线模型生成架构（这在文献中很常见）。我们的贡献不仅概述了将LLM集成到基于探究的学习建模工具中，还描述了一种可能的架构，通过该架构可以创建更具交互性的多模态LLM工具。

英文摘要

Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

URL PDF HTML ☆

赞 0 踩 0

2606.11209 2026-06-11 cs.CL cs.AI cs.LG 新提交

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

ProcessThinker: 通过基于展开的过程奖励增强多模态大语言模型推理

Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

发表机构 * LMU Munich（慕尼黑大学）； Harvard University（哈佛大学）； University of Cambridge（剑桥大学）； Mina AI ； Konrad Zuse School of Excellence in Reliable AI (relAI)（康拉德·楚泽可靠人工智能卓越学校（relAI））

AI总结提出ProcessThinker，一种无需显式过程奖励模型的后训练方法，通过步骤标记格式和基于展开的过程奖励，为多步推理提供密集的步骤级奖励，提升多模态推理一致性。

Comments Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure

详情

AI中文摘要

视觉问答越来越需要多步推理。最近在可验证奖励下的强化学习后训练（RLVR）和组相对策略优化（GRPO）可以改善多模态推理，但大多数方法依赖于稀疏的仅结果奖励。因此，它们难以判断错误答案是由于推理后期的一个小错误，还是从一开始就无用的轨迹。一个常见的解决方案是训练一个过程奖励模型（PRM）用于步骤级监督，但这通常需要大规模高质量的思想链注释和额外的训练成本。我们提出ProcessThinker，一种实用的后训练流程，无需训练显式的PRM即可提供步骤级过程奖励。ProcessThinker首先将推理轨迹重写为步骤标记格式以进行冷启动监督微调，然后应用带有标准格式奖励和我们基于展开的过程奖励的GRPO。具体来说，对于每个中间步骤，我们从该步骤采样多个连续步骤，并使用经验成功率（最终答案验证）作为步骤奖励。这提供了密集的信用分配，并鼓励更可靠地支持正确结论的推理步骤，有助于减少跨步骤的不一致或自相矛盾的进展——这是逻辑推理中的一个关键问题。在四个具有挑战性的视频基准测试（Video-MMMU、MMVU、VideoMathQA和LongVideoBench）上，ProcessThinker始终优于基线模型Qwen3-VL-8B-Instruct。

英文摘要

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

URL PDF HTML ☆

赞 0 踩 0

2606.11208 2026-06-11 cs.CL cs.AI 新提交

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

BioDivergence：生物医学摘要中隐藏上下文矛盾的基准与评估框架

Elias Hossain, Sanjeda Sara Jennifer, Sabera Akter Bushra, Niloofar Yousefi

发表机构 * College of Engineering and Computer Science, University of Central Florida（中佛罗里达大学工程与计算机科学学院）； Burnett School of Biomedical Sciences, University of Central Florida（中佛罗里达大学伯内特生物医学科学学院）

AI总结提出BioDivergence框架，通过六类冲突分类、13轴分歧本体和结构化输出，解决现有NLI基准无法捕捉生物医学研究中上下文依赖的差异问题，并发布包含11865个声明对的基准数据集。

详情

AI中文摘要

生物医学发现常常在不同研究中看似冲突，但许多差异是上下文依赖的而非真正的矛盾。队列、地理、实验方案、疾病亚型和临床环境的变化可能使两种说法在局部都成立。现有的NLI和科学声明验证基准将此类情况简化为蕴含、矛盾或中立，未能捕捉分歧背后的上下文结构。为解决这一问题，我们引入了BioDivergence，一个包含六类冲突分类、13轴分歧本体以及每个声明对四个结构化输出（冲突类型、分歧轴、主要混杂因素和调和解释）的评估框架。我们发布了BioDivergence-Silver-v1.0，一个跨五个生物医学领域的11865个声明对的文章分离银标准基准，以及一个用于比较的遗留去重变体。结果显示，两种变体之间存在显著的排名差异，微调参考模型在文章分离设置下下降了约12分，而Mistral-7B-Instruct-v0.3在842个示例的主测试集上达到了0.5523的准确率和0.3894的上下文F1分数。BioDivergence提供了一种更忠实的方式来区分上下文分歧与直接矛盾，并区分文章级记忆与真正的任务学习。

英文摘要

Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.

URL PDF HTML ☆

赞 0 踩 0

2606.11207 2026-06-11 cs.AI cs.CL 新提交

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

从显式元素到隐式意图：用于可审计行为推断的预定义库

Liu hung ming

发表机构 * PARRAWA AI

AI总结提出SemantiClean框架，通过共享元素库从电商会话数据中提取结构化语义信号，驱动可插拔推断目标，优先保证可审计性和可复现性，而非单纯追求精度。

Comments 20 pages, 9 tables

详情

AI中文摘要

我们提出SemantiClean，一个模块化框架，用于从电商会话数据中提取结构化语义信号，并通过共享元素库驱动可插拔推断目标，包括购买意图、客户细分和产品亲和性。与仅优化准确率的传统端到端预测器不同，SemantiClean优先考虑可审计性、结构治理和sigma=0可复现性，明确牺牲边际预测增益以换取元素级透明度和可辩护的决策轨迹。该框架基于在线购物者购买意图（OSPI）数据集，将24个行为元素组织成四层架构（功能层、交互层、系统层、上下文层），并通过三种抗通胀机制强制信号质量：RedundancyGroup贡献上限、TieredPenaltyCalculator偏差惩罚和AdaptiveConstraintMode冷启动处理。本文介绍了LLM集成语义推断引擎，一个完全实现的两阶段LLM驱动推断架构，在推断时利用完整的元素元数据。本文报告的所有定量结果均由该引擎产生。确定性引擎输出完全可复现（sigma=0）；LLM相关结果（E8、E10）在固定提供者/模型/温度设置下受控输出可变性。性别推断目标在当前实现中非功能性，已从所有定量结果中排除。

英文摘要

We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start protection.This report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

URL PDF HTML ☆

赞 0 踩 0

2606.11206 2026-06-11 cs.CL cs.LG 新提交

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

兼容性感知的动态微调用于大型语言模型

Yucheng Zhou, Junwei Sheng, Qianning Wang, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau（澳门大学科技学院电脑与信息科学系及智慧城市物联网国家重点实验室）； Auckland University of Technology（奥克兰理工大学）

AI总结提出兼容性感知动态微调（CADFT），通过模型似然度动态调整监督更新，抑制不兼容样本的高方差梯度，提升训练稳定性和泛化能力。

Comments ACL 2026

详情

AI中文摘要

监督微调（SFT）是对齐大型语言模型（LLMs）的主要范式，但它存在优化不稳定和泛化能力有限的问题。最近的研究将这一问题归因于病态的梯度缩放，并提出了动态微调（DFT）来在令牌级别进行修正。然而，DFT假设所有演示都是同样合适的学习目标，这一假设被大规模指令数据的强异质性所违反，其中演示-策略不匹配会在样本级别导致高方差更新。我们引入了兼容性感知动态微调（CADFT），这是DFT的一个原则性扩展，用于控制样本级别的优化方差。CADFT从模型似然度中推导出一个动态的、依赖于策略的兼容性信号，以调节监督更新，抑制来自不兼容演示的高方差梯度。我们进一步提出了一种延迟的、低频的兼容性引导重写策略，将持续不兼容的演示转化为可学习的目标。我们表明，CADFT可以被解释为一个方差控制的估计器，将DFT中的令牌级稳定性推广到样本级别。大量实验表明，CADFT在保持完全监督且不依赖显式奖励建模的同时，提高了稳定性、泛化能力和冷启动强化学习初始化。

英文摘要

Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.11205 2026-06-11 cs.LG cs.AI cs.CL 新提交

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

谄媚的双立场评估：同意的结构与干预的局限

Matthew James Buchan

发表机构 * University of Toronto（多伦多大学）

AI总结提出双立场评估方法，发现激活引导在减少谄媚时也会抑制对事实正确陈述的同意，揭示了表示可读但不可写的普遍差距。

Comments 18 pages, 9 figures, accepted to TAIS 2026

详情

AI中文摘要

激活引导可以改变LLM的行为，但标准评估通常不测试减少谄媚的方向是否也抑制对事实正确陈述的同意。我们引入了双立场评估，测试每个话题的两个立场，并将其应用于Llama-3-8B-Instruct上的质心差引导。我们发现一种分离：模型在几何上不同的子空间中表示谄媚和事实同意，但引导方向在两者上的投影相等，无法差异化地针对任一。因此，该方向同样减少对事实正确陈述（例如地球是圆的）和谄媚陈述的同意。两个激活组的所有其他静态属性都匹配，表明行为分离源于生成动态或残差流分析无法解析的更细粒度结构。该模式说明了一个普遍差距：从激活中可读的表示可能无法通过它们写入。

英文摘要

Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

URL PDF HTML ☆

赞 0 踩 0

2606.11204 2026-06-11 cs.CL cs.IR 新提交

Benchmarking Large Language Models for Safety Data Extraction

大型语言模型在安全数据提取中的基准测试

Jonas Grill, Thomas Bayer, Sören Berlinger

发表机构 * SAP SE（SAP公司）； Institute for Digital Transformation, Ravensburg-Weingarten University（拉文斯堡-魏恩加滕大学数字化转型研究所）

AI总结针对安全数据表（SDS）的异构格式，本研究基准测试了四种大型语言模型（LLM）在文本与多模态处理下的提取性能，发现文本结合思维链提示的Gemini 1.5 Pro准确率最高（84%），但均未达到90%的可靠部署阈值。

Comments 18 pages, 8 figures, submitted to Applied Intelligence

详情

AI中文摘要

从安全数据表（SDS）中准确提取结构化信息在工业安全中仍具挑战性，原因在于文档格式异构以及传统基于规则的方法的局限性。本研究对最先进的大型语言模型（LLM）在自动化SDS数据提取方面进行了基准测试，比较了基于文本和多模态处理流水线。我们系统评估了四种模型：Gemini 1.5 Pro、GPT-4o、Claude 3.7 Sonnet和Llama 3.1-70B，采用三种提示策略：零样本、少样本和思维链。评估框架在超过50,000个提取数据字段上评估了准确性、延迟和成本。结果显示，基于文本的提取在所有指标上始终优于多模态处理。结合思维链提示的Gemini 1.5 Pro达到了最高准确率（84%），优于GPT-4o（81%）和Claude 3.7 Sonnet（79%）。然而，没有模型超过可靠实际部署通常所需的90%准确率阈值。这些发现表明，通用LLM在无监督工业使用中尚不够稳健，尽管性能表明通过任务特定微调具有强大潜力。未来研究应关注领域自适应训练、模型校准以及集成人在回路验证，以确保安全关键可靠性。

英文摘要

Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.

URL PDF HTML ☆

赞 0 踩 0