arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02048 2026-06-02 cs.AI cs.CV physics.bio-ph

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

动态酪蛋白凝胶化显微图像拓扑纹理分析及其与流变学性质的关系

Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen, Jon Sporring

发表机构 * Department of Computer Science, University of Copenhagen, Denmark（哥本哈根大学计算机科学系）； Department of Green Technology, University of Southern Denmark, Denmark（南丹麦大学绿色技术系）； Department of Food Science, University of Copenhagen, Denmark（哥本哈根大学食品科学系）

AI总结提出结合拓扑数据分析、差分盒计数、多重分形分割和局部二值模式的工具箱，分析STED显微图像中酪蛋白凝胶化的拓扑与纹理特征，揭示与流变学性质相关的微观结构转变。

详情

AI中文摘要

我们提出了一种新颖的计算工具箱，集成了拓扑数据分析（TDA）、差分盒计数（DBC）、多重分形分割（MFP）和局部二值模式（LBP），应用于由葡萄糖酸-δ-内酯（GDL）在30°C和40°C以及两种GDL浓度（1.8%和3.5% w/v）下诱导的酪蛋白酸钠凝胶化的时间序列超分辨率STED显微图像。TDA通过最大Betti-1曲线追踪拓扑环，即反映蛋白质网络互连性的封闭环状结构，揭示了分散聚集体的滞后阶段、与网络渗透和流变学观察到的溶胶-凝胶转变相一致的急剧衰减，以及对应于网络重排的凝胶后增加。这些拓扑转变通过DBC和MFP得到证实，因为这些方法能够解析结构复杂性和空间异质性的变化。该工具箱在实验应用前在模拟分形图像上进行了验证。总之，这些描述符对体相流变学作为平均体相力学响应捕获的细微微观结构转变具有敏感性。这种集成方法为表征食品和材料科学中具有演化微观结构动力学的复杂微观结构提供了稳健的定量工具。代码可在https://github.com/Zahratabatabaei/Delifood_CV_paper.git获取。

英文摘要

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at https://github.com/Zahratabatabaei/Delifood_CV_paper.git

URL PDF HTML ☆

赞 0 踩 0

2606.02042 2026-06-02 cs.CV

Normality-Preserving Continual Industrial Anomaly Detection via Orthogonal LoRA Banks

通过正交LoRA库保持正态性的持续工业异常检测

Weibai Fang, Haijun Che, Feiyang Ren, Qiancheng Lao

发表机构 * Yisu University（Yorkshire University）

AI总结提出基于历史冻结正交LoRA库和分层新颖性自适应库增长模块的框架，解决扩散模型在持续工业异常检测中的历史正态先验漂移和灾难性遗忘问题。

Comments 33 pages,6 figures,Submitted to Advanced Engineering Informatics

详情

AI中文摘要

基于扩散模型的持续工业异常检测面临历史正态先验漂移和灾难性遗忘问题。现有的持续扩散方法通过回放或约束优化保留先前知识，但缺乏在顺序适应过程中隔离和保护类别特定正态先验的显式机制。尽管低秩适应提供了模块化残差更新，但标准LoRA既未冻结历史正态子空间，也未阻止新适配器干扰先前适配器。为解决此问题，我们提出基于两个模块的正态保持持续异常检测框架：历史冻结正交LoRA库（HF-OLB）和分层新颖性自适应库增长模块（HNABG）。HF-OLB冻结预训练的U-Net主干和已学习的LoRA库，并将新任务特定的正态残差约束到历史LoRA子空间的正交补空间中。HNABG进一步分配层依赖的残差容量，并仅在残差正态新颖性超过现有库的表达容量时扩展库。在MVTec和VisA上的大量实验证明了所提方法的有效性。在具有挑战性的VisA 2x6设置下，我们的方法实现了83.6/91.8的图像和像素级A-AUROC，以及3.8/3.9的FM，将像素级A-AUROC提升了3.2个百分点，同时将像素级FM降低了1.3。这些结果表明，我们的方法在长时间跨度的持续类别序列中有效保留了历史正态先验。

英文摘要

Continual industrial anomaly detection with diffusion models suffers from historical normality prior drift and catastrophic forgetting. Existing continual diffusion methods preserve previous knowledge through replay or constrained optimization, but they lack an explicit mechanism for isolating and protecting category-specific normality priors during sequential adaptation. Although low-rank adaptation provides modular residual updates, standard LoRA neither freezes historical normality subspaces nor prevents new adapters from interfering with previous ones. To address this issue, we propose a normality-preserving continual anomaly detection framework based on two modules: History Frozen Orthogonal LoRA Bank (HF-OLB) and Hierarchical Novelty Adaptive Bank Growth module (HNABG). HF-OLB freezes both the pre-trained U-Net backbone and the learned LoRA banks, and constrains new task-specific normality residuals to the orthogonal complement of historical LoRA subspaces. HNABG further allocates layer-dependent residual capacity and expands the bank only when the residual normality novelty exceeds the expressive capacity of existing banks. Extensive experiments on MVTec and VisA demonstrate the effectiveness of the proposed method. On the challenging VisA 2x6 setting, our method achieves 83.6/91.8 image and pixel level A-AUROC with 3.8/3.9 FM, improving pixel level A-AUROC over the state of the art by 3.2 points while reducing pixel level FM by 1.3. These results show that our method effectively preserves historical normality priors in long horizon continual category sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.02041 2026-06-02 cs.CL

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

SentGuard：面向大型语言模型的句子级流式护栏

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University（复旦大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出SentGuard，一种与生成并行运行的句子级流式护栏，通过轻量级等待缓冲区将流式令牌分组为句子块并仅释放已验证块，以在低延迟下实现高精度不安全内容检测。

Comments 16 pages, 5 figures, submitted to ARR

详情

AI中文摘要

大型语言模型越来越多地实时流式输出长篇幅、推理密集的响应，这使得何时进行审核与是否进行审核同样关键。现有的护栏分为两种不理想的极端：响应级方法延迟干预直到完整输出生成，而令牌级方法基于不完整的语义进行操作，往往产生不稳定的决策和过多的护栏调用。为应对这一挑战，我们提出SentGuard，一种与生成并行运行的句子级流式护栏。一个轻量级等待缓冲区将流式令牌分组为句子块，并仅向用户释放已验证的块，引入一个小偏移量，使得SentGuard能够在目标LLM解码后续内容时评估当前前缀。为支持这一点，我们构建了StreamSafe基准，包含8个危害类别的结构化逐句标注，捕捉推理和响应段中安全风险的演变。我们进一步使用从粗到细的目标训练SentGuard，以在不安全意图在句子边界出现时立即检测。在5个安全基准上的实验表明，SentGuard优于现有基线，在两个句子内检测到90.5%的不安全案例，同时保持7.41%的低流式误报率。

英文摘要

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

URL PDF HTML ☆

赞 0 踩 0

2606.02035 2026-06-02 cs.AI cs.LG

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

RL-ACRGNet：基于强化学习的胸部放射学报告生成网络

Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya

发表机构 * Human-AI Interaction (HAIx) Lab, Indian Institute of Technology Gandhinagar（人类-人工智能交互实验室，印度理工学院冈丁加尔）； Department of Computer Science and Engineering, Madhav Institute of Technology and Science Deemed University (MITS-DU)（计算机科学与工程系，马达夫技术与科学 deemed 大学（MITS-DU））； Multimedia and Information Security Research Group, Department of Computer Science and Engineering, ABV-Indian Institute of Information Technology and Management（多媒体与信息安全研究组，计算机科学与工程系，ABV-印度信息科技与管理学院）

AI总结提出RL-ACRGNet，一种结合预训练DenseNet编码器与多级LSTM解码器的离策略强化学习框架，通过度量奖励机制优化视觉语义嵌入，在IU-Xray和MIMIC-CXR数据集上超越基线，生成高质量临床报告。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

医学影像解读是现代临床诊断的基石，然而手动生成放射学报告既耗时又容易出现解读不一致。在医学AI领域，通过深度学习自动化这些描述有望简化临床工作流程并标准化诊断输出。然而，由于在捕获细粒度视觉特征和确保临床连贯性方面的局限性，准确的疾病检测和精确的报告生成仍然是重大挑战。为了解决这些问题，我们提出了RL-ACRGNet，一种改进的编码器-解码器模型，它将预训练的DenseNet编码器与多级LSTM解码器集成在离策略强化学习框架中。通过使用双网络方法，基于度量奖励机制细化视觉语义嵌入，我们证明RL-ACRGNet在IU-Xray数据集上持续优于最先进的基线，在BLEU-4（0.47%）、METEOR（0.17%）和ROUGE-L（0.518）上取得了定量改进。此外，在大规模MIMIC-CXR数据集上的综合评估证实了该模型的稳健泛化能力及其生成高质量、临床相关报告的能力。

英文摘要

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports

URL PDF HTML ☆

赞 0 踩 0

2606.02027 2026-06-02 cs.RO cs.LG cs.MA

World-Task Factorization for Robot Learning

世界-任务分解用于机器人学习

Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok

发表机构 * Department of Computer Science and Technology, University of Cambridge, United Kingdom（计算机科学与技术系，剑桥大学，英国）； Robotics and Biology Laboratory, Technische Universität Berlin（机器人与生物学实验室，柏林技术大学）； Science of Intelligence (SCIoI), Cluster of Excellence, Berlin, Germany（智能科学（SCIoI），卓越中心，柏林，德国）； Robotics Institute Germany（德国机器人研究所）

AI总结提出将策略分解为世界因子和任务因子，通过可微图模型AICON与紧凑学习策略结合，实现零样本泛化到新配置并迁移到真实硬件。

详情

AI中文摘要

机器人学习必须产生能够泛化到新的约束、队友和环境组合的策略。为此，我们必须对策略进行结构性分解，这种选择决定了哪些部分泛化、哪些需要重新训练、哪些保持纠缠。现有方法涵盖从期望结构从数据扩展中涌现，到通过层次结构、技能库或学习专门化手工设计。在本文中，我们研究我们认为机器人学中最基本的分解：将世界与任务分离。我们研究了这种分解有原则的条件。世界因子是具身系统和环境的属性；它们独立于意图存在。任务因子由任务在世界所允许的事物上的逻辑定义。我们通过贝叶斯模型证据形式化这种不对称性：它与数据生成过程一致，通过分析世界模型保持高似然，并减少奥卡姆剃刀对任务参数的惩罚。我们通过将AICON（一个可微分的递归估计器和互连图，具有组合性，无需任务特定数据即可运行，并将成本梯度传播到执行器）与一个紧凑的学习策略配对来实例化这种分解，该策略调节梯度路径。梯度作为两个因子之间的接口：它们通过图携带世界结构，通过成本携带任务结构，从而在保持结构泛化的同时实现低维学习。我们在三个问题上测试了世界/任务分解，这些问题包含异构机器人、环境、任务逻辑和感觉运动模态。我们的框架在所有设置中优于端到端基线和分析启发式方法，零样本泛化到分布外配置，并无需重新训练即可迁移到真实硬件。

英文摘要

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

URL PDF HTML ☆

赞 0 踩 0

2606.02022 2026-06-02 cs.CV cs.AI cs.LG

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配：多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow（莫斯科Tevian）； Lomonosov Moscow State University（莫斯科国立罗蒙诺索夫大学）

AI总结本文揭示了多视角目标关联中常用的排名度量（如AP、FPR-95）与分配目标之间的根本性不匹配，并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情

AI中文摘要

多视角目标关联是一个重要的计算机视觉问题，是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题，但最近的工作严重依赖成对排名度量（如AP和FPR-95）进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上，我们表明即使分配已经正确，AP和FPR-95也可能不完美，而基于Sinkhorn的归一化可以使它们完美。相反，最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试，在实践中验证了这种不匹配。我们表明，仅优化几个后处理参数就能显著提升AP和FPR-95，而分配级别的度量（如ACC和IPAA）却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

URL PDF HTML ☆

赞 0 踩 0

2606.02021 2026-06-02 cs.CV

PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation

PerBite: 一种用于咬合感知食物体积估计的精选诊断工作流

Ahmad AlMughrabi, Farid Al-Areqi, David Fernández Gómez, Umair Haroon, Marc Bolaños, Ricardo Marques, Petia Radeva

发表机构 * University of Barcelona（巴塞罗那大学）； LogMeal ； Universitat Pompeu Fabra（庞培法布拉大学）

AI总结提出PerBite工作流，通过分割、三维重建、尺度校准和网格后处理等步骤，从餐前餐后状态估计食物体积，在MetaFood挑战中排名第一。

详情

AI中文摘要

一个视觉上合理的食物网格能否被信任来估计消耗食物的体积？\method 使用来自MetaFood CVPR 2026连续三维重建与进食挑战的选定配对餐前和餐后状态来研究这个问题。提交的工作流遵循一个精选的重建协议：SAM~3分割食物和盘子区域；Hunyuan3D/SAM~3D生成无量纲食物网格；盘子直径提供度量尺度；在Blender中移除盘子几何形状；剩余的网格进行孔洞填充、水密化并积分以估计体积。MoGe-2仅作为辅助线索用于初始菜肴直径估计，当直接盘子测量不确定时；它不是报告挑战结果的主要尺度来源。\method 排名第一，在34个网格上使用刚性ICP（无尺度校正）的平均Chamfer距离为8.31。在17个餐前餐后对上，它实现了33.87%的状态级体积MAPE和零单调性违规，而消耗体积MAPE为53.74%。结果表明，表面重建、度量尺度、受控网格清理、水密体积积分和物理消耗一致性应分别评估以用于饮食评估。源代码和评估脚本将在\href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}提供。

英文摘要

Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87\% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74\%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}.

URL PDF HTML ☆

赞 0 踩 0

2606.02020 2026-06-02 cs.CL cs.LG

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

揭示思维链推理的熵动力学

Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过熵动力学揭示思维链推理的两阶段结构（不确定性区域和置信区域），并提出基于CUSUM变化点检测的无训练框架实现早期退出和测试时缩放，以提升推理效率与可靠性。

Comments 21 pages, 10 figures, accepted in ICML2026

详情

AI中文摘要

本文研究了思维链（CoT）的熵动力学，揭示了一致的两阶段结构：一个探索性的不确定性区域，然后急剧过渡到收敛的置信区域。我们证明置信区域具有两个关键性质：1）高可靠性——置信区域中的答案变得高度准确和稳定，以及2）高冗余性——模型在达到正确答案后生成长时间的不必要token。这些性质解锁了更高效和可靠的推理策略：1）早期退出利用可靠性和冗余性，在收益递减时安全终止计算，以及2）测试时缩放使用置信区域信号优先考虑收敛轨迹。为了实施这些见解，我们将置信区域检测建模为序列变化点检测问题，首次将经典变化点方法应用于监控CoT推理。使用累积和（CUSUM）算法（一种统计最优的变化点检测器），我们开发了一个无训练框架用于实时推理控制。实验表明，我们的方法为早期退出建立了优越的帕累托前沿。CUSUM在减少11.1% token的情况下达到63.06%的准确率，在准确率上分别超过DEER和Dynasor 3.28%和4.36%。对于测试时缩放，CUSUM加权投票始终优于自一致性。

英文摘要

This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.02016 2026-06-02 cs.LG

Evaluating Real-World Generalizability of Algorithm Selection Models

评估算法选择模型的现实世界泛化能力

Gjorgjina Cenikj, Jakub Kudela, Eva Tuba, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute（计算机系统部门，约泽夫·斯蒂芬研究所）； Brno University of Technology（布拉格技术大学）

AI总结通过跨基准测试系统评估算法选择模型在合成与现实优化问题上的泛化能力，分析其迁移性能并指出在特定领域应用中的挑战。

Comments 10 pages, 12 figures

详情

DOI: 10.1145/3795101.3805348

AI中文摘要

算法选择（AS）旨在通过利用可测量的问题特征和历史性能数据，自动为给定问题实例识别最合适的优化算法。在本研究中，我们研究了AS模型在合成和现实优化景观上的泛化能力。我们考虑了两个广泛使用的学术基准测试套件（BBOB和CEC）以及两个现实世界问题集（机器人轨迹优化任务和无人机路径规划问题）。通过系统的跨基准测试评估，我们分析了AS模型如何在领域之间迁移，识别了泛化成功或失败的情况，并强调了在现实、特定领域环境中应用AS时出现的挑战。我们的研究结果提供了对当前AS方法鲁棒性的见解，并为开发更可靠、广泛适用的现实世界优化AS系统提供了信息。

英文摘要

Algorithm Selection (AS) aims to automatically identify the most suitable optimization algorithm for a given problem instance by leveraging measurable problem characteristics and historical performance data. In this study, we investigate the generalization ability of AS models across both synthetic and real-world optimization landscapes. We consider two widely used academic benchmark suites (BBOB and CEC) and two real-world problem sets (robotics trajectory optimization tasks and unmanned aerial vehicle path-planning problems). Through a systematic cross-benchmark evaluation, we analyze how AS models transfer between domains, identify where generalization succeeds or breaks down, and highlight the challenges that arise when applying AS in realistic, domain-specific contexts. Our findings provide insights into the robustness of current AS approaches and inform the development of more reliable, broadly applicable AS systems for real-world optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.02011 2026-06-02 cs.AI cs.LG

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

推理模型中的极端低位推理：失败模式与针对性恢复

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov

发表机构 * University of Washington（华盛顿大学）

AI总结针对大型推理模型在2位量化推理中因生成不稳定导致总token数膨胀而无法实现端到端加速的问题，提出轻量级FP16规划和循环救援两种控制方法，显著恢复模型精度并保持实际速度。

详情

AI中文摘要

大型推理模型（LRM）依赖长推理轨迹，导致推理成本高昂。虽然低位量化降低了每token解码成本，但我们表明，激进的2位推理可能无法实现端到端加速，因为生成过程中的不稳定性会膨胀总token数。2位量化不仅降低答案准确性，还常常产生更长的轨迹，包含重复循环、预算耗尽、延迟承诺和未闭合的推理段。我们分析了Qwen3推理模型在数学和常识基准上的完整推理轨迹，并表明准确率下降与这些过程级失败密切相关。为解决这些问题，我们引入了两种轻量级控制：FP16规划，为2位模型提供简短的高精度轮廓；以及循环救援，检测重复轨迹并要么承诺早期答案，要么回退到FP16。在MATH-500上，循环救援将Qwen3-8B准确率从17.2%提升至74.2%，而规划加循环救援将Qwen3-32B准确率从65.0%提升至87.2%。总体而言，我们的结果表明，当极端低位推理的失败被视为可控生成病理时，它变得可行：通过轻量级检测和选择性FP16支持，2位推理可以在恢复准确率的同时保持真实的端到端速度。我们的代码可在 https://github.com/brain-lab-research/quantized-reasoning 获取。

英文摘要

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02010 2026-06-02 cs.CL cs.AI

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench: 通过平面图绘制评估LLM空间推理能力

Oleksandr Nikitin

发表机构 * tvori.info

AI总结提出PlanarBench基准，通过让LLM根据边列表以ASCII艺术绘制平面图来评估其空间推理能力，发现边数是主要难度预测因子。

Comments 12 pages, 4 figures, https://github.com/wizzard0/planar-bench-as1073

2606.02009 2026-06-02 cs.CL

Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French

自动作文评分与语言认证：评估法语中的泛化性、一致性和有效性

Rodrigo Wilkens, Rémi Cardon, Vincent Folny, Thomas François

发表机构 * University of Exeter（埃克塞特大学）； France Éducation international（法国教育国际）； Cental, IL&C, UCLouvain ； Computer Science and Engineering Department, Universidad Carlos III de Madrid（马德里卡斯蒂利亚大学计算机科学与工程系）

AI总结本文提出一个增强的论证有效性框架，通过公平性分析、语言特征相关性、预测误差评估和与人工评分的一致性比较，对8种模型架构在法语作文评分上进行多维评估，推进了法语自动作文评分的前沿。

详情

AI中文摘要

在自动作文评分（AES）中，基准测试实践促进了最小化评估方法，这与评估框架（如论证有效性框架ABV）的广泛视角建议形成对比，ABV主张对系统进行多维评估，特别是在高风险语言测试的背景下。在本文中，我们引入了一个增强且更实用的ABV框架版本，结合了公平性分析、与语言特征的相关性、预测误差评估以及与人工评分者的一致性比较。将该框架应用于法语AES，我们在一个包含27k篇考试作文（每篇2名评分者）的语料库和一个包含961篇作文（每篇至少9名评分者）的泛化语料库上比较了8种模型架构。我们的分析展示了应用ABV框架以更好地理解AES模型的能力和缺陷的益处，同时推进了法语AES的最新水平。

英文摘要

In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.

URL PDF HTML ☆

赞 0 踩 0

2606.02002 2026-06-02 cs.CV

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

面向盲图像质量评估的统计与视觉-语言特征的失真感知融合

Bishr Omer Abdelrahman Adam, Xu Li

发表机构 * Northwestern Polytechnical University（西北工业大学）

AI总结提出一种失真感知融合框架，通过乘法门控机制动态加权NSS统计特征与VLM嵌入，在三个基准上取得最优或竞争性能，并揭示NSS对不同失真的贡献差异。

详情

AI中文摘要

盲图像质量评估（BIQA）旨在无参考图像的情况下预测感知图像质量。经典的自然场景统计（NSS）描述符和现代视觉语言模型（VLM）嵌入从根本不同的角度解决这一问题，但两者结合是否能产生互补优势以及如何根据输入图像加权其贡献尚待探索。我们提出一种失真感知融合框架，通过乘法门控机制将138维NSS描述符与两种互补的VLM嵌入（SigLIP和CLIP-H）集成，该门控机制学习基于图像内容的每输入流权重。与静态拼接融合不同，所提出的门控网络根据输入抑制或放大每个流的贡献，产生的权重与在KADID-10k上通过独立消融测量的每失真NSS贡献呈正相关（Spearman秩相关系数ρ=0.33）。该框架无需对VLM骨干网络进行端到端微调，并使用结合均方误差、Pearson线性相关和成对排序目标的混合损失进行训练。我们在三个标准基准上评估：KonIQ-10k（SROCC=0.9142，PLCC=0.9279）、KADID-10k（SROCC=0.9715，PLCC=0.9733，超越近期最先进方法）和LIVE Challenge in-the-Wild（通过跨数据集预训练和微调，SROCC=0.8527，PLCC=0.8802）。在KADID-10k上的每失真分析表明，NSS特征对噪声和色彩偏移失真（像素统计直接影响）贡献最大，对感知失真（如色彩饱和度变化）贡献最小。学习到的门控值验证了这些发现，确认模型自主发现了与手动每失真研究一致的失真-流亲和模式。

英文摘要

Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.

URL PDF HTML ☆

赞 0 踩 0

2606.02001 2026-06-02 cs.CL

Scaling Agentic Capabilities via Grounded Interaction Synthesis

通过基于交互合成扩展智能体能力

Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao, Shuqing Bian, Wei Lu, Xiaoyong Du

发表机构 * Renmin University of China（中国人民大学）； Peking University（北京大学）； Tencent（腾讯）

AI总结提出GAIS框架，通过两阶段接地机制（协议锚定环境和结构引导规划）自动生成多样化的环境和复杂任务，显著提升智能体在BFCL、τ²-Bench和ACEBench上的性能。

详情

AI中文摘要

通用智能体智能的关键在于与多样化的真实世界工具交互以完成复杂任务的能力，这种能力与交互数据的质量密切相关。为了规避人工标注的昂贵成本，现有范式完全依赖大型语言模型（LLMs）来扩展智能体环境和任务的合成。然而，这种无约束的生成常常退化为LLMs内部先验的有偏随机采样，无法捕捉真实世界领域的多样性和难度，也无法构建高保真、长周期的任务。在这项工作中，我们引入了基于交互合成（GAIS），这是一个通过两阶段接地机制自动构建多样化环境和复杂任务的框架。具体来说，我们构建了源自真实世界模型上下文协议（MCP）服务器的协议锚定环境，以确保功能多样性和难度。随后，我们采用结构引导规划来导航这些环境，主动施加逻辑依赖和对抗策略以生成复杂任务。在BFCL、τ²-Bench和ACEBench上的实验表明，GAIS合成的数据显著优于最先进的基线，使基础模型能够匹配甚至超越其官方指令微调版本。此外，GAIS展现出优越的数据效率和可扩展性，在显著减少数据量的情况下实现卓越能力，同时在基线停滞时保持持续增长。我们的代码和数据集可在https://github.com/Eric8932/GAIS公开获取。

英文摘要

General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random sampling of LLMs' internal priors, failing to capture the diversity and difficulty of real-world domains or construct high-fidelity, long-horizon tasks. In this work, we introduce Grounded Agentic Interaction Synthesis (GAIS), a framework that automates the scalable construction of diverse environments and complex tasks via a two-phase grounding mechanism. Specifically, we construct protocol-anchored environments derived from real-world Model Context Protocol (MCP) servers to ensure functional diversity and difficulty. Subsequently, we employ structure-guided planning to navigate these environments, actively enforcing logical dependencies and adversarial policies to generate complex tasks. Experiments on BFCL, $τ^2$-Bench, and ACEBench demonstrate that GAIS-synthesized data significantly outperforms state-of-the-art baselines, enabling base models to match or even surpass their official instruction-tuned counterparts. Furthermore, GAIS exhibits superior data efficiency and scalability, achieving exceptional capabilities with significantly less data while maintaining continuous growth where baselines stagnate. Our code and dataset are publicly available at https://github.com/Eric8932/GAIS.

URL PDF HTML ☆

赞 0 踩 0

2606.02000 2026-06-02 cs.CV cs.AI eess.IV

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型：基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴集团大模型实验室）； Hupan Lab（虎盘实验室）； Zhejiang University（浙江大学）； INSAIT

AI总结提出一种无渲染框架，通过压缩的3D人体网格标记直接条件化视频生成，实现精确的人体运动控制，减少2D引导伪影并提升3D结构建模能力。

Comments Project page: https://jingyunliang.github.io/MeshToken/

详情

AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而，这类模型是否真正感知视觉观察背后的3D结构，而不仅仅是生成合理的2D投影，仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题，该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同，我们提出了一种无渲染框架，直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息，同时实现了统一的基于标记的生成流程，在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明，该方法在人体运动控制基准上表现强劲，同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明，配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

URL PDF HTML ☆

赞 0 踩 0

2606.01999 2026-06-02 cs.LG cs.AI

Why Do Time Series Models Need Long Context Windows?

为什么时间序列模型需要长上下文窗口？

Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

发表机构 * Università della Svizzera Italiana（瑞士联邦理工学院）； EPFL（瑞士联邦理工学院）； Politecnico di Milano（米兰理工学院）

AI总结本文从生成过程识别和条件预测两个目标出发，证明长上下文窗口通过降低生成过程的不确定性来提升预测性能，并表明即使对于记忆长度为P的过程，输入窗口必须严格大于P才能达到最小误差。

详情

AI中文摘要

现代用于预测时间序列组的深度学习模型依赖于越来越长的观测窗口。然而，增加窗口大小的好处通常被简单地归因于捕捉长程依赖，而关于全局预测模型如何利用输入观测的更广泛讨论一直有限。在本文中，我们表明预测时间序列组涉及两个目标：(i) 生成过程识别（GPI），即推断生成输入序列的具体过程，以及 (ii) 条件预测（CF），即根据输入观测预测未来值。从这个角度来看，最优预测可以解释为对所有可能数据生成过程的平均，并按输入窗口给定的似然加权。这为长上下文窗口的好处提供了另一种解释：它们降低了运行过程中输入时间序列由哪个具体过程生成的不确定性。我们证明，即使对于记忆长度为 $P$ 的过程，严格大于 $P$ 的输入窗口大小对于达到最小可实现误差是必要的。最后，我们展示了如何将 GPI 和 CF 解耦，以在不牺牲准确性的情况下提高计算可扩展性。在合成和真实数据上的实验验证了我们的见解及其对设计预测架构的相关性。

英文摘要

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.01995 2026-06-02 cs.CL

CARTE: A Benchmark for Mapping Language Model Knowledge Across France

CARTE：法国语言模型知识映射基准

Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang, Michalis Vazirgiannis

发表机构 * École Polytechnique, Institut Polytechnique de Paris（巴黎政治学院）； National Technical University of Athens（雅典国家技术大学）； Mohamed bin Zayed University of Artificial Intelligence（姆阿扎德人工智能大学）

AI总结提出CARTE基准，通过2431道多选题评估大语言模型在法国13个大区14个主题领域的细粒度区域知识，并引入CARTE-LV子集聚焦语言变异，实验发现模型在区域和规模上存在性能差异。

详情

AI中文摘要

我们推出了CARTE（文化锚定的区域-领土评估），这是一个多项选择基准，用于评估大语言模型（LLMs）在法国境内基于地理和区域差异的知识上进行细粒度推理的能力。虽然先前的基准侧重于国家层面的文化理解，但它们很大程度上忽略了国内差异以及区分密切相关区域背景的需求。CARTE通过引入涵盖法国13个大区和14个主题领域（包括文化、语言、人口、经济、环境和流动性）的2431个问题来填补这一空白。我们进一步推出了CARTE-LV，这是一个针对法国区域语言变异的子集，能够对语言相关差异进行集中评估。我们在少样本设置下评估了27个参数从1B到12B的LLMs。我们的实验揭示了跨区域和模型规模的性能差异，表明预训练覆盖存在系统性差距，且对国内变异的鲁棒性有限。

英文摘要

We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.

URL PDF HTML ☆

赞 0 踩 0

2606.01993 2026-06-02 cs.CL cs.AI cs.LG

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能？

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Kuaishou Technology（快手科技）

AI总结提出MMG2Skill框架，将多模态异构的野外指南编译为可编辑技能，通过轨迹级根因反馈持续改进，在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情

AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而，这些知识通常是多模态、异构、有噪声的，并且隐含地假设人类执行者，使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距，我们将此问题形式化为指南到技能学习：将野外指南转换为可执行技能，并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力，我们引入了MMG2Skill-Bench，这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill，一个闭环框架，它将指南编译为可编辑技能，在执行过程中将固定的视觉语言模型（VLM）智能体条件化于这些技能，并从轨迹级根因反馈中修正技能，而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中，使用六个VLM骨干网络，MMG2Skill在每个模型-领域设置中始终优于普通基线智能体，在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明，直接用原始指南提示智能体会降低性能，而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中，当成功信号适当校准时，基于分析器的提前停止进一步防止了后期性能退化，并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

URL PDF HTML ☆

赞 0 踩 0

2606.01992 2026-06-02 cs.CV cs.AI cs.LG

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准：当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab（米兰理工学院，AIRLab）； S&H – Software & Hardware（S&H – 软件与硬件）

AI总结提出结构化基准TGAD，通过三个场景逐步增加语言功能角色，评估多模态异常检测系统的文本引导能力，发现当前系统仅表面受语言条件化，标准基准高估了其能力。

详情

AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统，并被呈现为支持文本引导的零样本和少样本检测。然而，这些方法使用继承自单模态基准的协议进行评估，这些协议保持文本条件不变，因此无法衡量语言是否条件化决策；报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测（TGAD），这是一个结构化基准，通过三个场景逐步增加语言的功能角色：MVTec AD上的受控提示敏感性设置；MVTec AD的组件标记扩展，要求模型将其评估限制在指定部件；以及新的组装面板数据集（APD），这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型：生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中，文本接口仅表面条件化决策：除非移除对象名词，否则提示内容被吸收（生成模型的I-AUROC从97.4降至82.6）；一旦指令部件外的缺陷被视为正常，组件级指令不约束决策（从90.3降至66.3）；当两者在APD上结合时，图像级判别崩溃至MVTec水平以下，一种情况低于随机水平（71.2、50.5、31.5）。这些结果表明，标准基准夸大了当前多模态异常检测系统的文本引导能力，并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01991 2026-06-02 cs.AI cs.CL cs.CY

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP：基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

发表机构 * Beijing Institute of Technology（北京理工大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）

AI总结针对LLM智能体因动作空间扩大而面临功率寻求风险，提出SafeMCP服务器端防御插件，通过内部世界模型进行前瞻推理，实现主动工具过滤和即时干预两级防御，在保持智能体效用的同时有效降低风险。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

详情

AI中文摘要

随着大语言模型（LLM）智能体越来越多地利用模型上下文协议（MCP）在复杂环境中运行，其动作空间的扩展赋予了智能体不安全的能力，并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要，但它们也创造了一个脆弱的风险表面，其中微小的错误或幻觉会被放大为灾难性故障。为此，我们提出了SafeMCP，一种{服务器端}防御插件，通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理，实现两级防御：主动工具过滤以限制危险功率扩展，以及即时干预作为故障安全机制。为了训练SafeMCP，我们引入了一个三阶段流程，包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习（RL）。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明，SafeMCP实现了安全平衡，在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

URL PDF HTML ☆

赞 0 踩 0

2606.01985 2026-06-02 cs.CV

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow：基于流匹配的多轮图像编辑强化学习

Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu

发表机构 * Apple（苹果公司）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Lambda, Inc（Lambda公司）

AI总结提出MT-EditFlow框架，通过流匹配强化学习优化多轮图像编辑的奖励信号，解决单轮编辑模型在多轮交互中的失败和误差传播问题，显著提升多轮编辑性能。

详情

AI中文摘要

近年来，基于指令的图像编辑取得了重大突破，模型现在能够处理现实世界中的编辑需求，满足日常用户的实用性要求。然而，主要为单轮编辑训练的编辑模型在多轮编辑中常常失败——在这种自然的交互设置中，用户基于模型自身之前的输出迭代地细化图像。这种失败源于“全有或全无”的要求，即单次失败会破坏整个序列，以及误差传播，即暴露偏差导致编辑误差累积。为了解决这些挑战，我们引入了MT-EditFlow，一个流匹配强化学习框架，旨在优化序列图像编辑的奖励信号。MT-EditFlow整合了多轮视角和多奖励公式，为基于GRPO和NFT的强化学习方法提供了统一的结构。我们通过研究有效的轮次级聚合评分策略、权衡奖励偏差与方差的VLM推理模式以及防止奖励破解的优势融合级别，系统地分析和优化了奖励信号。我们的发现表明，将聚合优势广播到整个编辑轨迹中，有效地弥合了局部规划与全局多轮任务成功之间的差距。大量实验表明，MT-EditFlow在多种基础模型上显著提升了性能。值得注意的是，它在FLUX.1-Kontext-dev上将第3轮整体性能提升了6.85分，超越了Qwen-Image-Edit等最先进的开源模型。通过保持高边际成功率和减少暴露偏差，MT-EditFlow为视觉内容创作中更可靠、更自然的人机协作奠定了基础。

英文摘要

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

URL PDF HTML ☆

赞 0 踩 0

2606.01975 2026-06-02 cs.AI cs.SE

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

基于LLM的算法开发：以张量网络收缩顺序优化中LLM使用为例

Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges

发表机构 * German Aerospace Center (DLR), Institute of Software Technology, department High-Performance Computing（德国航空航天中心（DLR）软件技术研究所高性能计算部门）

AI总结通过OpenEvolve对张量网络收缩顺序优化的案例研究，探讨了基于LLM的算法开发，重点分析了LLM选择、评估指标和测试实例等设计因素，强调了验证引导的进化编码代理的潜力以及人类科学家在评估、验证和解释方面的重要性与挑战。

Comments Submitted to the proceedings of the deRSE26 conference

2606.01973 2026-06-02 cs.LG cs.CV

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

开放集测试时自适应中分布内与分布外准确率的深入分析

Zefeng Li, Evan Shelhamer

发表机构 * University of British Columbia and Vector Institute（不列颠哥伦比亚大学和向量研究所）

AI总结本文通过基准测试和提出新基线，揭示了当前开放集测试时自适应方法在平衡分布内准确率和分布外检测能力上的不足。

Comments TMLR 2026

详情

AI中文摘要

开放集测试时自适应（TTA）在存在输入偏移和未知输出类别的情况下更新模型。尽管近期方法在提高已知类别的分布内（InD）准确率方面取得了进展，但它们准确检测分布外（OOD）未知类别的能力仍未得到充分探索。我们在小规模CIFAR-10-C和大规模ImageNet-C的标准损坏基准上，对鲁棒和开放集TTA方法（SAR、OSTTA、UniEnt和SoTTA）进行了基准测试。对于CIFAR-10-C，我们使用来自SVHN和CIFAR-100的OOD数据，分别对应其损坏形式SVHN-C和CIFAR-100-C。对于ImageNet-C，我们使用来自ImageNet-O和Textures的OOD数据，分别对应其损坏形式ImageNet-O-C和Textures-C。ImageNet-O更接近ImageNet，包含未知但相关的物体类别（如食物类的“蒜香面包”与“热狗”，基础设施类的“高速公路”与“水坝”），而Textures则远离ImageNet，包含非物体图案（如“裂纹”泥土、“多孔”海绵、“纹理”树叶）。我们评估了TTA方法在CIFAR-10-C和ImageNet-C上对InD与OOD识别的准确率和置信度。我们在CIFAR-10-C上验证了每种方法自身OOD检测技术的准确率。我们还在ImageNet-C上进行了评估，并报告了准确率和标准OOD检测指标。我们进一步考察了更现实的设置，其中OOD数据的比例和速率可以变化。为了探索InD识别与OOD拒绝之间的权衡，我们提出了一种新的基线，将softmax/多类输出替换为sigmoid/多标签输出。我们的分析首次表明，当前的开放集TTA方法难以平衡InD和OOD准确率，并且它们仅能不完全地过滤OOD数据以进行自身的自适应更新。

英文摘要

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

URL PDF HTML ☆

赞 0 踩 0

2606.01970 2026-06-02 cs.RO cs.MA cs.SY eess.SY

Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions

基于市场重规划的搜救任务中安全关键无人机群

Luiz Giacomossi, Andrea Haglund, Claire Namatovu, Emily Zainali, Esaias Målqvist, Yonatan M. Beyene, Ivan Tomasic, Baran Çürüklü, Håkan Forsberg

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Swedish Defence Research Agency（瑞典国防研究机构）； KTH Royal Institute technological Institute（皇家理工学院）

AI总结提出一种分布式协调架构IRDS，通过反向拍卖市场机制和几何共识协议，在无人机故障下自主重分配任务，在25%退化下保持93%任务成功率。

Comments 6 pages, 4 figures, accepted at MIPRO 2026

详情

AI中文摘要

搜救任务中可靠自主无人机群需要能够容忍代理退化并维持操作的容错协调。本文介绍了智能重规划无人机群（IRDS），一种为资源受限环境设计的分布式协调架构。所提出的框架采用反向拍卖市场机制，其中代理基于距离加权成本函数竞标服务搜索区域，并结合几何共识协议进行目标验证。我们通过物理仿真（N=8个代理，8x8网格）评估该方法，并施加随机故障注入。结果表明，无人机群能够以相对于总任务持续时间较低的延迟自主重新分配来自故障代理的任务，在25%劳动力退化下保持93%的任务成功率。所提出的框架展示了一种稳健的、经过实证测试的空中机器人自愈协调方法。

英文摘要

Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.

URL PDF HTML ☆

赞 0 踩 0

2606.01967 2026-06-02 cs.CL

Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning

训练提示至关重要：面向鲁棒微调的状态自适应优化

Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao, Jinhao Dong, Pengfei Hu, Wei Lu, Xiaoyong Du

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出状态自适应提示优化（SAPO）策略，通过将任务公式从静态输入转变为动态状态自适应变量，有效缓解灾难性遗忘并提升泛化能力，在多个基准上取得显著性能提升。

详情

AI中文摘要

虽然提示工程在推理过程中对最大化大型语言模型（LLM）的能力至关重要，但提示在训练过程中的作用仍未得到充分探索。现有的微调范式通常将训练提示视为表面形式，假设语义等价的指令会产生相同的学习结果。然而，我们揭示这种等价性具有欺骗性：虽然释义后的提示通常会导致类似的任务内性能，但它们在灾难性遗忘和泛化方面会引发截然不同的跨任务影响。关键的是，这些影响在任务间呈正相关，表明存在始终产生更好性能的优越提示。此外，我们发现这些优越提示可以在学习之前通过任务损失稳健地识别。利用这些见解，我们引入了状态自适应提示优化（SAPO），这是一种轻量级但有效的训练策略，它将任务公式从静态输入转变为动态的、状态自适应的变量。在多种基准上的全面实验证实了其有效性，它显著减轻了遗忘，同时提高了泛化能力，相比于最先进的方法取得了显著的性能提升。这些结果提供了关于训练提示如何塑造学习动态的见解，并为鲁棒微调提供了实用的方法。我们的代码可在 https://github.com/Eric8932/SAPO 获取。

英文摘要

While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning. Our code is available at https://github.com/Eric8932/SAPO.

URL PDF HTML ☆

赞 0 踩 0

2605.04948 2026-06-02 cs.CL

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

将大型语言模型适配到低资源黏着语：LoRA与QLoRA在巴什基尔语上的比较研究

Mullosharaf K. Arabov, Svetlana S. Khaybullina

发表机构 * Institute of Computational Mathematics and Information Technologies, Kazan Federal University（计算数学与信息科技学院，卡兹安联邦大学）

AI总结本文比较了LoRA和QLoRA两种参数高效微调方法在低资源黏着语巴什基尔语上的适配效果，发现QLoRA在7B规模模型上能在质量和计算成本间取得有效平衡。

Comments Accepted to CLIB 2026

详情

AI中文摘要

本文对参数高效微调方法（包括LoRA和QLoRA）在将大型语言模型适配到巴什基尔语（突厥语族的一种低资源黏着语）任务上进行了比较研究。实验在包含71k文档（4690万个token）的巴什基尔语文本语料库上进行，使用了多种架构的模型：DistilGPT2、GPT-2（base、medium）、Phi-2、Qwen2.5-7B、DeepSeek-7B和Mistral-7B。为提高结果可靠性，每种配置使用三种不同的随机种子进行训练。测试集上最低困惑度由全微调的GPT-2 medium获得（3.34）。同时，应用于Mistral-7B（3.79）和Phi-2（3.81）的QLoRA在可训练参数减少40倍以上的情况下达到了相当的质量。然而，我们也观察到某些架构使用PEFT时质量显著下降的情况（例如，DeepSeek-7B，秩为8，困惑度=129.55），这表明结果关键取决于基础模型及其分词器的选择。此外，基于巴什基尔语提示的生成文本定性分析显示，具有最佳困惑度的模型不一定产生最连贯的输出：QLoRA微调的模型生成了单语巴什基尔语续写，而具有最低困惑度的全微调模型则频繁切换到英语。结果表明，对于巴什基尔语，7B规模模型上的QLoRA在质量和计算成本之间提供了有效的折中。为确保可重复性，开放数据、代码和训练好的适配器将在论文被接收后发布。

英文摘要

This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.02640 2026-06-02 cs.AI

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

可信人工智能面临不变性冲突，因果性是解决方案

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克信息研究所）

AI总结本文通过将可信AI目标重新解释为数据生成过程变化下的不相容不变性要求，论证因果性是理解和平衡性能与多个可信目标之间权衡的必要框架。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

AI中文摘要

随着人工智能（包括机器学习模型和基础模型）在高风险领域的部署日益增多，确保其可信度已成为一个核心挑战。然而，可信人工智能的核心目标，如公平性、鲁棒性、隐私性和可解释性，很难同时实现，尤其是在保持效用的同时。这篇立场论文认为，因果性对于理解和平衡性能与可信人工智能多个目标之间的权衡是必要的。我们将可信人工智能的权衡重新解释为数据生成过程不同变化下的不相容不变性要求，从而为我们的论点奠定基础。然后，我们通过文献中的案例研究和风格化的合成数据模拟来说明这一论点，表明因果性提供了一个统一的框架，用于理解可信人工智能中的权衡如何产生，以及如何通过选择性不变性来缓解或解决这些权衡。这一视角既适用于经典机器学习模型，也适用于大规模基础模型。最后，我们概述了利用因果性构建既可信又高性能的人工智能所面临的开放挑战和机遇。

英文摘要

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate this argument through case-study analyses from the literature and a stylized synthetic-data simulation, showing that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Finally, we outline open challenges and opportunities for using causality to build both trustworthy and high-performing AI.

URL PDF HTML ☆

赞 0 踩 0

2605.02270 2026-06-02 cs.CL

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

塔吉克-波斯语对机器音译模型的系统基准测试：从基于规则到Transformer架构的比较研究

Mullosharaf K. Arabov

发表机构 * Institute of Computational Mathematics and Information Technologies（计算数学与信息科技学院）； Kazan Federal University（卡兹安联邦大学）

AI总结本文通过构建多源平行语料库，系统比较了从基于规则到Transformer的六类模型，发现字节级ByT5在塔吉克-波斯语音译任务中表现最优（chrF++ 87.4/80.1），而基于子词的分词模型完全失败。

Comments Accepted to CLIB 2026

详情

AI中文摘要

本文首次对塔吉克语（西里尔字母）和波斯语（阿拉伯字母）之间音译的现代机器学习架构进行了全面比较分析。一个关键贡献是创建并验证了一个独特的平行语料库，该语料库汇集了多个异构来源，包括众包项目、词典对、《列王纪》平行文本、外交文章、《玛斯纳维》文本、官方术语列表和音译对应关系。初始数据集包含328,253个句子对；通过分层随机抽样形成了40,000个句对的代表性子集。实验比较了六类模型：基于规则的基线、带注意力的LSTM、字符级Transformer、G2P Transformer（从头训练）、预训练多语言模型（mBART、带LoRA的mT5）以及字节级ByT5。结果表明ByT5具有压倒性优势（塔吉克语到波斯语chrF++ 87.4，反向80.1）。尽管数据有限，G2P Transformer显著优于mBART（72.3 vs. 62.2 chrF++）。使用子词分词（mT5）的模型完全失败（chrF++低于18.5）。研究结果表明，对于塔吉克-波斯语对的准确音译，在字节或字符级别操作的架构明确优于依赖子词分词的傳統多语言Seq2Seq模型。

英文摘要

This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization.

URL PDF HTML ☆

赞 0 踩 0

2605.02122 2026-06-02 cs.LG cs.AI

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

STABLEVAL: 面向AI系统的分歧感知与稳定评估

Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对多数投票法在标注者分歧下导致排名不稳定的问题，提出STABLEVAL框架，通过建模潜在正确性和标注者混淆模式，实现稳定且不确定性感知的系统评估。

详情

AI中文摘要

人类评估仍然是评估现代AI系统的主要标准，然而标注者的分歧、偏见和变异性使得在标准多数投票聚合下系统排名变得脆弱。多数投票忽略了标注者可靠性和项目级别的模糊性，往往在标注者子集之间产生不稳定的比较。我们引入了STABLEVAL，一个分歧感知的评估框架，该框架对潜在项目正确性和标注者特定的混淆模式进行建模，以产生后验期望项目得分和校准的智能体级别分数。与Dawid-Skene等标签去噪方法不同，STABLEVAL明确设计用于稳定和不确定性感知的系统评估，而不是硬标签恢复。我们将排名稳定性形式化为首要评估目标，并分析聚合方法如何保留或扭曲底层标注者行为。在受控的合成实验和多个真实世界人工标注基准上，多数投票在标注者异质性和对抗性噪声下表现出增加的得分误差和排名不稳定性，而STABLEVAL产生了更稳定和统计上更合理的系统排名。这些结果表明，对分歧进行建模对于稳健和可复现的AI评估至关重要。

英文摘要

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.01964 2026-06-02 cs.CL

Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location, Within-Word Landing Position, and Fixation Duration in Reading

Eyettention II: 一种用于建模阅读中注视位置、词内着陆位置和注视持续时间的双序列架构

Shuwen Deng, Cui Ding, David R. Reich, Paul Prasse, Lena A. Jäger

发表机构 * Department of Computer Science, University of Potsdam（波恩大学计算机科学系）； Department of Computational Linguistics, University of Zurich（苏黎世大学计算语言学系）； Department of Informatics, University of Zurich（苏黎世大学信息学系）

AI总结提出端到端训练的轻量级深度学习模型Eyettention II，通过双序列架构生成包含注视位置、词内着陆位置和注视持续时间的完整扫描路径，在预测性能上超越现有模型并捕捉关键心理语言学现象。

详情

AI中文摘要

阅读时眼睛的运动方式为理解读者的认知过程和文本属性提供了宝贵信息。特别是，阅读过程中的眼动追踪数据在多种技术应用中显示出高度价值，例如增强和解释语言模型以及推断读者特征。然而，这些应用通常依赖于大规模数据驱动模型，需要大量的眼动追踪数据集，而由于数据收集的资源密集型特性，这些数据集难以获取。为了解决数据稀缺的挑战，我们开发了Eyettention II，一个端到端训练的深度学习模型，能够生成由按时间顺序排列的完整注视属性组成的真实扫描路径，包括注视位置、词内着陆位置和注视持续时间。我们的模型轻量级，可在有限的GPU资源上高效训练，并与认知理论紧密对齐。我们证明，Eyettention II在扫描路径预测方面超越了最先进的模型，并通过捕捉关键心理语言学现象模拟了类似人类的注视行为。凭借其稳健的性能，Eyettention II有潜力推动自然语言处理的发展，促进心理语言学实验材料的预测试，并揭示超出理论认知模型明确编码的新见解。

英文摘要

The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader's characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.

URL PDF HTML ☆

赞 0 踩 0