arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.30589 2026-06-01 cs.CL cs.AI

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

ImmigrationQA: 一个基于来源的数据集及面向美国移民法的小型模型适配

Nazarii Shportun

发表机构 * Independent Researcher(独立研究员)

AI总结 本文构建了基于来源的问答数据集ImmigrationQA(17,058对,覆盖13个移民子领域),并通过参数高效LoRA微调Llama 3.2 3B Instruct模型,在程序性子领域取得显著提升,但复杂法律推理仍较弱。

Comments 12 pages, 4 tables. Dataset (17,058 QA pairs), fine-tuned model, and code are publicly released

详情
AI中文摘要

美国移民法涵盖数千页的官方政策、联邦法规和程序指南,这些内容频繁变化,且对缺乏法律代表的申请人影响重大。我们描述了ImmigrationQA的构建过程,这是一个基于来源的问答数据集,包含13个移民子领域的17,058对问答,以及使用参数高效LoRA对Llama 3.2 3B Instruct模型在该数据集上的微调。语料库来自11个主要和次要来源——包括USCIS政策手册、8 CFR、BIA先例决定和社区问答——产生了10,056份经过验证的规范文档和18,308个文本块。使用Claude Sonnet 4.6通过五种模式特定提示从这些文本块生成结构化问答对,其中22对因来源跨度重叠不足被拒绝。微调模型在993对的保留测试集上使用LLM-as-judge评分进行评估,基于101个示例的分层样本。微调模型平均得分为1.08/3.0(16.8%完全正确;101示例分层评估),而Llama 3 8B基础模型得分为0.85/3.0(4%完全正确),平均分相对提升27%;零样本Claude Sonnet基线得分为1.52/3.0(25%完全正确)。微调模型在程序性子领域(旅行证件、身份调整、非移民签证)表现出集中改进,但在复杂法律推理和时效性统计方面仍然较弱。整个流程的云计算成本约为29美元。所有工件——数据集、模型、代码和提示模板——均已公开发布。该系统不能替代法律咨询,且不反映语料库抓取日期后的法规变化。

英文摘要

U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.

2605.30587 2026-06-01 cs.CV

ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models

ReGuLaR:面向大型视觉语言模型的基于关系的潜在推理

Zihu Wang, Karthik Somayaji N. S, Peng Li

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出ReGuLaR框架,通过训练时的ReGFormer将潜在推理显式地锚定在视觉证据中的对象和关系上,在多种基准上取得最优性能。

详情
AI中文摘要

链式思维推理通过用自然语言表述中间推理步骤,显著提升了大视觉语言模型的推理能力。然而,这种离散的文本理由通常不足以编码连续的视觉证据。最近的工作通过将推理转移到连续潜在空间来解决这一限制。尽管取得了有希望的进展,现有方法仍使潜在推理与视觉证据的组合结构和关系结构联系不足。为填补这一空白,我们引入了ReGuLaR,一种基于关系的潜在推理框架,将潜在状态显式地锚定在这些关键但被忽视的视觉证据上。ReGuLaR在训练时使用ReGFormer使潜在推理聚焦于与问题相关的对象及对象间关系,而在推理时模型无需调用ReGFormer即可推理并生成答案。为支持ReGuLaR的训练,我们构建了RGROUNDING-351K,一个标注了关键对象边界框和对象间关系的真实世界视觉语言数据集。在多种基准上的广泛实验表明,ReGuLaR持续优于现有方法,并取得了最先进的性能。我们在投稿中包含了代码,并将在接收后公开发布代码和训练数据。

英文摘要

Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are often insufficient for encoding continuous visual evidence. Recent work addresses this limitation by moving reasoning into continuous latent space. Despite promising progress, existing methods leave latent reasoning insufficiently connected to the compositional and relational structure of visual evidence. To address this gap, we introduce ReGuLaR, a relation grounded latent reasoning framework that explicitly grounds latent states in these critical yet overlooked visual evidence. ReGuLaR uses a training-time ReGFormer to focus latent reasoning on question-relevant objects and inter-object relations, while at inference time the model reasons and generates answers without invoking the ReGFormer. To support training ReGuLaR, we construct RGROUNDING-351K, a real-world vision-language dataset annotated with key object bounding boxes and inter-object relations. Extensive experiments across diverse benchmarks show that ReGuLaR consistently outperforms existing approaches and achieves state-of-the-art performance. We include our code in the submission and will release the code and training data publicly upon acceptance.

2605.30585 2026-06-01 cs.LG cs.AI cs.CE

Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation

机器学习不确定性量化方法在预测涡轮燃气温度退化中的基准测试

Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke, Rekha Sundararajan, Andrew Rimell, James G. Steinrock

发表机构 * Computational Science Laboratory(计算科学实验室) Department of Computer Science(计算机科学系) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究了五种预测区间构建方法(Delta法、贝叶斯蒙特卡洛Dropout、Bootstrap法、下上界估计和均值方差估计),在统一实验框架下评估其捕捉涡轮燃气温度神经网络预测不确定性的能力,并基于覆盖概率、归一化平均预测区间宽度和覆盖宽度准则等指标比较了各方法的可靠性、锐度及权衡,为发动机健康管理中的预测区间方法选择和调优提供了实用指南。

详情
AI中文摘要

现代发动机的有效预测与健康管理依赖于准确的涡轮燃气温度预测和稳健的不确定性量化,以确保可靠性和安全性。本文研究了五种构建预测区间的主要方法——即Delta法、贝叶斯蒙特卡洛Dropout、Bootstrap法、下上界估计和均值方差估计——作为捕捉涡轮燃气温度神经网络预测中不确定性的手段。每种方法都在统一的实验框架内实现,该框架采用交叉验证进行超参数选择、重复训练-测试分割以保证性能稳健性,并使用多个指标评估区间的准确性和紧致性。具体地,测量了覆盖概率、归一化平均预测区间宽度以及基于覆盖宽度的准则,以全面评估每种方法的可靠性和锐度。在代表性涡轮燃气温度数据集上进行的实验揭示了五种方法在区间覆盖、宽度和稳定性方面的不同权衡。这些发现为发动机健康管理和预测中选择和调整预测区间方法提供了实用指南,确保在实际应用中的可解释性和精度。

英文摘要

Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major approaches for constructing prediction intervals -- namely the Delta method, Bayesian Monte Carlo Dropout, Bootstrap method, Lower-Upper Bound Estimation, and Mean-Variance Estimation -- as a means of capturing the uncertainty in neural network predictions of turbine gas temperature. Each approach is implemented within a unified experimental framework that employs cross-validation for hyperparameter selection, repeated train-test splits for performance robustness, and multiple metrics to evaluate both the accuracy and tightness of the intervals. In particular, Coverage Probability, Normalized Mean Prediction Interval Width, and the Coverage Width-based Criterion are measured to comprehensively assess each method's reliability and sharpness. Experiments conducted on a representative turbine gas temperature dataset reveal distinct trade-offs among the five methods in terms of interval coverage, width, and stability. These findings provide a practical guide for selecting and tuning prediction interval methods in engine health management and prognostics, ensuring both interpretability and precision in real-world applications.

2605.30583 2026-06-01 cs.RO cs.PF

Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering

Caspar: 基于自适应重排序的符号编程CUDA加速器

Emil Martens, Aaron Miller, Matias Varnum, Annette Stahl

发表机构 * Norwegian University of Science and Technology(挪威科学与技术大学) Skydio

AI总结 提出Caspar库,通过自动生成优化CUDA内核,实现从Python符号表达式到GPU高性能运行时的桥梁,并在大规模BA数据集上实现5-20倍加速。

Comments Accepted at ICRA 2026

详情
AI中文摘要

我们提出Caspar,一个使现代GPU在机器人领域更易用的库,并提供可应用于多种优化问题的最先进非线性GPU求解器。Caspar通过从符号表达式自动生成优化的CUDA内核,弥合了Python中表达性符号编程与C++中高性能GPU运行时之间的差距。基于SymForce库,用户可以轻松定义和组合符号表达式(包括李群运算),以生成自定义CUDA内核。要将Caspar用作求解器,用户只需定义符号残差函数;Caspar随后使用符号微分生成必要的GPU内核和接口以执行非线性优化。本文介绍了Caspar的核心组件,并通过在Bundle Adjustment in the Large (BAL)数据集上执行光束法平差展示了其性能。我们将Caspar与其他最先进的光束法平差器进行基准测试,结果表明它比最佳替代方案快5到20倍,所需内存更少,且达到相似的精度。这说明了我们的符号GPU编程方法的优势。Caspar作为SymForce的一部分发布,可在https://github.com/symforce-org/symforce免费获取。

英文摘要

We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library, users can easily define and combine symbolic expressions, including Lie group operations, to generate custom CUDA kernels. To use Caspar as a solver, users need only define the symbolic residual functions; Caspar then uses symbolic differentiation to generate the necessary GPU kernels and interfaces to perform nonlinear optimization. In this paper, we present the core components of Caspar and showcase its performance by performing bundle adjustment on the Bundle Adjustment in the Large (BAL) dataset. We benchmark Caspar against other state-of-the-art bundle adjusters and show that it is 5 to 20 times faster than the best alternative, requires less memory, and achieves similar accuracy. This illustrates the benefit of our symbolic GPU programming approach. Caspar is released as part of SymForce and is freely available at https://github.com/symforce-org/symforce

2605.30582 2026-06-01 cs.CL

AI for Monitoring and Classifying Data Used in Research Literature

AI用于监控和分类研究文献中使用的数据

Rafael Macalaba, Aivin V. Solatorio

发表机构 * World Bank(世界银行) Office of the World Bank Group(世界银行集团办公室) Development Data Group(发展数据组)

AI总结 提出基于GLiNER的多任务框架,结合合成数据生成和LLM重验证,实现研究文献中数据集提及的提取、关系识别和使用上下文分类,以解决数据集使用监控中的标注稀缺和引用不一致问题。

详情
AI中文摘要

虽然Google Scholar和Semantic Scholar等平台追踪学术论文的引用,但尚无类似基础设施用于监控研究文献中的数据集使用情况,导致数据使用格局在很大程度上不透明。解决这一差距对于透明度、可重复性和影响监控至关重要,但进展受到不一致的引用实践、稀缺的标注数据以及文献中对数据集的模糊引用的阻碍。传统的NLP方法难以应对这些挑战,促使转向更具适应性、语义更丰富的模型。基于先前使用LLM进行数据提及检测和合成数据进行引导训练的工作,本文提出了一种用于可扩展数据集监控的更新方法。我们引入了一个基于GLiNER的多任务框架,联合执行数据集提及提取、关系识别和使用上下文分类。为解决标签稀缺问题,该流程利用合成数据生成产生训练示例,并基于LLM的重验证过滤错误提及并强制标签一致性,共同提高了整个训练流程的可靠性、覆盖率和输出一致性。这项工作推进了用于监控研究文献中数据使用的开源工具的开发,有助于实现可泛化、无约束的数据集引用追踪的更广泛目标。

英文摘要

While platforms like Google Scholar and Semantic Scholar track citations for academic papers, no comparable infrastructure exists for monitoring dataset usage in research literature, leaving the landscape of data use largely opaque. Addressing this gap is critical for transparency, reproducibility, and monitoring of impact, yet progress is hindered by inconsistent citation practices, scarce labeled data, and ambiguous references to datasets in the wild. Traditional NLP approaches struggle with these challenges, motivating the shift toward more adaptive, semantically rich models. Building on prior work using LLMs for data mention detection and synthetic data for bootstrapping training, this paper presents an updated methodology for scalable dataset monitoring. We introduce a multitask GLiNER-based framework that jointly performs dataset mention extraction, relation identification, and usage-context classification. To address label scarcity, the pipeline leverages synthetic data generation to produce training examples and LLM-based revalidation to filter incorrect mentions and enforce labeling consistency, together improving reliability, coverage, and output consistency across the training pipeline. This work advances the development of open-source tools for monitoring data use in research literature, contributing to the broader goal of generalizable, unconstrained dataset citation tracking.

2605.30580 2026-06-01 cs.CL cs.LG

Speculative Decoding Across Languages

跨语言的推测解码

Nirajan Paudel, Michael Ginn, Luc De Nardi, Alexis Palmer

发表机构 * University of Colorado(科罗拉多大学)

AI总结 本文研究了通过微调草稿模型或使用n-gram模型来提高非英语语言中推测解码效率的策略,发现任务特定蒸馏虽能提升效率但泛化性差,而n-gram模型尽管接受率较低,但由于生成速度快,始终能提供显著的加速效果。

Comments 10 pages, 11 figures, submitted to ACL ARR May 2026

详情
AI中文摘要

推测解码已成为大型语言模型(LLM)推理的关键组成部分,通过草拟多个令牌并并行验证,实现更快的生成。然而,小型草稿模型往往在多语言能力上严重不足。因此,在生成非英语文本时,推测解码的效率远低于英语。我们比较了三种提高十一种语言推测解码效率的策略:在任务特定数据(翻译)上微调草稿模型;在未标记的单语语料库上微调草稿模型;以及在相同单语语料库上训练简单的n-gram草稿模型。我们在翻译(从英语到目标语言)和保留任务故事生成上评估效率。我们发现,虽然任务特定蒸馏可以显著提高效率,但蒸馏模型在新任务上泛化能力差。与此同时,n-gram草稿模型尽管接受率较低,但由于草稿生成速度快得多,始终能提供大的加速。

英文摘要

Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective. We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.

2605.30576 2026-06-01 cs.AI

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

自动驾驶强化学习中不确定性感知与时间调控的专家建议

Ahmed Abouelazm, Felix Klingebiel, Philip Schörner, J. Marius Zöllner

发表机构 * FZI Research Center for Information Technology(弗劳恩霍夫信息技术研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出一种不确定性感知框架,通过自适应阈值触发专家建议并采用承诺-冷却策略调控指导时长,结合离线策略隐式分位数网络实现安全高效的探索,在CARLA中成功率提升5-7%。

Comments Accepted in The IEEE International Conference on Intelligent Transportation Systems (ITSC) September 15-18, 2026 -- Naples, Italy

详情
AI中文摘要

自动驾驶强化学习中的探索本质上是不安全的:智能体必须经历新颖行为才能学习,但探索可能导致碰撞或偏离道路。我们提出一种不确定性感知框架,利用专家建议引导探索,同时避免长期依赖。当认知不确定性或偶然不确定性超过基于滚动缓冲区的自适应阈值时,触发建议,确保建议随智能体置信度演变。采用带有随机早停启发式的承诺-冷却策略调控指导的持续时间和频率,使智能体接触连贯操作而不耗尽建议预算。专家和智能体经验在离线策略隐式分位数网络(IQN)骨干网络中的共享回放缓冲区中合并,实现专家轨迹的高效重用。在CARLA中的实验表明,我们的方法优于IQN基线,成功率提高5-7%并减少失败,证明风险敏感的不确定性与调控的专家集成相结合,能够实现基于传感器的RL策略学习在无信号交叉口导航中更安全、更高效的探索。

英文摘要

Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.

2605.30574 2026-06-01 cs.CL

Probing the Prompt KV Cache: Where It Becomes Dispensable

探测提示KV缓存:何时变得可有可无

Vinayshekhar Bannihatti Kumar, Manoj Ghuhan Arivazhagan, Disha Makhija, Rashmi Gangadharaiah

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 通过控制实验发现,提示KV缓存的冗余主要源于聊天模板的形式而非内容,上层缓存可用中性填充模板的KV缓存替代而不损失精度。

详情
AI中文摘要

先前的KV缓存压缩方案经验性地表明,在解码过程中提示缓存部分冗余,丢弃或总结条目几乎不损失精度。我们探究这种冗余何时以及何种形式出现:在哪些层、经过多少解码步骤、以及提示跨度KV缓存可以何种方式被替换而不破坏任务。通过控制拼接干预,扫描层截止点和解码步骤,发现这种冗余关乎形式(聊天模板脚手架)而非内容。将上层提示跨度KV缓存替换为来自聊天模板脚手架(其用户内容为中性填充)的KV缓存,可恢复接近干净的精度,而将相同槽位归零则使精度崩溃。这种分离在Qwen3、Gemma 3和Llama 3系列的多数据集上重复出现。

英文摘要

Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which layers, after how many decoding steps, and in what form can the prompt span KV cache be replaced without breaking the task. A controlled splice intervention swept over layer cutoff and decoding steps shows this redundancy is about form (chat template scaffolding) rather than content. Replacing the upper layer prompt span KV cache with KV cache from a chat template scaffold whose user content is a neutral filler recovers near clean accuracy, while zeroing the same slots collapses accuracy. The dissociation replicates across the Qwen3, Gemma 3, and Llama 3 families on multiple datasets.

2605.30573 2026-06-01 cs.LG

Zeroth-Order Non-Log-Concave Sampling with Variance Reduction and Applications to Inverse Problems

零阶非对数凹采样与方差缩减及其在逆问题中的应用

M. Berk Sahin, Behzad Sharif, Abolfazl Hashemi

发表机构 * Elmore School of Electrical and Computer Engineering, Purdue University, West Lafayette, USA(电子工程学院,普渡大学,西拉法叶,美国) Weldon School of Biomedical Engineering, Purdue University, West Lafayette, USA(生物医学工程学院,普渡大学,西拉法叶,美国)

AI总结 针对黑盒设置下非对数凹分布采样中梯度不可访问且经典零阶估计器方差大的问题,提出方差缩减的零阶朗之万采样方法,首次建立非渐近收敛保证,并应用于逆问题后验采样。

Comments Accepted to ICML 2026

详情
AI中文摘要

从具有未归一化密度的高维非对数凹分布中采样仍然是机器学习中的一个基本挑战,特别是在梯度信息不可访问或计算上禁止的黑盒设置中。虽然朗之万动力学在梯度可访问时提供了一个原则性的采样框架,但其扩展到黑盒设置时存在高方差问题,并且缺乏非对数凹采样的非渐近收敛保证。为了解决这些限制,我们提出了一种方差缩减的零阶朗之万采样方法。我们的方法采用了一种梯度估计器,该估计器显著降低了经典批处理零阶估计器的方差,并消除了准确估计所需批处理大小的不利维度依赖性,从而实现实用且稳定的采样。我们首次建立了零阶非对数凹采样在ε-相对Fisher信息以及(在Poincaré不等式假设下)平方总变差距离方面的非渐近收敛保证。我们进一步提出了ZO-APMC,一种用于具有预训练基于分数的生成先验的黑盒逆问题的后验采样算法,为此类方法建立了首个非渐近收敛保证。我们通过合成实验验证了我们的理论,并在实际线性和非线性逆问题上展示了强大的实证性能。

英文摘要

Sampling from high-dimensional, non-log-concave distributions with unnormalized densities remains a fundamental challenge in machine learning, particularly in black-box settings where gradient information is inaccessible or computationally prohibitive. While Langevin dynamics provides a principled framework for sampling when gradients are accessible, its extension to the black-box settings suffers from high variance and lacks non-asymptotic convergence guarantees for non-log-concave sampling. To address these limitations, we propose a variance-reduced zeroth-order Langevin sampling method. Our method employs a gradient estimator that substantially reduces the variance of the classical batched zeroth-order estimator and eliminates the unfavorable dimensional dependence of the batch size required for accurate estimation, enabling practical and stable sampling. We establish the first non-asymptotic convergence guarantees for zeroth-order non-log-concave sampling in terms of $\varepsilon$-relative Fisher information, and, under a Poincaré inequality assumption, squared total variation distance. We further propose ZO-APMC, a posterior sampling algorithm for black-box inverse problems with pre-trained score-based generative priors, establishing the first non-asymptotic convergence guarantees for such methods. We validate our theory through synthetic experiments and demonstrate strong empirical performance on practical linear and nonlinear inverse problems.

2605.30570 2026-06-01 cs.AI

Procedural Generation of First Person Shooter Maps using Map-Elites

使用MAP-Elites程序化生成第一人称射击游戏地图

Simone de Donato, Pier Luca Lanzi, Daniele Loiacono

发表机构 * Politecnico di Milano — DEIB(米兰理工学院——DEIB)

AI总结 研究应用MAP-Elites算法生成第一人称射击游戏地图,提出两种新表示方法(点线和空间布局)以提高地图多样性和质量。

详情
AI中文摘要

我们研究了应用MAP-Elites(一种著名的质量多样性算法)来设计第一人称射击(FPS)游戏关卡。我们考虑了两种已知的地图表示方法(全黑和网格图),并引入了两种新的表示方法(点线和空间布局),以改进FPS地图的特征化。我们定义了一系列指标来描述地图的拓扑属性(仅依赖于地图布局)和涌现属性(必须通过实际游戏玩法进行评估)。我们进行了深入分析,以确定最适合指导MAP-Elites照明过程的特征。我们应用带有滑动边界的MAP-Elites(MESB)来演化FPS地图种群。我们的结果表明,与之前用于演化FPS地图的表示方法相比,新表示方法可以生成具有更高多样性和质量的地图。

英文摘要

We investigate the application of MAP-Elites (a well-known quality diversity algorithm) to design levels for First-Person Shooter (FPS) games. We consider two well-known map representations (All-Black and Grid-Graph) and introduce two novel representations (Point-Line and Spatial-Layout) that improve the characterization of FPS maps. We define a series of metrics to describe maps' topological properties (which solely depend on maps' layout), and emergent properties (which must be evaluated through actual gameplay). We perform an in-depth analysis to identify the most suitable features to guide MAP-Elites illumination process. We apply MAP-Elites with Sliding Boundaries (MESB) to evolve populations of FPS maps. Our results show that the new representations can generate maps with higher diversity and quality than the representations previously used for evolving FPS maps.

2605.30569 2026-06-01 cs.RO

Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity

Any-ttach: 快速末端执行器更换实现简洁的灵巧操作

Weizhe Ni, Jinzhou Li, Haoyu Li, Cody Andres Alessio-Bunnell, Wenjing Pan, Xianyi Cheng

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University(杜克大学机械工程与材料科学系)

AI总结 提出Any-ttach框架,通过低成本快速末端执行器更换机制,结合任务规划,实现多种工具和末端模块的灵巧操作,在长时任务中验证了可靠性和效率提升。

详情
AI中文摘要

机器人操作灵巧性通常通过构建越来越复杂的高自由度多指手来实现。虽然许多机器人手被设计为复制人类形态,但人手的功能角色暗示了不同的视角:其复杂性可能很大程度上是为了支持工具使用和工具制造。这一观察启发了Any-ttach,一个以工具为中心的操作框架,将快速末端执行器更换视为实现简单灵巧性的机制。Any-ttach结合了用于开合机器人接口的低成本自动更换机制、用于收集人类演示的手持设备,以及一个组合了学习、参数化和规划的工具使用技能的任务规划框架。该系统通过相同的共享接口支持多种工具和末端执行器模块,包括日常工具、铰接工具(如剪刀)、Fin Ray手指和低成本拟人手。我们的实验表明,Any-ttach提高了工具更换的可靠性,增加了演示效率,减少了工具位姿变异性,并支持多样化的工具使用技能。在两个长时任务(制作三明治和准备黄瓜)中,Any-ttach通过末端执行器切换和执行监控执行了六个工具使用子技能。这些结果表明,机器人不仅可以通过更复杂的末端执行器,还可以通过快速可更换的工具和末端执行器模块来扩展操作能力。更多详情和视频请访问https://any-ttach.github.io/。

英文摘要

Robotic manipulation dexterity is often pursued by building increasingly complex high-DoF multifingered hands. While many robotic hands are designed to replicate human morphology, the functional role of human hands suggests a different perspective: much of their complexity may exist to enable tool use and tool making. This observation motivates Any-ttach, a tool-centric manipulation framework that treats quick end-effector swapping as a mechanism for dexterity with simplicity. Any-ttach combines a low-cost automatic swapping mechanism for an open-close robot interface, a handheld device for collecting human demonstrations, and a task planning framework that composes learned, parameterized, and planned tool-use skills. The system supports diverse tools and end-effector modules, including daily tools, articulated tools such as scissors, Fin Ray fingers, and a low-cost anthropomorphic hand, through the same shared interface. Our experiments show that Any-ttach improves tool-swapping reliability, increases demonstration efficiency, reduces tool-pose variability, and supports diverse tool-use skills. In two long-horizon tasks, making a sandwich and preparing a cucumber, Any-ttach executes six tool-use subskills through end-effector switching and execution monitoring. These results suggest that robots can expand manipulation capability not only through more complex end-effectors, but also through rapidly exchangeable tools and end-effector modules. More details and videos are available at https://any-ttach.github.io/.

2605.30568 2026-06-01 cs.CL

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

生成与精炼动态评估准则用于LLM-as-a-Judge

Zijie Wang, Eduardo Blanco

发表机构 * University of Arizona(亚利桑那大学) Department of Computer Science(计算机科学系)

AI总结 提出无需人工标注的自动生成细粒度评估准则方法,通过数据集级和实例级粒度生成,并利用元评判奖励信号迭代微调准则生成器,在成对和逐点评估中均超越现有基线。

详情
AI中文摘要

LLM-as-a-Judge 是一种可扩展的人工评估替代方案,然而现有的基于准则的方法依赖于人工标注数据,如参考答案或专家制定的准则。我们提出自动生成细粒度评估准则,无需任何人工标注。我们的免训练方法以数据集特定和实例特定的粒度生成准则,在四个基准测试中取得了与现有方法竞争的性能。我们进一步提出一种方法,通过元评判奖励信号迭代微调准则生成器。微调后的生成器在成对和逐点评估中均优于所有现有基线。值得注意的是,微调后的14B准则生成器在准则生成方面优于更大的专有模型,显示了我们的微调策略的有效性。

英文摘要

LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained evaluation rubrics without any human annotation. Our training-free method generates rubrics at dataset-specific and instance-specific granularities, achieving performance competitive with existing methods across four benchmarks. We further present a method that iteratively fine-tunes a rubric generator model via meta-judge reward signals. The fine-tuned generator outperforms all existing baselines in both pairwise and pointwise evaluation. Notably, a fine-tuned 14B rubric generator outperforms a much larger proprietary model at rubric generation, showing the effectiveness of our fine-tuning strategy.

2605.30563 2026-06-01 cs.AI

Transforming and Encoding FTS for SAT Solving: What Helps, What Hurts (Extended Version)

转换与编码FTS以用于SAT求解:什么有帮助,什么有损害(扩展版)

João Filipe, Álvaro Torralba, Gregor Behnke

发表机构 * University of Amsterdam, Institute for Logic Language and Computation(阿姆斯特丹大学,逻辑语言与计算研究所) Aalborg University(奥尔堡大学)

AI总结 研究如何将因子化任务编码为SAT问题,提出多种编码策略,并分析并行性和任务转换对SAT规划器性能的影响。

详情
AI中文摘要

因子化任务是一种经典规划表示,它通过有限形式的析取前提、条件效应和天使非确定性扩展了SAS+。这使得任务表示比传统形式如STRIPS或SAS+更紧凑,并支持广泛的任务转换。然而,现有的因子化任务规划方法仅限于启发式搜索方法。在这项工作中,我们研究了如何将因子化任务编码为SAT。我们提出了几种编码任务的方法,重点关注将因子化转换关系翻译为命题逻辑的不同策略。我们还分析了如何在这种设置中利用不同层次的并行性,并研究了常见任务转换对基于SAT的规划器性能的影响。

英文摘要

Factored tasks are a classical planning representation that extends SAS+ with limited forms of disjunctive preconditions, conditional effects, and angelic nondeterminism. This allows for a more compact representation of tasks than traditional formalisms such as STRIPS or SAS+, and supports a wide range of task transformations. However, existing planning approaches for factored tasks have been limited to heuristic search methods. In this work, we investigate how to encode factored tasks in SAT. We propose several ways to encode the tasks, focusing on different strategies for translating the factored transition relation into propositional logic. We also analyze how to exploit parallelism at various levels in this setting and study the impact of common task transformations on the performance of SAT-based planners.

2605.30561 2026-06-01 cs.CV cs.AI

VLM3: Vision Language Models Are Native 3D Learners

VLM3:视觉语言模型是原生3D学习者

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi

发表机构 * Meta(Meta公司) Princeton University(普林斯顿大学)

AI总结 本文提出VLM3,通过焦距统一、文本像素参考和数据混合缩放,使标准视觉语言模型无需复杂架构或损失函数即可高效掌握多种3D任务。

详情
AI中文摘要

视觉语言模型(VLM)通过提示使统一模型能够解决各种视觉任务,在语义理解方面表现出色。然而,3D理解仍然很大程度上依赖于具有复杂任务特定设计的专家视觉模型。本文要提出的关键论点是,VLM是原生的3D学习者。我们深入的大规模研究表明:1)焦距统一,2)基于文本的像素参考,以及3)数据混合和缩放,是有效3D学习所需的一切。模型架构变化、大模型、大量数据增强以及包括回归公式在内的复杂损失(其中许多构成了专家视觉模型的基础)实际上并不是必要条件。因此,我们提出了VLM3,一种具有最简单设计的可扩展方法,使标准VLM能够掌握多样的3D任务。VLM3不仅大幅提升了VLM深度估计的准确性(0.84 -> 0.9),还实现了多样的3D任务,如像素对应、相机姿态估计和物体级3D理解,在保持标准架构和基于文本的训练的同时,匹配了专家视觉模型的准确性。我们相信VLM3为简单且可扩展的3D学习开辟了新的范式。

英文摘要

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

2605.30557 2026-06-01 cs.CV cs.AI cs.CL

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

看见不等于知道:视觉语言模型是否知道何时不回答空间问题(以及为什么)?

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Google Research(谷歌研究)

AI总结 针对视觉语言模型在空间推理中过度自信回答的问题,提出SpatialUncertain框架,通过遮挡和视角歧义两种挑战,评估模型是否知道何时应弃权以及如何寻找可靠证据。

Comments Website: https://zhangyuejoslin.github.io/spatialuncertain/

详情
AI中文摘要

空间推理是部署在真实环境中的视觉语言模型(VLM)的基本能力。然而,视觉观察本质上是对3D世界的有限表示:遮挡可能使物体不可见,视角可能使几何属性产生误导。尽管如此,现有的空间推理基准通常假设观察是充分且可靠的,侧重于模型是否产生正确答案,而不是它们是否认识到问题无法回答以及需要哪些额外观察。在这项工作中,我们通过构建一个受控评估框架SpatialUncertain来挑战这一假设,并引入两种观察挑战:(1)遮挡,隐藏目标信息;(2)视角歧义,产生误导性视觉线索。对于每种配置,我们设计在清晰观察下可回答但在引入挑战下需要弃权的空间问题。我们进一步评估模型是否能识别哪些额外视角可以解决视角歧义。我们在多种前沿开源和闭源VLM上的结果揭示了两个一致的失败模式。首先,模型倾向于过度自信地回答,即使在视觉证据不完整或具有误导性时也试图解决空间推理任务,在遮挡下平均准确率约为30%,在视角歧义下低于10%。其次,即使有额外视角可用,一些模型在识别哪些视角能提供可靠证据方面表现接近随机。总之,我们的发现呼吁超越答案正确性,转向评估模型是否知道何时弃权以及如何寻找可靠证据。

英文摘要

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

2605.30556 2026-06-01 cs.LG q-bio.NC

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

监督训练在生物合理学习规则下迅速降低早期视觉皮层对齐

Nils Leutenegger

发表机构 * Independent Researcher(独立研究者)

AI总结 研究发现无训练网络在早期视觉皮层表征相似性上优于或持平于训练网络,通过对比四种学习规则(BP、FA、PC、STDP)在训练过程中与人类fMRI数据的对齐变化,揭示全局误差信号(BP)比局部学习规则(PC、STDP)更剧烈地重塑早期表征。

Comments 7 pages, 4 figures

详情
AI中文摘要

随机、未训练的神经网络在早期视觉皮层的表征相似性上始终达到或超过训练网络。这一令人困惑的发现挑战了学习能改善大脑对齐的假设。我们通过追踪四种学习规则(反向传播BP、反馈对齐FA、预测编码PC和脉冲时序依赖可塑性STDP)在训练过程中与人类fMRI数据的表征相似性分析(RSA)对齐来研究这一问题。使用THINGS数据库中的720张物体图像和三名被试在六个视觉ROI上的fMRI数据,我们在八个训练检查点(epoch 0-40)测量模型与大脑表征相异矩阵之间的Spearman相关性。我们发现:(1)单个训练epoch根据学习规则不同使V1对齐降低25-90%;(2)反向传播对V1对齐的降低最为严重(delta r = -0.080),而预测编码和STDP保留更多(delta r ~ -0.04);(3)在物体选择皮层(LOC)中出现较弱的相反趋势,BP在训练中对齐增加最大,但绝对变化很小。这些结果表明,未训练架构仅通过归纳偏置捕获低级视觉统计,且全局误差信号(BP)比局部学习规则(PC、STDP)更激进地重塑早期表征,后者更好地保留了类脑结构。

英文摘要

Random, untrained neural networks consistently match or exceed trained networks in representational similarity to early visual cortex. This puzzling finding challenges the assumption that learning improves brain alignment. We investigate it by tracking representational similarity analysis (RSA) alignment to human fMRI data across training for four learning rules: backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP). Using 720 object images from the THINGS database and fMRI data from three subjects across six visual ROIs, we measure Spearman correlations between model and brain representational dissimilarity matrices at eight training checkpoints (epochs 0-40). We find that (1) a single epoch of training reduces V1 alignment by 25-90%, depending on the learning rule; (2) backpropagation reduces V1 alignment most severely (delta r = -0.080), while predictive coding and STDP preserve substantially more (delta r ~ -0.04); and (3) a weaker, opposite tendency appears in object-selective cortex (LOC), where BP shows the largest increase in alignment during training, although the absolute change is small. These results suggest that untrained architectures capture low-level visual statistics through inductive biases alone, and that global error signals (BP) reshape early representations more aggressively than local learning rules (PC, STDP), which better preserve brain-like structure.

2605.30553 2026-06-01 cs.LG cs.IT math.IT

Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future

破坏是学习生成的一般策略;扩散的优势在于认真对待它;探索是未来

Pierre-André Noël

发表机构 * ServiceNow AI Research(ServiceNow AI研究院)

AI总结 本文提出扩散模型作为信息隐藏与猜测框架的一部分,论证其破坏式信息隐藏比手工设计更灵活,尤其在数据稀缺场景有优势,并探讨强化学习技术移植到扩散上下文时的微妙问题及原生探索方向。

Comments Published April 27th, 2026 as an ICLR blogpost https://iclr-blogposts.github.io/2026/blog/2026/destruction/

Journal ref Noël, Piere-André. "Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future", ICLR Blogposts, 2026

详情
AI中文摘要

我将扩散模型视为机器学习技术家族的一部分,这些技术从模型输入中隐藏信息,并训练模型猜测被隐藏的信息。我认为扩散的破坏式信息隐藏方法比典型的手工设计信息隐藏技术更灵活,提供了一个丰富的训练环境,在某些场景(尤其是数据稀缺场景)中可能具有优势。然后,我讨论了将强化学习技术移植到扩散上下文时可能出现的微妙问题,并思考如何以更扩散原生的方式解决这些探索问题。我没有确定的答案,但我指出了我认为有趣的方向。本文之后附有一篇教程,进一步阐述了先破坏后生成的观点。为了便于教程的阐述,引入了一种新型的概率图模型。

英文摘要

I present diffusion models as part of a family of machine learning techniques that withhold information from a model's input and train it to guess the withheld information. I argue that diffusion's destroying approach to withholding is more flexible than typical hand-crafted information withholding techniques, providing a rich training playground that could be advantageous in some settings, notably data-scarce ones. I then address subtle issues that may arise when porting reinforcement learning techniques to the diffusion context, and wonder how such exploration problems could be addressed in more diffusion-native ways. I do not have definitive answers, but I do point my fingers in directions I deem interesting. A tutorial follows this thesis, expanding on the destroy-then-generate perspective. A novel kind of probabilistic graphical models is introduced to facilitate the tutorial's exposition.

2605.30550 2026-06-01 cs.LG

Early Prediction of Future Behavioral Strategy from Process Traces

从过程轨迹早期预测未来行为策略

Robert Kasumba, Dennis Barbour, Chien-Ju Ho

发表机构 * Division of Computational and Data Sciences(计算与数据科学系) Department of Biomedical Engineering(生物医学工程系) Department of Computer Science(计算机科学系)

AI总结 提出过程级潜变量模型(PLVM),通过跨任务过程轨迹融合共享人级潜在表示,实现早期跨任务行为策略预测。

详情
AI中文摘要

自适应系统通常需要从有限的证据中做出关于人的特定任务决策:导师可能需要预测学习者将如何解决新问题,游戏可能需要适应玩家进入新关卡,人机系统可能需要推断合作伙伴是会坚持计划还是切换目标。这些决策依赖于塑造人们如何解决相关任务的人级倾向,但这类倾向难以从标准行为证据中推断。一种方法是使用聚合结果摘要,如分数、完成率或生产率;这些摘要紧凑且跨任务可用,但可能将不同的行为过程压缩为相似的结果。另一种方法是使用过程级轨迹,记录行为如何展开;然而,单一任务内的过程建模可能将稳定的人级倾向与任务特定布局和可供性纠缠在一起。在本工作中,我们研究早期跨任务行为推断:部分源任务过程轨迹是否能揭示可迁移的人级结构,从而预测保留目标任务中的策略。我们引入过程级潜变量模型(PLVM),该模型编码任务特定轨迹并将其融合为共享的人级潜在表示以进行跨任务预测。在自然主义的人类游戏遥测数据集PowerWash Simulator中,PLVM使用来自两个清洁任务的部分轨迹,预测保留的消防站关卡中局部持久的区域规划者行为与频繁的区域跳跃者行为。具有已知潜在类型的受控模拟表明,当源任务揭示共享潜在过程的互补维度时,跨任务融合有所帮助。这些结果表明,当观察足够的目标任务行为不切实际时,过程级跨任务建模可以支持目标任务策略的早期预测。

英文摘要

Adaptive systems often need to make task-specific decisions about people from limited evidence: a tutor may need to anticipate how a learner will approach a new problem, a game may need to adapt when a player enters a new level, and a human-AI system may need to infer whether a partner will persist with a plan or switch goals. These decisions depend on person-level tendencies that shape how people solve related tasks, but such tendencies are difficult to infer from standard behavioral evidence. One approach is to use aggregate outcome summaries, such as scores, completion rates, or productivity; these summaries are compact and available across tasks, but can collapse distinct behavioral processes into similar outcomes. Another approach is to use process-level traces, which record how behavior unfolds; however, process modeling within one task can entangle stable person-level tendencies with task-specific layout and affordances. In this work, we study early cross-task behavioral inference: whether partial source-task process traces can reveal transferable person-level structure that predicts strategy in a held-out target task. We introduce a Process-Level Latent Variable Model (PLVM), which encodes task-specific traces and fuses them into a shared person-level latent representation for cross-task prediction. In PowerWash Simulator, a naturalistic telemetry dataset of human gameplay, PLVM uses partial traces from two cleaning tasks to predict locally persistent Zone Planner behavior versus frequent Zone Hopper behavior in the held-out Fire Station level. Controlled simulations with known latent types show that cross-task fusion helps when source tasks reveal complementary dimensions of a shared latent process. These results suggest that process-level cross-task modeling can support early prediction of target-task strategy when observing sufficient target-task behavior is impractical.

2605.30545 2026-06-01 cs.CL

Refining Word-Based Grammatical Error Annotation for L2 Korean

针对L2韩语的基于词的语法错误标注改进

Jungyeul Park, Kyungtae Lim, Wonjun Oh, Benjamin Nguyen, Zihao Huang, Mengyang Qiu, Jayoung Song

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院) The University of British Columbia(不列颠哥伦比亚大学) Saint Elizabeth University(圣埃利赞大学) The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 通过解决现有资源中的三个问题(表面目标实现、韩语特定编辑标注和单参考评估),改进基于词的语法错误标注,并验证了改进资源在困惑度、编辑一致性和多参考评估上的有效性。

详情
AI中文摘要

韩语语法错误纠正(K-GEC)在基于词的评估与许多学习者错误的语素层面之间存在结构不匹配。后置词和动词结尾附着于词汇宿主,但它们编码的语法关系必须在纠正和评估中体现。本文通过解决现有资源中的三个相关问题来改进L2韩语的基于词语法错误标注:表面目标实现、韩语特定编辑标注和单参考评估。我们在形态约束实现规则下从国立韩语研究院(NIKL)L2语料库重建目标句子,并将其语素级标注转换为词级\texttt{m2}编辑。然后我们定义了一个韩语ERRANT风格的标注方案,保留MRU核心的同时区分功能语素错误、拼写错误、词边界错误和词序错误。我们还为KoLLA语料库增加了一个额外的参考纠正,为韩语GEC提供了多参考评估设置。实证验证表明,改进的NIKL目标具有更低的困惑度,转换后的\texttt{m2}文件与源-目标编辑表示的一致性更高,并且改进的资源在相同模型设置下提高了基于KoBART的纠正性能。多参考KoLLA评估进一步减少了对偏离单一参考的有效纠正的惩罚,尤其对于神经和提示式GEC系统。这些结果表明,韩语GEC评估不仅依赖于纠正模型,还依赖于反映韩语形态、空格和纠正变异性的参考数据和编辑标注。

英文摘要

Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they encode grammatical relations that must be represented in correction and evaluation. This paper refines word-based grammatical error annotation for L2 Korean by addressing three connected problems in existing resources: surface target realization, Korean-specific edit annotation, and single-reference evaluation. We reconstruct target sentences from the National Institute of Korean Language (NIKL) L2 corpus under morphologically constrained realization rules and convert its morpheme-level annotations into word-level \texttt{m2} edits. We then define a Korean ERRANT-style annotation scheme that preserves the MRU core while distinguishing functional morpheme errors, spelling errors, word boundary errors, and word order errors. We also augment the KoLLA corpus with an additional reference correction, yielding a multi-reference evaluation setting for Korean GEC. Empirical validation shows that the refined NIKL targets yield lower perplexity, the converted \texttt{m2} files achieve higher agreement with source-target edit representations, and the refined resources improve KoBART-based correction under the same model setting. Multi-reference KoLLA evaluation further reduces the penalty imposed on valid corrections that diverge from a single reference, especially for neural and prompted GEC systems. These results show that Korean GEC evaluation depends not only on correction models, but also on reference data and edit annotations that reflect Korean morphology, spacing, and correction variability.

2605.30544 2026-06-01 cs.CV cs.CR

On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection

面向GDPR合规的视觉监控的端侧生成式AI:来自本地目标检测的自然语言警报

Gudrun Schappacher-Tilp, Nicoletta Kaehling, Jan Kornberger, Egon Teiniker

发表机构 * Hailo-8L AI accelerator(Hailo-8L人工智能加速器) Raspberry Pi 5(树莓派5) Phi-3 Mini

AI总结 提出一种隐私设计管道,通过将推理完全限制在边缘设备上,结合YOLOv5n-seg目标检测和Phi-3 Mini语言模型,生成自然语言警报,实现GDPR合规的视觉监控。

Comments 6 pages, 4 figures, 3 tables, 1 listing

详情
AI中文摘要

依赖云端AI推理的视觉监控系统会将原始图像数据暴露给外部服务,这与《通用数据保护条例》(GDPR)的数据最小化原则产生根本冲突。本文提出了一种隐私设计的概念验证管道,通过将所有推理完全限制在边缘设备上来解决这一冲突。为Hailo-8L AI加速器编译的YOLOv5n-seg模型在Raspberry Pi 5上实现实时目标检测,推理后立即丢弃原始像素缓冲区。一个状态触发引擎将最小的JSON事件负载转发到本地托管的Phi-3 Mini(3.8B参数,Q4_0量化)实例,该实例为操作员合成一到两句的自然语言警报。任何图像数据都不会跨越网络边界;仅传输生成的文本警报。我们描述了完整的系统架构和实现,报告了目标硬件上的测量推理延迟和资源利用率,并展示了代表性的生成警报。结果表明,在单板计算机上结合专用神经网络加速器和端侧大型语言模型不仅是可行的,而且能产生实际可部署、人类可读的监控输出,同时通过设计符合GDPR第5(1)(c)条。

英文摘要

Visual monitoring systems that rely on cloud-based AI inference expose raw image data to external services, creating fundamental tensions with the data-minimisation principle of the General Data Protection Regulation (GDPR). This paper presents a proof-of-concept privacy-by-design pipeline that resolves this tension by confining all inference entirely to the edge device. A YOLOv5n-seg model compiled for a Hailo-8L AI accelerator delivers real-time object detection on a Raspberry Pi 5, from which raw pixel buffers are immediately discarded after inference. A stateful trigger engine forwards minimal JSON event payloads to a locally hosted instance of Phi-3 Mini (3.8B parameters, Q4_0 quantisation), which synthesises one-to-two sentence natural-language alerts for a human operator. No image data crosses the network boundary at any point; only the generated text alert is transmitted. We describe the full system architecture and implementation, report measured inference latency and resource utilisation on the target hardware, and present representative generated alerts. The results demonstrate that combining a dedicated neural-network accelerator with an on-device large language model on a single-board computer is not only feasible but produces practically deployable, human-readable monitoring output while aligning with GDPR Art. 5(1)(c) by design.

2605.30542 2026-06-01 cs.AI

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

物理可行的世界模型:面向查询条件具身AI的案例

Adam J. Thorpe, Stepan Tretiakov, Cheng-Hsi Hsiao, Su Ann Low, Xingjian Li, Hassan Iqbal, Neel P. Bhatt, Ufuk Topcu, Krishna Kumar

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对具身AI中现有世界模型预测未来观测但物理不可行的问题,提出应构建基于查询条件、识别最简物理抽象的世界模型,通过模块化分解确保可解释性和可验证性。

Comments 21 pages; Adam J. Thorpe and Stepan Tretiakov contributed equally

详情
AI中文摘要

具身AI的世界模型必须是物理可行的:其构建应能通过表示支配动作结果的物理结构来回答干预查询,而不仅仅是预测未来观测。现有的观测预测世界模型可以产生视觉上合理但物理上错误的展开。这种失败是结构性的;不同的物理系统可能看起来相同,但在干预下却产生分歧。我们通过控制基准实验暴露了这个问题,这些基准固定可见场景同时变化潜在物理属性。我们表明,此类模型可能推荐不可行的动作、错误预测交互结果或认证不安全行为。我们认为,具身AI需要能够识别足以回答干预查询的最简物理抽象的世界模型。这样的模型由模块化组件组成,包括环境表示、潜在状态和参数估计、动作规范、干预动力学和查询级响应。一个自主编排器应识别相关抽象,并为每个查询组合兼容的学习和结构化组件。当封闭形式的物理不可用、不确定或成本高昂时,转移模型可以是解析的、模拟的、学习的或混合的,但它必须保留决定干预结果的结构。这种分解使模型可解释、其组件可验证,其输出可针对查询进行审计。它还为新的世界模型提供了设计原则,为现有模型提供了可行性测试:正确的抽象不是最详细的世界模型,而是保留与查询相关区分的最简模型。我们在现有系统无法正确回答的查询上展示了这种方法,并概述了编排器如何动态组装和调整物理可行的模型以用于规划、控制和验证。

英文摘要

World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.

2605.30541 2026-06-01 cs.LG physics.geo-ph

SubsurfaceGen: Procedural Generation of Field-Scale Earth Models and Seismic Data

SubsurfaceGen: 野外尺度地球模型与地震数据的程序化生成

Joseph Stitt, Pratik Rathore, Madeleine Udell, Ching-Yao Lai

发表机构 * Stanford University(斯坦福大学)

AI总结 提出SubsurfaceGen,一个GPU加速的3D速度模型与地震数据生成器,并发布包含4276个2D速度切片、5秒波场和8秒炮集记录的数据集,用于评估机器学习在全波形反演中的表现。

Comments 38 pages

详情
AI中文摘要

全波形反演(FWI)是地下成像的黄金标准,应用范围从碳封存到能源和矿产勘探再到地震灾害评估。机器学习方法进行FWI需要野外尺度、地质多样性和物理真实的训练数据,但现有资源如Marmousi、SEAM和OpenFWI在空间范围、时间范围、地质多样性和物理真实性方面存在不足。我们通过SubsurfaceGen(一个用于3D速度模型和地震数据的GPU加速生成器)来解决这些限制。与SubsurfaceGen一起,我们发布了一个配对数据集,包含来自42个真实、野外尺度的3D速度模型的4276个2D速度切片、5秒波场和8秒炮集记录,每个模型横向跨度10 km x 10 km,深度6.19 km,分辨率为10 m。该数据集涵盖六种地质环境——四种由SubsurfaceGen构建,两种来自先前来源——与碳封存和碳氢化合物勘探相关。我们使用该数据集评估神经算子进行波场预测和编码器-解码器进行端到端速度反演,并保留一种地质环境用于分布外测试。这些实验揭示了野外尺度的失败模式,并展示了SubsurfaceGen及相关数据集如何影响基于机器学习的FWI。

英文摘要

Full waveform inversion (FWI) is the gold standard for subsurface imaging, with applications from carbon sequestration to energy and mineral exploration to earthquake hazard assessment. Machine learning approaches to FWI need field-scale, geologically diverse, and physically realistic training data, but existing resources such as Marmousi, SEAM, and OpenFWI fall short on spatial extent, temporal extent, geological diversity, and physical realism. We address these limitations with SubsurfaceGen, a GPU-accelerated generator for 3D velocity models and seismic data. Along with SubsurfaceGen, we release a paired dataset of 4,276 2D velocity slices, 5 s wavefields, and 8 s shot gathers drawn from 42 realistic, field-scale 3D velocity models, each spanning 10 km x 10 km laterally and 6.19 km deep at 10 m resolution. The dataset spans six geological settings -- four built with SubsurfaceGen and two drawn from prior sources -- relevant for carbon sequestration and hydrocarbon exploration. We use this dataset to evaluate neural operators on wavefield prediction and encoder-decoders on end-to-end velocity inversion, holding out one geological setting for out-of-distribution testing. These experiments surface failure modes at field-scale and demonstrate how SubsurfaceGen and the associated dataset can impact ML-based FWI.

2605.30538 2026-06-01 cs.LG

DisasterLex: An Expert Concept-to-Schema Knowledge Graph for Geospatial Reasoning in Disaster Analytics

DisasterLex:面向灾害分析中地理空间推理的专家概念到模式知识图谱

Yiming Xiao, Ankit Basu, Kai Yin, Sahil Vartak, Christian Swords, Ali Mostafavi

发表机构 * Texas A&M University(德克萨斯大学)

AI总结 提出DisasterLex框架,通过插入专家知识图谱(EKG)将用户查询与数据库模式桥接,在灾害分析场景中实现文本到SQL的准确转换,性能优于现有方法1.4-2.75倍。

详情
AI中文摘要

灾害不可避免且日益昂贵,有效响应依赖于查询结构化表格数据:支撑灾害管理的精确、信息密集的危害、暴露度、脆弱性和生命线基础设施记录。当前的文本到SQL方法允许自然语言访问此类表格,但迁移到灾害领域时效果不佳,因为查询跨越异构地理空间模式,并需要对因果关系进行推理。我们引入DisasterLex,一个知识图谱中介的框架,在用户查询和数据库之间插入一个包含精选概念和类型化因果边的专家知识图谱(EKG),并通过概念到表格链接与模式桥接。该编排运行四个阶段(识别查询实体、路由到操作域、在因果边上规划、以及生成SQL),在每个步骤限制传递给模型的模式。我们在一个灾害分析数据库(36个地理空间表,150列)上实例化,该数据库具有包含107个概念、117条因果边和52个概念到模式链接的EKG,并在75个查询的测试集上评估。在所有七个涵盖专有和开源权重系列的基础模型上,DisasterLex以1.65到3.56(满分5.0)的绝对分数,比四个最先进的基线(LightRAG、HippoRAG 2、ReFoRCE、CHESS)高出1.4到2.75倍。错误分析显示基线失败集中在路由和多表SQL组合上,这正是我们的编排明确解决的操作。代码、数据和EKG工件可在https://github.com/YimingXiao98/DisasterLex 和Zenodo https://doi.org/10.5281/zenodo.20388029 获取。

英文摘要

Disasters are inevitable and increasingly costly, and effective response depends on querying structured tabular data: precise, information-dense records of hazard, exposure, vulnerability, and lifeline infrastructure that underpin disaster management. Current text-to-SQL methods enable natural-language access to such tables but transfer poorly to the disaster domain, where queries span heterogeneous geospatial schemas and require reasoning over causal relations. We introduce DisasterLex, a knowledge-graph-mediated framework that inserts an Expert Knowledge Graph (EKG) of curated concepts and typed causal edges between the user query and the database, bridged to schema by concept-to-table links. The orchestration runs four stages (identifying query entities, routing to the operational domain, planning over causal edges, and grounding the SQL), restricting the schema passed to the model at each step. We instantiate it on a disaster-analytics database (36 geospatial tables, 150 columns) with an EKG of 107 concepts, 117 causal edges, and 52 concept-to-schema links, evaluated on a 75-query test set. On all seven base models spanning proprietary and open-weight families, DisasterLex beats four state-of-the-art baselines (LightRAG, HippoRAG 2, ReFoRCE, CHESS) by 1.4x to 2.75x, with absolute scores of 1.65 to 3.56 (of 5.0). Error analysis shows baseline failures cluster in routing and multi-table SQL composition, the operations our orchestration explicitly addresses. Code, data, and the EKG artifact are available at https://github.com/YimingXiao98/DisasterLex and on Zenodo at https://doi.org/10.5281/zenodo.20388029.

2605.30537 2026-06-01 cs.LG

The Long-Term Effects of Data Selection in LLM Fine-Tuning

LLM微调中数据选择的长期影响

Yuxin Yang, Aoxiong Zeng, Xiangquan Yang

发表机构 * Shanghai University(上海大学) East China Normal University(华东师范大学)

AI总结 研究多阶段LLM微调中,短视数据选择策略(如基于当前效用)可能导致后续学习变慢、遗忘加剧和排名反转,提出长视距感知选择(LHAS)目标函数以缓解此问题。

Comments work in process

详情
AI中文摘要

数据选择越来越多地被用于降低大型语言模型(LLM)微调的成本,近期方法根据当前效用、多样性、质量或影响力对样本进行优先级排序。本文研究一个不同的问题:当微调在多个阶段进行时,当前看起来最优的选择策略是否会使模型后续适应性变差?我们引入LLM数据选择的长期视角,其中选择器不仅通过即时任务性能评估,还通过未来适应速度、遗忘、能力不平衡和分布外鲁棒性评估。我们在统一的多阶段协议下比较了代表性的随机、基于损失、基于梯度、基于多样性、基于质量和基于效用-多样性的选择家族。通过旨在实例化该协议的控制实验,我们展示了短期选择器如何表现出排名反转:它们改善了当前阶段,同时减慢了后续学习并增加了遗忘。我们将这种行为形式化为“短视选择”,提供了其可能发生的简单局部分析,并提出了一个诊断性的长视距感知选择(LHAS)目标函数,该函数在即时效用基础上增加了覆盖度、未来代理迁移和反集中项。该研究认为,数据选择应被评估为一种塑造模型学习轨迹的训练干预,而不仅仅是一种局部数据效率机制。

英文摘要

Data selection is increasingly used to reduce the cost of large language model (LLM) fine-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence. This paper studies a different question: when fine-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long-horizon view of LLM data selection in which a selector is evaluated not only by immediate task performance, but also by future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness. We compare representative random, loss-based, gradient-based, diversity-based, quality-based, and utility-diversity selection families under a unified multi-stage protocol. Through controlled experiments designed to instantiate this protocol, we show how short-term selectors can exhibit rank reversal: they improve the current stage while slowing subsequent learning and increasing forgetting. We formalize this behavior as \emph{myopic selection}, provide a simple local analysis of why it can occur, and propose a diagnostic Long-Horizon Aware Selection (LHAS) objective that augments immediate utility with coverage, future-proxy transfer, and anti-concentration terms. The study argues that data selection should be evaluated as a training intervention that shapes the model's learning trajectory, rather than only as a local data-efficiency mechanism.

2605.30529 2026-06-01 cs.CL cs.AI cs.LG

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

通用嵌入还是特定嵌入,哪个更好?非英语语言临床编码搜索的实证研究

David Rey-Blanco, Roberto Cruz

发表机构 * TietAI

AI总结 本研究通过使用大型生成语言模型生成的合成数据微调双语编码器,构建两阶段检索器,解决了非英语语言临床编码检索中召回率下降的问题,并在多语言基准上取得了优于BioBERT-ST的性能。

Comments 24 pages, 12 figures, 6 tables

详情
AI中文摘要

用于语义搜索的句子嵌入模型绝大多数是在英语语料库上开发和评估的。当应用于其他语言的临床检索——特别是ICD-10-CM/CIE-10代码的检索——召回率会下降,而这种下降往往被聚合基准所掩盖。我们研究大型生成语言模型是否可以作为数据工厂来缩小这一差距。我们构建了一个两阶段检索器(双编码器后接交叉编码器重排序器),该检索器在Gemini生成的合成数据(涵盖英语、西班牙语、加泰罗尼亚语、意大利语、葡萄牙语和法语)上对西班牙生物医学编码器(PlanTL-GOB-ES/bsc-bio-ehr-es)进行微调,并与BioBERT-ST和未调优的西班牙编码器进行评估。仅双编码器在MRR(0.876 vs. 0.866)上匹配BioBERT-ST,并在R@3(0.650 vs. 0.626)和R@5(0.804 vs. 0.790)上超越它,且无需英语生物医学预训练。添加交叉编码器重排序器将聚合R@5提升至0.822,并在五种语言中的四种上占据主导地位(西班牙语+0.017,加泰罗尼亚语+0.033,法语+0.018,葡萄牙语+0.037),但以英语的小幅回归为代价。这种权衡在临床上是可接受的:葡萄牙语的R@5达到0.829,而BioBERT-ST为0.714。贡献:一个基于LLM生成数据构建领域特定医学检索器的开放配方;学习增益的量化(MRR从0.755到0.876,+15.9%,使用约19,500个合成对);以及按语言和排名对增益集中区域的刻画。

英文摘要

Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

2605.30526 2026-06-01 cs.LG cs.CL

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

测量、定位和消融LLMs中的对齐特征

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür, Nick Feamster

发表机构 * University of Chicago(芝加哥大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 研究通过对比人类文本、基模型和对齐模型生成,发现对齐训练引入AI风格特征,并提出PASTA方法通过消融对齐方向来降低AI检测率。

详情
AI中文摘要

对齐语言模型通常表现出可识别的AI风格,但其与后训练和内部表示的联系尚不清楚。本文研究后训练是否引入或放大了AI风格规律,以及这些规律是否具有局部内部特征。为此,我们在匹配的人类源前缀下比较人类文本、基模型生成和对齐模型生成。对齐生成显示出比基生成更低的人类语料库亲和力和更高的AI检测率,表明后训练使生成文本偏离人类语料库风格,转向检测器可见的AI风格文本。然后我们引入PASTA(后训练对齐特征目标消融),一种无需训练的方法,通过对齐-基残差对比估计后训练对齐特征,并在解码过程中消融相应方向。在11个对齐模型和6个AI检测器上,PASTA降低了对大多数对齐模型的检测率;该效果在检测器间良好迁移,且不被随机方向复现。定性分析表明,PASTA生成保持相关性和连贯性,同时表现出更大的风格变化。这些结果共同表明,后训练的AI风格效果可以通过激活消融进行测量、定位和因果测试。

英文摘要

Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies AI-like stylistic regularities and whether these regularities have a localized internal signature. To this end, we compare human text, base-model generations, and aligned-model generations under matched human-source prefixes. Aligned generations show lower human-corpus affinity and higher AI-detection rates than base generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible AI-like text. We then introduce PASTA (Post-training Alignment Signature Targeted Ablation), a training-free method that estimates a post-training alignment signature from aligned-base residual contrasts and ablates the corresponding direction during decoding. Across 11 aligned models and 6 AI detectors, PASTA lowers the detection rate for most aligned models; this effect transfers well across detectors and is not reproduced by random directions. Qualitative analysis suggests that PASTA generations remain relevant and coherent while exhibiting greater stylistic variation. Together, these results show that AI-like stylistic effects of post-training can be measured, localized, and causally tested through activation ablation.

2605.30524 2026-06-01 cs.LG

Representation Collapse in Sequential Post-Training of Large Language Models

大型语言模型顺序后训练中的表示坍缩

Yichen Liu, Mingyu Chen, Hao Wang, Xiaoran Xu, Chenxi Lin, Rui Zhang, Yutong Zhou, Yuxin Yang, Jiarui Wu, Wei Sun

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) Zhejiang Gongshang University(浙江工商大学) Ningbo University(宁波大学) Shanghai University(上海大学)

AI总结 研究大型语言模型在顺序后训练阶段中内部表示逐渐压缩为低秩、各向异性且同质的特征空间,并提出轻量级干预措施以保持未来可学习性。

Comments work in progress

详情
AI中文摘要

大型语言模型现在通过一系列后训练阶段进行适配,而不是通过单次指令微调。本文研究这种顺序后训练是否逐渐将内部表示压缩为低秩、各向异性且同质的特征空间。我们定义了一套针对隐藏状态、logits、token轨迹和LoRA更新的测量方法,并利用它来分析在受控阶段顺序下的监督微调、偏好优化、安全/拒绝调优、数学和代码专业化以及长思维链调优。中心假设是,过度的表示集中不仅仅是几何上的奇特性:它预示着后期适配中可塑性降低、域外泛化能力减弱以及校准效果变差。我们进一步评估了轻量级干预措施,包括混合域重放、特征刷新、表示多样性正则化和LoRA更新去相关,作为在不放弃后训练行为收益的情况下保持未来可学习性的方法。

英文摘要

Large language models are now adapted through chains of post-training stages rather than through a single instruction-tuning pass. This paper studies whether such sequential post-training gradually compresses internal representations into low-rank, anisotropic, and homogeneous feature spaces. We define a measurement suite for hidden states, logits, token trajectories, and LoRA updates, and we use it to analyze supervised fine-tuning, preference optimization, safety/refusal tuning, math and code specialization, and long chain-of-thought tuning under controlled stage orderings. The central hypothesis is that excessive representation concentration is not merely a geometric curiosity: it predicts reduced plasticity during later adaptation, weaker out-of-domain generalization, and poorer calibration. We further evaluate lightweight interventions, including mixed-domain replay, feature refresh, representation diversity regularization, and LoRA update decorrelation, as ways to preserve future learnability without giving up the behavioral gains of post-training.

2605.30523 2026-06-01 cs.LG cs.AI cs.CC cs.CL cs.FL

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

重新审视填充Transformer的表达能力:哪些架构选择重要,哪些不重要

Anej Svete, William Merrill, Ryan Cotterell, Ashish Sabharwal

发表机构 * ETH Zürich(苏黎世联邦理工学院) Allen Institute for AI(人工智能研究所)

AI总结 本文通过连接布尔电路,系统研究了填充Transformer的表达能力,发现数值精度和模型深度是影响表达能力的主要因素,而注意力类型、模型宽度和均匀性等架构选择对表达能力影响不大。

详情
AI中文摘要

近期工作通过连接布尔电路描述了Transformer能计算和不能计算的内容,但现有结果缺乏精确刻画,且对建模选择敏感。填充Transformer——在其输入后附加填充符号如“...”——通过为自适应并行计算提供多项式空间,成为建立与电路类等价关系的有用工具。然而,目前仅研究了有限的填充Transformer理想化模型,这些等价关系在注意力类型、模型宽度和均匀性变化下的稳健性仍待探索。我们发现,在实际假设下,填充Transformer对所有这些变化都出奇地稳健,并确定数值精度和模型深度是影响表达能力的主要因素。具体地,我们证明多项式填充的L-均匀常数精度Transformer等价于L-均匀AC⁰,而增长精度的Transformer达到L-均匀TC⁰,与宽度无关。此外,循环机制允许类似电路的顺序处理:log^d N次循环的常数精度Transformer达到FO-均匀AC^d,增长精度的达到FO-均匀TC^d。有趣的是,宽度或精度超过对数增长并不会增加表达能力,且我们所有结果对softmax和平均硬注意力Transformer均成立。

英文摘要

Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.

2605.30519 2026-06-01 cs.CV

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

OmniMem: 用于长视频生成的可扩展自适应记忆检索

Lin Zhao, Yushu Wu, Yifan Gong, Yanzhi Wang, Pu Zhao

发表机构 * Northeastern University(东北大学) Adobe Research

AI总结 提出OmniMem框架,通过自适应窗口排除和查询共享KV选择等机制,在自回归视频生成中实现显式全范围稀疏KV检索,显著提升长视频动态程度并保持一致性。

Comments 22 pages, 14 figures; project page: https://wuyushuwys.github.io/OmniMem/

详情
AI中文摘要

自回归(AR)视频生成通过顺序生成潜在块来扩展视频,但扩展到长视频需要重复访问不断增长的历史KV缓存。现有方法通过截断KV缓存或将其压缩为隐式记忆来降低这一成本,但两者都失去了对查询相关历史细节的显式访问。我们提出OmniMem,一个显式全范围记忆检索框架,对历史缓存执行稀疏KV检索。为了使其在基于块的自回归视频生成中实用,OmniMem解决了两个问题:(i)稀疏KV选择中的局部偏差和(ii)记忆访问中的联合爆炸。自适应窗口排除在存在足够长距离历史时从选择候选者中移除局部窗口块,为信息丰富的长距离检索保留稀疏预算。查询共享KV选择减少了跨查询的多样性,而每头分散KV访问避免了将特定于头的选择扩展为大的选定KV缓冲区。这使得每个注意力头可以根据自己的选择模式检索非连续的KV块。长视频生成实验表明,OmniMem在强基线上将动态程度提高了52.3%,并保持了强一致性,同时保持了可比较的内存使用量。

英文摘要

Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.

2605.30514 2026-06-01 cs.LG cs.CL

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

MAAT: 多阶段适配器感知的定向遗忘学习

Suryash Yagnik, Shubham Gaur, Saksham Thakur, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Indian Institute of Information Technology, Bhopal, India(印度比哈尔理工学院) University of California, Santa Cruz, USA(加州大学圣克鲁兹分校) Independent Researcher(独立研究者) Stanford University, USA(斯坦福大学) BITS Pilani Goa, India(比斯拉米印度学院)

AI总结 针对现有机器遗忘评估中因果知识(Why类)样本极少导致评估失衡的问题,提出5WBENCH平衡基准和MAAT多阶段框架,首次在Why类知识上同时实现高遗忘与高保留。

Comments 16 pages, 4 figures, 10 tables

详情
AI中文摘要

机器遗忘评估在结构上存在偏差:Why类问题(探究因果和关系知识)在CounterFact中占比不足0.06%,在ZSRE中占0.6%,在TOFU、MUSE和WMDP-Cyber中占不到1.3%。这种近乎为零的表示意味着,在因果知识上失败的方法可以在整体上获得高分,而这种失败在没有平衡评估的情况下是无法检测的。我们提出了5WBENCH,一个平衡的5000样本基准,每个5W类别(谁、什么、何时、何地、为什么)包含1000个样本,首次使因果遗忘失败变得可量化。使用5WBENCH,我们表明现有基线方法无法在Why类问题上同时实现高遗忘和高保留:激进的遗忘会降低保留知识,而保守的方法则无法遗忘因果事实。Why类问题的难度源于多跳推理链(Why条目占44%,其他类别≤2%)以及超过40.1个token答案跨度上的梯度稀释。我们提出了MAAT(多阶段适配器感知的定向遗忘学习),一个三阶段框架,作用于LoRA适配器权重,结合梯度投影上升、SVD秩维度剪枝、任务向量否定以及混合KL-隐藏状态保留修复。MAAT是第一个在Why类因果知识上同时实现高遗忘和高保留的方法,在遗忘-保留帕累托前沿上达到了新的操作点。我们公开了代码。

英文摘要

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.