URL PDF HTML ☆

赞 0 踩 0

2606.03005 2026-06-03 cs.CV cs.AI

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

发表机构 * University of Waterloo（多伦多大学）； Vector Institute（向量研究所）； AIML, Adelaide University（AIML，阿德莱德大学）； Department of Engineering Science, University of Oxford（牛津大学工程科学系）； Oxford Centre for Economic and Social History, University of Oxford（牛津大学经济与社会史中心）； Department of Computer Science, University College London（伦敦大学学院计算机科学系）

AI总结提出TypewriterLM，一个仅在1913年前英文文本上训练的7.24B历史语言模型，通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件，解决数据质量、时间泄漏、训练和评估等挑战。

详情

AI中文摘要

我们介绍了TypewriterLM，一个仅在1913年前英文文本上训练的7.24B历史语言模型。开发历史语言模型需要解决数据质量和可用性、防止时间泄漏、设计时间一致的后训练流程以及构建可靠评估等挑战。为了解决这些问题，我们构建了TypewriterCorpus，一个54B词元的历史语料库，收集自多样化的档案和语言标注来源，并进行了广泛的数据清洗和泄漏缓解措施。此外，我们引入了词汇基础指令微调，一种后训练框架，限制响应直接基于历史源文档。使用该框架，我们构建了两个历史指令微调数据集：History-LIMA和History-SelfInstruct。为了评估能力和时间一致性，我们引入了History-Event，一个用于评估能力、时间基础和泄漏的基准套件。我们发布了TypewriterLM及所有相关资源，以支持未来对历史语言模型的研究。

英文摘要

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

URL PDF HTML ☆

赞 0 踩 0

2606.02981 2026-06-03 cs.CL

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

从标注验证集输出统计量预测推理时缩放增益

Luyang Zhang, Jingyan Li

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出一种基于标注验证集输出统计量的轻量级方法，通过三个核心特征（提示级一致性扩散、标签辅助的首个正确样本位置、完成长度方差）结合熵特征，使用岭回归预测最佳-of-N推理缩放增益，达到Spearman ρ=0.90的相关性。

详情

AI中文摘要

Best-of-$N$ 推理缩放（从语言模型中抽取 $N$ 个候选答案，并返回奖励模型评分最高的一个）能提高准确性，但提升幅度因模型而异，而预先预测该幅度目前需要端到端运行整个过程。先前的工作将模型采样输出的廉价统计量与验证集正确性（样本一致性、多样性、模型置信度以及正确样本出现的位置）与模型行为联系起来，但并未确定其中哪些能构成稳定、紧凑的 best-of-$N$ 增益预测器。我们基于单次标注验证集采样过程中计算的特征拟合岭回归预测器，使用 bootstrap-Lasso 对候选特征集进行稳定性分析，并给出带有显式线性近似残差的集中性分析。在三个基础模型族、六种后训练方法以及数学和推理任务领域上，稳定性分析识别出一个严格的三特征核心，包括提示级一致性扩散、标签辅助的首个正确样本位置和完成长度方差；基于该核心加上熵扩展构建的紧凑岭回归预测器，在奖励模型验证器下与实际 best-of-$N$ 增益的 Spearman 相关系数达到 $ ho = 0.90$。预期用途是在支付完整的奖励模型评分成本之前，利用标注验证集对候选配置进行筛选。

英文摘要

Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end-to-end. Prior work links cheap statistics of a model's sampled outputs and validation-set correctness (how often samples agree, how diverse they are, how confident the model is, and where correct samples appear) to model behavior, but does not isolate which of these form a stable, compact predictor of best-of-$N$ gain. We fit ridge predictors on features computed from a single labeled validation-set sampling pass, use bootstrap-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear-approximation residual. Across three base-model families, six post-training methods, and math and reasoning task domains, the stability analysis identifies a strict three-feature core spanning prompt-level agreement spread, label-assisted first-correct-sample position, and completion-length variance; a compact ridge predictor built from this core plus an entropy add-on reaches Spearman $ρ= 0.90$ with actual best-of-$N$ gain under a reward-model verifier. The intended use is labeled validation-set screening of candidate configurations before paying the full reward-model scoring cost.

URL PDF HTML ☆

赞 0 踩 0

2606.02980 2026-06-03 cs.SD cs.CY

A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5

一种训练高效的基于Transformer的反欺骗网络用于ASVspoof 5中的逻辑访问

Sidan Yin, Bo Zhao

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结针对ASVspoof 5 Track 1封闭条件，提出TFPARN网络，结合焦点分类损失和成对排序损失，通过Transformer编码器和注意力池化实现高效反欺骗，在minDCF和EER上优于AASIST和RawNet2，且推理内存更低、训练更快。

Comments 11 pages, 2 figures

详情

AI中文摘要

合成和篡改的语音会降低自动说话人验证系统的可靠性，因此反欺骗方法需要在训练和推理中既准确又高效。本文聚焦于ASVspoof 5 Track 1封闭条件，其中标准交叉熵训练可能对困难样本关注不足，且不与基于排序和阈值的评估指标直接对齐。我们提出TFPARN，一种基于Transformer的焦点成对注意力排序网络。该系统从语音中提取log-Mel特征，使用Transformer编码器建模帧级信息，应用注意力池化获得话语级表示，并通过焦点分类损失和成对排序损失的组合进行训练。训练中使用RawBoost增强，评估时应用测试时增强以提高鲁棒性。与在相同协议下重新实现的AASIST和RawNet2基线相比，TFPARN取得了最佳结果，minDCF为0.2430，EER为12.52%。消融实验进一步表明，成对损失、焦点损失和注意力池化均能提升性能。TFPARN在比较系统中使用最低的推理内存（1.4 GB），每段话语运行时间约0.79毫秒，并且达到最佳检查点的训练时间少于AASIST。这些结果表明，TFPARN在逻辑访问反欺骗中实现了检测准确性和计算成本之间的良好平衡。

英文摘要

Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti-spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross-entropy training may not give enough attention to hard trials and is not directly aligned with ranking- and threshold-based evaluation metrics. We propose TFPARN, a Transformer-based focal-pairwise attentive ranking network. The system extracts log-Mel features from speech, uses a Transformer encoder to model frame-level information, applies attention pooling to obtain utterance-level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test-time augmentation is applied during evaluation to improve robustness. Compared with re-implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti-spoofing.

URL PDF HTML ☆

赞 0 踩 0

2606.02979 2026-06-03 cs.CV cs.AI cs.RO

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，丰田寺大学）； Department of Computer Science and Electronics, Gadjah Mada University（计算机科学与电子系，加查马达大学）

AI总结提出一种紧凑的深度多任务学习模型，通过自适应损失加权和中间传感器融合技术，在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影，实现高效自动驾驶感知。

Comments This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213

详情

DOI: 10.1109/TITS.2022.3149370

AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型，能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影，无需其他模型支持。我们还提供了一种自适应损失加权算法，以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术，该模型可以处理并组合来自RGB摄像头、动态视觉传感器（DVS）和安装在自车多个位置的激光雷达的多种输入模态。因此，可以更好地理解动态变化的环境。基于消融研究，使用我们提出的方法训练的模型变体取得了更好的性能。此外，还进行了比较研究，以阐明其与一些近期模型组合相比的性能和有效性。结果表明，即使参数少得多，我们的模型仍能保持更好的性能。因此，该模型可以更快地推理，并减少GPU内存使用。此外，结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究，我们在以下网址公开共享代码和其他文件：https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

URL PDF HTML ☆

赞 0 踩 0

2606.02976 2026-06-03 cs.CL

Memory Retrieval for Changing Preferences

针对偏好变化的记忆检索

Yuehan Qin, Li Li, Linxin Song, Wei Yang, Jiate Li, Yuqing Yang, Yue Zhao

发表机构 * University of Southern California（南加州大学）

AI总结提出基于贝叶斯因子的统一框架，通过量化历史轮次对潜在偏好状态的证据强度，实现长上下文对话系统中的记忆访问与选择。

详情

AI中文摘要

长上下文对话系统必须决定何时访问记忆以及交互历史的哪些部分是相关的。现有方法通常依赖启发式检索信号或始终开启的记忆使用，未能考虑用户偏好的变化性和潜在不一致性。在这项工作中，我们提出了一个基于偏好变化的记忆访问与选择统一框架。我们将个性化记忆检索表述为识别哪些历史轮次提供了关于用户潜在偏好状态的证据，而不是依赖表面语义相似性。为此，我们使用贝叶斯因子量化每个记忆轮次的效用，定义为当该轮次包含在上下文中时模型参考响应似然的改进。这提供了证据强度的原则性度量，以及用于记忆访问和选择的统一信号。通过将记忆检索视为效用估计，模型学会识别显著轮次并根据预期效用调节记忆使用。在四个异构记忆基准上的实验表明，我们的方法在需要建模偏好变化的长上下文、偏好密集型任务上优于现有的基于嵌入的检索，同时在语义相似性足够的低密度场景中保持竞争力。

英文摘要

Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to account for the changing and potentially inconsistent nature of user preferences. In this work, we propose a unified framework for memory access and selection based on changing preferences. We formulate personalized memory retrieval as identifying which historical turns provide evidence about a user's latent preference state, rather than relying on surface-level semantic similarity. To this end, we quantify the utility of each memory turn using a Bayes factor, defined as the improvement in the model's likelihood of the reference response when the turn is included in context. This provides a principled measure of evidence strength and a unified signal for both memory access and selection. By framing memory retrieval as utility estimation, the model learns to identify salient turns and regulate memory usage based on expected utility. Experiments on four heterogeneous memory benchmarks show that our approach outperforms existing embedding-based retrieval on long-context, preference-intensive tasks where modeling changing preferences is essential, while remaining competitive in low-density regimes where semantic similarity suffices.

URL PDF HTML ☆

赞 0 踩 0

2606.02973 2026-06-03 cs.CL

Chatbots Output Meaningful (but Problematic) Language

聊天机器人输出有意义（但有问题）的语言

Matthew Stone, Una Stojnić

发表机构 * University of Parma（帕尔马大学）； University of Cambridge（剑桥大学）； University of Pittsburgh（匹兹堡大学）； Franklin and Marshall College（弗兰克林与马歇尔学院）

AI总结本文论证大型语言模型（LLM）的输出是有意义的，但无需假设其具有心理状态或意图，并探讨了这一观点对语言理论和AI伦理的影响。

Comments 49 pages

详情

AI中文摘要

AI聊天机器人的话语有意义吗？具体来说，如果用户问Anthropic的智能体Claude：“西班牙的首都是什么？”Claude回答：“马德里是西班牙的首都。”这句话是否具有其通常的意义——并且表达了一个真实的命题？大多数普通用户以及AI工程师认为答案显然是“是”。然而，许多认知科学家、语言学家和语言哲学家认为，关于语言和意义的主流意向主义理论得出了相反的结论。因此，更同情普通用户直觉的理论家主张对语言进行激进的“去拟人化”，修正我们对心理状态、意图和语义内容的理解，以捕捉LLM输出有意义的直觉。我们采取不同的方法。虽然我们也认为LLM的输出是有意义的，但我们认为，适当的人类语言理论已经适用于当前的聊天机器人。意义是一个低门槛：声称LLM输出有意义并不需要假设心理状态、意图、理性或LLM中交流所需的认知能力——实际上，也不需要任何其他拟人化假设。人们确实有交流意图（通常是成功的），但即便如此，在人类中，语言产出也可能偏离说话者的想法。我们的观点对于我们应该如何理论化——并批判性地参与——人类语言输出和合成生成的文本具有重要影响。特别是，说聊天机器人产生有意义的文本绝不意味着认可它们的输出，或假设该技术是（或不是）好的、强大的、合适的或有用的。

英文摘要

Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning -- and does it express a true proposition? Most ordinary users, as well as AI engineers, take the answer to be trivially "yes." However, many cognitive scientists, linguists, and philosophers of language argue that dominant intentionalist accounts of language and meaning deliver the opposite conclusion. Theorists more sympathetic to ordinary users' intuitions have therefore advocated a radical "de-anthropomorphization" of language, revising our understanding of mental states, intentions, and semantic content to capture the intuition that the outputs of LLMs are meaningful. We take a different approach. While we, too, argue that LLM outputs are meaningful, we contend that a proper theory of human language already applies, as is, to current chatbots. Meaning is a low bar: claiming that LLM outputs are meaningful does not require positing mental states, intentions, rationality, or the cognitive capacities requisite for communication in LLMs -- or, indeed, making any other anthropomorphic assumptions. People do have communicative intentions (typically successful ones), but nevertheless, even in humans, language production can depart from what the speaker has in mind. Our view has important consequences for how we should theorize about -- and critically engage with -- both human linguistic output and synthetically generated text. In particular, to say that chatbots produce meaningful text is not by any means to endorse what they output, or to assume that the technology is (or is not) good, powerful, appropriate, or useful.

URL PDF HTML ☆

赞 0 踩 0

2606.02971 2026-06-03 cs.CL

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

EURO-5K：领域预训练何时重要？用于欧盟报告义务提取的Transformer基准测试

Marios Koniaris, Vasileios Kotronis, Eugenia Giannini, Panayiotis Tsanakas

发表机构 * Division of Computer Science, School of Electrical and Computer Engineering（计算机科学系，电气与计算机工程学院）； National Technical University of Athens（雅典技术大学）； Department of Humanities Social Sciences and Law, School of Applied Mathematical and Physical Sciences（人文社会科学与法律系，应用数学与物理科学学院）

AI总结本文构建了EURO-5K数据集，通过对比判别式与生成式模型在欧盟报告义务提取上的表现，发现领域预训练在参数高效微调时收益显著，且模型可作为专用提取器。

详情

AI中文摘要

从欧盟立法中提取报告义务对于评估和减少监管报告负担至关重要。然而，区分报告要求与结构相似的条款需要专门的法律理解。当前的法律NLP方法缺乏具有明确指南和提取范式及领域适应策略比较评估的专门数据集。我们整理了EURO-5K，一个包含来自136项欧盟立法法案的句子级报告义务和具有挑战性的负例的语料库。在该数据集上，我们训练并比较了判别式标记分类模型（BERT风格）和生成式跨度提取模型（LLM），针对基线（基于模式和依赖关系的提取、少样本提示）评估了全微调和参数高效的QLoRA。结果表明，全微调的通用和法律BERT模型实现了相似的性能（0.89 F1），而微调的LLM在句子级提取上达到了编码器的准确度。法律预训练对生成式模型仅带来微小提升。相反，当适应能力受限时，法律预训练明显有益，因为参数高效微调的法律BERT优于其通用对应版本。学习曲线分析表明，法律预训练在数据极少时加速了早期学习。所有方法在大约3000个样本时收敛，之后收益递减，验证了数据集的充分性。在两个外部监管语料库上的跨数据集评估表明，我们的模型表现为专门的报告义务提取器，而非通用监管分类器。我们发布了EURO-5K、训练好的模型以及一个带有可解释性可视化和结构化RDF导出的交互式演示。这些表明，两种范式和参数高效训练为监管合规自动化提供了实用工具。

英文摘要

Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal understanding. Current legal NLP methods lack specialised datasets with clear guidelines and comparative evaluation of extraction paradigms and domain adaptation strategies. We curate EURO-5K, a corpus of sentence-level reporting obligations and challenging negative examples from 136 EU legislative acts. On this dataset, we train and compare discriminative token-classification models (BERT-style) and generative span-extraction models (LLMs), evaluating both full fine-tuning and parameter-efficient QLoRA against baselines (pattern and dependency-based extraction, few-shot prompting). Results show that fully fine-tuned generic and legal BERT models achieve similar performance (0.89 F1), while fine-tuned LLMs match encoder accuracy for sentence-level extraction. Legal pretraining offers only small gains for generative models. In contrast, it is clearly beneficial when adaptation capacity is constrained, as parameter-efficient tuning of Legal-BERT outperforms its generic counterpart. Learning curve analysis demonstrates that legal pretraining accelerates early learning with minimal data. All approaches converge around 3K samples with diminishing returns thereafter, validating dataset sufficiency. Cross-dataset evaluation on two external regulatory corpora shows that our models behave as specialised reporting obligation extractors rather than generic regulatory classifiers. We release EURO-5K, trained models, and an interactive demo with explainability visualizations and structured RDF export. These demonstrate that both paradigms and parameter-efficient training provide practical tools for regulatory compliance automation.

URL PDF HTML ☆

赞 0 踩 0

2606.02969 2026-06-03 cs.RO math.OC

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

发表机构 * Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center , ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain（图像处理小组（GTI）、信息处理与电信中心、电信工程学院、马德里理工大学、西班牙）

AI总结针对自我中心视频中的自然语言查询定位任务，提出手部轨迹编码器与自适应门控交叉注意力融合方法，利用手部运动信息提升查询定位性能。

Comments Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

2606.02959 2026-06-03 cs.LG cs.CR

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Gate AI：大语言模型安全基准评估方法与结果

Ryle Goehausen, Marcus Sousa

发表机构 * constellationnetwork（Constellation Network）

AI总结针对提示注入和越狱检测器评估中数据集阈值调优和操作点未公开的问题，提出一种采用5折交叉验证、全局操作点选择和多种泛化诊断的评估框架，并在16个公开基准上进行了测试。

Comments 17 pages, 23 figures, 2 tables. Working preprint; subsequent versions may update benchmark numbers as the framework evolves

详情

AI中文摘要

已发布的大语言模型提示注入和越狱检测器评估通常存在两个系统性弱点：每个数据集单独调整阈值以及未公开的操作点。我们描述了一种解决这两个问题的评估框架。被评估的检测器在16个公共基准（12,111个样本）上使用5折交叉验证进行评分。主要流程采用StratifiedKFold（按行）；同时，并行运行StratifiedGroupKFold流程，基于复合键（父提示ID加上Jaccard $\gtrsim 0.8$的MinHash + LSH近重复聚类）作为泄漏溢价诊断。在保留的折上选择一个全局操作点（在FPR $\leq 1\%$条件下最大化F1），并统一应用于每个数据集，因此每个数据集的结果反映一个阈值，而非每个基准的优化。通过一系列诊断检查泛化能力（留一数据集交叉验证、随机标签对照、对抗验证、排列特征重要性、长度偏差相关性、分类器头部一致性、跨源近重复检测、阈值可迁移性、训练集与OOF一致性以及释义不变性探测），其中大多数具有定量通过阈值，其余则说明失败模式。对于每次外部比较，检测器的阈值根据竞争对手公布的假阳性率重新调整，以便在匹配的操作点上评估对比值。

英文摘要

Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both. The detector under evaluation is scored across 16 public benchmarks (12,111 samples) using 5-fold cross-validation. StratifiedKFold (by row) is the headline pass; a parallel StratifiedGroupKFold pass over a composite key (parent-prompt id plus MinHash + LSH near-duplicate clusters at Jaccard $\gtrsim 0.8$) runs alongside it as a leakage-premium diagnostic. A single global operating point is selected on the held-out folds (max F1 subject to FPR $\leq 1\%$) and applied uniformly to every dataset, so per-dataset results reflect one threshold rather than per-benchmark optimisation. Generalisation is examined through a battery of diagnostics (leave-one-dataset-out cross-validation, a random-label control, adversarial validation, permutation feature importance, length-bias correlation, classifier-head agreement, cross-source near-duplicate detection, threshold transferability, train-vs-OOF agreement, and a paraphrase-invariance probe), most with a quantitative pass threshold and the remainder with a stated failure mode. For every external comparison, the detector's threshold is re-tuned to the competitor's published false-positive rate so head-to-head values are evaluated at matched operating points.

URL PDF HTML ☆

赞 0 踩 0

2606.02956 2026-06-03 cs.CV cs.LG cs.RO

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

自动驾驶的未来之路：KITScenes多模态数据集

Richard Schwarzkopf, Fabian Immel, Alexander Blumberg, Jonas Merkert, Nils Rack, Kaiwen Wang, Fabian Konstantinidis, Julian Truetsch, Carlos Fernandez, Annika Bätz, Kevin Rösch, Marlon Steiner, Willi Poh, Yinzhe Shen, Royden Wagner, Felix Hauser, Dominik Strutz, Jaime Villa, Gleb Stepanov, Holger Caesar, Ömer Şahin Taş, Frank Bieder, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * FZI Research Center for Information Technology（弗劳恩霍夫信息技术研究中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； University Charles III of Madrid（马德里第三大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出KITScenes多模态数据集，通过高保真传感器和完整HD地图，解决现有数据集在传感器精度、地图完整性和地理多样性上的不足，并引入四个基准推动空间学习。

Comments 28 pages, 21 figures

详情

AI中文摘要

现有的自动驾驶数据集取得了重大进展，但在传感器保真度、地图完整性或地理多样性方面仍存在不足。我们提出了KITScenes多模态数据集，这是一个基于高保真传感器和地图构建的欧洲数据集。我们完全同步的传感器套件结合了高分辨率全局快门相机、超过400米的长距离激光雷达、4D成像雷达以及冗余的GNSS/INS定位。据我们所知，我们的HD地图是任何传感器数据集中最完整的，并通过开源软件上的自动驾驶试验进行了验证。首次在公共数据集中，所有与驾驶相关的交通元素（如交通灯）都以3D方式映射到重投影精确的水平，并具有完整的拓扑连接。我们的数据集记录在街道布局不规则且交通模式混合的城市中，通过拓宽可用的地理多样性来补充现有数据集。我们还引入了四个基准，每个基准都推动了具身AI的空间学习：在线HD地图构建、长距离深度估计、新颖视图合成和端到端驾驶。项目页面：此https URL

英文摘要

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization. Our HD maps are, to our knowledge, the most complete of any sensor dataset, validated through autonomous driving trials on open-source software. For the first time in a public dataset, all driving-relevant traffic elements, such as traffic lights, are mapped in 3D to a reprojection-accurate level with full topological connectivity. Recorded in cities with irregular street layouts and mixed traffic modes, our dataset complements existing datasets by broadening the available geographic diversity. We also introduce four benchmarks, each advancing spatial learning for embodied AI: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. Project page: https://kitscenes.com/

URL PDF HTML ☆

赞 0 踩 0

2606.02953 2026-06-03 cs.CL

Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

大型语言模型中的语言生产力：模型强制但不抢占

Claire Bonial, Claire Benet Post, Laura Michaelis, Harish Tayyar Madabushi

发表机构 * Georgetown University（乔治城大学）； University of Colorado Boulder（科罗拉多大学丹佛分校）； University of Bath（巴斯大学）

AI总结通过测试大型语言模型是否受固化（高频使用）和抢占（未观察到结构）两种统计信号影响，发现模型能识别强制情况下的构式生产力，但无法利用负面证据避免过度泛化。

详情

AI中文摘要

基于使用的语法理论认为，语言的创造性生产力受到两种不同频率信号的增强和约束：固化（源于高频使用）和抢占（源于在期望出现特定语言结构的语境中从未观察到该结构）。大型语言模型也是基于使用的，因为语言结构是通过接触大量文本而习得的。在这里，我们测试固化和抢占这两种对立的统计力量是否也鼓励和约束了LLM中的语言生产力。我们跨模型架构证明，较大的模型在强制情况下能够识别并用非词再现构式生产力（固化），其中更广泛的构式语境强制了对词汇项的非典型解释。然而，我们也表明，即使最大的模型也不会将负面证据扩展到新语言，并且统计抢占不能使模型避免对语义上合适但从未在数据中观察到的模式进行过度泛化。

英文摘要

Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.

URL PDF HTML ☆

赞 0 踩 0

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE：边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结提出SCOPE模块化代理，用于自然语言控制的PTZ相机，在边缘部署实现实时感知、规划与控制，并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情

DOI: 10.1145/3757279.3785641
Journal ref: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026

AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估：自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具，并使用部署关键指标（包括延迟、准确性和错误模式）进行评估。我们提出了SCOPE（用于感知和评估的仿真与相机操作），这是一个模块化代理，用于自然语言、开放词汇的云台变焦（PTZ）相机控制和视觉场景理解，专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行，也可在物理PTZ相机上运行，所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试，涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别，在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合，以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合，将Qwen3小语言模型（SLM）与Moondream和Qwen视觉语言模型（VLM）配对。更强的SLM显著减少了幻觉并改善了工具路由，从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM，感知就成为主要的性能瓶颈。在规划和感知方面，混合专家模型在延迟和内存占用与更小网络相当的情况下，始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升，为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

URL PDF HTML ☆

赞 0 踩 0

2606.02948 2026-06-03 cs.LG cs.DS

ERP-XTTN: 可解释的原型引导跨注意力用于跨被试ERP分类

Charlotte Genevier Wyman, Leanne Hirshfield

发表机构 * University of Colorado Boulder（科罗拉多大学波得尔分校）

AI总结提出ERP-XTTN，一种基于原型引导跨注意力的架构，在无需校准的跨被试条件下实现可解释的ERP分类，并揭示分类错误的神经生理学原因。

详情

AI中文摘要

可解释的脑机接口分类器能够在无需校准的情况下跨被试泛化仍然是一个开放的挑战。我们测试了基于原型的跨注意力是否能在部署兼容条件下提供具有竞争力且可解释的事件相关电位（ERP）分类。我们提出ERP-XTTN，一种跨注意力架构，通过仅查询-键的跨注意力（无值投影）将输入EEG片段路由到固定的差异波原型，因此分类完全依赖于注意力路由，且注意力忠实性是结构性的而非事后解释的。原型从训练折差异波的极值自动推导。我们在三个公开数据集（BNCI Horizon 2020、HRI Cursor和ERP CORE）上评估，涵盖八个ERP成分（ERN、LRP、ErrP、N170、P300、N2pc、MMN、N400），使用留一被试（LOSO）评估，并在两种通道数（3通道和全导联）下采用因果滤波，与EEGNet和基于黎曼几何的xDAWN（xDAWN+RG）对比。最佳基线与ERP-XTTN的平均差距在3通道时为0.018 AUROC，在全导联时为0.034，这源于两个大致不同的来源：相对于EEGNet的时间灵活性成本和相对于xDAWN+RG的空间利用成本，后者在全导联时由信噪比驱动。除了准确性，透明的路由揭示了黑箱模型无法发现的跨被试信号结构：假阳性与真阳性的相似度高于真阴性，表明分类错误在神经生理学上是可以解释的。ERP-XTTN在因果、无校准条件下泛化到多种ERP，并在最小导联设置下具有较小的可解释性代价。据我们所知，这是ERP CORE上首个epoch级LOSO基准测试。

英文摘要

Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.

URL PDF HTML ☆

赞 0 踩 0

2606.02936 2026-06-03 cs.LG

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning

分层RBF-KAN和RBF-SKAN架构用于多维函数逼近和随机场学习

Mingtao Xia, Qijing Shen

发表机构 * University of Houston（德克萨斯大学）； University of Birmingham（伯明翰大学）； University of Oxford（牛津大学）

AI总结提出并分析使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构，用于逼近确定性函数和随机场模型，并证明其通用逼近性质及缓解维度灾难的潜力。

详情

AI中文摘要

本文提出并分析了使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构，用于逼近确定性函数和随机场模型。具体地，我们开发了用于多维确定性函数逼近的分层径向基函数Kolmogorov-Arnold网络（分层RBF-KAN）和用于随机场学习的分层径向基函数随机Kolmogorov-Arnold网络（分层RBF-SKAN）。从理论角度，我们为两种架构建立了通用逼近结果。特别地，我们推导了分层RBF-KAN的定量逼近估计，表明所提出的框架通过降低逼近问题的有效维度，有潜力部分缓解高维函数学习中的维度灾难。此外，我们证明了分层RBF-SKAN可以在Wasserstein-2度量下逼近随机场模型。实验上，我们表明所提出的基于径向基函数的神经网络结构能够有效学习多元函数和随机场模型。

英文摘要

In this manuscript, we propose and analyze hierarchical Kolmogorov--Arnold neural network architectures employing radial basis functions as activation functions for approximating deterministic functions and random field models. Specifically, we develop a hierarchical radial-basis-function Kolmogorov--Arnold network (hierarchical RBF-KAN) for multidimensional deterministic function approximation and a hierarchical radial-basis-function stochastic Kolmogorov--Arnold network (hierarchical RBF-SKAN) for random field learning. From a theoretical perspective, we establish universal approximation results for both architectures. In particular, we derive quantitative approximation estimates for the hierarchical RBF-KAN, showing that the proposed framework has the potential to partially alleviate the curse of dimensionality in learning high-dimensional functions by reducing the effective dimensionality of the approximation problem. Furthermore, we show that the hierarchical RBF-SKAN can approximate random field models under the Wasserstein-2 metric. Empirically, we show that our proposed radial-basis-function-based neural network structure could effectively learn multivariate functions and random field models.

URL PDF HTML ☆

赞 0 踩 0

2606.02935 2026-06-03 cs.CV cs.CE

CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

基于椭圆轴估计的圆柱体CAD到CT配准

Aleksander Ogonowski, Mikołaj Mrozowski, Daniel Więcek, Arkadiusz Ćwiek, Konrad Klimaszewski, Rafał Możdżonek, Adam Padee, Lech Raczyński, Piotr Wasiuk, Wojciech Wiślicki, Michał Matusiak, Sławomir Wronka

发表机构 * Department of Complex Systems, National Centre for Nuclear Research（复杂系统系，国家核研究中心）； ImagineRT sp. z o.o.（ImagineRT公司）； National Centre for Nuclear Research（国家核研究中心）

AI总结提出一种两阶段几何配准方法，通过检测CT切片中的椭圆截面估计旋转轴，再通过体素化CAD模型并最大化与CT扫描的体积重叠实现圆柱体（电离室）的精确配准，无需强度校准或特征匹配，倾斜和方向误差低于0.1°。

详情

AI中文摘要

CAD模型与CT扫描的精确配准对于在体积成像中建立真实几何基准至关重要。获取可靠的对象掩膜在机器学习环境中日益重要；随着最新架构能力增强，需要大规模数据集以充分利用其能力。当CT灰度值缺乏校准参考时，传统的基于强度的方法失效，而基于点的算法（如ICP、RANSAC）需要理想化CAD几何与噪声体积CT数据之间不可用的特征对应。我们提出了一种针对圆柱体（电离室）的两阶段几何配准方法，利用对象的独特几何特征。首先，通过检测CT切片中的椭圆截面、对边缘检测轮廓拟合椭圆，并在RANSAC异常值去除后对拟合椭圆中心进行PCA，来估计3D旋转轴。其次，将CAD模型体素化，沿检测轴定向，并通过平移调整最大化与CT扫描的体积重叠。该方法无需强度校准或特征匹配，即可实现倾斜和方向误差低于0.1°的鲁棒配准。配准后，对齐的CAD模型为机器学习目标定位和工业CT工作流中的自动分析等应用提供真实几何基准。

英文摘要

Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.

URL PDF HTML ☆

赞 0 踩 0