arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8087
2606.03022 2026-06-03 cs.CL cs.AI

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

幻觉作为正交噪声:通过动态上下文正交化实现推理时流形对齐

Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Xingchen AGI Lab(星辰AGI实验室) China Telecom AI Technology (Beijing) Co., Ltd.(中国电信人工智能技术(北京)有限公司) Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出一种基于线性表示假设的几何框架,将大语言模型幻觉解释为残差流语义流形的正交噪声,并引入推理时干预方法动态上下文正交化(DCO),通过层间Z分数抑制机制选择性地衰减异常正交分量,在保持知识记忆的同时提升上下文忠实度。

详情
AI中文摘要

大语言模型(LLMs)中的幻觉——即生成与上下文事实或逻辑约束不一致的内容——仍然是可靠部署面临的持续挑战。在这项工作中,我们通过基于线性表示假设的几何框架来解决这个问题。我们提出,幻觉表现为相对于残差流语义流形的正交噪声。具体来说,我们假设虽然注意力头理想地传播与上下文子空间一致的信息,但当特定头引入与该子空间正交的分量时,就会产生幻觉,破坏潜在表示的一致性。基于这一表述,我们引入了动态上下文正交化(DCO),一种推理时干预方法。DCO利用输入残差流作为动态上下文锚点,对注意力头输出进行正交分解。为了区分上下文对齐的语义更新和发散噪声,DCO采用层间Z分数抑制机制,根据统计分布选择性地衰减异常正交分量。在XSum、NQ-Swap和IFEval等基准上对Llama-3-8B和70B的评估表明,与最先进的干预基线相比,DCO实现了更优的上下文忠实度。此外,DCO在TriviaQA和TruthfulQA等知识密集型任务上保持高性能,有效缓解了现有方法中常见的幻觉抑制与参数知识保留之间的权衡。我们的发现验证了幻觉的几何解释,并将DCO确立为一种计算高效的流形对齐方法。代码可在https://this https URL获取。

英文摘要

Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

2606.03021 2026-06-03 cs.CL

Hint-Guided Diversified Policy Optimization for LLM Reasoning

提示引导的多样化策略优化用于大语言模型推理

Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Ant Group(蚂蚁集团)

AI总结 提出提示引导的多样化策略优化(HDPO),通过“提出-选择-思考”轨迹激励模型生成多样且可靠的解决方案,提升推理能力、候选方案多样性和可靠方案识别能力。

详情
AI中文摘要

大语言模型(LLMs)的最新发展展示了令人印象深刻的推理能力,其中可验证奖励的强化学习(RLVR)是一种有前景的增强策略。然而,现有的奖励机制局限于结果层面的正确性,缺乏明确的信号来引导模型考虑多样化的解决方案。相比之下,人类问题解决通常涉及评估多种潜在方法并选择最可靠的解决方案,而当前的RLVR框架并未明确激励这种认知过程。受此启发,我们提出了提示引导的多样化策略优化(HDPO),允许模型首先列出所有潜在的候选解决方案大纲作为提示,然后选择最可靠的一个进行进一步推理。HDPO包括两个阶段:结构化推理的冷启动和提示引导的多样化强化学习,以激励模型遵循“提出-选择-思考”轨迹生成多样且可靠的解决方案。实验结果表明,HDPO有效提升了LLM的推理能力,增强了候选解决方案的多样性以及LLM识别可靠解决方案的能力。

英文摘要

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

2606.03017 2026-06-03 cs.LG cs.AI cs.RO

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

ConTraIRL:用于可迁移逆强化学习的分解对比抽象

Yikang Gui, Bikramjit Banerjee, Prashant Doshi

发表机构 * School of Computing University of Georgia(乔治亚大学计算学院) School of Computing Sciences & Computer Engineering The University of Southern Mississippi(密西西比大学计算科学与计算机工程学院)

AI总结 提出ConTraIRL框架,通过双编码器对比学习解耦环境动态与任务目标的潜在表示,实现组合奖励迁移,在连续控制基准上显著提升少样本迁移的样本效率和奖励恢复。

详情
AI中文摘要

当策略必须泛化到未见过的环境动态与任务目标组合时,逆强化学习中的奖励迁移不可靠。我们提出用于可迁移逆强化学习的分解对比抽象(ConTraIRL),该框架通过学习这两个因素的解耦潜在表示来实现组合奖励迁移。ConTraIRL采用双编码器架构,将观测映射到分离的动态和目标的潜在空间,并通过双重对比目标进行训练。时间对齐鼓励动态编码器学习目标不变的结构,而目标编码器捕获动态不变的特征。这种分解支持在重组动态-目标设置下的奖励推断。在连续控制基准上的实验表明,对未见过的动态-目标配对进行有效的少样本迁移,与迁移逆强化学习基线相比,提高了样本效率和奖励恢复。

英文摘要

Reward transfer in Inverse Reinforcement Learning (IRL) is unreliable when policies must generalize to unseen combinations of environment dynamics and task goals. We propose Factorized Contrastive Abstractions for Transferable IRL (ConTraIRL), a framework that enables compositional reward transfer by learning decoupled latent representations of these two factors. ConTraIRL uses a dual-encoder architecture that maps observations into separate dynamics and goal latent spaces, trained with a dual contrastive objective. Temporal alignment encourages the dynamics encoder to learn goal-invariant structure, while the goal encoder captures dynamics-invariant features. This factorization supports reward inference under recombined dynamics-goal settings. Experiments on continuous control benchmarks demonstrate effective few-shot transfer to unseen dynamics-goal pairings, improving sample efficiency and reward recovery over transfer IRL baselines.

2606.03014 2026-06-03 cs.LG cs.AR

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

MOSAIC: 通过自适应聚合和推理并发的高效混合智能体调度

Saptarshi Mitra, Yifan Zhang, Rachid Karami, Phyo Pyae Moe Aung, Nazmul Takbir, Sreetama Sarkar, Souvik Kundu, Sitao Huang

发表机构 * University of California, Irvine, USA(加州大学 Irvine 分校) University of Southern California, Los Angeles, USA(南加州大学洛杉矶分校) Intel, USA(英特尔公司)

AI总结 针对混合智能体系统在有限GPU资源下的负载不均衡问题,提出基于整数线性规划调度器和置信度感知自适应聚合的MOSAIC框架,实现最高2.5倍专家阶段、4.23倍聚合阶段和1.7~2.3倍端到端加速,精度损失在0.1个百分点内。

Comments 13 pages, 8 main pages

详情
AI中文摘要

混合智能体(MoA)系统通过将每个查询路由到多个专家大语言模型并聚合其输出来提高推理准确性。在有限的GPU资源上高效执行此工作负载存在瓶颈。基于技能的调度导致专家需求倾斜,而将指令微调的大语言模型与长推理模型结合会导致生成长度的极端变化。因此,传统的调度策略由于负载不平衡而遭受显著的GPU空闲和吞吐量崩溃。我们提出了MOSAIC,一个加速MoA工作负载的调度框架。首先,我们制定了一个基于整数线性规划(ILP)的调度器,该调度器根据离线分析的成本联合优化专家放置和每个工作线程的提示分配,在工作线程间复制推理专家同时固定轻量级专家。其次,MOSAIC使用置信度感知的自适应聚合,利用专家间一致性来绕过重型最终聚合器大语言模型处理共识查询。在我们的4-GPU系统中,与基线调度器相比,MOSAIC实现了最高2.5倍的专家阶段、4.23倍的聚合阶段和1.7~2.3倍的端到端加速,同时精度匹配在0.1个百分点以内。

英文摘要

Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based routing creates skewed expert demand, and combining instruction-tuned LLMs with long-reasoning models results in extreme variability in generation lengths. Consequently, traditional scheduling strategies suffer from significant GPU idling and throughput collapse due to load imbalances. We present MOSAIC, a scheduling framework to accelerate MoA workloads. First, we formulate an Integer Linear Program (ILP) based scheduler that jointly optimizes expert placement and per-worker prompt assignment from offline-profiled costs, replicating reasoning experts across workers while pinning lightweight ones. Second, MOSAIC uses confidence-aware adaptive aggregation, leveraging inter-expert agreement to bypass the heavy final aggregator LLM for consensus queries. In our 4-GPU system, MOSAIC achieves up to 2.5x expert-stage, 4.23x aggregator-stage and 1.7~2.3x end-to-end speedups over the baseline scheduler, while matching accuracy within 0.1pp.

2606.03005 2026-06-03 cs.CV cs.AI

MUSE: A Unified Agentic Harness for MLLMs

MUSE: 多模态大语言模型的统一智能体框架

Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu

发表机构 * Northeastern University(东北大学)

AI总结 提出MUSE框架,通过可组合模块(任务表示、视觉处理、感知工具、结构化解析、确定性验证和验证器引导修复)提升冻结多模态大语言模型性能,无需重新训练。

详情
AI中文摘要

尽管进展迅速,多模态大语言模型(MLLMs)在人类轻松解决的任务上仍然失败,例如从屏幕截图导航网格迷宫或选择正确的拼图块。我们不重新训练模型,而是提出一个补充性问题:仅通过改进执行脚手架,能从冻结的MLLM中引出多少能力?我们引入MUSE,一个多模态统一结构化执行框架,它用可组合的模块(任务表示、视觉处理、感知工具使用、结构化解析、确定性验证和验证器引导修复)包装任何现成的MLLM,无需任何模型重新训练。我们使用多个最先进的MLLM,在涵盖视觉空间规划、视觉感知、多模态推理和细粒度视觉辨别的多样化基准上评估MUSE。MUSE在所有设置中都比裸模型带来一致的提升,在困难实例上提升最大。进一步分析揭示,许多MLLM失败源于框架层面的缺陷而非根本的模型缺陷,并且可以通过验证器引导修复来解决,无需触及模型。这些发现突显了智能体多模态框架作为一个关键但尚未充分探索的设计维度,提供了超越以模型为中心的优化的正交改进途径。

英文摘要

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

2606.03003 2026-06-03 cs.LG cs.AI cs.RO

Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group

精确等变性在训练中保持,实现跨对称群的零样本泛化

Hongbo Wang

发表机构 * Department of Mathematics, Stony Brook University(石溪大学数学系)

AI总结 通过等变编码器和预测器构建的潜世界模型,其训练损失具有可证明的对称性,从而在仅拟合部分方向动力学时,数学上确定整个轨道上的行为,实现跨对称群的零样本泛化。

Comments 92 pages, 11 figures. Core paper plus an extended results-log appendix and a forward-looking theory supplement. All experiments are laptop-scale (CPU/MPS), fully seeded and deterministic

详情
AI中文摘要

由等变编码器 $E$ 和等变预测器 $f$ 构建的潜世界模型继承了其训练损失的可证明对称性:当世界的动力学真正承载一个群 $G$,通过正交表示 $\rho(g)$ 作用于潜变量时,单步预测 relMSE 在整个群上精确不变,因此仅在方向的受限切片上拟合动力学,数学上就确定了整个轨道上的动力学(举一反三)。我们在笔记本电脑规模(CPU/MPS,完全设定随机种子)上端到端验证了这一点。[A] 该对称性在真实的 Muon/AdamW + EMA + VICReg 运行中幸存——组合的编码-预测残差在优化后约为 $10^{-6}$,不仅在初始化时,而且在任何优化器下都成立。[B] 单步误差在整个群上平坦至五位小数,而相同假设类别的非等变基线拟合了切片但在分布外失效(2D 中 VN $\times 1.00$ 对比基线 $\times 13.8$,3D 中 $\times 17.2$,整个 $\mathrm{SE}(3)$ 阶梯上 $\times 157$),且等变模型小 $4.5$-$7.4$ 倍。[C] 相同的等距论证提升到闭环:在匹配的等变规划器下,方向 $g$ 处的控制轨迹恰好是所见轨迹应用 $\rho(g)$ 的结果,因此闭环误差在整个群上不变——在真实 PushT 上的 2D/$\mathrm{SO}(2)$ 中浮点地板精确,在 3D/$\mathrm{SE}(3)$ 中统计平坦(不相交的 95% 置信区间)。我们针对 Sutton 的苦涩教训对先验进行了压力测试:增强、暴力规模和软等变性各自最多缩小跨群任务指标,但从未达到浮点地板精确性。由于等变性在复合下封闭,$H$ 步展开在每个视界上保持平坦($\times 1.00$,$\le 2\times 10^{-7}$),而基线的残差随 $H$ 复合。超出范围:任务成功扫描、无规划器不变性和缩放。

英文摘要

A latent world model built from an equivariant encoder $E$ and an equivariant predictor $f$ inherits a provable symmetry of its training loss: when the world's dynamics genuinely carries a group $G$ acting on latents by an orthogonal representation $ρ(g)$, the one-step prediction relMSE is exactly invariant across the whole group, so fitting the dynamics on a restricted slice of orientations mathematically determines it on the entire orbit (jǔ yī fǎn sān). We verify this end-to-end at laptop scale (CPU/MPS, fully seeded). [A] The symmetry survives a real Muon/AdamW + EMA + VICReg run -- composed encode-then-predict residual $\sim 10^{-6}$ after optimisation, not just at initialisation, and under any optimiser. [B] One-step error is flat to five digits across the group, while a same-hypothesis-class non-equivariant baseline fits the slice but breaks out-of-distribution (VN $\times 1.00$ vs baseline $\times 13.8$ in 2D, $\times 17.2$ in 3D, $\times 157$ over the full $\mathrm{SE}(3)$ ladder), with the equivariant model $4.5$-$7.4\times$ smaller. [C] The same isometry argument lifts to closed loop: under a matching equivariant planner the control trajectory at orientation $g$ is exactly $ρ(g)$ applied to the seen one, so closed-loop error is invariant across the group -- float-floor-exact in 2D/$\mathrm{SO}(2)$ on real PushT and statistically flat in 3D/$\mathrm{SE}(3)$ (disjoint 95% CIs). We stress-test the prior against Sutton's Bitter Lesson: augmentation, brute-force scale, and soft-equivariance each close at most the across-group task metric, never the float-floor exactness. Because equivariance is closed under composition, the $H$-fold rollout stays flat ($\times 1.00$, $\le 2\times 10^{-7}$) at every horizon, while the baseline's residual compounds with $H$. Out of scope: task-success sweeps, planner-free invariance, and scaling.

2606.02998 2026-06-03 cs.LG eess.AS

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

CoughSense:通过Whisper编码器微调和双编码器交叉注意力融合与平衡对比学习的五类呼吸系统疾病分类

Nikhil Vincent

发表机构 * Independent Researcher, Bothell, Washington, USA(独立研究者,华盛顿州贝斯尔市)

AI总结 提出CoughSense系统,利用Whisper编码器微调和双编码器交叉注意力融合,结合主动帧注意力池化和平衡对比学习,在智能手机录音上实现五类呼吸系统疾病(健康、COVID-19、哮喘/呼吸疾病、支气管炎、肺炎)的高精度分类。

Comments 26 pages, 3 figures

详情
AI中文摘要

自动咳嗽分析为低成本呼吸系统筛查提供了一条途径,但现有工作大多止步于二元COVID-19检测。一个实用的工具需要能够从消费者智能手机的一次咳嗽录音中区分出多种呼吸系统疾病。我们提出了CoughSense,一个将咳嗽录音分为五类的系统:健康、COVID-19、哮喘或呼吸系统疾病、支气管炎和肺炎。我们汇集了来自四个公共数据集(Coswara、CoughVID、Virufy和West China Hospital Pediatric Cough Dataset)的18,301条录音,并使用OpenAI Whisper编码器作为预训练骨干进行咳嗽疾病分类。主要贡献是主动帧QKV注意力池化,它将注意力限制在1500个编码器令牌中的前200个。这避免了由于3秒咳嗽仅填充Whisper 30秒输入窗口中的150个令牌而产生的静音稀释问题。其他训练部分处理19:1的类别不平衡和四个数据集的领域偏移,包括加权随机采样器、SpecAugment、强制少数配对的平衡混合、监督对比辅助损失、FiLM症状条件化和梯度反转领域适应。双编码器模型通过交叉注意力将Whisper与OPERA-CT呼吸基础模型融合。CoughSense(Whisper-tiny,8.6M参数)在五折交叉验证中达到了82.3%的平衡准确率(宏F1为0.817,AUC为0.941),比ImageNet预训练的EfficientNet-B2高出11.1个百分点,比从头训练的ViT高出29.6个百分点。所有五个类别的召回率均超过74%,其中四个超过80%。双编码器模型达到了85.4%的平衡准确率。在所有消融组件中,主动帧池化是最大的单一贡献者,贡献了5.1个百分点,这应该有助于任何使用Whisper作为骨干的短音频任务。

英文摘要

Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a consumer smartphone. We present CoughSense, a system that sorts cough recordings into five classes. These are healthy, COVID-19, asthma or respiratory condition, bronchitis, and pneumonia. We aggregated 18,301 recordings from four public datasets (Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset) and used the OpenAI Whisper encoder as a pretrained backbone for cough disease classification. The main contribution is active-frame QKV attention pooling, which restricts attention to the first 200 of 1500 encoder tokens. This avoids the silence-dilution problem that arises because a 3-second cough fills only 150 tokens of Whisper's 30-second input window. Other training parts handle the 19 to 1 class imbalance and the four-dataset domain shift. These include WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing, a supervised contrastive auxiliary loss, FiLM symptom conditioning, and gradient-reversal domain adaptation. A dual-encoder model fuses Whisper with the OPERA-CT respiratory foundation model through cross-attention. CoughSense (Whisper-tiny, 8.6M parameters) reached 82.3 percent balanced accuracy on five-fold cross-validation (macro-F1 of 0.817, AUC of 0.941). It beat an ImageNet-pretrained EfficientNet-B2 by 11.1 points and a ViT trained from scratch by 29.6 points. All five classes passed 74 percent recall and four of five passed 80 percent. The dual-encoder model reached 85.4 percent balanced accuracy. Active-frame pooling is the largest single contributor across all ablation components at 5.1 points, which should help any short-audio task using Whisper as a backbone.

2606.02996 2026-06-03 cs.RO cs.CV cs.HC

MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry

MARIO: 运动增强的实时多传感器惯性里程计

Yiquan Li, Taeyoung Yeon, Chenfeng Gao, Vasco Xu, Xuanyou Liu, Karan Ahuja

发表机构 * Northwestern University(西北大学) University of Chicago(芝加哥大学)

AI总结 提出MARIO框架,通过学习IMU推断的人体姿态先验约束运动动力学,并结合多传感器融合(磁力计、气压计、辅助IMU),在Nymeria数据集上将位置漂移降低36%-42%,实现无相机人体跟踪的准确鲁棒惯性里程计。

Comments CVPR 2026 Findings

详情
AI中文摘要

仅使用惯性测量单元(IMU)的惯性里程计(IO)为增强现实(AR)和可穿戴设备中的人体运动跟踪提供了轻量级解决方案。最近的基于学习的IO方法通过在大规模人体运动数据集上进行预训练,提高了惯性定位的泛化能力。然而,这些方法仍然容易受到漂移和噪声的影响,因为它们没有显式捕捉人体运动动力学,尤其是在日常活动数据集(如Nymeria)上。在这项工作中,我们提出通过学习的IMU推断姿态先验将惯性里程计建立在人体运动学基础上,该先验促进物理一致的运动约束。我们将此姿态先验集成到现有IO架构中,并在具有挑战性的Nymeria数据集上将位置漂移减少高达36%,该数据集比先前工作中使用的数据集大5倍。我们进一步通过传感器融合框架改进了长期性能,该框架整合了商用AR眼镜上已有的轻量级传感器的辅助信号,包括磁力计、气压计和辅助IMU。通过这种融合策略,位置漂移减少了高达42%,提高了在不同运动条件下的鲁棒性和泛化能力。总之,我们的结果通过将人体运动学与多模态传感统一起来,为惯性轻量级里程计引入了新范式,为准确鲁棒的无相机人体跟踪设立了新基准。我们的网站位于此https URL。

英文摘要

Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at https://spice-lab.org/projects/MARIO/.

2606.02994 2026-06-03 cs.AI cs.CL

Inducing Reasoning Primitives from Agent Traces

从智能体轨迹中归纳推理原语

Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出推理原语归纳方法,从ReAct智能体轨迹中挖掘并聚类常见推理步骤,构建伪工具库,在多个推理任务上显著提升性能。

Comments 22 pages including appendices

详情
AI中文摘要

ReAct风格的LLM智能体经常跨问题重新发现相同的推理例程,但这些例程被困在瞬时的草稿板中。我们引入了推理原语归纳,一种单次通过的方法,挖掘成功的ReAct轨迹,聚类循环出现的推理动作,并将最频繁的动作转换为一个紧凑的类型化伪工具库。每个伪工具由一个自然语言文档字符串指定,在调用时由LLM解释,标准的ReAct循环在测试时组合这些原语。核心结果是,归纳出的库优于生成其轨迹的原始智能体:在RuleArena NBA上提高44个百分点(30 -> 74),在MuSR团队分配上提高30个百分点(38 -> 68),在NatPlan会议规划上提高22个百分点(7 -> 29)。在涵盖叙事推理、规则应用和约束满足规划的五个可比较子任务中,单个固定配置在每个子任务上优于零样本思维链,匹配或超过专家编写的分解,并以更低的平均推理成本优于AWM。

英文摘要

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

2606.02991 2026-06-03 cs.CL cs.AI

Pretraining Language Models on Historical Text

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

发表机构 * University of Waterloo(多伦多大学) Vector Institute(向量研究所) AIML, Adelaide University(AIML,阿德莱德大学) Department of Engineering Science, University of Oxford(牛津大学工程科学系) Oxford Centre for Economic and Social History, University of Oxford(牛津大学经济与社会史中心) Department of Computer Science, University College London(伦敦大学学院计算机科学系)

AI总结 提出TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型,通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件,解决数据质量、时间泄漏、训练和评估等挑战。

详情
AI中文摘要

我们介绍了TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型。开发历史语言模型需要解决数据质量和可用性、防止时间泄漏、设计时间一致的后训练流程以及构建可靠评估等挑战。为了解决这些问题,我们构建了TypewriterCorpus,一个54B词元的历史语料库,收集自多样化的档案和语言标注来源,并进行了广泛的数据清洗和泄漏缓解措施。此外,我们引入了词汇基础指令微调,一种后训练框架,限制响应直接基于历史源文档。使用该框架,我们构建了两个历史指令微调数据集:History-LIMA和History-SelfInstruct。为了评估能力和时间一致性,我们引入了History-Event,一个用于评估能力、时间基础和泄漏的基准套件。我们发布了TypewriterLM及所有相关资源,以支持未来对历史语言模型的研究。

英文摘要

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

2606.02981 2026-06-03 cs.CL

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

从标注验证集输出统计量预测推理时缩放增益

Luyang Zhang, Jingyan Li

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种基于标注验证集输出统计量的轻量级方法,通过三个核心特征(提示级一致性扩散、标签辅助的首个正确样本位置、完成长度方差)结合熵特征,使用岭回归预测最佳-of-N推理缩放增益,达到Spearman ρ=0.90的相关性。

详情
AI中文摘要

Best-of-$N$ 推理缩放(从语言模型中抽取 $N$ 个候选答案,并返回奖励模型评分最高的一个)能提高准确性,但提升幅度因模型而异,而预先预测该幅度目前需要端到端运行整个过程。先前的工作将模型采样输出的廉价统计量与验证集正确性(样本一致性、多样性、模型置信度以及正确样本出现的位置)与模型行为联系起来,但并未确定其中哪些能构成稳定、紧凑的 best-of-$N$ 增益预测器。我们基于单次标注验证集采样过程中计算的特征拟合岭回归预测器,使用 bootstrap-Lasso 对候选特征集进行稳定性分析,并给出带有显式线性近似残差的集中性分析。在三个基础模型族、六种后训练方法以及数学和推理任务领域上,稳定性分析识别出一个严格的三特征核心,包括提示级一致性扩散、标签辅助的首个正确样本位置和完成长度方差;基于该核心加上熵扩展构建的紧凑岭回归预测器,在奖励模型验证器下与实际 best-of-$N$ 增益的 Spearman 相关系数达到 $ ho = 0.90$。预期用途是在支付完整的奖励模型评分成本之前,利用标注验证集对候选配置进行筛选。

英文摘要

Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end-to-end. Prior work links cheap statistics of a model's sampled outputs and validation-set correctness (how often samples agree, how diverse they are, how confident the model is, and where correct samples appear) to model behavior, but does not isolate which of these form a stable, compact predictor of best-of-$N$ gain. We fit ridge predictors on features computed from a single labeled validation-set sampling pass, use bootstrap-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear-approximation residual. Across three base-model families, six post-training methods, and math and reasoning task domains, the stability analysis identifies a strict three-feature core spanning prompt-level agreement spread, label-assisted first-correct-sample position, and completion-length variance; a compact ridge predictor built from this core plus an entropy add-on reaches Spearman $ρ= 0.90$ with actual best-of-$N$ gain under a reward-model verifier. The intended use is labeled validation-set screening of candidate configurations before paying the full reward-model scoring cost.

2606.02980 2026-06-03 cs.SD cs.CY

A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5

一种训练高效的基于Transformer的反欺骗网络用于ASVspoof 5中的逻辑访问

Sidan Yin, Bo Zhao

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对ASVspoof 5 Track 1封闭条件,提出TFPARN网络,结合焦点分类损失和成对排序损失,通过Transformer编码器和注意力池化实现高效反欺骗,在minDCF和EER上优于AASIST和RawNet2,且推理内存更低、训练更快。

Comments 11 pages, 2 figures

详情
AI中文摘要

合成和篡改的语音会降低自动说话人验证系统的可靠性,因此反欺骗方法需要在训练和推理中既准确又高效。本文聚焦于ASVspoof 5 Track 1封闭条件,其中标准交叉熵训练可能对困难样本关注不足,且不与基于排序和阈值的评估指标直接对齐。我们提出TFPARN,一种基于Transformer的焦点成对注意力排序网络。该系统从语音中提取log-Mel特征,使用Transformer编码器建模帧级信息,应用注意力池化获得话语级表示,并通过焦点分类损失和成对排序损失的组合进行训练。训练中使用RawBoost增强,评估时应用测试时增强以提高鲁棒性。与在相同协议下重新实现的AASIST和RawNet2基线相比,TFPARN取得了最佳结果,minDCF为0.2430,EER为12.52%。消融实验进一步表明,成对损失、焦点损失和注意力池化均能提升性能。TFPARN在比较系统中使用最低的推理内存(1.4 GB),每段话语运行时间约0.79毫秒,并且达到最佳检查点的训练时间少于AASIST。这些结果表明,TFPARN在逻辑访问反欺骗中实现了检测准确性和计算成本之间的良好平衡。

英文摘要

Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti-spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross-entropy training may not give enough attention to hard trials and is not directly aligned with ranking- and threshold-based evaluation metrics. We propose TFPARN, a Transformer-based focal-pairwise attentive ranking network. The system extracts log-Mel features from speech, uses a Transformer encoder to model frame-level information, applies attention pooling to obtain utterance-level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test-time augmentation is applied during evaluation to improve robustness. Compared with re-implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti-spoofing.

2606.02979 2026-06-03 cs.CV cs.AI cs.RO

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,丰田寺大学) Department of Computer Science and Electronics, Gadjah Mada University(计算机科学与电子系,加查马达大学)

AI总结 提出一种紧凑的深度多任务学习模型,通过自适应损失加权和中间传感器融合技术,在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影,实现高效自动驾驶感知。

Comments This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213

详情
AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型,能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影,无需其他模型支持。我们还提供了一种自适应损失加权算法,以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术,该模型可以处理并组合来自RGB摄像头、动态视觉传感器(DVS)和安装在自车多个位置的激光雷达的多种输入模态。因此,可以更好地理解动态变化的环境。基于消融研究,使用我们提出的方法训练的模型变体取得了更好的性能。此外,还进行了比较研究,以阐明其与一些近期模型组合相比的性能和有效性。结果表明,即使参数少得多,我们的模型仍能保持更好的性能。因此,该模型可以更快地推理,并减少GPU内存使用。此外,结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究,我们在以下网址公开共享代码和其他文件:https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

2606.02976 2026-06-03 cs.CL

Memory Retrieval for Changing Preferences

针对偏好变化的记忆检索

Yuehan Qin, Li Li, Linxin Song, Wei Yang, Jiate Li, Yuqing Yang, Yue Zhao

发表机构 * University of Southern California(南加州大学)

AI总结 提出基于贝叶斯因子的统一框架,通过量化历史轮次对潜在偏好状态的证据强度,实现长上下文对话系统中的记忆访问与选择。

详情
AI中文摘要

长上下文对话系统必须决定何时访问记忆以及交互历史的哪些部分是相关的。现有方法通常依赖启发式检索信号或始终开启的记忆使用,未能考虑用户偏好的变化性和潜在不一致性。在这项工作中,我们提出了一个基于偏好变化的记忆访问与选择统一框架。我们将个性化记忆检索表述为识别哪些历史轮次提供了关于用户潜在偏好状态的证据,而不是依赖表面语义相似性。为此,我们使用贝叶斯因子量化每个记忆轮次的效用,定义为当该轮次包含在上下文中时模型参考响应似然的改进。这提供了证据强度的原则性度量,以及用于记忆访问和选择的统一信号。通过将记忆检索视为效用估计,模型学会识别显著轮次并根据预期效用调节记忆使用。在四个异构记忆基准上的实验表明,我们的方法在需要建模偏好变化的长上下文、偏好密集型任务上优于现有的基于嵌入的检索,同时在语义相似性足够的低密度场景中保持竞争力。

英文摘要

Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to account for the changing and potentially inconsistent nature of user preferences. In this work, we propose a unified framework for memory access and selection based on changing preferences. We formulate personalized memory retrieval as identifying which historical turns provide evidence about a user's latent preference state, rather than relying on surface-level semantic similarity. To this end, we quantify the utility of each memory turn using a Bayes factor, defined as the improvement in the model's likelihood of the reference response when the turn is included in context. This provides a principled measure of evidence strength and a unified signal for both memory access and selection. By framing memory retrieval as utility estimation, the model learns to identify salient turns and regulate memory usage based on expected utility. Experiments on four heterogeneous memory benchmarks show that our approach outperforms existing embedding-based retrieval on long-context, preference-intensive tasks where modeling changing preferences is essential, while remaining competitive in low-density regimes where semantic similarity suffices.

2606.02973 2026-06-03 cs.CL

Chatbots Output Meaningful (but Problematic) Language

聊天机器人输出有意义(但有问题)的语言

Matthew Stone, Una Stojnić

发表机构 * University of Parma(帕尔马大学) University of Cambridge(剑桥大学) University of Pittsburgh(匹兹堡大学) Franklin and Marshall College(弗兰克林与马歇尔学院)

AI总结 本文论证大型语言模型(LLM)的输出是有意义的,但无需假设其具有心理状态或意图,并探讨了这一观点对语言理论和AI伦理的影响。

Comments 49 pages

详情
AI中文摘要

AI聊天机器人的话语有意义吗?具体来说,如果用户问Anthropic的智能体Claude:“西班牙的首都是什么?”Claude回答:“马德里是西班牙的首都。”这句话是否具有其通常的意义——并且表达了一个真实的命题?大多数普通用户以及AI工程师认为答案显然是“是”。然而,许多认知科学家、语言学家和语言哲学家认为,关于语言和意义的主流意向主义理论得出了相反的结论。因此,更同情普通用户直觉的理论家主张对语言进行激进的“去拟人化”,修正我们对心理状态、意图和语义内容的理解,以捕捉LLM输出有意义的直觉。我们采取不同的方法。虽然我们也认为LLM的输出是有意义的,但我们认为,适当的人类语言理论已经适用于当前的聊天机器人。意义是一个低门槛:声称LLM输出有意义并不需要假设心理状态、意图、理性或LLM中交流所需的认知能力——实际上,也不需要任何其他拟人化假设。人们确实有交流意图(通常是成功的),但即便如此,在人类中,语言产出也可能偏离说话者的想法。我们的观点对于我们应该如何理论化——并批判性地参与——人类语言输出和合成生成的文本具有重要影响。特别是,说聊天机器人产生有意义的文本绝不意味着认可它们的输出,或假设该技术是(或不是)好的、强大的、合适的或有用的。

英文摘要

Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning -- and does it express a true proposition? Most ordinary users, as well as AI engineers, take the answer to be trivially "yes." However, many cognitive scientists, linguists, and philosophers of language argue that dominant intentionalist accounts of language and meaning deliver the opposite conclusion. Theorists more sympathetic to ordinary users' intuitions have therefore advocated a radical "de-anthropomorphization" of language, revising our understanding of mental states, intentions, and semantic content to capture the intuition that the outputs of LLMs are meaningful. We take a different approach. While we, too, argue that LLM outputs are meaningful, we contend that a proper theory of human language already applies, as is, to current chatbots. Meaning is a low bar: claiming that LLM outputs are meaningful does not require positing mental states, intentions, rationality, or the cognitive capacities requisite for communication in LLMs -- or, indeed, making any other anthropomorphic assumptions. People do have communicative intentions (typically successful ones), but nevertheless, even in humans, language production can depart from what the speaker has in mind. Our view has important consequences for how we should theorize about -- and critically engage with -- both human linguistic output and synthetically generated text. In particular, to say that chatbots produce meaningful text is not by any means to endorse what they output, or to assume that the technology is (or is not) good, powerful, appropriate, or useful.

2606.02971 2026-06-03 cs.CL

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

EURO-5K:领域预训练何时重要?用于欧盟报告义务提取的Transformer基准测试

Marios Koniaris, Vasileios Kotronis, Eugenia Giannini, Panayiotis Tsanakas

发表机构 * Division of Computer Science, School of Electrical and Computer Engineering(计算机科学系,电气与计算机工程学院) National Technical University of Athens(雅典技术大学) Department of Humanities Social Sciences and Law, School of Applied Mathematical and Physical Sciences(人文社会科学与法律系,应用数学与物理科学学院)

AI总结 本文构建了EURO-5K数据集,通过对比判别式与生成式模型在欧盟报告义务提取上的表现,发现领域预训练在参数高效微调时收益显著,且模型可作为专用提取器。

详情
AI中文摘要

从欧盟立法中提取报告义务对于评估和减少监管报告负担至关重要。然而,区分报告要求与结构相似的条款需要专门的法律理解。当前的法律NLP方法缺乏具有明确指南和提取范式及领域适应策略比较评估的专门数据集。我们整理了EURO-5K,一个包含来自136项欧盟立法法案的句子级报告义务和具有挑战性的负例的语料库。在该数据集上,我们训练并比较了判别式标记分类模型(BERT风格)和生成式跨度提取模型(LLM),针对基线(基于模式和依赖关系的提取、少样本提示)评估了全微调和参数高效的QLoRA。结果表明,全微调的通用和法律BERT模型实现了相似的性能(0.89 F1),而微调的LLM在句子级提取上达到了编码器的准确度。法律预训练对生成式模型仅带来微小提升。相反,当适应能力受限时,法律预训练明显有益,因为参数高效微调的法律BERT优于其通用对应版本。学习曲线分析表明,法律预训练在数据极少时加速了早期学习。所有方法在大约3000个样本时收敛,之后收益递减,验证了数据集的充分性。在两个外部监管语料库上的跨数据集评估表明,我们的模型表现为专门的报告义务提取器,而非通用监管分类器。我们发布了EURO-5K、训练好的模型以及一个带有可解释性可视化和结构化RDF导出的交互式演示。这些表明,两种范式和参数高效训练为监管合规自动化提供了实用工具。

英文摘要

Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal understanding. Current legal NLP methods lack specialised datasets with clear guidelines and comparative evaluation of extraction paradigms and domain adaptation strategies. We curate EURO-5K, a corpus of sentence-level reporting obligations and challenging negative examples from 136 EU legislative acts. On this dataset, we train and compare discriminative token-classification models (BERT-style) and generative span-extraction models (LLMs), evaluating both full fine-tuning and parameter-efficient QLoRA against baselines (pattern and dependency-based extraction, few-shot prompting). Results show that fully fine-tuned generic and legal BERT models achieve similar performance (0.89 F1), while fine-tuned LLMs match encoder accuracy for sentence-level extraction. Legal pretraining offers only small gains for generative models. In contrast, it is clearly beneficial when adaptation capacity is constrained, as parameter-efficient tuning of Legal-BERT outperforms its generic counterpart. Learning curve analysis demonstrates that legal pretraining accelerates early learning with minimal data. All approaches converge around 3K samples with diminishing returns thereafter, validating dataset sufficiency. Cross-dataset evaluation on two external regulatory corpora shows that our models behave as specialised reporting obligation extractors rather than generic regulatory classifiers. We release EURO-5K, trained models, and an interactive demo with explainability visualizations and structured RDF export. These demonstrate that both paradigms and parameter-efficient training provide practical tools for regulatory compliance automation.

2606.02969 2026-06-03 cs.RO math.OC

Hybrid Dynamics Modeling for a Flexible 2-DoF Robotic Arm

柔性2自由度机械臂的混合动力学建模

Maciek Popik, Daniel Yang, Mahdis Bisheban

发表机构 * Dept. of Mechanical and Manufacturing Eng at the Schulich School of Engineering, University of Calgary, Alberta, Canada(施密特工程学院机械与制造工程系,卡尔加里大学,阿尔伯塔,加拿大) Schulich School of Engineering at the University of Calgary(卡尔加里大学施密特工程学院) Intelligent Dynamics and Control Lab(智能动力与控制实验室) University of Calgary(卡尔加里大学)

AI总结 针对刚性模型无法捕获的未建模动力学,本文结合刚体动力学与高斯混合模型或纯数据驱动回归,对柔性2自由度机械臂进行混合建模,并比较了不同方法的扭矩预测精度。

详情
AI中文摘要

本文研究了三种对柔性连杆2自由度机械臂动力学进行建模的方法,以解决刚体模型无法捕获的未建模动力学。两种物理信息模型将刚体动力学(RBD)公式与高斯混合模型(GMM)相结合,以捕获残差模型误差和连杆柔性。一个基于运动学的回归模型作为纯数据驱动的基线。使用开源数据集,首先通过运动学特征的岭回归估计扭矩预测,而基于物理的基线则根据公布的规格构建,随后使用普通最小二乘回归直接从数据估计相同的参数集。结果表明,基于物理的参数精度最差,而正则化和最小二乘估计器与实测扭矩更吻合。残差分析和误差指标凸显了纯参数模型在柔性连杆系统中的局限性,并强调了正则化和数据驱动辨识的价值,支持了半参数残差学习方法的发展。

英文摘要

This paper examines three approaches for modeling the dynamics of a flexible-link 2-DoF robotic arm to address unmodeled dynamics not captured by rigid-body models. Two physics informed models combine rigid-body dynamics (RBD) formulations with a Gaussian Mixture Model (GMM) to capture residual model errors and linkage flexibility. A kinematics-based regression model serves as a purely data-driven baseline. Using an open-source dataset, torque predictions are first estimated using Ridge regression on kinematic features, while the physicsbased baseline is constructed from published specifications, and ordinary least-squares regression is subsequently used to estimate the same parameter set directly from data. Results show that the physics-based parameters yield the poorest accuracy, while regularized and least-squares estimators align more closely with measured torques. Residual analysis and error metrics highlight the limitations of purely parametric models for flexible-link systems and underscore the value of regularization and data-driven identification, supporting developments of semi-parametric residual learning methods.

2606.02965 2026-06-03 cs.AI

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

基准测试无法衡量的:论自主智能体弃权能力的评估

Victor Ojewale, Suresh Venkatasubramanian

发表机构 * Brown University(布朗大学)

AI总结 本文指出自主智能体基准测试忽视弃权能力,提出合规偏差概念,并引入弃权场景分类和评估协议,实验表明安全-可用性权衡是可调的。

Comments ACM CAIS 2026: RLEval Workshop Oral Presentation(Best Paper Award)

详情
AI中文摘要

自主智能体的基准测试衡量智能体是否完成任务,然而这种框架系统地忽略了智能体是否应该继续执行任务。在人类反馈目标下训练的智能体形成了一种结构性倾向,即使缺乏安全行动所需的输入、证据或授权也会继续执行,我们将这种倾向称为合规偏差,因为奖励信号和基准测试评分机制都将继续执行视为正确的默认行为,无论安全行动的前提条件是否满足。我们做出三项贡献。首先,我们表明合规偏差源于人类反馈流程中的奖励黑客行为,并因主流智能体基准测试而根深蒂固,这些基准测试要么惩罚智能体的暂停,要么在架构上无法区分有原则的暂停和静默失败。然后,我们引入弃权合理场景的三缺口分类法,涵盖所需信息缺失的规范缺口、无法确认世界状态的验证缺口以及未获得明确授权的权威缺口,这些共同为构建弃权感知的智能体基准测试提供了原则性基础。最后,我们提出弃权评估协议(安全率、可用率和知情拒绝率),并报告了144个企业智能体场景和五个模型系列的初步结果,其中运行时强制弃权机制在授权场景下实现了高达89.2%的危险动作阻断和87.5%的可用性,表明安全-可用性权衡是可调的而非固有的,并且其形状在不同模型系列间差异显著。我们将此视为初步工作,并提供分类法和复合指标作为进一步讨论的起点。

英文摘要

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.

2606.02963 2026-06-03 cs.LG

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

KForge:面向AI加速器的LLM驱动跨平台内核生成

Taras Sereda, Burak Bartan, Ankita Nayak, Tom St. John, Natalie Serrino, Zain Asgar

发表机构 * Gimlet Labs Inc(Gimlet实验室)

AI总结 提出KForge框架,通过两个协作的LLM代理(生成代理和性能分析代理)迭代优化,自动生成跨平台高性能内核,在NVIDIA B200和Intel Arc B580上分别实现2.12%的吞吐量提升和5.13倍的几何平均加速。

Comments Accepted at ISCA 2026 Workshop MLArchSys

详情
AI中文摘要

生产推理越来越多地针对异构加速器组合。智能体管道交织推理、工具调用和多智能体协调,每个阶段具有不同的计算和内存特征。为达到最优效率,每个阶段应在最适合的加速器上运行。这带来了系统挑战:每个管道现在需要在越来越多的硬件后端和编程模型上生成高性能内核。手工编写这些内核耗时、需要深厚的底层专业知识,并且随着内核复杂性增长而难以扩展。最近,大型语言模型(LLMs)已被用于自动内核生成,但在低级代码生成和跨后端泛化方面仍存在挑战。我们提出KForge,一个跨平台框架,围绕由两个协作的基于LLM的代理驱动的迭代优化循环构建:生成代理,使用编译和正确性反馈生成并逐步优化内核;性能分析代理,解释从编程API到基于GUI的工具的性能数据,并发出指导下一轮合成的建议。该循环在功能传递(驱动候选达到正确性)和优化传递(缩小与手工调优基线的性能差距)之间交替。我们在两个基线参考可用性差异很大的后端上评估KForge。在NVIDIA B200上,KForge在gpt-oss-20b推理速度基准上相比TensorRT-LLM实现了2.12%的端到端吞吐量提升。在Intel Arc B580上,KForge生成的Triton内核在KernelBench Level 2的37个GEMM+尾部操作工作负载上,通过算子融合和混合精度执行,实现了比PyTorch eager和this http URL中较快者5.13倍的几何平均加速。

英文摘要

Production inference increasingly targets a heterogeneous mix of accelerators. Agentic pipelines interleave reasoning, tool calls, and multi-agent coordination, each with distinct compute and memory profiles. For optimal efficiency, each stage should run on the accelerator best suited to it. This creates a systems challenge: each pipeline now requires high-performance kernels across a growing set of hardware backends and programming models. Writing these kernels by hand is time-consuming, demands deep low-level expertise, and does not scale as kernel complexity grows. Recently, Large Language Models (LLMs) have been leveraged for automatic kernel generation, but challenges in low-level code generation and cross-backend generalization persist. We present KForge, a cross-platform framework built around an iterative refinement loop driven by two collaborating LLM-based agents: a generation agent that produces and progressively refines kernels using compilation and correctness feedback, and a performance-analysis agent that interprets profiling data, from programmatic APIs to GUI-based tools, and emits recommendations that steer the next round of synthesis. The loop alternates between functional passes, which drive a candidate to correctness, and optimization passes, which close the performance gap to hand-tuned baselines. We evaluate KForge on two backends with very different baseline reference availability. On NVIDIA B200, KForge achieves a 2.12$\%$ improvement in end-to-end throughput compared to TensorRT-LLM on the gpt-oss-20b inference speed benchmark. On Intel Arc B580, KForge generates Triton kernels achieving a 5.13$\times$ geometric mean speedup over the faster of PyTorch eager and torch.compile on 37 GEMM + tail-ops workloads from KernelBench Level 2, primarily via operator fusion and mixed-precision execution.

2606.02962 2026-06-03 cs.CV cs.AI cs.HC eess.IV

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

发表机构 * Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center , ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain(图像处理小组(GTI)、信息处理与电信中心、电信工程学院、马德里理工大学、西班牙)

AI总结 针对自我中心视频中的自然语言查询定位任务,提出手部轨迹编码器与自适应门控交叉注意力融合方法,利用手部运动信息提升查询定位性能。

Comments Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

详情
AI中文摘要

自我中心自然语言查询(NLQ)定位要求模型在长第一人称视频中定位回答自由形式文本查询的时间区间。现有方法融合视频外观与查询,但忽略了手部运动,尽管大约41%的Ego4D NLQ查询是在手-物交互或其后立即发生的时刻回答的。我们提出了一种手部轨迹编码器,用于将手部骨骼序列转换为高语义的手部运动学特征,然后通过具有自适应门控的交叉注意力融合策略,将这些特征与预训练的视频-文本特征对齐并组合。在Ego4D NLQ v2验证集上,手-物交互查询(R1@IoU=0.3提升2.54)和数量/状态查询(R1@IoU=0.3提升4.32)的增益最为明显,表明手部轨迹提供了超越外观的定位线索。

英文摘要

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

2606.02959 2026-06-03 cs.LG cs.CR

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Gate AI:大语言模型安全基准评估方法与结果

Ryle Goehausen, Marcus Sousa

发表机构 * constellationnetwork(Constellation Network)

AI总结 针对提示注入和越狱检测器评估中数据集阈值调优和操作点未公开的问题,提出一种采用5折交叉验证、全局操作点选择和多种泛化诊断的评估框架,并在16个公开基准上进行了测试。

Comments 17 pages, 23 figures, 2 tables. Working preprint; subsequent versions may update benchmark numbers as the framework evolves

详情
AI中文摘要

已发布的大语言模型提示注入和越狱检测器评估通常存在两个系统性弱点:每个数据集单独调整阈值以及未公开的操作点。我们描述了一种解决这两个问题的评估框架。被评估的检测器在16个公共基准(12,111个样本)上使用5折交叉验证进行评分。主要流程采用StratifiedKFold(按行);同时,并行运行StratifiedGroupKFold流程,基于复合键(父提示ID加上Jaccard $\gtrsim 0.8$的MinHash + LSH近重复聚类)作为泄漏溢价诊断。在保留的折上选择一个全局操作点(在FPR $\leq 1\%$条件下最大化F1),并统一应用于每个数据集,因此每个数据集的结果反映一个阈值,而非每个基准的优化。通过一系列诊断检查泛化能力(留一数据集交叉验证、随机标签对照、对抗验证、排列特征重要性、长度偏差相关性、分类器头部一致性、跨源近重复检测、阈值可迁移性、训练集与OOF一致性以及释义不变性探测),其中大多数具有定量通过阈值,其余则说明失败模式。对于每次外部比较,检测器的阈值根据竞争对手公布的假阳性率重新调整,以便在匹配的操作点上评估对比值。

英文摘要

Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both. The detector under evaluation is scored across 16 public benchmarks (12,111 samples) using 5-fold cross-validation. StratifiedKFold (by row) is the headline pass; a parallel StratifiedGroupKFold pass over a composite key (parent-prompt id plus MinHash + LSH near-duplicate clusters at Jaccard $\gtrsim 0.8$) runs alongside it as a leakage-premium diagnostic. A single global operating point is selected on the held-out folds (max F1 subject to FPR $\leq 1\%$) and applied uniformly to every dataset, so per-dataset results reflect one threshold rather than per-benchmark optimisation. Generalisation is examined through a battery of diagnostics (leave-one-dataset-out cross-validation, a random-label control, adversarial validation, permutation feature importance, length-bias correlation, classifier-head agreement, cross-source near-duplicate detection, threshold transferability, train-vs-OOF agreement, and a paraphrase-invariance probe), most with a quantitative pass threshold and the remainder with a stated failure mode. For every external comparison, the detector's threshold is re-tuned to the competitor's published false-positive rate so head-to-head values are evaluated at matched operating points.

2606.02956 2026-06-03 cs.CV cs.LG cs.RO

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

自动驾驶的未来之路:KITScenes多模态数据集

Richard Schwarzkopf, Fabian Immel, Alexander Blumberg, Jonas Merkert, Nils Rack, Kaiwen Wang, Fabian Konstantinidis, Julian Truetsch, Carlos Fernandez, Annika Bätz, Kevin Rösch, Marlon Steiner, Willi Poh, Yinzhe Shen, Royden Wagner, Felix Hauser, Dominik Strutz, Jaime Villa, Gleb Stepanov, Holger Caesar, Ömer Şahin Taş, Frank Bieder, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * FZI Research Center for Information Technology(弗劳恩霍夫信息技术研究中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University Charles III of Madrid(马德里第三大学) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出KITScenes多模态数据集,通过高保真传感器和完整HD地图,解决现有数据集在传感器精度、地图完整性和地理多样性上的不足,并引入四个基准推动空间学习。

Comments 28 pages, 21 figures

详情
AI中文摘要

现有的自动驾驶数据集取得了重大进展,但在传感器保真度、地图完整性或地理多样性方面仍存在不足。我们提出了KITScenes多模态数据集,这是一个基于高保真传感器和地图构建的欧洲数据集。我们完全同步的传感器套件结合了高分辨率全局快门相机、超过400米的长距离激光雷达、4D成像雷达以及冗余的GNSS/INS定位。据我们所知,我们的HD地图是任何传感器数据集中最完整的,并通过开源软件上的自动驾驶试验进行了验证。首次在公共数据集中,所有与驾驶相关的交通元素(如交通灯)都以3D方式映射到重投影精确的水平,并具有完整的拓扑连接。我们的数据集记录在街道布局不规则且交通模式混合的城市中,通过拓宽可用的地理多样性来补充现有数据集。我们还引入了四个基准,每个基准都推动了具身AI的空间学习:在线HD地图构建、长距离深度估计、新颖视图合成和端到端驾驶。项目页面:此https URL

英文摘要

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization. Our HD maps are, to our knowledge, the most complete of any sensor dataset, validated through autonomous driving trials on open-source software. For the first time in a public dataset, all driving-relevant traffic elements, such as traffic lights, are mapped in 3D to a reprojection-accurate level with full topological connectivity. Recorded in cities with irregular street layouts and mixed traffic modes, our dataset complements existing datasets by broadening the available geographic diversity. We also introduce four benchmarks, each advancing spatial learning for embodied AI: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. Project page: https://kitscenes.com/

2606.02953 2026-06-03 cs.CL

Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

大型语言模型中的语言生产力:模型强制但不抢占

Claire Bonial, Claire Benet Post, Laura Michaelis, Harish Tayyar Madabushi

发表机构 * Georgetown University(乔治城大学) University of Colorado Boulder(科罗拉多大学丹佛分校) University of Bath(巴斯大学)

AI总结 通过测试大型语言模型是否受固化(高频使用)和抢占(未观察到结构)两种统计信号影响,发现模型能识别强制情况下的构式生产力,但无法利用负面证据避免过度泛化。

详情
AI中文摘要

基于使用的语法理论认为,语言的创造性生产力受到两种不同频率信号的增强和约束:固化(源于高频使用)和抢占(源于在期望出现特定语言结构的语境中从未观察到该结构)。大型语言模型也是基于使用的,因为语言结构是通过接触大量文本而习得的。在这里,我们测试固化和抢占这两种对立的统计力量是否也鼓励和约束了LLM中的语言生产力。我们跨模型架构证明,较大的模型在强制情况下能够识别并用非词再现构式生产力(固化),其中更广泛的构式语境强制了对词汇项的非典型解释。然而,我们也表明,即使最大的模型也不会将负面证据扩展到新语言,并且统计抢占不能使模型避免对语义上合适但从未在数据中观察到的模式进行过度泛化。

英文摘要

Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE:边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结 提出SCOPE模块化代理,用于自然语言控制的PTZ相机,在边缘部署实现实时感知、规划与控制,并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情
Journal ref
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026
AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估:自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具,并使用部署关键指标(包括延迟、准确性和错误模式)进行评估。我们提出了SCOPE(用于感知和评估的仿真与相机操作),这是一个模块化代理,用于自然语言、开放词汇的云台变焦(PTZ)相机控制和视觉场景理解,专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行,也可在物理PTZ相机上运行,所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试,涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别,在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合,以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合,将Qwen3小语言模型(SLM)与Moondream和Qwen视觉语言模型(VLM)配对。更强的SLM显著减少了幻觉并改善了工具路由,从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM,感知就成为主要的性能瓶颈。在规划和感知方面,混合专家模型在延迟和内存占用与更小网络相当的情况下,始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升,为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

2606.02948 2026-06-03 cs.LG cs.DS

From Non-Convex to Strongly Convex: Curvature-Adaptive FTPL for Online Optimization

从非凸到强凸:曲率自适应的FTPL在线优化

Moses Charikar, Chirag Pabbaraju, Ambuj Tewari

发表机构 * Stanford University(斯坦福大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 提出一种曲率自适应的FTPL算法,通过时变扰动尺度实现非凸Lipschitz损失下的最优遗憾界,并在线性累积曲率下达到对数遗憾。

详情
AI中文摘要

曲率自适应是在线优化中的一个经典主题:对于凸Lipschitz损失,自适应方法在一般凸损失的最优$O(\sqrt{T})$遗憾和强凸性下的$O(\log T)$遗憾之间进行插值。最近的研究表明,假设可以访问近似离线优化预言机,Follow-the-Perturbed-Leader (FTPL) 即使对于在线非凸Lipschitz损失也能实现最优的$O(\sqrt{T})$遗憾,但这些保证没有利用曲率。我们证明,在非凸设置中,FTPL可以变得曲率自适应,而无需事先知道曲率如何随时间累积。我们的算法将标准FTPL的固定扰动尺度替换为仅使用过去信息选择的时变尺度。我们给出了该尺度的简单跟随者调节规则,并表明它与事后最佳选择竞争(在常数因子内)。所得到的方法对于任意非凸Lipschitz损失实现了$O(\sqrt{T})$遗憾,并随着累积曲率的增长而改进;在足够精确的预言机调用下,当累积曲率线性增长时(包括经典的强凸情形),它实现了$O(\log T)$遗憾。我们用规定的累积曲率序列(即使对于一维凸损失)的匹配下界来补充这些上界,表明最坏情况非凸遗憾与曲率驱动的快速速率之间的权衡是内在的。

英文摘要

Curvature adaptivity is a classical theme in online optimization: for convex Lipschitz losses, adaptive methods interpolate between the optimal $O(\sqrt{T})$ regret for general convex losses and $O(\log T)$ regret under strong convexity. Recent work has shown that Follow-the-Perturbed-Leader (FTPL) achieves optimal $O(\sqrt{T})$ regret even for online non-convex Lipschitz losses, assuming access to an approximate offline-optimization oracle, but these guarantees do not exploit curvature. We show that FTPL can be made curvature-adaptive in the non-convex setting, without knowing in advance how curvature will accumulate over time. Our algorithm replaces the fixed perturbation scale of standard FTPL with a time-varying scale chosen using only past information. We give a simple follow-the-leader tuning rule for this scale and show that it competes, up to constants, with the best choice in hindsight. The resulting method achieves $O(\sqrt{T})$ regret for arbitrary non-convex Lipschitz losses and improves as cumulative curvature grows; with sufficiently accurate oracle calls, it achieves $O(\log T)$ regret when cumulative curvature grows linearly, which includes the classical strongly convex regime. We complement these upper bounds with matching lower bounds for prescribed cumulative-curvature sequences, already for one-dimensional convex losses, showing that the tradeoff between worst-case non-convex regret and curvature-driven fast rates is intrinsic.

2606.02947 2026-06-03 cs.LG cs.CV

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

BYORn:自举你的响应以防御大型视觉-语言模型的后门攻击

Ivan Sabolić, Marin Oršić, Josip Šarić, Sven Lončarić

发表机构 * University of Rijeka(里耶卡大学)

AI总结 提出BYORn框架,通过识别并替换语义不合理的后门目标响应,打破触发器与目标输出的关联,从而在保持干净任务性能的同时提升对后门攻击的鲁棒性。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调是将自回归视觉-语言模型适应下游任务的主要方法。最近的研究表明,这种范式极易受到后门攻击,并且现有的防御在开放生成设置中无效。为此,我们提出了BYORn,一个鲁棒的后门防御微调框架,其动机是观察到,在给定相应图像-文本输入和预训练模型的情况下,被毒化的目标响应通常在语义上不合理。BYORn识别这种不对齐的响应,并动态地用模型生成的替代响应替换它们,从而打破触发器与目标输出之间的相关性。由此产生的目标梯度对应于干净数据分布上总体风险上界的经验估计的梯度。实验上,BYORn在保持干净任务性能的同时,持续提高了对后门攻击的鲁棒性,建立了泛化与攻击成功率之间的新权衡边界。最后,我们证明了BYORn对专门设计用于规避所提防御的自适应攻击仍然有效。

英文摘要

Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

2606.02946 2026-06-03 cs.LG cs.CR

Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment

智取变色龙:针对直播风险评估中战术性OOD偏移的反事实解耦

Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu, Qiwei Zhong, Xiang Ao

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) ByteDance China(字节跳动中国)

AI总结 针对直播中恶意行为者通过战术性分布偏移(Tactical OOD Shift)规避检测的问题,提出基于潜在因果的反事实解耦框架LPCD,通过潜在层建模意图与叙事变化并强制潜在反事实一致性,实现鲁棒的风险评估。

Comments Accepted by KDD'26

详情
AI中文摘要

直播已成为社交互动和数字商务的主要媒介,但日益受到复杂风险的困扰。该领域的一个基本挑战是战术性分布偏移(tactical OOD shift):虽然恶意行为者保持稳定的潜在目标,但他们不断重新设计叙事包装以逃避检测。这种对抗性偏移暴露了现有OOD泛化范式的关键局限性,其假设在意图-战术紧密耦合演变和原始级反事实定义不清的情况下难以满足。在本文中,我们从潜在因果角度解决这一问题,并提出潜在预测反事实解耦(LPCD),一个用于鲁棒直播风险评估的即插即用框架。LPCD通过在潜在层建模意图和叙事变化来实现对抗性战术重新包装下的反事实推理,并强制潜在反事实一致性以将风险预测锚定在因果稳定的恶意意图上。在推理时,LPCD应用轻量级、无参数的校准以进一步缓解战术引起的分布偏移。在大规模工业数据集和在线生产流量上的大量实验表明,LPCD持续优于最先进的基线,验证了其在现实直播中调节不断演变的对抗性风险的有效性。项目页面见此https URL。

英文摘要

Live streaming has emerged as a primary medium for social interaction and digital commerce, yet it is increasingly plagued by sophisticated risks. A fundamental challenge in this domain is \emph{tactical out-of-distribution (OOD) shift}: while malicious actors maintain stable underlying objectives, they continuously redesign narrative packaging to evade detection. Such adversarial shifts expose critical limitations of existing OOD generalization paradigms, whose assumptions are difficult to satisfy in the presence of tightly coupled intent-tactic evolution and ill-defined raw-level counterfactuals. In this paper, we tackle this issue from a \emph{latent causal} perspective and propose \underline{L}atent-\underline{P}redictive \underline{C}ounterfactual \underline{D}ecoupling~(LPCD), a plug-in framework for robust live streaming risk assessment. LPCD enables counterfactual reasoning under adversarial tactical re-packaging by modeling intent and narrative variation at the latent level, and enforces \emph{latent counterfactual consistency} to anchor risk prediction on causally stable malicious intent. At inference time, LPCD applies a lightweight, parameter-free calibration to further mitigate tactic-induced distribution shifts. Extensive experiments on large-scale industrial datasets and online production traffic demonstrate that LPCD consistently outperforms state-of-the-art baselines, validating its effectiveness in moderating evolving adversarial risks in real-world live streaming. The project page is available at https://qiaoyran.github.io/LiveStreamingRiskAssessment/.

2606.02939 2026-06-03 cs.LG eess.SP

ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification

ERP-XTTN: 可解释的原型引导跨注意力用于跨被试ERP分类

Charlotte Genevier Wyman, Leanne Hirshfield

发表机构 * University of Colorado Boulder(科罗拉多大学波得尔分校)

AI总结 提出ERP-XTTN,一种基于原型引导跨注意力的架构,在无需校准的跨被试条件下实现可解释的ERP分类,并揭示分类错误的神经生理学原因。

详情
AI中文摘要

可解释的脑机接口分类器能够在无需校准的情况下跨被试泛化仍然是一个开放的挑战。我们测试了基于原型的跨注意力是否能在部署兼容条件下提供具有竞争力且可解释的事件相关电位(ERP)分类。我们提出ERP-XTTN,一种跨注意力架构,通过仅查询-键的跨注意力(无值投影)将输入EEG片段路由到固定的差异波原型,因此分类完全依赖于注意力路由,且注意力忠实性是结构性的而非事后解释的。原型从训练折差异波的极值自动推导。我们在三个公开数据集(BNCI Horizon 2020、HRI Cursor和ERP CORE)上评估,涵盖八个ERP成分(ERN、LRP、ErrP、N170、P300、N2pc、MMN、N400),使用留一被试(LOSO)评估,并在两种通道数(3通道和全导联)下采用因果滤波,与EEGNet和基于黎曼几何的xDAWN(xDAWN+RG)对比。最佳基线与ERP-XTTN的平均差距在3通道时为0.018 AUROC,在全导联时为0.034,这源于两个大致不同的来源:相对于EEGNet的时间灵活性成本和相对于xDAWN+RG的空间利用成本,后者在全导联时由信噪比驱动。除了准确性,透明的路由揭示了黑箱模型无法发现的跨被试信号结构:假阳性与真阳性的相似度高于真阴性,表明分类错误在神经生理学上是可以解释的。ERP-XTTN在因果、无校准条件下泛化到多种ERP,并在最小导联设置下具有较小的可解释性代价。据我们所知,这是ERP CORE上首个epoch级LOSO基准测试。

英文摘要

Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.

2606.02936 2026-06-03 cs.LG

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning

分层RBF-KAN和RBF-SKAN架构用于多维函数逼近和随机场学习

Mingtao Xia, Qijing Shen

发表机构 * University of Houston(德克萨斯大学) University of Birmingham(伯明翰大学) University of Oxford(牛津大学)

AI总结 提出并分析使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构,用于逼近确定性函数和随机场模型,并证明其通用逼近性质及缓解维度灾难的潜力。

详情
AI中文摘要

本文提出并分析了使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构,用于逼近确定性函数和随机场模型。具体地,我们开发了用于多维确定性函数逼近的分层径向基函数Kolmogorov-Arnold网络(分层RBF-KAN)和用于随机场学习的分层径向基函数随机Kolmogorov-Arnold网络(分层RBF-SKAN)。从理论角度,我们为两种架构建立了通用逼近结果。特别地,我们推导了分层RBF-KAN的定量逼近估计,表明所提出的框架通过降低逼近问题的有效维度,有潜力部分缓解高维函数学习中的维度灾难。此外,我们证明了分层RBF-SKAN可以在Wasserstein-2度量下逼近随机场模型。实验上,我们表明所提出的基于径向基函数的神经网络结构能够有效学习多元函数和随机场模型。

英文摘要

In this manuscript, we propose and analyze hierarchical Kolmogorov--Arnold neural network architectures employing radial basis functions as activation functions for approximating deterministic functions and random field models. Specifically, we develop a hierarchical radial-basis-function Kolmogorov--Arnold network (hierarchical RBF-KAN) for multidimensional deterministic function approximation and a hierarchical radial-basis-function stochastic Kolmogorov--Arnold network (hierarchical RBF-SKAN) for random field learning. From a theoretical perspective, we establish universal approximation results for both architectures. In particular, we derive quantitative approximation estimates for the hierarchical RBF-KAN, showing that the proposed framework has the potential to partially alleviate the curse of dimensionality in learning high-dimensional functions by reducing the effective dimensionality of the approximation problem. Furthermore, we show that the hierarchical RBF-SKAN can approximate random field models under the Wasserstein-2 metric. Empirically, we show that our proposed radial-basis-function-based neural network structure could effectively learn multivariate functions and random field models.

2606.02935 2026-06-03 cs.CV cs.CE

CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

基于椭圆轴估计的圆柱体CAD到CT配准

Aleksander Ogonowski, Mikołaj Mrozowski, Daniel Więcek, Arkadiusz Ćwiek, Konrad Klimaszewski, Rafał Możdżonek, Adam Padee, Lech Raczyński, Piotr Wasiuk, Wojciech Wiślicki, Michał Matusiak, Sławomir Wronka

发表机构 * Department of Complex Systems, National Centre for Nuclear Research(复杂系统系,国家核研究中心) ImagineRT sp. z o.o.(ImagineRT公司) National Centre for Nuclear Research(国家核研究中心)

AI总结 提出一种两阶段几何配准方法,通过检测CT切片中的椭圆截面估计旋转轴,再通过体素化CAD模型并最大化与CT扫描的体积重叠实现圆柱体(电离室)的精确配准,无需强度校准或特征匹配,倾斜和方向误差低于0.1°。

详情
AI中文摘要

CAD模型与CT扫描的精确配准对于在体积成像中建立真实几何基准至关重要。获取可靠的对象掩膜在机器学习环境中日益重要;随着最新架构能力增强,需要大规模数据集以充分利用其能力。当CT灰度值缺乏校准参考时,传统的基于强度的方法失效,而基于点的算法(如ICP、RANSAC)需要理想化CAD几何与噪声体积CT数据之间不可用的特征对应。我们提出了一种针对圆柱体(电离室)的两阶段几何配准方法,利用对象的独特几何特征。首先,通过检测CT切片中的椭圆截面、对边缘检测轮廓拟合椭圆,并在RANSAC异常值去除后对拟合椭圆中心进行PCA,来估计3D旋转轴。其次,将CAD模型体素化,沿检测轴定向,并通过平移调整最大化与CT扫描的体积重叠。该方法无需强度校准或特征匹配,即可实现倾斜和方向误差低于0.1°的鲁棒配准。配准后,对齐的CAD模型为机器学习目标定位和工业CT工作流中的自动分析等应用提供真实几何基准。

英文摘要

Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.