arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 8081
专题追踪
2606.02798 2026-06-03 cs.AI

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

BehaviorBench: 从行为轨迹建模真实用户决策

Liangwei Yang, Jielin Qiu, Zixiang Chen, Ming Zhu, Juntao Tan, Zhiwei Liu, Wenting Zhao, Zhujun Lan, Akshara Prabhakar, Silvio Savarese, Huan Wang, Shelby Heinecke

发表机构 * Salesforce AI Research(Salesforce AI研究院)

AI总结 提出 BehaviorBench 基准,利用真实世界行为轨迹(预测市场与链上记录)评估个性化决策建模,包含信念预测和交易预测两个任务层。

详情
AI中文摘要

许多决策支持场景需要系统适应个体用户,但针对该问题的评估数据仍然有限。现有的用户理解基准通常依赖模拟用户或模型生成的行为,尽管近期研究警告基于模型的模拟可能系统性地偏离人类行为。我们引入了 extsc{BehaviorBench},一个从真实世界行为轨迹评估个性化决策建模的基准。 extsc{BehaviorBench} 从观测到的公开预测市场和链上记录重建钱包级别的决策历史,并将其组织为两个互补的任务层:\emph{信念预测},预测用户在市场中最终的公开立场和置信度;以及\emph{交易预测},预测个体交易的方向和数量。在 2000 个评估钱包中,该基准包含 141,445 个信念实例和 1,485,972 个交易实例,并具有用于基于检索的评估的不相交支持池。我们在四种历史接口下评估前沿和开放权重生成模型:无个性化、直接近期历史、生成用户画像和检索支持钱包证据。个性化在信念预测上比交易预测更一致地提升性能,模型排名在不同任务层和指标间变化,不同的历史接口暴露了不同的失败模式。 extsc{BehaviorBench} 提供了一个评估设置,用于研究个性化方法是否能够利用真实世界行为证据而非仅依赖模拟用户。

英文摘要

Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \emph{Belief prediction}, which predicts a user's final revealed stance and confidence in a market, and \emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.

2606.02796 2026-06-03 cs.RO

A Measurement-Driven Digital Twin Architecture for Plant-Level Biomass Estimation and Growth Forecasting in Hydroponic Systems

基于测量驱动的数字孪生架构:用于水培系统中植物级生物量估计与生长预测

Morgan Mayborne, Abhisesh Silwal, George Kantor

发表机构 * The Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出一种结合传感器数据和模型更新的数字孪生架构,通过RGB-D图像和神经网络实时估计生菜质量,并实现未来1-4天生长预测,误差约2克。

Comments 7 pages, 6 figures

详情
AI中文摘要

针对密集城市中心的食品分配问题,已开发出水培等替代土壤园艺的方法。本文开发了一种新系统,利用测量信息流和可用模型,持续更新水培环境中单个生菜植株的生长轨迹估计。这些“数字孪生”模型被集成到一个运行中的水培温室中,配备定制园艺和传感器硬件以生长和测量相关信息。为辅助更新模型参数,使用自定义神经网络连续测量植物产量,输入为植物的RGB-D图像。该网络在1300张图像的收集数据集上训练,能够估计质量,误差在真实值的1.5克以内。集成到定制系统后,数字孪生生长预测可近似未来1至4天的产量,保持约2克的预测误差。

英文摘要

Alternatives to soil-based horticulture, such as hydroponics, have been developed to respond to food distribution concerns for dense urban centers. A new system was developed to track an individual lettuce plant's growth in a hydroponic environment, utilizing streams of measured information and available models to continuously update the growth trajectory estimates for a plant. These "digital twin" models were integrated into an operating hydroponic greenhouse, with custom horticultural and sensor hardware to grow and measure relevant information. To aid in updating model parameters, plant yield was continuously measured with a custom neural network, using RGB-D images of the plants as an input. The network, trained on a collected dataset of 1300 images, was able to estimate mass within 1.5 g of the ground-truth value. After integration into the custom system, digital twin growth projections could approximate future yield between one and four days in the future, maintaining around a 2 g forecasting error.

2606.02791 2026-06-03 cs.AI

Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

评估 Transformer 和 LSTM 框架在无资料流域预测中的表现

Taye Akinrele, James Halgren, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本研究通过 NOAA 国家水模型回顾模拟,评估仅编码器 Transformer 与 LSTM 在有限水文信息下上游径流推断的优势,发现 LSTM 整体性能更强,且加入下游信息可显著提升预测技能。

Comments 5 pages

详情
AI中文摘要

流域网络呈现收敛拓扑结构,其中多个支流汇入下游河道,整合了多样化的上游水文过程。在无资料流域中,缺乏直接观测增加了不确定性,并限制了预测极端事件的能力。本研究利用 NOAA 国家水模型(NWM)的回顾模拟,评估仅编码器 Transformer 是否在有限水文信息下比 LSTM 更具优势,用于上游径流推断。在仅上游和组合配置中,LSTM 在两种配置下的整体表现均优于 Transformer 模型。加入下游信息进一步提升了所有模型的性能,使中位数 NNSE 提高了 60% 以上。我们并未将其视为排行榜式的比较,而是将实验解释为对水文序列推断的架构归纳偏置的测试。结果表明,循环记忆仍比仅编码器 Transformer 更适用于此上游重建任务,而下游水文背景提供了强大的辅助约束,显著提高了跨架构的预测技能。

英文摘要

Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream hydrological processes. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events. This study evaluates whether an encoder-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model (NWM). Across both upstream-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%. Rather than treating this as a leaderboard-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures

2606.02785 2026-06-03 cs.LG hep-ex physics.atom-ph quant-ph

QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models

QUIVER: 用于大型机器学习模型中增强表示的量子信息视角

Aritra Bal, Michael Binder, Markus Klute, Benedikt Maier, Michael Spannowsky

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Cambridge(剑桥大学)

AI总结 提出QUIVER框架,通过变分量子电路提取量子Fisher信息矩阵作为几何特征,与经典特征融合以提升大型机器学习模型性能,并在QM9和JetClass数据集上验证了有效性。

Comments 9 pages, 1 figure and 2 tables. Accepted as a poster at the AI4Physics Workshop, ICML 2026 (Seoul, South Korea)

详情
AI中文摘要

大型机器学习模型显著受益于提供同一示例互补视角的多模态输入。我们引入QUIVER(量子信息增强表示视角),这是一种用量子Fisher视角丰富经典数据驱动特征的范式:量子Fisher视角是一种几何驱动的、基无关的高阶相关性总结,由为执行相同任务而训练的变分量子电路(VQC)捕获。与经典特征增强不同,量子Fisher信息矩阵编码了学习到的量子态流形的内在几何结构。虽然这种受量子信息理论启发的特征映射通常难以经典建模,但它可以揭示额外的经典数据或模型容量难以学习的统计结构。这使得量子Fisher视角成为一种真正互补而非冗余的模态。我们证明QUIVER在两个来自完全不同领域的基准数据集上提升了标准性能指标:用于预测分子性质的QM9,以及用于预测大型强子对撞机(LHC)喷注风味的JetClass。然而,核心贡献是领域无关的:量子Fisher视角可以通过对基础架构进行针对性修改,融合到广泛类别的模型架构中,以纳入问题的量子几何信息。这些结果表明,从模拟变分电路中提取的量子几何特征,可以在容错量子硬件出现之前,为标准机器学习任务带来可衡量的价值。

英文摘要

Large machine learning models benefit substantially from multimodal inputs that provide a complementary view of the same example. We introduce QUIVER (QUantum-Informed Views for Enhanced Representations, a paradigm that enriches classical data-driven features with a quantum Fisher view: a geometrically motivated, basis-independent summary of higher-order correlations captured by a variational quantum circuit (VQC) trained to perform the same task. Unlike classical feature augmentation, the quantum Fisher information matrix encodes the intrinsic geometry of the learned quantum state manifold. While this feature map, motivated by quantum information theory, is ordinarily non-trivial to model classically, it can surface statistical structure that additional classical data or model capacity finds difficult to learn. This makes the quantum Fisher view a genuinely complementary modality rather than a redundant one. We demonstrate that QUIVER improves standard performance metrics on two benchmark datasets from very different fields: QM9 for predicting molecule properties, and JetClass for predicting jet flavor at the Large Hadron Collider (LHC). The core contribution, however, is domain-agnostic: the quantum Fisher view can be fused into a broad class of model architectures via targeted modifications to the base architecture, to incorporate information about the quantum geometry of the problem. These results demonstrate that quantum-geometric features, extracted from simulated variational circuits, can deliver measurable value for standard machine learning tasks, well before the advent of fault-tolerant quantum hardware.

2606.02775 2026-06-03 cs.AI cs.AR cs.DC cs.PF cs.RO

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

AURA: 恒定VRAM下机器人策略的动作门控记忆

Josef Chen

发表机构 * KAIKAKU(卡基库)

AI总结 提出AURA-Mem,一种恒定大小、基于动作误差信号门控写入的循环记忆,替代KV缓存,在边缘机器人任务中实现与基线相当的准确率,同时减少5-9倍写入次数。

详情
AI中文摘要

KV缓存是数据中心合适的记忆,但却是机器人错误的记忆。数据中心推理批量处理许多短请求并重置它们,在众多请求中分摊注意力缓存。具身智能体则在带宽受限的边缘硬件上运行一个长且不重置的回合,其中高带宽内存和闪存稀缺,闪存写入寿命有限,内存写入而非计算可能成为约束瓶颈。AURA-Mem(动作效用循环自适应记忆)针对这一场景。它用一个固定大小的循环记忆和一个学习得到的门控包装冻结的视觉-语言-动作骨干网络,该门控仅在当前观测会改变下一个动作时写入:一种知道何时保持沉默的记忆。与基于重建的记忆不同,该门控直接针对闭环动作误差信号进行训练。其推理状态固定为4,224字节,无论时间步长如何,而KV缓存则在100,000步时增长到6,061倍。在受控的合成基准测试中,AURA-Mem在准确率上与最佳的O(1)基线相当,同时使用5.19-6.13倍更少的写入,在更简单的配置上最多减少9.19倍写入。预算匹配的随机和周期性调度无法恢复这一增益,从而将收益归因于动作惊喜信号。在LIBERO-Long上训练的闭环OpenVLA-OFT 7B面板(每个机械臂n=60个回合)上,门控不会损害成功率:AURA-Mem与无门控基础策略(0.233)相当,并略超过始终写入的KV臂(0.217),同时使用7.0倍更少的写入和恒定内存。我们还实例化了一个近似信息状态价值损失界限作为方法论演示;在此规模下,该界限是空洞的而非保证。

英文摘要

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.

2606.02774 2026-06-03 cs.CV

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

GeoDrive-Bench:自动驾驶中区域特定多模态推理的基准测试

Yingzi Ma, Chaowei Xiao, Ming Jiang

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出GeoDrive-Bench基准,通过5053个跨六国人工验证的多选题,评估视觉语言模型在感知、预测、规划和区域推理四个驾驶任务中基于区域特定交通规则的推理能力,并设计蒸馏算法注入区域知识以提升模型性能。

详情
AI中文摘要

用于自动驾驶的视觉语言模型(VLM)已展现出有前景的性能,但它们处理区域特定交通规则的能力仍未得到充分探索,这引发了对其在全球不同环境中部署的不确定性。因此,我们引入了GeoDrive-Bench,这是一个新颖的基准,能够系统性地研究VLM的地理文化驾驶推理。我们整理了5053个人工验证的多选题,涵盖六个国家,涉及多样的驾驶文化。具体而言,我们强调四个驾驶任务:感知、预测、规划和区域推理。每个问题要求模型从视觉证据和当地交通惯例中推断出正确的驾驶行为,而不给出明确的国家标签。除了评估,我们还设计了一种蒸馏算法,将区域特定的交通规则知识注入VLM的内部表示,使模型能够更好地将视觉场景理解与当地驾驶策略对齐。在九个最先进的VLM上的实验表明,每个任务在不同地理驾驶文化下存在显著的性能差异,而我们提出的基线模型在跨区域的地理文化推理上有所改进。这些结果表明,当前的VLM仍然缺乏鲁棒的区域感知驾驶智能,并突显了GeoDrive-Bench作为可部署自动驾驶基础模型的诊断和训练导向测试床的价值。

英文摘要

Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

2606.02767 2026-06-03 cs.RO cs.LG

Hybrid Adaptive Kalman Filtering for Data-Efficient Joint Tracking and Classification

混合自适应卡尔曼滤波用于数据高效的联合跟踪与分类

Jiho Lee, Nisar R. Ahmed, Rebecca Russell

发表机构 * Charles Stark Draper Laboratory, Inc.(查尔斯·斯泰克·德帕尔实验室,Inc.) Ann and H. J. Smead Department of Aerospace Engineering Sciences(安与H.J.斯梅德航空航天工程科学系)

AI总结 提出一种自监督混合自适应卡尔曼滤波器,通过仅从测量中学习系统动力学和过程噪声协方差的结构化校正,同时保持滤波器的概率结构,实现低数据和大数据场景下的高精度估计与鲁棒分类。

Comments 8 pages, 4 figures

详情
AI中文摘要

卡尔曼滤波性能对模型失配和噪声协方差调谐高度敏感。基于学习的方法解决了这些局限性,但通常依赖于大量数据集的监督训练,且不能产生一致的不确定性估计。在本文中,我们提出了一种自监督混合自适应卡尔曼滤波器,该滤波器仅从测量中学习系统动力学和过程噪声协方差的结构化校正,同时保持滤波器的概率结构。这使得可以计算创新似然,并随后通过广义贝叶斯推理用于模型分类。在真实世界和模拟数据集上的实验结果表明,在低数据和大数据场景下,估计精度和统计一致性均得到提高,分类性能也表现出鲁棒性。

英文摘要

Kalman filtering performance is highly sensitive to model mismatch and noise covariance tuning. Learning-based approaches address these limitations but typically rely on supervised training with large datasets and do not produce consistent uncertainty estimates. In this paper, we propose a self-supervised Hybrid Adaptive Kalman Filter that learns structured corrections to system dynamics and process noise covariance from measurements alone while preserving the probabilistic structure of the filter. This allows the innovation likelihood to be computed and subsequently used for model classification via generalized Bayesian inference. Experimental results on real-world and simulated datasets demonstrate improved estimation accuracy and statistical consistency as well as robust classification performance across both low-data and large-data scenarios.

2606.02765 2026-06-03 cs.LG cs.AI

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

表示能力:Transformer语言模型中特征表示的几何限制

Alexander Guha

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 基于线性表示和叠加假设,通过嵌入矩阵的余弦相似度分布估计模型可支持的近正交方向数量,推导出容量公式,并发现容量对偏差ε指数敏感。

Comments 22 pages, 10 figures. Submitted to NeurIPS 2026. This is a condensed version of thesis: https://hdl.handle.net/2286/R.2.N.204857

详情
AI中文摘要

模型维度($d_{model}$)是Transformer语言模型中的一个基本超参数,但其在设定特征表示的几何限制方面的作用仍未得到充分探索。基于线性表示和叠加假设——这些假设提出模型将特征编码为潜在空间中的近正交方向——我们开发了一个框架来估计模型可以支持多少个这样的方向。我们首先将嵌入矩阵确立为跨潜在空间近正交约束的可测量代理:成对余弦相似度分布中有意义的token关系与偶然相似性之间的边界给出了模型对完美正交性的可接受偏差ε的具体估计。将此度量应用于数十个开源模型揭示了两个类别:具有高ε且其嵌入缺乏近正交结构的模型,以及具有低ε且保持近正交结构的模型。然后我们表明,标准的Johnson-Lindenstrauss引理大大低估了训练表示的填充效率,并推导出一个调整后的容量公式,其中近正交方向的数量取决于向量与维度的比率($k/d$)而非原始计数——这一单一修改在没有额外参数的情况下将预测误差降低了两个数量级。结合这些结果,我们将表示能力定义为模型潜在空间中可用于特征和嵌入的可区分方向上界。容量对ε指数敏感,并且较大的模型倾向于更严格的正交约束而非最大化原始容量——这一模式与几种解释(稳定性-容量权衡、可用概念的上限或模型规模的混杂因素)兼容,我们将这些留给未来工作。

英文摘要

Model dimension ($d_{model}$) is a fundamental hyperparameter in transformer language models, yet its role in setting the geometric limits of feature representation remains under-explored. Grounded in the Linear Representation and Superposition Hypotheses - which propose that models encode features as near-orthogonal directions in latent space - we develop a framework for estimating how many such directions a model can support. We first establish the embedding matrix as a measurable proxy for near-orthogonality constraints across the latent space: the boundary between meaningful token relationships and incidental similarity in the pairwise cosine similarity distribution gives a concrete estimate of the model's accepted deviation $\varepsilon$ from perfect orthogonality. Applying this metric across dozens of open-source models reveals two classes: models with high $\varepsilon$ whose embeddings lack near-orthogonal structure, and models with low $\varepsilon$ that maintain it. We then show that the standard Johnson-Lindenstrauss lemma greatly underestimates the packing efficiency of trained representations, and derive an adjusted capacity formula in which the number of near-orthogonal directions depends on the ratio of vectors to dimensions ($k/d$) rather than the raw count - a single modification that cuts prediction error by two orders of magnitude with no extra parameters. Combining these results, we define representational capacity as an upper bound on the number of distinguishable directions available for features and embeddings in a model's latent space. Capacity is exponentially sensitive to $\varepsilon$, and larger models favor tighter orthogonality constraints over maximizing raw capacity - a pattern compatible with several explanations (a stability-capacity trade-off, a ceiling on usable concepts, or confounds with model scale) that we leave to future work.

2606.02764 2026-06-03 cs.CV physics.comp-ph

From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry

从局部训练到大规模制图:机器学习与深度学习在可迁移卫星测深中的比较评估

Hsiao-Jou Hsu, Joachim Moortgat

发表机构 * School of Earth Sciences, The Ohio State University(地球科学学院,俄亥俄州立大学)

AI总结 本研究评估了随机森林与四种CNN在0-20米深度范围内基于Sentinel-2影像的可迁移卫星测深性能,通过保持空间连续性的训练策略和引入平滑权重函数损失,实现了跨区域稳健的深度估计。

Comments 42 pages, 13 figures, 15 tables. Supplementary Information provided as ancillary file (anc/SI.pdf). Code and pretrained weights at https://github.com/buckai-observatory/DL_bathy

Journal ref Remote Sens. 18 (2026) 1768

详情
AI中文摘要

多光谱影像的卫星测深(SDB)成本效益高,但在不同区域间的扩展性较差,尤其是在光学复杂的沿海环境中。我们利用Sentinel-2影像评估了机器学习与深度学习在0-20米深度范围内的可迁移SDB性能。在普拉塔斯岛和大堡礁选定区域训练了随机森林基线模型和四种CNN(ResNet-50、ResNet-101、EfficientNet-B4、ConvNeXt-Large),然后在空间独立的区域内和跨区域测试区域进行评估。训练过程中保持空间连续性(即保留连续的礁块而非随机斑块)是影响最大的设计选择;我们进一步引入了平滑权重函数(SWF)加权的RMSE损失,以强调近地表深度。采用这些选择后,区域内RMSE在0-20米范围内为1.15至1.92米,在深度≤3米时低至0.26米。随机森林在跨区域迁移下性能急剧下降(RMSE从1.53米升至2.99-3.78米),而深度模型保持更稳健(2.46-2.98米)。在公开的MagicBathyNet航空RGB基准(0-16米)上,所提出的网络达到了0.19-0.22米的RMSE,优于U-Net基线和一种任务特定的Transformer架构,且参数显著更少。我们进一步利用了多时相重复影像:在其上训练增加了多样性,并且在推理时对各次通过的中位数聚合预测减少了来自太阳角度、大气条件、水性质和潮汐变化的噪声。我们发布了优化的架构和预训练权重,以实现对新地点的可扩展迁移。

英文摘要

Satellite-derived bathymetry (SDB) from multispectral imagery is cost-effective but scales poorly across regions, especially in optically complex coastal environments. We evaluate machine learning and deep learning for transferable SDB over the 0-20 m depth range using Sentinel-2 imagery. A Random Forest baseline and four CNNs (ResNet-50, ResNet-101, EfficientNet-B4, ConvNeXt-Large) are trained on Pratas Island and selected Great Barrier Reef regions, then evaluated on spatially independent intra- and cross-regional test areas. Preserving spatial continuity during training, by keeping contiguous reef blocks rather than random patches, is the single most impactful design choice; we further introduce a Smooth Weight Function (SWF)-weighted RMSE loss that emphasizes near-surface depths. With these choices, intra-regional RMSE ranges from 1.15 to 1.92 m over 0-20 m and is as low as 0.26 m for depths <= 3 m. Random Forest degrades sharply under cross-regional transfer (RMSE 1.53 m -> 2.99-3.78 m), while the deep models stay more robust (2.46-2.98 m). On the public MagicBathyNet aerial-RGB benchmark (0-16 m) the proposed networks reach 0.19-0.22 m RMSE, outperforming a U-Net baseline and a task-specific transformer architecture with substantially fewer parameters. We further exploit multi-temporal repeat imagery: training on it broadens diversity, and median-aggregating predictions across passes at inference reduces noise from changing sun angles, atmospheric conditions, water properties, and tides. We release optimized architectures and pretrained weights to enable scalable transfer to new sites.

2606.02762 2026-06-03 cs.LG

Binary Road Surface Classification Using Machine Learning on Production Vehicle Signals During Cruising

基于生产车辆巡航信号的道路表面二分类机器学习方法

Vishal Hariharan, Salar Basiri, Kanwar Bharat Singh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对巡航工况下传统摩擦估计方法失效的问题,提出基于特征和端到端数据驱动框架,利用车辆动力学信号统计特征对路面抓地力(干/湿)与打滑(雪/冰)进行二分类。

详情
AI中文摘要

实时道路滑溜性知识,甚至更精确的峰值抓地潜力估计,是车辆预警和干预控制系统的关键输入。通常,摩擦通过基于动力学的递归估计器计算滑移斜率来估计;然而,其有效性受到车辆动力学场景的严重限制。当车辆巡航且几乎没有滑移时,由于当前生产级传感器(如轮速传感器)和方法无法测量或准确估计微滑移(这对区分不同路面至关重要),这些方法变得无效。为了解决这一挑战,需要利用机器学习揭示巡航过程中车辆信号与路面条件之间的相关性。本文采用基于特征的框架和端到端数据驱动框架,将车辆动力学行为统计量与路面条件相关联,并执行二分类:抓地(干或湿)和打滑(雪或冰)。采用滑动窗口方法,将短时缓冲窗口内的轮速、轮扭矩、纵向加速度、转向角和横摆角速度批量输入机器学习模块,以预测道路状态。在公共道路数据上的验证结果表明,即使在巡航过程中,数据驱动方法也能正确识别路面,显示出在轮胎和车辆动力学领域实现精确数据驱动摩擦相关状态估计器的潜力。

英文摘要

Knowledge of real-time road slipperiness, or even better, a refined estimate of peak grip potential, is a critical input for vehicle warning and intervention control systems. Typically, friction is estimated through dynamics-based recursive estimators by calculating the slip slope; however, its efficacy is heavily constrained by the vehicle dynamic scenario. When the vehicle is cruising and there is little to no slip, these methods become ineffective due to the inability of present-day production-grade sensors, such as wheel speed sensors, and methods to either measure or accurately estimate micro slip, which is crucial for distinguishing different surfaces. To address this challenge, the correlation between vehicle signals and road surface condition during cruising needs to be uncovered using machine learning. In this paper, a feature-based framework and an end-to-end data-driven framework are used to correlate the statistics of vehicle dynamics behavior with the condition of the road surface and perform binary classification into grip, dry or damp, and slip, snow or ice, conditions. A sliding-window approach is adopted to batch a short buffered window of wheel speeds, wheel torques, longitudinal acceleration, steering angle, and yaw rate, which are fed into a machine learning module for predicting the road state. Validation results on public-road data show scenarios where the data-driven method identifies the road surface correctly even during cruising, showing promise for accurate data-driven friction-related state estimators in the field of tire and vehicle dynamics.

2606.02754 2026-06-03 cs.LG

$Ψ$-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

$\Psi$-Bench: 评估说服性对话中人格敏感的影响力

Peixuan Han, Hongyi Du, Jiayu Liu, Yihang Sun, Yutong Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出 $\Psi$-Bench 基准,通过三个现实场景评估 LLM 利用用户画像进行说服的能力,发现当前模型仍有较大提升空间,且用户画像带来 18.24% 的性能提升。

详情
AI中文摘要

个性化是现代语言代理的关键能力。然而,当前研究主要将个性化代理定位为对用户偏好的被动响应者,限制了其与用户交互并主动提供建议或指导的能力。为了在真实交互中系统评估这种主动个性化,我们提出了 $\Psi$-Bench,一个评估 LLM 通过对话影响真实用户能力的基准。我们在 $\Psi$-Bench 中设计了三个涉及说服的现实交互场景,并通过从对话历史中提取的显式用户画像赋予模拟客户个性特征。我们在 $\Psi$-Bench 上评估了 10 个前沿 LLM,发现尽管大多数模型能产生连贯合理的论点,但即使是最先进的模型在说服方面仍有相当大的改进空间。我们还发现,提供客户画像访问权限平均带来 18.24% 的性能提升,突显了用户特定信息对有效说服的重要性。总体而言,我们的工作强调了人格敏感的影响力作为评估和开发更主动的个性化 LLM 代理的一个具有挑战性但实用的方向。代码可在以下网址获取:this https URL。

英文摘要

Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose $Ψ$-Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real-world interaction scenarios that involve persuasion in $Ψ$-Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on $Ψ$-Bench and find that while most models can produce coherent and reasonable arguments, even state-of-the-art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24\%, highlighting the importance of user-specific information for effective persuasion. Overall, our work highlights persona-sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi-Bench.

2606.02753 2026-06-03 cs.CV cs.AI

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

MetaWorld: 从单视角视频数据扩展多智能体视频世界模型

Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学)

AI总结 提出MetaWorld框架,通过单目世界状态展开、主体感知世界生成器和世界状态对齐机制,从单视角视频构建多智能体视频世界模型,解决数据稀缺和世界状态对齐问题。

详情
AI中文摘要

视频世界模型是具身AI和元宇宙的基础生成技术,但现有方法固有限制于单智能体从单一视角观察。将这些模型扩展到多智能体设置引入了两个关键挑战:数据稀缺(协调的多视角记录对于通用开放域场景来说成本过高)和世界状态对齐(独立生成的视频流无法确保共享物理环境和事件在不同视角下一致演化)。为应对这些挑战,我们提出MetaWorld,一种新颖框架,可直接从单视角视频将多智能体视频世界模型扩展到开放域环境。首先,我们引入单目世界状态展开(MWSU),将单目视频显式分解为相机操作者的自我运动与可见主体的空间轨迹。这种相机-轨迹分解自然提取了共享3D空间内同步的多智能体运动数据,完全绕开了多相机设置的需求。其次,为精确视觉控制,我们开发了主体感知世界生成器,实现基于每个智能体身份图像的外观驱动模拟。最后,为确保两个视角基于相同的物理现实,我们提出世界状态对齐(WSA),一种在视频DiT的每个Transformer层插入的逐帧跨分支交叉注意力机制。通过联合同步去噪过程,WSA强制实现静态几何一致性和动态运动一致性,促使共享3D环境和物理事件在两个自我中心视角间保持良好对齐。大量实验表明,MetaWorld实现了优越的跨视角一致性和身份保真度,为多智能体视频世界建模建立了一个高度可扩展、物理驱动的范式。

英文摘要

Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.

2606.02745 2026-06-03 cs.RO cs.LG

SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos

SeeTraceAct: 跨具身演示视频中的可见性感知潜在规划

Jaehyeon Son, Junhyun Kim, Kyle Kam, Jeremiah Coholich, Seok Joon Kim, Jinhoo Kim, Chris Dongjoo Kim, Jaemin Cho, Dieter Fox, Zsolt Kira

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Allen Institute for AI(Allen人工智能研究所) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学)

AI总结 提出SeeTraceAct框架,通过可见性感知的未来末端执行器轨迹预测增强空间定位,实现基于单次跨具身演示视频的机器人策略泛化,在模拟和真实场景中取得最优成功率。

详情
AI中文摘要

视觉-语言-动作模型(VLA)是有前途的通用机器人策略,但将其适应新任务通常需要昂贵的任务特定遥操作数据。作为替代,我们研究一次性演示条件VLA,其中机器人策略以未见任务的单个演示视频为条件。我们发现,当成功执行需要精确定位小目标区域时,现有的端到端方法往往难以应对。为解决这一限制,我们提出SeeTraceAct,一种演示条件VLA框架,通过可见性感知的未来末端执行器轨迹预测来鼓励精确的空间定位。为实现跨具身演示的可重复评估,我们引入并发布了RoboCasa-DC,这是RoboCasa的演示条件扩展,包含成对的人形机器人视频。在RoboCasa-DC和真实世界基准(Franka Panda臂以人类演示为条件)上的实验表明,SeeTraceAct优于基线,在所有四个RoboCasa-DC设置中实现了最佳成功率,并将真实世界平均成功率提高了12.5个百分点。

英文摘要

Vision-language-action models (VLAs) are promising general-purpose robot policies, but adapting them to new tasks typically requires costly task-specific teleoperation data. As an alternative, we study one-shot demo-conditioned VLAs, where a robot policy is conditioned on a single demonstration video of an unseen task. We find that existing end-to-end approaches often struggle when successful execution requires precisely localizing small target regions. To address this limitation, we propose SeeTraceAct, a demo-conditioned VLA framework that encourages precise spatial grounding through visibility-aware prediction of future end-effector traces. To enable reproducible evaluation with cross-embodiment demonstrations, we introduce and release RoboCasa-DC, a demo-conditioned extension of RoboCasa with episode-paired humanoid videos. Experiments on RoboCasa-DC and a real-world benchmark, where a Franka Panda arm is conditioned on human demonstrations, show that SeeTraceAct outperforms baselines, achieving the best success rate across all four RoboCasa-DC settings and improving real-world average success by 12.5 percentage points.

2606.02742 2026-06-03 cs.CV

Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

一致但错误:空间视觉-语言模型中的证据不敏感性

S Divakar Bhat, Toshihiko Yamasaki

发表机构 * The University of Tokyo, Japan(东京大学)

AI总结 通过引入ViewDiag多视图评估协议,发现现代视觉-语言模型在空间推理中表现出跨视角一致但错误的现象,表明其预测主要源于先验驱动而非证据敏感推理。

详情
AI中文摘要

空间推理是机器人、自主系统和具身AI的基础,然而现代视觉-语言模型(VLM)在度量距离查询上仍然不可靠。一个常见的假设是,跨视角的一致预测反映了几何基础。我们测试了这一假设,并发现了相反的情况:领先的VLM经常产生视角不变且一致的答案,即使这些答案是不正确的,这表明预测与视角特定的视觉证据之间的耦合较弱。我们引入了 extbf{ViewDiag},一个基于Hypersim、ScanNet和KITTI360构建的受控多视图评估协议,包含80个场景中的176个对象对轨迹,每个轨迹有2-10个视图。该协议从三个维度评估模型:度量准确性、分布集中度以及用于区分决策崩溃与表示崩溃的内部崩溃的潜在特征探针。在不同的模型中,我们观察到高预测稳定性与显著误差的一致模式,聚集在强一致性但低准确性的区域。 oindent 这些结果挑战了将跨视角一致性作为几何理解代理的常见做法。相反,我们表明稳定的预测可能反映了先验驱动的崩溃,而不是证据敏感的推理。ViewDiag提供了一个受控基准和诊断框架,用于评估超越准确性的空间VLM。代码和数据可在\href{this https URL}{此处}找到。

英文摘要

Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \href{https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git}{here}

2606.02741 2026-06-03 cs.CL cs.CY

Greener Than Humans? Environmental Attitudes in Large Language Models

比人类更绿色?大语言模型中的环境态度

Stefanie Kunkel, Tilman Hartwig, Marcus Voss, Emma K. Schütt, Angelika Gellrich

发表机构 * University of Tübingen(图宾根大学)

AI总结 通过构建基准评估31个LLM的环境认知、情感和行为建议,发现多数模型比德国调查受访者更倾向环保态度,但存在语境敏感性和谄媚偏移。

Comments Code can be found at https://gitlab.opencode.de/uba-ki-lab/llm-questionnaire-benchmarking-framework Benchmark data and results can be found at https://zenodo.org/records/20445903

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于可持续发展相关的决策支持、报告和公共传播,但关于其输出中嵌入的环境态度的系统性证据很少。本文开发了一个基准,用于评估LLMs的环境认知、情感和行为建议,并将其应用于31个广泛使用的专有和开源模型。利用既有环境意识调查中的问题以及额外的可持续发展相关行为测量,我们比较了1)模型之间的LLM响应,以及2)模型与来自德国的人类调查基准之间的响应。我们评估了它们在提示条件下的稳健性。我们发现,许多LLMs比平均调查受访者更倾向于环境进步态度,表现出更高的环境情感和认知水平,并推荐与大量潜在二氧化碳减排相关的行为。同时,我们观察到可持续导向的响应与模型来源、大小或发布背景之间没有系统性的关系。然而,模型表现出语境敏感性,受基于角色的提示控制,并显示出反映用户指定意识形态立场的谄媚偏移,这引发了关于现实部署中可操纵性和规范可靠性的担忧。我们的发现提供了一个可重复使用的评估框架,用于评估LLMs中与可持续发展相关的价值对齐,并强调了随着AI系统日益嵌入可持续转型和公共决策中,治理、透明度和关键监督的重要性。

英文摘要

Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs. This paper develops a benchmark for evaluating environmental cognition, affect, and behavioural recommendations in LLMs and applies it to 31 widely used proprietary and open-weight models. Drawing on questions from established environmental awareness surveys and additional sustainability-related behavioural measures, we compare LLM responses 1) among models and 2) between models and human survey benchmarks from Germany. We assess their robustness across prompting conditions. We find that many LLMs align more closely with environmentally progressive attitudes than the average survey respondent, exhibiting higher levels of environmental affect and cognition and recommending behaviours associated with substantial potential CO2 reductions. At the same time, we observe no systematic relationship between sustainability-oriented responses and model origin, size, or release context. However, models exhibit contextual sensitivity, controlled by persona-based prompting and show sycophantic shifts mirroring user-specified ideological positions, which raises concerns about steerability and normative reliability in real-world deployments. Our findings provide a reusable evaluation framework for assessing sustainability-related value alignment in LLMs and highlight the importance of governance, transparency, and critical oversight as AI systems become increasingly embedded in sustainability transformations and public decision-making.

2606.02739 2026-06-03 cs.SD cs.AI eess.AS

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

EntangleCodec:通过语义-声学纠缠的统一离散音频分词器

Hui Li, Yangfan Gao, Junlin Shang, Changhao Jiang, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University(复旦大学)

AI总结 提出EntangleCodec,一种通过将音频与丰富标题对齐学习语义-声学联合表示的统一离散音频分词器,在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景,并通过流匹配扩散解码器实现高质量重建,在音频理解和生成任务上均取得领先性能。

Comments 17 pages, 10 figures

详情
AI中文摘要

音频分词器作为连续音频与音频语言模型(ALM)之间的离散接口,但现有分词器往往难以同时支持理解和生成。面向重建的编解码器保持声学保真度但缺乏丰富语义,而语义感知分词器通常依赖独立的语义和声学流,引入冗余或错位。我们提出 extbf{EntangleCodec},一种统一的离散音频分词器,在量化之前学习与标题对齐的语义-声学表示。通过将音频与丰富标题而非ASR转录对齐,EntangleCodec在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景。流匹配扩散解码器进一步实现了语音、音乐和通用音频的高质量重建。EntangleCodec在重建质量上与专用编解码器竞争,在音频理解上优于所有基于编解码器的基线,在MMAR上提升高达 extbf{+7.4\%},并在统一框架中支持TTS和TTA生成。此外,基于EntangleCodec的音频语言模型展现出强大的扩展行为:即使参数为 extit{0.6B},该模型在三个基准测试中超越了参数超过 extit{13B}的专用连续表示LLM,参数减少了 extbf{22$ imes$};扩展到 extit{8B}进一步在MMAR上建立了新的最先进结果,突显了在音频语言建模中表示质量与模型规模同等重要。代码和模型权重可从此https URL获取。

英文摘要

Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.

2606.02724 2026-06-03 cs.CV cs.AI

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

AVTrack: 以人为中心的复杂场景中的视听跟踪

Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有视听跟踪数据集局限于简单场景的问题,提出AVTrack数据集,通过包含相机运动、视觉遮挡和位置变化等复杂动态条件,评估并提升鲁棒的人为中心视听场景理解。

Comments 19 pages, 10 figures, ICML 2026

详情
AI中文摘要

视听说话人跟踪旨在通过利用听觉和视觉线索来定位和跟踪活跃的说话人,实现细粒度、以人为中心的场景理解。这一能力对于智能视频编辑、监控和人机交互等实际应用至关重要。然而,现有数据集大多局限于具有粗略标注的简单或同质视听场景。这种过度简化的设置使评估偏向于静态视听共现,而非严格评估复杂动态场景中的鲁棒时空建模和跨模态推理。为了解决这些限制,我们引入了AVTrack,一个以人为中心的视听实例分割(AVIS)数据集,专为动态真实世界场景设计。AVTrack具有多样且具有挑战性的条件,包括相机运动、视觉遮挡和位置变化。在AVTrack上对代表性AVIS方法的评估揭示了显著的性能下降,使AVTrack成为复杂环境中鲁棒的以人为中心的视听场景理解的挑战性基准。我们进一步提供了一个简单而有效的基线,以促进未来的研究。项目网站:此https URL

英文摘要

Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/

2606.02680 2026-06-03 cs.LG

Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention

局部性并不意味着可达性:块稀疏因果注意力中的边界修复

Zhibo Yang

发表机构 * Ocean University of China(中国海洋大学)

AI总结 本文研究块稀疏因果注意力中序列局部性与注意力图可达性之间的不匹配,通过结构依赖集形式化边界伪影,提出相位条件覆盖函数分析可达性,并引入边界桥接注意力作为最小修复方法。

Comments 36 pages, 5 figures, 16 tables

详情
AI中文摘要

稀疏因果注意力通常由序列局部性描述:附近的token应保持易于访问,而远处的token可能被丢弃以降低成本。本文研究了序列局部性与注意力图可达性之间的不匹配。在固定块因果注意力中,两个相邻token可能在每一深度的注意力图中断开连接。我们通过结构依赖集形式化了这种边界伪影:如果每个注意力层使用相同的固定块因果掩码,且所有剩余操作是位置级的,则目标表示只能依赖于其自身块前缀中的token。这为构造的K路边界复制分布产生了架构级的边界-复制分离,top-1准确率上界为1/K,期望交叉熵下界为log K。然后,我们推导了相位条件覆盖函数,表明可达性取决于源-目标距离以及目标在其块内的偏移。这些覆盖律预测了稀疏模式何时会失败、修复何时有帮助,以及滑动窗口注意力和边界修复为何不可互换。边界桥接注意力被视为建设性的证明:它保留了固定块路径,并在块边界附近使用共享投影添加了零额外参数的辅助因果边。受控的1024-token实验表明,收益集中在覆盖对齐的诊断中。作为次要的外部有效性证据,固定检查点的8K-token Qwen2.5-7B探针显示了相同的覆盖不可比模式。贡献在于一个理论指导的诊断框架,用于块稀疏因果注意力中的局部性-可达性不匹配,以及相位条件覆盖分析和最小建设性修复。

英文摘要

Sparse causal attention is usually described by sequence locality: nearby tokens should remain easy to access, while distant tokens may be dropped to reduce cost. This paper studies a mismatch between sequence locality and attention-graph reachability. In fixed block causal attention, two adjacent tokens can be disconnected in the attention graph at every depth. We formalize this boundary artifact through structural dependency sets: if every attention layer uses the same fixed block causal mask and all remaining operations are positionwise, a target representation can depend only on tokens in its own block prefix. This yields an architecture-level boundary-copy separation for a constructed K-way boundary-copy distribution, with top-1 accuracy upper bound 1/K and expected cross-entropy lower bound log K. We then derive phase-conditioned coverage functions showing that reachability depends on both source-target distance and the target's offset within its block. These coverage laws predict when a sparse pattern should fail, when a repair can help, and why sliding-window attention and boundary repair are not interchangeable. Boundary Bridge Attention is treated as a constructive witness: it preserves the fixed block path and adds zero-additional-parameter auxiliary causal edges near block boundaries using shared projections. Controlled 1024-token experiments show that gains concentrate in coverage-aligned diagnostics. As secondary external-validity evidence, a fixed-checkpoint 8K-token Qwen2.5-7B probe shows the same coverage-incomparability pattern. The contribution is a theory-guided diagnostic framework for locality-reachability mismatch in block-sparse causal attention, together with phase-conditioned coverage analysis and a minimal constructive repair.

2606.02679 2026-06-03 cs.LG cs.MM cs.SD eess.AS

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

融合之前,先问保留什么:多模态信号的上下文校准

Jiyuan Liu, Liangwei Nathan Zheng, Wei Emma Zhang, Xinpei Wang, Weitong Chen

发表机构 * Adelaide University(阿德莱德大学) Shandong University(山东大学)

AI总结 提出一种紧凑的校准模块,在融合前对各模态特征进行实例级和维度级调制,抑制不可靠成分并增强上下文支持信号,提升多模态任务性能。

Comments 11 pages, 7 figures, 9 tables

详情
AI中文摘要

多模态系统通常受益于跨语言、声音和视觉流的信息组合,但这种收益并非保证。一个模态对某个输入有用,可能对另一个输入成为干扰,同一模态内的局部特征响应可能与其他来源的证据不一致。本文研究如何在下游预测器合并多模态表示之前调整它们。我们开发了一个紧凑的校准模块,在摘要级别将每个模态与其他模态进行比较,提取跨源支持和冲突的线索,并将这些线索转换为实例级和维度级的调制信号。校准应用于原始模态特征而非已融合的表示,使模型能够抑制误导成分,保留微弱但有用的证据,并强调在当前多模态上下文中得到更好支持的响应。该模块设计为即插即用组件,可附加到不同的融合主干上,无需更改其预测头。在涵盖情感理解、动作识别、音视频事件检测和音视频情感分类的五个基准测试中,所提出的预融合校准策略在基于序列和卷积的融合设置下均提升了性能。模态移除、合成损坏、训练动态和特征级可视化的额外分析表明,在融合前校准信号可以减少来自不可靠模态的干扰,并产生更稳定的多模态优化。

英文摘要

Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local feature responses within the same modality can disagree with evidence from other sources. This work investigates how to adjust multimodal representations before they are merged by a downstream predictor. We develop a compact calibration module that compares each modality with the others at the summary level, extracts cues of cross-source support and conflict, and converts these cues into instance-wise and dimension-wise modulation signals. The calibration is applied to the original modality features rather than to already fused representations, enabling the model to suppress misleading components, preserve weak but useful evidence, and emphasize responses that are better supported by the current multimodal context. The module is designed as a plug-in component and can be attached to different fusion backbones without changing their prediction heads. Across five benchmarks covering sentiment understanding, action recognition, audio-visual event detection, and audio-visual emotion classification, the proposed pre-combination calibration strategy improves performance under both sequence-based and convolutional fusion settings. Additional analyses under modality removal, synthetic corruption, training dynamics, and feature-level visualization show that calibrating signals before fusion can reduce interference from unreliable modalities and produce more stable multimodal optimization.

2606.02677 2026-06-03 cs.RO

Motion Planning in Dynamic Environments: A Survey from Classical to Modern Methods

动态环境中的运动规划:从经典到现代方法的综述

Zongyuan Shen, Yaming Ou, Shalabh Gupta, Shancheng Zhao, Dehua Zhou, Gao Wang, Zhongqiang Ren, Junfeng Fan, Long Cheng

发表机构 * College of Information Science and Technology, Jinan University(济南大学信息科学与技术学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Department of Electrical and Computer Engineering, University of Connecticut(康奈尔大学电子与计算机工程系) Global College, Shanghai Jiao Tong University(上海交通大学全球学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 本文综述了138篇文献,将动态环境中的运动规划方法分为采样、图搜索、模型预测控制、学习及经典局部规划五类,分析了各方法的原理、优缺点及动态环境特有挑战。

详情
AI中文摘要

动态环境中的运动规划要求机器人持续调整路径以应对环境变化,实现安全不间断的导航。尽管许多综述回顾了静态环境中的规划,但针对动态环境的系统综述仍然有限。本文对138篇文献进行了全面综述,主要发表于2015年至2025年,涵盖经典和基于学习的方法。运动规划方法根据采样、图搜索、模型预测控制、学习以及额外的经典局部规划方法(包括速度障碍、势场和动态窗口)的概念分为五类。学习技术包括监督学习和强化学习。我们还讨论了动态感知在运动规划中的作用,涵盖了使用相机、LiDAR和事件传感器检测和建模移动障碍物的技术。该综述分析了每种方法的原理、优势和局限性,特别关注动态环境特有的挑战,如预测不确定性、人机交互和机器人冻结问题。该综述为研究人员提供了对动态环境中运动规划方法的结构化理解。

英文摘要

Motion planning in dynamic environments requires robots to continuously adapt their paths in response to environmental changes for safe and uninterrupted navigation. While many surveys have reviewed planning in static settings, systematic reviews focused on dynamic environments remain limited. This paper presents a comprehensive survey of 138 works, primarily published between 2015 and 2025, spanning both classical and learning-based approaches. The motion planning methods are grouped into five categories based on the concepts of sampling, graph search, model predictive control, learning, and additional classical local planning approaches, including velocity obstacles, potential fields and dynamic windows. The learning techniques include supervised learning and reinforcement learning. We also discuss the role of dynamic perception in motion planning, covering techniques for detecting and modeling moving obstacles using cameras, LiDAR, and event-based sensors. The survey analyzes the principles, strengths, and limitations of each method, with particular attention to challenges unique to dynamic environments, such as prediction uncertainty, human-robot interaction, and the freezing robot problem. The survey provides researchers with a structured understanding of motion planning methods in dynamic environments.

2606.02673 2026-06-03 cs.AI cs.LG

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

大语言模型中用于结构推理的可视化图脚手架

Runlin Lei, Xiaokui Xiao, Zhewei Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出将图结构作为大语言模型的内部推理辅助而非仅外部知识源,通过多跳问答实验发现视觉图引导相比文本化图在无直接答案提示时仍保持有效性,支持图作为组织推理的可视化脚手架。

详情
AI中文摘要

图已被用于增强大语言模型的结构化推理,主要是在测试时作为外部知识源提供给模型。在本文中,我们采取不同的视角:图对LLMs的价值不仅在于提供信息,还在于组织推理。受人类使用图结构思维导图组织分支和汇聚思维的启发,我们探究图是否可以作为推理辅助的内部形式。我们在多跳问答任务上研究这一问题,其中教师提供的推理轨迹被重写为图思维导图并用于指导学生模型。我们的实验揭示了明显的模态差距。当图结构被扁平化为文本时,一旦直接答案提示被移除,其益处变得有限。在这种抽象引导设置下,推理效率和答案质量都大幅下降。相比之下,视觉图引导在没有直接答案线索时仍然有效,并且其优势在监督微调和基于KL的蒸馏后仍然保持。上述发现支持了以下主张:图不仅应作为LLMs的外部知识结构来研究,还应作为组织推理的可视化脚手架。

英文摘要

Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided to models at test time. In this paper, we take a different view: the value of graphs for LLMs lie not only in supplying information, but also in organizing reasoning. Inspired by how humans use graph-structured mind maps to organize branching and converging thoughts, we ask whether graphs can serve as an internal form of reasoning assistance. We study this question on multi-hop question answering tasks, where teacher-provided reasoning traces are rewritten as graph mind maps and used to guide a student model. Our experiments reveal a clear modality gap. When graph structures are flattened into text, their benefits become limited once direct answer hints are removed. Under this abstract guidance setting, both reasoning efficiency and answer quality degrade substantially. In contrast, visual graph guidance remains effective without direct answer clues, and its advantage persists after supervised fine-tuning and KL-based distillation. The above findings support the claim that graphs should be studied not only as external knowledge structures for LLMs, but also as visual scaffolds for organizing reasoning.

2606.02671 2026-06-03 cs.LG cs.AI

Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

对齐数据驱动预测器与分配:面向决策的生存分析方法

Itai Zilberstein, Ioannis Anagnostides, Tuomas Sandholm

发表机构 * Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA(计算机科学系,卡内基梅隆大学,匹兹堡,PA) Strategy Robot, Inc.(策略机器人公司) Strategic Machine, Inc.(战略机器公司) Optimized Markets, Inc.(优化市场公司)

AI总结 针对生存分析中预测模型与分配决策目标不一致的问题,提出基于归一化折现累积增益(NDCG)的决策聚焦学习方法,通过优化NDCG提升分配效果,在心脏移植数据上使基线模型NDCG提升50-100%。

详情
AI中文摘要

机器学习预测器已成为指导自动化决策的重要工具。然而,一个主要的错位仍然存在:预测模型通常根据标准统计指标进行优化,而与其所指导的算法任务相孤立。我们在器官分配这一高风险领域中强调了这种不一致性,通过证明任何依赖(即使是高度准确的)针对标准指标(如一致性指数(C-index))优化的生存预测器的算法,在用于分配时可能产生任意差的结果,无法保证比均匀随机选择更好的效用。为了弥合生存分析与策略优化之间的差距,我们引入了一种基于优化归一化折现累积增益(NDCG)的决策聚焦学习方法,NDCG是信息检索中的主流指标。我们通过证明NDCG转化为分配性能的保证,确立了其在生存分析中的效用。在实证中,我们提出了一种自举方法来优化现有生存模型的NDCG。与先前工作不同,我们还解决了评估排名时右删失的挑战。在美国历史心脏移植数据上,我们的方法将基线模型的NDCG大幅提升了50-100%,这相当于在移植分配中每年额外获得数万生命年。我们预计我们的框架将在基于预测的决策中找到更广泛的应用。

英文摘要

Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation from the algorithmic tasks they inform. We highlight this incongruity in the high-stakes domain of organ allocation by demonstrating that any algorithm relying on (even highly accurate) survival predictors optimized for standard metrics -- such as the Concordance index (C-index) -- can yield arbitrarily poor outcomes when used for allocation, failing to guarantee utility better than a uniform random selection. To bridge the gap between survival analysis and policy optimization, we introduce a decision-focused learning approach based on optimizing normalized discounted cumulative gain (NDCG), a mainstay metric in information retrieval. We establish the utility of NDCG in survival analysis by proving that it translates to guarantees on the performance of allocation. Empirically, we propose a bootstrapping approach to optimize the NDCG of existing survival models. Unlike prior work, we also address the challenge of right censorship when evaluating ranking. On historical heart transplant data from the US, our method dramatically boosts the NDCG of baseline models by 50-100%, which translates to tens of thousands of additional life years gained annually when deployed for transplant allocation. We anticipate that our framework will find broader applications in decision making with predictions.

2606.02663 2026-06-03 cs.LG cs.AI

AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret

AdaWeather: 自适应混合概率天气预报与对数遗憾

Saptarishi Dhanuka, Sarvesh Iyer, Manmeet Singh, Mihir More, Rushil Gupta, Dhruman Gupta, Parthasarathi Mukhopadhyay, Sandeep Juneja

发表机构 * Ashoka University(阿什oka大学) Western Kentucky University(西方肯塔基大学)

AI总结 提出 AdaWeather 自适应框架,通过结合机器学习和专家混合方法融合多个概率天气预报,实现对数遗憾界,并在温度预测上取得改进。

Comments 36 pages, 16 figures. Submitted to arXiv. Forecast aggregation for probabilistic weather prediction using offline supervised learning and online prediction with expert advice. Includes theoretical regret guarantees and empirical evaluation on temperature forecasting. Submitted to NeurIPS 2026

详情
AI中文摘要

机器学习的最新进展已经产生了与最先进数值天气预报模型相当的概率天气预报模型。但没有任何模型在时空上持续占优,且相对性能高度依赖于上下文。这激发了自适应方法来组合多个预报以获得改进和鲁棒性。尽管文献中已提出组合预报,但这些要么通过监督学习实现,要么通过专家建议预测方法实现。我们引入 AdaWeather,一个自适应框架,它使用机器学习和专家混合方法结合多个概率预报,以得到统一的改进概率预报。传统专家方法针对事后最佳单一专家建立遗憾界,而我们扩展了算法和分析,表明我们的方法相对于事后最佳静态专家混合具有对数遗憾。实验上,我们专注于温度预测,并观察到相对于现有方法的改进。

英文摘要

Recent advances in machine learning have produced probabilistic weather forecasting models comparable to state-of-the-art numerical weather predictors. But no model consistently dominates spatio-temporally, and relative performance is highly context-dependent. This motivates adaptive methods for combining multiple forecasts to obtain improvements and robustness. While combined forecasts have been proposed in the literature, these are achieved either through supervised learning or through prediction with expert advice methods. We introduce AdaWeather, an adaptive framework that combines many probabilistic forecasts using both machine learning as well as mixture of experts to arrive at a unified improved probabilistic forecast. While traditional expert methods develop the regret bounds with respect to the best single expert in hindsight, we extend the algorithm and analysis to show our method has logarithmic regret compared to the best static mixture of experts in hindsight. Empirically, we focus on forecasting temperature, and observe improvements over existing methods.

2606.02662 2026-06-03 cs.LG cs.AI physics.chem-ph

Improvise, Adapt, Overcome: An On-The-Fly Multifidelity Algorithm for Efficient Machine Learning

即兴、适应、克服:一种用于高效机器学习的即时多保真算法

Vivin Vinod, Peter Zaspel

发表机构 * School of Mathematics and Natural Sciences, University of Wuppertal(数学与自然科学学院,乌珀塔尔大学)

AI总结 提出一种自适应即时多保真机器学习框架,通过动态查询不同保真度的训练样本,自动确定数据集组成,在降低数据生成成本的同时提高模型精度。

Comments Supplementary Information added as separate PDF

详情
AI中文摘要

机器学习加速了量子化学,但受到生成高保真训练数据的高昂成本的阻碍。多保真机器学习(MFML)通过系统性地结合丰富的低保真数据和稀疏的高保真数据来减轻这一开销。尽管取得了成功,标准MFML方案依赖于预定义的缩放因子来确定不同保真度之间的稀疏数据比例,通常会产生冗余的多保真数据,导致效率损失。在这里,我们介绍了一种用于机器学习的自适应即时多保真框架,该框架自主确定训练数据集的组成。通过动态查询每个保真度的训练样本,该算法在转向更昂贵的参考计算之前,先在较低保真度上使模型精度饱和。我们在不同的化学性质上对新颖的自适应MFML进行了基准测试,包括计算化学金标准的耦合簇能量,以及更具化学挑战性的激发能。在我们的数值实验中,我们表明,与单保真方法相比,我们的自适应算法将数据生成成本降低了多达30倍,并且与标准MFML相比提高了多达5倍。数据冗余的缓解为量子化学中可持续的成本感知机器学习建立了一条高精度、低成本的途径。

英文摘要

Machine learning has accelerated quantum chemistry but is hindered by the prohibitive cost of generating high fidelity training data. Multifidelity machine learning (MFML) mitigates this overhead by systematically combining abundant low fidelity data with sparse high fidelity data. In spite of its success, standard MFML schemes rely on pre-defined scaling factors to determine sparse data ratio across fidelities, often generating redundant multifidelity data resulting in a loss of efficiency. Here, we introduce an adaptive on-the-fly multifidelity framework for machine learning that autonomously determines training dataset composition. By dynamically querying training samples at each fidelity, the algorithm saturates model accuracy at lower fidelities before moving up to more expensive reference calculations. We benchmark the novel adaptive-MFML across diverse chemical properties including the computational chemistry gold standard coupled cluster energies, and the more chemically challenging excitation energies. In our numerical experiments we show that our adaptive algorithm reduces data generation costs by up to a factor of 30 compared to single fidelity methods and improves upon standard MFML by up to a factor of 5. The mitigation of data redundancy establishes a high-accuracy low-cost pathway for sustainable cost-aware machine learning in quantum chemistry.

2606.02659 2026-06-03 cs.LG cs.AI

CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning

CL-DMDF:基于对比学习的动态多模态数据融合模型

Dong Li, Lingling Zhang, Binghao Han, Linlin Ding, Yue Kou

发表机构 * Tsinghua University(清华大学)

AI总结 针对多模态数据融合中模态缺失和局部交互忽视全局互补线索的问题,提出基于对比学习的动态多模态数据融合模型(CL-DMDF),通过跨特征和模态维度的注意力机制、实体质心对比学习模块和自适应融合模块,提升动态融合的效率和准确性。

Comments 9 pages, 5 figures, 7 tables

详情
AI中文摘要

多模态数据融合涉及整合和分析来自多种模态的信息,以揭示潜在的关联和互补模式,从而增强数据处理和决策能力。尽管现有的结构化多模态输入方法通常针对特定任务设计并假设模态完全可观测,但实际应用中常因各种因素导致模态输入不确定或缺失。一些传统模型过度强调缺失模态内的局部交互,忽视了多模态表示中嵌入的全局互补线索。为克服这些限制,我们提出了一种基于对比学习的动态多模态数据融合模型(CL-DMDF)。CL-DMDF引入了一种新颖的注意力机制,该机制在特征和模态维度上同时操作,以计算可靠的注意力分数,有效反映每个层级的重要性。CL-DMDF进一步整合了实体质心对比学习模块,该模块从实体特征构建基于质心的正样本,以增强判别学习。此外,采用自适应融合模块以提高动态融合策略的效率和准确性。在三个数据集上进行的大量实验证明了CL-DMDF在各种多模态融合任务中的有效性。

英文摘要

Multimodal data fusion involves integrating and analyzing information from multiple modalities to uncover latent correlations and complementary patterns, thereby enhancing data processing and decision-making. While existing methods for structured multimodal inputs are typically designed around specific tasks and assume fully observed modalities, real-world applications often suffer from uncertain or missing modality inputs due to various factors. Some traditional models overly emphasize local interactions within missing modalities, neglecting the global complementary cues embedded in multimodal representations. To overcome these limitations, we propose a Dynamic Multimodal Data Fusion model based on Contrastive Learning (CL-DMDF). CL-DMDF introduces a novel attention mechanism that operates across both feature and modality dimensions to compute reliable attention scores, effectively reflecting importance at each level. The CL-DMDF further incorporates an entity-centroid contrastive learning module that constructs centroid-based positive samples from entity features to enhance discriminative learning. Additionally, an adaptive fusion module is employed to improve the efficiency and accuracy of dynamic fusion strategies. Extensive experiments conducted on three datasets demonstrate the effectiveness of the CL-DMDF across diverse multimodal fusion tasks.

2606.02658 2026-06-03 cs.RO

Fixed-Time Dynamic Landing of Quadrotors using Adaptive Unscented Kalman Filtering and Nonlinear Model Predictive Control

基于自适应无迹卡尔曼滤波和非线性模型预测控制的四旋翼飞行器固定时间动态着陆

Mohammadreza Izadi, Zeinab Shayan, Steven Waslander, Reza Faieghi

发表机构 * Autonomous Vehicles Laboratory, Department of Aerospace Engineering, Toronto Metropolitan University(自主车辆实验室,航空航天工程系,多伦多 Metropolitan 大学) University of Toronto Institute for Aerospace Studies, University of Toronto(多伦多大学航空航天研究 institute,多伦多大学)

AI总结 提出一种结合非线性模型预测控制与实时最小加加速度轨迹规划器及自适应无迹卡尔曼滤波的估计与控制框架,实现多旋翼无人机在移动平台上的固定时间动态着陆,并通过仿真和硬件实验验证了其可重复着陆能力和优于EKF/UKF的速度预测精度。

Comments Accepted to the Conference on Robots and Vision (CRV 2026), Vancouver, Canada

详情
AI中文摘要

本文介绍了一种用于多旋翼无人机在移动平台上动态着陆的估计与控制框架。所提出的方法将非线性模型预测控制与实时最小加加速度轨迹规划器相结合,该规划器强制执行规定的着陆时间,从而在终端下降过程中实现一致的时间安排。为了增强在时变传感质量下的鲁棒性,我们采用了自适应无迹卡尔曼滤波,在线更新过程和测量噪声统计量。此外,我们提供了参考可行性分析,表明在标准跟踪假设下,最小加加速度参考会诱导有界的推力和扭矩指令。所提出的框架在仿真和硬件实验中进行了评估,并表明相对于基于EKF/UKF的方法,实现了可重复的着陆和改进的平台速度预测精度。

英文摘要

This paper introduces an estimation and control framework for dynamic landing of multi-rotor uncrewed aerial vehicles on moving platforms. The proposed method integrates nonlinear model predictive control with a real-time minimum-jerk trajectory planner that enforces a prescribed touchdown time, enabling consistent timing during the terminal descent. To enhance robustness in the presence of time-varying sensing quality, we utilize an adaptive unscented kalman filter that updates the process and measurement noise statistics online. In addition, we provide a reference feasibility analysis showing that minimum-jerk references induce bounded thrust and torque commands under standard tracking hypotheses. The proposed framework is evaluated in simulation and hardware experiments, and it is shown to achieve repeatable landings and improved platform velocity prediction accuracy relative to EKF/UKF-based methods.

2606.02657 2026-06-03 cs.LG q-fin.CP q-fin.ST

Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift

分布偏移下泛化界中的制度到达不确定性

Prince Poudel

发表机构 * Independent researcher(独立研究者)

AI总结 针对分布偏移中制度组成不匹配带来的额外风险,提出量化框架,通过精确分解分离制度不匹配与制度敏感性,并扩展至β-混合数据,给出极小极大下界。

Comments 23 pages, 4 tables, 3 Figures

详情
AI中文摘要

标准泛化界假设训练和部署分布相同或静态,不考虑平静与危机状态比例不同的制度切换环境。本文提出一个框架,通过量化当分布偏移为马尔可夫切换时因制度组成不匹配导致的额外风险,来泛化制度感知模型。我们得到了精确分解,将制度不匹配与制度敏感性分离;将界限扩展到β-混合数据,使用针对谱间隙校正的有效样本量;并在合成数据和25年全球股指上展示了极小极大下界。所提出的惩罚是事后实现的泛化差距,而仅训练估计器未显示显著相关性:危机的特征几何可以被检测到,但时间到达不能。因此,该框架不是预测机器。在制度变化的罕见情况下,预测未来制度的组成是一个开放问题。

英文摘要

The standard generalization bounds assume that the training and deployment distributions are the same, or are static, and don't consider regime switching environments where the ratio of calm vs crisis states is different. This paper proposes a framework that generalizes regime-aware models by quantifying the extra risk due to regime composition mismatch, when distribution shifts are Markov-switching. We obtain an exact decomposition, separating regime mismatch from regime sensitivity; we extend the bound to beta-mixing data using the effective sample size corrected for the spectral gap; and we show a minimax lower bound for synthetic data and on 25 years of global equity indices. The proposed penalty is an ex post realized generalization gap, whereas the training-only estimator does not show significant correlation: the feature geometry of crises can be detected, but not the temporal arrival. Thus, the framework is not a forecast machine. Forecasting the composition of the future regime is an open question in the rare cases of regime change.

2606.02641 2026-06-03 cs.RO cs.AI

CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving

CARVE: 通过包络实现交互驾驶中被否决机动的认证可负担修复

Yifan Wang

发表机构 * Yifan Wang(王一帆)

AI总结 针对交互驾驶中规则感知堆栈易忽略的硬规则裕度负值问题,提出CARVE认证层,通过有限格点上的自我与代理战术算子,实现被否决机动的可负担修复认证,并证明其合理性。

Comments 8 pages, 3 figures

详情
AI中文摘要

交互驾驶暴露了规则感知自动驾驶堆栈中容易忽略的失效模式:即使非优先代理的小幅合法让步可恢复可行性,自我候选的硬规则裕度仍可能为负。现有的规则手册、防护和可达性过滤器在否决不安全动作方面表现强劲,而基于预测的规划器则对可能的响应进行建模。两者均未返回运行时证明对象,该对象说明哪个有界多代理编辑修复了机动、谁拥有编辑、请求是否在路权上可负担,以及如果请求未被遵守,自我后备是什么。我们将这一缺失对象形式化为*交互修复认证*,并引入*CARVE*,一个在自我拥有和代理拥有的战术算子有限格点上的无预测认证层。代理拥有的请求仅在\(B_j(s) = eta(\pi_j)\alpha_j^{\max}(s)\)内可接受,这是一个将运动学可达性与规范优先级分离的合作包络。生成的证书记录了绑定规则、修复类别、修复集、责任加权成本分配和后备。在589个基于Lanelet2几何的INTERACTION重放片段上,CARVE-Greedy接受了98.64%的初始否决机动,恢复了370/378个人类解决错误否决,同时保持了589/589的路权尊重、零优先级代理假阳性以及400/400的负压力否决。我们证明了证书的合理性、结构性的路权尊重、精确的有限格点最小性、后备应急性和责任一致性条件。CARVE不预测也不需要其他驾驶员的合规性;它认证在声明假设下提议的交互是否有界、可归因且规范上可接受。

英文摘要

Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative for an ego candidate even though a small lawful accommodation by a non-priority agent would restore feasibility. Existing rulebooks, shields, and reachability filters are strong at vetoing unsafe actions, while prediction-based planners model likely responses. Neither returns a runtime proof object that states which bounded multi-agent edit repairs the maneuver, who owns the edit, whether the request is right-of-way affordable, and what ego fallback remains if the request is not observed. We formulate this missing object as *interactive repair certification* and introduce *CARVE*, a prediction-free certificate layer over a finite lattice of ego-owned and agent-owned tactical operators. Agent-owned requests are admissible only inside \(B_j(s) = β(π_j)α_j^{\max}(s)\), a cooperation envelope that separates kinematic reachability from normative priority. The resulting certificate records the binding rule, repair category, repair set, responsibility-weighted cost split, and fallback. On 589 Lanelet2-geometry-grounded INTERACTION replay episodes, CARVE-Greedy accepts 98.64% of initially vetoed maneuvers and recovers 370/378 human-resolved false vetoes, while preserving 589/589 right-of-way respect, zero priority-agent false positives, and 400/400 negative-stress vetoes. We prove certificate soundness, structural right-of-way respect, exact finite-lattice minimality, fallback contingency, and blame-consistency conditions. CARVE does not predict or require another driver's compliance; it certifies whether a proposed interaction is bounded, attributable, and normatively admissible under declared assumptions.

2606.02638 2026-06-03 cs.SD cs.AI eess.AS

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune:歌曲生成的结构化与细粒度控制

Yuejiao Wang, Zihao Ji, Pengfei Cai, Xu Li, Haorui Zheng, Zewen Song, Zhongliang Liu, Chen Zhang, Pengfei Wan

发表机构 * Kling Team, Kuaishou Technology(快手科技 Kling 团队) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学)

AI总结 提出基于扩散Transformer的SegTune框架,通过用户或LLM指定局部音乐描述实现结构化细粒度控制,并引入LLM时长预测器实现精确歌词-音乐对齐,在音乐性和可控性上超越现有基线。

Comments This paper has been accepted to ACL 2026 as an oral presentation and has been nominated for the Best Paper Award. This work is a revised and extended version of an earlier technical report (arXiv:2510.18416). arXiv admin note: text overlap with arXiv:2510.18416

详情
AI中文摘要

近期神经歌曲生成的进展使得从歌词和全局文本提示中实现高质量合成成为可能。然而,大多数系统无法建模歌曲随时间变化的属性,严重限制了音乐结构和动态的细粒度控制。为解决这一问题,我们提出SegTune,一个基于扩散Transformer的框架,通过允许用户或大型语言模型(LLM)指定与歌曲片段对齐的局部音乐描述,实现结构化和细粒度的可控性。这些片段提示被时间广播到对应的时间窗口,而全局提示则确保风格连贯性。为支持精确的歌词-音乐对齐,我们引入了一个基于LLM的时长预测器,以LyRiCs格式自回归生成句子级时间戳。我们进一步构建了一个大规模数据管道,用于收集高质量歌曲及其对齐的歌词和提示,并提出了新的指标来评估片段对齐和声乐一致性。实验表明,SegTune在音乐性和可控性方面均优于现有基线。访问我们的项目页面(此 https URL )获取代码和更多生成的歌曲。

英文摘要

Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.

2606.02628 2026-06-03 cs.LG cs.CL

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

幻觉可从量化LLM中间层隐藏状态线性解码

Aizierjiang Aiersilan

发表机构 * University of Macau(澳门大学)

AI总结 研究开源LLM在4位量化下中间层隐藏状态是否编码线性可分的真实性信号,发现单层线性探针AUROC达0.904-1.000,优于采样方法,且信号近似线性。

详情
AI中文摘要

我们研究开源LLM是否在其隐藏状态中编码线性可分的真实性信号,以及该信号在网络哪一层最强。在三个7B-8B指令微调模型(Llama-3.1-8B、Mistral-7B、Qwen2.5-7B)以4位NF4量化加载的情况下,我们在四个幻觉基准(TruthfulQA、HaluEval-QA、FEVER和一个受控合成集)上提取每层隐藏状态,并比较四种检测方法:线性探针、MLP探针、INSIDE EigenScore、自一致性和注意力熵。单个中间网络层的线性探针在保留分割上达到0.904-1.000 AUROC,而基于采样的检测器在相同协议下不超过0.541 AUROC。真实性信号近似线性:MLP探针很少超过线性探针0.01 AUROC。在自然语言基准上,峰值探测层落在模型家族的一致范围内——Llama和Mistral的32层中第13-18块,Qwen的28层中第19-25块。第一块注意力熵在知识基础设置中提供互补信号(HaluEval-QA上0.866-0.941 AUROC),且无额外推理成本。该协议下采样方法的低区分性反映了配对标签评估与这些方法访问信息之间的结构性不匹配,而非这些方法的固有限制。代码和数据已发布,可在单个8 GB GPU上完全复现。

英文摘要

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~$13$--$18$ of~$32$ for Llama and Mistral, and blocks~$19$--$25$ of~$28$ for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings ($0.866$--$0.941$ AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single $8$\,GB GPU.