arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.06829 2026-06-08 cs.RO 新提交

Three-dimensional hydro-cluttered locomotion by an undulatory robot

三维水杂波环境中的波动机器人运动

Tianyu Wang, Matthew Fernandez, Galen Tunnicliffe, Nikolas Cornell, Justin Duong, Donoven Dortilus, Zhaochen J. Xu, Patricia Meza, Sean Lublinsky, Darsh Parikh, Jianfeng Lin, Emily Grace, Daniel I. Goldman

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机器研究所,佐治亚理工学院) School of Physics, Georgia Institute of Technology(Georgia理工学院物理系) George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology(佐治亚理工学院乔治·W·伍德鲁夫机械工程学院) School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电气与计算机工程学院) Department of Mechanical and Industrial Engineering, Northeastern University(东北大学机械与工业工程系) Ransom Everglades School(拉森·伊弗格莱德学校)

AI总结 提出AquaMILR机器人,通过可编程体顺应性和深度调节,在三维水杂波环境中实现快速鲁棒的前进运动,并利用惯性滚动作为自发恢复机制。

详情
AI中文摘要

水生机器人扩展了人类进入水下环境的能力,但许多水下空间包含可能干扰开放水域运动的障碍物。在“水杂波”环境中,水与刚性和柔性杂物交织,使得身体与障碍物的接触不可避免。在这些空间中操作需要能够调节和利用接触的机器人,但这一机制仍然难以建模或模拟。基于近期在具有地形适应能力的无肢机器人机械智能方面的进展,我们利用AquaMILR(一种细长无肢机器人)开发了三维水生运动原理,该机器人结合了双侧缆绳驱动、可编程体顺应性、分布式深度调节、耐腐蚀外壳以及用于无系留现场操作的板载电源和电子设备。系统的机器人物理实验表明,可编程体顺应性调节身体变形,并将身体-环境相互作用转化为跨增强水杂波约束强度的快速、鲁棒的前向推进。深度调节提供了三维通道,使机器人能够绕过杂物、从阻塞中恢复,并继续通过原本无法通行的路径。在潜在卡滞场景中,涌现的惯性诱导滚动作为一种自发恢复机制,使机器人摆脱可能导致失败的杂物,无需额外控制即可继续运动。在红树林水生环境中的机器人测试表明,这些原理可转化为实际操作,实现导航和无法进入根区的板载视觉检查。这些结果确立了水杂波运动原理和一种设计范式,其中水生机器人将环境复杂性作为运动资源加以利用。

英文摘要

Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obstacles that can disrupt open-water locomotion. In "hydro-cluttered" environments, water is interspersed with rigid and flexible clutter, making body-obstacle contact unavoidable. Operating in these spaces requires robots that can regulate and exploit contact, but this regime remains difficult to model or simulate. Building on recent advances in mechanical intelligence in terradynamically capable limbless robotics, we develop principles for 3D aquatic locomotion using AquaMILR, an elongate limbless robot that combines bilateral cable-driven actuation, programmable body compliance, distributed depth regulation, corrosion-resistant enclosures, and onboard power and electronics for untethered field operation. Systematic robophysical experiments reveal that programmable body compliance regulates body deformation and converts body-environment interactions into fast, robust, forward progression across increasing hydro-clutter constraint strength. Depth regulation provides three-dimensional access, allowing the robot to bypass clutter, recover from obstruction, and continue through otherwise inaccessible routes. In potential jamming scenarios, emergent inertia-induced rolling acts as a spontaneous recovery mechanism, freeing the robot from clutter that would otherwise lead to failure and allowing locomotion to continue without additional control. Tests of the robot in an aquatic mangrove field demonstrate that these principles transfer to practical operation, enabling navigation and onboard visual inspection of inaccessible root zones. These results establish principles for hydro-cluttered locomotion and a design paradigm in which aquatic robots exploit environmental complexity as a locomotor resource.

2606.06828 2026-06-08 cs.CV cs.LG 新提交

AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

AdaGRPO: 一种面向基于流的GRPO的能力感知自适应增强方法

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) S-Lab, Nanyang Technological University(南洋理工大学S实验室) Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Stanford University(斯坦福大学) Shanghai Innovation Institute(上海创新研究院) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) CPII under InnoHK(InnoHK下的CPII) Adobe Research(Adobe研究)

AI总结 提出AdaGRPO,通过在线课程过滤策略和跨层级优势融合,解决流模型GRPO中提示选择随机和优势估计缺乏全局视角的问题,提升训练稳定性和性能。

Comments Project Website: https://bujiazi.github.io/adagrpo.github.io/

详情
AI中文摘要

组相对策略优化(GRPO)在将文本到图像(T2I)流模型与人类偏好对齐方面取得了显著成功。然而,我们发现当前基于流的GRPO的学习循环与学习者的当前能力基本脱钩,在提示选择和优势估计方面存在关键盲点:(i)现有方法随机采样提示,忽视了数据选择对强化学习(RL)效能的重大影响——这一因素在大型语言模型的GRPO中被证明至关重要;(ii)它们仅依赖组内统计来评估样本质量,缺乏准确衡量真实策略改进的全局视角。为解决这些问题,我们提出了自适应GRPO(AdaGRPO),一种专为流模型设计的新型能力感知RL算法。具体而言,AdaGRPO由两个主要部分组成:(i)在线课程过滤策略:动态跟踪模型的能力,并自适应选择与其当前学习边界最匹配的提示;(ii)跨层级优势融合:协同整合细粒度组内优势与宏观全局优势,提供全面无偏的策略评估。作为轻量级即插即用模块,AdaGRPO可无缝集成到现有框架如Flow-GRPO、DanceGRPO和Flow-CPS中。大量实验表明,AdaGRPO持续推动性能提升,同时显著稳定流模型的GRPO训练。

英文摘要

Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy--a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.

2606.06827 2026-06-08 cs.LG 新提交

Architecture Shapes Transfer Specificity in Implicit Neural Representations

架构影响隐式神经表示中的迁移特异性

D Yang Eng

发表机构 * D Yang Eng

AI总结 通过控制实验和偏微分方程基准测试,研究SIREN、ReLU MLP和傅里叶特征MLP三种隐式神经表示架构的迁移特性,发现迁移幅度与迁移特异性分离,ReLU更具源选择性,而SIREN广泛重用权重。

详情
AI中文摘要

坐标网络中的迁移通常通过热启动增益来衡量,但这种增益反映的是源特定结构还是通用权重重用尚不明确。我们通过控制分析测试、二维顶盖驱动方腔纳维-斯托克斯基准以及一维热方程、粘性伯格斯方程和聚焦三次非线性薛定谔方程参考解套件,研究了三种隐式神经表示(INR)家族:SIREN、ReLU MLP和傅里叶特征MLP。分析测试使用独立种子随机控制,而PDE基准使用同族替代源控制和辅助消融。在各种设置下,迁移幅度和迁移特异性明显分离。在10种子控制的一维几何测试中,傅里叶特征显示出最大的结构化迁移(33.1倍),其次是SIREN(23.0倍)和ReLU(10.7倍),但ReLU的选择性更强:随机控制迁移为0.41倍,而SIREN为14.24倍。在受控的双参数一维族中,排名发生变化:在默认设置下,ReLU给出了最清晰的结构化与控制分离,而傅里叶特征仅在带宽重新调整后才有改进。在纳维-斯托克斯和更广泛的一维PDE套件中,没有单一架构主导所有方程,但相同的模式仍然存在:SIREN通常广泛重用权重,而ReLU以及在某些方程中的傅里叶特征更具源选择性。静态诊断仍然薄弱,启发式缩放律$A_{\text{transfer}} \propto 1/\Delta t^2$在所实施的一维审计中被拒绝。这些结果将迁移特异性定位为坐标网络的有用诊断,并表明科学机器学习中的架构选择应在明确控制条件下进行评估,而不仅仅依据迁移幅度。

英文摘要

Transfer in coordinate networks is often measured by warm-start gain, but whether that gain reflects source-specific structure or generic weight reuse is less clear. We study this question across three implicit neural representation (INR) families, SIREN, ReLU MLPs, and Fourier-feature MLPs, using controlled analytic tests, a 2D lid-driven-cavity Navier--Stokes benchmark, and 1D PDE reference-solution suites for heat, viscous Burgers, and focusing cubic NLS. The analytic tests use independent-seed random controls, while the PDE benchmarks use alternate same-family source controls and auxiliary ablations. Across settings, transfer magnitude and transfer specificity separate clearly. In a 10-seed controlled 1D geometric test, Fourier Features show the largest structured transfer ($33.1\times$), followed by SIREN ($23.0\times$) and ReLU ($10.7\times$), but ReLU is far more selective: random-control transfer is $0.41\times$ for ReLU versus $14.24\times$ for SIREN. On a controlled two-parameter 1D family, the ranking changes: ReLU gives the clearest structured-versus-control separation at default settings, whereas Fourier Features improve only after bandwidth retuning. In Navier--Stokes and the broader 1D PDE suite, no single architecture dominates every equation, yet the same pattern remains: SIREN often reuses weights broadly, whereas ReLU and, in some equations, Fourier Features are more source-selective. Static diagnostics remain weak, and the heuristic scaling law $A_{\text{transfer}} \propto 1/Δt^2$ is rejected in the implemented 1D audit. These results position transfer specificity as a useful diagnostic for coordinate networks and suggest that architecture selection in scientific machine learning should be evaluated under explicit control conditions, not by transfer magnitude alone.

2606.06825 2026-06-08 cs.CL cs.AI 新提交

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL: 通过渐进式奖励改进文本到SQL的强化学习

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

发表机构 * East China Normal University(华东师范大学)

AI总结 提出Progress-SQL,一种多轮强化学习框架,通过Oracle引导诊断树(ODT)生成子句级结构反馈,结合渐进式奖励(结构对齐、词汇对齐、延迟奖励和执行状态奖励),提升文本到SQL生成的准确性和鲁棒性。

详情
AI中文摘要

强化学习最近在改进大型语言模型进行文本到SQL生成方面显示出潜力,但现有方法通常优化基于单个SQL状态定义的一次性奖励。这种奖励为迭代SQL纠正提供的指导有限,不足以捕捉多轮SQL改进的提升。在本文中,我们提出Progress-SQL,一种具有渐进式奖励的多轮强化学习框架,用于文本到SQL。我们的方法引入Oracle引导诊断树(ODT),它将SQL查询抽象为子句级结构轮廓,并为下一轮改进生成诊断反馈。为了提供密集且稳健的奖励信号,我们将基于ODT的结构对齐与词汇对齐相结合,并定义一个渐进式奖励,衡量从初始SQL到最终SQL的改进。我们进一步加入一个偏好早期正确性的渐进延迟奖励和一个鼓励从无效SQL中恢复的执行状态奖励。在BIRD、Spider和Spider鲁棒性变体上的实验表明,我们的方法在主要评估和鲁棒性评估上均一致提升了文本到SQL的性能。

英文摘要

Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

2606.06823 2026-06-08 cs.LG cs.AI q-fin.ST 新提交

PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance

PandaAI: 一种用于量化金融中神经符号数据分析与集成决策的实用智能体CQ2

Yuqi Li, Siyuan Liu, Bingjun Liu

发表机构 * Panda AI

AI总结 针对金融数据低信噪比和非平稳性,提出PandaAI,一种结合市场状态建模与约束alpha生成的闭环神经符号LLM智能体,通过领域微调和模块化架构实现风险感知决策,在沪深300数据上Rank IC提升18.2%,最大回撤降低25.7%。

详情
AI中文摘要

尽管深度学习在各个领域表现出色,但由于金融数据的低信噪比(SNR)和非平稳性,其在金融序列决策中的应用仍然具有挑战性。利用大型语言模型(LLM)的推理能力,我们提出了\textbf{PandaAI},一种具有市场状态建模和约束alpha生成的闭环神经符号LLM智能体,它桥接了通用LLM推理与金融严谨性,并抑制了LLM生成输出的金融毒性。为了弥合通用语言能力与金融严谨性之间的差距,我们微调了一个领域特定的LLM。此外,我们将此LLM集成到模块化架构中,形成一个闭环系统。与传统优化孤立预测指标的模型不同,\textbf{PandaAI}被设计为一种神经符号智能体,以明确的风险意识在复杂、真实的金融环境中导航。在沪深300股票数据上的大量实验表明,\textbf{PandaAI}比最先进的时间序列模型实现了$18.2\%$更高的Rank IC和$25.7\%$更低的最大回撤。我们的约束LLM生成和双通道适应方法为LLM在高风险序列决策场景中的部署提供了一种通用范式。

英文摘要

While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to the low Signal-to-Noise Ratio (SNR) and non-stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose \textbf{PandaAI}, a closed-loop neuro-symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM-generated outputs. To bridge the gap between general linguistic capability and financial rigor, we fine-tune a domain-specific LLM. Furthermore, we integrate this LLM into a modular architecture and form a closed-loop system. Unlike traditional models that optimize isolated prediction metrics, \textbf{PandaAI} is designed as a neuro-symbolic agent that navigates the complex, real-world financial environment with explicit risk awareness. Extensive experiments on CSI 300 stock data show that \textbf{PandaAI} achieves a $18.2\%$ higher Rank IC and $25.7\%$ lower maximum drawdown than state-of-the-art time-series models. Our constrained LLM generation and dual-channel adaptation method provide a general paradigm for LLM deployment in high-stakes sequential decision-making scenarios.

2606.06820 2026-06-08 cs.LG cs.AI 新提交

SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

SCALE: 可扩展的交叉注意力学习与外推方法用于智能体工作流调度

Zhifei Xu, Jierui Lan, Zixuan Liang, Aiji Liang, Jinxi He

发表机构 * Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院)

AI总结 提出SCALE调度器,通过交叉注意力指针网络和结构化表示正则化,实现无需微调即可泛化到不同规模集群的深度强化学习工作流调度。

Comments Submitted to Computer Networks

详情
AI中文摘要

智能体大型语言模型系统将复杂任务分解为工作流有向无环图,其原语必须在异构集群上调度。现有的深度强化学习调度器与固定集群大小绑定,当服务器数量变化时需要重新训练。我们提出SCALE(可扩展的交叉注意力学习与外推),一种无需微调即可泛化到未见过的集群规模的深度强化学习调度器。SCALE采用交叉注意力指针网络,其中任务特征查询服务器特征,因此架构通过构造接受任意数量的服务器。然而,我们观察到仅排列不变架构并不能保证在新规模下的良好性能——随着服务器数量增长,注意力特征经历分布偏移。为了解决这个问题,我们引入结构化表示正则化:一种去相关损失结合朝向标准正态的KL惩罚,使特征统计量无论输入大小都保持稳定。在16个节点上训练并直接在32和48个节点上测试,SCALE在N=48时相对于没有SRR的相同架构将平均响应时间降低了8.9%,确认了显式正则化对于缩小规模泛化差距是必要的。

英文摘要

Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.

2606.06813 2026-06-08 cs.CV cs.AI 新提交

Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

打破锁定:通过表示调制实现文本到图像生成的多样化

Dahee Kwon, Haeun Lee, Jaesik Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 针对文本到图像模型在固定提示下生成样本过于相似的问题,提出无训练表示级干预方法DAVE,通过选择性衰减早期生成中的零频空间平均分量来增强多样性,保持图像质量且计算开销极小。

Comments Accepted to ICML 2026. Code is available at: https://github.com/daheekwon/DAVE

详情
AI中文摘要

近期基于大规模Transformer骨干和流目标的文本到图像模型在文本-图像对齐和视觉质量方面表现出色,但在固定提示下常生成过于相似的样本。现有的多样性增强方法缓解了这一问题,但通常需要昂贵的采样或辅助优化,带来显著开销。为探究这种同质性的根本原因,我们检查了中间Transformer特征,观察到零频空间平均(DC)分量在生成早期快速收敛,导致早期轨迹锁定,限制了后续变化。基于此观察,我们提出DC衰减多样性增强(DAVE),一种无训练的表示级干预,选择性地在早期阶段衰减该分量。DAVE以可忽略的开销保留采样流程,在保持竞争性图像质量的同时,提高了提示一致性的多样性。

英文摘要

Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

2606.06812 2026-06-08 cs.CL 新提交

Quantifying Media Representation Dynamics Across 25 Years of News Reporting on Policing-related Deaths

量化25年警务相关死亡新闻报道中的媒体表征动态

Farhan Samir, Jappun Dhillon, Meghna Ravikumar, Syed Ishtiaque Ahmed, Vered Shwartz

发表机构 * University of Toronto(多伦多大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 通过分析25年间4000篇加拿大新闻报道,提出PerspectiveGap模型,发现国家官僚视角出现频率是公众视角的近三倍,且近年来平民代表有所增加。

Comments 9 pages, 6 figures. Websci'26

详情
Journal ref
Proceedings of the 18th ACM Web Science Conference 2026 (pp. 421-429)
AI中文摘要

我们进行了迄今为止最大规模的加拿大警务相关死亡新闻叙事计算分析,涵盖了过去25年间的4000篇文章。我们开发了一个新颖的计算模型PerspectiveGap,该模型基于先前关于警务媒体表征的社会学研究。我们发现,关于警务相关死亡的报道平均而言,国家官僚视角的出现频率几乎是其他公众成员(包括亲属、社区成员、目击者、代表家庭的律师或公民自由团体)视角的三倍。相当一部分文章完全没有平民行为者的观点,尽管近年来平民代表有所增加。定性分析表明,国家官僚对这些死亡的描述往往是临床和程序性的,而平民话语则带有明显更多的情感色彩。这里开发的PerspectiveGap框架可以适用于其他司法管辖区,提供了一种可扩展的方法来分析媒体系统如何构建关于警务和问责的叙事。

英文摘要

We perform the largest known computational analysis of Canadian news narratives about police-involved deaths, spanning 4,000 articles from the last quarter-century. We develop a novel computational model, PerspectiveGap, grounded in prior sociological work on media representation of policing. We find that reporting on police-involved deaths on average features perspectives from state bureaucrats at a rate nearly three times as much as perspectives from other members of the public, including relatives, community members, eyewitnesses, lawyers representing the family, or civil liberties groups. A considerable fraction of articles contain no points of view from civilian actors, though civilian representation has increased in recent years. Qualitatively, we find that state bureaucrats' accounts of these deaths tend to be clinical and procedural, while civilian discourse carries considerably more emotional valence. The PerspectiveGap framework developed here can be contextualized to other jurisdictions, offering a scalable approach for analyzing how media systems construct narratives around policing and accountability.

2606.06806 2026-06-08 cs.SD eess.AS 新提交

Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference

利用SSL导出的离散语音标记的软分布进行下游推理

Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu

发表机构 * The University of Tokyo(东京大学) National Institute of Advanced Industrial Science and Technology (AIST)(国家工业科学与技术研究院(AIST))

AI总结 提出在下游推理时使用软标记分配,保留硬离散化的训练效率同时增强推理时表达力,在ASR和语音合成任务上优于硬分配,并在非母语ASR上超越连续SSL特征。

Comments Accepted to Interspeech2026

详情
AI中文摘要

从自监督学习(SSL)模型获得的离散语音标记在保持强大性能的同时提供高效的数据压缩,并已广泛用作各种任务中的中间表示。然而,离散化不可避免地导致信息丢失,与连续SSL特征相比性能下降。在这项工作中,我们提出仅在下游推理期间应用软标记分配。这种方法保留了训练期间硬离散化的效率,同时增强了推理时标记的表达力。所提出的方法在ASR和语音合成任务上均优于传统的硬分配,并且对域外数据表现出特别强的泛化能力。对于非母语语音的ASR,它甚至超过了使用连续SSL特征的模型。此外,对所得表示的分析表明,与传统的硬分配相比,它们与音素的对齐更准确。

英文摘要

Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.

2606.06805 2026-06-08 cs.RO cs.AI cs.SY eess.SY 新提交

Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency

车道变更轨迹规划:个性化驾驶舒适性与移动效率

Haoxuan Dong, Dongjun Li, Ziyou Song

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Electrical Engineering(电气工程系) National University of Singapore(新加坡国立大学) Computer Science(计算机科学) University of Michigan(密歇根大学)

AI总结 提出一种神经网络驱动的轨迹规划器,结合三阶多项式轨迹生成与学习模块,通过双头共享骨干和基于误差胜者逻辑回归的统计门控机制,实现个性化舒适性与移动效率的平衡。

Comments Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026

详情
AI中文摘要

车道变更涉及同时的纵向和横向运动,这些运动影响驾驶舒适性和移动效率。由于这些运动紧密耦合且存在显著的车辆间差异,车道变更操作的轨迹规划具有高度个性化的特点。本研究提出了一种神经网络驱动的规划器,该规划器将三阶多项式轨迹生成器与学习模块相结合,该学习模块在不同驾驶条件下推断最优轨迹参数。使用具有双头的共享骨干网络,一个头确保全工况操作保障,而另一个头捕捉驾驶员对舒适性或移动效率的特定偏好。通过基于误差胜者逻辑回归的统计门控实现头门控切换机制,该机制在不同驾驶条件下自适应地选择适当的头,从而实现上下文感知的车道变更轨迹规划。代表性案例和蒙特卡洛模拟表明,所提出的规划器在车道变更过程中实现了个性化的舒适性和移动性,而基线则在个性化数据不足或不可用的驾驶条件下确保可行的轨迹。

英文摘要

Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.

2606.06804 2026-06-08 cs.LG stat.AP 新提交

Interpreting Learning Under Competing Models: Joint and Stepwise Approaches for Dynamic Cognitive Diagnosis

解释竞争模型下的学习:动态认知诊断的联合与逐步方法

Yawen Ma, Sahoko Ishida, Kate Cain, Gabriel Wallin

发表机构 * School of Mathematical Sciences, Lancaster University(兰卡斯特大学数学科学学院) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Psychology, Lancaster University(兰卡斯特大学心理学系)

AI总结 研究在项目-技能结构未知时,联合估计Q矩阵与学习过程相比先确定Q矩阵再研究学习,如何改变对学习者发展的结论,并通过动态认知诊断模型分析阅读游戏数据,发现联合分析更可靠。

详情
AI中文摘要

数字学习环境记录学习者对单个项目的反应,使得研究特定技能的发展而非总体分数成为可能。从这些数据中得出关于学习的结论需要一个将反应与潜在技能联系起来的模型,并追踪掌握程度随时间的变化。当每个项目测量的技能未知时,分析者必须决定是联合估计这种结构(Q矩阵)与学习过程,还是先确定它再研究学习。我们表明,这一决定可以改变关于学习者如何发展的实质性结论。使用动态认知诊断模型,我们分析了两个阅读游戏的数据,这些游戏测量了从二年级到三年级的词汇和理解能力,项目文本嵌入为未知的Q矩阵提供了先验信息。联合分析和偏差校正的逐步分析一致认为,大多数学习者朝着掌握两种技能的方向发展,但在三年级时有多少人仍然只部分熟练的问题上存在分歧,从而改变了阅读进展的报告方式。模拟研究确定了两种分析何时出现分歧,并表明当项目-技能结构不确定且项目池在不同年级之间变化时,联合分析更可靠。我们提供了两种分析的R代码。

英文摘要

Digital learning environments record learners' responses to individual items, making it possible to study the development of specific skills rather than overall scores. Drawing conclusions about learning from these data requires a model that links responses to latent skills and tracks how mastery changes over time. When the skills measured by each item are unknown, the analyst must decide whether to estimate this structure, the Q-matrix, jointly with the learning process, or to establish it first and study learning afterwards. We show that this decision can change substantive conclusions about how learners develop. Using dynamic cognitive diagnostic models, we analyse data from two reading games measuring vocabulary and comprehension from Grade 2 to Grade 3, with item-text embeddings providing prior information for the unknown Q-matrix. A joint analysis and a bias-corrected stepwise analysis agree that most learners move toward mastering both skills, but disagree about how many remain only partially proficient at Grade 3, changing how reading progress would be reported. A simulation study identifies when the two analyses diverge and shows that joint analysis is more reliable when the item-skill structure is uncertain and the item pool changes between grades. We provide R code for both analyses.

2606.06797 2026-06-08 cs.CL 新提交

Korean Culture into LLM Alignment: Toward Cultural Coherence

将韩国文化融入大语言模型对齐:迈向文化一致性

MinJae Jung, Minwoo Kim

发表机构 * SKT LG AI Research(LG人工智能研究) Kanana Team(Kanana团队)

AI总结 针对大语言模型的文化对齐,提出构建性定义而非仅抑制负面输出,设计基于提示的种子生成器扩展韩国危害分类,结合韩国法律、社会规范和解释惯例制定安全响应策略,通过DPO微调提升韩国文化安全率且不损害通用能力。

Comments Accepted to ICML 2026 Workshop on Culture X AI

详情
AI中文摘要

大语言模型的文化方面工作主要集中于负面目标:抑制哪些输出。我们认为还需要一个建设性的对应部分,即文化一致性响应的操作性定义,而不仅仅是它必须避免什么,并针对韩国进行了实例化。我们设计了一个围绕基于提示的LLM种子生成器的对齐数据流水线,该生成器扩展了韩国危害分类,其核心是韩国文化适应的安全响应策略:一个基于韩国法律框架、社会规范和解释惯例的逐类别指南,三个前沿模型各自根据该指南生成候选响应。对所得三元组进行DPO微调提高了六个开源LLM的韩国文化安全率,同时未导致韩国通用能力基准的大幅下降,定性输出显示微调模型能够引用韩国法规和机构程序,并在适当时提供建设性的韩国背景信息以及拒绝回答。

英文摘要

Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is rather than only what it must avoid, and instantiate it for Korean. We design an alignment-data pipeline around a prompt-based LLM seed generator that expands a Korean harm taxonomy, with a Korean-culturally-adapted safe-response policy at its centre: a per-category guideline grounded in Korean legal frameworks, social norms, and interpretive conventions, against which three frontier models each produce a candidate response. DPO fine-tuning on the resulting triplets improves the Korean cultural safe rate across six open-weight LLMs while causing no large degradation on Korean general-capability benchmarks, and qualitative outputs show fine-tuned models naming Korean statutes and institutional procedures and, where appropriate, supplying constructive Korean-context information alongside refusal.

2606.06790 2026-06-08 cs.RO cs.LG cs.SY eess.SY 新提交

Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

学习具有主动铰接悬挂的行星探测车的全地形运动

Arthur Bouton, Tristan D. Hasseler, Michael Paton, Travis Brown, Jacob Levy, William Reid, Joshua Martin, Hari Nayar

发表机构 * Jet Propulsion Laboratory, California Institute of Technology(喷气推进实验室,加州理工学院) Center for Autonomy, University of Texas at Austin(自主性中心,德克萨斯大学奥斯汀分校) Space Systems Laboratory, University of Maryland(空间系统实验室,马里兰大学)

AI总结 提出一种带有主动万向悬挂的四轮行星探测车概念,利用强化学习训练单一神经网络控制器,实现自主障碍协商和全地形运动,通过策略整合和零样本迁移在物理车上验证。

Comments 21 pages, 26 figures

详情
AI中文摘要

本文介绍了ERNEST,一种四轮行星探测车概念,配备了两自由度主动万向悬挂系统,结合偏航和滚转驱动,实现车轮重构、转向和主动负载分配。一个单一的神经网络控制器,经过训练以在挑战性地形上跟踪期望路径,完全释放了这种驱动悬挂系统在自主障碍协商中的能力。利用高保真DARTS仿真引擎开发了强化学习框架,该引擎结合了刚体接触动力学和Bekker-Wong地面力学,使得能够出现适应松散土壤条件的运动策略。为了在异质地形上获得单一统一控制器,一种策略整合策略将地形专业化智能体的经验合并到一个神经网络中,消除了对显式地形分类和控制器切换的需求。得到的控制器结合了本体感觉和外感觉反馈,包括稀疏立体视觉导出的地形高程、底盘姿态、关节状态和力-扭矩测量。通过领域随机化、传感器噪声注入和模型到真实系统的辨识,实现了到物理车的零样本迁移。实验结果表明,该控制器能够自主穿越岩石场、凸起陷阱、轮高台阶、沙波纹和沙坡。在20°沙坡上,尽管增加了驱动,学习到的控制器在干沙上降低了37%的运输成本,并在湿沙上实现了优越的性能,而被动悬挂在湿沙上完全无法移动。

英文摘要

This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.

2606.06788 2026-06-08 cs.CL cs.HC 新提交

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

像对五岁小孩一样解释或随我选择:评估语言模型响应的交互潜力

Indu Panigrahi, Tal August

发表机构 * Siebel School of Computing and Data Science(计算与数据科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出基于语言复杂度的交互评估框架,测试GPT-5.1等模型生成不同复杂度响应的能力,发现最佳模型仅46%时间正确调整复杂度。

Comments Preprint

详情
AI中文摘要

在科学信息检索任务中对大型语言模型(LLMs)的评估日益以使用为中心,例如与真实用户进行实时或多轮评估。这些评估仍然假设单一的静态聊天界面,但随着模型被集成到新界面中,评估必须转向纳入特定于界面的标准。我们基于一项有16名参与者的形成性研究,提出了一个新的评估框架,该框架测试模型生成对同一查询的多个响应的能力,这些响应沿语言的可解释轴(语言复杂度)变化,灵感来自人机交互设计文献中的直接操作界面。我们评估了GPT-5.1、GPT-5 mini、Claude Sonnet 4.5 + Thinking和DeepSeek-V3.1,为98个科学查询生成了5个不同语言复杂度级别的响应。虽然模型在不同响应之间变化复杂度,但大多数变化仍然不一致,表现最佳的模型(Claude Sonnet 4.5)仅46%的时间在正确方向上移动了可靠的复杂度度量。我们的发现在增加样本量和替代复杂度级别时仍然成立。

英文摘要

Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.

2606.06787 2026-06-08 cs.AI 新提交

AdMem: Advanced Memory for Task-solving Agents

AdMem: 面向任务求解智能体的高级记忆

Runzhe Wang, Huilin Lu, Shengjie Liu, Li Dong, Jason Zhu

发表机构 * Princeton University(普林斯顿大学) Amazon(亚马逊) Arm

AI总结 提出一种统一自动记忆框架,集成语义、情节和程序记忆,通过双层级设计和多智能体架构实现自动生成、奖励标注与自适应检索,提升长程多轮任务的鲁棒性和成功率。

详情
AI中文摘要

大型语言模型(LLM)作为使用工具的智能体展现出潜力,但在需要记忆、组织和重用知识的长期任务中仍然受限。先前的记忆方法旨在解决这一问题,但主要侧重于存储事实信息。最近关于程序性记忆的工作改进了任务重用,但往往退化为重放过去的成功,而未处理失败案例或在线可扩展性。我们引入了一个统一且自动的记忆框架,该框架在结合短期和长期存储的双层级设计中集成了语义、情节和程序性记忆。一个包含执行者、记忆者和批评者智能体的多智能体架构实现了自动记忆生成、奖励标注和自适应检索。长期记忆通过基于奖励的评估、合并和剪枝进行管理,确保了可扩展性和持续改进。跨多种环境的实验表明,与现有基线相比,我们的方法提高了长程多轮任务的鲁棒性和成功率。这项工作凸显了全面、自适应记忆对于推进基于LLM的智能体的重要性。

英文摘要

Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly focus on storing factual information. Recent work on procedural memory improves task reuse, yet often reduces to replaying past successes without addressing failure cases or online scalability. We introduce a unified and automatic memory framework that integrates semantic, episodic, and procedural memory in a bi-level design combining short-term and long-term stores. A multi-agent architecture with actor, memory, and critic agents enables automatic memory generation, reward annotation, and adaptive retrieval. Long-term memory is managed through reward-based evaluation, merging, and pruning, ensuring scalability and continual improvement. Experiments across various environments show that our approach improves robustness and success on long multi-turn tasks compared to existing baselines. This work highlights the importance of comprehensive, adaptive memory for advancing LLM-based agents.

2606.06786 2026-06-08 cs.LG cs.NI 新提交

Federated Foundation Models over Vehicular Networks

车辆网络中的联邦基础模型

Kasra Borazjani, Fardis Nadimi, Payam Abdisarabshali, Owen Palinski, Allan Salihovic, Dinh Nguyen, Minghui Liwang, Seyyedali Hosseinalipour

发表机构 * University of Waterloo(多伦多大学)

AI总结 本文提出将多模态多任务联邦基础模型(M3T FedFMs)集成到车辆网络中的愿景,结合基础模型的表达力与联邦学习的隐私保护分布式学习能力,并通过Waymo数据集案例验证其潜力。

Comments 8 pages, 4 figures

详情
AI中文摘要

本文提出了一种前瞻性的愿景,将新兴的多模态多任务联邦基础模型(M3T FedFMs)集成到车辆网络中,旨在统一多模态多任务基础模型(M3T FMs)的表达能力与联邦学习(FL)的隐私保护和分布式学习能力。鉴于这一研究方向在很大程度上尚未被探索,我们首先介绍了M3T FedFMs的基本训练/微调原理。然后,我们讨论了它们在车辆网络中的一系列代表性用例,展示了M3T FedFMs在实现下一代车辆智能方面的巨大潜力。之后,我们识别了车辆环境固有的关键约束,这些约束挑战了M3T FedFMs的实际部署,并阐述了一系列前瞻性的研究方向以应对这些挑战。此外,通过在真实车辆数据集(即Waymo开放数据集)上进行的案例研究,我们展示了M3T FedFMs在车辆网络中的前景,并发布了我们的实现以促进可重复性和刺激这一新兴领域的研究(仓库:此 https URL)。

英文摘要

This paper presents a forward-looking vision for integrating the emerging multi-modal multi-task federated foundation models (M3T FedFMs) into vehicular networks, with the goal of unifying the expressive power of multi-modal multi-task foundation models (M3T FMs) with the privacy-preserving and distributed learning capabilities of federated learning (FL). Given the largely underexplored nature of this research direction, we first introduce the fundamental training/fine-tuning principles of M3T FedFMs. We then discuss a range of their representative use cases in vehicular networks, illustrating the significant potential of M3T FedFMs to enable next-generation vehicular intelligence. Afterwards, we identify key constraints inherent to vehicular environments that challenge the practical deployment of M3T FedFMs, and articulate a set of forward-looking research directions to address these challenges. Furthermore, through a case study conducted on a real-world vehicular dataset (i.e., Waymo Open Dataset), we demonstrate the promise of M3T FedFMs for vehicular networks and release our implementation to facilitate reproducibility and stimulate research in this emerging area (repository: https://github.com/KasraBorazjani/vehicular-fedfm)

2606.06781 2026-06-08 cs.CL 新提交

When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

当更好的代码手册还不够:LLM政治事件编码中的预测性能与行为可靠性

Zixian He, Bharath Raahul Murugesan, Patrick Brandt, Yibo Hu

发表机构 * Independent Researcher(独立研究者) Illinois Institute of Technology(伊利诺伊理工学院) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 本研究探讨在政治事件编码任务中,将专家代码手册优化为LLM友好形式能显著提升分类性能,但预测增益并未完全转化为行为可靠性,模型在代码手册变化下仍可能失效。

Comments 14 pages, 3 figures, 11 tables

详情
AI中文摘要

高准确率并不一定使LLM成为忠实的编码器。这个问题很重要,因为许多社会科学研究依赖专家编写的代码手册将文本转化为结构化数据。我们在政治事件编码中研究这个问题,这是一个具有挑战性的源-目标关系分类任务,超越了普通的句子级分类,模型必须使用详细的编码规则确定一个行为者对另一个行为者做了什么。我们测试了当专家代码手册被操作化为LLM友好形式(包含更清晰的定义、示例、检索上下文和困难案例规则)时是否变得更有效。然后,我们在标签名称、代码手册顺序和标签-定义映射的受控变化下评估行为可靠性。更清晰的代码手册显著提高了分类性能,尤其是对于细粒度事件分类。然而,这些预测增益并未完全转化为行为可靠性。模型可能产生有效的标签并恢复定义,但在代码手册的受控变化下仍未能通过行为可靠性测试。这些发现表明,代码手册引导的LLM系统不仅应根据准确性进行评估,还应根据它们是否保留了使编码输出对社会科学研究有意义的编码逻辑来评估。

英文摘要

High accuracy does not necessarily make an LLM a faithful coder. This issue matters because many social-science studies rely on expert-written codebooks to turn text into structured data. We study this problem in political event coding, a challenging source-target relation classification task beyond ordinary sentence-level classification, where models must determine what one actor did to another using detailed coding rules. We test whether expert codebooks become more effective when operationalized into LLM-friendly forms with clearer definitions, examples, retrieved context, and rules for difficult cases. We then evaluate behavioral reliability under controlled changes to label names, codebook order, and label-definition mappings. Clearer codebooks substantially improve classification performance, especially for fine-grained event classification. However, these predictive gains do not fully translate into behavioral reliability. Models may produce valid labels and recover definitions while still failing behavioral reliability tests under controlled codebook changes. These findings suggest that codebook-guided LLM systems should be evaluated not only by accuracy, but also by whether they preserve the coding logic that makes coded outputs meaningful for social-science research.

2606.06776 2026-06-08 cs.LG 新提交

A Rolling-Window Framework for Churn Prediction and Behavioral Driver Identification

一个用于流失预测和行为驱动因素识别的滚动窗口框架

Muhammad Jawad Mufti, Omar Hammad, Haitham Saleh, Muqaddas Gull

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals(国王法赫德石油和矿物大学信息与计算机科学系) Interdisciplinary Research Center for Smart Mobility and Logistics (IRC-SML), King Fahd University of Petroleum and Minerals(智能交通与物流交叉研究中心(IRC-SML),国王法赫德石油和矿物大学) SDAIA–KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum and Minerals(SDAIA-KFUPM人工智能联合研究中心,国王法赫德石油和矿物大学)

AI总结 提出基于滚动行为窗口的流失预测框架,在非契约服务环境中实现连续风险评估,特征模型准确率87.6%、ROC-AUC 0.94,序列模型召回率96.1%。

详情
AI中文摘要

客户流失预测是客户分析中的核心任务,特别是在非契约、按使用付费的服务环境中,脱离行为并非明确可观察,而必须从行为不活跃中推断出来。现有的流失预测方法通常依赖于简化的时间假设或客户行为的单点表示,这限制了它们支持持续风险评估、可解释性和随时间推移的现实部署的能力。本研究提出了一个时间显式的流失预测框架,该框架使用滚动行为窗口对客户行为进行建模,从而随着客户活动的演变实现重复和实例级别的流失风险估计。客户行为在固定的30天观察窗口内进行总结,随后是30天的未来流失评估窗口,确保行为证据与流失结果之间存在明确的时间分离。该框架在统一的时间设计内集成了基于特征和基于序列的学习方法。所提出的方法在来自非契约服务平台的大规模真实世界数据集上进行了评估。实证结果表明了强大且稳定的预测性能,基于特征的模型准确率达到87.6%,ROC-AUC为0.94,而基于序列的模型通过捕获时间脱离模式实现了高达96.1%的召回率。对未来未见数据的评估证实了在时间偏移下具有显著的鲁棒性,无需模型重新训练即可保持准确率高于83%和ROC-AUC超过0.91。总体而言,研究结果强调,精心设计的时间框架(而非仅模型复杂性)对于实现稳健、可解释且可部署的流失预测至关重要。本研究为动态服务环境中面向流失的决策支持提供了实用基础。

英文摘要

Customer churn prediction is a central task in customer analytics, particularly in non-contractual, pay-per-use service environments where disengagement is not explicitly observed and must be inferred from behavioral inactivity. Existing churn prediction approaches often rely on simplified temporal assumptions or single-point representations of customer behavior, which limit their ability to support continuous risk assessment, interpretability, and realistic deployment over time. This study proposes a temporally explicit churn prediction framework that models customer behavior using rolling behavioral windows, enabling repeated and instance-level churn risk estimation as customer activity evolves. Customer behavior is summarized within a fixed 30-day observation window, followed by a 30-day future churn evaluation window, ensuring a clear temporal separation between behavioral evidence and churn outcomes. The framework integrates feature-based and sequence-based learning approaches within a unified temporal design. The proposed approach is evaluated on a large-scale, real-world dataset from a non-contractual service platform. Empirical results demonstrate strong and stable predictive performance, with accuracy reaching 87.6% and ROC-AUC of 0.94 for the feature-based model, while the sequence-based model achieves recall as high as 96.1% by capturing temporal disengagement patterns. Evaluation on future unseen data confirms meaningful robustness under temporal shift, with accuracy remaining above 83% and ROC-AUC exceeding 0.91 without model retraining. Overall, the findings highlight that carefully designed temporal framing, rather than model complexity alone, is critical for achieving robust, interpretable, and deployment-ready churn prediction. The study provides a practical foundation for churn-oriented decision support in dynamic service environments.

2606.06762 2026-06-08 cs.RO 新提交

Multi-Robot Planning and Control from CCTV Camera Networks in a Real Warehouse

基于真实仓库中闭路电视摄像机网络的多机器人规划与控制

Luke Robinson, Benjamin Ramtoula, Anas Izaaryene, Paul Newman, Daniele De Martini

发表机构 * Oxford Robotics Institute, University of Oxford, UK(牛津大学机器人研究所,牛津大学,英国) Robot Systems Group, Technical University of Munich, Germany(机器人系统组,慕尼黑技术大学,德国)

AI总结 提出仅利用分布式CCTV网络和边缘计算实现多机器人协调规划与控制的方法,在真实仓库中验证了四台机器人和30个摄像头的系统,首次实现仅依赖外部摄像头网络的现场多机器人协调。

详情
AI中文摘要

利用环境中嵌入的摄像头对移动机器人进行离车控制,通过将感知和计算移离机器人,为可扩展的自主性提供了一条实用路径。我们将这一思想从单机器人情况扩展到真实仓库中的协调车队,仅使用分布式CCTV网络和边缘计算驱动多个机器人。该系统完全在未校准的、基于像素的拓扑相机图的图像空间中运行,支持灵活相机放置下的大范围操作。分层规划器为每个机器人选择相机序列,并通过每个视图规划其图像空间运动,采用优先-联合策略协调机器人,将重叠的相机区域视为一次仅由一个机器人持有的共享资源,以防止碰撞和死锁。我们在一个真实仓库中验证了该方法,该仓库有四个机器人和30个摄像头,分布在六个27米长的过道中,报告了任务时间和协调统计数据。据我们所知,这是首次仅使用外部摄像头网络和离车计算进行多机器人规划和协调的现场演示,机器人未携带任何特定于任务的导航硬件。

英文摘要

Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalable autonomy, moving sensing and compute off the robots. We extend this idea from the single-robot case to coordinated fleets in a real warehouse, driving multiple robots with only a distributed CCTV network and edge compute. The system operates entirely in image space over an uncalibrated, pixel-wise topological camera graph, enabling wide-area operation with flexible camera placement. A hierarchical planner selects a camera sequence per robot and plans its image-space motion through each view, coordinating robots with a prioritised-then-joint strategy and treating overlapping camera regions as shared resources held by one robot at a time to prevent collisions and deadlocks. We validate the approach in a real warehouse with four robots and 30 cameras across six 27 m aisles, reporting mission times and coordination statistics. To our knowledge, this is the first field demonstration of multi-robot planning and coordination using only an external camera network and off-board compute, with robots carrying no task-specific navigation hardware.

2606.06761 2026-06-08 cs.RO cs.AI 新提交

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

AxisGuide: 在RGB观测中接地机器人动作坐标系以实现鲁棒的视觉运动操控

Jiyun Jang, Yujin Sung, Woosung Joung, Daewon Chae, Sangwon Lee, Sohwi Kim, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University(韩国大学) University of Michigan(密歇根大学) KT R&D Center(KT研发中心) Kakao Mobility(Kakao移动)

AI总结 针对视觉运动策略在分布偏移下动作执行失败的问题,提出AxisGuide方法,通过渲染机器人基座坐标系轴并叠加提示通道,增强动作坐标理解,显著提升泛化性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

通过大规模行为克隆训练的视觉运动操控策略已实现强大的语义场景理解,但在分布偏移下往往无法可靠地执行正确的低级动作。例如,即使在具有相同场景布局、相机视角和光照的简单拾取任务中,当物体被放置在未见过的位置时,性能也会大幅下降。我们认为这一差距源于动作理解不足,即无法在图像空间中解释机器人基座坐标系。为解决此问题,我们引入AxisGuide,一种轻量级引导方法,桥接语义场景理解和动作坐标解释。利用相机参数和末端执行器位姿,AxisGuide在每个相机视图中渲染机器人基座轴,并通过少量提示通道增强RGB观测,明确可视化图像空间中+x、+y和+z运动的含义。在LIBERO仿真和真实环境中的广泛评估表明,AxisGuide带来了显著的性能提升和更好的泛化能力,凸显了显式动作坐标提示对于学习可靠且可迁移的通用视觉运动策略的有效性。

英文摘要

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

2606.06760 2026-06-08 cs.CV 新提交

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

MedSIGHT:迈向医学大型视觉语言模型中的基础视觉理解

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MedSIGHT框架,通过区域感知器、医学区域码本和渐进训练策略,统一医学视觉语言模型的语义理解和像素级分割,在72K数据上达到多模态理解与分割的SOTA。

Comments Accepted at ICML 2026

详情
AI中文摘要

医学大型视觉语言模型(Med-LVLMs)最近在视觉语言理解和医学图像分割方面取得了显著进展。然而,现有模型仍难以统一这两种能力,而这对于实现连接视觉发现与语义解释的临床推理至关重要。我们提出MedSIGHT,一个统一框架,赋予Med-LVLMs结构化的像素级理解能力,实现基础视觉理解。MedSIGHT引入了一个新颖的区域感知器模块,生成以区域为中心的标记,将空间信息直接编码到语言模型的表示空间中。我们进一步将医学区域码本引入LLM词汇表,使模型能够生成离散的区域代码,作为解剖和病理区域的符号表示。这些代码通过区域感知器解码以重建分割掩码,实现端到端的空间基础。最后,MedSIGHT使用我们提出的渐进训练策略,将区域感知器、码本和LLM组合起来,逐步稳定地对齐这些模块。仅在72K多模态指令对上训练,MedSIGHT在多种成像模态的医学理解和分割任务上均达到了最先进的性能。

英文摘要

Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.

2606.06755 2026-06-08 cs.CL cs.ET 新提交

PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs

PromptPrint: 通过自然语言提示在LLMs中的行为生物特征

Shaiv Patel, Kartik Narayan, Vishal Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出PromptPrint,研究用户与LLM交互的简短提示是否包含可识别的行为生物特征,通过词汇、句法和话语模式分析,发现词汇稳定性假设成立,但存在唯一性-一致性悖论,且身份信号对语义改写脆弱。

Comments 10 pages, 6 figures

详情
AI中文摘要

作者归属研究传统上关注长篇、表达性文本;然而,与大型语言模型(LLM)的交互通常是简短且任务驱动的提示。这引发了一个基本问题:这样的提示是否包含稳定、可识别作者且独特的信号?我们引入了PromptPrint,一项对基于提示的身份的系统研究,假设用户的习惯性词汇、句法和话语模式形成可学习的行为生物特征。使用来自1,034名用户的20,680个真实提示,我们建立了三个关键发现。首先,词汇表示显著优于语义编码器,支持“词汇稳定性假设”:身份主要编码在表面层面的词汇选择中,而非抽象意图。其次,风格特征表现出“唯一性-一致性悖论”:用户在整个群体中高度独特,但在不同上下文中行为不一致。第三,对抗性分析揭示了一个清晰的脆弱性谱:身份信号对微小的词汇扰动具有鲁棒性,但在语义改写下显著退化。总体而言,我们的结果展示了大规模下的强识别性能,确立了基于提示的身份作为一种可行的行为生物特征。这项工作为LLM交互中的用户建模引入了新视角,对安全和隐私具有重要意义。数据和代码将在我们的工作被接受后发布。

英文摘要

Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable, and distinctive signal? We introduce PromptPrint, a systematic study of prompt-based identity, the hypothesis that a user's habitual vocabulary, syntax, and discourse patterns form a learnable behavioral biometric. Using 20,680 real prompts from 1,034 users, we establish three key findings. First, lexical representations significantly outperform semantic encoders, supporting the "lexical stability hypothesis": identity is primarily encoded in surface-level word choice rather than abstract intent. Second, stylometric features exhibit a "uniqueness-consistency paradox": users are highly distinctive across the population, yet behaviorally inconsistent across contexts. Third, adversarial analysis reveals a clear vulnerability spectrum: identity signals are robust to minor lexical perturbations but degrade substantially under semantic paraphrasing. Overall, our results demonstrate strong identification performance at scale, establishing prompt-based identity as a viable behavioral biometric. This work introduces a new perspective on user modeling in LLM interactions, with important implications for security and privacy. Data and code will be released upon the acceptance of our work.

2606.06746 2026-06-08 cs.LG 新提交

Performance Variation in Deep Reinforcement Learning

深度强化学习中的性能变化

Haruto Tanaka, A. Rupam Mahmood

发表机构 * Department of Computing Science University of Alberta(计算科学系阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所(Amii)) CIFAR AI Chair(CIFAR人工智能主席)

AI总结 针对深度强化学习算法运行间鲁棒性低的问题,提出基于百分位的统计量(min-max IPR)和可视化方法(逐运行百分位高亮),用于评估性能变化,并通过三个案例验证其有效性。

详情
AI中文摘要

深度强化学习(RL)算法通常存在运行间鲁棒性低的问题,表现为相同配置的智能体在独立运行中性能显著变化。尽管这一问题在研究和实践中带来一系列挑战,但开发评估方法的研究相对较少;相反,RL研究通常报告估计平均性能的不确定性。在本文中,我们概述了传统不确定性和变化估计的局限性,特别是它们与目的的不匹配以及低估风险。然后,我们提出了一种基于百分位的替代统计量和可视化方法,分别称为min-max IPR和逐运行百分位高亮。这些基于百分位的工具易于解释,并依赖于样本百分位的标准性质,提供了关于运行间性能变化的丰富信息。我们通过三个案例研究证明了这一点。首先,我们表明LayerNorm和倒数第二层归一化缩小了PPO的性能变化,而SAC的变化基本不变。其次,我们比较了PPO、SAC、TD-MPC和TD-MPC2,并表明TD-MPC在四个算法中变化最小,同时数据效率最高。最后,在五个Atari环境上比较DQN和Rainbow时,我们表明两种算法表现出相似水平的性能变化。

英文摘要

Deep reinforcement learning (RL) algorithms often suffer from low run-to-run robustness, manifesting as significant performance variation across independent runs of identically configured agents. Although this issue poses a spectrum of challenges across research and practice, relatively few studies develop methods to evaluate it; RL research instead often reports uncertainty in the estimated mean performance. In this paper, we outline the limitations of conventional uncertainty and variation estimates, particularly their misalignment with purpose and the risk of underreporting. We then propose an alternative percentile-based statistic and visualization method, min-max IPR and run-wise percentile highlighting, respectively. These percentile-based tools are easy to interpret and rely on standard properties of sample percentiles, providing rich information about run-to-run performance variation. We demonstrate this through three case studies. First, we show that LayerNorm and penultimate-layer normalizations narrow performance variation in PPO, whereas the variation is mostly unchanged in SAC. Second, we compare PPO, SAC, TD-MPC, and TD-MPC2, and show TD-MPC exhibits the least variation while being the most data efficient among the four. Finally, in a comparison of DQN and Rainbow on five Atari environments, we show that both algorithms exhibit similar levels of performance variation.

2606.06745 2026-06-08 cs.CL 新提交

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

何时深度思考:用于LLM推理的抑制性深思

Zhixuan He, Yue Feng

发表机构 * University of Birmingham, United Kingdom(英国伯明翰大学)

AI总结 提出IDPR框架,通过抑制控制器根据快速答案决定是否启动慢速推理,在数学推理测试集上仅调用8.20%的慢速推理,准确率从47.90%提升至48.92%。

详情
AI中文摘要

推理型大语言模型可以通过深思推理提高问题求解性能,但对每个输入都调用慢速推理在计算上昂贵且往往不必要。我们提出IDPR,一个响应条件抑制性深思框架。IDPR首先生成一个简洁的直观答案,然后使用抑制控制器决定该特定响应是否应被释放或抑制以支持慢速推理。与仅输入路由器不同,抑制控制器以快速答案和快速侧证据为条件,包括置信度、logit边际、可解析性和生成成本。我们从配对的快速-慢速结果中训练控制器,并在准确率优先的慢速调用预算下,在保留验证集上选择抑制阈值。在一个保留的5000示例数学推理测试集上,IDPR仅对8.20%的示例调用慢速推理,并将准确率从47.90%提升至48.92%。在相同的慢速调用预算下,随机路由将准确率降至46.76%,而最强的基于置信度的基线达到48.22%。IDPR还实现了最高的纠正精度,表明响应条件抑制能更好地识别受益于慢速推理的快速答案。

英文摘要

Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.

2606.06743 2026-06-08 cs.SD cs.AI cs.CL 新提交

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

HybridCodec: 快速双流、语义增强的神经音频编解码器

Arjun Gangwar, S Umesh

发表机构 * Indian Institute of Technology, Madras(印度理工学院马德拉斯分校)

AI总结 提出HybridCodec,一种结合语义蒸馏与双流架构的统一神经音频编解码器,实现强解耦、跨语言鲁棒性及3倍速度提升。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

随着多模态大语言模型的出现,神经音频编解码器作为语音分词器的流行度激增。具有语义和声学解耦的新编解码器架构已经出现。将语义信息引入编解码器模型有两种主要方法:一种是从SSL表示中将语义信息蒸馏到第一个RVQ层,另一种是维护语义和声学特征的独立流。我们提出HybridCodec,一种结合了两种范式的统一架构。它采用独立的语义和声学分枝,同时将SSL表示蒸馏到语义流中。这种设计确保了强解耦,而无需在推理期间使用SSL模型。HybridCodec在域内测试集上展示了优越的语义特化(RVQ-1)和有竞争力的重建(RVQ-all)。我们展示了其在域外和零样本跨语言设置中的鲁棒性,相比现有双流模型实现了3倍加速。

英文摘要

The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

2606.06741 2026-06-08 cs.AI cs.CL cs.LG 新提交

OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill: 面向LLM智能体的开放世界自我进化

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Salesforce AI Research(Salesforce人工智能研究) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院)

AI总结 提出OpenSkill框架,使智能体在无目标任务监督下,利用开放世界资源自举构建技能和验证信号,实现自我进化,在多个基准上取得最佳自动通过率。

Comments 20 pages, 4 figures and 8 tables. Code is avalable at https://github.com/OpenLAIR/OpenSkill

详情
AI中文摘要

自我进化智能体需要在部署后进行适应,但现有方法假设存在可用的学习循环,例如精心策划的技能、成功的轨迹或验证信号。真实的开放世界部署可能不提供这些,只提供一个任务提示。在这项工作中,我们研究开放世界自我进化,其中智能体必须从零开始构建其技能和自身的验证信号,使用开放世界资源但没有目标任务监督。我们提出OpenSkill,一个启动这个循环的框架:它从文档、代码库和网络中获取基础知识和验证锚点,将它们综合成可迁移的技能,并根据自建的虚拟任务(基于锚点而非目标答案)来优化这些技能。因此,开放世界既提供了要学习的知识,也提供了一个独立于监督的练习环境,目标任务监督保留用于最终评估。在三个基准和两个目标智能体上,OpenSkill在满足无监督约束的同时取得了最佳自动通过率。分析表明,其技能无需特定模型适应即可跨模型迁移,并且其自建验证器与真实结果一致,尽管从未访问过这些结果。

英文摘要

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

2606.06740 2026-06-08 cs.SD cs.AI cs.CL 新提交

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

多语言多说话人单元声码器:离散语音表示的系统分析

Naman Kothari, Arjun Gangwar, Adarsh Arigala, S Umesh

发表机构 * National Institute of Technology, Trichy(印度Trichy国家理工学院) Indian Institute of Technology, Madras(印度Madras理工学院)

AI总结 分析基于BigVGAN的单元声码器在多语言多说话人语音生成中的表现,发现聚类大小控制可懂度,显式说话人条件防止身份崩溃,语言监督在低聚类大小时有益。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

通过k-means聚类自监督嵌入获得的离散语音单元纠缠了音素、说话人和语言信息,导致多语言多说话人语音生成中的说话人混合和跨语言干扰。尽管在音频大语言模型和语音到语音系统中使用日益增多,单元声码器仍然研究不足。我们分析了基于BigVGAN的单元声码器,涵盖四种印度语言。我们使用WER、说话人相似度和单元级指标研究了聚类大小与条件策略之间的相互作用。结果表明,聚类大小通过提高音素区分性来控制可懂度,而显式说话人条件对于防止身份崩溃不可或缺。语言监督主要在单元仍模糊的较小聚类大小时带来进一步收益。我们的分析显示,在较小库存时,不同语言中相似音素会坍缩到相同的聚类ID,而较大的聚类会逐渐将它们分离。

英文摘要

Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

2606.06738 2026-06-08 cs.CL 新提交

Modular Monolingual Adaptation using Pretrained Language Models

使用预训练语言模型的模块化单语适应

Nalin Kumar, Ondřej Dušek

发表机构 * Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics(查尔斯大学数学与物理系形式与应用语言学研究所)

AI总结 提出一种模块化方法,通过替换标记、冻结对应嵌入并调整模型其余部分,在低资源语言上提升NLU任务性能,优于全模型微调。

Comments Accepted to ACL 2026 Industry Track

详情
AI中文摘要

为低资源语言构建单语语言模型通常依赖于通过在整个模型上对目标语言进行微调来适应预训练语言模型。这种方法比从头开始训练更受欢迎,因为它能够实现有效的知识迁移。此外,先前的工作表明,使用特定于语言的标记器可以增强适应性。在这项工作中,我们假设全模型调优通常是不必要的,并提出了一种更模块化的方法。具体来说,我们替换标记,冻结相应的嵌入,并调整模型的其余部分。我们在苏格兰盖尔语、爱尔兰语和克丘亚语上进行实验,其中克丘亚语是一种非常低资源的语言(8.5k训练实例)。在自然语言理解任务——掩码填充、命名实体识别和词性标注上的评估表明,我们提出的方法在将模型适应低资源语言时提高了性能。此外,我们提供了对训练策略有效性、预训练嵌入选择和模型的全面分析。

英文摘要

Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks -- mask filling, NER, and POS -- shows that our proposed approach improves performance when adapting models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

2606.06727 2026-06-08 cs.RO cs.SY eess.SY 新提交

IDDMBSE: Integrating Data-Driven and Model-Based Systems Engineering for Trusted Autonomous Cyber-Physical Systems

IDDMBSE:集成数据驱动和基于模型的系统工程用于可信自主网络物理系统

John S. Baras, Sai Sandeep Damera, Ryan Matheu, Clinton Enwerem, Praveen M. S. Kumar

发表机构 * Institute for Systems Research, University of Maryland, College Park(系统研究所,马里兰大学,College Park)

AI总结 提出IDDMBSE方法,将MBSE V流程与数据驱动循环结合,通过开源工具链PERFECT、TRADES-X和VERITAS实现,在自主地面机器人全生命周期验证其有效性。

Comments 9 pages, 11 figures. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

自主网络物理系统(CPS)处于基于模型的系统工程(MBSE)和数据驱动的机器学习与人工智能(ML/AI)的交汇点,但目前尚无一种集成的系统工程(SE)方法论能够原生地覆盖两者。我们通过IDDMBSE(一种集成的数据驱动和基于模型的系统工程方法论)来填补这一空白,该方法将严格的MBSE V流程扩展为每一步都包含数据驱动循环,并基于SysML、自主栈以及混合模型驱动加数据驱动的权衡架构。我们将IDDMBSE实例化为一个可互操作的开源工具链:PERFECT,它将SysML系统架构映射到可执行的ROS自主栈,用于可扩展的性能评估;TRADES-X,它将设计空间探索分解为基于模型的优化阶段和随后的数据驱动评估阶段;以及VERITAS,它将形式化验证、数据驱动验证和运行时验证结合到一个统一的保证工作流中。我们在一个可信自主地面机器人的全开发生命周期中演示了IDDMBSE,涵盖传感器套件选择、风险敏感路径规划、行为树任务验证、基于共形预测的鲁棒感知以及有保证的多机器人协调,所有这些都在一个我们随工具链一起发布的、具有争议地形的Isaac Sim测试场中进行了演练。最后,我们概述了IDDMBSE如何在SysML v2 / KerML基础上重新构建,以实现语言原生的可组合性和更紧密的ML/AI集成。

英文摘要

Autonomous cyber-physical systems (CPS) sit at the intersection of Model-Based Systems Engineering (MBSE) and data-driven Machine Learning and Artificial Intelligence (ML/AI), yet no integrated Systems Engineering (SE) methodology natively spans both. We address this gap with IDDMBSE, an Integrated Data-Driven and Model-Based Systems Engineering methodology that extends the rigorous MBSE V-process with a data-driven loop at every step, anchored in SysML, the autonomy stack, and a hybrid model-based plus data-driven trade-off architecture. We instantiate IDDMBSE as an interoperable, open-source tool chain: PERFECT, which maps SysML system architectures to executable ROS autonomy stacks for scalable performance evaluation; TRADES-X, which decomposes design-space exploration into a model-based optimization stage followed by a data-driven evaluation stage; and VERITAS, which combines formal, data-driven, and runtime verification into a single assurance workflow. We demonstrate IDDMBSE on a Trusted Autonomous Ground Robot across its development lifecycle, spanning sensor-suite selection, risk-sensitive path planning, behavior-tree task verification, conformal-prediction-based robust perception, and assured multi-robot coordination, all exercised in a contested-terrain Isaac Sim test range that we release with the tool chain. We close by sketching how IDDMBSE is being re-formulated on SysML v2 / KerML foundations to enable language-native composability and tighter ML/AI integration.

2606.06724 2026-06-08 cs.LG 新提交

Synthics: Synthetic Physics-like Datasets for Machine Learning

Synthics: 用于机器学习的类物理合成数据集

Jari Vepsäläinen

发表机构 * Aalto University(阿尔托大学)

AI总结 提出基于贝叶斯概率上下文无关文法从方程语料库生成结构相似的合成回归数据集,通过非侵入式探测确定物理有效域,统计验证表明合成数据在结构保真度上优于纯概率方法,且在下游超参数调优中性能接近真实数据。

详情
AI中文摘要

代表性数据是机器学习的基础,因为有限的数据会阻碍泛化。收集足够的真实样本通常是不可行的。合成数据生成提供了一种实用的解决方案,但前提是生成的数据能忠实反映真实观测的结构。本文提出了一种方法,用于生成结构上类似于给定方程语料库中物理方程的合成回归数据集。该方法使用贝叶斯概率上下文无关文法来捕捉语料库的底层代数结构,并从中采样新方程。为了确保生成的输入位于物理有意义的域内,通过非侵入式探测为每个方程表征适用域,同时恢复变量间约束。输入采样进一步通过从有效域的随机子范围中混合均匀分布和截断正态分布进行抽取,模拟真实的实验条件。使用Kolmogorov-Smirnov检验对Feynman方程语料库进行统计验证。生成的方程在全部八个研究的结构特征上与语料库匹配,而未经平滑的纯概率文法仅匹配两个特征,表明贝叶斯先验对于在给定语料库规模下实现结构保真度至关重要。在下游超参数调优任务中,基于合成数据调优的梯度提升回归器在真实数据上平均选出20个配置中的第6优配置,与基于真实数据调优的结果相当,并显著优于随机表达式树(第10优)和噪声(第19优)。

英文摘要

Representative data is fundamental in machine learning, as limited data hinders generalisation. Collecting sufficient real-world samples is often infeasible. Synthetic data generation offers a practical solution, but only if the generated data faithfully reflects the structure of real observations. In this paper, a method for generating synthetic regression datasets that structurally resemble physics equations from a given equation corpus is presented. The approach uses a Bayesian Probabilistic Context-Free Grammar to capture the underlying algebraic structure of the corpus, from which novel equations are sampled. To ensure the generated inputs lie within a physically meaningful domain, the applicability domain is characterised for each equation through non-intrusive probing, also recovering inter-variable constraints. Input sampling further mimics realistic experimental conditions by drawing from random sub-ranges of the valid domain with mixed uniform and truncated normal distributions. The generated data is statistically validated against the Feynman equation corpus using Kolmogorov-Smirnov tests. The generated equations match the corpus on all of the eight studied structural features, compared to only two for an unsmoothed purely probabilistic grammar, demonstrating that the Bayesian prior is essential for structural fidelity given the size of the corpus. In a downstream hyperparameter-tuning task, a gradient-boosted regressor tuned on the synthetic data picks, on average, the 6th-best configuration out of 20 on real data, matching the result of tuning on real data itself and substantially outperforming random expression trees (10th) and noise (19th).