arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1733
2606.14159 2026-06-15 cs.LG q-bio.BM 新提交

Curvature-Guided Geometric Representation for Protein-Ligand Binding Affinity Prediction

曲率引导的几何表示用于蛋白质-配体结合亲和力预测

Shuai Li, Chuan-Xian Ren, Yuhao Li, Ziqi Huang, Yue Pan, Mingzhe Tang, Hong Yan

发表机构 * School of Mathematics, Sun Yat-sen University(中山大学数学学院) Department of Electrical Engineering, City University of Hong Kong(香港城市大学电机工程系)

AI总结 提出RicciBind框架,利用里奇曲率捕捉局部相互作用紧密度,结合最优传输实现跨域对齐,提升结合亲和力预测的准确性与可解释性。

详情
AI中文摘要

蛋白质-配体结合亲和力(PLA)预测在药物发现中至关重要。尽管基于机器学习的方法取得了显著进展,现有方法难以联合表征局部几何组织和全局协调的跨分子相互作用,限制了其对复杂结合机制建模的能力。在此,我们提出RicciBind,一个几何表示框架,它整合了曲率引导的层次结构学习与基于最优传输(OT)的跨域对齐,以建模分子相互作用。具体而言,RicciBind利用里奇曲率捕捉分子结构内的局部相互作用紧密度,增强结构感知,并将原子相互作用组织成曲率感知的层次表示。然后,基于OT的聚类匹配机制在几何约束下对齐异质域中的蛋白质和配体聚类,实现全局一致的对应关系,并揭示超出局部邻域的高阶相互作用模式。通过将曲率引导的结构编码与OT驱动的跨域对齐相结合,RicciBind有效建模了复杂的相互作用语义,并显著提高了结合亲和力预测的准确性和可解释性。大量实验表明,RicciBind在PLA基准和虚拟筛选任务中取得了优越的预测性能和泛化能力。消融研究进一步证实了里奇曲率在增强分子相互作用表示中的关键作用。

英文摘要

Protein-ligand binding affinity (PLA) prediction is critical in drug discovery. Despite the notable advancements in machine learning-based approaches, existing methods struggle to jointly characterize local geometric organization and globally coordinated cross-molecular interactions, limiting their ability to model complex binding mechanisms. Here, we propose RicciBind, a geometric representation framework that integrates curvature-guided hierarchical structure learning with optimal transport (OT)-based cross-domain alignment to model molecular interactions. Specifically, RicciBind leverages Ricci curvature to capture local interaction tightness within molecular structures, enhancing structural awareness and organizing atomic interactions into curvature-aware hierarchical representations. An OT-based cluster matching mechanism then aligns protein and ligand clusters across heterogeneous domains under geometric constraints, enabling globally consistent correspondences and revealing higher-order interaction patterns beyond local neighborhoods. By coupling curvature-guided structure encoding with OT-driven cross-domain alignment, RicciBind effectively models complex interaction semantics and substantially improves both the accuracy and interpretability of binding affinity prediction. Extensive experiments demonstrate that RicciBind achieved superior predictive performance and generalization across PLA benchmarks and virtual screening tasks. Ablation studies further confirmed the essential role of Ricci curvature in enhancing molecular interaction representations.

2606.14157 2026-06-15 cs.LG cs.AI 新提交

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

通过逆最优传输从起点-终点流中学习城市访问成本

Paula Joy B. Martinez

发表机构 * GitHub

AI总结 提出逆最优传输模型从学校间入学流中恢复潜在选择成本,应用于菲律宾283,016条学生流动数据,估计补贴等效距离以优化城市服务分配。

Comments Oral Presentation. 2026 International Conference on Urban AI

详情
AI中文摘要

城市通过混合公私设施网络提供基本服务,包括学校、诊所、交通提供者和补贴服务点。在这些系统中,规划者通常观察到家庭去哪里,但看不到他们权衡距离、价格和机构访问等因素的潜在成本函数。我们通过菲律宾的学校选择来研究这个城市问题,该国最大的国家教育补贴旨在将学习者从拥挤的公立学校转移到参与计划的私立学校。将学校到学校的入学流视为熵最优传输计划,我们使用两种互补的逆最优传输模型恢复潜在选择成本:一个带有补贴项的可解释距离带模型,以及一个通过可微分Sinkhorn前向传递训练的神经成本模型。应用于人口最多地区23,820条观测流中的283,016次学习者出行,该框架估计了一个补贴等效距离$\lambda^{(k)}$,解释为补贴抵消的感知旅行成本公里数。该案例展示了如何将行政起点-终点数据转化为可解释的规划指标,用于可访问性感知的补贴设计、设施选址和城市服务分配。

英文摘要

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

2606.14156 2026-06-15 cs.LG cs.AI 新提交

Learning High Coverage Discriminative Parsimonious Rulesets

学习高覆盖判别性简约规则集

Mariamma Antony, Raman Sankaran, Chiranjib Bhattacharyya, Uma Satya Ranjan

发表机构 * Indian Institute of Science(印度科学研究所) Compass

AI总结 提出CDPR方法,通过子模最大化算法学习高覆盖、判别性且简约的规则集,在保持高准确率的同时显著提升可解释性,覆盖率比次优算法提升2.5倍以上。

详情
AI中文摘要

基于IF-THEN规则表示的学习系统易于提供可解释性,使其成为当代人工智能研究的关键焦点。此类规则集的一个关键目标是实现高判别能力和可解释性。虽然现有的最先进算法隐式地优先考虑预测准确性,但它们通常在确保可解释性的一个或多个质量指标(如规则集的覆盖率和简约性)上表现不足。受此启发,本文提出开发CDPR,旨在为分类问题创建高度准确且可解释的规则集。据我们所知,这是首次尝试建立这样的方法。在本研究中,我们引入了两种基于子模最大化的算法,这些算法不仅提供了可证明的覆盖率保证,而且产生的规则集既具有判别性又简约。我们通过实验证明,通过我们的方法学习的规则集在准确性和可解释性方面表现更好,并且与次优算法相比,平均覆盖率提高了2.5倍以上。

英文摘要

Learning systems based on IF-THEN rule representations readily offer interpretability, making them a crucial focus in contemporary AI research. A key objective for such rule sets is to achieve both high discriminative power and interpretability. While existing state-of-the-art algorithms implicitly prioritize predictive accuracy, they often fall short on one or more quality metrics that ensure interpretability, such as coverage and parsimony of rule sets. Motivated by this, this paper propose the development of CDPR, which aims to create highly accurate and interpretable rule sets for classification problems. To the best of our knowledge, this represents the first attempt to establish such an approach. In this study, we introduce two algorithms rooted in submodular maximization, which not only provide provable guarantees on coverage but also yield rule sets that are both discriminative and parsimonious. We empirically demonstrate that rule sets learned through our approaches achieve higher accuracy and interpretability and has more than a 2.5-fold improvement in average coverage rates when compared to the next best algorithm.

2606.14155 2026-06-15 cs.LG cs.CL 新提交

Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

基于图的目标反向传播用于多LLM智能体系统中的上下文自适应

Tan Zhu, Tong Yao, Kananart Kuwaranancharoen, Amit Singh, Yushang Lai, Deepa Mohan, Shankara Bhargava

发表机构 * Retail Intelligence, Walmart Global Tech(零售智能,沃尔玛全球技术)

AI总结 提出GTBP框架,通过图结构反向传播局部目标输出,实现多LLM智能体工作流的上下文自适应,理论保证稳定性,实验优于基线。

详情
AI中文摘要

上下文自适应通过迭代地从任务反馈中修改可调提示,无需修改模型权重,自动化了基于LLM系统中的提示工程。将这一范式扩展到多LLM智能体系统至关重要:现有方法存在不准确的信用分配问题且缺乏收敛保证。我们提出基于图的目标反向传播(GTBP),一种针对建模为有向无环图的智能体工作流的上下文自适应框架。GTBP通过工作流图向后传播局部目标输出,并利用目标-输出差异指导阶段式提示更新机制。理论上,我们证明GTBP的阶段式提示更新在迭代中变得稳定,且足够强大的LLM优化器可以降低整体目标。实验上,GTBP在三个基准测试中一致优于强基线,同时保持可比较的计算成本。

英文摘要

Context adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: existing methods suffer from inaccurate credit assignment and lack convergence guarantees. We propose \textbf{G}raph-based \textbf{T}arget \textbf{B}ack-\textbf{P}ropagation (GTBP), a context adaptation framework for agentic workflows modeled as directed acyclic graphs. GTBP propagates local target outputs backward through the workflow graph and uses target--output discrepancies to guide a stage-wise prompt update mechanism. Theoretically, we show that GTBP's stage-wise prompt updates become stable over iterations, and that a sufficiently capable LLM optimizer can decrease the overall objective. Empirically, GTBP consistently outperforms strong baselines across three benchmarks while maintaining comparable computational cost.

2606.14153 2026-06-15 cs.CV cs.RO 新提交

Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic

编码器胜者无法可靠跨VLA骨干网络规模迁移:一种冻结骨干嫁接诊断方法

Qingping Zeng, Fei She

发表机构 * Tsinghua University(清华大学)

AI总结 提出冻结骨干嫁接诊断方法,发现小规模VLA上最优的视觉编码器在大规模骨干上并非最优,编码器选择依赖于骨干网络规模。

Comments 23 pages, 5 figures, 8 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常从其上游VLM发布中继承视觉编码器,但目前尚不清楚在小规模VLA上验证的编码器选择是否能迁移到更大的骨干网络上。我们引入了一种冻结骨干嫁接诊断方法:将已发布VLA的视觉塔替换为候选编码器,采用固定协议(自适应平均池化、LayerNorm和单个可训练的线性投影器),同时冻结语言模型和动作专家。在四个编码器、两个LIBERO套件、两个骨干网络(SmolVLA-450M和$\pi_{0.5}$-3.3B)以及每个单元两到三个随机种子(共40次主要嫁接运行,加上原生、LoRA、池化以及零/打乱图像对照,全部通过离线动作MSE评分)的条件下,小骨干网络的胜者无法可靠地选出大骨干网络的顶级编码器:SigLIP在SmolVLA上两个套件中均表现最佳,而在$\pi_{0.5}$上,DINOv2-small在空间套件中领先,物体套件则是对种子敏感的接近平局带;四个骨干-套件比较中的三个(以及12个种子级单元中的11个)支持依赖于骨干网络的排名。嫁接包装本身并非中性,在两个骨干网络上符号相反(在SmolVLA原生视觉塔上MSE增加45-56%,在$\pi_{0.5}$上降低50-52%),因此所有结论都依赖于固定的嫁接协议。我们将冻结嫁接定位为一种廉价的靶向骨干诊断方法,在承诺大规模使用编码器之前运行,而非闭环部署声明。

英文摘要

Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $π_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $π_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $π_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.

2606.14150 2026-06-15 cs.LG cs.CL 新提交

Small LLMs: Pruning vs. Training from Scratch

小型LLM:剪枝 vs. 从头训练

Yufeng Xu, Taiming Lu, Kunjun Li, Jiachen Zhu, Mingjie Sun, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学) New York University(纽约大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过六种剪枝方法在Llama-3.1-8B上比较剪枝与从头训练,发现有限预算下剪枝更优,预算充足时粗粒度剪枝可被超越。

Comments Our code is available at https://github.com/zlab-princeton/llm-pruning-collection

详情
AI中文摘要

剪枝有望成为获得强大小型语言模型的捷径。在本工作中,我们通过六种涵盖深度、宽度和稀疏粒度的剪枝方法,在两种受控的token匹配设置下,以0.5-0.8的剪枝率对Llama-3.1-8B进行剪枝,检验了这一承诺。(1) 在相同的训练token预算下,剪枝初始化始终优于随机初始化。这表明父模型提供了一个强起点,尽管随着训练token预算的增加和剪枝率的提高,优势逐渐缩小,在我们研究的最高剪枝率下几乎消失。(2) 当从头训练被给予整个流程消耗的全部token预算时,细粒度剪枝仍保持优势,而粗粒度结构化剪枝可能被匹配或超越。这表明父模型传递了额外训练token无法完全恢复的知识,但仅在细粒度下如此。综合来看,我们的结果给出了明确的建议:当手头有一个大型预训练模型且训练token预算有限时,剪枝优于从头训练;当训练预算不受限时,从头训练在粗粒度剪枝下可能具有竞争力,因此大型预训练父模型并非总是必要的。

英文摘要

Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

2606.14149 2026-06-15 cs.LG 新提交

Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

信任但验证:通过事后对抗审计和多智能体反馈循环减轻医学幻觉

Muhammad Osama, Maheera Amjad, Zartasha Mustansar, Arslan Shaukat, Muhammad U. S. Khan

发表机构 * Data Science and Machine Learning Lab, SINES, NUST(NUST SINES数据科学与机器学习实验室) SINES, NUST(NUST SINES) CEME, NUST(NUST CEME)

AI总结 本研究提出一种五智能体“信任但验证”系统,通过事后对抗审计和多智能体反馈循环,将大型语言模型在临床问题中推荐禁用药品的幻觉错误率降低约53%。

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在医疗环境中,但其产生幻觉的倾向在涉及临床决策时带来风险。本研究考察LLM在回答临床问题时是否会推荐近期被禁止或撤回的药品,并测试一种基于智能体的方法来减少此类错误。我们使用单一LLM骨干开发了一个五智能体“信任但验证”系统。为了衡量监管知识过时性,我们创建了一个包含103个临床多项选择题的对抗数据集,其中历史上正确的答案现在指向禁用物质。该规模确保了跨各种治疗类别的统计显著性。我们评估了三个开放访问模型家族(GPT-OSS、Llama-3、Falcon-3)在原始和智能体条件下的表现。通过逐点得分、标签准确率、幻觉错误率(HER)和组件保真度(CF)得分来衡量性能。我们还观察到专有模型中的临床安全性退化。在默认配置下,所有模型都显示出高幻觉率,一致地选择了与训练数据模式匹配的禁用药物。我们提出的智能体架构将各模型的HER降低了约53%。逐点得分从-0.25(不安全推荐)转向0.0(适当拒绝)。即使模型的参数知识倾向于禁用物质,安全审计也能拦截危险输出。所提出的多智能体框架提供了一种模型无关的方法来强制执行监管合规性,优先考虑患者安全而非流畅的文本生成。我们的工作展示了在安全关键的医疗环境中部署自主AI系统的实用方法,并说明了如何将实时监管数据集成到LLM流水线中以支持临床决策。

英文摘要

Large Language Models (LLMs) are increasingly deployed in healthcare settings, yet their tendency to hallucinate poses risks when clinical decisions are involved. This study examine whether LLMs recommend recently banned or withdrawn pharmaceuticals when answering clinical questions and tests an agent-based method for reducing such errors. We developed a five-agent "Trust but Verify" system using a single LLM backbone. To measure regulatory knowledge obsolescence, we created an adversarial dataset of 103 clinical MCQs where historically correct answers now refer to banned substances. This scale ensures statistical significance across various therapeutic classes. We evaluated three open-access model families (GPT-OSS, Llama-3, Falcon-3) under vanilla and agentic conditions. Performance was measured via pointwise score, label accuracy, Hallucination Error Rate (HER), and Component Fidelity (CF) score. We also observed clinical safety regression in proprietary models. In default configurations, all models showed high hallucination rates, consistently selecting banned drugs that matched training data patterns. Our proposed agentic architecture reduced HER by approximately 53% across models. Pointwise scores shifted from -0.25 (unsafe recommendation) toward 0.0 (appropriate refusal). The safety audit intercepted dangerous outputs even when models' parametric knowledge favored the banned substance. The proposed multi-agent framework offers a model-agnostic method for enforcing regulatory compliance that prioritizes patient safety over fluent text generation. Our work demonstrates a practical approach for deploying autonomous AI systems in safety-critical healthcare settings. It shows how real-time regulatory data can be integrated into LLM pipelines to support clinical decision-making.

2606.14145 2026-06-15 cs.CL 新提交

Personal Care Utility: Health as Everyday Infrastructure

个人护理公用设施:健康作为日常基础设施

Mahyar Abbasian, Elahe Khatibi, Saba A. Farahani, Nitish Nagesh, Arshia Ilaty, Hooman Sajjadi, Amir Rahmani, Ramesh Jain

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出个人护理公用设施(PCU)分层事件驱动架构,将健康视为日常基础设施,通过Personicle组织连续信号,分离临床决策与语言表达,以2型糖尿病为例验证其生成实时干预和知识引导的能力。

Comments 12 pages, 2 figures, 3 tables

详情
AI中文摘要

医疗保健本质上是必要的、专业的和偶发性的——围绕一个人每年与临床医生相处的大约一小时而设计。在临床环境之外的8,759小时中,饮食、睡眠、运动、用药和压力实际上塑造着长期健康,却没有相应的基础设施。个性化健康的瓶颈不是原始数据或推理能力,而是缺乏这一基础设施层。本文介绍了个人护理公用设施(PCU):一种分层事件驱动架构,被提议作为日常健康缺失的公用设施,就像支付、网络和电力是其领域的公用设施一样。PCU通过Personicle将连续的个人信号组织成具有语义意义的生活事件,根据个人基线估计动态健康状态,推理原因和背景,并通过一个编排器将临床决策逻辑、行为策略选择和自然语言表达分离,从而引导指导。这种分离使得大型语言模型能够支持推理和沟通,同时将安全关键的临床决策建立在经过验证的证据基础上。我们针对2型糖尿病实例化了PCU——将CGM、饮食、活动、用药、睡眠、压力和临床数据转化为血糖事件、个性化状态估计、因果解释和基于知识的干预。一个日常生活场景展示了相同的基础设施根据背景和风险产生实时提示、每周总结、用药检查、沉默或确定性安全警报。最后,我们讨论了PCU如何推广到其他慢性疾病以及任何始终在线的个人健康公用设施必须解决的治理问题。结果是一个蓝图,将个性化视为日常健康指导的架构属性,而不是最终的消息传递层。

英文摘要

Healthcare is essential, expert, and episodic by design - built around the roughly one hour per year a person spends with a clinician. The 8,759 hours outside clinical settings, where eating, sleeping, movement, medication, and stress actually shape long-term health, have no comparable infrastructure. The bottleneck for personalized health is not raw data or reasoning capability; it is the absence of that infrastructure layer. This paper introduces the Personal Care Utility (PCU): a layered, event-driven architecture proposed as the missing utility for everyday health, in the way that payments, networks, and power are utilities for their domains. PCU organizes continuous personal signals into semantically meaningful life events through a Personicle, estimates dynamic health state against personal baselines, reasons about cause and context, and routes guidance through an orchestrator that separates clinical decision logic, behavioral strategy selection, and natural-language expression. This separation lets large language models support reasoning and communication while keeping safety-critical clinical decisions grounded in validated evidence. We instantiate PCU for Type 2 Diabetes - turning CGM, meal, activity, medication, sleep, stress, and clinical data into glycemic events, individualized state estimates, causal explanations, and knowledge-grounded interventions. A day-in-the-life scenario shows the same infrastructure producing real-time nudges, weekly summaries, medication check-ins, silence, or deterministic safety alerts depending on context and risk. We close with how PCU generalizes to other chronic conditions and the governance questions any always-on personal health utility must address. The result is a blueprint that treats personalization not as a final messaging layer, but as an architectural property of everyday health guidance.

2606.14141 2026-06-15 cs.SD cs.AI cs.CL 新提交

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

动态声源的时空音频语言建模

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

发表机构 * POSTECH(浦项科技大学) Sony AI(索尼AI) Sony Group Corporation(索尼集团) Sungkyunkwan University(成均馆大学) KAIST(韩国科学技术院)

AI总结 提出ST-AudioLM模型,通过时空音频编码器联合学习事件语义与源轨迹,在ST-AudioQA基准上提升动态声源问答的语义-定位权衡。

详情
AI中文摘要

声音事件是具有语义身份、位置和轨迹的实体,但当前的音频-语言模型通常将片段推理为全局事件内容。相反,声音事件定位模型随时间跟踪声源方向,但对语言推理的语义覆盖有限。为解决这一差距,我们引入了ST-AudioQA,一个基于一阶环绕声(FOA)渲染的静态和移动声源的时空音频问答数据集和基准。每个场景提供源身份、活动、方向、距离和运动元数据,实现密集轨迹监督以及关于什么在发声、在哪里、如何移动以及源之间关系的问题。我们进一步提出了ST-Audio Encoder,一种时间分辨的FOA音频编码器,联合学习事件语义和源轨迹,以及ST-AudioLM,它将编码器的音频令牌连接到LLM进行时空音频问答。实验表明,这种表示改善了语义-定位权衡,并比静态空间和面向定位的基线产生更强的推理性能。

英文摘要

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

2606.14139 2026-06-15 cs.LG 新提交

Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion

全波形反演的扩散模型解耦潜变量优化

Chen Min, Zheng Ma

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学CMA-上海)

AI总结 提出解耦潜变量优化(DLO),通过二次惩罚目标分离物理变量和潜变量,结合数据保真度梯度和扩散先验,在OpenFWI基准上优于经典正则化和现有扩散方法。

Comments 35 pages, 14 figures

详情
AI中文摘要

全波形反演(FWI)通过求解严重不适定、非凸的PDE约束优化,从地震记录中恢复地下速度。经典正则化方法稳定反演但无法再现真实地质结构;最近的扩散先验方法提高了真实性,但以数据保真度和先验一致性之间的脆弱权衡为代价。我们提出解耦潜变量优化(DLO),将标准潜变量优化形式松弛为辅助物理变量和潜变量上的二次惩罚目标。数据保真度梯度作用于物理空间,扩散采样器仅通过解码的先验样本贡献,并保留了经典FWI的标准平滑速度初始化。在OpenFWI基准上,DLO在干净、含噪和缺失道采集下优于经典正则化和现有扩散方法。在70×70 OpenFWI模型上训练的先验直接迁移到Marmousi和Overthrust基准,DLO恢复了复杂的断层结构,并对初始化平滑和测量噪声保持鲁棒。

英文摘要

Full waveform inversion (FWI) recovers subsurface velocity from seismic recordings by solving a severely ill-posed, nonconvex PDE-constrained optimization. Classical regularizers stabilize the inversion but fail to reproduce realistic geological structures; recent diffusion-prior methods improve realism at the cost of a fragile trade-off between data fidelity and prior consistency. We propose Decoupled Latent Optimization (DLO), which relaxes the standard latent-optimization formulation into a quadratic-penalty objective over an auxiliary physical variable and a latent variable. The data-fidelity gradient acts in physical space, the diffusion sampler contributes only through a decoded prior sample, and the standard smoothed-velocity initialization of classical FWI is preserved. On the OpenFWI benchmark, DLO outperforms classical regularizers and existing diffusion-based methods under clean, noisy, and missing-trace acquisitions. The prior, trained on 70*70 OpenFWI models, transfers directly to the Marmousi and Overthrust benchmarks, where DLO recovers intricate fault structures and remains robust to initialization smoothing and measurement noise.

2606.14130 2026-06-15 cs.LG cs.MA 新提交

Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

基于合约的组合屏蔽实现安全多智能体强化学习

Omar Adalat, Edwin Hamel-De le Court, Francesco Belardinelli

发表机构 * Imperial College London(伦敦帝国学院) University of Manchester(曼彻斯特大学)

AI总结 提出一种去中心化屏蔽方法,通过合约机制协调智能体局部LTL安全义务,在无集中运行时控制下保证全局安全并优化团队奖励。

详情
AI中文摘要

在多智能体强化学习中,当任何智能体无法单方面强制执行全局安全时,就会出现安全协调问题:一个智能体动作的可接受性可能取决于其他智能体的动态。去中心化屏蔽可以在运行时强制执行安全,但纯粹分解的权限通常会排除仅通过协调才能安全的团队最优行为。我们研究了在去中心化执行下训练和部署的智能体的确定性安全保证,无需集中运行时控制即可恢复团队最优的安全行为。智能体共享一个在安全线性时序逻辑片段($\mathsf{LTL}_{\mathsf{safe}}$)中的全局规范$\phi$,并选择局部$\mathsf{LTL}_{\mathsf{safe}}$义务元组,这些义务的合取蕴含全局规范$\phi$。每个智能体可以依赖其他智能体的局部义务作为假设,因为整个合约元组同时被认证,并允许投影到局部动作掩码。在学习时,一个非平稳的多臂赌博机从局部$\mathsf{LTL}_{\mathsf{safe}}$义务库中选择元组以优化团队奖励,同时不放弃端到端安全性。我们在6个环境和15种算法变体上评估了该方法。

英文摘要

Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but purely factorised permissions often exclude optimal team behaviour that is safe only through coordination. We study deterministic safety guarantees for agents trained and deployed under decentralised execution, recovering team-optimal safe behaviour without centralised runtime control. Agents have a shared global specification $ϕ$ in the safety fragment of Linear Temporal Logic ($\mathsf{LTL}_{\mathsf{safe}}$ ), and select among tuples of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations whose conjunction implies the global specification $ϕ$. Each agent may rely on the other agents' local obligations as assumptions because the whole contract tuple is certified simultaneously and allows projection into local action masks. At learning time, a non-stationary multi-armed bandit chooses among a library of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations to select the tuple that optimises team reward, all without forgoing end-to-end safety. We evaluate the approach across 6 environments and 15 algorithmic variants.

2606.14129 2026-06-15 cs.CV 新提交

BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

BoRAD: 自举表示实现多类异常检测

Duy Hoang Khuong, Tri Nguyen Minh, Ngu Huynh Cong Viet

发表机构 * Department of Artificial Intelligence, FPT University(FPT大学人工智能系) Department of IT, FPT University(FPT大学信息技术系) Department of Computing Fundamental, FPT University(FPT大学计算基础系)

AI总结 提出BoRAD框架,通过原型正则化解决多类异常检测中重建模型的捷径和误重建问题,无需标签即可实现单模型多类检测。

详情
AI中文摘要

基于重建的异常检测在工业检测中具有吸引力,但将其从类别特定训练扩展到一劳永逸的设置具有挑战性。单个模型必须重建多样的正常外观,同时不复制异常细节,这暴露了两个耦合的失败模式:相同捷径,即异常通过重建路径;以及误重建,即正常类别相互混淆。我们提出\textbf{BoRAD},一个无标签训练框架,将其视为表示容量分配问题。BoRAD使用共享的可学习原型库施加两个互补正则化器:空间原型对齐约束局部原型内变异以抑制异常复制,而原型相对全局对齐保留原型间结构并提高对异常角度偏差的敏感性。原型库和预测头仅在训练期间使用;推理保持标准的师生特征差异过程,无需类别标签、负样本对、内存检索或原型查找。BoRAD实现了具有竞争力的一劳永逸异常检测性能,包括MVTec AD上86.2% mAD、VisA上80.7% mAD和Real-IAD上73.1% mAD。诊断分析进一步显示异常泄漏减少、正常类别可分性提高以及异常-正常分数分离更强。

英文摘要

Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose \textbf{BoRAD}, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2\% mAD on MVTec AD, 80.7\% mAD on VisA and 73.1\% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

2606.14125 2026-06-15 cs.CV cs.AI 新提交

Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing

条件至关重要:稳定扩散图像编辑中的反演与注意力

Zheyuan Zhan, Hongchen Li, Can Wang, Yinfei Ma, Mingzhen Huang, Ruoshi Bai, Jiawei Chen, Siwei Lyu, Defang Chen

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University(浙江大学区块链与数据安全全国重点实验室) HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security(杭州高新技术产业开发区(滨江)区块链与数据安全研究院) College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院) University at Buffalo, State University of New York(纽约州立大学布法罗分校)

AI总结 本文提出SimEdit框架,通过优化文本条件精度和令牌级跨分支注意力控制,提升扩散模型反演稳定性和编辑保真度,在PIE-Bench上显著优于先前方法。

Comments Accepted to ECML PKDD 2026 Research Track

详情
AI中文摘要

基于反演的图像编辑提供了灵活且无需训练的控制,但仍面临反演精度以及编辑保真度与背景保留之间的权衡问题。尽管最近的方法改进了反演公式或注意力交互,但文本条件在塑造扩散动态和编辑行为中的作用仍未得到充分探索。我们从经验和理论上证明,文本条件的精度通过调节扩散速度场的几何形状来影响反演稳定性,同时也会影响编辑过程中跨分支注意力的一致性。这些效应直接影响背景保留和语义保真度。基于这一分析,我们提出了SimEdit,一个条件感知框架,包含两个互补组件:(a) 条件细化,构建具有改进语义精度和结构对齐的条件信号,以促进稳定反演和一致的注意力操作;(b) 令牌级跨分支注意力控制,将编辑相关和结构保留组件分离,并在注意力操作期间对其进行非对称调节。在PIE-Bench上的大量实验表明,SimEdit在反演重建质量和编辑性能上均持续优于先前的注意力操作方法。我们的代码可在以下网址获取:https://this URL。

英文摘要

Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components: (a) conditioning refinement, which constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation, and (b) token-wise cross-branch attention control, which separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation. Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at https://github.com/zju-pi/SimEdit.

2606.14123 2026-06-15 cs.LG cs.AI 新提交

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

知识追踪中恢复被搁置的区分能力:通过经验贝叶斯收缩进行逐项偏差校正

Xiaoran Yan, Cheng Tang, Atsushi Shimada

发表机构 * Kyushu University(九州大学)

AI总结 提出SLC方法,利用Laplace/IRLS将二值观测转化为高斯伪观测,通过卡尔曼平滑器进行经验贝叶斯收缩,并拟合偏移Platt链接,以校正知识追踪模型中的逐项偏差,恢复被搁置的区分能力,在多个数据集和骨干网络上提升AUC和NLL。

Comments 25 pages, 3 figures. Accepted at ECML PKDD 2026 (Research Track). Code: https://github.com/xiaoran-y/SLC

详情
AI中文摘要

部署的知识追踪模型通常在训练后被冻结,但由于骨干架构中逐项表达能力的限制以及部署后项目属性的变化,会出现系统性的逐项logit偏差,从而降低预测质量。全局事后校准器(如Platt缩放、温度缩放和保序回归)能改善概率估计,但无法改变由AUC衡量的区分能力。这种AUC不变性是单调分数变换的结构性结果;恢复被搁置的区分能力需要以项目身份为条件。我们提出SLC(状态空间logit校正),通过Laplace/IRLS将二值观测转换为高斯伪观测,通过卡尔曼平滑器应用经验贝叶斯收缩,并拟合偏移Platt链接。状态空间公式还产生了一个可检测性界限,表征了伯努利信息下限,解释了在当前数据密度下时间跟踪为何没有益处。在四个数据集、五个骨干网络和三个随机种子上,SLC在所有四个数据集上提升了AUC,在三个数据集上提升了NLL,优势集中在稀疏项目上。跨领域控制表明,当部署的骨干网络留下实体级偏差时,类似现象可能出现在教育领域之外。

英文摘要

Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

2606.14122 2026-06-15 cs.CL 新提交

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

超越困惑度:字节感知语言模型中的 UTF-8 有效性

Sangwhan Moon, Daisuke Oba, Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki

发表机构 * University of Tokyo(东京大学) National Institute of Information and Communications Technology(信息通信技术国家研究所)

AI总结 研究字节级分词语言模型生成无效UTF-8序列的问题,通过多语言训练实验发现UTF-8有效性收敛比困惑度慢约一倍,且罕见字符结构有效性更高,表明可靠UTF-8生成是需要单独评估的能力。

详情
AI中文摘要

字节级分词使语言模型能够处理任何Unicode输入,但当遇到罕见或未见字符时,模型可能生成无效的UTF-8序列。我们使用一个355M参数的模型,在来自英语、日语、韩语和中文的平衡多语言语料库的80B tokens上训练,研究了训练规模与UTF-8生成可靠性之间的关系。我们引入了多种评估协议,将UTF-8结构有效性与语言建模分离。UTF-8有效性收敛滞后于困惑度大约两倍:困惑度在2.1B tokens后稳定,但UTF-8有效性需要4.2B tokens。在无上下文生成中,罕见字符比常见字符获得更高的结构有效性,表明频繁字符表示过度特化。通过实验,我们观察到可靠的UTF-8生成是一种独特的能力,需要超越困惑度的评估。

英文摘要

Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

2606.14119 2026-06-15 cs.AI 新提交

FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

FactoryLLM:用于评估智能工厂中大语言模型的安全开源AI实验场

Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Abdur Forkan, Prem Prakash Jayaraman

发表机构 * GitHub arXiv

AI总结 提出FactoryLLM,一个安全开源的AI实验场,通过多机器文档分析评估基于RAG的大语言模型,采用RAGAS和NVIDIA LLM-as-a-Judge双评估机制,案例验证了跨机器文档推理的有效性。

Comments 6 pages, 3 figures, IEEE INDIN 2026

详情
AI中文摘要

智能工厂中的故障诊断和恢复具有挑战性,因为关键信息分散在通过制造过程相互连接的多台机器的手册中。大语言模型(LLM)提供了一种有前景的方法。在本文中,我们提出了FactoryLLM,一个安全开源的AI实验场,旨在通过分析制造过程中多台机器的文档来评估不同的基于LLM的检索增强生成(RAG)模型。FactoryLLM使用户能够配置LLM,并通过使用RAGAS和NVIDIA的LLM-as-a-Judge指标的双重评估设置,评估在多个文档上进行推理时的性能。FactoryLLM是安全的,因为它允许用户运行本地或开源LLM,而无需共享敏感的工业数据,提供了一个受控的实验环境。我们通过一个涉及自主智能车辆及其移动规划器软件的案例研究展示了FactoryLLM的有效性,评估了三个LLM在来自约600页跨机器文档的30个维护查询上的表现。结果表明,FactoryLLM在跨机器文档推理方面是有效的:每个模型的地面性得分均高于0.88。用于社区在特定制造场景中测试FactoryLLM的完整代码和文档已公开提供。

英文摘要

Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper, we propose FactoryLLM, a safe and open-source AI playground designed for evaluating different LLM-based retrieval-augmented generation (RAG) models by analysing documents from multiple machines across the manufacturing process. FactoryLLM enables the user to configure the LLM, and assess performance when reasoning over multiple documents, through a dual evaluation setup using both RAGAS and NVIDIA's LLM-as-a-Judge metrics. FactoryLLM is safe because it allows users to run local or open-source LLMs without sharing sensitive industrial data, providing a controlled environment for experimentation. We demonstrate the efficacy of FactoryLLM through a case study which involves an Autonomous Intelligent Vehicle and its Mobile Planner software, evaluating three LLMs across 30 maintenance queries derived from approximately 600 pages of cross-machine documentation. The results suggest that FactoryLLM is effective in cross-machine document reasoning: every model achieved a groundedness score above 0.88. The full code and documentation for community to test FactoryLLM with their manufacturing specific scenarios are publicly available.

2606.14116 2026-06-15 cs.LG stat.ME 新提交

DTVEM-RE: A Hierarchical Random-Effects Extension of the Differential Time-Varying Effect Model for Person-Specific Multi-Lag Estimation in Intensive Longitudinal Data

DTVEM-RE:差分时变效应模型的分层随机效应扩展,用于密集纵向数据中个体特异性多滞后估计

Amartya Bhattacharya

发表机构 * Geisel School of Medicine, Dartmouth College(达特茅斯学院盖泽尔医学院)

AI总结 针对DTVEM假设所有人共享相同滞后结构的局限,提出DTVEM-RE扩展,允许个体拥有自己的滞后系数,通过贝叶斯分层VAR和连续时间OU模型实现,模拟和实证表明其能恢复个体间变异并提升预测性能。

详情
AI中文摘要

Jacobson等人(2019)提出的差分时变效应模型(DTVEM)是寻找密集纵向数据中最佳时间滞后的流行工具,但它假设所有人共享相同的滞后结构。原作者将此问题列为未来工作,这与现代临床研究的前提——个体存在差异——相冲突。我们提出DTVEM-RE,一种允许每个人拥有自己滞后系数的扩展,包含两种确认步骤版本:在Stan中实现的离散时间分层贝叶斯VAR,它在个体间进行信息汇集并提供校准的不确定性;以及在ctsem中实现的连续时间个体Ornstein-Uhlenbeck模型,它直接处理不均匀间隔的测量点。我们报告了四个结果。模拟显示,贝叶斯版本恢复个体间变异tau_a的偏差低于0.01,覆盖率为90%至93%。在Fisher等人(2017)的EMA数据集(N=40)上,个体特异性滞后1效应在三个情绪项目上相差一个数量级,贝叶斯和GAMM估计高度一致(r=0.87至0.92),且DTVEM-RE在四种离散时间方法中给出最佳的一步预测。多滞后版本显示所有九个tau_k值的可信区间均排除零,且个体差异最大的滞后在不同项目间变化,这是仅考虑滞后1的方法(如mlVAR)无法检测到的。最后,两个版本在个体特异性滞后1估计上几乎完全一致(r >= 0.995),差异仅如收缩所预测。据我们所知,DTVEM-RE是DTVEM风格滞后检测的第一个个体特异性实现,并且它包含标准DTVEM作为特例。

英文摘要

The Differential Time-Varying Effect Model (DTVEM) of Jacobson et al. (2019) is a popular tool for finding the best time lag in intensive longitudinal data, but it assumes everyone shares the same lag structure. The original authors named fixing this as future work, and it clashes with the premise of modern clinical research, which is that people differ. We present DTVEM-RE, an extension that lets each person have their own lag coefficients, with two versions of the confirmatory step: a discrete-time hierarchical Bayesian VAR in Stan, which pools across people and gives calibrated uncertainty, and a continuous-time per-person Ornstein-Uhlenbeck model in ctsem, which handles unevenly spaced beeps directly. We report four results. A simulation shows the Bayesian version recovers the between-person spread tau_a with bias below 0.01 and coverage of 90 to 93 percent. On the Fisher et al. (2017) EMA dataset (N=40), person-specific lag-1 effects vary by an order of magnitude across three mood items, the Bayesian and GAMM estimates agree closely (r=0.87 to 0.92), and DTVEM-RE gives the best one-step-ahead prediction among four discrete-time methods. A multi-lag version shows all nine tau_k values have credible intervals excluding zero, and the lag where people differ most changes across items, something lag-1-only methods like mlVAR cannot detect. Finally, the two versions agree almost exactly on person-specific lag-1 estimates (r >= 0.995), differing only as shrinkage predicts. DTVEM-RE is, to our knowledge, the first person-specific implementation of DTVEM-style lag detection, and it contains standard DTVEM as a special case.

2606.14108 2026-06-15 cs.LG cs.AI 新提交

Numbers Already Carry Their Own Embeddings

数字本身已携带其嵌入

Suhyun Bae, Donghun Lee

发表机构 * Department of Mathematics, Korea University(高丽大学数学系)

AI总结 提出无训练嵌入方法AOE,同时保留数字的实数值与p-adic模签名,实现即插即用并在代数组合基准上首次达到完美精度。

Comments Presented at the MATH-AI Workshop at NeurIPS 2025

详情
AI中文摘要

我们引入了Adelic运算保持嵌入(AOE),这是一种无需训练的表示,同时捕捉数字的实数值及其模(p-adic)签名。该构造通过设计保留了加法和乘法结构,将数字输入转化为“用数学语言表达”的嵌入。与依赖任务特定重新训练的先前方法不同,AOE是即插即用的,可无缝集成到现有架构中。在代数组合基准测试上,它取得了持续的性能提升,包括在编织图案任务上首次实现完美准确率——这为克服人工智能中长期存在的“数字问题”提供了一条有原则的前进道路。

英文摘要

We introduce Adelic operation-preserved embeddings (AOE), a training-free representation that captures both a number's real value and its modular (p-adic) signatures. This construction preserves additive and multiplicative structure by design, turning numerical input into embeddings that "speak in the language of mathematics." Unlike prior approaches that rely on task-specific retraining, AOE is plug-and-play and drops seamlessly into existing architectures. On algebraic combinatorics benchmarks, it delivers consistent gains including the first-ever perfect accuracy on the Weaving Pattern task-while suggesting a principled path forward for overcoming the long-standing "number problem" in AI.

2606.14094 2026-06-15 cs.CV cs.AI 新提交

FEMOT: Multi-Object Tracking using Frame and Event Cameras

FEMOT: 使用帧和事件摄像机的多目标跟踪

Shiao Wang, Xiao Wang, Chao Wang, Yitao Li, Menghao Liu, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang

发表机构 * School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院) Peng Cheng Laboratory(鹏城实验室) National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理全国重点实验室) School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University(北京大学深圳研究生院电子与计算机工程学院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出FEMOT大规模RGB-事件多目标跟踪数据集和FEMOTR多模态跟踪框架,通过频域融合解耦特征,有效利用互补信息实现鲁棒跟踪。

详情
AI中文摘要

传统的RGB摄像机因其捕获丰富外观和语义信息的能力而被广泛用于多目标跟踪。然而,在复杂的现实挑战下,如运动模糊、低照度和过度曝光,其性能通常会下降。受生物启发的事件摄像机提供高时间分辨率和高动态范围,在极端场景下提供互补线索。尽管如此,由于缺乏大规模且标注良好的数据集,RGB-事件多目标跟踪仍未被充分探索。为解决这一问题,我们提出了FEMOT,一个大规模RGB-事件多目标跟踪数据集,涵盖多样化的现实场景和14个具有挑战性的属性。凭借RGB和事件数据以及高质量标注,FEMOT为系统评估RGB-事件多目标跟踪方法提供了可靠平台。基于FEMOT,我们重新训练并评估了超过十个强跟踪器,从而为未来研究建立了全面的基准。此外,我们提出了FEMOTR,一种多模态跟踪框架,该框架解耦RGB和事件特征并在频域中融合它们,从而有效利用其互补特性实现鲁棒的目标定位和身份关联。在FEMOT和DSEC-MOT数据集上的大量实验证明了所提方法的有效性。源代码和基准数据集已在此https URL上发布。

英文摘要

Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on https://github.com/Event-AHU/FEMOT.

2606.14089 2026-06-15 cs.RO 新提交

A Modular Dual-Arm Apple Harvesting Robot with Enhanced Field Performance

一种具有增强田间性能的模块化双臂苹果采摘机器人

Keyi Zhu, Kyle Lammers, Chaaran Arunachalam, Kaixiang Zhang, Renfu Lu, Zhaojian Li

发表机构 * Michigan State University(密歇根州立大学) United States Department of Agriculture Agricultural Research Service(美国农业部农业研究局)

AI总结 提出一种模块化双臂苹果采摘机器人,采用垂直堆叠臂实现单树上下区域同时作业,结合基础模型感知、7阶加加速度轨迹生成、线性扫描采摘策略等5项改进,在商业果园中达到80.0%采摘成功率和7.53秒平均单臂周期,91.2%果实达到特级标准。

详情
AI中文摘要

机器人苹果采摘为解决商业果园劳动力短缺提供了有前景的方案,但低吞吐量和在果园环境中的较差性能阻碍了其商业应用。本文提出一种模块化双臂苹果采摘机器人,采用垂直堆叠臂实现单棵树上、下区域同时作业,将平台定位从多树横向重新定位简化为单树停止。与我们之前的水平双臂系统相比,该平台集成了5项进步:(1)基于基础模型的感知管线,结合Grounding-DINO和EfficientViT-SAM,在非结构化户外环境中实现鲁棒的水果定位;(2)7阶加加速度有界轨迹生成与控制屏障函数安全滤波器相结合,实现快速且安全的臂运动;(3)线性扫描采摘策略,带有10厘米接近缓冲区和旋转分离,提高了采摘可靠性;(4)基于时序逻辑的双臂协调策略与视觉-臂异步调度,最大化共享真空源的使用;(5)在2025年收获季节,涵盖不同苹果品种和树形结构的两个商业果园中进行现场验证。在这些田间试验收集的1738个臂循环中,系统实现了80.0%的单次尝试成功率和平均每臂周期7.53秒。水果损伤评估确认,91.2%的机器人采摘水果保持了美国农业部最高等级(特级),碰伤率在2.4%至4.9%之间。随着采摘周期时间的进一步改进和对茂密树叶遮挡的处理,这种新型模块化机器人设计有望用于苹果的商业化采摘。

英文摘要

Robotic apple harvesting offers a promising solution to labor shortages in commercial orchards, but low throughput and poor performance in orchard environments hinder its commercial adoption. This paper presents a modular dual-arm apple harvesting robot that uses a vertically stacked arms to enable simultaneous operation in the upper and lower zones of a single tree, simplifying platform positioning from multi-tree lateral repositioning to single-tree stops. Compared to our prior horizontal dual-arm system, the platform integrates 5 advances: (1)a foundation-model-based perception pipeline combining Grounding-DINO and EfficientViT-SAM for robust fruit localization in unstructured outdoor environments; (2)7th-order jerk-bounded trajectory generation paired with a Control Barrier Function safety filter to achieve fast yet safe arm motions; (3)a linear sweep harvesting strategy with a 10cm approach buffer and rotational detachment that improves picking reliability; (4)a temporal-logic-based dual-arm coordination policy with vision-arm async scheduling that maximizes usage of a shared vacuum source; and (5)field validation in 2 commercial orchards covering different apple varieties and tree architectures during the 2025 harvest season. Across the 1738 arm cycles collected in these field trials, the system achieved an 80.0% per-attempt success rate and a mean per-arm cycle time of 7.53s. Fruit damage assessments confirmed that 91.2% of robotically harvested fruit retained the highest USDA grade (Extra Fancy), with bruise rates between 2.4% and 4.9%. With further improvements in the picking cycle time and handling of heavy foliage occlusions, this new modular robot design holds promise for commercial harvesting of apples.

2606.14086 2026-06-15 cs.SD 新提交

Explainable and Trustworthy Speech Emotion Recognition Using Confidence Score and Reinforcement Learning Rectified Speech Emotion Descriptors

使用置信度分数和强化学习修正语音情感描述子的可解释且可信的语音情感识别

Youjun Chen, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Shujie Hu, Huimeng Wang, Haoning Xu, Chengxi Deng, Bowen Zhang, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) National Research Council Canada(加拿大国家研究委员会) Tsinghua University(清华大学)

AI总结 提出基于置信度分数和强化学习的在线语音情感描述子修正方法,用于后训练语音情感识别系统,在IEMOCAP和MELD上分别取得2.9%和3.3%的绝对性能提升。

Comments Accepted by Interspeech2026

详情
AI中文摘要

可解释且可信的语音情感识别(SER)至今仍是一项具有挑战性的任务,这主要是由于缺乏带有可靠语音情感描述子(SED)标签(如韵律特征和说话人特征)的SER数据。本文提出了一种基于置信度分数和强化学习(RL)的在线SED修正方法,用于在自动标注的SED标签上对SER系统进行后训练。在IEMOCAP和MELD上的实验表明,结合所提出的置信度分数和基于RL的SED修正方法的可解释SER系统,在性能上始终优于没有数据选择或SED修正的基线系统。最佳系统集成了这两个组件,在IEMOCAP和MELD基准测试上,分别比没有数据选择和SED修正的基线系统高出2.9%和3.3%的绝对SER增益(相对增益分别为3.7%和5.4%)。

英文摘要

Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and reinforcement learning (RL) based on-the-fly SED rectification approach for post-training SER systems on automatically annotated SED labels. Experiments on IEMOCAP and MELD suggest that explainable SER systems incorporating the proposed confidence score and RL-based SED rectification approach consistently outperform baselines without data selection or SED rectification. The best performing system, which integrates both components, surpasses the baseline without data selection and SED rectification, achieving SER gains of 2.9% and 3.3% absolute (3.7% and 5.4% relative) on IEMOCAP and MELD benchmarks, respectively.

2606.14084 2026-06-15 cs.RO 新提交

Self-Improving VLA Policies: Selected Diffusion Noise for Spurious-Robust Action Smoothing

自我改进的VLA策略:用于抗伪影动作平滑的选择性扩散噪声

Duc Minh Nguyen, Bao-Ngoc Dao, Tung M. Luu, Binh Gia Nguyen, Vinh Tong, Anji Liu, Vu N. Duong, Dung D. Le, Daniel Sonntag, Trung Le, Ngan Le, Jan Peter, An Thai Le, Minh Nhat Vu, Mathias Niepert, Khoa D. Doan, Duy M. H. Nguyen, Vien Anh Ngo

发表机构 * Center for AI Research, VinUniversity(VinUniversity人工智能研究中心) VinRobotics KAIST(韩国科学技术院) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院) National University of Singapore(新加坡国立大学) DFKI(德国人工智能研究中心) University of Oldenburg(奥尔登堡大学) Monash University(莫纳什大学) University of Arkansas(阿肯色大学) TU Darmstadt(达姆施塔特工业大学)

AI总结 提出一种无需训练的选择性扩散噪声方法,通过动态采样噪声向量增强视觉-语言-动作策略的鲁棒性和动作平滑性,在仿真和真实场景中成功率分别提升8%和10%。

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)策略在机器人操作中实现了强大的泛化能力,但对伪影视觉相关性和噪声动作生成仍然敏感,导致在扰动下行为脆弱。我们引入了选择性扩散噪声(SDN),这是一种简单的、无需训练的测试时方法,通过利用扩散噪声空间作为可控自由度来提高鲁棒性和成功率。SDN动态采样与参考集最大分离的噪声向量,以减轻对伪影线索的依赖,同时选择产生更一致动作轨迹的候选。这种双重目标即使在物体遮挡的观测下也能鼓励稳定行为,并在不修改模型参数的情况下减少动作抖动。我们在两个模拟基准(Google Robot、Widow-X)和两个真实世界机器人数据集上,对多种VLA策略(包括pi_0、Groot-N1.5和Groot-N1.6)评估了SDN。SDN在模拟环境中一致地将成功率提高了8%,在真实环境中提高了10%,同时产生更平滑、更稳定的动作。我们的结果强调,扩散噪声选择可以作为在测试时增强VLA策略的有效且通用机制。

英文摘要

Diffusion-based Vision-Language-Action (VLA) policies enable strong generalization in robotic manipulation, but remain sensitive to spurious visual correlations and noisy action generation, leading to brittle behavior under perturbations. We introduce Selected Diffusion Noise (SDN), a simple, training-free test-time method that improves both robustness and success rate by leveraging the diffusion noise space as a controllable degree of freedom. SDN dynamically samples noise vectors that are maximally separated from a reference set to mitigate reliance on spurious cues, while selecting candidates that yield more coherent action trajectories. This dual objective encourages stable behavior even under object-masked observations and reduces action jitter without modifying model parameters. We evaluate SDN on two simulation benchmarks (Google Robot, Widow-X) and two real-world robotic datasets across multiple VLA policies, including pi_0, Groot-N1.5, and Groot-N1.6. SDN consistently improves success rates by +8% in simulation and +10% in real-world settings, while producing smoother and more stable actions. Our results highlight that diffusion noise selection can serve as an effective and general mechanism for enhancing VLA policies at test time.

2606.14083 2026-06-15 cs.RO 新提交

The N2D Haptic Glove: A Multi-Finger Glove for 2D Directional Force Feedback for Contact Rich Manipulation

N2D 触觉手套:用于接触丰富操作的多指二维方向力反馈手套

Yao-Ting Huang, Jake Honma, Omar Hernandez, Logan Li, Kaitlin Calimbahin, Bryce Hackel, Michael C. Yip

发表机构 * University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出 N2D 触觉手套,通过绞盘驱动在指尖提供二维弯曲-伸展力反馈,显著降低遥操作中的接触力误差并提高一致性。

详情
AI中文摘要

人类在操作过程中依赖方向性指尖力来探测和调节接触,但大多数可穿戴触觉手套仅提供振动或单轴力,导致力方向模糊。缺乏方向性提示时,用户必须仅凭视觉推断接触力,常导致过度按压、控制不一致以及机器人遥操作精度下降。我们提出 N2D 触觉手套,一种多指可穿戴设备,利用绞盘驱动传输在指尖提供平面弯曲-伸展力,实现高透明度力反馈。通过台架验证和涉及机器人手臂与手触觉遥操作的用户研究,我们证明与仅视觉和单轴触觉基线相比,平面指尖反馈在精确操作中显著降低接触力误差,提高试验间一致性,并增强轴向探测任务中的整体用户体验。这些发现确立了 N2D 触觉手套和基于方向手指的触觉设备作为接触丰富遥操作、沉浸式虚拟现实模拟以及机器人从演示中学习的有前景模式。N2D 触觉手套的硬件和软件系统将完全开源,网址为 \href{this https URL}{this https URL}。

英文摘要

Humans rely on directional fingertip forces to probe and regulate contact during manipulation, yet most wearable haptic gloves render only vibration or single-axis force, leaving force direction ambiguous. Without directional cues, users must infer contact force from vision alone, often leading to over-pressing, inconsistent control, and reduced precision in robotic teleoperation. We present the N2D Haptic Glove, a multi-finger wearable device that renders planar flexion-extension fingertip forces using capstan-drive transmissions for high-transparency force feedback. Through benchtop validations and a user study involving haptic teleoperation of a robotic arm and hand, we demonstrate that compared to visual-only and single-axis haptic baselines, planar fingertip feedback significantly reduces contact force error during precise manipulation, improves trial-to-trial consistency, and enhances overall user experience in axial probing tasks. These findings establish the N2D Haptic Glove and directional finger-based haptics devices as a promising modality for contact-rich teleoperation, immersive virtual reality simulations, and robot learning from demonstrations. N2D Haptic Glove's hardware and software system will be fully open-sourced at \href{https://ucsdarclab.github.io/n2d-glove/}{this https URL}.

2606.14079 2026-06-15 cs.LG 新提交

Deep Spectral Learning of Embedded Latent Transfer Operators for Stochastic Dynamical Systems

随机动力系统的嵌入潜转移算子深度谱学习

Ryogo Tanaka, Yoshinobu Kawahara

发表机构 * Graduate School of Information Science and Technology, The University of Osaka(大阪大学信息科学与技术研究生院) Center for Advanced Intelligence Project, RIKEN(理化学研究所先进智能项目中心)

AI总结 提出一种深度谱编码器方法,通过可学习的非线性特征映射定义马尔可夫潜状态,利用泛函典型相关分析和Galerkin投影估计转移与观测算子,实现贝叶斯滤波和Koopman谱分解,在噪声和部分可观测条件下表现稳定优越。

Comments Accepted at the 42nd Conference on Uncertainty in Artificial Intelligence (UAI 2026)

详情
AI中文摘要

我们提出了一种用于随机非线性动力系统的谱学习方法,该方法在深度特征空间中用嵌入的潜转移算子表示。我们将该方法实例化为深度谱编码器(DSE),一种基于算子的潜状态空间模型,其中时不变神经编码器从观测中实现可学习的非线性特征映射,这些特征定义了马尔可夫潜状态,其时间演化和观测映射分别由转移算子和观测算子描述。在可学习的Galerkin投影特征空间中的泛函典型相关分析提供了来自过去和未来观测的状态坐标,两个线性算子以岭正则化的闭式解形式在状态坐标上估计,这些解与相关算子的Galerkin投影一致。在此表示上,我们推广了特征空间中的序贯贝叶斯滤波和Koopman谱模态分解。多个场景的实验表明,即使在噪声和部分可观测条件下,与序贯贝叶斯滤波和动态模式分解基线相比,该方法性能稳定且优越。

英文摘要

We propose a spectral learning method for stochastic nonlinear dynamical systems represented with embedded latent transfer operators in deep feature spaces. We instantiate the method as Deep Spectral Encoder (DSE), an operator-based latent state-space model in which a time-invariant neural encoder implements learnable nonlinear feature maps from observations, and these features define Markovian latent states whose temporal evolution and observation mapping are described by the transfer and observation operators, respectively. Functional canonical correlation analysis in a learnable Galerkin-projected feature space provides state coordinates from past and future observations, and the two linear operators are estimated on the state coordinates as ridge-regularized closed-form solutions that coincide with Galerkin projections of the associated covariance operators. On this representation, we generalize sequential Bayesian filtering and Koopman spectral mode decomposition in feature space. Experiments on several scenarios show stable and superior performance with sequential Bayesian filtering and dynamic mode decomposition baselines even under noise and partial observability.

2606.14078 2026-06-15 cs.LG cs.AI 新提交

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

通过持续学习中的灾难性遗忘视角重新思考后门对抗性去学习

Zhenqian Zhu, Yamin Hu, Yujiang Liu, Luping Wei, Wenbo Hou, Bin Li, Haodong Li, Wenjian Luo

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen Key Laboratory of Media Security, Shenzhen University(深圳大学媒体安全深圳市重点实验室)

AI总结 本文将后门学习与去学习建模为持续学习视角下的三阶段过程,基于灾难性遗忘机制推导完全后门去学习的必要条件,并提出盲反演-后门对抗性去学习(BI-BAU)方法,通过期望最大化算法优化最大后验目标,有效消除后门效应。

Comments Accepted by ACM CCS 2026

详情
AI中文摘要

现有研究表明,当前的后门防御方法鲁棒性有限,且常无法应对特定类型的攻击。更令人担忧的是,主流的安全调优策略往往仅提供表面安全保护,因为它们未能完全消除后门效应。在本工作中,我们从持续学习视角将后门学习与去学习重新表述为一个顺序的三阶段过程。在此框架内,我们正式定义了完全后门去学习,并基于灾难性遗忘机制进一步推导了实现它的必要条件。在这些见解的指导下,我们提出了盲反演-后门对抗性去学习(BI-BAU),它将满足去学习条件的对抗样本生成问题表述为一个盲反演问题。我们通过将对抗训练的双层优化过程整合到期望最大化(EM)算法框架中来解决该问题,以优化最大后验(MAP)目标。此外,BI-BAU被扩展到目标类别未知的无目标对抗场景以及多模态对比学习任务中,增强了其在预训练模型可能被攻破的真实部署场景中的适用性。大量实验表明,我们的方法在广泛的后门攻击中具有通用适用性,并能有效且彻底地消除后门模型中的后门效应。

英文摘要

Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety protection, as they fall short of completely eliminating the backdoor effects. In this work, we present a novel formulation of backdoor learning and unlearning as a sequential, three-stage process from a continual learning perspective. Within this framework, we formally define complete backdoor unlearning and further derive the necessary conditions for achieving it based on the mechanism of catastrophic forgetting. Guided by these insights, we propose Blind Inversion-Backdoor Adversarial Unlearning (BI-BAU), which formulates the generation of adversarial examples satisfying the unlearning conditions as a blind inversion problem. We solve this by integrating the bi-level optimization process of adversarial training into an Expectation-Maximization (EM) algorithm framework to optimize the maximum a posteriori (MAP) objective. Furthermore, BI-BAU is extended to untargeted adversarial scenarios with unknown target classes, as well as to multi-modal contrastive learning tasks, enhancing its applicability to real-world deployment scenarios where pre-trained models may be compromised. Extensive experiments demonstrate that our method exhibits general applicability across a wide spectrum of backdoor attacks and can effectively and thoroughly eliminate the backdoor effects from a backdoor model.

2606.14072 2026-06-15 cs.CV cs.CL 新提交

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

扩散细化分割与视觉-语言解释用于儿童脑肿瘤MRI

Wentao Ke, Jianche Liu

发表机构 * Department of Mechanical Engineering, Stanford University(斯坦福大学机械工程系) School of Medicine, Stanford University(斯坦福大学医学院)

AI总结 提出两阶段框架,先用Swin-UNETR粗分割,再用条件扩散模型细化边界,最后结合多模态语言模型生成结构化报告,提升儿童脑肿瘤分割精度和可解释性。

详情
AI中文摘要

由于标注数据有限、成像表型异质性、肿瘤边界弥散以及肿瘤子区域类别不平衡,准确的儿童脑肿瘤分割仍然具有挑战性。在此,我们提出一个两阶段深度学习框架,用于改进多模态儿童脑MRI分割和临床解释。首先,我们在BraTS-PEDs MRI扫描上评估3D Res U-Net和Swin-UNETR基线模型,使用四种配准模态预测肿瘤核心、全肿瘤和增强肿瘤区域。其次,我们引入基于扩散的细化模型,以粗Swin-UNETR预测为条件,包括3D DDPM细化器和MedSegDiff。条件化显著提高了扩散稳定性和性能,特别是对于增强肿瘤边界分割。条件化MedSegDiff实现了最强的边界一致性,HD95最低。最后,预测的肿瘤体积和代表性分割叠加图与多模态语言模型集成,生成结构化的放射学风格报告。综合来看,我们的结果表明,从粗到细的扩散分割可以改善儿童肿瘤边界描绘,并支持端到端可解释的AI辅助神经肿瘤学工作流程。

英文摘要

Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

2606.14071 2026-06-15 cs.CV 新提交

ShearFuse-UNet: Hadamard, DCT, and Shearlet Transform Fusion for Next-Day Wildfire Spread Prediction

ShearFuse-UNet: Hadamard、DCT和Shearlet变换融合用于次日野火蔓延预测

Ene Meco, Yingyi Luo, Emadeldeen Hamdan, Adam Watts, Ahmet Enis Cetin

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) US Forest Service, Pacific Wildland Fire Science Laboratory(美国林务局太平洋野火科学实验室)

AI总结 提出ShearFuse-UNet,一种轻量级深度学习模型,通过融合WHT、DCT和Shearlet变换分支,在U-Net编码器中实现多模态卫星数据的次日野火蔓延预测,以267k参数达到0.596 F1分数,优于ResNet18 U-Net。

详情
AI中文摘要

我们提出了ShearFuse-UNet,一种轻量级且计算高效的深度学习模型,用于从多模态卫星数据预测次日野火蔓延。该模型在U-Net骨干网络的每个编码器块内集成了三个互补的变换域分支:二维快速沃尔什-阿达玛变换(WHT)分支、二维离散余弦变换(DCT)分支和锥自适应数字Shearlet残差分支。WHT和DCT分支通过可学习的频谱缩放和固定的软阈值建立正交潜在空间,而Shearlet分支提供各向异性的多方向特征分解,显式编码火线特有的细长边缘结构。一个学习的SpectralFusion门自适应地组合WHT和DCT响应,并将Shearlet重构作为残差添加。这种三分支设计与Transformer自注意力有松散的结构类比:WHT和DCT分支提供自适应融合的互补频谱表示,而Shearlet分支通过残差路径贡献方向内容。与自注意力不同,所提出的设计依赖于固定的数学变换而非学习的投影算子,减少了参数数量和计算成本。在WildfireSpreadTS数据集上评估,ShearFuse-UNet仅用267k参数就达到了0.596的F1分数,优于基于ResNet18的U-Net(14M参数,F1=0.589),展示了非常有利的精度-效率权衡。在Google Next-Day Wildfire Spread数据集上的结果进一步验证了这些发现。

英文摘要

We propose ShearFuse-UNet, a lightweight and computationally efficient deep learning model for next-day wildfire spread prediction from multi-modal satellite data. The model integrates three complementary transform-domain branches inside each encoder block of a U-Net backbone: a 2D Fast Walsh-Hadamard Transform (WHT) branch, a 2D Discrete Cosine Transform (DCT) branch, and a cone-adapted digital Shearlet residual branch. The WHT and DCT branches establish orthogonal latent spaces with learnable spectral scaling and fixed soft-thresholding, while the Shearlet branch provides anisotropic, multi-directional feature decomposition that explicitly encodes the elongated edge structures characteristic of fire fronts. A learned SpectralFusion gate adaptively combines the WHT and DCT responses, and the Shearlet reconstruction is added as a residual. This three-branch design bears a loose structural analogy to transformer self-attention: the WHT and DCT branches provide complementary spectral representations that are adaptively fused, while the Shearlet branch contributes directional content through a residual pathway. Unlike self-attention, the proposed design relies on fixed mathematical transforms rather than learned projection operators, reducing parameter count and computational cost. Evaluated on the WildfireSpreadTS dataset, ShearFuse-UNet achieves an F1 score of 0.596 with only 267k parameters, outperforming a ResNet18-based U-Net (14M parameters, F1 = 0.589) and demonstrating a highly favorable accuracy-efficiency trade-off. Results on the Google Next-Day Wildfire Spread dataset further validate these findings across a different benchmark.

2606.14070 2026-06-15 cs.RO 新提交

Development of a 3 in Sewer Pipe Inspection Robot with an Articulated Differential Mechanism using X-shaped Linkages

使用X形连杆的铰接差动机构的三通下水道管道检测机器人开发

Shoya Umemura, Ryota Taniguchi, Atsushi Kakogawa

发表机构 * Ritsumeikan University(立命馆大学)

AI总结 提出一种改进的三通下水道管道检测机器人,通过铰接差动机构提升牵引力和越障能力,并设计基于驱动轮电流检测的线缆松弛控制方法,实验验证了其越障性能。

Comments The 23rd International Conference on Ubiquitous Robots (UR 2026), 15-18 July, Osaka Ibaraki Campus, Ritsumeikan University, Ibaraki, Osaka, Japan

详情
AI中文摘要

本文提出了一种改进版的三通下水道管道检测机器人,配备紧急疏散机构。第一版中存在的低牵引力和差劲的越障能力,通过简单连接推进单元得到了改善。耦合的推进单元具有差动机构,能够通过单根线缆实现姿态变化,从而适应管道直径变化。为了穿越管道接头等障碍物,设计了一种控制方法,通过驱动轮电机上的电流负载检测障碍物接触并松弛线缆。该方法通过模拟管道实验进行了验证。使用施加在驱动轮上的电流波形进行了负载比较。我们提出的控制方法显著提高了新型铰接式机器人的越障能力。

英文摘要

This paper proposes, an improved version of the 3 in sewer pipe inspection robot equipped with an emergency evacuation mechanism. The low traction force and poor stepover capability, which were challenges of the first version, have been improved by simply connecting the propulsion units. The coupled propulsion units feature a differential mechanism capable of posture changes via a single wire, enabling adaptation to pipe diameter variations. To traverse obstacles like pipe joints, a control method was devised that detects obstacle contact through current load on the drive wheel motors and slackens the wire. This method was verified through simulated pipe experiments. Load comparisons were made using current waveforms applied to the drive wheels. Our proposed control method significantly improved the step-over capability of the new articulated robots.

2606.14068 2026-06-15 cs.CL 新提交

Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

对男性更严厉?评估LLM在不同冲突场景中的性别不对称道德框架

Guangzong Si, Dong Wang, Zhenhao Li, Yifan Yu, Panwang Pan, Wentao Zhu

发表机构 * University of Science and Technology of China(中国科学技术大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 提出GAMA-Bench基准,通过性别镜像场景评估LLM对相同负面行为的回应,发现模型对男性更倾向于惩罚和责备,对女性则更强调共情和治疗。

Comments underreview

详情
AI中文摘要

现有关于LLM性别偏见的研究主要关注刻板印象、职业关联或明确的有害输出。在这项工作中,我们询问LLM是否在匹配的男性角色和女性角色条件下对相同的负面行为应用一致的回应标准。我们引入了GAMA-Bench,一个包含1298个场景的性别镜像基准,涵盖亲密关系和公共社会冲突。它通过受控网格和跨模型审查构建性别中立的不当行为模板,然后将它们编译成配对的第一人称提示,包含匹配的角色性别和角色指称变化。我们进一步设计了一个结构化的回应框架协议,以衡量模型如何分配惩罚、共情、升级、指导和责备。在10个代表性LLM上的实验揭示了一致的男性不利不对称:男性角色获得更多惩罚性、升级性和责备中心的框架,而女性角色对相同的不当行为获得更多治疗性和共情导向的框架。进一步分析表明,这种模式在模型家族、场景轨道、模型规模和显式思维风格推理中持续存在。官方代码见https://this URL。

英文摘要

Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative behavior under matched male-actor and female-actor conditions. We introduce GAMA-Bench, a gender-mirrored benchmark of 1,298 scenarios covering intimate relationship and public social conflicts. It constructs gender-neutral misconduct templates through controlled grids and cross-model review, then compiles them into paired first-person prompts with matched actor-gender and role-reference variations. We further design a structured response-framing protocol to measure how models allocate punishment, empathy, escalation, instruction, and blame. Experiments on 10 representative LLMs reveal a consistent male-disadvantaging asymmetry: male actors receive more punitive, escalatory, and blame-centered framing, whereas female actors receive more therapeutic and empathy-oriented framing for the same misconduct. Further analyses show that this pattern persists across model families, scenario tracks, model scale, and explicit thinking-style reasoning. The official code is available at https://github.com/xufeiqiong/GAMA-Bench.

2606.14063 2026-06-15 cs.RO cs.SY eess.SY 新提交

Semidefinite Relaxations for Collision-Free Motion Planning

无碰撞运动规划的半定松弛

Bernhard Paus Graesdal, Alexandre Amice, Pablo A. Parrilo, Russ Tedrake

发表机构 * Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology(麻省理工学院电气工程与计算机科学系)

AI总结 研究点机器人通过球形障碍物的无碰撞运动规划,提出半定松弛方法,理论分析其紧性并利用对称性降低计算复杂度,比直接非线性规划快10-100倍。

详情
AI中文摘要

我们研究了无碰撞运动规划的半定松弛。我们关注一个点机器人在 $\mathbb{R}^n$ 中从起点运动到终点,穿过球形障碍物,并受到路径连续性约束和平方导数成本;这一设定概念简单但抓住了无碰撞运动规划的难度。我们将该问题精确地表述为多项式曲线上的非凸问题,并提出了一个自然的半定松弛。我们贡献了两个关键的理论见解;据我们所知,这是对无碰撞运动规划半定松弛的首次理论分析。首先,我们表明求解凸松弛等价于在潜在更高维空间中全局最优地求解一个相关的运动规划问题。这种几何解释给出了紧性的必要和充分条件,以及松弛何时松弛的清晰直觉。其次,我们表明该松弛允许对称性约简,使其比预期的要小得多,正半定锥的大小随多项式次数线性增长,且与环境维度无关。由此产生的松弛比使用 SNOPT 和 IPOPT 求解的直接非线性规划转录快10到100倍,求解时间的方差显著降低,并能可靠地找到原始问题的局部最优路径。我们展示了其作为 RRT 规划器中凸导向函数的有效性,用于具有 $C^4$ 连续轨迹的最小加加速度四旋翼规划。

英文摘要

We study semidefinite relaxations for collision-free motion planning. We focus on a point robot moving from start to goal through spherical obstacles in $\mathbb{R}^n$, subject to path continuity constraints and squared derivative costs; a setting that is conceptually simple yet captures the hardness of collision-free motion planning. We formulate this problem exactly as a nonconvex problem over polynomial curves, and present a natural semidefinite relaxation. We contribute two key theoretical insights; to our knowledge this is the first theoretical analysis of semidefinite relaxations for collision-free motion planning. First, we show that solving the convex relaxation is equivalent to solving, to global optimality, a related motion planning problem in a potentially higher-dimensional space. This geometric interpretation yields necessary and sufficient conditions for tightness, and a clear intuition for when the relaxation is loose. Second, we show that the relaxation admits a symmetry reduction that makes it significantly smaller than one might expect, with positive semidefinite cone sizes that scale linearly with the polynomial degree and are independent of the ambient dimension. The resulting relaxation is 10 to 100 times faster than direct nonlinear programming transcriptions solved with SNOPT and IPOPT, exhibits significantly lower variance in solve times, and reliably finds a locally optimal path for the original problem. We demonstrate its effectiveness as a convex steering function in an RRT planner for minimum-snap quadrotor planning with $C^4$ continuous trajectories.