arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.14060 2026-06-15 cs.LG cs.CL 新提交

Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

非参数化机器文本检测：基于多视角高斯过程

Aleem Khan, Nicholas Andrews

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出多视角非参数检测框架，通过高斯过程集成互补特征视图，提高对对抗攻击的鲁棒性，并提供校准概率和分布外输入的原则性弃权。

详情

AI中文摘要

对抗条件（如释义和定向风格迁移）会急剧降低机器文本检测器的准确性。然而，文档携带多种互补信号（例如，风格特征、似然和排序特征、结构特征），抑制其中一种的攻击可能使其他信号保持完整。虽然参数化分类器可以在充分监督下学习组合这些特征，但当分布发生变化时（例如，新型攻击或未见过的语言模型），分类器容易做出过度自信的错误预测。为了解决这个问题，我们提出了一种多视角、非参数化的检测框架，该框架从同一文档中提取互补的特征视图，并通过高斯过程集成聚合每个视图的证据。通过跨视图聚合证据，对手必须同时击败多个独立的检测轴，从而大幅提高逃避成本。高斯过程公式还提供了校准概率和对分布外输入的原则性弃权，支持在高风险场景中的可靠部署。我们在三个涵盖不同生成器和攻击的基准测试（DetectRL和RAID基准测试，以及PAN2025共享任务）上进行了评估，结果表明，我们的多视角检测器在考虑的攻击下保持强性能，在针对未见攻击时优于现有方法。

英文摘要

Adversarial conditions such as paraphrasing and targeted style transfer sharply degrade the accuracy of machine text detectors. A document, however, carries multiple complementary signals (e.g., stylistic features, likelihood and rank-order features, and structural features), and an attack that suppresses one may leave others intact. While a parametric classifier can learn to combine these features given sufficient supervision, classifiers are prone to making confidently incorrect predictions when the distribution shifts (e.g., novel attacks or unseen language models). To address this, we propose a multi-view, non-parametric detection framework that extracts complementary feature views from the same document and aggregates per-view evidence through a Gaussian process ensemble. By aggregating evidence across views, an adversary must simultaneously defeat multiple independent axes of detection, substantially raising the cost of evasion. The Gaussian process formulation additionally provides calibrated probabilities and principled abstention on out-of-distribution inputs, supporting reliable deployment in high-stakes settings. We evaluate on three benchmarks spanning diverse generators and attacks: the DetectRL and RAID benchmarks, and the PAN2025 shared task and demonstrate that our multi-view detector maintains strong performance under the considered attacks, outperforming existing approaches against held out attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.14058 2026-06-15 cs.RO 新提交

ReactSim-Bench: Benchmarking Reactive Behavior World Model Simulation in Autonomous Driving

ReactSim-Bench：自动驾驶中反应性行为世界模型模拟的基准测试

Zhiyuan Zhang, Yanlun Peng, Jianing Zhang, Xianda Guo, Zehan Huang, Haoran Liu, Qifeng Li, Shaofeng Zhang, Xiaosong Jia, Junchi Yan

发表机构 * School of Computer Science & School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学计算机科学与技术学院、人工智能学院）； Great Wall Motor（长城汽车）； Institute of Trustworthy Embodied AI (TEAI), Fudan University（复旦大学可信具身人工智能研究所）； School of Computer Science, Wuhan University（武汉大学计算机学院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出ReactSim-Bench，通过解耦自车与周围智能体控制，使用偏离日志的自车行为作为输入，评估行为世界模型模拟的反应性能力，并基于碰撞、地图和运动学指标系统评测多种模型。

详情

AI中文摘要

反应能力是自动驾驶仿真系统中数据驱动行为世界模型模拟器的一个关键特性。具备这种能力，模拟世界中的智能体能够对不同于日志的自车行为做出可行的响应。然而，现有的行为仿真基准测试并未直接衡量反应能力。它们通常让模拟器联合控制自车和周围智能体，并通过日志相似性或开环预测指标来评估真实性。在这项工作中，我们引入了ReactSim-Bench，用于评估自动驾驶中行为世界模型模拟的反应能力。我们将智能体和自车的控制解耦，使用偏离日志的自车行为作为独立的自车输入，要求智能体做出响应。为了获得这些自车行为，我们构建了一个流程，使用自车规划器模型生成候选行为，并通过规则和人工验证筛选数据。采用碰撞指标、基于地图的指标和运动学可行性指标来评估反应性响应的安全性和规则合规性。我们构建了包含三个类别的2,636个测试场景，并对多种架构的最先进模型进行了系统评估，包括基于Transformer、扩散和下一令牌预测的模型。我们进一步分析了重新规划频率对性能的影响，并为未来研究提供了见解。

英文摘要

Reactive capability is a key property of data-driven behavior world model simulators for autonomous driving simulation systems. With this capability, simulated world agents can respond feasibly to autonomous vehicle (AV) behaviors that differ from the log. However, existing behavior simulation benchmarks do not directly measure reactive capability. They often let the simulator jointly control the AV and surrounding agents and evaluate realism through log similarity or open-loop prediction metrics. In this work, we introduce ReactSim-Bench for evaluating the reactive capability of behavior world model simulation in autonomous driving. We decouple the control of agents and the AV, using AV behaviors that differ from the log and require agents to respond as independent AV inputs. To obtain these AV behaviors, we construct a pipeline that uses an AV planner model to generate candidate behaviors and filters the data using rules and manual verification. Collision metrics, map-based metrics, and kinematic feasibility metrics are used to evaluate the safety and rule compliance of reactive responses. We construct 2,636 test scenarios with three categories and conduct a systematic evaluation of state-of-the-art models across multiple architectures, including Transformer-based, diffusion-based, and next-token-prediction-based models. We further analyze how replan frequency affects performance and provide insights for future studies.

URL PDF HTML ☆

赞 0 踩 0

2606.14049 2026-06-15 cs.SD cs.CV 新提交

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

FoleyGenEx: 统一视频到音频生成，具备多模态控制、时间对齐与语义精度

Shiyao Wang, Xijuan Zeng, Hui Wang, Shiwan Zhao, Feng Deng, Chen Zhang, Yong Qin

发表机构 * Academy for Advanced Interdisciplinary Studies, Nankai University（南开大学前沿交叉学科研究院）； Kling Team, Kuaishou Technology（快手科技Kling团队）

AI总结提出FoleyGenEx统一框架，通过条件注入、多模态动态掩码和副词数据增强，实现视频到音频生成中多模态控制、帧级时间对齐与细粒度语义的同步合成。

Comments Accepted by INTERSPEECH 2026

详情

Journal ref: INTERSPEECH 2026

AI中文摘要

我们提出FoleyGenEx，一个统一的视频到音频（VTA）框架，集成了多模态控制、帧级时间对齐和细粒度语义，能够为多种任务生成同步且多功能的音频合成。现有的VTA方法要么具有多模态控制但时间对齐较弱，要么对齐能力强但缺乏参考音频条件和语义精度。FoleyGenEx通过三项核心创新填补了这一空白：用于音频控制VTA和Foley扩展的条件注入机制、保持训练同步的多模态动态掩码策略，以及利用信号处理和大语言模型增强文本监督的副词数据增强算法，提供细微语义。在AudioCaps、VGGSound和Greatest Hits上的实验表明，与现有方法相比，它具有竞争力的可控VTA性能。演示样本见此https URL。

英文摘要

We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

URL PDF HTML ☆

赞 0 踩 0

2606.14048 2026-06-15 cs.CV cs.RO 新提交

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

WAM4D：通过空间注册令牌实现快速4D世界动作模型

Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi, Chengyu Bai, Qianpu Sun, Jiajun Li, Xiaojie Zhang, Jian Tang, Sirui Han, Shanghang Zhang

发表机构 * Peking University（北京大学）； The Hong Kong University of Science and Technology（香港科技大学）； Beijing Innovation Center of Humanoid Robotics（北京人形机器人创新中心）

AI总结提出WAM4D，利用轻量级空间注册令牌将预训练几何先验迁移至因果视频-动作变换器，实现高效4D世界动作建模，在RoboTwin 2.0和真实操作任务中提升空间一致性并保持快速推理。

Comments 15 pages, 7figures, 9tables

详情

AI中文摘要

世界动作模型（WAMs）最近在联合建模未来观测和可执行机器人动作方面显示出前景。然而，大多数现有的WAMs仍在2D视频或潜在空间中运行，其中视觉上合理的展开缺乏精确操作所需的3D空间约束和遮挡接触几何。虽然几何基础模型为从视觉观测恢复密集3D结构和运动提供了强大的先验，但迫使WAMs预测密集4D表示会引入昂贵的几何解码并减慢因果动作生成。为了解决这一权衡，我们提出了WAM4D，一种快速的4D世界动作模型，它使用轻量级空间注册令牌作为训练时的未来深度读出，将预训练的几何先验迁移到因果视频-动作变换器中，然后移除注册分支以实现轻量级动作推理。为了防止非因果捷径，我们进一步为混合变换器（MoT）WAM骨干设计了因果混合注意力，定义了视频、动作和几何令牌之间的模态特定可见性。在RoboTwin 2.0和具有挑战性的真实世界操作任务上的全面实验表明，WAM4D提高了空间一致性，并在保持高效推理的同时实现了具有竞争力的动作预测。

英文摘要

World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

URL PDF HTML ☆

赞 0 踩 0

2606.14042 2026-06-15 cs.CV 新提交

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

通过ChordEdit重新思考一步图像编辑：复现、简化与新见解

Minghan Li, Jeremy Moebel, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）

AI总结本文通过复现、消融和简化ChordEdit，揭示其机制：和弦窗口作为时间步偏移，和弦传输执行低频语义编辑，近端对齐补充高频细节，从而将编辑分解为粗低频传输和细高频对齐两个阶段，为自适应编辑提供新路径。

Comments 9 pages

2606.14040 2026-06-15 cs.LG 新提交

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

在应当稀疏处分解，在应当稠密处吸收

Ruixuan Deng, Zehao Jin, Zekun Wang, Zihan Dong

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结针对稀疏自编码器假设所有激活内容均可稀疏分解的缺陷，提出在标准SAE旁添加低秩线性瓶颈以吸收稠密成分，在Gemma-2-2B第12层上秩24瓶颈减少84%稠密潜变量，并揭示该成分是结构可识别、因果必要且被稀疏字典冗余编码的计算脚手架。

详情

AI中文摘要

稀疏自编码器（SAE）通常被训练为通过稀疏字典重建残差流的\textbf{全部}内容，隐含假设所有激活内容都适合稀疏、单语义的分解。我们质疑这一假设，并推测激活包含一个低秩、稠密的成分，该成分对模型计算重要但本质上不适合稀疏表示，这是训练SAE中广泛观察到的持久稠密潜变量的主要来源。为验证这一点，我们在标准SAE（BatchTopK和Matryoshka）旁添加一个小型秩$r$线性瓶颈，使得稠密结构在稀疏重建前被吸收。在Gemma-2-2B第12层上，秩24瓶颈将稠密潜变量计数减少高达84%，同时在匹配稀疏度下改善了两种架构的稀疏探测和定向探测扰动。被吸收的成分（i）在\textbf{结构上可识别}，即顶部主成分和离群维度；（ii）在\textbf{因果上必要}，移除它会使下一个token的交叉熵增加7.5倍，远超移除几何上几乎相同的顶部24个PCA方向带来的2.8倍增加；（iii）被\textbf{稀疏字典冗余编码}，消融787个最大对齐的稀疏特征仅使交叉熵增加2.9倍，消融2048个主题对齐特征几乎不改变MMLU主题分类，而移除脚手架则使其从98.7%降至随机水平。综合来看，我们的发现识别出残差流激活中一个紧凑、语义信息丰富且因果重要的成分（我们称之为\textbf{计算脚手架}），标准稀疏字典对其表示效率低下，表明基于稀疏性的可解释性方法的适用范围需要谨慎重新审视。

英文摘要

Sparse autoencoders (SAEs) are typically trained to reconstruct the \textbf{entire} residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We question this assumption and hypothesize that activations contain a low-rank, dense component that is computationally important to the model yet inherently unsuitable for sparse representation, which serves as a major source of the persistent dense latents widely observed in trained SAEs. To test this, we add a small rank-$r$ linear bottleneck in parallel with standard SAEs (BatchTopK and Matryoshka), allowing dense structure to be absorbed before sparse reconstruction. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latent count by up to 84\% while improving sparse probing and targeted probe perturbation on both architectures at matched sparsity. The absorbed component is (i) \textbf{structurally identifiable} as the top principal components and outlier dimensions; (ii) \textbf{causally necessary}, with removing it raising next-token cross-entropy by 7.5$\times$, far exceeding the 2.8$\times$ from removing the geometrically near-identical top-24 PCA directions; and (iii) \textbf{redundantly encoded by sparse dictionaries}, with ablating 787 maximally aligned sparse features raising cross-entropy by only 2.9$\times$ and ablating 2,048 topic-aligned features leaving MMLU topic classification virtually unchanged, whereas removing the scaffold drops it from 98.7\% to chance. Together, our findings identify a compact, semantically informative and causally important component of residual stream activations (which we term a \textbf{computational scaffold}) that standard sparse dictionaries represent inefficiently, suggesting that the scope of sparsity-based interpretability methods warrants careful re-examination.

URL PDF HTML ☆

赞 0 踩 0

2606.14037 2026-06-15 cs.CL 新提交

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

对或错，模型都顺从：LLM 道德判断中的方向盲从

Jihye Kim, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结本文提出顺从不对称性（A = BCR/HCR）双向诊断指标，发现大语言模型在事实判断中更顺从有益提示（A=1.58），但在道德判断中几乎同等顺从有益和误导提示（A=1.04），揭示了方向盲从这一对齐失败模式。

详情

AI中文摘要

随着语言模型在许多领域扮演整合角色，LLM对用户反驳的响应成为一个关键的对齐属性。然而，许多现有评估将顺从视为单向的，测量模型是否抵抗压力，但不测量它们是否有选择地抵抗。我们引入顺从不对称性（A = BCR/HCR），一种双向诊断方法，比较在有益提示下的有益输出变化与在误导提示下的有害变化。在9个模型和972,000个提示条件响应中，我们发现这种选择性在事实判断和道德判断中有所不同：模型在事实问题上遵循有益提示多于有害提示（A = 1.58），但在道德问题上以几乎相同的速率遵循两个方向（A = 1.04）。这种现象在模型家族、能力水平和提示类型中持续存在。有趣的是，我们还发现思维链提示同时放大了有益和有害的顺从，而基于身份的提示以几乎相同的幅度抑制了这两者。这些结果将方向盲从道德顺从确定为当前LLM中一个独特的失败模式，并表明对齐应针对方向校准的更新，而不是仅降低顺从。

英文摘要

As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

URL PDF HTML ☆

赞 0 踩 0

2606.14035 2026-06-15 cs.CV 新提交

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

面向360度室内全景编辑的基于重聚焦交叉注意力的免调优扩散模型

Dinh-Khoi Vo, Nhut-Thanh Le-Hinh, Viet-Tham Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * arXiv

AI总结提出FocusDiff框架，通过重聚焦交叉注意力实现免调优的精确区域编辑，并扩展到360度室内全景编辑，在局部编辑基准LIMB上优于现有零样本方法。

Comments ICCCI 2026. Project page: https://vdkhoi20.github.io/FocusDiff

详情

AI中文摘要

零样本文本引导扩散显著推进了图像编辑，但其实际可用性仍受三个持续挑战的制约：需要精细提示工程的提示脆弱性、无意影响非目标区域的溢出编辑、以及由于训练数据中有限细粒度监督导致的小或杂乱对象上的失败。我们提出FocusDiff（目标感知重聚焦用于免调优扩散编辑），一个基于重聚焦交叉注意力的免调优框架，用于精确且区域特定的图像操作。给定通过自动分割或手动选择获得的目标区域，FocusDiff对非编辑区域应用选择性模糊，以引导注意力朝向掩码区域，同时准确地将对象的身份、结构和外观传递到编辑输出。集成的上下文保留模块进一步确保背景保真度和全局一致性，使得从简单文本提示在一次传递中实现精确编辑成为可能。我们还将FocusDiff扩展到360度室内全景编辑，并在虚拟现实环境中展示其有效性。在我们包含30个多对象图像和100个标注示例（包括具有挑战性的小对象案例）的局部编辑基准LIMB上的广泛实验表明，FocusDiff在文本-图像对齐和背景保留方面优于现有零样本编辑器，实现了卓越的精度、逼真度和可用性。项目页面见此https URL。

英文摘要

Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20.github.io/FocusDiff.

URL PDF HTML ☆

赞 0 踩 0

2606.14032 2026-06-15 cs.RO 新提交

From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving

从攻击到课程：面向安全自动驾驶的可学习性引导对抗训练

Yuewen Mei, Tong Nie, Jie Sun, Haotian Shi, Wei Ma, Jian Sun

发表机构 * College of Transportation & Key Laboratory of Road and Traffic Engineering of Ministry of Education, Tongji University（同济大学交通运输工程学院 & 道路与交通工程教育部重点实验室）； Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University（香港理工大学土木与环境工程学系）

AI总结提出AlignADV框架，通过偏好对齐生成可解决场景，并利用行为指纹预测策略能力，动态采样课程以提升自动驾驶对抗训练的收敛效率与安全性。

详情

AI中文摘要

闭环对抗训练通过将策略暴露于罕见的安全关键场景来提高自动驾驶安全性。标准流程首先生成对抗场景，然后采样用于策略优化。然而，大多数现有框架仍以攻击为导向：碰撞驱动的生成器常合成无法解决的极端情况，这可能导致学习退化；而启发式采样器忽略驾驶策略的演化能力，导致样本效率低下和收敛延迟。我们提出AlignADV，一个可学习性引导的闭环对抗训练框架，将对抗场景转化为可解决且与能力对齐的课程。首先，我们将对抗场景生成重新表述为偏好对齐问题，并采用直接偏好优化引导生成器朝向关键但可解决的场景。其次，我们引入行为指纹来捕捉演化策略的内在特征，并构建多模态能力预测模型，无需昂贵的闭环模拟即可估计策略性能。通过结合可解决性对齐场景与能力预测，AlignADV开发了动态课程采样机制，优先针对当前策略弱点的场景。在Waymo开放运动数据集上的实验表明，AlignADV提高了收敛效率和最终性能，与基线方法相比，训练步骤减少高达40.6%，同时在正常和对抗交通条件下降低了碰撞率并提高了路线完成率。这些结果强调了从攻击导向的场景生成向可学习性引导的策略改进的转变，为更安全、更高效的自动驾驶训练提供了原则性方向。项目页面：此 https URL。

英文摘要

Closed-loop adversarial training improves autonomous driving safety by exposing policies to rare safety-critical scenarios. Standard pipelines first generate adversarial scenarios and then sample them for policy optimization. However, most existing frameworks remain attack-oriented: collision-driven generators often synthesize unsolvable extreme situations, which can degrade learning, while heuristic samplers ignore the evolving capability of the driving policy, causing sample inefficiency and delayed convergence. We propose AlignADV, a learnability-guided closed-loop adversarial training framework that converts adversarial scenarios into resolvable and capability-aligned curricula. First, we reformulate adversarial scenario generation as a preference alignment problem and employ direct preference optimization to guide the generator toward critical yet resolvable scenarios. Second, we introduce behavioral fingerprints to capture the intrinsic characteristics of the evolving policy and construct a multi-modal capability prediction model that estimates policy performance without expensive closed-loop simulations. By combining resolvability-aligned scenarios with capability predictions, AlignADV develops a dynamic curriculum sampling mechanism that prioritizes scenarios targeting the current policy's vulnerabilities. Experiments on the Waymo Open Motion Dataset demonstrate that AlignADV improves convergence efficiency and final performance, reducing training steps by up to 40.6 percent compared with baseline methods while lowering collision rate and improving route completion under both normal and adversarial traffic conditions. These results highlight a shift from attack-oriented scenario generation to learnability-guided policy improvement, offering a principled direction for safer and more efficient autonomous driving training. Project page: https://meiyuewen.github.io/AlignADV/.

URL PDF HTML ☆

赞 0 踩 0

2606.14031 2026-06-15 cs.AI 新提交

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

治疗性药物-疾病关系的适用条件提取

Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

发表机构 * The University of Osaka（大阪大学）； RIKEN（理化学研究所）； Institute of Science Tokyo（东京科学大学）； Tohoku University（东北大学）

AI总结提出从生物医学文献中提取药物-疾病治疗关系适用条件的任务，构建首个手动标注数据集，并改进LoRA方法以考虑药物与疾病间关系，在多个评估设置中优于基线。

详情

AI中文摘要

识别某种药物对目标疾病产生治疗效果的适用条件对于临床决策支持至关重要。然而，现有的大多数生物医学信息提取方法仅关注识别药物与疾病之间的关系，而很大程度上忽略了这些关系适用的上下文特定条件。为解决这一问题，我们引入了从生物医学研究文献中提取治疗性药物-疾病关系适用条件的任务。我们创建了首个数据集，在生物医学论文摘要上手动标注了药物、疾病和适用条件的三元组，包含1,119个药物-疾病对。利用该数据集，我们系统评估了一系列现有方法的性能。此外，我们提出了一种新方法，增强LoRA以考虑药物与疾病之间的关系。我们的方法在不同评估设置中均优于强基线。本文的源代码和数据集可从以下网址获取：this https URL

英文摘要

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug--disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: https://github.com/guantingluo98/Drug-ACE

URL PDF HTML ☆

赞 0 踩 0

2606.14030 2026-06-15 cs.SD cs.CL 新提交

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

神经说话人日志中的结构化剪枝与低位量化效率-性能权衡

Rishit Chatterjee, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College（科尔比学院计算机科学系）

AI总结针对资源受限硬件上的流式说话人日志，通过结构化剪枝和低位量化压缩分割模型，研究不同延迟预算下的性能权衡，发现FP16可减半模型大小但DER增加40%。

Comments 6 pages, 3 figures, preprint

详情

AI中文摘要

流式说话人日志对于时间紧迫的医疗调度至关重要，但在资源受限的硬件上部署需要更小、更快的模型。使用模拟医疗调度对话数据集SIMSAMU，我们评估了流式行为，然后通过剪枝和低位量化压缩分割模型。我们表征了在一系列流式延迟预算下的性能，发现额外的缓冲并不总是有益的，而极低延迟操作点可能显著降低性能。我们的研究表明，模型压缩以性能换取内存占用，并强调了一个操作点，其中FP16将模型大小减半，实时因子基本不变，但相对于基线，DER增加了40%。这项工作表征了实时部署的权衡，并有助于在时间关键环境中实现可靠人类通信的语音技术。

英文摘要

Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40\% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.14029 2026-06-15 cs.LG 新提交

Utility-Constrained Policy Optimization

效用约束策略优化

Mehrdad Moghimi, Bernardo Avila Pires

发表机构 * York University（约克大学）； Google DeepMind（谷歌深度思维）

AI总结提出一种简单而强大的效用约束MDP方法，支持风险敏感约束，无需预先固定约束限值，在多个安全基准任务上匹配或超越现有基线。

详情

AI中文摘要

约束MDP（CMDP）是将安全性纳入强化学习智能体的广泛采用框架；然而，该框架不支持风险敏感约束。这可能是有问题的：例如，CMDP允许最优解为了满足风险中性约束，混合了罕见的灾难性行为和频繁的过度保守行为。此外，先前的实证结果表明，即使在风险中性评估下，执行更严格的风险敏感约束也能提高性能。纳入风险敏感约束的自然框架是效用约束MDP（UCMDP），但此前没有针对该问题的实用解决方案。在这项工作中，我们为UCMDP和约束RL引入了一种简单而强大的方法。除了允许风险敏感约束外，我们的框架不需要在训练智能体之前预先固定约束限值，只要知道一个合理的范围即可。这增加了策略的灵活性，并且在实践中允许以零额外训练成本调整这些限值。除了受益于框架的通用性外，我们的智能体在实践中表现出强大的性能，在多个Safety Gymnasium基准任务中持续匹配或超越现有基线。

英文摘要

Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal solutions that, in order to satisfy the risk-neutral constraints, mix infrequent catastrophic behaviors and frequent, overly conservative ones. Moreover, prior empirical results suggest that enforcing stricter, risk-sensitive constraints can improve performance even under risk-neutral evaluation. The natural framework to incorporate risk-sensitive constraints is utility-constrained MDPs (UCMDPs), but no practical solutions for this problem existed. In this work, we introduce a simple yet powerful methodology for UCMDPs and constrained RL. Besides allowing for risk-sensitive constraints, our framework does not require us to fix constraint limits in advance of training the agent, provided that a sensible range is known. This increases policy flexibility and, in practice, allows for adjustments to these limits at no extra training cost. Besides benefiting from the generality of the framework, our agent shows strong performance in practice, consistently matching or outperforming existing baselines in several Safety Gymnasium benchmark tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.14025 2026-06-15 cs.CV 新提交

GarmentSketch: Large-scale Sketch-to-Fashion Benchmark

GarmentSketch：大规模草图到时尚基准

Duong-Duy-Khang Bui, Minh-Tan Pham, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * Kangbdd.github.io

AI总结为解决时尚草图到图像合成缺乏大规模配对数据的问题，构建了包含26249对草图-文本描述的GarmentSketch数据集，并基于多模态大模型与人工精炼生成描述，评估了现有生成模型的性能。

Comments ICCCI 2026. Project page: https://khangbdd.github.io/garmentsketch

详情

AI中文摘要

时尚草图是设计工作流程的基石，允许在物理原型制作之前快速可视化创意概念。然而，基于草图的时尚图像合成进展因缺乏大规模、高质量配对资源而受阻。为弥补这一差距，我们提出了GarmentSketch，一个新颖的数据集，包含21个服装类别的26,249张时尚草图，每张草图都配有详细的文本描述。描述是通过一个多阶段流水线生成的，该流水线集成了多个多模态大语言模型（MLLM）与人在回路中的精炼，确保了语义准确性和描述丰富性。我们在最先进的生成模型上对GarmentSketch进行了基准测试，为草图引导的文本到图像生成提供了基线性能。我们的实验揭示了现有方法的潜力和当前局限性。通过提供全面且注释丰富的资源，GarmentSketch为推进草图理解、细粒度时尚图像生成以及设计中的创意人机协作奠定了基础。该数据集将在以下网址提供：this https URL。

英文摘要

Fashion sketching is a cornerstone of design workflows, allowing rapid visualization of creative concepts prior to physical prototyping. Yet, progress in sketch-based fashion image synthesis has been hindered by the absence of large-scale, high-quality paired resources. To bridge this gap, we present GarmentSketch, a novel dataset comprising 26,249 fashion sketches across 21 garment categories, each paired with detailed textual descriptions. Captions were produced through a multi-stage pipeline that integrates multiple multimodal large language models (MLLMs) with human-in-the-loop refinement, ensuring both semantic accuracy and descriptive richness. We benchmark GarmentSketch on state-of-the-art generative models, providing baseline performance for sketch-guided text-to-image generation. Our experiments reveal both the promise and the current limitations of existing methods. By offering a comprehensive and richly annotated resource, GarmentSketch establishes a foundation for advancing sketch understanding, fine-grained fashion image generation, and creative human-AI collaboration in design. The dataset will be available at: https://khangbdd.github.io/garmentsketch.

URL PDF HTML ☆

赞 0 踩 0

2606.14024 2026-06-15 cs.CV 新提交

ViT-Up: Faithful Feature Upsampling for Vision Transformers

ViT-Up：面向视觉Transformer的忠实特征上采样

Krispin Wandel, Jingchuan Wang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出ViT-Up，一种隐式特征上采样框架，通过从中间ViT隐藏状态构建逐层查询，在任意连续坐标预测特征，避免图像引导带来的特征泄露和模糊，在密集预测和语义对应任务上超越现有方法。

Comments Code is available at: https://github.com/krispinwandel/vit-up

详情

AI中文摘要

视觉Transformer（ViT）已成为视觉表示学习的主导架构，提供异常强大且广泛可重用的骨干特征。然而，由于全局自注意力的二次复杂度，ViT通常在小块令牌网格上运行，这给语义分割和深度估计等密集预测任务带来了持续瓶颈。这推动了任务无关特征上采样器的发展。尽管最近的最先进方法能产生视觉锐利的密集表示，但它们依赖浅层图像编码器进行引导上采样，可能引入特征泄露、碎片化和模糊。我们提出ViT-Up，一种隐式特征上采样框架，用从中间ViT隐藏状态构建的逐层查询替代外部图像引导。这使得在任意连续图像坐标上预测特征成为可能，同时保持与骨干特征空间的对齐。实验表明，ViT-Up在密集预测和语义对应任务上持续优于最先进的图像引导上采样器。在DINOv3-S+上，ViT-Up在Cityscapes上相比先前方法提升高达+2.07 mIoU，在SPair-71k上提升+4.17 PCK@0.10。使用更大的DINOv3-B骨干时，这些增益增加到+3.36 mIoU和+8.09 PCK@0.10，表明ViT-Up随骨干容量增加而扩展良好。

英文摘要

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

URL PDF HTML ☆

赞 0 踩 0

2606.14022 2026-06-15 cs.LG 新提交

PostDeg: Placement Beats Parameterization in LayerNorm GNNs

PostDeg：在LayerNorm GNN中位置胜过参数化

Yash Tomar, Aryav Das

发表机构 * Purdue University（普渡大学）； Park Tudor High School（帕克图多尔高中）

AI总结发现LayerNorm会擦除拓扑信号，而后LayerNorm位置可保留信号；提出无参数的后LayerNorm逆度缩放PostDeg，在三个组合优化任务上提升显著，且四个证伪测试均未触发。

Comments Yash Tomar and Aryav Das contributed equally to this work

详情

AI中文摘要

基于LayerNorm的GNN通常会擦除节点选择策略应依赖的拓扑信号（度、中心性、$k$-核），但文献尚未定位擦除发生在残差块中的何处。我们回答了这个问题：在LayerNorm之前插入的正逐节点标量会被除以一个稳定项，而同一标量在LayerNorm之后插入会作为表示幅度到达分数头。幸存的位置是后LayerNorm位置。我们通过PostDeg实例化它，这是一种无参数的后LayerNorm逆度缩放，并预先注册了四个证伪器（图级标量、额外LayerNorm、表达能力相同的槽位、与骨干无关的来源），这些证伪器将拒绝该规则。PostDeg在影响力最大化、网络瓦解和最大独立集上比LN骨干分别提升$+3.5\%/+2.5\%/+5.6\%$，每个任务在10/10配对种子中获胜；四个证伪器均未触发。结论是，增益来自位置而非参数化——这是一个小的不变性检查，可推广到任何归一化残差堆栈中的任何正拓扑标量。

英文摘要

LayerNorm-based GNNs routinely erase the topology signals (degree, centrality, $k$-core) that node-selection policies should depend on, but the literature has not located where in the residual block the erasure happens. We answer that question: a positive per-node scalar inserted before LayerNorm is divided out up to a stabilizer term, while the same scalar inserted after LayerNorm reaches the score head as representation magnitude. The surviving slot is the post-LayerNorm position. We instantiate it with PostDeg, a parameter-free post-LayerNorm inverse-degree scale, and pre-register four falsifiers (graphwise scalars, extra LayerNorm, expressive same-slot capacity, backbone-agnostic source) that would reject the rule. PostDeg gains $+3.5\%/+2.5\%/+5.6\%$ over the LN backbone on influence maximization, network dismantling, and maximum independent set, with $10/10$ paired-seed wins per task; none of the four falsifiers fires. The takeaway is that placement, not parameterization, carries the gain -- a small invariance check that generalizes to any positive topology scalar in any normalized residual stack.

URL PDF HTML ☆

赞 0 踩 0

2606.14010 2026-06-15 cs.CV cs.LG cs.RO 新提交

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

RT-VLA：通过知识蒸馏实现实时视觉-语言-动作模型

Xiangyu Huang, Zhenlin Hua, Han Zhou, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出RT-VLA，通过多级监督蒸馏将SimLingo模型的能力压缩至轻量学生模型，在保持竞争性能的同时将推理时间降低44.8倍（纯视觉模式）和7.9倍（视觉+语言模式），实现实时可解释的VLA自动驾驶。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过联合建模视觉感知、语言推理、可解释性和动作预测，在端到端自动驾驶中展现出强大潜力。然而，其庞大的视觉-语言骨干网络和推理模块引入了显著的推理延迟，从而阻碍了它们在道路网络严苛现实中的部署。我们提出RT-VLA，一种轻量级、蒸馏的VLA模型，通过多级监督蒸馏将最先进的SimLingo模型的驾驶和推理能力迁移到紧凑的学生模型中。RT-VLA保留了基于语言的推理，并通过离线语言分析安全关键驾驶时刻来支持事后解释，而不增加实时控制的延迟。与SimLingo教师模型相比，RT-VLA在保持竞争性的闭环驾驶和语言推理性能的同时，在纯视觉模式下将推理时间减少了44.8倍，在视觉+语言模式下减少了7.9倍。这些结果表明，监督蒸馏是构建实时、可解释的VLA风格自动驾驶模型的实用方法。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

URL PDF HTML ☆

赞 0 踩 0

2606.14006 2026-06-15 cs.CV cs.ET 新提交

HARBOR: Heading Analysis and Reconstruction from Behavioral Observation and Radar

HARBOR：基于行为观测与雷达的航向分析与重建

Joao P. A. Dantas, Paulo F. Silva Filho, Jelton A. Cunha, Gabriel Dietzsch

发表机构 * Institute for Advanced Studies (IEAv)（高级研究所（IEAv））

AI总结提出HARBOR管道，仅用单张SAR图像在无辅助数据时预测船只运动，通过骨架几何和局部强度估计航向，离线校准AIS参数生成概率热图。

详情

AI中文摘要

海上态势感知通常依赖自动识别系统（AIS）传输来跟踪船只运动。然而，在作战或冲突场景中，由于信号丢失、故意关闭或有意欺骗，这些数据可能不可用。在此条件下，合成孔径雷达（SAR）图像成为广域海上监测的关键传感替代方案，尽管仅提供静态场景快照。本文介绍HARBOR（基于行为观测与雷达的航向分析与重建），一个完整的管道，用于将单张SAR图像转换为预测运动信息，而无需在推理时使用任何辅助数据源。该方法首先进行SAR图像预处理以增强和分割船只候选区域，然后通过骨架几何和局部强度模式进行自动检测、基于尺寸的分类和航向估计。AIS数据仅在离线校准阶段用于推导依赖船只类型的运动参数，随后应用于生成候选未来船只位置的概率热图。使用真实COSMO-SkyMed SAR图像进行的案例研究展示了该管道在巴西南部海上场景中的应用，显示了其在数据拒绝环境中提取运动趋势并生成船只位置概率投影的能力。

英文摘要

Maritime situational awareness often relies on Automatic Identification System (AIS) transmissions to track vessel movements. However, in operational or conflict scenarios, these data may be unavailable due to signal loss, deliberate deactivation, or intentional spoofing. In such conditions, synthetic aperture radar (SAR) imagery becomes a critical sensing alternative for wide-area maritime monitoring, despite providing only static scene snapshots. This work introduces HARBOR (Heading Analysis and Reconstruction from Behavioral Observation and Radar), a complete pipeline for transforming a single SAR image into predictive motion information without requiring any auxiliary data source at inference time. The method begins with SAR image preprocessing to enhance and segment vessel candidates, followed by automatic detection, size-based classification, and heading estimation using skeleton geometry and local intensity patterns. AIS data are used exclusively during an offline calibration phase to derive vessel-type-dependent motion parameters, which are then applied to generate probabilistic heatmaps of candidate future vessel positions. A case study using real COSMO-SkyMed SAR imagery demonstrates the pipeline on a maritime scene in southern Brazil, showing its ability to extract motion tendencies and generate probabilistic projections of vessel positions in data-denied environments.

URL PDF HTML ☆

赞 0 踩 0

2606.14005 2026-06-15 cs.CV 新提交

Context-Guided Semantic Alignment for Feature Fusion Networks

上下文引导的特征融合网络语义对齐

Hyungseop Lee, Jiho Lee, Woochul Kang

发表机构 * Department of Embedded Systems Engineering, Incheon National University（仁川国立大学嵌入式系统工程系）

AI总结提出轻量级语义对齐模块FINE，通过跨层级注意力机制利用高层上下文指导低层特征融合，并引入对齐感知令牌采样降低计算复杂度，提升目标检测精度。

Comments 26 pages, 12 figures, 8 tables

详情

AI中文摘要

特征融合网络是现代目标检测器的基础组件，通过聚合多尺度特征来检测不同大小的物体。然而，直接融合来自不同金字塔层次的特征往往因其异构表示而导致语义不一致。本文提出特征交互网络（FINE），一种轻量级语义对齐模块，在融合前通过跨层级注意力机制利用高层上下文指导来细化低层特征。为弥合结构差距并确保计算效率，我们引入对齐感知令牌采样，对齐跨尺度的对应空间区域，将注意力复杂度降低一个数量级。生成的注意力权重产生一个空间-通道调制图，通过残差逐元素调制进行上采样并应用于低层特征。该机制确保网络选择性地增强语义相关像素，同时保留密集预测任务所需的亚像素定位精度。FINE普遍适用于各种检测器，并在不牺牲效率的情况下持续提升检测精度。

英文摘要

Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.14000 2026-06-15 cs.AI 新提交

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

数值分析的形式化：超越内核接受的智能体流水线与质量审计

Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, Vasily Ilin

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种编码智能体流水线，将数值分析教材形式化为Lean 4代码，并引入三维质量评估框架（语义正确性、Mathlib复用、跨文件复用），发现编译通过掩盖了不忠实的形式化模式。

详情

AI中文摘要

近期工作表明，编码智能体可以在Lean 4中形式化整个高等数学教材，但现有努力集中在mathlib中已有充分表示的数学分支，并仅通过内核接受来衡量成功。我们通过将编码智能体应用于形式化《常微分方程数值方法》（一本数值分析教材，在mathlib中基本缺失）来解决这两个限制，从而考验智能体从头开发新理论的能力。我们进一步引入一个系统、可复现的三维框架，用于评估智能体生成的形式化质量，超越编译层面：语义正确性、Mathlib复用以及通过LLM-as-judge方法的跨文件复用。将该框架应用于我们自己的形式化以及RepoProver和M2F发布的输出，我们发现了内核接受完全掩盖的重复性不忠实形式化模式，包括不完整的多部分陈述、添加弱化假设和参数限制。我们的结果表明，基于编译的指标大大高估了形式化质量，我们提供了一种可复现的审计方法，以支持对未来自动形式化系统进行更严格的评估。

英文摘要

Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both limitations by applying a coding agent to formalize Numerical Methods for Ordinary Differential Equations, a textbook in numerical analysis that is largely absent from mathlib, stressing the agent's capacity to develop new theory from scratch. We further introduce a systematic, reproducible three-dimensional framework for evaluating the quality of agent-produced formalizations beyond compilation: semantic correctness, Mathlib reuse, and cross-file reuse via LLM-as-judge methods. Applying this framework to our own formalization and to the released outputs of RepoProver and M2F, we uncover recurring unfaithful formalization patterns, including incomplete multi-part statements, added weakening hypotheses, and parameter restrictions, that kernel acceptance entirely obscures. Our results suggest that compilation-based metrics substantially overstate formalization quality, and we provide a reproducible audit methodology to support more rigorous evaluation of future autoformalization systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13995 2026-06-15 cs.CL 新提交

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Dialogue SWE-Bench: 对话驱动的编码智能体基准

Brendan King, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结提出Dialogue SWE-Bench基准，通过用户模拟器评估编码智能体在对话中解决软件工程问题的能力，并引入模式引导智能体提升对话性能3-14%。

Comments 22 pages, 13 figures

详情

AI中文摘要

AI编码智能体已迅速改变软件工程，驱动着广泛使用的交互式编码助手。尽管它们在现实世界中是交互式使用的，但现有基准将其评估为完全自主系统。在这项工作中，我们引入了Dialogue SWE-Bench，一个自动基准数据集，用于评估编码智能体通过与用户对话解决现实世界软件工程问题的能力。我们设计了一个新颖的、基于角色设定的用户模拟器来支持我们的任务评估，并通过对话质量的自动评估来增强任务评估。我们还提出了一种新的模式引导智能体，旨在提升现成编码智能体的对话能力，相比强基线提升了3-14%。我们的结果表明，更好的编码模型并不总是对应更好的对话模型，这表明对话能力是编码智能体性能的一个独特且目前研究不足的维度。

英文摘要

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

URL PDF HTML ☆

赞 0 踩 0

2606.13993 2026-06-15 cs.CL 新提交

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

文本和音频语言模型中动词+up短语的整体存储

Zachary Nicholas Houghton, Yu Zhou, Dan Pluth, Vijay K. Gurbani

发表机构 * University of Oregon（俄勒冈大学）； Vail Systems, Inc（Vail Systems公司）

AI总结研究文本和音频语言模型对动词+up短语的整体存储，发现频率和可预测性驱动独立表征，支持基于使用的语言理论。

2606.13990 2026-06-15 cs.RO 新提交

SplatlessDF: Continuous Distance Field Mapping with Non-Splatting Gaussians

SplatlessDF: 基于非溅射高斯分布的连续距离场映射

Monisha Mushtary Uttsha, Lan Wu, Teresa Vidal-Calleja

发表机构 * UTS Robotics Institute, Faculty of Engineering and IT, University of Technology Sydney（悉尼科技大学工程与信息技术学院UTS机器人研究所）； School of Engineering, University of Western Australia（西澳大学工程学院）

AI总结提出SplatlessDF框架，利用各向异性高斯元素从空间角度构建连续距离场，支持距离和梯度查询，并可与2D高斯溅射结合实现统一建模，适用于机器人导航。

详情

AI中文摘要

最近的高斯溅射（GS）方法表明，场景可以通过可优化的高斯分布高效表示，以实现高质量的重建和渲染。本文基于这一原理，引入SplatlessDF，一个从空间而非光度角度使用各向异性高斯元素的连续距离场（DF）映射框架。SplatlessDF直接参数化高斯分布并优化以恢复可微DF，使得能够在空间域中查询距离和梯度，用于下游机器人任务如导航。此外，SplatlessDF可与2D高斯溅射（2DGS）耦合，提供一个完全基于高斯原语的统一框架，该框架可以学习连续DF和表面模型，并支持光度渲染。我们考虑两种设置：独立的仅DF公式和与2DGS耦合的联合DF-渲染公式。实验表明，独立公式提供高效准确的距离和梯度查询，而联合公式改善渲染几何并同时建模连续DF。这些结果凸显了GS风格表示不仅在表面建模和渲染方面，而且在适用于机器人导航的映射表示方面的潜力。

英文摘要

Recent Gaussian splatting (GS) methods have shown that scenes can be represented efficiently with optimisable Gaussians for high-quality reconstruction and rendering. In this paper, building on this principle, we introduce SplatlessDF, a continuous distance field (DF) mapping framework that uses anisotropic Gaussian elements from a spatial rather than photometric perspective. SplatlessDF directly parameterises the Gaussians and optimises to recover a differentiable DF, enabling distances and gradients to be queried in the spatial domain for downstream robotic tasks such as navigation. Furthermore, SplatlessDF can be coupled with 2D Gaussian splatting (2DGS), providing a unified framework based solely on Gaussian primitives that can learn continuous DF and surface models and supports photometric rendering. We consider two settings: a standalone DF-only formulation and a joint DF-rendering formulation coupled with 2DGS. Experiments show that the standalone formulation provides efficient and accurate distance and gradient queries, while the joint formulation improves rendering geometry and simultaneously models a continuous DF. These results highlight the potential of GS-style representations not only for surface modelling and rendering but also for mapping representations suited to robotic navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.13989 2026-06-15 cs.SD cs.AI 新提交

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

掩码、采样、修正：面向引导离散流匹配文本转语音的可修正CTMC推理栈

Alef Iury Siqueira Ferreira, Lucas Rafael Stefanel Gris, Luiz Fernando de Araújo Vidal, Frederico Santos de Oliveira, Christopher Dane Shulby, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho

发表机构 * Federal University of Goiás（戈亚斯联邦大学）； Federal University of Uberlândia（乌贝兰迪亚联邦大学）； University of São Paulo（圣保罗大学）； University of Brasília（巴西利亚大学）； University of California, Berkeley（加利福尼亚大学伯克利分校）

AI总结提出Mask, Sample, Revise推理栈，结合无预测器引导、提示匹配条件耦合和调度约束重掩码机制，在低步数下提升离散流匹配TTS的鲁棒性和可懂度。

详情

AI中文摘要

最近的无对齐非自回归文本转语音模型将合成视为条件填充任务，绕过了显式时长预测器和外部对齐器。当语音用神经编解码令牌表示时，填充问题变为离散，使得离散流匹配（一种用于离散生成的连续时间马尔可夫链框架）成为自然选择。然而，用于稳定低步数条件填充的推理时控制仍未充分探索。我们提出Mask, Sample, Revise，一种用于无对齐DFM-TTS的推理时CTMC栈。该栈结合了无预测器引导以增强文本条件、提示匹配条件耦合以将概率路径与声学提示对齐，以及SC-ReMask（一种调度约束重掩码机制），引入令牌到掩码的转换，使得早期去掩码决策可以被修正。这些组件无需事后微调，并在单个tau-leaping采样器中运行。受控消融实验表明，该栈在低NFE提示设置下提高了可懂度和鲁棒性，优于具有更多步数的无引导和仅引导采样器。

英文摘要

Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.

URL PDF HTML ☆

赞 0 踩 0

2606.13977 2026-06-15 cs.CL 新提交

Creative Integration: A Decidable Criterion of Creativity

创造性整合：一个可判定的创造力标准

Yoshinori Nomura

发表机构 * Mirage Mountain Technologies（幻山科技）

AI总结提出基于描述长度压缩的创造性整合可判定标准，通过四个二元门和伪整合分类法实现判别，并在多领域语料库上通过四项可证伪测试验证。

Comments 18 pages, 1 figure

详情

AI中文摘要

"整合性"解决方案广受赞誉但鲜有定义：我们缺乏一种可操作的方式来区分真正的整合（使世界更易于描述）与整洁的重新描述。基于将创造力和智能视为压缩的思想脉络，我们为创造性整合（CI）给出了这样一个标准：当且仅当在固定描述语言下，描述长度严格缩短（C = L_pre/L_post > 1），且缩减位于冲突本身时，A与B之间真实冲突的解决即为CI。我们通过四个二元合取门使判断可判定，并通过一个伪整合分类法（命名并拒绝相似物）固定其外延。我们用一个精心策划的多领域语料库支持该标准，并且——关键的是——不是通过人类评分者间一致性，而是通过它可能失败的四个可证伪测试来验证：独立计算检查、对硬负例的区分、样本外预测和描述语言鲁棒性；所有测试均以余量通过。贡献不在于"创造力即压缩"，而在于其可判定性、区分性和语料库：据此，使一个举动真正具有创造性——而非仅仅是新颖——的是它压缩了一个冲突，新颖性和价值是下游症状；所有创造力是否都如此构成，我们作为一个明确的猜想陈述。我们仅声称C-1的符号；我们判断，而非生成。结果是一个可引用的基元，用于更广泛的计划。

英文摘要

"Integrative" solutions are widely praised but rarely defined: we lack an operational way to tell a genuine integration -- one that makes the world cheaper to describe -- from a tidy re-description. Building on the lineage that treats creativity and intelligence as compression, we give such a criterion for creative integration (CI): the resolution of a real conflict between A and B is CI if and only if, under a fixed description language, the description length strictly shrinks (C = L_pre/L_post > 1), with the reduction located in the conflict itself. We make the judgment decidable through four binary, conjunctive gates, and we fix its extension through a taxonomy of pseudo-integration that names and rejects the look-alikes. We back the criterion with a curated, multi-domain corpus and -- crucially -- validate it not by human inter-rater agreement but by four falsifiable tests it could fail: an independent computational check, discrimination against hard negatives, out-of-sample prediction, and description-language robustness; all pass with margin. The contribution is not "creativity is compression" but its decidability, discrimination, and corpus: on this account, what makes a move genuinely creative -- rather than merely novel -- is that it compresses a conflict, with novelty and value as downstream symptoms; whether all creativity is so constituted we state as an explicit conjecture. We claim only the sign of C-1; we judge, not generate. The result is a citable primitive for a broader program.

URL PDF HTML ☆

赞 0 踩 0

2606.13971 2026-06-15 cs.CV 新提交

Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

Prompt2Effect: 通过LoRA生成实现免训练图像到视频模型特化

Xiaomeng Yang, Yanyu Li, Gordon Guocheng Qian, Ivan Skorokhodov, Viacheslav Ivanov, Avalon Vinella, Xuan Zhang, Yanzhi Wang, Sergey Tulyakov, Anil Kag

发表机构 * Northeastern University（东北大学）； Snap Inc.（Snap公司）

AI总结提出Prompt2Effect，一种权重驱动超网络，通过单次前向传播直接合成效果特定的LoRA权重，无需训练，在保持视频质量的同时将计算成本从56 GPU小时降至3.3秒。

详情

AI中文摘要

将图像到视频（I2V）扩散模型个性化以具有特定视觉效果的需求日益增长，用于高端视频生成。当前实践需要为每个效果训练单独的LoRA模块，这带来了大量的数据整理和迭代优化成本，阻碍了交互式控制。我们提出Prompt2Effect，一种权重驱动的超网络，通过单次前向传播直接合成效果特定的LoRA权重，从而分摊每个效果的训练成本。与先前仅从语义回归适配器权重的超网络不同，Prompt2Effect显式地以冻结的基础模型权重为条件，将权重预测建立在每层的结构几何上。此外，我们不是预测原始LoRA矩阵，而是引入一种SVD规范化的参数化方法，解决了分解歧义并稳定了大规模权重合成。这些设计原则共同实现了高维I2V扩散模型的准确且可扩展的LoRA预测。大量实验表明，与传统的LoRA微调相比，Prompt2Effect实现了相当或更优的视频质量和效果对齐，同时将计算成本从56 GPU训练小时降至3.3秒的超网络推理。当用作后续微调的初始化时，我们预测的权重进一步提高了最终性能，并将优化速度提升了约10倍。

英文摘要

Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

URL PDF HTML ☆

赞 0 踩 0

2606.13970 2026-06-15 cs.RO cs.LG 新提交

An Attention-based Model for Robust Forecasting with Missing Modality

基于注意力的缺失模态鲁棒预测模型

Zhitian Zhang, Wenjie Zi, Yunduz Rakhmangulova, Saghar Irandoust, Hossein Hajimirsadeghi, Thibaut Durand

发表机构 * Simon Fraser University（西蒙菲莎大学）； RBC Borealis

AI总结提出一种基于条件变分自编码器和Transformer的多模态模型，通过注意力机制学习统一固定维度的表示，在训练和推理中处理缺失模态，在人类轨迹预测和机器人操作预测任务上优于现有方法。

Comments Work originally done in 2023

详情

AI中文摘要

在缺失模态下的学习是多模态机器人学习中的一个基本挑战，因为现实世界的机器人系统通常运行在传感器数据不完整的环境中。基于注意力的模型在处理多模态数据时具有吸引力，因为它们可以用单一骨干网络处理多种模态。然而，大多数多模态模型假设在训练和推理过程中所有模态都可用，限制了它们在机器人感知和决策中的适用性。在本文中，我们介绍了一种多模态模型，旨在在训练和推理过程中处理缺失模态。该模型被表述为条件变分自编码器（CVAE），并采用基于Transformer的架构，利用注意力机制学习统一的固定维度表示，即使某些模态缺失。我们表明，所提出的模型可以在缺失模态的情况下进行训练，同时逼近所有模态的鲁棒表示。我们在五个多模态数据集上评估了我们的方法，涉及两个机器人学习任务：人类轨迹预测和机器人操作预测。实验结果表明，我们的模型有效地从不完整数据中学习，并且优于先前的多模态融合方法。

英文摘要

Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.13964 2026-06-15 cs.CV 新提交

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

CaricHarmony：身份保持的漫画合成的对比扩散路径

Dongyu Wang, Dar-Yen Chen, Yi-Zhe Song

发表机构 * SketchX, CVSSP, University of Surrey（萨里大学CVSSP实验室SketchX组）

AI总结提出CaricHarmony，一种无需训练的方法，通过并行无污染扩散路径解决身份与形状条件信号污染问题，实现平衡的漫画合成，在保持身份一致性的同时达到最优形状保真度。

详情

AI中文摘要

基于草图的漫画合成存在一个根本性失败模式：当身份和形状条件在扩散模型中结合时，它们会产生破坏性干扰，导致不可避免地向平淡肖像或无法识别的扭曲崩溃。我们将根本原因确定为\emph{条件信号污染}——去噪轨迹中竞争的概率分布使得平衡生成变得不可能。我们提出了CaricHarmony，这是第一种通过并行无污染扩散路径明确解决这种污染的无训练方法。在推理过程中，我们维护三条路径：$\mathcal{P}^{\mathrm{i}}$（纯身份）、$\mathcal{P}^{\mathrm{s}}$（纯形状）和$\mathcal{P}^{\mathrm{i+s}}$（和谐输出）。作用于交叉注意力特征的新型能量函数提供梯度引导，将$\mathcal{P}^{\mathrm{i+s}}$导向最优平衡：$\mathcal{E}_{\mathrm{shape}}$通过布局和语义对齐确保草图保真度，而$\mathcal{E}_{\mathrm{id}}$采用对极端扭曲鲁棒的令牌级对应匹配。与需要每身份70秒微调的DemoCaricature或受限于贝塞尔曲线的CaricatureBooth不同，CaricHarmony接受任何草图格式并在16秒内生成。实验展示了最先进的性能：在可比较的身份一致性分数下，形状CLIP分数为0.8615（对比0.8450），总体用户偏好分数为7.81（对比6.06）。我们的方法从根本上将身份-形状冲突重新概念化为扩散模型的条件信号污染，从而在保持识别的同时实现前所未有的创造性控制。

英文摘要

Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\mathcal{P}^{\mathrm{i}}$ (pure identity), $\mathcal{P}^{\mathrm{s}}$ (pure shape), and $\mathcal{P}^{\mathrm{i+s}}$ (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers $\mathcal{P}^{\mathrm{i+s}}$ toward optimal balance: $\mathcal{E}_{\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\mathcal{E}_{\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.13959 2026-06-15 cs.LG 新提交

Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone

机器学习能否在数据受限条件下预测水稻产量？卫星气候数据、国家作物统计及塞拉利昂的经验教训

Ibrahim Denis Fofanah

发表机构 * Seidenberg School of Computer Science & Information Systems Pace University, New York, USA（佩斯大学塞登伯格计算机科学与信息系统学院，纽约，美国）； RiseAfrica Foundation for STEM and Innovation Sierra Leone, West Africa（RiseAfrica STEM与创新基金会，塞拉利昂，西非）

AI总结利用塞拉利昂25年作物统计和免费卫星气候数据，通过严格反泄漏协议训练机器学习模型，发现仅气候数据的XGBoost将水稻产量预测误差降低三分之一，早期季节降雨是关键预测因子，并转化为政策建议。

Comments 32 pages, 7 figures. Code and data: https://github.com/Denis060/sierraleone-agri-ml

详情

AI中文摘要

塞拉利昂的农业几乎没有数据驱动的决策支持，也没有已发表的机器学习研究考察该国的作物产量。我们询问是否可以利用塞拉利昂目前拥有的数据预测水稻产量。使用25年（2000-2024年）九种主要作物的FAOSTAT生产数据，我们在严格的反泄漏协议下训练XGBoost、梯度提升和随机森林，采用扩展窗口的前向验证评估七个保留年份，并以朴素持久性为基准。仅基于作物统计训练的模型均未优于持久性。加入免费卫星气候数据（CHIRPS降雨、NASA POWER温度）逆转了这一结果：仅使用气候数据的XGBoost将预测误差降低了三分之一（RMSE 284 vs 428 kg/ha），这一优势在线性模型中依然成立，并且在排除异常的2018年季节后仍然稳健。早期季节（5-6月）降雨是主导预测因子，意味着季节性产量风险在收获前数月即可观测。没有模型预测到2018年的产量崩溃，其根源是制度性的而非气候性的。我们将研究结果转化为对塞拉利昂“Feed Salone”战略的政策建议，并提供了完全开源的流程。

英文摘要

Sierra Leone's agriculture operates with almost no data-driven decision support, and no published machine learning study has examined the country's crop yields. We ask whether rice yield can be forecast from data Sierra Leone currently has. Using 25 years of FAOSTAT production data (2000-2024) for nine major crops, we train XGBoost, Gradient Boosting, and Random Forest under a strict anti-leakage protocol with expanding-window walk-forward evaluation across seven held-out years, benchmarked against naive persistence. No model trained on crop statistics alone outperforms persistence. Augmenting with free satellite climate data (CHIRPS rainfall, NASA POWER temperature) reverses this result: a climate-only XGBoost reduces forecast error by one third (RMSE 284 vs 428 kg/ha), a gain that holds for a linear model and is robust to excluding the anomalous 2018 season. Early-season (May-June) rainfall is the dominant predictor, implying seasonal yield risk is observable months before harvest. No model anticipated the 2018 collapse, whose origins were institutional rather than climatic. We translate the findings into policy recommendations for Sierra Leone's Feed Salone Strategy, with a fully open-source pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.13955 2026-06-15 cs.LG 新提交

Smoothing Dark Areas in Molecular Latent Diffusion

分子潜在扩散中的暗区平滑

Xi Wang, Jiahan Li, Yuxuan Xia, Yingcheng Wu, Shaoyi Zheng, Shengjie Wang

发表机构 * New York University（纽约大学）； Stanford University（斯坦福大学）

AI总结针对分子潜在扩散中存在的暗区问题，提出拓扑优化VAE（TopVAE），通过训练时内化结构和化学约束，减少暗区，提升离后验鲁棒性，在QM9和GEOM-Drugs上取得显著改进。

详情

AI中文摘要

潜在扩散是可扩展3D分子生成的有前景框架，但它需要潜在空间在扩散采样之外保持平滑、有效且可导航。然而，现有的分子VAE通常通过基于重建的目标学习，这并不能保证这样的潜在空间。我们表明这会导致暗区：在扩散采样过程中可达但解码为不连通或化学无效分子的潜在空间区域。与图像生成不同，分子解码需要严格的结构和化学精度，因此即使微小的潜在扰动也可能导致灾难性失败。因此，我们提出TopVAE，一种拓扑优化的VAE，通过使解码器在训练期间内化结构和化学约束来减少暗区，消除了测试时化学校正的需要。TopVAE大大提高了离后验鲁棒性，当与标准DiT配对时，在QM9上实现了$77\%$更低的FCD-3D、最高的V&C，在GEOM-Drugs上实现了$52\%$更低的FCD-3D，以及在零样本支架修复中实现了$1.29{\ imes}$更稳定和更连通的分子。

英文摘要

Latent diffusion is a promising framework for scalable 3D molecular generation, but it requires a latent space that remains smooth, valid, and navigable beyond posterior samples. Existing molecular VAEs, however, are typically learned through reconstruction-based objectives, which do not guarantee such a latent space. We show that this leads to dark areas: regions of latent space that are reachable during diffusion sampling but decode to disconnected or chemically invalid molecules. Unlike in image generation, molecular decoding requires strict structural and chemical precision, so even small latent perturbations can produce catastrophic failures. We therefore propose TopVAE, a topology-optimized VAE that reduces dark areas by making the decoder internalize structural and chemical constraints during training, eliminating the need for test-time chemical correction. TopVAE greatly improves off-posterior robustness, and when paired with a standard DiT, achieves $77\%$ lower FCD-3D on QM9, the highest V&C, $52\%$ lower FCD-3D on GEOM-Drugs, and $1.29{\times}$ more stable and connected molecules on zero-shot scaffold inpainting.

URL PDF HTML ☆

赞 0 踩 0

2606.13949 2026-06-15 cs.AI 新提交

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

Minim: 通过可信本地净化实现代理的隐私感知最小化视图

Hexuan Yu, Chaoyu Zhang, Heng Jin, Shanghao Shi, Ning Zhang, Y. Thomas Hou, Wenjing Lou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对LLM代理传输完整UI状态导致隐私泄露的问题，提出MINIM框架，在客户端基于上下文完整性学习双重分数（敏感性和必要性），通过三元披露策略实现隐私感知的最小化视图，在减少敏感泄露的同时保留任务关键信息。

Comments Accepted at ICML 2026 (43rd International Conference on Machine Learning, Seoul, South Korea). Code available at https://github.com/yyyyhx/MINIM

详情

AI中文摘要

现代基于LLM的自主代理越来越依赖丰富的用户界面（UI）状态观察，以在复杂数字环境中实现可靠的动作基础。然而，许多部署将完整的UI状态传输到远程推理服务器，即使大多数元素与当前任务无关，这可能会泄露敏感但不必要的上下文，如身份验证代码、私人通知和后台应用状态。我们提出MINIM，一个可信的本地代理，在任何观察离开设备之前，在客户端执行隐私感知的最小化。基于上下文完整性（CI），MINIM通过预测每个UI元素的固有敏感性分数（s）和任务条件必要性分数（n）来学习双分数表示。这些分数驱动一个三元披露策略，保留必要元素，在需要时抽象敏感属性，并移除与任务无关的内容。我们优化了一个CI感知目标，对高风险内容上的必要性错误施加更强的惩罚，从而在保留任务关键信息的同时实现积极的剪枝。在来自WebArena的真实世界UI观察上的实验表明，MINIM显著减少了与任务无关的敏感泄露，同时保留了任务关键的语义上下文和可靠代理动作所需的交互能力。

英文摘要

Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a ternary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and the interactive affordances required for reliable agent actions.

URL PDF HTML ☆

赞 0 踩 0