arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
2605.12316 2026-05-13 cs.LG

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

Yunbei Xu, Yuzhe Yuan, Ruohan Zhan

发表机构 * National University of Singapore(新加坡国立大学) University College London(伦敦大学学院)

AI总结 本文研究了在模型误设条件下,使用联合KL散度度量的自回归模型和下一个token预测中,序列长度对近似误差和估计误差的影响。通过建立匹配的上界和下界,作者首次完整刻画了长序列误差行为,并改进了现有工作的收敛速率与最优性分析。研究揭示了联合KL散度在近似误差上具有与序列长度无关的优势,同时证明了序列长度对估计误差的下界为Ω(H),与高效算法的上界一致,从而统一了训练目标、评估指标和近似度量之间的关系。

详情
英文摘要

We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback--Leibler (KL) divergence. Our goal is to characterize how the sequence horizon \(H\) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an \(Ω(H)\) dependence for computationally efficient methods; this isolates the choice of divergence as the source of approximation amplification. On the estimation side, we prove a fundamental information-theoretic lower bound of order \(Ω(H)\) that holds for both decomposable policy classes and fully shared policies, matching the \(\widetilde O(H)\) upper bounds achieved by computationally efficient algorithms. Our analysis clarifies the landscape of recent autoregressive learning results by aligning the log-loss training objective, the sequence-level evaluation metric, and the approximation metric {\color{black}through a sharp joint-KL oracle theory}. We further show that these joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.

2605.12313 2026-05-13 cs.CL cs.IR

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Rezarta Islamaj, Joey Chan, Robert Leaman, Jongmyung Jung, Hyeongsoon Hwang, Quoc-An Nguyen, Hoang-Quynh Le, Harikrishnan Gurushankar Saisudha, Ganesh Chandrasekar, Rustam R. Taktashov, Nadezhda Yu. Bizyukova, Sofia I. R. Conceição, Paulo R. C. Lopes, Reem Abdel Salam, Mary Adewunmi, Zhiyong Lu

发表机构 * National Library of Medicine (NLM), National Institutes of Health (NIH)(美国国家医学图书馆(NLM)、国家卫生研究院(NIH)) University of Illinois at Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Korea University(韩国大学) VNU University of Engineering and Technology, Hanoi, Vietnam(越南河内工程大学) Concordia University, Montreal, QC, CA(蒙特利尔大学) Institute of Biomedical Chemistry (IBMC), 10 bld. 8, Pogodinskaya str., 119121 Moscow, Russia(俄罗斯生物医学化学研究所(IBMC)) LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisbon, Portugal(葡萄牙里斯本大学 LASIGE 实验室) Faculty of Engineering, Computer Engineering Department Cairo University(埃及开罗大学工程学院) Menzies School of Health Research, Charles Darwin University, NT, Australia(澳大利亚查尔斯达尔文大学梅恩兹健康研究中心) CaresAI, Australia(澳大利亚 CaresAI)

AI总结 BioCreative IX 的 MedHopQA 共享任务旨在评估大型语言模型在多跳医学问答中的推理能力,提出了包含1000个复杂问答对的新型数据集,每个问题需结合两个不同维基页面的信息进行两跳推理,特别关注罕见疾病相关问题。任务吸引了13支队伍的48次提交,结果表明基于检索增强生成(RAG)等策略的系统显著优于基线模型,最佳系统在概念准确度(MedCPT)和精确匹配(EM)指标上分别达到89.30%和87.30%。该数据集已公开,以推动医学多跳问答领域的发展。

详情
英文摘要

Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/

2605.12312 2026-05-13 cs.LG cs.AI

Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

Chenran Zhao, Dianxi Shi, Yaowen Zhang, Chunping Qiu, Shaowu Yang

发表机构 * College of Computer Science and Technology, National University of Defense Technology(计算机科学与技术学院,国防科技大学) Intelligent Game and Decision Lab (IGDL)(智能游戏与决策实验室) Institute of Military Transportation(军事交通运输研究院) School of Artificial Intelligence, Hebei University of Technology(人工智能学院,河北工业大学)

AI总结 本文研究了在存在随机延迟的跨任务强化学习场景中,如何提高策略的可迁移性和适应性。为了解决延迟导致的动作与状态反馈时间错位以及任务目标变化带来的知识复用困难,作者提出了一种基于隐式因果图建模的可迁移延迟感知强化学习方法。该方法通过场节点编码器将高维观测转化为具有节点语义的潜在状态,并利用消息传递机制学习节点间的动态因果依赖关系,从而获得可迁移的结构化表示和环境动态知识,有效提升了跨任务学习的效率与性能。

详情
英文摘要

Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce the reusability of previously acquired task knowledge. To address this problem, this paper proposes a transferable delay-aware reinforcement learning method based on implicit causal graph modeling. The proposed method uses a field-node encoder to represent high-dimensional observations as latent states with node-level semantics, and employs a message-passing mechanism to characterize dynamic causal dependencies among nodes, thereby learning transferable structured representations and environment dynamics knowledge. On this basis, imagination-driven behavior learning and planning are incorporated to optimize policies in the latent space, enabling cross-task knowledge transfer and rapid adaptation. Experimental results show that the proposed method outperforms baseline methods on DMC continuous control tasks with random delays. Cross-task transfer experiments further demonstrate that the learned structured representations and dynamics knowledge can be effectively transferred to new tasks and significantly accelerate policy adaptation.

2605.12310 2026-05-13 cs.SD

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

Chen Geng, Meng Chen, Ruohua Zhou, Ruolan Liu, Weifeng Zhao

发表机构 * School of Intelligence Science and Technology(智能科学与技术学院) Beijing University of Civil Engineering and Architecture(北京建筑大学) Lyra Lab, Tencent Music Entertainment(腾讯音乐娱乐Lyra实验室) Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture(北京超智能城市建筑技术重点实验室)

AI总结 本文提出了一种名为 Poly-SVC 的多声部感知歌唱语音转换系统,旨在在保留歌词和旋律的前提下,将源歌手的歌声转换为目标歌手的声音。该方法创新性地处理了伴奏录音中的残余和声问题,通过基于常数 Q 变换的音高提取器、随机采样器以及基于条件流匹配的扩散解码器,实现了对旋律与和声特征的融合,从而生成自然且富有表现力的多声部输出。实验表明,Poly-SVC 在自然度、音色相似性和和声重建方面均优于现有基线模型。

Comments Accepted by ICASSP 2026

详情
英文摘要

Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.

2605.12308 2026-05-13 cs.LG

In-context learning to predict critical transitions in dynamical systems

Yunus Sevinchan, Juan Nathaniel, Kai Ueltzhöffer, Carla Roesch, Tobias Weber, Vaios Laschos, Hang Fan, Gregor Ramien, Johannes Haux, Pierre Gentine, Benjamin Herdeanu

发表机构 * Columbia University(哥伦比亚大学) University of Edinburgh(爱丁堡大学) kausable Heidelberg(海德堡kausable)

AI总结 该研究旨在解决动态系统中临界转变的早期预警问题,这类转变通常具有突发性和不可逆性,且在现实世界中观测数据稀缺。为此,作者提出了一种基于上下文学习的深度学习框架TipPFN,通过合成数据生成器训练模型,使其能够灵活适应不同规模、复杂度和维度的上下文信息。该方法在未见过的临界转变场景、仿真到现实案例以及真实观测数据中均表现出先进的早期检测能力,为构建可靠的预警系统提供了新思路。

Comments 14+38 pages, 5+23 figures

详情
英文摘要

Critical transitions - abrupt, often irreversible changes in system dynamics - arise across human and natural systems, often with catastrophic consequences. Real-world observations of such shifts remain scarce, preventing the development of reliable early warning systems. Conventional statistical and spectral indicators, such as increasing variance, tend to fail under realistic conditions of limited data and correlated noise, whereas existing deep learning classifiers do not extrapolate beyond their training data distribution. In this work, we introduce TipPFN, an in-context learning (ICL) framework that uses a prior-data fitted network to infer a system's proximity to a critical transition. Trained on our novel synthetic data generator, which is based on canonical bifurcation scenarios coupled to diverse, randomized stochastic dynamics, TipPFN flexibly capitalizes on contexts of various sizes, complexity and dimensionalities. We demonstrate robust, state-of-the-art early detection of critical transitions in previously unseen tipping regimes, sim-to-real examples, and real-world observations in both ICL and zero-shot settings.

2605.12306 2026-05-13 cs.LG cs.AI cs.CV

KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks

Minjong Cheon

发表机构 * Sejong University Department of Computer Science and Engineering(世宗大学计算机科学与工程系)

AI总结 本文提出了一种名为KAN-CL的持续学习框架,旨在解决任务间参数干扰导致的灾难性遗忘问题。该方法利用Kolmogorov-Arnold网络(KAN)的紧支撑样条参数化特性,在每个样条节点层面进行重要性加权锚定,从而实现更精细的参数正则化。实验表明,KAN-CL在多个基准数据集上显著降低了遗忘率,同时保持了较高的分类精度,并通过神经切线核分析进一步揭示了其理论优势。

详情
英文摘要

Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.

2605.12305 2026-05-13 cs.CV

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Yabo Zhang, Kunchang Li, Dewei Zhou, Xinyu Huang, Xun Wang

发表机构 * ByteDance Seed(字节跳动种子)

AI总结 该研究针对多模态语言模型在处理复杂交错指令时生成图像的性能不足问题,提出了一种统一的视觉生成模型INSET,将图像作为文本指令中的原生词汇嵌入,从而更精确地匹配描述与视觉目标。通过引入可扩展的数据引擎生成大量高质量交错样本,并在多项任务中展现出优于现有方法的多图像一致性和文本对齐能力,同时支持多模态图像编辑等扩展应用。

详情
英文摘要

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

2605.12301 2026-05-13 cs.LG math.ST stat.TH

Approximation of Maximally Monotone Operators : A Graph Convergence Perspective

Takashi Furuya, Yury Korolev, Takaharu Yaguchi

发表机构 * Doshisha University(滋贺大学) RIKEN AIP(理化学研究所AIP) University of Bath(巴斯大学) Kyushu University(九州大学)

AI总结 该论文研究了如何通过图收敛方法对极大单调算子进行逼近,这类算子在数学和机器学习中具有重要应用。传统上的均匀或 $L^p$ 近似方法在处理此类算子时存在局限,作者提出利用图收敛(Painlevé-Kuratowski 收敛)作为逼近框架,证明了任何极大单调算子均可通过编码-解码结构进行局部图收敛逼近,并构建了保持极大单调性质的结构化近似方法。这一成果为处理不连续或集值算子的算子学习提供了新的理论基础和方法。

详情
英文摘要

Operator learning has been highly successful for continuous mappings between infinite-dimensional spaces, such as PDE solution operators. However, many operators of interest-including differential operators-are discontinuous or set-valued, and lie outside classical approximation frameworks. We propose a paradigm shift by formulating approximation via graph convergence (Painlevé-Kuratowski convergence), which is well-suited for closed operators. We show that uniform and $L^p$ approximation are fundamentally inadequate in this setting. Focusing on maximally monotone operators, we prove that any such operator can be approximated in the sense of local graph convergence by continuous encoder-decoder architectures, and further construct structure-preserving approximations that retain maximal monotonicity via resolvent-based parameterizations.

2605.12299 2026-05-13 cs.CL

GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

Leonor Veloso, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich(信息与语言处理中心,慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 该研究提出了一种名为GKnow的基准,用于评估语言模型在不同性别相关预测任务中的性别知识和性别偏见。研究发现,性别偏见与事实性性别在神经网络的电路和单个神经元层面高度纠缠,导致神经元消融等去偏方法效果不可靠。GKnow有助于识别和分析负责性别预测的模型组件,并揭示现有性别偏见评估基准可能掩盖事实性性别知识下降的问题。

Comments Accepted to ACL 2026

详情
英文摘要

Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very specific gender-related task, such as gendered pronoun prediction, or (ii) fail to distinguish between the production of factually gendered outputs (the correct assumption of gender given a word that carries gender as a semantic property) and gender biased outputs (based on a stereotype). To address these issues, we curate \gknow, a benchmark to assess gender knowledge and gender bias in language models across different types of gender-related predictions. \gknow allows us to identify and analyze circuits and individual neurons responsible for gendered predictions. We test the impact of neuron ablation on benchmarks for disentangling stereotypical and factual gender (DiFair and the test set of GKnow), as well as StereoSet. Results show that gender bias and factual gender are severely entangled on the level of both circuits and neurons, entailing that ablation is an unreliable debiasing method. Furthermore, we show that benchmarks for evaluating gender bias can hide the decrease in factual gender knowledge that accompanies neuron ablation. We curate GKnow as a contribution to the continuous development of robust gender bias benchmarks.

2605.12297 2026-05-13 cs.CV cs.RO eess.IV

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

Luming Wang, Hao Shi, Jiajun Zhai, Kailun Yang, Kaiwei Wang

发表机构 * National Research Center for Optical Instrumentation, Zhejiang University(浙江大学光学仪器国家研究中心) School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(湖南大学人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心) Ant Group Company Ltd.(蚂蚁集团有限公司)

AI总结 本文提出EgoEV-HandPose,一种基于立体事件相机的端到端框架,用于解决第一人称视角下的3D双手姿态估计与手势识别问题。核心方法KeypointBEV通过将特征提升至统一的鸟瞰视角,并结合迭代重投影引导的优化循环,有效解决了深度不确定性与运动模糊问题。同时,研究还发布了首个大规模真实场景立体事件相机数据集EgoEVHands,显著提升了低光和双手遮挡场景下的性能,为事件相机在第一人称感知领域的发展提供了新基准。

Comments Extended version of SMC 2025 paper arXiv:2503.12419. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose

详情
英文摘要

Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose.

2605.12294 2026-05-13 cs.AI

Executable Agentic Memory for GUI Agent

Zerui Qin, Sheng Yue, Xingyuan Hua, Yongjian Fu, Ju Ren

发表机构 * Tsinghua University, China(清华大学, 中国) Sun Yat-sen University, China(中山大学, 中国)

AI总结 本文提出了一种名为可执行智能体记忆(EAM)的新方法,用于提升图形用户界面(GUI)智能体在长期任务中的稳定性和效率。EAM 通过构建结构化的知识图谱,将自由生成的规划过程转化为基于检索与执行的流程,并结合状态感知的深度优先搜索和动作分组挖掘技术,实现高效的记忆构建。此外,引入基于价值引导的图搜索机制,利用轻量级Q函数模型指导蒙特卡洛树搜索,从而在保证规划效率的同时,显著提升了任务执行的成功率与成本效益。

详情
英文摘要

Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to $19.6\%$ on AndroidWorld, while reducing token costs $6\times$ relative to GPT-4o. With a $2.8$s average latency, EAM enables reliable, quick, and long-horizon GUI automation.

2605.12292 2026-05-13 cs.LG

STRABLE: Benchmarking Tabular Machine Learning with Strings

Gioia Blayer, Myung Jun Kim, Félix Lefebvre, Lennart Purucker, Alan Arazi, Eilam Shapira, Roi Reichart, Frank Hutter, Marine Le Morvan, David Holzmüller, Gaël Varoquaux

发表机构 * SODA Team, INRIA Saclay(SODA团队,INRIA萨克莱实验室) Probabl University of Freiburg(弗赖堡大学) Prior Labs ELLIS Institute Tübingen(图宾根ELLIS研究所) Technion – Israel Institute of Technology(技术ion-以色列理工学院)

AI总结 该论文提出了STRABLE,一个包含108个真实应用场景表格的基准数据集,用于评估包含字符串和数值的表格机器学习方法。研究探讨了在表格数据中是否需要专门处理字符串的模型,或是将其编码为数值即可,并比较了不同处理方式的效果。实验表明,针对以分类变量为主的表格,使用简单字符串嵌入与先进表格学习模型结合即可取得良好效果,而以自由文本为主的表格则更适合使用大型语言模型编码。STRABLE为字符串表格学习提供了可靠的基准,有助于推动该领域研究。

详情
英文摘要

Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.

2605.12290 2026-05-13 cs.LG

Targeted Neuron Modulation via Contrastive Pair Search

Sam Herring, Jake Naviasky, Karan Malhotra

发表机构 * Nous Research(Nous研究)

AI总结 该研究探讨了语言模型如何通过指令微调拒绝有害请求的机制,并提出了一种名为对比神经元归因(CNA)的新方法,能够识别出少量关键神经元,这些神经元在区分有害和无害提示中起关键作用。实验表明,通过干预这些神经元可以有效降低模型的拒绝率,同时保持输出质量,而基础模型则缺乏这种可干预的拒绝机制。研究揭示了对齐微调如何将原有的判别结构转化为可操控的拒绝门控,为行为调控提供了更可靠的方法。

详情
英文摘要

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.

2605.12289 2026-05-13 cs.LG cs.AI

PriorZero: Bridging Language Priors and World Models for Decision Making

Junyu Xiong, Yuan Pu, Jia Tang, Yazhe Niu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) The Chinese University of Hong Kong MMLab(香港中文大学MMLab)

AI总结 本文提出了一种名为 PriorZero 的统一框架,旨在将大型语言模型(LLM)的语言先验知识与基于世界模型的规划相结合,以提升强化学习代理在长期任务中的决策能力。该方法通过解耦的 rollout-训练设计,将 LLM 的概念先验仅注入蒙特卡洛树搜索(MCTS)的根节点,从而在保持世界模型深度前瞻能力的同时,引导搜索向语义上有潜力的动作聚焦。实验表明,PriorZero 在多个基准任务中显著提升了探索效率和最终性能,为基于 LLM 的决策制定提供了一个有前景的框架。

Comments 30 pages, 12 figures

详情
英文摘要

Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at https://github.com/opendilab/LightZero.

2605.12282 2026-05-13 cs.CV

Large-Small Model Collaboration for Farmland Semantic Change Detection

Xinjia Li, Rui Wang, Qiurong Peng, Lingfei Ye, Dengrong Zhang, Haoyu Zhang

发表机构 * College of Information Science and Technology, Hangzhou Normal University(杭州师范大学信息科学与技术学院)

AI总结 本文针对精细农田语义变化检测(SCD)中存在的标注不足和伪变化干扰问题,构建了一个大规模细粒度农田变化检测基准HZNU-FCD,并提出了一种大模型与小模型协作的检测框架。该框架结合了任务驱动的小型视觉模型FD-Mamba和冻结的大型视觉-语言模型,通过跨模态逻辑仲裁和硬区域协同训练策略,有效提升了边界保持和小区域变化检测的精度。实验表明,该方法在多个数据集上均取得了优异的性能,展现出良好的鲁棒性和泛化能力。

详情
英文摘要

Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at https://github.com/Lovelymili/FD-Mamba.

2605.12281 2026-05-13 cs.CL cs.LG

What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty

Jonas Mayer Martins, Zhuojing Huang, Aaricia Herygers, Lisa Beinborn

发表机构 * University of Göttingen(哥廷根大学)

AI总结 本研究探讨了英语词汇对以西班牙语、德语或汉语为母语的学习者而言为何难以掌握,并计算建模了词汇难度,考虑了词汇的熟悉度、语义、表层形式及跨语言迁移等因素。通过Shapley值分析,发现词汇熟悉度是所有三类学习者共通的主要影响因素,而西班牙语和德语学习者还受到正字法迁移的影响,而汉语学习者则主要依赖熟悉度和表层特征。该模型提供了可解释的、针对不同母语背景的学习者定制化的词汇难度评估,有助于设计更有效的词汇教学方案。

Comments Submitted to BEA 2026 at ACL. 18 pages, 13 figures

详情
英文摘要

What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.

2605.12278 2026-05-13 cs.LG

Hypernetworks for Dynamic Feature Selection

Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

发表机构 * University of Essex(埃塞克斯大学) Slovak Academy of Sciences(斯洛伐克科学院)

AI总结 本文研究了动态特征选择(DFS)框架中的结构限制,并提出了一种基于超网络的新型方法Hyper-DFS,该方法能够按需生成特定特征子集的分类器参数,从而在保证泛化性能的同时降低结构复杂度。通过引入Set Transformer编码,Hyper-DFS还构建了平滑的条件空间,使功能相似的任务在几何上更接近。实验表明,Hyper-DFS在合成数据、真实表格数据以及多个图像数据集上均优于现有方法,并在未见过的特征子集上展现出更强的零样本泛化能力。

详情
英文摘要

Dynamic feature selection (DFS) is a machine learning framework in which features are acquired sequentially for individual samples under budget constraints. The exponential growth in the number of possible feature acquisition paths forces a DFS model to balance fitting specific scenarios against maintaining general performance, even when the feature space is moderate in size. In this paper, we study the structural limitations of existing DFS approaches to achieve an optimal solution. Then, we propose \textsc{Hyper-DFS}, a hypernetwork-based DFS approach that generates feature subset-specific classifier parameters on demand. We show that the use of hypernetworks compared to mask-embedding methods results in a smaller structural complexity bound. We also use a Set Transformer encoding to create a smooth conditioning space for the hypernetwork, so that functionally similar tasks are also geometrically close. In our benchmarks, \textsc{Hyper-DFS} outperforms all state-of-the-art approaches on synthetic and real-life tabular data. It is also competitive or superior across all image datasets tested, and shows substantially stronger zero-shot generalisation to feature subsets never seen during training than existing DFS approaches.

2605.12276 2026-05-13 cs.AI

NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities

Jina Kim, Gengchen Mai, Lingyi Zhao, Khurram Shafique, Yao-Yi Chiang

发表机构 * Department of Computer Science and Engineering, University of Minnesota(明尼苏达大学计算机科学与工程系) Department of Geography and the Environment, University of Texas at Austin(德克萨斯大学奥斯汀分校地理与环境系) Novateur Research Solutions(Novateur研究解决方案)

AI总结 该研究提出了一种名为NARA的自监督学习框架,用于处理异构矢量地理实体的数据,旨在解决现有方法在统一建模几何、语义和空间关系方面的不足。NARA通过联合建模语义、几何结构和空间关系,实现了对点、线、面等不同类型的地理实体的上下文感知表征。实验表明,该方法在建筑功能分类、交通速度预测和兴趣点推荐等任务中均优于现有方法,验证了其在统一关系建模方面的有效性。

详情
英文摘要

Geospatial foundation models have primarily focused on raster data such as satellite imagery, where self-supervised learning has been widely studied. Vector geospatial data instead represent the world as discrete geoentities with explicit geometry, semantics, and structured spatial relations, including metric proximity and topological relationships. These relations jointly determine how entities interact within space, yet existing representation learning methods remain fragmented, often restricted to specific geometry types or partial spatial relations, limiting their ability to capture unified spatial context across heterogeneous geoentities. We propose NARA (Neural Anchor-conditioned Relation-Aware representation learning), a self-supervised framework for vector geoentities. NARA learns context-dependent representations by jointly modeling semantics, geometry, and spatial relations within a unified framework and captures relational spatial structure beyond proximity alone, enabling rich contextualized representations across heterogeneous geoentities of points, polylines, and polygons. Evaluation on building function classification, traffic speed prediction, and next point-of-interest recommendation shows consistent improvements over prior methods, highlighting the benefit of unified relational modeling for vector geospatial data.

2605.12266 2026-05-13 cs.CV

CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts

Matteo Ballegeer, Toon Van Camp, Willem Jaspers, Alp Bayar, Aung Nyein Soe, Martin Roelfs, Dries F. Benoit, Bieke Decraemer, Joost R. Duflou

发表机构 * Data Analytics Research Group, Ghent University(根特大学数据分析研究组) Corelab CodesignS, Flanders Make(核心实验室CodesignS,弗拉芒制作) Department of Mechanical Engineering, KU Leuven/Flanders Make(机械工程系,根特大学/弗拉芒制作)

AI总结 该研究针对钣金弯曲零件的制造努力估计问题,提出了一种结合CAD特征与图神经网络的混合方法。通过在B-rep拓扑图中引入基于规则模块识别的制造特征,如弯折特性、翻边长度等,增强了模型对工艺相关几何模式的学习能力。实验表明,该方法在合成数据集和真实工业数据集上均显著提升了预测精度,验证了领域知识与图学习结合在制造可行性评估中的有效性。

详情
英文摘要

Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.

2605.12265 2026-05-13 cs.AI

How Useful Is Cross-Domain Generalization for Training LLM Monitors?

Sam Martin, Fabien Roger

发表机构 * Anthropic Fellows Program(Anthropic 后备计划)

AI总结 本文研究了在有限训练数据下使用提示语言模型进行分类的有效性,并探讨了跨领域泛化对训练大语言模型分类器的作用。研究发现,通过多任务提示训练可以在相邻领域提升分类性能,但在某些边缘情况下,微调模型会因提示变化而失效。研究还表明,将分类训练与通用指令遵循训练结合,能够在保持分类性能的同时缓解泛化失败问题,并发现这种无思考的分类训练在构建其他分类器和监控系统中可能具有实用价值。

详情
英文摘要

Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We show that such training partially generalizes to adjacent domains, improving classification performance on tasks that are unseen during training. However, we identify specific edge cases where the fine-tuned models fail to follow prompts, such as when the classification prompt changes completely while the data domain remains the same as during training. We show that classification training can be mixed with general instruction following training, and that (when done well) such training keeps the benefits of classification training and mitigates its generalization failures. Surprisingly, we see that this no-thinking supervised classification training can generalize to with-thinking classification and summarization, suggesting that no-thinking classification training might be instrumentally useful in building other kinds of classifiers and monitoring systems.

2605.12262 2026-05-13 cs.AI cs.LG

Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

Joshua Wendland, Markel Zubia, Roman Andriushchenko, Maris F. L. Galesloot, Milan Ceska, Henrik von Kleist, Thiago D. Simao, Maximilian Weininger, Nils Jansen

发表机构 * Ruhr University Bochum(博德姆鲁尔大学) Brno University of Technology(布拉格技术大学) Radboud University Nijmegen(拉德博德大学奈杰姆) Harvard University(哈佛大学) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文提出了一种新的部分可观测马尔可夫决策过程(POMDP)子类——缺失性-MDP(miss-MDP),将缺失数据理论融入强化学习框架中。该模型通过缺失函数描述状态特征在不同时间步缺失的概率,针对未知缺失函数的情况,提出基于不同缺失类型结构特性的算法,从观测数据中学习缺失函数,并据此生成近似最优策略。理论证明所得到的策略在真实 miss-MDP 中具有高概率的 ε-最优性,实验结果也验证了方法的有效性。

详情
英文摘要

We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action-observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are epsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDP methods.

2605.12261 2026-05-13 cs.LG

Delay-Empowered Causal Hierarchical Reinforcement Learning

Chenran Zhao, Dianxi Shi, Haotian Wang, Mengzhu Wang, Yaowen Zhang, Chunping Qiu, Shaowu Yang

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) Intelligent Game and Decision Lab (IGDL)(智能游戏与决策实验室) Institute of Military Transportation(军事交通运输研究院) School of Artificial Intelligence, Hebei University of Technology(河北工业大学人工智能学院)

AI总结 许多现实任务中存在延迟效应,即动作的后果会在不同时间滞后后才显现。现有延迟感知的强化学习方法通常依赖状态增强、延迟分布的先验知识或非延迟数据,限制了其泛化能力。本文提出了一种延迟赋能的因果分层强化学习方法(DECHRL),该方法显式建模状态转移的因果结构及其相关的随机延迟分布,并将其融入延迟感知的赋能目标中,引导智能体主动探索可控性高的状态,从而在时间不确定性下提升性能。实验表明,DECHRL在具有随机延迟的修改版2D-Minecraft和MiniGrid环境中显著优于基线方法。

详情
英文摘要

Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their generalization. Hierarchical reinforcement learning, by contrast, inherently offers advantages in handling delays due to its hierarchical structure, yet existing methods are restricted to fixed delays. To address these limitations, we propose Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL). DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty. We evaluate DECHRL in modified 2D-Minecraft and MiniGrid environments featuring stochastic delays. Experimental results show that DECHRL effectively models temporal delays and significantly outperforms baselines in decision-making under temporal uncertainty.

2605.12259 2026-05-13 cs.CV

From Image Hashing to Scene Change Detection

Anh-Kiet Duong, Marie-Claire Iatrides, Petra Gomez-Krämer, Jean-Michel Carozza

发表机构 * L3i Laboratory(L3i实验室) La Rochelle University(拉罗谢尔大学) LIENSs Laboratory(LIENSs实验室) Association Ferrocampus(Ferrocampus协会)

AI总结 图像哈希技术虽能高效存储和检索图像,但其全局比较特性无法定位具体变化区域,限制了其在场景变化检测中的应用。本文从场景变化检测的角度重新审视图像哈希,提出了一种基于块的哈希框架HashSCD,能够在哈明空间中直接实现全局变化检测与局部变化定位,无需对历史图像重复推理。该方法通过对比学习进行无监督训练,在保证性能的同时显著降低了计算和存储开销。

Comments 18 pages; accepted to ICPR 2026

详情
英文摘要

Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.

2605.12258 2026-05-13 cs.LG

Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models

Runhe Lai, Xinhua Lu, Yanqi Wu, Jinlun Ye, Weijiang Yu, Ruixuan Wang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院,广州,中国) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国) Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China(机器智能与高级计算关键实验室,教育部,广州,中国)

AI总结 多模态大语言模型在实际应用中仍面临物体幻觉的挑战。本文深入分析了指令令牌嵌入,发现其隐含编码了视觉信息并能有效过滤误导性视觉嵌入带来的错误信息,基于此提出了一种无需额外训练或辅助模型的物体幻觉检测方法——指令透镜分数(InsLen)。该方法结合校准局部分数与上下文一致性分数,实验表明其在多个基准和不同模型架构上均优于现有方法,具有良好的有效性和鲁棒性。

Comments Accepted by ICML-2026

详情
英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an in-depth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug-and-play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code is available at https://github.com/Fraserlairh/Instruction-Lens-Score.

2605.12255 2026-05-13 cs.AI cs.CY cs.LG

Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference

Toru Takahashi

发表机构 * Human Informatics and Systems Laboratory, Doshisha University, Kyoto, Japan(大阪大学人文学与系统实验室,京都,日本) Linked Open Data Initiative, NPO, Tokyo, Japan(开放数据倡议,东京,日本) Keio Research Institute at SFC, Fujisawa, Japan(庆应义塾大学SFC研究所, Fujisawa,日本) Stroly Inc., Kyoto, Japan(Stroly公司,京都,日本)

AI总结 本文探讨了为何人们在面对相同观察时会产生不同结论的问题,指出这种分歧源于推理与学习过程中的非可识别性,而非对方认知缺陷。研究将非可识别性分为两个层次:在相同世界模型下因推理设置不同导致结论差异,以及推理设置本身影响数据暴露和更新规则,进而导致世界模型的差异。文章引入推理配置的概念,分析了分歧如何受计算、观察和协调等约束条件的影响,并将其与深度表征学习中的相关概念联系起来,通过AI监管辩论的案例加以说明。

Comments 12 pages, 2 figures, 1 table. Extended English version of a paper accepted for presentation at JSAI 2026

详情
英文摘要

When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) $θ$-level non-identifiability, where conclusions diverge under the same world model $W$ because inference settings differ; and (ii) $W$-level non-identifiability, where repeated use of an inference setting $θ$ biases data exposure and update rules, causing the learned world model $W$ itself to diverge. We introduce an inference profile $θ= (R, E, S, D)$, consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation $o$ and the same $W$. We further explain why disagreements tend to project onto a small number of bases -- abstract versus concrete, externalizability, and order versus freedom -- as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.

2605.12252 2026-05-13 cs.CV

H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows

Mubashara Rehman, Niki Martinel, Michele Avanzo, Riccardo Spizzo, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, Università degli Studi di Udine(机器学习与感知实验室,乌迪内大学) Centro di Riferimento Oncologico di Aviano IRCCS(阿维亚诺肿瘤参考中心)

AI总结 该研究提出了一种名为H3D-MarNet的两阶段框架,用于从千伏CT(kVCT)到兆伏CT(MVCT)的去金属伪影和CT模态转换,以提升放疗流程中的图像质量。第一阶段通过小波引导的预处理模块,在去除金属伪影的同时保留解剖结构;第二阶段采用结合卷积神经网络和Transformer的Domain-TransNet,通过注意力机制融合局部细节与全局上下文信息,实现高保真的CT模态转换。实验表明,该方法在伪影严重区域取得了较高的PSNR和SSIM指标,显示出其在临床放疗中的应用潜力。

Comments Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026

详情
英文摘要

Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows.

2605.12247 2026-05-13 cs.RO

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

Yibo Liu, Stanko Oparnica, Simon Shewchun-Jakaitis, Guoyi Fu, Jie Wang, Jun Yang, Anand Jagannathan, Tony Hong-Yau Lo

发表机构 * Epson Canada(爱普生加拿大) Queen’s University(皇后大学)

AI总结 在机器人接触丰富的装配任务中,由于相对位姿的不确定性,如错位和微小间隙,搜索和高精度插入面临重大挑战。本文提出SI-Diff框架,通过力域扩散策略统一学习搜索与高精度插入动作,引入新的模式条件机制以在单一模型中捕捉不同动作行为,并设计新的搜索教师策略生成多样化轨迹,从而提升模型对初始位姿偏差的容忍度和对未知形状的泛化能力。

Comments 9 pages, 8 figures

详情
英文摘要

Contact-rich assembly is fundamental in robotics but poses significant challenges due to uncertainties in relative poses, such as misalignments and small clearances in peg-in-hole tasks. Existing approaches typically address search and high-precision insertion separately, because these tasks involve distinct action patterns. However, supporting both tasks within a single model, without switching models or weights, is desirable for intelligent assembly systems. In this work, we propose SI-Diff, a framework that learns both search and high-precision insertion through a force-domain diffusion policy. To this end, we introduce a new mode-conditioning mechanism that enables the policy to capture distinct action behaviors under a single framework. Moreover, we develop a new search teacher policy that can generate diverse trajectories. By training on successful and efficient demonstrations provided by the teacher policy, the model learns the mapping from tactile and end-effector velocity observations to effective action behaviors. We conduct thorough experiments to show that SI-Diff extends the tolerance to x-y misalignments from 2 mm to 5 mm compared to the state-of-the-art baseline, TacDiffusion, while also demonstrating strong zero-shot transferability to unseen shapes.

2605.12245 2026-05-13 cs.LG

SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization

Chengzhu Bao, Xianglong Yan, Zhiteng Li, Guangshuo Qin, Guanghua Yu, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tencent Hunyuan(腾讯文心)

AI总结 SOAR 是一种针对 NVFP4 量化格式的后训练量化框架,旨在提升大语言模型在 4 位微缩量化下的重建精度。该方法提出闭式联合尺度优化(CJSO)和解耦尺度搜索(DSS)技术,通过联合优化全局与块级尺度并解耦量化与反量化尺度,有效缓解了传统方法中尺度选择不灵活和精度损失的问题。实验表明,SOAR 在多个大语言模型上均优于现有 NVFP4 量化方法,在相同内存占用下实现了更高的精度。

详情
英文摘要

NVFP4 has recently emerged as an efficient 4-bit microscaling format for large language models (LLMs), offering superior numerical fidelity with native hardware support. However, existing methods often yield suboptimal performance due to inflexible scale selection and the coupled treatment of quantization and dequantization scales. To address these issues, we propose Scale Optimization for Accurate Reconstruction (SOAR), a novel post-training quantization framework that improves the accuracy of NVFP4 quantization. At its core, SOAR features Closed-form Joint Scale Optimization (CJSO), which jointly optimizes global and block-wise scales via analytical solutions derived from reconstruction error minimization. Furthermore, it incorporates Decoupled Scale Search (DSS). DSS decouples the high-precision quantization scale from its constrained dequantization counterpart, and performs discrete search to mitigate precision loss from scale quantization. Extensive experiments across multiple LLMs show that our method consistently outperforms existing NVFP4 quantization baselines, achieving superior accuracy under the same memory footprint with no additional hardware overhead. The code and models will be available at https://github.com/steven-bao1/SOAR.

2605.12243 2026-05-13 cs.CL

PreScam: A Benchmark for Predicting Scam Progression from Early Conversations

Weixiang Sun, Shang Ma, Yiyang Li, Tianyi Ma, Zehong Wang, Colby Nelson, Xusheng Xiao, Yanfang Ye

发表机构 * University of Notre Dame(诺丁汉大学) Arizona State University(亚利桑那州立大学)

AI总结 PreScam 是一个用于从早期对话中预测诈骗进展的基准数据集,旨在研究如情感诈骗和投资诈骗等多轮对话型诈骗的演变过程。该数据集基于用户提交的诈骗报告构建,包含11,573个涵盖20类诈骗的对话实例,并按照诈骗生命周期进行结构化标注,标注内容包括诈骗者的心理操作和受害者的回应。研究通过两个任务评估模型能力,结果显示当前模型在捕捉诈骗线索方面有一定成效,但在追踪风险升级和跨轮次操控方面仍存在较大挑战。

详情
英文摘要

Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer's subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.

2605.12242 2026-05-13 cs.CL cs.AI

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

Deepak Kumar, Baban Gain, Asif Ekbal

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institute of Technology Patna(印度理工学院帕纳瓦)

AI总结 自动语音识别(ASR)生成的文本常包含停顿、重复和误起等不流畅现象,影响可读性和下游应用效果。本文提出一种基于大语言模型(LLM)的多语言语音文本流畅性修正方法,通过序列标注识别不流畅词元,并结合指令微调与对比学习优化模型,使其在去除不流畅内容的同时保持语义和语法完整性。实验表明,该方法在印地语、孟加拉语和马拉地语上显著优于现有基线模型,验证了其有效性与实用性。

Comments Accepted to ACL 2026 (Main)

详情
英文摘要

Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at https://github.com/deepak-kumar-98/Mind-the-Pause.