arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3420
专题追踪
2604.16929 2026-04-21 cs.CL

MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning

Ruijun Huang, Zhiqiao Kang, Yuxuan Zhu, Junxiong Li, Jiahao Zhao, Minghuan Tan, Feng Jiang, Min Yang

Comments To appear in ACL 2026

详情
英文摘要

The accurate extraction of scientific measurements from literature is a critical yet challenging task in AI4Science, enabling large-scale analysis and integration of quantitative research findings. However, Large Language Models (LLMs) frequently exhibit severe hallucinations, which significantly undermine the reliability of automated scientific document understanding systems. To address this problem, we propose MeasHalu, a novel framework for mitigating scientific measurement hallucinations through enhanced reasoning and targeted optimization. We first present a fine-grained taxonomy of measurement-specific hallucinations, categorizing errors across quantities, units, modifiers, and relations. Our approach incorporates a two-stage reasoning-aware fine-tuning strategy using augmented scientific data and process-based supervision. Furthermore, we introduce a progressive reward curriculum designed to penalize specific hallucination types, significantly improving extraction faithfulness. Experimental results demonstrate that MeasHalu substantially reduces hallucination rates and improves overall accuracy on the MeasEval benchmark. This work provides a targeted solution to a key bottleneck in automated scientific knowledge extraction, facilitating more trustworthy and scalable machine-assisted scientific literature analysis.

2604.16926 2026-04-21 cs.LG cs.AI eess.SP

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

Gabriel Jason Lee, Jathurshan Pradeepkumar, Jimeng Sun

详情
英文摘要

Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored. In this work, we introduce NeuroAdapt-Bench, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts. We evaluate representative TTA approaches from other domains across multiple pretrained foundation models, diverse downstream tasks, and heterogeneous datasets spanning in-distribution, out-of-distribution, and extreme modality shifts (e.g., Ear-EEG). Our results show that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation. In contrast, optimization-free methods demonstrate greater stability and more reliable improvements. These findings highlight the limitations of existing TTA techniques in EEG, provide guidance for future development, and underscore the need for domain-specific adaptation strategies.

2604.16923 2026-04-21 cs.AI

Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy

Junxi Wu, Kailin Huang, Dongjian Hu, Bin Chen, Hao Wu, Shu-Tao Xia, Changliang Zou

详情
英文摘要

Detecting AI-generated text is an important but challenging problem. Existing likelihood-based detection methods are often sensitive to content complexity and may exhibit unstable performance. In this paper, our key insight is that modern Large Language Models (LLMs) undergo alignment (including fine-tuning and preference tuning), leaving a measurable distributional imprint. We theoretically derive this imprint by abstracting the alignment process as a sequence of constrained optimization steps, showing that the log-likelihood ratio can naturally decompose into implicit instructional biases and preference rewards. We refer to this quantity as the Alignment Imprint. Furthermore, to mitigate the instability in high-entropy regions, we introduce Log-likelihood Alignment Preference Discrepancy (LAPD), a standardized information-weighted statistic based on alignment imprint. We provide statistical guarantee that alignment-based statistics dominate Fast-DetectGPT in performance. We also theoretically show that LAPD strictly improves the unweighted alignment scores when the aligned and base models are close in distribution. Extensive experiments show that LAPD achieves an improvement 45.82% relative to the strongest existing baselines, yielding large and consistent gains across all settings.

2604.16919 2026-04-21 cs.LG cs.AI cs.CV

Noise-Adaptive Diffusion Sampling for Inverse Problems Without Task-Specific Tuning

Yingzhi Xia, Setthakorn Tanomkiattikun, Liangli Zhen, Zaiwang Gu

Comments Accepted by ICLR 2026

详情
英文摘要

Diffusion models (DMs) have recently shown remarkable performance on inverse problems (IPs). Optimization-based methods can fast solve IPs using DMs as powerful regularizers, but they are susceptible to local minima and noise overfitting. Although DMs can provide strong priors for Bayesian approaches, enforcing measurement consistency during the denoising process leads to manifold infeasibility issues. We propose Noise-space Hamiltonian Monte Carlo (N-HMC), a posterior sampling method that treats reverse diffusion as a deterministic mapping from initial noise to clean images. N-HMC enables comprehensive exploration of the solution space, avoiding local optima. By moving inference entirely into the initial-noise space, N-HMC keeps proposals on the learned data manifold. We provide a comprehensive theoretical analysis of our approach and extend the framework to a noise-adaptive variant (NA-NHMC) that effectively handles IPs with unknown noise type and level. Extensive experiments across four linear and three nonlinear inverse problems demonstrate that NA-NHMC achieves superior reconstruction quality with robust performance across different hyperparameters and initializations, significantly outperforming recent state-of-the-art methods. The code is available at https://github.com/NA-HMC/NA-HMC.

2604.16918 2026-04-21 cs.CL cs.LG

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny

详情
英文摘要

Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.

2604.16917 2026-04-21 cs.CL

x1: Learning to Think Adaptively Across Languages and Cultures

Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Yichong Huang, Zekun Yuan, Lei Huang, Weitao Ma, Qichen Hong, Yunfei Lu, Dandan Tu, Bing Qin

Comments Findings of ACL2026

详情
英文摘要

Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language choice, x1 is constructed without expanding the model's knowledge boundaries and is trained by contrasting linguistically distinct reasoning trajectories for the same input. Our extensive experiments demonstrate the benefits of adaptive multilingual reasoning across multilingual mathematical reasoning and culturally grounded tasks. Moreover, our results challenge a simplistic view of scaling laws: while scaling reduces cross-lingual disparities in procedural domains such as math reasoning, it does not eliminate the advantages of culture-associated languages in culturally grounded tasks, as we empirically show that such reasoning enables more efficient and accurate cultural knowledge recall. Overall, our findings establish language choice as a functional component of reasoning, with implications for building more generalist and globally competent reasoning models.

2604.16916 2026-04-21 cs.CL

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

Yuheng Chen, Zhiyu Wu, Bowen Cheng, Tetsuro Takahashi

详情
英文摘要

Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a systematic failure mode in this setting: reformulating harmful requests as forced-choice MCQs, where all options are unsafe, can systematically bypass refusal behavior, even in models that consistently reject equivalent open-ended prompts. Across 14 proprietary and open-source models, we show that forced-choice constraints sharply increase policy-violating responses. Notably, for human-authored MCQs, violation rates follow an inverted U-shaped trend with respect to structural constraint strength, peaking under intermediate task specifications, whereas MCQs generated by high-capability models yield near-saturation violation rates across constraints and exhibit strong cross-model transferability. Our findings reveal that current safety evaluations substantially underestimate risks in structured task settings and highlight constrained decision-making as a critical and underexplored surface for alignment failures.

2604.16913 2026-04-21 cs.AI cs.CL cs.CR cs.DC

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Syed Muhammad Aqdas Rizvi

Comments Working paper. 14 pages, 3 figures, 6 tables. Code and dataset: https://github.com/smarizvi110/sentinel-bench

详情
英文摘要

Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge-native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference-time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel-Bench, an 840-inference empirical framework executing a strict intra-model ablation on Qwen-3.5-9B. By toggling latent reasoning across frozen weights, we isolate the impact of inference-time compute against an adversarial Optimism DAO dataset. Our findings reveal a severe compute-accuracy inversion. The autoregressive baseline (System 1) achieved 100% adversarial robustness, 100% juridical consistency, and state finality in under 13 seconds. Conversely, System 2 reasoning introduced catastrophic instability, fundamentally driven by a 26.7% Reasoning Non-Convergence (cognitive collapse) rate. This collapse degraded trial-to-trial consensus stability to 72.6% and imposed a 17x latency overhead, introducing critical vulnerabilities to Governance Extractable Value (GEV) and hardware centralization. While rare (1.5% of adversarial trials), we empirically captured "Reasoning-Induced Sycophancy," where the model generated significantly longer internal monologues (averaging 25,750 characters) to rationalize failing the adversarial trap. We conclude that for edge-native SLMs operating under Byzantine Fault Tolerance (BFT) constraints, System 1 parameterized intuition is structurally and economically superior to System 2 iterative deliberation for decentralized consensus. Code and Dataset: https://github.com/smarizvi110/sentinel-bench

2604.16911 2026-04-21 cs.AI

Skilldex: A Package Manager and Registry for Agent Skill Packages with Hierarchical Scope-Based Distribution

Sampriti Saha, Pranav Hemanth

Comments 8 pages, 1 figure, 5 tables. IEEE conference format

详情
英文摘要

Large Language Model (LLM) agents are increasingly extended at runtime via skill packages, structured natural-language instruction bundles loaded from a well-known directory. Community install tooling and registries exist, but two gaps persist: no public tool scores skill packages against Anthropic's published format specification, and no mechanism bundles related skills with the shared context they need to remain mutually coherent. We present Skilldex, a package manager and registry for agent skill packages addressing both gaps. The two novel contributions are: (1) compiler-style format conformance scoring against Anthropic's skill specification, producing line-level diagnostics on description specificity, frontmatter validity, and structural adherence; and (2) the skillset abstraction, a bundled collection of related skills with shared assets (vocabulary files, templates, reference documents) that enforce cross-skill behavioral coherence. Skilldex also provides supporting infrastructure: a three-tier hierarchical scope system, a human-in-the-loop agent suggestion loop, a metadata-only community registry, and a Model Context Protocol (MCP) server. The system is implemented as a TypeScript CLI (skillpm / spm) with a Hono/Supabase registry backend, and is open-source.

2604.16910 2026-04-21 cs.CV cs.RO

LAGS: Low-Altitude Gaussian Splatting with Groupwise Heterogeneous Graph Learning

Yikun Wang, Yujie Wan, Wei Zuo, Shuai Wang, Yik-Chung Wu, Chengzhong Xu, Huseyin Arslan

Comments 5 pages, 8 figures

详情
英文摘要

Low-altitude Gaussian splatting (LAGS) facilitates 3D scene reconstruction by aggregating aerial images from distributed drones. However, as LAGS prioritizes maximizing reconstruction quality over communication throughput, existing low-altitude resource allocation schemes become inefficient. This inefficiency stems from their failure to account for image diversity introduced by varying viewpoints. To fill this gap, we propose a groupwise heterogeneous graph neural network (GW-HGNN) for LAGS resource allocation. GW-HGNN explicitly models the non-uniform contribution of different image groups to the reconstruction process, thus automatically balancing data fidelity and transmission cost. The key insight of GW-HGNN is to transform LAGS losses and communication constraints into graph learning costs for dual-level message passing. Experiments on real-world LAGS datasets demonstrate that GW-HGNN significantly outperforms state-of-the-art benchmarks across key rendering metrics, including PSNR, SSIM, and LPIPS. Furthermore, GW-HGNN reduces computational latency by approximately 100x compared to the widely-used MOSEK solver, achieving millisecond-level inference suitable for real-time deployment.

2604.16903 2026-04-21 cs.RO

Leveraging VR Robot Games to Facilitate Data Collection for Embodied Intelligence Tasks

Yihan Zhang, Ziyun Huang, Linqi Ye

详情
英文摘要

Collecting embodied interaction data at scale remains costly and difficult due to the limited accessibility of conventional interfaces. We present a gamified data collection framework based on Unity that combines procedural scene generation, VR-based humanoid robot control, automatic task evaluation, and trajectory logging. A trash pick-and-place task prototype is developed to validate the full workflow.Experimental results indicate that the collected demonstrations exhibit broad coverage of the state-action space, and that increasing task difficulty leads to higher motion intensity as well as more extensive exploration of the arm's workspace. The proposed framework demonstrates that game-oriented virtual environments can serve as an effective and extensible solution for embodied data collection.

2604.16895 2026-04-21 cs.CV cs.AI

Physics-Informed Tracking (PIT)

Emil Hovad, Allan Peter Engsig-Karup

Comments 20 pages, 3 figures, 11 tables

详情
英文摘要

We propose Physics-Informed Tracking (PIT), a video-based framework for tracking a single particle from video, where a neural network autoencoder localizes a particle as a heatmap peak (landmark) and a differentiable physics module embedded in the autoencoder constrains several landmarks over time (a trajectory) to satisfy known dynamics. The novel Physics-Informed Landmark Loss (PILL) compares this predicted trajectory back against the landmarks, enforcing physical consistency without labels. Its supervised variant (PILLS) instead compares the prediction against ground-truth position, velocity, and bounce from simulation, enabling end-to-end backpropagation. To support supervised and unsupervised learning, we use an autoencoder with a split bottleneck that separates A) tracking-related structure via landmark heatmaps from B) background noise and subsequent image reconstruction. We evaluate a replicated 26 factorial design (n = 4 replicates, 64 configurations), showing that PILLS consistently achieves sub-pixel tracking accuracy for the bilinear and physics-refined decoder outputs under both clean and noisy conditions.

2604.16894 2026-04-21 cs.LG stat.ME stat.ML

Covariance-Based Structural Equation Modeling in Small-Sample Settings with $p>n$

Hiroki Hasegawa, Aoba Tamura, Yukihiko Okada

Comments 31 pages, 7 figures and 7 tables

详情
英文摘要

Factor-based Structural Equation Modeling (SEM) relies on likelihood-based estimation assuming a nonsingular sample covariance matrix, which breaks down in small-sample settings with $p>n$. To address this, we propose a novel estimation principle that reformulates the covariance structure into self-covariance and cross-covariance components. The resulting framework defines a likelihood-based feasible set combined with a relative error constraint, enabling stable estimation in small-sample settings where $p>n$ for sign and direction. Experiments on synthetic and real-world data show improved stability, particularly in recovering the sign and direction of structural parameters. These results extend covariance-based SEM to small-sample settings and provide practically useful directional information for decision-making.

2604.16893 2026-04-21 cs.CV cs.LG

EasyVideoR1: Easier RL for Video Understanding

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang

详情
英文摘要

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.

2604.16892 2026-04-21 cs.CV

CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization

Antonios Kritikos, Nikolaos Spanos, Athanasios Voulodimos

Comments Accepted in CVPRW 2026 (DG-EBF Workshop)

详情
英文摘要

Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba image encoder and CLIP's text encoder, CrossFlowDG is tested against four common DG benchmarks, and achieves competitive performance on several benchmarks and state-of-the-art on TerraIncognita. Code is available at: https://github.com/ajkrit/CrossFlowDG

2604.16890 2026-04-21 cs.AI

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang

Comments This paper has been accepted for publication at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
英文摘要

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.

2604.16888 2026-04-21 cs.LG math.OC

Towards Fully Parameter-Free Stochastic Optimization: Grid Search with Self-Bounding Analysis

Yuheng Zhao, Yu-Hu Yan, Amit Attia, Tomer Koren, Lijun Zhang, Peng Zhao

详情
英文摘要

Parameter-free stochastic optimization aims to design algorithms that are agnostic to the underlying problem parameters while still achieving convergence rates competitive with optimally tuned methods. While some parameter-free methods do not require the specific values of the problem parameters, they still rely on prior knowledge, such as the lower or upper bounds of them. We refer to such methods as ``partially parameter-free''. In this work, we target achieving ``fully parameter-free'' methods, i.e., the algorithmic inputs do not need to satisfy any unverifiable condition related to the true problem parameters. We propose a powerful and general grid search framework, named \textsc{Grasp}, with a novel self-bounding analysis technique that effectively determines the search ranges of parameters, in contrast to previous work. Our method demonstrates generality in: (i) the non-convex case, where we propose a fully parameter-free method that achieves near-optimal convergence rate, up to logarithmic factors; (ii) the convex case, where our parameter-free methods are competitive with strong performance in terms of acceleration and universality. Finally, we contribute a sharper guarantee for the model ensemble, a final step of the grid search framework, under interpolated variance characterization.

2604.16887 2026-04-21 cs.RO

Time-Division Multiplexing Actuation in Tendon-Driven Arms: Lightweight Design and Fault Tolerance

Shoujie Li, Changqing Guo, Jianle Xu, Hong Luo, Xueqian Wang, Wenbo Ding, Bin Liang

Comments 11 pages

Journal ref IEEE T-MECH Under review 2026

详情
英文摘要

Robotic manipulators for aerospace applications require a delicate balance between lightweight construction and fault-tolerant operation to satisfy strict weight limitations and ensure reliability in remote, hazardous environments. This paper presents Time-Division Multiplexing Actuation (TDMA), a practical approach for tendon-driven robots that significantly reduces actuator count while preserving high torque output and intrinsic fault tolerance. The key hardware employs a vertically-stacked rotational selection structure that integrates self-rotating TDM motors for rapid configuration, electromagnetic clutches enabling sub-0.1 second engagement, a worm gear reducer for enhanced load capacity and self-locking capability, and a dual-encoder system for precise, long-term positioning. Leveraging TDMA, the proposed MuxArm achieves a self-weight of 2.17 kg, supports an actuator driving capacity of 10 kg, and maintains end-effector accuracy up to 1% of its length, even under partial servo failure. Additionally, an actuation space trajectory planning algorithm is developed, enabling fault-tolerant control and reducing tendon load by up to 50% compared to conventional methods. Comprehensive experiments demonstrate MuxArm's robust performance in diverse settings, including free-space, cluttered, and confined environments.

2604.16886 2026-04-21 cs.RO

Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

Xianhao Wang, Xiaojian Ma, Haozhe Hu, Rongpeng Su, Yutian Cheng, Zhou Ziheng, Hangxin Liu, Lei Liu, Bin Li, Qing Li

详情
英文摘要

Generalist embodied agents must perform interactive, causally-dependent reasoning, continually interacting with the environment, acquiring information, and updating plans to solve long-horizon tasks before they could be adopted in real-life scenarios. For instance, retrieving an apple from a cabinet may require opening multiple doors and drawers before the apple becomes visible and reachable, demanding sequential interaction under partial observability. However, existing benchmarks fail to systematically evaluate this essential capability. We introduce COIN, a benchmark designed to assess interactive reasoning in realistic robotic manipulation through three key contributions. First, we construct COIN-50: 50 interactive tasks in daily scenarios, and create COIN-Primitive required by causally-dependent tasks, and COIN-Composition with mid-term complexity for skill learning and generalization evaluation. Second, we develop a low-cost mobile AR teleoperation system and collect the COIN-Primitive Dataset with 50 demonstrations per primitive task (1,000 in total). Third, we develop systematic evaluation metrics about execution stability and generalization robustness to evaluate CodeAsPolicy, VLA, and language-conditioned H-VLA approaches. Our comprehensive evaluation reveals critical limitations in current methods: models struggle with interactive reasoning tasks due to significant gaps between visual understanding and motor execution. We provide fine-grained analysis of these limitations.

2604.16884 2026-04-21 cs.CV

Bias-constrained multimodal intelligence for equitable and reliable clinical AI

Cheng Li, Weijian Huang, Jiarun Liu, Hao Yang, Qi Yang, Song Wu, Ye Li, Hairong Zheng, Shanshan Wang

详情
英文摘要

The integration of medical imaging and clinical text has enabled the emergence of generalist artificial intelligence (AI) systems for healthcare. However, pervasive biases, such as imbalanced disease prevalence, skewed anatomical region distributions, heterogeneous imaging protocols, and demographic disparities, pose significant challenges to the fairness and reliability of vision-language systems in real-world clinical settings. Here we present BiasCareVL, a bias-aware multimodal learning framework that introduces bias control directly into model design, rather than treating it as a post hoc correction. BiasCareVL incorporates adaptive uncertainty modeling with optional human-in-the-loop refinement to regulate the influence of dominant data patterns and to promote equitable reasoning under distributional imbalance. Trained on 3.44 million samples spanning over 15 imaging modalities, the framework supports diverse clinical tasks, including visual question answering, disease classification, segmentation, and report generation within a unified representation space. Across eight public benchmarks covering dermatology, oncology, radiology, and pathology, BiasCareVL consistently outperforms 20 state-of-the-art methods, with pronounced gains in clinically challenging scenarios, including over 10% accuracy improvement in multi-class skin lesion diagnosis and more than 20% Dice improvement in small tumor segmentation. Furthermore, BiasCareVL achieves diagnostic performance exceeding human accuracy with substantially reduced time requirements when evaluated with board-certified radiologists. By open-sourcing BiasCareVL, we aim to promote a transparent, reproducible, and equitable future for AI in healthcare, paving the way for general-purpose, trustworthy, and clinically reliable AI systems.

2604.16883 2026-04-21 cs.LG cs.AI

SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

Junnan Liu, Xinyan Liu, Peifeng Gao, Zhaobo Qi, Beichen Zhang, Weigang Zhang, Antoni Bert Chen

详情
英文摘要

In long-context decoding for LLMs and LMMs, attention becomes increasingly memory-bound because each decoding step must load a large amount of KV-cache data from GPU memory. Existing acceleration strategies often trade efficiency for accuracy by relying on heuristic pruning that may discard useful information. At a deeper level, they also tend to indiscriminately preserve all high-scoring tokens, treat early tokens as indispensable anchors, or rely on heuristic head routing, reflecting an insufficient mechanistic understanding of the attention sink phenomenon. In this paper, we show that the attention sink phenomenon corresponds to a stable, reachable, and error-controllable fixed point constructed during training. Based on this insight, we propose SinkRouter, a training-free selective routing framework that detects the sink signal and skips computations that would otherwise produce near-zero output. To translate this mechanism into real-world acceleration, we develop a hardware-aware Triton kernel with block-level branching and Split-K parallelism. We conduct extensive evaluations on a diverse suite of long-context benchmarks, including LongBench, InfiniteBench, CVBench, MileBench, and MMVP, using both text-only and multimodal backbones such as Llama-3.1-8B, Llama-3.1-70B, Yi-9B-200K, LLaVA-1.5-7B, and LLaVA-1.5-13B. Across these settings, SinkRouter consistently improves decoding efficiency while maintaining competitive accuracy, and reaches 2.03x speedup with a 512K context.

2604.16881 2026-04-21 cs.CL cs.AI

Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation

Jiang Zhou, Xiaohu Zhao, Xinwei Wu, Tianyu Dong, Hao Wang, Yangyang Liu, Heng Liu, Linlong Xu, Longyue Wang, Weihua Luo, Deyi Xiong

Comments 23 pages, 11 figures, 11 tables

详情
英文摘要

Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B's entity translation accuracy from 23.66\% to 31.87\% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24++, which scales to +1.59 with extended optimization. Extensive analyses of $pass@k$ dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.

2604.16878 2026-04-21 cs.LG

OC-Distill: Ontology-aware Contrastive Learning with Cross-Modal Distillation for ICU Risk Prediction

Zhongyuan Liang, Junhyung Jo, Hyang-Jung Lee, Sang Kyu Kim, Irene Y. Chen

详情
英文摘要

Early prediction of severe clinical deterioration and remaining length of stay can enable timely intervention and better resource allocation in high-acuity settings such as the ICU. This has driven the development of machine learning models that leverage continuous streams of vital signs and other physiological signals for real-time risk prediction. Despite their promise, existing methods have important limitations. Contrastive pretraining treats all patients as equally strong negatives, failing to capture clinically meaningful similarity between patients with related diagnoses. Meanwhile, downstream fine-tuning typically ignores complementary modalities such as clinical notes, which provide rich contextual information unavailable in physiological signals alone. To address these challenges, we propose OC-Distill, a two-stage framework that leverages multimodal supervision during training while requiring only vital signs at inference. In the first stage, we introduce an ontology-aware contrastive objective that exploits the ICD hierarchy to quantify patient similarity and learn clinically grounded representations. In the second stage, we fine-tune the pretrained encoder via cross-modal knowledge distillation, transferring complementary information from clinical notes into the model. Across multiple ICU prediction tasks on MIMIC, OC-Distill demonstrates improved label efficiency and achieves state-of-the-art performance among methods that use only vital signs at inference.

2604.16871 2026-04-21 cs.AI cs.LG

GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning

Hikaru Shindo, Henri Rößler, Quentin Delfosse, Kristian Kersting

Comments Preprint

详情
英文摘要

Neuro-symbolic Reinforcement Learning (NeSy-RL) combines symbolic reasoning with gradient-based optimization to achieve interpretable and generalizable policies. Relational concepts, such as "left of" or "close by", serve as foundational building blocks that structure how agents perceive and act. However, conventional approaches require human experts to manually define these concepts, limiting adaptability since concept semantics vary across environments. We propose GRAIL (Grounding Relational Agents through Interactive Learning), a framework that autonomously grounds relational concepts through environmental interaction. GRAIL leverages large language models (LLMs) to provide generic concept representations as weak supervision, then refines them to capture environment-specific semantics. This approach addresses both sparse reward signals and concept misalignment prevalent in underdetermined environments. Experiments on the Atari games Kangaroo, Seaquest, and Skiing demonstrate that GRAIL matches or outperforms agents with manually crafted concepts in simplified settings, and reveals informative trade-offs between reward maximization and high-level goal completion in the full environment.

2604.16868 2026-04-21 cs.RO

Greedy Kalman-Swarm: Improving State Estimation in Robot Swarms in Harsh Environments

Phunyapa Suksomboon, Paulo Garcia

Comments accepted at ECTI-CON 2026

详情
英文摘要

State estimation is a fundamental requirement in robotics, where the accurate determination of a robot's state is essential for stable operation despite inherent process disturbances and sensor noise. Traditionally, this is achieved through Kalman filtering, providing a statistically optimal estimate by balancing predictive models with noisy measurements. In the context of robotic swarms, the challenge shifts from individual accuracy to collective coordination, where the integration of global dynamics can significantly enhance the precision of the entire group. Existing estimation techniques rely on centralized processing or heavy communication protocols to reach a global consensus, which are frequently impractical in real-world deployments. Here we show that a localized, "greedy" approach to distributed state estimation (termed "Greedy Kalman-Swarm") allows individual robots to leverage relative inter-robot sensing for improved accuracy without requiring full data availability or global communication. Simulations in communication-constrained environments show robots can effectively integrate all currently available neighbor data at each iteration to refine their internal states, yet remain robust and functional even when data is missing. This results in a performance profile that strikes a balance between the low overhead of independent estimation and the high accuracy of centralized systems, specifically under harsh or dynamic environmental conditions. Our results demonstrate that global state awareness can be emergent rather than enforced, providing a scalable framework for maintaining swarm cohesion in unpredictable terrains. We anticipate that this decentralized methodology will serve as a foundation for more resilient autonomous systems, particularly in search-and-rescue or space exploration missions where reliable, high-bandwidth communication cannot be guaranteed.

2604.16862 2026-04-21 cs.LG

Learning to Trade Like an Expert: Cognitive Fine-Tuning for Stable Financial Reasoning in Language Models

Yuchen Pan, Soung Chang Liew

Comments 6 pages, 3 figures

详情
英文摘要

Recent deployments of large language models (LLMs) as autonomous trading agents raise questions about whether financial decision-making competence generalizes beyond specific market patterns and how it should be trained and evaluated in noisy markets lacking ground truth. We propose a structured framework for training and evaluating such models. Central to our approach is a curated, multiple-choice question (MCQ) dataset derived from classic textbooks and historical markets, verified by an AI committee, enriched with structured reasoning traces, and augmented to reduce shortcut learning. To evaluate whether performance on isolated MCQs generalizes to real-world trading, we introduce a two-stage protocol combining test-set evaluation with an MCQ-based chronological trading simulation. Extensive evaluations across market regimes provide statistically robust evidence that open models trained with our framework exhibit competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale. We release the dataset and evaluation framework to support further research.

2604.16861 2026-04-21 cs.LG cs.CV

CCAR: Intrinsic Robustness as an Emergent Geometric Property

Akash Samanta, Manish Pratap Singh, Debasis Chaudhuri

详情
英文摘要

Standard supervised learning optimizes for predictive accuracy but remains agnostic to the internal geometry of learned features, often yielding representations that are entangled and brittle. We propose Class-Conditional Activation Regularization (CCAR) to explicitly engineer the feature space, imposing a block-diagonal structure via a soft inductive bias. By shaping the latent representation to confine class energy to orthogonal subspaces, we create an intrinsic geometric scaffold that naturally filters noise and adversarial perturbations. We provide theoretical analysis linking this structural constraint to the maximization of the Fisher Discriminant Ratio, establishing a formal connection between geometric disentanglement and algorithmic stability. Empirically, this approach demonstrates that robustness is an emergent property of a well-engineered feature space, significantly outperforming baselines on label noise and input corruption benchmarks.

2604.16859 2026-04-21 cs.AI

GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba

Dongyi He, Yuanquan Gao, Bin Jiang, He Yan

详情
英文摘要

Accurate traffic forecasting is crucial for intelligent transportation systems, supporting effective traffic management, congestion reduction, and informed urban planning. However, traditional models often fail to adequately capture the intricate spatio-temporal dependencies present in traffic data. To overcome these limitations, we introduce GAMMA-Net, a novel approach that integrates Graph Attention Networks (GAT) with multi-axis Selective State Space Models (Mamba). The GAT component uses a self-attention mechanism to dynamically adjust the influence of nodes within the traffic network, enabling adaptive spatial dependency modeling based on real-time conditions. Simultaneously, the Mamba module efficiently models long-term temporal and spatial dynamics without the heavy computational cost of conventional recurrent architectures. Extensive experiments on several benchmark traffic datasets, including METR-LA, PEMS-BAY, PEMS03, PEMS04, PEMS07, and PEMS08, show that GAMMA-Net consistently outperforms existing state-of-the-art models across different prediction horizons, achieving up to a 16.25% reduction in Mean Absolute Error (MAE) compared to baseline models. Ablation studies highlight the critical contributions of both the spatial and temporal components, emphasizing their complementary role in improving prediction accuracy. In conclusion, the GAMMA-Net model sets a new standard in traffic forecasting, offering a powerful tool for next-generation traffic management and urban planning. The code for this study is available at https://github.com/hdy6438/GAMMA-Net

2604.16858 2026-04-21 cs.CV

Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

Xudong Li, Jiaxi Tan, Ziyin Zhou, Yan Zhong, Zihao Huang, Jingyuan Zheng, Yan Zhang, Xiawu Zheng, Rongrong Ji

详情
英文摘要

Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.

2604.16855 2026-04-21 cs.CV

When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization

Tianqi Li, Wenyu Fang, Xin He, Xue Geng, Xu Cheng, Yun Liu

详情
英文摘要

Camouflaged object detection (COD) segments objects that intentionally blend with the background, so predictions depend on subtle texture and boundary cues. COD is often needed under tight on-device memory and latency budgets, making low-bit inference highly desirable. However, COD is unusually hard to quantify aggressively. We study post-training W4A4 quantization of Transformer-based COD and find a task-specific cliff: heavy-tailed background tokens dominate a shared activation range, inflating the step size and pushing weak-but-structured boundary cues into the zero bin. This exposes a token-local bottleneck -- remove cross-token range domination and bound the zero-bin mass under 4-bit activations. To address this, we introduce COD-TDQ, a COD-aware Token-group Dual-constraint activation Quantization method. COD-TDQ addresses this token-local bottleneck with two coupled steps: Direct-Sum Token-Group (DSTG) assigns token-group scales to suppress cross-token range domination, and Dual-Constraint Range Projection (DCRP) projects each token-group clip range to keep the step-to-dispersion ratio and the zero-bin mass bounded. Across four COD benchmarks and two baseline models (CFRN and ESCNet), COD-TDQ consistently achieves an Sαscore more than 0.12 higher than that of the state-of-the-art quantization method without retraining. The code will be released.