arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.17535 2026-04-21 cs.CL cs.AI

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, Jingnan Gu

Comments 9 pages, 1 figure

详情

英文摘要

Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model's in-context learning ability to act as a teacher, OPSDL leverages the model's own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.

URL PDF HTML ☆

赞 0 踩 0

2604.17513 2026-04-21 cs.RO

FLASH: Fast Learning via GPU-Accelerated Simulation for High-Fidelity Deformable Manipulation in Minutes

Siyuan Luo, Bingyang Zhou, Chong Zhang, Xin Liu, Zhenhao Huang, Gang Yang, Zhengtao Han, Xiaotian Hu, Eric Yang, Rymon Yu, Ziqiu Zeng, Fan Shi

2604.17512 2026-04-21 cs.CL cs.LG

ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization

Harshavardhanan Deekeswar

Comments 8 pages, 5 tables, 1 figure. Code, benchmarks, and specification at https://github.com/harsh-aranga/onto

2604.17504 2026-04-21 cs.CV cs.AI

RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding

Gaozhi Zhou, Hu He, Peng Shen, Jipeng Zhang, Liujue Zhang, Linrui Xu, Zeyuan Wang, Ziyu Li, Xuezhi Cui, Wang Guo, Haifeng Li

2604.17503 2026-04-21 cs.AI cs.MA

SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

Zheng Nie, Ruolin Shen, Xinlei Yu, Bo Yin, Jiangning Zhang, Xiaobin Hu

2604.17501 2026-04-21 cs.CL

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao, Kaize Ding

Comments ACL 2026

2604.17500 2026-04-21 cs.CV

Edit Fidelity Field: Semantics-Aware Region Isolation for Training-Free Scene Text Editing

Guandong Li, Mengxia Ye

2604.17494 2026-04-21 cs.LG cs.AI

A Probabilistic Consensus-Driven Approach for Robust Counterfactual Explanations

Marcin Kostrzewa, Maciej Zięba, Jerzy Stefanowski

2604.17492 2026-04-21 cs.CV

Coevolving Representations in Joint Image-Feature Diffusion

Theodoros Kouzelis, Spyros Gidaris, Nikos Komodakis

2604.17480 2026-04-21 cs.LG

Trustworthy deep domain adaptation for wearable photoplethysmography signal analysis with decision-theoretic uncertainty quantification

Ciaran Bench

2604.17477 2026-04-21 cs.CV cs.LG

Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection

Qihao Shen, Jiaxing Xuan, Zhenguang Liu, Sifan Wu, Yutong Xie, Zhaoyan Ming, Yingying Jiao, kui Ren

详情

英文摘要

Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple-branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task-relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large-scale benchmark datasets demonstrate that our method consistently achieves state-of-the-art performance. Our code is released at https://github.com/injooker/Unveiling Deepfake.

URL PDF HTML ☆

赞 0 踩 0

2604.17475 2026-04-21 cs.AI cs.CL cs.LG

Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception

Ashutosh Bajpai, Tamal Majumder, Akshay Nambi, Tanmoy Chakraborty

Comments ACL 2026 Findings

2604.17472 2026-04-21 cs.CV

UniMesh: Unifying 3D Mesh Understanding and Generation

Peng Huang, Yifeng Chen, Zeyu Zhang, Hao Tang

2604.17470 2026-04-21 cs.LG

Machine Learning Hamiltonian Dynamical Systems with Sparse and Noisy Data

Vedanta Thapar, Abhinav Gupta

2604.17455 2026-04-21 cs.CV

From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation

Evren Çetinkaya, Sangmin Lee, Jung Uk Kim, Hong Joo Lee, Nassir Navab

Comments CVPR 2026 Findings

2604.17454 2026-04-21 cs.CV

HSG: Hyperbolic Scene Graph

Liyang Wang, Zeyu Zhang, Hao Tang

2604.17451 2026-04-21 cs.CV

SegTTA: Training-Free Test-Time Augmentation for Zero-Shot Medical Imaging Segmentation

Yihong Yao, Chunlei Li, Canxuan Gang, Wenzhi Hu, Zeyu Zhang, Hao Zhang, Xiaoyan Li

2604.17446 2026-04-21 cs.CV

HyKey: Hyperspectral Keypoint Detection and Matching in Minimally Invasive Surgery

Alexander Saikia, Chiara Di Vece, Zhehua Mao, Sierra Bonilla, Chloe He, Joao Ramalhinho, Tobias Czempiel, Sophia Bano, Danail Stoyanov

Comments 15 pages, 5 figures, IPCAI/IJCARS

2604.17439 2026-04-21 cs.CV

Attention Is not Everything: Efficient Alternatives for Vision

Nur Mohammad Kazi, Ibteshum Khaled, Md. Luthful Hasan Galib, Ali Faruk Shihab, Md. Rakibul Islam

Comments Preprint, manuscript under review

2604.17436 2026-04-21 cs.CV

DEM Refinement and Validation on the Lunar Surface Using Shape-from-Shading with Chandrayaan-2 OHRC Imagery

Aaranay Aadi, Jai Gopal Singla, Nitant Dube

Comments 6 pages, 6 figures

2604.17435 2026-04-21 cs.CL cs.AI cs.SD eess.AS

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang, Hung-yi Lee

Comments Submitted to Interspeech. Audio Demo and Dataset: https://47zzz.github.io/MoVE/

2604.17429 2026-04-21 cs.CL cs.AI

Jupiter-N Technical Report

George Drayson

2604.17428 2026-04-21 cs.CV cs.AI

Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

Zhijiang Tang, Jiaxin Qi, Bing Zhao, Jianqiang Huang

2604.17425 2026-04-21 cs.LG physics.optics

Neural Adjoint Method for Meta-optics: Accelerating Volumetric Inverse Design via Fourier Neural Operators

Chanik Kang, Hyewon Suk, Haejun Chung

Comments 10 pages, 6 figures, 3 tables

2604.17422 2026-04-21 cs.CV cs.MM

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

Shaoguang Wang, Weiyu Guo, Ziyang Chen, Xuming Hu, Hui Xiong

Comments 9 pages, 7 figures, 9 tables. Preprint

2604.17420 2026-04-21 cs.LG cs.AI cs.SI

TransXion: A High-Fidelity Graph Benchmark for Realistic Anti-Money Laundering

Keyang Chen, Mingxuan Jiang, Yongsheng Zhao, Zeping Li, Zaiyuan Chen, Weiqi Luo, Zhixin Li, Sen Liu, Yinan Jing, Guangnan Ye, Xihong Wu, Hongfeng Chai

2604.17411 2026-04-21 cs.CL cs.AI

DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs

Lexuan Liang, Tao Zou, Xuxiang Ta, Zekun Qiu

Comments 25 pages, 4 figures

2604.17407 2026-04-21 cs.RO

Think before Go: Hierarchical Reasoning for Image-goal Navigation

Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Lin Zhao, Long Chen, Zhi-Xin Yang, Nanning Zheng

Comments Accepted by ACL2026 (main conference)

2604.17405 2026-04-21 cs.AI

STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering

Wei Chen, Lili Zhao, Zhi Zheng, HuiJun Hou, Tong Xu

Comments Accepted by SIGIR 2026 Full Paper. The code repository is available at https://github.com/fanshu6hao/STRIDE

详情

英文摘要

Multi-hop question answering (MHQA) enables accurate answers to complex queries by retrieving and reasoning over evidence dispersed across multiple documents. Existing MHQA approaches mainly rely on iterative retrieval-augmented generation, which suffer from the following two major issues. 1) Existing methods prematurely commit to surface-level entities rather than underlying reasoning structures, making question decomposition highly vulnerable to lexical ambiguity. 2) Existing methods overlook the logical dependencies among reasoning steps, resulting in uncoordinated execution. To address these issues, we propose STRIDE, a framework that separates strategic planning, dynamic control, and grounded execution. At its core, a Meta-Planner first constructs an entity-agnostic reasoning skeleton to capture the abstract logic of the query, thereby deferring entity grounding until after the reasoning structure is established, which mitigates disambiguation errors caused by premature lexical commitment. A Supervisor then orchestrates sub-question execution in a dependency-aware manner, enabling efficient parallelization where possible and sequential coordination when necessary. By dynamically deciding whether to retrieve new evidence or infer from existing facts, it avoids redundant queries and error propagation, while fusing cross-branch information and reformulating failed queries to enhance robustness. Grounded fact extraction and logical inference are delegated to specialized execution modules, ensuring faithfulness through explicit separation of retrieval and reasoning. We further propose STRIDE-FT, a modular fine-tuning framework that uses self-generated execution trajectories from STRIDE, requiring neither human annotations nor stronger teacher models. Experiments show that STRIDE achieves robust and accurate reasoning, while STRIDE-FT effectively enhances open-source LLMs.

URL PDF HTML ☆

赞 0 踩 0

2604.17400 2026-04-21 cs.AI math.AT

Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination

Mohit Dubey

Comments 8 pages, pre print, 3 figures

详情

英文摘要

Multi-agent systems (MAS) powered by large language models suffer from severe token inefficiency arising from two compounding sources: (i) unstructured parallel execution, where all agents activate simultaneously irrespective of input readiness; and (ii) unrestricted context sharing, where every agent receives the full accumulated context regardless of relevance. Existing mitigation strategies - static pruning, hierarchical decomposition, and learned routing - treat coordination as a structural allocation problem and fundamentally ignore its temporal dimension. We propose Phase-Scheduled Multi-Agent Systems (PSMAS), a framework that reconceptualizes agent activation as continuous control over a shared attention space modeled on a circular manifold. Each agent i is assigned a fixed angular phase theta_i in the range [0, 2*pi], derived from the task dependency topology; a global sweep signal phi(t) rotates at velocity omega, activating only agents within an angular window epsilon. Idle agents receive compressed context summaries, reducing per-step token consumption. We implement PSMAS on LangGraph, evaluate on four structured benchmarks (HotPotQA-MAS, HumanEval-MAS, ALFWorld-Multi, WebArena-Coord) and two unstructured conversational settings, and prove stability, convergence, and optimality results for the sweep dynamics. PSMAS achieves a mean token reduction of 27.3 percent (range 21.4-34.8 percent) while maintaining task performance within 2.1 percentage points of a fully activated baseline (p < 0.01, n = 500 per configuration), and outperforms the strongest learned routing baseline by 5.6 percentage points in token reduction with 2.0 percentage points less performance drop. Crucially, we show that scheduling and compression are independent sources of gain: scheduling alone accounts for 18-20 percentage points of reduction, robust to compression degradation up to alpha = 0.40.

URL PDF HTML ☆

赞 0 踩 0