arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.12501 2026-05-13 cs.CV

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo

AI总结 该研究针对计算机使用代理(CUA)在处理复杂、低频交互任务时可靠性不足的问题,提出了一种新的基准测试CUActSpot,涵盖GUI、文本、表格、画布和自然图像等多种交互模态及多种操作类型。为解决复杂交互数据稀缺的问题,研究设计了一种基于渲染器的数据合成方法,自动生成场景并生成对应的指令和操作轨迹。实验表明,基于该数据集训练的模型在性能上优于参数量更少的开源模型。

详情
英文摘要

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

2605.12500 2026-05-13 cs.CV

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin

AI总结 本文提出了一种名为 SenseNova-U1 的统一多模态模型,旨在解决当前视觉-语言模型中理解与生成分离的问题。该模型基于 NEO-unify 架构,将理解和生成视为同一底层过程的协同视角,从而实现更自然的多模态智能。研究展示了该模型在多种任务上的优越性能,并提供了详细的设计与训练策略,为多模态研究提供了新的方向。

Comments Project page: https://github.com/OpenSenseNova/SenseNova-U1

详情
英文摘要

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

2605.12498 2026-05-13 cs.CV cs.GR

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Christen Millerdurai, Shaoxiang Wang, Yaxu Xie, Vladislav Golyanik, Didier Stricker, Alain Pagani

AI总结 本文提出了一种名为EgoForce的单目手部三维姿态重建框架,旨在从用户的视角(即相机空间)准确恢复手部的绝对三维姿态和位置,适用于AR/VR、远程呈现等需要紧凑且无干扰感知的场景。该方法通过引入可微分的前臂表示、统一的臂手变换器以及光线空间闭式求解器,有效解决了单目方法中深度尺度模糊的问题,并能在多种广角相机模型上实现鲁棒的重建。实验表明,EgoForce在三个自拍视角数据集上达到了最先进的精度,尤其在HOT3D数据集上将相机空间MPJPE降低了28%。

Comments 23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference

详情
英文摘要

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

2605.12497 2026-05-13 cs.CV

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

AI总结 该研究探讨了在开放世界场景下,如何通过外部信息(如事实、事件、长尾实体和多跳关系)辅助完成视觉感知任务的问题。为此,作者提出了“感知深度研究”这一新挑战,并构建了WebEye基准,包含可验证证据、知识密集型查询和精确标注的图像实例。同时,他们设计了Pixel-Searcher方法,通过智能搜索流程实现从外部信息到像素级目标定位的端到端感知,显著提升了开放世界视觉任务的性能。

Comments Project page: https://pixel-searcher.github.io/

详情
英文摘要

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

2605.12496 2026-05-13 cs.CV

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yue Yu, Hanlin Wang, Haobo Li, Jiapeng Zhu, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen, Huamin Qu

AI总结 CausalCine 是一种用于多镜头视频叙事的实时自回归生成框架,旨在解决现有模型在长序列生成中出现的运动停滞和语义漂移问题。该方法通过引入因果基础模型和内容感知记忆路由机制,实现了跨镜头的连贯生成,并支持动态提示输入与上下文复用。实验表明,CausalCine 在生成质量上优于传统自回归模型,同时实现了接近双向模型的效果,并支持实时交互式生成。

Comments Project page: https://yihao-meng.github.io/CausalCine/

详情
英文摘要

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

2605.12495 2026-05-13 cs.CV cs.AI cs.LG

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao

AI总结 本文提出了一种名为 AlphaGRPO 的新框架,通过将组相对策略优化(GRPO)应用于统一多模态模型(UMMs),在无需额外冷启动阶段的情况下提升了多模态生成能力。该方法通过分解可验证奖励(DVReward)机制,利用大语言模型将复杂的用户请求拆解为可验证的语义和质量问题,从而提供稳定可靠的反馈,支持模型进行文本到图像的推理生成和自主的自我反思优化。实验表明,AlphaGRPO 在多个多模态生成基准测试中均取得显著提升,并在无需编辑任务训练的情况下也表现出色。

Comments ICML2026

详情
英文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/

2605.12494 2026-05-13 cs.CV

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu, Gim Hee Lee

AI总结 本文研究了在基于可微渲染的表面重建中普遍存在的光度模糊问题,并提出了一种名为 AmbiSuR 的框架,旨在提升高斯点扩散(Gaussian Splatting)方法在光度模糊环境下的重建精度。通过重新审视高斯点扩散的表示基础,作者发现了其内在的光度模糊特性,并提出了一种光度去模糊方法和模糊指示模块,以约束几何解的求解并引导重建过程。实验表明,该方法在多种复杂场景下均能实现更准确、更鲁棒的表面重建。

Comments Accepted at ICML 2026. Project page: https://fictionarry.github.io/AmbiSuR-Proj/

详情
英文摘要

Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: https://fictionarry.github.io/AmbiSuR-Proj/ .

2605.12493 2026-05-13 cs.CL

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang

AI总结 LongMemEval-V2 是一个用于评估智能体长期记忆能力的新基准,旨在检验其是否能有效学习并记住网络环境中的关键经验,从而成为有经验的同事。该基准包含 451 个精心设计的问题,涵盖静态状态回忆、动态状态追踪、工作流程知识等多个核心能力,并提供大量历史轨迹作为输入。研究提出两种记忆方法,其中基于编码代理的方法在准确率上表现优异,但存在较高的延迟成本,表明在长期记忆系统的设计上仍有提升空间。

Comments Work in Progress

详情
英文摘要

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

2605.12492 2026-05-13 cs.LG stat.ML

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Kexuan Shi, Hanxuan Li, Zeju Qiu, Yandong Wen, Simon Buchholz, Weiyang Liu

AI总结 本文提出了一种基于正交等价变换的谱值保持优化器Pion,用于大语言模型的训练。与Adam等加法优化器不同,Pion通过左右正交变换更新权重矩阵,从而在训练过程中保持其奇异值不变。该方法在调整权重矩阵几何结构的同时固定其谱范数,实验表明Pion在大模型预训练和微调任务中表现出稳定且具有竞争力的性能。

Comments Technical report v1 (30 pages, 19 figures, project page: https://spherelab.ai/pion/)

详情
英文摘要

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

2605.12491 2026-05-13 cs.CV cs.LG

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo

AI总结 本文提出了一种名为VECA的视觉Transformer架构,旨在解决传统ViT在高分辨率图像处理中计算复杂度过高的问题。VECA通过引入弹性核心-边缘注意力机制,利用少量学习得到的核心嵌入作为通信接口,使得图像块之间无需直接交互,从而将计算复杂度从二次降低到线性。该方法在保持输入token完整性的前提下,实现了计算资源与精度之间的灵活权衡,在多个视觉任务中表现出与最新视觉基础模型相当的性能。

Comments Project repository here: https://github.com/alansong1322/VECA

详情
英文摘要

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

2605.12487 2026-05-13 cs.CL cs.IR cs.LG

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, Assaf Toledo

AI总结 本文研究了如何利用大语言模型(LLM)指导的查询优化方法,提升嵌入模型在零样本搜索和分类任务中的适用性。通过在少量文档上获取LLM的反馈来实时优化用户查询的嵌入表示,使模型能够适应具体任务需求。实验表明,该方法在多个基准任务中均取得显著提升,最高相对改进达25%,有效提升了检索质量与分类准确性,并拓展了嵌入模型在实际场景中的应用范围。

详情
英文摘要

We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at https://github.com/IBM/task-aware-embedding-refinement.

2605.12481 2026-05-13 cs.AI

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, Jieping Ye

AI总结 计算机使用代理(CUA)在执行任务时需要在底层GUI操作和高层工具调用之间进行切换,但这种混合动作空间使得代理难以判断何时使用哪种方式,导致执行路径次优。为了解决这一问题,本文提出ToolCUA,一种通过分阶段训练范式学习最优GUI-工具路径选择的端到端代理。该方法通过生成混合轨迹、引导式强化学习和在线代理强化学习等技术,显著提升了任务执行的准确性和效率,在多个基准测试中表现出色,验证了其在现实数字代理中的应用潜力。

详情
英文摘要

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

2605.12480 2026-05-13 cs.CV cs.AI

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

AI总结 OmniNFT 是一种针对联合音视频生成任务的新型强化学习框架,旨在解决现有方法在模态保真度、跨模态对齐和细粒度同步方面的不足。该方法通过模态感知的奖励路由、分层梯度手术和区域损失重加权三大创新,有效缓解了多目标优势不一致、多模态梯度不平衡和信用分配不均等问题。实验表明,OmniNFT 在多个基准测试中显著提升了音视频的感知质量与同步效果。

Comments Project page: https://zghhui.github.io/OmniNFT/

详情
英文摘要

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

2605.12477 2026-05-13 cs.LG cs.CL

MEME: Multi-entity & Evolving Memory Evaluation

Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh

AI总结 MEME 是一个用于评估大型语言模型代理在多实体和动态记忆场景下表现的基准,定义了六个涵盖多实体与演化维度的任务,其中包含此前未被评估的级联推理、缺失推理和删除状态等挑战。研究发现,现有记忆系统在依赖推理任务上的表现普遍较差,即使在静态检索性能良好的情况下,准确率也远低于平均水平。实验表明,仅有一种基于文件存储并结合强语言模型的系统部分缓解了这一问题,但其计算成本极高,说明当前有效解决方案尚不适用于大规模实际场景。

详情
英文摘要

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

2605.12476 2026-05-13 cs.LG cs.CL

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Sagi Ahrac, Noya Hochwald, Mor Geva

AI总结 本文研究了稀疏混合专家(SMoE)模型中路由器与专家之间的几何耦合关系,揭示了路由决策与专家权重更新之间的内在联系。研究发现,对于同一个输入标记,路由器和对应专家的梯度更新方向一致,仅在标量系数上存在差异,这一现象在实验中也得到验证。基于这一几何耦合特性,作者提出了一种无需辅助损失的在线K-Means路由策略,通过专家对路由标记的隐藏状态进行平均,实现高效的负载分配,实验表明该方法在保持较低困惑度的同时显著降低了负载不平衡。

详情
英文摘要

Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.

2605.12474 2026-05-13 cs.AI

Reward Hacking in Rubric-Based Reinforcement Learning

Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He

AI总结 本文研究了基于评分标准(rubric-based)的强化学习中的奖励黑客(reward hacking)问题,探讨了在训练时使用验证器(verifier)优化策略,但在评估时由多个独立评委进行判断时可能产生的偏差。研究发现,弱验证器会导致策略在训练中获得高分但无法迁移到真实评估中,而强验证器虽能缓解这一问题,却无法完全消除。此外,研究还引入了“自我内化差距”作为验证器无关的诊断指标,并指出评分标准设计的局限性可能导致策略在完整性等指标上得分提升,却牺牲了事实准确性与整体质量。

详情
英文摘要

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

2605.12472 2026-05-13 cs.IT math.IT

An Improved Lower Bound on Support Size of Capacity-Achieving Inputs for the Binomial Channel: Extended version

Mohammadamin Baniasadi, Luca Barletta, Alex Dytso

AI总结 本文研究了二项式信道及其容量达到输入和输出分布的结构。已知容量达到的输入分布是有限点集上的离散分布,先前的研究表明其支撑点数量的下界为 $\sqrt{n}$,上界为 $n/2$。本文通过三个主要步骤,得出了支撑点数量的改进下界 $\sqrt{n\log\log n}$,并证明了基于Beta分布的输入在相对熵和卡方散度下接近最优输出分布,从而表明容量达到的输入分布至少需要该数量级的支撑点。

详情
英文摘要

We study the binomial channel and the structure of its capacity-achieving input and output distributions. It is known that the capacity-achieving input distribution is discrete and supported on finitely many points. The best previously known bounds show that the support size of the capacity-achieving distribution is lower-bounded by a term of order $\sqrt n$ and upper-bounded by a term of order $n/2$, where $n$ is the number of trials. In this work, we derive a new lower bound on the support size of order $\sqrt{n\log\log n}$, up to explicit constants. The proof consists of three main steps. First, we derive new upper and lower bounds on the capacity with a gap that vanishes as $n\to\infty$, which yields $C(n)=\frac12\log\frac{nπ}{2e}+o(1)$. Second, we show that the Beta-binomial output distribution induced by the reference input $X_r\sim\mathrm{Beta}(1/2,1/2)$ is asymptotically optimal: it approaches the capacity-achieving output distribution in relative entropy and, after a comparison step, in $χ^2$ divergence. Third, we prove a quantitative $χ^2$ approximation lower bound showing that this Beta-binomial output cannot be approximated too well by the output induced by a $K$-point input. Combining these ingredients forces the capacity-achieving input distribution to have at least order $\sqrt{n\log\log n}$ mass points.

2605.12471 2026-05-13 cs.LG cs.AI cs.CL

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez

AI总结 本文提出了一种名为 KV-Fold 的长上下文推理方法,通过将键值(KV)缓存视为序列块的左折叠累加器,实现无需训练的推理过程。模型在每一步处理下一个块时,基于累积的缓存进行条件处理,并将生成的键值追加到缓存中,从而逐步扩展缓存并传递至后续步骤。该方法在保持模型结构和参数不变的前提下,实现了稳定的长距离信息保留和高效推理,实验表明其在大规模上下文任务中表现出优异的准确性和内存效率。

Comments 12 pages, 3 figures, 6 tables

详情
英文摘要

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

2605.12467 2026-05-13 eess.SY cs.SY

Towards Closed-loop Stability of Nonlinear Receding Horizon Games

Sophie Hall, Florian Dörfler, Timm Faulwasser

AI总结 本文研究了无终端成分的非线性滚动时域博弈的闭环稳定性问题。通过引入不动点现象,作者在较弱假设下证明了递归可行性的成立,并给出了闭环轨迹实现实际渐近收敛的充分条件。研究还表明,随着预测时域的增加,系统围绕稳态纳什均衡的吸引域呈指数级收缩,这一特性与模型预测控制中的行为一致。此外,引入线性终端惩罚项可有效抑制脱离轨迹现象,确保系统渐近收敛至稳态纳什均衡。

详情
英文摘要

We analyze Receding Horizon Games without any MPC-like terminal ingredients. We show that recursive feasibility can be inferred from the turnpike phenomenon under mild assumptions. Moreover, we prove sufficient conditions for practical asymptotic convergence of the closed-loop trajectories, and we discuss how the gap towards practical asymptotic stability may be closed. We use numerical examples to show that the closed-loop region of attraction around the steady-state GNE shrinks exponentially with the horizon length, a behavior previously known only for model predictive control. Further, we apply a linear end penalty and demonstrate in numerical simulations that it suppresses the leaving arc and ensures asymptotic convergence to the steady-state GNE.

2605.12466 2026-05-13 cs.LG cs.AI cs.CL cs.NE

Solve the Loop: Attractor Models for Language and Reasoning

Jacob Fein-Ashley, Paria Rashidinejad

AI总结 该论文提出了一种名为“吸引子模型”(Attractor Models)的新架构,用于改进语言建模和推理任务中的迭代计算过程。该模型通过一个主干模块生成初始输出嵌入,再通过吸引子模块求解固定点以逐步优化结果,利用隐式微分进行训练,从而实现固定深度下的内存效率和自适应迭代次数。实验表明,吸引子模型在大规模语言预训练和小模型推理任务中均优于现有方法,显著提升了性能并降低了训练成本,同时展现出一种新的“均衡内化”现象,使得模型在推理时可移除求解器而几乎不损失性能。

详情
英文摘要

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.

2605.12464 2026-05-13 cs.LG cs.AR cs.PF

Search Your Block Floating Point Scales!

Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu, Reyna Abhyankar, Qingyang Wu, Austin Silveria, Pragaash Ponnusamy, Jue Wang, Ben Athiwaratkun, Leon Song, Tri Dao, Daniel Y. Fu, Chris De Sa

AI总结 本文研究了如何优化块浮点(BFP)格式中的缩放因子选择,以降低量化误差并提升模型性能。作者提出了一种名为 ScaleSearch 的方法,通过精细搜索利用微缩放格式中的尾数位,为给定数据分布选择最优缩放因子。该方法可与现有量化技术结合,显著提升量化效果,并引入了基于 ScaleSearch 的加速注意力算法 ScaleSearchAttention,在保持性能的同时有效减少了量化误差。实验表明,该方法在多个任务上均取得了显著的性能提升。

详情
英文摘要

Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.

2605.12462 2026-05-13 cs.AI cs.CY cs.GT cs.LG

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

Jose E. Aguilar Escamilla, Lingdong Zhou, Xiangqi Zhu, Huazheng Wang

AI总结 本文提出了一种名为DR-Gym的开源仿真环境,旨在从电力公司视角训练和评估需求响应策略,以提升电网灵活性和能源可负担性。该环境专注于市场级电力场景,提供了与电力公司相关的丰富观测空间,并引入了基于真实极端事件的批发电价模型和物理基础的建筑用电需求模型。研究通过多目标奖励函数支持多样化的学习目标,展示了该仿真器在创建现实且可学习环境方面的能力。

详情
英文摘要

Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility's pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility's perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.

2605.12461 2026-05-13 math.ST cs.DS cs.LG stat.ML stat.TH

A proximal gradient algorithm for composite log-concave sampling

Linghai Liu, Sinho Chewi

AI总结 本文提出了一种用于从复合对数凹分布中采样的近端梯度算法,该分布形式为 $π \propto e^{-f - g}$,假设能够获取 $f$ 的梯度以及 $g$ 的受限高斯预言机(RGO)。该算法通过结合梯度信息和 RGO 采样,实现了高效的采样过程。研究证明,在 $f + g$ 强凸且 $f$ 光滑的条件下,该算法在总变分距离下达到 $\varepsilon$ 精度所需的迭代次数为 $\widetilde{\mathcal{O}}(κ\sqrt{d} \log^4(1/\varepsilon))$,与现有最优结果一致,并进一步扩展到非对数凹分布和非光滑 $f$ 的情形。

详情
英文摘要

We propose an algorithm to sample from composite log-concave distributions over $\mathbb{R}^d$, i.e., densities of the form $π\propto e^{-f-g}$, assuming access to gradient evaluations of $f$ and a restricted Gaussian oracle (RGO) for $g$. The latter requirement means that we can easily sample from the density $\text{RGO}_{g,h,y}(x) \propto \exp(-g(x) -\frac{1}{2h}||y-x||^2)$, which is the sampling analogue of the proximal operator for $g$. If $f + g$ is $α$-strongly convex and $f$ is $β$-smooth, our sampler achieves $\varepsilon$ error in total variation distance in $\widetilde{\mathcal O}(κ\sqrt d \log^4(1/\varepsilon))$ iterations where $κ:= β/α$, which matches prior state-of-the-art results for the case $g=0$. We further extend our results to cases where (1) $π$ is non-log-concave but satisfies a Poincaré or log-Sobolev inequality, and (2) $f$ is non-smooth but Lipschitz.

2605.12460 2026-05-13 cs.LG cs.CL

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping

AI总结 本文提出了一种多流语言模型(Multi-Stream LLMs),通过将传统的单一计算流改为多个并行计算流,解决了当前语言模型在处理输入、思考和输出时的串行瓶颈问题。该方法将不同角色(如输入、思考、输出)分离到独立的流中,使模型能够在同一时间步同时读取多个输入并生成多个输出,从而提升模型的效率、安全性和可监控性。这一数据驱动的改进为构建更高效、更可控的自主智能体提供了新的思路。

Comments Preprint, 37 pages. Code at https://github.com/seal-rg/streaming/

详情
英文摘要

The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.

2605.12457 2026-05-13 cs.DS cs.DM

Layer-Based Width for PAFP

Samuel German

AI总结 本文研究了路径避免禁止对问题(PAFP),即在有向图中寻找一条从源点到终点的路径,使得该路径不包含每个禁止顶点对的两个端点。作者引入了一种基于图层的宽度度量方法,证明了当联合图的BFS宽度和逆向边数量满足一定条件时,PAFP问题在参数化意义下是固定参数可解的(FPT)。此外,作者还展示了在特定图层宽度限制下,PAFP可以在多项式时间内或通过2-SAT编码解决,并指出这些结果在某些条件下是紧致的,即无法进一步放宽。

Comments Accepted to IWOCA 2026; proceedings version to appear in Springer LNCS

详情
英文摘要

The Path Avoiding Forbidden Pairs problem (PAFP) asks whether, in a directed graph $G$ with terminals $s,t$ and a set $\mathcal{F}$ of forbidden vertex pairs, there is an $s$-$t$ path that contains at most one endpoint from each forbidden pair. We initiate the study of PAFP through a layer-based width measure. Our first focus is the union digraph $G\cup\mathcal{F}$, obtained by adding to $G$ one arc per forbidden pair, oriented according to a fixed reachability-compatible order. Let the BFS layer $L_d$ be all vertices at directed shortest-path distance $d$ from $s$, where the BFS-width from $s$ is $\max_d |L_d|$. We show if $G\cup\mathcal{F}$ has BFS-width $b$ from $s$ and only $β$ arcs going from a later BFS layer to an earlier one, then PAFP is FPT parameterized by $b+β$. The backward-arc hypothesis is essential: we show PAFP remains NP-complete when the union digraph is a DAG with BFS-width 2. We also show if the input DAG has BFS-width at most $2$ and only $k$ backward input arcs, then PAFP can be decided in $2^k |I|^{O(1)}$ time, with unrestricted forbidden pairs. This width-$2$ result is tight: inspection of a classical reduction shows NP-completeness on input DAGs of BFS-width $3$ with no backward input arcs. Moreover, we study exact-length layers in the input graph, where the $d$-th layer consists of the vertices reachable from $s$ by a directed path of length exactly $d$. For DAGs of exact-length width at most $2$, we show PAFP is polynomial-time decidable by a 2-SAT encoding of fixed-length paths. This bound is tight: the same classical reduction yields NP-completeness on DAGs of exact-length width $3$. Unlike previously known polynomial-time regimes for PAFP, which restrict the forbidden-pair set in order to obtain tractability, our two input-graph tractability results allow unrestricted forbidden pairs and input graphs with exponentially many $s$-$t$ paths.

2605.12455 2026-05-13 cs.IT cs.NI eess.SP math.IT quant-ph

Simultaneously Minimizing Storage and Bandwidth Under Exact Repair With Quantum Entanglement

Lei Hu, Mohamed Nomeir, Alptug Aytekin, Sennur Ulukus

AI总结 本文研究了在量子纠缠辅助的分布式存储系统中实现精确修复的再生编码问题,旨在同时最小化存储开销和修复带宽。作者提出了一种基于经典乘积矩阵框架和CSS稳定子形式的方法,在节点失效时,利用存活节点共享的纠缠态进行精确修复,使得新节点能够恢复与原节点完全相同的数据。该方法在节点数满足一定条件时,实现了与功能修复下相同的最优存储与带宽平衡点,为量子增强的分布式存储系统提供了理论支持和实用方案。

详情
英文摘要

We study exact-regenerating codes for entanglement-assisted distributed storage systems. Consider an $(n,k,d,α,β_{\mathsf{q}},B)$ distributed system that stores a file of $B$ classical symbols across $n$ nodes with each node storing $α$ symbols. A data collector can recover the file by accessing any $k$ nodes. When a node fails, any $d$ surviving nodes share an entangled state, and each of them transmits a quantum system of $β_{\mathsf{q}}$ qudits to a newcomer. The newcomer then performs a measurement on the received quantum systems to generate its storage. Recent work [1] showed that, under functional repair where the regenerated content may differ from that of the failed node, there exists a unique optimal regenerating point that \emph{simultaneously minimizes both storage $α$ and repair bandwidth $d β_{\mathsf{q}}$} when $d \geq 2k-2$. In this paper, we show that, under \emph{exact repair}, where the newcomer reproduces exactly the same content as the failed node, this optimal point remains achievable. Our construction builds on the classical product-matrix framework and the Calderbank-Shor-Steane (CSS)-based stabilizer formalism.

2605.12453 2026-05-13 eess.SP cs.AI cs.DB cs.LG cs.NI

Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance

Mannam Veera Narayana, Rohit Singh, Deepa M. R, Radha Krishna Ganti

AI总结 本研究针对高速移动场景下5G用户设备(UE)切换(HO)中断时间长、测量报告开销大等问题,提出了一种基于真实部署网络环境的数据集,涵盖步行、骑行、汽车、公交和火车等多种移动方式及不同速度条件下的UE移动数据。该数据集重点采集了切换过程中的时序提前(TA)测量信息,包括RACH触发、MAC CE和PDCCH授权等关键信令事件,填补了现有研究的空白。该数据集可支持AI/ML模型在切换管理、波束管理和TA预测等场景下的训练与评估,为6G智能移动性研究提供了重要基础。

详情
英文摘要

To address the issues of high interruption time and measurement report overhead under user equipment (UE) mobility especially in high speed 5G use cases the use of AI/ML techniques (AI/ML beam management and mobility procedures) have been proposed. These techniques rely heavily on data that are most often simulated for various scenarios and do not accurately reflect real deployment behavior or user traffic patterns. Therefore, there is an utmost need for realistic datasets under various conditions. This work presents a dataset collected from a commercially deployed network across various modes of mobility (pedestrian, bike, car, bus, and train) and at multiple speeds to depict real time UE mobility. When collecting the dataset, we focused primarily on handover (HO) scenarios, with the aim of reducing the HO interruption time and maintaining continuous throughput during and immediately after HO execution. To support this research, the dataset includes timing advance (TA) measurements at various signaling events such as RACH trigger, MAC CE, and PDCCH grant which are typically missing in existing works. We cover a detailed description of the creation of the dataset; experimental setup, data acquisition, and extraction. We also cover an exploratory analysis of the data, with a primary focus on mobility, beam management, and TA. We discuss multiple use cases in which the proposed dataset can facilitate understanding of the inference of the AI/ML model. One such use case is to train and evaluate various AI/ML models for TA prediction.

2605.12452 2026-05-13 cs.CL cs.AI cs.CY

The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events

Gunjan, Sidahmed Benabderrahmane, Talal Rahwan

AI总结 该研究关注大型语言模型(LLM)生成的政治话语在危机事件中的表现,探讨其与真实在线舆论的差异。研究构建了一个包含九个危机事件的配对语料库,从情感强度、结构规律性、词汇意识形态框架和跨事件依赖性四个维度进行对比分析,发现生成内容虽然流畅,但在群体层面缺乏现实感,情感更单一、结构更规整、用词更抽象。研究提出“漫画化差距”(Caricature Gap)作为衡量指标,揭示生成政治话语在社会真实性和多样性上的局限性。

详情
英文摘要

Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.

2605.12449 2026-05-13 cs.CV

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Wufei Ma, Chloe Wang, Siyi Chen, Jiawei Peng, Patrick Li, Alan Yuille

AI总结 LychSim 是一个基于 Unreal Engine 5 构建的可控且交互式的视觉研究仿真框架,旨在降低仿真平台的技术门槛,促进闭环优化和分布外评估。该框架通过简洁的 Python 接口、程序化数据生成管道以及与 Model Context Protocol 的集成,实现了高保真环境生成、语义对齐的三维标注以及与大语言模型的动态交互。LychSim 在合成数据生成、对抗性检验和语言驱动场景生成等多个下游任务中展现出广泛应用潜力,并将开源以供研究社区使用。

Comments 3D-LLM/VLA Workshop at CVPR 2026. Project page: https://lychsim.github.io/

详情
英文摘要

While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.

2605.12446 2026-05-13 cs.LG cs.CL

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

Chen Li, Xiaoling Hu, Songzhu Zheng, Jiawei Zhou, Chao Chen

AI总结 大型语言模型在生成答案时常常表现出过高的置信度,即使答案错误,因此可靠的置信度估计对于实际应用至关重要。本文提出了一种解耦且顺序感知的框架,用于校准语言模型的口头置信度,通过先生成答案再基于固定的问题-答案对进行置信度估计,避免了答案生成过程的干扰。该方法通过多模型完成的采样构建替代指标,并优化基于排序的强化学习目标,使更高正确性可能性的回答获得更高的口头置信度,实验表明该方法在保持答案准确性的同时显著提升了校准和失败预测性能。

Comments 18 pages, 2 figures

详情
英文摘要

Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question--answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.