arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11131 2026-05-13 cs.CV

USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

Elisha Dayag, Nhat Thanh Tran, Jack Xin

发表机构 * University of California Irvine(加州大学 Irvine 分校)

AI总结 本文提出了一种可扩展且高效的类似Mamba的注意力机制USEMA,用于医学图像分割,旨在解决传统视觉Transformer因二次计算复杂度带来的效率问题。USEMA结合了局部窗口注意力和理论一致的算术平均,以兼顾局部特征提取与全局信息捕捉,并与卷积神经网络融合构建混合UNet架构。实验表明,USEMA在多种模态和图像尺寸下均表现出优于纯卷积模型和基于Mamba模型的分割性能和计算效率。

详情
英文摘要

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

2605.11128 2026-05-13 cs.CL

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Amin Banayeeanzade, Qingchuan Yang, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Alfy Samuel, Robin Jia, Meisam Razaviyayn, Sai Praneeth Karimireddy

发表机构 * University of Southern California(南加州大学)

AI总结 该研究探讨了大型语言模型(LLMs)在生成过程中多样性下降的问题,指出其根源在于推理时概率分布的校准不足。研究提出了一个有效性-多样性框架,将多样性崩溃归因于模型在解码过程中对有效和无效续写分配概率质量的方式,并将其分解为两种形式的校准错误:顺序校准和形状校准。实验表明,这种校准问题在多个规模和类型的语言模型中普遍存在,而非单纯由采样策略导致。

详情
英文摘要

Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.

2605.11119 2026-05-13 cs.RO

ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor Environments

Hanyu Jin, Zhefan Xu, Haoyu Shen, Xinming Han, Kanlong Ye, Kenji Shimada

发表机构 * Department of Mechanical Engineering, Carnegie Mellon University(卡内基梅隆大学机械工程系)

AI总结 本文提出了一种适用于部分已知室内环境的无人机表面检测规划框架ASIP-Planner,旨在解决因临时障碍物导致的视野遮挡和检测质量下降问题。该方法结合基于区域划分的全局覆盖规划器和面向检测的局部视角自适应模块,能够在保证轨迹结构的前提下,生成避障轨迹并实时调整视角以减少遮挡影响。实验表明,该框架在仿真和实际飞行测试中均能实现较高的检测覆盖率和轨迹效率,提升了无人机在部分已知结构化室内环境中的检测性能和适应能力。

详情
英文摘要

Indoor infrastructure inspection, such as tunnels and industrial facilities, requires systematic surface coverage to ensure that all inspection targets are properly observed. Unmanned Aerial Vehicles (UAVs) offer an alternative to manual inspection by conducting map-guided surface inspection using prior structural models. However, in practice, indoor inspection often relies on floorplan-derived reference maps that may not reflect unforeseen obstacles, such as temporary structures or equipment, leading to occluded viewpoints and degraded inspection quality. Existing coverage planning methods typically assume a fully known inspection environment and perform deterministic global viewpoint optimization based on accurate prior maps, making them vulnerable to environmental discrepancies during execution. This work presents an adaptive UAV inspection framework for partially known structured indoor environments. The proposed method integrates a segment-based global coverage planner with an inspection-oriented local view-angle adaptation module. The global planner organizes planar inspection targets into surface-aligned clusters to generate compact viewpoint sequences with improved orientation consistency. The local planner generates collision-free trajectories and adjusts the viewing direction online to mitigate occlusion-induced coverage loss while preserving the planned trajectory structure. The simulation results across randomized scene configurations demonstrate that the proposed global planner achieves near-complete coverage while reducing trajectory length compared to representative baselines. Real-world flight experiments further validate that the framework produces usable inspection data for downstream analysis. These results indicate that the proposed framework improves inspection efficiency and adaptability in partially known structured indoor environments.

2605.11117 2026-05-13 cs.LG cs.MA math.PR

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

Juan Diego Toscano, Zhaojie Chai, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系)

AI总结 GRAFT-ATHENA 是一种自我改进的智能代理团队框架,旨在实现自主科学发现和演化数值算法。该框架通过将组合决策空间映射为因子概率树,显著降低了参数规模,并能够跨不同领域积累和共享方法经验。研究展示了 GRAFT-ATHENA 在多个物理信息机器学习基准和实际工程问题中的优越性能,包括自主提出正则化约束和发现新的数值方法,为自主实验室的发展提供了基础。

详情
英文摘要

Scientific discovery can be modeled as a sequence of probabilistic decisions that map physical problems to numerical solutions. Recent agentic AI systems automate individual scientific tasks by orchestrating LLM-driven planners, solvers, and evaluators. Each method is a combination of methodological actions, with many viable combinations for any given problem and structural dependencies between choices. However, existing frameworks treat each problem in isolation, with no shared substrate to accumulate methodological experience across domains. Here we show that GRAFT-ATHENA, a self-improving agentic framework, learns from past problems and autonomously expands its own action space across diverse domains. GRAFT (Graph Reduction to Adaptive Factored Trees) projects combinatorial decision spaces into factored probabilistic trees in which each method is a single path, taking the parameter footprint from exponential to linear. In the lineage of classical Bayesian networks, the factorization is an $I$-map of the policy, and the resulting paths embed as unique fingerprints in a metric space whose closeness lets each new problem learn from similar past ones. On canonical physics-informed machine learning (PIML) benchmarks, GRAFT-ATHENA improves over human and prior agentic baselines, and on production solvers, it tackles complex engineering problems such as reconstructing Mach-10 flow over the Apollo Command Module from a 1968 report and recovering shear-thinning blood-cell rheology. Notably, the system grows its own knowledge substrate, autonomously proposing regularization constraints for ill-posed inverse problems and discovering new numerical methods such as a spectral PINN with exponential convergence. These results provide a foundation for autonomous laboratories that grow more capable with every problem they solve.

2605.11115 2026-05-13 cs.CV cs.GR cs.LG

LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

Pedram Fekri, WenChen Li, William Chen, Peter Altamirano

发表机构 * Monks AI Research Lab(Monks AI研究院)

AI总结 本文提出了一种名为LatentHDR的新型框架,用于生成高质量的高动态范围(HDR)图像。该方法通过在潜在空间中将场景生成与曝光建模解耦,利用预训练的扩散模型生成一致的场景表示,并通过一个轻量的条件潜在到潜在映射模块,将其确定性地映射到特定曝光的表示,从而在单次生成过程中实现结构一致的多曝光堆栈。该方法显著降低了计算成本,提升了生成效率,并在多个基准测试中取得了领先的动态范围和感知质量。

详情
英文摘要

High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.

2605.11114 2026-05-13 cs.RO cs.AI

SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

Tianchonghui Fang, Yuan Zhuang, Fei Miao

发表机构 * School of Computing, University of Connecticut(康奈尔大学计算学院)

AI总结 该研究提出了一种名为SEVO的语义增强虚拟观测方法,旨在提升低成本机器人在不同环境下的视觉-语言-动作(VLA)操作鲁棒性。SEVO通过固定摄像头覆盖操作区域、主动红光照明标准化物体外观以及实时语义分割提供背景不变的提示,结合多样化数据采集策略,显著提升了模型的泛化能力。实验表明,在相同政策架构下,SEVO使机器人在训练和新环境中的抓取成功率大幅提升,验证了观测设计和数据多样性对低成本机器人可靠操作的重要性。

详情
英文摘要

Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant semantic cue. Critically, we show that a diversified data collection protocol (systematically varying lighting, backgrounds, and distractors during teleoperation) is the single most important factor for generalization. We target transparent water bottles, objects that visually blend with their surroundings, and select a simple pick-and-place task to enable hundreds of controlled real-robot trials across two mobile platforms. The full pipeline achieves 95% grasp success with ACT and 83% with SmolVLA in the training environment, transferring to novel environments at 85% and 75%. Without SEVO, the same policies achieve only 75%/70% in training and collapse to 30-35% in novel environments. Our results demonstrate that principled observation design and environmental diversity during data collection, not model scaling, enable low-cost robots to operate reliably in everyday household environments.

2605.11107 2026-05-13 cs.CV cs.AI

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Youssef Zaazou, Mark Thomas

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究针对视觉语言模型(VLMs)在处理图像分类任务时易受背景干扰的问题,提出了一种基于嵌入空间线性可加性的方法,将场景表示分解为前景和背景成分,从而构建背景不变的表示。通过利用合成数据进行预训练,该方法在存在完美虚假关联的Waterbirds数据集上实现了首个超过90%的最差群体准确率,且无需依赖真实去偏数据,具有良好的模拟到现实迁移能力,适用于实际部署。

Comments 36 pages, 7 figures

详情
英文摘要

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

2605.11102 2026-05-13 cs.LG cs.AI cs.SY eess.SY

Newton's Lantern: A Reinforcement Learning Framework for Finetuning AC Power Flow Warm Start Models

Shourya Bose, Helgi Hilmarsson, Dhruv Suri

发表机构 * Pravah(普拉瓦)

AI总结 该研究提出了一种名为“牛顿灯”的强化学习框架,用于优化交流潮流问题的暖启动模型。通过分析牛顿-拉夫森迭代次数的下界,研究揭示了现有监督方法在接近电压崩溃的重载场景下泛化能力不足的原因,并基于此设计了一种结合群体相对策略优化和学习奖励模型的微调方法,以迭代次数作为监督信号进行训练。实验表明,该方法在多个标准测试案例中均能稳定收敛,并实现了最小的平均迭代次数。

详情
英文摘要

Neural warm starts can sharply reduce the number of Newton-Raphson iterations required to solve the AC power flow problem, but existing supervised approaches generalize poorly on heavily loaded instances near voltage collapse. We prove a lower bound on the Newton-Raphson iteration count that depends on the direction of the warm start error rather than on its magnitude, and show as a corollary that the bound becomes vacuous as the smallest singular value of the power-flow Jacobian shrinks, identifying the failure mode of supervised regression near the saddle-node bifurcation. Motivated by this analysis, we introduce Newton's Lantern, a finetuning pipeline that combines group relative policy optimization with a learned reward model trained on perturbations of the base model's predictions, using the iteration count itself as the supervisory signal. Across IEEE 118-bus, GOC 500-bus, and GOC 2000-bus benchmarks, Newton's Lantern is the only method that converges on every test snapshot while attaining the smallest mean iteration count.

2605.11098 2026-05-13 cs.SD

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi, Hongfei Du, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao

发表机构 * College of William & Mary(威廉姆斯与玛丽学院) Emory University(埃默里大学) George Mason University(乔治·梅森大学)

AI总结 AffectCodec 是一种用于情感表达语音建模的情绪感知神经语音编解码器,旨在在量化过程中保留语音中的情感信息。该方法通过结合情感语义引导的潜在调制、关系保持的情感语义蒸馏和情感加权语义对齐,实现了在压缩过程中保持语义保真度和韵律自然性的同时保留情感关键线索。实验表明,AffectCodec 在语音重建、情感识别和下游文本到语音生成任务中均表现出更优的情感一致性和感知质量。

Comments Accepted to ACL Findings 2026

详情
英文摘要

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

2605.11093 2026-05-13 cs.LG cs.AI cs.PF cs.SE cs.SY eess.SY

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Nengneng Yu, Sixian Xiong, Yibo Zhao, Wei Wang, Zaoxing Liu

发表机构 * Department of Computer Science(计算机科学系)

AI总结 当前大语言模型推理任务越来越依赖对模型内部状态的实时访问。本文提出 DMI-Lib,一种高性能的深度模型检测工具,通过异步观测子系统、基于 Ring² 的 GPU-CPU 内存抽象以及策略控制的主机后端,将内部可观测性作为系统级核心原语,实现与推理主路径的解耦。实验表明,DMI-Lib 在保持服务优化和严格 GPU 内存限制的同时,显著降低了观测开销,相比现有方法在延迟上减少了 2 到 15 倍。

详情
英文摘要

Today's inference-time workloads increasingly depend on timely access to a model's internal states. We present DMI-Lib, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI-Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI-Lib incurs only 0.4%--6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x-15x compared to existing baselines with similar observability features. DMI-Lib is open-sourced at https://github.com/ProjectDMX/DMI.

2605.11091 2026-05-13 cs.LG cs.AI

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

Shubhankit Singh, Hassan Shaikh, Kuldeep Raghuwanshi, Keshav Bulia

发表机构 * Research Commons AI IIT Bombay(印度理工学院博伊斯) IIT Delhi(印度理工学院德里)

AI总结 本文提出ASD-Bench,一个针对自闭症谱系障碍(ASD)的四维综合基准,用于评估AI模型在不同年龄段群体中的表现。该基准涵盖预测性能、校准、可解释性和对抗鲁棒性四个维度,基于4,068份AQ-10问卷数据,测试了多种传统机器学习和深度学习模型。研究发现不同年龄段的特征重要性存在显著差异,并指出单一性能指标不足以评估临床AI系统的可靠性。

Comments 20 pages, 12 figures, 8 tables

详情
英文摘要

Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We introduce ASD-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts (children 1-11 yr, adolescents 12-16 yr, adults 17-64 yr) on four axes: predictive performance, calibration, interpretability, and adversarial robustness. Applied to a curated v3 dataset of 4,068 AQ-10 records, our benchmark spans classical models (XGBoost, AdaBoost, Random Forest, Logistic Regression), neural networks (MLP), deep tabular transformers (TabNet, TabTransformer, FT-Transformer), and TabPFN v2. We introduce the Heuristic Aggregate Penalty (HAP): a cost-sensitive metric penalising false negatives more heavily and incorporating cross-validation variance for deployment stability. Adult classification yields high performance (10/17 models achieve perfect F1 and AUC), while adolescents present a harder task (F1 ceiling 0.837 vs. 0.915 for children). Feature hierarchies shift across cohorts: A9 (social motivation) dominates for children, A5 (pattern recognition) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking. Accuracy and calibration are dissociated: AdaBoost achieves F1=1.000 on adults with ECE=0.302, confirming single-metric evaluation is insufficient for clinical AI. Cohort-specific deployment recommendations are provided. All findings should be interpreted as proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnostic performance.

2605.11055 2026-05-13 cs.CV cs.LG

The first global agricultural field boundary map at 10m resolution

Caleb Robinson, Gedeon Muhawenayo, Subash Khanal, Zhanpei Fang, Isaac Corley, Ana M. Tárano, Lyndon Estes, Jennifer Marcus, Nathan Jacobs, Hannah Kerner, Inbal Becker-Reshef, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Arizona State University(亚利桑那州立大学) Washington University in St. Louis(圣路易斯华盛顿大学) Oregon State University(俄勒冈州立大学) Clark University(克拉克大学) Taylor Geospatial(泰勒地理空间)

AI总结 本文提出了首个全球10米分辨率的农业地块边界地图,覆盖2024和2025年共241个国家和地区,包含31.7亿个遥感地块多边形。研究采用基于“Fields of The World”数据集训练的U-Net分割模型,对Sentinel-2无云影像进行处理生成地图,并通过多国实地数据验证其准确性。该数据集以三种形式公开发布,为全球农作物监测、粮食安全及相关农业研究提供了首个一致的地块级分析单元。

详情
英文摘要

The agricultural field is the natural unit at which crops are planted, managed, regulated, and reported, yet most global remote-sensing products for agriculture are only available at the pixel level. While some high-quality field-level data products exist, they come from parcel registries covering only parts of Europe or from ML-derived products for individual countries. No openly available, globally consistent map of agricultural field boundaries exists to date. Here we present the first global field boundary dataset at 10\,m resolution for the years 2024 and 2025, comprising 3.17 billion remote-sensing field polygons (1.62 B in 2024 and 1.55 B in 2025) across 241 countries and territories, produced by applying a U-Net segmentation model trained on the Fields of The World dataset to cloud-free Sentinel-2 mosaics. Validated against ground-truth field boundaries in 24 countries, the map achieved a mean pixel-level recall of 0.85 with 14 countries exceeding 0.90. Evaluation against full-country ground-truth datasets in Austria, Latvia, and Finland yielded F1 scores of 0.89, 0.88, and 0.74, respectively. Because reference data for global validation is inherently incomplete, we accompanied the map with a 500 m confidence layer that identifies regions where predictions are reliable. We release the dataset openly as three global maps: the confidence-thresholded default field boundary dataset, the full unfiltered dataset, and the continuous-valued confidence raster. These maps provide the first globally consistent field-level unit of analysis for crop monitoring, food security, and downstream agricultural science.

2605.11048 2026-05-13 cs.RO cs.AI

ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching

Shuoheng Zhang, Yifu Yuan, Hongyao Tang, Yan Zheng, Qiaojun Yu, Pengyi Li, Guowei Huang, Helong Huang, Xingyue Quan, Jianye Hao

发表机构 * Tianjin University(天津大学) Huawei Noah's Ark Lab(华为诺亚实验室) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出了一种名为ForceFlow的力感知反应框架,旨在解决机器人在复杂接触场景下的操作任务。该方法基于流匹配技术,通过融合力信号与多模态感知信息,实现了对接触力和运动的深度耦合,并采用视觉主导与触觉主导分阶段的策略,提升了任务执行的鲁棒性和泛化能力。实验表明,ForceFlow在六个实际接触密集任务中表现出更高的成功率和更低的成本,展示了其在接触力自调节和跨分布泛化方面的优越性能。

详情
英文摘要

Existing imitation learning methods enable robots to interact autonomously with the physical environment. However, contact-rich manipulation tasks remain a significant challenge due to complex contact dynamics that demand high-precision force feedback and control. Although recent efforts have attempted to integrate force/torque sensing into policies, how to build a simple yet effective framework that achieves robust generalization under multimodal observations remains an open question. In this paper, we propose ForceFlow, a force-aware reactive framework built upon flow matching. For contact-stage policy design, we investigate force signal fusion mechanisms and adopt an asymmetric multimodal fusion architecture that treats force as a global regulatory signal, combined with a joint prediction paradigm that enhances the policy's understanding of instantaneous force and historical information, thereby achieving deep coupling between force and motion. For task-level hierarchical decomposition, we divide manipulation into a vision-dominant approach stage (VLM-based pointing for target localization) and a touch-dominant interaction stage (force-driven contact execution), with a Vision-to-Force (V2F) handover mechanism that explicitly decouples spatial generalization from contact regulation. Experimental results across six real-world contact-rich tasks demonstrate that ForceFlow achieves a 37% success rate improvement over the strong baseline ForceVLA while maintaining significantly lower cost. Moreover, ForceFlow exhibits accurate force signal prediction and demonstrates superior performance in contact force self-regulation and zero-shot out-of-distribution (OOD) generalization.

2605.11020 2026-05-13 cs.LG cs.AI cs.RO

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

Anish Diwan, Davide Tateo, Christopher E. Mower, Haitham Bou-Ammar, Jan Peters, Oleg Arenz

发表机构 * Technical University of Darmstadt(达姆斯塔特技术大学) Lund University(隆德大学) German Research Center for AI (DFKI)(人工智能研究中心(DFKI)) Robotics Institute Germany (RIG)(德国机器人研究所(RIG)) Huawei, Noah’s Ark Lab(华为诺亚实验室) University College London(伦敦大学学院)

AI总结 本文提出了一种名为 Trust Region Inverse Reinforcement Learning(TRIRL)的逆强化学习方法,旨在在无需每次迭代都完整求解强化学习问题的前提下,实现奖励函数和策略的单调改进。其核心思想是通过信任区域优化策略,在当前策略附近进行局部搜索,从而显式优化对偶目标。该方法在保持对偶改进单调性的同时,避免了对抗方法的训练不稳定性,并在多个复杂任务中表现出色,奖励函数也具有对系统动态变化的鲁棒性。

Comments Accepted as a conference paper at the International Conference on Machine Learning (ICML) 2026

详情
英文摘要

Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL--one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.

2605.11019 2026-05-13 cs.LG cs.AI

Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

Zizhao Chen, Yuying Li, Siting Lin, Lianxi Wang

发表机构 * Guangdong University of Foreign Studies(广东外语外贸大学)

AI总结 尽管大语言模型依赖于思维链进行复杂推理,但过度思考现象严重降低了推理效率。本文受认知科学启发,提出了一种基于变分后验引导的高效推理框架VPG-EA,通过引入效率感知的证据下界,将高效推理建模为变分推断问题,并采用参数共享的双流架构,将后验分布中的高效模式通过变分蒸馏迁移至先验策略中。实验表明,该方法在不同规模模型上均显著提升了综合效率指标。

详情
英文摘要

Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy. Inspired by cognitive science, we theoretically prove that a posterior distribution guided by reference answers achieves higher expected utility than the prior distribution, thus capable of breaking through the sampling bottleneck of high-quality samples. However, the posterior distribution is unavailable during inference. To this end, we formalize efficient reasoning as a variational inference problem and introduce an efficiency-aware evidence lower bound as the theoretical foundation. Based on this, we propose the VPG-EA framework. It adopts a parameter-shared dual-stream architecture to instantiate both the posterior distribution and the prior policy; after filtering out pseudo-efficient paths via cross-view evaluation, it unidirectionally transfers the posterior's efficient patterns to the prior policy through variational distillation. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B scales demonstrate that VPG-EA improves the comprehensive efficiency metric epsilon cubed by 8.73% and 12.37% over the strongest baselines on each model size, respectively.

2605.11017 2026-05-13 cs.LG cs.AI cs.IR

Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics

Chao Zhou

发表机构 * Meta Platforms, Inc.(Meta平台公司)

AI总结 该论文研究了在用户行为曲线建模中,由于数据聚合导致的参数模型系统性偏差问题,即行为曲线中的辛普森悖论。研究发现,个体用户的行为峰值与聚合后的整体曲线存在显著差异,这种偏差主要由生存偏差引起。论文提出了合成零校准方法以减少个体分类中的误判,并指出这一现象在推荐系统、广告和临床给药等领域具有广泛影响。

Comments Submitted to NeurIPS 2026

详情
英文摘要

Behavioral curve modeling -- fitting parametric functions to engagement-versus-exposure data -- is standard practice in recommendation, advertising, and clinical dosing. We show that aggregation introduces a systematic distortion: Simpson's paradox in behavioral curves. On Goodreads (3.3M users, 9 genres), individual users peak at n* approximately 11 exposures while the aggregate peaks at n* approximately 34 -- a 3x gap driven by survival bias. Amazon Electronics (18M reviews) shows a 5.3x distortion. MovieLens-25M (D approximately 1) serves as a negative control, confirming that survival bias -- not aggregation per se -- is the operative mechanism. The distortion is robust to category granularity, engagement operationalization, and classifier calibration. We develop Synthetic Null Calibration to address a 32% false positive rate in per-user classification. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition.

2605.11014 2026-05-13 cs.LG cs.AI

Backbone-Equated Diffusion OOD via Sparse Internal Snapshots

Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren

发表机构 * ONERA SONDRA Université Paris-Saclay(巴黎-萨克雷大学) CentraleSupélec(中央理工-巴黎高等学院)

AI总结 该论文提出了一种名为MBE的公平比较协议,用于解决扩散模型在异常检测(OOD)任务中因主干网络、噪声参数化和推理预算不同而导致的评估不一致问题。研究引入了基于稀疏内部激活的Canonical Feature Snapshots(CFS)检测方法,仅需少量冻结扩散模型的内部激活即可实现高效的OOD检测。实验表明,CFS在CIFAR尺度基准上表现出色,且其性能主要依赖于少量稀疏状态,而非完整的去噪过程或复杂的下游模块。论文还从理论角度解释了这一现象,揭示了扩散模型在低噪声条件下内部状态与编码器-解码器互补性的关系。

详情
英文摘要

Fair comparison between diffusion-based OOD detectors is challenging, as conclusions can vary with backbone choice, corruption parameterization, and test-time budget. We address this issue through a Mutualized Backbone-Equated (MBE) protocol that aligns canonical corruption levels and logical test-time cost across diffusion backbones. Within this setting, we introduce Canonical Feature Snapshots (CFS), a family of detectors that probes a frozen diffusion backbone using only a tiny number of native internal activations at canonical low-noise levels. On a controlled CIFAR-scale benchmark, the strongest one-forward CFS variant is CFS(1x2), while an even smaller decoder-only variant remains highly competitive. This shows that much of the relative-OOD signal exposed by frozen diffusion backbones is concentrated in a small number of sparse internal states, rather than requiring full denoising trajectories or high-capacity downstream heads. We further provide a local diagnostic theory explaining these observations through conditional encoder-decoder complementarity, diagonal-score separation, and low-noise corruption stability. The official implementation is available at https://github.com/RouzAY/cfs-diffusion-ood/.

2605.11011 2026-05-13 cs.LG cs.AI

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

Taekhyun Park, Yongjae Lee, Dohee Kim, Hyerim Bae

发表机构 * Department of Data Science(数据科学系) Department of Industrial Engineering(工业工程系) Pusan National University(釜山国立大学) Changwon National University(昌原国立大学)

AI总结 LoopUS 是一种将预训练大语言模型(LLM)转化为循环潜层优化模型的后训练框架,旨在提升模型的推理能力。该方法通过分解模型结构、引入选择性门控机制、随机深度监督和置信度头部等核心组件,实现了在不破坏原有能力的前提下,将标准模型改造成稳定的循环架构。LoopUS 有效缓解了计算瓶颈和表示崩溃问题,显著提升了模型的推理性能。

详情
英文摘要

Looped computation shows promise in improving the reasoning-oriented performance of LLMs by scaling test-time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To address these limitations, we introduce \textbf{Looped Depth Up-Scaling} (LoopUS), a post-training framework that converts a standard pretrained LLM into a looped architecture. As a key technical contribution, LoopUS recasts the pretrained LLM into an encoder, a looped reasoning block, and a decoder. It operationalizes this latent-refinement architecture through four core components: (1) block decomposition, guided by staged representation dynamics; (2) an input-dependent selective gate to mitigate hidden-state drift; (3) random deep supervision for memory-efficient learning over long recursive horizons; and (4) a confidence head for adaptive early exiting. Collectively, these mechanisms transform a standard non-looped model into a looped form while stabilizing it against both computational bottlenecks and representation collapse. Through stable latent looping, LoopUS improves reasoning-oriented performance without extending the generated traces or requiring recurrent training from scratch. For more details, see https://thrillcrazyer.github.io/LoopUS

2605.11010 2026-05-13 cs.LG

A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions

Antonios Makris, Christos Dousis, Emmanouil Kritharakis, Stavros Bouras, Konstantinos Tserpes

发表机构 * School of Electrical and Computer Engineering, National Technical University of Athens, Greece(电气与计算机工程学院,国家技术大学亚历山大学院,希腊) Department of Informatics and Telematics, Harokopio University of Athens, Greece(信息与电信学院,哈罗基奥大学亚历山大学院,希腊)

AI总结 本文对比研究了在同质和异质数据分布下常用的联邦学习聚合策略的性能。通过实验分析不同聚合方法在数据异质性变化时对模型精度、损失及系统效率(如聚合、训练和通信时间)的影响,揭示了各类策略在不同数据分布和任务条件下的优劣差异。研究为选择适合实际场景的聚合方法提供了重要参考。

详情
英文摘要

Federated Learning has emerged as a transformative paradigm for collaborative machine learning across distributed environments. However, its performance is strongly influenced by the aggregation strategy used to combine local model updates at the server, which directly affects learning performance, robustness, and system behavior. This work presents a comprehensive experimental comparison of widely used federated aggregation strategies under both homogeneous and heterogeneous data distributions. Using benchmark image classification datasets, we analyze how different aggregation mechanisms respond to varying degrees of data heterogeneity, examining their impact on centralized accuracy and loss, and system-level efficiency metrics, including aggregation, training, and communication time. The results demonstrate that aggregation strategies exhibit distinct trade-offs across datasets and data distributions, with their effectiveness varying according to dataset characteristics and operating conditions.

2605.11009 2026-05-13 cs.LG cs.RO

ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

Qian Chen, Junqiao Zhao, Hongtu Zhou, Hang Yu, Yanping Zhao, Chen Ye, Guang Chen

发表机构 * Tongji University(同济大学)

AI总结 本文提出了一种名为ACSAC的自适应块大小的Actor-Critic方法,用于解决长期稀疏奖励任务中的强化学习挑战。该方法通过引入因果Transformer作为Q网络,能够在不同大小的动作块之间灵活选择最优块长度,从而在无需任务特定调参的情况下实现反应性与时间一致性之间的自适应平衡。实验表明,ACSAC在多个长期稀疏奖励操作任务中表现出优越的性能,达到了当前最先进的水平。

详情
英文摘要

Long-horizon, sparse-reward tasks pose a fundamental challenge for reinforcement learning, since single-step TD learning suffers from bootstrapping error accumulation across successive Bellman updates. Actor-critic methods with action chunking address this by operating over temporally extended actions, which reduce the effective horizon, enable fast value backups, and support temporally consistent exploration. However, existing methods rely on a fixed chunk size and therefore cannot adaptively balance reactivity against temporal consistency. A large fixed chunk size reduces responsiveness to new observations, while a small one produces incoherent motions, forcing task-specific tuning of the chunk size. To address this limitation, we propose Adaptive Chunk Size Actor-Critic (ACSAC). ACSAC leverages a causal Transformer critic to evaluate expected returns for action chunks of different sizes. At each chunk boundary, it adaptively selects the chunk size that maximizes the expected return, supporting flexible, state-dependent chunk sizes without task-specific tuning. We prove that the ACSAC Bellman operator is a contraction whose unique fixed point is the action-value function of the adaptive policy. Experiments on OGBench demonstrate that ACSAC achieves state-of-the-art performance on long-horizon, sparse-reward manipulation tasks across both offline RL and offline-to-online RL settings.

2605.11008 2026-05-13 cs.LG cs.AI

When and How to Canonize: A Generalization Perspective

Yonatan Sverdlov, Benjamin Friedman, Snir Hordan, Nadav Dym

发表机构 * Technion – Israel Institute of Technology(技术学院–以色列理工学院)

AI总结 本文从理论角度分析了通过规范化(canonization)实现不变性的方法在对称数据处理中的泛化性能。研究引入了一种基于覆盖数界分析的理论框架,揭示了规范化模型的误差界处于结构不变模型与非不变基线模型之间,并证明了规范化效果依赖于其正则性。在点云处理中,作者进一步证明了字典序排序的覆盖数随维度指数增长,而Hilbert曲线规范化则保证多项式增长,为该方法在点云架构中的成功提供了理论依据。

详情
英文摘要

While invariant architectures are standard for processing symmetric data, there is growing interest in achieving invariance by applying group averaging or canonization to non-invariant backbones. However, the theoretical generalization properties of these alternative strategies remain poorly understood. We introduce a theoretical framework to analyze the generalization error of these methods by bounding their covering numbers. We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group-averaged models, and at worst equal to the bounds of non-invariant baselines. Furthermore, we show that there exist optimal canonizations which attain the optimal error bounds, and poor canonizations which attain the non-invariant error bounds, and that this depends on the regularity of the canonization. Finally, applying this framework to permutation groups in point cloud processing, we rigorously prove that the covering number of lexicographical sorting grows exponentially with point cloud dimension, whereas Hilbert curve canonization guarantees polynomial growth. This provides the first formal theoretical justification for the empirical success of Hilbert curve serialization in state-of-the-art point cloud architectures. We conclude with experiments that support our theoretical claims. Code is available at https://github.com/yonatansverdlov/Canonization

2605.11007 2026-05-13 cs.LG cs.AI

RT-Transformer: The Transformer Block as a Spherical State Estimator

Peter Racioppo

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种将Transformer模块视为球面上状态估计器的方法,揭示了Transformer中的核心组件——注意力机制、残差连接和归一化——实际上源于一个统一的几何估计问题。通过将潜在状态建模为超球面上的方向,并在当前估计的切平面上定义噪声,研究构建了一个基于精度加权的方向推断过程,其中注意力聚合证据,残差连接实现状态更新,归一化将更新后的状态重新投影到超球面上。该工作表明,这些组件是估计问题几何性质的自然结果,而非独立的架构设计选择。

详情
英文摘要

We show that the core components of the Transformer block -- attention, residual connections, and normalization -- arise naturally from a single geometric estimation problem. Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere. Together, these components follow from the geometry of the estimation problem rather than being introduced as independent architectural choices.

2605.11005 2026-05-13 cs.LG cs.AI cs.DC

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

Zhichen Zeng, Chi-Chih Chang, Jiayi Wang, Zezhou Wang, Ningxin Zheng, Zheng Zhong, Cesar A. Stuardo, Dongyang Wang, Mohamed S. Abdelfattah, Haibin Lin, Banghua Zhu, Ang Li, Ziheng Jiang

发表机构 * University of Washington(华盛顿大学) Cornell University(康奈尔大学)

AI总结 本文提出了一种名为DisagMoE的混合专家(MoE)训练系统,旨在解决大规模语言模型训练中专家并行策略面临的通信瓶颈问题。该方法通过将注意力层和前馈网络层分组到不同的GPU组中,并引入多阶段流水线和单向多对多通信机制,有效实现了计算与通信的重叠。实验表明,DisagMoE在多个MoE模型上显著提升了训练效率,尤其在16节点8xH800集群上实现了最高1.8倍的加速。

详情
英文摘要

Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers' computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.

2605.11001 2026-05-13 cs.LG

Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

Xiaofeng Liu

发表机构 * Department of Civil and Environmental Engineering, Pennsylvania State University, University Park, PA, USA(宾夕法尼亚州立大学土木与环境工程系,大学公园,PA,美国) Institute for Computational and Data Sciences, Pennsylvania State University, University Park, PA, USA(宾夕法尼亚州立大学计算与数据科学研究所,大学公园,PA,美国)

AI总结 本文提出了一种基于有限体积法的物理信息神经网络(FVM-PINN)框架,用于求解二维浅水方程,解决了传统PINNs在处理守恒性、不连续性和非结构网格时的不足。研究发现,仅依赖物理方程训练的FVM-PINN在实际问题中容易陷入低动量的平凡解,而引入稀疏数据指导可显著提升模型性能,减少速度场的误差达数十倍。实验表明,该框架在真实河流场景中能够有效构建高精度的替代模型。

详情
英文摘要

Physics-informed neural networks (PINNs) are a simple surrogate-modelling paradigm for partial differential equations, but their standard strong-form residual formulation is ill suited to the shallow water equations (SWE). It cannot enforce local conservation, handle discontinuities, or leverage the boundary-conforming unstructured meshes used in real-world applications. We introduce ``Data-Guided FVM-PINN'', a framework that replaces the strong-form residual with a differentiable, well-balanced Roe Riemann-solver finite-volume (FVM) loss evaluated on unstructured meshes. The major finding is that physics-only FVM-PINN training often fails on realistic 2D problems: the network collapses to a trivial low-momentum state that nearly satisfies the FVM-PINN residual but bears no resemblance to the true flow. A loss-landscape diagnostic shows that the FVM-PINN loss at zero momentum is only about $7\times$ larger than at the trained solution, a shallow basin that an ordinary optimizer falls into; adding even sparse data turns this into a $310\times$ separation, breaking the degeneracy. On a 2D block-in-channel benchmark, just $200$ random velocity measurements drop the velocity-field $L_2$ error by $22\times$ versus physics-only; $50$ measurements still deliver a $7\times$ reduction. A controlled ablation isolates the contribution of the FVM-PINN loss: it reduces velocity-field $L_2$ by $\sim$$23\%$ in the sparse-data regime and is essentially neutral when dense reference data is available. On a real-world Savannah River reach ($1306$ cells, $3600$~s simulation, five Manning zones), the framework constructs an accurate surrogate from SRH-2D anchor data, with time-window decomposition reducing error monotonically via progressive initial-condition handoff.

2605.10999 2026-05-13 cs.LG cs.AI cs.MA

SkillGen: Verified Inference-Time Agent Skill Synthesis

Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, Stefan Feuerriegel

发表机构 * Munich Center for Machine Learning, LMU Munich(慕尼黑马尔他学习中心,慕尼黑大学) University of Notre Dame(诺特尔达大学) Microsoft Research(微软研究院)

AI总结 SkillGen 是一种多智能体框架,旨在从基础智能体生成的轨迹中合成可审计的单一技能,无需重新训练模型即可提升智能体性能。该方法通过对比成功与失败轨迹,识别可复用的成功模式和失败原因,并生成可读的技能描述,支持人工验证。SkillGen 的核心创新在于将技能建模为对智能体行为的干预,并通过对比使用和不使用该技能时的性能差异,评估其整体效果,从而有效提升模型在多个任务和数据集上的表现。

详情
英文摘要

Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable. However, high-quality skills are still largely written by hand. We introduce SkillGen, a multi-agent framework that synthesizes a single auditable skill from trajectories generated by a base agent. The output is a human-readable artifact that can be inspected before use. Rather than merely summarizing trajectories, SkillGen leverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures. SkillGen then generates candidate skills and iteratively refines the skill. A key novelty in SkillGen is that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs (cases where the skill fixes a baseline failure) and regressions (cases where the skill breaks a baseline success). Across a broad range of agents and datasets, SkillGen consistently improves held-out performance, outperforms existing skill-generation baselines, and produces skills that transfer across models.

2605.10993 2026-05-13 cs.RO

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

Yanbin Hu, Jin Cui, Jiayi Lu, Ruixuan Yang, Jun Ye, Boran Zhao, Xingyu Chen, Xuguang Lan, Pengju Ren

发表机构 * School of Software, Xi’an Jiaotong University(西安交通大学软件学院) School of Artificial Intelligence, Xi’an Jiaotong University(西安交通大学人工智能学院) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学人机混合增强智能国家重点实验室,人工智能与机器人研究院)

AI总结 ECHO 是一种用于视觉-语言-动作(VLA)模型的连续层次记忆框架,旨在提升模型在长时域操作任务中的性能。该方法受人类经验层次组织的启发,通过双曲自编码器将VLA隐藏状态映射到连续层次空间,并利用双曲度量和蕴含约束机制构建语义记忆树,实现高效的自上而下经验检索。同时,背景巩固机制通过几何插值和结构分割持续优化记忆树,支持连续空间中的虚拟记忆合成,显著提升了模型在长任务组合和未知场景中的泛化能力。

详情
英文摘要

Memory capacity is a critical factor determining the performance of Vision-Language-Action (VLA) models in long-horizon manipulation tasks. Existing memory-augmented architectures primarily rely on linear or flat storage, lacking structural priors for manipulation categories and hierarchical organization. This deficiency hinders efficient experience retrieval and limits generalization to unseen long-horizon task compositions. Inspired by the hierarchical organization of human experience, we propose ECHO (Experience Consolidation and Hierarchical Organization), a novel memory framework operating within a Continuous Hierarchical Space. By employing a hyperbolic autoencoder, ECHO maps VLA hidden states into this space. Leveraging hyperbolic metrics and entailment constraint mechanisms, experience vectors are organized into a semantic memory tree that supports efficient top-down retrieval. In parallel, a background consolidation mechanism continuously refines the memory tree through geometric interpolation and structural splitting, supporting virtual memory synthesis in the continuous space. We integrate ECHO into the $π_0$ foundation model. Evaluations on LIBERO and preliminary real-world experiments demonstrate the effectiveness of our approach, notably achieving a 12.8% absolute improvement in execution success rate over the $π_0$ baseline on LIBERO-Long, while improving compositional generalization on cross-suite unseen long-horizon tasks.

2605.10991 2026-05-13 cs.LG cs.AI

Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

Linhai Zhang, Yulan He

发表机构 * King’s College London(伦敦国王学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 本文研究了测试时个性化(TTP)这一新兴方向,提出通过从个性化策略模型中采样多个候选并利用个性化奖励模型选择最优解,以提升推理阶段的计算扩展性。研究证明,理想选择方式下,预期效用随采样数量对数增长,但现有奖励模型难以实现这一潜力。为此,作者推导出统一的扩展定律,揭示了两种失效模式,并提出一种概率化的个性化奖励模型,有效缓解了这些问题。实验表明,该框架在多种策略模型和文本生成任务中均能实现稳定的扩展效果。

详情
英文摘要

Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.

2605.10988 2026-05-13 cs.LG cs.AI

Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation

Yutszyuk Wong, Wentai Wu, Yuen-Ying Yeung, Weiwei Lin

发表机构 * Jinan University Guangzhou, China(广州吉林大学) South China University of Technology Guangzhou, China(华南理工大学)

AI总结 本文研究了在大规模网络系统中如何在仅有包级标注的情况下实现日志实例级别的异常定位问题。为此,作者提出了LogMILP方法,结合多实例学习、原型引导和反事实扰动一致性正则化,实现了在弱监督条件下的高效异常检测与定位。实验表明,该方法在多个公开数据集上表现出优异的检测性能和更可靠的实例级定位能力。

Comments 6 pages,2 figures

详情
英文摘要

Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at massive scale while instance-level annotations are prohibitively expensive, posing great difficulties to fine-grained anomaly localization. To address this challenge, we propose LogMILP (Log anomaly localization based on Multi-Instance Learning enhanced by prototypes and Perturbation), a weakly supervised framework that enables both bag-level anomaly detection and instance-level anomaly localization using only bag-level labels. Our method guides the model to pinpoint the critical log entries using prototype-guided structural modeling with counterfactual perturbation consistency regularization, thereby improving localization reliability and interpretability under coarse-grained supervision. Experimental results on three public datasets demonstrate that LogMILP achieves competitive detection performance while yielding significantly more reliable instance-level localization. Our code is open-sourced at https://github.com/YUK1207/LogMILP.

2605.10987 2026-05-13 cs.LG cs.AI cs.CR

AESOP: Adversarial Execution-path Selection to Overload Deep Learning Pipelines

Tingxi Li, Mingfang Ji, Ravishka Shemal Rathnasuriya, Simin Chen, Yitao Hu, Wei Yang

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Tianjin University(天津大学)

AI总结 本文研究了深度学习推理流水线中由于动态路径选择带来的效率攻击问题,提出了一种名为AESOP的对抗性路径选择框架。该方法通过结合漏洞引导的路径排序与自适应损失加权,有效放大了模型的计算量和延迟,实验证明其在白盒和灰盒设置下均能显著提升攻击效果。研究揭示了现有针对单一模型的攻击方法在动态流水线场景下存在显著性能差距,并展示了系统级防御措施虽能缓解攻击但无法完全阻止其影响。

详情
英文摘要

Modern machine learning deployments increasingly compose specialized models into dynamic inference pipelines, where upstream components produce intermediate predictions that determine the workload and inputs of downstream components. The cost of processing an input is therefore not determined by any single model, but by two coupled factors: the per-inference cost of each invoked component and its workload volume. Because these pipelines run under hard real-time constraints, efficiency is a fundamental requirement for system availability. We show that this structure creates an efficiency-attack surface that existing methods targeting single models cannot exploit: on identical inputs and budgets, path-aware targeting inflates FLOPs by $2,407\times$ while the strongest single-model baseline achieves $117\times$ -- a $20\times$ gap attributable entirely to where the attack is directed. We formalize this as the adversarial path-selection problem and present AESOP, a framework combining vulnerability-guided path ranking with adaptive loss weighting. We evaluate AESOP on five pipelines plus a production-realistic deployment variant with batching, bounded buffering, and confidence-threshold defenses. AESOP achieves up to $2,407\times$ FLOPs and $419\times$ latency inflation in white-box setting and 58$\times$ FLOPs / 17$\times$ latency in gray-box settings. Under system-level defenses, the attack is not neutralized but redirected: pipelines are forced to choose between throughput collapse ($0.578 \to 0.006$ input/s) and $96.7\%$ data loss to sustain throughput.

2605.10985 2026-05-13 cs.LG cs.AI q-bio.BM

Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

Siddhant Dutta, Edward Tan Beng Wai, Soumick Sarker, Pasan Gunawardane, Jagath C. Rajapakse

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 该研究提出了一种可解释的蛋白质语言模型表示方法,通过可微分图划分技术将ESM-2的表示映射到蛋白质接触图,并利用SoftBlobGIN网络学习功能子结构,从而提升预测任务的性能与可解释性。该方法无需重新训练语言模型,仅增加少量参数,即可在酶分类、功能预测等任务中取得优异表现,并能自动识别生物意义的功能区域,如活性位点残基和催化接触模式。实验表明,该框架显著提升了结构解释的准确性与可审计性,为蛋白质语言模型提供了结构层面的透明性支持。

Comments 19 Pages, 8 figures, 11 Tables, Submitted to NeurIPS 2026

详情
英文摘要

Protein language models such as ESM-2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural $\&$ evolutionary signals are encoded in dense latent spaces. We propose a plug-$\&$-play framework that projects ESM-2 representations onto protein contact graphs $\&$ applies $\textbf{SoftBlobGIN}$, a lightweight Graph Isomorphism Network with differentiable Gumbel-softmax substructure pooling, to perform structure-aware message passing $\&$ learn coarse functional substructures for downstream prediction tasks. Across enzyme classification, SoftBlobGIN achieves 92.8\% accuracy $\&$ 0.898 macro-F1. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active-site residues, spatially localized functional clusters, $\&$ catalytic contact patterns. On binding-site detection, SoftBlobGIN improves residue AUROC from $0.885$ using an ESM-2 linear probe to $0.983$, indicating that these structural explanations are not recoverable from language-model features alone. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active-site residues showing $1.85\times$ higher importance than other blobs ($ρ{=}0.339$, $p{=}0.009$), without any active-site supervision. Our framework requires no retraining of the language model, adds only $\sim$1.1M parameters, $\&$ generalises across ProteinShake tasks, achieving $F_{\max}$ of $0.733$ on Gene Ontology prediction $\&$ AUROC of $0.969$ on binding-site detection. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent $\&$ auditable.