arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2605.12088 2026-05-14 cs.CV

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Yiyan Xu, Qiulin Wang, Wenjie Wang, Yunyao Mao, Xintao Wang, Pengfei Wan, Kun Gai, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) Kling Team, Kuaishou Technology(快手技术团队)

AI总结 本文研究了多参考图像生成问题,即在文本指令引导下生成图像并忠实保留多个参考图像中的主体身份和外观细节。现有方法通常将语义和外观特征分离处理,导致模型难以正确关联主体与对应参考图像的细节,从而引发属性泄露和跨参考混淆。为此,作者提出UniCustom框架,在视觉语言模型编码前融合ViT和VAE特征,使模型能够同时学习主体语义和外观信息,并通过两阶段训练策略和槽位绑定正则化进一步提升生成质量。实验表明,UniCustom在多个基准上显著优于现有方法。

详情
英文摘要

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.

2605.12072 2026-05-14 cs.CV

PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting

Hantang Li, Qiang Zhu, Xiandong Meng, Xingtao Wang, Debin Zhao, Xiaopeng Fan

发表机构 * School of Computer Science, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区计算机科学学院,中国) Pengcheng Laboratory, Shenzhen, China(鹏城实验室,中国) Smart Coding Institute, Pengcheng Laboratory, Shenzhen, China(鹏城实验室智能编码研究所,中国) School of Computer Science, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学学院,中国)

AI总结 PairDropGS 是一种基于配对 dropout 的一致性正则化方法,旨在提升稀疏视角下高斯溅射(Gaussian Splatting)的重建稳定性与质量。该方法通过从共享高斯场中构造配对的 dropout 子集,并引入低频一致性正则化,以保持场景布局和粗略几何结构的稳定性,同时避免对高频细节的过度约束。此外,PairDropGS 还采用渐进式一致性调度策略,增强训练过程中的鲁棒性,实验表明其在多个基准数据集上均取得了优于现有方法的重建效果。

Comments 11 pages,8 figures

详情
英文摘要

Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation learning.In this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.

2605.11726 2026-05-14 cs.LG

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

Yan Jiang, Ruihong Qiu, Zi Huang

发表机构 * The University of Queensland(昆士兰大学)

AI总结 本文研究了在多领域扩散大语言模型(dLLMs)的强化学习(RL)后训练中,块大小对性能的影响,并从领域冲突的角度出发,提出了块大小冲突的概念。研究构建了一个新的数据集Block-R1-41K和基准Block-R1,用于支持单领域和跨领域的RL后训练,并提出了一种基于样本级最优块大小的跨领域训练方法。实验覆盖了13个数据集、7种最新RL算法和多种dLLM模型,验证了方法的有效性。

详情
英文摘要

Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms and diverse dLLM backbones are comprehensively covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1 with the dataset released at https://huggingface.co/datasets/YanJiangJerry/Block-R1-41K.

2605.11231 2026-05-14 cs.LG cs.AI

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Abhishek Moturu, Anna Goldenberg, Babak Taati

发表机构 * Department of Computer Science(计算机科学系) University of Toronto(多伦多大学) The Hospital for Sick Children UHN(Sick Children医院 UHN) KITE Research Institute(KITE研究机构) T-CAIREM Vector Institute(T-CAIREM向量研究所) Vector Institute(向量研究所)

AI总结 本文提出了一种名为LiBaGS的轻量级合成数据选择方法,旨在针对特定任务选择具有代表性的合成样本以补充训练分布的不足。该方法结合了决策边界距离、预测不确定性、真实数据密度和支撑有效性等多个指标,以筛选出信息量大且贴近真实数据流形的样本。通过边界间隙分配规则和边际价值停止准则,LiBaGS能够高效地选择稀疏但真实的边界邻域样本,提升模型在下游任务中的准确率。

详情
英文摘要

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

2605.10556 2026-05-14 cs.CV cs.LG

EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

Vittorio Palladino, Gianluca Palermo, Michael E. Papka, Zhiling Lan

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 随着大语言模型架构日益多样化,并在异构加速器上处理多模态工作负载,优化推理能耗已成为与延迟和吞吐量同样关键的问题。现有方法要么将延迟作为能耗代理,要么依赖数据密集的黑箱模型,均难以适应不同的并行策略。本文提出EnergyLens,通过符号回归从性能剖析数据中推导出一个包含12个参数的闭式能耗模型,能够准确描述系统特性如并行度、批大小和序列长度对能耗的影响,其预测结果具有物理可解释性,并且仅需少量的剖析样本即可实现高精度的配置选择和跨硬件平台的泛化能力。

Comments 10 pages

详情
英文摘要

As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.

2605.10415 2026-05-14 cs.CL

Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

Junyu Lu, Deyi Ji, Xuanyi Liu, Lanyun Zhu, Bo Xu, Liang Yang, Xian-Sheng Hua, Hongfei Lin

发表机构 * School of Computer Science and Technology, Dalian University of Technology, China(大连理工大学计算机科学与技术学院) Tencent(腾讯) Peking University(北京大学) Tongji University(同济大学)

AI总结 本文研究了在主观性分析任务中如何使大语言模型的不确定性与人类判断的分歧相一致。传统方法使用聚合标签训练模型,忽略了低一致性样本的内在不确定性,导致模型预测过于自信。为此,作者提出了一个两阶段的DPUA框架,通过感知人类分歧并调整模型不确定性,既保持了任务性能,又提升了模型在边界样本上的可靠性与分布外泛化能力。

详情
英文摘要

Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model's sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model's confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.

2605.09505 2026-05-14 cs.AI

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

Yuyang Dai, Zheng Chen, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) The University of Osaka(大阪大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该研究提出了EpiGraph,一个大规模的癫痫知识图谱和评估基准,旨在提升基于证据的临床推理能力。EpiGraph整合了48,166篇同行评审论文和七项临床资源,构建了一个包含24,324个实体和32,009个证据支持三元组的异构图谱,并基于此定义了五个临床任务用于评估模型性能。实验表明,结合EpiGraph的大型语言模型在多项任务中表现显著提升,尤其在药理基因组学推理方面提升了30%至41%,验证了结构化知识对增强临床推理能力的有效性。

详情
英文摘要

Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.

2605.09415 2026-05-14 cs.AI

Strategic commitments shape collective cybersecurity under AI inequality

Adeela Bashir, Zia Ush Shamszaman, Zhao Song, Matjaz Perc, The Anh Han

发表机构 * Government agencies(政府机构) Cybersecurity vendors(网络安全供应商) Regulated institutions(监管机构)

AI总结 随着人工智能在网络安全中的广泛应用,攻防双方的力量对比正在发生变化。本文研究了在AI防御工具获取不均的情况下,资源有限的防御者难以有效保护系统所带来的安全风险,并提出通过引入有承诺的防御者和针对性补贴,可以显著提升整体防御能力并增强系统韧性。研究还表明,这种策略不仅能提高防御者的安全收益,还能有效抑制攻击者的获利空间,为AI驱动环境下的网络安全政策制定提供了理论支持。

Comments 26 pages, 16 figures

详情
英文摘要

The growing integration of AI into cybersecurity is reshaping the balance between attackers and defenders. When access to advanced AI-enabled defence tools is uneven, resource-limited defenders may be unable to adopt effective protection, creating persistent system vulnerabilities. We study the impact of differential AI access using an evolutionary game-theoretic model in a finite population. We first show that when high-capability defence is costly, the population is driven toward low-cost, weak-defence behaviour, sustaining attacks and weakening long-run security. To address this problem, we introduce differential access to AI defence tools by allowing defenders to choose between low- and high-capability protection based on their resources. We then examine the role of a small group of committed defenders who always adopt strong defence and influence others through social learning. Although commitment increases the prevalence of strong defence, it alone cannot stabilise secure outcomes due to high defence costs. We therefore incorporate a targeted subsidy to remove the cost disadvantage from committed defenders. Our analysis shows that subsidised commitment significantly increases strong defence adoption, suppresses successful attacks, and improves overall system resilience. Simulations across a broad parameter space confirm that subsidies consistently outperform commitment alone. In addition, social-welfare analysis shows improved defender outcomes while keeping attacker gains low. These findings suggest that targeted support for key defenders can be an effective mechanism for stabilising cybersecurity in AI-driven environments and provide a theoretical bridge between cybersecurity policy, AI governance, and strategic allocation of defensive AI capabilities.

2605.09134 2026-05-14 cs.AI cs.SE

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

Yuanhao Li, Hongbo Wang, Xiaotang Shang, Xunzhu Tang, Yiming Cao, Xuhong Chen

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学) University of Luxembourg(卢森堡大学)

AI总结 BoostAPR 是一种基于执行引导的强化学习框架,旨在解决程序修复中反馈稀疏和奖励粒度过粗的问题。该方法通过三个阶段进行优化:首先在带有推理轨迹的执行验证演示上进行监督微调,然后从执行结果中训练双奖励模型,分别用于评估序列级和行级的修复效果,最后通过PPO算法进行优化,将行级奖励重新分配给关键的代码修改区域。实验表明,BoostAPR 在多个基准测试中取得了优异的修复效果,并展现出良好的跨语言泛化能力。

Comments 21 pages, 2 figures. Accepted at ICML 2026

详情
英文摘要

Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.

2605.09020 2026-05-14 cs.CV

The Direct Integration Theorem: A Rigorous Framework for Consistent Discrete Solutions of the Inverse Radon Problem

Mikhail G. Mozerov

发表机构 * Institute for Information Transmission Problems, Russian Academy of Sciences(信息传输问题研究所,俄罗斯科学院)

AI总结 本文提出了一种新的直接积分定理(DIT),作为经典中心切片定理(CST)的非平凡推论,为连续域到离散域的数学一致转换提供了严谨的框架,解决了计算断层成像中的根本性难题。该方法无需传统 ramp 滤波和频率域插值,避免了零频奇点和谱失真等问题,并实现了基于采样参数和网格几何的准精确重建。实验表明,该方法在图像方差保持、重建质量及重投影保真度方面优于传统滤波反投影(FBP)方法,显著提升了图像的统计特性还原能力。

Comments Submitted to IEEE TPAMI. Code and data available at https://github.com/Mozerov-iitp/radon-dit/

详情
英文摘要

This paper presents a novel Direct Integration Theorem (DIT), derived as a non-trivial corollary of the classical Central Slice Theorem (CST). The DIT provides a mathematically consistent transition from the continuous to the discrete domain - a fundamental challenge in computed tomography - thereby eliminating the need for frequency-domain interpolation without resorting to conventional ramp-filtering. The proposed approach circumvents two principal limitations inherent in traditional methods: (i) the zero-frequency singularity and spectral distortions introduced by the mandatory ramp-filtering step, and (ii) discretization inaccuracies associated with frequency-domain interpolation. Based on the DIT, we develop a rigorous framework for consistent discrete solutions of the inverse Radon problem. Mathematical modeling demonstrates that this approach achieves quasi-exact reconstruction, with errors constrained solely by sampling parameters and grid geometry. Furthermore, while Filtered Back Projection (FBP) inherently distorts the variance of the reconstructed image, the DIT-based algorithm preserves it. Comparative simulations confirm that the proposed method eliminates common artifacts, such as intensity cupping, and consistently outperforms FBP in terms of PSNR, SSIM, and reprojection fidelity, faithfully restoring the original image's statistical characteristics.

2605.07653 2026-05-14 cs.CV eess.IV

Aquatic Neuromorphic Optical Flow

Pei Zhang, Yunkai Liang, Kaiqiang Wang

发表机构 * School of Electrical Engineering, Guangxi University(广西大学电气工程学院) Baise Artificial Intelligence Innovation and Development Center(百色人工智能创新与发展中心) School of Physical Science and Technology, Northwestern Polytechnical University(西北工业大学物理科学与技术学院)

AI总结 本文研究了水下环境中基于神经形态视觉的光流估计问题,提出了一种基于脉冲神经网络的自监督框架,能够从异步事件流中高效估计逐像素光流,有效克服了水下数据稀缺的瓶颈。该方法在保证视觉和定量性能的同时,显著提升了计算效率,为资源受限的水下边缘平台提供了轻量、实时且低成本的感知解决方案。

Comments This work is under review. Project page: https://github.com/pz-even/event_underwater_optical_flow

详情
英文摘要

Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.

2605.06651 2026-05-14 cs.AI

AI co-mathematician: Accelerating mathematicians with agentic AI

Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev, Tatiana Schmidt, Andrew Cowie, Fernanda Viegas, Dimitri Kanevsky, Vineet Kahlon, Hartmut Maennel, Sophia Alj, George Holland, Alex Davies, Pushmeet Kohli

发表机构 * Google(谷歌)

AI总结 本文介绍了“AI co-mathematician”,一个辅助数学家进行开放式研究的智能工作平台。该系统通过异步、状态化的交互方式,支持数学研究中的各个环节,如想法生成、文献检索、计算探索和定理证明,并能有效管理不确定性、追踪失败假设并生成原生数学成果。实验表明,该系统不仅提升了数学研究效率,还在多个难题求解基准测试中取得了优异成绩。

Comments 23 pages; several citations added

详情
英文摘要

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

2605.04557 2026-05-14 cs.CV cs.AI

Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

Vlad Vasilescu, Daniela Faur, Teodor Costachioiu

发表机构 * Univ. POLITEHNICA Bucharest SIGMA Lab , CAMPUS Institute(巴比什-博亚尔银行大学 SIGMA 实验室,CAMPUS 机构) Univ. POLITEHNICA Bucharest GEOSENSE , CAMPUS Institute(巴比什-博亚尔银行大学 GEOSENSE,CAMPUS 机构)

AI总结 本文研究了如何高效生成受几何控制的高分辨率卫星图像,以解决该类图像稀缺且成本高昂的问题,这对土地覆盖分类、变化检测和灾害监测等任务的模型开发与测试造成阻碍。作者提出了一种基于现有预训练扩散模型的方法,通过引入窗口交叉注意力模块,仅利用跳跃连接特征实现对生成过程的控制,方法简洁高效。实验表明,该方法在性能上与现有控制技术相当,且在几何控制图对齐方面表现更优,同时指出现有评估方法的局限性,强调了对齐评估一致性的重要性。

Comments 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)

详情
英文摘要

High-resolution satellite images are often scarce and costly, especially for remote areas or infrequent events. This shortage hampers the development and testing of machine learning models for land-cover classification, change detection, and disaster monitoring. In this paper, we tackle the problem of geometry-controlled high-resolution satellite image synthesis by adding control over existing pre-trained diffusion models. We propose a simple yet efficient method for controlling the synthesis process by leveraging only skip connection features using windowed cross-attention modules. Several previously established control techniques are compared, indicating that our method achieves comparable performance while leading to a better alignment with the geometry control map. We also discuss the limitations in current evaluation approaches, amplifying the necessity of a consistent alignment assessment.

2605.02752 2026-05-14 cs.CV

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

Giacomo Pacini, Luca Ciampi, Nicola Messina, Nicola Tonellotto, Giuseppe Amato, Fabrizio Falchi

发表机构 * Institute of Information Science and Technologies of the National Research Council (ISTI-CNR)(意大利国家研究理事会信息科学与技术研究所) University of Pisa - Department of Information Engineering(比萨大学信息工程系)

AI总结 本文研究了开放世界文本引导的类别无关计数(CAC)任务中语义对齐的问题,指出当前模型在理解文本提示与视觉场景之间关系时存在不足,导致计数结果不可靠。为此,作者提出了一种新的评估框架PrACo++,包含负标签测试和干扰项测试等新协议,并构建了包含多类别标注的MUCCA数据集。实验表明,尽管现有模型在标准指标上表现良好,但在语义理解与对齐方面仍存在明显缺陷,突显了构建更具语义感知能力模型的重要性。

Comments Code available at https://github.com/ciampluca/PrACo

详情
英文摘要

Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations. This limitation leads to spurious counting responses and reduced reliability in real-world scenarios. To systematically address these limitations, we propose a new evaluation framework focused on model robustness and trustworthiness. Our contribution is two-fold: (i) we introduce PrACo++ (Prompt-Aware Counting++), a novel test suite featuring two dedicated evaluation protocols -- the negative-label test and the distractor test -- paired with new specialized metrics; and (ii) we present the MUCCA (MUlti-Category Class-Agnostic counting) evaluation dataset, a new collection of real-world images featuring multiple annotated object categories per scene, unlike existing CAC benchmarks that typically include a single category per image. Our extensive experimental evaluation of 10 state-of-the-art methods shows that, despite strong performance under standard counting metrics, current models exhibit significant weaknesses in understanding and grounding object class descriptions. Finally, we provide a quantitative analysis of how semantic similarity between prompts influences these failures. Overall, our results underscore the need for more semantically grounded architectures and offer a reliable framework for future assessment in open-world text-guided CAC methods.

2605.02521 2026-05-14 cs.CV

MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

Xinyi Yin, Yiduo Wang, Tingqi Hu, Meicong Si, Yunyun Shi, Shi Chen, Hao Wang, Junxiao Xue, Xuecheng Wu

发表机构 * School of Cyber Science and Engineering, Zhengzhou University(郑州大学信息科学与工程学院) School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) School of Journalism and New Media, Xi’an Jiaotong University(西安交通大学新闻与传播学院) Research Center for Space Computing System, Zhejiang Lab(浙江实验室空间计算系统研究中心)

AI总结 本文提出MooD,一种基于连续愉悦-唤醒(Valence-Arousal)模型的感知增强型高效情感图像编辑框架,旨在解决现有情感图像编辑方法在推理效率和连续情感建模方面的不足。MooD通过引入VA感知检索策略和融合视觉迁移与感知增强语义引导,实现了细粒度且高效的可控情感编辑。同时,为弥补现有数据集对自然场景覆盖不足的问题,研究者构建了涵盖多场景的AffectSet数据集,进一步提升了模型的性能与泛化能力。

详情
英文摘要

Affective Image Editing (AIE) aims to modify visual content to evoke targeted emotions. Although current approaches achieve impressive editing quality, they often overlook inference efficiency, which limits their applicability in computational social scenarios. Moreover, most methods depend on discrete emotion representations, which hinder the continuous modeling of complex human emotions and constrain expressive capabilities in interactive scenarios. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and detailed visual semantics. Building upon this, MooD integrates visual transfer and perception-enhanced semantic guidance to achieve controllable AIE. Furthermore, considering that existing VA-annotated datasets mainly focus on social scenarios and largely overlook natural scenes, we therefore construct AffectSet, a comprehensive VA-annotated dataset covering diverse scenarios, to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design.

2605.02350 2026-05-14 cs.LG

A Near-optimal SQ Lower Bound for Smoothed Agnostic Learning of Boolean Halfspaces

Tim Sinen

发表机构 * University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 本文研究了在均匀边际分布下,对布尔半空间进行平滑无误学习的复杂度问题。作者在输入坐标独立翻转的概率为 $σ$ 的模型下,证明了 $L^1$ 多项式回归的运行时间和样本复杂度为 $\tilde{O}(n^{O(\log(1/\varepsilon)/σ)})$,并给出了几乎匹配的统计查询复杂度下界 $n^{Ω(\log(1+σ/\varepsilon^2)/σ)}$。该结果补充了近期在高斯边际分布下连续情况的相关研究。

Comments Fixed several typos and minor proof issues

详情
英文摘要

We study the complexity of smoothed agnostic learning of halfspaces on $\{\pm 1\}^n$ under uniform marginals in the model of~\cite{KM25}, where each input coordinate is independently flipped with probability $σ\in (0, {1}/{2})$. We show that $L^1$ polynomial regression achieves runtime and sample complexity $\tilde{O}(n^{O(\log(1/\varepsilon)/σ)})$, and prove a nearly matching Statistical Query complexity lower bound of $n^{Ω(\log(1+σ/\varepsilon^2)/σ)}$. This complements the recent work of~\cite{DK26}, which established analogous bounds in the continuous setting under Gaussian marginals.

2604.26070 2026-05-14 cs.LG math.OC math.ST q-bio.QM stat.TH

Observable Neural ODEs for Identifiable Causal Forecasting in Continuous Time

Jennifer Wendland, Nicolas Freitag, Maik Kschischo

发表机构 * Department of Computer Science(计算机科学系)

AI总结 该论文研究了连续时间因果推理中的可识别性问题,针对存在隐藏混杂因素的动态决策场景,提出了可观测神经ODE(ObsNODE)模型。通过将控制理论中的可观测性概念与因果可识别性联系起来,论文推导出一种连续时间调整公式,并设计了能够从观测数据中重构潜在状态的神经ODE模型,从而实现对不同干预路径下结果的预测。实验表明,该方法在合成癌症数据、基于MIMIC-IV的半合成数据和真实脓毒症数据上均表现出优越的性能。

Comments 20 pages, 5 figures

详情
英文摘要

Causal inference in continuous-time sequential decision problems is challenged by hidden confounders. We show that, in latent state-space models with time-varying interventions, observability of the latent dynamics from observed data is necessary for identifying dynamic treatment effects, linking control-theoretic observability to causal identifiability, even when hidden confounders affect both treatments and outcomes. We derive a continuous-time adjustment formula expressing potential outcome distributions under treatment trajectories via the measurement model, latent dynamics, and the filtering distribution over latent states given observed histories. We propose Observable Neural ODEs (ObsNODEs), Neural ODE models in observable normal form for causal forecasting. ObsNODEs learn continuous-time dynamics with states reconstructible from observations, enabling outcome prediction under alternative treatment paths. Experiments on synthetic cancer data, semi-synthetic data based on MIMIC-IV, and real-world sepsis data show strong performance over recent sequence models.

2604.25774 2026-05-14 cs.CL cs.AI

CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation

Wei-Chun Chen, Yu-Xuan Chen, I-Fang Chung, Ying-Jia Lin

发表机构 * CGU-ILALab

AI总结 本文研究了如何从非结构化的菜谱文本中准确估计营养成分这一挑战性问题,比较了基于传统方法和大语言模型(LLM)的多种技术。研究发现,传统方法如TF-IDF在推理速度上有优势,但效果有限;而基于LLM的少样本推理和混合方法在营养估计准确性上表现最佳,主要得益于其对模糊术语和非标准单位的处理能力。然而,这类方法也带来了更高的计算延迟,突显了实时性与精度之间的实际部署权衡。

Comments Accepted by the Third Workshop on Patient-oriented Language Processing (CL4Health) at LREC 2026

Journal ref http://lrec-conf.org/proceedings/lrec2026/workshops/cl4health/2026.cl4health-1.0.pdf

详情
英文摘要

Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.

2604.23018 2026-05-14 cs.CV cs.AI cs.LG

AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Mohammad Sadegh Salehi, Alex Perkins, Igor Maurell, Ashkan Dabbagh, Raymond Wong

发表机构 * Zero One Creative(Zero One创意)

AI总结 该研究提出了一个名为 AmaraSpatial-10K 的三维数据集,旨在解决现有大规模三维资产在空间计算和具身人工智能应用中的部署难题。该数据集包含超过 10,000 个经过优化的合成三维资产,每个资产都具备精确的度量尺度、确定的锚点、分离的物理材质贴图以及多句文本元数据,便于直接使用。研究还引入了一套可复用的评估体系,显著提升了三维资产在图像检索、物理模拟和跨模态对齐等方面的性能。

详情
英文摘要

Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.

2604.22686 2026-05-14 cs.CV

SS3D: End2End Self-Supervised 3D from Web Videos

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera

发表机构 * U2IS, ENSTA – Institut Polytechnique de Paris(U2IS,ENSTA–巴黎国立理工学院) Pôle Recherche, Agence Ministérielle pour l’IA de Défense(人工智能防御部研究部)

AI总结 本文提出 SS3D,一种基于 SfM 的大规模自监督预训练方法,用于从单目视频中进行端到端的三维估计。该方法在一个前向传播过程中联合预测深度、相机运动和内参,并通过统一的单检查点评估协议进行训练和评估。为了解决网络视频中多视角可观测性弱和数据异构性强的问题,作者引入了多视角信号代理(MVS)用于过滤和课程采样,并通过专家训练蒸馏到单一学生模型中,显著提升了模型性能。

详情
英文摘要

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

2604.21496 2026-05-14 cs.AI cs.CL cs.CY

How English Print Media Frames Human-Elephant Conflicts in India

Bonala Sai Punith, Salveru Jayati, Garima Shakya, Shubham Kumar Nigam

发表机构 * Chhattisgarh Forest Department(恰特里什加尔森林部门)

AI总结 本文研究了印度英语印刷媒体如何报道人象冲突(HEC),通过分析2022年1月至2025年9月期间1968篇新闻文章中的28986个句子,揭示了媒体在报道中普遍使用恐惧和攻击性语言,可能加剧公众对大象的敌意,影响人与野生动物的共存努力。研究采用结合长上下文变换器、大语言模型和领域特定词典的多模型情感分析框架,量化情感倾向、提取关键语句并识别语言模式,为负责任的野生动物报道提供了可扩展的方法支持。

详情
英文摘要

Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

2604.21360 2026-05-14 cs.CV

Prototype-Based Test-Time Adaptation of Vision-Language Models

Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知关键实验室) Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China(高效计算,中华人民共和国教育部,厦门大学,361005,中国)

AI总结 本文提出了一种基于原型的测试时适配(PTA)方法,用于提升视觉-语言模型在测试阶段的性能。该方法通过构建类特定的知识原型来累积测试样本的信息,并根据每个样本的零样本分类置信度对原型进行自适应加权,从而提升模型对新数据的适应能力。与基于缓存的适配方法相比,PTA无需维护和检索缓存,显著提高了推理效率,同时在多个图像识别和点云分析基准测试中取得了优于现有方法的性能。

详情
英文摘要

Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.

2604.13986 2026-05-14 cs.LG

PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling

Zichao Yan, Yan Wu, Mica Xu Ji, Chaitra Agrahar, Esther Wershof, Marcel Nassar, Mehrshad Sadria, Ridvan Eksi, Vladimir Trifonov, Ignacio Ibarra, Telmo Felgueira, Błażej Osiński, Rory Stark

发表机构 * Altos Labs(Altos实验室)

AI总结 PRiMeFlow 是一种基于流匹配的端到端方法,旨在直接建模基因和小分子扰动对基因表达空间的影响,以应对单细胞基因表达异质性和潜在基因依赖关系带来的建模挑战。该方法通过分布拟合准确逼近单细胞基因表达的实证分布,并在 PerturBench 平台上进行了广泛基准测试,验证了其有效性。研究还通过消融实验验证了关键设计选择,并在多个数据集上展示了其在人类胚胎干细胞扰动预测任务中的卓越性能。

详情
英文摘要

Predicting the effects of perturbations in-silico on cell state can identify drivers of cell behavior at scale and accelerate drug discovery. However, modeling challenges remain due to the inherent heterogeneity of single cell gene expression and the complex, latent gene dependencies. Here, we present PRiMeFlow, an end-to-end flow matching based approach to directly model the effects of genetic and small molecule perturbations in the gene expression space. The distribution-fitting approach taken by PRiMeFlow enables it to accurately approximate the empirical distribution of single-cell gene expression, which we demonstrate through extensive benchmarking inside PerturBench. Through ablation studies, we also validate important model design choices such as operating in gene expression space and parameterizing the velocity field with a U-Net architecture. Finally, by scaling PRiMeFlow to a broad perturbation data atlas spanning multiple datasets and employing a carefully designed pretraining-finetuning strategy, we demonstrate its outstanding performance on the H1 human embryonic stem cells from the ARC Virtual Cell Challenge benchmark.

2604.11581 2026-05-14 cs.CL

Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

Solomon Messing

发表机构 * Center for Social Media, AI, and Politics(社交媒体、人工智能与政治中心)

AI总结 大型语言模型(LLM)的评估结果对模型部署、安全标准、研究结论和人工智能对劳动力市场的影响预测具有重要影响。然而,现有评估方法通常忽略判断模型选择、模型温度和提示语表达等因素带来的不确定性,导致置信区间覆盖不足,且随着数据量增加问题更加严重。本文分析了LLM评估流程中的不确定性来源,区分了数据量增加可减少的方差与研究者设计选择带来的敏感性,并通过设计研究预测来降低总体评估误差,显著提升了评估结果的准确性和可靠性。

详情
英文摘要

LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made. Yet standard confidence intervals ignore variability from judge model choice, model temperature, and prompt phrasing, producing under-coverage that worsens with more data. The omitted variance can shift results enough to reverse conclusions \citep{baumann2025llmhacking, huang2026dropping}; pipelines that fail to average over it leave the surface that ``benchmark hacking'' exploits \citep{singh2025leaderboard}. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total evaluation error (TEE). Across the demonstrations, naive standard errors are 40 - 60\% smaller than the TEE-corrected SE. Using Chatbot Arena data, we show naive 95\% CI coverage drops as $n$ grows while TEE-corrected coverage holds at 95\%, and TEE-guided pipelines restrict the benchmark gaming surface from 56 to 32 Elo ($K=27$), below the human-leaderboard baseline. We show further that a small pilot recovers honest CIs and projects which design changes most improve precision. Acting on those projections halves MMLU estimation error against the answer key at equivalent cost, and raises per-match agreement with human votes by 7.9 percentage points on Chatbot Arena.

2604.10755 2026-05-14 cs.CV

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Junzhi Ning, Jiashi Lin, Yingying Fang, Wei Li, Jiyao Liu, Cheng Tang, Chenglong Ma, Wenhao Tang, Tianbin Li, Ziyan Huang, Guang Yang, Junjun He

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Imperial College London(帝国理工学院) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 该论文提出了MMRareBench,首个针对罕见病的多模态和多图像医学评估基准,旨在评估模型在诊断、治疗规划、跨图像证据对齐和检查建议等四个临床流程中的综合能力。该基准包含1,756个问答对和7,958张医学图像,采用基于Orphanet的本体对齐和严格的评估协议,系统揭示了现有大型语言模型在罕见病场景下处理多图像信息时能力不足的问题,尤其在治疗规划方面表现较差。研究结果表明,尽管医学领域模型在诊断任务上表现较好,但在多图像任务中仍显著落后于通用模型。

详情
英文摘要

Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

2604.10547 2026-05-14 cs.AI

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian

发表机构 * Soochow University(苏州大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Stony Brook University(石溪大学) The University of Chicago(芝加哥大学)

AI总结 本文提出了一种名为 Agent² RL-Bench 的紧凑型诊断基准,用于评估大型语言模型(LLM)代理在强化学习(RL)后训练中的自主设计与优化能力。该基准要求代理在有限预算下自主完成模型训练、调试和评估,涵盖从静态规则训练到闭环在线 RL 的多种任务。实验表明,尽管部分代理能有效提升模型性能,但整体上在固定预算下实现稳定、自主的 RL 后训练仍具有挑战性,该基准为未来研究提供了有效的评估框架。

Comments 37 pages, 7 figures, 20 tables

详情
英文摘要

We introduce Agent2 RL-Bench, a compact diagnostic benchmark for evaluating agentic RL post-training, which tests whether LLM agents can autonomously design, implement, debug, and execute post-training pipelines that improve foundation models. RL post-training increasingly drives model alignment and specialization, yet existing benchmarks are largely static, rewarding supervised fine-tuning or script generation without assessing an agent's ability to close an interactive RL loop. Agent2 RL-Bench provides a unified agent-facing interface: each run starts from an isolated workspace containing a base model, task data, instructions, and a grading API, and agents must iterate within a fixed budget by training models and submitting artifacts for evaluation. The benchmark spans six tasks across three levels, from static rule-based training to judge-based optimization and closed-loop online RL with trajectory collection. Two diagnostic skills, namely runtime recording and post-hoc summarization, enable structured analysis of agent behavior, facilitating smooth and effective iteration of the benchmark's evaluation framework. Across five agent systems and six driver LLMs, agents show intelligent behavior but clear limitations: one RL-oriented run improves ALFWorld from 4.85 to 93.28 via SFT warm-up and GRPO with online rollouts, yet DeepSearchQA remains difficult, most successful routes rely on supervised pipelines, and interactive outcomes show large single-run differences across agent stacks. Overall, Agent2 RL-Bench shows that current agents can sometimes engineer online RL, but stable agent-driven RL post-training remains rare under fixed budgets. It also demonstrates that our benchmark provides a strong and effective evaluation framework for future research in this direction. Code is available at https://github.com/microsoft/RD-Agent/blob/main/rdagent/scenarios/rl/autorl_bench/README.md

2604.09543 2026-05-14 cs.LG

ANTIC: Adaptive Neural Temporal In-situ Compressor

Sandeep S. Cranganore, Andrei Bodnar, Gianluca Galletti, Fabian Paischer, Johannes Brandstetter

发表机构 * Institute for Machine Learning, Ellis Unit, JKU Linz, Austria(机器学习研究所,埃利斯单位,JKU林茨,奥地利) University of Manchester, United Kingdom(曼彻斯特大学,英国)

AI总结 本文提出了一种名为ANTIC的自适应神经时序原位压缩方法,用于解决由高维偏微分方程驱动的高分辨率时空场在长期存储中产生的海量数据问题。该方法结合了自适应时间选择器和基于持续微调的神经空间压缩模块,能够在模拟过程中实时筛选关键帧并学习相邻快照之间的残差更新,从而在单次流式处理中实现时空联合压缩,大幅减少存储需求而不显著影响物理模拟的准确性。实验表明,该方法可实现多个数量级的存储压缩。

Comments 31 pages, 19 figures, 9 Tables; Accepted at ICML 2026; First authors contributed equally

Journal ref The Forty-Third International Conference on Machine Learning 2026

详情
英文摘要

The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.

2604.07969 2026-05-14 cs.CL

Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

George Fountzoulas

发表机构 * Department of Computer Engineering & Informatics(计算机工程与信息学系)

AI总结 本文提出了一种名为 Kathleen 的文本分类架构,该架构直接在原始 UTF-8 字节上进行操作,无需分词器或注意力机制,参数量少于 470K。其核心方法包括基于振荡器的序列处理、FFT 变换编码、相位谐波非线性以及内容相关的混响机制等创新组件。实验表明,Kathleen 在多个基准数据集上取得了与预训练模型相当甚至更优的性能,同时大幅减少了参数量。

Comments 15 pages, 10 tables. v2: Added V9 architecture with Positional Decay Modulation. Pretraining eliminated. SST-2 improved from 83.3% to 85.8%

详情
英文摘要

We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and under 470K parameters. Kathleen introduces several novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats); (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters (+2.6% accuracy, <0.001% of model parameters); (4) Content-Dependent Reverb with Positional Decay Modulation -- a temporal memory mechanism whose decay rate is jointly conditioned on input content and a learned position-indexed bias vector; (5) Token-Level Module Sequencer with consonance and dissonance interference channels. Through iterative architecture evolution from an initial 733K-parameter baseline (Kathleen-Clean) to the current Kathleen-V9 (469K parameters), we demonstrate that pretraining can be entirely eliminated while improving accuracy. Kathleen-V9 achieves 88.5% +/- 0.2% on IMDB, 92.4% +/- 0.2% on AG News, and 85.8% +/- 0.5% on SST-2 (3-seed averages) -- matching or exceeding the pretrained baseline on all benchmarks with 36% fewer parameters. On SST-2, the improvement is +2.5% absolute over the pretrained predecessor. Kathleen processes sequences in O(L) time and memory.

2604.02753 2026-05-14 cs.CV

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun

发表机构 * Jiangsu University(江苏大学) Brown University(布朗大学) Nanyang Technological University(南洋理工大学) MBZUAI University of New South Wales(新南威尔士大学) USC University of Toronto(多伦多大学) Data61 CSIRO

AI总结 本文提出了一种名为DeCo-DETR的解耦认知DETR框架,旨在解决开放词汇目标检测(OVOD)在实际应用中的效率与性能问题。该方法通过构建基于预训练多模态模型的层次化语义原型空间,避免了推理时对文本编码器的依赖,从而提升了检测效率。同时,通过解耦语义推理与定位任务的训练策略,实现了检测精度与开放世界泛化的有效平衡,实验表明其在多个基准上表现出优异的零样本检测性能。

Comments Accepted at ICLR 2026

详情
英文摘要

Open-vocabulary object detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

2604.01938 2026-05-14 cs.CL cond-mat.stat-mech physics.soc-ph

How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

Ramon Ferrer-i-Cancho

发表机构 * Computational Linguistics Research Group (LQMC) Departament de Ci\`encies de la Computaci\'o Universitat Polit\`ecnica de Catalunya Campus Nord, Edifici Omega Jordi Girona Salgado 1-3 08034 Barcelona, Catalonia, Spain

AI总结 本文研究如何从交换距离最小化的角度衡量语言中词序或跨语言手势顺序的最优性。作者提出了一种数学框架,用于评估词序在排列图(permutohedron)中的最优程度,并发现跨语言手势的顺序至少有77%达到最优,表明这种最优性并非偶然。研究还引入二次分配问题(QAP)作为语言学中多种优化问题的统一框架,提出了一个能够整合包括交换距离最小化在内的多种语言学原则的通用最优分配原理。

Comments Many corrections in Appendix C, specially in the proofs

详情
英文摘要

The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.