arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.03855 2026-06-11 cs.SD 版本更新

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

音频大语言模型中多事件音频定位的敏感性分析

Taehan Lee, Jaehan Jung, Hyukjun Lee

发表机构 * Sogang University（ソンガン大学）

AI总结通过大规模评估，发现音频大语言模型在复杂声学场景中事件数量增加会导致真阳性率下降和假阳性率上升，提示词则引入权衡，模型对多事件音频更不确定。

Comments 6 pages, Accepted to Interspeech 2026

详情

AI中文摘要

音频大语言模型在理解音频样本方面表现出强大能力，但其在复杂声学场景中的可靠性仍待探索。不同于以往局限于小规模或查询构建控制不足的工作，我们提出了一种大规模评估，研究随着听觉场景复杂度增加时的事件定位和误报情况。使用71K个AudioCapsV2片段，我们提取标准化的（源，属性）事件，并构建两种查询类型：用于真实检测的存在事件查询和用于探测幻觉的缺失事件查询，在音频对齐的文本嵌入空间中采用相似性过滤的负采样。我们评估了四种最先进的音频大语言模型，每个模型使用12种提示变体，处理超过50万个是/否查询。在所有模型中，事件数量增加一致地降低了真阳性率并提高了假阳性率，而提示则在两者之间引入了强烈的权衡。我们的置信度分析表明，模型在多事件音频上变得更加不确定，揭示了改进空间。

英文摘要

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.

URL PDF HTML ☆

赞 0 踩 0

2505.10018 2026-06-11 cs.RO 版本更新

CoVar: 置信度-方差引导的半监督学习伪标签选择

Jinshi Liu, Lei He, Pan Liu

发表机构 * College of Artificial Intelligence, Shenzhen University（深圳大学人工智能学院）； School of Information and Electrical Engineering, Hunan University of Science and Technology（湖南科技大学信息与电气工程学院）； Information Hub, Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州）信息中心）

AI总结提出CoVar框架，通过联合建模最大置信度和残差类方差来评估伪标签可靠性，利用SVD谱松弛分离可靠与不可靠预测，无需手动阈值，在分割和分类任务上取得提升。

详情

AI中文摘要

半监督学习中的伪标签选择通常由最大置信度阈值驱动，然而在模型过度自信和类别不平衡下，仅靠置信度可能不可靠。我们提出CoVar，一个置信度-方差框架，通过联合建模最大置信度（MC）和残差类方差（RCV）来评估伪标签可靠性。从熵最小化出发，我们推导出二阶交叉熵近似，表明当MC高且RCV低时，低损失伪标签更受青睐，并带有置信度依赖的惩罚项，该惩罚项对接近确定的预测更强。基于此准则，CoVar将预测嵌入二维置信度-方差空间，并使用基于SVD的谱松弛来分离可靠和不可靠的预测，无需手动调整置信度阈值。然后，聚类加权高斯函数将此分离转换为每个样本的训练权重。所得权重可在训练期间集成到现有的半监督分割和分类流程中，且不引入推理开销。在PASCAL VOC 2012、Cityscapes、CIFAR-10、CIFAR-100、SVHN和STL-10上的实验表明，在匹配骨干网络下，VOC和Cityscapes上取得明显提升，并在标准分类基准上达到竞争性或更低的错误率。这些结果表明，残差类离散度为鲁棒伪标签选择提供了置信度之外的补充信号。

英文摘要

Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unreliable under model overconfidence and class imbalance. We propose CoVar, a confidence--variance framework that assesses pseudo-label reliability by jointly modeling Maximum Confidence (MC) and Residual-Class Variance (RCV). Starting from entropy minimization, we derive a second-order cross-entropy approximation showing that low-loss pseudo-labels are favored when MC is high and RCV is low, with a confidence-dependent penalty that becomes stronger for near-certain predictions. Based on this criterion, CoVar embeds predictions into a two-dimensional confidence--variance space and uses SVD-based spectral relaxation to separate reliable and unreliable predictions without hand-tuned confidence thresholds. Cluster-wise Gaussian weighting then converts this separation into per-sample training weights. The resulting weights can be integrated into existing semi-supervised segmentation and classification pipelines during training and introduce no inference-time overhead. Experiments on PASCAL VOC 2012, Cityscapes, CIFAR-10, CIFAR-100, SVHN, and STL-10 show clear gains on VOC and Cityscapes under matched backbones, as well as competitive or improved error rates on standard classification benchmarks. These results indicate that residual-class dispersion provides a useful signal complementary to confidence for robust pseudo-label selection.

URL PDF HTML ☆

赞 0 踩 0

2602.22638 2026-06-11 cs.AI 版本更新

扩散以协调：高效在线多智能体扩散策略

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出首个在线离线策略多智能体强化学习框架OMAD，利用扩散策略和松弛策略目标最大化缩放联合熵，实现高效探索与协调，在MPE和MAMuJoCo上样本效率提升2.5至5倍。

详情

AI中文摘要

在线多智能体强化学习（MARL）是实现高效智能体协调的重要框架。关键在于增强策略表达能力以实现更优性能。基于扩散的生成模型在图像生成和离线设置中展现出卓越的表达能力和多模态表示，因此非常适合满足这一需求。然而，它们在在线MARL中的潜力尚未被充分探索。主要障碍是扩散模型的难以处理的似然性阻碍了基于熵的探索和协调。为应对这一挑战，我们首次提出使用扩散策略的在线离线策略MARL框架（OMAD）来协调协调。我们的关键创新是采用松弛策略目标，最大化缩放联合熵，从而在无需可处理似然的情况下促进有效探索。此外，在集中训练与分散执行（CTDE）范式中，我们使用联合分布价值函数来优化分散扩散策略。它利用可处理的熵增强目标来指导扩散策略的同时更新，从而确保稳定协调。在MPE和MAMuJoCo上的广泛评估表明，我们的方法在10个不同任务上达到了新的最先进水平，样本效率显著提升了2.5至5倍。

英文摘要

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

URL PDF HTML ☆

赞 0 踩 0

2602.05746 2026-06-11 cs.LG cs.AI 版本更新

从不可约元组合线性层

Travis Pence, Daisuke Yamada, Vikas Singh

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结提出用Clifford代数将线性层分解为双向量（几何基元）的组合，仅需O(log^2 d)参数，在LLM注意力投影中匹配强基线性能。

Comments 35 Pages, 11 Tables, 6 Figures, Appearing in NeurIPS 2025

Journal ref Advances in Neural Information Processing Systems 38 (2025)

详情

AI中文摘要

当代大型模型常表现出暗示存在低级基元的行为，这些基元组合成功能更丰富的模块，但这些基本构建块仍未被很好理解。我们通过询问：能否从最小几何基元集合中识别/合成线性变换？来研究线性层中的这种组合结构。利用Clifford代数，我们证明线性层可以表示为双向量（编码有向平面的几何对象）的组合，并引入一种可微算法将其分解为转子乘积。这种构造仅需O(log^2 d)个参数，而稠密矩阵需要O(d^2)。应用于LLM注意力层中的键、查询和值投影，我们的基于转子的层匹配了块Hadamard和低秩近似等强基线的性能。我们的发现为这些几何基元如何在深度模型中组合成更高层次功能提供了代数视角。

英文摘要

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors -- geometric objects encoding oriented planes -- and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

URL PDF HTML ☆

赞 0 踩 0

2602.11995 2026-06-11 cs.LG 版本更新

Momentum LMS Theory beyond Stationarity: Stability, Tracking, and Regret

超越平稳性的动量LMS理论：稳定性、跟踪与遗憾

Yifei Jin, Xin Zheng, Lei Guo

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences（中国科学院大学先进交叉学科学院）； State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences（中国科学院数学科学国家重点实验室）； School of Mathematical Sciences, University of Chinese Academy of Sciences（中国科学院大学数学科学学院）

AI总结本文研究动量最小均方算法在非平稳时变线性系统中的跟踪性能与遗憾界，通过分析二阶时变随机向量差分方程，证明其快速适应和鲁棒跟踪能力。

Comments 9 pages, 3 figures

详情

AI中文摘要

英文摘要

The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we systematically review enhancement methods to identify mechanisms suitable for edge adaptation. Subsequently, we present a distributed framework that synergizes reasoning enhancement via adaptive CoT prompting with scalable deployment through a distributed MoE architecture. An important innovation of this approach involves modeling reasoning depth as a dynamic network resource variable, which is optimized jointly with expert activation and transmission power. This mechanism allows the system to dynamically regulate expert networks and reasoning complexity according to task requirements and device capabilities. Experimental evaluations in mobile edge environments demonstrate that the proposed framework effectively balances reasoning quality and resource efficiency. The results show that with less than one second of additional inference time, both accuracy and latency satisfaction rate can reach 90\%, validating the practical viability of deploying sophisticated LLM reasoning in resource-constrained MEGI systems.

URL PDF HTML ☆

赞 0 踩 0

2601.00181 2026-06-11 cs.CL cs.AI 版本更新

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

对话中的因果情绪识别：上下文饱和与话语标记证据

Cheonkam Jeong, Adeline Nyamathi

发表机构 * University of California, Irvine（加州大学尔湾分校）

AI总结通过系统消融实验发现对话上下文对情绪识别性能起主导作用但快速饱和，并揭示悲伤情绪与左边缘话语标记使用减少及更高上下文依赖性的关联。

详情

AI中文摘要

我们解决了对话情绪识别中两个长期存在的空白：哪些建模选择实质性地影响性能，以及识别结果如何与可解释的话语层面模式相关联。我们通过在IEMOCAP上进行系统研究并在MELD上进行跨数据集验证来研究这两个问题。对于识别，我们使用10个随机种子进行受控消融实验，并进行多重比较校正的配对显著性检验，得到三个发现。首先，对话上下文是主导因素，但性能快速饱和：大约90%的性能提升来自最近的前10-30轮对话，具体取决于标签集。其次，层级句子表示仅在仅话语设置中帮助最大，并在MELD上显示出明显优势，但一旦轮次级别的上下文可用，其益处消失，表明对话历史吸收了大量话语内部结构。第三，整合外部情感词典不会改善结果，这与预训练编码器已经捕获ERC所需的大部分情感信号一致。在严格因果设置下，我们的简单模型实现了强性能（4-way 82.69%；6-way加权F1 67.07%），表明无需未来轮次即可达到竞争性准确率。对于语言分析，我们检查了5,286个话语标记出现，发现情绪与标记位置之间存在可靠关联（p <.0001）。悲伤话语的左边缘标记使用率（21.9%）低于其他情绪（28-32%），这与左边缘标记与主动话语管理相关的观点一致。这与我们的识别结果一致，其中悲伤从对话上下文中获益最多（+22个百分点），表明悲伤可能比具有更强局部语用线索的情绪更依赖于上下文。

英文摘要

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p < .0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

URL PDF HTML ☆

赞 0 踩 0

2602.06868 2026-06-11 cs.RO 版本更新

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

基于共识的优化（CBO）：迈向机器人学的全局最优性

Xudong Sun, Armand Jordana, Massimo Fornasier, Jalal Etesami, Majid Khadiv

发表机构 * Munich Center for Machine Learning (MCML), Munich, Germany（慕尼黑机器学习中心（MCML），德国慕尼黑）

AI总结提出将共识优化（CBO）引入机器人学，在温和假设下保证收敛到全局最优，并在三个挑战性轨迹优化场景中优于现有方法。

详情

AI中文摘要

零阶优化最近在机器人系统的最优轨迹和策略设计中受到显著关注。然而，大多数现有方法（如MPPI、CEM和CMA-ES）本质上是局部的，因为它们依赖于梯度估计。在本文中，我们将基于共识的优化（CBO）引入机器人学，该方法在温和假设下保证收敛到全局最优。我们提供了理论分析和说明性示例，以直观理解CBO与现有方法之间的根本差异。为了展示CBO在机器人问题上的可扩展性，我们考虑了三个具有挑战性的轨迹优化场景：（1）一个简单系统的长时域问题，（2）一个高度欠驱动系统的动态平衡问题，以及（3）一个仅具有终端成本的高维问题。我们的结果表明，在所有三个具有挑战性的设置中，CBO相对于现有方法能够实现更低的成本。这为研究机器人学中的全局轨迹优化开辟了一个新框架。

英文摘要

Zero-order optimization has recently received significant attention for designing optimal trajectories and policies for robotic systems. However, most existing methods (e.g., MPPI, CEM, and CMA-ES) are local in nature, as they rely on gradient estimation. In this paper, we introduce consensus-based optimization (CBO) to robotics, which is guaranteed to converge to a global optimum under mild assumptions. We provide theoretical analysis and illustrative examples that give intuition into the fundamental differences between CBO and existing methods. To demonstrate the scalability of CBO for robotics problems, we consider three challenging trajectory optimization scenarios: (1) a long-horizon problem for a simple system, (2) a dynamic balance problem for a highly underactuated system, and (3) a high-dimensional problem with only a terminal cost. Our results show that CBO is able to achieve lower costs with respect to existing methods on all three challenging settings. This opens a new framework to study global trajectory optimization in robotics.

URL PDF HTML ☆

赞 0 踩 0

2602.03282 2026-06-11 cs.CV cs.AI 版本更新

Global Geometry Is Not Enough for Vision Representations

全局几何不足以用于视觉表示

Jiwan Chung, Seon Joo Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过实验发现全局嵌入几何与组合绑定能力几乎无关，而输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪该能力，并分析指出这是由于现有损失函数显式约束嵌入几何但未约束局部输入-输出映射所致。

详情

AI中文摘要

表示学习中的一个常见假设是，全局分布良好的嵌入支持鲁棒且可泛化的表示。这一关注点塑造了训练目标和评估协议，隐含地将全局几何视为表示能力的代理。虽然全局几何有效地编码了哪些元素存在，但它通常对元素如何组合不敏感。我们通过测试几何度量预测跨多种视觉编码器的组合绑定的能力来研究这一局限性。我们发现，基于标准几何的统计量与组合绑定几乎无相关性。相比之下，由输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪这一能力。我们进一步提供了分析性解释，表明这种差异源于目标设计，因为现有损失显式约束嵌入几何，但未约束局部输入-输出映射。这些结果表明，全局嵌入几何仅捕捉了表示能力的部分视图，并将功能敏感性确立为建模复合结构的关键补充轴。

英文摘要

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input--output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input--output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

URL PDF HTML ☆

赞 0 踩 0

2602.02726 2026-06-11 cs.LG cs.CL 版本更新

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

向量量化潜在概念：聚类式概念发现的可扩展替代方案

Xuemin Yu, Ankur Garg, Samira Ebrahimi Kahou, Hassan Sajjad

发表机构 * Dalhousie University, Canada（加拿大达尔豪斯大学）； University of Calgary, Canada（加拿大卡尔加里大学）

AI总结提出VQLC框架，通过向量量化学习离散潜在概念，在保持可解释性的同时，实现与K-Means相当的计算效率，并优于层次聚类在大规模数据上的扩展性。

详情

AI中文摘要

大型语言模型（LLMs）在其隐藏状态中编码了丰富的语义信息，但理解这些内部表示捕获了哪些信息仍然困难。从隐藏状态中提取的潜在概念为解释LLMs提供了有希望的方向，但现有的基于聚类的方法面临权衡：层次聚类产生连贯的概念，但由于其二次内存成本而仅限于小数据集，而K-Means高效扩展但可能产生语义连贯性较差的概念。我们提出向量量化潜在概念（VQLC），一种离散概念学习框架，在冻结的隐藏状态上学习潜在概念的码本。在12个数据集-模型设置中，VQLC在计算成本上接近K-Means，扩展性优于层次聚类，并在忠实度上保持竞争力，在仅解码器模型上增益最明显。基于LLMs的评估、定性分析和稀疏自编码器（SAE）比较表明，学习到的概念是可解释且任务相关的。

英文摘要

Large language models (LLMs) encode rich semantic information in their hidden states, yet it remains difficult to understand what information these internal representations capture. Latent concepts extracted from hidden states offer a promising direction for interpreting LLMs, but existing clustering-based methods face a trade-off: hierarchical clustering produces coherent concepts but is limited to small datasets due to its quadratic memory cost, while K-Means scales efficiently but may yield less semantically coherent concepts. We propose Vector Quantized Latent Concept (VQLC), a discrete concept learning framework that learns a codebook of latent concepts on frozen hidden states. Across 12 dataset-model settings, VQLC stays close to K-Means in computational cost, scales better than hierarchical clustering, and remains competitive in faithfulness, with the clearest gains on decoder-only models. LLMs-based evaluation, qualitative analysis, and a Sparse Autoencoder (SAE) comparison demonstrate that the learned concepts are interpretable and task-relevant.

URL PDF HTML ☆

赞 0 踩 0

2512.16415 2026-06-11 cs.CV 版本更新

CountZES: Counting via Zero-Shot Exemplar Selection

CountZES: 通过零样本示例选择进行计数

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（莫莫德·本·扎耶德人工智能大学）

AI总结针对零样本计数中示例质量差导致计数不准的问题，提出CountZES方法，通过检测锚定、密度引导和特征共识三阶段协同选择多样化示例，提升计数准确性。

详情

AI中文摘要

在零样本（ZS）设置下，复杂场景中的目标计数尤其具有挑战性，其中仅使用类别名称对未见类别的实例进行计数。现有的ZS计数方法通常依赖现成的开放词汇检测器（OVD）从文本推断示例，但在密集场景中，这些方法会受到语义噪声、外观变异和多实例提议的影响。或者，采用随机图像块采样，但无法准确描绘目标实例。由于计数对示例质量敏感，此类选择策略通常产生代表性差的示例，导致计数估计不准确。为解决这些问题，我们提出CountZES，一种通过零样本示例选择进行目标计数的纯推理方法。CountZES通过三个协同阶段发现多样化的示例：检测锚定示例（DAE）、密度引导示例（DGE）和特征共识示例（FCE）。DAE细化OVD检测以分离出精确的单实例示例。DGE引入密度驱动的自监督范式，识别统计一致且语义紧凑的示例，而FCE通过特征空间聚类增强视觉一致性。这些阶段共同产生互补的示例集，平衡了文本基础、计数一致性和特征代表性。在多个数据集上的实验表明，CountZES在零样本计数方法中表现出优越性能，同时有效跨领域泛化。

英文摘要

Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.

URL PDF HTML ☆

赞 0 踩 0

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克信息研究所）

AI总结提出MentisOculi基准，通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力，发现视觉策略普遍无法提升性能，且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

Comments 9 pages, 8 figures, Accepted at ICML 2026

详情

AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型（MLLMs）过渡到能够原生交错生成的统一多模态模型（UMMs）。这一转变激发了将中间可视化作为推理辅助的兴趣，类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力，我们开发了MentisOculi，这是一个程序化的、分层的多步推理问题套件，适用于视觉解决方案，旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略，我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制：虽然它们拥有解决任务的文本推理能力，并且有时能生成正确的视觉内容，但它们遭受复合生成错误，并且无法利用甚至真实的可视化。我们的发现表明，尽管视觉思维具有内在吸引力，但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

URL PDF HTML ☆

赞 0 踩 0

2602.02285 2026-06-11 cs.LG cs.CL math.ST stat.TH 版本更新

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

AI4SLT: 基于 Lean 4 的形式化统计学习理论实证过程

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文首次在 Lean 4 中完整形式化统计学习理论，基于实证过程理论，通过人机协作工作流构建了可验证的定理证明工具箱，并揭示了教材中的隐含假设。

Comments Accepted by ICML 2026

详情

AI中文摘要

我们提出了首个基于实证过程理论的统计学习理论（SLT）在 Lean 4 中的全面形式化。我们的端到端形式化基础设施填补了最新 Lean 库中缺失的内容，包括高斯 Lipschitz 集中的完整推导、次高斯过程的 Dudley 熵积分定理，以及具有尖锐速率的（稀疏）最小二乘回归应用。该项目采用人机协作工作流，其中人类设计证明策略，AI 代理执行战术性证明构建，从而产生了经过人工验证的 SLT 的 Lean 4 工具箱。除了实现之外，形式化过程暴露并解决了标准 SLT 教材中的隐含假设和缺失细节，强制对理论进行逐行细粒度理解。这项工作建立了一个可重用的形式化基础，并为机器学习理论的未来发展打开了大门。代码可在以下网址获取：https://this https URL。

英文摘要

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in https://github.com/YuanheZ/lean-stat-learning-theory.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping

Planning under Distribution Shifts with Causal POMDPs

Scaling Laws of Global Weather Models

CoVar: Confidence-Variance-Guided Pseudo-Label Selection for Semi-Supervised Learning

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

Coverage Guarantees for Pseudo-Calibrated Conformal Prediction under Distribution Shift

Composing Linear Layers from Irreducibles

Momentum LMS Theory beyond Stationarity: Stability, Tracking, and Regret

SpaTeoGL: Spatiotemporal Graph Learning for Interpretable Seizure Onset Zone Analysis from Intracranial EEG

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking

Tensor Methods: A Unified and Interpretable Approach for Material Design

On the Optimal Reasoning Length for RL-Trained Language Models

Autoregressive Direct Preference Optimization

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

Global Geometry Is Not Enough for Vision Representations

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

CountZES: Counting via Zero-Shot Exemplar Selection

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory