URL PDF HTML ☆

赞 0 踩 0

2606.05760 2026-06-05 cs.CV

ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection

ExpSpeech-Net: 表情与语音的多模态融合用于深度伪造检测

Ruchika Sharma, Rudresh Dwivedi

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出轻量级ExpSpeech-Net模型，通过融合面部表情和语音模式，利用SqueezeNet和RNN骨干网络及智能特征选择，实现高效深度伪造检测，准确率达94.5%。

详情

AI中文摘要

深度伪造视频日益挑战在线内容的可信度。许多现有检测方法依赖于复杂、资源密集型的模型，限制了其实用性。本研究引入了ExpSpeech-Net深度伪造检测（SqN-R-DFD）模型，该模型以SqueezeNet和RNN（循环神经网络）为骨干，提供了一个轻量级且高效的深度伪造检测框架，能够同时分析面部表情和语音模式。该方法采用了先进的特征提取，例如基于ISLBT的图像特征和用于信号的MPNCC，并结合使用SASMA（鹬辅助黏液霉菌算法）的智能特征选择策略，确保检测模型获得最优且平衡的输入。通过结合SqueezeNet和RNN，有效捕捉深度伪造视频中的细微不一致性。该框架实现了94.5%的准确率、99.3%的精确率和96.8%的F-measure，优于传统方法。这表明，将多种模态与智能预处理和特征选择相结合，能够实现适用于日常应用的实用、实时深度伪造检测。

英文摘要

Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.

URL PDF HTML ☆

赞 0 踩 0

2606.05758 2026-06-05 cs.CV cs.AI cs.LG

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT：一种用于视觉-语言模型中连续输出解码的残差流适配器

Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； West Lafayette Jr./Sr. High School（韦斯特拉法叶高中）

AI总结提出DRIFT框架，通过结合基础预测器和基于流匹配的生成式精化模块，将预训练视觉-语言模型适配到连续解码任务，在视觉定位和机器人控制等任务上优于回归和生成方法。

详情

AI中文摘要

许多现代视觉-语言模型（VLM）基于离散标记的自回归解码。虽然基于文本的输出接口支持可扩展的预训练和跨多种任务的强零样本泛化，但它们不适用于需要精确连续输出的问题，例如定位事件的时间边界或生成机器人控制动作。为了解决这一挑战，我们提出了DRIFT，一个用于将预训练VLM适配到连续解码任务的通用框架。DRIFT结合了一个基础预测器（提供目标输出的粗略估计）和一个基于流匹配的生成式精化模块（迭代改进预测）。这种残差公式将生成建模问题从学习全局输出分布转变为在强先验周围建模局部残差分布，大大简化了优化。我们在感知和规划任务上评估了DRIFT，包括视觉定位和机器人控制。在跨越MLLM、VLA和WAM的多个任务和架构中，DRIFT consistently优于一组强大的基于回归和生成的方法。

英文摘要

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.05756 2026-06-05 cs.LG cs.AI cs.IT math.IT

Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability

超越软掩码：用于鲁棒GNN可解释性的硬扰动混合解释器

Jialiang Yin, Zheng Zhao, Linsey Pang, Bo Dong, Bin Shi, Jiaxing Zhang

发表机构 * Xi’an Jiaotong University（西安交通大学）； PayPal ； bellevue USA（贝尔维尤美国）

AI总结提出基于广义图信息瓶颈的硬扰动混合解释框架HPME，通过图池化提取离散解释子图并采用结构级替换的混合策略，解决软掩码方法中标签无关信息泄漏和分布偏移问题，提升解释保真度。

详情

AI中文摘要

图神经网络（GNN）在涉及图结构数据的各种应用中表现出卓越性能，尤其是在高风险领域。然而，其决策过程的不透明性限制了可信度和更广泛的采用。现有的事后解释方法通过识别影响GNN预测的子图来提高可解释性，并采用混合策略来缓解使用子图进行预测时引起的分布外（OOD）问题。然而，这些方法通常依赖软掩码，其本质上无法完全消除标签无关信息，允许冗余结构泄漏到混合过程中，阻碍OOD问题的解决，从而降低解释保真度。在本文中，我们提出HPME，一个基于广义图信息瓶颈的硬扰动混合解释框架，利用图池化提取离散解释子图，并产生信息容量界限以彻底压缩标签无关组件。此外，我们引入了一种基于结构级替换的新型混合策略，生成分布内解释以有效缓解分布偏移。在多种任务上的大量实验表明，HPME在合成和真实数据集上生成鲁棒且可解释的解释方面达到了最先进的性能。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.05754 2026-06-05 cs.SD cs.AI eess.AS

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

Sagnac辅助增强型OTDR分布式声学传感：标准化基准与工程评估框架

Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie

发表机构 * East China Jiaotong University（东华交通大学）； School of Materials and Energy, Guangdong University of Technology（广东工业大学材料与能源学院）； Jiangxi Tonghui Technology Group Co., Ltd.（江西 Tonghui 技术集团有限公司）； School of Artificial Intelligence and Big Data, Guangzhou Vocational University of Science and Technology（广州科学技术职业大学人工智能与大数据学院）

AI总结提出一种Sagnac辅助增强型ϕ-OTDR传感架构和标准化基准框架，通过双分支融合模型在10公里光纤上实现89.79%准确率和5.00%虚警率，解决了偏振衰落和干扰问题。

详情

AI中文摘要

相位敏感光时域反射计（ϕ-OTDR）因其在大距离上提供分布式时空监测能力，被广泛应用于大规模分布式声学传感（DAS）。然而，其现场性能仍可能因偏振诱导衰落（PIF）、局部信号退化和强环境干扰而恶化。本研究开发了一种Sagnac辅助增强型ϕ-OTDR传感架构和面向工程的DAS事件识别标准化基准框架。Sagnac干涉仪提供连续相位响应，补充了ϕ-OTDR通道中易衰落的观测值，并通过在FPGA平台上实现的互相关过程实现异构信号对齐。该基准协议在一致的数据划分、预处理和度量定义下，比较了传统特征工程方法、概率浅层分类器、单分支深度模型和双分支融合模型。在10公里传感光纤上进行的六类代表性声学事件实验表明，双分支融合模型在评估方法中提供了最有利的权衡，在平衡测试集上达到89.79%的准确率、89.83%的宏F1值和5.00%的虚警率。结果还表明，通道分组对双分支评估影响显著，表明面向部署的结论应基于准确率、宏F1、虚警率、漏报率和延迟，而非仅凭准确率。这项工作为基于ϕ-OTDR的DAS提供了一种物理驱动的增强策略，并为未来面向融合的传感研究提供了可复现的基准协议。用于复现DAS事件识别实验的实现和脚本可在https://github.com/wawa-abc/das公开获取。

英文摘要

Phase-sensitive optical time-domain reflectometry ($ϕ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $ϕ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $ϕ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $ϕ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.

URL PDF HTML ☆

赞 0 踩 0

2606.05753 2026-06-05 cs.CV

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

余弦误导：辅助损失重塑视觉语言模型，而非其潜变量

XiuYu Zhang, Junfeng Fang, Zhenkai Liang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文通过实验发现，在视觉语言模型的潜视觉推理中，余弦相似度等对齐损失与准确性负相关，并引入PRISM诊断工具揭示潜变量被绕过，辅助损失主要通过共享参数重塑语言模型。

详情

AI中文摘要

潜视觉推理（LVR）在视觉语言模型（VLM）的感知和答案生成之间插入有监督的潜变量。该领域使用这些潜变量与其视觉目标之间的对齐（即余弦相似度或均方误差）作为训练损失和质量指标，假设更好的对齐会产生更好的答案。我们通过设计包含五种LVR变体的矩阵进行测试，发现该假设被颠覆：余弦对齐与所有五种变体的准确性呈负相关（r=-0.94）。为了解释这一点，我们引入了PRISM，一对推理时诊断工具：一个线性探针，询问答案在何处可解码；一个破坏性测试，询问潜变量是否承担负载。有监督的潜变量在很大程度上被绕过。破坏它们最多使准确性变化四个百分点。答案在潜变量下游可解码，但在潜变量处不可解码，并且这种可解码性差距的大小预测了每个变体在扰动下对其潜变量的依赖程度。与信息瓶颈对损失的解释一致，辅助目标通过共享参数而非其名义上优化的潜变量来重塑语言模型。

英文摘要

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

URL PDF HTML ☆

赞 0 踩 0

2606.05749 2026-06-05 cs.CL cs.AI

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

MARDoc：面向多模态长文档问答的记忆感知精炼智能体框架

Kaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang, Yongqiang Liu, Xiaochen Zhang, Qing Yang

发表机构 * Tianjin University（天津大学）； Qifu Technology（启福科技）； Beihang University（北航）； Jiangnan University（江南大学）

AI总结提出MARDoc框架，通过解耦为探索、精炼和反思三个智能体，并利用结构化记忆替代完整交互历史，减少上下文噪声，提升多模态长文档问答性能。

详情

AI中文摘要

迭代检索-推理智能体近期在多模态长文档问答中展现出潜力。然而，现有系统大多维护一个不断增长的单一上下文，混合了检索轨迹、观察和中间推理。随着交互积累，关键证据变得分散和稀释，使多跳推理变得嘈杂。我们提出MARDoc，一个记忆感知精炼智能体框架，将长文档问答解耦为三个专门智能体：探索者负责多粒度多模态检索，精炼者负责将交互轨迹蒸馏为结构化证据和推理记忆，反思者负责检查证据充分性并提供针对性反馈。在迭代过程中，智能体依赖动态更新的结构化记忆，而非完整的累积交互历史。这种设计减少了上下文噪声，同时保留了答案关键事实及其逻辑依赖。在MMLongBench-Doc和DocBench上的实验表明，MARDoc取得了强劲结果，优于同骨干基线，并证明了结构化记忆在智能体文档问答中的有效性。

英文摘要

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

URL PDF HTML ☆

赞 0 踩 0

2606.05744 2026-06-05 cs.CL

TextWand：场景文本编辑的统一框架

Shuyu Wang, Zhile Guan, Hongxiu Chen, Yule Duan, Weiqi Li, Xin Shan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（电子与计算机工程学院，北京大学）

AI总结提出TextWand统一框架，通过渲染和擦除原子操作分解复杂编辑任务，结合ORPE编码和RAS策略，实现场景文本的移除、生成和替换，并在新基准TextWand-Bench上超越现有模型。

详情

AI中文摘要

我们提出TextWand，一个通用框架，将场景文本移除、生成和替换统一到单个模型中。通过将复杂的编辑任务分解为渲染和擦除的原子原语，TextWand实现了对文本外观和背景完整性的精确控制。具体来说，我们引入了一种新颖的设计——叠加参考位置编码（ORPE），以强制执行像素级布局保真度和示例驱动的风格控制，同时采用一种新策略——区域自适应抑制（RAS），以确保干净的文本擦除。为了解决现有单任务数据集中缺乏通用场景文本编辑综合基准的问题，我们构建了TextWand-Bench。大量实验表明，TextWand在场景文本移除、生成和替换任务中，通过提供更优的文本内容准确性、布局和风格一致性以及整体图像质量，超越了现有的领先开源和闭源模型。

英文摘要

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.05728 2026-06-05 cs.AI cs.CL

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

DiG-Plan：通过扩散引导缓解工具图规划中的早期承诺问题

Yansi Li, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）

AI总结针对工具图规划中自回归解码的早期承诺问题，提出基于扩散生成器与自回归精炼器解耦的DiG-Plan框架，显著提升组合搜索覆盖率和任务性能。

Comments Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings

详情

AI中文摘要

生成可执行的工具计划需要从工具库中选择合适的子集，这是一个解空间呈指数级增长的组合搜索问题。然而，我们发现了主流方法中的一个关键错位：标准自回归（AR）解码存在早期承诺问题，即初始令牌选择会严格约束搜索轨迹。一项受控研究表明，在计算量匹配的条件下，掩码去噪将Pass@10解覆盖率从0.320提升至0.943（相对于AR采样）。受此启发，我们提出了DiG-Plan，一个将组合探索与结构精炼解耦的框架。DiG-Plan采用基于扩散的提议器，通过迭代精炼生成多样化的工具集，随后使用AR精炼器进行依赖关系预测。在TaskBench上，DiG-Plan相比AR基线提升了10%的相对性能，在复杂组合任务上增益最大；API-Bank的结果表明，提议-精炼-选择设计在不同领域均有效。代码已开源：https://github.com/puddingyeah/DiG-Plan。

英文摘要

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

URL PDF HTML ☆

赞 0 踩 0

2606.05724 2026-06-05 cs.CL cs.AI

基于机器学习的监控摄像头实时威胁检测

Gajendra Mandal, J. P. Patra, Priyansh Mahant

发表机构 * GitHub

AI总结提出基于YOLOv8的实时目标检测框架，利用自定义钝器数据集与公开枪支刀具数据集训练模型，实现监控场景下枪支、刀具和钝器的有效检测。

详情

AI中文摘要

确保人口密集的城市环境中的公共安全仍然是一个关键挑战，需要部署智能和自动化的视频监控系统。传统的监控方法严重依赖人工监控，效率低下且容易受到人为疲劳、响应延迟和观察错误的影响。为了克服这些限制，本文提出了一种基于实时目标检测的监控框架。该系统专注于检测枪支、刀具以及印度监控场景中常见于暴力活动的区域特定钝器。本文的一个关键贡献是使用移动相机收集的自定义数据集，包含336张标记的钝器图像，如铁棒、木棍和塑料棒。该数据集与公开的7,623张枪支和刀具图像数据集合并，形成包含7,959张图像、三个类别（枪、刀、钝器）的合并数据集。使用该合并数据集训练基于YOLOv8的目标检测模型以实现实时性能。实验评估表明，增加训练时长显著提高了钝器类别的召回率和平均精度，且未出现过拟合迹象。总体而言，所提出的框架在准确性和效率之间取得了有效平衡，使其适用于校园、公共空间和交通区域等真实监控环境中的部署。

英文摘要

Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.

URL PDF HTML ☆

赞 0 踩 0

2606.05704 2026-06-05 cs.AI cs.LG

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

基于评论的异构多智能体推理用于可靠的数学问题求解

Muhammad Talha Sharif, Abdul Rehman

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种基于评论的异构多智能体框架，通过生成器-验证器结构和自适应学习系统，利用中间反馈评估和引导推理过程，在GSM8K基准上实现高达13%的准确率提升，并减少对大模型的依赖。

Comments 6 pages

详情

AI中文摘要

近期的大语言模型（LLMs）展示了令人印象深刻的推理能力；但在复杂数学推理问题中，它们仍然容易产生幻觉、中间推理错误以及不可靠的推理结果。在本研究中，我们引入了一种基于评论的异构多智能体方法，以提高数学推理的可靠性。该框架整合了多个不同专长的LLM智能体，并采用评论驱动的自适应学习系统，基于中间反馈评估和引导推理过程。系统采用生成器-验证器框架，验证器不仅判断正确性，还提供评论以指导解决方案的重新生成。这允许自适应错误纠正并防止错误级联。我们在GSM8K基准上的实验表明，所提方法相比单次和非评论模型实现了高达13%的准确率提升。此外，研究结果表明，异构性和评论减少了对大模型的需求，使较小模型也能达到相当的性能。消融研究显示，主要性能提升归因于基于评论的反馈循环，而非模型大小。总之，所提方法展示了结合异构多智能体协作与评论以获得可靠且可解释推理系统的优势。

英文摘要

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05703 2026-06-05 cs.CV

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

并行雅可比解码用于快速自回归图像生成

Boya Liao, Ying Li, Siyong Jian, Huan Wang

发表机构 * Westlake University（西交利物浦大学）

AI总结提出并行雅可比解码（PJD），通过二维空间域扩展草稿令牌并调整注意力掩码，实现无需训练的自回归图像生成加速，在保持生成质量的同时获得4.8倍至6.4倍加速。

Comments Accepted by CVPR 2026

详情

AI中文摘要

自回归（AR）模型在生成高保真图像方面表现出色。然而，其固有的顺序逐令牌预测导致推理速度显著变慢。最近的研究引入了雅可比式解码来加速自回归图像生成。初始扩展草稿序列提高了效率，但由于一维序列中的错误传播阻碍收敛，加速很快饱和。观察到图像表现出强烈的局部空间相关性，我们提出了并行雅可比解码（PJD），一种无需训练的解码方法，在二维空间域中扩展草稿令牌以实现高效的空间并行细化。PJD调整注意力掩码以减轻错误累积并提高收敛稳定性。在多个数据集上的大量实验表明，PJD在多种自回归图像生成模型上实现了4.8倍至6.4倍的加速，同时保持了具有竞争力的生成质量。

英文摘要

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.05702 2026-06-05 cs.AI cs.CV

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工学院计算技术学院）

AI总结本文提出一个新基准，通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力，并揭示模型常利用颜色等表面线索而非真正时间特征。

详情

AI中文摘要

近期视觉-语言模型（VLM）在解释复杂视觉语义方面取得了显著进展，但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准，专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准（侧重于帧序列）不同，我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此，我们构建了三个专门数据集：一个包含跨越长时间历史周期的视觉相似物体，另一个按不同事件和物体类型分类，第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验，我们分析了模型是否在不同类别间表现出性能差异，并关键地探讨了它们是否依赖“错误捷径”（如图像颜色而非真正的时间特征）。我们的结果表明，尽管VLM显示出潜力，但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架，我们提供了一个诊断工具，用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

重新审视原型重放用于无样本持续学习：基于流形感知边界采样与自适应类别平衡损失

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science（切斯特·F·卡森成像科学中心）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结针对无样本类增量学习，提出流形感知边界采样和自适应类别平衡损失，通过生成边界感知重放样本和动态调整类别权重，使原型重放方法恢复竞争力并达到最先进性能。

Comments Published in CVPR 2026 Findings. 10 pages, 6 figures. CVF version: https://openaccess.thecvf.com/content/CVPR2026F/html/Xu_Revisiting_Prototype_Rehearsal_for_Exemplar-Free_Continual_Learning_Manifold-Aware_Boundary_Sampling_CVPRF_2026_paper.html. Code: https://github.com/HXuSz11/ACB_CEOS_CVPR2026_Findings

详情

AI中文摘要

无样本类增量学习旨在随时间获取新类别而不存储原始数据。历史上，原型重放（在存储的类原型周围采样并与当前任务数据混合）是减少灾难性遗忘的流行策略。然而，最近的漂移补偿方法通过在演化特征空间中显式重新对齐原型，持续优于基于原型的重放，引发了对重放本身是否根本受限的疑问。我们认为性能差距并非源于原型重放的思想本身，而是源于其典型的实现方式：现有方法将原型视为孤立的类摘要，忽略了来自邻近敌对类的信息，并且未能纠正少量合成旧类样本与来自新引入类别的数百个真实实例之间出现的类别不平衡。基于这一假设，我们重新审视原型重放，并提出一种流形感知变体，以恢复其在无样本类增量学习中的竞争力。首先，我们引入约束扩展过采样，将每个旧类原型向其最近的新类敌对特征进行插值，生成边界感知的重放样本，这些样本更好地遵循底层数据流形，同时保持类间分离。其次，我们设计了一种自适应类别平衡损失，执行基于时间的类别加权，在旧原型信息量最大时放大其梯度，并随着后续任务积累更丰富的监督而逐渐退火其影响。这些组件共同将原型重放转变为一种抗漂移、感知不平衡的机制，缩小甚至逆转了与近期漂移补偿方法的差距，在多个无样本类增量学习基准上实现了最先进的性能。

英文摘要

Exemplar-free class-incremental learning (EFCIL) aims to acquire new classes over time without storing raw data. Historically, prototype rehearsal, which samples around stored class prototypes and mixes them with current-task data, has been a popular strategy to reduce catastrophic forgetting. However, recent drift-compensation methods that explicitly realign prototypes in the evolving feature space consistently outperform prototype-based rehearsal, raising the question of whether rehearsal itself is fundamentally limited. We argue that the performance gap stems not from the idea of prototype rehearsal per se, but from how it is typically instantiated: existing approaches treat prototypes as isolated class summaries that ignore information from nearby enemy classes, and fail to correct the emerging class imbalance between a handful of synthetic old-class samples and hundreds of real instances from newly introduced classes. Building on this hypothesis, we revisit prototype rehearsal and propose a manifold-aware variant that restores its competitiveness in EFCIL. First, we introduce Constrained Expansive Over-Sampling, which interpolates each old-class prototype toward its nearest enemy features from new classes, generating boundary-aware rehearsal samples that better follow the underlying data manifold while preserving inter-class separation. Second, we design an Adaptive Class-Balanced loss that performs time-based class weighting, amplifying gradients from older prototypes when they are most informative and gradually annealing their influence as richer supervision from later tasks accumulates. Together, these components turn prototype rehearsal into a drift-resilient, imbalance-aware mechanism that closes, and often reverses, the gap to recent drift-compensation methods, achieving state-of-the-art performance across multiple EFCIL benchmarks.

URL PDF HTML ☆

赞 0 踩 0