arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2329
2502.01941 2026-05-13 cs.CL cs.AI

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu

AI总结 本文研究了键值(KV)缓存压缩在大语言模型推理中对高密度推理能力的影响,指出当前评估多侧重于稀疏检索任务,忽视了推理链(CoT)的完整性问题。为此,作者提出KVFundaBench基准,揭示了在高压缩率下推理任务会出现严重的任务依赖性退化现象。基于此,他们提出ShotKV方法,通过分离预填充和解码阶段、保持语义单元的完整性,有效提升了长上下文生成任务的准确率,并降低了推理延迟。

Comments ICML 2026

详情
英文摘要

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.

2501.06857 2026-05-13 cs.AI

A Counterfactual Cause in Situation Calculus

Daxin Liu, Vaishak Belle

AI总结 本文提出了一种基于反事实分析的因果概念,用于在情境演算框架下解释行动历史中的量化效应原因。与现有实际成就原因的定义不同,该方法从反事实视角出发,能够更自然地推广到成就原因的定义,并与Batusov和Soutchanski的成果进行对比分析。此外,文章还探讨了该因果概念与Halpern和Pearl实际因果理论之间的关系,特别指出在处理析取性目标时反事实视角的应用细节。

Comments This version changes the working title of the extended report and fixes some errors

详情
英文摘要

Recently, Batusov and Soutchanski proposed a notion of actual achievement cause in the situation calculus, amongst others, they can determine the cause of quantified effects in a given action history. While intuitively appealing, this notion of cause is not defined in a counterfactual perspective. In this paper, we propose a notion of cause based on counterfactual analysis. In the context of action history, we show that our notion of cause generalizes naturally to a notion of achievement cause. We analyze the relationship between our notion of the achievement cause and the achievement cause by Batusov and Soutchanski. Finally, we relate our account of cause to Halpern and Pearl's account of actual causality. Particularly, we note some nuances in applying a counterfactual viewpoint to disjunctive goals, a common thorn in definitions of actual causes.

2501.03717 2026-05-13 cs.CV cs.AI cs.GR

Materialist: Physically Based Editing Using Single-Image Inverse Rendering

Lezhong Wang, Duc Minh Tran, Ruiqi Cui, Thomson TG, Anders Bjorholm Dahl, Siavash Arjomand Bigdeli, Jeppe Revall Frisvad, Manmohan Chandraker

AI总结 本文提出了一种基于物理的单图像逆渲染编辑方法Materialist,旨在解决图像编辑中物理一致性不足的问题。该方法结合神经网络与物理渲染,通过神经网络预测初始材质属性,并利用渐进式可微渲染进行优化,从而实现对材质、光照和物体插入等的高质量编辑。该方法无需完整场景几何即可编辑透明材质,并在环境光映射估计方面表现出色,实验表明其在合成与真实数据集上均具有优异性能。

Comments More Comprehensive IJCV Camera-Ready Version. Project website: https://lez-s.github.io/materialist_project/

详情
Journal ref
International Journal of Computer Vision (IJCV), 134(6), 267 (2026)
英文摘要

Achieving physically consistent image editing remains a significant challenge in computer vision. Existing image editing methods typically rely on neural networks, which struggle to accurately handle shadows and refractions. Conversely, physics-based inverse rendering often requires multi-view optimization, limiting its practicality in single-image scenarios. In this paper, we propose Materialist, a neural-initialized physically based rendering pipeline for single-image inverse rendering. Unlike previous hybrid methods that use physics to guide neural generation, our method leverages neural networks to predict initial material properties, which are then rigorously optimized via progressive differentiable rendering. Our approach enables a range of applications, including material editing, object insertion, and relighting, while also introducing an effective method for editing material transparency via ray-traced refraction without requiring full scene geometry. Furthermore, our envmap estimation method also achieves competitive performance, further enhancing the accuracy of image editing task. Experiments demonstrate strong performance across synthetic and real-world datasets, excelling even on challenging out-of-domain images.

2412.05225 2026-05-13 cs.CL cs.AI cs.NE

BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti

AI总结 BEExformer 是一种结合二值化和早停机制的高效 Transformer 模型,旨在提升大语言模型在受限资源下的推理效率。该模型引入了基于选择性学习的遗忘网络和二值化感知训练方法,有效减少了模型大小并提升了推理速度。通过在中间层引入熵值减少的软路由损失,BEExformer 在降低计算量的同时还提升了准确率,展示了其在性能与效率之间的优越平衡。

Comments This revised manuscript includes 18 pages, 6 figures, and 6 tables. Methodology and results sections have been improved for clarity and depth, incorporating additional comparisons, ablations, and new evaluation datasets. A few relevant references were added, and overall organization refined for better readability

详情
Journal ref
in IEEE Transactions on Sustainable Computing, vol. 11, no. 2, pp. 98-110, 2026
英文摘要

Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), a first-of-its-kind selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT employs a differentiable second-order approximation to the sign function, enabling gradient computation that captures both the sign and magnitude of the weights. This aids in 21.30 times reduction in model size. The EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. This accelerates inference by reducing FLOPs by 52.27% and even improves accuracy by 3.22% by resolving the "overthinking" problem inherent in deep networks. Extensive evaluation through comparison with the SOTA methods and various ablations across nine datasets covering multiple NLP tasks demonstrates its Pareto-optimal performance-efficiency trade-off.

2411.19240 2026-05-13 cs.CL

How far can bias go? Tracing bias from pretraining data to alignment

Marion Thaler, Abdullatif Köksal, Alina Leidinger, Anna Korhonen, Hinrich Schütze

AI总结 随着大型语言模型(LLMs)越来越多地应用于面向用户的场景,解决可能加剧社会不平等的偏见问题变得尤为重要。本文研究了预训练数据中的性别职业偏见如何影响LLMs的输出,以Dolma数据集和OLMo模型为例,通过零样本提示和词元共现分析揭示了训练数据中的偏见在模型输出中被放大的现象。研究还发现指令微调在一定程度上缓解了表征偏见,但整体性别刻板印象仍存在,强调了在预训练阶段应对偏见的重要性。

详情
英文摘要

As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.

2411.16769 2026-05-13 cs.LG cs.CL cs.CR cs.CV

Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting

Zhi-Yi Chin, Pin-Yu Chen, Wei-Chen Chiu, Mario Fritz

AI总结 本文研究了如何自动检测和生成针对文本到图像模型的有害内容,以评估其安全性。为解决现有方法依赖白盒信息、泛化能力差或生成不可解释攻击样本的问题,作者提出了ICER框架,通过基于大语言模型的提示重写和上下文经验回放技术,生成语义保持的自然语言攻击提示,并通过强化学习优化策略,实现攻击策略的有效探索与利用。实验表明,ICER在多种安全机制下优于现有方法,并能成功迁移到商业系统如DALL-E 3和Midjourney。

Comments The source code is available at https://github.com/zhiyichin/ICER

详情
英文摘要

Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.

2411.13311 2026-05-13 cs.CV cs.AI

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data

Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura

AI总结 该研究提出了一种高效的融合网络,用于利用摄像头和原始雷达数据在鸟瞰图(BEV)视角下进行目标检测。通过直接使用雷达的原始距离-多普勒(RD)谱,避免了复杂的雷达信号处理,并结合摄像头图像处理管道提取特征,最终将摄像头和雷达特征进行融合以实现目标检测。该方法在保证检测精度的同时,降低了计算复杂度,为自动驾驶系统提供了更高效、鲁棒的感知方案。

Comments IEEE Intelligent Transportation Systems Conference (ITSC) 2024

详情
英文摘要

Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

2407.00805 2026-05-13 cs.AI

Towards Shutdownable Agents via Stochastic Choice

Elliott Thornley, Alexander Roman, Christos Ziakas, Leyton Ho, Louis Thomson

AI总结 本文研究如何训练人工智能代理使其在任务执行过程中既高效又不抗拒关闭,提出了一种基于“折扣奖励相同长度轨迹”(DReST)的奖励函数,以引导代理在不同轨迹长度之间进行随机选择,从而实现“有用性”和“中立性”。通过在网格世界中训练简单代理,实验表明该方法能够有效提升代理的有用性和中立性,为构建可关闭的高级人工智能代理提供了初步理论支持和实证依据。

详情
Journal ref
Technical AI Safety (TAIS) Conference 2025
英文摘要

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

2406.05615 2026-05-13 cs.CL

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

AI总结 本文综述了视频-语言理解领域的研究进展,从模型架构、训练方法和数据视角系统梳理了该领域的主要任务、面临的挑战及解决方法。作者对现有方法进行了性能对比,并探讨了未来研究的潜在方向,为相关工作的进一步发展提供了参考。

Comments Accepted at ACL 2024 (Findings). Code is available at https://github.com/nguyentthong/video-language-understanding

详情
英文摘要

Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.

2404.05120 2026-05-13 cs.RO cs.SY eess.SY

Rollbot: a Spherical Robot Driven by a Single Actuator

Jingxian Wang, Michael Rubenstein

AI总结 本文介绍了一种名为 Rollbot 的新型球形机器人,它仅使用一个执行器即可实现可控的二维平面运动,打破了传统球形机器人需要至少两个执行器的假设。Rollbot 通过改变其单个电机和附加质量的加速度与减速,根据所推导的准稳定状态动力学和控制律,控制其滚动轨迹的曲率,从而实现可控的圆周运动和路径跟踪。研究提供了理论分析、设计方法及控制策略,并验证了该框架的有效性。

Comments Accepted by ICRA 2026

详情
英文摘要

Spherical robots typically require at least two actuators to achieve controlled 2D planar motion. Here we present Rollbot, the first spherical robot capable of controllably maneuvering on a 2D plane with a single actuator, challenging this assumption. Rollbot rolls on the ground in a circular pattern and controls its motion by changing the trajectory's curvature by accelerating and decelerating its single motor and the attached mass according to our derived quasi-stable state dynamics and control laws. We present the theoretical analysis, design, and control of Rollbot, and demonstrate its ability to move in a controllable circular pattern and follow waypoints, validating the efficacy of the proposed theoretical framework.

2402.16860 2026-05-13 cs.CV cs.IR

Interactive Mars Image Content-Based Search with Interpretable Machine Learning

Bhavan Vasu, Steven Lu, Emily Dunkel, Kiri L. Wagstaff, Kevin Grimes, Michael McAuley

AI总结 本文研究如何通过可解释的机器学习方法实现对火星图像的交互式内容搜索,以支持科学探索和用户兴趣。作者提出了一种基于原型的分类架构,使用户能够理解并验证分类器在处理好奇号火星车图像时所依赖的证据。该方法不仅提供了分类解释,还探讨了所用证据的多样性和正确性,未来将部署于NASA行星数据系统图像图谱中,替代当前不可解释的系统。

Comments Published at the Thirty-Sixth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-24). Corrected citation: Proc. AAAI 38(21): 22976-22982 (2024)

详情
Journal ref
Proc AAAI Conference on Artificial Intelligence 2024
英文摘要

The NASA Planetary Data System (PDS) hosts millions of images of planets, moons, and other bodies collected throughout many missions. The ever-expanding nature of data and user engagement demands an interpretable content classification system to support scientific discovery and individual curiosity. In this paper, we leverage a prototype-based architecture to enable users to understand and validate the evidence used by a classifier trained on images from the Mars Science Laboratory (MSL) Curiosity rover mission. In addition to providing explanations, we investigate the diversity and correctness of evidence used by the content-based classifier. The work presented in this paper will be deployed on the PDS Image Atlas, replacing its non-interpretable counterpart.

2402.07619 2026-05-13 cs.SD cs.AI eess.AS

Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data

Yuyang Yan, Wafaa Aljbawi, Sami O. Simons, Visara Urovi

AI总结 该研究旨在开发一种基于众包呼吸道语音数据的多变量深度学习模型,用于检测 COVID-19。研究利用 Cambridge COVID-19 Sound 数据库中的语音样本,提取包括梅尔频谱图、MFCC 和 CNN 编码器特征等多种语音特征,并构建了 LSTM、CNN 和 HuBERT 等深度学习分类模型进行疾病识别。实验结果表明,HuBERT 模型在准确率和 AUC 指标上均优于传统机器学习方法,达到了 86% 和 0.93,展示了语音数据在 COVID-19 诊断中的巨大潜力。

Comments arXiv admin note: text overlap with arXiv:2209.03727

详情
英文摘要

COVID-19 has affected more than 223 countries worldwide and in the Post-COVID Era, there is a pressing need for non-invasive, low-cost, and highly scalable solutions to detect COVID-19. We develop a deep learning model to identify COVID-19 from voice recording data. The novelty of this work is in the development of deep learning models for COVID-19 identification from only voice recordings. We use the Cambridge COVID-19 Sound database which contains 893 speech samples, crowd-sourced from 4352 participants via a COVID-19 Sounds app. Voice features including Mel-spectrograms and Mel-frequency cepstral coefficients (MFCC) and CNN Encoder features are extracted. Based on the voice data, we develop deep learning classification models to detect COVID-19 cases. These models include Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) and Hidden-Unit BERT (HuBERT). We compare their predictive power to baseline machine learning models. HuBERT achieves the highest accuracy of 86\% and the highest AUC of 0.93. The results achieved with the proposed models suggest promising results in COVID-19 diagnosis from voice recordings when compared to the results obtained from the state-of-the-art.

2312.06950 2026-05-13 cs.CV cs.CL

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Khoi Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

AI总结 该研究针对低资源视频-语言建模任务,提出了一种参数高效的微调方法READ,通过引入具有时序建模能力的递归适配器(READ)和部分视频-语言对齐(PVLA)目标,有效捕捉视频帧与文本间的时序关系并保留关键任务信息。实验表明,READ在多个低资源基准测试中显著优于现有微调策略,为视频-语言模型的参数高效迁移学习提供了新思路。

Comments Accepted at AAAI 2024

详情
英文摘要

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. The code, model, and data have been made available at https://nguyentthong.github.io/READ.

2312.02549 2026-05-13 cs.CV cs.CL

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

AI总结 本文研究的是时序语言定位问题,即在视频中找到与自然语言查询语义对应的片段。为了解决传统注意力机制在建模视频片段与文本关系时的不足,作者提出了一种基于能量的模型框架,以显式学习片段与查询之间的分布关系,并设计了一种新的Transformer架构DemaFormer,通过引入可学习的阻尼因子的指数移动平均方法,更有效地编码输入信息。实验表明,该方法在四个公开数据集上优于现有先进方法。

Comments Accepted at EMNLP 2023 (Findings). Code is available at https://github.com/nguyentthong/demaformer

详情
英文摘要

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

2305.12678 2026-05-13 cs.CL

Gradient-Boosted Decision Tree for Listwise Context Model in Multimodal Review Helpfulness Prediction

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Anh Tuan Luu, Cong-Duy Nguyen, Zhen Hai, Lidong Bing

AI总结 本文研究多模态评论有用性预测(MRHP)问题,旨在根据预测的有用性评分对产品评论进行排序。为了解决传统全连接神经网络在特征划分上的低效性以及成对损失函数难以捕捉整体排序目标的问题,作者提出了一种基于列表级注意力的网络结构和列表级优化目标,以更准确地建模评论排序的上下文信息,并进一步引入梯度提升决策树作为评分预测器,以更有效地划分评论表示。实验表明,该方法在两个大规模基准数据集上取得了优越的性能和泛化能力。

Comments Published in ACL 2023 (Findings). Code is available at https://github.com/nguyentthong/gbdt_listwise_mrhp

详情
英文摘要

Multimodal Review Helpfulness Prediction (MRHP) aims to rank product reviews based on predicted helpfulness scores and has been widely applied in e-commerce via presenting customers with useful reviews. Previous studies commonly employ fully-connected neural networks (FCNNs) as the final score predictor and pairwise loss as the training objective. However, FCNNs have been shown to perform inefficient splitting for review features, making the model difficult to clearly differentiate helpful from unhelpful reviews. Furthermore, pairwise objective, which works on review pairs, may not completely capture the MRHP goal to produce the ranking for the entire review list, and possibly induces low generalization during testing. To address these issues, we propose a listwise attention network that clearly captures the MRHP ranking context and a listwise optimization objective that enhances model generalization. We further propose gradient-boosted decision tree as the score predictor to efficaciously partition product reviews' representations. Extensive experiments demonstrate that our method achieves state-of-the-art results and polished generalization performance on two large-scale MRHP benchmark datasets.

2304.09479 2026-05-13 cs.CV cs.GR cs.LG

DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows

Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn

AI总结 本文提出了一种新的单视角人脸重光照方法DiFaReli++,能够在真实场景中生成具有时间一致阴影的逼真光照效果。该方法无需精确的内在分解,仅基于2D图像进行训练,避免了对光照标注数据的依赖。通过结合条件扩散隐式模型(DDIM)与渲染阴影参考及阴影图的条件引导,实现了对光照与几何复杂交互的高效建模,并在多个指标上超越了教师模型,取得了当前最优的重光照效果。

Comments Published in IEEE TPAMI (vol. 48, no. 5, May 2026). This is an extended version of the ICCV 2023 paper (DiFaReli)

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 5, pp. 5068-5082, May 2026
英文摘要

We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io

2302.12039 2026-05-13 cs.CL cs.AI

Natural Language Processing in the Legal Domain

Dirk Hartung, Daniel Martin Katz, Michael J. Bommarito, Lauritz Gerlach, Abhik Jana, Jerrold Soh

AI总结 本文综述了自然语言处理在法律领域的最新发展,重点分析了2013年至2024年间近一千篇相关论文的技术与内容进展。研究指出,近年来法律NLP的研究数量、任务类型和语言覆盖范围显著增加,同时方法复杂度不断提升,逐渐接近通用NLP的水平,并在数据可用性和代码可复现性方面达到更高的专业标准。这些趋势预示着法律NLP领域未来的发展潜力和广阔前景。

Comments 15 pages, 7 figures, 2 tables

详情
英文摘要

We summarize the current state of the field of NLP & Law with a specific focus on recent technical and substantive developments. To support our analysis, we construct and analyze a nearly complete corpus of nearly one thousand NLP & Law related papers published between 2013-2024. Our analysis highlights several major trends. Namely, we document an increasing number of papers written, tasks undertaken, and languages covered over the course of the past decade. We observe an increase in the sophistication of the methods which researchers deployed in this applied context. Legal NLP is beginning to match not only the methodological sophistication of general NLP but also the professional standards of data availability and code reproducibility observed within the broader scientific community. We believe all of these trends bode well for the future of the field and point to an exciting next phase for the Legal NLP community.

2211.03524 2026-05-13 cs.CL

Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions

Thong Nguyen, Xiaobao Wu, Anh-Tuan Luu, Cong-Duy Nguyen, Zhen Hai, Lidong Bing

AI总结 该研究针对多模态评论有用性预测问题,提出了一种基于自适应对比学习的多模态Transformer方法。核心方法通过显式建模跨模态关系中的互信息,并引入自适应权重机制以提升优化灵活性,同时设计多模态交互模块以解决数据对齐问题。实验表明,该方法在两个公开数据集上取得了优于现有方法的先进性能。

Comments Accepted to the main EMNLP 2022 conference. Code is available at https://github.com/nguyentthong/adaptive_contrastive_mrhp

详情
英文摘要

Modern Review Helpfulness Prediction systems are dependent upon multiple modalities, typically texts and images. Unfortunately, those contemporary approaches pay scarce attention to polish representations of cross-modal relations and tend to suffer from inferior optimization. This might cause harm to model's predictions in numerous cases. To overcome the aforementioned issues, we propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem, concentrating on mutual information between input modalities to explicitly elaborate cross-modal relations. In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach in order to increase flexibility in optimization. Lastly, we propose Multimodal Interaction module to address the unalignment nature of multimodal data, thereby assisting the model in producing more reasonable multimodal representations. Experimental results show that our method outperforms prior baselines and achieves state-of-the-art results on two publicly available benchmark datasets for MRHP problem.

2109.10616 2026-05-13 cs.CL

Enriching and Controlling Global Semantics for Text Summarization

Thong Nguyen, Anh Tuan Luu, Truc Lu, Tho Quan

AI总结 本文针对基于Transformer的摘要生成模型在捕捉文档全局语义方面存在的不足,提出了一种结合归一化流的神经主题模型,以增强摘要的全局语义表达。为避免全局语义对局部表示的过度影响,还引入了语义控制机制,调节全局信息在生成过程中的参与程度。实验表明,该方法在多个常用摘要数据集上均优于现有先进模型。

Comments Accepted to the main EMNLP 2021 conference. Code is available at https://github.com/nguyentthong/topicflow-sum

详情
英文摘要

Recently, Transformer-based models have been proven effective in the abstractive summarization task by creating fluent and informative summaries. Nevertheless, these models still suffer from the short-range dependency problem, causing them to produce summaries that miss the key points of document. In this paper, we attempt to address this issue by introducing a neural topic model empowered with normalizing flow to capture the global semantics of the document, which are then integrated into the summarization model. In addition, to avoid the overwhelming effect of global semantics on contextualized representation, we introduce a mechanism to control the amount of global semantics supplied to the text generation module. Our method outperforms state-of-the-art summarization models on five common text summarization datasets, namely CNN/DailyMail, XSum, Reddit TIFU, arXiv, and PubMed.

2605.12461 2026-05-13 math.ST cs.DS cs.LG stat.ML stat.TH

A proximal gradient algorithm for composite log-concave sampling

Linghai Liu, Sinho Chewi

AI总结 本文提出了一种用于从复合对数凹分布中采样的近端梯度算法,该分布形式为 $π \propto e^{-f - g}$,假设能够获取 $f$ 的梯度以及 $g$ 的受限高斯预言机(RGO)。该算法通过结合梯度信息和 RGO 采样,实现了高效的采样过程。研究证明,在 $f + g$ 强凸且 $f$ 光滑的条件下,该算法在总变分距离下达到 $\varepsilon$ 精度所需的迭代次数为 $\widetilde{\mathcal{O}}(κ\sqrt{d} \log^4(1/\varepsilon))$,与现有最优结果一致,并进一步扩展到非对数凹分布和非光滑 $f$ 的情形。

详情
英文摘要

We propose an algorithm to sample from composite log-concave distributions over $\mathbb{R}^d$, i.e., densities of the form $π\propto e^{-f-g}$, assuming access to gradient evaluations of $f$ and a restricted Gaussian oracle (RGO) for $g$. The latter requirement means that we can easily sample from the density $\text{RGO}_{g,h,y}(x) \propto \exp(-g(x) -\frac{1}{2h}||y-x||^2)$, which is the sampling analogue of the proximal operator for $g$. If $f + g$ is $α$-strongly convex and $f$ is $β$-smooth, our sampler achieves $\varepsilon$ error in total variation distance in $\widetilde{\mathcal O}(κ\sqrt d \log^4(1/\varepsilon))$ iterations where $κ:= β/α$, which matches prior state-of-the-art results for the case $g=0$. We further extend our results to cases where (1) $π$ is non-log-concave but satisfies a Poincaré or log-Sobolev inequality, and (2) $f$ is non-smooth but Lipschitz.

2605.12453 2026-05-13 eess.SP cs.AI cs.DB cs.LG cs.NI

Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance

Mannam Veera Narayana, Rohit Singh, Deepa M. R, Radha Krishna Ganti

AI总结 本研究针对高速移动场景下5G用户设备(UE)切换(HO)中断时间长、测量报告开销大等问题,提出了一种基于真实部署网络环境的数据集,涵盖步行、骑行、汽车、公交和火车等多种移动方式及不同速度条件下的UE移动数据。该数据集重点采集了切换过程中的时序提前(TA)测量信息,包括RACH触发、MAC CE和PDCCH授权等关键信令事件,填补了现有研究的空白。该数据集可支持AI/ML模型在切换管理、波束管理和TA预测等场景下的训练与评估,为6G智能移动性研究提供了重要基础。

详情
英文摘要

To address the issues of high interruption time and measurement report overhead under user equipment (UE) mobility especially in high speed 5G use cases the use of AI/ML techniques (AI/ML beam management and mobility procedures) have been proposed. These techniques rely heavily on data that are most often simulated for various scenarios and do not accurately reflect real deployment behavior or user traffic patterns. Therefore, there is an utmost need for realistic datasets under various conditions. This work presents a dataset collected from a commercially deployed network across various modes of mobility (pedestrian, bike, car, bus, and train) and at multiple speeds to depict real time UE mobility. When collecting the dataset, we focused primarily on handover (HO) scenarios, with the aim of reducing the HO interruption time and maintaining continuous throughput during and immediately after HO execution. To support this research, the dataset includes timing advance (TA) measurements at various signaling events such as RACH trigger, MAC CE, and PDCCH grant which are typically missing in existing works. We cover a detailed description of the creation of the dataset; experimental setup, data acquisition, and extraction. We also cover an exploratory analysis of the data, with a primary focus on mobility, beam management, and TA. We discuss multiple use cases in which the proposed dataset can facilitate understanding of the inference of the AI/ML model. One such use case is to train and evaluate various AI/ML models for TA prediction.

2605.12410 2026-05-13 stat.ML cs.LG math.OC math.ST stat.TH

Model-based Bootstrap of Controlled Markov Chains

Ziwei Su, Imon Banerjee, Diego Klabjan

AI总结 本文提出并分析了一种基于模型的引导方法,用于估计有限可控马尔可夫链(CMC)中的转移核,适用于可能具有非平稳或历史依赖控制策略的情形,这在行为策略未知的离线强化学习中具有重要意义。研究通过引入新的引导大数定律和鞅中心极限定理,建立了引导转移估计器在分布上的一致性,并进一步扩展到离线策略评估和最优策略恢复任务,获得了价值函数和Q函数的渐近有效置信区间。实验表明,该方法在覆盖精度上优于现有方法,尤其在小样本和短回合场景下表现更优。

Comments 45 pages, 7 figures, 19 tables

详情
英文摘要

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

2605.12391 2026-05-13 astro-ph.EP astro-ph.SR cs.LG

Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning

Brian P. Powell, Jorge Martinez-Palomera, Amy Tuson, Christina Hedges, Jessie Dotson, Jordan Caraballo-Vega

AI总结 本文提出了一种基于深度学习的新方法,用于从TESS时序图像数据中检测小行星等移动天体。该方法采用两个堆叠的3D U-Net网络(称为W-Net)进行背景过滤和运动目标识别,并通过图像立方体旋转增强训练数据,使模型对小行星速度和方向的变化具有鲁棒性,无需预先设定参数范围。此外,研究还提出了一种自适应归一化方法,提升了数据处理效果,并公开了用于生成训练数据的工具库,适用于其他类似的时间域巡天任务。

Comments Accepted by The Astronomical Journal, 11 May 2026

详情
英文摘要

We present a novel method for extracting moving objects from TESS data using machine learning. Our approach uses two stacked 3D U-Nets with skip connections, which we call a W-Net, to filter background and identify pixels containing moving objects in TESS image time-series data. By augmenting the training data through rotation of the image cubes, our method is robust to differences in speed and direction of asteroids, requiring no assumptions for either parameter range which are typically required in "shift-and-stack" type algorithms. We also developed a novel method for learned data scaling that we call Adaptive Normalization, which allows the neural network to learn the ideal range and scaling distribution required for optimal data processing. We built a code for creating TESS training data with asteroid masks that served as the foundation of our effort (tess-asteroid-ml), which we publicly released for the benefit of the community. Our method is not limited to TESS, but applicable for implementation in other similar time-domain surveys, making it of particular interest for use with data from upcoming missions such as the Nancy Grace Roman Space Telescope and NEOSurveyor.

2605.12365 2026-05-13 quant-ph cs.AI

QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

Kien X. Nguyen, Ankit Kulshrestha, Ilya Safro, Xiaoyuan Liu

AI总结 量子比特路由是量子编译中的一个基础难题,因其动态特性使得局部决策会随时间累积,难以获得全局最优解。本文提出QAP-Router,将量子比特路由建模为动态二次分配问题,并结合强化学习进行求解。通过将量子门交互建模为流矩阵,硬件拓扑建模为距离矩阵,统一表征了交互与距离之间的耦合关系,并在强化学习环境中定义了奖励函数。实验表明,该方法在多个真实量子电路数据集上显著降低了路由后的CNOT门数量。

详情
英文摘要

Qubit routing is a fundamental problem in quantum compilation, known to be NP-hard. Its dynamic nature makes local routing decisions propagate and compound over time, making global efficient solutions challenging. Existing heuristic methods rely on local rules with limited lookahead, while recent learning-based approaches often treat routing as a generic sequential decision problem without fully exploiting its underlying structure. In this paper, we introduce QAP-Router, framing qubit routing based on a dynamic Quadratic Assignment Problem (QAP) formulation. By modeling logical interactions, or quantum gates, as flow matrices and hardware topology as a distance matrix, our approach captures the interaction-distance coupling in a unified objective, which defines the reward in the reinforcement learning environment. To further exploit this structure, the policy network employs a solution-aware Transformer backbone that encodes the interaction between the flow matrix and the distance matrix into the attention mechanism. We also integrate a lookahead mechanism that blends naturally into the QAP framework, preventing myopic decisions. Extensive experiments on 1,831 real-world quantum circuits from the MQTBench, AgentQ and QUEKO datasets show that our method substantially reduces the CNOT gate count of routed circuits by 15.7%, 30.4% and 12.1%, respectively, relative to existing industry compilers.

2605.12364 2026-05-13 cs.CR cs.LG cs.MA

Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries

Matthew D. Laws, Alina Oprea, Cristina Nita-Rotaru

AI总结 本文研究了在拜占庭对手存在的情况下,如何对分布式智能体AI治理系统进行攻击与防御。作者分析了恶意提供者可能发起的多种攻击,并提出了四种不同安全与性能权衡的解决方案,包括基于拜占庭容错的SAGA-BFT、轻量监控的SAGA-MON、客户端审计的SAGA-AUD以及混合架构的SAGA-HYB,以提升系统安全性并适应不同应用场景的需求。

Comments 18 pages, 18 figures, 4 tables

详情
英文摘要

Agentic AI governance is a critical component of agentic AI infrastructure ensuring that agents follow their owner's communication and interaction policies, and providing protection against attacks from malicious agents. The state-of-the-art solution, SAGA, assumes a logically centralized point of trust, the Provider, which serves as a repository for user and agent information and actively enforces policies. While SAGA provides protection against malicious agents, it remains vulnerable to a malicious Provider that deviates from the protocol, undermining the security of the identity and access control infrastructure. Deployment on both private and public clouds, each susceptible to insider threats, further increases the risk of Provider compromise. In this work, we analyze the attacks that can be mounted from a compromised Provider, taking into account the different system components and realistic deployments. We identify and execute several concrete attacks with devastating effects: undermining agent attributability, extracting private data, or bypassing access control. We then present three types of solutions for securing the Provider that offer different trade-offs between security and performance. We first present SAGA-BFT, a fully byzantine-resilient architecture that provides the strongest protection, but incurs significant performance degradation, due to the high-cost of byzantine resilient protocols. We then propose SAGA-MON and SAGA-AUD, two novel solutions that leverage lightweight server-side monitoring or client-side auditing to provide protection against most classes of attacks with minimal overhead. Finally, we propose SAGA-HYB, a hybrid architecture that combines byzantine-resilience with monitoring and auditing to trade-off security for performance. We evaluate all the architectures and compare them with SAGA. We discuss which solution is best and under what conditions.

2605.12362 2026-05-13 cs.NE cs.AI

A Family of Quaternion-Valued Differential Evolution Algorithms for Numerical Function Optimization

Gerardo Altamirano-Gomez, Álvaro Gallardo, Carlos Ignacio Hernández Castellanos

AI总结 本文提出了一种基于四元数的差分进化算法(QDE)家族,用于解决连续函数的数值优化问题。该算法直接在四元数空间中进行操作,设计了多种利用四元数代数与几何特性的变异策略,提升了算法的收敛速度和优化性能。实验结果表明,QDE在BBOB基准测试中优于传统的实数型差分进化算法,展示了其在计算智能领域的潜力与优势。

详情
英文摘要

The numerical optimization of continuous functions is a fundamental task in many scientific and engineering domains, ranging from mechanical design to training of artificial intelligence models. Among the most effective and widely used algorithms for this purpose is Differential Evolution (DE), known for its simplicity and strong performance. Recent research has shown that adapting AI models to operate over alternative number systems-such as complex numbers, quaternions, and geometric algebras-can improve model compactness and accuracy. However, such extensions remain underexplored in bio-inspired optimization algorithms. In particular, the use of quaternion algebra represents an emerging area in computational intelligence. This paper introduces a family of novel Quaternion-Valued Differential Evolution (QDE) algorithms that operate directly in the quaternion space. We propose several mutation strategies specifically designed to exploit the algebraic and geometric properties of quaternions. Results show that our QDE variants achieve faster convergence and superior performance on several function classes in the BBOB benchmark compared to the traditional real-valued DE algorithm.

2605.12341 2026-05-13 stat.ML cs.LG

Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting

Laura Lützow, Simone Garatti, Marco C. Campi, Lars Lindemann, Matthias Althoff

AI总结 该论文提出了一种多变量校准预测(MCP)框架,旨在在不进行数据划分的情况下优化预测集的形状,同时保持有限样本下的覆盖保证。MCP 扩展了传统校准预测方法,支持向量值评分函数和多个校准变量,将预测集设计与校准统一为一个优化问题。研究提出了两种高效变体 RemMCP 和 RelMCP,分别适用于不同类型的优化需求,并在实验中验证了其在保持目标覆盖的同时,能够获得更小或相当的预测集大小,并显著降低校准过程中的方差。

详情
英文摘要

Conformal prediction constructs prediction sets with finite-sample coverage guarantees, but its calibration stage is structurally constrained to a scalar score function and a single threshold variable - forcing shapes of prediction sets to be fixed before calibration, typically through data splitting. We introduce multi-variable conformal prediction (MCP), a framework that extends conformal prediction to vector-valued score functions with multiple simultaneous calibration variables. Building on scenario theory as a principled framework for certifying data-driven decisions, MCP unifies prediction set design and calibration into a single optimization problem, eliminating data splitting without sacrificing coverage guarantees. We propose two computationally efficient variants: RemMCP, grounded in constrained optimization with constraint removal, which admits a clean generalization of split conformal prediction; and RelMCP, based on iterative optimization with constraint relaxation, which supports non-convex score functions at the cost of possibly greater conservatism. Through numerical experiments on ellipsoidal and multi-modal prediction sets, we demonstrate that RemMCP and RelMCP consistently meet the target coverage with prediction set sizes smaller than or comparable to those of baselines with data split, while considerably reducing variance across calibration runs - a direct consequence of using all available data for shape optimization and calibration simultaneously.

2605.12335 2026-05-13 cs.IR cs.AI cs.LG

EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records

Saeed Shurrab, Mariam Al-Omari, Dana El Samad, Farah E. Shamout

AI总结 电子健康记录(EHR)包含丰富的患者纵向信息,广泛应用于预测建模,但如何有效利用历史数据仍面临轨迹长、事件异构、时间不规则等挑战。本文提出EHR-RAGp,一种基于检索增强的原型引导基础模型,通过动态整合不同临床事件类型的最相关历史信息,提升预测性能。该模型引入原型引导检索模块,用于对齐和评估历史数据与预测任务的相关性,从而引导模型关注最具信息量的上下文,在多个临床预测任务中表现优于现有先进模型。

Comments Retrieval Augmented EHR Foundation Model

详情
英文摘要

Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.

2605.12303 2026-05-13 cs.HC cs.CV cs.LG

From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review

Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer, Nadja Klein, Gerhard Satzger

AI总结 高质量的标注数据对训练鲁棒的机器学习模型至关重要,但在大规模标注任务中,获取标注仍然成本高昂。本文研究了如何通过可视化模型的空间不确定性来辅助人类标注者更有效地审查标注结果,提出了一种定位感知的视觉提示方法,帮助标注者识别可能出错的区域。实验表明,使用该方法的标注者在保证标注质量的同时,整体效率更高,验证了空间不确定性作为改进人机协同标注的有效手段。

详情
英文摘要

High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at https://mos-ks.github.io/MUHA/.

2605.12287 2026-05-13 eess.AS cs.SD

The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

Jaehoon Ahn, Tae Gum Hwang, Moon-Ryul Jung

AI总结 近年来,基于深度神经网络的节拍跟踪模型在主流打击乐数据集上表现出色,但在SMC数据集上却始终表现不佳。本文分析了当前最先进的模型在SMC数据集中的失败模式,发现其主要问题包括八度错误、连续性错误以及整体跟踪失败,并指出这些模型容易产生“自信但错误”的激活结果。研究还揭示了标准DBN模型因默认最低节拍限制导致对21%的SMC曲目无法正确推断节拍,从而影响了整体性能,为改进节拍和强拍检测提供了具体方向。

Comments 6 pages, 3 figures. Technical report on beat tracking failure modes; prepared for ISMIR 2026

详情
英文摘要

Over the past two decades, the task of musical beat tracking has transitioned from heuristic onset detection algorithms to highly capable deep neural networks (DNN). Although DNN-based beat tracking models achieve near-perfect performance on mainstream, percussive datasets, the SMC dataset has stubbornly yielded low F-measure scores. By testing how well state-of-the-art models detect beats on individual tracks in the SMC dataset, we identify three distinct failure modes: octave errors, continuity errors, and complete tracking failure where all metrics fall below 0.3. We reveal that state-of-the-art models tend to generate "confident-but-wrong" activations. Furthermore, we show that the standard DBN's default minimum tempo of 55 BPM prevents it from inferring the correct tempo for 21\% of SMC tracks, forcing double-tempo predictions on slow music. By exposing such fundamental oversights, we provide concrete directions for improving beat and downbeat detection, specifically emphasizing training data diversification and multi-hypothesis tempo estimation.