arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2503.06139 2026-05-13 cs.CL

GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs

Mingyang Song, Mao Zheng, Xuan Luo

AI总结 本文提出了一种名为Goal-Reversed Prompting(GRP)的新方法,用于改进基于大语言模型(LLM)的零样本评估。该方法通过让评估模型判断两个候选答案中“较差”的一个,再通过排除法确定偏好,从而提升评估准确性。实验表明,GRP在多个评估任务中显著提升了判断模型的性能,尤其在推理和数学任务中效果更为明显,且该方法兼容多种提示模板,无需额外推理轮次。

Comments Ongoing Work

详情
英文摘要

Pairwise LLM-as-a-judge evaluation asks the judge to identify the \emph{better} of two candidate answers. We study a one-line modification that asks for the \emph{worse} answer instead and recovers the preference by elimination, a procedure we call Goal-Reversed Prompting (GRP). GRP introduces no extra inference rounds, composes with any prompt template (direct, chain-of-thought, or Arena-Hard SOP), and leaves the rest of the evaluation pipeline untouched. Two observations motivate the reversal. Reverse reasoning is a recurring strategy in human problem solving, and modern instruction-tuned judges exhibit a positive-leaning bias that asking for the worse answer can counteract. On JudgeBench under a strict consistency protocol that counts a judgment as correct only when both response orderings agree with the gold preference, GRP improves all three closed-source judges we test across both response-pair sources. With GPT-4o-generated pairs, the Arena-Hard SOP baseline improves from 61.71\% to 66.23\% for GPT-4o (+4.52) and from 60.00\% to 66.00\% for Claude-3.5-Sonnet (+6.00), with the largest absolute gains on Reasoning and Mathematics. The lift persists when response pairs come from Claude-3.5-Sonnet and when the SOP scaffolding is stripped to a minimal direct-prompting template, suggesting that goal reversal acts on the underlying judging behavior rather than on a particular rubric. Stronger judges benefit more than weaker ones, suggesting that goal reversal exposes additional reasoning capacity rather than compensating for its absence.

2502.11981 2026-05-13 cs.LG cs.AI cs.CY

Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap

Nir Rosenfeld, Haifeng Xu

AI总结 本文提出将社会福利作为机器学习设计与应用中的核心指导原则,以促进社会福祉的最大化。作者借鉴福利经济学中关于资源分配的理论,认为在社会场景中,机器学习模型应不仅追求预测准确率,还需关注其对社会整体利益的影响。文章主张将福利作为优化、泛化和表达性之外的第四大核心标准,为机器学习的理论研究和实际应用提供新的方向和评价依据。

详情
英文摘要

Decades of research in machine learning have given us powerful tools for making accurate predictions. But when used in social settings and on human inputs, better accuracy does not immediately translate to better social outcomes. To effectively promote social well-being through machine learning, this position article advocates for the wide adoption of \emph{social welfare} as a guiding principle. The field of welfare economics asks: how should we allocate limited resources to self-interested agents in a way that maximizes social benefit? We argue that this perspective applies to many modern applications of machine learning in social contexts. As such, we propose that welfare serves as an additional core criterion in the design, study, and use of learning algorithms, complementing the conventional pillars of optimization, generalization, and expressivity, and as a compass guiding both theory and practice.

2502.03061 2026-05-13 cs.LG

Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context

Mohammad Shahverdikondori, Amir Mohammad Abouei, Alireza Rezaeimoghadam, Negar Kiyavash

AI总结 本文研究了在获得动作后上下文信息的随机多臂老虎机环境中,最佳臂识别(BAI)问题的新变种。该问题考虑了在每次动作后,学习者除了获得奖励外,还能获取额外的上下文信息,从而更有效地辅助决策。文章分析了两种不同类型的后动作上下文,并提出了相应的算法,理论上证明其样本复杂度达到最优,同时实验表明利用上下文信息能显著提升性能。

Comments 46 pages, 8 figures

详情
英文摘要

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a post-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) separator, where the reward depends solely on the context, and (ii) non-separator, where the reward depends on both the action and the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the separator setting, we propose a novel sampling rule called G-tracking, which uses the geometry of the context space to directly track the contexts rather than the actions. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. Moreover, in both settings, we theoretically and empirically show that algorithms that ignore the post-action context are sub-optimal. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

2501.19403 2026-05-13 cs.LG cs.AI

Tackling Fake Forgetting through Uncertainty Quantification

Yingdan Shi, Sijia Liu, Kaize Ding, Ren Wang

AI总结 本文研究了机器遗忘中的“假遗忘”问题,即模型虽然在遗忘指标上表现良好,但实际仍保留了被遗忘数据的真实标签信息。为解决这一问题,作者提出了一种基于符合预测的新型评估指标CR,用于更可靠地衡量遗忘质量,并进一步设计了一个结合符合预测的遗忘框架CPU,有效提升了遗忘效果。实验表明,该方法在图像分类任务中具有优越的遗忘性能。

详情
英文摘要

Machine unlearning seeks to remove the influence of specified data from a trained model. While the unlearning accuracy provides a widely used metric for assessing unlearning performance, it falls short in assessing the reliability of forgetting. In this paper, we find that the forgetting data points misclassified by unlearning accuracy still have their ground truth labels included in the conformal prediction set from the uncertainty quantification perspective, leading to a phenomenon we term fake forgetting. To address this issue, we propose a novel metric CR, inspired by conformal prediction, that offers a more reliable assessment of forgetting quality. Building on these insights, we further propose an unlearning framework CPU that incorporates conformal prediction into the Carlini & Wagner adversarial attack loss, enabling the ground truth label to be effectively removed from the conformal prediction set. Through extensive experiments on image classification tasks, we demonstrate both the effectiveness of our proposed metric and the superior forgetting quality achieved by our framework. Code is available at https://github.com/TIML-Group/Conformal-Prediction-Unlearning.

2501.16931 2026-05-13 cs.LG stat.AP

Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation

Christoph Lehmann, Yahor Paromau

AI总结 该论文提出了一种基于分布的机器学习模型评估方法,强调性能指标应被视为随机变量而非固定值,以更全面地反映训练过程中的不确定性。研究通过分析性能指标的经验分布,利用分位数和置信区间进行点估计和区间估计,尤其关注小样本情况下的统计推断可行性。该方法相比传统的基于均值的评估,能够更细致地刻画模型性能的变异性和不确定性,适用于需要可靠性的实际应用场景,并且易于实现和推广。

Comments 21 pages, 9 figures

详情
英文摘要

Machine learning models are often evaluated using point estimates of performance metrics such as accuracy, F1 score, or mean squared error. Such summaries fail to capture the inherent variability induced by stochastic elements of the training process, including data splitting, initialization, and hyperparameter optimization. This work proposes a distributional perspective on model evaluation by treating performance metrics as random quantities rather than fixed values. Instead of focusing solely on aggregate measures, empirical distributions of performance metrics are analyzed using quantiles and corresponding confidence intervals. The study investigates point and interval estimation of quantiles based on real-data use cases for classification and regression tasks, complemented by simulation studies for validation. Special emphasis is placed on small sample sizes, reflecting practical constraints in machine learning, where repeated training is computationally expensive. The results show that meaningful statistical inference on the underlying performance distribution is feasible even with sample sizes in the range of 10-25, while standard nonparametric confidence interval remain applicable under these conditions. The proposed approach provides a more detailed characterization of variability and uncertainty compared to mean-based evaluation and enables a more differentiated comparison of models. In particular, it supports a risk-oriented interpretation of model performance, which is relevant in applications where reliability is critical. The presented methods are easy to implement and broadly applicable, making them a practical extension to standard performance evaluation procedures in machine learning.

2501.08083 2026-05-13 cs.CV

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Mert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, Matthias Rottmann

AI总结 该论文研究了在自动驾驶等复杂开放领域中,如何利用视觉基础模型(VFM)进行输入监控以检测超出训练数据分布的场景(OOD)。作者提出了一种无需监督、模型无关的方法,通过结合VFM作为特征提取器与密度建模技术,统一检测语义偏移和协变量偏移。实验表明,该方法在多种条件下优于现有OOD分类方法,并能有效识别可能引发下游任务错误的高风险输入,为复杂视觉任务中的安全监控提供了新思路。

详情
英文摘要

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks

2501.02955 2026-05-13 cs.CV

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

AI总结 近年来,视觉语言模型(VLMs)在视频理解方面取得了显著进展,但对细粒度运动的理解仍缺乏系统研究。为此,本文提出了MotionBench,一个全面评估视频模型细粒度运动理解能力的基准,包含六类运动相关问题和多源视频数据。实验表明现有VLM在细粒度运动理解上表现不佳,作者通过分析视频特征压缩架构并提出一种高效的Through-Encoder融合方法,有效提升了模型的运动感知能力,展示了该方向仍有较大的提升空间。

Comments 20 pages

详情
英文摘要

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

2412.18594 2026-05-13 cs.LG stat.ML

Local and Mixing-Based Algorithms for Gaussian Graphical Model Selection from Glauber Dynamics

Vignesh Tirukkonda, Anirudh Rayas, Gautam Dasarathy

AI总结 该论文研究了在数据来自高斯Glauber动力学的依赖样本下,如何进行高斯图模型结构学习的问题。作者提出了两种互补的方法:一种是基于相关性检验的局部边检测算法,无需等待链的混合即可并行实现;另一种是在满足Dobrushin收缩条件时,通过子采样高斯Gibbs轨迹,使其在总变分距离下接近独立同分布样本,从而可直接使用标准的独立样本图模型学习方法。研究还提供了有限样本下的恢复保证,并分析了观测时间的信息下界。

Comments Major revision. Corrects the earlier local ratio-estimator analysis by replacing it with a local product estimator; adds a burn-in/thinning estimator based on total-variation decoupling for Gaussian Gibbs samplers; strengthens the lower bounds; adds experiments; and compares with the related ICML 2026 work of Shen, Wu, Majid, and Moitra

详情
英文摘要

Gaussian graphical model selection is usually studied under independent sampling, but in many applications observations arise from dependent dynamics. We study structure learning when the data consist of a single trajectory of Gaussian Glauber dynamics. We develop two complementary approaches. The first is a local edge-testing estimator based on an appropriately designed correlation test that reveals edges. This estimator does not require waiting for the chain to mix and admits an embarrassingly parallel edgewise implementation. The second is a burn-in/thinning reduction: under a Dobrushin contraction condition, we prove that a suitably subsampled Gaussian Gibbs trajectory is close in total variation to an i.i.d. product sample, allowing standard i.i.d. Gaussian graphical model learners to be used as black boxes. The key technical ingredient, which may be of independent interest, is a high-dimensional total-variation bound for random-scan Gaussian Gibbs samplers, obtained by combining Wasserstein contraction with an approximate Lipschitz smoothing argument. We prove finite-sample recovery guarantees for both approaches, establish information-theoretic lower bounds on the observation time, and empirically compare the resulting sample-computation tradeoffs.

2412.14461 2026-05-13 cs.CL

To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation

Xiang Cheng, Raveesh Mayya, João Sedoc

AI总结 本文探讨了基于大语言模型(LLM)进行文本标注时面临的可重复性挑战,尤其是当原始模型被弃用时如何保证标注结果的长期可复现。研究提出了一种名为SILICON的分析框架,将测量误差分解为四个来源,并设计了针对性的干预措施以降低误差。实验验证表明,该方法有效提升了标注质量,并通过回归方法建立了可长期使用的开放权重模型,为管理研究中的标注任务提供了更稳健的解决方案。

详情
英文摘要

Unstructured text data annotation is foundational to management research. LLMs offer a cost-effective and scalable alternative to human annotation, but they introduce a novel challenge: the annotator itself can be retired. Proprietary models undergo regular deprecation cycles, threatening long-term reproducibility. Hence, the ability to reproduce annotation results when the original model becomes unavailable, i.e., robust reproducibility, is a central methodological challenge for LLM-based annotation. Achieving robust reproducibility requires first controlling measurement error. We develop an analytical framework that decomposes measurement error into four sources: guideline-induced error from inconsistent annotation criteria, baseline-induced error from unreliable human references, prompt-induced error from suboptimal meta-instruction, and model-induced error from architectural differences across LLMs. We develop the SILICON workflow that instantiates the analytical framework, prescribing targeted interventions at each error source. Empirical validation across nine management research tasks confirms that these interventions reduce measurement error, and simulations show that the resulting error reduction yields more accurate downstream statistical estimates. With measurement error controlled, we address two further aspects of robust reproducibility. First, we propose a regression-based methodology to establish backup open-weight models, which are permanently accessible. Every tested task has at least one open-weight model with no statistically detectable performance difference. Second, we quantify the upper bound of annotation quality attainable from the current set of available models by proposing a routing procedure that selectively sends low-confidence items to auxiliary models, revealing when model aggregation improves performance and when that may adversely affect labeling quality.

2412.13050 2026-05-13 cs.LG cs.AI cs.CL cs.CV cs.SD eess.AS

Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Weiguo Pian, Shijian Deng, Shentong Mo, Mingrui Liu, Yunhui Guo, Yapeng Tian

AI总结 本文提出了一种新的多模态大语言模型持续学习场景——模态不一致持续学习(MICL),该场景涉及图像、音频或视频等不一致模态以及图文生成或问答等不同任务类型的持续学习任务。为应对模态和任务类型变化带来的灾难性遗忘问题,研究提出了MoInCL方法,通过伪目标生成模块和基于指令的知识蒸馏技术,有效缓解了模态和任务类型变化对模型性能的影响。实验结果表明,MoInCL在多个任务上优于现有的持续学习方法,具有显著优势。

Comments Accepted at Transactions on Machine Learning Research (TMLR), 2026

详情
英文摘要

In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.

2411.19517 2026-05-13 cs.LG cs.AI

RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs

Tae-Hoon Lee, Min-Soo Kim

AI总结 该研究提出了一种基于强化学习的初始原启发式方法RL-SPH,旨在为整数线性规划问题快速生成可行解。与现有方法不同,RL-SPH能够独立生成高质量的可行解,即使在涉及非二进制整数的问题中也表现优异。实验表明,RL-SPH在可行性率、原始间隙和原始积分等指标上均优于现有方法,展现出显著的性能提升。

Comments Accepted at ICML 2026. 30 pages, 12 figures, 22 tables

详情
英文摘要

Primal heuristics play a crucial role in quickly finding feasible solutions for NP-hard integer linear programming (ILP). Although $\textit{end-to-end learning}$-based primal heuristics (E2EPH) have recently been proposed, they are typically unable to independently generate feasible solutions. To address this challenge, we propose RL-SPH, a novel reinforcement learning-based start primal heuristic capable of independently generating feasible solutions, even for ILP involving non-binary integers. Empirically, RL-SPH rapidly obtains high-quality feasible solutions with a 100% feasibility rate, achieving on average a 28.6$\times$ lower primal gap and a 2.6$\times$ lower primal integral compared to existing start primal heuristics.

2405.10271 2026-05-13 cs.LG cs.AI cs.DC cs.ET

Pruning Federated Models through Loss Landscape Analysis and Client Agreement Scoring

Christian Internò, Elena Raponi, Markus Olhofer, Ali Raza, Thomas Bäck, Niki van Stein, Yaochu Jin, Barbara Hammer

AI总结 本文针对联邦学习中资源受限设备部署时面临的大模型训练成本高和数据异构性带来的不稳定性问题,提出了一种基于损失景观分析和客户端一致性评分的自动模型剪枝框架AutoFLIP。该方法将客户端数据多样性视为一种可利用的特性,通过一次性的联邦损失探索构建全局损失景观图,进而指导动态剪枝策略,显著提升了模型效率和鲁棒性。实验表明,AutoFLIP在非独立同分布场景下平均减少52%的计算开销和65%的通信成本,同时保持了最先进的准确率。

详情
Journal ref
Published in IEEE Internet of Things Journal, 2026
英文摘要

The practical deployment of Federated Learning (FL) on resource-constrained devices is fundamentally limited by the high cost of training large models and the instability caused by heterogeneous (non-IID) client data. Conventional pruning methods often treat data heterogeneity as a problem to be mitigated. In this work, we introduce a paradigm shift: we reframe client diversity as a feature to be harnessed. We propose AutoFLIP, a framework that begins not with training, but with a one-time federated loss exploration. During this phase, clients collaboratively build a map of the collective loss landscape, using their diverse data to reveal the problem's essential structure. This shared intelligence then guides an adaptive pruning strategy that is dynamically refined by client agreement throughout training. This approach allows AutoFLIP to identify robust and efficient sub-networks from the outset. Our extensive experiments show that AutoFLIP reduces computational overhead by an average of 52% and communication costs by over 65% while simultaneously achieving state-of-the-art accuracy in challenging non-IID settings.

2212.02011 2026-05-13 cs.CV

PointCaM: Cut-and-Mix for Open-Set Point Cloud Learning

Jie Hong, Shi Qiu, Weihao Li, Saeed Anwar, Mehrtash Harandi, Nick Barnes, Lars Petersson

AI总结 本文研究了开放集点云学习问题,即在训练时不使用未知类别数据,而在推理阶段识别未知对象。为此,作者提出了一种名为PointCaM的新型点云“切割-混合”机制,包含未知点模拟器和未知点估计器两个模块,通过模拟分布外数据并利用多级特征上下文来区分已知与未知点云。实验表明,该方法在多个数据集上显著提升了开放集识别性能,验证了其有效性。

Comments Accepted in CVIU

详情
英文摘要

Point cloud learning is receiving increasing attention. However, most existing point cloud models lack the practical ability to deal with the unavoidable presence of unknown objects. This paper primarily discusses point cloud learning in open-set settings, where we train the model without data from unknown classes and identify them during the inference stage. In essence, we propose a novel Point Cut-and-Mix mechanism for solving open-set point cloud learning, comprising an Unknown-Point Simulator and an Unknown-Point Estimator module. Specifically, we use the Unknown-Point Simulator to simulate out-of-distribution data in the training stage by manipulating the geometric context of partially known data. Based on this, the Unknown-Point Estimator module learns to exploit the point cloud's feature context to discriminate between known and unknown data. Unlike existing methods that only consider classifier features, our proposed solution leverages multi-level feature contexts to recognize unknown point cloud objects more effectively. We test the proposed approach on several datasets, including customized S3DIS, ModelNet40, and ScanObjectNN. The improved open-set performances over comparative baselines show the effectiveness of our PointCaM method. Our code is available at https://github.com/JHome1/pointcam.

2605.11832 2026-05-13 cs.RO

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Junjin Xiao, Dongyang Li, Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Feng Xiong, Mu Xu, Xing Wei, Zhiheng Ma, Qing Zhang, Wei-Shi Zheng

AI总结 本文研究了视觉-语言-动作(VLA)模型在空间感知与操作中的挑战,提出了一种基于多视角潜在先验的动作流形学习方法。通过预训练的多视角扩散模型生成新视角,并结合几何引导的门控变换器(G3T)对齐多视角特征并过滤遮挡噪声,同时引入动作流形学习(AML)直接预测有效动作,提升操作效率。实验表明,该方法在多个基准和真实机器人任务中均优于现有先进方法。

详情
英文摘要

This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.

2605.11825 2026-05-13 cs.RO

Mapping Embodied Affective Touch Strategies on a Humanoid Robot

Qiaoqiao Ren, Omar Eldardeer, Francesca Cocchella, Rea Francesco, Alessandra Sciutti, Tony Belpaeme

AI总结 该研究探讨了人类与人形机器人交互中的情感触觉策略,分析了触觉表达如何受到机器人身体结构和触觉位置的限制。通过32名参与者与配备全身触觉传感器的iCub机器人进行互动实验,研究发现不同身体区域和空间约束显著影响触觉的位置和动态特征,并且情感表达在不同身体部位和约束条件下存在差异。研究结果表明,情感触觉策略具有高度的身体区域依赖性,并受到机器人本体特征的显著影响。

详情
英文摘要

Affective touch in human-robot interaction is shaped not only by emotional intent, but also by robot embodiment, including touch location, physical constraints, and perceived agency or social role. Existing HRI studies typically focus on one or two isolated body parts, limiting understanding of how affective touch generalises across the full humanoid body. We present a study with 32 participants interacting with the iCub robot, which is equipped with full-body distributed tactile sensors. Participants expressed eight emotions under three conditions: free touch, arm-only touch, and torso-only touch. Results show that body region and spatial constraints jointly shaped both touch location and dynamics. In free touch, participants preferred socially accessible upper-body regions, while less frequently touched areas showed stronger emotion-specific selectivity. Emotion-related variation was more evident in motion features for arm-only touch and pressure features for torso-only touch. Touch strategies also did not transfer directly between free and constrained conditions, even within the same coarse body region. Participants reported increased closeness to the robot after interaction, with around 30 percent reporting a change in perceived social relationship. Together, these findings show that affective touch expression is strongly body-region dependent and shaped by embodiment constraints.

2605.11824 2026-05-13 cs.CV cs.AI

REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View

Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura

AI总结 该论文提出了一种名为REFNet++的多任务高效融合方法,用于将摄像头和雷达传感器数据在鸟瞰极坐标视图中进行融合。研究通过变分编码器-解码器架构,将摄像头图像转换为极坐标域,并从雷达的范围-多普勒谱中提取角度信息以生成范围-方位角特征,从而实现两种模态数据在统一域中的对齐。该方法在保证融合精度的同时提升了计算效率,并在车辆检测和自由空间分割任务中取得了优于现有方法的性能。

Comments IEEE Intelligent Transportation Systems Conference (ITSC) 2025

详情
英文摘要

A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.

2605.11818 2026-05-13 cs.CV

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

Binhao Wang, Shihao Zhao, Bo Cheng, Qiuyu Ji, Yuhang Ma, Liebucha Wu, Shanyuan Liu, Dawei Leng, Yuhui Yin

AI总结 该论文提出了一种基于扩散模型的图像分层分解方法RevealLayer,旨在解决复杂自然图像中隐藏层与可见层的分离难题,以及遮挡区域内容的恢复问题。方法引入了区域感知注意力模块、遮挡引导适配器和复合损失函数,以实现更精确的层分离和遮挡内容重建。同时,研究团队构建了高质量的RevealLayer-100K数据集和评估基准RevealLayerBench,实验表明该方法在层分解任务上优于现有方法。

详情
英文摘要

Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.

2605.11815 2026-05-13 cs.LG

Fed-BAC: Federated Bandit-Guided Additive Clustering in Hierarchical Federated Learning

Satwat Bashir, Tasos Dagiuklas, Muddesar Iqbal

AI总结 本文提出了一种名为Fed-BAC的联邦学习方法,旨在解决分层联邦学习中数据异构环境下聚类分配与客户端选择联合优化的问题。Fed-BAC结合了加法聚类个性化与两层老虎机框架,在云端通过上下文老虎机进行聚类分配,边缘服务器则利用汤普森采样选择高贡献客户端。实验表明,Fed-BAC在多个分类任务中相比现有方法显著提升了准确率与收敛速度,并增强了跨服务器的公平性。

Comments 9 pages, 5 figures. Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS 2026), Valencia, Spain, June 9-12, 2026. To appear in IEEE proceedings

详情
英文摘要

Hierarchical federated learning (HFL) leverages edge servers for partial aggregation in edge computing. Yet existing FL methods lack mechanisms for jointly optimizing cluster assignment and client selection under data heterogeneity. This paper proposes Fed-BAC, which integrates additive cluster personalization with a two-level bandit framework: contextual bandits at the cloud learn server-to-cluster assignments, while Thompson Sampling at each edge server identifies high-contributing clients. The additive decomposition enables the sharing of knowledge between groups through a globally aggregated network, while cluster-specific networks capture distribution variations. Across three classification benchmarks (CIFAR-10, SVHN, Fashion-MNIST) under moderate ($α= 0.5$) and severe ($α= 0.1$) Dirichlet non-IID partitioning, Fed-BAC achieves distributed accuracy gains of up to +35.5pp over HierFAVG and +8.4pp over IFCA, while requiring only 80% client participation, converging 1.5 to 4.8$\times$ faster depending on dataset and accuracy target, and improving cross-server fairness. These gains are further validated at 5$\times$ deployment scale on CIFAR-10. The advantage of Fed-BAC increases with heterogeneity severity, confirming that additive cluster personalization becomes increasingly valuable as data distributions diverge.

2605.11814 2026-05-13 cs.AI

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Yihao Wang, Haoran Xu, Renjie Gu, Yixuan Ye, Xinyi Chen, Xinyu Mu, Yuan Gao, Chunxiao Guo, Peng Wei, Jinjie Gu, Huan Li, Ke Chen, Lidan Shou

AI总结 MedMemoryBench 是一个用于评估个性化医疗智能体记忆能力的基准测试平台,旨在应对大规模医疗场景中对高精度、安全且具备长期追踪能力的记忆机制的需求。该研究通过构建基于临床真实患者模型的高仿真医疗交互数据集,并引入“构建即评估”的动态评估方法,揭示了主流模型在复杂医疗推理和噪声鲁棒性方面的严重不足,为开发可靠、实用的医疗智能体奠定了基础。

详情
英文摘要

The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.

2605.11813 2026-05-13 cs.AI

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

Jinbiao Chen, Shuang Jin, Guoyun Zhang, Junyu Zhang, Guanyi Wang, Hanzhang Qin

AI总结 该研究旨在解决鲁棒优化(RO)中将不确定优化模型转化为可解确定性模型时需要手动重述的问题。为此,作者提出了AutoREM,一种无需参数更新且无需领域专家知识的基于经验记忆的自动重述框架,通过离线适应过程构建结构化文本记忆以提升重述效果。研究还构建了AutoRO-Bench基准,用于系统评估基于大语言模型的RO重述能力,并在多种数据集和基础模型上验证了AutoREM在准确性和效率方面的优越性。

详情
英文摘要

Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.

2605.11809 2026-05-13 cs.AI

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

Huoren Yang, Jianchao Zhao, Hu Yusong, Qiguan Ou, Yuyang Gao, Wei Ke, Yuhang He, SongLin Dong, Zhiheng Ma, Yihong Gong

AI总结 该论文提出了一种名为MCF-Proto的轻量级动作头,用于改进视觉-语言-动作(VLA)模型的动作预测能力。不同于传统在固定世界坐标系中直接预测动作指令的方法,该方法引入了以运动为中心的动作框架(MCF),通过旋转变换将动作预测转换到局部坐标系中,并基于原型进行动作参数化,最终映射回世界坐标系进行端到端训练。这种方法无需额外监督信号,能够自动生成稳定的几何结构,提升动作表示的紧凑性和鲁棒性,尤其在面对几何扰动时表现出色。

详情
英文摘要

Vision-Language-Action (VLA) models have advanced rapidly with stronger backbones, broader pre-training, and larger demonstration datasets, yet their action heads remain largely homogeneous: most directly predict action commands in a fixed world coordinate frame. We propose \textbf{MCF-Proto}, a lightweight action head that equips VLA policies with a Motion-Centric Action Frame (MCF) and a prototype-based action parameterization. At each step, the policy predicts a rotation $R_t \in SO(3)$, composes actions in the transformed local frame from a set of prototypes, and maps them back to the world frame for end-to-end training, using only standard demonstrations without auxiliary supervision. This simple design induces stable emergent structure. Without explicit directional labels, the learned local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion. Meanwhile, actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes. These structural properties translate into improved robustness, especially under geometric perturbations. Our results suggest that adding lightweight geometric and compositional structure to the action head can materially improve how VLA policies organize and generalize robotic manipulation behavior. An anonymized code repository is provided in the supplementary material.

2605.11808 2026-05-13 cs.CV

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

Zhenxin Qin, Qiang Li, Qingzhuo Wang, Ruiyang Qin, Zhihua Wei, Wen Shen

AI总结 本文研究了大视觉语言模型(LVLMs)在生成文本时产生的动作关系幻觉问题,即模型生成的文本与视觉输入中的动作关系不一致。为解决这一问题,作者提出了一种基于关系感知的视觉增强方法(RVE),通过定义动作关系敏感度(ARS)评分定位包含关键视觉线索的动作相关区域,并增强模型对这些区域的注意力。实验表明,该方法在缓解动作关系幻觉方面优于现有方法,且几乎不增加推理成本,同时在空间关系和物体幻觉任务中也表现出良好的泛化能力。

详情
英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.

2605.11807 2026-05-13 cs.AI

Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

Qiuyu Ding, Heng-Da Xu, Wei Zhang, Dongyi Lv, Changda Xia, Feng Xiong, Mu Xu

AI总结 该研究针对生成式兴趣点(POI)推荐模型无法感知现实世界动态变化的问题,提出了一种基于大语言模型(LLM)的增强方法AWARE。该方法通过引入基于代理的LLM生成具有时空感知能力的上下文叙事,捕捉区域文化特征、季节趋势和实时事件,并结合用户行为特征进行个性化推荐。实验表明,AWARE在三个真实数据集上显著优于现有方法,相对提升了12.4%的推荐效果。

详情
英文摘要

Generative point-of-interest (POI) recommendation models based on large language models (LLMs) have shown promising results by formulating next POI prediction as a sequence generation task. However, the knowledge encoded in these models remains fixed after training, making them unable to perceive evolving real-world conditions that shape user mobility decisions, such as local events and cultural trends. To bridge this gap, we propose AWARE (Agent-based World knowledge Augmented REcommendation), which employs an LLM agent to generate location- and time-aware contextual narratives that capture regional cultural characteristics, seasonal trends, and ongoing events relevant to each user. Rather than introducing generic or noisy information, AWARE further anchors these narratives in each user's behavioral context, grounding external world knowledge in personalized spatial-temporal patterns. Extensive experiments on three real-world datasets demonstrate that AWARE consistently outperforms competitive baselines, achieving up to 12.4% relative improvement.

2605.11804 2026-05-13 cs.LG cs.CV

Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning

Patryk Krukowski, Jacek Tabor, Przemysław Spurek, Marek Śmieja, Łukasz Struski

AI总结 本文研究了无数据持续学习(DFCIL)中的模型逆问题,旨在生成高质量伪样本以缓解灾难性遗忘。现有方法通常假设特征分布具有对角协方差,忽略了特征间的相关性,导致生成样本质量不高。为此,作者提出REMIX方法,通过拉普拉斯核参数化实现结构化协方差建模,在保证计算效率的同时捕捉特征依赖关系,显著提升了合成样本的保真度和DFCIL性能。

详情
英文摘要

Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.

2605.11803 2026-05-13 cs.CV cs.AI

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Minseok Kang, Minhyeok Lee, Jungho Lee, Minjung Kim, Donghyeong Kim, Dayeon Lee, Heeseung Choi, Ig-jae Kim, Sangyoun Lee

AI总结 随着视频大语言模型(Video-LLMs)处理更长更复杂的视频,其推理成本因帧间视觉标记数量的增加而迅速上升。为解决这一问题,本文提出OTT-Vid,一种基于最优运输的时序标记压缩方法。该方法通过空间剪枝识别每帧中的关键内容,并利用非均匀标记质量的最优运输模型评估相邻帧间的压缩潜力,从而动态分配压缩预算,有效保护语义重要标记。实验表明,OTT-Vid在保留仅10%标记的情况下,仍能保持95.8%的视频问答和73.9%的时序定位性能,优于现有无训练压缩方法。

Comments 22pages, 9 figures. Code available at https://github.com/minseokii/OTT-Vid

详情
英文摘要

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

2605.11800 2026-05-13 cs.LG cs.CL

ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

Wenyong Zhou, Yuannuo Feng, Yizhe Chen, Taiqiang Wu, Wendong Xu, Wenbo Qi, Zhengwu Liu, Wang Kang, Ngai Wong

AI总结 该研究针对基于专家混合(MoE)架构的大语言模型在模拟计算-in-内存(CIM)系统中的鲁棒性问题,提出了一种名为ROMER的后训练校准框架。研究发现,硬件噪声会显著破坏专家负载平衡并导致路由决策性能下降,ROMER通过替换低激活专家和校准路由逻辑,有效提升了模型在噪声环境下的表现。实验表明,ROMER在多个MoE模型上显著降低了困惑度,验证了其有效性和广泛适用性。

Comments 11 pages, 5 figures, 4 tables

详情
英文摘要

Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.

2605.11799 2026-05-13 cs.CV

SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions

Markus Essl, Marta Moscati, Mubashir Noman, Muhammad Zaigham Zaheer, Usman Naseem, Shah Nawaz, Markus Schedl

AI总结 该论文提出了一种增强自动驾驶车辆三维目标检测中多模态传感器融合鲁棒性的方法,针对摄像头和激光雷达数据缺失或受污染的情况,设计了一个框架无关的融合模块。该模块能够有效应对单一模态失效或数据损坏的问题,并在BEVFusion框架中进行实例化验证。实验表明,该方法在多种传感器退化场景下表现出色,尤其在极端天气和传感器故障条件下达到了最先进的性能。

Comments Accepted at ICIP 2026

详情
英文摘要

Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird's-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.

2605.11789 2026-05-13 cs.AI

Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations

Alison Moldovan-Mauer, Benedikt Mangold

AI总结 该研究探讨了不文明交流对多智能体系统中协作效率的影响,通过构建基于大语言模型的多智能体系统,利用蒙特卡洛模拟方法进行大规模实验。研究发现,不文明行为显著延长了智能体达成共识所需的时间,并且这种延迟在参数规模较小的模型中更为明显。此外,研究还揭示了“先发优势”现象,即率先发言的智能体在不同毒性条件下均更有可能赢得讨论。

详情
英文摘要

Unconstructive debate and uncivil communication carry well-documented costs for productivity and cohesion, yet isolating their effect on operational efficiency has proven difficult. Human subject research in this domain is constrained by ethical oversight, limited reproducibility, and the inherent unpredictability of naturalistic settings. We address this gap by leveraging Large Language Model (LLM) based Multi-Agent Systems as a controlled sociological sandbox, enabling systematic manipulation of communicative behavior at scale. Using a Monte Carlo simulation framework, we generate thousands of structured 1-on-1 adversarial debates across varying toxicity conditions, measuring convergence time, defined as the number of rounds required to reach a conclusion, as a proxy for interactional efficiency. Building on a prior study, we replicate and extend its findings across two additional LLM agents of varying parameter size, allowing us to assess whether the effects of toxic behavior on debate dynamics generalize across model scale. The convergence latency of 25% reported in the previous study was confirmed. It was found that this latency is significantly bigger for models with fewer parameters. We further identify a significant first-mover advantage, whereby the agent initiating the discussion wins significantly above chance regardless of toxicity condition.

2605.11782 2026-05-13 cs.CV

Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

Antoni Valls, Jordi Sanchez-Riera

AI总结 该研究针对视力障碍者在城市环境中安全自主导航的问题,提出了一种基于视觉问答(VQA)的事件地图框架,利用视觉语言模型(VLMs)对行人场景进行描述和危险识别。通过三级分层查询结构,实现无需任务特定再训练的细粒度场景理解,并将模型响应聚合为加权风险评分系统,生成四类安全等级的可导航风险地图。研究还构建了一个涵盖六大洲20个城市的多样化数据集,并验证了生成式多模态大语言模型在该任务中的优越性能。

Comments 10 pages, 6 figures, submitted to IEEE T-ITS

详情
英文摘要

Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

2605.11779 2026-05-13 cs.CL

Choosing features for classifying multiword expressions

Eric Laporte

AI总结 本文研究了如何为多词表达(MWEs)选择合适的特征以进行有效分类。作者指出,虽然已有多种特征可供选择,但并非所有特征都能可靠地用于分类,因此需要优化特征选择以提高分类效果。为此,作者提出了一种增强的分类方法,并结合多种语言的研究成果,以提升分类方案在多语言环境下的适用性。

详情
Journal ref
Multiword expressions: Insights from a multi-lingual perspective, 2018, Language Science Press, pp.143-186
英文摘要

Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.