arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2237
2603.22017 2026-06-10 cs.LG 版本更新

Domain Adapted Large Language Models for Additive Manufacturing

面向增材制造的领域自适应大语言模型

Peter Pak, Amir Barati Farimani

发表机构 * Department of Mechanical Engineering, Carnegie Mellon University(机械工程系,卡内基梅隆大学)

AI总结 本文通过约5000万token的小型数据集对开源大语言模型进行领域自适应预训练和指令微调,构建多模态领域自适应模型,在增材制造基准测试中达到90%以上准确率。

详情
AI中文摘要

本文提出了一系列多模态领域自适应大语言模型,这些模型基于指令微调变体的开源权重模型(Gemma 3、Qwen 3、Gemma 4),使用约5000万token的相对小型数据集构建。该数据集包含开放获取的增材制造期刊文章,从中提取数据用于领域自适应预训练和视觉指令微调过程。开发模型的各个阶段使用增材制造基准进行评估,该基准由增材制造领域特定任务和已发布资源组成。领域自适应和指令微调模型在语言和视觉任务中均表现出色,在通用增材制造知识方面达到90%以上的准确率。这种领域自适应预训练和指令微调策略为大语言模型在增材制造等领域的专业化提供了一种可访问的方法。

英文摘要

This work presents a collection of multi-modal domain adapted large language models built upon the instruction tuned variants of open weight models (Gemma 3, Qwen 3, Gemma 4) using a relatively small dataset of around 50 million tokens. The dataset consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. Domain adapted and instruction tuned models exhibit proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.

2603.21350 2026-06-10 cs.CL 版本更新

Beyond Memorization: Distinguishing Between Pattern-Based and Epistemic Reasoning in LLMs Using Epistemic Puzzles

超越记忆:使用认知谜题区分LLMs中的模式推理与认知推理

Adi Gabay, Gabriel Stanovsky, Liat Peterfreund

发表机构 * School of Computer Science and Engineering(计算机科学与工程系) The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 本文通过设计二维基准谜题,分离叙事熟悉度与推理复杂度,区分LLMs的模式匹配与真实认知推理,发现模型对表面变化鲁棒但难以处理非对称情境。

详情
AI中文摘要

认知推理要求智能体从部分观察和关于其他智能体知识的信息中推断世界状态。先前评估LLMs在认知谜题上的工作通常将失败归因于记忆而非推理。我们认为这种二分法对于较新的模型过于粗糙:记忆是模式推理的一个极限情况,其中模型将任务匹配到熟悉的模板并应用相应的解决方案。我们引入了一个基于DEL风格谜题的二维基准,将叙事熟悉度与推理复杂度分离,从而能够区分模式推理与认知推理。我们发现,模型对表面形式变化的鲁棒性远高于先前研究所示,但在非对称设置中持续表现不佳,其中熟悉的模式不再适用,成功需要跟踪碎片化的认知状态。

英文摘要

Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on epistemic puzzles often frames failures as memorization rather than reasoning. We argue that this dichotomy is too coarse for newer models: memorization is a limiting case of pattern-based reasoning, where a model matches a task to a familiar template and applies the corresponding solution. We introduce a two-dimensional benchmark over DEL-style puzzles, separating narrative familiarity from inference complexity, allowing us to distinguish pattern-based from epistemic reasoning. We find that models are substantially more robust to surface form changes than prior work suggested, yet consistently struggle in asymmetric settings where familiar patterns no longer apply and success requires tracking fragmented epistemic states.

2603.21050 2026-06-10 cs.SD 版本更新

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

ERM-MinMaxGAP:多语言多模态语音-LLM情感识别中的性别偏见基准测试与缓解

Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen

发表机构 * Kyoto University, Japan(京都大学,日本) Agency for Science, Technology, and Research (A*STAR), Singapore(科技研究局(A*STAR),新加坡)

AI总结 针对多语言语音大模型在情感识别中的性别偏见问题,提出基于MELD-ST的多语言多模态基准,并设计ERM-MinMaxGAP训练目标,通过自适应公平权重和MinMaxGAP正则化器,在英日德三种语言上提升性能并缩小性别差距。

Comments This paper has been accepted for presentation at INTERSPEECH 2026

详情
AI中文摘要

语音情感识别(SER)系统可能表现出与性别相关的性能差异,但这种偏见如何在跨语言和跨模态的多语言语音大模型中体现尚不清楚。我们引入了一个基于MELD-ST的新型多语言多模态基准,涵盖英语、日语和德语,以量化特定语言的SER性能和性别差距。我们发现偏见强烈依赖于语言,并且多模态融合并不能可靠地提高公平性。为了解决这些问题,我们提出了ERM-MinMaxGAP,一种公平性感知的训练目标,它通过提出的自适应公平权重机制和一种新颖的MinMaxGAP正则化器(针对每种语言和模态内的最大男女损失差距)来增强经验风险最小化(ERM)。基于Qwen2-Audio骨干网络,我们的ERM-MinMaxGAP方法在单模态和多模态设置下分别将多语言SER性能提高了5.5%和5.0%,同时将整体性别偏见差距减少了0.1%和1.4%。

英文摘要

Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.

2603.20850 2026-06-10 cs.CV cs.RO 版本更新

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Glove2Hand:从多模态传感手套合成自然的手-物体交互

Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan

发表机构 * Meta Reality Labs(Meta现实实验室) Rutgers University(罗格斯大学)

AI总结 提出Glove2Hand框架,将多模态传感手套视频转化为逼真的裸手,并保留物理交互动态;引入3D高斯手模型和扩散手恢复器,创建HandSense数据集,提升下游任务性能。

Comments CVPR 2026 Highlight. This version includes the motion retarget process in the appendix

详情
AI中文摘要

理解手-物体交互(HOI)是计算机视觉、机器人和AR/VR的基础。然而,传统手部视频通常缺乏接触力和运动信号等关键物理信息,并且容易频繁遮挡。为了解决这些挑战,我们提出了Glove2Hand,一个将多模态传感手套HOI视频转化为逼真裸手的框架,同时忠实保留底层物理交互动态。我们引入了一种新颖的3D高斯手模型,确保时间渲染一致性。使用基于扩散的手部恢复器将渲染的手无缝集成到场景中,该恢复器有效处理复杂的手-物体交互和非刚性变形。利用Glove2Hand,我们创建了HandSense,这是第一个多模态HOI数据集,包含手套到手的视频以及同步的触觉和IMU信号。我们证明HandSense显著增强了下游裸手应用,包括基于视频的接触估计和严重遮挡下的手部跟踪。

英文摘要

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

2602.01135 2026-06-10 cs.LG 版本更新

Your Autoregressive Model Already Reveals the Causal Graph

你的自回归模型已经揭示了因果图

Hugo Math, Rainer Lienhart

发表机构 * Department of Machine Learning \& Computer Vision, University of Augsburg, Augsburg, Germany

AI总结 本文提出TRACE框架,利用预训练自回归模型作为密度估计器进行条件独立性测试,从单序列离散事件中恢复时间因果图,并在大规模非线性SCM和真实车辆诊断日志上取得显著性能提升。

Comments 8 pages

详情
Journal ref
Structured Probabilistic Inference & Generative Modeling workshop ICML 2026
AI中文摘要

通过下一个词预测训练的自回归模型隐式地学习了其数据生成过程的条件独立结构。我们利用这一观察结果,从单个离散事件观测序列中执行可扩展的因果发现——无需任何特定任务的重新训练。这种单流设置自然出现在车辆诊断、制造系统和患者轨迹中,但至今仍未得到充分解决:缺乏重复样本、大量事件词汇和长程时间依赖使得现有方法要么不准确,要么计算上不可行。我们引入了TRACE,一个将任何预训练自回归模型重新用作条件互信息密度估计器的框架,条件互信息是条件独立性测试的基本原语。通过在GPU上构建并行化的CI测试,TRACE恢复了样本级时间因果图及其摘要投影,其规模随词汇量线性增长,同时自然处理延迟因果效应。关键的是,我们证明了最小化标准交叉熵预训练损失直接最小化了因果识别误差的上界,建立了序列预测与因果发现之间的对偶性。在非线性SCM(|X| = 8000)和真实车辆诊断日志(|X| = 29100)上,TRACE是此规模下首个适用的方法,在F1分数上超过最强基线20多点。

英文摘要

Autoregressive models trained via next-token prediction implicitly learn the conditional independence structure of their data-generating process. We exploit this observation to perform scalable causal discovery from a single observed sequence of discrete events -- without any task-specific retraining. Such single-stream settings arise naturally in vehicle diagnostics, manufacturing systems, and patient trajectories, yet they remain largely unsolved: the absence of repeated samples, massive event vocabularies, and long-range temporal dependencies render existing methods either inaccurate or computationally intractable. We introduce TRACE, a framework that repurposes any pretrained autoregressive model as a density estimator for conditional mutual information, the fundamental primitive for conditional independence testing. By constructing parallelized CI tests on GPUs, TRACE recovers both the sample-level time causal graph and its summary projection, scaling linearly with the vocabulary size while naturally handling delayed causal effects. Crucially, we prove that minimizing the standard cross-entropy pretraining loss directly minimizes an upper bound on the causal identification error, establishing a duality between sequence prediction and causal discovery. On nonlinear SCMs (|X| = 8000) and real-world vehicle diagnostic logs (|X| = 29100), TRACE is the first applicable method at this scale, outperforming the strongest baseline by over 20 F1 points.

2510.04491 2026-06-10 cs.AI cs.CL 版本更新

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

不耐烦的用户混淆AI智能体:用于测试智能体的高保真人类特质模拟

Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TraitBasis方法,通过控制用户特质向量(如不耐烦、不连贯)对AI智能体进行压力测试,发现性能下降2%-30%,揭示当前智能体对用户行为变化的脆弱性。

Comments ACL 2026 [Oral]

详情
AI中文摘要

尽管构建对话式AI智能体取得了快速进展,但其鲁棒性在很大程度上仍未得到测试。用户行为的微小变化,例如更加不耐烦、不连贯或怀疑,可能导致智能体性能急剧下降,揭示了当前AI智能体的脆弱性。现有的基准测试未能捕捉到这种脆弱性:智能体在标准评估中可能表现良好,但在更真实和多样化的环境中却显著退化。我们通过引入TraitBasis来填补这一鲁棒性测试空白,这是一种轻量级、模型无关的方法,用于系统地对AI智能体进行压力测试。TraitBasis学习激活空间中的方向,这些方向对应于可引导的用户特质(例如不耐烦或不连贯),可以在推理时进行控制、缩放、组合和应用,无需任何微调或额外数据。使用TraitBasis,我们将τ-Bench扩展到τ-Trait,其中通过受控特质向量改变用户行为。我们观察到在τ-Trait上,前沿模型的平均性能下降2%-30%,突显了当前AI智能体对用户行为变化的鲁棒性不足。这些结果共同强调了鲁棒性测试的关键作用以及TraitBasis作为一种简单、数据高效且可组合工具的前景。通过驱动模拟压力测试和训练循环,TraitBasis为构建在真实人类交互的不可预测动态中保持可靠的AI智能体打开了大门。我们已在四个领域(航空、零售、电信和远程医疗)开源了τ-Trait,以便社区在现实、行为多样化的意图和特质场景下系统地对智能体进行质量保证:此网址。

英文摘要

Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

2507.02513 2026-06-10 cs.CV 版本更新

Automatic Labelling for Low-Light Pedestrian Detection

低光照行人检测的自动标注

Dimitrios Bouzoulas, Eerik Alamikkotervo, Risto Ojala

发表机构 * Energy and Mechanical Engineering, Aalto University(艾尔沃斯大学能源与机械工程系)

AI总结 提出一种自动红外-RGB流水线,利用红外检测生成标签训练低光照行人检测模型,在KAIST数据集上优于真实标签。

详情
AI中文摘要

RGB图像中的行人检测是行人安全的关键任务,因为自动驾驶车辆和高级驾驶辅助系统中最常见的传感器是RGB相机。低光照行人检测缺乏大型公共数据集和自动标注流水线。本研究提出一种自动红外-RGB流水线作为解决方案。该流水线包括:1) 红外检测,使用微调的红外行人检测模型;2) 标签转移过程,将红外检测结果转移到对应的RGB图像;3) 使用生成的标签训练低光照RGB行人检测的目标检测模型。研究使用KAIST数据集进行。评估中,三个目标检测模型DETR、YOLO和RCNN在生成的标签和真实标签上分别训练。在未见过的图像上比较时,结果显示,在mAP@50和LAMR指标上,基于生成标签训练的模型在6个案例中的5个优于基于真实标签训练的模型,并且在所有案例中mAP@50-95指标均优于真实标签。获得的结果表明,所提出的自动标注流水线可用于低光照行人检测数据集的可扩展标注。本研究的源代码可在GitHub上获取:this https URL

英文摘要

Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. Low-light pedestrian detection lacks large public datasets and autolabelling pipelines. This research proposes a solution in the form of an automated infrared-RGB pipeline. The pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For evaluation, three object detection models, DETR, YOLO, and RCNN, were trained on generated and ground truth labels. When compared on previously unseen images, the results showed that the models trained on generated labels out-performed the ones trained on ground-truth in 5 out of 6 cases for the mAP@50 and LAMR metrics, and outperformed ground-truth on mAP@50-95 in all cases. Acquired results indicate that the proposed auto-labelling pipeline could be used for scalable annotation of low-light datasets for pedestrian detection. The source code for this research is available on GitHub: https://github.com/BouzoulasDimitrios/IR-RGB-autoamed-low-light-pedestrian-labelling

2603.12785 2026-06-10 cs.LG math.ST stat.TH 版本更新

Upper Bounds for Local Learning Coefficients of Three-Layer Neural Networks

三层神经网络局部学习系数的上界

Yuki Kurumadani

发表机构 * sigmath.es.osaka-u.ac.jp(大阪大学)

AI总结 针对三层神经网络的奇异参数点,提出一种基于预算、需求和供给约束的计数规则来推导局部学习系数的上界,覆盖了swish等激活函数,并在一维输入下与已知精确值一致。

详情
AI中文摘要

已知三层神经网络构成奇异学习模型,其贝叶斯渐近行为由学习系数(或实对数规范阈值)控制。尽管该量在正则模型和某些特殊奇异模型中已被阐明,但在神经网络中广泛适用的评估方法仍然有限。最近,半正则模型的局部学习系数公式被提出,给出了学习系数的上界。然而,该公式仅适用于实现参数集中的非奇异点,不能用于奇异点。特别是对于三层神经网络,所得上界在某些情况下与已知的学习系数值存在显著差异。本文推导了三层神经网络中一类奇异实现参数的局部学习系数上界公式。该公式可解释为在预算、需求和供给约束下的计数规则。在非多项式实解析情况下,该公式适用于一般设置;而在多项式情况下,它适用于真实分布没有隐藏单元的限制。特别地,我们的结果涵盖了诸如swish函数等激活函数,并在上述限制下包括多项式激活函数,从而将先前结果扩展到更广泛的激活函数类。我们进一步证明,当输入维度为一时,上界公式右侧的数值与先前已知的学习系数一致,从而提供了与已知精确结果的有用比较。我们的结果还提供了关于三层神经网络权重参数如何影响学习系数的系统视角。

英文摘要

Three-layer neural networks are known to form singular learning models, and their Bayesian asymptotic behavior is governed by the learning coefficient, or real log canonical threshold. Although this quantity has been clarified for regular models and for some special singular models, broadly applicable methods for evaluating it in neural networks remain limited. Recently, a formula for the local learning coefficient of semiregular models was proposed, yielding an upper bound on the learning coefficient. However, this formula applies only to nonsingular points in the set of realization parameters and cannot be used at singular points. In particular, for three-layer neural networks, the resulting upper bound has been shown to differ substantially from learning coefficient values already known in some cases. In this paper, we derive a formula for an upper bound on local learning coefficients at a class of singular realization parameters in three-layer neural networks. This formula can be interpreted as a counting rule under budget, demand, and supply constraints. In the non-polynomial real-analytic case, the formula applies in general settings, whereas in the polynomial case it applies under the restriction that the true distribution has no hidden units. In particular, our result covers activation functions such as the swish function and also includes polynomial activation functions under the above restriction, thereby extending previous results to a broader class of activation functions. We further show that, when the input dimension is one, the numerical value given by the right-hand side of our upper-bound formula agrees with the previously known learning coefficient, thereby providing a useful comparison with known exact results. Our result also provides a systematic perspective on how the weight parameters of three-layer neural networks affect the learning coefficient.

2512.06628 2026-06-10 cs.RO cs.CV 版本更新

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

MIND-V:基于强化学习物理对齐的长期机器人操作分层世界模型

Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li

发表机构 * Tsinghua University(清华大学) X Square Robot(X Square机器人) Sun Yat-sen University(中山大学) HKUST(香港科技大学)

AI总结 提出MIND-V分层世界模型,通过语义推理、行为语义桥接和运动视频生成,结合强化学习物理对齐,实现长期机器人操作视频的物理合理合成。

详情
AI中文摘要

可扩展的具身智能受到多样化、长期机器人操作数据稀缺的限制。现有视频世界模型仅能合成简单动作的短视频,且常依赖手动定义轨迹。为此,我们提出MIND-V,一种认知分层世界模型,旨在合成物理合理且逻辑连贯的长期机器人操作视频。受认知科学启发,MIND-V通过三个核心组件桥接高层推理与像素级合成:语义推理中心(SRH)利用预训练视觉语言模型进行任务规划;行为语义桥(BSB)将抽象指令转换为域不变表示;运动视频生成器(MVG)用于条件视频渲染。MIND-V采用分阶段视觉未来展开(Staged Visual Future Rollouts)这一测试时优化策略以增强长期鲁棒性。为强制遵循物理定律,我们引入GRPO强化学习后训练阶段,由新颖的物理预见一致性(PFC)奖励引导。PFC利用V-JEPA2世界模型作为物理裁判,在潜在特征空间中惩罚不合理动态。实验证实MIND-V在长期模拟中的SOTA性能及其对策略学习的重要价值,为具身数据合成引入了可扩展且完全自主的框架。

英文摘要

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

2603.11917 2026-06-10 cs.CV 版本更新

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

PicoSAM3:实时传感器区域感兴趣分割

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno

发表机构 * ETH Zürich(苏黎世联邦理工学院) IBM Research(IBM研究院)

AI总结 PicoSAM3是一款轻量级实时传感器区域分割模型,结合密集CNN架构、区域兴趣提示编码和知识蒸馏,实现低延迟高精度分割。

详情
AI中文摘要

实时、在设备上的分割对于延迟敏感且隐私保护的应用至关重要,如智能眼镜和物联网设备。我们介绍了PicoSAM3,一个针对边缘和传感器执行优化的轻量级提示视觉分割模型,包括在索尼IMX500视觉传感器上的部署。PicoSAM3拥有1.3M参数,结合密集CNN架构、区域兴趣提示编码、高效通道注意机制以及从SAM2和SAM3的知识蒸馏。在COCO和LVIS数据集上,PicoSAM3分别达到65.45%和64.01%的mIoU,优于现有基于SAM和边缘导向的基线模型。INT8量化模型在精度上几乎没有下降,同时在IMX500上实现了11.82ms的实时传感器推断延迟,完全符合其内存和运算限制。消融研究显示,从大SAM模型的知识蒸馏可使mIoU提升高达14.5%,证明了高质量、空间灵活的提示分割可在传感器层面实现。

英文摘要

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

2603.11482 2026-06-10 cs.SD cs.CL eess.AS 版本更新

AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style

AnimeScore: 基于偏好的数据集与框架用于评估动漫风格语音

Joonyong Park, Jerry Li

发表机构 * Spellbrush, USA(美国Spellbrush)

AI总结 针对动漫风格语音缺乏客观评估指标的问题,提出基于偏好排序的框架AnimeScore,通过187名评估者的15000对判断数据,利用声学分析和SSL排序模型实现高达90.8% AUC的自动评估。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

目前评估“动漫风格”语音依赖于昂贵的主观判断,尚无标准化的客观指标。一个关键挑战在于,与自然度不同,动漫相似度缺乏共享的绝对尺度,使得传统的平均意见得分(MOS)协议不可靠。为填补这一空白,我们提出AnimeScore,一个基于偏好的框架,通过成对排序自动评估动漫相似度。我们收集了来自187名评估者的15000对成对判断,并附有自由形式的描述;声学分析表明,感知的动漫相似度由受控的共振峰塑造、韵律连续性和刻意发音驱动,而非简单的启发式规则如高音调。我们证明,手工设计的声学特征达到69.3%的AUC上限,而基于SSL的排序模型达到90.8%的AUC,提供了一个实用的度量标准,也可作为生成式语音模型基于偏好优化的奖励信号。

英文摘要

Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.

2603.10676 2026-06-10 cs.LG cs.CE 版本更新

Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention

时空注意力图神经网络:用注意力解释因果关系

Kosti Koistinen, Kirsi Hellsten, Joni Herttuainen, Kimmo K. Kaski

发表机构 * Aalto University School of Science Computer Science Department(阿alto大学科学学院计算机科学系)

AI总结 提出时空注意力图神经网络(STA-GNN),用于工业控制系统的无监督可解释异常检测,通过动态图建模时空依赖,注意力机制揭示因果关系,并采用共形预测控制误报率。

Comments 33 pages, 7 figures

详情
AI中文摘要

工业控制系统(ICS)支撑关键基础设施,并因操作技术与网络环境的融合面临日益增长的网络物理威胁。虽然基于机器学习的ICS异常检测方法在理论上表现出色,但部署常受限于可解释性差、误报率高以及对系统行为演变(即基线漂移)的敏感性。我们提出一种时空注意力图神经网络(STA-GNN),用于ICS中无监督且可解释的异常检测,该网络同时建模系统的时间动态和关系结构。传感器、控制器和网络实体被表示为动态学习图中的节点,使模型能够捕获物理过程和通信模式之间的相互依赖关系。注意力机制提供影响关系,支持检查检测事件背后的相关性和潜在因果路径。该方法支持多种数据模态,包括SCADA点测量、网络流特征和载荷特征,从而实现统一的网络物理分析。为满足操作需求,我们引入共形预测策略来控制误报率并监控环境漂移下的性能退化。我们的发现强调了模型评估的可能性与局限性,以及ICS异常检测中的常见陷阱。我们的发现强调了可解释、漂移感知的评估对于可靠部署基于学习的安全监控系统的重要性。

英文摘要

Industrial Control Systems (ICS) underpin critical infrastructure and face growing cyber-physical threats due to the convergence of operational technology and networked environments. While machine learning-based anomaly detection approaches in ICS shows strong theoretical performance, deployment is often limited by poor explainability, high false-positive rates, and sensitivity to evolving system behavior, i.e., baseline drifting. We propose a Spatio-Temporal Attention Graph Neural Network (STA-GNN) for unsupervised and explainable anomaly detection in ICS that models both temporal dynamics and relational structure of the system. Sensors, controllers, and network entities are represented as nodes in a dynamically learned graph, enabling the model to capture inter-dependencies across physical processes and communication patterns. Attention mechanisms provide influential relationships, supporting inspection of correlations and potential causal pathways behind detected events. The approach supports multiple data modalities, including SCADA point measurements, network flow features, and payload features, and thus enables unified cyber-physical analysis. To address operational requirements, we incorporate a conformal prediction strategy to control false alarm rates and monitor performance degradation under drifting of the environment. Our findings highlight the possibilities and limitations of model evaluation and common pitfalls in anomaly detection in ICS. Our findings emphasise the importance of explainable, drift-aware evaluation for reliable deployment of learning-based security monitoring systems.

2603.09979 2026-06-10 cs.CL 版本更新

GhazalBench: Evaluating LLM Understanding and Canonical Surface-Form Access in Persian Ghazals

GhazalBench: 评估大语言模型对波斯抒情诗的理解与规范表层形式访问

Ghazal Kalhor, Yadollah Yaghoobzadeh

发表机构 * School of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰理工大学电气与计算机工程学院) Tehran Institute for Advanced Studies, Khatam University(德黑兰高级研究院,凯塔姆大学)

AI总结 提出GhazalBench基准,评估LLM在波斯抒情诗中的诗意理解与规范表层形式访问能力,发现模型普遍能理解诗意但难以生成精确诗句,而识别任务缩小差距,英语表现更好,表明训练数据差异是关键。

详情
AI中文摘要

波斯诗歌在伊朗文化实践中扮演着活跃角色,哈菲兹等经典诗人的诗句常被引用、释义或根据部分线索补全。支持此类交互要求语言模型不仅理解诗意,还要掌握文化规范的表层形式。我们提出GhazalBench,一个评估大语言模型(LLM)在基于使用条件下与波斯抒情诗交互的基准。与先前主要将记忆视为缺陷的研究不同,GhazalBench考察在文化基础交互中精确表层形式访问功能重要的场景。该基准评估两种互补能力:诗到散文的理解,以及在变化语义和词汇线索下的规范表层形式访问。在多个专有和开源多语言LLM中,我们观察到一致的分离:模型通常能捕捉诗意,但在开放设置中难以生成精确诗句补全,而基于识别的设置显著缩小了这一差距。在英语十四行诗上的平行实验显示出明显更强的补全性能,表明这些限制更多与训练暴露差异相关,而非固有架构约束。我们的发现强调了需要联合评估意义、形式及对文化重要文本的线索依赖访问的评估框架。GhazalBench可在该https URL获取。

英文摘要

Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally canonical surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. Unlike prior work that primarily studies memorization as a liability, GhazalBench examines settings where access to exact surface form is functionally important for culturally grounded interaction. The benchmark evaluates two complementary abilities: poem-to-prose understanding and canonical surface-form access under varying semantic and lexical cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle to produce exact verse completions in open-ended settings, while recognition-based settings substantially reduce this gap. Parallel experiments on English sonnets show markedly stronger completion performance, suggesting that these limitations are tied more to differences in training exposure than to inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://anonymous.4open.science/r/GhazalBench/.

2603.07238 2026-06-10 cs.CL eess.AS 版本更新

Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

扩展自监督语音模型揭示深层语言关系:来自太平洋集群的证据

Minu Kim, Hoirin Kim, David R. Mortensen

发表机构 * School of Electrical Engineering, KAIST, Republic of Korea(韩国成均馆大学电气工程学院) Thomas Lord Department of Computer Science, University of Southern California, USA(美国南加州大学计算机科学系) Language Technologies Institute, Carnegie Mellon University, USA(美国卡内基梅隆大学语言技术研究所)

AI总结 通过将自监督语音模型的语言识别系统从126种扩展到4017种语言,发现系统在4K规模下发生质变,揭示出太平洋地区基因无关语言的宏观集群,表明大规模模型能内化多层语言历史。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

从自监督语音模型(S3Ms)中提取的语言表征之间的相似性已被观察到主要反映地理邻近性或由近期扩张或接触驱动的表面类型学相似性,可能遗漏更深层的谱系信号。我们研究了将基于S3M的语言识别系统从126种扩展到4017种语言如何重塑这种拓扑结构,并发现一个非线性效应:系统发育恢复在1K规模以下保持平稳,但4K模型经历质变,既解析了清晰的谱系也解析了长期的语言接触。最引人注目的是,一个稳健的太平洋宏观集群出现,将基因上无关的巴布亚语、大洋洲语和澳大利亚语分组在一起,我们将其驱动因素追溯到一种集中编码,该编码捕获了共享的声学特征,如全局能量动态。这些结果表明,大规模S3Ms内化了多层语言历史,为计算系统发育学和语言接触研究提供了有前景的视角。

英文摘要

Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling an S3M-based language identification system from 126 to 4,017 languages reshapes this topology, and find a non-linear effect: phylogenetic recovery stays flat up to the 1K scale, but the 4K model undergoes a qualitative shift, resolving both clear lineages and long-term linguistic contact. Most strikingly, a robust Pacific macro-cluster emerges, grouping genealogically unrelated Papuan, Oceanic, and Australian languages, and we trace its driver to a concentrated encoding that captures shared acoustic signatures such as global energy dynamics. These results suggest that massive S3Ms internalize multiple layers of language history, offering a promising perspective for computational phylogenetics and the study of language contact.

2603.05291 2026-06-10 cs.RO 版本更新

Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies

层次扩散策略中的在线自训练协同适应

Clemence Grislain, Mathilde Kappel, Olivier Sigaud, Mohamed Chetouani

发表机构 * ISIR, Sorbonne Université, CNRS(ISIR,索邦大学,国家科学研究中心)

AI总结 提出ORCHID自训练框架,通过环境反馈过滤轨迹并蒸馏回规划器和控制器,实现层次扩散策略的在线稳定改进,在CALVIN基准上轻量模型超越纯离线方法。

Comments Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation (DEMO)

详情
AI中文摘要

层次策略将语言条件的长时域机器人操作分解为高层规划器和低层控制器。然而,HL和LL之间的有效协调要求两个组件在兼容的子目标分布上运行。我们提出ORCHID,一个自训练框架,通过迭代精炼对齐规划和控制,实现层次扩散策略的稳定在线改进。通过环境反馈过滤策略样本,ORCHID识别出规划器和控制器共同成功的轨迹,并通过监督学习将其蒸馏回两个模块。这个过程引发了双向协同适应:规划器将其子目标建立在控制器的实际到达能力上,而控制器则专门处理规划器产生的轨迹结构。通过依赖过滤的在线策略样本的监督蒸馏,ORCHID避免了使用扩散模型的在线层次梯度强化学习训练中典型的不稳定性。在CALVIN基准上,ORCHID使一个轻量级、初始较弱的模型超越了纯离线方法,包括一个两倍大小的视觉-语言-动作模型。

英文摘要

Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller. However, effective coordination between HL and LL requires that both components operate on compatible subgoal distributions. We propose ORCHID, a self-training framework that enables stable online improvement of hierarchical diffusion policies by aligning planning and control through iterative refinement. By filtering policy samples via environment feedback, ORCHID identifies trajectories where the planner and controller are jointly successful and distills them back into both modules via supervised learning. This process induces a bidirectional co-adaptation: the planner grounds its subgoals in the actual reaching capabilities of the controller, while the controller specializes in the trajectory structures the planner produces. By relying on supervised distillation of filtered on-policy samples, ORCHID avoids the instability typical of online hierarchical gradient-based RL training with diffusion models. On the CALVIN benchmark, ORCHID allows a lightweight, initially weak model to outperform pure offline methods, including a Vision-Language-Action model twice its size.

2603.04852 2026-06-10 cs.AI cs.CV 版本更新

Non-Parametric Structural Priors for Geometry Theorem Prediction

几何定理预测的非参数结构先验

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China(北京师范大学人工智能学院) Engineering Research Center of Intelligent Technology(智能技术与教育应用工程研究中心) Beijing Key Laboratory of Artificial Intelligence for Education, Beijing, China(北京人工智能教育重点实验室) Baidu, Beijing, China(百度)

AI总结 针对几何定理预测中参数模型泛化性差的问题,提出定理前驱图作为非参数结构先验,通过上下文学习实现无训练定理预测,在FormalGeo7k上达到89.29%准确率。

详情
AI中文摘要

多步定理预测是几何问题求解中的核心挑战。现有的神经符号方法严重依赖有监督参数模型,这些模型对不断发展的定理库泛化能力有限。在这项工作中,我们通过上下文学习(ICL)的视角探索无训练定理预测。我们识别出一个关键的可扩展性瓶颈,称为结构漂移:随着推理深度的增加,普通ICL的性能急剧下降,通常降至接近零。我们将这种失败归因于LLM无法恢复潜在拓扑依赖关系,导致无结构探索。为解决此问题,我们提出定理前驱图,将历史解轨迹中的时间依赖关系编码为有向图,并施加显式拓扑约束,从而在推理过程中有效剪枝搜索空间。结合检索增强的图构建和逐步符号执行器,我们的方法使LLM能够在没有任何基于梯度的优化的情况下充当结构化规划器。在FormalGeo7k基准上的实验表明,我们的方法达到了89.29%的准确率,显著优于ICL基线,并与最先进的有监督模型相匹配。这些结果表明,显式结构先验为扩展基于LLM的符号推理提供了一个有前景的方向。

英文摘要

Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

2512.07352 2026-06-10 cs.SD 版本更新

MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

MultiAPI Spoof:用于语音反欺骗检测的多API数据集和局部注意力网络

Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li

发表机构 * Duke Kunshan University(杜克大学昆山分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) OfSpectrum, Inc.(OfSpectrum公司) Digital Innovation Research Center(数字创新研究中心) School of Artificial Intelligence(人工智能学院)

AI总结 针对现有反欺骗基准与真实场景差距大的问题,构建包含30种API生成约230小时合成语音的多API数据集,并提出局部注意力增强网络Nes2Net-LA,实现最先进性能与强鲁棒性。

Comments Accept to Interspeech 2026

详情
AI中文摘要

现有的语音反欺骗基准依赖于一组狭窄的公共模型,与商业系统使用多样化、通常专有API的真实场景之间存在巨大差距。为解决这一问题,我们引入了MultiAPI Spoof,一个多API音频反欺骗数据集,包含由30种不同API(包括商业服务、开源模型和在线平台)生成的约230小时合成语音。此外,我们提出了Nes2Net-LA,一种局部注意力增强的Nes2Net变体,改进了局部上下文建模和细粒度欺骗特征提取。基于该数据集,我们还定义了API追踪任务,能够对欺骗音频进行细粒度的生成源归因。实验表明,Nes2Net-LA实现了最先进的性能,并在多样化和未见过的欺骗条件下表现出卓越的鲁棒性。代码和数据集已发布。

英文摘要

Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Furthermore, we propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Based on this dataset, we also define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have been released.

2603.02221 2026-06-10 cs.LG cs.AI 版本更新

MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction

MedFeat: 基于模型感知与可解释性驱动的LLM特征工程用于临床表格预测

Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W Eyre, Jingjing Fu

发表机构 * Microsoft Research(微软研究院) University of Oxford(牛津大学)

AI总结 提出MedFeat框架,利用模型感知和特征重要性信号迭代引导LLM生成针对性特征,在临床表格预测中平均提升超10%。

详情
AI中文摘要

在临床表格预测中,带有特征工程的经典机器学习模型通常优于神经方法。LLM越来越多地被用于自动化这一过程,作为领域专家提出多样化的特征变换以提升下游性能。然而,现有的基于LLM的方法将特征生成与下游模型解耦:LLM未接收到关于哪些特征当前驱动预测或模型表示能力不足的信号,因此提议既不针对特征空间中有前景的区域,也不适应学习器的归纳偏差。这一缺陷在医疗数据中尤为突出,医疗数据同时表现出类别不平衡、异质特征空间和严格的可解释性要求。本文提出MedFeat,这是首个受机器学习从业者工作流程启发的特征工程框架,利用模型感知和特征重要性信号迭代地指导临床表格学习的特征发现。我们在广泛的具有挑战性的真实临床任务上评估MedFeat,并表明它在统计上显著优于最先进的基线,在具有不同归纳偏差的模型上平均提升超过10%。

英文摘要

In clinical tabular prediction, classical machine learning models with feature engineering often outperform neural methods. LLMs are increasingly used to automate this process, acting as domain experts that propose diverse feature transformations to boost downstream performance. However, existing LLM-based methods decouple feature generation from the downstream model: the LLM receives no signal about which features currently drive predictions or where the model's representational capacity falls short, so proposals are neither targeted to promising regions of the feature space nor tailored to the learner's inductive bias. This shortcoming is amplified in healthcare data, which simultaneously exhibits class imbalance, heterogeneous feature spaces, and strict interpretability requirements. In this paper, we propose MedFeat, the first feature engineering framework inspired by the workflow of machine learning practitioners, leveraging model-awareness and feature importance signals to iteratively guide feature discovery for clinical tabular learning. We evaluate MedFeat on a broad range of challenging real-world clinical tasks and show that it statistically significantly outperforms state-of-the-art baselines, with an average improvement of more than 10% over the baseline across models with distinct inductive biases.

2407.05886 2026-06-10 cs.RO 版本更新

Rod models in continuum and soft robot control: a review

连续体和软体机器人控制中的杆模型:综述

Carlo Alessi, Camilla Agabiti, Daniele Caradonna, Cecilia Laschi, Federico Renda, Egidio Falotico

发表机构 * Istituto Italiano di Tecnologia(意大利技术研究院) The BioRobotics Institute(生物机器人研究所) Department of Excellence in Robotics and AI(机器人与人工智能卓越部门)

AI总结 本文综述了杆模型在连续体和软体机器人建模与控制中的应用,涵盖数学基础、机器人建模及控制策略,并讨论了其优势、局限和未来方向。

详情
AI中文摘要

连续体和软体机器人可以变革在受限或非结构化环境中需要柔顺交互的自动化任务,包括医疗、农业、海洋和太空应用。然而,其复杂的力学特性给建模和控制带来了重大挑战。低维连续介质力学模型,如杆理论,能够有效捕捉细长体在接触丰富场景中的大变形,同时平衡精度和计算效率。本文对连续体和软体机器人的杆模型进行了纵向综述,涵盖其数学基础、机器人建模和控制应用。我们回顾了软体机器人中采用的主要杆理论,并引入了一种基于变形的杆模型分类方法。此外,我们调查了近期基于模型和基于学习的利用杆模型的控制策略,强调了它们在操作和物理交互任务中的作用。最后,我们讨论了基于杆的方法的优势、局限性、研究空白和新兴方向。本文旨在为开发连续体和软体机器人的模型和控制策略提供参考。

英文摘要

Continuum and soft robots can transform automation tasks requiring compliant interaction in constrained or unstructured environments, including healthcare, agriculture, marine, and space applications. However, their complex mechanics introduce significant challenges in modeling and control. Low-dimensional continuum mechanical models, such as rod theories, effectively capture the large deformations of slender bodies in contact-rich scenarios while balancing accuracy and computational efficiency. This paper presents a vertical survey of rod models for continuum and soft robots, spanning their mathematical foundations, robot modeling, and control applications. We review the main rod theories adopted in soft robotics and introduce a deformation-based classification of rod models for continuum and soft robots. Furthermore, we survey recent model-based and learning-based control strategies leveraging rod models, highlighting their role in manipulation and physical interaction tasks. Finally, we discuss advantages, limitations, research gaps, and emerging directions of rod-based approaches. This paper aims to serve as a reference for developing models and control strategies for continuum and soft robots.

2602.21331 2026-06-10 cs.RO 版本更新

CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics

CableRobotGraphSim:一种用于建模部分可观测缆索驱动机器人动力学的图神经网络

Nelson Chen, William R. Johnson, Rebecca Kramer-Bottiglio, Kostas Bekris, Mridul Aanjaneya

发表机构 * Rutgers University(罗切斯特大学) Yale University(耶鲁大学)

AI总结 提出CableRobotGraphSim,一种图神经网络模型,通过将缆索驱动机器人表示为图(刚体为节点,缆索和接触为边),仅利用部分可观测输入即可快速准确匹配其他仿真和真实机器人,并采用仿真-真实联合训练提升鲁棒性,最后集成MPPI控制器实现闭环导航。

详情
AI中文摘要

通用仿真器加速了机器人的发展。然而,基于第一性原理的传统仿真器通常需要全状态可观测性或依赖参数搜索进行系统辨识。本文提出\texttt{CableRobotGraphSim},一种用于缆索驱动机器人的新型图神经网络(GNN)模型,旨在解决先前仿真方案的不足。通过将缆索驱动机器人表示为图,其中刚体作为节点,缆索和接触作为边,该模型能够快速准确地匹配其他仿真模型和真实机器人的特性,同时仅接收部分可观测输入。伴随GNN模型的是一个仿真-真实联合训练过程,该过程促进了对噪声真实数据的泛化能力和鲁棒性。该模型进一步与模型预测路径积分(MPPI)控制器集成,用于闭环导航,展示了模型的速度和准确性。

英文摘要

General-purpose simulators have accelerated the development of robots. Traditional simulators based on first-principles, however, typically require full-state observability or depend on parameter search for system identification. This work presents \texttt{CableRobotGraphSim}, a novel Graph Neural Network (GNN) model for cable-driven robots that aims to address shortcomings of prior simulation solutions. By representing cable-driven robots as graphs, with the rigid-bodies as nodes and the cables and contacts as edges, this model can quickly and accurately match the properties of other simulation models and real robots, while ingesting only partially observable inputs. Accompanying the GNN model is a sim-and-real co-training procedure that promotes generalization and robustness to noisy real data. This model is further integrated with a Model Predictive Path Integral (MPPI) controller for closed-loop navigation, which showcases the model's speed and accuracy.

2602.19086 2026-06-10 cs.CV 版本更新

Seal-Robust KCR: A Robust Kuzushiji Character Recognition Framework under Seal Interference

抗印章干扰的KCR:一种鲁棒的印章干扰下久世文字符识别框架

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

发表机构 * Kyoto University(京都大学)

AI总结 针对印章干扰导致久世文字符识别性能下降的问题,提出一种结合文档修复和合成数据增强的抗印章干扰框架,在真实和合成测试集上分别降低字符错误率39.7%和50.1%。

Comments Supplementary material is available at https://ruiyangju.github.io/Seal-Robust-KCR

详情
AI中文摘要

久世文是前现代日本最广泛使用的草书书写系统之一。由于其高度草书的形态和广泛的字形变化,大多数现代日本读者无法阅读久世文字符。因此,近年来的研究集中在开发自动化久世文字符识别(KCR)方法,这些方法在相对干净的日本历史文档图像上取得了强劲性能。尽管印章经常出现在日本历史文档中,现有方法在印章干扰下,特别是当印章与字符重叠时,往往无法保持识别精度。为了应对这一挑战,我们提出了一种抗印章干扰的KCR框架。基于字符检测、分类和排序,所提出的框架额外引入了文档修复以减轻印章干扰,从而提升整体识别性能。此外,我们引入了一种新颖的合成数据增强策略来增强字符检测模型的性能。我们进一步纠正了标注错误,重构了数据集,并创建了一个合成测试集以模拟严重的印章干扰。实验结果表明,所提出的框架在减轻印章干扰对KCR的影响方面是有效的。与常规基线和NDLkotenOCR相比,它在真实测试集上分别实现了39.7%和5.9%的相对字符错误率(CER)降低,在合成测试集上分别实现了50.1%和41.7%的降低。

英文摘要

Kuzushiji was one of the most widely used cursive writing systems in pre-modern Japan. Due to its highly cursive forms and extensive glyph variations, most modern Japanese readers are unable to read Kuzushiji characters. Consequently, recent studies have focused on developing automated Kuzushiji character recognition (KCR) methods, which have achieved strong performance on relatively clean Japanese historical document images. Although seals frequently appear in Japanese historical documents, existing methods often fail to maintain recognition accuracy under seal interference, particularly when seals overlap with characters. To address this challenge, we propose a seal-robust KCR framework. Based on character detection, classification, and ordering, the proposed framework additionally incorporates document restoration to mitigate seal interference, thereby improving overall recognition performance. In addition, we introduce a novel synthetic data augmentation strategy to enhance the performance of character detection models. We further correct annotation errors, reconstruct the dataset, and create a synthetic test set to simulate severe seal interference. Experimental results demonstrate the effectiveness of the proposed framework in mitigating the impact of seal interference on KCR. Compared with a conventional baseline and NDLkotenOCR, it achieves relative character error rate (CER) reductions of 39.7% and 5.9%, respectively, on the real test set, and 50.1% and 41.7%, respectively, on the synthetic test set.

2502.11034 2026-06-10 cs.LG 版本更新

AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping

AdaGC: 通过自适应梯度裁剪增强LLM预训练稳定性

Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Dianhai Yu, Yanjun Ma, Li Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出自适应逐张量梯度裁剪方法AdaGC,通过限制梯度范数相对于历史裁剪值的指数移动平均来消除损失尖峰,在Llama-2 7B等模型上实现零尖峰并提升下游准确率。

Comments Accept by ICML 2026

详情
AI中文摘要

损失尖峰仍然是大规模语言模型预训练中的一个持续障碍。虽然先前的研究试图通过调查单个因素来识别损失尖峰的根本原因,但我们观察到,在实践中,这种尖峰通常是由异质因素的汇合触发的。根据经验,损失尖峰可能源于数据异常值、硬件或瞬时计算故障、数值精度问题和超参数设置的组合。无论根本原因如何,这些尖峰表现为不稳定的优化器更新,因为异常梯度污染了第一和第二矩状态。在本文中,我们提出了一种基于梯度的原则性补救措施:AdaGC,一种自适应逐张量梯度裁剪方案,通过将梯度范数限制在它们历史裁剪值的张量级指数移动平均附近来减轻这种污染。AdaGC与优化器无关,引入可忽略的内存开销,并且与GlobalGC相比降低了通信成本,特别是在混合并行分布式训练中。在Llama-2 7B、Mixtral 8x1B和ERNIE 10B-A1.4B上的实验表明,AdaGC稳健地消除了训练不稳定性,一致地将所有模型的尖峰分数降至零,并且相对于GlobalGC分别将下游准确率提高了1.32%、1.27%和2.48%。此外,AdaGC与Muon和Lion等优化器无缝集成,一致地产生更高的平均准确率和零尖峰分数。代码可在以下网址获得:此https URL(参见Research/AdaGC)。

英文摘要

Loss spikes remain a persistent obstacle in large-scale language model pretraining. While previous research has attempted to identify the root cause of loss spikes by investigating individual factors, we observe that, in practice, such spikes are typically triggered by the confluence of heterogeneous factors. Empirically, loss spikes may arise from a combination of data outliers, hardware or transient computational faults, numerical precision issues, and hyperparameter settings. Regardless of the underlying cause, these spikes manifest as unstable optimizer updates, as abnormal gradients contaminate both first- and second-moment states. In this paper, we propose a principled gradient-centric remedy: AdaGC, an adaptive per-tensor gradient clipping scheme that mitigates such contamination by bounding gradient norms relative to a tensor-wise exponential moving average of their historical clipped values. AdaGC is optimizer-agnostic, introduces negligible memory overhead, and reduces communication costs compared to GlobalGC, particularly in hybrid-parallel distributed training. Experiments on Llama-2 7B, Mixtral 8x1B, and ERNIE 10B-A1.4B demonstrate that AdaGC robustly eliminates training instabilities, consistently reducing spike scores to zero for all models and improving downstream accuracy over GlobalGC by 1.32%, 1.27%, and 2.48%, respectively. Furthermore, AdaGC seamlessly integrates with optimizers such as Muon and Lion, consistently yielding higher average accuracy and zero spike scores. The code is available at https://github.com/PaddlePaddle/PaddleFleet (see Research/AdaGC).

2602.17907 2026-06-10 cs.CL cs.AI 版本更新

Improving Topic Modeling by Distilling Soft Labels from Language Models

DSL-Topic:通过从语言模型中蒸馏软标签改进主题建模

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

发表机构 * University of Washington(华盛顿大学)

AI总结 提出DSL框架,通过从语言模型蒸馏软标签来增强主题模型训练,利用上下文感知的软标签重构信号,显著提升主题连贯性和分配准确性。

Comments 22 pages, 5 figures. Camera-ready version for ICML 2026

详情
AI中文摘要

传统的神经主题模型通常通过重构文档的词袋表示来优化,忽略了上下文信息并面临数据稀疏性问题。在这项工作中,我们引入了一种新颖的主题模型训练框架,通过从语言模型中蒸馏软标签(DSL)。为了构建上下文丰富的重构信号,我们将基于特定提示的下一个词概率投影到预定义词汇表上,并使用语言模型隐藏状态训练主题模型重构软标签。这产生了更高质量的主题,与语料库的潜在主题结构更加紧密对齐。大量实验表明,DSL在主题连贯性和分配准确性上相比现有基线取得了显著改进。此外,我们还引入了一种基于检索的指标,显示我们的方法在识别语义相似文档方面显著优于现有方法,突显了其在面向检索应用中的有效性。

英文摘要

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we introduce a novel topic model training framework by Distilling Soft Labels (DSL) from Language Models (LMs). To construct the contextually enriched reconstruction signals, we project the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary, and train the topic models to reconstruct the soft labels using the LM hidden states. This produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Extensive experiments demonstrate that DSL achieves substantial improvements in topic coherence and assignment accuracy over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

2602.13807 2026-06-10 cs.LG 版本更新

AnomaMind: Agentic Time Series Anomaly Detection with Tool-Augmented Reasoning

AnomaMind:基于工具增强推理的智能体时间序列异常检测

Xiaoyu Tao, Yuchong Wu, Mingyue Cheng, Ze Guo, Tian Gao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出AnomaMind框架,将时间序列异常检测重构为顺序决策过程,通过粗到细的工作流(定位可疑区间、工具交互构建诊断证据、自我反思细化决策)结合知识记忆与数值诊断工具包,并采用混合推理机制,显著提升域内和跨域异常检测性能与泛化能力。

详情
AI中文摘要

时间序列异常检测在许多实际应用中至关重要,有效的解决方案必须定位异常区域并在复杂设置下支持可靠的决策。然而,现有大多数方法将异常检测视为具有固定特征表示的纯判别预测任务,而非基于证据的诊断过程。因此,当异常表现出强上下文依赖性、多样化模式或跨数据集领域偏移时,它们往往难以应对。为应对这些挑战,我们提出AnomaMind,一个智能体时间序列异常检测框架,将异常检测重构为顺序决策过程。AnomaMind通过粗到细的工作流运行:首先定位可疑区间,然后通过工具交互构建诊断证据,最后通过自我反思细化异常决策。该工作流由一个结合知识记忆和数值诊断的工具箱支持:从训练数据中挖掘的视觉异常模式和领域知识提供上下文指导,而统计、基于值、基于变化和区域级别的算子提供可测量的验证证据。AnomaMind进一步采用混合推理机制,其中通用模型处理灵活推理、工具调用和细化,而检测特定策略通过基于规则的奖励进行优化,以实现可解析输出、F1分数对齐和假阳性控制。在域内和跨域设置下的广泛实验表明,AnomaMind持续改善异常检测性能并增强跨异质异常模式的泛化能力,验证了工具增强推理在异常检测中的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Time series anomaly detection is critical in many real-world applications, where effective solutions must localize anomalous regions and support reliable decision-making under complex settings. However, most existing methods frame anomaly detection as a purely discriminative prediction task with fixed feature representations, rather than an evidence-driven diagnostic process. As a result, they often struggle when anomalies exhibit strong context dependence, diverse patterns, or domain shifts across datasets. To address these challenges, we propose AnomaMind, an agentic time series anomaly detection framework that reformulates anomaly detection as a sequential decision-making process. AnomaMind operates through a coarse-to-fine workflow that first localizes suspicious intervals, then constructs diagnostic evidence through tool interaction, and finally refines anomaly decisions through self-reflection. The workflow is supported by a toolkit box that combines knowledge memory and numerical diagnostics: visual anomaly patterns mined from training data and domain knowledge provide contextual guidance, while statistical, value-based, change-based, and region-level operators provide measurable evidence for verification. AnomaMind further adopts a hybrid inference mechanism in which general-purpose models handle flexible reasoning, tool invocation, and refinement, while a detection-specific policy is optimized with rule-based rewards for parsable outputs, F1-score alignment, and false-positive control. Extensive experiments under both in-domain and cross-domain settings demonstrate that AnomaMind consistently improves anomaly detection performance and enhances generalization across heterogeneous anomaly patterns, validating the effectiveness of tool-augmented reasoning for anomaly detection. The code is available at https://github.com/Xiaoyu-Tao/AnomaMind-TS.

2602.12966 2026-06-10 cs.CL cs.SE 版本更新

ProbeLLM: Automating Principled Diagnosis of LLM Failures

ProbeLLM:自动化的大语言模型故障原则性诊断

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学) LMU Munich(慕尼黑大学) Massachusetts Institute of Technology(麻省理工学院) IBM Research(IBM研究院)

AI总结 提出ProbeLLM框架,通过分层蒙特卡洛树搜索在全局探索与局部细化间分配预算,结合工具增强生成与验证,将故障发现从孤立案例提升为结构化故障模式,揭示更广泛、清晰、细粒度的故障景观。

详情
AI中文摘要

理解大语言模型(LLM)如何以及为何失败正成为一个核心挑战,因为模型快速演进而静态评估滞后。虽然动态测试生成已实现自动化探测,但现有方法常发现孤立的失败案例,缺乏对探索的原则性控制,且对模型弱点的底层结构洞察有限。我们提出ProbeLLM,一个基准无关的自动化探测框架,将弱点发现从个体失败提升到结构化故障模式。ProbeLLM将探测形式化为分层蒙特卡洛树搜索,在新故障区域的全局探索与重复错误模式的局部细化之间明确分配有限的探测预算。通过将探测限制在可验证的测试用例,并利用工具增强生成与验证,ProbeLLM将故障发现建立在可靠证据之上。发现的失败进一步通过失败感知嵌入和边界感知归纳整合为可解释的故障模式。在多种基准和LLM上,ProbeLLM揭示了比静态基准和先前自动化方法更广泛、更清晰、更细粒度的故障景观,支持从以案例为中心的评估向原则性弱点发现的转变。

英文摘要

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

2602.12542 2026-06-10 cs.LG cs.AI 版本更新

Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference

探索预测性医疗中基于概念正交推理的准确且透明的域适应

Pengfei Hu, Chang Lu, Feifan Liu, Yue Ning

发表机构 * Department of Computer Science, Stevens Institute of Technology, Hoboken, NJ, United States(斯蒂文斯理工学院计算机科学系) UMass Chan Medical School, University of Massachusetts Amherst, Amherst, MA, United States(马萨诸塞大学阿默斯特分校UMass Chan医学学校)

AI总结 提出ExtraCare模型,通过将患者表示分解为不变和协变分量并强制正交,在保留标签信息的同时暴露域特定变异,实现准确预测并提供基于医疗概念的可解释性。

Comments Accepted by ICML 2026 Main Conference

详情
AI中文摘要

用于电子健康记录(EHR)临床事件预测的深度学习模型在不同数据分布下部署时,常常性能下降。虽然域适应(DA)方法可以缓解这种偏移,但其“黑箱”性质阻碍了在临床实践中的广泛采用,而临床实践中透明度对于信任和安全至关重要。我们提出ExtraCare,将患者表示分解为不变和协变分量。通过监督这两个分量并在训练中强制其正交性,我们的模型在保留标签信息的同时暴露域特定变异,从而实现比大多数特征对齐模型更准确的预测。更重要的是,它通过将稀疏的潜在维度映射到医疗概念,并通过目标消融量化其贡献,提供人类可理解的解释。ExtraCare在两个真实EHR数据集上,跨多个域划分设置进行评估,展示了优越的性能以及增强的透明度,其准确预测和来自广泛案例研究的解释证明了这一点。

英文摘要

Deep learning models for clinical event prediction on electronic health records (EHR) often suffer performance degradation when deployed under different data distributions. While domain adaptation (DA) methods can mitigate such shifts, their "black-box" nature prevents widespread adoption in clinical practice where transparency is essential for trust and safety. We propose ExtraCare to decompose patient representations into invariant and covariant components. By supervising these two components and enforcing their orthogonality during training, our model preserves label information while exposing domain-specific variation at the same time for more accurate predictions than most feature alignment models. More importantly, it offers human-understandable explanations by mapping sparse latent dimensions to medical concepts and quantifying their contributions via targeted ablations. ExtraCare is evaluated on two real-world EHR datasets across multiple domain partition settings, demonstrating superior performance along with enhanced transparency, as evidenced by its accurate predictions and explanations from extensive case studies.

2602.12424 2026-06-10 cs.CL cs.AI 版本更新

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

RankLLM: 通过量化问题难度对大型语言模型进行加权排名

Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Notre Dame(诺特大学) Zhejiang Wanli University(浙江万里大学) Squirrel Ai Learning City University of Hong Kong(香港城市大学) Duke University(杜克大学)

AI总结 提出RankLLM框架,通过量化问题难度和模型能力实现细粒度评估,在35550个问题上对30个模型进行评测,与人类判断一致性达90%。

Comments 32 pages, 9 figures. Accepted by ICLR 2026

详情
AI中文摘要

基准测试建立了标准化的评估框架,以系统评估大型语言模型(LLM)的性能,促进客观比较并推动该领域的进步。然而,现有基准测试未能区分问题难度,限制了其有效区分模型能力的能力。为解决这一局限,我们提出了RankLLM,一种旨在量化问题难度和模型能力的新框架。RankLLM引入难度作为区分的主要标准,实现了对LLM能力的更细粒度评估。RankLLM的核心机制促进了模型与问题之间的双向分数传播。RankLLM的核心直觉是:当模型正确回答一个问题时,它获得一个能力分数;而当一个问题难倒模型时,其难度分数增加。利用该框架,我们在多个领域的35550个问题上评估了30个模型。RankLLM与人类判断的一致性达到90%,并且始终优于IRT等强基线。它还表现出强大的稳定性、快速收敛和高计算效率,使其成为大规模、难度感知的LLM评估的实用解决方案。

英文摘要

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

2602.10796 2026-06-10 cs.LG 版本更新

PRISM: Parallel Residual Iterative Sequence Model

PRISM: 并行残差迭代序列模型

Jie Jiang, Ke Cheng, Xin Xu, Mengyang Pang, Tianhao Lu, Jiaheng Li, Yue Liu, Yuan Wang, Jun Zhang, Huan Yu, Zhouchen Lin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PRISM模型,通过写-遗忘解耦和两阶段代理架构,将迭代优化并行化,在保持表达能力的同时实现174倍吞吐量提升。

Comments 21 pages, 2 figures

详情
AI中文摘要

生成式序列建模面临Transformer的表达能力与线性序列模型的效率之间的基本矛盾。现有的高效架构在理论上受限于浅层单步线性更新,而像测试时训练(TTT)这样的强大迭代方法由于两个维度的串行依赖(token级状态依赖和步级迭代循环)破坏了硬件并行性。我们提出PRISM(并行残差迭代序列模型)来解决这一矛盾。PRISM以可并行化的形式显式重构了TTT的表达性门控×残差×方向迭代模式。我们采用写-遗忘解耦策略,将非线性隔离在注入算子内。为了绕过显式求解器的串行依赖,PRISM利用两阶段代理架构:短卷积利用局部历史能量锚定初始残差,而学习预测器直接从输入估计细化更新。该设计将与迭代校正相关的结构模式蒸馏为可并行化的前馈算子。理论上,我们证明该公式实现了Rank-$L$累积,在结构上扩展了更新流形,超越了单步Rank-$1$瓶颈。实验上,它实现了与显式优化方法相当的性能,同时实现了\textbf{174倍更高的吞吐量}。代码见该https URL。

英文摘要

Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear updates, while powerful iterative methods like Test-Time Training (TTT) break hardware parallelism due to two dimensions of serial dependency: token-level state reliance and step-level iteration loops. We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension. PRISM explicitly reconstructs the expressive gate x residual x direction iteration pattern of TTT in a parallelizable form. We employ a Write-Forget Decoupling strategy that isolates non-linearity within the injection operator. To bypass the serial dependency of explicit solvers, PRISM utilizes a two-stage proxy architecture: a short-convolution anchors the initial residual using local history energy, while a learned predictor estimates the refinement updates directly from the input. This design distills structural patterns associated with iterative correction into a parallelizable feedforward operator. Theoretically, we prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck. Empirically, it achieves comparable performance to explicit optimization methods while achieving \textbf{174x higher throughput}. Codes are available in https://github.com/gpr-prism/prism/.

2602.07413 2026-06-10 cs.RO 版本更新

Going with the Flow: Koopman Behavioral Models as Pseudo Planners for Visuo-Motor Dexterity

随流而行:Koopman行为模型作为视觉运动灵巧性的伪规划器

Yunhai Han, Jiaqi Fu, Linhao Bai, Ziyu Xiao, Zhaodong Yang, Yogita Choudhary, Krishna Jha, Chuizheng Kong, Shreyas Kousik, Harish Ravichandar

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出统一行为模型(UBM),将灵巧技能建模为耦合动力系统,确保时间一致性;基于Koopman算子实现线性潜空间,通过在线重规划实现反应性和适应性,在模拟和真实任务中达到或超越现有方法。

Comments Website: https://k-ubm.github.io/

详情
AI中文摘要

当代视觉运动灵巧性模型通常依赖于具有扩散和Transformer骨干的表达性策略类来实现强性能。然而,这些架构需要大量数据和计算资源,并且远未达到可靠,特别是对于多指灵巧性。重要的是,它们将技能建模为反应性映射,并依赖于固定视界的动作分块,在时间一致性和反应性之间造成了刚性权衡。为了解决这些问题,我们首先引入统一行为模型(UBMs),这是一个将灵巧技能表示为耦合动力系统的框架,捕捉环境视觉特征(视觉流)和机器人本体感受状态(动作流)如何共同演化。因此,UBMs通过构造而非启发式平均来确保时间一致性。与试图预测任意机器人动作对环境影响的 world models 不同,UBMs 针对行为动力学,编码演示的机器人行为如何与对环境期望的影响相关。UBM 可以视为一个伪规划器:给定初始条件,它计算整个技能视界上的期望机器人行为,同时“想象”视觉特征的流。为了实现UBMs,我们提出Koopman-UBM,作为UBMs的第一个实例化,即结构化的潜在线性系统。K-UBM计算高效,通过在线重规划策略实现反应性和适应性:模型充当自身的运行时监控器,当预测和观察到的视觉流偏离超过阈值时自动触发重规划。在七个模拟任务和四个真实世界任务中,我们的方法匹配或超过了最先进基线的性能,同时提供了更快的推理、平滑的执行、对遮挡的鲁棒性和灵活的重规划。

英文摘要

Contemporary visuo-motor dexterity models often rely on expressive policy classes with diffusion and transformer backbones to achieve strong performance. However, these architectures require significant data and computational resources, and remain far from reliable, particularly for multi-fingered dexterity. Importantly, they model skills as reactive mappings and rely on fixed-horizon action chunking, creating a rigid trade-off between temporal coherence and reactivity. To address these issues, we first introduce Unified Behavioral Models (UBMs), a framework to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co-evolve. As such, UBMs ensure temporal coherence by construction rather than heuristic averaging. Unlike world models that attempt to predict the impact of arbitrary robot actions on the environment, UBMs target behavioral dynamics that encode how demonstrated robot behavior is related to desired impacts on the environment. A UBM can be viewed as a pseudo planner: given an initial condition, it computes the desired robot behavior over the entire skill horizon, while simultaneously ``imagining" the resulting flow of visual features. To operationalize UBMs, we propose Koopman-UBM, a first instantiation of UBMs as a structured latent linear system. K-UBM is computationally efficient, enabling reactivity and adaptation via an online replanning strategy: the model acts as its own runtime monitor, automatically triggering replanning when predicted and observed visual flow diverge beyond a threshold. Across seven simulated tasks and four real-world tasks, our approach matches or exceeds the performance of state-of-the-art baselines, while offering considerably faster inference, smooth execution, robustness to occlusions, and flexible replanning.

2602.09809 2026-06-10 cs.CV 版本更新

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

SciFlow-Bench:通过逆解析评估结构感知的科学图表生成

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Huawei Cloud BU(华为云业务部) Zhongguancun Academy(中关村学院) Beijing Key Laboratory of Data Intelligence and Security (Peking University)(北京数据智能与安全重点实验室(北京大学))

AI总结 提出SciFlow-Bench基准,通过逆解析将生成的图表图像转换为结构化图进行比较,以结构可恢复性而非视觉相似性评估科学图表生成。

详情
AI中文摘要

科学图表传达显式的结构信息,然而现代文本到图像模型通常生成视觉上合理但结构错误的结果。现有基准要么依赖图像中心或主观指标,对结构不敏感,要么评估中间符号表示而非最终渲染图像,导致基于像素的图表生成研究不足。我们引入SciFlow-Bench,一个结构优先的基准,用于直接从像素级输出评估科学图表生成。基于真实科学PDF构建,SciFlow-Bench将每个源框架图与规范真值图配对,并在闭环往返协议下将模型作为黑盒图像生成器进行评估,该协议将生成的图表图像逆解析回结构化图以进行比较。该设计通过结构可恢复性而非仅视觉相似性进行强制评估,并由一个协调规划、感知和结构推理的分层多智能体系统实现。实验表明,保持结构正确性仍然是一个基本挑战,特别是对于具有复杂拓扑的图表,强调了结构感知评估的必要性。

英文摘要

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.