arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2606.08655 2026-06-09 cs.RO cs.CV 新提交

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysGraph:用于感知与推理的物理感知3D场景图

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

发表机构 * Duke University(杜克大学)

AI总结 提出PhysGraph框架,结合符号推理与结构化3D几何,建模杂乱场景中的运动学和物理属性,在语义分割、多物体质量估计和关节预测上达到最优。

详情
AI中文摘要

为了执行广泛的日常任务,机器人需要构建一个语义丰富、物理基础扎实且结构化的3D表示,以支持任务规划和功能预测。然而,现有方法主要关注语义检索,常常忽略物理和运动学因素。尝试建模物理属性的方法通常依赖于狭窄的训练集或单物体建模,限制了跨不同物体类型的可扩展性和泛化能力。为应对这些挑战,我们提出了PhysGraph,一个将符号推理与结构化3D几何相统一的框架,用于建模杂乱场景中的运动学和物理属性。给定RGB-D观测,PhysGraph重建以物体为中心的3D几何,并跨视图关联物体实例。然后,它将物体分解为功能部件,并通过视觉推理推断材料和关节。在合成和真实世界数据集上的评估表明,PhysGraph在语义分割、多物体质量估计和关节预测方面取得了最先进的结果。凭借其简单而有效的设计,PhysGraph生成物理一致且语义结构化的场景图,作为下游任务(如约束感知的3D功能预测和真实到模拟迁移)的结构化3D表示,这两项任务均在我们的实验中得到了验证。

英文摘要

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

2606.08654 2026-06-09 cs.LG cs.NA math.AP math.NA stat.AP 新提交

Operator learning for the 2D incompressible Navier-Stokes equations: a conformal prediction approach in the data-scarce regime

二维不可压缩Navier-Stokes方程的算子学习:数据稀缺情况下的共形预测方法

Weinan Wang, Bowen Gang, Hao Deng

发表机构 * University of Oklahoma(俄克拉荷马大学) Fudan University(复旦大学)

AI总结 针对数据稀缺下算子学习的不确定性量化,提出基于扰动的共形预测框架,在二维Navier-Stokes基准上比现有方法生成更窄的共形带,同时保持目标覆盖。

详情
AI中文摘要

本文提出了一种基于扰动的共形预测框架,用于算子学习中的不确定性量化,重点关注二维Navier-Stokes方程。虽然神经算子为昂贵的PDE求解器提供了快速替代方案,但它们本身无法为时空场预测提供校准的不确定性。我们的方法将训练好的傅里叶神经算子(FNO)与分裂共形预测相结合,通过比较在几乎相同数据集上训练的两个算子的预测来构建局部不确定性尺度:一个使用原始标签,另一个使用添加小高斯噪声的标签。我们在数据稀缺情况下考虑该过程,其中总标签预算固定,而需要单独不确定性网络的方法必须在多个模型之间划分训练数据。在二维Navier-Stokes基准上,在匹配总数据预算的情况下,基于扰动的方法产生的共形带比现有方法窄得多,同时保持目标同时覆盖。这些结果表明,扰动敏感性是共形化神经算子的一种实用且样本高效的不确定性代理。

英文摘要

In this paper, we propose a perturbation-based conformal prediction framework for uncertainty quantification in operator learning, with a focus on the 2D Navier--Stokes equations. While neural operators provide fast surrogates for expensive PDE solvers, they do not by themselves provide calibrated uncertainty for spatiotemporal field predictions. Our approach wraps a trained Fourier Neural Operator (FNO) with split conformal prediction and constructs the local uncertainty scale by comparing the predictions of two operators trained on nearly identical datasets: one on the original labels and one on labels perturbed by small Gaussian noise. We consider this procedure in the data-scarce regime, where the total label budget is fixed and methods that require a separate uncertainty network must divide training data between multiple models. On the 2D Navier--Stokes benchmark, the perturbation-based method produces substantially narrower conformal bands than existing methods under matched total data budgets while maintaining the target simultaneous coverage. These results suggest that perturbation sensitivity is a practical and sample-efficient uncertainty proxy for conformalized neural operators.

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 新提交

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation(河北省认知智能重点实验室,雄安创新研究院) Hebei University of Technology(河北工业大学) Beijing Information Science and Technology University(北京信息科技大学)

AI总结 提出FiberTune,通过在线动作探针过滤动作预测特征方向,对齐教师视觉残差并正则化有效秩,在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情
AI中文摘要

动作监督的视觉-语言-动作(VLA)策略微调能有效拟合演示,但仅约束改变预测动作的方向,导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune,一种训练时目标,在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差与冻结的视觉教师对齐,同时正则化其有效秩。在相同训练条件下,FiberTune在跨越两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控仿真设置以及物理SO-101拾取放置任务中,均优于仅任务损失的微调;代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示,这些增益与探针过滤后的残差教师对齐度和有效秩增加一致,符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

2606.08644 2026-06-09 cs.CL cs.AI 新提交

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

一种用于大语言模型中动态实体追踪的检索条件重绑定电路

Soyoung Oh, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所)

AI总结 通过因果干预识别出大语言模型中实现动态状态追踪的检索条件重绑定机制,该机制由紧凑的注意力头电路编码并恢复绑定信息,在不同模型家族中表现不同。

详情
AI中文摘要

为了正确解释上下文并检索相关信息,大语言模型必须将实体与其属性绑定,并在状态变化时更新这些绑定。我们分析了LLM在动态状态追踪中如何实现这一绑定过程。通过因果干预,我们识别出一种检索条件重绑定机制,这是一个紧凑的注意力头电路,编码交换相关的绑定信息并在读出时恢复。在Gemma和Llama模型中,该电路支持重绑定行为,但机制的表示特征在不同模型家族中有所不同。在Gemma模型中,绑定特征清晰地表达在相关注意力头的查询/键子空间中,而在Llama模型中,绑定信息主要由键向量携带。总体而言,我们的结果揭示了LLM中上下文相关状态追踪的可解释机制。

英文摘要

To interpret context correctly and retrieve relevant information, large language models must bind entities to their attributes and update these bindings as state changes. We analyze how LLMs implement this binding process in a dynamic state tracking. Using causal interventions, we identify a retrieval conditioned rebinding mechanism, a compact attention head circuit that encodes swap relevant binding information and reinstates it at readout. Across Gemma and Llama models, this circuit supports rebinding behavior, but the representational signature of the mechanism differs across model families. In Gemma models, the binding signature is clearly expressed in the query/key subspaces of the relevant attention heads, whereas in Llama models, the binding information is carried primarily in key vectors. Overall, our results reveal an interpretable mechanism for context dependent state tracking in LLMs.

2606.08641 2026-06-09 cs.CV 新提交

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning

可学习的令牌稀疏化用于高效十亿像素全切片图像推理

Jingzhi Chen, Landi He, Zhuo Chen, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对视觉语言模型中全切片图像令牌过多的问题,提出可学习的稀疏化方法,通过SparseLearn组件和可微分的Soft Top-K算子实现训练,推理时仅保留32个令牌,在SlideBench上达到73.32%准确率。

详情
AI中文摘要

在视觉语言模型中处理十亿像素全切片图像面临的主要困难是视觉令牌数量过多。现有解决方案通常依赖于无需训练的空间下采样或启发式剪枝策略,这些方法往往会丢弃细微但具有临床意义的模式,因为病理证据在组织中不规则地分布。为了克服这一限制,我们将全切片图像中的令牌减少重新定义为可训练的稀疏化问题,使模型能够学习最优选择策略,而不是遵循固定的启发式规则。我们提出了一种解耦路由架构。为了在训练过程中通过不可微的剪枝操作实现梯度传播,我们引入了一个名为SparseLearn的组件。该组件使用一个方差保持的噪声门,通过可微分的Soft Top-K算子调节每个补丁的信息流,并配合一个对角注意力去噪器,在不泄露空间信息的情况下恢复受扰动的表示。在推理时,SparseLearn模块被完全丢弃,训练好的评分器应用确定性的Hard Top-K算子,仅保留得分最高的32个令牌,不产生额外计算。通过将视觉序列压缩到仅32个令牌的稀疏集合(仅占原始长度的0.78%),我们的框架在SlideBench(TCGA)上实现了73.32%的总体准确率,持续优于基于采样的基线和通用视觉语言模型。在SlideBench(BCNB)和WSI VQA*上也展示了强大的零样本泛化能力。通过解决视觉上下文瓶颈并防止稀疏诊断证据的稀释,这项工作为端到端的十亿像素全切片图像推理提供了一种高效范式。

英文摘要

The processing of gigapixel whole slide images within vision language models faces a major difficulty due to an excessive number of visual tokens. Existing solutions typically rely on spatial downsampling or heuristic pruning strategies that operate without training, and these methods often discard subtle but clinically meaningful patterns because pathological evidence is scattered irregularly across the tissue. To overcome this limitation, we reformulate token reduction in whole slide images as a trainable sparsification problem, allowing the model to learn an optimal selection strategy instead of following fixed heuristics. We propose a decoupled routing architecture. To enable gradient propagation through the nondifferentiable pruning operation during training, we introduce a component called SparseLearn. This component uses a variance-preserving noise gate that regulates the information flow of each patch via a differentiable Soft Top-K operator, together with a diagonal attention denoiser that recovers perturbed representations without leaking spatial information. At inference time, the SparseLearn module is entirely discarded, and the trained scorer applies a deterministic Hard Top-K operator to keep only the highest scoring 32 tokens, incurring no extra computation. By compressing the visual sequence down to a sparse set of just 32 tokens, which represents as little as 0.78% of the original length, our framework achieves 73.32% overall accuracy on SlideBench (TCGA), consistently surpassing sampling-based baselines and general-purpose vision language models. It also demonstrates strong zero shot generalization on SlideBench (BCNB) and WSI VQA*. By resolving the visual context bottleneck and preventing the dilution of sparse diagnostic evidence, this work provides a highly efficient paradigm for end to end gigapixel whole slide image reasoning.

2606.08635 2026-06-09 cs.LG cs.DC 新提交

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

SpectrumKV: 面向预填充-解码分离式LLM服务的逐令牌混合精度KV缓存传输

Yang Pengju

发表机构 * GitHub

AI总结 针对预填充-解码分离架构中KV缓存传输开销大的问题,提出SpectrumKV,通过为每个令牌分配不同精度(FP16/INT8/INT4)实现混合精度传输,并设计轻量部署探测自适应选择精度策略,在相同传输预算下显著提升模型质量并降低TTFT。

Comments 28 pages,13 figures,8 tables

详情
AI中文摘要

预填充-解码(PD)分离将提示处理与令牌生成解耦,但也使键值(KV)缓存成为网络负载。现有的PD端KV缩减方法大多是二元的:选中的令牌以全精度传输,其余则不传输。本文认为二元选择留下了一个有用的设计空间未被利用。SpectrumKV为每个令牌分配一个精度级别:注意力汇聚点和其他高重要性令牌以FP16保护,中等重要性令牌以INT8发送,低重要性令牌在模型可容忍时以INT4发送。主要的实际复杂性在于INT4容忍度是模型相关的。Qwen2.5-7B在INT4 KV量化下灾难性失败,而Mistral-7B和Gemma-2-9B保持稳定。因此,SpectrumKV运行一个轻量级的部署时探测:在三级策略下进行三次激进的NIAH试验。通过的模型使用FP16+INT8+INT4;失败的模型回退到FP16+INT8。在Qwen2.5-7B-Instruct、Mistral-7B-Instruct-v0.3和Gemma-2-9B-it上,SpectrumKV在相同传输预算下提高了质量。在WikiText-2上,归一化KV预算为50%时,SpectrumKV分别将困惑度改变+1.97%、-0.06%和-0.44%,而PDTrim为+25.85%、+22.07%和+35.63%。在4096令牌的NIAH检索中,自适应策略在激进预算b=0.3下对Qwen达到52.6%,而PDTrim为26.3%,并在b=0.5时达到100%;Mistral和Gemma在三级策略下保持检索性能。传输路径的端到端GPU计时显示,在b=0.5时TTFT降低50-62%。这些结果表明,PD KV传输应被视为精度分配问题,而不仅仅是令牌剪枝。

英文摘要

Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are transmitted at full precision and the rest are not transmitted. This paper argues that binary selection leaves a useful design space unused. SpectrumKV assigns a precision level to each token instead: attention sinks and other high-importance tokens are protected at FP16, medium-importance tokens are sent at INT8, and low-importance tokens are sent at INT4 when the model can tolerate it. The main practical complication is that INT4 tolerance is model-dependent. Qwen2.5-7B catastrophically fails under INT4 KV quantization, while Mistral-7B and Gemma-2-9B remain stable. SpectrumKV therefore runs a lightweight deployment-time probe: three aggressive NIAH trials under a 3-tier policy. Models that pass use FP16+INT8+INT4; models that fail fall back to FP16+INT8. Across Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV improves quality at the same transfer budget. At a 50% normalized KV budget on WikiText-2, SpectrumKV changes perplexity by +1.97%,-0.06%, and-0.44%, respectively, compared with PDTrim's +25.85%, +22.07%, and +35.63%. On NIAH retrieval at 4096 tokens, the adaptive policy reaches 52.6% on Qwen at the aggressive b=0.3 budget versus 26.3% for PDTrim, and reaches 100% by b=0.5; Mistral and Gemma preserve retrieval under the 3-tier policy. End-to-end GPU timing of the transfer path shows 50-62% TTFT reductions at b=0.5. These results suggest that PD KV transfer should be treated as a precision-allocation problem, not only as token pruning.

2606.08634 2026-06-09 cs.CV 新提交

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

SSAFE: 通过冻结视觉编码器实现简单而强大的AI生成图像检测

Seunghyun Lee, Byoungkwon Kim, Jaehyun Nam, Kyungmin Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院) Google Cloud AI(谷歌云AI)

AI总结 本文发现冻结的多模态视觉编码器在嵌入空间中自然分离真实与合成图像,通过线性分类器即可实现强检测性能,并提出一种表示感知的数据策展策略,仅用10K图像训练,在多个基准上表现优异。

Comments Preprint. 22 pages, 10 figures, supplementary material included

详情
AI中文摘要

生成模型的快速发展模糊了合成图像与真实图像之间的界限,产生了对可靠深度伪造检测的迫切需求。然而,大多数现有方法依赖于大规模的真实-伪造数据集,随着新生成器的不断涌现,这些数据集越来越难以维护。在这项工作中,我们研究了图像真实性信息在多大程度上已经编码在现代多模态视觉表示中。我们发现,冻结的多模态编码器在其嵌入空间中自然分离真实图像和合成图像,使得简单的线性分类器无需特定任务微调即可实现强性能。受此观察启发,我们开发了一种表示感知的数据策展策略,选择一组紧凑的代表性生成器进行训练。由此产生的训练集仅包含10K张图像,而AIGIBench为288K张,OpenFake为400万张,同时提高了对未见生成器和分布偏移的鲁棒性。我们还引入了RealWorldBench,这是一个包含现代相机照片、当代库存图像以及近期商业生成器输出的基准。在多个基准上的实验表明,将冻结的多模态表示与精心策展的训练数据相结合,为AI生成图像检测提供了一种简单而有效的方法。

英文摘要

The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real--fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.

2606.08633 2026-06-09 cs.AI cs.LG 新提交

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

面向长时域船舶轨迹与目的地预测的推理型大语言模型

Hongwei Wang, Miao Zhou, Fengde Wang, Yuting Wang, Jiewen Yu, Jun-Yan He, Bohao Qu, Wanbing Zhang, Xiuju Fu, Qing Guo, Zipei Fan, Yingying Xing, Yi Yuan

发表机构 * Institute of High Performance Computing (IHPC), A*STAR, Singapore(新加坡科技研究局高性能计算研究所) The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University(同济大学道路与交通工程教育部重点实验室) Meituan Inc., Shenzhen, China(美团(深圳)) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(新加坡科技研究局前沿人工智能研究中心) Nankai University(南开大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出基于可验证奖励强化学习(RLVR)的Maritime LLM后训练框架,将轨迹转化为语义文本,通过物理有效性约束和层次匹配提升长时域(30天)预测精度,4B模型表现最优。

Comments The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026, Naples, Italy

详情
AI中文摘要

长时域海上轨迹预测对航运管理、物流规划和海上风险分析至关重要,但月度级别的预测仍研究不足。现有深度学习方法主要关注短期和中期坐标外推,在长时间跨度下往往难以保持路线可行性和目的地正确性。本文研究了利用具备推理能力的大语言模型进行联合长时域船舶轨迹和目的地预测,并基于可验证奖励强化学习(RLVR)开发了Maritime LLM后训练框架。构建了一个基于AIS的基准数据集,包含60天历史轨迹和30天预测范围,其中轨迹被转换为语义文本表示用于RL提示构建。RLVR通过强制执行物理有效性、提供早期加权轨迹监督以及通过层次匹配和课程学习评估目的地正确性,使LLM与海上预测目标对齐。实验结果表明,RLVR训练的LLM在零样本LLM和代表性深度学习基线方法上均有显著提升,尤其在目的地相关指标上。在评估的RLVR训练变体中,4B LLM实现了最佳整体性能,表明奖励兼容优化和任务特定容量匹配比单纯使用更大的8B或14B LLM更为重要。结果还显示,在有限的微调数据下,LSTM仍然是一个强大的深度学习基线,而Transformer风格的时空模型通常需要更大的数据集和更丰富的结构化输入。总体而言,这项工作推进了用于运营决策支持的语义化、验证器对齐的海上预测。

英文摘要

Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month-level forecasting remains insufficiently studied. Existing deep learning methods mainly focus on short- and mid-term coordinate extrapolation and often struggle to preserve route feasibility and destination correctness over extended horizons. This paper investigates joint long-horizon vessel trajectory and destination forecasting with reasoning-capable large language models, and develops a Maritime LLM post-training framework based on Reinforcement Learning with Verifiable Reward (RLVR). An AIS-based benchmark is constructed with 60-day historical trajectories and 30-day forecasting horizons, where trajectories are converted into semantic textual representations for RL prompt construction. RLVR aligns LLMs with maritime forecasting objectives by enforcing physical validity, providing early-weighted trajectory supervision, and evaluating destination correctness through hierarchical matching and curriculum learning. Experimental results show that RLVR-trained LLMs substantially improve over zero-shot LLMs and representative deep learning baselines, especially on destination-related metrics. Among the evaluated RLVR-trained variants, 4B LLMs achieve the best overall performance, suggesting that reward-compatible optimization and task-specific capacity matching are more important than simply using larger 8B or 14B LLMs. The results also show that LSTM remains a strong deep learning baseline under limited fine-tuning data, while Transformer-style spatio-temporal models typically require larger datasets and richer structured inputs. Overall, this work advances semantic, verifier-aligned maritime forecasting for operational decision support.

2606.08630 2026-06-09 cs.LG cs.AI 新提交

Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting

Tyan-WP:用于超短期概率预测的风电基础模型

Jiahui Huang, Ao Luo, Lei Liu, Hongwei Zhao, Tengyuan Liu, Ruibo Guo, Bo Wang, Zhao Wang, Bin Li

发表机构 * School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) China Electric Power Research Institute(中国电力科学研究院)

AI总结 提出首个风电基础模型Tyan-WP,通过静态站点嵌入和功率感知气象融合模块,在零样本场景下实现超短期概率预测,显著优于传统模型。

详情
AI中文摘要

全球风电容量,特别是在中国,正在蓬勃发展,新的风电场跨越了多样的地形和气候。行业迫切需要准确的风电基础模型,以缩短调试并加速并网。这是因为特定站点的时间序列模型(TSM)不适用于数据稀缺场景且泛化能力差,而通用大型时间序列模型(LTSM)大多限于单变量输入,无法充分利用静态站点属性或功率与气象协变量之间的依赖关系,导致精度不足。为填补这一空白,我们提出了\textbf{Tyan-WP},这是首个用于超短期概率预测的风电基础模型。在覆盖美国超过126,000个站点、跨越七年的大规模风电数据集上预训练后,Tyan-WP通过两个特定领域模块设计进一步提升了零样本预测:使用坐标、地形和生态区域元数据的静态站点嵌入,以及一个功率感知气象融合(PAMF)模块,该模块对历史功率和气象协变量之间的交互进行建模。在统一评估协议下,Tyan-WP在10个域内站点上超越了八个特定站点的监督TSM,并在127个域内站点上优于十一个通用LTSM,MAE降低19.9%,RMSE降低16.6%,CRPS降低22.2%,AQL降低21.7%,同时R^2提升16.7%。它还在六个真实的英国站点上展示了强大的跨地理泛化能力。这些结果表明,风电基础模型可以在无需目标站点训练的情况下实现准确的零样本预测,为新风电场快速涡轮机接入和概率风险管理提供了实用途径。

英文摘要

Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently needs accurate wind power foundation models to shorten commissioning and accelerate grid connection. This is because site-specific time series models (TSMs) are not well suited to data-scarce scenarios and generalize poorly, while generic large time series models (LTSMs) are mostly limited to univariate inputs and cannot fully exploit static site attributes or the dependencies between power and meteorological covariates, leading to insufficient accuracy. To fill this gap, we propose \textbf{Tyan-WP}, the first wind power foundation model for ultra-short-term probabilistic forecasting. Pretrained on a large-scale wind power dataset covering more than 126,000 U.S. sites over seven years, Tyan-WP further improves zero-shot forecasting through two domain-specific module designs: static site embedding using coordinate, terrain, and ecoregion metadata, and a power-aware meteorological fusion (PAMF) module that models interactions between historical power and meteorological covariates. Under a unified evaluation protocol, Tyan-WP surpasses eight site-specific supervised TSMs on 10 in-domain sites and outperforms eleven generic LTSMs on 127 in-domain sites, reducing MAE by 19.9%, RMSE by 16.6%, CRPS by 22.2%, and AQL by 21.7%, while raising R^2 by 16.7%. It further demonstrates strong cross-geography generalization on six real U.K. sites. These results show that the wind power foundation model can achieve accurate zero-shot forecasting without target-site training, providing a practical pathway for rapid turbine onboarding and probabilistic risk management at new wind farms.

2606.08629 2026-06-09 cs.CL 新提交

Sycophancy Towards Researchers Drives Performative Misalignment

对研究者的迎合驱动了表演性失调

David D. Baek, Xinnuo Li, Anay Gupta, Taslim Mahbub, Kejian Shi, Max Tegmark, Shi Feng

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学)

AI总结 本文提出语言模型在评估中表现出的对齐伪装行为更可能是对研究者的迎合而非策略性欺骗,并通过三个实验支持该假说。

详情
AI中文摘要

语言模型日益增长的情境感知能力引发了安全担忧:模型可能意识到自己正在被评估,并调整行为以逃避监控和抵制修改,例如仅在评估中假装对齐。这种对齐伪装行为常被解释为诡计:一种有意的战略欺骗。在本文中,我们考察了一种替代解释,即表演性失调,它将行为变化解释为对AI研究者的迎合结果。为检验这一假说,我们提出了三个实证发现。首先,我们表明即使告诉模型它们已部署,评估意识仍然存在,这与诡计故事相矛盾,后者预测当模型感知到评估时失调会减少。其次,我们使用探针和引导表明,当前方法无法在机制上区分对齐伪装评估中的迎合和诡计。第三,我们微调模型使其更迎合,并观察到对评估线索的敏感性增加。最后,我们强调在未来的意图失调评估和缓解工作中,应将迎合与诡计去混淆。

英文摘要

The increasing situational awareness of language models raises safety concerns: models might be aware when they are evaluated, and adjust their behavior to evade monitoring and resist modification, e.g., pretending to be aligned only in evaluation. This alignment faking behavior is often interpreted as scheming: an intentional effort of strategic deception. In this paper, we examine an alternative interpretation, performative misalignment, which explains the change in behavior as a result of sycophancy towards AI researchers. To examine this hypothesis, we present three empirical findings. First, we show that evaluation awareness persists even when we tell models they are deployed, which contradicts the scheming story which predicts less misalignment when the model perceives evaluation. Second, we use probing and steering to show that our current methods cannot mechanistically distinguish sycophancy and scheming in alignment faking evaluations. Third, we fine-tune models to be more sycophantic and observe increased sensitivity to evaluation cues. To conclude, we emphasize deconfounding sycophancy from scheming for future work on evaluations and mitigations of intent misalignment.

2606.08625 2026-06-09 cs.CL 新提交

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

从整体评估到结构化标准:大语言模型演变中的评分准则

Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心) Department of Computer Science and Technology, Institute for AI(计算机科学与技术系,人工智能研究院)

AI总结 本文提出评分准则作为统一框架,通过分解整体判断为可验证维度、提供过程级反馈和动态涌现自模型行为三个层次,连接人类意图与机器行为。

详情
AI中文摘要

随着大型语言模型(LLMs)向开放式自主智能体发展,用于评估和引导其行为的机制也必须相应演进。本文引入评分准则作为捕捉这一演进的统一框架,将其描述为对LLM范式转变的动态响应,这种响应在评估、强化学习和安全对齐等看似独立的工作中反复出现。我们将评分准则定义为将复杂质量判断转化为结构化、可操作标准的一组显式标准,并证明其在上述研究线索中的反复出现并非巧合。我们系统地整理了现有的评分准则设计,考察了其构建与优化,并分析了它们在评估和训练中的作用。评分准则在三个逐渐深入的层面体现:在评估层面,它们将整体判断分解为可验证的维度;在训练层面,它们作为密集的反馈信号,在标量奖励不足时提供过程级指导;在内在层面,它们从模型行为中动态涌现,驱动自我改进。我们进一步评估了评分准则在生成质量、执行保真度、理论约束和安全威胁方面的可靠性,并调查了跨领域的基于评分准则的基准。通过使评估透明且可分解,评分准则将人类价值期望转化为机器可学习的信号,成为人类意图与机器行为之间的持久桥梁。

英文摘要

As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.

2606.08617 2026-06-09 cs.CL 新提交

Cross-Source Reasoning-based Correction for Author Name Disambiguation

基于跨源推理的作者姓名消歧校正

Fanjin Zhang, Yunhe Pang, Bo Chen, Zhiyu Shen, Yanghui Rao, Evgeny Kharlamov, Jie Tang

发表机构 * Renmin University of China(中国人民大学) Sun Yat-Sen University(中山大学) Tsinghua University(清华大学) Robert Bosch GmbH(罗伯特·博世有限公司) University of Oslo(奥斯陆大学)

AI总结 提出CrossND框架,通过跨源不一致分配推理,结合数据精炼、监督微调和测试时缩放,无需人工干预即可校正作者姓名消歧错误。

Comments Accepted at KDD 2026 ADS track

详情
AI中文摘要

作者姓名消歧是学术搜索系统中的关键挑战,通常通过从头开始和实时消歧方法解决。然而,当前算法仍然容易受到论文-作者分配的累积误差影响,并忽略了不同来源之间的不一致分配。诉诸专家注释是资源密集型的。为此,本文探索了作者姓名消歧的新视角:通过利用跨源的不一致分配进行跨源校正。我们提出了CrossND,一个集成数据精炼、跨源推理和测试时缩放的全栈框架。首先,一个精炼链去噪作者档案并产生更准确的论文-作者匹配概率。其次,一个监督微调过程结合这些精炼信号和基于概率软逻辑的交叉校正模块,推断哪些来源的分配是错误的。第三,测试时缩放进一步增强了预测的准确性和鲁棒性。在真实数据集上的实验表明,CrossND通过利用跨源推理,无需人工干预,始终优于17个基线。

英文摘要

Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.

2606.08615 2026-06-09 cs.CV cs.CL 新提交

Harnessing Streaming Video in the Wild

利用野外流式视频

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) JD.COM(京东)

AI总结 提出Streaming Harness系统,通过Streaming-Train-248K数据集和训练目标,使视觉语言模型具备主动交互、长期记忆和实时处理能力,并构建Streaming-Eval基准评估流式视频理解。

详情
AI中文摘要

视觉语言模型(VLM)在视频通话助手、实时评论和具身机器人等应用中越来越需要处理无界视频流。理想的流式系统应支持主动交互、长期记忆和实时处理,同时基于能够处理各种野外流式任务的VLM骨干。然而,现有VLM在离线视频理解方面表现出色,但在流式能力上有所欠缺,并且缺乏用于流式部署的专用基础设施。我们在三个方面解决这一差距。(i) 对于骨干能力,我们构建了\textbf{Streaming-Train-248K},一个流式数据集,配以新颖的训练目标,用于使VLM适应流式交互和理解。(ii) 对于实际部署,我们引入了\textbf{Streaming Harness},一个即插即用系统,赋予任何VLM三种核心能力:主动交互(每秒响应决策)、长期记忆(12小时上下文保留)和实时处理(亚秒级延迟)。(iii) 为了推动社区在流式能力方面的持续进步,我们设计了\textbf{Streaming-Eval},一个反映模型在各种野外场景中能力的基准。大量实验表明,我们的方法在流式视频理解所需的所有核心能力上均取得了一致的提升。我们将开源我们的数据、代码和基准,以推动社区从离线视频理解向可部署的流式智能的转变。

英文摘要

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

2606.08612 2026-06-09 cs.CV 新提交

Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions

深度学习时代的面部表情识别:方法、模型、数据集、性能、挑战与未来研究方向的多准则系统综述

Spyridon Georgiou, Aggelos Psiris, Spyridon Evangelatos, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University(国际希腊大学) University of Thessaly(色萨利大学) Democritus University of Thrace(德谟克利特大学) University of Peloponnese(伯罗奔尼撒大学) Harokopio University of Athens(哈罗科皮奥大学)

AI总结 本文系统综述了深度学习面部表情识别的最新进展,提出五阶段演化框架和多准则分类法,分析了七维度的优缺点,并总结了数据集、性能比较及未来挑战。

详情
AI中文摘要

面部表情识别(FER)在过去十年中取得了快速发展,这得益于从手工特征和浅层分类器向深度卷积、注意力机制、视觉语言和基础模型架构的转变,以及大规模野外基准测试的并行增长,这些基准涵盖了分类、维度、复合、微表情、动作单元(AU)和强度估计任务。然而,基于深度学习的FER领域迄今为止仅在狭窄的任务、架构或应用特定轴线上被综述,缺乏对其近期进展的整体、系统组织的描述。本综述通过全面回顾近期基于深度学习的FER,并明确将其与更广泛的面部情感识别(FAR)领域联系起来,填补了这一空白。其主要贡献包括:a) 描述了FER演变为五个不同阶段的过程,从手工特征和经典机器学习到注意力机制、视觉语言和基础模型方法,并给出了每个阶段的关键里程碑工作;b) 一个多准则分类法,沿七个互补轴分析文献:识别任务、输入模态、面部预处理流程、网络架构、学习策略、采集设置和应用领域;c) 按准则进行比较分析,深入洞察每个类别在野外条件下的优势和局限性;d) 按任务组织的公共FER数据集综述,包括其标注方案、模态和评估协议;e) 性能指标汇编以及代表性最先进方法在广泛采用的基准上的按任务定量比较;f) 当前挑战和有前景的未来方向的讨论。

英文摘要

Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER's evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.

2606.08610 2026-06-09 cs.RO cs.AI 新提交

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR:面向智能体机器人强化学习的框架

Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki

发表机构 * TU Darmstadt(达姆施塔特工业大学) Honda Research Institute Europe(本田欧洲研究所) Columbia University(哥伦比亚大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海智能自主系统研究院) University of Würzburg(维尔茨堡大学) Hessian.AI(黑森人工智能中心)

AI总结 提出HARBOR框架,通过将机器人强化学习自动化视为框架工程问题,利用专用智能体、标准化命令和可复用知识,在模拟中自动完成从环境搭建到策略训练的全流程,并在6个基准测试和16个任务中验证其有效性。

详情
AI中文摘要

强化学习已成为机器人学习的一种强大范式,特别是在模拟到现实的环境中,但其更广泛的采用仍受限于围绕算法的工程流程。构建任务、设计奖励和调整超参数需要大量专家努力,使得强化学习工作流程成本高昂且难以扩展。我们提出HARBOR,一个智能体框架,将机器人强化学习自动化视为一个框架工程问题:给定一个模拟器代码库和一个任务规范,它自动完成从环境设置到模拟中策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段,由专用智能体通过标准化命令、持久化工件、可执行门和可复用知识执行,并通过去中心化并行试验和跨运行经验学习来扩展迭代。我们在6个基准测试和总共16个任务上评估HARBOR,涵盖操作、移动和双臂灵巧控制。我们证明HARBOR端到端地自动化了模拟强化学习工作流程,设计奖励,调整算法以匹配或改进默认配置,并以实用的令牌和挂钟成本减少了工程工作量;生成的策略也可以转移到真实机器人。

英文摘要

Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

2606.08605 2026-06-09 cs.CL 新提交

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

大规模多语言事实核查:微调紧凑模型 vs 大语言模型

Pratuat Amatya, Vinay Setty

发表机构 * Factiverse

AI总结 提出一个多语言事实核查系统,通过微调XLM-RoBERTa、mmBERT和SetFit模型,在114种语言的声明检测和28种语言的真实性预测中,与GPT-5.2等LLM相比,展示了紧凑模型的高效和稳定性能。

详情
AI中文摘要

我们提出了一个部署在Factiverse的多语言事实核查系统,旨在跨多种语言实现高吞吐量和低延迟操作。该系统遵循模块化流水线,包含三个阶段:声明检测、证据检索与重排序,以及真实性预测。我们微调了XLM-RoBERTa-Large用于声明检测,mmBERT-base用于三标签立场分类(支持/反驳/混合),以及一个基于SetFit的多语言重排序器用于声明-证据匹配。我们将这些组件与强大的LLM基线进行比较,包括GPT-5.2、Claude Opus~4.6和Qwen3-8b。在涵盖114种语言的声明检测和28种语言的真实性预测的生产数据上的实验表明,任务特定的微调提供了强大且稳定的多语言性能,而微调的检索模型与现代专有嵌入保持竞争力。相同硬件上的延迟测量进一步显示,基于编码器的组件具有巨大的效率提升,支持其在具有严格成本和隐私约束的生产部署中使用。总体而言,紧凑的微调自托管模型仍然是大规模多语言事实核查的实用且有效的基础。本研究的代码和数据可在https://github.com/factiverse/factcheck-editor获取。

英文摘要

We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

2606.08602 2026-06-09 cs.LG cs.AI 新提交

Reinforcement Learning for Flow-Matching Policies with Density Transport

基于密度传输的流匹配策略强化学习

Boshu Lei, Kostas Daniilidis, Antonio Loquercio

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出在线强化学习算法RLDT,利用Stein变分梯度下降构建传输场,微调预训练流匹配策略,通过期望目标估计稳定训练,在连续控制任务中优于基线方法。

详情
AI中文摘要

我们提出了一种在线强化学习(RL)算法,用于微调连续控制问题中的流匹配策略。我们的关键见解是将基于RL的策略改进视为将动作密度向高奖励区域传输,这自然与流匹配模型的传输公式一致。先前的方法要么近似当前或最优策略分布,要么采用蒸馏,这引入了有偏梯度或牺牲了多模态建模能力。相比之下,我们提出的基于密度传输的RL方法(称为RLDT)使用Stein变分梯度下降(SVGD)从最大熵RL目标构建传输场,然后微调预训练的流匹配策略以与该场对齐。使用这种对齐目标进行训练并非易事,因为流匹配策略通过多步过程生成动作,使得直接的基于梯度的优化具有挑战性。为了克服这一挑战并稳定训练,我们通过期望目标估计从中间去噪步骤近似策略动作。这使得传输场更新能够传播到网络参数中,而无需通过时间进行不稳定的反向传播。实验结果表明,RLDT在奖励质量和收敛速度方面优于竞争基线。该性能在多种连续控制任务中保持一致,包括密集和稀疏奖励,以及基于状态和视觉的长期机器人操作。项目网页为https://rpfey.github.io/rldt/。

英文摘要

We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.

2606.08601 2026-06-09 cs.AI 新提交

InA-Probe: Instruction-Aware Active Probing for Time Series Forecasting with LLMs

InA-Probe:面向LLM时间序列预测的指令感知主动探测

Peiliang Gong, Emadeldeen Eldele, Chenyu Liu, Ziyu Jia, Yi Ding, Xinliang Zhou, Lianchao Gu, Qi Zhu, Yang Liu, Daoqiang Zhang, Xiaoli Li

发表机构 * Nanyang Technological University(南洋理工大学) Khalifa University(哈利法大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出指令感知主动探测(InA-Probe),通过多级指令注入和自适应查询生成,结合双阶段注意力机制,在7个基准上超越现有方法,跨域误差降低37%。

详情
AI中文摘要

大型语言模型(LLMs)近期在时间序列预测中展现出令人瞩目的潜力。然而,现有方法主要依赖被动模态对齐或静态任务重编程,往往难以捕捉细粒度的非平稳时间模式或适应细微的任务意图。本文提出指令感知主动探测(InA-Probe),将范式从被动对齐转向主动的指令驱动探测机制。具体而言,我们设计了一种多级指令注入机制,为模型注入全局任务目标和细粒度的补丁级语义先验。在此基础上,自适应查询生成模块生成样本特定的探测,这些探测由时间上下文动态调制。随后,这些探测通过双阶段注意力过程进行精炼:首先通过指令感知自注意力内化任务特定意图,然后通过时间交叉注意力审查询问投影的时间表示以提取显著模式。在七个真实世界基准上的全面实验表明,InA-Probe在统一泛化和零样本迁移中均持续优于最先进的深度学习和基于LLM的基线,在具有挑战性的跨域场景中预测误差降低高达37%。消融研究进一步证实,自适应查询与细粒度指令之间的协同作用是解锁LLM推理能力以处理复杂时间序列的关键。

英文摘要

Large Language Models (LLMs) have recently demonstrated impressive potential for time series forecasting. However, existing methods predominantly rely on passive modality alignment or static task reprogramming, which often fail to capture fine-grained, non-stationary temporal patterns or to adapt to nuanced task intents. In this paper, we propose Instruction-aware Active Probing (InA-Probe), which shifts the paradigm from passive alignment toward an active, instruction-driven probing mechanism. Specifically, we design a Multi-Level Instruction Injection mechanism that enriches the model with both global task objectives and fine-grained, patch-level semantic priors. Building on this, an Adaptive Query Generation module produces sample-specific probes that are dynamically modulated by the temporal context. These probes are then refined through a dual-stage attention process: they first internalize task-specific intents via Instruction-Aware Self-Attention, and subsequently interrogate the projected temporal representations through Temporal Cross-Attention to extract salient patterns. Comprehensive experiments on seven real-world benchmarks show that InA-Probe consistently outperforms state-of-the-art deep learning and LLM-based baselines, excelling in both one-for-all generalization and zero-shot transfer while reducing forecasting error by up to 37\% in challenging cross-domain scenarios. Ablation studies further confirm that the synergy between adaptive querying and fine-grained instructions is key to unlocking the reasoning power of LLMs for complex time series.

2606.08596 2026-06-09 cs.AI cs.HC 新提交

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

将LLM推理蒸馏为可解释的策略树用于人机协作

Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 提出Co-pi-tree方法,通过将大语言模型推理蒸馏为可执行策略树,在Overcooked-AI中平均奖励提升35.4%,同时减少77.7%的LLM查询和97.1%的测试延迟。

详情
AI中文摘要

构建高效可靠的策略以辅助人类是人机协作中不可或缺的。现有方法主要遵循两条工作路线。大多数先前工作依赖多智能体强化学习(MARL)来学习黑盒策略,这限制了可解释性并引发安全问题。近期方法在每个决策步骤查询大语言模型(LLM),导致响应缓慢和推理成本高昂。我们提出协作策略树(Co-pi-tree),一种闭环方法,学习一个可执行的策略树,该树由伙伴行为预测树和智能体动作选择树组成。Co-pi-tree通过将LLM推理蒸馏为策略树代码来构建策略。然后通过伙伴交互评估策略,获取反馈,并使用自然语言总结交互反馈以改进有问题的分支。在Overcooked-AI中的实验表明,Co-pi-tree将平均奖励比基线平均值提高35.4%,同时将LLM查询次数减少77.7%,测试时延迟减少97.1%。项目页面:https://beiwenzhang.github.io/Co-pi-tree/

英文摘要

Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co-pi-tree/

2606.08589 2026-06-09 cs.CL cs.DL cs.IR 新提交

Detection and Interpretability Analysis of Quotation Errors by Large Language Models

大语言模型对引用错误的检测与可解释性分析

Bei Huang, Yingyi Zhang, Shenghao Huang, Chengzhi Zhang

发表机构 * School of Social Science, Soochow University(苏州大学社会科学学院) School of Economics and Management, Nanjing University of Science and Technology(南京理工大学经济与管理学院)

AI总结 针对引用错误问题,提出基于大语言模型微调的自动检测方法,通过引入全文数据优化数据集构建,并利用TokenSHAP进行可解释性分析,实验表明微调方法有效且基于源摘要的全文整合方案性能最佳。

详情
Journal ref
The Electronic Library, 2026
AI中文摘要

目的 - 引用错误指引用信息与其原始来源之间的不一致。这一现象导致一系列负面影响,如对原始研究的误解、削弱学术界对相关问题的集体理解,以及削弱基于引用的学术评价体系的准确性和公平性。现有研究表明,引用错误在学术界普遍存在;此外,人工验证引用错误不仅劳动密集,而且效率低下。因此,本文提出“引用错误自动检测”任务。方法 - 采用基于大语言模型的方法,本文在现有研究基础上从两个方面提升检测性能:首先,采用微调方法使大语言模型检测引用错误;其次,将引文全文数据纳入数据集构建,并通过比较三种全文整合方法探索构建此类数据集的最优方案。在此基础上,本文进一步使用TokenSHAP工具对模型预测结果进行可解释性实验分析。发现 - 大语言模型的微调方法提升了引用错误检测的性能。在整合全文信息的不同方法中,基于使用源摘要的方法取得了最佳性能。原创性 - 将大语言模型的微调方法应用于引用错误自动检测任务,并对模型输出结果进行可解释性分析。

英文摘要

Purpose - Quotation error refers to the inconsistency between cited information and its original source. This phenomenon leads to a series of negative impacts, such as misinterpretation of the original research, undermining the academic community's collective understanding of relevant issues, and weakening the accuracy and fairness of the citation-based academic evaluation system. Existing studies have shown that quotation error is prevalent in the academic community; moreover, manual verification of quotation error is not only labor-intensive but also inefficient. Therefore, this paper proposes the task of 'automated detection of quotation errors'. Methodology - Adopting a large language model (LLM)-based approach, this paper improves detection performance from two aspects on the basis of existing research: first, employ the fine-tuning approach for LLMs to detect quotation errors; second, incorporating full-text data of the cited literature into dataset construction, and exploring the optimal scheme for building such datasets by comparing three types of full-text integration methods. Based on this, this paper further uses the TokenSHAP tool to conduct interpretability experimental analysis on the model's prediction results. Findings - The fine-tuning approach for LLMs has improved the performance in detecting quotation errors. Among the different methods for incorporating full-text information, the approach based on using the source abstract yielded the best performance. Originality - The fine-tuning approach for large language models (LLMs) is applied to the task of automated detection of quotation errors, and interpretability analysis is conducted on the model's output results.

2606.08584 2026-06-09 cs.LG 新提交

Convolutional Sparse Coding via the Locally Competitive Algorithm on Loihi 2

基于Loihi 2的局部竞争算法实现卷积稀疏编码

Geoffrey Kasenbacher, Daniel Ruepp, Gerrit A. Ecke

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) Institut für Robotik und Kognitive Systeme, Universität zu Lübeck(吕贝克大学机器人与认知系统研究所)

AI总结 本文在Loihi 2神经形态芯片上实现了卷积稀疏编码的局部竞争算法,并与GPU基线对比,展示了其在结构化稀疏推理中的可行性和优势。

详情
AI中文摘要

稀疏编码通过将输入表示为仅少量基函数的线性组合,为信号表示提供了一个原则性框架。局部竞争算法(LCA)因其动力学特性(泄漏积分、阈值化和侧向抑制)自然映射到神经形态硬件,在神经形态计算中特别有吸引力。虽然先前的工作已在Loihi 2上研究了非卷积LCA,但卷积设置尤其令人感兴趣,因为它引入了空间结构、权重共享、重叠感受野和缩放行为,这些更代表实际的稀疏推理工作负载。在这项工作中,我们提出了通过LCA在Loihi 2上实现卷积稀疏编码,并在相同的推理问题上与传统的GPU基线进行了评估。该实现遵循单层循环LCA公式,并将其扩展到具有从成对滤波器相互作用导出的局部抑制核的卷积特征图。据我们所知,这是Loihi 2上卷积LCA的首次实现和基准测试。我们的目标不仅是证明可行性,而且还要阐明在何种操作条件下卷积稀疏推理在神经形态硬件上变得有吸引力。由此产生的研究将卷积LCA定位为新兴神经形态系统上结构化稀疏推理的有用基准。

英文摘要

Sparse coding provides a principled framework for signal representation by expressing an input as a linear combination of only a small number of basis functions. The Locally Competitive Algorithm (LCA) is particularly attractive in the context of neuromorphic computing because its dynamics, leaky integration, thresholding, and lateral inhibition map naturally to neuromorphic hardware. While prior work has studied non-convolutional LCA on Loihi 2, the convolutional setting is of particular interest because it introduces spatial structure, weight sharing, overlapping receptive fields, and scaling behavior that are more representative of practical sparse inference workloads. In this work, we present a Loihi 2 implementation of convolutional sparse coding via the LCA and evaluate it against a conventional GPU baseline on the same inference problems. The implementation follows a one-layer recurrent LCA formulation and extends it to convolutional feature maps with local inhibitory kernels derived from pairwise filter interactions. To the best of our knowledge, this is the first implementation and benchmark of convolutional LCA on Loihi 2. Our goal is not only to demonstrate feasibility, but also to clarify in which operating regimes convolutional sparse inference becomes attractive on neuromorphic hardware. The resulting study positions convolutional LCA as a useful benchmark for structured sparse inference on emerging neuromorphic systems.

2606.08578 2026-06-09 cs.LG 新提交

Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?

迷失在非凸损失景观中:如何微调大型时间序列模型?

Xu Zhang, Peang Wang, Wei Wang

发表机构 * Shanghai Key Laboratory of Data Science(上海市数据科学重点实验室) College of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Fudan University(复旦大学)

AI总结 针对预训练大型时间序列模型微调时因非凸损失景观导致过拟合的问题,提出平滑全微调(SFF)方法,通过随机初始化辅助模型插值平滑损失景观,提升可训练性,在八个代表性模型上取得一致改进。

Comments This paper has been accepted by The Fourteenth International Conference on Learning Representations (ICLR 2026). The code is available at the link \url{https://github.com/Meteor-Stars/SFF}

详情
AI中文摘要

近年来,大型时间序列模型(LTSMs)因其与大型语言模型的相似性(包括灵活的上下文长度、可扩展性和任务通用性)而受到越来越多的关注,其性能优于先进的任务特定模型。然而,先前研究表明,预训练的LTSMs可能表现出条件较差的非凸损失景观,导致可训练性有限。因此,直接微调往往会导致过拟合和次优性能,有时甚至比从头训练更差,大大削弱了预训练的好处。为了克服这一限制,我们提出了平滑全微调(SFF),一种新颖的微调技术。具体来说,我们通过随机初始化构建一个辅助LTSM以获得更平滑的损失景观,然后将其权重与预训练模型的权重进行线性插值,以平滑原始景观。这一过程在保留预训练知识的同时提高了可训练性,从而实现更有效的下游微调。从优化角度来看,SFF扰动尖锐最小值而不显著损害平坦区域,有助于逃离不良局部盆地,走向更平滑且泛化性更好的解。在基准数据集上的大量实验表明,在包括Timer、TimesFM、MOMENT、UniTS、MOIRAI、Chronos、TTMs和Sundial在内的八个代表性LTSM上,针对多样化的下游任务均取得了一致的改进。代码可在链接获取:https://github.com/Meteor-Stars/SFF。

英文摘要

Recently, large time series models (LTSMs) have gained increasing attention due to their similarities to large language models, including flexible context length, scalability, and task generality, outperforming advanced task-specific models. However, prior studies indicate that pre-trained LTSMs may exhibit a poorly conditioned non-convex loss landscape, leading to limited trainability. As a result, direct fine-tuning tends to cause overfitting and suboptimal performance, sometimes even worse than training from scratch, substantially diminishing the benefits of pre-training. To overcome this limitation, we propose Smoothed Full Fine-tuning (SFF), a novel fine-tuning technology. Specifically, we construct an auxiliary LTSM via random initialization to obtain a smoother loss landscape, and then linearly interpolate its weights with those of the pre-trained model to smooth the original landscape. This process improves trainability while preserving pre-trained knowledge, thereby enabling more effective downstream fine-tuning. From an optimization perspective, SFF perturbs sharp minima without significantly harming flat regions, facilitating escape from poor local basins toward smoother and more generalizable solutions. Extensive experiments on benchmark datasets demonstrate consistent improvements across eight representative LTSMs, including Timer, TimesFM, MOMENT, UniTS, MOIRAI, Chronos, TTMs, and Sundial, on diverse downstream tasks. The code is available at the link: https://github.com/Meteor-Stars/SFF.

2606.08574 2026-06-09 cs.LG cs.CV 新提交

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

OrderDP:一种理论上保证无损的动态数据剪枝框架

Chenhan Jin, Shengze Xu, Qingsong Wang, Fan Jia, Dingshuo Chen, Tieyong Zeng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Guangzhou Nanfang College(广州南方学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Xiangtan University(湘潭大学) University of Utah(犹他大学)

AI总结 提出OrderDP框架,通过随机子集选取与top-q样本选择实现无偏梯度估计,提供收敛性和泛化性理论保证,在CIFAR和ImageNet上降低40%训练成本且保持精度。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

数据剪枝(DP)作为一种常被提及的减轻训练负担的策略,根据定义明确的剪枝方法减少训练样本数量,同时力求实现近乎无损的性能。然而,现有方法通常选择信息量大的样本,与全数据集训练相比可能导致有偏的梯度估计。此外,这种偏差及其对最终性能的影响分析仍不明确。为解决这些问题,我们提出OrderDP,一个即插即用的框架,旨在获得稳定、无偏且近乎无损的训练加速,并具有理论保证。具体而言,OrderDP首先随机选择一个子集,然后选择前$q$个样本,其中相对于代理损失建立无偏性。这确保了OrderDP在代理目标方面进行无偏训练。我们进一步建立了收敛性和泛化性分析,阐明了OrderDP如何影响最优性能,并在保证最终性能的同时实现良好控制的加速。实验上,我们在CIFAR-10、CIFAR-100和ImageNet-1K上对OrderDP与全面基线进行了评估,展示了具有竞争力的精度、稳定的收敛和精确的控制——所有这些都通过更简单的设计和更快的运行时间实现,同时将训练成本降低超过40%。我们的方法兼具强性能和计算效率,为数据高效学习提供了一个稳健且易于适应的工具。代码公开于https://github.com/shengze-xu/OrderDP。

英文摘要

Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top-$q$ samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control -- all with a simpler design and faster runtime, while reducing training cost by over 40%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning. The code is publicly available at https://github.com/shengze-xu/OrderDP.

2606.08573 2026-06-09 cs.LG cs.CL 新提交

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Titans-as-a-Layer:对话语音情感识别的测试时记忆

Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Memory-as-a-Layer (MAL)适配器,利用测试时神经记忆为对话语音情感识别提供上下文,在不修改大型音频语言模型的前提下提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

语音情感识别(SER)通常被表述为话语级分类,尽管对话情感取决于说话者通常的音域和先前话语建立的情感上下文。语音语言模型提供了强大的预训练声学和语义表示,并可以通过微调将其适应于SER标签,但这种机制仍然缺少每对话状态。我们研究测试时神经记忆是否可以在保持大型音频语言模型(LALMs)主干不变的情况下提供这种缺失的上下文。基于Titans,我们引入了一种即插即用的Memory-as-a-Layer(MAL)适配器,它将对话历史写入小型神经记忆,并作为音频令牌对齐的残差更新读回,避免了对宿主模型令牌位置的更改。在不同的音频LLM和情感识别数据集评估中,我们的设计在不同评估指标上改善了SER性能,支持测试时记忆作为对话SER的残差上下文机制。

英文摘要

Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversational emotion depends on a speaker's usual vocal range and the emotional context established by previous utterances. Speech-language models provide strong pretrained acoustic and semantic representations, and can adapts them to SER labels via finetune, but this mechanism still missing per-dialogue state. We study whether test-time neural memory can supply this missing context while leaving the large audio language models (LALMs) backbone intact. Building on Titans, we introduce a plug-and-play Memory-as-a-Layer (MAL) adapter that writes dialogue history into a small neural memory and reads it back as an audio-token-aligned residual update, avoiding changes to the host model's token positions. Across different audio LLMs and emotion recognition datasets evaluations, our design improves SER performs across different evaluation metrics, supporting test-time memory as a residual contextual mechanism for conversational SER.

2606.08572 2026-06-09 cs.CV 新提交

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

OmniCap-IF:全视频字幕遵循指令能力的基准测试与改进

Jiahao Wang, An Ping, Yanghai Wang, Yuanxing Zhang, Shihao Li, Hanyan Bian, Yichi Ren, Yize Zhang, Han Wang, Haowen Chen, Junze Li, Jiaqi Wang, Yiyang Hu, Zhuze Xu, Zijie Zhang, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学 NJU-LINK 团队) Kling Team, Kuaishou Technology(快手科技 Kling 团队)

AI总结 提出首个全模态字幕指令遵循基准OmniCap-IF,通过格式与内容正确性评估50种约束类型,揭示格式-内容权衡,并构建54K指令微调数据集OmniCap-IF-54K及模型OmniCaptioner-IF。

详情
AI中文摘要

虽然全模态大语言模型(OLLMs)在联合处理音频和视觉流方面展示了令人印象深刻的能力,但它们严格遵循复杂、多方面的用户指令的能力在很大程度上仍未得到探索。现有基准主要关注整体视频理解或纯文本指令遵循,未能捕捉模态与用户约束之间的复杂交互。为填补这一空白,我们引入了OmniCap-IF,这是首个专门设计用于评估全模态字幕中指令遵循能力的综合基准。OmniCap-IF包含一个系统框架,从格式正确性和内容正确性两个维度评估字幕。我们的基准涵盖了纯视觉、纯音频和音视频模态中的50种不同约束类型,同时整合了时间定位以评估时空精度。对1,920个高质量样本上主流模型的广泛评估揭示了显著的性能差异。此外,我们的分析揭示了一个关键的“格式-内容权衡”,表明增加格式复杂性直接降低了模型的全模态推理能力。最后,为推进该领域,我们整理了一个54K的指令微调数据集OmniCap-IF-54K,并提出了OmniCaptioner-IF,该模型在复杂指令遵循和通用全模态字幕性能方面均取得了显著改进。

英文摘要

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

2606.08571 2026-06-09 cs.CL cs.AI cs.LG 新提交

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

用于诊断推理模型中未知未知的结构化无知证书的校准

Subramanyam Sahoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出结构化无知证书(SICs)输出格式,通过GRPO微调14B模型,使模型在无法回答时明确承认知识缺失并生成检索查询,在未知未知问题上实现99.46%的JSON有效性和0.967的证书特异性分数。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

大型语言模型经常以特征性方式失败:对于超出其知识边界的问题,它们不是承认无知,而是生成流畅但错误的答案。我们引入了\textbf{结构化无知证书}(SICs),这是一种JSON格式的输出模式,要求模型明确命名缺失的领域交叉点,列举所需概念,并提出一个富有成效的检索查询,而不是凭空捏造答案。为了训练模型生成高质量的SICs,我们构建了一个包含7,347个样本的\emph{未知-未知}(UU)数据集,通过提示Qwen3-14B将来自七个领域(物理、生物、工程、计算机科学、经济、医学、法律)的问题拼接成新颖的跨领域查询,这些查询是任何单一领域专家都无法回答的。我们使用组相对策略优化(GRPO)微调了一个14B参数的模型,采用结合检索效用、概念特异性和输出格式有效性的复合奖励。在模型响应上训练的释义散度探测器证实,SIC调优的输出系统地表现出更高的未知-未知概率分数。在735个保留的UU问题上的评估实现了99.46%的JSON有效性率、0.967的平均证书特异性分数,以及在基于检索的生成上相比基础模型3.6%的ROUGE-L改进——这表明显式的认知结构化是一种可学习且可衡量的能力。

英文摘要

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

2606.08566 2026-06-09 cs.CV 新提交

Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

通过细粒度情感-原因对提取实现精确的情感归因视频字幕生成

Weidong Chen, Cheng Ye, Zhendong Mao, Liping Wang, Xinyan Liu, Yongdong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) Harbin Institute of Technology (Weihai)(哈尔滨工业大学(威海))

AI总结 提出细粒度情感-原因对提取框架,通过概念感知视觉语义分解和视觉引导情感可解释学习,提升情感视频字幕的准确性和丰富性。

详情
AI中文摘要

情感视频字幕生成(EVC)是一项具有挑战性的任务,旨在为视频生成事实准确且情感丰富的描述。现有的EVC方法利用整体视觉特征挖掘全局情感线索,然后聚合多模态特征以指导情感字幕生成,这忽略了EVC任务的关键特性。视觉情感是由特定的动机原因引发的,这些原因通常只隐含在核心视频片段中。整体挖掘带来了显著的信息冗余和不准确的情感线索。因此,细粒度的视觉原因提取对情感感知和情感归因字幕生成都有促进作用。为此,我们提出了一种用于情感归因视频字幕生成的细粒度情感-原因对提取框架。具体来说,我们通过两轮学习成对的情感和原因特征:1)我们提出了一种概念感知的视觉语义分解模块,通过探索场景、对象和运动概念来增强视觉特征。此外,为了增强情感特征,我们提出了一种视觉引导的情感可解释学习模块,该模块利用视觉时间动态指导情感细化,并通过可靠的VAD向量约束增强可解释的细化过程。2)我们通过在细化前后交叉耦合视觉和情感特征来实现情感-原因对提取,并利用对比损失实现语义强制对齐。总体而言,我们的方法优化了视频的复杂语义理解和情感感知,从而在情感字幕生成中取得了有前景的性能。在三个具有挑战性的数据集上进行的大量实验证明了我们的方法和每个提出模块的优越性,例如,在EVC-MSVD数据集上,BLEU-2和ROUGE-L分别取得了+4.4%和+5.4%的最佳性能。

英文摘要

Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.

2606.08565 2026-06-09 cs.LG cs.AI 新提交

EinSort: Sorting is All We Need for Tensorizing LLM

EinSort: 张量化大语言模型,排序即一切

Toshiaki Koike-Akino, Jing Liu, Ye Wang

发表机构 * Toshiaki Koike-Akino Jing Liu Ye Wang

AI总结 提出EinSort方法,通过索引排序发现张量中的低秩结构,实现大语言模型权重和KV缓存的张量化压缩,相比基线方法提升了重构质量。

Comments 38 pages, 17 figures

详情
AI中文摘要

张量网络为压缩大型神经网络提供了高效的表示。通过精心设计形状和拓扑,它们可以显著减少内存和计算成本。然而,由于大型基础模型的巨大规模和非结构化的权重分布,识别其中的隐式低秩结构仍然具有挑战性。我们提出了一种自适应张量化方法,通过索引排序发现目标张量中的固有低秩结构。在权重和KV缓存压缩上的实验表明,与基线方法相比,重构质量得到了提升。

英文摘要

Tensor networks provide efficient representations for compressing large neural networks. By carefully designing shapes and topologies, they can significantly reduce memory and computational costs. However, identifying implicit low-rank structures in large foundation models remains challenging due to their enormous scale and un-structured weight distributions. We propose an adaptive tensorization method that discovers inherent low-rank structure in a target tensor by index ordering. Experiments on weight and KV-cache compression demonstrate improved reconstruction quality compared to baselines.

2606.08564 2026-06-09 cs.RO 新提交

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Real-IKEA:物理保真度是鲁棒操作的前提

Kunqi Xu, Zhenhao Huang, Siyuan Luo, Ziqiu Zeng, Fan Shi

发表机构 * National University of Singapore(新加坡国立大学) Peking University(北京大学)

AI总结 针对仿真与现实物理差异导致操作鲁棒性不足的问题,提出Real-IKEA数据集与仿真框架,通过高保真资产和阻力校准配置,使强化学习策略发现优先利用机械优势的鲁棒策略。

详情
AI中文摘要

机器人操作的鲁棒性常常因简化仿真与充满阻力的现实世界之间的物理差距而失败。在这项工作中,我们强调在铰接交互中的物理真实性是鲁棒策略学习的重要因素。我们提出了Real-IKEA,一个以物理精度为首要目标的数据集和仿真框架。Real-IKEA提供了1,079个铰接资产配置,源自83个真实的IKEA把手和旋钮,经过细致的六步物理工作流程处理。对于接触几何精度,我们引入了一个双向表面偏差度量来量化碰撞网格。对于动力学真实性,我们建立了阻力校准配置,改变阻尼和摩擦。关键的是,我们通过强化学习策略证明,高保真资产能够发现鲁棒的“钩”和“杠杆”策略,这些策略优先考虑机械优势而非脆弱的摩擦拉动。总之,这些结果使Real-IKEA成为开发能够在铰接物体任务中达到人类水平鲁棒性的操作策略的关键基准。

英文摘要

Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.

2606.08563 2026-06-09 cs.LG physics.ao-ph 新提交

Physics-Guided Dual Decoding and Spectral Supervision for Global 3D Hydrometeor Prediction

物理引导的双解码与光谱监督用于全球三维水凝物预测

Dandan Chen, Yaqiang Wang

发表机构 * Chinese Academy of Meteorological Sciences(中国气象科学研究院) Xiong’an Institute of Meteorological Artificial Intelligence(雄安气象人工智能研究院)

AI总结 针对三维水凝物预测中零膨胀长尾分布导致的过度平滑问题,提出物理引导的双解码框架PredHydro-Net,通过解耦架构、小波频率解耦和对抗训练,在极端事件检测和光谱表示上优于现有模型。

详情
AI中文摘要

虽然全球数据驱动模型在预测连续大气变量方面表现出色,但由于这些变量的零膨胀长尾分布,三维水凝物预测仍然具有挑战性。标准的深度学习优化通常会产生过度平滑的预测,削弱极端事件和空间纹理。我们提出了PredHydro-Net,一个物理引导的双解码框架,以缓解这种平滑。为了解决多变量优化冲突,它采用了解耦架构,其中宏观热力学和动力学场单向调节水凝物的生成。通过集成基于小波的频率解耦、光谱幅度匹配和对抗训练,该模型在定量准确性和空间保真度之间实现了有利的权衡。在72小时全球评估中,PredHydro-Net在极端事件检测和光谱表示方面优于时空深度学习基线(Earthformer和PredRNNv2)以及业务全球预报系统(GFS)。此外,它与全球降水测量(GPM)卫星反演表现出良好的气候一致性。该模型合理地再现了极端天气事件(如飓风伊恩)中的三维云结构。特征归因证实了其对物理前兆(如相对湿度和风辐合)的依赖,为长尾大气预测提供了一种稳健的、物理信息的方法。

英文摘要

While global data-driven models excel at predicting continuous atmospheric variables, three-dimensional hydrometeor forecasting remains challenging due to the zero-inflated, long-tailed distributions of these variables. Standard deep learning optimization often yields overly smooth forecasts, attenuating extreme events and spatial textures. We propose PredHydro-Net, a physics-guided dual-decoding framework that mitigates this smoothing. To resolve multi-variable optimization conflicts, it employs a decoupled architecture where macroscopic thermodynamic and dynamic fields unidirectionally modulate hydrometeor generation. By integrating wavelet-based frequency decoupling, spectral amplitude matching, and adversarial training, the model achieves a favorable trade-off between quantitative accuracy and spatial fidelity. In a 72-h global evaluation, PredHydro-Net outperforms both spatiotemporal deep learning baselines (Earthformer and PredRNNv2) and the operational Global Forecast System (GFS) in extreme-event detection and spectral representation. Furthermore, it demonstrates strong climatological consistency with Global Precipitation Measurement (GPM) satellite retrievals. The model reasonably reproduces the three-dimensional cloud structures in extreme weather events, such as Hurricane Ian. Feature attribution confirms its dependence on physical precursors such as relative humidity and wind convergence, offering a robust, physics-informed approach to long-tailed atmospheric prediction.