arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11200 2026-06-11 cs.CL cs.CV 新提交

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

使用多模态语言模型检测社交媒体上的AI生成内容

Chenyang Yang, Shen Yan, Yibo Yang, Litao Hu, Yuchen Liu, Yuan Zeng, Hanchao Yu, Yinan Zhu, Sumedha Singla, Brian Vanover, Huijun Qian, Zihao Wang, Fujun Liu, Aashu Singh, Jianyu Wang, Xuewen Zhang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Meta

AI总结针对AI生成内容检测的泛化性差、单模态依赖和缺乏可解释性问题，提出基于多模态数据的紧凑视觉-语言模型，实现检测与解释，在公开基准和内部数据集上达到最优性能。

详情

AI中文摘要

生成式AI使得逼真的图像和视频得以创建，并越来越多地在社交媒体上传播，通常用于垃圾信息、错误信息、操纵和欺诈。现有的AI生成内容（AIGC）检测方法面临挑战，包括对新一代模型的泛化能力差、依赖单一模态以及缺乏可解释的解释。我们提出了一个流程，通过持续整理多样化的多模态社交媒体数据并训练一个紧凑的视觉-语言模型用于检测和解释，来缓解这些问题。我们的模型在公开基准上达到了最先进的检测性能，并在多个平台的内部社交媒体数据集上展示了强大的检测和解释能力。我们将模型部署在社交媒体平台上用于帖子推荐，并观察到对用户参与度的积极下游影响，表明在动态、真实的社交媒体环境中进行有效的AIGC检测是可行的。

英文摘要

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

URL PDF HTML ☆

赞 0 踩 0

2606.11198 2026-06-11 cs.CL cs.AI 新提交

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

结构注意力税：检索格式如何劫持上下文学习而与内容无关

Yuqi Zhang, Di Zhang

发表机构 * Xi’an Jiaotong-Liverpool University（西交利物浦大学）

AI总结研究发现知识图谱三元组因其格式结构比自然语言吸引2-3倍注意力，压缩演示注意力达42%，并提出了分解注意力为语义与结构成分的框架及缓解策略。

Comments 10 pages, 5 figures

详情

AI中文摘要

检索增强生成（RAG）系统注入外部知识以改进大语言模型输出，然而注入内容的格式——区别于其语义相关性——可以独立地扭曲模型的注意力分布。我们识别并形式化了一种称为结构注意力税的现象：知识图谱（KG）三元组，由于其关系分隔符和重复的槽位模式，每个token捕获的注意力是语义等价的自然语言文本的2-3倍（$\hat{o}$(KG) ≈ 0.70 对比 $\hat{o}$(中性) ≈ 0.25），将演示注意力压缩高达42%——无论三元组是相关还是噪声。我们开发了一个形式化框架，将注意力分数分解为语义和结构成分（公式2），推导了一个压缩界（命题1），将token级别的格式偏差与演示注意力损失联系起来，并表明结构项控制着注意力被转移多少，而语义项控制着这是有益还是有害。这种解耦揭示了改进检索增强ICL的两个正交轴：优化检索质量（语义轴）和减少格式驱动的注意力捕获（结构轴）。实验上，在两个模型家族（Mistral-7B, LLaMA-3-8B）和三个QA基准上，我们观察到源任务对齐占主导地位：任务匹配的BM25检索在HotpotQA上达到58-62%，而ConceptNet为25-27%，超过30个百分点的差距远远超过所有门控策略（≤2个百分点）。我们从该框架推导出五种结构感知缓解策略，从零成本提示修改到训练时正则化；格式展平（S3）通过来自口头化三元组控制的准确性和注意力级证据得到验证，而结构分散（S1）产生了混合结果，揭示了格式级别干预的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ($\hat{o}$(KG) $\approx$ 0.70 vs. $\hat{o}$(neutral) $\approx$ 0.25), compressing demonstration attention by up to 42% -- regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet's 25-27%, a >30 pp gap that dwarfs all gating strategies ($\leq$2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.11196 2026-06-11 cs.CL cs.AI cs.CR cs.LG 新提交

非线性色散方程的混合迭代神经低正则积分器

Zhangyong Liang, Huanhuan Gao

发表机构 * National Center for Applied Mathematics, Tianjin University（天津大学应用数学中心）； School of Mechanical and Aerospace Engineering, Jilin University（吉林大学机械与 aerospace 工程学院）

AI总结提出HIN-LRI混合框架，用轻量神经网络学习并校正经典低正则积分器的结构截断误差，通过显式时间步缩放保证稳定性，在粗糙数据色散方程上提升精度并保持泛化能力。

详情

AI中文摘要

我们提出HIN-LRI，一种混合框架，通过训练一个神经算子来校正经典数值求解器的结构截断误差，从而增强该求解器。基础低正则积分器为非线性色散偏微分方程提供一致的一阶近似，而一个在低维潜在流形上运行的轻量神经网络学习解析方法无法闭合的残差缺陷。神经校正上的显式时间步缩放确保其Lipschitz贡献为$\mathcal{O}(\tau)$，从而产生一个在步长上一致有界且与空间分辨率无关的Gronwall稳定性因子。该网络通过求解器在环的目标进行端到端训练，该目标展开完整迭代并在Bourgain型范数中惩罚轨迹误差，使学习与多步求解器动态对齐，而非孤立的单步目标。在给定假设下，全局误差满足$C(\varepsilon_{net}+\delta)\\,\tau^\gamma\ln(1/\tau)$，其中$\varepsilon_{net}$衡量网络逼近质量，$\delta$衡量训练不足。在三个具有粗糙数据的色散基准上的实验表明，HIN-LRI在精度上优于解析积分器、分裂方法和神经PDE替代模型，具有稳定的空间细化、有效的分布外迁移和适度的在线开销。

英文摘要

We propose HIN-LRI, a hybrid framework that augments a classical numerical solver with a neural operator trained to correct the solver's structured truncation error. A base low-regularity integrator provides a consistent first-order approximation to nonlinear dispersive PDEs, while a lightweight neural network, operating on a low-dimensional latent manifold, learns the residual defect that analytical methods cannot close. An explicit time-step scaling on the neural correction ensures that its Lipschitz contribution remains $\mathcal{O}(τ)$, yielding a Gronwall stability factor bounded uniformly in the step size and independent of the spatial resolution. The network is trained end-to-end through a solver-in-the-loop objective that unrolls the full iteration and penalises trajectory error in a Bourgain-type norm, aligning learning with multi-step solver dynamics rather than isolated one-step targets. Under stated assumptions, the global error satisfies $C(\varepsilon_{net}+δ)\,τ^γ\ln(1/τ)$, where $\varepsilon_{net}$ measures the network approximation quality and $δ$ the training shortfall. Experiments on three dispersive benchmarks with rough data show that HIN-LRI improves accuracy over analytical integrators, splitting methods, and neural PDE surrogates, with stable spatial refinement, effective out-of-distribution transfer, and modest online overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.04221 2026-06-11 cs.CL cs.AI 版本更新

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

面向隐私敏感的临床信息抽取的自提示小型语言模型

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh, Shirindokht Shiraz, Chun-Teh Lee, Ryan Brandon, Muhammad F Walji, Xiaoqian Jiang, Bunmi Tokede

发表机构 * McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校麦克威廉斯生物医学信息学学院）； School of Public Health, The University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校公共卫生学院）； School of Dentistry, The University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校牙科学院）； Willamette Dental and Skourtes Institute（威廉特牙科与斯库尔特斯研究所）

AI总结针对牙科病历中非结构化、领域特定且隐私敏感的命名实体识别挑战，提出一种本地可部署的自提示框架，通过多提示集成推理和基于QLoRA的微调及直接偏好优化，使小型语言模型在Qwen2.5-14B-Instruct上达到微宏F1分数0.864/0.837。

详情

AI中文摘要

从牙科病程记录中进行临床命名实体识别具有挑战性，因为文档高度非结构化、领域特定且通常涉及隐私敏感信息。我们开发了一个本地可部署的框架，使小型语言模型能够自行生成、验证、完善和评估实体特定提示，以从牙科记录中提取多个临床实体。利用1,200份标注记录，我们通过多提示集成推理评估了候选开放权重模型，并进一步使用基于QLoRA的监督微调和直接偏好优化对选定模型进行调整。模型性能差异显著，凸显了需要针对特定任务进行评估而非依赖通用基准。Qwen2.5-14B-Instruct取得了最强的基线性能。经过DPO后，Qwen2.5-14B-Instruct和Llama-3.1-8B-Instruct分别达到了0.864/0.837和0.806/0.797的微/宏F1分数。这些发现表明，自动提示优化结合轻量级基于偏好的后训练可以支持使用本地部署的小型语言模型进行可扩展的临床信息抽取。

英文摘要

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

URL PDF HTML ☆

赞 0 踩 0

2605.02849 2026-06-11 cs.CV 版本更新

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

通过条件控制扩散实现超低比特率视频压缩的主动采样

Amirhosein Javadi, Shirin Saeedi Bidokhti, Tara Javidi

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego（电子与计算机工程系，加州大学圣地亚哥分校）； Department of Electrical and Systems Engineering, University of Pennsylvania（电子与系统工程系，宾夕法尼亚大学）

AI总结提出ActDiff-VC框架，利用条件扩散模型和主动采样策略（自适应关键帧选择与预算感知稀疏轨迹选择），在超低比特率下实现高感知质量视频压缩。

Comments 21 pages, 11 figures, 3 tables

详情

AI中文摘要

扩散模型为超低比特率下的感知重建提供了强大的生成先验，但有效的视频压缩需要使用高度紧凑的条件信号来控制生成过程。在这项工作中，我们提出了ActDiff-VC，一种基于扩散的超低比特率视频压缩框架。我们的方法将视频划分为可变长度的片段，仅在需要时传输关键帧，并使用一组紧凑的跟踪点轨迹总结时间动态。基于这些稀疏信号，条件扩散解码器合成剩余帧，从而在严格的码率约束下实现感知上逼真的重建。为了支持这一设计，我们引入了两种机制：内容自适应关键帧选择和预算感知稀疏轨迹选择，它们共同为生成重建提供了紧凑而有效的条件。在UVG和MCL-JCV基准上的实验表明，在匹配NIQE时，ActDiff-VC实现了高达64.6%的码率降低，在可比码率下，KID改善高达64.6%，FID改善高达37.7%，并且在超低比特率下，相对于学习和基于扩散的基线，提供了有利的感知率失真权衡。

英文摘要

Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.

URL PDF HTML ☆

赞 0 踩 0

2605.02411 2026-06-11 cs.AI cs.IR cs.LG cs.MA 版本更新

SCAIL-2：通过端到端上下文条件统一受控角色动画

Wenhao Yan, Fengjia Guo, Zhuoyi Yang, Jie Tang

发表机构 * Z.ai ； Tsinghua University（清华大学）

AI总结提出SCAIL-2框架，通过端到端上下文条件统一受控角色动画，绕过中间表示直接利用驱动视频，并合成MotionPair-60K数据集，采用上下文掩码和模式RoPE实现统一，结合Bias-Aware DPO减少误差，显著优于现有方法。

详情

AI中文摘要

受控角色动画需要将运动从驱动序列转移到参考角色。先前的工作严重依赖中间表示，包括用于表示运动的姿态骨架或用于表示环境的掩码背景，这不可避免地导致信息损失。为了解决这个问题，我们提出了SCAIL-2，一个绕过这些中间表示并实现\textbf{端到端}角色动画的框架。通过将驱动视频直接连接到序列，模型可以从输入视频中获得所有所需的视觉信息。为了解决缺乏端到端数据的问题，我们通过解耦条件统一角色动画的子任务，然后策划一个流程来合成MotionPair-60K，一个包含角色动画异构任务的端到端运动转移数据集。为了实现统一，我们利用上下文掩码条件和模式特定的RoPE作为文本指令和原始视觉信息之外的软引导。为了解决详细区域的合成差异，我们提出了Bias-Aware DPO来构建偏好项目以减轻误差。大量实验表明，我们的方法在各种角色动画任务中显著优于现有的最先进方法。合成数据的一个大子集以及模型权重将在我们的项目页面发布：this https URL。

英文摘要

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To achieve the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.

URL PDF HTML ☆

赞 0 踩 0

2606.10794 2026-06-11 cs.AI 版本更新

CoCoSI: 面向空间智能的协作认知地图构建

Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Cornell University（康奈尔大学）

AI总结提出一种即插即用的多智能体框架，通过协作构建结构化认知地图作为空间记忆，无需修改架构或额外训练即可增强预训练多模态大模型的空间理解能力。

详情

AI中文摘要

空间智能是多模态大语言模型（MLLMs）的一个关键前沿，使其能够从视觉体验中推理物理世界。受人类空间认知启发，最近的方法从多帧视觉输入构建基于网格的认知地图，以随时间维持连贯的空间表示。然而，有限的上下文长度仍然挑战空间理解，而现有方法如长上下文建模和外部记忆通常需要架构更改、记忆模块或微调，限制了其对现成预训练MLLMs的适用性。这促使我们提出一种轻量级、模型无关的方法，以在原生上下文窗口之外保留空间信息。为此，我们提出一个即插即用的多智能体框架，协作构建认知地图作为结构化空间记忆，无需架构修改或额外训练即可增强任意预训练MLLMs的空间理解。我们的框架具有局部-全局智能体协调、原子提交的认知地图构建以及跨智能体验证的特点。大量实验表明，我们的方法在空间理解任务上取得了优越性能，同时完全无需训练。代码将发布。

英文摘要

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2606.10360 2026-06-11 cs.SD 版本更新

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

ViP-VL：基于向量量化学习的越南语自监督语音预训练模型

Khanh Le, Kiet Anh Hoang, Bao Nguyen, Duy Vo, Dung Vo, Thai Tran, Linh Pham, Khoa D Doan

发表机构 * VinUniversity（越南 Vin 大学）

AI总结提出ViP-VL模型，通过声学堆叠、感受野对齐和掩码选择策略，在BEST-RQ框架上实现高效自监督预训练，在越南语ASR、情感识别、方言分类和说话人验证四项任务上取得最优结果。

Comments Accepted to INTERSPEECH 2026

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测：校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

发表机构 * Northeastern University Boston, United States（东北大学波士顿分校）

AI总结针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题，提出基于核密度估计的密度脊方法，利用隐藏状态生成轨迹的六维运动特征图构建响应流形，通过到最近脊顶点的欧氏距离评分，在标签稀缺协议下AUROC提升5-20点。

详情

AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测，其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器（Semantic Entropy, EigenScore）避免标签但质量停滞，而有监督探针（SAPLMA）获得更强的分布内分数，但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分，从而得到随机输出分布的低维几何骨架。我们在七个问答基准（HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA）上，使用九个文本和视觉大语言模型，在刻意标签稀缺协议（$n_{\ ext{cal}}{=}200$ 查询，$N{=}5$ 生成）下，与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜，同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.10135 2026-06-11 cs.CV cs.AI 版本更新

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

BiWM：利用双向自回归推进开源交互式视频世界模型

Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

发表机构 * LynnReal AI ； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）

AI总结提出BiWM框架，通过双向自回归范式将预训练视频骨干转化为交互式世界模型，仅需两阶段训练（微调+分布匹配蒸馏），支持多尺度模型和长程生成，优于现有因果流水线。

Comments After the paper was posted, we discovered that several visualization results were produced using wrong configuration settings during runtime. This error affects the reliability of the presented visual comparisons. Additionally, further optimization of the design is needed. We therefore request to withdraw this version and will submit a corrected and improved version later

详情

AI中文摘要

将双向视频扩散模型过渡到自回归范式提高了视频世界模型的交互性，但现有的因果流水线需要多个阶段（控制微调、自回归训练、因果初始化、少步蒸馏），并且由于误差累积，质量仍落后于双向模型。最近的世界模型如Yume-1.5和Matrix-Game-3.0采用双向自回归方法，通过自我纠正误差传播获得保真度和稳定的长程展开，但开源框架（如minWM）仅支持因果模型。我们提出BiWM，这是首个在双向自回归范式下用于交互式视频世界模型的全栈框架，联合优化生成质量和推理速度。从预训练视频骨干开始，BiWM通过微调注入相机控制，然后运行几步分布匹配蒸馏（DMD）阶段，将骨干转化为动作/相机可控的世界模型：仅需两个训练阶段（而非minWM的四个），在8xH200 GPU上几百步内收敛。单一方案覆盖Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B和LTX-2.3-22B，并支持现有双向模型的二次微调。BiWM实现了minWM失去可控性的真实相机控制，集成了可插拔历史压缩（FramePack风格和PackForcing风格）用于长程展开，并提供可选的NVFP4 4位训练/推理流水线。为对抗DMD的模式寻求退化，我们添加了GAN和覆盖前向KL目标，以保留场景动态。我们开源BiWM，用于资源受限的研究和高保真环境模拟。

英文摘要

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.10046 2026-06-11 cs.SD cs.AI 版本更新

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

潜流内部：音频分离基础模型中注意力动力学的因果解读

Yuxuan Chen, Haoyuan Yu, Peize He

发表机构 * Jilin University（吉林大学）； Hunan University（湖南大学）； University of Electronic Science and Technology of China（电子科学与技术大学）

AI总结本文通过因果干预协议揭示流匹配Transformer在音频分离中的双路径注意力机制，并提出无训练加速方法LSAC，在保持质量的同时减少约25%自注意力计算。

详情

AI中文摘要

流匹配变压器实现了强大的音频分离，但其注意力动力学是不透明的。我们将已建立的因果干预原则适应为SAM Audio的确定性推理时探测协议。正交探测揭示了一种双路径文本条件机制：加法注入控制语义身份，而交叉注意力细化声学结构。我们观察到异步逐层收敛：稳定层早期构建时间支架，而快速层在采样过程中继续解决伪影。该模型还减弱时间分割线索以维持连续流稳定性。利用这些见解，我们提出了层选择性注意力缓存（LSAC），一种无训练加速方法，在稳定层中缓存注意力。在各种声学复杂度下，LSAC将自注意力计算减少约25%，质量损失可忽略，并且与朴素步长减少相比，质量保持率高达6.7倍。

英文摘要

Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

URL PDF HTML ☆

赞 0 踩 0

2606.10040 2026-06-11 cs.RO 版本更新

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Efficient-WAM: 一种具有低成本未来想象能力的10亿参数世界-动作模型

Jiajun Li, Tiecheng Guo, Yifan Ye, Rongyu Zhang, Xiaowei Chi, Qianpu Sun, Ying Li, Yunfan Lou, Yan Huang, Zhihe Lu, Meng Guo, Shanghang Zhang

发表机构 * The University of Hong Kong（香港大学）； Peking University（北京大学）； Muka Robotics（Muka机器人）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Nanjing University（南京大学）

AI总结提出Efficient-WAM，通过紧凑视频专家、稀疏视频潜变量和非对称去噪降低未来想象成本，在保持控制性能的同时实现30倍推理加速。

详情

AI中文摘要

世界-动作模型（WAM）通过将未来视觉预测与动作生成相结合，已成为具身控制的一种有前景的范式。然而，大多数现有WAM依赖于逼真的未来预测，这导致高推理延迟，使得实时机器人部署困难。这促使设计一种更高效的WAM，既能保留未来视觉预测的控制优势，又能降低其推理成本。我们引入了Efficient-WAM，一种在保留控制优势的同时降低未来想象成本的世界-动作模型。Efficient-WAM通过从WAN-2.2-5B迁移的紧凑视频专家、稀疏视频潜变量以及非对称视频-动作去噪（为视频分配比动作更少的采样步骤）来提高推理效率。Efficient-WAM不优化未来分支的视觉保真度，而是将未来视频预测视为动作生成的紧凑指导信号。在RoboTwin 2.0和真实世界操作任务上的综合实验表明，尽管未来预测明显粗糙，Efficient-WAM仍能保持强大的动作性能。在保持竞争性控制能力的同时，我们的10亿参数模型在物理部署中可将每块延迟降低至约100毫秒，相比现有WAM实现了30倍的加速。

英文摘要

World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.

URL PDF HTML ☆

赞 0 踩 0

2606.11118 2026-06-11 cs.LG math.OC math.PR stat.AP stat.ML 版本更新

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

在线平台中的数据驱动动态分类：学习双边信息

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

发表机构 * IE Business School, IE University（IE大学商学院）； Kenan-Flagler Business School, The University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校肯纳-弗拉格勒商学院）

AI总结针对双边服务平台，提出一种数据驱动算法，在未知顾客和卖家选择参数的情况下动态优化商品分类，并证明其遗憾值随时间呈多对数增长且达到最优速率。

详情

AI中文摘要

我们研究了一个在离散时间环境下，具有不完全信息和异质顾客的双边服务平台上的动态分类问题。在每个周期，一位顾客到达寻求服务，平台选择一组卖家进行展示。顾客根据多项逻辑选择模型，最多向分类中的一个卖家提出交易。经过固定数量的周期后，卖家审查收到的提议，并根据另一个多项逻辑选择模型，每位卖家最多选择一个顾客，然后循环重复。一个关键挑战是平台事先不知道顾客或卖家的选择模型参数。据我们所知，这是首次研究双边选择参数均未知的动态分类问题。我们开发了一种数据驱动算法，该算法在优化平台目标的同时学习这些参数。我们使用遗憾值来评估性能，该遗憾值衡量相对于一个预知所有参数和顾客到达时间的先知基准的收入损失。我们证明该算法的最坏情况遗憾值随时间呈多对数增长，并推导出匹配的下界，从而确定其速率最优性。

通过时序图学习识别足球比赛中控球阶段的意图驱动方法

Yuesen Li, Daniel Link

发表机构 * Technical University of Munich（慕尼黑工业大学）

AI总结提出基于时序图注意力网络（T-GAN）的框架，从时空追踪数据中识别足球比赛控球阶段，实现战术意图（入侵空间、保持控球、得分）和六个子阶段的分类，F1分数达0.87（意图级）和0.79（得分阶段）。

Comments 27 pages, 10 figures

详情

AI中文摘要

理解足球（以下简称足球）的战术组织需要识别不同的比赛阶段。然而，控球阶段很少直接可观察，而是由不断演变的战术意图塑造，而非仅靠空间模式。本研究提出一个数据驱动框架，用于从时空追踪数据中识别控球比赛阶段。分析了七场德国足球甲级联赛比赛，使用TRACAB以25 Hz记录。定义了一个层次化阶段模型，包含三种战术意图（入侵对手空间、保持控球、得分）和六个阶段（构建、推进、反击、维持、持续威胁、完成）。开发了时序图注意力网络（T-GAN），结合帧级球员交互图、上下文特征和基于Transformer的时序建模。使用帧级F1和序列感知的Truth-Dominance交并比（IoT-D）指标评估性能。T-GAN在意图级别达到宏平均帧级F1分数0.87，入侵相关阶段0.76，得分阶段0.79。在序列级别，后处理后意图的平均对角线IoT-D F1从0.68增加到0.79，阶段从0.61增加到0.71，表明时序连贯性改善。模型比较显示，序列建模是分割质量的主要驱动因素，而基于图的关系建模特别有利于反击识别。探索性球员注意力分析进一步表明，边路和中场位置组对阶段区分贡献显著。总体而言，该框架将连续追踪数据转化为战术可解释的控球阶段表示，具有自动比赛标注、战术分析和打法特征分析的潜在应用。

英文摘要

Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

Hybrid Iterative Neural Low-Regularity Integrator for Nonlinear Dispersive Equations

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

Spatially Selective Self-Training for Unsupervised Building Change Detection

Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning