2605.29430 2026-05-29 cs.AI cs.CL

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University（西安交通大学人工智能学院）； X-LANCE Lab, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University（上海交通大学电子信息与电气工程学院X-LANCE实验室）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Fudan University（复旦大学）； Tongyi Fun Team, Alibaba Group（阿里云通义团队）

AI总结提出Agentic ASR闭环框架，通过多轮交互和语义纠正减少语义错误，并引入句子级语义错误率（S^2ER）作为评估指标。

详情

AI中文摘要

自动语音识别（ASR）是人机交互的核心组成部分，也是基于LLM的助手和智能体日益重要的前端。然而，当前大多数ASR系统仍遵循单遍范式，这与人类通信方式不一致——在人类通信中，误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误，很难纠正。同时，词错误率（WER）或字符错误率（CER）等词级指标无法充分反映此类问题。为解决这些局限，我们将交互式ASR形式化为多轮修正任务，并提出Agentic ASR，一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率（S^2ER），一种基于LLM的语义评估指标，以及交互式仿真系统，用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明，迭代交互持续减少语义错误，在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见：https://interactiveasr.github.io/，在线演示见：https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

URL PDF HTML ☆

赞 0 踩 0

2605.29429 2026-05-29 cs.CV

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

每细胞类型一次点击足矣：无需训练的组交互用于细胞实例分割

Sanghyun Jo, Seo Jin Lee, Seohyung Hong, Yoorim Gang, Hyeongsub Kim, Hyungseok Seo, Kyungsu Kim

发表机构 * OGQ, Korea（韩国OGQ）； Seoul National University, Korea（韩国首尔国立大学）； LG CNS, Korea（韩国LG CNS）

AI总结提出组提示范式，通过每细胞类型一次点击即可分割所有该类型实例，基于SAM冻结编码器的特征聚类性质，设计无需训练的Chain-of-Prompts框架递归扩展点击，在多个基准上保持高性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情

AI中文摘要

在特定细胞数据集上训练的细胞实例分割模型在分布外的细胞类型上性能严重下降，而交互式基础模型通过每个实例提示克服了这一点，但对于包含数百到数千个密集实例的组织病理学图像，其成本过高。我们引入了组提示，这是一种新范式，将交互式分割从每个实例 $O(N)$ 转变为每个类型 $O(T)$，其中每细胞类型一次点击即可分割该类型的所有实例。我们的关键观察是，Segment Anything Model (SAM) 的冻结图像编码器在给出任何提示之前，已经在其特征空间中对相同类型的细胞进行了聚类。利用这一特性，我们提出了Chain-of-Prompts (CoP)，这是一个无需训练的框架，通过以下方式递归扩展单个用户点击：(1) 通过非参数门控多尺度编码器特征识别可靠的相同类型位置，以及 (2) 选择空间上最远的可靠点作为下一个提示以最大化覆盖范围。在三个细胞类型标注的基准上，每类型一次点击的CoP保留了超过90%的每个实例性能，并且无需任何额外训练就超越了全监督方法。在四个形态均匀的基准上，一次点击保留了超过99%。项目页面：https://shjo-april.github.io/Chain-of-Prompts/

英文摘要

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

URL PDF HTML ☆

赞 0 踩 0

2605.29427 2026-05-29 cs.CL

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard：检测LLM交互中的金融监管违规

Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing（阿里云计算Qwen金融团队）； Tongyi Lab, Alibaba Group（阿里集团通义实验室）； School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对金融领域LLM交互中的监管违规检测问题，提出基于监管文档的自动化管道，构建首个金融合规检测基准FinGuard-Bench，并训练FinGuard模型，在基准上显著优于现有方法。

详情

AI中文摘要

随着大型语言模型（LLM）在金融服务中的部署日益增多，一次不合规的交互就可能使机构面临监管处罚并直接损害消费者利益。现有的防护模型围绕通用危害分类构建，忽略了基于特定金融法规的违规行为。我们通过一个直接操作监管文档的监管驱动管道来弥补这一空白，该管道归纳出金融合规风险分类，并在没有任何预定义违规类别的情况下合成基于监管的训练数据。将该管道应用于中国金融法规，我们发布了 extbf{FinGuard-Bench}，据我们所知，这是首个金融监管合规检测基准，在查询和回复层面均带有专家标注的标签。我们进一步训练了 extbf{FinGuard}，这是一个基于Qwen3-8B构建的金融合规检测模型，通过监督微调和自我对弈强化学习在基于监管的数据上进行训练。在FinGuard-Bench上，FinGuard显著优于所有基线，包括专用防护模型和更大的通用LLM，如Qwen3.5-397B-A17B和GPT-5.1。此外，FinGuard还保留了通用安全能力，并能仅使用政策文档适应未见过的机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

英文摘要

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.29425 2026-05-29 cs.AI

3DVLA：通过3D空间和实例理解增强视觉-语言-动作模型

Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）

AI总结提出3DVLA框架，通过多视角一致性3D特征编码、实例估计模块和掩码自监督3D编码，解决VLA模型缺乏3D场景理解的问题，在LIBERO-Plus和RoboTwin 2.0上显著提升操作性能。

详情

信息导向的离线到在线强化学习

Keru Chen

发表机构 * School of Electrical, Computer and Energy Engineering, Arizona State University（电气、计算机与能源工程学院，亚利桑那州立大学）

AI总结本文提出信息导向采样（IDS）方法，通过条件互信息量化离线数据后的残余不确定性，在离线到在线强化学习中平衡即时遗憾与信息增益，并证明其贝叶斯遗憾界及在偏置残余不确定性场景下的优势。

详情

AI中文摘要

基于离线数据集的决策通常从固定离线数据中预热策略或评分模型，然后通过有限的在线交互进行优化。离线数据减少了不确定性，但并未消除探索需求；它改变了仍需探索的内容。我们通过学习目标 $χ$ 与在线轨迹在给定离线数据集条件下的条件互信息 $I(χ;τ_{1:T}\\mid\\mathcal{D}_N)$ 来形式化这种残余不确定性。这一观点自然地引出了信息导向采样（IDS），一个由参数 $η\\\ge 0$ 参数化的家族，通过权衡即时遗憾与信息增益来选择动作。我们通过比率证书证明了 IDS 的通用离线到在线贝叶斯遗憾界：任何由参考汤普森采样策略在同一随机策略类上满足的信息比率界都会被 IDS 继承。在已知动力学的贝叶斯线性奖励模型中，条件互信息具有对数行列式形式，且普通 IDS（$η=0$）满足 $\\widetilde O\\\!\\\left(Hd\\\min\\\left\\\{\\\sqrt T,\\\,T\\\sqrt{C^\\\dagger_{β,\\\mathrm{IDS}_0}(N,T)/N}\\right\\\}\\right)$，其中覆盖系数与普通 IDS 自身诱导的访问分布相关。我们还识别出一个预热阶段，其中存在一个主导但信息丰富的探测动作，普通 IDS 会选择该探测动作而汤普森采样从不选择，从而产生常数因子的贝叶斯遗憾分离。受控的赌博机实验和 D4RL 离线到在线强化学习实验验证了这一机制：当离线数据信息丰富但留下偏置或低概率的残余不确定性，且目标在线动作可以解决这些不确定性时，IDS 最为有益，这种情形在离线强化学习、离线黑箱优化和贝叶斯优化中普遍存在。

英文摘要

Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(χ;τ_{1:T}\mid\mathcal{D}_N)$ between a learning target $χ$ and the online trajectories after conditioning on the offline dataset. This view leads naturally to information-directed sampling (IDS), a family parameterised by $η\ge 0$ that selects actions by trading off instantaneous regret against information gain. We prove a generic offline-to-online Bayesian regret bound for IDS through a ratio certificate: any information-ratio bound satisfied by a reference Thompson-sampling policy over the same randomised policy class is inherited by IDS. In a known-dynamics Bayesian linear-reward model, the conditional mutual information has a log-determinant form, and vanilla IDS ($η=0$) satisfies $\widetilde O\!\left(Hd\min\left\{\sqrt T,\,T\sqrt{C^\dagger_{β,\mathrm{IDS}_0}(N,T)/N}\right\}\right),$ where the coverage coefficient is tied to the visitation distribution induced by vanilla IDS itself. We also identify a warm-start regime with a dominated but informative probe in which vanilla IDS selects the probe while Thompson sampling never does, giving a constant-factor Bayesian regret separation. Controlled bandit experiments and D4RL offline-to-online RL experiments validate this mechanism: IDS is most beneficial when offline data is informative but leaves biased or low-probability residual uncertainty that targeted online actions can resolve, a regime shared by offline RL, offline black-box optimization, and Bayesian optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.29402 2026-05-29 cs.CV cs.AI

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

面向高效长视频推理的语义与视觉证据：HD-EPIC VQA挑战赛的解决方案

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li

发表机构 * Lenovo, China（联想（中国））

AI总结提出一种统一框架，通过解耦长视频推理为语义证据（粗到细提取全局过程结构）和视觉证据（基于目标的细粒度定位），并采用查询条件证据检索与整合，在HD-EPIC VQA挑战赛中取得竞争性能。

详情

AI中文摘要

理解长格式自我中心视频对于多模态大语言模型（MLLMs）仍然具有挑战性，原因在于有限的上下文长度和对细粒度视觉细节的定位不足。最近提出的HD-EPIC基准突出了这些局限性：即使是强大的长上下文模型，在多样化的视频问答任务中也表现较低。在本文中，我们提出了一个统一框架，将长视频推理解耦为两种互补的证据形式：语义证据和视觉证据。语义证据通过粗到细的提取流程捕获全局过程结构，而基于目标的视觉证据通过边界框和视觉嵌入保留细粒度的定位。在推理过程中，我们将推理形式化为查询条件的证据检索和整合过程，动态地从两个来源选择相关信息。我们的方法在HD-EPIC-VQA挑战赛的多个任务类别中取得了竞争性能。更广泛地说，我们的结果表明，显式地结构化、检索和整合语义与视觉证据对于使用MLLMs进行有效的长视频理解至关重要。

英文摘要

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.29401 2026-05-29 cs.LG

对齐但脆弱：通过零阶优化增强LLM安全鲁棒性

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

发表机构 * The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security（杭州高科技园区（滨江）区块链与数据安全研究院）； Sun Yat-sen University（中山大学）； KAUST（卡塔尔大学）

AI总结针对大语言模型安全对齐后易受轻量级后处理（如参数噪声、激活噪声或量化）影响的问题，提出基于零阶优化的混合框架，通过先标准一阶安全对齐再零阶精炼提升鲁棒性，并利用扰动评估估计层鲁棒性敏感性以高效聚焦关键层更新。

详情

AI中文摘要

大语言模型的安全对齐旨在减少有害或不安全行为，同时保持通用效用。然而，最近的研究发现对齐效果可能是脆弱的：轻量级的对齐后操作，如参数噪声、激活噪声或量化，很容易削弱预期的安全行为。先前提高鲁棒性的努力主要集中在数据整理、修改对齐目标和识别安全关键参数上，而优化器本身的作用在很大程度上未被探索。在本文中，我们首次从基础优化器的角度研究安全对齐的鲁棒性。这种以优化器为中心的视角自然地指向零阶优化，它通过评估扰动下的安全对齐来提供面向鲁棒性的信号。基于这一见解，我们提出了一个混合框架，首先执行标准的一阶安全对齐，然后应用零阶精炼来提高鲁棒性。从理论和实证上，我们表明仅需少量零阶精炼步骤即可增强鲁棒性，同时保持安全对齐。我们进一步通过利用其固有的基于扰动的评估来估计逐层鲁棒性敏感性，从而提高零阶精炼的效率，使精炼过程能够以适度的训练开销将更新集中在鲁棒性关键层上。

英文摘要

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.29394 2026-05-29 cs.AI

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

EvoMD-LLM：学习反应分子动力学中物种进化的语言

Zhichen Tang, Zhengzheng Dang, Yulin Chen, Jixin Wu, Haiwen Li, Yanming Wang

发表机构 * Global College, Shanghai Jiao Tong University（上海交通大学全球学院）； Global Institute of Future Technology, Shanghai Jiao Tong University（上海交通大学未来技术全球研究院）

AI总结提出EvoMD-LLM框架，将反应分子动力学轨迹离散化为符号时间序列，通过时间脚手架机制使自回归大语言模型学习物种组成演化，在多项时间预测任务上优于基线模型，并能生成可解释性预测。

Comments 17 pages, ACL Findings

详情

AI中文摘要

虽然大型语言模型（LLM）在静态科学推理方面表现出色，但它们在建模动态物理过程的时间结构方面存在困难。我们提出了EvoMD-LLM（进化分子动力学大型语言模型），这是一个将物种级分子动力学重新表述为符号时间语言建模问题的框架。反应分子动力学轨迹被离散化为分子事件序列，其中每个标记代表一个化学物种及其持续时间，通过高效微调使标准自回归LLM能够学习随时间的组成演化。EvoMD-LLM的一个关键组成部分是时间脚手架，它将事件持续时间视为显式语言标记，并作为结构化归纳偏置，与传统的序列建模方法相比，显著减少了无效或幻觉的分子输出。我们在多个时间预测任务上评估了EvoMD-LLM，达到了高达66.14%的准确率，并始终优于序列神经网络和基于语言的基线。除了定量改进，我们定性地观察到，该模型能够通过结合相关化学知识为其预测生成解释，尽管它没有经过配对轨迹-解释数据的显式监督。这些结果表明，符号时间语言建模为将LLM应用于动态物理模拟提供了有效框架。

英文摘要

While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.

URL PDF HTML ☆

赞 0 踩 0

2605.29390 2026-05-29 cs.CV

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

注意力特征空间中的正交负引导用于文本到图像生成

Jungmin Ko, Jungwon Park, Jimyeong Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University（人工智能交叉学科项目，首尔国立大学）； Research Institute for Convergence Science, Seoul National University（融合科学研究所，首尔国立大学）； Artificial Intelligence Institute, Seoul National University（人工智能研究所，首尔国立大学）； Department of Intelligence and Information, Seoul National University（智能与信息系，首尔国立大学）； Daegu Gyeongbuk Institute of Science and Technology（大邱庆北科学技术院）； Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd（三星先进技术研究所，三星电子公司）

AI总结提出一种基于注意力特征空间的正交负引导方法，通过正交化负提示注意力特征与正提示特征并仅减去正交分量，在无需训练的情况下有效抑制不需要的概念，同时保持图像质量和提示对齐。

Comments Preprint

详情

AI中文摘要

文本到图像（T2I）模型生成高质量图像的能力日益增强。然而，强制显式地避免指定对象或属性仍然是一个根本性的难题。现有方法，包括提示否定、事后编辑和负引导，对于显式概念抑制仍显不足，常常无法移除目标概念或降低整体图像质量。为此，我们提出了注意力特征空间中的正交负引导方法，这是一种无需训练的方法，在基于MM-DiT的T2I变换器的注意力输出空间中操作。我们的方法将负提示注意力特征相对于正提示特征进行正交化，并仅减去正交分量，从而在保留期望语义的同时抑制不需要的概念。在FLUX-dev和FLUX-schnell上的实验表明，我们的方法在概念抑制、提示对齐和图像质量之间取得了有利的权衡。在人工评估中，我们的方法比第二好的基线高出18.78%。我们进一步展示了该方法支持多概念抑制和可调概念抑制。

英文摘要

Text-to-image (T2I) models have become increasingly capable of generating high-quality images. Yet, enforcing the explicit absence of a specified object or attribute remains a fundamentally challenging problem. Existing approaches, including prompt negation, post-hoc editing, and negative guidance, remain insufficient for explicit concept suppression, often failing to remove the target concept or degrading overall image quality. To this end, we propose Orthogonal Negative Guidance in attention feature space, a training-free method that operates in the attention output space of MM-DiT-based T2I transformers. Our method orthogonalizes negative-prompt attention features with respect to positive-prompt features and subtracts only the orthogonal component, suppressing unwanted concepts while preserving desired semantics. Experiments on FLUX-dev and FLUX-schnell show that our method achieves favorable trade-offs between concept suppression, prompt alignment, and image quality. In human evaluation, our method outperforms the second-best baseline by 18.78%. We further show that our method supports multi-concept suppression and adjustable concept suppression.

URL PDF HTML ☆

赞 0 踩 0

2605.29387 2026-05-29 cs.LG cs.AI stat.ML

X平台上AI裁员话语中的注意力不对称性：资本与劳动放大的计算分析

Joy Bose

发表机构 * Independent Researcher（独立研究员）

AI总结通过收集X平台推文，使用账户级收集方法发现资本话语的放大效应是劳动话语的3.12倍，经粉丝数标准化后仍存在2.69倍的不对称性，并引入放大比和放大归一化指数作为平台话语不平等的度量指标。

Comments 18 pages, 3 figures, 9 tables

详情

AI中文摘要

当工人因AI驱动的重组而失业时，X（前Twitter）上同时发生两种截然不同的对话。科技高管和AI研究人员谈论生产力、转型和机遇。被解雇的工人和劳工批评者谈论失业、不确定性和恐惧。本文提出一个简单问题：哪种对话获得更多传播？我们报告了三项研究，使用两种收集方法和来自20个知名公共账户的763条推文。研究1使用基于关键词的收集（n=392），发现语料库之间无显著差异（p=0.891），表明关键词搜索对此任务噪声过大。研究2使用基于账户的收集（n=96），发现资本话语的平均放大优势是劳动话语的3.12倍（p=0.000003，Cohen's d=0.555）。研究3结合两种方法（n=763），确认了平均放大比4.18倍和中位数放大比10.77倍的结果（p<0.000001）。关键的是，在按粉丝数标准化后，不对称性仍然存在，为2.69倍（p=0.000009，Cohen's d=0.491），表明该效应并非仅仅是资本账户拥有更大受众的结果。该发现在所有测试的放大度量权重下均稳健。我们引入放大比和放大归一化指数作为衡量平台级话语不平等的简单指标。在Reddit上的跨平台复制（n=647条帖子）未复制该发现，表明不对称性可能特定于X基于账户的放大架构。我们讨论了跨平台话语分析的方法论意义。

英文摘要

When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p<0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

A Progress-Aware Leader-Follower Midair Docking System for Dual-Drone Aerial Manipulation

Phase-Conditioned Imitation Learning with Autonomous Failure Recovery for Robust Deformable Object Manipulation

Information-Directed Offline-to-Online Reinforcement Learning

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

On the Optimizer Dependence of Neural Scaling Laws

TRACER: Persistent Regularization for Robust Multimodal Finetuning

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Decentralized LLM-Driven Coordination of Acoustic Robots for Contactless Object Manipulation

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification