arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2603.12766 2026-04-01 cs.CV

Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

Shifeng Chen, Yihui Li, Jun Liao, Hongyu Yang, Di Huang

Comments https://junliao2025.github.io/Catalyst4D-ProjectPage/

2603.08533 2026-04-01 cs.CV

SecAgent: Efficient Mobile GUI Agent with Semantic Context

Yiping Xie, Song Chen, Jingxuan Xing, Wei Jiang, Zekun Zhu, Yingyao Wang, Pi Bu, Jun Song, Yuning Jiang, Bo Zheng

2603.06753 2026-04-01 cs.CV

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

Zhenyuan Chen, Guanyuan Shen, Feng Zhang

Comments accepted by CVPRW 2026

2603.06679 2026-04-01 cs.AI cs.CV cs.GR

MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, Nataniel Ruiz

Comments Project page here: https://ryanpo.com/multigen/

2603.06561 2026-04-01 cs.CV

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel

Comments preprint

2603.05659 2026-04-01 cs.CV cs.AI cs.LG

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

Wisdom Ikezogwo, Mehmet Saygin Seyfioglu, Ranjay Krishna, Karim Bouyarmane

2603.02413 2026-04-01 cs.CV

TruckDrive: Long-Range Autonomous Highway Driving Dataset

Filippo Ghilotti, Edoardo Palladin, Samuel Brucker, Adam Sigal, Mario Bijelic, Felix Heide

2603.01228 2026-04-01 cs.CV

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou

2603.01142 2026-04-01 cs.CV

ArtLLM: Generating Articulated Assets via 3D LLM

Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu

Comments CVPR 2026. Project page: https://authoritywang.github.io/artllm/

2603.00314 2026-04-01 cs.CL cs.AI

When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang

2602.23574 2026-04-01 cs.CV cs.AI cs.LG

Evidential Neural Radiance Fields

Ruxiao Duan, Alex Wong

Comments The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

2602.22743 2026-04-01 cs.AI

Generative Data Transformation: From Mixed to Unified Data

Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, Enhong Chen

Comments Accepted by The Web Conference 2026 (WWW '26)

2602.19910 2026-04-01 cs.CV

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li

Comments 15 pages, accepted by CVPR 2026

2602.19261 2026-04-01 cs.LG cs.AI cs.NE

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Aleksei Liuliakov, Luca Hermes, Barbara Hammer

Comments Submitted to IJCNN 2026 (IEEE WCCI). 7 pages, 4 figures

2602.17650 2026-04-01 cs.CV

Human-level 3D shape perception emerges from multi-view learning

Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa

Comments Project page: https://tzler.github.io/human_multiview Code: https://github.com/tzler/human_multiview Huggingface dataset: https://huggingface.co/datasets/tzler/MOCHI

2602.15257 2026-04-01 cs.CV cs.AI cs.CL

How to Train Your Long-Context Visual Document Model

Austin Veselka

2602.14265 2026-04-01 cs.CL cs.LG

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder

Comments v2, 10 pages main, 80 pages total, 19 tables, 20 figures

2602.14157 2026-04-01 cs.CV cs.AI cs.LG

When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

Ahmed Ghorbel, Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati

Journal ref ICLR 2026, ReALM-GEN workshop

2602.09826 2026-04-01 cs.CL

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

Abdulmuizz Khalak, Abderrahmane Issam, Gerasimos Spanakis

Comments Accepted to VarDial 2026

2602.03006 2026-04-01 cs.AI

Distilling LLM Reasoning into Graph of Concept Predictors

Ziyang Yu, Liang Zhao

2602.01838 2026-04-01 cs.CL

AXE: Low-Cost Cross-Domain Web Structured Information Extraction

Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban

2602.00702 2026-04-01 cs.CV

JoyStreamer: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li, Liangbo Zhou, Huan Zhang, Youzheng Wu, Xiaodong He

2601.22150 2026-04-01 cs.CV

Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Xiaoxiao Sun, Mingyang Li, Kun Yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

Comments 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/

2601.17094 2026-04-01 cs.LG cs.AI cs.CL

The Mouth is Not the Brain: Bridging Energy-Based World Models and Language Generation

Junichiro Niimi

Comments ICLR 2026 The 2nd Workshop on World Models: Understanding, Modelling, and Scaling

2601.15968 2026-04-01 cs.CV

HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models

Xin Xie, Jiaxian Guo, Dong Gong

2601.13358 2026-04-01 cs.AI cs.LG

The Geometry of Thought: How Scale Restructures Reasoning In Large Language Models

Samuel Cyrenius Anderson

Comments The theoretical framework has been shown to be wrong and should not be followed for future research direction

2601.09176 2026-04-01 cs.LG

$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, BinYan Zhang, Zhe Jiang, Yujuan Tan, Duo Liu

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27171-27179, 2026

2601.06874 2026-04-01 cs.CV

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, Liujuan Cao

Comments Accepted to CVPR 2026; Project Website: https://mvggt.github.io/

2601.02536 2026-04-01 cs.CV

MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaden Shaar, Bradon Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan

Comments CVPR 2026

详情

英文摘要

Understanding real-world videos such as movies requires integrating visual and dialogue cues. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and, given the difficulty of evaluating free-form answers, largely resort to simple multiple choice questions. We introduce a novel open-ended multimodal VideoQA benchmark, MovieRecapsQA, created using movie recap videos -- a distinctive type of YouTube content that summarizes a film via a voiceover description of key clips from the movie (recap video). From the transcribed voiceover (recap summary) of 60 recap videos, we generate $\approx$8.2K questions along with the necessary ``facts'' expected in each answer; the former facilitates the creation of questions that require mutimodal reasoning and the latter allow the construction of a reference-free evaluation metric that can be applied to open-ended responses. To our knowledge, this is the first reference-free open-ended VideoQA benchmark. The benchmark allows each question to be evaluated in different input video settings: given (a) the full-length movie, (b) the full ($\approx$11 min) recap video (visual only), (c) $\approx$14 min of aligned movie scenes, i.e, movie scenes relevant to the question, and (d) $\approx$1.2 min of aligned recap video scenes. In all cases, the text of any associated movie dialogue is provided. Each question is categorized by the modality required to answer it -- visual, dialogue, or both -- enabling fine-grained evaluation of multimodal capabilities. We benchmark (setting (d)) seven state-of-the-art MLLMs and find that (i) only our reference-free metric produces meaningful human-aligned model separation; (ii) vision-centric questions yield the lowest scores across all models; (iii) removing visual input often \textit{improves} model factuality; and (iv) the primary bottleneck is visual perception, not visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2512.24176 2026-04-01 cs.CV cs.LG

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu

Comments Accepted to CVPR 2026. Project Page: https://zhouxingyu13.github.io/Internal-Guidance/

AI 大模型

视觉与机器人

科学与医疗