arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型推理能力

大模型数学、逻辑、规划、多步推理和测试时计算能力。

今日/当前日期收录 1 信号源:cs.CL, cs.AI, cs.LG
2605.20531 2026-06-19 cs.LO cs.LG 版本更新 80%

Pseudo-Formalization for Automatic Proof Verification

伪形式化用于自动证明验证

Slim Barkallah, Luke Bailey, Kaiyue Wen, Mohammed Abouzaid, Tengyu Ma

发表机构 * GitHub

专题命中 数学推理 :伪形式化用于自动证明验证

AI总结 本文提出了一种名为伪形式化的证明格式,该格式在保持自然语言灵活性的同时,保留了形式证明的模块性和精确性,通过块验证算法实现了对自然语言证明的高效验证,其在错误发现的精度和召回率上优于现有基线方法。

Comments 31 pages, code available at https://github.com/Slim205/pseudo-formalization

详情
AI中文摘要

可靠的证明验证仍然是训练和评估在复杂数学推理上的人工智能系统的主要瓶颈。在像Lean这样的语言中,完全形式化的证明容易验证,因为它们是无歧义且模块化的。大多数证明,尤其是由人工智能系统编写证明,既没有这种属性,将它们翻译成形式语言在许多前沿数学领域仍然具有挑战性。我们提出了伪形式化(PF),一种证明格式,它捕捉了形式证明的模块性和精确性,同时保留了自然语言的灵活性。一个伪形式化证明被分解成自包含的模块,每个模块陈述其前提、结论和证明,用自然语言。为了验证一个常规的自然语言证明的正确性,一个LLM将其翻译成伪形式化,然后独立验证每个模块,我们称之为块验证(BV)。我们在两个涵盖竞赛和研究级数学的基准上评估PF+BV,其中它在错误发现的精度和召回率上优于LLM-as-judge基线。为了支持未来的工作,我们发布了我们的研究级证明验证基准ArxivMathGradingBench。

英文摘要

Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.