2606.09809
2026-06-10
cs.AI
版本更新
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting
评估卡:AI评估报告的解释层
Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Max Lamparth, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman
发表机构
*
Hugging Face
;
Stanford University(斯坦福大学)
;
Queen Mary University of London(伦敦玛丽女王大学)
;
University of Copenhagen(哥本哈根大学)
;
Trustible
;
EleutherAI
;
TU Darmstadt(达姆施塔特工业大学)
;
Weizenbaum Institute & Technical University of Munich(魏森鲍姆研究所与慕尼黑工业大学)
;
Harvard University(哈佛大学)
;
The Hebrew University of Jerusalem(耶路撒冷希伯来大学)
;
Iowa State University(爱荷华州立大学)
;
IBM Research(IBM研究院)
;
University of Chicago(芝加哥大学)
;
Independent(独立)
;
Berkeley AI Safety Institute (BASIS)(伯克利人工智能安全研究所)
;
Simula
;
University of Edinburgh(爱丁堡大学)
;
ETH Zurich & ETH AI Center(苏黎世联邦理工学院与ETH AI中心)
;
Oxford Internet Institute(牛津互联网研究所)
;
Amherst College(阿默斯特学院)
;
University of Nebraska(内布拉斯加大学)
;
Syntony Research
;
McGill University(麦吉尔大学)
;
Evals Consensus
;
Israel Institute of Technology(以色列理工学院)
;
IOL.Learn & Zuse Institute Berlin(IOL.Learn与柏林祖泽研究所)
;
Georgia Institute of Technology(佐治亚理工学院)
;
Quebec AI Institute, Université de Montréal(魁北克人工智能研究所,蒙特利尔大学)
;
University of Notre Dame(圣母大学)
;
Georgetown University(乔治城大学)
;
DHBW Stuttgart(斯图加特双元制大学)
;
Massachusetts Institute of Technology(麻省理工学院)
AI总结
针对AI评估报告不一致的问题,提出EvalCards作为统一记录层,通过结构化模式、四种解释信号和监控工具,覆盖5816个模型和635个基准,揭示报告实践中的系统性差距。