--- title: "programbench agent benchmark" source_url: https://programbench.com/ ingested: 2026-05-08 sha256: pending-rehash-2026-06-01 review_value: 8 review_confidence: 8 review_recommendation: strong review_stars: 4 source_feed: TLDR AI (newsletter) source_published: 2026-05-07 type: raw created: 2026-05-10 updated: 2026-05-10 tags: [] tags: [raw-status:stub] --- # ProgramBench: Benchmarking Programs, Not Prompts Meta Superintelligence Labs / Stanford / Harvard 发布的 Agent 基准。任务:仅凭编译后的二进制文件和文档,agent 必须从头实现程序(无源码、无反编译、无网络)。200 个任务(jq → FFmpeg → SQLite),248K+ 行为测试。最佳模型(Claude Opus 4.7)仅 3% almost-resolved。