# AGIEval This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark. # Introduction AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. For a full description of the benchmark, please refer to our paper: [AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models](https://arxiv.org/pdf/2304.06364.pdf). # Tasks and Data [We have updated the dataset to version 1.1.](data/v1_1) The new version updated Chinese Gaokao (chemistry, biology, physics) datasets with questions from 2023 and addressed annotation issues. To facilitate evaluation, now all multi-choice question (MCQ) tasks have one answer only (Gaokao-Physics and JEC-QA used to have multi-label answers). AGIEval-en datasets remain the same as Verison 1.0. The new version's statistics are as follows: AGIEval v1.1 contains 20 tasks, including 18 MCQ tasks and two cloze tasks (Gaokao-Math-Cloze and MATH). You can find the full list of tasks in the table below. ![The datasets used in AGIEVal](AGIEval_tasks.png) You can download all post-processed data in the [data/v1_1](data/v1_1) folder. All usage of the data should follow the license of the original datasets. The data format for all datasets is as follows: ``` { "passage": null, "question": "设集合 $A=\\{x \\mid x \\geq 1\\}, B=\\{x \\mid-1-1\\}$", "(B)$\\{x \\mid x \\geq 1\\}$", "(C)$\\{x \\mid-1