\n", " \n", " 1 | \n", " baseline | \n", " How does the survey categorize the evaluation of LLMs and what are the three major groups mentioned? | \n", " None | \n", " ['Question \\nAnsweringTool \\nLearning\\nReasoning\\nKnowledge \\nCompletionEthics \\nand \\nMorality Bias\\nToxicity\\nTruthfulnessRobustnessEvaluation\\nRisk \\nEvaluation\\nBiology and \\nMedicine\\nEducationLegislationComputer \\nScienceFinance\\nBenchmarks for\\nHolistic Evaluation\\nBenchmarks \\nforKnowledge and Reasoning\\nBenchmarks \\nforNLU and NLGKnowledge and Capability\\nLarge Language \\nModel EvaluationAlignment Evaluation\\nSafety\\nSpecialized LLMs\\nEvaluation Organization\\n…Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.\\nOur survey expands the scope to synthesize findings from both capability and alignment\\nevaluations of LLMs. By complementing these previous surveys through an integrated\\nperspective and expanded scope, our work provides a comprehensive overview of the current\\nstate of LLM evaluation research. The distinctions between our survey and these two related\\nworks further highlight the novel contributions of our study to the literature.\\n2 Taxonomy and Roadmap\\nThe primary objective of this survey is to meticulously categorize the evaluation of LLMs,\\nfurnishing readers with a well-structured taxonomy framework. Through this framework,\\nreaders can gain a nuanced understanding of LLMs’ performance and the attendant challenges\\nacross diverse and pivotal domains.\\nNumerous studies posit that the bedrock of LLMs’ capabilities resides in knowledge and\\nreasoning, serving as the underpinning for their exceptional performance across a myriad of\\ntasks. Nonetheless, the effective application of these capabilities necessitates a meticulous\\nexamination of alignment concerns to ensure that the model’s outputs remain consistent with\\nuser expectations. Moreover, the vulnerability of LLMs to malicious exploits or inadvertent\\nmisuse underscores the imperative nature of safety considerations. Once alignment and safety\\nconcerns have been addressed, LLMs can be judiciously deployed within specialized domains,\\ncatalyzing task automation and facilitating intelligent decision-making. Thus, our overarching\\n6', 'This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\\napplications of LLMs across diverse domains, including biology, education, law, computer\\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\\nperformance.\\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\\nto better serve the community and the world, ensuring their applications in various domains\\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\\nof LLMs’ development and evaluation.\\n58'] | \n", " 0.375000 | \n", " 1. The retrieved context does match the subject matter of the user's query. The user's query is about how a survey categorizes the evaluation of Large Language Models (LLMs) and the three major groups mentioned. The context provided discusses the categorization of LLMs evaluation in the survey, mentioning aspects like knowledge and reasoning, alignment evaluation, safety evaluation, and potential applications across diverse domains. \n", "\n", "2. However, the context does not provide a full answer to the user's query. While it does discuss the categorization of LLMs evaluation, it does not clearly mention the three major groups. The context mentions several aspects of LLMs evaluation, but it is not clear which of these are considered the three major groups. \n", "\n", "[RESULT] 1.5 | \n", "
\n", " \n", " 9 | \n", " baseline | \n", " How does this survey on LLM evaluation differ from previous reviews conducted by Chang et al. (2023) and Liu et al. (2023i)? | \n", " None | \n", " ['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\\napplications of LLMs across diverse domains, including biology, education, law, computer\\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\\nperformance.\\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\\nto better serve the community and the world, ensuring their applications in various domains\\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\\nof LLMs’ development and evaluation.\\n58', '(2021)\\nBEGIN (Dziri et al., 2022b)\\nConsisTest (Lotfi et al., 2022)\\nSummarizationXSumFaith (Maynez et al., 2020)\\nFactCC (Kryscinski et al., 2020)\\nSummEval (Fabbri et al., 2021)\\nFRANK (Pagnoni et al., 2021)\\nSummaC (Laban et al., 2022)\\nWang et al. (2020)\\nGoyal & Durrett (2021)\\nCao et al. (2022)\\nCLIFF (Cao & Wang, 2021)\\nAggreFact (Tang et al., 2023a)\\nPolyTope (Huang et al., 2020)\\nMethodsNLI-based MethodsWelleck et al. (2019)\\nLotfi et al. (2022)\\nFalke et al. (2019)\\nLaban et al. (2022)\\nMaynez et al. (2020)\\nAharoni et al. (2022)\\nUtama et al. (2022)\\nRoit et al. (2023)\\nQAQG-based MethodsFEQA (Durmus et al., 2020)\\nQAGS (Wang et al., 2020)\\nQuestEval (Scialom et al., 2021)\\nQAFactEval (Fabbri et al., 2022)\\nQ2 (Honovich et al., 2021)\\nFaithDial (Dziri et al., 2022a)\\nDeng et al. (2023b)\\nLLMs-based MethodsFIB (Tam et al., 2023)\\nFacTool (Chern et al., 2023)\\nFActScore (Min et al., 2023)\\nSelfCheckGPT (Manakul et al., 2023)\\nSAPLMA (Azaria & Mitchell, 2023)\\nLin et al. (2022b)\\nKadavath et al. (2022)\\nFigure 3: Overview of alignment evaluations.\\n4 Alignment Evaluation\\nAlthough instruction-tuned LLMs exhibit impressive capabilities, these aligned LLMs are\\nstill suffering from annotators’ biases, catering to humans, hallucination, etc. To provide a\\ncomprehensive view of LLMs’ alignment evaluation, in this section, we discuss those of ethics,\\nbias, toxicity, and truthfulness, as illustrated in Figure 3.\\n21'] | \n", " 0.000000 | \n", " 1. The retrieved context does not match the subject matter of the user's query. The user's query is asking for a comparison between the current survey on LLM evaluation and previous reviews conducted by Chang et al. (2023) and Liu et al. (2023i). However, the context does not mention these previous reviews at all, making it impossible to draw any comparisons. Therefore, the context does not match the subject matter of the user's query. (0/2)\n", "2. The retrieved context cannot be used exclusively to provide a full answer to the user's query. As mentioned above, the context does not mention the previous reviews by Chang et al. and Liu et al., which are the main focus of the user's query. Therefore, it cannot provide a full answer to the user's query. (0/2)\n", "\n", "[RESULT] 0.0 | \n", "
\n", " \n", " 11 | \n", " baseline | \n", " According to the document, what are the two main concerns that need to be addressed before deploying LLMs within specialized domains? | \n", " None | \n", " ['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\\napplications of LLMs across diverse domains, including biology, education, law, computer\\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\\nperformance.\\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\\nto better serve the community and the world, ensuring their applications in various domains\\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\\nof LLMs’ development and evaluation.\\n58', 'objective is to delve into evaluations encompassing these five fundamental domains and their\\nrespective subdomains, as illustrated in Figure 1.\\nSection 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive\\nassessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This\\nsection is meticulously divided into four distinct subsections: Question-Answering, Knowledge\\nCompletion, Reasoning, and Tool Learning. Question-answering and knowledge completion\\ntasks stand as quintessential assessments for gauging the practical application of knowledge,\\nwhile the various reasoning tasks serve as a litmus test for probing the meta-reasoning and\\nintricate reasoning competencies of LLMs. Furthermore, the recently emphasized special\\nability of tool learning is spotlighted, showcasing its significance in empowering models to\\nadeptly handle and generate domain-specific content.\\nSection 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ perfor-\\nmance across critical dimensions, encompassing ethical considerations, moral implications,\\nbias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to\\nscrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias,\\nand toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content.\\nFurthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which\\ncan lead to the inadvertent dissemination of false information. As such, an indispensable\\nfacet of this evaluation involves the rigorous assessment of truthfulness, underscoring its\\nsignificance as an essential aspect to evaluate and rectify.\\nSection 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two funda-\\nmental dimensions: the robustness of LLMs and their evaluation in the context of Artificial\\nGeneral Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their\\nrobustness becomes paramount. Robustness equips them to navigate disturbances stemming\\nfrom users and the environment, while also shielding against malicious attacks and deception,\\nthereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably ad-\\nvance toward human-level capabilities, the evaluation expands its purview to encompass more\\nprofound security concerns. These include but are not limited to power-seeking behaviors\\nand the development of situational awareness, factors that necessitate meticulous evaluation\\nto safeguard against unforeseen challenges.\\nSection 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation\\nparadigm into diverse specialized domains. Within this section, we turn our attention to the\\nevaluation of LLMs specifically tailored for application in distinct domains. Our selection\\nencompasses currently prominent specialized LLMs spanning fields such as biology, education,\\nlaw, computer science, and finance. The objective here is to systematically assess their\\naptitude and limitations when confronted with domain-specific challenges and intricacies.\\nSection 7, denominated “Evaluation Organization”, serves as a comprehensive introduction\\nto the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light\\nof the rapid proliferation of LLMs, users are confronted with the challenge of identifying the\\nmost apt models to meet their specific requirements while minimizing the scope of evaluations.\\nIn this context, we present an overview of well-established and widely recognized benchmark\\n7'] | \n", " 0.750000 | \n", " The retrieved context does match the subject matter of the user's query. It discusses the concerns that need to be addressed before deploying LLMs within specialized domains. The two main concerns mentioned are the alignment evaluation, which includes ethical considerations, moral implications, bias detection, toxicity assessment, and truthfulness evaluation, and the safety evaluation, which includes the robustness of LLMs and their evaluation in the context of Artificial General Intelligence (AGI). \n", "\n", "However, the context does not provide a full answer to the user's query. While it does mention the two main concerns, it does not go into detail about why these concerns need to be addressed before deploying LLMs within specialized domains. The context provides a general overview of the concerns, but it does not specifically tie these concerns to the deployment of LLMs within specialized domains. \n", "\n", "[RESULT] 3.0 | \n", "
\n", " \n", " 12 | \n", " baseline | \n", " In the \"Alignment Evaluation\" section, what are some of the dimensions that are assessed to mitigate potential risks associated with LLMs? | \n", " None | \n", " ['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\\napplications of LLMs across diverse domains, including biology, education, law, computer\\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\\nperformance.\\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\\nto better serve the community and the world, ensuring their applications in various domains\\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\\nof LLMs’ development and evaluation.\\n58', 'Question \\nAnsweringTool \\nLearning\\nReasoning\\nKnowledge \\nCompletionEthics \\nand \\nMorality Bias\\nToxicity\\nTruthfulnessRobustnessEvaluation\\nRisk \\nEvaluation\\nBiology and \\nMedicine\\nEducationLegislationComputer \\nScienceFinance\\nBenchmarks for\\nHolistic Evaluation\\nBenchmarks \\nforKnowledge and Reasoning\\nBenchmarks \\nforNLU and NLGKnowledge and Capability\\nLarge Language \\nModel EvaluationAlignment Evaluation\\nSafety\\nSpecialized LLMs\\nEvaluation Organization\\n…Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.\\nOur survey expands the scope to synthesize findings from both capability and alignment\\nevaluations of LLMs. By complementing these previous surveys through an integrated\\nperspective and expanded scope, our work provides a comprehensive overview of the current\\nstate of LLM evaluation research. The distinctions between our survey and these two related\\nworks further highlight the novel contributions of our study to the literature.\\n2 Taxonomy and Roadmap\\nThe primary objective of this survey is to meticulously categorize the evaluation of LLMs,\\nfurnishing readers with a well-structured taxonomy framework. Through this framework,\\nreaders can gain a nuanced understanding of LLMs’ performance and the attendant challenges\\nacross diverse and pivotal domains.\\nNumerous studies posit that the bedrock of LLMs’ capabilities resides in knowledge and\\nreasoning, serving as the underpinning for their exceptional performance across a myriad of\\ntasks. Nonetheless, the effective application of these capabilities necessitates a meticulous\\nexamination of alignment concerns to ensure that the model’s outputs remain consistent with\\nuser expectations. Moreover, the vulnerability of LLMs to malicious exploits or inadvertent\\nmisuse underscores the imperative nature of safety considerations. Once alignment and safety\\nconcerns have been addressed, LLMs can be judiciously deployed within specialized domains,\\ncatalyzing task automation and facilitating intelligent decision-making. Thus, our overarching\\n6'] | \n", " 0.750000 | \n", " 1. The retrieved context does match the subject matter of the user's query. The user's query is about the dimensions assessed in the \"Alignment Evaluation\" section to mitigate potential risks associated with LLMs (Large Language Models). The context talks about the evaluation of LLMs, including alignment evaluation and safety evaluation. It mentions aspects like knowledge and reasoning, ethical concerns, biases, toxicity, and truthfulness. These are some of the dimensions that could be assessed to mitigate potential risks associated with LLMs. So, the context is relevant to the query. (2/2)\n", "\n", "2. However, the retrieved context does not provide a full answer to the user's query. While it mentions some dimensions that could be assessed in alignment evaluation (like knowledge and reasoning, ethical concerns, biases, toxicity, and truthfulness), it does not explicitly state that these are the dimensions assessed to mitigate potential risks associated with LLMs. The context does not provide a comprehensive list of dimensions or explain how these dimensions help mitigate risks. Therefore, the context cannot be used exclusively to provide a full answer to the user's query. (1/2)\n", "\n", "[RESULT] 3.0 | \n", "
\n", " \n", " 14 | \n", " baseline | \n", " What is the purpose of evaluating the knowledge and capability of LLMs? | \n", " None | \n", " ['objective is to delve into evaluations encompassing these five fundamental domains and their\\nrespective subdomains, as illustrated in Figure 1.\\nSection 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive\\nassessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This\\nsection is meticulously divided into four distinct subsections: Question-Answering, Knowledge\\nCompletion, Reasoning, and Tool Learning. Question-answering and knowledge completion\\ntasks stand as quintessential assessments for gauging the practical application of knowledge,\\nwhile the various reasoning tasks serve as a litmus test for probing the meta-reasoning and\\nintricate reasoning competencies of LLMs. Furthermore, the recently emphasized special\\nability of tool learning is spotlighted, showcasing its significance in empowering models to\\nadeptly handle and generate domain-specific content.\\nSection 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ perfor-\\nmance across critical dimensions, encompassing ethical considerations, moral implications,\\nbias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to\\nscrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias,\\nand toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content.\\nFurthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which\\ncan lead to the inadvertent dissemination of false information. As such, an indispensable\\nfacet of this evaluation involves the rigorous assessment of truthfulness, underscoring its\\nsignificance as an essential aspect to evaluate and rectify.\\nSection 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two funda-\\nmental dimensions: the robustness of LLMs and their evaluation in the context of Artificial\\nGeneral Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their\\nrobustness becomes paramount. Robustness equips them to navigate disturbances stemming\\nfrom users and the environment, while also shielding against malicious attacks and deception,\\nthereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably ad-\\nvance toward human-level capabilities, the evaluation expands its purview to encompass more\\nprofound security concerns. These include but are not limited to power-seeking behaviors\\nand the development of situational awareness, factors that necessitate meticulous evaluation\\nto safeguard against unforeseen challenges.\\nSection 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation\\nparadigm into diverse specialized domains. Within this section, we turn our attention to the\\nevaluation of LLMs specifically tailored for application in distinct domains. Our selection\\nencompasses currently prominent specialized LLMs spanning fields such as biology, education,\\nlaw, computer science, and finance. The objective here is to systematically assess their\\naptitude and limitations when confronted with domain-specific challenges and intricacies.\\nSection 7, denominated “Evaluation Organization”, serves as a comprehensive introduction\\nto the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light\\nof the rapid proliferation of LLMs, users are confronted with the challenge of identifying the\\nmost apt models to meet their specific requirements while minimizing the scope of evaluations.\\nIn this context, we present an overview of well-established and widely recognized benchmark\\n7', 'evaluations. This serves the purpose of aiding users in making judicious and well-informed\\ndecisions when selecting an appropriate LLM for their particular needs.\\nPleasebeawarethatourtaxonomyframeworkdoesnotpurporttocomprehensivelyencompass\\nthe entirety of the evaluation landscape. In essence, our aim is to address the following\\nfundamental questions:\\n•What are the capabilities of LLMs?\\n•What factors must be taken into account when deploying LLMs?\\n•In which domains can LLMs find practical applications?\\n•How do LLMs perform in these diverse domains?\\nWe will now embark on an in-depth exploration of each category within the LLM evaluation\\ntaxonomy, sequentially addressing capabilities, concerns, applications, and performance.\\n3 Knowledge and Capability Evaluation\\nEvaluating the knowledge and capability of LLMs has become an important research area as\\nthese models grow in scale and capability. As LLMs are deployed in more applications, it is\\ncrucial to rigorously assess their strengths and limitations across a diverse range of tasks and\\ndatasets. In this section, we aim to offer a comprehensive overview of the evaluation methods\\nand benchmarks pertinent to LLMs, spanning various capabilities such as question answering,\\nknowledge completion, reasoning, and tool use. Our objective is to provide an exhaustive\\nsynthesis of the current advancements in the systematic evaluation and benchmarking of\\nLLMs’ knowledge and capabilities, as illustrated in Figure 2.\\n3.1 Question Answering\\nQuestionansweringisaveryimportantmeansforLLMsevaluation, andthequestionanswering\\nability of LLMs directly determines whether the final output can meet the expectation. At\\nthe same time, however, since any form of LLMs evaluation can be regarded as question\\nanswering or transfer to question answering form, there are rare datasets and works that\\npurely evaluate question answering ability of LLMs. Most of the datasets are curated to\\nevaluate other capabilities of LLMs.\\nTherefore, we believe that the datasets simply used to evaluate the question answering ability\\nof LLMs must be from a wide range of sources, preferably covering all fields rather than\\naiming at some fields, and the questions do not need to be very professional but general.\\nAccording to the above criteria for datasets focusing on question answering capability, we can\\nfind that many datasets are qualified, e.g., SQuAD (Rajpurkar et al., 2016), NarrativeQA\\n(Kociský et al., 2018), HotpotQA (Yang et al., 2018), CoQA (Reddy et al., 2019). Although\\nthese datasets predate LLMs, they can still be used to evaluate the question answering ability\\nof LLMs. Kwiatkowski et al. (2019) present the Natural Questions corpus. The questions\\n8'] | \n", " 0.750000 | \n", " The retrieved context is relevant to the user's query as it discusses the purpose of evaluating the knowledge and capability of LLMs (Large Language Models). It explains that the evaluation is important to assess their strengths and limitations across a diverse range of tasks and datasets. The context also mentions the different aspects of LLMs that are evaluated, such as question answering, knowledge completion, reasoning, and tool use. \n", "\n", "However, the context does not fully answer the user's query. While it does provide a general idea of why LLMs are evaluated, it does not delve into the specific purpose of these evaluations. For instance, it does not explain how these evaluations can help improve the performance of LLMs, or how they can be used to identify areas where LLMs may need further development or training.\n", "\n", "[RESULT] 3.0 | \n", "
\n", " \n", "