]> Explanation Faithfulness Evaluation Measures Ontology As artificial intelligence systems, especially large language models, gain popularity there is an increased desire to trust these systems. One common method to increase trust is to provide explanations. These explanations provide information about the system’s workings and the knowledge used in its general reasoning processes or the processes behind a specific decision. However, ensuring that these explanations are faithful, that they accurately represent that reasoning process, is a difficult and ongoing task. Over 60 measures have been proposed in the last 5 years to quantify how faithful an explanation is, but it is difficult to compare them to decide which one is most appropriate for a given use case. This ontology provides a structured representation of faithfulness measures that would make comparisons easier. It documents the inferences that researchers use to evaluate and categorize these measures. Researchers proposing new measures can communicate the benefits of their idea with a shared vocabulary. Ideally this will also serve as the schema for a comprehensive knowledge graph of measures which will power a recommendation system for AI explainability researchers. Danielle Villa: https://tw.rpi.edu/person/danielle-villa Class explanation defined as "when X happens, then, due to a given set of circumstances C, Y will occur because of a given law L". In order to be complete, an explanation needs at least one antecedent event (explanans), a posterior event (a posterior event) and has to happen in a context that relates the two events (the context), governed by a law (theory). Assumption Merriam-Webster: Assumption, Definition 3b - https://www.merriam-webster.com/dictionary/assumption A fact or statement taken for granted Axiomatic Evaluation Method Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. An evaluation method that treats certain principles as necessary conditions for faithfulness and tests if an explanation(s) satisfies them Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Binary Faithfulness Measure Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. A faithfulness measure that determines whether an explanation(s) is faithful or not Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Faithfulness Test Collection of Explanations A collection of explanations to be evaluated for a single faithfulness value Computational-based Evaluation Method Quantitative Evaluation Method An evaluation method that does not rely on human assessment Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Corollary Merriam-Webster: Corollary, Definition 1 - https://www.merriam-webster.com/dictionary/corollary An assumption inferred immediately from another assumption with little or no additional proof Evaluation Method Merriam-Webster: Method, Definition 1ba - https://www.merriam-webster.com/dictionary/method The technique used to determine the faithfulness of the explanation(s) Faithfulness Measure Agarwal, C., Tanneru, S. H., & Lakkaraju, H. (2024). Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models. arXiv preprint arXiv:2402.04614. Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Faithfulness Metric A technique for quantifying whether an explanation accurately represents the reasoning process behind a model's prediction Accountability Measure, Descriptive Accuracy Measure, Explanation Transparency measure, Fidelity Measure Faithfulness Measure Identifier An identifier that is associated with a faithfulness measure Faithfulness Proxy Agarwal, C., Tanneru, S. H., & Lakkaraju, H. (2024). Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models. arXiv preprint arXiv:2402.04614. Oxford Reference: Proxy Variable - https://www.oxfordreference.com/display/10.1093/oi/authority.20110803100351624 A property used instead of faithfulness when faithfulness cannot be measured directly As faithfulness cannot currently be measured, this specifies the alternate property that is used as a substitute property or necessary condition for faithfulness Feature-based Modality Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. An explanation modality that claims which parts of the input are more important than others to the model's decision Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Feature-Attribution Modality, Heat-Map Modality, Token Importance Modality, Input Saliency Modality, Salience Map Modality, Extractive Rationale Modality Graded Faithfulness Measure A faithfulness measure that determines the extent and likelihood of an explanation(s) being faithful Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Granularity Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. The level of specificity of a faithfulness determination Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Faithfulness measures may determine a faithfulness score for different levels: a single explanation, an entire dataset of explanations, or all explanations produced by a model Scope Human-based Evaluation Method Qualitative Evaluation Method An evaluation method that relies on human assessment Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Model Access Level Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. The amount of model internals that must be available for a faithfulness measure to function Model Specific Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. The model specificity of targeting particular classes of models Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Model Specificty How particular a faithfulness measure is to the architecture of a model Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. This poperty describes whether or not the faithfulness measure is specific to the model's architecture Natural Language Modality Luo, S., Ivison, H., Han, S. C., & Poon, J. (2024). Local interpretations for explainable natural language processing: A survey. ACM Computing Surveys, 56(9), 1-36. An explanation modality that is generated text NLE Luo, S., Ivison, H., Han, S. C., & Poon, J. (2024). Local interpretations for explainable natural language processing: A survey. ACM Computing Surveys, 56(9), 1-36. Perturbation-based Evaluation Method Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. An evaluation method that perturbs parts of the input and observes the change in the output Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. This differs from robustness evaluation in that robustness considers extremely similar inputs and expects that the explanation is similar; but in perturbation-based evaluation, we consider inputs that are not necessarily similar, and our expectation of how the explanation should change depends on which parts of the input are perturbed For all instances of this class: makes_assumption Linearity_Assumption Predictive Power Evaluation Method Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Simulation Evaluation Method An evaluation method that uses the explanation to predict model decisions on unseen examples and considers a higher accuracy as an indicator of higher faithfulness Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Robustness Evaluation Method Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. An evaluation method that measures whether the explanation is stable against subtle changes in the input examples Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. For all instances of this class: makes_assumption Prediction_Assumption White-box Evaluation Method Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. An evaluation method that relies on known ground-truth explanations, against which a candidate explanation can be compared, where the ground-truth explanations come from either transparent tasks or transparent models Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Access the Classifier Method Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. A human-based evaluation method designed to assess whether the interpretation provides sufficient information to understand the classifier's logic Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Attention Map Explanation Modality A feature-based modality that uses the attention mechanism(s) in a model Bibal, A., Cardon, R., Alfter, D., Wilkens, R., Wang, X., François, T., & Watrin, P. (2022, May). Is attention explanation? an introduction to the debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3889-3900). Attention Weight Access Level Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. The model's attention weights or attention mechanism must be available to the faithfulness measure Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. Completeness Proxy Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593. A faithfulness proxy that states that the explanation should describe the entire dynamic of the model Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593. Completeness to Model Evaluation Proxy Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018, October). Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) (pp. 80-89). IEEE. A faithfulness proxy that states that a surrogate model should closely approximate the original model it explains Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018, October). Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) (pp. 80-89). IEEE. Comprehensive Evaluation Method Craven, M. W. (1996). Extracting comprehensible models from trained neural networks. The University of Wisconsin-Madison. A computational-based evaluation method that assess how easily we can inspect and understand a model Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Continuity Proxy Nejadgholi, I., Omidyeganeh, M., Drouin, M. A., & Boisvert, J. (2025). A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations. arXiv preprint arXiv:2507.10585. A faithfulness proxy that states that a small input change should result in a small explanation change Nejadgholi, I., Omidyeganeh, M., Drouin, M. A., & Boisvert, J. (2025). A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations. arXiv preprint arXiv:2507.10585. Contrastivity Proxy Nejadgholi, I., Omidyeganeh, M., Drouin, M. A., & Boisvert, J. (2025). A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations. arXiv preprint arXiv:2507.10585. A faithfulness proxy that an explanation should highlight differences from alternative outcomes Nejadgholi, I., Omidyeganeh, M., Drouin, M. A., & Boisvert, J. (2025). A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations. arXiv preprint arXiv:2507.10585. Correctness Evaluation Method Thomas, F., & David, V. (2022). Representativity and consistency measures for deep neural network explanations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Fidelity Evaluation A computational-based evaluation method that assess the ability of the explanations to reflect the behavior of the prediction model Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Chain of Thought Explanation Modality Agarwal, C., Tanneru, S. H., & Lakkaraju, H. (2024). Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models. arXiv preprint arXiv:2402.04614. A natural language modality that consists of a sequence of intermediate thoughts or steps that lead to the final decision or response of an LLM CoT Agarwal, C., Tanneru, S. H., & Lakkaraju, H. (2024). Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models. arXiv preprint arXiv:2402.04614. Embedding Access Level Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. The model's embedding space must be available to the faithfulness measure Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. Erasure Method Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. A compuational-based evaluation method where parts of the input are erased in expectation that the model's decision will or will not change, based on the explanation Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Evaluated by Method Specifies an evaluation method that evaluates the faithfulness proxy Evaluates Faithfulness of Explanations Specifies the collection of explanations that a faithfulness measure uses to determine faithfulness Evaluates Modality Specifies the modality of the explanations that a faithfulness measure can determine the faithfulness of Evaluates Proxy Specifies a faithfulness proxy evaluated by the evaluation method Faithfulness Document A document that introduces a faithfulness measure Feature Importance Agreement Proxy Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. A faithfulness proxy that states that input tokens that are important (resp. unimportant) for label prediction should also be important (resp. unimportant) for explanation generation Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Find-Alignment Method Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. A human-based evaluation method designed to evaluate how close the explanation is to human reasoning Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Full Access Level Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. The entire model and its internals must be available to the faithfulness measure Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. This level of access should only be used for faithfulness measures that evaluate self-explainable systems Global Scope The granularity level of specifying the faithfulness of an entire model or dataset Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Model-Understanding Scope Gradient Access Level Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. The model's gradient values must be available to the faithfulness measure Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. Gradient-based Explanation Modality A feature-based modality that uses a backwards pass through the model, using the computed gradient values, to determine the importance of each input token Bastings, J., & Filippova, K. (2020). The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?. arXiv preprint arXiv:2010.05607. Has Granularity Specifies the level of granularity at which a faithfulness measure operates Has Model Specificity Specifies how specific the model must be to apply the faithfulness measure Higher Access Level Specifices that this model access level requires more detailed access than another model access level Partial ordering Human-Annotation Method Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). A human-based evaluation method which verifies the validity of the attribution scores by comparing them with the human problem-solving process Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). Inference Access Level The model must be available to run inferences for the faithfulness measure API inference-only access to the model Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. Inferred from Assumption Specifies the assumption that the corollary naturally follows from Merriam-Webster: Corollary, Definition 1 - https://www.merriam-webster.com/dictionary/corollary Input Sensitivity Proxy Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. A faithfulness proxy that states that the explanations should be sensitivie (resp. insensitive) to changes in the input that influence (resp. do not influence) the prediction Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Linearity Assumption Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The assumption that certain parts of the input are more important to the model reasoning than others; the contributions of different parts of the input are independent from each other Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Linearity Corollary Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The corollary to the Linearity Assumption that under certain circumstances, heatmap interpretations can be faithful Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Local Scope The granularity level of specifying the faithfulness of a single explanation Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Explanation-Understanding Scope Logic Trap 1 Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). The assumption that the decision-making process of neural networks is equal to the decision-making process of humans. Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). Lower Access Level Specifices that this model access level requires less detailed access than another model access level Partial ordering LSTM Specific The model specificity of targeting LSTM-based models Makes Assumption Specifices the assumption that is relied upon Meaningful Perturbation Method Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). A computational-based evaluation method which makes modifications to the input instances, in accordance with the generated attribution, and expects significant differences to model predictions Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). Measures Proxy Specifices the faithfulness proxy that the faithfulness measure evaluates in place of faithfulness Minimality Proxy Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. A faithfulness proxy that states that an explanation should include only the smallest number of necessary factors Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Model Agnostic Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. The model specificity of being able to be applied to any black-box model regardless of its internal components Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Model Assumption Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The assumption that two models will make the same predictions if and only if they use the same reasoning process Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Model Corollary 1 Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The corollary to the Model Assumption that an interpretation system is unfaithful if it results in different interpretations of models that make the same decisions Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Model Corollary 2 Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The corollary to the Model Assumption that an interpretation is unfaithful if it results in different decisions than the model it interprets Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Model Sensitivity Proxy Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. A faithfulness proxy that states that the explanations should be sensitivie (resp. insensitive) to changes in the model that influence (resp. do not influence) the prediction Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. No Access Level No model access is required for the faithfulness measure Leiter, C., Lertvittayakumjorn, P., Fomicheva, M., Zhao, W., Gao, Y., & Eger, S. (2024). Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75), 1-49. Occlusion-based Explanation Modality Bastings, J., & Filippova, K. (2020). The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?. arXiv preprint arXiv:2010.05607. A feature-based modality that computes input saliency by erasing input features and measuring how that affects the model This includes many different methods of occlusion, including making and removal Bastings, J., & Filippova, K. (2020). The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?. arXiv preprint arXiv:2010.05607. Leave-one-out Explanation Modality Polarity Consistency Proxy Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. A faithfulness proxy that uses the consistency between importance-score--based explanations and the impact polarity on model predictions as a faithfulness score If an explanation method assigns a positive weight to a feature as its contribution to some predicted label, then after removing this feature, the model confidence in the label should be suppressed Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Prediction Assumption Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The assumption that on similar inputs, the model makes similar decisions if and only if its reasoning is similar Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Prediction Corollary Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. The corollary to the Prediction Assumption that an interpretation system is unfaithful if it provides different interpretations for similar inputs and outputs Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Propogation-based Explanation Modality A feature-based modality that starts with a forward pass to obtain the relevance and then redistributes that relevance among the inputs of each layer Bastings, J., & Filippova, K. (2020). The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?. arXiv preprint arXiv:2010.05607. Requires Access Level Specifies the model access level that the faithfulness measure requires Robustness Equivalence Proxy Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. A faithfulness proxy that states that the explanation and the predicted label should be equally robust (or non-robust) under noise Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. Robustness Proxy Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. A faithfulness proxy that states that an explanation should be invariant to small perturbations in the input Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. Self-Consistency Proxy Parcalabescu, L., & Frank, A. (2023). On measuring faithfulness of natural language explanations. CoRR. A faithfulness proxy that states that the model should produce explanations that never contradict each other Parcalabescu, L., & Frank, A. (2023). On measuring faithfulness of natural language explanations. CoRR. Sensitivity to Iterative Masking Evaluation Method An evaluation method that iteratively removes salient features and measures model responses Crothers, E., Viktor, H., & Japkowicz, N. (2024, September). Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading. In International Conference on Machine Learning, Optimization, and Data Science (pp. 133-147). Cham: Springer Nature Switzerland. Crothers, E., Viktor, H., & Japkowicz, N. (2024, September). Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading. In International Conference on Machine Learning, Optimization, and Data Science (pp. 133-147). Cham: Springer Nature Switzerland. Similar Measure To Specifies a measure that measures the same proxy, uses the same type of evaluation method, evaluates the same modality, has the same granularity level, the same model specificity, and at least one shared assumption and shared access level Simulatability Proxy Chakraborty, S., Tomsett, R., Raghavendra, R., Harborne, D., Alzantot, M., Cerutti, F., ... & Gurram, P. (2017, August). Interpretability of deep learning models: A survey of results. In 2017 IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & communications, cloud & big data computing, Internet of people and smart city innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI) (pp. 1-6). IEEE. A faithfulness proxy that states that a human should be able to use the input data together with the model to reproduce every calculationg step necessary to make the same prediction as the model Chakraborty, S., Tomsett, R., Raghavendra, R., Harborne, D., Alzantot, M., Cerutti, F., ... & Gurram, P. (2017, August). Interpretability of deep learning models: A survey of results. In 2017 IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & communications, cloud & big data computing, Internet of people and smart city innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI) (pp. 1-6). IEEE. Soundness Proxy Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593. A faithfulness proxy that states that the explanation should be correct and truthful Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593. Stability Evaluation Method Robustness Evaluation Method A computational-based evaluation method that assess whether slight variations of an instance that did not change the predicted class substantially changed the explanation Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. Surrogate-Model Method Arya, V., Bellamy, R. K., Chen, P. Y., Dhurandhar, A., Hind, M., Hoffman, S. C., ... & Zhang, Y. (2019). One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012. A computational-based evaluation method that build a second, usually directly interpretable model, that approximates a more complex model Arya, V., Bellamy, R. K., Chen, P. Y., Dhurandhar, A., Hind, M., Hoffman, S. C., ... & Zhang, Y. (2019). One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012. Transformer Specific The model specificity of targeting transformer-based models Uses Method Specifies an evaluation method used by the faithfulness measure Explanation Modality a particular form/media in which an explanation exists is expressed Shruthi Chari has presentation a property that captures what form of output an entity a is presented in If a faithfulness method requires access beyond inference, then it's model specific true If a faithfulness measure uses a find alignment method, then it makes the Logic Trap 1 assumption Alangari, N., El Bachir Menai, M., Mathkour, H., & Almosallam, I. (2023). Exploring evaluation methods for interpretable machine learning: A survey. Information, 14(8), 469. true If a faithfulness measure uses a surrogate model method, then it makes the Model Assumption Arya, V., Bellamy, R. K., Chen, P. Y., Dhurandhar, A., Hind, M., Hoffman, S. C., ... & Zhang, Y. (2019). One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012. true If a faithfulness measures uses a surrogate model method, it requires inference-level access and assumes Model Corollary 1 Arya, V., Bellamy, R. K., Chen, P. Y., Dhurandhar, A., Hind, M., Hoffman, S. C., ... & Zhang, Y. (2019). One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012. true If a faithfulness measure uses the Sensitivity to Iterative Masking method, it must evaluate feature-based explanations, requires inference-level access, and makes the Linearity assumption Crothers, E., Viktor, H., & Japkowicz, N. (2024, September). Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading. In International Conference on Machine Learning, Optimization, and Data Science (pp. 133-147). Cham: Springer Nature Switzerland. true If a faithfulness method uses the Completeness to Model Evaluation proxy, then it has a global scope Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018, October). Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) (pp. 80-89). IEEE. true If a faithfulness measure uses the Erasure method, it must evaluates feature-based explanations and makes the Linearity assumption Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. true If a faithfulness measure uses the Robustness proxy, it requires inference-level access and makes the Prediction Assumption Jacovi, A., & Goldberg, Y. (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. arXiv preprint arXiv:2004.03685. true If a faithfulness measure uses the Human-Annotation method, it must evaluate feature-based explanations and makes the Logic Trap 1 assumption Ju, Y., Zhang, Y., Yang, Z., Jiang, Z., Liu, K., & Zhao, J. (2022, May). Logic traps in evaluating attribution scores. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5911-5922). true If a faithfulness measure uses a perturbation-based evaluation method, then it makes the Linearity Assumption and requires inference-level access Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. true If a faithfulness measure uses a robustnes evaluation method, then it makes the Prediction Assumption and requires inference-level access Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. true If a faithfulness measure uses the Feature Importance Agreement proxy, it makes the Linearity assumption and must evaluate a feature-based explanation modality. Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. true If a faithfulness measure uses the Input Sensitivity proxy, then it require inference-level access Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. true If a faithfulness measure uses the Polarity Consistency proxy, it makes the Linearity Corollary and must evaluate feature-based explanations Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. true If a faithfulness method uses the Model Sensitivity proxy, then it has a global scope and makes the Model Assumption Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723. true If a faithfulness measure uses the Continuity proxy, it requires inference level acess Nejadgholi, I., Omidyeganeh, M., Drouin, M. A., & Boisvert, J. (2025). A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations. arXiv preprint arXiv:2507.10585. true If a faithfulness measure uses the Completeness proxy, then it has a global scope Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593. true If a faithfulness measure uses the Soundness proxy with a feature-based explanation modality, it must use a computational method Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593. true