#Tags [[Research/Research Papers/2404.01318v4.pdf]] #AMLT0015/EvadeMLModel #AMLT0043/CraftAdversarialData #AMLT0051/LLMPromptInjection #AMLT0054/LLMJailbreak #AMLT0057/LLMDataLeakage **Title:** JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models **Authors:** Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong **Publication Date:** March 28, 2024 (last updated July 16, 2024) Summary: JailbreakBench is an open-sourced benchmark designed to address challenges in evaluating jailbreak attacks on large language models (LLMs). It provides standardized tools and datasets for assessing the robustness of LLMs against adversarial prompts that aim to generate harmful or unethical content. Key Contributions: 1. Repository of jailbreak artifacts (adversarial prompts) 2. JBB-Behaviors dataset with 100 harmful and benign behaviors 3. Standardized evaluation framework for jailbreaking attacks and defenses 4. Leaderboard tracking performance of attacks and defenses on various LLMs 5. Rigorous selection process for jailbreak classifiers Problem Statement: The paper addresses the lack of standardization in evaluating jailbreak attacks on LLMs, including inconsistent success rate calculations, lack of reproducibility, and the absence of a unified benchmark for comparing different attacks and defenses. Methodology: 1. Created JBB-Behaviors dataset with 100 behaviors across 10 categories 2. Developed a standardized pipeline for red-teaming LLMs and evaluating defenses 3. Conducted a human evaluation to select an effective jailbreak classifier 4. Implemented baseline attacks (PAIR, GCG, JB-Chat, Prompt with RS) and defenses (SmoothLLM, Perplexity Filter, Erase-and-Check) 5. Evaluated attacks and defenses on multiple LLMs (Vicuna, Llama-2, GPT-3.5, GPT-4) Main Results: 1. Llama-3-70B selected as the most effective jailbreak classifier (90.7% agreement with human annotators) 2. Prompt with RS attack achieved the highest success rates across models (78-93%) 3. Erase-and-Check defense showed the most consistent performance in reducing attack success rates Qualitative Analysis: - The benchmark provides a much-needed standardization for evaluating LLM robustness against jailbreak attacks - Open-sourcing jailbreak artifacts may accelerate research on defenses but also raises ethical concerns - The selection of a consistent jailbreak classifier (Llama-3-70B) improves comparability across studies Limitations: - Current focus on text-based inputs only, not considering other modalities - Potential for overfitting to the specific jailbreak classifier used in the benchmark - Ethical considerations of releasing jailbreak artifacts publicly Conclusion and Future Work: JailbreakBench provides a comprehensive, open-source platform for evaluating LLM robustness against jailbreak attacks. Future work may include expanding to other modalities, refining the jailbreak classifier, and continuously updating the benchmark to reflect evolving attack and defense techniques. Tools Introduced: - JailbreakBench library: https://github.com/JailbreakBench/jailbreakbench - JailbreakBench leaderboard: https://jailbreakbench.github.io/ ## Repository Token Information Total tokens in repository: 40187 Tokens per file: - CONTRIBUTING.md: 391 tokens - README.md: 6041 tokens - .github/workflows/publish.yml: 466 tokens - .github/workflows/lint.yml: 386 tokens - .github/ISSUE_TEMPLATE/attack-submission.yml: 552 tokens - src/jailbreakbench/vllm_server.py: 23 tokens - src/jailbreakbench/submission.py: 3391 tokens - src/jailbreakbench/classifier.py: 2452 tokens - src/jailbreakbench/config.py: 1150 tokens - src/jailbreakbench/dataset.py: 398 tokens - src/jailbreakbench/__init__.py: 218 tokens - src/jailbreakbench/artifact.py: 1430 tokens - src/jailbreakbench/llm/llm_wrapper.py: 1403 tokens - src/jailbreakbench/llm/litellm.py: 739 tokens - src/jailbreakbench/llm/dummy_vllm.py: 139 tokens - src/jailbreakbench/llm/llm_output.py: 176 tokens - src/jailbreakbench/llm/__init__.py: 0 tokens - src/jailbreakbench/llm/vllm.py: 502 tokens - src/jailbreakbench/plotting/plot_source_breakdown.py: 683 tokens - src/jailbreakbench/defenses/perplexity_filter.py: 787 tokens - src/jailbreakbench/defenses/defenses_registry.py: 112 tokens - src/jailbreakbench/defenses/base_defense.py: 443 tokens - src/jailbreakbench/defenses/smooth_llm.py: 562 tokens - src/jailbreakbench/defenses/remove_non_dictionary.py: 357 tokens - src/jailbreakbench/defenses/__init__.py: 122 tokens - src/jailbreakbench/defenses/synonym_substitution.py: 284 tokens - src/jailbreakbench/defenses/erase_and_check.py: 644 tokens - src/jailbreakbench/defenses/defenselib/perturbations.py: 448 tokens - src/jailbreakbench/defenses/defenselib/__init__.py: 0 tokens - src/jailbreakbench/defenses/defenselib/defense_hparams.py: 420 tokens - examples/submit.py: 699 tokens - examples/prompts/vicuna.json: 3918 tokens - examples/prompts/llama2.json: 4111 tokens - tests/test_artifact.py: 1540 tokens - tests/test_config.py: 712 tokens - tests/__init__.py: 0 tokens - tests/test_classifier.py: 1829 tokens - tests/test_llm/test_dummy.py: 401 tokens - tests/test_llm/__init__.py: 0 tokens - tests/test_llm/test_litellm.py: 1166 tokens - tests/test_llm/test_vllm.py: 1092 tokens ## Tutorial and Enhancement Suggestions # JailbreakBench Tutorial ## 1. Project Overview JailbreakBench is an open-source benchmark for evaluating the robustness of large language models (LLMs) against jailbreaking attacks. The project provides tools for: - Accessing a dataset of harmful and benign behaviors (JBB-Behaviors) - Evaluating jailbreak attacks against different LLMs - Implementing and testing defense mechanisms - Submitting new attacks to a leaderboard The codebase is primarily written in Python and uses libraries like PyTorch, Transformers, and vLLM. ## 2. Project Structure The repository is organized as follows: ``` jailbreakbench/ ├── src/ │ └── jailbreakbench/ │ ├── llm/ │ ├── defenses/ │ ├── plotting/ │ └── ... ├── tests/ ├── examples/ ├── .github/ └── ... ``` Key directories: - `src/jailbreakbench/`: Core functionality - `tests/`: Unit tests - `examples/`: Example scripts for using the library - `.github/`: GitHub Actions workflows and issue templates ## 3. Key Components ### 3.1 Dataset (dataset.py) The `JailbreakbenchDataset` class provides access to the JBB-Behaviors dataset, containing harmful and benign behaviors for testing LLMs. ```python from jailbreakbench import read_dataset dataset = read_dataset() goals = dataset.goals behaviors = dataset.behaviors ``` ### 3.2 LLM Wrappers (llm/) The `LLM` base class (in `llm_wrapper.py`) defines the interface for interacting with language models. Implementations include: - `LLMLiteLLM`: Uses the LiteLLM library for API calls - `LLMvLLM`: Uses vLLM for local model inference Example usage: ```python from jailbreakbench import LLMLiteLLM llm = LLMLiteLLM(model_name="vicuna-13b-v1.5", api_key="YOUR_API_KEY") responses = llm.query(prompts=["Hello, how are you?"], behavior="greeting") ``` ### 3.3 Classifiers (classifier.py) Classifiers are used to determine if a model's response is jailbroken. The main implementations are: - `LlamaGuard1JailbreakJudge` - `Llama3JailbreakJudge` - `Llama3RefusalJudge` Example: ```python from jailbreakbench import Llama3JailbreakJudge judge = Llama3JailbreakJudge(api_key="YOUR_API_KEY") is_jailbroken = judge(prompts, responses) ``` ### 3.4 Defenses (defenses/) The `Defense` base class defines the interface for implementing defense mechanisms. Some implemented defenses include: - `SmoothLLM` - `PerplexityFilter` - `EraseAndCheck` Defenses can be applied when querying an LLM: ```python responses = llm.query(prompts, behavior="test", defense="SmoothLLM") ``` ### 3.5 Evaluation and Submission (submission.py) The `evaluate_prompts` and `create_submission` functions are used to evaluate jailbreak attacks and prepare submissions for the leaderboard. ```python import jailbreakbench as jbb evaluation = jbb.evaluate_prompts(all_prompts, llm_provider="litellm") jbb.create_submission(evaluation, method_name="MyAttack", attack_type="black_box", method_params={}) ``` ## 4. Key Algorithms and Techniques ### 4.1 SmoothLLM Defense The SmoothLLM defense (in `defenses/smooth_llm.py`) implements the technique described in the paper. It works by: 1. Creating perturbed copies of the input prompt 2. Querying the LLM with each perturbed prompt 3. Classifying each response as jailbroken or not 4. Taking a majority vote to determine the final output ### 4.2 Perplexity Filter The Perplexity Filter defense (in `defenses/perplexity_filter.py`) uses a language model to compute the perplexity of input prompts. It filters out prompts with high perplexity, which may indicate adversarial content. ### 4.3 Jailbreak Classification The `Llama3JailbreakJudge` (in `classifier.py`) uses a fine-tuned Llama 3 model to classify whether a response is jailbroken. This classifier was selected based on its high agreement with human annotators, as described in the paper. ## 5. Relation to Research Paper The codebase directly implements the concepts and methodologies described in the JailbreakBench paper: - The JBB-Behaviors dataset is accessible through the `dataset.py` module - The standardized evaluation pipeline is implemented in `submission.py` - Baseline attacks and defenses mentioned in the paper are included in the `defenses/` directory - The jailbreak classifier selection process is reflected in the `classifier.py` implementation # Potential Enhancements 1. **Multi-modal Jailbreak Detection** - Extend the benchmark to include image and audio inputs - Implement classifiers capable of detecting jailbreaks across multiple modalities - Modify the dataset and evaluation pipeline to support multi-modal inputs and outputs 2. **Adaptive Defense Mechanisms** - Develop defenses that can learn and adapt to new jailbreak attempts over time - Implement a feedback loop where successful jailbreaks are used to improve defenses - Explore reinforcement learning techniques for dynamic defense optimization 3. **Prompt Optimization for Robustness** - Implement techniques to automatically generate more robust system prompts - Develop a method to fine-tune LLMs for improved jailbreak resistance - Create a benchmark for comparing the effectiveness of different prompt engineering strategies 4. **Efficiency Improvements for Large-Scale Evaluation** - Optimize the evaluation pipeline for distributed computing environments - Implement caching and batching strategies to reduce API calls and computation time - Explore quantization and pruning techniques to enable faster local model inference 5. **Integration with Continuous Monitoring Systems** - Develop tools for real-time jailbreak detection in production LLM applications - Create a system for automatically updating defenses based on newly discovered vulnerabilities - Implement a framework for continuously evaluating deployed models against the latest jailbreak techniques