{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "8659f92b-f435-4cd3-ad02-4a53641fb401", "showTitle": false, "title": "" } }, "source": [ "This notebook is available at https://github.com/databricks-industry-solutions/review-summarisation. For more information about this solution accelerator, check out our [website](https://www.databricks.com/solutions/accelerators/large-language-models-retail) and [blog post](https://www.databricks.com/blog/automated-analysis-product-reviews-using-large-language-models-llms)." ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "51c5a2cb-00bc-4ed8-9628-d654a0be69b2", "showTitle": false, "title": "" } }, "source": [ "# Prompt Engineering\n", "We have prepped our data, and it is ready for summarisation. In this notebook, we are going to cover how we can create prompts to use with an LLM to summarise the reviews we have batched.\n", "\n", "There many, many models which we can pick and choose from in the open source community. Huggingface hosts most of these on their hub, and the great thing is, they have more or less standardised the way to interact with these models, so we can use most of them with slight changes to our code.\n", "\n", "So, what should we pay attention to while choosing our model ?\n", "\n", "First, lets talk about model size. When you hear about a LLM, you would usually see that it comes with a parameter value. This can be something like 7 Billion parameters, 13, 30, 70, etc.. What does this mean ? This value tells us about how many configurable/trainable parameters a model has, which can tell us about its capacity to understand things. The bigger the model, the more complex tasks it can complete. However, as their size increase, so does their computation requirements: they tend to require more resources to operate. Therefore, picking the smallest size that can do the work is the best way to start. In our case, summarisation can be done by 7B models, so we are going to option for those.\n", "\n", "Then, we need to pick a model which can follow instructions. What does that mean ? Lets take a look at two examples:\n", "\n", "* [MPT-7B-Base Model](https://huggingface.co/mosaicml/mpt-7b)\n", "* [MPT-7B-Instruct Model](https://huggingface.co/mosaicml/mpt-7b-instruct)\n", "\n", "Mosaic's 7B Base model is a pre-trained model, however it has not been fine-tuned for a specific task. It would be a great candidate if we wanted to fine-tune it for a specific task which we have training data for. Where as the Instruct model has been trained on a Instructions dataset, and is more ready to follow instructions. \n", "\n", "Given this, selecting the instruct model makes more sense, which we will cover in this notebook.\n", "\n", "Lets begin!" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "55668abd-139d-40c3-8682-0f4b7477569d", "showTitle": false, "title": "" } }, "source": [ "### Set Up\n", "Our model requires some specific libraries to be present in the runtime. We can install them using the commands below. This is definitely a good way to start, however, if you are thinking about setting up a cluster which you will continuously use with a model as such, another good way to approach library installation can be to specify the libraries you want within your cluster's configuration page.\n", "\n", "For this part of project, we are going to need a GPU enabled instance.\n", "\n", "Our set up for this notebook is:\n", "* Runtime: DBR 13.2 ML + GPU\n", "* Compute (Single Node): \n", " * Required: GPU Compute with minimum **25 GB GPU RAM**\n", " * Suggested: Nvidia A10 or A100 GPU \n", " * Azure Example: `NC25ads_A100_v4`" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "6dcc25f8-0e9b-45b8-be3a-f4f8335a6f68", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "%sh\n", "# Check out our driver's GPU\n", "nvidia-smi" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "cec01744-4e7b-43b0-a7ec-a9ff7c161793", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Install libraries\n", "%pip install -q flash-attn xformers torch==2.0.1 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir_sm90#subdirectory=python\n", "\n", "# Restart Python Kernel\n", "dbutils.library.restartPython()" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "fad3646b-c690-4e81-a996-32827fcca264", "showTitle": false, "title": "" } }, "source": [ "### Data Setup\n", "\n", "Lets set the global variables and the default catalogs/schemas we are going to use in this notebook." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "implicitDf": true, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "6cae6d44-b69e-4dea-b707-4b8255fffd42", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Imports\n", "from config import CATALOG_NAME, SCHEMA_NAME, USE_UC, USE_VOLUMES\n", "\n", "# If UC is enabled\n", "if USE_UC:\n", " _ = spark.sql(f\"USE CATALOG {CATALOG_NAME};\")\n", "\n", "# Sets the standard database to be used in this notebook\n", "_ = spark.sql(f\"USE SCHEMA {SCHEMA_NAME};\")\n", "\n", "\n", "# Create a Volume (Optional, skip if no-UC)\n", "if USE_VOLUMES:\n", " _ = spark.sql(\"CREATE VOLUME IF NOT EXISTS model_store;\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "55102a35-1086-45e7-bea8-2c97a9fbe20f", "showTitle": false, "title": "" } }, "source": [ "### Paths Setup\n", "\n", "We are going to create or specify another location which we are going to use for saving our models to." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "0b8b004d-22d5-4b2c-a9d1-f2c6ac3184fa", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Import the OS system to declare a ENV variable\n", "from config import MAIN_STORAGE_PATH\n", "import os\n", "\n", "# Setting up the storage path (please edit this if you would like to store the data somewhere else)\n", "main_storage_path = f\"{MAIN_STORAGE_PATH}/model_store\"\n", "\n", "# Declaring as an Environment Variable \n", "os.environ[\"MAIN_STORAGE_PATH\"] = main_storage_path" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "1f55a5f0-9c28-40b1-89bd-26465cfa9ef2", "showTitle": false, "title": "" } }, "source": [ "### Read Data\n", "Our primary dataset is going to be the batched book reviews which we created in the last notebook" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "16064b58-8ae7-42a3-9801-33a5dd6a720a", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Read our main dataframe\n", "batched_reviews_df = spark.read.table(\"batched_book_reviews\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "0b8e5787-1a6b-4297-afa8-c72bfa60f0b3", "showTitle": false, "title": "" } }, "source": [ "### LLM Pipeline\n", "\n", "We have selected the [MPT-7B-Instruct Model](https://huggingface.co/mosaicml/mpt-7b-instruct) model for this specific task, which was built by our friends from [Mosaic ML](https://www.mosaicml.com/). The model it self is very robust, has good performance and can take on summarisation tasks easily. It features a 30B parameter version as well, but that would probably be an overkill for our use case.\n", "\n", "Some other 7B models you can also check out are:\n", "* [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)\n", "* [Llama-2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)\n", "* [Stable-Beluga-7B](https://huggingface.co/stabilityai/StableBeluga-7B)\n", "\n", "\n", "We are now going to use the the transformers library from hugging face to download & load the model." ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "6c7e66a7-c5c2-4d85-9d0e-5922757e202e", "showTitle": false, "title": "" } }, "source": [ "#### Download Model & Tokenizer\n", "\n", "In the first cell, we will download the tokenizer and the model to a local location. This step is not a deal breaker, but it speeds up the loading process of the model since it makes it so that we don't have to download it each time we need to use it." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "b48205a1-5420-40ef-a125-8de2731558e1", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# External Imports\n", "from huggingface_hub.utils import (\n", " logging as hf_logging,\n", " disable_progress_bars as hfhub_disable_progress_bar,\n", ")\n", "from huggingface_hub import snapshot_download\n", "import os\n", "\n", "# Turn Off Info Logging for Transfomers\n", "hf_logging.set_verbosity_error()\n", "hfhub_disable_progress_bar()\n", "\n", "# MPT-7B-Instruct revisions in https://huggingface.co/mosaicml/mpt-7b-instruct/commits/main\n", "model_name = \"mosaicml/mpt-7b-instruct\"\n", "model_revision_id = \"925e0d80e50e77aaddaf9c3ced41ca4ea23a1025\"\n", "\n", "# Tokenizer revisions in https://huggingface.co/EleutherAI/gpt-neox-20b/commits/main\n", "tokenizer_name = \"EleutherAI/gpt-neox-20b\"\n", "tokenizer_revision_id = \"9369f145ca7b66ef62760f9351af951b2d53b77f\"\n", "\n", "# Download the model\n", "local_model_path = f\"{main_storage_path}/mpt-7b-instruct/\"\n", "if os.path.isdir(local_model_path):\n", " print(\"Local model exists\")\n", "else:\n", " print(f\"Downloading model to {local_model_path}\")\n", " model_download = snapshot_download(\n", " repo_id=model_name,\n", " revision=model_revision_id,\n", " local_dir=local_model_path,\n", " local_dir_use_symlinks=False,\n", " )\n", "\n", "# Download the tokenizer\n", "local_tokenizer_path = f\"{main_storage_path}/mpt-7b-tokenizer/\"\n", "if os.path.isdir(local_tokenizer_path):\n", " print(\"Local tokenizer exists\")\n", "else:\n", " print(f\"Downloading tokenizer to {local_tokenizer_path}\")\n", " tokenizer_download = snapshot_download(\n", " repo_id=tokenizer_name,\n", " revision=tokenizer_revision_id,\n", " local_dir=local_tokenizer_path,\n", " local_dir_use_symlinks=False,\n", " )" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "f96cac0d-0efd-48ba-88f3-5a7fa3ccdf78", "showTitle": false, "title": "" } }, "source": [ "#### Load & Build Pipeline\n", "In the cell below, we will load the downloaded snapshots to the GPU and instantiate a pipeline which will encapsulate the tokenizer and the model to help with text generation." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "48e8657b-939e-4fb0-9d3a-d18b52e47a49", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# External Imports\n", "from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig\n", "from transformers.utils import logging as t_logging\n", "import transformers\n", "import torch\n", "\n", "# Turn Off Info Logging for Transformers\n", "t_logging.set_verbosity_error()\n", "t_logging.disable_progress_bar()\n", "\n", "# MPT-7b-instruct revisions in https://huggingface.co/mosaicml/mpt-7b-instruct/commits/main\n", "model_name = local_model_path\n", "tokenizer_name = local_tokenizer_path\n", "\n", "# Load model's config\n", "config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)\n", "config.attn_config[\"attn_impl\"] = \"triton\"\n", "config.init_device = \"cuda:0\"\n", "\n", "# Load model\n", "model = AutoModelForCausalLM.from_pretrained(\n", " model_name,\n", " config=config,\n", " torch_dtype=torch.bfloat16,\n", " trust_remote_code=True,\n", ")\n", "\n", "# Load the tokenizer\n", "tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, padding_side=\"left\")\n", "tokenizer.pad_token_id = tokenizer.eos_token_id\n", "\n", "# Build the pipeline\n", "mpt_pipeline = transformers.pipeline(\n", " \"text-generation\",\n", " model=model,\n", " config=config,\n", " tokenizer=tokenizer,\n", " torch_dtype=torch.bfloat16,\n", " device=0,\n", ")\n", "\n", "# # Required tokenizer setting for batch inference\n", "mpt_pipeline.tokenizer.pad_token_id = tokenizer.eos_token_id" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "95ff09ec-90e0-4acf-858d-a178605b3cc2", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "%sh\n", "# Check out the GPU again to see how the memory has changed (since we loaded the model)\n", "nvidia-smi" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "b1c9719b-05f1-4fbe-9539-efe2c6f41843", "showTitle": false, "title": "" } }, "source": [ "#### Suggested Prompt Template\n", "\n", "Now that we have loaded the model, we can start asking some questions..\n", "\n", "The MPT 7B Instruct model has been fine tuned with a [specific prompt template](https://huggingface.co/mosaicml/mpt-7b-instruct#formatting). You can think about the prompt template as the \"right way to ask a question to the model\". On the model's webpage, they specifically note that the model should be instructed in this way. Lets take a look at how we can achieve this prompt" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "a9db586b-5c65-4995-a3ac-e9569de814b9", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Suggested template prompt\n", "mpt_template_prompt = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "{question}\n", "### Response:\n", "\"\"\"\n", "\n", "# Example Question\n", "example_question = \"When does summer start ?\"\n", "\n", "print(mpt_template_prompt.format(question=example_question))" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "54419314-a456-493e-ac9e-9364369e8f71", "showTitle": false, "title": "" } }, "source": [ "#### Few Examples\n", "\n", "Lets run through a few examples to see how our model works.." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "6b25670b-75c2-40e1-87a7-b044ce294ecc", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Example requests\n", "requests = [\n", " \"How many days are there in a week ?\",\n", " \"When does summer start ?\",\n", " \"If you could learn a programming language, which one would you go for ?\",\n", " \"What's a Large Language Model ?\",\n", " \"What does summarisation mean ?\",\n", " \"How can you deal with an angry customer ?\"\n", "]\n", "\n", "# Format the requests with the prompt template\n", "formatted_requests = [\n", " mpt_template_prompt.format(question=single_request) \n", " for single_request in requests\n", "]\n", "\n", "# Generate response\n", "llm_responses = mpt_pipeline(\n", " formatted_requests, \n", " max_new_tokens=200, # How long can the answer be ?\n", " do_sample=True,\n", " temperature=0.4, # How creative the model can be ? (1 = max creativity)\n", " eos_token_id=tokenizer.eos_token_id,\n", " pad_token_id=tokenizer.eos_token_id,\n", ")\n", "\n", "# Print Responses\n", "for response, f_request, r_request in zip(llm_responses, formatted_requests, requests):\n", " print(\"-\"*10)\n", " print(\"User: \", r_request)\n", " print(\"MPT-7B: \", response[0][\"generated_text\"].split(f_request)[-1])" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "6a604eef-837e-45d0-afa1-c4c7bd237d35", "showTitle": false, "title": "" } }, "source": [ "Answers look pretty good overall!" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "08af3929-dd8e-4bbd-9473-c194a8a9b6c7", "showTitle": false, "title": "" } }, "source": [ "### Prompt Engineering\n", "\n", "Now, lets start focussing on our task by taking a look at prompt engineering and why its important..\n", "\n", "The question/instruction we write within the prompt is essentially the task description for model. You can think of this as writing code in English for our model. The more complex the task gets, the more computationally intensive it becomes for the model to answer. And, each model has a different way it likes to be called.\n", "\n", "For example, some models do better when the input text, in our case reviews, is put before the instruction. The best way to understand all of this is through experimentation. Lets take a look at how we set up one here to see which prompts work best for our case. But first, we will need some examples to test with. Lets get a random positive review and a negative review." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "7c609ad8-70e1-4248-83a3-47638d4238e5", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Imports\n", "from pyspark.sql import functions as SF\n", "\n", "# Retrieve positive examples\n", "positive_review_examples = (\n", " batched_reviews_df\n", " .filter(SF.col(\"star_rating_class\")==\"high\")\n", " .sample(False, fraction=0.01, seed=42)\n", " .select(\"concat_review_text\")\n", " .limit(10)\n", " .collect()\n", ")\n", "positive_review_examples = [x[0] for x in positive_review_examples]\n", "\n", "# Retrieve negative examples\n", "negative_review_examples = (\n", " batched_reviews_df\n", " .filter(SF.col(\"star_rating_class\")==\"low\")\n", " .sample(False, fraction=0.01, seed=42)\n", " .select(\"concat_review_text\")\n", " .limit(10)\n", " .collect()\n", ")\n", "negative_review_examples = [x[0] for x in negative_review_examples]" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "32252fa9-2a79-4366-873f-260f7c0d04bf", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "print(\"Positive Example\")\n", "print(positive_review_examples[0][:150] + \"...\")\n", "print(\"\\n\" + \"-\" * 15 + \"\\n\")\n", "print(\"Negative Example\")\n", "print(negative_review_examples[0][:150] + \"...\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "bee47a49-6d4d-438f-880b-de31071f3486", "showTitle": false, "title": "" } }, "source": [ "#### Positive Prompt Variations\n", "\n", "Starting with our positive examples, here are some prompts we can test:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "6d652a93-5d23-4bd3-9e79-5c22ae96d5a1", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Prompt Variations\n", "positive_prompt_1 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Provide three bullet-point summary capturing what customers liked about this book using the reviews below.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "positive_prompt_2 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Identify three aspects that readers liked about the book and provide a summary for each from the reviews below.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "positive_prompt_3 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Distill and provide three bullet points capturing what customers most appreciated about the book from the reviews below.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "positive_prompt_4 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Identify three distinct and specific aspects that readers enjoyed about the book from the reviews below, and provide a bullet point summary for each.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "positive_prompt_5 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Analyze the book reviews below and identify three distinct aspects that readers enjoyed about the book. Return the result as three succinct bullet points.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "positive_prompt_6 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Analyze the book reviews below and identify three distinct aspects that readers enjoyed about the book. Be sure to include any character dynamics, plot elements, or emotional responses mentioned by the reviewers. Return the result as three succinct bullet points.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "positive_prompt_7 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Analyze the provided book reviews and identify distinct aspects in three bullet points that readers enjoyed about the book. For each aspect, provide a brief explanation using the specific details mentioned in the reviews, focusing on character dynamics, plot elements, or emotional responses elicited.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "# Build a prompts list\n", "all_positive_prompts = [\n", " positive_prompt_1,\n", " positive_prompt_2,\n", " positive_prompt_3,\n", " positive_prompt_4,\n", " positive_prompt_5,\n", " positive_prompt_6,\n", " positive_prompt_7,\n", "]" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "2067414b-b86d-4395-843f-1cdad0a52a32", "showTitle": false, "title": "" } }, "source": [ "#### Build Test Flow\n", "All of them are slightly different than each other. We preserve the generic template that was suggested in the model's webpage, however we alter the instruction slightly with each to see how it differs.\n", "\n", "Now, lets write some code to see how each differs.\n", "\n", "One thing to note is that we are reducing the temperature of our model here.. Why ? Because we want to it to be less creative and we want it to follow the instructions better." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "8cea8cdf-dd32-454d-93b9-9ae41075ee65", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# External Imports\n", "import time\n", "\n", "# Create a function for assessment\n", "def timed_generation(prompt, review_example):\n", " # Feed our example to the prompt\n", " request = prompt.format(review=review_example)\n", "\n", " # Record the start time\n", " start_time = time.time()\n", "\n", " # Generate the response\n", " response = mpt_pipeline(\n", " request,\n", " max_new_tokens=150,\n", " temperature=0.1,\n", " do_sample=True,\n", " pad_token_id=tokenizer.eos_token_id,\n", " eos_token_id=tokenizer.eos_token_id,\n", " )\n", "\n", " # Record time elapsed\n", " finish_time = time.time()\n", " elapsed_time = round(finish_time - start_time, 2)\n", "\n", " # Parse the response\n", " response = response[0][\"generated_text\"].split(request)[-1]\n", "\n", " # Form output\n", " results = {\n", " \"prompt\": prompt,\n", " \"review\": review_example,\n", " \"request\": request,\n", " \"elapsed_time\": elapsed_time,\n", " \"response\": response\n", " }\n", " return results" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "1c5badaa-78f9-4d3c-ac8b-9d6be252f0c9", "showTitle": false, "title": "" } }, "source": [ "Generate summaries using the wide range of prompts" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "a4203254-424f-4279-8c10-11d7e22b2bb3", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "all_results = []\n", "\n", "# For each prompt, try out an example and return results\n", "for select_prompt in all_positive_prompts:\n", " single_result = timed_generation(\n", " prompt=select_prompt, \n", " review_example=positive_review_examples[0]\n", " )\n", " all_results.append(single_result)\n", "\n", "i = 1\n", "for single_result in all_results:\n", " print(\"-\" * 15)\n", " print(f\"MPT-7B-Instruct Test {i}\\n\")\n", " print(\"Prompt:\")\n", " print(single_result[\"prompt\"])\n", " print(single_result[\"response\"])\n", " print(\"\\nGeneration time:\", single_result[\"elapsed_time\"], \"seconds\")\n", " print()\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e57729d8-dd3c-47be-ba8d-001f553cced2", "showTitle": false, "title": "" } }, "source": [ "#### Negative Prompt Forming\n", "\n", "Following the same flow for the negative prompt, where we want to asses negative reviews and understand what customers disliked, and how we can improve the product.." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "491000e4-52af-402b-ae0f-65a49b300220", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "negative_prompt_1 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Provide a three bullet-point summary capturing what customers disliked about this book using the reviews below.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "negative_prompt_2 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Identify three aspects that readers disliked about the book and provide a summary for each from the reviews below.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "negative_prompt_3 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Distill and provide three bullet points capturing what customers most criticized about the book from the reviews below.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "negative_prompt_4 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Identify three distinct and specific aspects that readers did not enjoy about the book from the reviews below, and provide a bullet point summary for each.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "negative_prompt_5 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Analyze the book reviews below and identify three distinct aspects that readers disliked about the book. Return the result as three succinct bullet points.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "negative_prompt_6 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Analyze the provided book reviews and identify three distinct aspects that readers disliked about the book. Be sure to include any character dynamics, plot elements, or emotional responses mentioned by the reviewers that led to negative experiences. Return the answer as three succinct bullet points.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "negative_prompt_7 = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "### Instruction:\n", "Analyze the provided book reviews and identify distinct aspects in three bullet points that readers disliked about the book. For each aspect, provide a brief explanation using the specific details mentioned in the reviews, focusing on character dynamics, plot elements, or emotional responses elicited.\n", "\n", "Reviews: {review}\n", "\n", "### Response:\n", "\"\"\"\n", "\n", "# Build a full list of prompts\n", "all_negative_prompts = [\n", " negative_prompt_1,\n", " negative_prompt_2,\n", " negative_prompt_3,\n", " negative_prompt_4,\n", " negative_prompt_5,\n", " negative_prompt_6,\n", " negative_prompt_7,\n", "]" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "7cace0bf-4890-4a99-a585-63cd6f2083e1", "showTitle": false, "title": "" } }, "source": [ "Running the tests for the negative prompts" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "8b98dd1e-af6a-4266-8d01-32c3cdada7bb", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "all_negative_results = []\n", "\n", "# For each prompt, try out an example and return results\n", "for select_prompt in all_negative_prompts:\n", " single_result = timed_generation(\n", " prompt=select_prompt, \n", " review_example=negative_review_examples[0]\n", " )\n", " all_negative_results.append(single_result)\n", "\n", "i = 1\n", "for single_result in all_negative_results:\n", " print(\"-\" * 15)\n", " print(f\"MPT-7B-Instruct Test {i}\\n\")\n", " print(\"Prompt:\")\n", " print(single_result[\"prompt\"])\n", " print(single_result[\"response\"])\n", " print(\"\\nGeneration time:\", single_result[\"elapsed_time\"], \"seconds\")\n", " print()\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "0dc5db32-1fab-46d5-922a-a61c7af5d7f1", "showTitle": false, "title": "" } }, "source": [ "### Further Prompt Testing\n", "\n", "As we can see from above, each prompt pushes the model to respond in a different way. Some of them make the model respond with succinct answers, where as others make it explain with more details...\n", "\n", "The generation time recorded is also a good metric to keep an eye out for. This can tell us how computationally intensive our task is going to be, and its always good to consider that too. The longer it gets, the more resources we are going to have to use to summarise. So, if the short answers suffice, we could option for those and get to spend less resources. However, if we want more details, then we can choose the performance trade-off.\n", "\n", "\n", "Deducting from the examples above, it looks like `Prompt 5` can work nicely both for negative and positive examples .. It goes into details, but not as much as 6, and still can get things done quickly as well.\n", "\n", "Another test could be to see how it does with a variety of examples:" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "7d230d85-73fc-4d90-9b0e-ccfc8f41daea", "showTitle": false, "title": "" } }, "source": [ "#### Positive Prompt Testing with More Examples\n", "\n", "Lets use the prompt we have selected for the positive review summarisation, and see how it performs with other examples" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "0ab161e6-8124-4754-b4bf-d4875a6e6531", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Specify the selected prompt\n", "selected_positive_test_prompt = positive_prompt_5\n", "\n", "# Run summarisation for variety of examples\n", "variety_review_results = []\n", "for select_review in positive_review_examples:\n", " single_result = timed_generation(\n", " prompt=selected_positive_test_prompt, \n", " review_example=select_review\n", " )\n", " variety_review_results.append(single_result)\n", "\n", "# Print Examples\n", "i = 1\n", "for single_result in variety_review_results:\n", " print(\"-\" * 15)\n", " print(f\"MPT-7B-Instruct Test {i}\\n\")\n", " print(\"Review:\")\n", " print(single_result[\"review\"][:350] + \"...\")\n", " print(\"\\nResponse:\")\n", " print(single_result[\"response\"])\n", " print(\"\\nGeneration time:\", single_result[\"elapsed_time\"], \"seconds\")\n", " print()\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "94055b6d-2d74-4e72-a859-91a21119a8a0", "showTitle": false, "title": "" } }, "source": [ "#### Negative Prompt Testing with More Examples\n", "\n", "Following the same practice here, but with the prompt selected for the negative reviews." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "fcb15487-03c3-498b-96fb-33661d149597", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Specify the selected prompt\n", "selected_negative_test_prompt = negative_prompt_5\n", "\n", "# Generating summaries\n", "variety_negative_review_results = []\n", "for select_review in negative_review_examples:\n", " single_result = timed_generation(\n", " prompt=selected_negative_test_prompt, \n", " review_example=select_review\n", " )\n", " variety_negative_review_results.append(single_result)\n", "\n", "# Display\n", "i = 1\n", "for single_result in variety_negative_review_results:\n", " print(\"-\" * 15)\n", " print(f\"MPT-7B-Instruct Test {i}\\n\")\n", " print(\"Review:\")\n", " print(single_result[\"review\"][:350] + \"...\")\n", " print(\"\\nResponse:\")\n", " print(single_result[\"response\"])\n", " print(\"\\nGeneration time:\", single_result[\"elapsed_time\"], \"seconds\")\n", " print()\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "c07ccedb-0a90-4272-867c-23587a16d459", "showTitle": false, "title": "" } }, "source": [ "### Adding Prompts to Data\n", "\n", "Now that we have selected our prompts, we can go ahead and add these to our dataset, and save, so that we can use them with ease later on in the summarisation part.\n", "\n", "First, lets start by declaring our selections:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "bcbb9b49-6c71-429a-b91b-88e6ac0dace9", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Define the selected positive prompt\n", "selected_positive_prompt = positive_prompt_5\n", "\n", "# Define the selected negative prompt\n", "selected_negative_prompt = negative_prompt_5" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "255df347-0809-4e7d-b464-d323c0da9976", "showTitle": false, "title": "" } }, "source": [ "Now, we can go ahead and create a UDF which will take these into account and format our text as we like. This UDF is simply going to take a concatenated review from our dataset, and depending on whether its a positive or negative example, going to wrap the instruction around it for all rows." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "56179f25-3233-424b-ba6a-1bcd770db648", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# External Imports\n", "from pyspark.sql import functions as SF\n", "from pyspark.sql.types import StringType\n", "import pandas as pd\n", "\n", "# Build Instruction Builder UDF\n", "@SF.udf(\"string\")\n", "def build_instructions(review, rating_class):\n", " instructed_review = \"\"\n", " if rating_class == \"high\":\n", " instructed_review = selected_positive_prompt.format(review=review)\n", " elif rating_class == \"low\":\n", " instructed_review = selected_negative_prompt.format(review=review)\n", " return instructed_review\n", "\n", "# Apply\n", "batched_instructions_df = (\n", " batched_reviews_df\n", " .withColumn(\n", " \"model_instruction\",\n", " build_instructions(SF.col(\"concat_review_text\"), SF.col(\"star_rating_class\")),\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "f5597633-cda1-4467-8ff3-6f00a95eca86", "showTitle": false, "title": "" } }, "source": [ "### Save Table with Model Instructions\n", "\n", "For the final step of our notebook, we can go ahead and save the data. We are going to use this table in the next notebook where we are going to feed the generated instructions to our LLM." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "86a97083-7f99-4f19-bb43-819c0bd27792", "showTitle": false, "title": "" } }, "outputs": [], "source": [ "# Save Raw Reviews\n", "(\n", " batched_instructions_df\n", " .write\n", " .mode(\"overwrite\")\n", " .option(\"overwriteSchema\", \"true\")\n", " .saveAsTable(\"batched_instructions\")\n", ")" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "mostRecentlyExecutedCommandWithImplicitDF": { "commandId": 412298436463828, "dataframes": [ "_sqldf" ] }, "pythonIndentUnit": 4 }, "notebookName": "03-prompt-engineering", "widgets": {} }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }