{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "02e90d5c",
   "metadata": {},
   "source": [
    "# Accelerate BERT Inference with Hugging Face Transformers and AWS inferentia"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a1644f1",
   "metadata": {},
   "source": [
    "In this end-to-end tutorial, you will learn how to speed up BERT inference for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia. \n",
    "\n",
    "You will learn how to: \n",
    "\n",
    "1. Convert your Hugging Face Transformer to AWS Neuron (Inferentia)\n",
    "2. Create a custom `inference.py` script for `text-classification`\n",
    "3. Create and upload the neuron model and inference script to Amazon S3\n",
    "4. Deploy a Real-time Inference Endpoint on Amazon SageMaker\n",
    "5. Run and evaluate Inference performance of BERT on Inferentia\n",
    "\n",
    "Let's get started! 🚀\n",
    "\n",
    "---\n",
    "\n",
    "*If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3db68e5",
   "metadata": {},
   "source": [
    "# 1. Convert your Hugging Face Transformer to AWS Neuron\n",
    "\n",
    "We are going to use the [AWS Neuron SDK for AWS Inferentia](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html). The Neuron SDK includes a deep learning compiler, runtime, and tools for converting and compiling PyTorch and TensorFlow models to neuron compatible models, which can be run on [EC2 Inf1 instances](https://aws.amazon.com/ec2/instance-types/inf1/). \n",
    "\n",
    "As a first step, we need to install the [Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/neuron-install-guide.html) and the required packages.\n",
    "\n",
    "*Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the `conda_python3` conda kernel.*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69c59d90",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set Pip repository  to point to the Neuron repository\n",
    "!pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com\n",
    "\n",
    "# Install Neuron PyTorch\n",
    "!pip install torch-neuron==1.9.1.* neuron-cc[tensorflow] sagemaker>=2.79.0 transformers==4.12.3 --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce0ef431",
   "metadata": {},
   "source": [
    "After we have installed the Neuron SDK we can convert load and convert our model. Neuron models are converted using `torch_neuron` with its `trace` method similar to `torchscript`. You can find more information in our [documentation](https://huggingface.co/docs/transformers/serialization#torchscript).\n",
    "\n",
    "To be able to convert our model we first need to select the model we want to use for our text classification pipeline from [hf.co/models](http://hf.co/models). For this example lets go with [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) but this can be easily adjusted with other BERT-like models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "96d8dfea",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_id = \"distilbert-base-uncased-finetuned-sst-2-english\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e4386d9",
   "metadata": {},
   "source": [
    "At the time of writing, the [AWS Neuron SDK does not support dynamic shapes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#dynamic-shapes), which means that the input size needs to be static for compiling and inference. \n",
    "\n",
    "In simpler terms, this means when the model is compiled with an input of batch size 1 and sequence length of 16. The model can only run inference on inputs with the same shape.\n",
    "\n",
    "_When using a `t2.medium` instance the compiling takes around 2-3 minutes_ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1c22e8d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role\n",
      "sagemaker bucket: sagemaker-us-east-1-558105141721\n",
      "sagemaker session region: us-east-1\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import tensorflow  # to workaround a protobuf version conflict issue\n",
    "import torch\n",
    "import torch.neuron\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
    "\n",
    "\n",
    "# load tokenizer and model\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "model = AutoModelForSequenceClassification.from_pretrained(model_id, torchscript=True)\n",
    "\n",
    "# create dummy input for max length 128\n",
    "dummy_input = \"dummy input which will be padded later\"\n",
    "max_length = 128\n",
    "embeddings = tokenizer(dummy_input, max_length=max_length, padding=\"max_length\",return_tensors=\"pt\")\n",
    "neuron_inputs = tuple(embeddings.values())\n",
    "\n",
    "# compile model with torch.neuron.trace and update config\n",
    "model_neuron = torch.neuron.trace(model, neuron_inputs)\n",
    "model.config.update({\"traced_sequence_length\": max_length})\n",
    "\n",
    "# save tokenizer, neuron model and config for later use\n",
    "save_dir=\"tmp\"\n",
    "os.makedirs(\"tmp\",exist_ok=True)\n",
    "model_neuron.save(os.path.join(save_dir,\"neuron_model.pt\"))\n",
    "tokenizer.save_pretrained(save_dir)\n",
    "model.config.save_pretrained(save_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9997e9db",
   "metadata": {},
   "source": [
    "# 2. Create a custom inference.py script for text-classification\n",
    "\n",
    "The [Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) supports zero-code deployments on top of the [pipeline feature](https://huggingface.co/transformers/main_classes/pipelines.html) from 🤗 Transformers. This allows users to deploy Hugging Face transformers without an inference script [[Example](https://github.com/huggingface/notebooks/blob/master/sagemaker/11_deploy_model_from_hf_hub/deploy_transformer_model_from_hf_hub.ipynb)]. \n",
    "\n",
    "Currently is this feature not supported with AWS Inferentia, which means we need to provide an `inference.py` for running inference. \n",
    "\n",
    "*If you would be interested in support for zero-code deployments for inferentia let us know on the [forum](https://discuss.huggingface.co/c/sagemaker/17).*\n",
    "\n",
    "---\n",
    "\n",
    "To use the inference script, we need to create an `inference.py` script. In our example, we are going to overwrite the `model_fn` to load our neuron model and the `predict_fn` to create a text-classification pipeline. \n",
    "\n",
    "If you want to know more about the `inference.py` script check out this [example](https://github.com/huggingface/notebooks/blob/master/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb). It explains amongst other things what the `model_fn` and `predict_fn` are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "b4246c06",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir code"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce675df9",
   "metadata": {},
   "source": [
    "We are using the `NEURON_RT_NUM_CORES=1` to make sure that each HTTP worker uses 1 Neuron core to maximize throughput."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "3ce41529",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting code/inference.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile code/inference.py\n",
    "\n",
    "import os\n",
    "from transformers import AutoConfig, AutoTokenizer\n",
    "import torch\n",
    "import torch.neuron\n",
    "\n",
    "# To use one neuron core per worker\n",
    "os.environ[\"NEURON_RT_NUM_CORES\"] = \"1\"\n",
    "\n",
    "# saved weights name\n",
    "AWS_NEURON_TRACED_WEIGHTS_NAME = \"neuron_model.pt\"\n",
    "\n",
    "\n",
    "def model_fn(model_dir):\n",
    "    # load tokenizer and neuron model from model_dir\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_dir)\n",
    "    model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))\n",
    "    model_config = AutoConfig.from_pretrained(model_dir)\n",
    "\n",
    "    return model, tokenizer, model_config\n",
    "\n",
    "\n",
    "def predict_fn(data, model_tokenizer_model_config):\n",
    "    # destruct model, tokenizer and model config\n",
    "    model, tokenizer, model_config = model_tokenizer_model_config\n",
    "\n",
    "    # create embeddings for inputs\n",
    "    inputs = data.pop(\"inputs\", data)\n",
    "    embeddings = tokenizer(\n",
    "        inputs,\n",
    "        return_tensors=\"pt\",\n",
    "        max_length=model_config.traced_sequence_length,\n",
    "        padding=\"max_length\",\n",
    "        truncation=True,\n",
    "    )\n",
    "    # convert to tuple for neuron model\n",
    "    neuron_inputs = tuple(embeddings.values())\n",
    "\n",
    "    # run prediciton\n",
    "    with torch.no_grad():\n",
    "        predictions = model(*neuron_inputs)[0]\n",
    "        scores = torch.nn.Softmax(dim=1)(predictions)\n",
    "\n",
    "    # return dictonary, which will be json serializable\n",
    "    return [{\"label\": model_config.id2label[item.argmax().item()], \"score\": item.max().item()} for item in scores]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "144d8ccb",
   "metadata": {},
   "source": [
    "# 3. Create and upload the neuron model and inference script to Amazon S3\n",
    "\n",
    "Before we can deploy our neuron model to Amazon SageMaker we need to create a `model.tar.gz` archive with all our model artifacts saved into `tmp/`, e.g. `neuron_model.pt` and upload this to Amazon S3.\n",
    "\n",
    "To do this we need to set up our permissions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "952983b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "sess = sagemaker.Session()\n",
    "# sagemaker session bucket -> used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "    iam = boto3.client('iam')\n",
    "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "374ff630",
   "metadata": {},
   "source": [
    "Next, we create our `model.tar.gz`.The `inference.py` script will be placed into a `code/` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3808b9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# copy inference.py into the code/ directory of the model directory.\n",
    "!cp -r code/ tmp/code/\n",
    "# create a model.tar.gz archive with all the model artifacts and the inference.py script.\n",
    "%cd tmp\n",
    "!tar zcvf model.tar.gz *\n",
    "%cd .."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09a6f330",
   "metadata": {},
   "source": [
    "Now we can upload our `model.tar.gz` to our session S3 bucket with `sagemaker`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "6146af09",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.s3 import S3Uploader\n",
    "\n",
    "# create s3 uri\n",
    "s3_model_path = f\"s3://{sess.default_bucket()}/{model_id}\"\n",
    "\n",
    "# upload model.tar.gz\n",
    "s3_model_uri = S3Uploader.upload(local_path=\"tmp/model.tar.gz\",desired_s3_uri=s3_model_path)\n",
    "print(f\"model artifcats uploaded to {s3_model_uri}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04e1395a",
   "metadata": {},
   "source": [
    "# 4. Deploy a Real-time Inference Endpoint on Amazon SageMaker\n",
    "\n",
    "After we have uploaded our `model.tar.gz` to Amazon S3 can we create a custom `HuggingfaceModel`. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41522ef6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.huggingface.model import HuggingFaceModel\n",
    "\n",
    "\n",
    "# create Hugging Face Model Class\n",
    "huggingface_model = HuggingFaceModel(\n",
    "   model_data=s3_model_uri,       # path to your model and script\n",
    "   role=role,                    # iam role with permissions to create an Endpoint\n",
    "   transformers_version=\"4.12\",  # transformers version used\n",
    "   pytorch_version=\"1.9\",        # pytorch version used\n",
    "   py_version='py37',            # python version used\n",
    ")\n",
    "\n",
    "# Let SageMaker know that we've already compiled the model via neuron-cc\n",
    "huggingface_model._is_compiled_model = True\n",
    "\n",
    "# deploy the endpoint endpoint\n",
    "predictor = huggingface_model.deploy(\n",
    "    initial_instance_count=1,      # number of instances\n",
    "    instance_type=\"ml.inf1.xlarge\" # AWS Inferentia Instance\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c858560",
   "metadata": {},
   "source": [
    "# 5. Run and evaluate Inference performance of BERT on Inferentia\n",
    "\n",
    "The `.deploy()` returns an `HuggingFacePredictor` object which can be used to request inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da2ff049",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {\n",
    "  \"inputs\": \"the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .\",\n",
    "}\n",
    "\n",
    "res = predictor.predict(data=data)\n",
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a146346",
   "metadata": {},
   "source": [
    "We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let's test its performance of it. As a dummy load test will we loop and send 10000 synchronous requests to our endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1dcfd37",
   "metadata": {},
   "outputs": [],
   "source": [
    "# send 10000 requests\n",
    "for i in range(10000):\n",
    "    resp = predictor.predict(\n",
    "        data={\"inputs\": \"it 's a charming and often affecting journey .\"}\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6b3812f",
   "metadata": {},
   "source": [
    "Let's inspect the performance in cloudwatch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a4d916b",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af0f26d0",
   "metadata": {},
   "source": [
    "The average latency for our BERT model is `5-6ms` for a sequence length of 128.  \n",
    "\n",
    "![performance](./imgs/performance.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1030c87f",
   "metadata": {},
   "source": [
    "### **Delete model and endpoint**\n",
    "\n",
    "To clean up, we can delete the model and endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8917d5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_model()\n",
    "predictor.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}