{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Entity Recognition\n", "\n", "In this notebook, we'll deploy and use an entity recognition model\n", "from the [spaCy](spacy.io) library.\n", "\n", "**Note**: When running this notebook on SageMaker Studio, you should make\n", "sure the 'SageMaker JumpStart PyTorch 1.0' image/kernel is used. When\n", "running this notebook on SageMaker Notebook Instance, you should make\n", "sure the 'sagemaker-soln' kernel is used."]}, {"cell_type": "markdown", "metadata": {}, "source": ["This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import boto3\n", "import os\n", "import json\n", "\n", "client = boto3.client('servicecatalog')\n", "cwd = os.getcwd().split('/')\n", "i= cwd.index('S3Downloads')\n", "pp_name = cwd[i + 1]\n", "pp = client.describe_provisioned_product(Name=pp_name)\n", "record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']\n", "record = client.describe_record(Id=record_id)\n", "\n", "keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]\n", "values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]\n", "stack_output = dict(zip(keys, values))\n", "\n", "with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:\n", "    json.dump(stack_output, f)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We start by importing a variety of packages that will be used throughout\n", "the notebook. One of the most important packages is the Amazon SageMaker\n", "Python SDK (i.e. `import sagemaker`). We also import modules from our own\n", "custom (and editable) package that can be found at `../package`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import boto3\n", "import sagemaker\n", "from sagemaker.pytorch import PyTorchModel\n", "import sys\n", "\n", "sys.path.insert(0, '../package')\n", "from package import config, utils"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Up next, we define the current folder and create a SageMaker client (from\n", "`boto3`). We can use the SageMaker client to call SageMaker APIs\n", "directly, as an alternative to using the Amazon SageMaker SDK. We'll use\n", "it at the end of the notebook to delete certain resources that are\n", "created in this notebook."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["current_folder = utils.get_current_folder(globals())\n", "sagemaker_client = boto3.client('sagemaker')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We'll use the unique solution prefix to name the model and endpoint."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["model_name = \"{}-entity-recognition\".format(config.SOLUTION_PREFIX)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Up next, we need to define the Amazon SageMaker Model which references\n", "the source code and the specifies which container to use. Our pre-trained\n", "model is from the spaCy library which doesn't rely on a specific deep\n", "learning framework. Just for consistency with the other notebooks we'll\n", "continue to use the PyTorchModel from the Amazon SageMaker Python SDK.\n", "Using PyTorchModel and setting the framework_version argument, means that\n", "our deployed model will run inside a container that has PyTorch\n", "pre-installed. Other requirements can be installed by defining a\n", "requirements.txt file at the specified source_dir location. We use the\n", "entry_point argument to reference the code (within source_dir) that\n", "should be run for model inference: functions called model_fn, input_fn,\n", "predict_fn and output_fn are expected to be defined. And lastly, you can\n", "pass `model_data` from a training job, but we are going to load the\n", "pre-trained model in the source code running on the endpoint. We still\n", "need to provide `model_data`, so we pass an empty archive."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["model = PyTorchModel(\n", "    name=model_name,\n", "    model_data=f'{config.SOURCE_S3_PATH}/models/empty.tar.gz',\n", "    entry_point='entry_point.py',\n", "    source_dir='../containers/entity_recognition',\n", "    role=config.IAM_ROLE,\n", "    framework_version='1.5.0',\n", "    py_version='py3',\n", "    code_location='s3://' + config.S3_BUCKET + '/code'\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a\n", "dedicated instance. We choose to deploy the endpoint on a single\n", "ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this\n", "region). You can expect this deployment step to take\n", "around 5 minutes. After approximately 15 dashes, you can expect to see an\n", "exclamation mark which indicates a successful deployment."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from sagemaker.serializers import JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "\n", "predictor = model.deploy(\n", "    endpoint_name=model_name,\n", "    instance_type=config.HOSTING_INSTANCE_TYPE,\n", "    initial_instance_count=1,\n", "    serializer=JSONSerializer(),\n", "    deserializer=JSONDeserializer()\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["When you're trying to update the model for development purposes, but\n", "experiencing issues because the model/endpoint-config/endpoint already\n", "exists, you can delete the existing model/endpoint-config/endpoint by\n", "uncommenting and running the following commands:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# sagemaker_client.delete_endpoint(EndpointName=model_name)\n", "# sagemaker_client.delete_endpoint_config(EndpointConfigName=model_name)\n", "# sagemaker_client.delete_model(ModelName=model_name)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["When calling our new endpoint from the notebook, we use a Amazon\n", "SageMaker SDK\n", "[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).\n", "A `Predictor` is used to send data to an endpoint (as part of a request),\n", "and interpret the response. Our `model.deploy` command returned a\n", "`Predictor` but, by default, it will send and receive numpy arrays. Our\n", "endpoint expects to receive (and also sends) JSON formatted objects, so\n", "we modify the `Predictor` to use JSON instead of the PyTorch endpoint\n", "default of numpy arrays. JSON is used here because it is a standard\n", "endpoint format and the endpoint response can contain nested data\n", "structures."]}, {"cell_type": "markdown", "metadata": {}, "source": ["With our model successfully deployed and our predictor configured, we can\n", "try out the entity recognizer out on example inputs. All we need to do is\n", "construct a dictionary object with a single key called `text` and provide\n", "the the input string. We call `predict` on our predictor and we should\n", "get a response from the endpoint that contains our entities."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["data = {'text': 'Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.'}\n", "response = predictor.predict(data=data)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We have the responce and we can print out the named entities and noun\n", "chunks that have been extracted from the text above. You will see the\n", "verbatim text of each alongside its location in the original text (given\n", "by start and end character indexes). Usually a document will contain many\n", "more noun chunks than named entities, but named entities have an\n", "additional field called `label` that indicates the class of the named\n", "entity. Since the spaCy model was trained on the OneNotes 5 corpus, it\n", "uses the following classes:\n", "\n", "| TYPE | DESCRIPTION |\n", "|---|---|\n", "| PERSON | People, including fictional. |\n", "| NORP | Nationalities or religious or political groups. |\n", "| FAC | Buildings, airports, highways, bridges, etc. |\n", "| ORG | Companies, agencies, institutions, etc. |\n", "| GPE | Countries, cities, states. |\n", "| LOC | Non-GPE locations, mountain ranges, bodies of water. |\n", "| PRODUCT | Objects, vehicles, foods, etc. (Not services.) |\n", "| EVENT | Named hurricanes, battles, wars, sports events, etc. |\n", "| WORK_OF_ART | Titles of books, songs, etc. |\n", "| LAW | Named documents made into laws. |\n", "| LANGUAGE | Any named language. |\n", "| DATE | Absolute or relative dates or periods. |\n", "| TIME | Times smaller than a day. |\n", "| PERCENT | Percentage, including \u201d%\u201c. |\n", "| MONEY | Monetary values, including unit. |\n", "| QUANTITY | Measurements, as of weight or distance. |\n", "| ORDINAL | \u201cfirst\u201d, \u201csecond\u201d, etc. |\n", "| CARDINAL | Numerals that do not fall under another type. |"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["print(response['entities'])\n", "print(response['noun_chunks'])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["You can try more examples above, but note that this model has been\n", "pretrained on the OneNotes 5 dataset. You may need to fine-tune this\n", "model with your own question answering data to obtain better results."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Clean Up\n", "\n", "When you've finished with the summarization endpoint (and associated\n", "endpoint-config), make sure that you delete it to avoid accidental\n", "charges."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["sagemaker_client.delete_endpoint(EndpointName=model_name)\n", "sagemaker_client.delete_endpoint_config(EndpointConfigName=model_name)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Next Stage\n", "\n", "We've just looked at how you can extract named entities and noun chunks\n", "from a document. Up next we'll look at a technique that can be used to\n", "classify relationships between entities.\n", "\n", "[Click here to continue.](./4_relationship_extraction.ipynb)"]}], "metadata": {"jupytext": {"cell_metadata_filter": "-all", "main_language": "python", "notebook_metadata_filter": "-all"}, "kernelspec": {"display_name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/1.8.1-cpu-py36", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/1.8.1-cpu-py36"}}, "nbformat": 4, "nbformat_minor": 4}