{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Initialization boilerplate\n", "from typing import *\n", "\n", "import os\n", "import ibm_watson\n", "import ibm_watson.natural_language_understanding_v1 as nlu\n", "import ibm_cloud_sdk_core\n", "import pandas as pd\n", "import text_extensions_for_pandas as tp\n", "import urllib\n", "\n", "import ray\n", "import spacy\n", "import multiprocessing\n", "import time\n", "import threading\n", "import matplotlib.pyplot as plt\n", "\n", "# Remove silly SpaCy warnings about not having a GPU\n", "def fix_spacy_warnings():\n", " import warnings\n", " warnings.filterwarnings(action='ignore', \\\n", " category=UserWarning, message='.*User provided device_type.*')\n", "fix_spacy_warnings()\n", "\n", "# Remove silly Huggingface warnings about not using a useless parallel tokenizer.\n", "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n", "\n", "api_key = os.environ.get(\"IBM_API_KEY\")\n", "service_url = os.environ.get(\"IBM_SERVICE_URL\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 5: Scale up some more with Pandas and Ray\n", "\n", "Let's start by summarizing our progress so far.\n", "\n", "In [Part 1](./Market_Intelligence_Part1.ipynb), we defined an NLP pipeline that used Watson Natural Language Understanding and Text Extensions for Pandas to identify the names of executives in corporate press releases.\n", "\n", "In [Part 2](./Market_Intelligence_Part2.ipynb), we extended our NLP pipeline by using SpaCy and Text Extensions for Pandas to associate each executive name with a job title.\n", "\n", "If we separate the expensive NLP model evaluations from the rest of our code, we can think of this processing pipeline as having five distinct stages, as shown in the following diagram.\n", "\n", "![First version of our processing pipeline](images/pipeline_v1.png)\n", "\n", "In [Part 3](./Market_Intelligence_Part3.ipynb), we applied the semijoin trick with Text Extensions for Pandas to reduce the cost of running the SpaCy dependency parser. This change decreased the cost of parsing by a factor of 9 and improved end-to-end running time by a factor of 3.\n", "\n", "In [Part 4](./Market_Intelligence_Part4.ipynb), we showed how you can use [Ray](ray.io) to rate-limit requests to a remote web service such as Watson Natural Language Understanding.\n", "\n", "In this final part of the tutorial, we use Ray to take the performance improvements from Part 3 to the next level by applying parallel processing. By the time we're done, we'll have a version of our pipeline that runs **300 times faster** than the pipeline we started with at the beginning of Part 3.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlhtml
0https://newsroom.ibm.com/2020-02-04-The-Avril-...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
1https://newsroom.ibm.com/2020-02-11-IBM-X-Forc...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
2https://newsroom.ibm.com/2020-02-18-IBM-Study-...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
3https://newsroom.ibm.com/2020-02-19-IBM-Power-...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
4https://newsroom.ibm.com/2020-02-20-Centotrent...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
.........
186https://newsroom.ibm.com/2021-01-25-OVHcloud-t...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
187https://newsroom.ibm.com/2021-01-26-Luminor-Ba...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
188https://newsroom.ibm.com/2021-01-26-DIA-Levera...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
189https://newsroom.ibm.com/2021-01-26-IBM-Board-...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
190https://newsroom.ibm.com/2021-01-26-Latin-Amer...<!DOCTYPE html public \"-//W3C//DTD HTML 4.01 T...
\n", "

191 rows × 2 columns

\n", "
" ], "text/plain": [ " url \\\n", "0 https://newsroom.ibm.com/2020-02-04-The-Avril-... \n", "1 https://newsroom.ibm.com/2020-02-11-IBM-X-Forc... \n", "2 https://newsroom.ibm.com/2020-02-18-IBM-Study-... \n", "3 https://newsroom.ibm.com/2020-02-19-IBM-Power-... \n", "4 https://newsroom.ibm.com/2020-02-20-Centotrent... \n", ".. ... \n", "186 https://newsroom.ibm.com/2021-01-25-OVHcloud-t... \n", "187 https://newsroom.ibm.com/2021-01-26-Luminor-Ba... \n", "188 https://newsroom.ibm.com/2021-01-26-DIA-Levera... \n", "189 https://newsroom.ibm.com/2021-01-26-IBM-Board-... \n", "190 https://newsroom.ibm.com/2021-01-26-Latin-Amer... \n", "\n", " html \n", "0 \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
0[1281, 1292): 'Rob DiCicco'[1294, 1348): 'PharmD, Deputy Chief Health Off...
0[1213, 1229): 'Christoph Herman'[1231, 1281): 'SVP and Head of SAP HANA Enterp...
1[2227, 2242): 'Stephen Leonard'[2244, 2282): 'General Manager, IBM Cognitive ...
0[2290, 2298): 'Bob Lord'[2300, 2376): 'IBM Senior Vice President of Co...
.........
0[3113, 3123): 'Mike Doran'[3125, 3156): 'Worldwide Sales Director at IBM'
0[3156, 3170): 'Howard Boville'[3172, 3211): 'Senior Vice President, IBM Hybr...
0[3116, 3139): 'Samuel Brack Co-Founder'[3129, 3154): 'Co-Founder and CTO at DIA'
1[3511, 3525): 'Hillery Hunter'[3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud'
0[1488, 1498): 'Ana Zamper'[1500, 1535): 'Ecosystem Leader, IBM Latin Ame...
\n", "

260 rows × 2 columns

\n", "" ], "text/plain": [ " person \\\n", "0 [1977, 1991): 'Wendi Whitmore' \n", "0 [1281, 1292): 'Rob DiCicco' \n", "0 [1213, 1229): 'Christoph Herman' \n", "1 [2227, 2242): 'Stephen Leonard' \n", "0 [2290, 2298): 'Bob Lord' \n", ".. ... \n", "0 [3113, 3123): 'Mike Doran' \n", "0 [3156, 3170): 'Howard Boville' \n", "0 [3116, 3139): 'Samuel Brack Co-Founder' \n", "1 [3511, 3525): 'Hillery Hunter' \n", "0 [1488, 1498): 'Ana Zamper' \n", "\n", " title \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... \n", "0 [1294, 1348): 'PharmD, Deputy Chief Health Off... \n", "0 [1231, 1281): 'SVP and Head of SAP HANA Enterp... \n", "1 [2244, 2282): 'General Manager, IBM Cognitive ... \n", "0 [2300, 2376): 'IBM Senior Vice President of Co... \n", ".. ... \n", "0 [3125, 3156): 'Worldwide Sales Director at IBM' \n", "0 [3172, 3211): 'Senior Vice President, IBM Hybr... \n", "0 [3129, 3154): 'Co-Founder and CTO at DIA' \n", "1 [3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud' \n", "0 [1500, 1535): 'Ecosystem Leader, IBM Latin Ame... \n", "\n", "[260 rows x 2 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "spacy_language_model = spacy.load(\"en_core_web_trf\")\n", "nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", "nlu_api.set_service_url(service_url)\n", "\n", "def steps_1_through_4(doc_html: str) -> pd.DataFrame:\n", " # Steps 1 and 2, as implemented in find_persons_quoted_by_name()\n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " \n", " # Steps 3 and 4, as implemented in find_titles_of_persons()\n", " step_3_results = mi.perform_dependency_parsing(step_1_results[\"analyzed_text\"],\n", " spacy_language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n", "\n", "# Repeat steps 1-4 on every document\n", "dataframes_to_stack = [\n", " steps_1_through_4(doc_html) for doc_html in articles[\"html\"]\n", "]\n", "\n", "# Step 5: Merge the results across documents\n", "step_5_results = pd.concat(dataframes_to_stack)\n", "step_5_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The exact running time for the above cell varies depending on which machine you use to run the notebook and what other processes are running in the background. On a 2020 Macbook Pro, it takes around 850 seconds. Let's see if we can improve on that running time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap: Accelerating the baseline pipeline with the semijoin trick" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Part 3, we showed how to improve the end-to-end performance of this pipeline by applying the semijoin trick to the expensive dependency parsing step. This change made parsing 9x faster, improving the end-to-end performance by a factor of 3. Here's what our five-stage pipeline looks like after that change." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 52s, sys: 3.12 s, total: 1min 55s\n", "Wall time: 4min 23s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
0[1281, 1292): 'Rob DiCicco'[1294, 1348): 'PharmD, Deputy Chief Health Off...
0[1213, 1229): 'Christoph Herman'[1231, 1281): 'SVP and Head of SAP HANA Enterp...
1[2227, 2242): 'Stephen Leonard'[2244, 2282): 'General Manager, IBM Cognitive ...
0[2290, 2298): 'Bob Lord'[2300, 2376): 'IBM Senior Vice President of Co...
.........
0[3113, 3123): 'Mike Doran'[3125, 3156): 'Worldwide Sales Director at IBM'
0[3156, 3170): 'Howard Boville'[3172, 3211): 'Senior Vice President, IBM Hybr...
0[3116, 3139): 'Samuel Brack Co-Founder'[3131, 3154): '-Founder and CTO at DIA'
1[3511, 3525): 'Hillery Hunter'[3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud'
0[1488, 1498): 'Ana Zamper'[1500, 1535): 'Ecosystem Leader, IBM Latin Ame...
\n", "

260 rows × 2 columns

\n", "
" ], "text/plain": [ " person \\\n", "0 [1977, 1991): 'Wendi Whitmore' \n", "0 [1281, 1292): 'Rob DiCicco' \n", "0 [1213, 1229): 'Christoph Herman' \n", "1 [2227, 2242): 'Stephen Leonard' \n", "0 [2290, 2298): 'Bob Lord' \n", ".. ... \n", "0 [3113, 3123): 'Mike Doran' \n", "0 [3156, 3170): 'Howard Boville' \n", "0 [3116, 3139): 'Samuel Brack Co-Founder' \n", "1 [3511, 3525): 'Hillery Hunter' \n", "0 [1488, 1498): 'Ana Zamper' \n", "\n", " title \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... \n", "0 [1294, 1348): 'PharmD, Deputy Chief Health Off... \n", "0 [1231, 1281): 'SVP and Head of SAP HANA Enterp... \n", "1 [2244, 2282): 'General Manager, IBM Cognitive ... \n", "0 [2300, 2376): 'IBM Senior Vice President of Co... \n", ".. ... \n", "0 [3125, 3156): 'Worldwide Sales Director at IBM' \n", "0 [3172, 3211): 'Senior Vice President, IBM Hybr... \n", "0 [3131, 3154): '-Founder and CTO at DIA' \n", "1 [3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud' \n", "0 [1500, 1535): 'Ecosystem Leader, IBM Latin Ame... \n", "\n", "[260 rows x 2 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "spacy_language_model = spacy.load(\"en_core_web_trf\")\n", "nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", "nlu_api.set_service_url(service_url)\n", "\n", "def steps_1_through_4(doc_html: str) -> pd.DataFrame:\n", " # Steps 1 and 2, as implemented in find_persons_quoted_by_name()\n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " \n", " # Steps 3 and 4, as implemented in find_titles_of_persons()\n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_2_results[\"person\"],\n", " spacy_language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n", "\n", "# Repeat steps 1-4 on every document\n", "dataframes_to_stack = [\n", " steps_1_through_4(doc_html) for doc_html in articles[\"html\"]\n", "]\n", "\n", "# Step 5: Merge the results across documents\n", "step_5_results = pd.concat(dataframes_to_stack)\n", "step_5_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code runs in about 280 seconds, a 3x improvement from the initial baseline version.\n", "\n", "Now it's time to deploy parallel processing and improve that performance some more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First version: Wrap the entire processing pipeline in a `@ray.remote` decorator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Ray](https://ray.io) is a system for parallel processing that is designed to work at a wide variety of scales. As you tune the accuracy of an NLP application, you'll move between different sized inputs. You may start out by examining results on individual documents on your laptop, then switch to processing dozens of documents on your laptop. Later you may run \n", "through thousands of documents on a server. Eventually, you could deploy to production and process millions of documents on a cluster. And when there's a problem in production you might find yourself back again to working on your laptop for the next round of tuning.\n", "\n", "Ray lets you code up your processing pipeline once and have it work well across this wide variety of scales. That way you can use the same code at every point throughout this iterative process.\n", "\n", "In the remainder of this part of the tutorial, we're going to show how to parallelize our end-to-end pipeline using Ray. Once we have it running in parallel on a laptop, we'll take the same code over to a large server to get even more parallel speedup.\n", "\n", "The first step is to start up a Ray cluster by calling `ray.init()`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-04-16 14:15:07,175\tINFO services.py:1374 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "def reboot_ray():\n", " if ray.is_initialized():\n", " ray.shutdown()\n", " ray.init()\n", " \n", "reboot_ray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's try the simplest approach possible: Wrap our document processing code in a Ray remote function. To create a remote function, we just need to define a Python function and add the `@ray.remote` decorator to the function:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# NOTE: The blog version of this cell should show this code side-by-side with the \n", "# original code and highlight the changes.\n", "\n", "spacy_language_model = spacy.load(\"en_core_web_trf\")\n", "nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", "nlu_api.set_service_url(service_url)\n", "\n", "# @ray.remote decorator defines a Ray task\n", "@ray.remote\n", "def steps_1_through_4(doc_html: str) -> pd.DataFrame:\n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_2_results[\"person\"],\n", " spacy_language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can push the processing of a document to the Ray cluster by spawning a copy of the remote function. To spawn a copy of the function, we call the remote function's `remote()` method, which starts running the function in the background and returns a *future* -- a placeholder for the result that the function will produce when it completes.\n", "\n", "If we pass the future to `ray.get()`, then Ray will block until the function has completed, download the result to the calling process, and return the result:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "Could not serialize the function 1686926749.steps_1_through_4. Check https://docs.ray.io/en/master/serialization.html#troubleshooting for more information.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/opt/miniconda3/envs/pd/lib/python3.8/site-packages/ray/remote_function.py\u001b[0m in \u001b[0;36m_remote\u001b[0;34m(self, args, kwargs, num_returns, num_cpus, num_gpus, memory, object_store_memory, accelerator_type, resources, max_retries, retry_exceptions, placement_group, placement_group_bundle_index, placement_group_capture_child_tasks, runtime_env, name, scheduling_strategy)\u001b[0m\n\u001b[1;32m 285\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 286\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_pickled_function\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpickle\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdumps\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_function\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 287\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mTypeError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/opt/miniconda3/envs/pd/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py\u001b[0m in \u001b[0;36mdumps\u001b[0;34m(obj, protocol, buffer_callback)\u001b[0m\n\u001b[1;32m 72\u001b[0m )\n\u001b[0;32m---> 73\u001b[0;31m \u001b[0mcp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdump\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 74\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mfile\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetvalue\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/opt/miniconda3/envs/pd/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py\u001b[0m in \u001b[0;36mdump\u001b[0;34m(self, obj)\u001b[0m\n\u001b[1;32m 619\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 620\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mPickler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdump\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 621\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mRuntimeError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: cannot pickle '_thread.RLock' object", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/bd/k5pyhn0130708d7y9q2pjj380000gn/T/ipykernel_68713/418004903.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdoc_html\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0marticles\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0miloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"html\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mfuture\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msteps_1_through_4\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremote\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc_html\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0mstep_4_results\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfuture\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/opt/miniconda3/envs/pd/lib/python3.8/site-packages/ray/remote_function.py\u001b[0m in \u001b[0;36m_remote_proxy\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 137\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mwraps\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunction\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 138\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_remote_proxy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 139\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_remote\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 140\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 141\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremote\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_remote_proxy\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/opt/miniconda3/envs/pd/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py\u001b[0m in \u001b[0;36m_invocation_remote_span\u001b[0;34m(self, args, kwargs, *_args, **_kwargs)\u001b[0m\n\u001b[1;32m 293\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mkwargs\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 294\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0;34m\"_ray_trace_ctx\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 295\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0m_args\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0m_kwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 296\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 297\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0;34m\"_ray_trace_ctx\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/opt/miniconda3/envs/pd/lib/python3.8/site-packages/ray/remote_function.py\u001b[0m in \u001b[0;36m_remote\u001b[0;34m(self, args, kwargs, num_returns, num_cpus, num_gpus, memory, object_store_memory, accelerator_type, resources, max_retries, retry_exceptions, placement_group, placement_group_bundle_index, placement_group_capture_child_tasks, runtime_env, name, scheduling_strategy)\u001b[0m\n\u001b[1;32m 291\u001b[0m \u001b[0;34m\"https://docs.ray.io/en/master/serialization.html#troubleshooting \"\u001b[0m \u001b[0;31m# noqa\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 292\u001b[0m \"for more information.\")\n\u001b[0;32m--> 293\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 294\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 295\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_last_export_session_and_job\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mworker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcurrent_session_and_job\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: Could not serialize the function 1686926749.steps_1_through_4. Check https://docs.ray.io/en/master/serialization.html#troubleshooting for more information." ] } ], "source": [ "doc_html = articles.iloc[1][\"html\"]\n", "\n", "future = steps_1_through_4.remote(doc_html)\n", "step_4_results = ray.get(future)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops, that didn't work! Ray tasks can only operate over Python objects that can be serialized. A connection to Watson NLU can't be serialized because the connection object has open sockets and lock objects\n", "\n", "We also can't serialize the language model inside the SpaCy dependency parser (as of SpaCy version 3.0), because that Python object contains locks. Even if you could serialize it, doing so would result in Ray shipping a 500 megabyte model to every copy of the task, which would lead to underwhelming performance.\n", "\n", "To make this work, we'll need to pull the initialization code for both models inside the Ray task:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "@ray.remote\n", "def steps_1_through_4(doc_html: str) -> pd.DataFrame:\n", " fix_spacy_warnings() # Workaround for spurious warning messages\n", " spacy_language_model = spacy.load(\"en_core_web_trf\")\n", " nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", " nlu_api.set_service_url(service_url)\n", " \n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_2_results[\"person\"],\n", " spacy_language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now our Ray task works on a single document:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
\n", "
" ], "text/plain": [ " person \\\n", "0 [1977, 1991): 'Wendi Whitmore' \n", "\n", " title \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "step_4_results = ray.get(steps_1_through_4.remote(doc_html))\n", "step_4_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To process our entire collection of documents, we modify our `for` loop so that it spawns background tasks for each of the documents instead of processing the documents locally. Then we wrap the `for` loop in a call to `ray.get()`, which tells Ray to wait until all the background tasks have completed and collect the results:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-04-16 14:27:28,089\tINFO services.py:1374 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "# Don't include this cell in the blog\n", "reboot_ray()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(steps_1_through_4 pid=69306)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69317)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69307)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69312)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69313)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69304)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69315)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69314)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69304)\u001b[0m Request failed 2 times; retrying in 2 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69315)\u001b[0m Request failed 2 times; retrying in 2 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69310)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69314)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(steps_1_through_4 pid=69310)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "CPU times: user 3.04 s, sys: 1.17 s, total: 4.22 s\n", "Wall time: 1min 26s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
0[1281, 1292): 'Rob DiCicco'[1294, 1348): 'PharmD, Deputy Chief Health Off...
0[1213, 1229): 'Christoph Herman'[1231, 1281): 'SVP and Head of SAP HANA Enterp...
1[2227, 2242): 'Stephen Leonard'[2244, 2282): 'General Manager, IBM Cognitive ...
0[2290, 2298): 'Bob Lord'[2300, 2376): 'IBM Senior Vice President of Co...
.........
0[3113, 3123): 'Mike Doran'[3125, 3156): 'Worldwide Sales Director at IBM'
0[3156, 3170): 'Howard Boville'[3172, 3211): 'Senior Vice President, IBM Hybr...
0[3116, 3139): 'Samuel Brack Co-Founder'[3131, 3154): '-Founder and CTO at DIA'
1[3511, 3525): 'Hillery Hunter'[3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud'
0[1488, 1498): 'Ana Zamper'[1500, 1535): 'Ecosystem Leader, IBM Latin Ame...
\n", "

260 rows × 2 columns

\n", "
" ], "text/plain": [ " person \\\n", "0 [1977, 1991): 'Wendi Whitmore' \n", "0 [1281, 1292): 'Rob DiCicco' \n", "0 [1213, 1229): 'Christoph Herman' \n", "1 [2227, 2242): 'Stephen Leonard' \n", "0 [2290, 2298): 'Bob Lord' \n", ".. ... \n", "0 [3113, 3123): 'Mike Doran' \n", "0 [3156, 3170): 'Howard Boville' \n", "0 [3116, 3139): 'Samuel Brack Co-Founder' \n", "1 [3511, 3525): 'Hillery Hunter' \n", "0 [1488, 1498): 'Ana Zamper' \n", "\n", " title \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... \n", "0 [1294, 1348): 'PharmD, Deputy Chief Health Off... \n", "0 [1231, 1281): 'SVP and Head of SAP HANA Enterp... \n", "1 [2244, 2282): 'General Manager, IBM Cognitive ... \n", "0 [2300, 2376): 'IBM Senior Vice President of Co... \n", ".. ... \n", "0 [3125, 3156): 'Worldwide Sales Director at IBM' \n", "0 [3172, 3211): 'Senior Vice President, IBM Hybr... \n", "0 [3131, 3154): '-Founder and CTO at DIA' \n", "1 [3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud' \n", "0 [1500, 1535): 'Ecosystem Leader, IBM Latin Ame... \n", "\n", "[260 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "# NOTE: The blog version of this listing should highlight what has changed \n", "# relative to the original code.\n", "\n", "@ray.remote\n", "def steps_1_through_4(doc_html: str) -> pd.DataFrame:\n", " fix_spacy_warnings() # Workaround for spurious warning messages\n", " nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", " nlu_api.set_service_url(service_url)\n", " spacy_language_model = spacy.load(\"en_core_web_trf\")\n", " \n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_2_results[\"person\"],\n", " spacy_language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n", "\n", "# Repeat steps 1-4 on every document, at the same time\n", "dataframes_to_stack = ray.get([\n", " steps_1_through_4.remote(doc_html) for doc_html in articles[\"html\"]\n", "])\n", "\n", "# Step 5: Merge the results across documents\n", "step_5_results = pd.concat(dataframes_to_stack)\n", "step_5_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This version runs correctly, and it runs in about 90 seconds on our 8-core laptop -- an additional performance improvement of almost 3x on top of the 3x improvement from Part 3.\n", "\n", "But there's room for two kinds of additional improvement:\n", "* This code loads a 500 megabyte model for every document, which incurs a signficant additional cost. The 8 cores on our test machine are fully occupied, but we only see a speedup of 3. So we're using more than twice as many CPU cycles per document.\n", "* This code produces several screens of scary log messages like: \n", " ```\n", " (pid=8210) Request failed 2 times; retrying in 2 sec\n", " ```\n", " because we're exceeding the request rate limit of our free Lite instance of Watson Natural Language Understanding. The limit for a Lite instance is 5 requests per second.\n", " We should be able to finish our 190 documents in 38 seconds while staying below the limit.\n", " But instead we're taking twice as long while continually bumping up against the limit and having to retry requests.\n", " \n", "We can use Ray *actors* to address both of these problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using actors to avoid repeated model startup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Ray *actor* is persistent Python or Java object that lives in a Ray worker process. Actors can maintain arbitrary amounts of state, both mutable and immutable. If we attach the SpaCy language model to an actor, then the model will be initialized once in the actor's constructor instead of being loaded each time we process a document.\n", "\n", "Turning our `steps_1_through_4()` function into a Ray Actor is a simple matter of restructuring the code slightly.\n", "\n", "Here's the code before using actor:\n", "```python\n", "@ray.remote\n", "def steps_1_through_4(doc_html: str) -> pd.DataFrame:\n", " spacy_language_model = spacy.load(\"en_core_web_trf\")\n", " nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", " nlu_api.set_service_url(service_url)\n", " \n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_1_results[\"analyzed_text\"],\n", " spacy_language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n", "```\n", "\n", "And here's a version that uses an actor:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "@ray.remote\n", "class ParserModelActor(object):\n", " def __init__(self, spacy_model_name: str):\n", " self._language_model = spacy.load(spacy_model_name)\n", " self._nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", " self._nlu_api.set_service_url(service_url)\n", "\n", " def steps_1_through_4(self, doc_html: str) -> pd.DataFrame:\n", " fix_spacy_warnings() # Workaround for spurious warning messages\n", " step_1_results = mi.extract_named_entities_and_semantic_roles(doc_html, \n", " self._nlu_api)\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_2_results[\"person\"],\n", " self._language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compared with the previous Ray task, this version makes two important changes:\n", "* The model initializaiton in first line of `steps_1_through_4` moves to the class's constructor.\n", "* We use the actor's local copy, `self._language_model`, as the input to the processing for step 3.\n", "\n", "Invoking a Ray actor is a two-step process. First, you create an instance of the actor. Then you tell that instance to perform tasks.\n", "\n", "Here's some code that creates an instance of the `ParserModelActor` Actor, invokes the actor's `steps_1_through_4` task, blocks until the task completes, and returns the results:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-04-16 14:29:03,824\tINFO services.py:1374 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "# Don't include this cell in the blog\n", "reboot_ray()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
\n", "
" ], "text/plain": [ " person \\\n", "0 [1977, 1991): 'Wendi Whitmore' \n", "\n", " title \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "actor = ParserModelActor.remote(\"en_core_web_trf\")\n", "future = actor.steps_1_through_4.remote(doc_html)\n", "step_4_results = ray.get(future)\n", "step_4_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code runs, performs the computation in the background, and produces the correct result. And the actor will remain active as long as the Python variable `actor` is still in scope. We can pass additional documents to the `steps_1_through_4` task without incurring the overhead of loading a language model each time.\n", "\n", "As a Python class, this actor can only process one request at a time. To process documents in parallel, we can define a Ray [actor pool](https://docs.ray.io/en/latest/ray-core/actors/actor-utils.html#actor-pool), a group of multiple copies of our actor that can process documents in parallel." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-04-16 14:29:20,386\tINFO services.py:1374 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "# Don't include this cell in the blog\n", "reboot_ray()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(ParserModelActor pid=69429)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69434)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69429)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69443)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 2 times; retrying in 2 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69441)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69437)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69443)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69429)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69437)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69441)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69442)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 2 times; retrying in 2 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69437)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69437)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69437)\u001b[0m \n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 2 times; retrying in 2 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69441)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69443)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69438)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69442)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69429)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69429)\u001b[0m Request failed 2 times; retrying in 2 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69440)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69442)\u001b[0m Request failed 1 times; retrying in 1 sec\n", "\u001b[2m\u001b[36m(ParserModelActor pid=69429)\u001b[0m Request failed 3 times; retrying in 4 sec\n", "CPU times: user 1.52 s, sys: 651 ms, total: 2.17 s\n", "Wall time: 50.9 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1289, 1303): 'Wendi Whitmore'[1305, 1344): 'VP of Threat Intelligence, IBM ...
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
0[1213, 1229): 'Christoph Herman'[1231, 1281): 'SVP and Head of SAP HANA Enterp...
1[2227, 2242): 'Stephen Leonard'[2244, 2282): 'General Manager, IBM Cognitive ...
0[2290, 2298): 'Bob Lord'[2300, 2376): 'IBM Senior Vice President of Co...
.........
1[2119, 2134): 'James Kavanaugh'[2136, 2189): 'IBM senior vice president and c...
0[3116, 3139): 'Samuel Brack Co-Founder'[3131, 3154): '-Founder and CTO at DIA'
1[3511, 3525): 'Hillery Hunter'[3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud'
0[1488, 1498): 'Ana Zamper'[1500, 1535): 'Ecosystem Leader, IBM Latin Ame...
0[1396, 1411): 'Sridhar Muppidi'[1413, 1451): 'Chief Technology Officer, IBM S...
\n", "

260 rows × 2 columns

\n", "
" ], "text/plain": [ " person \\\n", "0 [1289, 1303): 'Wendi Whitmore' \n", "0 [1977, 1991): 'Wendi Whitmore' \n", "0 [1213, 1229): 'Christoph Herman' \n", "1 [2227, 2242): 'Stephen Leonard' \n", "0 [2290, 2298): 'Bob Lord' \n", ".. ... \n", "1 [2119, 2134): 'James Kavanaugh' \n", "0 [3116, 3139): 'Samuel Brack Co-Founder' \n", "1 [3511, 3525): 'Hillery Hunter' \n", "0 [1488, 1498): 'Ana Zamper' \n", "0 [1396, 1411): 'Sridhar Muppidi' \n", "\n", " title \n", "0 [1305, 1344): 'VP of Threat Intelligence, IBM ... \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... \n", "0 [1231, 1281): 'SVP and Head of SAP HANA Enterp... \n", "1 [2244, 2282): 'General Manager, IBM Cognitive ... \n", "0 [2300, 2376): 'IBM Senior Vice President of Co... \n", ".. ... \n", "1 [2136, 2189): 'IBM senior vice president and c... \n", "0 [3131, 3154): '-Founder and CTO at DIA' \n", "1 [3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud' \n", "0 [1500, 1535): 'Ecosystem Leader, IBM Latin Ame... \n", "0 [1413, 1451): 'Chief Technology Officer, IBM S... \n", "\n", "[260 rows x 2 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "# This listing should NOT appear in the blog version.\n", "# Run a pool with the actor defined above over the entire document collection, just\n", "# to make sure it works and to determine the running time.\n", "\n", "num_cpus = multiprocessing.cpu_count() // 2\n", "\n", "actors = ray.util.ActorPool([ParserModelActor.remote(\"en_core_web_trf\")\n", " for i in range(num_cpus)])\n", "\n", "# Repeat steps 1-4 on every document\n", "dataframes_to_stack = actors.map_unordered(\n", " lambda actor, value: actor.steps_1_through_4.remote(value), articles[\"html\"]\n", ")\n", "\n", "# Step 5: Merge the results across documents\n", "step_5_results = pd.concat(dataframes_to_stack)\n", "step_5_results" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Using actors to manage request rate\n", "\n", "The large model that we attached to our `ParserModelActor` actor is an example of *immutable* actor state. Ray actors can also have *mutable* state that changes in response to tasks that the actor performs.\n", "\n", "For this application, we can use mutable actor state to track of how quickly our application is sending requests to the Watson Natural Language Understanding web service. \n", "\n", "Part 4 of this tutorial described in detail how to add such rate-limiting logic with a Ray actor. Here's a quick summary of how it works: We track the request rate is by tracking how much time has elapsed since the most recent request. With that information in hand, our actor can throttle new requests if they would exceed the rate limit. We put the logic for managing this state into an abstract base class, `RateLimitedActor`, the code for which can be found in `market_intelligence.py`. With that base class in place, we can define a Ray actor that sends documents to the Watson Natural Language Understanding web service while respecting a request rate limit. Because the Python API for Watson Natural Language Understanding is thread-safe, we can use a multithreaded Python actor to track multiple simulataneous requests." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "@ray.remote\n", "class NLUClientActor(mi.RateLimitedActor):\n", " \"\"\"\n", " Threaded actor to handle multiple simulatenous requests to the IBM Watson\n", " Natural Language Understanding service while respecting an upper bound on the\n", " number of requests per second.\n", " \"\"\"\n", " def __init__(self, requests_per_sec: float, \n", " api_key: str, service_url: str):\n", " super().__init__(requests_per_sec)\n", " # One instance of the Python API for all threads\n", " self._nlu_api = ibm_watson.NaturalLanguageUnderstandingV1(\n", " version=\"2021-01-01\", \n", " authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key))\n", " self._nlu_api.set_service_url(service_url)\n", " \n", " def process_internal(self, doc_html: str) -> Any:\n", " return mi.extract_named_entities_and_semantic_roles(doc_html, self._nlu_api)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "@ray.remote\n", "class ParserModelActor(object):\n", " def __init__(self, spacy_model_name: str, nlu_client: NLUClientActor):\n", " self._language_model = spacy.load(spacy_model_name)\n", " self._nlu_client = nlu_client\n", "\n", " def steps_1_through_4(self, doc_html: str) -> pd.DataFrame:\n", " fix_spacy_warnings() # Workaround for spurious warning messages\n", " step_1_results = ray.get(self._nlu_client.process.remote(doc_html))\n", " step_2_results = mi.identify_persons_quoted_by_name(step_1_results) \n", " step_3_results = mi.perform_targeted_dependency_parsing(\n", " step_2_results[\"person\"],\n", " self._language_model)\n", " step_4_results = mi.extract_titles_of_persons(step_2_results, step_3_results)\n", " return step_4_results\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-04-16 14:30:19,999\tINFO services.py:1374 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "reboot_ray()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.4 s, sys: 592 ms, total: 1.99 s\n", "Wall time: 48.1 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1597, 1605): 'Bob Lord'[1607, 1673): 'Senior Vice President, Cognitiv...
0[1977, 1991): 'Wendi Whitmore'[1993, 2040): 'Vice President, IBM X-Force Thr...
0[1289, 1303): 'Wendi Whitmore'[1305, 1344): 'VP of Threat Intelligence, IBM ...
0[1213, 1229): 'Christoph Herman'[1231, 1281): 'SVP and Head of SAP HANA Enterp...
1[2227, 2242): 'Stephen Leonard'[2244, 2282): 'General Manager, IBM Cognitive ...
.........
0[3116, 3139): 'Samuel Brack Co-Founder'[3131, 3154): '-Founder and CTO at DIA'
1[3511, 3525): 'Hillery Hunter'[3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud'
0[497, 509): 'John Granger'[511, 620): 'Senior Vice President, Cloud Appl...
1[2375, 2386): 'Hamilton Yu'[2388, 2399): 'CEO of Taos'
0[1488, 1498): 'Ana Zamper'[1500, 1535): 'Ecosystem Leader, IBM Latin Ame...
\n", "

260 rows × 2 columns

\n", "
" ], "text/plain": [ " person \\\n", "0 [1597, 1605): 'Bob Lord' \n", "0 [1977, 1991): 'Wendi Whitmore' \n", "0 [1289, 1303): 'Wendi Whitmore' \n", "0 [1213, 1229): 'Christoph Herman' \n", "1 [2227, 2242): 'Stephen Leonard' \n", ".. ... \n", "0 [3116, 3139): 'Samuel Brack Co-Founder' \n", "1 [3511, 3525): 'Hillery Hunter' \n", "0 [497, 509): 'John Granger' \n", "1 [2375, 2386): 'Hamilton Yu' \n", "0 [1488, 1498): 'Ana Zamper' \n", "\n", " title \n", "0 [1607, 1673): 'Senior Vice President, Cognitiv... \n", "0 [1993, 2040): 'Vice President, IBM X-Force Thr... \n", "0 [1305, 1344): 'VP of Threat Intelligence, IBM ... \n", "0 [1231, 1281): 'SVP and Head of SAP HANA Enterp... \n", "1 [2244, 2282): 'General Manager, IBM Cognitive ... \n", ".. ... \n", "0 [3131, 3154): '-Founder and CTO at DIA' \n", "1 [3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud' \n", "0 [511, 620): 'Senior Vice President, Cloud Appl... \n", "1 [2388, 2399): 'CEO of Taos' \n", "0 [1500, 1535): 'Ecosystem Leader, IBM Latin Ame... \n", "\n", "[260 rows x 2 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "num_cpus = multiprocessing.cpu_count() // 2\n", "\n", "nlu_client = NLUClientActor.options(max_concurrency=5).remote(5.0, api_key, service_url)\n", "actors = ray.util.ActorPool([ParserModelActor.remote(\"en_core_web_trf\", nlu_client)\n", " for i in range(num_cpus)])\n", "\n", "dataframes_to_stack = actors.map_unordered(\n", " lambda actor, value: actor.steps_1_through_4.remote(value), \n", " articles[\"html\"]\n", ")\n", "\n", "step_5_results = pd.concat(dataframes_to_stack)\n", "step_5_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now our pipeline runs in about 45 seconds without hitting the rate limit.\n", "\n", "And now the 5 documents per second rate limit of our Lite instance of Watson Natural Language Understanding the chief performance bottleneck. We can remove that bottleneck by switching to a Standard instance of the service with a simple one-line change." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Don't show this cell in the blog version\n", "standard_api_key = os.environ.get(\"STANDARD_API_KEY\")\n", "standard_service_url = os.environ.get(\"STANDARD_SERVICE_URL\") " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-04-16 14:31:16,894\tINFO services.py:1374 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "reboot_ray()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.54 s, sys: 592 ms, total: 2.13 s\n", "Wall time: 36.8 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontitle
0[1066, 1084): 'Guilherme Franklin'[1086, 1099): 'El Ordeño COO'
1[1881, 1898): 'Martín Hagelstrom'[1900, 1943): 'IBM Blockchain Leader for IBM L...
0[1281, 1292): 'Rob DiCicco'[1294, 1348): 'PharmD, Deputy Chief Health Off...
0[1338, 1348): 'Rob Thomas'[1350, 1380): 'general manager, IBM Data & AI'
0[1289, 1303): 'Wendi Whitmore'[1305, 1344): 'VP of Threat Intelligence, IBM ...
.........
0[3116, 3139): 'Samuel Brack Co-Founder'[3131, 3154): '-Founder and CTO at DIA'
1[3511, 3525): 'Hillery Hunter'[3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud'
0[315, 329): 'Arvind Krishna'[331, 371): 'IBM chairman and chief executive ...
1[2119, 2134): 'James Kavanaugh'[2136, 2189): 'IBM senior vice president and c...
0[1488, 1498): 'Ana Zamper'[1500, 1535): 'Ecosystem Leader, IBM Latin Ame...
\n", "

260 rows × 2 columns

\n", "
" ], "text/plain": [ " person \\\n", "0 [1066, 1084): 'Guilherme Franklin' \n", "1 [1881, 1898): 'Martín Hagelstrom' \n", "0 [1281, 1292): 'Rob DiCicco' \n", "0 [1338, 1348): 'Rob Thomas' \n", "0 [1289, 1303): 'Wendi Whitmore' \n", ".. ... \n", "0 [3116, 3139): 'Samuel Brack Co-Founder' \n", "1 [3511, 3525): 'Hillery Hunter' \n", "0 [315, 329): 'Arvind Krishna' \n", "1 [2119, 2134): 'James Kavanaugh' \n", "0 [1488, 1498): 'Ana Zamper' \n", "\n", " title \n", "0 [1086, 1099): 'El Ordeño COO' \n", "1 [1900, 1943): 'IBM Blockchain Leader for IBM L... \n", "0 [1294, 1348): 'PharmD, Deputy Chief Health Off... \n", "0 [1350, 1380): 'general manager, IBM Data & AI' \n", "0 [1305, 1344): 'VP of Threat Intelligence, IBM ... \n", ".. ... \n", "0 [3131, 3154): '-Founder and CTO at DIA' \n", "1 [3527, 3558): 'IBM Fellow, VP & CTO, IBM Cloud' \n", "0 [331, 371): 'IBM chairman and chief executive ... \n", "1 [2136, 2189): 'IBM senior vice president and c... \n", "0 [1500, 1535): 'Ecosystem Leader, IBM Latin Ame... \n", "\n", "[260 rows x 2 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "num_cpus = multiprocessing.cpu_count()\n", "\n", "# The blog version of this code listing should only show how the next line changes.\n", "nlu_client = NLUClientActor.options(max_concurrency=num_cpus).remote(\n", " 80.0, standard_api_key, standard_service_url)\n", "\n", "\n", "# Note that this call to remote() will start asynchronously loading the language models.\n", "actors = ray.util.ActorPool([ParserModelActor.remote(\"en_core_web_trf\", nlu_client)\n", " for i in range(num_cpus)])\n", "\n", "dataframes_to_stack = actors.map_unordered(\n", " lambda actor, value: actor.steps_1_through_4.remote(value), \n", " articles[\"html\"]\n", ")\n", "\n", "step_5_results = pd.concat(dataframes_to_stack)\n", "step_5_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now our running time is down to 31 seconds, from our original running time of 852 seconds. That's a performance improvement of 27x!\n", "\n", "Our 8-core laptop is now the bottleneck. We can further improve running time by switching to a larger machine or a cluster of machines, with no code changes.\n", "\n", "On a larger machine from the IBM cloud, we can process these documents in **15 seconds**, which is **57 times faster** than the code we started with!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running time comparison\n", "\n", "Here's a detailed chart that compares the running times across the different ways of performing our task." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "time_data = {\n", " # All times in seconds\n", " \"for loop\": 852,\n", " \"Pandas\\noptimizations\": 283,\n", " \"+Ray task\": 90,\n", " \"+Ray actors\": 45, \n", " \"+standard\\nNLU\\ninstance\": 30, \n", " # Separate run on IBM Cloud machine with 56 cores.\n", " \"+cloud VM\": 15,\n", "}\n", "\n", "plt.figure(figsize=(8, 5))\n", "plt.bar(time_data.keys(), time_data.values())\n", "plt.ylabel(\"Running time (sec)\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# The previous chart, but comparing documents per second\n", "num_docs = len(articles.index)\n", "docs_per_sec = {\n", " k: num_docs/v for k, v in time_data.items()\n", "}\n", "plt.figure(figsize=(8, 5))\n", "plt.bar(docs_per_sec.keys(), docs_per_sec.values())\n", "plt.ylabel(\"Documents per second\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }