{ "cells": [ { "cell_type": "markdown", "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", "metadata": { "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "21e9eafb", "metadata": { "id": "21e9eafb" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/09.2.Entity_Resolution_Training.ipynb)" ] }, { "cell_type": "markdown", "id": "gk3kZHmNj51v", "metadata": { "collapsed": false, "id": "gk3kZHmNj51v" }, "source": [ "# Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "_914itZsj51v", "metadata": { "id": "_914itZsj51v", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "## Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "fY0lcShkj51w", "metadata": { "id": "fY0lcShkj51w", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, legal\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "## Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "# Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "wRXTnNl3j51w", "metadata": { "id": "wRXTnNl3j51w" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "6zgaKT7Khkzu", "metadata": { "id": "6zgaKT7Khkzu" }, "source": [ "## Entity Resolution Training\n", "\n", "Here, we will train a legal resolver model with a sample dataset.We will train a company name normalization model. Our dataset columns has to be object type.\n", "\n", "Let's start to train." ] }, { "cell_type": "markdown", "id": "d6c6f9d0-b6de-4708-a783-e3af20aace47", "metadata": { "id": "d6c6f9d0-b6de-4708-a783-e3af20aace47" }, "source": [ "## Load Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "5tEedT614vgo", "metadata": { "id": "5tEedT614vgo" }, "outputs": [], "source": [ "! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_company_name.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "f9ff314c-5548-4946-971c-1aa45bf82d43", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "f9ff314c-5548-4946-971c-1aa45bf82d43", "outputId": "77e6ce30-651e-405b-e7e0-71744f78487b" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_nameirs_numbercomp_abbreviation_var
0StepOne Personal Health, Inc.900785095StepOne Personal Health
1StepOne Personal Health, Inc.900785095StepOne Personal Health Inc
2StepOne Personal Health, Inc.900785095STEPONE PERSONAL HEALTH INC
3StepOne Personal Health, Inc.900785095StepOne Personal Health inc
4StepOne Personal Health, Inc.900785095StepOne Personal Health INC
............
9995INGLES MARKETS INC560846267Ingles Markets Inc
9996INGLES MARKETS INC560846267INGLES MARKETS Inc.
9997INGLES MARKETS INC560846267INGLES MARKETS inc.
9998INGLES MARKETS INC560846267INGLES MARKETS INC
9999INGLES MARKETS INC560846267INGLES MARKETS
\n", "

10000 rows × 3 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " company_name irs_number comp_abbreviation_var\n", "0 StepOne Personal Health, Inc. 900785095 StepOne Personal Health\n", "1 StepOne Personal Health, Inc. 900785095 StepOne Personal Health Inc\n", "2 StepOne Personal Health, Inc. 900785095 STEPONE PERSONAL HEALTH INC\n", "3 StepOne Personal Health, Inc. 900785095 StepOne Personal Health inc\n", "4 StepOne Personal Health, Inc. 900785095 StepOne Personal Health INC\n", "... ... ... ...\n", "9995 INGLES MARKETS INC 560846267 Ingles Markets Inc\n", "9996 INGLES MARKETS INC 560846267 INGLES MARKETS Inc.\n", "9997 INGLES MARKETS INC 560846267 INGLES MARKETS inc.\n", "9998 INGLES MARKETS INC 560846267 INGLES MARKETS INC\n", "9999 INGLES MARKETS INC 560846267 INGLES MARKETS\n", "\n", "[10000 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('sample_company_name.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "f731862c-99cc-4ac1-8473-421743d77ca9", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f731862c-99cc-4ac1-8473-421743d77ca9", "outputId": "99eff647-b2a3-4b9b-99b4-540784085288" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000 entries, 0 to 9999\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 company_name 10000 non-null object\n", " 1 irs_number 10000 non-null int64 \n", " 2 comp_abbreviation_var 10000 non-null object\n", "dtypes: int64(1), object(2)\n", "memory usage: 234.5+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "i9xiDDJFgK4F", "metadata": { "id": "i9xiDDJFgK4F" }, "outputs": [], "source": [ "df['comp_abbreviation_var'] =df['comp_abbreviation_var'].astype(str)\n", "df['irs_number'] =df['irs_number'].astype(str)\n", "df['company_name'] =df['company_name'].astype(str)" ] }, { "cell_type": "code", "execution_count": null, "id": "3HKru2_3gM5U", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3HKru2_3gM5U", "outputId": "fc592cbf-b40c-4265-856d-9301c8c1397b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000 entries, 0 to 9999\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 company_name 10000 non-null object\n", " 1 irs_number 10000 non-null object\n", " 2 comp_abbreviation_var 10000 non-null object\n", "dtypes: object(3)\n", "memory usage: 234.5+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "4a9cf2e1-44fc-4f6a-b60d-da76715568d4", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4a9cf2e1-44fc-4f6a-b60d-da76715568d4", "outputId": "06c46baa-2884-45ad-e2bf-1d1b81265114" }, "outputs": [ { "data": { "text/plain": [ "(10000, 3)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "id": "F8YY4nDEi9hH", "metadata": { "id": "F8YY4nDEi9hH" }, "source": [ "## Get Embeddings\n", "Now we will get the sentence embeddings of `comp_abbreviation_var` column." ] }, { "cell_type": "code", "execution_count": null, "id": "2c2fff38-38fc-4ada-a880-52959211ee50", "metadata": { "id": "2c2fff38-38fc-4ada-a880-52959211ee50", "tags": [] }, "outputs": [], "source": [ "data = spark.createDataFrame(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "12a6b170-bc99-4a40-a7f5-fb902c19ca7d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "12a6b170-bc99-4a40-a7f5-fb902c19ca7d", "outputId": "250a2d28-1ec1-401b-ff9e-1e6ff74ad7bd", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tfhub_use download started this may take some time.\n", "Approximate size to download 923.7 MB\n", "[OK!]\n" ] } ], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"comp_abbreviation_var\")\\\n", " .setOutputCol(\"sentence\")\n", "\n", "embeddings = nlp.UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n", " .setInputCols(\"sentence\") \\\n", " .setOutputCol(\"sentence_embeddings\")\n", "\n", "training_pipeline = nlp.Pipeline(stages = [\n", " documentAssembler,\n", " embeddings])\n", "\n", "training_model = training_pipeline.fit(data)\n", "\n", "final_data = training_model.transform(data)" ] }, { "cell_type": "code", "execution_count": null, "id": "5VtDRIrunKX5", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5VtDRIrunKX5", "outputId": "6a069a56-23ef-491a-ffa5-d0794fda8016" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+----------+---------------------+--------------------+--------------------+\n", "| company_name|irs_number|comp_abbreviation_var| sentence| sentence_embeddings|\n", "+--------------------+----------+---------------------+--------------------+--------------------+\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 22...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| STEPONE PERSONAL ...|[{document, 0, 26...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| Stepone Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n", "|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 20...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| EQUITY ONE NET IN...|[{document, 0, 24...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...|\n", "|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...|\n", "|AmeriCredit Autom...| 880475154| AmeriCredit Autom...|[{document, 0, 39...|[{sentence_embedd...|\n", "|GROUNDFLOOR FINAN...| 463414189| GROUNDFLOOR FINAN...|[{document, 0, 23...|[{sentence_embedd...|\n", "|GROUNDFLOOR FINAN...| 463414189| GROUNDFLOOR FINAN...|[{document, 0, 23...|[{sentence_embedd...|\n", "+--------------------+----------+---------------------+--------------------+--------------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "final_data.show()" ] }, { "cell_type": "markdown", "id": "yRtd3nRHnVMQ", "metadata": { "id": "yRtd3nRHnVMQ" }, "source": [ "We have `sentence_embeddings` column in our training dataframe that we will use as input while training the model." ] }, { "cell_type": "markdown", "id": "Mnc5wuJ-ntEo", "metadata": { "id": "Mnc5wuJ-ntEo" }, "source": [ "## Train Model" ] }, { "cell_type": "code", "execution_count": null, "id": "176e74ba-6867-469b-aebc-f510f8d78a08", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "176e74ba-6867-469b-aebc-f510f8d78a08", "outputId": "d8ee8fdc-d23e-466c-d8e9-f2411ca18e57", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 108 ms, sys: 13.8 ms, total: 121 ms\n", "Wall time: 15.3 s\n" ] } ], "source": [ "%%time\n", "use = legal.SentenceEntityResolverApproach()\\\n", " .setNeighbours(50)\\\n", " .setThreshold(10000)\\\n", " .setInputCols(\"sentence_embeddings\")\\\n", " .setLabelCol(\"company_name\")\\\n", " .setOutputCol('original_company_name')\\\n", " .setNormalizedCol(\"company_name\")\\\n", " .setDistanceFunction(\"EUCLIDEAN\")\\\n", " .setCaseSensitive(False)\\\n", " .setUseAuxLabel(True)\\\n", " .setAuxLabelCol('irs_number')\n", "\n", "model = use.fit(final_data)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "oELEGmD0V_Th", "metadata": { "id": "oELEGmD0V_Th" }, "outputs": [], "source": [ "# Save model\n", "model.write().overwrite().save(\"use_company_name\")" ] }, { "cell_type": "markdown", "id": "091132bb-e472-444a-bc6d-ef502dc3e1dd", "metadata": { "id": "091132bb-e472-444a-bc6d-ef502dc3e1dd" }, "source": [ "## Test Model" ] }, { "cell_type": "code", "execution_count": null, "id": "b2141e1b-9e7d-4081-812c-6ca606cf1687", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "b2141e1b-9e7d-4081-812c-6ca606cf1687", "outputId": "52077d9a-ba4d-4236-b783-d976b2b7c9d0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tfhub_use download started this may take some time.\n", "Approximate size to download 923.7 MB\n", "[OK!]\n" ] } ], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "embeddings = nlp.UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n", " .setInputCols(\"ner_chunk\") \\\n", " .setOutputCol(\"sentence_embeddings\")\n", " \n", "resolver = legal.SentenceEntityResolverModel.load(\"use_company_name\") \\\n", " .setInputCols([\"sentence_embeddings\"]) \\\n", " .setOutputCol(\"normalized_name\")\\\n", " .setDistanceFunction(\"EUCLIDEAN\")\n", "\n", "pipeline = nlp.Pipeline(\n", " stages = [\n", " documentAssembler,\n", " embeddings,\n", " resolver,\n", " ])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(empty_data)\n", "\n", "light_model= nlp.LightPipeline(model)" ] }, { "cell_type": "code", "execution_count": null, "id": "7e0ba1ec-13d8-42b3-9a53-24cf85fd8ad7", "metadata": { "id": "7e0ba1ec-13d8-42b3-9a53-24cf85fd8ad7" }, "outputs": [], "source": [ "# returns LP resolution results\n", "\n", "import pandas as pd\n", "pd.set_option('display.max_colwidth', 0)\n", "\n", "def get_codes (lp, text, vocab='company_name', hcc=False):\n", " \n", " full_light_result = lp.fullAnnotate(text)\n", "\n", " chunks = []\n", " codes = []\n", " begin = []\n", " end = []\n", " resolutions=[]\n", " all_distances =[]\n", " all_codes=[]\n", " all_cosines = []\n", " all_k_aux_labels=[]\n", "\n", " for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):\n", " \n", " begin.append(chunk.begin)\n", " end.append(chunk.end)\n", " chunks.append(chunk.result)\n", " codes.append(code.result) \n", " all_codes.append(code.metadata['all_k_results'].split(':::'))\n", " resolutions.append(code.metadata['all_k_resolutions'].split(':::'))\n", " all_distances.append(code.metadata['all_k_distances'].split(':::'))\n", " all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))\n", " if hcc:\n", " try:\n", " all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))\n", " except:\n", " all_k_aux_labels.append([])\n", " else:\n", " all_k_aux_labels.append([])\n", "\n", " df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, \n", " 'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})\n", " \n", " if hcc:\n", "\n", " df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])\n", " df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])\n", " df['hcc_code'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])\n", "\n", " df = df.drop(['all_k_aux_labels'], axis=1)\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": null, "id": "57312793-ba38-4bbd-b042-42a229f61221", "metadata": { "id": "57312793-ba38-4bbd-b042-42a229f61221" }, "outputs": [], "source": [ "text = \"AmeriCann Inc\"" ] }, { "cell_type": "code", "execution_count": null, "id": "zpEhjGo1UiuA", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 177 }, "id": "zpEhjGo1UiuA", "outputId": "52c653ef-153e-405c-d999-995d1f427bf4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs\n", "Wall time: 5.01 µs\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendcodeall_codesresolutionsall_distances
0AmeriCann Inc012AmeriCann, Inc.[AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC][AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC][0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170]
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end code \\\n", "0 AmeriCann Inc 0 12 AmeriCann, Inc. \n", "\n", " all_codes \\\n", "0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n", "\n", " resolutions \\\n", "0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n", "\n", " all_distances \n", "0 [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time \n", "get_codes (light_model, text, vocab = 'normalized_name')" ] }, { "cell_type": "code", "execution_count": null, "id": "1dc7ce8e-30e4-433f-8ae5-a8645bd21d1b", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 177 }, "id": "1dc7ce8e-30e4-433f-8ae5-a8645bd21d1b", "outputId": "1d3eff93-6cc7-402e-bc07-5daba68e12b7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.2 ms, sys: 277 µs, total: 9.48 ms\n", "Wall time: 61.2 ms\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendcodeall_codesresolutionsall_distances
0AmeriCann inc012AmeriCann, Inc.[AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC][AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC][0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170]
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end code \\\n", "0 AmeriCann inc 0 12 AmeriCann, Inc. \n", "\n", " all_codes \\\n", "0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n", "\n", " resolutions \\\n", "0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n", "\n", " all_distances \n", "0 [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'AmeriCann inc'\n", "\n", "%time get_codes (light_model, text, vocab='normalized_name')" ] }, { "cell_type": "code", "execution_count": null, "id": "ebd35257-7249-4bb4-b3ed-715fe65ce816", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 177 }, "id": "ebd35257-7249-4bb4-b3ed-715fe65ce816", "outputId": "84f5c441-a01a-4889-8400-43ceadc99381", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.66 ms, sys: 2.18 ms, total: 6.84 ms\n", "Wall time: 52.7 ms\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendcodeall_codesresolutionsall_distances
0StepOne Personal Health inc026StepOne Personal Health, Inc.[StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.][StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.][0.0000, 0.2224, 0.2714, 0.2729, 0.2802, 0.2868, 0.2874]
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end code \\\n", "0 StepOne Personal Health inc 0 26 StepOne Personal Health, Inc. \n", "\n", " all_codes \\\n", "0 [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] \n", "\n", " resolutions \\\n", "0 [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] \n", "\n", " all_distances \n", "0 [0.0000, 0.2224, 0.2714, 0.2729, 0.2802, 0.2868, 0.2874] " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'StepOne Personal Health inc'\n", "\n", "%time get_codes (light_model, text, vocab='normalized_name')" ] }, { "cell_type": "code", "execution_count": null, "id": "60765bb9-9b85-412c-bf72-214a4714662f", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 194 }, "id": "60765bb9-9b85-412c-bf72-214a4714662f", "outputId": "485b8be5-9a0e-4f61-8a38-7ce0459543a1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.07 ms, sys: 732 µs, total: 7.81 ms\n", "Wall time: 67 ms\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendcodeall_codesresolutionsall_distances
0Alzamend Neuro INC017Alzamend Neuro, Inc.[Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP][Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP][0.0000, 0.1704, 0.1802, 0.1934, 0.2149, 0.2162, 0.2254]
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end code \\\n", "0 Alzamend Neuro INC 0 17 Alzamend Neuro, Inc. \n", "\n", " all_codes \\\n", "0 [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] \n", "\n", " resolutions \\\n", "0 [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] \n", "\n", " all_distances \n", "0 [0.0000, 0.1704, 0.1802, 0.1934, 0.2149, 0.2162, 0.2254] " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'Alzamend Neuro INC'\n", "\n", "%time get_codes (light_model, text, vocab='normalized_name')" ] }, { "cell_type": "code", "execution_count": null, "id": "3eda13a5-4b4e-448e-bf74-f09c14789920", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 194 }, "id": "3eda13a5-4b4e-448e-bf74-f09c14789920", "outputId": "94af733c-7601-432b-bcb7-04996b970204" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.98 ms, sys: 2.37 ms, total: 10.3 ms\n", "Wall time: 56.6 ms\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendcodeall_codesresolutionsall_distances
0MMEX Resources Corporation025MMEX Resources Corp[MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.][MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.][0.1096, 0.1540, 0.1624, 0.2054, 0.2202, 0.2406, 0.2451]
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end code \\\n", "0 MMEX Resources Corporation 0 25 MMEX Resources Corp \n", "\n", " all_codes \\\n", "0 [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] \n", "\n", " resolutions \\\n", "0 [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] \n", "\n", " all_distances \n", "0 [0.1096, 0.1540, 0.1624, 0.2054, 0.2202, 0.2406, 0.2451] " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'MMEX Resources Corporation'\n", "\n", "%time get_codes (light_model, text, vocab='normalized_name')" ] }, { "cell_type": "code", "execution_count": null, "id": "40651b68-d4a3-4ade-ae5a-60994c4314d1", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 194 }, "id": "40651b68-d4a3-4ade-ae5a-60994c4314d1", "outputId": "53c0850c-a470-43c3-d3b2-c791f1708413" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.89 ms, sys: 941 µs, total: 8.83 ms\n", "Wall time: 43.4 ms\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendcodeall_codesresolutionsall_distances
0Alphadyne Asset Management Lp.029Alphadyne Asset Management LP[Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.][Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.][0.0000, 0.0724, 0.1040, 0.2378, 0.2470, 0.2570, 0.2614, 0.2722]
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end code \\\n", "0 Alphadyne Asset Management Lp. 0 29 Alphadyne Asset Management LP \n", "\n", " all_codes \\\n", "0 [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] \n", "\n", " resolutions \\\n", "0 [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] \n", "\n", " all_distances \n", "0 [0.0000, 0.0724, 0.1040, 0.2378, 0.2470, 0.2570, 0.2614, 0.2722] " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'Alphadyne Asset Management Lp.'\n", "\n", "%time get_codes (light_model, text, vocab='normalized_name')" ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }