{
"cells": [
{
"cell_type": "markdown",
"id": "db5f4f9a-7776-42b3-8758-85624d4c15ea",
"metadata": {
"id": "db5f4f9a-7776-42b3-8758-85624d4c15ea"
},
"source": [
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
]
},
{
"cell_type": "markdown",
"id": "21e9eafb",
"metadata": {
"id": "21e9eafb"
},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/09.2.Entity_Resolution_Training.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "gk3kZHmNj51v",
"metadata": {
"collapsed": false,
"id": "gk3kZHmNj51v"
},
"source": [
"# Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "_914itZsj51v",
"metadata": {
"id": "_914itZsj51v",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"! pip install -q johnsnowlabs"
]
},
{
"cell_type": "markdown",
"id": "YPsbAnNoPt0Z",
"metadata": {
"id": "YPsbAnNoPt0Z"
},
"source": [
"## Automatic Installation\n",
"Using my.johnsnowlabs.com SSO"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fY0lcShkj51w",
"metadata": {
"id": "fY0lcShkj51w",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, legal\n",
"\n",
"# nlp.install(force_browser=True)"
]
},
{
"cell_type": "markdown",
"id": "hsJvn_WWM2GL",
"metadata": {
"id": "hsJvn_WWM2GL"
},
"source": [
"## Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to my.johnsnowlabs.com\n",
"- Download your license\n",
"- Upload it using the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "i57QV3-_P2sQ",
"metadata": {
"id": "i57QV3-_P2sQ"
},
"outputs": [],
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
]
},
{
"cell_type": "markdown",
"id": "xGgNdFzZP_hQ",
"metadata": {
"id": "xGgNdFzZP_hQ"
},
"source": [
"- Install it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "OfmmPqknP4rR",
"metadata": {
"id": "OfmmPqknP4rR"
},
"outputs": [],
"source": [
"nlp.install()"
]
},
{
"cell_type": "markdown",
"id": "DCl5ErZkNNLk",
"metadata": {
"id": "DCl5ErZkNNLk"
},
"source": [
"# Starting"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wRXTnNl3j51w",
"metadata": {
"id": "wRXTnNl3j51w"
},
"outputs": [],
"source": [
"spark = nlp.start()"
]
},
{
"cell_type": "markdown",
"id": "6zgaKT7Khkzu",
"metadata": {
"id": "6zgaKT7Khkzu"
},
"source": [
"## Entity Resolution Training\n",
"\n",
"Here, we will train a legal resolver model with a sample dataset.We will train a company name normalization model. Our dataset columns has to be object type.\n",
"\n",
"Let's start to train."
]
},
{
"cell_type": "markdown",
"id": "d6c6f9d0-b6de-4708-a783-e3af20aace47",
"metadata": {
"id": "d6c6f9d0-b6de-4708-a783-e3af20aace47"
},
"source": [
"## Load Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5tEedT614vgo",
"metadata": {
"id": "5tEedT614vgo"
},
"outputs": [],
"source": [
"! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_company_name.csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9ff314c-5548-4946-971c-1aa45bf82d43",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "f9ff314c-5548-4946-971c-1aa45bf82d43",
"outputId": "77e6ce30-651e-405b-e7e0-71744f78487b"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" company_name \n",
" irs_number \n",
" comp_abbreviation_var \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" StepOne Personal Health, Inc. \n",
" 900785095 \n",
" StepOne Personal Health \n",
" \n",
" \n",
" 1 \n",
" StepOne Personal Health, Inc. \n",
" 900785095 \n",
" StepOne Personal Health Inc \n",
" \n",
" \n",
" 2 \n",
" StepOne Personal Health, Inc. \n",
" 900785095 \n",
" STEPONE PERSONAL HEALTH INC \n",
" \n",
" \n",
" 3 \n",
" StepOne Personal Health, Inc. \n",
" 900785095 \n",
" StepOne Personal Health inc \n",
" \n",
" \n",
" 4 \n",
" StepOne Personal Health, Inc. \n",
" 900785095 \n",
" StepOne Personal Health INC \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 9995 \n",
" INGLES MARKETS INC \n",
" 560846267 \n",
" Ingles Markets Inc \n",
" \n",
" \n",
" 9996 \n",
" INGLES MARKETS INC \n",
" 560846267 \n",
" INGLES MARKETS Inc. \n",
" \n",
" \n",
" 9997 \n",
" INGLES MARKETS INC \n",
" 560846267 \n",
" INGLES MARKETS inc. \n",
" \n",
" \n",
" 9998 \n",
" INGLES MARKETS INC \n",
" 560846267 \n",
" INGLES MARKETS INC \n",
" \n",
" \n",
" 9999 \n",
" INGLES MARKETS INC \n",
" 560846267 \n",
" INGLES MARKETS \n",
" \n",
" \n",
"
\n",
"
10000 rows × 3 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" company_name irs_number comp_abbreviation_var\n",
"0 StepOne Personal Health, Inc. 900785095 StepOne Personal Health\n",
"1 StepOne Personal Health, Inc. 900785095 StepOne Personal Health Inc\n",
"2 StepOne Personal Health, Inc. 900785095 STEPONE PERSONAL HEALTH INC\n",
"3 StepOne Personal Health, Inc. 900785095 StepOne Personal Health inc\n",
"4 StepOne Personal Health, Inc. 900785095 StepOne Personal Health INC\n",
"... ... ... ...\n",
"9995 INGLES MARKETS INC 560846267 Ingles Markets Inc\n",
"9996 INGLES MARKETS INC 560846267 INGLES MARKETS Inc.\n",
"9997 INGLES MARKETS INC 560846267 INGLES MARKETS inc.\n",
"9998 INGLES MARKETS INC 560846267 INGLES MARKETS INC\n",
"9999 INGLES MARKETS INC 560846267 INGLES MARKETS\n",
"\n",
"[10000 rows x 3 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('sample_company_name.csv')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f731862c-99cc-4ac1-8473-421743d77ca9",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "f731862c-99cc-4ac1-8473-421743d77ca9",
"outputId": "99eff647-b2a3-4b9b-99b4-540784085288"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000 entries, 0 to 9999\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 company_name 10000 non-null object\n",
" 1 irs_number 10000 non-null int64 \n",
" 2 comp_abbreviation_var 10000 non-null object\n",
"dtypes: int64(1), object(2)\n",
"memory usage: 234.5+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "i9xiDDJFgK4F",
"metadata": {
"id": "i9xiDDJFgK4F"
},
"outputs": [],
"source": [
"df['comp_abbreviation_var'] =df['comp_abbreviation_var'].astype(str)\n",
"df['irs_number'] =df['irs_number'].astype(str)\n",
"df['company_name'] =df['company_name'].astype(str)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3HKru2_3gM5U",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3HKru2_3gM5U",
"outputId": "fc592cbf-b40c-4265-856d-9301c8c1397b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000 entries, 0 to 9999\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 company_name 10000 non-null object\n",
" 1 irs_number 10000 non-null object\n",
" 2 comp_abbreviation_var 10000 non-null object\n",
"dtypes: object(3)\n",
"memory usage: 234.5+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a9cf2e1-44fc-4f6a-b60d-da76715568d4",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4a9cf2e1-44fc-4f6a-b60d-da76715568d4",
"outputId": "06c46baa-2884-45ad-e2bf-1d1b81265114"
},
"outputs": [
{
"data": {
"text/plain": [
"(10000, 3)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"id": "F8YY4nDEi9hH",
"metadata": {
"id": "F8YY4nDEi9hH"
},
"source": [
"## Get Embeddings\n",
"Now we will get the sentence embeddings of `comp_abbreviation_var` column."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c2fff38-38fc-4ada-a880-52959211ee50",
"metadata": {
"id": "2c2fff38-38fc-4ada-a880-52959211ee50",
"tags": []
},
"outputs": [],
"source": [
"data = spark.createDataFrame(df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12a6b170-bc99-4a40-a7f5-fb902c19ca7d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "12a6b170-bc99-4a40-a7f5-fb902c19ca7d",
"outputId": "250a2d28-1ec1-401b-ff9e-1e6ff74ad7bd",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tfhub_use download started this may take some time.\n",
"Approximate size to download 923.7 MB\n",
"[OK!]\n"
]
}
],
"source": [
"documentAssembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"comp_abbreviation_var\")\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"embeddings = nlp.UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n",
" .setInputCols(\"sentence\") \\\n",
" .setOutputCol(\"sentence_embeddings\")\n",
"\n",
"training_pipeline = nlp.Pipeline(stages = [\n",
" documentAssembler,\n",
" embeddings])\n",
"\n",
"training_model = training_pipeline.fit(data)\n",
"\n",
"final_data = training_model.transform(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5VtDRIrunKX5",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5VtDRIrunKX5",
"outputId": "6a069a56-23ef-491a-ffa5-d0794fda8016"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+----------+---------------------+--------------------+--------------------+\n",
"| company_name|irs_number|comp_abbreviation_var| sentence| sentence_embeddings|\n",
"+--------------------+----------+---------------------+--------------------+--------------------+\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 22...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| STEPONE PERSONAL ...|[{document, 0, 26...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| Stepone Personal ...|[{document, 0, 26...|[{sentence_embedd...|\n",
"|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 20...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| EQUITY ONE NET IN...|[{document, 0, 24...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...|\n",
"|Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...|\n",
"|AmeriCredit Autom...| 880475154| AmeriCredit Autom...|[{document, 0, 39...|[{sentence_embedd...|\n",
"|GROUNDFLOOR FINAN...| 463414189| GROUNDFLOOR FINAN...|[{document, 0, 23...|[{sentence_embedd...|\n",
"|GROUNDFLOOR FINAN...| 463414189| GROUNDFLOOR FINAN...|[{document, 0, 23...|[{sentence_embedd...|\n",
"+--------------------+----------+---------------------+--------------------+--------------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"final_data.show()"
]
},
{
"cell_type": "markdown",
"id": "yRtd3nRHnVMQ",
"metadata": {
"id": "yRtd3nRHnVMQ"
},
"source": [
"We have `sentence_embeddings` column in our training dataframe that we will use as input while training the model."
]
},
{
"cell_type": "markdown",
"id": "Mnc5wuJ-ntEo",
"metadata": {
"id": "Mnc5wuJ-ntEo"
},
"source": [
"## Train Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "176e74ba-6867-469b-aebc-f510f8d78a08",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "176e74ba-6867-469b-aebc-f510f8d78a08",
"outputId": "d8ee8fdc-d23e-466c-d8e9-f2411ca18e57",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 108 ms, sys: 13.8 ms, total: 121 ms\n",
"Wall time: 15.3 s\n"
]
}
],
"source": [
"%%time\n",
"use = legal.SentenceEntityResolverApproach()\\\n",
" .setNeighbours(50)\\\n",
" .setThreshold(10000)\\\n",
" .setInputCols(\"sentence_embeddings\")\\\n",
" .setLabelCol(\"company_name\")\\\n",
" .setOutputCol('original_company_name')\\\n",
" .setNormalizedCol(\"company_name\")\\\n",
" .setDistanceFunction(\"EUCLIDEAN\")\\\n",
" .setCaseSensitive(False)\\\n",
" .setUseAuxLabel(True)\\\n",
" .setAuxLabelCol('irs_number')\n",
"\n",
"model = use.fit(final_data)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "oELEGmD0V_Th",
"metadata": {
"id": "oELEGmD0V_Th"
},
"outputs": [],
"source": [
"# Save model\n",
"model.write().overwrite().save(\"use_company_name\")"
]
},
{
"cell_type": "markdown",
"id": "091132bb-e472-444a-bc6d-ef502dc3e1dd",
"metadata": {
"id": "091132bb-e472-444a-bc6d-ef502dc3e1dd"
},
"source": [
"## Test Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2141e1b-9e7d-4081-812c-6ca606cf1687",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "b2141e1b-9e7d-4081-812c-6ca606cf1687",
"outputId": "52077d9a-ba4d-4236-b783-d976b2b7c9d0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tfhub_use download started this may take some time.\n",
"Approximate size to download 923.7 MB\n",
"[OK!]\n"
]
}
],
"source": [
"documentAssembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"ner_chunk\")\n",
"\n",
"embeddings = nlp.UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n",
" .setInputCols(\"ner_chunk\") \\\n",
" .setOutputCol(\"sentence_embeddings\")\n",
" \n",
"resolver = legal.SentenceEntityResolverModel.load(\"use_company_name\") \\\n",
" .setInputCols([\"sentence_embeddings\"]) \\\n",
" .setOutputCol(\"normalized_name\")\\\n",
" .setDistanceFunction(\"EUCLIDEAN\")\n",
"\n",
"pipeline = nlp.Pipeline(\n",
" stages = [\n",
" documentAssembler,\n",
" embeddings,\n",
" resolver,\n",
" ])\n",
"\n",
"empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
"\n",
"model = pipeline.fit(empty_data)\n",
"\n",
"light_model= nlp.LightPipeline(model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e0ba1ec-13d8-42b3-9a53-24cf85fd8ad7",
"metadata": {
"id": "7e0ba1ec-13d8-42b3-9a53-24cf85fd8ad7"
},
"outputs": [],
"source": [
"# returns LP resolution results\n",
"\n",
"import pandas as pd\n",
"pd.set_option('display.max_colwidth', 0)\n",
"\n",
"def get_codes (lp, text, vocab='company_name', hcc=False):\n",
" \n",
" full_light_result = lp.fullAnnotate(text)\n",
"\n",
" chunks = []\n",
" codes = []\n",
" begin = []\n",
" end = []\n",
" resolutions=[]\n",
" all_distances =[]\n",
" all_codes=[]\n",
" all_cosines = []\n",
" all_k_aux_labels=[]\n",
"\n",
" for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):\n",
" \n",
" begin.append(chunk.begin)\n",
" end.append(chunk.end)\n",
" chunks.append(chunk.result)\n",
" codes.append(code.result) \n",
" all_codes.append(code.metadata['all_k_results'].split(':::'))\n",
" resolutions.append(code.metadata['all_k_resolutions'].split(':::'))\n",
" all_distances.append(code.metadata['all_k_distances'].split(':::'))\n",
" all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))\n",
" if hcc:\n",
" try:\n",
" all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))\n",
" except:\n",
" all_k_aux_labels.append([])\n",
" else:\n",
" all_k_aux_labels.append([])\n",
"\n",
" df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, \n",
" 'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})\n",
" \n",
" if hcc:\n",
"\n",
" df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])\n",
" df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])\n",
" df['hcc_code'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])\n",
"\n",
" df = df.drop(['all_k_aux_labels'], axis=1)\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "57312793-ba38-4bbd-b042-42a229f61221",
"metadata": {
"id": "57312793-ba38-4bbd-b042-42a229f61221"
},
"outputs": [],
"source": [
"text = \"AmeriCann Inc\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "zpEhjGo1UiuA",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 177
},
"id": "zpEhjGo1UiuA",
"outputId": "52c653ef-153e-405c-d999-995d1f427bf4"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs\n",
"Wall time: 5.01 µs\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" chunks \n",
" begin \n",
" end \n",
" code \n",
" all_codes \n",
" resolutions \n",
" all_distances \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" AmeriCann Inc \n",
" 0 \n",
" 12 \n",
" AmeriCann, Inc. \n",
" [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
" [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
" [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks begin end code \\\n",
"0 AmeriCann Inc 0 12 AmeriCann, Inc. \n",
"\n",
" all_codes \\\n",
"0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
"\n",
" resolutions \\\n",
"0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
"\n",
" all_distances \n",
"0 [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time \n",
"get_codes (light_model, text, vocab = 'normalized_name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1dc7ce8e-30e4-433f-8ae5-a8645bd21d1b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 177
},
"id": "1dc7ce8e-30e4-433f-8ae5-a8645bd21d1b",
"outputId": "1d3eff93-6cc7-402e-bc07-5daba68e12b7"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 9.2 ms, sys: 277 µs, total: 9.48 ms\n",
"Wall time: 61.2 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" chunks \n",
" begin \n",
" end \n",
" code \n",
" all_codes \n",
" resolutions \n",
" all_distances \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" AmeriCann inc \n",
" 0 \n",
" 12 \n",
" AmeriCann, Inc. \n",
" [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
" [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
" [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks begin end code \\\n",
"0 AmeriCann inc 0 12 AmeriCann, Inc. \n",
"\n",
" all_codes \\\n",
"0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
"\n",
" resolutions \\\n",
"0 [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] \n",
"\n",
" all_distances \n",
"0 [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = 'AmeriCann inc'\n",
"\n",
"%time get_codes (light_model, text, vocab='normalized_name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebd35257-7249-4bb4-b3ed-715fe65ce816",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 177
},
"id": "ebd35257-7249-4bb4-b3ed-715fe65ce816",
"outputId": "84f5c441-a01a-4889-8400-43ceadc99381",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4.66 ms, sys: 2.18 ms, total: 6.84 ms\n",
"Wall time: 52.7 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" chunks \n",
" begin \n",
" end \n",
" code \n",
" all_codes \n",
" resolutions \n",
" all_distances \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" StepOne Personal Health inc \n",
" 0 \n",
" 26 \n",
" StepOne Personal Health, Inc. \n",
" [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] \n",
" [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] \n",
" [0.0000, 0.2224, 0.2714, 0.2729, 0.2802, 0.2868, 0.2874] \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks begin end code \\\n",
"0 StepOne Personal Health inc 0 26 StepOne Personal Health, Inc. \n",
"\n",
" all_codes \\\n",
"0 [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] \n",
"\n",
" resolutions \\\n",
"0 [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] \n",
"\n",
" all_distances \n",
"0 [0.0000, 0.2224, 0.2714, 0.2729, 0.2802, 0.2868, 0.2874] "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = 'StepOne Personal Health inc'\n",
"\n",
"%time get_codes (light_model, text, vocab='normalized_name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60765bb9-9b85-412c-bf72-214a4714662f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
},
"id": "60765bb9-9b85-412c-bf72-214a4714662f",
"outputId": "485b8be5-9a0e-4f61-8a38-7ce0459543a1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.07 ms, sys: 732 µs, total: 7.81 ms\n",
"Wall time: 67 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" chunks \n",
" begin \n",
" end \n",
" code \n",
" all_codes \n",
" resolutions \n",
" all_distances \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Alzamend Neuro INC \n",
" 0 \n",
" 17 \n",
" Alzamend Neuro, Inc. \n",
" [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] \n",
" [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] \n",
" [0.0000, 0.1704, 0.1802, 0.1934, 0.2149, 0.2162, 0.2254] \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks begin end code \\\n",
"0 Alzamend Neuro INC 0 17 Alzamend Neuro, Inc. \n",
"\n",
" all_codes \\\n",
"0 [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] \n",
"\n",
" resolutions \\\n",
"0 [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] \n",
"\n",
" all_distances \n",
"0 [0.0000, 0.1704, 0.1802, 0.1934, 0.2149, 0.2162, 0.2254] "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = 'Alzamend Neuro INC'\n",
"\n",
"%time get_codes (light_model, text, vocab='normalized_name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3eda13a5-4b4e-448e-bf74-f09c14789920",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
},
"id": "3eda13a5-4b4e-448e-bf74-f09c14789920",
"outputId": "94af733c-7601-432b-bcb7-04996b970204"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.98 ms, sys: 2.37 ms, total: 10.3 ms\n",
"Wall time: 56.6 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" chunks \n",
" begin \n",
" end \n",
" code \n",
" all_codes \n",
" resolutions \n",
" all_distances \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" MMEX Resources Corporation \n",
" 0 \n",
" 25 \n",
" MMEX Resources Corp \n",
" [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] \n",
" [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] \n",
" [0.1096, 0.1540, 0.1624, 0.2054, 0.2202, 0.2406, 0.2451] \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks begin end code \\\n",
"0 MMEX Resources Corporation 0 25 MMEX Resources Corp \n",
"\n",
" all_codes \\\n",
"0 [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] \n",
"\n",
" resolutions \\\n",
"0 [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] \n",
"\n",
" all_distances \n",
"0 [0.1096, 0.1540, 0.1624, 0.2054, 0.2202, 0.2406, 0.2451] "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = 'MMEX Resources Corporation'\n",
"\n",
"%time get_codes (light_model, text, vocab='normalized_name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40651b68-d4a3-4ade-ae5a-60994c4314d1",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
},
"id": "40651b68-d4a3-4ade-ae5a-60994c4314d1",
"outputId": "53c0850c-a470-43c3-d3b2-c791f1708413"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.89 ms, sys: 941 µs, total: 8.83 ms\n",
"Wall time: 43.4 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" chunks \n",
" begin \n",
" end \n",
" code \n",
" all_codes \n",
" resolutions \n",
" all_distances \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Alphadyne Asset Management Lp. \n",
" 0 \n",
" 29 \n",
" Alphadyne Asset Management LP \n",
" [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] \n",
" [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] \n",
" [0.0000, 0.0724, 0.1040, 0.2378, 0.2470, 0.2570, 0.2614, 0.2722] \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks begin end code \\\n",
"0 Alphadyne Asset Management Lp. 0 29 Alphadyne Asset Management LP \n",
"\n",
" all_codes \\\n",
"0 [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] \n",
"\n",
" resolutions \\\n",
"0 [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] \n",
"\n",
" all_distances \n",
"0 [0.0000, 0.0724, 0.1040, 0.2378, 0.2470, 0.2570, 0.2614, 0.2722] "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = 'Alphadyne Asset Management Lp.'\n",
"\n",
"%time get_codes (light_model, text, vocab='normalized_name')"
]
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}