![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/09.2.Entity_Resolution_Training.ipynb)

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

## Entity Resolution Training

Here, we will train a legal resolver model with a sample dataset.We will train a company name normalization model. Our dataset columns has to be object type.

Let's start to train.

## Load Dataset

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_company_name.csv

In [None]:
import pandas as pd

df = pd.read_csv('sample_company_name.csv')
df

Unnamed: 0,company_name,irs_number,comp_abbreviation_var
0,"StepOne Personal Health, Inc.",900785095,StepOne Personal Health
1,"StepOne Personal Health, Inc.",900785095,StepOne Personal Health Inc
2,"StepOne Personal Health, Inc.",900785095,STEPONE PERSONAL HEALTH INC
3,"StepOne Personal Health, Inc.",900785095,StepOne Personal Health inc
4,"StepOne Personal Health, Inc.",900785095,StepOne Personal Health INC
...,...,...,...
9995,INGLES MARKETS INC,560846267,Ingles Markets Inc
9996,INGLES MARKETS INC,560846267,INGLES MARKETS Inc.
9997,INGLES MARKETS INC,560846267,INGLES MARKETS inc.
9998,INGLES MARKETS INC,560846267,INGLES MARKETS INC


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   company_name           10000 non-null  object
 1   irs_number             10000 non-null  int64 
 2   comp_abbreviation_var  10000 non-null  object
dtypes: int64(1), object(2)
memory usage: 234.5+ KB


In [None]:
df['comp_abbreviation_var'] =df['comp_abbreviation_var'].astype(str)
df['irs_number'] =df['irs_number'].astype(str)
df['company_name'] =df['company_name'].astype(str)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   company_name           10000 non-null  object
 1   irs_number             10000 non-null  object
 2   comp_abbreviation_var  10000 non-null  object
dtypes: object(3)
memory usage: 234.5+ KB


In [None]:
df.shape

(10000, 3)

## Get Embeddings
Now we will get the sentence embeddings of `comp_abbreviation_var` column.

In [None]:
data = spark.createDataFrame(df)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("comp_abbreviation_var")\
    .setOutputCol("sentence")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
    .setInputCols("sentence") \
    .setOutputCol("sentence_embeddings")

training_pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    embeddings])

training_model = training_pipeline.fit(data)

final_data = training_model.transform(data)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
final_data.show()

+--------------------+----------+---------------------+--------------------+--------------------+
|        company_name|irs_number|comp_abbreviation_var|            sentence| sentence_embeddings|
+--------------------+----------+---------------------+--------------------+--------------------+
|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 22...|[{sentence_embedd...|
|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|
|StepOne Personal ...| 900785095| STEPONE PERSONAL ...|[{document, 0, 26...|[{sentence_embedd...|
|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|
|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...|
|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|
|StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...|
|StepOne Personal ..

We have `sentence_embeddings` column in our training dataframe that we will use as input while training the model.

## Train Model

In [None]:
%%time
use = legal.SentenceEntityResolverApproach()\
  .setNeighbours(50)\
  .setThreshold(10000)\
  .setInputCols("sentence_embeddings")\
  .setLabelCol("company_name")\
  .setOutputCol('original_company_name')\
  .setNormalizedCol("company_name")\
  .setDistanceFunction("EUCLIDEAN")\
  .setCaseSensitive(False)\
  .setUseAuxLabel(True)\
  .setAuxLabelCol('irs_number')

model = use.fit(final_data)


CPU times: user 108 ms, sys: 13.8 ms, total: 121 ms
Wall time: 15.3 s


In [None]:
# Save model
model.write().overwrite().save("use_company_name")

## Test Model

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.load("use_company_name") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("normalized_name")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          documentAssembler,
          embeddings,
          resolver,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

light_model= nlp.LightPipeline(model)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
# returns LP resolution results

import pandas as pd
pd.set_option('display.max_colwidth', 0)

def get_codes (lp, text, vocab='company_name', hcc=False):
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []
    all_k_aux_labels=[]

    for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):
            
        begin.append(chunk.begin)
        end.append(chunk.end)
        chunks.append(chunk.result)
        codes.append(code.result) 
        all_codes.append(code.metadata['all_k_results'].split(':::'))
        resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
        all_distances.append(code.metadata['all_k_distances'].split(':::'))
        all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
        if hcc:
            try:
                all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
            except:
                all_k_aux_labels.append([])
        else:
            all_k_aux_labels.append([])

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
    
    if hcc:

        df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])
        df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])
        df['hcc_code'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])

    df = df.drop(['all_k_aux_labels'], axis=1)
    
    return df

In [None]:
text = "AmeriCann Inc"

In [None]:
%time 
get_codes (light_model, text, vocab = 'normalized_name')

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.01 µs


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,AmeriCann Inc,0,12,"AmeriCann, Inc.","[AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC]","[AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC]","[0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170]"


In [None]:
text = 'AmeriCann inc'

%time get_codes (light_model, text, vocab='normalized_name')

CPU times: user 9.2 ms, sys: 277 µs, total: 9.48 ms
Wall time: 61.2 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,AmeriCann inc,0,12,"AmeriCann, Inc.","[AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC]","[AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC]","[0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170]"


In [None]:
text = 'StepOne Personal Health inc'

%time get_codes (light_model, text, vocab='normalized_name')

CPU times: user 4.66 ms, sys: 2.18 ms, total: 6.84 ms
Wall time: 52.7 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,StepOne Personal Health inc,0,26,"StepOne Personal Health, Inc.","[StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.]","[StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.]","[0.0000, 0.2224, 0.2714, 0.2729, 0.2802, 0.2868, 0.2874]"


In [None]:
text = 'Alzamend Neuro INC'

%time get_codes (light_model, text, vocab='normalized_name')

CPU times: user 7.07 ms, sys: 732 µs, total: 7.81 ms
Wall time: 67 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Alzamend Neuro INC,0,17,"Alzamend Neuro, Inc.","[Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP]","[Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP]","[0.0000, 0.1704, 0.1802, 0.1934, 0.2149, 0.2162, 0.2254]"


In [None]:
text = 'MMEX Resources Corporation'

%time get_codes (light_model, text, vocab='normalized_name')

CPU times: user 7.98 ms, sys: 2.37 ms, total: 10.3 ms
Wall time: 56.6 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,MMEX Resources Corporation,0,25,MMEX Resources Corp,"[MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.]","[MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.]","[0.1096, 0.1540, 0.1624, 0.2054, 0.2202, 0.2406, 0.2451]"


In [None]:
text = 'Alphadyne Asset Management Lp.'

%time get_codes (light_model, text, vocab='normalized_name')

CPU times: user 7.89 ms, sys: 941 µs, total: 8.83 ms
Wall time: 43.4 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Alphadyne Asset Management Lp.,0,29,Alphadyne Asset Management LP,"[Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.]","[Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.]","[0.0000, 0.0724, 0.1040, 0.2378, 0.2470, 0.2570, 0.2614, 0.2722]"
