![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.0.NER_and_ZeroShotNER.ipynb)

# Legal Named Entity Recognition (NER) and Zero-shot NER

#üé¨ Installation

In [None]:
! pip install -q johnsnowlabs

##üîó Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

##üîó Manual downloading
üìöIf you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

#üìå Starting

In [None]:
spark = nlp.start()

##üîé NER Model Implementation in Spark NLP

  The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework. a slightly modified version of the architecture proposed by Jason PC Chiu and Eric Nichols ([Named Entity Recognition with Bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308)). It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps.
  
  In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In the architecture of the proposed framework in the original paper, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features.

###üìå Legal CuadNER Model

This model uses Name Entity Recognition to extract DOC (Document Type), PARTY (An Entity signing a contract), ALIAS (the way a company is named later on in the document) and EFFDATE (Effective Date of the contract).

In [None]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
        .setInputCols("sentence", "token") \
        .setOutputCol("embeddings")\

ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_contract_doc_parties download started this may take some time.
[OK!]


In [None]:
# you can see pipeline stages with this code

model.stages

[DocumentAssembler_d02822bc3d37,
 SentenceDetectorDLModel_8aaebf7e098e,
 REGEX_TOKENIZER_a8b4485b4dba,
 ROBERTA_EMBEDDINGS_b915dff90901,
 MedicalNerModel_93f728ff96e5,
 NerConverter_c5758600563d]

In [None]:
# With this code, you can see which labels your NER model has.

ner_model.getClasses()

['O',
 'I-DOC',
 'B-EFFDATE',
 'B-ALIAS',
 'I-ALIAS',
 'B-PARTY',
 'I-EFFDATE',
 'I-PARTY',
 'B-DOC']

In [None]:
ner_model.extractParamMap()

# With extractParamMap() function, you can see the parameters of any annotators you are using.

{Param(parent='MedicalNerModel_93f728ff96e5', name='inferenceBatchSize', doc='number of sentences to process in a single batch during inference'): 1,
 Param(parent='MedicalNerModel_93f728ff96e5', name='labelCasing', doc='Setting all labels of the NER models upper/lower case. values upper|lower'): '',
 Param(parent='MedicalNerModel_93f728ff96e5', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='MedicalNerModel_93f728ff96e5', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): True,
 Param(parent='MedicalNerModel_93f728ff96e5', name='includeAllConfidenceScores', doc='whether to include all confidence scores in annotation metadata or just the score of the predicted tag'): False,
 Param(parent='MedicalNerModel_93f728ff96e5', name='batchSize', doc='Size of every batch'): 256,
 Param(parent='MedicalNerModel_93f728ff96e5', name='classes', doc='get the tags used to trained this MedicalNe

####‚úîÔ∏è **Sample Text**

In [None]:
text = """EXCLUSIVE DISTRIBUTOR AGREEMENT (" Agreement ") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as " Developer ") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as " Distributor ")."""

df = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(df)

####üñ®Ô∏è **Getting Result**

In [None]:
from pyspark.sql import functions as F

result.select(F.explode(F.arrays_zip(result.token.result, 
                                     result.ner.result, 
                                     result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence")).show(200, truncate=100)

+-----------+---------+----------+
|      token|ner_label|confidence|
+-----------+---------+----------+
|  EXCLUSIVE|    B-DOC|     0.885|
|DISTRIBUTOR|    I-DOC|    0.7397|
|  AGREEMENT|    I-DOC|    0.9926|
|         ("|        O|    0.9998|
|  Agreement|        O|    0.9964|
|         ")|        O|       1.0|
|      dated|        O|       1.0|
|         as|        O|    0.9985|
|      April|B-EFFDATE|    0.9845|
|         15|I-EFFDATE|     0.951|
|          ,|I-EFFDATE|    0.9504|
|       1994|I-EFFDATE|    0.8741|
|         by|        O|       1.0|
|        and|        O|       1.0|
|    between|        O|       1.0|
|       IMRS|  B-PARTY|    0.9898|
| OPERATIONS|  I-PARTY|    0.9987|
|        INC|  I-PARTY|    0.9995|
|          .|        O|    0.9907|
|          ,|        O|    0.9983|
|          a|        O|       1.0|
|   Delaware|        O|    0.9997|
|corporation|        O|    0.9999|
|       with|        O|       1.0|
|        its|        O|       1.0|
|  principal|       

In [None]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)

+-------------------------------+---------+----------+
|chunk                          |ner_label|confidence|
+-------------------------------+---------+----------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC      |0.87243336|
|April 15, 1994                 |EFFDATE  |0.94      |
|IMRS OPERATIONS INC            |PARTY    |0.996     |
|Developer                      |ALIAS    |0.9741    |
|Delteq Pte Ltd                 |PARTY    |0.9505667 |
|Distributor                    |ALIAS    |0.9814    |
+-------------------------------+---------+----------+



####üñ®Ô∏è **Getting Result with LightPipeline**

LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They‚Äôre useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

 **It is nearly 10x faster than using Spark ML Pipeline**

For more details:
[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)

In [None]:
import pandas as pd

light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    

df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,EXCLUSIVE DISTRIBUTOR AGREEMENT,0,30,0,DOC
1,"April 15, 1994",57,70,0,EFFDATE
2,IMRS OPERATIONS INC,87,105,0,PARTY
3,Developer,259,267,1,ALIAS
4,Delteq Pte Ltd,276,289,1,PARTY
5,Distributor,510,520,1,ALIAS


####üìå NER Visualizer

For saving the visualization result as html, provide `save_path` parameter in the display function.

In [None]:
# from sparknlp_display import NerVisualizer

visualiser = nlp.viz.NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

##üîé Create Generic Pipeline for NerDL Models

In [None]:
def base_pipeline():
    
    documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    textSplitter = legal.TextSplitter()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

    tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
    
    pipeline = nlp.Pipeline(stages=[
            documentAssembler,
            textSplitter,
            tokenizer])
    
    return pipeline

In [None]:
def generic_ner_pipeline(model_name):
    
    embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
            .setInputCols("sentence", "token") \
            .setOutputCol("embeddings")\

    ner_model = legal.NerModel.pretrained(model_name, "en", "legal/models")\
            .setInputCols(["sentence", "token", "embeddings"])\
            .setOutputCol("ner")

    ner_converter = nlp.NerConverter()\
            .setInputCols(["sentence","token","ner"])\
            .setOutputCol("ner_chunk")

    nlpPipeline = nlp.Pipeline(stages=[
            base_pipeline(),
            embeddings,
            ner_model,
            ner_converter])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)
    
    return model

##üìå Create Generic Result Function

In [None]:
def get_result(result):
    result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                         result.ner_chunk.metadata)).alias("cols")) \
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=False)

###‚úîÔ∏è Legal Cuad_NER_Header Model

This model uses Name Entity Recognition to detect **HEADER** and **SUBHEADER** with aims to detect the different sections of a legal document.

In [None]:
text = """5. GRANT OF PATENT LICENSE
5.1 Arizona Patent Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Patents for use in the Company Field throughout the world."""

model_name = "legner_headers"
df = spark.createDataFrame([[text]]).toDF("text")

result = generic_ner_pipeline(model_name).transform(df)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_headers download started this may take some time.
[OK!]


In [None]:
get_result(result)

+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|5. GRANT OF PATENT LICENSE|HEADER   |
|5.1 Arizona Patent Grant  |SUBHEADER|
+--------------------------+---------+



###‚úîÔ∏è Legal Cuad_NER_Obligations Model

üìöEntities:
 - OBLIGATION_SUBJECT
 - OBLIGATION_ACTION
 - OBLIGATION
 - OBLIGATION_INDIRECT_OBJECT

In [None]:
tokenClassifier = legal.BertForTokenClassification.pretrained("legner_obligations", "en", "legal/models")\
  .setInputCols("token", "sentence")\
  .setOutputCol("ner")\
  .setCaseSensitive(True)

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    base_pipeline(), 
    tokenClassifier,
    ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

legner_obligations download started this may take some time.
[OK!]


In [None]:
# Sometimes models work better with lowercase, depending on the vocabulary of the uppercase items
# Sometimes only uncased language models are present.
# This one is mixed but works better with lowercase
text = """PPD may engage VS to perform imaging services""".lower()

df = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(df)


In [None]:
get_result(result)

+------------------+------------------+
|chunk             |ner_label         |
+------------------+------------------+
|ppd               |OBLIGATION_SUBJECT|
|may engage        |OBLIGATION_ACTION |
|vs                |OBLIGATION        |
|to perform imaging|OBLIGATION        |
+------------------+------------------+



###‚úîÔ∏è Legal NER_Law_Money Spanish Model with RoBertaForTokenClassification

üìöEnities
 - LAW
 - MONEY

In [None]:
tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("legner_law_money", "es", "legal/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner")
ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    base_pipeline(), 
    tokenClassifier,
    ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

legner_law_money download started this may take some time.
Approximate size to download 395.1 MB
[OK!]


In [None]:
text = """La recaudaci√≥n del ministerio del interior fue de 20,000,000 euros as√≠ constatado por el art√≠culo 24 de la Constituci√≥n Espa√±ola."""

df = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(df)

In [None]:
get_result(result)

+---------------------------------------+---------+
|chunk                                  |ner_label|
+---------------------------------------+---------+
|20,000,000 euros                       |MONEY    |
|art√≠culo 24 de la Constituci√≥n Espa√±ola|LAW      |
+---------------------------------------+---------+



#üîé Zero-shot Legal Example

üìö`Zero-shot` is a new inference paradigm which allows us to use a model for prediction without any previous training step.

For doing that, several examples (_hypotheses_) are provided and sent to the Language model, which will use `NLI (Natural Language Inference)` to check if the any information found in the text matches the examples (confirm the hypotheses).

NLI usually works by trying to _confirm or reject an hypotheses_. The _hypotheses_ are the `prompts` or examples we are going to provide. If any piece of information confirm the constructed hypotheses (answer the examples we are given), then the hypotheses is confirmed and the Zero-shot is triggered.

Let's see it  in action.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

textSplitter = legal.TextSplitter()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

sparktokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setEntityDefinitions(
        {
            "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?', "When was the agreement?"],
            "ORG": ["Which company?"],
            "STATE": ["Which state?"],
            "AGREEMENT": ["What kind of agreement?"],
            "LICENSE": ["What kind of license?"],
            "LICENSE_RECIPIENT": ["To whom the license is granted?"]
        })
    

nerconverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
  documentAssembler,
  textSplitter,
  sparktokenizer,
  zero_shot_ner,
  nerconverter
    ]
)

legner_roberta_zeroshot download started this may take some time.
[OK!]


In [None]:
from pyspark.sql.types import StructType,StructField, StringType

sample_text = ["""In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.""",
              """In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.""",
              """This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')""",
              """The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license"""]

p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF("text"))

In [None]:
# from pyspark.sql import functions as F

res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
   .select(F.expr("cols['0']").alias("chunk"),
           F.expr("cols['3']['entity']").alias("ner_label"))\
           .filter("ner_label!='O'")

DataFrame[chunk: string, ner_label: string]

In [None]:
lp = nlp.LightPipeline(p_model)
lp_res_1 = lp.fullAnnotate(sample_text[2])
lp_res_2 = lp.fullAnnotate(sample_text[3])

In [None]:
# from sparknlp_display import NerVisualizer

visualiser = nlp.viz.NerVisualizer()

visualiser.display(lp_res_1[0], label_col='ner_chunk', document_col='document')

In [None]:
visualiser.display(lp_res_2[0], label_col='ner_chunk', document_col='document')