![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/80.1.Legal_Subpoenas_NER.ipynb)

# üé¨ Installation

In [None]:
! pip install -q johnsnowlabs

## üîó Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance, legal

nlp.install(refresh_install=True, visual=True, force_browser = True)

## üîó Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# üìå Starting

In [None]:
spark = nlp.start()

## üîé **Legal Subpoenas NER (small)**


‚úçExplanation:

 The Legal Subpoenas NER (small) statement refers to a pre-trained named entity recognition (NER) model specifically designed for legal text processing. 

- The `legner_subpoena` model is trained specifically to recognize and extract information related to subpoenas in legal documents. A subpoena is a legal document issued by a court that commands an individual or organization to provide specific documents, testimony, or evidence relevant to a legal case. Recognizing and extracting subpoena-related information from large volumes of legal texts can be a time-consuming task, and the legner_subpoena model is designed to automate this process.

üìöEntities:

- `ADDRESS`, `MATTER_VS`, `APPOINTMENT_HOUR`, `DOCUMENT_TOPIC`, `DOCUMENT_PERSON`, `COURT_ADDRESS`, `APPOINTMENT_DATE`, `COUNTY`, `CASE`, `SIGNER`, `COURT`, `DOCUMENT_DATE_TO`, `DOCUMENT_TYPE`, `STATE`, `DOCUMENT_DATE_FROM`, `RECEIVER`, `MATTER`, `SUBPOENA_DATE`, `DOCUMENT_DATE_YEAR`


In [5]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

textSplitter = legal.TextSplitter()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = nlp.Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)
  
loaded_ner_model = legal.NerModel.pretrained('legner_subpoena','en','legal/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = nlp.NerConverter()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_span")

ner_prediction_pipeline = nlp.Pipeline(stages = [
                                            document,
                                            textSplitter,
                                            token,
                                            roberta_embeddings,
                                            loaded_ner_model,
                                            converter
                                            ])

empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)

text = """SUBPOENA TO PRODUCE DOCUMENTS, INFORMATION, OR OBJECTS OR TO PERMIT INSPECTION OF PREMISES IN A CIVIL ACTION

UNITED STATES DISTRICT COURT
DISTRICT OF NEW YORK

Plaintiff: Chang Lee
v.
Defendant: Jie Chen

To: Kim Nguyen
789 Elm Street
New York, NY 10003

You are hereby commanded to produce at the time, date, and place set forth below the following documents, electronically stored information, or tangible things:

All financial records, including bank statements, credit card statements, and tax returns for Jie Chen from January 1, 2017 to present;
All emails and other correspondence between Jie Chen and any business partners, associates or employees related to the above financial records from January 1, 2017 to present;
All contracts and agreements entered into by Jie Chen, including any non-disclosure agreements, from January 1, 2017 to present.
The production shall occur at the following time and location:

Date: August 15, 2023
Time: 10:00 a.m.
Location: Law Office of Lee & Associates, 456 Broadway, Suite 800, New York, NY 10003.

You are further commanded to preserve and protect the confidentiality of any documents, electronically stored information, or tangible things produced or inspected, in accordance with the applicable law or agreement.

You are not required to produce or permit inspection of any privileged or protected documents or information.

This subpoena is issued by the court at the request of the Plaintiff's attorney, and you are hereby ordered to comply with this subpoena as provided by the Federal Rules of Civil Procedure.

You must comply with this subpoena under the penalty of law.

Dated: May 4, 2023

[Signature of Clerk of Court]
By: Sarah Johnson
Deputy Clerk"""

sample_data = spark.createDataFrame([[text]]).toDF("text")

result = prediction_model.transform(sample_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_subpoena download started this may take some time.
[OK!]


In [6]:
from pyspark.sql import functions as F

result.select(F.explode(F.arrays_zip(result.token.result, 
                                     result.ner.result, 
                                     result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence")).show(200, truncate=100)

+--------------+--------------------+----------+
|         token|           ner_label|confidence|
+--------------+--------------------+----------+
|      SUBPOENA|                   O|       1.0|
|            TO|                   O|       1.0|
|       PRODUCE|                   O|    0.9991|
|     DOCUMENTS|     B-DOCUMENT_TYPE|    0.9844|
|             ,|                   O|       1.0|
|   INFORMATION|     B-DOCUMENT_TYPE|    0.9345|
|             ,|                   O|    0.9993|
|            OR|                   O|    0.9999|
|       OBJECTS|                   O|    0.9624|
|            OR|                   O|       1.0|
|            TO|                   O|       1.0|
|        PERMIT|                   O|       1.0|
|    INSPECTION|                   O|    0.9982|
|            OF|                   O|       1.0|
|      PREMISES|                   O|    0.9966|
|            IN|                   O|    0.9999|
|             A|                   O|    0.9999|
|         CIVIL|    

In [7]:
result.select(F.explode(F.arrays_zip(result.ner_span.result, result.ner_span.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)

+---------------------------------+------------------+----------+
|chunk                            |ner_label         |confidence|
+---------------------------------+------------------+----------+
|DOCUMENTS                        |DOCUMENT_TYPE     |0.9844    |
|INFORMATION                      |DOCUMENT_TYPE     |0.9345    |
|NEW YORK                         |STATE             |0.50975   |
|Chang Lee                        |MATTER            |0.7286    |
|Jie Chen                         |RECEIVER          |0.69225   |
|Kim Nguyen                       |RECEIVER          |0.7975    |
|789 Elm Street
New York, NY 10003|ADDRESS           |0.99141246|
|documents                        |DOCUMENT_TYPE     |0.9848    |
|electronically stored information|DOCUMENT_TYPE     |0.9720667 |
|financial records                |DOCUMENT_TYPE     |0.98545   |
|bank statements                  |DOCUMENT_TYPE     |0.9024    |
|credit card statements           |DOCUMENT_TYPE     |0.7403    |
|tax retur

### üñ®Ô∏è **Getting Result with LightPipeline**

In [8]:
import pandas as pd

light_model = nlp.LightPipeline(prediction_model)

light_result = light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_span']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    

df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,DOCUMENTS,20,28,0,DOCUMENT_TYPE
1,INFORMATION,31,41,0,DOCUMENT_TYPE
2,NEW YORK,151,158,0,STATE
3,Chang Lee,172,180,0,MATTER
4,Jie Chen,196,203,0,RECEIVER
5,Kim Nguyen,210,219,0,RECEIVER
6,"789 Elm Street\nNew York, NY 10003",221,253,0,ADDRESS
7,documents,351,359,0,DOCUMENT_TYPE
8,electronically stored information,362,394,0,DOCUMENT_TYPE
9,financial records,422,438,0,DOCUMENT_TYPE


###üìå **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [9]:
# from sparknlp_display import NerVisualizer

visualiser = nlp.viz.NerVisualizer()

visualiser.display(light_result[0], label_col='ner_span', document_col='document')