![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/10.1.Chunk_Mappers_Training.ipynb)

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

# Legal Data Augmentation with Chunk Mappers

# About Data Augmentation

__Data Augmentation__ is the process of increase an extracted datapoint with external sources. 

For example, let's suppose I work with a document which mentions the company _Amazon_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.

In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.

Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich the company with much more information, allowing us to take better decisions.

Let's see how to do it.

# Train Your Own ChunkMapper Model

Here, we will train a ChunkMapper model with 1000 sample 

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_openedgar.json

In [None]:
import json
with open('sample_openedgar.json', 'r') as f:
 company_json = json.load(f)

In [None]:
company_json['mappings'][8]

{'key': 'AWA Group LP',
 'relations': [{'key': 'name', 'values': ['AWA Group LP']},
 {'key': 'sic', 'values': ['INVESTMENT ADVICE [6282]']},
 {'key': 'sic_code', 'values': [6282, 0]},
 {'key': 'irs_number', 'values': [371785232, 0]},
 {'key': 'fiscal_year_end', 'values': [630, 1231, 0]},
 {'key': 'state_location', 'values': ['NC']},
 {'key': 'state_incorporation', 'values': ['DE']},
 {'key': 'business_street', 'values': ['116 SOUTH FRANKLIN STREET']},
 {'key': 'business_city', 'values': ['ROCKY MOUNT']},
 {'key': 'business_state', 'values': ['NC']},
 {'key': 'business_zip', 'values': ['27804']},
 {'key': 'business_phone', 'values': ['952-446-6678']},
 {'key': 'former_name', 'values': ['']},
 {'key': 'former_name_date', 'values': ['']},
 {'key': 'date',
 'values': ['2017-01-23',
 '2017-03-16',
 '2016-01-22',
 '2016-01-19',
 '2015-06-30',
 '2016-04-14',
 '2016-07-27',
 '2016-10-28',
 '2015-06-26',
 '2015-09-02',
 '2015-09-29',
 '2015-12-31']},
 {'key': 'company_id', 'values': [1645148]}]

### Check a sample company

In [None]:
for x in company_json['mappings']:
 if 'Rayton Solar Inc.' in x['key']:
 print(x)

{'key': 'Rayton Solar Inc.', 'relations': [{'key': 'name', 'values': ['Rayton Solar Inc.']}, {'key': 'sic', 'values': ['SEMICONDUCTORS & RELATED DEVICES [3674]']}, {'key': 'sic_code', 'values': [3674]}, {'key': 'irs_number', 'values': [0]}, {'key': 'fiscal_year_end', 'values': [1231]}, {'key': 'state_location', 'values': ['CA']}, {'key': 'state_incorporation', 'values': ['DE']}, {'key': 'business_street', 'values': ['920 COLORADO AVE.']}, {'key': 'business_city', 'values': ['SANTA MONICA']}, {'key': 'business_state', 'values': ['CA']}, {'key': 'business_zip', 'values': ['90401']}, {'key': 'business_phone', 'values': ['(661) 259-4786']}, {'key': 'former_name', 'values': ['']}, {'key': 'former_name_date', 'values': ['']}, {'key': 'date', 'values': ['2017-01-10', '2017-01-20', '2017-01-06', '2017-05-15', '2017-09-28', '2016-11-29', '2016-12-20', '2016-12-22', '2022-09-21', '2019-06-27', '2018-03-22', '2018-04-30', '2018-12-10', '2021-09-22', '2020-06-08', '2020-09-28']}, {'key': 'company_

### Check all keys

In [None]:
all_rels = [x['key'] for x in company_json['mappings'][0]['relations']]

In [None]:
all_rels

['name',
 'sic',
 'sic_code',
 'irs_number',
 'fiscal_year_end',
 'state_location',
 'state_incorporation',
 'business_street',
 'business_city',
 'business_state',
 'business_zip',
 'business_phone',
 'former_name',
 'former_name_date',
 'date',
 'company_id']

### Create ChunkMapperApproach

In [None]:
chunkerMapper = legal.ChunkMapperApproach()\
 .setInputCols(["ner_chunk"])\
 .setOutputCol("mappings")\
 .setDictionary("sample_openedgar.json")\
 .setRels(all_rels)

In [None]:
empty_dataset = spark.createDataFrame([[""]]).toDF("text")

In [None]:
fit_CM = chunkerMapper.fit(empty_dataset)

In [None]:
# Save model
fit_CM.write().overwrite().save('openedgar_2000_2022_company_mapper')

### Let's test our ChunkMapper model

In [None]:
text = [""" AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """]

We get compnay name from sample text

In [None]:
documentAssembler = nlp.DocumentAssembler()\
 .setInputCol("text")\
 .setOutputCol("document")
 
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
 .setInputCols(["document"])\
 .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
 .setInputCols(["sentence"])\
 .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
 .setInputCols(["sentence", "token"]) \
 .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
 .setInputCols(["sentence", "token", "embeddings"])\
 .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
 .setInputCols(["sentence","token","ner"])\
 .setOutputCol("ner_chunk")\
 .setWhiteList(["ORG"]) # Return only ORG entities

nlpPipeline = nlp.Pipeline(stages=[
 documentAssembler,
 sentenceDetector,
 tokenizer,
 embeddings,
 ner_model,
 ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
legner_org_per_role_date download started this may take some time.
[OK!]


In [None]:
# We get company name from sample text

ner_result = light_model.fullAnnotate(text)

ner_result

[{'document': [Annotation(document, 0, 129, AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. , {})],
 'ner_chunk': [Annotation(chunk, 1, 12, AWA Group LP, {'entity': 'ORG', 'sentence': '0', 'chunk': '0', 'confidence': '0.9788'})],
 'token': [Annotation(token, 1, 3, AWA, {'sentence': '0'}),
 Annotation(token, 5, 9, Group, {'sentence': '0'}),
 Annotation(token, 11, 12, LP, {'sentence': '0'}),
 Annotation(token, 14, 20, intends, {'sentence': '0'}),
 Annotation(token, 22, 23, to, {'sentence': '0'}),
 Annotation(token, 25, 27, pay, {'sentence': '0'}),
 Annotation(token, 29, 37, dividends, {'sentence': '0'}),
 Annotation(token, 39, 40, on, {'sentence': '0'}),
 Annotation(token, 42, 44, the, {'sentence': '0'}),
 Annotation(token, 46, 51, Common, {'sentence': '0'}),
 Annotation(token, 53, 57, Units, {'sentence': '0'}),
 Annotation(token, 59, 60, on, {'sentence': '0'}),
 Annotation(token, 62, 62, a, {'sentence': '0'

In [None]:
ORG = ner_result[0]["ner_chunk"][0].result

ORG

'AWA Group LP'

In [None]:
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
 .setInputCols("document") \
 .setOutputCol("sentence_embeddings")
 
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
 .setInputCols(["sentence_embeddings"]) \
 .setOutputCol("resolution")\
 .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
 stages = [
 documentAssembler,
 embeddings,
 resolver])

lp_res = nlp.LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
legel_edgar_company_name download started this may take some time.
[OK!]


In [None]:
# We normalize company name

el_res = lp_res.annotate(ORG)

el_res

{'document': ['AWA Group LP'],
 'sentence_embeddings': ['AWA Group LP'],
 'resolution': ['AWA Group LP']}

In [None]:
NORM_ORG = el_res["resolution"][0]

NORM_ORG

'AWA Group LP'

### Let's load our ChunkMapper model

In [None]:
documentAssembler = nlp.DocumentAssembler()\
 .setInputCol("text")\
 .setOutputCol("document")

chunkAssembler = nlp.Doc2Chunk() \
 .setInputCols("document") \
 .setOutputCol("chunk") \
 .setIsArray(False)

CM = legal.ChunkMapperModel().load("openedgar_2000_2022_company_mapper")\
 .setInputCols(["chunk"])\
 .setOutputCol("mappings")

cm_pipeline = nlp.Pipeline(stages=[documentAssembler, 
 chunkAssembler, 
 CM])

fit_cm_pipeline = cm_pipeline.fit(empty_data)

In [None]:
# LightPipelines don't support Doc2Chunk, so we will use here usual transform

df = spark.createDataFrame([[NORM_ORG]]).toDF("text")

df.show()

+------------+
| text|
+------------+
|AWA Group LP|
+------------+



In [None]:
res = fit_cm_pipeline.transform(df)

res.show()

+------------+--------------------+--------------------+--------------------+
| text| document| chunk| mappings|
+------------+--------------------+--------------------+--------------------+
|AWA Group LP|[{document, 0, 11...|[{chunk, 0, 11, A...|[{labeled_depende...|
+------------+--------------------+--------------------+--------------------+



In [None]:
res.select("mappings.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AWA Group LP, INVESTMENT ADVICE [6282], 6282, 371785232, 630, NC, DE, 116 SOUTH FRANKLIN STREET, ROCKY MOUNT, NC, 27804, 952-446-6678, , , 2017-01-23, 1645148]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
r = res.select("mappings").collect()
r

[Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=11, result='AWA Group LP', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'name', 'entity': 'AWA Group LP', 'relation': 'name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='INVESTMENT ADVICE [6282]', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'sic', 'entity': 'AWA Group LP', 'relation': 'sic'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='6282', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__

In [None]:
json_dict = dict()
for n in r[0]['mappings']:
 json_dict[n.metadata['relation']] = str(n.result)

In [None]:
import json
print(json.dumps(json_dict, indent=4, sort_keys=True))

{
 "business_city": "ROCKY MOUNT",
 "business_phone": "952-446-6678",
 "business_state": "NC",
 "business_street": "116 SOUTH FRANKLIN STREET",
 "business_zip": "27804",
 "company_id": "1645148",
 "date": "2017-01-23",
 "fiscal_year_end": "630",
 "former_name": "",
 "former_name_date": "",
 "irs_number": "371785232",
 "name": "AWA Group LP",
 "sic": "INVESTMENT ADVICE [6282]",
 "sic_code": "6282",
 "state_incorporation": "DE",
 "state_location": "NC"
}
